Code Optimizations for Parallelization of Programs using Data Dependence Identifier

In a Parallelizing Compiler, code transformations help to reduce the data dependencies and identify parallelism in a code. In our earlier paper, we proposed a model Data Dependence Identifier (DDI), in which a program P is represented as graph GP . Using GP , we could identify data dependencies in a program and also perform transformations like dead code elimination and constant propagation. In this paper, we present algorithms for loop invariant code motion, live range analysis, node splitting and loop fusion code transformations using DDI in polynomial time. Keywords—Automatic parallelization; parallelizing compilers; code optimizations; data dependence; loop invariant code motion; node splitting; live range analysis; loop fusion


I. INTRODUCTION
Multicore processors have completely replaced single core processors, as a result general purpose computers became parallel systems, this change has thrown lot of challenges to software community in the effective utilization of the former. Though multiprocessing capability of operating systems improves the overall throughput of new hardware still performance of serial programs remains the same even on multicore systems. To enhance the performance of serial programs on multicore systems, instructions in the serial code has to be broken into groups such that each group cab be run in parallel. One way to accomplish this task is by manual conversion which is a tedious job. One more way is to use a tool that converts serial program to parallel.
Automating the process of serial to parallel conversion is called as Automatic Parallelization and the compiler which can perform automatic parallelization is typically referred as Parallelizing Compiler. The general process of serial to parallel program conversion is a three step one: 1) perform code transformations in order to detect potential parallelism; 2) check for data dependencies in the code; 3) generate parallel code.
Two instructions I 1 and I 2 in a program are said to be data dependent if both the instructions access same memory location. Presence of data dependencies makes parallelism an impossible task. Code transformations help to eliminate some of the data dependencies thereby giving a scope to detect potential parallelism.
In our earlier paper [1], we proposed a model called Data Dependence Identifier(DDI) which can identify data dependencies in scalars, arrays, and pointers in a program. We also discussed how code optimizations like dead code elimination, constant propagation can be performed using DDI. In this paper, we discussed how code optimizations like loop invariant code motion, live range analysis and node splitting, loop fusion are performed using our model DDI.

II. RELATED WORKS
Compiler converts source code to Intermediate Representation (IR) to perform code optimizations. This IR may differ from compiler to compiler. Generally, in traditional compilers for uniprocessor systems, instructions in source code are intermediately represented in three address code format and Directed Acyclic Graph (DAG), code optimizations are performed using this Intermediate Representation [3].
Intermediate Representation is crucial for a parallelizing compiler. Here, we will discuss in brief about some of the parallelizing compilers and their Intermediate Representations.
• SUIF (Stanford University Intermediate Format): SUIF is a source to source parallelizing compiler that takes C or FORTRAN serial code as input and produces parallelized code to be run on a multiprocessor machine. SUIF intermediate representation is a language-independent abstract syntax tree. Data flow analysis, data dependence analysis, scalar and array privatization, reduction variable analysis are performed using IR [4], [5], [6].
• Cetus: Cetus converts serial program written in C to parallel C program by inserting OpenMP annotations to be run on a multicore system. Cetus intermediate representation is a hierarchial tree based structure implemented in Java. Cetus IR includes a set of iterators that traverses through the IR to get the required information about loops, conditional statements, etc. Data dependence analysis -GCD Test [7] and Range Test [8] are used to identify data dependencies in arrays. Transformation techniques like scalar and array privatization, induction variable substitution, reduction variable recognition are performed using IR to eliminate some of the dependencies [9].
• Pluto: Pluto is a source to source compiler that transforms serial C program to OpenMP C [10]. Intermediate Representation of Pluto is based on polyhedral model. Dependence analysis, loop transformations for parallelism and optimized data locality are performed using IR [11]. Optimizations based on polyhedral model are integrated in compilers like GCC and LLVM. State-of-art in Pluto includes loop fusion transformation using Fusion Conflict Graphs (FCG) [12] and verified code generation [13].
• Intel compiler [16] automatically identify the loops that can be parallelized and partitions the data accordingly.
Using our proposed model DDI, we have shown how data dependence analysis can be performed. We are broadening the scope of our model by showing how code transformations like loop invariant code motion, live range analysis and node splitting, loop fusion can be applied on DDI.

III. DATA DEPENDENCE IDENTIFIER
In this section we discuss in brief about our model Data Dependence Identifier(DDI) which we have proposed in our earlier paper [1]. The main objective of DDI model is to represent a program as graph to identify data dependencies in a program. Though many graphical representation of program exists [14], [15], our representation takes a completely different perspective, we consider variables in the program as nodes and the edges between these variables are drawn based on the mode of access of variables from memory. For this purpose, we have categorized the instructions in a program and parameterized program as discussed in sections A and B.

A. Categorization of Instructions based on Memory Accessibility
We categorized the instructions in a program broadly into Memory Access Instruction (MAI) and Non Memory Access Instruction (NMAI) based on the way they access the memory. In MAI, instructions access the memory to perform the required operation. Instructions like arithmetic, conditional fall under this category. In NMAI, instructions do not access the memory at all i.e., instructions like jump, break come under this category.
MA Instructions are further classified into three categories: MA-READ, MA-WRITE, MA-READWRITE. In MA-READWRITE(MARW), instructions access the memory for both read as well as write operations. For example, in Arithmetic instruction: 'c = a + b ′ , data is read from memory locations a and b and written to a memory location c. In MA-READ(MAR), instructions perform only read operation but no write operation. For example, in conditional instruction: 'if (a > b) ′ data is only read from memory locations a and b but the output is not written to any variable. Generally in these instructions data is read from memory and send to other Hardware Units(HU) in the computer system like processor or output devices. In MA-WRITE(MAW), instructions perform only write operation but no read operation. For example, in assignment instruction 'a = 5 ′ , a constant value is written to a memory location a. Here we assume, a = 5 means that the constant 5 is read from the programmer(PR) and written to the location a.

B. Parameterization of Program
A program P is parameterized with I, V, W, HU, P R, where: where data is read from variables a , b and output is written to c.
• HU represents the set of hardware units i.e. input devices, output devices, processor and any other hardware unit in the computer system.
• P R is the set of constant values initialized in the program P by the programmer.
Therefore, we write P as P (I, V ∪ {HU, P R}, M AI).

C. Directed Graph Representation of a Program
In DDI model, we represent a program as graph. Here, we discuss how a program P is transformed to an equivalent directed graph called graph of P written as G P .
All instructions in a given program P are indexed sequentially with the positive integers 1, 2, ...n. First instruction in a program is indexed as 1, second instruction as 2 and so on. For i n ∈ I, we call index of i n = n.  • For every ordered pair of sets (R, W ) ∈ M AI, we include the edges {[r, w])|∀r ∈ R, w ∈ W }.
• Every edge in G P is labeled with elements from label set S which contains indices of instructions in I. L : E → {1, 2, ...n} such that L((r, w)) = k if index([R, W ]) = k such that (r, w) ∈ E, r ∈ R, w ∈ W .
We use the notation (., .) to represent the edges of the graph and [., .] indicates the pair of sets R,W for representing memory access instructions.
if MAI-verification() then 4: for every r ∈ R and w ∈ W do 5: Loop representation in DDI: The statements within the loop are denoted as i.k, where i.k represents instruction i when the loop is executed k th time.

Nested loop representation in DDI:
Consider nested loops L 1 and L 2 , where L 1 is the outer loop and L 2 is the inner loop. The statements within the nested loop are denoted as m.x.p, where m.x.p represents m th instruction when the loop is executed x th iteration in L 1 loop and p th iteration in L 2 loop.

IV. APPLICATIONS OF DDI
In our earlier paper [1], we proposed how compiler optimizations like constant propagation, dead code elimination and induction variable detection can be performed using our DDI model. In this section, we will discuss how optimizations like loop invariant code motion, live range analysis, loop fusion, scalar privatization can be performed using our DDI model.

A. Loop Invariant Code Motion
where m is the number of iterations of the loop.

There is no edge
=⇒ u ∈ W implies that there exists u k ∈ R such that value of u k is written in u and there is no change in u k through out the loop.
=⇒ By algorithm, G l will have the edges (u k , u) and there will not be any edge of the form (v, u k ) where u, u k ∈ N.G l , since the nodes of G l pertains to the variable inside the loop.  In example 4(a), Loop Statements(LS)={5,6,7} for node x, L((t, x)) = [5.1, 5.2] and L((P R, x)) = [5.1, 5.2], there exists no other edges to node x with label 5. Only source of input to node x with label 5 is from nodes t and P R. As P R is constant value, only input is node t. Incoming edge to node t is L((P R, t)) = 1, 1 / ∈ LS. Therefore, we conclude statement 5 is loop invariant code.

B. Live Range Analysis
For parallelizing a program, statements in the program has to be grouped in such a way that the statements in these groups can be executed in parallel and gives the same output as sequential execution. one way to accomplish this task is using the live range information of variables in the program. We define the live range of a variable in a program as follows: Consider the program in example 5, y is live in statement 2 and not live in statements 1,3,4.
Definition IV.2. Live Range Analysis of a program P is a description which provides an information on the nature of the variable, whether live or not, in each of the instruction of program P .
In example 4(a), x is live in statements 1,2,3,4. y is live in statement 2. k is live in statement 1. a is live in statement 4. This information is represented in the form: , which is usually referred as live range analysis of P. Now, we propose a method to compute live range of variables in a program using our DDI model. Theorem 2. Given a program P and the corresponding graph G P . If node u ∈ G P have either incoming and outgoing edges with labels i k then variable u is said to be live in statements i k of program P.
Proof: u is live in instruction i k .
=⇒ By definition IV.1, either Read or W rite operation is performed on u in i k . =⇒ By algorithm 1, there will be an outgoing edge with label i k from u (if Write operation is performed over u in i k ) or there will be an incoming edge with label i k to u (if Read operation is performed over u in i k ).
=⇒ There is an edge incident on u with label i k .
Based on the above theorem, we propose an algorithm to compute live range of variables in a program.
In Line 2 of algorithm 3, A is initialized as an empty two

C. Node Splitting
If a variable is live through out the program means there exists data dependence among the statements. The data dependence has to be broken in order to group the statements such that each group can execute in parallel. One approach to break the data dependence cycle is using Node Splitting. Node splitting creates one more copy of a node(duplicate node) in the graph and divides the edges between two nodes to produce an analogous graph. This transformation limits the live range of a variable to a section in the code hence producing a code more feasible for parallelization. Consider the program in example 4, variable x is live in instructions {1, 2, 3, 4}, after splitting x as x and n, x is live in instructions {1, 2} and n in {3, 4}. Definition IV.3. A node u ∈ V.G P is said to be a splitting node if the sub-graph that involves u can be split into two sub-graphs G P1 and G P2 such that functionality of both the programs P 1 and P 2 is equivalent to the functionality of P .
Theorem 3. Let P be a program, G P be the graph that corresponds to P . A node u ∈ V.G P is a splitting node of G P if and only if ∃ a sub-graph G P u that involves the node u as follows.

Proof:
Hypothesis: u is a splitting node of G P .
Claim: ∃ a sub-graph G P u of G P as shown in fig1.
Hypothesis: G P u can be split into two sub-graphs G P1 u and G P2 u such that the functionality of P 1 and P 2 is equivalent to P . =⇒ ∃ a program P in which variable u is used more than once( t 1 times) for writing and u is used more than once (t 2 times) such that t 1 ≥ t 2 .
without loosing any generality, assume t 1 = t 2 = t =⇒ u is used t times for writing and u is used t times for reading purpose.
=⇒ Again, without loss of any generality, for every writing to u, we have a reading from u.
=⇒ Since u is a splitting node, we have a sequence of t blocks in P such that in each block, u is written first and then u is read.
=⇒ A snippet of P that involves u will look as follows: block 1 Corresponding G P will be as shown in fig.1, hence the claim.

Part II proof:
Hypothesis: In P, ∃ G P u l i < s i Claim: u is a splitting node. Hypothesis: P has a sequence of t blocks and in each block, value is read from u after a value is written to u.
=⇒ G P u can be split as follows In each P i , a value is written to u first and then u is read.
=⇒ From G P u , we infer that in P , value is written first and then read next.
=⇒ By sheer observations, we infer that the total functionality of the snippets P i (i = 1 to t) is same as the functionality of P . The functionality of other statements (which does not involve u) in P remains as such in P i also.
=⇒ We have a node u in G P u which can be split into a sequence of sub-graphs G P u , i = 1, 2, .., t such that the total functionality of P i , i = 1, 2, .., t is equivalent to the functionality of P .
=⇒ u is a splitting node of G P . Corollary 3.1. Let P be a program. Let G P be the graph that corresponds to P. P is parallelizable if and only if G P has atleast one splitting node.

D. Loop Fusion
Loop fusion is a technique in which two loops are merged or fused to form a single loop. Generally, a loop iterates through the same set of instructions to perform a task. Two loops L1 and L2 can be fused if number of iterations, terminating conditions of both the loops match and the semantics of the code be intact after merging. Fusing of loops reduces the number of loops present in a program thereby mitigating the overhead involved in parallelization of many loops.
Definition IV.4. Loop Fusion: is a technique by which statements of multiple loops are merged into a single loop such that semantics of the code is intact.
Consider L1 be the first loop and L2 be the second loop in sequence, then L1 and L2 can be fused if the following conditions are satisfied: • Loops L1 and L2 should have same looping conditions and should iterate for same number of times.
• Dependencies that exist between statements of loop L1 and L2 does not change the semantics of the code.
So, concept of fusion depends on the dependencies that exist between the loops. Hence, first we discuss different dependencies that exist between the loops. Two loops L1 and L2 (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 12, No. 10, 2021 are said to be data dependent if dependence exists between any statement of L1 and any statement of L2. Let statements S i ∈ L1 and S j ∈ L2. The following dependencies may exist between S i and S j : Definition IV.5. No dependence: L1 and L2 are said to have no dependence if the statements S i and S j do not access any common memory location.
Definition IV.6. Flow dependence: If memory location M is accessed for 'Write' operation in statement S i and the same location M is accessed for 'Read' in statement S j . Then, flow dependence exist between statements S i and S j .
In example 6(case i), 'Write' operation is performed on an index location in array A in first loop and is 'Read' from the same index location in array A in second loop. So, there exist flow dependence between loops L1 and L2.
Definition IV.7. Anti dependence: If memory location M is accessed for 'Read' operation in statement S i and the same location M is accessed for 'Write' in statement S j . Then, anti dependence exist between statements S i and S j .
In example 7(case i), 'Read' operation is performed on an index location in array x in first loop and 'Write' operation on the same index location of array x in second loop, there exist Anti dependence between loops L1 and L2.
Definition IV.8. Loop carried forward dependence: If memory location M is accessed by an iteration of statement S i and then the same location M is accessed in later iterations of statement S j . Then, loop carried forward dependence exists between statements S i and S j .
In example 8, A [1] value computed in first iteration of first loop is read in the second iteration of second loop, shows existence of loop carried forward dependence between L1 and L2.
Definition IV.9. Loop carried backward dependence: If memory location M is accessed in an iteration of statement S j and then the same location M is accessed in later iterations of statement S i . Then, loop carried backward dependence exists between statements S i and S j . In example 9, A [2] value computed in second iteration of first loop is read in the first iteration of second loop, shows presence of loop carried backward dependence.

Identification of data dependencies using DDI and feasibility of fusion
So far we have discussed the dependencies that exist between the loops. Here, we will discuss the type of dependencies between L1 and L2 which does not affect the fusion of L1 and L2. We propose four theorems with which we can identify the type of dependence that exist between L1 and L2 using our DDI and the feasibility of fusing them. Let G P be the corresponding graph of P, with G L1 and G L2 as the sub-graphs of G P that corresponds to loops L1 and L2 respectively. L1 and L2 is said to have no dependence if there exists no edge (u, v), ∀u ∈ G L1 and ∀v ∈ G L2 . Such loops L1 and L2 can be fused.

Proof: Assume that there exists an edge
=⇒ An edge (u, v) in G L1 means that a value is Read from variable u in some statement i rj (say) and Written to variable v. As v ∈ G L2 , v is accessed by some statement i s k (say) in L2.
=⇒ By Algorithm 1, there exists a statement i rj in L 1 which Read's a value from variable u and that value is Written to a variable v in statement i s k in L 2 .
=⇒ Thus, we have proved that if there exists an edge (u, v) with label i rj then there exists a statement i rj which accesses the variables u and v . } ∈ L2, there exists no common nodes among L1 and L2, no edges between nodes of L1 and L2. Therefore, no dependence exist between the two loops, in which case merging of loops is possible.
Theorem 5. Given program P with loops L1 with statements {i r1 , i r2 , ...i rn } and L2 with statements {i s1 , i s2 , ...i sm }. G P be the corresponding graph of P, G L1 and G L2 are the subgraphs of G P that corresponds to loops L1 and L2 respectively. Let there exist edges L((u, v)) = i rj .n and L((v, w) = i s k .m such that i rj ∈ L1, i s k ∈ L2. i 1) L1 and L2 is said to have flow dependence if i s k > i rj . 2) L1 and L2 can be fused if m ≥ n and i s k > i rj .
3) L1 and L2 can not be fused if n > m.
Proof: Let G L1 and G L2 be the sub-graphs of G P that corresponds to loops L1 and L2 in P and there exist edges L((u, v)) = i rj .n and L((v, w) = i s k .m in G P .
www.ijacsa.thesai.org =⇒ The condition i s k > i rj means that first a value is Written to v in nth iteration of instruction i rj and then Read from v in mth iteration of instruction i s k .
=⇒ By Definition IV.6, there exists flow dependence between statements i s k and i rj if Read operation succeeds Write operation.
Hypothesis 2: L1 and L2 can be fused if m ≥ n and i s k > i rj .
=⇒ As m ≥ n and i s k > i rj , even after fusing the statements i s k and i rj as Read operation succeeds Write operation on variable v, semantics of code is unchanged. =⇒ If L1 and L2 are fused, as n > m, variable v is read in mth iteration of instruction i s k even before v is written in nth iteration of instruction i rj i.e., older value of variable v is read not the updated. i.e., value of x is read even before write operation. Therefore, merging of loops L1 and L2 changes the semantics of the code.
Theorem 6. Given program P with loops L1 with statements {i r1 , i r2 , ...i rn } and loop L2 with statements {i s1 , i s2 , ...i sm }. G P be the corresponding graph of P, G L1 and G L2 are the sub-graphs of G P that corresponds to loops L1 and L2 respectively. Let there exist edges L((u, v)) = i rj .n and L((w, u)) = i s k .m such that i rj ∈ L1, i s k ∈ L2. i 1) L1 and L2 is said to have Anti dependence if i s k > i rj . 2) L1 and L2 can be fused if m ≥ n and i s k > i rj .
3) L1 and L2 can not be fused if n ≥ m, i.e., an outgoing edge from u of L1 have iteration number greater than an incoming edge to u.
Proof: Let G L1 and G L2 be the sub-graphs of G P that corresponds to loops L1 and L2 in P and there exist edges L((u, v)) = i rj .n and L((w, u) = i s k .m in G P . Hypothesis 1: There is an anti dependence if i s k > i rj .
=⇒ An incoming edge with label i rj .n to node v means variable v is Written in nth iteration of instruction i rj . An outgoing edge with label i s k .m from node w to node u means variable u is Written in mth iteration of instruction i s k .
=⇒ The condition i s k > i rj means that first a value is Read from u in nth iteration of instruction i rj and then Written from w to u in mth iteration of instruction i s k . =⇒ As m ≥ n and i s k > i rj , even after fusing the statements i s k and i rj as Write operation succeeds Read operation on variable u, semantics of code is unchanged.
=⇒ By definition IV.4, statements of loops L1 and L2 can be fused if the semantics of the code is intact.
Hypothesis 3: L1 and L2 can not be fused if n > m.
=⇒ Before fusing, in loop L1 Read operation is performed on variable u in nth iteration of instruction i rj . In loop L2 variable u is Written in mth iteration of instruction i s k .
=⇒ If L1 and L2 are fused, as n > m, variable u is Written in mth iteration of instruction i s k even before u is Read in nth iteration of instruction i rj i.e., a new value is written to u even before older value is Read.
=⇒ As semantics of code changes, loop fusion is not possible if n > m. If anti dependence exists between loops L1 and L2 i.e. if an instruction in L1 access an memory location M for Read and the same location is accessed by an instruction in loop L2 for Write, merging of loops is possible if M is accessed for Read first and then for Write even after fusing. In example 7(case i), L((x [2], A[4])) = 4.2 and L((B [2], x[2])) = 8.2 says x [2] is read in iteration 2 of instruction 4 and written in iteration 1 of instruction 8. As the value is 'Read' in first loop and 'written' in second loop in the same iteration, merging of loops will not change the semantics of code. Therefore edges L((u, v)) = n.i and L((w, u) = m.j where n ∈ L1 , m ∈ L2 and j ≥ i in the graph represents anti dependence where merging is possible.
If anti dependence exists between loops L1 and L2 merging of loops is not possible if a memory location M which is accessed for 'Read' in L1 and then for 'Write' in L2 is not preserved after fusing. Example 7(case ii) shows the anti dependence where memory location x is accessed for 'Write' many times in first loop and for 'Read' in second loop, merging of loops is not possible.
If Loop carried forward dependence exists between loops L1 and L2 merging of loops is possible. In Example 8, L((x, A[1])) = 5.1 and L((A [1], B[2])) = 9.2 says memory location A [1] is written in iteration 1 of instruction 5 and is read in iteration 2 of instruction 9. As a value computed in iteration i of first loop is accessed in iteration j of second loop where j ≥ i, merging of loops will not change the semantics of the code. If Loop carried backward dependence exists between loops L1 and L2 merging of loops is not possible. If a memory location is accessed by an iteration of a statement S i in loop L1 and the same location is accessed by previous iterations of statement S j in loop L2, when such statements are merged S j in L2 will access the memory location first and then S i which will change the order of execution. As semantics of code will change, fusing of loops is not possible if loop carried backward dependence is present between L1 and L2.
In Example 9, L((x, A[2])) = 4.2 and L((A [2], B[1])) = 8.1 says memory location A [2] is written in iteration 2 of instruction 4 and is read in iteration 1 of instruction 8 i.e., memory location A [2] is read even before it is updated. As a value computed in iteration i of first loop is accessed in iteration j of second loop where j < i, merging of loops can change the semantics of the code.  Thus, we conclude that fusion of two loops L1 and L2 is possible though the above discussed dependencies such www.ijacsa.thesai.org as flow dependence(case i), loop carried dependence, anti dependence(case i) exists between the statements of the loops.

V. CONCLUSION AND FUTURE WORK
In this paper, we have introduced a model to perform various optimizations like loop invariant code motion, live range analysis, node splitting and loop fusion through a graphical representation of the program called as Data Dependence Identifier (DDI). For each of the optimization we have investigated on the condition that has to be satisfied by DDI (graphical representation of the program P ) so that optimizations can be performed which leads to an effective parallelization of P .
All the optimizations that were discussed are justified as well as validated conceptually with a sequence of rigorous theorems. These theoretical proofs also serve the purpose of the correctness of proposed algorithms with which one could easily perform the optimizations of a program.
Salient Features: Though there are many graphical representations for a program, our graphical representation referred as DDI is a unique graphical representation in the state that the variables of P are used as nodes and the edges between the nodes reflect the nature of access (read/write) of the variables from the memory.
Thus, salient features of our work are: • a novel graphical representation of a program.
• performing almost all the optimizations with one model DDI.
Future Work: The optimization procedures are the main components of parallelization process. With our DDI model, in this paper we have just established the performance of various optimization procedures. Validating the optimization procedures with the benchmarked programs may not yield any significant insight on the performance of optimization procedures with DDI. The reason being that, performance of the various components of a machine may not yield any useful information on the performance of the machine built with those components. For this reason, experimental validation of a full DDI based parallelizer is proposed as future work and to be taken as separate work.
Further, one can initiate investigating DDI for extending the DDI as an optimizer, to as parallelizer. Extension of the DDI as a full fledged parallelizer and the empirical comparison of the DDI based parallelizer with the contemporary parallelizers are the two major works worthful to be considered as future works in the direction of the present paper.