Jurnal Array Misbiantoro

Efficient Interprocedural Array Data-FlowAnalysis for Automatic Program Parallelization

Junjie Gu, Member, IEEE Computer Society, and Zhiyuan Li, Member, IEEE Computer Society

AbstractÐSince sequential languages such as Fortran and C are more machine-independent than current parallel languages, it is

highly desirable to develop powerful parallelization-tools which can generate parallel codes, automatically or semiautomatically,

targeting different parallel architectures. Array data-flow analysis is known to be crucial to the success of automatic parallelization.

Such an analysis should be performed interprocedurally and symbolically and it often needs to handle the predicates represented by IF

conditions. Unfortunately, such a powerful program analysis can be extremely time-consuming if not carefully designed. How to

enhance the efficiency of this analysis to a practical level remains an issue largely untouched to date. This paper presents techniques

for efficient interprocedural array data-flow analysis and documents experimental results of its implementation in a research

parallelizing compiler. Our techniques are based on guarded array regions and the resulting tool runs faster, by one or two orders of

magnitude, than other similarly powerful tools.

Index TermsÐParallelizing compiler, array data-flow analysis, interprocedural analysis, symbolic analysis.

æ

1 INTRODUCTION

PROGRAM execution speed has always been a fundamentalconcern for computation intensive applications. To

exceed the execution speed provided by the state-of-the-art uniprocessor machines, programs need to take advan-tage of parallel computers. Over the past several decades,much effort has been invested in efficient use of parallelarchitectures. In order to exploit parallelism inherent incomputational solutions, progress has been made in areasof parallel languages, parallel libraries, and parallelizingcompilers. This paper addresses the issue of automaticparallelization of practical programs, particularly thosewritten in imperative languages such as Fortran and C.

Compared to current parallel languages, sequential

languages such as Fortran 77 and C are more machine-

independent. Hence, it is highly desirable to develop

powerful automatic parallelization tools which can generate

parallel codes targeting different parallel architectures. It

remains to be seen how far automatic parallelization can go.

Nevertheless, much progress has been made recently in the

understanding of its future directions. One important

finding by many is the critical role of array data-flow

analysis [10], [17], [20], [32], [33], [37], [38], [42]. This

aggressive program analysis not only can support array

privatization [29], [33], [43], which removes spurious data

dependences thereby to enable loop parallelization, but it

can also support compiler techniques for memory perfor-

mance enhancement and efficient message passing

deployment.

Few existing tools, however, are capable of interprocedur-

al array data-flow analysis. Furthermore, no previous

studies have paid much attention to the issue of the

efficiency of such analysis. Quite understandably, rapid

prototyping tools, such as SUIF [23] and Polaris [4], do not

emphasize compilation efficiency and they tend to run

slowly. On the other hand, we also believe it to be important

to demonstrate that aggressive interprocedural analysis can

be performed efficiently. Such efficiency is important for

development of large-sized programs, especially when

intensive program modification, recompilation and retest-

ing are conducted. Taking an hour or longer to compile a

program, for example, would be highly undesirable for

such programming tasks.In this paper, we present techniques used in the

Panorama parallelizing compiler [35] to enhance the

efficiency of interprocedural array data-flow analysis with-

out compromising its capabilities in practice. We focus on

the kind of array data-flow analysis useful for array

privatization and loop parallelization. These are important

transformations which can benefit program performance on

various parallel machines. We make the following key

contributions in this paper:

. We present a general framework to summarize andto propagate array regions and their access condi-tions, which enables array privatization and loopparallelization for Fortran-like programs whichcontain nonrecursive calls, symbolic expressions inarray subscripts and loop bounds, and IF conditionsthat may directly affect array privatizability andloop parallelizability.

. We show a hierarchical approach to predicatehandling, which reduces the time complexity ofanalyzing the predicates which control differentexecution paths.

244 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 26, NO. 3, MARCH 2000

. J. Gu is with Sun Microsystems, Inc., UMPK16-303, 901 San AntonioRoad, Palo Alto, Calif. 94303. E-mail: [email protected].

. Z. Li is with the Department of Computer Sciences, Purdue University,West Lafayette, IN 47907. E-mail: [email protected].

Manuscript received 22 July 1998; revised 21 Feb. 1999; accepted 12 May1999. Recommended for acceptance by M. JazayeriFor information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number 110086

0098-5589/00/$10.00 ß 2000 IEEE

. We present experimental results to show thatreducing unnecessary set difference operations con-tributes significantly to the speed of the array data-flow analysis.

. We measure the analysis speed of Panorama whenapplied to application programs in the Perfectbenchmark suite [3], a suite that is well-known tobe difficult to parallelize automatically. As a way toshow the quality of the parallelized code, we alsoreport the speedups of the programs parallelized byPanorama and executed on an SGI Challenge multi-processor. The results show that Panorama runsfaster, by one or two orders of magnitude, than otherknown tools of similar capabilities.

We note that, in order to achieve program speed up,additional program transformations often need to beperformed in addition to array data-flow analysis, such asreduction-loop recognition, loop permutation, loop fusion,advanced induction variable substitution, and so on. Suchtechniques have been discussed elsewhere and some ofthem have been implemented in both Polaris [16] and morerecently in Panorama. The techniques which are alreadyimplemented consume quite insignificant portion of thetotal analysis and transformation time since array data-flowanalysis is the most time-consuming part. Hence, we do notdiscuss their details in this paper.

The rest of the paper is organized as follows: InSection 2, we present background materials for inter-procedural array data-flow analysis and its use for arrayprivatization and loop parallelization. We point out themain factors in such an analysis which can potentiallyslow down the compiler drastically. In Section 3, wepresent a framework for interprocedural array data-flowanalysis based on guarded array regions. In Section 4, wediscuss several implementation issues. We also brieflydiscuss how array data-flow analysis can be performedon programs with recursive procedures and dynamicarrays. In Section 5, we discuss the effectiveness and theefficiency of our analysis. Experimental results arereported to show the parallelization capabilities ofPanorama and its high time efficiency. We comparerelated work in Section 6 and conclude in Section 7.

2 BACKGROUND

In this section, we briefly review the idea of arrayprivatization and give reasons why an aggressive inter-procedural array data-flow analysis is needed for thisimportant program transformation.

2.1 Array Privatization

If a variable is modified in different iterations of a loop,writing conflicts result when the iterations are executed bymultiple processors. Quite often, array elements written inone iteration of a DO loop are used in the same iterationbefore being overwritten in the next iteration. This kind ofarrays usually serve as a temporary working space withinan iteration and the array values in different iterations areunrelated. Array privatization is a technique that creates adistinct copy of an array for each processor such that thestorage conflict can be eliminated without violating

program semantics. Parallelism in the program is increased.Data access time may also be reduced since privatizedvariables can be allocated to local memories. Fig. 1 shows asimple example where the DOALL loop after transforma-tion is to be executed in parallel. Note that the value of A(1)is copied from outside of the DOALL loop since A(1) is notwritten within the DOALL loop. If the values written toA(k) in the original DO loop are live at the end of the loopnest, i.e., the values will be used by statements after the loopnest, additional statements must be inserted in the DOALLloop which, in the last loop iteration, will copy the values ofA1(k) to A(k). In this example, we assume A(k) are deadafter the loop nest, hence the absence of the copy-outstatements.

Practical cases of array privatization can be much morecomplex than the example in Fig. 1. The benefit of suchtransformation, on the other hand, can be significant. Earlyexperiments with manually performed program transfor-mations showed that, without array privatization, programexecution speed on an Alliant FX/80 machine with eightvector processors would be slowed down by a factor of fivefor programs MDG, OCEAN, TRACK, and TRFD in thewell-known Perfect benchmark suite [15]. Recent experi-ments with automatically transformed codes running on anSGI Challenge multiprocessor show even more strikingeffects of array privatization on a number of Perfectbenchmark programs [16].

2.2 Data Dependence Analysis vs. Array Data-FlowAnalysis

Conventional data dependence analysis is the predecessorof all current work on array data-flow analysis. In hispioneering work, Kuck defines flow dependence, anti-dependence, and output dependence [26]. While the lattertwo are due to multiassignments to the same variable inimperative languages, the flow dependence is definedbetween two statements, one of which reads the valuewritten by the other. Thus, the original definition of flowdependence is precisely a reaching definition relation.Nonetheless, early compiler techniques were not able tocompute array reaching definitions and, therefore, for along time, flow dependence is conservatively computedby asserting that one statement depends on another if theformer may execute after the latter and both may accessthe same memory location. Thus, the analysis of all threekinds of data dependences reduces to the problem of

GU AND LI: EFFICIENT INTERPROCEDURAL ARRAY DATA-FLOW ANALYSIS FOR AUTOMATIC PROGRAM PARALLELIZATION 245

Fig. 1. Simple example of array privatization.

memory disambiguation, which is insufficient for arrayprivatization.

Array data-flow analysis refers to computing the flow ofvalues for array elements. For the purpose of arrayprivatization and loop parallelization, the parallelizingcompiler needs to establish the fact that, as in the case inFig. 1, no array values are written in one iteration but usedin another.

2.3 Interprocedural Analysis

In order to increase the granularity of parallel tasks and,hence, the benefit of parallel execution, it is important toparallelize loops at outer levels. Unfortunately, such outer-level loops often contain procedure calls. A traditionalmethod to deal with such loops is in-lining, whichsubstitutes procedure calls with the bodies of the calledprocedures. Illinois' Polaris [4], for example, uses thismethod. Unfortunately, many important compiler transfor-mations increase their consumed time and storage quad-ratically or, at even higher rates, with the number ofoperations within individual procedures. Hence, there is asevere limit on the feasible scope of in-lining. It is widelyrecognized that, for large-scale applications, often a betteralternative is to perform interprocedural summary analysisinstead of in-lining. Interprocedural data dependenceanalysis has been discussed extensively [21], [24], [31],[40]. In recent years, we have seen increased efforts on arraydata-flow analysis [10], [17], [20], [32], [33], [37], [38], [42].However, few tools are capable of interprocedural arraydata-flow analysis without in-lining [10], [20], [23].

2.4 Complications of Array Data-Flow Analysis

In reality, a parallelizing compiler not only needs to analyzethe effects of procedure calls, but it may also need toanalyze relations among symbolic expressions and amongbranching conditions.

The examples in Fig. 2 illustrate such cases. In these threeexamples, privatizing the array A will make it possible toparallelize the I loops. Fig. 2a shows a simplified loop fromthe MDG program (routine interf) [3]. It is a difficult

example which requires a certain kind of inference between

IF conditions. Although both A and B are privatizable, we

will discuss A only, as B is a simple case. Suppose that the

condition kc:NE:0 is false and, as the result, the last loop K

within loop I gets executed and A�6 : 9� gets used. We want

to determine whether A�6 : 9� may use values written in

previous iterations of loop I. Condition kc:NE:0 being false

implies that, within the same iteration of I, the statement

kc � kc� 1 is not executed. Thus, B�K�:GT :cut2 is false for

all K � 1; � � � ; 9 of the first DO loop K. This fact further

implies that B�K � 4�:GT :cut2 is false for K � 2; � � � ; 5 of

the second DO loop K, which ensures that A�6 : 9� gets

written before its use in the same iteration I. Therefore, A is

privatizable in loop I.Fig. 2b illustrates a simplified version of a segment of the

ARC2D program (routine filerx) [3]. The condition :NOT:p

is invariant for DO loop I. As a result, if A�jmax� is not

modified in one iteration, thus exposing its use, then

A�jmax� should not be modified in any iteration. Therefore,

A�jmax� never uses any value written in previous iterations

of I. Moreover, it is easy to see that the use of A�jlow : jup�is not upward exposed. Hence, A is privatizable and loop I

is a parallel loop. In this example, the IF condition being

loop invariant makes sure that there is no loop-carried flow

dependence. Otherwise, whether a loop-carried flow

dependence exists in Fig. 2b depends upon the IF condition.Fig. 2c shows a simplified version of a segment of the

OCEAN program (routine ocean) [3]. Interprocedural

analysis is needed for this case. In order to privatize A in

the I loop, the compiler must recognize the fact that if a call

to out in the I loop does use A�1 : m�, then the call to in in

the same iteration must modify A�1 : m� so that the use of

A�j� must take the value defined in the same iteration of I.

This requires checking whether the condition x > SIZE in

subroutine out can infer the condition x > SIZE in

subroutine in. For all three examples above, it is necessary

to manipulate symbolic operations. Previous and current

work suggests that the handling of conditionals, symbolic


Fig. 2. More complex examples of privatizable arrays.

analysis, and interprocedure analysis should be provided ina powerful compiler.

Because array data-flow analysis must be performedover a large scope to deal with the whole set of thesubroutines in a program, algorithms for informationpropagation and for symbolic manipulation must be care-fully designed. Otherwise, this analysis will simply be tootime-consuming for practical compilers. To handle theseissues simultaneously, we have designed a framework,which is described next.

3 ARRAY DATA-FLOW ANALYSIS BASED ON

GUARDED ARRAY REGIONS

In traditional frameworks for data-flow analysis, at eachmeet point of a control flow graph, data-flow informationfrom different control branches is merged under a meetoperator. Such merged information typically does notdistinguish information from different branches. The meetoperator can be therefore said to be path-insensitive. Asillustrated in the last section, path-sensitive array data-flowinformation can be critical to the success of array privatiza-tion and hence loop parallelization. In this section, wepresent our path-sensitive analysis that uses conditionalsummary sets to capture the effect of IF conditions on arrayaccesses. We call the conditional summary sets guarded arrayregions (GARs).

3.1 Guarded Array Regions

Our basic unit of array reference representation is a regulararray region.

Definition. A regular array region of array A is denoted byA�r1; r2; � � � ; rm�, where m is the dimension of A and ri,i � 1; � � � ;m, is a range in the form of �l : u : s�, with l; u; s beingsymbolic expressions. The triple �l : u : s� represents all valuesfrom l touwith step s, which is simply denoted by �l� if l � u andby �l : u� if s � 1. An empty array region is represented by ; andan unknown array region is represented by .

The regular array region defined above is morerestrictive than the original regular section proposed byCallahan and Kennedy [6]. The regular array region doesnot contain any interdimensional relationship. This makesset operations simpler. However, a diagonal and atriangular shape of an array cannot be represented exactly.For instance, for an array A(1 : n; 1 : n), a diagonal A(i; i),i � 1; . . . ; n and a triangular A(i; j), i � 1; . . . ; n; i � j, areapproximated by the same regular array region:A(1 : n; 1 : n).

Regular array regions can cover the most frequent casesin real programs and they seem to have an advantage inefficiency when dealing with the common cases. The guardsin GAR's (defined below) can be used to describe the morecomplex array sections, although their primary use is todescribe control conditions under which regular arrayregions are accessed.

Definition. A guarded array region (GAR) is a tuple �P;R�which contains a regular array region R and a guard P , whereP is a predicate that specifies the condition under which R isaccessed. We use � to denote a guard whose predicate cannot

be written explicitly, i.e., an unknown guard. If both P � �and R � , we say that the GAR �P;R� � is unknown.Similarly, if either P is False or R is ;, we say that �P;R� is ;.

In order to preserve as much precision as possible, we try

to avoid marking a whole array region as unknown. If a

multidimensional array region has only one dimension that

is truly unknown, then only that dimension is marked as

unknown. Also, if only one item in a range tuple �l : u : s�,say u, is unknown, then we write the tuple as

�l : unknown : s�.Let a program segment, n, be a piece of code with a

unique entry point and a unique exit point. We use results

of set operations on GARs to summarize two essential

pieces of array reference information for n which are listed

below:

. UE�n�: the set of array elements which are upwardlyexposed in n if these elements are used in n and theytake the values defined outside n,

. MOD�n�: the set of array elements written within n.

In addition, the following sets, which are also represented

by GARs, are used to describe array references in a DO loop

l with its body denoted by b:

. UEi�b�: the set of the array elements used in anarbitrary iteration i of DO loop l that are upwardlyexposed to the entry of the loop body b,

. UEi�l�: the subset of array elements in UEi (b) whichare further upwardly exposed to the entry of the DOloop l,

. MODi�b�: the set of the array elements written inloop body b for an arbitrary iteration i of DO loop l.Where no confusion results, this may simply bedenoted as MODi,

. MODi�l�: the same as MODi�b�,

. MOD<i�b�: the set of the array elements written in allof the iterations prior to an arbitrary iteration i of DOloop l. Where no confusion results, this may simplybe denoted as MOD<i,

. MOD<i�l�: the same as MOD<i�b�,

. MOD>i�b�: the set of the array elements written in allof the iterations following an arbitrary iteration i ofDO loop l. Where no confusion results, this maysimply be denoted as MOD>i,

. MOD>i�l�: the same as MOD>i�b�.

Take Fig. 2c for example. For loop J of subroutine in,UEj is empty and MODj equals �True;B�j��. Therefore,MOD<j i s �1 < j;B�1 : jÿ 1 : 1�� a n d MOD>j i s�j < mm;B�j� 1 : mm : 1��. The MOD for the loop J is �1 �mm;A�1 : mm : 1�� and, hence, the MOD of subroutine in is�x � SIZE \ 1 � mm;A�1 : mm : 1��. Similarly, UEj forloop J of subroutine out is �True;B�j�� and UE for thesame loop is �1 � mm;A�1 : mm : 1��. Lastly, UE of thesubroutine out is �x � SIZE \ 1 � mm;A�1 : mm : 1��.

Our data-flow analysis requires three kinds of operationson GARs: union, intersection, and difference. These opera-tions in turn are based on union, intersection, and difference


operations on regular array regions as well as logicaloperations on predicates. Next, we will first discuss theoperations on array regions, then on GARs.

3.2 Operations on Regular Array Regions

As operands of the region operations must belong to thesame array, we will drop the array name from the arrayregion notation hereafter whenever there is no confusion.Given two regular array regions, R1 � A�r1

1; r12; � � � ; r1

m� andR2 � A�r2

1; r22; � � � ; r2

m�, where m is the dimension of array A,we define the following operations:

. R1 \R2: For the sake of simplicity of presentation,here we assume steps of 1 and leave Section 4 fordiscussion of other step values. Let r1

i � �l1i : u1i : 1�,

r2i � �l2i : u2

i : 1�, i � 1; � � � ;m. Let Di be r1i \ r2

i , wehave Di � �max�l1i ; l2i � : min�u1

i ; u2i � : 1�. We then

have R1 \R2 that equals

; 9i;Di � ;�D1; D2; � � � ; Dm� Otherwise:

�Note that we do not keep max and min operatorsin a regular array region. Therefore, when therelationship of symbolic expressions can not bedetermined even after a demand-driven symbolicanalysis is conducted, we will mark the intersectionas unknown.

. R1 [R2: Since these regions are symbolic ones,care must be taken to prevent false regionscreated by union operations. For example, know-ing R1 � �m : p : 1�; R2 � �p� 1 : n : 1�, we haveR1 [R2 � �m : n : 1� if and only if both R1 andR2 are valid. This can be guaranteed nicely byimposing validity predicates into guards, as wedid in [20]. In doing so, the union of tworegular regions should be computed withoutconcern for validity of these two regions. Sincethis introduces additional predicate operationsthat we try to avoid, we will usually keep theunion of two regions without merging themuntil they, like constant regions, are known to bevalid.

. R1 ÿR2: For an m-dimensional array, the result ofthe difference operation is generally 2m regularregions if each range difference results in two newranges. This representation could be quite complexfor large m; however, it is useful to describe thegeneral formulas of set difference operations. Sup-pose R1 � R2 (otherwise, use

R1 ÿR2 � R1 ÿR1 \R2�:We first define R1�k� and R2�k�, k � 1; � � � ;m, as thelast k ranges within R1 and R2, respectively.According to this definition, we have R1�m� ��r1

1; r12; r

13; � � � ; r1

m� a n d R2�m� � �r21; r

22; r

23; � � � ; r2

m�,and R1�mÿ 1� � �r1

2; r13; � � � ; r1

m� and

R2�mÿ 1� � �r22; r

23; � � � ; r2

m�:The computation of R1 ÿR2 is recursively given bythe following formula:

R1�m� ÿR2�m� ��r1

1 ÿ r21� If m � 1

�r11 ÿ r2

1; r12; r

13; � � � ; r1

m�[ �r2

1; �R1�mÿ 1� ÿR2�mÿ 1�� If m > 1:

8>>><>>>:The following are some examples of difference operations,

. �1 : 100� ÿ �2 : 100� � �1�

.

�1 : 100; 1 : 100� ÿ �3 : 99; 2 : 100�� 1 : 100� ÿ �3 : 99�; �1 : 100��[ ��3 : 99�; ��1 : 100� ÿ �2 : 100��

� ��1 : 2� [ �100��; �1 : 100�� [ �3 : 99; 1�� 1 : 2; 1 : 100� [ �100; 1 : 100� [ �3 : 99; 1�

In order to avoid splitting regions due to differenceoperations, we routinely defer solving difference opera-tions, using a new data structure called GARWD totemporarily represent the difference results. As we shallshow later, using GARWDs keeps the summary computa-tion both efficient and exact. GARWDs are defined in thefollowing subsection.

3.3 Operations on GARs and GARWDs

Given two GARs, T1 � �P1; R1� and T2 � �P2; R2�, we havethe following:

. T1 \ T2 � �P1 ^ P2; R1 \R2�

. T1 [ T2: The most frequent cases in union operationsare of two kinds:

- If P1 � P2, the union becomes �P1; R1 [R2�,- If R1 � R2, the result is �P1 _ P2; R1�,

If two array regions cannot be safely combined dueto the unknown symbolic terms, we keep two GARsin a list without merging them.

. T1 ÿ T2 � �P1 ^ P2; R1 ÿR2� [ �P1 ^ P2; R1�, As dis-cussed previously, R1 ÿR2 may be multiple arrayregions, making the actual result of T1 ÿ T2 poten-tially complex. However, as we shall explain via anexample, difference operations can often be canceledby intersection and union operations. Therefore, wedo not solve the difference T1 ÿ T2, unless the resultis a single GAR or until the last moment when theactual result must be solved in order to finish datadependence tests or array privatizability tests. Whenthe difference is not yet solved by the above formula,it is represented by a GARWD.

Definition. A GAR with a difference list (GARWD) is a setdefined by two components: a source GAR and a differencelist. The source GAR is an ordinary GAR as defined above,while the difference list is a list of GARs. The GARWD setdenotes all the members of the source GAR which are not inany GAR on the difference list. It is written asfsource GAR;< difference list >g.


The following examples show how to use the above

formulas:

. �T; �1 : 100�� \ �p; �2 : 101�� p; �2 : 100��

.

�T; �1 : 50�� [ �T; �51 : 100�� T; �1 : 100��T; �1 : 100�� [ �p; �1 : 100�� T; �1 : 100��

.

�p; �2 : 99�� ÿ �T; �1 : 100�� p; ��2 : 99� ÿ �1 : 100�� p; ;� � ;

.

�T; �1 : 100�� ÿ �T; �2 : 99�� f�T; �1 : 100��; < �T; �2 : 99�� >g;

which is a GARWD. Note that, if we cannot further

postpone solving of the above difference, we can

solve it to

�T; ��1 : 100� ÿ �2 : 99�� T; ��1� [ �100�� T; �1�� [ �T; �100��

3.3.1 GARWD Operations

Operations between two GARWDs and between a GARWD

and a GAR can be easily derived from the above. For

example, consider a GARWD gwd � g1; f< g2 >g and a

GAR g. The result of subtracting g from gwd is the following:

1. fg3 < g2 >g, if �g1 ÿ g� � g3, or2. fg1 < g2 >g, if �gÿ g2� � ;, or3. fg1 < g2; g >g, otherwise,

where g3 is a single GAR. The first formula is applied if the

result of �g1 ÿ g� is exactly a single GAR g3. Because g1 and g

may be symbolic, the difference result may not be a single

GAR. Hence, we have the third formula. Similarly, the

intersection of gwd and g is:

1. fg4 < g2 >g, if �g1 \ g� � g4, or2. ;, if �gÿ g2� � ;, or3. unknown otherwise,

where g4 is also a single GAR.The union of two GARWDs is usually kept in the list, but

it can be merged in some cases. Some concrete examples are

given below to illustrate the operations on GARWDs:

.

f�T; �1 : 100��; < �T; �n : m�� > g ÿ �T; �2 : 100�� f��T; �1 : 100�� ÿ �T; �2 : 100��; < �T; �n : m�� >g� f�T; �1��; < �T; �n : m�� >g

.

�T; �1 : 100��; < �T; �n : m�� >g \ �p; �101 : 200�� f��T; �1 : 100�� \ �p; �101 : 200��; < �T; �n : m�� >g� f�;�; < �T; �n : m�� >g � ;

.

f�T; �1 : 100��; < �T; �n : m�� >g [ f�T; �1 : 100��; <>g� f�T; �1 : 100��; <>g

Fig. 3 is an example showing the advantage of usingGARWDs. The right hand side is the summary result for thebody of the outer loop, where the subscript i in UEi and inMODi indicates that these two sets belong to an arbitraryiteration i. UEi is represented by a GARWD. For simplicity,we omit the guards whose values are true in the example.To recognize array A as privatizable, we need to prove thatno loop-carried data flow exists. The set of all mods withinthose iterations prior to iteration i, denoted by MOD<i, isequal to MODi. In theory, MOD<i � � if i � 1, whichnonetheless does not invalidate the analysis. Since bothGARs in the MOD<i list are in the difference list of theGARWD for UEi, it is obvious that the intersection ofMOD<i and UEi is empty and that, therefore, array A isprivatizable. We implement this by assigning each GAR aunique region number, shown in parentheses in Fig. 3, whichmakes intersection a simple integer operation.

As shown above, our difference operations, which areused during the calculation of UE sets, do not result in theloss of information. This helps to improve the effectivenessof our analysis. On the other hand, intersection operationsmay result in unknown values due to the intersections ofthe sets containing unknown symbolic terms. A demand-driven symbolic evaluator is invoked to determine thesymbolic values or the relationship between symbolicterms. If the intersection result cannot be determined bythe symbolic evaluator, it is marked as unknown.

In our array data-flow framework based on GARs,intersection operations are performed only at the last stepwhen our analyzer tries to conduct dependence tests andarray privatization tests, at the point where a conservativeassumption must be made if an intersection result ismarked as unknown. The intersection operations, however,


Fig. 3. An example of GARWDs.

are not involved in the propagation of the MOD and UEsets and, therefore, they do not affect the accuracy of thosesets.

3.4 Computing UE and MOD Sets

The UE and MOD information is propagated backwardfrom the end to the beginning of a routine or a programsegment. Through each routine, these two sets are summar-ized in one pass and the results are saved. The summaryalgorithm is invoked on demand for a particular routine, soit will not summarize a routine unless necessary. Parametermapping and array reshaping are done when the propaga-tion crosses routine boundaries.

To facilitate interprocedural propagation of the summaryinformation, we adopt a hierarchical supergraph (HSG) torepresent the control flow of the entire program. The HSGaugments the supergraph proposed by Myers [36] byintroducing a hierarchy among nested loops and procedurecalls. An HSG contains three kinds of nodes: basic blocknodes, loop nodes, and call nodes. A DO loop is representedby a loop node which is a compound node whose internalflow subgraph describes the control flow of the loop body.A procedure call site is represented by a call node whichhas an outgoing edge pointing to the entry node of the flowsubgraph of the called procedure and has an incoming edgefrom the unique exit node of the called procedure. Due tothe nested structures of DO loops and routines, a hierarchyfor control flow is derived among the HSG nodes, with theflow subgraph at the highest level representing the mainprogram. The HSG resembles the HSCG used by the PIPSproject for parallel task scheduling [25]. Fig. 4 shows anexample of the HSG. Note that the flow subgraph of aroutine is never duplicated for different calls to the sameroutine unless multiple versions of the called routine arecreated by the compiler to enhance its potential parallelism.More details about the HSG and its implementation can befound in reference [20], [18].

During the propagation of the array data-flow informa-tion, we use MOD IN�n� to represent the array elementsthat are modified in nodes which are forwardly reachablefrom n, at the same or lower HSG level as n, and we useUE IN�n� to represent the array elements whose values areimported to n and are used in the nodes forwardly

reachable from n. Suppose a DO loop l, with its bodydenoted by b, is represented by a loop node N and the flowsubgraph of b has the entry node n. We have UEi�b� equal toUE IN�n� and UE�N� equal to the expansion of UEi�b� (seebelow). Similarly, we have MODi�b� equal to MOD IN�n�and MOD�N� equal to the expansion of MODi�b�. The MODand MOD IN sets are represented by a list of GARs, whilethe UE and UE_IN sets by a list of GARWDs.

Fig. 5a and Fig. 5b show how the MOD_IN and UE_INsets are propagated, in the direction opposite to the controlflow, through a basic block S and a flow subgraph for an IFstatement, with the then-branch S1 and the else-branch S2,respectively. During the propagation, variables appearingin certain summary sets may be modified by assignmentstatements and, therefore, their righthand side expressionssubstitute for the variables. For simplicity, such variablesubstitutions are not shown in Fig. 5. Fig. 5b shows that,when summary sets are propagated to IF branches, IFconditions are put into the guards on each branch, and thisis indicated by function padd�� in the figure.

The whole summary process is quite straightforward,except that the computation of UE sets for loops needsfurther analysis to support summary expansion, as illustratedby Fig. 6

Given a DO loop with index I, I 2 �l; u; s�, suppose UEi

and MODi are already computed for an arbitrary iteration i.We want to calculate UE and MOD sets for the entire I loop,following the formula below:

MOD �X

i2�l:u:s�MODi

UE �X

i2�l:u:s��UEi ÿMOD<i�;

MOD<i �X

j2�l:u:s�^�j<i�MODj; MOD<1 � �:

TheP

summation above is also called an expansion orprojection, denoted by proj�� in Fig. 6, which is used toeliminate i from the summary sets. The UE calculationgiven above takes two steps. The first step computes�UEi ÿMOD<i�, which represents the set of array elementswhich are used in iteration i and have been exposed to theoutside of the whole I loop. The second step projects the


Fig. 4. Example of the HSG.

result of Step 1 against the domain of i, i.e., the range

(l : u : s), to remove i. The expansion for a list of GARs and a

list of GARWDs consists of the expansion of each GAR and

each GARWD in the lists. Since a detailed discussion on

expansion would be tedious, we will provide a guideline

only in this paper (see the Appendix).

4 IMPLEMENTATION CONSIDERATIONS AND

EXTENSIONS

4.1 Symbolic Analysis

Symbolic analysis handles expressions which involve

unknown symbolic terms. It is widely used in symbolic

evaluation or abstract interpretation to discover program

properties such as values of expressions, relationships

between symbolic expressions, etc. Symbolic analysis

requires the ability to represent and manipulate unknown

symbolic terms. Among several expression representations,

a normal form is often used [7], [9], [22]. The advantage of a

normal form is that it gives the same representation for

congruent expressions. In addition, symbolic expressions

encountered in array data-flow analysis and dependence

analysis are mostly integer polynomials. Operations on

integer polynomials, such as the comparison of two

polynomials, are straightforward. Therefore, we adopt

integer polynomials as our representation for symbolic

expressions. Our normal form, which is essentially a sum of

products, is given below:

e �XNi�1

ti � Ii � t0; �1�

where each Ii is an index variable and ti is a term which is

given by (2) below:

ti �XMi

j�1

pj; i � 1; � � � ; N �2�

pj � cj �YLjk�1

xjk; j � 1; � � � ;Mj; �3�

where pj is a product, cj is an integer constant (possible

integer fraction), xjk is an integer variable but not an index

variable, N is the nesting number of the loop containing e,

Mi is the number of products in ti, and Lj is the number of

variables in pj.Take the program segments in Fig. 7 as examples. For

subroutine SUB1, the MOD set of statement S1 contains a

single GAR: �True; A�N1� I1� I2��. The MOD set of DO

loop I2 contains �True; A�N1� I1� 1 : N1� I1� 100��.T h e MOD s e t o f D O l o o p I1 c o n t a i n s

�True; A�N1� 2 : N1� 200��. Lastly, the MOD set

o f t h e w h o l e s u b r o u t i n e c o n t a i n s

�True; A�N2 �N3�N4� 2 : N2 �N3�N4� 200��. For sub-

routine SUB2, the MOD set of statement S2 contains a

single GAR: �True; A�I1��. The MOD set of DO loop I1

contains �N1 > 1; A�1 : N1ÿ 1��. The MOD set of the IF


Fig. 5. Computing summary sets for basic control flow components.

Fig. 6. Expansion of loop summaries.

statement contains �N1 > N6 ^N1 > 1; A�1 : N1ÿ 1��.Lastly, the MOD set of the whole subroutine contains

�N2 �N3�N4 �N5 > N6 ^N2 �N3�N4 �N5 > 1;

A�1 : N2 �N3�N4 �N5ÿ 1��:

All expressions e, ti, and pj in the above are sorted

according to a unique integer key assigned to each variable.

Since both Mi and Lj control the complexity of a

polynomial, they are chosen as our design parameters. As

an example of using Mi and Lj to control the complexity of

expressions, e will be a linear expression (affine) if Mi is

limited to be 1 and Lj to be zero. By controlling the

complexity of expression representations, we can properly

control the time complexity of manipulating symbolic

expressions.Symbolic operations such as additions, subtractions,

multiplications, and divisions by an integer constant are

provided as library functions. In addition, a simple

demand-driven symbolic evaluation scheme is implemen-

ted. It propagates an expression upward along a control

flow graph until the value of expression is known or the

predefined propagation limit is reached.

4.2 Range Operations

In this subsection, we give a detailed discussion of range

operations for various step values. To describe the range

operations, we use the functions of min�e1; e2� and

max�e1; e2� in the following. However, these functions

should be solved, otherwise the unknown is usually

returned as the result.G iven two ranges r1 and r2, r1 � �l1 : u1 : s1�,

r2 � �l2 : u2 : s2�.1. If s1 � s2 � 1,

.

r1 \ r2 ��max�l1; l2� � min�u1; u2�; �max�l1; l2�

: min�u1; u2� : s1��

. A s s u m i n g r2 � r1, o t h e r w i s e u s er1 ÿ r2 � r1 ÿ r1 \ r2, we have

r1 ÿ r2 ��l1 � max�l1; l2� ÿ s1; �l1 : max�l1; l2�ÿ s1 : s1�� [

�min�u1; u2� � s1 � u1; �min�u1; u2�� s1 : u1 : s1��

where max�l1; l2� � l2;min�u1; u2� � u2 because

r2 � r1.. Union operation. If �l2 > u1 � s1� or�l1 > u2 � s2�, r1 [ r2 cannot be combinedi n t o o n e r a n g e . O t h e r w i s e ,�r1 [ r2 � �True; �min�l1; l2� : max�u1; u2� : s1��,assuming that r1 and r2 are both valid.If it is unknown at this momentwhether both are valid, we do notcombine them.

2. If s1 � s2 � c > 1, where c is a known constant value,we do the following: If �l1 ÿ l2� is divisible by c, thenwe use the formulas in case 1 to compute theintersection, difference and union. Otherwise, r1 \r2 � ; and r1 ÿ r2 � r1. The union r1 [ r2 usuallycannot be combined into one range and must bemaintained as a list of ranges. For the special casethat jl1 ÿ l2j � ju1 ÿ u2j � 1 and s1 � s2 � 2, we have

r1 [ r2 � �min�l1; l2� : max�u1; u2� : 1�:

3. If s1 � s2 and l1 � l2 which may be symbolicexpressions, then we use the formulas in case 1 toperform the intersection, difference and union.

4. If s1 is divisible by s2, we check to see if r2 covers r1.I f so , we have r1 \ r2 � r1, r1 ÿ r2 � ;, andr1 [ r2 � r2.

5. In all other cases, the result of the intersection ismarked as unknown. The difference is kept in adifference list at the level of the GARWDs and theunion remains a list of the two ranges.

4.3 Extensions to Recursive Calls and DynamicArrays

Programming languages such as Fortran 90 and C permit

recursive procedure calls and dynamically allocated data

structures. In this subsection, we briefly discuss how array

data-flow analysis can be performed in the presence of

recursive calls and dynamic arrays.Recursive calls can be treated in array data-flow analysis

essentially the same way as in array data dependence

analysis [30]. A recursive procedure calls itself either

directly or indirectly, which forms cycles in the call graph


Fig. 7. Examples of symbolic expressions in guarded array regions.

of the whole program. A proper order must be establishedfor the traversal of the call graph. First, all MaximumStrongly Connected Components (MSCs) must be identified inthe call graph. Each MSC is then reduced to a singlecondensed node and the call graph is reduced to an acyclicgraph. Array data flow is then analyzed by traversing thereduced graph in a reversed topological order. When acondensed node (i.e., an MSC) is visited, a proper order isestablished among all members in the MSC for an iterativetraversal. For each member procedure, the sets of modifiedand used array regions, with guards, that are visible to itscallers must be summarized, respectively, by iterating overcalling cycles. If the MSC is a simple cycle, which is acommon case in practical programs, the compiler candetermine whether the visible array regions of each memberprocedure grow through recursion or not, after analyzingthat procedure twice. If a region grows in a certain arraydimension during recursive calls, then a conservativeestimate should be made for that dimension. In the worstcase, for example, the range of modification or use in thatarray dimension can be marked as unknown. A morecomplex MSC requires a more complex traversal order [30].

Dynamically allocated arrays can be summarized essen-tially the same way as static arrays. The main difference isthat, during the backward propagation of array regions(with guards) through the control flow graph, i.e., the HSGin this paper, if the current node contains a statement thatallocates a dynamic array, then all UE sets and MOD setsfor that array are killed beyond this node.

The discussion above is based on the assumption that notrue aliasing exists in each procedure, i.e., references todifferent variable names must access different memorylocations if either reference is a write. This assumption istrue for Fortran 90 and Fortran 77 programs, but may befalse for C programs. Before performing array data-flowanalysis on C programs, alias analysis must first beperformed. Alias analysis has been studied extensively inrecent literature [8], [11], [14], [27], [28], [39], [44], [45].

5 EFFECTIVENESS AND EFFICIENCY

In this section, we first discuss how GARs are used for arrayprivatization and loop parallelization. We then presentexperimental results to show the effectiveness andefficiency of array data-flow analysis.

5.1 Array Privatization and Loop Parallelization

An array A is a privatization candidate in a loop L if itselements are overwritten in different iterations of L (see[29]). Such a candidacy can be established by examiningthe summary array MODi set: If the intersection ofMODi and MOD<i is nonempty, then A is a candidate. Aprivatization candidate is privatizable if there exist noloop-carried flow dependences in L. For an array A in aloop L with an index I, if MOD<i \ UEi � ;, then thereexists no flow dependence carried by loop L.

L e t u s lo ok at F i g . 2c a ga in . UEi � ; a n dmod<i � �x � SIZE ^ 1 < m ^ 1 < i;A�1 : m : 1��. H e n c e ,MOD<i \ UEi �MOD<i \ ; � ;, so A is privatizable with-in loop I. As another example, let us look at Fig. 2b. SinceMODi is not loop-variant, we have MODg\ ��T; �jlow : jup�� [ �p; �jmax��

� f�p; �jmax��; < �T; �jlow : jup�� >g\ �T; �jlow : jup�� [ f�p; �jmax��;< �T; �jlow : jup�� >g \ �p; �jmax��

� f�p; �jmax��; < �T; �jlow : jup�� >g\ �T; �jlow : jup�� [ f��p; �jmax��\ �p; �jmax��; < �T; �jlow : jup�� >g

� f�p; �jmax��; < �T; �jlow : jup�� >g\ �T; �jlow : jup��

� ;:The last difference operation above can be easily done

because GAR �T; �jlow : jup�� is in the difference list.

Therefore, UEi \MOD<i is empty. This guarantees that

array A is privatizable.As we explained in Section 2.1, copy-in and copy-out

statements sometimes need to be inserted in order to

preserve program correctness. The general rules are 1)

upwardly exposed array elements must be copied in; and 2)

live array elements must be copied-out. We have already

discussed the determination of upwardly exposed array

elements. We currently perform a conservative liveness

analysis proposed in [29].The essence of loop parallelization is to prove the

absence of loop-carried dependences. For a given DO loop

L with index I, the existence of different types of loop-

carried dependences can be detected in the following order:

. loop-carried flow dependences: They exist if andonly if UEi \MODi� 6� ;.

. loop-carried antidependences: Suppose we havealready determined that there exist no loop-carriedoutput dependences, then loop-carried antidepen-dences exist if and only if UEi \MOD>i 6� ;. (Ifloop-carried antidependences were to be consideredseparately, then UEi in the above formula should bereplaced by DEi, where DEi stands for the down-wardly exposed use set of iteration i.)

Take output dependences for example. In Fig. 7a, MODi

o f D O l o o p I2 c o n t a i n s a s i n g l e G A R :

�True; A�N1� I1� i��. MOD 1�N1�I1� 1 : N1� I1� iÿ 1�� a n d MOD>i c o n t a i n s ,

�i < 100 A�N1� I1� i� 1 : N1� I1� 100��. Loop-carried

output dependences do not exist for DO loop I2 because

MODi \ �MODi� � ;. In contrast, for DO loop

I1, MODi contains �True; A�N1� i� 1 : N1� i� 100��.MOD 1; A�N1� 2 : N1� i� 99��. Loop-

carried output dependences exist for DO loop I1 because

MODi \MOD<i 6� ;. Note that if an array is privatized,

then no loop-carried output dependences exist between the

write references to private copies of the same array.


5.2 Experimental Results

We have implemented our array data-flow analysis in aprototyping parallelizing compiler, Panorama, which is amultiple pass, source-to-source Fortran program analyzer[35]. It roughly consists of the phases of parsing, building ahierarchical supergraph (HSG) and the interproceduralscalar UD/DU chains [1], performing conventional datadependence tests, array data-flow analysis and otheradvanced analyses, and parallel code generation.

Table 1 shows the Fortran loops in the Perfect benchmarksuite which should be parallelizable after array privatiza-tion and after necessary transformations such as inductionvariable substitution, parallel reduction, and event syn-chronization placement. This table also marks which loopsrequire symbolic analysis, predicate analysis and interpro-cedural analysis, respectively. (The details of privatizablearrays in these loops can be found in [18].)

Columns 4 and 5 mark those loops that can beparallelized by Polaris (Version 1.5) and by Panorama,respectively. Only one loop (interf/1000) is parallelized byPolaris but not by Panorama, because one of the privatiz-able arrays is not recognized as such. To privatize this arrayrequires implementation of a special pattern matchingwhich is not done in Panorama. On the other hand,Panorama parallelizes several loops that cannot be paralle-

lized by Polaris. Table 2 compares the speedup of theprograms selected from Table 1, parallelized by Polaris andby Panorama, respectively. Only those programs paralleliz-able by either or both of the tools are selected. The speedupnumbers are computed by dividing the real execution timeof the sequential codes divided by the real execution time ofthe parallelized codes, running on an SGI Challengemultiprocessor with four 196 MHz R10000 CPUs and,1,024 MB memory. On average, the speedups are compar-able between Polaris-parallelized codes and Panorama-parallelized codes. Note that the speedup numbers may befurther improved by a number of recently discoveredmemory-efficiency enhancement techniques. These techni-ques are not implemented in the versions of Polaris andPanorama used for this experiment.

Table 3 shows wall-clock time spent on the main parts ofPanorama. In Table 3, ªParsing timeº is the time to parse theprogram once, although Panorama currently parses aprogram three times, the first time for constructing the callgraph and for rearranging the parsing order of the sourcefiles, the second time for interprocedural analysis, and thelast time for code generation.

The column ªHSG & DOALL Checkingº is the timetaken to build the HSG, UD/DU chains, and conventionalDOALL checking. The column ªArray Summaryº refers to


TABLE 1Parallelizable Loops in the Perfect Benchmark Suite and the Required Privatization Techniques

SA: Symbolic Analysis. PA: Predicate Analysis. IA: Interprocedural Analysis.

our array data-flow analysis which is applied only to loopswhose parallelizability cannot be determined by theconventional DOALL tests. Fig. 8 shows the percentage oftime spent by the array data-flow analysis and the rest ofPanorama. Even though the time percentage of array data-flow analysis is high (about 38 percent on average), the totalexecution time is small (31 seconds maximum). To get aperspective of the overhead of our interprocedural analysis,the last column, marked by ªf77 -O,º shows the time spentby the f77 compiler with option -O to compile thecorresponding Fortran program into sequential machinecode.

Table 4 lists the analysis time of Polaris alongside of thatof Panorama, which includes all three times of parsing,instead of just one as in Table 3. It is difficult to provide anabsolutely fair comparison. So, these two sets of numbersare listed together to provide a perspective. The timing ofPolaris (Version 1.5) is measured without the passes afterarray privatization and dependence tests. We did not listthe timing results of SUIF because SUIFs current publicversion does not perform array data-flow analysis and nosuch timing results are publically available. Both Panoramaand Polaris are compiled by the GNU gcc/g++ compilerwith the -O optimization level. The time was measured bygettimeofday() and is elapsed wall-clock time. When usingan SGI Challenge machine, which has a large memory, thetime gap between Polaris and Panorama is reduced. This is

probably because Polaris is written in C++ with a hugeexecutable image. The size of its executable image is about14 MB, while Panorama, written in C, has an executableimage of 1.1 MB. Even with a memory size as large as 1 GB,Panorama is still faster than Polaris by one or two orders ofmagnitude.

5.3 Summary vs. In-Lining

We believe that several design choices contribute to theefficiency of Panorama. In the next subsections, we presentsome of these choices made in Panorama.

The foremost reason seems to be that Panoramacomputes interprocedural summary without in-lining theroutine bodies as Polaris does. If a subroutine is called inseveral places in the program, in-lining causes the sub-routine body to be analyzed several times, while Panoramaonly needs to summarize each subroutine once. Thesummary result is later mapped to different call sites.Moreover, for data dependence tests involving call state-ments, Panorama uses the summarized array regioninformation, while Polaris performs data dependencesbetween every pair of array references in the loop bodyafter in-lining. Since the time complexity of data depen-dence tests is O�n2�, where n is the number of individualreferences being tested, in-lining can significantly increasethe time for dependence testing. In our experiments withPolaris, we limit the number of in-lined executable state-ments to 50, a default value used by Polaris. With thismodest number, data dependence tests still account forabout 30 percent of the total time.

We believe that another important reason for Panorama'sefficiency is its efficient computation and propagation of thesummary sets. Two design issues are particularly note-worthy, namely, the handling of predicates and thedifference set operations. Next, we discuss these issues inmore details.

5.4 Efficient Handling of Predicates

General predicate operations are expensive, so compilersoften do not perform them. In fact, the majority of


TABLE 2Speedup Comparison between Polaris and Panorama

(with four R10000 CPUs).

TABLE 3Analysis Time (in Seconds) Distribution

Timing is measured on SGI Indy workstations with 134 MHz MIPS R4600 CPU and 64 MB memory.

predicate-handling required for our array data-flow analy-

sis involves simple operations such as checking to see if two

predicates are identical, if they are loop-independent, and if

they contain indices and affect shapes or sizes of array

regions. These can be implemented rather efficiently.A canonical normal form is used to represent the

predicates. Pattern-matching under a normal form is easier

than under arbitrary forms. Both the conjunctive normal

form (CNF) and the disjunctive normal form (DNF) have

been widely used in program analysis [7], [9]. These cited

works show that negation operations are expensive with

both CNF and DNF. This fact was also confirmed by our

previous experiments using CNF [20]. Negation operations

occur not only due to ELSE branches, but also due to GAR

and GARWD operations elsewhere. Hence, we design a

new normal form such that negation operations can often be

avoided.

We use a hierarchical approach to predicate handling. A

predicate is represented by a high level predicate tree,

PT �V ;E; r�, where V is the set of nodes, E is the set of

edges, and r is the root of PT . The internal nodes of V are

NAND operators except for the root, which is an AND

operator. The leaf nodes are divided into regular leaf nodes

and negative leaf nodes. A regular leaf node represents a

predicate such as an IF condition, while a negative leaf

node represents the negation of a predicate. Theoretically,

this representation is not a normal form because two

identical predicates may have different predicate trees,

which may render pattern-matching unsuccessful. We,

however, believe that such cases are rare and that they

happen only when the program is extremely complicated.

Fig. 9 shows a PT .Each leaf (regular or negative) is a token which

represents a basic predicate such as an IF condition or a

DO condition in the program. At this level, we keep a basic


Fig. 8. Time percentage of array data-flow summary.

TABLE 4Elapsed Analysis Time (in Seconds)

1 SGI Challenge with 1,024 MB memory and 196 MHz R10000 CPU.2 SGI Indy with 134 MHz MIPS R4600 CPU and 64 MB memory.3 ª*º means Polaris takes longer than four hours.

predicate as a unit and do not split it. The predicate

operations are based only on these tokens and do not check

the details within these basic predicates. Negation of a

predicate tree is simple this way. A NAND operation,

shown in Fig. 10, may either increase or decrease by one

level in a predicate tree according to the shape of the

predicate tree. If there is only one regular leaf node (or one

negative leaf node) in the tree, the regular leaf node is

simply changed to a negative leaf node (or vice versa). AND

and OR operations are also easily handled, as shown in

Fig. 10. We use a unique token for each basic predicate so

that simple and common cases can be easily handled

without checking the contents of the predicates. The content

of each predicate is represented in CNF and is examined

when necessary.Table 5 lists several key parameters, the total number of

arrays summarized, the average length of a MOD set

(column ªAve # GARsº), the average length of a UE set

(column ªAve # GARWDsº), and some data concerning

difference and predicate operations. The total number of

arrays summarized given in the table is the sum of the

number of arrays summarized in each loop nest and an

array that appears in two disjoint loop nests is counted

twice. Since the time for set operations is proportional to the

square of the length of MOD and UE lists, it is important


Fig. 9. High level representation of predicates.

Fig. 10. Predicate operations.

that these lists are short. It is encouraging to see that theyare indeed short in the benchmark application programs.

Columns 7 and 8 (marked ªHighº and ªLowº) in Table 5show that over 95 percent of the total predicate operationsare the high level ones, where a negation or a binarypredicate operation on two basic predicates is counted asone operation. These numbers are dependent on thestrategy used to handle the predicates. Currently, we deferthe checking of predicate contents until the last step. As aresult, only a few low level predicate operations are needed.Our results show that this strategy works well for arrayprivatization since almost all privatizable arrays in ourtested programs can be recognized. Some cases, such asthose that need to handle guards containing loop indices,do need low level predicate operations. The hierarchicalrepresentation scheme serves well.

5.5 Reducing Unnecessary Difference Operations

We do not solve the difference of T1 ÿ T2 using the general

formula presented in Section 2 unless the result is a single

GAR. When the difference cannot be simplified to a single

GAR, the difference is represented by a GARWD instead of

by a union of GARs, as implied by that formula. This

strategy postpones the expensive and complex difference

operations until they are absolutely necessary and it avoids

propagating a relatively complex list of GARs. For example,

let a GARWD G1 be f�1 : m�; < �k : n�; �2 : n1� >g and G2 be

�1 : m�. We have G1 ÿG2 � � and two difference operations

represented in G1 are reduced (i.e., there is no need to

perform them). In Table 5, the total number of difference

operations and the total number of reduced difference

operations are illustrated in columns 5 and 6, respectively.

Although difference operations are reduced by only about

nine percent on average, the reduction is dramatic for some

programs: it is by one third for MDG and by half for MG3D.Let us use the example in Fig. 2b to further illustrate the

significance of delayed difference operations. A simplified

control flow graph of the body of the outer loop is shown in

Fig. 11. Suppose that each node has been summarized and

that the summary results are listed below:

MOD�1� � �T; �jlow : jup��; UE�1� � ;MOD�2� � ;; UE�2� � ;MOD�3� � �T; �jmax��; UE�3� � ;MOD�4� � ;; UE�4� � �T; �jlow : jup��

[�T; �jmax��Following the description given in Section 3.4, we will

propagate the summary sets of each node in the following

steps to get the summary sets for the body of the outer loop.

1.

MOD IN�p4� �MOD�4� � ;UE IN�p4� � UE�4�

� �T; �jlow : jup�� [ �T; �jmax��

2.

MOD IN�p3� �MOD�3� [MOD IN�p4�� T; �jmax��

UE IN�p3� � UE�3� [ �UE IN�p4� ÿMOD�3�� f�T; �jlow : jup��; < �T; �jmax�� >g

This difference operation is kept in the GARWD and

will be reduced at Step 4.3.

MOD IN�p2� � �p; �jmax��UE IN�p2� � f�p; �jlow : jup��; < �p; �jmax�� >g

[ �p; �jlow : jup�� [ �p; �jmax��


Fig. 11. The HSG of the body of the outer loop for Fig. 2b.

TABLE 5Measurement of Key Parameters

In the above, p is inserted into the guards of theGARs, which are propagated through the TRUEedge, and p is then inserted into the guardspropagated through the FALSE edge.

4.

MOD IN�p1� � �T; �jlow : jup�� [ �p; �jmax��UE IN�p1� � UE IN�p2� ÿMOD�1�

� f�p; �jmax��; < �T; �jlow : jup�� >gAt this step, the computation of UE IN�p1�removes one difference operation because�f�p; �jlow : jup��; < �p; �jmax�� >g ÿ �T; �jlow : jup��is equal to ;. In other words, there is no need toperform the difference operation represented byG A R W D f�p; �jlow : jup��; < �p; �jmax�� >g. A nadvantage of the GARWD representation is that adifference can be postponed rather than alwaysperformed. Without using a GARWD, the differenceoperation at Step 2 always has to be performed,which should not be necessary and which thusincreases execution time.

Therefore, the summary sets of the body of the outer loop(DO I) should be:

MODi �MOD IN�p1� � �T; �jlow : jup�� [ �p; �jmax��UEi � UE IN�p1�

� f�p; �jmax��; < �T; �jlow : jup�� >g:To determine if array A is privatizable, we need to prove

that there exists no loop-carried flow dependence for A. Wefirst calculate MOD<i, the set of array elements written initerations prior to iteration i, giving us MOD<i �MODi.The intersection of MOD<i and UEi is conducted by twointersections, each of which is formed on one modcomponent from MOD<i and UEi respectively. The firstmod, �T; �jlow : jup��, appears in the difference list of UEi

and, thus, the result is obviously empty. Similarly, theintersection of �p; �jmax�� and the second mod, �p; �jmax��,is empty because their guards are contradictory. Becausethe intersection of MOD<i and UEi is empty, array A isprivatizable. In both intersections, we avoid performing thedifference operation in UEi and, therefore, improveefficiency.

6 RELATED WORK

There exist a number of approaches to array data-flowanalysis. As far as we know, no work has particularlyaddressed the efficiency issue or presented efficiency data.One school of thought attempts to gather flow informationfor each array element and to acquire an exact array data-flow analysis. This is usually done by solving a system ofequalities and inequalities. Feautrier [17] calculates thesource function to indicate detailed flow information.Maydan et al. [33], [34] simplify Feautrier's method byusing a Last-Write-Tree (LWT). Duesterwald et al. [12]compute the dependence distance for each reachingdefinition within a loop. Pugh and Wonnacott [37] use aset of constraints to describe array data-flow problems andsolve them basically by the Fourier-Motzkin variable

elimination. Maslov [32], as well as Pugh and Wonnacott[37], also extend the previous work in this category byhandling certain IF conditions. Generally, these approachesare intraprocedural and do not seem easily extensibleinterprocedurally. The other group analyzes a set of arrayelements instead of individual array elements. Early workuses regular sections [6], [24], convex regions [40], [41], dataaccess descriptors [2], etc., to summarize MOD/USE sets ofarray accesses. They are not array data-flow analyses.Recently, array data-flow analyses based on these sets wereproposed (Gross and Steenkiste [19], Rosene [38], Li [29], Tuand Padua [43], Creusillet and Irigoin [10], and Hall et al.[21]). Of these, ours is the only one using conditionalregions (GARs), even though some do handle IF conditionsusing other approaches. Although the second group doesnot provide as many details about reaching-definitions asthe first group, it handles complex program constructsbetter and can be easily performed interprocedurally.

Array data-flow summary, as a part of the second groupmentioned above, has been a focus in the parallelizingcompiler area. The most essential information in array data-flow summary is the upwardly exposed use set. Thesesummary approaches can be compared in two aspects: setrepresentation and path sensitivity. For set representation,convex regions are highest in precision, but they are alsoexpensive because of their complex representation.Bounded regular sections (or regular sections) have thesimplest representation and, thus, are most inexpensive.Early work tried to use a single regular section or a singleconvex region to summarize one array. Obviously, a singleset can potentially lose information, and it may beineffective in some cases. Tu and Padua [43] and Creusilletand Irigoin [10] seem to use a single regular section and asingle convex region, respectively. Hall et al. [21] use a listof convex regions to summarize all the references of anarray. It is unclear if this representation is more precise thana list of regular sections upon which our approach is based.

Regarding path sensitivity, the commonality of theseprevious methods is that they do not distinguish summarysets of different control flow paths. Therefore, thesemethods are called path-insensitive and have been shownto be inadequate in real programs. Our approach, as far aswe know, is the only path-sensitive array data-flowsummary approach in the parallelizing compiler area. Itdistinguishes summary information from different paths byputting IF conditions into guards. Some other approachesdo handle IF conditions, but not in the context of array data-flow summary.

7 CONCLUSION

In this paper, we have presented an array data-flowanalysis which handles interprocedural, symbolic, andpredicate analyses all together. The analysis is shown viaexperiments to be quite effective for program paralleliza-tion. Important design decisions are made such that theanalysis can be performed efficiently. Our hierarchicalpredicate handling scheme turns out to serve very well.Many predicate operations can be performed at high levels,avoiding expensive low-level operations. The new datastructure, GARWD (i.e., guarded array regions with a


difference list), reduces expensive set-difference operations

by up to 50 percent for a few programs, although the

reduction is unimpressive for other programs. Another

important finding is that the MOD lists and the UE lists can

be kept rather short, thus reducing set operation time.As far as we know, this is the first time the efficiency

issue has been addressed and data presented for such a

powerful analysis. We believe it is important to continue

exploring the efficiency issue because, unless interproce-

dural array data-flow analysis can be performed reasonably

fast, its adoption in real programming world would be

unlikely. With continued advances of parallelizing compiler

techniques, we hope that fully or partially automatic

parallelization will provide a viable methodology for

machine-independent parallel programming.

APPENDIX

EXPANSION OF LOOP SUMMARIES

In the following, we present a guideline for computing the

expansion of loop summaries introduced in Section 3.4.

First, for a GAR Q, proj�Q� is obtained by the following

steps:

1. If i appears in the guard of a GAR, we remove thepredicate components which involve i from theguard and we use such components to derive a newdomain of i. Suppose that i in the guard can besolved and represented as i 2 �l0 : u0�. The newdomain of i becomes

max�l0; l� ÿ ls

� �� s� l :

min�u0; u� ÿ ls

� �� s� l : s

� �which simplifies to �max�l0; l� : min�u0; u�� for s � 1.

For example, given i 2 �2 : 100 : 2� and GAR

�5 � i; A�i��, we remove the relational expression

5 � i from the guard and form the new domain of

i : �dmax�5;2�ÿ22 e � 2� 2 : 100 : 2� � �6 : 100 : 2�. Hence,

the projection will be completed by expanding

�T;A�i��; i 2 �6 : 100 : 2�, w h o s e r e s u l t i s

�T;A�6 : 100 : 2��.2. Suppose that i appears in only one dimension of Q.

If the result of substituting l � i � u, or the newbounds on i obtained above, into the old range triplein that dimension can still be represented by a rangetriple �l00 : u00 : s00�, then we replace the old rangetriple by �l00 : u00 : s00�. For example, the range triple(i : i : 1) becomes (l : u : 1).

3. If, in the above, the result of substituting l � i � uinto the old range can no longer be represented by arange, or if i appears in more than one dimension ofQ, then these dimensions are marked as unknown.Tighter approximation is possible for special cases,but we will not discuss it in this paper.

For the expansion of a GARWD, we have the following:

1. For a GARWD, if its difference list and its sourceGAR cannot be expanded separately, then we mustsolve the difference list first, invoking the symbolic

evaluator if necessary. If the difference list cannot besolved, the expansion result is marked as unknown.

2. The computation of �UEi ÿMOD<i� and its expan-sion can be done without expanding MODi toMODg andMODi � A�I �m�. We can formulate:

�UEi ÿMOD<i�; i 2 �l : u� �UEi0 ; i

0 2 �l : l� �mÿ n� ÿ 1� �mÿ n� > 0

UEi0 ; i0 2 �l : u�; �mÿ n� � 0:

�As a concrete example, suppose we have i 2 �2 : 99�,MODi � �T;A�i� 1��, and UEi � �T;A�i��, which

satisfies �mÿ n� > 0 in the above. The set

�UEi ÿMOD<i�, with i 2 �2 : 99�, should equal set

UEi0 , i0 2 �2 : 2�. Suppose, however, that MODi is

�T;A�iÿ 1��. The case of �mÿ n� � 0 applies in-

stead. The set �UEi ÿMOD<i�, with i 2 �2 : 99�,equals UEi0 , with i0 2 �2 : 99�. In this paper, we leave

out the general discussion on the short-cut computa-

tion illustrated above.

ACKNOWLEDGMENTS

This paper is based in part on the work previously

presented in the proceedings of the Sixth ACM SIGPLAN

Symposium Principles and Practice of Parallel Programming,

pp. 157-167, 1997. Supported in part by NSF CCR-950254.

MIP-9610379 and by Purdue Research Foundation.

REFERENCES

[1] A.V. Aho, R. Sethi, and J.D. Ullman, Compilers: Principles,Techniques, and Tools, Reading, Mass.: Addison-Wesley, 1986.

[2] V. Balasundaram, ªA Mechanism for Keeping Useful InternalInformation in Parallel Programming Tools: The Data AccessDescriptor,º J. Parallel and Distributed Computing, vol. 9, pp. 154-170, 1990.

[3] M. Berry, D. Chen, P. Koss, D. Kuck, L. Pointer, S. Lo, Y. Pang, R.Roloff, A. Sameh, E. Clementi, S. Chin, D. Schneider, G. Fox, P.Messina, D. Walker, C. Hsiung, J. Schwarzmeier, K. Lue, S.Orszag, F. Seidl, O. Johnson, G. Swanson, R. Goodrum, and J.Martin, ªThe Perfect Club Benchmarks: Effective PerformanceEvaluation of Supercomputers,º The Int'l J. Supercomputer Applica-tions, vol. 3, no. 3, pp. 5-40, Fall 1989.

[4] W. Blume, R. Doallo, R. Eigenmann, J. Grout, J. Hoeflinger, T.Lawrence, J. Lee, D. Padua, Y. Paek, B. Pottenger, L. Rauchwerger,and P. Tu, ªParallel Programming with Polaris,º Computer, vol. 28,no. 11, pp. 78-82, Nov. 1996.

[5] W. Blume and R. Eigenmann, ªSymbolic Analysis TechniquesNeeded for the Effective Parallelization of Perfect Benchmarks,ºTechnical report, dept. of Computer Science, Univ. of Illinois,1994.

[6] D. Callahan and K. Kennedy, ªAnalysis of Interprocedural SideEffects in a Parallel Programming Environment,º Proc. ACMSIGPLAN '86 Symp. Compiler Construction, pp. 162-175, June 1986.

[7] T.E. Cheatham Jr., G.H. Holloway, and J.A. Townley, ªSymbolicEvaluation and the Analysis of Programs,º IEEE Trans. SoftwareEng., vol. 5, no. 4, pp. 402-417, July 1979

[8] J.-D. Choi, M. Burke, and P. Carini, ªEfficient Flow-SensitiveInterprocedural Computation of Pointer-Induced Aliases and SideEffects,º Proc. 20th Ann. ACM Symp. Principles of ProgrammingLanguages, pp. 232-245, Jan. 1993.

[9] L.A. Clarke and D.J. Richardson, ªApplications of SymbolicEvaluation,º J. Systems and Software, vol. 5, no. 1, pp. 15±35, 1985.


[10] B. Creusillet and F. Irigoin, ªInterprocedural Array RegionAnalyses,º Int'l. J. Parallel Programming, vol. 24, no. 6, pp. 513-546, Dec. 1996.

[11] A. Deutsch, ªInterprocedural May-Alias Analysis for Pointers:Beyond K-Limiting,º Proc. ACM SIGPLAN Conf. ProgrammingLanguage Design and Implementation, pp. 230±241, June 1994.

[12] E. Duesterwald, R. Gupta, and M.L. Soffa, ªA Practical Data-FlowFramework for Array Reference Analysis and Its Use inOptimizations,º Proc. ACM SIGPLAN '93 Conf. ProgrammingLanguage Design and Implementation, pp. 68±77, June 1993.

[13] R. Eigenmann, J. Hoeflinger, and D. Padua, ªOn the AutomaticParallelization of the Perfect Benchmarks,º Technical reportTR 1392, Center for Supercomputing Research and Development,Univ. of Illinois, Urbana-Champaign, Nov. 1994.

[14] M. Emami, R. Ghiya, and L.J. Hendren, ªContext-SensitiveInterprocedural Points-to Analysis in the Presence of FunctionPointers,º Proc. ACM SIGPLAN Conf. Programming LanguageDesign and Implementation, pp. 242±256, 1994.

[15] R. Eigenmann, J. Hoeflinger, Z. Li, and D. Padua, ªExperience inthe Automatic Parallelization of Four Perfect-Benchmark Pro-grams,º Proc. Fourth Workshop Languages and Compilers for ParallelComputing, Aug. 1991.

[16] R. Eigenmann, J. Hoeflinger, and D. Padua, ªOn the AutomaticParallelization of the Perfect Benchmarks,º IEEE Trans. Parallel andDistributed Systems, vol. 9, no. 1, pp. 5±23, Jan. 1998.

[17] P. Feautrier, ªDataflow Analysis of Array and Scalar References,ºInt'l J. Parallel Programming, vol. 2, no. 1, pp. 23±53, Feb. 1991.

[18] J. Gu, ªInterprocedural Array Data-Flow Analysis,º doctoraldissertation, dept. of Computer Science and Eng., Univ. ofMinnesota, Dec. 1997.

[19] T. Gross and P. Steenkiste, ªStructured Data-Flow Analysis forArrays and Its Use in an Optimizing Compiler,º Software Practiceand Experience, vol. 20, no. 2, pp. 133±155, Feb. 1990.

[20] J. Gu, Z. Li, and G. Lee, ªSymbolic Array Dataflow Analysis forArray Privatization and Program Parallelization,º Proc.Supercomputing, Dec. 1995.

[21] M.W. Hall, B.R. Murphy, S.P. Amarasinghe, S.-W. Liao, and M.S.Lam, ªInterprocedural Analysis for Parallelization,º Proc. EighthWorkshop Languages and Compilers for Parallel Computing, pp. 61±80,Aug. 1995.

[22] M.R. Haghighat and C.D. Polychronopoulos, ªSymbolic Depen-dence Analysis for Parallelizing Compilers,º Technical reportCSRD Report No. 1355, Center for Supercomputing Research andDevelopment, Univ. of Illinois, 1994.

[23] M.W. Hall, J.M. Anderson, S.P. Amarasinghe, B.R. Murphy, S-W.Liao, E. Bugnion, and M.S. Lam, ªMaximizing MultiprocessorPerformance with the SUIF Compiler,º Computer, vol 28, no. 11,pp. 84±89, Nov. 1996.

[24] P. Havlak and K. Kennedy, ªAn Implementation of Interproce-dural Bounded Regular Section Analysis,º IEEE Trans. Parallel andDistributed Systems, vol. 2, no. 3, pp. 350±360, 1991.

[25] F. Irigoin, P. Jouvelot, and R. Triolet, ªSemantical InterproceduralParallelization: An Overview of the PIPS Project,º Proc. ACM Int'lConf. Supercomputing, pp. 244-251, 1991.

[26] D.J. Kuck, The Structure of Computers and Computations, vol. 1.John Wiley & Sons, 1978.

[27] W. Landi and B.G. Ryder, ªA Safe Approximate Algorithm forInterprocedural Pointer Aliasing,º Proc. ACM SIGPLAN Conf.Programming Language Design and Implementation, pp. 235-248, June1992.

[28] W. Landi, B.G. Ryder, and S. Zhang, ªInterprocedural Modifica-tion Side Effect Analysis with Pointer Aliasing,º Proc. ACMSIGPLAN Conf. Programming Language Design and Implementation,pp. 56-67, June 1993.

[29] Z. Li, ªArray Privatization for Parallel Execution of Loops,º Proc.ACM Int'l. Conf. Supercomputing, pp. 313-322, July 1992.

[30] Z. Li and P.-C. Yew, ªInterprocedural Analysis for ParallelComputing,º Proc. 1988 Int'l Conf. Parallel Proc., pp. 221±228,Aug. 1988.

[31] Z. Li and P.-C. Yew, ªProgram Parallelization with Interprocedur-al Analysis,º J. Supercomputing, pp. 225-244, vol. 2, no. 2, Oct. 1988.

[32] V. Maslov, ªLazy Array Data-Flow Dependence Analysis,º Proc.Annual ACM Symp. Principles of Programming Languages, pp. 331-325, Jan. 1994.

[33] D.E. Maydan, S.P. Amarasinghe, and M.S. Lam, ªArray Data-flowAnalysis and Its Use in Array Privatization,º Proc. 20th ACMSymp. Principles of Programming Languages, pp. 2±15, Jan. 1993.

[34] D.E. Maydan, ªAccurate Analysis of Array References,º PhDthesis, Stanford Univ., Oct. 1992.

[35] T. Nguyen, J. Gu, and Z. Li, ªAn Interprocedural ParallelizingCompiler and Its Support for Memory Hierarchy Research,º Proc.Eighth Int'l Workshop Languages and Compilers for ParallelComputing, pp. 96-110, Aug. 1995

[36] E.W. Myers, ªA Precise Interprocedural Data-Flow Algorithm,ºProc. Eighth Ann. ACM Symp. Principles of Programming Languages,pp. 219-230, Jan. 1981.

[37] W. Pugh and D. Wonnacott, ªAn Exact Method for Analysis ofValue-Based Array Data Dependences,º Proc. Sixth Ann. WorkshopProgramming Languages and Compilers for Parallel Computing, Aug.1993.

[38] C. Rosene, ªIncremental Dependence Analysis,º Technical reportCRPC-TR90044, PhD thesis, Computer Science Dept., Rice Univ.,Mar. 1990.

[39] E. Ruf, ªContext-Insensitive Alias Analysis Reconsidered,º Proc.ACM SIGPLAN Conf. Programming Language Design and Implemen-tation, pp. 13-31, June 1995.

[40] R. Triolet, F. Irigoin, and P. Feautrier, ªDirect Parallelization ofCALL Statements,º Proc. ACM SIGPLAN '86 Symp. CompilerConstruction, pp. 176-185, July 1986.

[41] R. Triolet, ªInterprocedural Analysis for Program Restructuringwith Parafrase,º Technical report CSRD Rpt. No. 538, Center forSupercomputing Research and Development, Univ. of Illinois,Urbana-Champaign, Dec. 1985.

[42] P. Tu and D. Padua, ªGated SSA-Based Demand-Driven SymbolicAnalysis for Parallelizing Compilers,º Proc. Int'l Conf. Super-computing, pp. 414-423, July 1995.

[43] P. Tu and D. Padua, ªAutomatic Array Privatization,º Proc. SixthWorkshop Languages and Compilers for Parallel Computing, pp. 500-521, Aug. 1993.

[44] R.P. Wilson and M.S. Lam, ªEfficient Context-Sensitive PointerAnalysis for C Programs,º Proc. ACM SIGPLAN Conf. Program-ming Language Design and Implementation, pp. 1±12, June 1995.

[45] S. Zhang, B.G. Ryder, and W. Landi, ªProgram Decomposition forPointer Aliasing: A Step towards Practical Analyses,º Proc. FourthSymp. Foundations of Software Eng., Oct. 1996.

Junjie Gu received his PhD degree in 1997 fromthe Department of Computer Science andEngineering, University of Minnesota. He is asenior software engineer at Sun Microsystems,Inc., which he joined in 1997. He was aresearch associate from 1986 to 1992 at theInstitute of Computing Technology, at theChinese Academy of Sciences, Peoples Repub-lic of China. He is a member of the IEEEComputer Society. His research interest is in

programming languages and compilers

Zhiyuan Li received his PhD degree in 1989at the Department of Computer Science,University of Illinois. He is an associateprofessor in the Department of ComputerSciences at Purdue University, which hejoined in 1997. He was an assistant professorin the Department of Computer Science, atthe University of Minnesota from 1991 to1997. He was formerly a senior softwareengineer at the Center for Supercomputing

Research and Development, University of Illinois at Urbana-Champaign from 1990 to 1991. He taught in the Department ofComputer Science, at York University from 1989 to 1990. He is amember of the IEEE Computer Society. His research is in the areaof compilers and system software for high performance computers.


Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Jurnal Array Misbiantoro

Documents