Compiling High Performance Fortran

Post on 07-Feb-2016

46 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Compiling High Performance Fortran. Allen and Kennedy, Chapter 14. Overview. Motivation for HPF Overview of compiling HPF programs Basic Loop Compilation for HPF Optimizations for compiling HPF Results and Summary. Motivation for HPF. - PowerPoint PPT Presentation

Transcript

Optimizing Compilers for Modern Architectures

Compiling High Performance Fortran

Allen and Kennedy, Chapter 14

Optimizing Compilers for Modern Architectures

Overview

• Motivation for HPF

• Overview of compiling HPF programs

• Basic Loop Compilation for HPF

• Optimizations for compiling HPF

• Results and Summary

Optimizing Compilers for Modern Architectures

Motivation for HPF

• Require “Message Passing” to communicate data between processors

• Approach 1: Use MPI calls in Fortran/C code

Scalable Distributed Memory Multiprocessor

Optimizing Compilers for Modern Architectures

MPI implementation

Motivation for HPF

• Consider the following sum reduction

PROGRAM SUM

REAL A(10000)

READ (9) A

SUM = 0.0

DO I = 1, 10000

SUM = SUM + A(I)

ENDDO

PRINT SUM

END

PROGRAM SUM

REAL A(100), BUFF(100)

IF (PID == 0) THEN

DO IP = 0, 99

READ (9) BUFF(1:100)

IF (IP == 0)

A(1:100) = BUFF(1:100)

ELSE SEND(IP,BUFF,100)

ENDDO

ELSE RECV(0,A,100)

ENDIF

/*Actual sum reduction code here */

IF (PID == 0) SEND(1,SUM,1)

IF (PID > 0) RECV(PID-1,T,1)

SUM = SUM + T

IF (PID < 99) SEND(PID+1,SUM,1)

ELSE SEND(0,SUM,1)

ENDIF

IF (PID == 0) PRINT SUM;

END

Optimizing Compilers for Modern Architectures

Motivation for HPF

• Disadvantages of MPI approach—User has to rewrite the program in SPMD form [Single

Program Multiple Data]—User has to manage data movement [send & receive], data

placement and synchronization—Too messy and not easy to master

Optimizing Compilers for Modern Architectures

Motivation for HPF

• Approach 2: Use HPF—HPF is an extended version of Fortran 90—HPF has Fortran 90 features and a few directives

• Directives—Tell how data is laid out in processor memories in parallel

machine configuration. For example,– !HPF DISTRIBUTE A(BLOCK)

—Assist in identifying parallelism. For example,– !HPF INDEPENDENT

Optimizing Compilers for Modern Architectures

Motivation for HPF

• The same sum reduction code

PROGRAM SUM

REAL A(10000)

READ (9) A

SUM = 0.0

DO I = 1, 10000

SUM = SUM + A(I)

ENDDO

PRINT SUM

END

• When written in HPF...

PROGRAM SUM

REAL A(10000)

!HPF$ DISTRIBUTE A(BLOCK)

READ (9) A

SUM = 0.0

DO I = 1, 10000

SUM = SUM + A(I)

ENDDO

PRINT SUM

END

• Minimum modification

• Easy to write

• Now compiler has to do more work

Optimizing Compilers for Modern Architectures

Motivation for HPF

• Advantages of HPF—User needs only to write some easy directives; need not

write the whole program in SPMD form—User does not need to manage data movement [send &

receive] and synchronization—Simple and easy to master

Optimizing Compilers for Modern Architectures

Overview

• Motivation for HPF

• Overview of compiling HPF programs

• Basic Loop Compilation for HPF

• Optimizations for compiling HPF

• Results and Summary

Optimizing Compilers for Modern Architectures

• Dependence Analysis

• Used for communication analysis—Fact used: No dependence

carried by I loop

HPF Compilation Overview

• Running example:

REAL A(10000), B(10000)

!HPF$ DISTRIBUTE A(BLOCK), B(BLOCK)

DO J = 1, 10000

DO I = 2, 10000

S1: A(I) = B(I-1) + C

ENDDO

DO I = 1, 10000

S2: B(I) = A(I)

ENDDO

ENDDO

Optimizing Compilers for Modern Architectures

HPF Compilation Overview

• Running example:

REAL A(10000), B(10000)

!HPF$ DISTRIBUTE A(BLOCK), B(BLOCK)

DO J = 1, 10000

DO I = 2, 10000

S1: A(I) = B(I-1) + C

ENDDO

DO I = 1, 10000

S2: B(I) = A(I)

ENDDO

ENDDO

• Dependence Analysis

• Distribution Analysis

Optimizing Compilers for Modern Architectures

HPF Compilation Overview

• Running example:

REAL A(10000), B(10000)

!HPF$ DISTRIBUTE A(BLOCK), B(BLOCK)

DO J = 1, 10000

DO I = 2, 10000

S1: A(I) = B(I-1) + C

ENDDO

DO I = 1, 10000

S2: B(I) = A(I)

ENDDO

ENDDO

• Dependence Analysis

• Distribution Analysis

• Computation Partitioning—Partition so as to distribute

work of the I loops

Optimizing Compilers for Modern Architectures

HPF Compilation OverviewREAL A(1,100), B(0:100)

!HPF$ DISTRIBUTE A(BLOCK), B(BLOCK)

DO J = 1, 10000

I1: IF (PID /= 100) SEND(PID+1,B(100),1)

I2: IF (PID /= 0) THEN

RECV(PID-1,B(0),1)

A(1) = B(0) + C

ENDIF

DO I = 2, 100

S1: A(I) = B(I-1)+C

ENDDO

DO I = 1, 100

S2: B(I) = A(I)

ENDDO

ENDDO

• Dependence Analysis

• Distribution Analysis

• Computation Partitioning

• Communication Analysis and placement—Communication reqd for

B(0)for each iteration—Shadow region B(0)

Optimizing Compilers for Modern Architectures

HPF Compilation OverviewREAL A(1,100), B(0:100)

!HPF$ DISTRIBUTE A(BLOCK), B(BLOCK)

DO J = 1, 10000

I1: IF (PID /= 100) SEND(PID+1,B(100),1)

DO I = 2, 100

S1: A(I) = B(I-1)+C

ENDDO

I2: IF (PID /= 0) THEN

RECV(PID-1,B(0),1)

A(1) = B(0) + C

ENDIF

DO I = 1, 100

S2: B(I) = A(I)

ENDDO

ENDDO

• Dependence Analysis

• Distribution Analysis

• Computation Partitioning

• Communication Analysis and placement

• Optimization—Aggregation—Overlap communication and

computation—Recognition of reduction

Optimizing Compilers for Modern Architectures

Overview

• Motivation for HPF

• Overview of compiling HPF programs

• Basic Loop Compilation for HPF

• Optimizations for compiling HPF

• Results and Summary

Optimizing Compilers for Modern Architectures

Basic Loop Compilation

• Distribution Propagation and analysis—Analyze what distribution holds for a given array at a given

point in the program—Difficult due to

– REALIGN and REDISTRIBUTE directives– Distribution of formal parameters inherited from

calling procedure—Use “Reaching Decompositions” data flow analysis and its

interprocedural version

Optimizing Compilers for Modern Architectures

Basic Loop Compilation

• For simplicity assume single distribution for an array at all points in a subprogram

• Define

• For example suppose array A of size N is block distributed over p processors—Block size,

μA(i) =(ρA(i),δA(i))=(p, j)

BA =ceiling(N,p)

ρA(i) =ceiling(i,BA) −1

δA(i) =(i−1)modBA −1

Optimizing Compilers for Modern Architectures

Basic Loop Compilation

• Iteration Partitioning—Dividing work among

processors– Computation

partitioning—Determine which iterations

of a loop will be executed on which processor

—Owner-computes rule

REAL A(10000)

!HPF$ DISTRIBUTE A(BLOCK)

DO I = 1, 10000

A(I) = A(I) + C

ENDDO

• Iteration I is executed on owner of A(I)

• 100 processors: 1st 100 iterations on processor 0, the next 100 on processor 1 and so on

Optimizing Compilers for Modern Architectures

Iteration Partitioning

• Multiple statements in a loop in a recurrence: choose a partitioning reference

• Processor responsible for performing computation for iteration I is

• Set of indices executed on p

A(α(I ))

θL(I ) =ρA(α(I ))

{I |1≤I ≤N;θL(I )=p}

α−1(ρA−1({p}))∩ [1..N]

Optimizing Compilers for Modern Architectures

Iteration Partitioning

• Have to map global loop index to local loop index

• Smallest value in maps to 1

REAL A(10000)

!HPF$ DISTRIBUTE A(BLOCK)

DO I = 1, N

A(I+1) = B(I) + C

ENDDO

Global_ Loop_ index Δ L ⏐ → ⏐ Local_ Loop_ index

α−1(ρA−1({p}))

Optimizing Compilers for Modern Architectures

ρA−1({p})=[100p+1:100p+100]

α−1(ρA−1({p}))=[100p:100p+99]

ΔL(I,p) =I −min(α −1(ρA−1({p})))+1

=I −100p+1

Iteration PartitioningREAL A(10000),B(10000)

!HPF$ DISTRIBUTE A(BLOCK),B(BLOCK)

DO I = 1, N

A(I+1) = B(I) + C

ENDDO

• Map global iteration space, I to local iteration space,i as follows:

Optimizing Compilers for Modern Architectures

Iteration Partitioning

• Adjust array subscripts for local iterations:

B(β(I))→ B(γ(i))

γ(i) =δB(β(ΔL−1(i,p)))

ΔL−1(i,PID) =i+100* PID−1

δB(k)=k−100* PID

γ(i) =i+100* PID−1−100*PID =i−1

Optimizing Compilers for Modern Architectures

Iteration Partitioning

• For interior processors the code becomes..

DO i = 1, 100

A(i) = B(i-1) + C

ENDDO

• Adjust lower bound for 1st processor and upper bound of last processor to take care of boundary conditions..

lo = 1

IF (PID==0) lo = 2

hi = 100

IF (PID==CEIL(N+1/100)-1) hi = MOD(N,100) + 1

DO i = lo, hi

A(i) = B(i-1) + C

ENDDO

Optimizing Compilers for Modern Architectures

Communication Generation

• For our example no communication is required for iterations in

• Iterations which require receiving data are

• Iterations which require sending data are

α−1(ρA−1({p}))∩ β−1(ρB

−1({p}))∩ [1..N]

(α−1(ρA−1({p}))−β−1(ρB

−1({p})))∩ [1..N]

(β−1(ρB−1({p}))−α−1(ρA

−1({p})))∩ [1..N]

Optimizing Compilers for Modern Architectures

Communication Generation

REAL A(10000), B(10000)

!HPF$ DISTRIBUTE A(BLOCK), B(BLOCK)

...

DO I = 1, N

A(I+1) = B(I) + C

ENDDO

• Receive required for iterations in [100p:100p]

• Send required for iterations in [100p+100:100p+100]

• No communication required for iterations in [100p+1:100p+99]

Optimizing Compilers for Modern Architectures

Communication Generation

• After inserting receive

lo = 1

IF (PID==0) lo = 2

hi = 100

IF (PID==CEIL((N+1)/100)-1)

hi = MOD(N,100) + 1

DO i = lo, hi

IF (i==1 && PID /= 0)

RECV (PID-1, B(0), 1)

A(i) = B(i-1) + C

ENDDO

• Send must happen in the 101st iteration

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP)

hi = MOD(N,100) + 1

DO i = lo, hi+1

IF (i==1 && PID /= 0)

RECV (PID-1, B(0), 1)

IF (i <= hi) THEN

A(i) = B(i-1) + C

ENDIF

IF (i == hi+1 && PID /= lastP)

SEND(PID+1, B(100), 1)

ENDDO

Optimizing Compilers for Modern Architectures

Communication Generationlo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

DO i = lo, hi+1

IF (i==1 && PID /= 0)

RECV (PID-1, B(0), 1)

IF (i <= hi) THEN

A(i) = B(i-1) + C

ENDIF

IF (i == hi+1 && PID /= lastP)

SEND(PID+1, B(100), 1)

ENDDO

• Move SEND outside the loop

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

DO i = lo, hi

IF (i==1 && PID /= 0)

RECV (PID-1, B(0), 1)

A(i) = B(i-1) + C

ENDDO

IF (PID /= lastP)

SEND(PID+1, B(100), 1)

ENDIF

Optimizing Compilers for Modern Architectures

Communication Generationlo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

DO i = lo, hi

IF (i==1 && PID /= 0)

RECV (PID-1, B(0), 1)

A(i) = B(i-1) + C

ENDDO

IF (PID /= lastP)

SEND(PID+1, B(100), 1)

ENDIF

• Move receive outside loop and loop peel

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

IF (lo == 1 && PID /= 0) THEN

RECV (PID-1, B(0), 1)

A(1) = B(0) + C

ENDIF

! lo = MAX(lo,1+1) == 2

DO i = 2, hi

A(i) = B(i-1) + C

ENDDO

IF (PID /= lastP)

SEND(PID+1, B(100), 1)

ENDIF

Optimizing Compilers for Modern Architectures

Communication Generationlo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

IF (lo == 1 && PID /= 0) THEN

RECV (PID-1, B(0), 1)

A(1) = B(0) + C

ENDIF

! lo = MAX(lo,1+1) == 2

DO i = 2, hi

A(i) = B(i-1) + C

ENDDO

IF (PID /= lastP)

SEND(PID+1, B(100), 1)

ENDIF

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

IF (PID /= lastP)

SEND(PID+1, B(100), 1)

IF (lo == 1 && PID /= 0) THEN

RECV (PID-1, B(0), 1)

A(1) = B(0) + C

ENDIF

DO i = 2, hi

A(i) = B(i-1) + C

ENDDO

ENDIF

Optimizing Compilers for Modern Architectures

Communication Generation

• When is such rearrangement legal?

• Receive: copy from global to local location

• Send: copy local to global location

IF (PID <= lastP) THEN

S1: IF (lo == 1 && PID /= 0) THEN

B(0) = Bg(0) ! RECV

A(1) = B(0) + C

ENDIF

DO i = 2, hi

A(i) = B(i-1) + C

ENDDO

S2: IF (PID /= lastP) Bg(100) = B(100) ! SEND

ENDIF

No chain of dependences from S1 to S2

Optimizing Compilers for Modern Architectures

Communication GenerationREAL A(10000), B(10000)

!HPF$ DISTRIBUTE A(BLOCK)

...

DO I = 1, N

A(I+1) = A(I) + C

ENDDO

Would be rewritten as ..

IF (PID <= lastP) THEN

S1: IF (lo == 1 && PID /= 0) THEN

A(0) = Ag(0) ! RECV

A(1) = A(0) + C

ENDIF

DO i = 2, hi

A(i) = A(i-1) + C

ENDDO

S2: IF (PID /= lastP)

Ag(100) = A(100) ! SEND

ENDIF

• Rearrangement won’t be correct

Optimizing Compilers for Modern Architectures

Overview

• Motivation for HPF

• Overview of compiling HPF programs

• Basic Loop Compilation for HPF

• Optimizations for compiling HPF

• Results and Summary

Optimizing Compilers for Modern Architectures

Communication VectorizationREAL A(10000,100)

!HPF$ DISTRIBUTE A(BLOCK,*),

B(BLOCK,*)

DO J = 1, M

DO I = 1, N

A(I+1,J) = B(I,J) + C

ENDDO

ENDDO

• Using Basic Loop compilation gives..

DO J = 1, M

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP)

hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

IF (PID /= lastP)

SEND(PID+1, B(100,J), 1)

IF (lo == 1) THEN

RECV (PID-1, B(0,J), 1)

A(1,J) = B(0,J) + C

ENDIF

DO i = lo+1, hi

A(i,J) = B(i-1,J) + C

ENDDO

ENDIF

ENDDO

Optimizing Compilers for Modern Architectures

Communication VectorizationDO J = 1, M

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP)

hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

IF (PID /= lastP)

SEND(PID+1, B(100,J), 1)

IF (lo == 1) THEN

RECV (PID-1, B(0,J), 1)

A(1,J) = B(0,J) + C

ENDIF

DO i = lo+1, hi

A(i,J) = B(i-1,J) + C

ENDDO

ENDIF

ENDDO

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

DO J = 1, M

IF (PID /= lastP)

SEND(PID+1, B(100,J), 1)

ENDDO

DO J = 1, M

IF (lo == 1) THEN

RECV (PID-1, B(0,J), 1)

A(i,J) = B(i-1,J) + C

ENDIF

ENDDO

DO J = 1, M

DO i = lo+1, hi

A(i,J) = B(i-1,J) + C

ENDDO

ENDDO

ENDIFDistribute J Loop

Optimizing Compilers for Modern Architectures

Communication Vectorizationlo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

DO J = 1, M

IF (PID /= lastP)

SEND(PID+1, B(100,J), 1)

ENDDO

DO J = 1, M

IF (lo == 1) THEN

RECV (PID-1, B(0,J), 1)

A(i,J) = B(i-1,J) + C

ENDIF

ENDDO

DO J = 1, M

DO i = lo+1, hi

A(i,J) = B(i-1,J) + C

ENDDO

ENDDO

ENDIF

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

IF (lo == 1) THEN

RECV (PID-1, B(0,1:M), M)

DO J = 1, M

A(1,J) = B(0,J) + C

ENDDO

ENDIF

DO J = 1, M

DO i = lo+1, hi

A(i,J) = B(i-1,J) + C

ENDDO

ENDDO

IF (PID /= lastP)

SEND(PID+1, B(100,1:M), M)

ENDIF

Optimizing Compilers for Modern Architectures

Communication VectorizationDO J = 1, M

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP)

hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

S1: IF (PID /= lastP)

Bg(100,J)=B(100,J)

IF (lo == 1) THEN

S2: B(0,J)=Bg(0,J)

S3: A(1,J) = B(0,J) + C

ENDIF

DO i = lo+1, hi

S4: A(i,J) = B(i-1,J) + C

ENDDO

ENDIF

ENDDO

• Communication stmts resulting from an inner loop can be vectorized wrt an outer loop if the communication statements are not involved in a recurrence carried by outer loop

Optimizing Compilers for Modern Architectures

Communication VectorizationREAL A(10000,100)

!HPF$ DISTRIBUTE A(BLOCK,*),

B(BLOCK,*)

DO J = 1, M

DO I = 1, N

A(I+1,J) = A(I,J) + B(I,J)

ENDDO

ENDDO

• Can sends be done before the receives?

• Can communication be vectorized?

REAL A(10000,100)

!HPF$ DISTRIBUTE A(BLOCK,*)

DO J = 1, M

DO I = 1, N

A(I+1,J+1) = A(I,J) + C

ENDDO

ENDDO

• Can sends be done before the receives?

• Can communication be fully vectorized?

Optimizing Compilers for Modern Architectures

Overlapping Communication and Computation

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

S0:IF (PID /= lastP)

SEND(PID+1, B(100), 1)

S1:IF (lo == 1 && PID /= 0) THEN

RECV (PID-1, B(0), 1)

A(1) = B(0) + C

ENDIF

L1:DO i = 2, hi

A(i) = B(i-1) + C

ENDDO

ENDIF

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

S0:IF (PID /= lastP)

SEND(PID+1, B(100), 1)

L1:DO i = 2, hi

A(i) = B(i-1) + C

ENDDO

S1:IF (lo == 1 && PID /= 0) THEN

RECV (PID-1, B(0), 1)

A(1) = B(0) + C

ENDIF

ENDIF

Optimizing Compilers for Modern Architectures

PipeliningREAL A(10000,100)

!HPF$ DISTRIBUTE A(BLOCK,*)

DO J = 1, M

DO I = 1, N

A(I+1,J) = A(I,J) + C

ENDDO

ENDDO

• Initial code generation for the I loop gives..

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

DO J = 1, M

IF (lo == 1) THEN

RECV (PID-1, A(0,J), 1)

A(1,J) = A(0,J) + C

ENDIF

DO i = lo+1, hi

A(i,J) = A(i-1,J) + C

ENDDO

IF (PID /= lastP)

SEND(PID+1, A(100,J), 1)

ENDDO

ENDIF

Can be vectorizedBut gives up parallelism

Optimizing Compilers for Modern Architectures

Pipelining

• Pipelined parallelism with communication

Optimizing Compilers for Modern Architectures

Pipelining

• Pipelined parallelism with communication overhead

Optimizing Compilers for Modern Architectures

Pipelining: Blockinglo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

IF (PID <= lastP) THEN

DO J = 1, M

IF (lo == 1) THEN

RECV (PID-1, A(0,J), 1)

A(1,J) = A(0,J) + C

ENDIF

DO i = lo+1, hi

A(i,J) = A(i-1,J) + C

ENDDO

IF (PID /= lastP)

SEND(PID+1, A(100,J), 1)

ENDDO

ENDIF

...

IF (PID <= lastP) THEN

DO J = 1, M, K

IF (lo == 1) THEN

RECV (PID-1, A(0,J:J+K-1), K)

DO j = J, J+K-1

A(1,J) = A(0,J) + C

ENDDO

ENDIF

DO j = J, J+K-1

DO i = lo+1, hi

A(i,J) = A(i-1,J) + C

ENDDO

ENDDO

IF (PID /= lastP)

SEND(PID+1, A(100,J:J+K-1),K)

ENDDO

ENDIF

Optimizing Compilers for Modern Architectures

Other Optimizations

• Alignment and Replication

• Identification of Common recurrences

• Storage Mangement—Minimize temporary storage used for communication—Space taken for temporary storage should be at most

equal to the space taken by the arrays

• Interprocedural Optimizations

Optimizing Compilers for Modern Architectures

Results

Optimizing Compilers for Modern Architectures

Summary

• HPF is easy to code—But hard to compile

• Steps required to compile HPF programs—Basic loop compilation

– Communication generation—Optimizations

– Communication vectorization– Overlapping communication with computation– Pipelining

top related