Compiling High Performance Fortran

Optimizing Compilers for Modern Architectures

Allen and Kennedy, Chapter 14

Overview

• Motivation for HPF

• Overview of compiling HPF programs

• Basic Loop Compilation for HPF

• Optimizations for compiling HPF

• Results and Summary

Motivation for HPF

• Require “Message Passing” to communicate data between processors

• Approach 1: Use MPI calls in Fortran/C code

Scalable Distributed Memory Multiprocessor

MPI implementation

Motivation for HPF

• Consider the following sum reduction

PROGRAM SUM

REAL A(10000)

READ (9) A

SUM = 0.0

DO I = 1, 10000

SUM = SUM + A(I)

PRINT SUM

PROGRAM SUM

REAL A(100), BUFF(100)

IF (PID == 0) THEN

DO IP = 0, 99

READ (9) BUFF(1:100)

IF (IP == 0)

A(1:100) = BUFF(1:100)

ELSE SEND(IP,BUFF,100)

ELSE RECV(0,A,100)

/*Actual sum reduction code here */

IF (PID == 0) SEND(1,SUM,1)

IF (PID > 0) RECV(PID-1,T,1)

SUM = SUM + T

IF (PID < 99) SEND(PID+1,SUM,1)

ELSE SEND(0,SUM,1)

IF (PID == 0) PRINT SUM;

Motivation for HPF

• Disadvantages of MPI approach—User has to rewrite the program in SPMD form [Single

Program Multiple Data]—User has to manage data movement [send & receive], data

placement and synchronization—Too messy and not easy to master

Motivation for HPF

• Approach 2: Use HPF—HPF is an extended version of Fortran 90—HPF has Fortran 90 features and a few directives

• Directives—Tell how data is laid out in processor memories in parallel

machine configuration. For example,– !HPF DISTRIBUTE A(BLOCK)

—Assist in identifying parallelism. For example,– !HPF INDEPENDENT

Motivation for HPF

• The same sum reduction code

PROGRAM SUM

REAL A(10000)

READ (9) A

SUM = 0.0

DO I = 1, 10000

SUM = SUM + A(I)

PRINT SUM

• When written in HPF...

PROGRAM SUM

REAL A(10000)

!HPF$ DISTRIBUTE A(BLOCK)

READ (9) A

SUM = 0.0

DO I = 1, 10000

SUM = SUM + A(I)

PRINT SUM

• Minimum modification

• Easy to write

• Now compiler has to do more work

Motivation for HPF

• Advantages of HPF—User needs only to write some easy directives; need not

write the whole program in SPMD form—User does not need to manage data movement [send &

receive] and synchronization—Simple and easy to master

Overview

• Dependence Analysis

• Used for communication analysis—Fact used: No dependence

carried by I loop

HPF Compilation Overview

• Running example:

REAL A(10000), B(10000)

!HPF$ DISTRIBUTE A(BLOCK), B(BLOCK)

DO J = 1, 10000

DO I = 2, 10000

S1: A(I) = B(I-1) + C

DO I = 1, 10000

S2: B(I) = A(I)

REAL A(10000), B(10000)

DO J = 1, 10000

DO I = 2, 10000

S1: A(I) = B(I-1) + C

DO I = 1, 10000

S2: B(I) = A(I)

• Distribution Analysis

REAL A(10000), B(10000)

DO J = 1, 10000

DO I = 2, 10000

S1: A(I) = B(I-1) + C

DO I = 1, 10000

S2: B(I) = A(I)

• Computation Partitioning—Partition so as to distribute

work of the I loops

HPF Compilation OverviewREAL A(1,100), B(0:100)

DO J = 1, 10000

I1: IF (PID /= 100) SEND(PID+1,B(100),1)

I2: IF (PID /= 0) THEN

RECV(PID-1,B(0),1)

A(1) = B(0) + C

DO I = 2, 100

S1: A(I) = B(I-1)+C

DO I = 1, 100

S2: B(I) = A(I)

• Computation Partitioning

• Communication Analysis and placement—Communication reqd for

B(0)for each iteration—Shadow region B(0)

HPF Compilation OverviewREAL A(1,100), B(0:100)

DO J = 1, 10000

I1: IF (PID /= 100) SEND(PID+1,B(100),1)

DO I = 2, 100

S1: A(I) = B(I-1)+C

I2: IF (PID /= 0) THEN

RECV(PID-1,B(0),1)

A(1) = B(0) + C

DO I = 1, 100

S2: B(I) = A(I)

• Computation Partitioning

• Communication Analysis and placement

• Optimization—Aggregation—Overlap communication and

computation—Recognition of reduction

Overview

Basic Loop Compilation

• Distribution Propagation and analysis—Analyze what distribution holds for a given array at a given

point in the program—Difficult due to

– REALIGN and REDISTRIBUTE directives– Distribution of formal parameters inherited from

calling procedure—Use “Reaching Decompositions” data flow analysis and its

interprocedural version

• For simplicity assume single distribution for an array at all points in a subprogram

• Define

• For example suppose array A of size N is block distributed over p processors—Block size,

μA(i) =(ρA(i),δA(i))=(p, j)

BA =ceiling(N,p)

ρA(i) =ceiling(i,BA) −1

δA(i) =(i−1)modBA −1

• Iteration Partitioning—Dividing work among

processors– Computation

partitioning—Determine which iterations

of a loop will be executed on which processor

—Owner-computes rule

REAL A(10000)

DO I = 1, 10000

A(I) = A(I) + C

• Iteration I is executed on owner of A(I)

• 100 processors: 1st 100 iterations on processor 0, the next 100 on processor 1 and so on

Iteration Partitioning

• Multiple statements in a loop in a recurrence: choose a partitioning reference

• Processor responsible for performing computation for iteration I is

• Set of indices executed on p

A(α(I ))

θL(I ) =ρA(α(I ))

{I |1≤I ≤N;θL(I )=p}

α−1(ρA−1({p}))∩ [1..N]

• Have to map global loop index to local loop index

• Smallest value in maps to 1

REAL A(10000)

DO I = 1, N

A(I+1) = B(I) + C

Global_ Loop_ index Δ L ⏐ → ⏐ Local_ Loop_ index

α−1(ρA−1({p}))

ρA−1({p})=[100p+1:100p+100]

α−1(ρA−1({p}))=[100p:100p+99]

ΔL(I,p) =I −min(α −1(ρA−1({p})))+1

=I −100p+1

Iteration PartitioningREAL A(10000),B(10000)

!HPF$ DISTRIBUTE A(BLOCK),B(BLOCK)

DO I = 1, N

A(I+1) = B(I) + C

• Map global iteration space, I to local iteration space,i as follows:

• Adjust array subscripts for local iterations:

B(β(I))→ B(γ(i))

γ(i) =δB(β(ΔL−1(i,p)))

ΔL−1(i,PID) =i+100* PID−1

δB(k)=k−100* PID

γ(i) =i+100* PID−1−100*PID =i−1

• For interior processors the code becomes..

DO i = 1, 100

A(i) = B(i-1) + C

• Adjust lower bound for 1st processor and upper bound of last processor to take care of boundary conditions..

lo = 1

IF (PID==0) lo = 2

hi = 100

IF (PID==CEIL(N+1/100)-1) hi = MOD(N,100) + 1

DO i = lo, hi

A(i) = B(i-1) + C

Communication Generation

• For our example no communication is required for iterations in

• Iterations which require receiving data are

• Iterations which require sending data are

α−1(ρA−1({p}))∩ β−1(ρB

−1({p}))∩ [1..N]

(α−1(ρA−1({p}))−β−1(ρB

−1({p})))∩ [1..N]

(β−1(ρB−1({p}))−α−1(ρA

−1({p})))∩ [1..N]

REAL A(10000), B(10000)

DO I = 1, N

A(I+1) = B(I) + C

• Receive required for iterations in [100p:100p]

• Send required for iterations in [100p+100:100p+100]

• No communication required for iterations in [100p+1:100p+99]

• After inserting receive

lo = 1

IF (PID==0) lo = 2

hi = 100

IF (PID==CEIL((N+1)/100)-1)

hi = MOD(N,100) + 1

DO i = lo, hi

IF (i==1 && PID /= 0)

RECV (PID-1, B(0), 1)

A(i) = B(i-1) + C

• Send must happen in the 101st iteration

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP)

hi = MOD(N,100) + 1

DO i = lo, hi+1

IF (i==1 && PID /= 0)

RECV (PID-1, B(0), 1)

IF (i <= hi) THEN

A(i) = B(i-1) + C

IF (i == hi+1 && PID /= lastP)

SEND(PID+1, B(100), 1)

Communication Generationlo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP) hi = MOD(N,100) + 1

DO i = lo, hi+1

IF (i==1 && PID /= 0)

RECV (PID-1, B(0), 1)

IF (i <= hi) THEN

A(i) = B(i-1) + C

IF (i == hi+1 && PID /= lastP)

SEND(PID+1, B(100), 1)

• Move SEND outside the loop

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID <= lastP) THEN

DO i = lo, hi

IF (i==1 && PID /= 0)

RECV (PID-1, B(0), 1)

A(i) = B(i-1) + C

IF (PID /= lastP)

SEND(PID+1, B(100), 1)

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

DO i = lo, hi

IF (i==1 && PID /= 0)

RECV (PID-1, B(0), 1)

A(i) = B(i-1) + C

IF (PID /= lastP)

SEND(PID+1, B(100), 1)

• Move receive outside loop and loop peel

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (lo == 1 && PID /= 0) THEN

RECV (PID-1, B(0), 1)

A(1) = B(0) + C

! lo = MAX(lo,1+1) == 2

DO i = 2, hi

A(i) = B(i-1) + C

IF (PID /= lastP)

SEND(PID+1, B(100), 1)

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (lo == 1 && PID /= 0) THEN

RECV (PID-1, B(0), 1)

A(1) = B(0) + C

! lo = MAX(lo,1+1) == 2

DO i = 2, hi

A(i) = B(i-1) + C

IF (PID /= lastP)

SEND(PID+1, B(100), 1)

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID /= lastP)

SEND(PID+1, B(100), 1)

IF (lo == 1 && PID /= 0) THEN

RECV (PID-1, B(0), 1)

A(1) = B(0) + C

DO i = 2, hi

A(i) = B(i-1) + C

• When is such rearrangement legal?

• Receive: copy from global to local location

• Send: copy local to global location

S1: IF (lo == 1 && PID /= 0) THEN

B(0) = Bg(0) ! RECV

A(1) = B(0) + C

DO i = 2, hi

A(i) = B(i-1) + C

S2: IF (PID /= lastP) Bg(100) = B(100) ! SEND

No chain of dependences from S1 to S2

Communication GenerationREAL A(10000), B(10000)

DO I = 1, N

A(I+1) = A(I) + C

Would be rewritten as ..

S1: IF (lo == 1 && PID /= 0) THEN

A(0) = Ag(0) ! RECV

A(1) = A(0) + C

DO i = 2, hi

A(i) = A(i-1) + C

S2: IF (PID /= lastP)

Ag(100) = A(100) ! SEND

• Rearrangement won’t be correct

Overview

Communication VectorizationREAL A(10000,100)

!HPF$ DISTRIBUTE A(BLOCK,*),

B(BLOCK,*)

DO J = 1, M

DO I = 1, N

A(I+1,J) = B(I,J) + C

• Using Basic Loop compilation gives..

DO J = 1, M

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP)

hi = MOD(N,100) + 1

IF (PID /= lastP)

SEND(PID+1, B(100,J), 1)

IF (lo == 1) THEN

RECV (PID-1, B(0,J), 1)

A(1,J) = B(0,J) + C

DO i = lo+1, hi

A(i,J) = B(i-1,J) + C

Communication VectorizationDO J = 1, M

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP)

hi = MOD(N,100) + 1

IF (PID /= lastP)

SEND(PID+1, B(100,J), 1)

IF (lo == 1) THEN

RECV (PID-1, B(0,J), 1)

A(1,J) = B(0,J) + C

DO i = lo+1, hi

A(i,J) = B(i-1,J) + C

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

DO J = 1, M

IF (PID /= lastP)

SEND(PID+1, B(100,J), 1)

DO J = 1, M

IF (lo == 1) THEN

RECV (PID-1, B(0,J), 1)

A(i,J) = B(i-1,J) + C

DO J = 1, M

DO i = lo+1, hi

A(i,J) = B(i-1,J) + C

ENDIFDistribute J Loop

Communication Vectorizationlo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

DO J = 1, M

IF (PID /= lastP)

SEND(PID+1, B(100,J), 1)

DO J = 1, M

IF (lo == 1) THEN

RECV (PID-1, B(0,J), 1)

A(i,J) = B(i-1,J) + C

DO J = 1, M

DO i = lo+1, hi

A(i,J) = B(i-1,J) + C

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (lo == 1) THEN

RECV (PID-1, B(0,1:M), M)

DO J = 1, M

A(1,J) = B(0,J) + C

DO J = 1, M

DO i = lo+1, hi

A(i,J) = B(i-1,J) + C

IF (PID /= lastP)

SEND(PID+1, B(100,1:M), M)

Communication VectorizationDO J = 1, M

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

IF (PID==lastP)

hi = MOD(N,100) + 1

S1: IF (PID /= lastP)

Bg(100,J)=B(100,J)

IF (lo == 1) THEN

S2: B(0,J)=Bg(0,J)

S3: A(1,J) = B(0,J) + C

DO i = lo+1, hi

S4: A(i,J) = B(i-1,J) + C

• Communication stmts resulting from an inner loop can be vectorized wrt an outer loop if the communication statements are not involved in a recurrence carried by outer loop

Communication VectorizationREAL A(10000,100)

!HPF$ DISTRIBUTE A(BLOCK,*),

B(BLOCK,*)

DO J = 1, M

DO I = 1, N

A(I+1,J) = A(I,J) + B(I,J)

• Can sends be done before the receives?

• Can communication be vectorized?

REAL A(10000,100)

!HPF$ DISTRIBUTE A(BLOCK,*)

DO J = 1, M

DO I = 1, N

A(I+1,J+1) = A(I,J) + C

• Can sends be done before the receives?

• Can communication be fully vectorized?

Overlapping Communication and Computation

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

S0:IF (PID /= lastP)

SEND(PID+1, B(100), 1)

S1:IF (lo == 1 && PID /= 0) THEN

RECV (PID-1, B(0), 1)

A(1) = B(0) + C

L1:DO i = 2, hi

A(i) = B(i-1) + C

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

S0:IF (PID /= lastP)

SEND(PID+1, B(100), 1)

L1:DO i = 2, hi

A(i) = B(i-1) + C

S1:IF (lo == 1 && PID /= 0) THEN

RECV (PID-1, B(0), 1)

A(1) = B(0) + C

PipeliningREAL A(10000,100)

!HPF$ DISTRIBUTE A(BLOCK,*)

DO J = 1, M

DO I = 1, N

A(I+1,J) = A(I,J) + C

• Initial code generation for the I loop gives..

lo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

DO J = 1, M

IF (lo == 1) THEN

RECV (PID-1, A(0,J), 1)

A(1,J) = A(0,J) + C

DO i = lo+1, hi

A(i,J) = A(i-1,J) + C

IF (PID /= lastP)

SEND(PID+1, A(100,J), 1)

Can be vectorizedBut gives up parallelism

Pipelining

• Pipelined parallelism with communication

Pipelining

• Pipelined parallelism with communication overhead

Pipelining: Blockinglo = 1

IF (PID==0) lo = 2

hi = 100

lastP = CEIL((N+1)/100) - 1

DO J = 1, M

IF (lo == 1) THEN

RECV (PID-1, A(0,J), 1)

A(1,J) = A(0,J) + C

DO i = lo+1, hi

A(i,J) = A(i-1,J) + C

IF (PID /= lastP)

SEND(PID+1, A(100,J), 1)

DO J = 1, M, K

IF (lo == 1) THEN

RECV (PID-1, A(0,J:J+K-1), K)

DO j = J, J+K-1

A(1,J) = A(0,J) + C

DO j = J, J+K-1

DO i = lo+1, hi

A(i,J) = A(i-1,J) + C

IF (PID /= lastP)

SEND(PID+1, A(100,J:J+K-1),K)

Other Optimizations

• Alignment and Replication

• Identification of Common recurrences

• Storage Mangement—Minimize temporary storage used for communication—Space taken for temporary storage should be at most

equal to the space taken by the arrays

• Interprocedural Optimizations

Results

Summary

• HPF is easy to code—But hard to compile

• Steps required to compile HPF programs—Basic loop compilation

– Communication generation—Optimizations

– Communication vectorization– Overlapping communication with computation– Pipelining

Compiling High Performance Fortran

Documents

Transforming Fortran DO Loops to Improve Performance on...

Product Brief Get high performance with Intel’s next ........

A Guide for Compiling a - Kean...

Bahasa FORTRAN -...

High Performance Fortran (HPF)

The Current Status of Fortran 90 and Fortran 95 - new...

HPF (High Performance Fortran)

N m Mİ -...

Co-array Fortran Performance and Potential: an NPB...

Kemari: A Portable High Performance Fortran System for...

Fortran 2015 and Coarrays in GNU Fortran · Fortran 2015...

Fortran Modernization Project - UCLA Plasma Simulation...

FORTRAN - wi1.uni- · PDF file3 Seminar Programmiersprachen...

Development and performance comparison of MPI and Fortran...

From HPF to Coarray Fortran...

Modern Fortran - High Performance Computing Outline Day 1:.....