PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

HTA’sNOV30

PROGRAMMING FOR PARALLELISM AND LOCALITY WITH

PRESENTATION BY ROMAN FRIGG

Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson Research Center3 byGanesh Bikshandi1, Jia Guo, Daniel Hoeflinger1, Gheorghe Almasi3, Basilio B. Fraguela2, María J. Garzarán1, David Padua1 and Christoph von Praun3

PAPER PUBLISHED AT PPOPP MARCH 2006

PROGRAMMING TODAY’S SYSTEMS

2|

SCALABILITY PORTABILITY

PRODUCTIVITY


2|


PRODUCTIVITY

Parallelism


2|


PRODUCTIVITY

Parallelism Locality


2|


PRODUCTIVITY


Abstractions


2|


PRODUCTIVITY


Abstractions

HTA’s

CLASSIFICATION

3|

LIBRARIES LANGUAGES

HTA

MPI/PVM

GAS

POET

POOMA X10 CAFZPL

TITANIUM

UPC HPFCLASSIFICATION

3|

LIBRARIES LANGUAGES

HTA

MPI/PVM

GAS

POET

POOMA X10 CAFZPL

TITANIUM

UPC HPFCLASSIFICATION

3|

‣ Library‣ Matlab & C++

‣ Single threaded, global view

TALK OVERVIEW

INTRO1

4|

TALK OVERVIEW

INTRO

HOW HTA’s WORK

1

2

4|

TALK OVERVIEW

INTRO

HOW HTA’s WORK

HTA OPERATIONS & APPLICATIONS

1

2

3

4|

TALK OVERVIEW

INTRO

HOW HTA’s WORK


EVALUATION

1

2

3

4

4|

TALK OVERVIEW

INTRO

HOW HTA’s WORK


EVALUATION

CONCLUSIONS

1

2

3

4

5

4|

RECURSIVE TILING

‣ distributed

image source: paper

5| HOW HTA’s

WORK 2

9 H TA3/20/06

Hierarchically Tiled Array

9 H TA3/20/06

Hierarchically Tiled Array

‣ local

‣ local

CONSTRUCT HTA FROM 6x6 MATRIX

6| HOW HTA’s

WORK 2


6|

T1 = hta( ,{ , } )

HOW HTA’s WORK 2


6|

T1 = hta( ,{ , } )M

HOW HTA’s WORK 2


6|

T1 = hta( ,{ , } )M [1 3 5]

HOW HTA’s WORK 2


6|

1

3

5

T1 = hta( ,{ , } )M [1 3 5]

HOW HTA’s WORK 2


6|

1

3

5

T1 = hta( ,{ , } )M [1 3 5] [1 3 5]

HOW HTA’s WORK 2


6|

1

3

5

1 3 5

T1 = hta( ,{ , } )M [1 3 5] [1 3 5]

HOW HTA’s WORK 2


6|

1

3

5

1 3 5

T1 = hta( ,{ , } )M [1 3 5] [1 3 5]

HOW HTA’s WORK 2

T2 = hta( ,{ , }, )


6|

1

3

5

1 3 5

T1 = hta( ,{ , } )M [1 3 5] [1 3 5]

HOW HTA’s WORK 2

T2 = hta( ,{ , }, )T1


6|

1

3

5

1 3 5

T1 = hta( ,{ , } )M [1 3 5] [1 3 5]

HOW HTA’s WORK 2

T2 = hta( ,{ , }, )T1 [1 2]


6|

1

3

5

1 3 5

T1 = hta( ,{ , } )M [1 3 5] [1 3 5]

HOW HTA’s WORK 2

T2 = hta( ,{ , }, )T1 [1 2]

1

2


6|

1

3

5

1 3 5

T1 = hta( ,{ , } )M [1 3 5] [1 3 5]

HOW HTA’s WORK 2

T2 = hta( ,{ , }, )T1 [1 2] [1 3]

1

2


6|

1

3

5

1 3 5

T1 = hta( ,{ , } )M [1 3 5] [1 3 5]

HOW HTA’s WORK 2

T2 = hta( ,{ , }, )T1 [1 2] [1 3]

1 3

1

2


6|

1

3

5

1 3 5

T1 = hta( ,{ , } )M [1 3 5] [1 3 5]

HOW HTA’s WORK 2

T2 = hta( ,{ , }, )T1 [1 2] [1 3]

1 3

1

2

[2 2]


6|

1

3

5

1 3 5

T1 = hta( ,{ , } )M [1 3 5] [1 3 5]

P1 P2

P3 P4

HOW HTA’s WORK 2

T2 = hta( ,{ , }, )T1 [1 2] [1 3]

1 3

1

2

[2 2]

HTAACCESS

7|

C=

HOW HTA’s WORK 2

HTAACCESS

7|

C(1:2,3:6)

C=

HOW HTA’s WORK 2

HTAACCESS

7|

C(1:2,3:6)

C=

HOW HTA’s WORK 2

HTAACCESS

7|

C(1:2,3:6)

C=

HOW HTA’s WORK 2

HTAACCESS

7|

C(1:2,3:6)

C=

HOW HTA’s WORK 2

HTAACCESS

7|

C(1:2,3:6)

C{2,1}{1,2}(2,2)

C=

HOW HTA’s WORK 2

HTAACCESS

7|

C(1:2,3:6)

C{2,1}{1,2}(2,2)

C=

HOW HTA’s WORK 2

HTAACCESS

7|

C(1:2,3:6)

C{2,1}{1,2}(2,2)

C=

HOW HTA’s WORK 2

HTAACCESS

7|

C(1:2,3:6)

C{2,1}{1,2}(2,2)

C=

HOW HTA’s WORK 2

HTAACCESS

7|

C(1:2,3:6)

C{2,1}{1,2}(2,2)

C=

HOW HTA’s WORK 2

HTAACCESS

7|

C(1:2,3:6)

C{2,1}{1,2}(2,2)

C=

HOW HTA’s WORK 2

HTAACCESS

7|

C(1:2,3:6)

C{2,1}{1,2}(2,2)

C=

HOW HTA’s WORK 2

HTAACCESS

7|

C(1:2,3:6)

C{2,1}{1,2}(2,2)

C=

HOW HTA’s WORK 2

HTAACCESS

7|

C(1:2,3:6)

C{2,1}{1,2}(2,2)

C=

HOW HTA’s WORK 2

HTAACCESS

7|

C(1:2,3:6)

C{2,1}{1,2}(2,2)

C=

HOW HTA’s WORK 2

HTAACCESS

7|

C(1:2,3:6)

C{2,1}{1,2}(2,2)

C=

HOW HTA’s WORK 2

C(6,4)=

HTAACCESS

7|

C(1:2,3:6)

C{2,1}{1,2}(2,2)

C=

HOW HTA’s WORK 2

C(6,4)= C{2,1}(2,4)=

ASSIGNMENTS & BINARY OPERATORS

8|

VALID OPERATIONS

Scalar

⊕Array

HTA


9|

VALID OPERATION ?


9|

*4x4 HTA 2x3 Array

VALID OPERATION ?


9|

u*4x4 HTA 2x3 Array

VALID OPERATION ?


9|

*4x4 HTA 3x2 Array

VALID OPERATION ?


9|

v*4x4 HTA 3x2 Array

VALID OPERATION ?


9|

*4x4 HTA Scalar

VALID OPERATION ?


9|

u*4x4 HTA Scalar

VALID OPERATION ?


9|

*4x4 HTA 4x4 HTA

VALID OPERATION ?


9|

u*4x4 HTA 4x4 HTA

VALID OPERATION ?


9|

=4x4 HTA 4x4 HTA

VALID OPERATION ?


9|

v=

4x4 HTA 4x4 HTA

VALID OPERATION ?

TALK OVERVIEW

INTRO

HOW HTA’s WORK


CONCLUSIONS

1

2

3

5

10|

EVALUATION

4

TWOKINDS OF

OPERATIONS

11| HTA OPERATIONS & APPLICATIONS

3

GLOBAL COMPUTATIONS

COMMUNICATION

OPERATIONS

P1

P3

P2

P3

P1

P3

P2

P3

f(x)TWO

KINDS OF OPERATIONS


3

GLOBAL COMPUTATIONS

COMMUNICATION

OPERATIONS

P1

P3

P2

P3

P1

P3

P2

P3

f(x)TWO

KINDS OF OPERATIONS


3

GLOBAL COMPUTATIONS

COMMUNICATION

OPERATIONS

P1

P3

P2

P3

P1

P3

P2

P3

f(x)TWO

KINDS OF OPERATIONS


3

GLOBAL COMPUTATIONS

COMMUNICATION

OPERATIONS

P1

P3

P2

P3

P1

P3

P2

P3

Assignments, repmat, circshift, permute

f(x)TWO

KINDS OF OPERATIONS


3

GLOBAL COMPUTATIONS

COMMUNICATION

OPERATIONS

P1

P3

P2

P3

P1

P3

P2

P3

g(x)g(x)g(x)g(x)g(x)


f(x)TWO

KINDS OF OPERATIONS


3

GLOBAL COMPUTATIONS

COMMUNICATION

OPERATIONS

P1

P3

P2

P3

P1

P3

P2

P3

g(x) g(x)

g(x)g(x)


f(x)TWO

KINDS OF OPERATIONS


3

GLOBAL COMPUTATIONS

COMMUNICATION

OPERATIONS

P1

P3

P2

P3

P1

P3

P2

P3

g(x) g(x)

g(x)g(x)


parHTA(@g(x), H)

CANNON’S ALGORITHM


3

function C = cannon(A,B,C)

for i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);

endfor k=1:mC = C + A * B;A = circshift(A, [0, -1]);B = circshift(B, [-1, 0]);

end



3




end

A,B,C



3




end



3




end

circshift( )circshift( )

circshift( )circshift( )



3




end



3

Initialization




end



3

Initialization

Iteration




end

B23

A32


3

A11 A12 A13

A22 A23

A33

A21

A31

B11

B22B21

B31 B32 B33

Initializationfor i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);

end

B13B12

B23

A32


3

A11 A12 A13

A22 A23

A33

A21

A31

B11

B22B21

B31 B32 B33


end

B13B12

i=2

B23

A32


3

A11 A12 A13

A22 A23

A33

A21

A31

B11

B22B21

B31 B32 B33


end

B13B12

i=2

B23

A32


3

A11 A12 A13

A22 A23

A33

A21

A31

B11

B22B21

B31 B32 B33


end

B13B12

i=2

B23

A32


3

A11 A12 A13

A22 A23

A33

A21

A31

B11 B22

B21

B31

B32

B33


end

B13

B12

i=2

B23

A32


3

A11 A12 A13

A22 A23

A33

A21

A31

B11 B22

B21

B31

B32

B33


end

B13

B12

i=3

B23

A32


3

A11 A12 A13

A22 A23

A33

A21

A31

B11 B22

B21

B31

B32

B33


end

B13

B12

i=3

B23

A32


3

A11 A12 A13

A22 A23

A33

A21

A31

B11 B22

B21

B31

B32

B33


end

B13

B12

i=3

B23

A32


3

A11 A12 A13

A22 A23

A33

A21

A31

B11 B22

B21

B31

B32

B33


end

B13

B12

i=3

B23

A32


3

A11 A12 A13

A22 A23

A33

A21

A31

B11 B22

B21

B31

B32 B33


end

B13B12

i=3

B23A32


3

A11 A12 A13

A22 A23

A33

A21

A31

B11 B22

B21

B31

B32

B33


end

B13

B12

i=3

C32

C11 C12 C13

C22 C23

C33

C21

C31


3

A32

A12 A13

A23 A21

A31

Iterationfor k=1:mC = C + A * B;A = circshift(A, [0, -1]);B = circshift(B, [-1, 0]);

end

A11

A22

A33 B23B12

B13B21

B31

B32

B11 B22 B33

C32

C11 C12 C13

C22 C23

C33

C21

C31


3

A32

A12 A13

A23 A21

A31


end

A11

A22

A33 B23B12

B13B21

B31

B32

B11 B22 B33

k=1

C32

C11 C12 C13

C22 C23

C33

C21

C31


3

A32

A12 A13

A23 A21

A31


end

A11

A22

A33 B23B12

B13B21

B31

B32

B11 B22 B33

k=1

C32

C11 C12 C13

C22 C23

C33

C21

C31


3

A32

A12 A13

A23 A21

A31


end

A11

A22

A33 B23B12

B13B21

B31

B32

B11 B22 B33

k=1

C32

C11 C12 C13

C22 C23

C33

C21

C31


3

A32

A12 A13

A23 A21

A31


end

A11

A22

A33 B23B12

B13B21

B31

B32

B11 B22 B33

k=1

C32

C11 C12 C13

C22 C23

C33

C21

C31


3

A32

A12 A13

A23 A21

A31


end

A11

A22

A33

B23B12

B13B21

B31

B32

B11 B22 B33

k=1

C32

C11 C12 C13

C22 C23

C33

C21

C31


3

A32

A12 A13

A23 A21

A31


end

A11

A22

A33

B23B12

B13B21

B31

B32

B11 B22 B33

k=2

C32

C11 C12 C13

C22 C23

C33

C21

C31


3

A32

A12 A13

A23 A21

A31


end

A11

A22

A33

B23B12

B13B21

B31

B32

B11 B22 B33

k=2

C32

C11 C12 C13

C22 C23

C33

C21

C31


3

A32

A12 A13

A23 A21

A31


end

A11

A22

A33

B23B12

B13B21

B31

B32

B11 B22 B33

k=2

TALK OVERVIEW

INTRO

HOW HTA’s WORK


CONCLUSIONS

1

2

3

5

15|

EVALUATION

4

NASA ADVANCED SUPERCOMPUTING BENCHMARK

image source: paper

16| EVALUATION

4

Nprocs EP (CLASS C) FT (CLASS B) CG (CLASS C) MG (CLASS B) LU (CLASS B)Fortran+ Matlab + Fortran + Matlab + Fortran + Matlab + Fortran + Matlab + Fortran + Matlab +MPI HTA MPI HTA MPI HTA MPI HTA MPI HTA

1 901.6 3556.9 136.8 657.4 3606.9 3812.0 26.9 828.0 15.7 245.14 273.1 888.8 109.1 274.0 362.0 1750.9 17.0 273.8 6.3 60.58 136.3 447.0 65.5 159.3 123.4 823.6 9.6 151.3 2.9 29.916 68.6 224.8 37.2 87.2 89.5 375.2 4.8 87.0 1.2 16.032 34.7 112.0 20.7 42.9 48.4 250.3 3.3 54.9 1.1 9.864 17.1 56.7 10.4 24.0 44.5 148.0 1.6 50.4 1.3 7.1128 8.5 29.1 5.9 15.6 30.8 123.0 1.4 38.5 1.6 N/A

Table 1. Execution times in seconds for some of the applications in the NAS benchmarks for Fortran+MPI versus MATLAB +HTA. Theexecution time for 1 processor corresponds to the serial application in Fortran or MATLAB , without MPI or HTAs.

tation not involving distributed HTA operations. Since all data arereplicated, the behavior in each processor is exactly the same aswhat would be the behavior of the client except that no communi-cation is necessary to use data from the main thread in operationson distributed HTAs. On invocation of a method on a distributedHTA, each processor applies the corresponding operation to thetiles of the HTA it owns.The incorporation of HTAs in MATLAB produced an explicitly

parallel programming extension of MATLAB that integrates seam-lessly with the language. Most other parallel MATLAB extensionseither make use of extraneous primitives (MultiMATLAB [24]) ordo not allow explicit parallel programming (Matlab*P [17]). Also,the incorporation of HTA gives MATLAB a mechanism to accessand operate on tiles much more powerful than that provided by theirnative . The main disadvantage of the implementa-tion is that the immense overhead of the interpreted MATLAB lim-its the efficiency of many applications. The three main sources ofthis overhead are:

Excessive creation of temporary variables.MATLAB creates tem-poraries to hold the partial results of expression, which signifi-cantly slows down the programs.Frequent replication of data. MATLAB passes parameters byvalue and assignment statements replicate the data, andInterpretation of instructions. The overhead resulting from the in-terpretation of instructions is more pronounced when the compu-tation relies mainly on scalar operations.

Table 1 presents the execution time for Fortran+MPI and ourMATLAB +HTA implementations of most of the NAS bench-marks. The table shows the execution times in seconds when theapplications execute on a cluster of up-to 128 processors. Each pro-cessor is a 3.2 GHz Intel Xeon connected through a Gigabit Ether-net. For the NAS benchmarks we used the version 3.1, and com-piled them with the INTEL ifort compiler, version 8.1, and flag-03. For MATLAB we used the version 7.0.1 (R14). Finally, forMPI we used MPI-LAM [6].The execution time for 1 processor corresponds to the serial

execution of the pure Fortran or MATLAB code without MPI orHTAs. Results in Table 1 correspond to the class C input for EPand CG, and class B for MG, FT and LU.As can be seen in the table, in the case of EP and FT the parallel

MATLAB code takes advantage of parallelism leading to executiontimes that are of the same magnitude as those of the Fortran+MPIcode. In the case of CG our parallel MATLAB does reasonablywell, although not as well as the Fortran+MPI version that ob-tains super-linear speedups when the number of processors is 64 orsmaller. However, for MG and LU the performance of the sequen-tial MATLAB implementation was slow and, in the case of MG,

the parallel MATLAB does not improve upon the serial Fortranversion. Similarly, for BT (not shown) the serial MATLAB versionruns so slow that, even the parallel version is not comparable withits sequential Fotran counterpart. Overall, for EP, FT and CG wherethe sequential MATLAB version runs 1 to 5 times slower than theFortran version, the parallel MATLAB implementation does rea-sonably well improving upon the serial Fortran version. In thesecases, it could be said that parallelism at least compensates for theinterpretation overhead. For 128 processors the parallel MATLABobtains speedups of 30.9, 8.8 and 29.3 over the sequential Fortrancounterpart for EP, FT and CG, respectively.

4.2 C++In the C++ implementation, HTAs are represented as compos-ite objects with methods to operate on both distributed and non-distributed HTAs. As in the case of MATLAB , MPI is usedfor communication and, while the programming model is singlethreaded, HTA C++ programs execute in SPMD form. To facil-itate programming, our C++ implementation enforces an alloca-tion/deallocation policy through reference counting as follows: (1)HTAs are allocated through factory methods on the heap. Themethods return a handle which is assigned to a (stack allocated)variable. (2) All accesses to the HTA occur through this handle,which itself is small in size and typically passed by value acrossprocedure boundaries. (3) Once all handles to an HTA disappearfrom the stack, the HTA and its related structures are automaticallydeleted from memory. This design permits sharing of sub-treesamong HTAs and also precludes deallocation errors. Moreover, thetemporary arrays that are for instance created during the partialevaluation of expressions, are handled through this mechanism anddeleted automatically as early as possible.Performance is one of the main goals of our C++ implementa-

tion. Methods were optimized and whenever possible specializedfor specific cases. Also, the user is given control over the memorylayout of non-distributed HTAs. In MATLAB the layout was in thehands of the system and the user had no way of influencing it. Fi-nally, to enable efficient access to scalar components of HTAs, theimplementation was organized to guarantee that hot methods wereinlined. This last strategy enabled the codes written using the li-brary to have performance similar to that of traditional (non-HTAs)implementations. For example, the code in Figure 13 represents themultiplication of two two-dimensional arrays recursively tiled. Thecode is similar to the MATLAB code shown Figure 8.The code in Figure 13 shows the declaration of the HTAs , ,

and . The function is the factory method that creates theHTAs. It takes as input the complete tiling information for eachHTA, number of tiles in each dimension , tilesize , and memory layout (

, or ). The function is recursive. When the input

53

NASA ADVANCED SUPERCOMPUTING BENCHMARK

image source: paper

16| EVALUATION

4

Nprocs EP (CLASS C) FT (CLASS B) CG (CLASS C) MG (CLASS B) LU (CLASS B)Fortran+ Matlab + Fortran + Matlab + Fortran + Matlab + Fortran + Matlab + Fortran + Matlab +MPI HTA MPI HTA MPI HTA MPI HTA MPI HTA

1 901.6 3556.9 136.8 657.4 3606.9 3812.0 26.9 828.0 15.7 245.14 273.1 888.8 109.1 274.0 362.0 1750.9 17.0 273.8 6.3 60.58 136.3 447.0 65.5 159.3 123.4 823.6 9.6 151.3 2.9 29.916 68.6 224.8 37.2 87.2 89.5 375.2 4.8 87.0 1.2 16.032 34.7 112.0 20.7 42.9 48.4 250.3 3.3 54.9 1.1 9.864 17.1 56.7 10.4 24.0 44.5 148.0 1.6 50.4 1.3 7.1128 8.5 29.1 5.9 15.6 30.8 123.0 1.4 38.5 1.6 N/A

Table 1. Execution times in seconds for some of the applications in the NAS benchmarks for Fortran+MPI versus MATLAB +HTA. Theexecution time for 1 processor corresponds to the serial application in Fortran or MATLAB , without MPI or HTAs.

tation not involving distributed HTA operations. Since all data arereplicated, the behavior in each processor is exactly the same aswhat would be the behavior of the client except that no communi-cation is necessary to use data from the main thread in operationson distributed HTAs. On invocation of a method on a distributedHTA, each processor applies the corresponding operation to thetiles of the HTA it owns.The incorporation of HTAs in MATLAB produced an explicitly

parallel programming extension of MATLAB that integrates seam-lessly with the language. Most other parallel MATLAB extensionseither make use of extraneous primitives (MultiMATLAB [24]) ordo not allow explicit parallel programming (Matlab*P [17]). Also,the incorporation of HTA gives MATLAB a mechanism to accessand operate on tiles much more powerful than that provided by theirnative . The main disadvantage of the implementa-tion is that the immense overhead of the interpreted MATLAB lim-its the efficiency of many applications. The three main sources ofthis overhead are:

Excessive creation of temporary variables.MATLAB creates tem-poraries to hold the partial results of expression, which signifi-cantly slows down the programs.Frequent replication of data. MATLAB passes parameters byvalue and assignment statements replicate the data, andInterpretation of instructions. The overhead resulting from the in-terpretation of instructions is more pronounced when the compu-tation relies mainly on scalar operations.

Table 1 presents the execution time for Fortran+MPI and ourMATLAB +HTA implementations of most of the NAS bench-marks. The table shows the execution times in seconds when theapplications execute on a cluster of up-to 128 processors. Each pro-cessor is a 3.2 GHz Intel Xeon connected through a Gigabit Ether-net. For the NAS benchmarks we used the version 3.1, and com-piled them with the INTEL ifort compiler, version 8.1, and flag-03. For MATLAB we used the version 7.0.1 (R14). Finally, forMPI we used MPI-LAM [6].The execution time for 1 processor corresponds to the serial

execution of the pure Fortran or MATLAB code without MPI orHTAs. Results in Table 1 correspond to the class C input for EPand CG, and class B for MG, FT and LU.As can be seen in the table, in the case of EP and FT the parallel

MATLAB code takes advantage of parallelism leading to executiontimes that are of the same magnitude as those of the Fortran+MPIcode. In the case of CG our parallel MATLAB does reasonablywell, although not as well as the Fortran+MPI version that ob-tains super-linear speedups when the number of processors is 64 orsmaller. However, for MG and LU the performance of the sequen-tial MATLAB implementation was slow and, in the case of MG,

the parallel MATLAB does not improve upon the serial Fortranversion. Similarly, for BT (not shown) the serial MATLAB versionruns so slow that, even the parallel version is not comparable withits sequential Fotran counterpart. Overall, for EP, FT and CG wherethe sequential MATLAB version runs 1 to 5 times slower than theFortran version, the parallel MATLAB implementation does rea-sonably well improving upon the serial Fortran version. In thesecases, it could be said that parallelism at least compensates for theinterpretation overhead. For 128 processors the parallel MATLABobtains speedups of 30.9, 8.8 and 29.3 over the sequential Fortrancounterpart for EP, FT and CG, respectively.

4.2 C++In the C++ implementation, HTAs are represented as compos-ite objects with methods to operate on both distributed and non-distributed HTAs. As in the case of MATLAB , MPI is usedfor communication and, while the programming model is singlethreaded, HTA C++ programs execute in SPMD form. To facil-itate programming, our C++ implementation enforces an alloca-tion/deallocation policy through reference counting as follows: (1)HTAs are allocated through factory methods on the heap. Themethods return a handle which is assigned to a (stack allocated)variable. (2) All accesses to the HTA occur through this handle,which itself is small in size and typically passed by value acrossprocedure boundaries. (3) Once all handles to an HTA disappearfrom the stack, the HTA and its related structures are automaticallydeleted from memory. This design permits sharing of sub-treesamong HTAs and also precludes deallocation errors. Moreover, thetemporary arrays that are for instance created during the partialevaluation of expressions, are handled through this mechanism anddeleted automatically as early as possible.Performance is one of the main goals of our C++ implementa-

tion. Methods were optimized and whenever possible specializedfor specific cases. Also, the user is given control over the memorylayout of non-distributed HTAs. In MATLAB the layout was in thehands of the system and the user had no way of influencing it. Fi-nally, to enable efficient access to scalar components of HTAs, theimplementation was organized to guarantee that hot methods wereinlined. This last strategy enabled the codes written using the li-brary to have performance similar to that of traditional (non-HTAs)implementations. For example, the code in Figure 13 represents themultiplication of two two-dimensional arrays recursively tiled. Thecode is similar to the MATLAB code shown Figure 8.The code in Figure 13 shows the declaration of the HTAs , ,

and . The function is the factory method that creates theHTAs. It takes as input the complete tiling information for eachHTA, number of tiles in each dimension , tilesize , and memory layout (

, or ). The function is recursive. When the input

53

Too many numbers !

0 32 64 96 1280

32

64

96

128

EP

25 %

100 %

Matlab+HTA Fortran+MPI

ebarassinglyparallel

# processors

speedup factor

sequential speed

17|

128 3.2 GHz Intel Xeons, Gigabit Ethernet

EVALUATION

4


0 32 64 96 1280

32

64

96

128

EP LINEAR SPEEDUP

25 %

100 %


ebarassinglyparallel

# processors

speedup factor

sequential speed

17|


EVALUATION

4


0 32 64 96 1280

32

64

96

128

FFT

21 %

100 %


fast fouriertransform

18| EVALUATION

4



# processors

speedup factor

sequential speed

0 32 64 96 1280

32

64

96

128

FFT

HTA’s SCALE BETTER

21 %

100 %


fast fouriertransform

18| EVALUATION

4



# processors

speedup factor

sequential speed

0 32 64 96 1280

32

64

96

128

CG95 % 100 %


conjugategradient

19| EVALUATION

4



# processors

speedup factor

sequential speed

0 32 64 96 1280

32

64

96

128

CG MPI SUPER L INEAR SPEEDUP

95 % 100 %


conjugategradient

19| EVALUATION

4



# processors

speedup factor

sequential speed

0

32

64

96

128

0 32 64 96 128

MG

3 %

100 %


multigrid

20| EVALUATION

4



# processors

speedup factor

sequential speed

HTA’s

SLOW

0 32 64 96 1280

32

64

96

128

MG

3 %

100 %


multigrid

20| EVALUATION

4



# processors

speedup factor

sequential speed

0

32

64

96

128

0 32 64 96 128

LU

6 %

100 %


lufactorization

21| EVALUATION

4



# processors

speedup factor

sequential speed

0

32

64

96

128

0 32 64 96 128

LU

SLOW AGAIN

6 %

100 %


lufactorization

21| EVALUATION

4



# processors

speedup factor

sequential speed

0 32 64 96 1280

32

64

96

128

LU

SLOW AGAIN

6 %

100 %


lufactorization

21| EVALUATION

4

no data for 128 proce

ssors



# processors

speedup factor

sequential speed

PERFORMANCE OF C++ HTA’s

22| EVALUATION

4

MMM

504 1008 2016 3024 40320

1000

2000

3000

4000

matrix size

MFLOPS Naive 3 loops HTA naive Tiled 6 loopsHTA+ATLAS ATLAS Intel MKL

Intel Pentium 4, 3.0 GHz, 8KB L1 cache

PERFORMANCE OF C++ HTA’s

22| EVALUATION

4

MMM

504 1008 2016 3024 40320

1000

2000

3000

4000

matrix size

MFLOPS Naive 3 loops HTA naive Tiled 6 loopsHTA+ATLAS ATLAS Intel MKL

Intel Pentium 4, 3.0 GHz, 8KB L1 cache

8-13.5%

LINES OF CODE

COMPARISON

ep cg mg ft lu0

200

400

600

800

1000

1200

lines

of c

ode

HTA MPIHTA

MPI

HTA

MPI

HTA

MPIHTA

MPI

ComputationCommunicationData Decomposition

image source: paper

23| EVALUATION

4

TALK OVERVIEW

INTRO

HOW HTA’s WORK


CONCLUSIONS

1

2

3

5

24|

EVALUATION

4

CONCLUSIONS

25| CONCLUSIONS

5


PRODUCTIVITY

HTA’s

FURTHER INFORMATION

26| CONCLUSIONS

5

http://polaris.cs.uiuc.edu/hta/



THANKS.FOR YOUR ATTENTION

&AQPUT YOUR QUESTIONS

PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson

Documents