HTA’s NOV 30 PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC 1 , Universidade da Coruna 2 and IBM T.J. Watson Research Center 3 by Ganesh Bikshandi 1 , Jia Guo, Daniel Hoeflinger 1 , Gheorghe Almasi 3 , Basilio B. Fraguela 2 , María J. Garzarán 1 , David Padua 1 and Christoph von Praun 3 PAPER PUBLISHED AT PPOPP MARCH 2006
116
Embed
PROGRAMMING FOR PARALLELISM AND LOCALITY … · PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PRESENTATION BY ROMAN FRIGG Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
HTA’sNOV30
PROGRAMMING FOR PARALLELISM AND LOCALITY WITH
PRESENTATION BY ROMAN FRIGG
Written at UIUC1, Universidade da Coruna2 and IBM T.J. Watson Research Center3 byGanesh Bikshandi1, Jia Guo, Daniel Hoeflinger1, Gheorghe Almasi3, Basilio B. Fraguela2, María J. Garzarán1, David Padua1 and Christoph von Praun3
PAPER PUBLISHED AT PPOPP MARCH 2006
PROGRAMMING TODAY’S SYSTEMS
2|
SCALABILITY PORTABILITY
PRODUCTIVITY
PROGRAMMING TODAY’S SYSTEMS
2|
SCALABILITY PORTABILITY
PRODUCTIVITY
Parallelism
PROGRAMMING TODAY’S SYSTEMS
2|
SCALABILITY PORTABILITY
PRODUCTIVITY
Parallelism Locality
PROGRAMMING TODAY’S SYSTEMS
2|
SCALABILITY PORTABILITY
PRODUCTIVITY
Parallelism Locality
Abstractions
PROGRAMMING TODAY’S SYSTEMS
2|
SCALABILITY PORTABILITY
PRODUCTIVITY
Parallelism Locality
Abstractions
HTA’s
CLASSIFICATION
3|
LIBRARIES LANGUAGES
HTA
MPI/PVM
GAS
POET
POOMA X10 CAFZPL
TITANIUM
UPC HPFCLASSIFICATION
3|
LIBRARIES LANGUAGES
HTA
MPI/PVM
GAS
POET
POOMA X10 CAFZPL
TITANIUM
UPC HPFCLASSIFICATION
3|
‣ Library‣ Matlab & C++
‣ Single threaded, global view
TALK OVERVIEW
INTRO1
4|
TALK OVERVIEW
INTRO
HOW HTA’s WORK
1
2
4|
TALK OVERVIEW
INTRO
HOW HTA’s WORK
HTA OPERATIONS & APPLICATIONS
1
2
3
4|
TALK OVERVIEW
INTRO
HOW HTA’s WORK
HTA OPERATIONS & APPLICATIONS
EVALUATION
1
2
3
4
4|
TALK OVERVIEW
INTRO
HOW HTA’s WORK
HTA OPERATIONS & APPLICATIONS
EVALUATION
CONCLUSIONS
1
2
3
4
5
4|
RECURSIVE TILING
‣ distributed
image source: paper
5| HOW HTA’s
WORK 2
9 H TA3/20/06
Hierarchically Tiled Array
9 H TA3/20/06
Hierarchically Tiled Array
‣ local
‣ local
CONSTRUCT HTA FROM 6x6 MATRIX
6| HOW HTA’s
WORK 2
CONSTRUCT HTA FROM 6x6 MATRIX
6|
T1 = hta( ,{ , } )
HOW HTA’s WORK 2
CONSTRUCT HTA FROM 6x6 MATRIX
6|
T1 = hta( ,{ , } )M
HOW HTA’s WORK 2
CONSTRUCT HTA FROM 6x6 MATRIX
6|
T1 = hta( ,{ , } )M [1 3 5]
HOW HTA’s WORK 2
CONSTRUCT HTA FROM 6x6 MATRIX
6|
1
3
5
T1 = hta( ,{ , } )M [1 3 5]
HOW HTA’s WORK 2
CONSTRUCT HTA FROM 6x6 MATRIX
6|
1
3
5
T1 = hta( ,{ , } )M [1 3 5] [1 3 5]
HOW HTA’s WORK 2
CONSTRUCT HTA FROM 6x6 MATRIX
6|
1
3
5
1 3 5
T1 = hta( ,{ , } )M [1 3 5] [1 3 5]
HOW HTA’s WORK 2
CONSTRUCT HTA FROM 6x6 MATRIX
6|
1
3
5
1 3 5
T1 = hta( ,{ , } )M [1 3 5] [1 3 5]
HOW HTA’s WORK 2
T2 = hta( ,{ , }, )
CONSTRUCT HTA FROM 6x6 MATRIX
6|
1
3
5
1 3 5
T1 = hta( ,{ , } )M [1 3 5] [1 3 5]
HOW HTA’s WORK 2
T2 = hta( ,{ , }, )T1
CONSTRUCT HTA FROM 6x6 MATRIX
6|
1
3
5
1 3 5
T1 = hta( ,{ , } )M [1 3 5] [1 3 5]
HOW HTA’s WORK 2
T2 = hta( ,{ , }, )T1 [1 2]
CONSTRUCT HTA FROM 6x6 MATRIX
6|
1
3
5
1 3 5
T1 = hta( ,{ , } )M [1 3 5] [1 3 5]
HOW HTA’s WORK 2
T2 = hta( ,{ , }, )T1 [1 2]
1
2
CONSTRUCT HTA FROM 6x6 MATRIX
6|
1
3
5
1 3 5
T1 = hta( ,{ , } )M [1 3 5] [1 3 5]
HOW HTA’s WORK 2
T2 = hta( ,{ , }, )T1 [1 2] [1 3]
1
2
CONSTRUCT HTA FROM 6x6 MATRIX
6|
1
3
5
1 3 5
T1 = hta( ,{ , } )M [1 3 5] [1 3 5]
HOW HTA’s WORK 2
T2 = hta( ,{ , }, )T1 [1 2] [1 3]
1 3
1
2
CONSTRUCT HTA FROM 6x6 MATRIX
6|
1
3
5
1 3 5
T1 = hta( ,{ , } )M [1 3 5] [1 3 5]
HOW HTA’s WORK 2
T2 = hta( ,{ , }, )T1 [1 2] [1 3]
1 3
1
2
[2 2]
CONSTRUCT HTA FROM 6x6 MATRIX
6|
1
3
5
1 3 5
T1 = hta( ,{ , } )M [1 3 5] [1 3 5]
P1 P2
P3 P4
HOW HTA’s WORK 2
T2 = hta( ,{ , }, )T1 [1 2] [1 3]
1 3
1
2
[2 2]
HTAACCESS
7|
C=
HOW HTA’s WORK 2
HTAACCESS
7|
C(1:2,3:6)
C=
HOW HTA’s WORK 2
HTAACCESS
7|
C(1:2,3:6)
C=
HOW HTA’s WORK 2
HTAACCESS
7|
C(1:2,3:6)
C=
HOW HTA’s WORK 2
HTAACCESS
7|
C(1:2,3:6)
C=
HOW HTA’s WORK 2
HTAACCESS
7|
C(1:2,3:6)
C{2,1}{1,2}(2,2)
C=
HOW HTA’s WORK 2
HTAACCESS
7|
C(1:2,3:6)
C{2,1}{1,2}(2,2)
C=
HOW HTA’s WORK 2
HTAACCESS
7|
C(1:2,3:6)
C{2,1}{1,2}(2,2)
C=
HOW HTA’s WORK 2
HTAACCESS
7|
C(1:2,3:6)
C{2,1}{1,2}(2,2)
C=
HOW HTA’s WORK 2
HTAACCESS
7|
C(1:2,3:6)
C{2,1}{1,2}(2,2)
C=
HOW HTA’s WORK 2
HTAACCESS
7|
C(1:2,3:6)
C{2,1}{1,2}(2,2)
C=
HOW HTA’s WORK 2
HTAACCESS
7|
C(1:2,3:6)
C{2,1}{1,2}(2,2)
C=
HOW HTA’s WORK 2
HTAACCESS
7|
C(1:2,3:6)
C{2,1}{1,2}(2,2)
C=
HOW HTA’s WORK 2
HTAACCESS
7|
C(1:2,3:6)
C{2,1}{1,2}(2,2)
C=
HOW HTA’s WORK 2
HTAACCESS
7|
C(1:2,3:6)
C{2,1}{1,2}(2,2)
C=
HOW HTA’s WORK 2
HTAACCESS
7|
C(1:2,3:6)
C{2,1}{1,2}(2,2)
C=
HOW HTA’s WORK 2
C(6,4)=
HTAACCESS
7|
C(1:2,3:6)
C{2,1}{1,2}(2,2)
C=
HOW HTA’s WORK 2
C(6,4)= C{2,1}(2,4)=
ASSIGNMENTS & BINARY OPERATORS
8|
VALID OPERATIONS
Scalar
⊕Array
HTA
ASSIGNMENTS & BINARY OPERATORS
9|
VALID OPERATION ?
ASSIGNMENTS & BINARY OPERATORS
9|
*4x4 HTA 2x3 Array
VALID OPERATION ?
ASSIGNMENTS & BINARY OPERATORS
9|
u*4x4 HTA 2x3 Array
VALID OPERATION ?
ASSIGNMENTS & BINARY OPERATORS
9|
*4x4 HTA 3x2 Array
VALID OPERATION ?
ASSIGNMENTS & BINARY OPERATORS
9|
v*4x4 HTA 3x2 Array
VALID OPERATION ?
ASSIGNMENTS & BINARY OPERATORS
9|
*4x4 HTA Scalar
VALID OPERATION ?
ASSIGNMENTS & BINARY OPERATORS
9|
u*4x4 HTA Scalar
VALID OPERATION ?
ASSIGNMENTS & BINARY OPERATORS
9|
*4x4 HTA 4x4 HTA
VALID OPERATION ?
ASSIGNMENTS & BINARY OPERATORS
9|
u*4x4 HTA 4x4 HTA
VALID OPERATION ?
ASSIGNMENTS & BINARY OPERATORS
9|
=4x4 HTA 4x4 HTA
VALID OPERATION ?
ASSIGNMENTS & BINARY OPERATORS
9|
v=
4x4 HTA 4x4 HTA
VALID OPERATION ?
TALK OVERVIEW
INTRO
HOW HTA’s WORK
HTA OPERATIONS & APPLICATIONS
CONCLUSIONS
1
2
3
5
10|
EVALUATION
4
TWOKINDS OF
OPERATIONS
11| HTA OPERATIONS & APPLICATIONS
3
GLOBAL COMPUTATIONS
COMMUNICATION
OPERATIONS
P1
P3
P2
P3
P1
P3
P2
P3
f(x)TWO
KINDS OF OPERATIONS
11| HTA OPERATIONS & APPLICATIONS
3
GLOBAL COMPUTATIONS
COMMUNICATION
OPERATIONS
P1
P3
P2
P3
P1
P3
P2
P3
f(x)TWO
KINDS OF OPERATIONS
11| HTA OPERATIONS & APPLICATIONS
3
GLOBAL COMPUTATIONS
COMMUNICATION
OPERATIONS
P1
P3
P2
P3
P1
P3
P2
P3
f(x)TWO
KINDS OF OPERATIONS
11| HTA OPERATIONS & APPLICATIONS
3
GLOBAL COMPUTATIONS
COMMUNICATION
OPERATIONS
P1
P3
P2
P3
P1
P3
P2
P3
Assignments, repmat, circshift, permute
f(x)TWO
KINDS OF OPERATIONS
11| HTA OPERATIONS & APPLICATIONS
3
GLOBAL COMPUTATIONS
COMMUNICATION
OPERATIONS
P1
P3
P2
P3
P1
P3
P2
P3
g(x)g(x)g(x)g(x)g(x)
Assignments, repmat, circshift, permute
f(x)TWO
KINDS OF OPERATIONS
11| HTA OPERATIONS & APPLICATIONS
3
GLOBAL COMPUTATIONS
COMMUNICATION
OPERATIONS
P1
P3
P2
P3
P1
P3
P2
P3
g(x) g(x)
g(x)g(x)
Assignments, repmat, circshift, permute
f(x)TWO
KINDS OF OPERATIONS
11| HTA OPERATIONS & APPLICATIONS
3
GLOBAL COMPUTATIONS
COMMUNICATION
OPERATIONS
P1
P3
P2
P3
P1
P3
P2
P3
g(x) g(x)
g(x)g(x)
Assignments, repmat, circshift, permute
parHTA(@g(x), H)
CANNON’S ALGORITHM
12| HTA OPERATIONS & APPLICATIONS
3
function C = cannon(A,B,C)
for i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);
endfor k=1:mC = C + A * B;A = circshift(A, [0, -1]);B = circshift(B, [-1, 0]);
end
CANNON’S ALGORITHM
12| HTA OPERATIONS & APPLICATIONS
3
function C = cannon(A,B,C)
for i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);
endfor k=1:mC = C + A * B;A = circshift(A, [0, -1]);B = circshift(B, [-1, 0]);
end
A,B,C
CANNON’S ALGORITHM
12| HTA OPERATIONS & APPLICATIONS
3
function C = cannon(A,B,C)
for i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);
endfor k=1:mC = C + A * B;A = circshift(A, [0, -1]);B = circshift(B, [-1, 0]);
end
CANNON’S ALGORITHM
12| HTA OPERATIONS & APPLICATIONS
3
function C = cannon(A,B,C)
for i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);
endfor k=1:mC = C + A * B;A = circshift(A, [0, -1]);B = circshift(B, [-1, 0]);
end
circshift( )circshift( )
circshift( )circshift( )
CANNON’S ALGORITHM
12| HTA OPERATIONS & APPLICATIONS
3
function C = cannon(A,B,C)
for i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);
endfor k=1:mC = C + A * B;A = circshift(A, [0, -1]);B = circshift(B, [-1, 0]);
end
CANNON’S ALGORITHM
12| HTA OPERATIONS & APPLICATIONS
3
Initialization
function C = cannon(A,B,C)
for i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);
endfor k=1:mC = C + A * B;A = circshift(A, [0, -1]);B = circshift(B, [-1, 0]);
end
CANNON’S ALGORITHM
12| HTA OPERATIONS & APPLICATIONS
3
Initialization
Iteration
function C = cannon(A,B,C)
for i=2:mA{i,:} = circshift(A{i,:}, [0, -(i-1)]);B(:,i} = circshift(B{:,i}, [-(i-1), 0]);
endfor k=1:mC = C + A * B;A = circshift(A, [0, -1]);B = circshift(B, [-1, 0]);
Table 1. Execution times in seconds for some of the applications in the NAS benchmarks for Fortran+MPI versus MATLAB +HTA. Theexecution time for 1 processor corresponds to the serial application in Fortran or MATLAB , without MPI or HTAs.
tation not involving distributed HTA operations. Since all data arereplicated, the behavior in each processor is exactly the same aswhat would be the behavior of the client except that no communi-cation is necessary to use data from the main thread in operationson distributed HTAs. On invocation of a method on a distributedHTA, each processor applies the corresponding operation to thetiles of the HTA it owns.The incorporation of HTAs in MATLAB produced an explicitly
parallel programming extension of MATLAB that integrates seam-lessly with the language. Most other parallel MATLAB extensionseither make use of extraneous primitives (MultiMATLAB [24]) ordo not allow explicit parallel programming (Matlab*P [17]). Also,the incorporation of HTA gives MATLAB a mechanism to accessand operate on tiles much more powerful than that provided by theirnative . The main disadvantage of the implementa-tion is that the immense overhead of the interpreted MATLAB lim-its the efficiency of many applications. The three main sources ofthis overhead are:
Excessive creation of temporary variables.MATLAB creates tem-poraries to hold the partial results of expression, which signifi-cantly slows down the programs.Frequent replication of data. MATLAB passes parameters byvalue and assignment statements replicate the data, andInterpretation of instructions. The overhead resulting from the in-terpretation of instructions is more pronounced when the compu-tation relies mainly on scalar operations.
Table 1 presents the execution time for Fortran+MPI and ourMATLAB +HTA implementations of most of the NAS bench-marks. The table shows the execution times in seconds when theapplications execute on a cluster of up-to 128 processors. Each pro-cessor is a 3.2 GHz Intel Xeon connected through a Gigabit Ether-net. For the NAS benchmarks we used the version 3.1, and com-piled them with the INTEL ifort compiler, version 8.1, and flag-03. For MATLAB we used the version 7.0.1 (R14). Finally, forMPI we used MPI-LAM [6].The execution time for 1 processor corresponds to the serial
execution of the pure Fortran or MATLAB code without MPI orHTAs. Results in Table 1 correspond to the class C input for EPand CG, and class B for MG, FT and LU.As can be seen in the table, in the case of EP and FT the parallel
MATLAB code takes advantage of parallelism leading to executiontimes that are of the same magnitude as those of the Fortran+MPIcode. In the case of CG our parallel MATLAB does reasonablywell, although not as well as the Fortran+MPI version that ob-tains super-linear speedups when the number of processors is 64 orsmaller. However, for MG and LU the performance of the sequen-tial MATLAB implementation was slow and, in the case of MG,
the parallel MATLAB does not improve upon the serial Fortranversion. Similarly, for BT (not shown) the serial MATLAB versionruns so slow that, even the parallel version is not comparable withits sequential Fotran counterpart. Overall, for EP, FT and CG wherethe sequential MATLAB version runs 1 to 5 times slower than theFortran version, the parallel MATLAB implementation does rea-sonably well improving upon the serial Fortran version. In thesecases, it could be said that parallelism at least compensates for theinterpretation overhead. For 128 processors the parallel MATLABobtains speedups of 30.9, 8.8 and 29.3 over the sequential Fortrancounterpart for EP, FT and CG, respectively.
4.2 C++In the C++ implementation, HTAs are represented as compos-ite objects with methods to operate on both distributed and non-distributed HTAs. As in the case of MATLAB , MPI is usedfor communication and, while the programming model is singlethreaded, HTA C++ programs execute in SPMD form. To facil-itate programming, our C++ implementation enforces an alloca-tion/deallocation policy through reference counting as follows: (1)HTAs are allocated through factory methods on the heap. Themethods return a handle which is assigned to a (stack allocated)variable. (2) All accesses to the HTA occur through this handle,which itself is small in size and typically passed by value acrossprocedure boundaries. (3) Once all handles to an HTA disappearfrom the stack, the HTA and its related structures are automaticallydeleted from memory. This design permits sharing of sub-treesamong HTAs and also precludes deallocation errors. Moreover, thetemporary arrays that are for instance created during the partialevaluation of expressions, are handled through this mechanism anddeleted automatically as early as possible.Performance is one of the main goals of our C++ implementa-
tion. Methods were optimized and whenever possible specializedfor specific cases. Also, the user is given control over the memorylayout of non-distributed HTAs. In MATLAB the layout was in thehands of the system and the user had no way of influencing it. Fi-nally, to enable efficient access to scalar components of HTAs, theimplementation was organized to guarantee that hot methods wereinlined. This last strategy enabled the codes written using the li-brary to have performance similar to that of traditional (non-HTAs)implementations. For example, the code in Figure 13 represents themultiplication of two two-dimensional arrays recursively tiled. Thecode is similar to the MATLAB code shown Figure 8.The code in Figure 13 shows the declaration of the HTAs , ,
and . The function is the factory method that creates theHTAs. It takes as input the complete tiling information for eachHTA, number of tiles in each dimension , tilesize , and memory layout (
, or ). The function is recursive. When the input
53
NASA ADVANCED SUPERCOMPUTING BENCHMARK
image source: paper
16| EVALUATION
4
Nprocs EP (CLASS C) FT (CLASS B) CG (CLASS C) MG (CLASS B) LU (CLASS B)Fortran+ Matlab + Fortran + Matlab + Fortran + Matlab + Fortran + Matlab + Fortran + Matlab +MPI HTA MPI HTA MPI HTA MPI HTA MPI HTA
Table 1. Execution times in seconds for some of the applications in the NAS benchmarks for Fortran+MPI versus MATLAB +HTA. Theexecution time for 1 processor corresponds to the serial application in Fortran or MATLAB , without MPI or HTAs.
tation not involving distributed HTA operations. Since all data arereplicated, the behavior in each processor is exactly the same aswhat would be the behavior of the client except that no communi-cation is necessary to use data from the main thread in operationson distributed HTAs. On invocation of a method on a distributedHTA, each processor applies the corresponding operation to thetiles of the HTA it owns.The incorporation of HTAs in MATLAB produced an explicitly
parallel programming extension of MATLAB that integrates seam-lessly with the language. Most other parallel MATLAB extensionseither make use of extraneous primitives (MultiMATLAB [24]) ordo not allow explicit parallel programming (Matlab*P [17]). Also,the incorporation of HTA gives MATLAB a mechanism to accessand operate on tiles much more powerful than that provided by theirnative . The main disadvantage of the implementa-tion is that the immense overhead of the interpreted MATLAB lim-its the efficiency of many applications. The three main sources ofthis overhead are:
Excessive creation of temporary variables.MATLAB creates tem-poraries to hold the partial results of expression, which signifi-cantly slows down the programs.Frequent replication of data. MATLAB passes parameters byvalue and assignment statements replicate the data, andInterpretation of instructions. The overhead resulting from the in-terpretation of instructions is more pronounced when the compu-tation relies mainly on scalar operations.
Table 1 presents the execution time for Fortran+MPI and ourMATLAB +HTA implementations of most of the NAS bench-marks. The table shows the execution times in seconds when theapplications execute on a cluster of up-to 128 processors. Each pro-cessor is a 3.2 GHz Intel Xeon connected through a Gigabit Ether-net. For the NAS benchmarks we used the version 3.1, and com-piled them with the INTEL ifort compiler, version 8.1, and flag-03. For MATLAB we used the version 7.0.1 (R14). Finally, forMPI we used MPI-LAM [6].The execution time for 1 processor corresponds to the serial
execution of the pure Fortran or MATLAB code without MPI orHTAs. Results in Table 1 correspond to the class C input for EPand CG, and class B for MG, FT and LU.As can be seen in the table, in the case of EP and FT the parallel
MATLAB code takes advantage of parallelism leading to executiontimes that are of the same magnitude as those of the Fortran+MPIcode. In the case of CG our parallel MATLAB does reasonablywell, although not as well as the Fortran+MPI version that ob-tains super-linear speedups when the number of processors is 64 orsmaller. However, for MG and LU the performance of the sequen-tial MATLAB implementation was slow and, in the case of MG,
the parallel MATLAB does not improve upon the serial Fortranversion. Similarly, for BT (not shown) the serial MATLAB versionruns so slow that, even the parallel version is not comparable withits sequential Fotran counterpart. Overall, for EP, FT and CG wherethe sequential MATLAB version runs 1 to 5 times slower than theFortran version, the parallel MATLAB implementation does rea-sonably well improving upon the serial Fortran version. In thesecases, it could be said that parallelism at least compensates for theinterpretation overhead. For 128 processors the parallel MATLABobtains speedups of 30.9, 8.8 and 29.3 over the sequential Fortrancounterpart for EP, FT and CG, respectively.
4.2 C++In the C++ implementation, HTAs are represented as compos-ite objects with methods to operate on both distributed and non-distributed HTAs. As in the case of MATLAB , MPI is usedfor communication and, while the programming model is singlethreaded, HTA C++ programs execute in SPMD form. To facil-itate programming, our C++ implementation enforces an alloca-tion/deallocation policy through reference counting as follows: (1)HTAs are allocated through factory methods on the heap. Themethods return a handle which is assigned to a (stack allocated)variable. (2) All accesses to the HTA occur through this handle,which itself is small in size and typically passed by value acrossprocedure boundaries. (3) Once all handles to an HTA disappearfrom the stack, the HTA and its related structures are automaticallydeleted from memory. This design permits sharing of sub-treesamong HTAs and also precludes deallocation errors. Moreover, thetemporary arrays that are for instance created during the partialevaluation of expressions, are handled through this mechanism anddeleted automatically as early as possible.Performance is one of the main goals of our C++ implementa-
tion. Methods were optimized and whenever possible specializedfor specific cases. Also, the user is given control over the memorylayout of non-distributed HTAs. In MATLAB the layout was in thehands of the system and the user had no way of influencing it. Fi-nally, to enable efficient access to scalar components of HTAs, theimplementation was organized to guarantee that hot methods wereinlined. This last strategy enabled the codes written using the li-brary to have performance similar to that of traditional (non-HTAs)implementations. For example, the code in Figure 13 represents themultiplication of two two-dimensional arrays recursively tiled. Thecode is similar to the MATLAB code shown Figure 8.The code in Figure 13 shows the declaration of the HTAs , ,
and . The function is the factory method that creates theHTAs. It takes as input the complete tiling information for eachHTA, number of tiles in each dimension , tilesize , and memory layout (