Parco 2005 1 Auto-optimization of Auto-optimization of linear algebra linear algebra parallel routines: the parallel routines: the Cholesky factorization Cholesky factorization Luis-Pedro García Luis-Pedro García Servicio de Apoyo a la Investigación Servicio de Apoyo a la Investigación Tecnológica Tecnológica Universidad Politécnica de Cartagena, Universidad Politécnica de Cartagena, Spain [email protected]Spain [email protected]Javier Cuenca Javier Cuenca Departamento de Ingeniería Departamento de Ingeniería y Tecnología de Computadores y Tecnología de Computadores Universidad de Murcia, Spain Universidad de Murcia, Spain [email protected][email protected]Domingo Giménez Domingo Giménez Departamento de Informática y Siste Departamento de Informática y Sistem Universidad de Murcia, Spain Universidad de Murcia, Spain [email protected][email protected]
19
Embed
Auto-optimization of linear algebra parallel routines: the Cholesky factorization
Auto-optimization of linear algebra parallel routines: the Cholesky factorization. Luis-Pedro García Servicio de Apoyo a la Investigación Tecnológica Universidad Politécnica de Cartagena, Spain [email protected]. Javier Cuenca Departamento de Ingeniería y Tecnología de Computadores - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Parco 2005 1
Auto-optimization of Auto-optimization of linear algebra parallel linear algebra parallel routines: the Cholesky routines: the Cholesky
factorizationfactorizationLuis-Pedro GarcíaLuis-Pedro García
Servicio de Apoyo a la Investigación Servicio de Apoyo a la Investigación TecnológicaTecnológica
Universidad Politécnica de Cartagena, Spain Universidad Politécnica de Cartagena, Spain [email protected]@sait.upct.es
Javier CuencaJavier CuencaDepartamento de IngenieríaDepartamento de Ingeniería
y Tecnología de Computadoresy Tecnología de ComputadoresUniversidad de Murcia, SpainUniversidad de Murcia, Spain
IntroductionIntroduction Our Goal: to obtain linear algebra parallel Our Goal: to obtain linear algebra parallel
routines with auto-optimization capacityroutines with auto-optimization capacity The approach: model the behavior of the The approach: model the behavior of the
algorithmalgorithm This This workwork: improve the model for the : improve the model for the
communication costcommunication costss whenwhen:: The routine uses different typeThe routine uses different typess of MPI of MPI
communication mechanismscommunication mechanisms The system has more than The system has more than oneone interconnection interconnection
networknetwork The communication parameters vary with the The communication parameters vary with the
volume of the communicationvolume of the communication
4Parco 2005
IntroductionIntroduction
TTheoretical and experimental heoretical and experimental study study of of the algorithm. APthe algorithm. AP selection selection..
In linear algebra parallel routinesIn linear algebra parallel routines,, typical AP and SPtypical AP and SP are are: : b, p = r b, p = r xx c c and the basic library and the basic library kk11,, kk22, , kk33, , ttss and and ttww
An analytical model of the execution An analytical model of the execution timetime T(n) = f(n,AP,SP)T(n) = f(n,AP,SP)
The The n n xx n n matrix is mapped matrix is mapped through through a block cyclic 2-D distribution onto a a block cyclic 2-D distribution onto a two-dimensional mesh of two-dimensional mesh of p = r p = r xx c c processes (in ScaLAPACK style)processes (in ScaLAPACK style)
(a) First step (b) Second step (c) Third step
Figure 1. Work distribution in the first three steps, with n/b = 6 and p = 2 x 3
The general model: The general model: t(n) = f(n,AP,SP)t(n) = f(n,AP,SP) Problem size:Problem size:
n n matrix size matrix size
Algorithmic parameters (AP):Algorithmic parameters (AP): bb block size block size p = r p = r x x c c processes processes
System parameters (SP) System parameters (SP) SP = g(n,AP)SP = g(n,AP):: k(n,b,p): k2potf2, k3trsm, k3gemm k(n,b,p): k2potf2, k3trsm, k3gemm andand k3syrk k3syrk cost of cost of
basic arithmetic operationsbasic arithmetic operations ttss(p)(p) start-up timestart-up time
ttwsws(n,p)(n,p), , ttwdwd(n,p) (n,p) word-sending time for different types of word-sending time for different types of communicationscommunications tcom(n,p) = ts(p)
Systems:Systems: A network of four nodes Intel Pentium 4 (P4net) A network of four nodes Intel Pentium 4 (P4net)
with a with a FFastEthernetastEthernet switch switch, enabling parallel , enabling parallel communications between them. The MPI library communications between them. The MPI library used used is MPICHis MPICH
A network of four nodes HP AlphaServer quad A network of four nodes HP AlphaServer quad processors (HPC160) using Shared Memory processors (HPC160) using Shared Memory (HPC160smp), MemoryChannel (HPC160mc) (HPC160smp), MemoryChannel (HPC160mc) and both (HPC160smp-mc) for the and both (HPC160smp-mc) for the communications between processes. A MPI communications between processes. A MPI library optimized for Shared Memory and for library optimized for Shared Memory and for MemoryChannel has been used.MemoryChannel has been used.
9Parco 2005
Experimental ResultsExperimental Results How to estimate the arithmetic SPsHow to estimate the arithmetic SPs
With routines performing some basic operation (dgemm, With routines performing some basic operation (dgemm, dsyrk, dtrsm) with the same data access scheme used in the dsyrk, dtrsm) with the same data access scheme used in the algorithmalgorithm
How to estimate the communication SPsHow to estimate the communication SPs With routines that communicate rows or columns in the With routines that communicate rows or columns in the
logical mesh of processes:logical mesh of processes: With a broadcast for MPI derived data type between processes With a broadcast for MPI derived data type between processes inin
the same columnthe same column With a broadcast for MPI predefined data type between With a broadcast for MPI predefined data type between
processes processes inin the same row the same row
In both cases the experiments are In both cases the experiments are repeated several times, repeated several times, to to obtain an obtain an average valueaverage value
10Parco 2005
Experimental ResultsExperimental Results Lowest execution time with the optimized version Lowest execution time with the optimized version
of BLAS and LAPACK for Pentium 4 and for Alphaof BLAS and LAPACK for Pentium 4 and for AlphaBlock Block sizesize
3232 6464 128128 256256
k3,dgemk3,dgemmm
0,001860,0018622
0,000930,0009377
0,000570,0005722
0,000460,0004677
k3,dsyrkk3,dsyrk 0,003490,0034922
0,001480,0014844
0,001220,0012288
0,000760,0007622
k3,dtrsmk3,dtrsm 0,011710,0117199
0,006520,0065277
0,003780,0037855
0,002320,0023255
Block Block sizesize
3232 6464 128128 256256
k3,dgemk3,dgemmm
0,000820,0008244
0,000650,0006588
0,000610,0006100
0,000580,0005800
k3,dsyrkk3,dsyrk 0,001620,0016288
0,001160,0011644
0,000800,0008077
0,000680,0006888
k3,dtrsmk3,dtrsm 0,001610,0016177
0,001110,0011100
0,000840,0008411
0,000700,0007066
Table 1. Values of arithmetic system parameters (in µsec) in Pentium 4 with BLASopt
Table 2. Values of arithmetic system parameters (in µsec) in Alpha with CXML
11Parco 2005
Experimental ResultsExperimental Results But other SPBut other SPss can depend on can depend on nn and and bb, for , for
example: example: k2,potf2k2,potf2
bb//nn 512512 10241024 20482048
3232 0,00450,0045 0,00540,0054 0,00670,0067
6464 0,00340,0034 0,0460,046 0,00490,0049
128128 0,00630,0063 0,00770,0077 0,00760,0076
256256 0,00860,0086 0,01030,0103 0,01000,0100
bb//nn 10241024 20482048 40964096
3232 0,00280,0028 0,01470,0147 0,01010,0101
6464 0,00240,0024 0,00820,0082 0,00340,0034
128128 0,00330,0033 0,00520,0052 0,00250,0025
256256 0,00270,0027 0,00400,0040 0,00230,0023
Table 3. Values of k2,potf2 (in µsec) in Pentium 4 with BLASopt
Table 4. Values of k2,potf2 (in µsec) in Alpha with CXML
12Parco 2005
Experimental ResultsExperimental Results Communication system parametersCommunication system parameters
Broadcast cost for MPI predefined data typeBroadcast cost for MPI predefined data type,, ttwsws
Message SizeMessage Size
pp 15001500 20482048 > 4000> 4000
22 0,610,61 0,770,77 0,840,84
44 1,221,22 1,451,45 1,681,68
pp Shared Shared MemorMemor
yy
MemorMemoryy
ChanneChannell
22 0,0110,011 0,0720,072
44 0,0250,025 0,140,14Table 5. Values of tws (in µsecs) in P4net
Table 6. Values of tws (in µsecs) in HPC160
13Parco 2005
Experimental ResultsExperimental Results Communication system parametersCommunication system parameters
Word sending time of a broadcast for MPI Word sending time of a broadcast for MPI derived data type derived data type ttwdwd
Experimental ResultsExperimental ResultsParameters selection in P4netParameters selection in P4net
Table 9. Parameters selection for the Cholesky factorization in P4net
18Parco 2005
Experimental ResultsExperimental ResultsParameters Selection in HPC160Parameters Selection in HPC160
Table 10. Parameters selection for the Cholesky factorization in HPC160 with shared memory (HPC160smp), MemoryChannel (HPC160mc) and both (HPC160smp-mc)
19Parco 2005
ConclusionsConclusions The method has been applied successfully The method has been applied successfully to the Cholesky factorization and can be to the Cholesky factorization and can be applied to other linear algebra routinesapplied to other linear algebra routines
It is neccesary to use different costs for It is neccesary to use different costs for different types of MPI communication different types of MPI communication mechanisms.mechanisms.
and to use different cost for the and to use different cost for the communication parameters in systems with communication parameters in systems with more than one interconnection network.more than one interconnection network.
It is necessary to decide the optimal It is necessary to decide the optimal allocation of processes by node, according to allocation of processes by node, according to the speed of the interconnection networks. the speed of the interconnection networks. Hybrid systems.Hybrid systems.