Auto-optimization of linear algebra parallel routines: the Cholesky factorization

Parco 2005 1

Auto-optimization of Auto-optimization of linear algebra parallel linear algebra parallel routines: the Cholesky routines: the Cholesky

factorizationfactorizationLuis-Pedro GarcíaLuis-Pedro García

Servicio de Apoyo a la Investigación Servicio de Apoyo a la Investigación TecnológicaTecnológica

Universidad Politécnica de Cartagena, Spain Universidad Politécnica de Cartagena, Spain [email protected]@sait.upct.es

Javier CuencaJavier CuencaDepartamento de IngenieríaDepartamento de Ingeniería

y Tecnología de Computadoresy Tecnología de ComputadoresUniversidad de Murcia, SpainUniversidad de Murcia, Spain

[email protected]@ditec.um.es

Domingo GiménezDomingo GiménezDepartamento de Informática y SistemasDepartamento de Informática y Sistemas

Universidad de Murcia, SpainUniversidad de Murcia, [email protected]@dif.um.es

2Parco 2005

OutlineOutline

IntroductionIntroduction Parallel routine for the Cholesky Parallel routine for the Cholesky

factorizationfactorization Experimental ResultsExperimental Results ConclusionsConclusions

3Parco 2005

IntroductionIntroduction Our Goal: to obtain linear algebra parallel Our Goal: to obtain linear algebra parallel

routines with auto-optimization capacityroutines with auto-optimization capacity The approach: model the behavior of the The approach: model the behavior of the

algorithmalgorithm This This workwork: improve the model for the : improve the model for the

communication costcommunication costss whenwhen:: The routine uses different typeThe routine uses different typess of MPI of MPI

communication mechanismscommunication mechanisms The system has more than The system has more than oneone interconnection interconnection

networknetwork The communication parameters vary with the The communication parameters vary with the

volume of the communicationvolume of the communication

4Parco 2005

IntroductionIntroduction

TTheoretical and experimental heoretical and experimental study study of of the algorithm. APthe algorithm. AP selection selection..

In linear algebra parallel routinesIn linear algebra parallel routines,, typical AP and SPtypical AP and SP are are: : b, p = r b, p = r xx c c and the basic library and the basic library kk11,, kk22, , kk33, , ttss and and ttww

An analytical model of the execution An analytical model of the execution timetime T(n) = f(n,AP,SP)T(n) = f(n,AP,SP)

5Parco 2005

Parallel Cholesky Parallel Cholesky factorizationfactorization

The The n n xx n n matrix is mapped matrix is mapped through through a block cyclic 2-D distribution onto a a block cyclic 2-D distribution onto a two-dimensional mesh of two-dimensional mesh of p = r p = r xx c c processes (in ScaLAPACK style)processes (in ScaLAPACK style)

(a) First step (b) Second step (c) Third step

Figure 1. Work distribution in the first three steps, with n/b = 6 and p = 2 x 3

6Parco 2005


The general model: The general model: t(n) = f(n,AP,SP)t(n) = f(n,AP,SP) Problem size:Problem size:

n n matrix size matrix size

Algorithmic parameters (AP):Algorithmic parameters (AP): bb block size block size p = r p = r x x c c processes processes

System parameters (SP) System parameters (SP) SP = g(n,AP)SP = g(n,AP):: k(n,b,p): k2potf2, k3trsm, k3gemm k(n,b,p): k2potf2, k3trsm, k3gemm andand k3syrk k3syrk cost of cost of

basic arithmetic operationsbasic arithmetic operations ttss(p)(p) start-up timestart-up time

ttwsws(n,p)(n,p), , ttwdwd(n,p) (n,p) word-sending time for different types of word-sending time for different types of communicationscommunications tcom(n,p) = ts(p)

+ntw(n,p)

7Parco 2005


Theoretical model:Theoretical model: Arithmetic cost:Arithmetic cost:

Communication cost:Communication cost:

T = tarit + tcom

8Parco 2005

Experimental ResultsExperimental Results

Systems:Systems: A network of four nodes Intel Pentium 4 (P4net) A network of four nodes Intel Pentium 4 (P4net)

with a with a FFastEthernetastEthernet switch switch, enabling parallel , enabling parallel communications between them. The MPI library communications between them. The MPI library used used is MPICHis MPICH

A network of four nodes HP AlphaServer quad A network of four nodes HP AlphaServer quad processors (HPC160) using Shared Memory processors (HPC160) using Shared Memory (HPC160smp), MemoryChannel (HPC160mc) (HPC160smp), MemoryChannel (HPC160mc) and both (HPC160smp-mc) for the and both (HPC160smp-mc) for the communications between processes. A MPI communications between processes. A MPI library optimized for Shared Memory and for library optimized for Shared Memory and for MemoryChannel has been used.MemoryChannel has been used.

9Parco 2005

Experimental ResultsExperimental Results How to estimate the arithmetic SPsHow to estimate the arithmetic SPs

With routines performing some basic operation (dgemm, With routines performing some basic operation (dgemm, dsyrk, dtrsm) with the same data access scheme used in the dsyrk, dtrsm) with the same data access scheme used in the algorithmalgorithm

How to estimate the communication SPsHow to estimate the communication SPs With routines that communicate rows or columns in the With routines that communicate rows or columns in the

logical mesh of processes:logical mesh of processes: With a broadcast for MPI derived data type between processes With a broadcast for MPI derived data type between processes inin

the same columnthe same column With a broadcast for MPI predefined data type between With a broadcast for MPI predefined data type between

processes processes inin the same row the same row

In both cases the experiments are In both cases the experiments are repeated several times, repeated several times, to to obtain an obtain an average valueaverage value

10Parco 2005

Experimental ResultsExperimental Results Lowest execution time with the optimized version Lowest execution time with the optimized version

of BLAS and LAPACK for Pentium 4 and for Alphaof BLAS and LAPACK for Pentium 4 and for AlphaBlock Block sizesize

3232 6464 128128 256256

k3,dgemk3,dgemmm

0,001860,0018622

0,000930,0009377

0,000570,0005722

0,000460,0004677

k3,dsyrkk3,dsyrk 0,003490,0034922

0,001480,0014844

0,001220,0012288

0,000760,0007622

k3,dtrsmk3,dtrsm 0,011710,0117199

0,006520,0065277

0,003780,0037855

0,002320,0023255

Block Block sizesize

3232 6464 128128 256256

k3,dgemk3,dgemmm

0,000820,0008244

0,000650,0006588

0,000610,0006100

0,000580,0005800

k3,dsyrkk3,dsyrk 0,001620,0016288

0,001160,0011644

0,000800,0008077

0,000680,0006888

k3,dtrsmk3,dtrsm 0,001610,0016177

0,001110,0011100

0,000840,0008411

0,000700,0007066

Table 1. Values of arithmetic system parameters (in µsec) in Pentium 4 with BLASopt

Table 2. Values of arithmetic system parameters (in µsec) in Alpha with CXML

11Parco 2005

Experimental ResultsExperimental Results But other SPBut other SPss can depend on can depend on nn and and bb, for , for

example: example: k2,potf2k2,potf2

bb//nn 512512 10241024 20482048

3232 0,00450,0045 0,00540,0054 0,00670,0067

6464 0,00340,0034 0,0460,046 0,00490,0049

128128 0,00630,0063 0,00770,0077 0,00760,0076

256256 0,00860,0086 0,01030,0103 0,01000,0100

bb//nn 10241024 20482048 40964096

3232 0,00280,0028 0,01470,0147 0,01010,0101

6464 0,00240,0024 0,00820,0082 0,00340,0034

128128 0,00330,0033 0,00520,0052 0,00250,0025

256256 0,00270,0027 0,00400,0040 0,00230,0023

Table 3. Values of k2,potf2 (in µsec) in Pentium 4 with BLASopt

Table 4. Values of k2,potf2 (in µsec) in Alpha with CXML

12Parco 2005

Experimental ResultsExperimental Results Communication system parametersCommunication system parameters

Broadcast cost for MPI predefined data typeBroadcast cost for MPI predefined data type,, ttwsws

Message SizeMessage Size

pp 15001500 20482048 > 4000> 4000

22 0,610,61 0,770,77 0,840,84

44 1,221,22 1,451,45 1,681,68

pp Shared Shared MemorMemor

yy

MemorMemoryy

ChanneChannell

22 0,0110,011 0,0720,072

44 0,0250,025 0,140,14Table 5. Values of tws (in µsecs) in P4net

Table 6. Values of tws (in µsecs) in HPC160

13Parco 2005


Word sending time of a broadcast for MPI Word sending time of a broadcast for MPI derived data type derived data type ttwdwd

Block sizeBlock size

P4netP4net HPC160smpHPC160smp HPC160mcHPC160mc

pp 3232 6464 128128 256256 3232 6464 128128 256256 3232 6464 128128 256256

22 0,970,97 0,80,844

1,001,00 1,101,10 0,00,01919

0,00,02424

0,00,02020

0,00,01919

0,00,09595

0,00,09191

0,00,08989

0,00,09090

44 1,601,60 1,91,900

1,601,60 1,641,64 0,00,04747

0,00,04848

0,00,04545

0,00,04141

0,10,19090

0,10,17676

0,10,17979

0,10,18383

Table 7. Values of twd twd (in µsecs) obtained experimentally for different b and p

14Parco 2005


Startup time of MPI broadcast Startup time of MPI broadcast ttss

Can be considered Can be considered ttss(n,p) ≈ t(n,p) ≈ tss(p)(p)

pp P4netP4net HPC160smpHPC160smp HPC160mcHPC160mc

22 5555 4,884,88 4,884,88

44 121121 9,779,77 9,779,77

Table 8. Values of ttss (in µsecs) obtained experimentally for different number of processes

15Parco 2005

Experimental ResultsExperimental Results

Theoretical n = 4096

0,0010,0020,0030,0040,0050,00

32 64 128 256

block size (b)

tim

e (s

eco

nd

s) 1x1

1x2

2x1

2x2

1x4

4x1

Experimental n = 4096

0,00

10,00

20,00

30,00

40,00

32 64 128 256

block size (b)

tim

e (s

eco

nd

s) 1x1

1x2

2x1

2x2

1x4

4x1


0,00

20,00

40,00

60,00

80,00

32 64 128 256

block size (b)

tim

e (s

eco

nd

s) 1x1

1x2

2x1

2x2

1x4

4x1


0,0020,0040,0060,0080,00

32 64 128 256

block size (b)

time

(sec

onds

) 1x1

1x2

2x1

2x2

1x4

4x1

P4netP4net

16Parco 2005

Experimental ResultsExperimental Results HPC160smpHPC160smp


0,0010,00

20,0030,00

40,00

32 64 128 256

block size (b)

tim

e (s

eco

nd

s) 1x1

1x2

2x1

2x2

1x4

4x1


0,0010,00

20,0030,00

40,00

32 64 128 256

block size (b)

tim

e (s

eco

nd

s) 1x1

1x2

2x1

2x2

1x4

4x1


0,000

50,000

100,000

150,000

32 64 128 256

block size (b)

time

(sec

onds

) 1x1

1x2

2x1

2x2

1x4

4x1


0,000

50,000

100,000

150,000

32 64 128 256

block size (b)

tim

e (s

eco

nd

s) 1x1

1x2

2x1

2x2

1x4

4x1

17Parco 2005

Experimental ResultsExperimental ResultsParameters selection in P4netParameters selection in P4net

Table 9. Parameters selection for the Cholesky factorization in P4net

18Parco 2005

Experimental ResultsExperimental ResultsParameters Selection in HPC160Parameters Selection in HPC160

Table 10. Parameters selection for the Cholesky factorization in HPC160 with shared memory (HPC160smp), MemoryChannel (HPC160mc) and both (HPC160smp-mc)

19Parco 2005

ConclusionsConclusions The method has been applied successfully The method has been applied successfully to the Cholesky factorization and can be to the Cholesky factorization and can be applied to other linear algebra routinesapplied to other linear algebra routines

It is neccesary to use different costs for It is neccesary to use different costs for different types of MPI communication different types of MPI communication mechanisms.mechanisms.

and to use different cost for the and to use different cost for the communication parameters in systems with communication parameters in systems with more than one interconnection network.more than one interconnection network.

It is necessary to decide the optimal It is necessary to decide the optimal allocation of processes by node, according to allocation of processes by node, according to the speed of the interconnection networks. the speed of the interconnection networks. Hybrid systems.Hybrid systems.

Auto-optimization of linear algebra parallel routines: the Cholesky factorization

Documents