Top Banner
Performance modeling of the HPCG benchmark Vladimir Marjanovic, Jose Gracia and Colin W. Glass 16.11.2014 :: :: :: V. Marjanovic PMBS14
17

Performance modeling of the HPCG benchmarksdh/pmbs14/PMBS14/Workshop_Schedule... · HPL benchmark and TOP500 • HPL is de facto the most important benchmark for ranking supercomputers

Sep 07, 2019

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Performance modeling of the HPCG benchmarksdh/pmbs14/PMBS14/Workshop_Schedule... · HPL benchmark and TOP500 • HPL is de facto the most important benchmark for ranking supercomputers

����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� �����

Performance modeling of the HPCG benchmark

Vladimir Marjanovic, Jose Gracia and Colin W. Glass

16.11.2014:: :: ::

Vladimir Marjanovic, Jose Gracia and Colin W. Glass

V. Marjanovic PMBS14

Page 2: Performance modeling of the HPCG benchmarksdh/pmbs14/PMBS14/Workshop_Schedule... · HPL benchmark and TOP500 • HPL is de facto the most important benchmark for ranking supercomputers

����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� �����

HPC Systems

• Ranking supercomputers

• Two approaches:

– single application (kernel) :

16.11.2014:: :: ::

– single application (kernel) :

HPL, HPCG

– many applications (kernels):

NAS benchmark, HPC Challenge, etc

V. Marjanovic PMBS14 2

Page 3: Performance modeling of the HPCG benchmarksdh/pmbs14/PMBS14/Workshop_Schedule... · HPL benchmark and TOP500 • HPL is de facto the most important benchmark for ranking supercomputers

����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� �����

HPL benchmark and TOP500

• HPL is de facto the most important benchmark for

ranking supercomputers

• Since 1993 TOP 500 uses HPL (first version 1979)

16.11.2014:: :: ::

• GFLOP/s is the metric

• REPRESENTATIVITY is an issue!

V. Marjanovic PMBS14 3

Page 4: Performance modeling of the HPCG benchmarksdh/pmbs14/PMBS14/Workshop_Schedule... · HPL benchmark and TOP500 • HPL is de facto the most important benchmark for ranking supercomputers

����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� �����

HPCG History

• First version in September 2013

• Conjugate Gradients solver

16.11.2014:: :: ::

• MG preconditioner (from version 2.0 onwards)

• Aims at high representativity for real world applications

V. Marjanovic PMBS14 4

Page 5: Performance modeling of the HPCG benchmarksdh/pmbs14/PMBS14/Workshop_Schedule... · HPL benchmark and TOP500 • HPL is de facto the most important benchmark for ranking supercomputers

����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� �����

HPCG

• MPI and MPI/OpenMP, std lib

• Input: (nx,ny,nz) per MPI process

• Metric: GFLOP/s

16.11.2014:: :: ::

• Metric: GFLOP/s

• Official run > 3600sec

• Computational complexity O(n3)

communication complexity O(n2)

V. Marjanovic PMBS14 5

MemoryUsage = C1 + C2 * n3

Page 6: Performance modeling of the HPCG benchmarksdh/pmbs14/PMBS14/Workshop_Schedule... · HPL benchmark and TOP500 • HPL is de facto the most important benchmark for ranking supercomputers

����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� �����

Pseudo-code and % of routines for large problem sizefor ( i = 0; i<50 && normr>err; i++ ){

MG(A,r,z);

DDOT( r ,t ,rtz );

Allreduce ( rtz );

if( i > 1 )

beta = rtz/rtzold;

WAXPBY( z, beta, p );

ExchangeHalos( A, p);

16.11.2014:: :: ::V. Marjanovic PMBS14 6

ExchangeHalos

SpMV( A, p, Ap );

DDOT ( p, Ap, pAp );

Allreduce ( pAp);alpha =rtz/pAp;

WAXPBY( x, alpha, p);

WAXPBY( r, -alpha, Ap);

DDOT( r, r, normr );

Allreduce (normr);

normr = sqrt( normr);

}

Page 7: Performance modeling of the HPCG benchmarksdh/pmbs14/PMBS14/Workshop_Schedule... · HPL benchmark and TOP500 • HPL is de facto the most important benchmark for ranking supercomputers

����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� �����

Pseudo-code: Computation and communication routinesfor ( i = 0; i<50 && normr>err; i++ ){

MG(A,r,z);

DDOT( r ,t ,rtz );

Allreduce ( rtz );

if( i > 1 )

beta = rtz/rtzold;

WAXPBY( z, beta, p );

ExchangeHalos( A, p);

/*MG routine*/

if( depth <3){

ExchangeHalos( );

SYMGS( );

ExchangeHalos( );

SpMV( );

MG( depth++ )

• Computation routines:

– SYMGS

– SpMV

– WAXPBY

– DDOT

• Communication routines:

16.11.2014:: :: ::V. Marjanovic PMBS14 7

ExchangeHalos

SpMV( A, p, Ap );

DDOT ( p, Ap, pAp );

Allreduce ( pAp);alpha =rtz/pAp;

WAXPBY( x, alpha, p);

WAXPBY( r, -alpha, Ap);

DDOT( r, r, normr );

Allreduce (normr);

normr = sqrt( normr);

}

MG( depth++ )

ExchangeHalos( );

SYMGS( );}else{

ExchangeHalos( );

SYMGS( );}

• Communication routines:

– Allreduce

– ExchangesHalos

Page 8: Performance modeling of the HPCG benchmarksdh/pmbs14/PMBS14/Workshop_Schedule... · HPL benchmark and TOP500 • HPL is de facto the most important benchmark for ranking supercomputers

����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� �����

Computation: Memory/Compute bound – Byte/FLOP

cache

memorybenchmark kernel Byte/FLOP

HPL DGEMM 12/n

HPCG SpMV, SYMGS > 4memory BW

16.11.2014:: :: ::V. Marjanovic PMBS14 8

core

• Modern hardware ≈ 0.3 Byte/Flop

e.g E2680v3 has 0.14 Byte/Flop

• HPCG kernels are memory bound on modern

hardware

Page 9: Performance modeling of the HPCG benchmarksdh/pmbs14/PMBS14/Workshop_Schedule... · HPL benchmark and TOP500 • HPL is de facto the most important benchmark for ranking supercomputers

����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� �����

Computational routines

for (i=0; i<n3;i++)

for(j=0; j<27;j++)

a+=b[i][j]*c[index[i][j]]

SpMV and SYMGS have the same

computational behavior

WAXPB and DDOT 1D loop

SpMV & SYMGS

for (i=0; i<n3;i++)

a[i]=alpha*b[i]+beta*c[j]

WAXPB&DDOT

16.11.2014:: :: ::V. Marjanovic PMBS14 9

(Byte/sec)BW

e(Byte)MemoryUsag(sec)ompRoutineexecutionC

eff

=

computational behavior

Page 10: Performance modeling of the HPCG benchmarksdh/pmbs14/PMBS14/Workshop_Schedule... · HPL benchmark and TOP500 • HPL is de facto the most important benchmark for ranking supercomputers

����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� �����

Communication: MPI_Allreduce

• Hypercube algorithm: O(log(N))

• HPCG calls MPI_Allreduce three times per iteration

• Message size = 8Byte

16.11.2014:: :: ::V. Marjanovic PMBS14 10

∑=

−−=

k

i

iii MMlatencysecllreduceexecutionA1

1))log()(log()(

• k different latency levels: within socket, within node,

within blade, within cabinet, etc

hypercube algorithm

Page 11: Performance modeling of the HPCG benchmarksdh/pmbs14/PMBS14/Workshop_Schedule... · HPL benchmark and TOP500 • HPL is de facto the most important benchmark for ranking supercomputers

����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� �����

Communication: MPI_Allreduce

16.11.2014:: :: ::V. Marjanovic PMBS14 11

Page 12: Performance modeling of the HPCG benchmarksdh/pmbs14/PMBS14/Workshop_Schedule... · HPL benchmark and TOP500 • HPL is de facto the most important benchmark for ranking supercomputers

����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� �����

Communication: ExchangeHalos

• Exchange halos with neighbors

• Maximum number of neighbors is 26

• For large problem size one process exchange up to

16.11.2014:: :: ::V. Marjanovic PMBS14 12

)()/(_

)()(

))(8)(4)(2()(

MPIcallsoverheadsecByteBWIC

ByteHaloSizesecaloExexecutionH

BytenznynxnynznznxnynxByteHaloSize

eff

+=

+++⋅+⋅+⋅+⋅⋅=

• For large problem size one process exchange up to

1MB

communication pattern

Page 13: Performance modeling of the HPCG benchmarksdh/pmbs14/PMBS14/Workshop_Schedule... · HPL benchmark and TOP500 • HPL is de facto the most important benchmark for ranking supercomputers

����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� �����

Whole application for ( i = 0; i<50 && normr>err; i++ ){

MG(A,r,z);

DDOT( r ,t ,rtz );

Allreduce ( rtz );

if( i > 1 )

beta = rtz/rtzold;

WAXPBY( z, beta, p );

ExchangeHalos( A, p);

SpMV( A, p, Ap );

DDOT ( p, Ap, pAp );

Allreduce ( pAp);

alpha =rtz/pAp;

WAXPBY( x, alpha, p);

WAXPBY( r, -alpha, Ap);

DDOT( r, r, normr );

/*MG routine*/

if( depth <3){

ExchangeHalos( );

SYMGS( );

ExchangeHalos( );

SpMV( );

MG( depth++ )

ExchangeHalos( );

SYMGS( );

}else{

ExchangeHalos( );

• Combine routines and sum over

execution times

• Execution time is modeled and

FLOP are computed, giving FLOP/s

16.11.2014:: :: ::V. Marjanovic PMBS14 13

)(3)0()0(

))(3)()(2()3()3(2

0

WAXPBAllreduceDDOTdepthHaloExdepthSpMVMGPCGexecutionH

depthHaloExdepthSpMVdepthSYMGSdepthSYMGSdepthHaloExGexecutionMdepth

++⋅+=+=+=

⋅++⋅+=+== ∑=

DDOT( r, r, normr );

Allreduce (normr);

normr = sqrt( normr);

}

ExchangeHalos( );

SYMGS( );

}

FLOP are computed, giving FLOP/s

Page 14: Performance modeling of the HPCG benchmarksdh/pmbs14/PMBS14/Workshop_Schedule... · HPL benchmark and TOP500 • HPL is de facto the most important benchmark for ranking supercomputers

����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� �����

Platforms Software Characterization

Platform A Platform B Platform C

Stream(MB/s) 4705 1700 3430

Pingpong(μs) 2-4 2-90 4-240

• Small problem size: HPCG avoids

16.11.2014:: :: ::V. Marjanovic PMBS14 14

• Small problem size: HPCG avoids

memory bandwidth bottleneck

• Large problem size: HPCG performance is

proportional to STREAM benchmark

HPCG: GFLOP/s vs. problem size

single node

Page 15: Performance modeling of the HPCG benchmarksdh/pmbs14/PMBS14/Workshop_Schedule... · HPL benchmark and TOP500 • HPL is de facto the most important benchmark for ranking supercomputers

����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� �����

Accuracy of the model

• Accuracy ±2%

16.11.2014:: :: ::V. Marjanovic PMBS14 15

• 93600 cores machine

Official run: 39114GFLOP/s

Model : 39319GFLOP/s

Page 16: Performance modeling of the HPCG benchmarksdh/pmbs14/PMBS14/Workshop_Schedule... · HPL benchmark and TOP500 • HPL is de facto the most important benchmark for ranking supercomputers

����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� �����

Performance prediction

• For current hardware communication cost is 3%

16.11.2014:: :: ::V. Marjanovic PMBS14 16

• Extrapolation to 1billion core machines

Page 17: Performance modeling of the HPCG benchmarksdh/pmbs14/PMBS14/Workshop_Schedule... · HPL benchmark and TOP500 • HPL is de facto the most important benchmark for ranking supercomputers

����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� �����

Conclusions

• HPCG model shows high accuracy 2%

• Arbitrary problem size → single property dominates

16.11.2014:: :: ::

• Information content of the full system benchmark equals

STREAM benchmark on a single node

V. Marjanovic PMBS14 17