Page 1
TowardsAutoma-cHBMAlloca-onusingLLVM:
ACaseStudywithKnightsLanding
DouniaKhaldiandBarbaraChapmanIns.tuteforAdvancedComputa.onalScience
StonyBrookUniversityStonyBrook,NY
TheThirdWorkshopontheLLVMCompilerInfrastructureinHPCSaltLakeCity,Utah,November14,2016
LLVMlogoiscopyrightedbyAppleInc.
Page 2
Outline
• Introduc.onandMo.va.on• Methodology:– Bandwidth-Cri.calDataAnalysis(BCDA)– HBMAlloca.onTransforma.on
• ExperimentalResultsusingCGbenchmark• ConclusionandFutureWork
2
Page 3
Introduc.on:ExploringMemoryHierarchy
CachesLevel1Level2
MainMemoryDDR(NUMA)
ScrachpadMemory(MCDRAMfromIntel,HBM2fromNVIDIA,SPMofDSPsfromTI)
NVM(3DXPONITfromMicronandIntel)
• Newkindsofmemoryinnewarchitectures
• Whichdataelementshavetoresideonthesememories?
• HighperformanceusingHBM,withlowerpowerrequirementscomparedtoDDR
• 3DXpointoffers1,000.mestheperformanceoftoday’sSSDs
Registers
3
DataMovement
DataMovement
Page 4
KNLArchitectureasaCaseStudy
4
490GB/s
90GB/s
Page 5
MCDRAMConfigura.onModes
5Extractedfromhep://colfaxresearch.com/knl-mcdram/
Page 6
ProgrammingKNLMCDRAM:FlatMode• hbwmalloclibrary
• Intelmemkindlibrary– C,C++:memkind_malloc() – Fortran:
• !DIR$ ATTRIBUTES FASTMEM :: object • SinceIntelFortran16.0compiler
• AutoHBWlibrary– Thresholdsize:AUTO_HBW_SIZE
• numactlcommand
float *fv; fv = (float *) malloc(sizeof(float)*n);
float *fv; fv = (float *) hbw_malloc(sizeof(float)*n);
HBM
Alloca-on
6
Page 7
RelatedWorkLevel Work Example
APILevel Legion,Sequoia,RDDs,Adios persistent
OpenMP5.0?
Currentproposal:#pragmaompallocateWithmemoryspaces,allocatorsandtraits
VendorsLowLevelLibrariesfromIntelandCray
Cray:#pragmamemory(bandwidth)
Compilerlevel Compilertransforma.ons Loopnests
Tools VTune Collectbandwidthprofiles• Dynamic• bandwidthinforma.onandthen
what?
7
Page 8
RelatedWorkLevel Work Example
APILevel Legion,Sequoia,RDDs,Adios persistent
OpenMP5.0?
Currentproposal:#pragmaompallocateWithmemoryspaces,allocatorsandtraits
VendorsLowLevelLibrariesfromIntelandCray
Cray:#pragmamemory(bandwidth)
Compilerlevel Compilertransforma.ons Loopnests
Tools VTune Collectbandwidthprofiles• Dynamic• bandwidthinforma.onandthen
what?
8
• WeuseLLVM,awidespreadSSA-basedcompila.oninfrastructureforsequen.alandparallellanguages
• DecidewhenitisbeneficialtoallocatedataintheHBMforsequen.alandOpenMPcode
• Casestudy:theHBM,calledMCDRAM,ofKnightsLanding(KNL)
Page 9
Mo.va.on:ImpactofMCDRAMonOpenMP3D7-pointStencil
0
2
4
6
8
10
12
14
5123:1 5123:5 5123:10 10243:1 10243:5 10243:10
Exec
utio
n tim
e (s
ec)
Grid size : timesteps
ICC:OMP:DDR ICC:OMP:HBM LLVM:OMP:DDR LLVM:OMP:HBM
• Setup:1-nodemachinewithoneIntel(R)XeonPhi(TM)[email protected]
• ICC16.0.3andLLVM3.8.1,with–O3
• DDRvs.HBMexecu.on.meofOpenMPversionof3D7-pointStencil
• hbw_set_policy(HBM_POLICY_BIND);
9
Page 10
WhattoallocateintoHBM?
for (cgit = 1; cgit <= cgitmax; cgit++){ ... #pragma omp for for (j = 0; j < lastrow - firstrow + 1; j++) { suml = 0.0; for (k = rowstr[j]; k < rowstr[j+1]; k++) { suml = suml + a[k]*p[colidx[k]]; } q[j] = suml; } #pragma omp for reduction (+:d) for (j = 0; j < lastcol -firstcol + 1; j++){ d = d + p[j] * q[j]; } ... }
• Snippetcode(NAS-NPBCGbenchmark)• Differenttypesofmemoryaccesses• Severalmatrixandvectormul.plica.onsandaddi.ons
10
Page 11
Bandwidth-Cri.calData(1)
0
20
40
60
80
100
120
140
160
180
5123:1 5123:5 5123:10 10243:1 10243:5 10243:10
Exe
cutio
n t
ime
(se
c)
Grid size : timesteps
ICC:Seq:DDR ICC:Seq:HBM LLVM:Seq:DDR LLVM:Seq:HBM
0
20
40
60
80
100
120
140
160
180
DDR(all) HBM(all)
Mo
ps/
s
Different versions of CG (CLASS C)
ICC:Seq LLVM:Seq
• ManywiresintoMCDRAMàsimultaneousaccessisneeded
11
Page 12
Bandwidth-Cri.calData(2)
12
• Predictablememoryaccesspaeernsàapplica.onisbandwidth-bound
• ManywiresintoMCDRAMàsimultaneousaccessisneeded
• Librarysolu.onsànotportable• APIlevel:mightbeaburdenàCompiler+run.mesolu.on
Page 13
Methodology:Bandwidth-Cri.calDataAnalysis(BCDA)
13
R = R(v)
P(R) = bandwidth(R) cost(r)workshare(r)r∈R∑
cost(r) =2 if is a store operation1 otherwise⎧⎨⎩
workshare(r) =0 if r is individual1 if r is simultaneous⎧⎨⎩
bandwidth(R) =1 ∀r ∈ R, r is regular0 otherwise⎧⎨⎩
Page 14
BCDA:InterproceduralMemoryOpera.onsCount
• LLVMIRisinSSAform– Onedefini/onàmul/pleuses– AllowsforDef-UseandUse-Defchainanalysis
• InterproceduralMemoryOpera.onsCount()– __kmpc_fork_call – Numberofmemoryopera.onsinthegeneratedLLVMIR(load, store and getelementptr)
R = R(v)
P(R) = bandwidth(R) cost(r)workshare(r)r∈R∑
∑
14
Page 15
BCDA:DataReuseCost
• Func.oncostassignsaweighttoreferenceopera.ons
15
P(R) = bandwidth(R) cost(r)workshare(r)r∈R∑
cost(r) =2 if r is a store operation1 otherwise⎧⎨⎩
Page 16
BCDA:Individualvs.SimultaneousAccess
• OpenMPasacasestudy• Func.onworksharedetectsifanaccessrhasbeenperformedinanOpenMPwork-sharingregionornot
P(R) = bandwidth(R) cost(r)workshare(r)r∈R∑
workshare(r) =0 if r is individual1 if r is simultaneous⎧⎨⎩
16
Page 17
BCDA:Regularvs.IrregularAccessPaeern
• Func.onbandwidth:latencyvsbandwidthbound
• IndirectAccesses:indicesargumentsofthegetelementptrinstruc.on
P(R) = bandwidth(R) cost(r)workshare(r)r∈R∑
bandwidth(R) =1 ∀r ∈ R, r is regular0 otherwise⎧⎨⎩
17
Page 18
Methodology:Alloca.onTransforma.on#if defined (HAVE_HBWMALLOC_H) # include <hbwmalloc.h> void *memkind_alloc(size_t size) { int avail = hbw_check_available(); void *a; hbw_set_policy(HBW_POLICY_PREFERRED); if(avail == 0){ a = hbw_malloc(size); assert(a != NULL); } else{ a = malloc(size); } return a; } #else void *memkind_alloc(size_t size) { void *a = malloc(size); return a; } #endif
int *a = malloc(sizeof(int) * n);
%call3 = call i8* @malloc(i64 %mul) %6 = bitcast i8* %call3 to i32* store i32* %6, i32** @a, align 8
%call31 = call i8* @memkind_alloc(i64 %mul) %6 = bitcast i8* %call31 to i32* store i32* %6, i32** @a, align 8
18compiler-rtrun-melibrary
Page 19
ExperimentalResults:Cri.calDataAnalysisResultsfortheCGBenchmark
FPArray cost workshare
bandwidth
P(FPArray)
r 46 Allparallel regular 46
q 21 Allparallel regular
21
a 17 Allparallel regular
17
x 16 Allparallel regular
16
p 29 Allparallel irregular
0
Z 21 Allparallel irregular
0
19
Page 20
PerformanceResults
0
2000
4000
6000
8000
10000
DDR(All)
HBM(All)
HBM(z)
HBM(p)
HBM(x)
HBM(A)
HBM(A,q,r)
HBM(A,q,r,x)
Mo
ps/
s
Different versions of CG (CLASS C)
ICC:OMP LLVM:OMP• Setup:1-nodemachinewithoneIntel(R)XeonPhi(TM)[email protected]
• LLVM3.9,spor.ngClang3.9• Resultsusing:• ConjugateGradient(CG)benchmark(NASParallelsuite)
• 2.29xperformanceimprovementusingLLVMand2.33xusingICC
20
DDRvs.HBM-array-alloca.onperformanceoftheOpenMPversionofCG
Page 21
ConclusionandFutureWork• HBMmanagementfromacompilerpoint-of-view– DecidewhenitisbeneficialtoallocatedataintheHBMforsequen.alandOpenMPcode
– Casestudy:HBM(MCDRAM)ofKnightsLanding(KNL)– 2.29xperformanceimprovementusingLLVMcompilerand2.33xusingIntelcompilercomparedtotheDDRversionofCG
• FutureWork:– Improvetheaccuracyofourpriorityfunc.on– Implementmorepreciseanalysesregardingirregularaccessesandinstruc.oncountsforrecursivefunc.onsandnestedloops
– UseofAutoHBWtoaddsizeasanaddi.onalmetric 21
Page 22
TowardsAutoma-cHBMAlloca-onusingLLVM:
ACaseStudywithKnightsLanding
DouniaKhaldiandBarbaraChapmanIns.tuteforAdvancedComputa.onalScience
StonyBrookUniversityStonyBrook,NY
TheThirdWorkshopontheLLVMCompilerInfrastructureinHPCSaltLakeCity,Utah,November14,2016
LLVMlogoiscopyrightedbyAppleInc.