LLVM logo is copyrighted by Apple Inc. Towards …– Case study: HBM (MCDRAM) of Knights Landing (KNL) – 2.29x performance improvement using LLVM compiler and 2.33x using Intel

TowardsAutoma-cHBMAlloca-onusingLLVM:

ACaseStudywithKnightsLanding

DouniaKhaldiandBarbaraChapmanIns.tuteforAdvancedComputa.onalScience

StonyBrookUniversityStonyBrook,NY

TheThirdWorkshopontheLLVMCompilerInfrastructureinHPCSaltLakeCity,Utah,November14,2016

LLVMlogoiscopyrightedbyAppleInc.

Outline

•  Introduc.onandMo.va.on•  Methodology:–  Bandwidth-Cri.calDataAnalysis(BCDA)–  HBMAlloca.onTransforma.on

•  ExperimentalResultsusingCGbenchmark•  ConclusionandFutureWork

2

Introduc.on:ExploringMemoryHierarchy

CachesLevel1Level2

MainMemoryDDR(NUMA)

ScrachpadMemory(MCDRAMfromIntel,HBM2fromNVIDIA,SPMofDSPsfromTI)

NVM(3DXPONITfromMicronandIntel)

•  Newkindsofmemoryinnewarchitectures

•  Whichdataelementshavetoresideonthesememories?

•  HighperformanceusingHBM,withlowerpowerrequirementscomparedtoDDR

•  3DXpointoffers1,000.mestheperformanceoftoday’sSSDs

Registers

3

DataMovement

DataMovement

KNLArchitectureasaCaseStudy

4

490GB/s

90GB/s

MCDRAMConfigura.onModes

5Extractedfromhep://colfaxresearch.com/knl-mcdram/

ProgrammingKNLMCDRAM:FlatMode•  hbwmalloclibrary

•  Intelmemkindlibrary–  C,C++:memkind_malloc() –  Fortran:

•  !DIR$ ATTRIBUTES FASTMEM :: object •  SinceIntelFortran16.0compiler

•  AutoHBWlibrary–  Thresholdsize:AUTO_HBW_SIZE

•  numactlcommand

float *fv; fv = (float *) malloc(sizeof(float)*n);

float *fv; fv = (float *) hbw_malloc(sizeof(float)*n);

HBM

Alloca-on

6

RelatedWorkLevel Work Example

APILevel Legion,Sequoia,RDDs,Adios persistent

OpenMP5.0?

Currentproposal:#pragmaompallocateWithmemoryspaces,allocatorsandtraits

VendorsLowLevelLibrariesfromIntelandCray

Cray:#pragmamemory(bandwidth)

Compilerlevel Compilertransforma.ons Loopnests

Tools VTune Collectbandwidthprofiles•  Dynamic•  bandwidthinforma.onandthen

what?

7

RelatedWorkLevel Work Example

APILevel Legion,Sequoia,RDDs,Adios persistent

OpenMP5.0?

Currentproposal:#pragmaompallocateWithmemoryspaces,allocatorsandtraits

VendorsLowLevelLibrariesfromIntelandCray

Cray:#pragmamemory(bandwidth)

Compilerlevel Compilertransforma.ons Loopnests

Tools VTune Collectbandwidthprofiles•  Dynamic•  bandwidthinforma.onandthen

what?

8

•  WeuseLLVM,awidespreadSSA-basedcompila.oninfrastructureforsequen.alandparallellanguages

•  DecidewhenitisbeneficialtoallocatedataintheHBMforsequen.alandOpenMPcode

•  Casestudy:theHBM,calledMCDRAM,ofKnightsLanding(KNL)

Mo.va.on:ImpactofMCDRAMonOpenMP3D7-pointStencil

0

2

4

6

8

10

12

14

5123:1 5123:5 5123:10 10243:1 10243:5 10243:10

Exec

utio

n tim

e (s

ec)

Grid size : timesteps

ICC:OMP:DDR ICC:OMP:HBM LLVM:OMP:DDR LLVM:OMP:HBM

•  Setup:1-nodemachinewithoneIntel(R)XeonPhi(TM)[email protected]

•  ICC16.0.3andLLVM3.8.1,with–O3

•  DDRvs.HBMexecu.on.meofOpenMPversionof3D7-pointStencil

•  hbw_set_policy(HBM_POLICY_BIND);

9

WhattoallocateintoHBM?

for (cgit = 1; cgit <= cgitmax; cgit++){ ... #pragma omp for for (j = 0; j < lastrow - firstrow + 1; j++) { suml = 0.0; for (k = rowstr[j]; k < rowstr[j+1]; k++) { suml = suml + a[k]*p[colidx[k]]; } q[j] = suml; } #pragma omp for reduction (+:d) for (j = 0; j < lastcol -firstcol + 1; j++){ d = d + p[j] * q[j]; } ... }

•  Snippetcode(NAS-NPBCGbenchmark)•  Differenttypesofmemoryaccesses•  Severalmatrixandvectormul.plica.onsandaddi.ons

10

Bandwidth-Cri.calData(1)

0

20

40

60

80

100

120

140

160

180

5123:1 5123:5 5123:10 10243:1 10243:5 10243:10

Exe

cutio

n t

ime

(se

c)

Grid size : timesteps

ICC:Seq:DDR ICC:Seq:HBM LLVM:Seq:DDR LLVM:Seq:HBM

0

20

40

60

80

100

120

140

160

180

DDR(all) HBM(all)

Mo

ps/

s

Different versions of CG (CLASS C)

ICC:Seq LLVM:Seq

•  ManywiresintoMCDRAMàsimultaneousaccessisneeded

11

Bandwidth-Cri.calData(2)

12

•  Predictablememoryaccesspaeernsàapplica.onisbandwidth-bound

•  ManywiresintoMCDRAMàsimultaneousaccessisneeded

•  Librarysolu.onsànotportable•  APIlevel:mightbeaburdenàCompiler+run.mesolu.on

Methodology:Bandwidth-Cri.calDataAnalysis(BCDA)

13

R = R(v)

P(R) = bandwidth(R) cost(r)workshare(r)r∈R∑

cost(r) =2 if is a store operation1 otherwise⎧⎨⎩

workshare(r) =0 if r is individual1 if r is simultaneous⎧⎨⎩

bandwidth(R) =1 ∀r ∈ R, r is regular0 otherwise⎧⎨⎩

BCDA:InterproceduralMemoryOpera.onsCount

•  LLVMIRisinSSAform– Onedefini/onàmul/pleuses– AllowsforDef-UseandUse-Defchainanalysis

•  InterproceduralMemoryOpera.onsCount()–  __kmpc_fork_call – Numberofmemoryopera.onsinthegeneratedLLVMIR(load, store and getelementptr)

R = R(v)


∑

14

BCDA:DataReuseCost

•  Func.oncostassignsaweighttoreferenceopera.ons

15


cost(r) =2 if r is a store operation1 otherwise⎧⎨⎩

BCDA:Individualvs.SimultaneousAccess

•  OpenMPasacasestudy•  Func.onworksharedetectsifanaccessrhasbeenperformedinanOpenMPwork-sharingregionornot


workshare(r) =0 if r is individual1 if r is simultaneous⎧⎨⎩

16

BCDA:Regularvs.IrregularAccessPaeern

•  Func.onbandwidth:latencyvsbandwidthbound

•  IndirectAccesses:indicesargumentsofthegetelementptrinstruc.on


bandwidth(R) =1 ∀r ∈ R, r is regular0 otherwise⎧⎨⎩

17

Methodology:Alloca.onTransforma.on#if defined (HAVE_HBWMALLOC_H) # include <hbwmalloc.h> void *memkind_alloc(size_t size) { int avail = hbw_check_available(); void *a; hbw_set_policy(HBW_POLICY_PREFERRED); if(avail == 0){ a = hbw_malloc(size); assert(a != NULL); } else{ a = malloc(size); } return a; } #else void *memkind_alloc(size_t size) { void *a = malloc(size); return a; } #endif

int *a = malloc(sizeof(int) * n);

%call3 = call i8* @malloc(i64 %mul) %6 = bitcast i8* %call3 to i32* store i32* %6, i32** @a, align 8

%call31 = call i8* @memkind_alloc(i64 %mul) %6 = bitcast i8* %call31 to i32* store i32* %6, i32** @a, align 8

18compiler-rtrun-melibrary

ExperimentalResults:Cri.calDataAnalysisResultsfortheCGBenchmark

FPArray cost workshare

bandwidth

P(FPArray)

r 46 Allparallel regular 46

q 21 Allparallel regular

21

a 17 Allparallel regular

17

x 16 Allparallel regular

16

p 29 Allparallel irregular

0

Z 21 Allparallel irregular

0

19

PerformanceResults

0

2000

4000

6000

8000

10000

DDR(All)

HBM(All)

HBM(z)

HBM(p)

HBM(x)

HBM(A)

HBM(A,q,r)

HBM(A,q,r,x)

Mo

ps/

s

Different versions of CG (CLASS C)

ICC:OMP LLVM:OMP•  Setup:1-nodemachinewithoneIntel(R)XeonPhi(TM)[email protected]

•  LLVM3.9,spor.ngClang3.9•  Resultsusing:•  ConjugateGradient(CG)benchmark(NASParallelsuite)

•  2.29xperformanceimprovementusingLLVMand2.33xusingICC

20

DDRvs.HBM-array-alloca.onperformanceoftheOpenMPversionofCG

ConclusionandFutureWork•  HBMmanagementfromacompilerpoint-of-view– DecidewhenitisbeneficialtoallocatedataintheHBMforsequen.alandOpenMPcode

–  Casestudy:HBM(MCDRAM)ofKnightsLanding(KNL)–  2.29xperformanceimprovementusingLLVMcompilerand2.33xusingIntelcompilercomparedtotheDDRversionofCG

•  FutureWork:–  Improvetheaccuracyofourpriorityfunc.on–  Implementmorepreciseanalysesregardingirregularaccessesandinstruc.oncountsforrecursivefunc.onsandnestedloops

– UseofAutoHBWtoaddsizeasanaddi.onalmetric 21

TowardsAutoma-cHBMAlloca-onusingLLVM:

ACaseStudywithKnightsLanding

DouniaKhaldiandBarbaraChapmanIns.tuteforAdvancedComputa.onalScience

StonyBrookUniversityStonyBrook,NY

TheThirdWorkshopontheLLVMCompilerInfrastructureinHPCSaltLakeCity,Utah,November14,2016

LLVMlogoiscopyrightedbyAppleInc.

LLVM logo is copyrighted by Apple Inc. Towards …– Case study: HBM (MCDRAM) of Knights Landing (KNL) – 2.29x performance improvement using LLVM compiler and 2.33x using Intel

Documents