Top Banner
Helen He NUG Meeting, 10/08/2015 Nested OpenMP
22

Nested OpenMP - National Energy Research Scientific ... · • Hopper: NERSC Cray XE6 ... Courtesy of Robert Hager ... • Nested OpenMP can be used to enable mulple threads for

Apr 25, 2018

Download

Documents

lycong
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Nested OpenMP - National Energy Research Scientific ... · • Hopper: NERSC Cray XE6 ... Courtesy of Robert Hager ... • Nested OpenMP can be used to enable mulple threads for

Helen He!NUG Meeting, 10/08/2015

Nested OpenMP

Page 2: Nested OpenMP - National Energy Research Scientific ... · • Hopper: NERSC Cray XE6 ... Courtesy of Robert Hager ... • Nested OpenMP can be used to enable mulple threads for

OpenMP Execution Model

•  ForkandJoinModel– Masterthreadforksnewthreadsatthebeginningofparallelregions.

– Mul6plethreadsshareworkinparallel.–  Threadsjoinattheendoftheparallelregions.

-2-

Page 3: Nested OpenMP - National Energy Research Scientific ... · • Hopper: NERSC Cray XE6 ... Courtesy of Robert Hager ... • Nested OpenMP can be used to enable mulple threads for

Hopper/Edison Compute Nodes

-3-

•  Hopper:NERSCCrayXE6,6,384nodes,153,126cores.•  4NUMAdomainspernode,6coresperNUMAdomain.

•  Edison:NERSCCrayXC30,5,576nodes,133,824cores.•  2NUMAdomainspernode,12coresperNUMAdomain.

2hardwarethreadspercore.•  Memorybandwidthisnon-homogeneousamongNUMAdomains.

Page 4: Nested OpenMP - National Energy Research Scientific ... · • Hopper: NERSC Cray XE6 ... Courtesy of Robert Hager ... • Nested OpenMP can be used to enable mulple threads for

MPI Process Affinity: aprun “-S” Option •  Processaffinity:orCPUpinning,bindsMPIprocesstoaCPUorarangesof

CPUsonthenode.•  ImportanttospreadMPIranksevenlyontodifferentNUMAnodes.•  Usethe“-S”opWonforHopper/Edison.

-4-

0200400600800

100012001400

24576*1 12288*2 8192*3 4096*6 2048*12

RunTime(sec)

MPITasks*OpenMPThreads

GTCHybridMPI/OpenMPonHopper,24,576cores

with-S-ss

no-S-ss

aprun–n4–S1–d6

aprun–n4–d6

Lower is B

etter

-S2–d3

Page 5: Nested OpenMP - National Energy Research Scientific ... · • Hopper: NERSC Cray XE6 ... Courtesy of Robert Hager ... • Nested OpenMP can be used to enable mulple threads for

Thread Affinity: aprun “-cc” Option

•  Threadaffinity:forceseachprocessorthreadtorunonaspecificsubsetofprocessors,totakeadvantageoflocalprocessstate.

•  Threadlocalityisimportantsinceitimpactsbothmemoryandintra-nodeperformance.

•  OnHopper/Edison:–  Thedefaultop6onis“-cccpu”(useitfornon-Intelcompilers),

bindseachPEtoaCPUwithintheassignedNUMAnode.–  PayaWen6ontoIntelcompiler,whichusesanextrathread.

•  Use“-ccnone”if1MPIprocesspernode•  Use“-ccnuma_node”(Hopper)or“-ccdepth”(Edison)ifmul6pleMPIprocessespernode

-5-

Page 6: Nested OpenMP - National Energy Research Scientific ... · • Hopper: NERSC Cray XE6 ... Courtesy of Robert Hager ... • Nested OpenMP can be used to enable mulple threads for

NERSC KNC Testbed: Babbage

-6-

•  NERSCIntelXeonPhiKnightsCorner(KNC)testbed

•  45computenodes,eachhas:–  Hostnode:2IntelXeon

Sandybridgeprocessors,8coreseach.

–  2MICcardseachhas60na6vecoresand4hardwarethreadspercore.

–  MICcardsaWachedtohostnodesviaPCI-express.

–  8GBmemoryoneachMICcard

•  Recommendtouseatleast2threadspercoretohidelatencyofin-orderexecuWon.

TobestpreparecodesonBabbageforCori:•  Use“na6ve”modeonKNCtomimicKNL,

whichmeansignorethehost,justruncompletelyonKNCcards.

•  Encouragetoexploresinglenodeop6miza6onforthreadingscalingandvectoriza6ononKNCcardswithproblemsizesthatcanfit.

•  “Symmetric”,“Offload”modesonKNCand“OpenMP4.0target”work,butarenotourpromotedusagemodelsforBabbage.

Page 7: Nested OpenMP - National Energy Research Scientific ... · • Hopper: NERSC Cray XE6 ... Courtesy of Robert Hager ... • Nested OpenMP can be used to enable mulple threads for

Babbage MIC Card

-7-

Babbage:NERSCIntelXeonPhitestbed,45nodes.2MICcardspernode.•  1NUMAdomainperMICcard:60physicalcores,240logicalcores.OpenMP

threadingpotenWalto240-way.•  KMP_AFFINITY,KMP_PLACE_THREADS,OMP_PLACES,OMP_PROC_BINDfor

threadaffinitycontrol•  I_MPI_PIN_DOMAINforMPI/OpenMPprocessandthreadaffinitycontrol.

Page 8: Nested OpenMP - National Energy Research Scientific ... · • Hopper: NERSC Cray XE6 ... Courtesy of Robert Hager ... • Nested OpenMP can be used to enable mulple threads for

Full OpenMP 4.0 Support in Compilers

•  GNUcompiler–  From4.9.0forC/C++–  Fromgcc/4.9.1forFortran

•  Intelcompiler–  Fromintel/15.0:supportsmostfeaturesinOpenMP4.0;

FromIntel/16.0:fullsupport

•  Craycompiler–  Fromcce/8.4.0

-8-

Page 9: Nested OpenMP - National Energy Research Scientific ... · • Hopper: NERSC Cray XE6 ... Courtesy of Robert Hager ... • Nested OpenMP can be used to enable mulple threads for

Thread Affinity Control in OpenMP 4.0 •  OMP_PLACES:alistofplacesthatthreadscanbepinnedon

–  threads:Eachplacecorrespondstoasinglehardwarethreadonthetargetmachine.

–  cores:Eachplacecorrespondstoasinglecore(havingoneormorehardwarethreads)onthetargetmachine.

–  sockets:Eachplacecorrespondstoasinglesocket(consis6ngofoneormorecores)onthetargetmachine.

–  Alistwithexplicitplacevalues:suchas:•  "{0,1,2,3},{4,5,6,7},{8,9,10,11},{12,13,14,15}”•  “{0:4},{4:4},{8:4},{12:4}”

•  OMP_PROC_BIND–  spread:Bindthreadsasevenlydistributed(spread)aspossible–  close:Bindthreadsclosetothemasterthreadwhiles6lldistribu6ng

threadsforloadbalancing,wraparoundonceeachplacereceivesonethread

–  master:Bindthreadsthesameplaceasthemasterthread

-9-

Page 10: Nested OpenMP - National Energy Research Scientific ... · • Hopper: NERSC Cray XE6 ... Courtesy of Robert Hager ... • Nested OpenMP can be used to enable mulple threads for

Nested OpenMP Thread Affinity Illustration

-10-

setenvOMP_PLACESthreadssetenvOMP_NUM_THREADS4,4setenvOMP_PROC_BINDspread,close

spread

close

Page 11: Nested OpenMP - National Energy Research Scientific ... · • Hopper: NERSC Cray XE6 ... Courtesy of Robert Hager ... • Nested OpenMP can be used to enable mulple threads for

Sample Nested OpenMP Code #include<omp.h>#include<stdio.h>voidreport_num_threads(intlevel){#pragmaompsingle{prinl("Level%d:numberofthreadsintheteam:%d\n",level,omp_get_num_threads());}}intmain(){omp_set_dynamic(0);#pragmaompparallelnum_threads(2){report_num_threads(1);#pragmaompparallelnum_threads(2){report_num_threads(2);#pragmaompparallelnum_threads(2){report_num_threads(3);}}}return(0);}

-11-

%a.outLevel1:numberofthreadsintheteam:2Level2:numberofthreadsintheteam:1Level3:numberofthreadsintheteam:1Level2:numberofthreadsintheteam:1Level3:numberofthreadsintheteam:1

%setenvOMP_NESTEDTRUE%a.outLevel1:numberofthreadsintheteam:2Level2:numberofthreadsintheteam:2Level2:numberofthreadsintheteam:2Level3:numberofthreadsintheteam:2Level3:numberofthreadsintheteam:2Level3:numberofthreadsintheteam:2Level3:numberofthreadsintheteam:2

Level0:P0Level1:P0P1Level2:P0P2;P1P3Level3:P0P4;P2P5;P1P6;P3P7

Page 12: Nested OpenMP - National Energy Research Scientific ... · • Hopper: NERSC Cray XE6 ... Courtesy of Robert Hager ... • Nested OpenMP can be used to enable mulple threads for

When to Use Nested OpenMP

•  SomeapplicaWonteamsareexploringwithnestedOpenMPtoallowmorefine-grainedthreadparallelism.–  MPI/Hybridnotusingnodefullypacked–  ToplevelOpenMPloopdoesnotuseallavailablethreads–  Mul6plelevelsofOpenMPloopsarenoteasilycollapsed–  Certaincomputa6onalintensivekernelscouldusemorethreads–  MKLcanuseextracoreswithnestedOpenMP

-12-

Page 13: Nested OpenMP - National Energy Research Scientific ... · • Hopper: NERSC Cray XE6 ... Courtesy of Robert Hager ... • Nested OpenMP can be used to enable mulple threads for

Process and Thread Affinity in Nested OpenMP

•  AchievingbestprocessandthreadaffinityiscrucialingernggoodperformancewithnestedOpenMP,yetitisnotstraighlorwardtodoso.

•  AcombinaWonofOpenMPenvironmentvariablesandrunWmeflagsareneededfordifferentcompilersanddifferentbatchschedulersondifferentsystems.

-13-

Example:UseIntelcompilerwithTorque/MoabonEdison:setenvOMP_NESTEDtruesetenvOMP_NUM_THREADS4,3setenvOMP_PROC_BINDspread,closeaprun-n2-S1-d12–ccnuma_node./nested.intel.edison

Page 14: Nested OpenMP - National Energy Research Scientific ... · • Hopper: NERSC Cray XE6 ... Courtesy of Robert Hager ... • Nested OpenMP can be used to enable mulple threads for

Edison: Run Time Environment Variables •  setenvOMP_NESTEDtrue

–  Defaultisfalseformostcompilers•  setenvOMP_MAX_ACTIVE_LEVELS2

–  Thedefaultwas1forCCEpriortocce/8.4.0•  setenvOMP_NUM_THREADS4,3•  setenvOMP_PROC_BINDspread,close•  setenvKMP_HOT_TEAMS1

–  Intelonlyenv.Defaultisfalse•  setenvKMP_HOT_TEAMS_MAX_LEVELS2

–  Intelonlyenv.AllownestedlevelOpenMPthreadstostayaliveinsteadofbeingdestroyedandcreatedagaintoreducethreadcrea6onoverhead.

•  aprun-n2-S1-d12–ccnuma_node./nested.intel.edison–  Use-dfortotalnumberofthreads(productsofnum_threadsfromall

levels).–ccnuma_nodetoallowthreadsmigratewithinNUMAnodetonotaffectedbyIntel’sextramanagerthread.

-14-

Page 15: Nested OpenMP - National Energy Research Scientific ... · • Hopper: NERSC Cray XE6 ... Courtesy of Robert Hager ... • Nested OpenMP can be used to enable mulple threads for

Babbage: Run Time Environment Variables •  SetI_MPI_PIN_DOMAIN=autotogetbasicMPIprocessaffinity•  DonotsetKMP_AFFINITY,otherwiseOMP_PROC_BINDwill

beignored.•  UseOMP_PLACES=threads(default)insteadofsockets•  setenvOMP_NESTEDtrue•  setenvOMP_NUM_THREADS4,3•  setenvOMP_PROC_BINDspread,close•  setenvKMP_HOT_TEAMS1•  setenvKMP_HOT_TEAMS_MAX_LEVELS2•  mpirun.mic-n2-hostbc1109-mic0./xthi-nested.mic|sort

-15-

Page 16: Nested OpenMP - National Energy Research Scientific ... · • Hopper: NERSC Cray XE6 ... Courtesy of Robert Hager ... • Nested OpenMP can be used to enable mulple threads for

XGC1: Nested OpenMP •  Alwaysmakesuretousebestthreadaffinity.Avoidusingthreadsacross

NUMAdomains.•  Currently:

•  Isabitslowerthan(workongoing):

•  Willtry:

•  Usenum_threadsclauseinsourcecodestosetthreadsfornestedregions.

Formostothernon-nestedregions,useOMP_NUM_THREADSenvforsimplicityandflexibility.

-16-

exportOMP_NUM_THREADS=6,4exportOMP_PROC_BIND=spread,closeexportOMP_NESTED=TRUEexportOMP_STACKSIZE=8000000aprun-n200-N2-S1-j2-ccnuma_node./xgca

exportOMP_NUM_THREADS=24exportOMP_NESTED=TRUEexportOMP_STACKSIZE=8000000aprun-n200-d24-N2-S1-j2-ccnuma_node./xgca

exportKMP_HOT_TEAMS=1exportKMP_HOT_TEAMS_MAX_LEVELS=2

CourtesyofRobertHager,PPPLandNESAPXGC1team.

Page 17: Nested OpenMP - National Energy Research Scientific ... · • Hopper: NERSC Cray XE6 ... Courtesy of Robert Hager ... • Nested OpenMP can be used to enable mulple threads for

Use Multiple Threads in MKL

•  ByDefault,inOpenMPparallelregions,only1threadwillbeusedforMKLcalls.–  MKL_DYNAMICSistruebydefault

•  NestedOpenMPcanbeusedtoenablemulWplethreadsforMKLcalls.TreatMKLasanestedinnerOpenMPregion.

•  Sampleserngs

-17-

exportOMP_NESTED=trueexportOMP_PLACES=coresexportOMP_PROC_BIND=closeexportOMP_NUM_THREADS=6,4exportMKL_DYNAMICS=falseexportKMP_HOT_TEAMS=1exportKMP_HOT_TEAMS_MAX_LEVELS=2

Page 18: Nested OpenMP - National Energy Research Scientific ... · • Hopper: NERSC Cray XE6 ... Courtesy of Robert Hager ... • Nested OpenMP can be used to enable mulple threads for

NWChem: OpenMP “Reduce” Algorithm

-18-

•  PlanewaveLagrangemulWplier–  Manymatrixmul6plica6onsofcomplexnumbers,C=AxB–  Smallermatrixproducts:FFM,typicalsize100x10,000x100–  OriginalthreadingscalingwithMKLnotsa6sfactory

•  OpenMP“Reduce”or“Block”algorithm-  DistributeworkonAandBalongthekdimension-  Athreadputsitscontribu6oninabufferofsizemxn-  BuffersreducedtoproduceC-  OMPteamsofthreads

FFM

CourtesyofMathiasJacquelin,LBNL

Page 19: Nested OpenMP - National Energy Research Scientific ... · • Hopper: NERSC Cray XE6 ... Courtesy of Robert Hager ... • Nested OpenMP can be used to enable mulple threads for

NWChem: OpenMP “Reduce” Algorithm •  Bezerforsmallerinnerdimensions,i.e.forFFMs•  MulWpleFFMscanbedoneconcurrentlyindifferentthreadpools•  Threadingenablesustouseall240hardwarethreads•  Best“Reduce”:10MPI,6teamsof4threads(nestedOpenMPwithMKL)

-19-

MKL1MPI,240threads

Best“Reduce”10MPI,6teamsof4threads

CourtesyofMathiasJacquelin,LBNL

Page 20: Nested OpenMP - National Energy Research Scientific ... · • Hopper: NERSC Cray XE6 ... Courtesy of Robert Hager ... • Nested OpenMP can be used to enable mulple threads for

FFT3D on KNC, Ng=643

-20-

CourtesyofJeongnimKim,Intel

Page 21: Nested OpenMP - National Energy Research Scientific ... · • Hopper: NERSC Cray XE6 ... Courtesy of Robert Hager ... • Nested OpenMP can be used to enable mulple threads for

Nested OpenMP on NERSC Systems

•  Pleaseseedetailedexampleserngsinthe“NestedOpenMP”webpage:–  RunonEdisonandBabbage– WithIntelandCraycompilers–  UseTorque/MoabandSLURMbatchschedulers–  hWps://www.nersc.gov/users/computa6onal-systems/edison/running-jobs/using-openmp-with-mpi/nested-openmp/

-21-

Page 22: Nested OpenMP - National Energy Research Scientific ... · • Hopper: NERSC Cray XE6 ... Courtesy of Robert Hager ... • Nested OpenMP can be used to enable mulple threads for

Thank you.

-22-