Top Banner
Helen He NERSC User Services Group July 10, 2014 Babbage: the Intel Many Core (MIC) Testbed System at NERSC
35

Babbage: the Intel Many Core (MIC) Testbed System at NERSC · Babbage: the Intel Many Core (MIC) Testbed System at NERSC . First Message • Babbage can help you to prepare for Cori

Mar 01, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Babbage: the Intel Many Core (MIC) Testbed System at NERSC · Babbage: the Intel Many Core (MIC) Testbed System at NERSC . First Message • Babbage can help you to prepare for Cori

Helen He!NERSC User Services Group!July 10, 2014!

Babbage: the Intel Many Core (MIC) Testbed System at NERSC

Page 2: Babbage: the Intel Many Core (MIC) Testbed System at NERSC · Babbage: the Intel Many Core (MIC) Testbed System at NERSC . First Message • Babbage can help you to prepare for Cori

First Message

•  BabbagecanhelpyoutoprepareforCoriregardingthreadscalability(hybridMPI/OpenMPimplementa=on)andvectoriza=on.

•  PerformanceonBabbagewillbesignificantlyworsethanCori.However,usingBabbagecanexposeboGlenecksandweaknessesinyourcodeforimprovement.

-2-

Page 3: Babbage: the Intel Many Core (MIC) Testbed System at NERSC · Babbage: the Intel Many Core (MIC) Testbed System at NERSC . First Message • Babbage can help you to prepare for Cori

Outline

•  KnightsCorner(KNC)architectureandprogrammingconsidera=ons

•  Systemconfigura=onsandprogrammingenvironment

•  Howtocompileandrun•  Examplesoftuningkernelsandapplica=ons

-3-

Page 4: Babbage: the Intel Many Core (MIC) Testbed System at NERSC · Babbage: the Intel Many Core (MIC) Testbed System at NERSC . First Message • Babbage can help you to prepare for Cori

Basic Terminologies

•  MIC:IntelManyIntegratedCoresarchitecture•  Xeon:IntelProcessors.VariousproductnamesincludeNehalem,Westmere,SandyBridge(SNB)etc.

•  XeonPhi:Intel’smarkeEngnameforMICarchitecture.–  Somecodenamesare:KnightsFerry(KNF),KnightCorner(KNC),KnightsLanding(KNL)

•  KnightsCorner(KNC):firstgeneraEonproductofMICXeonPhiimplementaEon–  Co-processorsconnectedtohostviaPCIe–  Validateprogrammingmodels–  PreparefornextgeneraEonproducEonhardware

-4-

Page 5: Babbage: the Intel Many Core (MIC) Testbed System at NERSC · Babbage: the Intel Many Core (MIC) Testbed System at NERSC . First Message • Babbage can help you to prepare for Cori

Babbage Nodes

-5-

•  1loginnode:bint01•  45computenodes,eachhas:

–  Hostnode:2IntelXeonSandybridgeEPprocessors,8coreseach.2.6GHz,AVX256-bit.Peakperformance166GB/sec

–  2MICcards(5100P)eachhas60naEvecores,connectedbyahigh-speedbidirecEonalring,1053MHz,4hardwarethreadspercore.

–  Peakperformance1TB/sec–  8GBGDDR5memory,peak

memorybandwidth320GB/sec–  512-bitSIMDinstrucEons.Holds16

SPor8DPfloaEngpointnumbers.

Page 6: Babbage: the Intel Many Core (MIC) Testbed System at NERSC · Babbage: the Intel Many Core (MIC) Testbed System at NERSC . First Message • Babbage can help you to prepare for Cori

Babbage KNC vs. Cori KNL •  Similari=es

–  Manyintegratedcoresonanode(>60),4hardwarethreadspercore–  MPI+OpenMPasmainprogramminglanguage–  512bitsvectorlengthforSIMD

•  SignificantimprovementsinKNL–  Self-hostedarchitecture(notaco-processor!)–  3XsinglethreadperformancethanKNC–  Deeperout-of-orderexecuEon–  Highbandwidthon-packagememory–  ImprovedvectorizaEoncapabiliEes–  Andmore…

-6-

Evenwiththesedifferences,BabbagecansEllbehelpfulpreparingapplicaEonsforCori.(Note:Edisoncanbeusedaswell,futurepresentaEon)

Page 7: Babbage: the Intel Many Core (MIC) Testbed System at NERSC · Babbage: the Intel Many Core (MIC) Testbed System at NERSC . First Message • Babbage can help you to prepare for Cori

Programming Considerations

•  Use“na=ve”modeonKNCtomimicKNL,whichmeansignorethehost,justruncompletelyonKNCcards.

•  Encouragesinglenodeexplora=ononKNCcardswithproblemsizesthatcanfit.

•  Simpletoportfrommul=-coreCPUarchitectures,buthardtoachievehighperformance.

•  NeedtoexplorehighlooplevelparallelismviathreadingandSIMDvectoriza=on

•  Availablethreadingmodels:OpenMP,pthreads,etc.

-7-

Page 8: Babbage: the Intel Many Core (MIC) Testbed System at NERSC · Babbage: the Intel Many Core (MIC) Testbed System at NERSC . First Message • Babbage can help you to prepare for Cori

User Friendly Test System

•  Configuredwithease-of-useinmind–  AllproducEonfilesystemsaremounted–  SystemSSHconfiguraEonallowspassword-lessaccesstothehostnodesandMICcardssmoothly

–  ModulescreatedandloadedbydefaulttoiniEateIntelcompilerandMPIlibraries

–  BatchschedulerinstalledforallocaEngnodes–  MICcardshavingaccesstosystemlibrariesallowsmulEpleversionsofsohwaretoco-exist•  NoneedtocopysystemlibrariesorbinariestoeachMICcardmanuallyaspre-stepsforrunningjobs

•  Createdscriptsandwrapperstofurthersimplifyjoblaunchingcommands

–  UserenvironmentverysimilartootherproducEonsystems

-8-

Page 9: Babbage: the Intel Many Core (MIC) Testbed System at NERSC · Babbage: the Intel Many Core (MIC) Testbed System at NERSC . First Message • Babbage can help you to prepare for Cori

Intel Linux Studio XE Package •  IntelC,C++andFortrancompilers•  IntelMKL:mathkernellibraries•  IntelIntegratedPerformancePrimi=ve(IPP):

performancelibraries•  IntelTraceAnalyzerandCollector:MPIcommunicaEons

profilingandanalysis•  IntelVtuneAmplifierXE:advancedthreadingand

performanceprofiler•  IntelInspectorXE:memoryandthreadingdebugger•  IntelAdvisorXE:threadingprototypingtool•  IntelThreadingBuildingBlocks(TBB)andIntelCilkPlus:

parallelprogrammingmodels

-9-

Page 10: Babbage: the Intel Many Core (MIC) Testbed System at NERSC · Babbage: the Intel Many Core (MIC) Testbed System at NERSC . First Message • Babbage can help you to prepare for Cori

Available Software Loadedbydefault:-bash-4.1$modulelistCurrentlyLoadedModulefiles:1)modules3)torque/4.2.65)intel/14.0.07)usg-default-modules/1.12)nsg/1.2.04)moab/7.2.66)impi/4.1.1ModulesAvailable:-bash-4.1$moduleavail<omitsystemso*waremodules…>----------------------/usr/common/usg/Modules/modulefiles------------------------------------------advisor/4.300519impi/4.1.0szip/host-2.1allineatools/4.2.1-36484(default)impi/4.1.1(default)szip/mic-2.1(default)nw/3.3.4-hostinspector/2013.304368totalview/8T.12.0-1(default)nw/3.3.4-mic(default)intel/13.0.1usg-default-modules/1.0hdf5/host-1.8.10-p1intel/13.1.2usg-default-modules/1.1(default)hdf5/host-1.8.13intel/14.0.0(default)vtune/2013.update16(default)hdf5/mic-1.8.10-p1intel/14.0.3zlib/host-1.2.7hdf5/mic-1.8.13(default)itac/8.1.3zlib/host-1.2.8hdf5-parallel/host-1.8.10-p1netcdf/host-4.1.3zlib/mic-1.2.7hdf5-parallel/host-1.8.13netcdf/host-4.3.2zlib/mic-1.2.8(default)hdf5-parallel/mic-1.8.10-p1netcdf/mic-4.1.3hdf5-parallel/mic-1.8.13(default)netcdf/mic-4.3.2(default)

-10-

Page 11: Babbage: the Intel Many Core (MIC) Testbed System at NERSC · Babbage: the Intel Many Core (MIC) Testbed System at NERSC . First Message • Babbage can help you to prepare for Cori

How to Compile on Babbage •  OnlyIntelcompilerandIntelMPIaresupported.

•  Compileontheloginnode“bint01”directlytobuildanexecutabletorunonthehostorontheMICcards.

•  Youcanalsocompileonahostnode.Donot“sshbcxxxx”directlyfrom“bint01”,instead,use“qsub-I-lnodes=1”togetanodeallocatedtoyou.

•  Use“ifort”,icc”or“icpc”tocompileserialFortran,C,orC++codes.•  Use“mpiifort”,“mpiicc”,or“mpiicpc”tocompileparallelFortran,C,orC+

+MPIcodes.(NOTmpif90,mpicc,ormpiCC)

•  Usethe“-openmp”flagforOpenMPcodes.

•  Usethe“-mmic”flagtobuildanexecutabletorunontheMICcards.

•  Example:Buildabinaryforhost:%mpiicc-openmp-oxthi.hostxthi.cBuildabinaryforMIC:%mpiicc-mmic-openmp-oxthi.micxthi.c

-11-

Page 12: Babbage: the Intel Many Core (MIC) Testbed System at NERSC · Babbage: the Intel Many Core (MIC) Testbed System at NERSC . First Message • Babbage can help you to prepare for Cori

Spectrum of Programming Models HostOnly Offload Symmetric

(HostandMIC)Na=ve(MIConly)

Xeon(Host)

Programfoocallbar()End

Programfoocallbar()End

Programfoocallbar()End

--

XeonPhi(MIC)

--

bar()

Programfoocallbar()End

Programfoocallbar()End

-12-

•  KnightsLanding(KNL)willbeinself-hostedmode,thuseliminatesthehostandtheneedtocommunicatebetweenhostandMIC.

•  Weencourageuserstofocusonop=mizingintheNa=vemodeandexploreon-nodescalingonasingleKNCcard.

Page 13: Babbage: the Intel Many Core (MIC) Testbed System at NERSC · Babbage: the Intel Many Core (MIC) Testbed System at NERSC . First Message • Babbage can help you to prepare for Cori

How to Run on Host

bint01%qsub-I-lnodes=2<waitforasession>%cd$PBS_O_WORKDIR%cat$PBS_NODEFILEbc1012bc1011

%get_hoshile%cathoshile.$PBS_JOBIDbc1012-ibbc1011-ib

-13-

%exportOMP_NUM_THREADS=4%mpirun-n2-hossilehossile.$PBS_JOBID-ppn1./xthi.host Hellofromrank0,thread0,onbc1012.(coreaffinity=0-15)Hellofromrank0,thread2,onbc1012.(coreaffinity=0-15)…Hellofromrank1,thread3,onbc1011.(coreaffinity=0-15)Hellofromrank1,thread0,onbc1011.(coreaffinity=0-15)

•  Usefulforcomparingperformancewithrunningna=velyontheMICcards.

Page 14: Babbage: the Intel Many Core (MIC) Testbed System at NERSC · Babbage: the Intel Many Core (MIC) Testbed System at NERSC . First Message • Babbage can help you to prepare for Cori

How to Run on MIC Cards Natively

bint01%qsub-I-lnodes=2<waitforasession>%cd$PBS_O_WORKDIR%cat$PBS_NODEFILEbc1012bc1011%get_micfile%catmicfile.$PBS_JOBIDbc1011-mic0bc1011-mic1bc1010-mic0bc1010-mic1%exportOMP_NUM_THREADS=12%exportKMP_AFFINITY=balanced

-14-

%mpirun.mic-n4-hossilemicfile.$PBS_JOBID-ppn1./xthi.mic|sortHellofromrank0,thread0,onbc1011-mic0.(coreaffinity=1)Hellofromrank0,thread1,onbc1011-mic0.(coreaffinity=5)Hellofromrank0,thread10,onbc1011-mic0.(coreaffinity=41)Hellofromrank0,thread11,onbc1011-mic0.(coreaffinity=45)…Hellofromrank3,thread6,onbc1010-mic1.(coreaffinity=25)Hellofromrank3,thread7,onbc1010-mic1.(coreaffinity=29)Hellofromrank3,thread8,onbc1010-mic1.(coreaffinity=33)Hellofromrank3,thread9,onbc1010-mic1.(coreaffinity=37)

Page 15: Babbage: the Intel Many Core (MIC) Testbed System at NERSC · Babbage: the Intel Many Core (MIC) Testbed System at NERSC . First Message • Babbage can help you to prepare for Cori

Thread Affinity: KMP_AFFINITY •  none:defaultopEononhost•  compact:defaultopEononMIC.Bindthreadsasclosetoeachotheraspossible

•  scaGer:bindthreadsasfarapartaspossible

•  balanced:onlyavailableonMIC.Spreadtoeachcorefirst,thensetthreadnumbersusingdifferentHTofsamecoreclosetoeachother.

•  explicit:example:setenvKMP_AFFINITY“explicit,granularity=fine,proclist=[1:236:1]”•  Newenvoncoprocessors:KMP_PLACE_THREADS,forexactthreadplacement

-15-

Node Core1 Core2 Core3

HT1 HT2 HT3 HT4 HT1 HT2 HT3 HT4 HT1 HT2 HT3 HT4

Thread 0 1 2 3 4 5

Node Core1 Core2 Core3

HT1 HT2 HT3 HT4 HT1 HT2 HT3 HT4 HT1 HT2 HT3 HT4

Thread 0 3 1 4 2 5

Node Core1 Core2 Core3

HT1 HT2 HT3 HT4 HT1 HT2 HT3 HT4 HT1 HT2 HT3 HT4

Thread 0 1 2 3 4 5

Page 16: Babbage: the Intel Many Core (MIC) Testbed System at NERSC · Babbage: the Intel Many Core (MIC) Testbed System at NERSC . First Message • Babbage can help you to prepare for Cori

MPI Process Affinity: I_MPI_PIN_DOMAIN

•  MapCPUsintonon-overlappingdomains–  1MPIprocessperdomain–  OpenMPthreadspinnedinsideeachdomain

•  I_MPI_PIN_DOMAIN=<size>[:<layout>]<size>=ompadjusttoOMP_NUM_THREADSauto#CPUs/#MPIprocs<n>anumber<layout>=plasormaccordingtoBIOSnumberingcompactclosetoeachothersca}erfarawayfromeachother

-16-

Page 17: Babbage: the Intel Many Core (MIC) Testbed System at NERSC · Babbage: the Intel Many Core (MIC) Testbed System at NERSC . First Message • Babbage can help you to prepare for Cori

3D Stencil Diffusion Algorithm

-17-

--Onhost:use16threads,KMP_AFFINITY=sca}er--OnMIC:Testedwithdifferentnumberofthreads:60,120,180,236,240,combinedwithvariousKMP_AFFINITYopEons.--ThebestspeeduponMICisobtainedvia180threadswithsca}eraffinity.--RunsfasteronhostwithbaseopEonandOpenMPonlyopEon.--FasteronMICwhenvectorizaEonisintroducedwithOpenMP.--OpenMPandVectorizaEonbothplaysignificantrolesonMIC--MoreadvancedloopopEmizaEontechniques(looppeelandEling)canimprovefurther.

0

50

100

150

200

250

300

350

400

450

500

base omp omp+vect peel =led

Speedu

p

3DStencilDiffusionSpeedup

compact

scaGer

balanced

0

20000

40000

60000

80000

100000

120000

base omp omp+vect peel =led

MFLop

s/sec

3DStencilDiffusiononHostandMIC

host

mic

Page 18: Babbage: the Intel Many Core (MIC) Testbed System at NERSC · Babbage: the Intel Many Core (MIC) Testbed System at NERSC . First Message • Babbage can help you to prepare for Cori

STREAM

-18-

0

20

40

60

80

100

120

140

160

180

15 30 60 120 180 240

GFlop

s/sec

NumberofOpenMPThreads

StreamTriad

no-vec

basic

prefetch

cache-evict

straming-stores

opt

--Bestrateis162GFlops/secwith60OpenMPthreadson1MICcard.--60%improvementwithvectorizaEon.--Sohwareprefetchhelpssignificantly.(35%improvement)--Intelreportsbestperformanceof174GFlops/seconXeonPhi7100P,61core

Page 19: Babbage: the Intel Many Core (MIC) Testbed System at NERSC · Babbage: the Intel Many Core (MIC) Testbed System at NERSC . First Message • Babbage can help you to prepare for Cori

Tuning Lessons Learned •  Somecoderestructuringandalgorithmmodifica=onsareneeded

totakeadvantageoftheKNCarchitecture.•  Someapplica=onswon'tbeabletofitintomemorywithpureMPI

duetothesmallmemorysizeonKNCcards.•  Itisessen=altoaddOpenMPatashighalevelaspossibleto

explorelooplevelparallelism,andmakesurelarge,innermost,computa=onalextensiveloopsarevectorized.

•  ExplorethescalabilityofOpenMPimplementa=on.•  TryvariousMPIandOpenMPaffinityop=ons.•  Specialcompilerop=onsonKNCalsohelps.•  Memoryalignmentisimportant.•  Op=miza=onstargetedforKNCcanhelpperformanceforother

architectures:Xeon,KNL.

-19-

Page 20: Babbage: the Intel Many Core (MIC) Testbed System at NERSC · Babbage: the Intel Many Core (MIC) Testbed System at NERSC . First Message • Babbage can help you to prepare for Cori

Summary

•  PerformanceonBabbagedoesnotrepresentwhatwillbeonCori.

•  BabbagecanhelpyoutoprepareforCoriregardingthread

scalability(hybridMPI/OpenMPimplementa=on)andvectoriza=on.

•  [email protected]=ons.

-20-

Page 21: Babbage: the Intel Many Core (MIC) Testbed System at NERSC · Babbage: the Intel Many Core (MIC) Testbed System at NERSC . First Message • Babbage can help you to prepare for Cori

Further Information •  Babbagewebpage:

–  h}ps://www.nersc.gov/users/computaEonal-systems/testbeds/babbage•  IntelXeonPhiCoprocessorDeveloperZone:

–  h}p://sohware.intel.com/mic-developer•  ProgrammingandCompilingforIntelMICArchitecture

–  h}p://sohware.intel.com/en-us/arEcles/programming-and-compiling-for-intel-many-integrated-core-architecture

•  Op=mizingMemoryBandwidthonStreamTriad–  h}p://sohware.intel.com/en-us/arEcles/opEmizing-memory-bandwidth-on-

stream-triad•  InteroperabilitywithOpenMPAPI

–  h}p://sohware.intel.com/sites/products/documentaEon/hpc/ics/impi/41/win/Reference_Manual/Interoperability_with_OpenMP.htm

•  IntelClusterStudioXE2013–  h}p://sohware.intel.com/en-us/intel-cluster-studio-xe/

•  IntelXeonPhiCoprocessorHigh-PerformanceProgramming.JimJeffersandJamesReinders,PublishedbyElsevierInc.2013.

-21-

Page 22: Babbage: the Intel Many Core (MIC) Testbed System at NERSC · Babbage: the Intel Many Core (MIC) Testbed System at NERSC . First Message • Babbage can help you to prepare for Cori

Acknowledgement •  Babbagesystemsupportteam,especiallyNickCardo,for

configuringthesystemandsolvingmanymysteriesandissues.

•  NERSCApplica=onReadinessTeamfortes=ng,providing

ideas,andrepor=ngproblemsonthesystem.

-22-

Page 23: Babbage: the Intel Many Core (MIC) Testbed System at NERSC · Babbage: the Intel Many Core (MIC) Testbed System at NERSC . First Message • Babbage can help you to prepare for Cori

Thank you.

-23-

Page 24: Babbage: the Intel Many Core (MIC) Testbed System at NERSC · Babbage: the Intel Many Core (MIC) Testbed System at NERSC . First Message • Babbage can help you to prepare for Cori

Extra Slides

-24-

Page 25: Babbage: the Intel Many Core (MIC) Testbed System at NERSC · Babbage: the Intel Many Core (MIC) Testbed System at NERSC . First Message • Babbage can help you to prepare for Cori

Babbage Compute Nodes •  45nodes,etc.,bc09xx,bc10xx,bc11xx•  Hostnode:2IntelXeonSandybridgeES-2670processors

–  Eachprocessorhas8cores,with2hardwarethreads(HTnotenabled),2.6GHz,peakperformance166.4GFlops

–  128GBmemorypernode–  Memorybandwidth51.2GB/sec–  AVX32bytealigned:AVXonhost,256-bitSIMD

•  2MICcards(5110P,bc09xx-mic0,bc09xx-mic1)eachwith:–  60naEvecores,connectedbyahigh-speedbidirecEonalring,clockspeedis1053MHz,Error

CorrecEngCode(ECC)enabled–  4hardwarethreadspercore–  8GBGDDR5memory,effecEvespeed5GT/s,peakmemorybandwidth320GB/sec–  L1cachepercore:32KB8-wayassociaEvedataandinstrucEoncache–  L2cachepercore:512KB8-wayassociaEveinclusivecachewithhardwareprefetcher–  Peakperformance1011GFlops–  VectorUnit.

•  512-bitSIMDinstrucEons,•  32512-bitregisters,holds32DPand64SP…

-25-

Page 26: Babbage: the Intel Many Core (MIC) Testbed System at NERSC · Babbage: the Intel Many Core (MIC) Testbed System at NERSC . First Message • Babbage can help you to prepare for Cori

Memory Alignment

•  Alwaysalignat64byteboundariestoensuredatacanbeloadedfrommemorytocacheop=mally–  20%performancepenaltywithoutmemoryalignmentfor–  DGEMM(matrixsize6000x6000)

•  Fortran:compilewith“-alignarray64byte”op=ontoalignallsta=carraydatato64memoryaddressboundaries

•  C/C++:declarevar–  staEc:floatvar[100]__a}ribute__((aligned(64)));–  dynamic:__mm_aligned_malloc(buf,64)

•  Moreop=onswithcompilerdirec=ves

-26-

Page 27: Babbage: the Intel Many Core (MIC) Testbed System at NERSC · Babbage: the Intel Many Core (MIC) Testbed System at NERSC . First Message • Babbage can help you to prepare for Cori

SIMD and Vectorization •  Vectoriza=on:theprocessoftransformingascalarinstrucEon

(SISD)intovectorinstrucEon(SIMD)•  Totellcompilertoignorepoten=aldependenciesand

vectorizeanyway:–  FortrandirecEve:!DIR$SIMD–  C/C++direcEve:#pragmasimd

•  Example:a,b,c,arepointers,compilerdoesnotknowtheyareindependent

-27-

Notvectorized:for(i=0;i<n;i++)a[i]=b[i]+c[i]

Notvectorized:for(i=0;i<n;i++)a[i]=b[i]+a[i-1]

Vectorized:#pragmasimdfor(i=0;i<n;i++)a[i]=b[i]+c[i]

Page 28: Babbage: the Intel Many Core (MIC) Testbed System at NERSC · Babbage: the Intel Many Core (MIC) Testbed System at NERSC . First Message • Babbage can help you to prepare for Cori

Wrapper Script for mpirun on MIC Cards

•  Inallmodulefiles:–  Set$LD_LIBRARY_PATHforlibrariesneededonhost–  Set$MIC_LD_LIBRARY_PATHforlibrariesneededonMICcard

•  %catmpirun.mic#!/bin/shmpirun-envLD_LIBRARY_PATH$MIC_LD_LIBRARY_PATH$@

•  Sampleexecu=online:%mpirun.mic-n4-hossilemicfile.$PBS_JOBID-ppn2./xthi.mic

-28-

Page 29: Babbage: the Intel Many Core (MIC) Testbed System at NERSC · Babbage: the Intel Many Core (MIC) Testbed System at NERSC . First Message • Babbage can help you to prepare for Cori

-DMIC Trick for Configure on MIC Card

•  Some=meswheninstallsovwarelibrariesonMICcards,atestprogramneedstoberun.Duetocross-compile,thetestprogramwillfail.

•  Thetrickistodefine“-DMIC”forthethecompilerop=onssuchasCC,CXX,FC,etc.usedin“configure”:exportCC=“icc–DMIC”,…

•  Replaceall“-DMIC”inMakefilewith“-mmic”,thencompileandbuild.files=$(find./*-nameMakefile)perl–p–i–e‘s/-DMIC/-mmic/g’$files

-29-

Page 30: Babbage: the Intel Many Core (MIC) Testbed System at NERSC · Babbage: the Intel Many Core (MIC) Testbed System at NERSC . First Message • Babbage can help you to prepare for Cori

Sample Batch Script #PBS-qregular#PBS-lnodes=2:mics=2#PBS-lwallEme=02:00:00#PBS-V

cd$PBS_O_WORKDIRexportOMP_NUM_THREADS=60exportKMP_AFFINITY=balancedmpirun.mic-n4–hossile$PBS_MICFILE-ppn1./myexe.mic(#some5mesthefullpathtotheexecutableisneeded,otherwiseyoumayseea"nosuchfileordirectory"error).-------------------------------------#Canusecustomhossilewith-hostopEon:%mpirun.mic–n4-hostbc1013-mic0,bc1012-mic1–ppn2./myexe.mic#Canpassenvwith–envopEon:%mpirun.mic-n16-hostbc1011-mic1-envOMP_NUM_THREADS2-envKMP_AFFINITYbalanced./myexe.mic

-30-

Page 31: Babbage: the Intel Many Core (MIC) Testbed System at NERSC · Babbage: the Intel Many Core (MIC) Testbed System at NERSC . First Message • Babbage can help you to prepare for Cori

Thread Affinity: KMP_PLACE_THREADS

•  Newse�ngoncoprocessorsonly.InaddiEontoKMP_AFFINITY,cansetexactbutsEllgenericthreadplacement.

•  KMP_PLACE_THREADS=<n>Cx<m>T,<o>O–  <n>CoresEmes<m>Threadswith<o>ofcoresOffset–  e.g.40Cx3T,1Omeansusing40cores,and3threads(HT2,3,4)percore

•  OSrunsonlogicalproc0,whichlivesonphysicalcore59–  OSprocsoncore59:0,237,238,239.–  Avoiduseproc0,i.e.,usemax_threads=236onBabbage.

-31-

Page 32: Babbage: the Intel Many Core (MIC) Testbed System at NERSC · Babbage: the Intel Many Core (MIC) Testbed System at NERSC . First Message • Babbage can help you to prepare for Cori

Synthetic Benchmark Summary (Intel MKL) (5110P)

-32-

Page 33: Babbage: the Intel Many Core (MIC) Testbed System at NERSC · Babbage: the Intel Many Core (MIC) Testbed System at NERSC . First Message • Babbage can help you to prepare for Cori

STREAM Compiler Options •  -no-vec:

–  -O3-mmic–openmp-no-vec-DSTREAM_ARRAY_SIZE=64000000•  Base

–  -O3-mmic-openmp-DSTREAM_ARRAY_SIZE=64000000•  Baseplus-opt-prefetch-distance=64,8

–  SohwarePrefetch64cachelinesaheadforL2cache–  SohwarePrefetch8cachelinesaheadforL1cache

•  Baseplus-opt-streaming-cache-evict=0–  Turnoffallcachelineevicts

•  Baseplus-opt-streaming-storesalways–  EnablegeneraEonofstreamingstoresundertheassumpEonthatthe

applicaEonismemorybound•  Opt:

–  Useallaboveflags(except–no-vec)•  NomorehugepagesneededsinceitisnowdefaultinMPSS

-33-

Page 34: Babbage: the Intel Many Core (MIC) Testbed System at NERSC · Babbage: the Intel Many Core (MIC) Testbed System at NERSC . First Message • Babbage can help you to prepare for Cori

•  novec:-O3-novec-w-hz-fno-alias-FR-convertbig_endian-openmp-fpp–auto•  base:-mmic-O3-w-openmp-FR-convertbig_endian-alignarray64byte-vec-report6•  precision:-fimf-precision=low-fimf-domain-exclusion=15-fp-modelfast=1-no-prec-div-no-prec-sqrt•  MIC:-opt-assume-safe-padding-opt-streaming-storesalways-opt-streaming-cache-evict=0-mP2OPT_hlo_pref_use_outer_strategy=F

WRF (Weather Research and Forecasting Model)

0

50

100

150

200

250

15 30 60 120 180 240

Time(sec)

NumberofOpenMPThreads

WRFem_realonMIC

novec

base

base+MIC

base+precision

all

•  BestEmeis39.26secwith180OpenMPthreads.•  15.1%improvementwithallopEmizaEonflagscomparedwithbaseopEons.

Page 35: Babbage: the Intel Many Core (MIC) Testbed System at NERSC · Babbage: the Intel Many Core (MIC) Testbed System at NERSC . First Message • Babbage can help you to prepare for Cori

Steps to Optimize BerkeleyGW

Time/Code-Revision

1.  RefactortocreatehierarchicalsetofloopstobeparallelizedviaMPI,OpenMPandVectorizaEonandtoimprovememorylocality.

2.  AddOpenMPatashighalevelaspossible.3.  Makesurelargeinnermost,flopintensive,loopsarevectorized

*-eliminatespuriouslogic,somecoderestructuringsimplificaEonandotheropEmizaEon

AheropEmizaEon,4earlyIntelXeon-PhicardswithMPI/OpenMPis~1.5Xfasterthan32coresofIntelSandyBridgeontestproblem.*

LowerisBe}

er

CourtesyofJackDeslippe