Top Banner
Whitepaper NVIDIA’s Next Generation CUDA TM Compute Architecture: Kepler TM GK110 The Fastest, Most Efficient HPC Architecture Ever Built V1.0
24

NVIDIA Kepler GK110 Architecture Whitepaper

Dec 18, 2015

Download

Documents

hisahin

NVIDIA Kepler GK110
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • Whitepaper

    NVIDIAs Next Generation CUDATM Compute Architecture:

    Kepler TM GK110

    The Fastest, Most Efficient HPC Architecture Ever Built

    V1.0

  • Table of Contents KeplerGK110TheNextGenerationGPUComputingArchitecture...........................................................3KeplerGK110ExtremePerformance,ExtremeEfficiency..........................................................................4

    Dynamic Parallelism......................................................................................................................5 Hyper-Q...........................................................................................................................................5 Grid Management Unit..................................................................................................................5 NVIDIA GPUDirect.....................................................................................................................5

    AnOverviewoftheGK110KeplerArchitecture...........................................................................................6PerformanceperWatt..............................................................................................................................7StreamingMultiprocessor(SMX)Architecture.........................................................................................8SMXProcessingCoreArchitecture.......................................................................................................9QuadWarpScheduler...........................................................................................................................9NewISAEncoding:255RegistersperThread.....................................................................................11ShuffleInstruction...............................................................................................................................11AtomicOperations..............................................................................................................................11TextureImprovements.......................................................................................................................12

    KeplerMemorySubsystemL1,L2,ECC................................................................................................1364KBConfigurableSharedMemoryandL1Cache............................................................................1348KBReadOnlyDataCache...............................................................................................................13ImprovedL2Cache.............................................................................................................................14MemoryProtectionSupport...............................................................................................................14

    DynamicParallelism................................................................................................................................14HyperQ...................................................................................................................................................17GridManagementUnitEfficientlyKeepingtheGPUUtilized...............................................................19NVIDIAGPUDirect................................................................................................................................20

    Conclusion...................................................................................................................................................21Appendix A - QuickRefresheronCUDA...................................................................................................22

    CUDAHardwareExecution.................................................................................................................23

  • Kepler GK110 The Next Generation GPU Computing Architecture Asthedemandforhighperformanceparallelcomputingincreasesacrossmanyareasofscience,medicine,engineering,andfinance,NVIDIAcontinuestoinnovateandmeetthatdemandwithextraordinarilypowerfulGPUcomputingarchitectures.NVIDIAsexistingFermiGPUshavealreadyredefinedandacceleratedHighPerformanceComputing(HPC)capabilitiesinareassuchasseismicprocessing,biochemistrysimulations,weatherandclimatemodeling,signalprocessing,computationalfinance,computeraidedengineering,computationalfluiddynamics,anddataanalysis.NVIDIAsnewKeplerGK110GPUraisestheparallelcomputingbarconsiderablyandwillhelpsolvetheworldsmostdifficultcomputingproblems.ByofferingmuchhigherprocessingpowerthanthepriorGPUgenerationandbyprovidingnewmethodstooptimizeandincreaseparallelworkloadexecutionontheGPU,KeplerGK110simplifiescreationofparallelprogramsandwillfurtherrevolutionizehighperformancecomputing.

  • Kepler GK110 - Extreme Performance, Extreme Efficiency Comprising7.1billiontransistors,KeplerGK110isnotonlythefastest,butalsothemostarchitecturallycomplexmicroprocessoreverbuilt.Addingmanynewinnovativefeaturesfocusedoncomputeperformance,GK110wasdesignedtobeaparallelprocessingpowerhouseforTeslaandtheHPCmarket.KeplerGK110willprovideover1TFlopofdoubleprecisionthroughputwithgreaterthan80%DGEMMefficiencyversus6065%onthepriorFermiarchitecture.Inadditiontogreatlyimprovedperformance,theKeplerarchitectureoffersahugeleapforwardinpowerefficiency,deliveringupto3xtheperformanceperwattofFermi.

    KeplerGK110DiePhoto

  • ThefollowingnewfeaturesinKeplerGK110enableincreasedGPUutilization,simplifyparallelprogramdesign,andaidinthedeploymentofGPUsacrossthespectrumofcomputeenvironmentsrangingfrompersonalworkstationstosupercomputers:

    Dynamic ParallelismaddsthecapabilityfortheGPUtogeneratenewworkforitself,synchronizeonresults,andcontroltheschedulingofthatworkviadedicated,acceleratedhardwarepaths,allwithoutinvolvingtheCPU.Byprovidingtheflexibilitytoadapttotheamountandformofparallelismthroughthecourseofaprogram'sexecution,programmerscanexposemorevariedkindsofparallelworkandmakethemostefficientusetheGPUasacomputationevolves.Thiscapabilityallowslessstructured,morecomplextaskstoruneasilyandeffectively,enablinglargerportionsofanapplicationtorunentirelyontheGPU.Inaddition,programsareeasiertocreate,andtheCPUisfreedforothertasks.

    Hyper-QHyperQenablesmultipleCPUcorestolaunchworkonasingleGPUsimultaneously,therebydramaticallyincreasingGPUutilizationandsignificantlyreducingCPUidletimes.HyperQincreasesthetotalnumberofconnections(workqueues)betweenthehostandtheGK110GPUbyallowing32simultaneous,hardwaremanagedconnections(comparedtothesingleconnectionavailablewithFermi).HyperQisaflexiblesolutionthatallowsseparateconnectionsfrommultipleCUDAstreams,frommultipleMessagePassingInterface(MPI)processes,orevenfrommultiplethreadswithinaprocess.Applicationsthatpreviouslyencounteredfalseserializationacrosstasks,therebylimitingachievedGPUutilization,canseeuptodramaticperformanceincreasewithoutchanginganyexistingcode.

    Grid Management UnitEnablingDynamicParallelismrequiresanadvanced,flexiblegridmanagementanddispatchcontrolsystem.ThenewGK110GridManagementUnit(GMU)managesandprioritizesgridstobeexecutedontheGPU.TheGMUcanpausethedispatchofnewgridsandqueuependingandsuspendedgridsuntiltheyarereadytoexecute,providingtheflexibilitytoenablepowerfulruntimes,suchasDynamicParallelism.TheGMUensuresbothCPUandGPUgeneratedworkloadsareproperlymanagedanddispatched.

    NVIDIA GPUDirectNVIDIAGPUDirectisacapabilitythatenablesGPUswithinasinglecomputer,orGPUsindifferentserverslocatedacrossanetwork,todirectlyexchangedatawithoutneedingtogotoCPU/systemmemory.TheRDMAfeatureinGPUDirectallowsthirdpartydevicessuchasSSDs,NICs,andIBadapterstodirectlyaccessmemoryonmultipleGPUswithinthesamesystem,significantlydecreasingthelatencyofMPIsendandreceivemessagesto/fromGPUmemory.ItalsoreducesdemandsonsystemmemorybandwidthandfreestheGPUDMAenginesforusebyotherCUDAtasks.KeplerGK110alsosupportsotherGPUDirectfeaturesincludingPeertoPeerandGPUDirectforVideo.

  • An Overview of the GK110 Kepler Architecture KeplerGK110wasbuiltfirstandforemostforTesla,anditsgoalwastobethehighestperformingparallelcomputingmicroprocessorintheworld.GK110notonlygreatlyexceedstherawcomputehorsepowerdeliveredbyFermi,butitdoessoefficiently,consumingsignificantlylesspowerandgeneratingmuchlessheatoutput.AfullKeplerGK110implementationincludes15SMXunitsandsix64bitmemorycontrollers.DifferentproductswillusedifferentconfigurationsofGK110.Forexample,someproductsmaydeploy13or14SMXs.Keyfeaturesofthearchitecturethatwillbediscussedbelowinmoredepthinclude:

    ThenewSMXprocessorarchitecture Anenhancedmemorysubsystem,offeringadditionalcachingcapabilities,morebandwidthat

    eachlevelofthehierarchy,andafullyredesignedandsubstantiallyfasterDRAMI/Oimplementation.

    Hardwaresupportthroughoutthedesigntoenablenewprogrammingmodelcapabilities

    KeplerGK110Fullchipblockdiagram

  • KeplerGK110supportsthenewCUDAComputeCapability3.5.(ForabriefoverviewofCUDAseeAppendixAQuickRefresheronCUDA).ThefollowingtablecomparesparametersofdifferentComputeCapabilitiesforFermiandKeplerGPUarchitectures:

    FERMIGF100

    FERMIGF104

    KEPLERGK104

    KEPLERGK110

    ComputeCapability 2.0 2.1 3.0 3.5Threads/Warp 32 32 32 32MaxWarps/Multiprocessor 48 48 64 64MaxThreads/Multiprocessor 1536 1536 2048 2048MaxThreadBlocks/Multiprocessor 8 8 16 1632bitRegisters/Multiprocessor 32768 32768 65536 65536MaxRegisters/Thread 63 63 63 255MaxThreads/ThreadBlock 1024 1024 1024 1024SharedMemorySizeConfigurations(bytes) 16K 16K 16K 16K

    48K 48K 32K 32K 48K 48KMaxXGridDimension 2^161 2^161 2^321 2^321HyperQ No No No YesDynamicParallelism No No No Yes

    ComputeCapabilityofFermiandKeplerGPUs

    PerformanceperWattAprincipaldesigngoalfortheKeplerarchitecturewasimprovingpowerefficiency.WhendesigningKepler,NVIDIAengineersappliedeverythinglearnedfromFermitobetteroptimizetheKeplerarchitectureforhighlyefficientoperation.TSMCs28nmmanufacturingprocessplaysanimportantroleinloweringpowerconsumption,butmanyGPUarchitecturemodificationswererequiredtofurtherreducepowerconsumptionwhilemaintaininggreatperformance.EveryhardwareunitinKeplerwasdesignedandscrubbedtoprovideoutstandingperformanceperwatt.Thebestexampleofgreatperf/wattisseeninthedesignofKeplerGK110snewStreamingMultiprocessor(SMX),whichissimilarinmanyrespectstotheSMXunitrecentlyintroducedinKeplerGK104,butincludessubstantiallymoredoubleprecisionunitsforcomputealgorithms.

  • StreamingMultiprocessor(SMX)ArchitectureKeplerGK110snewSMXintroducesseveralarchitecturalinnovationsthatmakeitnotonlythemostpowerfulmultiprocessorwevebuilt,butalsothemostprogrammableandpowerefficient.

    SMX:192singleprecisionCUDAcores,64doubleprecisionunits,32specialfunctionunits(SFU),and32load/storeunits(LD/ST).

  • SMXProcessingCoreArchitectureEachoftheKeplerGK110SMXunitsfeature192singleprecisionCUDAcores,andeachcorehasfullypipelinedfloatingpointandintegerarithmeticlogicunits.KeplerretainsthefullIEEE7542008compliantsingleanddoubleprecisionarithmeticintroducedinFermi,includingthefusedmultiplyadd(FMA)operation.OneofthedesigngoalsfortheKeplerGK110SMXwastosignificantlyincreasetheGPUsdelivereddoubleprecisionperformance,sincedoubleprecisionarithmeticisattheheartofmanyHPCapplications.KeplerGK110sSMXalsoretainsthespecialfunctionunits(SFUs)forfastapproximatetranscendentaloperationsasinpreviousgenerationGPUs,providing8xthenumberofSFUsoftheFermiGF110SM.SimilartoGK104SMXunits,thecoreswithinthenewGK110SMXunitsusetheprimaryGPUclockratherthanthe2xshaderclock.Recallthe2xshaderclockwasintroducedintheG80TeslaarchitectureGPUandusedinallsubsequentTeslaandFermiarchitectureGPUs.Runningexecutionunitsatahigherclockrateallowsachiptoachieveagiventargetthroughputwithfewercopiesoftheexecutionunits,whichisessentiallyanareaoptimization,buttheclockinglogicforthefastercoresismorepowerhungry.ForKepler,ourprioritywasperformanceperwatt.Whilewemademanyoptimizationsthatbenefittedbothareaandpower,wechosetooptimizeforpowerevenattheexpenseofsomeaddedareacost,withalargernumberofprocessingcoresrunningatthelower,lesspowerhungryGPUclock.QuadWarpSchedulerTheSMXschedulesthreadsingroupsof32parallelthreadscalledwarps.EachSMXfeaturesfourwarpschedulersandeightinstructiondispatchunits,allowingfourwarpstobeissuedandexecutedconcurrently.Keplersquadwarpschedulerselectsfourwarps,andtwoindependentinstructionsperwarpcanbedispatchedeachcycle.UnlikeFermi,whichdidnotpermitdoubleprecisioninstructionstobepairedwithotherinstructions,KeplerGK110allowsdoubleprecisioninstructionstobepairedwithotherinstructions.

  • EachKeplerSMXcontains4WarpSchedulers,eachwithdualInstructionDispatchUnits.AsingleWarpSchedulerUnitisshownabove.

    WealsolookedforopportunitiestooptimizethepowerintheSMXwarpschedulerlogic.Forexample,bothKeplerandFermischedulerscontainsimilarhardwareunitstohandletheschedulingfunction,including:

    a) Registerscoreboardingforlonglatencyoperations(textureandload)b) Interwarpschedulingdecisions(e.g.,pickthebestwarptogonextamongeligiblecandidates)c) Threadblocklevelscheduling(e.g.,theGigaThreadengine)

    However,Fermisscheduleralsocontainsacomplexhardwarestagetopreventdatahazardsinthemathdatapathitself.Amultiportregisterscoreboardkeepstrackofanyregistersthatarenotyetreadywithvaliddata,andadependencycheckerblockanalyzesregisterusageacrossamultitudeoffullydecodedwarpinstructionsagainstthescoreboard,todeterminewhichareeligibletoissue.ForKepler,werecognizedthatthisinformationisdeterministic(themathpipelinelatenciesarenotvariable),andthereforeitispossibleforthecompilertodetermineupfrontwheninstructionswillbereadytoissue,andprovidethisinformationintheinstructionitself.Thisallowedustoreplaceseveralcomplexandpowerexpensiveblockswithasimplehardwareblockthatextractsthepredeterminedlatencyinformationandusesittomaskoutwarpsfromeligibilityattheinterwarpschedulerstage.

  • NewISAEThenumbthreadacFermimacompellinchromodyto5.3xdumemory.ShuffleInTofurthewithinawstoreandthreadswpermutatthread.Ubutterfly

    Shuffleofcarriedoublock,sincFFT,whic

    Thisexamp

    AtomicOAtomicmcorrectlyadd,min,operationusedforpserializet

    Encoding:255berofregistecesstouptoyseesubstanngexamplecaynamics)calcuetotheabili

    nstructionrimproveperwarptoshareloadoperati

    withinawarpion.Shufflessefulshufflesystylepermuffersaperformutinasinglescedataexchahrequiresda

    leshowssomeo

    perationsemoryoperaperformreadmax,andcom

    nsareperformparallelsortinhreadexecut

    5Registersprsthatcanbe255registersntialspeedupanbeseeninulationsusingitytouseman

    rformance,Kedata.Previoonstopassthcanreadvaluupportsarbitsubsetsincluutationsamomanceadvanstep.Shuffleangedatthewtasharingwi

    ofthevariation

    ationsareimpdmodifywritmpareandswmedwithouting,reductiontion.

    erThreadeaccessedbys.CodesthatsasaresultotheQUDAlibgCUDA.QUDnymoreregis

    eplerimplemusly,sharinghedatathrouuesfromothetraryindexeddingnextthrngthethreadtageovershaalsocanreduwarplevelnethinawarp,a

    spossibleusing

    portantinparteoperationswapareatominterruptionboperations,a

    yathreadhasexhibithighroftheincreasbraryforperfDAfp64basedstersperthre

    mentsanewSdatabetweenughsharedmerthreadsintreferencesread(offsetudsinawarp,aredmemoryucetheamoueverneedstoa6%perform

    gthenewShuffl

    rallelprogramonsharedda

    micinthesensbyotherthreandbuildingd

    sbeenquadrregisterpresssedavailableforminglatticdalgorithmseadandexper

    Shuffleinstrucnthreadswitemory.Withthewarpinjui.e.anythrepordownbyarealsoavaily,inthatastontofshared

    obeplacedinmancegainca

    einstructionin

    mming,allowiatastructuressethatthereeads.Atomicmdatastructure

    upledinGK1sureorspillinperthreadreeQCD(quantseeperformariencingfewe

    ction,whichathinawarpretheShuffleinustaboutanyadreadsfromyafixedamolableasCUDAoreandloadmemoryneesharedmemnbeseenjus

    Kepler.

    ingconcurrens.Atomicopeead,modify,amemoryoperesinparallel

    10,allowingengbehaviorinegistercounttumanceincreaseerspillstoloc

    allowsthreadequiredseparnstruction,yimaginablemanyotherunt)andXORAintrinsics.operationisdedperthrea

    mory.InthecastbyusingSh

    ntthreadstoerationssuchandwriterationsarewwithoutlocks

    eachn.A

    esupcal

    dsrate

    R

    adaseofhuffle.

    as

    widelysthat

  • ThroughputofglobalmemoryatomicoperationsonKeplerGK110issubstantiallyimprovedcomparedtotheFermigeneration.Atomicoperationthroughputtoacommonglobalmemoryaddressisimprovedby9xtooneoperationperclock.Atomicoperationthroughputtoindependentglobaladdressesisalsosignificantlyaccelerated,andlogictohandleaddressconflictshasbeenmademoreefficient.Atomicoperationscanoftenbeprocessedatratessimilartogloballoadoperations.Thisspeedincreasemakesatomicsfastenoughtousefrequentlywithinkernelinnerloops,eliminatingtheseparatereductionpassesthatwerepreviouslyrequiredbysomealgorithmstoconsolidateresults.KeplerGK110alsoexpandsthenativesupportfor64bitatomicoperationsinglobalmemory.InadditiontoatomicAdd,atomicCAS,andatomicExch(whichwerealsosupportedbyFermiandKeplerGK104),GK110supportsthefollowing:

    atomicMin atomicMax atomicAnd atomicOr atomicXor

    Otheratomicoperationswhicharenotsupportednatively(forexample64bitfloatingpointatomics)maybeemulatedusingthecompareandswap(CAS)instruction.TextureImprovementsTheGPUsdedicatedhardwareTextureunitsareavaluableresourceforcomputeprogramswithaneedtosampleorfilterimagedata.ThetexturethroughputinKeplerissignificantlyincreasedcomparedtoFermieachSMXunitcontains16texturefilteringunits,a4xincreasevstheFermiGF110SM.Inaddition,Keplerchangesthewaytexturestateismanaged.IntheFermigeneration,fortheGPUtoreferenceatexture,ithadtobeassignedaslotinafixedsizebindingtablepriortogridlaunch.Thenumberofslotsinthattableultimatelylimitshowmanyuniquetexturesaprogramcanreadfromatruntime.Ultimately,aprogramwaslimitedtoaccessingonly128simultaneoustexturesinFermi.WithbindlesstexturesinKepler,theadditionalstepofusingslotsisntnecessary:texturestateisnowsavedasanobjectinmemoryandthehardwarefetchesthesestateobjectsondemand,makingbindingtablesobsolete.Thiseffectivelyeliminatesanylimitsonthenumberofuniquetexturesthatcanbereferencedbyacomputeprogram.Instead,programscanmaptexturesatanytimeandpasstexturehandlesaroundastheywouldanyotherpointer.

  • KeplerMemorySubsystemL1,L2,ECCKeplersmemoryhierarchyisorganizedsimilarlytoFermi.TheKeplerarchitecturesupportsaunifiedmemoryrequestpathforloadsandstores,withanL1cacheperSMXmultiprocessor.KeplerGK110alsoenablescompilerdirecteduseofanadditionalnewcacheforreadonlydata,asdescribedbelow.

    64KBConfigurableSharedMemoryandL1CacheIntheKeplerGK110architecture,asinthepreviousgenerationFermiarchitecture,eachSMXhas64KBofonchipmemorythatcanbeconfiguredas48KBofSharedmemorywith16KBofL1cache,oras16KBofsharedmemorywith48KBofL1cache.KeplernowallowsforadditionalflexibilityinconfiguringtheallocationofsharedmemoryandL1cachebypermittinga32KB/32KBsplitbetweensharedmemoryandL1cache.TosupporttheincreasedthroughputofeachSMXunit,thesharedmemorybandwidthfor64bandlargerloadoperationsisalsodoubledcomparedtotheFermiSM,to256Bpercoreclock.48KBReadOnlyDataCacheInadditiontotheL1cache,Keplerintroducesa48KBcachefordatathatisknowntobereadonlyforthedurationofthefunction.IntheFermigeneration,thiscachewasaccessibleonlybytheTextureunit.Expertprogrammersoftenfounditadvantageoustoloaddatathroughthispathexplicitlybymappingtheirdataastextures,butthisapproachhadmanylimitations.

  • InKepler,inadditiontosignificantlyincreasingthecapacityofthiscachealongwiththetexturehorsepowerincrease,wedecidedtomakethecachedirectlyaccessibletotheSMforgeneralloadoperations.UseofthereadonlypathisbeneficialbecauseittakesbothloadandworkingsetfootprintoffoftheShared/L1cachepath.Inaddition,theReadOnlyDataCacheshighertagbandwidthsupportsfullspeedunalignedmemoryaccesspatternsamongotherscenarios.UseofthispathismanagedautomaticallybythecompileraccesstoanyvariableordatastructurethatisknowntobeconstantthroughprogrammeruseoftheC99standardconst__restrictkeywordwillbetaggedbythecompilertobeloadedthroughtheReadOnlyDataCache.ImprovedL2CacheTheKeplerGK110GPUfeatures1536KBofdedicatedL2cachememory,doubletheamountofL2availableintheFermiarchitecture.TheL2cacheistheprimarypointofdataunificationbetweentheSMXunits,servicingallload,store,andtexturerequestsandprovidingefficient,highspeeddatasharingacrosstheGPU.TheL2cacheonKepleroffersupto2xofthebandwidthperclockavailableinFermi.Algorithmsforwhichdataaddressesarenotknownbeforehand,suchasphysicssolvers,raytracing,andsparsematrixmultiplicationespeciallybenefitfromthecachehierarchy.FilterandconvolutionkernelsthatrequiremultipleSMstoreadthesamedataalsobenefit.MemoryProtectionSupportLikeFermi,Keplersregisterfiles,sharedmemories,L1cache,L2cacheandDRAMmemoryareprotectedbyaSingleErrorCorrectDoubleErrorDetect(SECDED)ECCcode.Inaddition,theReadOnlyDataCachesupportssingleerrorcorrectionthroughaparitycheck;intheeventofaparityerror,thecacheunitautomaticallyinvalidatesthefailedline,forcingareadofthecorrectdatafromL2.ECCcheckbitfetchesfromDRAMnecessarilyconsumesomeamountofDRAMbandwidth,whichresultsinaperformancedifferencebetweenECCenabledandECCdisabledoperation,especiallyonmemorybandwidthsensitiveapplications.KeplerGK110implementsseveraloptimizationstoECCcheckbitfetchhandlingbasedonFermiexperience.Asaresult,theECConvsoffperformancedeltahasbeenreducedbyanaverageof66%,asmeasuredacrossourinternalcomputeapplicationtestsuite.

    DynamicParallelismInahybridCPUGPUsystem,enablingalargeramountofparallelcodeinanapplicationtorunefficientlyandentirelywithintheGPUimprovesscalabilityandperformanceasGPUsincreaseinperf/watt.Toacceleratetheseadditionalparallelportionsoftheapplication,GPUsmustsupportmorevariedtypesofparallelworkloads.DynamicParallelismisanewfeatureintroducedwithKeplerGK110thatallowstheGPUtogeneratenewworkforitself,synchronizeonresults,andcontroltheschedulingofthatworkviadedicated,acceleratedhardwarepaths,allwithoutinvolvingtheCPU.

  • Fermiwasverygoodatprocessinglargeparalleldatastructureswhenthescaleandparametersoftheproblemwereknownatkernellaunchtime.AllworkwaslaunchedfromthehostCPU,wouldruntocompletion,andreturnaresultbacktotheCPU.Theresultwouldthenbeusedaspartofthefinalsolution,orwouldbeanalyzedbytheCPUwhichwouldthensendadditionalrequestsbacktotheGPUforadditionalprocessing.InKeplerGK110anykernelcanlaunchanotherkernel,andcancreatethenecessarystreams,eventsandmanagethedependenciesneededtoprocessadditionalworkwithouttheneedforhostCPUinteraction.Thisarchitecturalinnovationmakesiteasierfordeveloperstocreateandoptimizerecursiveanddatadependentexecutionpatterns,andallowsmoreofaprogramtoberundirectlyonGPU.ThesystemCPUcanthenbefreedupforadditionaltasks,orthesystemcouldbeconfiguredwithalesspowerfulCPUtocarryoutthesameworkload.

    DynamicParallelismallowsmoreparallelcodeinanapplicationtobelauncheddirectlybytheGPUontoitself(rightsideofimage)ratherthanrequiringCPUintervention(leftsideofimage).

    DynamicParallelismallowsmorevarietiesofparallelalgorithmstobeimplementedontheGPU,includingnestedloopswithdifferingamountsofparallelism,parallelteamsofserialcontroltaskthreads,orsimpleserialcontrolcodeoffloadedtotheGPUinordertopromotedatalocalitywiththeparallelportionoftheapplication.Becauseakernelhastheabilitytolaunchadditionalworkloadsbasedonintermediate,onGPUresults,programmerscannowintelligentlyloadbalanceworktofocusthebulkoftheirresourcesontheareasoftheproblemthateitherrequirethemostprocessingpoweroraremostrelevanttothesolution.

  • Oneexamplewouldbedynamicallysettingupagridforanumericalsimulationtypicallygridcellsarefocusedinregionsofgreatestchange,requiringanexpensivepreprocessingpassthroughthedata.Alternatively,auniformlycoarsegridcouldbeusedtopreventwastedGPUresources,orauniformlyfinegridcouldbeusedtoensureallthefeaturesarecaptured,buttheseoptionsriskmissingsimulationfeaturesoroverspendingcomputeresourcesonregionsoflessinterest.WithDynamicParallelism,thegridresolutioncanbedetermineddynamicallyatruntimeinadatadependentmanner.Startingwithacoarsegrid,thesimulationcanzoominonareasofinterestwhileavoidingunnecessarycalculationinareaswithlittlechange.ThoughthiscouldbeaccomplishedusingasequenceofCPUlaunchedkernels,itwouldbefarsimplertoallowtheGPUtorefinethegriditselfbyanalyzingthedataandlaunchingadditionalworkaspartofasinglesimulationkernel,eliminatinginterruptionoftheCPUanddatatransfersbetweentheCPUandGPU.

    ImageattributionCharlesReid

    Theaboveexampleillustratesthebenefitsofusingadynamicallysizedgridinanumericalsimulation.Tomeetpeakprecisionrequirements,afixedresolutionsimulationmustrunatanexcessivelyfineresolutionacrosstheentiresimulationdomain,whereasamultiresolutiongridappliesthecorrectsimulationresolutiontoeachareabasedonlocalvariation.

  • HyperQOneofthofworkfrlaunchesfhardwarewithinonthiscouldcomplexit

    KeplerGKnumberotheGPUbconnectioCUDAstrewithinaplimitingG

    HyperQpe

    EachCUDoptimizedconcurrendependen

    Qechallengesrommultiplefromseparateworkqueueestreamtocdbealleviatedtyincreases,tK110improveofconnectionbyallowing32onavailableweams,frommprocess.AppliPUutilization

    ermitsmoresim

    DAstreamismd,andoperatntlywithoutnncies.

    inthepasthastreams.Theestreams,bu.Thisallowedcompletebefodtosomeextthiscanbecoesonthisfuncs(workqueu2simultaneowithFermi).HmultipleMessicationsthatn,canseeup

    ultaneousconn

    managedwithionsinonestneedingtosp

    asbeenkeepeFermiarchitutultimatelydforfalseintroreadditionatentthroughmemoreandctionalitywithes)betweenus,hardwareHyperQisaflagePassingIpreviouslyentoa32xperf

    ectionsbetwee

    hinitsownhatreamwillnoecificallytailo

    ingtheGPUsecturesuppothestreamswrastreamdealkernelsinatheuseofabdmoredifficuhthenewHythehostand

    emanagedcoexiblesolutionterface(MPncounteredfaformanceincr

    nCPUandGPU

    ardwareworklongerblockorthelaunch

    suppliedwithorted16waywereallmultpendencies,rseparatestrebreadthfirstulttomanageyperQfeaturetheCUDAWonnections(coonthatallowsPI)processes,alseserializatreasewithout

    .

    kqueue,interotherstreamordertoelim

    hanoptimallyconcurrencytiplexedintotrequiringdepeamcouldbelaunchordereefficiently.e.HyperQin

    WorkDistributomparedtotsconnectionsorevenfromionacrosstastchangingan

    rstreamdepems,enablingsminatepossib

    yscheduledloofkernelthesamependentkerneexecuted.W,asprogram

    ncreasesthettor(CWD)logthesinglesfrommultipmmultiplethrsks,therebynyexistingco

    endenciesarestreamstoexblefalse

    oad

    elsWhile

    totalgicin

    plereads

    de.

    execute

  • HyperQofferssignificantbenefitsforuseinMPIbasedparallelcomputersystems.LegacyMPIbasedalgorithmswereoftencreatedtorunonmulticoreCPUsystems,withtheamountofworkassignedtoeachMPIprocessscaledaccordingly.ThiscanleadtoasingleMPIprocesshavinginsufficientworktofullyoccupytheGPU.WhileithasalwaysbeenpossibleformultipleMPIprocessestoshareaGPU,theseprocessescouldbecomebottleneckedbyfalsedependencies.HyperQremovesthosefalsedependencies,dramaticallyincreasingtheefficiencyofGPUsharingacrossMPIprocesses.

    HyperQworkingwithCUDAStreams:IntheFermimodelshownontheleft,only(C,P)&(R,X)canrunconcurrentlyduetointrastreamdependenciescausedbythesinglehardwareworkqueue.TheKeplerHyperQmodelallowsallstreamstorunconcurrentlyusingseparateworkqueues.

  • GridManagementUnitEfficientlyKeepingtheGPUUtilizedNewfeaturesinKeplerGK110,suchastheabilityforCUDAkernelstolaunchworkdirectlyontheGPUwithDynamicParallelism,requiredthattheCPUtoGPUworkflowinKeplerofferincreasedfunctionalityovertheFermidesign.OnFermi,agridofthreadblockswouldbelaunchedbytheCPUandwouldalwaysruntocompletion,creatingasimpleunidirectionalflowofworkfromthehosttotheSMsviatheCUDAWorkDistributor(CWD)unit.KeplerGK110wasdesignedtoimprovetheCPUtoGPUworkflowbyallowingtheGPUtoefficientlymanagebothCPUandCUDAcreatedworkloads.WediscussedtheabilityoftheKeplerGK110GPUtoallowkernelstolaunchworkdirectlyontheGPU,anditsimportanttounderstandthechangesmadeintheKeplerGK110architecturetofacilitatethesenewfunctions.InKepler,agridcanbelaunchedfromtheCPUjustaswasthecasewithFermi,howevernewgridscanalsobecreatedprogrammaticallybyCUDAwithintheKeplerSMXunit.TomanagebothCUDAcreatedandhostoriginatedgrids,anewGridManagementUnit(GMU)wasintroducedinKeplerGK110.ThiscontrolunitmanagesandprioritizesgridsthatarepassedintotheCWDtobesenttotheSMXunitsforexecution.TheCWDinKeplerholdsgridsthatarereadytodispatch,anditisabletodispatch32activegrids,whichisdoublethecapacityoftheFermiCWD.TheKeplerCWDcommunicateswiththeGMUviaabidirectionallinkthatallowstheGMUtopausethedispatchofnewgridsandtoholdpendingandsuspendedgridsuntilneeded.TheGMUalsohasadirectconnectiontotheKeplerSMXunitstopermitgridsthatlaunchadditionalworkontheGPUviaDynamicParallelismtosendthenewworkbacktoGMUtobeprioritizedanddispatched.Ifthekernelthatdispatchedtheadditionalworkloadpauses,theGMUwillholditinactiveuntilthedependentworkhascompleted.

  • TheredesignedKeplerHOSTtoGPUworkflowshowsthenewGridManagementUnit,whichallowsittomanagetheactivelydispatchinggrids,pausedispatch,andholdpendingandsuspendedgrids.

    NVIDIAGPUDirectWhenworkingwithalargeamountofdata,increasingthedatathroughputandreducinglatencyisvitaltoincreasingcomputeperformance.KeplerGK110supportstheRDMAfeatureinNVIDIAGPUDirect,whichisdesignedtoimproveperformancebyallowingdirectaccesstoGPUmemorybythirdpartydevicessuchasIBadapters,NICs,andSSDs.WhenusingCUDA5.0,GPUDirectprovidesthefollowingimportantfeatures:

    Directmemoryaccess(DMA)betweenNICandGPUwithouttheneedforCPUsidedatabuffering.

    SignificantlyimprovedMPISend/MPIRecvefficiencybetweenGPUandothernodesinanetwork. EliminatesCPUbandwidthandlatencybottlenecks Workswithvarietyof3rdpartynetwork,capture,andstoragedevices

  • Applicationslikereversetimemigration(usedinseismicimagingforoil&gasexploration)distributethelargeimagingdataacrossseveralGPUs.HundredsofGPUsmustcollaboratetocrunchthedata,oftencommunicatingintermediateresults.GPUDirectenablesmuchhigheraggregatebandwidthforthisGPUtoGPUcommunicationscenariowithinaserverandacrossserverswiththeP2PandRDMAfeatures.KeplerGK110alsosupportsotherGPUDirectfeaturessuchasPeertoPeerandGPUDirectforVideo.

    GPUDirectRDMAallowsdirectaccesstoGPUmemoryfrom3rdpartydevicessuchasnetworkadapters,whichtranslatesintodirecttransfersbetweenGPUsacrossnodesaswell.

    ConclusionWiththelaunchofFermiin2010,NVIDIAusheredinanewerainthehighperformancecomputing(HPC)industrybasedonahybridcomputingmodelwhereCPUsandGPUsworktogethertosolvecomputationallyintensiveworkloads.Now,withthenewKeplerGK110GPU,NVIDIAagainraisesthebarfortheHPCindustry.KeplerGK110wasdesignedfromthegrounduptomaximizecomputationalperformanceandthroughputcomputingwithoutstandingpowerefficiency.ThearchitecturehasmanynewinnovationssuchasSMX,DynamicParallelism,andHyperQthatmakehybridcomputingdramaticallyfaster,easiertoprogram,andapplicabletoabroadersetofapplications.KeplerGK110GPUswillbeusedinnumeroussystemsrangingfromworkstationstosupercomputerstoaddressthemostdauntingchallengesinHPC.

  • AppenCUDAisawrittenwkernelsthintothreaexecutesgrid,aproAthreadbbarriersyarrayofthglobalmemodel,eaautomaticcommuniresultsin

    ndix A - combination

    withC,C++,Fohatexecuteacadblocksandaninstanceoogramcounteblockisasetnchronizationhreadblocksemory,andsyachthreadhacarrayvariabcation,datasGlobalMemo

    Quick Rnhardware/soortran,andotcrossmanypgridsofthreofthekernel.er,registers,pofconcurrennandsharedthatexecuteynchronizebeasaperthreables.Eachthresharing,androryspaceafte

    Refresheoftwareplatftherlanguagearallelthreadadblocks,asEachthreadperthreadprtlyexecutingmemory.Atthesameker

    etweendependprivatemeeadblockhasresultsharingerkernelwid

    er on CUformthatenaes.ACUDAprds.TheprograshowninFigalsohasthrearivatememorgthreadsthathreadblockhrnel,readinpndentkernelmoryspaceusaperblockginparallelaleglobalsync

    UDA ablesNVIDIArograminvokammerorcomgure1.Eachthadandblockry,inputs,andtcancooperahasablockIDputsfromglobcalls.IntheCusedforregissharedmemolgorithms.Grchronization.

    GPUstoexecesparallelfumpilerorganhreadwithinIDswithinitsdoutputresuateamongtheDwithinitsgrbalmemory,CUDAparallelsterspills,funoryspaceuseridsofthread

    cuteprogramnctionscalledizesthesethrathreadblocsthreadblockults.emselvesthrorid.Agridisawriteresultslprogramminnctioncalls,anedforinterthblocksshare

    sdreadsckkand

    oughntongndChread

  • Figure1:CUDAHierarchyofthreads,blocks,andgrids,withcorrespondingperthreadprivate,perblockshared,andperapplicationglobalmemoryspaces.

    CUDAHardwareExecutionCUDAshierarchyofthreadsmapstoahierarchyofprocessorsontheGPU;aGPUexecutesoneormorekernelgrids;astreamingmultiprocessor(SMonFermi/SMXonKepler)executesoneormorethreadblocks;andCUDAcoresandotherexecutionunitsintheSMXexecutethreadinstructions.TheSMXexecutesthreadsingroupsof32threadscalledwarps.Whileprogrammerscangenerallyignorewarpexecutionforfunctionalcorrectnessandfocusonprogrammingindividualscalarthreads,theycangreatlyimproveperformancebyhavingthreadsinawarpexecutethesamecodepathandaccessmemorywithnearbyaddresses.

  • NoticeALLINFORMATIONPROVIDEDINTHISWHITEPAPER,INCLUDINGCOMMENTARY,OPINION,NVIDIADESIGNSPECIFICATIONS,REFERENCEBOARDS,FILES,DRAWINGS,DIAGNOSTICS,LISTS,ANDOTHERDOCUMENTS(TOGETHERANDSEPARATELY,MATERIALS)AREBEINGPROVIDEDASIS.NVIDIAMAKESNOWARRANTIES,EXPRESSED,IMPLIED,STATUTORY,OROTHERWISEWITHRESPECTTOMATERIALS,ANDEXPRESSLYDISCLAIMSALLIMPLIEDWARRANTIESOFNONINFRINGEMENT,MERCHANTABILITY,ANDFITNESSFORAPARTICULARPURPOSE.Informationfurnishedisbelievedtobeaccurateandreliable.However,NVIDIACorporationassumesnoresponsibilityfortheconsequencesofuseofsuchinformationorforanyinfringementofpatentsorotherrightsofthirdpartiesthatmayresultfromitsuse.NolicenseisgrantedbyimplicationorotherwiseunderanypatentorpatentrightsofNVIDIACorporation.Specificationsmentionedinthispublicationaresubjecttochangewithoutnotice.Thispublicationsupersedesandreplacesallinformationpreviouslysupplied.NVIDIACorporationproductsarenotauthorizedforuseascriticalcomponentsinlifesupportdevicesorsystemswithoutexpresswrittenapprovalofNVIDIACorporation.TrademarksNVIDIA,theNVIDIAlogo,CUDA,FERMI,KEPLERandGeForcearetrademarksorregisteredtrademarksofNVIDIACorporationintheUnitedStatesandothercountries.Othercompanyandproductnamesmaybetrademarksoftherespectivecompanieswithwhichtheyareassociated.Copyright2012NVIDIACorporation.Allrightsreserved.