NVIDIA Kepler GK110 Architecture Whitepaper

Whitepaper

NVIDIAs Next Generation CUDATM Compute Architecture:

Kepler TM GK110

The Fastest, Most Efficient HPC Architecture Ever Built

V1.0

Table of Contents KeplerGK110TheNextGenerationGPUComputingArchitecture...........................................................3KeplerGK110ExtremePerformance,ExtremeEfficiency..........................................................................4

Dynamic Parallelism......................................................................................................................5 Hyper-Q...........................................................................................................................................5 Grid Management Unit..................................................................................................................5 NVIDIA GPUDirect.....................................................................................................................5

AnOverviewoftheGK110KeplerArchitecture...........................................................................................6PerformanceperWatt..............................................................................................................................7StreamingMultiprocessor(SMX)Architecture.........................................................................................8SMXProcessingCoreArchitecture.......................................................................................................9QuadWarpScheduler...........................................................................................................................9NewISAEncoding:255RegistersperThread.....................................................................................11ShuffleInstruction...............................................................................................................................11AtomicOperations..............................................................................................................................11TextureImprovements.......................................................................................................................12

KeplerMemorySubsystemL1,L2,ECC................................................................................................1364KBConfigurableSharedMemoryandL1Cache............................................................................1348KBReadOnlyDataCache...............................................................................................................13ImprovedL2Cache.............................................................................................................................14MemoryProtectionSupport...............................................................................................................14

DynamicParallelism................................................................................................................................14HyperQ...................................................................................................................................................17GridManagementUnitEfficientlyKeepingtheGPUUtilized...............................................................19NVIDIAGPUDirect................................................................................................................................20

Conclusion...................................................................................................................................................21Appendix A - QuickRefresheronCUDA...................................................................................................22

CUDAHardwareExecution.................................................................................................................23

Kepler GK110 The Next Generation GPU Computing Architecture Asthedemandforhighperformanceparallelcomputingincreasesacrossmanyareasofscience,medicine,engineering,andfinance,NVIDIAcontinuestoinnovateandmeetthatdemandwithextraordinarilypowerfulGPUcomputingarchitectures.NVIDIAsexistingFermiGPUshavealreadyredefinedandacceleratedHighPerformanceComputing(HPC)capabilitiesinareassuchasseismicprocessing,biochemistrysimulations,weatherandclimatemodeling,signalprocessing,computationalfinance,computeraidedengineering,computationalfluiddynamics,anddataanalysis.NVIDIAsnewKeplerGK110GPUraisestheparallelcomputingbarconsiderablyandwillhelpsolvetheworldsmostdifficultcomputingproblems.ByofferingmuchhigherprocessingpowerthanthepriorGPUgenerationandbyprovidingnewmethodstooptimizeandincreaseparallelworkloadexecutionontheGPU,KeplerGK110simplifiescreationofparallelprogramsandwillfurtherrevolutionizehighperformancecomputing.

Kepler GK110 - Extreme Performance, Extreme Efficiency Comprising7.1billiontransistors,KeplerGK110isnotonlythefastest,butalsothemostarchitecturallycomplexmicroprocessoreverbuilt.Addingmanynewinnovativefeaturesfocusedoncomputeperformance,GK110wasdesignedtobeaparallelprocessingpowerhouseforTeslaandtheHPCmarket.KeplerGK110willprovideover1TFlopofdoubleprecisionthroughputwithgreaterthan80%DGEMMefficiencyversus6065%onthepriorFermiarchitecture.Inadditiontogreatlyimprovedperformance,theKeplerarchitectureoffersahugeleapforwardinpowerefficiency,deliveringupto3xtheperformanceperwattofFermi.

KeplerGK110DiePhoto

ThefollowingnewfeaturesinKeplerGK110enableincreasedGPUutilization,simplifyparallelprogramdesign,andaidinthedeploymentofGPUsacrossthespectrumofcomputeenvironmentsrangingfrompersonalworkstationstosupercomputers:

Dynamic ParallelismaddsthecapabilityfortheGPUtogeneratenewworkforitself,synchronizeonresults,andcontroltheschedulingofthatworkviadedicated,acceleratedhardwarepaths,allwithoutinvolvingtheCPU.Byprovidingtheflexibilitytoadapttotheamountandformofparallelismthroughthecourseofaprogram'sexecution,programmerscanexposemorevariedkindsofparallelworkandmakethemostefficientusetheGPUasacomputationevolves.Thiscapabilityallowslessstructured,morecomplextaskstoruneasilyandeffectively,enablinglargerportionsofanapplicationtorunentirelyontheGPU.Inaddition,programsareeasiertocreate,andtheCPUisfreedforothertasks.

Hyper-QHyperQenablesmultipleCPUcorestolaunchworkonasingleGPUsimultaneously,therebydramaticallyincreasingGPUutilizationandsignificantlyreducingCPUidletimes.HyperQincreasesthetotalnumberofconnections(workqueues)betweenthehostandtheGK110GPUbyallowing32simultaneous,hardwaremanagedconnections(comparedtothesingleconnectionavailablewithFermi).HyperQisaflexiblesolutionthatallowsseparateconnectionsfrommultipleCUDAstreams,frommultipleMessagePassingInterface(MPI)processes,orevenfrommultiplethreadswithinaprocess.Applicationsthatpreviouslyencounteredfalseserializationacrosstasks,therebylimitingachievedGPUutilization,canseeuptodramaticperformanceincreasewithoutchanginganyexistingcode.

Grid Management UnitEnablingDynamicParallelismrequiresanadvanced,flexiblegridmanagementanddispatchcontrolsystem.ThenewGK110GridManagementUnit(GMU)managesandprioritizesgridstobeexecutedontheGPU.TheGMUcanpausethedispatchofnewgridsandqueuependingandsuspendedgridsuntiltheyarereadytoexecute,providingtheflexibilitytoenablepowerfulruntimes,suchasDynamicParallelism.TheGMUensuresbothCPUandGPUgeneratedworkloadsareproperlymanagedanddispatched.

NVIDIA GPUDirectNVIDIAGPUDirectisacapabilitythatenablesGPUswithinasinglecomputer,orGPUsindifferentserverslocatedacrossanetwork,todirectlyexchangedatawithoutneedingtogotoCPU/systemmemory.TheRDMAfeatureinGPUDirectallowsthirdpartydevicessuchasSSDs,NICs,andIBadapterstodirectlyaccessmemoryonmultipleGPUswithinthesamesystem,significantlydecreasingthelatencyofMPIsendandreceivemessagesto/fromGPUmemory.ItalsoreducesdemandsonsystemmemorybandwidthandfreestheGPUDMAenginesforusebyotherCUDAtasks.KeplerGK110alsosupportsotherGPUDirectfeaturesincludingPeertoPeerandGPUDirectforVideo.

An Overview of the GK110 Kepler Architecture KeplerGK110wasbuiltfirstandforemostforTesla,anditsgoalwastobethehighestperformingparallelcomputingmicroprocessorintheworld.GK110notonlygreatlyexceedstherawcomputehorsepowerdeliveredbyFermi,butitdoessoefficiently,consumingsignificantlylesspowerandgeneratingmuchlessheatoutput.AfullKeplerGK110implementationincludes15SMXunitsandsix64bitmemorycontrollers.DifferentproductswillusedifferentconfigurationsofGK110.Forexample,someproductsmaydeploy13or14SMXs.Keyfeaturesofthearchitecturethatwillbediscussedbelowinmoredepthinclude:

ThenewSMXprocessorarchitecture Anenhancedmemorysubsystem,offeringadditionalcachingcapabilities,morebandwidthat

eachlevelofthehierarchy,andafullyredesignedandsubstantiallyfasterDRAMI/Oimplementation.

Hardwaresupportthroughoutthedesigntoenablenewprogrammingmodelcapabilities

KeplerGK110Fullchipblockdiagram

KeplerGK110supportsthenewCUDAComputeCapability3.5.(ForabriefoverviewofCUDAseeAppendixAQuickRefresheronCUDA).ThefollowingtablecomparesparametersofdifferentComputeCapabilitiesforFermiandKeplerGPUarchitectures:

FERMIGF100

FERMIGF104

KEPLERGK104

KEPLERGK110

ComputeCapability 2.0 2.1 3.0 3.5Threads/Warp 32 32 32 32MaxWarps/Multiprocessor 48 48 64 64MaxThreads/Multiprocessor 1536 1536 2048 2048MaxThreadBlocks/Multiprocessor 8 8 16 1632bitRegisters/Multiprocessor 32768 32768 65536 65536MaxRegisters/Thread 63 63 63 255MaxThreads/ThreadBlock 1024 1024 1024 1024SharedMemorySizeConfigurations(bytes) 16K 16K 16K 16K

48K 48K 32K 32K 48K 48KMaxXGridDimension 2^161 2^161 2^321 2^321HyperQ No No No YesDynamicParallelism No No No Yes

ComputeCapabilityofFermiandKeplerGPUs

PerformanceperWattAprincipaldesigngoalfortheKeplerarchitecturewasimprovingpowerefficiency.WhendesigningKepler,NVIDIAengineersappliedeverythinglearnedfromFermitobetteroptimizetheKeplerarchitectureforhighlyefficientoperation.TSMCs28nmmanufacturingprocessplaysanimportantroleinloweringpowerconsumption,butmanyGPUarchitecturemodificationswererequiredtofurtherreducepowerconsumptionwhilemaintaininggreatperformance.EveryhardwareunitinKeplerwasdesignedandscrubbedtoprovideoutstandingperformanceperwatt.Thebestexampleofgreatperf/wattisseeninthedesignofKeplerGK110snewStreamingMultiprocessor(SMX),whichissimilarinmanyrespectstotheSMXunitrecentlyintroducedinKeplerGK104,butincludessubstantiallymoredoubleprecisionunitsforcomputealgorithms.

StreamingMultiprocessor(SMX)ArchitectureKeplerGK110snewSMXintroducesseveralarchitecturalinnovationsthatmakeitnotonlythemostpowerfulmultiprocessorwevebuilt,butalsothemostprogrammableandpowerefficient.

SMX:192singleprecisionCUDAcores,64doubleprecisionunits,32specialfunctionunits(SFU),and32load/storeunits(LD/ST).

SMXProcessingCoreArchitectureEachoftheKeplerGK110SMXunitsfeature192singleprecisionCUDAcores,andeachcorehasfullypipelinedfloatingpointandintegerarithmeticlogicunits.KeplerretainsthefullIEEE7542008compliantsingleanddoubleprecisionarithmeticintroducedinFermi,includingthefusedmultiplyadd(FMA)operation.OneofthedesigngoalsfortheKeplerGK110SMXwastosignificantlyincreasetheGPUsdelivereddoubleprecisionperformance,sincedoubleprecisionarithmeticisattheheartofmanyHPCapplications.KeplerGK110sSMXalsoretainsthespecialfunctionunits(SFUs)forfastapproximatetranscendentaloperationsasinpreviousgenerationGPUs,providing8xthenumberofSFUsoftheFermiGF110SM.SimilartoGK104SMXunits,thecoreswithinthenewGK110SMXunitsusetheprimaryGPUclockratherthanthe2xshaderclock.Recallthe2xshaderclockwasintroducedintheG80TeslaarchitectureGPUandusedinallsubsequentTeslaandFermiarchitectureGPUs.Runningexecutionunitsatahigherclockrateallowsachiptoachieveagiventargetthroughputwithfewercopiesoftheexecutionunits,whichisessentiallyanareaoptimization,buttheclockinglogicforthefastercoresismorepowerhungry.ForKepler,ourprioritywasperformanceperwatt.Whilewemademanyoptimizationsthatbenefittedbothareaandpower,wechosetooptimizeforpowerevenattheexpenseofsomeaddedareacost,withalargernumberofprocessingcoresrunningatthelower,lesspowerhungryGPUclock.QuadWarpSchedulerTheSMXschedulesthreadsingroupsof32parallelthreadscalledwarps.EachSMXfeaturesfourwarpschedulersandeightinstructiondispatchunits,allowingfourwarpstobeissuedandexecutedconcurrently.Keplersquadwarpschedulerselectsfourwarps,andtwoindependentinstructionsperwarpcanbedispatchedeachcycle.UnlikeFermi,whichdidnotpermitdoubleprecisioninstructionstobepairedwithotherinstructions,KeplerGK110allowsdoubleprecisioninstructionstobepairedwithotherinstructions.

EachKeplerSMXcontains4WarpSchedulers,eachwithdualInstructionDispatchUnits.AsingleWarpSchedulerUnitisshownabove.

WealsolookedforopportunitiestooptimizethepowerintheSMXwarpschedulerlogic.Forexample,bothKeplerandFermischedulerscontainsimilarhardwareunitstohandletheschedulingfunction,including:

a) Registerscoreboardingforlonglatencyoperations(textureandload)b) Interwarpschedulingdecisions(e.g.,pickthebestwarptogonextamongeligiblecandidates)c) Threadblocklevelscheduling(e.g.,theGigaThreadengine)

However,Fermisscheduleralsocontainsacomplexhardwarestagetopreventdatahazardsinthemathdatapathitself.Amultiportregisterscoreboardkeepstrackofanyregistersthatarenotyetreadywithvaliddata,andadependencycheckerblockanalyzesregisterusageacrossamultitudeoffullydecodedwarpinstructionsagainstthescoreboard,todeterminewhichareeligibletoissue.ForKepler,werecognizedthatthisinformationisdeterministic(themathpipelinelatenciesarenotvariable),andthereforeitispossibleforthecompilertodetermineupfrontwheninstructionswillbereadytoissue,andprovidethisinformationintheinstructionitself.Thisallowedustoreplaceseveralcomplexandpowerexpensiveblockswithasimplehardwareblockthatextractsthepredeterminedlatencyinformationandusesittomaskoutwarpsfromeligibilityattheinterwarpschedulerstage.

NewISAEThenumbthreadacFermimacompellinchromodyto5.3xdumemory.ShuffleInTofurthewithinawstoreandthreadswpermutatthread.Ubutterfly

Shuffleofcarriedoublock,sincFFT,whic

Thisexamp

AtomicOAtomicmcorrectlyadd,min,operationusedforpserializet

Encoding:255berofregistecesstouptoyseesubstanngexamplecaynamics)calcuetotheabili

nstructionrimproveperwarptoshareloadoperati

withinawarpion.Shufflessefulshufflesystylepermuffersaperformutinasinglescedataexchahrequiresda

leshowssomeo

perationsemoryoperaperformreadmax,andcom

nsareperformparallelsortinhreadexecut

5Registersprsthatcanbe255registersntialspeedupanbeseeninulationsusingitytouseman

rformance,Kedata.Previoonstopassthcanreadvaluupportsarbitsubsetsincluutationsamomanceadvanstep.Shuffleangedatthewtasharingwi

ofthevariation

ationsareimpdmodifywritmpareandswmedwithouting,reductiontion.

erThreadeaccessedbys.CodesthatsasaresultotheQUDAlibgCUDA.QUDnymoreregis

eplerimplemusly,sharinghedatathrouuesfromothetraryindexeddingnextthrngthethreadtageovershaalsocanreduwarplevelnethinawarp,a

spossibleusing

portantinparteoperationswapareatominterruptionboperations,a

yathreadhasexhibithighroftheincreasbraryforperfDAfp64basedstersperthre

mentsanewSdatabetweenughsharedmerthreadsintreferencesread(offsetudsinawarp,aredmemoryucetheamoueverneedstoa6%perform

gthenewShuffl

rallelprogramonsharedda

micinthesensbyotherthreandbuildingd

sbeenquadrregisterpresssedavailableforminglatticdalgorithmseadandexper

Shuffleinstrucnthreadswitemory.Withthewarpinjui.e.anythrepordownbyarealsoavaily,inthatastontofshared

obeplacedinmancegainca

einstructionin

mming,allowiatastructuressethatthereeads.Atomicmdatastructure

upledinGK1sureorspillinperthreadreeQCD(quantseeperformariencingfewe

ction,whichathinawarpretheShuffleinustaboutanyadreadsfromyafixedamolableasCUDAoreandloadmemoryneesharedmemnbeseenjus

Kepler.

ingconcurrens.Atomicopeead,modify,amemoryoperesinparallel

10,allowingengbehaviorinegistercounttumanceincreaseerspillstoloc

allowsthreadequiredseparnstruction,yimaginablemanyotherunt)andXORAintrinsics.operationisdedperthrea

mory.InthecastbyusingSh

ntthreadstoerationssuchandwriterationsarewwithoutlocks

eachn.A

esupcal

dsrate

R

adaseofhuffle.

as

widelysthat

ThroughputofglobalmemoryatomicoperationsonKeplerGK110issubstantiallyimprovedcomparedtotheFermigeneration.Atomicoperationthroughputtoacommonglobalmemoryaddressisimprovedby9xtooneoperationperclock.Atomicoperationthroughputtoindependentglobaladdressesisalsosignificantlyaccelerated,andlogictohandleaddressconflictshasbeenmademoreefficient.Atomicoperationscanoftenbeprocessedatratessimilartogloballoadoperations.Thisspeedincreasemakesatomicsfastenoughtousefrequentlywithinkernelinnerloops,eliminatingtheseparatereductionpassesthatwerepreviouslyrequiredbysomealgorithmstoconsolidateresults.KeplerGK110alsoexpandsthenativesupportfor64bitatomicoperationsinglobalmemory.InadditiontoatomicAdd,atomicCAS,andatomicExch(whichwerealsosupportedbyFermiandKeplerGK104),GK110supportsthefollowing:

atomicMin atomicMax atomicAnd atomicOr atomicXor

Otheratomicoperationswhicharenotsupportednatively(forexample64bitfloatingpointatomics)maybeemulatedusingthecompareandswap(CAS)instruction.TextureImprovementsTheGPUsdedicatedhardwareTextureunitsareavaluableresourceforcomputeprogramswithaneedtosampleorfilterimagedata.ThetexturethroughputinKeplerissignificantlyincreasedcomparedtoFermieachSMXunitcontains16texturefilteringunits,a4xincreasevstheFermiGF110SM.Inaddition,Keplerchangesthewaytexturestateismanaged.IntheFermigeneration,fortheGPUtoreferenceatexture,ithadtobeassignedaslotinafixedsizebindingtablepriortogridlaunch.Thenumberofslotsinthattableultimatelylimitshowmanyuniquetexturesaprogramcanreadfromatruntime.Ultimately,aprogramwaslimitedtoaccessingonly128simultaneoustexturesinFermi.WithbindlesstexturesinKepler,theadditionalstepofusingslotsisntnecessary:texturestateisnowsavedasanobjectinmemoryandthehardwarefetchesthesestateobjectsondemand,makingbindingtablesobsolete.Thiseffectivelyeliminatesanylimitsonthenumberofuniquetexturesthatcanbereferencedbyacomputeprogram.Instead,programscanmaptexturesatanytimeandpasstexturehandlesaroundastheywouldanyotherpointer.

KeplerMemorySubsystemL1,L2,ECCKeplersmemoryhierarchyisorganizedsimilarlytoFermi.TheKeplerarchitecturesupportsaunifiedmemoryrequestpathforloadsandstores,withanL1cacheperSMXmultiprocessor.KeplerGK110alsoenablescompilerdirecteduseofanadditionalnewcacheforreadonlydata,asdescribedbelow.

64KBConfigurableSharedMemoryandL1CacheIntheKeplerGK110architecture,asinthepreviousgenerationFermiarchitecture,eachSMXhas64KBofonchipmemorythatcanbeconfiguredas48KBofSharedmemorywith16KBofL1cache,oras16KBofsharedmemorywith48KBofL1cache.KeplernowallowsforadditionalflexibilityinconfiguringtheallocationofsharedmemoryandL1cachebypermittinga32KB/32KBsplitbetweensharedmemoryandL1cache.TosupporttheincreasedthroughputofeachSMXunit,thesharedmemorybandwidthfor64bandlargerloadoperationsisalsodoubledcomparedtotheFermiSM,to256Bpercoreclock.48KBReadOnlyDataCacheInadditiontotheL1cache,Keplerintroducesa48KBcachefordatathatisknowntobereadonlyforthedurationofthefunction.IntheFermigeneration,thiscachewasaccessibleonlybytheTextureunit.Expertprogrammersoftenfounditadvantageoustoloaddatathroughthispathexplicitlybymappingtheirdataastextures,butthisapproachhadmanylimitations.

InKepler,inadditiontosignificantlyincreasingthecapacityofthiscachealongwiththetexturehorsepowerincrease,wedecidedtomakethecachedirectlyaccessibletotheSMforgeneralloadoperations.UseofthereadonlypathisbeneficialbecauseittakesbothloadandworkingsetfootprintoffoftheShared/L1cachepath.Inaddition,theReadOnlyDataCacheshighertagbandwidthsupportsfullspeedunalignedmemoryaccesspatternsamongotherscenarios.UseofthispathismanagedautomaticallybythecompileraccesstoanyvariableordatastructurethatisknowntobeconstantthroughprogrammeruseoftheC99standardconst__restrictkeywordwillbetaggedbythecompilertobeloadedthroughtheReadOnlyDataCache.ImprovedL2CacheTheKeplerGK110GPUfeatures1536KBofdedicatedL2cachememory,doubletheamountofL2availableintheFermiarchitecture.TheL2cacheistheprimarypointofdataunificationbetweentheSMXunits,servicingallload,store,andtexturerequestsandprovidingefficient,highspeeddatasharingacrosstheGPU.TheL2cacheonKepleroffersupto2xofthebandwidthperclockavailableinFermi.Algorithmsforwhichdataaddressesarenotknownbeforehand,suchasphysicssolvers,raytracing,andsparsematrixmultiplicationespeciallybenefitfromthecachehierarchy.FilterandconvolutionkernelsthatrequiremultipleSMstoreadthesamedataalsobenefit.MemoryProtectionSupportLikeFermi,Keplersregisterfiles,sharedmemories,L1cache,L2cacheandDRAMmemoryareprotectedbyaSingleErrorCorrectDoubleErrorDetect(SECDED)ECCcode.Inaddition,theReadOnlyDataCachesupportssingleerrorcorrectionthroughaparitycheck;intheeventofaparityerror,thecacheunitautomaticallyinvalidatesthefailedline,forcingareadofthecorrectdatafromL2.ECCcheckbitfetchesfromDRAMnecessarilyconsumesomeamountofDRAMbandwidth,whichresultsinaperformancedifferencebetweenECCenabledandECCdisabledoperation,especiallyonmemorybandwidthsensitiveapplications.KeplerGK110implementsseveraloptimizationstoECCcheckbitfetchhandlingbasedonFermiexperience.Asaresult,theECConvsoffperformancedeltahasbeenreducedbyanaverageof66%,asmeasuredacrossourinternalcomputeapplicationtestsuite.

DynamicParallelismInahybridCPUGPUsystem,enablingalargeramountofparallelcodeinanapplicationtorunefficientlyandentirelywithintheGPUimprovesscalabilityandperformanceasGPUsincreaseinperf/watt.Toacceleratetheseadditionalparallelportionsoftheapplication,GPUsmustsupportmorevariedtypesofparallelworkloads.DynamicParallelismisanewfeatureintroducedwithKeplerGK110thatallowstheGPUtogeneratenewworkforitself,synchronizeonresults,andcontroltheschedulingofthatworkviadedicated,acceleratedhardwarepaths,allwithoutinvolvingtheCPU.

Fermiwasverygoodatprocessinglargeparalleldatastructureswhenthescaleandparametersoftheproblemwereknownatkernellaunchtime.AllworkwaslaunchedfromthehostCPU,wouldruntocompletion,andreturnaresultbacktotheCPU.Theresultwouldthenbeusedaspartofthefinalsolution,orwouldbeanalyzedbytheCPUwhichwouldthensendadditionalrequestsbacktotheGPUforadditionalprocessing.InKeplerGK110anykernelcanlaunchanotherkernel,andcancreatethenecessarystreams,eventsandmanagethedependenciesneededtoprocessadditionalworkwithouttheneedforhostCPUinteraction.Thisarchitecturalinnovationmakesiteasierfordeveloperstocreateandoptimizerecursiveanddatadependentexecutionpatterns,andallowsmoreofaprogramtoberundirectlyonGPU.ThesystemCPUcanthenbefreedupforadditionaltasks,orthesystemcouldbeconfiguredwithalesspowerfulCPUtocarryoutthesameworkload.

DynamicParallelismallowsmoreparallelcodeinanapplicationtobelauncheddirectlybytheGPUontoitself(rightsideofimage)ratherthanrequiringCPUintervention(leftsideofimage).

DynamicParallelismallowsmorevarietiesofparallelalgorithmstobeimplementedontheGPU,includingnestedloopswithdifferingamountsofparallelism,parallelteamsofserialcontroltaskthreads,orsimpleserialcontrolcodeoffloadedtotheGPUinordertopromotedatalocalitywiththeparallelportionoftheapplication.Becauseakernelhastheabilitytolaunchadditionalworkloadsbasedonintermediate,onGPUresults,programmerscannowintelligentlyloadbalanceworktofocusthebulkoftheirresourcesontheareasoftheproblemthateitherrequirethemostprocessingpoweroraremostrelevanttothesolution.

Oneexamplewouldbedynamicallysettingupagridforanumericalsimulationtypicallygridcellsarefocusedinregionsofgreatestchange,requiringanexpensivepreprocessingpassthroughthedata.Alternatively,auniformlycoarsegridcouldbeusedtopreventwastedGPUresources,orauniformlyfinegridcouldbeusedtoensureallthefeaturesarecaptured,buttheseoptionsriskmissingsimulationfeaturesoroverspendingcomputeresourcesonregionsoflessinterest.WithDynamicParallelism,thegridresolutioncanbedetermineddynamicallyatruntimeinadatadependentmanner.Startingwithacoarsegrid,thesimulationcanzoominonareasofinterestwhileavoidingunnecessarycalculationinareaswithlittlechange.ThoughthiscouldbeaccomplishedusingasequenceofCPUlaunchedkernels,itwouldbefarsimplertoallowtheGPUtorefinethegriditselfbyanalyzingthedataandlaunchingadditionalworkaspartofasinglesimulationkernel,eliminatinginterruptionoftheCPUanddatatransfersbetweentheCPUandGPU.

ImageattributionCharlesReid

Theaboveexampleillustratesthebenefitsofusingadynamicallysizedgridinanumericalsimulation.Tomeetpeakprecisionrequirements,afixedresolutionsimulationmustrunatanexcessivelyfineresolutionacrosstheentiresimulationdomain,whereasamultiresolutiongridappliesthecorrectsimulationresolutiontoeachareabasedonlocalvariation.

HyperQOneofthofworkfrlaunchesfhardwarewithinonthiscouldcomplexit

KeplerGKnumberotheGPUbconnectioCUDAstrewithinaplimitingG

HyperQpe

EachCUDoptimizedconcurrendependen

Qechallengesrommultiplefromseparateworkqueueestreamtocdbealleviatedtyincreases,tK110improveofconnectionbyallowing32onavailableweams,frommprocess.AppliPUutilization

ermitsmoresim

DAstreamismd,andoperatntlywithoutnncies.

inthepasthastreams.Theestreams,bu.Thisallowedcompletebefodtosomeextthiscanbecoesonthisfuncs(workqueu2simultaneowithFermi).HmultipleMessicationsthatn,canseeup

ultaneousconn

managedwithionsinonestneedingtosp

asbeenkeepeFermiarchitutultimatelydforfalseintroreadditionatentthroughmemoreandctionalitywithes)betweenus,hardwareHyperQisaflagePassingIpreviouslyentoa32xperf

ectionsbetwee

hinitsownhatreamwillnoecificallytailo

ingtheGPUsecturesuppothestreamswrastreamdealkernelsinatheuseofabdmoredifficuhthenewHythehostand

emanagedcoexiblesolutionterface(MPncounteredfaformanceincr

nCPUandGPU

ardwareworklongerblockorthelaunch

suppliedwithorted16waywereallmultpendencies,rseparatestrebreadthfirstulttomanageyperQfeaturetheCUDAWonnections(coonthatallowsPI)processes,alseserializatreasewithout

.

kqueue,interotherstreamordertoelim

hanoptimallyconcurrencytiplexedintotrequiringdepeamcouldbelaunchordereefficiently.e.HyperQin

WorkDistributomparedtotsconnectionsorevenfromionacrosstastchangingan

rstreamdepems,enablingsminatepossib

yscheduledloofkernelthesamependentkerneexecuted.W,asprogram

ncreasesthettor(CWD)logthesinglesfrommultipmmultiplethrsks,therebynyexistingco

endenciesarestreamstoexblefalse

oad

elsWhile

totalgicin

plereads

de.

execute

HyperQofferssignificantbenefitsforuseinMPIbasedparallelcomputersystems.LegacyMPIbasedalgorithmswereoftencreatedtorunonmulticoreCPUsystems,withtheamountofworkassignedtoeachMPIprocessscaledaccordingly.ThiscanleadtoasingleMPIprocesshavinginsufficientworktofullyoccupytheGPU.WhileithasalwaysbeenpossibleformultipleMPIprocessestoshareaGPU,theseprocessescouldbecomebottleneckedbyfalsedependencies.HyperQremovesthosefalsedependencies,dramaticallyincreasingtheefficiencyofGPUsharingacrossMPIprocesses.

HyperQworkingwithCUDAStreams:IntheFermimodelshownontheleft,only(C,P)&(R,X)canrunconcurrentlyduetointrastreamdependenciescausedbythesinglehardwareworkqueue.TheKeplerHyperQmodelallowsallstreamstorunconcurrentlyusingseparateworkqueues.

GridManagementUnitEfficientlyKeepingtheGPUUtilizedNewfeaturesinKeplerGK110,suchastheabilityforCUDAkernelstolaunchworkdirectlyontheGPUwithDynamicParallelism,requiredthattheCPUtoGPUworkflowinKeplerofferincreasedfunctionalityovertheFermidesign.OnFermi,agridofthreadblockswouldbelaunchedbytheCPUandwouldalwaysruntocompletion,creatingasimpleunidirectionalflowofworkfromthehosttotheSMsviatheCUDAWorkDistributor(CWD)unit.KeplerGK110wasdesignedtoimprovetheCPUtoGPUworkflowbyallowingtheGPUtoefficientlymanagebothCPUandCUDAcreatedworkloads.WediscussedtheabilityoftheKeplerGK110GPUtoallowkernelstolaunchworkdirectlyontheGPU,anditsimportanttounderstandthechangesmadeintheKeplerGK110architecturetofacilitatethesenewfunctions.InKepler,agridcanbelaunchedfromtheCPUjustaswasthecasewithFermi,howevernewgridscanalsobecreatedprogrammaticallybyCUDAwithintheKeplerSMXunit.TomanagebothCUDAcreatedandhostoriginatedgrids,anewGridManagementUnit(GMU)wasintroducedinKeplerGK110.ThiscontrolunitmanagesandprioritizesgridsthatarepassedintotheCWDtobesenttotheSMXunitsforexecution.TheCWDinKeplerholdsgridsthatarereadytodispatch,anditisabletodispatch32activegrids,whichisdoublethecapacityoftheFermiCWD.TheKeplerCWDcommunicateswiththeGMUviaabidirectionallinkthatallowstheGMUtopausethedispatchofnewgridsandtoholdpendingandsuspendedgridsuntilneeded.TheGMUalsohasadirectconnectiontotheKeplerSMXunitstopermitgridsthatlaunchadditionalworkontheGPUviaDynamicParallelismtosendthenewworkbacktoGMUtobeprioritizedanddispatched.Ifthekernelthatdispatchedtheadditionalworkloadpauses,theGMUwillholditinactiveuntilthedependentworkhascompleted.

TheredesignedKeplerHOSTtoGPUworkflowshowsthenewGridManagementUnit,whichallowsittomanagetheactivelydispatchinggrids,pausedispatch,andholdpendingandsuspendedgrids.

NVIDIAGPUDirectWhenworkingwithalargeamountofdata,increasingthedatathroughputandreducinglatencyisvitaltoincreasingcomputeperformance.KeplerGK110supportstheRDMAfeatureinNVIDIAGPUDirect,whichisdesignedtoimproveperformancebyallowingdirectaccesstoGPUmemorybythirdpartydevicessuchasIBadapters,NICs,andSSDs.WhenusingCUDA5.0,GPUDirectprovidesthefollowingimportantfeatures:

Directmemoryaccess(DMA)betweenNICandGPUwithouttheneedforCPUsidedatabuffering.

SignificantlyimprovedMPISend/MPIRecvefficiencybetweenGPUandothernodesinanetwork. EliminatesCPUbandwidthandlatencybottlenecks Workswithvarietyof3rdpartynetwork,capture,andstoragedevices

Applicationslikereversetimemigration(usedinseismicimagingforoil&gasexploration)distributethelargeimagingdataacrossseveralGPUs.HundredsofGPUsmustcollaboratetocrunchthedata,oftencommunicatingintermediateresults.GPUDirectenablesmuchhigheraggregatebandwidthforthisGPUtoGPUcommunicationscenariowithinaserverandacrossserverswiththeP2PandRDMAfeatures.KeplerGK110alsosupportsotherGPUDirectfeaturessuchasPeertoPeerandGPUDirectforVideo.

GPUDirectRDMAallowsdirectaccesstoGPUmemoryfrom3rdpartydevicessuchasnetworkadapters,whichtranslatesintodirecttransfersbetweenGPUsacrossnodesaswell.

ConclusionWiththelaunchofFermiin2010,NVIDIAusheredinanewerainthehighperformancecomputing(HPC)industrybasedonahybridcomputingmodelwhereCPUsandGPUsworktogethertosolvecomputationallyintensiveworkloads.Now,withthenewKeplerGK110GPU,NVIDIAagainraisesthebarfortheHPCindustry.KeplerGK110wasdesignedfromthegrounduptomaximizecomputationalperformanceandthroughputcomputingwithoutstandingpowerefficiency.ThearchitecturehasmanynewinnovationssuchasSMX,DynamicParallelism,andHyperQthatmakehybridcomputingdramaticallyfaster,easiertoprogram,andapplicabletoabroadersetofapplications.KeplerGK110GPUswillbeusedinnumeroussystemsrangingfromworkstationstosupercomputerstoaddressthemostdauntingchallengesinHPC.

AppenCUDAisawrittenwkernelsthintothreaexecutesgrid,aproAthreadbbarriersyarrayofthglobalmemodel,eaautomaticcommuniresultsin

ndix A - combination

withC,C++,Fohatexecuteacadblocksandaninstanceoogramcounteblockisasetnchronizationhreadblocksemory,andsyachthreadhacarrayvariabcation,datasGlobalMemo

Quick Rnhardware/soortran,andotcrossmanypgridsofthreofthekernel.er,registers,pofconcurrennandsharedthatexecuteynchronizebeasaperthreables.Eachthresharing,androryspaceafte

Refresheoftwareplatftherlanguagearallelthreadadblocks,asEachthreadperthreadprtlyexecutingmemory.Atthesameker

etweendependprivatemeeadblockhasresultsharingerkernelwid

er on CUformthatenaes.ACUDAprds.TheprograshowninFigalsohasthrearivatememorgthreadsthathreadblockhrnel,readinpndentkernelmoryspaceusaperblockginparallelaleglobalsync

UDA ablesNVIDIArograminvokammerorcomgure1.Eachthadandblockry,inputs,andtcancooperahasablockIDputsfromglobcalls.IntheCusedforregissharedmemolgorithms.Grchronization.

GPUstoexecesparallelfumpilerorganhreadwithinIDswithinitsdoutputresuateamongtheDwithinitsgrbalmemory,CUDAparallelsterspills,funoryspaceuseridsofthread

cuteprogramnctionscalledizesthesethrathreadblocsthreadblockults.emselvesthrorid.Agridisawriteresultslprogramminnctioncalls,anedforinterthblocksshare

sdreadsckkand

oughntongndChread

Figure1:CUDAHierarchyofthreads,blocks,andgrids,withcorrespondingperthreadprivate,perblockshared,andperapplicationglobalmemoryspaces.

CUDAHardwareExecutionCUDAshierarchyofthreadsmapstoahierarchyofprocessorsontheGPU;aGPUexecutesoneormorekernelgrids;astreamingmultiprocessor(SMonFermi/SMXonKepler)executesoneormorethreadblocks;andCUDAcoresandotherexecutionunitsintheSMXexecutethreadinstructions.TheSMXexecutesthreadsingroupsof32threadscalledwarps.Whileprogrammerscangenerallyignorewarpexecutionforfunctionalcorrectnessandfocusonprogrammingindividualscalarthreads,theycangreatlyimproveperformancebyhavingthreadsinawarpexecutethesamecodepathandaccessmemorywithnearbyaddresses.

NoticeALLINFORMATIONPROVIDEDINTHISWHITEPAPER,INCLUDINGCOMMENTARY,OPINION,NVIDIADESIGNSPECIFICATIONS,REFERENCEBOARDS,FILES,DRAWINGS,DIAGNOSTICS,LISTS,ANDOTHERDOCUMENTS(TOGETHERANDSEPARATELY,MATERIALS)AREBEINGPROVIDEDASIS.NVIDIAMAKESNOWARRANTIES,EXPRESSED,IMPLIED,STATUTORY,OROTHERWISEWITHRESPECTTOMATERIALS,ANDEXPRESSLYDISCLAIMSALLIMPLIEDWARRANTIESOFNONINFRINGEMENT,MERCHANTABILITY,ANDFITNESSFORAPARTICULARPURPOSE.Informationfurnishedisbelievedtobeaccurateandreliable.However,NVIDIACorporationassumesnoresponsibilityfortheconsequencesofuseofsuchinformationorforanyinfringementofpatentsorotherrightsofthirdpartiesthatmayresultfromitsuse.NolicenseisgrantedbyimplicationorotherwiseunderanypatentorpatentrightsofNVIDIACorporation.Specificationsmentionedinthispublicationaresubjecttochangewithoutnotice.Thispublicationsupersedesandreplacesallinformationpreviouslysupplied.NVIDIACorporationproductsarenotauthorizedforuseascriticalcomponentsinlifesupportdevicesorsystemswithoutexpresswrittenapprovalofNVIDIACorporation.TrademarksNVIDIA,theNVIDIAlogo,CUDA,FERMI,KEPLERandGeForcearetrademarksorregisteredtrademarksofNVIDIACorporationintheUnitedStatesandothercountries.Othercompanyandproductnamesmaybetrademarksoftherespectivecompanieswithwhichtheyareassociated.Copyright2012NVIDIACorporation.Allrightsreserved.

NVIDIA Kepler GK110 Architecture Whitepaper

Documents