Top Banner
Profiling & Tuning Applications CUDA Course István Reguly
57

Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Aug 04, 2018

Download

Documents

LamPhuc
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Profiling&TuningApplications

CUDACourseIstván Reguly

Page 2: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Introduction

• Whyismyapplicationrunningslow?• Workitoutonpaper• Instrumentcode• Profileit• NVIDIAVisualProfiler

• WorkswithCUDA,needssometweakstoworkwithOpenCL• nvprof – commandlinetool,canbeusedwithMPIapplications

Page 3: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

IdentifyingPerformanceLimiters

• CPU:Setup,datamovement• GPU:Bandwidth,computeorlatencylimited• Numberofinstructionsforeverybytemoved• Algorithmicanalysisgivesagoodestimate• Actualcodeislikelydifferent

• Instructionsforloopcontrol,pointermath,etc.• Memoryaccesspatterns• Howtofindout?

• Use theprofiler (quick,butapproximate)• Usesourcecodemodification (takesmorework)

Page 4: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

AnalysiswithSourceCodeModification

• Timememory-onlyandmath-onlyversions• Notsoeasyforkernelswithdata-dependentcontrolflow• Goodtoestimatetimespentonaccessingmemoryorexecutinginstructions

• Showswhetherkernelismemoryorcomputebound• Putan“if”statementdependingonkernelargumentaroundmath/mem instructions• Usedynamicsharedmemorytogetthesameoccupancy

Page 5: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

AnalysiswithSourceCodeModification__global__ voidkernel(float *a){int idx =threadIdx.x +blockDim.x+blockIdx.x;floatmy_a;my_a =a[idx];for(int i =0;i <100;i++)my_a =sinf(my_a+i*3.14f);a[idx]=my_a;}

__global__ voidkernel(float *a,int prof){int idx =threadIdx.x +blockDim.x+blockIdx.x;floatmy_a;if(prof&1)my_a =a[idx];if(prof&2)

for(int i =0;i <100;i++)my_a =sinf(my_a+i*3.14f);if(prof&1)a[idx]=my_a;}

Page 6: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Examplescenariostim

e

mem mathfull

Memory-bound

Goodoverlapbetweenmem-math.Latency isnotaproblem

mem mathfull

Math-bound

Goodoverlapbetweenmem-math.

mem mathfull

Wellbalanced

Goodoverlapbetweenmem-math.

mem mathfull

Mem andlatencybound

Pooroverlap,latency isaproblem

Page 7: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

NVIDIAVisualProfiler

• Collectsmetricsandeventsduringexecution• CallstotheCUDAAPI• Overallapplication:

• Memorytransfers• Kernel launches

• Kernels• Occupancy• Computationefficiency• Memorybandwidthefficiency

• Source-levelprofiling• Requiresdeterministicexecution!

Page 8: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Meetthetestsetup

• 2Dgaussian blurwitha5x5stencil• 4096^2grid

__global__ void stencil_v0(float *input, float *output,int sizex, int sizey) {

const int x = blockIdx.x*blockDim.x + threadIdx.x + 2;const int y = blockIdx.y*blockDim.y + threadIdx.y + 2;if ((x >= sizex-2) || (y >= sizey-2)) return;float accum = 0.0f;for (int i = -2; i < 2; i++) {

for (int j = -2; j < 2; j++) {accum += filter[i+2][j+2]*input[sizey*(y+j) +

(x+i)]; }

}output[sizey*y+x] = accum/273.0f;}

Page 9: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Meetthetestsetup

• NVIDIAK40• GK110B• SM3.5• ECCon• Graphicsclocksat745MHz,Memoryclocksat3004MHz

• CUDA7.0nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0

Page 10: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Interactivedemooftuningprocess

Page 11: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Launchaprofilingsession

Page 12: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Timeline Summary

Analysis results

Firstlook

Guide

Page 13: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

TheTimeline

HostsideAPIcalls

MemCpy Compute

Page 14: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

AnalysisGuided Unguided

Page 15: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

ExamineIndividualKernels

Listsallkernelssortedbytotalexecution time: thehighertherankthehighertheimpactofoptimisation onoverallperformance

Initialunoptimised (v0) 8.122ms

Page 16: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Utilisation – WarpIssueEfficiency&PipeUtilisation

Bothbelow60%->Latency!

Let’sinvestigateMostofitismemoryops

Page 17: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Latencyanalysis

Memorythrottle ->performBWanalysis

Page 18: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

MemoryBandwidthanalysis

L1cachenotused…

Page 19: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Globalmemoryloadefficiency53.3%L2hitrate96.7%

Investigatefurther…Unguided

6-8transactions peraccess– somethingiswrongwithhowweaccessmemory

Page 20: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Memoryunit isutilized,butGlobalLoadefficiencybecameevenworse:20.5%

Iteration1– turnonL1

Initialunoptimised (v0) 8.122ms

EnableL1 6.57ms

Quick&easystep:TurnonL1cachebyusing-Xptxas -dlcm=ca

Page 21: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Cachelineutilization

8

8

…..

32bytes(8floats)L1cachedisabled:->32Btransactions

Min4,Max8transactionsUnitoftransaction

Page 22: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Cachelineutilization

8

8

…..

128bytes(32floats) L1cacheenabled:->128Btransactions->4*32BtoL2Min16,Max32transactions

Eachtimeatransactionrequiresmorethan1128Bcacheline:re-issue

Unitoftransaction

Page 23: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Cachelineutilization

2

32

…..

128bytes(32floats) L1cacheenabled:->128Btransactions->4*32BtoL2Min4,Max8transactions

Unitoftransaction

Page 24: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Iteration2– 32x2blocks

Initialunoptimised (v0) 8.122ms

EnableL1 6.57ms

Blocksize 3.4ms

Memoryutilization decreased 10%PerformancealmostdoublesGlobalLoadEfficiency50.8%

Page 25: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Keytakeaway

• Latency/Bandwidthbound• Inefficientuseofmemorysystemandbandwidth• Symptoms:

• Lotsoftransactionsperrequest(lowloadefficiency)• Goal:

• Usethewholecacheline• Improvememoryaccesspatterns(coalescing)

• Whattodo:• Aligndata,changeblocksize,changedatalayout• Usesharedmemory/shufflestoloadefficiently

Page 26: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Latencyanalysis

Page 27: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Latencyanalysis

Page 28: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Latencyanalysis

Increasetheblocksizesomorewarpscanbeactiveatthesametime.

Kepler:Max16blocksperSMMax2048threadsperSM

Page 29: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Occupancy– usingall“slots”

Increaseblocksizeto32x4

Issueinstruction

Nextinstruction

Latencyofinstruction

Warp1

Warp2

Warp3

Warp4

Scheduler

Illustrative only,realityisabitmorecomplex…

Page 30: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Iteration3– 32x4blocks

Initialunoptimised (v0) 8.122ms

EnableL1 6.57ms

Blocksize 3.4ms

Blocksize 2 2.36ms

Up10%Fulloccupancy

Page 31: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Keytakeaway

• Latencybound– lowoccupancy• Unusedcycles,exposedlatency• Symptoms:

• Highexecution/memorydependency,lowoccupancy

• Goal:• Betterutilise cyclesby:havingmorewarps

• Whattodo:• Determineoccupancylimiter(registers,blocksize,sharedmemory)andvaryit

Page 32: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Improvingmemorybandwidth

• L1isfast,butabitwasteful(128Bloads)• 8transactionsonaverage(minimumwouldbe4)

• Load/Storepipestressed• Anywaytoreducetheload?

• Texturecache• Dedicatedpipeline• 32byteloads• const __restrict__*• __ldg()

Page 33: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Iteration4– texturecache

Initialunoptimised (v0) 8.122ms

Blocksize 2 2.36ms

Texturecache 1.53ms

Page 34: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Keytakeaway

• Bandwidthbound– Load/StoreUnit• LSUoverutilised• Symptoms:

• LSUpipeutilisation high,otherslow• Goal:

• Betterspreadtheloadbetweenotherpipes:useTEX• Whattodo:

• Readread-onlydatathroughthetexturecache• const __restrict__or__ldg()

Page 35: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Computeutilization couldbehigher(~78%)LotsofInteger&memoryinstructions, fewerFPIntegeropshavelowerthroughputthanFPTrytoamortizethecost:increase computeperbyte

Load/store

ComputeanalysisInstructionmix

Page 36: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

InstructionLevelParallelism

• Remember,GPUisin-order:

• Secondinstructioncannotbeissuedbeforefirst• Butitcanbeissuedbeforethefirstfinishes– ifthereisnodependency

• Appliestomemoryinstructionstoo– latencymuchhigher(countstowardsstallreasons)

Issueinstruction

Nextinstruction

Latencyofinstruction

Warp1

a=b+c

d=a+e

a=b+c

d=e+f

Page 37: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

InstructionLevelParallelism

for (j=0;j<2;j++)acc+=filter[j]*input[x+j];

tmp=input[x+0]

acc += filter[0]*tmp

tmp=input[x+1]

acc += filter[1]*tmp

for (j=0;j<2;j++) {acc0+=filter[j]*input[x+j];acc1+=filter[j]*input[x+j+1];}

tmp=input[x+0]

acc0 += filter[0]*tmp

tmp=input[x+1]

acc0 += filter[1]*tmp

tmp=input[x+0+1]

acc1 += filter[0]*tmp

tmp=input[x+1+1]

acc1 += filter[1]*tmp

#pragmaunrollcanhelp ILPCreatetwoaccumulatorsOr…

Process2pointsperthreadBonusdatare-use(registercaching)

Page 38: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Iteration5– 2pointsperthread

Initialunoptimised (v0) 8.122ms

Texturecache 1.53ms

2points 1.07ms

Load/store

Page 39: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Keytakeaway

• Latencybound– lowinstructionlevelparallelism• Unusedcycles,exposedlatency• Symptoms:

• Highexecutiondependency,one“pipe”saturated• Goal:

• Betterutilise cyclesby:increasingparallelworkperthread

• Whattodo:• IncreaseILPbyhavingmoreindependentwork,e.g.morethan1outputvalueperthread• #pragmaunroll

Page 40: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

168GB/sdeviceBW

Iteration6– 4pointsperthread

Load/store

Initialunoptimised (v0) 8.122ms

2points 1.07ms

4points 0.95ms

Page 41: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Checklist

• cudaDeviceSynchronize()• MostAPIcalls(e.g.kernellaunch)areasynchronous• Overheadwhenlaunchingkernels• GetridofcudaDeviceSynchronize()tohidethislatency• Timing:eventsorcallbacksCUDA5.0+

• Cacheconfig 16/48or48/16kBL1/shared(defaultis48kshared!)onKepler• cudaSetDeviceCacheConfig• cudaFuncSetCacheConfig• Checkifsharedmemoryusageisalimitingfactor

Page 42: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Checklist• Occupancy• Max1536threadsor8blocksperSMonFermi(2048/16forKepler,2048/32forMaxwell)• Limitedamountofregistersandsharedmemory

• Max255registers/thread, restisspilledtoglobalmemory• Youcanexplicitlylimitit(-maxregcount=xx)• 48kB/16kBshared/L1:don’tforgettosetit

• VisualProfilertellsyouwhatisthelimitingfactor• Insomecasesthough,itisfasterifyoudon’tmaximise it(seeVolkov paper)->Autotuning!

Page 43: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Verbosecompile

• Add–Xptxas=-v

• Checkprofilerfiguresforbestoccupancy

ptxas info:Compilingentryfunction'_Z10fem_kernelPiS_' for'sm_20'ptxas info:Functionproperties for_Z10fem_kernelPiS_856bytesstackframe,980bytesspillstores,1040bytesspillloads

ptxas info:Used63registers, 96bytescmem[0]

Page 44: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Checklist

• Precisionmix(e.g.1.0vs 1.0f)– cuobjdump• F2F.F64.F32(6*thecostofamultiply)• IEEEstandard:alwaysconverttohigherprecision• Integermultiplicationsarenowexpensive(6*)

• cudaMemcpy• Introducesexplicitsynchronisation,highlatency• Isitnecessary?

• Maybecheaper tolaunchakernelwhichimmediatelyexits• Coulditbeasynchronous?(Pinthememory!)

Page 45: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Auto-tuning

• Severalparametersthataffectperformance• Blocksize• Amountofworkperblock• Applicationspecific

• Whichcombinationperformsthebest?• Auto-tuningwithFlamingo

• #define/readthesizes,recompile/reruncombinations

Page 46: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Auto-tuningCaseStudy

• Threadcooperationonsparsematrix-vectorproduct• Multiplethreadsdoingpartialdotproductontherow• Reductioninsharedmemory

• Auto-tunefordifferentmatrices• Difficulttopredictcachingbehavior• Developaheuristicforcooperationvs.averagerowlength

Page 47: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Autotuning CaseStudy

1 2 4 8 16 32

10−4

10−3

10−2

Number of cooperating threads

Run t

ime

(sec

onds)

atmosmodd (2.63)

crankseg_2 (14.89)

shallow_water1 (2.00)

webbase−1M (1.76)

cant (8.01)

Page 48: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Conclusions• Iterativeapproachtoimprovingacode’sperformance• Identifyhotspot• Findperformancelimiter,understandwhyit’sanissue• Improveyourcode• Repeat

• Managedtoachievea8.5xspeedup• ShownhowNVVPguidesusandhelpsunderstandwhatthecodedoes• Thereismoreitcanshow…

References: C.Angerer,J.Demouth,“CUDAOptimizationwithNVIDIANsight Eclipse Edition”,GTC2015

Page 49: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

RapidcodedevelopmentwithThrust

Page 50: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Thrust

• OpenHigh-LevelParallelAlgorithmsLibrary• ParallelAnalogoftheC++StandardTemplateLibrary(STL)• Vectorcontainers• Algorithms

• Comeswiththetoolkit• ProductivewaytouseCUDA

Page 51: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Example

Page 52: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Productivity

• Containers• host_vector• device_vector

• Memorymanagement• Allocation,deallocation• Transfers

• Algorithmselection• Locationisimplicit

Page 53: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Productivity

• Largesetofalgorithms• ~100functions• CPU,GPU

• Flexible• C++templates• User-definedtypes• User-definedoperators

Algorithm Description

reduce Sumofasequence

find Firstpositionofavalueinasequence

mismatch Firstpositionwhere twosequencesdiffer

count Numberofinstancesofavalue

inner_product Dotproductoftwosequences

merge Mergetwosortedsequences

Page 54: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Portability• Implementations• CUDAC/C++• ThreadingBuildingBlocks• OpenMP• InteroperablewithanythingCUDAbased

• Recompile

• Mixbackendsnvcc -DTHRUST_DEVICE_SYSTEM=THRUST_HOST_SYSTEM_OMP

Page 55: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Interoperability

• Thrustcontainersandrawpointers• UsecontainerinCUDAkernel

• Useadevicepointerinthrustalgorithms(notavectorthough,nobegin(),end(),resize()etc.)

thrust::device_vector<int> d_vec(...);cuda_kernel<<<N, 128>>>(some_argument_d,

thrust::raw_pointer_cast(&d_vec[0]));

int *dev_ptr;cudaMalloc((void**)&dev_ptr, 100*sizeof(int));

thrust::device_ptr<int> dev_ptr_thrust(dev_ptr);thrust::fill(dev_ptr_thrust, dev_ptr_thrust+100, 0);

Page 56: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Thrust

• Constantlyevolving• Reliable– comeswiththetoolkit,testedeverydaywithunittests• Performance– specialised implementationsfordifferenthardware• Extensible– allocators,back-ends,etc.

Page 57: Profiling & Tuning Applications - People · • CUDA 7.0 nvcc profiling_lecture.cu -O2 -arch=sm_35 -I. –lineinfo –DIT=0. ... inner_product Dot product of two sequences merge Merge

Thrustdocumentationhttp://thrust.github.io/doc/modules.html