Top Banner
IAP09 CUDA@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel hardware using CUDA Lecture 07 CUDA Advanced #2 - Nicolas Pinto (MIT) Friday, January 23, 2009
86

Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

Jun 08, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

IAP09 CUDA@MIT / 6.963

Supercomputing on your desktop:Programming the next generation of cheap

and massively parallel hardware using CUDA

Lecture 07

CUDA Advanced #2-

Nicolas Pinto (MIT)

Friday, January 23, 2009

Page 2: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

During this course,

we’ll try to

and use existing material ;-)

“ ”

adapted for 6.963

Friday, January 23, 2009

Page 3: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

Todayyey!!

Friday, January 23, 2009

Page 4: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

Wanna Play with The Big Guys?

Friday, January 23, 2009

Page 5: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

Here are the keys to High-Performance in CUDA

Friday, January 23, 2009

Page 6: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

To optimize or not to optimize

Hoare said (and Knuth restated)

“We should forget about small e!ciencies, say about97% of the time:

“Premature optimization is the root of all evil.”

!3% of the time we really should worry about small e!ciencies

(Every 33rd codeline)

Applied Mathematics 23/53slide by Johan Seland

Warning!

Friday, January 23, 2009

Page 7: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

To optimize or not to optimize

Hoare said (and Knuth restated)

“We should forget about small e!ciencies, say about97% of the time:Premature optimization is the root of all evil.”

!3% of the time we really should worry about small e!ciencies

(Every 33rd codeline)

Applied Mathematics 23/53slide by Johan Seland

Warning!

Friday, January 23, 2009

Page 8: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

StrategyMemory Optimizations

Execution Optimizations

IAP09 CUDA@MIT / 6.963

Friday, January 23, 2009

Page 9: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

CUDAPerformance Strategies

IAP09 CUDA@MIT / 6.963

Friday, January 23, 2009

Page 10: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

Optimization goals

We should strive to reach GPU performance

We must know the GPU performanceVendor specificationsSyntetic benchmarks

Choose a performance metricMemory bandwidth or GFLOPS?

Use clock() to measure

Experiment and profile!

Applied Mathematics 25/53

Strategy

slide by Johan Seland

Friday, January 23, 2009

Page 11: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

© NVIDIA Corporation 2006 3

Programming Model

A kernel is executed as a grid of thread blocks

A thread block is a batch of threads that can cooperate with each other by:

Sharing data through shared memory

Synchronizing their execution

Threads from different blocks cannot cooperate

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread

(0, 1)

Thread

(1, 1)

Thread

(2, 1)

Thread

(3, 1)

Thread

(4, 1)

Thread

(0, 2)

Thread

(1, 2)

Thread

(2, 2)

Thread

(3, 2)

Thread

(4, 2)

Thread

(0, 0)

Thread

(1, 0)

Thread

(2, 0)

Thread

(3, 0)

Thread

(4, 0)

Threading

Friday, January 23, 2009

Page 12: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

© NVIDIA Corporation 2008 10

Data Movement in a CUDA Program

Host Memory

Device Memory

[Shared Memory]

COMPUTATION

[Shared Memory]

Device Memory

Host Memory

Memory

Friday, January 23, 2009

Page 13: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

39

!"#$%$&'()*+,-$#.%/(0,-(#.'(123

456$%$&'($78'"'78'7#("5-5**'*$/%

456$%$&'(5-$#.%'#$9($7#'7/$#:(;%5#.<=578>$8#.?

@,%'#$%'/($#A/(='##'-(#,(-'9,%"B#'(#.57(#,(959.'

123(/"'78/($#/(#-57/$/#,-/(,7()C3/D(7,#(%'%,-:

E,(%,-'(9,%"B#5#$,7(,7(#.'(123(#,(5F,$8(9,/#*:(

85#5(#-57/0'-/

GF'7(*,>("5-5**'*$/%(9,%"B#5#$,7/(957(/,%'#$%'/(='(

05/#'-(#.57(#-57/0'--$7+(=59H(578(0,-#.(#,(.,/#

Perf

Friday, January 23, 2009

Page 14: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

40

!"#$%$&'()'%*+,(-*.'+'/0'

-*12'30'4(536(7*/80*12'30'4(9(*+4'+(*:(%1;/$#<4'

=2*>12?@*012(4'5$0'(%'%*+,(

!"#$%$&'(:*+(3"1#$12(2*012$#,($/(010.'4(#'A#<+'(

%'%*+,

B/(3.1+'4(%'%*+,C(15*$4(.$;.84';+''(>1/D(0*/:2$0#3

Perf

Friday, January 23, 2009

Page 15: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

41

!"#$%&'(")*"+$%,-%./"0$'%1$2,03

45)'0$'6%,-%*72$6%-"6*$0%*/")%+8,9"8%2$2,03

!/0$"'6%:")%:,,;$0"*$%(7"%6/"0$'%2$2,03

<6$%,)$%=%"%-$>%*/0$"'6%*,%8,"'%=%:,2;5*$%'"*"%

6/"0$'%93%"88%*/0$"'6

<6$%7*%*,%"(,7'%),)?:,"8$6:$'%"::$66

.*"+$%8,"'6%")'%6*,0$6%7)%6/"0$'%2$2,03%*,%0$?,0'$0%),)?

:,"8$6:$"98$%"''0$667)+

1"*07@%*0")6;,6$%$@"2;8$%8"*$0

Perf

Friday, January 23, 2009

Page 16: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

42

!"#$%&'&((#()"*$+,,)-)#./(0

%&'/)/)1.$012'$-1*32/&/)1.$/1$4##3$/5#$6%!$

*2(/)3'1-#""1'"$#72&((0$82"0

9&.0$/5'#&:";$*&.0$/5'#&:$8(1-4"

<##3$'#"12'-#$2"&=#$(1>$#.12=5$/1$"2331'/$

*2(/)3(#$&-/)?#$/5'#&:$8(1-4"$3#'$*2(/)3'1-#""1'

@#=)"/#'";$"5&'#:$*#*1'0

Perf

Friday, January 23, 2009

Page 17: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

Friday, January 23, 2009

Page 18: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

Memory Optimizations

IAP09 CUDA@MIT / 6.963

Friday, January 23, 2009

Page 19: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

44

!"#$%&'$()*#*+,)*$-.

/()*#*+*-0'#"#$%&')%,-.1"%.

2$,3".4*-0'03$5,3'#"#$%&',44"..".

6.*-0'.7,%"8'#"#$%&'"11"4)*9"3&

Memory

Friday, January 23, 2009

Page 20: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

45

!"#"$%&"'()*&(

!*+,-*$.*./&0$#/$1/(#$.*./&0$2"'34,3#1$.5-1$

6/4*&$#1"'$3*+,-*$.*./&0$#/$3*+,-*$2"'34,3#1

789:($;*"<$=>?@A*$BCDE$+(F$GH$89:($;*"<$=I5"3&/$JK$LDHHE

G89:($)/&$>?@A*$MFH

N,',.,O*$#&"'()*&(

@'#*&.*3,"#*$3"#"$(#&5-#5&*($-"'$2*$"66/-"#*3P$/;*&"#*3$

/'P$"'3$3*"66/-"#*3$4,#1/5#$*+*&$-/;0,'Q$#1*.$#/$1/(#$

.*./&0

8&/5;$#&"'()*&(

R'*$6"&Q*$#&"'()*&$.5-1$2*##*&$#1"'$."'0$(."66$/'*(

Memory

Friday, January 23, 2009

Page 21: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

46

!"#$%&'()$*+,$-'./+0."123$.2

(4*","55'(6'2789+"55':2+"55'("7;'1+'3+<"#$%5'()$*+='27+-$-'./

>1"?5$2+=;#=$27+(4*",$-(</+<$.3'.-"1($@AB+CDE2F+('--'1+'1+!GH%$I<.$22+8IJK9

LM+CDE2+-$"24.$*+'1+1N'.($+KOP;+-'7=$.?'".*2+8'Q$.(5'()$*+!GH%$9

R$$+7=$+S?"1*:;*7=0$27T GUVW+RVX+2"-<5$

U2$+:;7=+("47;'1W55'("7;1#+7''+-4(=+<"#$%5'()$*+-$-'./+("1+.$*4($+'Q$."55+2/27$-+<$.3'.-"1($

0$27+/'4.+2/27$-2+"1*+"<<2+7'+5$".1+7=$;.+5;-;72

Memory

Friday, January 23, 2009

Page 22: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

47

!"#$%"&'()#*+&,(%-./0*12(.

3145(.2&"%2(67+&16.2*8721#6.9&:;;<=;;&7"#7>&7+7"(.

?1>("+&2#&$(&@(*A#*)%67(&$#22"(6(7>

B@21)1C%21#6.&7%6&4*(%2"+&167*(%.(&@(*A#*)%67(

D#%"(.71649&8@&2#&E;F&.@((-8@

?%2(67+&51-1649&8@&2#&GHIF&.@((-8@

gmem

Friday, January 23, 2009

Page 23: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

Accessing global memory

4 cycles to issue on memory fetch

but 400-600 cycles of latencyThe equivalent of 100 MADs

Likely to be a performance bottleneck

Order of magnitude speedups possibleCoalesce memory access

Use shared memory to re-order non-coalesced addressing

Applied Mathematics 32/53slide by Johan Seland

gmem

Friday, January 23, 2009

Page 24: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

48

!"#$%&'()*

+,'""-.()#/%.,-%#.,01,#,2#$345#-6,789 /2-%#.&:

+,'")/(*;";&,-%*("),"3,*$"0#$,<%<"-1=

9> 01/%&,4 %#'2,/2-%#.,-%#.&,#,5"-.=,()/?,3$"#/?,@

8AB 01/%&,4 %#'2,/2-%#.,-%#.&,#,.";0$%45"-.=,()/A?,3$"#/A?,@

AC9 01/%&,D %#'2,/2-%#.,-%#.&,#,E;#.45"-.=,()/>?,3$"#/>?,@

+..(/(")#$,-%&/-('/(")&,"),FBGHFIG,#-'2(/%'/;-%=

J/#-/()*,#..-%&&,3"-,#,-%*("),<;&/,0%,#,<;$/(6$%,"3,-%*("),

&(K%

L2%,k/2 /2-%#.,(),#,2#$345#-6,<;&/,#''%&&,/2% k/2 %$%<%)/,(),#,

0$"'M,0%()*,-%#.

NO'%6/(")=,)"/,#$$,/2-%#.&,<;&/,0%,6#-/('(6#/()*

P-%.('#/%.,#''%&&?,.(Q%-*%)'%,5(/2(),#,2#$35#-6

gmem

Friday, January 23, 2009

Page 25: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

49

!"#$%&'%()*''%&&+),%#(-./)0$"#1&

12 13 14 135 13617

12 13 14 135 13617

374 378 395 3:4349 352 355 399

374 378 395 3:4349 352 355 399

;"<%)=>?%#(&)@")A"1)B#?1-'-C#1%

*$$)1>?%#(&)C#?1-'-C#1%

gmem

Friday, January 23, 2009

Page 26: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

50

!"#$%&'(#')*+##'((,*-'%)."/*0&$%1(

12 13 14 135 13617

374 378349 352 355

:';<=1')*+##'((*>?*@A;'%)(

395 3B4399

C.(%&./"')*D1%;1."/*+));'((*E"$1*%*<=&1.F&'*$0*85G

12 13 14 137 13617

374 378 395 3B4349 352 355 399

135

gmem

Friday, January 23, 2009

Page 27: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

51

!"#$%&'()*+,-(.()*,/%&0$1&

234%5(.%)1,"),678+,

9%5)%$+,5%#:,#,;$"#1<,()'5%.%)1<,=5(1%,>#'?

@A,;$"#1&,BCDAEF

-(.%&,#G%5#*%:,"G%5,C89,50)&

CD9,>$"'?&,3,DHI,1J5%#:&+

@HIK&,L '"#$%&'%:

@HMK&,L '"#$%&'%:<,&".%,1J5%#:&,:")N1,4#51('(4#1%

@<OPOK&,L 4%5.01%:Q.(&#$(*)%:,1J5%#:,#''%&&

gmem

Friday, January 23, 2009

Page 28: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

58

!"#$%&'()*+

,-./'-/.%&0"10&(2%0! 34054067089-%&:&%0#0,-./'-/.%0"10;..#9&0<,";=0()&-%#>0"10;..#90"10,-./'-/.%&0

<;",=

?10,";0(&0)"-0@(#A$%+

B".'%0&-./'-/.%0#$(*)C%)-+0DD#$(*)<E=40FG%.%0E0H0340540".067

:&%0,IJI0-"0#'G(%@%0'"#$%&'()*

zyx Point structure

zyx zyx zyx AoS

xxx yyy zzz SoA

gmem

Friday, January 23, 2009

Page 29: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

59

!"#$%&'()*+,-.//#01

!"#$%&'()*,*0%#2$1,(/30"4%&,250".*53.2

!0(2('#$,2",/%/"0167".)8,9%0)%$&

:%#8()*,&20.'2.0%&,";,&(<%,"25%0,25#),=>,?>,"0,@A712%&,B($$,70%#9,'"#$%&'()*+

C0%;%0,-20.'2.0%&,";,D00#1& "4%0,D"-

E;,-"D,(&,)"2,4(#7$%>,0%#8FB0(2%,250".*5,-GHG

D88(2(")#$,0%&".0'%&+D$(*)%8,I13%&,-JK,-#/3$%

gmem

Friday, January 23, 2009

Page 30: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

64

!"#"$$%$&'%()#*&+#,-./%,/0#%

12&"&3"#"$$%$&(",-.2%4&("2*&/-#%"56&",,%66&(%()#*

7-%#%8)#%4&(%()#*&.6&5.9.5%5&.2/)&:"2;6

<66%2/."$&/)&",-.%9%&-.=-&:"25>.5/-

<",-&:"2;&,"2&6%#9.,%&)2%&"55#%66&3%#&,*,$%

+&(%()#*&,"2&6%#9.,%&"6&("2*&6.(0$/"2%)06&

",,%66%6&"6&./&-"6&:"2;6

'0$/.3$%&6.(0$/"2%)06&",,%66%6&/)&"&:"2;

#%60$/&.2&"&:"2;&,)28$.,/&

?)28$.,/.2=&",,%66%6&"#%&6%#."$.@%5

Bank 15

Bank 7

Bank 6

Bank 5

Bank 4

Bank 3

Bank 2

Bank 1

Bank 0

smem

Friday, January 23, 2009

Page 31: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

65

!"#$%&''()**+#,%-."/01)*

23%!"#$%43#51+67*

8+#)"(%"''()**+#,%

*7(+')%99%:

23%!"#$%43#51+67*

;"#'3/%:<:%=)(/>7"7+3#

Bank 15

Bank 7

Bank 6

Bank 5

Bank 4

Bank 3

Bank 2

Bank 1

Bank 0

Thread 15

Thread 7

Thread 6

Thread 5

Thread 4

Thread 3

Thread 2

Thread 1

Thread 0

Bank 15

Bank 7

Bank 6

Bank 5

Bank 4

Bank 3

Bank 2

Bank 1

Bank 0

Thread 15

Thread 7

Thread 6

Thread 5

Thread 4

Thread 3

Thread 2

Thread 1

Thread 0

smem

Friday, January 23, 2009

Page 32: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

66

!"#$%&''()**+#,%-."/01)*

234"5%!"#$%67#81+9:*

;+#)"(%"''()**+#,%

*:(+')%<<%2

=34"5%!"#$%67#81+9:*

;+#)"(%"''()**+#,%

*:(+')%<<%=

Thread 11

Thread 10

Thread 9

Thread 8

Thread 4

Thread 3

Thread 2

Thread 1

Thread 0

Bank 15

Bank 7

Bank 6

Bank 5

Bank 4

Bank 3

Bank 2

Bank 1

Bank 0

Thread 15

Thread 7

Thread 6

Thread 5

Thread 4

Thread 3

Thread 2

Thread 1

Thread 0

Bank 9

Bank 8

Bank 15

Bank 7

Bank 2

Bank 1

Bank 0x8

x8

smem

Friday, January 23, 2009

Page 33: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

67

!"#$%&&'())()$*%+$,"$-%./)$".$012

3%.&#4&,5$"6$(%75$-%./$4)$89$-4,)$+('$9$7:"7/$7;7:()

<=77())4>($89?-4,$#"'&)$%'($%))4@.(&$,"$)=77())4>($

-%./)

012$5%)$AB$-%./)

<"$-%./$C$%&&'())$D$AB

<%*($%)$,5($)4E($"6$%$5%:6?#%'+F"$-%./$7".6:47,)$-(,#((.$&466('(.,$5%:6?#%'+)G$".:;$#4,54.$%$)4.@:($5%:6?#%'+

smem

Friday, January 23, 2009

Page 34: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

68

!"#$%&'(%()$*'+#,-'.),/01.23

!"#$%&'(%()$*'13'#3'/#32'#3'$%4132%$3'1/'2"%$%'#$%'

,)'+#,-'.),/01.23

5"%'/#32'.#3%6

7/'#00'2"$%#&3')/'#'"#0/89#$:'#..%33'&1//%$%,2'+#,-3;'2"%$%'13'

,)'+#,-'.),/01.2

7/'#00'2"$%#&3')/'#'"#0/89#$:'$%#&'2"%'1&%,21.#0'#&&$%33;'

2"%$%'13',)'+#,-'.),/01.2'<+$)#&.#32=

5"%'30)9'.#3%6

>#,-'?),/01.26'(@021:0%'2"$%#&3'1,'2"%'3#(%'"#0/89#$:'

#..%33'2"%'3#(%'+#,-

A@32'3%$1#01B%'2"%'#..%33%3

?)32'C'(#D'E')/'31(@02#,%)@3'#..%33%3'2)'#'31,40%'+#,-

smem

Friday, January 23, 2009

Page 35: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

Use the right kind of memory

Constant memory:Quite small, ! 20KAs fast as register access if all threads in a warp access thesame location

Texture memory:Spatially cachedOptimized for 2D localityNeighboring threads should read neighboring addressesNo need to think about coalescing

Constraint:These memories can only be updated from the CPU

Applied Mathematics 31/53slide by Johan Seland

Strategy

Friday, January 23, 2009

Page 36: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

Memory optimizations roundup

CUDA memory handling is complexAnd I have not covered all topics...

Using memory correctly can lead to huge speedupsAt least CUDA expose the memory hierarchy, unlike CPUs

Get your algorithm up an running first, then optimize

Use shared memory to let threads cooperate

Be wary of “data ownership”A thread does not have to read/write the data it calculate

Applied Mathematics 41/53

Strategy

slide by Johan Seland

Friday, January 23, 2009

Page 37: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

Conflicts,Coalescing, Warps...I hate growing up.

Friday, January 23, 2009

Page 38: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

!"#$%$&'#$()*+,'%"-./*0'#1$,*21')3"(3.

Example

Friday, January 23, 2009

Page 39: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

70

!"#$%&'($")*+,*-

./0'."1+2-'34#$")*+,*-56

7228*#$"#-*9

:,"2-*;%)<

=>,%?%)<'.!@!'A")B';,)C2%;#*

.+--?8+*'C,$'->-)'*1"22'1"#$%;-*

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

1 5 9 13

2 6 10 14

3 7 11 15

4 8 12 16

Example

Friday, January 23, 2009

Page 40: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

71

!"#$%&'(#')*+,%"(-$('

__global__ void transpose_naive(float *odata, float *idata, int width, int height)

{

unsigned int xIndex = blockDim.x * blockIdx.x + threadIdx.x;

unsigned int yIndex = blockDim.y * blockIdx.y + threadIdx.y;

if (xIndex < width && yIndex < height)

{

unsigned int index_in = xIndex + width * yIndex;

unsigned int index_out = yIndex + height * xIndex;

$)%.%/0")'12$3.4 = 0)%.%/0")'120"4;

}

}

1.

2.

3.

4.

5.

6.

Example

Friday, January 23, 2009

Page 41: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

72

!"#$%&'(#')*+,%"(-$('

.'%)(*/"-01*2,$3*4565

787978:78778;

;879;8:;87;8;

79879798:7987798;

<,/1'*$01-01*1$*4565

7987:87787;87

798;:8;78;;8;

79879:8797879;879

Stride = 16, uncoalesced

45654565

Stride = 1, coalesced

Example

Friday, January 23, 2009

Page 42: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

73

!"#$%&'%()*+#,&-"&%

.&&/0-12",3)0#1+24)2&)-#+1212",%()2,1")&5/#+%)12$%&

*6+%#(7$"'8)974:)7;<3

=%#()16%)974:7;< 2,-/1)12$%:)&1"+%)2,1")>?@?

A+21%)16%)>?@?)(#1#)1")97;:74< "/1-/1)12$%*+#,&-"&%)16%)2,(%42,B)2,1")>?@?

*6+%#()914:1;<3

=%#(&)%$%0%,1)914:1;< C+"0)2,-/1)12$%

A+21%&)%$%0%,1)914:1;< 2,1")"/1-/1)12$%

!"#$%&'2,B)2&)#'62%D%()2C3

E$"'8F12$%)(20%,&2",&)#+%)0/$12-$%&)"C)GH

Example

Friday, January 23, 2009

Page 43: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

74

!"#$%&'%()*+#,&-"&%

.+/0%&)0")12324%#(&)5+"6)7232

898:89;89889<

<98:<9;<98<9<

8:98:8:9;8:988:9<

.+/0%&)0")72324%#(&)5+"6)1232

8:98;98898<98

8:9<;9<89<<9<

8:98:;98:898:<98:

898:89;89889<

<98:<9;<98<9<

8:98:8:9;8:988:9<

898:89;89889<

<98:<9;<98<9<

8:98:8:9;8:988:9<

Example

Friday, January 23, 2009

Page 44: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

75

!"#"$%&'()(*+'(,-

./01+23$01+2$!"#"$4('/$3'0(21$5$67

8+-9$:,-;<(:'3

=1+23$;0,)$!"#"

6>?6@?66?6A?6

6>?A@?A6?AA?A

6>?6>@?6>6?6>A?6>

!,<B'(,-

C<<,:+'1$+-$D1E'0+F :,<B)-

=1+2$3'0(21$5$6G

./01+23$01+2$;0,)$:,-31:B'(H1$I+-93

6>?6@?66?6A?6

6>?A@?A6?AA?A

6>?6>@?6>6?6>A?6>

Example

Friday, January 23, 2009

Page 45: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

75

!"#"$%&'()(*+'(,-

./01+23$01+2$!"#"$4('/$3'0(21$5$67

8+-9$:,-;<(:'3

=1+23$;0,)$!"#"

6>?6@?66?6A?6

6>?A@?A6?AA?A

6>?6>@?6>6?6>A?6>

!,<B'(,-

C<<,:+'1$+-$D1E'0+F :,<B)-

=1+2$3'0(21$5$6G

./01+23$01+2$;0,)$:,-31:B'(H1$I+-93

6>?6@?66?6A?6

6>?A@?A6?AA?A

6>?6>@?6>6?6>A?6>

Example

Friday, January 23, 2009

Page 46: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

76

!"#$%&'%()*+#,&-"&%

__global__ void transpose(float *odata, float *idata, int width, int height)

{

__shared__ float block[(BLOCK_DIM./)*BLOCK_DIM];

unsigned int xBlock = blockDim.x * blockIdx.x;

unsigned int yBlock = blockDim.y * blockIdx.y;

unsigned int xIndex = xBlock + threadIdx.x;

unsigned int yIndex = yBlock + threadIdx.y;

unsigned int index_out, index_transpose;

if (xIndex < width && yIndex < height)

{

unsigned int index_in = width * yIndex + xIndex;

unsigned int index_block = threadIdx.y * (BLOCK_DIM+1) + threadIdx.x;

block[index_block] = idata[index_in];

index_transpose = threadIdx.x * (BLOCK_DIM+1) + threadIdx.y;

index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;

}

__syncthreads();

if (xIndex < width && yIndex < height)

odata[index_out] = block[index_transpose];

}

1.

2.

3.

4.

5.

6.

7.

8.

9.

10.

11.

12.

13.

14.

15.

Example

Friday, January 23, 2009

Page 47: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

Coalesced transpose: Source code

__global__ voidtranspose( float *out, float *in, int w, int h ) {

__shared__ float block[BLOCK_DIM*BLOCK_DIM];

unsigned int xBlock = blockDim.x * blockIdx.x;unsigned int yBlock = blockDim.y * blockIdx.y;

unsigned int xIndex = xBlock + threadIdx.x;unsigned int yIndex = yBlock + threadIdx.y;

unsigned int index_out, index_transpose;

if ( xIndex < width && yIndex < height ) {unsigned int index_in = width * yIndex + xIndex;unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;

block[index_block] = in[index_in];

index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;

}__synchthreads();

if ( xIndex < width && yIndex < height ) {out[index_out] = block[index_transpose];

}}

Applied Mathematics 39/53

Example

slide by Johan Seland

Friday, January 23, 2009

Page 48: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

Coalesced transpose: Source code

__global__ voidtranspose( float *out, float *in, int w, int h ) {

__shared__ float block[BLOCK_DIM*BLOCK_DIM];

unsigned int xBlock = blockDim.x * blockIdx.x;unsigned int yBlock = blockDim.y * blockIdx.y;

unsigned int xIndex = xBlock + threadIdx.x;unsigned int yIndex = yBlock + threadIdx.y;

unsigned int index_out, index_transpose;

if ( xIndex < width && yIndex < height ) {unsigned int index_in = width * yIndex + xIndex;unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;

block[index_block] = in[index_in];

index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;

}__synchthreads();

if ( xIndex < width && yIndex < height ) {out[index_out] = block[index_transpose];

}}

Allocate shared memory.

Applied Mathematics 39/53slide by Johan Seland

Example

Friday, January 23, 2009

Page 49: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

Coalesced transpose: Source code

__global__ voidtranspose( float *out, float *in, int w, int h ) {

__shared__ float block[BLOCK_DIM*BLOCK_DIM];

unsigned int xBlock = blockDim.x * blockIdx.x;unsigned int yBlock = blockDim.y * blockIdx.y;

unsigned int xIndex = xBlock + threadIdx.x;unsigned int yIndex = yBlock + threadIdx.y;

unsigned int index_out, index_transpose;

if ( xIndex < width && yIndex < height ) {unsigned int index_in = width * yIndex + xIndex;unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;

block[index_block] = in[index_in];

index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;

}__synchthreads();

if ( xIndex < width && yIndex < height ) {out[index_out] = block[index_transpose];

}}

Allocate shared memory.

Set up indexing

Applied Mathematics 39/53slide by Johan Seland

Example

Friday, January 23, 2009

Page 50: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

Coalesced transpose: Source code

__global__ voidtranspose( float *out, float *in, int w, int h ) {

__shared__ float block[BLOCK_DIM*BLOCK_DIM];

unsigned int xBlock = blockDim.x * blockIdx.x;unsigned int yBlock = blockDim.y * blockIdx.y;

unsigned int xIndex = xBlock + threadIdx.x;unsigned int yIndex = yBlock + threadIdx.y;

unsigned int index_out, index_transpose;

if ( xIndex < width && yIndex < height ) {unsigned int index_in = width * yIndex + xIndex;unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;

block[index_block] = in[index_in];

index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;

}__synchthreads();

if ( xIndex < width && yIndex < height ) {out[index_out] = block[index_transpose];

}}

Allocate shared memory.

Set up indexing

Check that we are withindomain, calculate moreindices

Applied Mathematics 39/53slide by Johan Seland

Example

Friday, January 23, 2009

Page 51: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

Coalesced transpose: Source code

__global__ voidtranspose( float *out, float *in, int w, int h ) {

__shared__ float block[BLOCK_DIM*BLOCK_DIM];

unsigned int xBlock = blockDim.x * blockIdx.x;unsigned int yBlock = blockDim.y * blockIdx.y;

unsigned int xIndex = xBlock + threadIdx.x;unsigned int yIndex = yBlock + threadIdx.y;

unsigned int index_out, index_transpose;

if ( xIndex < width && yIndex < height ) {unsigned int index_in = width * yIndex + xIndex;unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;

block[index_block] = in[index_in];

index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;

}__synchthreads();

if ( xIndex < width && yIndex < height ) {out[index_out] = block[index_transpose];

}}

Allocate shared memory.

Set up indexing

Check that we are withindomain, calculate moreindices

Write to shared memory.

Applied Mathematics 39/53slide by Johan Seland

Example

Friday, January 23, 2009

Page 52: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

Coalesced transpose: Source code

__global__ voidtranspose( float *out, float *in, int w, int h ) {

__shared__ float block[BLOCK_DIM*BLOCK_DIM];

unsigned int xBlock = blockDim.x * blockIdx.x;unsigned int yBlock = blockDim.y * blockIdx.y;

unsigned int xIndex = xBlock + threadIdx.x;unsigned int yIndex = yBlock + threadIdx.y;

unsigned int index_out, index_transpose;

if ( xIndex < width && yIndex < height ) {unsigned int index_in = width * yIndex + xIndex;unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;

block[index_block] = in[index_in];

index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;

}__synchthreads();

if ( xIndex < width && yIndex < height ) {out[index_out] = block[index_transpose];

}}

Allocate shared memory.

Set up indexing

Check that we are withindomain, calculate moreindices

Write to shared memory.

Calculate output indices.

Applied Mathematics 39/53slide by Johan Seland

Example

Friday, January 23, 2009

Page 53: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

Coalesced transpose: Source code

__global__ voidtranspose( float *out, float *in, int w, int h ) {

__shared__ float block[BLOCK_DIM*BLOCK_DIM];

unsigned int xBlock = blockDim.x * blockIdx.x;unsigned int yBlock = blockDim.y * blockIdx.y;

unsigned int xIndex = xBlock + threadIdx.x;unsigned int yIndex = yBlock + threadIdx.y;

unsigned int index_out, index_transpose;

if ( xIndex < width && yIndex < height ) {unsigned int index_in = width * yIndex + xIndex;unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;

block[index_block] = in[index_in];

index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;

}__synchthreads();

if ( xIndex < width && yIndex < height ) {out[index_out] = block[index_transpose];

}}

Allocate shared memory.

Set up indexing

Check that we are withindomain, calculate moreindices

Write to shared memory.

Calculate output indices.

Synchronize.NB:outside if-clause

Applied Mathematics 39/53slide by Johan Seland

Example

Friday, January 23, 2009

Page 54: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

Coalesced transpose: Source code

__global__ voidtranspose( float *out, float *in, int w, int h ) {

__shared__ float block[BLOCK_DIM*BLOCK_DIM];

unsigned int xBlock = blockDim.x * blockIdx.x;unsigned int yBlock = blockDim.y * blockIdx.y;

unsigned int xIndex = xBlock + threadIdx.x;unsigned int yIndex = yBlock + threadIdx.y;

unsigned int index_out, index_transpose;

if ( xIndex < width && yIndex < height ) {unsigned int index_in = width * yIndex + xIndex;unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;

block[index_block] = in[index_in];

index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;

}__synchthreads();

if ( xIndex < width && yIndex < height ) {out[index_out] = block[index_transpose];

}}

Allocate shared memory.

Set up indexing

Check that we are withindomain, calculate moreindices

Write to shared memory.

Calculate output indices.

Synchronize.NB:outside if-clause

Write to global mem.Di!erent index

Applied Mathematics 39/53slide by Johan Seland

Example

Friday, January 23, 2009

Page 55: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

Transpose timings

Was it worth the trouble?

Grid Size Coalesced Non-coalesced Speedup128! 128 0.011 ms 0.022 ms 2.0!512! 512 0.07 ms 0.33 ms 4.5!

1024! 1024 0.30 ms 1.92 ms 6.4!1024! 2048 0.79 ms 6.6 ms 8.4!

For me, this is a clear yes.

Applied Mathematics 40/53slide by Johan Seland

Example

Friday, January 23, 2009

Page 56: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

Friday, January 23, 2009

Page 57: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

Execution Optimizations

IAP09 CUDA@MIT / 6.963

Friday, January 23, 2009

Page 58: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

Know the arithmetic cost of operations

4 clock cycles:Floating point: add, multiply, fused multiply-addInteger add, bitwise operations, compare, min, max

16 clock cycles:reciprocal, reciprocal square root, log(x), 32-bit integermultiplication

32 clock cycles:sin(x), cos(x) and exp(x)

36 clock cycles:Floating point division (24-bit version in 20 cycles)

Particularly costly:Integer division, moduloRemedy: Replace with shifting whenever possible

Double precision (when available) will perform at half thespeed

Applied Mathematics 28/53slide by Johan Seland

Exec

Friday, January 23, 2009

Page 59: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

79

!""#$%&"'

()*+%,-.&/0*#"0.1&/-%*+-+2+"#0+,-/+3#+&0.%44'5-/1-

+2+"#0.&6-10)+*-7%*$/-./-0)+-1&4'-7%'-01-).,+-

4%0+&".+/-%&,-8++$-0)+-)%*,7%*+-9#/'

!""#$%&"' :-;#<9+*-1=-7%*$/-*#&&.&6-

"1&"#**+&04'-1&-%-<#40.$*1"+//1*-,.>.,+,-9'-

<%2.<#<-&#<9+*-1=-7%*$/-0)%0-"%&-*#&-

"1&"#**+&04'

?.<.0+,-9'-*+/1#*"+-#/%6+@

A+6./0+*/

B)%*+,-<+<1*'

Exec

Friday, January 23, 2009

Page 60: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

80

!"#$%&'()*+,#-.+/.0"#12#)1

3+(4+5'()*1+6+3+(4+70'2#8"().11("1

,(+9''+70'2#8"().11("1+:9;.+92+'.912+(<.+5'()*+2(+.=.)02.

3+(4+5'()*1+%+3+(4+70'2#8"().11("1+6+>

?0'2#8'.+5'()*1+)9<+"0<+)(<)0"".<2'@+#<+9+70'2#8"().11("

&'()*1+2:92+9".<A2+B9#2#<C+92+9+DD1@<)2:".9$1EF+*..8+2:.+

:9"$B9".+501@

,05G.)2+2(+".1(0").+9;9#'95#'#2@+H ".C#12."1I+1:9".$+7.7("@

3+(4+5'()*1+6+JKK+2(+1)9'.+2(+4020".+$.;#).1

&'()*1+.=.)02.$+#<+8#8.'#<.+491:#(<

JKKK+5'()*1+8."+C"#$+B#''+1)9'.+9)"(11+70'2#8'.+C.<."92#(<1

Exec

Friday, January 23, 2009

Page 61: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

81

!"#$%&"'()"*"+,"+-.

!"/,0/1&"'02'$&"('"#$%&"'(,"*"+,"+-.3+%&'4-&$5+6%('"%47&(-/+(8"('"/,(9::(-.-7"%(7/&"'

;-"+/'$5%<=>)?< @AB<

A5(-5C*7"&"7.(D$,"(&D"(7/&"+-.<(!4+(/&(7"/%&(EF: &D'"/,%(GH(2/'*%I(*"'(C47&$*'5-"%%5'

?&(7"/%&(:JK 5--4*/+-.

AD'"/,%(,5(+5&(D/L"(&5(8"75+#(&5(&D"(%/C"(&D'"/,(875-M

/,,N1O:(((P1OQ(P1EQ(P1:

/,,N1O:(((P1JQ(P1OQ(P1R

S T(.(U(JV

W(T(S U(OV

7,N%D/'",N1O:((P1OQ(XP'OEUYZ(

/,,N1O:(((((((((((P1OQ(P1OQ(P1R

%[,/&/XYZ(UT(OV

Exec

Friday, January 23, 2009

Page 62: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

82

!"#$%&"'()'"%%*'"

+$,"(-.&"/01(21(*%$/#(34'"(&5'".,%(6"'(78

9$3$&$/#(:.0&4'%;

<*32"'(4=('"#$%&"'%(6"'(>"'/"-

?@AB 6"'(78C(6.'&$&$4/",(.34/#(04/0*''"/&(&5'".,%

D34*/&(4=(%5.'",(3"34'1

@EFG 6"'(78C(6.'&$&$4/",(.34/#(04/0*''"/&(&5'".,2-40>%

H5"0>(I0*2$/(=$-"(=4'(J('"#$%&"'%(K(>"'/"-

L%"(M3.N''"#04*/&O< =-.#(&4(<PHH

< O(,"%$'",(3.N$3*3('"#$%&"'%(K(>"'/"-

D&(%43"(64$/&(Q%6$--$/#R $/&4(98S8(3.1(400*'

!",*0"%(6"'=4'3./0"(M 98S8($%(%-4T

H5"0>(I0*2$/(=$-"(=4'(98S8(*%.#"

Exec

Friday, January 23, 2009

Page 63: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

83

!"#"$%&'&'()$"*+,$-"),*.("

/*")012#3+2#&+'*4567 +2#&+')#+)'6--

8$9)-+%2&:")#;")<"$'":)-+=")>&#;)#;")5-,?&')@:.()#+)

="#"$%&'")$"(&*#"$),*.("A

82"')#;")A-,?&')@&:")>&#;).)#"3#)"=&#+$).'=):++<)@+$)

#;")0-+="7 *"-#&+'Aarchitecture {sm_10}

abiversion {0}

modname {cubin}

code {

name = BlackScholesGPU

lmem = 0

smem = 68

reg = 20

bar = 0

bincode {

0xa0004205 0x04200780 0x40024c09 0x00200780

per thread local memory

per thread block shared memory

per thread registers

Exec

Friday, January 23, 2009

Page 64: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

84

!"#$%&''()*+',%!*-'(-*./0Exec

Friday, January 23, 2009

Page 65: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

85

!"#$%$&$'()#*+,-./)",+)01234

5*22/,)#*+,-./)",+)01234)-/)-)%61#$"1,)27)8-+")/$&,

9:2$.)8-/#$'()32%"6#-#$2')2')6'.,+;"2"61-#,.)8-+"/

<2+,)#*+,-./)",+)01234)==)0,##,+)%,%2+>)1-#,'3>)

*$.$'(

?6#@)%2+,)#*+,-./)",+)01234)==)7,8,+)+,($/#,+/)",+)

#*+,-.

A,+',1)$':23-#$2'/)3-')7-$1)$7)#22)%-'>)+,($/#,+/)-+,)6/,.

B,6+$/#$3/

<$'$%6%C)DE)#*+,-./)",+)01234

!'1>)$7)%61#$"1,)32'36++,'#)01234/)

FGH)2+)HID)#*+,-./)-)0,##,+)3*2$3,

J/6-11>)/#$11),'26(*)+,(/)#2)32%"$1,)-'.)$':24,)/633,//7611>

K*$/)-11).,",'./)2')>26+)32%"6#-#$2'@)/2),L"+$%,'#M

Exec

Friday, January 23, 2009

Page 66: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

86

!""#$%&"'()*(+,-./-0%&",

1&"-,%23&4(/""#$%&"'(5/,2(&/6(&,",22%-37'(

3&"-,%2,($,-./-0%&",

BUT…

8/9:/""#$%&"'(0#763$-/",22/-2("%&&/6(%5,;#%6,7'(

<35,(7%6,&"'(/&(0,0/-':=/#&5(>,-&,72

?16(%77("/0,2(5/9&(6/(%-36<0,63"(3&6,&236'(%&5(%@%37%=7,(

$%-%77,7320A

Exec

Friday, January 23, 2009

Page 67: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

87

!"#"$%&%#'(%)*+,#)-../'0"&'+1

!"#"$%&%#'("&'+1)2%/.3)"4".&"&'+1)&+)4'55%#%1&)6!73

6!73)8"#9)'1)$"19):"93;)+5)$,/&'.#+0%33+#3

<%$+#9)="14:'4&2

>2"#%4)$%$+#9)3'(%

?%@'3&%#)5'/%)3'(%

A2#%"43).%#)=/+0B

*+,)0"1)%8%1)$"B%)"..3)3%/5C&,1'1@)D/'B%)EEAF)"14)-AG->H

IJK.%#'$%1&L $+4%)4'30+8%#3)"14)3"8%3)+.&'$"/)0+15'@,#"&'+1

Exec

Friday, January 23, 2009

Page 68: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

Loop unrolling

Sometimes we know some kernel parameters at compile time:# of loop iterationsDegrees of polynomialsNumber of data elements

If we could “tell” this to the compiler, it can unroll loops andoptimize register usage

We need to be genericAvoid code duplication, sizes unknown at compile time

Templates to rescueThe same trick can be used for regular C++ sources

Applied Mathematics 43/53slide by Johan Seland

Exec

Friday, January 23, 2009

Page 69: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

Example: de Casteljau algorithm

A standard algorithm for evaluating polynomials in Bernstein form

Recursively defined:

f (x) = bd00

bki ,j = xbk!1

i+1,j + (1! x)bk!1i ,j+1

b0i ,jare coe!cients

f (x) = bd00

bd!110 bd!1

01

bd!220 bd!2

11 bd!202

1! x

1! x x 1! x2

x

x

Applied Mathematics 44/53slide by Johan Seland

Exec

Friday, January 23, 2009

Page 70: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

Implementation

The de Casteljau algorithm is usually implemented as nestedfor-loops

Coe!cients are overwritten for each iteration

f l o a t d eCa s t e l j a u ( f l o a t ! c , f l o a t x , i n t d ){

f o r ( u i n t i = 1 ; i <= d ; ++i ) {f o r ( u i n t j = 0 ; j <= d" i ; ++j )

c [ j ] = ( 1 . 0 f"x )! c [ j ] + x! c [ j +1] ;}

r e t u r n c [ 0 ] ;}

f (x) = cd00

cd"110 cd"1

01

cd"220 cd"2

11 cd"202

1! x

1! x x 1! x2

x

x

Applied Mathematics 45/53slide by Johan Seland

Exec

Friday, January 23, 2009

Page 71: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

Template loop unrolling

We make d a template parametertemplate<int d>f l o a t d eCa s t e l j a u ( f l o a t ! c , f l o a t x, int d ) {

f o r ( u i n t i = 1 ; i <= d ; ++i ) {f o r ( u i n t j = 0 ; j <= d" i ; ++j )

c [ j ] = ( 1 . 0 f"x )! c [ j ] + x! c [ j +1] ;}r e t u r n c [ 0 ] ;

}

Kernel is called assw i t c h ( d ) {case 1 :

d eCa s t e l j a u <1><<<dimGrid , dimBlock>>>( c , x ) ; b reak ;ca se 2 :

d eCa s t e l j a u <2><<<dimGrid , dimBlock>>>( c , x ) ; b reak ;..ca se MAXD:

deCa s t e l j a u <MAXD><<<dimGrid , dimBlock>>>( c , x ) ; b reak ;}

Applied Mathematics 46/53slide by Johan Seland

Exec

Friday, January 23, 2009

Page 72: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

Results

For the de Castelaju algorithm we see a relatively smallspeedup

! 1.2" (20%...)

Very easy to implement

Can lead to long compile times

Conclusion:

Probably worth it near end of development cycle

Applied Mathematics 47/53slide by Johan Seland

Exec

Friday, January 23, 2009

Page 73: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

88

!"#$%&'("#

)#*+,'-.#*/!)01/2+,3",4.#$+/$5.,.$-+,('-($'

6+4",7/$".%+'$(#8

0(9+,8+#-/:,.#$5(#8

;.#</$"#3%($-'

=.-+#$7/5(*(#8

)'+/2+.</2+,3",4.#$+/4+-,($'/-"/8&(*+/"2-(4(>.-("#/

)#*+,'-.#*/2.,.%%+%/.%8",(-54/$"42%+?(-7/-5+",7

@#"A/5"A/-"/(*+#-(37/-72+/"3/:"--%+#+$<

+B8B/4+4",7C/$",+/$"42&-.-("#C/",/(#'-,&$-("#/"9+,5+.*

D2-(4(>+/7"&,/.%8",(-54C/then &#,"%%/%""2'

)'+/-+42%.-+/2.,.4+-+,'/-"/8+#+,.-+/"2-(4.%/$"*+

Exec

Friday, January 23, 2009

Page 74: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

61

!"#$%&'($)*+,-.$/012*.#0

3#.4+$5#-+,0#$-67$2*67$418#68*-.$4#02105-69#$

401:.#5

;/&$-67$%/&$8*5*6<$210$-..$=#06#.$*6>19-8*16+$-67$

5#594?+

!*5#$+8-54+

(99#++$81$"-07@-0#$4#02105-69#$91,68#0+$

Profiling

Friday, January 23, 2009

Page 75: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

62

!"#$%&'

()*$+',%-*,+-%./*0,1"+2,2%-01%-*,.34$+*-',3$,'"#$%&',"$,+2*,.2"56

+"7*'+%75

#&08"$.32*-*$+

#&08.32*-*$+

#'+8"$.32*-*$+

#'+8.32*-*$+

&3.%&8&3%0

&3.%&8'+3-*

9-%$.2

0")*-#*$+89-%$.2

"$'+-4.+"3$' : "$'+-4.+"3$,.34$+

1%-58'*-"%&";* : +2-*%0,1%-5',+2%+,'*-"%&";*,3$,%00-*'',.3$<&".+',+3,'2%-*0,3-,.3$'+%$+,7*73-=

.+%8&%4$.2*0 : *>*.4+*0,+2-*%0,9&3./'

Global memory loads/stores are coalesced

(coherent) or non-coalesced (incoherent)

Total branches and divergent branches

taken by threads

Local loads/stores

Profiling

Friday, January 23, 2009

Page 76: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

63

!"#$%&%$#'"()&%*+',$%)-*."#$%/

01,.$/)%$&%$/$"#)$2$"#/)3'#4'")1)#4%$15)31%&

6",7)#1%($#/)*"$)8.,#'&%*-$//*%01,.$/)3',,)"*#)-*%%$/&*"5)#*)#4$)#*#1,)".89$%)*+)31%&/),1."-4$5)+*%)1)&1%#'-.,1%):$%"$,;

<1."-4)$"*.(4)#4%$15)9,*-:/)#*)$"/.%$)#41#)#4$)#1%($#)8.,#'&%*-$//*%)'/)('2$")1)-*"/'/#$"#)&$%-$"#1($)*+)#4$)#*#1,)3*%:;

01,.$/)1%$)9$/#)./$5)#*)'5$"#'+7)%$,1#'2$)&$%+*%81"-$)5'++$%$"-$/)9$#3$$")."*&#'8'=$5)1"5)*&#'8'=$5)-*5$

!")*#4$%)3*%5/>)#%7)#*)%$5.-$)#4$)81("'#.5$/)*+)(,5?(/#@'"-*4$%$"#>)5'2$%($"#@9%1"-4>)1"5)31%&@/$%'1,'=$

Profiling

Friday, January 23, 2009

Page 77: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

84

!"#$%#&'()"*$%#*+,*"-"&"(.*#"/0).1%(!"#$%#&'()"*$%#*+,*"-"&"(.*#"/0).1%(

23+45

23+25

2365

23765

43825

43995

:43:72*;<=>

+93??:*;<=>

92346?*;<=>

273977*;<=>

?37+2*;<=>

+36@+*;<=>

43869*;<=>

9838+5

4232:5

2@3825

639+5

+3:65

43995

834:6*&>A"#("-*7B&0-.1C-"*"-"&"(.>*C"#*.D#"'/

83962*&>A"#("-*:B)%&C-"."-E*0(#%--"/

83@9:*&>A"#("-*@B0(#%--*-'>.*F'#C

83?:@*&>A"#("-*+B$1#>.*'//*/0#1(G*G-%H'-*-%'/

23744*&>A"#("-*9B>"I0"(.1'-*'//#">>1(G

93+@:*&>A"#("-*4B1(."#-"'J"/*'//#">>1(G

F1.D*H'(K*)%($-1).>

638@+*&>A"#("-*2B*1(."#-"'J"/*'//#">>1(G

F1.D*/1J"#G"(.*H#'()D1(G

A"#("-*7*%(*94,*"-"&"(.>B*74*;<=>L

M."C

MC""/0C<'(/F1/.DN1&"*O444*1(.>PQ0&0-'.1J"

MC""/0C

Example

Friday, January 23, 2009

Page 78: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

Build your own!

Friday, January 23, 2009

Page 79: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

Friday, January 23, 2009

Page 80: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

© 2008 NVIDIA Corporation.slide by David Kirk

Thank you!

Friday, January 23, 2009

Page 81: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

Back Pocket Slides

slide by David Cox

Friday, January 23, 2009

Page 82: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

Friday, January 23, 2009

Page 83: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

Misc

IAP09 CUDA@MIT / 6.963

Friday, January 23, 2009

Page 84: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

19M02: High Performance Computing with CUDA

Tesla C1060 Computing ProcessorTesla C1060 Computing Processor

1.33 GHzCore GHz

Processor 1x Tesla T10P

Form factor

Full ATX:

4.736” (H) x 10.5” (L)

Dual slot wide

On-boardmemory

4 GB

System I/O PCIe x16 gen2

Memory I/O512-bit, 800MHz DDR

102 GB/s peak bandwidth

Display outputs None

Typical power 160 W

Friday, January 23, 2009

Page 85: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

20M02: High Performance Computing with CUDA

Tesla S1070 1U SystemTesla S1070 1U System

1.5 GHzCore GHz

Processors 4 x Tesla T10P

Form factor1U for an EIA 19”

4-post rack

Total 1U systemmemory

16 GB (4.0GB per GPU)

System I/O 2 PCIe x16

Memory I/O perprocessor

512-bit, 800MHz GDDR

102 GB/s peakbandwidth

Display outputs None

Typical power 700 W

Chassisdimensions

1.73” H ! 17.5” W !28.5” D

Friday, January 23, 2009

Page 86: Programming the next generation of cheap parallel hardware ... · IAP09 A@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel

18M02: High Performance Computing with CUDA

Double Precision Floating PointDouble Precision Floating Point

NVIDIA GPU SSE2 Cell SPE

Precision IEEE 754 IEEE 754 IEEE 754

Rounding modes for FADDand FMUL

All 4 IEEE, round tonearest, zero, inf, -inf

All 4 IEEE, round tonearest, zero, inf, -inf

Round tozero/truncate only

Denormal handling Full speedSupported, costs 1000’sof cycles

Flush to zero

NaN support Yes Yes No

Overflow and Infinitysupport

Yes YesNo infinity,clamps to max norm

Flags No Yes Some

FMA Yes No Yes

Square rootSoftware with low-latencyFMA-based convergence

Hardware Software only

DivisionSoftware with low-latencyFMA-based convergence

Hardware Software only

Reciprocal estimateaccuracy

24 bit 12 bit 12 bit

Reciprocal sqrt estimateaccuracy

23 bit 12 bit 12 bit

log2(x) and 2^x estimatesaccuracy

23 bit No No

Friday, January 23, 2009