Top Banner
Accelerating Proton Computed Tomography with GPUs Thomas D. Uram, Argonne Leadership Compu2ng Facility Michael E. Papka, Argonne Leadership Compu2ng Facility, Northern Illinois University Nicholas T. Karonis, Northern Illinois University, Argonne Na2onal Laboratory
18

Accelerating Proton Computed Tomography with GPUson-demand.gputechconf.com/gtc/2015/presentation/S5497-Thomas-Uram.pdf · Accelerating Proton Computed Tomography with GPUs...

Jun 19, 2019

Download

Documents

lediep
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Accelerating Proton Computed Tomography with GPUson-demand.gputechconf.com/gtc/2015/presentation/S5497-Thomas-Uram.pdf · Accelerating Proton Computed Tomography with GPUs Thomas'D.'Uram,'Argonne'Leadership'Compu2ng'Facility'

Accelerating Proton Computed Tomography with GPUs

Thomas'D.'Uram,'Argonne'Leadership'Compu2ng'Facility'Michael'E.'Papka,'Argonne'Leadership'Compu2ng'Facility,'Northern'Illinois'University'Nicholas'T.'Karonis,'Northern'Illinois'University,'Argonne'Na2onal'Laboratory

Page 2: Accelerating Proton Computed Tomography with GPUson-demand.gputechconf.com/gtc/2015/presentation/S5497-Thomas-Uram.pdf · Accelerating Proton Computed Tomography with GPUs Thomas'D.'Uram,'Argonne'Leadership'Compu2ng'Facility'

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'([email protected])

Overview‣ Proton'computed'tomography'(pCT)'is'an'alterna2ve'to'xEray'based'CAT'scans,'which'

promises'several'medical'benefits'at'the'cost'of'being'significantly'more'computa2onally'expensive'

‣ We'designed'a'60Enode'GPU'cluster'to'meet'the'computa2onal'challenge'!

!

‣ Computed'tomography'‣ Benefits'of'proton'computed'tomography'‣ Computa2onal'problem'descrip2on'‣ CPU/GPU'performance'comparison

2

Page 3: Accelerating Proton Computed Tomography with GPUson-demand.gputechconf.com/gtc/2015/presentation/S5497-Thomas-Uram.pdf · Accelerating Proton Computed Tomography with GPUs Thomas'D.'Uram,'Argonne'Leadership'Compu2ng'Facility'

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'([email protected])

What is Computed Tomography?‣ CAT'(or'CT)'scans'are'wellEknown'‣ CAT'=='“computerized'axial'tomography”'‣ CAT'scans'are'used'to'reconstruct'the'density'distribu2on'within'a'volume,'typically'used'

in'medical'imaging'‣ CAT'scans'are'conducted'with'photons'(XErays)'

!

‣ What'is'Proton'Computed'Tomography?'• A'reconstruc2on'technique'similar'to'XEray'computed'tomography,'conducted'with'

protons'instead'of'photons

3

Page 4: Accelerating Proton Computed Tomography with GPUson-demand.gputechconf.com/gtc/2015/presentation/S5497-Thomas-Uram.pdf · Accelerating Proton Computed Tomography with GPUs Thomas'D.'Uram,'Argonne'Leadership'Compu2ng'Facility'

‣ 13'million'people'are'diagnosed'with'cancer'each'year'worldwide'‣ 2.6'million'of'them'are'candidates'for'proton'therapy'treatment'‣ Proton'therapy'involves'deposi2ng'protons'at'precise'loca2ons'within'a'tumor'

site'where'they'irradiate'the'target'2ssue'‣ The'protons'emit'lower'radia2on'as'they'travel'through'the'body'un2l'they'

reach'the'target,'where'they'emit'a'burst'of'radia2on'(the'Bragg'peak)'• Healthy'2ssue'beyond'the'tumor'site'receives'nominally'no'radia2on'

‣ It'is'crucially'important'to'precisely'iden2fy'the'tumor'site'• To'ensure'that'cancerous'2ssue'is'destroyed'• To'avoid'damaging'healthy'2ssue'surrounding'the'tumor,'especially'in'

sensi2ve'areas'‣ Proton'therapy'treatment'planning'is'currently'performed'using'XEray'imaging'

• Photons'and'protons'interact'with'intermediate'material'differently'• Conversion'between'photon/proton'modali2es'involves'a'systema0c'range'

error'of'365%

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'([email protected])

Why Proton Computed Tomography?

4

Image source: Wikipedia

Page 5: Accelerating Proton Computed Tomography with GPUson-demand.gputechconf.com/gtc/2015/presentation/S5497-Thomas-Uram.pdf · Accelerating Proton Computed Tomography with GPUs Thomas'D.'Uram,'Argonne'Leadership'Compu2ng'Facility'

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'([email protected])

‣ Our'goal'is'to'reconstruct'volume'of'adult'human'head'in'under'10'minutes''

‣ Protons'directed'through'two'frontal'planes,'the'target'volume,'two'backing'planes,'and'finally'a'calorimeter'

‣ Measures'posi2on'and'angle'of'incidence'of'protons'at'entry'and'exit,'and'the'energy'loss

5

Final System (in black): 4 tracking planes with XY Si detectors: calorimeter with 64 end=on CsI Crystals

Planned Scaled Prototype (in red): 4 planes of XY Si detectors (2 X-SSDs and 2 Y-SSDs per plane): 8 CsI Crystal bars

Calorimeter: Each bar corresponds to a 5cm x 5cm CsI Crystal, read out by a photodiode

Tracking Plane: Each large square corresponds to one double-sided or two single-sided 9cm x 9cm SSDs

Proton computed tomography

Page 6: Accelerating Proton Computed Tomography with GPUson-demand.gputechconf.com/gtc/2015/presentation/S5497-Thomas-Uram.pdf · Accelerating Proton Computed Tomography with GPUs Thomas'D.'Uram,'Argonne'Leadership'Compu2ng'Facility'

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'([email protected])

Problem Description‣ Proton'source,'detector'planes,'and'calorimeter'

mounted'on'rota2ng'gantry,'as'in'familiar'XEray'CT'configura2ons'

‣ Data'collected'over'a'full'rota2on'of'the'gantry,'180'samples'(every'2'degrees)'

‣ Ini2al'detector'designed'to'image'a'human'head'(nominally'25cm'cube)'

‣ From'physics'domain,'and'so'that'each'voxel'is'sufficiently'represented'in'the'resul2ng'system'matrix,'we'approximate'requiring'a'volume'consis2ng'of'256x256x36'(2,359,296=~'2.4M)'voxels'and'2'billion'protons'total'

‣ For'each'proton,'we'track'11'values:'‣ [x,y,z]'at'entry'‣ [x,y,z]'at'exit'‣ angle'at'entry'and'exit'‣ input'and'output'energy'‣ gantry'rota2on'angle

6

Final System (in black): 4 tracking planes with XY Si detectors: calorimeter with 64 end=on CsI Crystals

Planned Scaled Prototype (in red): 4 planes of XY Si detectors (2 X-SSDs and 2 Y-SSDs per plane): 8 CsI Crystal bars

Calorimeter: Each bar corresponds to a 5cm x 5cm CsI Crystal, read out by a photodiode

Tracking Plane: Each large square corresponds to one double-sided or two single-sided 9cm x 9cm SSDs

Page 7: Accelerating Proton Computed Tomography with GPUson-demand.gputechconf.com/gtc/2015/presentation/S5497-Thomas-Uram.pdf · Accelerating Proton Computed Tomography with GPUs Thomas'D.'Uram,'Argonne'Leadership'Compu2ng'Facility'

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'([email protected])

Baseline execution times

7

‣ Began'with'serial'code'that'took'more'than'7'hours'to'process'131M'protons'

‣ Parallelized'with'MPI'to'use'mul2ple'CPUs'

‣ Established'baseline'execu2on'2mes

{Phase Execution time (seconds)

Setup 128.2

Most Likely Path (MLP) 1278.5

Linear solver (CARP) 664.9

Overall execution time 2072.0

1 billion protons, 60 nodes, CPU only

Page 8: Accelerating Proton Computed Tomography with GPUson-demand.gputechconf.com/gtc/2015/presentation/S5497-Thomas-Uram.pdf · Accelerating Proton Computed Tomography with GPUs Thomas'D.'Uram,'Argonne'Leadership'Compu2ng'Facility'

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'([email protected])

MLP (Most Likely Path)

8

‣ In'contrast'with'XEray'computed'tomography'in'which'the'par2cles'traverse'the'volume'in'straight'lines,'in'pCT'the'protons'are'scakered'by'the'material'as'they'travel'through'the'volume'

‣ MLP'computes'the'path'integral'of'the'protons'through'the'material'based'on'their'known'entry'and'exit'loca2ons'and'angles'and'the'energy'loss'

‣ The'proton'paths'are'discre2zed'as'the'voxels'touched'while'traversing'the'volume'

‣ Path'integral'calcula2ons'are'independent'and'parallelize'at'the'level'of'protons'(but'inherently'sequen2al'within'each'path)

Page 9: Accelerating Proton Computed Tomography with GPUson-demand.gputechconf.com/gtc/2015/presentation/S5497-Thomas-Uram.pdf · Accelerating Proton Computed Tomography with GPUs Thomas'D.'Uram,'Argonne'Leadership'Compu2ng'Facility'

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'([email protected])

Linear solver (CARP)‣ The'result'of'MLP'is'a'system'of'equa2ons'rela2ng'each'proton’s'touched'

voxels'to'the'rela2ve'stopping'power'(roughly,'the'energy'loss)'‣ We'began'the'project'with'a'CPU'implementa2on'of'the'rowEac2on'based'

sparse'itera2ve'solver'CARP'(component'averaged'row'projec2ons)'‣ CARP'decomposes'the'matrix'into'row'blocks,'one'block'per'processor,'and'

iterates'to'sa2sfactory'convergence:'• Performs'a'JacobiElike'itera2on'sequen2ally'through'the'rows'to'produce'a'perE

block'solu2on'vector'• Averages'the'perEblock'solu2on'vectors'(in'componentEwise'fashion)'• Redistributes'the'solu2on'vector'x'to'all'processors

9

Page 10: Accelerating Proton Computed Tomography with GPUson-demand.gputechconf.com/gtc/2015/presentation/S5497-Thomas-Uram.pdf · Accelerating Proton Computed Tomography with GPUs Thomas'D.'Uram,'Argonne'Leadership'Compu2ng'Facility'

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'([email protected])

Hardware: Gaea GPU cluster at Northern Illinois University‣ 60'compute'nodes'‣ Node'configura2on'

• 2x'Intel'X5650'12Ecore'CPUs'• 2x'NVIDIA'M2070'GPUs'• 72GB'RAM'• QDR'Infiniband

10

Page 11: Accelerating Proton Computed Tomography with GPUson-demand.gputechconf.com/gtc/2015/presentation/S5497-Thomas-Uram.pdf · Accelerating Proton Computed Tomography with GPUs Thomas'D.'Uram,'Argonne'Leadership'Compu2ng'Facility'

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'([email protected])

Data decomposition‣ 2.1B'protons'/'60'nodes'=~'35M'protons'per'node'‣ 2'GPUs'E>'17M'protons'per'GPU'‣ The'maximum'voxels'per'proton'is'~364'‣ 17M'protons'x'364'voxels'x'4'bytes/voxel'='25GB'data'per'GPU'

• Larger'than'available'M2070'GPU'memory'of'6GB'‣ High'watermark'memory'requirement'on'cluster'is'3TB'(aggregate)

11

Page 12: Accelerating Proton Computed Tomography with GPUson-demand.gputechconf.com/gtc/2015/presentation/S5497-Thomas-Uram.pdf · Accelerating Proton Computed Tomography with GPUs Thomas'D.'Uram,'Argonne'Leadership'Compu2ng'Facility'

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'([email protected])

MLP (Most Likely Path) CUDA implementation‣ MLP'involves'calcula2ng'path'integral'of'the'protons'‣ Ini2al'implementa2on'assigns'a'thread'per'proton'‣ PerEGPU'proton'data'is'larger'than'GPU'memory'on'M2070'‣ Stage'batches'of'protons'to'GPU'‣ MLP'was'ported'to'the'GPU,'with'mul2ple'variants'

• gpu'struct:'Direct'port'of'CPUEbased'code'using'structured'proton/voxel'data'• gpu'flat'memory:'Flat'memory'space'with'perEproton'padded'voxel'arrays'• gpu'flat'memory'+'overlap:'Streaming'computa2on'to'overlap'compute'and'

hostEdevice'transfers'

12

Page 13: Accelerating Proton Computed Tomography with GPUson-demand.gputechconf.com/gtc/2015/presentation/S5497-Thomas-Uram.pdf · Accelerating Proton Computed Tomography with GPUs Thomas'D.'Uram,'Argonne'Leadership'Compu2ng'Facility'

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'([email protected])

MLP (Most Likely Path) CUDA implementation (26M protons, 2 GPUs)

13

Implementation Execution time (seconds) Speedup

cpu 598.7 -

gpu_struct 77.6 7.7x

gpu_flat_memory 55.5 10.8x

gpu_flat_memory + overlap 53.0 11.3x

Page 14: Accelerating Proton Computed Tomography with GPUson-demand.gputechconf.com/gtc/2015/presentation/S5497-Thomas-Uram.pdf · Accelerating Proton Computed Tomography with GPUs Thomas'D.'Uram,'Argonne'Leadership'Compu2ng'Facility'

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'([email protected])

Linear solver (CARP) CUDA implementation (26M protons, 2 GPUs)‣ CARP'ported'directly'from'CPU'code'‣ PerEnode'rowEblock'data'larger'than'GPU'memory;'batch'process'‣ Further'subdivide'perEnode'rowEblock'into'rowEblocks'per'streaming'mul2processor'

!

!

!

!

!

!

!

‣ Limited'speedup'in'GPU'implementa2on,'because:'• rowEac2on'based'solver'constrains'parallel'granularity'• scakered'memory'accesses'constrain'performance,'as'is'typical'of'sparse'matrix'opera2ons

14

Implementation Execution time (seconds) Speedup

cpu 161.0 -

gpu 139.3 1.16x

Page 15: Accelerating Proton Computed Tomography with GPUson-demand.gputechconf.com/gtc/2015/presentation/S5497-Thomas-Uram.pdf · Accelerating Proton Computed Tomography with GPUs Thomas'D.'Uram,'Argonne'Leadership'Compu2ng'Facility'

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'([email protected])

Performance at scale

15

Phase Execution time (seconds)

Setup 22.3

Most Likely Path (MLP) 151.0

Linear solver (CARP) 265.5

Overall execution time 438.8Initial goal was to complete in <600s (10mins)

2'billion'protons,'60'nodes,'12'CPU'cores/node,'2'GPUs/node

Page 16: Accelerating Proton Computed Tomography with GPUson-demand.gputechconf.com/gtc/2015/presentation/S5497-Thomas-Uram.pdf · Accelerating Proton Computed Tomography with GPUs Thomas'D.'Uram,'Argonne'Leadership'Compu2ng'Facility'

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'([email protected])

Further work: CARP Hybrid CPU/GPU‣ Assign'row'blocks'to'CPU'and'GPU'simultaneously'‣ Weighted'work'distribu2on'based'on'ini2al'performance'measurements

16

Implementation Execution time (seconds) Speedup

cpu 161.0 -

gpu 139.3 1.16x

hybrid 102.3 1.57x

2'billion'protons,'60'nodes,'12'cores/node,'2'GPUs/node

Page 17: Accelerating Proton Computed Tomography with GPUson-demand.gputechconf.com/gtc/2015/presentation/S5497-Thomas-Uram.pdf · Accelerating Proton Computed Tomography with GPUs Thomas'D.'Uram,'Argonne'Leadership'Compu2ng'Facility'

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'([email protected])

Future work‣ Integrate'alterna2ve'linear'solvers'to'improve'performance

(amgX,'cusparse,'PETSc)'‣ Consider'alternate'data'decomposi2ons'to'improve'cache'locality'

• volume'slab'per'streaming'mul2processor'• volume'wedge'per'streaming'mul2processor''

‣ Measure'performance'on'nextEgenera2on'GPUs'• K80'for'greater'performance'• Jetson/TK1'for'greater'performance/wak'

‣ Experiment'with'GPU'cloud'plauorms'(Amazon'cloud)

17

Page 18: Accelerating Proton Computed Tomography with GPUson-demand.gputechconf.com/gtc/2015/presentation/S5497-Thomas-Uram.pdf · Accelerating Proton Computed Tomography with GPUs Thomas'D.'Uram,'Argonne'Leadership'Compu2ng'Facility'

Argonne'Leadership'Compu2ng'Facility'E'Thomas'D.'Uram'([email protected])

AcknowledgementsNicholas'T.'Karonis,'Northern'Illinois'University'(NIU)'and'Argonne'Na2onal'Laboratory'(ANL)'Michael'E.'Papka,'NIU'and'ANL'Caesar'Ordoñez,'NIU'Eric'Olson,'ANL'Kirk'Duffin,'NIU'Venkat'Vishwanath,'ANL'!

US'Department'of'Defense'contract'number'W81XWHE10E1E0170'sponsored'this'work.'

18