3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution

Evaluation of FPGAs resurgence for hardware acceleration applied to computed tomography

3D Tomography back-projection parallelization on

FPGAs using OpenCL

Presented by : Maxime MARTELLI , 1st year PhD Student

L2S, SATIE, TSA

1

2017 GPU Winter School, Grenoble, FR

CONTEXT

Moore’s law end announced for 2021

Architecture Algorithm Adequacy- Granular hardware specialization - Processors will offload specific processing to a suited architecture

Software FPGA design tools multiplication

2

HYPOTHESISThe idea

Does HLS tools progress means a resurgence of FPGAs for computed tomography?

3

With the rise of Accelerator-as-a-Service (AaaS), what is the future landscape for FPGAs?

Summary

4

I. What is OpenCL ?II. Why use HLS on FPGAs ?III. Use case highlightIV.OpenCL Memory modelV. Custom implementationsVI.Conclusion and perspectives

I. WHAT IS OPENCL?

5

• Open, royalty-free standard for parallel, compute intensive applica

tion development

• Initiated by Apple, specification maintained by the Khronos group

• Supports multiple device classes, CPUs, GPUs, DSPs, Cell, etc.

• First release on December 2008

• Specification currently at version 2.0

• SDKs and tools are provided by compliant device vendors

OpenCL basics

6

• Proprietary technology for GPGPU programming from Nvidi

a

• Not just API and tools, but name for the whole architecture

• Targets Nvidia hardware and GPUs only

• First SDK released February 2007

• SDK and tools available to 32- and 64-bit Windows, Linux a

nd Mac OS

• Tools and SDK are available for free from Nvidia.

CUDA basics

7

Basics compared

8

CUDA OpenCLWhat it is HW architecture,

programming language, API, SDK

and tools

Open API and language

specification

Propietary or open technology

Proprietary Open and royalty-free

When introduced Q4 2006 Q4 2008SDK vendor Nvidia Implementation

vendorsFree SDK Yes Depends on vendor

Heterogeneous device support

No, just NVIDIA GPUs

Yes (Apple, Nvidia, AMD, IBM, Intel,

…)

OpenCL Memory Architecture

9

CUDA Memory Architecture

10

OpenCL Execution model

11

II. WHY USE HLS ON FPGAS ?

12

Field Programmable Gate Array (FPGA)

13Programmable Switch FabricSource : Intel

CPU instruction mapping

14Source : Intel

CPU execution path (1)

15Source : Intel

CPU execution path (2)

16Source : Intel

CPU vs FPGA execution

17Source : Intel

• Custom data-path that matches your algorithms

• Uses exactly what you need (Operation, Data Width, memory

configuration, …)

• Timing closure and reduced power consumption

• Much easier programming than VHDL

Advantages of FPGA HLS

18

II. USE CASE HIGHLIGHT

19

Brief history

In 2004, FPGA were widely used in Tomography

For 10 years now, GPU dominates the field

With the evolution of HLS tools, a new interest for FPGAs emerge

20

3D Computed Tomography Projection

21

Back-projection algorithm

Memory bound algorithm

!"#$%&"'(&*+,"-./00+$$

= 3,5

Density calculation :d(𝑐)=∫ 𝑠𝑖𝑛𝑜89

:;< 𝑢(𝜑, 𝑐 . 𝑣 𝜑, 𝑐 , 𝜑). 𝑤(𝜑, 𝑐):𝑑𝜑

Input : α [dimϕ], β[dimϕ], sinogram[dimU*dimV*dimϕ]Output : volume[dimX, dimY, dimZ]

For z = 0 to dimZ - 1For y = 0 to dimY - 1

For x = 0 to dimX - 1voxelsum = 0For ϕ = 0 to dimϕ - 1| Calculate (U, V) from α[ϕ] and β[ϕ] | voxelsum += sinogram[U, V,ϕ]volume[x,y,z] = voxelsum

22

Massively parallel2563 voxels

256 angles variations

Back-projection results on FPGA

23

Benchmark the different memory structures

Main contributions

01

Implement algorithm-focused optimizations02

Assessing OpenCL code optimization for FPGA03

24

III. OPENCL MEMORY MODEL

25

Memory structure latency on an Altera Cyclone V

240

10 15 3

Global Constant Local Private

Mean latency (cycles)

Tricky situations for calculation (LSU embedded cache)

Custom benchmark (random reads)

26

IV. CUSTOM IMPLEMENTATIONS

27

OpenCL work-group enqueueing mechanism

Data parallelism : ND Range

Task parallelism : Single Work Item (SWI)

28

- Max FPGA frequency : 205 MHz- Intel FPGA SDK for OpenCL 16.0

Experiment setup : DE1-SoC

29

- 1 Gb of DDR3 memory- Dual core ARM Cortex A9 processor and FPGA fabric within an Altera Cyclone V

Implementation 1 : Shift-Register Pattern (TP)

Input : α [dimϕ], β[dimϕ], sinogram[dimU*dimV*dimϕ]Output : volume[dimX, dimY, dimZ]For ϕ = 0 to dimϕ - 1

SRP[ϕ]= (α[ϕ], β[ϕ]) ◁ SRP initializationFor z = 0 to dimZ - 1

For y = 0 to dimY - 1For x = 0 to dimX - 1

voxelsum = 0#pragma unroll ◁ Task parallelismFor ϕ = 0 to dimϕ – 1| SRP[dimϕ – 1] = SRP[0] || For i = 0 to dimϕ – 2 |-- SRP implementation | SRP[i] = SRP [i+1] || Calculate (U, V) from α[ϕ] and β[ϕ] | voxelsum += sinogram[U, V,ϕ]volume[x,y,z] = voxelsum

30

Implementation 2 : Memory pre-fetching(DP)

Input : const α [dimϕ], const β[dimϕ], sinogram[dimU*dimV*dimϕ]Output : volume[dimX, dimY, dimZ]Local int local_sinogram[Xoff* Yoff]/* Recovery of work-item characteristics (x,y,z) */voxelsum = 0For ϕ = 0 to dimϕ – 1| /* Calculate Un, Vn coordinates */| /* Dispatch min, max coordinates computation | between local work-items */| barrier(CLK_LOCAL_MEM_FENCE)| /* Global sinogram fetching by local work-items */| barrier(CLK_LOCAL_MEM_FENCE)| voxelsum += local_sinogram[localU, localV]volume[x,y,z] = voxelsum

31

Kernels implementations on Cyclone V SoC

222,9

67,5

32,2616,9 31,3 30,8

SWI+Naive SWI+SRP ND+Naive ND+2CU ND+MF ND+Backbone

Raw Execution Time (s)

ND+2CU : linear extrapolation model verification

ND+Backbone : irreducible logic utilization

ND + MF uses less logic than naïve NDrange

SWI + SRP uses less logic and is faster than naïve SWI

Key Points

32

Logic Utilization (%)

4936

55

96

4021

Kernels implementations on Cyclone V SoC

109,2

24,317,7 16,2

12,56,47

Normalized Execution Time (s)

Speedup SWI+Naïve à ND+MF

8,74

33

Matching VHDL FPGA implementations for

ND+MF

Computation rate 137 M“voxel”/s

68 MHz 112 MHz 140 MHz 140 MHz 140 MHz 140 MHz

250

15 2,270

50

100

150

200

250

300 Power(W)

GPU vs FPGA with OpenCL

An embedded GPU is more energy

efficient

Algorithm inadequacy implies

longer FPGA execution time

Low FPGA consumption

12 94

991

0

200

400

600

800

1000

1200

Executiontime(ms)

0,83

0,39

0,63

0

0,2

0,4

0,6

0,8

1

TitanXPascal(GPU) JetsonTK2(GPU) Arria10FPGA

Energy(mWh)

34

V. CONCLUSION AND PERSPECTIVES

35

Intel SDK guarantees one “voxel” computation per clock

Achieved speedup of 8.74 with little hardware knowledge

FPGAs still fall short compared to embedded GPU (performance and power)

for this family of CT algorithm

FPGA (2009) = FPGA OpenCL

Efficient tool for software developers

FPGA < Embedded GPU

CONCLUSION

36

Room for improvement

By reducing kernel footprint or increasing kernel frequency

Many algorithms, like radar clutter computation, are well adapted to

FPGAs strength

Old algorithms not fit for GPUs can re-emerge

Adapted Use-Case Computed Tomography with FPGA?

PERSPECTIVES

37

- Bigger card- Xilinx SDx evaluation- New adapted algorithm

Next ?

THANK YOU

Any questions or comments are welcomed !

38

FPGA key numbers

6,9 %

2015 Global Market

6,36 billion

Intel

Xilinx

Others

In 2016, FPGAs outgrew the overall semiconductor market

(resp. 6,9 % vs 1,5 %)

The market is expected to reach 10 billion $ by 2024

Xilinx stays as the first FPGA manufacturer

Market sharesAverage annual gross

39

3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution

Documents

3D Tomography back- projection parallelization on FPGAs using … · GPU vs FPGA with OpenCL An embedded GPU is more energy efficient Algorithm inadequacy implies longer FPGA execution