Evaluation of FPGAs resurgence for hardware acceleration applied to computed tomography
3D Tomography back-projection parallelization on
FPGAs using OpenCL
Presented by : Maxime MARTELLI , 1st year PhD Student
L2S, SATIE, TSA
1
2017 GPU Winter School, Grenoble, FR
CONTEXT
Moore’s law end announced for 2021
Architecture Algorithm Adequacy- Granular hardware specialization - Processors will offload specific processing to a suited architecture
Software FPGA design tools multiplication
2
HYPOTHESISThe idea
Does HLS tools progress means a resurgence of FPGAs for computed tomography?
3
With the rise of Accelerator-as-a-Service (AaaS), what is the future landscape for FPGAs?
Summary
4
I. What is OpenCL ?II. Why use HLS on FPGAs ?III. Use case highlightIV.OpenCL Memory modelV. Custom implementationsVI.Conclusion and perspectives
I. WHAT IS OPENCL?
5
• Open, royalty-free standard for parallel, compute intensive applica
tion development
• Initiated by Apple, specification maintained by the Khronos group
• Supports multiple device classes, CPUs, GPUs, DSPs, Cell, etc.
• First release on December 2008
• Specification currently at version 2.0
• SDKs and tools are provided by compliant device vendors
OpenCL basics
6
• Proprietary technology for GPGPU programming from Nvidi
a
• Not just API and tools, but name for the whole architecture
• Targets Nvidia hardware and GPUs only
• First SDK released February 2007
• SDK and tools available to 32- and 64-bit Windows, Linux a
nd Mac OS
• Tools and SDK are available for free from Nvidia.
CUDA basics
7
Basics compared
8
CUDA OpenCLWhat it is HW architecture,
programming language, API, SDK
and tools
Open API and language
specification
Propietary or open technology
Proprietary Open and royalty-free
When introduced Q4 2006 Q4 2008SDK vendor Nvidia Implementation
vendorsFree SDK Yes Depends on vendor
Heterogeneous device support
No, just NVIDIA GPUs
Yes (Apple, Nvidia, AMD, IBM, Intel,
…)
OpenCL Memory Architecture
9
CUDA Memory Architecture
10
OpenCL Execution model
11
II. WHY USE HLS ON FPGAS ?
12
Field Programmable Gate Array (FPGA)
13Programmable Switch FabricSource : Intel
CPU instruction mapping
14Source : Intel
CPU execution path (1)
15Source : Intel
CPU execution path (2)
16Source : Intel
CPU vs FPGA execution
17Source : Intel
• Custom data-path that matches your algorithms
• Uses exactly what you need (Operation, Data Width, memory
configuration, …)
• Timing closure and reduced power consumption
• Much easier programming than VHDL
Advantages of FPGA HLS
18
II. USE CASE HIGHLIGHT
19
Brief history
In 2004, FPGA were widely used in Tomography
For 10 years now, GPU dominates the field
With the evolution of HLS tools, a new interest for FPGAs emerge
20
3D Computed Tomography Projection
21
Back-projection algorithm
Memory bound algorithm
!"#$%&"'(&*+,"-./00+$$
= 3,5
Density calculation :d(𝑐)=∫ 𝑠𝑖𝑛𝑜89
:;< 𝑢(𝜑, 𝑐 . 𝑣 𝜑, 𝑐 , 𝜑). 𝑤(𝜑, 𝑐):𝑑𝜑
Input : α [dimϕ], β[dimϕ], sinogram[dimU*dimV*dimϕ]Output : volume[dimX, dimY, dimZ]
For z = 0 to dimZ - 1For y = 0 to dimY - 1
For x = 0 to dimX - 1voxelsum = 0For ϕ = 0 to dimϕ - 1| Calculate (U, V) from α[ϕ] and β[ϕ] | voxelsum += sinogram[U, V,ϕ]volume[x,y,z] = voxelsum
22
Massively parallel2563 voxels
256 angles variations
Back-projection results on FPGA
23
Benchmark the different memory structures
Main contributions
01
Implement algorithm-focused optimizations02
Assessing OpenCL code optimization for FPGA03
24
III. OPENCL MEMORY MODEL
25
Memory structure latency on an Altera Cyclone V
240
10 15 3
Global Constant Local Private
Mean latency (cycles)
Tricky situations for calculation (LSU embedded cache)
Custom benchmark (random reads)
26
IV. CUSTOM IMPLEMENTATIONS
27
OpenCL work-group enqueueing mechanism
Data parallelism : ND Range
Task parallelism : Single Work Item (SWI)
28
- Max FPGA frequency : 205 MHz- Intel FPGA SDK for OpenCL 16.0
Experiment setup : DE1-SoC
29
- 1 Gb of DDR3 memory- Dual core ARM Cortex A9 processor and FPGA fabric within an Altera Cyclone V
Implementation 1 : Shift-Register Pattern (TP)
Input : α [dimϕ], β[dimϕ], sinogram[dimU*dimV*dimϕ]Output : volume[dimX, dimY, dimZ]For ϕ = 0 to dimϕ - 1
SRP[ϕ]= (α[ϕ], β[ϕ]) ◁ SRP initializationFor z = 0 to dimZ - 1
For y = 0 to dimY - 1For x = 0 to dimX - 1
voxelsum = 0#pragma unroll ◁ Task parallelismFor ϕ = 0 to dimϕ – 1| SRP[dimϕ – 1] = SRP[0] || For i = 0 to dimϕ – 2 |-- SRP implementation | SRP[i] = SRP [i+1] || Calculate (U, V) from α[ϕ] and β[ϕ] | voxelsum += sinogram[U, V,ϕ]volume[x,y,z] = voxelsum
30
Implementation 2 : Memory pre-fetching(DP)
Input : const α [dimϕ], const β[dimϕ], sinogram[dimU*dimV*dimϕ]Output : volume[dimX, dimY, dimZ]Local int local_sinogram[Xoff* Yoff]/* Recovery of work-item characteristics (x,y,z) */voxelsum = 0For ϕ = 0 to dimϕ – 1| /* Calculate Un, Vn coordinates */| /* Dispatch min, max coordinates computation | between local work-items */| barrier(CLK_LOCAL_MEM_FENCE)| /* Global sinogram fetching by local work-items */| barrier(CLK_LOCAL_MEM_FENCE)| voxelsum += local_sinogram[localU, localV]volume[x,y,z] = voxelsum
31
Kernels implementations on Cyclone V SoC
222,9
67,5
32,2616,9 31,3 30,8
SWI+Naive SWI+SRP ND+Naive ND+2CU ND+MF ND+Backbone
Raw Execution Time (s)
ND+2CU : linear extrapolation model verification
ND+Backbone : irreducible logic utilization
ND + MF uses less logic than naïve NDrange
SWI + SRP uses less logic and is faster than naïve SWI
Key Points
32
Logic Utilization (%)
4936
55
96
4021
Kernels implementations on Cyclone V SoC
109,2
24,317,7 16,2
12,56,47
Normalized Execution Time (s)
Speedup SWI+Naïve à ND+MF
8,74
33
Matching VHDL FPGA implementations for
ND+MF
Computation rate 137 M“voxel”/s
68 MHz 112 MHz 140 MHz 140 MHz 140 MHz 140 MHz
250
15 2,270
50
100
150
200
250
300 Power(W)
GPU vs FPGA with OpenCL
An embedded GPU is more energy
efficient
Algorithm inadequacy implies
longer FPGA execution time
Low FPGA consumption
12 94
991
0
200
400
600
800
1000
1200
Executiontime(ms)
0,83
0,39
0,63
0
0,2
0,4
0,6
0,8
1
TitanXPascal(GPU) JetsonTK2(GPU) Arria10FPGA
Energy(mWh)
34
V. CONCLUSION AND PERSPECTIVES
35
Intel SDK guarantees one “voxel” computation per clock
Achieved speedup of 8.74 with little hardware knowledge
FPGAs still fall short compared to embedded GPU (performance and power)
for this family of CT algorithm
FPGA (2009) = FPGA OpenCL
Efficient tool for software developers
FPGA < Embedded GPU
CONCLUSION
36
Room for improvement
By reducing kernel footprint or increasing kernel frequency
Many algorithms, like radar clutter computation, are well adapted to
FPGAs strength
Old algorithms not fit for GPUs can re-emerge
Adapted Use-Case Computed Tomography with FPGA?
PERSPECTIVES
37
- Bigger card- Xilinx SDx evaluation- New adapted algorithm
Next ?
THANK YOU
Any questions or comments are welcomed !
38
FPGA key numbers
6,9 %
2015 Global Market
6,36 billion
Intel
Xilinx
Others
In 2016, FPGAs outgrew the overall semiconductor market
(resp. 6,9 % vs 1,5 %)
The market is expected to reach 10 billion $ by 2024
Xilinx stays as the first FPGA manufacturer
Market sharesAverage annual gross
39