Performance Evaluation of SAR Performance Evaluation of SAR Image Reconstruction on CPUs and GPUs Fisnik Kraja, Alin Murarasu, Georg Acher, Arndt Bode Chair of Computer Architecture, Technische Universität München, Germany 2012 IEEE Aerospace Conference, 3-10 March 2012, Big Sky, Montana
17
Embed
Performance Evaluation of SAR Image Reconstruction on CPUs and GPUs
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Performance Evaluation of SARPerformance Evaluation of SAR Image Reconstruction on CPUs and
GPUs
Fisnik Kraja, Alin Murarasu, Georg Acher, Arndt BodeChair of Computer Architecture, Technische Universität München, Germany
2012 IEEE Aerospace Conference, 3-10 March 2012, Big Sky, Montana
The main points
• The motivation statementThe motivation statement• Description of the SAR 2DFMFI application• Description of the benchmarked architectureDescription of the benchmarked architecture• Results of sequential optimizations and thread
parallelization on the CPUparallelization on the CPU• Porting SAR Image Reconstruction to CUDA• Comparison of CPU and GPU ResultsComparison of CPU and GPU Results• Summary and conclusions
2/24/2012 2
Motivation
• On-board space-based processing should beOn board space based processing should be increased
• Future space applications with high performance p pp g prequirements– HRWS SAR: 1 Tera FLOPS, 603.1 Gbit/s throughput
• Heterogeneous (CPU+GPU) architectures might be the solution
• Novel accelerator designs integrate in one chip CPUs and graphics processing modules
2/24/2012 3
SAR Image Reconstruction
SAR Sensor P i (SSP)Synthetic Data Processing (SSP)
Reconstructed SAR image is obtained by applying the 2D
SSP Processing Step Computation Execution Size &Type Time in % Layout
1. Filter the echoed signal 1d_Fw_FFT 1.1 [mc x n]2. Transposition is needed 0.3 [n x mc]3. Signal Compression along slow-time CEXP, MAC 1.1 [n x mc]4 Narrow bandwidth polar format reconstruction along slow time 1d Fw FFT 0 5 [n x mc]4. Narrow-bandwidth polar format reconstruction along slow-time 1d_Fw_FFT 0.5 [n x mc]5. Zero pad the spatial frequency domain's compressed signal 0.4 [n x mc]6. Transform-back the zero padded spatial spectrum 1d_Bw_FFT 5.2 [n x m]7. Slow-time decompression CEXp, MAC 2.3 [n x m]8. Digitally-spotlighted SAR signal spectrum 1d Fw FFT 5.2 [n x m]8. Digitally spotlighted SAR signal spectrum 1d_Fw_FFT 5.2 [n x m]9. Generate the Doppler domain representation the reference
signal's complex conjugateCEXP, MAC 3.4 [n x m]
10. Circumvent edge processing effects 2D-FFT_shift 0.4 [n x m]11. 2D Interpolation from a wedge to a rectangular area: MAC,Sin,Cos 69 [nx x m]
input[n x m] -> output[nx x m]12. Transform from the doppler domain image into a spatial domain
image. IFFT[nx x m]-> Transpose -> FFT[m x nx]1d_Bw_FFT1d_Bw_FFT
10 [m x nx]
13 Transform into a viewable image CABS 1.1 [m x nx]
2/24/2012 5
The Benchmarked Architecture
• The dual socket ccNUMA
Memory (6GB) Memory (6GB)
– 2 Intel Nehalem CPUs 4Cores @2.13GHz
– 2x6 GB=12 GB shared memory– 32 nm
CPU
(4Cores)
CPU
(4Cores)
– 32 nm– Board TDP=120 W
• 2 Accelerators with NVIDIA TeslaInput/Output Controller
PCI Express 2.0 (up to 36 lanes)2 Accelerators with NVIDIA Tesla C2070 GPUs each: – 14 Streaming Multiprocessors – 448 scalar cores @ 1.15 GHz.
• Shared memory among the threads in one block • Exploiting the locality of the
• Thread blocks are mapped to SMs in warps (32 threads) that receive the same instruction (SIMD)
algorithms ensures performance
• Limited amount of memory brings ( )
• Branches impact the efficiency of SIMD units
the need for slow PCIecommunications
2/24/2012 9
Porting SAR Application to CUDA
• 2D Data Tiling for Loops– Tile elements are computed
Thread (tx, ty) in block (bx, by) is to calculate • row (by*TILE DIM+ty) andTile elements are computed
by a block of threads
– Tiling technique increases
• row (by*TILE_DIM+ty) and • column (bx*TILE_DIM+tx)
of the data set.g q
the number of active blocks, increasing so the level of occupancy
– On the Tesla C2070 device: max 1024 threads per blockblock. • TILE_DIM=32 (32x32=1024)
2/24/2012 10
CUDA Implementation Discussions
• CUFFT library provides a simple interface for computing parallel FFTs– Batch execution for multiple 1-dimensional transformsBatch execution for multiple 1 dimensional transforms – Drawback: memory needed on the host side increases with:
• Size of the transform • Number of the configured transforms in the batch
• Operations missing in CUDA:– Library functions like cexp() and cabs()
Atomic operations of floating point variables– Atomic operations of floating-point variables
• Transcendental instructions: efficiently execute on Special Function Units (SFUs). ( )– sine– cosine– square root
2/24/2012 11
Performance Results
• CPU vs GPU10
12
– Better performance on the GPU
– Better power efficiency on the CPU
8
10
peed
up
the CPU
• Small Scale vs Large Scale– For small scale images
4
6Sp
g(SCALE<20), the data set fits completely on the GPU memory
– For large scale images CPU S CPU 8 CPU 16 GPU0
2
For large scale images (SCALE > 30), the data set does not fit in the GPU memory
• Programming heterogeneous systems is impacted by:– Data dependenciesp– Scheduling algorithms– System Resources
• Frequent Transfers between CPU and GPU should be avoided
• Profiling is needed to identify the parts of the code that will benefit from executing on the GPU
• In our case, it was decided to execute on the GPU only the I t l ti L (70% f th t t l ti ti ) i d tInterpolation Loop (70% of the total execution time) in order to avoid transfers in steps like:– FFT_SHIFT– Transposition– Transposition
2/24/2012 13
Using Multiple GPU Devices
• OpenMP + CUDA: One OpenMP thread per device– Separate GPU context
• Each thread calls independently• Each thread calls independently– Memory management functions– CUDA Kernels
• 2 Approaches– Same image is reconstruction by 2 GPUs
• Bottlenecks in the QPI (remote accesses) and PCIe links( )– Separate images are reconstructed on 2 separate GPUs
(Pipelined version)• Reduced CPU <-> GPU data transfers
• Porting the SAR application to CUDA requires knowledge on the underlying hardware and on the CUDA paradigm.underlying hardware and on the CUDA paradigm.
• For the SAR application GPUs offer better performance than CPUs– But CPUs are more power efficientBut CPUs are more power efficient
• Heterogeneous computing improves performance but the Performance/Watt ratio is impacted by the number of CPU <-> GPU transfers.
• Static scheduling of CUDA kernels offers no flexibility in h t ti i theterogeneous computing environments
• When using multiple GPU devices, it is very important to reduce the number of CPU < > GPU and GPU < > GPU transfersnumber of CPU <-> GPU and GPU <-> GPU transfers.
2/24/2012 16
Thank You!
Questions?
Fisnik KrajaChair of Computer ArchitectureChair of Computer Architecture