March 18-21, 2013 March 18-21, 2013 1 GTC'13 GTC'13 Signal Processing on GPUs for Radio Telescopes Signal Processing on GPUs for Radio Telescopes John W. Romein John W. Romein Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands
71
Embed
Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
March 18-21, 2013March 18-21, 2013 1GTC'13GTC'13
Signal Processing on GPUsfor Radio Telescopes
Signal Processing on GPUsfor Radio Telescopes
John W. RomeinJohn W. Romein
Netherlands Institute for Radio Astronomy (ASTRON)Dwingeloo, the Netherlands
March 18-21, 2013March 18-21, 2013 2GTC'13GTC'13
Overview
radio telescopes motivation processing pipelines signal-processing algorithms
filter, correlator, beam forming, etc. performance
March 18-21, 2013March 18-21, 2013 3GTC'13GTC'13
The LOFAR Radio Telescope
March 18-21, 2013March 18-21, 2013 4GTC'13GTC'13
The LOFAR Radio Telescope
largest low-frequency telescope
no dishes
distributed sensor network
~85.000 receivers
March 18-21, 2013March 18-21, 2013 5GTC'13GTC'13
LOFAR: A Software Telescope
all-sky view antennas ➜ new science opportunities different observation modes require flexibility
digitally steered concurrent observations supercomputer real time
standard imaging
pulsar survey
known pulsar
epoch of re-ionization
transients
ultra-high energy particles
...
March 18-21, 2013March 18-21, 2013 6GTC'13GTC'13
LOFAR SuperTerp
March 18-21, 2013March 18-21, 2013 7GTC'13GTC'13
What's Next? The Square Kilometre Array
world-wide effort unprecedented size
3,000 dishes + aperture array South Africa + Australia
build/maintain/operate current telescopes (LOFAR, WSRT) research for SKA
March 18-21, 2013March 18-21, 2013 9GTC'13GTC'13
Motivation
1)bring GPU technology to LOFAR
2)do accelerator research for the SKA
March 18-21, 2013March 18-21, 2013 10GTC'13GTC'13
ASTRON/IBM Dome
P1 algorithms & machines
P2 nano-photonics
P3 access patterns
P4 microservers
P5 accelerators
P6 compressed sampling
P7 real-time communication
research technologies to develop the SKA green computing nano-photonics data & streaming
March 18-21, 2013March 18-21, 2013 11GTC'13GTC'13
Accelerator Research
accelerators for (radio-astronomical) signal processing GPUs, Xeon Phi, BG/Q, FPGAs, ...
fundamental understanding of accelerators which properties make architecture (in)efficient? I/O-compute balance? energy efficiency? programmability? architecture-(in)dependent optimizations? devise new algorithms? generic approach to program accelerators, or ad-hoc?
applications:
1) LOFAR2) SKA
March 18-21, 2013March 18-21, 2013 12GTC'13GTC'13
LOFAR Data Processing
developing new GPU-based system
March 18-21, 2013March 18-21, 2013 13GTC'13GTC'13
Correlator Processing Overview
March 18-21, 2013March 18-21, 2013 14GTC'13GTC'13
More Detail
complex piece of software! several pipelines
March 18-21, 2013March 18-21, 2013 15GTC'13GTC'13
Implement GPU Kernels
killing two birds
1) develop new LOFAR correlator2) code base for accelerator research
March 18-21, 2013March 18-21, 2013 16GTC'13GTC'13
Why We Use OpenCL
OpenCL advantages disadvantagesvendor independent poor library support (e.g., FFT)GPU: runtime compilation cannot use all GPU features (e.g., GPUdirect)GPU: float8, float16, swizzlingCPU: nice C++ interface (exceptions)
OpenCL advantages disadvantagesvendor independent poor library support (e.g., FFT)GPU: runtime compilation cannot use all GPU features (e.g., GPUdirect)GPU: float8, float16, swizzlingCPU: nice C++ interface (exceptions)
compensate Δt change phase different Δt ➜ different beam
many beams from same input
March 18-21, 2013March 18-21, 2013 54GTC'13GTC'13
Coherent Beam Forming Implementation
CV beam=∑stat
W beam ,stat∗Sstat
global memory ➜ local memory each thread:
1 beam, 1 polarization station-dependent weights in registers
2 passes of 24 stations
☺ 48 stations, 128 beams ➜ 14.2 FLOPs / byte
March 18-21, 2013March 18-21, 2013 55GTC'13GTC'13
Coherent Beam Forming Performance
0 32 64 96 1280
0.5
1
1.5
2
2.5 FirePro S10000Tesla K10
#beams
TF
LO
PS
0 32 64 96 1280
100
200
300
400 FirePro S10000Tesla K10
#beams
GB
/s
48 stations sawtooth caused by unused threads
March 18-21, 2013March 18-21, 2013 56GTC'13GTC'13
Ultra-High Energy Particle (UHEP) Pipeline
March 18-21, 2013March 18-21, 2013 57GTC'13GTC'13
UHEP Pipeline
store raw antenna voltages in 1.3s circular buffer
create ~50 beams on moon
detects 1015‒1020.5 eV particle collisions
trigger in one/few adjacent beams: freeze/dump captured antenna voltages within 1.3s ???
highly experimental pipeline
Image Courtesy L. Viatour
March 18-21, 2013March 18-21, 2013 58GTC'13GTC'13
UHEP Pipeline
create ~50 beams on moon inverse filter for high time resolution trigger
peak detection anti-coincidence check
freeze raw antenna data buffer max 1.3s latency!
March 18-21, 2013March 18-21, 2013 59GTC'13GTC'13
UHEP Pipeline
create ~50 beams on moon inverse filter for high time resolution trigger
peak detection anti-coincidence check
freeze raw antenna data buffer max 1.3s latency!
March 18-21, 2013March 18-21, 2013 60GTC'13GTC'13
UHEP Pipeline
create ~50 beams on moon inverse filter for high time resolution trigger
peak detection anti-coincidence check
freeze raw antenna data buffer max 1.3s latency!
March 18-21, 2013March 18-21, 2013 61GTC'13GTC'13
UHEP Pipeline
create ~50 beams on moon inverse filter for high time resolution trigger
peak detection anti-coincidence check
freeze raw antenna data buffer max 1.3s latency!
March 18-21, 2013March 18-21, 2013 62GTC'13GTC'13
UHEP Performance Breakdown
48 stations, 64 beams
Beam Forming Transpose Inverse FIR Inverse FFT Trigger0
1
2
3 Tesla K10FirePro S10000
TF
LO
P/s
Beam Forming Transpose Inverse FIR Inverse FFT Trigger0
100
200
300
400Tesla K10FirePro S10000
GB
/s
March 18-21, 2013March 18-21, 2013 63GTC'13GTC'13
UHEP Performance Breakdown
3.93 ms data, 48 stations, 488 subbands
most time spent in beam former
I/O overlap
latency ≪ 100 ms
Tesla K10 FirePro S100000
20
40
60
Input (Beam Former Weights)Input (Samples)?TriggerInv. FFTInv. FIRTransposeBeam Forming
ms
March 18-21, 2013March 18-21, 2013 64GTC'13GTC'13
A Wild Idea
March 18-21, 2013March 18-21, 2013 65GTC'13GTC'13
OpenCL on Top of CUDA Driver API
fool CUDA RTS
CPU: implement our own “platform” (ICD)
OpenCL library calls ➜ CUDA Driver API Calls limited subset (proof of concept)
GPU: use OpenCL ➜ PTX compiler (clc/clang/llvm)
☺ efficient
☹ does not support full language can use:
☺ visual profiler
☺ cuFFT, GPUdirect
March 18-21, 2013March 18-21, 2013 66GTC'13GTC'13
OpenCL on Top of CUDA Driver API
March 18-21, 2013March 18-21, 2013 67GTC'13GTC'13
Future Work
March 18-21, 2013March 18-21, 2013 68GTC'13GTC'13
To Do (LOFAR Correlator)
GPU kernels: dedispersion, flagging pulsar pipelines CPU code:
work distribution network reordering (FDR IB)
>240 Gb/s optimizations monitoring & control ...
March 18-21, 2013March 18-21, 2013 69GTC'13GTC'13
To Do (SKA Research)
Xeon Phi OpenCL on FPGA energy efficiency fully understand all results
March 18-21, 2013March 18-21, 2013 70GTC'13GTC'13
Conclusions
many signal-processing GPU kernels new LOFAR correlator research for SKA
OpenCL: vendor independent, elegant, but poor support NVIDIA FirePro S10000 faster then Tesla K10, but immature driver high efficiency on most important kernels
March 18-21, 2013March 18-21, 2013 71GTC'13GTC'13
Thanks
support: Intel, NVIDIA grants from Dutch national & province governments LOFAR GPU Correlator team: Alexander van Amesfoort, Wouter Klijn, Marcel