Top Banner
March 18-21, 2013 March 18-21, 2013 1 GTC'13 GTC'13 Signal Processing on GPUs for Radio Telescopes Signal Processing on GPUs for Radio Telescopes John W. Romein John W. Romein Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands
71

Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

Apr 02, 2018

Download

Documents

duongdang
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 1GTC'13GTC'13

Signal Processing on GPUsfor Radio Telescopes

Signal Processing on GPUsfor Radio Telescopes

John W. RomeinJohn W. Romein

Netherlands Institute for Radio Astronomy (ASTRON)Dwingeloo, the Netherlands

Page 2: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 2GTC'13GTC'13

Overview

radio telescopes motivation processing pipelines signal-processing algorithms

filter, correlator, beam forming, etc. performance

Page 3: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 3GTC'13GTC'13

The LOFAR Radio Telescope

Page 4: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 4GTC'13GTC'13

The LOFAR Radio Telescope

largest low-frequency telescope

no dishes

distributed sensor network

~85.000 receivers

Page 5: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 5GTC'13GTC'13

LOFAR: A Software Telescope

all-sky view antennas ➜ new science opportunities different observation modes require flexibility

digitally steered concurrent observations supercomputer real time

standard imaging

pulsar survey

known pulsar

epoch of re-ionization

transients

ultra-high energy particles

...

Page 6: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 6GTC'13GTC'13

LOFAR SuperTerp

Page 7: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 7GTC'13GTC'13

What's Next? The Square Kilometre Array

world-wide effort unprecedented size

3,000 dishes + aperture array South Africa + Australia

2016‒2019: 10% SKA 2020‒2023: 100% SKA

exascale computing; petascale I/O O(104)‒O(105) > LOFAR

many challenges!

Page 8: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 8GTC'13GTC'13

Our Efforts

build/maintain/operate current telescopes (LOFAR, WSRT) research for SKA

Page 9: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 9GTC'13GTC'13

Motivation

1)bring GPU technology to LOFAR

2)do accelerator research for the SKA

Page 10: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 10GTC'13GTC'13

ASTRON/IBM Dome

P1 algorithms & machines

P2 nano-photonics

P3 access patterns

P4 microservers

P5 accelerators

P6 compressed sampling

P7 real-time communication

research technologies to develop the SKA green computing nano-photonics data & streaming

Page 11: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 11GTC'13GTC'13

Accelerator Research

accelerators for (radio-astronomical) signal processing GPUs, Xeon Phi, BG/Q, FPGAs, ...

fundamental understanding of accelerators which properties make architecture (in)efficient? I/O-compute balance? energy efficiency? programmability? architecture-(in)dependent optimizations? devise new algorithms? generic approach to program accelerators, or ad-hoc?

applications:

1) LOFAR2) SKA

Page 12: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 12GTC'13GTC'13

LOFAR Data Processing

developing new GPU-based system

Page 13: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 13GTC'13GTC'13

Correlator Processing Overview

Page 14: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 14GTC'13GTC'13

More Detail

complex piece of software! several pipelines

Page 15: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 15GTC'13GTC'13

Implement GPU Kernels

killing two birds

1) develop new LOFAR correlator2) code base for accelerator research

Page 16: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 16GTC'13GTC'13

Why We Use OpenCL

OpenCL advantages disadvantagesvendor independent poor library support (e.g., FFT)GPU: runtime compilation cannot use all GPU features (e.g., GPUdirect)GPU: float8, float16, swizzlingCPU: nice C++ interface (exceptions)

float2 input_samples[NR_STATIONS][NR_CHANNELS][NR_TIMES][NR_POLARIZATIONS];

float16 sums, weights;…sums += weights.s3456789ABCDEF012 * sample;

Page 17: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 17GTC'13GTC'13

Why We Are Disappointed

OpenCL advantages disadvantagesvendor independent poor library support (e.g., FFT)GPU: runtime compilation cannot use all GPU features (e.g., GPUdirect)GPU: float8, float16, swizzlingCPU: nice C++ interface (exceptions)

float2 input_samples[NR_STATIONS][NR_CHANNELS][NR_TIMES][NR_POLARIZATIONS];

float16 sums, weights;…sums += weights.s3456789ABCDEF012 * sample;❑ NVIDIA dropped OpenCL support in

development tools

❑ visual profiler

Page 18: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 18GTC'13GTC'13

GPUs

NVIDIA Tesla K10 AMD FirePro S10000 Intel Xeon Phi

3072 3584 976 FPUs

4.58 5.91 2.14 SP TFLOPS

320 (ECC off) 480 352 GB/s

225 375 300 W

FirePro S10000:

driver/VBIOS/hardware (?) issues cannot always overlap compute‒PCIe I/O

Xeon Phi:

immature results

Page 19: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 19GTC'13GTC'13

From Feasibility Study to Production Software

concentrated mostly on GPU kernels CPU parts: now/later

Page 20: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 20GTC'13GTC'13

Implemented Kernels

most important: filter correlator beam former

Page 21: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 21GTC'13GTC'13

Processing Pipelines

imaging mode

sky images pulsar modes

in development UHEP mode

detect Ultra-High Energy Particles experimental

Page 22: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 22GTC'13GTC'13

Processing Pipelines

imaging mode

sky images pulsar modes

in development UHEP mode

detect Ultra-High Energy Particles experimental

Page 23: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 23GTC'13GTC'13

Processing Pipelines

imaging mode

sky images pulsar modes

in development UHEP mode

detect Ultra-High Energy Particles experimental

Page 24: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 24GTC'13GTC'13

Imaging (Correlator) Pipeline

Page 25: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 25GTC'13GTC'13

Imaging (Correlator) Pipeline

four GPU kernels

Page 26: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 26GTC'13GTC'13

Poly-Phase Filter (PPF) Bank

splits frequency band into channels like prism

trades time for frequency resolution

Page 27: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 27GTC'13GTC'13

Poly-Phase Filter (PPF) Bank

FIR filter + FFT

Page 28: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 28GTC'13GTC'13

1) Finite Impulse Response (FIR) Filter

history & weights (in registers)

no physical shift reorder output for next kernel

uncoalesced writes many FMAs

operational intensity = 6.4 FLOPs / byte

Page 29: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 29GTC'13GTC'13

FIR Filter Performance

0 20 40 60 800

0.5

1

1.5

2 FirePro S10000Tesla K10

#stations

TF

LO

PS

0 20 40 60 800

100

200

300

400

FirePro S10000Tesla K10

#stations

GB

/s

K10: drop @41 stations ➜ small #threads

Page 30: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 30GTC'13GTC'13

2) FFT

1D complex ➜ complex typically: 64‒256 tweaked “Apple” FFT library

Page 31: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 31GTC'13GTC'13

2) FFT Performance

0 20 40 60 800

0.5

1

1.5

2 FirePro S10000Tesla K10

#stations

TF

LO

PS

0 20 40 60 800

100

200

300

400

FirePro S10000Tesla K10

#stations

GB

/s

memory I/O bound

Page 32: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 32GTC'13GTC'13

combined kernel: clock correction delay compensation bandpass correction transpose

reduces #device memory accesses

3) Phase & BandPass Correction

Page 33: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 33GTC'13GTC'13

3a) Clock Correction

stations: shared clock corrects cable length errors merge with next step (phase delay)

Page 34: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 34GTC'13GTC'13

3b) Delay Compensation (a.k.a. Tracking)

track observed source delay telescope data

delay varies due to earth rotation shift samples remainder: rotate phase (= cmul)

0.75 FLOPs / byte

Page 35: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 35GTC'13GTC'13

3c) BandPass Correction

0 64 128 192 2560.0

0.5

1.0

frequency channel

po

we

r

powers in channels unequal

artifact from station processing multiply by channel-dependent weight

0.125 FLOPs / byte

Page 36: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 36GTC'13GTC'13

Phase & BandPass Correction

0 20 40 60 800

0.5

1

1.5

2 FirePro S10000Tesla K10

#stations

TF

LO

PS

0 20 40 60 800

100

200

300

400

FirePro S10000Tesla K10

#stations

GB

/s

☹ 0.75 FLOPs / byte memory I/O bound

Page 37: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 37GTC'13GTC'13

4) Correlator Kernel

multiply samples from each station pair

integrate ~1s

Page 38: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 38GTC'13GTC'13

4) Correlation Triangle

multiply samples from each station pair

integrate ~1s

Page 39: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 39GTC'13GTC'13

4) Correlator Implementation

global memory ➜ local memory 1 thread per station pair (dual pol)

1 FLOP / byte (local mem)

Page 40: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 40GTC'13GTC'13

4) Optimized Correlator Implementation

increase register reuse 1 thread: 2x2 stations (dual pol)

2 FLOPs / byte (local mem) also: 3x3, 4x4

Page 41: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 41GTC'13GTC'13

Correlator Register Usage

3x3 on K10: excessive spilling

max. registers/thread

Tesla K10 63

FirePro S10000 256

#accumulator registers

1x1 8

2x2 32

3x3 72

Page 42: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 42GTC'13GTC'13

Correlator #Threads

cannot use all threads in last warp too few threads ➜ low occupancy too many threads ➜ multiple passes

multiple thread blocks

0 20 40 60 800

256

512

768

1024

#stations

#th

rea

ds

1x12x23x3

max #threads

Tesla K10 1,024

FirePro S10000 256

Page 43: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 43GTC'13GTC'13

1x1 vs. 2x2 and 3x3 (FirePro S10000)

0 20 40 60 800

0.5

1

1.5

2 1x12x23x3

#stations

TF

LO

PS

0 20 40 60 800

100

200

300

400 1x12x23x3

#stations

GB

/s

use whichever performs best

Page 44: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 44GTC'13GTC'13

Correlator Performance

0 20 40 60 800

0.5

1

1.5

2 FirePro S10000Tesla K10

#stations

TF

LO

PS

0 20 40 60 800

100

200

300

400

FirePro S10000Tesla K10

#stations

GB

/s

compute bound S10000: multiple passes

Page 45: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 45GTC'13GTC'13

Combined Pipeline

#pragma omp parallel num_threads(2 * nrGPUs) { cl::CommandQueue queue(...); cl::Buffer input(...), output(...), ...; #pragma omp for schedule(dynamic) for (int subband = 0; subband < nrSubbands; subband ++) { receive(input, subband); queue.enqueueWriteBuffer(input, ...); queue.enqueueNDRangeKernel(FIR_filter, ...); queue.enqueueNDRangeKernel(FFT, ...); queue.enqueueNDRangeKernel(delay_bandpass, ...); queue.enqueueNDRangeKernel(correlator, ...); queue.enqueueReadBuffer(output, ...); send(output, subband); } }

2 queues/GPU: overlap I/O & computations

1 host thread/queue: easy model

Page 46: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 46GTC'13GTC'13

Correlator PipelineTesla K10 Performance Breakdown

1 10 20 30 40 50 60 70 800

2

4

6

8

10?CorrelatePhase & BandPass CorrectionFFTFIR filter

#stations

#G

PU

s

most challenging observation mode 8-bit samples

488 subbands (95.3 MHz)

correlations most expensive

need ~11 Tesla K10s

Page 47: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 47GTC'13GTC'13

Large #Stations

➜ different approach

#stations #correlations #threads #thread blocks (K10)

LOFAR ~80 ~12,960 ~820 1

SKA ≤3,000 ≤18,000,000 ≤1,125,000 ≤1,000

LOFAR SKA

Page 48: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 48GTC'13GTC'13

Large #Stations: Another Strategy

32x32 stations(16x16 threads)

2x2 stations (dual pol)1 thread

192x192 stations(squares + rectangles)

read samples 32+32 stations

☺ 32 FLOPs / byte

Page 49: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 49GTC'13GTC'13

Large #Stations: Correlator Performance

0 128 256 384 512 640 768 8960

1

2

3

4

FirePro S10000Tesla K10

#stations

TF

LO

PS

0 128 256 384 512 640 768 8960

100

200

300

400

FirePro S10000Tesla K10

#stations

GB

/s

to do: Tesla K20X possibly faster

Page 50: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 50GTC'13GTC'13

Pulsar Pipelines

Page 51: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 51GTC'13GTC'13

Pulsar Pipelines

1)search unknown pulsars

2)observe known pulsar many narrow beams

credit: Jason Hessels

Page 52: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 52GTC'13GTC'13

Pulsar Pipelines

in development

Page 53: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 53GTC'13GTC'13

Coherent Beam Forming

CV beam=∑stat

W beam ,stat∗Sstat

add stations

> sensitivity weighed

compensate Δt change phase different Δt ➜ different beam

many beams from same input

Page 54: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 54GTC'13GTC'13

Coherent Beam Forming Implementation

CV beam=∑stat

W beam ,stat∗Sstat

global memory ➜ local memory each thread:

1 beam, 1 polarization station-dependent weights in registers

2 passes of 24 stations

☺ 48 stations, 128 beams ➜ 14.2 FLOPs / byte

Page 55: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 55GTC'13GTC'13

Coherent Beam Forming Performance

0 32 64 96 1280

0.5

1

1.5

2

2.5 FirePro S10000Tesla K10

#beams

TF

LO

PS

0 32 64 96 1280

100

200

300

400 FirePro S10000Tesla K10

#beams

GB

/s

48 stations sawtooth caused by unused threads

Page 56: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 56GTC'13GTC'13

Ultra-High Energy Particle (UHEP) Pipeline

Page 57: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 57GTC'13GTC'13

UHEP Pipeline

store raw antenna voltages in 1.3s circular buffer

create ~50 beams on moon

detects 1015‒1020.5 eV particle collisions

trigger in one/few adjacent beams: freeze/dump captured antenna voltages within 1.3s ???

highly experimental pipeline

Image Courtesy L. Viatour

Page 58: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 58GTC'13GTC'13

UHEP Pipeline

create ~50 beams on moon inverse filter for high time resolution trigger

peak detection anti-coincidence check

freeze raw antenna data buffer max 1.3s latency!

Page 59: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 59GTC'13GTC'13

UHEP Pipeline

create ~50 beams on moon inverse filter for high time resolution trigger

peak detection anti-coincidence check

freeze raw antenna data buffer max 1.3s latency!

Page 60: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 60GTC'13GTC'13

UHEP Pipeline

create ~50 beams on moon inverse filter for high time resolution trigger

peak detection anti-coincidence check

freeze raw antenna data buffer max 1.3s latency!

Page 61: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 61GTC'13GTC'13

UHEP Pipeline

create ~50 beams on moon inverse filter for high time resolution trigger

peak detection anti-coincidence check

freeze raw antenna data buffer max 1.3s latency!

Page 62: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 62GTC'13GTC'13

UHEP Performance Breakdown

48 stations, 64 beams

Beam Forming Transpose Inverse FIR Inverse FFT Trigger0

1

2

3 Tesla K10FirePro S10000

TF

LO

P/s

Beam Forming Transpose Inverse FIR Inverse FFT Trigger0

100

200

300

400Tesla K10FirePro S10000

GB

/s

Page 63: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 63GTC'13GTC'13

UHEP Performance Breakdown

3.93 ms data, 48 stations, 488 subbands

most time spent in beam former

I/O overlap

latency ≪ 100 ms

Tesla K10 FirePro S100000

20

40

60

Input (Beam Former Weights)Input (Samples)?TriggerInv. FFTInv. FIRTransposeBeam Forming

ms

Page 64: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 64GTC'13GTC'13

A Wild Idea

Page 65: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 65GTC'13GTC'13

OpenCL on Top of CUDA Driver API

fool CUDA RTS

CPU: implement our own “platform” (ICD)

OpenCL library calls ➜ CUDA Driver API Calls limited subset (proof of concept)

GPU: use OpenCL ➜ PTX compiler (clc/clang/llvm)

☺ efficient

☹ does not support full language can use:

☺ visual profiler

☺ cuFFT, GPUdirect

Page 66: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 66GTC'13GTC'13

OpenCL on Top of CUDA Driver API

Page 67: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 67GTC'13GTC'13

Future Work

Page 68: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 68GTC'13GTC'13

To Do (LOFAR Correlator)

GPU kernels: dedispersion, flagging pulsar pipelines CPU code:

work distribution network reordering (FDR IB)

>240 Gb/s optimizations monitoring & control ...

Page 69: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 69GTC'13GTC'13

To Do (SKA Research)

Xeon Phi OpenCL on FPGA energy efficiency fully understand all results

Page 70: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 70GTC'13GTC'13

Conclusions

many signal-processing GPU kernels new LOFAR correlator research for SKA

OpenCL: vendor independent, elegant, but poor support NVIDIA FirePro S10000 faster then Tesla K10, but immature driver high efficiency on most important kernels

Page 71: Signal Processing on GPUs for Radio Telescopes | GTC …on-demand.gputechconf.com/gtc/2013/presentations/S3352-Radio... · Signal Processing on GPUs for Radio Telescopes ... all-sky

March 18-21, 2013March 18-21, 2013 71GTC'13GTC'13

Thanks

support: Intel, NVIDIA grants from Dutch national & province governments LOFAR GPU Correlator team: Alexander van Amesfoort, Wouter Klijn, Marcel

Loose, Jan David Mol