Top Banner
May 14-17, 2012 May 14-17, 2012 1 GTC'12 GTC'12 Signal Processing on GPUs for Radio Telescopes Signal Processing on GPUs for Radio Telescopes John W. Romein John W. Romein Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands
72

Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

Sep 19, 2018

Download

Documents

ngohanh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 1GTC'12GTC'12

Signal Processing on GPUs for Radio TelescopesSignal Processing on GPUs for Radio Telescopes

John W. RomeinJohn W. Romein

Netherlands Institute for Radio Astronomy (ASTRON)Dwingeloo, the Netherlands

Page 2: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 2GTC'12GTC'12

Overview

radio telescopes six radio telescope algorithms on GPUs

part 1: real-time processing of telescope data 1) FIR filter 2) FFT 3) bandpass correction 4) delay compensation 5) correlator

part 2: creation of sky images 6) gridding (new GPU algorithm!)

Page 3: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 3GTC'12GTC'12

Intro: Radio Telescopes

Page 4: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 4GTC'12GTC'12

LOFAR Radio Telescope

largest low-frequency telescope distributed sensor network

~85,000 sensors

Page 5: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 5GTC'12GTC'12

LOFAR: A Software Telescope

different observation modes require flexibility standard imaging pulsar survey known pulsar epoch of reionization transients ultra-high energy particles …

need supercomputer real time

Page 6: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 6GTC'12GTC'12

LOFAR Data Processing

Blue Gene/P supercomputer

Page 7: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 7GTC'12GTC'12

Square Kilometre Array

TFLOPS

LOFAR (2012) ~30

SKA 10% (2016) ~30,000

Full SKA (2020) ~1,000,000

future radio telescope huge processing requirements

Page 8: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 8GTC'12GTC'12

Part 1: Real-Time Processing of Telescope Data

Page 9: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 9GTC'12GTC'12

Rationale

2005: LOFAR needed supercomputer 2012: can GPUs do this work?

Page 10: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 10GTC'12GTC'12

Blue Gene/P Algorithms on GPUs

BG/P software complex several processing pipelines

try imaging pipeline on GPU computational kernels only

other pipelines + control software: later

Page 11: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 11GTC'12GTC'12

CUDA or OpenCL?

OpenCL advantages vendor independent runtime compilation: easier programming (parameters constant)

float2 samples[NR_STATIONS][NR_CHANNELS][NR_TIMES][NR_POLARIZATIONS];

OpenCL disadvantages less mature

e.g., poor support for FFTs cannot use all GPU features

go for OpenCL

Page 12: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 12GTC'12GTC'12

Poly-Phase Filter (PPF) bank

splits frequency band into channels like prism

time resolution ➜ freq. resolution

Page 13: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 13GTC'12GTC'12

Poly-Phase Filter (PPF) bank

FIR filter + FFT

Page 14: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 14GTC'12GTC'12

1) Finite Impulse Response (FIR) Filter

history & weights (in registers) no physical shift

many FMAs

operational intensity = 32 ops / 5 bytes

Page 15: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 15GTC'12GTC'12

Performance Measurements

maximum foreseen LOFAR load ≤ 77 stations 488 subbands @ 195 KHz dual pol 2x8 bits/sample ≤ 240 Gb/s

GTX 580, GTX 680, HD 6970, HD 7970 need Tesla quality for real use

Page 16: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 16GTC'12GTC'12

FIR Filter Performance

GTX 580 performs best restricted by memory bandwidth

Page 17: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 17GTC'12GTC'12

2) FFT

1D complex ➜ complex 16-256 points tweaked “Apple” FFT library

64 work items: 1 FFT 256 work items: 4 FFTs

Page 18: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 18GTC'12GTC'12

FFT Performance

N=256 tweaked library 5 n log(n)

Page 19: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 19GTC'12GTC'12

corrects cable length errors merge with next step (phase delay)

Clock Correction

Page 20: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 20GTC'12GTC'12

track observed source delay telescope data

delay changes due to earth rotation shift samples remainder: rotate phase (= cmul)

18 FLOPs / 32 bytes

3) Delay Compensation (a.k.a. Tracking)

Page 21: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 21GTC'12GTC'12

4) BandPass Correction

powers in channels unequal artifact from station processing

multiply by channel-dependent weights

1 FLOP / 8 bytes

Page 22: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 22GTC'12GTC'12

Transpose

reorder data for next step (correlator) through local memory

see talk S0514

Page 23: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 23GTC'12GTC'12

Combined Kernel

combine: delay compensation bandpass correction transpose

reduces global memory accesses

18 FLOPs / 32 bytes

Page 24: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 24GTC'12GTC'12

Delay / Band Pass Performance

poor operational intensity 156 GB/s!

Page 25: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 25GTC'12GTC'12

5) Correlator

see previous talk (S0347) multiply samples from each pair of

stations integrate ~1s

Page 26: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 26GTC'12GTC'12

Correlator Implementation

one thread

global memory ➜ local memory 1 thread: 2x2 stations (dual pol)

4 float4 loads ➜ 64 FMAs 32 accumulator registers

Page 27: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 27GTC'12GTC'12

Correlator #Threads

HD 6970 / HD 7970 need multiple passes!

20

39

58

77

0

256

512

768

1024

#stations

#th

rea

ds

max #threads

GTX 580 1024

GTX 680 1024

HD 6970 256

HD 7970 256

Page 28: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 28GTC'12GTC'12

Correlator Performance

HD 7970: multiple passes register usage ➜ low occupancy

Page 29: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 29GTC'12GTC'12

Combined Pipeline

full pipeline 2 host threads

own queue, own buffers overlap I/O & computations easy model!

H➜D FIR

H➜D

H➜D FFT D&B Correlate

FIR FFT D&B Correlate

D➜H H➜D

Page 30: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 30GTC'12GTC'12

Overall Performance Imaging Pipeline

#GPUs needed for LOFAR GTX 680 (marginally) fastest

~13 GPUs HD 7970 real improvement over HD 6970

Page 31: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 31GTC'12GTC'12

Performance Breakdown GTX 580

dominated by correlator correlator: compute bound others: memory I/O bound PCIe I/O overlapped

Page 32: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 32GTC'12GTC'12

Performance Breakdown GTX 680

~20% faster than GTX 580

Page 33: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 33GTC'12GTC'12

Performance Breakdown HD 7970

multiple passes correlator visible poor overlap I/O

Page 34: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 34GTC'12GTC'12

Performance Breakdown HD 6970

≤ 2.7x slower

Page 35: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 35GTC'12GTC'12

Are GPUs Efficient?

Blue Gene/P: better compute-I/O balance & integrated network

few tens of GPUs as powerful as 2 BG/P racks

GTX 680 Blue Gene/P

FIR filter ~21% 85%

FFT ~17% 44%

Delay / BandPass ~2.6% 26%

Correlator ~35% 96%

% of FPU peak performance

Page 36: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 36GTC'12GTC'12

Feasible?

imaging pipeline ~13 GTX 680s (≈ 8 Tesla K10)

+ RFI detection? other pipelines

? 240 Gb/s FDR InfiniBand transpose

Page 37: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 37GTC'12GTC'12

Future Optimizations

combine more kernels fewer passes over global memory FFT: difficult invoke FFT from GPU kernel, not CPU

Page 38: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 38GTC'12GTC'12

Conclusions Part 1

OpenCL ok FFT support = minimal

GTX 680 (Kepler) marginally faster than HD 7970 (GCN)

<35% of FPU peak: memory I/O bottleneck

heavy use of FMA instructions

LOFAR imaging pipeline on GPUs = feasible

Page 39: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 39GTC'12GTC'12

Part 2: Creation of Sky Images

Page 40: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 40GTC'12GTC'12

Context

after observation: remove RFI calibrate create sky imagecreate sky image

calibration/imaging loop possibly repeated

Page 41: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 41GTC'12GTC'12

Creating a Sky Image

convolve correlations and add to convolve correlations and add to ggridrid 2D FFT ➜ sky image

Page 42: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 42GTC'12GTC'12

Gridding

corrconv

grid

(~100x100)

(~4096x4096)

convolve correlation and add to grid for all correlations

Page 43: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 43GTC'12GTC'12

Two Problems

1. lots of FLOPS

2. add to memory: slow!

corrconv

grid

(~100x100)

(~4096x4096)

Page 44: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 44GTC'12GTC'12

Two Solutions

1. lots of FLOPS ➜ use GPUs

2. add to memory: slow! ➜ avoid

corrconv

grid

(~100x100)

(~4096x4096)

Page 45: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 45GTC'12GTC'12

This Is A Hard Problem

literature: 4 other GPU gridders estimated perf. on GTX680

compensated faster hardware bandwidth difference + 50%

0

50

100

150

200

250

300

350

400

0

5

10

15

20

25

30

35

40

45

50

1)

2)

3)

4)

conv. matrix sizeG

FLO

PS

giga

-pix

e l-u

pdat

e s-p

er-s

econ

d

16x16 32x32 64x64 128x128 256x256

Page 46: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 46GTC'12GTC'12

This Is A Hard Problem

1)MWA (Edgar et. al. [CPC'11]) search correlations

2)Cell BE (Varbanescu [PhD,'10]) local store

3)van Amesfoort et. al. [CF'09] private grid per block ➜

very small grids4)Humphreys & Cornwell

[SKA memo 132, '11] adds directly to grid in memory

0

50

100

150

200

250

300

350

400

0

5

10

15

20

25

30

35

40

45

50

1)

2)

3)

4)

conv. matrix sizeG

FLO

PS

giga

-pix

e l-u

pdat

e s-p

er-s

econ

d

16x16 32x32 64x64 128x128 256x256

Page 47: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 47GTC'12GTC'12

This Is A Hard Problem

~3% of FPU peak performance! SKA: exascale

0

50

100

150

200

250

300

350

400

0

5

10

15

20

25

30

35

40

45

50

1)

2)

3)

4)

conv. matrix sizeG

FLO

PS

giga

-pix

e l-u

pdat

e s-p

er-s

econ

d

16x16 32x32 64x64 128x128 256x256

Page 48: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 48GTC'12GTC'12

W-Projection Gridding

correlation has associated (u,v,w) coords (u,v) not exact grid points use different convolution matrices

choose most appropriate one

(int(u), int(v))depends on frac(u), frac(v), w

corrconv

grid

Page 49: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 49GTC'12GTC'12

Where Is The Data?

grid: device memory conv. matrices: texture correlations + (u,v,w) coords: shared (local) memory

corrconv

grid

(~100x100)

(~4096x4096)

Page 50: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 50GTC'12GTC'12

Placement Movement

per baseline: (u,v,w) changes slowly grid locality

corrconv

grid

f

t

Page 51: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 51GTC'12GTC'12

Use Locality

reduce #memory accesses X: one thread accumulate additions in register until conv. matrix slides off

corrconv

grid

Page 52: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 52GTC'12GTC'12

But How ???

1 thread / grid point which correlations contribute? severe load imbalance

corrconv

grid

Page 53: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 53GTC'12GTC'12

An Unintuitive Approach

conceptual blocks of conv. matrix size

corrconv

grid

Page 54: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 54GTC'12GTC'12

An Unintuitive Approach

1 thread monitors all X at any time: 1 X covers conv. matrix!!!

corrconv

grid

Page 55: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 55GTC'12GTC'12

An Unintuitive Approach

thread computes current: X grid point X conv. matrix entry

corrconv

grid

Page 56: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 56GTC'12GTC'12

An Unintuitive Approach

(u,v) coords change

corrconv

grid

Page 57: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 57GTC'12GTC'12

An Unintuitive Approach

(u,v) coords change more

corrconv

grid

Page 58: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 58GTC'12GTC'12

An Unintuitive Approach

(atomically) adds data if switching to another X

corrconv

grid

Page 59: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 59GTC'12GTC'12

An Unintuitive Approach

#threads = block size too many threads ➜ do in parts

corrconv

grid

Page 60: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 60GTC'12GTC'12

(Dis)Advantages

☹ overhead

☺ < 1% grid-point memory updates

corrconv

grid

Page 61: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 61GTC'12GTC'12

Performance Measurements

Page 62: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 62GTC'12GTC'12

Performance Tests Setup

(u,v,w) from real LOFAR observation (6 hour)

#stations 44#channels 16integration time 10 sobservation time 6 hconv. matrix size ≤ 256x256oversampling 8x8#W-planes 128grid size 2048x2048

Page 63: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 63GTC'12GTC'12

GTX 680 Performance (CUDA)

75.1-95.6 Gpixels/s 25% of peak FPU overhead index computations

most additions in registers 0.23%-0.55% ➜ atomic add = 26% of total run time!

occupancy: 0.694-0.952 texture hit rate: >0.872

0

100

200

300

400

500

600

700

800

900

1000

0

10

20

30

40

50

60

70

80

90

100

110

120GTX 680 (CUDA)

conv. matrix sizeG

FLO

PS

giga

-pix

e l-u

pdat

e s-p

er-s

econ

d

16x16 32x32 64x64 128x128 256x256

Page 64: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 64GTC'12GTC'12

GTX 680 Performance (OpenCL)

OpenCL slower than CUDA no atomic FP add!

use atomic cmpxchg V1.1: no 1D images (added in V1.2)

2D image: slower0

100

200

300

400

500

600

700

800

900

1000

0

10

20

30

40

50

60

70

80

90

100

110

120GTX 680 (CUDA)GTX 680 (OpenCL)

conv. matrix sizeG

FLO

PS

giga

-pix

e l-u

pdat

e s-p

er-s

econ

d

16x16 32x32 64x64 128x128 256x256

Page 65: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 65GTC'12GTC'12

HD 7970 Performance (OpenCL)

medium & large conv. size: outperforms GTX 680 ~25% > bandwidth, FPU, power

small conv. size: poor computation-I/O overlap map host memory into device 0

100

200

300

400

500

600

700

800

900

1000

0

10

20

30

40

50

60

70

80

90

100

110

120GTX 680 (CUDA)GTX 680 (OpenCL)HD 7970

conv. matrix sizeG

FLO

PS

giga

-pix

e l-u

pdat

e s-p

er-s

econ

d

16x16 32x32 64x64 128x128 256x256

Page 66: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 66GTC'12GTC'12

2 x Xeon E5-2680 Performance (C++/AVX)

C++ & AVX vector intrinsics adds directly to grid relies on L1 cache

works well on CPU insufficient cache for GPUs

48-79% of peak FPU 0

100

200

300

400

500

600

700

800

900

1000

0

10

20

30

40

50

60

70

80

90

100

110

120GTX 680 (CUDA)GTX 680 (OpenCL)HD 79702 x E5-2680

conv. matrix sizeG

FLO

PS

giga

-pix

e l-u

pdat

e s-p

er-s

econ

d

16x16 32x32 64x64 128x128 256x256

Page 67: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 67GTC'12GTC'12

Multi-GPU Scaling

eight Nvidia GTX 580s

0 1 2 3 4 5 6 7 80

1000

2000

3000

4000

5000256x256

64x64

16x16

nr. GPUs

GF

LOP

S

131,072 threads! scales well

Page 68: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 68GTC'12GTC'12

Green Computing

up to 1.94 GFLOP/w (with previous gen hardware!)

0 1 2 3 4 5 6 7 80

0.5

1

1.5

2

2.5 256x256

64x64

16x16

nr. GPUs

pow

er c

o nsu

mpt

ion

(kW

)

0 1 2 3 4 5 6 7 80

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

256x256

64x64

16x16

nr. GPUs

pow

er e

f fic

ienc

y (G

FLO

P/W

)

Page 69: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 69GTC'12GTC'12

Compared To Other GPU Gridders

1)MWA (Edgar et. al. [CPC'11])

2)Cell BE (Varbanescu [PhD,'10])

3)van Amesfoort et. al. [CF'09]

4)Humphreys & Cornwell [SKA memo 132, '11]

new method ~10x faster

0

100

200

300

400

500

600

700

800

0

12.5

25

37.5

50

62.5

75

87.5

100

new

1)

2)

3)

4)

conv. matrix sizeG

FLO

PS

giga

-pix

e l-u

pdat

e s-p

er-s

econ

d

16x16 32x32 64x64 128x128 256x256

Page 70: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 70GTC'12GTC'12

See Also

An Efficient Work-Distribution Strategy for Gridding Radio-Telescope Data on GPUs, John W. Romein, ACM International Conference on Supercomputing (ICS'12), June 25-29, 2012, Venice, Italy

Page 71: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 71GTC'12GTC'12

Future Work

LOFAR gridder combine with A-projection time-dependent conv. function ➜ compute on GPU

Page 72: Signal Processing on GPUs for Radio Telescopes - GTC …on-demand.gputechconf.com/gtc/2012/presentations/S0124-GTC2012... · GTC'12 May 14-17, 2012 1 Signal Processing on GPUs for

May 14-17, 2012May 14-17, 2012 72GTC'12GTC'12

Conclusions Part 2

efficient GPU gridding algorithm minimizes memory accesses

OpenCL lacks atomic floating-point add ~10x faster than other gridders

scales well on 8 GPUs energy efficient