Massive Parallel LDPC Decoding on GPU Gabriel Falcão, Leonel Sousa, Vitor Silva Univ. of Coimbra and T. Univ. of Lisbon, Portugal.

Massive Parallel LDPC Decoding on GPU

Gabriel Falcão, Leonel Sousa, Vitor Silva

Univ. of Coimbra and T. Univ. of Lisbon, Portugal

Salt Lake City, Feb 21st 2008 PPoPP’08 2

MOTIVATION

LDPC Decoding Intensive computation Irregular accesses to memory

LDPC decoding using VLSI dedicated hardware Low area, low power consumption High throughputs (Mbps) and low latency Fixed-point arithmetic

LDPC decoding on GPUs GPUs processing horse power available CUDA programming interface Medium to high throughputs (Mbps) Floating-point arithmetic Software based flexible solution!


OUTLINE

Motivation LDPC codes Bit Node processing (BN) Check Node processing (CN) GPUs CUDA interface Experimental results Conclusions and future work


LDPC CODES

Advantages: Linear block codes Perform close to Shannon limit capacity High throughputs (Mbps) Very low Bit Error Rate (BER)

Disadvantages: Good performance implies large H matrices Computationally intensive operations Large amounts of hardware VLSI dedicated solutions are expensive Bottom line: Why not using the horse power available on

GPUs, instead of developing expensive VLSI?


LDPC CODES

Parity check matrix defines the LDPC code

Tanner Graph represents connections between BNs and CNs

1BN 2BN 3BN 4BN 5BN

Checknode

BitnodeBNi

0BN

0CN 1CN 2CNCNi

1 1 0 0 1 0

0 1 1 0 0 1

1 0 1 1 0 0

H CN1

BN1


LDPC DECODER

BNs and CNs exchange messages (i.e., probabilities) allowing reliable decision on a bit value

m n Message sent from CN to BNmnr

n mMessage sent from BN to CNnmq

1BN 2BN 3BN 4BN 5BN

Checknode

BitnodeBNi

0BN

0CN 1CN 2CNCNi


CHECK NODE PROCESSING - CN

1. Calculates message going from CNm to BNn:

BNi

BNj

BNk

BNn

q im

qjm

qkm

r mn

CNm

1

'' ( )\

1 1(0) (1 2 (1))

2 2

i i

mn n mn N m n

r q


BIT NODE PROCESSING – BN

2. Calculates the message sent from BNn to CNm including channel information Pn:

3. Then computes the a posteriori pseudo-probabilities and performs hard decoding:

1

'' ( )\

(0) 1 (0)i i

nm n m nnmm N n m

q k p r

,n

1 1 0.5ˆ

0 1 0.5

i

nn i

n

Qc

Q

BNn

rin

r jn

r kn

qnm

P n

CNi

CNm CNj

CNk

( )

(0) 1 (0)i i

n n mnnm M n

Q k p r


INTENSIVE COMPUTING

"If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens?"-- Seymore Cray


GRAPHICS PROCESSING UNITS (GPUs)

Raw compute power increasing rapidly Manycores architecture Can be programmed outside the graphics framework Exposing parallelism Multi-threaded architecture using CUDA Interest in GPP on GPUs Hard programming Needs efficient interface GPU wins when arithmetic intensity is maximized… GPU looses with memory accesses!


SUM PRODUCT ALGORITHM (SPA)

Kernel 1 - Computes the messages sent from CNm to BNn probability of BNn being 0 or 1

1

'' ( )\

1

'' ( )\

1

'' ( )\

Kernel 1 - Horizontal Processing

1 1(0) (1 2 (1))

2 2

(1) 1 (0)

Kernel 2 - Vertical Processing

(0) 1 (0)

(1) (1)

i i

mn n mn N m n

i i

mn mn

i i

nm n m nnmm M n m

i i

nm nm n m nm M n m

r q

r r

q k p r

q k p r

Kernel 2 – Computes the messages from BNn to CNm


COMPACT DATA STRUCTURES – H MATRIX

H mapped into compact HBN and HCN data structures

1 1 1 0 0 0 0 0

0 0 0 1 1 1 0 0

1 0 0 1 0 0 1 0

0 1 0 0 1 0 0 1

(8 bit nodes checked by 4check node equations)H =

word 2

r0,0

0 1

r0,1

0 2

r0,2

0 0

r1,3

1 0

r1,4

1 1

r1,5

0 3

r2,0

1 3

r2,3

2 0

r2,6

1 2

r3,1

2 2

r3,4

2 3

r3,7

2 1

r0,0

0 1

r0,1

0 2

r0,2

0 0

r1,3

1 0

r1,4

1 1

r1,5

0 3

r2,0

1 3

r2,3

2 0

r2,6

1 2

r3,1

2 2

r3,4

2 3

r3,7

2 1

r0,0

0 1

r0,1

0 2

r0,2

0 0

r1,3

1 0

r1,4

1 1

r1,5

0 3

r2,0

1 3

r2,3

2 0

r2,6

1 2

r3,1

2 2

r3,4

2 3

r3,7

2 1

HBN

word 1

r0,0

0 1

r0,1

0 2

r0,2

0 0

r1,3

1 0

r1,4

1 1

r1,5

0 3

r2,0

1 3

r2,3

2 0

r2,6

1 2

r3,1

2 2

r3,4

2 3

r3,7

2 1

word n

for all CNm do: (rows in H)

for all BNn do: (columns in H)

If Hmn==1 then

pnext = j:Hmn==1,

// with n+1< j <(n+N) mod N

HBN=pnext


COMPUTING KERNELS ON THE GPU

A novel SPA multi-thread computing approach SPA iteratively performed by several KERNELS on GPU

Flow control and execution management of KERNELS performed by the CUDA programming interface

Kernel 1ps0 rs0

k2k1HCN

HBN

Kernel 2qs0 rsi qsi

Kernel 1rsi

k2k1HCNHBN

Kernel 2

THREAD (0, 0)

THREAD (1, 0)

THREAD (0, 0) THREAD (0, 0)

THREAD (1, 0) THREAD (1, 0)

THREAD (0, 0)

THREAD (1, 0)

SY

NC

HR

ON

IZA

TIO

N

PO

INT

SY

NC

HR

ON

IZA

TIO

N

PO

INT

SY

NC

HR

ON

IZA

TIO

N

PO

INT

SY

NC

HR

ON

IZA

TIO

N

PO

INT


CUDA INTERFACE FOR GPGPU

C based programming interface for NVIDIA’s 8x series and next generation

CUDA enables efficient use of their massive parallelism Multi-threading hides latency problems Allows transparent programming Slow global memory and fast shared memory acess Avoid non-coalesced memory accesses Significant speedups depending on the algorithm Hard challenge: irregular memory access patterns!


MULTI-THREAD COMPUTING APPROACH

Multi-thread strategy and architecture

GRID

GPU

BLOCK (0, 0) BLOCK (1, 0) BLOCK (2, 0)

BLOCK (2, 1)BLOCK (1, 1)BLOCK (0, 1)

BLOCK (1, 0)

THREAD (0, 0)

REGISTERS

LOCALMEMORY

SHARED MEMORY

THREAD (1, 0)

LOCALMEMORY

REGISTERS

BLOCK (2, 0)

THREAD (0, 0)

REGISTERS

LOCALMEMORY

SHARED MEMORY

THREAD (1, 0)

LOCALMEMORY

REGISTERS

GLOBAL MEMORY

BLOCK (X, 0)

BLOCK (X, 1)

BLOCK (2, Y)BLOCK (1, Y)BLOCK (0, Y) BLOCK (X, Y)



Circular addressing mechanism allows increase of parallelism

GRID

GPU

BLOCK (0, 0)

r NU

LL

BLOCK (1, 0)

r NU

LL

BLOCK (X, 0)

r NU

LL

BLOCK (0, 1)

r NU

LL

BLOCK (1, 1)

r NU

LL

BLOCK (X, 1)

r NU

LL

BLOCK (0, Y)

r NU

LL

BLOCK (1, Y)

r NU

LL

BLOCK (X, Y)

r NU

LL

THREAD (tx, 0)

rNULL

THREAD (tx, 1)

rNULL

THREAD (tx, 2)

rNULL

BLOCK (0,0)

THREAD (0, 0)

r0,0

THREAD (1, 0)

r0,1

THREAD (2, 0)

r0,2

THREAD (3, 0)

rNULL

THREAD (0, 1)

r1,3

THREAD (1, 1)

r1,4

THREAD (2, 1)

r1,5

THREAD (3, 1)

rNULL

THREAD (0, 2)

r2,0

THREAD (1, 2)

r2,3

THREAD (2, 2)

r2,6

THREAD (3, 2)

rNULL

THREAD (tx, ty)

rNULL

THREAD (0, ty)

rx1,y1

THREAD (1, ty)

rx2,y2

THREAD (2, ty)

rx3,y3

THREAD (3, ty)

rNULL



GRID

GPU

BLOCK (0, 0)

q NU

LL

BLOCK (1, 0)

q NU

LL

BLOCK (2, 0)

q NU

LL

BLOCK (X, 0)

q NU

LL

BLOCK (0, 1)

q NU

LL

BLOCK (1, 1)

q NU

LL

BLOCK (2, 1)

q NU

LL

BLOCK (X, 1)

q NU

LL

BLOCK (0, 2)

q NU

LL

BLOCK (1, 2)

q NU

LL

BLOCK (2, 2)

q NU

LL

BLOCK (X, 2)

q NU

LL

BLOCK (0, Y)

q NU

LL

BLOCK (1, Y)

q NU

LL

BLOCK (2, Y)

q NU

LL

BLOCK (X, Y)

q NU

LL

qn,m0

qn,m1

qn,m2

qn',m0

qn',m1

qn',m2

THREAD (A)

THREAD (B)

THREAD (C)

THREAD (D)

THREAD (E)

THREAD (F)


EXPERIMENTAL RESULTS

Matrix size

CPU GPU CPU GPU CPU GPU

25 iterations 50 iterations 100 iterations

512x1024 3.5 0.2 6.9 0.4 13.9 0.8

2448x4896 16.7 0.8 33.3 1.6 66.5 3.1

2000x4000 21.0 1.1 41.9 2.2 84.0 4.2

Main conclusions ( … obtained from the matrices we considered using CUDA):

• Much faster processing than on top notch CPUs• Supports floating-point operations • Achieves medium to large throughputs• BUT MOST DEFINITELLY NOT AS GREAT AS WE HOPED!


CONCLUSIONS AND FUTURE WORK

GPGPU approach for LDPC decoding New compact data structures to represent the H matrix Multi-thread algorithm for LDPC decoding

Significant speedups achieved with the CUDA programming interface Up to 22

GPUs allow a software based, scalable and low cost solution

Trading task parallelism by data parallelism Adoption/generalization of the proposed approach

(algorithms and data structures) for irregular processing in graphs


CONCLUSIONS

Gabriel Falcão, [email protected]

University of Coimbra

Technical University of Lisbon

Portugal

Massive Parallel LDPC Decoding on GPU Gabriel Falcão, Leonel Sousa, Vitor Silva Univ. of Coimbra and T. Univ. of Lisbon, Portugal.

Documents

salt lake city

bn n probability of

compact h bn

value slide

portugal slide

memory ldpc decoding

h cn data structures

r mn cnm slide