Top Banner
Massive Parallel LDPC Decoding on GPU Gabriel Falcão, Leonel Sousa, Vitor Silva Univ. of Coimbra and T. Univ. of Lisbon, Portugal
20

Massive Parallel LDPC Decoding on GPU Gabriel Falcão, Leonel Sousa, Vitor Silva Univ. of Coimbra and T. Univ. of Lisbon, Portugal.

Mar 27, 2015

Download

Documents

Evan Rollins
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Massive Parallel LDPC Decoding on GPU Gabriel Falcão, Leonel Sousa, Vitor Silva Univ. of Coimbra and T. Univ. of Lisbon, Portugal.

Massive Parallel LDPC Decoding on GPU

Gabriel Falcão, Leonel Sousa, Vitor Silva

Univ. of Coimbra and T. Univ. of Lisbon, Portugal

Page 2: Massive Parallel LDPC Decoding on GPU Gabriel Falcão, Leonel Sousa, Vitor Silva Univ. of Coimbra and T. Univ. of Lisbon, Portugal.

Salt Lake City, Feb 21st 2008 PPoPP’08 2

MOTIVATION

LDPC Decoding Intensive computation Irregular accesses to memory

LDPC decoding using VLSI dedicated hardware Low area, low power consumption High throughputs (Mbps) and low latency Fixed-point arithmetic

LDPC decoding on GPUs GPUs processing horse power available CUDA programming interface Medium to high throughputs (Mbps) Floating-point arithmetic Software based flexible solution!

Page 3: Massive Parallel LDPC Decoding on GPU Gabriel Falcão, Leonel Sousa, Vitor Silva Univ. of Coimbra and T. Univ. of Lisbon, Portugal.

Salt Lake City, Feb 21st 2008 PPoPP’08 3

OUTLINE

Motivation LDPC codes Bit Node processing (BN) Check Node processing (CN) GPUs CUDA interface Experimental results Conclusions and future work

Page 4: Massive Parallel LDPC Decoding on GPU Gabriel Falcão, Leonel Sousa, Vitor Silva Univ. of Coimbra and T. Univ. of Lisbon, Portugal.

Salt Lake City, Feb 21st 2008 PPoPP’08 4

LDPC CODES

Advantages: Linear block codes Perform close to Shannon limit capacity High throughputs (Mbps) Very low Bit Error Rate (BER)

Disadvantages: Good performance implies large H matrices Computationally intensive operations Large amounts of hardware VLSI dedicated solutions are expensive Bottom line: Why not using the horse power available on

GPUs, instead of developing expensive VLSI?

Page 5: Massive Parallel LDPC Decoding on GPU Gabriel Falcão, Leonel Sousa, Vitor Silva Univ. of Coimbra and T. Univ. of Lisbon, Portugal.

Salt Lake City, Feb 21st 2008 PPoPP’08 5

LDPC CODES

Parity check matrix defines the LDPC code

Tanner Graph represents connections between BNs and CNs

1BN 2BN 3BN 4BN 5BN

Checknode

BitnodeBNi

0BN

0CN 1CN 2CNCNi

1 1 0 0 1 0

0 1 1 0 0 1

1 0 1 1 0 0

H CN1

BN1

Page 6: Massive Parallel LDPC Decoding on GPU Gabriel Falcão, Leonel Sousa, Vitor Silva Univ. of Coimbra and T. Univ. of Lisbon, Portugal.

Salt Lake City, Feb 21st 2008 PPoPP’08 6

LDPC DECODER

BNs and CNs exchange messages (i.e., probabilities) allowing reliable decision on a bit value

m n Message sent from CN to BNmnr

n mMessage sent from BN to CNnmq

1BN 2BN 3BN 4BN 5BN

Checknode

BitnodeBNi

0BN

0CN 1CN 2CNCNi

Page 7: Massive Parallel LDPC Decoding on GPU Gabriel Falcão, Leonel Sousa, Vitor Silva Univ. of Coimbra and T. Univ. of Lisbon, Portugal.

Salt Lake City, Feb 21st 2008 PPoPP’08 7

CHECK NODE PROCESSING - CN

1. Calculates message going from CNm to BNn:

BNi

BNj

BNk

BNn

q im

qjm

qkm

r mn

CNm

1

'' ( )\

1 1(0) (1 2 (1))

2 2

i i

mn n mn N m n

r q

Page 8: Massive Parallel LDPC Decoding on GPU Gabriel Falcão, Leonel Sousa, Vitor Silva Univ. of Coimbra and T. Univ. of Lisbon, Portugal.

Salt Lake City, Feb 21st 2008 PPoPP’08 8

BIT NODE PROCESSING – BN

2. Calculates the message sent from BNn to CNm including channel information Pn:

3. Then computes the a posteriori pseudo-probabilities and performs hard decoding:

1

'' ( )\

(0) 1 (0)i i

nm n m nnmm N n m

q k p r

,n

1 1 0.5ˆ

0 1 0.5

i

nn i

n

Qc

Q

BNn

rin

r jn

r kn

qnm

P n

CNi

CNm CNj

CNk

( )

(0) 1 (0)i i

n n mnnm M n

Q k p r

Page 9: Massive Parallel LDPC Decoding on GPU Gabriel Falcão, Leonel Sousa, Vitor Silva Univ. of Coimbra and T. Univ. of Lisbon, Portugal.

Salt Lake City, Feb 21st 2008 PPoPP’08 9

INTENSIVE COMPUTING

"If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens?"-- Seymore Cray

Page 10: Massive Parallel LDPC Decoding on GPU Gabriel Falcão, Leonel Sousa, Vitor Silva Univ. of Coimbra and T. Univ. of Lisbon, Portugal.

Salt Lake City, Feb 21st 2008 PPoPP’08 10

GRAPHICS PROCESSING UNITS (GPUs)

Raw compute power increasing rapidly Manycores architecture Can be programmed outside the graphics framework Exposing parallelism Multi-threaded architecture using CUDA Interest in GPP on GPUs Hard programming Needs efficient interface GPU wins when arithmetic intensity is maximized… GPU looses with memory accesses!

Page 11: Massive Parallel LDPC Decoding on GPU Gabriel Falcão, Leonel Sousa, Vitor Silva Univ. of Coimbra and T. Univ. of Lisbon, Portugal.

Salt Lake City, Feb 21st 2008 PPoPP’08 11

SUM PRODUCT ALGORITHM (SPA)

Kernel 1 - Computes the messages sent from CNm to BNn probability of BNn being 0 or 1

1

'' ( )\

1

'' ( )\

1

'' ( )\

Kernel 1 - Horizontal Processing

1 1(0) (1 2 (1))

2 2

(1) 1 (0)

Kernel 2 - Vertical Processing

(0) 1 (0)

(1) (1)

i i

mn n mn N m n

i i

mn mn

i i

nm n m nnmm M n m

i i

nm nm n m nm M n m

r q

r r

q k p r

q k p r

Kernel 2 – Computes the messages from BNn to CNm

Page 12: Massive Parallel LDPC Decoding on GPU Gabriel Falcão, Leonel Sousa, Vitor Silva Univ. of Coimbra and T. Univ. of Lisbon, Portugal.

Salt Lake City, Feb 21st 2008 PPoPP’08 12

COMPACT DATA STRUCTURES – H MATRIX

H mapped into compact HBN and HCN data structures

1 1 1 0 0 0 0 0

0 0 0 1 1 1 0 0

1 0 0 1 0 0 1 0

0 1 0 0 1 0 0 1

(8 bit nodes checked by 4check node equations)H =

word 2

r0,0

0 1

r0,1

0 2

r0,2

0 0

r1,3

1 0

r1,4

1 1

r1,5

0 3

r2,0

1 3

r2,3

2 0

r2,6

1 2

r3,1

2 2

r3,4

2 3

r3,7

2 1

r0,0

0 1

r0,1

0 2

r0,2

0 0

r1,3

1 0

r1,4

1 1

r1,5

0 3

r2,0

1 3

r2,3

2 0

r2,6

1 2

r3,1

2 2

r3,4

2 3

r3,7

2 1

r0,0

0 1

r0,1

0 2

r0,2

0 0

r1,3

1 0

r1,4

1 1

r1,5

0 3

r2,0

1 3

r2,3

2 0

r2,6

1 2

r3,1

2 2

r3,4

2 3

r3,7

2 1

HBN

word 1

r0,0

0 1

r0,1

0 2

r0,2

0 0

r1,3

1 0

r1,4

1 1

r1,5

0 3

r2,0

1 3

r2,3

2 0

r2,6

1 2

r3,1

2 2

r3,4

2 3

r3,7

2 1

word n

for all CNm do: (rows in H)

for all BNn do: (columns in H)

If Hmn==1 then

pnext = j:Hmn==1,

// with n+1< j <(n+N) mod N

HBN=pnext

Page 13: Massive Parallel LDPC Decoding on GPU Gabriel Falcão, Leonel Sousa, Vitor Silva Univ. of Coimbra and T. Univ. of Lisbon, Portugal.

Salt Lake City, Feb 21st 2008 PPoPP’08 13

COMPUTING KERNELS ON THE GPU

A novel SPA multi-thread computing approach SPA iteratively performed by several KERNELS on GPU

Flow control and execution management of KERNELS performed by the CUDA programming interface

Kernel 1ps0 rs0

k2k1HCN

HBN

Kernel 2qs0 rsi qsi

Kernel 1rsi

k2k1HCNHBN

Kernel 2

THREAD (0, 0)

THREAD (1, 0)

THREAD (0, 0) THREAD (0, 0)

THREAD (1, 0) THREAD (1, 0)

THREAD (0, 0)

THREAD (1, 0)

SY

NC

HR

ON

IZA

TIO

N

PO

INT

SY

NC

HR

ON

IZA

TIO

N

PO

INT

SY

NC

HR

ON

IZA

TIO

N

PO

INT

SY

NC

HR

ON

IZA

TIO

N

PO

INT

Page 14: Massive Parallel LDPC Decoding on GPU Gabriel Falcão, Leonel Sousa, Vitor Silva Univ. of Coimbra and T. Univ. of Lisbon, Portugal.

Salt Lake City, Feb 21st 2008 PPoPP’08 14

CUDA INTERFACE FOR GPGPU

C based programming interface for NVIDIA’s 8x series and next generation

CUDA enables efficient use of their massive parallelism Multi-threading hides latency problems Allows transparent programming Slow global memory and fast shared memory acess Avoid non-coalesced memory accesses Significant speedups depending on the algorithm Hard challenge: irregular memory access patterns!

Page 15: Massive Parallel LDPC Decoding on GPU Gabriel Falcão, Leonel Sousa, Vitor Silva Univ. of Coimbra and T. Univ. of Lisbon, Portugal.

Salt Lake City, Feb 21st 2008 PPoPP’08 15

MULTI-THREAD COMPUTING APPROACH

Multi-thread strategy and architecture

GRID

GPU

BLOCK (0, 0) BLOCK (1, 0) BLOCK (2, 0)

BLOCK (2, 1)BLOCK (1, 1)BLOCK (0, 1)

BLOCK (1, 0)

THREAD (0, 0)

REGISTERS

LOCALMEMORY

SHARED MEMORY

THREAD (1, 0)

LOCALMEMORY

REGISTERS

BLOCK (2, 0)

THREAD (0, 0)

REGISTERS

LOCALMEMORY

SHARED MEMORY

THREAD (1, 0)

LOCALMEMORY

REGISTERS

GLOBAL MEMORY

BLOCK (X, 0)

BLOCK (X, 1)

BLOCK (2, Y)BLOCK (1, Y)BLOCK (0, Y) BLOCK (X, Y)

Page 16: Massive Parallel LDPC Decoding on GPU Gabriel Falcão, Leonel Sousa, Vitor Silva Univ. of Coimbra and T. Univ. of Lisbon, Portugal.

Salt Lake City, Feb 21st 2008 PPoPP’08 16

MULTI-THREAD COMPUTING APPROACH

Circular addressing mechanism allows increase of parallelism

GRID

GPU

BLOCK (0, 0)

r NU

LL

BLOCK (1, 0)

r NU

LL

BLOCK (X, 0)

r NU

LL

BLOCK (0, 1)

r NU

LL

BLOCK (1, 1)

r NU

LL

BLOCK (X, 1)

r NU

LL

BLOCK (0, Y)

r NU

LL

BLOCK (1, Y)

r NU

LL

BLOCK (X, Y)

r NU

LL

THREAD (tx, 0)

rNULL

THREAD (tx, 1)

rNULL

THREAD (tx, 2)

rNULL

BLOCK (0,0)

THREAD (0, 0)

r0,0

THREAD (1, 0)

r0,1

THREAD (2, 0)

r0,2

THREAD (3, 0)

rNULL

THREAD (0, 1)

r1,3

THREAD (1, 1)

r1,4

THREAD (2, 1)

r1,5

THREAD (3, 1)

rNULL

THREAD (0, 2)

r2,0

THREAD (1, 2)

r2,3

THREAD (2, 2)

r2,6

THREAD (3, 2)

rNULL

THREAD (tx, ty)

rNULL

THREAD (0, ty)

rx1,y1

THREAD (1, ty)

rx2,y2

THREAD (2, ty)

rx3,y3

THREAD (3, ty)

rNULL

Page 17: Massive Parallel LDPC Decoding on GPU Gabriel Falcão, Leonel Sousa, Vitor Silva Univ. of Coimbra and T. Univ. of Lisbon, Portugal.

Salt Lake City, Feb 21st 2008 PPoPP’08 17

MULTI-THREAD COMPUTING APPROACH

GRID

GPU

BLOCK (0, 0)

q NU

LL

BLOCK (1, 0)

q NU

LL

BLOCK (2, 0)

q NU

LL

BLOCK (X, 0)

q NU

LL

BLOCK (0, 1)

q NU

LL

BLOCK (1, 1)

q NU

LL

BLOCK (2, 1)

q NU

LL

BLOCK (X, 1)

q NU

LL

BLOCK (0, 2)

q NU

LL

BLOCK (1, 2)

q NU

LL

BLOCK (2, 2)

q NU

LL

BLOCK (X, 2)

q NU

LL

BLOCK (0, Y)

q NU

LL

BLOCK (1, Y)

q NU

LL

BLOCK (2, Y)

q NU

LL

BLOCK (X, Y)

q NU

LL

qn,m0

qn,m1

qn,m2

qn',m0

qn',m1

qn',m2

THREAD (A)

THREAD (B)

THREAD (C)

THREAD (D)

THREAD (E)

THREAD (F)

Page 18: Massive Parallel LDPC Decoding on GPU Gabriel Falcão, Leonel Sousa, Vitor Silva Univ. of Coimbra and T. Univ. of Lisbon, Portugal.

Salt Lake City, Feb 21st 2008 PPoPP’08 18

EXPERIMENTAL RESULTS

Matrix size

CPU GPU CPU GPU CPU GPU

25 iterations 50 iterations 100 iterations

512x1024 3.5 0.2 6.9 0.4 13.9 0.8

2448x4896 16.7 0.8 33.3 1.6 66.5 3.1

2000x4000 21.0 1.1 41.9 2.2 84.0 4.2

Main conclusions ( … obtained from the matrices we considered using CUDA):

• Much faster processing than on top notch CPUs• Supports floating-point operations • Achieves medium to large throughputs• BUT MOST DEFINITELLY NOT AS GREAT AS WE HOPED!

Page 19: Massive Parallel LDPC Decoding on GPU Gabriel Falcão, Leonel Sousa, Vitor Silva Univ. of Coimbra and T. Univ. of Lisbon, Portugal.

Salt Lake City, Feb 21st 2008 PPoPP’08 19

CONCLUSIONS AND FUTURE WORK

GPGPU approach for LDPC decoding New compact data structures to represent the H matrix Multi-thread algorithm for LDPC decoding

Significant speedups achieved with the CUDA programming interface Up to 22

GPUs allow a software based, scalable and low cost solution

Trading task parallelism by data parallelism Adoption/generalization of the proposed approach

(algorithms and data structures) for irregular processing in graphs

Page 20: Massive Parallel LDPC Decoding on GPU Gabriel Falcão, Leonel Sousa, Vitor Silva Univ. of Coimbra and T. Univ. of Lisbon, Portugal.

Salt Lake City, Feb 21st 2008 PPoPP’08 20

CONCLUSIONS

Gabriel Falcão, [email protected]

University of Coimbra

Technical University of Lisbon

Portugal