Top Banner
Falc˜ ao G, Yamagiwa S, Silva V et al. Parallel LDPC decoding on GPUs using a stream-based computing approach. JOUR- NAL OF COMPUTER SCIENCE AND TECHNOLOGY 24(5): 913–924 Sept. 2009 Parallel LDPC Decoding on GPUs Using a Stream-Based Computing Approach Gabriel Falc˜ ao 1 , Student Member, IEEE, Shinichi Yamagiwa 2 , Member, IEEE, Vitor Silva 1 and Leonel Sousa 2,3 , Member, ACM, Senior Member, IEEE 1 Department of Electrical and Computer Engineering, University of Coimbra, Instituto de Telecomunica¸ oes Polo II - Universidade de Coimbra, 3030-290 Coimbra, Portugal 2 INESC-ID, Technical University of Lisbon, Rua Alves Redol n.9, 1000-029 Lisboa, Portugal 3 Department of Electrical and Computer Engineering, IST, Technical University of Lisbon, Rua Alves Redol n.9 1000-029 Lisboa, Portugal E-mail: {gff, vitor}@co.it.pt; {yama, las}@inesc-id.pt Received July 8, 2008; revised May 20, 2009. Abstract Low-Density Parity-Check (LDPC) codes are powerful error correcting codes adopted by recent communication standards. LDPC decoders are based on belief propagation algorithms, which make use of a Tanner graph and very intensive message-passing computation, and usually require hardware-based dedicated solutions. With the exponential increase of the computational power of commodity graphics processing units (GPUs), new opportunities have arisen to develop general purpose processing on GPUs. This paper proposes the use of GPUs for implementing flexible and programmable LDPC decoders. A new stream-based approach is proposed, based on compact data structures to represent the Tanner graph. It is shown that such a challenging application for stream-based computing, because of irregular memory access patterns, memory bandwidth and recursive flow control constraints, can be efficiently implemented on GPUs. The proposal was experimentally evaluated by programming LDPC decoders on GPUs using the Caravela platform, a generic interface tool for managing the kernels’ execution regardless of the GPU manufacturer and operating system. Moreover, to relatively assess the obtained results, we have also implemented LDPC decoders on general purpose processors with Streaming Single Instruction Multiple Data (SIMD) Extensions. Experimental results show that the solution proposed here efficiently decodes several codewords simultaneously, reducing the processing time by one order of magnitude. Keywords data-parallel computing, graphics processing unit (GPU), Caravela, low-density parity-check (LDPC) code, error correcting code 1 Introduction Low-Density Parity-Check (LDPC) codes were ori- ginally proposed by Robert Gallager in 1962 [1] and re- discovered by Mackay and Neal in 1996 [2] . They have been used in recent digital communication systems, such as DVB-S2, WiMAX and other emerging stan- dards. LDPCs are linear (n, k) block codes [3] defined by parity check sparse binary H matrices with (n - k) rows and n columns of dimension. They are usually represented by bipartite or Tanner [4] graphs, formed by Bit Nodes (BNs) and Check Nodes (CNs) linked by bidirectional edges. LDPC decoding requires the propagation of messages between connected nodes, as indicated by the Tanner graph. It is based on the com- putationally intensive Sum-Product Algorithm (SPA), also called belief propagation. This family of decoders presents computational chal- lenges due to the irregularity of the algorithm opera- ting over sparse matrices, or linked lists represent- ing the irregular interconnection network between BNs and CNs according to the Tanner graph description [5] . They require complex control flow, such as nested loops representing recursive computation. Therefore, the only available solutions for real-time processing are hardware-based Application Specific Integrated Cir- cuits (ASIC) that usually adopt integer arithmetic [6] . But hardware only provides non-flexible and non- scalable dedicated solutions [7-9] that involve long de- velopment times and expensive non-recurring engineer- ing. More flexible solutions for LDPC decoding using specialized Digital Signal Processors have recently been proposed [10] . In recent years, multi-core architectures have Regular Paper This work was partially supported by the Portuguese Foundation for Science and Technology, through the FEDER program, and also under Grant No. SFRH/BD/37495/2007.
12

Parallel LDPC Decoding on GPUs Using a Stream-Based Computing Approach

Apr 26, 2023

Download

Documents

Joao Fumega
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Parallel LDPC Decoding on GPUs Using a Stream-Based Computing Approach

Falcao G, Yamagiwa S, Silva V et al. Parallel LDPC decoding on GPUs using a stream-based computing approach. JOUR-

NAL OF COMPUTER SCIENCE AND TECHNOLOGY 24(5): 913–924 Sept. 2009

Parallel LDPC Decoding on GPUs Using a Stream-Based Computing

Approach

Gabriel Falcao1, Student Member, IEEE, Shinichi Yamagiwa2, Member, IEEE, Vitor Silva1

and Leonel Sousa2,3, Member, ACM, Senior Member, IEEE

1Department of Electrical and Computer Engineering, University of Coimbra, Instituto de TelecomunicacoesPolo II - Universidade de Coimbra, 3030-290 Coimbra, Portugal

2INESC-ID, Technical University of Lisbon, Rua Alves Redol n.9, 1000-029 Lisboa, Portugal3Department of Electrical and Computer Engineering, IST, Technical University of Lisbon, Rua Alves Redol n.9

1000-029 Lisboa, Portugal

E-mail: {gff, vitor}@co.it.pt; {yama, las}@inesc-id.pt

Received July 8, 2008; revised May 20, 2009.

Abstract Low-Density Parity-Check (LDPC) codes are powerful error correcting codes adopted by recent communicationstandards. LDPC decoders are based on belief propagation algorithms, which make use of a Tanner graph and very intensivemessage-passing computation, and usually require hardware-based dedicated solutions. With the exponential increase ofthe computational power of commodity graphics processing units (GPUs), new opportunities have arisen to develop generalpurpose processing on GPUs. This paper proposes the use of GPUs for implementing flexible and programmable LDPCdecoders. A new stream-based approach is proposed, based on compact data structures to represent the Tanner graph. It isshown that such a challenging application for stream-based computing, because of irregular memory access patterns, memorybandwidth and recursive flow control constraints, can be efficiently implemented on GPUs. The proposal was experimentallyevaluated by programming LDPC decoders on GPUs using the Caravela platform, a generic interface tool for managing thekernels’ execution regardless of the GPU manufacturer and operating system. Moreover, to relatively assess the obtainedresults, we have also implemented LDPC decoders on general purpose processors with Streaming Single Instruction MultipleData (SIMD) Extensions. Experimental results show that the solution proposed here efficiently decodes several codewordssimultaneously, reducing the processing time by one order of magnitude.

Keywords data-parallel computing, graphics processing unit (GPU), Caravela, low-density parity-check (LDPC) code,

error correcting code

1 Introduction

Low-Density Parity-Check (LDPC) codes were ori-ginally proposed by Robert Gallager in 1962[1] and re-discovered by Mackay and Neal in 1996[2]. They havebeen used in recent digital communication systems,such as DVB-S2, WiMAX and other emerging stan-dards. LDPCs are linear (n, k) block codes[3] definedby parity check sparse binary H matrices with (n− k)rows and n columns of dimension. They are usuallyrepresented by bipartite or Tanner[4] graphs, formedby Bit Nodes (BNs) and Check Nodes (CNs) linkedby bidirectional edges. LDPC decoding requires thepropagation of messages between connected nodes, asindicated by the Tanner graph. It is based on the com-putationally intensive Sum-Product Algorithm (SPA),also called belief propagation.

This family of decoders presents computational chal-lenges due to the irregularity of the algorithm opera-ting over sparse matrices, or linked lists represent-ing the irregular interconnection network between BNsand CNs according to the Tanner graph description[5].They require complex control flow, such as nestedloops representing recursive computation. Therefore,the only available solutions for real-time processingare hardware-based Application Specific Integrated Cir-cuits (ASIC) that usually adopt integer arithmetic[6].But hardware only provides non-flexible and non-scalable dedicated solutions[7−9] that involve long de-velopment times and expensive non-recurring engineer-ing. More flexible solutions for LDPC decoding usingspecialized Digital Signal Processors have recently beenproposed[10].

In recent years, multi-core architectures have

Regular PaperThis work was partially supported by the Portuguese Foundation for Science and Technology, through the FEDER program, and

also under Grant No. SFRH/BD/37495/2007.

Page 2: Parallel LDPC Decoding on GPUs Using a Stream-Based Computing Approach

914 J. Comput. Sci. & Technol., Sept. 2009, Vol.24, No.5

evolved from dual or quad-core to tera-scale systems,supporting multi-threading, a powerful technique tohide memory latency, while at the same time pro-vide larger SIMD units for vector processing[11]. Pro-grammed under the stream-based model, recent GPUsare multicore architectures that can also be used forgeneral purpose processing (GPGPU)[12], yielding ahigh level of performance in commodity products[13−14].The literature contains publications about GPGPUapplications that include i) numerical computations,such as dense and sparse matrix multiplications[15−16],ii) computer graphics algorithms, such as those in raytracing processing[17], iii) demanding simulations ap-plied to physics such as fluid mechanics solvers[18], andiv) database and data mining operations[19−20]. At theprogramming level, Buck et al. propose extensions tothe C language known as Brook[21], which facilitatethe programming of general purpose computation onGPUs. Brook supports data-parallel constructs and en-ables the use of GPUs as streaming co-processors. How-ever, to apply GPUs for general purpose processing,there is still the need to manage and control GPU’s op-erations. Among the programming tools and environ-ments developed for GPGPU are the Compute UnifiedDevice Architecture (CUDA) from NVIDIA[22], and theCaravela platform[23−24]. While CUDA is a very ef-fective specific solution to improve efficiency but onlyon Tesla-based NVIDIA GPUs, the Caravela tool isa general programming interface, based on a stream-based computing model that can use any GPU as co-processor. Caravela does not directly interact withthe GPU hardware, but rather communicates with theGPU driver, which makes it a generic and powerful pro-gramming interface tool that operates independentlyof the operating system and GPU manufacturer. Themain purpose of Caravela is to make it possible to de-velop and test parallel algorithms for GPUs and not tocompete performance-wise with commercial dedicatedand optimized programming tools like CUDA. The exe-cution unit of the Caravela platform is defined as aflow-model and can be programmed in DirectX[25] orOpenGL[26].

This paper proposes a novel approach for stream-based LDPC decoding based on the computationallyintensive SPA. It exploits data-level parallelism accord-ing to the stream-based computing model. An efficientparallel algorithm was developed for LDPC decoding onGPUs and programmed using the Caravela program-ming interface and tools.

Experimental results show that the proposed algo-rithm can run significantly faster on GPUs than onmodern general purpose processors. Efficient solutionswere developed in order to compare an LDPC decoderexecuting on a CPU against a novel approach on a GPU

that conveniently exploits the parallelism of stream-based architectures, by simultaneously decoding severalcodewords. Although this paper is about LDPC decod-ing, because this is the operation that demands mostcomputational power, it is still possible to implementLDPC coding on GPUs.

The main contributions of this paper that imple-ments an efficient parallel LDPC decoder on GPUs are:i) the development of novel data structures for LDPCdecoding that support stream-based computing; thisnew approach uses different compact data structuresfrom the conventional compress row storage and com-press column storage formats[16], where connections be-tween the nodes of the Tanner graph are representedusing circular addressing, which facilitates simultane-ous access to different data elements as required bySPA processing; ii) the introduction of the new con-cept of multi-codeword decoding (the parallel architec-ture of the GPU allows decoding in simultaneous sev-eral codewords); iii) an architecture that represents aprogrammable solution, as opposed to VLSI-dedicatedLDPC decoders (new trends show that the number ofcores on a processor is rising, and computational per-formance should increase over the next few years); iv)the use of floating-point arithmetic with 32-bit dataprecision (single-precision), which produces a lower BitError Rate (BER), compared with the typical 5 to 6-bitdata precision and fixed-point arithmetic used in VLSI-based solutions.

This paper is organized as follows. Section 2 an-alyzes the Sum-Product Algorithm (SPA) used forLDPC decoding and the respective data dependen-cies. A new algorithm and data structures suitablefor stream-based LDPC decoding are proposed in Sec-tion 3. Section 4 describes the GPU architecture andthe Caravela interface programming tool, while Sec-tion 5 contains the experimental evaluation, comparingexecution times on GPUs and general purpose CPUs.Section 6 concludes the paper.

2 Sum-Product Algorithm for LDPCDecoding

Considering a set of bits, or codeword, that we wishto transmit over a noisy channel, the theory of graphsapplied to error correcting codes has fostered codes toperformances extremely close to the Shannon limit[27].The certainty of an information bit can be spread overseveral bits of a codeword, allowing, in certain circum-stances, to recover the correct codeword on the decoderside, in the presence of noise.

2.1 Sum-Product Algorithm

In a graph representing a linear block er-ror correcting code, reasoning algorithms exploit

Page 3: Parallel LDPC Decoding on GPUs Using a Stream-Based Computing Approach

Gabriel Falcao et al.: Parallel LDPC Decoding on GPUs 915

probabilistic relationships between nodes imposed byparity-check equations. The SPA belongs to this cate-gory of algorithms. It finds a set of maximum a pos-teriori probabilities (MAP)[28], which allows the mostlikely transmitted codeword to be inferred.

Given an (n, k) binary LDPC code, we as-sume BPSK modulation which maps a codewordc = (c0, c1, c2, . . . , cn−1) into a sequence x =(x0, x1, x2, . . . , xn−1), according to xi = (−1)ci . Then,x is transmitted through an Additive White GaussianNoise (AWGN) channel, producing a received sequencey = (y0, y1, y2, . . . , yn−1) with yi = xi + ni, where ni

represents AWGN with zero mean and variance σ2.

Algorithm 1. SPA

1: {Initialization}pn = p(yi = 1); q

(0)mn(0) = 1− pn; q

(0)mn(1) = pn;

2: while (cHT 6= 0 ∧ i < I) {c-decoded word; I-Max no. of iterations. } do

3: {For all node pairs (BN n,CN m), correspondingto Hmn = 1 in the parity check matrix H of thecode do:}

4: {Compute the message sent from CN m to BN n,that indicates the probability of BN n being 0 or1:}

(Kernel 1 — Horizontal Processing)

r(i)mn(0) =

1

2+

1

n′∈N(m)\n(1− 2q

(i−1)

n′m (1)),

︸ ︷︷ ︸π(·)

(1)

r(i)mn(1) = 1− r

(i)mn(0), (2)

{where N(m)\n represents BN’s connected toCN m excluding BN n.}

5: {Compute message from BN n to CN m:}(Kernel 2 — Vertical Processing)

q(i)nm(0) = knm(1− pn) Π

m′∈M(n)\mr(i)

m′n(0)

︸ ︷︷ ︸λ(·)

, (3)

q(i)nm(1) = knmpn Π

m′∈M(n)\mr(i)

m′n(1), (4)

{where knm are chosen to ensure q(i)nm(0) +

q(i)nm(1) = 1, and M(n)\m is the set of CN’s con-

nected to BN n excluding CN m.}6: {Compute the a posteriori pseudo-probabilities:}

Q(i)n (0) = kn(1− pn) Π

m∈M(n)r(i)mn(0),

Q(i)n (1) = knpn Π

m∈M(n)r(i)mn(1),

Fig.1. Example of the Tanner graph and some messages being exchanged between CNm and BNn nodes.

Page 4: Parallel LDPC Decoding on GPUs Using a Stream-Based Computing Approach

916 J. Comput. Sci. & Technol., Sept. 2009, Vol.24, No.5

{where kn are chosen to guarantee Q(i)n (0) +

Q(i)n (1) = 1.}

7: {Perform hard decoding:} ∀n,

c(i)n =

{1 ⇐ Q

(i)n (1) > 0.5,

0 ⇐ Q(i)n (1) < 0.5,

(5)

8: end while

The SPA applied to LDPC decoding is illustratedin Algorithm 1. It is mainly described by two diffe-rent horizontal and vertical intensive processing blocks,defined by (1)∼(2), and (3)∼(4), respectively.

(1) and (2) update the messages from CN m to BN n,considering accesses to H in a row-major basis — hor-izontal processing — indicating the probability of BN n

being 0 or 1. Similarly, the latter pair of (3) and (4),computes the q

(i)nm messages sent from BN n to CN m,

assuming accesses to H in a column-major basis — ver-tical processing. Finally, (5) performs the hard decod-ing at the end of an iteration. The iterative procedure isstopped if the decoded word c verifies all parity-checkequations (cHT = 0), or if the maximum number ofiterations (I) is reached.

Fig.1 shows an example for a 4 × 8 H matrix rep-resenting 8 BNs and 4 CNs. BN 0, BN 1 and BN 2 areupdated by CN 0 as indicated in the first row of H.From the second until the last row it can be seen thatthe subsequent BNs are updated by the CNs connectedto them. For each iteration and for every BN, the cor-responding q

(i−1)nm data is read and r

(i)mn messages are

updated in iteration i according to (1) and (2).

Table 1. Number of Arithmetic Operations Involved in the

Update of Messages for the Horizontal and Vertical Processing

Steps per Iteration, Using the SPA for LDPC Decoding

SPA + ∗Horizontal processing v(v + 1)M 2v(v − 1)M

Vertical processing tN 2t2N

Given an H matrix with M rows (CNs) and Ncolumns (BNs), a mean row weight v and a mean col-umn weight t, with v, t > 2, Table 1 gives the compu-tational complexity in terms of the number of floating-point add and multiply operations required for boththe horizontal and vertical processing steps in the SPALDPC decoding algorithm. Depending on the applica-tion and on the channel conditions (typically, the mostimportant is Signal-to-Noise Ratio), LDPC decodingcan imply a substantial number of arithmetic opera-tions per second, which justifies the investigation of newparallelization strategies.

2.2 Parallelizing Message Computations

The Mv messages in the left column of Table 2, forthe example in Fig.1, show no data dependency con-straints in the message updating procedure for the hor-izontal step (the π(·) function can be found in Algori-thm 1, while mi

Z0→W0means message m circulating

from node Z0 to node W0 during iteration i). Theseoperations can be parallelized by adopting a convenientscheduling that supports the updating of different mes-sages for different vertices, simultaneously for differentparts of the graph. The flooding schedule [3] algorithmadopted in this work guarantees that no CN is updatedbefore all BNs conclude their updating procedure andvice versa. The messages sent by BNs are all updatedtogether before CN messages start being updated. Oneach iteration, all data used for computing a new mes-sage was obtained in the previous iteration. This prin-ciple is fundamental when developing a parallel SPALDPC decoder to suit a parallel architecture (e.g., aGPU), as described in Section 3. A similar conclusioncan be drawn when analyzing the vertical processingin the right-hand column of Table 2 (the λ(·) functioncan be found in Algorithm 1). In spite of the irregularmemory access pattern, here, the processing of tN newmessages can also be parallelized.

Table 2. SPA Parallelization of Message Computations for the Example in Fig.1

Horizontal Kernel Vertical Kernel

miCN0→BN0

= π(mi−1BN1→CN0

, mi−1BN2→CN0

) miBN0→CN0

= λ(p0, mi−1CN2→BN0

)

miCN0→BN1

= π(mi−1BN0→CN0

, mi−1BN2→CN0

) miBN0→CN2

= λ(p0, mi−1CN0→BN0

)

miCN0→BN2

= π(mi−1BN0→CN0

, mi−1BN1→CN0

) miBN1→CN0

= λ(p1, mi−1CN3→BN1

)

miCN1→BN3

= π(mi−1BN4→CN1

, mi−1BN5→CN1

) miBN1→CN3

= λ(p1, mi−1CN0→BN1

)

miCN1→BN4

= π(mi−1BN3→CN1

, mi−1BN5→CN1

) miBN2→CN0

= λ(p2)

miCN1→BN5

= π(mi−1BN3→CN1

, mi−1BN4→CN1

) miBN3→CN1

= λ(p3, mi−1CN2→BN3

)

miCN2→BN0

= π(mi−1BN3→CN2

, mi−1BN6→CN2

) miBN3→CN2

= λ(p3, mi−1CN1→BN3

)

miCN2→BN3

= π(mi−1BN0→CN2

, mi−1BN6→CN2

) miBN4→CN1

= λ(p4, mi−1CN3→BN4

)

miCN2→BN6

= π(mi−1BN0→CN2

, mi−1BN3→CN2

) miBN4→CN3

= λ(p4, mi−1CN1→BN4

)

miCN3→BN1

= π(mi−1BN4→CN3

, mi−1BN7→CN3

) miBN5→CN1

= λ(p5)

miCN3→BN4

= π(mi−1BN1→CN3

, mi−1BN7→CN3

) miBN6→CN2

= λ(p6)

miCN3→BN7

= π(mi−1BN1→CN3

, mi−1BN4→CN3

) miBN7→CN3

= λ(p7)

Page 5: Parallel LDPC Decoding on GPUs Using a Stream-Based Computing Approach

Gabriel Falcao et al.: Parallel LDPC Decoding on GPUs 917

To illustrate these concepts for the horizontal pro-cessing, the first row of H in Fig.2(a) shows that thethree messages associated with the first CN equationcan be updated in parallel without any kind of conflictbetween nodes (maintaining data consistency). Theother messages on the left side of Table 2 show that thesame principle applies to the other CN equations in theexample illustrated in Fig.1. Again, a similar conclu-sion can be drawn for the vertical processing. Mes-sages mi

BN 0→CN0and mi

BN 0→CN2(in the right column

of Table 2) represent the update of the two messagesassociated with BN 0, as defined by the first column ofH. Fig.2(b) shows that data dependencies also supportparallel operations in this case.

3 Stream-Based LDPC Decoding

(1)∼(4) in Algorithm 1 are the most intensive cal-culations in SPA. To take advantage of the very highprocessing performance of GPUs to compute them, effi-cient data structures adapted for stream computing arenecessary. A stream-based LDPC decoder needs diffe-rent computation and memory access patterns in con-secutive kernels to update BNs and CNs, respectively.In order to support the execution of kernels 1 and 2(representing horizontal and vertical processing in Al-gorithm 1) on the GPU, we propose two stream-baseddata structures HBN and HCN to represent the H ma-trix. These structures require significantly less memoryand are suitable for stream computing both regular andirregular codes.

3.1 Mapping the Tanner Graph into DataStreams

Let us use the example in Fig.1 to illustrate the

transformation performed in H to produce the compactstream data structures. HBN codes information aboutedge connections used in each parity check equation(horizontal processing). This data structure is gene-rated by scanning the H matrix in a row major orderand by sequentially mapping only the BN edges associ-ated with non-null elements in H used by a single CNequation (in the same row). Algorithm 2 details thisprocedure. In step 5, it can be seen that all edges as-sociated with the same CN are collected and stored inconsecutive positions inside HBN . The addressing ineach row of H becomes circular. The pixel elementcorresponding to the last non-null element of each rowpoints to the first element of this row, implementing acircular list that is used to update all the π(·) messages.The circular addressing allows to introduce a high level

Algorithm 2. Generating Compact HBN from Orig-inal H matrix

1: {Read a binary M ×N matrix H}2: for all CN m (rows in Hmn): do

3: for all BN n (columns in Hmn): do

4: if Hmn == 1 then

5: ptrnext = j : Hmj == 1, with n + 1 6j < (n + N) mod N ;

{Find circularly the right neighbor onthe current row}

6: HBN = ptrnext;

{Store ptrnext into the HBN structure,using a square texture of dimension

D ×D, with D =

⌈√M∑

m=1

N∑n=1

Hmn

⌉}

7: end if

8: end for

9: end for

Fig.2. Memory accesses defined by the Tanner graph for the example shown in Fig.1. (a) For horizontal neighbors. (b) For vertical

neighbors. Messages being read/written are the non-zero elements emphasized in colors.

Page 6: Parallel LDPC Decoding on GPUs Using a Stream-Based Computing Approach

918 J. Comput. Sci. & Technol., Sept. 2009, Vol.24, No.5

of parallelism. In the limit, for a multi-processor plat-form, a different pixel processor can be allocated toevery single edge or π(·) message.

Each element of the data structure, here representedby a pixel texture, records the address of the next entrypointer and the corresponding value of rmn. Althoughthe pixel elements in Fig.3 are represented by their rowand column addresses, the structures can be easily vec-torized by convenient 1D or 2D reshaping accordingto the target stream-based architecture they apply to.The 3D representation shows that the same matrix in-formation can be used to simultaneously decode severalcodewords, by applying SIMD processing, for example.

In the upper left corner of Fig.3, it can be seenthat the pixel processor allocated to compute the mes-sage mi

CN 0→BN 0(identified as message r0,0) depends

on messages mi−1BN 1→CN 0

and mi−1BN 2→CN 0

coming fromBN 1 and BN 2. This is equivalent to saying that to up-date BN 0 (upper left pixel), we have to read the infor-mation from BN 1 (BN 0 holds the address of BN 1) andBN 2 (BN 1 holds the address of BN 2) circularly, andthen update BN 0 (BN 2 knows the address of BN 0).This mechanism is used to update all the other BNs inparallel.

For the vertical processing, HCN is a sequential rep-resentation of the edges associated with non-null ele-ments in H connecting every BN to all its neighbor-ing CNs (in the same column). This data structureis generated by scanning the H matrix in a column

major order. Once again, the access between adjacentelements is circular, as described in Algorithm 3 andillustrated in Fig.4 for the H matrix given in Fig.1. Inthis case, a careful construction of the 2D addresses inHCN is

Algorithm 3. Generating Compact HCN from orig-inal H matrix and HBN

1: {Read a binary M ×N matrix H

2: for all BN n (columns in Hmn): do

3: for all CN m (rows in Hmn): do

4: if Hmn == 1 then

5: ptr tmp = i : Hin == 1, with m + 1 6i < (m + M) mod M ;

{Find circularly the neighbor below onthe current column}

6: ptrnext = search(HBN , ptr tmp, n);

{Find in HBN the pixel with indices(ptr tmp, n)}

7: HCN = ptrnext;

{Store ptrnext into the HCN structure,with addresses compatible with HBN ,using a square texture of dimension

D ×D, with D =

⌈√M∑

m=1

N∑n=1

Hmn

⌉}

8: end if

9: end for

10: end for

Fig.3. HBN structure. A 2D texture representing bit node edges with circular addressing for the example in Fig.1. Also, the pixel

processors entry points are shown.

Page 7: Parallel LDPC Decoding on GPUs Using a Stream-Based Computing Approach

Gabriel Falcao et al.: Parallel LDPC Decoding on GPUs 919

Fig.4. HCN structure. A 2D texture representing check node edges with circular addressing for the example in Fig.1. Also, the pixel

processors entry points are shown.

required, because every pixel texture representing agraph edge must be in exactly the same position asit is in HBN . This meticulous positioning of the pixelelements in HCN allows the processing to be performedalternately for both kernels, using the same input tex-tures. Step 6 shows that ptrnext is placed in the samepixel texture (or n,m edge) that it occupies in HBN .

Fig.4 describes how the HCN data structure is or-ganized for the example in Fig.1, under kernel 2. Themessage mi

BN 0→CN0(identified as message q0,0) is a

function of p0,mi−1CN 2→BN 0

and should update the upperleft pixel representing CN 0, which holds the address ofCN 2. This is another way of saying that CN 2 updatesCN 0, and vice-versa. This mechanism works in thesame way for all the other CNs in the grid.

4 LDPC Decoding on GPUs with Caravela

On a GPU, the color data is written into the framebuffer, which outputs it to the screen as depicted inFig.5. Vertex/pixel processors compute four floating-point values (XYZW for vertex, ARGB for pixel) inparallel. Moreover, the coloring operation in the pixelprocessor is also parallelized because the output colorsare generated independently as data streams, and eachelement of a stream is also independently processed.Recent GPUs therefore include several pixel processorcores that generate output colors concurrently. Theseprocessors perform SIMD computation in four data

units, and also concurrent calculations for the result-ing output data streams.

Fig.5. Processing steps for graphics rendering on a GPU.

In recent GPUs, vertex and pixel processors are pro-grammable. These processors are usually programmedfor graphics purposes. It is very important that theprograms run fast so that complex frames can be gen-erated in real-time. GPUs have dedicated floating-pointprocessing pipelines in these processors enabling themto achieve realistic graphics scenes with high resolu-tion in real-time, and GPGPU applications can makeuse of such high performance processors. However, therasterizer is composed of fixed hardware, and its out-put data cannot be programmed. Moreover, in almost

Page 8: Parallel LDPC Decoding on GPUs Using a Stream-Based Computing Approach

920 J. Comput. Sci. & Technol., Sept. 2009, Vol.24, No.5

all GPUs the output data from the rasterizer is justsent to the pixel processor and cannot be fetched bythe CPU. Thus, only the computing power of the pixelprocessor is used in traditional GPUs for GPGPU ap-plications, because of both its programmability andits flexibility for I/O data control①. These proces-sors can be programmed in standard languages suchas the DirectX Assembly Language, the High LevelShader Language (HLSL)[25] and the OpenGL Shad-ing Language[26]. The programs are called shader pro-grams.

4.1 Caravela Platform

For GPGPU, programmers need specific knowledgefor controlling GPU hardware via a graphics runtimeenvironment. Moreover, there are different runtime en-vironments, depending on the GPU vendor and theprogramming language. This is an overhead for pro-grammers who have to concentrate their best efforts onimplementing efficient parallel algorithms in a shaderprogram. To solve this disparity in programming GPU-based applications, the Caravela platform[23] has beenimplemented for GPGPU, and is publicly available atthe web site[24].

The execution unit of the Caravela platform is basedon the flow-model. As Fig.6 shows, the flow-model iscomposed of input/output data streams, constant pa-rameter inputs and a pixel shader program or kernel,which fetches the input data streams and processesthem to generate the output data streams. The ap-plication program in Caravela is executed as stream-based computation, like a dataflow processor. How-ever, the input data stream of the flow-model can beaccessed randomly, because the input data streams arejust memory buffers for the program that uses the data.On the other hand, the output data streams are se-quences of data elements. The designation “pixel” for aunit of the I/O buffer is used because the pixel processorprocesses input data for every pixel color. A flow-modelunit has defined the number of pixels for the I/O datastreams, the number of constant parameters, the datatype of the I/O data streams, the pixel shader programand the requirements for the GPU targeted. To giveportability to the flow-model, these items are packedinto an eXtensible Markup Language (XML) file. Thismechanism allows the usage of a flow-model unit lo-cated in a remote computer, just by fetching the XMLfile.

The Caravela platform mainly consists of a librarythat supports an Application Programming Interfacefor GPGPU. The Caravela library has adopted the

following definitions of the processing units: Machine isa host machine, Adapter is a video adapter that includesone or more GPUs and, finally, Shader is a GPU. Anapplication needs to map a flow-model into a shader,before executing the mapped flow-model.

Fig.6. Structure of the flow-model.

The Caravela runtime operates as a resource man-ager for flow-models. By using the Caravela libraryfunctions, programmers can easily implement targetapplications in the framework of flow-models, by justmapping flow-models into shader(s). Therefore, pro-grammers do not need to know much about graph-ics runtime environment details or GPU architectures,which means that the Caravela library can become aneffective solution to tackle the problem of differencesbetween graphical environments[24].

The execution of the flow-model covers both non-recursive and recursive applications. Caravela opti-mizes buffer management, particularly for OpenGL,where an extension to the Caravela library wasimplemented[29], allowing the efficient reutilization ofoutput buffers as input data in future iterations. Thisoptimization does not represent an overhead in compu-tation time, as it simply swaps data pointers. It doesnot move blocks of data.

4.2 SPA Implementation Based on theFlow-Model

We developed a flow-model to support the LDPC de-coder based on the Caravela tools, which also utilizesthe efficient mechanisms provided for recursive compu-tation. The synchronous data flow graph in Fig.7 rep-resents the implemented stream-based LDPC decoder.

①There are some exceptions, namely the most recent NVIDIA GPU cards, where unified shaders can be allocated dynamically.

Page 9: Parallel LDPC Decoding on GPUs Using a Stream-Based Computing Approach

Gabriel Falcao et al.: Parallel LDPC Decoding on GPUs 921

Constants k1 and k2 represent the matrix sizes. Ac-cording to Algorithm 1, for iteration 0 kernel 1 receivesas inputs data stream p0, a constant k1 and the streamHBN . The output stream r0 is then produced and itbecomes one of the input data streams of kernel 2. Theother inputs of this kernel are HCN and constant k2.The processing iterates alternately over kernel 1 andkernel 2 until the last kernel 2 produces the final out-put stream qi−1 for iteration i− 1.

Fig.7. Synchronous data flow graph for a stream-based LDPC

decoder: the pair kernel 1 and kernel 2 is repeated i times for an

LDPC decoder executing i iterations.

Fig.8. Organization of the LDPC decoder flow-model.

Fig.8 graphically represents the corresponding flow-model unit containing a shader program that supportsthe stream-based computation of both kernels, wherethe input and output data streams are 2D textures. Inthe first iteration, the input data stream 0 representsdata channel probabilities. The first output stream isproduced by performing kernel 1. After the first exe-cution, this stream directly feeds the input of the next

flow-model unit that executes kernel 2. Data streamscan be multiplexed through a simple and efficient swap-ping mechanism[29]. The output data stream can befeedback as an input stream of the next flow-model unitexecution, and the process is repeated for each itera-tion. At the end, the last output stream conveys thedecoded codeword.

5 Performance Evaluation

The proposed algorithm was programmed in recentCPUs and GPUs in order to evaluate the performanceof the described stream-based LDPC decoder. More-over, we also optimized the CPU program by hand,using the second generation of Stream SIMD Exten-sions (SSE2) of the IA-32 instruction set. The relativeperformance of CPU- and GPU-based approaches arecompared for different workloads (i.e., H matrices withdistinct characteristics).

The experimental setup is presented in Table 3. Itincludes a recent 8800 GTX GPU from NVIDIA, withstream processors (SPs) running at 1.35GHz and amodern Core 2Duo processor from Intel at 2.4GHz.The LDPC decoders are programmed on the CPU us-ing the C language, version 8.0 of the MS V. Studio2005 C/C++ compiler with the -O2 full optimizationfor speed, and on the GPU using version 2.0 of theOpenGL Shading Language and the Caravela library.

Table 3. Experimental Setup

CPU GPU

Platform Intel Core 2 Duo NVIDIA 8800 GTX

Clock frequency 2.4GHz 1.35GHz (p/SP)

Memory 1GB 768MB

Language C OpenGL (GLSL)

The experiments were carried out in both platformsusing five matrices of different sizes and with varyingnumber of edges. They are represented by matrices Ato E shown in Table 4. Their properties were chosento approximately simulate the computational workloadof LDPC codes with typical sizes ranging from small tomedium and large (all sizes covered) and used in recentcommunication standards. These matrices are repre-sentative of a class of good codes which were down-loaded from David J.C. MacKay’s website[30].

The decoder addresses only non-null elements in theH matrix by using structures to represent the edgessimilar to the ones described in Section 3. The LDPCdecoder on the x86 CPU is based on efficient linked listsdata structures, and both CPU and GPU solutions usesingle precision floating-point arithmetic.

②These are the pixels showing empty coordinates (×,×) in Figs. 3 and 4, imposed by the GPU Caravela interface that onlysupports 2D data textures with square dimensions D ×D, where D is a power of 2.

Page 10: Parallel LDPC Decoding on GPUs Using a Stream-Based Computing Approach

922 J. Comput. Sci. & Technol., Sept. 2009, Vol.24, No.5

Table 4. Matrices Under Test

Matrix Size Edges Edges/ Texture Unused Pixel

Row Dim. Textures②

A 111× 999 2 997 27 64× 64 1 099

B 408× 816 4 080 10 64× 64 16

C 212× 1908 7 632 36 128× 128 8 752

D 2448× 4896 14 688 6 128× 128 1 696

E 2000× 4000 16 000 8 128× 128 384

The pixel processors of the GPU were used with datainputted as textures and the output data stream assum-ing the usual place of the pixel color components’ out-put in a graphical application. However, all the detailsare hidden from the programmer by using the Caravelainterface tool. To compute the SPA on the GPU, theinput H matrix is placed into an input stream and ker-nel 1 and kernel 2 are processed in pixel processors,according to Fig.7. The proposed compact representa-tion also allows the reduction of data transfers betweenthe host memory (RAM) and the device VRAM, whichis a very important aspect in the achievement of highperformance with GPUs.

5.1 Experiments and Results

The purpose of Fig.9 is to assess the relative perfor-mance of the GPU and the CPU. The speedups showthat, for the best case obtained with matrix C, theGPU is nearly 8 times faster than the CPU. In average,for the matrices under test, the execution speed is 3.5times faster on the GPU when executing 50 iterations.For 100 iterations the average speedup rises to 4.3.

Fig.9. Global speedup comparison between CPU- and GPU-

based versions.

For a given Tanner graph, it is possible to decode4z codewords in parallel, with z ∈ N, which allowsthe direct application of SIMD processing based onthe use of Arithmetic Packed Instructions. In thepresent case, this optimization is made possible by per-forming the same arithmetic operation to decode fourcodewords simultaneously. Using xmm0-xmm7 128-bit

MMX registers from Intel CPUs, four floating-point el-ements are packed and operated together in a singleinstruction. The experimental results in Fig.10 showthe processing times for the GPU and the CPU handoptimized with SSE2 instructions. The GPU needs sig-nificantly shorter decoding times to complete the pro-cessing. The speedup shown in Fig.9 increases as thenumber of edges being processed also increases, but notnecessarily just depending on it. Comparing matrix Awith matrix B, it can be seen that even though thelatter has fewer unused pixel textures that represent noedges in the Tanner graph, the former performs fasterbecause it has more edges per row (27 against 10). Thisis explained by the fact that GPUs perform better foralgorithms demanding intensive computation. If wecompare the above matrix A with matrix C, the lat-ter performs with a speedup approximately 33% better,which seems to be consistent with the fact that it hasaround 33% more edges per row (36 against 27). Fi-nally, matrix E has a better speedup than matrix D,depicted next to it, because the former has 8 edges perrow while the latter has 6, and, at the same time, thereis less dummy processing on unused pixel textures.

Fig.10. Decoding processing times for an 8800 GTX GPU from

NVIDIA vs. an Intel CPU using SSE2.

The experimental results in Fig.10 show that theCPU performance achieved when SSE2 instructions areused for LDPC decoding starts degrading after a cer-tain dimension of H, mainly due to cache misses. Onthe other hand, when analyzing the GPU response, itis possible to conclude that the GPU performs betterfor large matrices.

All in all, the GPU-based approach shows higherspeedups for the LDPC decoding algorithm with inten-sive computation on huge quantities of data, due to itsparallelism features and impressive processing power.

Page 11: Parallel LDPC Decoding on GPUs Using a Stream-Based Computing Approach

Gabriel Falcao et al.: Parallel LDPC Decoding on GPUs 923

Even using SSE2 instructions on the CPU, the GPUprovides significantly shorter execution times, as de-picted in Fig.10. The gain exists for 50 or more itera-tions in all tested matrices.

5.2 Discussion

In order to achieve real-time processing, LDPC de-coders usually have to be implemented in hardware.Reconfigurable FPGA architectures or ASICs usuallyimplement LDPC decoding algorithms based on inte-ger arithmetic[6]. The nature of LDPC codes demandshuge workloads and a complex routing mechanism tosupport the message passing procedure between adja-cent nodes. Some interesting solutions in the literaturetackle such problems quite efficiently. Quaglio et al.propose a solution[5] for the irregular network connect-ing BNs and CNs according to the Tanner graph. Acomplete coder/decoder low-power solution based onVLSI is presented in [7], while reconfigurable solutionsbased on FPGAs are proposed in [8–9]. However, theseimplementations have reduced flexibility and use highnon-recurring engineering.

The massive dissemination of low-cost commodityprogrammable parallel devices such as GPUs has al-lowed us to develop a new flexible solution to the LDPCdecoding problem. Furthermore, this solution supportsfloating-point arithmetic, which can provide a lowerBER regarding hardware dedicated architectures.

Although recent GPUs allow to obtain mediumthroughputs for LDPC decoding in real-time, a signifi-cant increase of GPU performance can be expected inthe next few years, as more cores are being placed ona single device. The throughputs of the next gener-ation of GPU-based LDPC decoders are likely to risesignificantly.

6 Conclusions

This paper proposes a novel LDPC decoding ap-proach suitable for the stream-based computing model,using GPU computational power to replace the conven-tional hardware solution. To pursue this goal, we de-veloped compact and efficient stream-based data struc-tures for the I/O data streams that fit the Tanner graphrepresentation of an LDPC code. The Sum-Product Al-gorithm used for LDPC decoding was first tuned manu-ally and programmed on CPUs using the second gener-ation of Stream SIMD Extensions of the IA-32 instruc-tion set. The algorithm was also written in OpenGLshader language, after which we applied the flow-modelin the Caravela platform to program the LDPC decoderon GPUs and perform relative performance evaluation.The experimental results obtained for the GPU-basedLDPC decoder allow us to state that the proposed

stream-based LDPC decoder approach leads to signifi-cant speedups, close to one order of magnitude, regard-ing to the processing time on modern general purposeprocessors.

References

[1] Gallager R G. Low-density parity-check codes. IRE Transac-tions on Information Theory, 1962, 8(1): 21–28.

[2] Mackay D J C, Neal R M. Near Shannon limit performanceof low density parity check codes. IEE Electronics Letters,1996, 32(18): 1645–1646.

[3] Lin S, Costello D J. Error Control Coding. 2nd Ed., PrenticeHall, 2004.

[4] Tanner R. A recursive approach to low complexity codes.IEEE Transactions on Information Theory, 1981, 27(5): 533–547.

[5] Quaglio F, Vacca F, Castellano C, Tarable A, Masera G. In-terconnection framework for high-throughput, flexible LDPCdecoders. In Proc. Design, Automation and Test in Europe(DATE2006), Munich, Germany, March 6–10, 2006, pp.124–129.

[6] Ping L, Leung W K. Decoding low density parity check codeswith finite quantization bits. IEEE Communications Letters,2000, 4(2): 62–64.

[7] Zhang T, Parhi K. Joint (3, k)-regular LDPC code and de-coder/encoder design. IEEE Transactions on Signal Process-ing, 2004, 52(4): 1065–1079.

[8] Verdier F, Declercq D. A low-cost parallel scalable FPGA ar-chitecture for regular and irregular LDPC decoding. IEEETransactions on Communications, 2006, 54(7): 1215–1223.

[9] Falcao G, Gomes M, Goncalves J, Faia P, Silva V. HDL li-brary of processing units for an automatic LDPC decoderdesign. In Proc. IEEE Ph.D. Research in Microelectronicsand Electronics (PRIME), Otranto, Italy, June 11–16, 2006,pp.349–352.

[10] Gomes M, Silva V, Neves C, Marques R. Serial LDPCdecoding on a SIMD DSP using horizontal-scheduling.In Proc. 14th European Signal Processing Conference(EUSIPCO2006), Florence, Italy, Sept. 4–8, 2006.

[11] Ghuloum A, Sprangle E, Fang J, Wu G, Zhou X. Ct: A flexi-ble parallel programming model for tera-scale architectures.Intel, 2007, pp.1–21.

[12] Owens J D, Luebke D, Govindaraju N, Harris M, Kruger J,Lefohn A E, Purcell T J. A survey of general-purpose com-putation on graphics hardware. Computer Graphics Forum,2007, 26(1): 80–113.

[13] Goodnight N, Wang R, Humphreys G. Computation on pro-grammable graphics hardware. IEEE Computer Graphicsand Applications, 2005, 25(5): 12–15.

[14] Fok K L, Wong T T, Wong M L. Evolutionary computingon consumer graphics hardware. IEEE Intelligent Systems,2007, 22(2): 69–78.

[15] Kruger J, Westermann R. Linear algebra operators for GPUimplementation of numerical algorithms. ACM Transactionson Graphics, 2003, 22(3): 908–916.

[16] Bolz J, Farmer I, Grinspun E, Schroder P. Sparse matrixsolvers on the GPU: Conjugate gradients and multigrid. ACMTransactions on Graphics, 2003, 22(3): 917–924.

[17] Purcell T J, Buck I, Mark W R, Hanrahan P. Ray tracingon programmable graphics hardware. ACM Transactions onGraphics, 2002, 21(3): 703–712.

[18] Harris M. Fast Fluid Dynamics Simulation on the GPU. GPUGems, Fernando R. (ed.), Addison Wesley, 2004.

[19] Govindaraju N K, Lloyd B, Wang W, Lin M, ManochaD. Fast computation of database operations using graphics

Page 12: Parallel LDPC Decoding on GPUs Using a Stream-Based Computing Approach

924 J. Comput. Sci. & Technol., Sept. 2009, Vol.24, No.5

processors. In Proc. the 2004 ACM SIGMOD InternationalConference on Management of Data, Paris, France, June 13–18, 2004, pp.215–226.

[20] Govindaraju N K, Raghuvanshi N, Manocha D. Fast and ap-proximate stream mining of quantiles and frequencies usinggraphics processors. In Proc. the 2005 ACM SIGMOD In-ternational Conference on Management of Data, Baltimore,USA, June 14–16, 2005, pp.611–622.

[21] Buck I, Foley T, Horn D, Sugerman J, Fatahalian K, Hous-ton M, Hanrahan P. Brook for GPUs: Stream computing ongraphics hardware. ACM Trans. Graph., 2004, 23(3): 777–786.

[22] CUDA. Aug. 2007, http://developer.nvidia.com/object/cuda.html.

[23] Yamagiwa S, Sousa L. Caravela: A novel stream-based dis-tributed computing environment. IEEE Computer, 2007,40(5): 70–77.

[24] Caravela. April 2007, http://www.caravela-gpu.org.[25] DirectX. April 2007, http://www.microsoft.com/directx.[26] Kessenich J, Baldwin D, Rost R. The OpenGL shading lan-

guage. Technical Report, 3Dlabs, Inc. Ltd.[27] Chung S, Forney G, Richardson T, Urbanke R. On the de-

sign of low-density parity-check codes within 0.0045 dB of theShannon limit. IEEE Communications Letters, 2001, 5(2):58–60.

[28] Wicker S B, Kim S. Fundamentals of Codes, Graphs, andIterative Decoding. Kluwer Academic Publishers, 2003.

[29] Yamagiwa S, Sousa L, Antao D. Data buffering optimizationmethods toward a uniformed programming interface for GPU-based applications. In Proc. Int. Conf. Computer Frontiers,Ischia, Italy, May 7–9, 2007, pp.205–212.

[30] Encyclopedia of Sparse Graph Codes. April, 2007, http://www.inference. phy.cam.ac.uk/mackay/codes/data.html.

Gabriel Falcao is a researcherat the Instituto de Telecomunicacoes,Coimbra, Portugal. His research in-terests span the areas of digital sig-nal processing, VLSI, parallel archi-tectures and high performance com-puting. He received his M.Sc. degreein electrical and computer engineer-ing from the Faculty of Engineeringof the University of Porto (FEUP),

Portugal, in 2002. He is currently a teaching assistant at theDepartment of Electrical and Computer Engineering, Fac-ulty of Sciences and Technology of the University of Coim-bra (FCTUC), Portugal, where he is a Ph.D. candidate. Heis also a student member of IEEE.

Shinichi Yamagiwa is a re-searcher at INESC-ID, Lisbon. Hisresearch interests include paralleland distributed computing, espe-cially using GPU resources, and bothnetwork hardware and software forcluster computers. Yamagiwa re-ceived his Ph.D. degree in engineer-ing from the University of Tsukuba,Japan. He is a member of IEEE.

Vitor Silva received the Grad-uation diploma and the Ph.D. de-gree in electrical engineering fromthe University of Coimbra, Portu-gal in 1984 and 1996, respectively.He is currently an assistant profes-sor at the Department of Electricaland Computer Engineering, Univer-sity of Coimbra, where he lecturesdigital signal processing and informa-

tion and coding theory. His research focuses on signal pro-cessing, image and video compression and coding theory,which are, mainly, carried out at the Instituto de Telecomu-nicacoes, Coimbra, Portugal. He published over 90 papersand he supervised successfully several post-graduation the-ses.

Leonel Sousa received the Ph.D.degree in electrical and computer en-gineering from IST at the Techni-cal University of Lisbon, Portugal,in 1996. He is currently an asso-ciate professor of the Electrical andComputer Engineering Departmentat IST and a senior researcher atINESC-ID. His research interests in-clude VLSI architectures, and paral-

lel and distributed computing. He has contributed to morethan 150 papers in journals and international conferences.He is currently a member of the HiPEAC and an associateeditor of the Eurasip Journal on Embedded Systems, andalso a senior member of IEEE and a member of ACM.