Fast Design Space Exploration using Vivado HLS: Non-Binary ...

Fast Design Space Exploration using Vivado HLS:Non-Binary LDPC Decoders

Joao Andrade∗, Nithin George†, Kimon Karras‡, David Novo†, Vitor Silva∗, Paolo Ienne†, Gabriel Falcao∗

∗ Instituto de Telecomunicacoes, Dept. Electrical and Computer Engineering, Univ. of Coimbra, Portugal† Ecole Polytechnique Federale de Lausanne (EPFL), School of Comp. and Comm. Sciences, Switzerland

‡ Xilinx Research Labs, Dublin, Ireland

Introduction: Non-binary LDPC Decoder on FPGAs

I We explore a complex error-correction signal processing algorithm:. non-binary LDPC decoding (FFT-SPA)

α α αα2 1 α2 α 1 1 α2 1 1

αc1

c6c5c4c3c2c1

α2c1 c6c6α2c5c4 c5αc4αc2 c3 α2c3αc2

F F F F F F F F F F F F

mv(x)

mvc(x) mcv(x)

mcv(z)mvc(z)

perm

ute

deperm

ute

CN1 CN2 CN3

VN1 VN3 VN4 VN5 VN6

Walsh-Hadamard

Transform

m∗v(x)

VN2

Fig. 1 Non-binary LDPC factor graph example and message-passing algorithm.

I We utilize a high-level synthesis tool to design an LDPC decoderFPGA accelerator

I Vivado HLS allows:. fast design space exploration via directive optimizations. C/C++ code as input for generating an FPGA accelerator

Proposed LDPC Decoder Accelerator

loop

loop

loop

//compute

loop

loop

//compute loop

//compute

loop

loop

//fetch data loop

loop

//compute loop

//store data

vn_proc

BRAMs

DR

AM

DR

AM

fwht

permute

loop

loop

loop

//compute

cn_proc

loop

loop

//fetch data loop

loop

//compute loop

//store data

fwht

loop

loop

//compute loop

//compute

depermute

Fig. 2 Non-binary LDPC decoder base

solution block diagram.

I LDPC decoder characteristics. 3-dimensions of computation:

I N×d/M×dc probability massfunctions (pmfs)

I 2m probabilities per pmfI dv/dc pmf per nodeI 2m is the Galois field dimension

. each dimension is defined over acomputation loop

I Applied LDPC computation:. Fast Walsh-Hadamard transform

(fwht). Hadamard products

(vn/cn proc). Cyclic permutations

((de)permute)

I Under the hood transformations:. 3 different nested loop structures:

I cn proc/vn proc: 3 loops triple-nestedI depermute/permute: 2 loops double-nestedI fwht: 5 loops triple-nested

. no computation performed directly on DRAM data→ high bandwidth available but high latency of access

. data is moved to BRAM memory for computation at prologueand to DRAM memory at epilogue

High-Level Architecture

BoardDRAM 0 DRAM 1

Memory Interface

...

AXI4

Interconnect

AXI4

Interconnect

BRAMs KBRAMs 1

HLS IP

Core 1

HLS IP

Core K

FPGA

core 0

core 1

core 2

AXI4 I.

Mem. Int.

BRAMs 2

HLS IP

Core 2

Fig. 3 High-level architecture and die shot with 3 decoders P&R’d.

I Vivado HLS exports an accelerator design as an IP-XACT withoutexternal I/O, clock interface or AXI4 data movers. 1 DRAM and AXI-M controllers per SODIMM (2). 1 port on AXI-M controllers per accelerator instantiated (K)

Proposed Accelerator Optimizations

Table 2 Optimizations carried out for each solution.

SolutionsOptimizations I II III IV V VI

Unrolling X X X XPipelining X X

Array partitioning X X X

I We combined the followingoptimizations to the 6 testedsolutions:. loop unrolling (II, V). loop pipelining (III, VI). array partitioning (IV, V, VI)

I Opt. directives are not applied until code refactoring in some casesI Every dimension where parallelism is exploited must be defined in its

particular loop, otherwise unrolling or pipelining becomes unbearable tomanage. in fact, some optimization configurations do not complete the C-synthesis

I pipeline is targeted at II=1I unrolling is complete

Experimental Results: Latency vs. LUTs utilization

I II III IV V VI10

0

101

102

103

104

105

106

Late

ncy [

cycle

s]

Optimizations

160

180

200

220

240

260

Fre

qu

en

cy [

MH

z]

vn_proc

cn_proc

permute

depermute

fwht

I II III IV V VI10

0

101

102

103

104

105

106

Late

ncy [

cycle

s]

Optimizations

160

180

200

220

240

260

Fre

qu

en

cy [

MH

z]

I II III IV V VI10

0

101

102

103

104

105

106

Late

ncy [

cycle

s]

Optimizations

160

180

200

220

240

260

Fre

qu

en

cy [

MH

z]

Fig. 4 Latency and clock frequency of operation of each LDPC accelerator solution for GF({22, 23, 24}).

I Applying the different optimizations we obtain a set of pareto points withtradeoffs in frequency and LUTs utilization:. providing more memory ports (higher bandwidth) is useful only if ALUs

consume data. clock frequencies across the solutions can vary widely (160∼260) MHz. pipelining has diminishing returns in latency reduction

(depermute/permute) for increasing Galois Field dimensions

Comparison with RTL-based Decoders

Table 1 Dec. throughput, FPGA util. and

freq. of operation.

Decoder m KLUT FF BRAM DSP Thr. Clk

[%] [%] [%] [%] [Mbit/s] [MHz]

This work

21 14 7 0.5 0.5 1.17 250

14 80 35 6 6 14.54 219

31 21 9 0.9 0.9 0.95 250

6 81 34 5 5 4.81 210

41 30 13 2 2 0.66 216

3 73 32 5 5 1.85 201

Emden @ ISTCIIP’10

2 33.16

1004 N/A 13.22

8 1.56

Zhang @ TCS–I’11 4

1

48 (Slices) 41 N/A 9.3 N/A

Boutillon, @ TCS–I’13 6 19 6 1 N/A 2.95 61

Scheiber @ ICECS’13 1 14 (Slices) 21 N/A 13.4 122

Andrade @ ICASSP’14 8 85 (LEs) 62 7 1.1 163

LUTs [%]0 5 10 15 20

Late

ncy [

us]

10 1

10 2

10 3

10 4

10 5

GF(4) GF(8) GF(16)

Pareto optimal points

larger circuit sizesame latency

same circuit sizelower latency

Fig. 5 Pareto and non-Pareto optimization points measured in

latency (µs) vs. LUTs utilization (%).

I LUT utilization grows with the Galois Field dimension

. Pareto points observed clearly illustrate the diminishing returns in thelatency for LUTs tradeoff

I We can settle for the optimized solution VI and increase the number K ofinstantiated LDPC decoder accelerators on the high-level architecture

I RTL-based circuits still achieve higher performances but we reach quite closeeven though HLS is being used

. approx. 50% dec. throughput

. but only for several K instantiated decoders

Conclusions

I We show that combining the correct optimizations we are able to reachwithin 50% of RTL-based LDPC decoders

I Programming language is the same but programming model is different. Code refactoring is still required. Exploited parallelism dimensions are exposed in proper loop structures

I By instantiating the accelerators in a suitable high-level architecture we areable to fit multiple accelerators further elevating the parallelism level

Created with LATEXbeamerposter http://www.it.pt/site_group_detail_p.asp?id=2323rd IEEE FCCM May 3-5, Vancouver, BC, Canada

CORES DIRECTASPANTONE

CORES CMYK

This work supported by the Portuguese Fundac~ao para a Ciencia e Tecnologia (FCT) under grantsUID/EEA/50008/2013 and SFRH/BD/78238/2011.

http://www.it.pt/site_group_detail_p.asp?id=23

Fast Design Space Exploration using Vivado HLS: Non-Binary ...

Documents