Fast Design Space Exploration using Vivado HLS: Non-Binary LDPC Decoders Joao Andrade * , Nithin George † , Kimon Karras ‡ , David Novo † , Vitor Silva * , Paolo Ienne † , Gabriel Falcao * * Instituto de Telecomunica¸ c˜oes, Dept. Electrical and Computer Engineering, Univ. of Coimbra, Portugal † ´ Ecole Polytechnique F´ ed´ erale de Lausanne (EPFL), School of Comp. and Comm. Sciences, Switzerland ‡ Xilinx Research Labs, Dublin, Ireland Introduction: Non-binary LDPC Decoder on FPGAs I We explore a complex error-correction signal processing algorithm: . non-binary LDPC decoding (FFT-SPA) α α α α 2 1 α 2 α 1 1 α 2 1 1 αc 1 c 6 c 5 c 4 c 3 c 2 c 1 α 2 c 1 c 6 c 6 α 2 c 5 c 4 c 5 αc 4 αc 2 c 3 α 2 c 3 αc 2 F F F F F F F F F F F F m v (x) m vc (x) m cv (x) m cv (z ) m vc (z ) permute depermute CN 1 CN 2 CN 3 VN 1 VN 3 VN 4 VN 5 VN 6 Walsh-Hadamard Transform m * v (x) VN 2 Fig. 1 Non-binary LDPC factor graph example and message-passing algorithm. I We utilize a high-level synthesis tool to design an LDPC decoder FPGA accelerator I Vivado HLS allows: . fast design space exploration via directive optimizations . C/C++ code as input for generating an FPGA accelerator Proposed LDPC Decoder Accelerator loop loop loop //compute loop loop //compute loop //compute loop loop //fetch data loop loop //compute loop //store data vn_proc BRAMs DRAM DRAM fwht permute loop loop loop //compute cn_proc loop loop //fetch data loop loop //compute loop //store data fwht loop loop //compute loop //compute depermute Fig. 2 Non-binary LDPC decoder base solution block diagram. I LDPC decoder characteristics . 3-dimensions of computation: I N×d/M×d c probability mass functions (pmfs) I 2 m probabilities per pmf I d v /d c pmf per node I 2 m is the Galois field dimension . each dimension is defined over a computation loop I Applied LDPC computation: . Fast Walsh-Hadamard transform (fwht) . Hadamard products (vn/cn proc) . Cyclic permutations ((de)permute) I Under the hood transformations: . 3 different nested loop structures: I cn proc/vn proc: 3 loops triple-nested I depermute/permute: 2 loops double-nested I fwht: 5 loops triple-nested . no computation performed directly on DRAM data → high bandwidth available but high latency of access . data is moved to BRAM memory for computation at prologue and to DRAM memory at epilogue High-Level Architecture Board DRAM 0 DRAM 1 Memory Interface ... AXI4 Interconnect AXI4 Interconnect BRAMs K BRAMs 1 HLS IP Core 1 HLS IP Core K FPGA core 0 core 1 core 2 AXI4 I. Mem. Int. BRAMs 2 HLS IP Core 2 Fig. 3 High-level architecture and die shot with 3 decoders P&R’d. I Vivado HLS exports an accelerator design as an IP-XACT without external I/O, clock interface or AXI4 data movers . 1 DRAM and AXI-M controllers per SODIMM (2) . 1 port on AXI-M controllers per accelerator instantiated (K) Proposed Accelerator Optimizations Table 2 Optimizations carried out for each solution. Solutions Optimizations I II III IV V VI Unrolling X X X X Pipelining X X Array partitioning X X X I We combined the following optimizations to the 6 tested solutions: . loop unrolling (II, V) . loop pipelining (III, VI) . array partitioning (IV, V, VI) I Opt. directives are not applied until code refactoring in some cases I Every dimension where parallelism is exploited must be defined in its particular loop, otherwise unrolling or pipelining becomes unbearable to manage . in fact, some optimization configurations do not complete the C-synthesis I pipeline is targeted at II=1 I unrolling is complete Experimental Results: Latency vs. LUTs utilization I II III IV V VI 10 0 10 1 10 2 10 3 10 4 10 5 10 6 Latency [cycles] Optimizations 160 180 200 220 240 260 Frequency [MHz] vn_proc cn_proc permute depermute fwht I II III IV V VI 10 0 10 1 10 2 10 3 10 4 10 5 10 6 Latency [cycles] Optimizations 160 180 200 220 240 260 Frequency [MHz] I II III IV V VI 10 0 10 1 10 2 10 3 10 4 10 5 10 6 Latency [cycles] Optimizations 160 180 200 220 240 260 Frequency [MHz] Fig. 4 Latency and clock frequency of operation of each LDPC accelerator solution for GF({2 2 , 2 3 , 2 4 }). I Applying the different optimizations we obtain a set of pareto points with tradeoffs in frequency and LUTs utilization: . providing more memory ports (higher bandwidth) is useful only if ALUs consume data . clock frequencies across the solutions can vary widely (160∼260) MHz . pipelining has diminishing returns in latency reduction (depermute/permute) for increasing Galois Field dimensions Comparison with RTL-based Decoders Table 1 Dec. throughput, FPGA util. and freq. of operation. Decoder m K LUT FF BRAM DSP Thr. Clk [%] [%] [%] [%] [Mbit/s] [MHz] This work 2 1 14 7 0.5 0.5 1.17 250 14 80 35 6 6 14.54 219 3 1 21 9 0.9 0.9 0.95 250 6 81 34 5 5 4.81 210 4 1 30 13 2 2 0.66 216 3 73 32 5 5 1.85 201 Emden @ ISTCIIP’10 2 33.16 100 4 N/A 13.22 8 1.56 Zhang @ TCS–I’11 4 1 48 (Slices) 41 N/A 9.3 N/A Boutillon, @ TCS–I’13 6 19 6 1 N/A 2.95 61 Scheiber @ ICECS’13 1 14 (Slices) 21 N/A 13.4 122 Andrade @ ICASSP’14 8 85 (LEs) 62 7 1.1 163 LUTs [%] 0 5 10 15 20 Latency [us] 10 1 10 2 10 3 10 4 10 5 GF(4) GF(8) GF(16) Pareto optimal points larger circuit size same latency same circuit size lower latency Fig. 5 Pareto and non-Pareto optimization points measured in latency (μs) vs. LUTs utilization (%). I LUT utilization grows with the Galois Field dimension . Pareto points observed clearly illustrate the diminishing returns in the latency for LUTs tradeoff I We can settle for the optimized solution VI and increase the number K of instantiated LDPC decoder accelerators on the high-level architecture I RTL-based circuits still achieve higher performances but we reach quite close even though HLS is being used . approx. 50% dec. throughput . but only for several K instantiated decoders Conclusions I We show that combining the correct optimizations we are able to reach within 50% of RTL-based LDPC decoders I Programming language is the same but programming model is different . Code refactoring is still required . Exploited parallelism dimensions are exposed in proper loop structures I By instantiating the accelerators in a suitable high-level architecture we are able to fit multiple accelerators further elevating the parallelism level 23rd IEEE FCCM May 3-5, Vancouver, BC, Canada This work supported by the Portuguese Funda¸ c~ao para a Ci^encia e Tecnologia (FCT) under grants UID/EEA/50008/2013 and SFRH/BD/78238/2011.