Architectural Improvement for Field Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Programmable Counter Array: Enabling Efficient Synthesis of Fast Compressor Trees on FPGA Synthesis of Fast Compressor Trees on FPGA Alessandro Alessandro Cevrero Cevrero 1,2 1,2 Panagiotis Panagiotis Athanasopoulos Athanasopoulos 1,2 1,2 Hadi Parandeh- Hadi Parandeh- Afshar Afshar 2 Paolo Ienne Paolo Ienne 2 Yusuf Leblebici Yusuf Leblebici 1 Ajay K. Verma Ajay K. Verma 2 Philip Brisk Philip Brisk 2 Frank K. Gurkaynak Frank K. Gurkaynak 1 1 2 16 th ACM/SIDA International Symposium on FPGAs Monterey, California, USA, February 26, 2008
13
Embed
Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Compressor Trees on FPGA Alessandro Cevrero 1,2 Panagiotis.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Architectural Improvement for Field Programmable Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Counter Array: Enabling Efficient Synthesis of Fast
Paolo IennePaolo Ienne22Yusuf LeblebiciYusuf Leblebici11
Ajay K. VermaAjay K. Verma22 Philip BriskPhilip Brisk22 Frank K. GurkaynakFrank K. Gurkaynak11
1 2
16th ACM/SIDA International Symposium on FPGAs
Monterey, California, USA, February 26, 2008
Motivation and ContributionMotivation and Contribution
Goal: Improve FPGA performance for arithmetic circuits.
Field Programmable Counter Array (FPCA):
[Brisk et al., DAC 2007] Programmable IP core to accelerate compressor trees Hybrid FPGA/FPCA device
Contributions:
Completely new FPCA architectureReduced routing delayMore flexibility and better mappingSimplified integration process
1/11
FPGA CommentaryFPGA Commentary
Logic cells with dedicated addition circuitry and fast carry chains Support for ternary addition [Altera Stratix II/III, Xilinx Virtex-5] Parallel accumulation uses adder trees
ASIC designers use compressor trees! Compressor tree synthesis on FPGAs via GPC mapping
[Parandeh-Afshar et al., ASPDAC 2008, DATE 2008] Faster than ternary adder trees
IP Cores DSP48, BlockRAM, etc. [Xilinx, Altera] FP cores [Beauchamp et al., TVLSI 2008] Mismatches in bitwidth limit gains
[Kuon and Rose, FPGA 2006, TCAD 2007]2/11
Methodology and SolutionMethodology and Solution
1. Transform circuit to merge disparate addition and multiplication operations to expose compressor trees
FPCA synthesis heuristic: Map columns of input bits
onto FPCA Minimize the height of the
compressor tree Avoid vertical configurations,
when possible
FPCAFPCA
FPCA
…
FPCAFPCA
Horizontal Vertical
Multi-FPCA Configurations
Routing Delay
7/11
CSlice SynthesisCSlice Synthesis
CSlice V2.0 rank-3 with 16 input bits per CSlice
90nm Artisan standard cell library
Cslice Rank-1 Rank-2 Rank-3
Area [µm2] 1240 2347 2770
Delay [ns] 0.40 0.71 0.73
CPA delay [ns] 0.04 0.05 0.07
FPCA Synthesis:
Rank-3 CSlices used in experiments
8 CSlices per FPCA
Similar to dimensions of a DSP block in current FPGAs
Simplifies integration process
DFFs store configuration bitstream
Semi-custom design
Standard cells are predominant
8/11
FPCA Delay ExtractionFPCA Delay Extraction
Methodology:
Each FPCA instance is replaced with F* instance (same I/0)
Extract Delay Between F* instances
Combined these Delay with Combinational Delay extracted for the FPCA
Input Pins
Output Pins
SUM
SUM
SUM
Define a pre-placed soft IP core : F* Same dimensions and I/O as FPCA Map onto Stratix II FPGA Extract critical path delay Replace all sum operations with F*
Map compressor tree onto FPCA Configuration DFF values set to
constant values ; not optimized Measure critical path delay
For each compressor tree in the circuit
Subtract delay of F* Add FPCA delay
Methodology:
F*
F*
F*
FPCA
FPCA
FPCA
9/11
Experimental ResultsExperimental Results
Comparison
GPC Mapping [Parandeh-Afshar et al., ASP-DAC 2008] FPCA mapping (6 FPCAs per device)
FPCA Speedup Over GPC Mapping
0
0.5
1
1.5
2
2.5
3
GPC Mapping FPCA
2.40x
1.60x
10/11
ConclusionConclusionConclusion
Future Work
New FPCA architecture Hardwired connections between counters
Counters of multiple sizes organized into CSlices
Carry chains between CSlices
Avg./Max. speedups of 1.60x/2.40x compared to GPC mapping