Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Compressor Trees on FPGA Alessandro Cevrero 1,2 Panagiotis.

Architectural Improvement for Field Programmable Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Counter Array: Enabling Efficient Synthesis of Fast

Compressor Trees on FPGA Compressor Trees on FPGA

Alessandro CevreroAlessandro Cevrero1,21,2 Panagiotis Panagiotis AthanasopoulosAthanasopoulos1,21,2

Hadi Parandeh-AfsharHadi Parandeh-Afshar22

Paolo IennePaolo Ienne22Yusuf LeblebiciYusuf Leblebici11

Ajay K. VermaAjay K. Verma22 Philip BriskPhilip Brisk22 Frank K. GurkaynakFrank K. Gurkaynak11

1 2

16th ACM/SIDA International Symposium on FPGAs

Monterey, California, USA, February 26, 2008

Motivation and ContributionMotivation and Contribution

Goal: Improve FPGA performance for arithmetic circuits.

Field Programmable Counter Array (FPCA):

[Brisk et al., DAC 2007] Programmable IP core to accelerate compressor trees Hybrid FPGA/FPCA device

Contributions:

Completely new FPCA architectureReduced routing delayMore flexibility and better mappingSimplified integration process

1/11

FPGA CommentaryFPGA Commentary

Logic cells with dedicated addition circuitry and fast carry chains Support for ternary addition [Altera Stratix II/III, Xilinx Virtex-5] Parallel accumulation uses adder trees

ASIC designers use compressor trees! Compressor tree synthesis on FPGAs via GPC mapping

[Parandeh-Afshar et al., ASPDAC 2008, DATE 2008] Faster than ternary adder trees

IP Cores DSP48, BlockRAM, etc. [Xilinx, Altera] FP cores [Beauchamp et al., TVLSI 2008] Mismatches in bitwidth limit gains

[Kuon and Rose, FPGA 2006, TCAD 2007]2/11

Methodology and SolutionMethodology and Solution

1. Transform circuit to merge disparate addition and multiplication operations to expose compressor trees

• [Verma and Ienne, ICCAD 2004]

2. Synthesize compressor tree onto FPCA

• [Brisk et al., DAC 2007]

3. Map everything else onto traditional FPGA

• Standard approach

4. Integrate FPGA+FPCA onto same die

• Ongoing research at EPFL

FPCA : programmable compressor tree

∑

+

3/11

Previous WorkPrevious WorkInitial FPCA architecture

[Brisk et al., DAC 2007] Routing network delay

Performance bottleneck

Poor area utilization Many resources unused

Large counters implement the functionality of smaller counters

“Pitch matching” problem FPCA routing channels must

align with FPGA routing channels

Leads to unnecessarily large counters

4/11

Recurring Patterns in Compressor Recurring Patterns in Compressor Tree SynthesisTree Synthesis

15

4

3

2

CPA

15:4

4:3

3:2

New FPCA architecture:

Counter Slice (CSlice) Compress one column at a

time

Propagate carry bits to neighboring CSlices

Eliminates FPGA-style routing network

No routing delay between counters

Pitch matching problem disappears

5/11

FPCA v2.0

Area Utilization

CSlice ArchitectureCSlice Architecture

Configurable

GPC

6/11

4:3

3:2

CPA

15:4

4:3

3:2

CPA

15:4CSlice

4:3

3:2

CPA

CSlice

4:3

3:2

CPA

15:4CSlice

CSlice

SiSi+1

Si+2Si+3

15:4

FPCA V2.0 Mapping HeuristicFPCA V2.0 Mapping Heuristic

FPCA synthesis heuristic: Map columns of input bits

onto FPCA Minimize the height of the

compressor tree Avoid vertical configurations,

when possible

FPCAFPCA

FPCA

…

FPCAFPCA

Horizontal Vertical

Multi-FPCA Configurations

Routing Delay

7/11

CSlice SynthesisCSlice Synthesis

CSlice V2.0 rank-3 with 16 input bits per CSlice

90nm Artisan standard cell library

Cslice Rank-1 Rank-2 Rank-3

Area [µm2] 1240 2347 2770

Delay [ns] 0.40 0.71 0.73

CPA delay [ns] 0.04 0.05 0.07

FPCA Synthesis:

Rank-3 CSlices used in experiments

8 CSlices per FPCA

Similar to dimensions of a DSP block in current FPGAs

Simplifies integration process

DFFs store configuration bitstream

Semi-custom design

Standard cells are predominant

8/11

FPCA Delay ExtractionFPCA Delay Extraction

Methodology:

Each FPCA instance is replaced with F* instance (same I/0)

Extract Delay Between F* instances

Combined these Delay with Combinational Delay extracted for the FPCA

Input Pins

Output Pins

SUM

SUM

SUM

Define a pre-placed soft IP core : F* Same dimensions and I/O as FPCA Map onto Stratix II FPGA Extract critical path delay Replace all sum operations with F*

Map compressor tree onto FPCA Configuration DFF values set to

constant values ; not optimized Measure critical path delay

For each compressor tree in the circuit

Subtract delay of F* Add FPCA delay

Methodology:

F*

F*

F*

FPCA

FPCA

FPCA

9/11

Experimental ResultsExperimental Results

Comparison

GPC Mapping [Parandeh-Afshar et al., ASP-DAC 2008] FPCA mapping (6 FPCAs per device)

FPCA Speedup Over GPC Mapping

0

0.5

1

1.5

2

2.5

3

GPC Mapping FPCA

2.40x

1.60x

10/11

ConclusionConclusionConclusion

Future Work

New FPCA architecture Hardwired connections between counters

Counters of multiple sizes organized into CSlices

Carry chains between CSlices

Avg./Max. speedups of 1.60x/2.40x compared to GPC mapping

Add pipeline registers to FPCA Increase latency, increase clock frequency, throughput

Demonstrator chip taped out in October 2007 Returned from the foundry in January 2008; PCBs ready next

week

Measure power consumption, clock frequency, I/O interface, etc.11/11

Demonstrator ChipDemonstrator Chip

Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Compressor Trees on FPGA Alessandro Cevrero 1,2 Panagiotis.

Documents

map compressor tree

fpca map

fpca instance

compressor trees verma

combinational delay

fpga routing channelsleads

fpga performance

cpa delay ns0