1/20/05CMPs on FPGAs 1 Mapping CMPs to Xilinx FPGAs Jan Gray Architect, Office of the CTO, Microsoft (fpgacpu.org, fpga-cpu list)

1/20/05 CMPs on FPGAs 1

Mapping CMPs to Xilinx FPGAs

Jan GrayArchitect, Office of the CTO, Microsoft

(fpgacpu.org, fpga-cpu list)

CMPs on FPGAs 21/20/05

Outline

Why am I here? (1) FPGA CMPs: a brief personal history (1) A methodology for great quality of results (7) Mapping a scalar RISC PE to an FPGA (5) CMP and RAMP Comments (4)


Why Am I Here? End of rapid clock freq scaling – parallelism imperative –

we get it… The vast design space of 1 B trans. SoCs ~2010

Enables dozens of cores, integration, 100s GFLOPS, but how can the millions of real world developers (not grad students) exploit it?

Particular a challenge in ‘client personal computing’ settings

~2010, industry must deliver loveable (mainstream, evolutionary) concurrency programming models

RAMP promises rapid iteration → rapid innovation in tools and architectural support for loveable concurrency models

[How] can we study mainstream commercial workloads, tools, and platforms on RAMP CMPs?


My Journey into FPGA CMPs Inspired by comp.arch, many Hot Chips conferences, H&P 91: Freidin’s RISC4005: 20 MHz |4| 16-bit RISC in 4005 95: jr32: 33÷2 MHz |4|, 32-bit RISC + SoC in 4010 98: XSOC/xr16: 40 MHz |3|, 16-bit RISC in 260 LCs in S10

Lcc, sims, SoC, Circuit Cellar series, fpgacpu.org, fpga-cpu list 00: Altera NIOS 01: gr1040 : 200 LUT+1 BRAM, 80 MHz |2|; 60 in V600E [P] 01: Xilinx MicroBlaze (125 MHz |3| in 2Vxxx) 04: 10 MicroBlaze in 2V2000 via EDK 6.3i 04: 24 multithreaded-’MB’ in 4VLX25 [PD] 06: 24 ‘PowerPC’ in 4VLX25 [PD], 200 MHz |4|, 133 MHz |2| [P] = PAR/TRCE only; [D] = PAR/TRCE of datapath only; |x| = x stg pipe, no FP


A Methodology for Great Quality of Results: It’s Essential for CMPs! It’s the golden age of FPGA development

Was: timing whack a mole, synthesis pushing on a rope Now: good fast tools, fast computers, better fabrics

But: 2-10x better delay×area by tailoring ISA, HW/SW partitioning, datapath, pipeline, tech mapping, floorplanning to the FPGA Prefer 40 200 MHz processors/die to 20 100 MHz ones Example: always @(posedge clk) q <= add ? q + a : b;

Hand tech mapped and floorplanned: 1 LUT/bit Synthesis: 2 LUT/bit, +0.5 ns delay

5X faster place and route rapid (methodical) expts Efficiency of ASIC CPU models on FPGAs?


The Art of High Performance FPGA Design / How to Hack FPGAs like Ray Andraka Great FPGA designers have The Knowledge Best practices for great datapath QoR:

Choose the datapath’s technology mapping, pipeline regime, and floorplan, and then write the HDL

Bottom up experiments Where it matters, use technology mapping tricks Build up libraries of optimal datapath elements Floorplan datapaths via Relationally Placed Macros (RPMs)

VHDL (generate + attributes) or Python + Verilog Synthesize 95% of control unit – life is too short

Careful timing constraints, grok TRCE reports Tune architecture and implementation together

Sweat the muxes To iterate is divine


The Knowledge in a Nutshell

The LUT and its DFFTech mapping opts to quash mux LUTs

The BRAM The DSP48 (The DCM)


The LUT and its D-FF 4-input LUT Ripple-carry adder MUXCY and

XORCY: ~2.5 ns 32-bit adder MULTAND

P[i] = B[i] ? (P[i-1] + (A<<i)) : P[i-1]; Mux cascades – MUXF5 etc. 16x1-bit LUT-RAM; sp, dp SRL16 – 16-bit tapped shift reg D-Flip-Flops

Clock enable, synchronous reset, system reset regime


1 LUT/bit Technology Mapping Opts ADDSUB: o <= add ? (a+b) : (a-b); MUX2K: o <= k ? sel : (sel ? a : b); MULTAND + carry-chain:

ADDMUX: o <= add ? (a+b) : c; MUXADD: o <= addb ? (a+b) : (a+c); ALU: o <= s1 ? (s2 ? (a+b) : (a-b)) : (s2 ? (a&b) : (a^b));

Fast carry-chain-logic EQV: o[i/2] <= a[i+1:i] == b[i+1:i] EQZ: o[i/4] <= a[i+3:i] == 0 C, V conditions

Other cheap mux ideas 4-1 MUX using 2 2-1 MUXES and a MUXF5 LUT-RAM / SRL16 is a 16-1 MUX 4-input OR of 4 clearable registers


BRAM 18 Kb dual port synch SRAM

Up to dual x32+x4 D/Q 0 cycle : tAS ~0.5 ns; tCO ~2.1 ns

BRAM … adder … BRAM 6 ns Virtex-4

Opt. 1 cycle: DO*_REG: 0.9 ns 400+ MHz

Byte write enables FIFOs for ser/des rate matching

The Myriad Uses of BRAMs on fpgacpu.org


MULT/DSP48

Dozens to hundreds in V-4 Pipeline at 400+ MHz Faster adders than the fabric Basis for interesting fast simulated FPUs


QoR Examples Xr16 core

ISA codesigned with datapath Elide 1 result forwarding mux, compensate in SW Map result mux, shifter to TBUFs

gr1040 core – 200 LUTs + 1 BRAM 2 stage pipeline – elide all result forwarding muxes BRAM for instructions and data Use 1 LUT/bit ‘ALU’ – delete OR operator – 67% smaller Use ADDMUX – faster, 30% smaller C, V, branch, and i-cache tag check in carry-chain-logic


Mapping a Scalar RISC PE to an FPGA Instruction cache, data cache

Cache lines – 1+ BRAMs Cache tags – LUT RAM or BRAM Read first mode for write-back caches

Register file Single or dual ported LUT RAM Multicontext reg files in BRAM

ALU Tech mapping tricks; DSP48?

Result forwarding muxes Multithreading – MicroUnity, HEP deep pipelines OK

Clock pipeline faster than operand regs ALU forwarding operand regs recurrence

LUT RAM PCs, SPRs, PSWs; BRAM reg files But probably too much pressure on tiny i-caches and d-caches?


Simple Is Beautiful Simpler is smaller Smaller is cheaper

More PEs per part Smaller can be faster

Interconnect is slow, so the less, the better Easier to optimize (retiming, floorplanning, technology mapping)

Smaller is more power frugal Simpler is easier to verify Move complexity out the ISA, trap into software, or use

dynamic translation to the simpler ISA (WCED?)


“Jan’s Razor”

In a chip multiprocessor design, strive to leave out all but the minimal kernel set of features from each processing element, so as to maximize processing elements per die.

Small clusters of cores share mul/div, barrel shift, FPU, TLB, even d-cache port


Silly Example: 70 ‘PowerPC-lite’ datapaths in a 2VP70


Which ISAs for RAMP PEs?

Best fit in an FPGA fabric (==austerity)MicroBlaze, MIPS, SPARC, PowerPC, x86x86 + PC periphs via dynamic translation?

Extant soft cores: MB, SPARC 2VP/4VFX + EDK (CoreConnect *) bonus

MB, PowerPC Commercial workloads and tools

PowerPC!


PE Figures of Merit Area: #[LUTs, BRAMs, DSPs, DCMs] Frequency, power, floorplanned? (fast PAR) Simplicity / ease of modification

Some experiments will augment base CPU ISAs Facilities

Validation Debug support Tools integration Workloads IP Rights


X86 HW seems too complex for area and time efficient large-n FPGA CMP 386, x64, v8086, x87, MMX, SSE2/3, SMM, hypervisor

exts, … Don’t underestimate complexity of rest of system

components / cores Build a ‘PowerPC’ CMP, run a port of the Virtual

PC for Mac x86 dynamic translation engine upon, run apps on that Save/restore PC workloads to VHD images (When you have many cores, you don’t mind if your

simulator spends a few on dyn translation)

Speculation: How to Experiment Upon Commercial X86 Workloads on RAMP


Other Thoughts Compose optimized building blocks into

synthesized (floorplanned?) system architectures Synplify Pro has a great RTL viewer MicroBlaze is an excellent, Type B core EDK is a great framework

Can plug in HW and SW components, bus masters and slaves, new CPU cores and OS and periphs, BSPs

EDK ships with a broad complement of cores Don’t reinvent all that! EDK vs. RDL?

QinetiQ (?) FPU IP


Comments? Thanks.

1/20/05CMPs on FPGAs 1 Mapping CMPs to Xilinx FPGAs Jan Gray Architect, Office of the CTO, Microsoft (fpgacpu.org, fpga-cpu list)

Documents

fpga cmps

ramp cmps

mapping cmps

fpgacpu list slide

fp slide

risc soc

golden age of fpga development

loveable concurrency