Top Banner
1/20/05 CMPs on FPGAs 1 Mapping CMPs to Xilinx FPGAs Jan Gray Architect, Office of the CTO, Microsoft (fpgacpu.org, fpga-cpu list)
21

1/20/05CMPs on FPGAs 1 Mapping CMPs to Xilinx FPGAs Jan Gray Architect, Office of the CTO, Microsoft (fpgacpu.org, fpga-cpu list)

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1/20/05CMPs on FPGAs 1 Mapping CMPs to Xilinx FPGAs Jan Gray Architect, Office of the CTO, Microsoft (fpgacpu.org, fpga-cpu list)

1/20/05 CMPs on FPGAs 1

Mapping CMPs to Xilinx FPGAs

Jan GrayArchitect, Office of the CTO, Microsoft

(fpgacpu.org, fpga-cpu list)

Page 2: 1/20/05CMPs on FPGAs 1 Mapping CMPs to Xilinx FPGAs Jan Gray Architect, Office of the CTO, Microsoft (fpgacpu.org, fpga-cpu list)

CMPs on FPGAs 21/20/05

Outline

Why am I here? (1) FPGA CMPs: a brief personal history (1) A methodology for great quality of results (7) Mapping a scalar RISC PE to an FPGA (5) CMP and RAMP Comments (4)

Page 3: 1/20/05CMPs on FPGAs 1 Mapping CMPs to Xilinx FPGAs Jan Gray Architect, Office of the CTO, Microsoft (fpgacpu.org, fpga-cpu list)

CMPs on FPGAs 31/20/05

Why Am I Here? End of rapid clock freq scaling – parallelism imperative –

we get it… The vast design space of 1 B trans. SoCs ~2010

Enables dozens of cores, integration, 100s GFLOPS, but how can the millions of real world developers (not grad students) exploit it?

Particular a challenge in ‘client personal computing’ settings

~2010, industry must deliver loveable (mainstream, evolutionary) concurrency programming models

RAMP promises rapid iteration → rapid innovation in tools and architectural support for loveable concurrency models

[How] can we study mainstream commercial workloads, tools, and platforms on RAMP CMPs?

Page 4: 1/20/05CMPs on FPGAs 1 Mapping CMPs to Xilinx FPGAs Jan Gray Architect, Office of the CTO, Microsoft (fpgacpu.org, fpga-cpu list)

CMPs on FPGAs 41/20/05

My Journey into FPGA CMPs Inspired by comp.arch, many Hot Chips conferences, H&P 91: Freidin’s RISC4005: 20 MHz |4| 16-bit RISC in 4005 95: jr32: 33÷2 MHz |4|, 32-bit RISC + SoC in 4010 98: XSOC/xr16: 40 MHz |3|, 16-bit RISC in 260 LCs in S10

Lcc, sims, SoC, Circuit Cellar series, fpgacpu.org, fpga-cpu list 00: Altera NIOS 01: gr1040 : 200 LUT+1 BRAM, 80 MHz |2|; 60 in V600E [P] 01: Xilinx MicroBlaze (125 MHz |3| in 2Vxxx) 04: 10 MicroBlaze in 2V2000 via EDK 6.3i 04: 24 multithreaded-’MB’ in 4VLX25 [PD] 06: 24 ‘PowerPC’ in 4VLX25 [PD], 200 MHz |4|, 133 MHz |2| [P] = PAR/TRCE only; [D] = PAR/TRCE of datapath only; |x| = x stg pipe, no FP

Page 5: 1/20/05CMPs on FPGAs 1 Mapping CMPs to Xilinx FPGAs Jan Gray Architect, Office of the CTO, Microsoft (fpgacpu.org, fpga-cpu list)

CMPs on FPGAs 51/20/05

A Methodology for Great Quality of Results: It’s Essential for CMPs! It’s the golden age of FPGA development

Was: timing whack a mole, synthesis pushing on a rope Now: good fast tools, fast computers, better fabrics

But: 2-10x better delay×area by tailoring ISA, HW/SW partitioning, datapath, pipeline, tech mapping, floorplanning to the FPGA Prefer 40 200 MHz processors/die to 20 100 MHz ones Example: always @(posedge clk) q <= add ? q + a : b;

Hand tech mapped and floorplanned: 1 LUT/bit Synthesis: 2 LUT/bit, +0.5 ns delay

5X faster place and route rapid (methodical) expts Efficiency of ASIC CPU models on FPGAs?

Page 6: 1/20/05CMPs on FPGAs 1 Mapping CMPs to Xilinx FPGAs Jan Gray Architect, Office of the CTO, Microsoft (fpgacpu.org, fpga-cpu list)

CMPs on FPGAs 61/20/05

The Art of High Performance FPGA Design / How to Hack FPGAs like Ray Andraka Great FPGA designers have The Knowledge Best practices for great datapath QoR:

Choose the datapath’s technology mapping, pipeline regime, and floorplan, and then write the HDL

Bottom up experiments Where it matters, use technology mapping tricks Build up libraries of optimal datapath elements Floorplan datapaths via Relationally Placed Macros (RPMs)

VHDL (generate + attributes) or Python + Verilog Synthesize 95% of control unit – life is too short

Careful timing constraints, grok TRCE reports Tune architecture and implementation together

Sweat the muxes To iterate is divine

Page 7: 1/20/05CMPs on FPGAs 1 Mapping CMPs to Xilinx FPGAs Jan Gray Architect, Office of the CTO, Microsoft (fpgacpu.org, fpga-cpu list)

CMPs on FPGAs 71/20/05

The Knowledge in a Nutshell

The LUT and its DFFTech mapping opts to quash mux LUTs

The BRAM The DSP48 (The DCM)

Page 8: 1/20/05CMPs on FPGAs 1 Mapping CMPs to Xilinx FPGAs Jan Gray Architect, Office of the CTO, Microsoft (fpgacpu.org, fpga-cpu list)

CMPs on FPGAs 81/20/05

The LUT and its D-FF 4-input LUT Ripple-carry adder MUXCY and

XORCY: ~2.5 ns 32-bit adder MULTAND

P[i] = B[i] ? (P[i-1] + (A<<i)) : P[i-1]; Mux cascades – MUXF5 etc. 16x1-bit LUT-RAM; sp, dp SRL16 – 16-bit tapped shift reg D-Flip-Flops

Clock enable, synchronous reset, system reset regime

Page 9: 1/20/05CMPs on FPGAs 1 Mapping CMPs to Xilinx FPGAs Jan Gray Architect, Office of the CTO, Microsoft (fpgacpu.org, fpga-cpu list)

CMPs on FPGAs 91/20/05

1 LUT/bit Technology Mapping Opts ADDSUB: o <= add ? (a+b) : (a-b); MUX2K: o <= k ? sel : (sel ? a : b); MULTAND + carry-chain:

ADDMUX: o <= add ? (a+b) : c; MUXADD: o <= addb ? (a+b) : (a+c); ALU: o <= s1 ? (s2 ? (a+b) : (a-b)) : (s2 ? (a&b) : (a^b));

Fast carry-chain-logic EQV: o[i/2] <= a[i+1:i] == b[i+1:i] EQZ: o[i/4] <= a[i+3:i] == 0 C, V conditions

Other cheap mux ideas 4-1 MUX using 2 2-1 MUXES and a MUXF5 LUT-RAM / SRL16 is a 16-1 MUX 4-input OR of 4 clearable registers

Page 10: 1/20/05CMPs on FPGAs 1 Mapping CMPs to Xilinx FPGAs Jan Gray Architect, Office of the CTO, Microsoft (fpgacpu.org, fpga-cpu list)

CMPs on FPGAs 101/20/05

BRAM 18 Kb dual port synch SRAM

Up to dual x32+x4 D/Q 0 cycle : tAS ~0.5 ns; tCO ~2.1 ns

BRAM … adder … BRAM 6 ns Virtex-4

Opt. 1 cycle: DO*_REG: 0.9 ns 400+ MHz

Byte write enables FIFOs for ser/des rate matching

The Myriad Uses of BRAMs on fpgacpu.org

Page 11: 1/20/05CMPs on FPGAs 1 Mapping CMPs to Xilinx FPGAs Jan Gray Architect, Office of the CTO, Microsoft (fpgacpu.org, fpga-cpu list)

CMPs on FPGAs 111/20/05

MULT/DSP48

Dozens to hundreds in V-4 Pipeline at 400+ MHz Faster adders than the fabric Basis for interesting fast simulated FPUs

Page 12: 1/20/05CMPs on FPGAs 1 Mapping CMPs to Xilinx FPGAs Jan Gray Architect, Office of the CTO, Microsoft (fpgacpu.org, fpga-cpu list)

CMPs on FPGAs 121/20/05

QoR Examples Xr16 core

ISA codesigned with datapath Elide 1 result forwarding mux, compensate in SW Map result mux, shifter to TBUFs

gr1040 core – 200 LUTs + 1 BRAM 2 stage pipeline – elide all result forwarding muxes BRAM for instructions and data Use 1 LUT/bit ‘ALU’ – delete OR operator – 67% smaller Use ADDMUX – faster, 30% smaller C, V, branch, and i-cache tag check in carry-chain-logic

Page 13: 1/20/05CMPs on FPGAs 1 Mapping CMPs to Xilinx FPGAs Jan Gray Architect, Office of the CTO, Microsoft (fpgacpu.org, fpga-cpu list)

CMPs on FPGAs 131/20/05

Mapping a Scalar RISC PE to an FPGA Instruction cache, data cache

Cache lines – 1+ BRAMs Cache tags – LUT RAM or BRAM Read first mode for write-back caches

Register file Single or dual ported LUT RAM Multicontext reg files in BRAM

ALU Tech mapping tricks; DSP48?

Result forwarding muxes Multithreading – MicroUnity, HEP deep pipelines OK

Clock pipeline faster than operand regs ALU forwarding operand regs recurrence

LUT RAM PCs, SPRs, PSWs; BRAM reg files But probably too much pressure on tiny i-caches and d-caches?

Page 14: 1/20/05CMPs on FPGAs 1 Mapping CMPs to Xilinx FPGAs Jan Gray Architect, Office of the CTO, Microsoft (fpgacpu.org, fpga-cpu list)

CMPs on FPGAs 141/20/05

Simple Is Beautiful Simpler is smaller Smaller is cheaper

More PEs per part Smaller can be faster

Interconnect is slow, so the less, the better Easier to optimize (retiming, floorplanning, technology mapping)

Smaller is more power frugal Simpler is easier to verify Move complexity out the ISA, trap into software, or use

dynamic translation to the simpler ISA (WCED?)

Page 15: 1/20/05CMPs on FPGAs 1 Mapping CMPs to Xilinx FPGAs Jan Gray Architect, Office of the CTO, Microsoft (fpgacpu.org, fpga-cpu list)

CMPs on FPGAs 151/20/05

“Jan’s Razor”

In a chip multiprocessor design, strive to leave out all but the minimal kernel set of features from each processing element, so as to maximize processing elements per die.

Small clusters of cores share mul/div, barrel shift, FPU, TLB, even d-cache port

Page 16: 1/20/05CMPs on FPGAs 1 Mapping CMPs to Xilinx FPGAs Jan Gray Architect, Office of the CTO, Microsoft (fpgacpu.org, fpga-cpu list)

CMPs on FPGAs 161/20/05

Silly Example: 70 ‘PowerPC-lite’ datapaths in a 2VP70

Page 17: 1/20/05CMPs on FPGAs 1 Mapping CMPs to Xilinx FPGAs Jan Gray Architect, Office of the CTO, Microsoft (fpgacpu.org, fpga-cpu list)

CMPs on FPGAs 171/20/05

Which ISAs for RAMP PEs?

Best fit in an FPGA fabric (==austerity)MicroBlaze, MIPS, SPARC, PowerPC, x86x86 + PC periphs via dynamic translation?

Extant soft cores: MB, SPARC 2VP/4VFX + EDK (CoreConnect *) bonus

MB, PowerPC Commercial workloads and tools

PowerPC!

Page 18: 1/20/05CMPs on FPGAs 1 Mapping CMPs to Xilinx FPGAs Jan Gray Architect, Office of the CTO, Microsoft (fpgacpu.org, fpga-cpu list)

CMPs on FPGAs 181/20/05

PE Figures of Merit Area: #[LUTs, BRAMs, DSPs, DCMs] Frequency, power, floorplanned? (fast PAR) Simplicity / ease of modification

Some experiments will augment base CPU ISAs Facilities

Validation Debug support Tools integration Workloads IP Rights

Page 19: 1/20/05CMPs on FPGAs 1 Mapping CMPs to Xilinx FPGAs Jan Gray Architect, Office of the CTO, Microsoft (fpgacpu.org, fpga-cpu list)

CMPs on FPGAs 191/20/05

X86 HW seems too complex for area and time efficient large-n FPGA CMP 386, x64, v8086, x87, MMX, SSE2/3, SMM, hypervisor

exts, … Don’t underestimate complexity of rest of system

components / cores Build a ‘PowerPC’ CMP, run a port of the Virtual

PC for Mac x86 dynamic translation engine upon, run apps on that Save/restore PC workloads to VHD images (When you have many cores, you don’t mind if your

simulator spends a few on dyn translation)

Speculation: How to Experiment Upon Commercial X86 Workloads on RAMP

Page 20: 1/20/05CMPs on FPGAs 1 Mapping CMPs to Xilinx FPGAs Jan Gray Architect, Office of the CTO, Microsoft (fpgacpu.org, fpga-cpu list)

CMPs on FPGAs 201/20/05

Other Thoughts Compose optimized building blocks into

synthesized (floorplanned?) system architectures Synplify Pro has a great RTL viewer MicroBlaze is an excellent, Type B core EDK is a great framework

Can plug in HW and SW components, bus masters and slaves, new CPU cores and OS and periphs, BSPs

EDK ships with a broad complement of cores Don’t reinvent all that! EDK vs. RDL?

QinetiQ (?) FPU IP

Page 21: 1/20/05CMPs on FPGAs 1 Mapping CMPs to Xilinx FPGAs Jan Gray Architect, Office of the CTO, Microsoft (fpgacpu.org, fpga-cpu list)

CMPs on FPGAs 211/20/05

Comments? Thanks.