1/20/05 CMPs on FPGAs 1 Mapping CMPs to Xilinx FPGAs Jan Gray Architect, Office of the CTO, Microsoft (fpgacpu.org, fpga-cpu list)
Dec 21, 2015
1/20/05 CMPs on FPGAs 1
Mapping CMPs to Xilinx FPGAs
Jan GrayArchitect, Office of the CTO, Microsoft
(fpgacpu.org, fpga-cpu list)
CMPs on FPGAs 21/20/05
Outline
Why am I here? (1) FPGA CMPs: a brief personal history (1) A methodology for great quality of results (7) Mapping a scalar RISC PE to an FPGA (5) CMP and RAMP Comments (4)
CMPs on FPGAs 31/20/05
Why Am I Here? End of rapid clock freq scaling – parallelism imperative –
we get it… The vast design space of 1 B trans. SoCs ~2010
Enables dozens of cores, integration, 100s GFLOPS, but how can the millions of real world developers (not grad students) exploit it?
Particular a challenge in ‘client personal computing’ settings
~2010, industry must deliver loveable (mainstream, evolutionary) concurrency programming models
RAMP promises rapid iteration → rapid innovation in tools and architectural support for loveable concurrency models
[How] can we study mainstream commercial workloads, tools, and platforms on RAMP CMPs?
CMPs on FPGAs 41/20/05
My Journey into FPGA CMPs Inspired by comp.arch, many Hot Chips conferences, H&P 91: Freidin’s RISC4005: 20 MHz |4| 16-bit RISC in 4005 95: jr32: 33÷2 MHz |4|, 32-bit RISC + SoC in 4010 98: XSOC/xr16: 40 MHz |3|, 16-bit RISC in 260 LCs in S10
Lcc, sims, SoC, Circuit Cellar series, fpgacpu.org, fpga-cpu list 00: Altera NIOS 01: gr1040 : 200 LUT+1 BRAM, 80 MHz |2|; 60 in V600E [P] 01: Xilinx MicroBlaze (125 MHz |3| in 2Vxxx) 04: 10 MicroBlaze in 2V2000 via EDK 6.3i 04: 24 multithreaded-’MB’ in 4VLX25 [PD] 06: 24 ‘PowerPC’ in 4VLX25 [PD], 200 MHz |4|, 133 MHz |2| [P] = PAR/TRCE only; [D] = PAR/TRCE of datapath only; |x| = x stg pipe, no FP
CMPs on FPGAs 51/20/05
A Methodology for Great Quality of Results: It’s Essential for CMPs! It’s the golden age of FPGA development
Was: timing whack a mole, synthesis pushing on a rope Now: good fast tools, fast computers, better fabrics
But: 2-10x better delay×area by tailoring ISA, HW/SW partitioning, datapath, pipeline, tech mapping, floorplanning to the FPGA Prefer 40 200 MHz processors/die to 20 100 MHz ones Example: always @(posedge clk) q <= add ? q + a : b;
Hand tech mapped and floorplanned: 1 LUT/bit Synthesis: 2 LUT/bit, +0.5 ns delay
5X faster place and route rapid (methodical) expts Efficiency of ASIC CPU models on FPGAs?
CMPs on FPGAs 61/20/05
The Art of High Performance FPGA Design / How to Hack FPGAs like Ray Andraka Great FPGA designers have The Knowledge Best practices for great datapath QoR:
Choose the datapath’s technology mapping, pipeline regime, and floorplan, and then write the HDL
Bottom up experiments Where it matters, use technology mapping tricks Build up libraries of optimal datapath elements Floorplan datapaths via Relationally Placed Macros (RPMs)
VHDL (generate + attributes) or Python + Verilog Synthesize 95% of control unit – life is too short
Careful timing constraints, grok TRCE reports Tune architecture and implementation together
Sweat the muxes To iterate is divine
CMPs on FPGAs 71/20/05
The Knowledge in a Nutshell
The LUT and its DFFTech mapping opts to quash mux LUTs
The BRAM The DSP48 (The DCM)
CMPs on FPGAs 81/20/05
The LUT and its D-FF 4-input LUT Ripple-carry adder MUXCY and
XORCY: ~2.5 ns 32-bit adder MULTAND
P[i] = B[i] ? (P[i-1] + (A<<i)) : P[i-1]; Mux cascades – MUXF5 etc. 16x1-bit LUT-RAM; sp, dp SRL16 – 16-bit tapped shift reg D-Flip-Flops
Clock enable, synchronous reset, system reset regime
CMPs on FPGAs 91/20/05
1 LUT/bit Technology Mapping Opts ADDSUB: o <= add ? (a+b) : (a-b); MUX2K: o <= k ? sel : (sel ? a : b); MULTAND + carry-chain:
ADDMUX: o <= add ? (a+b) : c; MUXADD: o <= addb ? (a+b) : (a+c); ALU: o <= s1 ? (s2 ? (a+b) : (a-b)) : (s2 ? (a&b) : (a^b));
Fast carry-chain-logic EQV: o[i/2] <= a[i+1:i] == b[i+1:i] EQZ: o[i/4] <= a[i+3:i] == 0 C, V conditions
Other cheap mux ideas 4-1 MUX using 2 2-1 MUXES and a MUXF5 LUT-RAM / SRL16 is a 16-1 MUX 4-input OR of 4 clearable registers
CMPs on FPGAs 101/20/05
BRAM 18 Kb dual port synch SRAM
Up to dual x32+x4 D/Q 0 cycle : tAS ~0.5 ns; tCO ~2.1 ns
BRAM … adder … BRAM 6 ns Virtex-4
Opt. 1 cycle: DO*_REG: 0.9 ns 400+ MHz
Byte write enables FIFOs for ser/des rate matching
The Myriad Uses of BRAMs on fpgacpu.org
CMPs on FPGAs 111/20/05
MULT/DSP48
Dozens to hundreds in V-4 Pipeline at 400+ MHz Faster adders than the fabric Basis for interesting fast simulated FPUs
CMPs on FPGAs 121/20/05
QoR Examples Xr16 core
ISA codesigned with datapath Elide 1 result forwarding mux, compensate in SW Map result mux, shifter to TBUFs
gr1040 core – 200 LUTs + 1 BRAM 2 stage pipeline – elide all result forwarding muxes BRAM for instructions and data Use 1 LUT/bit ‘ALU’ – delete OR operator – 67% smaller Use ADDMUX – faster, 30% smaller C, V, branch, and i-cache tag check in carry-chain-logic
CMPs on FPGAs 131/20/05
Mapping a Scalar RISC PE to an FPGA Instruction cache, data cache
Cache lines – 1+ BRAMs Cache tags – LUT RAM or BRAM Read first mode for write-back caches
Register file Single or dual ported LUT RAM Multicontext reg files in BRAM
ALU Tech mapping tricks; DSP48?
Result forwarding muxes Multithreading – MicroUnity, HEP deep pipelines OK
Clock pipeline faster than operand regs ALU forwarding operand regs recurrence
LUT RAM PCs, SPRs, PSWs; BRAM reg files But probably too much pressure on tiny i-caches and d-caches?
CMPs on FPGAs 141/20/05
Simple Is Beautiful Simpler is smaller Smaller is cheaper
More PEs per part Smaller can be faster
Interconnect is slow, so the less, the better Easier to optimize (retiming, floorplanning, technology mapping)
Smaller is more power frugal Simpler is easier to verify Move complexity out the ISA, trap into software, or use
dynamic translation to the simpler ISA (WCED?)
CMPs on FPGAs 151/20/05
“Jan’s Razor”
In a chip multiprocessor design, strive to leave out all but the minimal kernel set of features from each processing element, so as to maximize processing elements per die.
Small clusters of cores share mul/div, barrel shift, FPU, TLB, even d-cache port
CMPs on FPGAs 171/20/05
Which ISAs for RAMP PEs?
Best fit in an FPGA fabric (==austerity)MicroBlaze, MIPS, SPARC, PowerPC, x86x86 + PC periphs via dynamic translation?
Extant soft cores: MB, SPARC 2VP/4VFX + EDK (CoreConnect *) bonus
MB, PowerPC Commercial workloads and tools
PowerPC!
CMPs on FPGAs 181/20/05
PE Figures of Merit Area: #[LUTs, BRAMs, DSPs, DCMs] Frequency, power, floorplanned? (fast PAR) Simplicity / ease of modification
Some experiments will augment base CPU ISAs Facilities
Validation Debug support Tools integration Workloads IP Rights
CMPs on FPGAs 191/20/05
X86 HW seems too complex for area and time efficient large-n FPGA CMP 386, x64, v8086, x87, MMX, SSE2/3, SMM, hypervisor
exts, … Don’t underestimate complexity of rest of system
components / cores Build a ‘PowerPC’ CMP, run a port of the Virtual
PC for Mac x86 dynamic translation engine upon, run apps on that Save/restore PC workloads to VHD images (When you have many cores, you don’t mind if your
simulator spends a few on dyn translation)
Speculation: How to Experiment Upon Commercial X86 Workloads on RAMP
CMPs on FPGAs 201/20/05
Other Thoughts Compose optimized building blocks into
synthesized (floorplanned?) system architectures Synplify Pro has a great RTL viewer MicroBlaze is an excellent, Type B core EDK is a great framework
Can plug in HW and SW components, bus masters and slaves, new CPU cores and OS and periphs, BSPs
EDK ships with a broad complement of cores Don’t reinvent all that! EDK vs. RDL?
QinetiQ (?) FPU IP