Top Banner

Click here to load reader

of 49

Serial Code Accelerators for Heterogeneous Multi-core Processor with 3D memory Philip Jacob Thesis Defense July 26 rd 2010 Committee members John F. McDonald.

Dec 22, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • Slide 1
  • Serial Code Accelerators for Heterogeneous Multi-core Processor with 3D memory Philip Jacob Thesis Defense July 26 rd 2010 Committee members John F. McDonald Tong Zhang Paul Schoch Christopher D. Carothers
  • Slide 2
  • Outline Need for Serial code accelerator Clock Race Multi-core CMOS Amdahls law Alternate technologies SiGe /FinFET etc ECL/ I2L Architectural studies HCRU CPI Multi-core 3D memory Processor core and 3D memory FPGA core model Chip designs Thermal Analysis Conclusion & Future Research 2/50
  • Slide 3
  • Outline Need for Serial Code Accelerator Clock Race Multi-core CMOS Amdahls law Alternate technologies SiGe /FinFET etc ECL/ I2L Architectural studies HCRU CPI Multi-core 3D memory Processor core and 3D memory FPGA core model Chip designs Thermal Analysis Conclusion & Future Research 3
  • Slide 4
  • Motivation for High Clock Rate CPU: HPCS Faster processing nodes to execute MPI code using SiGe HBTs. Improve packet handling to reduce communication latency. Ref1: http://www.nas.nasa.gov/About/Projects/Columbia/columbia.htmlhttp://www.nas.nasa.gov/About/Projects/Columbia/columbia.html 4
  • Slide 5
  • Previous decade: Clock Race suggested need for 3D Memory Ref 2: Hennessey, Patterson,Computer Architecture A Quantitative approach Memory Wall 5
  • Slide 6
  • The Clock Race for CMOS has Ended Ref 3 : Wilfried Haensch, 2008 IBM TAPO meeting 6 Clock Doubling Times = 64 GHz! 6
  • Slide 7
  • CMOS Repeater Crisis - Wires Dont Scale Well Number of Repeaters is Exploding as a Power of 10 per 33% Shrink Ref 4: Ruchir Puri, IBM, 2007 Sematech /ACMThermal and Design Issues in 3D ICs Mx resistance increasing with technology scaling. High resistance requiring increased repeater counts. Increased power consumption as buffers are leaky and accounts >50% of logic leakage. Forced to reduce /hold clock rate 7 Chip Integration Technology Challenges
  • Slide 8
  • Result: Multi-cores in CMOS Dual core to Quad Core to 50 core Generation Dual Core Quad Core 50 core Knights corner cloud computing chip 8
  • Slide 9
  • Is adding more cores the right solution? Amdahls 1967 Figure of Merit (FOM) estimates speedup to an overall system when only part of the system is improved. Speeding up parallel code by adding n cores. Ref 5:Gene Amdahl Validity of the single processor approach to achieving large scale computing capabilities AFIPS Conference, 1967 9
  • Slide 10
  • Speed up The Serial Code 10
  • Slide 11
  • Heterogeneous Multi-core System with MCUs, and a single HCRU for Serial Code HCRU MCU0MCU1 MCU3MCU2 MCU4MCU5 MCU6MCU7 Turn off High clock rate processor during parallel operation to save power. Integration could be either on same chip or through Silicon carrier. 11
  • Slide 12
  • Outline Need for Serial Code Accelerator Clock Race Multi-core CMOS Amdahls law Alternate technologies SiGe /FinFET etc ECL/ I2L Architectural studies HCRU CPI Multi-core 3D memory Processor core and 3D memory FPGA core model Chip designs Thermal Analysis Conclusion & Future Research 12/50
  • Slide 13
  • Alternate technologies SiGe HBT Strained Si FinFETs 13
  • Slide 14
  • SiGe HBT Vertical Device. 3 regions of operation: OFF, Forward active, Sat. Current equations are exponential making them better drivers of wires. 14
  • Slide 15
  • Doping Profile to form Hetero-junction Ge into the base region reduces the potential barrier to injection of electrons from emitter into the base. Drift field accelerates e -. Results in increased Ic and reduced base transit time. Ref 6:On the potential of SiGe HBTs for extreme environment Electronics, Cressler, Proceedings of IEEE, Sept 2005 15
  • Slide 16
  • Scaling in SiGe HBTs Ref 6: On the potential of SiGe HBTs for extreme environment Electronics, Cressler, Proceedings of IEEE, Sept 2005 FOM- Cut off Frequency. Solomon Tang Scaling rule. * Circuit delay scales with emitter size. * Shrink the Emitter for constant TOTAL Current. Collector current density goes up. Supply Voltage and swing voltage is constant. 16 180nm 130nm 90nm
  • Slide 17
  • Emitter Coupled Logic Design Current Steering circuits. Differential input/outputs. Low voltage swings. Taller trees for more complex gates but higher static power consumption. NAND gate 17
  • Slide 18
  • D Flip Flop 18 Latch Cross coupled inverters
  • Slide 19
  • Low Power in Bipolar: I2L / Integrated Injection Logic INV NAND NOR Vcc = 1V Signal Levels Low= 0.2V High=0.7V 19
  • Slide 20
  • NPN only IIL 20 V EE V CC V EE V CC Out in Ref 7: J.H. Pugsley and C.B. Silio, Proceeding of the 8 th International Symposium of Multiple-Valued Logic, Pg 21-31, 1978 1.1V power supply 4.4ps rise time 300mV swing In collaboration with Tuhin, Srikumar
  • Slide 21
  • ITRS Roadmap for CMOS Microprocessor Power 21
  • Slide 22
  • Apple Sponsored Exponential PowerPC 0.7M Hitachi Si-bipolars. 0.3um x 1.0 um emitter 20 GHz fT 1995. 2.0M 0.5m FETs. Die Size 15mm x 10mm. Metal Pitch 2m. ~80Watts. 0.75~0.85 GHz (last tapeout). Mixed ECL 500mV and CML 250mV swing. Main power supply was 3.5V (most contemporary designs would use 2.5V). 22
  • Slide 23
  • Outline Need for Serial Code Accelerator Clock Race Multi-core CMOS Amdahls law Alternate technologies SiGe /FinFET etc ECL/ I2L Architectural studies HCRU CPI Multi-core 3D memory Processor core and 3D memory FPGA core model Chip designs Thermal Analysis Conclusion & Future Research 23/50
  • Slide 24
  • High Level Architecture 24
  • Slide 25
  • CPI vs. Clock vs. Bus width Cache structure -unified L0 (1KB) -unified L1 (16KB) - A huge L2 - CPI=7.82 Trace driven simulator Dinero Cache access time - CACTI 25
  • Slide 26
  • Access time improvement in BiCMOS over CMOS L1 cache (16K cache) 1. Decoder data 2. Word Line 3. Sense amp data 4. Comparator 5. Mux 6. Sel Inverter 7. o/p driver CMOS access time=0.718ns BiCMOS access time=0.431ns Ref 8: CACTI 4.2, 5.0 http://quid.hpl.hp.com:9081/cacti/detailed.y?new 26
  • Slide 27
  • Simplescalar Execution driven simulator 3D cache with wide bandwidth 27 Ref 9: www.simplescalar.com
  • Slide 28
  • Reducing CPI for HCRU Simple scalar simulator 3 level cache SPEC int benchmarks CPI around 2.5 to 3 28
  • Slide 29
  • 3D processor memory stack solution Multi-tierMulti-bank Higher bandwidth through 3D-vias translates to multi-port cache accessing simultaneously multiple banks or tiers. Good for multi-cores where bus arbitration can be avoided. Multi-core 29
  • Slide 30
  • Multiprocessor simulator- RSIM Ref 10: RSIM http://rsim.cs.uiuc.edu/rsim/ Symmetric multi processor simulator adapted for 3D memory over multi-core 30
  • Slide 31
  • Multi-core processor RSIM results 31 FFT benchmark
  • Slide 32
  • Outline Need for Serial Code Accelerator Clock Race Multi-core CMOS Amdahls law Alternate technologies SiGe /FinFET etc ECL/ I2L Architectural studies HCRU CPI Multi-core 3D memory Processor core and 3D memory FPGA core model Chip designs Thermal Analysis Conclusion & Future Research 32/50
  • Slide 33
  • 7 stage Pipelined processor core Instruction Decode Instruction Decode Register File Stage 1 Register File Stage 1 Register File Stage 2 Register File Stage 2 Operand preparation ALU Post Ex/ Write Back Queue Update Remote PC Post Ex/ Write Back Queue Update Remote PC L0 i-cache + Remote Program Counter L0 i-cache + Remote Program Counter L0 d-cache Pipeline controller (FSM) Pipeline controller (FSM) Instruction queue Instruction queue Core Test input (instruction sequence generator) Core Test input (instruction sequence generator) Pipeline stage control signals Signals to FSM Data Bus External signals & traps ALU feed forward Output Scan Chain Data Reg File 33
  • Slide 34
  • Dual Ported 8HP Register File Ref 11:Okan Erdogo Phd Thesis 2008 Read Port A Operation at 18.4 GHz (measured) 2 read ports/ 1 write port size = 8 words 34
  • Slide 35
  • CLA carry chain test structure Ref 12:Paul Belemjian Phd Thesis, 2008 Measured waveform of the 8 HP adder test chip 26.67GHz 35
  • Slide 36
  • Operand Preparation block 36 S2S1S0ALU LLLCLEAR LLHB MINUS A LHLA MINUS B LHHA PLUS B HLLA xor B HLHA + B HHLAB HHHPRESET
  • Slide 37
  • Pipeline Controller FSM chip CLOCK SET HLT STALL_CACHE STALL_BR UNSTALL_CACHE X Y Z Pipe Clear FSM States Data I/p counter Test output STAGE 1 STAGE 2 Pipe control signal 37
  • Slide 38
  • 3D FDSOI CMOS Process - MITLL Ref 13:MIT LL process documentation 38
  • Slide 39
  • 3D cache Floor plan & Microphotograph Way0Way0 Way1Way1 Way2Way2 Way3Way3 TAG ARRAYTAG ARRAY 3D Via Controller 3D Via 39 In collaboration with Aamir Zia
  • Slide 40
  • Measured Results of 3D memory chip Measured waveform of alternating read after write from Tier 1 at 500MHz clock Measured waveform with a string of consecutive 0s from Tier 3 40
  • Slide 41
  • Floor planning (5mm * 5mm) i-cache (Reg file) 5w Inst Q 4 words 1.4w Inst Dec- oder 1w FSM (Pipe line Ctrl) 1w Reg File 5w Op. Prep 1w Adder 2.5w Write/store queue 1.4w SERDES 2.5w L0 d-cache (reg file) 5w L0 d-cache (reg file) 5w L0 d-cache (reg file) 5w L0 d-cache (reg file) 5w Test Inst generator i-cache (Reg file) 5w L1 CACHE 41
  • Slide 42
  • Thermal Studies of Processor floor plan using COMSOL 335K Substrate too thick that the heat is not spreading into the bottom sink. Deep Trench Isolation in SiGe HBT prevents lateral heat spreading 42 In collaboration with Okan Erdogan
  • Slide 43
  • Use of Diamond Heat Spreaders Ref 14:J.C. Sung et al, Semiconductor on Diamond (SOD) for System on Chip (SoC) Architectures, VMIC Conference, Sept. 2006, pp. 35-38. View at diamond Cu boundary for 50um Diamond layer under CPU with one tier of 3D Memory Silicon thinning to 50 m, and bonding to 50 m diamond 43
  • Slide 44
  • Thermal studies with Processor- 3D memory 313K Wafer thinning Diamond substrate Cu heat spreading interface layers 44
  • Slide 45
  • Outline Need for Serial Code Accelerator Clock Race Multi-core CMOS Amdahls law Alternate technologies SiGe /FinFET etc ECL/ I2L Architectural studies HCRU CPI Multi-core 3D memory Processor core and 3D memory FPGA core model Chip designs Thermal Analysis Conclusion & Future Research 45/50
  • Slide 46
  • 46 Milestones Fall 2004-2005 Preliminary study of 3D architecture, 2005- 2006 DQE, IEEE D&T Paper accepted, Processor design on FPGA, MS degree 2006-2007 Processor redesign on FPGA, Multi-core processor evaluations, Completion of course work, Candidacy 2007-2008 Chip implementation, Testing blocks. Operand preparation blocks Pipeline Controller implementation in 8HP SiGe. 2009-2010 Amdahls law and heterogeneous core integration Thesis Defense
  • Slide 47
  • Publications "Mitigating Memory wall effects in High clock rate and Multi-core CMOS 3D ICs- Processor Memory Stacks", Philip Jacob, Aamir Zia, Mike Chu, Jin Woo Kim, Russell Kraft, John F. McDonald, and Kerry Bernstein, Proceedings of the IEEE 3D IC special issue. Vol.97, No.1, Jan 2009, pp 108-122 "Predicting the Performance of a 3D Processor-Memory Chip Stack Philip Jacob, Okan Erdogan, Aamir Zia, Paul M. Belemjian, Russell Kraft and John F. McDonald, IEEE Design and Test, Nov-Dec 2005, pp 540-547. (cited 14 times) A Three-Dimensional L2 cache with Ultra-Wide Data Bus for 3D Processor-Memory Integration, Aamir Zia, Philip Jacob, Russell P. Kraft and John F. McDonald, Transactions in VLSI, IEEE. Vol. 18, No. 6, June 2010, pp 967-977. A 40Gs/s Time Interleaved ADC using SiGe BiCMOS technology, Michael Chu, Philip Jacob, Jin-Woo Kim, Mitchell LeRoy, Russell Kraft, John F. McDonald, JSSC, IEEE, Vol. 45, No. 2, Feb 2010, pp 380-390. A Reconfigurable 40 GHz BiCMOS Uniform Delay Crossbar Switch for Broadband and Wide Tuning Range Narrowband Applications, Jin-woo Kim, Michael Chu, Philip Jacob, Aamir Zia, Russell Kraft, John F. McDonald, IET Circuits, Devices and Systems. [Accepted] 47
  • Slide 48
  • 48 Conclusion & Future Research goals Need for a fast core Possible alternative technologies especially SiGe Chip designs in 3D memory and SiGe for processor core Thermal analysis using COMSOL Heterogeneous core integration with 3D memory the way forward! IIL Logic for low power operations Serial code/ parallel code separation.
  • Slide 49
  • Questions Thank you for your attention