Digital Integrated Digital Integrated CircuitsCircuitsA Design PerspectiveA Design Perspective
System on a System on a Chip DesignChip Design
Application Application Specific Specific
Integrated Integrated Circuits: Circuits:
IntroductionIntroduction
Jun-Dong ChoJun-Dong Cho
SungKyunKwan Univ.SungKyunKwan Univ.
Dept. of ECE, Vada Lab.Dept. of ECE, Vada Lab.
http://vada.skku.ac.krhttp://vada.skku.ac.kr
ContentsContents
Why ASIC?Why ASIC? Introduction to System On Chip Design Introduction to System On Chip Design Hardware and Software Co-designHardware and Software Co-design Low Power ASIC DesignsLow Power ASIC Designs
Why ASIC – Why ASIC – Design Design productivity grows!productivity grows!
Complexity increase 40 % per year Complexity increase 40 % per year Design productivity increase 15 % per yearDesign productivity increase 15 % per year
Integration of PCB on single die
Silicon in 2010Silicon in 2010
Die Area: 2.5x2.5 cmVoltage: 0.6 VTechnology: 0.07 m
Density Access Time(Gbits/cm2) (ns)
DRAM 8.5 10DRAM (Logic) 2.5 10SRAM (Cache) 0.3 1.5
Density Max. Ave. Power Clock Rate(Mgates/cm2) (W/cm2) (GHz)
Custom 25 54 3Std. Cell 10 27 1.5
Gate Array 5 18 1Single-Mask GA 2.5 12.5 0.7
FPGA 0.4 4.5 0.25
ASIC PrinciplesASIC Principles Value-added ASIC for huge volume Value-added ASIC for huge volume
opportunities; standard parts for quick time to opportunities; standard parts for quick time to market applicationsmarket applications
Economics of DesignEconomics of Design– Fast Prototyping, Low VolumeFast Prototyping, Low Volume– Custom Design, Labor Intensive, High VolumeCustom Design, Labor Intensive, High Volume
CAD Tools Needed to Achieve the Design CAD Tools Needed to Achieve the Design StrategiesStrategies– System-level design: Concept to VHDL/CSystem-level design: Concept to VHDL/C– Physical design VHDL/C to silicon, Timing closure Physical design VHDL/C to silicon, Timing closure
(Monterey, Magma, Synopsys, Cadence, Avant!)(Monterey, Magma, Synopsys, Cadence, Avant!) Design Strategies:Design Strategies: Hierarchy; Regularity; Hierarchy; Regularity;
Modularity; LocalityModularity; Locality
ASIC Design StrategiesASIC Design Strategies Design is a continuous tradeoff to achieve Design is a continuous tradeoff to achieve
performance specs with adequate results in performance specs with adequate results in all the other parameters.all the other parameters.
Performance SpecsPerformance Specs - - function, timing, function, timing, speed, powerspeed, power
Size of DieSize of Die - - manufacturing costmanufacturing cost Time to DesignTime to Design - - engineering cost and engineering cost and
scheduleschedule Ease of Test Generation & TestabilityEase of Test Generation & Testability - -
engineering cost, manufacturing cost, engineering cost, manufacturing cost, schedule schedule
ASIC FlowASIC Flow
Structured ASIC DesignsStructured ASIC Designs
Hierarchy:Hierarchy: Subdivide the design into Subdivide the design into many levels of sub-modulesmany levels of sub-modules
RegularityRegularity: : Subdivide to max number of Subdivide to max number of similar sub-modules at each levelsimilar sub-modules at each level
ModularityModularity: : Define sub-modules Define sub-modules unambiguously & well defined unambiguously & well defined interfacesinterfaces
LocalityLocality: : Max local connections, keeping Max local connections, keeping critical paths within module boundariescritical paths within module boundaries
ASIC Design OptionsASIC Design Options Programmable LogicProgrammable Logic Programmable InterconnectProgrammable Interconnect Reprogrammable Gate ArraysReprogrammable Gate Arrays Sea of Gates & Gate Array DesignSea of Gates & Gate Array Design Standard Cell DesignStandard Cell Design Full Custom Mask DesignFull Custom Mask Design
Symbolic LayoutSymbolic Layout Process Migration - Retargeting DesignsProcess Migration - Retargeting Designs
ASIC Design Methodologies ASIC Design Methodologies
Density
Performance
Flexibility
Design time
Manufacturing time
Cost - low volume
Cost - high volume
CustomCustom
Very High
Very High
Very High
Very Long
Very High
Low
Medium
Cell-based
High
High
High
High
Low
Short
Medium
Prediffused
Low
High
High
High
Short
Short
Medium
Prewired
Low
High
Low
Very Short
Very Short
Medium - Low
Medium - Low
Why SOC?Why SOC?
• SOC specs are coming from system engineers SOC specs are coming from system engineers rather rather
than RTL descriptionsthan RTL descriptions
•SOC will bridge the gap hardware/software and SOC will bridge the gap hardware/software and their implementation in novel, energy-efficient silicon their implementation in novel, energy-efficient silicon architecture.architecture.
•In SOC design, chips are assembled at IP block level In SOC design, chips are assembled at IP block level (design reusable) and IP interfaces rather than gate (design reusable) and IP interfaces rather than gate levellevel
CMOS density now allows complete CMOS density now allows complete System-on-a-chip SolutionsSystem-on-a-chip Solutions
ViterbiEqual.
Demodandsync
phone
bookkeypad
intfc
protocolcontrol
de-intl&
decoder
RPE-LTPspeechdecoder
speechquality
enhancement
voicerecognition
phonebookDMA
S/P
DSP core
P core
RAM & ROM
Dedicated logic
A
D
digitaldownconv
Analog
FPGAFPGA Reconfigurable Reconfigurable
InterconnectInterconnect
Also like to add
Source: Brodersen, ICASSP ‘98
How do we design these chipsHow do we design these chips?
Possible Single-Chip Radio Possible Single-Chip Radio ArchitecturesArchitectures
Software RadioSoftware Radio
GOAL: Simplify System Design ProcessGOAL: Simplify System Design Process
Seek architectures which are flexible such Seek architectures which are flexible such that hardware and protocols can be that hardware and protocols can be designed independentlydesigned independently
APPROACH: Minimize the use of APPROACH: Minimize the use of dedicated logicdedicated logic
Universal RadioUniversal Radio
GOAL: Maximize Bandwidth Efficiency and GOAL: Maximize Bandwidth Efficiency and Battery LifeBattery Life
Seek architectures which perform complex Seek architectures which perform complex algorithms very fast with minimal energyalgorithms very fast with minimal energy
APPROACH: Minimize the use of APPROACH: Minimize the use of programmable logicprogrammable logic
Why is SOC design so scary?
60 GHz SiGe Transceiver for 60 GHz SiGe Transceiver for Wireless LAN ApplicationsWireless LAN Applications
A low power 30 GHz LNA is designed A low power 30 GHz LNA is designed as the front end of the receiver. as the front end of the receiver.
Wideband and high gain response is Wideband and high gain response is realized by a 2-stage design using realized by a 2-stage design using a stagger-tuned technique. a stagger-tuned technique.
The simulated performance predicts a The simulated performance predicts a forward gain of |S21| > 20 dB over forward gain of |S21| > 20 dB over a 6 GHz range with an input match a 6 GHz range with an input match of |S11| < -30 dB and output match of |S11| < -30 dB and output match of |S22| < -10 dB. of |S22| < -10 dB.
The mixer consists of a single The mixer consists of a single balanced Gilbert cell. balanced Gilbert cell.
A fully-integrated differential 25 GHz A fully-integrated differential 25 GHz VCO is used, in conjunction with VCO is used, in conjunction with the mixer, to downconvert the RF the mixer, to downconvert the RF input to a 5 GHz IF.input to a 5 GHz IF.
30 GHz receiver layout consisting of the LNA, mixer and VCO
Wideband CMOS LC VCOWideband CMOS LC VCO
A 1.8 GHz wideband LC VCO A 1.8 GHz wideband LC VCO implemented in 0.18 µm bulk implemented in 0.18 µm bulk CMOS has been successfully CMOS has been successfully designed, fabricated, and designed, fabricated, and measured.measured.
This VCO utilizes a 4-bit array of This VCO utilizes a 4-bit array of switched capacitors and a small switched capacitors and a small accumulation-mode varactor to accumulation-mode varactor to achieve a measured tuning range achieve a measured tuning range exceeding 2:1 (73%) and a worst-exceeding 2:1 (73%) and a worst-case tuning sensitivity of 270 case tuning sensitivity of 270 MHz/V. MHz/V.
The amplitude reference level is The amplitude reference level is programmable by means of a 3-bit programmable by means of a 3-bit DAC. DAC.
VCOs die photograph
A High Level View of A High Level View of an Industry Standard Design Flowan Industry Standard Design Flow
Every step can loop to every Every step can loop to every other stepother step
Each step can take hours or Each step can take hours or days for a 100,000 line days for a 100,000 line descriptiondescription
HDL description contains no HDL description contains no physical informationphysical information
Different engineers handle Different engineers handle the front-end and back-end the front-end and back-end designdesign
HDL EntryHDL Entry
good?good?
SynthesisSynthesis
Floor-planFloor-planPlace & RoutePlace & Route
Physical VerificationPhysical VerificationDRC & LVSDRC & LVS
donedone
good?good?
good?good?
good?good?
source: Hitachi, Prof. R. W. Brodersensource: Hitachi, Prof. R. W. Brodersen
Problems with this flow:
How have semiconductor companies made this flow work?
Fron
t-E
ndFr
ont-
End
Bac
k-E
ndB
ack-
End
A More Accurate Picture of the Standard A More Accurate Picture of the Standard FlowFlow
Architecture:Architecture: Partition the chip into functional units Partition the chip into functional units and generate bit-true test vectors to specify the and generate bit-true test vectors to specify the behavior of each unitbehavior of each unitTOOLS:TOOLS: Matlab, C, SPW, (VCC) Matlab, C, SPW, (VCC)FREEZE the test vectorsFREEZE the test vectors
Front-End:Front-End: Enter HDL code which matches the test Enter HDL code which matches the test vectorsvectorsTOOLS:TOOLS: HDL Simulators, Design Compiler HDL Simulators, Design CompilerFREEZE the HDL codeFREEZE the HDL code
Back-End:Back-End: Create a floor-plan and tweak the tools Create a floor-plan and tweak the tools until a successful mask layout is createduntil a successful mask layout is createdTOOLS:TOOLS: Design Compiler, Floor-planners, Placers, Design Compiler, Floor-planners, Placers, Routers, Clock-tree generators, Physical VerificationRouters, Clock-tree generators, Physical Verification
ArchitectureArchitecture
10 months10 months
Front-End
10 months
Back-End 2 monthsBack-End 2 months
Fabrication 2 monthsFabrication 2 months
Source: IBM Semiconductor, Prof. R. NewtonSource: IBM Semiconductor, Prof. R. Newton
How can we improve this flow?
Common Fabric for IP BlocksCommon Fabric for IP Blocks Soft IP blocks are portable, but not as predictable as Soft IP blocks are portable, but not as predictable as
hard IP.hard IP. Hard IP blocks are very predictable since a specific Hard IP blocks are very predictable since a specific
physical implementation can be characterized, but are physical implementation can be characterized, but are hard to port since are often tied to a specific process.hard to port since are often tied to a specific process.
Common fabric is required for both portability and Common fabric is required for both portability and predictability.predictability.
Wide availability: Cell Based Array, metal Wide availability: Cell Based Array, metal programmable architecture that provides the programmable architecture that provides the performance of a standard cell and is optimized for performance of a standard cell and is optimized for synthesis.synthesis.
Four main applicationsFour main applications
Set-top box:Set-top box: Mobile multimedia system, base Mobile multimedia system, base station for the home local-area network.station for the home local-area network.
Digital PCTV:Digital PCTV: concurrent use of TV,3D concurrent use of TV,3D graphics, and Internet servicesgraphics, and Internet services
Set-top box Set-top box LAN serviceLAN service: Wireless home-: Wireless home-networks, multi-user wireless LANnetworks, multi-user wireless LAN
Navigation systemNavigation system:: steer and control traffic steer and control traffic and/or goods-transportationand/or goods-transportation
CMPRCMPR is a multipurpose program that can be is a multipurpose program that can be used for displaying diffraction data, manual- & used for displaying diffraction data, manual- & auto-indexing, peak fitting and other auto-indexing, peak fitting and other
PC-Multimedia ApplicationsPC-Multimedia Applications
Types of System-on-a-Chip Types of System-on-a-Chip DesignsDesigns
Physical gapPhysical gap
Timing closure problem: layout-driven logic and RT-level Timing closure problem: layout-driven logic and RT-level synthesissynthesis
Energy efficiency requires locality of computation and Energy efficiency requires locality of computation and storage: match for stream-based data processing of storage: match for stream-based data processing of speech,images, and multimedia-system packets.speech,images, and multimedia-system packets.
Next generation SOC designers must bridge the Next generation SOC designers must bridge the architectural gap b/w system specification and energy-architectural gap b/w system specification and energy-efficient IP-based architectures, while CAE vendors and efficient IP-based architectures, while CAE vendors and IP providers will bridge the physical gap.IP providers will bridge the physical gap.
Circular Y-ChartCircular Y-Chart
SOC Co-Design ChallengesSOC Co-Design Challenges Current systems are complex and heterogenous Current systems are complex and heterogenous
Contain many different types of componentsContain many different types of components Half of the chip can be filled with 200 low-power, Half of the chip can be filled with 200 low-power,
RISC-like processors (ASIP) interconnected by field-RISC-like processors (ASIP) interconnected by field-programmable buses, embedded in 20Mbytes of programmable buses, embedded in 20Mbytes of distributed DRAM and flash memory, Another Half: distributed DRAM and flash memory, Another Half: ASICASIC
Computational power will not result from multi-GHz Computational power will not result from multi-GHz clocking but from parallelism, with below 200 MHz.clocking but from parallelism, with below 200 MHz.
This will greatly simplify the design for correct timing, This will greatly simplify the design for correct timing, testability, and signal integrity.testability, and signal integrity.
Bridging the architectural gapBridging the architectural gap One-M gate reconfigurable, one-M gate hardwired One-M gate reconfigurable, one-M gate hardwired
logic. logic. 50GIPS for programmable components or 500 GIPS 50GIPS for programmable components or 500 GIPS
for dedicated hardwaresfor dedicated hardwares Product reliability: design at a level far above the RT Product reliability: design at a level far above the RT
level, with reuse factors in excess of 100level, with reuse factors in excess of 100 Trade-off: 100MOPs/watt (microprocessor) Trade-off: 100MOPs/watt (microprocessor)
100GOPs/watt (hardwired) Reconf. Computing with a 100GOPs/watt (hardwired) Reconf. Computing with a large number of computing nodes and a very large number of computing nodes and a very restricted instruction set (Pleiades)restricted instruction set (Pleiades)
Why Lower PowerWhy Lower Power
Portable systemsPortable systems– long battery lifelong battery life– light weightlight weight– small form factorsmall form factor
IC priority listIC priority list– power dissipationpower dissipation– costcost– performanceperformance
Technology direction Technology direction Reduced voltage/power Reduced voltage/power
designs based on designs based on mature high mature high performance IC performance IC technology, high technology, high integration to minimize integration to minimize size, cost, power, and size, cost, power, and speedspeed
year
Power(W)
1980 1985 1990 1995 2000
10
20
30
40
50
5
15
25
35
45
i286i386 DX 16 i486 DX25
i486 DX 50
i486 DX2 66 P-PC601 50
P6 166
P5 66
Alpha21064 200
Alpha 21164
i486 DX4 100
P II 300
P-PC604 133
P-PC750 400
P III 500
Alpha 21264
Microprocessor Power Microprocessor Power DissipationDissipation
Levels for Low Power DesignLevels for Low Power Design
System
Algorithm
Architecture
Circuit/Logic
Technology
Hardware-software partitioning,
Complexity, Concurrency, Locality,
Parallelism, Pipelining, Signal correlations
Sizing, Logic Style, Logic Design
Threshold Reduction, Scaling, Advanced packaging
Possible Power Savings at Different Design LevelsLevel of
Abstraction Expected Saving
Algorithm
Architecture
Logic Level
Layout Level
Device Level
10 - 100 times
10 - 90%
20 - 40%
10 - 30%
10 - 30%
Regularity, Data representation
Instruction set selection, Data rep.
SOI
Power down
Power-hungry ApplicationsPower-hungry Applications
Signal Compression: HDTV Standard, Signal Compression: HDTV Standard, ADPCM, Vector Quantization, H.263, 2-D ADPCM, Vector Quantization, H.263, 2-D motion estimation, MPEG-2 storage motion estimation, MPEG-2 storage management management
Digital Communications: Shaping Filters, Digital Communications: Shaping Filters, Equalizers, Viterbi decoders, Reed-Solomon Equalizers, Viterbi decoders, Reed-Solomon decodersdecoders
New Computing Platforms New Computing Platforms
SOC power efficiency more than 10GOPs/wSOC power efficiency more than 10GOPs/w– Higher On Chip System Integration: COTS: 100W, Higher On Chip System Integration: COTS: 100W,
SOC:10W (inter-chip capacitive loads, I/O buffers)SOC:10W (inter-chip capacitive loads, I/O buffers)– Speed & Performance: shorter interconnection,fewer Speed & Performance: shorter interconnection,fewer
drivers,faster devices,more efficient processing drivers,faster devices,more efficient processing artchitecturesartchitectures
Mixed signal systemsMixed signal systems Reuse of IP blocks Reuse of IP blocks Multiprocessor, configurable computingMultiprocessor, configurable computing Domain-specific, combined memory-logic Domain-specific, combined memory-logic
2P kCFV
Low Power Design Flow ILow Power Design Flow IFunctionFunction
Partitioning andPartitioning andHW/SW AllocationHW/SW Allocation
SystemSystemLevelLevel
SpecificationSpecification
System-LevelSystem-LevelPower AnalysisPower Analysis
BehavioralBehavioralDescriptionDescription
SoftwareSoftwareFunctionsFunctions
ProcessorProcessorSelectionSelection
Power-drivenPower-drivenBehavioralBehavioralTransformationTransformation
Behavioral-LevelBehavioral-Level
Power AnalysisPower Analysis
Power ConsciousPower Conscious
BehavioralBehavioralDescriptionDescription
Power AnalysisPower AnalysisRT-LevelRT-LevelHigh-LevelHigh-Level
Synthesis andSynthesis andOptimizationOptimization
SoftwareSoftwareOptimizationOptimization
Software-LevelSoftware-Level
Power AnalysisPower Analysis
To RT-Level DesignTo RT-Level Design
Low Power Design Flow IILow Power Design Flow II
RT-levelRT-level
DescriptionDescription
RTLRTLmappingmapping
Logic SynthesisLogic Synthesisandand
OptimizationOptimization
Gate-LevelGate-LevelPower AnalysisPower Analysis
Gate-levelGate-level
DescriptionDescription
Power AnalysisPower AnalysisSwitch-LevelSwitch-LevelHigh-LevelHigh-Level
Synthesis andSynthesis andOptimizationOptimization
RTLRTLLibraryLibrary
Data-pathData-path ControllerController
Switch-levelSwitch-level
DescriptionDescription
Standard cellStandard cellLibraryLibraryProcessorProcessor
Control andControl andSteering LogicSteering Logic
MemoryMemory
RTLRTLMacrocellsMacrocells
Three Factors affecting Three Factors affecting EnergyEnergy
– Reducing waste by Hardware SimplificationReducing waste by Hardware Simplification: redundant : redundant h/w extraction, Locality of reference,Demand-driven / h/w extraction, Locality of reference,Demand-driven / Data-driven computation,Application-specific Data-driven computation,Application-specific processing,Preservation of data correlations, Distributed processing,Preservation of data correlations, Distributed processingprocessing
– All in one Approach(SOC):All in one Approach(SOC): I/O pin and buffer reduction I/O pin and buffer reduction– Voltage Reducible HardwaresVoltage Reducible Hardwares
– 2-D pipelining (systolic arrays)2-D pipelining (systolic arrays)– SIMD:Parallel Processing:useful for data w/ parallel SIMD:Parallel Processing:useful for data w/ parallel
structurestructure– VLIW: Approach- flexibleVLIW: Approach- flexible
IBM’s PowerPC Lower Power IBM’s PowerPC Lower Power ArchitectureArchitecture Optimum Supply Voltage through Hardware Parallel, Optimum Supply Voltage through Hardware Parallel,
Pipelining ,Parallel instruction executionPipelining ,Parallel instruction execution– 603e executes five instruction in parallel (IU, FPU, BPU, LSU, 603e executes five instruction in parallel (IU, FPU, BPU, LSU,
SRU) SRU) – FPU is pipelined so a multiply-add instruction can be issued FPU is pipelined so a multiply-add instruction can be issued
every clock cycle every clock cycle – Low power 3.3-volt designLow power 3.3-volt design
Use small complex instruction with smaller instruction length Use small complex instruction with smaller instruction length – IBM’s PowerPC 603e is RISCIBM’s PowerPC 603e is RISC
Superscalar: CPI < 1Superscalar: CPI < 1– 603e issues as many as three instructions per cycle603e issues as many as three instructions per cycle
Low Power ManagementLow Power Management– 603e provides four software controllable power-saving modes. 603e provides four software controllable power-saving modes.
Copper Processor with SOICopper Processor with SOI IBM’s Blue Logic ASICIBM’s Blue Logic ASIC :New design reduces of power by a factor :New design reduces of power by a factor
of 10 timesof 10 times
Power-Down TechniquesPower-Down Techniques
Lowering the voltage along with the clock actually alters the energy-per-operation of the microprocessor, reducing the energy required to perform a fixed amount of work
Implementing Digital Implementing Digital SystemsSystems
H/W and S/W Co-designH/W and S/W Co-design
Three Co-Design Approaches Three Co-Design Approaches IFIP International Conference FORTE/PSTV’98, Nov.’98 N.S. Voros et.al, “Hardware -IFIP International Conference FORTE/PSTV’98, Nov.’98 N.S. Voros et.al, “Hardware -
software co-design of embedded systems using multiple formalisms for application software co-design of embedded systems using multiple formalisms for application
developmentdevelopment”” ASIP co-design: builds a specific programmable processor for ASIP co-design: builds a specific programmable processor for
an application, and translates the application into software code. an application, and translates the application into software code. H/w and s/w partitioning includes the instruction set design.H/w and s/w partitioning includes the instruction set design.
H/w s/w synchronous system co-design: s/w processor as a H/w s/w synchronous system co-design: s/w processor as a master controller, and a set of h/w accelerators as co-master controller, and a set of h/w accelerators as co-processors. Vulcan, Codes, Tosca, Cosymaprocessors. Vulcan, Codes, Tosca, Cosyma
H/w s/w for distributed systems: mapping of a set of H/w s/w for distributed systems: mapping of a set of communication processors onto a set of interconnected communication processors onto a set of interconnected processors. Behavioral decomposition, process allocation and processors. Behavioral decomposition, process allocation and communication transformation. Coware(powerful), Siera communication transformation. Coware(powerful), Siera (reuse), Ptolemy (DSP)(reuse), Ptolemy (DSP)
Mixing H/W and S/WMixing H/W and S/W Argument: Mixed hardware/ software systemsArgument: Mixed hardware/ software systems
represent the best of both worlds.represent the best of both worlds.
High performance, flexibility, design reuse, etc.High performance, flexibility, design reuse, etc.
Counterpoint: From a design standpoint, it isCounterpoint: From a design standpoint, it is
the worst of both worldsthe worst of both worlds– Simulation: Problems of verification, and test become harderSimulation: Problems of verification, and test become harder– Interface: Too many tools, too many interactions, too much Interface: Too many tools, too many interactions, too much
heterogeneityheterogeneity– Hardware/ software partitioning is “AI- complete”!Hardware/ software partitioning is “AI- complete”!– (MIT, Stanford: by analogy with (MIT, Stanford: by analogy with ""NP-completeNP-complete") A term used ") A term used
to describe problems in to describe problems in artificial intelligenceartificial intelligence, to indicate that , to indicate that the solution presupposes a solution to the "strong AI the solution presupposes a solution to the "strong AI problem" (that is, the synthesis of a human-level problem" (that is, the synthesis of a human-level intelligence). A problem that is AI-complete is just too hard. intelligence). A problem that is AI-complete is just too hard.
Low power partitioning Low power partitioning approachapproach
Different HW resources are invoked according to the Different HW resources are invoked according to the instruction executed at a specific point in timeinstruction executed at a specific point in time
During the execution of the add op., ALU and During the execution of the add op., ALU and register are used, but Multiplier is in idle state.register are used, but Multiplier is in idle state.
Non-active resources will still consume energy since Non-active resources will still consume energy since the according circuit continue to switchthe according circuit continue to switch
Calculate wasting energyCalculate wasting energy Adding application specific core and partial runningAdding application specific core and partial running Whenever one core performing, all the other cores Whenever one core performing, all the other cores
are shut downare shut down
ASIP (ASIP (Application Specific Application Specific Instruction ProcessorsInstruction Processors) Design) Design Given a set of applications, determine micro Given a set of applications, determine micro
architecture of ASIP (i. e., configuration of architecture of ASIP (i. e., configuration of functional units in datapaths, instruction set)functional units in datapaths, instruction set)
To accurately evaluate performance of To accurately evaluate performance of processor on a given application need to processor on a given application need to compile compile the application program onto the the application program onto the processor datapath and processor datapath and simulate simulate object code.object code.
The micro architecture of the processor is a The micro architecture of the processor is a design parameter!design parameter!
ASIP Design FlowASIP Design Flow
Cross-Disciplinary natureCross-Disciplinary nature
Software for low power:loop transformation leads to much Software for low power:loop transformation leads to much higher temporal and spatial locality of data.higher temporal and spatial locality of data.
Code size becomes an important objective Software will Code size becomes an important objective Software will eventually become a part of the chipeventually become a part of the chip
Behavior-platform-compiler codesign: codesigned with Behavior-platform-compiler codesign: codesigned with C++ or JAVA, describing their h/w and s/w C++ or JAVA, describing their h/w and s/w implementation.implementation.
Multidisciplinary system thinking is required for future Multidisciplinary system thinking is required for future designs (designs (e.g., Eindhoven Embedded Systems Institutee.g., Eindhoven Embedded Systems Institute http://www.eesi.tue.nl/english)http://www.eesi.tue.nl/english)
VLSI Signal Processing Design VLSI Signal Processing Design MethodologyMethodology
pipelining, parallel processing, retiming, pipelining, parallel processing, retiming, folding, unfolding, look-ahead, relaxed look-folding, unfolding, look-ahead, relaxed look-ahead, and approximate filtering ahead, and approximate filtering
bit-serial, bit-parallel and digit-serial bit-serial, bit-parallel and digit-serial architectures, carry save architecturearchitectures, carry save architecture
redundant and residue systemsredundant and residue systems Viterbi decoder, motion compensation, 2D-Viterbi decoder, motion compensation, 2D-
filtering, and data transmission systemsfiltering, and data transmission systems
Low Power DSPLow Power DSP
DO-LOOPDO-LOOP Dominant Dominant
VSELP Vocoder : 83.4 %2D 8x8 DCT : 98.3 %LPC computation : 98.0 %
DO-LOOP Power Minimization ==> DSP Power Minimization
VSELP : Vector Sum Excited Linear PredictionLPC : Linear Prediction Coding
Deep-Submicron Design Deep-Submicron Design FlowsFlows
Rapid evaluation of complex designs for area Rapid evaluation of complex designs for area and performanceand performance
Timing convergence via estimated routing Timing convergence via estimated routing parasiticsparasitics
In-place timing repair without resynthesisIn-place timing repair without resynthesis Shorter design intervals, minimum iterationsShorter design intervals, minimum iterations Block-level design and place and routeBlock-level design and place and route Localized changes without disturbanceLocalized changes without disturbance Integration of complex projects and design Integration of complex projects and design
reusereuse
SOC CAD CompaniesSOC CAD Companies Avant! www.avanticorp.comAvant! www.avanticorp.com Cadence www.cadence.comCadence www.cadence.com Duet Tech www.duettech.comDuet Tech www.duettech.com Escalade www.escalade.comEscalade www.escalade.com Logic visions Logic visions
www.logicvision.comwww.logicvision.com Mentor Graphics Mentor Graphics
www.mentor.comwww.mentor.com Palmchip www.palmchip.comPalmchip www.palmchip.com Sonic www.sonicsinc.comSonic www.sonicsinc.com Summit Design www.summit-Summit Design www.summit-
design.comdesign.com
Synopsys Synopsys www.synopsys.comwww.synopsys.com
Topdown design Topdown design solutions solutions www.topdown.comwww.topdown.com
Xynetix Design Systems Xynetix Design Systems www.xynetix.comwww.xynetix.com
Zuken-Redac Zuken-Redac www.redac.co.uk www.redac.co.uk
Design Design Technology Technology for Low Power for Low Power Radio SystemsRadio Systems
http://bwrc.eecs.berkeley.eduhttp://bwrc.eecs.berkeley.edu
Rhett DavisRhett DavisDept. of EECSDept. of EECSUniv. of Calif.Univ. of Calif.
BerkeleyBerkeley
Domain of InterestDomain of Interest Highly integrated system-on-a-chip solutions – SOC’s Highly integrated system-on-a-chip solutions – SOC’s Wireless communications with associated processing, Wireless communications with associated processing,
e.g. multimedia processing, compression, switching, e.g. multimedia processing, compression, switching, etc…etc…
Primary computation is high complexity dataflow with a Primary computation is high complexity dataflow with a relatively small amount of controlrelatively small amount of control
Why Systems-on-a-Chip - SOC ?Why Systems-on-a-Chip - SOC ?
State-of-the-Art CMOS is easily able to implement complete State-of-the-Art CMOS is easily able to implement complete systems (or what was on a board before)systems (or what was on a board before)– A microprocessor core is only 1-2 mmA microprocessor core is only 1-2 mm22
(1-2 % of the area of a $4 chip) (1-2 % of the area of a $4 chip)– Portability (size) is critical to meet the cost, power and size Portability (size) is critical to meet the cost, power and size
requirements of future wireless systemsrequirements of future wireless systems– Chips will be required to support the complete application (wireless Chips will be required to support the complete application (wireless
internet, multimedia)internet, multimedia)– Dedicated stand-alone computation is replacing general purpose Dedicated stand-alone computation is replacing general purpose
processors as the semiconductor industry driverprocessors as the semiconductor industry driver
Analog Baseband
Digital Baseband
(DSP + MCU)
PowerManagement
Small Signal RF
PowerRF
Cellular Phones: An example
Digital Cellular Market(Phones Shipped)
1996 1997 1998 1999 2000
Units 48M 86M 162M 260M 435M
(Courtesy Mike McMahon, Texas Instruments)
Cellular Phone Baseband SOCCellular Phone Baseband SOC
MCU
Gates
Analog
ROM
DSP
RAM
2000+ phones on each 8” wafer @ .15 Leff
1Million Baseband Chips per Day!!!1Million Baseband Chips per Day!!!(Courtesy Mike McMahon, Texas Instruments)
Wireless System Design IssuesWireless System Design Issues
It is now possible to use CMOS to integrate all It is now possible to use CMOS to integrate all digital radio functions – but what is the “best” digital radio functions – but what is the “best” architectural way to use CMOS???architectural way to use CMOS???
Computation rates for wireless systems will easily Computation rates for wireless systems will easily range up to 100’s of GOPS in signal processingrange up to 100’s of GOPS in signal processing– What’s keeping us from achieving this in silicon?What’s keeping us from achieving this in silicon?– What can we do about it?What can we do about it?
Computational Efficiency MetricsComputational Efficiency Metrics
Definition: MOPS Definition: MOPS – Millions of algorithmically defined arithmetic operations (e.g. Millions of algorithmically defined arithmetic operations (e.g.
multiply, add, shift) – in a GP processor several instructions per multiply, add, shift) – in a GP processor several instructions per “useful” operation“useful” operation
Figures of merit Figures of merit – MOPS/mW - Energy efficiency (battery life)MOPS/mW - Energy efficiency (battery life)– MOPS/mmMOPS/mm22 - Area efficiency (cost) - Area efficiency (cost)
Optimization of these “efficiencies” is the basic goal Optimization of these “efficiencies” is the basic goal assuming functionality is metassuming functionality is met
Energy-Efficiency of ArchitecturesEnergy-Efficiency of Architectures
Embedded Processors Microprocessor.1-1 MIPS/mW
ASIPsDSPs
DSP1-10 MIPS/mW
DedicatedHW
Flexibility (Coverage)
En
ergy
Eff
icie
ncy
MO
PS
/mW
(or
MIP
S/m
W)
0.1
1
10
100
1000
ReconfigurableProcessor/Logic
Reconfiguration (???) Potential of 10-100 MOPS/mW
Direct mapped100-1000 MOPS/mW
Software Processors: Energy TrendsSoftware Processors: Energy Trends
Primary means of performance increase of software processors has Primary means of performance increase of software processors has been by increasing clock ratebeen by increasing clock rate
Decreasing Energy EfficiencyDecreasing Energy Efficiency
i386i486C-33
PP-100
A21064A
MIPS R4400
SuperSparc2-90
PPC 604-120
A21164-300
PPro-150
PPC603e-100
PP166MIPS R10000
PPro200
i386C-33
PP-66
486-66 PPC 601-80
HP PA7200PP-133
UltraSparc-167
HP PA8000
MIPS R5000
DX4 100
0
50
100
150
200
250
300
Fre
q(M
Hz)
1991 1992 1993 1994 1995 1996
E C VDD2
Software Processors: Area TrendsSoftware Processors: Area Trends
DSP processor with 1 multiplier (25 mm2)
16x16 multiplier(.05 mm2)
Why time multiplex to save area if the overhead is much greater than the area saved????
Increasing clock rate results in a memory bottleneck – addressed by bringing Increasing clock rate results in a memory bottleneck – addressed by bringing memory on-chipmemory on-chip
Area is increasingly dominated by memory – degrading MOPs/mmArea is increasingly dominated by memory – degrading MOPs/mm22
Parallelism is the answer, but …Parallelism is the answer, but …
Not by putting Von Neumann processors in parallel and Not by putting Von Neumann processors in parallel and programming with a sequential languageprogramming with a sequential language– Attempts to do this have failed over and over again…Attempts to do this have failed over and over again…– The parallel computer compiler problem is very difficultThe parallel computer compiler problem is very difficult
Not by trying to capture parallelism at the instruction levelNot by trying to capture parallelism at the instruction level– Superscalar, VLIW, etc… are very inefficientSuperscalar, VLIW, etc… are very inefficient– Hardware can’t figure out the parallelism from a sequential Hardware can’t figure out the parallelism from a sequential
language eitherlanguage either
The problem is the initial sequential description (e.g. C) The problem is the initial sequential description (e.g. C) which is poorly matched to highly parallel applicationswhich is poorly matched to highly parallel applications
What is really hapenning…What is really hapenning…
While (i=0;i++:i<num) {While (i=0;i++:i<num) {
a = a * c[i];a = a * c[i];
b[i] = sin (a * pi) + cos(a*pi);b[i] = sin (a * pi) + cos(a*pi);
};};
Outfil = b[i] * indata;Outfil = b[i] * indata;
Then try to rediscover the
parallelism
Re-entering it using a sequential
description
Starting with a parallel algorithmic description
We take this path so that we can use an architecture that is orders of magnitude less efficient in energy and area
??????
What can a fully parallel CMOS solution What can a fully parallel CMOS solution potentially do?potentially do?
In In .25 micron.25 micron a multiplier requires .05 mm a multiplier requires .05 mm22 and 7pJ and 7pJ per operation at 1 V. Adders and registers are about per operation at 1 V. Adders and registers are about 10 times smaller and 10 times lower energy10 times smaller and 10 times lower energy
Lets implement a 50mmLets implement a 50mm22 , .25 micron chip using , .25 micron chip using adders, registers and multipliersadders, registers and multipliers
We can have 2000 adders/registers and 200 We can have 2000 adders/registers and 200 multipliers in less than 1/2 of the chip, also assume 1/3 multipliers in less than 1/2 of the chip, also assume 1/3 of power goes into clocksof power goes into clocks
25 MHz clock (1 volt) gives ~50 Gops at 100mW 25 MHz clock (1 volt) gives ~50 Gops at 100mW
500 MOPS/mW500 MOPS/mW and and 1000 MOPS/mm1000 MOPS/mm22
Start with a parallel description of the algorithm…Start with a parallel description of the algorithm…
Then directly map into hardware …Then directly map into hardware …
Mult2
Mac2Mult1 Mac1
S reg X regAdd,Sub,Shift
Results in fully parallel solutionsResults in fully parallel solutions
Energy Energy AreaArea
64-point FFT64-point FFT
Energy per Energy per Transform (nJ) Transform (nJ)
16-State Viterbi 16-State Viterbi DecoderDecoder
Energy per Energy per Decoded bit (nJ) Decoded bit (nJ)
64-point FFT64-point FFT
Transforms per second Transforms per second per unit area per unit area
(Trans/ms/mm(Trans/ms/mm22))
16-State Viterbi 16-State Viterbi DecoderDecoder
Decode rate per unit Decode rate per unit area (kb/s/mmarea (kb/s/mm22))
Direct-Mapped HardwareDirect-Mapped Hardware 1.781.78 0.0220.022 2,2002,200 200,000200,000
FPGAFPGA 683683 5.55.5 1.81.8 100100
Low-Power DSPLow-Power DSP 436436 19.619.6 4.34.3 5050
High-Performance DSPHigh-Performance DSP 17001700 108108 1010 150150
(numbers taken from vendor-published benchmarks)
Orders of magnitude lower efficiency Orders of magnitude lower efficiency even for an optimized processor architectureeven for an optimized processor architecture
Reasons software solutions seem attractiveReasons software solutions seem attractive
(1) Believed to reduce time-to-system-implementation(1) Believed to reduce time-to-system-implementation
(2) Provides flexibility (2) Provides flexibility
(3) Locks the customers into an architecture they can’t (3) Locks the customers into an architecture they can’t changechange
(4) Difficulty in getting dedicated SOC chips designed(4) Difficulty in getting dedicated SOC chips designed
Are these good reasons???Are these good reasons???
(1) Believed to reduce time-to-system (1) Believed to reduce time-to-system implementationimplementation
Software decreases time to get first prototype, but Software decreases time to get first prototype, but time to fully verified system is much longer (hardware time to fully verified system is much longer (hardware is often ready but software still needs to be done)is often ready but software still needs to be done)
Limitations of software prototype often sets the Limitations of software prototype often sets the ultimate limit of the system performanceultimate limit of the system performance
Software solutions can be shipped with bugs, not a Software solutions can be shipped with bugs, not a real option for SOC real option for SOC
(2) Need flexibility(2) Need flexibility
Software is not always flexible Software is not always flexible – Can be hard to verifyCan be hard to verify
Flexibility does not imply software programmabilityFlexibility does not imply software programmability– Domain specific design can have multiple modules, Domain specific design can have multiple modules,
coefficients and local state control (the factor of 100 in coefficients and local state control (the factor of 100 in efficiency) to address a range of applicationsefficiency) to address a range of applications
– Reconfiguration of interconnect can achieve flexibility with Reconfiguration of interconnect can achieve flexibility with high levels of efficiencyhigh levels of efficiency
Flexibility without softwareFlexibility without software
Energy per Transform vs. FFT size
101
102
103
10-10
10-9
10-8
10-7
10-6
10-5
10-4
10-3
FFT size
En
erg
y p
er T
rans
form
(J)
Lower limit Function-specific reconfigurable hardwareData-path reconfigurable processor FPGA Low-power DSP High-performance DSP
101
102
103
103
104
105
106
107
108
FFT size(T
rans
form
s pe
r S
econ
d)/(
Sili
con
Are
a) (
Tra
ns/s
/mm
2 ) Function-specific reconfigurable hardwareData-path reconfigurable processor FPGA Low-power DSP High-performance DSP
Transforms per Second per mm2 vs. FFT size
* All results are scaled to 0.18m
Reasons software solutions seem attractiveReasons software solutions seem attractive
(1) Believed to reduce time-to-system implementation(1) Believed to reduce time-to-system implementation
(2) Provides flexibility (2) Provides flexibility
(3) Locks the customers into an architecture they can’t (3) Locks the customers into an architecture they can’t changechange
(4) Difficulty in getting dedicated SOC chips designed(4) Difficulty in getting dedicated SOC chips designed
Standard DSP-ASIC Design Flow Standard DSP-ASIC Design Flow
Three translationsThree translations of design data of design data
Requirements for re-verification at Requirements for re-verification at each stageeach stage
Uncontrolled looping when pipeline Uncontrolled looping when pipeline stallsstalls
Problems:
Prohibitively Long Design Time Prohibitively Long Design Time for Direct Mapped Architecturesfor Direct Mapped Architectures
AlgorithmDesign
Floating-PointSimulation
System/ArchitectureDesign
Fixed-PointSimulation
Hardware/Front-End Design
RTL Code
Physical/Back-End Design
Mask Layout
Sequential
Mixed Sequential & Structural
Integer only,Structural w/SequentialLeaf-cells
Single-wire Connectivity
w/ TimingConstraints
Direct Mapping Design FlowDirect Mapping Design Flow
Encourages iterations of layoutEncourages iterations of layout Controls loopingControls looping Reduces the flow to a single phaseReduces the flow to a single phase Depends on fast automationDepends on fast automation
Algorithm/System
SimulationFront-End
RTL Libraries
Back-End
Floorplan
Automated Flow
Mask Layout
Performance Estimates
Déjà vu???Déjà vu???
An automated style of design with parameterized modules An automated style of design with parameterized modules processed through foundries is just the reincarnation of processed through foundries is just the reincarnation of good ole Silicon Compilation of >10 years ago good ole Silicon Compilation of >10 years ago
What happened?What happened?– A decline of research into design methodologiesA decline of research into design methodologies– A single dominant flow has resulted - the Verilog-Synopsys-A single dominant flow has resulted - the Verilog-Synopsys-
Standard CellStandard Cell– Lack of tool flows to support alternative styles of designLack of tool flows to support alternative styles of design– Research community lost access to technology – moved to highly Research community lost access to technology – moved to highly
sub-optimal processor and FPGA solutionssub-optimal processor and FPGA solutions
Capturing Design DecisionsCapturing Design Decisions
Categories:Categories: FunctionFunction - basic input-output behavior - basic input-output behavior SignalSignal - physical signals and types - physical signals and types CircuitCircuit - transistors - transistors FloorplanFloorplan - physical positions - physical positions
How to get layout and performance estimates in a day?
MACreg.file
add
shift
reg. file
Simplified View of the FlowSimplified View of the Flow
New Software:New Software: Generation of netlists from a Generation of netlists from a
dataflow graphdataflow graph
Merging of floorplan from last Merging of floorplan from last iterationiteration
Automatic routing and Automatic routing and performance analysisperformance analysis
Automation of flow as a Automation of flow as a dependency graph (UNIX dependency graph (UNIX MAKE program)MAKE program)
merge
autoLayout
elaborate
netlist
route
layout
dataflow graph
floorplanmacrolibrary
Why Simulink?Why Simulink?
Simulink is an easy sell to algorithm developersSimulink is an easy sell to algorithm developers Closely integrated with popular system design tool MatlabClosely integrated with popular system design tool Matlab Successfully models digital and analog circuitsSuccessfully models digital and analog circuits
Time-Multiplexed FIR Filter
D
A
WEN
SRAM
Q
2TAP_COEF
addr
wen
reset_acc
CONTROL
1 1X Y
A
B
RESET
MAC
Z
Modeling Datapath LogicModeling Datapath Logic
Discrete-TimeDiscrete-Time(cycle accurate)(cycle accurate)
Fixed-Point TypesFixed-Point Types(bit true)(bit true)
Completely specify function Completely specify function and signal decisionsand signal decisions
No need for RTLNo need for RTLMultiply / Accumulate
+
+
ADD
1A
S18
MULTS12 REG
Z1
CONSTS18
0
MUX
3RESET
2B
1Z
Extended finite state-Extended finite state-machine editormachine editor
Co-simulation with dataflow Co-simulation with dataflow graphgraph
New Software:New Software:Stateflow-VHDL translatorStateflow-VHDL translator
No need for RTLNo need for RTL
Modeling Control LogicModeling Control Logic
Address Generator / MAC Reset
[addr==15]
incrduring: addr++;reset_acc=0;
restartentry: addr=0;wen=0;reset_acc=1;
initentry: addr=0;wen=1;
Specifying Circuit DecisionsSpecifying Circuit Decisions
Macro choices embedded in dataflow graphMacro choices embedded in dataflow graph Cross-check simulations requiredCross-check simulations required
Time-Multiplexed FIR Filter
D
A
WEN
SRAM
Q
2TAP_COEF
addr
wen
reset_acc
CONTROL
1 1X Y
A
B
RESET
MAC
Z
Stateflow-VHDL
translator
RTL Codeor
Data-pathGenerator
Codeor
CustomModule
Black Box
Hierarchy Hardened ProgressivelyHierarchy Hardened Progressively
Macro characterization saved for fast estimatesMacro characterization saved for fast estimates Each level of hierarchy becomes a new hard macroEach level of hierarchy becomes a new hard macro Higher levels of hierarchy are adjustedHigher levels of hierarchy are adjusted When top level of hierarchy is hardened, the design is doneWhen top level of hierarchy is hardened, the design is done
System-LevelDesign Environment
estimateperformance:
power, area, delay
Hard Macro Characterization Libraries
layout and characterize
new hard macro
Capturing Floorplan DecisionsCapturing Floorplan Decisions
Commercial physical design tools usedCommercial physical design tools used Instance names in floorplan match dataflow graphInstance names in floorplan match dataflow graph Placements merged on each iterationPlacements merged on each iteration Manhattan distance can be used for parasitic estimatesManhattan distance can be used for parasitic estimates
Parallel Pipelined FIR Filter
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
VDD (V)
Wir
e d
elay
/ F
O4
inv.
del
ay
5mm M6 wire
1mm M6 wire
Reduced Impact of InterconnectReduced Impact of Interconnect
Long wires can Long wires can be modeled as be modeled as
lumped lumped capacitancescapacitances
0.18 0.18 m m
FO4 invFO4 invdelaydelay
WireWiredelaydelay
...
Race-Immune Clock Tree SynthesisRace-Immune Clock Tree Synthesis
Race marginRace margin= 580 ps= 580 ps
0.18 0.18 mm VVDDDD = 1 V = 1 V
t < t - tskew(max) clk-Q(min) hold(max)
Demonstrated on a 600k transistor design
Example Clock TreeStages: 22Sinks: 7650Skew: 320 psClock Power: 2.8 mWLogic Power: 21 mW
Hierarchical Clock Tree Synthesis
180 MB180 MB1.5 GB1.5 GB
disk spacedisk space (elaborate / route)(elaborate / route)(characterization)(characterization)
3 hours3 hours9 hours9 hours
execution timeexecution time (elaborate / route)(elaborate / route)(characterization) (characterization)
240 k240 ktransistorstransistors
21 k21 kcellscells
18.0 ns18.0 nscritical path delay (1 V, PathMill)critical path delay (1 V, PathMill)
13.0 mW13.0 mWpower @ 25 MHz (1 V, PowerMill)power @ 25 MHz (1 V, PowerMill)
1.4 mm1.4 mm22area in 0.25 area in 0.25 mm
parallel pipelined FIR filterparallel pipelined FIR filter
Example 1: Macro HardeningExample 1: Macro Hardening
Most time/disk space spent on extraction and power simulation
Example 2: Test ChipExample 2: Test Chip
300k transistors300k transistors 0.25 mm0.25 mm 1.0 V1.0 V 25 MHz25 MHz 6.8 mm6.8 mm22
14 mW14 mW 2 phase clock2 phase clock 3 layers of 3 layers of
P&R hierarchyP&R hierarchy
Parallel Pipelined FIR Filter(8X decimation filter for 12-bit 200 MHz
TDMA Baseband ReceiverTDMA Baseband Receiver
600k transistors600k transistors 0.18 mm0.18 mm 1.0 V1.0 V 25 MHz25 MHz 1.1 mm1.1 mm22
21 mW21 mW single phase clocksingle phase clock 5 clock domains5 clock domains 2 layers of 2 layers of
P&R hierarchyP&R hierarchy
carrierdetection
freq
uenc
y es
timat
ion
rota
te &
cor
rela
te
control
ConclusionsConclusions Direct-Mapped hardware is the most efficient use of siliconDirect-Mapped hardware is the most efficient use of silicon
Direct-Mapped hardware can be easier to design and verify Direct-Mapped hardware can be easier to design and verify than embedded hardware/software systemsthan embedded hardware/software systems
Don’t translate design data, refine itDon’t translate design data, refine it
Design with dataflow graphs, not sequential codeDesign with dataflow graphs, not sequential code
Design flow automation speeds up design space explorationDesign flow automation speeds up design space exploration
Embedded Embedded Processor Processor Architectures and Architectures and (Re)Configurable (Re)Configurable ComputingComputing
Vandana PrabhuVandana Prabhu
Professor Jan M. RabaeyProfessor Jan M. Rabaey
Jan 10, 2000
Pico Radio ArchitecturePico Radio Architecture
ReconfigurableDataPath
ReconfigurableDataPath
FPGAFPGA Embedded uPEmbedded uP
Dedicated FSMDedicated FSM
DedicatedDSP
DedicatedDSP
Reconfigurable Computing:Reconfigurable Computing:Merging Efficiency and VersatilityMerging Efficiency and Versatility
“Hardware” customized to specifics of problem.
Direct map of problem specific dataflow, control.
Circuits “adapted” as problem requirements change.
Spatially programmed connection of processing elements.Spatially programmed connection of processing elements.
Matching Computation and ArchitectureMatching Computation and Architecture
Convolution
Two models of computation:communicating processes + data-flow
AddressGen AddressGen
Memory Memory
MAC MAC
ControlProcessor
L CG
Two architectural models:sequential control+ data-driven
Implementation Fabrics for Implementation Fabrics for Data ProcessingData Processing
Signal Update BlockAcquisition and
Timing Recovery Signal Update Block
AdaptivePilot
Correlator
A da ptiveD ata
C or re lator
C0 CL-1
Digital Baseband
Sk
...
Data Out
Receiver
ChannelCoefficientEstimates
AdaptivePilot
Correlator
Da
ta I
n
300 million multiplications/sec357 million add-sub’s/sec
Adaptive Adaptive Pilot Pilot CorrelatorCorrelator
Digital Digital Baseband Baseband ReceiverReceiver
DSPDSP Power: 460mWPower: 460mW
Area: 1089mmArea: 1089mm22
Power: 1500mWPower: 1500mW
Area: 3600mmArea: 3600mm22
Direct Direct MappedMapped
Power: 3mWPower: 3mW
Area: 1.3mmArea: 1.3mm22
Power: 10mWPower: 10mW
Area: 5mmArea: 5mm22
PleiadesPleiadesPower: 18.49mWPower: 18.49mW
Area: 5.44mmArea: 5.44mm22
Power: 62.33mWPower: 62.33mW
Area: 21.34mmArea: 21.34mm22
16 Mmacs/mW!
Software Methodology FlowSoftware Methodology Flow
Algorithms
Kernel Detection
Estimation/Exploration
Partitioning
Software CompilationReconfig. Hardware Mapping
Interface Code Generation
Power & Timing Estimation
of Various Kernel Implementations
Area &
PDA Models
PremappedKernels
Acceleratorproc &Timing
Constraints
Xform’sfor lowpower
Behavioral
Kernels
Executable Intemediate Form
InterconnectOptimization
Reconfig HW
(Marlene Wan)
Maia: Reconfigurable Baseband Maia: Reconfigurable Baseband Processor for WirelessProcessor for Wireless
• 0.25um tech: 4.5mm x 6mm
• 1.2 Million transistors
• 40 MHz at 1V
• 1 mW VCELP voice coder
• Hardware
• 1 ARM-8
• 8 SRAMs & 8 AGPs
• 2 MACs
• 2 ALUs
• 2 In-Ports and 2 Out-Ports
• 14x8 FPGA
Implementation Fabrics for Implementation Fabrics for ProtocolsProtocols
BU
F
Memory
BU
FSlot_Set_Tbl2x16
addr
slot_set<31:0>
Slot_no<5:0>
Slotstart
Pktend
RACHreq
RACHakn
W_ENA
R_ENAupdate
idle
writereadslotset
RACH
idle
A protocol = Extended FSM
Intercom TDMA MAC
ASIC FPGA ARM8
Power 0.26mW 2.1mW 114mWEnergy 10.2pJ/op 81.4pJ/op n*457pJ/op
ASIC: 1V, 0.25 ASIC: 1V, 0.25 m CMOS processm CMOS process
FPGA: 1.5 V 0.25 FPGA: 1.5 V 0.25 m CMOS low-energy m CMOS low-energy FPGA FPGA
ARM8: 1 V 25 MHz processor; n = ARM8: 1 V 25 MHz processor; n = 13,00013,000
Ratio: 1 - 8 - >> 400Ratio: 1 - 8 - >> 400
Idea: Exploit model of computation: concurrent finite state machines, communicating through message passing
Low-Power FPGALow-Power FPGA
Low Energy Embedded FPGALow Energy Embedded FPGA (Varghese George)(Varghese George)
Test chipTest chip– 8x8 CLB array8x8 CLB array
– 5 in - 3 out CLB5 in - 3 out CLB
– 3-level interconnect hierarchy3-level interconnect hierarchy
– 4 mm4 mm22 in 0.25 in 0.25 m ST CMOSm ST CMOS
– 0.8 and 1.5 V supply0.8 and 1.5 V supply Simulation ResultsSimulation Results
– 125 MHz Toggle Frequency125 MHz Toggle Frequency
– 50 MHz 8-bit adder50 MHz 8-bit adder
– energy 70 times lower than energy 70 times lower than comparable Xilinxcomparable Xilinx
An Energy-Efficient µP SystemAn Energy-Efficient µP System
Integrateddc-dc
converter
• Dynamic Voltage Scaling (Trevor Pering & Tom Burd)
µP
roc.
Spe
ed Lower speed,Lower voltage, Lower energy
Before
IdleAfter
Xtensa Configurable ProcessorXtensa Configurable Processor
Xtensa Xtensa (Tensilica,Inc)(Tensilica,Inc) for embedded CPU for embedded CPU – Configurability allows designer to keep “minimal” hardware Configurability allows designer to keep “minimal” hardware
overhead overhead – ISA (compatible with 32 bit RISC) can be extended for ISA (compatible with 32 bit RISC) can be extended for
software optimizations software optimizations – Fully synthesizableFully synthesizable– Complete HW/SW suite Complete HW/SW suite
VCC modeling for explorationVCC modeling for exploration– Requires mapping of “fuzzy” instructions of VCC processor Requires mapping of “fuzzy” instructions of VCC processor
model to real ISAmodel to real ISA– Requires multiple models depending on memory Requires multiple models depending on memory
configurationconfiguration– ISS simulation to validate accuracy of modelISS simulation to validate accuracy of model
(Vandana Prabhu)
Microprocessor Optimizations for Microprocessor Optimizations for Network ProtocolsNetwork Protocols
ImplementsTransport Transport layer on configurable processorlayer on configurable processor– TDMA control and channel usage managementTDMA control and channel usage management
Upper layer of protocol is dominated by processor control flow– Memory routines, Branches, Procedure callsMemory routines, Branches, Procedure calls
Artifacts of code generation tools is significant Excessively modular code introduces procedure calls Uses dynamic memory allocation
Configurable processor Increased size of register file Customized instructions help datapath but not control
(Kevin Camera & Tim Tuan )
Total Execution Time
calloc memcpy other
Memory Routines
Efficient implementaion at code generation and architecture levels!
Implementation Methodology for Implementation Methodology for Reconfigurable Wireless ProtocolReconfigurable Wireless Protocol
Changing granularity within protocol stack Changing granularity within protocol stack requires estimation tool for energy-efficient requires estimation tool for energy-efficient implementationimplementation
Software exploration on processorsSoftware exploration on processors– Exploring Xtensa’s TIEExploring Xtensa’s TIE
Hardware exploration on FPGA platformsHardware exploration on FPGA platforms– Optimal FPGA architectureOptimal FPGA architecture– Alternately “Reconfigurable FSM” analogous to Alternately “Reconfigurable FSM” analogous to
Pleiades approach for datapath kernelsPleiades approach for datapath kernels
(Suetfei Li & Tim Tuan)
TCI - A First Generation PicoNodeTCI - A First Generation PicoNode
TensilicaEmbedded Proc.
TensilicaEmbedded Proc.
MemorySub-system
MemorySub-system
Baseband ProcessingBaseband Processing
ConfigurableLogic
(Physical Layer)
ConfigurableLogic
(Physical Layer)
ProgrammableProtocol StackProgrammableProtocol Stack
Sonics Backplane
The System-on-a-Chip NightmareThe System-on-a-Chip Nightmare
Bridge
DMA CPU DSP
MemCtrl.
MPEG
CI O O
System Bus
PeripheralBus
Control Wires
Custom Interfaces
The “Board-on-a-Chip”Approach
Courtesy of Sonics, Inc
The Communications PerspectiveThe Communications Perspective
DSP MPEGCPUDMA
C MEMI O
Example: “The Silicon Backplane”Example: “The Silicon Backplane”(Sonics, Inc)(Sonics, Inc)
Open CoreProtocolTM
SiliconBackplaneAgentTM
Communications-based DesignCommunications-based DesignGuaranteed Bandwidth
Arbitration
(Mike Sheets)
SummarySummary Design for low-energy impacts all stages of Design for low-energy impacts all stages of
the design process — the design process — the earlier the betterthe earlier the better Energy reduction requires clear Energy reduction requires clear
communication and computationcommunication and computation abstractions abstractions Efficient and Efficient and abstract modelingabstract modeling of energy at of energy at
behavior and architecture level is crucialbehavior and architecture level is crucial Efficient hardware implementation of protocol Efficient hardware implementation of protocol
stackstack Beat the SoC monster!Beat the SoC monster!
Targeting Tiled Architectures Targeting Tiled Architectures in Design Explorationin Design ExplorationLilian BossuetLilian Bossuet11, Wayne Burleson, Wayne Burleson22, Guy Gogniat, Guy Gogniat11,,
Vikas AnandVikas Anand22, Andrew Laffely, Andrew Laffely22, Jean-Luc Philippe, Jean-Luc Philippe11
1 1 LESTER LabLESTER LabUniversité de Université de Bretagne SudBretagne Sud
Lorient, FranceLorient, France{lilian.bossuet, {lilian.bossuet, guy.gogniat, guy.gogniat,
jean-jean-luc.philippe}@uniluc.philippe}@uni
v-ubs.frv-ubs.fr
2 2 Department of ElectricalDepartment of Electricaland Computer Engineeringand Computer Engineering
University of University of Massachusetts,Massachusetts,Amherst, USAAmherst, USA
{burleson, vanand, {burleson, vanand, alaffely}@ecs.umass.edualaffely}@ecs.umass.edu
Design Space Exploration: Design Space Exploration: Motivations Motivations Design solutions for new telecommunication and Design solutions for new telecommunication and
multimedia applications targeting embedded systemsmultimedia applications targeting embedded systems
Optimization and reduction of SoC power consumptionOptimization and reduction of SoC power consumption
Increase computing performanceIncrease computing performance– Increase parallelismIncrease parallelism– Increase speedIncrease speed
Be flexibleBe flexible– Take into account run-time reconfigurationTake into account run-time reconfiguration– Targeting multi-granularity (heterogeneous) architecturesTargeting multi-granularity (heterogeneous) architectures
Design Space Exploration: Design Space Exploration: FlowFlow
AlgorithmicSpecification
DESI GN SPACEEXPLORATI ON
archi1
archi2 archi3
archi4
archi5 archi6
archi7
archi8
archi9
archi10
archi11
archi12
Generic Synthesisor Estimations
First Run
SecondRun
ArchitecturalSpecification
Functional descriptionof design space
Dedicated Tool
RTLSpecification
Ab
str
acti
on
Level
Low
Hig
h
Physical Model ofArchi2 and Archi10
Accurate ModelArchi2
Perf
orm
an
ce A
ccu
racy
Low
Hig
h
Progressive design space reduction:Progressive design space reduction:– iterative exploration iterative exploration – refinement of architecture modelrefinement of architecture model– increase of performance estimation increase of performance estimation
accuracyaccuracy
One level of abstraction for one One level of abstraction for one level of estimation accuracylevel of estimation accuracy
Reconfigurable ArchitecturesReconfigurable Architectures
Bridging the Bridging the flexibilityflexibility gap between ASICs and microprocessor gap between ASICs and microprocessor [Hartenstein DATE 2001][Hartenstein DATE 2001]
EnergyEnergy efficientefficient and solution to and solution to low powerlow power programmable programmable DSPDSP [Rabaey ICASSP 1997, FPL 2000][Rabaey ICASSP 1997, FPL 2000]
Run TimeRun Time Reconfigurable Reconfigurable [Compton & Hauck [Compton & Hauck 1999]1999]
=> => A keyA key ingredient ingredient for futurefor future silicon platforms silicon platforms [Schaumont & all. [Schaumont & all. DAC 2001]DAC 2001]
Design Space of Reconfigurable Design Space of Reconfigurable ArchitectureArchitecture
RECONFIGURABLE ARCHITECTURESRECONFIGURABLE ARCHITECTURES(R-SOC)(R-SOC)
FINE GRAINFINE GRAIN(FPGA)(FPGA)
MULTI GRANULARITYMULTI GRANULARITY(Heterogeneous)(Heterogeneous)
COARSE GRAINCOARSE GRAIN(Systolic)(Systolic)
Processor +Processor +CoprocessorCoprocessor
Tile-BasedTile-BasedArchitectureArchitecture
Coarse Grain Coarse Grain CoprocessorCoprocessor
Fine GrainFine GrainCoprocessorCoprocessor
IslandIslandTopologyTopology
Hierarchical Hierarchical TopologyTopology
LinearLinearTopologyTopology
HierarchicalHierarchicalTopologyTopology
MeshMeshTopologyTopology
• ChameleonChameleon• REMARCREMARC• MorphosysMorphosys
• PleiadesPleiades• GarpGarp• FIPSOCFIPSOC• Triscend E5Triscend E5• Triscend A7Triscend A7• Xilinx Virtex-II ProXilinx Virtex-II Pro• Altera ExcaliburAltera Excalibur• Atmel FPSICAtmel FPSIC
• Xilinx VirtexXilinx Virtex• Xilinx SpartranXilinx Spartran• Atmel AT40KAtmel AT40K• Lattice ispXPGA
• Altera StratixAltera Stratix• Altera ApexAltera Apex• Altera CycloneAltera Cyclone
• Systolic RingSystolic Ring• RaPiDRaPiD• PipeRenchPipeRench
• DARTDART• FPFAFPFA
• RAWRAW• CHESSCHESS• MATRIXMATRIX• KressArrayKressArray• Systolix PulsedspSystolix Pulsedsp
• aSoCaSoC• E-FPFA E-FPFA
A Target Architecture: aSoCA Target Architecture: aSoC Adaptive System-on-a-Chip (aSoC)Adaptive System-on-a-Chip (aSoC)
Tiled architecture containing many heterogeneous Tiled architecture containing many heterogeneous processing cores (RISC, DSP, FPGA, Motion processing cores (RISC, DSP, FPGA, Motion Estimation, Viterbi Decoder)Estimation, Viterbi Decoder)
Mesh communication network controlled with Mesh communication network controlled with statically determined communication schedulestatically determined communication schedule
A scalable architecture.A scalable architecture.
FPGA in System-on-a-ChipFPGA in System-on-a-Chip
Fast Time-To-MarketFast Time-To-Market Post-Fabrication Post-Fabrication
CustomizationCustomization– Broaden application domainBroaden application domain– Run-time ReconfigurationRun-time Reconfiguration– Bug FixesBug Fixes– UpgradesUpgrades
10x-100x Worse:10x-100x Worse:– AreaArea– PerformancePerformance– PowerPower
Mark L. Chang [email protected]
tile
FPGA
uProc
MUL
MUL Heterogeneous CoresHeterogeneous Cores
aSoC ArchitectureaSoC Architecture
Point-to-point Point-to-point connectionsconnections
ctrl
South Core
West
North
East
Communication Communication InterfaceInterface
aSoC Communications aSoC Communications InterfaceInterface
Core
Coreports
DecoderLocal
Frequency& Voltage
North to South & East
Instruction Memory
PC
Controller
North
South
East
West
Local Config.
North
South
East
WestInputs Outputs
Interface CrossbarInterface Crossbar– inter-tile transferinter-tile transfer– tile to core transfertile to core transfer
Interconnect/Instruction MemoryInterconnect/Instruction Memory– contains instructions to contains instructions to
configure the interface crossbar configure the interface crossbar (cycle-by-cycle)(cycle-by-cycle)
Interface ControllerInterface Controller– selects the instructionselects the instruction
CoreportsCoreports– data interface and storage for data interface and storage for
transfers with the tile IP coretransfers with the tile IP core Dynamic Voltage and Frequency Dynamic Voltage and Frequency
SelectionSelection– Dynamic Power ManagementDynamic Power Management
Interface Crossbar
ctrl
South Core
West
North
East
aSoC Exploration ...aSoC Exploration ... Type of tilesType of tiles
Number of each type of tileNumber of each type of tile
Placement of the tilesPlacement of the tiles
Intern architecture of reconfigurable tiles Intern architecture of reconfigurable tiles (FPGA core)(FPGA core)
Communication schedulingCommunication scheduling
Design Space Exploration: Design Space Exploration: GoalsGoals GoalGoal: : Rapid explorationRapid exploration of various architectural solutions to be of various architectural solutions to be
implemented on implemented on heterogeneous reconfigurable architecturesheterogeneous reconfigurable architectures (aSoC) in order to select the most efficient architecture for one (aSoC) in order to select the most efficient architecture for one or several applications or several applications
Take place before architectural synthesis (algorithmic Take place before architectural synthesis (algorithmic specification with high level abstraction language) specification with high level abstraction language)
Estimations are based on a Estimations are based on a functional architecture modelfunctional architecture model (generic, technology-independent) (generic, technology-independent)
Iterative exploration flowIterative exploration flow to progressively refine the architecture to progressively refine the architecture definition, from a coarse model to a dedicated modeldefinition, from a coarse model to a dedicated model
Design Exploration Flow Design Exploration Flow Targeting Tiled ArchitectureTargeting Tiled Architecture
C SPECIFICATION
C to HCDFG parser
Function F2
HCDFG Graphs of the application
Application App 1
Function F1
Model of the aSOC Architectures
Tile T2aSOC A1
Tile T1
Application Analysis
Tile Exploration
Results of the Tile exploration step
Function Tile PerformanceF1 T1 T11 , C11 , Occ 11
T2 T21 , C21 , Occ 21
F2 T1 T12 , C12 , Occ 12
T2 T22 , C22 , Occ 22
aSOC Builder
Static Communication Scheduling
Final model ofaSOC architecture
aSOC Analysis
THF Model HF Model
F1
F2
T2
T1
C SPECIFICATION
C to HCDFG parser
Function F2
HCDFG Graphs of the application
Application App1
Function F1
Model of the aSOC Architectures
Tile T2aSOC A1
Tile T1
Application Analysis
Tile Exploration
Results of the Tile exploration step
Function Tile PerformanceF1 T1 T11, C11, Occ11
T2 T21, C21, Occ21
F2 T1 T12, C12, Occ12
T2 T22, C22, Occ22
aSOC Builder
Static CommunicationScheduling
Final model ofaSOC architecture
aSOC Analysis
THF Model HF Model
F1
F2
T2
T1
Application AnalysisApplication Analysis
Use of algorithmic metrics and Use of algorithmic metrics and dedicated scheduling algorithms to dedicated scheduling algorithms to highlight the target architectures highlight the target architectures
Algorithmic Algorithmic metricsmetrics:: – Characterize the application orientation Characterize the application orientation
• ProcessingProcessing• MemoryMemory• ControlControl
– Characterize the application potential Characterize the application potential parallelismparallelism
• ProcessingProcessing• MemoryMemory
C SPECIFICATION
C to HCDFG parser
Function F2
HCDFG Graphs of the application
Application App1
Function F1
Model of the aSOC Architectures
Tile T2aSOC A1
Tile T1
Application Analysis
Tile Exploration
Results of the Tile exploration step
Function Tile PerformanceF1 T1 T11, C11, Occ11
T2 T21, C21, Occ21
F2 T1 T12, C12, Occ12
T2 T22, C22, Occ22
aSOC Builder
Static CommunicationScheduling
Final model ofaSOC architecture
aSOC Analysis
THF Model HF Model
F1
F2
T2
T1
Tile Exploration: with 3 stepsTile Exploration: with 3 steps ProjectionProjection: :
– Link between necessary resources (application) Link between necessary resources (application) and available resources (tile) and available resources (tile)
– Use of an allocation algorithm based on Use of an allocation algorithm based on communication costs reductioncommunication costs reduction
CompositionComposition: : – Take into account of the function scheduling to Take into account of the function scheduling to
estimate additional resources (register, mux, …) estimate additional resources (register, mux, …)
EstimationEstimation: : – performance interval computation (lower and performance interval computation (lower and
upper bounds) upper bounds) – speed/resource utilization/power characterizationspeed/resource utilization/power characterization
C SPECIFICATION
C to HCDFG parser
Function F2
HCDFG Graphs of the application
Application App1
Function F1
Model of the aSOC Architectures
Tile T2aSOC A1
Tile T1
Application Analysis
Tile Exploration
Results of the Tile exploration step
Function Tile PerformanceF1 T1 T11, C11, Occ11
T2 T21, C21, Occ21
F2 T1 T12, C12, Occ12
T2 T22, C22, Occ22
aSOC Builder
Static CommunicationScheduling
Final model ofaSOC architecture
aSOC Analysis
THF Model HF Model
F1
F2
T2
T1
aSoC BuilderaSoC Builder
Environment Environment AppMapperAppMapper
Partition and assignmentPartition and assignment– based on Run Time Estimationbased on Run Time Estimation
CompilationCompilation– Communication SchedulingCommunication Scheduling– Core compilationCore compilation
Generate tiles configuration Generate tiles configuration – Communications instructionsCommunications instructions– Bitstreams (for reconfigurable tile)Bitstreams (for reconfigurable tile)– RISC instructionsRISC instructions
C SPECIFICATION
C to HCDFG parser
Function F2
HCDFG Graphs of the application
Application App1
Function F1
Model of the aSOC Architectures
Tile T2aSOC A1
Tile T1
Application Analysis
Tile Exploration
Results of the Tile exploration step
Function Tile PerformanceF1 T1 T11, C11, Occ11
T2 T21, C21, Occ21
F2 T1 T12, C12, Occ12
T2 T22, C22, Occ22
aSOC Builder
Static CommunicationScheduling
Final model ofaSOC architecture
aSOC Analysis
THF Model HF Model
F1
F2
T2
T1
aSoC AnalysisaSoC Analysis
Use the results of previous stepsUse the results of previous steps– Functions schedulingFunctions scheduling– Tile allocationTile allocation– Communication schedulingCommunication scheduling
Complete estimation of the proposed Complete estimation of the proposed solutionsolution– Global execution timeGlobal execution time– Global power consumptionGlobal power consumption– Total areaTotal area
Power-Aware Power-Aware System on a ChipSystem on a Chip
A. Laffely, J. Liang, R. Tessier, C. A. A. Laffely, J. Liang, R. Tessier, C. A. Moritz, W. BurlesonMoritz, W. Burleson
University of Massachusetts AmherstUniversity of Massachusetts Amherst
Boston Area Architecture ConferenceBoston Area Architecture Conference30 Jan 200330 Jan 2003
{alaffely, jliang, tessier, moritz, {alaffely, jliang, tessier, moritz, burleson}@ecs.umass.eduburleson}@ecs.umass.edu
This material is based upon work supported by the National Science Foundation under Grant No. 9988238.Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Adaptive System-on-a-ChipAdaptive System-on-a-Chip Tiled architecture with Tiled architecture with
mesh interconnectmesh interconnect– Point to point Point to point
communication pipelinecommunication pipeline Allows for heterogeneous Allows for heterogeneous
corescores– Differing sizes, clock Differing sizes, clock
rates, voltagesrates, voltages Low-overhead core Low-overhead core
interface for interface for – On-chip bus substitute On-chip bus substitute
for streaming for streaming applicationsapplications
Based on static Based on static schedulingscheduling– Fast and predictableFast and predictable
Proc
Tile
MultiplierFPGA
Multiplier
ctrl
SouthCore
West
North
East
CommunicationInterface
aSoC ImplementationaSoC Implementation
3000
2500
.18 technology Full custom
Some ResultsSome Results
9 and 16 core systems tested for IIR, MPEG 9 and 16 core systems tested for IIR, MPEG encoding and Image processing applicationsencoding and Image processing applications– ~ ~ 2 x2 x the performance compared to Coreconnect the performance compared to Coreconnect
bus Burst and Hierarchical bus Burst and Hierarchical – ~ ~ 1.5 x1.5 x the performance of an oblivious routing the performance of an oblivious routing
networknetwork11 (Dynamic routing) (Dynamic routing)– Max speedup is 5 xMax speedup is 5 x
1. W. Dally and H. Aoki, “Deadlock-free Adaptive Routing in Multi-computer Networks
Using Virtual Routing”, IEEE Transactions on Parallel and Distributed Systems, April 1993