Digital Integrated Circuits A Design Perspective System on a Chip Design.

Digital Integrated Digital Integrated CircuitsCircuitsA Design PerspectiveA Design Perspective

System on a System on a Chip DesignChip Design

Application Application Specific Specific

Integrated Integrated Circuits: Circuits:

IntroductionIntroduction

Jun-Dong ChoJun-Dong Cho

SungKyunKwan Univ.SungKyunKwan Univ.

Dept. of ECE, Vada Lab.Dept. of ECE, Vada Lab.

http://vada.skku.ac.krhttp://vada.skku.ac.kr

ContentsContents

Why ASIC?Why ASIC? Introduction to System On Chip Design Introduction to System On Chip Design Hardware and Software Co-designHardware and Software Co-design Low Power ASIC DesignsLow Power ASIC Designs

Why ASIC – Why ASIC – Design Design productivity grows!productivity grows!

Complexity increase 40 % per year Complexity increase 40 % per year Design productivity increase 15 % per yearDesign productivity increase 15 % per year

Integration of PCB on single die

Silicon in 2010Silicon in 2010

Die Area: 2.5x2.5 cmVoltage: 0.6 VTechnology: 0.07 m

Density Access Time(Gbits/cm2) (ns)

DRAM 8.5 10DRAM (Logic) 2.5 10SRAM (Cache) 0.3 1.5

Density Max. Ave. Power Clock Rate(Mgates/cm2) (W/cm2) (GHz)

Custom 25 54 3Std. Cell 10 27 1.5

Gate Array 5 18 1Single-Mask GA 2.5 12.5 0.7

FPGA 0.4 4.5 0.25

ASIC PrinciplesASIC Principles Value-added ASIC for huge volume Value-added ASIC for huge volume

opportunities; standard parts for quick time to opportunities; standard parts for quick time to market applicationsmarket applications

Economics of DesignEconomics of Design– Fast Prototyping, Low VolumeFast Prototyping, Low Volume– Custom Design, Labor Intensive, High VolumeCustom Design, Labor Intensive, High Volume

CAD Tools Needed to Achieve the Design CAD Tools Needed to Achieve the Design StrategiesStrategies– System-level design: Concept to VHDL/CSystem-level design: Concept to VHDL/C– Physical design VHDL/C to silicon, Timing closure Physical design VHDL/C to silicon, Timing closure

(Monterey, Magma, Synopsys, Cadence, Avant!)(Monterey, Magma, Synopsys, Cadence, Avant!) Design Strategies:Design Strategies: Hierarchy; Regularity; Hierarchy; Regularity;

Modularity; LocalityModularity; Locality

ASIC Design StrategiesASIC Design Strategies Design is a continuous tradeoff to achieve Design is a continuous tradeoff to achieve

performance specs with adequate results in performance specs with adequate results in all the other parameters.all the other parameters.

Performance SpecsPerformance Specs - - function, timing, function, timing, speed, powerspeed, power

Size of DieSize of Die - - manufacturing costmanufacturing cost Time to DesignTime to Design - - engineering cost and engineering cost and

scheduleschedule Ease of Test Generation & TestabilityEase of Test Generation & Testability - -

engineering cost, manufacturing cost, engineering cost, manufacturing cost, schedule schedule

ASIC FlowASIC Flow

Structured ASIC DesignsStructured ASIC Designs

Hierarchy:Hierarchy: Subdivide the design into Subdivide the design into many levels of sub-modulesmany levels of sub-modules

RegularityRegularity: : Subdivide to max number of Subdivide to max number of similar sub-modules at each levelsimilar sub-modules at each level

ModularityModularity: : Define sub-modules Define sub-modules unambiguously & well defined unambiguously & well defined interfacesinterfaces

LocalityLocality: : Max local connections, keeping Max local connections, keeping critical paths within module boundariescritical paths within module boundaries

ASIC Design OptionsASIC Design Options Programmable LogicProgrammable Logic Programmable InterconnectProgrammable Interconnect Reprogrammable Gate ArraysReprogrammable Gate Arrays Sea of Gates & Gate Array DesignSea of Gates & Gate Array Design Standard Cell DesignStandard Cell Design Full Custom Mask DesignFull Custom Mask Design

Symbolic LayoutSymbolic Layout Process Migration - Retargeting DesignsProcess Migration - Retargeting Designs

ASIC Design Methodologies ASIC Design Methodologies

Density

Performance

Flexibility

Design time

Manufacturing time

Cost - low volume

Cost - high volume

CustomCustom

Very High

Very High

Very High

Very Long

Very High

Low

Medium

Cell-based

High

High

High

High

Low

Short

Medium

Prediffused

Low

High

High

High

Short

Short

Medium

Prewired

Low

High

Low

Very Short

Very Short

Medium - Low

Medium - Low

Why SOC?Why SOC?

• SOC specs are coming from system engineers SOC specs are coming from system engineers rather rather

than RTL descriptionsthan RTL descriptions

•SOC will bridge the gap hardware/software and SOC will bridge the gap hardware/software and their implementation in novel, energy-efficient silicon their implementation in novel, energy-efficient silicon architecture.architecture.

•In SOC design, chips are assembled at IP block level In SOC design, chips are assembled at IP block level (design reusable) and IP interfaces rather than gate (design reusable) and IP interfaces rather than gate levellevel

CMOS density now allows complete CMOS density now allows complete System-on-a-chip SolutionsSystem-on-a-chip Solutions

ViterbiEqual.

Demodandsync

phone

bookkeypad

intfc

protocolcontrol

de-intl&

decoder

RPE-LTPspeechdecoder

speechquality

enhancement

voicerecognition

phonebookDMA

S/P

DSP core

P core

RAM & ROM

Dedicated logic

A

D

digitaldownconv

Analog

FPGAFPGA Reconfigurable Reconfigurable

InterconnectInterconnect

Also like to add

Source: Brodersen, ICASSP ‘98

How do we design these chipsHow do we design these chips?

Possible Single-Chip Radio Possible Single-Chip Radio ArchitecturesArchitectures

Software RadioSoftware Radio

GOAL: Simplify System Design ProcessGOAL: Simplify System Design Process

Seek architectures which are flexible such Seek architectures which are flexible such that hardware and protocols can be that hardware and protocols can be designed independentlydesigned independently

APPROACH: Minimize the use of APPROACH: Minimize the use of dedicated logicdedicated logic

Universal RadioUniversal Radio

GOAL: Maximize Bandwidth Efficiency and GOAL: Maximize Bandwidth Efficiency and Battery LifeBattery Life

Seek architectures which perform complex Seek architectures which perform complex algorithms very fast with minimal energyalgorithms very fast with minimal energy

APPROACH: Minimize the use of APPROACH: Minimize the use of programmable logicprogrammable logic

Why is SOC design so scary?

60 GHz SiGe Transceiver for 60 GHz SiGe Transceiver for Wireless LAN ApplicationsWireless LAN Applications

A low power 30 GHz LNA is designed A low power 30 GHz LNA is designed as the front end of the receiver. as the front end of the receiver.

Wideband and high gain response is Wideband and high gain response is realized by a 2-stage design using realized by a 2-stage design using a stagger-tuned technique. a stagger-tuned technique.

The simulated performance predicts a The simulated performance predicts a forward gain of |S21| > 20 dB over forward gain of |S21| > 20 dB over a 6 GHz range with an input match a 6 GHz range with an input match of |S11| < -30 dB and output match of |S11| < -30 dB and output match of |S22| < -10 dB. of |S22| < -10 dB.

The mixer consists of a single The mixer consists of a single balanced Gilbert cell. balanced Gilbert cell.

A fully-integrated differential 25 GHz A fully-integrated differential 25 GHz VCO is used, in conjunction with VCO is used, in conjunction with the mixer, to downconvert the RF the mixer, to downconvert the RF input to a 5 GHz IF.input to a 5 GHz IF.

30 GHz receiver layout consisting of the LNA, mixer and VCO

Wideband CMOS LC VCOWideband CMOS LC VCO

A 1.8 GHz wideband LC VCO A 1.8 GHz wideband LC VCO implemented in 0.18 µm bulk implemented in 0.18 µm bulk CMOS has been successfully CMOS has been successfully designed, fabricated, and designed, fabricated, and measured.measured.

This VCO utilizes a 4-bit array of This VCO utilizes a 4-bit array of switched capacitors and a small switched capacitors and a small accumulation-mode varactor to accumulation-mode varactor to achieve a measured tuning range achieve a measured tuning range exceeding 2:1 (73%) and a worst-exceeding 2:1 (73%) and a worst-case tuning sensitivity of 270 case tuning sensitivity of 270 MHz/V. MHz/V.

The amplitude reference level is The amplitude reference level is programmable by means of a 3-bit programmable by means of a 3-bit DAC. DAC.

VCOs die photograph

A High Level View of A High Level View of an Industry Standard Design Flowan Industry Standard Design Flow

Every step can loop to every Every step can loop to every other stepother step

Each step can take hours or Each step can take hours or days for a 100,000 line days for a 100,000 line descriptiondescription

HDL description contains no HDL description contains no physical informationphysical information

Different engineers handle Different engineers handle the front-end and back-end the front-end and back-end designdesign

HDL EntryHDL Entry

good?good?

SynthesisSynthesis

Floor-planFloor-planPlace & RoutePlace & Route

Physical VerificationPhysical VerificationDRC & LVSDRC & LVS

donedone

good?good?

good?good?

good?good?

source: Hitachi, Prof. R. W. Brodersensource: Hitachi, Prof. R. W. Brodersen

Problems with this flow:

How have semiconductor companies made this flow work?

Fron

t-E

ndFr

ont-

End

Bac

k-E

ndB

ack-

End

A More Accurate Picture of the Standard A More Accurate Picture of the Standard FlowFlow

Architecture:Architecture: Partition the chip into functional units Partition the chip into functional units and generate bit-true test vectors to specify the and generate bit-true test vectors to specify the behavior of each unitbehavior of each unitTOOLS:TOOLS: Matlab, C, SPW, (VCC) Matlab, C, SPW, (VCC)FREEZE the test vectorsFREEZE the test vectors

Front-End:Front-End: Enter HDL code which matches the test Enter HDL code which matches the test vectorsvectorsTOOLS:TOOLS: HDL Simulators, Design Compiler HDL Simulators, Design CompilerFREEZE the HDL codeFREEZE the HDL code

Back-End:Back-End: Create a floor-plan and tweak the tools Create a floor-plan and tweak the tools until a successful mask layout is createduntil a successful mask layout is createdTOOLS:TOOLS: Design Compiler, Floor-planners, Placers, Design Compiler, Floor-planners, Placers, Routers, Clock-tree generators, Physical VerificationRouters, Clock-tree generators, Physical Verification

ArchitectureArchitecture

10 months10 months

Front-End

10 months

Back-End 2 monthsBack-End 2 months

Fabrication 2 monthsFabrication 2 months

Source: IBM Semiconductor, Prof. R. NewtonSource: IBM Semiconductor, Prof. R. Newton

How can we improve this flow?

Common Fabric for IP BlocksCommon Fabric for IP Blocks Soft IP blocks are portable, but not as predictable as Soft IP blocks are portable, but not as predictable as

hard IP.hard IP. Hard IP blocks are very predictable since a specific Hard IP blocks are very predictable since a specific

physical implementation can be characterized, but are physical implementation can be characterized, but are hard to port since are often tied to a specific process.hard to port since are often tied to a specific process.

Common fabric is required for both portability and Common fabric is required for both portability and predictability.predictability.

Wide availability: Cell Based Array, metal Wide availability: Cell Based Array, metal programmable architecture that provides the programmable architecture that provides the performance of a standard cell and is optimized for performance of a standard cell and is optimized for synthesis.synthesis.

Four main applicationsFour main applications

Set-top box:Set-top box: Mobile multimedia system, base Mobile multimedia system, base station for the home local-area network.station for the home local-area network.

Digital PCTV:Digital PCTV: concurrent use of TV,3D concurrent use of TV,3D graphics, and Internet servicesgraphics, and Internet services

Set-top box Set-top box LAN serviceLAN service: Wireless home-: Wireless home-networks, multi-user wireless LANnetworks, multi-user wireless LAN

Navigation systemNavigation system:: steer and control traffic steer and control traffic and/or goods-transportationand/or goods-transportation

CMPRCMPR is a multipurpose program that can be is a multipurpose program that can be used for displaying diffraction data, manual- & used for displaying diffraction data, manual- & auto-indexing, peak fitting and other auto-indexing, peak fitting and other

http://www.ncnr.nist.gov/programs/crystallography/software/cmpr/cmprdoc.html

PC-Multimedia ApplicationsPC-Multimedia Applications

Types of System-on-a-Chip Types of System-on-a-Chip DesignsDesigns

Physical gapPhysical gap

Timing closure problem: layout-driven logic and RT-level Timing closure problem: layout-driven logic and RT-level synthesissynthesis

Energy efficiency requires locality of computation and Energy efficiency requires locality of computation and storage: match for stream-based data processing of storage: match for stream-based data processing of speech,images, and multimedia-system packets.speech,images, and multimedia-system packets.

Next generation SOC designers must bridge the Next generation SOC designers must bridge the architectural gap b/w system specification and energy-architectural gap b/w system specification and energy-efficient IP-based architectures, while CAE vendors and efficient IP-based architectures, while CAE vendors and IP providers will bridge the physical gap.IP providers will bridge the physical gap.

Circular Y-ChartCircular Y-Chart

SOC Co-Design ChallengesSOC Co-Design Challenges Current systems are complex and heterogenous Current systems are complex and heterogenous

Contain many different types of componentsContain many different types of components Half of the chip can be filled with 200 low-power, Half of the chip can be filled with 200 low-power,

RISC-like processors (ASIP) interconnected by field-RISC-like processors (ASIP) interconnected by field-programmable buses, embedded in 20Mbytes of programmable buses, embedded in 20Mbytes of distributed DRAM and flash memory, Another Half: distributed DRAM and flash memory, Another Half: ASICASIC

Computational power will not result from multi-GHz Computational power will not result from multi-GHz clocking but from parallelism, with below 200 MHz.clocking but from parallelism, with below 200 MHz.

This will greatly simplify the design for correct timing, This will greatly simplify the design for correct timing, testability, and signal integrity.testability, and signal integrity.

Bridging the architectural gapBridging the architectural gap One-M gate reconfigurable, one-M gate hardwired One-M gate reconfigurable, one-M gate hardwired

logic. logic. 50GIPS for programmable components or 500 GIPS 50GIPS for programmable components or 500 GIPS

for dedicated hardwaresfor dedicated hardwares Product reliability: design at a level far above the RT Product reliability: design at a level far above the RT

level, with reuse factors in excess of 100level, with reuse factors in excess of 100 Trade-off: 100MOPs/watt (microprocessor) Trade-off: 100MOPs/watt (microprocessor)

100GOPs/watt (hardwired) Reconf. Computing with a 100GOPs/watt (hardwired) Reconf. Computing with a large number of computing nodes and a very large number of computing nodes and a very restricted instruction set (Pleiades)restricted instruction set (Pleiades)

Why Lower PowerWhy Lower Power

Portable systemsPortable systems– long battery lifelong battery life– light weightlight weight– small form factorsmall form factor

IC priority listIC priority list– power dissipationpower dissipation– costcost– performanceperformance

Technology direction Technology direction Reduced voltage/power Reduced voltage/power

designs based on designs based on mature high mature high performance IC performance IC technology, high technology, high integration to minimize integration to minimize size, cost, power, and size, cost, power, and speedspeed

year

Power(W)

1980 1985 1990 1995 2000

10

20

30

40

50

5

15

25

35

45

i286i386 DX 16 i486 DX25

i486 DX 50

i486 DX2 66 P-PC601 50

P6 166

P5 66

Alpha21064 200

Alpha 21164

i486 DX4 100

P II 300

P-PC604 133

P-PC750 400

P III 500

Alpha 21264

Microprocessor Power Microprocessor Power DissipationDissipation

Levels for Low Power DesignLevels for Low Power Design

System

Algorithm

Architecture

Circuit/Logic

Technology

Hardware-software partitioning,

Complexity, Concurrency, Locality,

Parallelism, Pipelining, Signal correlations

Sizing, Logic Style, Logic Design

Threshold Reduction, Scaling, Advanced packaging

Possible Power Savings at Different Design LevelsLevel of

Abstraction Expected Saving

Algorithm

Architecture

Logic Level

Layout Level

Device Level

10 - 100 times

10 - 90%

20 - 40%

10 - 30%

10 - 30%

Regularity, Data representation

Instruction set selection, Data rep.

SOI

Power down

Power-hungry ApplicationsPower-hungry Applications

Signal Compression: HDTV Standard, Signal Compression: HDTV Standard, ADPCM, Vector Quantization, H.263, 2-D ADPCM, Vector Quantization, H.263, 2-D motion estimation, MPEG-2 storage motion estimation, MPEG-2 storage management management

Digital Communications: Shaping Filters, Digital Communications: Shaping Filters, Equalizers, Viterbi decoders, Reed-Solomon Equalizers, Viterbi decoders, Reed-Solomon decodersdecoders

New Computing Platforms New Computing Platforms

SOC power efficiency more than 10GOPs/wSOC power efficiency more than 10GOPs/w– Higher On Chip System Integration: COTS: 100W, Higher On Chip System Integration: COTS: 100W,

SOC:10W (inter-chip capacitive loads, I/O buffers)SOC:10W (inter-chip capacitive loads, I/O buffers)– Speed & Performance: shorter interconnection,fewer Speed & Performance: shorter interconnection,fewer

drivers,faster devices,more efficient processing drivers,faster devices,more efficient processing artchitecturesartchitectures

Mixed signal systemsMixed signal systems Reuse of IP blocks Reuse of IP blocks Multiprocessor, configurable computingMultiprocessor, configurable computing Domain-specific, combined memory-logic Domain-specific, combined memory-logic

2P kCFV

Low Power Design Flow ILow Power Design Flow IFunctionFunction

Partitioning andPartitioning andHW/SW AllocationHW/SW Allocation

SystemSystemLevelLevel

SpecificationSpecification

System-LevelSystem-LevelPower AnalysisPower Analysis

BehavioralBehavioralDescriptionDescription

SoftwareSoftwareFunctionsFunctions

ProcessorProcessorSelectionSelection

Power-drivenPower-drivenBehavioralBehavioralTransformationTransformation

Behavioral-LevelBehavioral-Level

Power AnalysisPower Analysis

Power ConsciousPower Conscious

BehavioralBehavioralDescriptionDescription

Power AnalysisPower AnalysisRT-LevelRT-LevelHigh-LevelHigh-Level

Synthesis andSynthesis andOptimizationOptimization

SoftwareSoftwareOptimizationOptimization

Software-LevelSoftware-Level

Power AnalysisPower Analysis

To RT-Level DesignTo RT-Level Design

Low Power Design Flow IILow Power Design Flow II

RT-levelRT-level

DescriptionDescription

RTLRTLmappingmapping

Logic SynthesisLogic Synthesisandand

OptimizationOptimization

Gate-LevelGate-LevelPower AnalysisPower Analysis

Gate-levelGate-level


Power AnalysisPower AnalysisSwitch-LevelSwitch-LevelHigh-LevelHigh-Level

Synthesis andSynthesis andOptimizationOptimization

RTLRTLLibraryLibrary

Data-pathData-path ControllerController

Switch-levelSwitch-level


Standard cellStandard cellLibraryLibraryProcessorProcessor

Control andControl andSteering LogicSteering Logic

MemoryMemory

RTLRTLMacrocellsMacrocells

Three Factors affecting Three Factors affecting EnergyEnergy

– Reducing waste by Hardware SimplificationReducing waste by Hardware Simplification: redundant : redundant h/w extraction, Locality of reference,Demand-driven / h/w extraction, Locality of reference,Demand-driven / Data-driven computation,Application-specific Data-driven computation,Application-specific processing,Preservation of data correlations, Distributed processing,Preservation of data correlations, Distributed processingprocessing

– All in one Approach(SOC):All in one Approach(SOC): I/O pin and buffer reduction I/O pin and buffer reduction– Voltage Reducible HardwaresVoltage Reducible Hardwares

– 2-D pipelining (systolic arrays)2-D pipelining (systolic arrays)– SIMD:Parallel Processing:useful for data w/ parallel SIMD:Parallel Processing:useful for data w/ parallel

structurestructure– VLIW: Approach- flexibleVLIW: Approach- flexible

IBM’s PowerPC Lower Power IBM’s PowerPC Lower Power ArchitectureArchitecture Optimum Supply Voltage through Hardware Parallel, Optimum Supply Voltage through Hardware Parallel,

Pipelining ,Parallel instruction executionPipelining ,Parallel instruction execution– 603e executes five instruction in parallel (IU, FPU, BPU, LSU, 603e executes five instruction in parallel (IU, FPU, BPU, LSU,

SRU) SRU) – FPU is pipelined so a multiply-add instruction can be issued FPU is pipelined so a multiply-add instruction can be issued

every clock cycle every clock cycle – Low power 3.3-volt designLow power 3.3-volt design

Use small complex instruction with smaller instruction length Use small complex instruction with smaller instruction length – IBM’s PowerPC 603e is RISCIBM’s PowerPC 603e is RISC

Superscalar: CPI < 1Superscalar: CPI < 1– 603e issues as many as three instructions per cycle603e issues as many as three instructions per cycle

Low Power ManagementLow Power Management– 603e provides four software controllable power-saving modes. 603e provides four software controllable power-saving modes.

Copper Processor with SOICopper Processor with SOI IBM’s Blue Logic ASICIBM’s Blue Logic ASIC :New design reduces of power by a factor :New design reduces of power by a factor

of 10 timesof 10 times

Power-Down TechniquesPower-Down Techniques

Lowering the voltage along with the clock actually alters the energy-per-operation of the microprocessor, reducing the energy required to perform a fixed amount of work

Implementing Digital Implementing Digital SystemsSystems

H/W and S/W Co-designH/W and S/W Co-design

Three Co-Design Approaches Three Co-Design Approaches IFIP International Conference FORTE/PSTV’98, Nov.’98 N.S. Voros et.al, “Hardware -IFIP International Conference FORTE/PSTV’98, Nov.’98 N.S. Voros et.al, “Hardware -

software co-design of embedded systems using multiple formalisms for application software co-design of embedded systems using multiple formalisms for application

developmentdevelopment”” ASIP co-design: builds a specific programmable processor for ASIP co-design: builds a specific programmable processor for

an application, and translates the application into software code. an application, and translates the application into software code. H/w and s/w partitioning includes the instruction set design.H/w and s/w partitioning includes the instruction set design.

H/w s/w synchronous system co-design: s/w processor as a H/w s/w synchronous system co-design: s/w processor as a master controller, and a set of h/w accelerators as co-master controller, and a set of h/w accelerators as co-processors. Vulcan, Codes, Tosca, Cosymaprocessors. Vulcan, Codes, Tosca, Cosyma

H/w s/w for distributed systems: mapping of a set of H/w s/w for distributed systems: mapping of a set of communication processors onto a set of interconnected communication processors onto a set of interconnected processors. Behavioral decomposition, process allocation and processors. Behavioral decomposition, process allocation and communication transformation. Coware(powerful), Siera communication transformation. Coware(powerful), Siera (reuse), Ptolemy (DSP)(reuse), Ptolemy (DSP)

Mixing H/W and S/WMixing H/W and S/W Argument: Mixed hardware/ software systemsArgument: Mixed hardware/ software systems

represent the best of both worlds.represent the best of both worlds.

High performance, flexibility, design reuse, etc.High performance, flexibility, design reuse, etc.

Counterpoint: From a design standpoint, it isCounterpoint: From a design standpoint, it is

the worst of both worldsthe worst of both worlds– Simulation: Problems of verification, and test become harderSimulation: Problems of verification, and test become harder– Interface: Too many tools, too many interactions, too much Interface: Too many tools, too many interactions, too much

heterogeneityheterogeneity– Hardware/ software partitioning is “AI- complete”!Hardware/ software partitioning is “AI- complete”!– (MIT, Stanford: by analogy with (MIT, Stanford: by analogy with ""NP-completeNP-complete") A term used ") A term used

to describe problems in to describe problems in artificial intelligenceartificial intelligence, to indicate that , to indicate that the solution presupposes a solution to the "strong AI the solution presupposes a solution to the "strong AI problem" (that is, the synthesis of a human-level problem" (that is, the synthesis of a human-level intelligence). A problem that is AI-complete is just too hard. intelligence). A problem that is AI-complete is just too hard.

http://wombat.doc.ic.ac.uk/foldoc/foldoc.cgi?NP-complete

http://wombat.doc.ic.ac.uk/foldoc/foldoc.cgi?artificial+intelligence

Low power partitioning Low power partitioning approachapproach

Different HW resources are invoked according to the Different HW resources are invoked according to the instruction executed at a specific point in timeinstruction executed at a specific point in time

During the execution of the add op., ALU and During the execution of the add op., ALU and register are used, but Multiplier is in idle state.register are used, but Multiplier is in idle state.

Non-active resources will still consume energy since Non-active resources will still consume energy since the according circuit continue to switchthe according circuit continue to switch

Calculate wasting energyCalculate wasting energy Adding application specific core and partial runningAdding application specific core and partial running Whenever one core performing, all the other cores Whenever one core performing, all the other cores

are shut downare shut down

ASIP (ASIP (Application Specific Application Specific Instruction ProcessorsInstruction Processors) Design) Design Given a set of applications, determine micro Given a set of applications, determine micro

architecture of ASIP (i. e., configuration of architecture of ASIP (i. e., configuration of functional units in datapaths, instruction set)functional units in datapaths, instruction set)

To accurately evaluate performance of To accurately evaluate performance of processor on a given application need to processor on a given application need to compile compile the application program onto the the application program onto the processor datapath and processor datapath and simulate simulate object code.object code.

The micro architecture of the processor is a The micro architecture of the processor is a design parameter!design parameter!

ASIP Design FlowASIP Design Flow

Cross-Disciplinary natureCross-Disciplinary nature

Software for low power:loop transformation leads to much Software for low power:loop transformation leads to much higher temporal and spatial locality of data.higher temporal and spatial locality of data.

Code size becomes an important objective Software will Code size becomes an important objective Software will eventually become a part of the chipeventually become a part of the chip

Behavior-platform-compiler codesign: codesigned with Behavior-platform-compiler codesign: codesigned with C++ or JAVA, describing their h/w and s/w C++ or JAVA, describing their h/w and s/w implementation.implementation.

Multidisciplinary system thinking is required for future Multidisciplinary system thinking is required for future designs (designs (e.g., Eindhoven Embedded Systems Institutee.g., Eindhoven Embedded Systems Institute http://www.eesi.tue.nl/english)http://www.eesi.tue.nl/english)

VLSI Signal Processing Design VLSI Signal Processing Design MethodologyMethodology

pipelining, parallel processing, retiming, pipelining, parallel processing, retiming, folding, unfolding, look-ahead, relaxed look-folding, unfolding, look-ahead, relaxed look-ahead, and approximate filtering ahead, and approximate filtering

bit-serial, bit-parallel and digit-serial bit-serial, bit-parallel and digit-serial architectures, carry save architecturearchitectures, carry save architecture

redundant and residue systemsredundant and residue systems Viterbi decoder, motion compensation, 2D-Viterbi decoder, motion compensation, 2D-

filtering, and data transmission systemsfiltering, and data transmission systems

Low Power DSPLow Power DSP

DO-LOOPDO-LOOP Dominant Dominant

VSELP Vocoder : 83.4 %2D 8x8 DCT : 98.3 %LPC computation : 98.0 %

DO-LOOP Power Minimization ==> DSP Power Minimization

VSELP : Vector Sum Excited Linear PredictionLPC : Linear Prediction Coding

Deep-Submicron Design Deep-Submicron Design FlowsFlows

Rapid evaluation of complex designs for area Rapid evaluation of complex designs for area and performanceand performance

Timing convergence via estimated routing Timing convergence via estimated routing parasiticsparasitics

In-place timing repair without resynthesisIn-place timing repair without resynthesis Shorter design intervals, minimum iterationsShorter design intervals, minimum iterations Block-level design and place and routeBlock-level design and place and route Localized changes without disturbanceLocalized changes without disturbance Integration of complex projects and design Integration of complex projects and design

reusereuse

SOC CAD CompaniesSOC CAD Companies Avant! www.avanticorp.comAvant! www.avanticorp.com Cadence www.cadence.comCadence www.cadence.com Duet Tech www.duettech.comDuet Tech www.duettech.com Escalade www.escalade.comEscalade www.escalade.com Logic visions Logic visions

www.logicvision.comwww.logicvision.com Mentor Graphics Mentor Graphics

www.mentor.comwww.mentor.com Palmchip www.palmchip.comPalmchip www.palmchip.com Sonic www.sonicsinc.comSonic www.sonicsinc.com Summit Design www.summit-Summit Design www.summit-

design.comdesign.com

Synopsys Synopsys www.synopsys.comwww.synopsys.com

Topdown design Topdown design solutions solutions www.topdown.comwww.topdown.com

Xynetix Design Systems Xynetix Design Systems www.xynetix.comwww.xynetix.com

Zuken-Redac Zuken-Redac www.redac.co.uk www.redac.co.uk

Design Design Technology Technology for Low Power for Low Power Radio SystemsRadio Systems

http://bwrc.eecs.berkeley.eduhttp://bwrc.eecs.berkeley.edu

Rhett DavisRhett DavisDept. of EECSDept. of EECSUniv. of Calif.Univ. of Calif.

BerkeleyBerkeley

Domain of InterestDomain of Interest Highly integrated system-on-a-chip solutions – SOC’s Highly integrated system-on-a-chip solutions – SOC’s Wireless communications with associated processing, Wireless communications with associated processing,

e.g. multimedia processing, compression, switching, e.g. multimedia processing, compression, switching, etc…etc…

Primary computation is high complexity dataflow with a Primary computation is high complexity dataflow with a relatively small amount of controlrelatively small amount of control

Why Systems-on-a-Chip - SOC ?Why Systems-on-a-Chip - SOC ?

State-of-the-Art CMOS is easily able to implement complete State-of-the-Art CMOS is easily able to implement complete systems (or what was on a board before)systems (or what was on a board before)– A microprocessor core is only 1-2 mmA microprocessor core is only 1-2 mm22

(1-2 % of the area of a $4 chip) (1-2 % of the area of a $4 chip)– Portability (size) is critical to meet the cost, power and size Portability (size) is critical to meet the cost, power and size

requirements of future wireless systemsrequirements of future wireless systems– Chips will be required to support the complete application (wireless Chips will be required to support the complete application (wireless

internet, multimedia)internet, multimedia)– Dedicated stand-alone computation is replacing general purpose Dedicated stand-alone computation is replacing general purpose

processors as the semiconductor industry driverprocessors as the semiconductor industry driver

Analog Baseband

Digital Baseband

(DSP + MCU)

PowerManagement

Small Signal RF

PowerRF

Cellular Phones: An example

Digital Cellular Market(Phones Shipped)

1996 1997 1998 1999 2000

Units 48M 86M 162M 260M 435M

(Courtesy Mike McMahon, Texas Instruments)

Cellular Phone Baseband SOCCellular Phone Baseband SOC

MCU

Gates

Analog

ROM

DSP

RAM

2000+ phones on each 8” wafer @ .15 Leff

1Million Baseband Chips per Day!!!1Million Baseband Chips per Day!!!(Courtesy Mike McMahon, Texas Instruments)

Wireless System Design IssuesWireless System Design Issues

It is now possible to use CMOS to integrate all It is now possible to use CMOS to integrate all digital radio functions – but what is the “best” digital radio functions – but what is the “best” architectural way to use CMOS???architectural way to use CMOS???

Computation rates for wireless systems will easily Computation rates for wireless systems will easily range up to 100’s of GOPS in signal processingrange up to 100’s of GOPS in signal processing– What’s keeping us from achieving this in silicon?What’s keeping us from achieving this in silicon?– What can we do about it?What can we do about it?

Computational Efficiency MetricsComputational Efficiency Metrics

Definition: MOPS Definition: MOPS – Millions of algorithmically defined arithmetic operations (e.g. Millions of algorithmically defined arithmetic operations (e.g.

multiply, add, shift) – in a GP processor several instructions per multiply, add, shift) – in a GP processor several instructions per “useful” operation“useful” operation

Figures of merit Figures of merit – MOPS/mW - Energy efficiency (battery life)MOPS/mW - Energy efficiency (battery life)– MOPS/mmMOPS/mm22 - Area efficiency (cost) - Area efficiency (cost)

Optimization of these “efficiencies” is the basic goal Optimization of these “efficiencies” is the basic goal assuming functionality is metassuming functionality is met

Energy-Efficiency of ArchitecturesEnergy-Efficiency of Architectures

Embedded Processors Microprocessor.1-1 MIPS/mW

ASIPsDSPs

DSP1-10 MIPS/mW

DedicatedHW

Flexibility (Coverage)

En

ergy

Eff

icie

ncy

MO

PS

/mW

(or

MIP

S/m

W)

0.1

1

10

100

1000

ReconfigurableProcessor/Logic

Reconfiguration (???) Potential of 10-100 MOPS/mW

Direct mapped100-1000 MOPS/mW

Software Processors: Energy TrendsSoftware Processors: Energy Trends

Primary means of performance increase of software processors has Primary means of performance increase of software processors has been by increasing clock ratebeen by increasing clock rate

Decreasing Energy EfficiencyDecreasing Energy Efficiency

i386i486C-33

PP-100

A21064A

MIPS R4400

SuperSparc2-90

PPC 604-120

A21164-300

PPro-150

PPC603e-100

PP166MIPS R10000

PPro200

i386C-33

PP-66

486-66 PPC 601-80

HP PA7200PP-133

UltraSparc-167

HP PA8000

MIPS R5000

DX4 100

0

50

100

150

200

250

300

Fre

q(M

Hz)

1991 1992 1993 1994 1995 1996

E C VDD2

Software Processors: Area TrendsSoftware Processors: Area Trends

DSP processor with 1 multiplier (25 mm2)

16x16 multiplier(.05 mm2)

Why time multiplex to save area if the overhead is much greater than the area saved????

Increasing clock rate results in a memory bottleneck – addressed by bringing Increasing clock rate results in a memory bottleneck – addressed by bringing memory on-chipmemory on-chip

Area is increasingly dominated by memory – degrading MOPs/mmArea is increasingly dominated by memory – degrading MOPs/mm22

Parallelism is the answer, but …Parallelism is the answer, but …

Not by putting Von Neumann processors in parallel and Not by putting Von Neumann processors in parallel and programming with a sequential languageprogramming with a sequential language– Attempts to do this have failed over and over again…Attempts to do this have failed over and over again…– The parallel computer compiler problem is very difficultThe parallel computer compiler problem is very difficult

Not by trying to capture parallelism at the instruction levelNot by trying to capture parallelism at the instruction level– Superscalar, VLIW, etc… are very inefficientSuperscalar, VLIW, etc… are very inefficient– Hardware can’t figure out the parallelism from a sequential Hardware can’t figure out the parallelism from a sequential

language eitherlanguage either

The problem is the initial sequential description (e.g. C) The problem is the initial sequential description (e.g. C) which is poorly matched to highly parallel applicationswhich is poorly matched to highly parallel applications

What is really hapenning…What is really hapenning…

While (i=0;i++:i<num) {While (i=0;i++:i<num) {

a = a * c[i];a = a * c[i];

b[i] = sin (a * pi) + cos(a*pi);b[i] = sin (a * pi) + cos(a*pi);

};};

Outfil = b[i] * indata;Outfil = b[i] * indata;

Then try to rediscover the

parallelism

Re-entering it using a sequential

description

Starting with a parallel algorithmic description

We take this path so that we can use an architecture that is orders of magnitude less efficient in energy and area

??????

What can a fully parallel CMOS solution What can a fully parallel CMOS solution potentially do?potentially do?

In In .25 micron.25 micron a multiplier requires .05 mm a multiplier requires .05 mm22 and 7pJ and 7pJ per operation at 1 V. Adders and registers are about per operation at 1 V. Adders and registers are about 10 times smaller and 10 times lower energy10 times smaller and 10 times lower energy

Lets implement a 50mmLets implement a 50mm22 , .25 micron chip using , .25 micron chip using adders, registers and multipliersadders, registers and multipliers

We can have 2000 adders/registers and 200 We can have 2000 adders/registers and 200 multipliers in less than 1/2 of the chip, also assume 1/3 multipliers in less than 1/2 of the chip, also assume 1/3 of power goes into clocksof power goes into clocks

25 MHz clock (1 volt) gives ~50 Gops at 100mW 25 MHz clock (1 volt) gives ~50 Gops at 100mW

500 MOPS/mW500 MOPS/mW and and 1000 MOPS/mm1000 MOPS/mm22

Start with a parallel description of the algorithm…Start with a parallel description of the algorithm…

Then directly map into hardware …Then directly map into hardware …

Mult2

Mac2Mult1 Mac1

S reg X regAdd,Sub,Shift

Results in fully parallel solutionsResults in fully parallel solutions

Energy Energy AreaArea

64-point FFT64-point FFT

Energy per Energy per Transform (nJ) Transform (nJ)

16-State Viterbi 16-State Viterbi DecoderDecoder

Energy per Energy per Decoded bit (nJ) Decoded bit (nJ)

64-point FFT64-point FFT

Transforms per second Transforms per second per unit area per unit area

(Trans/ms/mm(Trans/ms/mm22))

16-State Viterbi 16-State Viterbi DecoderDecoder

Decode rate per unit Decode rate per unit area (kb/s/mmarea (kb/s/mm22))

Direct-Mapped HardwareDirect-Mapped Hardware 1.781.78 0.0220.022 2,2002,200 200,000200,000

FPGAFPGA 683683 5.55.5 1.81.8 100100

Low-Power DSPLow-Power DSP 436436 19.619.6 4.34.3 5050

High-Performance DSPHigh-Performance DSP 17001700 108108 1010 150150

(numbers taken from vendor-published benchmarks)

Orders of magnitude lower efficiency Orders of magnitude lower efficiency even for an optimized processor architectureeven for an optimized processor architecture

Reasons software solutions seem attractiveReasons software solutions seem attractive

(1) Believed to reduce time-to-system-implementation(1) Believed to reduce time-to-system-implementation

(2) Provides flexibility (2) Provides flexibility

(3) Locks the customers into an architecture they can’t (3) Locks the customers into an architecture they can’t changechange

(4) Difficulty in getting dedicated SOC chips designed(4) Difficulty in getting dedicated SOC chips designed

Are these good reasons???Are these good reasons???

(1) Believed to reduce time-to-system (1) Believed to reduce time-to-system implementationimplementation

Software decreases time to get first prototype, but Software decreases time to get first prototype, but time to fully verified system is much longer (hardware time to fully verified system is much longer (hardware is often ready but software still needs to be done)is often ready but software still needs to be done)

Limitations of software prototype often sets the Limitations of software prototype often sets the ultimate limit of the system performanceultimate limit of the system performance

Software solutions can be shipped with bugs, not a Software solutions can be shipped with bugs, not a real option for SOC real option for SOC

(2) Need flexibility(2) Need flexibility

Software is not always flexible Software is not always flexible – Can be hard to verifyCan be hard to verify

Flexibility does not imply software programmabilityFlexibility does not imply software programmability– Domain specific design can have multiple modules, Domain specific design can have multiple modules,

coefficients and local state control (the factor of 100 in coefficients and local state control (the factor of 100 in efficiency) to address a range of applicationsefficiency) to address a range of applications

– Reconfiguration of interconnect can achieve flexibility with Reconfiguration of interconnect can achieve flexibility with high levels of efficiencyhigh levels of efficiency

Flexibility without softwareFlexibility without software

Energy per Transform vs. FFT size

101

102

103

10-10

10-9

10-8

10-7

10-6

10-5

10-4

10-3

FFT size

En

erg

y p

er T

rans

form

(J)

Lower limit Function-specific reconfigurable hardwareData-path reconfigurable processor FPGA Low-power DSP High-performance DSP

101

102

103

103

104

105

106

107

108

FFT size(T

rans

form

s pe

r S

econ

d)/(

Sili

con

Are

a) (

Tra

ns/s

/mm

2 ) Function-specific reconfigurable hardwareData-path reconfigurable processor FPGA Low-power DSP High-performance DSP

Transforms per Second per mm2 vs. FFT size

* All results are scaled to 0.18m

Reasons software solutions seem attractiveReasons software solutions seem attractive

(1) Believed to reduce time-to-system implementation(1) Believed to reduce time-to-system implementation

(2) Provides flexibility (2) Provides flexibility

(3) Locks the customers into an architecture they can’t (3) Locks the customers into an architecture they can’t changechange

(4) Difficulty in getting dedicated SOC chips designed(4) Difficulty in getting dedicated SOC chips designed

Standard DSP-ASIC Design Flow Standard DSP-ASIC Design Flow

Three translationsThree translations of design data of design data

Requirements for re-verification at Requirements for re-verification at each stageeach stage

Uncontrolled looping when pipeline Uncontrolled looping when pipeline stallsstalls

Problems:

Prohibitively Long Design Time Prohibitively Long Design Time for Direct Mapped Architecturesfor Direct Mapped Architectures

AlgorithmDesign

Floating-PointSimulation

System/ArchitectureDesign

Fixed-PointSimulation

Hardware/Front-End Design

RTL Code

Physical/Back-End Design

Mask Layout

Sequential

Mixed Sequential & Structural

Integer only,Structural w/SequentialLeaf-cells

Single-wire Connectivity

w/ TimingConstraints

Direct Mapping Design FlowDirect Mapping Design Flow

Encourages iterations of layoutEncourages iterations of layout Controls loopingControls looping Reduces the flow to a single phaseReduces the flow to a single phase Depends on fast automationDepends on fast automation

Algorithm/System

SimulationFront-End

RTL Libraries

Back-End

Floorplan

Automated Flow

Mask Layout

Performance Estimates

Déjà vu???Déjà vu???

An automated style of design with parameterized modules An automated style of design with parameterized modules processed through foundries is just the reincarnation of processed through foundries is just the reincarnation of good ole Silicon Compilation of >10 years ago good ole Silicon Compilation of >10 years ago

What happened?What happened?– A decline of research into design methodologiesA decline of research into design methodologies– A single dominant flow has resulted - the Verilog-Synopsys-A single dominant flow has resulted - the Verilog-Synopsys-

Standard CellStandard Cell– Lack of tool flows to support alternative styles of designLack of tool flows to support alternative styles of design– Research community lost access to technology – moved to highly Research community lost access to technology – moved to highly

sub-optimal processor and FPGA solutionssub-optimal processor and FPGA solutions

Capturing Design DecisionsCapturing Design Decisions

Categories:Categories: FunctionFunction - basic input-output behavior - basic input-output behavior SignalSignal - physical signals and types - physical signals and types CircuitCircuit - transistors - transistors FloorplanFloorplan - physical positions - physical positions

How to get layout and performance estimates in a day?

MACreg.file

add

shift

reg. file

Simplified View of the FlowSimplified View of the Flow

New Software:New Software: Generation of netlists from a Generation of netlists from a

dataflow graphdataflow graph

Merging of floorplan from last Merging of floorplan from last iterationiteration

Automatic routing and Automatic routing and performance analysisperformance analysis

Automation of flow as a Automation of flow as a dependency graph (UNIX dependency graph (UNIX MAKE program)MAKE program)

merge

autoLayout

elaborate

netlist

route

layout

dataflow graph

floorplanmacrolibrary

Why Simulink?Why Simulink?

Simulink is an easy sell to algorithm developersSimulink is an easy sell to algorithm developers Closely integrated with popular system design tool MatlabClosely integrated with popular system design tool Matlab Successfully models digital and analog circuitsSuccessfully models digital and analog circuits

Time-Multiplexed FIR Filter

D

A

WEN

SRAM

Q

2TAP_COEF

addr

wen

reset_acc

CONTROL

1 1X Y

A

B

RESET

MAC

Z

Modeling Datapath LogicModeling Datapath Logic

Discrete-TimeDiscrete-Time(cycle accurate)(cycle accurate)

Fixed-Point TypesFixed-Point Types(bit true)(bit true)

Completely specify function Completely specify function and signal decisionsand signal decisions

No need for RTLNo need for RTLMultiply / Accumulate

+

+

ADD

1A

S18

MULTS12 REG

Z1

CONSTS18

0

MUX

3RESET

2B

1Z

Extended finite state-Extended finite state-machine editormachine editor

Co-simulation with dataflow Co-simulation with dataflow graphgraph

New Software:New Software:Stateflow-VHDL translatorStateflow-VHDL translator

No need for RTLNo need for RTL

Modeling Control LogicModeling Control Logic

Address Generator / MAC Reset

[addr==15]

incrduring: addr++;reset_acc=0;

restartentry: addr=0;wen=0;reset_acc=1;

initentry: addr=0;wen=1;

Specifying Circuit DecisionsSpecifying Circuit Decisions

Macro choices embedded in dataflow graphMacro choices embedded in dataflow graph Cross-check simulations requiredCross-check simulations required

Time-Multiplexed FIR Filter

D

A

WEN

SRAM

Q

2TAP_COEF

addr

wen

reset_acc

CONTROL

1 1X Y

A

B

RESET

MAC

Z

Stateflow-VHDL

translator

RTL Codeor

Data-pathGenerator

Codeor

CustomModule

Black Box

Hierarchy Hardened ProgressivelyHierarchy Hardened Progressively

Macro characterization saved for fast estimatesMacro characterization saved for fast estimates Each level of hierarchy becomes a new hard macroEach level of hierarchy becomes a new hard macro Higher levels of hierarchy are adjustedHigher levels of hierarchy are adjusted When top level of hierarchy is hardened, the design is doneWhen top level of hierarchy is hardened, the design is done

System-LevelDesign Environment

estimateperformance:

power, area, delay

Hard Macro Characterization Libraries

layout and characterize

new hard macro

Capturing Floorplan DecisionsCapturing Floorplan Decisions

Commercial physical design tools usedCommercial physical design tools used Instance names in floorplan match dataflow graphInstance names in floorplan match dataflow graph Placements merged on each iterationPlacements merged on each iteration Manhattan distance can be used for parasitic estimatesManhattan distance can be used for parasitic estimates

Parallel Pipelined FIR Filter

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

VDD (V)

Wir

e d

elay

/ F

O4

inv.

del

ay

5mm M6 wire

1mm M6 wire

Reduced Impact of InterconnectReduced Impact of Interconnect

Long wires can Long wires can be modeled as be modeled as

lumped lumped capacitancescapacitances

0.18 0.18 m m

FO4 invFO4 invdelaydelay

WireWiredelaydelay

...

Race-Immune Clock Tree SynthesisRace-Immune Clock Tree Synthesis

Race marginRace margin= 580 ps= 580 ps

0.18 0.18 mm VVDDDD = 1 V = 1 V

t < t - tskew(max) clk-Q(min) hold(max)

Demonstrated on a 600k transistor design

Example Clock TreeStages: 22Sinks: 7650Skew: 320 psClock Power: 2.8 mWLogic Power: 21 mW

Hierarchical Clock Tree Synthesis

180 MB180 MB1.5 GB1.5 GB

disk spacedisk space (elaborate / route)(elaborate / route)(characterization)(characterization)

3 hours3 hours9 hours9 hours

execution timeexecution time (elaborate / route)(elaborate / route)(characterization) (characterization)

240 k240 ktransistorstransistors

21 k21 kcellscells

18.0 ns18.0 nscritical path delay (1 V, PathMill)critical path delay (1 V, PathMill)

13.0 mW13.0 mWpower @ 25 MHz (1 V, PowerMill)power @ 25 MHz (1 V, PowerMill)

1.4 mm1.4 mm22area in 0.25 area in 0.25 mm

parallel pipelined FIR filterparallel pipelined FIR filter

Example 1: Macro HardeningExample 1: Macro Hardening

Most time/disk space spent on extraction and power simulation

Example 2: Test ChipExample 2: Test Chip

300k transistors300k transistors 0.25 mm0.25 mm 1.0 V1.0 V 25 MHz25 MHz 6.8 mm6.8 mm22

14 mW14 mW 2 phase clock2 phase clock 3 layers of 3 layers of

P&R hierarchyP&R hierarchy

Parallel Pipelined FIR Filter(8X decimation filter for 12-bit 200 MHz

TDMA Baseband ReceiverTDMA Baseband Receiver

600k transistors600k transistors 0.18 mm0.18 mm 1.0 V1.0 V 25 MHz25 MHz 1.1 mm1.1 mm22

21 mW21 mW single phase clocksingle phase clock 5 clock domains5 clock domains 2 layers of 2 layers of

P&R hierarchyP&R hierarchy

carrierdetection

freq

uenc

y es

timat

ion

rota

te &

cor

rela

te

control

ConclusionsConclusions Direct-Mapped hardware is the most efficient use of siliconDirect-Mapped hardware is the most efficient use of silicon

Direct-Mapped hardware can be easier to design and verify Direct-Mapped hardware can be easier to design and verify than embedded hardware/software systemsthan embedded hardware/software systems

Don’t translate design data, refine itDon’t translate design data, refine it

Design with dataflow graphs, not sequential codeDesign with dataflow graphs, not sequential code

Design flow automation speeds up design space explorationDesign flow automation speeds up design space exploration

Embedded Embedded Processor Processor Architectures and Architectures and (Re)Configurable (Re)Configurable ComputingComputing

Vandana PrabhuVandana Prabhu

Professor Jan M. RabaeyProfessor Jan M. Rabaey

Jan 10, 2000

Pico Radio ArchitecturePico Radio Architecture

ReconfigurableDataPath

ReconfigurableDataPath

FPGAFPGA Embedded uPEmbedded uP

Dedicated FSMDedicated FSM

DedicatedDSP

DedicatedDSP

Reconfigurable Computing:Reconfigurable Computing:Merging Efficiency and VersatilityMerging Efficiency and Versatility

“Hardware” customized to specifics of problem.

Direct map of problem specific dataflow, control.

Circuits “adapted” as problem requirements change.

Spatially programmed connection of processing elements.Spatially programmed connection of processing elements.

Matching Computation and ArchitectureMatching Computation and Architecture

Convolution

Two models of computation:communicating processes + data-flow

AddressGen AddressGen

Memory Memory

MAC MAC

ControlProcessor

L CG

Two architectural models:sequential control+ data-driven

Implementation Fabrics for Implementation Fabrics for Data ProcessingData Processing

Signal Update BlockAcquisition and

Timing Recovery Signal Update Block

AdaptivePilot

Correlator

A da ptiveD ata

C or re lator

C0 CL-1

Digital Baseband

Sk

...

Data Out

Receiver

ChannelCoefficientEstimates

AdaptivePilot

Correlator

Da

ta I

n

300 million multiplications/sec357 million add-sub’s/sec

Adaptive Adaptive Pilot Pilot CorrelatorCorrelator

Digital Digital Baseband Baseband ReceiverReceiver

DSPDSP Power: 460mWPower: 460mW

Area: 1089mmArea: 1089mm22

Power: 1500mWPower: 1500mW


Direct Direct MappedMapped


Area: 1.3mmArea: 1.3mm22



PleiadesPleiadesPower: 18.49mWPower: 18.49mW


Power: 62.33mWPower: 62.33mW


16 Mmacs/mW!

Software Methodology FlowSoftware Methodology Flow

Algorithms

Kernel Detection

Estimation/Exploration

Partitioning

Software CompilationReconfig. Hardware Mapping

Interface Code Generation

Power & Timing Estimation

of Various Kernel Implementations

Area &

PDA Models

PremappedKernels

Acceleratorproc &Timing

Constraints

Xform’sfor lowpower

Behavioral

Kernels

Executable Intemediate Form

InterconnectOptimization

Reconfig HW

(Marlene Wan)

Maia: Reconfigurable Baseband Maia: Reconfigurable Baseband Processor for WirelessProcessor for Wireless

• 0.25um tech: 4.5mm x 6mm

• 1.2 Million transistors

• 40 MHz at 1V

• 1 mW VCELP voice coder

• Hardware

• 1 ARM-8

• 8 SRAMs & 8 AGPs

• 2 MACs

• 2 ALUs

• 2 In-Ports and 2 Out-Ports

• 14x8 FPGA

Implementation Fabrics for Implementation Fabrics for ProtocolsProtocols

BU

F

Memory

BU

FSlot_Set_Tbl2x16

addr

slot_set<31:0>

Slot_no<5:0>

Slotstart

Pktend

RACHreq

RACHakn

W_ENA

R_ENAupdate

idle

writereadslotset

RACH

idle

A protocol = Extended FSM

Intercom TDMA MAC

ASIC FPGA ARM8

Power 0.26mW 2.1mW 114mWEnergy 10.2pJ/op 81.4pJ/op n*457pJ/op

ASIC: 1V, 0.25 ASIC: 1V, 0.25 m CMOS processm CMOS process

FPGA: 1.5 V 0.25 FPGA: 1.5 V 0.25 m CMOS low-energy m CMOS low-energy FPGA FPGA

ARM8: 1 V 25 MHz processor; n = ARM8: 1 V 25 MHz processor; n = 13,00013,000

Ratio: 1 - 8 - >> 400Ratio: 1 - 8 - >> 400

Idea: Exploit model of computation: concurrent finite state machines, communicating through message passing

Low-Power FPGALow-Power FPGA

Low Energy Embedded FPGALow Energy Embedded FPGA (Varghese George)(Varghese George)

Test chipTest chip– 8x8 CLB array8x8 CLB array

– 5 in - 3 out CLB5 in - 3 out CLB

– 3-level interconnect hierarchy3-level interconnect hierarchy

– 4 mm4 mm22 in 0.25 in 0.25 m ST CMOSm ST CMOS

– 0.8 and 1.5 V supply0.8 and 1.5 V supply Simulation ResultsSimulation Results

– 125 MHz Toggle Frequency125 MHz Toggle Frequency

– 50 MHz 8-bit adder50 MHz 8-bit adder

– energy 70 times lower than energy 70 times lower than comparable Xilinxcomparable Xilinx

An Energy-Efficient µP SystemAn Energy-Efficient µP System

Integrateddc-dc

converter

• Dynamic Voltage Scaling (Trevor Pering & Tom Burd)

µP

roc.

Spe

ed Lower speed,Lower voltage, Lower energy

Before

IdleAfter

Xtensa Configurable ProcessorXtensa Configurable Processor

Xtensa Xtensa (Tensilica,Inc)(Tensilica,Inc) for embedded CPU for embedded CPU – Configurability allows designer to keep “minimal” hardware Configurability allows designer to keep “minimal” hardware

overhead overhead – ISA (compatible with 32 bit RISC) can be extended for ISA (compatible with 32 bit RISC) can be extended for

software optimizations software optimizations – Fully synthesizableFully synthesizable– Complete HW/SW suite Complete HW/SW suite

VCC modeling for explorationVCC modeling for exploration– Requires mapping of “fuzzy” instructions of VCC processor Requires mapping of “fuzzy” instructions of VCC processor

model to real ISAmodel to real ISA– Requires multiple models depending on memory Requires multiple models depending on memory

configurationconfiguration– ISS simulation to validate accuracy of modelISS simulation to validate accuracy of model

(Vandana Prabhu)

Microprocessor Optimizations for Microprocessor Optimizations for Network ProtocolsNetwork Protocols

ImplementsTransport Transport layer on configurable processorlayer on configurable processor– TDMA control and channel usage managementTDMA control and channel usage management

Upper layer of protocol is dominated by processor control flow– Memory routines, Branches, Procedure callsMemory routines, Branches, Procedure calls

Artifacts of code generation tools is significant Excessively modular code introduces procedure calls Uses dynamic memory allocation

Configurable processor Increased size of register file Customized instructions help datapath but not control

(Kevin Camera & Tim Tuan )

Total Execution Time

calloc memcpy other

Memory Routines

Efficient implementaion at code generation and architecture levels!

Implementation Methodology for Implementation Methodology for Reconfigurable Wireless ProtocolReconfigurable Wireless Protocol

Changing granularity within protocol stack Changing granularity within protocol stack requires estimation tool for energy-efficient requires estimation tool for energy-efficient implementationimplementation

Software exploration on processorsSoftware exploration on processors– Exploring Xtensa’s TIEExploring Xtensa’s TIE

Hardware exploration on FPGA platformsHardware exploration on FPGA platforms– Optimal FPGA architectureOptimal FPGA architecture– Alternately “Reconfigurable FSM” analogous to Alternately “Reconfigurable FSM” analogous to

Pleiades approach for datapath kernelsPleiades approach for datapath kernels

(Suetfei Li & Tim Tuan)

TCI - A First Generation PicoNodeTCI - A First Generation PicoNode

TensilicaEmbedded Proc.

TensilicaEmbedded Proc.

MemorySub-system

MemorySub-system

Baseband ProcessingBaseband Processing

ConfigurableLogic

(Physical Layer)

ConfigurableLogic

(Physical Layer)

ProgrammableProtocol StackProgrammableProtocol Stack

Sonics Backplane

The System-on-a-Chip NightmareThe System-on-a-Chip Nightmare

Bridge

DMA CPU DSP

MemCtrl.

MPEG

CI O O

System Bus

PeripheralBus

Control Wires

Custom Interfaces

The “Board-on-a-Chip”Approach

Courtesy of Sonics, Inc

The Communications PerspectiveThe Communications Perspective

DSP MPEGCPUDMA

C MEMI O

Example: “The Silicon Backplane”Example: “The Silicon Backplane”(Sonics, Inc)(Sonics, Inc)

Open CoreProtocolTM

SiliconBackplaneAgentTM

Communications-based DesignCommunications-based DesignGuaranteed Bandwidth

Arbitration

(Mike Sheets)

SummarySummary Design for low-energy impacts all stages of Design for low-energy impacts all stages of

the design process — the design process — the earlier the betterthe earlier the better Energy reduction requires clear Energy reduction requires clear

communication and computationcommunication and computation abstractions abstractions Efficient and Efficient and abstract modelingabstract modeling of energy at of energy at

behavior and architecture level is crucialbehavior and architecture level is crucial Efficient hardware implementation of protocol Efficient hardware implementation of protocol

stackstack Beat the SoC monster!Beat the SoC monster!

Targeting Tiled Architectures Targeting Tiled Architectures in Design Explorationin Design ExplorationLilian BossuetLilian Bossuet11, Wayne Burleson, Wayne Burleson22, Guy Gogniat, Guy Gogniat11,,

Vikas AnandVikas Anand22, Andrew Laffely, Andrew Laffely22, Jean-Luc Philippe, Jean-Luc Philippe11

1 1 LESTER LabLESTER LabUniversité de Université de Bretagne SudBretagne Sud

Lorient, FranceLorient, France{lilian.bossuet, {lilian.bossuet, guy.gogniat, guy.gogniat,

jean-jean-luc.philippe}@uniluc.philippe}@uni

v-ubs.frv-ubs.fr

2 2 Department of ElectricalDepartment of Electricaland Computer Engineeringand Computer Engineering

University of University of Massachusetts,Massachusetts,Amherst, USAAmherst, USA

{burleson, vanand, {burleson, vanand, alaffely}@ecs.umass.edualaffely}@ecs.umass.edu

Design Space Exploration: Design Space Exploration: Motivations Motivations Design solutions for new telecommunication and Design solutions for new telecommunication and

multimedia applications targeting embedded systemsmultimedia applications targeting embedded systems

Optimization and reduction of SoC power consumptionOptimization and reduction of SoC power consumption

Increase computing performanceIncrease computing performance– Increase parallelismIncrease parallelism– Increase speedIncrease speed

Be flexibleBe flexible– Take into account run-time reconfigurationTake into account run-time reconfiguration– Targeting multi-granularity (heterogeneous) architecturesTargeting multi-granularity (heterogeneous) architectures

Design Space Exploration: Design Space Exploration: FlowFlow

AlgorithmicSpecification

DESI GN SPACEEXPLORATI ON

archi1

archi2 archi3

archi4

archi5 archi6

archi7

archi8

archi9

archi10

archi11

archi12

Generic Synthesisor Estimations

First Run

SecondRun

ArchitecturalSpecification

Functional descriptionof design space

Dedicated Tool

RTLSpecification

Ab

str

acti

on

Level

Low

Hig

h

Physical Model ofArchi2 and Archi10

Accurate ModelArchi2

Perf

orm

an

ce A

ccu

racy

Low

Hig

h

Progressive design space reduction:Progressive design space reduction:– iterative exploration iterative exploration – refinement of architecture modelrefinement of architecture model– increase of performance estimation increase of performance estimation

accuracyaccuracy

One level of abstraction for one One level of abstraction for one level of estimation accuracylevel of estimation accuracy

Reconfigurable ArchitecturesReconfigurable Architectures

Bridging the Bridging the flexibilityflexibility gap between ASICs and microprocessor gap between ASICs and microprocessor [Hartenstein DATE 2001][Hartenstein DATE 2001]

EnergyEnergy efficientefficient and solution to and solution to low powerlow power programmable programmable DSPDSP [Rabaey ICASSP 1997, FPL 2000][Rabaey ICASSP 1997, FPL 2000]

Run TimeRun Time Reconfigurable Reconfigurable [Compton & Hauck [Compton & Hauck 1999]1999]

=> => A keyA key ingredient ingredient for futurefor future silicon platforms silicon platforms [Schaumont & all. [Schaumont & all. DAC 2001]DAC 2001]

Design Space of Reconfigurable Design Space of Reconfigurable ArchitectureArchitecture

RECONFIGURABLE ARCHITECTURESRECONFIGURABLE ARCHITECTURES(R-SOC)(R-SOC)

FINE GRAINFINE GRAIN(FPGA)(FPGA)

MULTI GRANULARITYMULTI GRANULARITY(Heterogeneous)(Heterogeneous)

COARSE GRAINCOARSE GRAIN(Systolic)(Systolic)

Processor +Processor +CoprocessorCoprocessor

Tile-BasedTile-BasedArchitectureArchitecture

Coarse Grain Coarse Grain CoprocessorCoprocessor

Fine GrainFine GrainCoprocessorCoprocessor

IslandIslandTopologyTopology

Hierarchical Hierarchical TopologyTopology

LinearLinearTopologyTopology

HierarchicalHierarchicalTopologyTopology

MeshMeshTopologyTopology

• ChameleonChameleon• REMARCREMARC• MorphosysMorphosys

• PleiadesPleiades• GarpGarp• FIPSOCFIPSOC• Triscend E5Triscend E5• Triscend A7Triscend A7• Xilinx Virtex-II ProXilinx Virtex-II Pro• Altera ExcaliburAltera Excalibur• Atmel FPSICAtmel FPSIC

• Xilinx VirtexXilinx Virtex• Xilinx SpartranXilinx Spartran• Atmel AT40KAtmel AT40K• Lattice ispXPGA

• Altera StratixAltera Stratix• Altera ApexAltera Apex• Altera CycloneAltera Cyclone

• Systolic RingSystolic Ring• RaPiDRaPiD• PipeRenchPipeRench

• DARTDART• FPFAFPFA

• RAWRAW• CHESSCHESS• MATRIXMATRIX• KressArrayKressArray• Systolix PulsedspSystolix Pulsedsp

• aSoCaSoC• E-FPFA E-FPFA

A Target Architecture: aSoCA Target Architecture: aSoC Adaptive System-on-a-Chip (aSoC)Adaptive System-on-a-Chip (aSoC)

Tiled architecture containing many heterogeneous Tiled architecture containing many heterogeneous processing cores (RISC, DSP, FPGA, Motion processing cores (RISC, DSP, FPGA, Motion Estimation, Viterbi Decoder)Estimation, Viterbi Decoder)

Mesh communication network controlled with Mesh communication network controlled with statically determined communication schedulestatically determined communication schedule

A scalable architecture.A scalable architecture.

FPGA in System-on-a-ChipFPGA in System-on-a-Chip

Fast Time-To-MarketFast Time-To-Market Post-Fabrication Post-Fabrication

CustomizationCustomization– Broaden application domainBroaden application domain– Run-time ReconfigurationRun-time Reconfiguration– Bug FixesBug Fixes– UpgradesUpgrades

10x-100x Worse:10x-100x Worse:– AreaArea– PerformancePerformance– PowerPower

Mark L. Chang [email protected]

tile

FPGA

uProc

MUL

MUL Heterogeneous CoresHeterogeneous Cores

aSoC ArchitectureaSoC Architecture

Point-to-point Point-to-point connectionsconnections

ctrl

South Core

West

North

East

Communication Communication InterfaceInterface

aSoC Communications aSoC Communications InterfaceInterface

Core

Coreports

DecoderLocal

Frequency& Voltage

North to South & East

Instruction Memory

PC

Controller

North

South

East

West

Local Config.

North

South

East

WestInputs Outputs

Interface CrossbarInterface Crossbar– inter-tile transferinter-tile transfer– tile to core transfertile to core transfer

Interconnect/Instruction MemoryInterconnect/Instruction Memory– contains instructions to contains instructions to

configure the interface crossbar configure the interface crossbar (cycle-by-cycle)(cycle-by-cycle)

Interface ControllerInterface Controller– selects the instructionselects the instruction

CoreportsCoreports– data interface and storage for data interface and storage for

transfers with the tile IP coretransfers with the tile IP core Dynamic Voltage and Frequency Dynamic Voltage and Frequency

SelectionSelection– Dynamic Power ManagementDynamic Power Management

Interface Crossbar

ctrl

South Core

West

North

East

aSoC Exploration ...aSoC Exploration ... Type of tilesType of tiles

Number of each type of tileNumber of each type of tile

Placement of the tilesPlacement of the tiles

Intern architecture of reconfigurable tiles Intern architecture of reconfigurable tiles (FPGA core)(FPGA core)

Communication schedulingCommunication scheduling

Design Space Exploration: Design Space Exploration: GoalsGoals GoalGoal: : Rapid explorationRapid exploration of various architectural solutions to be of various architectural solutions to be

implemented on implemented on heterogeneous reconfigurable architecturesheterogeneous reconfigurable architectures (aSoC) in order to select the most efficient architecture for one (aSoC) in order to select the most efficient architecture for one or several applications or several applications

Take place before architectural synthesis (algorithmic Take place before architectural synthesis (algorithmic specification with high level abstraction language) specification with high level abstraction language)

Estimations are based on a Estimations are based on a functional architecture modelfunctional architecture model (generic, technology-independent) (generic, technology-independent)

Iterative exploration flowIterative exploration flow to progressively refine the architecture to progressively refine the architecture definition, from a coarse model to a dedicated modeldefinition, from a coarse model to a dedicated model

Design Exploration Flow Design Exploration Flow Targeting Tiled ArchitectureTargeting Tiled Architecture

C SPECIFICATION

C to HCDFG parser

Function F2

HCDFG Graphs of the application

Application App 1

Function F1

Model of the aSOC Architectures

Tile T2aSOC A1

Tile T1

Application Analysis

Tile Exploration

Results of the Tile exploration step

Function Tile PerformanceF1 T1 T11 , C11 , Occ 11

T2 T21 , C21 , Occ 21

F2 T1 T12 , C12 , Occ 12

T2 T22 , C22 , Occ 22

aSOC Builder

Static Communication Scheduling

Final model ofaSOC architecture

aSOC Analysis

THF Model HF Model

F1

F2

T2

T1

C SPECIFICATION

C to HCDFG parser

Function F2


Application App1

Function F1


Tile T2aSOC A1

Tile T1


Tile Exploration


Function Tile PerformanceF1 T1 T11, C11, Occ11

T2 T21, C21, Occ21

F2 T1 T12, C12, Occ12

T2 T22, C22, Occ22

aSOC Builder

Static CommunicationScheduling


aSOC Analysis

THF Model HF Model

F1

F2

T2

T1

Application AnalysisApplication Analysis

Use of algorithmic metrics and Use of algorithmic metrics and dedicated scheduling algorithms to dedicated scheduling algorithms to highlight the target architectures highlight the target architectures

Algorithmic Algorithmic metricsmetrics:: – Characterize the application orientation Characterize the application orientation

• ProcessingProcessing• MemoryMemory• ControlControl

– Characterize the application potential Characterize the application potential parallelismparallelism

• ProcessingProcessing• MemoryMemory

C SPECIFICATION

C to HCDFG parser

Function F2


Application App1

Function F1


Tile T2aSOC A1

Tile T1


Tile Exploration



T2 T21, C21, Occ21

F2 T1 T12, C12, Occ12

T2 T22, C22, Occ22

aSOC Builder



aSOC Analysis

THF Model HF Model

F1

F2

T2

T1

Tile Exploration: with 3 stepsTile Exploration: with 3 steps ProjectionProjection: :

– Link between necessary resources (application) Link between necessary resources (application) and available resources (tile) and available resources (tile)

– Use of an allocation algorithm based on Use of an allocation algorithm based on communication costs reductioncommunication costs reduction

CompositionComposition: : – Take into account of the function scheduling to Take into account of the function scheduling to

estimate additional resources (register, mux, …) estimate additional resources (register, mux, …)

EstimationEstimation: : – performance interval computation (lower and performance interval computation (lower and

upper bounds) upper bounds) – speed/resource utilization/power characterizationspeed/resource utilization/power characterization

C SPECIFICATION

C to HCDFG parser

Function F2


Application App1

Function F1


Tile T2aSOC A1

Tile T1


Tile Exploration



T2 T21, C21, Occ21

F2 T1 T12, C12, Occ12

T2 T22, C22, Occ22

aSOC Builder



aSOC Analysis

THF Model HF Model

F1

F2

T2

T1

aSoC BuilderaSoC Builder

Environment Environment AppMapperAppMapper

Partition and assignmentPartition and assignment– based on Run Time Estimationbased on Run Time Estimation

CompilationCompilation– Communication SchedulingCommunication Scheduling– Core compilationCore compilation

Generate tiles configuration Generate tiles configuration – Communications instructionsCommunications instructions– Bitstreams (for reconfigurable tile)Bitstreams (for reconfigurable tile)– RISC instructionsRISC instructions

C SPECIFICATION

C to HCDFG parser

Function F2


Application App1

Function F1


Tile T2aSOC A1

Tile T1


Tile Exploration



T2 T21, C21, Occ21

F2 T1 T12, C12, Occ12

T2 T22, C22, Occ22

aSOC Builder



aSOC Analysis

THF Model HF Model

F1

F2

T2

T1

aSoC AnalysisaSoC Analysis

Use the results of previous stepsUse the results of previous steps– Functions schedulingFunctions scheduling– Tile allocationTile allocation– Communication schedulingCommunication scheduling

Complete estimation of the proposed Complete estimation of the proposed solutionsolution– Global execution timeGlobal execution time– Global power consumptionGlobal power consumption– Total areaTotal area

Power-Aware Power-Aware System on a ChipSystem on a Chip

A. Laffely, J. Liang, R. Tessier, C. A. A. Laffely, J. Liang, R. Tessier, C. A. Moritz, W. BurlesonMoritz, W. Burleson

University of Massachusetts AmherstUniversity of Massachusetts Amherst

Boston Area Architecture ConferenceBoston Area Architecture Conference30 Jan 200330 Jan 2003

{alaffely, jliang, tessier, moritz, {alaffely, jliang, tessier, moritz, burleson}@ecs.umass.eduburleson}@ecs.umass.edu

This material is based upon work supported by the National Science Foundation under Grant No. 9988238.Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Adaptive System-on-a-ChipAdaptive System-on-a-Chip Tiled architecture with Tiled architecture with

mesh interconnectmesh interconnect– Point to point Point to point

communication pipelinecommunication pipeline Allows for heterogeneous Allows for heterogeneous

corescores– Differing sizes, clock Differing sizes, clock

rates, voltagesrates, voltages Low-overhead core Low-overhead core

interface for interface for – On-chip bus substitute On-chip bus substitute

for streaming for streaming applicationsapplications

Based on static Based on static schedulingscheduling– Fast and predictableFast and predictable

Proc

Tile

MultiplierFPGA

Multiplier

ctrl

SouthCore

West

North

East

CommunicationInterface

aSoC ImplementationaSoC Implementation

3000

2500

.18 technology Full custom

Some ResultsSome Results

9 and 16 core systems tested for IIR, MPEG 9 and 16 core systems tested for IIR, MPEG encoding and Image processing applicationsencoding and Image processing applications– ~ ~ 2 x2 x the performance compared to Coreconnect the performance compared to Coreconnect

bus Burst and Hierarchical bus Burst and Hierarchical – ~ ~ 1.5 x1.5 x the performance of an oblivious routing the performance of an oblivious routing

networknetwork11 (Dynamic routing) (Dynamic routing)– Max speedup is 5 xMax speedup is 5 x

1. W. Dally and H. Aoki, “Deadlock-free Adaptive Routing in Multi-computer Networks

Using Virtual Routing”, IEEE Transactions on Parallel and Distributed Systems, April 1993

Digital Integrated Circuits A Design Perspective System on a Chip Design.

Documents

Digital Integrated Circuits A Design Perspective System on a Chip Design.