Low Power System Level Design Methodologies Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab. .

Low Power System Level Design Methodologies

Jun-Dong ChoSungKyunKwan Univ.

Dept. of ECE, Vada Lab. http://vada.skku.ac.kr

VLSI Algorithmic Design Automation Lab.

2

Contents

Introduction to System Level Design Hardware and Software Co-design Re-configurable Processors Other Low Power System Level Designs


3

Introduction to SOC

• SOC will bridge the gap b/w s/w and their implementation

in novel, energy-efficient silicon architecture.

•Chips are assembled at IP block level and IP interfaces rather than gate level Design Reuse

•SOC specs are coming from ICT system engineers rather

than RTL descriptions.


4

Four main applications Set-top box: Mobile multimedia system, base

station for the home local-area network. Digital PCTV: concurrent use of TV,3D graphics,

and Internet services Set-top box LAN service: Wireless home-networks,

multi-user wireless LAN Navigation system: steer and control traffic and/or goods-transportation


5Types of System-on-a-Chip Designs


6

Silicon in 2010Die Area: 2.5x2.5 cmVoltage: 0.6 VTechnology: 0.07 m

Density Access Time(Gbits/cm2) (ns)

DRAM 8.5 10DRAM (Logic) 2.5 10SRAM (Cache) 0.3 1.5

Density Max. Ave. Power Clock Rate(Mgates/cm2) (W/cm2) (GHz)

Custom 25 54 3Std. Cell 10 27 1.5

Gate Array 5 18 1Single-Mask GA 2.5 12.5 0.7

FPGA 0.4 4.5 0.25


7

Why Lower Power Portable systems

long battery life light weight small form factor

IC priority list power dissipation cost performance

Technology direction reduced voltage/power

designs based on mature high performance IC technology, high integration to minimize size, cost, power, and speed


8

year

Power(W)

1980 1985 1990 1995 2000

10

20

30

40

50

5

15

25

35

45

i286i386 DX 16 i486 DX25

i486 DX 50

i486 DX2 66 P-PC601 50

P6 166

P5 66

Alpha21064 200

Alpha 21164

i486 DX4 100

P II 300

P-PC604 133

P-PC750 400

P III 500

Alpha 21264

Microprocessor Power Dissipation


9

Three Factors affecting Energy– Hardware Simplification: redundant h/w extraction– All in on Approach(SOC): I/O pin reduction– Voltage Reducible Hardwares

2-D pipelining (systolic arrays) SIMD(Single Instruction stream, Multiple Data

stream) Parallel Processing:useful for data w/ parallel structure

VLIW(Very Long Instruction Word) Approach- flexible

MIMD(Multiple Instruction streams, Multiple Data streams)


10

New Computing Platforms

SOC power efficiency more than 10GOPs/w Higher On Chip System Integration: COTS: 100W,

SOAC:10W (inter-chip capacitive loads, I/O buffers) Speed & Performance: shorter interconnection,fewer

drivers,faster devices,more efficient processing artchitectures

Mixed signal systems Reuse of IP blocks Multiprocessor, configurable computing Domain-specific, combined memory-logic

2P kCFV


11

Physical gap Timing closure problem: layout-driven logic and RT-level

synthesis Energy efficiency requires locality of computation and

storage: match for stream-based data processing of speech,images, and multimedia-system packets.

Next generation SOC designers must bridge the architectural gap b/w system specification and energy-efficient IP-based architectures, while CAE vendors and IP providers will bridge the physical gap.


12

Levels for Low Power DesignSystem

Algorithm

Architecture

Circuit/Logic

Technology

Hardware-software partitioning,

Complexity, Concurrency, Locality,

Parallelism, Pipelining, Signal correlations

Sizing, Logic Style, Logic Design

Threshold Reduction, Scaling, Advanced packaging

Possible Power Savings at Different Design LevelsLevel of

Abstraction Expected Saving

Algorithm

Architecture

Logic Level

Layout Level

Device Level

10 - 100 times

10 - 90%

20 - 40%

10 - 30%

10 - 30%

Regularity, Data representation

Instruction set selection, Data rep.

SOI

Power down


13

Low Power Design Flow IFunction

Partitioning andHW/SW Allocation

SystemLevel

Specification

System-LevelPower Analysis

BehavioralDescription

SoftwareFunctions

ProcessorSelection

Power-drivenBehavioralTransformation

Behavioral-LevelPower Analysis

Power ConsciousBehavioralDescription

Power AnalysisRT-LevelHigh-Level

Synthesis andOptimization

SoftwareOptimization

Software-Level

Power Analysis

To RT-Level Design


14

Low Power Design Flow II

RT-levelDescription

RTLmapping

Logic SynthesisandOptimization

Gate-LevelPower Analysis

Gate-level

Description

Power AnalysisSwitch-LevelHigh-Level

Synthesis andOptimization

RTLLibrary

Data-path Controller

Switch-level

Description

Standard cellLibraryProcessor

Control andSteering Logic

Memory

RTLMacrocells


15

Reducing Waste Locality of reference Demand-driven / Data-driven computation Application-specific processing Preservation of data correlations Distributed processing


16

Eliminating Redundant Computations


17

Power-hungry Applications Signal Compression: HDTV Standard, ADPCM,

Vector Quantization, H.263, 2-D motion estimation, MPEG-2 storage management

Digital Communications: Shaping Filters, Equalizers, Viterbi decoders, Reed-Solomon decoders


18

IBM’s PowerPC Lower Power Architecture Optimum Supply Voltage through Hardware Parallel, Pipelining ,Parallel instruction

execution 603e executes five instruction in parallel (IU, FPU, BPU, LSU, SRU) FPU is pipelined so a multiply-add instruction can be issued every clock cycle Low power 3.3-volt design

Use small complex instruction with smaller instruction length IBM’s PowerPC 603e is RISC

Superscalar: CPI < 1 603e issues as many as three instructions per cycle

Low Power Management 603e provides four software controllable power-saving modes.

IBM’s Blue Logic ASIC :New design reduces of power by a factor of 10 times


19

Power-Down Techniques

◆ Lowering the voltage along with the clock actually alters the energy-per-operation of the microprocessor, reducing the energy required to perform a fixed amount of work


20

Voltage vs Delay

•Use Variable Voltage Scaling or Scheduling for Real-time Processing •Use architecture optimization to compensate for slower operation, e.g., Parallel Processing and Pipelining for concurrent increasing and critical path reducing.


21

Low Voltage Main Memories


22

Why Copper Processor? Motivation: Aluminum resists the flow of

electricity as wires are made thinner and narrower.

Performance: 40% speed-up Cost: 30% less expensive Power: Less power from batteries Chip Size: 60% smaller than Aluminum chip


23

Silicon-on-Insulator How Does SOI Reduce Capacitance ?

Eliminated junction capacitance by using SOI (similar to glass) is placed between the impuritis and the silicon substrate high performance, low power, low soft error


24

Design Challenges Current systems are complex and heterogenous

Contain many different types of components Half of the chip can be filled with 200 low-power,

RISC-like processors (ASIP) interconnected by field-programmable buses, embedded in 20Mbytes of distributed DRAM and flash memory

Computational power will not result from multi-GHz clocking but from parallelism, with below 200 MHz. This will greatly simplify the design for correct timing, testability, and signal integrity.


25

Application- Specific Instruction Processor

Processor architecture tailored not just for application domain (e. g., DSP, microcontrollers), but for specific sets of applications (e. g., audio, engine control)

ASIP characteristics Greater design cost (processor + compiler) + Higher performance, lower power than

commercial cores, more flexibility than ASIC


26

ASIP Design Given a set of applications, determine micro

architecture of ASIP (i. e., configuration of functional units in datapaths, instruction set)

To accurately evaluate performance of processor on a given application need to compile the application program onto the processor datapath and simulate object code

However, the micro architecture of the processor is a design parameter!


27

ASIP Design Flow


28

Compiler Optimizations Machine independent optimizations

Parallelizing transformations, Common subexpression elimination, Constant Propagation, Strength reduction, Loop Invariant Code motion

Machine dependent optimizations Loop unrolling and software pipelining Static allocation (non- recursive procedure calls) Storage layout (arrays, scalars) Optimization of mode setting instructions Instruction selection, scheduling, and register

allocation


29

Loop unrolling The technique of loop unrolling replicates the body of a loop some number of

times (unrolling factor u) and then iterates by step u instead of step 1. This transformation reduces the loop overhead, increases the instruction parallelism and improves register, data cache or TLB locality.

Loop overhead is cut in half because two iterations are performed in each iteration. If array elements are assigned to registers, register locality is improved because A(i) and A(i +1) are used twice in the loop body. Instruction parallelism is increased because the second assignment can be performed while the results of the first are being stored and the loop variables are being updated.

for i to N

A i A i A i A i

= -

( ) = ( ) + ( - ) ( + )

2 1

1 1

for i to N

A i A i A i A i

A i A i A i A i

= - 2 step 2

( ) = ( ) + ( - ) ( + )

( ) = ( ) + ( ) ( + )

2

1 1

1 1 2


30

Loop Unrolling (IIR filter example)

loop unrolling : localize the data to reduce the activity of the inputs of the functional units or two output samples are computed in parallel based on two input samples.

Neither the capacitance switched nor the voltage is altered. However, loop unrolling enables several other transformations (distributivity, constant propagation, and pipelining). After distributivity and constant propagation,

The transformation yields critical path of 3, thus voltage can be dropped.

)( 211

211

nnnnnn

nnn

YAXAXYAXY

YAXY

22

1

211

nnnn

nnn

YAYAXY

YAXY


31

Loop Unrolling for Low Power


32



33



34

Implementing Digital Systems


35

Configurability One-M gate reconfigurable, one-M gate hardwired

logic. Reduce design risks for which NRE costs will

become dominant 50GIPS for programmable components or 500 GIPS for dedicated hardwares 1 V with the watt range


36

Bridging the architectural gap Product reliability: design at a level far above the

RT level, with reuse factors in excess of 100 Trade-off: 100MOPs/watt (microprocessor)

100GOPs/watt (hardwired) Reconf. Computing with a large number of computing nodes and a very restricted instruction set (Pleiades)


37

Cross-Disciplinary nature Software for low power:loop transformation leads

to much higher temporal and spatial locality of data.

Code size becomes an important objective Software will eventually become a part of the chip

Behavior-platform-compiler codesign: codesigned with C++ or JAVA, describing their h/w and s/w implementation.

Multidisciplinary system thinking is required for future designs (e.g., Eindhoven Embedded Systems Institute http://www.eesi.tue.nl/english)


38

Low Power DSP 수행시간의 대부분이 DO-LOOP 에서 이루어짐

VSELP Vocoder : 83.4 %2D 8x8 DCT : 98.3 %LPC computation : 98.0 %

DO-LOOP 의 Power Minimization ==> DSP 의 Power Minimization

VSELP : Vector Sum Excited Linear PredictionLPC : Linear Prediction Coding


39

VLSI Signal Processing Design Methodology

pipelining, parallel processing, retiming, folding, unfolding, look-ahead, relaxed look-ahead, and approximate filtering

bit-serial, bit-parallel and digit-serial architectures, carry save architecture

redundant and residue systems Viterbi decoder, motion compensation, 2D-

filtering, and data transmission systems


40

Common Fabric for IP Blocks Soft IP blocks are portable, but not as predictable as

hard IP. Hard IP blocks are very predictable since a specific

physical implementation can be characterized, but are hard to port since are often tied to a specific process.

Common fabric is required for both portability and predictability.

Wide availability: Cell Based Array, metal programmable architecture that provides the performance of a standard cell and is optimized for synthesis.


41

H/W and S/W Co-design


42

Mixing H/W and S/W Argument: Mixed hardware/ software systems

represent the best of both worlds.High performance, flexibility, design reuse, etc.

Counterpoint: From a design standpoint, it is the worst of both worlds

Simulation: Problems of verification, and test become harder

Interface: Too many tools, too many interactions, too much heterogeneity

Hardware/ software partitioning is “AI- complete”!


43

Partitioning Performance Requirements

몇몇의 Function 들은 Hardware 로의 구현이 더 용이 반복적으로 사용되는 Block Parallel 하게 구성되어 있는 Block

Modifiability Software 로 구성된 Block 은 변형이 용이

Implementation Cost Hardware 로 구성된 Block 은 공유해서 사용이 가능

Scheduling 각각 HW 와 SW 로 분리된 Block 들을 정해진 constraints 들에 맞출 수 있도록

scheduling SW Operation 은 순차적으로 scheduling 되어야 한다 Data 와 Control 의 의존성만 없다면 SW 와 HW 는 Concurrent 하게

scheduling


44

Low power partitioning approach

Different HW resources are invoked according to the instruction executed at a specific point in time

During the execution of the add op., ALU and register are used, but Multiplier is in idle state.

Non-active resources will still consume energy since the according circuit continue to switch

Calculate wasting energy Adding application specific core and partial running Whenever one core performing, all the other cores are shut

down


45

Effective Resource Utilization+

+

+

+

D

D

S

5 1 2

3 4

6

7

Retiming

D

D

D

D

D+

+

+

+S

51 2 6

7

43

Before AFTER

CYCLE Multipliers1 1, 3

2, 4

-

-5

6, 8

7

2

13

4

Adder8

6

7

5

Adder Multipliers

2

1

1

1

-

Can reducd interconnect capacitance.


46

Partitioning Process

- Derives a graph G- operation and connection

- Decomposition of G into a set of clusters- cluster : set of operation

- Calculate bus-traffic energy- Pre-select clusters with constraints- Set the number of resources- List scheduling- Test the utilization rate (ASIC or µP)

- the utilization rate of µP is supported by SW estimation tool


47

Design FlowApplication

DevideAppliction in

cluster

List schedule

Computeutilizationrate(ASIC)

Select cluster

Computeutilizationrate(uP)

-

Core EnergyEstimation

HW Synthesis

Evaluate

- Max 94% energy saving and in most case even reduced execution time- 16k sell overhead


48

Interface Interface Block 의 필요성

Hardware 와 Software Block 간의 Data 전달 효율적인 Interface Block 을 구성해야만 HW/SW

Block 간의 Overhead 를 줄일 수 있다

Interface 방법 Shared Memory FIFO Handshaking protocol


49

Logical Bus ArchitectureSystem Bus Signals

address, data, control signalsaddress space consists of the memory space & I/O spacememory space : memory of the SW componentI/O space : ports within SW & registers in other HW

Port SignalsThese are specialized signals capable of directly interfacing between SW & HW component

Interrupt SignalsWhen SW & HW components have completed an operation, or when an error condition is detected


50

Co-Simulation Co-simulation 의 필요성

HW part 와 SW part 를 함께 Simulation 을 할 수 있게 해 줌으로써 구성된 System 의 결과를 예측할 수 있다

System Performance 를 예측하여 Synthesis 이전에 지정된 Spec. 에 맞도록 System 을 재설계할 수 있도록 해 준다

HW/SW Partitioning 을 위한 각 Sub-block 의 특성을 예측해 준다

Co-simulation Tool Ptolemy COSSAP POLIS


51

Hardware/Softrware C0-Design Flow

Analysis of Constra ints& Requirem ents

System Specification

Hardware & SoftwarePartitioning

HardwareDescription

SoftwareDescription

Interface SynthesisHardware Synthesis

& ConfigurationSoftware G eneration &

Param eterization

ConfigurationM odules

HardwareCom ponents

HW / SWInterface

SoftwareM odules

HW / SW Integration &Cosim ulation

IntegrationSystem

System Evaluation Design Verification


52

Partitioning Example: CDMA Searcher

P N -C odeG enera to r

µ¿ ±â´© À û´Ü(R ea l)

µ¿ ±â´© À û´Ü(Im age)

¿ ¡³Ê Á ö°è»ê´Ü(R ea l)

¿ ¡³Ê Á ö°è»ê´Ü(Im age)

ºñ± ³, ¼ ±Å Ã ´Ü ºñµ ¿ ±â´© À û´Ü ºñ± ³, ¼ ±Å Ã ´Ü

P N -C odeG enera tion

S ynchronousA ccum ula tor

(S W )

S ynchronousA ccum ula tor1

(H W )

C ost(S peed,A rea,P ow er)

E nergyE stim ate

(S W )

S ynchronousA ccum ula tor2

(H W )

C om parator(S W )

A synchronousA ccum ula tor

(S W )

C om parator(S W )

E nergyE stim ate

(H W )

C om paratorw ith

precom puta tion(H W )

A synchronousA ccum ula tor

(H W )

C om paratorw ith

precom puta tion(H W )

G O A L!


53

Approach+ +

+ +

Y I2 YQ

2

>

>

+

>

RXI TXI RXQ TXQ RXI TXQ RXQ - TXI

max 값 선 택

θ 1 와 비 교

θ 2 와 비 교

동 기 누 적 단

비 동 기 누 적 단

에 너 지 계 산 단

O I = (RX I * TX I)

+ (RXQ * TXQ) O Q = (RX I * TXQ)

+ (RXQ * (- TX I))

Y I = ∑ O I Y Q = ∑ O Q

Z = max (Y I2 , Y Q

2)

∑ Z

Search Done !!

Yes

YesSearch_Slew No

No

C ontrol Signal G enerator

- Software oriented design- Dark block : Hardware- Interface : Control signal gen.- Partitioned in terms of speed cost

- Change from SW to HW 1. Implementation speed 2. Parallel architecture


54

Resultcycle ratio Area(gates)

Full SW 266 -

Full HW - 9008Synchronous accumulator(1) 138 48.1 + 872

Computing energy(2) 265 4.4 + 3096(1) & (2) 137 48.5 + 3968

(2) &Comparator(3)

265 4.4 + 3155

(1) & (3) 138 48.1 + 931


55

Flexibility vs. Energy-Efficiency

• Trade-off between efficiency and flexibility, programmable designs incur significant performance and power penalties compared to ASIC.•The parallel algorithm of signal processing can be achieved significant power savings by executing the dominant computational kernels of a given class of applications with common features on dedicated, optimized processing elements with minimum energy overhead.

Programmability requires generalized computation, storage, and communication system, which can be used to implement different kinds of algorithmsDomain specific processors preserve the flexibility of a general purpose programmable device to achieve higher levels of energy-efficiency, while maintaining the flexibility to handle a variety of algorithms


56Hybrid Architecture Template (Pleiades) Arthur Abnous and Jan Rabaey

Pleiades does much better on the energy scale than the TI DSPs.Because DSPs are general-purpose, and instruction execution involves a great deal of overhead. Pleiades has the ability to create dedicated hardware structures tuned to the task at hand and executes operations with a small energy overhead


57

Application Domains : ULTRA-LOW-POWER DOMAIN-SPECIFIC MULTIMEDIA PROCESSORS

CELP- Based Speech Coding LPC Analysis and Synthesis Codebook Search Lag ComputationDCT- Based Video Compression and Decompression DCT and Inverse- DCT Motion Estimation and Compensation Huffman Coding and Decoding Baseband Processing for Digital Radios Demodulation, Channel Equalization Timing Recovery, Error Correction


58

The Re-configurable Terminal


59

Satellite Processors


60

Elements of Energy- Efficiency


61

Multi-Processor Implementation


62

Switching Activity Reduction(a) Average activity in a multiplier as a function of the constant value

(b) A parallel and serial implementations of an adder tree.


63

Communication Network


64

Distributed Data- Driven Control

Execution of a hardware module is triggered by the arrival of tokens. When there are no tokens to be processed at a given module, no switching activity occurs in that module.


65

Implementation of Handshaking


66

Design Methodology


67

Low Power Circuit Techniques Reduced swing interconnect (communication network, memories,

programmable logic modules) On chip dc- dc conversion + multiple supply voltages Locally synchronous - globally asynchronous Automatic power- down Optimized libraries (0.6 m CMOS + Cadence/ Synopsys design flow)


68

VSELP Synthesis Filter Mapped onto Satellite Processors


69

Mappings of VSELP Kernel

The most energy efficient CELP-based speech algorithm - dissipates 36 mW ( Vdd = 1.8V, 0.5 um CMOS) - requires 23.4 MOPS

Proposed VSELP speech coder - 0.6 um CMOS - dissipates under 5 mW


70

IIR Mapping


71

IIR Comparison


72

FFT Mapping


73

FFT Comparison


74

Reconguration for Power Savingin Real-Time Motion Estimation,S.R.Park,UMASS


75

Motion Estimation


76

Block Matching Algorithm


77

Configurable H/W Paradigms


78

Programmable Logic Modules


79

Why Hardware for Motion Estimation? Most Computationally demanding part of Video Encoding Example: CCIR 601 format 720 by 576 pixel 16 by 16 macro block (n = 16) 32 by 32 search area (p = 8) 25 Hz Frame rate (f frame = 25) 9 Giga Operations/Sec is needed for Full Search Block Matching Algorithm.


80

Why Reconguration in Motion Estimation?

Adjusting the search area at frame-rate according to the changing characteristics of video sequences

Reducing Power Consumption by avoiding unnecessary computation

Motion Vector Distributions


81

Architecture for Motion EstimationFrom P. Pirsch et al, VLSI Architectures for Video Compression, Proc. Of IEEE, 1995


82

Re-configurable Architecture for ME


83

Power Estimation in Recongurable Architecture


84

Power vs Search area


85

Resource Reuse in FPGAs


86

Motion Estimation


87

Motion Estimation (low power)

P P P

P P P P

P P

a add abs

b add add abs

abs add

2 2

2

0 45

2

2 1

2

/

/

.

Therefore, power reduction

factor is 11%


88

References[1] A. Abnous and J. Rabeay, “Ultra-Low-Power Domain-Specific Multimedia Processors”, Proceedings of the IEEE VLSI

Signal Processing Workshop, San Francisco, Oct 1996.

[2] Digital Semiconductor, Digital Semiconductor SA-110 Microprocessor Technical Reference Manual, Digital Equipment Corporation, 1996.

[3] TMS320C5x General-Purpose Application User’s Guides, Literatures Number SPRU164, TI, 1997.

[4] T. Anderson, The TMS320C2xx Sum-of-Products Methodology, Technical Application Report SPRA068, TI, 1996.

[5] M. Tsai, IIR Filter Design on the TMS320C54x DSP, Technical Application Report SPRA079, TI, 1996.

[6] Ftp://ftp.ti.com/pub/tms320bbs/c5xxfiles/54xffts.exe, ‘C54x Software Support Files, TI.

[7] C.Turner, Calculation of TMS320LS54x Power Dissipation, Technical Application Report SPRA164, TI, 1997.

[8] C.Turner, Calculation of TMS320LS54x Power Dissipation, Technical Application Report SPRA088, TI, 1996.

[9] E. Kusse, Personal communication, 1996.[10] J. Rabeay et al., “Fast Prototyping of Data Path Intensive Architecture”, IEEE Design & Test Magazine, Vol. 8, N0. 2,

pp. 40-51, 1991.[11] J. Montanaro et al., “A 160-MHz, 32-b, 0.5-W CMOS RISC Microprocessor”, IEEE Journal of Solid-State Circuit, Vol. 31,

N0. 11, pp. 1703-1714, Nov. 1996.[12] A. Fischman and P. Rowland, Designing Low-Power Applications with TMS320LC54x, Technical Application Report

SPRA281, TI, 1997.[13] Daniel D. Gajski, Nikil D. Dutt, Allen C-H Wu, Steve Y-L Lin, \High-level synthesis, Introduction to chip and system design," Kluwer

Academic publishers, 1992.

[14] Duncan A. Buell, Jerey M.Arnold, Walter J.Kleinfelde \Splash2, FPGAs in Custom Computing Machine," IEEE Computer Society Press, Los Alamitos, California.

[15] Jonathan Babb, Russell Tessier, Mathew Dahl, Silvina Zimi Hanono, David M. Hoki, and Anant Agarwal, Logic emulation with virtual wires," IEEE Transactions on Computer Aided Design of Integrated circuits and systems, vol. 16, No. 6, June 1997.

[16] M.Vasilko, Djamel Ait-Boudaoud, \Architectural synthesis techniques for dynamically Recongurable logic," Field Programmable Logic: Smart Applications, New Paradigms and Compilers, Proceedings of 6th Int. Workshop on Field Programmable Logic and Applications,FPL 96, Darmstadt, Germany, Sept. 23-25 1996.


89

References[17] Patrick Lysaght, Gordon McGregor and Jonathan Stockwood, Conguration Controller Synthesis for Dynamically Recongurable

Systems," IEE Colloquium on Hardware Software COSynthesis for Recongurable systems, 1996.

[18] M.Vasilko, Djamel Ait-Boudaoud, Scheduling for dynamically Recongurable FPGAs," Proceedings of International workshop on Logic and Architecture synthesis, pp. 328-336, IFIPTC10 WG10.5, Dec. 18-19 1995.

[19] Doug Smith, Dinesh Bhatia, RACE: Recongurable and Adaptive Computing Environment,” Field Programmable Logic: Smart Applications, New Paradigms and Compilers, Proceedings of 6th Int. Workshop on Field Programmable Logic and Applications,FPL 96, Darmstadt, Germany, Sept. 23-25 1996. See http://www.ececs.uc.edu/ ~ dal.

[20] Xilinx Netlist Format (XNF) Specication, Version 6.1, June 1, 1995.

[21] Xilinx XABEL reference manual.


90

DIGLOG multiplierC n n C n n

A A B B

A B A B B A A B

mult add

jR

kR

jR

kR

jR

kR R R

( ) , ( ) ,

,

( )( )

253 214

2 2

2 2 2 2

2 where n world length in bits

1st Iter 2nd Iter 3rd Iter

Worst-case error -25% -6% -1.6%

Prob. of Error<1% 10% 70% 99.8%

With an 8 by 8 multiplier, the exact result can be obtained at a maximum of seven iteration steps (worst case)


91

Low Power CDMA Searcher Project 과제명 : IS-95 기반의 DS/CDMA 시스템 Co-design 기법을 이용한 저전력 설계

개발기간 : 1999.3.1 - 2000.2:28 (12 개월 ) 개발 목적 및 방법 : CDMA 단말기에 사용하기위한 MSM

(Mobile Station Modem) 칩의 탐색자 (Searcher Engine) 에 대한 RTL 수준 저전력 설계 구현 . 동작 주파수 : 12.5MHz

Data flow graph 를 사용하여 rescheduling, pre-computation 및 strength reduction, Synchronous Accumulator 를 이용한 저전력 설 , area 와 power

를 각각 최대 67.68%, 41.35% 감소 시킴 . H/W and S/W Co-design 기법 적용 San Kim and Jun-Dong Cho, “Low Power CDMA Searcher”, CAD and VLSI Workshop, May.

1999.

Inki Hwang, San Kim and Jun-Dong Cho, “CDMA Searcher Co-Design”, ASIC Workshop,

Sep. 1999.


92

Voltage Scaling Merely changing a processor clock frequency is not an

effective technique for reducing energy consumption. Reducing the clock frequency will reduce the power consumed by a processor, however, it does not reduce the energy required to perform a given task.

Lowering the voltage along with the clock actually alters the energy-per-operation of the microprocessor, reducing the energy required to perform a fixed amount of work.


93

OS: Voltage Scaling


94

OS: Voltage Scheduling


95

Multiple Supply VoltagesFilter Example


96

Scale Supply Voltage with fCLK


97

Adaptive Power Supply Voltages


98

Different Voltage Schedules

0 5 10 15 20 25 Time(sec)

5.021000Mcycles50MHz

40J

(A)

0 5 10 15 20 25 Time(sec)

5.02750Mcycles50MHz

32.5J

(B)

0 5 10 15 20 25Time(sec)

5.02

1000Mcycles40MHz

25J (C)

Timing constraint

2.52

250Mcycles25MHz

4.02

En

ergy

con

sum

pti

on (

Vd

d2 )


99

Data Driven Signal ProcessingThe basic idea of averaging two samples are buffered and their work loads are averaged.

The averaged workload is then used as the effective workload to drive the power supply.

Using a pingpong buffering scheme, data samples In +2, In +3

are being buffered while In, In +1

are being processed.


100

Example of Buffering


101

SOC CAD Companies Avant! www.avanticorp.com Cadence www.cadence.com Duet Tech www.duettech.com Escalade www.escalade.com Logic visions

www.logicvision.com Mentor Graphics

www.mentor.com Palmchip www.palmchip.com Sonic www.sonicsinc.com Summit Design www.summit-

design.com

Synopsys www.synopsys.com

Topdown design solutions www.topdown.com

Xynetix Design Systems www.xynetix.com

Zuken-Redac www.redac.co.uk


102

Viterbi decoder project▶ 과제명 : Convolutional Encoder 를 위한 저전력 복호

알고리즘의 연구▶ 개발기간 : 1999.02.22 - 11:30 ( 약 9 개월 )▶ 개발 목적 및 방법 : IMT-2000 중에 포함되는 channel

coding 장치의 저전력화를 위한 독 자적인 기술의 연구 / 개발

▶ CODEC 주요사양 : - Code Rate : R = 1/2, 1/3, 1/4 , k=9 - Decoding 방법 : Trace-back Viterbi Decoder using Soft Decision


103

Viterbi decoder project▶ 발표논문

1. Asia Pacific Conference on ASIC’99In this paper, we have presented the use of the consensus term and clocking control signal in ACSU for the low power Viterbi decoder. A 20% reduction in area and 30% reduction in power consumption are obtained based on the low power ACSU architecture[1]. Applying our proposed glitch reduction techniques to [1], the additional power consumption is reduced by 7% at a cost of 3% increase in area.

2. International Conference on VLSI and CAD’99 In this paper, we propose a new lower power algorithm on the trace-back unit of

systolic array Viterbi decoder[2]. Reusing the already-generated trace-back routes reduces the number of trace-back operations, and results in increasing the area of spurious switching activity region. Therefore, the switching activity during trace-back operation was further reduced with using gated-clocks. Our result showed on the average 40% reduction in power with the same latency, but 23% increase in area against the trace-back unit in [2]. We used Design Compiler of SYNOPSYS and measured power consumption using DesignPower of SYNOPSYS.


104

Viterbi decoder project▶ Reference

1. B C. Y. Tsui, R.S. K. Cheng and C. Ling, “Using Transformation to Reduce Power Consumption of IS-95 CDMA Receiver”, International Symposium on Low Power Electronics and Design, 1999

2. T. K. Truong, A. M. T. Shih, I. S. Reed, E. H.Satorius, “A VLSI Design for a Trace-back Viterbi Decoder”, IEEE Trans. Communication, vol. 40, no. 3, Mar. 1992.

Low Power System Level Design Methodologies Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab. .

Documents

system level design

speed slide

chip designs slide

types of system

chip system integration

soc soc

mobile multimedia system

multimediasystem packets