Low Power System Level Design Methodologies Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab. http://vada.skku.ac.kr
Dec 13, 2015
Low Power System Level Design Methodologies
Jun-Dong ChoSungKyunKwan Univ.
Dept. of ECE, Vada Lab. http://vada.skku.ac.kr
VLSI Algorithmic Design Automation Lab.
2
Contents
Introduction to System Level Design Hardware and Software Co-design Re-configurable Processors Other Low Power System Level Designs
VLSI Algorithmic Design Automation Lab.
3
Introduction to SOC
• SOC will bridge the gap b/w s/w and their implementation
in novel, energy-efficient silicon architecture.
•Chips are assembled at IP block level and IP interfaces rather than gate level Design Reuse
•SOC specs are coming from ICT system engineers rather
than RTL descriptions.
VLSI Algorithmic Design Automation Lab.
4
Four main applications Set-top box: Mobile multimedia system, base
station for the home local-area network. Digital PCTV: concurrent use of TV,3D graphics,
and Internet services Set-top box LAN service: Wireless home-networks,
multi-user wireless LAN Navigation system: steer and control traffic and/or goods-transportation
VLSI Algorithmic Design Automation Lab.
5Types of System-on-a-Chip Designs
VLSI Algorithmic Design Automation Lab.
6
Silicon in 2010Die Area: 2.5x2.5 cmVoltage: 0.6 VTechnology: 0.07 m
Density Access Time(Gbits/cm2) (ns)
DRAM 8.5 10DRAM (Logic) 2.5 10SRAM (Cache) 0.3 1.5
Density Max. Ave. Power Clock Rate(Mgates/cm2) (W/cm2) (GHz)
Custom 25 54 3Std. Cell 10 27 1.5
Gate Array 5 18 1Single-Mask GA 2.5 12.5 0.7
FPGA 0.4 4.5 0.25
VLSI Algorithmic Design Automation Lab.
7
Why Lower Power Portable systems
long battery life light weight small form factor
IC priority list power dissipation cost performance
Technology direction reduced voltage/power
designs based on mature high performance IC technology, high integration to minimize size, cost, power, and speed
VLSI Algorithmic Design Automation Lab.
8
year
Power(W)
1980 1985 1990 1995 2000
10
20
30
40
50
5
15
25
35
45
i286i386 DX 16 i486 DX25
i486 DX 50
i486 DX2 66 P-PC601 50
P6 166
P5 66
Alpha21064 200
Alpha 21164
i486 DX4 100
P II 300
P-PC604 133
P-PC750 400
P III 500
Alpha 21264
Microprocessor Power Dissipation
VLSI Algorithmic Design Automation Lab.
9
Three Factors affecting Energy– Hardware Simplification: redundant h/w extraction– All in on Approach(SOC): I/O pin reduction– Voltage Reducible Hardwares
2-D pipelining (systolic arrays) SIMD(Single Instruction stream, Multiple Data
stream) Parallel Processing:useful for data w/ parallel structure
VLIW(Very Long Instruction Word) Approach- flexible
MIMD(Multiple Instruction streams, Multiple Data streams)
VLSI Algorithmic Design Automation Lab.
10
New Computing Platforms
SOC power efficiency more than 10GOPs/w Higher On Chip System Integration: COTS: 100W,
SOAC:10W (inter-chip capacitive loads, I/O buffers) Speed & Performance: shorter interconnection,fewer
drivers,faster devices,more efficient processing artchitectures
Mixed signal systems Reuse of IP blocks Multiprocessor, configurable computing Domain-specific, combined memory-logic
2P kCFV
VLSI Algorithmic Design Automation Lab.
11
Physical gap Timing closure problem: layout-driven logic and RT-level
synthesis Energy efficiency requires locality of computation and
storage: match for stream-based data processing of speech,images, and multimedia-system packets.
Next generation SOC designers must bridge the architectural gap b/w system specification and energy-efficient IP-based architectures, while CAE vendors and IP providers will bridge the physical gap.
VLSI Algorithmic Design Automation Lab.
12
Levels for Low Power DesignSystem
Algorithm
Architecture
Circuit/Logic
Technology
Hardware-software partitioning,
Complexity, Concurrency, Locality,
Parallelism, Pipelining, Signal correlations
Sizing, Logic Style, Logic Design
Threshold Reduction, Scaling, Advanced packaging
Possible Power Savings at Different Design LevelsLevel of
Abstraction Expected Saving
Algorithm
Architecture
Logic Level
Layout Level
Device Level
10 - 100 times
10 - 90%
20 - 40%
10 - 30%
10 - 30%
Regularity, Data representation
Instruction set selection, Data rep.
SOI
Power down
VLSI Algorithmic Design Automation Lab.
13
Low Power Design Flow IFunction
Partitioning andHW/SW Allocation
SystemLevel
Specification
System-LevelPower Analysis
BehavioralDescription
SoftwareFunctions
ProcessorSelection
Power-drivenBehavioralTransformation
Behavioral-LevelPower Analysis
Power ConsciousBehavioralDescription
Power AnalysisRT-LevelHigh-Level
Synthesis andOptimization
SoftwareOptimization
Software-Level
Power Analysis
To RT-Level Design
VLSI Algorithmic Design Automation Lab.
14
Low Power Design Flow II
RT-levelDescription
RTLmapping
Logic SynthesisandOptimization
Gate-LevelPower Analysis
Gate-level
Description
Power AnalysisSwitch-LevelHigh-Level
Synthesis andOptimization
RTLLibrary
Data-path Controller
Switch-level
Description
Standard cellLibraryProcessor
Control andSteering Logic
Memory
RTLMacrocells
VLSI Algorithmic Design Automation Lab.
15
Reducing Waste Locality of reference Demand-driven / Data-driven computation Application-specific processing Preservation of data correlations Distributed processing
VLSI Algorithmic Design Automation Lab.
16
Eliminating Redundant Computations
VLSI Algorithmic Design Automation Lab.
17
Power-hungry Applications Signal Compression: HDTV Standard, ADPCM,
Vector Quantization, H.263, 2-D motion estimation, MPEG-2 storage management
Digital Communications: Shaping Filters, Equalizers, Viterbi decoders, Reed-Solomon decoders
VLSI Algorithmic Design Automation Lab.
18
IBM’s PowerPC Lower Power Architecture Optimum Supply Voltage through Hardware Parallel, Pipelining ,Parallel instruction
execution 603e executes five instruction in parallel (IU, FPU, BPU, LSU, SRU) FPU is pipelined so a multiply-add instruction can be issued every clock cycle Low power 3.3-volt design
Use small complex instruction with smaller instruction length IBM’s PowerPC 603e is RISC
Superscalar: CPI < 1 603e issues as many as three instructions per cycle
Low Power Management 603e provides four software controllable power-saving modes.
IBM’s Blue Logic ASIC :New design reduces of power by a factor of 10 times
VLSI Algorithmic Design Automation Lab.
19
Power-Down Techniques
◆ Lowering the voltage along with the clock actually alters the energy-per-operation of the microprocessor, reducing the energy required to perform a fixed amount of work
VLSI Algorithmic Design Automation Lab.
20
Voltage vs Delay
•Use Variable Voltage Scaling or Scheduling for Real-time Processing •Use architecture optimization to compensate for slower operation, e.g., Parallel Processing and Pipelining for concurrent increasing and critical path reducing.
VLSI Algorithmic Design Automation Lab.
21
Low Voltage Main Memories
VLSI Algorithmic Design Automation Lab.
22
Why Copper Processor? Motivation: Aluminum resists the flow of
electricity as wires are made thinner and narrower.
Performance: 40% speed-up Cost: 30% less expensive Power: Less power from batteries Chip Size: 60% smaller than Aluminum chip
VLSI Algorithmic Design Automation Lab.
23
Silicon-on-Insulator How Does SOI Reduce Capacitance ?
Eliminated junction capacitance by using SOI (similar to glass) is placed between the impuritis and the silicon substrate high performance, low power, low soft error
VLSI Algorithmic Design Automation Lab.
24
Design Challenges Current systems are complex and heterogenous
Contain many different types of components Half of the chip can be filled with 200 low-power,
RISC-like processors (ASIP) interconnected by field-programmable buses, embedded in 20Mbytes of distributed DRAM and flash memory
Computational power will not result from multi-GHz clocking but from parallelism, with below 200 MHz. This will greatly simplify the design for correct timing, testability, and signal integrity.
VLSI Algorithmic Design Automation Lab.
25
Application- Specific Instruction Processor
Processor architecture tailored not just for application domain (e. g., DSP, microcontrollers), but for specific sets of applications (e. g., audio, engine control)
ASIP characteristics Greater design cost (processor + compiler) + Higher performance, lower power than
commercial cores, more flexibility than ASIC
VLSI Algorithmic Design Automation Lab.
26
ASIP Design Given a set of applications, determine micro
architecture of ASIP (i. e., configuration of functional units in datapaths, instruction set)
To accurately evaluate performance of processor on a given application need to compile the application program onto the processor datapath and simulate object code
However, the micro architecture of the processor is a design parameter!
VLSI Algorithmic Design Automation Lab.
27
ASIP Design Flow
VLSI Algorithmic Design Automation Lab.
28
Compiler Optimizations Machine independent optimizations
Parallelizing transformations, Common subexpression elimination, Constant Propagation, Strength reduction, Loop Invariant Code motion
Machine dependent optimizations Loop unrolling and software pipelining Static allocation (non- recursive procedure calls) Storage layout (arrays, scalars) Optimization of mode setting instructions Instruction selection, scheduling, and register
allocation
VLSI Algorithmic Design Automation Lab.
29
Loop unrolling The technique of loop unrolling replicates the body of a loop some number of
times (unrolling factor u) and then iterates by step u instead of step 1. This transformation reduces the loop overhead, increases the instruction parallelism and improves register, data cache or TLB locality.
Loop overhead is cut in half because two iterations are performed in each iteration. If array elements are assigned to registers, register locality is improved because A(i) and A(i +1) are used twice in the loop body. Instruction parallelism is increased because the second assignment can be performed while the results of the first are being stored and the loop variables are being updated.
for i to N
A i A i A i A i
= -
( ) = ( ) + ( - ) ( + )
2 1
1 1
for i to N
A i A i A i A i
A i A i A i A i
= - 2 step 2
( ) = ( ) + ( - ) ( + )
( ) = ( ) + ( ) ( + )
2
1 1
1 1 2
VLSI Algorithmic Design Automation Lab.
30
Loop Unrolling (IIR filter example)
loop unrolling : localize the data to reduce the activity of the inputs of the functional units or two output samples are computed in parallel based on two input samples.
Neither the capacitance switched nor the voltage is altered. However, loop unrolling enables several other transformations (distributivity, constant propagation, and pipelining). After distributivity and constant propagation,
The transformation yields critical path of 3, thus voltage can be dropped.
)( 211
211
nnnnnn
nnn
YAXAXYAXY
YAXY
22
1
211
nnnn
nnn
YAYAXY
YAXY
VLSI Algorithmic Design Automation Lab.
31
Loop Unrolling for Low Power
VLSI Algorithmic Design Automation Lab.
32
Loop Unrolling for Low Power
VLSI Algorithmic Design Automation Lab.
33
Loop Unrolling for Low Power
VLSI Algorithmic Design Automation Lab.
34
Implementing Digital Systems
VLSI Algorithmic Design Automation Lab.
35
Configurability One-M gate reconfigurable, one-M gate hardwired
logic. Reduce design risks for which NRE costs will
become dominant 50GIPS for programmable components or 500 GIPS for dedicated hardwares 1 V with the watt range
VLSI Algorithmic Design Automation Lab.
36
Bridging the architectural gap Product reliability: design at a level far above the
RT level, with reuse factors in excess of 100 Trade-off: 100MOPs/watt (microprocessor)
100GOPs/watt (hardwired) Reconf. Computing with a large number of computing nodes and a very restricted instruction set (Pleiades)
VLSI Algorithmic Design Automation Lab.
37
Cross-Disciplinary nature Software for low power:loop transformation leads
to much higher temporal and spatial locality of data.
Code size becomes an important objective Software will eventually become a part of the chip
Behavior-platform-compiler codesign: codesigned with C++ or JAVA, describing their h/w and s/w implementation.
Multidisciplinary system thinking is required for future designs (e.g., Eindhoven Embedded Systems Institute http://www.eesi.tue.nl/english)
VLSI Algorithmic Design Automation Lab.
38
Low Power DSP 수행시간의 대부분이 DO-LOOP 에서 이루어짐
VSELP Vocoder : 83.4 %2D 8x8 DCT : 98.3 %LPC computation : 98.0 %
DO-LOOP 의 Power Minimization ==> DSP 의 Power Minimization
VSELP : Vector Sum Excited Linear PredictionLPC : Linear Prediction Coding
VLSI Algorithmic Design Automation Lab.
39
VLSI Signal Processing Design Methodology
pipelining, parallel processing, retiming, folding, unfolding, look-ahead, relaxed look-ahead, and approximate filtering
bit-serial, bit-parallel and digit-serial architectures, carry save architecture
redundant and residue systems Viterbi decoder, motion compensation, 2D-
filtering, and data transmission systems
VLSI Algorithmic Design Automation Lab.
40
Common Fabric for IP Blocks Soft IP blocks are portable, but not as predictable as
hard IP. Hard IP blocks are very predictable since a specific
physical implementation can be characterized, but are hard to port since are often tied to a specific process.
Common fabric is required for both portability and predictability.
Wide availability: Cell Based Array, metal programmable architecture that provides the performance of a standard cell and is optimized for synthesis.
VLSI Algorithmic Design Automation Lab.
41
H/W and S/W Co-design
VLSI Algorithmic Design Automation Lab.
42
Mixing H/W and S/W Argument: Mixed hardware/ software systems
represent the best of both worlds.High performance, flexibility, design reuse, etc.
Counterpoint: From a design standpoint, it is the worst of both worlds
Simulation: Problems of verification, and test become harder
Interface: Too many tools, too many interactions, too much heterogeneity
Hardware/ software partitioning is “AI- complete”!
VLSI Algorithmic Design Automation Lab.
43
Partitioning Performance Requirements
몇몇의 Function 들은 Hardware 로의 구현이 더 용이 반복적으로 사용되는 Block Parallel 하게 구성되어 있는 Block
Modifiability Software 로 구성된 Block 은 변형이 용이
Implementation Cost Hardware 로 구성된 Block 은 공유해서 사용이 가능
Scheduling 각각 HW 와 SW 로 분리된 Block 들을 정해진 constraints 들에 맞출 수 있도록
scheduling SW Operation 은 순차적으로 scheduling 되어야 한다 Data 와 Control 의 의존성만 없다면 SW 와 HW 는 Concurrent 하게
scheduling
VLSI Algorithmic Design Automation Lab.
44
Low power partitioning approach
Different HW resources are invoked according to the instruction executed at a specific point in time
During the execution of the add op., ALU and register are used, but Multiplier is in idle state.
Non-active resources will still consume energy since the according circuit continue to switch
Calculate wasting energy Adding application specific core and partial running Whenever one core performing, all the other cores are shut
down
VLSI Algorithmic Design Automation Lab.
45
Effective Resource Utilization+
+
+
+
D
D
S
5 1 2
3 4
6
7
Retiming
D
D
D
D
D+
+
+
+S
51 2 6
7
43
Before AFTER
CYCLE Multipliers1 1, 3
2, 4
-
-5
6, 8
7
2
13
4
Adder8
6
7
5
Adder Multipliers
2
1
1
1
-
Can reducd interconnect capacitance.
VLSI Algorithmic Design Automation Lab.
46
Partitioning Process
- Derives a graph G- operation and connection
- Decomposition of G into a set of clusters- cluster : set of operation
- Calculate bus-traffic energy- Pre-select clusters with constraints- Set the number of resources- List scheduling- Test the utilization rate (ASIC or µP)
- the utilization rate of µP is supported by SW estimation tool
VLSI Algorithmic Design Automation Lab.
47
Design FlowApplication
DevideAppliction in
cluster
List schedule
Computeutilizationrate(ASIC)
Select cluster
Computeutilizationrate(uP)
-
Core EnergyEstimation
HW Synthesis
Evaluate
- Max 94% energy saving and in most case even reduced execution time- 16k sell overhead
VLSI Algorithmic Design Automation Lab.
48
Interface Interface Block 의 필요성
Hardware 와 Software Block 간의 Data 전달 효율적인 Interface Block 을 구성해야만 HW/SW
Block 간의 Overhead 를 줄일 수 있다
Interface 방법 Shared Memory FIFO Handshaking protocol
VLSI Algorithmic Design Automation Lab.
49
Logical Bus ArchitectureSystem Bus Signals
address, data, control signalsaddress space consists of the memory space & I/O spacememory space : memory of the SW componentI/O space : ports within SW & registers in other HW
Port SignalsThese are specialized signals capable of directly interfacing between SW & HW component
Interrupt SignalsWhen SW & HW components have completed an operation, or when an error condition is detected
VLSI Algorithmic Design Automation Lab.
50
Co-Simulation Co-simulation 의 필요성
HW part 와 SW part 를 함께 Simulation 을 할 수 있게 해 줌으로써 구성된 System 의 결과를 예측할 수 있다
System Performance 를 예측하여 Synthesis 이전에 지정된 Spec. 에 맞도록 System 을 재설계할 수 있도록 해 준다
HW/SW Partitioning 을 위한 각 Sub-block 의 특성을 예측해 준다
Co-simulation Tool Ptolemy COSSAP POLIS
VLSI Algorithmic Design Automation Lab.
51
Hardware/Softrware C0-Design Flow
Analysis of Constra ints& Requirem ents
System Specification
Hardware & SoftwarePartitioning
HardwareDescription
SoftwareDescription
Interface SynthesisHardware Synthesis
& ConfigurationSoftware G eneration &
Param eterization
ConfigurationM odules
HardwareCom ponents
HW / SWInterface
SoftwareM odules
HW / SW Integration &Cosim ulation
IntegrationSystem
System Evaluation Design Verification
VLSI Algorithmic Design Automation Lab.
52
Partitioning Example: CDMA Searcher
P N -C odeG enera to r
µ¿ ±â´© À û´Ü(R ea l)
µ¿ ±â´© À û´Ü(Im age)
¿ ¡³Ê Á ö°è»ê´Ü(R ea l)
¿ ¡³Ê Á ö°è»ê´Ü(Im age)
ºñ± ³, ¼ ±Å à ´Ü ºñµ ¿ ±â´© À û´Ü ºñ± ³, ¼ ±Å à ´Ü
P N -C odeG enera tion
S ynchronousA ccum ula tor
(S W )
S ynchronousA ccum ula tor1
(H W )
C ost(S peed,A rea,P ow er)
E nergyE stim ate
(S W )
S ynchronousA ccum ula tor2
(H W )
C om parator(S W )
A synchronousA ccum ula tor
(S W )
C om parator(S W )
E nergyE stim ate
(H W )
C om paratorw ith
precom puta tion(H W )
A synchronousA ccum ula tor
(H W )
C om paratorw ith
precom puta tion(H W )
G O A L!
VLSI Algorithmic Design Automation Lab.
53
Approach+ +
+ +
Y I2 YQ
2
>
>
+
>
RXI TXI RXQ TXQ RXI TXQ RXQ - TXI
max 값 선 택
θ 1 와 비 교
θ 2 와 비 교
동 기 누 적 단
비 동 기 누 적 단
에 너 지 계 산 단
O I = (RX I * TX I)
+ (RXQ * TXQ) O Q = (RX I * TXQ)
+ (RXQ * (- TX I))
Y I = ∑ O I Y Q = ∑ O Q
Z = max (Y I2 , Y Q
2)
∑ Z
Search Done !!
Yes
YesSearch_Slew No
No
C ontrol Signal G enerator
- Software oriented design- Dark block : Hardware- Interface : Control signal gen.- Partitioned in terms of speed cost
- Change from SW to HW 1. Implementation speed 2. Parallel architecture
VLSI Algorithmic Design Automation Lab.
54
Resultcycle ratio Area(gates)
Full SW 266 -
Full HW - 9008Synchronous accumulator(1) 138 48.1 + 872
Computing energy(2) 265 4.4 + 3096(1) & (2) 137 48.5 + 3968
(2) &Comparator(3)
265 4.4 + 3155
(1) & (3) 138 48.1 + 931
VLSI Algorithmic Design Automation Lab.
55
Flexibility vs. Energy-Efficiency
• Trade-off between efficiency and flexibility, programmable designs incur significant performance and power penalties compared to ASIC.•The parallel algorithm of signal processing can be achieved significant power savings by executing the dominant computational kernels of a given class of applications with common features on dedicated, optimized processing elements with minimum energy overhead.
Programmability requires generalized computation, storage, and communication system, which can be used to implement different kinds of algorithmsDomain specific processors preserve the flexibility of a general purpose programmable device to achieve higher levels of energy-efficiency, while maintaining the flexibility to handle a variety of algorithms
VLSI Algorithmic Design Automation Lab.
56Hybrid Architecture Template (Pleiades) Arthur Abnous and Jan Rabaey
Pleiades does much better on the energy scale than the TI DSPs.Because DSPs are general-purpose, and instruction execution involves a great deal of overhead. Pleiades has the ability to create dedicated hardware structures tuned to the task at hand and executes operations with a small energy overhead
VLSI Algorithmic Design Automation Lab.
57
Application Domains : ULTRA-LOW-POWER DOMAIN-SPECIFIC MULTIMEDIA PROCESSORS
CELP- Based Speech Coding LPC Analysis and Synthesis Codebook Search Lag ComputationDCT- Based Video Compression and Decompression DCT and Inverse- DCT Motion Estimation and Compensation Huffman Coding and Decoding Baseband Processing for Digital Radios Demodulation, Channel Equalization Timing Recovery, Error Correction
VLSI Algorithmic Design Automation Lab.
58
The Re-configurable Terminal
VLSI Algorithmic Design Automation Lab.
59
Satellite Processors
VLSI Algorithmic Design Automation Lab.
60
Elements of Energy- Efficiency
VLSI Algorithmic Design Automation Lab.
61
Multi-Processor Implementation
VLSI Algorithmic Design Automation Lab.
62
Switching Activity Reduction(a) Average activity in a multiplier as a function of the constant value
(b) A parallel and serial implementations of an adder tree.
VLSI Algorithmic Design Automation Lab.
63
Communication Network
VLSI Algorithmic Design Automation Lab.
64
Distributed Data- Driven Control
Execution of a hardware module is triggered by the arrival of tokens. When there are no tokens to be processed at a given module, no switching activity occurs in that module.
VLSI Algorithmic Design Automation Lab.
65
Implementation of Handshaking
VLSI Algorithmic Design Automation Lab.
66
Design Methodology
VLSI Algorithmic Design Automation Lab.
67
Low Power Circuit Techniques Reduced swing interconnect (communication network, memories,
programmable logic modules) On chip dc- dc conversion + multiple supply voltages Locally synchronous - globally asynchronous Automatic power- down Optimized libraries (0.6 m CMOS + Cadence/ Synopsys design flow)
VLSI Algorithmic Design Automation Lab.
68
VSELP Synthesis Filter Mapped onto Satellite Processors
VLSI Algorithmic Design Automation Lab.
69
Mappings of VSELP Kernel
The most energy efficient CELP-based speech algorithm - dissipates 36 mW ( Vdd = 1.8V, 0.5 um CMOS) - requires 23.4 MOPS
Proposed VSELP speech coder - 0.6 um CMOS - dissipates under 5 mW
VLSI Algorithmic Design Automation Lab.
70
IIR Mapping
VLSI Algorithmic Design Automation Lab.
71
IIR Comparison
VLSI Algorithmic Design Automation Lab.
72
FFT Mapping
VLSI Algorithmic Design Automation Lab.
73
FFT Comparison
VLSI Algorithmic Design Automation Lab.
74
Reconguration for Power Savingin Real-Time Motion Estimation,S.R.Park,UMASS
VLSI Algorithmic Design Automation Lab.
75
Motion Estimation
VLSI Algorithmic Design Automation Lab.
76
Block Matching Algorithm
VLSI Algorithmic Design Automation Lab.
77
Configurable H/W Paradigms
VLSI Algorithmic Design Automation Lab.
78
Programmable Logic Modules
VLSI Algorithmic Design Automation Lab.
79
Why Hardware for Motion Estimation? Most Computationally demanding part of Video Encoding Example: CCIR 601 format 720 by 576 pixel 16 by 16 macro block (n = 16) 32 by 32 search area (p = 8) 25 Hz Frame rate (f frame = 25) 9 Giga Operations/Sec is needed for Full Search Block Matching Algorithm.
VLSI Algorithmic Design Automation Lab.
80
Why Reconguration in Motion Estimation?
Adjusting the search area at frame-rate according to the changing characteristics of video sequences
Reducing Power Consumption by avoiding unnecessary computation
Motion Vector Distributions
VLSI Algorithmic Design Automation Lab.
81
Architecture for Motion EstimationFrom P. Pirsch et al, VLSI Architectures for Video Compression, Proc. Of IEEE, 1995
VLSI Algorithmic Design Automation Lab.
82
Re-configurable Architecture for ME
VLSI Algorithmic Design Automation Lab.
83
Power Estimation in Recongurable Architecture
VLSI Algorithmic Design Automation Lab.
84
Power vs Search area
VLSI Algorithmic Design Automation Lab.
85
Resource Reuse in FPGAs
VLSI Algorithmic Design Automation Lab.
86
Motion Estimation
VLSI Algorithmic Design Automation Lab.
87
Motion Estimation (low power)
P P P
P P P P
P P
a add abs
b add add abs
abs add
2 2
2
0 45
2
2 1
2
/
/
.
Therefore, power reduction
factor is 11%
VLSI Algorithmic Design Automation Lab.
88
References[1] A. Abnous and J. Rabeay, “Ultra-Low-Power Domain-Specific Multimedia Processors”, Proceedings of the IEEE VLSI
Signal Processing Workshop, San Francisco, Oct 1996.
[2] Digital Semiconductor, Digital Semiconductor SA-110 Microprocessor Technical Reference Manual, Digital Equipment Corporation, 1996.
[3] TMS320C5x General-Purpose Application User’s Guides, Literatures Number SPRU164, TI, 1997.
[4] T. Anderson, The TMS320C2xx Sum-of-Products Methodology, Technical Application Report SPRA068, TI, 1996.
[5] M. Tsai, IIR Filter Design on the TMS320C54x DSP, Technical Application Report SPRA079, TI, 1996.
[6] Ftp://ftp.ti.com/pub/tms320bbs/c5xxfiles/54xffts.exe, ‘C54x Software Support Files, TI.
[7] C.Turner, Calculation of TMS320LS54x Power Dissipation, Technical Application Report SPRA164, TI, 1997.
[8] C.Turner, Calculation of TMS320LS54x Power Dissipation, Technical Application Report SPRA088, TI, 1996.
[9] E. Kusse, Personal communication, 1996.[10] J. Rabeay et al., “Fast Prototyping of Data Path Intensive Architecture”, IEEE Design & Test Magazine, Vol. 8, N0. 2,
pp. 40-51, 1991.[11] J. Montanaro et al., “A 160-MHz, 32-b, 0.5-W CMOS RISC Microprocessor”, IEEE Journal of Solid-State Circuit, Vol. 31,
N0. 11, pp. 1703-1714, Nov. 1996.[12] A. Fischman and P. Rowland, Designing Low-Power Applications with TMS320LC54x, Technical Application Report
SPRA281, TI, 1997.[13] Daniel D. Gajski, Nikil D. Dutt, Allen C-H Wu, Steve Y-L Lin, \High-level synthesis, Introduction to chip and system design," Kluwer
Academic publishers, 1992.
[14] Duncan A. Buell, Jerey M.Arnold, Walter J.Kleinfelde \Splash2, FPGAs in Custom Computing Machine," IEEE Computer Society Press, Los Alamitos, California.
[15] Jonathan Babb, Russell Tessier, Mathew Dahl, Silvina Zimi Hanono, David M. Hoki, and Anant Agarwal, Logic emulation with virtual wires," IEEE Transactions on Computer Aided Design of Integrated circuits and systems, vol. 16, No. 6, June 1997.
[16] M.Vasilko, Djamel Ait-Boudaoud, \Architectural synthesis techniques for dynamically Recongurable logic," Field Programmable Logic: Smart Applications, New Paradigms and Compilers, Proceedings of 6th Int. Workshop on Field Programmable Logic and Applications,FPL 96, Darmstadt, Germany, Sept. 23-25 1996.
VLSI Algorithmic Design Automation Lab.
89
References[17] Patrick Lysaght, Gordon McGregor and Jonathan Stockwood, Conguration Controller Synthesis for Dynamically Recongurable
Systems," IEE Colloquium on Hardware Software COSynthesis for Recongurable systems, 1996.
[18] M.Vasilko, Djamel Ait-Boudaoud, Scheduling for dynamically Recongurable FPGAs," Proceedings of International workshop on Logic and Architecture synthesis, pp. 328-336, IFIPTC10 WG10.5, Dec. 18-19 1995.
[19] Doug Smith, Dinesh Bhatia, RACE: Recongurable and Adaptive Computing Environment,” Field Programmable Logic: Smart Applications, New Paradigms and Compilers, Proceedings of 6th Int. Workshop on Field Programmable Logic and Applications,FPL 96, Darmstadt, Germany, Sept. 23-25 1996. See http://www.ececs.uc.edu/ ~ dal.
[20] Xilinx Netlist Format (XNF) Specication, Version 6.1, June 1, 1995.
[21] Xilinx XABEL reference manual.
VLSI Algorithmic Design Automation Lab.
90
DIGLOG multiplierC n n C n n
A A B B
A B A B B A A B
mult add
jR
kR
jR
kR
jR
kR R R
( ) , ( ) ,
,
( )( )
253 214
2 2
2 2 2 2
2 where n world length in bits
1st Iter 2nd Iter 3rd Iter
Worst-case error -25% -6% -1.6%
Prob. of Error<1% 10% 70% 99.8%
With an 8 by 8 multiplier, the exact result can be obtained at a maximum of seven iteration steps (worst case)
VLSI Algorithmic Design Automation Lab.
91
Low Power CDMA Searcher Project 과제명 : IS-95 기반의 DS/CDMA 시스템 Co-design 기법을 이용한 저전력 설계
개발기간 : 1999.3.1 - 2000.2:28 (12 개월 ) 개발 목적 및 방법 : CDMA 단말기에 사용하기위한 MSM
(Mobile Station Modem) 칩의 탐색자 (Searcher Engine) 에 대한 RTL 수준 저전력 설계 구현 . 동작 주파수 : 12.5MHz
Data flow graph 를 사용하여 rescheduling, pre-computation 및 strength reduction, Synchronous Accumulator 를 이용한 저전력 설 , area 와 power
를 각각 최대 67.68%, 41.35% 감소 시킴 . H/W and S/W Co-design 기법 적용 San Kim and Jun-Dong Cho, “Low Power CDMA Searcher”, CAD and VLSI Workshop, May.
1999.
Inki Hwang, San Kim and Jun-Dong Cho, “CDMA Searcher Co-Design”, ASIC Workshop,
Sep. 1999.
VLSI Algorithmic Design Automation Lab.
92
Voltage Scaling Merely changing a processor clock frequency is not an
effective technique for reducing energy consumption. Reducing the clock frequency will reduce the power consumed by a processor, however, it does not reduce the energy required to perform a given task.
Lowering the voltage along with the clock actually alters the energy-per-operation of the microprocessor, reducing the energy required to perform a fixed amount of work.
VLSI Algorithmic Design Automation Lab.
93
OS: Voltage Scaling
VLSI Algorithmic Design Automation Lab.
94
OS: Voltage Scheduling
VLSI Algorithmic Design Automation Lab.
95
Multiple Supply VoltagesFilter Example
VLSI Algorithmic Design Automation Lab.
96
Scale Supply Voltage with fCLK
VLSI Algorithmic Design Automation Lab.
97
Adaptive Power Supply Voltages
VLSI Algorithmic Design Automation Lab.
98
Different Voltage Schedules
0 5 10 15 20 25 Time(sec)
5.021000Mcycles50MHz
40J
(A)
0 5 10 15 20 25 Time(sec)
5.02750Mcycles50MHz
32.5J
(B)
0 5 10 15 20 25Time(sec)
5.02
1000Mcycles40MHz
25J (C)
Timing constraint
2.52
250Mcycles25MHz
4.02
En
ergy
con
sum
pti
on (
Vd
d2 )
VLSI Algorithmic Design Automation Lab.
99
Data Driven Signal ProcessingThe basic idea of averaging two samples are buffered and their work loads are averaged.
The averaged workload is then used as the effective workload to drive the power supply.
Using a pingpong buffering scheme, data samples In +2, In +3
are being buffered while In, In +1
are being processed.
VLSI Algorithmic Design Automation Lab.
100
Example of Buffering
VLSI Algorithmic Design Automation Lab.
101
SOC CAD Companies Avant! www.avanticorp.com Cadence www.cadence.com Duet Tech www.duettech.com Escalade www.escalade.com Logic visions
www.logicvision.com Mentor Graphics
www.mentor.com Palmchip www.palmchip.com Sonic www.sonicsinc.com Summit Design www.summit-
design.com
Synopsys www.synopsys.com
Topdown design solutions www.topdown.com
Xynetix Design Systems www.xynetix.com
Zuken-Redac www.redac.co.uk
VLSI Algorithmic Design Automation Lab.
102
Viterbi decoder project▶ 과제명 : Convolutional Encoder 를 위한 저전력 복호
알고리즘의 연구▶ 개발기간 : 1999.02.22 - 11:30 ( 약 9 개월 )▶ 개발 목적 및 방법 : IMT-2000 중에 포함되는 channel
coding 장치의 저전력화를 위한 독 자적인 기술의 연구 / 개발
▶ CODEC 주요사양 : - Code Rate : R = 1/2, 1/3, 1/4 , k=9 - Decoding 방법 : Trace-back Viterbi Decoder using Soft Decision
VLSI Algorithmic Design Automation Lab.
103
Viterbi decoder project▶ 발표논문
1. Asia Pacific Conference on ASIC’99In this paper, we have presented the use of the consensus term and clocking control signal in ACSU for the low power Viterbi decoder. A 20% reduction in area and 30% reduction in power consumption are obtained based on the low power ACSU architecture[1]. Applying our proposed glitch reduction techniques to [1], the additional power consumption is reduced by 7% at a cost of 3% increase in area.
2. International Conference on VLSI and CAD’99 In this paper, we propose a new lower power algorithm on the trace-back unit of
systolic array Viterbi decoder[2]. Reusing the already-generated trace-back routes reduces the number of trace-back operations, and results in increasing the area of spurious switching activity region. Therefore, the switching activity during trace-back operation was further reduced with using gated-clocks. Our result showed on the average 40% reduction in power with the same latency, but 23% increase in area against the trace-back unit in [2]. We used Design Compiler of SYNOPSYS and measured power consumption using DesignPower of SYNOPSYS.
VLSI Algorithmic Design Automation Lab.
104
Viterbi decoder project▶ Reference
1. B C. Y. Tsui, R.S. K. Cheng and C. Ling, “Using Transformation to Reduce Power Consumption of IS-95 CDMA Receiver”, International Symposium on Low Power Electronics and Design, 1999
2. T. K. Truong, A. M. T. Shih, I. S. Reed, E. H.Satorius, “A VLSI Design for a Trace-back Viterbi Decoder”, IEEE Trans. Communication, vol. 40, no. 3, Mar. 1992.