Low Power IP Design Methodology for Rapid Development of DSP Intensive SOC Platforms T. Arslan A.T. Erdogan S. Masupe C. Chun-Fu D. Thompson
Low Power IP Design Methodology for Rapid Development of DSP Intensive SOC Platforms
T. ArslanA.T. ErdoganS. MasupeC. Chun-Fu
D. Thompson
Contents
• Introduction to power consumption• Introduction to Main Concepts• Low Power Design Methodology• IP implementations• Results and conclusions
Power Consumption in CMOS-Based DSP Systems
Vdd
VoutinV
C L
I sc Idy
Idd = I sc Idy+
I sc
V
Idy
V
t
out
in
P k.C.V . f I .V I Vave dd sc dd l dd 2
Power ReductionMethods
Reduce
C*= k.C
ReduceVdd
• Supply Voltage Reduction
• Clock Gating
Disadvantage:
• Added design effort
Common Approaches to Low Power Design
Systematic Low Power Design Approach
Exploit Algorithmic Correlations and Redundancies within an algorithm, then Map to hardware.
Verilog/VHDL
DSP AlgorithmLibrary
PerformanceCriteria
Block,Segmentation, etc.
Multiplier SC,Bus SC CAD
SynthesisComponent
Library
Ordering algorithm
Data representation
Netlist
Systematic Design Implementation Framework
Rapid Design and IP-Based Integration Platforms
. . .
P
IPy
IPx
. . .
Developed IPs
Parameterisation Options
Synthesis(Buildgate)
System Design(Verilog)
Verification(Behavioural Simulation)
Technology-SpecificNetlist
Verification(Gate-Level Simulation)
Verification(Post-Layout Simulation)
Floorplanning,Placement & Routing(Silicon Ensemble)
I/O PadsPlacement
Tape-out Verification(Dracula DRC/ERC/LVS)
SystemSpecifications
Layout
Design Flow for Filter IPs
FIR Filter Implementation
Typical Single Multiplier DSP Processor Architecture
Multiplier
Adder
Output register
Control
ADCinput
x(n)
DACoutput
y(n)
Data busCoefficient bus
Datamemory
Coefficientmemory
Multiplier-accumulator(MAC)
Transpose Direct Form (TDF) FIR Structure
x(n)
z-1z-1
h(0) h(1) h(2)
y(n)
h(N-1)
z-1. . .
. . .
. . .
stage0 stage1stage1. . . stageN-1
PCV1(n)PCV1(n) PCV2(n) PCVN-1(n)PCV0(n)
Multiplier
Adder
Control
ADCinput
x(n)
DACoutput
y(n)
Data bus I
Coefficient bus
PCVMCoefficient
memory
Data bus II
Modified DSP Processor Architecture for TDF FIR Filter Implementation
An Example SFG for IP2
Coefficient Memory Configuration with Coefficient Ordering
Order coefficients such that adjacent coefficients are highly correlated.
Filter Design(Matlab)
FilterSpecifications
Coefficient Set
Coefficient Ordering(C Routine)
OrderedCoefficient Set
Memory Configuration(C Routine)
Coefficient Words
Coefficient Word:
SF : Shift FlagSF = 1 shiftSF = 0 no shiftPCVMA : Pre-Calculated Value Memory Address
h(k) PCVMA SF
Coefficient Word Decomposition (Verilog Code)
An Example SFG for IP3
Memory Operations (Verilog Code)
Software Implementation Example for IP3
Power Evaluation
Filter Specifications
Lowpass filter specifications
Filter # Passband(kHz)
Stopband(kHz)
Passbandripple(dB)
Stopbandattenuation
(dB)
Windowfunction
Filterlength
1 0 - 1.5 2 - 4 0.1 50 Hamming 532 0 - 1.2 1.7 - 5 0.01 40 Kaiser 713 0 - 3.375 5.625 - 10 0.002 90 - 424 0 - 1 1.5 - 5 0.0135 56 - 615 0 - 1.5 2 - 4 0.1 50 Blackman 89
Bandpass filter specifications
Filter # Stopband(kHz)
Passband(kHz)
Stopband(kHz)
Passbandripple(dB)
Stopbandattenuation
(dB)
Windowfunction
Filterlength
1 0 - 0.1 0.15 - 0.25 0.3 - 0.5 0.1 60 Kaiser 732 0 - 0.45 0.9 - 1.1 1.55 - 7.5 0.8 30 - 343 0 -5 8 - 12 15 - 44.14 0.00868 60 Kaiser 544 0 - 1 2 - 3.5 4.25 - 5 0.13 56.4 - 325 0 - 0.1 1.375-3.625 4 - 5 0.1 68.4 - 80
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000sw
itche
d ca
paci
tanc
e (p
F)
IP1 IP2 IP3
PCVM buscoefficient busdata busmultiplier
25%
54%
Power Reductions Achieved (wordlength = 16 bit)
An example of a 6-tap FIR filter with block size of 3
Power Reductions for IP4 (wordlength = 16 bit)
0
100
200
300
400
500
600
700
800
900
switc
hed
capa
cita
nce
(pF)
1 2 4 8 16
Block Size
coefficient busdata busmultiplier
40% 42%
50%53%
Reductions in Number of Memory Accesses (%)
0
10
20
30
40
50
60
70
80
90
100
Red
uctio
n (%
)
2 4 8 16
Block Size
Data memoryCoefficient memory
Coefficient Set
Coefficient Set1 Coefficient Set2
Data Set
Shifter
Multiplier
Adder
Output
Coefficient Segmentation Algorithm
Example Segmentations
Example Segmentations
Coefficient Segmentation Algorithm for Two’s Complement Coding
Begin
H = (h0, h1, … , hL-1)
i = 0, k = 0
2i >= hk i = i + 1
sk = 2i-1
sk = - 2i
k > L -1i = 0
k = k + 1
End
No
No
No
Yes
Yes
Yes
3 hk <= 0
mk = 0
sk = hkNo
Yes
2 2i != hk
1
mk = hk-sk
mk = hk-sk
Coefficient Segmentation Algorithm for Sign-Magnitude Coding
Begin
H = (h0, h1, … , hL-1)
i = 0, k = 0
2i >= hk i = i + 1
mk = hk - 2 i-1
sk = 2 i-1hk
- 2i < hk - 2i-1
mk = hk - 2i
sk = 2i
hk < 0
mk = - mk
sk = - sk
k > L -1i = 0
k = k + 1
End
No
No
No
No
Yes
Yes
Yes
Yes
1
2
3
Total switching activity of H and M coefficient sets with Two’s Complement Coding
Total switching activity of H and M coefficient sets with Sign-Magnitude Coding
MSB(coefficient)
(two’s complement)
(sign magnitude)
Multiplier(two’s)
Add/Sub
Acc
Control
CoefficientMemory
DataMemory
Output
Simplified Filter Architecture for Mixed-Mode Multiplication
( sign magnitude)
( sign magnitude)
Multiplier(sign)
Add
Acc
Control
CoefficientMemory
DataMemory
Sign two’s
Output
Simplified Filter Architecture for Sign-Magnitude Multiplication
0
51015202530354045
#Tra
nsiti
ons/
sam
ple
b0 b2 b4 b6 b8 b10
b12
b14
Bit Position
conventionalsegmentation
Example Switching Activity Distribution with Two’s Complement Coding (N=89, W=16)
05101520253035404550
#Tra
nsiti
ons/
sam
ple
b0 b2 b4 b6 b8 b10
b12
b14
Bit Position
conventionalsegmentation
Example Switching Activity Distribution with Sign-Magnitude Coding (N=89, W=16)
Two’s complement Mixed mode Sign-magnitudeMultipliersize Algorithm swcap/sample
(pF)Reduction
(%)swcap/sample
(pF)Reduction
(%)swcap/sample
(pF)Reduction
(%)conventional 497 294 1628-bitsegmentation 236 52.52 222 24.49 81 50.00conventional 3862 2511 217316-bitsegmentation 2058 46.71 1806 28.08 1452 33.18conventional 14795 12281 1145824-bitsegmentation 11051 25.31 10283 16.27 9367 18.25
Power Reductions Achieved with Coefficient Segmentation
0
500
1000
1500
2000
2500
3000
3500
4000
switc
hed
capa
cita
nce
(pF)
twos mixed signData representation
conventionalsegmentation
Power Reduction in Multiplier Circuit (wordlength = 16 bit)
47% 35%53% 44%
62%
0
500
1000
1500
2000
2500
3000
3500
4000
twos mixed sign
Data representation
multipliershifter
switc
hed
capa
cita
nce
(pF) 46%
35%
51%44%
61%
Power Reduction (wordlength = 16 bit)
Power Reduction at Coefficient Bus (wordlength = 16 bit)
0
50
100
150
200
250
300
350
400
switc
hed
capa
cita
nce
(pF)
twos mixed sign
Data representation
conventionalsegmentation
49% 37%54%
37%
54%
DCT Implementation Scheme
2-D DCT Implementation Approach
Simplified Architecture of the DCT Processor
Conventional Programmable FIR Filter Architecture
TDF with Coefficient Ordering Programmable FIR Filter Architecture
Power Reduction (%)
IP1
tNC
ResetLoad
Clock
DataCoefficient
Output
Of/Uf
Top View of IP1
Block Report for IP1
IP2
tNC
ResetLoad
Clock
DataCoefficient
Output
Of/Uf
Top View of IP2
Block Report for IP2
IP3
tNC
ResetLoad
Clock
DataCoefficient Word
Output
Of/Uf
Top View of IP3
Block Report for IP3
0
2000
4000
6000
8000
10000
12000
14000
16000
Are
a
8-bit 16-bit 24-bit
Wordlength
IP1IP2IP3
Area Comparison
Top View of IP4
IP4
tNC
ResetLoad
Clock
DataCoefficient
Output
Of/UfBlock Size
IP5
tNC
ResetLoad
Clock
DataCoefficient Word
Output
Of/Uf
Top View of IP5
Top View of IP6
Case Study: a 34-tap bandpass filter
Area and Power Characteristics for the Example Filter
Conclusions
• A methodology for Low Power Implementation of DSP functions has been presented.
• The methodology has been used to develop a number of IPs.
• Significant reductions in Power is reported.
• Power reduction is achieved in the multiplier and system buses.
• Methodology can be used for prototyping other DSP functions.