Arithmetic Building Blocks Low-Power Design Lecture 12 18-322 Fall 2002 Textbook: [Chapter 7, Section 4.4] Outline ! Arithmetic building blocks "Adders "Multipliers ! Low-power design "Reducing power consumption "Data-path/Control circuitry
1
Arithmetic Building BlocksLow-Power Design
Lecture 1218-322 Fall 2002
Textbook: [Chapter 7, Section 4.4]
Outline
! Arithmetic building blocks"Adders"Multipliers
! Low-power design"Reducing power consumption"Data-path/Control circuitry
2
A Generic Digital Processor
MEMORY
DATAPATH
CONTROL
INPU
T-O
UT
PUT
Our focus today. Also, main emphasis in the project!
The Binary Adder
S A B Ci⊕ ⊕=
A= BCi ABCi ABCi ABCi+ + +
Co AB BCi ACi+ +=
A B
Cout
Sum
Cin Fulladder
3
Static CMOS Full Adder: Is this a great implementation?
VDD
VDD
VDD
VDD
A B
Ci
S
Co
X
B
A
Ci A
BBA
Ci
A B Ci
Ci
B
A
Ci
A
B
BA
28 TransistorsAB+(A+B)Ci ABCi + C0(A+B+Ci)
The Ripple-Carry Adder
A0 B0
S0
Co,0Ci,0
A1 B1
S1
Co,1
A2 B2
S2
Co,2
A3 B3
S3
Co,3
(= Ci,1)FA FA FA FA
Worst case delay linear with the number of bits
tadder N 1�( )tcarry tsum+≈
td = O(N)
Goal: Make the fastest possible carry path circuit
4
Inversion Property
A B
S
CoCi FA
A B
S
CoCi FA
Minimize Critical Path by Reducing the Number of Inverting Stages
A0 B0
S0
Co,0Ci,0
A1 B1
S1
Co,1
A2 B2
S2
Co,2 Co,3FA� FA� FA� FA�
A3 B3
S3
Odd CellEven Cell
Exploit Inversion Property
Note: Needs 2 different types of cells
5
Express Sum and Carry using P, G, D
Define 3 new variable which ONLY depend on A, BGenerate (G) = ABPropagate (P) = A ⊕ BDelete = A B
Can also derive expressions for S and Co based on D and P
A better structure: the Mirror Adder
VDD
Ci
A
BBA
B
A
A BKill
Generate"1"-Propagate
"0"-Propagate
VDD
Ci
A B Ci
Ci
B
A
Ci
A
BBA
VDD
SCo
24 transistors
P G
C0 = G+PCi
D
6
Carry-Bypass Adder
FA FA FA FA
P0 G1 P0 G1 P2 G2 P3 G3
Co,3Co,2Co,1Co,0Ci,0
FA FA FA FA
P0 G1 P0 G1 P2 G2 P3 G3
Co,2Co,1Co,0Ci,0
Co,3
Mul
tiple
xer
BP=PoP1P2P3
Idea: If (P0 and P1 and P2 and P3 = 1)then Co3 = C0, else �kill� or �generate�.
Carry-Bypass Adder (cont.)
Setup
CarryPropagation
Sum
Setup
CarryPropagation
Sum
Setup
CarryPropagation
Sum
Setup
CarryPropagation
Sum
Bit 0-3 Bit 4-7 Bit 8-11 Bit 12-15
Ci,0
Note: the topological path worst-case delay is much higher than the true critical path!
N bits
M bits
7
Carry Ripple vs. Carry Bypass
N
tp
ripple adder
bypass adder
4..8
For small values of N, RCA is actually faster!
Carry-Select Adder
Setup
"0" Carry Propagation
"1" Carry Propagation
Multiplexer
Sum Generation
Co,k-1 Co,k+3
"0"
"1"
P,G
Carry Vector
8
Carry Select Adder: Critical Path
Setup
"0" Carry
"1" Carry
Multiplexer
Sum Generation
"0"
"1"
Setup
"0" Carry
"1" Carry
Multiplexer
Sum Generation
"0"
"1"
Setup
"0" Carry
"1" Carry
Multiplexer
Sum Generation
"0"
"1"
Setup
"0" Carry
"1" Carry
Multiplexer
Sum Generation
"0"
"1"
Bit 0-3 Bit 4-7 Bit 8-11 Bit 12-15
S0-3 S4-7 S8-11 S12-15
Co,15Co,11Co,7Co,3Ci,0
Adder Delays - Comparison
0.0 20.0 40.0 60.0N
0.0
10.0
20.0
30.0
40.0
50.0
tp
ripple adder
linear select
9
Look Ahead - Basic Idea
A 0,B 0 A 1,B 1 AN-1 ,B N-1...
C i,0 P0 Ci,1 P1C i,N-1 PN-1
...
C0,k = Gk + Pk C0,k-1
C0,k = Gk + Pk (Gk-1 + Pk-1(� + P1 (G0 + P0Ci,0)))The look ahead structure is useful for small values of N (N<4)
Design as a Trade-Off
0 10 20N
0.0
20.0
40.0
60.0
80.0
t p (n
sec)
0 10 20N
0.0
0.2
0.4
Area
(mm
2 )
look-ahead
select
bypassmanchester
mirrorstatic
manchester
look-ahead
select
static
mirrorbypass
10
The Binary Multiplication
1 0 1 1
1 0 1 0 1 0
0 0 0 0 0 0
1 0 1 0 1 0
1 0 1 0 1 0
1 0 1 0 1 0
×
1 1 1 0 0 1 1 1 0
+
partial product (this corresponds to a AND operation)
multiplicand
multiplier# Multiplications are expensive and slowoperations
#Performance is often dominated by the speed at which multiplications can be executed
The Array Multiplier
3456
HA FA FA HA
FA FA FA HA
FA FA FA HA
2 X0X1XX3 Y1
X0X1X2X3 Y2
X0X1X2X3 Y3
Z1
Z2
ZZZZ
Z0
Z7
Very efficient layout!
X0X1XX 23Y0
11
The MxN Array Multiplier� Critical Path
HA FA FA HA
HAFAFAFA
FAFA FA HA
Critical Path 1
Critical Path 2
Critical Path 1 & 2
Optimization is very difficult since there exists several critical paths!
Optimize this!
Carry-Save Multiplier
HA HA HA HA
FAFAFAHA
FAHA FA FA
FAHA FA HA
Vector Merging Adder
Optimization becomes easier (unique critical path)!
12
Wallace-Tree Multiplier
FA
FA
FA
FA
y0 y1 y2
y3
y4
y5
S
Ci-1
Ci-1
Ci-1
Ci
Ci
Ci
FA
y0 y1 y2
FA
y3 y4 y5
FA
FA
CC S
Ci-1
Ci-1
Ci-1
Ci
Ci
Ci
Multipliers �Summary
� Optimization Goals Different compared to the Binary Adder
� Once Again: Identify the Critical Path
� Other possible techniques
- Data encoding (Booth)- Pipelining
- Logarithmic versus Linear (Wallace Tree Multiplier)
13
Outline
$ Arithmetic building blocks"Adders"Multipliers
! Low-power design"Reducing power consumption"Data-path/Control circuitry
How about POWER? Ways to reducing power consumption
! Load capacitance (CL)⌧Roughly proportional to the chip
area
! Switching activity (avg. number of transitions/cycle)
⌧Very data dependent⌧A big portion due to glitches
(real-delay)
! Clock frequency (f)⌧Lowering only f decreases
average power, but total energy is the same and throughput is worse
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
5.00
5.50
6.00
6.50
7.00
7.50
2.00 4.00 6.00
V dd (volts)
NO
RM
AL
IZE
DD
EL
AY
adder (SPICE)
microcoded DSP chip
multiplier
adder
ring oscillator
clock generator2.0 µ m technology
#Voltage supply (VDD)� Biggest impact
14
Using parallelism (1)
Pref = CrefVDD2fref
Assume: tp = 25ns (worst-case, all modules) at VDD = 5V
Using parallelism (2)
! Cpar = 2.15C (extra-routing needed)! fpar = f/2 (tp,new = ½ (50)ns => VDD ~ 2.9V; VDD,par = 0.58 VDD)! Ppar = CparVDD
2fpar = 0.36 Pref
15
Using pipelining
! Cpipe = 1.15C! Delay decreases 2 times (VDD,pipe = 0.58 VDD)! Ppipe = 0.39 P
Chain vs. balanced design
! Question for you:"Which of the two designs is more energy efficient?
⌧Assume: � Zero-delay model� All inputs have a signal probability of 0.5
⌧Hint: Calculate p0→1 for W, X and F
16
Low energy gates � transistor sizing
! Use the smallest transistors that satisfy the delay constraints" Increasing transistor size improves the speed but it also increases
power dissipation (since the load capacitances increases)⌧Slack time - difference between required time and arrival time of a signal at a
gate output� Positive slack - size down� Negative slack - size up
! Make gates that toggle more frequently smaller
! Slope engineering to reduce short circuit currents
Low energy gate netlists � pin ordering
! Better to postpone the introduction of signals with a high transition rate (signals with signal probability close to 0.5)
17
Power×Delay trade-off for various adders
Ripple Carry AdderManchester Carry Chain
Constant width Carry SkipVariable width Carry Skip
Brent & Kung
Carry Save
Carry Look Ahead
ELM (sort of CLA)
Use a design that is fast enough and consumes the least power!
Control circuits
! State encoding has a big impact on the power efficiency! Energy driven -> try to minimize number of bit transitions in
the state register"Fewer transitions in state register"Fewer transitions propagated to combinational logic
18
Encoder
Bus encoding
! Reduces number of bit toggles on the bus! Different flavors
"Bus-invert coding⌧Uses an extra bus line invert:
� if the number of transitions is < K/2, invert = 0 and the symbol is transmitted as is
� if the number of transitions is > K/2, invert = 1 and the symbol is transmitted in a complemented form
"Low-weight coding⌧Uses transition signaling instead of level signaling
Bus Decoder
Bus invert coding
Source: M.Stan et al., 1994