Arithmetic Building Blocks Low-Power Design

1

Arithmetic Building BlocksLow-Power Design

Lecture 1218-322 Fall 2002

Textbook: [Chapter 7, Section 4.4]

Outline

! Arithmetic building blocks"Adders"Multipliers

! Low-power design"Reducing power consumption"Data-path/Control circuitry

2

A Generic Digital Processor

MEMORY

DATAPATH

CONTROL

INPU

T-O

UT

PUT

Our focus today. Also, main emphasis in the project!

The Binary Adder

S A B Ci⊕ ⊕=

A= BCi ABCi ABCi ABCi+ + +

Co AB BCi ACi+ +=

A B

Cout

Sum

Cin Fulladder

3

Static CMOS Full Adder: Is this a great implementation?

VDD

VDD

VDD

VDD

A B

Ci

S

Co

X

B

A

Ci A

BBA

Ci

A B Ci

Ci

B

A

Ci

A

B

BA

28 TransistorsAB+(A+B)Ci ABCi + C0(A+B+Ci)

The Ripple-Carry Adder

A0 B0

S0

Co,0Ci,0

A1 B1

S1

Co,1

A2 B2

S2

Co,2

A3 B3

S3

Co,3

(= Ci,1)FA FA FA FA

Worst case delay linear with the number of bits

tadder N 1�( )tcarry tsum+≈

td = O(N)

Goal: Make the fastest possible carry path circuit

4

Inversion Property

A B

S

CoCi FA

A B

S

CoCi FA

Minimize Critical Path by Reducing the Number of Inverting Stages

A0 B0

S0

Co,0Ci,0

A1 B1

S1

Co,1

A2 B2

S2

Co,2 Co,3FA� FA� FA� FA�

A3 B3

S3

Odd CellEven Cell

Exploit Inversion Property

Note: Needs 2 different types of cells

5

Express Sum and Carry using P, G, D

Define 3 new variable which ONLY depend on A, BGenerate (G) = ABPropagate (P) = A ⊕ BDelete = A B

Can also derive expressions for S and Co based on D and P

A better structure: the Mirror Adder

VDD

Ci

A

BBA

B

A

A BKill

Generate"1"-Propagate

"0"-Propagate

VDD

Ci

A B Ci

Ci

B

A

Ci

A

BBA

VDD

SCo

24 transistors

P G

C0 = G+PCi

D

6

Carry-Bypass Adder

FA FA FA FA

P0 G1 P0 G1 P2 G2 P3 G3

Co,3Co,2Co,1Co,0Ci,0

FA FA FA FA

P0 G1 P0 G1 P2 G2 P3 G3

Co,2Co,1Co,0Ci,0

Co,3

Mul

tiple

xer

BP=PoP1P2P3

Idea: If (P0 and P1 and P2 and P3 = 1)then Co3 = C0, else �kill� or �generate�.

Carry-Bypass Adder (cont.)

Setup

CarryPropagation

Sum

Setup

CarryPropagation

Sum

Setup

CarryPropagation

Sum

Setup

CarryPropagation

Sum

Bit 0-3 Bit 4-7 Bit 8-11 Bit 12-15

Ci,0

Note: the topological path worst-case delay is much higher than the true critical path!

N bits

M bits

7

Carry Ripple vs. Carry Bypass

N

tp

ripple adder

bypass adder

4..8

For small values of N, RCA is actually faster!

Carry-Select Adder

Setup

"0" Carry Propagation

"1" Carry Propagation

Multiplexer

Sum Generation

Co,k-1 Co,k+3

"0"

"1"

P,G

Carry Vector

8

Carry Select Adder: Critical Path

Setup

"0" Carry

"1" Carry

Multiplexer

Sum Generation

"0"

"1"

Setup

"0" Carry

"1" Carry

Multiplexer

Sum Generation

"0"

"1"

Setup

"0" Carry

"1" Carry

Multiplexer

Sum Generation

"0"

"1"

Setup

"0" Carry

"1" Carry

Multiplexer

Sum Generation

"0"

"1"

Bit 0-3 Bit 4-7 Bit 8-11 Bit 12-15

S0-3 S4-7 S8-11 S12-15

Co,15Co,11Co,7Co,3Ci,0

Adder Delays - Comparison

0.0 20.0 40.0 60.0N

0.0

10.0

20.0

30.0

40.0

50.0

tp

ripple adder

linear select

9

Look Ahead - Basic Idea

A 0,B 0 A 1,B 1 AN-1 ,B N-1...

C i,0 P0 Ci,1 P1C i,N-1 PN-1

...

C0,k = Gk + Pk C0,k-1

C0,k = Gk + Pk (Gk-1 + Pk-1(� + P1 (G0 + P0Ci,0)))The look ahead structure is useful for small values of N (N<4)

Design as a Trade-Off

0 10 20N

0.0

20.0

40.0

60.0

80.0

t p (n

sec)

0 10 20N

0.0

0.2

0.4

Area

(mm

2 )

look-ahead

select

bypassmanchester

mirrorstatic

manchester

look-ahead

select

static

mirrorbypass

10

The Binary Multiplication

1 0 1 1

1 0 1 0 1 0

0 0 0 0 0 0

1 0 1 0 1 0

1 0 1 0 1 0

1 0 1 0 1 0

×

1 1 1 0 0 1 1 1 0

+

partial product (this corresponds to a AND operation)

multiplicand

multiplier# Multiplications are expensive and slowoperations

#Performance is often dominated by the speed at which multiplications can be executed

The Array Multiplier

3456

HA FA FA HA

FA FA FA HA

FA FA FA HA

2 X0X1XX3 Y1

X0X1X2X3 Y2

X0X1X2X3 Y3

Z1

Z2

ZZZZ

Z0

Z7

Very efficient layout!

X0X1XX 23Y0

11

The MxN Array Multiplier� Critical Path

HA FA FA HA

HAFAFAFA

FAFA FA HA

Critical Path 1

Critical Path 2

Critical Path 1 & 2

Optimization is very difficult since there exists several critical paths!

Optimize this!

Carry-Save Multiplier

HA HA HA HA

FAFAFAHA

FAHA FA FA

FAHA FA HA

Vector Merging Adder

Optimization becomes easier (unique critical path)!

12

Wallace-Tree Multiplier

FA

FA

FA

FA

y0 y1 y2

y3

y4

y5

S

Ci-1

Ci-1

Ci-1

Ci

Ci

Ci

FA

y0 y1 y2

FA

y3 y4 y5

FA

FA

CC S

Ci-1

Ci-1

Ci-1

Ci

Ci

Ci

Multipliers �Summary

� Optimization Goals Different compared to the Binary Adder

� Once Again: Identify the Critical Path

� Other possible techniques

- Data encoding (Booth)- Pipelining

- Logarithmic versus Linear (Wallace Tree Multiplier)

13

Outline

$ Arithmetic building blocks"Adders"Multipliers

! Low-power design"Reducing power consumption"Data-path/Control circuitry

How about POWER? Ways to reducing power consumption

! Load capacitance (CL)⌧Roughly proportional to the chip

area

! Switching activity (avg. number of transitions/cycle)

⌧Very data dependent⌧A big portion due to glitches

(real-delay)

! Clock frequency (f)⌧Lowering only f decreases

average power, but total energy is the same and throughput is worse

1.00

1.50

2.00

2.50

3.00

3.50

4.00

4.50

5.00

5.50

6.00

6.50

7.00

7.50

2.00 4.00 6.00

V dd (volts)

NO

RM

AL

IZE

DD

EL

AY

adder (SPICE)

microcoded DSP chip

multiplier

adder

ring oscillator

clock generator2.0 µ m technology

#Voltage supply (VDD)� Biggest impact

14

Using parallelism (1)

Pref = CrefVDD2fref

Assume: tp = 25ns (worst-case, all modules) at VDD = 5V

Using parallelism (2)

! Cpar = 2.15C (extra-routing needed)! fpar = f/2 (tp,new = ½ (50)ns => VDD ~ 2.9V; VDD,par = 0.58 VDD)! Ppar = CparVDD

2fpar = 0.36 Pref

15

Using pipelining

! Cpipe = 1.15C! Delay decreases 2 times (VDD,pipe = 0.58 VDD)! Ppipe = 0.39 P

Chain vs. balanced design

! Question for you:"Which of the two designs is more energy efficient?

⌧Assume: � Zero-delay model� All inputs have a signal probability of 0.5

⌧Hint: Calculate p0→1 for W, X and F

16

Low energy gates � transistor sizing

! Use the smallest transistors that satisfy the delay constraints" Increasing transistor size improves the speed but it also increases

power dissipation (since the load capacitances increases)⌧Slack time - difference between required time and arrival time of a signal at a

gate output� Positive slack - size down� Negative slack - size up

! Make gates that toggle more frequently smaller

! Slope engineering to reduce short circuit currents

Low energy gate netlists � pin ordering

! Better to postpone the introduction of signals with a high transition rate (signals with signal probability close to 0.5)

17

Power×Delay trade-off for various adders

Ripple Carry AdderManchester Carry Chain

Constant width Carry SkipVariable width Carry Skip

Brent & Kung

Carry Save

Carry Look Ahead

ELM (sort of CLA)

Use a design that is fast enough and consumes the least power!

Control circuits

! State encoding has a big impact on the power efficiency! Energy driven -> try to minimize number of bit transitions in

the state register"Fewer transitions in state register"Fewer transitions propagated to combinational logic

18

Encoder

Bus encoding

! Reduces number of bit toggles on the bus! Different flavors

"Bus-invert coding⌧Uses an extra bus line invert:

� if the number of transitions is < K/2, invert = 0 and the symbol is transmitted as is

� if the number of transitions is > K/2, invert = 1 and the symbol is transmitted in a complemented form

"Low-weight coding⌧Uses transition signaling instead of level signaling

Bus Decoder

Bus invert coding

Source: M.Stan et al., 1994

Arithmetic Building Blocks Low-Power Design

Documents