Buffer Insertion

Interconnect Optimizations

A scaling primer• Ideal process scaling:

– Device geometries shrink by S= 0.7x)• Device delay shrinks by s

– Wire geometries shrink by • R/ : /(ws.hs) = r/s2

• Cc/ : (hs)./(Ss) = Cc• C/: similar• R/ doubles, C/ and Cc/ unchanged

SS

GG

DD

h

w

l

S

l

h

Sw

Interconnect role• Short (local) interconnect

– Used to connect nearby cells– Minimize wire C, i.e., use short min-width wires

• Medium to long-distance (global) interconnect– Size wires to tradeoff area vs. delay– Increasing width Capacitance increases, Resistance

decreases Need to find acceptable tradeoff - wire sizing problem• “Fat” wires

– Thicker cross-sections in higher metal layers– Useful for reducing delays for global wires– Inductance issues, sharing of limited resource

Cross-Section of A Chip

Block scaling

• Block area often stays same – # cells, # nets doubles

– Wiring histogram shape invariant

• Global interconnect lengths don’t shrink• Local interconnect lengths shrink by s

Interconnect delay scaling• Delay of a wire of length l :

int = (rl)(cl) = rcl2 (first order)

• Local interconnects : int : (r/s2)(c)(ls)2 = rcl2

– Local interconnect delay unchanged (compare to faster devices)

• Global interconnects : int : (r/s2)(c)(l)2 = (rcl2)/s2

– Global interconnect delay doubles – unsustainable!

• Interconnect delay increasingly more dominant

Buffer Insertion For Delay Reduction

Analysis of Simple RC Circuit

)()()(

)())(()(

)()()(

tvtvdttdvRC

dttdvC

dttCvdti

tvtvtiR

T

T

state variable

Inputwaveform

± v(t)CR

vT(t)

i(t)

Analysis of Simple RC Circuit

Step-input response:

match initial state:

output response for step-input:

v0v0u(t)

v0(1-e-t/RC)u(t)

)()()(0 tuvtv

dttdvRC

)()( 0 tuvKetv RCt

)()1()( 0 tuevtv RCt

0)( 0)0( 0 tuvKv

Delays of Simple RC Circuit• v(t) = v0(1 - e-t/RC) -- waveform

under step input v0u(t)

• v(t)=0.5v0 t = 0.69RC– i.e., delay = 0.69RC (50% delay)

v(t)=0.1v0 t = 0.1RC

v(t)=0.9v0 t = 2.3RC– i.e., rise time = 2.2RC (if defined as time from 10% to 90% of Vdd)

• Commonly used metric TD = RC (= Elmore delay)

Elmore Delay

Delay

Elmore Delay

• Driver is modeled as R• Driver intrinsic gate delay t(B)• Delay = all Ri all Cj downstream from Ri Ri*Cj• Elmore delay at n2 R(B)*(C1+C2)+R(w)*C2• Elmore delay at n1 R(B)*(C1+C2)

R(B)C1 R(w) C2

n1

B

n2

Elmore Delay

• For uniform wire

• No matter how to lump, the Elmore delay is the same

x

C

unit wire capacitance cunit wire resistance r

Delay for Buffer

v

C

u

C(b)

u

Intrinsic buffer delayDriver resistanceInput capacitance

R

Buffers Reduce Wire Delay

x/2

cx/4 cx/4rx/2

t_unbuf = R( cx + C ) + rx( cx/2 + C )

t_buf = 2R( cx/2 + C ) + rx( cx/4 + C ) + tb

t_buf – t_unbuf = RC + tb – rcx2/4

x/2

cx/4 cx/4rx/2

CC R

x

∆t

Combinational Logic Delay

Combinational logic delay <= clock period

Combinational Logic

RegisterPrimary Input

RegisterPrimary Outputclock

Buffered global interconnects: Intuition

Interconnect delay = r.c.l2

Now, interconnect delay = r.c.li2 < r.c.l2 (where l = lj )

since (lj 2) < (lj )2

(Of course, account for buffer delay also)

l1 lnl3l2

l

Optimal inter-buffer length• First order (lumped parasitic, Elmore delay) analysis

• Assume N identical buffers with equal inter-buffer length l (L = Nl)

• For minimum delay,

gddg

ggd

CRl

cRrCrclL

clCrlclCRNT

12/

2/

0dldT

02 2

opt

gd

lCRrcL

rcCR

l gdopt

2

L

Rd – On resistance of inverterCg – Gate input capacitancer,c – Resistance, cap. per micron

… …l

Optimal interconnect delay• Substituting lopt back into the interconnect delay

expression:

rcCR

CRcRrC

rcCR

rcL

CRl

cRrCrclLT

gd

gddg

gd

gdopt

dgoptopt

2

2

1

cRrCrcCRLT dggdopt 2

Delay grows linearly with L (instead of quadratically)

Total buffer count

• Ever-increasing fractions of total cell count will be buffers– 70% in 32nm

0

10

20

30

40

50

60

70

80

90nm 65nm 45nm 32nm

% c

ells

use

d to

buf

fer n

ets

clk-bufbuftot-buf

Source: ITRS, 2003Source: ITRS, 20030.1

1

10

100250 180 130 90 65 45 32

Feature size (nm)Relativedelay

Gate delay (fanout 4)Local interconnect (M1,2)Global interconnect with repeatersGlobal interconnect without repeaters

ITRS projections

Buffers Improve Slack

RAT = 300Delay = 350Slack = -50 RAT = 700Delay = 600Slack = 100RAT = 300Delay = 250Slack = 50RAT = 700Delay = 400Slack = 300

slackmin = -50

slackmin = 50Decouple capacitive load from critical path

RAT = Required Arrival TimeSlack = RAT - Delay

Timing Driven Buffering Problem Formulation

• Given– A Steiner tree– RAT at each sink– A buffer type– RC parameters– Candidate buffer locations

• Find buffer insertion solution such that the slack at the driver is maximized

Candidate Buffering Solutions

Candidate Solution Characteristics

• Each candidate solution is associated with– vi: a node

– ci: downstream capacitance

– qi: RAT

vi is a sinkci is sink capacitance

v is an internal node

Van Ginneken’s Algorithm

Candidate solutions are propagated toward the source

Dynamic Programming

Solution Propagation: Add Wire

• c2 = c1 + cx• q2 = q1 – rcx2/2 – rxc1

• r: wire resistance per unit length• c: wire capacitance per unit length

(v1, c1, q1)(v2, c2, q2)x

28

Solution Propagation: Insert Buffer

• c1b = Cb • q1b = q1 – Rbc1 – tb

• Cb: buffer input capacitance

• Rb: buffer output resistance

• tb: buffer intrinsic delay

(v1, c1, q1)(v1, c1b, q1b)

Solution Propagation: Merge

• cmerge = cl + cr

• qmerge = min(ql , qr)

(v, cl , ql) (v, cr , qr)

Solution Propagation: Add Driver

• q0d = q0 – Rdc0 = slackmin

• Rd: driver resistance

• Pick solution with max slackmin

(v0, c0, q0)(v0, c0d, q0d)

Example of Solution Propagation

(v1, 1, 20)22

v1 v1

(v2, 3, 16)

• r = 1, c = 1• Rb = 1, Cb = 1, tb = 1• Rd = 1

(v2, 1, 12)

v1

(v3, 5, 8)v1

(v3, 3, 8)

slack = 5

slack = 3

Add wire

Add wire

Insert buffer Add wire

Add driver

Add driver

32

Example of Merging

Left candidates

Right candidates

Merged candidates

Solution Pruning

• Two candidate solutions– (v, c1, q1)– (v, c2, q2)

• Solution 1 is inferior if – c1 > c2 : larger load

– and q1 < q2 : tighter timing

Pruning When Insert Buffer

They have the same load cap Cb, only the one with max q is kept

35

Generating Candidates(1)

(2)

(3)

From Dr. Charles Alpert

36

Pruning Candidates(3)

(a) (b)

Both (a) and (b) “look” the same to the source.Throw out the one with the worst slack

(4)

37

Candidate Example Continued(4)

(5)

38

Candidate Example ContinuedAfter pruning

(5)

At driver, compute which candidate maximizesslack. Result is optimal.

39

Merging Branches

Right Candidates

Left Candidates

40

Pruning Merged Branches

Critical

With pruning

41

Van Ginneken Example

(20,400)

(20,400)(30,250)(5, 220)

WireC=10,d=150

BufferC=5, d=30

(20,400)

BufferC=5, d=50C=5, d=30

WireC=15,d=200C=15,d=120

(30,250)(5, 220)

(45, 50)(5, 0)(20,100)(5, 70)

42

Van Ginneken Example Cont’d

(20,400)(30,250)(5, 220)

(45, 50)(5, 0)(20,100)(5, 70)

(5,0) is inferior to (5,70). (45,50) is inferior to (20,100)

(20,400)(30,250)(5, 220)

(20,100)(5, 70)(30,10)

(15, -10)

Pick solution with largest slack, follow arrows to get solution

Wire C=10

Basic Data Structure

(c1, q1) (c2, q2) (c3, q3)

Sorted list such that• c1 < c2 < c3

• If there is no inferior candidates q1 < q2 < q3

Worse load cap

Better timing

44

Prune Solution List

(c1, q1) (c2, q2) (c3, q3)

Increasing c

q1 < q2 ?

(c4, q4)

q3 < q4 ?

Y

N Prune 2 q1 < q3 ?

q2 < q3 ?

Yq3 < q4 ?

YPrune 3 q1 < q4 ?

N Prune 3

N

N Prune 4N Prune 4

q2 < q4 ?

45

Pruning In Merging

(cl1, ql1)

(cl2, ql2)

(cl3, ql3)

(cr1, qr1)

(cr2, qr2)

ql1 < ql2 < qr1 < ql3 < qr2

Merged candidate

s

(cl1+cr1, ql1)

(cl2+cr1, ql2)

(cl3+cr1, qr1)

(cl3+cr2, ql3)

(cl1, ql1)

(cl2, ql2)

(cl3, ql3)

(cr1, qr1)

(cr2, qr2)

(cl1, ql1)

(cl2, ql2)

(cl3, ql3)

(cr1, qr1)

(cr2, qr2)

(cl1, ql1)

(cl2, ql2)

(cl3, ql3)

(cr1, qr1)

(cr2, qr2)

Left candidate

s

Right candidate

s

Van Ginneken Complexity

• Generate candidates from sinks to source

• Quadratic runtime– Adding a wire does not change #candidates

– Adding a buffer adds only one new candidate

– Merging branches additive, not multiplicative

– Linear time solution list pruning

• Optimal for Elmore delay model

Multiple Buffer Types

(v1, 1, 20)22

v1

v1

(v2, 3, 16)

• r = 1, c = 1

• Rb1 = 1, Cb1 = 1, tb1 = 1

• Rb2 = 0.5, Cb2 = 2, tb2 = 0.5

• Rd = 1

(v2, 1, 12)v1

(v2, 2, 14)

Buffer Insertion

Documents