Interconnect Optimizations
Interconnect Optimizations
A scaling primer• Ideal process scaling:
– Device geometries shrink by S= 0.7x)• Device delay shrinks by s
– Wire geometries shrink by • R/ : /(ws.hs) = r/s2
• Cc/ : (hs)./(Ss) = Cc• C/: similar• R/ doubles, C/ and Cc/ unchanged
SS
GG
DD
h
w
l
S
l
h
Sw
Interconnect role• Short (local) interconnect
– Used to connect nearby cells– Minimize wire C, i.e., use short min-width wires
• Medium to long-distance (global) interconnect– Size wires to tradeoff area vs. delay– Increasing width Capacitance increases, Resistance
decreases Need to find acceptable tradeoff - wire sizing problem• “Fat” wires
– Thicker cross-sections in higher metal layers– Useful for reducing delays for global wires– Inductance issues, sharing of limited resource
Cross-Section of A Chip
Block scaling
• Block area often stays same – # cells, # nets doubles
– Wiring histogram shape invariant
• Global interconnect lengths don’t shrink• Local interconnect lengths shrink by s
Interconnect delay scaling• Delay of a wire of length l :
int = (rl)(cl) = rcl2 (first order)
• Local interconnects : int : (r/s2)(c)(ls)2 = rcl2
– Local interconnect delay unchanged (compare to faster devices)
• Global interconnects : int : (r/s2)(c)(l)2 = (rcl2)/s2
– Global interconnect delay doubles – unsustainable!
• Interconnect delay increasingly more dominant
Buffer Insertion For Delay Reduction
Analysis of Simple RC Circuit
)()()(
)())(()(
)()()(
tvtvdttdvRC
dttdvC
dttCvdti
tvtvtiR
T
T
state variable
Inputwaveform
± v(t)CR
vT(t)
i(t)
Analysis of Simple RC Circuit
Step-input response:
match initial state:
output response for step-input:
v0v0u(t)
v0(1-e-t/RC)u(t)
)()()(0 tuvtv
dttdvRC
)()( 0 tuvKetv RCt
)()1()( 0 tuevtv RCt
0)( 0)0( 0 tuvKv
Delays of Simple RC Circuit• v(t) = v0(1 - e-t/RC) -- waveform
under step input v0u(t)
• v(t)=0.5v0 t = 0.69RC– i.e., delay = 0.69RC (50% delay)
v(t)=0.1v0 t = 0.1RC
v(t)=0.9v0 t = 2.3RC– i.e., rise time = 2.2RC (if defined as time from 10% to 90% of Vdd)
• Commonly used metric TD = RC (= Elmore delay)
Elmore Delay
Delay
Elmore Delay
• Driver is modeled as R• Driver intrinsic gate delay t(B)• Delay = all Ri all Cj downstream from Ri Ri*Cj• Elmore delay at n2 R(B)*(C1+C2)+R(w)*C2• Elmore delay at n1 R(B)*(C1+C2)
R(B)C1 R(w) C2
n1
B
n2
Elmore Delay
• For uniform wire
• No matter how to lump, the Elmore delay is the same
x
C
unit wire capacitance cunit wire resistance r
Delay for Buffer
v
C
u
C(b)
u
Intrinsic buffer delayDriver resistanceInput capacitance
R
Buffers Reduce Wire Delay
x/2
cx/4 cx/4rx/2
t_unbuf = R( cx + C ) + rx( cx/2 + C )
t_buf = 2R( cx/2 + C ) + rx( cx/4 + C ) + tb
t_buf – t_unbuf = RC + tb – rcx2/4
x/2
cx/4 cx/4rx/2
CC R
x
∆t
Combinational Logic Delay
Combinational logic delay <= clock period
Combinational Logic
RegisterPrimary Input
RegisterPrimary Outputclock
Buffered global interconnects: Intuition
Interconnect delay = r.c.l2
Now, interconnect delay = r.c.li2 < r.c.l2 (where l = lj )
since (lj 2) < (lj )2
(Of course, account for buffer delay also)
l1 lnl3l2
l
Optimal inter-buffer length• First order (lumped parasitic, Elmore delay) analysis
• Assume N identical buffers with equal inter-buffer length l (L = Nl)
• For minimum delay,
gddg
ggd
CRl
cRrCrclL
clCrlclCRNT
12/
2/
0dldT
02 2
opt
gd
lCRrcL
rcCR
l gdopt
2
L
Rd – On resistance of inverterCg – Gate input capacitancer,c – Resistance, cap. per micron
… …l
Optimal interconnect delay• Substituting lopt back into the interconnect delay
expression:
rcCR
CRcRrC
rcCR
rcL
CRl
cRrCrclLT
gd
gddg
gd
gdopt
dgoptopt
2
2
1
cRrCrcCRLT dggdopt 2
Delay grows linearly with L (instead of quadratically)
Total buffer count
• Ever-increasing fractions of total cell count will be buffers– 70% in 32nm
0
10
20
30
40
50
60
70
80
90nm 65nm 45nm 32nm
% c
ells
use
d to
buf
fer n
ets
clk-bufbuftot-buf
Source: ITRS, 2003Source: ITRS, 20030.1
1
10
100250 180 130 90 65 45 32
Feature size (nm)Relativedelay
Gate delay (fanout 4)Local interconnect (M1,2)Global interconnect with repeatersGlobal interconnect without repeaters
ITRS projections
Buffers Improve Slack
RAT = 300Delay = 350Slack = -50 RAT = 700Delay = 600Slack = 100RAT = 300Delay = 250Slack = 50RAT = 700Delay = 400Slack = 300
slackmin = -50
slackmin = 50Decouple capacitive load from critical path
RAT = Required Arrival TimeSlack = RAT - Delay
Timing Driven Buffering Problem Formulation
• Given– A Steiner tree– RAT at each sink– A buffer type– RC parameters– Candidate buffer locations
• Find buffer insertion solution such that the slack at the driver is maximized
Candidate Buffering Solutions
Candidate Solution Characteristics
• Each candidate solution is associated with– vi: a node
– ci: downstream capacitance
– qi: RAT
vi is a sinkci is sink capacitance
v is an internal node
Van Ginneken’s Algorithm
Candidate solutions are propagated toward the source
Dynamic Programming
Solution Propagation: Add Wire
• c2 = c1 + cx• q2 = q1 – rcx2/2 – rxc1
• r: wire resistance per unit length• c: wire capacitance per unit length
(v1, c1, q1)(v2, c2, q2)x
28
Solution Propagation: Insert Buffer
• c1b = Cb • q1b = q1 – Rbc1 – tb
• Cb: buffer input capacitance
• Rb: buffer output resistance
• tb: buffer intrinsic delay
(v1, c1, q1)(v1, c1b, q1b)
Solution Propagation: Merge
• cmerge = cl + cr
• qmerge = min(ql , qr)
(v, cl , ql) (v, cr , qr)
Solution Propagation: Add Driver
• q0d = q0 – Rdc0 = slackmin
• Rd: driver resistance
• Pick solution with max slackmin
(v0, c0, q0)(v0, c0d, q0d)
Example of Solution Propagation
(v1, 1, 20)22
v1 v1
(v2, 3, 16)
• r = 1, c = 1• Rb = 1, Cb = 1, tb = 1• Rd = 1
(v2, 1, 12)
v1
(v3, 5, 8)v1
(v3, 3, 8)
slack = 5
slack = 3
Add wire
Add wire
Insert buffer Add wire
Add driver
Add driver
32
Example of Merging
Left candidates
Right candidates
Merged candidates
Solution Pruning
• Two candidate solutions– (v, c1, q1)– (v, c2, q2)
• Solution 1 is inferior if – c1 > c2 : larger load
– and q1 < q2 : tighter timing
Pruning When Insert Buffer
They have the same load cap Cb, only the one with max q is kept
35
Generating Candidates(1)
(2)
(3)
From Dr. Charles Alpert
36
Pruning Candidates(3)
(a) (b)
Both (a) and (b) “look” the same to the source.Throw out the one with the worst slack
(4)
37
Candidate Example Continued(4)
(5)
38
Candidate Example ContinuedAfter pruning
(5)
At driver, compute which candidate maximizesslack. Result is optimal.
39
Merging Branches
Right Candidates
Left Candidates
40
Pruning Merged Branches
Critical
With pruning
41
Van Ginneken Example
(20,400)
(20,400)(30,250)(5, 220)
WireC=10,d=150
BufferC=5, d=30
(20,400)
BufferC=5, d=50C=5, d=30
WireC=15,d=200C=15,d=120
(30,250)(5, 220)
(45, 50)(5, 0)(20,100)(5, 70)
42
Van Ginneken Example Cont’d
(20,400)(30,250)(5, 220)
(45, 50)(5, 0)(20,100)(5, 70)
(5,0) is inferior to (5,70). (45,50) is inferior to (20,100)
(20,400)(30,250)(5, 220)
(20,100)(5, 70)(30,10)
(15, -10)
Pick solution with largest slack, follow arrows to get solution
Wire C=10
Basic Data Structure
(c1, q1) (c2, q2) (c3, q3)
Sorted list such that• c1 < c2 < c3
• If there is no inferior candidates q1 < q2 < q3
Worse load cap
Better timing
44
Prune Solution List
(c1, q1) (c2, q2) (c3, q3)
Increasing c
q1 < q2 ?
(c4, q4)
q3 < q4 ?
Y
N Prune 2 q1 < q3 ?
q2 < q3 ?
Yq3 < q4 ?
YPrune 3 q1 < q4 ?
N Prune 3
N
N Prune 4N Prune 4
q2 < q4 ?
45
Pruning In Merging
(cl1, ql1)
(cl2, ql2)
(cl3, ql3)
(cr1, qr1)
(cr2, qr2)
ql1 < ql2 < qr1 < ql3 < qr2
Merged candidate
s
(cl1+cr1, ql1)
(cl2+cr1, ql2)
(cl3+cr1, qr1)
(cl3+cr2, ql3)
(cl1, ql1)
(cl2, ql2)
(cl3, ql3)
(cr1, qr1)
(cr2, qr2)
(cl1, ql1)
(cl2, ql2)
(cl3, ql3)
(cr1, qr1)
(cr2, qr2)
(cl1, ql1)
(cl2, ql2)
(cl3, ql3)
(cr1, qr1)
(cr2, qr2)
Left candidate
s
Right candidate
s
Van Ginneken Complexity
• Generate candidates from sinks to source
• Quadratic runtime– Adding a wire does not change #candidates
– Adding a buffer adds only one new candidate
– Merging branches additive, not multiplicative
– Linear time solution list pruning
• Optimal for Elmore delay model
Multiple Buffer Types
(v1, 1, 20)22
v1
v1
(v2, 3, 16)
• r = 1, c = 1
• Rb1 = 1, Cb1 = 1, tb1 = 1
• Rb2 = 0.5, Cb2 = 2, tb2 = 0.5
• Rd = 1
(v2, 1, 12)v1
(v2, 2, 14)