INVITED PAPER Techniques for Fast Physical Synthesis · 2008-01-25 · PAPER Techniques for Fast Physical Synthesis Fast, efficient buffer design, logic transformations, and clustering

INV ITEDP A P E R

Techniques for FastPhysical SynthesisFast, efficient buffer design, logic transformations, and clustering components

for placement are some of the techniques being used to reduce design

turnaround for large, complex chips.

By Charles J. Alpert, Fellow IEEE, Shrirang K. Karandikar, Zhuo Li, Member IEEE,

Gi-Joon Nam, Member IEEE, Stephen T. Quay, Haoxing Ren, Member IEEE,

C. N. Sze, Member IEEE, Paul G. Villarrubia, and Mehmet C. Yildiz, Member IEEE

ABSTRACT | The traditional purpose of physical synthesis is to

perform timing closure, i.e., to create a placed design that

meets its timing specifications while also satisfying electrical,

routability, and signal integrity constraints. In modern design

flows, physical synthesis tools hardly ever achieve this goal in

their first iteration. The design team must iterate by studying

the output of the physical synthesis run, then potentially

massage the input, e.g., by changing the floorplan, timing

assertions, pin locations, logic structures, etc., in order to

hopefully achieve a better solution for the next iteration. The

complexity of physical synthesis means that systems can take

days to run on designs with multimillions of placeable objects,

which severely hurts design productivity. This paper discusses

some newer techniques that have been deployed within IBM’s

physical synthesis tool called PDS [1] that significantly improves

throughput. In particular, we focus on some of the biggest

contributors to runtime, placement, legalization, buffering, and

electric correction, and present techniques that generate

significant turnaround time improvements.

KEYWORDS | Circuit optimization; circuit synthesis; CMOS

integrated circuits; design automation

I . INTRODUCTION

Physical synthesis has emerged as a critical component of

modern design methodologies. The primary purpose of

physical synthesis is to perform timing closure. Several

technology generations ago, back when wire delay was

insignificant, synthesis provided an accurate picture of the

timing of the design. However, technology scaling has

caused wire delay to continue to increase relative to gate

delay. Consequently, a design that meets timing require-

ments in synthesis likely will not close once its physical

footprint is realized, due to the wire delays. The purpose ofphysical synthesis is place the design, recognize the delays

and signal integrity issues introduced by the wiring, and fix

the problems. It may also need to locally resynthesize

pieces of the design that no longer meet timing con-

straints. That new logic needs to be replaced, which causes

iterations between synthesis and placement, until hope-

fully the design closes on timing.

Unfortunately, more often than not, the design will notclose on timing without manual designer intervention.

Perhaps the designer needs to modify the floorplan or re-

structure certain sets of paths. This causes the designer to

iterate between manual design work and automatic

physical synthesis. The turnaround time of the physical

design stage critically depends on the efficiency (and

quality) of the physical synthesis system. On large multi-

million ASIC parts, physical synthesis can take several daysto complete, even on the best hardware available. This

trend is only getting worse as designs seem to scale faster

than the hardware improves to optimize them. While

hierarchical or system on a chip (SoC) methodologies can

be used to handle the large complexities, performing

timing closure on a flat part is always preferable if at all

possible [2], since it avoids all the complexities of

hierarchical design.Of course, there are many newer challenges that the

physical system needs to handle besides traditional timing

closure. Some examples include lowering power using a

Manuscript received March 8, 2006; revised October 20, 2006.

C. J. Alpert, S. K. Karandikar, Z. Li, G.-J. Nam, and C. N. Sze are with the IBM Austin

Research Laboratory, Austin, TX 78758 USA (e-mail: [email protected];

[email protected]; [email protected]; [email protected]; [email protected]).

S. T. Quay, H. Ren, P. G. Villarrubia, and M. C. Yildiz are with the IBM Corporation,

Austin, TX 78758 USA (e-mail: [email protected]; [email protected];

[email protected]; [email protected]).

Digital Object Identifier: 10.1109/JPROC.2006.890096

Vol. 95, No. 3, March 2007 | Proceedings of the IEEE 5730018-9219/$25.00 �2007 IEEE

technology library with multiple threshold voltages, fixnoise violations that show up after performing routing,

and handling the timing variability and uncertainty

introduced by modern design processes. Inserting techni-

ques for analysis and optimization for these more complex

problems only add to the runtime of the entire system.

Thus, the turnaround time for the core system needs to be

as fast as possible. This work surveys some of the recent

techniques introduced into PDS [1] to improve turn-around time.

A. Buffering TrendsMuch of the paper focuses on innovation in buffering

techniques since buffering is perhaps the most important

challenge for physical synthesis as it moves beyond 90 nm

technologies. As technology scales, wires become thinner

which causes their resistance to increase. The result is thatwire delays increasingly dominate gate delays, and the

problem only becomes worse with each advance from the

65 to the 45 to the 32 nm node. Saxena et al. [3] predict a

Bbuffering explosion[ whereby over half of all the logic

will consist of buffers, which are essentially performing no

useful computationVthey are merely helping move sig-

nals from one part of the chip to another. Even in today’s

90 nm designs, we commonly see 20%–25% of the logicconsisting of buffers and/or inverters; some of the larger

designs that PDS optimizes end up with over a million

buffers.

Given these trends, there are several challenges to

achieve a fast and effective physical synthesis result.

1) One has to be able to perform buffer insertion

incredibly quickly. If one is going to insert over a

million buffers and then may have to rip them upand redo trees to improve timing and routability,

the underlying algorithm must be efficient.

2) Area and power are big concerns. Smart floor-

planning and logic coding from the designer can

help mitigate the buffering effects, but still, one

should insert buffers to minimize both total area

(so that they can be easily incorporated into the

design) and power.3) Buffering algorithms need to understand where

the free space is in the layout to be effective and

not overfill areas that cannot handle the buffers.

Some methodologies invoke buffer block plan-

ning to drive buffer locations to preallocated

areas (e.g., [4], [5]).

4) Buffering constricts or seeds global routing.

Because the distance between buffers continuesto decrease, a long net may have perhaps ten

stages of buffering to get from point A to point B.

The locations of those ten buffers force the global

router to route from A to the first buffer, then

from the first buffer to the second buffer, etc.,

instead of finding the best direct route from A

to B. Essentially, the routing problem is pushed

up to be handled by buffering. A good bufferingsolution can make the global router’s job easy,

while a bad one makes it more difficult.

B. Major Phases of Physical SynthesisThe authors of [1] present seven primary stages to PDS:

1) initial placement and optimization;

2) timing-driven placement and optimization;

3) timing-driven detailed placement;4) optimization techniques;

5) clock insertion and optimization;

6) routing and post routing optimization;

7) early-mode timing optimization.

Before running physical synthesis, at the very least one

should achieve timing closure with a zero wireload (ZWL)

timing model. If one cannot close on the design with ZWL,

then one certainly will not be able to once the design isrealized physically. In fact, since a ZWL model may be

hopelessly optimistic since wire delays are increasingly

significant, a designer may want to achieve timing closure

with a more pessimistic model. As examples, one could

multiply each gate delay by a constant factor and/or use a

linear optimally buffered delay model for logic that is re-

stricted by the designer’s floorplan. Thus, before proceed-

ing with physical synthesis, the designer should iterate onthe architecture, synthesis, and floorplan to achieve a

closed design from some type of physically ignorant timing

model so that the design is in a reasonably good state.

Similarly, if one cannot close on the timing before

clock insertion and routing, then it is unlikely one will be

able to close after these steps. Thus, the first four stages of

the above flow can be considered the core physical

synthesis operations. The designer will typically iteratewith physical synthesis runs in this part of the flow before

proceedings to steps 5, 6, and 7. Hence, the focus for this

paper will be on the first four stages.

The purpose of initial placement (e.g., [6]–[8], mFar

[9], [10]) is to place the cells such that they do not overlap

with each other or existing fixed objects from the

floorplan. At this point, the timing of the design will

have degraded completely from the ZWL timing due to theintroduction of long wires. The optimization steps then

buffer and repower the design so that the timing looks

quite reasonable. From the timing analysis, one can then

draw conclusions as to which nets must be shortened by

placement and which ones do not, or for that matter, could

even afford to be longer.

The purpose of timing-driven placement (e.g., [11]–[13])

is to use timing analysis to drive the placement to achievea good timing result at the possible expense of wire-

length. Probably the easiest (and certainly fastest) way to

achieve this is to perform net weighting [14]–[16] whereby

the nets which need to be shorter are assigned a high

weight, and the nets that can afford to be longer are

assigned a low weight. Any placement algorithm can be

modified to handle net weights. For example, a net with

Alpert et al.: Techniques for Fast Physical Synthesis

574 Proceedings of the IEEE | Vol. 95, No. 3, March 2007

integer weight n can be replaced with n identical nets of

weight one. The problem of coming up with a good map-

ping of nets to weights is a difficult problem. The

approach of Pan et al. [15] is advantageous in that it fig-ures out which nets can influence the most possible crit-

ical paths and gives these nets higher weight; since nets

which are both influential and negative are emphasized,

the wirelength degradation from timing is minimized.

The mechanism with which the given placer handles net

weights certainly affects the quality; a particular net

weighting algorithm may work splendidly with Placer A

but not with Placer B.In general, net weighting actually causes the total

wirelength in the design to increase, though it will cause

the timing to be significantly better. After net weighting,

the entire placement is performed again from scratch,

though repowering levels and buffering structures may

remain from the previous phase. Once again, the timing

picture will look quite grim immediately after placement

due to new long wires. Another round of buffering andrepowering optimizations can then be applied to get the

timing into reasonable shape for the next phase.

After timing-driven placement and optimization, many

cells may be placed in locations that are locally suboptimal.

Timing-driven detailed placement makes local moves and

swaps to try to improve both wirelength and the global

timing of the design. The detailed placement is timing

driven in that it can also use net weights to guide itssolution. Constraints to limit cell movement may be used

to prevent global moves that may undo the placement

achieved by the previous phase.

The final phase of core timing closure is pure opti-mization. At this point, the timing is hopefully reasonably

close, but buffering and repowering alone does not suffice

to fix the critical paths. Direct logic transforms can be

applied at this point [1], [17]–[19]. Examples include thefollowing.

1) Cloning takes a cell that may be driving a large

number of pins and duplicates it so that the load

can be divided between the existing cell and the

original. This may or may not reduce delay, de-

pending on the increased load caused by the new

cell. It certainly will increase area.

2) Inverter manipulation takes inverters that aredriving or driven by a cell and absorbs them into

the cell. For example, an and gate driving an

inverter can become a nand gate. The reverse can

happen as well, whereby inverters are pulled out

of the logic of a cell.

3) Logic decomposition breaks apart a single logic cell

into several cells. For example, Fig. 1 shows a

4-input nand gate can be decomposed into (b)two 2-input and gates each driving a third 2-input

nand gate.

4) Connection reordering rewires commutative con-

nections in fan-in trees. Fig. 1(c) shows an exam-

ple reordering of the inputs to derive a different

physical solution.

5) Cell movement picks a cell along a critical path and

tries to find a new location for the cell that im-proves timing.

These optimizations can be deployed on critical paths

along with incremental timing analysis to push the design

closer to timing closure.

C. A Closer Look at OptimizationWhile placement is relatively straightforward, the

pieces that constitute optimization may not be so clear.Optimization can also be broken down into the following

phases:

1) electrical correction;

2) critical path optimization;

3) histogram compression;

4) legalization.

The purpose of electrical correction [20] is to fix the

capacitance and slew violations introduced, usuallythrough buffering and repowering. Most of these are

introduced from the placement stage. In general, one

wants to first perform electrical correction in order to get

the design in a reasonable state for the subsequent

Fig. 1. Examples of logic direct logic transforms. (a) Initial gate. (b) Logic decomposition. (c) Connection reordering.

Alpert et al. : Techniques for Fast Physical Synthesis

Vol. 95, No. 3, March 2007 | Proceedings of the IEEE 575

optimizations. Electrical correction is potentially a bigruntime hog. The reason is that designs need more buffers

than ever to fix slew violations due to the ever decreasing

ratio between gate and wire delays. Some older designs

[21] may require 250 000 buffers and some newer designs

today require over a million buffers. The trend toward

large and more complex designs has turned a relatively

simple and fast phase into a complex and slow one.

During critical path optimization one examines a smallsubset of the most critical paths and performs optimization

specifically to improve timing on those paths. This phase

needs to be intertwined with incremental timing so that

one can see the impact of logic changes right away and to

then find the next set of critical paths to work on. Here one

can afford to throw Bthe kitchen sink[ at the problem; any

optimization such as direct logic transforms described

above that may potentially improve the timing is fair gameand can be attempted. A continuing challenge in the field

is to derive more complex transforms that involve the

interaction of multiple gates and potential cell movements.

For example, one may wish to Bstraighten[ all the gates in

a path, simultaneously repower them, and perform buffer

insertion on the fly. Unlike electrical correction, the

runtime for this phase does not scale nearly as much with

increasing design size, since the number of paths that areworked on in this phase is a user parameter that is in-

dependent of the design size. The bottlenecks for runtime

here are how far the critical paths are from closure and the

time it takes to update the timing.

Critical path optimization certainly can fail to close on

timing. There could be a path (or paths) in the design that

is completely incapable of meeting timing requirements,

e.g., due to floorplanning of fixed blocks. At this pointphysical synthesis could return with its best solution found

so far, though there might still be thousands of paths which

do not meet their timing targets. The purpose of the

histogram compression phase is to perform optimization on

these less critical, but still negative paths. This is analogous

to pushing down on the timing histogram returned from

timing analysis. The size of the histogram after this phase

gives the designer an indication of how much work thereremains to close on timing. This phase helps the designer

distinguish between a few really poor paths versus

thousands of systemic problems.

Throughout all of the above phases, every optimization

will disrupt the placement. One can choose to always find

a legal location for every buffer or piece of logic during

optimization; however, this will be very expensive. In

PDS, optimizations are allowed to make changes and placecells that may cause the placement to have overlaps,

potentially in the thousands. Periodically a phase of place-

ment legalization needs to be called to resolve these

overlaps to once again make the placement viable. The

frequency that this step needs to be called (along with the

size of its task) can be a major contributor to the total

runtime of the system.

Fig. 2 gives an example of how the four major phases of

core physical synthesis may be broken down further. Forexample, observe how no legalization occurs at the end of

the first phase. Since the entire design will be replaced in

phase 2, legalization at this point can be considered

unnecessary. Also observe that in phase 4, critical path

optimization and legalization are run after each other

three times. In practice, this loop can be made even tighter

so that any timing disturbances caused by legalization are

quickly reflected in the list of most critical paths. The flowshown in this figure is just an example of how the different

phases may operate together. Many different combinations

can be employed (such as more or fewer placements,

optimizations before initial placement, etc.) that may

achieve better results. It remains a challenge of physical

synthesis to find flows that achieve excellent results across

a wide range of design styles.

D. Achieving Fast Physical SynthesisIn order to make the physical synthesis as fast as

possible, we have focused on a variety of techniques that

can be deployed throughout the flows. A key philosophy

for achieving both a fast and high quality result is to do the

optimizations as fast as possible even if some optimality is

sacrificed. As long as the design is in a reasonably good

Fig. 2. Major phases of physical synthesis.



state after applying fast optimization, one can always applyslower, but more accurate optimization to further polish

the design. In other words, one can break a few eggs while

making the cake, as long as there is a way to clean them up

(but if one is careless and breaks too many eggs, the cake

will never be completed). This paper presents some of the

major algorithmic techniques that have been discovered or

utilized. They include the following.

• Clustering for multilevel placement. While it iswell established that multilevel partitioning [22]

gives superior runtime and quality of results for the

circuit partitioning problem, achieving a similar

result for placement has been much more elusive.

For placement, this requires clustering with a bit

more care; we have been able to cluster to achieve

speedups of a factor of 3–5 versus flat placement

while obtaining similar placement quality. Thisresult can be applied to both the initial and timing-

driven placement phases. Details of this technique

can be found in [23].

• Fast timing-driven buffering. It is well known

that Van Ginneken’s algorithm [24] can achieve an

optimal buffering result for a given tree topology.

When one extends it to handle a large buffer li-

brary and to control the total buffer area, the run-time increases significantly. This work shows how

one can add new pruning and estimation tech-

niques to improve runtime without any measurable

degradation in solution quality. This result can be

applied to any of the buffering phases. Details of

this work can be found in [25].

• Integrated electrical correction. As mentioned

above, electrical correction consumes an increas-ingly large percentage of the runtime of physical

synthesis. This work proposes integrating buffer-

ing and repowering into a single engine that rec-

ognizes which optimization is best to perform

for a given net. Details of the scheme can be

found in [20].

• Timerless buffering. For electrical correction,

one does not require the best solution in terms oftiming. Any suboptimal timing solution that

proves critical can always rebuffered later. When

potentially inserting a million buffers for electrical

correction, it is essential to fix slew and capaci-

tance violations as fast as possible while using the

minimum buffer area. This section describes a

new algorithm for solving this problem that is an

order of magnitude faster than timing-drivenbuffering. Details of the algorithms can be found

in [26].

• Layout aware buffer trees. When performing

buffer insertion, one can run into danger by

ignoring the density of placed objects and the

design routability because placements of buffers in

those locations may require them to get moved

later by legalization. Often one may find locationsthat are in sparser regions of the chip but are no

worse than the locations in dense regions. This

work presents a generalized fast technique for

constructing a Steiner tree for buffering, via either

timing-driven or timerless buffering. Details of the

work can be found in [27].

• Diffusion based legalization. The danger of le-

galization is that it can potentially degrade timingby moving a timing critical cell to a legal location

that is far away from its optimal location. To avoid

this, one can run legalization very frequently to

keep it from doing too much work for any iteration.

As an alternative, diffusion-based legalization more

smoothly spread cells using the paradigm of phys-

ical process of diffusion. Consequently, timing-

degradations are less frequent and legalization canbe run less often in between optimizations. Fur-

ther, this technique can be used to alleviate local

routing congestion hot spots. Details of the algo-

rithm can be found in [28].

The remainder of the paper discusses each of these

technical contributions in more detail.

II . CLUSTERING FOR FASTGLOBAL PLACEMENT

Global placement is perhaps the most independent and

well defined component of physical synthesis that is a

major contributor to the total runtime of the system.

Global placement algorithms can generally be categorized

as simulated annealing, top-down cut-based partitioning,

analytic placement, or some combination thereof. Simu-lated annealing [29] is an iterative optimization method

which refines a placement solution using stochastic algo-

rithm. Although this is an effective method to integrate

non-conventional multidimensional objective functions

for global placement, it is known to be slow and not

scalable compared to other global placement algorithms.

Recent years have seen the emergence of several new

academic placement tools, especially in the top-downpartitioning and analytic domains. With the advent of

multilevel partitioning [22], [30] as a fast and effective

algorithm for min-cut partitioning, new generations of top-

down cut-based placers such as Capo [31], Feng Shui [32],

Dragon2000 [33] have appeared in recent years. A placer

in this class partitions the cells into two (bisection) or four

(quadrisection) regions of the chip, then recursively

partitions each region until a coarse global placement isachieved. In general, recursive cut-based placement ap-

proaches perform quite well when designs are dense, but

rather poorly when they are sparse.

Analytic placers typically solve a relaxed placement

formulation (such as minimum total squared wirelength)

optimally, allowing cells to temporarily overlap. Legali-

zation is achieved by removing overlaps via either



partitioning or by introducing additional forces/constraintsto generate a new optimization problem. The recent

placement contest [6] shows that analytic placement

algorithms can produce high quality placement solutions

on modern real circuits. This helped the advent of new

renaissance of analytic placers, e.g., Aplace [7], mPL [8],

mFar [9], FastPlace [10]. The genesis of this analytic

placement movement began with [34] and has very

recently been signicantly imporved in [35]. Analyticplacers tends to find better global placement solutions

particularly when designs have non-trivial white space in

it. In other words, analytic placer seems to have an

advantage in managing white space during global place-

ment process.

For any placer, clustering can be used to make it faster.

Clustering groups cells into fewer clusters, then placement

can be run directly on the clusters. However, for any placerof reasonable quality, the challenge lies in using clustering

to maintain and perhaps even enhance solution quality.

The particular clustering technique needs to be adapted for

the placer to which it is being applied.

The remainder of this section discusses how hierarchi-

cal clustering and unclustering techniques are integrated

into a top-down analytic placer that exists in PDS, though

is general enough to be applied to any placer. This placerwas chosen since it has been proven effective in the design

of several hundred real ASIC parts and is flexible enough

to handle a wide variety of special user constraints, like

bounds on cell movements. Further, clustering can also

help improve timing-driven placement under a net-

weighting paradigm. By grouping cells with high weights

into clusters, these cell groups will likely be placed close

together in the final placement.The hierarchical analytic placement is the integration

of three key components: analytic top-down placement,

best-choice clustering, and area-based unclustering. First,

we briefly review the flat global placement algorithm used

for this particular speedup technique. Multilevel place-

ment can be applied to just about any flat global placer,

though the techniques for clustering and unclustering

must be customized to obtain good results.

A. Analytic Top-Down Placement OverviewThe analytic top-down global placement algorithm pre-

sented here is based on quadratic placement with geo-

metric partitioning [36]. A quadratic wirelength objective

is often used for analytic placement since it can be easily

optimized, e.g., with a conjugate gradient solver (CG)

minimize �ð~x;~yÞ ¼Xi 9 j

wij ðxi � xjÞ2 þ ðyi � yjÞ2� �

(1)

where ~x ¼ ½x1; x2; . . . ; xn� and ~y ¼ ½y1; y2; . . . ; yn� are the

coordinates of the movable cells~v ¼ ½v1; v2; . . . ; vn� and wij

is the weight of the net connecting vi and vj. The optimalsolution is found by solving one linear system for~x and one

for~y. Quadratic placement only works on a placement with

fixed objects (anchor points). Otherwise, it produces a

degenerate solution where all cells are on top of each other

on a single point.

Although the solution of (1) provides a placement

solution with optimal squared wirelength, the solution will

have lots of overlapping cells. To remove overlaps, weadopt geometric four-way partitioning [36]. Four-way

partitioning, or quadrisection, is a function f : V ! i 2f0; 1; 2; 3g where i represents an index for one of sub-

regions or bins B0, B1, B2, B3. The assignment of cells to

bins needs to satisfy the capacity constraint for each bin.

With the given cell locations determined by quadratic

optimization, four-way geometric partitioning minimizes

the sum of weighted cell movements (using a linear timealgorithm), defined as

Xv 2 V

sizeðvÞ � d ðxv; yvÞ; BfðvÞ� �

(2)

where v is a cell, ðxv; yvÞ is a location of cell v from

quadratic solution and BfðvÞ refers to one of four bins whichthe cell v is assigned to. The distance term dððx; yÞ; BiÞ with

i 2 f0; 1; 2; 3g is the Manhattan distance from coordinate

ðx; yÞ to the nearest point to the bin Bi. The distance is

weighted by the size of cell, sizeðvÞ. The intuition of this

objective function is to obtain the partitioning solution

with the minimum perturbation to the previous quadratic

optimization solution.

Quadrisection is recursively applied, so that at level k,there are 4k placement sub-regions or bins. For each bin,

the process of quadratic optimization and subsequent

geometric partition are repeated until each sub-placement

region contains a trivial number of objects. At each

placement level, one can also apply local refinement tech-

niques such as repartitioning [37]. Repartitioning consists

of applying a quadrisection algorithm on each 2 2 sub-

problem instance in a sequential manner. The fundamen-tal reason that repartitioning can improve wirelengths of

placement is that it can fix any poor assignments from the

mininimum movement quadrisection step. Since it can see

the assignments made from the prior level, it is able to

locally reverse any poor assignments based on the

repartitioning objective function.

B. Clustering for Multilevel PlacementAs placement instances climb into the multimillions of

cells, clustering becomes a powerful tool for speeding upthe global placer. Clustering effectively reduces the

problem size fed to the placer by viewing each cluster of

cells as a single cell. The quality of a clustering based or



multilevel placer is critically dependent on the ability toperform intelligent clustering.

In terms of the interactions between clustering and

placement, the prior work can be classified into two cat-

egories: transient and persistent. Transient clustering

usually involves clustering and unclustering as part of

the internal placement algorithm interactions. For exam-

ple, multilevel min-cut partitioning [22], [31] falls into this

category. The clustering is used for partitioning for a givenlevel, but then an entirely new clustering is generated for

the subsequent level. Hence, the clustering is transient

since it is constantly recomputed based on the current

state of the placer. In contrast, persistent clustering gener-

ates a cluster hierarchy at the beginning of a placement in

order to reduce the size of a problem for the entire

placement [9]. The clustered objects can be dissolved at or

near the end of placement, with a final Bclean-up[ oper-ation. For persistent clustering, the clustering algorithm

itself is actually independent the core placer. Rather, it is a

preprocessing step which imposes a more compact netlist

structure for the placer, e.g., [29].

To embed clustering within our placer, we propose a

semipersistent clustering strategy. One problem with per-

sistent clustering is that clustered objects may be too large

relative to the decision granularity (for example, the sizeof bin which the cluster is assigned during partitioning),

which results in the degradation of final placement

solution quality. The goal of semipersistent clustering is

to address this deficiency. Semipersistent clustering takes

advantage of the hierarchical nature of clustering so that

clustered objects are dissolved slowly during the place-

ment flow. At the early stage of the placement algorithm,

a global optimization process is performed on highlyclustered netlist while local optimization/refinement

can be executed on the almost flattened netlist at later

stage.

There are many algorithms and objectives for cluster-

ing (see [38] for a survey). For example, a common

technique is to match pairs of similar objects and apply

matching passes recursively [22]. While extremely fast, the

pairs that get merged towards the end of a pass may clusterto a less desirable neighbor. To avoid this behavior, one

could perform partial passes where one only merges some

small fraction of the cells before updating the list of

potential matches. In its most extreme, one can use a

partial list of one match. In other words, at each pass only

perform the single best clustering over all possible clusters

according to the given objective function. This is what we

call best-choice clustering [23] as shown in Fig. 3. By usinga priority queue to identify the best cluster, one obtains a

sub-quadratic implementation. Priority queue manage-

ment naturally provides an ideal clustering sequence and it

is always guaranteed that two objects with the best

clustering score will be clustered.

The degree of clustering may be controlled by com-

puting a target number of objects. Best-choice clustering is

simply repeated until the overall number of objects be-

comes the target number of objects. Fewer target objectsimply more extensive clustering (and a larger speedup).

During the clustering score calculation, the weight we

of a hyperedge e is defined as 1=jej which is inversely

proportional to the number of objects that are incident to

the hyperedge. Given two objects u and v, the clustering

score dðu; vÞ between u and v is defined as

dðu; vÞ ¼X

e2Eju;v2e

we

aðuÞ þ aðvÞð Þ (3)

where e is a hyperedge connecting object u and v, we is a

corresponding edge weight, and aðuÞ and aðvÞ are the areas

of u and v respectively. The clustering score of two objects

is directly proportional to the total sum of edge weights

connecting them, and inversely proportional to the sum of

their areas. This clustering score function can handle

hyper edge directly without transforming it into a clique

model. Also the area-based denominator of the scorefunction helps to produce more balanced clustering

results.

Suppose Nu is the set of neighboring objects to a given

object u. We define the closest object to u, denoted cðuÞ, as

the neighbor object with the highest clustering score to u,

i.e., cðuÞ ¼ v such that dðu; vÞ ¼ maxfdðu; zÞjz 2 Nug.

The best-choice algorithm is composed of two phases.

In phase I, for each object u in the netlist, the closestobject v and its associated clustering score d are calculated.

Then, the tuple ðu; v; dÞ is inserted to the priority queue

with d as comparison key. For each object u, only one tuple

with the closest object v is inserted. This vertex-oriented

priority queue allows for more efficient data structure

Fig. 3. Best-choice clustering algorithm.



managements than edge-based methods. Phase I is a

simply priority queue PQ initialization step.

In the second phase, the top tuple ðu; v; dÞ in PQ is

picked up (Step 2), and the pair of objects ðu; vÞ are

clustered creating a new object u0 (Step 3). The netlist is

updated (Step 4), the closest object v0 to the new object u0

and its associated clustering score d0 are calculated, and a

new tuple ðu0; v0; d0Þ is inserted to PQ (Steps 5–6). Since

clustering changes the netlist connectivity, some of

previously calculated clustering scores might become

invalid. Thus, the clustering scores of the neighbors of

the new object u0, (equivalently all neighbors of u and v)

need to be recalculated (Step 7), and PQ is adjusted

accordingly. The following example illustrates clusteringscore calculation and updating.

For example, assume the input netlist with six objects

fA; B; C;D; E; Fg and eight hyperedges fA; Bg, fA; Cg,

fA;Dg, fA; Eg, fA; Fg, another fA; Cg, fB; Cg, and

fA; C; Fg as in Fig. 4(a). Let the size of each objects is 1.

By calculating the clustering score of A to its neighbors, we

find that dðA; BÞ ¼ 1=4, dðA; CÞ ¼ 2=3, dðA;DÞ ¼ 1=4,

dðA; EÞ ¼ 1=4, and dðA; FÞ ¼ 5=12. dðA; CÞ has the highestscore, and C is declared as the closest object to A. Since

dðA; CÞ is the highest score in the priority queue, A will be

clustered with C and the circuit netlist will be updated as

shown in Fig. 4(b). With a new object AC introduced,

corresponding cluster scores will be dðAC; FÞ ¼ 1=3,

dðAC; EÞ ¼ 1=6, dðAC;DÞ ¼ 1=6, and dðAC; BÞ ¼ 1=3.

C. Area-Based Selective UnclusteringIn this semipersistent clustering scenario, the cluster-

ing hierarchy is preserved during the most global place-

ment. However, if the size of a clustered object is relatively

large to the decision granularity, the geometric partition-

ing result on this object can affect not only the quality of

global placement solution, but also the subsequent

legalization due to the limited amount of available free

space. To address this issue, we employ an adaptive area-based unclustering strategy. For each bin, the size of each

clustered object is compared to the available free space. Ifthe size is bigger than the predetermined percentage of the

available free space, the clustered object is dissolved. Our

empirical analysis shows that with the appropriate

threshold value (5%), most clusters can be preserved

during the global placement flow with insignificant loss of

wirelength. The area-based selective unclustering is

another knob to provide a tradeoff between runtime and

quality of placement solution. More aggressive uncluster-ing (lower threshold value) produces better wirelengths at

the cost of higher CPU time.

D. Putting the Placer TogetherFinally, the clustering can be integrated with analytic

top-down placement to derive a new hierarchical global

placement algorithm that is summarized in Fig. 5. With a

given initial netlist, a coarsened netlist is generated viabest-choice clustering which is used as a seed for the

subsequent global placement. Steps 2–5 are the basic

analytic top-down global placement algorithm described in

Section II-A. After quadratic optimization and quadrisec-

tion are performed for each bin, an area-based selective

unclustering is performed to dissolve large clustered

objects (Step 6). At the end of each placement level, a

repartitioning refinement is executed for local improve-ment (Step 8). Steps 2–9 consist of the main global

placement algorithm. If there exist clustered objects after

the global placement, those are dissolved unconditionally

(Step 10) before the final legalization and detailed

placement are executed (Step 11). The proposed algorithm

relies on three strategic components; best-choice cluster-

ing, analytic top-down global placement, and area-based

selective unclustering.

E. Results and SummaryTable 1 shows the performance of hierarchical place-

ment over flat placement on real industrial circuits. The

table shows the size of circuits in terms of the number of

Fig. 4. Clustering a pair of objects A and C.

Fig. 5. Hierarchical analytic top-down placement algorithm.



objects and nets, the utilization of designs, the wirelength

improvement and speed-up over flat placement. Let � be

the ratio of the number of cells to the target number of

clusters. With clustering ratio � ¼ 2, hierarchical place-ment is on average twice as fast as flat placement while

obtaining a slight 0.92% improvement in wirelength. With

a more aggressive clustering ratio of � ¼ 10, hierarchical

placement is about five times faster than flat placement,

with a slight 3% degradation in wirelength. Different

values of � can be used to trade off speed and quality.

Overall, we demonstrate that careful clustering and

unclustering strategies can yield a hierarchical placementthat is significantly faster than flat while with comparable

solution quality.

III . TECHNIQUES FOR FASTTIMING-DRIVEN BUFFER INSERTION

For timing critical nets, buffer insertion must be deployed

frequently to improve delay, either for handling nets withlarge fanout, long wires, or isolating non-critical sinks

from critical ones. For example, Fig. 6(a) shows a 3-pin net

with poor timing in which the small squares are potential

buffer insertion locations. Proper buffer insertion, as

shown in Fig. 6(b), improves the timing to the most critical

sink by 200 ps. The bottom sink is not critical so only a

decoupling buffer is required for that subpath.

The buffering algorithms in PDS are based on theclassic dynamic programming paradigm [24]. The reason

for this is because the algorithm is provably optimal for a

given tree topology (such as [39], [40]), though this result

will frequently insert many additional buffers to obtain a

negligible improvement in performance. Thus, the algo-rithm must also manage the tradeoff between buffering

resources and delay [41]. Doing so changes the algorithms

complexity from polynomial to pseudopolynomial and in

practice adds an order of magnitude to the runtime. The

result is an extremely effective algorithm for timing-driven

buffer trees, though the algorithm’s inefficiency is

problematic.

Thus, it is essential to make this core optimization asfast as possible. Hence, this section explores tricks for

tweaking the classic algorithm to obtain significant

performance improvement without losing solution quality.

These techniques can be easily integrated with the classic

buffer insertion framework while also considering slew,

noise, and capacitance constraints [42], [43]. Used in

conjunction, these techniques can lead to more than a

factor of ten performance improvement versus traditionaldynamic programming.

A. Overview of Classic Buffering AlgorithmFor a given Steiner tree with a set of buffer locations

(namely the internal nodes), buffer insertion inserts

buffers at some subset of legal locations such that the

required arrival time (RAT) at the source is maximized. In

the dynamic programming framework, candidate solutionsare generated and propagated from the sinks toward the

Table 1 Comparisons of Hierarchical Analytic Top-Down Placement Against Flat Placement in Wirelengths and Runtimes

Fig. 6. An example of how buffer insertion can improve timing to critical sinks. (a) A net without buffers inserted. (b) Proper buffer

insertion improves timing.



source. Each candidate solution � is associated with aninternal node in the tree and is characterized by a 3-tuple

ðq; c;wÞ. The value q represents the required arrival time;

c is the downstream load capacitance; and w is the cost

summation for the buffer insertion decision.

Initially, a single candidate ðq; c;wÞ is assigned for each

sink where q is the sink RAT, c is the load capacitance and

w ¼ 0. When the candidate solutions are propagated from

a node to its parent, all three terms are updated ac-cordingly. At an internal node, a new candidate is gen-

erated by inserting a buffer. At each Steiner node, two sets

of solutions from the children are merged. Finally at the

source, the solutions with max q are selected.

The candidate solutions at each node are organized as

an array of linked lists. The solutions in each list of the

array have the same buffer cost value w ¼ 0; 1; 2; . . ..During the algorithm, inferior solutions are pruned. Asolution is defined as inferior (or redundant) if there exists

another solution that is better in slack, capacitance, and

buffer cost. More precisely, for two candidate solutions

�1 ¼ ðq1; c1;w1Þ and �2 ¼ ðq2; c2;w2Þ, �2 dominates �1 if

q2 � q1, c2 � c1 and w2 � w1. In such case, we say �1 is

redundant and may be pruned. After pruning, every list

with the same cost is a sorted in terms of q and c.

A buffer library is a set of buffers and inverters, whileeach of them is associated with its driving resistance, input

capacitance, intrinsic delay, and buffer cost. During

optimization, we wish to control the total buffer resources

so that the design is not over-buffered for marginal timing

improvement. While total buffer area can be used, to the

first order, the number of buffers provides a reasonably

good approximation for the buffer resource utilization.

Indeed, we use the number of buffers since it allows amuch more efficient baseline van Ginneken implementa-

tion. Note that, our techniques presented in this paper can

be applied on any buffer resources model, such as total

buffer area or power.

At the end of the algorithm, a set of solutions with

different cost-RAT tradeoff is obtained. Each solution gives

the maximum RAT achieved under the corresponding cost

bound. Practically, we choose neither the solution withmaximum RAT at source nor the one with minimum total

buffer cost. Usually, we would like to pick one solution in

the middle such that the solution with one more buffer

brings marginal timing gain. In PDS, we use the B10 ps

rule[ (though the value can of course be modified de-

pending on the frequency target). For the final solutions

sorted by the source’s RAT value, we start from the so-

lution with maximum RAT and compare it with the secondsolution (usually it has one buffer less). If the difference in

RAT is more than 10 ps, we pick the first solution.

Otherwise, we drop it (since with less than 10 ps timing

improvement, it does not worth an extra buffer) and

continue to compare the second and the third solution. Of

course, instead of 10 ps, any time threshold can be used

when applying to different nets.

B. Preslack PruningDuring the algorithm, a candidate solution is pruned

out only if there is another solution that is superior in

terms of capacitance, slack and cost. This pruning is based

on the information at the current node being processed.

However, all solutions at this node must be propagated

further upstream toward the source. This means the load

seen at this node must be driven by some minimal amount

of upstream wire or gate resistance. By anticipating theupstream resistance ahead of time, one can prune out more

potentially inferior solutions earlier rather than later,

which reduces the total number of candidates generated.

More specifically, assume that each candidate must be

driven by an upstream resistance of at least Rmin. The

pruning based on anticipated upstream resistance is called

the prebuffer slack pruning.

Prebuffer Slack Pruning (PSP): For two non-redundant

solutions ðq1; c1;wÞ and ðq2; c2;wÞ, where q1 G q2 and

c1 G c2, if ðq2 � q1Þ=ðc2 � c1Þ � Rmin, then ðq2; c2;wÞ is

pruned.

The PSP technique was first proposed in [44]. Using an

appropriate value of Rmin guarantees optimality is not lost

[44], [45]. However, what if we are willing to sacrifice

optimality for a faster solution by using a resistance Rwhich is larger than Rmin. In practice, we observe that a

somewhat larger value than Rmin does not hurt solution

quality.

We performed buffer insertion experiments on 1000

high capacitance industrial nets by varying the value of Rused for preslack pruning. The percent slack and CPU time

compared to no preslack pruning is shown in Fig. 7.

Observe that the slack slowly degrades as a function ofresistance, while the CPU time decrease is fairly sharp. For

example, R ¼ 120 � is the minimum resistance value

which preslack pruning is still optimal solution. However,

one can get a 50% speedup for less than 5% slack

degradation for a larger value of R ¼ 600 �. These results

indicate that using PSP can bring a huge speed-up in

classic buffering for a fairly small degradation in solution

quality.

C. Squeeze PruningThe basic data structure of van Ginneken style

algorithms is a sorted list of non-dominated candidate

solutions. Both the pruning in van Ginneken style

algorithm and the prebuffer slack pruning are performed

by comparing two neighboring candidate solutions at a

time. However, more potentially inferior solutions can bepruned out by comparing three neighboring candidate

solutions simultaneously. For three solutions in the sorted

list, the middle one may be pruned according to the

squeeze pruning defined as follows.

Squeeze Pruning: For every three candidate solutions

ðq1; c1;wÞ, ðq2; c2;wÞ, ðq3; c3;wÞ, where q1 G q2 G q3 and



c1 G c2 G c3, if ðq2 � q1Þ=ðc2 � c1Þ G ðq3 � q2Þ=ðc3 � c2Þ,then ðq2; c2;wÞ is pruned.

For a two-pin net, consider the case that the algorithm

proceeds to a buffer location and there are three sorted

candidate solutions with the same cost that correspond to

the first three candidate solutions in Fig. 3(a). According

to the rationale in prebuffer slack pruning, the q-c slope fortwo neighboring candidate solutions tells the potential that

the candidate solution with smaller c can prune out the

other one. A small slope implies a high potential. For

example, ðq1; c1;wÞ has a high potential to prune out

ðq2; c2;wÞ if ðq2 � q1Þ=ðc2 � c1Þ is small. If the slope value

between the first and the second candidate solutions is

smaller than the slope value between the second and the

third candidate solutions, then the middle candidatesolution is always dominated by either the first candidate

solution or the third candidate solution. Squeeze pruning

keeps optimality for a two-pin net. After squeeze pruning,

the solution curve in ðq; cÞ plane is concave as shown in

Fig. 8(b).

For a multisink net, squeeze pruning does not

guarantee optimality since each candidate solution may

merge with different candidate solutions from the otherbranch and the middle candidate solution in Fig. 8(a) may

offer smaller capacitance to other candidate solutions in

the other branch. Squeeze pruning may prune out a post-

merging candidate solution that is originally with less total

capacitance. However, despite the loss of guaranteed

optimality, most of the time squeeze pruning causes no

degradation in solution quality and overall is a fairly safe

pruning technique.

D. Library LookupThe size of buffer library is an important factor in

determining runtime. Modern designs may have hundreds

of buffers and inverters to choose from. The theoretical

complexity of van Ginneken style buffer insertion is

quadratic in terms of the library size, though in practice it

appears to be linear. To avoid the slow down from large

libraries, we take advantage of buffer library pruning [46]

to select a small yet effective set of buffers from all those

that may be used. We now discuss a more effectivetechnique, library lookup.

During van Ginneken style buffer insertion, every

buffer in the library is examined for iteration. If there are ncandidate solutions at an internal node before buffer in-

sertion and the library consists of m buffers, then mntentative solutions are evaluated. For example, in Fig. 9(a),

all eight buffers are considered for all n candidate

solutions.However, many of these candidate solutions are

clearly not worth considering We seek to avoid generating

poor candidate solutions in the first place and not even

consider adding m buffered candidate solutions for each

Fig. 8. Squeeze pruning example. (a) The solution curve in

ðq; cÞ plane before squeeze pruning. (b) The solution

curve after squeeze pruning.

Fig. 7. The speed-up and solution sacrifice of aggressive preslack-pruning for 1000 nets as a function of R.



unbuffered candidate solution. We propose to consider

each candidate solution in turn. For each candidate so-

lution with capacitance ci, we look up the best non-

inverting buffer and the best inverting buffer that yield

the best delay from two precomputed tables before opti-mization. For Fig. 9(b), the capacitance ci results in se-

lecting buffer B3 and inverter I2 from the non-inverting

and inverting buffer tables.

The two tables may be precomputed before buffer

insertion begins.

All 2n tentative new buffered candidate solutions can

be divided into two groups, where one group includes ncandidate solutions with an inverting buffer just insertedand the other group includes n candidate solutions with a

non-inverting buffer just inserted. We only choose one

candidate solution that yields the maximum slack from

each group and finally only two candidate solutions are

inserted into the original candidate solution lists. Since the

number of tentative new buffered solutions is reduced

from mn to 2n, the speedup is achieved. Also, since only

two new candidate solutions instead of m new candidatesolutions are inserted, the number of total candidate so-

lutions is reduced. This is similar to the case when the

buffer library size is only two, but the buffer type may

change depending on the downstream load.

E. Results and SummaryTable 2 shows the impact of the three speedup tech-

niques: preslack pruning (PSP), squeeze pruning (SqP),and library lookup (LL) versus the classic algorithm

(baseline). The results are average for 5000 high capa-

citance results from an ASIC chip. The second column

shows the total slack improvement (for all 5000 nets)

improvement after buffer insertion, and the third column

gives the total CPU time. Overall, the three techniques

resulted in a 20X speedup, with just 3% degradation in

solution quality.

Buffer insertion is a core optimization for fixing

timing critical paths. When optimizing tens of thousandsof nets, some optimality can be afforded to be sacrificed in

order to get sufficient runtime. Note that at the end of

physical synthesis, one could try reapplying buffer in-

sertion without these speedups (while also using more

accurate delay models) to the handful of remaining

critical nets. This is still much more efficient than ap-

plying full blown high accuracy buffer insertion for the

entire design.This work in essence summarizes our philosophy to fast

physical synthesis. Do the optimization well as fast as

possible, even if a little optimality is sacrificed. At the end,

if the design is close to timing closure, slower and more

accurate techniques can always be employed to further

refine the design.

IV. FAST ELECTRICAL CORRECTION

The previous discussion discusses fast buffering for critical

path optimization. Our focus now turns toward using

buffers and gate sizing for electrical correction. As dis-

cussed in the first section, electrical correction is be-

coming an increasingly costly phase of physical synthesis.

High wire resistance and sharp required slew rates (for

either noise or performance) mean that potentiallymillions of buffers must be inserted and millions of gates

must be repowered simply to have an electrically correct

design. Critical path optimization techniques rely on the

correct operation of the timing analyzer; however, any

timer, even a sophisticated one, only works correctly if the

design it is given is in a reasonable electrical state. For

example, if capacitive loads are outside the range that a

gate model has been characterized for, the timer will giveresults that do not reflect the true performance of the gate.

Further, if one can quickly make the timing result look

decent, this will leave much less work for the subsequent

slower critical path optimizations.

This section focuses on how to quickly perform elec-

trical correction, i.e., fix capacitance and slew violations

[20]. Further, it is crucial that this phase requires minimal

Fig. 9. Library lookup example. B1 to B4 are non-inverting buffers.

I1 and I4 are inverting buffers. (a) van Ginneken style buffer

insertion. (b) Library lookup.

Table 2 Simulation Results for Full Library Consisting of 24 Buffers.

Baseline are the Results of the Algorithm of Lillis et al. [47]. PSP Shows the

Results of Aggressive Prebuffer Slack Pruning Technique. SqP Stands for

our Squeeze Pruning Technique. LL is the Library Lookup Technique



area overhead, thereby reducing unnecessary powerconsumption and silicon real estate. The need for reducing

area usage is obvious for area-constrained designs. How-

ever, even in designs where the total area may not be at a

premium, local regions may be congested. Further, in

delay-constrained designs, the area saving can be used by

subsequent optimizations to improve the performance of

critical regions.

A. Types of Electrical ViolationsTiming analyzers utilize models for gate delays and

slews, which are precharacterized. Each gate is character-

ized with a maximum capacitive load that it can drive and a

maximum input slew rate, and the operation of the timer is

valid within these ranges. If these conditions are violated,

timers usually extrapolate to obtain Bbest guess[ values.

However, values calculated in this manner may be in-accurate. This leads to the limits that define electrical

violations. There are two Brules[ that a design has to pass

for it to be electrically clean, as follows.

• Slew Limits: These rules define the maximum

slews permissible on all nets of the design. If the

slew (defined here as the 10%–90% rise or fall

time of a signal, other definitions can be used as

well) at the input of a logic gate is too large, a gatemay not switch at the target speed, or may not

switch at all, leading to incorrect operation.

• Capacitance Limits: These define the maximum

effective capacitance that a gate or an input pin can

drive. A large capacitance on the output of a gate

directly affects its switching speed and power

dissipation. Additionally, gates are typically char-

acterized for a limited range of output capacitance,and delay calculation during design can be in-

correct if the output capacitance is greater than the

maximum value.

Violations of these rules (referred to as slew violations

and capacitance violations) taken together are called elec-

trical violations. These limits are principally determined

during gate characterization, but designers may choose totighten these constraints further. High performance de-

signs, such as microprocessors typically have much tighter

slew limits than ASICs.

B. Causes of ViolationsFig. 10 shows the main causes of slew violations, and

how these may be fixed. Consider a net having source

gate A and sink gate B. The capacitive load seen by gate Ais the sum of the interconnect capacitance of the net and

the input capacitance of gate B. Assume that a signal with

slew s1 is applied at the input of gate A. Due to the load

that it has to drive, the slew s2 at the output of gate A may

be more than s1. Thus, one cause of degradation is the

source gate not being capable of driving the load at its

output. Next, even if the slew at the output of A, s2, is

within the specified limits, it could degrade as the signaltraverses the net to the sink. Thus, at the sink, the signal

could have an even larger slew of s3. This is the second

contribution to slew degradation.

There are two main methods of fixing slew violations,

as shown in Fig. 10. First, the source gate of the net can be

sized up, so that the new gate can drive the load present.

While this may fix violations on the net in question, the

obvious disadvantage is that the problem has been movedto the input of the source gate, where the input nets now

have larger sink capacitances. However, this may or may

not create violations on the input nets.

Second, keeping the source at its original size, buffers

can be inserted on the net in question. These isolate the

load capacitance of the sink, and repower the signal on the

net, so that slews are within the specified limits. Unlike

resizing, this method does not affect the electrical state ofany other nets, but the area overhead can be much higher.

Additionally, the time required to determine where to best

insert buffers is much greater than the time required to

resize a gate.

The causes of capacitance violations are similar to those

of slew violations: sink and interconnect capacitance both

Fig. 10. Causes of slew violations, and different methods of fixing them. (a) Slew degradation due to gate and interconnect.

(b) Fixing slew violation by sizing source. (c) Fixing slew violation by buffering.



contribute to the existence of a violation. The fixes too aresimilar, using resizing and buffering. However, it is

important to note that it is possible to have capacitance

violations on a net that does not have slew violations, and

vice versa. Therefore, both capacitance and slew violations

have to be taken into consideration individually.

The simplest way to perform electrical correction is via

a sequential approach. First try resizing gates to fix vio-

lations while being careful not to over size them. For thosenets that cannot be solved with resizing, invoke a buffer

insertion algorithm. This may require a second pass of

resizing in order to properly size the newly inserted buffers

and inverters.

The most important drawback of this approach is that

sizing and buffering used to fix violations are applied

sequentially, with no communication or, indeed, knowl-

edge of each other’s capabilities. Thus, a pass of resizing orbuffering tries to fix the violations that it sees, and as-

sumes that the the other will be able to handle the vio-

lations that it cannot fix. Thus, if resizing is applied to a

net to fix a slew violation on a sink, it may decide that

buffering is the best solution, for a variety of reasons.

However, in the next pass, when the net is passed to the

buffer insertion routine, there may be conditions that

prohibit the insertion of buffers, such as blockages. Sub-sequent passes of resizing and buffering are then needed

with different settings, to overcome this situation, and

there is no guarantee that any of these passes will fix the

existing violation.

C. An Integrated ApproachAlternatively, we propose a framework that tightly

integrates the selection of the two optimizations, allowingfor the use of the correct optimization in a single pass over

the design. This integrated approach seeks to selectively

apply the resizing and buffering optimizations on a net-by-

net basis. Nets are selected in topological order, from out-

puts to inputs, and on each net, the following operations

are carried out.

• If there are no violations on the net, then the

source (driving) gate is sized down as much aspossible, without introducing new violations.

• If slew violations exist on the net, the source gate is

sized up as necessary, to fix the violations.

• If the previous step (resizing to fix violations) does

not succeed, the net is buffered.

The rationale of this approach is as follows. First, nets

are processed in output-to-input order; any side effect of

resizing gates only impacts the input nets, which are yet tobe processed. Sizing a gate up to remove a violation on its

output has a detrimental affect on its input nets. This is

handled by processing nets in the correct order.

Second, sizing gates down when possible has two

benefits. First, area is recovered when gates are larger than

necessary, and second, reducing the load on input nets

potentially removes violations that may exist, or reduces

their severity. The area salvaged in this step is better usedfor improving delay on critical paths of the circuit. Of

course, this step can be skipped if the design has already

been optimized for delay.

Finally, if resizing cannot fix a violation, buffering is

used to fix the net. Since buffering is the last resort, this

optimization can be as aggressive as required, which is

used to our advantage as shown later. This order (resizing

followed by buffering) is also advantageous from a runtimestandpoint, since buffering a net is much slower than

simply sizing the source gate.

The approach to gate sizing is straightforward. Given

an input slew rate and output load, we iterate through all

available sizes, and select the smallest gate size that can

deliver the required output slew. Buffering is based on the

algorithm described in the next section. The algorithm

selects the minimum area solution such that electricalconstraints are satisfied. For runtime considerations, a

coarse buffer library is often used for buffer insertion. The

lack of granularity in the buffer library makes the potential

to resize the buffer gates possible. Of course, a more fine-

grained library can be used, but can cause extra runtime.

To decide whether a gate meets its required slew

target, we adopt the model of Kashyap et al. [48] because

of its simplicity. It is actually the slew equivalent of theElmore delay model, but actually does not suffer as se-

verely from inaccuracies caused be resistive shielding.

The slew model can be explained using a generic

example which is a path p from node vi (upstream) to vj

(downstream) in a buffered tree. There is a buffer (or the

driver) bu at vi, and there is no buffer between vi and vj.

The slew rate sðvjÞ at vj depends on both the output slew

sbu;outðviÞ at buffer bu and the slew degradation swðpÞ alongpath p (or wire slew), and is given by [48]

sðvjÞ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffisbu;outðviÞ2 þ swðpÞ2

q: (4)

The slew degradation swðpÞ can be computed with

Bakoglu’s metric [49] as

swðpÞ ¼ ln 9 � DðpÞ (5)

where DðpÞ is the Elmore delay from vi to vj.

The basic framework presented above is flexible, and

lends itself to multiple refinements as follows. Once a netis buffered, the integrated framework allows for a quick

sizing of the newly added buffers. The buffering algorithm

can therefore be used with a small library of buffers.

Existing inverter trees can be ripped up and reinserted as

required, keeping in mind signal polarity constraints on

the sinks. If buffering does not fix a net, the cause of the

failure can be analyzed on the fly, and different algorithms,

e.g., for blockage avoidance can be used. Finally, if area is



at a premium, both resizing and buffering can be applied toevery net, and the solution with the lowest cost can be

selected.

D. Electrical Correction SummaryThe integrated framework allows PDS to efficiently

perform electrical correction. However, in our initial

implementation, we found that 80%–90% of the runtime

takes place in the van Ginneken style buffer insertionalgorithm, even with the speedups discussed above. For

electrical correction, using a buffer insertion algorithm

which optimizes for delay is wasteful, since the purpose of

this stage is to simply produce an electrically correct

design. This motivates a new buffer insertion formulation

specifically for electrical correction that is discussed in the

next section.

V. FAST TIMERLESS BUFFERING

The efficiency of electrical correction directly depends on

the efficiency of the buffering algorithm. While Section III

shows how one can speed up performance driven

buffering, it still suffers from the fact that three constraints

must be handled at once: area, slew, and delay. In elec-

trical correction, one can afford to ignore the last ob-jective, delay. The assumption is that if a tree buffered by

electrical correction subsequently becomes part of a

critical path, it can always be ripped up and rebuffered

by the critical path optimizations while taking into account

the most up to date timing analysis. In general, we find

that only a relatively small percentage of nets (e.g., 5%)

need to be rebuffered. Thus, this section proposes a

simpler buffer formulation that ignores delay constraintsin order to achieve a more runtime and area efficient

result.

The key observation that motivates this approach is that

traditional buffer insertion requires pruning based on

three components: capacitance, slack (or delay), and area

(or power). Because a candidate has to be inferior in all

three categories to be pruned, the list of possible can-

didates can grow quite large. However, to perform elec-trical correction, the optimal delay solution is not required

and instead one wishes to fix electrical violations with

minimum area. By using only two instead of three

categories for pruning, one can obtain a much more ef-

ficient solution (that is actually linear time in the case of a

single buffer).

A. Problem FormulationFor electrical correction, we seek the minimum area

(or cost) buffering solution such that slew constraints are

satisfied. Since one does not need to know required arrival

time at sinks, it can be performed independently of timing

analysis, hence the term, timerless buffering. While this

new formulation is actually NP-complete, some highly

efficient and practical algorithms can be utilized.

The input to the timerless buffering problem includes arouting tree T ¼ ðV; EÞ, where V ¼ fs0g [ Vs [ Vn, and

E � V V. Vertex s0 is the source vertex, Vs is the set of

sink vertices, and Vn is the set of internal vertices. Each sink

vertex s 2 Vs is associated with sink capacitance cs. Each

edge e 2 E is associated with lumped resistance Re and

capacitance ce. A buffer library B contains different types

of buffers. Each type of buffer b has a cost wb, which can be

measured by area or any other metric, depending on theoptimization objective. Without loss of generality, we

assume that the driver at source s0 is also in B. A function

f : Vn ! 2B specifies the types of buffers allowed at each

internal vertex.

The output slew of a buffer, such as bu at vi, depends on

the input slew at this buffer and the load capacitance seen

from the output of the buffer. For a fixed input slew, the

output slew of buffer b at vertex v is then given by

sb;outðvÞ ¼ Rb � cðvÞ þ Kb (6)

where cðvÞ is the downstream capacitance at v, Rb and Kb

are empirical fitting parameters. This is similar to

empirically derived K-factor equations [50]. We call Rb

the slew resistance and Kb the intrinsic slew of buffer b.

A buffer assignment � is a mapping � : Vn ! B [ fbgwhere b denotes that no buffer is inserted. The cost of a

solution � is wð�Þ ¼P

b2� wb. With the above notations,

the basic timerless buffering problem can be formulated

as follows.

Timerless Buffering Problem: Given a Steiner treeT ¼ ðV; EÞ a buffer library B, compute a buffer assignment

� such that the total cost wð�Þ is minimized such that the

input slew at each buffer or sink is no greater than a

constant � .

B. A Timerless Buffering AlgorithmIn the dynamic programming framework, a set of

candidate solutions are propagated from the sinks towardthe source along the given tree. Each solution � is char-

acterized by a three-tuple ðc;w; sÞ, where c denotes the

downstream capacitance at the current node, w denotes

the cost of the solution and s is the accumulated slew

degradation sw defined in (5). At a sink node, the cor-

responding solution has c equal to the sink capacitance,

w ¼ 0 and s ¼ 0. The solution propagation is accom-

plished by the following operations.Consider to propagate solutions from a node v to its

parent node u through edge e ¼ ðu; vÞ. A solution �v at vbecomes solution �u at u, which can be computed as

cð�uÞ ¼ cð�vÞ þ ce;wð�uÞ ¼ wð�vÞ a n d sð�uÞ ¼ sð�vÞ þln 9 � De where De ¼ Reððce=2Þ þ cð�vÞÞ.

In addition to keeping the unbuffered solution �u, a

buffer bi can be inserted at u to generate a buffered



solution �u;buf which can be then computed as cð�u;buf Þ ¼cbi;wð�u;buf Þ ¼ wð�vÞ þ wbi

and sð�u;buf Þ ¼ 0.

When two sets of solutions are propagated through left

child branch and right child branch to reach a branching

node, they are merged. Denote the left-branch solution set

and the right-branch solution set by �l and �r, respec-

tively. For each solution �l 2 �l and each solution �r 2 �r,

the corresponding merged solution �0 can be obtained ac-

cording to: cð�0Þ ¼ cð�lÞ þ cð�rÞ;wð�0Þ ¼ wð�lÞ þ wð�rÞand sð�0Þ ¼ maxfsð�lÞ; sð�rÞg. To ensure that the worst

case in the two branches still satisfies slew constraint, we

take the maximum slew degradation for the merged

solution.

For any two solutions �1, �2 at the same node, �1

dominates �2 if cð�1Þ � cð�2Þ, wð�1Þ � wð�2Þ and sð�1Þ �sð�2Þ. Whenever a solution becomes dominated, it is

pruned from the solution set without further propagation.A solution � can also be pruned when it is infeasible, i.e.,

either its accumulated slew degradation sð�Þ or the slew

rate of any downstream buffer in � is greater than the slew

constraint � .

When a buffer bi is inserted into a solution �, sð�Þ is set

to zero and cð�Þ is set to cðbiÞ. This means that inserting

one buffer may bring only one new solution, namely, the one

with the smallest w. However, in minimum cost timingbuffering, a buffer insertion may result in many non-

dominated ðq; c;wÞ tuples with the same c value, where qdenotes the required arrival time (RAT).

Consequently, in timerless buffering, at each buffer

position along a single branch, at most jBj new solutions

can be generated through buffer insertion since c; s are

the same after inserting each buffer. In contrast, buffer

insertion in the same situation may introduce many newsolutions in timing buffering. This sheds light on why

timerless buffering can be much more efficiently

computed.

Another important fact is that the slew constraint is in

some sense close to length constraint. In timerless buff-

ering, solutions can soon become infeasible if we do not

add a buffer into it and thus many solutions, which are

only propagated through wire insertion, are oftenremoved soon. An extreme case demonstrating this point

is that in standard timing buffering, the solutions with no

buffer inserted can always live until being pruned by

driver given a loose timing constraint. This may nothappen in timerless buffering: this kind of solutions soon

become infeasible as long as the slew constraint is not too

loose.

Due to these special characteristics of the timerless

buffering problem, a linear time optimal algorithm for

buffering with a single buffer type is possible. In tim-

ing buffering, it is not known how to design a polynomialtime algorithm in this case. From these facts, the basicdifferences between these two somewhat related buffering

problems are clear.

C. Results and SummaryTable 3 compares timerless buffering to timing-driven

buffering for 1000 high capacitance nets from an ASIC

design for slew constraints ranging from 0.4 to 2.0 ns. A

library of 48 buffers was used. The experiment showsthat timerless buffering does result in a consistent de-

gradation in slack, which is not surprising since it does

not utilize timing information. Because timerless buffer-

ing minimizes area in its objective function, it is more

efficient in buffering area and the number of buffers

used. The area savings tends to increase as the slew

constraint is relaxed. Finally, the CPU time advantage is

clear as speedups of 25 to over 100 are observed. Thetiming-driven buffering used here does utilize preslack

pruning and squeeze pruning, but not library lookup.

Obviously the latter technique would reduce the advan-

tage somewhat.

Since electrical correction can result in millions of

buffers being inserted, one needs to do this as fast as

possible. Even with the speedups in Section III, a delay

driven technique is not suitable for this task. Instead, usinga timerless formulation that seeks to minimize area proves

significantly faster and actually uses less area.

Ultimately, one needs a large back of buffering so-

lutions depending on where one is in the physical synthesis

flow. For early electrical correction, a faster timerless

algorithm is appropriate. For critical path optimization, a

van Ginneken style algorithm is needed. However, one

often may need to pay attention to the blockages orplacement and routing congestion that may exist in the

design. The next section shows a framework for dealing

with any of these layout characteristics.

Table 3 Comparison of Timerless Buffering With Timing-Driven Buffering



VI. LAYOUT AWARE FAST AND FLEXIBLEBUFFER TREES

Given a Steiner tree, we can insert buffers for critical pathoptimization using timing-driven buffering or electrical

correction using timerless buffering. The quality of the

results strongly depends on the Steiner tree used, and

so we use a buffer-aware tree construction as described

in [39]. However, this construction ignores the blockages

and congestion present in the layout. Ignoring this can

potentially cause several design headaches.

A. Types of Layout IssuesFor example, Fig. 11(a) illustrates the Balley[ problem,

in which space is limited between two large fixed blocks.

The space between blocks is highly desirable since routes

that cross the blockages have only potential insertion space

in the alley. Fig. 11(b) shows the buffer Bpile-up[ phe-

nomenon. Several nets may desire buffers to be inserted in

the black congested region, yet since there is no space for

buffers there, the buffers are inserted as close to the

boundary as possible. As more nets are optimized, thesebuffers pile up and spiral out further from their ideal

locations. This could be alleviated by only allowing buffers

from critical path optimization (not electrical correction)

to use these scarce resources.

As technology continues to scale, the optimum distance

between consecutive buffers continues to decrease. In

hierarchical design, this means allocating spaces within

macro blocks for buffering of global nets. An example isshown in Fig. 12(a). The space for buffers is potentially

limited, so non-critical nets should be routed around the

blocks while critical ones can use the holes. Long non-

critical nets still require buffers to fix slew and/or ca-

pacitance violations. In addition, these nets could be

critical, but have a wide range of possible buffering

solutions that may bring them into the non-critical group.

In the figure, the top net is non-critical and requires threebuffers, while the bottom net is critical and needs only two

by exploiting holes punched in the block.

Even without holes in block, designs may have pockets

of low density for which inserting buffers is preferred, as

shown in Fig. 12(b). In the figure, the Steiner route is

located in the low density part of the chip, which makes

the buffers inserted along the route also use low densityregions. Fig. 12(c) shows an example where one may be

willing to insert buffers in high density regions if a net is

critical. The 2-buffer route above the block yields faster

delays than the 4-buffer route below the block that is better

suited for noncritical nets. Finally, Fig. 12(d) shows

routing congestion between two blocks; the preferred

buffered route avoids this congestion without sacrificing

timing.There are some buffering approaches that attack a

subset of these type of problems by simultaneously in-

tegrate the layout environment, build a Steiner tree, and

buffer (e.g., [51], [52]), but doing too much work at once

inherently makes these algorithms too inefficient for this

application. Instead, we propose the following flow:

• Step 1: construct a fast timing-driven Steiner tree

(e.g., [39]) that is ignorant of the environment.• Step 2: reroute the Steiner tree to preserve its to-

pology while navigating environmental constraints.

• Step 3: insert buffers via the algorithms in

Section III or V.

This section focuses on solving the problem in Step 2.

B. Rerouting Algorithm OverviewTo reroute the tree, the design area is divided into

tiles, as in global routing, and stores the placement and

routing density characteristics for each tile. The algorithm

takes the existing Steiner tree and breaks it into disjoint

2-paths, i.e., paths which start and end with either the

source, a sink, or a Steiner point such that every internal

node has degree two. For example, the nets shown in

Fig. 13(a) and (b) both decompose into three 2-paths.

Finally, each 2-path is rerouted in turn to minimize cost,Fig. 11. Buffer insertion can potentially: (a) fill up constrained ‘‘alleys’’

and (b) cause buffer ‘‘pile-ups.’’

Fig. 12. Some environmental based constraints include: (a) holes

in large blocks; (b) navigating large blocks and dense regions;

(c) distinguishing between critical and noncritical preferred

routes; and (d) avoiding routing congestion.



starting from the sinks and ending at the source. The new

Steiner tree is assembled from the new 2-path routes.

Essentially, the algorithm is performing maze routing for

each subsection of the tree. The two key components of

achieving a good result are plate expansion, which allows

the Steiner points to migrate and deriving the right mazerouting cost function.

If a Steiner point is in a congested region, it needs to

migrate from its original location. One could consider

allowing it to move anywhere in the layout, but since the

original Steiner layout was presumably Bgood[ we restrict

it to move only within a specified Bplate[ region. This is

one key for enabling the algorithm to be efficient. The

plate needs to be large enough to enable the Steiner pointto migrate to a less congested tile.

During maze rerouting, one considers routing to any

tile in the plate instead of just the original tile.

Fig. 13(a) shows a routing tree after Step 1. The striped

tile is the Steiner point, and the shaded region shows a

5 5 plate centered at the original Steiner point.

Fig. 13(b) shows a Steiner tree that might result after re-

routing. The Steiner point has moved to a different lo-cation within the plate; where it ends up depends on the

cost function that is optimized. The dotted region shows

the potential search space for the rerouting of the 2-path

from the Steiner point to the source. In this case, the

bounding box containing the two endpoints was expanded

by one tile.

C. Maze Routing Cost Function forElectrical Correction

Each tile is assigned cost that should reflect potentially

inserting a buffer and/or routing through the tile. Let

eðtÞ � 1 be the environmental cost of using tile t, where

eðtÞ ¼ 0 if the tile is totally void of any resource utilization,

while eðtÞ ¼ 1 represents a fully utilized tile. As an

example, for placement congestion, let dðtÞ could be the

placement density (cell area divided by total area available)

of tile t. and le rðtÞ be its routability (used tracks divided bytotal tracks available). Then one could use

eðtÞ ¼ �dðtÞ2 þ ð1 � �ÞrðtÞ2(7)

where 0 � � � 1 trades off between routing and place-

ment cost.

For fixing electrical violations, one wants the net to

avoid high cost tiles, while still making an attempt to

minimize wirelength. For this case, consider

costðtÞ ¼ 1 þ eðtÞ: (8)

This cost function implies that a fully utilized tile has

cost twice that of a tile that uses no resources. The constant

of one can be viewed as a Bdelay component.[ Let the costof a path be equal to the cost of all tiles in the path, and

initially assign all sinks to zero initial cost. We wish to

minimize the cost of the entire tree being constructed. For

a tile t that corresponds to a Steiner point, with subtree

children L and R, the cost of the tree routed at t is

costðtÞ ¼ costðLÞ þ costðRÞ.

D. Maze Routing Cost Function for CriticalPath Optimization

For critical nets, the cost impact of the environment is

relatively immaterial. We seek the absolute best possible

slack, but still need the route to avoid regions wherebuffers cannot be inserted at all. When a net is optimally

buffered (assuming no obstacles), its delay is a linear

function of its length [53]. Of course, this solution must be

realizable. To minimize delay, we simply minimize the

number of tiles to the most critical sink. Thus, the cost for

a tile is just costðtÞ ¼ 1 (there is no eðtÞ term). When

merging branches, one wants to choose the branch with

w o r s t s l a c k , s o t h e m e r g e d c o s t costðtÞ i s :maxðcostðLÞ; costðRÞÞ. To initialize the slack, a notion of

which sink is critical is needed. Since our cost function

basically counts tiles as delay, the required arrival time

(RAT) must be converted to tiles. Let DpT be the minimum

delay per tile achievable on an optimally buffered line. For

a sink s, the costðsÞ is initialized to �RATðsÞ=DpT. The

more critical a sink, the higher its initial cost. The

objective is to minimize cost at the source.Fig. 14(a) shows one of several possible solutions for

rerouting the net in Fig. 13 using this cost function, where

s2 is considered two tiles more critical than s1. Note that it

achieves a shortest path to s2. Contrast that with the

electrical correction cost function shown in Fig. 14(b), in

which the Bblob[ represents an area of high cost. In this

case, the route avoids the congested area even though it

means the route to the critical sink is much longer.

Fig. 13. Example of a three-pin net: (a) before and (b) after rerouting.

The shaded square region is the ‘‘plate’’ and the dotted region

is the solution search space for the final 2-path.



E. General Cost FunctionThe previous cost functions can generate extreme

behavior; however, one can trade off between the two cost

functions. Let 0 � K � 1 be the tradeoff parameter,

where K ¼ 1 corresponds to a electrical correction and

K ¼ 0 corresponds to a critical net. The cost function for

tile t is then

costðtÞ ¼ 1 þ K � eðtÞ: (9)

For critical nets, merging branches is a maximization

function, while it is an additive function for non-critical

nets. These ideas can be combined with to yield:

costðtÞ ¼ max costðLÞ; costðRÞð Þþ K � min costðLÞ; costðRÞð Þ: (10)

Finally, the sink initialization formula becomes

costðsÞ ¼ ðK � 1ÞRATðsÞ=DpT: (11)

Thus, K trades off the cost function, the merging

operation, and sink initialization. In practice, we use K ¼ 1for electrical correction and subsequently smaller values

up to K ¼ 0:1 for critical path optimization.

F. Slew Threshold ConstraintAs described the maze routing cost functions do not

guarantee slew constraints will be satisfied. Let T be the

maximum number of tiles that can be driven by a buffer

before the slew constraint is violated. If the route goes over

more than T consecutive blocked tiles, there will be an

unavoidable slew violation when buffering. Hence, duringmaze routing we track the number of consecutive blocked

tiles and forbid it from exceeding T by not performing

node expansion once this threshold is reached. Thisguarantees that the resulting Steiner tree will have

sufficient area for buffers so that slew violations can be

fixed by subsequent dynamic programming.

G. Example and SummaryThe effect of rerouting can be shown by the example in

Fig. 15, which displays the a placement density map for a

given 7-pin net of an industrial design. The source ismarked with a white x, while sinks are marked with dark

squares. The white dots are potential buffer insertion

locations, and the diamonds are the inserted buffers. The

route on the left is the solution with K ¼ 1:0, while the one

on the right is the solution for K ¼ 0:1. Observe that the left

route totally avoids the large blockage, which ultimately

leads to a 4134 ps slack improvement over the unbuffered

solution. However, for when K ¼ 0:1, the route success-fully finds the prime real estate (the holes inside the block)

and places buffers in them where it deems it appropriate.

This improves the slack by 4646 ps. The simple parameter

setting of the cost function yields a different Steiner route

that can recognize layout constraints depending on the

particular phase of physical synthesis.

Optimizations that ignore the layout can cause severe

headaches for timing closure and routability. The mazererouting technique proposed in this section is general

enough to handle any kinds of layout configurations,

whether blockages, regions packed with dense cells, or

routing congestion. One does not need to deploy this

throughout physical synthesis though. Instead one could

wait for the Bmess[ and then clean it up. For example, PDS

has a phase to identify all buffers in routing congested

regions, rip-up those buffers, then reroute them using thismaze routing strategy. This clean-up-the-mess strategy

enables more overall efficient optimization than trying to

always preemptively avoid the mess. The next section

explains how a different kind of legalization algorithm is

Fig. 15. Illustration of the different routes obtained with the general

maze routing cost function for a layout containing a large block

with punched out holes. (a) A routed net with K ¼ 1:0.

(b) The same net with K ¼ 0:1.

Fig. 14. Examples of the (a) critical and (b) non-critical net cost

functions. The shaded area represents a region of high cost.



more effective at cleaning up messes made from synthesisoperations.

VII. DIFFUSION-BASED PLACEMENTTECHNIQUES FOR LEGALIZATION

During electrical correction and critical path optimization,

some gates may be resized while new ones are inserted

into the design. PDS does not assign a location right away,but rather assigns a preferred location that may overlap

existing cells. Periodically, legalization needs to run to

snap these cells from overlapping to legal locations. If one

waits too long between legalization invocations, cells may

end up quite far from their preferred location which may

severely hurt timing. This section discusses a new

legalization paradigm called diffusion that was first

described in [28]. Diffusion tries to avoid this behaviorby keeping the relative ordering of the cells intact.

Of course, there are other methods that can also

achieve legalization without moving any one cell too far

away. Brenner et al. [54] describe a network flow algo-

rithm that superimposes a flow network on top of grid bins

and then flows cells from overly dense bins to bins that are

under capacity. More recently, Luo et al. superimpose a

Delauney triangluation on top of the cells and use thisstructure to enforce relative order while achieving local

density targets. Techniques for local cell movement, swap-

ping and shifting to improve placement quality after legal-

ization can be found in [55], [56].

During optimization, local regions can become overfull

at which point synthesis, buffering, and repowering opti-

mizations may become handcuffed if they are forbidden to

add to the area in an already full bin. The main advantageof diffusion is that it can allow the optimizations to

proceed anyway, knowing that cells will not be moved too

far away from their intended location. Further, diffusion

can be implemented or run in just a few minutes, even on

designs with millions of gates.

Diffusion is a well-understood physical process that

moves elements from a state with non-zero potential en-

ergy to a state of equilibrium. The process can be modeledby breaking down the movements into several small finite

time steps, then moving each element the distance it would

be expected to move during that time step. Our legalization

approach follows this model; it moves each cell a small

amount in a given time step according to its local density

gradient. The more time steps the process is run, the closer

the placement gets toward achieving equilibrium.

Assume that a placement is close to legal if all that isrequired to legalize the placement is to snap cells to rows

or perhaps perform minor cell sliding in order to fit the

cells. Also, assume the chip layout is divided into small,

equally sized bins which can fit around 5–15 cells. Let dmax

be the maximum allowed density of a bin, where com-

monly dmax ¼ 1. The placement is considered close to legal

if the area density of every bin is less than or equal to dmax.

For all bins with density greater than dmax, cells must bemigrated out of those bins into less dense ones. The goal of

legalization is to reduce the density of each bin to no more

than dmax while avoiding moving these cells far from their

original locations and also to preserve the ordering in-

duced by the original placement. Once each bin satisfies its

density requirement dj;k � dmax, a legal placement solution

can generally be easily achieved (since each bin is

guaranteed sufficient space), e.g., through local slide andspiral optimization.

A. The Diffusion ProcessDiffusion is driven by the concentration gradient,

which is the slope and steepness of the concentration dif-

ference at a given point. The increase in concentration in a

cross section of unit area with time is simply the difference

of the material flow into the cross section and the material

flow out of it. Diffusion reaches equilibrium when the

material concentration is evenly distributed.Mathematically, the relationship of material concen-

tration with time and space can be described using the

following partial differential equation:

@dx;yðtÞ@t

¼ r2dx;yðtÞ (12)

where dx;yðtÞ is the material concentration at position

ðx; yÞ at time t. Equation (12) states that the speed of

density change is linear with respect to its second-order

gradient over the density space. In the context of place-

ment, cells will move quicker when their local densityneighborhood has a steeper gradient.

When the region for diffusion is fixed (as in

placement), the boundary conditions are defined as

rdxb;ybðtÞ ¼ 0 for coordinates ðxb; ybÞ on the chip bound-

ary. We also define coordinates over fixed blocks in the

same way in order to prevent cells from diffusing on top of

fixed blocks. This forces cells to diffuse around the blocks.

In diffusion, a cell migrates from an initial location toits final equilibrium location via a non-direct route. This

route can be captured by a velocity function that gives the

velocity of a cell at every location in the circuit for a given

time t. This velocity at certain position and time is

determined by the local density gradient and the density

itself. Intuitively, a sharp density gradient causes cells to

move faster. For every potential ðx; yÞ location, define a

2-D velocity field vx;y ¼ ðvHx;y; vV

x;yÞ of diffusion at time tas follows:

vHx;yðtÞ ¼ �

@dx;yðtÞ@x

=dx;yðtÞ

vVx;yðtÞ ¼ �

@dx;yðtÞ@y

=dx;yðtÞ: (13)



Given this equation, and a starting location ðxð0Þ; yð0ÞÞfor a particular location, one can find the new location

ðxðtÞ; yðtÞÞ for the element at time t by integrating the

velocity field

xðtÞ ¼ xð0Þ þZ t

0

vHxðt0Þ;yðt0Þðt0Þdt0

yðtÞ ¼ yð0Þ þZ t

0

vVxðt0Þ;yðt0Þðt0Þdt0: (14)

Equations (12)–(14) are sufficient to simulate the

diffusion process. Given any particular element, one can

now find the new location of the molecule at any point

in time t. To apply this paradigm to placement, one

needs to migrate from this continuous space to a discreteplace since cells have various rectangular sizes and the

placement image itself is discrete. The next section

presents a technique to simulate diffusion specifically for

placement.

B. Diffusion Based PlacementOne can discretize continuous coordinates by dividing

the placement areas into equal sized bins indexed by ðj; kÞ.Assume the coordinate system is scaled so that the width

and height of each bin is one. Then location ðx; yÞ liesinside bin ðj; kÞ ¼ ðbxc; bycÞ. We can also discretize con-

tinuous time t as n�t, where �t is the size of the discrete

time step.

Instead of the continuous density dx;y, we now can

describe diffusion in the context of the density dj;k of bin

ðj; kÞ. The initial density dj;kð0Þ of each bin ðj; kÞ can be

defined as dj;kð0Þ ¼ �Ai where Ai is the overlapping area of

cell i and bin ðj; kÞ.For simplicity, assume that if a fixed block overlaps a

bin, it overlaps the bin completely. In these cases, the

bin density is defined to be one, though boundary con-

ditions prevent cells from diffusing on top of fixed

blocks.

Assume that the density dj;kðnÞ has already been com-

puted for time n. Now one needs to find how the density

changes and cells movements for the next time stepn þ 1. We use the Forward Time Centered Space (FTCS)

[57] scheme to discretize (12). The new bin density is

given by

dj;kðnþ1Þ¼dj;kðnÞþ�t

2djþ1;kðnÞþdj�1;kðnÞ�2dj;kðnÞ� �

þ�t

2dj;kþ1ðnÞþdj;k�1ðnÞ�2dj;kðnÞ� �

: (15)

The new density of a bin at time n þ 1 depends only on itsdensity and the density of its four neighbor bins. Note that

one does not actually use the cell locations at time n þ 1 to

compute the density.

Just as (12) can be discretized to compute placement

bin density, (13) can be discretized to compute the velocity

for cells inside the bins. For now, assume that each cell in

the bin is assigned the same velocity, the velocity for the

bin, given by

vHj;kðnÞ ¼ �

djþ1;kðnÞ � dj�1;kðnÞ2dj;kðnÞ

vVj;kðnÞ ¼ �

dj;kþ1ðnÞ � dj;k�1ðnÞ2dj;kðnÞ

: (16)

The horizontal (vertical) velocity is proportional to thedifferences in density of the two neighboring horizontal

(vertical) bins.

To make sure that fixed cells and bins outside the

boundary do not move, we enforce vV ¼ 0 at a horizontal

boundary and vH ¼ 0 at a vertical boundary.

Assuming that each cell in a bin has the same velocity

fails to distinguish between the relative locations of cells

within a bin. Further, two cells that are right next to eachother but in different bins can be assigned very different

velocities which could change their relative ordering.

Since the goal of placement migration is to preserve the

integrity of the original placement, this behavior cannot be

permitted. To remedy this behavior, we apply velocity

interpolation to generate a horizontal (vertical) velocity

vHx;yðvV

x;yÞ and for a given ðx; yÞ. The interpolation looks at

the four closest bins for each cell and interpolates from thevelocities assigned to each of those bins, generating a

unique velocity vectory for a cell at location ðx; yÞ.Finally, since the velocity for each cell can be de-

termined at time n ¼ t=�t, one can compute its new

placement via a discretized form of (14). Suppose at time

step n a cell has location ðxðnÞ; yðnÞÞ. Its location for the

next time stamp is given by

xðn þ 1Þ ¼ xðnÞ þ vHxðnÞ;yðnÞ ��t

yðn þ 1Þ ¼ yðnÞ þ vVxðnÞ;yðnÞ ��t: (17)

An example is shown in Fig. 16 in which a cell takes

nine discrete time steps. Observe how the cell never

overlaps a blockage and also how the magnitude of its

movements becomes smaller toward the tail end of its path.

C. Making it WorkSince the diffusion process reaches equilibrium when

each bin has the same density, we can expect the final



density after diffusion to be the same as the average density�dj;k=N. This can cause unnecessary spreading, even if

every bin’s density is well below dmax. This additional

spreading will no doubt degrade the placement quality of

results.

Essentially, what we would like is to run diffusion

for the regions which require it, perhaps for legalization

or even to remove routing congestion while leaving the

rest of the design (which may be in very good shape)alone. The idea of local diffusion is to only run diffu-

sion on cells in a window around bins that violate the

target density constraint. Local diffusion also has the

advantages of less work to do each iteration and faster

convergence.

Although we use (15) to compute bin densities during

diffusion, the computed densities are not exactly the same

as the real placement densities. The mathematics of thediffusion process [(15), (16), and (17)] assume continu-

ously distributed equal size particle distribution. However,

the real standard cell distribution does not always satisfy

this condition. This happens because cells are not equally

distributed inside a bin and because cells have different

sizes. Periodically, one should update the density based on

the real cell placement when the error exceeds a certain

threshold, then restart the diffusion algorithm from thenew placement map.

D. Diffusion SummaryFig. 17 shows an example of diffusion-based legaliza-

tion in a region surrounded by other placed cells and fixed

blocks. The top-left figure shows an initial illegal

placement in which the colored regions represent areas

of cell overlap. The top-right figure shows what happens

when traditional legalization is invoked. Observe how the

integrity of the regions is no longer preserved as the

colored cells mix. This shows how some cells can movequite far away from their neighboring cells from the top

illegal placement. Finally, the bottom figure shows theresult of diffusion based legalization, in which the con-

tinuity of the colored regions is relatively well preserved.

This example illustrates that diffusion is able to perform a

smooth spreading, which is less disruptive to the state of

the design.

To see how effective diffusion-based legalization can be

in a physical synthesis engine, we ran PDS physical

synthesis optimization on seven ASIC testcases in whichwe did not legalize at all during the run. This results in a

large amount of overlaps caused by physical synthesis. We

ran a greedy and flow-based legalizer for comparison and

measure the best results obtained by those approaches

[28]. Compared to the traditional approaches, diffusion

averages about 4% improvement in the total wirelength of

the design. Further, the timing of the worst slack path is

48% better on average and the overall number of negativepaths is 36% better. The improvement can be observed for

all seven designs.

The ability of diffusion to minimize timing degrada-

tion, to smoothly spread out the placement, and to attack

local hotspots of either placement or routing congestion

makes it a powerful technique for physical synthesis. For

starters, one can afford to run legalization less often since

diffusion is less likely to significantly disrupt the state ofthe design.

VIII . CONCLUSION

A. Impact of the Stages of Physical SynthesisThis paper discussed various techniques to achieve fast

physical synthesis which may be applied in all the phases ofphysical synthesis. Recall the four main phases that we are

considering in this paper are:

1) initial placement and optimization;

2) timing-driven placement and optimization;

3) timing-driven detailed placement;

4) optimization techniques.

One need not apply all the techniques in performing

design closure, and frequently designers mix and matchthe pieces depending upon their needs. For example, the

first phase is especially useful during the floorplanning

process. The designer may wish to find the locations of

large blocks and also restrict the movement of key logic.

Through placement and optimization, the designer can

reasonably evaluate the quality of the floorplan. If the

designer is happy, with this result he or she may skip all

the way to the last technique to push down the timing onany remaining critical paths.

In general, the timing after performing the first step

will be far from achieving closure, e.g., the cycle time may

be double what is required by the design specifications.

Performing timing-driven placement and optimization

generally helps significantly and results in many fewer

negative paths. The third stage generally does not helpFig. 16. An example cell movement from diffusion.



timing but may improve wiring by anywhere from 2% to5%, and this can make a huge difference in achieving a

routable design.

Finally, unless the design is for some reason Beasy,[ the

last stage of optimization is critical for actually achieving

timing closure. Designers exploit this stage the most

during their iterations as they tweak the design. If only

minor changes are required, going back to global place-

ment would be far too disruptive and potentially put thedesign in a completely different state. The ability to iterate

and perform in-place synthesis is critical in garnering the

last bit of performance out of the design. However, if the

timing of the design is in really bad shape, optimization

alone will not be able to close on timing. The designer

must go back and iterate on the floorplan and global

placement steps.

B. Future DirectionsPhysical synthesis is a runtime intensive, complex

system that requires the integration and cooperation of

several types of algorithms and functions. Exacerbating the

turnaround time problem is that designs sizes will likely

soon move from the millions to tens of millions of place-

able objects. There are numerous research directions in

the timing closure space that we believe are worth pursu-

ing to achieve both faster runtime and higher quality ofresults. In general, achieving better quality can also be a

great way to achieve a faster system, as the back end

optimization could have far fewer negative paths to work

on. Some promising research directions include the

following.

1) Better net weighting for timing-driven placement. For

example, consider two critical paths A and B, both

Fig. 17. Diffusion-based legalization example.



of which are equally critical, but A spends 80% ofits delay traversing fixed blocks and 20% through

moveable logic, while B spends 20% and 80% in

fixed and moveable logic, respectively. In this

case, A does not have much room for error as

placement needs to fix the 20% of the logic that

can be fixed, while B has considerably more

opportunity for placement to straighten out the

80% of logic that it can affect. Thus, net weightingshould give more priority to nets in path A than B.

There are numerous other scenarios that can be

studied and modeled to improve net weighting.

2) Removing a global placement. In the flow described,

placement is run twice. If clever net weighting

and crude placement estimation is used, it may be

possible to significantly improve runtime by

skipping a placement step altogether and stillretain solution quality.

3) Latch pipeline placement. As designs require

multiple cycles to get from one side of the chip

to the other, placement needs to recognize that

latches must be placed in such a way to guar-

antee that one can get from one latch to another

within the given cycle time. For example, assume

latch A drives latch B, which drives latch C, andA is fixed on the left side and C is fixed on the

right. If B is too close to A, then the path from B

to C becomes critical. If one applies a higher net

weight to the connection from B to C, then B

may be moved too close to C and then the A to B

path becomes critical. One has to teach place-

ment to find an appropriate balance, and it is

unlikely net weighting alone can achieve thiskind of result.

4) BDo no harm[ detailed placement. Detailed place-

ment is a powerful technique for improving

wirelength but typically does not improve timing.

In fact, it is risky to run it late in the fourth stage

of the flow because it may worsen paths that were

already carefully optimized. The idea of Bdo no

harm[ detailed placement [58] is to recognizemoves that degrade the timing and forbid them,

while only accepting moves that improve wire-

length and timing.

5) Force-directed placement. As discussed earlier,

force-directed placement is emerging as a prom-

ising technique both in terms of quality [7] mPL

[8] mFar [9] and speed [10]. This technique also

has the advantage of stability in that small changesto net weights likely will not create entirely

different global placements. Its spreading ability

(like that of diffusion) makes it appealing for

handling incremental netlist changes.

6) Parallelism. As designs truly become large, the

designs can potentially be partitioned into smaller

physical pieces that do not require an inordinate

amount of cross-partition communication. Onecan then apply physical synthesis on each piece

relatively independently. While this approach

seems simple enough, it is fraught with choices,

any of which could lead to significantly degraded

solution. One must be careful with the partition

pin assignment, buffering strategy, and timing

contracts between partitions.

7) Complex transforms. Transforms which performmultiple operations simultaneously could poten-

tially have a big impact on timing. For example,

consider a cell B on the left side connected to cells

A and C on the right side. Clearly B wants to be

near A and C, but if the nets connected to B have

already been buffered, those buffers act as anchors

which keep B from moving to the right. One needs

to rip up the buffer trees, then consider moving B,then put the buffer trees back in to evaluate

whether this was worthwhile. Another example is

simultaneous buffering and cloning.

This list is just a sampling of possible research

directions. As design technology scales to 65 nm and

below, the problem of timing closure will continue to

evolve into the even more complex problem of designclosure. Design closure requires that accurate modeling ofthe clock tree network and routing be incorporated earlier

and earlier up the physical synthesis pipeline to take into

account their effects on timing and signal integrity. The

need to meet a global power constraint, e.g., by

incorporating multithreshold logic gates and voltage

islands, also becomes more critical. One must pay

attention to how physical design choices impact manu-

facturability. Requiring physical synthesis to meet andincorporate these additional constraints only further

exacerbates the runtime issue. Therefore, research which

discovers more efficient techniques for core physical

synthesis optimizations like placement, buffering, legali-

zation, repowering, incremental timing, routing, and clock

tree synthesis will continue to be of high value. h

Acknowledgment

The PDS physical synthesis system has had many

contributors over the years. The authors sincerely thank

everyone who has helped both with driving the work

presented here and for overall contributions to IBM’s PDS

tool. These contributors include Lakshmi Reddy, Ruchir

Puri, David Kung, Leon Stok, Charles Bivona, Louise

Trevillian, Michael Kazda, Pooja Kotecha, Nate Heiter,Erik Kusko, Mike Dotson, Carl Hagen, Zahi Kurzum,

Gopal Gandham, Stephen Quay, Tuhin Mahmud, Jiang

Hu, Milos Hrkic, Kristian Zoerhoff, William Dougherty,

Brian Wilson, Bryon Wirtz, Tony Drumm, Elaine D’Souza,

Shyam Ramji, Alex Suess, Jose Neves, Veena Puresan,

Arjen Mets, Andrew Sullivan, Jim Curtain, David Geiger,

Tsz-mei Ko, and Pete Osler.



RE FERENCES

[1] L. Trevillyan, D. Kung, R. Puri, L. N. Reddy,and M. A. Kazda, BAn integrated environmentfor technology closure of deep-submicron ICdesigns,’’ IEEE Des. Test Comput., vol. 21,no. 1, pp. 14–22, Jan.–Feb. 2004.

[2] P. G. Villarrubia, BPhysical design tools forhierarchy,[ in Proc. ACM Int. Symp. PhysicalDesign, 2005.

[3] P. Saxena, N. Menezes, P. Cocchini, andD. A. Kirkpatrick, BRepeater scaling and itsimpact on CAD,’’ IEEE Trans. Comput.-AidedDesign Integr. Circuits Syst., vol. 23, no. 4,pp. 451–463, Apr. 2004.

[4] J. Cong, Z. D. Kong, and T. Pan, BBuffer blockplanning for interconnect planning andprediction,’’ IEEE Trans. Very Large ScaleIntegr. (VLSI) Syst., vol. 9, no. 6, pp. 929–937,Dec. 2001.

[5] C. J. Alpert, J. Hu, S. S. Sapatnekar, andP. G. Villarrubia, BA practical methodologyfor early buffer and wire resource allocation,[in Proc. Design Automation Conf., 2001.

[6] G.-J. Nam, C. J. Alpert, P. G. Villarrubia,B. Winter, and M. Yildiz, BThe ISPD2005placement contest and benchmark suite,[ inProc. ACM Int. Symp. Physical Design, 2005,pp. 216–220.

[7] A. B. Kahng and Q. Wang, BImplementationand extensibility of an analytic placer,’’ IEEETrans. Comput.-Aided Design Integr. CircuitsSyst., vol. 24, no. 5, pp. 734–747, May 2005.

[8] T. Chan, J. Cong, T. Kong, J. Shinnerl, andK. Sze, BAn enhanced multilevel algorithmfor circuit placement,[ in Proc. IEEE/ACMInt. Conf. Computer-Aided Design, 2003,pp. 299–305.

[9] B. Hu and M. M. Sadowska, BFine granularityclustering-based placement,’’ IEEE Trans.Comput.-Aided Design Integr. Circuits Syst.,vol. 23, no. 4, pp. 527–536, Apr. 2004.

[10] N. Viswanathan and C.-N. Chu, BFastplace:Efficient analytical placement using cellshifting, iterative local refinement anda hybrid net model,[ in Proc. ACM Int.Symp. Physical Design, 2004, pp. 26–33.

[11] B. Halpin, C. Y. R. Chen, and N. Sehgal,BTiming driven placement using physicalnet constraints,[ in Proc. IEEE/ACM DesignAutomation Conf., 2001, pp. 780–783.

[12] R.-S. Tsay and J. Koehl, BAn analyticnet weighting approach for performanceoptimization in circuit placement,[ in Proc.IEEE/ACM Design Automation Conf., 1991,pp. 620–625.

[13] X. Yang, B.-K. Choi, and M. Sarrafzadeh,BTiming-driven placement using designhierarchy guided constraint generation,[in IEEE/ACM ICCAD, 2002, pp. 177–180.

[14] K. Rajagopal, T. Shaked, Y. Parasuram, T. Cao,A. Chowdhary, and B. Halpin, BTiming drivenforce directed placement with physical netconstraints,[ in Proc. Int. Symp. on PhysicalDesign, Apr. 2003, pp. 60–66.

[15] H. Ren, D. Z. Pan, and D. Kung, BSensitivityguided net weighting for placement drivensynthesis,[ in Proc. Int. Symp. on PhysicalDesign, Apr. 2004, pp. 10–17.

[16] T. Kong, BA novel net weighting algorithm fortiming-driven placement,[ in Proc. Int. Conf.Computer Aided Design, 2002, pp. 172–176.

[17] D. Brand, R. F. Damiano, L. P. P. P. vanGinneken, and A. D. Drumm, BIn thedriver’s seat of booledozer,[ in ICCD,1994, pp. 518–521.

[18] L. Stok, D. S. Kung, D. Brand, A. D. Drumm,L. N. Reddy, N. Hieter, D. J. Geiger,H. H. Chao, P. J. Osler, and A. J. Sullivan,

BBooledozer: Logic synthesis for asics,’’ IBM J.Res. Dev., vol. 40, no. 4, pp. 407–430, 1996.

[19] W. Donath, P. Kudva, L. Stok, P. Villarrubia,L. Reddy, A. Sullivan, and K. Chakraborty,BTransformational placement and synthesis,[in Proc. Design, Automation and Test in Europe,Mar. 2000.

[20] S. K. Karandikar, C. J. Alpert, M. C. Yildiz,P. G. Villarrubia, S. T. Quay, and T. Mahmud,BFast electrical correction using resizing andbuffering,[ in Proc. Asia and South PacificDesign Automation Conf., 2007.

[21] P. J. Osler, BPlacement driven synthesis casestudies on two sets of two chips: Hierarchicaland flat,[ in Proc. ACM Int. Symp. PhysicalDesign, 2004, pp. 190–197.

[22] G. Karypis, R. Aggarwal, V. Kumar, andS. Shekhar, BMultilevel hypergraphpartitioning: Application in VLSI domain,[in Proc. ACM/IEEE Design AutomationConf., 1997, pp. 526–529.

[23] G.-J. Nam, S. Reda, C. Alpert, P. Villarrubia,and A. Kahng, BA fast hierarchical quadraticplacement algorithm,’’ IEEE Trans. CAD of ICsand Systems, vol. 25, no. 4, Apr. 2006.

[24] L. P. P. P. van Ginneken, BBuffer placementin distributed RC-tree networks for minimalElmore delay,[ in IEEE Int. Symp. on Circuitsand Systems, May 1990, pp. 865–868.

[25] Z. Li, C. N. Sze, C. J. Alpert, J. Hu, andW. Shi, BMaking fast buffer insertion evenfaster via approximation techniques,[ in Proc.Asia and South Pacific Design Automation Conf.,2005, pp. 13–18.

[26] S. Hu, C. J. Alpert, J. Hu, S. K. Karandikar,Z. Li, W. Shi, and C. N. Sze, BFastalgorithms for slew constrained minimumcost buffering,[ in Proc. ACM/IEEE DesignAutomation Conf., 2006, pp. 308–313.

[27] C. J. Alpert, M. Hrkic, J. Hu, and S. T. Quay,BFast and flexible buffer trees that navigatethe physical layout environment,[ in Proc.ACM/IEEE Design Automation Conf., 2004,pp. 24–29.

[28] H. Ren, D. Z. Pan, C. J. Alpert, andP. Villarrubia, BDiffusion-based placementmigration,[ in Proc. Design AutomationConf., 2005, pp. 515–520.

[29] W.-J. Sun and C. Sechen, BEfficient andeffective placement for very large circuits,’’IEEE Trans. Comput.-Aided Design Integr.Circuits Syst., vol. 14, no. 5, pp. 349–359,May 1995.

[30] C. J. Alpert, J.-H. Huang, and A. B. Kahng,BMultilevel circuit partitioning,’’ IEEE Trans.Comput.-Aided Design Integr. Circuits Syst.,vol. 17, no. 8, pp. 655–667, Aug. 1998.

[31] A. E. Caldwell, A. B. Kahng, and I. L. Markov,BCan recursive bisection alone produceroutable placements?’’ in Proc. DesignAutomation Conf., 2000, pp. 477–482.

[32] A. Agnihotri, M. C. Yildiz, A. Khatkhate,A. Mathur, S. Ono, and P. H. Madden,BFractional cut: Improved recursive bisectionplacement,[ in Proc. Int. Conf. ComputerAided Design, 2003, pp. 307–310.

[33] M. Wang, X. Yang, and M. Sarrafzadeh,BDragon2000: Standard-cell placement toolfor large industry circuits,[ in Proc. Int. Conf.Computer-Aided Design, 2000, pp. 260–263.

[34] H. Eisenmann and F. M. Johannes, BGenericglobal placement and floorplanning,[ in Proc.ACM/IEEE Design Automation Conf., 1998,pp. 269–274.

[35] P. Spindler and F. M. Johannes, BFast androbust quadratic placement combined withan exact linear net model,[ presented at theIEEE/ACM Int. Conf. Computer-AidedDesign, San Jose, CA, 2006.

[36] J. Vygen, BAlgorithms for large-scale flatplacement,[ in Proc. ACM/IEEE DesignAutomation Conf., 1997, pp. 746–751.

[37] D.-H. Huang and A. B. Kahng, BPartitioningbased standard cell global placement withan exact objective,[ in Proc. ACM Int. Symp.Physical Design, 1997, pp. 18–25.

[38] C. J. Alpert and A. B. Kahng, BRecentdevelopments in netlist partitioning: Asurvey,’’ Integr. VLSI J., vol. 19, pp. 1–81, 1995.

[39] C. J. Alpert, G. Gandham, M. Hrkic, J. Hu,A. B. Kahng, J. Lillis, B. Liu, S. T. Quay,S. S. Sapatnekar, and A. J. Sullivan, BBufferedSteiner trees for difficult instances,’’ IEEETrans. Comput.-Aided Design Integr. CircuitsSyst., vol. 21, no. 1, pp. 3–14, Jan. 2002.

[40] J. Cong, A. Kahng, and K. Leung, BEfficientalgorithm for the minimum shortestpath steiner arborescence problem withapplication to VLSI physical design,’’ IEEETrans. Comput.-Aided Design Integr. CircuitsSyst., vol. 17, no. 1, pp. 24–38, Jan. 1998.

[41] J. Lillis, C. K. Cheng, and T. Y. Lin, BOptimalwire sizing and buffer insertion for low powerand a generalized delay model,’’ IEEE J.Solid-State Circuits, vol. 31, no. 3, pp. 437–447,Mar. 1996.

[42] C. J. Alpert, A. Devgan, and S. T. Quay,BBuffer insertion for noise and delayoptimization,[ in Proc. ACM/IEEE DesignAutomation Conf., 1998, pp. 362–367.

[43] C. J. Alpert, A. Devgan, and S. T. Quay,BBuffer insertion with accurate gateand interconnect delay computation,[ inProc. ACM/IEEE Design Automation Conf.,1999, pp. 479–484.

[44] W. Shi and Z. Li, BAn O(nlogn) timealgorithm for optimal buffer insertion,[ inProc. IEEE/ACM Design Automation Conf.,2003, pp. 580–585.

[45] W. Shi, Z. Li, and C. J. Alpert, BComplexityanalysis and speedup techniques for optimalbuffer insertion with minimum cost,[ in Proc.Asia and South Pacific Design Automation Conf.,2004, pp. 609–614.

[46] C. J. Alpert, R. G. Gandham, J. L. Neves, andS. T. Quay, BBuffer library selection,[ in Proc.ICCD, 2000, pp. 221–226.

[47] J. Lillis, C. K. Cheng, and T.-T. Y. Lin,BOptimal wire sizing and buffer insertionfor low power and a generalized delay model,’’IEEE Trans. Solid-State Circuits, vol. 31, no. 3,pp. 437–447, Mar. 1996.

[48] C. Kashyap, C. Alpert, F. Liu, and A. Devgan,BClosed form expressions for extendingstep delay and slew metrics to ramp inputs,[in Proc. Int. Symp. Physical Design (ISPD),2003, pp. 24–31.

[49] H. Bakoglu, Circuits, Interconnects, andPackaging for VLSI. Reading, MA:Addison-Wesley, 1990.

[50] N. Weste and K. Eshraghian, Principlesof CMOS VLSI Design. Reading, MA:Addison-Wesley, 1993, pp. 221–223.

[51] M. Hrkic and J. Lillis, BS-tree: A techniquefor buffered routing tree synthesis,[ in Proc.ACM/IEEE Design Automation Conf., 2002,pp. 578–583.

[52] X. Tang, R. Tian, H. Xiang, andD. F. Wong, BA new algorithm for routingtree construction with buffer insertionand wire sizing under obstacle constraints,[in Proc. IEEE/ACM Int. Conf. Computer-AidedDesign, 2001, pp. 49–56.

[53] C. J. Alpert, J. Hu, S. S. Sapatnekar, andC. N. Sze, BAccurate estimation of globalbuffer delay within a floorplan,[ in Proc.IEEE/ACM Int. Conf. Computer-Aided Design,2004, pp. 706–711.



[54] U. Brenner, A. Pauli, and J. Vygen, BAlmostoptimum placement legalization by minimumcost flow and dynamic programming,[ in Proc.Int. Symp. Physical Design, 2004, pp. 2–9.

[55] S. W. Hur and J. Lilis, BMongrel: Hybridtechniques for standard cell placement,[ inProc. Int. Conf. Computer-Aided Design, 2000,pp. 165–170.

[56] A. B. Kahng, P. Tucker, and A. Zelikovsky,BOptimization of linear placementsfor wirelength minimization with freesites,[ in Proc. Asia and South PacificDesign Automation Conf., 1999, pp. 18–21.

[57] W. H. Press, S. A. Teukolsky,W. T. Vetterling, and B. P. Flannery,

Numerical Recipes in C++. Cambridge,U.K.: Cambridge Univ. Press, 2002.

[58] H. Ren, D. Pan, C. Alpert, G.-J. Nam, andP. G. Villarrubia, BHippocrates:First-do-no-harm detailed placement,[presented at the Asia and South PacificDesign Automation Conf., Yokohama,Japan, 2007.

ABOUT THE AUT HORS

Charles J. Alpert (Fellow, IEEE) received the B.S.

degree in math and computational sciences and

the B.A. degree in history from Stanford Univer-

sity, Stanford, CA, in 1991 and the Ph.D. degree

in computer science from the University of

California, Los Angeles (UCLA), in 1996.

He currently works as a Research Staff Member

at the IBM Austin Research Laboratory, Austin, TX,

where he serves as the technical lead for the

design tools group. He has over 80 conference and

journal publications. His research centers upon innovation in physical

synthesis optimization.

Dr. Alpert has thrice received the Best Paper Award from the ACM/

IEEE Design Automation Conference. He has served as the general chair

and the technical program chair for the Tau Workshop on Timing Issues

in the Specification and Synthesis of Digital Systems and the Interna-

tional Symposium on Physical Design. He also serves as an Associate

Editor of IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN. For his work in

mentoring SRC funded research, he received the Mahboob Khan Mentor

Award in 2001.

Shrirang K. Karandikar received the B.E. degree

from the University of Pune, Pune, India, in 1994,

the M. S. degree from Clarkson University,

Potsdam, NY, in 1996, and the Ph.D. degree from

the University of Minnesota, Minneapolis, in 2004.

He worked with Intel’s Logic and Validation

Technology group from 1997 to 1999, and is

currently a Researcher Staff Member at IBM Austin

Research Laboratory. His current interests are in

the areas of logic synthesis and physical design of

VLSI systems.

Zhuo Li (Member, IEEE) received the B.S. and M.S.

degrees in electrical engineering from Xi’an

Jiaotong University, Xi’an, China, and the Ph.D.

degree in computer engineering from Texas A&M

University, College Station, in 1998, 2001, and

2005, respectively.

From 2005 to 2006, he was with Pextra

Corporation, College Station as a Cofounder and

Senior Technical Staff working on VLSI extraction

tools development. He is currently with IBM

Austin Research Laboratory, Austin, TX. His research interests include

physical synthesis optimization, parasitic extraction, circuit modeling

and simulation, timing analysis and delay testing.

Dr. Li was a recipient of Applied Materials Fellowship in 2002. He

received a Best Paper Award at the Asia and South Pacific Design

Automation Conference in 2007.

Gi-Joon Nam (Member, IEEE) received the B.S.

degree in computer engineering from Seoul Na-

tional University, Seoul, Korea, and the M.S. and

Ph.D. degrees in computer science and engineer-

ing from the University of Michigan, Ann Arbor.

Since 2001, he has been with the International

Business Machines Corporation Austin Research

Laboratory, Austin, TX, where he is primarily

working on the physical design space, particularly

placement and timing closure flow. His general

interests includes computer-aided design algorithms, combinatorial

optimizations, very large scale integration system designs, and computer

architecture.

Dr. Nam has been serving on the technical program committee for the

International Symposium on Physical Design (ISPD), the International

Conference on Computer Design (ICCD), the Asia and South Pacific Design

Automation Conference (ASPDAC) and the International System-on-Chip

Conference (SOCC). He was also the organizer of ISPD 2005/2006

placement contest.

Stephen T. Quay received two B.S. degrees in

electrical engineering and computer science from

Washington University, St. Louis, MO, in 1983.

He is currently a Senior Engineer with the IBM

Systems and Technology Group, Austin, TX. Since

1983, he has worked in many areas of chip layout

and analysis for IBM. He currently develops design

automation applications for interconnect perfor-

mance optimization.

Haoxing Ren (Member, IEEE) received the B.S.

and M.Eng. degrees in electrical engineering from

Shanghai Jiao Tong University, China, in 1996 and

1999, respectively, the M.S. degree in computer

engineering from Rensselaer Polytechnic Insti-

tute, Troy, NY, in 2000, and the Ph.D. degree in

computer engineering at the University of Texas,

Austin, in 2006.

He worked at IBM System and Technology

Group from 2000 to 2006. Currently he is a

Research Staff Member at IBM T.J. Watson Research Center. His research

interests include logic synthesis and physical design of VLSI systems.



C. N. Sze (Member, IEEE) received the B.Eng. and

M.Phil. degrees from the Department of Computer

Science and Engineering, the Chinese University of

Hong Kong, in 1999 and 2001, respectively, and

the Ph.D. degree in computer engineering at the

Department of Electrical Engineering, Texas A&M

University, College Station, in 2005.

Since then, he has been with the IBM Austin

Research Laboratory, Austin, TX, where he focuses

on integrated placement and timing optimization

for ASIC and microprocessor designs. He was a recipient of the DAC

Graduate Scholarships. His research interests include design and analysis

of algorithms, computer-aided design technique for very large scale

integration, physical design, and performance-driven interconnect

synthesis. He is known by the names of Chin-Ngai Sze and Cliff Sze.

Paul G. Villarrubia received the B.S. degree in

electrical engineering from Louisiana State Uni-

versity, Baton Rouge, in 1981 and the M.S. degree

from the University of Texas, Austin, in 1988.

He is currently a Senior Technical Staff Member

at IBM, Austin, where he leads the development of

placement and timing closure tools. He has

worked at IBM in the areas of physical design of

microprocessors, physical design tools develop-

ment, and tools development for ASIC timing

closure. Interests include placement, synthesis, buffering, signal integrity

and extraction. He has 21 patents, 20 publications.

Mr. Villarrubia has one DAC best paper award. He was a member of the

2005 ICCAD TPC, and an invited speaker at the 2002 and 2004 ISPD

conferences.

Mehmet C. Yildiz (Member, IEEE) received the

B.S. degree in computer engineering from

Marmara University, Turkey, in 1995, the M.S.

degree in computer science from Yeditepe

University, Turkey, in 1998, and the Ph.D. degree

from Binghamton University, Binghamton, NY,

in 2003.

He worked in IBM Austin Research Laboratory

as Postdoc for two years. He currently works as an

Advisory Software Engineer in IBM EDA, Austin,

TX. He has more than ten conference and journal papers. His work is

currently focused on clock tree routing.

Dr. Yildiz serves as an Advisory Board Member of SIGDA, responsible

for the Web server.



INVITED PAPER Techniques for Fast Physical Synthesis · 2008-01-25 · PAPER Techniques for Fast Physical Synthesis Fast, efficient buffer design, logic transformations, and clustering

Documents