INVITED PAPER Techniques for Fast Physical Synthesis Fast, efficient buffer design, logic transformations, and clustering components for placement are some of the techniques being used to reduce design turnaround for large, complex chips. By Charles J. Alpert, Fellow IEEE, Shrirang K. Karandikar , Zhuo Li, Member IEEE, Gi-Joon Nam, Member IEEE, Stephen T. Quay , Haoxing Ren, Member IEEE, C. N. Sze, Member IEEE, Paul G. Villarrubia , and Mehmet C. Yildiz, Member IEEE ABSTRACT | The traditional purpose of physical synthesis is to perform timing closure, i.e., to create a placed design that meets its timing specifications while also satisfying electrical, routability, and signal integrity constraints. In modern design flows, physical synthesis tools hardly ever achieve this goal in their first iteration. The design team must iterate by studying the output of the physical synthesis run, then potentially massage the input, e.g., by changing the floorplan, timing assertions, pin locations, logic structures, etc., in order to hopefully achieve a better solution for the next iteration. The complexity of physical synthesis means that systems can take days to run on designs with multimillions of placeable objects, which severely hurts design productivity. This paper discusses some newer techniques that have been deployed within IBM’s physical synthesis tool called PDS [1] that significantly improves throughput. In particular, we focus on some of the biggest contributors to runtime, placement, legalization, buffering, and electric correction, and present techniques that generate significant turnaround time improvements. KEYWORDS | Circuit optimization; circuit synthesis; CMOS integrated circuits; design automation I. INTRODUCTION Physical synthesis has emerged as a critical component of modern design methodologies. The primary purpose of physical synthesis is to perform timing closure. Several technology generations ago, back when wire delay was insignificant, synthesis provided an accurate picture of the timing of the design. However, technology scaling has caused wire delay to continue to increase relative to gate delay. Consequently, a design that meets timing require- ments in synthesis likely will not close once its physical footprint is realized, due to the wire delays. The purpose of physical synthesis is place the design, recognize the delays and signal integrity issues introduced by the wiring, and fix the problems. It may also need to locally resynthesize pieces of the design that no longer meet timing con- straints. That new logic needs to be replaced, which causes iterations between synthesis and placement, until hope- fully the design closes on timing. Unfortunately, more often than not, the design will not close on timing without manual designer intervention. Perhaps the designer needs to modify the floorplan or re- structure certain sets of paths. This causes the designer to iterate between manual design work and automatic physical synthesis. The turnaround time of the physical design stage critically depends on the efficiency (and quality) of the physical synthesis system. On large multi- million ASIC parts, physical synthesis can take several days to complete, even on the best hardware available. This trend is only getting worse as designs seem to scale faster than the hardware improves to optimize them. While hierarchical or system on a chip (SoC) methodologies can be used to handle the large complexities, performing timing closure on a flat part is always preferable if at all possible [2], since it avoids all the complexities of hierarchical design. Of course, there are many newer challenges that the physical system needs to handle besides traditional timing closure. Some examples include lowering power using a Manuscript received March 8, 2006; revised October 20, 2006. C. J. Alpert, S. K. Karandikar, Z. Li, G.-J. Nam, and C. N. Sze are with the IBM Austin Research Laboratory, Austin, TX 78758 USA (e-mail: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]). S. T. Quay, H. Ren, P. G. Villarrubia, and M. C. Yildiz are with the IBM Corporation, Austin, TX 78758 USA (e-mail: [email protected]; [email protected]; [email protected]; [email protected]). Digital Object Identifier: 10.1109/JPROC.2006.890096 Vol. 95, No. 3, March 2007 | Proceedings of the IEEE 573 0018-9219/$25.00 Ó2007 IEEE
27
Embed
INVITED PAPER Techniques for Fast Physical Synthesis · 2008-01-25 · PAPER Techniques for Fast Physical Synthesis Fast, efficient buffer design, logic transformations, and clustering
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
INV ITEDP A P E R
Techniques for FastPhysical SynthesisFast, efficient buffer design, logic transformations, and clustering components
for placement are some of the techniques being used to reduce design
turnaround for large, complex chips.
By Charles J. Alpert, Fellow IEEE, Shrirang K. Karandikar, Zhuo Li, Member IEEE,
Gi-Joon Nam, Member IEEE, Stephen T. Quay, Haoxing Ren, Member IEEE,
C. N. Sze, Member IEEE, Paul G. Villarrubia, and Mehmet C. Yildiz, Member IEEE
ABSTRACT | The traditional purpose of physical synthesis is to
perform timing closure, i.e., to create a placed design that
meets its timing specifications while also satisfying electrical,
routability, and signal integrity constraints. In modern design
flows, physical synthesis tools hardly ever achieve this goal in
their first iteration. The design team must iterate by studying
the output of the physical synthesis run, then potentially
massage the input, e.g., by changing the floorplan, timing
assertions, pin locations, logic structures, etc., in order to
hopefully achieve a better solution for the next iteration. The
complexity of physical synthesis means that systems can take
days to run on designs with multimillions of placeable objects,
which severely hurts design productivity. This paper discusses
some newer techniques that have been deployed within IBM’s
physical synthesis tool called PDS [1] that significantly improves
throughput. In particular, we focus on some of the biggest
contributors to runtime, placement, legalization, buffering, and
electric correction, and present techniques that generate
Alpert et al.: Techniques for Fast Physical Synthesis
580 Proceedings of the IEEE | Vol. 95, No. 3, March 2007
objects and nets, the utilization of designs, the wirelength
improvement and speed-up over flat placement. Let � be
the ratio of the number of cells to the target number of
clusters. With clustering ratio � ¼ 2, hierarchical place-ment is on average twice as fast as flat placement while
obtaining a slight 0.92% improvement in wirelength. With
a more aggressive clustering ratio of � ¼ 10, hierarchical
placement is about five times faster than flat placement,
with a slight 3% degradation in wirelength. Different
values of � can be used to trade off speed and quality.
Overall, we demonstrate that careful clustering and
unclustering strategies can yield a hierarchical placementthat is significantly faster than flat while with comparable
solution quality.
III . TECHNIQUES FOR FASTTIMING-DRIVEN BUFFER INSERTION
For timing critical nets, buffer insertion must be deployed
frequently to improve delay, either for handling nets withlarge fanout, long wires, or isolating non-critical sinks
from critical ones. For example, Fig. 6(a) shows a 3-pin net
with poor timing in which the small squares are potential
buffer insertion locations. Proper buffer insertion, as
shown in Fig. 6(b), improves the timing to the most critical
sink by 200 ps. The bottom sink is not critical so only a
decoupling buffer is required for that subpath.
The buffering algorithms in PDS are based on theclassic dynamic programming paradigm [24]. The reason
for this is because the algorithm is provably optimal for a
given tree topology (such as [39], [40]), though this result
will frequently insert many additional buffers to obtain a
negligible improvement in performance. Thus, the algo-rithm must also manage the tradeoff between buffering
resources and delay [41]. Doing so changes the algorithms
complexity from polynomial to pseudopolynomial and in
practice adds an order of magnitude to the runtime. The
result is an extremely effective algorithm for timing-driven
buffer trees, though the algorithm’s inefficiency is
problematic.
Thus, it is essential to make this core optimization asfast as possible. Hence, this section explores tricks for
tweaking the classic algorithm to obtain significant
performance improvement without losing solution quality.
These techniques can be easily integrated with the classic
buffer insertion framework while also considering slew,
noise, and capacitance constraints [42], [43]. Used in
conjunction, these techniques can lead to more than a
factor of ten performance improvement versus traditionaldynamic programming.
A. Overview of Classic Buffering AlgorithmFor a given Steiner tree with a set of buffer locations
(namely the internal nodes), buffer insertion inserts
buffers at some subset of legal locations such that the
required arrival time (RAT) at the source is maximized. In
the dynamic programming framework, candidate solutionsare generated and propagated from the sinks toward the
Table 1 Comparisons of Hierarchical Analytic Top-Down Placement Against Flat Placement in Wirelengths and Runtimes
Fig. 6. An example of how buffer insertion can improve timing to critical sinks. (a) A net without buffers inserted. (b) Proper buffer
insertion improves timing.
Alpert et al. : Techniques for Fast Physical Synthesis
Vol. 95, No. 3, March 2007 | Proceedings of the IEEE 581
source. Each candidate solution � is associated with aninternal node in the tree and is characterized by a 3-tuple
ðq; c;wÞ. The value q represents the required arrival time;
c is the downstream load capacitance; and w is the cost
summation for the buffer insertion decision.
Initially, a single candidate ðq; c;wÞ is assigned for each
sink where q is the sink RAT, c is the load capacitance and
w ¼ 0. When the candidate solutions are propagated from
a node to its parent, all three terms are updated ac-cordingly. At an internal node, a new candidate is gen-
erated by inserting a buffer. At each Steiner node, two sets
of solutions from the children are merged. Finally at the
source, the solutions with max q are selected.
The candidate solutions at each node are organized as
an array of linked lists. The solutions in each list of the
array have the same buffer cost value w ¼ 0; 1; 2; . . ..During the algorithm, inferior solutions are pruned. Asolution is defined as inferior (or redundant) if there exists
another solution that is better in slack, capacitance, and
buffer cost. More precisely, for two candidate solutions
�1 ¼ ðq1; c1;w1Þ and �2 ¼ ðq2; c2;w2Þ, �2 dominates �1 if
q2 � q1, c2 � c1 and w2 � w1. In such case, we say �1 is
redundant and may be pruned. After pruning, every list
with the same cost is a sorted in terms of q and c.
A buffer library is a set of buffers and inverters, whileeach of them is associated with its driving resistance, input
capacitance, intrinsic delay, and buffer cost. During
optimization, we wish to control the total buffer resources
so that the design is not over-buffered for marginal timing
improvement. While total buffer area can be used, to the
first order, the number of buffers provides a reasonably
good approximation for the buffer resource utilization.
Indeed, we use the number of buffers since it allows amuch more efficient baseline van Ginneken implementa-
tion. Note that, our techniques presented in this paper can
be applied on any buffer resources model, such as total
buffer area or power.
At the end of the algorithm, a set of solutions with
different cost-RAT tradeoff is obtained. Each solution gives
the maximum RAT achieved under the corresponding cost
bound. Practically, we choose neither the solution withmaximum RAT at source nor the one with minimum total
buffer cost. Usually, we would like to pick one solution in
the middle such that the solution with one more buffer
brings marginal timing gain. In PDS, we use the B10 ps
rule[ (though the value can of course be modified de-
pending on the frequency target). For the final solutions
sorted by the source’s RAT value, we start from the so-
lution with maximum RAT and compare it with the secondsolution (usually it has one buffer less). If the difference in
RAT is more than 10 ps, we pick the first solution.
Otherwise, we drop it (since with less than 10 ps timing
improvement, it does not worth an extra buffer) and
continue to compare the second and the third solution. Of
course, instead of 10 ps, any time threshold can be used
when applying to different nets.
B. Preslack PruningDuring the algorithm, a candidate solution is pruned
out only if there is another solution that is superior in
terms of capacitance, slack and cost. This pruning is based
on the information at the current node being processed.
However, all solutions at this node must be propagated
further upstream toward the source. This means the load
seen at this node must be driven by some minimal amount
of upstream wire or gate resistance. By anticipating theupstream resistance ahead of time, one can prune out more
potentially inferior solutions earlier rather than later,
which reduces the total number of candidates generated.
More specifically, assume that each candidate must be
driven by an upstream resistance of at least Rmin. The
pruning based on anticipated upstream resistance is called
the prebuffer slack pruning.
Prebuffer Slack Pruning (PSP): For two non-redundant
solutions ðq1; c1;wÞ and ðq2; c2;wÞ, where q1 G q2 and
c1 G c2, if ðq2 � q1Þ=ðc2 � c1Þ � Rmin, then ðq2; c2;wÞ is
pruned.
The PSP technique was first proposed in [44]. Using an
appropriate value of Rmin guarantees optimality is not lost
[44], [45]. However, what if we are willing to sacrifice
optimality for a faster solution by using a resistance Rwhich is larger than Rmin. In practice, we observe that a
somewhat larger value than Rmin does not hurt solution
quality.
We performed buffer insertion experiments on 1000
high capacitance industrial nets by varying the value of Rused for preslack pruning. The percent slack and CPU time
compared to no preslack pruning is shown in Fig. 7.
Observe that the slack slowly degrades as a function ofresistance, while the CPU time decrease is fairly sharp. For
example, R ¼ 120 � is the minimum resistance value
which preslack pruning is still optimal solution. However,
one can get a 50% speedup for less than 5% slack
degradation for a larger value of R ¼ 600 �. These results
indicate that using PSP can bring a huge speed-up in
classic buffering for a fairly small degradation in solution
quality.
C. Squeeze PruningThe basic data structure of van Ginneken style
algorithms is a sorted list of non-dominated candidate
solutions. Both the pruning in van Ginneken style
algorithm and the prebuffer slack pruning are performed
by comparing two neighboring candidate solutions at a
time. However, more potentially inferior solutions can bepruned out by comparing three neighboring candidate
solutions simultaneously. For three solutions in the sorted
list, the middle one may be pruned according to the
squeeze pruning defined as follows.
Squeeze Pruning: For every three candidate solutions
ðq1; c1;wÞ, ðq2; c2;wÞ, ðq3; c3;wÞ, where q1 G q2 G q3 and
Alpert et al.: Techniques for Fast Physical Synthesis
582 Proceedings of the IEEE | Vol. 95, No. 3, March 2007
c1 G c2 G c3, if ðq2 � q1Þ=ðc2 � c1Þ G ðq3 � q2Þ=ðc3 � c2Þ,then ðq2; c2;wÞ is pruned.
For a two-pin net, consider the case that the algorithm
proceeds to a buffer location and there are three sorted
candidate solutions with the same cost that correspond to
the first three candidate solutions in Fig. 3(a). According
to the rationale in prebuffer slack pruning, the q-c slope fortwo neighboring candidate solutions tells the potential that
the candidate solution with smaller c can prune out the
other one. A small slope implies a high potential. For
example, ðq1; c1;wÞ has a high potential to prune out
ðq2; c2;wÞ if ðq2 � q1Þ=ðc2 � c1Þ is small. If the slope value
between the first and the second candidate solutions is
smaller than the slope value between the second and the
third candidate solutions, then the middle candidatesolution is always dominated by either the first candidate
solution or the third candidate solution. Squeeze pruning
keeps optimality for a two-pin net. After squeeze pruning,
the solution curve in ðq; cÞ plane is concave as shown in
Fig. 8(b).
For a multisink net, squeeze pruning does not
guarantee optimality since each candidate solution may
merge with different candidate solutions from the otherbranch and the middle candidate solution in Fig. 8(a) may
offer smaller capacitance to other candidate solutions in
the other branch. Squeeze pruning may prune out a post-
merging candidate solution that is originally with less total
capacitance. However, despite the loss of guaranteed
optimality, most of the time squeeze pruning causes no
degradation in solution quality and overall is a fairly safe
pruning technique.
D. Library LookupThe size of buffer library is an important factor in
determining runtime. Modern designs may have hundreds
of buffers and inverters to choose from. The theoretical
complexity of van Ginneken style buffer insertion is
quadratic in terms of the library size, though in practice it
appears to be linear. To avoid the slow down from large
libraries, we take advantage of buffer library pruning [46]
to select a small yet effective set of buffers from all those
that may be used. We now discuss a more effectivetechnique, library lookup.
During van Ginneken style buffer insertion, every
buffer in the library is examined for iteration. If there are ncandidate solutions at an internal node before buffer in-
sertion and the library consists of m buffers, then mntentative solutions are evaluated. For example, in Fig. 9(a),
all eight buffers are considered for all n candidate
solutions.However, many of these candidate solutions are
clearly not worth considering We seek to avoid generating
poor candidate solutions in the first place and not even
consider adding m buffered candidate solutions for each
Fig. 8. Squeeze pruning example. (a) The solution curve in
ðq; cÞ plane before squeeze pruning. (b) The solution
curve after squeeze pruning.
Fig. 7. The speed-up and solution sacrifice of aggressive preslack-pruning for 1000 nets as a function of R.
Alpert et al. : Techniques for Fast Physical Synthesis
Vol. 95, No. 3, March 2007 | Proceedings of the IEEE 583
unbuffered candidate solution. We propose to consider
each candidate solution in turn. For each candidate so-
lution with capacitance ci, we look up the best non-
inverting buffer and the best inverting buffer that yield
the best delay from two precomputed tables before opti-mization. For Fig. 9(b), the capacitance ci results in se-
lecting buffer B3 and inverter I2 from the non-inverting
and inverting buffer tables.
The two tables may be precomputed before buffer
insertion begins.
All 2n tentative new buffered candidate solutions can
be divided into two groups, where one group includes ncandidate solutions with an inverting buffer just insertedand the other group includes n candidate solutions with a
non-inverting buffer just inserted. We only choose one
candidate solution that yields the maximum slack from
each group and finally only two candidate solutions are
inserted into the original candidate solution lists. Since the
number of tentative new buffered solutions is reduced
from mn to 2n, the speedup is achieved. Also, since only
two new candidate solutions instead of m new candidatesolutions are inserted, the number of total candidate so-
lutions is reduced. This is similar to the case when the
buffer library size is only two, but the buffer type may
change depending on the downstream load.
E. Results and SummaryTable 2 shows the impact of the three speedup tech-
niques: preslack pruning (PSP), squeeze pruning (SqP),and library lookup (LL) versus the classic algorithm
(baseline). The results are average for 5000 high capa-
citance results from an ASIC chip. The second column
shows the total slack improvement (for all 5000 nets)
improvement after buffer insertion, and the third column
gives the total CPU time. Overall, the three techniques
resulted in a 20X speedup, with just 3% degradation in
solution quality.
Buffer insertion is a core optimization for fixing
timing critical paths. When optimizing tens of thousandsof nets, some optimality can be afforded to be sacrificed in
order to get sufficient runtime. Note that at the end of
physical synthesis, one could try reapplying buffer in-
sertion without these speedups (while also using more
accurate delay models) to the handful of remaining
critical nets. This is still much more efficient than ap-
plying full blown high accuracy buffer insertion for the
entire design.This work in essence summarizes our philosophy to fast
physical synthesis. Do the optimization well as fast as
possible, even if a little optimality is sacrificed. At the end,
if the design is close to timing closure, slower and more
accurate techniques can always be employed to further
refine the design.
IV. FAST ELECTRICAL CORRECTION
The previous discussion discusses fast buffering for critical
path optimization. Our focus now turns toward using
buffers and gate sizing for electrical correction. As dis-
cussed in the first section, electrical correction is be-
coming an increasingly costly phase of physical synthesis.
High wire resistance and sharp required slew rates (for
either noise or performance) mean that potentiallymillions of buffers must be inserted and millions of gates
must be repowered simply to have an electrically correct
design. Critical path optimization techniques rely on the
correct operation of the timing analyzer; however, any
timer, even a sophisticated one, only works correctly if the
design it is given is in a reasonable electrical state. For
example, if capacitive loads are outside the range that a
gate model has been characterized for, the timer will giveresults that do not reflect the true performance of the gate.
Further, if one can quickly make the timing result look
decent, this will leave much less work for the subsequent
slower critical path optimizations.
This section focuses on how to quickly perform elec-
trical correction, i.e., fix capacitance and slew violations
[20]. Further, it is crucial that this phase requires minimal
Fig. 9. Library lookup example. B1 to B4 are non-inverting buffers.
I1 and I4 are inverting buffers. (a) van Ginneken style buffer
insertion. (b) Library lookup.
Table 2 Simulation Results for Full Library Consisting of 24 Buffers.
Baseline are the Results of the Algorithm of Lillis et al. [47]. PSP Shows the
Results of Aggressive Prebuffer Slack Pruning Technique. SqP Stands for
our Squeeze Pruning Technique. LL is the Library Lookup Technique
Alpert et al.: Techniques for Fast Physical Synthesis
584 Proceedings of the IEEE | Vol. 95, No. 3, March 2007
area overhead, thereby reducing unnecessary powerconsumption and silicon real estate. The need for reducing
area usage is obvious for area-constrained designs. How-
ever, even in designs where the total area may not be at a
premium, local regions may be congested. Further, in
delay-constrained designs, the area saving can be used by
subsequent optimizations to improve the performance of
critical regions.
A. Types of Electrical ViolationsTiming analyzers utilize models for gate delays and
slews, which are precharacterized. Each gate is character-
ized with a maximum capacitive load that it can drive and a
maximum input slew rate, and the operation of the timer is
valid within these ranges. If these conditions are violated,
timers usually extrapolate to obtain Bbest guess[ values.
However, values calculated in this manner may be in-accurate. This leads to the limits that define electrical
violations. There are two Brules[ that a design has to pass
for it to be electrically clean, as follows.
• Slew Limits: These rules define the maximum
slews permissible on all nets of the design. If the
slew (defined here as the 10%–90% rise or fall
time of a signal, other definitions can be used as
well) at the input of a logic gate is too large, a gatemay not switch at the target speed, or may not
switch at all, leading to incorrect operation.
• Capacitance Limits: These define the maximum
effective capacitance that a gate or an input pin can
drive. A large capacitance on the output of a gate
directly affects its switching speed and power
dissipation. Additionally, gates are typically char-
acterized for a limited range of output capacitance,and delay calculation during design can be in-
correct if the output capacitance is greater than the
maximum value.
Violations of these rules (referred to as slew violations
and capacitance violations) taken together are called elec-
trical violations. These limits are principally determined
during gate characterization, but designers may choose totighten these constraints further. High performance de-
signs, such as microprocessors typically have much tighter
slew limits than ASICs.
B. Causes of ViolationsFig. 10 shows the main causes of slew violations, and
how these may be fixed. Consider a net having source
gate A and sink gate B. The capacitive load seen by gate Ais the sum of the interconnect capacitance of the net and
the input capacitance of gate B. Assume that a signal with
slew s1 is applied at the input of gate A. Due to the load
that it has to drive, the slew s2 at the output of gate A may
be more than s1. Thus, one cause of degradation is the
source gate not being capable of driving the load at its
output. Next, even if the slew at the output of A, s2, is
within the specified limits, it could degrade as the signaltraverses the net to the sink. Thus, at the sink, the signal
could have an even larger slew of s3. This is the second
contribution to slew degradation.
There are two main methods of fixing slew violations,
as shown in Fig. 10. First, the source gate of the net can be
sized up, so that the new gate can drive the load present.
While this may fix violations on the net in question, the
obvious disadvantage is that the problem has been movedto the input of the source gate, where the input nets now
have larger sink capacitances. However, this may or may
not create violations on the input nets.
Second, keeping the source at its original size, buffers
can be inserted on the net in question. These isolate the
load capacitance of the sink, and repower the signal on the
net, so that slews are within the specified limits. Unlike
resizing, this method does not affect the electrical state ofany other nets, but the area overhead can be much higher.
Additionally, the time required to determine where to best
insert buffers is much greater than the time required to
resize a gate.
The causes of capacitance violations are similar to those
of slew violations: sink and interconnect capacitance both
Fig. 10. Causes of slew violations, and different methods of fixing them. (a) Slew degradation due to gate and interconnect.
(b) Fixing slew violation by sizing source. (c) Fixing slew violation by buffering.
Alpert et al. : Techniques for Fast Physical Synthesis
Vol. 95, No. 3, March 2007 | Proceedings of the IEEE 585
contribute to the existence of a violation. The fixes too aresimilar, using resizing and buffering. However, it is
important to note that it is possible to have capacitance
violations on a net that does not have slew violations, and
vice versa. Therefore, both capacitance and slew violations
have to be taken into consideration individually.
The simplest way to perform electrical correction is via
a sequential approach. First try resizing gates to fix vio-
lations while being careful not to over size them. For thosenets that cannot be solved with resizing, invoke a buffer
insertion algorithm. This may require a second pass of
resizing in order to properly size the newly inserted buffers
and inverters.
The most important drawback of this approach is that
sizing and buffering used to fix violations are applied
sequentially, with no communication or, indeed, knowl-
edge of each other’s capabilities. Thus, a pass of resizing orbuffering tries to fix the violations that it sees, and as-
sumes that the the other will be able to handle the vio-
lations that it cannot fix. Thus, if resizing is applied to a
net to fix a slew violation on a sink, it may decide that
buffering is the best solution, for a variety of reasons.
However, in the next pass, when the net is passed to the
buffer insertion routine, there may be conditions that
prohibit the insertion of buffers, such as blockages. Sub-sequent passes of resizing and buffering are then needed
with different settings, to overcome this situation, and
there is no guarantee that any of these passes will fix the
existing violation.
C. An Integrated ApproachAlternatively, we propose a framework that tightly
integrates the selection of the two optimizations, allowingfor the use of the correct optimization in a single pass over
the design. This integrated approach seeks to selectively
apply the resizing and buffering optimizations on a net-by-
net basis. Nets are selected in topological order, from out-
puts to inputs, and on each net, the following operations
are carried out.
• If there are no violations on the net, then the
source (driving) gate is sized down as much aspossible, without introducing new violations.
• If slew violations exist on the net, the source gate is
sized up as necessary, to fix the violations.
• If the previous step (resizing to fix violations) does
not succeed, the net is buffered.
The rationale of this approach is as follows. First, nets
are processed in output-to-input order; any side effect of
resizing gates only impacts the input nets, which are yet tobe processed. Sizing a gate up to remove a violation on its
output has a detrimental affect on its input nets. This is
handled by processing nets in the correct order.
Second, sizing gates down when possible has two
benefits. First, area is recovered when gates are larger than
necessary, and second, reducing the load on input nets
potentially removes violations that may exist, or reduces
their severity. The area salvaged in this step is better usedfor improving delay on critical paths of the circuit. Of
course, this step can be skipped if the design has already
been optimized for delay.
Finally, if resizing cannot fix a violation, buffering is
used to fix the net. Since buffering is the last resort, this
optimization can be as aggressive as required, which is
used to our advantage as shown later. This order (resizing
followed by buffering) is also advantageous from a runtimestandpoint, since buffering a net is much slower than
simply sizing the source gate.
The approach to gate sizing is straightforward. Given
an input slew rate and output load, we iterate through all
available sizes, and select the smallest gate size that can
deliver the required output slew. Buffering is based on the
algorithm described in the next section. The algorithm
selects the minimum area solution such that electricalconstraints are satisfied. For runtime considerations, a
coarse buffer library is often used for buffer insertion. The
lack of granularity in the buffer library makes the potential
to resize the buffer gates possible. Of course, a more fine-
grained library can be used, but can cause extra runtime.
To decide whether a gate meets its required slew
target, we adopt the model of Kashyap et al. [48] because
of its simplicity. It is actually the slew equivalent of theElmore delay model, but actually does not suffer as se-
verely from inaccuracies caused be resistive shielding.
The slew model can be explained using a generic
example which is a path p from node vi (upstream) to vj
(downstream) in a buffered tree. There is a buffer (or the
driver) bu at vi, and there is no buffer between vi and vj.
The slew rate sðvjÞ at vj depends on both the output slew
sbu;outðviÞ at buffer bu and the slew degradation swðpÞ alongpath p (or wire slew), and is given by [48]
The basic framework presented above is flexible, and
lends itself to multiple refinements as follows. Once a netis buffered, the integrated framework allows for a quick
sizing of the newly added buffers. The buffering algorithm
can therefore be used with a small library of buffers.
Existing inverter trees can be ripped up and reinserted as
required, keeping in mind signal polarity constraints on
the sinks. If buffering does not fix a net, the cause of the
failure can be analyzed on the fly, and different algorithms,
e.g., for blockage avoidance can be used. Finally, if area is
Alpert et al.: Techniques for Fast Physical Synthesis
586 Proceedings of the IEEE | Vol. 95, No. 3, March 2007
at a premium, both resizing and buffering can be applied toevery net, and the solution with the lowest cost can be
selected.
D. Electrical Correction SummaryThe integrated framework allows PDS to efficiently
perform electrical correction. However, in our initial
implementation, we found that 80%–90% of the runtime
takes place in the van Ginneken style buffer insertionalgorithm, even with the speedups discussed above. For
electrical correction, using a buffer insertion algorithm
which optimizes for delay is wasteful, since the purpose of
this stage is to simply produce an electrically correct
design. This motivates a new buffer insertion formulation
specifically for electrical correction that is discussed in the
next section.
V. FAST TIMERLESS BUFFERING
The efficiency of electrical correction directly depends on
the efficiency of the buffering algorithm. While Section III
shows how one can speed up performance driven
buffering, it still suffers from the fact that three constraints
must be handled at once: area, slew, and delay. In elec-
trical correction, one can afford to ignore the last ob-jective, delay. The assumption is that if a tree buffered by
electrical correction subsequently becomes part of a
critical path, it can always be ripped up and rebuffered
by the critical path optimizations while taking into account
the most up to date timing analysis. In general, we find
that only a relatively small percentage of nets (e.g., 5%)
need to be rebuffered. Thus, this section proposes a
simpler buffer formulation that ignores delay constraintsin order to achieve a more runtime and area efficient
result.
The key observation that motivates this approach is that
traditional buffer insertion requires pruning based on
three components: capacitance, slack (or delay), and area
(or power). Because a candidate has to be inferior in all
three categories to be pruned, the list of possible can-
didates can grow quite large. However, to perform elec-trical correction, the optimal delay solution is not required
and instead one wishes to fix electrical violations with
minimum area. By using only two instead of three
categories for pruning, one can obtain a much more ef-
ficient solution (that is actually linear time in the case of a
single buffer).
A. Problem FormulationFor electrical correction, we seek the minimum area
(or cost) buffering solution such that slew constraints are
satisfied. Since one does not need to know required arrival
time at sinks, it can be performed independently of timing
analysis, hence the term, timerless buffering. While this
new formulation is actually NP-complete, some highly
efficient and practical algorithms can be utilized.
The input to the timerless buffering problem includes arouting tree T ¼ ðV; EÞ, where V ¼ fs0g [ Vs [ Vn, and
E � V V. Vertex s0 is the source vertex, Vs is the set of
sink vertices, and Vn is the set of internal vertices. Each sink
vertex s 2 Vs is associated with sink capacitance cs. Each
edge e 2 E is associated with lumped resistance Re and
capacitance ce. A buffer library B contains different types
of buffers. Each type of buffer b has a cost wb, which can be
measured by area or any other metric, depending on theoptimization objective. Without loss of generality, we
assume that the driver at source s0 is also in B. A function
f : Vn ! 2B specifies the types of buffers allowed at each
internal vertex.
The output slew of a buffer, such as bu at vi, depends on
the input slew at this buffer and the load capacitance seen
from the output of the buffer. For a fixed input slew, the
output slew of buffer b at vertex v is then given by
sb;outðvÞ ¼ Rb � cðvÞ þ Kb (6)
where cðvÞ is the downstream capacitance at v, Rb and Kb
are empirical fitting parameters. This is similar to
empirically derived K-factor equations [50]. We call Rb
the slew resistance and Kb the intrinsic slew of buffer b.
A buffer assignment � is a mapping � : Vn ! B [ fbgwhere b denotes that no buffer is inserted. The cost of a
solution � is wð�Þ ¼P
b2� wb. With the above notations,
the basic timerless buffering problem can be formulated
as follows.
Timerless Buffering Problem: Given a Steiner treeT ¼ ðV; EÞ a buffer library B, compute a buffer assignment
� such that the total cost wð�Þ is minimized such that the
input slew at each buffer or sink is no greater than a
constant � .
B. A Timerless Buffering AlgorithmIn the dynamic programming framework, a set of
candidate solutions are propagated from the sinks towardthe source along the given tree. Each solution � is char-
acterized by a three-tuple ðc;w; sÞ, where c denotes the
downstream capacitance at the current node, w denotes
the cost of the solution and s is the accumulated slew
degradation sw defined in (5). At a sink node, the cor-
responding solution has c equal to the sink capacitance,
w ¼ 0 and s ¼ 0. The solution propagation is accom-
plished by the following operations.Consider to propagate solutions from a node v to its
parent node u through edge e ¼ ðu; vÞ. A solution �v at vbecomes solution �u at u, which can be computed as
cð�uÞ ¼ cð�vÞ þ ce;wð�uÞ ¼ wð�vÞ a n d sð�uÞ ¼ sð�vÞ þln 9 � De where De ¼ Reððce=2Þ þ cð�vÞÞ.
In addition to keeping the unbuffered solution �u, a
buffer bi can be inserted at u to generate a buffered
Alpert et al. : Techniques for Fast Physical Synthesis
Vol. 95, No. 3, March 2007 | Proceedings of the IEEE 587
solution �u;buf which can be then computed as cð�u;buf Þ ¼cbi;wð�u;buf Þ ¼ wð�vÞ þ wbi
and sð�u;buf Þ ¼ 0.
When two sets of solutions are propagated through left
child branch and right child branch to reach a branching
node, they are merged. Denote the left-branch solution set
and the right-branch solution set by �l and �r, respec-
tively. For each solution �l 2 �l and each solution �r 2 �r,
the corresponding merged solution �0 can be obtained ac-
cording to: cð�0Þ ¼ cð�lÞ þ cð�rÞ;wð�0Þ ¼ wð�lÞ þ wð�rÞand sð�0Þ ¼ maxfsð�lÞ; sð�rÞg. To ensure that the worst
case in the two branches still satisfies slew constraint, we
take the maximum slew degradation for the merged
solution.
For any two solutions �1, �2 at the same node, �1
dominates �2 if cð�1Þ � cð�2Þ, wð�1Þ � wð�2Þ and sð�1Þ �sð�2Þ. Whenever a solution becomes dominated, it is
pruned from the solution set without further propagation.A solution � can also be pruned when it is infeasible, i.e.,
either its accumulated slew degradation sð�Þ or the slew
rate of any downstream buffer in � is greater than the slew
constraint � .
When a buffer bi is inserted into a solution �, sð�Þ is set
to zero and cð�Þ is set to cðbiÞ. This means that inserting
one buffer may bring only one new solution, namely, the one
with the smallest w. However, in minimum cost timingbuffering, a buffer insertion may result in many non-
dominated ðq; c;wÞ tuples with the same c value, where qdenotes the required arrival time (RAT).
Consequently, in timerless buffering, at each buffer
position along a single branch, at most jBj new solutions
can be generated through buffer insertion since c; s are
the same after inserting each buffer. In contrast, buffer
insertion in the same situation may introduce many newsolutions in timing buffering. This sheds light on why
timerless buffering can be much more efficiently
computed.
Another important fact is that the slew constraint is in
some sense close to length constraint. In timerless buff-
ering, solutions can soon become infeasible if we do not
add a buffer into it and thus many solutions, which are
only propagated through wire insertion, are oftenremoved soon. An extreme case demonstrating this point
is that in standard timing buffering, the solutions with no
buffer inserted can always live until being pruned by
driver given a loose timing constraint. This may nothappen in timerless buffering: this kind of solutions soon
become infeasible as long as the slew constraint is not too
loose.
Due to these special characteristics of the timerless
buffering problem, a linear time optimal algorithm for
buffering with a single buffer type is possible. In tim-
ing buffering, it is not known how to design a polynomialtime algorithm in this case. From these facts, the basicdifferences between these two somewhat related buffering
problems are clear.
C. Results and SummaryTable 3 compares timerless buffering to timing-driven
buffering for 1000 high capacitance nets from an ASIC
design for slew constraints ranging from 0.4 to 2.0 ns. A
library of 48 buffers was used. The experiment showsthat timerless buffering does result in a consistent de-
gradation in slack, which is not surprising since it does
not utilize timing information. Because timerless buffer-
ing minimizes area in its objective function, it is more
efficient in buffering area and the number of buffers
used. The area savings tends to increase as the slew
constraint is relaxed. Finally, the CPU time advantage is
clear as speedups of 25 to over 100 are observed. Thetiming-driven buffering used here does utilize preslack
pruning and squeeze pruning, but not library lookup.
Obviously the latter technique would reduce the advan-
tage somewhat.
Since electrical correction can result in millions of
buffers being inserted, one needs to do this as fast as
possible. Even with the speedups in Section III, a delay
driven technique is not suitable for this task. Instead, usinga timerless formulation that seeks to minimize area proves
significantly faster and actually uses less area.
Ultimately, one needs a large back of buffering so-
lutions depending on where one is in the physical synthesis
flow. For early electrical correction, a faster timerless
algorithm is appropriate. For critical path optimization, a
van Ginneken style algorithm is needed. However, one
often may need to pay attention to the blockages orplacement and routing congestion that may exist in the
design. The next section shows a framework for dealing
with any of these layout characteristics.
Table 3 Comparison of Timerless Buffering With Timing-Driven Buffering
Alpert et al.: Techniques for Fast Physical Synthesis
588 Proceedings of the IEEE | Vol. 95, No. 3, March 2007
VI. LAYOUT AWARE FAST AND FLEXIBLEBUFFER TREES
Given a Steiner tree, we can insert buffers for critical pathoptimization using timing-driven buffering or electrical
correction using timerless buffering. The quality of the
results strongly depends on the Steiner tree used, and
so we use a buffer-aware tree construction as described
in [39]. However, this construction ignores the blockages
and congestion present in the layout. Ignoring this can
potentially cause several design headaches.
A. Types of Layout IssuesFor example, Fig. 11(a) illustrates the Balley[ problem,
in which space is limited between two large fixed blocks.
The space between blocks is highly desirable since routes
that cross the blockages have only potential insertion space
in the alley. Fig. 11(b) shows the buffer Bpile-up[ phe-
nomenon. Several nets may desire buffers to be inserted in
the black congested region, yet since there is no space for
buffers there, the buffers are inserted as close to the
boundary as possible. As more nets are optimized, thesebuffers pile up and spiral out further from their ideal
locations. This could be alleviated by only allowing buffers
from critical path optimization (not electrical correction)
to use these scarce resources.
As technology continues to scale, the optimum distance
between consecutive buffers continues to decrease. In
hierarchical design, this means allocating spaces within
macro blocks for buffering of global nets. An example isshown in Fig. 12(a). The space for buffers is potentially
limited, so non-critical nets should be routed around the
blocks while critical ones can use the holes. Long non-
critical nets still require buffers to fix slew and/or ca-
pacitance violations. In addition, these nets could be
critical, but have a wide range of possible buffering
solutions that may bring them into the non-critical group.
In the figure, the top net is non-critical and requires threebuffers, while the bottom net is critical and needs only two
by exploiting holes punched in the block.
Even without holes in block, designs may have pockets
of low density for which inserting buffers is preferred, as
shown in Fig. 12(b). In the figure, the Steiner route is
located in the low density part of the chip, which makes
the buffers inserted along the route also use low densityregions. Fig. 12(c) shows an example where one may be
willing to insert buffers in high density regions if a net is
critical. The 2-buffer route above the block yields faster
delays than the 4-buffer route below the block that is better
suited for noncritical nets. Finally, Fig. 12(d) shows
routing congestion between two blocks; the preferred
buffered route avoids this congestion without sacrificing
timing.There are some buffering approaches that attack a
subset of these type of problems by simultaneously in-
tegrate the layout environment, build a Steiner tree, and
buffer (e.g., [51], [52]), but doing too much work at once
inherently makes these algorithms too inefficient for this
application. Instead, we propose the following flow:
• Step 1: construct a fast timing-driven Steiner tree
(e.g., [39]) that is ignorant of the environment.• Step 2: reroute the Steiner tree to preserve its to-
pology while navigating environmental constraints.
• Step 3: insert buffers via the algorithms in
Section III or V.
This section focuses on solving the problem in Step 2.
B. Rerouting Algorithm OverviewTo reroute the tree, the design area is divided into
tiles, as in global routing, and stores the placement and
routing density characteristics for each tile. The algorithm
takes the existing Steiner tree and breaks it into disjoint
2-paths, i.e., paths which start and end with either the
source, a sink, or a Steiner point such that every internal
node has degree two. For example, the nets shown in
Fig. 13(a) and (b) both decompose into three 2-paths.
Finally, each 2-path is rerouted in turn to minimize cost,Fig. 11. Buffer insertion can potentially: (a) fill up constrained ‘‘alleys’’
and (b) cause buffer ‘‘pile-ups.’’
Fig. 12. Some environmental based constraints include: (a) holes
in large blocks; (b) navigating large blocks and dense regions;
(c) distinguishing between critical and noncritical preferred
routes; and (d) avoiding routing congestion.
Alpert et al. : Techniques for Fast Physical Synthesis
Vol. 95, No. 3, March 2007 | Proceedings of the IEEE 589
starting from the sinks and ending at the source. The new
Steiner tree is assembled from the new 2-path routes.
Essentially, the algorithm is performing maze routing for
each subsection of the tree. The two key components of
achieving a good result are plate expansion, which allows
the Steiner points to migrate and deriving the right mazerouting cost function.
If a Steiner point is in a congested region, it needs to
migrate from its original location. One could consider
allowing it to move anywhere in the layout, but since the
original Steiner layout was presumably Bgood[ we restrict
it to move only within a specified Bplate[ region. This is
one key for enabling the algorithm to be efficient. The
plate needs to be large enough to enable the Steiner pointto migrate to a less congested tile.
During maze rerouting, one considers routing to any
tile in the plate instead of just the original tile.
Fig. 13(a) shows a routing tree after Step 1. The striped
tile is the Steiner point, and the shaded region shows a
5 5 plate centered at the original Steiner point.
Fig. 13(b) shows a Steiner tree that might result after re-
routing. The Steiner point has moved to a different lo-cation within the plate; where it ends up depends on the
cost function that is optimized. The dotted region shows
the potential search space for the rerouting of the 2-path
from the Steiner point to the source. In this case, the
bounding box containing the two endpoints was expanded
by one tile.
C. Maze Routing Cost Function forElectrical Correction
Each tile is assigned cost that should reflect potentially
inserting a buffer and/or routing through the tile. Let
eðtÞ � 1 be the environmental cost of using tile t, where
eðtÞ ¼ 0 if the tile is totally void of any resource utilization,
while eðtÞ ¼ 1 represents a fully utilized tile. As an
example, for placement congestion, let dðtÞ could be the
placement density (cell area divided by total area available)
of tile t. and le rðtÞ be its routability (used tracks divided bytotal tracks available). Then one could use
eðtÞ ¼ �dðtÞ2 þ ð1 � �ÞrðtÞ2(7)
where 0 � � � 1 trades off between routing and place-
ment cost.
For fixing electrical violations, one wants the net to
avoid high cost tiles, while still making an attempt to
minimize wirelength. For this case, consider
costðtÞ ¼ 1 þ eðtÞ: (8)
This cost function implies that a fully utilized tile has
cost twice that of a tile that uses no resources. The constant
of one can be viewed as a Bdelay component.[ Let the costof a path be equal to the cost of all tiles in the path, and
initially assign all sinks to zero initial cost. We wish to
minimize the cost of the entire tree being constructed. For
a tile t that corresponds to a Steiner point, with subtree
children L and R, the cost of the tree routed at t is
costðtÞ ¼ costðLÞ þ costðRÞ.
D. Maze Routing Cost Function for CriticalPath Optimization
For critical nets, the cost impact of the environment is
relatively immaterial. We seek the absolute best possible
slack, but still need the route to avoid regions wherebuffers cannot be inserted at all. When a net is optimally
buffered (assuming no obstacles), its delay is a linear
function of its length [53]. Of course, this solution must be
realizable. To minimize delay, we simply minimize the
number of tiles to the most critical sink. Thus, the cost for
a tile is just costðtÞ ¼ 1 (there is no eðtÞ term). When
merging branches, one wants to choose the branch with
w o r s t s l a c k , s o t h e m e r g e d c o s t costðtÞ i s :maxðcostðLÞ; costðRÞÞ. To initialize the slack, a notion of
which sink is critical is needed. Since our cost function
basically counts tiles as delay, the required arrival time
(RAT) must be converted to tiles. Let DpT be the minimum
delay per tile achievable on an optimally buffered line. For
a sink s, the costðsÞ is initialized to �RATðsÞ=DpT. The
more critical a sink, the higher its initial cost. The
objective is to minimize cost at the source.Fig. 14(a) shows one of several possible solutions for
rerouting the net in Fig. 13 using this cost function, where
s2 is considered two tiles more critical than s1. Note that it
achieves a shortest path to s2. Contrast that with the
electrical correction cost function shown in Fig. 14(b), in
which the Bblob[ represents an area of high cost. In this
case, the route avoids the congested area even though it
means the route to the critical sink is much longer.
Fig. 13. Example of a three-pin net: (a) before and (b) after rerouting.
The shaded square region is the ‘‘plate’’ and the dotted region
is the solution search space for the final 2-path.
Alpert et al.: Techniques for Fast Physical Synthesis
590 Proceedings of the IEEE | Vol. 95, No. 3, March 2007
E. General Cost FunctionThe previous cost functions can generate extreme
behavior; however, one can trade off between the two cost
functions. Let 0 � K � 1 be the tradeoff parameter,
where K ¼ 1 corresponds to a electrical correction and
K ¼ 0 corresponds to a critical net. The cost function for
tile t is then
costðtÞ ¼ 1 þ K � eðtÞ: (9)
For critical nets, merging branches is a maximization
function, while it is an additive function for non-critical
nets. These ideas can be combined with to yield:
costðtÞ ¼ max costðLÞ; costðRÞð Þþ K � min costðLÞ; costðRÞð Þ: (10)
Finally, the sink initialization formula becomes
costðsÞ ¼ ðK � 1ÞRATðsÞ=DpT: (11)
Thus, K trades off the cost function, the merging
operation, and sink initialization. In practice, we use K ¼ 1for electrical correction and subsequently smaller values
up to K ¼ 0:1 for critical path optimization.
F. Slew Threshold ConstraintAs described the maze routing cost functions do not
guarantee slew constraints will be satisfied. Let T be the
maximum number of tiles that can be driven by a buffer
before the slew constraint is violated. If the route goes over
more than T consecutive blocked tiles, there will be an
unavoidable slew violation when buffering. Hence, duringmaze routing we track the number of consecutive blocked
tiles and forbid it from exceeding T by not performing
node expansion once this threshold is reached. Thisguarantees that the resulting Steiner tree will have
sufficient area for buffers so that slew violations can be
fixed by subsequent dynamic programming.
G. Example and SummaryThe effect of rerouting can be shown by the example in
Fig. 15, which displays the a placement density map for a
given 7-pin net of an industrial design. The source ismarked with a white x, while sinks are marked with dark
squares. The white dots are potential buffer insertion
locations, and the diamonds are the inserted buffers. The
route on the left is the solution with K ¼ 1:0, while the one
on the right is the solution for K ¼ 0:1. Observe that the left
route totally avoids the large blockage, which ultimately
leads to a 4134 ps slack improvement over the unbuffered
solution. However, for when K ¼ 0:1, the route success-fully finds the prime real estate (the holes inside the block)
and places buffers in them where it deems it appropriate.
This improves the slack by 4646 ps. The simple parameter
setting of the cost function yields a different Steiner route
that can recognize layout constraints depending on the
particular phase of physical synthesis.
Optimizations that ignore the layout can cause severe
headaches for timing closure and routability. The mazererouting technique proposed in this section is general
enough to handle any kinds of layout configurations,
whether blockages, regions packed with dense cells, or
routing congestion. One does not need to deploy this
throughout physical synthesis though. Instead one could
wait for the Bmess[ and then clean it up. For example, PDS
has a phase to identify all buffers in routing congested
regions, rip-up those buffers, then reroute them using thismaze routing strategy. This clean-up-the-mess strategy
enables more overall efficient optimization than trying to
always preemptively avoid the mess. The next section
explains how a different kind of legalization algorithm is
Fig. 15. Illustration of the different routes obtained with the general
maze routing cost function for a layout containing a large block
with punched out holes. (a) A routed net with K ¼ 1:0.
(b) The same net with K ¼ 0:1.
Fig. 14. Examples of the (a) critical and (b) non-critical net cost
functions. The shaded area represents a region of high cost.
Alpert et al. : Techniques for Fast Physical Synthesis
Vol. 95, No. 3, March 2007 | Proceedings of the IEEE 591
more effective at cleaning up messes made from synthesisoperations.
VII. DIFFUSION-BASED PLACEMENTTECHNIQUES FOR LEGALIZATION
During electrical correction and critical path optimization,
some gates may be resized while new ones are inserted
into the design. PDS does not assign a location right away,but rather assigns a preferred location that may overlap
existing cells. Periodically, legalization needs to run to
snap these cells from overlapping to legal locations. If one
waits too long between legalization invocations, cells may
end up quite far from their preferred location which may
severely hurt timing. This section discusses a new
legalization paradigm called diffusion that was first
described in [28]. Diffusion tries to avoid this behaviorby keeping the relative ordering of the cells intact.
Of course, there are other methods that can also
achieve legalization without moving any one cell too far
away. Brenner et al. [54] describe a network flow algo-
rithm that superimposes a flow network on top of grid bins
and then flows cells from overly dense bins to bins that are
under capacity. More recently, Luo et al. superimpose a
Delauney triangluation on top of the cells and use thisstructure to enforce relative order while achieving local
density targets. Techniques for local cell movement, swap-
ping and shifting to improve placement quality after legal-
ization can be found in [55], [56].
During optimization, local regions can become overfull
at which point synthesis, buffering, and repowering opti-
mizations may become handcuffed if they are forbidden to
add to the area in an already full bin. The main advantageof diffusion is that it can allow the optimizations to
proceed anyway, knowing that cells will not be moved too
far away from their intended location. Further, diffusion
can be implemented or run in just a few minutes, even on
designs with millions of gates.
Diffusion is a well-understood physical process that
moves elements from a state with non-zero potential en-
ergy to a state of equilibrium. The process can be modeledby breaking down the movements into several small finite
time steps, then moving each element the distance it would
be expected to move during that time step. Our legalization
approach follows this model; it moves each cell a small
amount in a given time step according to its local density
gradient. The more time steps the process is run, the closer
the placement gets toward achieving equilibrium.
Assume that a placement is close to legal if all that isrequired to legalize the placement is to snap cells to rows
or perhaps perform minor cell sliding in order to fit the
cells. Also, assume the chip layout is divided into small,
equally sized bins which can fit around 5–15 cells. Let dmax
be the maximum allowed density of a bin, where com-
monly dmax ¼ 1. The placement is considered close to legal
if the area density of every bin is less than or equal to dmax.
For all bins with density greater than dmax, cells must bemigrated out of those bins into less dense ones. The goal of
legalization is to reduce the density of each bin to no more
than dmax while avoiding moving these cells far from their
original locations and also to preserve the ordering in-
duced by the original placement. Once each bin satisfies its
density requirement dj;k � dmax, a legal placement solution
can generally be easily achieved (since each bin is
guaranteed sufficient space), e.g., through local slide andspiral optimization.
A. The Diffusion ProcessDiffusion is driven by the concentration gradient,
which is the slope and steepness of the concentration dif-
ference at a given point. The increase in concentration in a
cross section of unit area with time is simply the difference
of the material flow into the cross section and the material
flow out of it. Diffusion reaches equilibrium when the
material concentration is evenly distributed.Mathematically, the relationship of material concen-
tration with time and space can be described using the
following partial differential equation:
@dx;yðtÞ@t
¼ r2dx;yðtÞ (12)
where dx;yðtÞ is the material concentration at position
ðx; yÞ at time t. Equation (12) states that the speed of
density change is linear with respect to its second-order
gradient over the density space. In the context of place-
ment, cells will move quicker when their local densityneighborhood has a steeper gradient.
When the region for diffusion is fixed (as in
placement), the boundary conditions are defined as
rdxb;ybðtÞ ¼ 0 for coordinates ðxb; ybÞ on the chip bound-
ary. We also define coordinates over fixed blocks in the
same way in order to prevent cells from diffusing on top of
fixed blocks. This forces cells to diffuse around the blocks.
In diffusion, a cell migrates from an initial location toits final equilibrium location via a non-direct route. This
route can be captured by a velocity function that gives the
velocity of a cell at every location in the circuit for a given
time t. This velocity at certain position and time is
determined by the local density gradient and the density
itself. Intuitively, a sharp density gradient causes cells to
move faster. For every potential ðx; yÞ location, define a
2-D velocity field vx;y ¼ ðvHx;y; vV
x;yÞ of diffusion at time tas follows:
vHx;yðtÞ ¼ �
@dx;yðtÞ@x
=dx;yðtÞ
vVx;yðtÞ ¼ �
@dx;yðtÞ@y
=dx;yðtÞ: (13)
Alpert et al.: Techniques for Fast Physical Synthesis
592 Proceedings of the IEEE | Vol. 95, No. 3, March 2007
Given this equation, and a starting location ðxð0Þ; yð0ÞÞfor a particular location, one can find the new location
ðxðtÞ; yðtÞÞ for the element at time t by integrating the
velocity field
xðtÞ ¼ xð0Þ þZ t
0
vHxðt0Þ;yðt0Þðt0Þdt0
yðtÞ ¼ yð0Þ þZ t
0
vVxðt0Þ;yðt0Þðt0Þdt0: (14)
Equations (12)–(14) are sufficient to simulate the
diffusion process. Given any particular element, one can
now find the new location of the molecule at any point
in time t. To apply this paradigm to placement, one
needs to migrate from this continuous space to a discreteplace since cells have various rectangular sizes and the
placement image itself is discrete. The next section
presents a technique to simulate diffusion specifically for
placement.
B. Diffusion Based PlacementOne can discretize continuous coordinates by dividing
the placement areas into equal sized bins indexed by ðj; kÞ.Assume the coordinate system is scaled so that the width
and height of each bin is one. Then location ðx; yÞ liesinside bin ðj; kÞ ¼ ðbxc; bycÞ. We can also discretize con-
tinuous time t as n�t, where �t is the size of the discrete
time step.
Instead of the continuous density dx;y, we now can
describe diffusion in the context of the density dj;k of bin
ðj; kÞ. The initial density dj;kð0Þ of each bin ðj; kÞ can be
defined as dj;kð0Þ ¼ �Ai where Ai is the overlapping area of
cell i and bin ðj; kÞ.For simplicity, assume that if a fixed block overlaps a
bin, it overlaps the bin completely. In these cases, the
bin density is defined to be one, though boundary con-
ditions prevent cells from diffusing on top of fixed
blocks.
Assume that the density dj;kðnÞ has already been com-
puted for time n. Now one needs to find how the density
changes and cells movements for the next time stepn þ 1. We use the Forward Time Centered Space (FTCS)
[57] scheme to discretize (12). The new bin density is
given by
dj;kðnþ1Þ¼dj;kðnÞþ�t
2djþ1;kðnÞþdj�1;kðnÞ�2dj;kðnÞ� �
þ�t
2dj;kþ1ðnÞþdj;k�1ðnÞ�2dj;kðnÞ� �
: (15)
The new density of a bin at time n þ 1 depends only on itsdensity and the density of its four neighbor bins. Note that
one does not actually use the cell locations at time n þ 1 to
compute the density.
Just as (12) can be discretized to compute placement
bin density, (13) can be discretized to compute the velocity
for cells inside the bins. For now, assume that each cell in
the bin is assigned the same velocity, the velocity for the
bin, given by
vHj;kðnÞ ¼ �
djþ1;kðnÞ � dj�1;kðnÞ2dj;kðnÞ
vVj;kðnÞ ¼ �
dj;kþ1ðnÞ � dj;k�1ðnÞ2dj;kðnÞ
: (16)
The horizontal (vertical) velocity is proportional to thedifferences in density of the two neighboring horizontal
(vertical) bins.
To make sure that fixed cells and bins outside the
boundary do not move, we enforce vV ¼ 0 at a horizontal
boundary and vH ¼ 0 at a vertical boundary.
Assuming that each cell in a bin has the same velocity
fails to distinguish between the relative locations of cells
within a bin. Further, two cells that are right next to eachother but in different bins can be assigned very different
velocities which could change their relative ordering.
Since the goal of placement migration is to preserve the
integrity of the original placement, this behavior cannot be
permitted. To remedy this behavior, we apply velocity
interpolation to generate a horizontal (vertical) velocity
vHx;yðvV
x;yÞ and for a given ðx; yÞ. The interpolation looks at
the four closest bins for each cell and interpolates from thevelocities assigned to each of those bins, generating a
unique velocity vectory for a cell at location ðx; yÞ.Finally, since the velocity for each cell can be de-
termined at time n ¼ t=�t, one can compute its new
placement via a discretized form of (14). Suppose at time
step n a cell has location ðxðnÞ; yðnÞÞ. Its location for the
next time stamp is given by
xðn þ 1Þ ¼ xðnÞ þ vHxðnÞ;yðnÞ ��t
yðn þ 1Þ ¼ yðnÞ þ vVxðnÞ;yðnÞ ��t: (17)
An example is shown in Fig. 16 in which a cell takes
nine discrete time steps. Observe how the cell never
overlaps a blockage and also how the magnitude of its
movements becomes smaller toward the tail end of its path.
C. Making it WorkSince the diffusion process reaches equilibrium when
each bin has the same density, we can expect the final
Alpert et al. : Techniques for Fast Physical Synthesis
Vol. 95, No. 3, March 2007 | Proceedings of the IEEE 593
density after diffusion to be the same as the average density�dj;k=N. This can cause unnecessary spreading, even if
every bin’s density is well below dmax. This additional
spreading will no doubt degrade the placement quality of
results.
Essentially, what we would like is to run diffusion
for the regions which require it, perhaps for legalization
or even to remove routing congestion while leaving the
rest of the design (which may be in very good shape)alone. The idea of local diffusion is to only run diffu-
sion on cells in a window around bins that violate the
target density constraint. Local diffusion also has the
advantages of less work to do each iteration and faster
convergence.
Although we use (15) to compute bin densities during
diffusion, the computed densities are not exactly the same
as the real placement densities. The mathematics of thediffusion process [(15), (16), and (17)] assume continu-
ously distributed equal size particle distribution. However,
the real standard cell distribution does not always satisfy
this condition. This happens because cells are not equally
distributed inside a bin and because cells have different
sizes. Periodically, one should update the density based on
the real cell placement when the error exceeds a certain
threshold, then restart the diffusion algorithm from thenew placement map.
D. Diffusion SummaryFig. 17 shows an example of diffusion-based legaliza-
tion in a region surrounded by other placed cells and fixed
blocks. The top-left figure shows an initial illegal
placement in which the colored regions represent areas
of cell overlap. The top-right figure shows what happens
when traditional legalization is invoked. Observe how the
integrity of the regions is no longer preserved as the
colored cells mix. This shows how some cells can movequite far away from their neighboring cells from the top
illegal placement. Finally, the bottom figure shows theresult of diffusion based legalization, in which the con-
tinuity of the colored regions is relatively well preserved.
This example illustrates that diffusion is able to perform a
smooth spreading, which is less disruptive to the state of
the design.
To see how effective diffusion-based legalization can be
in a physical synthesis engine, we ran PDS physical
synthesis optimization on seven ASIC testcases in whichwe did not legalize at all during the run. This results in a
large amount of overlaps caused by physical synthesis. We
ran a greedy and flow-based legalizer for comparison and
measure the best results obtained by those approaches
[28]. Compared to the traditional approaches, diffusion
averages about 4% improvement in the total wirelength of
the design. Further, the timing of the worst slack path is
48% better on average and the overall number of negativepaths is 36% better. The improvement can be observed for
all seven designs.
The ability of diffusion to minimize timing degrada-
tion, to smoothly spread out the placement, and to attack
local hotspots of either placement or routing congestion
makes it a powerful technique for physical synthesis. For
starters, one can afford to run legalization less often since
diffusion is less likely to significantly disrupt the state ofthe design.
VIII . CONCLUSION
A. Impact of the Stages of Physical SynthesisThis paper discussed various techniques to achieve fast
physical synthesis which may be applied in all the phases ofphysical synthesis. Recall the four main phases that we are
considering in this paper are:
1) initial placement and optimization;
2) timing-driven placement and optimization;
3) timing-driven detailed placement;
4) optimization techniques.
One need not apply all the techniques in performing
design closure, and frequently designers mix and matchthe pieces depending upon their needs. For example, the
first phase is especially useful during the floorplanning
process. The designer may wish to find the locations of
large blocks and also restrict the movement of key logic.
Through placement and optimization, the designer can
reasonably evaluate the quality of the floorplan. If the
designer is happy, with this result he or she may skip all
the way to the last technique to push down the timing onany remaining critical paths.
In general, the timing after performing the first step
will be far from achieving closure, e.g., the cycle time may
be double what is required by the design specifications.
Performing timing-driven placement and optimization
generally helps significantly and results in many fewer
negative paths. The third stage generally does not helpFig. 16. An example cell movement from diffusion.
Alpert et al.: Techniques for Fast Physical Synthesis
594 Proceedings of the IEEE | Vol. 95, No. 3, March 2007
timing but may improve wiring by anywhere from 2% to5%, and this can make a huge difference in achieving a
routable design.
Finally, unless the design is for some reason Beasy,[ the
last stage of optimization is critical for actually achieving
timing closure. Designers exploit this stage the most
during their iterations as they tweak the design. If only
minor changes are required, going back to global place-
ment would be far too disruptive and potentially put thedesign in a completely different state. The ability to iterate
and perform in-place synthesis is critical in garnering the
last bit of performance out of the design. However, if the
timing of the design is in really bad shape, optimization
alone will not be able to close on timing. The designer
must go back and iterate on the floorplan and global
placement steps.
B. Future DirectionsPhysical synthesis is a runtime intensive, complex
system that requires the integration and cooperation of
several types of algorithms and functions. Exacerbating the
turnaround time problem is that designs sizes will likely
soon move from the millions to tens of millions of place-
able objects. There are numerous research directions in
the timing closure space that we believe are worth pursu-
ing to achieve both faster runtime and higher quality ofresults. In general, achieving better quality can also be a
great way to achieve a faster system, as the back end
optimization could have far fewer negative paths to work
on. Some promising research directions include the
following.
1) Better net weighting for timing-driven placement. For
example, consider two critical paths A and B, both
Fig. 17. Diffusion-based legalization example.
Alpert et al. : Techniques for Fast Physical Synthesis
Vol. 95, No. 3, March 2007 | Proceedings of the IEEE 595
of which are equally critical, but A spends 80% ofits delay traversing fixed blocks and 20% through
moveable logic, while B spends 20% and 80% in
fixed and moveable logic, respectively. In this
case, A does not have much room for error as
placement needs to fix the 20% of the logic that
can be fixed, while B has considerably more
opportunity for placement to straighten out the
80% of logic that it can affect. Thus, net weightingshould give more priority to nets in path A than B.
There are numerous other scenarios that can be
studied and modeled to improve net weighting.
2) Removing a global placement. In the flow described,
placement is run twice. If clever net weighting
and crude placement estimation is used, it may be
possible to significantly improve runtime by
skipping a placement step altogether and stillretain solution quality.
3) Latch pipeline placement. As designs require
multiple cycles to get from one side of the chip
to the other, placement needs to recognize that
latches must be placed in such a way to guar-
antee that one can get from one latch to another
within the given cycle time. For example, assume
latch A drives latch B, which drives latch C, andA is fixed on the left side and C is fixed on the
right. If B is too close to A, then the path from B
to C becomes critical. If one applies a higher net
weight to the connection from B to C, then B
may be moved too close to C and then the A to B
path becomes critical. One has to teach place-
ment to find an appropriate balance, and it is
unlikely net weighting alone can achieve thiskind of result.
4) BDo no harm[ detailed placement. Detailed place-
ment is a powerful technique for improving
wirelength but typically does not improve timing.
In fact, it is risky to run it late in the fourth stage
of the flow because it may worsen paths that were
already carefully optimized. The idea of Bdo no
harm[ detailed placement [58] is to recognizemoves that degrade the timing and forbid them,
while only accepting moves that improve wire-
length and timing.
5) Force-directed placement. As discussed earlier,
force-directed placement is emerging as a prom-
ising technique both in terms of quality [7] mPL
[8] mFar [9] and speed [10]. This technique also
has the advantage of stability in that small changesto net weights likely will not create entirely
different global placements. Its spreading ability
(like that of diffusion) makes it appealing for
handling incremental netlist changes.
6) Parallelism. As designs truly become large, the
designs can potentially be partitioned into smaller
physical pieces that do not require an inordinate
amount of cross-partition communication. Onecan then apply physical synthesis on each piece
relatively independently. While this approach
seems simple enough, it is fraught with choices,
any of which could lead to significantly degraded
solution. One must be careful with the partition
pin assignment, buffering strategy, and timing
contracts between partitions.
7) Complex transforms. Transforms which performmultiple operations simultaneously could poten-
tially have a big impact on timing. For example,
consider a cell B on the left side connected to cells
A and C on the right side. Clearly B wants to be
near A and C, but if the nets connected to B have
already been buffered, those buffers act as anchors
which keep B from moving to the right. One needs
to rip up the buffer trees, then consider moving B,then put the buffer trees back in to evaluate
whether this was worthwhile. Another example is
simultaneous buffering and cloning.
This list is just a sampling of possible research
directions. As design technology scales to 65 nm and
below, the problem of timing closure will continue to
evolve into the even more complex problem of designclosure. Design closure requires that accurate modeling ofthe clock tree network and routing be incorporated earlier
and earlier up the physical synthesis pipeline to take into
account their effects on timing and signal integrity. The
need to meet a global power constraint, e.g., by
incorporating multithreshold logic gates and voltage
islands, also becomes more critical. One must pay
attention to how physical design choices impact manu-
facturability. Requiring physical synthesis to meet andincorporate these additional constraints only further
exacerbates the runtime issue. Therefore, research which
discovers more efficient techniques for core physical
synthesis optimizations like placement, buffering, legali-
zation, repowering, incremental timing, routing, and clock
tree synthesis will continue to be of high value. h
Acknowledgment
The PDS physical synthesis system has had many
contributors over the years. The authors sincerely thank
everyone who has helped both with driving the work
presented here and for overall contributions to IBM’s PDS
tool. These contributors include Lakshmi Reddy, Ruchir
Puri, David Kung, Leon Stok, Charles Bivona, Louise
Trevillian, Michael Kazda, Pooja Kotecha, Nate Heiter,Erik Kusko, Mike Dotson, Carl Hagen, Zahi Kurzum,
Gopal Gandham, Stephen Quay, Tuhin Mahmud, Jiang
Hu, Milos Hrkic, Kristian Zoerhoff, William Dougherty,
Brian Wilson, Bryon Wirtz, Tony Drumm, Elaine D’Souza,
Shyam Ramji, Alex Suess, Jose Neves, Veena Puresan,
Arjen Mets, Andrew Sullivan, Jim Curtain, David Geiger,
Tsz-mei Ko, and Pete Osler.
Alpert et al.: Techniques for Fast Physical Synthesis
596 Proceedings of the IEEE | Vol. 95, No. 3, March 2007
RE FERENCES
[1] L. Trevillyan, D. Kung, R. Puri, L. N. Reddy,and M. A. Kazda, BAn integrated environmentfor technology closure of deep-submicron ICdesigns,’’ IEEE Des. Test Comput., vol. 21,no. 1, pp. 14–22, Jan.–Feb. 2004.
[2] P. G. Villarrubia, BPhysical design tools forhierarchy,[ in Proc. ACM Int. Symp. PhysicalDesign, 2005.
[3] P. Saxena, N. Menezes, P. Cocchini, andD. A. Kirkpatrick, BRepeater scaling and itsimpact on CAD,’’ IEEE Trans. Comput.-AidedDesign Integr. Circuits Syst., vol. 23, no. 4,pp. 451–463, Apr. 2004.
[4] J. Cong, Z. D. Kong, and T. Pan, BBuffer blockplanning for interconnect planning andprediction,’’ IEEE Trans. Very Large ScaleIntegr. (VLSI) Syst., vol. 9, no. 6, pp. 929–937,Dec. 2001.
[5] C. J. Alpert, J. Hu, S. S. Sapatnekar, andP. G. Villarrubia, BA practical methodologyfor early buffer and wire resource allocation,[in Proc. Design Automation Conf., 2001.
[6] G.-J. Nam, C. J. Alpert, P. G. Villarrubia,B. Winter, and M. Yildiz, BThe ISPD2005placement contest and benchmark suite,[ inProc. ACM Int. Symp. Physical Design, 2005,pp. 216–220.
[7] A. B. Kahng and Q. Wang, BImplementationand extensibility of an analytic placer,’’ IEEETrans. Comput.-Aided Design Integr. CircuitsSyst., vol. 24, no. 5, pp. 734–747, May 2005.
[8] T. Chan, J. Cong, T. Kong, J. Shinnerl, andK. Sze, BAn enhanced multilevel algorithmfor circuit placement,[ in Proc. IEEE/ACMInt. Conf. Computer-Aided Design, 2003,pp. 299–305.
[9] B. Hu and M. M. Sadowska, BFine granularityclustering-based placement,’’ IEEE Trans.Comput.-Aided Design Integr. Circuits Syst.,vol. 23, no. 4, pp. 527–536, Apr. 2004.
[10] N. Viswanathan and C.-N. Chu, BFastplace:Efficient analytical placement using cellshifting, iterative local refinement anda hybrid net model,[ in Proc. ACM Int.Symp. Physical Design, 2004, pp. 26–33.
[11] B. Halpin, C. Y. R. Chen, and N. Sehgal,BTiming driven placement using physicalnet constraints,[ in Proc. IEEE/ACM DesignAutomation Conf., 2001, pp. 780–783.
[12] R.-S. Tsay and J. Koehl, BAn analyticnet weighting approach for performanceoptimization in circuit placement,[ in Proc.IEEE/ACM Design Automation Conf., 1991,pp. 620–625.
[13] X. Yang, B.-K. Choi, and M. Sarrafzadeh,BTiming-driven placement using designhierarchy guided constraint generation,[in IEEE/ACM ICCAD, 2002, pp. 177–180.
[14] K. Rajagopal, T. Shaked, Y. Parasuram, T. Cao,A. Chowdhary, and B. Halpin, BTiming drivenforce directed placement with physical netconstraints,[ in Proc. Int. Symp. on PhysicalDesign, Apr. 2003, pp. 60–66.
[15] H. Ren, D. Z. Pan, and D. Kung, BSensitivityguided net weighting for placement drivensynthesis,[ in Proc. Int. Symp. on PhysicalDesign, Apr. 2004, pp. 10–17.
[16] T. Kong, BA novel net weighting algorithm fortiming-driven placement,[ in Proc. Int. Conf.Computer Aided Design, 2002, pp. 172–176.
[17] D. Brand, R. F. Damiano, L. P. P. P. vanGinneken, and A. D. Drumm, BIn thedriver’s seat of booledozer,[ in ICCD,1994, pp. 518–521.
[18] L. Stok, D. S. Kung, D. Brand, A. D. Drumm,L. N. Reddy, N. Hieter, D. J. Geiger,H. H. Chao, P. J. Osler, and A. J. Sullivan,
BBooledozer: Logic synthesis for asics,’’ IBM J.Res. Dev., vol. 40, no. 4, pp. 407–430, 1996.
[19] W. Donath, P. Kudva, L. Stok, P. Villarrubia,L. Reddy, A. Sullivan, and K. Chakraborty,BTransformational placement and synthesis,[in Proc. Design, Automation and Test in Europe,Mar. 2000.
[20] S. K. Karandikar, C. J. Alpert, M. C. Yildiz,P. G. Villarrubia, S. T. Quay, and T. Mahmud,BFast electrical correction using resizing andbuffering,[ in Proc. Asia and South PacificDesign Automation Conf., 2007.
[21] P. J. Osler, BPlacement driven synthesis casestudies on two sets of two chips: Hierarchicaland flat,[ in Proc. ACM Int. Symp. PhysicalDesign, 2004, pp. 190–197.
[22] G. Karypis, R. Aggarwal, V. Kumar, andS. Shekhar, BMultilevel hypergraphpartitioning: Application in VLSI domain,[in Proc. ACM/IEEE Design AutomationConf., 1997, pp. 526–529.
[23] G.-J. Nam, S. Reda, C. Alpert, P. Villarrubia,and A. Kahng, BA fast hierarchical quadraticplacement algorithm,’’ IEEE Trans. CAD of ICsand Systems, vol. 25, no. 4, Apr. 2006.
[24] L. P. P. P. van Ginneken, BBuffer placementin distributed RC-tree networks for minimalElmore delay,[ in IEEE Int. Symp. on Circuitsand Systems, May 1990, pp. 865–868.
[25] Z. Li, C. N. Sze, C. J. Alpert, J. Hu, andW. Shi, BMaking fast buffer insertion evenfaster via approximation techniques,[ in Proc.Asia and South Pacific Design Automation Conf.,2005, pp. 13–18.
[26] S. Hu, C. J. Alpert, J. Hu, S. K. Karandikar,Z. Li, W. Shi, and C. N. Sze, BFastalgorithms for slew constrained minimumcost buffering,[ in Proc. ACM/IEEE DesignAutomation Conf., 2006, pp. 308–313.
[27] C. J. Alpert, M. Hrkic, J. Hu, and S. T. Quay,BFast and flexible buffer trees that navigatethe physical layout environment,[ in Proc.ACM/IEEE Design Automation Conf., 2004,pp. 24–29.
[28] H. Ren, D. Z. Pan, C. J. Alpert, andP. Villarrubia, BDiffusion-based placementmigration,[ in Proc. Design AutomationConf., 2005, pp. 515–520.
[29] W.-J. Sun and C. Sechen, BEfficient andeffective placement for very large circuits,’’IEEE Trans. Comput.-Aided Design Integr.Circuits Syst., vol. 14, no. 5, pp. 349–359,May 1995.
[30] C. J. Alpert, J.-H. Huang, and A. B. Kahng,BMultilevel circuit partitioning,’’ IEEE Trans.Comput.-Aided Design Integr. Circuits Syst.,vol. 17, no. 8, pp. 655–667, Aug. 1998.
[31] A. E. Caldwell, A. B. Kahng, and I. L. Markov,BCan recursive bisection alone produceroutable placements?’’ in Proc. DesignAutomation Conf., 2000, pp. 477–482.
[32] A. Agnihotri, M. C. Yildiz, A. Khatkhate,A. Mathur, S. Ono, and P. H. Madden,BFractional cut: Improved recursive bisectionplacement,[ in Proc. Int. Conf. ComputerAided Design, 2003, pp. 307–310.
[33] M. Wang, X. Yang, and M. Sarrafzadeh,BDragon2000: Standard-cell placement toolfor large industry circuits,[ in Proc. Int. Conf.Computer-Aided Design, 2000, pp. 260–263.
[34] H. Eisenmann and F. M. Johannes, BGenericglobal placement and floorplanning,[ in Proc.ACM/IEEE Design Automation Conf., 1998,pp. 269–274.
[35] P. Spindler and F. M. Johannes, BFast androbust quadratic placement combined withan exact linear net model,[ presented at theIEEE/ACM Int. Conf. Computer-AidedDesign, San Jose, CA, 2006.
[36] J. Vygen, BAlgorithms for large-scale flatplacement,[ in Proc. ACM/IEEE DesignAutomation Conf., 1997, pp. 746–751.
[37] D.-H. Huang and A. B. Kahng, BPartitioningbased standard cell global placement withan exact objective,[ in Proc. ACM Int. Symp.Physical Design, 1997, pp. 18–25.
[38] C. J. Alpert and A. B. Kahng, BRecentdevelopments in netlist partitioning: Asurvey,’’ Integr. VLSI J., vol. 19, pp. 1–81, 1995.
[39] C. J. Alpert, G. Gandham, M. Hrkic, J. Hu,A. B. Kahng, J. Lillis, B. Liu, S. T. Quay,S. S. Sapatnekar, and A. J. Sullivan, BBufferedSteiner trees for difficult instances,’’ IEEETrans. Comput.-Aided Design Integr. CircuitsSyst., vol. 21, no. 1, pp. 3–14, Jan. 2002.
[40] J. Cong, A. Kahng, and K. Leung, BEfficientalgorithm for the minimum shortestpath steiner arborescence problem withapplication to VLSI physical design,’’ IEEETrans. Comput.-Aided Design Integr. CircuitsSyst., vol. 17, no. 1, pp. 24–38, Jan. 1998.
[41] J. Lillis, C. K. Cheng, and T. Y. Lin, BOptimalwire sizing and buffer insertion for low powerand a generalized delay model,’’ IEEE J.Solid-State Circuits, vol. 31, no. 3, pp. 437–447,Mar. 1996.
[42] C. J. Alpert, A. Devgan, and S. T. Quay,BBuffer insertion for noise and delayoptimization,[ in Proc. ACM/IEEE DesignAutomation Conf., 1998, pp. 362–367.
[43] C. J. Alpert, A. Devgan, and S. T. Quay,BBuffer insertion with accurate gateand interconnect delay computation,[ inProc. ACM/IEEE Design Automation Conf.,1999, pp. 479–484.
[44] W. Shi and Z. Li, BAn O(nlogn) timealgorithm for optimal buffer insertion,[ inProc. IEEE/ACM Design Automation Conf.,2003, pp. 580–585.
[45] W. Shi, Z. Li, and C. J. Alpert, BComplexityanalysis and speedup techniques for optimalbuffer insertion with minimum cost,[ in Proc.Asia and South Pacific Design Automation Conf.,2004, pp. 609–614.
[46] C. J. Alpert, R. G. Gandham, J. L. Neves, andS. T. Quay, BBuffer library selection,[ in Proc.ICCD, 2000, pp. 221–226.
[47] J. Lillis, C. K. Cheng, and T.-T. Y. Lin,BOptimal wire sizing and buffer insertionfor low power and a generalized delay model,’’IEEE Trans. Solid-State Circuits, vol. 31, no. 3,pp. 437–447, Mar. 1996.
[48] C. Kashyap, C. Alpert, F. Liu, and A. Devgan,BClosed form expressions for extendingstep delay and slew metrics to ramp inputs,[in Proc. Int. Symp. Physical Design (ISPD),2003, pp. 24–31.
[49] H. Bakoglu, Circuits, Interconnects, andPackaging for VLSI. Reading, MA:Addison-Wesley, 1990.
[50] N. Weste and K. Eshraghian, Principlesof CMOS VLSI Design. Reading, MA:Addison-Wesley, 1993, pp. 221–223.
[51] M. Hrkic and J. Lillis, BS-tree: A techniquefor buffered routing tree synthesis,[ in Proc.ACM/IEEE Design Automation Conf., 2002,pp. 578–583.
[52] X. Tang, R. Tian, H. Xiang, andD. F. Wong, BA new algorithm for routingtree construction with buffer insertionand wire sizing under obstacle constraints,[in Proc. IEEE/ACM Int. Conf. Computer-AidedDesign, 2001, pp. 49–56.
[53] C. J. Alpert, J. Hu, S. S. Sapatnekar, andC. N. Sze, BAccurate estimation of globalbuffer delay within a floorplan,[ in Proc.IEEE/ACM Int. Conf. Computer-Aided Design,2004, pp. 706–711.
Alpert et al. : Techniques for Fast Physical Synthesis
Vol. 95, No. 3, March 2007 | Proceedings of the IEEE 597
[54] U. Brenner, A. Pauli, and J. Vygen, BAlmostoptimum placement legalization by minimumcost flow and dynamic programming,[ in Proc.Int. Symp. Physical Design, 2004, pp. 2–9.
[55] S. W. Hur and J. Lilis, BMongrel: Hybridtechniques for standard cell placement,[ inProc. Int. Conf. Computer-Aided Design, 2000,pp. 165–170.
[56] A. B. Kahng, P. Tucker, and A. Zelikovsky,BOptimization of linear placementsfor wirelength minimization with freesites,[ in Proc. Asia and South PacificDesign Automation Conf., 1999, pp. 18–21.
[57] W. H. Press, S. A. Teukolsky,W. T. Vetterling, and B. P. Flannery,
Numerical Recipes in C++. Cambridge,U.K.: Cambridge Univ. Press, 2002.
[58] H. Ren, D. Pan, C. Alpert, G.-J. Nam, andP. G. Villarrubia, BHippocrates:First-do-no-harm detailed placement,[presented at the Asia and South PacificDesign Automation Conf., Yokohama,Japan, 2007.
ABOUT THE AUT HORS
Charles J. Alpert (Fellow, IEEE) received the B.S.
degree in math and computational sciences and
the B.A. degree in history from Stanford Univer-
sity, Stanford, CA, in 1991 and the Ph.D. degree
in computer science from the University of
California, Los Angeles (UCLA), in 1996.
He currently works as a Research Staff Member
at the IBM Austin Research Laboratory, Austin, TX,
where he serves as the technical lead for the
design tools group. He has over 80 conference and
journal publications. His research centers upon innovation in physical
synthesis optimization.
Dr. Alpert has thrice received the Best Paper Award from the ACM/
IEEE Design Automation Conference. He has served as the general chair
and the technical program chair for the Tau Workshop on Timing Issues
in the Specification and Synthesis of Digital Systems and the Interna-
tional Symposium on Physical Design. He also serves as an Associate
Editor of IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN. For his work in
mentoring SRC funded research, he received the Mahboob Khan Mentor
Award in 2001.
Shrirang K. Karandikar received the B.E. degree
from the University of Pune, Pune, India, in 1994,
the M. S. degree from Clarkson University,
Potsdam, NY, in 1996, and the Ph.D. degree from
the University of Minnesota, Minneapolis, in 2004.
He worked with Intel’s Logic and Validation
Technology group from 1997 to 1999, and is
currently a Researcher Staff Member at IBM Austin
Research Laboratory. His current interests are in
the areas of logic synthesis and physical design of
VLSI systems.
Zhuo Li (Member, IEEE) received the B.S. and M.S.
degrees in electrical engineering from Xi’an
Jiaotong University, Xi’an, China, and the Ph.D.
degree in computer engineering from Texas A&M
University, College Station, in 1998, 2001, and
2005, respectively.
From 2005 to 2006, he was with Pextra
Corporation, College Station as a Cofounder and
Senior Technical Staff working on VLSI extraction
tools development. He is currently with IBM
Austin Research Laboratory, Austin, TX. His research interests include