-
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED
CIRCUITS AND SYSTEMS, VOL. 37, NO. 4, APRIL 2018 845
Detailed-Placement-Enabled Dynamic PowerOptimization of
Multitier Gate-Level
Monolithic 3-D ICsSheng-En David Lin, Student Member, IEEE, and
Dae Hyun Kim, Member, IEEE
Abstract—Monolithic 3-D integration is expected to
providesignificantly higher degree of device density than
through-silicon-via-based 3-D integration due mainly to its
nano-scale intertierconnections. By stacking more than two device
layers (multitier)within a 3-D chip, further wirelength reduction
could be achieved,which can lead to additional performance and
power benefits. Inthis paper, we propose a detailed placement
algorithm callednonuniform-scaling-based placement to optimize the
dynamicpower consumption of multitier gate-level monolithic 3-D
ICs.We also introduce delay- and length-based timing constraints
toprevent potential degradation of the performance metric
duringplacement. Under the same timing constraints, our
algorithmreduces dynamic power consumption more effectively than
theuniform-scaling-based placement algorithm by 2% to 14%.
Index Terms—3-D IC, monolithic, multitier,
poweroptimization.
I. INTRODUCTION
MONOLITHIC 3-D integration is emerging as animproved way for 3-D
stacking to increase device den-sity further [1]. A monolithic
interlayer via (MIV) used for anintertier electrical connection is
much smaller than a through-silicon via (TSV) as shown in Fig. 1,
so monolithic 3-Dintegration is almost free from area and
capacitance over-head, whereas TSV-based 3-D integration suffers
from areaand capacitance overhead [2], [3]. Thus, monolithic 3-D
inte-gration is expected to enable the highest degree of
wirelengthreduction, performance improvement, power reduction,
andintertier bandwidth improvement.
The wirelength reduction of monolithic 3-D integrationcould be
converted into both power and performance bene-fits. For example,
Panth et al. [4] used a design methodologythat uniformly scales the
cell locations of a high-quality 2-Dplacement result by a constant
ratio 1/
√2 to generate two-
tier monolithic 3-D placement. The location scaling
generates
Manuscript received October 7, 2016; revised February 8, 2017
andMay 12, 2017; accepted June 28, 2017. Date of publication July
19, 2017; dateof current version March 29, 2018. This work was
supported in part by theNew Faculty Seed Grant 125679-002 funded by
Washington State Universityand in part by the DARPA Young Faculty
Award D16AP00119 funded by theDefense Advanced Research Projects
Agency. This paper was recommendedby Associate Editor Y. P. Liu.
(Corresponding author: Dae Hyun Kim.)
The authors are with the School of Electrical Engineering and
ComputerScience, Washington State University, Pullman, WA 99164 USA
(e-mail:[email protected]; [email protected]).
Color versions of one or more of the figures in this paper are
availableonline at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TCAD.2017.2729401
Fig. 1. Multitier monolithic 3-D integration.
overlaps among the cells, so the author uses a
partitioningalgorithm to distribute the cells into different tiers
to removethe overlaps. This methodology, namely a
uniform-scaling-based placement algorithm (USBP), scales down the
lengthof each net by almost the same ratio as the constant
scalingratio, thereby reducing wirelength and dynamic power
con-sumption. However, the reduced wirelength cannot be
directlyconverted into a higher clock frequency if gate delay and
pincapacitance dominate the critical path delay or the designhas a
potential power-density problem that will lead to athermal problem.
In this case, we can reduce the dynamicpower consumption further by
converting the total delay gainobtained from wirelength reduction
into power saving. In thispaper, we propose a 3-D detailed
placement algorithm, namelya nonuniform-scaling-based placement
algorithm (NUSBP),that provides more dynamic power reduction than
the USBPalgorithm under the same operating frequency.
It is expected that multitier 3-D integration will providemore
benefits than two-tier 3-D integration [5], and somemonolithic 3-D
integration technologies can fabricate multipledevice layers in a
single package [6]. Thus, we also gen-erate multitier monolithic
3-D placement results using theUSBP and NUSBP algorithms and
compare the quality of thealgorithms for multitier monolithic 3-D
ICs.
0278-0070 c© 2017 IEEE. Personal use is permitted, but
republication/redistribution requires IEEE permission.See
http://www.ieee.org/publications_standards/publications/rights/index.html
for more information.
mailto:[email protected]:[email protected]://ieeexplore.ieee.orghttp://www.ieee.org/publications_standards/publications/rights/index.html
-
846 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED
CIRCUITS AND SYSTEMS, VOL. 37, NO. 4, APRIL 2018
Our contributions in this paper are as follows.1) We develop a
3-D detailed placement algorithm that
optimizes dynamic power consumption of monolithic3-D ICs more
effectively than the USBP.
2) We present theoretical background on the minimizationof
dynamic power consumption in monolithic 3-D ICs.
3) We develop length-based and delay-based timing con-straint
algorithms to prevent aggressive cell movementsthat could degrade
the quality of the monolithic 3-D ICs.
4) We apply the uniform-scaling-based and
nonuniform-scaling-based 3-D placement algorithms to
multitiermonolithic 3-D IC design and present their results
withdetailed analyses.
The rest of this paper is organized as follows. We reviewthe
previous work on the design of monolithic 3-D ICs inSection II. In
Section III, we present theoretical backgroundon dynamic power
consumption and timing constraints. InSection IV, we propose a
dynamic power optimization algo-rithm in detail. Then, we present
and analyze simulation resultsin Section V. Finally, we conclude in
Section VI.
II. PRELIMINARIES AND RELATED WORK
In this section, we briefly review three monolithic 3-D ICdesign
methodologies presented in the literature and discussthe
uniform-scaling-based 3-D placement algorithm.
A. Design Methodologies for Monolithic 3-D ICs
Monolithic 3-D ICs can be designed in several differ-ent design
levels. The most fine-grained design style is thetransistor-level
monolithic integration (TMI) proposed in [7].In TMI, nMOS, and pMOS
transistors of each standard cellare placed in different tiers,
e.g., the nMOS and pMOS transis-tors are placed in the top and
bottom tiers, respectively. In thiscase, MIVs are used for both
intracell and intercell 3-D con-nections. TMI reduces the footprint
area of each standard cellalmost by half, but it overuses MIVs for
intracell 3-D routing,which increases routing complexity and leads
to unroutabledesigns. Block-level monolithic integration proposed
in [8] isanother monolithic 3-D IC design methodology in which
each2-D functional block is designed with 2-D standard cells andall
the blocks are placed in 3-D using a 3-D floorplanner.Thus, nMOS
and pMOS transistors are placed in both bottomand top tiers and
MIVs are inserted into whitespace betweenthe blocks. Gate-level
monolithic integration (GMI) proposedin [7] places 2-D standard
cells in 3-D. GMI can reuse existing2-D standard cells and
timing/power libraries. In addition, adesign methodology using 2-D
placement tools was proposedin [4] for the design of gate-level
monolithic 3-D ICs andachieved almost 20% wirelength reduction and
16% powerreduction. Thus, GMI is a prospective design
methodologyfor monolithic 3-D IC design with respect to the design
effortand the quality (wirelength, timing, and power) of 3-D
ICs.
B. Uniform-Scaling-Based 3-D Global Placement
The 3-D global placement algorithm presented in [4] worksas
follows. First, they determine a downscaling ratio s basedon the
ratio between the width (w2-D) of the 2-D layout and
TABLE IVARIABLES USED IN THIS PAPER
the width (w3-D) of a target 3-D layout of the design (s
=w3-D/w2-D).1 Then, they shrink the size of each cell in a
givenstandard cell library by the scaling ratio s and place the
cells in2-D using a commercial tool. By changing the library set
fromthe downscaled one to the original one after the placement,
theauthors obtain a layout in which the cells overlap with
eachother. The overlaps are removed by partitioning, which
alsoautomatically converts the 2-D layout into a 3-D layout.
Thewhole process of the downscaling of the cell size, placing
cellsin 2-D, and restoring the original cell size is very similar
toplacing the cells first with the original standard cell
libraryand then scaling the locations of the cells uniformly by
thesame downscaling ratio s. Thus, we call this approach USBP.
USBP reduces the length of each net almost by the down-scaling
ratio s, so the dynamic power consumption and thedelay of each net
are also reduced. However, the reduced netdelay cannot be directly
converted into higher clock frequencyif the delay of the critical
path is primarily due to gatedelay and pin capacitance. Thus, USBP
can easily reducethe dynamic power consumption, but cannot
guarantee thatit can increase the clock frequency. However, we can
convertthe increased timing margin into further dynamic power
con-sumption. In this paper, therefore, we propose an algorithm
toconvert the reduced net delay in noncritical paths into
powerreduction by a detailed placement algorithm, which we
callNUSBP.
III. DYNAMIC POWER REDUCTION INGATE-LEVEL MONOLITHIC 3-D ICS
In this section, we analyze dynamic power consumptionin
monolithic 3-D ICs and investigate how we can reducedynamic power
consumption further. We also discuss timingconstraints we take into
account during dynamic power opti-mization. Table I shows the
variables used in this paper andtheir meanings.
A. Power Reduction by Uniform Scaling
Dynamic power consumption is estimated by the
following,well-known formula:
Pint =∑
i∈Nαi · fclk ·
(Cw,i + Cp,i
) · VDD2 (1)
where N is the set of all the nets in the design and we
arebreaking down the capacitance into two capacitive compo-nents,
wire capacitance and input pin capacitance of each net.
1Assuming both the 2-D and 3-D layouts have the same total
silicon area,w3-D is w2-D/
√NT .
-
LIN AND KIM: DETAILED-PLACEMENT-ENABLED DYNAMIC POWER
OPTIMIZATION 847
TABLE IIIDEAL BENEFITS OBTAINED BY MONOLITHIC
3-D INTEGRATION AND USBP
Assuming the 2-D layout and the target 3-D layout have thesame
total silicon area, the scaling factor that the USBP algo-rithm
uses becomes 1/
√NT . Thus, the USBP algorithm ideally
reduces the length of each wire by 1/√
NT , which is convertedinto delay and power reduction. Table II
shows ideal benefitswe can obtain by USBP.
Since the switching activity of each net, clock frequency,and
the supply voltage are constants, we can reduce thedynamic power
consumption by reducing the wire capacitanceand/or the input pin
capacitance as shown in (1). Reducingwire capacitance requires
wirelength reduction, routing layerreassignment, wire spreading,
and so on. Reducing input pincapacitance requires gate sizing
(downsizing in most cases).
B. Conversion of Delay Benefit Into Power Reduction
As shown in Table II, the USBP algorithm reduces both netdelay
and dynamic power consumption by wirelength reduc-tion. As
explained in Section II-B, however, increasing theclock frequency
in monolithic 3-D ICs is not possible or desir-able. In this case,
we can adjust the cell locations to convertthe delay benefit into
further power reduction as shown below.
Fig. 2 shows an example in which three cells are
connectedthrough two nets. Assuming that the switching activities
ofNet 1 and Net 2 in the figure are α1 and α2, respectively,
thepower consumption before uniform scaling is
Pbefore = fclk · VDD2 ·(α1
(Cw,1 + Cp,1
) + α2(Cw,2 + Cp,2
))
(2)
and the power consumption after uniform scaling is
Pafter = fclk · VDD2 ·(α1
(Cw,1√
NT+ Cp,1
)+ α2
(Cw,2√
NT+ Cp,2
)).
(3)
Thus, the power benefit (�P = Pbefore − Pafter) obtainablefrom
USBP is
�P = fclk · VDD2 ·(α1 · Cw,1 + α2 · Cw,2
) ·(
1 − 1√NT
). (4)
However, we can reduce the power consumption furtherby moving
the cells. For instance, if α1 is greater than α2,moving Cell 2
closer to Cell 1 along Net 1 will reduce thepower consumption.
Suppose Cell 2 is moved toward Cell 1by x(um) after the uniform
scaling (x > 0). Then, the powerconsumption after the movement
is
P′after = fclk · VDD2 ·(
α1
(Cw,1√
NT− cu · x + Cp,1
)
+ α2(
Cw,2√NT
+ cu · x + Cp,2))
. (5)
(a)
(b)
(c)
Fig. 2. USBP and NUSBP. (a) Before uniform scaling. (b) After
uniformscaling (scaling factor:
√NT ). (c) After nonuniform scaling.
Then, the new power benefit (�P′ = Pbefore − P′after)becomes
�P′ = fclk · VDD2 ·(α1 · Cw,1 + α2 · Cw,2
) ·(
1 − 1√NT
)
+ fclk · VDD2 · cu · x · (α1 − α2) (6)where cu is the
capacitance per micro-meter for the nets. Thesecond term in (6) is
positive because we assume that α1 isgreater than α2. Thus, the
power benefit goes up further bymoving Cell 2 closer to Cell 1 in
this case.
This post-scaling adjustment of cell locations can be per-formed
in three different ways. First, the location of each cellis scaled
with its own scaling ratio as follows:
(xi, yi) →((
1√NT
+ sx,i)
· xi,(
1√NT
+ sy,i)
· yi)
(7)
where (xi, yi) is the location of Cell i, sx,i, and sy,i are
smallvariations for the x- and y-coordinate scaling factors for
Cell i,respectively. Second, the locations of all the cells are
uniformlyscaled down with a constant scaling ratio (1/
√NT ) and the
locations are slightly adjusted as follows:
(xi, yi) →(
xi√NT
,yi√NT
)→
(xi√NT
+ δx,i, yi√NT
+ δy,i)
(8)
where δx,i and δy,i are small, post-scaling displacement forCell
i. Third, the location of each cell is adjusted and thenthe
locations of all the cells are uniformly scaled down by aconstant
scaling ratio (1/
√NT ) as follows:
(xi, yi) →(
xi + δ′x,i, yi + δ′y,i)
→(
xi + δ′x,i√NT
,yi + δ′y,i√
NT
)(9)
where δ′x,i and δ′y,i are small, prescaling displacement for
Cell i.All of these approaches produce the same result, but we
usethe third approach in this paper and call it NUSBP.
Although NUSBP reduces the power consumption further,we should
take two important constraints, timing and densityconstraints, into
account in the computation of δ′x,i and δ′y,i.The next section
shows how we take the timing constraintinto account and Section
IV-D explains how we handle thedensity constraint.
-
848 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED
CIRCUITS AND SYSTEMS, VOL. 37, NO. 4, APRIL 2018
C. Ideal Nonuniform Scaling Under Timing Constraints
In Fig. 2, suppose d1,3 and d1,3′′ be the Elmore delays fromthe
output of Cell 1 to the input of Cell 3 for the 2-D case andafter
nonuniform scaling, respectively. Then, d1,3 is expressedas
follows:
d1,3 = R1(Cw,1 + Cp,2
) + Rw,1Cp,2 + Rw,1Cw,12
+ R2(Cw,2 + Cp,3
) + Rw,2Cp,3 + Rw,2Cw,22
(10)
where R1 and R2 are the drive resistances of Cell 1 and Cell
2,respectively. If the distance between Cell 1 and Cell 2 in Fig.
2becomes S1(um) and that between Cell 2 and Cell 3 becomesS2(um)
after nonuniform scaling, the difference between d1,3and d1,3′′
becomes
�d1,3′ = (R1Cw,1 + Rw,1Cp,2
)(1 − S1
L1
)
+ Rw,1Cw,12
(1 − S1
2
L12
)
+ (R2Cw,2 + Rw,2Cp,3)(
1 − S2L2
)
+ Rw,2Cw,22
(1 − S2
2
L22
). (11)
Setting �d1,3′ to zero and solving it with a constraint S1+S2
=[(L1 + L2)/√NT ] gives us the ranges of S1 and S2 that do
notdegrade the delay from Cell 1 to Cell 3.
D. Delay-Based Timing Constraint
Applying the timing constraint presented in the previoussection
to a target cell requires the routing topologies of allthe nets
connected to the cell to obtain the net lengths. In thispaper,
however, we use the half-perimeter wirelength (HPWL)to estimate the
length of each net for the following reasons.First, it might be too
costly to construct a routing topol-ogy for each net at this step.
Second, even if constructinga routing topology is not
time-consuming, we do not need toconstrain the range of the target
locations of a target cell alonga fixed routing topology at this
step. Thus, we modify the tim-ing constraint from the
Elmore-delay-based constraint to theHPWL-based constraint for Cell
2 in Fig. 2 as follows:
HPWL1′2 + HPWL2′2 ≤ HPWL12 + HPWL22 (12)
where HPWLi is the HPWL of net i, HPWLi′ is the HPWLof net i
after moving the target cell and scaling, and Net 1and Net 2 are
one of the input nets and one of the outputnets connected to the
target cell, respectively. We call this adelay-based timing
constraint.
The delay-based timing constraint preserves the total delayof
two adjacent nets connected through a target cell. However,applying
this constraint is time-consuming because it shouldalso consider
all the combinations of the input and outputnets associated with
each target cell. For instance, suppose atarget cell has m input
pins and n output pins. Each pin isconnected to a net and the
driver of each net drives multiplecells. If we move the target
cell, the length of each net changes
(might not change depending on the location of the target
cell),then we should check the timing of all the cells connected
tothe net. If the average number of cells connected to a net isr
and the average input and output pin counts of a cell are mand n,
respectively, moving a target cell requires applicationof the
delay-based timing constraint to r · m + (r − 1) · ncells. The
complexity of applying (12) to a cell is O(mn), sothe complexity of
applying the delay-based timing constraintto moving a cell is
O(rmn(m + n)). Since this is too time-consuming, we developed a
length-based timing constraint,which is presented in the next
section.
E. Length-Based Timing Constraint
Suppose the original HPWL of net i is HPWLi and that ofthe net
after nonuniform scaling is HPWLi′. Then, the fol-lowing constraint
strictly preserves the delay of the net afterscaling:
HPWLi′ ≤ HPWLi + δi (13)
where δi is a relaxation factor for net i and
empiricallydetermined for a given process technology.
The new HPWL of net i after nonuniform scaling is
HPWLi′ = HPWLi + �HPWLi√
NT(14)
so the substitution of (14) into inequality (13) gives
�HPWLi ≤(√
NT − 1)
HPWLi +√
NTδi (15)
which is a new length-based timing constraint with relaxationfor
each net. Since the delay is proportional to the square ofthe
length of a net, a decreasing function, such as the followingcan be
used for δi:
δi = k(um) if HPWLi ≤ t(um)= k
HPWLi+ t − 1
t· k(um) if HPWLi > t(um) (16)
where t is a sufficiently small wirelength, such as 5um andb and
k are constants tuned for a given process technologyby exhaustive
delay simulations. We call this a length-basedtiming constraint
with relaxation. If δi is zero, it is called alength-based timing
constraint.
IV. DYNAMIC POWER OPTIMIZATION ALGORITHMS
In this section, we present our algorithms for theminimization
of dynamic power consumption in gate-levelmonolithic 3-D ICs.
A. Overall Algorithm
Fig. 3 shows the overall design flow and the step in the
greenbox shows the proposed algorithm. For a given 2-D
placementresult, the NUSBP algorithm adjusts the cell locations in
thelayout and then uniformly scales the locations by 1/
√NT to
generate an NT -tier monolithic 3-D IC layout. The objectiveis
to minimize the dynamic power consumption estimated bythe following
formula:
P = fclk · VDD2∑
i∈N·(αi · HPWLi) (17)
-
LIN AND KIM: DETAILED-PLACEMENT-ENABLED DYNAMIC POWER
OPTIMIZATION 849
Fig. 3. Our 3-D IC design flow. The USBP skips the dynamic
poweroptimization step.
while satisfying either the delay- or length-based timing
con-straints shown in Section III. The cell location adjustment
issequentially applied to either each cell or a set of cells
(clus-ters) until there is no more noticeable improvement in
thepower consumption. The following sections describe how tofind
optimal locations for each cell or cluster and how to inte-grate
the timing and density constraints into the
optimizationalgorithm.
B. Finding Optimal Locations
For each cell in a given 2-D placement result, we find anoptimal
location that can minimize the sum of the dynamicpower of all the
nets connected to the cell. The idea is tomove the cell in a
direction we can reduce the sum of thedynamic power.
The following theorem helps find optimal locations thatminimize
the dynamic power consumption for a cell.
Theorem 1: For Cell A connected to k nets (n1, . . . ,
nk),construct two bounding boxes, one (Bq,1) without Cell A andthe
other (Bq,2) with Cell A, for nq (1 ≤ q ≤ k). Let BA be theset of
all the bounding boxes, BA = {B1,1, B1,2, B2,1, . . . , Bk,2}and
EPA be the set of all extremal points (four end points)of all the
bounding boxes in BA. Let TA be the set of allintersection points
of all pairs of the bounding boxes in BA.Then: 1) the current
location of Cell A is optimal or 2) thereexists at least one
optimal point in TA ∪ EPA that minimizesthe sum of the dynamic
power of all the nets connected toCell A.
Proof: The objective function we minimize is λ = ∑i∈N αi ·HPWLi.
λ is piecewise linear, so the optimal points minimiz-ing λ exist:
1) inside or on the boundary of some rectangles;2) on some
intervals (segments); or 3) on some extremalpoints (the endpoints
of some intervals or rectangles) orintersection points. Since
rectangles and intervals include theirendpoints, at least one of
the extremal points or the intersectionpoints are optimal.
To apply the above theorem to finding optimal locations fora
given target cell, we use the following observations.
1) λ = ∑i∈N αi · (HPWLx,i + HPWLy,i) where HPWLx,iand HPWLy,i
are the x- and y-components of the HPWLof net i.
2) HPWLx,i and HPWLy,i are independent of each other,so we can
optimize each of them separately (so we focusonly on the
x-coordinates from this point).
3) Suppose net i connects k cells, {C1, C2, . . . , Ck}, and
thex-coordinate of Cj is xj. If Ct is the target cell to move,
Algorithm 1: Find an Optimal x-Coordinate for a GivenCell
Input: A given target cell A and a netlist N.Output: An optimal
x-coordiate of A.
1 K = {n|n ∈ N, A ∈ n};2 B = {BoundingBox(n) = (x1, y1, x2,
y2)|n ∈ K};3 P = {x1(b), x2(b)|b ∈ B} ∪ x(A);4 // x(A) is the
x-coordinate of A.5 Array X = Sort (P);6 l = 1, r = |X|;7 while l
< r do8 m = (l + r)/2;9 CostL = Cost(X[l]);
10 CostM = Cost(X[m]);11 if CostL ≤ CostM then12 r = m;13 else14
CostT = Cost(X[m − 1]);15 if CostT ≤ CostM then16 r = m − 1;17
else18 l = m;19 end while
20 Return X[l];
xmin,i and xmax,i are defined as the minimum and themaximum of
xj(j = 1, . . . , k, j = t), respectively. Then,HPWLx,i linearly
decreases as the target cell is moved inthe positive direction from
−∞, stays constant betweenxmin,i and xmax,i, and then linearly
increases as the cellis moved toward ∞.
4) The x-coordinate of each intersection point in TA is
anextremal point of a rectangle in BA.
In summary, the x-coordinate of an optimal point minimizingλ for
a given cell is always an extremal point of one of thebounding
boxes in BA constructed for the cell. In addition,if we move the
cell from −∞ to ∞, λ linearly decreases,then stays constant, then
linearly increases. Thus, instead ofenumerating all the extremal
and intersection points, we firstconstruct BA, extract the
x-coordinates of the left and rightextremal points of each
rectangle in BA, and sort the coor-dinates in the increasing order.
Then, we find an optimalcoordinate by the binary search. Algorithm
1 describes how tofind an optimal x-coordinate for a given target
cell using thebinary search algorithm. The Cost(x) function for a
given x-coordinate x computes the dynamic power consumption whenthe
cell is moved to x. The algorithm uses the observation thatthe cost
function is convex.
Fig. 4 shows an example. In the figure, Cell A is connectedto
Net 1 ({A, C1, C2, C3}) and Net 2 ({A, C4, C5}). If theswitching
activity α1 of Net 1 is greater than α2 of Net 2,the optimal
location for Cell A is (x1, [y5, y4]), where [y5, y4]is the range
from y5 to y4, i.e., any value greater than orequal to y5 and less
than or equal to y4. If the x-coordinateax of Cell A is greater
than x1, but less than x2, the HPWL of
-
850 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED
CIRCUITS AND SYSTEMS, VOL. 37, NO. 4, APRIL 2018
Fig. 4. Two nets and their bounding boxes. Net 1 = {A, C1, C2,
C3}. Net 2 ={A, C4, C5}.
Net 1 is minimal, but the HPWL of Net 2 increases. If ax isless
than x1, the HPWL of Net 1 increases, but that of Net 2decreases.
Since α1 is greater than α2, the total power con-sumption
increases. We can easily verify that (x1, [y5, y4]) isthe optimal
location for Cell A in this way. If α2 is greater thanα1, however,
the optimal location for Cell A is (x5, [y5, y4]).
Notice that the proposed algorithm finds an optimal loca-tion
for each cell. To find optimal locations for all the cellsin the
netlist, we sequentially choose a cell and move it. Thus,the order
of the cells we choose might affect the quality ofthe solution. In
this paper, we sort all the nets in the decreas-ing order of the
switching activity and start from the cellsconnected to the
highest-activity net.
C. Timing Constraints
Optimal locations of some cells found by Theorem 1 mightviolate
the timing constraints explained in Section III. Thus,before we
move a cell to its optimal location, we checkwhether the move will
violate the timing constraints or not.If it violates the
constraints, we move the cell to the farthestlocation satisfying
the timing constraints from its current loca-tion, along and inside
the segment connecting the current andthe optimal location.
If the timing constraint is the length-based timing con-straint,
we compute the maximum value of �HPWL satisfyinginequality (13) for
each net connected to the target cell andchoose the smallest value
among them. The computation of�HPWL is performed as follows. We
first construct the bound-ing box of net i connected to the target
cell and divide the planeinto nine regions as shown in Fig. 5. The
target cell X is inthe center region (R5) and the dotted line shows
the boundingof net i. Depending on the optimal location (red
circles), wesplit the line segment connecting the current and
optimal loca-tions into multiple segments, one per region (we call
the linesegment an optimal segment). Then, we compute �HPWL ineach
region separately. For example, if the target location is inR5,
�HPWL is 0 even if we move the target cell between itscurrent and
the optimal locations. If the target location is inR2, however,
�HPWL is 0 if the target cell is moved insideR5, but �y if it is
moved in R2, where �y is the distancebetween the new location of
the target cell in R2 and the y-coordinate of the upper horizontal
line of the bounding boxof net i.
Starting from the region of the optimal location, we com-pute
�HPWL satisfying inequality (13). If we find it, it is
Fig. 5. Consideration of length-based timing constraints.
Fig. 6. Delay-based timing constraints.
the optimal location satisfying the length-based timing
con-straint for net i. If we do not find it, however, we move to
thenext farthest region along the optimal segment and compute�HPWL
again. We repeat this process until we find �HPWLsatisfying the
inequality. Notice that the maximum number ofregions we should
consider for each net is three, which occurswhen the optimal
location is in R1, R3, R7, or R9 in Fig. 5.
If the timing constraint is the delay-based timing constraint,we
should consider not only each pair of the two adjacent
netsconnected to the target cell, but also the cells adjacent to
thetarget cell. Fig. 6 shows an example. Suppose Cell 1 in
thefigure is the target cell. If Cell 1 is moved, the lengths ofNet
1, Net 2, and Net 3 are changed. In this case, we applythe
delay-based timing constraint to the following net pairs,(Net 1,
Net 2) and (Net 1, Net 3). However, the changes ofthe lengths of
Net 1, Net 2, and Net 3 also affect the delayvalues from Net 4 to
Net 2, from Net 5 to Net 2, and fromNet 1 to Net 7. Thus, we should
also check the delay-basedtiming constraint to the following net
pairs, (Net 4, Net 2),(Net 5, Net 2), and (Net 1, Net 7).
To apply the delay-based timing constraint to a pair of twonets,
Net 1 and Net 2, we first split the plane into max. Twentyfive
regions2 and compute �HPWL1 and �HPWL2 in eachregion over which the
optimal segment belongs to. Startingfrom the region of the optimal
location, we compute �HPWL1and �HPWL2 minimizing λ and satisfying
inequality (12). Ifwe do not find a value satisfying the inequality
in the region,we proceed to the next farthest region on the optimal
segmentin a similar way to the length-based timing constraint
case.
2Four x- and four y-coordinates from the eight extremal points
of the twobounding boxes split the plane into 25 regions.
-
LIN AND KIM: DETAILED-PLACEMENT-ENABLED DYNAMIC POWER
OPTIMIZATION 851
Fig. 7. Illustration of the clustering technique. The red nets
are high-activitynets.
D. Density Constraints
As mentioned in Section III, moving a cell to its opti-mal
location might increase the density of the layout areaaround the
optimal location. Thus, we need to control the lay-out density
efficiently during optimization. In this paper, wepredetermine a
bin size, obtain the maximum bin density in agiven layout, and
limit the density of each bin to be at mostthe maximum bin density.
If moving a cell to its optimal loca-tion violates the density
constraint of the bin, we find the nextfarthest bin that does not
violate the density segment alongthe optimal segment.
We satisfy the timing and density constraints for each moveby
considering both at the same time. Thus, we guarantee thatwe never
violate the timing and density constraints during/afterthe
optimization.
E. Clustering
A problem we found in moving a cell individually to itsoptimal
location is that we cannot move any cell connectedto a
high-activity net in some cases. For example, movingCell A, Cell B,
and Cell C toward Cell D in Fig. 7 will reducethe dynamic power
consumption, but moving the three cellsone by one is prohibited
because moving each one of themincreases the dynamic power
consumption or leads to no powerbenefit. Thus, we cluster the cells
connected to high-activitynets and move the cells simultaneously to
reduce dynamicpower consumption further in addition to moving each
cellindividually. However, we only cluster the cells connectedto a
net whose HPWL is less than a predetermined thresh-old value (K).
In fact, there exist some uncertainties in ouroptimization
methodology because our wirelength and powercomputation are based
on the HPWL. Thus, if K is large,the uncertainty goes up and the
final power value does notaccurately match our power computation.
If K is too small,however, there exists just a few clusters that
can be moved bythe clustering technique. Therefore, we empirically
determinedK based on simulations. Similar to the cell-based
optimiza-tion, the solution quality of the clustering-based
optimizationdepends on the order of the clusters we move. In this
paper, wesort the nets in the decreasing order of their switching
activities
and start from the highest-activity net. For each selected
net,we cluster the cells connected to the net into a single
super-celland apply the cell-based optimization algorithm to the
supercell. After we move the super cell, we flatten the super
cell.
F. Complexity Analysis
The cell-based optimization finds an optimal location foreach
cell and moves the cell to the location. Suppose a targetcell is
connected to maximum n nets, each of which connectsmaximum c cells.
Then, the complexity of finding two bound-ing boxes (one with the
cell included and the other withoutthe cell) for each net is O(c)
and that of finding all bound-ing boxes of the target cell is
O(cn). The runtimes for thelinear sweeping from −∞ to ∞ for the
x-coordinates of thebounding boxes is O(n). Thus, so the complexity
of findingan optimal location for each target cell is O(cn), which
ispractically O(1) because c and n are bounded for most of thecells
and nets. For an optimal location found by the aboveprocess,
finding an optimal location that satisfies the timingand density
constraints takes a constant amount of time, so thecomplexity of
moving all the cells to their optimal locations isO(C), where C is
the total number of cells. The cluster-basedoptimization also moves
a cluster for each net, so practicallythe complexity of finding an
optimal location for a cluster isalso O(1). Thus, the complexity of
moving all the clusters totheir optimal locations is O(N), where N
is the total numberof nets. We iterate the cell- and cluster-based
optimizationsonly a few times, so the overall complexity of the
NUSBPalgorithm is O(N + C).
V. SIMULATION RESULTS
In this section, we present our simulation results anddetailed
analysis.
A. 3-D IC Design Flow and Simulation Setup
For NUSBP, we iterate the cell- and cluster-based optimiza-tion
multiple times until the power reduction saturates in thedynamic
power optimization step. We use the Nangate [9]45nm library for the
standard cell library, Synopsys DesignCompiler for synthesis, and
Cadence Encounter for 2-D place-ment and legalization. We also use
Cadence Encounter toobtain the switching activity of each net by
propagating aconstant activity at the primary pins. We use hMetis
[10] forthe k-way partitioning to design k-tier monolithic 3-D ICs.
Toobtain area-balanced placement results, we split a given
layoutinto a grid and sequentially apply hMetis to each bin of
size5 ∗ r by 5 ∗ r, where r is the height of a standard cell
row.The bin size for the density check is 20 um by 20 um. All
theresults we obtained by NUSBP in this section did not violatethe
delay and density constraints.
B. Comparison of Dynamic Power Consumption inTwo-Tier Monolithic
3-D ICs
Table III shows wirelength (∑
HPWL) and dynamic powerconsumption (
∑α·HPWL) of 2-D and two-tier monolithic
3-D ICs designed by USBP [denoted by 2-tier uniform (2TU)]
-
852 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED
CIRCUITS AND SYSTEMS, VOL. 37, NO. 4, APRIL 2018
TABLE IIICOMPARISON OF 2-D, k-TIER UNIFORM-SCALING-BASED, AND
k-TIER NONUNIFORM-SCALING-BASED PLACEMENT RESULTS USING
DIFFERENT
TIMING CONSTRAINTS (-D: DELAY-BASED, -L: LENGTH-BASED, -LR:
LENGTH-BASED WITH RELAXATION). THE VALUES IN PARENTHESESSHOW THE
RATIO BETWEEN THE 3-D DESIGNS AND THE 2-D DESIGNS. FP IS THE
FOOTPRINT AREA AND RT IS THE RUNTIME
and NUSBP [denoted by 2-tier nonuniform (2TNU)] withdifferent
timing constraints. As the table shows, the USBPalgorithm reduces
the dynamic power consumption by roughly29% compared to the 2-D
placement result and the NUSBPalgorithm reduces the dynamic power
consumption by 31% to35%, 32% to 37%, and 33% to 39% for the
delay-based (-D),length-based (-L), and length-based with
relaxation (-LR) tim-ing constraints, respectively, compared to the
2-D placementresult. In addition, the NUSBP algorithm constantly
outper-forms the USBP algorithm for all the benchmarks by 5% to14%,
4% to 11%, and 3% to 8% for the -D, -L, and -LR
cases,respectively.
For more detailed analysis, we show the difference betweenthe
power consumption of 2TU and 2TNU-L for each net inFig. 8(a). We
group all nets into each switching activity binof width 0.001,
compute the sum of the dynamic power ofthe nets in each bin for 2TU
and 2TNU-L, and plot the differ-ences. In the figure, we observe
that the power reduction comesprimarily from the power reduction in
high-activity nets, i.e.,many high-activity (α > 0.8) nets in
2TNU-L are shorter thanin 2TU. Thus, we reduce the dynamic power
consumption byshortening the high-activity nets. However, some
low-activitynets (α ∼ 0.5) in 2TNU-L have higher power
consumptionthan those in 2TU. This is unavoidable because the
furtherpower reduction in NUSBP is due to making high-activity
netsshorter and low-activity nets longer. Fig. 8(b) shows the
dif-ference between the HPWL of 2TU and 2TNU-L for each net.In the
figure, we observe that the trend of the wirelength dif-ference is
similar to the trend of the power reduction shown inFig. 8(a). In
other words, the high-activity nets are shortenedat a cost of the
elongated low-activity nets.
Regarding the timing constraints, the length-based
timingconstraint reduces the power consumption more effectivelythan
the delay-based timing constraint as shown in Table III.
The reason is mainly because the delay-based timing constraintis
tighter than the length-based timing constraint. The fol-lowing
analysis shows the reason. Without loss of generality,suppose α1 is
greater than α2 in Fig. 2. The original lengths ofNet 1 and Net 2
are L1 and L2, respectively. Since α1 is greaterthan α2, we move
Cell 2 toward Cell 1 by δ. In the worstcase, the length of Net 1
after nonuniform scaling becomes(L1 − δ)/√NT and that of Net 2
becomes (L2 + δ)/√NT .Substituting these new lengths to inequality
(12) gives thefollowing:
(L1 − δ√
NT
)2+
(L2 + δ√
NT
)2≤ L12 + L22 (18)
δMAX,d =L1 − L2 +
√(L1 − L2)2 + 2(NT − 1)
(L12 + L22
)
2(19)
which is the maximum value of δ satisfying the delay-basedtiming
constraint. On the other hand, substituting the lengthof Net 2 to
inequality (12) gives the following:
δMAX,l =(√
NT − 1)
L2 (20)
which is the maximum value of δ satisfying the
length-basedtiming constraint.
If L1 is much longer than L2, δMAX,d is greater thanδMAX,l, so
the delay-based timing constraint is looser than thelength-based
timing constraint. If L1 is much shorter thanL2, however, δMAX,l is
greater than δMAX,d, so the length-based timing constraint is
looser. Although the length-basedtiming constraint led to lower
power consumption than thedelay-based timing constraint in the
benchmarks we used,the delay-based timing constraint could lead to
lower powerconsumption if the former case (L1 is much longer than
L2)dominates the design.
-
LIN AND KIM: DETAILED-PLACEMENT-ENABLED DYNAMIC POWER
OPTIMIZATION 853
(a)
(b)
Fig. 8. Comparison of (a) dynamic power consumption and (b)
HPWLbetween 2TNU-L and 2TU. The x-axis is the net activity.
Benchmark: LDPC.
C. Comparison of Dynamic Power Consumption inMultitier
Monolithic 3-D ICs
Table III also shows that the multitier monolithic 3-D ICsreduce
the dynamic power consumption more effectively thanthe two-tier
monolithic 3-D ICs. The USBP algorithm outper-forms the 2-D layout
by 29%, 42%, and 50% in the two-,three-, and four-tier designs,
respectively. In addition, theNUSBP algorithm outperforms the USBP
algorithm by 2%to 14%, 2% to 13%, and 2% to 13% for the two-,
three-, and four-tier designs, respectively. Although the
dynamicpower consumption monotonically decreases as the tier
countgoes up, the decrement also reduces. Thus, the dynamicpower
reduction will eventually saturate even if more tiers arestacked.
This is due to the saturation in the wirelength reduc-tion as shown
in the same table and other previous work [5].Since the amount of
wirelength reduction is proportional to thescaling ratio (1/
√NT ), wirelength reduction saturates, which
is also translated into the saturation of the dynamic
powerreduction. However, the NUSBP algorithm still outperformsthe
USBP algorithm constantly in the multitier monolithic
3-Ddesigns.
Regarding the runtime, applying the delay-based timingconstraint
takes the highest runtime to optimize the designs.The reason is
because the delay-based timing constraint isapplied not only to the
pairs of the nets connected to a targetcell, but also to some pairs
of the nets connected to the cellsadjacent to the target cell.
Thus, applying the delay-based tim-ing constraint requires more
computation time than the othertiming constraints. On the other
hand, the length-based tim-ing constraint with relaxation requires
more computations thanthe length-based timing constraint because
the former is tighter
Fig. 9. Variation of the dynamic power consumption of the
two-tier LDPCdesign with cell- and cluster-based optimization
interleaving (L: cell-based,and C: cluster-based).
than the latter, so the latter searches more optimal points
thanthe former.
D. Effect of Interleaving Cell- and
Cluster-BasedOptimizations
If we apply only the cell- or cluster-based optimiza-tion
repeatedly, the dynamic power consumption decreasesat the
beginning, but saturates eventually. As explained inSection IV-E,
therefore, we alternate between the cell- andcluster-based
optimizations to escape from local minima andreduce the dynamic
power consumption further. In the NUSBPdesign flow, we run m
iterations of the cell-based optimiza-tion, then run n iterations
of the cluster-based optimization andrepeat the cell- and
cluster-based optimization until the differ-ence between the
dynamic power consumption values beforeand after the optimization
is less than a predetermined number.We set both m and n to 10 in
our simulation.
Fig. 9 shows the variation of the dynamic power consump-tion of
the two-tier LDPC design, where L and C denote thecell- and
cluster-based optimizations, respectively. As the fig-ure shows,
the dynamic power reduction saturates after four tosix iterations
in each optimization mode. Once the first cell-based optimization
saturates, the cluster-based optimizationreduces the dynamic power
consumption further by movingmultiple cells connected to
high-activity nets at the same time.The dynamic power reduction in
the cluster-based optimiza-tion, however, also saturates. The
cluster-based optimizationperturbs the placement, so switching back
to the cell-basedoptimization helps reduce the power consumption
again. Asthe number of iterations increases, the power
consumptioneventually saturates as shown in Fig. 9.
E. Impact of the Density Control
As explained in Section IV-D, we control the density ofeach bin
in the layout during both the cell- and cluster-baseddynamic power
optimization. Since it is not straightforward toestimate the
routability of a design at the placement stage,
-
854 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED
CIRCUITS AND SYSTEMS, VOL. 37, NO. 4, APRIL 2018
(a)
(b)
Fig. 10. Density distribution of the DES benchmark. (a) Without
densityconstraints and (b) With density constraints.
many placement papers use some variants of density esti-mation
and control schemes to effectively alleviate potentialrouting
congestions. Fig. 10 shows the bin density distributionmaps for the
DES benchmark without and with the densityconstraint and control.
The density constraint for each bin inthe design is set to 0.8. As
shown in the figure, optimizingthe design without the density
constraint violates the densityconstraint and the utilization of
some of the bins is greaterthan 1.0. With the density control,
however, no bin violatesthe density constraint.
VI. CONCLUSION
In this paper, we proposed an NUSBP algorithm fordynamic power
optimization in multitier gate-level monolithic3-D ICs. The
algorithm finds an optimal location minimizingthe sum of the
dynamic power consumption of the nets con-nected to the cell for
each cell without violating the timingand density constraints. The
simulation results show that thealgorithm outperforms the USBP
algorithm by an average of2% to 14% for two- to four-tier
designs.
REFERENCES
[1] P. Batude et al., “Advances in 3D CMOS sequential
integration,” in Proc.IEEE Int. Electron Devices Meeting,
Baltimore, MD, USA, Sep. 2009,pp. 1–4.
[2] D. Henry et al., “Via first technology development based on
high aspectratio trenches filled with doped polysilicon,” in Proc.
IEEE Electron.Compon. Technol. Conf., Reno, NV, USA, May 2007, pp.
830–835.
[3] J. U. Knickerbocker et al., “Three-dimensional silicon
integration,” IBMJ. Res. Develop., vol. 52, no. 6, pp. 553–569,
Nov. 2008.
[4] S. Panth, K. Samadi, Y. Du, and S. K. Lim, “Design and CAD
method-ologies for low power gate-level monolithic 3D ICs,” in
Proc. Int. Symp.Low Power Electron. Design, Aug. 2014, pp.
171–176.
[5] D. H. Kim, S. Mukhopadhyay, and S. K. Lim, “TSV-aware
interconnectdistribution models for prediction of delay and power
consumption of3-D stacked ICs,” IEEE Trans. Comput.-Aided Design
Integr. CircuitsSyst., vol. 33, no. 9, pp. 1384–1395, Sep.
2014.
[6] S. Bobba et al., “CELONCEL: Effective design technique for
3-D mono-lithic integration targeting high performance integrated
circuits,” in Proc.Asia South Pac. Design Autom. Conf., Yokohama,
Japan, Jan. 2011,pp. 336–343.
[7] Y.-J. Lee and S. K. Lim, “Ultrahigh density logic designs
usingmonolithic 3-D integration,” IEEE Trans. Comput.-Aided Design
Integr.Circuits Syst., vol. 32, no. 12, pp. 1892–1905, Dec.
2013.
[8] S. Panth, K. Samadi, Y. Du, and S. K. Lim, “High-density
integrationof functional modules using monolithic 3D-IC
technology,” in Proc.Asia South Pac. Design Autom. Conf., Yokohama,
Japan, Jan. 2013,pp. 681–686.
[9] (2011). Nangate 45nm Open Cell Library, Nangate, Santa
Clara, CA,USA. [Online]. Available: http://www.nangate.com
[10] G. Karypis and V. Kumar. hMETIS, A Hypergraph
PartitioningPackage Version 1.5.3. Accessed on Nov. 15, 2014.
[Online].
Available:http://glaros.dtc.umn.edu/gkhome/metis/hmetis/download
Sheng-En David Lin (S’16) received the B.S.degree in electrical
engineering from WashingtonState University, Pullman, WA, USA, in
2014, wherehe is currently pursuing the Ph.D. degree with
theDepartment of Electrical Engineering and ComputerScience.
His current research interests include modelingfor very large
scale integration (VLSI) circuits andsystems and algorithms for
VLSI CAD automa-tion with current focus on designing of
monolithic3-D ICs.
Dae Hyun Kim (S’08–M’12) received the B.S.degree in electrical
engineering from Seoul NationalUniversity, Seoul, South Korea, in
2002, and theM.S. and Ph.D. degrees in electrical and com-puter
engineering from the Georgia Institute ofTechnology, Atlanta, GA,
USA, in 2007 and 2012,respectively.
He is an Assistant Professor with the Schoolof Electrical
Engineering and Computer Science,Washington State University,
Pullman, WA, USA.He researched on physical layout optimization
with
Cadence Design Systems, Inc., San Jose, CA, USA, from 2012 to
2014. Hiscurrent research interests include electronic design
automation and computer-aided design for very large scale
integration (VLSI), high-performance and/orlow-power VLSI and
computer systems, and 3-D integrated circuits andsystems.
Dr. Kim was a recipient of the Cadence Excellence in Innovation
Award in2014, the Defense Advanced Research Projects Agency Young
Faculty Awardin 2016, and the EECS Early Career Award from the
School of ElectricalEngineering and Computer Science at Washington
State University in 2017.
http://www.nangate.comhttp://glaros.dtc.umn.edu/gkhome/metis/hmetis/download
/ColorImageDict > /JPEG2000ColorACSImageDict >
/JPEG2000ColorImageDict > /AntiAliasGrayImages false
/CropGrayImages true /GrayImageMinResolution 200
/GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true
/GrayImageDownsampleType /Bicubic /GrayImageResolution 300
/GrayImageDepth -1 /GrayImageMinDownsampleDepth 2
/GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true
/GrayImageFilter /DCTEncode /AutoFilterGrayImages false
/GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict >
/GrayImageDict > /JPEG2000GrayACSImageDict >
/JPEG2000GrayImageDict > /AntiAliasMonoImages false
/CropMonoImages true /MonoImageMinResolution 400
/MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true
/MonoImageDownsampleType /Bicubic /MonoImageResolution 600
/MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000
/EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode
/MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None
] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false
/PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000
0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true
/PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ]
/PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier ()
/PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped
/False
/CreateJDFFile false /Description >>>
setdistillerparams> setpagedevice