IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF …daehyun/pubs/2018/2018tcad.pdf · 2018. 3. 21. · IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS,

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 37, NO. 4, APRIL 2018 845

Detailed-Placement-Enabled Dynamic PowerOptimization of Multitier Gate-Level

Monolithic 3-D ICsSheng-En David Lin, Student Member, IEEE, and Dae Hyun Kim, Member, IEEE

Abstract—Monolithic 3-D integration is expected to providesignificantly higher degree of device density than through-silicon-via-based 3-D integration due mainly to its nano-scale intertierconnections. By stacking more than two device layers (multitier)within a 3-D chip, further wirelength reduction could be achieved,which can lead to additional performance and power benefits. Inthis paper, we propose a detailed placement algorithm callednonuniform-scaling-based placement to optimize the dynamicpower consumption of multitier gate-level monolithic 3-D ICs.We also introduce delay- and length-based timing constraints toprevent potential degradation of the performance metric duringplacement. Under the same timing constraints, our algorithmreduces dynamic power consumption more effectively than theuniform-scaling-based placement algorithm by 2% to 14%.

Index Terms—3-D IC, monolithic, multitier, poweroptimization.

I. INTRODUCTION

MONOLITHIC 3-D integration is emerging as animproved way for 3-D stacking to increase device den-sity further [1]. A monolithic interlayer via (MIV) used for anintertier electrical connection is much smaller than a through-silicon via (TSV) as shown in Fig. 1, so monolithic 3-Dintegration is almost free from area and capacitance over-head, whereas TSV-based 3-D integration suffers from areaand capacitance overhead [2], [3]. Thus, monolithic 3-D inte-gration is expected to enable the highest degree of wirelengthreduction, performance improvement, power reduction, andintertier bandwidth improvement.

The wirelength reduction of monolithic 3-D integrationcould be converted into both power and performance bene-fits. For example, Panth et al. [4] used a design methodologythat uniformly scales the cell locations of a high-quality 2-Dplacement result by a constant ratio 1/

√2 to generate two-

tier monolithic 3-D placement. The location scaling generates

Manuscript received October 7, 2016; revised February 8, 2017 andMay 12, 2017; accepted June 28, 2017. Date of publication July 19, 2017; dateof current version March 29, 2018. This work was supported in part by theNew Faculty Seed Grant 125679-002 funded by Washington State Universityand in part by the DARPA Young Faculty Award D16AP00119 funded by theDefense Advanced Research Projects Agency. This paper was recommendedby Associate Editor Y. P. Liu. (Corresponding author: Dae Hyun Kim.)

The authors are with the School of Electrical Engineering and ComputerScience, Washington State University, Pullman, WA 99164 USA (e-mail:[email protected]; [email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCAD.2017.2729401

Fig. 1. Multitier monolithic 3-D integration.

overlaps among the cells, so the author uses a partitioningalgorithm to distribute the cells into different tiers to removethe overlaps. This methodology, namely a uniform-scaling-based placement algorithm (USBP), scales down the lengthof each net by almost the same ratio as the constant scalingratio, thereby reducing wirelength and dynamic power con-sumption. However, the reduced wirelength cannot be directlyconverted into a higher clock frequency if gate delay and pincapacitance dominate the critical path delay or the designhas a potential power-density problem that will lead to athermal problem. In this case, we can reduce the dynamicpower consumption further by converting the total delay gainobtained from wirelength reduction into power saving. In thispaper, we propose a 3-D detailed placement algorithm, namelya nonuniform-scaling-based placement algorithm (NUSBP),that provides more dynamic power reduction than the USBPalgorithm under the same operating frequency.

It is expected that multitier 3-D integration will providemore benefits than two-tier 3-D integration [5], and somemonolithic 3-D integration technologies can fabricate multipledevice layers in a single package [6]. Thus, we also gen-erate multitier monolithic 3-D placement results using theUSBP and NUSBP algorithms and compare the quality of thealgorithms for multitier monolithic 3-D ICs.

0278-0070 c© 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

mailto:[email protected]:[email protected]://ieeexplore.ieee.orghttp://www.ieee.org/publications_standards/publications/rights/index.html

846 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 37, NO. 4, APRIL 2018

Our contributions in this paper are as follows.1) We develop a 3-D detailed placement algorithm that

optimizes dynamic power consumption of monolithic3-D ICs more effectively than the USBP.

2) We present theoretical background on the minimizationof dynamic power consumption in monolithic 3-D ICs.

3) We develop length-based and delay-based timing con-straint algorithms to prevent aggressive cell movementsthat could degrade the quality of the monolithic 3-D ICs.

4) We apply the uniform-scaling-based and nonuniform-scaling-based 3-D placement algorithms to multitiermonolithic 3-D IC design and present their results withdetailed analyses.

The rest of this paper is organized as follows. We reviewthe previous work on the design of monolithic 3-D ICs inSection II. In Section III, we present theoretical backgroundon dynamic power consumption and timing constraints. InSection IV, we propose a dynamic power optimization algo-rithm in detail. Then, we present and analyze simulation resultsin Section V. Finally, we conclude in Section VI.

II. PRELIMINARIES AND RELATED WORK

In this section, we briefly review three monolithic 3-D ICdesign methodologies presented in the literature and discussthe uniform-scaling-based 3-D placement algorithm.

A. Design Methodologies for Monolithic 3-D ICs

Monolithic 3-D ICs can be designed in several differ-ent design levels. The most fine-grained design style is thetransistor-level monolithic integration (TMI) proposed in [7].In TMI, nMOS, and pMOS transistors of each standard cellare placed in different tiers, e.g., the nMOS and pMOS transis-tors are placed in the top and bottom tiers, respectively. In thiscase, MIVs are used for both intracell and intercell 3-D con-nections. TMI reduces the footprint area of each standard cellalmost by half, but it overuses MIVs for intracell 3-D routing,which increases routing complexity and leads to unroutabledesigns. Block-level monolithic integration proposed in [8] isanother monolithic 3-D IC design methodology in which each2-D functional block is designed with 2-D standard cells andall the blocks are placed in 3-D using a 3-D floorplanner.Thus, nMOS and pMOS transistors are placed in both bottomand top tiers and MIVs are inserted into whitespace betweenthe blocks. Gate-level monolithic integration (GMI) proposedin [7] places 2-D standard cells in 3-D. GMI can reuse existing2-D standard cells and timing/power libraries. In addition, adesign methodology using 2-D placement tools was proposedin [4] for the design of gate-level monolithic 3-D ICs andachieved almost 20% wirelength reduction and 16% powerreduction. Thus, GMI is a prospective design methodologyfor monolithic 3-D IC design with respect to the design effortand the quality (wirelength, timing, and power) of 3-D ICs.

B. Uniform-Scaling-Based 3-D Global Placement

The 3-D global placement algorithm presented in [4] worksas follows. First, they determine a downscaling ratio s basedon the ratio between the width (w2-D) of the 2-D layout and

TABLE IVARIABLES USED IN THIS PAPER

the width (w3-D) of a target 3-D layout of the design (s =w3-D/w2-D).1 Then, they shrink the size of each cell in a givenstandard cell library by the scaling ratio s and place the cells in2-D using a commercial tool. By changing the library set fromthe downscaled one to the original one after the placement, theauthors obtain a layout in which the cells overlap with eachother. The overlaps are removed by partitioning, which alsoautomatically converts the 2-D layout into a 3-D layout. Thewhole process of the downscaling of the cell size, placing cellsin 2-D, and restoring the original cell size is very similar toplacing the cells first with the original standard cell libraryand then scaling the locations of the cells uniformly by thesame downscaling ratio s. Thus, we call this approach USBP.

USBP reduces the length of each net almost by the down-scaling ratio s, so the dynamic power consumption and thedelay of each net are also reduced. However, the reduced netdelay cannot be directly converted into higher clock frequencyif the delay of the critical path is primarily due to gatedelay and pin capacitance. Thus, USBP can easily reducethe dynamic power consumption, but cannot guarantee thatit can increase the clock frequency. However, we can convertthe increased timing margin into further dynamic power con-sumption. In this paper, therefore, we propose an algorithm toconvert the reduced net delay in noncritical paths into powerreduction by a detailed placement algorithm, which we callNUSBP.

III. DYNAMIC POWER REDUCTION INGATE-LEVEL MONOLITHIC 3-D ICS

In this section, we analyze dynamic power consumptionin monolithic 3-D ICs and investigate how we can reducedynamic power consumption further. We also discuss timingconstraints we take into account during dynamic power opti-mization. Table I shows the variables used in this paper andtheir meanings.

A. Power Reduction by Uniform Scaling

Dynamic power consumption is estimated by the following,well-known formula:

Pint =∑

i∈Nαi · fclk ·

(Cw,i + Cp,i

) · VDD2 (1)

where N is the set of all the nets in the design and we arebreaking down the capacitance into two capacitive compo-nents, wire capacitance and input pin capacitance of each net.

1Assuming both the 2-D and 3-D layouts have the same total silicon area,w3-D is w2-D/

√NT .

LIN AND KIM: DETAILED-PLACEMENT-ENABLED DYNAMIC POWER OPTIMIZATION 847

TABLE IIIDEAL BENEFITS OBTAINED BY MONOLITHIC

3-D INTEGRATION AND USBP

Assuming the 2-D layout and the target 3-D layout have thesame total silicon area, the scaling factor that the USBP algo-rithm uses becomes 1/

√NT . Thus, the USBP algorithm ideally

reduces the length of each wire by 1/√

NT , which is convertedinto delay and power reduction. Table II shows ideal benefitswe can obtain by USBP.

Since the switching activity of each net, clock frequency,and the supply voltage are constants, we can reduce thedynamic power consumption by reducing the wire capacitanceand/or the input pin capacitance as shown in (1). Reducingwire capacitance requires wirelength reduction, routing layerreassignment, wire spreading, and so on. Reducing input pincapacitance requires gate sizing (downsizing in most cases).

B. Conversion of Delay Benefit Into Power Reduction

As shown in Table II, the USBP algorithm reduces both netdelay and dynamic power consumption by wirelength reduc-tion. As explained in Section II-B, however, increasing theclock frequency in monolithic 3-D ICs is not possible or desir-able. In this case, we can adjust the cell locations to convertthe delay benefit into further power reduction as shown below.

Fig. 2 shows an example in which three cells are connectedthrough two nets. Assuming that the switching activities ofNet 1 and Net 2 in the figure are α1 and α2, respectively, thepower consumption before uniform scaling is

Pbefore = fclk · VDD2 ·(α1

(Cw,1 + Cp,1

) + α2(Cw,2 + Cp,2

))

(2)

and the power consumption after uniform scaling is

Pafter = fclk · VDD2 ·(α1

(Cw,1√

NT+ Cp,1

)+ α2

(Cw,2√

NT+ Cp,2

)).

(3)

Thus, the power benefit (�P = Pbefore − Pafter) obtainablefrom USBP is

�P = fclk · VDD2 ·(α1 · Cw,1 + α2 · Cw,2

) ·(

1 − 1√NT

). (4)

However, we can reduce the power consumption furtherby moving the cells. For instance, if α1 is greater than α2,moving Cell 2 closer to Cell 1 along Net 1 will reduce thepower consumption. Suppose Cell 2 is moved toward Cell 1by x(um) after the uniform scaling (x > 0). Then, the powerconsumption after the movement is

P′after = fclk · VDD2 ·(

α1

(Cw,1√

NT− cu · x + Cp,1

)

+ α2(

Cw,2√NT

+ cu · x + Cp,2))

. (5)

(a)

(b)

(c)

Fig. 2. USBP and NUSBP. (a) Before uniform scaling. (b) After uniformscaling (scaling factor:

√NT ). (c) After nonuniform scaling.

Then, the new power benefit (�P′ = Pbefore − P′after)becomes

�P′ = fclk · VDD2 ·(α1 · Cw,1 + α2 · Cw,2

) ·(

1 − 1√NT

)

+ fclk · VDD2 · cu · x · (α1 − α2) (6)where cu is the capacitance per micro-meter for the nets. Thesecond term in (6) is positive because we assume that α1 isgreater than α2. Thus, the power benefit goes up further bymoving Cell 2 closer to Cell 1 in this case.

This post-scaling adjustment of cell locations can be per-formed in three different ways. First, the location of each cellis scaled with its own scaling ratio as follows:

(xi, yi) →((

1√NT

+ sx,i)

· xi,(

1√NT

+ sy,i)

· yi)

(7)

where (xi, yi) is the location of Cell i, sx,i, and sy,i are smallvariations for the x- and y-coordinate scaling factors for Cell i,respectively. Second, the locations of all the cells are uniformlyscaled down with a constant scaling ratio (1/

√NT ) and the

locations are slightly adjusted as follows:

(xi, yi) →(

xi√NT

,yi√NT

)→

(xi√NT

+ δx,i, yi√NT

+ δy,i)

(8)

where δx,i and δy,i are small, post-scaling displacement forCell i. Third, the location of each cell is adjusted and thenthe locations of all the cells are uniformly scaled down by aconstant scaling ratio (1/

√NT ) as follows:

(xi, yi) →(

xi + δ′x,i, yi + δ′y,i)

→(

xi + δ′x,i√NT

,yi + δ′y,i√

NT

)(9)

where δ′x,i and δ′y,i are small, prescaling displacement for Cell i.All of these approaches produce the same result, but we usethe third approach in this paper and call it NUSBP.

Although NUSBP reduces the power consumption further,we should take two important constraints, timing and densityconstraints, into account in the computation of δ′x,i and δ′y,i.The next section shows how we take the timing constraintinto account and Section IV-D explains how we handle thedensity constraint.


C. Ideal Nonuniform Scaling Under Timing Constraints

In Fig. 2, suppose d1,3 and d1,3′′ be the Elmore delays fromthe output of Cell 1 to the input of Cell 3 for the 2-D case andafter nonuniform scaling, respectively. Then, d1,3 is expressedas follows:

d1,3 = R1(Cw,1 + Cp,2

) + Rw,1Cp,2 + Rw,1Cw,12

+ R2(Cw,2 + Cp,3

) + Rw,2Cp,3 + Rw,2Cw,22

(10)

where R1 and R2 are the drive resistances of Cell 1 and Cell 2,respectively. If the distance between Cell 1 and Cell 2 in Fig. 2becomes S1(um) and that between Cell 2 and Cell 3 becomesS2(um) after nonuniform scaling, the difference between d1,3and d1,3′′ becomes

�d1,3′ = (R1Cw,1 + Rw,1Cp,2

)(1 − S1

L1

)

+ Rw,1Cw,12

(1 − S1

2

L12

)

+ (R2Cw,2 + Rw,2Cp,3)(

1 − S2L2

)

+ Rw,2Cw,22

(1 − S2

2

L22

). (11)

Setting �d1,3′ to zero and solving it with a constraint S1+S2 =[(L1 + L2)/√NT ] gives us the ranges of S1 and S2 that do notdegrade the delay from Cell 1 to Cell 3.

D. Delay-Based Timing Constraint

Applying the timing constraint presented in the previoussection to a target cell requires the routing topologies of allthe nets connected to the cell to obtain the net lengths. In thispaper, however, we use the half-perimeter wirelength (HPWL)to estimate the length of each net for the following reasons.First, it might be too costly to construct a routing topol-ogy for each net at this step. Second, even if constructinga routing topology is not time-consuming, we do not need toconstrain the range of the target locations of a target cell alonga fixed routing topology at this step. Thus, we modify the tim-ing constraint from the Elmore-delay-based constraint to theHPWL-based constraint for Cell 2 in Fig. 2 as follows:

HPWL1′2 + HPWL2′2 ≤ HPWL12 + HPWL22 (12)

where HPWLi is the HPWL of net i, HPWLi′ is the HPWLof net i after moving the target cell and scaling, and Net 1and Net 2 are one of the input nets and one of the outputnets connected to the target cell, respectively. We call this adelay-based timing constraint.

The delay-based timing constraint preserves the total delayof two adjacent nets connected through a target cell. However,applying this constraint is time-consuming because it shouldalso consider all the combinations of the input and outputnets associated with each target cell. For instance, suppose atarget cell has m input pins and n output pins. Each pin isconnected to a net and the driver of each net drives multiplecells. If we move the target cell, the length of each net changes

(might not change depending on the location of the target cell),then we should check the timing of all the cells connected tothe net. If the average number of cells connected to a net isr and the average input and output pin counts of a cell are mand n, respectively, moving a target cell requires applicationof the delay-based timing constraint to r · m + (r − 1) · ncells. The complexity of applying (12) to a cell is O(mn), sothe complexity of applying the delay-based timing constraintto moving a cell is O(rmn(m + n)). Since this is too time-consuming, we developed a length-based timing constraint,which is presented in the next section.

E. Length-Based Timing Constraint

Suppose the original HPWL of net i is HPWLi and that ofthe net after nonuniform scaling is HPWLi′. Then, the fol-lowing constraint strictly preserves the delay of the net afterscaling:

HPWLi′ ≤ HPWLi + δi (13)

where δi is a relaxation factor for net i and empiricallydetermined for a given process technology.

The new HPWL of net i after nonuniform scaling is

HPWLi′ = HPWLi + �HPWLi√

NT(14)

so the substitution of (14) into inequality (13) gives

�HPWLi ≤(√

NT − 1)

HPWLi +√

NTδi (15)

which is a new length-based timing constraint with relaxationfor each net. Since the delay is proportional to the square ofthe length of a net, a decreasing function, such as the followingcan be used for δi:

δi = k(um) if HPWLi ≤ t(um)= k

HPWLi+ t − 1

t· k(um) if HPWLi > t(um) (16)

where t is a sufficiently small wirelength, such as 5um andb and k are constants tuned for a given process technologyby exhaustive delay simulations. We call this a length-basedtiming constraint with relaxation. If δi is zero, it is called alength-based timing constraint.

IV. DYNAMIC POWER OPTIMIZATION ALGORITHMS

In this section, we present our algorithms for theminimization of dynamic power consumption in gate-levelmonolithic 3-D ICs.

A. Overall Algorithm

Fig. 3 shows the overall design flow and the step in the greenbox shows the proposed algorithm. For a given 2-D placementresult, the NUSBP algorithm adjusts the cell locations in thelayout and then uniformly scales the locations by 1/

√NT to

generate an NT -tier monolithic 3-D IC layout. The objectiveis to minimize the dynamic power consumption estimated bythe following formula:

P = fclk · VDD2∑

i∈N·(αi · HPWLi) (17)


Fig. 3. Our 3-D IC design flow. The USBP skips the dynamic poweroptimization step.

while satisfying either the delay- or length-based timing con-straints shown in Section III. The cell location adjustment issequentially applied to either each cell or a set of cells (clus-ters) until there is no more noticeable improvement in thepower consumption. The following sections describe how tofind optimal locations for each cell or cluster and how to inte-grate the timing and density constraints into the optimizationalgorithm.

B. Finding Optimal Locations

For each cell in a given 2-D placement result, we find anoptimal location that can minimize the sum of the dynamicpower of all the nets connected to the cell. The idea is tomove the cell in a direction we can reduce the sum of thedynamic power.

The following theorem helps find optimal locations thatminimize the dynamic power consumption for a cell.

Theorem 1: For Cell A connected to k nets (n1, . . . , nk),construct two bounding boxes, one (Bq,1) without Cell A andthe other (Bq,2) with Cell A, for nq (1 ≤ q ≤ k). Let BA be theset of all the bounding boxes, BA = {B1,1, B1,2, B2,1, . . . , Bk,2}and EPA be the set of all extremal points (four end points)of all the bounding boxes in BA. Let TA be the set of allintersection points of all pairs of the bounding boxes in BA.Then: 1) the current location of Cell A is optimal or 2) thereexists at least one optimal point in TA ∪ EPA that minimizesthe sum of the dynamic power of all the nets connected toCell A.

Proof: The objective function we minimize is λ = ∑i∈N αi ·HPWLi. λ is piecewise linear, so the optimal points minimiz-ing λ exist: 1) inside or on the boundary of some rectangles;2) on some intervals (segments); or 3) on some extremalpoints (the endpoints of some intervals or rectangles) orintersection points. Since rectangles and intervals include theirendpoints, at least one of the extremal points or the intersectionpoints are optimal.

To apply the above theorem to finding optimal locations fora given target cell, we use the following observations.

1) λ = ∑i∈N αi · (HPWLx,i + HPWLy,i) where HPWLx,iand HPWLy,i are the x- and y-components of the HPWLof net i.

2) HPWLx,i and HPWLy,i are independent of each other,so we can optimize each of them separately (so we focusonly on the x-coordinates from this point).

3) Suppose net i connects k cells, {C1, C2, . . . , Ck}, and thex-coordinate of Cj is xj. If Ct is the target cell to move,

Algorithm 1: Find an Optimal x-Coordinate for a GivenCell

Input: A given target cell A and a netlist N.Output: An optimal x-coordiate of A.

1 K = {n|n ∈ N, A ∈ n};2 B = {BoundingBox(n) = (x1, y1, x2, y2)|n ∈ K};3 P = {x1(b), x2(b)|b ∈ B} ∪ x(A);4 // x(A) is the x-coordinate of A.5 Array X = Sort (P);6 l = 1, r = |X|;7 while l < r do8 m = (l + r)/2;9 CostL = Cost(X[l]);

10 CostM = Cost(X[m]);11 if CostL ≤ CostM then12 r = m;13 else14 CostT = Cost(X[m − 1]);15 if CostT ≤ CostM then16 r = m − 1;17 else18 l = m;19 end while

20 Return X[l];

xmin,i and xmax,i are defined as the minimum and themaximum of xj(j = 1, . . . , k, j = t), respectively. Then,HPWLx,i linearly decreases as the target cell is moved inthe positive direction from −∞, stays constant betweenxmin,i and xmax,i, and then linearly increases as the cellis moved toward ∞.

4) The x-coordinate of each intersection point in TA is anextremal point of a rectangle in BA.

In summary, the x-coordinate of an optimal point minimizingλ for a given cell is always an extremal point of one of thebounding boxes in BA constructed for the cell. In addition,if we move the cell from −∞ to ∞, λ linearly decreases,then stays constant, then linearly increases. Thus, instead ofenumerating all the extremal and intersection points, we firstconstruct BA, extract the x-coordinates of the left and rightextremal points of each rectangle in BA, and sort the coor-dinates in the increasing order. Then, we find an optimalcoordinate by the binary search. Algorithm 1 describes how tofind an optimal x-coordinate for a given target cell using thebinary search algorithm. The Cost(x) function for a given x-coordinate x computes the dynamic power consumption whenthe cell is moved to x. The algorithm uses the observation thatthe cost function is convex.

Fig. 4 shows an example. In the figure, Cell A is connectedto Net 1 ({A, C1, C2, C3}) and Net 2 ({A, C4, C5}). If theswitching activity α1 of Net 1 is greater than α2 of Net 2,the optimal location for Cell A is (x1, [y5, y4]), where [y5, y4]is the range from y5 to y4, i.e., any value greater than orequal to y5 and less than or equal to y4. If the x-coordinateax of Cell A is greater than x1, but less than x2, the HPWL of


Fig. 4. Two nets and their bounding boxes. Net 1 = {A, C1, C2, C3}. Net 2 ={A, C4, C5}.

Net 1 is minimal, but the HPWL of Net 2 increases. If ax isless than x1, the HPWL of Net 1 increases, but that of Net 2decreases. Since α1 is greater than α2, the total power con-sumption increases. We can easily verify that (x1, [y5, y4]) isthe optimal location for Cell A in this way. If α2 is greater thanα1, however, the optimal location for Cell A is (x5, [y5, y4]).

Notice that the proposed algorithm finds an optimal loca-tion for each cell. To find optimal locations for all the cellsin the netlist, we sequentially choose a cell and move it. Thus,the order of the cells we choose might affect the quality ofthe solution. In this paper, we sort all the nets in the decreas-ing order of the switching activity and start from the cellsconnected to the highest-activity net.

C. Timing Constraints

Optimal locations of some cells found by Theorem 1 mightviolate the timing constraints explained in Section III. Thus,before we move a cell to its optimal location, we checkwhether the move will violate the timing constraints or not.If it violates the constraints, we move the cell to the farthestlocation satisfying the timing constraints from its current loca-tion, along and inside the segment connecting the current andthe optimal location.

If the timing constraint is the length-based timing con-straint, we compute the maximum value of �HPWL satisfyinginequality (13) for each net connected to the target cell andchoose the smallest value among them. The computation of�HPWL is performed as follows. We first construct the bound-ing box of net i connected to the target cell and divide the planeinto nine regions as shown in Fig. 5. The target cell X is inthe center region (R5) and the dotted line shows the boundingof net i. Depending on the optimal location (red circles), wesplit the line segment connecting the current and optimal loca-tions into multiple segments, one per region (we call the linesegment an optimal segment). Then, we compute �HPWL ineach region separately. For example, if the target location is inR5, �HPWL is 0 even if we move the target cell between itscurrent and the optimal locations. If the target location is inR2, however, �HPWL is 0 if the target cell is moved insideR5, but �y if it is moved in R2, where �y is the distancebetween the new location of the target cell in R2 and the y-coordinate of the upper horizontal line of the bounding boxof net i.

Starting from the region of the optimal location, we com-pute �HPWL satisfying inequality (13). If we find it, it is

Fig. 5. Consideration of length-based timing constraints.

Fig. 6. Delay-based timing constraints.

the optimal location satisfying the length-based timing con-straint for net i. If we do not find it, however, we move to thenext farthest region along the optimal segment and compute�HPWL again. We repeat this process until we find �HPWLsatisfying the inequality. Notice that the maximum number ofregions we should consider for each net is three, which occurswhen the optimal location is in R1, R3, R7, or R9 in Fig. 5.

If the timing constraint is the delay-based timing constraint,we should consider not only each pair of the two adjacent netsconnected to the target cell, but also the cells adjacent to thetarget cell. Fig. 6 shows an example. Suppose Cell 1 in thefigure is the target cell. If Cell 1 is moved, the lengths ofNet 1, Net 2, and Net 3 are changed. In this case, we applythe delay-based timing constraint to the following net pairs,(Net 1, Net 2) and (Net 1, Net 3). However, the changes ofthe lengths of Net 1, Net 2, and Net 3 also affect the delayvalues from Net 4 to Net 2, from Net 5 to Net 2, and fromNet 1 to Net 7. Thus, we should also check the delay-basedtiming constraint to the following net pairs, (Net 4, Net 2),(Net 5, Net 2), and (Net 1, Net 7).

To apply the delay-based timing constraint to a pair of twonets, Net 1 and Net 2, we first split the plane into max. Twentyfive regions2 and compute �HPWL1 and �HPWL2 in eachregion over which the optimal segment belongs to. Startingfrom the region of the optimal location, we compute �HPWL1and �HPWL2 minimizing λ and satisfying inequality (12). Ifwe do not find a value satisfying the inequality in the region,we proceed to the next farthest region on the optimal segmentin a similar way to the length-based timing constraint case.

2Four x- and four y-coordinates from the eight extremal points of the twobounding boxes split the plane into 25 regions.


Fig. 7. Illustration of the clustering technique. The red nets are high-activitynets.

D. Density Constraints

As mentioned in Section III, moving a cell to its opti-mal location might increase the density of the layout areaaround the optimal location. Thus, we need to control the lay-out density efficiently during optimization. In this paper, wepredetermine a bin size, obtain the maximum bin density in agiven layout, and limit the density of each bin to be at mostthe maximum bin density. If moving a cell to its optimal loca-tion violates the density constraint of the bin, we find the nextfarthest bin that does not violate the density segment alongthe optimal segment.

We satisfy the timing and density constraints for each moveby considering both at the same time. Thus, we guarantee thatwe never violate the timing and density constraints during/afterthe optimization.

E. Clustering

A problem we found in moving a cell individually to itsoptimal location is that we cannot move any cell connectedto a high-activity net in some cases. For example, movingCell A, Cell B, and Cell C toward Cell D in Fig. 7 will reducethe dynamic power consumption, but moving the three cellsone by one is prohibited because moving each one of themincreases the dynamic power consumption or leads to no powerbenefit. Thus, we cluster the cells connected to high-activitynets and move the cells simultaneously to reduce dynamicpower consumption further in addition to moving each cellindividually. However, we only cluster the cells connectedto a net whose HPWL is less than a predetermined thresh-old value (K). In fact, there exist some uncertainties in ouroptimization methodology because our wirelength and powercomputation are based on the HPWL. Thus, if K is large,the uncertainty goes up and the final power value does notaccurately match our power computation. If K is too small,however, there exists just a few clusters that can be moved bythe clustering technique. Therefore, we empirically determinedK based on simulations. Similar to the cell-based optimiza-tion, the solution quality of the clustering-based optimizationdepends on the order of the clusters we move. In this paper, wesort the nets in the decreasing order of their switching activities

and start from the highest-activity net. For each selected net,we cluster the cells connected to the net into a single super-celland apply the cell-based optimization algorithm to the supercell. After we move the super cell, we flatten the super cell.

F. Complexity Analysis

The cell-based optimization finds an optimal location foreach cell and moves the cell to the location. Suppose a targetcell is connected to maximum n nets, each of which connectsmaximum c cells. Then, the complexity of finding two bound-ing boxes (one with the cell included and the other withoutthe cell) for each net is O(c) and that of finding all bound-ing boxes of the target cell is O(cn). The runtimes for thelinear sweeping from −∞ to ∞ for the x-coordinates of thebounding boxes is O(n). Thus, so the complexity of findingan optimal location for each target cell is O(cn), which ispractically O(1) because c and n are bounded for most of thecells and nets. For an optimal location found by the aboveprocess, finding an optimal location that satisfies the timingand density constraints takes a constant amount of time, so thecomplexity of moving all the cells to their optimal locations isO(C), where C is the total number of cells. The cluster-basedoptimization also moves a cluster for each net, so practicallythe complexity of finding an optimal location for a cluster isalso O(1). Thus, the complexity of moving all the clusters totheir optimal locations is O(N), where N is the total numberof nets. We iterate the cell- and cluster-based optimizationsonly a few times, so the overall complexity of the NUSBPalgorithm is O(N + C).

V. SIMULATION RESULTS

In this section, we present our simulation results anddetailed analysis.

A. 3-D IC Design Flow and Simulation Setup

For NUSBP, we iterate the cell- and cluster-based optimiza-tion multiple times until the power reduction saturates in thedynamic power optimization step. We use the Nangate [9]45nm library for the standard cell library, Synopsys DesignCompiler for synthesis, and Cadence Encounter for 2-D place-ment and legalization. We also use Cadence Encounter toobtain the switching activity of each net by propagating aconstant activity at the primary pins. We use hMetis [10] forthe k-way partitioning to design k-tier monolithic 3-D ICs. Toobtain area-balanced placement results, we split a given layoutinto a grid and sequentially apply hMetis to each bin of size5 ∗ r by 5 ∗ r, where r is the height of a standard cell row.The bin size for the density check is 20 um by 20 um. All theresults we obtained by NUSBP in this section did not violatethe delay and density constraints.

B. Comparison of Dynamic Power Consumption inTwo-Tier Monolithic 3-D ICs

Table III shows wirelength (∑

HPWL) and dynamic powerconsumption (

∑α·HPWL) of 2-D and two-tier monolithic

3-D ICs designed by USBP [denoted by 2-tier uniform (2TU)]


TABLE IIICOMPARISON OF 2-D, k-TIER UNIFORM-SCALING-BASED, AND k-TIER NONUNIFORM-SCALING-BASED PLACEMENT RESULTS USING DIFFERENT

TIMING CONSTRAINTS (-D: DELAY-BASED, -L: LENGTH-BASED, -LR: LENGTH-BASED WITH RELAXATION). THE VALUES IN PARENTHESESSHOW THE RATIO BETWEEN THE 3-D DESIGNS AND THE 2-D DESIGNS. FP IS THE FOOTPRINT AREA AND RT IS THE RUNTIME

and NUSBP [denoted by 2-tier nonuniform (2TNU)] withdifferent timing constraints. As the table shows, the USBPalgorithm reduces the dynamic power consumption by roughly29% compared to the 2-D placement result and the NUSBPalgorithm reduces the dynamic power consumption by 31% to35%, 32% to 37%, and 33% to 39% for the delay-based (-D),length-based (-L), and length-based with relaxation (-LR) tim-ing constraints, respectively, compared to the 2-D placementresult. In addition, the NUSBP algorithm constantly outper-forms the USBP algorithm for all the benchmarks by 5% to14%, 4% to 11%, and 3% to 8% for the -D, -L, and -LR cases,respectively.

For more detailed analysis, we show the difference betweenthe power consumption of 2TU and 2TNU-L for each net inFig. 8(a). We group all nets into each switching activity binof width 0.001, compute the sum of the dynamic power ofthe nets in each bin for 2TU and 2TNU-L, and plot the differ-ences. In the figure, we observe that the power reduction comesprimarily from the power reduction in high-activity nets, i.e.,many high-activity (α > 0.8) nets in 2TNU-L are shorter thanin 2TU. Thus, we reduce the dynamic power consumption byshortening the high-activity nets. However, some low-activitynets (α ∼ 0.5) in 2TNU-L have higher power consumptionthan those in 2TU. This is unavoidable because the furtherpower reduction in NUSBP is due to making high-activity netsshorter and low-activity nets longer. Fig. 8(b) shows the dif-ference between the HPWL of 2TU and 2TNU-L for each net.In the figure, we observe that the trend of the wirelength dif-ference is similar to the trend of the power reduction shown inFig. 8(a). In other words, the high-activity nets are shortenedat a cost of the elongated low-activity nets.

Regarding the timing constraints, the length-based timingconstraint reduces the power consumption more effectivelythan the delay-based timing constraint as shown in Table III.

The reason is mainly because the delay-based timing constraintis tighter than the length-based timing constraint. The fol-lowing analysis shows the reason. Without loss of generality,suppose α1 is greater than α2 in Fig. 2. The original lengths ofNet 1 and Net 2 are L1 and L2, respectively. Since α1 is greaterthan α2, we move Cell 2 toward Cell 1 by δ. In the worstcase, the length of Net 1 after nonuniform scaling becomes(L1 − δ)/√NT and that of Net 2 becomes (L2 + δ)/√NT .Substituting these new lengths to inequality (12) gives thefollowing:

(L1 − δ√

NT

)2+

(L2 + δ√

NT

)2≤ L12 + L22 (18)

δMAX,d =L1 − L2 +

√(L1 − L2)2 + 2(NT − 1)

(L12 + L22

)

2(19)

which is the maximum value of δ satisfying the delay-basedtiming constraint. On the other hand, substituting the lengthof Net 2 to inequality (12) gives the following:

δMAX,l =(√

NT − 1)

L2 (20)

which is the maximum value of δ satisfying the length-basedtiming constraint.

If L1 is much longer than L2, δMAX,d is greater thanδMAX,l, so the delay-based timing constraint is looser than thelength-based timing constraint. If L1 is much shorter thanL2, however, δMAX,l is greater than δMAX,d, so the length-based timing constraint is looser. Although the length-basedtiming constraint led to lower power consumption than thedelay-based timing constraint in the benchmarks we used,the delay-based timing constraint could lead to lower powerconsumption if the former case (L1 is much longer than L2)dominates the design.


(a)

(b)

Fig. 8. Comparison of (a) dynamic power consumption and (b) HPWLbetween 2TNU-L and 2TU. The x-axis is the net activity. Benchmark: LDPC.

C. Comparison of Dynamic Power Consumption inMultitier Monolithic 3-D ICs

Table III also shows that the multitier monolithic 3-D ICsreduce the dynamic power consumption more effectively thanthe two-tier monolithic 3-D ICs. The USBP algorithm outper-forms the 2-D layout by 29%, 42%, and 50% in the two-,three-, and four-tier designs, respectively. In addition, theNUSBP algorithm outperforms the USBP algorithm by 2%to 14%, 2% to 13%, and 2% to 13% for the two-, three-, and four-tier designs, respectively. Although the dynamicpower consumption monotonically decreases as the tier countgoes up, the decrement also reduces. Thus, the dynamicpower reduction will eventually saturate even if more tiers arestacked. This is due to the saturation in the wirelength reduc-tion as shown in the same table and other previous work [5].Since the amount of wirelength reduction is proportional to thescaling ratio (1/

√NT ), wirelength reduction saturates, which

is also translated into the saturation of the dynamic powerreduction. However, the NUSBP algorithm still outperformsthe USBP algorithm constantly in the multitier monolithic 3-Ddesigns.

Regarding the runtime, applying the delay-based timingconstraint takes the highest runtime to optimize the designs.The reason is because the delay-based timing constraint isapplied not only to the pairs of the nets connected to a targetcell, but also to some pairs of the nets connected to the cellsadjacent to the target cell. Thus, applying the delay-based tim-ing constraint requires more computation time than the othertiming constraints. On the other hand, the length-based tim-ing constraint with relaxation requires more computations thanthe length-based timing constraint because the former is tighter

Fig. 9. Variation of the dynamic power consumption of the two-tier LDPCdesign with cell- and cluster-based optimization interleaving (L: cell-based,and C: cluster-based).

than the latter, so the latter searches more optimal points thanthe former.

D. Effect of Interleaving Cell- and Cluster-BasedOptimizations

If we apply only the cell- or cluster-based optimiza-tion repeatedly, the dynamic power consumption decreasesat the beginning, but saturates eventually. As explained inSection IV-E, therefore, we alternate between the cell- andcluster-based optimizations to escape from local minima andreduce the dynamic power consumption further. In the NUSBPdesign flow, we run m iterations of the cell-based optimiza-tion, then run n iterations of the cluster-based optimization andrepeat the cell- and cluster-based optimization until the differ-ence between the dynamic power consumption values beforeand after the optimization is less than a predetermined number.We set both m and n to 10 in our simulation.

Fig. 9 shows the variation of the dynamic power consump-tion of the two-tier LDPC design, where L and C denote thecell- and cluster-based optimizations, respectively. As the fig-ure shows, the dynamic power reduction saturates after four tosix iterations in each optimization mode. Once the first cell-based optimization saturates, the cluster-based optimizationreduces the dynamic power consumption further by movingmultiple cells connected to high-activity nets at the same time.The dynamic power reduction in the cluster-based optimiza-tion, however, also saturates. The cluster-based optimizationperturbs the placement, so switching back to the cell-basedoptimization helps reduce the power consumption again. Asthe number of iterations increases, the power consumptioneventually saturates as shown in Fig. 9.

E. Impact of the Density Control

As explained in Section IV-D, we control the density ofeach bin in the layout during both the cell- and cluster-baseddynamic power optimization. Since it is not straightforward toestimate the routability of a design at the placement stage,


(a)

(b)

Fig. 10. Density distribution of the DES benchmark. (a) Without densityconstraints and (b) With density constraints.

many placement papers use some variants of density esti-mation and control schemes to effectively alleviate potentialrouting congestions. Fig. 10 shows the bin density distributionmaps for the DES benchmark without and with the densityconstraint and control. The density constraint for each bin inthe design is set to 0.8. As shown in the figure, optimizingthe design without the density constraint violates the densityconstraint and the utilization of some of the bins is greaterthan 1.0. With the density control, however, no bin violatesthe density constraint.

VI. CONCLUSION

In this paper, we proposed an NUSBP algorithm fordynamic power optimization in multitier gate-level monolithic3-D ICs. The algorithm finds an optimal location minimizingthe sum of the dynamic power consumption of the nets con-nected to the cell for each cell without violating the timingand density constraints. The simulation results show that thealgorithm outperforms the USBP algorithm by an average of2% to 14% for two- to four-tier designs.

REFERENCES

[1] P. Batude et al., “Advances in 3D CMOS sequential integration,” in Proc.IEEE Int. Electron Devices Meeting, Baltimore, MD, USA, Sep. 2009,pp. 1–4.

[2] D. Henry et al., “Via first technology development based on high aspectratio trenches filled with doped polysilicon,” in Proc. IEEE Electron.Compon. Technol. Conf., Reno, NV, USA, May 2007, pp. 830–835.

[3] J. U. Knickerbocker et al., “Three-dimensional silicon integration,” IBMJ. Res. Develop., vol. 52, no. 6, pp. 553–569, Nov. 2008.

[4] S. Panth, K. Samadi, Y. Du, and S. K. Lim, “Design and CAD method-ologies for low power gate-level monolithic 3D ICs,” in Proc. Int. Symp.Low Power Electron. Design, Aug. 2014, pp. 171–176.

[5] D. H. Kim, S. Mukhopadhyay, and S. K. Lim, “TSV-aware interconnectdistribution models for prediction of delay and power consumption of3-D stacked ICs,” IEEE Trans. Comput.-Aided Design Integr. CircuitsSyst., vol. 33, no. 9, pp. 1384–1395, Sep. 2014.

[6] S. Bobba et al., “CELONCEL: Effective design technique for 3-D mono-lithic integration targeting high performance integrated circuits,” in Proc.Asia South Pac. Design Autom. Conf., Yokohama, Japan, Jan. 2011,pp. 336–343.

[7] Y.-J. Lee and S. K. Lim, “Ultrahigh density logic designs usingmonolithic 3-D integration,” IEEE Trans. Comput.-Aided Design Integr.Circuits Syst., vol. 32, no. 12, pp. 1892–1905, Dec. 2013.

[8] S. Panth, K. Samadi, Y. Du, and S. K. Lim, “High-density integrationof functional modules using monolithic 3D-IC technology,” in Proc.Asia South Pac. Design Autom. Conf., Yokohama, Japan, Jan. 2013,pp. 681–686.

[9] (2011). Nangate 45nm Open Cell Library, Nangate, Santa Clara, CA,USA. [Online]. Available: http://www.nangate.com

[10] G. Karypis and V. Kumar. hMETIS, A Hypergraph PartitioningPackage Version 1.5.3. Accessed on Nov. 15, 2014. [Online]. Available:http://glaros.dtc.umn.edu/gkhome/metis/hmetis/download

Sheng-En David Lin (S’16) received the B.S.degree in electrical engineering from WashingtonState University, Pullman, WA, USA, in 2014, wherehe is currently pursuing the Ph.D. degree with theDepartment of Electrical Engineering and ComputerScience.

His current research interests include modelingfor very large scale integration (VLSI) circuits andsystems and algorithms for VLSI CAD automa-tion with current focus on designing of monolithic3-D ICs.

Dae Hyun Kim (S’08–M’12) received the B.S.degree in electrical engineering from Seoul NationalUniversity, Seoul, South Korea, in 2002, and theM.S. and Ph.D. degrees in electrical and com-puter engineering from the Georgia Institute ofTechnology, Atlanta, GA, USA, in 2007 and 2012,respectively.

He is an Assistant Professor with the Schoolof Electrical Engineering and Computer Science,Washington State University, Pullman, WA, USA.He researched on physical layout optimization with

Cadence Design Systems, Inc., San Jose, CA, USA, from 2012 to 2014. Hiscurrent research interests include electronic design automation and computer-aided design for very large scale integration (VLSI), high-performance and/orlow-power VLSI and computer systems, and 3-D integrated circuits andsystems.

Dr. Kim was a recipient of the Cadence Excellence in Innovation Award in2014, the Defense Advanced Research Projects Agency Young Faculty Awardin 2016, and the EECS Early Career Award from the School of ElectricalEngineering and Computer Science at Washington State University in 2017.

http://www.nangate.comhttp://glaros.dtc.umn.edu/gkhome/metis/hmetis/download

/ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 200 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 400 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 600 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False

/CreateJDFFile false /Description >>> setdistillerparams> setpagedevice

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF …daehyun/pubs/2018/2018tcad.pdf · 2018. 3. 21. · IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS,

Documents