Top Banner
Parallel Standard Cell Placement on a Cluster of Workstations Faris H. Khundakjie, Patrick H. Madden Nael B. Abu-Ghazaleh and Mehmet Can Yildiz State University of New York at Binghamton Computer Science Department [email protected] Abstract In this paper we report experiences on a parallel implementation of a standard cell placement algorithm on a cluster of myrinet connected PCs. The proposed algorithm is based on a recently developed placement tool (called Feng Shui) that extends recursive bisection placement to incorporate global aspects of the design. This is achieved using an efficient and novel optimization called iterative deletion. We investigate several algorithmic and system-level optimizations. Contrary to previous attempts at parallelizing placement algorithms, initial experimental results show significant performance improvement with small reduction in the placement quality. Furthermore, the reduction in the placement quality does not increase with the number of processors. 1 Introduction With advances in VLSI fabrication technology, the size of circuits of interests has become extremely large and is continuously expanding. Physical design automation tools are needed to aid in design and layout of such circuits. However, the size of the circuits presents a similar challenge to the design automation tools: they must be able to provide good quality layouts with acceptable run times. In this paper, we consider VLSI standard cell placement – an important and difficult problem in physical design automation. The placement impacts circuit areas and wire delays profoundly; a poor placement may prevent a circuit from operating at an acceptable speed, or make too large for the design floor-plan. Furthermore, if the placement algorithm has high complexity, we will be unable to obtain a placement in an acceptable amount of time. This in turn may force us to sacrifice the placement quality to achieve faster placement time. Parallel processing offers the promise of increasing the performance and capacity of placement tools, enabling them to provide better solutions in faster times. The emergence and commercial success of clustering technologies is perhaps the most exciting development yet in the field of parallel processing: it finally allows scalable cost-effective parallel processing machines to be built [1, 3, 29]. Clusters approach the performance of custom parallel machines by using high-performance Local/System area networking technologies and standards (such as Myrinet [5], SCI [10, This research was supported in part by NSF under awards CCR-9988222 and EIA-991099 1
18

Parallel Standard Cell Placement on a Cluster of Workstations

Dec 12, 2022

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Parallel Standard Cell Placement on a Cluster of Workstations

ParallelStandardCell Placementona Clusterof Workstations�

FarisH. Khundakjie, PatrickH. MaddenNaelB. Abu-Ghazalehand MehmetCanYildiz

StateUniversityof New York atBinghamtonComputerScienceDepartment

[email protected]

Abstract

In this paperwe report experienceson a parallel implementationof a standardcell placementalgorithm on aclusterof myrinetconnectedPCs. Theproposedalgorithmis basedon a recentlydevelopedplacementtool (calledFeng Shui) thatextendsrecursive bisectionplacementto incorporateglobal aspectsof thedesign.This is achievedusinganefficientandnovel optimizationcallediterativedeletion.Weinvestigateseveralalgorithmicandsystem-leveloptimizations.Contraryto previousattemptsat parallelizingplacementalgorithms,initial experimentalresultsshowsignificantperformanceimprovementwith small reductionin the placementquality. Furthermore,the reductionintheplacementquality doesnot increasewith thenumberof processors.

1 Intr oduction

With advancesin VLSI fabricationtechnology, the size of circuits of interestshasbecomeextremely large and is

continuouslyexpanding. Physicaldesignautomationtools areneededto aid in designand layout of suchcircuits.

However, the sizeof the circuits presentsa similar challengeto the designautomationtools: they mustbe ableto

provide goodquality layoutswith acceptablerun times. In this paper, we considerVLSI standardcell placement

– an importantanddifficult problemin physicaldesignautomation.The placementimpactscircuit areasandwire

delaysprofoundly;a poorplacementmaypreventa circuit from operatingat anacceptablespeed,or make too large

for thedesignfloor-plan. Furthermore,if the placementalgorithmhashigh complexity, we will beunableto obtain

a placementin anacceptableamountof time. This in turn mayforceusto sacrificetheplacementquality to achieve

fasterplacementtime.

Parallel processingoffers the promiseof increasingthe performanceandcapacityof placementtools, enabling

themto providebettersolutionsin fastertimes.Theemergenceandcommercialsuccessof clusteringtechnologiesis

perhapsthemostexciting developmentyet in thefield of parallelprocessing:it finally allows scalablecost-effective

parallelprocessingmachinesto be built [1, 3, 29]. Clustersapproachthe performanceof customparallelmachines

by usinghigh-performanceLocal/Systemareanetworking technologiesandstandards(suchasMyrinet [5], SCI [10,

�This researchwassupportedin partby NSFunderawardsCCR-9988222andEIA-991099

1

Page 2: Parallel Standard Cell Placement on a Cluster of Workstations

17] andothers[4, 36, 37]) andlow-overheaduser-level spacecommunicationprotocols(suchastheBasicInterface

for Parallelism(BIP) [15], Illinois Fast Messages(IFM) [28] and others[11, 27]). Becausethey usecommodity

components,clustersareaffordable,scalableandeasyto build [32].

This paperpresentsour experiencesin parallelizinga state-of-the-artplacementtool on a clusterof myrinetcon-

nectedworkstations.The algorithmusesstandarditerative partitioningbut augmentsthat with an iterative deletion

stepthat incorporatesglobalaspectsof thepartition in additionto the local optimizationsobtainedthroughthe itera-

tivebisectionstep.

Thecontributionsof this paperarethe following: (i) the initial sequentialalgorithmis state-of-the-artboth from

a run-timecomplexity andsolutionquality perspectives;(ii) theparallelimplementationprovidessignificantspeedup

while maintainingveryhighqualitysolutionsregardlessof thenumberof processorsused.In contrast,existingparallel

implementationsgenerallysuffer significantdegradationin solutionquality; (iii) mostpublishedsolutionsarespecific

to an architectureandarenot directly portableto a clusterenvironment. To our knowledge,this is the first study

that investigatesthis importantproblemon a modernclusterenvironment;and(iv) we investigateseveralalgorithmic

optimizationsas well as system-level tradeoffs in the implementation. This includesa study of the effect of the

Myrinet network [5] runningthe BIP messagepassinglibrary relative to using100 Mbit/secswitchedEthernetfor

communication.

Theremainderof this paperis organizedasfollows. Section2 theplacementproblemandrelatedwork. Section3

discussesthesequentialplacementalgorithmin moredetail.Section4 presentsthedetailsof theparallelimplementa-

tion. Section5 presentstheexperimentalsetupandresults.Finally, Section6 presentssomeconcludingremarks.

2 Overview and RelatedWork

Placementalgorithmshave beenextensively studied;objectives suchas areaminimization, wire length mini-

mization,andtiming optimizationarecommon. For all but trivial problemsizes,we have only heuristicmethods;

optimizationis usuallyperformedon only a small subsetof the circuit at any given time. If we perturba subsetof

circuit elements,we canoptimizea placementfrom a local perspective,but have no guaranteethatour modifications

areappropriatefrom a global perspective.

We considerthe problemof standardcell placement:the objective is to placerectilinearcircuit elements(cells)

into oneor morehorizontalrows,minimizing total wire length. Therearea numberof establishedapproachesto the

placementproblem.For a comprehensive survey, the readeris referredto the following paper[33]. ForceDir ected

or LP basedapproachesrepeatedlysolve systemsof equations,determiningcell locationsiteratively (for example,

2

Page 3: Parallel Standard Cell Placement on a Cluster of Workstations

L R

First partitioning

SecondPartitioning

ThirdPartitioning

Figure1: Orderof bisectionsin a top-down recursive approach.Thepartitioningof L influencesthesolutionfor R,andviceversa.

[13]). This approachis popular in commercialplacementtools. Simulated Annealing basedapproachesobtain

cell placementsby swappingpositionsof cells randomly, guidedby a probabilisticacceptancefunction. A number

of currentcommercialplacementenginesutilize this approach;efficient cost estimatesallow the considerationof

large numbersof intermediatestates. A well known exampleof this approachis TimberWolf[35]. Partitioning

basedapproachesdeterminecell locationsby recursively dividing aninitial area(region)with successivebisectionsor

quadrisections.This approachhasbecomemoreattractiverecently;advancesin partitioningresearchhaveprovideda

numberof fastalgorithmswhich produceextremelygoodresults.This is theapproachusedin this paper.

Breuer[6] utilized repeatedgraphbisectionsto obtaina circuit placement;this approachis shown in Figure 1.

The bisectionsdivide the circuit netlist into a hierarchyof cells, with the resultinghierarchyroughly mappinginto

a rectilineargrid. Dunlop andKernighan[12] extendedthis approach,throughthe useof an improved partitioning

method[20]. Moving beyondsimplebisections,SuarisandKedem[34] exploredtheuseof quadrisection(a four way

partitioning). Huangand Kahng[16] also apply quadrisection,utilizing a multi-level clusteringbasedpartitioning

algorithm,andconsideringminimumspanningtreelengths,ratherthanthesimplemin-cutmetric.

Dunlop andKernighan[12] also introduceterminal propagation. Whenpartitioninga region, we canexpecta

numberof connectionsto berequiredto cellsor padsoutsideof theregion. Terminalpropagationprovidesa simple

methodto insertfixed“dummy” vertices,sothatthepartitioningconsiderstheseexternalconnections(Figure2). With

terminalpropagation,thepartitioningsof regionsbecomeinterdependent;if we begin with two regions,L andR, and

partition L first, this impactsthe optimal solutionfor R. Partitioning R first might result in a differentsolution,and

neitherof thesemightbegloballyoptimal,evenif theindividualpartitioningswere.Toaddresstheorderdependenceof

thepartitioning,both[34] and[16] employ repeatedpartitioningateachlevel. We mightwish to partitionL, followed

by R, andthenpartitionL a second time. Repeatedpartitioningsdo not,however, changea localoptimizationprocess

into a globalone.

Becauseof thecomputationalcomplexity of placement,therehasbeensignificantinterestin parallelizingplace-

3

Page 4: Parallel Standard Cell Placement on a Cluster of Workstations

L1

L2

R2

R1

p

c2

c1

n1

Figure2: Terminalpropagationasperformedby DunlopandKernighan.Whenrecursively partitioninga netlist,wecaninsertdummy terminals to influencethepartitioner. If a netspansmorethanoneregion, the locationof dummyterminalscanimprovetheplacementquality.

mentalgorithmssincethe1980’s(seefor example[8, 9, 18, 21, 22, 23, 24, 26]; agoodsummaryis availablehere[2]).

Thesestudieshaveexperimentedwith parallelizingmostof theplacementalgorithmsonavarietyof parallelarchitec-

tures.Mostof thesestudiesarequiteold andarespecificto thearchitecturesthatwereusedto testthem;it is difficult

to compareperformancedirectly with them. In addition,the objective functionsthat the differentstudiesoptimize

aregenerallydifferent.Furthermore,researchin placementsuffersfrom thelack of uniform metricsfor reportingthe

results– wire lengthsarereported,but the resultscanvary by a multiplicative factordependingon the underlying

assumptions(suchascell spacingandthemethodof measuringwire length).Thismakesit difficult to fairly compare

the algorithms,even whenthe samemodelsareused.However, while direct comparisonis difficult, comparisonof

relativemeasuresbetweenthesequentialandparallelversionsof eachalgorithm(suchasspeedup,andquality degra-

dation)arestill possible. In the following paragraphwe will overview someof the mostrecentof theseworks. In

Section5 they arealsoreviewedaswe compareour resultsto them.

Banerjee’s researchgrouphasdonethe mostextensive work in the areaof parallelplacementalgorithms. For

example,within the ProperPLACE CAD tool they discussa parallel placementalgorithm basedon simulatedan-

nealing[21]. In contrastto otherparallelplacementimplementations,they work with an abstractparallelmachine

modelallowing the implementationto be directly portedacrossdifferentarchitectures.They studysharedmemory

anddistributedmemoryimplementations.Thespeedupacrossdifferentimplementationswerereportedon theISCAS

benchmarks.Koideet al presenta parallelimplementationof their POPINStiming drivenplacementtool [23]. They

usea bisectionapproachsimilar to our own, with non-linearprogrammingusedin a secondphaseoptimization.The

parallel implementationusessharedmemory. In this study, the drop of quality in the parallel implementationwas

significant(anaverageof 14%).

4

Page 5: Parallel Standard Cell Placement on a Cluster of Workstations

Split each region intoa pair of smaller regions,using Iterative Deletion

Partition eachpair of regions withhMetis

Largest regionremaining smallerthan threshold?

Branch-and-Bound optimizationof each row

Begin with a singleregion, comprising theentire placement area.

N

Y

Figure3: Flowchartof theplacementapproach.Thepartitioningandbranch-and-boundimprovementstepsarewellknown. New to theapproachis thepre-processingperformedby iterative deletion.

3 Algorithm Formulation

Thispaperconsidersparallelizinga placementtool (calledFeng Shui [38]) developedby two of theauthors.Feng

Shi adaptsa recentlypresentedpartitioningapproachfor k-waypartitioning.It differsfrom thatsolutionin thatit uses

a techniquecallediterative deletion [25] to allow someglobal issuesto becapturedresultingin improvedplacement

quality at significantly lower cost thank-way partitioning. Feng Shui integratesa variantof k-way partitioningap-

proachinto a traditional framework. The generalflow of our placementtool is asshown in Figure3. Whenfaced

with large numbersof regionsto bisect,iterative deletion is appliedfirst (to obtaina goodquality global solution),

followedby repartitioningof regionswith a traditionalbisectionapproach(to improve solutionquality from a local

perspective). Theremainderof this sectiondescribesthesestepsin moredetail.

3.1 Bisection

Theframework for theplacementtool is atextbookimplementationof theapproachof DunlopandKernighan[12]. The

circuit is repeatedlydividedby eitherhorizontalor verticalcut lines.A recentmulti-level clusteringbasedpartitioning

algorithmhMetis[19], version1.5.3 is used. At eachpartitioning,we attemptto obtaina nearlyexact bisectionif

cutting vertically. If the cut line is horizontal(splitting a numberof rows), the rows aresplit asevenly aspossible.

If the region beingbisectedcontainsan odd numberof standardcell rows, fixed andweighteddummyverticesare

addedto allow a nearlyexactbisectionto bemappedinto theavailablespace.Thepartitioningobjective is min-cut;

thehMetispartitionerattemptsto minimizethenumberof cuthyperedges.

5

Page 6: Parallel Standard Cell Placement on a Cluster of Workstations

The recursive bisectionprocessdividesplacementregionsinto progressively smallerareas,ultimately assigning

eachcell to a singlerow, but possiblyhaving severalcells remainingwithin a region. To establishpositionsfor each

cell, thecellsareorderedby region locationwithin eachrow; thecellsarepackedtogetherwithout spacesor overlap.

Thepositionsof cellswhich werewithin thesameregion will bearbitraryat this point; they werenot orderedby the

partitioningprocess.To optimizethesepositions,we applybranch-and-boundreordering,modifying thepositionsof

a smallsetof consecutivecellsin asinglerow.

Feng Shui allowsthespecificationof a“window size,” controllingthenumberof cellsinvolvedin any branch-and-

boundoptimization. This window passesover eachcell row (in order),traveling alongeachrow at stepsof half the

window size.At eachstep,theoptimalorderfor cells foundunderthewindow is determined.Thenumberof passes

over theplacement,andthesizeof thewindow, arebothparameterswhich canbecontrolledby theuser. In practice,

we find thatwindow sizesof 6 to 8 cells,and4 passesof improvement,aresufficient for goodoverall performance.

Increasingthewindow sizemayimpactrun timessubstantially(asthecomplexity of thebranch-and-boundprocedure

is O�w! � worstcase,wherew is thesizeof theoptimizationwindow). In [7], a numberof waysto implementbranch-

and-boundreorderingsefficiently wereexplored.

3.2 k-Way Partitioning

Thefocusof ourwork hasbeenontheglobal aspectsof theplacementproblem.With partitioning,wecanoptimizethe

numberof edgescut within a region effectively, but have no way of knowing if this local optimizationis appropriate

from a global perspective. Similarly, our branch-and-boundreorderingis alsoa local optimization.

A carefulexaminationof placementby recursive bisectionrevealsa numberof instanceswhereglobalobjectives

may be lost. The examplein Figure4 shows a simplecasewherelocal optimizationis insufficient; thereare four

regionsto bisect,eachwith two cells. If theproblemis approachedasa seriesof independentbisections,a numberof

configurationswhich arebothstableandsuboptimalcanbeencountered.Thesub-optimalityof theglobalsolutionis

not relatedto thequality of thebisectionsof eachregion; simply improving thebisectionalgorithmwill not improve

theglobalconfiguration.

As we progressthroughtheplacementprocess,thenumberof regionsincreases,doublingrepeatedly. If thereare

k regionsthatareto besplit (obtaining2k new regions),thetraditionalapproachis iterative,bisectinga singleregion

at a time. Instead,Feng Shui attemptsbisectionof all regionsat the sametime, obtaininga solutionthat is of good

quality globally. Themethodusedto performthis massivebisectionis basedon partitioningby iterativedeletion[25].

6

Page 7: Parallel Standard Cell Placement on a Cluster of Workstations

c4

c1

c2

c5

c3

c6

c7

c8

pad1

pad2

Netlistc

1c

2

c3

c4

c5

c6

c7

c8

pad1

pad2

c2

c4

c6

c8

pad1

pad2

c1

c3

c5

c7

Stable Placement Optimal Placement

R2

R3

R1

R4

Figure4: Giventhenetlistabove,we mayassignpairsof cellsto theregionsshown. If we partitionor applybranch-and-boundreorderingto thefour regions,we maybeunableto find anoptimalsolution.If we determinetheorderingin region R1 first, we arrive at a stableandsuboptimalsolution. Repeatedlocal optimization(throughrepartitioning)will fail to find thegloballyoptimalsolution.

3.3 NewFormulation – Iterati veDeletion

To captureglobalobjectiveseffectively, a variantof multi-way partitioning is used(ratherthanasa seriesof biparti-

tions). We partitionall regionssimultaneously, with the intermediatestateof eachregion influencingtheothers.We

areconcernedwith partitioningverylargenumbersof regions,andourcostobjectiveis wire lengthratherthanmin-cut.

Theproblemwe consideris given a set of regions (with physical constraints) and a set of elements mapped to these

regions, bisect all regions to minimize the resulting bounding-box wire length. Solutionof this problemoptimizesthe

circuit from a global perspective. Multi-way partitioninghasprovenquitechallenging[31]; for traditionalobjectives

suchasmin-cut, thegreatestsuccesshasbeenobtainedwith recursivepartitioning.

In [25], hypergraphpartitioningwasconsidered.A new methodbasedon iterative deletion waspresented;in this

approach,verticesareduplicated,with oneinstanceof eachvertex beingassignedto a partition. Redundantelements

areremovedoneat a time until no duplicatesremain.While theapproachwasrelatively simple,it provedeffective in

someareaswheretraditionalmethodshaddifficulty. For bipartitioning,cut sizesfrom a singlelinear-time passwere

comparableto many passesof a traditionalFM[14] algorithm.Multi-waycutsizesweresuperiorto adirectflat multi-

waypartitioningalgorithm[31]. For problemswith avarietyof hyperedgeweights,acombinationof iterativedeletion

andFM partitioningprovedsubstantiallymoreeffectivethanFM partitioningalone.Theapproachis computationally

attractive: with integerhyperedgeweights,it maybeimplementedin O�n � time.

Our variationof the iterative deletionapproachfor placementoperatesin the following manner. Eachcell in a

region is assignedto both subregions;if thereis morethana singleinstanceof acell, it is consideredto beredundant.

Werepeatedlyremoveredundantcellsfrom subregionswhichhavehighutilization,andselectthehighestcostcell for

removal.

In theexisting implementationof theplacementengine,thecell costis evaluatedbasedon thecenterof massfor

7

Page 8: Parallel Standard Cell Placement on a Cluster of Workstations

c1

c2

c1

c2 pad

n1

n2

c1

c1

c2 pad

n1

n2

c1 c

2 pad

n1

n2

Initial configuration,with duplicates forboth cells

On instance of has beenremoved, resulting in a changein the center of mass for somenets, and also new cell costs.

A second redundantinstance is removed.

Figure5: Two instancesof eachcell in thenetlistaregenerated,andassignedto bothsubregionsof any region. Cellcostsarebasedon thecenterof massfor thevertices;high costcellsareremovedoneat a time from any region of theentire placement problem. In this way, thepartitioningis performedon a globalbasis,ratherthanwith only a pair ofsubregionsat a time.

thecomponentnets.For eachnetni, thecenterof massfor this netis theaverageX andY locationof thecellswhich

it connects.Thecostof any cell ci is thesumof thedistancesbetweenthecell andthecenterof massof eachnet to

which the cell is connected.In this way, a cell which is far from thecenterof massof eachnet to which thecell is

connectedhashigh cost. Eachregion hasa numberof cellsassignedto it, andanavailablecapacity;redundantcells

(thoseintroducedby iterativedeletion)with highcostsareremovedfrom theregionwhichhasthehighestratioof cell

areato capacity.

Heapsareusedto maintaintheorderingof cellswithin any givenregion andtheorderingof regions. In this way,

maintenanceandcell selectionarebothatworstO�log n � for eachcell removed.As thenumberof redundantelements

to beremovedin any passis n, eachpassis O�nlog n � . Regionsizesdecreaseby a factorof 2 with eachpass,resulting

in a logarithmicnumberof passesrequired.Thus,theiterative deletion portionof thealgorithmis atworstO�nlog2n � .

To illustratethe iterative deletionprocess,we presentFigure5. In this figure,we duplicatecellsc1, in region R1,

andcell c2 in region R2, andassumethata netconnectsc1 to c2, andthata secondnetconnectsc2 to a pad.

Theorderof cell deletionsin this figure is asfollows. Note thatothercell deletionsmaybeinterspersedwith the

following; we focusonly on thesecellsto clarify theprocess.

� Thecenterof massfor thenetsconnectedto cell c2 is closestto thepad;we removetheinstanceof c2 which is

furthestfrom this location(asthis is theinstancewhich hashighestcost).

� Thecenterof massfor netsconnectedto cell c2 is recalculated,andthis is propagatedto theothercells.

� Netn1 now hastwo instancesof c1 andoneinstanceof c2 connectedto it: thecenterof massfor thisnetchanges,

influencingthecostfor eachinstanceof c1.

� An instanceof c1 is removed.

8

Page 9: Parallel Standard Cell Placement on a Cluster of Workstations

4 Parallel Cell Placement

Theoptimizedsequentialimplementationof theplacementalgorithmdiscussedabovewasinstrumentedto isolate

theportionsthataremostcomputationallydemanding.Thebulk of theexecutiontime wasspentin partitioningand

reorderingphases– iterativedeletionupdatescontributedvery little overheadaftereachpartitioningpass.Theparallel

implementationandoptimizationsto it (both thosewe have alreadyimplementedandonesthatareplanned)will be

discussedin this section.TheimplementationwasperformedusingMPICH runningon top of theBasicInterfacefor

Parallelism(BIP) [30] on a myrinetconnectedPCclusterrunningLinux.

4.1 Parallel Partitioning

In order to maintainthe global optimizationachieved by iterative deletion,parallel processingis restrictedto the

growing list of regions to be partitionedin eachpass. In the first passthe main region is partitionedby a master

processinto two. In thesecondpasstheresultingtwo regionsarepartitionedinto four in parallelby themasteranda

slaveandsoonuntil theprocessorlimit is reached.As thelist grows,regionsabovea certainthresholdaredistributed

evenlyby themasteramongall processes.An exampleis shown in Figure6.

Tp/n

Step 1

Step 2

Step 3

Send Regions OutTu

Tp

Step log(n)2

Figure6: ParallelPartitioning

Initially, we implementedthedatadistributionusingamessageperwork unit (i.e. region). Thus,themasterwalks

the list andsendsa messagefor eachregion to the slavesin a round-robin fashion.We optimizedthis implementa-

tion by generatinga messageper eachbatchof work units usingMPI’s vector scatter-gather. Obviously, the latter

9

Page 10: Parallel Standard Cell Placement on a Cluster of Workstations

outperformsthefirst becauseit increasesthecomputationgranularityby requiringthemasterto communicatefewer

messageswith largersize.Loadbalanceis maintainedby theaforementionedevendistribution of regionsandby the

factthatthepartitioningengineproducesbalancedregioncutsaswell.

BecauseVLSI CAD applicationsrequirelargememoryspace,it is necessaryto utilize memoryefficiently. In order

to doso,unicastmaster-to-slaveandbroadcastcommunicationmessagesareusedto instructslavesto allocateonly the

spacerequiredfor storingandprocessingtheirworkloadin theroundrobinandvectorscatter-gatherimplementations

respectively. Suchmessagesincurslight overheadthatis negligible for non-trivial placementproblems.

An approximateanalysisof theparallelpartitioningfollows. Considera region of areaR to bepartitionedinto N

regionsjustbelow thethreshold.Assumingalinearruntimecomplexity of thepartitioningengineandahomogeneous

distribution of circuit elements,if it takestime Tp to partitionR, thenpartitioninga regionof areaRn requiresTp

n . The

progressionof region partitioningis modeledby thetreein Figure6. A sequentialimplementationperformsn steps,

with steptime Tpn , at eachpassto partition n regionsinto 2n regionswith a total time of Tp. The numberof passes

is log2N, andso the total amountof partitioningtime is Tp � log2N. On the otherhand,usingP processors,the first

log2P � 1 passeshave n �� P andeachtakes Tpn to complete(i.e. a maximumof oneregion perprocessor)while the

resthaven p andeachtakesceil(n/p)x Tp/n to complete(i.e. ceil(n/p)regionsperprocessor).Thetotal time for the

first groupof passesis:

log2P

∑i � 0

Tp

2i (1)

andthetotal time for therestis:log2N � 1

∑i � log2P 1

Tp

P� Tp

P � � log2N � log2P � 1� (2)

For machineswith smallP, overawide rangeof problemsize,N is muchlargerthanP (i.e. thenumberof passeswith

n �� p is muchlessthanthosewith n P), andif log2N � log2P � 1, thenthe total time is givenby thesecond

formula andis approximatedby Tpp � log2N. The bestcasespeedup(sequential/paralleltime) is hencelinear in the

numberof processors.

Costupdateandcommunicationtimescouldbeincorporatedby time Tu aftereachpass.Thetotalof thisoverhead

is thegeometricseriesandis givenby:

log2N � 1

∑i � 0

2i � Tu � Tu ��2 � log2N � � 1��

2 � 1� � �N � 1� � Tu (3)

SinceTu is largerin theparallelimplementationdueto communicationtime,theactualspeedupcannotbeexpected

to belinear.

10

Page 11: Parallel Standard Cell Placement on a Cluster of Workstations

4.2 Parallel Reordering

��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

��������������������������������������� � � � � � � � � � � � � � � � � � � !�!�!�!�!�!�!!�!�!�!�!�!�!!�!�!�!�!�!�!"�"�"�"�"�""�"�"�"�"�""�"�"�"�"�"#�#�#�#�#�#�##�#�#�#�#�#�##�#�#�#�#�#�#$�$�$�$�$�$$�$�$�$�$�$$�$�$�$�$�$%�%�%�%�%�%�%%�%�%�%�%�%�%%�%�%�%�%�%�%&�&�&�&�&�&�&&�&�&�&�&�&�&&�&�&�&�&�&�&'�'�'�''�'�'�''�'�'�'(�(�(�((�(�(�((�(�(�()�)�)�)�)�))�)�)�)�)�))�)�)�)�)�)*�*�*�*�*�**�*�*�*�*�**�*�*�*�*�*

+�++�++�++�++�+,�,,�,,�,,�,,�,

-�-�-�-�--�-�-�-�--�-�-�-�-.�.�.�.�..�.�.�.�..�.�.�.�. /�/�/�//�/�/�//�/�/�/0�0�0�00�0�0�00�0�0�01�1�1�1�1�1�11�1�1�1�1�1�11�1�1�1�1�1�11�1�1�1�1�1�11�1�1�1�1�1�12�2�2�2�2�2�22�2�2�2�2�2�22�2�2�2�2�2�22�2�2�2�2�2�22�2�2�2�2�2�2

3�3�3�33�3�3�33�3�3�34�4�4�44�4�4�44�4�4�45�5�55�5�55�5�55�5�55�5�56�66�66�66�66�6

7�7�7�7�7�7�77�7�7�7�7�7�77�7�7�7�7�7�77�7�7�7�7�7�77�7�7�7�7�7�78�8�8�8�8�8�88�8�8�8�8�8�88�8�8�8�8�8�88�8�8�8�8�8�88�8�8�8�8�8�8

9�9�9�99�9�9�99�9�9�9:�:�:�::�:�:�::�:�:�: ;�;�;;�;�;;�;�;;�;�;;�;�;<�<�<<�<�<<�<�<<�<�<<�<�<=�=�=�==�=�=�==�=�=�==�=�=�==�=�=�=>�>�>�>>�>�>�>>�>�>�>>�>�>�>>�>�>�>

?�?�?�??�?�?�??�?�?�?@�@�@�@@�@�@�@@�@�@�@A�A�A�A�AA�A�A�A�AA�A�A�A�AB�B�B�B�BB�B�B�B�BB�B�B�B�BP0Master Slave

P1Slave

Odd Pass Windows

P2

Data Set

Even Pass Windows

Individual cells

Figure7: ParallelReorderingPasses

Branchandboundcell reorderingis a secondphaseoptimizationthat is performedafter the placementstepto

improve thesolutionquality. A reorderingwindow of a prespecifiedwidth in numberof cells is slid acrosstheeach

row in stepsof half the window size. All combinationsof reorderingsof the cells within the currentwindow are

consideredandthe bestwindow kept. Thus,the smallestwork unit is the block of consecutive cells definedby the

reorderingwindow. The larger the window size the larger the block and the large the numberof cells considered

for reorderingat a time, but the smallerthe numberof blockssincerows areof fixed length. All cellsexcepthalf a

window’s worth at thebeginningof eachrow areconsideredfor reorderingtwice. To keepthis key for quality in the

parallelimplementation,only non-overlappingblocksarereorderedin parallel.Thus,eachrow undergoestwo passes

of reorderingwhereawindow now movesatstepsof full size.Thefirst passconsidersevenblocksonly andthesecond

considersoddblocksonly or viceversa.

The sameprinciplesfor partitioning load balanceandcommunicationare followed here;blocksaredistributed

evenlyamongall processorsusingvectorscatter-gatherasshown in Figure7. Becausethereis oftena largenumberof

rows in a placement,it is crucial to optimizememoryreallocation;allocatedmemoryis not releasedunlessthespace

requiredto holdblocksof anew row is largerthantheavailablespace.

5 Experiments

Theparallelimplementationof theplacementtool wasevaluatedfor executiontime andquality of solutionon a

11

Page 12: Parallel Standard Cell Placement on a Cluster of Workstations

Benchmark Rows Cells Netsfract 6 149 163struct 21 1952 1920primary1 17 833 904primary2 22 3014 3029biomed 44 6514 7052industry2 69 12637 13419industry3 52 15433 21967avqsmall 79 21918 30038avqlarge 83 25178 33298golem3 117 100312 217362

Table1: Theplacementbenchmarksconsidered.Thefirst four benchmarksaretheISCASBenchmarkcircuits. Thenumberof rowsusedwasdeterminedby theTimberWolf placementandroutingtools.

Benchmark Wire Length RunTime(sec)Sequential Parallel(ratio) Sequential 2 Processors 4 Processors 8 Processors

fract 66242 69498.4(1.05) 2.6 7 7 6.9struct 775636 797647.6(1.03) 41.9 30.5 21.5 16.6primary1 1053258 1064441(1.01) 18.6 16.9 13.2 11.8primary2 3747715 3855615.4(1.03) 80.2 57.1 39.2 30.1biomed 3382200 3514514(1.04) 162 111.8 79.4 64.5industry2 15629154 16508315(1.06) 359.4 227.2 155.5 117.8industry3 45088174 47757192.8(1.06) 538.6 338.5 218.1 157avqsmall 6041243 6626165.6(1.1) 925.1 746.8 598.9 566.3avqlarge 6344469 7053757.8(1.11) 1006 788.2 645.5 603.9golem3 88959019 92830917(1.04) 3235.9 2191.2 1243.1 918.1

Table2: Wire LengthsandRunTimesfor theParallelImplementationUsingScatter/Gatheron Myrinet

clusterof eight550MHzPentiumIII workstations.Theclusteris interconnectedusingaMyrinet network; theMyrinet

LAN cardsusea LANai 7 33MHz processorand33MHz, 32-Bit PCI bus. The machinesare also connectedvia

switched100Mbit/secEthernet.In [38], Feng Shui’sperformancewascomparedagainstcommerciallyavailabletools

andfound to producebetterresultswith smallerrun times. In this paper, we only focuson the performanceof the

parallelversion.Unlesswe indicateotherwise,all resultsaretheaverageof fiverunswith fivedifferentseeds.Table1

shows thebenchmarksusedin theexperiments.Thesebenchmarksspana wide rangeof circuits,up to andincluding

thelargestavailablepublicdomaincircuits.

Table2 shows theexecutiontimesof theparallelalgorithm(usingscatter/gather).Theseresultswereabout30%

fasterthanour initial implementationwith roundrobin region distribution. Thequality of theparallelversionis not

affectedby thenumberof processors.Thedegradationin quality is dueto thefact that theregion processingis done

in parallelandin phases.This is in contrastto thesequentialversionwheretheregionsareupdatedsequentially, with

every subsequentregion usingtheupdatedvaluesof theregionsprocessedbeforeit. We notethatthesmallestmodel

(fract) suffersa slowdown in the parallelimplementation;its sizemakesthe communicationoverheaddominatethe

executiontime. Note thateventhoughthetableshows theresultsfor 2, 4 and8 processorsonly, thealgorithmis not

12

Page 13: Parallel Standard Cell Placement on a Cluster of Workstations

restrictedto powerof two processorconfigurations.

Theresultsuseareorderwindow sizeof 6. As wewill show later, awindow sizeof 6 wasfoundto providethebest

qualityto executiontimepoint(biggerwindow sizetookmuchlongerto executeandprovidedmarginalimprovement).

Thedropin qualitywassmallfor mostbenchmarks(lower than6%). Theexceptionwasavqlargeandavqsmallwhich

sufferedaround10%reductionin solutionquality. Thesetwo benchmarksalsodid not achievegoodspeedupdespite

their large size. We arecurrently looking morecloselyat their behavior to try to gain insight into their behavior;

otherstudiesreportdifficultieswith thesebenchmarks[23]. Thereductionin quality is significantlysmallerthanthat

reportedin otherparallelplacementstudies.For example,in onestudythequalitydegradationof theparallelsolution

reached30% (with an averageof 14%) [23] on a 4 processorstudy. In anotherstudy, the quality dropalsoreached

over30%with anaverageof around20%. Both studiesonly reportresultson a subsetof thebenchmarksusedin this

study.

1

1.5

2

2.5

3

3.5

4

1 2 3 4 5 6 7 8 9

Spe

edupC

Number of Processer

Speedup for the BIP Experiments

Golem3AvqlargeIndustry3Industry2

Biomed

Figure8: Speedupfor theMyrinet/BIP Experiments

Figure8 shows the speedupobtainedby the parallelversionon selectedbenchmarks.We notethat the speedup

generallyincreaseswith thesizeof themodel;thelargerthemodelthelargerthegranularityof thecomputation.The

only exceptionis avqlarge– despitebeingthesecondlargestmodel,it benefitsleastof the5 modelsshown.

Table3 shows the run times for selectedbenchmarksusingethernetcommunication.The quality resultswere

identicalto the Myrinet version. Clearly, the performanceis significantlyworsethanthe Myrinet version. This can

moreclearlybeseenin Figure9.

13

Page 14: Parallel Standard Cell Placement on a Cluster of Workstations

Benchmark RunTime(sec)Sequential 2 Processors 4 Processors 8 Processors

industry2 359.4 283 229.7 212.6industry3 538.6 402.9 309.1 263.9avqsmall 925.1 894.6 816 772.9avqlarge 1006 946.2 872.6 842.3golem3 3235.9 2328 1983 1635

Table3: Wire LengthsandRunTimesfor theParallelImplementationUsingScatter/Gatheron Ethernet

1

1.5

2

2.5

3

3.5

4

1 2 3 4 5 6 7 8 9

Spe

edupC

Number of Processer

Speedup Comparison between BIP and Ethernet

Golem3-bipAvqlarge-bipIndustry2-bip

Golem3-ethernetAvgqlarge-ethernetIndustry2-ethernet

Figure9: SpeedupComparisonMyrinet/BIPvs. Ethernet

Iterativedeletionprovedusefulin thesequentialversionof thetool,yieldingafew percentimprovementonaverage

with asmallincreasein executiontime. Weattemptedto verify whetheriterativedeletionwill havethesameeffecton

theparallelversion.Table4 presentstheresultsof thisstudy. While iterativedeletionresultedin 0.7%improvementon

average,it did performworsefor anumberof benchmarks.Mostnotably, thetwo problematicbenchmarksin termsof

speedupandqualitydegradation(avqsmallandavqlarge)sufferedthebiggestdegradationfrom usingiterativedeletion.

Without thosetwo, theimprovementbecomes1.4%.

We alsoinvestigatedthe effect of the sizeof the reorderwindow (from the default 6 to 8). The executiontime

rosesharplyfor all benchmarks,but theefficiency of theparallelismfor thereorderportionof thealgorithmraisedthe

overall speedup(ascanbeseenin Table5). We notehowever thatthegainin quality is marginal;we would bebetter

off to executewith window size6 sequentiallyratherthanwindow size8 – the gain from thehigherwindow sizeis

smallerthanthelossdueto parallelism.

14

Page 15: Parallel Standard Cell Placement on a Cluster of Workstations

Benchmark Wire Length Improvement(%)Without ID With ID

fract 68502 69498.4 -1.4struct 804123 797647.6 0.8primary1 1084639 1064441 1.9primary2 3914142 3855615.4 1.5biomed 3506130 3514514 -0.2industry2 16488752 16508315 -0.1industry3 48973501 47757192.8 2.5avqsmall 6479309 6626165.6 -2.3avqlarge 6947883 7053757.8 -1.5golem3 96020944 92830917 3.3

Table4: Effectof IterativeDeletionon theParallelExecution

Benchmark Window Size6 8

Time Speedup Wire Length Time Speedup Wire Lengthstruct 16.7 2.5 789777 34.3 3.32 782683primary2 30.2 2.65 3863183 77.8 3.69 3852182biomed 64.3 2.52 3494020 110 3.22 3472616industry2 117.9 3.04 16444527 301.1 4.1 16380765avqsmall 561.1 1.64 6670242 749.7 2.88 6603005

Table5: Effectof ReordingWindow Size– 8 Processors

Theseresultsrepresentwhatis very mucha work-in-progress.We expectto furtherrefinetheexisting implemen-

tation in the following ways. Currently, the region list is relayedbackto the masterat the endof every partitioning

stepto updatethe dependencies,beforeresendingthe resultsbackout to the slaves. This causesa lot of redundant

communicationaswell asa sequentialwalk of thelist. Thenext stepis to allow a parallelwalk of thelist andupdate

of theregionsby doinganall to all exchange,insteadof the“reduction”backto themasterprocess.Fromthequality

perspective, we areworking on maintainingthesequentialdependenciesusinga “wavefrontpipeline” scheme.This

formulationshouldproducethesamequalityasthesequentialversion,althoughthespeedupwill notbeashighasthe

currentformulation.

6 Concluding Remarks

With the exponentialgrowth in the sizeof circuits underfabrication,efficient physicaldesigntools areneeded.

Parallel processingoffers the promiseof increasingthe performanceandcapacityof thesetools. The emergenceof

clustersascost-effective scalablehigh-performancecomputingplatformbringsthe long overduepromiseof parallel

processingto themainstream.Theinvestigationof parallelizationof physicaldesigntoolsusingclustersis timely.

In this paper, we presentedexperienceswith parallelizinga state-of-the-artplacementtool on a clusterof work-

15

Page 16: Parallel Standard Cell Placement on a Cluster of Workstations

stations.Thesequentialtool provideswire lengthscomparableto thoseof a well known commercialtool, andresults

reportedfor thetool Capo indicatethatplacementsarenotdifficult to route.Thesequentialimplementationis efficient;

thegrowth in run timesis nearlylinearwith thesizeof thecircuit. Thetool usesa novel iterative deletionapproach

to allow the considerationof global objectivesfrom within a traditionaltop-down placementframework. For more

detailspleasereferto thefollowing paper[38].

Contraryto othereffortsatparallelplacement,wewereableto obtainsignificantimprovementin performancewith

minimal degradationin thequalityof thelayout.Thedegradationin thequality of thesolutiondoesnot increasewith

thedegreeof parallelism.Weexploredseveralalgorithmicandsystemoptimizationsandevaluatedtheimplementation

ona myrinetaswell asaswitchedEthernetnetwork. We arestill in theprocessof optimizingtheimplementationand

arehopefulof achieving higherspeedupin time for thefinal versionof this paper.

Weexpectto continueto refinethis tool bothfrom animplementationandfunctionalityperspectives.For example,

timing drivenplacementisasignificantconcernfor moderndesign.Wearecurrentlyworkingwith anindustryresearch

groupto evaluatethe performanceof our approachon large designsunderrealisticdelayrules. We notethat delay

optimizationis perhapsmoreof a globalphenomenathanwire lengthminimization: meetingtiming objectivesmay

requiremodificationsin many areasof aplacement,andreductionsin delayfor somenetsmayrequireincreaseddelay

in others.

References

[1] T. Anderson,D. Culler, andD. Patterson.The casefor NOW (network of workstations). IEEE Micro, 15(1),February1995.

[2] P. Banerjee.Parallel Algorithms for VLSI Computer-Aided Design. PrenticeHall, EnglewoodCliffs, New Jersey,1994.(Chapter3).

[3] D. Becker, T. Sterling,D. Savarese,J.Dorband,U. Ranawak,andC. Packer. BEOWULF: A parallelworkstationfor scientificcomputation.In International Conference on Parallel Processing, 1995.

[4] M. Blumrich,R.Alpert,Y. Chen,D. Clark,S.Damianakis,C.Dubnicki,E.Felten,L. Iftode,K. Li, M. Martonosi,andR. Shillner. Designchoicesin the shrimpsystem:An empiricalstudy. In Proceedings of the 25th AnnualACM/IEEE International Symposium on Computer Architecture, June1998.

[5] N. Boden,D. Cohen,andW.-K. Su. Myrinet: A gigabit-per-secondlocal areanetwork. IEEE Micro, 15(1),February1995.

[6] M. A. Breuer. A classof min-cutplacementalgorithms.In Proc. Design Automation Conf, pages284–290,1997.

[7] A. E. Caldwell, A. B. Kahng,and I. L. Markov. Optimal partitionersandend-caseplacersfor standardcelllayout. In Proc. Int. Symp. on Physical Design, pages90–96,1999.

[8] A. CasottoandA. Sangiovanni-Vincentelli. Placementof standardcellsusingsimulatedannealingon thecon-nectionmachine.In Proceedings of the International Conference on Computer-Aided Design, pages350–353,November1987.

16

Page 17: Parallel Standard Cell Placement on a Cluster of Workstations

[9] F. Darema,S. Kirkpatrick, andV. Norton. Parallelalgorithmsfor chip placementby simulatedannealing.IBMJournal of Research and Development, May 1987.

[10] Dolphin Internconnects.The Dolphin SCI interconnect– whitepaper, February1996. Availablefrom DFEGEIHKJLGLIMINGOQPSRUTIVQPXWYTQP[ZG\FL HF]I^Y_`^badc`edf LdgbhIi DjakEYedlFmnHbe`f P HF]I^ .

[11] C. Dubnicki,A. Bilas,Y. Chen,S.Damianakis,andK. Li. VMMC-2: Efficient supportfor reliable,connection-orientedcommunication.In Hot Interconnects V, August1997.

[12] A. E. DunlopandB. W. Kernighan.A procedurefor placementof standard-cellVLSI circuits. IEEE Trans. onComputer-Aided Design of Integrated Circuits andSystems, CAD-4(1):92–98,January1985.

[13] H. EisenmannandF. M. Johannes.Genericglobal placementandfloorplanning. In Proc. Design AutomationConf, pages269–274,1998.

[14] CharlesM. FiducciaandR. M. Mattheyses. A linear-time heuristicfor improving network partitions. In Pro-ceedings of the 19th IEEE Design Automation Conference, pages175–181,1982.

[15] P. Geoffray, L. Prylli, andBernardTourancheau.BIP-SMP:High performancemessagepassingoveraclusterofcommoditySMPs.In Proceedings of Supercomputing (SC99), November1999.

[16] D. J.-H.HuangandA. B. Kahng. Partitioningbasedstandardcell globalplacementwith anexactobjective. InProc. Int. Symp. on Physical Design, pages18–25,1997.

[17] M. Ibel, K. Schauser, C. Scheiman,and M. Weis. High performanceclustercomputingusing SCI. In HotInterconnects V, August1997.

[18] R. JayaramanandR. Rutenbar. Floorplanningby annealingon a hypercubemultiprocessor. In Proceedings ofthe International Conference on Computer-Aided Design, pages346–349,November1987.

[19] G. Karypis,R. Aggarwal, V. Kumar, andS. Shekhar. Multilevel hypergraphpartitioning: Application in VLSIdomain.In Proc. Design Automation Conf, pages526–529,1997.

[20] Brian W. KernighanandS. Lin. An efficient heuristicprocedurefor partitioninggraphs.Bell System TechnicalJournal, 49:291–307,1970.

[21] S. Kim, B. Ramkumar, J. Chandy, S. Parkes,andP. Banerjee.ProperPLACE: a portableparallelalgorithmforstandardcell placement.In Proceedings of the 8th International Parallel Processing Symposium (IPPS’94), April1994.

[22] R. Kling andP. Banerjee.ConcurrentESP:a placementalgorithmfor executionon distributedprocessors.InProceedings of the International Conference on Computer-Aided Design, pages354–357,November1987.

[23] T. Koide,M. Ono,S.Wakabayashi,andY. Nishimaru.Par-POPINS:a timing-drivenparallelplacementmethodwith the elmoredelay model for row basedVLSIs. In Proceedings of the Design Automation Conference(DAC’97), 1997.

[24] S. Kravitz and R. Rutenbar. Placementby simulatedannealingon a multiprocessor. IEEE Transactions onComputer Aided Design of Integrated Circuits and Systems, 6(4):534–549,June1987.

[25] P. H. Madden.Partitioningby iterativedeletion.In Proc. Int. Symp. on Physical Design, pages83–89,1999.

[26] S.MohanandP. Mazumder. Wolverines:Standardcell placementon anetwork of workstations.IEEE Transac-tions on Computer Aided Design of Integrated Circuits and Systems, 12(9):1312–1326,September1993.

[27] M-VIA: Virtual interfacearchitecturefor linux, 2001. DGEGEIHKJ LILnoGoGoKPqp e`fsrGt PXuYvnwbL fYeFrdeGmnfstxD LnyIgFzGLdw a{m L .[28] S.Pakin, M. Lauria,andA. Chien. High performancemessagingon workstations:Illinois FastMessages(FM)

for Myrinet. In Proceedings of Supercomputing (SC’95), 1995.

[29] G. Pfister. In Search of Clusters, 2nd Ed. PrenticeHall, 1998.

17

Page 18: Parallel Standard Cell Placement on a Cluster of Workstations

[30] L. Prylli. BIP messagesusermanual,1998. Available at DFEGE`HKJ LGL c{DGH|tdm Pq}Gp a wFh cd~ vdp�R�P ^`f LU�j� l hd� m pI} mFc La p ]Ge`� P DFE � c .[31] L. A. Sanchis. Multiple-way network partitioningwith differentcost functions. IEEE Trans. on Computers,

42(22):1500–1504,1993.

[32] ScalableComputingLab. SCL clustercookbook:Building your own clustersfor parallelcomputation,1998.DFEGE`HKJ LGLnoIoGoKP rGtdc P m � eYr`cIm{� P�uFvdwbL l`f vI� ebtUEsr L`� c } rUEYenf �Fv`v`� � vIvd� .

[33] K. ShahookarandP. Mazumder. Vlsi cell placementtechniques.ACM Computing Surveys, 23(2):143–220,June1991.

[34] P. R. SuarisandG. Kedem.An algorithmfor quadrisectionandits applicationto standardcell placement.IEEETrans. on Circuits and Systems, 35(3):394–303,1988.

[35] W.-J. SunandC. Sechen.Efficient andeffective placementfor very largecircuits. IEEE Trans. on Computer-Aided Design of Integrated Circuits andSystems, 14(3):349–359,1995.

[36] Virtual interfacearchitecture(VIA) specification,2001. DFEGE`HKJ LGLnoIoGoKPXw a{m`fstUD P�v f u .

[37] M. Welsh,A. Basu,andT. von Eicken. Atm andfastethernetnetwork interfacesfor user-level communication.In Proceedings of the Third High-Performance Computer Architecture Conference (HPCA’97), February1997.

[38] M. Yeldiz andP. H. Madden.Globalobjectivesfor standardcell placement.In Design Automation Conference(DAC’2000), 2000.

18