Parallel Standard Cell Placement on a Cluster of Workstations Faris H. Khundakjie, Patrick H. Madden Nael B. Abu-Ghazaleh and Mehmet Can Yildiz State University of New York at Binghamton Computer Science Department [email protected]Abstract In this paper we report experiences on a parallel implementation of a standard cell placement algorithm on a cluster of myrinet connected PCs. The proposed algorithm is based on a recently developed placement tool (called Feng Shui) that extends recursive bisection placement to incorporate global aspects of the design. This is achieved using an efficient and novel optimization called iterative deletion. We investigate several algorithmic and system-level optimizations. Contrary to previous attempts at parallelizing placement algorithms, initial experimental results show significant performance improvement with small reduction in the placement quality. Furthermore, the reduction in the placement quality does not increase with the number of processors. 1 Introduction With advances in VLSI fabrication technology, the size of circuits of interests has become extremely large and is continuously expanding. Physical design automation tools are needed to aid in design and layout of such circuits. However, the size of the circuits presents a similar challenge to the design automation tools: they must be able to provide good quality layouts with acceptable run times. In this paper, we consider VLSI standard cell placement – an important and difficult problem in physical design automation. The placement impacts circuit areas and wire delays profoundly; a poor placement may prevent a circuit from operating at an acceptable speed, or make too large for the design floor-plan. Furthermore, if the placement algorithm has high complexity, we will be unable to obtain a placement in an acceptable amount of time. This in turn may force us to sacrifice the placement quality to achieve faster placement time. Parallel processing offers the promise of increasing the performance and capacity of placement tools, enabling them to provide better solutions in faster times. The emergence and commercial success of clustering technologies is perhaps the most exciting development yet in the field of parallel processing: it finally allows scalable cost-effective parallel processing machines to be built [1, 3, 29]. Clusters approach the performance of custom parallel machines by using high-performance Local/System area networking technologies and standards (such as Myrinet [5], SCI [10, This research was supported in part by NSF under awards CCR-9988222 and EIA-991099 1
18
Embed
Parallel Standard Cell Placement on a Cluster of Workstations
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
In this paperwe report experienceson a parallel implementationof a standardcell placementalgorithm on aclusterof myrinetconnectedPCs. Theproposedalgorithmis basedon a recentlydevelopedplacementtool (calledFeng Shui) thatextendsrecursive bisectionplacementto incorporateglobal aspectsof thedesign.This is achievedusinganefficientandnovel optimizationcallediterativedeletion.Weinvestigateseveralalgorithmicandsystem-leveloptimizations.Contraryto previousattemptsat parallelizingplacementalgorithms,initial experimentalresultsshowsignificantperformanceimprovementwith small reductionin the placementquality. Furthermore,the reductionintheplacementquality doesnot increasewith thenumberof processors.
1 Intr oduction
With advancesin VLSI fabricationtechnology, the size of circuits of interestshasbecomeextremely large and is
continuouslyexpanding. Physicaldesignautomationtools areneededto aid in designand layout of suchcircuits.
However, the sizeof the circuits presentsa similar challengeto the designautomationtools: they mustbe ableto
provide goodquality layoutswith acceptablerun times. In this paper, we considerVLSI standardcell placement
– an importantanddifficult problemin physicaldesignautomation.The placementimpactscircuit areasandwire
delaysprofoundly;a poorplacementmaypreventa circuit from operatingat anacceptablespeed,or make too large
for thedesignfloor-plan. Furthermore,if the placementalgorithmhashigh complexity, we will beunableto obtain
a placementin anacceptableamountof time. This in turn mayforceusto sacrificetheplacementquality to achieve
fasterplacementtime.
Parallel processingoffers the promiseof increasingthe performanceandcapacityof placementtools, enabling
Figure2: Terminalpropagationasperformedby DunlopandKernighan.Whenrecursively partitioninga netlist,wecaninsertdummy terminals to influencethepartitioner. If a netspansmorethanoneregion, the locationof dummyterminalscanimprovetheplacementquality.
Split each region intoa pair of smaller regions,using Iterative Deletion
Partition eachpair of regions withhMetis
Largest regionremaining smallerthan threshold?
Branch-and-Bound optimizationof each row
Begin with a singleregion, comprising theentire placement area.
N
Y
Figure3: Flowchartof theplacementapproach.Thepartitioningandbranch-and-boundimprovementstepsarewellknown. New to theapproachis thepre-processingperformedby iterative deletion.
3 Algorithm Formulation
Thispaperconsidersparallelizinga placementtool (calledFeng Shui [38]) developedby two of theauthors.Feng
Shi adaptsa recentlypresentedpartitioningapproachfor k-waypartitioning.It differsfrom thatsolutionin thatit uses
a techniquecallediterative deletion [25] to allow someglobal issuesto becapturedresultingin improvedplacement
Figure4: Giventhenetlistabove,we mayassignpairsof cellsto theregionsshown. If we partitionor applybranch-and-boundreorderingto thefour regions,we maybeunableto find anoptimalsolution.If we determinetheorderingin region R1 first, we arrive at a stableandsuboptimalsolution. Repeatedlocal optimization(throughrepartitioning)will fail to find thegloballyoptimalsolution.
3.3 NewFormulation – Iterati veDeletion
To captureglobalobjectiveseffectively, a variantof multi-way partitioning is used(ratherthanasa seriesof biparti-
tions). We partitionall regionssimultaneously, with the intermediatestateof eachregion influencingtheothers.We
On instance of has beenremoved, resulting in a changein the center of mass for somenets, and also new cell costs.
A second redundantinstance is removed.
Figure5: Two instancesof eachcell in thenetlistaregenerated,andassignedto bothsubregionsof any region. Cellcostsarebasedon thecenterof massfor thevertices;high costcellsareremovedoneat a time from any region of theentire placement problem. In this way, thepartitioningis performedon a globalbasis,ratherthanwith only a pair ofsubregionsat a time.
thecomponentnets.For eachnetni, thecenterof massfor this netis theaverageX andY locationof thecellswhich
it connects.Thecostof any cell ci is thesumof thedistancesbetweenthecell andthecenterof massof eachnet to
which the cell is connected.In this way, a cell which is far from thecenterof massof eachnet to which thecell is
connectedhashigh cost. Eachregion hasa numberof cellsassignedto it, andanavailablecapacity;redundantcells
groupto evaluatethe performanceof our approachon large designsunderrealisticdelayrules. We notethat delay
optimizationis perhapsmoreof a globalphenomenathanwire lengthminimization: meetingtiming objectivesmay
requiremodificationsin many areasof aplacement,andreductionsin delayfor somenetsmayrequireincreaseddelay
in others.
References
[1] T. Anderson,D. Culler, andD. Patterson.The casefor NOW (network of workstations). IEEE Micro, 15(1),February1995.
[2] P. Banerjee.Parallel Algorithms for VLSI Computer-Aided Design. PrenticeHall, EnglewoodCliffs, New Jersey,1994.(Chapter3).
[3] D. Becker, T. Sterling,D. Savarese,J.Dorband,U. Ranawak,andC. Packer. BEOWULF: A parallelworkstationfor scientificcomputation.In International Conference on Parallel Processing, 1995.
[4] M. Blumrich,R.Alpert,Y. Chen,D. Clark,S.Damianakis,C.Dubnicki,E.Felten,L. Iftode,K. Li, M. Martonosi,andR. Shillner. Designchoicesin the shrimpsystem:An empiricalstudy. In Proceedings of the 25th AnnualACM/IEEE International Symposium on Computer Architecture, June1998.
[5] N. Boden,D. Cohen,andW.-K. Su. Myrinet: A gigabit-per-secondlocal areanetwork. IEEE Micro, 15(1),February1995.
[6] M. A. Breuer. A classof min-cutplacementalgorithms.In Proc. Design Automation Conf, pages284–290,1997.
[7] A. E. Caldwell, A. B. Kahng,and I. L. Markov. Optimal partitionersandend-caseplacersfor standardcelllayout. In Proc. Int. Symp. on Physical Design, pages90–96,1999.
[8] A. CasottoandA. Sangiovanni-Vincentelli. Placementof standardcellsusingsimulatedannealingon thecon-nectionmachine.In Proceedings of the International Conference on Computer-Aided Design, pages350–353,November1987.
[9] F. Darema,S. Kirkpatrick, andV. Norton. Parallelalgorithmsfor chip placementby simulatedannealing.IBMJournal of Research and Development, May 1987.
[11] C. Dubnicki,A. Bilas,Y. Chen,S.Damianakis,andK. Li. VMMC-2: Efficient supportfor reliable,connection-orientedcommunication.In Hot Interconnects V, August1997.
[12] A. E. DunlopandB. W. Kernighan.A procedurefor placementof standard-cellVLSI circuits. IEEE Trans. onComputer-Aided Design of Integrated Circuits andSystems, CAD-4(1):92–98,January1985.
[13] H. EisenmannandF. M. Johannes.Genericglobal placementandfloorplanning. In Proc. Design AutomationConf, pages269–274,1998.
[14] CharlesM. FiducciaandR. M. Mattheyses. A linear-time heuristicfor improving network partitions. In Pro-ceedings of the 19th IEEE Design Automation Conference, pages175–181,1982.
[15] P. Geoffray, L. Prylli, andBernardTourancheau.BIP-SMP:High performancemessagepassingoveraclusterofcommoditySMPs.In Proceedings of Supercomputing (SC99), November1999.
[16] D. J.-H.HuangandA. B. Kahng. Partitioningbasedstandardcell globalplacementwith anexactobjective. InProc. Int. Symp. on Physical Design, pages18–25,1997.
[17] M. Ibel, K. Schauser, C. Scheiman,and M. Weis. High performanceclustercomputingusing SCI. In HotInterconnects V, August1997.
[18] R. JayaramanandR. Rutenbar. Floorplanningby annealingon a hypercubemultiprocessor. In Proceedings ofthe International Conference on Computer-Aided Design, pages346–349,November1987.
[19] G. Karypis,R. Aggarwal, V. Kumar, andS. Shekhar. Multilevel hypergraphpartitioning: Application in VLSIdomain.In Proc. Design Automation Conf, pages526–529,1997.
[20] Brian W. KernighanandS. Lin. An efficient heuristicprocedurefor partitioninggraphs.Bell System TechnicalJournal, 49:291–307,1970.
[21] S. Kim, B. Ramkumar, J. Chandy, S. Parkes,andP. Banerjee.ProperPLACE: a portableparallelalgorithmforstandardcell placement.In Proceedings of the 8th International Parallel Processing Symposium (IPPS’94), April1994.
[22] R. Kling andP. Banerjee.ConcurrentESP:a placementalgorithmfor executionon distributedprocessors.InProceedings of the International Conference on Computer-Aided Design, pages354–357,November1987.
[23] T. Koide,M. Ono,S.Wakabayashi,andY. Nishimaru.Par-POPINS:a timing-drivenparallelplacementmethodwith the elmoredelay model for row basedVLSIs. In Proceedings of the Design Automation Conference(DAC’97), 1997.
[24] S. Kravitz and R. Rutenbar. Placementby simulatedannealingon a multiprocessor. IEEE Transactions onComputer Aided Design of Integrated Circuits and Systems, 6(4):534–549,June1987.
[25] P. H. Madden.Partitioningby iterativedeletion.In Proc. Int. Symp. on Physical Design, pages83–89,1999.
[26] S.MohanandP. Mazumder. Wolverines:Standardcell placementon anetwork of workstations.IEEE Transac-tions on Computer Aided Design of Integrated Circuits and Systems, 12(9):1312–1326,September1993.
[27] M-VIA: Virtual interfacearchitecturefor linux, 2001. DGEGEIHKJ LILnoGoGoKPqp e`fsrGt PXuYvnwbL fYeFrdeGmnfstxD LnyIgFzGLdw a{m L .[28] S.Pakin, M. Lauria,andA. Chien. High performancemessagingon workstations:Illinois FastMessages(FM)
for Myrinet. In Proceedings of Supercomputing (SC’95), 1995.
[29] G. Pfister. In Search of Clusters, 2nd Ed. PrenticeHall, 1998.
[30] L. Prylli. BIP messagesusermanual,1998. Available at DFEGE`HKJ LGL c{DGH|tdm Pq}Gp a wFh cd~ vdp�R�P ^`f LU�j� l hd� m pI} mFc La p ]Ge`� P DFE � c .[31] L. A. Sanchis. Multiple-way network partitioningwith differentcost functions. IEEE Trans. on Computers,
42(22):1500–1504,1993.
[32] ScalableComputingLab. SCL clustercookbook:Building your own clustersfor parallelcomputation,1998.DFEGE`HKJ LGLnoIoGoKP rGtdc P m � eYr`cIm{� P�uFvdwbL l`f vI� ebtUEsr L`� c } rUEYenf �Fv`v`� � vIvd� .
[33] K. ShahookarandP. Mazumder. Vlsi cell placementtechniques.ACM Computing Surveys, 23(2):143–220,June1991.
[34] P. R. SuarisandG. Kedem.An algorithmfor quadrisectionandits applicationto standardcell placement.IEEETrans. on Circuits and Systems, 35(3):394–303,1988.
[35] W.-J. SunandC. Sechen.Efficient andeffective placementfor very largecircuits. IEEE Trans. on Computer-Aided Design of Integrated Circuits andSystems, 14(3):349–359,1995.
[36] Virtual interfacearchitecture(VIA) specification,2001. DFEGE`HKJ LGLnoIoGoKPXw a{m`fstUD P�v f u .
[37] M. Welsh,A. Basu,andT. von Eicken. Atm andfastethernetnetwork interfacesfor user-level communication.In Proceedings of the Third High-Performance Computer Architecture Conference (HPCA’97), February1997.
[38] M. Yeldiz andP. H. Madden.Globalobjectivesfor standardcell placement.In Design Automation Conference(DAC’2000), 2000.