Register Optimization and Scheduling for Real-Time Digital ...

Register Optimization and Scheduling forReal-Time Digital Signal ProcessingArchitecturesFrancis Depuydt20th October 1993

iAcknowledgementsAs I sit down and try to come up with a few catchy acknowledgements, I realizethat the work in this PhD thesis has been done by a team of people. This isone of the most important things I learned from my promotors Prof. Hugo DeMan and Prof. Francky Catthoor, and I wish to sincerely thank them for this.Since this work smells like team spirit, let me �rst express my gratitudetowards the members of that team : Dr. Gert Goossens (for introducing me inthe wonderful world of high-level synthesis, and also for the in-depth technicaldiscussions and the careful reading of all my prose), Dr. Dirk \Captain Dirk"Lanneer (for his enthusiasm for compilers, C++, politics and family life), Jo-han Van Praet (for being my closest neighbor in the o�ce and bearing myhead-banging and foot-stomping), Werner Geurts (for the many discussionson integer programming, graph theory and guitars), Koen \Flu�" Schoofs andMark Pauwels (for showing me the importance of a working system), PaoloPetroni (for his tidy integration work and his utter calmness during \demoweek") and Karl Van Rompaey (for his excellent work on the Smart schedu-ler). Thanks, dudes!I also wish to thank Dr. Jef van Meerbergen from Philips for assisting instarting up this work.This Ph.D. work has been carried out within the scope of the ESPRIT-2260(SPRITE) project. The �nancial support from the European Community isgratefully acknowledged.I also would like to thank the members of the jury (Prof. Hugo De Man,Prof. Francky Catthoor, Prof. P. Verbaeten, Prof. J. Jess, Prof. R. Lauwereins,Prof. J. Smit and Dr. Gert Goossens) for their time and e�ort.And all of this would of course not have been possible without the moralsupport of Lientje, friends, parents and other tribesmen. Thanks.Francis DepuydtOctober 1993

ii

Aan mijn ouders

Contents0.1 Abbreviations : : : : : : : : : : : : : : : : : : : : : : : : : : : : vii0.2 Symbols : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : vii1 Introduction 11.1 The design of single-chip VLSI systems : : : : : : : : : : : : : : 11.1.1 Single-chip VLSI architectures for real-time signal pro-cessing : : : : : : : : : : : : : : : : : : : : : : : : : : : : 21.1.2 Computer-Aided Design (CAD) of VLSI systems : : : : 31.1.3 Typical requirements for high-level synthesis : : : : : : : 41.2 The importance of scheduling for register usage : : : : : : : : : 51.3 The Cathedral-2nd synthesis system : : : : : : : : : : : : : : : : 61.3.1 Architecture components and related synthesis tasks : : 71.3.2 The Cathedral-2nd synthesis script : : : : : : : : : : : : 91.3.3 Register optimization in the Cathedral-2nd context : : : 111.4 Requirements for the scheduler : : : : : : : : : : : : : : : : : : 131.5 Overview of the thesis : : : : : : : : : : : : : : : : : : : : : : : 152 Approaches to register optimization 172.1 Literature overview : : : : : : : : : : : : : : : : : : : : : : : : : 172.1.1 Register optimization in software compiler technology : : 182.1.2 Register optimization in high-level synthesis : : : : : : : 202.1.3 Conclusions from the literature survey : : : : : : : : : : 242.2 A new approach : : : : : : : : : : : : : : : : : : : : : : : : : : : 252.3 Motivations for the new approach : : : : : : : : : : : : : : : : : 292.4 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 293 The signal ow graph model 313.1 Nodes and edges in the HDSFG : : : : : : : : : : : : : : : : : : 313.1.1 Nodes : : : : : : : : : : : : : : : : : : : : : : : : : : : : 313.1.2 Edges : : : : : : : : : : : : : : : : : : : : : : : : : : : : 323.1.3 The timing model : : : : : : : : : : : : : : : : : : : : : : 343.2 Hierarchy : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 363.2.1 Modeling loops in the HDSFG : : : : : : : : : : : : : : : 36iii

iv CONTENTS3.2.2 Conditions in the HDSFG : : : : : : : : : : : : : : : : : 403.2.3 The HDSFG model of the SYNDR example : : : : : : : 413.3 Scheduling the HDSFG : : : : : : : : : : : : : : : : : : : : : : : 433.4 Maximum register cost estimations : : : : : : : : : : : : : : : : 463.4.1 Estimation vs. exact calculation : : : : : : : : : : : : : : 463.4.2 The hierarchical retiming model : : : : : : : : : : : : : : 493.4.3 Di�erent maximum register cost estimations : : : : : : : 543.4.4 Algorithmic complexity of hierarchical retiming : : : : : 623.5 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 624 Cut reduction 654.1 The principles of cut reduction : : : : : : : : : : : : : : : : : : : 654.2 Branch-and-bound search : : : : : : : : : : : : : : : : : : : : : 674.3 The basic moves : : : : : : : : : : : : : : : : : : : : : : : : : : : 684.4 The cost function : : : : : : : : : : : : : : : : : : : : : : : : : : 714.5 Branching heuristics : : : : : : : : : : : : : : : : : : : : : : : : 724.6 Pruning the search space : : : : : : : : : : : : : : : : : : : : : : 734.7 Cut reduction termination : : : : : : : : : : : : : : : : : : : : : 744.8 Experiments with cut reduction : : : : : : : : : : : : : : : : : : 754.9 Comparison with the literature : : : : : : : : : : : : : : : : : : 794.10 Hierarchical cut reduction : : : : : : : : : : : : : : : : : : : : : 804.11 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 805 Clustering 815.1 Clusters and maximum register cost : : : : : : : : : : : : : : : : 825.2 The goal of clustering : : : : : : : : : : : : : : : : : : : : : : : : 845.3 The basic clustering technique : : : : : : : : : : : : : : : : : : : 845.3.1 Clustering vs. partitioning : : : : : : : : : : : : : : : : : 855.3.2 Greedy clustering : : : : : : : : : : : : : : : : : : : : : : 855.3.3 Distance metrics : : : : : : : : : : : : : : : : : : : : : : 865.3.4 Minimum distance selection : : : : : : : : : : : : : : : : 885.3.5 Cluster merging : : : : : : : : : : : : : : : : : : : : : : : 895.3.6 Distance recomputation : : : : : : : : : : : : : : : : : : 905.3.7 The clustering process : : : : : : : : : : : : : : : : : : : 915.3.8 Algorithmic complexity of clustering : : : : : : : : : : : 925.3.9 Clustering techniques in the literature : : : : : : : : : : 925.4 Extensions for control ow hierarchy : : : : : : : : : : : : : : : 945.4.1 Clustering of conditional operations : : : : : : : : : : : : 955.4.2 O�set register cost : : : : : : : : : : : : : : : : : : : : : 955.5 Clustering and cut reduction : : : : : : : : : : : : : : : : : : : : 975.6 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 99

CONTENTS v6 Integer Programming Scheduling 1006.1 The importance of software pipelining : : : : : : : : : : : : : : : 1026.1.1 The concept of software pipelining : : : : : : : : : : : : 1026.1.2 Software pipelining in multi-processor scheduling : : : : 1026.1.3 Software pipelining in high-level synthesis : : : : : : : : 1036.2 ILP scheduling in the literature : : : : : : : : : : : : : : : : : : 1046.3 Motivations for a new ILP model : : : : : : : : : : : : : : : : : 1056.4 The ILP scheduling model : : : : : : : : : : : : : : : : : : : : : 1066.4.1 Terminology : : : : : : : : : : : : : : : : : : : : : : : : : 1076.4.2 Set constraints : : : : : : : : : : : : : : : : : : : : : : : 1096.4.3 Formulation of the schedule length : : : : : : : : : : : : 1096.4.4 Modeling scheduled clusters for macronode scheduling : : 1096.4.5 Loop hierarchy : : : : : : : : : : : : : : : : : : : : : : : 1106.5 Resource constraints : : : : : : : : : : : : : : : : : : : : : : : : 1106.5.1 An example : : : : : : : : : : : : : : : : : : : : : : : : : 1116.5.2 Linear-time constraint generation : : : : : : : : : : : : : 1126.5.3 The di�erent resource constraints : : : : : : : : : : : : : 1156.5.4 Software pipelining and conditions : : : : : : : : : : : : 1166.6 Timing and retiming constraints : : : : : : : : : : : : : : : : : : 1176.6.1 Software pipelining constraints : : : : : : : : : : : : : : : 1176.6.2 Timing constraints : : : : : : : : : : : : : : : : : : : : : 1186.7 The interconnection model : : : : : : : : : : : : : : : : : : : : : 1216.8 Theoretical complexity analysis : : : : : : : : : : : : : : : : : : 1226.9 Cost functions : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1236.9.1 The cost function for cluster scheduling : : : : : : : : : : 1236.9.2 The cost function of macronode scheduling : : : : : : : : 1246.10 Experiments : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1256.10.1 ILP solution techniques in the literature : : : : : : : : : 1266.10.2 Experiments with ILP scheduling : : : : : : : : : : : : : 1266.10.3 Exploitation of set constraints : : : : : : : : : : : : : : : 1296.10.4 Experiments with ILP software pipelining : : : : : : : : 1296.10.5 Experiments with an integrated scheduling environment : 1316.11 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1327 Experiments 1337.1 The use of register constraints : : : : : : : : : : : : : : : : : : : 1337.2 Trading o� register cost vs. throughput : : : : : : : : : : : : : : 1387.3 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1398 Conclusions 1408.1 Motivation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1408.2 Contributions : : : : : : : : : : : : : : : : : : : : : : : : : : : : 141

vi CONTENTS8.3 Future work : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 142A Edge cost corrections 156B The cut reduction algorithm 157C Min ow 159C.1 The hierarchical network model : : : : : : : : : : : : : : : : : : 159C.2 Min ow in networks : : : : : : : : : : : : : : : : : : : : : : : : 161C.3 Handling conditions : : : : : : : : : : : : : : : : : : : : : : : : : 162C.4 Hierarchical min ow : : : : : : : : : : : : : : : : : : : : : : : : 162C.5 Fanout edges : : : : : : : : : : : : : : : : : : : : : : : : : : : : 163C.6 Example : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 164C.7 Algorithmic complexity : : : : : : : : : : : : : : : : : : : : : : : 164C.8 Comparison with hierarchical retiming : : : : : : : : : : : : : : 166C.9 Conclusion : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 166D The clustering algorithm 167E The scheduling-retiming relation 169F Derivation of tight timing constraints 171F.1 The timing constraint formulation : : : : : : : : : : : : : : : : : 171F.2 The formulation of the d1;2 6= F1;2j test : : : : : : : : : : : : : : 174F.3 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 174G Nederlandse samenvatting 176G.1 Inleiding : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 176G.1.1 Het ontwerp van digitale systemen op �e�en enkele chip : : 176G.1.2 Het verband tussen tijdsordening en registergebruik : : : 177G.1.3 Het Cathedral-2nd synthesesysteem : : : : : : : : : : : : 178G.2 Optimalisatie van het registergebruik : : : : : : : : : : : : : : : 178G.2.1 Registeroptimalisatie in de softwarecompilatietechnologie 179G.2.2 Registeroptimalisatie in hoog-niveau synthese : : : : : : 179G.2.3 Een nieuwe benadering : : : : : : : : : : : : : : : : : : : 179G.3 Het signaalstroomgraf-model : : : : : : : : : : : : : : : : : : : : 180G.4 Snedeverkleining : : : : : : : : : : : : : : : : : : : : : : : : : : 181G.5 Clustering : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 182G.6 Tijdsordening door ILP : : : : : : : : : : : : : : : : : : : : : : : 184G.7 Experimenten : : : : : : : : : : : : : : : : : : : : : : : : : : : : 185G.8 Besluit : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 186

Glossary0.1 AbbreviationsDSP digital signal processing p. 1RSP real-time signal processing p. 1GSM Global System for Mobile Communication p. 2ASIC application-speci�c integrated circuit p. 2VLIW very large instruction word p. 2RTL register transfer level p. 3HLS high-level synthesis p. 4EXU execution unit p. 7HLMM high-level memory management p. 9HLDM high-level data-path mapping p. 10LLM low-level mapping p. 10HDSFG hierarchical decorated signal ow graph p. 36RT register transfer p. 32CMB condition merge block p. 41Tron Tool for Register OptimizatioN p. 133Ilps Integer Linear Programming Scheduler p. 1250.2 Symbolsoi HDSFG operation (node) number i p. 32e1;2 edge between o1 and o2 p. 32c1;2 edge cost array p. 33ti start time of operation oi p. 34li;a latency of port a of operation oi p. 34Ti throughput of operation oi p. 35la;b latency of edge between ports a and b p. 35�1;2 minimum time delay between o1 and o2 p. 35vii

viii CONTENTS�01;2 maximum time delay between o1 and o2 p. 118d1;2 length of the delay line between o1 and o2 p. 38d1;2init initial edge weight p. 107W1;2 window of the delay line between o1 and o2 p. 39W1;2max maximum window of the delay line d1;2 p. 50pi time step of oi in a schedule p. 46ri retiming of operation oi p. 50rmax maximum retiming over all operations p. 117PESi parallel edge set for register �le i p. 55jPESij value of the parallel edge set PESi p. 55MPESi maximum parallel edge set for register �le i p. 55R(PESi) register cost of a PESi p. 56GMC global maximum register cost array p. 57BMC maximum bridging edges register cost array p. 58CMC constrained maximum register cost array p. 61C register cost constraint array p. 61D1;2 D-distance between two clusters C1 and C2 p. 88E1;2 E-distance between two clusters C1 and C2 p. 88jDj normal value of D-distance p. 89jEj normal value of E-distance p. 89C the number of time steps in the schedule p. 107xi;j;k integer programming scheduling variable p. 107Sj;k maximum clique of the resource con ict graph p. 115oC dummy sink operation p. 109�i earliest time step of operation oi p. 108!i latest time step of operation oi p. 108F1;2j �xed delay line length for timing constraint p. 120�1;2j extra integer variable for timing constraint p. 120y1;2j extra binary variable for timing constraint p. 120z1;2j extra binary variable for timing constraint p. 120M large, �xed value for timing constraint p. 120Ak1;k2 cost of a connection between EXUs k1 and k2 p. 124S source node of a network p. 159T sink node of a network p. 159f ow array p. 161b lower ow bound array p. 161F network ow array p. 161

Chapter 1IntroductionThe large market for integrated digital real-time signal processing systemsand the requirement for low power consumptions (e.g for portability reasons)justi�es the design of single-chip VLSI systems. To reduce the time-to-marketof these complex designs, powerful computer-aided design (CAD) software isbeing developed. A very important aspect of these CAD systems is the abilityto optimize the area and/or throughput of the chip. The optimization of onespeci�c component of the chip area|the area taken by the registers|and itsrelation with scheduling is discussed in this thesis. The techniques for registeroptimization that are proposed in this thesis are applied in the context ofhigh-level synthesis, although they can be applied in other contexts as well(e.g. code generation for programmable architectures).This introductory chapter is structured as follows: a brief introduction intothe design of single-chip VLSI systems and the di�erent kinds of CAD supportis given in Section 1.1. The problem of register optimization, and its relationwith scheduling, is discussed in Section 1.2. In Section 1.3, an overview ofa prototype high-level synthesis system (Cathedral-2nd) is presented. Allexperiments with register optimization in this thesis have been carried out inthe context of that system. Since register optimization is quite related withscheduling, some speci�c requirements for the scheduling task are formulatedin Section 1.4. Finally, the outline of this thesis is summarized in Section 1.5.1.1 The design of single-chip VLSI systemsThe CAD techniques discussed in this work are mainly intended to be usedfor the design of real-time digital signal processing (RSP)1 applications. Anoverview of several single-chip VLSI architectures for real-time signal process-ing is given in Section 1.1.1. The di�erent design stages are supported by1The term DSP is mostly used for �lter applications, and therefore avoided here.1

2 CHAPTER 1. INTRODUCTIONdi�erent kinds of CAD systems. A rough classi�cation of these CAD systemsis presented in Section 1.1.2. One of the designs stages that more and moregets CAD support is high-level synthesis. The requirements for computer-aidedhigh-level synthesis are discussed in Section 1.1.3.1.1.1 Single-chip VLSI architectures for real-time sig-nal processingThere is a relatively large market for integrated real-time digital signal pro-cessing systems, situated in the domain of consumer electronics: examples areISDN (Integrated Systems Digital Network), HDTV (High-De�nition Tele-vision), GSM (Global System for Mobile Communication), CD (CompactDisc), CD-I (Interactive CD), virtual reality and multi-media. Because of thehigh production volume of these systems and the requirement for low powerconsumption (portability), single-chip solutions are proposed [Goossens 92][Paulin 92] [Gupta 92].The real-time signal processing applications can be classi�ed in di�erentapplication domains, according to their throughput (or the rate at which theytake in new samples). Previous research has shown that each RSP applicationdomain requires a di�erent VLSI architectural style for an e�cient realization[Catthoor 88] [Catthoor 92]. An overview of the most common architecturalstyles is given in the following paragraphs2.Low- to medium-throughput RSP For low- to medium-throughput RSP(sampling rate between 1 kHz and 1MHz), as in low-end telecom, audio andspeech, several architectural alternatives exist:� A programmable general-purpose DSP processor , like the TMS320Cxxfamily from Texas Instruments or the Motorola M56000 series, could beused to implement a low- to medium throughput application. A goodsurvey of today's most common state-of-the-art general-purpose DSPprocessors is presented in [Lee 88] [Lee 89a]. The main advantage of thesedevices is their exibility, however at the expense of power consumptionand ine�ciency in the matching of operations to hardware.� By tuning the architecture to the application|and thus making it appli-cation speci�c (ASIC) | the power consumption and silicon area can bereduced, and critical parts can be speeded up to meet the overall through-put requirement. Typically, for low- to medium-throughput RSP appli-cations, a highly multiplexed, microcoded VLIW processor architecture2This classi�cation is valid for the currently available technologies, with clock frequenciesof 10 to 50 MHz

1.1. THE DESIGN OF SINGLE-CHIP VLSI SYSTEMS 3is used [Catthoor 88] [Lanneer 93a] [Man 88] [Vanhoof 90] [Rimey 89][Hartmann 92] [Haroun 88]. Such architectures have the advantage of anoptimized mapping of operations to hardware and an optimized powerconsumption, however at the expense of exibility.Irregular high-throughput RSP Irregular high-throughput RSP (1 to50MHz sampling rates), as in front-end telecom, medium-level image andvideo applications, requires more parallelism and less hardware multiplex-ing, although the control ow and the data dependencies are still irregular.A lowly-multiplexed cooperating data path ASIC architectural style [Note 91][Lippens 91] [Chu 89] is well-suited for this application domain.Heterogeneous architecture A heterogeneous architectural style can beconceived by using a mix of the above two architectural alternatives: a reusableprocessor core (a DSP core) is combined with accelerator data paths (irregularRSP architecture style) for the most time-critical computations [Goossens 92][Goossens 93]. This architectural style o�ers exibility, hardware e�ciencyand power reduction for power-critical operations.Regular high-throughput RSP For the regular high throughput appli-cations, like front-end video, imaging and radar, a regular array ASIC ar-chitecture is employed [Rosseel 91]. Such architectures e�ciently exploit theregularity of the applications.The CAD techniques in this thesis are intended to be used in the low- tomedium-throughput domain.1.1.2 Computer-Aided Design (CAD) of VLSI systemsTo reduce the time-to-market, powerful CAD software is being developed tosupport the di�erent levels of the design. At least four di�erent levels ofabstraction can be distinguished in a design:1. Silicon assembly: generating the physical layout of the chip from a givenhigh-level structural description. It involves logic synthesis, module gen-eration, placement and routing, etc. CAD systems for silicon assemblyare commercially available (Mentor, Cadence, Compass, to namebut a few).2. RTL synthesis: the translation of a register transfer description with a�xed timing into a structural description, which can be taken as inputfor silicon assembly. RTL synthesis involves controller synthesis, sequen-tial logic synthesis, and data-path synthesis. CAD systems for RTL

4 CHAPTER 1. INTRODUCTIONsynthesis are now becoming available, some commercially (e.g. Synop-sys, Cadence, Racal-Redac,Asyl-III), others in-house (e.g.Oasis,Callas).3. The next higher step in design abstraction is high-level synthesis: thesynthesis (from scratch) of an application-speci�c architecture from adescription of the behavior of a real-time signal processing algorithm.Again, several subtasks need to be performed: memory management,scheduling, data-path allocation, code generation, etc. High-level syn-thesis usually assumes a characterized module library for arithmetic, log-ical and storage components. Many prototype CAD systems exist, butmost of them are still in research3. One such prototype system is theCathedral-2nd high-level synthesis system. It will be described inSection 1.3, to provide the context of the work presented in this thesis.4. The highest level of VLSI design abstraction (so far) is the system level .Issues to be dealt with in system level design are hardware/softwarepartitioning, high-level communication protocols, design style selection,speci�cation and optimization of system control etc. The developmentof CAD for system level VLSI design is still in an early research stage[Kalavade 93] [Gupta 92].1.1.3 Typical requirements for high-level synthesisThe main context of the work presented in this thesis is high-level synthesis.Therefore, some typical requirements for high-level synthesis (HLS) to supportthe design of real-time signal processing applications are outlined:� HLS must be able to perform area minimization for �xed throughput con-straints. For real-time signal processing, only the worst-case executiontime of the algorithm matters. This execution time must be within acertain speci�cation for most applications, for which the chip area mustbe as small as possible. One speci�c aspect of this area | the numberof registers | is of central importance in this thesis. The minimizationof the power consumption of the chip can also be seen as an aspect ofarea minimization.� HLS must be able to take hardware constraints into account, e.g. a �xednumber of functional units, a �xed number of available registers, a �xedinstruction set etc. Imposing hardware constraints is an important wayof user interaction with the system.3A notable exception to this is the DSP-Station, a commercial high-level synthesissystem from EDC that is based for a large part on the Cathedral experience [Man 90].

1.2. THE IMPORTANCE OF SCHEDULING FOR REGISTER USAGE 5� For the low- to medium throughput domain, HLS has to be able to dealwith large applications (thousands of operations), containing nested loopsand conditions.� The real-time aspect: the worst-case execution time of the signal pro-cessing algorithm implemented on the chip must satisfy the throughputspeci�cations [Catthoor 92].� Most RSP algorithms use �nite word lengths: the correct bit-true behav-ior must be implemented [Pauwels 90] [Pauwels 92] [Schoofs 93].� Quite some medium-throughput RSP applications use large signal arrays[Verbauwhede 91] [Vanhoof 91].� The HLS system must be user-friendly: the designer should be providedwith relevant feedback, relevant design statistics, hints, estimations andchecks.As for the quality of the design, two types of high-level synthesis toolscan be distinguished. On the one hand, there are the fast prototyping toolsthat allow for a quick exploration of the design space, sacri�cing accuracy fordesign exploration speed [Rabaey 91] [Knapp 91]. On the other hand, thereare the optimization tools. Here, powerful optimizations are available at thecost of extra CPU time. It is clear that a designer needs both kinds of tools.Once a rough idea about the VLSI architecture has been conceived by quickdesign space exploration, the design can be further optimized over a muchsmaller design space. Since, typically, only a few iterations over this latteroptimization stage are required, quite some CPU time can be spent in orderto get optimal or near-optimal results.1.2 The importance of scheduling for registerusageOne of the requirements of high-level synthesis is area optimization (see Sec-tion 1.1.3). In this thesis, the optimization of the register cost, as a componentof the total chip area, is the main issue. In high-level synthesis, the area takenby registers is relevant and must be minimized: for instance, the size of a blockof 16 registers (16 bits wide) is comparable with the size of a 16 bit arithmeticand logic unit (ALU). The register cost is optimized in this thesis by allowingthe designer to put constraints on the available number of registers. This iscompatible with the requirements of software compilation, where it is essential

6 CHAPTER 1. INTRODUCTIONa = f1(...);

b = f3(a,...);

c = f2(...);

d = f4(c,...);

a = f1(...);

b = f3(a,...);

c = f2(...);

d = f4(c,...);

(a) (b)Figure 1.1: The impact of code ordering on register usageto generate code for a �xed number of available registers. For instance, oat-ing point DSP processors typically have 16 to 32 general-purpose registers, and�xed point DSP processors only have a few special-purpose registers.Many authors in the �eld of software compilation have acknowledged thatscheduling (or instruction ordering) has a large impact on the register us-age [Aho 88] [Hennessy 90] [Hendren 92]. Some optimizing software compilersperform code rearrangement [Aho 88] to optimize the register usage.The relation between register usage and scheduling, or instruction ordering,is illustrated by means of a simple example in �gure 1.1, where two alternativeC code orderings are compared. The instruction ordering in �gure 1.1(a) locallyrequires 1 register more than the instruction ordering of �gure 1.1(b).The relation between scheduling and register cost is examined in this thesis,by proposing solutions for the following problem:Problem 1.1 \Find a schedule that satis�es constraints on the available num-ber of registers, and requires a minimized total number of clock cycles."1.3 The Cathedral-2nd synthesis systemIn the previous section, it has been pointed out that the architectural styledepends on the target application domain [Lanneer 91] [Catthoor 88] [Man 88].In order to make high-level synthesis feasible for realistic applications, also thedesign methodology should depend on the target application domain and thearchitectural style. This principle is called targeted compilation [Man 90]. TheCathedral-2nd system, for instance, is targeted towards low- to medium-throughput real-time signal processing applications [Lanneer 91] [Lanneer 90][Goossens 92]. The techniques proposed in this thesis have been implementedand demonstrated in the context of the Cathedral-2nd synthesis system,therefore an overview of that system is given in this section.

1.3. THE CATHEDRAL-2ND SYNTHESIS SYSTEM 7In addition to the general characteristics of real-time signal processing inSection 1.1.3, some important characteristics of low- to medium throughputreal-time signal processing are the following:� Most medium-throughput RSP applications exhibit an irregular, com-plex mix of iterated pieces of code (loops) and the conditional executionof operations [Depuydt 91].� The hardware operators are highly-multiplexed i.e. they are re-used for alarge number of operations [Catthoor 92] [Lanneer 91].The target architectural style ofCathedral-2nd is the highly multiplexedmicrocoded VLIW processor style (see Section 1.1.1). This architectural styleis schematically shown in �gure 1.2. The components of the architecture in�gure 1.2 and the related synthesis tasks are discussed in Section 1.3.1. Theordering of these tasks into a synthesis \script" is explained in Section 1.3.2. Afew examples of register optimization in the context of the Cathedral-2ndsystem are given in Section 1.3.3.1.3.1 Architecture components and related synthesistasksIn the paragraphs below, the most important architecture components aredescribed. Their special features have to be utilized as e�ciently as possiblein the design. Therefore, some dedicated synthesis tasks are described as well.Execution units One of the most prominent components of the highly-multiplexed microcoded VLIW architecture is the exible execution unit datapath, or EXU (labeled \1" in �gure 1.2). Depending on the throughput speci-�cation, an EXU can be pipelined, and several EXUs can execute operationsin parallel. The EXU composition is matched to the application [Lanneer 91],e.g. EXU1 in �gure 1.2 is designed to perform the division of two integer valuesas a series of additions and shift operations. The designer can manually de�nethe composition of the EXUs based on statistics on the occurrence of certainsequences of operations [Praet 92]. Once the EXUs are de�ned, the chainingtask assigns groups (chains) of operations to the EXU types that can executethem in one single register transfer operation [Lanneer 91].Controller Another important architecture element is the pipelined multi-branch controller (labeled \2" in �gure 1.2): it contains the instruction se-quence to be executed as microcode, and it allows for multi-way branching(\jumping") and branch condition storage [Catthoor 88]. The controller spec-i�cation is generated by the scheduling task [Goossens 90] [Rompaey 92], and

8 CHAPTER 1. INTRODUCTIONbranchlogic

programcounter

incr.

microcodeROM

microcodedprocessorcontroller

S/P shift reg.

P/S shift reg.

adder/subtractor

shiftermux

RF RF

local controller

EXU1 bus

feedback bus

tristatedriver

bus

2

4

4

6

7

statusregister

5

EXU2

5

11

statusregister

instructionregister

Figure 1.2: Highly-multiplexed microcoded VLIW processor architecture

1.3. THE CATHEDRAL-2ND SYNTHESIS SYSTEM 9the controller generation task tries to come up with a minimum size controllerfor a given performance.EXU interconnection The communication of signals between two di�erentEXUs takes place over global busses (labeled \4" in �gure 1.2). These bussesare synthesized during the interconnection de�nition task. A large numberof global busses can lead to intricate oor-plans. Therefore, busses shouldbe re-used as much as possible [Vanhoof 90], and the bus merging task takescare of that. Feedback busses (labeled \5" in �gure 1.2) are local busses fromthe output of an EXU to the input of the same EXU. The EXU instanceassignment [Lanneer 93b] task assigns communicating operations as much aspossible to the same EXU instance, maximizing the use of the feedback busses.Routers (labeled \6" in �gure 1.2) can be added to the interconnection networkto change the signal alignment for bit-true behavior [Schoofs 93]. This is doneduring the hardware alignment task.Storage units Scalar signals (and sometimes short signal arrays) are storedin dual-port register �les (labeled \7" in �gure 1.2). Register �les, or registerarrays, are addressed directly by the microcode. They are hard-wired to theEXU inputs, to provide su�cient register access bandwidth to keep the EXUsbusy. The assignment of signals to register �les is done by the data routingtask [Lanneer 93a], and the assignment of these signals to a speci�c addressin the register �le is done during register assignment [Goossens 86]. Multi-dimensional signals (signal arrays) are stored in RAM, which can be seen asan EXU. The number and the size of these RAMs, as well as the number ofports accessing them and the address sequencing, is determined by the high-level memory management task [Swaaij 92] [Vanhoof 91] [Franssen 92].1.3.2 The Cathedral-2nd synthesis scriptIdeally, all the synthesis tasks mentioned above are performed simultaneouslyin one big optimization problem. However, due to the huge number of possiblecombinations of design decisions, it is clear that this is hardly ever going towork for real-life examples. Therefore, the synthesis tasks are ordered by theirpotential in uence on the design, and the synthesis task with the largest impactis performed �rst. The ordering of design tasks is called a script [Lanneer 93a][Goossens 93] [Catthoor 92]. The Cathedral-2nd synthesis script is shownin �gure 1.3, and consists of the following large steps:1. High-Level Memory Management (HLMM): the assignment of multi-dimensional signals (signal arrays) to memories, and determining of thenumber, size and access bandwidth of these memories (see Section 1.3.1).

10 CHAPTER 1. INTRODUCTIONsilicon

assemblyhigh-level

datapath mapping

high-levelmemory

management

EXU instanceassignment

data routing

interconnectdefinition

bus merging

alignmentoptimization

microcodegeneration

chaining

refinement

RTL synthesis

LLM script

scheduling

scheduling

schedulingFigure 1.3: The Cathedral-2nd synthesis script2. High-Level Data path Mapping (HLDM): the allocation and compositionof a number of EXU data paths (see Section 1.3.1).3. Low-Level Mapping (LLM): the assignment of operations and signals tohardware (see further).4. RTL synthesis (see Section 1.1.2).5. Silicon assembly (see Section 1.1.2).The LLM subscript (see �gure 1.3) is discussed in somewhat more detailbelow, because the work described in this thesis is illustrated in the contextof low-level mapping. For a given behavioral description and a given set ofEXU operators (de�ned by the HLDM task), low-level mapping determinesthe following bindings (or mappings) [Lanneer 93a]:� The binding of operations to resource types and the grouping of opera-tions in chains of operations (re�nement and chaining).� The binding of chains of operations (register transfers) to speci�c EXUinstances (EXU instance assignment).

1.3. THE CATHEDRAL-2ND SYNTHESIS SYSTEM 11� The binding of the signals (that are not already mapped during HLMM)to storage (data routing).� The binding of register transfers and type-changing operations to inter-connection and dedicated bit-routing networks (interconnect de�nitionand hardware alignment).� The binding of signals to addresses and ports of register �les (registerassignment).� The binding of operations to time steps (scheduling).The output of LLM is a symbolic microcode description and a completely spec-i�ed architecture description (architecture net-list), which are passed to RTLsynthesis and silicon assembly. At several levels in the design trajectory, theperformance of the design is evaluated by scheduling the design (see �gure 1.3).This scheduling tries to minimize the total number of clock cycles, by assigningeach operation to a time step such that the resource constraints, imposed bythe partially completed design, are obeyed. Note that, because the scheduleris called several times in the LLM script, it must be able to deal with severalkinds of hardware constraints (see Section 1.4).1.3.3 Register optimization in the Cathedral-2nd con-textThe following example illustrates register optimization in the Cathedral-2nd context: the design of a channel decoder for GSM [Busschaert 91], withouttaking register cost into account, takes 68 registers in 25 dual-port register �les.86 bits of the 103 bit microcode word are required to address these register�les. There are a number of reasons for this large number of registers:� The Cathedral-2nd system uses a large number of register �les, re-quired to obtain a large communication bandwidth (see Section 1.3.1).� The scheduling is performed without taking register cost into account[Depuydt 90] [Depuydt 91] [Vanhoof 90].� An ine�cient data routing (see the script in �gure 1.3) is responsible forduplication of signals and long signal lifetimes in expensive registers. Asolution for this problem is proposed in [Lanneer 93a].Although there are a number of causes for the large number of registers, onlythe relation between register cost and scheduling is investigated in this the-sis. The number of register �les, and their interconnection with the EXUs,

12 CHAPTER 1. INTRODUCTIONdesign reduction (%)registers microcode cyclesSYNDR 14 9 -41LPC 33 25 -22CHED91 16 16 5SPEECH 32 15 -8Table 1.1: Typical results of register optimization through schedulingdesign V L N function sourceSYNDR 31 2 11 syndrome generation PhilipsLPC 179 10 9 lin. predictive coding PhilipsCHED91 370 8 21 channel decoding AlcatelSPEECH 590 22 11 speech recognition Lernout & HauspieTable 1.2: Some relevant demonstrator characteristicsis dictated by the architectural style, but can in principle be changed by thedesigner.If the register cost is optimized during scheduling, results as in Table 1.1can be obtained. The table shows the reduction of register cost, the microcodeword-length and the total number of clock cycles for a number of real-life RSPapplications. This reduction is measured w.r.t. designs that are not optimizedfor register cost. In most cases, scheduling for an optimized register cost willcost more clock cycles (e.g. 22 % for the LPC design in table 1.1). However,for some designs schedules could be found with a smaller register cost and lessclock cycles (e.g. the GSM channel decoder CHED91), because of the use ofmore powerful scheduling techniques (like integer programming scheduling, seeChapter 6) in combination with register optimization techniques.Some relevant characteristics of the RSP applications in table 1.1 are sum-marized in table 1.2. In the second column of the table, the total number ofoperations V in the design is given. These operations are distributed over anumber of hierarchical levels (loops). The number of hierarchical levels L isgiven in the third column. The column N gives the number of register �lesin the data path, the function is the functionality of the application and thesource is the industrial source of the application. The demonstrator applica-tions of table 1.2 are used further-on in this thesis.

1.4. REQUIREMENTS FOR THE SCHEDULER 131.4 Requirements for the schedulerSince solving Problem 1.1 is done during the execution of the scheduling task,a number of important requirements for the scheduler are summarized in thissection.Architecture models The scheduler must be able to deal with several dif-ferent architecture models. An architecture model is de�ned by the followingparameters:1. The set of available EXUs.2. The set of di�erent EXUs to which each operation can be bound.3. The number of register �les, and their connections (e.g. hardwired toEXU inputs).4. The number of ports for each register �le.5. The size of each register �le.6. The interconnection network.If only parameters 1 and 2 are �xed, e.g. after chaining in the Cathedral-2ndcontext (see the LLM script in Section 1.3.2), a central register �le architecturewith unlimited register access bandwidth is used as model for the scheduler(�gure 1.4(a)). Note that the interconnection network is a full crossbar. Ifparameters 1 to 5 are �xed, e.g. after data routing in the LLM script, a dis-tributed dual-port register �le architecture model is used (�gure 1.4(b)). Andif all parameters are �xed, e.g. after interconnection de�nition, or when using�xed architectures like DSP cores, the architecture model of �gure 1.4(c) isused. Note that the architecture model is by no means restricted to a stylewhere the EXUs have register �les at their inputs.Design complexity The scheduler must support complex designs in themedium-throughput RSP application domain. This means that there is prob-ably going to be a large number of operations, organized in a multi-level hier-archy of nested loops and nested conditions. For instance, the loop hierarchyof the CHED91 design (see table 1.2) is shown in �gure 1.5. The numbers in�gure 1.5 indicate the range of the loop counters. Furthermore, the operationsin this kind of applications typically have a large scheduling freedom (thereare a lot of possible time-steps for an operation to be scheduled). Finally,the scheduler must be able to explore global scheduling techniques, like loopreorganization (software pipelining), to meet the throughput speci�cation ofthe application (see Section 6.1).

14 CHAPTER 1. INTRODUCTION

ctROM

+ *+

4

3

3

3

1ADD1 ADD2 MULT

+

N ADD1

+

ADD2

ctROM

*

MULT

+

4

3

ADD1

+

3

3

ADD2

ctROM

*

1 MULT

(a)

(b)

(c)

FULL CROSSBAR

FULL CROSSBAR

Figure 1.4: Di�erent architecture models to be supported

1.5. OVERVIEW OF THE THESIS 15top

main

for index for p for p2 for i for kdi

for t for j

for i2

0..455 0..147 148..266 1..15

0..1

0..5

0..243

0..7Figure 1.5: The loop nesting hierarchy of the CHED91 demonstrator applicationUser interaction Besides supporting all kinds of hardware constraints, thescheduler should also be able to trade o� scheduler optimality for speed (fastprototyping). This trade-o� can for instance be made if a set of basic schedul-ing techniques (optimal integer programming scheduling, list scheduling, etc.)is provided, from which the designer can choose.1.5 Overview of the thesisIn the next chapter (Chapter 2), a survey of the literature on register optimiza-tion in conjunction with scheduling is given, and a proposal for a new solutionis formulated and motivated. A formal de�nition of the design graph model forregister optimization and scheduling is given in Chapter 3. This chapter alsocontains a new technique for the estimation of the maximum register cost ofa design. This latter technique is at the basis of the cut reduction techniquefor register cost reduction, proposed in Chapter 4. To reduce the algorithmiccomplexity of cut reduction, clustering techniques are described in Chapter 5.In Chapter 6, special attention is paid to integer programming scheduling, anda new method for optimal software pipelining is presented. Some results ofexperiments with a prototype CAD-tool on real-life applications are given inChapter 7. Conclusions and directions for future work can be found in Chap-ter 8.The new contributions of this thesis are the following:1. A new methodology for register optimization is proposed in Chapter 2:the constraints on the available number of registers will be satis�ed dur-

16 CHAPTER 1. INTRODUCTIONing a scheduling preprocessing step, such that the scheduler is relievedfrom these register constraints. Two e�cient techniques for schedulingpreprocessing are described in Chapters 4 and 5 respectively.2. A new technique for estimating the maximum register cost of a design,based on retiming, is proposed in Chapter 3. It has a low polynomialalgorithmic complexity, and it is intensively used during scheduling pre-processing.3. A new integer programming scheduling formulation is proposed in Chap-ter 6. It combines scheduling, software pipelining and delay line opti-mization for repetitive signal ow graphs.

Chapter 2Approaches to registeroptimizationThe importance of register optimization has been acknowledged in the lit-erature. Contributions to register optimization have been published in thehigh-level synthesis domain, as well as in the software compilation domain. Inthis chapter, a survey is presented of the literature on register allocation, regis-ter assignment and the relation between register optimization and scheduling.A new approach for register optimization, in a strong relation to scheduling, isproposed and motivated at the end of this chapter.The survey of the literature is presented in Section 2.1. The outline ofthe approach for register optimization, as proposed in this thesis, is given inSection 2.2. Section 2.3 contains the motivations for this new approach as analternative for the techniques in the literature.2.1 Literature overviewThis section contains an overview of the techniques for register optimizationthat have been published in the literature. The problem of register optimiza-tion can be stated in two ways:1. In software compilation technology, the issue is the generation of codethat uses the available number of registers as e�ciently as possible. Lit-erature from this domain is surveyed in Section 2.1.1.2. In high-level synthesis, one of the design goals is to use as few registersas possible to reduce the area and the power consumption of the chip.Literature from this domain is surveyed in Section 2.1.2.Some observations with respect to the literature are formulated in Section 2.1.3.17

18 CHAPTER 2. APPROACHES TO REGISTER OPTIMIZATIONBecause the concepts of register allocation and register assignment are fre-quently employed in the literature, they are de�ned below [Aho 88]:De�nition 2.1 (Register allocation) Register allocation is the selection ofthe set of signals that will reside in registers at each machine cycle.De�nition 2.2 (Register assignment) Register assignment is the selectionof the speci�c register that a signal will reside in.Only the assignment of signals to registers is considered here. The assignmentof (multi-dimensional) signals to memory (RAM) is discussed in [Vanhoof 91][Swaaij 92] [Verbauwhede 91] [Lippens 91].2.1.1 Register optimization in software compiler tech-nologySoftware compilers generate machine code for a speci�c, �xed architecture,with in most cases a �xed instruction set. The traditional approach for codegeneration consists of the following steps [Aho 88] [Landskov 80]:1. Instruction ordering2. Register allocation and assignment3. Code generation/compactionThese steps are discussed in more detail below.1. Traditional instruction ordering techniques In most compilers forgeneral purpose microprocessors (e.g. Intel 80x86), the code is generated perbasic block [Aho 88]:De�nition 2.3 (Basic block) A basic block is a sequence of consecutive op-erations in which the ow of control enters at the beginning and leaves at theend without halt or possibility of branching except at the end.For operations ordered as expression trees, linear complexity code generationalgorithms have been published [Aho 88] [Aho 76]. These algorithms �nd theoptimal ordering of the operations in the expression tree for a minimal registerusage. However, if the basic block contains common subexpressions, optimalcode generation gets tougher, and partitioning techniques have been proposedto partition the DAG of the basic block in trees, on which the linear timealgorithms can operate [Aho 77] [Aho 88]. In any case, these code orderingalgorithms cannot cope with instruction parallelism as e.g. in VLIW architec-tures: more complex ordering techniques are required that can exploit thisparallelism, e.g. compaction or scheduling (see below).

2.1. LITERATURE OVERVIEW 192. Traditional register allocation For a given code ordering, traditionalsoftware compilers perform register allocation per basic block. Typically, graphcoloring techniques are used for this purpose, extended with heuristics for gen-erating spill code [Aho 88] [Chaitin 82] [Hendren 92]. The register assignmentfollows directly from that coloring. A good overview of literature on registerallocation in the software compilation domain is presented in [Mueller 93]. Insome optimizing compilers, a global register allocation is done �rst, based onsimple heuristics: temporary variables of an expression evaluation, results ofcommon subexpressions and loop counters are stored in registers [Aho 88].3. Code compaction Super-scalar and VLIW architectures require morepowerful code generation techniques, because they can execute several opera-tions in parallel. In the software compilation domain, the task of instructionordering and grouping operations together in the same machine cycle is calledcode compaction. In the domain of high-level synthesis, this task is called sche-duling. Since code compaction is done after register allocation, the compactionhas to assure that the available number of registers is respected. Several kindsof code compaction techniques can be distinguished:� Local compaction: the compaction of straight-line code segments in basicblocks is called local compaction. A survey of techniques for local com-paction is given in [Landskov 80]. Software pipelining (see Chapter 6)can be seen as a technique for the local compaction of loops.� Global compaction: a more e�cient compaction can be obtained by globalcompaction: compaction across basic blocks. A good example of globalcompaction is trace scheduling [Fisher 81]: a code \trace" | a threadof code that can cross basic block boundaries, along a possible path ofcontrol ow | is compacted by allowing the motion of code across thebasic block boundaries (which makes the compaction \global"). How-ever, trace scheduling does not take a �xed number of available registersinto account, such that an additional register allocation phase is needed.Furthermore, trace scheduling cannot directly be applied to real-timesignal processing: the traces have to be selected such that the worst-caseexecution time is minimized, and not the average execution time as in[Fisher 81]. On the other hand, the global compaction idea (with theimplicit code motion across basic block boundaries) is gaining impetusin high-level synthesis [Potasman 90] [Vanhoof 93].4. Register allocation while scheduling An interesting alternative forcode generation on super-scalar and VLIW architectures is performing in-struction ordering, register allocation and code compaction1 simultaneously1Instruction ordering and compaction will be called \scheduling" further-on.

20 CHAPTER 2. APPROACHES TO REGISTER OPTIMIZATION[Rimey 89] [Hartmann 92]. While deciding on the scheduling and the registerallocation, also the best \route" of a signal (from its origin to its destination)is selected, taking the restrictions of the architecture into account (e.g. �xednumber of registers, �xed interconnect). Hence, this task is sometimes called\data routing". The two most important techniques for data routing in thecontext of software compilation are the following:� Lazy data routing. In [Rimey 89], the spill paths of the un-scheduledsignal lifetimes are considered during scheduling, to make sure that thedesign does not use more that the available number of registers. This\lazy" checking of the spill paths requires solving a maximum ow prob-lem for each operation that is scheduled. The lazy data routing andscheduling technique is used for code generation on pipelined and highlyirregular data-paths.� Data routing in the CBC compiler. A similar technique is proposed in[Hartmann 92]: during list scheduling, an exhaustive evaluation of thepossible data routes is done. This requires the checking on deadlocksduring scheduling: scheduling an operation that writes a signal in thelast available register, blocking the scheduling of all other operations,must be avoided. This checking involves a rather sophisticated deadlockdetection algorithm. The technique is used for code generation on a�xed, single-bus architecture (CBC architecture).Observations. It is clear that both methods for simultaneous scheduling andregister allocation go further than the traditional register allocation methods.However, the decisions on register allocation are still local in nature, becausethere is no back-tracking on them. As a result, the order in which thesedecisions are made has a large e�ect on the quality of the result. The bestorder is very di�cult to �nd in these scheduling algorithms, especially in thebeginning, when only a few operations have been scheduled.2.1.2 Register optimization in high-level synthesisThe most important di�erence between register optimization in software com-piler technology and in high-level synthesis, is the fact that there is in gen-eral no constraint on the number of available registers in high-level synthesis.Therefore, most register optimization techniques in high-level synthesis focuson the minimization of the number of registers. A number of these techniquesare surveyed in the following paragraphs.1. Register �le assignment before scheduling Some high-level synthe-sis systems use an architectural model where register �les are hardwired to the

2.1. LITERATURE OVERVIEW 21inputs of the functional units, like e.g. Cathedral-2nd (see �gure 1.2) andHyper [Rabaey 90]. The goal of these register �les is to obtain a large com-munication bandwidth between the functional units [Catthoor 88] [Vanhoof 92][Chu 89]. A default register �le assignment follows directly from the resourceassignment of the operations: the operands are assigned to the register �lesas the EXU inputs. In most of these high-level synthesis systems, no alter-natives for this default register �le assignment are considered. However, inthe Cathedral- 2nd system, a exible data routing technique is proposed:registers are e�ciently used by carefully \routing" the signals, consideringseveral alternatives for transferring a signal from its origin to its destination[Lanneer 93a]. The technique proposed in [Lanneer 93a] starts from an initialschedule that is obtained without taking register access constraints into ac-count. The register cost and the interconnection cost are then minimized byevaluating several possible routing schemes for each signal, taking into accounta user-speci�ed clock cycle budget for the design. For instance, a signal witha long lifetime is a good candidate for temporary storage in RAM, thus savinga register during that lifetime. However, this exible data routing technique isnot able to do register allocation for a �xed architecture (with a �xed numberof registers).2. Register usage minimization during scheduling Register optimiza-tion is often considered as part of the global area minimization goal of thescheduler in a high-level synthesis system. The scheduler typically tries to bal-ance the use of the registers in time. For instance, a schedule that requires 15registers to store the signals in a few time steps, and at most 6 registers in allother time steps, is a highly unbalanced schedule w.r.t. register usage. Severalmechanisms for balancing the register utilization have been published:� In [Paulin 89] and [Verhaegh 91], the probability that a signal is goingto require storage is calculated for each time step. This informationis represented by means of distribution diagrams, and updated duringscheduling (for more details, see Section 3.4.1).� In the Hyper system [Rabaey 90] [Potkonjak 92], a list scheduling tech-nique (driven by a resource urgency priority function) tries to balancethe register utilization. A solution with a smaller register cost gets se-lected during a global, probabilistic search. This can become quite CPUintensive for complex examples with a huge search space.� Another technique for register optimization (although approximative) isthe minimization of the sum of the lifetimes during e.g. ILP scheduling[Hwang 91].

22 CHAPTER 2. APPROACHES TO REGISTER OPTIMIZATION� Finally, an alternative approach for balancing the register cost duringlist scheduling is proposed in [Rompaey 92]: the production of a signalis scheduled as close as possible to its consumption, if this has a positivee�ect on the register cost. The e�ect on the register cost is modeled bymeans of a demand priority component in the priority function of thelist scheduler.Observations. A lot of research e�ort has been spent on balancing of theregister utilization during scheduling. However, none of these techniques is ableto take a �xed number of available registers into account, mainly because thiswas not an issue in traditional high-level synthesis. However, when using DSP-core modules in a heterogeneous architecture (see Section 1.1.1), the numberof available registers is limited and this must be taken into account.3. Register assignment after scheduling In many high-level synthesissystems, the signals are assigned to registers after the number of operators hasbeen decided by the scheduler and the operations have been mapped onto theoperators. The schedule is �xed, and determines which registers can be sharedby which signals. An overview of some techniques for register assignment afterscheduling is presented below.� The register assignment problem. The problem of register assignment2 isclassi�ed under the NP-complete problems [Garey 79] [Sethi 75], and canbe reformulated as a graph coloring problem. This is illustrated by meansof an example in �gure 2.1: three signals (a, b and c) have to be assignedto a minimum number of registers. The lifetimes of the three signals areshown in �gure 2.1(a): for instance, signal b is alive from time step3 0to time step 2. The lifetimes can be represented by means of a circulararc graph (�gure 2.1(b)) by mapping the time steps of the schedule tosections of a circle: the same signal b is now represented by an arc fromsection 0 to section 2. Finding the minimumnumber of registers to storethe signals is equivalent to �nding a minimumset of colors such that eacharc gets one color, and two arcs that overlap (e.g. a and b in �gure 2.1(b))get a di�erent color. This coloring problem is known as the circular arccoloring problem [Golumbic 80], and it is NP-complete. The circulararc model can be translated in an undirected interference graph, where anode represents a lifetime, and where there is an undirected edge betweenthe nodes if the two corresponding lifetimes overlap (�gure 2.1(c)). Thecircular arc coloring problem is now translated into a graph coloringproblem, which is of course also NP-complete.2See De�nition 2.2.3By some historical convention, time steps in a schedule are numbered from 0.

2.1. LITERATURE OVERVIEW 23time

a

b

c

(a)

0 1 2 3

b

a

c

(b)

0

12

3

a

b c

(c)

time

a

b

c

a’

(d)Figure 2.1: The register assignment problem� Register assignment techniques: There are mainly two approaches tosolve a register assignment problem: exact coloring of the interferencegraph, and �nding e�cient heuristic coloring algorithms to color the in-terference graph, possibly sacri�cing optimality. The �rst approach is notpractical for large register assignment problems. Therefore, most systemsadhere to the second approach. In the Hyper system [Potkonjak 92],an e�cient, polynomial time coloring heuristic is used. In the Caddysystem [Kramer 90], the interference graph is constructed and colored foreach time step separately (while minimizing the interconnect by meansof a separate con ict graph), thus reducing the complexity of the registerassignment problem. In the Spaid system [Haroun 88], the lifetimes thatcross the loop boundary for consumption in the next iteration are \cut"in two, such that the circular arc graph reduces to an overlap graph,which can be colored in polynomial time [Golumbic 80] (e.g. with a leftedge algorithm [Kurdahi 87] [Goossens 89a] [Stok 92]). In �gure 2.1(d),the lifetime of signal a has been cut in two pieces, a and a0.� Extensions for cyclic graphs. As a result of solving the simpli�ed reg-ister assignment problem, the two parts of a \cut" cyclic lifetime aresometimes assigned to a di�erent register. To realize this assignment, anextra register transfer is required between these registers [Hendren 92][Stok 92]. To minimize the amount of extra register transfers required,

24 CHAPTER 2. APPROACHES TO REGISTER OPTIMIZATIONa re-coloring technique is proposed in [Stok 92]: by means of a multi-commodity network ow algorithm, lifetimes like e.g. a and a0 in �g-ure 2.1(d) get the same color (and thus the same register). If re-coloringis not possible, either an extra register or an extra register transfer isadded. However, extensive experimentation has revealed that re-coloringalmost never fails [Stok 92] [Hendren 92]: the number of registers as ob-tained after solving the simpli�ed register assignment problem is almostalways the same as would have been obtained by solving the originalregister assignment problem. In other words, register assignment can besolved very e�ciently in polynomial time, without sacri�cing optimalityin most cases. This experimental fact will be used in Chapter 3 to justifya technique for maximum register cost estimation. For completeness, aninteresting alternative for re-coloring is mentioned: in [Goossens 89a] and[Hendren 92], a preprocessing technique is proposed that �lls the \gaps"in the cyclic lifetimes (like e.g. the gap from time step 1 to time step 2 inthe cyclic lifetime of a in �gure 2.1(a)) with other, short lifetimes, andassigns all these signals to the same register. An extension of this tech-nique for conditional register assignment is described in [Seynhaeve 87].4. Register merging After register assignment in high-level synthesis, theinterconnect and multiplexer cost can still be decreased by merging registersinto register �les. The registers in the same register �le share the same inputand output ports. Several techniques have been published for register merg-ing [Bergamaschi 91] [Haroun 88] [Rouzeyre 91] [Ahmad 91] [Balakrishnan 88][Breternitz 91]. Some of these techniques perform register assignment and reg-ister merging simultaneously.2.1.3 Conclusions from the literature surveyIt is clear from the literature that register optimization is regarded as veryimportant, both in the �eld of software compilation, as in the �eld of high-levelsynthesis. Powerful techniques for register allocation and assignment, as well asfor register cost balancing during scheduling have been published. However,if the number of available registers is �xed | like in software compilation,or in designs with a heterogeneous architectural style (see Section 1.1.1), orsimply to allow user interaction | the interdependence of register allocationand scheduling is only considered by a few authors [Rimey 89] [Hartmann 92],situated especially in the domain of code generation for VLIW architectures.They propose to take local decisions on register allocation during scheduling.

2.2. A NEW APPROACH 25Scheduling

at eachhierarchical level

register costconstraints

other resourceconstraints

Cut reduction

MacronodeScheduling

register costconstraints

other resourceconstraints

Cut reduction

Clustering

preprocessingpreprocessing

(a) (b)

Cluster Scheduling

at eachhierarchical level

Figure 2.2: The scheduling scripts2.2 A new approachA new approach for register optimization is proposed in this section. It consid-ers the interaction of register assignment and allocation with scheduling, andcombines them into a scheduling script .Starting point Designers of real-time signal processing applications usingCAD systems that do not perform register optimization, do the register opti-mization manually. It has been observed that, mainly, two manual interven-tions are done:1. Pieces of code (individual operations, or functions) are manually sequen-tialized with respect to one another.2. Signals are manually spilled to RAM (i.e. temporarily stored in RAM).Adding this functionality directly to the scheduler would lead to quite acomplex scheduler, or a scheduler that takes local decisions on the registerallocation (see above). Therefore, a scheduling script4 is proposed in this dis-sertation, such that the task of satisfying the register constraints is performedbefore scheduling. As a result of this preprocessing, the scheduler is relievedfrom taking a �xed number of available registers into account. The proposedscript is schematically presented in �gure 2.2 (the di�erence between the scriptsin �gure 2.2(a) and (b) is explained further). The �rst step in the schedulingscript is the scheduling preprocessing to reduce the register requirements ofthe design such that they satisfy the constraints. There is a tight interactionbetween the preprocessing and the second step in the script (scheduling).4A script is a sequence of tasks.

26 CHAPTER 2. APPROACHES TO REGISTER OPTIMIZATION(a)

constraint: rega = 1 reg.

* * * * * *

+

+

+

+

cp cp cp

x9 x15

x16 x18

(b)

MULT ALU

rega

Figure 2.3: A simple example of scheduling preprocessingScheduling preprocessing The basic principles of scheduling preprocessingare illustrated bymeans of a simple example. Consider the signal ow graph5 in�gure 2.3(a). The large nodes correspond to operations | such as additions(+), multiplications (�) and register copy operations (cp) | and the small(black and white) nodes correspond to signals that are produced and consumedby the operations. Suppose that the signal ow graph in �gure 2.3(a) hasto be scheduled on the architecture in �gure 2.3(b), such that register �le\rega" only contains 1 register. The signals that have already been assignedto that register �le (x9, x15, x16 and x18) are indicated by the black dots.The di�erent steps of the scheduling script are illustrated for the example of�gure 2.3 in �gure 2.4. Preprocessing of the signal ow graph in �gure 2.3(a)can do two things (�gure 2.4(a)):1. It can add sequence to the signal ow graph: e.g. by adding the dashedarrow in �gure 2.4(a), such that its destination is forced to be scheduledafter its origin, certain lifetimes, or groups of lifetimes, cannot overlapduring scheduling. A technique for sequentializing a design, called cutreduction is proposed in Chapter 4.2. It can identify groups of operations that \belong together in time" (thedashed contours in �gure 2.4(a)). The groups (clusters) are scheduledseparately. As a result, the lifetimes of the signals they communicate arekept short. A technique for clustering proposed in Chapter 5.5The precise semantics of a signal ow graphs are described in Chapter 3.

2.2. A NEW APPROACH 27

(b)

*

*

*

+

+

x16*

*

*

+

+

cp

cp

cp

x18

x15x9

0

1

2

3

0

1

2

3

4

*

*

*

+

+

cp

cp

cp

x18

*

*

*

+

+

x15

(c)

x9

x16

0

1

2

3

4

5

6

7

(a)

* * *

* * *

+

+

+

+

cp cp cp

x9

x15

x16

x18

Figure 2.4: Illustration of the scheduling script

28 CHAPTER 2. APPROACHES TO REGISTER OPTIMIZATIONThe second preprocessing technique (clustering) is optional : it is not strictlyrequired to obtain a solution, but it speeds up the cut reduction (see Chapters 4and 5). By means of clustering and cut reduction, a solution that satis�es theconstraints on the number of available registers is guaranteed . For instance, itcan be veri�ed in �gure 2.4(a) that the constraint on the size of register �le\rega" is satis�ed: the preprocessed graph of �gure 2.4(a) cannot be scheduledsuch that it requires more than one register in register �le \rega".The actual scheduling Once the design is preprocessed such that the re-quired number of registers can no longer exceed the available number of reg-isters, it can be scheduled without worrying about registers. There are twodi�erent situations, depending on which type of preprocessing is performed. Ifonly cut reduction is done as preprocessing (�gure 2.2(a)), the scheduling taskconsists of scheduling a modi�ed signal ow graph where typically a numberof sequence edges have been added. However, if cut reduction is preceded by aclustering phase, the scheduling task is also split in two phases (�gure 2.2(b)).These two scheduling phases are illustrated in �gure 2.4: in a �rst schedulingstage, the clusters of operations identi�ed by clustering are scheduled sepa-rately (�gure 2.4(b)). This �rst scheduling stage is called cluster scheduling.The cluster schedules are then combined to form the actual schedule in a secondscheduling stage (�gure 2.4(c)). Since this second scheduling stage has to takescheduled clusters (called \macronodes") into account, it is called macronodescheduling6. The scheduling techniques used for these two scheduling tasks canbe chosen by the designer. For instance, the designer might want to schedulesome parts of the design optimally, with an integer programming scheduler,while other, non-critical parts can be scheduled with a fast list scheduler. InChapter 7 of this thesis, an integer programming scheduling technique is pro-posed for both scheduling stages.Hierarchy Since design hierarchy (nested loops and conditions) has to besupported by the scheduler, the scheduling script is executed hierarchically (see�gure 2.2(a) and (b)). Preprocessing and scheduling are performed for eachloop in the loop nesting hierarchy of the design, starting with the innermostloop. For instance, for the hierarchy of the CHED91 design in �gure 1.5,hierarchical scheduling treats the loops in the following order: i2! j ! t!kdi! i! p2! p! index! main! top. This does not severely a�ect theoptimality of the result, because the controllers and the scheduling techniquesthat are used in general do not allow the simultaneous execution of two loops.Hence, all resource constraints (except registers) are the same for each loop.The motivation for this single thread of control is that a single processor of a6For a formal de�nition of macronode scheduling, see De�nition 6.1 in Chapter 6.

2.3. MOTIVATIONS FOR THE NEW APPROACH 29multi-processor system is handled separately from the other processors. Thepartitioning into processors is assumed to be performed before scheduling.Furthermore, the innermost loops are mostly the time-critical ones, and shouldbe scheduled as dense as possible (see Section 6.1). Finally, attening thedesign hierarchy would lead to very complex scheduling problems.2.3 Motivations for the new approachThe main motivations for this new approach are summarized below:� The most important motivation for this work is that it allows to takeconstraints on the number of available registers into account. This isimportant both for the interaction with a synthesis system, as for thescheduling of an application on a �xed architecture (e.g. code generationon programmable VLIW architectures).� There is a strong relation between the ordering of the operations andthe number of registers that are required to store the signals. In thisthesis, it is proposed to do register optimization as a global schedulingpreprocessing, because doing it during scheduling leads to complex sche-duling techniques, that take local decisions on the register allocation (seeSection 2.1.1), and doing it after scheduling does not take the relationbetween scheduling and register cost into account. There is still a closeinteraction with scheduling, because the preprocessing mainly focusseson the sequentialization of the signal ow graph.� Finally, scheduling preprocessing is independent from the scheduler : thedesigner has the freedom to choose an appropriate scheduler. This �tsvery well in a hierarchical scheduling approach which is required anyhowfor complex medium-throughput applications.2.4 SummaryRegister optimization is regarded as very important, both in the �eld of soft-ware compilation, and in the �eld of high-level synthesis. Powerful techniquesfor register allocation, register assignment and register cost balancing duringscheduling have been published. However, if the number of available registersis �xed, the interdependence of register allocation and scheduling is only con-sidered by a few authors, especially in the domain of code generation for VLIWarchitectures. They propose scheduling algorithms that take local decisions onregister allocation while scheduling.

30 CHAPTER 2. APPROACHES TO REGISTER OPTIMIZATIONA new approach for register optimization is proposed in this thesis. It per-forms a global register optimization by preprocessing the design before schedul-ing, mainly by sequentializing it. As a result, the constraints on the availablenumber of registers are satis�ed before scheduling, such that any scheduler canbe used for the actual scheduling.

Chapter 3The signal ow graph modelThe behavioral representation of the design is almost always modeled as a con-trol/data ow graph, where the nodes are the operations and the edges datadependences or control ow between the nodes. In this chapter, a Hierarchi-cal Decorated Signal Flow Graph (HDSFG) model is developed, that suits allthe needs of register optimization and scheduling. It is based on the DSFGmodel as described in [Lanneer 91]. This does however not restrict the generalvalidity of the HDSFG model for scheduling problems. A new technique forthe estimation of the maximum register cost of a design is proposed, using theHDSFG as design model.In Section 3.1, the nodes and edges of the HDSFG model are de�ned.The model is completed for condition and loop hierarchy in Section 3.2. Thescheduling of a HDSFG is de�ned in Section 3.3, and the estimation of themaximum register cost of a HDSFG is discussed in Section 3.4.3.1 Nodes and edges in the HDSFGThe basic entities of the HDSFG model, nodes and edges, are described inSection 3.1.1 and Section 3.1.2. The timing relations between the nodes of theHDSFG are annotated with the edges. The timing model for these relationsis described in Section 3.1.3.3.1.1 NodesThe nodes in the HDSFG model are the operations of the behavioral descrip-tion of the design. The inputs and outputs of an operation are mapped onthe input and output ports of a HDSFG node. The annotation of the bind-ings of the operations and its inputs/outputs (binding to EXU instance, EXUports, control modes, etc.) is done by \decorating" each node with attributes31

32 CHAPTER 3. THE SIGNAL FLOW GRAPH MODEL[Lanneer 91]. For instance, RAM accesses are modelled as read and write oper-ations which are bound to a RAM EXU. This binding information is requiredfor the detection of resource con icts between operation nodes in the HDSFG:De�nition 3.1 (Resource con ict) Two nodes have a resource con ict ifthe operations they represent are incompatible when executed on the same hard-ware resource at the same time1.Most schedulers require that the HDSFG consists of register transfers, al-though this is not strictly required for the register optimization techniques inthis thesis:De�nition 3.2 (Register transfer RT) A register transfer (RT) is an op-eration that reads stored inputs and whose outputs are stored as well.These RTs are obtained by chaining together several operations that can ex-ecute in the same clock cycle [Lanneer 93b] [Hwang 91] [Pangrle 86] [Chu 92][Camposano 90]. For schedulers that can chain operations while schedulingthem [Gebotys 92] [Hwang 91], the HDSFG nodes are not restricted to beRTs.3.1.2 EdgesBetween nodes of the HDSFG, directed edges are de�ned. A directed edgestarts at a port of a node (the source node) and ends at another port (ofthe destination node). There are two kinds of edges, data edges and sequenceedges.De�nition 3.3 (Data edge) A data edge e1;2 is an edge that represents thetransfer of a signal between two operation ports (and thus between two opera-tions o1 and o2): the source node of the data edge produces the signal and thedestination node of the data edge consumes the signal.A data edge is represented as in �gure 3.1(a): by means of a signal symbol(the small circle for signal a in �gure 3.1(a)). A signal can be consumed byseveral operations (�gure 3.1(b)). This is sometimes represented by means oftwo edges for the same physical signal, starting at the same output port andgoing to two di�erent input ports2 (�gure 3.1(c)). Such edges are called fanoutedges:De�nition 3.4 (Fanout edges) Fanout edges are data edges that correspondto the same signal and originate at the same operation port.1For a more formal de�nition, the reader is referred to [Lanneer 93a]2Note that the source and destination port symbols will not be drawn in the �gures ifthere is no doubt about the connectivity of the edge.

3.1. NODES AND EDGES IN THE HDSFG 33(a) (b)

a

(d)

1

2 3

1

2 3

1 2

3

a

1

2

(c)Figure 3.1: Data and sequence edgesThe edges e1;2 and e1;3 in �gure 3.1(c) represent the same signal. The dataedges in the �gures in this thesis are represented using the style of �gure 3.1(a)and (b), or the style of �gure 3.1(c), depending on the context.A signal represented by a data edge can be assigned to at most one storage(register, register �le, RAM memory, etc.). The addresses of the signals thatare assigned to register �les do not have to be known, since they can be easilycomputed after scheduling, during register assignment (see Section 2.1.2). Onthe other hand, the addresses of the signals assigned to RAM memory arecomputed during high-level memory management (see Section 1.3) and arethus known.De�nition 3.5 (Sequence edge) A sequence edge e1;2 is an edge that rep-resents a timing relation between the source port and the destination port ofthe edge. It puts constraints on the time delay between an event at the sourceport and an event at the destination port.This timing relation can for instance be a timing constraint speci�ed by thedesigner or derived from an I/O or memory protocol, e.g. a �xed number ofclock cycles between I/O operations. However, it can easily be extended toalso model timing constraints in terms of e.g. nanoseconds. A formal de�nitionof the timing model in terms of clock cycles is given in Section 3.1.3. Sequenceedges are represented as in �gure 3.1(d). All edges have a cost:De�nition 3.6 (Edge cost array) The edge cost array c1;2 for an edge e1;2is an array of N non-negative real numbers, where N is the number of register�les in the architecture and where c1;2[i] = 1 if the edge corresponds to a signalthat is assigned to register �le i, and c1;2[i] = 0 otherwise.The edge cost array is used for calculating the maximum register cost of anHDSFG (see Section 3.4).

34 CHAPTER 3. THE SIGNAL FLOW GRAPH MODEL(a) (b)

*

+ *

+

a

b

c

d

e

f

l =0+,c

l =1a,b

l =0*,a

T =1*

T =1+

c

d

(c)

MACC1

MACC2

T =1MACC1

c,dl >1

c,dl >1

l =1MACC1,c

l =1MACC2,d

MULT

ADD

RF

PIPE

RF

RF Figure 3.2: Illustration of port and edge latencies3.1.3 The timing modelThe nodes and the edges of the HDSFG satisfy a timing model that is charac-terized by a number of parameters, de�ned in the following paragraphs.A �rst timing parameter is the start time of an operation:De�nition 3.7 (Start time) The start time ti of an operation oi is the �rstclock cycle during which oi starts executing (i.e. occupies a hardware resource,for computational operations like additions, multiplications, casts, etc.).The start time of each operation is determined by the scheduler. For example,the start time of an operation could be clock cycle 23,546.Each operation has a number of input and output ports. Each of theseports has a port latency as timing parameter:De�nition 3.8 (Operation port latency) The latency li;a of port a of op-eration oi is the number of clock cycles between the event at port a, i.e. readingor writing the corresponding signal, and the start time ti.The port latencies are determined by the assignment of the operation to anoperator type. The example in �gure 3.2 illustrates De�nition 3.8. In �g-ure 3.2(b), a sequence of two multiply-accumulate operations (e.g. two adja-cent taps of a digital �lter) is mapped on the EXU of �gure 3.2(a). The EXUhas three inputs, bu�ered by register �les (RF), and one output which is fedback to the third input. Moreover, the EXU of �gure 3.2(a) is pipelined be-tween the multiplier and the adder. The outputs of the multiplication andaddition operations both have latency l�;out = l+;out = 0, since they are pro-duced in the same clock cycle as the consumption of the inputs (�gure 3.2(b)).However if each multiplication-addition pair is represented in the HDSFG as

3.1. NODES AND EDGES IN THE HDSFG 35a \MACC" operation (�gure 3.2(c)), then the output latency of a MACC op-eration lMACC;out = 1, and also the latency of the third input of the MACCoperation lMACC;in3 = 1.Another timing parameter for operation nodes is the throughput3:De�nition 3.9 (Throughput) The throughput Ti of an operation oi is thenumber of clock cycles after which a new operation can start executing on thehardware resource occupied by oi.In the example of �gure 3.2, the operations in �gure 3.2(b) have throughputT� = T+ = 1. The high- level MACC operations in �gure 3.2(c) also havethroughput TMACC = 1, since a new MACC operation can start each clockcycle.The edges in the HDSFG also have a timing parameter, called the edgelatency:De�nition 3.10 (Edge latency) The edge latency la;b of the edge e betweenthe ports a and b is the number of clock cycles between the events at ports aand b (read or write events). The edge can be a data or a sequence edge. If eis a data edge corresponding to a signal that is produced at port a, stored andconsumed at port b, its edge latency is at least 1. The edge latency la;b is aninteger in the interval [lmina;b; lmaxa;b].The size of the interval [lmina;b; lmaxa;b] not only depends on the nature of theedge, but can also be limited by the topology of the HDSFG and the globaltiming constraints. The actual value of la;b is a result of the scheduling task.In the example of �gure 3.2, edge ea;b in �gure 3.2(b) has edge latency la;b = 1(or: la;b 2 [1; 1]), because the signal at output a is written in a pipeline register.Since a pipeline register is a latch, its value has to be read in the next clockcycle, else it gets overwritten. The edge ec;d in �gure 3.2(b) and (c) has edgelatency lc;d � 1 (or: lc;d 2 [1;1]), because the signal at output c is written toa register �le, where it can be stored for more than one clock cycle.By means of these timing parameters, timing constraints between operationnodes of the HDSFG can be formulated. For instance, a minimum timingconstraint is formulated as follows4 (�gure 3.3):t2 � t1 + �1;2 (3:1)with: �1;2 = max8ea;b (l1;a + lmina;b � l2;b) (3:2)where a is a port of operation o1 and b a port of o2. For instance, in �gure 3.2(c),�1;2 = 1 + 1 � 1 = 1 so that t2 � t1 + 1.3Throughput has a dimension of a number of clock cycles (i.s.o. the inverse) in the contextof this thesis. Some authors use the term \data introduction interval" in stead.4A completely similar derivation holds for maximum timing constraints.

36 CHAPTER 3. THE SIGNAL FLOW GRAPH MODELt1

t2

l1,a

lea,b

1

2

a

b l2,b

δ1,2

timeFigure 3.3: The minimum timing constraint3.2 HierarchyThe de�nition of a basic HDSFG node in the previous section is now extendedto also include nested HDSFGs. In this way, the HDSFG becomes fully hier-archical :De�nition 3.11 (HDSFG) A hierarchical decorated signal ow graph is adirected cyclic graph where the nodes are operations from the behavioral input,or nested hierarchical levels (again HDSFGs). The nodes of the HDSFG areattributed (\decorated") with additional synthesis information (e.g. the bind-ings to resources). The edges of the HDSFG represent data ow, or control ow (i.e. sequence).The implications of loops and delays on the HDSFG are discussed in Sec-tion 3.2.1. The way conditions are represented in the HDSFG is describedin Section 3.2.2. Both the modeling of nested loops and the modeling ofconditions are illustrated by means of the syndrome generator SYNDR (seetable 1.2) example in Section 3.2.3.3.2.1 Modeling loops in the HDSFGThe behavioral description of a real-time signal processing application almostalways contains loops, to describe an iterated computation (e.g FOR-loops orWHILE-loops). These loops are interpreted procedurally by the schedulers inthe low-level mapping phase of the synthesis script:De�nition 3.12 (Procedural loop) A procedural loop is the iterated ex-ecution of a signal ow graph (called the \loop body"), where the iterationsequence is �xed.It is the task of high-level memory management [Swaaij 92] [Franssen 92][Verbauwhede 91] to determine the best iteration sequence and nesting order

3.2. HIERARCHY 37(c)

d OR

d

1

2

12

1

2

12 d

1

2

12

(d)(b)

@

@

a

a@1

a@d

(a)

@

a

a@1 Figure 3.4: Delay operations and delay linesof loops. In theCathedral-2nd synthesis script, high-level memorymanage-ment is performed before low-level mapping (see �gure 1.3). Even stronger:there are good motivations to do this in any design script oriented towardsreal-time signal processing [Swaaij 92] [Franssen 92].Delay operations and delay lines As a consequence of the proceduralinterpretation of loops, there can be data and sequence edges between di�erentiterations of the same loop. A delay operation is inserted where a data or asequence edge runs to the next loop iteration:De�nition 3.13 (Delay operation) A delay operation w.r.t. a proceduralloop is an operation with zero throughput, and which has one input port andone output port. Both port latencies are zero. If a data edge is connected tothe input port, another data edge is connected to the output port, making thesignal of the input data edge available in the next loop iteration. If a sequenceedge is connected to the input port, another sequence edge is connected to theoutput port, modeling a sequence constraint between operations in di�erent loopiterations.A delay operation is sometimes represented as in �gure 3.4(a): the signal ais delayed to the next loop iteration, where it is called a@1. Successive delayoperations can be chained (�gure 3.4(b)):De�nition 3.14 (Delay line) A delay line is a set of d consecutive delayoperations. d is the length of the delay line and d � 0.In the HDSFG model, delay lines are not modeled as nodes, but as weights onthe edges (�gure 3.4(c)). The delay operations \@" are often omitted in the�gures and represented by black dots on the edges, or by \looping" edges asin �gure 3.4(c).

38 CHAPTER 3. THE SIGNAL FLOW GRAPH MODEL@

a

a@1

initial valueof ’a’

last valueof ’a’

(a) (b)

1

entry of initial value of ’a’

exit oflast value of ’a’

(c)

entry of loopinvariant signal

Figure 3.5: Initialization and termination of delay linesDe�nition 3.15 (Edge weight) The weight d1;2 of an edge e1;2 is the lengthof the delay line on e1;2 where d1;2 is an integer and d1;2 � 0.Note that also sequence edges can carry delay lines (�gure 3.4(d)), e.g. fortiming relations between HDSFG nodes in di�erent loop iterations (see De�-nition 3.13).In the very �rst iteration of the loop, there is no delayed signal from theprevious iteration available. Therefore, the initial value of a delay operationhas to be de�ned (�gure 3.5(a)):De�nition 3.16 (Initial value) The initial value of a delayed signal a is thevery �rst value at the output of the delay operation that delays a, upon enteringits corresponding loop.It is clear that this initial value has to be computed before entering the loop,as shown in �gure 3.5(a). Similarly, the last value of a delayed signal is de�nedas follows:De�nition 3.17 (Last value) The last value of a delayed signal a is thevalue of that signal in the last iteration of the loop in which it is nested.This last value of a signal is typically used somewhere after the loop (�g-ure 3.5(a)).Delay line windows A delay over 1 loop iteration can be realized witha single register, whereas delays over several iterations require more hard-ware: several registers, or a FIFO, or RAM, etcetera [Vanhoof 92] [Vanhoof 91]

3.2. HIERARCHY 39a@2

1

2

1

21

2a a@1a

a a@1

1

2

= 221d

(a) (b)Figure 3.6: Constraining the delay line window[Lanneer 93a]. The actual mapping of a delay line to storage is one of the tasksof data routing (see Section 1.3). However, the limit on size of the storage avail-able for storing a delay line can be modeled as a constraint in the HDSFG:De�nition 3.18 (Window) The window W of a delay line is the number ofstorage �elds required to store the delay line. If d is the length of the delayline, then W � d.The window of a delay line can be constrained in a very straightforward way, byadding an inverse sequence edge with the appropriate weight. For example, byadding a sequence edge from o2 to o1 with weight d2;1 = 2 in �gure 3.6(a), thewindow of the delay line on edge e1;2 is never going to exceed 2 (i.e.W1;2 � 2).This is illustrated in �gure 3.6(b), where the three possible loop organizationsof the data edge in �gure 3.6(a) are shown (e.g. after scheduling). In neitherof the three cases, the storage requirement of the delay line on e1;2 exceeds2 �elds (the dashed lines indicate cuts through the graph). This mechanismfor window constraining is used in the register optimization techniques of thefollowing chapters.Entry and exit nodes Initial values and last values of delayed signals \en-ter" and \exit" the loop. These inputs and outputs are modeled with entryand exit nodes in the HDSFG of the loop (the squares in �gure 3.5(b) and(c)):De�nition 3.19 (Entry/exit node) An entry (exit) node is a dummy nodewith zero throughput and zero port latency, modeling the entry (exit) of a signalinto (out of) the loop.Each entry (exit) nodes corresponds to a di�erent signal being imported (ex-ported). Besides initial values of signals, other signals that typically enterloops are loop invariant signals (e.g. constants) and signal arrays that have

40 CHAPTER 3. THE SIGNAL FLOW GRAPH MODELbeen written elsewhere. Loop invariant signals are alive during the whole loop,and are therefore modeled with a separate entry-exit node pair (�gure 3.5(c)).Edge decomposition A minor complication of the hierarchy in the HDSFGis the fact that edges representing the entry or exit of signals have to be de-composed when crossing a loop boundary (see e.g. �gure 3.5(b)). The problemwith this decomposition is the distribution of the edge latency. A pragmaticsolution to this problem is assigning the non-zero edge latency to those partsof the decomposed edge that are part of the outermost hierarchical levels, sincethe deeper hierarchical levels are more time-critical. Moreover, if this deeperhierarchical level has already been scheduled, a part of the edge latency canbe taken up by its schedule.3.2.2 Conditions in the HDSFGMost real-time signal processing applications in the low to medium throughputdomain involve complex decision making. This decision making is implementedby means of conditions (e.g. IF-statements) in the input description.Mutual exclusiveness Some operations in the HDSFG are only executed ifa certain condition is true. These operations are called conditional operations,and they can be grouped in blocks:De�nition 3.20 (Conditional block) A conditional block is a HDSFGwhose operation nodes all execute under the same condition.Some of these conditions are mutually exclusive:De�nition 3.21 (Mutual exclusiveness) Mutually exclusive operations orblocks are operations or blocks whose conditions never hold at the same time(the logical AND of the conditions is always FALSE).The checking of the mutual exclusiveness of two conditions is generally anNP-complete problem [Papadimitriou 82]. However, the designer can pro-vide information on exclusiveness in the input description (e.g. by using theELSE keyword), such that e�cient exclusivity computations can be performed[Seynhaeve 87] [Rompaey 92].The condition merge block Signals de�ned by conditional operations areonly de�ned under that speci�c condition. However, the same signal can be de-�ned by several mutually exclusive operations (e.g. the signal a in �gure 3.7(a)).In that case, it can be consumed outside of the conditional block. Therefore,entry and exit nodes can also be used in conditional blocks. Mutually exclusiveconditional blocks are grouped as follows:

3.2. HIERARCHY 41a

if cond

a

if condCMB

if cond

(a) (b) (c)

if condif cond

if cond1 2

3

1 2

3

1 2

3Figure 3.7: Conditional signal de�nitionsDe�nition 3.22 (CMB) A condition merge block (CMB) is a hierarchicalHDSFG node that groups a complete set of mutually exclusive conditional HDS-FGs.This abstraction of conditional code provides an extra level of hierarchy in theHDSFG, that can be e�ciently exploited for maximum register cost estima-tion (see Section 3.4) and clustering (see Section 5.4.1). The extra levels ofhierarchy do not remove much of the global optimality because they are onlyused during scheduling preprocessing, which only aims at constraining thescheduling. However, the CMB abstraction is not alway used: the scheduler(see Chapter 6) and the cut reduction technique (see Chapter 4) use attenedHDSFG models, where the only hierarchy is the loop nesting hierarchy. Thisallows a more global view. In a attened HDSFG, besides the nodes, also alledges are attributed with the condition under which they hold (�gure 3.7(c)).3.2.3 The HDSFG model of the SYNDR exampleAs an illustration, the HDSFG for the syndrome generator example SYNDR(see table 1.2) is constructed. The original signal ow graph of SYNDR isshown in �gure 3.8. This example contains two nested loops and conditions,as indicated in the �gure. The operation symbols in the signal ow graph areexplained in table 3.1. Some of the register copy operations in �gure 3.8 areused to write the initial value of a delayed signal in its appropriate register. Itis assumed that the operands of these copy operations are constants that arestored in some register �le. Other register copy operations (like the registercopy of signal sig1 12 to sig1 12 v1) are a consequence of the fact that theregister �les that hold the operands of an operation are hardwired to the inputsof the execution units. The architecture used for the SYNDR example is shownin �gure 3.9(a), and the binding of the signals to the register �les is shown in

42 CHAPTER 3. THE SIGNAL FLOW GRAPH MODEL

cp

cp

cp

snr01_v1

snr01_v2

@

@

snr01_v3d

+<

csnr01

@

R

t6t1

ca cp

t6t1_v1tmp17_v1

<

ntmp22

cp

snr02d_v2

+<

csnr02 ca

tmp18_v1

+ -

sig1_12 sig2_11

cp cp

sig1_12_v1

R

*

R

cp

sig3_10

sig3_10v1

sig3_10v2l7t1

tmp29

+

tmp27

+

cp

accu14_v1

snr02d_v1 @

cp

R

*

R

cp

sig4_9

sig4_9v1

l7t2

tmp31

@

W

neg

snr01d_v2

snr01d_v1

return_v1

return8

for j

if not ntmp22 if ntmp22

snr01

for i

cp cp

cp

cpcp

accu14d_v1snr02

sig2_11_v1

sig4_9v2

Figure 3.8: The signal ow graph of the SYNDR example

3.3. SCHEDULING THE HDSFG 43symbol operation+ < increment and compare with constant< compare with constant@ delaycp register copyca type castR read from RAMW write to RAM+ addition� multiplicationneg negationTable 3.1: The operations of the SYNDR example�gure 3.9(b). The bus network in �gure 3.9(a) is assumed to be of unlimitedbandwidth.The HDSFG constructed from the signal ow graph of �gure 3.8 is shownin �gure 3.10. The HDSFG of the i-loop (�gure 3.10(a)) contains three delaylines of length 1. Note the entry (exit) nodes for the initial (last) values ofthe delayed signals. Furthermore, it has one hierarchical node for the nestedj-loop. The HDSFG of the j-loop is shown in �gure 3.10(b). The HDSFG in�gure 3.10(b) also has a nested CMB node, abstracting the conditional blocksof the j-loop. The HDSFGs for the CMB node are shown in �gure 3.10(c): thetwo HDSFGs of the two mutually exclusive conditional blocks. Note also thatthe hierarchical HDSFG nodes in �gure 3.10(a) and (b) are totally conformwith the de�nition of a HDSFG node. They are multi-input, multi-outputoperation nodes, and their port latencies and throughput all equal the numberof clock cycles that the nested HDSFG node takes to execute. This informationis of course only available after scheduling.3.3 Scheduling the HDSFGSince the application domain is real-time signal processing, only the worst-casethroughput of the design is relevant [Goossens 90] [Catthoor 92] [Lanneer 93a].For data-dependent loop iterators (e.g. the WHILE construct), an upper boundon the number of iterations is required from the designer. Then, for eachHDSFG, a static schedule [Parhi 89] can be de�ned:

44 CHAPTER 3. THE SIGNAL FLOW GRAPH MODELalu mult

ramsyn

rams

ramt

raml

reg1

reg2

reg3

reg4

reg5

reg6

reg7

reg8

reg9

reg1

0re

g11

reg1

2re

g13

(a)

snr01_v1

snr01_v2

snr01_v3snr01_v3d

csnr01

t6t1

t6t1_v1 tmp17_v1

ntmp22

snr02d_v2

csnr02

tmp18_v1

sig1_12

sig2_11sig1_12_v1

sig2_11sig3_10

sig3_10v1

sig3_10v2

l7t1

tmp29

tmp27accu14_v1 snr02d_v1

sig4_9

sig4_9v1

sig4_9v2

l7t2

tmp31

snr01d_v2

snr01d_v1

accu14d_v1

return_v1

carrier signals

(b)

snr02 snr01reg1

reg2

reg3

reg4

reg5

reg6

reg7

reg9

reg11

reg13

reg14

bus network

reg1

4

Figure 3.9: The architecture and signal bindings for the SYNDR example

3.3. SCHEDULING THE HDSFG 45

(a)

cp

cp

snr01

snr01_v1

+< csnr01

W

neg

return_v1

return8

for j

accu14_v1

1

1

for i

1

snr01_v2

(b)

R

t6t1

ca cp

t6t1_v1tmp17_v1

<

ntmp22

cp

snr02d_v2

+<

csnr02ca

tmp18_v1

+ -

sig1_12 sig2_11

cp cp

sig2_11_v1

accu14_v1

snr02

CMB

for j

snr01d_v1

1

1

1

cp

sig1_12_v1 sig2_11_v1

R

*

R

cp

sig3_10

sig3_10v1

l7t1

tmp29

+

tmp27

+

cp

accu14_v1

cp

R

*

R

cp

sig4_9

sig4_9v1

sig4_9v2l7t2

tmp31

accu14d_v1 accu14d_v1

if ntm

p22if not ntmp22

accu14_v1

CMB

(c)

snr01_iva snr01_v1_ivasnr01_v2_iva

snr02_iva accu14d_v1_iva

sig1_12_v1

sig3_10v2

Figure 3.10: HDSFG for the SYNDR example

46 CHAPTER 3. THE SIGNAL FLOW GRAPH MODELDe�nition 3.23 (Static schedule) A static schedule is a mapping of theHDSFG operations to the time axis. This mapping, as well as the resourceassignment of the operations, is the same for each loop iteration, whateverconditions may hold during that iteration.Since the schedule length is independent from the conditions that hold, allconditional branches in the control ow take an equal number of clock cycles.Short conditional branches are completed with no-operation (NOP) instruc-tions [Vanhoof 93].As motivated in Section 1.2, scheduling in the context of this dissertationis static scheduling under global resource constraints (and extra sequence con-straints). The goal of this scheduling is always the minimization of the totalnumber of clock cycles to execute the application. A schedule is basically noth-ing else than a list of time steps and a set of operations to be executed at eachtime step5:De�nition 3.24 (Time step) The time step pi of operation oi is the timestep of the schedule where oi is executed. The �rst time step of a schedule isalways labeled \0".3.4 Maximum register cost estimationsAn important function of the HDSFG model is to support the estimation of themaximum register cost of a design. This maximum register cost estimation isat the basis of the techniques for register optimization in Chapters 4 and 5. Itis therefore worthwhile to spend a section on the discussion of these estimationtechniques.The pro's and cons of estimation versus the exact calculation of the max-imum register cost are discussed in Section 3.4.1. Next, a new technique,hierarchical retiming, is proposed for maximum register cost estimation. Theretiming model is presented in Section 3.4.2. The di�erent estimations used inthis dissertation are described in Section 3.4.3. Finally, Section 3.4.4 containscomments on the algorithmic complexity of maximum register cost estimation.3.4.1 Estimation vs. exact calculationThe problem of �nding a schedule that satis�es constraints on the availablenumber of registers is solved in this dissertation by satisfying the register costconstraints before scheduling (see Sections 2.2 and 2.3 for an overview and a5The term \time step" is used by most authors, though \c-step", \control-step" and\potential" are also used.

3.4. MAXIMUM REGISTER COST ESTIMATIONS 47motivation of this approach). Therefore, the maximum (in the sense of worst-case over all possible schedules) register cost represents an important designcharacteristic. Techniques to reduce this maximum register cost are going tobe presented in Chapters 4 and 5.The exact maximum register cost Suppose a graph G(E; V ) is con-structed, where V is the set of nodes and E is the set of undirected edges.Each node of V represents a data edge of the HDSFG (suppose that there isnever more that one fanout edge6 at each operation port). There is an edgebetween two nodes of V if the data edges, corresponding to these two nodes,represent two di�erent, overlapping signal lives7. For instance, G(E; V ) forthe scheduled HDSFG in �gure 2.1(a) is shown in �gure 2.1(c): e.g. signals aand b are alive simultaneously at the end of the �rst time step of the schedule,hence there is an edge between the nodes a and b in �gure 2.1(c). The exactmaximum register cost can be found by computing the size of the maximumclique8 in G(E; V ) [Tseng 86] [Grass 90]. The size of the maximum clique in�gure 2.1(b) is 3, such that 3 registers are required to store the signals in theschedule of �gure 2.1(a). Finding the max-clique in a graph is an NP-completeproblem [Garey 79] [Papadimitriou 82]. Since the graph G(E; V ) can get quitelarge and since the maximum register cost is evaluated a large number of times(see Chapters 4 and 5), �nding the maximum clique is not practical. There-fore, a polynomial time alternative is presented below for the estimation of themaximum register cost.Register cost estimation techniques in the literature In [Paulin 89],[Verhaegh 91] and [Lanneer 93a], register cost estimation is done by meansof probabilistic distribution graphs. The register cost is estimated in order tobalance the register utilization in time. Due to its probabilistic nature however,this technique is not suited for the calculation of the maximum register cost.This is illustrated in �gure 3.11: the distribution probabilities of the operationsin �gure 3.11(a) are shown in �gure 3.11(b) for a maximum schedule lengthof 6 time steps. For instance, there is a 0.25 probability that operation o1will be scheduled in time step 0. The distribution diagram for the lifetimesof the signals is shown in �gure 3.11(c): e.g. a probability of 0.6 for signal ain time step 3 means that there is a 60 % chance that a is going to requirestorage in time step 3. The accumulation of all these probabilities is shownin �gure 3.11(d): according to this diagram, the estimated register cost in e.g.time step 1 is 1.07. However, the maximumregister cost in time step 1 is 3. An6See De�nition 3.4.7For un-scheduled designs, two signals have a possible lifetime overlap if their productionsand consumptions can be scheduled such that the signals are indeed simultaneously alive.8See de�nition 6.6.


(a) (b)

1 2

3

4

5

0.25

0.25

0.25

0.20

0.20

1

3 4

5

a b

c d

2

(c) (d)

a

0

1

2

3

4

5

0

1

2

3

4

5

0

1

2

3

4

5

0.4

0.6

b

0.4

0.6

c

0.4

0.6d

0.27

0.53

0.6

0.53

0.33

1.07

2.13

2.4

1.93

0.73Figure 3.11: Register cost estimation with distribution diagrams

3.4. MAXIMUM REGISTER COST ESTIMATIONS 49alternative technique for register cost estimation is reported in [Kurdahi 87]and [Mlinar 91]: the maximum register cost is estimated by calculating theminimum ow in a network. This however requires that the signal ow graphis acyclic (e.g. by removing the edges with weight d1;2 � 1). In this dissertation,a new technique for cyclic graphs is presented (see further in this section).The maximum number of live signals A good estimation of the maxi-mum number of registers can be obtained by calculating the maximum numberof live signals (at any time step). Theoretically, this is an under-estimationof the maximum register cost: e.g. in �gure 2.1(a), at most 2 signals arealive simultaneously, but 3 registers are required. However, Stok [Stok 92] andHendren [Hendren 92] conclude from a large number of experiments that thenumber of registers required is seldom larger than the maximumnumber of livesignals. In some special cases, they reported, some extra register transfers arerequired to split a lifetime in two separate lifetimes (see Section 2.1.2). Theaddition of these extra transfers can very well be done by the data routing task(see Section 1.3.1). In other words, calculating the maximum number of livesignals is a very tight underestimation of the actual maximum register cost9.3.4.2 The hierarchical retiming modelA new contribution of this thesis is the calculation of the maximum num-ber of live signals in polynomial time by means of the retiming technique[Leiserson 83]. This is shown in the following paragraphs.1. Retiming variables At each hierarchical level of the HDSFG, a retimingmodel is constructed as follows. Each node gets an integer retiming variabler. Initially, all retiming variables have value r = 0. If a (partial) scheduling ofthe HDSFG has already been performed, the HDSFG nodes that are scheduledat the same time step share the same retiming variable (see also Section 5.5).2. Nested block model Since the goal is to model the maximum numberof live signals, also the maximum set of live signals of nested blocks shouldbe modeled. This can be done by modeling nested blocks (�gure 3.12(a)) bymeans of a node pair fNB;NB 0g and an extra edge, like in �gure 3.12(b). Theextra edge eNB;NB0 gets a zero initial weight (dNB;NB0 = 0) and a cost cNB;NB0that represents the maximum number of live signals (in each register �le) forthe nested block10.9The same conclusion was made in the context of register assignment in Section 2.1.2.10This is an extension of the de�nition of edge cost (De�nition 3.6).

50 CHAPTER 3. THE SIGNAL FLOW GRAPH MODELNB

NB’

(b)

nestedblock

(a)

d =0NB,NB’c NB,NB’Figure 3.12: Retiming model for nested blocks1

2

1,2d 1,2d

(a)

1

2

1,2d

W -1,2 1,2

2,1

d

d

(b)

=

M

r = 0EXIT

init

init

init init

r1

r 2 maxFigure 3.13: The retiming model3. Hierarchical retiming Retiming can be applied in a hierarchical context(�gure 3.13(a)):De�nition 3.25 (Hierarchical retiming) Hierarchical retiming is a trans-formation of the set fd1;2initg (the initial edge weights of all edges in theHDSFG) to the set fd1;2g, withd1;2 = d1;2init + r2 � r1 � 0 (3:3)and such that the following function is maximized (for a �xed i):Xe1;2 d1;2:c1;2[i] (3:4)The retiming is �rst performed at the innermost nested loop level of the HDSFGhierarchy and from there expands towards the higher loop levels.

3.4. MAXIMUM REGISTER COST ESTIMATIONS 51(a) (b)

input

output

input

output

d=1Figure 3.14: Retiming for maximum register costNote that the window W1;2 of the delay line d1;2 can be constrained to W1;2maxby adding a sequence edge from o2 to o1 with weight W1;2max � d1;2init (�g-ure 3.13(b)), as explained in Section 3.2.1. The following theorem explains thecost function (3.4):Theorem 3.1 If not more than one edge with c1;2[i] > 0 corresponds to thesame signal, the maximum number of live signals in register �le i at any timestep of any schedule of the HDSFG can be obtained by retiming the HDSFGaccording to De�nition 3.25, where the maximum number of live signals isgiven by cost function (3.4).Proof:Let jSi;j;kj be the number of live signals in register �le i at time step j ofschedule k of the HDSFG. The maximum number of live signals S over allpossible schedules of the HDSFG is:S = maxk fmaxj jSi;j;kjg (3:5)Suppose that this maximum number of live signals occurs at time step j1 ofschedule k1. If the HDSFG is assumed to repeat a number of times (as is alwaysthe case in real-time signal processing), schedule k1 can always be rotated suchthat time step j1 becomes the last time step of the schedule, by changing thetime steps po of each operation o as follows:p0o = (po + C � j1) mod C (3:6)where C is the length of the schedule (i.e. the number of time steps). Aftertime step j1 has been rotated such that it has become the last time step of

52 CHAPTER 3. THE SIGNAL FLOW GRAPH MODELthe schedule, all signals that are alive during time step j1 are delayed to thenext iteration of the loop (see De�nition 3.13). This means that this maximumnumber of live signals can also be found by retiming the HDSFG such that amaximum number of data edges cross the loop boundary, i.e. have a weightd1;2 > 0. This can be achieved by maximizingPe1;2 d1;2:c1;2[i] during retiming,on the condition that not more than one edge with c1;2[i] > 0 corresponds tothe same signal. 2The application of Theorem 3.1 is illustrated in �gure 3.14: the HDSFG in�gure 3.14(a) requires at least 1 register for the delayed signal (suppose thatall signals are stored in the same register �le i and that the maximum windowfor each delay line is 1). Hierarchical retiming of the HDSFG in �gure 3.14(a)yields the HDSFG of �gure 3.14(b), where the maximumnumber of live signalsis 3.It is clear that the condition of Theorem 3.1 does not hold if there arefanout edges (see De�nition 3.4). However, the edge cost of fanout edges canbe corrected such that hierarchical retiming still gives a very good estimationof the maximum number of live signals (see Section 3.4.3).4. The role of the exit nodes in retiming To make retiming possible,extra delay has to be able to move into the retiming model of the HDSFG, as in�gure 3.14(b): delay can get into the HDSFG through the output (exit) nodes.If the retiming variable of the exit nodes is set to value r = 0 (�gure 3.13(a)), itcan be seen that all other retiming values are non-negative, since delay movesfrom output to input during retiming. The advantage of non-negative retimingvalues is that they can be computed with linear programming techniques (seeSection 3.4.4). Furthermore, the initial weight of the edges that connect to theexit nodes is M (�gure 3.13(a)).5. The value of M The default value of M is M =1, to provide an in�-nite pool of delay for retiming. Hierarchical retiming implicitly takes softwarepipelining [Goossens 89b] [Lam 88] [Parhi 91] of the HDSFG into account11.The windows of the delay lines on the edges should be constrained. E.g. in �g-ure 3.15(a), edge e3;5 has a maximum window of 3. These maximum windowsare set by the designer (but can also be minimized during software pipelining,see Chapter 6). However, most signals in a typical HDSFG are scalar andthus have a �xed maximum window of Wmax = 1. If software pipelining is notgoing to be performed, the loop iterations are sequentialized (�gure 3.15(b).Finally, a limited software pipelining can be modeled by allowing the user toset the value of M , as in �gure 3.15(c).11The register cost of a design can severely increase by software pipelining, see Chapter 6.

3.4. MAXIMUM REGISTER COST ESTIMATIONS 531

3 4

5

2

M=9999

1

3 4

5

2

(b)

1

3 4

5

2

M=2

1

3 4

5

2

(c)

1

3 4

5

2

M=9999

1

3 4

5

2

(a)

Figure 3.15: The weight on the output edges

54 CHAPTER 3. THE SIGNAL FLOW GRAPH MODEL6. Retiming formulation The hierarchical retiming problem is formulatedas a linear program. The formulation consists of one inequality (3.3) for eachedge, the extra r = 0 equation for the exit nodes, and a goal function. Theretiming can be solved by means of existing mathematical linear programmingtechniques [Schrijver 86] [Papadimitriou 82] [Schrage 89]. These will computeat least one set of integer values for the retiming variables r, because theconstraint matrix of the problem is totally unimodular (see Section 3.4.4).7. Retiming of conditional HDSFGs The retiming of De�nition 3.25only holds for unconditional HDSFGs. Conditions could be taken into accountin the linear programming formulation of the retiming. This however woulddestroy the total unimodularity of the linear program. This can not be af-forded, because retiming is executed very frequently. Two di�erent alternativeapproaches for conditional HDSFGs are proposed:1. Approach 1: the conditional blocks are encapsulated in CMB blocksand the retiming of each conditional block is calculated separately. Theworst-case maximum register cost is modeled at the next higher level, bymeans of the extra edge eCMB;CMB0 (see �gure 3.12(b)). The advantageof the CMB abstraction is the reduction of a large retiming problem intoa number of smaller retiming problems, without sacri�cing optimality.However, this is only true if software pipelining of the enclosing loop isnot taken into account.2. Approach 2: if software pipelining is going to be performed on the enclos-ing loop during scheduling, the CMB hierarchy must be attened. Thereason for this is that the separate retiming of each conditional block isno longer valid if software pipelining is possible. Then, retiming has to bedone for each possible condition. Afterwards, the worst-case maximumregister cost over all conditions is taken.3.4.3 Di�erent maximum register cost estimationsIn this section, the modeling of the number of live signals is explained in detail.Because three di�erent kinds of maximum register cost estimations are neededin the chapters further-on, they are formally de�ned in this section.1. Parallel edge sets Bearing in mind the de�nitions of edge weight andcosts, and retiming (De�nitions 3.15, 3.6 and 3.25), the following concepts canbe de�ned:

3.4. MAXIMUM REGISTER COST ESTIMATIONS 55De�nition 3.26 (Parallel edge set PES) A parallel edge set for register�le i (PESi) is a set of edges e1;2 with d1;2 > 0 and c1;2[i] > 0, which representsa set of signals that are simultaneously alive in register �le i.De�nition 3.27 (Value of a PES) The value of a parallel edge set, denotedas jPESij, is de�ned as jPESij = Xe1;22PESi d1;2:c1;2[i] (3:7)According to De�nition 3.25, hierarchical retiming �nds the parallel edge setwith the maximal value:De�nition 3.28 (Maximum parallel edge set MPES) A maximumpar-allel edge set for register �le i (MPESi) is a PESi with a maximum valuejPESij over all possible retimings of the HDSFG.MPESi represents the maximumnumber of live signals, by Theorem 3.1 (notethat this only holds if each operation has no more than one fanout edge | seefurther). In other words, �nding aMPESi is the goal of hierarchical retiming.It must also be noted that parallel and maximum parallel edge sets ingeneral do not correspond to cuts and maxcuts. This can be shown by meansof a simple counter-example. Suppose that a HDSFG contains two nodes, o1and o2. Suppose also that there is an edge e1;2 with weight d1;2 = 0 betweennodes o1 and o2, and an edge e2;1 with weight d2;1 = 1 in the other direction.In other words, o1, o2, e1;2 and e2;1 form a cycle in the HDSFG. According tothe de�nitions of parallel and maximum parallel edge sets, only one of the twoedges of the cycle can be part of a PES or a MPES. However, the de�nition ofa maxcut is as follows (taken from [Papadimitriou 82]):De�nition 3.29 (Maxcut) A maxcut of a graph G(E; V ) is a partition ofthe nodes V into two sets V1 and V2 such that there are a maximum numberof edges of E between V1 and V2.The nodes in the example can be partitioned such that o1 belongs to onepartition and o2 to the other. In that case, the maxcut consists of the twoedges, e1;2 and e2;1, and not of one edge, as is the case for a maximum paralleledge set. However, if the HDSFG is acyclic, a PESi corresponds to a cutthrough the HDSFG and a MPESi with a maxcut 12.12Although only correct for acyclic HDSFGs, the terms \cut" and \maxcut" are sometimesused in this text in stead of PESi and MPESi.

56 CHAPTER 3. THE SIGNAL FLOW GRAPH MODEL2. The register cost of parallel edge sets Once a maximum paralleledge set has been obtained, its register cost is de�ned as follows:De�nition 3.30 (Register cost of a PES) The register cost of a PESi,denoted as R(PESi), is the number of di�erent signals represented by its edges.The register cost R(PESi) of a parallel edge set is not always the same as itsvalue jPESij. Both can di�er if there are di�erent fanout edges for the samesignal. Therefore, corrections on the edge cost c1;2 will have to be introducedin order to model the number of di�erent signals (and thus the number ofrequired registers) as close as possible by means of jPESij (see further). Notealso that R(PESi) consists of a contribution of the current hierarchical leveland possibly contributions of several nested blocks. These contributions arecomputed separately, starting with the innermost hierarchical level, as stipu-lated in the de�nition of hierarchical retiming.3. The edge cost The cost of an edge is used for modeling the numberof di�erent signals in the goal function (3.4) of the retiming. Two edges thatrepresent the same signal (fanout edges) should only contribute for one signalin the retiming goal function. Therefore, the edge cost c needs a correction inthe case of fanout edges. This is accomplished as follows:1. Only one of the fanout edges gets cost c[i] = 1, and all the others getc[i] = 0. This approach is correct if data- ow analysis shows that allother fanout edges are \covered" by that one edge: in �gure 3.16(a), theedges with zero cost cannot be \cut" without also cutting the edges withthe unit cost. Therefore, their contribution doesn't need to be modeledin the retiming cost function.2. If data- ow analysis cannot remove redundant edges, the cost is dis-tributed equally over the F fanout edges: each edge gets cost c[i] = 1=F .An MPESi that contains all the fanout edges will then only count 1signal, as illustrated in �gure 3.16(b). An alternative approach is illus-trated in �gure 3.16(c): a dummy node is added (the shaded node in the�gure) and only the edge to the dummy node gets cost c[i] = 1. How-ever, experience has shown that the �rst alternative is a more powerfulheuristic than the second one.It can now also be seen that | only when there are no multiple fanout edgesin the MPESi | the following is true (as required by Theorem 3.1):Xe1;2 d1;2:c1;2[i] = R(MPESi) (3:8)

3.4. MAXIMUM REGISTER COST ESTIMATIONS 57(a)

c[i]=1c[i]=0

c[i]=1c[i]=0

c[i]=1c[i]=0

c[i]=1

c[i]=0

initial value

last value

(c)(b)

c[i]=0.5 c[i]=0.5 c[i]=0 c[i]=0

c[i]=1Figure 3.16: The edge costs for retimingThe edge cost array for the nested block model is (see also �gure 3.12):cNB;NB0[i] = R(MPESiNB ) (3:9)4. Maximum register cost arrays If N is the number of register �lesin the architecture, all maximum register cost estimations are represented asarrays with N integer components. Three di�erent maximum register costestimations are used in this dissertation: GMC, BMC and CMC. GMC isthe main maximum register cost estimation used for scheduling preprocessing(see Section 2.2). The use of BMC and CMC will become clear in Chapter 5.The di�erent maximum register cost estimations are de�ned as follows:De�nition 3.31 (GMC) The global maximum register cost array GMC ata certain level of the HDSFG hierarchy has as i-th component:GMC[i] = R(MPESi) (3:10)where MPESi is derived by hierarchical retiming of the HDSFG at the givenhierarchical level, in which for each nested block:cNB;NB0 = GMCNB (3:11)

58 CHAPTER 3. THE SIGNAL FLOW GRAPH MODELThe GMC at the top level of the design represents the maximum registercost for the whole design. As an example, the calculation of GMC[1] (forregister �le \reg1") for the j-loop of the SYNDR example (see table 1.2 and�gure 3.8 and 3.9) is illustrated in �gure 3.17: hierarchical retiming of theoriginal HDSFG in �gure 3.17(a), without software pipelining (i.e. with a se-quence edge between all outputs and all inputs, not drawn for simplicity of the�gure), gives a MPES1 with register cost 4 in �gure 3.17(b).Another kind of maximum register cost estimates the maximum number ofedges that \bridge" a nested block or a subgraph, because these edges representsignals that can be alive during the execution of the nested block:De�nition 3.32 (Bridging edge) A bridging edge of a nested block NB isan edge e for which there exists a PESi such that both e 2 PESi and eNB;NB0 2PESi, where eNB;NB0 is the edge of the nested block model for NB.There is a di�erent set of bridging edges for each register �le i. Only the set ofbridging edges with the largest register cost (for each register �le) is relevantfor estimating the maximum bridging edges register cost:De�nition 3.33 (BMC) The maximum bridging edges register cost array(BMC) for nested block NB at a certain level of the HDSFG hierarchy has asi-th component: BMC[i] = R(MPESi) (3:12)where MPESi is derived by hierarchical retiming of the HDSFG at the givenhierarchical level, in which for the nested block NB:cNB;NB0 = 0 (3.13)rNB0 � rNB = 1 (3.14)Constraint (3.13) assures that only the register cost of the bridging edges istaken into account. Constraint (3.14) forces the edges of the MPESi to beparallel with NB | or, in other words, forces the \maxcut" to go through NB| as required by the de�nition of a bridging edge. As an illustration, BMC[1](for register �le \reg1") of the j-loop of the SYNDR red-thread example (see�gure 3.8 and 3.9) is shown in �gure 3.18. The thick edges are the edges of theMPES1 set (a di�erent style for drawing delay lines is used in �gure 3.18).The third kind of maximumregister cost is somewhat similar to the BMC,but here the signals inside the nested block are counted as well:De�nition 3.34 (CMC) The constrained maximum register cost arrayCMC through nested block NB at a certain level of the HDSFG hierarchyhas as i-th component: CMC[i] = R(MPESi) (3:15)

3.4. MAXIMUM REGISTER COST ESTIMATIONS 59R

ca cp

<

cp

+<

ca

+ -

cp cp

CMB’

CMB

c=1

c=1

c=1

c=1

c=0

c=0.5c=0.5

c=0

(a)

c=0

c=0

c=0

R

ca cp

<

cp

+<

ca

+ -

cp cp

CMB’

CMB

c=1

c=1

c=1

c=1

c=0

c=0.5c=0.5

c=0

(b)

c=0

c=0

c=0

Figure 3.17: GMC[1] calculation for j-loop of the SYNDR example


cp

cp

snr01

+<csnr01

W

neg

return_v1

return8

for j

snr01_v1

for i

snr01_v2

cp

cp

accu14_v1

Figure 3.18: Calculation of BMC[1] for the j-loop of the SYNDR example

3.4. MAXIMUM REGISTER COST ESTIMATIONS 61where MPESi is derived by hierarchical retiming of the HDSFG at the givenhierarchical level, in which for the nested block NB:cNB;NB0 = GMCNB (3.16)rNB0 � rNB = 1 (3.17)Constraint (3.16) assures that the register cost of the nested block NB istaken into account. As in the de�nition of BMC, constraint (3.17) forces themaxcut to cut NB.For reasons of uniformity, the constraints on the register �le sizes are rep-resented as a array as well:De�nition 3.35 The register cost constraint array C is an array whose i-thcomponent is the constraint on the size of register �le i.5. Edge cost corrections Besides the corrections on the edge costs c forfanout edges (see above), the edge costs are also corrected if edges into (outof) nested blocks are part of the MPESi of that nested block. Again, therationale for this is avoiding to count these edges more than once. The readeris referred to Appendix A for more details.6. Algebra with maximum register cost arrays Since the componentsof the maximum register cost arrays are computed independently from oneanother | to obtain the worst case size of each register �le | it is ratherunlikely that a design actually requires GMC[i] �elds in register �le i andGMC[j] �elds in register �le j. Therefore, the total maximum register costis most likely not the sum of GMC's components. Besides, the total registercost is irrelevant in most designs considered in this thesis: only the registercosts of the individual register �les must satisfy the constraints, not the totalnumber of registers. The following algebraic operations can be performed onregister cost arrays MC:� Addition (and, similarly, subtraction):MC1 +MC2 =MCsum ,MCsum[i] =MC1[i] +MC2[i] (3:18)� Maximum of a set of register cost arrays:maxfMC1;MC2g =MC1 , jMC1j > jMC2j (3:19)� The component-wise maximum of a set of register cost arrays is a newarray:�maxfMC1;MC2g =MC� ,MC�[i] = maxfMC1[i];MC2[i]g (3:20)

62 CHAPTER 3. THE SIGNAL FLOW GRAPH MODEL� Comparison of two register cost arrays:MC1 �MC2 , 8i :MC1[i] �MC2[i] (3:21)3.4.4 Algorithmic complexity of hierarchical retimingThe retiming problem is formulated and solved as a linear program, becauseall retiming variables are non-negative (see Section 3.4.2). The constraint ma-trix of the linear program is a node-arc incidence matrix, because it consistsof inequalities of the type (3.3). This kind of matrices is totally unimodu-lar [Papadimitriou 82]: all optimal vertices of the polytope are integer. Asa consequence, hierarchical retiming is solvable in polynomial time. Or, itcan be solved with extremal point methods, like Simplex [Papadimitriou 82][Schrijver 86]. These extremal point algorithms have worst-case exponentialcomplexity, but in practice, they exhibit a low polynomial complexity.As pointed out in [Kurdahi 87] and [Depuydt 91], the maximum registercost can also be estimated by means of minimum ow techniques, if theHDSFG is acyclic. A cyclic HDSFG can be made acyclic by removing alledges with weight d1;2 > 0. The reader is referred to Appendix C for a de-scription of a hierarchical min ow algorithm. The disadvantage of the min owapproach is that the additional software pipelining freedom of the scheduler[Renfors 81] [Parhi 91] [Goossens 89b] is not taken into account, which canlead to a substantial underestimation of the maximum register cost.Figure 3.19 shows the average CPU times for maximum register cost es-timation versus the size of the HDSFGs to be retimed. Two techniques arecompared: retiming with the revised Simplex method [Schrage 89] and min- ow. Each average CPU time was calculated over some tens of measurements,on acyclic HDSFGs (to be able to compare with min ow). Retiming exhibitsa saturating (logarithmic) experimental complexity, and min ow a linear ex-perimental complexity. The o�set of the CPU times for retiming are due tothe writing and reading of �les to/fro a commercial linear programming pack-age (Lindo [Schrage 89]). It looks like the extra price paid for the generalityof retiming (cyclic graphs) is very reasonable compared to the min ow CPUtimes.3.5 SummaryIn this section, a hierarchical signal ow graph (HDSFG) has been de�ned. Atiming model is imposed on the nodes and edges of the HDSFG. Nested loopsand conditions are modeled by means of the hierarchy in the HDSFG. Thishierarchy is going to be e�ciently exploited by the register optimization andscheduling techniques presented in the following chapters.

3.5. SUMMARY 63

20 40 60 80 100

Graph size (nr. of nodes)

2

4

6

Ave

rage

CP

U t

ime

(sec

)

retiming

minflow

Figure 3.19: Average CPU times for maximum register cost estimation on aDEC5000

64 CHAPTER 3. THE SIGNAL FLOW GRAPH MODELFurthermore, attention has been paid to the modeling of delay lines: be-cause of a generalized delay model for sequence edges, the memory cost ofthese delay lines can be constrained.Finally, as a basic utility using the HDSFG model, a new estimation tech-nique for the maximum register cost of a design has been proposed. It is basedon retiming, and has a low polynomial algorithmic complexity in practice. Thetechnique handles cyclic graphs and can take software pipelining into account.

Chapter 4Cut reductionA �rst technique for solving the problem of �nding a schedule that satis�esconstraints on the available number of registers is presented in this chapter.The technique is based on the following principle: the HDSFG is transformedbefore scheduling, such that the maximum number of registers required is notlarger than the available number of registers. This transformation preprocessesthe HDSFG such that it can afterwards be scheduled without taking the avail-able number of register into account: any schedule of the preprocessed HDSFGwill satisfy the register constraints.The preprocessing technique presented here directly reduces the maximumnumber of parallel data edges. Since a set of parallel data edges is similar1 toa cut through the HDSFG, the technique is called cut reduction.This chapter is organized as follows. The principles of cut reduction aresummarized in Section 4.1. The basic principles of a branch-and-bound searchstrategy for cut reduction are explained in Section 4.2. The basic moves inthe search space are discussed in more detail in Section 4.3. The cost functionfor the branch-and-bound search is discussed in Section 4.4. The detailedmechanisms of the search strategy (branching and bounding) are explained inSections 4.5 and 4.6. The need for a termination phase for checking the exactmaximum register cost of the design is discussed in Section 4.7, and someexperimental results are analyzed in Section 4.8. In Section 4.9, the proposedtechnique is compared with some other techniques for register optimization inthe literature. Finally, an extension for hierarchical cut reduction is discussedin Section 4.10.4.1 The principles of cut reductionThe principles of cut reduction are illustrated in �gure 4.1. If it is assumed that1But not exactly the same thing, see Section 3.4.3.65

66 CHAPTER 4. CUT REDUCTIONa

b

c d

e

f

g

a

b

c d

e f

g

a b

c d

e

g

(a) (c)(b)

f

a’

S1

S2

S3R

W

S1S4Figure 4.1: An example of cut reductionall signals in �gure 4.1(a) are stored in the same central register bank, and thatthe HDSFG in �gure 4.1(a) is only executed once, then a maximum paralleledge set2 MPES = fa; c; d; fg. This edge set has a register cost R(MPES) =4, meaning that a schedule of the HDSFG in �gure 4.1(a) can cost four registersin the worst case. This worst-case cost can be reduced to three registers, bysequentializing edges d and f . This is done by adding the sequence edgeS1 (�gure 4.1(b)): two possible MPES (\maxcuts") are now fa; c; dg andfa; e; fg. The maximum register cost of the HDSFG in �gure 4.1(b) can not befurther decreased by sequentialization (i.e. adding sequence edges). However,by e.g. performing spilling [Aho 88] [Chaitin 82], the maximum register costcan be brought to 2 (�gure 4.1(c)): by writing the signal on edge a to RAMand reading it back later (signal a0), and sequentializing the read and writeoperations with respect to the other operations (sequence edges S2 and S3),the worst-case register cost is now reduced to 2.Cut reduction can formally be de�ned by means of the terminology ofChapter 3:De�nition 4.1 (Cut reduction) A cut reduction is a transformation of theHDSFG that reduces GMC such that GMC � C.There are several cut reduction transformations possible: the most importantones are discussed in Section 4.3.2See De�nition 3.28.

4.2. BRANCH-AND-BOUND SEARCH 6710

10

11

12

11

11

13

14 10

10 11

10

S S

P P

P

P

1

2

3

4 5

6 7

8 9

10

11

12

E

cost

partialsolutionlabel

LEGEND:

S = solutionP = prunedE = equal solutionFigure 4.2: Branch-and-bound search space4.2 Branch-and-bound searchAccording to De�nition 4.1, cut reduction has to come up with a transforma-tion of the HDSFG such that:8i :GMC[i] � C[i] (4:1)This transformation consists of basic moves in the search space, like adding asequence edge or spilling a signal to RAM (see Section 4.3). This set of basicmoves has to be such that Equation (4.1) is satis�ed for all register �les i,and such that some cost function is minimized. This cost function could forinstance be the length of the critical path in the HDSFG (see Section 4.4).In other words, cut reduction is a constrained search problem: not only a costfunction has to be minimized, but also constraints have to be satis�ed.1. Partial solutions A global branch-and-bound search strategy[Papadimitriou 82] is proposed. The search process is schematically illustratedin �gure 4.2: each basic move causes the search to proceed from one partialsolution (the squares in �gure 4.2) to another partial solution. The searchstarts with the initial HDSFG in the root of the search tree (the partial solu-tion labeled 1 in �gure 4.2). In each partial solution, the GMC is computedand compared with C.De�nition 4.2 (Partial solution) A partial solution of cut reduction is atransformation of the original HDSFG for which GMC > C.From one partial solution, new (partial) solutions are generated by each timeperforming one basic move (i.e. branching: e.g. from partial solution 2 in

68 CHAPTER 4. CUT REDUCTION�gure 4.2, three basic moves can be performed. Each one of these basic movesgenerates a new (partial) solution (3, 6 and 7 in �gure 4.2). Basic movesare performed until the search is forced to backtrack : for instance, if for onereason or another (see further) no branching is done from partial solution 4in �gure 4.2, the search \tracks back" by undoing the last basic move untilpartial solution 3 is reached, where not all branches are explored yet. In thisway, the search tree is traversed in a depth-�rst fashion [Papadimitriou 82].The traversal order in �gure 4.2 is the same as the order of the partial solutionlabels.2. Feasible solutions A partial solution becomes a feasible solution if itsGMC � C.De�nition 4.3 (Feasible solution) A feasible solution of cut reduction is atransformation of the HDSFG that satis�es the register constraints: GMC �C.3. Pruning the search tree From a feasible solution, no further branchingis necessary (in other words, the search tree is pruned). For instance, supposethat solutions 4 and 5 in �gure 4.2 are feasible solutions (labeled with an \S"):the sub-trees of solutions 4 and 5 are pruned for the branch-and-bound search.The reason for this is themonotonous cost function: with each basic move, thecost of the partial solution stays the same or increases (see �gure 4.2). To havea monotonous cost function is crucial for branch-and-bound (see Section 4.4),because it provides for the ability to prune a partial solution: if a partialsolution has a cost that is larger than the cost of the best feasible solution sofar, there is no point in further branching. For example, the search tree in�gure 4.2 is pruned in the partial solutions 6, 7 and 8 (label \P").4. Optimal solutions The feasible solution for which the cost is minimalis called the optimal solution:De�nition 4.4 (Optimal solution) An optimal solution of cut reduction isa transformation of the original HDSFG such that its GMC � C and its costis minimal.4.3 The basic movesAt each partial solution, a number of moves is considered to reduce the \max-cut", i.e. to reduce GMC. The basic HDSFG transformation moves that areconsidered for branching in a partial solution of the search are the following:

4.3. THE BASIC MOVES 691

2

3

4

1

2

3

4

1

2

3

4

(a) (b) (c)

1 1

BM1Figure 4.3: Basic move 1: adding a sequence edge pair1

2

1

2

BM2a W

R

addressa

a’Figure 4.4: Basic move 2: spilling a signal1. BM1: Sequentialization of two edges on the cut by adding a sequenceedge pair . This is illustrated in �gure 4.3: suppose that edges e1;2 ande3;4 both belong to the MPESi (i.e. the \maxcut") at a certain partialsolution of the search. To fully sequentialize the edges e1;2 and e3;4 in�gure 4.3(a), two sequence edges need to be added to the HDSFG. Onlyone of them gets a weight of 1, the other gets zero weight. Thus, thereare two possible ways for adding the sequence edge pair (�gure 4.1(b)and (c)).2. BM2: Spilling a signal (�gure 4.4): the write and read RAM accessesneed to be generated, as well as the address. This address can be storedin the microcode instruction word, or loaded from read-only memory(ROM).3. BM3: Decrementing the maximum window of delay lines (�gure 4.5).Note that the maximum window W1;2max must of course remain largerthan 0 (see De�nition 3.18).Other basic moves are possible [Lanneer 93a], but are not considered here.

70 CHAPTER 4. CUT REDUCTION1

2

d 1,2

1

2

BM3

1,2maxW d 1,2 1,2max

W - -1- d 1,2 d 1,2Figure 4.5: Basic move 3: decrementing a delay line windowThe e�ect of a basic move on GMC. Performing a basic move rarelymakes the cut through the HDSFG larger, as shown by the following theorem:Theorem 4.1 GMC[i] cannot increase by performing basic moves BM1 orBM3Proof:The value of the i-th component of GMC equals the register cost R(MPESi)of the maximum parallel edge set MPESi in the HDSFG (De�nition 3.31). Ithas to be shown that R(MPESi) does not increase by applying basic movesBM1 or BM3.� By applying basic move BM1, a sequence edge pair fp; qg is added tothe HDSFG. Suppose that, after adding fp; qg, the MPESi is extendedwith edge e to form a new edge set MPES 0i, such that R(MPES 0i) >R(MPESi). Before adding fp; qg, edge e was not part of MPESi, andtherefore was fully sequential with one of the edges of MPESi (else, ewould have been part of MPESi as well). In other words, e was onthe same directed cycle in the HDSFG as one of the edges of MPESi.Note that one of the edges of this directed cycle can be a sequence edge.Furthermore, this directed cycle has an accumulated edge weight (accu-mulated delay) of 1 (labeled \UDC" in �gure 4.6), to establish the full se-quence between e and one of the edges ofMPESi. If e is part ofMPES 0iafter adding the sequence edges fp; qg, UDC in �gure 4.6 is no longera unit-delay cycle. This is not possible, because this would mean thatadding sequence edges changes the behavior of the HDSFG. Adding se-quence edges only imposes timing constraints on the HDSFG. Therefore,by indirect demonstration, it is shown that R(MPES 0i) 6> R(MPESi).� Furthermore, it is trivial that decrementing a maximum window (basicmove BM3) cannot increase R(MPESi), for any i. 2Corollary 4.1 Reducing the value of GMC[i] by means of basic moves BM1and BM3 cannot increase the value of another component GMC[j].

4.4. THE COST FUNCTION 71MPES

UDCi

eFigure 4.6: Sequence of two edges on a unit-delay cycle (UDC)The e�ect of spilling on GMC Theorem 4.1 and Corollary 4.1 show thatmost basic moves cannot increase and are actually reducing the maximumregister costs of the register �les. Only the maximum register cost of the dataand address input register �les of the spill RAM can increase if basic moveBM2 (spilling) is performed. This can require extra sequentialization of theread and write operations introduced by spilling.Back-tracking To be able to reach the optimal solution with respect tothe cost function, undoing basic moves is required. This is a general principlecalled back-tracking. To reduce the amount of back-tracking during the search,the order in which the possible basic moves at a partial solution are performed(branching) is crucial. Special heuristics for this purpose are discussed inSection 4.5.4.4 The cost functionEach time a basic move is performed and the search reaches a new partial solu-tion, a cost function is evaluated (see the numbers in the squares in �gure 4.2).The cost function for cut reduction has two components, C1 and C2:1. C1 is the impact of a basic move on the schedule length.2. C2 is the extra hardware cost of a basic move (e.g. an address in the spillRAM, or extra interconnection).The actual cost function is a weighted sum of these components: �1:C1+�2:C2.The values of the weights �1 and �2 are set by the designer. The two costfunction components are elaborated below.Impact of a basic move on the schedule length The schedule length at apartial solution can be calculated , by actually scheduling the HDSFG, or it can

72 CHAPTER 4. CUT REDUCTIONbe estimated , by e.g. topological sorting (leveling) of the HDSFG. Althoughan estimation by leveling is implemented at the moment, a fast, linear time-complexity list scheduling algorithm [Rompaey 92] [Vanhoof 93] could be usedfor more accurate cost calculations.Extra hardware cost The extra hardware cost is only meaningful whenmapping a design on an incompletely speci�ed datapath, i.e. a datapath wherefor instance the interconnection between the EXUs has not yet been �xed. Inthat case, spilling a value to RAM requires a spill RAM, a spill path to andfrom that RAM and the necessary addressing. E�cient ways for the estimationof this extra hardware cost are presented in [Lanneer 93a].Monotonicity of the cost function As motivated in Section 4.2, it isvery important for the pruning of the search space to have a monotonous costfunction:De�nition 4.5 (Monotonicity) The cost function is monotonous if it doesnot decrease by applying a basic move.The following theorem can be proven trivially by inspection of the basic movesBM1, BM2 and BM3 in Section 4.3:Theorem 4.2 The cost function �1:C1 + �2:C2 is monotonous.4.5 Branching heuristicsAt each partial solution, the new GMC is calculated. If GMC > C, somecomponents of GMC have a value larger than the constraint on the availablenumber of registers. The set of possible basic moves to reduce the registercost is determined. Performing a basic move causes a branch to a new par-tial solution in the branch-and-bound solution space. The order in whichthese branches are issued, is crucial for the run time of the branch-and-boundsearch. Ideally, the best branch is chosen at each partial solution, and theoptimum solution is found without even having to back-track. It is howevernot possible to know which branch is the best one, because the optimum isunknown. Therefore, the only thing that can be done is proposing a numberof heuristics that try to determine which branch will most probably lead tothe optimum. By means of the branching heuristics proposed below, the setof possible branches (basic moves) is ordered:1. As a result of Corollary 4.1, reducing a component of the GMC can im-plicitly reduce another GMC component. Therefore, it is advantageousto start reducing the component of GMC that has to be reduced themost.

4.6. PRUNING THE SEARCH SPACE 73a

bW

R

BM3c

d

b

c

dFigure 4.7: Heuristic 3: the best candidate for spilling2. If spilling (basic move BM2) is expensive and thus has a large user-de�ned weight in the cost function, it is advantageous to start issueingbasic moves BM1 and BM3 before doing a spilling basic move, or viceversa.3. Spilling the signals with the longest lifetimes (determined by topologi-cal sorting of the HDSFG) �rst has the least e�ect on the spilling costbecause of the larger time slots to schedule the RAM access operations.Furthermore, there is more freedom left for sequentializing the lifetimesof the addresses in the RAM address registers. For instance, edge a in�gure 4.7 is the best candidate for spilling.4. If the edge that has to be sequentialized w.r.t. another edge is part ofa chain (e.g. edges a, b and c in �gure 4.8), only adding sequence edgesto/fro the begin or end nodes of the chain have to be examined. Forexample, only nodes o1 and o2 in �gure 4.8 are possible origins (or des-tinations) for sequence edges.More intricate heuristics can be found, which require a more complex analysis.The above heuristics, however, have proven to lead to e�cient pruning of thebranch-and-bound search space (see Section 4.8).4.6 Pruning the search spaceDue to the monotonicity of the cost function (see Theorem 4.2), e�cient prun-ing of the branch-and-bound search space can be done: the sub-trees of thesearch space at some (partial) solutions can be ignored because they certainly

74 CHAPTER 4. CUT REDUCTIONb

c

1

2

a

Figure 4.8: Heuristic 4: chains of edgesdo not contain the optimum solution. The cases for which pruning can beperformed, are summarized in the following paragraphs.Upper bounds on the cost A feasible solution is found when the con-straints on the available number of registers are met. In �gure 4.2, feasiblesolutions are labeled with an \S". The cost of a feasible solution is an up-per bound on the cost of the optimum solution. Since the cost function ismonotonous, there is no point in branching any further from a partial solutionwhose cost is larger than or equal to the current upper bound on the cost. E.g.in �gure 4.2, partial solutions 6, 7 and 8 are pruned for this reason. It is clearthat, the tighter the upper bounds on the cost, the more e�cient the pruning.Therefore, the branch ordering heuristics of Section 4.5 play an important rolein �nding tight upper bounds on the cost as soon as possible during the search.Pruning equal solutions It makes no sense to perform the same series ofbasic moves twice, but in a di�erent order. The net result would be exactlythe same. Therefore, partial solutions are identi�ed by the set of basic movesto reach them, and the search is pruned in a partial solution that is equal toanother one:De�nition 4.6 (Equal solutions) Two (partial) solutions are equal if theyhave been obtained with exactly the same set of basic moves.4.7 Cut reduction terminationThe GMC in a partial solution is calculated by means of the hierarchicalretiming technique of Section 3.4. This technique computes a tight underesti-

4.8. EXPERIMENTS WITH CUT REDUCTION 75mation of the maximum register cost, in stead of the real maximum registercost:1. Hierarchical retiming calculates the maximum number of live signals,MPESi. However, the required number of registers to store these signalscan be larger than the register cost R(MPESi) (which is de�ned as thenumber of di�erent signals in MPESi, see De�nition 3.30), as explainedin Section 3.4.1.2. The cost function for hierarchical retiming is no longer exact for fanoutedges, as explained in Section 3.4.2.The main reason for using hierarchical retiming is its low, polynomial algo-rithmic complexity. This makes it feasible for intensive use during the branch-and-bound search (see the discussion in Section 3.4.1).However, the �nal solution of the search must be checked : the exact max-imum register cost must satisfy the constraints (see De�nition 4.1). It wasexplained in Section 3.4.1 that the exact maximum register cost can only becalculated by means of maximum clique techniques. Therefore, once a solu-tion is found using the estimation of the maximum register cost, the search iscontinued with the exact maximum register cost calculation. Due to the highquality of the estimation, this typically requires the examination of at most afew extra partial solutions.4.8 Experiments with cut reductionThe branch-and-bound algorithm (see AlgorithmB.1 in Appendix B) computesthe optimal set of basic moves to be performed on the HDSFG such that theconstraints on the available number of registers are satis�ed, and the costfunction is minimized. In other words, if there exists a transformation of theHDSFG such that the transformed HDSFG satis�es the register constraints,then AlgorithmB.1 will �nd this transformation. Note that, if the cost functiononly uses an estimation of the schedule length, the optimum schedule in termsof time steps is not guaranteed to be found3. However, the fast list schedulerSmart [Goossens 90] [Rompaey 92] does a good job in estimating the minimalschedule length (this has been veri�ed for medium-throughput applications bycomparing Smart schedules with optimal schedules, obtained with integerlinear programming).In table 4.1, some results of experiments with branch-and-bound cut reduc-tion are given. Di�erent HDSFGs were extracted from the di�erent examplesof table 1.2, and di�erent register constraints were applied. The �rst column3The optimal schedule could be guaranteed by using an optimal scheduling technique forthe evaluation of the cost function, which is hardly practical.

76 CHAPTER 4. CUT REDUCTIONReduction HDSFG size BAB size opt (%) CPU time (s)1 105 37 8.11 804 16 421� 2.14 3674 16 756 1.19 4585 40 480 8.75 857 65 395 3.54 4198 26 550 5.64 629 105 1392 1.94 70011 56 1055� 2.56 21913 28 953 3.58 117�: BM2 not considered for branchingTable 4.1: Results of branch-and-bound cut reductionin table 4.1 is the total number of registers by which the GMC of the originalHDSFG exceeds the constraints C. The second column in the table indicatesthe number of nodes in the HDSFG. The third column shows how many par-tial solutions have been traversed during the branch-and-bound search. Thefourth column indicates after which part of this total number of partial so-lutions the optimum solution was found. Cut reduction is implemented in aprototype CAD tool called Tron. The last column gives the cut reductionCPU times of Tron on a DEC5000 (in seconds).Another example is a real-life linear predictive coding (LPC) application(see table 1.2). In �gure 4.9, the register cost of a schedule with registerconstraints (the thick arrows) is compared with the register cost of a schedulewithout register constraints, and the maximum register cost of the design.The register cost is shown for each register �le in the architecture (X-axisin �gure 4.9). The register constraints are indicated by the thick arrows.For instance, only 2 registers are available in register �le \alu/rega". Theregister cost imposed by the constraints is the minimumachievable register costfor the LPC design (without spilling). Note that by applying severe registerconstraints (like the ones in �gure 4.9), the design gets more sequentialized.In the case of the LPC example, the total number of clock cycles increases bysome 33 % (from 10,791 to 13,332). Cut reduction transformed the HDSFG towithin the register constraints by only using basic move BM1 in a few tens ofminutes on a DEC5000. The scheduling was done with an integer programmingscheduler (Ilps) (see Chapter 6), and no software pipelining was performed.The HDSFG of the top level of the LPC loop hierarchy after cut reductionwith the constraints of �gure 4.9 is shown in �gure 4.10. Most operationsin �g. 4.10 are shift, negate and RAM access operations. The register �le

4.8. EXPERIMENTS WITH CUT REDUCTION 77maximum register cost

scheduling without register optimization

scheduling after cut reduction

register constraints

1

2

3

4

5

size

6

7

alu/

rega

alu/

regb

mul

t/reg

a

mul

t/reg

b

ram

1/da

reg

ram

1/ad

reg

acu1

/reg

a

acu2

/reg

a

acu3

/reg

a

8

9

10

13,332 cycles

10,791 cycles

Figure 4.9: Register costs for the LPC designassignment of the signals is indicated as labels on the edges: for instance, theoperand of operation 30 is stored in one of the input register �les of the ALU(\alu/rega") and the output of that operation is fed back to the other inputregister �le of the same ALU (\alu/regb"). The nested loops (the shadedcircles) have been preprocessed and scheduled a priori, and their register costis taken into account. The sequence edges (the dashed edges in �gure 4.10)were added during a branch-and-bound search at the top HDSFG level over900 partial solutions in 148 CPU seconds. Note also the backward sequenceedges with a weight of 1, to ensure full sequence, even in the case of softwarepipelining. The best solution was encountered after having searched 4 % ofthe search space.Cut reduction complexity The �rst conclusion that can be drawn fromthe results in table 4.1 is that branch-and-bound cut reduction gets complexfor large reductions in large HDSFGs. Although the LPC design of �gure 4.9was entirely preprocessed to a minimum register cost by means of cut reduc-tion in reasonable times, the algorithm will run into CPU time problems forlarge designs with large blocks in the HDSFG hierarchy (hundreds of nodes).Another factor that in uences the cut reduction run times is the set of allowedbasic moves. If, for instance, basic move BM2 is not allowed during branching,the search space becomes substantially smaller (see the results labeled \�" intable 4.1).

78 CHAPTER 4. CUT REDUCTIONfor k0

5 29

40

36

41

alu/rega

sreg sreg

ram1/adreg

mult/regb

for k2

6 33

alu/regaalu/rega

87

17

18

ram1/dareg

11

12

alu/rega

2

30

31

32

ram1/adreg

alu/regb

alu/rega

34

ram1/adreg

13

9

19

ram1/dareg

alu/rega

10

3

ram1/daregalu/rega

4

20

21

ram1/dareg

ram1/dareg

alu/rega

14

15

16

alu/rega

ram1/dareg

23

2425

alu/rega alu/rega

alu/regb alu/rega

for k8 for k6

mult/regb

Figure 4.10: The top-level HDSFG of the LPC design after cut reduction.

4.9. COMPARISON WITH THE LITERATURE 79E�ciency of branching and pruning It can also be observed that theoptimum solution is always found very soon: the column opt in table 4.1indicates that the �rst 10 % of the traversed search space always contains theoptimum. The rest of the CPU time is used to prove that there are no bettersolutions. This proves the e�ciency of the branching heuristics in Section 4.5.Therefore, a possibility for cutting down the large CPU times by an orderof magnitude is limiting the search space: there is a large chance that theoptimum solution does not belong to the unexplored part of the search space.4.9 Comparison with the literatureOther, or similar, techniques to directly reduce the required number of registersin a design are published in the literature. A discussion of these techniques ispresented in the following paragraphs.In [Grass 90], a branch-and-bound technique is used to reduce all kinds ofresource con icts, including register con icts. However, in stead of reducingthe register cost of a maximum parallel edge set (MPES), the register cost ofa maximum clique in the edge con ict graph is reduced. The maximum cliquerepresents the exact maximum register cost (see Section 3.4.1), but calculatingit has an exponential algorithmic complexity [Garey 79]. The proposed cutreduction technique in this chapter uses the maximum register cost estimationpresented in Chapter 3, which can be calculated in polynomial time. Only whenan optimum is found by using the estimation, the exact maximum register costis computed and reduced if necessary. This leads to much faster run times forthe cut reduction. Furthermore, the technique proposed in [Grass 90] doesnot consider delay lines, software pipelining and design hierarchy, nor does itconsider the possibility of spilling signals to RAM. Finally, branching heuristicsas the ones in Section 4.5 are not exploited in [Grass 90].The data routing technique in [Lanneer 93a] determines storage schemes(i.e. routes to and from registers) for the signals in the design that are notassigned to RAM. For a given clock cycle budget, the hardware cost of these\data routes" and their e�ect on the schedule length are estimated and min-imized while choosing the routes. The data routing technique as describedin [Lanneer 93a] is not able to satisfy constraints on the available number ofregisters, and is therefore complementary to the work in this chapter. How-ever, when extended with the cut reduction techniques presented here, datarouting could be able to satisfy register constraints. This is a topic for futureinvestigation.

80 CHAPTER 4. CUT REDUCTION4.10 Hierarchical cut reductionIf the condition and loop hierarchy of the design is represented as explained inSection 3.2, the strategy for hierarchical cut reduction is very much the same asthe one for the calculation of the global GMC of a design: i.e. starting at theinnermost hierarchical level . Cut reduction is �rst performed at the innermostlevels, and the resulting GMC, after transformation of the HDSFG, is thenpassed to the next hierarchical level.This approach �ts very well in the global scheduling strategy as outlinedin Section 2.2. In the scheduling context of this thesis, notice that the GMCthat is passed to the next higher level in the hierarchy is the actual registercost of the schedule of the nested block: indeed, a nested block is preprocessedand scheduled before moving on to the next hierarchical level. However, hier-achical cut reduction can also be used in other scheduling contexts, since it isa completely general technique.4.11 SummaryCut reduction has been proposed as a preprocessing technique for registerconstraint satisfaction: after preprocessing the HDSFG, any schedule will sat-isfy the constraints on the available number of registers. A branch-and-boundsearch strategy has been proposed for cut reduction , with as main basic movesthe adding of sequence edges and the spilling of signals to memory.Although e�cient heuristics prune the search space to a large extent, with-out sacri�cing optimality, the branch-and-bound algorithm still su�ers fromlarge CPU times if the GMC� C. Alternative heuristic pruning techniques(sacri�cing some optimality) are proposed in Chapter 5.

Chapter 5ClusteringIn the previous chapter, a technique for register optimization based on graphtransformation (cut reduction) was proposed. For instance, sequence edges areadded to the HDSFG until the maximumregister cost is within the constraints.A branch-and-bound search strategy was proposed for �nding the best trans-formation of the HDSFG. Unfortunately, the search space gets very large forlarge designs with severe register constraints. Therefore, another technique forregister optimization is proposed in this chapter. It is used as a preprocessingfor cut reduction, to reduce its search space.The register optimization technique in this chapter is based on the followingidea: nodes that communicate data can best be kept together in the schedule,in order to keep the lifetimes of the communicated data as short as possible.This is illustrated in �gure 5.1. Suppose that the HDSFG of �gure 5.1(a) isscheduled as in �gure 5.1(b). This schedule requires 3 registers to store signalsa, b and d simultaneously. The lifetime of signal d is overlapping the lifetimesof a and b in �gure 5.1(b), but the production of signal d could be delayeda b

c

d

e

(a)

f

(b) (d)

c

d

e

(c)

c

1 2 3

4 5

1 2 3

4

6

1 23

45

1 2

34

5

b da

c

a b

6

f

5

e

f

66

a b

d

e

fFigure 5.1: Illustration of the clustering concept81

82 CHAPTER 5. CLUSTERINGto a later time step. \Keeping communicating nodes together" leads to theconcept of clusters of HDSFG nodes, as illustrated in �gure 5.1(c): by keepingtogether nodes 3, 5 and 6 in the same cluster (and the other nodes in anothercluster) and scheduling the HDSFG hierarchically (i.e. scheduling the clustersseparately), the lifetime of signal d (and hence also the lifetime of signal e) canbe made non-overlapping with the lifetimes of a and b. This is shown in theschedule of �gure 5.1(d), where also the schedules of the clusters are indicated(the rectangular shapes). Note that the sequence edge that is added betweenthe two clusters ensures the disjointness of the lifetime of d with the lifetimesof a and b1. This shows that cut reduction is still required after clustering.However, only the sequentialization of 2 clusters needs to be considered, i.s.o.the sequentialization of 5 nodes in the un-clustered case.The concepts that have been illustrated intuitively above, are founded moretheoretically in the sections below. A formal de�nition of the concept clusteris given in Section 5.1. The goal of identifying clusters in a HDSFG is dis-cussed in Section 5.2, and some basic clustering techniques are proposed inSection 5.3. Since treating loops and conditions is indispensable for most ap-plications, extensions of clustering for control ow hierarchy are discussed inSection 5.4. Finally, the integration of clustering with cut reduction is analyzedin Section 5.5.5.1 Clusters and maximum register costA group of \communicating" operations is formally de�ned as follows:De�nition 5.1 (Cluster) A cluster is a convex subgraph of the HDSFG.where a convex subgraph is de�ned as:De�nition 5.2 (Convex subgraph) A subgraph G of the HDSFG is convexif each directed path, that starts in a node of G, contains one or more nodesoutside of G and ends in a node of G, has an accumulated edge weight P d � 1.For instance, in �gure 5.2, a subgraph with a delay-free directed path to itself(the path via signals x12 and x14) is not convex, and thus not a cluster. As aconsequence of De�nitions 5.1 and 5.2, the following corollaries hold:Corollary 5.1 A cluster can always be scheduled independently from the restof the HDSFG.1The source port of this sequence edge is one of the input ports of node 4, the destinationport is the output port of node 3, and the edge latency l4;3 = 0. As a result, nodes 3 and4 can be scheduled at the same time step without violating the timing constraint on thesequence edge.

5.1. CLUSTERS AND MAXIMUM REGISTER COST 83* * * * * *

+

+

+

+ +

+

x1 x2 x3

x4 x5 x6

x8

x9

x10

x11 x12

x7

x13

x14

x15

non-convexsubgraph

convexsubgraphFigure 5.2: Convex and non-convex subgraphsCorollary 5.2 Two clusters can always be fully sequentialized (e.g. by meansof cut reduction).These properties of clusters will be used further-on.A cluster contains nodes, but also edges: all edges incident to nodes of thecluster are edges of that cluster. In other words, an edge can belong to severalclusters.A graph that is completely covered by clusters has interesting properties:De�nition 5.3 (Complete cluster cover) A graph is completely coveredby a set of clusters S i� each node of the graph belongs to exactly one clusterof S.Theorem 5.1 If an un-scheduled HDSFG is completely covered by a clusterset S, then2: GMC = �maxS fCMCg (5:1)Proof:The maximum parallel edge set for a register �le i (MPESi), with registercost GMC[i], contains an edge (or edges) of at least one cluster of S, sincethe HDSFG is completely covered by S. Suppose the subset of clusters thatcontains the edges of the MPESi is SMPESi . From the de�nition of CMC, itfollows that GMC[i] equals the largest CMC[i] of the clusters in SMPESi.2See De�nitions 3.31 and 3.34 for the de�nitions of GMC and CMC.

84 CHAPTER 5. CLUSTERINGMoreover, from the de�nition of GMC follows that there can be no clusterin S with a larger CMC[i] than the clusters in SMPESi. The same reasoningholds for all i, therefore equation (5.1) is proven. 2Theorem 5.1 is illustrated in �gure 5.2: if it is assumed that all signals arestored in the same central register bank, the value of the one and only com-ponent of the CMC for the cluster in �gure 5.2 is 7. The data edges thatcorrespond to this CMC are indicated with the shaded dots.5.2 The goal of clusteringThe goal of the register optimization techniques in this dissertation is satisfyingthe constraints on the available number of registers by preprocessing the designbefore scheduling it. The same goal is pursued when clustering a HDSFG: tobring GMC within the constraints C. By means of Theorem 5.1, this goalcan be formulated more precisely:De�nition 5.4 (Clustering) Clustering is �nding a complete cluster coverS of the HDSFG such that �maxS fCMCg � C (5:2)and the number of clusters in S is minimal.By minimizing the number of clusters, the amount of sequentialization to bedone by cut reduction afterwards is reduced (i.e. the cut reduction searchspace is reduced). Since clustering is used as a preprocessing for cut reduction,decreasing the cut reduction complexity is its only goal (see also Section 5.5).5.3 The basic clustering techniqueIn this section, the use of a greedy clustering technique is motivated and com-pared with other clustering techniques. In Section 5.3.1, clustering is comparedwith partitioning. The basics of greedy, hierarchical clustering are explained inSection 5.3.2. Hierarchical clustering is based on the use of distance metrics.These metrics are discussed in Section 5.3.3. Metric clustering is based onselecting the minimum distance between the clusters in order to know whichclusters to merge next (Section 5.3.4). After merging clusters (Section 5.3.5),the metric is recomputed for the sake of accuracy, as explained in Section 5.3.6.The course of the clustering process is analyzed in Section 5.3.7. The algo-rithmic complexity of clustering is discussed in Section 5.3.8, and �nally, acomparison with other clustering techniques in the literature is given in Sec-tion 5.3.9.

5.3. THE BASIC CLUSTERING TECHNIQUE 85A B C D E

distance

d a,b

c,dd

{cd},ed

{ab},{cde}d

Figure 5.3: Hierarchical clustering5.3.1 Clustering vs. partitioningTo obtain a cluster as de�ned in De�nition 5.1, two possible actions can bedone: a pair of clusters could be merged into a new cluster3, or a cluster couldbe split into two clusters. The �rst alternative is called clustering, the secondone is partitioning. In fact, the two problems are equivalent: in the case ofclustering, the search starts from an initial cluster covering where each nodeof the HDSFG is part of a di�erent cluster, and in the case of partitioning, theinitial state is the one where all HDSFG nodes are part of the same cluster.The clustering approach is chosen here, because an interesting greedy heuristiccan be used.5.3.2 Greedy clusteringAs a naive solution for the clustering problem, one could try to exhaustivelyenumerate all possible cluster covers of the HDSFG and evaluate which onesatis�es De�nition 5.4. It is clear that the number of possible cluster coversis huge for most HDSFGs. Therefore, a greedy alternative is proposed. In[Johnson 67], a hierarchical clustering strategy is presented, that is based onthe de�nition of a metric (distance) between two clusters and the merging ofclusters at a minimum distance of each other. The concept of hierarchicalclustering is illustrated in �gure 5.3. Suppose that there are 5 clusters at thebeginning of the clustering (clusters A to E). The distance between each pair ofclusters is calculated. The smallest of these distances is shown on the Y-axis in�gure 5.3: the distance between clusters A and B is minimal. Therefore, A andB are merged into a new cluster fA,Bg. When now all distances are updated,the distance between C and D is minimal and they get merged next. In the3Merging larger sets than pairs is not considered here. This does not in uence the result,only the number of merging actions to reach it.

86 CHAPTER 5. CLUSTERING* * * * * *

+

+

+

+ +

+

x1 x2 x3

x4 x5 x6

x8

x9

x10

x11 x12

x7

x13

x14

x15

cluster

cluster GMC = 3

cluster schedule

*

*

*

+

+ Figure 5.4: Calculating the GMC of a clusternext clustering step, the clusters fC,Dg and E are merged, and so on. Theclustering ultimately stops when there is only one cluster left. Note that thisclustering approach is greedy because the result depends on the order in whichclusters are merged. The distance metric only provides a locally valid guidelinefor a good order for cluster merging. Some metrics and their properties arediscussed below.5.3.3 Distance metricsThe distance between two clusters is calculated by temporarily merging theclusters. Next, the GMC of the new cluster is calculated. This can be doneby scheduling it and performing register assignment for the schedule, on con-dition that this schedule is �xed as long as the cluster is retained. This isvery important: clusters can be pre-scheduled such that the HDSFG is par-tially scheduled when it gets to the �nal scheduling stage. For example, thecalculation of the GMC of a cluster is illustrated in �gure 5.4.If by temporarily merging two clusters, the GMC of the new cluster issuch that GMC > C, then the two clusters should not be merged. This isensured by setting the distance between them to 1. The same holds if theunion of the two clusters in not a convex subgraph (see De�nition 5.2). Onthe other hand, the following lemma allows to merge the temporarily mergedclusters C1 and C2 a priori :Lemma 5.1 Let C 0 = C1 [ C2, and let GMC0, GMC1 and GMC2be the maximum register cost arrays of resp. C 0, C1 and C2. If

5.3. THE BASIC CLUSTERING TECHNIQUE 87C1

C3

C2

GMC = 2

GMC = 2

GMC =2

GMC = 2

GMC =3

1

12

2

23

3Figure 5.5: Illustration of automatic mergingmax�fGMC1;GMC2g is the component-wise maximum of two register costarrays (see De�nition (3.20)), and ifGMC0 = �maxfGMC1;GMC2g (5:3)merging C1 and C2 cannot increase the global GMC of the HDSFG.Proof:It is su�cient to prove that, for each register �le i, the HDSFG's maximumparallel edge setMPESi after merging C1 and C2 cannot be larger than before:� Merging C1 and C2 does not increase the local MPESi, since GMC0 =max�fGMC1;GMC2g.� Because merging C1 and C2 is the only action that takes place in theHDSFG, the global MPESi of the HDSFG cannot increase either. 2The application of Lemma 5.1 is illustrated in �gure 5.5. It is assumed thatall signals are stored in the same register �le. Merging clusters C1 and C2 hasno global in uence: a solution with 2 registers can still be found. However, if

88 CHAPTER 5. CLUSTERINGC2 and C3 are merged, a solution with 2 registers can no longer be obtainedbecause the new cluster C2 [ C3 can be scheduled such that it requires 3registers. The \automatic" merging of Lemma 5.1 is ensured by setting thedistance between the corresponding clusters to 0.After having checked if the distance between two clusters cannot be set to1 or 0, the distance is actually calculated . There are two (extreme) approachesfor distance calculation:1. Calculating the CMC of the new cluster only. A �rst metricestimates the local e�ect of merging two clusters on the maximumregistercost:De�nition 5.5 (D-metric) The D-metric between two clusters C1 andC2 is evaluated as follows: D1;2 = CMC0 (5:4)where C 0 = C1[C2 and CMC0 is the constrained maximum register costarray for C 0.2. Recalculating the CMC of all clusters. The following metric exactlycalculates the e�ect of merging two clusters on theGMC of the HDSFG:De�nition 5.6 (E-metric) If S 0 is the cluster set after merging C1 andC2, the E-metric is evaluated as follows:E1;2 = �maxS0 fCMCg = GMCS0 (5:5)where GMCS0 is the global maximum register cost array of the HDSFGafter merging C1 and C2. The E-metric performs in fact a kind of look-ahead on the globale�ect of merging two clusters.5.3.4 Minimum distance selectionAfter calculating the distances between the clusters, a number of D or Earrays have to be evaluated to select the best cluster pairs for merging. Thecomponents of these distance arrays are the maximum register sizes of theregister �les (see De�nitions 5.5 and 5.6). A scalar value can be computed foreach distance:De�nition 5.7 (Normal value) The normal value jDj of a distance D is aweighted sum of its components:jDj = Xi=1::N w[i]:D[i] (5:6)

5.3. THE BASIC CLUSTERING TECHNIQUE 89D1 D2

D3

D4

D5

1

2

1

2

3

(a) (b) (c)

CA

BC C

C

DC

EC

D1=D2=D3=D4=D5

CA CC

CA BC

CC DC CD EC

CB DCFigure 5.6: An example of cluster mergingwhere w[i] = ( 1C[i]�D[i]+1 if D[i] � C[i]1 otherwise (5:7)The same holds for the E-metric.The distances with the smallest normal value indicate the cluster pairs that aremerged. The goal of the weights in (5.6) is to �ll the register �les as graduallyas possible while allowing more signal parallelism during clustering. The usercan overwrite these weights to steer the clustering di�erently.5.3.5 Cluster mergingAfter the calculation of the distances, the minimum distances are obtained bycomparing the normal values (De�nition 5.7) of the distances. Mostly, severaldistances are minimal at the same time. For instance, in �gure 5.6(a), supposethat cluster pairs fCA; CBg, fCA; CCg, fCB; CDg, fCC; CDg and fCD; CEg arecandidates for merging because their distances are minimal: D1 = D2 = D3 =D4 = D5. However, the cluster pairs that are merged must be disjoint : thecalculation of e.g. D1 in �gure 5.6(a) only takes the merging of CA with CBinto account, and not the merging of CA, CB and for instance CC.Cluster merging goal To minimize the number of clustering stages, asmany disjoint cluster pairs as possible should be merged. To solve this problem,a con ict graph is set up where the nodes are the candidate cluster pairs formerging (�gure 5.6(b)). An edge connects two nodes if the cluster pairs arenot disjoint (i.e. if they share a cluster). The problem of �nding a maximum

90 CHAPTER 5. CLUSTERINGnumber of clusters to merge is now transformed into a maximum independentset problem4 in the con ict graph. For instance, the cluster pairs fCA; CCgand fCB; CDg form a maximum independent set in �gure 5.6(b).Solving the maximum independent set problem This problem is NP-complete [Garey 79], and solving it exactly can be very complex if there area lot of candidate cluster pairs. Therefore, a fast, linear coloring [Aho 88]is performed, minimizing the number of colors. The largest set of equallycolored nodes is taken as an approximation of the maximum independent set:e.g. in �gure 5.6(c), this is the set of nodes colored with the color \2" (or thecolor \1"). This set can be smaller than the maximum independent set if theheuristic coloring as in [Aho 88] is used. However, optimality is not essentialhere: the main result of a sub-optimality is that the clustering will have to doa few iterations (merging steps) more.An alternative approach for cluster merging could be the merging of onepair of clusters per iteration. The advantage of this approach is that thecoloring problem (see �gure 5.6) doesn't need to be solved. However, moreclustering steps are required.5.3.6 Distance recomputationIf C1, C2 and C3 are clusters, the following inequality is called the ultra-metricinequality [Johnson 67]: D1;3 � �maxfD1;2;D2;3g (5:8)It has been shown in [Johnson 67] that, if a metric satis�es the ultra-metricinequality, the distances do not have to be recomputed during clustering. Newdistances can simply be derived from the old distances by taking the minimum,maximum or average of the old distances.On the other hand, if the metric does not satisfy the ultra-metric inequality,one could still cluster without recomputing the distances. In that case however,an error is propagated as the clustering proceeds. Almost all hierarchicalclustering techniques in the high-level synthesis literature proceed in that way,propagating an error (see Section 5.3.9). Unfortunately, the D- and E-metricare strongly non-ultrametric, because merging two clusters heavily in uencessome other distances. It is exactly this in uence which has to be known for theclustering in this thesis, so the distances have to be recomputed for accuracyreasons. However, this recomputation of all distances only takes place at e.g.every 5 or 10 clustering steps (see also Section 5.3.8), to save CPU time.4An independent set is a set of nodes of which no two nodes are connected by an edge.

5.3. THE BASIC CLUSTERING TECHNIQUE 911 na b

GMC

nr.of clusters

constraint C

upper bound

c

1 2

clusteringFigure 5.7: Schematic clustering process5.3.7 The clustering processThe clustering proceeds iteratively, by performing cluster merging during eachcluster iteration until no more clusters can be merged because all distancesare in�nite (or until there is only one cluster left). The complete clusteringalgorithm can be found in Appendix D. A typical clustering process is schemat-ically outlined in �gure 5.7. At the start of the clustering, there are n clusters(n is the number of nodes in the HDSFG), and GMC > C. This maximumregister cost is exactly the same as in the case where all nodes are clustered intoone single cluster. In between, the maximum register cost can only be smallerthan (or equal to) the maximum register cost at the start; hence the U-shapeof the trace in �gure 5.7. Note also that partitioning techniques would startsearching from the opposite direction as clustering in �gure 5.7. It is possiblethat partitioning requires less iterations to reach the optimal solution, howeverat the expense of more CPU time per iteration.The clustering can be successful (i.e. such that GMC � C at the end) orthe clustering can be unsuccessful (i.e. not able to �nd a cluster cover suchthat GMC � C).Successful clustering During clustering, the GMC decreases (trace 1 in�gure 5.7) and at some point (e.g. when there are b clusters left) GMC � C.However, as long as GMC � C, the clustering proceeds to minimize thenumber of clusters5, until at some point (e.g. when there are a clusters left)the maximum register cost is going to exceed C again. At this point, alldistances are 1, and the clustering stops. This is an important characteristicof the clustering strategy used in this work: it stops when the constraints areviolated again, after a feasible solution has been found.5See De�nition 5.4.

92 CHAPTER 5. CLUSTERINGUnsuccessful clustering Unfortunately, a clustering such as trace 1 in �g-ure 5.7 is not guaranteed. Sometimes (e.g. in the case of a very small numberof available registers), clustering is not able to come up with a cluster coveringthat satis�es the register cost constraints. This is illustrated by trace 2 in�gure 5.7: the clustering stops without even having reached an intermediatestep that satis�es the constraints. In this case, the cut reduction that followsthe clustering has to complete the job, e.g. by spilling signals to RAM. Still, itis useful to do some clustering to reduce the complexity of the cut reduction.5.3.8 Algorithmic complexity of clusteringLet n be the number of nodes in the HDSFG, jSj the number of clusters and Nthe number of register �les (and thus the number of components in a registercost array). One distance calculation requires N times a computation of aCMC[i] component (with practical complexity O(log n), see Section 3.4.4) ifthe D-metric is used, and N:jSj computations if the E-metric is used. Thereare jSj(jSj�1)=2 distances to be calculated at each clustering step. Therefore,at each step of the clustering algorithm, the practical algorithmic complexityof the computation is O(jSj2: log n) for the D-metric, and O(jSj3: log n) for theE-metric. There are at most n clustering steps, such that if jSj = n as a worstcase approach, the theoretical worst case clustering complexity is O(n3 log n)if the D-metric is used, and O(n4 log n) if the E-metric is used.A signi�cant speedup can be obtained by not recalculating all distances ateach step of the clustering algorithm. The clustering step interval at whichall distances are recalculated can be set by the user. Figure 5.8 shows thenormalized CPU times (i.e. average CPU times per register �le) for hierarchicalclustering in function of n. Each clustering was completed from n clusters upto 1 cluster. The interval for global distance recomputation was set to 5.Lemma 5.1 was not applied. It can be observed from the plot in �gure 5.8that clustering can still take quite some CPU time, although its algorithmiccomplexity is polynomial . The application of Lemma 5.1 is very important:experiments have shown that, typically, the number of clusters to start fromcan be reduced by some 10 to 20 % (or even higher if a kind of \tolerance" isbuilt in). This has a substantial e�ect on the clustering CPU times.5.3.9 Clustering techniques in the literatureClustering is used at several occasions in the literature. A small, but repre-sentative number of examples is given in the following paragraphs.An interesting use of hierarchical clustering is the assignment of operationsto data-paths, taking interconnection and layout into account. For instance,in the Bud-Daa system [McFarland 90], the metric between the operations of

5.3. THE BASIC CLUSTERING TECHNIQUE 93

10 100

Nr. of nodes

1

10

100

Nor

mal

ized

CP

U t

ime

(s)

Figure 5.8: Normalized CPU times for clustering

94 CHAPTER 5. CLUSTERINGthe behavioral description is a function of the common functionality, the degreeof interconnection and the potential parallelism. Operations are assigned tothe same hardware if possible, to reduce the interconnection. By means ofhierarchical clustering, a complete clustering tree is calculated (i.e. clusteringup to one cluster, see �gure 5.3). At each stage of the clustering, each cluster isassigned to a di�erent resource. After having completed the clustering, the bestclustering stage is selected by making \horizontal cuts" through the clusteringtree. Distances are not recalculated (the metric is assumed to be ultra-metric6),and the clustering proceeds very fast. However, errors get introduced duringclustering.A similar clustering technique is used in the Aparty system [Lagnese 91].It is used for architectural partitioning at the system level. Several metricsare de�ned, and the user can switch to another metric during clustering, tosteer the clustering (multistage clustering). Again, a complete clustering treeis built and distances are not recalculated. Another variant of this method ispublished in [Scheichenzuber 90], where the probabilities for the execution ofconditional operations is taken into account.Finally, a hierarchical clustering technique for merging registers into reg-ister �les while minimizing the interconnection and port cost is reported in[Rouzeyre 91]. Here, the metric is a measure for the advantage of mergingregisters into register �les. The compatibility of two registers for merging isdetermined by solving a bipartite matching problem.The metrics used in the literature are most of the times abstract becausethey represent a weighted sum of di�erent costs. The metrics in this workhowever only represent one speci�c cost : registers. Therefore, the D- and E-metrics de�ned above allow for an accurate stop criterion (when reaching theregister constraints), because they have a physical meaning. In other words,the complete clustering tree does not have to be calculated. Furthermore,there is no loss in accuracy, because the distances are recalculated. This leadsof course to a more complex clustering technique.5.4 Extensions for control ow hierarchyReal-time signal processing applications in the low- to medium-throughputdomain make extensive use of nested loops and nested conditions. In Sec-tion 5.4.1, solutions for clustering of conditional operations are discussed. Thee�ect of HDSFG condition and loop hierarchy on the register constraints forclustering is explained in Section 5.4.2.6See Section 5.3.2, equation (5.8).

5.4. EXTENSIONS FOR CONTROL FLOW HIERARCHY 955.4.1 Clustering of conditional operationsA �rst way of dealing with conditional operations during clustering is atteningthe condition hierarchy: each cluster then consists of operations with all kindsof conditions. For instance, the HDSFG in �gure 5.9(a) contains a numberof clusters. The clustering then proceeds as in �gure 5.9(b), where a clusterconsists of conditional and unconditional operations. In this case, a cluster hasa di�erent GMC and CMC for each di�erent condition. Also, a distance Dor E has to be calculated for each possible condition. The worst-case distanceis then representative for the metric between two clusters. It is clear that thisapproach can lead to a large number of distance calculations if there are a lotof di�erent conditions. However, it is the most general one.Calculating the distances for all possible conditions can be avoided by us-ing a hierarchical approach: a conditional cluster cannot be merged with anunconditional one unless the CMB block where it is part of is completely clus-tered . E.g. in �gure 5.9(a), the CMB block of the exclusive conditional blocksis completely clustered, since each conditional block is covered by 1 cluster.Therefore, the exclusive conditional blocks can be represented by an uncon-ditional node at the next higher level, and clustering can proceed uncondi-tionally, as in �gure 5.9(c), which leads to the same result in this case. Thishierarchical approach can lead to slightly more clusters, because the mergingof conditional and unconditional clusters is delayed. As a result, the cut re-duction task will be somewhat more complex if there are a lot of large, nestedconditional blocks. The advantages outweigh this small drawback, however.5.4.2 O�set register costClustering is performed at each level of the HDSFG hierarchy. If the CMBhierarchy for the clustering of conditions is taken into account (see above),then the HDSFG hierarchy also includes the conditional blocks. Each nestedblock can have bridging edges7: edges that (can) run across the nested blockand represent signals that can be alive during the execution of the operationsof the nested block. These signals will require storage in the worst-case. Thisbridging edges register cost is an o�set for the register cost constraints forclustering and cut reduction in the nested block. The bridging edges' maximumregister cost for a nested block NB is represented by the BMC for NB8.For instance, if there are 16 registers available, and maximum 4 signalsare alive during the execution of some nested loop, then there are | in theworst case | only 12 registers available for that nested loop. This howeveris a very pessimistic approach: it is much better to anticipate that less than7See De�nition 3.32.8See De�nition 3.33.

96 CHAPTER 5. CLUSTERINGif c if not c

(a)

if c if not c

if c if not c

(c)

unconditional

(b)

unconditional

C1

2C

4C

3C

13C

123C

23C

2C

4C

C1

4C

4C4C

123C

Figure 5.9: Clustering of conditional operations

5.5. CLUSTERING AND CUT REDUCTION 974 signals will be overlapping the nested block, because of the clustering andcut reduction at the next higher level. Therefore, the following more realisticBMC calculations are proposed:1. The BMC register cost array is calculated by minimizing the amount ofdelay during retiming (see Chapter 3). As an illustration, the examplein �gure 3.1 has a worst case bridging edges cost in register �le reg1 ofBMC[reg1] = 2, and a minimum cost BMC[reg1] = 1.2. If spilling is allowed, it could be reasonable to assume that any signalthat lives during the execution of e.g. a nested loop is stored in RAMduring the execution of that loop. In that case, BMC = 0.The choice between these two approaches is left to the user in the currentimplementation. Note that the o�set register cost has also to be taken intoaccount during cut reduction.5.5 Clustering and cut reductionAs indicated in �gure 5.7, the clustering does not always succeed in reducingthe maximum register cost such that GMC � C.An example In the example in �gure 5.10, it is assumed that all signalsare stored in the same register �le. In �gure 5.10(a), the result of clusteringis shown: three clusters have been identi�ed (C1, C2 and C3), and scheduled.The shaded pro�les are the register cost pro�les of these cluster schedules. Themaximum register cost of the clustered HDSFG in �gure 5.10(a) is 4 (e.g. if theclusters are scheduled in parallel), which exceeds the constraint of 3. The onlyway in which the register constraint can be satis�ed, is by adding a sequenceedge as in �gure 5.10(b). This is done by means of the cut reduction techniquespresented in Chapter 4. Note that clustering has reduced the search space forcut reduction: theGMC of each cluster is within the register cost constraints,and the only basic moves to be explored by cut reduction are adding sequenceedges and spilling signals between the clusters.Clustering is not strictly required to obtain a design that satis�es the reg-ister constraints. However, for large designs (with large HDSFGs), the cutreduction CPU times are impractical without clustering (see table 4.1). There-fore, in most real-life designs, the HDSFG is clustered before passing it on tocut reduction: for instance, the HDSFGs of some large hierarchical blocks inthe CHED91, LPC and SPEECH examples (see table 1.2) have to be clus-tered in order to make the cut reduction feasible. The total preprocessingCPU time (clustering and cut reduction) is discussed in Chapter 7 for somereal-life designs.

98 CHAPTER 5. CLUSTERING1

2

2

1

1

2

2

(a) (b)

constraint = 3

time

P1

P2

P3

P4

P5

P6

P7

P8

P9

(c)

1

1

2

2

1

constraint = 3

time

1

2

2

1

C1

2C 3C

C1

2C

3C

C1

2C

3CFigure 5.10: An example of the integration of clustering and cut reduction

5.6. SUMMARY 99Model for cut reduction The cut reduction of �gure 5.10(b) is obtainedby modeling the clustered and partially scheduled graph as in �gure 5.10(c):HDSFG nodes that are scheduled at the same time step are modeled as onenode in the reduced cut reduction model, because no sequence edge can beadded between them anyhow. The reduced cut reduction model of the clus-tered HDSFG in �gure 5.10(a) is shown in �gure 5.10(c).5.6 SummaryIn this chapter, hierarchical clustering has been proposed as a alternative tech-nique for register optimization. An e�cient, greedy algorithm based on dis-tance metrics has been presented. The metric is modeling the register cost ofthe design, and this provides a stop criterion for the clustering process.However, since clustering can not always guarantee a solution within theconstraints, it is only e�cient when used as a preprocessing for cut reduction.Experiments have shown that in this way, clustering and cut reduction can beapplied to large designs (see Chapter 7)).

Chapter 6Integer ProgrammingSchedulingThe preprocessing techniques of Chapters 4 and 5 have relieved the schedulerfrom taking constraints on the available number of registers into account. Af-ter preprocessing with clustering and cut reduction, any schedule satis�es theregister constraints. In this chapter, a scheduling technique based on integerlinear programming (ILP)1 is discussed. It allows to optimally perform soft-ware pipelining and delay line optimization during scheduling. The schedulingtechnique supports cyclic (repetitive) signal ow graphs and control ow hi-erarchy. Furthermore, operations can be assigned to resources such that theinterconnection between the functional units is optimized. Finally, a new andgeneral method for modeling conditional exclusive operations is proposed aswell.Clusters in the HDSFG are supported by the scheduling technique in thischapter. Although clustering is not strictly required | cut reduction cando the register optimization task alone, see Section 5.5 | the presence ofscheduled clusters can decrease the complexity of the scheduling problem toa large extent, making the ILP approach feasible for large realistic problems.The scheduling problem to be solved if the HDSFG has been clustered, is calleda macronode scheduling problem (see the scheduling script in �gure 2.2(b)):De�nition 6.1 (Macronode Scheduling) Macronode scheduling is sche-duling an HDSFG that contains clusters that are scheduled a priori.The goal of macronode scheduling in this dissertation is the minimization ofthe total number of clock cycles required to execute the application, under�xed resource constraints (but no register constraints). Macronode schedul-1The abbreviation \ILP" is going to be used for \integer linear programming" throughoutthis text. 100

101delay

0

1

2

3

(a) (b)

45Figure 6.1: Hierarchical schedulinging is somewhat similar to code compaction in software compilers for VLIWarchitectures2. The code to be compacted here is a set of scheduled clusters.The fact that the clusters can be scheduled a priori is very speci�c to thescheduling approach proposed in this thesis:De�nition 6.2 (Hierarchical scheduling) Hierarchical scheduling is par-titioning a large scheduling problem into a set of smaller scheduling problemsthat are solved independently.The two steps in the hierarchical scheduling approach are cluster schedulingand macronode scheduling. This is illustrated in �gure 6.1: after cluster sche-duling, the HDSFG consists of a set of scheduled clusters (�gure 6.1(a)), andmacronode scheduling determines the relative timing of these clusters (�g-ure 6.1(b)), allowing them to overlap in the �nal schedule (the cluster overlapsare indicated by the shaded areas in �gure 6.1(b)). It should be mentionedthat the two scheduling phases are highly interdependent , and that optimal-ity can no longer be guaranteed for hierarchical scheduling, even if techniqueslike ILP are used. As observed above, clustering should be avoided if globaloptimality is strictly wanted.This chapter is structured as follows. The importance of software pipeliningis discussed in Section 6.1. Motivations for using ILP and an overview of theliterature are given in Sections 6.2 and 6.3. The actual ILP scheduling modelis presented in Section 6.4. The resource constraints of the ILP model are dis-cussed in more detail in Section 6.5, and the timing constraints in Section 6.6.The modeling of the interconnection network is discussed in Section 6.7. Thecomplexity of the ILP problem is analyzed in Section 6.8. The cost functions ofthe di�erent ILP scheduling phases are discussed in Section 6.9. Experimentsand CPU times for solving the ILP problems are given in Section 6.10.2See Section 2.1.1.

102 CHAPTER 6. INTEGER PROGRAMMING SCHEDULING6.1 The importance of software pipeliningIn this section, the importance of software pipelining is illustrated. The con-cept of software pipelining is de�ned in Section 6.1.1. A lot of work has alreadybeen done on software pipelining. This work can be classed in two contexts:the multi-processor scheduling context and the high-level synthesis context. Aliterature overview of both domains is given in Sections 6.1.2 and 6.1.3.6.1.1 The concept of software pipeliningDe�nition 6.3 (Software pipelining) The software pipelining of a repeti-tive HDSFG (e.g. a loop) is increasing its throughput (or decreasing its totalexecution time) by executing operations from di�erent iterations of the repeti-tive HDSFG in parallel.Software pipelining is a powerful optimization, needed in almost all real-timesignal processing applications to get within the timing speci�cation. It was�rst described in the context of �lter synthesis in [Fettweis 76]. In real-timesignal processing, typically the throughput of some time-critical inner loop isoptimized by means of software pipelining. For example, reducing the schedulelength of the innermost loop (i2-loop) of the channel decoder design CHED91(see �gure 1.5) by 1 clock cycle yields a total gain of 11,712 clock cycles.All the models and techniques that are presented in this thesis (the calcu-lation of maximal parallel edge sets, clustering, cut reduction and scheduling)take the possibility of software pipelining into account.6.1.2 Software pipelining in multi-processor schedul-ingA lot of work has been done on software pipelining in the past ten years,especially in the domain of software compilers for multiprocessor systems. Itis impossible to give a detailed overview of all contributions. Only the mostcurrently cited ones are discussed here.Most authors in this domain have acknowledged the importance of thedata dependencies between successive loop iterations. This has led to thederivation of a theoretical lower bound on the iteration period that can beachieved [Renfors 81], called iteration period bound . A schedule that achievesthis minimum iteration period is called rate-optimal . Several techniques forrate-optimal scheduling have been proposed:Cyclo-static scheduling One of the �rst published was the cyclo-staticscheduling of digital �lters, where the assignment of operations to processors

6.1. THE IMPORTANCE OF SOFTWARE PIPELINING 103can also be periodical [Schwartz 85] [Schwartz 86]. The scheduling and assign-ment are exhaustive, and constraints on the inter-processor communicationcan be taken into account.Optimal loop unrolling In [Parhi 89], it is shown that there is an optimalloop unrolling factor such that, when applied, the data ow graph can bescheduled rate-optimally on a minimal number of processors. It should howeverbe noted that unfolding a loop increases the code size, which can be a problemif the code is stored on-chip.Extension of existing scheduling techniques An extension of list sche-duling for software pipelining was �rst reported in [Lam 88]. The assignmentof a variable to several registers, depending on the iteration, is proposed as atechnique for further increasing the throughput. In [Groot 92], a schedulingtechnique based on the concept of \range charts" is proposed. Rate-optimalschedules can be achieved without unfolding, but the required number of pro-cessors can be sub-optimal. Constraints on the storage of variables are nottaken into account. ILP scheduling is extended for software pipelining in amultiprocessor context in [Lucke 93]: the assignment to processors is done ina separate phase, and cyclo-static schedules can be obtained. However, it isimplicitly assumed that there are no delay lines of length larger than one.Architectural support Interesting to quote are the proposals for hardwaresupport for software pipelining on VLIW architectures [Rau 82] [Su 90]. Typ-ically, a bank of register queues takes care of the data that are exchangedbetween di�erent loop iterations on di�erent processors. Of course, there is animportant hardware cost involved with these approaches.6.1.3 Software pipelining in high-level synthesisMore recently, software pipelining is being used in high-level synthesis as well.The main di�erence with the software compilation domain is that the pro-cessors do no longer have the same functionality, and that the number ofprocessors is often minimized. The techniques can be roughly classi�ed intofunctional pipelining and cyclic software pipelining. Only cyclic software pi-pelining takes the cyclic nature of data ow graphs into account.Functional pipelining One of the �rst techniques for functional pipeliningin high-level synthesis was proposed in [Girczyc 87]: a loop body is partitionedin functional stages and these stages are executed in parallel. This \loop wind-ing" is done before scheduling. A similar approach was published in [Park 88].

104 CHAPTER 6. INTEGER PROGRAMMING SCHEDULINGIn that work, the trade-o� between throughput and hardware cost is thor-oughly discussed. The force-directed scheduling approach [Paulin 89] can alsobe extended for functional pipelining. ILP formulations for functional pipelin-ing have been presented as well [Chen 91].Cyclic software pipelining The main disadvantage of functional pipeliningis that it does not take data dependencies between successive loop iterationsinto account. Loops with an iteration period bound larger than one cannot befunctionally pipelined without taking that iteration period bound into account.A heuristic, iterative technique for the software pipelining of cyclic signal owgraphs (called \loop folding") was proposed in [Goossens 89b]: the operationsof the last time step of the schedule are moved to the next iteration to try toconstruct a new, shorter schedule. This approach cannot guarantee optimality,but has proven to perform quite well, with small CPU times. A very similarapproach, called \rotation scheduling" is proposed in [Chao 93]: the retimingformalism is used to formally specify software pipelining. Yet another heuristictechnique is proposed in [Lee 92], where the critical cycles of the signal owgraph are scheduled in a �rst phase, followed by a second phase where resourcecon icts with other operations are resolved by rescheduling some operations ina greedy way. Another approach is proposed in [Potkonjak 90]: here, softwarepipelining (or retiming) is performed as a graph transformation move for abetter resource utilization, as part of a global search strategy. The retiming isnot performed during scheduling either.In this chapter, another technique for the software pipelining of cyclic signal ow graphs is proposed. It performs the retiming of the HDSFG optimallyduring scheduling, which di�ers from the techniques in [Goossens 89b] and[Potkonjak 90]. The unfolding of a loop, if any, will be assumed to be donebefore scheduling.6.2 ILP scheduling in the literatureOne of the �rst ILP scheduling formulations in the context of high-level syn-thesis was published in [Hwang 91] and [Lee 89b]. It has a broad function-ality: chaining together operations in one clock cycle, support of multi-cycleoperations, function unit pipelining and handling conditions. A limited formof register optimization can be done by minimizing the sum of the lifetimes.However, the resource constraints for conditionally exclusive operations requireone extra 0/1 variable for each pair of exclusive operations. This becomes verycostly. A more general model for conditional resource constraints, without re-quiring extra variables, is described in Section 6.5.In [Gebotys 91], the same scheduling problem is solved. However, use is

6.3. MOTIVATIONS FOR A NEW ILP MODEL 105made of very tight constraints. These constraints are derived by means ofthe theory of integer facets [Nehmhauser 88] and lead to optimal solutions inmuch smaller CPU times. The same tight constraint formulation is used andextended for software pipelining in this dissertation.Most other published ILP models are directly derived from the above two.Furthermore, in most publications, the examples that demonstrate the cor-rectness and performance of an ILP model rarely exceed a complexity of a fewtens of operations. There are indeed practical bounds to the size of the appli-cations that can be scheduled with ILP. However, the hierarchical schedulingapproach, proposed in this thesis, allows to schedule much larger designs withILP, at the expense of some optimality (see Section 6.10).6.3 Motivations for a new ILP modelThe main motivation for using ILP to model and solve the scheduling problemis its optimality: the optimal schedule with respect to some cost function can befound. As already mentioned in Section 6.1, optimality is needed for the time-critical parts of the design. There has been a lot of previous work on softwarepipelining (see Section 6.1) and there are quite some ILP scheduling modelspublished (see Section 6.2). The new contributions of the ILP scheduling modelproposed in this chapter are the following:1. Integration of optimal software pipelining with optimal scheduling in aninteger linear programming (ILP) model.2. A new timing model that supports cyclic signal ow graphs and delayline optimization, by combining scheduling and retiming.3. An e�cient timing constraint is proposed, based on a formulation in theliterature [Gebotys 91], but extended for software pipelining.4. A general model for generating conditional resource constraints is pro-posed.The scheduling problem is modeled in the following canonical form:( minimize c:xA:x � b (6:1)where x is the array of ILP variables, c:x is a linear cost function, A is theconstraint matrix and b is the right-hand side vector. The constraints in (6.1)are linear inequalities. It is well known that ILP problems belong to theclass of the NP-complete problems [Papadimitriou 82] [Garey 79] [Schrijver 86]

106 CHAPTER 6. INTEGER PROGRAMMING SCHEDULING[Nehmhauser 88]. It is also well-known that the complexity of ILP largely de-pends on the formulation of the constraints [Nehmhauser 88]. Special atten-tion will be paid to this in Section 6.6, when formulating the timing constraints.Finally, it has to be stressed again that when a macronode scheduling prob-lem is solved with ILP, global optimality is not guaranteed: it is only guar-anteed w.r.t. the given set of scheduled clusters. In other words, a HDSFGwith another set of clusters can yield a schedule with a di�erent cost. Thisinterdependency between cluster scheduling and macronode scheduling is dis-cussed in more detail in Section 6.9.1. Note that the use of clusters is notstrictly required3, and thus a globally optimal solution can always be obtainedin theory. However, this is not realistic for large scheduling problems (seeSection 6.8).6.4 The ILP scheduling modelAlmost any scheduling, allocation or assignment problem (or a combination)can be modeled as an ILP. However, this kind of generality is not needed here:the most general scheduling problem solved is the scheduling and assignmentof a number of operations to a number of time steps and a given number offunctional units (EXUs). This scheduling problem is for instance relevant inthe context of the Cathedral-2nd script4, but is it quite useful in manypractical contexts. As a matter of fact, it is possible that the assignment ofoperations to functional units has already been performed by another tool, suchthat the scheduler must also be able to take a �xed assignment into account.The constraints of the scheduling model (the A and b matrices in equa-tion (6.1)) are presented and explained in the next sections. First, the termi-nology used in the constraints is de�ned in Section 6.4.1. The constraints inthe ILP model are the following:1. Set constraints ensure that each operation is assigned to a time step (seeSection 6.4.2).2. Resource constraints model the resource con icts (see De�nition 3.1).The formulation used here is based on the formulation in [Hwang 91][Lee 89b] [Gebotys 91], but is extended with a general method for han-dling conditions. This new contribution is the subject of a separatesection (Section 6.5).3. Timing constraints model the timing and the software pipelining of aHDSFG. They are based on the formulation in [Gebotys 91], and ex-3See Section 5.5.4See Section 1.3.2.

6.4. THE ILP SCHEDULING MODEL 107tended for software pipelining. This new contribution is also the subjectof a separate section (Section 6.6).4. The schedule length is modeled such that it can be minimized (see Sec-tion 6.4.3).5. The interconnection constraints model the interconnection between theexecution units. They are also discussed in a separate section (Sec-tion 6.7).The ILP model also supports hierarchical scheduling (see De�nition 6.2): themodeling of scheduled clusters (macronodes) is explained in Section 6.4.4, andthe modeling of loop hierarchy is discussed in Section 6.4.5.6.4.1 TerminologyThe terminology used in the ILP constraints is summarized in this section.Some symbols have already been de�ned in Chapter 3, and are repeated herefor the reader's convenience. Other symbols have not been de�ned yet, butwill be clari�ed in the following sections.� ei;jdata is a data edge from node oi to node oj .� ti is the start time of operation oi (see De�nition 3.7).� C is the number of time steps in the schedule.� pi is the time step at which operation oi is scheduled (see De�nition 3.24).Note that 0 � pi � C � 1.� xi;j;k = 1 if operation oi is scheduled at time step j and assigned toresource k. Otherwise, xi;j;k = 0.� ri is the retiming of operation oi (see Section 3.4.2).� cm;n = 1 if there is a connection between the output of resource m to theinput of resource n. Otherwise, cm;n = 0. The connection is assumed tohave a su�cient bit-width.� ti;j = 1 if data edge ei;j corresponds to a data transfer between di�erentresources.� �i;j is the minimum (maximum) di�erence between the start times ofoperations oi and oj (see equation (3.1)).� di;jinit is the initial edge weight, before scheduling, of edge ei;j (see De�-nition 3.15).

108 CHAPTER 6. INTEGER PROGRAMMING SCHEDULING� di;j is the actual weight, after scheduling, of edge ei;j.� Ti is the throughput of operation oi (see De�nition 3.9).� Sj;k is a maximum clique of the resource con ict graph of resource k attime step j (see Section 6.5).� Wi;jmax is the maximum window for the signal corresponding to edge ei;j(see De�nition 3.18).� oC is a dummy sink operation.� rmax is the maximum retiming over all operations: rmax = maxifrig.� �i is the earliest time step where operation oi can be scheduled: pi � �i.� !i is the latest time step where operation oi can be scheduled: pi � !i.The variables of the ILP problem are the following: xi;j;k, ri, Wi;j, cm;n, ti;jand rmax. Depending on the scheduling problem to be solved, some of thesevariables may be �xed a priori. For instance, if the interconnection betweenthe functional units is �xed, the cm;n variables are �xed.The scheduling and assignment freedom of an operation oi are repre-sented by means of the xi;j;k variables. The same formulation is also usedin [Gebotys 91]. However, in most other published ILP scheduling methods,scheduling is solved independently from assignment, and xi;j variables are used[Hwang 91] [Lee 89b] [Chen 91] [Lucke 93]. The modeling by means of xi;j;kvariables is more general, though, and solving scheduling and assignment si-multaneously can lead to better results. It should however be noticed thatperforming scheduling and assignment \in one shot" for large applications canlead to impractical CPU times. Therefore, the k index in the xi;j;k variablescan always be �xed a priori by performing assignment independently.The scheduling freedom interval Each operation oi is scheduled in theinterval [�i; !i]. This interval should be as small as possible, without limitingthe scheduling freedom of oi: the number of xi;j;k (and thus the ILP complexity)depends on the sizes of the scheduling freedom intervals. If software pipeliningis not allowed, �i and !i are determined by topological sorting, or some variantthat takes resource constraints into account. If on the other hand softwarepipelining is allowed, the most easy thing to do is set [�i; !i] = [0; C � 1],where C is the length of the schedule. However, more re�ned techniques fordetermining the schedule range of operations in repetitive signal ow graphshave been proposed [Groot 92] [Lucke 93]: if the time step of a reference nodeis �xed, �i can be determined for most other nodes and !i can be derivedfor all nodes that belong to the same directed cycle in the HDSFG as the

6.4. THE ILP SCHEDULING MODEL 109reference node. These methods however do not guarantee that all nodes havea scheduling range smaller than or equal to C, and the choice of a \good"reference node is non-trivial task.6.4.2 Set constraintsSince each operation has to be scheduled and assigned, the following constraintis imposed on each set of xi;j;k variables for each operation oi:8oi : !iXj=�iXk xi;j;k = 1 (6:2)6.4.3 Formulation of the schedule lengthIf the HDSFG is cyclic, or if there are edges with initial non-zero weight,the schedule length C has to be �xed in order to be able to generate theILP constraints (see further). However, when scheduling clusters, C can beminimized. Therefore, a dummy sink node oC (with scheduling variables xC;jC )is introduced such that all nodes oi of the cluster HDSFG that do not havean outgoing edge to another cluster node are connected to oC by means of asequence edge ei;C. The time step of oC , which equals the schedule length, isthen expressed as: C = !CXjC=�C jC :xC;jC (6:3)The same technique is also used in [Lee 89b] and [Hwang 91].6.4.4 Modeling scheduled clusters for macronode sche-dulingAs de�ned in De�nition 6.1, the HDSFG of a macronode scheduling problemcontains scheduled clusters. Due to this a priori scheduling, the schedulingfreedom left for macronode scheduling is reduced: the timing of the operationsthat belong to the same cluster is �xed w.r.t. some reference. Let this refer-ence be the �rst time step of the schedule of the cluster. In the example of�gure 5.10, the reference time steps are the time steps of P1, P2 and P6. Inthat example, only the 3 time steps of P1, P2 and P6 have to be determinedi.s.o. 11 time steps in the original scheduling problem.A similar notion of \anchor points" is used in [Ku 90]. There, the timingof the other operations is related to the timing of the \anchor points", but therelative timing is not �xed, as is the case in macronode scheduling.The relation between the time step of an operation oi and the reference timestep of the cluster to which it belongs, can formally be described. If oCi is a

110 CHAPTER 6. INTEGER PROGRAMMING SCHEDULING(c)

for ifor i

(b)

for i

(a)

pot. 0

pot. 1

pot. 0

pot. 1

pot. 8

pot. 9Figure 6.2: The model for nested loop hierarchy.dummy node that represents the �rst time step of the schedule of cluster Ci,and if pi is the time step of that schedule where oi is scheduled, then the xi;j;kvariables for oi are substituted by the scheduling variables of oCi , as follows:xi;j;k �! xCi;(j�pi)modC;k (6:4)Each xi;j;k variable in each constraint must be substituted as in (6.4) in orderto turn an ordinary scheduling problem into a macronode scheduling problem.Note that timing constraints between nodes of the same cluster must not beformulated, because they have already been taken into account during thecluster scheduling.6.4.5 Loop hierarchyScheduling is performed at each loop, starting with the innermost loop. Forexample, in the HDSFG of �gure 6.2(a), the i-loop is scheduled �rst. Supposethat the schedule of the i-loop has 8 time steps. To save xi;j;k variables for theHDSFG nodes at the top level, the nested i-loop is modeled as a node withzero throughput (�gure 6.2(b)). If the nested loop would have been modeled bya node with a throughput of 8, all nodes at the top level would have required 8more xi;j;k variables. By using less xi;j;k variables, the resulting ILP schedulingproblem gets less complex. After scheduling, a macro expansion is required toget the actual schedule (�gure 6.2(c)).6.5 Resource constraintsResource constraints are formulated to prevent operations making use of thesame resource at the same time. The main complication with resource con-straints are conditionally exclusive operations: these can be scheduled on the

6.5. RESOURCE CONSTRAINTS 111o1IF c1 THEN o2 IF c2 THEN o3 o4 ELSE o5ELSE o6 IF c3 THEN o7IF c4 THEN o8

(a) (b)

o1

o2 o3

o4

o5

o6o7

o8Figure 6.3: Example of conditional resource constraintssame resource at the same time step. Techniques for dealing with conditionalresource constraints have been published [Lee 89b] [Hwang 91]: most of themrequire extra binary variables. A new technique for generating resource con-straints for conditional operations, that does not exhibit this overhead, is pro-posed in this section.An example of the formulation of resource constraints is given in Sec-tion 6.5.1. The generation of resource constraints with a linear time complexityis discussed in Section 6.5.2. The actual formulation of the resource constraintsfor functional units, busses and ports is given in Section 6.5.3. Finally, someremarks on the in uence of software pipelining on conditional exclusivenessare given in Section 6.5.4.6.5.1 An exampleFor each time step, the number of operations executed on resource k cannot belarger than 1. Therefore, for each time step j, the operations that can be sched-uled at time step j and assigned to resource k are collected (e.g. operationso1 : : : o8 in �gure 6.3(a)). The resource con icts5 between these operations canbe modeled by means of a resource con ict graph, as in �gure 6.3(b):De�nition 6.4 (Resource con ict graph) The resource con ict graph forresource k at time step j is an undirected graph G(V;E) where V is the setof operations that can occupy resource k at time step j, and E a set of edges5See De�nition 3.1.

112 CHAPTER 6. INTEGER PROGRAMMING SCHEDULINGbetween the nodes of V . There is an edge between two nodes of V if they havea resource con ict.For instance, in �gure 6.3(b), operations o1 and o2 have a resource con ict forresource k, and thus cannot be scheduled at the same time step j. This canbe modeled by means of the following constraint:x1;j;k + x2;j;k � 1 (6:5)6.5.2 Linear-time constraint generationThe resource con ict graph is used to generate the resource constraints. A �rstalternative is generating a constraint like (6.5) for each edge in the resourcecon ict graph. In the case of the example in �gure 6.3, 18 such constraints willbe generated. Another (and better) alternative can be applied if the followingtheorem holds:Theorem 6.1 If the resource con ict graph G for resource k at time step jis such that any two operations of V that not are connected with an edge areconditionally exclusive, then G is a perfect graph.Proof:It is shown that a resource con ict graph G that satis�es the condition ofthe theorem is a triangulated graph, by checking its de�nition [Golumbic 80]:a triangulated graph is a graph that does not contain subgraphs isomorphicto Cn with n > 3. Furthermore, it is shown in [Golumbic 80] that triangu-lated graphs are perfect graphs. This proof shows that there can be no C4subgraph (by contraposition), and generalizes for n > 4. A C4 subgraph isshown in �gure 6.4(a): the nodes are connected such that they form a \cycle"of 4 nodes.Suppose that G contains a C4 subgraph. Consider two edges from C4 thatshare a node. These edges can only correspond to two exclusive conditions,as in �gure 6.4(a): the two edges adjacent to the node with condition c1 cor-respond to e.g. conditions c1:c2 and c1:c2. The on-sets of the two exclusiveconditions are subsets of the on-set of the condition of the shared node (�g-ure 6.4(b)), and furthermore they are disjoint. For the same reason, the on-setof the condition c3 must contain the on-sets of the exclusive conditions as well,and must at the same time be disjoint with the on-set of c1. It can be seenfrom �gure 6.4(b) that this is not possible. Therefore, a C4 subgraph in G isnot possible. Note that this only holds if two nodes of G that are not con-nected with an edge correspond to mutually exclusive operations. The samereasoning holds for subgraphs Cn with n > 4. 2

6.5. RESOURCE CONSTRAINTS 113(a) (b)

c1

c1.c2

c3?

c1.c2c1.c2 c1.c2

c3?

c1

Figure 6.4: Illustration of the proof of the Perfect Con ict Graph TheoremNon-exclusive, non-con icting operations An HDSFG does not alwayscomply with the condition of Theorem 6.1: e.g. there can be operations likean addition a+ b and an addition a+ b with the generation of an over ow ag.These operations do not have a resource con ict, and are not conditionallyexclusive. To satisfy the condition of the theorem, these two operations shouldbe merged into a multi-output operation. This limits the scheduling freedomof the two operations since they are forced to be scheduled at the same timestep, which can have a suboptimal e�ect on e.g. the interconnection cost. Ifthis e�ect is not wanted, resource constraints like (6.5) should be used.Chromatic numbers and maxcliques A few concepts from graph theorymust �rst be de�ned in order to appreciate the consequences of the perfectgraph property:De�nition 6.5 (Chromatic number) The chromatic number CHR(G) ofan undirected graph G is the minimum number of colors with which the graphcan be colored, such that two nodes connected by an edge get a di�erent color.De�nition 6.6 (Maxclique) A maxclique (or maximum clique) of an undi-rected graph G is a subgraph of G with maximum size M(G) (i.e. number ofnodes), such that each pair of nodes in the maxclique is connected with an edge.Lemma 6.1 The following relation holds for an undirected graph G:M(G) � CHR(G) (6:6)

114 CHAPTER 6. INTEGER PROGRAMMING SCHEDULING(a) (b)Figure 6.5: Illustration of chromatic numbers and maxcliques (a) for a non-perfectgraph, and (b) for a perfect graph.Lemma 6.1 is proven in [Golumbic 80]. As an illustration, consider the resourcecon ict graph G in �gure 6.5(a): its chromatic number CHR(G) = 3 (thecolors are indicated with the di�erent shades) andM(G) = 2. If all operationsin the resource con ict graph of �gure 6.5(a) are scheduled at the same timestep, 3 resources are needed, as indicated by CHR(G), and not 2 as indicatedbyM(G). In other words, the chromatic number of the resource con ict graphhas to be constrained: constraints have to be generated that assure that thechromatic number of the subgraph of G that is going to be scheduled at timestep j on the same resource does not exceed 1.The Perfect Graph property Generating ILP constraints on CHR(G)is di�cult, because CHR(G) can in general only be computed by means ofconstructive algorithms. However, an interesting property of perfect graphscan be used here (proven in [Golumbic 80]):Lemma 6.2 If G is a perfect graph, then:M(G) = CHR(G) (6:7)For example, CHR(G) =M(G) = 3 for the perfect graph in �gure 6.5(b). Inother words, if the resource con ict graph is perfect, it is su�cient to generateconstraints for all maxcliques. Finding all maxcliques is in general an NP-complete problem [Garey 79], but can be done in linear time for perfect graphs[Golumbic 80]. This is the second alternative for resource constraint generation| generate a resource constraint for each maxclique in the resource con ictgraph. As an illustration, the resource constraints for the example in �gure 6.3are: x1;j;k + x2;j;k + x3;j;k + x4;j;k + x8;j;k � 1x1;j;k + x2;j;k + x5;j;k + x8;j;k � 1x1;j;k + x6;j;k + x7;j;k + x8;j;k � 1

6.5. RESOURCE CONSTRAINTS 1151

1

2

1

pot. 0

pot. 1

pot. 2Figure 6.6: Illustration of port constraints for EXU assignmentEvaluation This second alternative for resource constraints is better thanthe �rst one (e.g. equation 6.5), because the constraints relate more variables.Furthermore, no extra variables are needed, whereas the approach of [Lee 89b]would have required 18 extra binary variables for the example in �gure 6.3.6.5.3 The di�erent resource constraintsThe only resources considered here are EXUs, busses and memory ports orregister �le ports. Constraints for registers have been proposed in [Gebotys 91],but only for acyclic graphs and without taking software pipelining into account.Here however, the register constraints have been satis�ed before scheduling (seeChapters 4 and 5).1. EXUs For each EXU k, for each time step j and for each maxcliqueSj;k in the resource con ict graph, the following constraint is generated:8j;8k;8Sj;k : Xoi2Sj;k Ti�1Xp=0 xi;(j�p)modC;k � 1 (6:8)This constraint covers both pipelined (Ti = 1) and multi-cycle (Ti > 1) oper-ations. If the HDSFG is cyclic, C has to be known. Otherwise, the \modC"can be dropped from the j-index expression. Also note that register port con-straints (e.g. because of the hardwiring of register �les to the EXU inputs) arenot taken into account when the EXU assignment is done during scheduling.In �gure 6.6 for instance, the operations in time steps 1 and 2 can be assignedto the same EXU (i.e. EXU1), even though writing their operands in time step0 could require that two di�erent signals are written to the same register �leat the same time. Two write ports are needed to do that. These kind of re-lations between the assignments in di�erent time steps are explicitly modeledin [Goossens 90]. However, including these relations in the ILP model is verycomplex and requires a lot of binary variables. Register (�le) port constraintsare therefore ignored if the EXU assignment is not yet �xed. Besides, the rout-ing and the storage of signals is done by a separate tool | the data routing

116 CHAPTER 6. INTEGER PROGRAMMING SCHEDULINGtool | in the Cathedral-2nd script6, such that register access constraintsdo not limit the scheduling and especially the software pipelining freedom. Anoptimal software pipelining and EXU assignment are much more importantthan register access con icts at that design level.2. Ports and busses If the signals in the HDSFG have been assignedto register �les, port constraints can be speci�ed for each register �le port,for each time step and for each maxclique in the resource con ict graph G forthat port. Note that G contains operations that read from port k, and canthemselves be assigned to EXU k0. For a read port k, the constraint is:8j;8k;8Sj;k : Xoi2Sj;kXk0 xi;j;k0 � 1 (6:9)For a write port or a bus k, the output port latency7 li of the writing operationhas to be taken into account:8j;8k;8Sj;k : Xoi2Sj;kXk0 xi;(j�li)modC;k0 � 1 (6:10)6.5.4 Software pipelining and conditionsIn the resource con ict graph (see De�nition 6.4), there is no edge betweeneach pair of operations foi; ojg that is mutually exclusive. However, oi and ojcan be part of a loop, and the conditions under which oi and oj execute canbe re-evaluated in each iteration of the loop. In that case, oi and oj are onlymutually exclusive if their retiming is the same:ri = rj (6:11)Unfortunately, the values of ri and rj are not known when generating theresource constraints. There are two alternative approaches:1. Condition (6.11) could be explicitly stated for each pair of mutuallyexclusive operations in the resource con ict graph.2. The previous approach can severely restrict the software pipelining free-dom. An alternative approach is to have to scheduling passes. In a �rstpass, the HDSFG is scheduled without the extra conditions (6.11). Thiscan give rise to resource con icts in the schedule. To resolve these re-source con icts, a second scheduling pass is done with a �xed retimingri for each operation, such that the correct resource constraints can begenerated.6See Section 1.3.2.7See De�nition 3.8.

6.6. TIMING AND RETIMING CONSTRAINTS 1171

2

r = 01

r = 02

12,initd= W -12 12,init21,init

ddmaxFigure 6.7: The retiming model6.6 Timing and retiming constraintsIn this section, the constraints related to the timing and the retiming of theHDSFG are discussed. Constraints modeling the software pipelining are dis-cussed in Section 6.6.1, and the actual timing constraints are discussed inSection 6.6.2.6.6.1 Software pipelining constraintsThe software pipelining during scheduling can be modeled with the retimingmodel of Section 3.4.2. The maximumwindow of a delay line is constrained bymeans of a backwards sequence edge (�gure 6.7). The delays in the HDSFGcan be redistributed as long as (3.3) is satis�ed, as repeated here for readingconvenience: 8e1;2 : d1;2 = d1;2init + r2 � r1 � 0However, if the minimum timing constraint �1;2 on the edge e1;2 between o1and o2 is larger than the schedule length C, e1;2 will correspond to a delay lineof at least length 1. Therefore, (3.3) is reformulated as follows:8e1;2 : d1;2 = d1;2init + r2 � r1 � b�1;2C c (6:12)To be able to minimize the maximumretiming (see Section 6.9.2), the followingconstraint is added: ri � rmax (6:13)If the HDSFG to be scheduled is a loop L, where BL is the �rst value of theloop iterator, EL its last value and SL its step (all known at compile time),then the upper bound of rmax is:rmax � EL �BLSL (6:14)

118 CHAPTER 6. INTEGER PROGRAMMING SCHEDULINGiteration 1

1

21

0

C-1

(a) (b)

2

iteration 1,21+d

t1

t2

1,2d

1,2d

2p

p1Figure 6.8: The projection of a timing relation6.6.2 Timing constraintsEach edge between two nodes o1 and o2 of the HDSFG imposes constraintson the relative timing of these two nodes. The detailed timing model of theHDSFG is described in Chapter 3 (Section 3.1.3). The only timing relation ofinterest here is: t2 = t1 +�(t); �(t) 2 [�1;2; �01;2] (6:15)�1;2 is the minimum timing constraint, and �01;2 is the maximum timing con-straint. For most edges in typical HDSFGs of medium-throughput applica-tions, �1;2 = 1 and �01;2 =1. Therefore, in the following paragraphs, the ILPformulation of a minimum timing constraint is derived. The derivation of themaximum timing constraint formulation is completely analogous.Projected timing relations If the nodes o1 and o2 are part of the sameloop, timing relation (6.15) should be expressed in terms of the time steps ofthe schedule for that loop, since it are the time steps pi of the HDSFG nodesthat are determined by the scheduler and not the absolute start times ti. Inother words, all start times ti are projected onto one iteration of the loop. Thisis illustrated in �gure 6.8: an edge between nodes o1 and o2 has a weight d1;2,such that if operation o1 is executed in the �rst iteration of the loop, operationo2 is executed somewhere during iteration 1 + d1;2 (�gure 6.8(a)). Of course,during each loop iteration an o1 and an o2 are executed, only the output of o1is delayed over d1;2 iterations before it is consumed by o2 (�gure 6.8(b)). If t1is taken as a reference, such that t1 = p1, then:t2 = p2 + d1;2:C (6:16)

6.6. TIMING AND RETIMING CONSTRAINTS 1191

2

δ1,2 =5

1

2

δ1,2=5

(a) (b)

j=0

j=1Figure 6.9: Illustration of scheduling-retiming interactionHence, the projected timing relation between p1 and p2 is:p2 = p1 +�(t)� (d1;2init + r2 � r1):C (6:17)where d1;2 is written in function of the retiming variables r1 and r2. For aminimum timing constraint, (6.17) becomes:p2 � p1 + �1;2 � (d1;2init + r2 � r1):C (6:18)The projection idea was �rst published in [Goossens 90], and is extended herefor software pipelining. Most important is the fact that edges between di�erentloop iterations are modeled. They are crucial for correctly modeling softwarepipelining, and a lot of authors in the high-level synthesis �eld do not takethem into account (see Section 6.1.2).The scheduling-retiming interaction There is a relation between sche-duling and retiming. An illustration of this relation in given in �gure 6.9. Theminimal d1;2 in �gure 6.9(a) is b52c = 2. However, this is only true if o1 isscheduled at time step 0: scheduling o1 at time step 1 forces the retiming toincrease the weight of edge e1;2 from 2 to 3 (�gure 6.9(b)). This correction onthe lower bound on d1;2 in (6.12) is formulated as follows:8e1;2 : d1;2init + r2 � r1 � b�1;2C c+ Xj1>b �1;2C c:C��1;2+C�1 x1;j1;k1 (6:19)The derivation of (6.19) is given in Appendix E. In the example of �gure 6.9,d1;2 � b52c+ 1 if p1 > b52c:2� 5 + 2� 1 = 0.Timing constraint formulations The timing constraint formulation in[Lee 89b] can be extended to model software pipelining:

120 CHAPTER 6. INTEGER PROGRAMMING SCHEDULING8e1;2 :Xk2 !2Xj2=�2 j2:x2;j2;k2�Xk1 !1Xj1=�1 j1:x1;j1;k1 + C:r2 �C:r1 � �1;2 � d1;2init:C (6.20)However, experience shows that there exist better formulations than (6.20).For acyclic HDSFGs, such a formulation has been published in [Gebotys 91].That particular formulation is extended for cyclic HDSFGs and automatic soft-ware pipelining here: the detailed derivation of the extended timing constraintcan be found in Appendix F. Only the constraints themselves are reproducedhere: 8e1;2;8j 2 [�1; !1] :Xj�j1�!1Xk1 x1;j1;k1 + X�2�j2<j+�1;2�F1;2j :CXk2 x2;j2;k2 � 1 + �1;2j (6.21)which is only valid if:F1;2j = dj + �1;2 � (C � 1)C e = bj � 1 + �1;2C c (6:22)and where �1;2j = 8><>: d1;2 if F1;2j = 0F1;2j � d1;2 if F1;2j = W1;2max1 if F1;2j > W1;2max (6:23)else, if 0 < F1;2j < W1;2max, �1;2j is such that:8><>: M:(y1;2j � 1) + 1 � d1;2 � F1;2j � M:y1;2jM:(z1;2j � 1) + 1 � F1;2j � d1;2 � M:z1;2j�1;2j = y1;2j + z1;2j (6:24)The F1;2j value in (6.22) is computed from j, which is given for each con-straint. However, the �1;2j, y1;2j and z1;2j variables are extra variables (�1;2jcan be substituted by y1;2j + z1;2j). They are required for the modeling ofdata precedences where the signal window W1;2max can be larger than F1;2jand F1;2j > 0. There are at most 4 extra y1;2j and z1;2j variables for each edgee1;2. As explained in Appendix F, M should be large enough and �xed by theuser.Experiments have shown that using formulation (6.21) - (6.24) yields muchbetter CPU times for solving the ILP problem than using formulation (6.20).The reason for this is that formulation (6.21) - (6.24) mostly adds 1 and -1

6.7. THE INTERCONNECTION MODEL 121entries in the constraint matrix A, such that the constraints have much betterchances for coinciding with integer facets of the ILP polytope [Nehmhauser 88][Gebotys 91], which results in faster convergence of Simplex-based ILP algo-rithms.6.7 The interconnection modelThe interconnection network that connects the EXUs is modeled as a point-to-point network: there can be a connection between each output and eachinput of the EXUs, and the presence of register �les is ignored. This modelcan be justi�ed by looking at the Cathedral-2nd low-level mapping script inSection 1.3.2. The only scheduling phase that can optimize the interconnectionis the one at the EXU assignment step in the script. At that moment of thedesign script, the data routing and the storage of the signals is unknown,and thus cannot be taken into account. Two alternatives for modeling theinterconnection are discussed in the following paragraphs:Alternative 1 The degree of freedom that in uences the interconnectioncost is the assignment of operations to EXUs. This interconnection cost can bemodeled by means of the ck1;k2 variables (see Section 6.4.1 for their de�nition):8k1;8k2; k1 6= k2;8e1;2data : !1Xj1=�1 x1;j1;k1 + !2Xj2=�2 x2;j2;k2 � ck1;k2 � 1 (6:25)In words, if o1 is assigned to k1 and o2 to k2, then a connection between k1 andk2 is needed: ck1;k2 = 1. It should also be mentioned that constraint (6.25) canbe used to model a �xed interconnection (important for �xed programmablearchitectures and user interaction): the appropriate ck1;k2 variable just needsto be set to 0 or 1. Furthermore, some ck1 ;k2 variables are �xed when theinnermost loops have been scheduled8. Deepest nested loops indeed have pri-ority because of their impact on the total clock cycle count. However, since noback-tracking on the EXU assignment in deeper nested loops is possible, theEXU assignment might be sub-optimal.Alternative 2 An alternative to model the interconnection network is mod-eling the interconnection cost by counting the number of data transfers be-tween di�erent EXUs. The more of such data transfers, the smaller the chancethat global busses can be merged afterwards (during bus merging9) withoutextra clock cycle cost. Therefore, the number of such data transfers should8According to the scheduling strategy outlined in Section 2.2.9See Section 1.3.1.

122 CHAPTER 6. INTEGER PROGRAMMING SCHEDULINGbe minimized. In fact, this approach is also followed in the EXU assignmenttechniques described in [Lanneer 93b]. A variable t1;2 is introduced for eachdata transfer that could be a data transfer between di�erent EXUs:8e1;2data;8k1;8k2; k1 6= k2 : !1Xj1=�1 x1;j1;k1 + !2Xj2=�2 x2;j2;k2 � t1;2 � 1 (6:26)Note that a �xed interconnection network can also be imposed by means ofthese constraints.Discussion Modeling the interconnection cost with constraint (6.26) willcost more ILP variables than using the model of constraint (6.25). In anycase, the two approaches merge EXU assignment with scheduling | two highlyinterdependent tasks | at the cost of extra CPU time. The optimization ofthe EXU assignment is done through the minimization of the interconnectioncost (see further, Sections 6.9.1 and 6.9.2).6.8 Theoretical complexity analysisThe complexity of an ILP problem is directly related with the number ofvariables [Nehmhauser 88] [Papadimitriou 82]. The variables of the above for-mulation are the xi;j;k, ri, W1;2max, y1;2j, z1;2j, ck1;k2 (or t1;2) and rmax variables.Their number V can be expressed as:V = abNCK +N +K(K � 1) + 4E + 1' O(NCK) (6.27)where N is the number of nodes in the HDSFG, C is the number of timesteps, K is the number of EXU resources, E is the number of edges that needextra y1;2j and z1;2j variables for their timing constraints, a is the percentageof the nodes N that has assignment freedom, and b is the percentage of theEXUs K where these node with assignment freedom can be assigned to. For atypical medium-throughput design (a channel decoder for mobile radio), witha = 0:3; b = 0:2;K = 12; E = 5, the theoretical number of ILP variables isplotted against N and C in �gure 6.10.Practice has shown that ILP scheduling problems with V � 600 vari-ables can be solved in reasonable times (i.e. tens of minutes, using Lindo[Schrage 89]). The plot in �gure 6.10 shows that small inner loops are typi-cally good candidates for optimal ILP scheduling. Note that the plot in �g-ure 6.10 is only valid for the \normal" scheduling problem, not for macronodescheduling. For macronode scheduling, the allowed number of nodes and thenumber of time steps can be much larger than in the plot. However, the resultis no longer guaranteed to be optimal, as explained in Section 6.3.

6.9. COST FUNCTIONS 1235 10 15 20 25 30 35 40

510

1520

2530

3540

500

1000

Nr. operations (N)

Nr. time steps (C)

Nr. ILP variables (V)

Figure 6.10: The theoretical number of ILP variables vs. the number of nodes andthe number of time steps for typical medium-throughput applications.6.9 Cost functionsIn the scope of this thesis, two scheduling problems can be solved with ILPscheduling: cluster scheduling and macronode scheduling. The cost functionfor cluster scheduling is discussed in Section 6.9.1 and the one for macronodescheduling in Section 6.9.2.6.9.1 The cost function for cluster schedulingThe basic philosophy of clustering is to reduce the mobilities of the operationsinside a cluster, and thus also to reduce the lifetimes of the signals inside acluster (see Chapter 5). This can be achieved in two ways:1. By minimizing the schedule length of a cluster.2. By minimizing the sum of the lifetimes in a cluster.Since the HDSFG of a cluster can be considered to be acyclic10, the modelingof software pipelining is not necessary. Therefore, all ri and all d1;2init can beset to 0 for cluster scheduling.10This follows from the de�nition of a cluster, see De�nition 5.1.

124 CHAPTER 6. INTEGER PROGRAMMING SCHEDULINGThe cost function for ILP cluster scheduling is formulated as follows:COST = 1: !CXjC=�C jC :xC;jC+ 2: Xe1;2data( !2Xj2=�2Xk2 x2;j2;k2 � !1Xj1=�1Xk1 x1;j1;k1)+ 3: Xk1 6=k2 Ak1;k2 :ck1;k2+ 4: Xe1;2data t1;2 (6.28)where the �rst termmodels the schedule length, the second term the sum of thesignal lifetimes, the third term the interconnection cost and the fourth termthe number of bus transfers between di�erent functional units. Furthermore,Ak1;k2 is the cost of a connection between EXUs k1 and k2. Note that the sumof the lifetimes is represented as the sum over all data edges in the cluster.Fanout edges11 are not taken into account: the single lifetime they representis accounted for more than once. However, minimizing the sum of twice alifetime also minimizes the lifetime itself.Experiments with the cost function of cluster scheduling have shown thatit is not always clear whether to minimize the number of time steps or the sumof the lifetimes. However, the best average results are obtained by minimizinga weighted sum of both, where the schedule length has a weight that is afew orders of magnitude larger than the weight of the sum of the lifetimes.Furthermore, note that either 3 or 4 should be zero (see Section 6.7).6.9.2 The cost function of macronode schedulingThe goal of macronode scheduling can be a combination of any of the followingoptimizations:1. Minimization of the schedule length C.2. Minimization of the interconnect if EXU assignment is not yet �xed.3. Minimization of the delay line windowsW1;2 by minimizing the maximumwindows W1;2max.4. Minimization of the code for the �lling and ushing of the softwarepipeline (see [Goossens 89b]) by minimizing rmax.11See De�nition 3.4.

6.10. EXPERIMENTS 125For the macronode scheduling of innermost loops, the HDSFG is cyclicand software pipelining is wanted. Therefore, C has to be �xed in order tobe able to generate the constraints. An iterative reduction of C has beenimplemented: starting from the value of C as obtained by list scheduling, Cis decremented and an ILP scheduling problem is solved, until no solutioncan be found. The last feasible solution is then the optimal solution (w.r.t.the given scheduled clusters, in the case of macronode scheduling). Otheralternatives than iterative reduction could be envisaged as well: binary search,or incrementing C from its theoretical lower bound (as in [Groot 92]).The cost function for macronode scheduling can then be formulated asfollows: COST = 1: Xk1 6=k2Ak1;k2:ck1;k2+ 2: Xe1;2data t1;2+ 3: Xe1;2data �1;2:W1;2max+ 4:rmax (6.29)where the �rst term models the interconnection cost, the second term the num-ber of bus transfers between di�erent functional units, the third term the sizesof the delay lines, the fourth term the maximum retiming. Furthermore, Ak1;k2is the cost of a connection between EXUs k1 and k2 (see also equation 6.28),and �1;2 is a user-de�ned weight for the minimization of delay lineW1;2. Again,the user has to specify the coe�cients 1 to 4 (where either 1 or 2 is zero).6.10 ExperimentsThe ILP formulation as described above has been implemented in a tool calledIlps. The tool generates the constraints, calls an ILP solver and constructsthe schedule. The ILP solver that was used here is Lindo [Schrage 89]. Itperforms a branch-and-bound search for the optimal integer solution, usinga revised Simplex algorithm to solve the linear program at each branch. Abrief overview of the literature on ILP techniques is given in Section 6.10.1.The results of the experiments with Ilps (without software pipelining) andan interesting pruning heuristic are presented in Section 6.10.2. Some briefcomments on the exploitation of set constraints are given in Section 6.10.3,and some experiments with software pipelining are described in Section 6.10.4.Finally, a few comments on the integration of the ILP scheduler with otherscheduling techniques are given in Section 6.10.5.

126 CHAPTER 6. INTEGER PROGRAMMING SCHEDULING6.10.1 ILP solution techniques in the literatureLinear and integer programming are extensively used in operations research[Nehmhauser 88] [Schrijver 86] [Papadimitriou 82], and a complete overview ofthis domain is absolutely out of the scope of this thesis. Only a few notes arecollected in this section.The ILP problem as formulated above can be solved by means of the Go-mory cutting plane technique [Schrijver 86] [Papadimitriou 82]. However, thistechnique quickly runs into CPU time problems when the number of variablesexceeds 100. Another technique that was tried out is the implicit enumerationtechnique [Geo�rion 67]. Implicit enumeration is a clever way of enumerat-ing all feasible solutions while pruning the search space. Again, experimentswith more than 100 variables took impractical CPU times. In the end, abranch-and-bound method was applied [Papadimitriou 82] [Schrage 89], andthis method proved to be successful (for results, see Section 6.10.2).A lot of research is still going on in the domain of operations research.An excellent overview paper is [Glover 86], where recent evolutions in arti-�cial intelligence are linked with operations research. In [Crowder 83], thecombination of problem preprocessing, cutting planes and branch-and-boundtechniques is shown to permit the optimization of very large sparse zero-oneproblems. In [Hafer 91], set constraints are exploited during the branch-and-bound search (see Section 6.10.3).Recent experiments show that formulating and solving optimization prob-lems with an ILP approach should certainly not be discarded as impractical,especially with ever faster machines and ever larger available memories.6.10.2 Experiments with ILP schedulingIn this section, some statistics are given on experiments with ILP schedulingwithout software pipelining and with a �xed EXU assignment . In �gure 6.11,the number of ILP constraints is plotted versus the number of ILP variables.It is clear that a typical scheduling problem has a lot more constraints thanvariables. In �gure 6.12, the CPU time (in seconds) is plotted against thenumber of variables (the black squares). Some ILP problems are already hardto solve for 250 variables. The same exponential behavior can be seen in theplot of �gure 6.13, where the CPU time (in seconds) is plotted against thenumber of operations in the HDSFG.However, the CPU times can be reduced substantially by stopping thebranch-and-bound search in the �rst integer solution that is found. This canbe achieved by setting some Lindo parameter called \IPTOL" to 1.0. Thee�ect on the optimality is shown in �gure 6.14: a number of experiments wasdone with both IPTOL set to 0.0 (i.e. full optimization) and to 1.0. Thedistribution of the deviation from the optimum (in %) for the solution with

6.10. EXPERIMENTS 127100 200 300 400 500 600

Number of 0/1 variables

1000

2000

3000

Num

ber

of c

onst

rain

ts

Figure 6.11: Nr. of ILP constraints vs. nr. of ILP variables200 400 600

Number of 0/1 variables

0

500

1000

1500

CP

U t

ime

(s)

IPTOL = 1.0

IPTOL = 0.0

Figure 6.12: CPU time (in seconds) versus the number of ILP variables

128 CHAPTER 6. INTEGER PROGRAMMING SCHEDULING10 20 30

Number of operations

500

1000

1500

CP

U t

ime

(s)

IPTOL = 0.0

IPTOL = 1.0

Figure 6.13: CPU time (in seconds) vs. number of operations0 2 4 6 8 10

Relative deviation (%)

0

10

20

30

40

Num

ber

of s

olut

ions

Figure 6.14: Distribution of deviation from optimum for IPTOL=1.0

6.10. EXPERIMENTS 1291 2 3 4

input1

5 6 7 8

input29

10

output1

output2

11

12FU1

FU2

FU3

FU3

Figure 6.15: Block diagram of RGB �lterIPTOL=1.0 is shown. The �rst solution is almost always the optimal solution,and the rest of the CPU time is spent in proving it. The radical CPU timeimprovement of stopping the search in the �rst solution is shown by the whitedots in �gs. 6.12 and 6.13.6.10.3 Exploitation of set constraintsAlmost all variables in the ILP formulation are subject to a set constraint ofthe kind Pni=1 xi;j;k = 1. These set constraints can be exploited during thebranch-and-bound search for solving the ILP [Hafer 91]: i.s.o. exploiting 2ncombinations in the worst case, only n combinations need to be examined. Forexample, take the set constraint x1+x2+x3 = 1: if x1 = 1, then automaticallyx2 = x3 = 0. The same reasoning holds for set constraints with inequalityrelations. In the experiments above, set constraints were not taken into accountbecause Lindo (anno 1993) does not support them. However, experimentswith other commercial packages that do support set constraints have revealedfactors of 3 in CPU time improvement for some scheduling benchmarks.6.10.4 Experiments with ILP software pipeliningA �rst example illustrates the functionality of the software pipelining in Ilpson an RGB �lter application for video (�gure 6.15). There are two delay linesin the design (the bu�ers for input1 and input2). The functional unit (FU)

130 CHAPTER 6. INTEGER PROGRAMMING SCHEDULINGdesign Cnsp Csp CPU (s) V rmaxk0scp 7 5 24 91 1k2scp 7 6 9 85 1k5scp 5 2 0.2 31 2k6scp 5 4 0.8 41 1syndr1 12 9 139 221 1syndr2 9 7 45 153 1iscp 5 4 4 66 1pscp 5 5 8 115 0p2scp 3 3 0.4 29 0i2scp 4 4 5 106 0Table 6.1: Results for some loops of three RSP applicationsassignment of the operations has been �xed before scheduling, and is indicatedby the dashed boxes (3 di�erent function units are used). Each operation takes1 clock cycle. The �lter can be software pipelined such that two new outputs(output1 and output2 ) are calculated every 4 clock cycles (whereas the criticalpath takes 6 clock cycles). If the size of the delay lines is not considered,operations o1, o5, o9 and o10 could for instance be moved to the next iteration.However, the e�ect would be that both delay lines require 1 more register. Ifon the other hand only operations o9 and o10 are moved to the next iteration,both delay lines keep their original length of 16 after software pipelining. Thesoftware pipelined schedule was obtained in 1.20 sec on a DEC3100.The practical applicability of Ilps is also illustrated in table 6.1, whereCnsp is the schedule length without and Csp with software pipelining, CPU theCPU time of scheduling with software pipelining in seconds on a DEC3100,V the number of variables in the integer program and rmax the maximumretiming. The �rst section of the table gives the results for some loops of theLPC application, the second section for two versions of a syndrome generationalgorithm and the third section for some loops of a channel decoder applicationfor GSM mobile radio (CHED91). The maximum windows for all signals areconstrained to 1.Csp is the minimal (optimal) schedule length that could be obtained underthe given resource constraints. The total number of clock cycles (10.947) ofthe LPC application could be reduced by 2.810 cycles. The channel decoderdesign however can not be improved signi�cantly by software pipelining, dueto the resource constraints and the HDSFG topology (the minimum schedulelength is lower bounded by the length of some critical cycles in the graph).

6.10. EXPERIMENTS 131example LS LS + ILPS gain (%)CHED91 60,983 58,281 458,310 58,206 0.1CHED93 132,577 120,362 9LPC 10,962 10,791 211,428 11,244 2SPEECH 42,510 37,044 1337,032 35,109 537,010 34,318 7Table 6.2: Result of combined ILP scheduling (ILPS) and list scheduling (LS)6.10.5 Experiments with an integrated scheduling en-vironmentIt is clear from �gs. 6.12 and 6.13 that not all scheduling problems can besolved with ILP scheduling. Therefore, the ILP scheduler is integrated witha list scheduler into a scheduling environment : if the number of ILP variablesexceeds a user-de�ned threshold (e.g. 600), the list scheduler is automaticallyselected. This holds both for cluster scheduling as for macronode scheduling.As for the optimality of the scheduling: the scheduling environment guar-antees an optimal solution if there are no clusters, and if there are less than e.g.600 ILP variables. Near-optimality is achieved if both the cluster schedulingproblem and the macronode scheduling problem have less than e.g. 600 ILPvariables, such that ILP scheduling can be used for both scheduling tasks.In table 6.2, the total number of clock cycles for pure list scheduling (LS)and for combined ILP scheduling and list scheduling (LS + ILPS ) are com-pared. The threshold on the number of ILP variables for switching to listscheduling was 600 variables. The results are from di�erent runs of the sche-duling environment with di�erent register constraints12. Note that no softwarepipelining was performed for the examples in table 6.2, because the integrationof software pipelining is not fully implemented yet. As the table shows, ILPscheduling can give up to 13 % better results when used for straight-line codeoptimization (compared with list scheduling). However, ILP scheduling is stillbest used for the optimal software pipelining of time-critical loops.12The CHED93 design is a variant of the CHED91 design.

132 CHAPTER 6. INTEGER PROGRAMMING SCHEDULING6.11 SummarySeveral scheduling techniques can be used for the separate scheduling tasksof the hierarchical scheduling strategy. An integer linear programming (ILP)scheduling technique is proposed in this chapter. Although based on previouslypublished models, the new contributions of the ILP model in this chapter are:� Integration of optimal software pipelining with optimal scheduling in aninteger linear programming (ILP) model.� A new timing model that supports cyclic signal ow graphs and delayline optimization, by combining scheduling and retiming.� An e�cient timing constraint is proposed, based on a formulation in theliterature [Gebotys 91], but extended for software pipelining.� A general model for generating conditional resource constraints is pro-posed.Though ILP scheduling has an exponential run time complexity, experi-ments show that it can be used in the context of hierarchical scheduling andbeyond.

Chapter 7ExperimentsThe e�ciency of the proposed techniques for register optimization is demon-strated by means of some relevant examples in this chapter. These experimentsare demonstrating the global functionality of the register optimization. Themain issues here are the speci�cation and the satisfaction of the register con-straints, the global trade-o� of the number of registers versus the total numberof clock cycles, and the total CPU times. For detailed experiments with thebasic techniques (i.e. cut reduction, clustering and integer programming sche-duling), the reader is referred to resp. Chapters 4, 5 and 6.The register optimization techniques that are proposed in this thesis areimplemented in a prototype tool, calledTron. The integer programming sche-duler Ilps (see Chapter 6) has also been integrated in Tron. All experimentsin this chapter were performed on a DEC5000/125 machine.The use of constraints on the available number of registers is illustrated inSection 7.1. Some more global results of the trade-o� between register costand throughput are discussed in Section 7.2.7.1 The use of register constraintsThe e�ect of register constraints is �rst demonstrated on the syndrome gen-erator example SYNDR (see table 1.2). The schedule obtained by Tron fora constrained size of 4 for register �le \reg1" (see �gure 3.9) is shown in �g-ure 7.1. The bold solid edges in �gure 7.1 correspond to signals that are storedin register �le \reg1". The signals snr01 , snr02 and accu14 v1 are stored in\reg1" all of the time. Therefore, the minimum register requirement for regis-ter �le \reg1" is 4: three registers for the above three signals and one registerfor any other signal that has to be stored in \reg1". The constraint of 4 issatis�ed by means of clustering and cut reduction. At the level of the j-loop, 4clusters are identi�ed. Starting from that cluster covering, cut reduction addedtwo sequence edges to the HDSFG (the bold dashed edges in �gure 7.1). No133

134 CHAPTER 7. EXPERIMENTS

cp

cp cp

+<

R

ca

cp

<

cp

+<ca

+

-

cp

cpR

*

R

cp

+

+

cp

cp

R

*

R cp

W

neg

for j

if ntmp22

for i

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

tmp17_v1

sig1_12

sig2_11

sig3_10v2 sig4_9

if not ntmp22

snr02

return_v1

sig1_12_v1

sig3_10

sig3_10v1

sig4_9v1

sig4_9v2l7t2

tmp31tmp27

tmp29

l7t1

tmp18_v1

snr02d_v2

t6t1

accu14_v1

snr01

sig2_11_v1

ntmp22

t6t1_v1

cp

cp

cp cp cp

15

16

17Figure 7.1: Schedule of SYNDR for a maximum size of 4 for register �le \reg"

7.1. THE USE OF REGISTER CONSTRAINTS 135preprocessing was required at the i-loop level. The scheduling preprocessingand the actual scheduling took 39.1 CPU seconds, and the design takes 9,681clock cycles.If the \reg1" register �le of the SYNDR design is constrained to a size of3, at least one signal assigned to \reg1" has to be stored somewhere else. Inthe schedule of �gure 7.2, the signal snr02 is spilled to RAM. This requires anextra read and write operation (the shaded R and W operations in �gure 7.2).The spilling of a signal to RAM is decided automatically by the cut reductiontechnique. The names of the signals are not shown in �gure 7.2 for the sakeof clarity. 5 clusters were found for these register constraints, and they aresequentialized by means of the bold dashed edges. Because more sequential-ization is needed to satisfy the tighter constraint of 3 i.s.o. 4, the schedule of�gure 7.2 corresponds to a total number of 13,665 clock cycles, whereas theschedule of �gure 7.1 takes 9,681 clock cycles. The scheduling preprocessingand the actual scheduling to obtain the schedule in �gure 7.2 took 307.2 CPUseconds. Note that again no preprocessing is required at the i-loop level.The use of constraints on the available number of registers is also demon-strated for the LPC example (see table 1.2) in �gure 7.3. The diagrams in�gure 7.3 show the di�erent register �les of the data-path along the X-axis,and the size of these register �les along the Y-axis. The number of clock cyclesfor each solution is shown in the upper right corner of each diagram. The pro-�le in the lightest shade (�gure 7.3(a)) represents the register �le sizes afterscheduling without taking register cost nor register constraints into account.The list scheduler Smart [Goossens 90] [Rompaey 92] was used to producethese results. On the other hand, the darkest shaded pro�les represent theregister cost after preprocessing and scheduling (�gure 7.3(b), (c) and (d)).As explained in the previous chapters, the register optimization is performedduring the scheduling preprocessing, such that constraints on the sizes of theregister �les are satis�ed. These constraints are represented by the bold ar-rows. The results of a number of di�erent runs of the Tron tool, with di�erentregister constraints, are shown. The following conclusions can be made:1. By applying \weak" register constraints (as in �gure 7.3(b)), better so-lutions (with about the same number of clock cycles but less registers)can be obtained w.r.t. a design without register constraints.2. \Severe" register constraints can cost a lot of clock cycles: the solutionin �gure 7.3(d) has only 1 register less than the solution in �gure 7.3(c),at the cost of more than 2,000 clock cycles.

136 CHAPTER 7. EXPERIMENTS

cp

cp cp

+<

R

ca

cp

<

cp

+<

ca

+

-cp

cp

R

*

R

cp

+

+

cp

cp

R

*

R

cp

W

neg

for j

for i

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

if not ntmp22

15

16

17

18

19

W

if ntmp22

R

snr02

cp

cp cp cp

20

21Figure 7.2: Schedule of SYNDR for a maximum size of 3 for register �le \reg1"

7.1.THEUSEOFREGISTERCONSTRAINTS137

1 2 3 4 5

size

alu/rega

alu/regb

mult/rega

mult/regb

ram1/dareg

ram1/adreg

acu1/rega

acu2/rega

acu3/rega

10.786 cycles

(b)

1 2 3 4 5

size

alu/rega

alu/regb

mult/rega

mult/regb

ram1/dareg

ram1/adreg

acu1/rega

acu2/rega

acu3/rega

11.244 cycles

(c)

1 2 3 4 5

size

alu/rega

alu/regb

mult/rega

mult/regb

ram1/dareg

ram1/adreg

acu1/rega

acu2/rega

acu3/rega

13.332 cycles

(d)

1 2 3 4 5

size

alu/rega

alu/regb

mult/rega

mult/regb

ram1/dareg

ram1/adreg

acu1/rega

acu2/rega

acu3/rega

10.791 cycles

(a)

Figure7.3:SeveralsetsofregisterconstraintsfortheLPCexample

138 CHAPTER 7. EXPERIMENTS10 12 14 16 18

Nr. of registers

10000

11000

12000

13000

Tot

al n

r. o

f cl

ock

cycl

es

LS + ILPS

LS

Figure 7.4: Di�erent schedules for the LPC example7.2 Trading o� register cost vs. throughputBy changing the register constraints on the design, the design space can beexplored. In this section, a set of experiments is described that analyze thetrade-o� between the total number of clock cycles and the total number ofregisters in a design. The di�erent solutions that were obtained during theseexperiments are shown for two examples, LPC and SPEECH (see table 1.2), in�gs. 7.4 and 7.5. Note that not the total number of registers, but the sizes of theindividual register �les are subject to constraints. The schedules for the LPCexample were all obtained in a few tens of CPU minutes, and the SPEECHschedules where obtained in about one hour of CPU time1. The �gures alsoshow the solutions obtained by using integer programming scheduling (Ilps)in an integrated scheduling environment (see Section 6.10.5). As expected,this indeed leads to better global results. The following conclusions can bemade from the experiments in �gure 7.4 and 7.5:1. By changing the size constraints on the register �les, a trade-o� canbe made between the total number of registers (and its correspondingcontroller cost) and the total number of clock cycles. The largest register1Note that these CPU times are for a prototype tool, where ease of coding rather thanexecution time was considered during the implementation.

7.3. SUMMARY 13920 25 30

Nr. of registers

30000

35000

40000T

otal

nr.

of

cloc

k cy

cles

LS + ILPS

LS

Figure 7.5: Di�erent schedules for the SPEECH examplecost reductions achieved by only changing the schedule are shown for alldemonstrator examples in Table 1.1.2. Integer programming scheduling as part of an integrated scheduling en-vironment leads to better global results.Note also that no software pipelining and no spilling were performed to obtainthe results in �gs. 7.4 and 7.5. Integer programming scheduling performs evenbetter if it is used for the software pipelining of time-critical parts of the design.7.3 SummaryThe scheduling of a design under constraints on the sizes of the di�erent reg-ister �les, by using the techniques that are proposed in this thesis, is provento be feasible. Trade-o�s between the total number of registers and the totalnumber of clock cycles can be explored by changing the register constraints.Register cost reductions of up to 30 % can be achieved by only changing theschedule (and thus not using e.g. spilling). Furthermore, integer program-ming scheduling can be used for scheduling parts of the design. This leads tosolutions with a smaller number of cycles.

Chapter 8ConclusionsThe problem of scheduling the signal ow graph of a design, such that itsexecution takes as few clock cycles as possible and such that it does not takemore than a �xed number of available registers, is solved in this thesis. Thisproblem is a relevant one, and it is mainly encountered in two application�elds: in software compilation, where the target architecture is �xed, andin application-speci�c high-level synthesis, where the register cost and thecontroller cost can be reduced by imposing constraints on the size of the register�les.8.1 MotivationMuch has already been published on register optimization and its relationwith scheduling, both in the �elds of high-level synthesis and in the �eld ofsoftware compilation. In high-level synthesis, the register cost is typically opti-mized during scheduling [Paulin 89] [Verhaegh 91] [Potkonjak 92] [Hwang 91][Rompaey 92]. However, most techniques do not take constraints on the avail-able number of registers into account, and if they do, only local register opti-mization decisions are made [Rimey 89] [Hartmann 92]. On the other hand, insoftware compilation, register optimization is typically done for a �xed code or-dering [Aho 88] [Chaitin 82] [Hendren 92] [Mueller 93], such that the relationbetween the code order and the register cost is not exploited. If more registersare required than available, some signals are moved (spilled) to RAM.Therefore, a new methodology for register optimization is proposed in thisthesis: register optimization during scheduling preprocessing such that, duringthe actual scheduling, no register constraints have to be taken into account.This allows for a global view for optimization decisions, and an interactionwith scheduling by adding extra constraints for the scheduler. The user canexplore the scheduling solution space by changing the register constraints. Fur-thermore, any scheduling technique can be used in the proposed methodology.140

8.2. CONTRIBUTIONS 141The methodology does not guarantee the absolute minimum number of clockcycles, because scheduling and register optimization are not done simultane-ously. However, practice has shown that optimality can seldom be achieved inpractical CPU times. On the other hand, the methodology in this thesis givessolutions with a high quality in reasonable CPU times.8.2 ContributionsThe new contributions of this thesis are the following:1. A new methodology for register optimization has been proposed[Depuydt 91] [Lanneer 91] [Goossens 93]. The constraints on the avail-able number of registers are satis�ed during a scheduling preprocessingstep, such that the scheduler is relieved from these register constraints.This methodology forms a link between high-level synthesis and codegeneration for architectures with a �xed number of available registers.2. E�cient techniques for scheduling preprocessing have been described.(a) Cut reduction is a technique that reduces the maximumregister costof a signal ow graph by transforming it by adding sequence edges,spilling signals to RAM and/or changing the maximum lengths ofdelay lines. A branch-and-bound algorithm steers the transforma-tion [Depuydt 93a].(b) The complexity of cut reduction is reduced by �rst performinga clustering on the signal ow graph [Depuydt 90] [Depuydt 91].Clustering reduces the scheduling freedom of the operations andthe overlapping possibilities of the signal lifetimes. Several tunedmetrics have been developed for steering the clustering process.3. A new technique for estimating the maximum register cost of a design,based on retiming, is proposed [Depuydt 92]. It has a low polynomialalgorithmic complexity, and it is intensively used during clustering andcut reduction.4. A new integer programming scheduling formulation is proposed that com-bines scheduling, software pipelining and delay line optimization forrepetitive signal ow graphs [Depuydt 93b]. Its feasibility is proven ontime-critical parts of real-life examples: these are the parts of the designthat a designer wants to be scheduled optimally. A new, general formu-lation for resource constraints between conditional operations has beenproposed as well.

142 CHAPTER 8. CONCLUSIONSAll the techniques allow to exploit control ow hierarchy (nested loopsand conditions) and the software pipelining of loops, and support multi-cycleoperations and pipelined operations.The methodology has been demonstrated on a few real-life examples. Itwas shown that register constraints can be satis�ed. Furthermore, trade-o�sbetween the register cost and the total number of clock cycles can be made bychanging the constraints on the available number of registers (in the register�les). The worst-case register cost can be reduced by up to 30 % by onlychanging the schedule. If the register cost has to be reduced more than that,other data routing alternatives for the signals have to be explored: the spillingof signals to memory is supported by cut reduction. However, a more completeset of data routing alternatives is discussed in [Lanneer 93a].8.3 Future workFuture consumer electronics and telecommunications systems will more andmore make use of a mix of standard DSP core components and application-speci�c hardware [Paulin 92]. As a result, the design problem is a hardware-software co-design problem, where techniques that apply to both high-levelsynthesis as to code generation are of key importance. Besides the restrictionson the available number of registers (discussed in this thesis), a number ofextra architectural characteristics have to be taken into account, like encod-ing restrictions, conditional two-way branching, unconditional branching, etc.[Praet 93] [Ki i 93].During the last ten years, a lot of contributions to scheduling and registeroptimization have been published in the literature. One might observe thatscheduling, especially in high-level synthesis, is becoming a mature domain.Therefore, several scheduling techniques should be combined in an integratedscheduling environment that can serve a broad range of user requirements.Register optimization could be merged with interconnect optimization (asdescribed in [Lanneer 93a]) during the clustering step. This will require amodi�cation in the de�nition of the metrics used for the hierarchical clustering.The cut reduction technique can be merged with the data routing tech-niques described in [Lanneer 93a]. This will lead to a data routing techniquethat is able to take a �xed number of registers into account. Better cut reduc-tion pruning techniques might have to be investigated in that case.A larger degree of user interaction should also be provided. Currently, theuser can specify constraints on the available number of registers. However,manual graph transformations for register optimization (e.g. adding sequenceedges, or de�ning clusters of operations) allow the user to directly in uencethe quality of the solution, especially if these manipulations are supported in

8.3. FUTURE WORK 143a user-friendly way.Finally, some e�ort can still be done to improve the complexity of theinteger programming scheduling problems. Especially, problem features likeset constraints should be taken into account by the ILP solvers. This candramatically reduce the CPU times.

Bibliography[Ahmad 91] I. Ahmad and C. Y. R. Chen. Post-processor for datapath synthesis using multiport memories. In Proc. of theInt. Conf. on Comp.-Aided Design, Santa Clara, pages276{279, November 1991.[Aho 76] A. V. Aho and S. C. Johnson. Optimal code generationfor expression trees. Journal of the ACM, 23(3):488{501,July 1976.[Aho 77] A. V. Aho, S. C. Johnson, and J. D. Ullman. Code genera-tion for expressions with common subexpressions. Journalof the ACM, 24(1):146{160, January 1977.[Aho 88] A. V. Aho, R. Sethi, and J. D. Ullman. Compilers: Prin-ciples, Techniques and Tools. Addison Wesley, 1988.[Balakrishnan 88] M. Balakrishnan, A. K. Majumdar, D. K. Banerji, J. G.Linders, and J. C. Majithia. Allocation of multiport mem-ories in data path synthesis. IEEE Trans. on Computer-Aided Design, 7(4):536{540, April 1988.[Bergamaschi 91] R. A. Bergamaschi, R. Camposano, and M. Payer. Data-path synthesis using path analysis. In Proc. of the DAC91,pages 591{596, June 1991.[Breternitz 91] M. Breternitz and J. P. Shen. Implementation optimiza-tion techniques for architecture synthesis of application-speci�c processors. In Proc. of the 24th Ann. Workshop onMicroprogramming and Microarchitecture, Albuquerque,New Mexico, pages 114{123, November 1991.[Busschaert 91] H. J. Busschaert, P. P. Reusens, L. Dartois, and L. Des-perben. A power e�cient channel coder/decoder chip forGSM terminals. In Proc. IEEE Custom Integrated Cir-cuits Conference, 1991.144

BIBLIOGRAPHY 145[Camposano 90] R. Camposano and R. A. Bergamaschi. Redesign usingstate splitting. In Proceedings of the EDAC, Glasgow,pages 157{161, March 1990.[Catthoor 88] F. Catthoor, J. Rabaey, G. Goossens, J. L. Van Meer-bergen, R. Jain, H. De Man, and J. Vandewalle. Archi-tectural strategies for an application-speci�c synchronousmultiprocessor environment. IEEE Transactions onAcoustics, Speech and Signal Processing, 36(2):265{283,February 1988.[Catthoor 92] F. Catthoor. Design methodologies for application-speci�c signal processing architectures. Tutorial presen-tation at EUSIPCO 92, 1992.[Chaitin 82] G. J. Chaitin. Register allocation and spilling via graphcoloring. In ACM SIGPLAN Conf. on Progamming Lan-guage Design and Impl., volume 17, pages 98{105, 1982.[Chao 93] L.-F. Chao, A. LaPaugh, and E. Sha. Rotation sche-duling: A loop pipelining algorithm. In Proc. of the30th ACM/IEEE Design Automation Conference, Dallas(Texas), pages 566{572, June 1993.[Chen 91] C. Y. R. Chen and M. Z. Moricz. Data path schedulingfor two-level pipelining. In Proceedings of the 28th DAC,San Francisco, pages 603{606, 1991.[Chu 89] C.-M. Chu, M. Potkonjak, M. Thaler, and J. Rabaey. HY-PER: An interactive synthesis environment for high per-formance real time applications. In Proceedings of ICCD,pages 432{435, 1989.[Chu 92] C.-M. Chu and J. Rabaey. Hardware selection and clus-tering in the HYPER synthesis system. In Proc. of theEDAC92, Brussels, pages 176{180, February 1992.[Crowder 83] H. Crowder and E. L. Johnson. Solving large-scale zero-one linear programming problems. Operations Research,31(5):803{834, September 1983.[Depuydt 90] F. Depuydt, G. Goossens, J. van Meerber-gen, F. Catthoor, and H. De Man. Scheduling of largesignal ow graphs based on metric graph clustering. InProceeding of IFIP90, Paris. Elsevier Publishers, 1990.

146 BIBLIOGRAPHY[Depuydt 91] F. Depuydt, G. Goossens, and H. De Man. Cluster-ing techniques for register optimization during schedul-ing preprocessing. In Proc. IEEE Int. Conf. Comp. AidedDesign, Santa Clara CA, pages 280{283, November 1991.[Depuydt 92] F. Depuydt. Retiming for maximum register cost cal-culation in hierarchical cyclic control-data ow graphs.SPRITE Report C2nd.d/IMEC/Y4m12/1, IMEC Lab,December 1992.[Depuydt 93a] F. Depuydt. Cut reduction: a new approach to schedulingunder register constraints. Technical report, IMEC Lab,October 1993. Submitted for publication.[Depuydt 93b] F. Depuydt. Optimal scheduling and software pipelin-ing of repetitive signal ow graphs with delay line opti-mization. Technical report, IMEC Lab, September 1993.Submitted for publication.[Even 79] S. Even. Graph Algorithms. Computer Science Press,1979.[Fettweis 76] A. Fettweis. Realizability of digital �lter networks. Archivf�ur Elektronik und �Ubertr�agungstechnik, 30:90{96, 1976.[Fisher 81] J. A. Fisher. Trace scheduling: A technique for global mi-crocode compaction. IEEE Transactions on Computers,C-30(7):478{490, July 1981.[Franssen 92] F. H. M. Franssen, M. F. X. B. van Swaaij, F. Catthoor,and H. De Man. Modeling piece-wise linear and data de-pendent signal indexing for multi-dimensional signal pro-cessing. In Proc. of the High-Level Synthesis Workshop,Laguna Niguel CA, pages 245{255, November 1992.[Garey 79] M. R. Garey and D. S. Johnson. Computer and In-tractability. A guide to the theory of NP-completeness. W.H. Freeman and Company, New York NY, USA, 1979.[Gebotys 91] C. H. Gebotys and M. I. Elmasry. Simultaneous schedul-ing and allocation for cost constrained optimal architec-tural synthesis. In Proceedings of the 28th ACM/IEEEDAC, pages 2{7, June 1991.

BIBLIOGRAPHY 147[Gebotys 92] C. H. Gebotys. Optimal scheduling and allocaton of em-bedded VLSI chips. In Proc. of the 29th ACM/IEEE De-sign Automation Conference, pages 116{119, 1992.[Geo�rion 67] A. M. Geo�rion. Integer programming by implicit enu-meration and Balas' method. SIAM Review, 9(2), April1967.[Girczyc 87] E. F. Girczyc. Loop winding - a data ow approach tofunctional pipelining. In Proc. IEEE Int. Symp. on Cir-cuits and Systems (ISCAS87), Philadelphia, pages 382{385, May 1987.[Glover 86] F. Glover. Future paths for integer programming andlinks to arti�cial intelligence. Computers and OperationsResearch, 13(5):533{549, 1986.[Golumbic 80] M. C. Golumbic. Algorithmic Graph Theory and PerfectGraphs. Academic Press, Inc., 1980.[Goossens 86] G. Goossens. Register-assignment and variable-lifetimeanalysis. Technical report, IMEC Laboratory, September1986. Report nr. ESPRIT97/IMEC/9.86/D/d(4)/2.[Goossens 89a] G. Goossens. Optimization techniques for automated syn-thesis of application-speci�c signal processing architec-tures. PhD thesis, Katholieke Universiteit Leuven, June1989.[Goossens 89b] G. Goossens, J.Vandewalle, and H. De Man. Loop opti-mization in register-transfer scheduling for DSP-systems.In Proceedings of the 26th ACM/IEEE Design Automa-tion Conference, June 1989.[Goossens 90] G. Goossens, J. Rabaey, J. Vandewalle, and H. De Man.An e�cient microcode compiler for application speci�cDSP processors. IEEE Transactions on Computer-AidedDesign, 9(9):925{937, September 1990.[Goossens 92] G. Goossens, F. Catthoor, D. Lanneer, and H. De Man.Integration of signal processing systems on heterogeneousIC architectures. In Proc. of the High Level SynthesisWorkshop, Dana Point Resort CA, pages 16{26, Novem-ber 1992.

148 BIBLIOGRAPHY[Goossens 93] G. Goossens, D. Lanneer, M. Pauwels, F. Depuydt,K. Schoofs, A. Ki i, M. Cornero, P. Petroni, and H. DeMan. Integration of medium-throughput signal processingalgorithms on exible instruction-set architectures. Ac-cepted for publication in: Journal of VLSI Signal Process-ing (Special issue on synthesis for real-time digital signalprocessing), Fall 1993, 1993.[Grass 90] W. Grass. A branch-and bound method for optimal trans-formation of data ow graphs for observing hardware con-straints. In Proceedings of the EDAC, Glasgow, pages 73{77, March 1990.[Groot 92] S. M. Heemstra de Groot, S. H. Gerez, and O. E. Her-rmann. Range-chart-guided iterative data- ow graphscheduling. IEEE Trans. on Circuits and Systems-1:Fund. Theory and Applications, 39(5):351{364, May 1992.[Gupta 92] R. K. Gupta and G. De Micheli. System-level synthe-sis using re-programmable components. In Proc. of theEDAC92, Brussels, pages 2{7, March 1992.[Hafer 91] L. Hafer. Constraint improvements for MILP-based hard-ware synthesis. In Proc. of the DAC91, pages 14{19, June1991.[Haroun 88] B. S. Haroun and M. I. Elmasry. Automatic synthesis ofa multi-bus architecture for DSP. In Proceedings of theICCAD88, pages 44{47, November 1988.[Hartmann 92] R. Hartmann. Combined scheduling and data routing forprogrammable ASIC systems. In Proc. of the EDAC92,Brussels, pages 486{490, February 1992.[Hendren 92] L. J. Hendren, G. R. Rao, E. R. Altman, and C. Muk-erji. A register allocation framework based on hierarchicalcyclic interval graphs. Private Communication, 1992.[Hennessy 90] J. L. Hennessy and D. A. Patterson. Computer Archi-tecture: A Quantitative Approach. Morgan KaufmannPublishers Inc., 1990.[Hwang 91] C.-T. Hwang, J.-H. Lee, and Y.-C. Hsu. A formalapproach to the scheduling problem in high level syn-thesis. IEEE Transactions on Computer Aided Design,10(4):464{475, April 1991.

BIBLIOGRAPHY 149[Johnson 67] S. C. Johnson. Hierarchical clustering schemes. Psy-chometrika, 32(3):241{254, September 1967.[Kalavade 93] A. Kalavade and E. A. Lee. A hardware-software codesignmethodology for dsp applications. IEEE Design & Testof Computers, pages 16{28, September 1993.[Ki i 93] A. Ki i, F. Depuydt, D. Lanneer, and G. Goossens. Con-trol operations de�nition and scheduling model. Technicalreport, IMEC Laboratory, July 1993.[Knapp 91] D. W. Knapp and A. C. Parker. The ADAM design plan-ning engine. IEEE Transactions on Computer Aided De-sign, 10(7):829{846, July 1991.[Kramer 90] H. Kramer and W. Rosenstiel. System synthesis using be-havioral descriptions. In Proceedings of the EDAC, Glas-gow, pages 277{282, March 1990.[Ku 90] D. Ku and G. De Micheli. Relative scheduling under tim-ing constraints. In Proceedings of the 27th DAC, Orlando,Florida, pages 59{64, June 1990.[Kurdahi 87] F. J. Kurdahi. Area Estimation of VLSI Circuits. PhDthesis, UCLA, August 1987.[Lagnese 91] E. D. Lagnese and D. E. Thomas. Architectural partition-ing for system level synthesis of integrated circuits. IEEETransactions on Computer-Aided Design, 10(7):847{860,July 1991.[Lam 88] M. Lam. Software pipelining: An e�ective schedulingtechnique for VLIW machines. In Proc. of the SIG-PLAN88 Conf. on Programming Language Design andImplementation, pages 318{328, 1988.[Landskov 80] D. Landskov, S. Davidson, B. Shriver, and P. W. Mal-lett. Local microcode compaction techniques. ComputingSurveys, 12(3):261{294, September 1980.[Lanneer 90] D. Lanneer, F. Catthoor, G. Goossens, M. Pauwels,J. Van Meerbergen, and H. De Man. Open-ended systemfor high-level synthesis of exible processors. In Proceed-ings of the European Conference on Design Automation,Glasgow,Scotland, pages 272{276, March 1990.

150 BIBLIOGRAPHY[Lanneer 91] D. Lanneer, S. Note, F. Depuydt, M. Pauwels,F. Catthoor, G. Goossens, and H. De Man. Architecturalsynthesis for medium and high throughput signal process-ing with the new Cathedral environment, chapter 2, pages27{54. High-level VLSI synthesis. Kluwer, Boston, 1991.Ed: R. Camposano and W. Wolf.[Lanneer 93a] D. Lanneer. Design Models and Data-Path Mapping forSignal Processing Architectures. PhD thesis, KatholiekeUniversiteit Leuven, 1993.[Lanneer 93b] D. Lanneer, M. Cornero, G. Goossens, and H. De Man.An assignment technique for incompletely speci�ed data-paths. In Proc. of the Eur. Conf. on Design Automation(EDAC), Paris (France), pages 284{288, 1993.[Lee 88] E. A. Lee. Programmable DSP architectures: Part 1.IEEE ASSP Magazine, pages 4{19, October 1988.[Lee 89a] E. A. Lee. Programmable DSP architectures: Part 2.IEEE ASSP Magazine, pages 4{14, January 1989.[Lee 89b] J.-H. Lee, Y.-C. Hsu, and Y.-L. Lin. A new integer lin-ear programming formulation for the scheduling problemin data path synthesis. In Proceedings of Papers of theICCAD89, pages 20{23, November 1989.[Lee 92] T.-F. Lee, A. Wu, D. Gajski, and Y.-L. Lin. An e�ectivemethodology for functional pipelining. In Proc. of the Int.Conf. on Computer Aided Design (ICCAD), Santa Clara,CA (USA), pages 230{233, 1992.[Leiserson 83] C. E. Leiserson, F. M. Rose, and J. B. Saxe. Optimizingsynchronous circuitry by retiming. In Proc. of the 3rdCaltech Conference on VLSI, Pasadena CA, pages 87{116, March 1983.[Lippens 91] P. E. R. Lippens, J. L. van Meerbergen, A. van der Werf,and W. F. J. Verhaegh. PHIDEO: A silicon compiler forhigh speed algorithms. In Proceedings of the EuropeanDesign Automation Conference, Amsterdam, pages 436{441, 1991.[Lucke 93] L. E. Lucke and K. K. Parhi. Generalized ILP schedul-ing and allocation for high-level DSP synthesis. In IEEECustom Integrated Circuits Conference, 1993.

BIBLIOGRAPHY 151[Man 88] H. De Man, J. Rabaey, J. Vanhoof, G. Goossens, P. Six,and L. Claesen. Cathedral 2: a computer-aided syn-thesis system for digital signal processing VLSI systems.Computer-Aided Engineering Journal, pages 55{66, April1988.[Man 90] H. De Man, F. Catthoor, G. Goossens, J. Vanhoof, J. VanMeerbergen, S. Note, and J. Huisken. Architecture-drivensynthesis techniques for VLSI implementation of DSP al-gorithms. Proceedings of the IEEE, 78(2), 1990.[McFarland 90] M. C. McFarland and T. J. Kowalski. Incorporat-ing bottom-up design into hardware synthesis. IEEETransactions on Computer-Aided Design, 9(9):938{950,September 1990.[Mlinar 91] M. J. Mlinar. Control Path/Data Path Tradeo�s in VLSIDesign. Ceng techn. report 91-16, Dept. of El. Eng. -Systems, USC, June 1991.[Mueller 93] F. Mueller. Register allocation by graph coloring: A re-view. Technical report, Florida State University, 1993.[Nehmhauser 88] G. L. Nehmhauser and L. A. Wolsey. Integer and Combi-natorial Optimization. John Wiley and Sons, 1988.[Note 91] S. Note, W. Geurts, F. Catthoor, and H. De Man.Cathedral-3: Architecture-driven high-level synthesis forhigh throughput DSP applications. In Proceedings ofthe Design Automation Conference, San Fransisco, June1991.[Pangrle 86] B. M. Pangrle and D. D. Gajski. State synthesis and con-nectivity binding for microarchitecture compilation. InProc. of the Int. Conf. on Computer-Aided Design (IC-CAD), pages 210{212, 1986.[Papadimitriou 82] C. H. Papadimitriou and K. Steiglitz. Combinatorial Op-timization. Algorithms and Complexity. Prentice Hall,1982.[Parhi 89] K. K. Parhi and D. G. Messerschmitt. Rate-optimal fully-static multiprocessor scheduling of data- ow signal pro-cessing programs. In Proceedings of ISCAS, pages 1923{1928, February 1989.

152 BIBLIOGRAPHY[Parhi 91] K. K. Parhi and D. G. Messerschmitt. Static rate-optimalscheduling of iterative data- ow programs via optimumunfolding. IEEE Trans. on Computers, 40(2):178{195,February 1991.[Park 88] N. Park and A. C. Parker. Sehwa: A software packagefor synthesis of pipelines from behavioral speci�cations.IEEE Transactions on Computer-Aided Design, 7(3):356{370, March 1988.[Paulin 89] P. G. Paulin and J. P. Knight. Force-directed schedulingfor the behavioral synthesis of ASICs. IEEE Transactionson Computer-Aided Design, 8(6):661{679, June 1989.[Paulin 92] P. G. Paulin, C. Liem, T. C. May, and S. Sutarwala. DSPdesign tool requirements for the nineties: An industrialperspective. Technical report, Bell-Northern Research,1992. Accepted for publication in: Journal of VLSI SignalProcessing (Fall '93).[Pauwels 90] M. Pauwels, F. Catthoor, K. Schoofs, M.Masschelein, andH. De Man. A Dynamic Range Compressor Architecturefor Audio, used as a Test-Vehicle for Type-Handling in theCathedral-2nd Synthesis System, pages 1555{1558. SignalProcessing 5. Elsevier Science Publishers, 1990. Editedby: L. Torres and E. Masgrau and M. A. Lagunas.[Pauwels 92] M. Pauwels, D. Lanneer, F. Catthoor, G. Goossens, andH. De Man. Models for bit-true simulation and high-levelsynthesis of DSP applications. In Second Great LakesSymposium on VLSI, Kalamazoo, MI, USA, February1992.[Potasman 90] R. Potasman, J. Lis, A. Nicolau, and D. Gajski. Per-colation based synthesis. In Proceedings of the 27thACM/IEEE Design Automation Conference, pages 444{449, June 1990.[Potkonjak 90] M. Potkonjak and J. Rabaey. Retiming for scheduling. InProc. of the VLSI Signal Processing Workshop, 1990.[Potkonjak 92] M. Potkonjak and J. M. Rabaey. Scheduling algorithmsfor hierarchical data control ow graphs. Intl. Journal ofCircuit Theory and Applications, 20:217{233, 1992.

BIBLIOGRAPHY 153[Praet 92] Johan Van Praet. Data path de�nition in Cathedral-2nd.Technical report, IMEC Laboratory, June 1992.[Praet 93] J. Van Praet. Classi�cation of �xed-point DSP cores inrelation to code generation. Technical report, IMEC Lab-oratory, June 1993. IMEC Internal report.[Rabaey 90] J. Rabaey and M.Potkonjak. Resource driven synthesisin the HYPER system. In Proceedings of the ISCAS, NewOrleans, pages 2592{2595. University of California, 1990.[Rabaey 91] J. M. Rabaey, C. Chu, P. Hoang, and M. Potkonjak. Fastprototyping of datapath-intensive architectures. IEEEDesign and Test of Computers, pages 40{51, June 1991.[Rau 82] B. R. Rau, C. D. Glasser, and R. L. Picard. E�cient codegeneration for horizontal architectures: Compiler tech-niques and architectural support. In Proc. of the 9th Ann.Symp. on Computer Architectures, Austin, Texas, pages131{139, April 1982.[Renfors 81] M. Renfors and Y. Neuvo. The maximum samplingrate of digital �lters under hardware speed constraints.IEEE Trans. on Circuits and Sustems, CAS-28(3):196{202, March 1981.[Rimey 89] K. E. Rimey. A Compiler for Application-Speci�c Sig-nal Processors. PhD thesis, University of California atBerkeley, 1989.[Rompaey 92] K. Van Rompaey, I. Bolsens, and H. De Man. Just in timescheduling. In Proc. of the IEEE Intl. Conf. on ComputerDesign: VLSI in Computers and Processors, Cambridge,Massachusetts, pages 295{300, October 1992.[Rosseel 91] J. Rosseel, M. van Swaaij, F. Catthoor, and H. De Man.A�ne transformations for multi-dimensional signal pro-cessing on ASIC regular arrays. In Proc. of the Eur. De-sign Automation Conf. (EDAC), Amsterdam, pages 442{446, 1991.[Rouzeyre 91] B. Rouzeyre and G. Sagnes. Memory area minimizationby hierarchical clustering in high level synthesis. In FifthIntl. Worshop on High Level Synthesis, B�uhlerh�ohe, pages29{36, March 1991.

154 BIBLIOGRAPHY[Scheichenzuber 90] J. Scheichenzuber, W. Grass, U. Lauther, and S. M�arz.Global hardware synthesis from behavioral data ow de-scriptions. In Proceedings of the DAC, Orlando, Florida,pages 456{461, June 1990.[Schoofs 93] K. Schoofs, G. Goossens, and H. De Man. Bit-alignmentin hardware allocation for multiplexed DSP architec-tures. In Proc. of the Eur. Conf. on Design Automation(EDAC), Paris (France), pages 289{293, 1993.[Schrage 89] L. Schrage. Linear, Integer and Quadratic Programmingwith LINDO. The Scienti�c Press, 1989.[Schrijver 86] A. Schrijver. Theory of Linear and Integer Programming.Wiley, 1986.[Schwartz 85] D. A. Schwartz and T. P. Barnwell III. Cyclo-static mul-tiprocessor scheduling for the optimal realization of shift-invariant ow graphs. In IEEE Intl. Conf. on Acous-tics, Speech and Signal Processing (ICASSP), pages 1384{1387, 1985.[Schwartz 86] D. A. Schwartz and T. P. Barnwell III. Cyclo-Static So-lutions: Optimal Multiprocessor Realizations of RecursiveAlgorithms, chapter 11. IEEE Press, 1986.[Sethi 75] R. Sethi. Complete register allocation problems. SIAMJournal of Computing, 4(3):226{248, September 1975.[Seynhaeve 87] D. Seynhaeve. Racecar. Technical Report DOC.CD90.I040.0, PHILIPS/IMEC, February 1987.[Stok 92] L. Stok and J. A. G. Jess. Foreground memory man-agement in data path synthesis. Int. Journal of CircuitTheory and Applications, 20:235{255, 1992.[Su 90] B. Su, J. Wang, Z. Tang, W. Zhao, and Y. Wu. A soft-ware pipelining based VLIW architecture and optimizingcompiler. In Proc. of the 23rd Annual Workshop on Mi-croprogramming and Microarchitecture, Orlando, Florida,pages 17{27, November 1990.[Swaaij 92] M. F. X. B. van Swaaij, F. H. M. Franssen, F. V. M.Catthoor, and H. J. De Man. Modeling data ow andcontrol ow for high level memory management. In Proc.of the EDAC92, Brussels, pages 8{13, February 1992.

BIBLIOGRAPHY 155[Tseng 86] C.-J. Tseng and D. P. Siewiorek. Automated synthesis ofdata paths in digital systems. IEEE Trans. on Computer-Aided Design, 5(3):379{395, July 1986.[Vanhoof 90] J. Vanhoof, I. Bolsens, S. De Troch, E. Blokken, and H. DeMan. Evaluation of high-level design decisions using thecathedral-II silicon compiler to prototype a DSP ASIC.In Proceedings of IFIP90, 1990.[Vanhoof 91] J. Vanhoof, I. Bolsens, and H. De Man. Compiling multi-dimensional data streams into distributed DSP ASICmemory. In Proc. IEEE Int. Conf. on Comp. Aided De-sign, Santa Clara CA, pages 272{275, November 1991.[Vanhoof 92] J. Vanhoof. Architecture Synthesis for Application-Speci�c Medium-Throughput Digital Signal ProcessingChips. PhD thesis, K.U. Leuven, 1992.[Vanhoof 93] J. Vanhoof, K. Van Rompaey, I. Bolsens, G. Goossens,and H. De Man. High-Level Synthesis for Real-Time Dig-ital Signal Processing. Kluwer Academic Publishers, 1993.[Verbauwhede 91] I. Verbauwhede, F. Catthoor, J. Vandewalle, and H. DeMan. In-place memory management of algebraic algo-rithms on application speci�c IC's. Journal of VLSI Sig-nal Processing, 3:193{200, 1991.[Verhaegh 91] W. F. J. Verhaegh, E. H. L. Aarts, J. H. M. Korst, andP. E. R. Lippens. Improved force-directed scheduling. InProceedings of the European Design Automation Confer-ence, pages 430{435, 1991.

Appendix AEdge cost correctionsIn Section 3.4.3, it is explained that the edge cost c1;2 is corrected if edgee1;2 is a fanout edge. However, the edge cost is also corrected if edges into(out of) nested blocks are part of the MPESi of that nested block. Therationale behind this is avoiding to count the signals corresponding with theseedges more than once during hierarchical retiming. The kind of correctiondepends on whether a GMC or CMC is calculated. Figure A.1 summarizesthe possible cases:c=0.33 c=0.33 c=0.33

c=GMC-0.66

c=0 c=0 c=0

c=GMCr(B) - r(A) = 1

A

B

(a) (b)Figure A.1: Possible cases for nested model edge cost1. If at least one of the fanout edges is part of the nested block's MPES,and a GMC is calculated, the cost of the edge eNB;NB0 of the nestedblock retiming model is modi�ed as shown in �gure A.1(a). As a result,when the nested block and another fanout edge are part of a maximalparallel edge set, the accumulated cost is not larger than 1.2. If at least one of the fanout edges is part of the nested block's MPES,and a CMC is calculated, the costs of the fanout edges are set to 0,because the edges of the nested block are going to be accounted foranyway (�gure A.1(b)). 156

Appendix BThe cut reduction algorithmIn Chapter 4, a register optimization technique, based on transforming theHDSFG until its GMC is within the register constraints, is explained. Thetechnique is called cut reduction, and is implemented as a branch-and-bound(BAB) search:Algorithm B.1 (BAB cut reduction)cut reduction() f1. if (bound()) return;2. calculate GMC:� if GMC � C for parent solution: exact GMC calculation b.m.o.maximum cliques;� else: estimation of GMC b.m.o. hierarchical retiming;3. if (GMC � C): best solution = this solution;4. else: branch();g157

158 APPENDIX B. THE CUT REDUCTION ALGORITHMbound() f1. if (equal to another partial solution) return YES;2. calculate cost;3. if (cost � cost of best solution) return YES;4. return NO;gbranch() f1. Determine all possible basic moves.2. Order them with the ordering heuristics into the array BM [].3. 8k:(a) Perform basic move BM [k].(b) cut reduction();(c) Undo basic move BM [k].g

Appendix CMin ow for Maximum RegisterCost CalculationAlthough the hierarchical retiming technique in Chapter 3 is the most generaltechnique for estimating the maximum register cost of a design, an alternativetechnique is proposed here. It is based on min ow , and can only be appliedfor acyclic HDSFGs 1.This appendix is structured as follows. The hierarchical network modelto accommodate the min ow calculations is described in Section C.1. Themin ow problem is discussed in Section C.2. The handling of conditions isdescribed in Section C.3, and the hierarchical min ow algorithm is given inSection C.4. A few details concerning fanout edges are discussed in Section C.5.The application of hierarchical min ow is illustrated for the SYNDR examplein Section C.6. The algorithmic complexity of hierarchical min ow is discussedin Section C.7, and a comparison with the hierarchical retiming technique ofChapter 3 is made in Section C.8.C.1 The hierarchical network modelThe HDSFG can be transformed into a hierarchical network by removing alledges with non-zero weight d1;2, thus making the HDSFG acyclic. Where edgesare removed, entry and exit nodes2 are added to the HDSFG. Furthermore,a source node S and a sink node T are added such that all nodes p withoutincoming edges are connected with S by means of a sequence edge eS;p, and allnodes q without outgoing edges are connected with T by means of a sequenceedge eq;T . As an illustration, the transformation of the i-loop of the SYNDRexample (see �gure 3.8) is shown in �gure C.1.1Note that an HDSFG can always be made acyclic by removing the edges with non-zeroweights (i.e. the \looping" edges).2See De�nition 3.19. 159

160 APPENDIX C. MINFLOW+<

snr01

W

neg

for j

accu14_v1

cp cp

snr01_v1snr01_v2

return_v1

T

S

snr01d_v1snr01_v3d snr01d_v2

snr01_v2 snr01_v1

b=0 b=0b=0

b[1]=1

b[7]=1

b[6]=1

b[7]=1 b[11]=1

b=0 b=0 b=0

snr01

b[1]=1

b=0

b[11]=1

b[1]=1

b=0

F = GMCfor j for j

cp cp

b[1]=1 b[1]=1

b=0 b=0

accu14d_v1_ivasnr02_iva

Figure C.1: Hierarchical network for the i-loop of the SYNDR example

C.2. MINFLOW IN NETWORKS 161The edges of the network are attributed with the following arrays3:De�nition C.1 (Flow array f) The ow array f of an edge is an array ofintegers, where f [i] is the value of the ow corresponding with register �le i.De�nition C.2 (Lower ow bound array b) The lower ow bound arrayb is an array of integers such that f [i] � b[i];8i. The b array is unique foreach edge, and depends on the design: b[i] = 1 if the signal corresponding tothe edge is stored in register �le i.The network is also attributed with a ow array:De�nition C.3 (Network ow F) The network ow F is an array of inte-gers where F[i] is the value of the ow through the network, i.e. leaving S andentering T , corresponding with register �le i.Nested blocks in the HDSFG are represented by single nodes in the network.These nodes are attributed with the ow F of the network for the nestedHDSFG. The use of lower ow bound arrays and nested block ow arraysis also illustrated in �gure C.1: for instance, the data edge corresponding tosignal `snr01d v2' has a b array with only 1 non-zero element (b[7] = 1),since the signal `snr01d v2' is stored in register �le number 7 (see �gure 3.9).Furthermore, the node for the nested j-loop is attributed with the ow arrayof the j-loop.C.2 Min ow in networksThe hierarchical network described above is a directed acyclic graph (DAG)[Even 79]. The following theorem has been proven by Kurdahi [Kurdahi 87]:Theorem C.1 The maximum number of registers for a design represented bya DAG equals the value of a maxcut through that DAG.Even [Even 79] has shown that the maxcut through a DAG can be computedby performing a min ow :De�nition C.4 (Min ow) The min ow problem consists of determining the ow f [i] for each edge such that the network ow F[i] is minimal, and suchthat the incoming ow equals the outgoing ow for each network node.Finding the min ow in a network is a problem of polynomial complexity,because it is the same as �nding the max ow from the sink to the source[Even 79]. The min ow problem can for instance by solved by means of theDinics algorithm [Even 79] [Papadimitriou 82], which has a cubic algorithmicalcomplexity.3Again, all arrays have N components, where N is the number of register �les in thearchitecture (see also Chapter 3).

162 APPENDIX C. MINFLOWC.3 Handling conditionsThe min ow algorithm is only valid for unconditional networks: all edges mustbe valid paths for ow under the same conditions. If some nodes or edges ofthe network are conditional, there should be a min ow calculation for eachpossible condition at that particular level of the HDSFG. During this condi-tional min ow , only edges whose condition is compatible with the conditionof the min ow can have non-zero b[i]. This method can get cumbersome ifthere are a lot of conditional branches in the design.Therefore, an alternative approach is the nesting of exclusive conditionblocks in CMB nodes4. These CMB blocks are then extra levels of hierarchyin the HDSFG. The ow array of a CMB block is obtained by computingthe component-wise maximum5 of the ow arrays F1 and F2 of the exclusivecondition blocks: FCMB = �maxfF1;F2g (C:1)C.4 Hierarchical min owBy means of the algorithm below, the maximum register cost of a design canbe calculated:Algorithm C.1 (Hierarchical min ow)For all levels of the hierarchical network, starting at the innermost level, cal-culate F[i] (8i : 1 � i � N) as follows:1. Saturate the network with initial ow: while there is still an edge e suchthat fe[i] < be[i], trace a directed path S �! T through e and incrementf [i] for all edges on that path. Whenever there is a choice during tracingthis path, take the edge with the smallest f [i].2. Flow propagation from nested levels: if the ow into the nested block,fNB[i], is smaller than the network ow FNB[i], trace a directed pathS �! T through the node NB and increase the ow f [i] of all edges onthat path with FNB[i]� fNB[i]. Again, whenever there is a choice duringtracing this path, take the edge with the smallest f [i].3. Construct the max ow from T to S, e.g. with the Dinics algorithm.4See De�nition 3.22.5See De�nition 3.20.

C.5. FANOUT EDGES 163b=1 b=0

b=0b=1

(a) (b) (c) (d)

b=0

b=0b=1

b=1

b=0

b=0 b=0b=1

1

2 3

4 5

f=1 f=2

f=1 f=1

alap < alap23Figure C.2: The fanout problem for min owAs a result of Theorem C.1, Algorithm C.1 calculates the global maximumregister cost of an acyclic HDSFG. The heuristic for edge selection duringpath tracing (taking the edge with minimal f [i]) reduces the complexity of themax ow in step 3: the amount of ow paths whose ow can be further reducedby a max ow from T to S, and the amount of reduction, is decreased with thisheuristic.In the global register optimization strategy, as proposed in Chapter 2, aHDSFG level is preprocessed and scheduled before moving on to the next hi-erarchy level (see also �gure 2.2). Therefore, if hierarchical min ow is usedduring global register optimization, a nested level is always scheduled. There-fore, its F[i] is not the maximum register cost but the actual register cost ofthe nested level. This actual register cost is obtained by performing registerassignment on the scheduled HDSFG (see Section 2.1.2).C.5 Fanout edgesModeling the number of di�erent signals being \cut" during min ow requiresspecial care in the case of fanout edges (i.e. data edges that represent the samesignal)6. These fanout edges should not give rise to di�erent ow paths beingcounted more than once: e.g. in �gure C.2(a), the two pairs of fanout edgesgive rise to a min ow of value 3, whereas it should have value 2 since only twodi�erent signals are involved.A solution for this problem was proposed in [Mlinar 91]: the HDSFG istransformed as in �gure C.2(b), by inserting dummy nodes into the HDSFG.However, this transformation still does not yield a good result, since the min- ow has now a value of 1 (in stead of 2). The technique of [Mlinar 91] calculatesan underestimation of the maximum register cost.6The same problem occurs in hierarchical retiming, see Section 3.4.2.

164 APPENDIX C. MINFLOWThe following strategy provides a solution for the above problems: only oneof the fanout edges gets b[i] = 1. Which edge to choose can be determinedwithdata- ow analysis (�gure C.2(c)) or with topological sorting of the HDSFG(�gure C.2(d)). Although this approach can not be proven to lead to optimalresults in all cases, it performs better than the one proposed in [Mlinar 91]:for example, the min ow in �gure C.2(d) has the correct value of 2.C.6 ExampleAs an illustration of the hierarchical min ow algorithm, the calculation of themaximum register cost for register �le \reg1" of the SYNDR example (see�gs. 3.8 and 3.9) is shown in �gure C.3. Since Algorithm C.1 starts executingat the innermost level, the min ows of the two exclusive conditional blocks inthe j-loop of the example are shown in �gure C.3(a) and (b): the edges withnon-zero ow are labeled with the value of f [reg1], and the thick edges are theones with b[reg1] = 1. Since the worst-case ow of the two conditional blocksis 1, the ow array of the CMB in �gure C.3(c) has value FCMB[reg1] = 1. Themin ow at the j-loop level for register �le \reg1" has value 4. This min owis propagated at the i-loop level in �gure C.3(d). Together with another owpath of value 1, this gives rise to a min ow of value 5 at the top level of thedesign. Therefore, the maximum number of registers in register �le \reg1" is5.C.7 Algorithmic complexityThe algorithmic complexity of steps 1 and 2 in Algorithm C.1 is O(v:e), wherev is the number of nodes and e the number of edges in the HDSFG. Sincee � v in directed graphs, this is the same as O(v2). The complexity of step 3of Algorithm C.1 is O(v3).Because of the edge selection heuristic during path tracing, it has beenobserved that the initial ow almost always equals the exact min ow if the ows are small . This is also shown in �gure C.4: the average ow reductionobtained by step 3 of Algorithm C.1 is plotted against the value of the initial ow. The plot reveals that for small initial ows, the backwards max ow stepin Algorithm C.1 only gives an additional improvement in a very small numberof cases. If this step is not executed, the complexity of Algorithm C.1 reducesto O(v2).

C.7. ALGORITHMIC COMPLEXITY 165F[reg1] = 1

(a)

cp

R

*

R

cp

+

+

cp

R

*

R

cp

S

T

1

1

1

1

1

(b)

cp

S

1

T

1

1

1

F[reg1] = 4

R

ca cp

<

cp+<

ca

+ -

cp cp

CMB

S

T

F[reg1] = 1

(c)

1

3

11

1

11

2

2 1

1

1

3

1

1

1

1

F[reg1] = 5

(d)

F[reg1] = 4

+<

W

neg

for j

cp cp

T

1

1

1

1

2

2

4

4

4

S

cp cp1 1

1 1

F[reg1] = 1

Figure C.3: Calculation ofGMC[reg1] for the SYNDR example b.m.o. hierarchicalmin ow

166 APPENDIX C. MINFLOW5 10

Initial flow

0.0

0.5

1.0

1.5

Ave

rage

red

ucti

on

Figure C.4: Average ow reduction by T �! S max owC.8 Comparison with hierarchical retimingA �rst observation is that min ow is not suited for cyclic graphs, because theHDSFG has to be transformed into an acyclic network. As a result, softwarepipelining cannot be taken into account. Furthermore, delay lines and theirsizes7 are not modeled, because cyclic edges are \cut" at the loop boundary.For instance, the signals \snr01 v1" and \snr01d v1" are stored in the samephysical register because they are part of the same delay line. Such constraintsare not taken into account by min ow.On the other hand, if the HDSFG can be treated in an acyclic way | e.g.because software pipelining and the modeling of delay lines are not important| min ow is a faster alternative than retiming (see �gure 3.19).C.9 ConclusionIn this Appendix, an alternative technique for maximum register cost estima-tion is proposed. It is based on min ow, but extended for hierarchical HDS-FGs: hierarchical min ow. Although not generally applicable for cyclic graphs,hierarchical min ow proves to be faster than retiming for acyclic HDSFGs.7See Section 3.2.1.

Appendix DThe clustering algorithmAlgorithm D.1 (Clustering)clustering() f1. iteration = 0;2. while (cluster()) continue;gcluster() f1. iteration++;2. if (all distances have to recalculated) then: for all cluster pairs fC1; C2g,do distance(C1,C2);3. else: for all cluster pairs fC1; C2g where C1 or C2 was created during theprevious clustering step, do distance(C1,C2);4. if all distances =1, return NO:5. merge();6. return YES;g 167

168 APPENDIX D. THE CLUSTERING ALGORITHMdistance(C1,C2) f1. temporarily merge C1 and C2;2. if fC1; C2g is non-convex, then D1;2 =1; return;3. compute GMC1;2:� if GMC1;2 > C, then D1;2 =1; return;� if GMC1;2 = max�fGMC1;GMC2g, then D1;2 = 0; return;4. compute D1;2 (or E1;2);gmerge() f1. Calculate the minimal jD1;2j (or jE1;2j): jDjmin (or jEjmin);2. Collect all cluster pairs fC1; C2g such that jD1;2j = jDjmin (or jE1;2j =jEjmin);3. Construct the cluster pair con ict graph G;4. Color G (with e.g. a linear coloring algorithm);5. S is the largest set of cluster pairs with the same color;6. Merge all cluster pairs in S;g

Appendix EThe scheduling-retimingrelationIn this Appendix, the relation between scheduling and retiming, and the impactof this relation on the lower bound on the edge weight d1;2 is formally derived.From the minimum timing constraint (6.18) and the de�nition of pi, thefollowing holds: p1 + �1;2 � d1;2:C � C � 1 (E:1)or: d1;2 � d�1;2 + p1 � (C � 1)C e (E:2)Let X = �1;2C (E.3)Y = p1 � (C � 1)C (E.4)where �1 < Y � 0, then the two lower bounds on d1;2, (6.12) and (E.2), canbe formulated as follows: d1;2 � bXc (E.5)d1;2 � dX + Y e (E.6)The second lower bound is tighter than the �rst one if Y is larger than a certainthreshold value. This value can be found by means of �gure E.1: the lowerbound on d1;2 equals bXc + 1 if Y > bXc �X, or if:p1 > b�1;2C c:C � �1;2 + C � 1 (E:7)This is illustrated in �gure 6.9: the �rst lower bound on d1;2 is b52c = 2. Atighter lower bound is imposed if p1 > 2:2 � 5 + 2 � 1 = 0: d1;2 � 3. Both169

170 APPENDIX E. THE SCHEDULING-RETIMING RELATIONR+

X+Y

XXFigure E.1: The lower bounds on the edge weight d1;2lower bounds (E.5) and (E.6) can be combined into one lower bound formula:8e1;2 : d1;2init + r2 � r1 � b�1;2C c+ Xj1>b �1;2C c:C��1;2+C�1 x1;j1;k1 (E:8)

Appendix FDerivation of tight timingconstraintsThis appendix contains the derivation of the tight timing constraints for ILPscheduling with automatic software pipelining. They are based on the tim-ing constraint in [Gebotys 91], but extended for cyclic graphs and softwarepipelining.F.1 The timing constraint formulationThe timing constraint in [Gebotys 91] is the following:8e1;2;8j 2 [�1; !1] :Xj�j1�!1Xk1 x1;j1;k1 + X�2�j2<j+�1;2Xk2 x2;j2;k2 � 1 (F.1)This constraint expresses that is it not possible that o1 is scheduled at timestep j or later and that o2 is scheduled earlier than j + �1;2, which is theexpression of a minimum timing constraint. Timing constraint (F.1) can beextended for our scheduling problem (including software pipelining) as follows:8e1;2;8j 2 [�1; !1] :Xj�j1�!1Xk1 x1;j1;k1 + X�2�j2<j+�1;2�d1;2 :CXk2 x2;j2;k2 � 1 (F.2)where d1;2 = d1;2init + r2 � r1The Pj2-summation in constraint (F.2) is only meaningful if:1 � j + �1;2 � d1;2:C � C � 1 (F:3)171

172 APPENDIX F. DERIVATION OF TIGHT TIMING CONSTRAINTSR+

A

A-D

A-B

R+

A-D

A-B

(a) (b)

AFigure F.1: The possible conditions on d1;2because else it contains no terms, in which case (F.2) is trivially satis�edbecause of the set constraint (6.2). This observation is used to eliminate thedependency of the summation boundary on r1 and r2. (F.3) can also be writtenas: dj + �1;2 � (C � 1)C e � d1;2 � bj � 1 + �1;2C c (F:4)In other words, timing constraint (F.2) can only be written with variable-independent summation boundaries if d1;2 satis�es certain conditions. Let:A = j + �1;2C (F.5)B = C � 1C (F.6)D = 1C (F.7)(F.8)where A, B and D are positive, real numbers and B;D < 1. Then, (F.4) canbe written as: dA�Be � d1;2 � bA�Dc (F:9)The possible relations of dA�Be and bA�Dc are shown in �gure F.1(a) and(b). There are two possible cases:1. dA�Be > bA�Dc2. dA�Be = bA�DcOnly the second case satis�es (F.9), such that (F.9) transforms in:d1;2 = dj + �1;2 � (C � 1)C e = bj � 1 + �1;2C c = F1;2j (F:10)

F.1. THE TIMING CONSTRAINT FORMULATION 1731

2

δ1,2 =5

1

2

δ1,2=5

(a) (b)

j=0

j=1Figure F.2: Illustration of the (d1;2 6= F1;2j) testIn other words, the condition on the validity of (F.2) is that d1;2 has the valuegiven by (F.10). This value is called F1;2j, such that the timing constraintformulation now becomes:8e1;2;8j 2 [�1; !1] :Xj�j1�!1Xk1 x1;j1;k1 +X�2�j2<j+�1;2�F1;2j :CXk2 x2;j2;k2 � 1 + (d1;2 6= F1;2j) (F.11)which is only valid if:F1;2j = dj + �1;2 � (C � 1)C e = bj � 1 + �1;2C c (F:12)The added term to the right-hand side of (F.11) is a boolean that has a valueof 1 if d1;2 6= F1;2j. In that case, the timing constraint is made redundant bysetting the right-hand side to 2. As an illustration, the timing constraints aregenerated for the example in �gure F.2: for j = 0 (�gure F.2(a)), condition(F.12) is satis�ed: F1;2j = d0+5�12 e = b0+5�12 c = 2. Therefore, the followingtiming constraint is generated:X0�j1�1Xk1 x1;j1;k1 + X0�j2<1 x2;j2;k2 � 1 + (d1;2 6= 2)Indeed, it can be seen in �gure F.2(a) that if d1;2 = 2, o2 can not be scheduled attime step 0 if o1 is scheduled at time step 0. For j = 1 however (�gure F.2(b)),condition (F.12) is not satis�ed: d1+5�12 e 6= b1+5�12 c. Therefore, no timingconstraint is generated for j = 1. This is correct, because if o1 is scheduled attime step 1, the retiming lower bound constraint (6.19) forces d1;2 � 3, suchthat o2 can be scheduled at time step 0 or at time step 1.

174 APPENDIX F. DERIVATION OF TIGHT TIMING CONSTRAINTSF.2 The formulation of the d1;2 6= F1;2j testThe only thing left to do is the formulation of the test in the right-hand sideof (F.11). An extra binary variable �1;2j is introduced:�1;2j = ( 0 if d1;2 = F1;2j1 otherwise (F:13)There is a �1;2;j variable for each F1;2j, and since d1;2 can only match one F1;2j,the �1;2;j variables have to satisfy the following set constraint:XF1;2j �1;2j � #fF1;2jg � 1 (F:14)where #fF1;2jg is the number of di�erent F1;2j values over all time steps j.The de�nition of �1;2;j is implemented by means of the following set ofconstraints: 8><>: M:(y1;2j � 1) + 1 � d1;2 � F1;2j � M:y1;2jM:(z1;2j � 1) + 1 � F1;2j � d1;2 � M:z1;2j�1;2j = y1;2j + z1;2j (F:15)where y1;2j and z1;2j are binary integer variables and M is a large integer suchthat: M > maxd1;2 jd1;2 � F1;2jj (F:16)The �rst constraint in (F.15) sets y1;2j = 1 if d1;2 > F1;2j, and y1;2j = 0 ifd1;2 = F1;2j. The second constraint in (F.15) sets z1;2j = 1 if F1;2j > d1;2, andz1;2j = 0 if F1;2j = d1;2.By inspection of condition (F.12), it can be seen that there are at most 2possible values for F1;2j. Therefore, there are at most 4 variables y1;2j and z1;2jrequired for an edge e1;2.If F1;2j = 0, the use of a �1;2j variable is not necessary: the (d1;2 6= F1;2j)term can be substituted by d1;2. Furthermore, if the maximumwindowW1;2maxof the signal corresponding to e1;2 is �xed such that W1;2max � F1;2j, then the(d1;2 6= F1;2j) term is substituted by F1;2j � d1;2. Finally, if W1;2max < F1;2j,constraint (F.11) is always satis�ed and thus redundant.F.3 SummaryThe formulation of a minimum timing constraint is thus as follows:8e1;2;8j 2 [�1; !1] :Xj�j1�!1Xk1 x1;j1;k1 + X�2�j2<j+�1;2�F1;2j :CXk2 x2;j2;k2 � 1 + �1;2j (F.17)

F.3. SUMMARY 175which is only valid if:F1;2j = dj + �1;2 � (C � 1)C e = bj � 1 + �1;2C c (F:18)and where �1;2j = 8><>: d1;2 if F1;2j = 0F1;2j � d1;2 if F1;2j = W1;2max1 if F1;2j > W1;2max (F:19)else, if 0 < F1;2j < W1;2max, �1;2j is such that:8><>: M:(y1;2j � 1) + 1 � d1;2 � F1;2j � M:y1;2jM:(z1;2j � 1) + 1 � F1;2j � d1;2 � M:z1;2j�1;2j = y1;2j + z1;2j (F:20)

Appendix GNederlandse samenvatting1G.1 InleidingDe brede markt voor ge��ntegreerde schakelingen voor digitale signaalverwer-king en de vereiste voor laag vermogenverbruik (vanwege bijvoorbeeld dedraagbaarheid van de systemen) rechtvaardigen het integreren van signaal-verwerkingstoepassingen op �e�en enkele chip. Om de ontwerptijd van dezecomplexe systemen te verkorten, worden krachtige computer-gesteunde ont-werpmethodes en -software ontwikkeld. Een zeer belangrijke vereiste voordeze ontwerpsoftware is dat ze in staat moet zijn om de oppervlakte en/ofde doorvoersnelheid van de chip te optimaliseren. In dit proefschrift staat deoptimalisatie van �e�en speci�eke component van de oppervlakte van een chipcentraal: de oppervlakte die in beslag genomen wordt door de registers. Uithet proefschrift blijkt dat er een sterk verband bestaat tussen het aantal regis-ters dat nodig is om de operanden en resultaten van bewerkingen op te slaan,en de ordening van deze bewerkingen in de tijd.G.1.1 Het ontwerp van digitale systemen op �e�en enkelechipHet toepassingsdomein van de technieken die in dit proefschrift worden voor-gesteld, is re�ele-tijd digitale signaalverwerking. Er blijkt een grote markt tebestaan voor re�ele-tijd digitale signaalverwerking, die een segment is van demarkt voor consumenten-elektronica: ISDN (Integrated Systems Digital Net-work), HDTV (High-De�nition Television), GSM (Global System for MobileCommunication), CD (Compact Disc), virtuele realiteit en multi-media zijnmaar enkele voorbeelden. Wegens de grote produktievolumes en de vereistenvoor draagbaarheid (laag vermogengebruik) worden deze systemen steeds meer1Dutch summary 176

G.1. INLEIDING 177op �e�en enkele chip ge��ntegreerd.Re�ele-tijd digitale signaalverwerkingssystemen kunnen geclassi�ceerd wor-den volgens hun doorvoersnelheid. Elk van deze klassen maakt gebruik van eenandere architectuurstijl om tot een e�ci�ente realisatie op chip te komen. Detechnieken in dit proefschrift zijn vooral van toepassing op re�ele-tijd digitalesignaalverwerkingssystemenmet een lage tot middelgrote doorvoersnelheid (be-monsteringsfrequenties van 1 kHz tot 1 MHz). De volgende architectuurstijlenzijn geschikt voor signaalverwerking met een lage tot middelgrote doorvoer-snelheid:� DSP-processoren voor een groot toepassingsbereik (zoals de Texas In-struments TMS320Cxx DSP-processor familie en de Motorola M56000reeks) kunnen aangewend worden [Lee 88] [Lee 89a]. Het grote voordeelvan deze processoren is hun exibiliteit (programmeerbaarheid), hoeweldit meestal ten nadele is van vermogenverbruik en e�ci�entie in het uit-voeren van sommige bewerkingen op de beschikbare hardware.� Door gebruik te maken van een toepassings-speci�eke architectuur(ASIC) kunnen het vermogenverbruik en de oppervlakte van de chipgereduceerd worden. Met een ASIC-architectuur is het ook mogelijkom bepaalde tijdskritische bewerkingen te versnellen om zo de globaledoorvoersnelheid van de chip te verhogen. Voor toepassingen met lagetot middelgrote doorvoersnelheid wordt vaak een zogenaamde microge-codeerde VLIW-processorarchitectuur2 aangewend.� De beide bovenstaande architectuurstijlen kunnen ook gemengd wor-den in een heterogene architectuurstijl: een herbruikbare DSP-processormodule wordt dan gecombineerd met toepassingsspeci�eke da-tapaden voor de meest tijdskritische bewerkingen [Goossens 93]. Dezearchitectuurstijl biedt de voordelen van exibiliteit, hardware-e�ci�entieen een geoptimaliseerd vermogenverbuik.G.1.2 Het verband tussen tijdsordening en registerge-bruikDe impact van de ordening van de bewerkingen in de tijd (tijdsordening) ophet registergebruik wordt ge��llustreerd aan de hand van een eenvoudig voor-beeld in �guur 1.1, waar twee alternatieve ordeningen van eenzelfde stuk Ccode worden vergeleken. De ordening in �guur 1.1(a) gebruikt lokaal �e�en re-gister meer dan de ordening in �guur 1.1(b). Vele auteurs op het gebied van2VLIW staat voor \Very Large Instruction Word".

178 BIJLAGE G. NEDERLANDSE SAMENVATTINGsoftwarecompilatie erkennen dit verband tussen tijdsordening en registerge-bruik [Aho 88] [Hennessy 90] [Hendren 92]. Sommige hoog-performante com-pilers voeren zelfs een herordening van de code door om het registergebruikte optimaliseren [Aho 88]. Ook in hoog-niveau synthese wordt de impact vantijdsordening op het registergebruik in rekening gebracht: meestal wordt hetregistergebruik tijdens de tijdsordening geminimaliseerd.In dit proefschrift wordt het verband tussen tijdsordening en registerge-bruik bestudeerd door een oplossing voor te stellen voor het volgende probleem:Construeer een tijdsordening zodat het registergebruik voldoet aan beperkingenop het beschikbaar aantal registers, en zodat het totaal aantal klokcycli wordtgeminimaliseerd.G.1.3 Het Cathedral-2nd synthesesysteemDe technieken voor registeroptimalisatie die in dit proefschrift worden voor-gesteld, worden op dit moment toegepast in het kader van een bestaand sys-teem voor de hoog-niveau synthese van re�ele-tijdstoepassingen met een lagetot middelgrote doorvoersnelheid (Cathedral-2nd) [Lanneer 93a]. De ar-chitectuurstijl in Cathedral-2nd is een microgecodeerde VLIW-processorarchitectuurstijl (zie �guur 1.2).De hoog-niveau synthese voor deze architectuurstijl wordt opgedeeld ineen aantal synthesetaken. De synthesetaak met de grootste impact op hetontwerp wordt eerst uitgevoerd. Zo worden bijvoorbeeld de toewijzing vanmultidimensionele signalen aan RAM-geheugens en de samenstelling van deEXU-datapaden eerst behandeld (zie �guur 1.3). Daarna volgt de gedetail-leerde toewijzing van bewerkingen (of kettingen van bewerkingen) aan EXU-datapaden, signalen aan registerbestanden, signaal-transfers aan interconnec-tienetwerken. De kwaliteit van elk van deze toewijzingen kan gecontroleerdworden door een tijdsordening uit te voeren, zelfs al is het ontwerp nog nietvolledig. Indien na tijdsordening blijkt dat het totaal aantal klokcycli te grootis, kan met een nieuwe iteratie over het ontwerp aangevangen worden.G.2 Optimalisatie van het registergebruikHet belang van het optimaliseren van het registergebruik is erkend in de lite-ratuur. Bijdragen op dit gebied werden reeds geleverd in het domein van desoftwarecompilatietechnologie (waar het in acht nemen van een vast aantal be-schikbare registers belangrijk is), en in het domein van de hoog-niveau synthese(waar het minimaliseren van het aantal registers �e�en van de ontwerpcriteriais).

G.2. OPTIMALISATIE VAN HET REGISTERGEBRUIK 179G.2.1 Registeroptimalisatie in de softwarecompilatie-technologieIn compilatoren die code genereren voor algemeen toepasbare microproces-soren wordt meestal per \elementair blok" (een codesegment zonder sprong-instrukties) gewerkt: er bestaan algoritmes voor het minimaliseren van hetregistergebruik door de volgorde van bewerkingen zorgvuldig te kiezen. Dezealgoritmes leveren echter goede resultaten op voorwaarde dat de bewerkingenin een boom-struktuur georganiseerd zijn en als deze bewerkingen niet gelijk-tijdig uitgevoerd kunnen worden [Aho 88] [Aho 76].Eenmaal de volgorde van de elementaire bewerkingen vast ligt, wordensignalen aan registers toegekend. Dit gebeurt meestal door middel van grafen-kleuringsalgoritmes [Chaitin 82] [Hendren 92] [Mueller 93]. Deze algoritmesbepalen ook welke signalen tijdelijk in het RAM-geheugen worden opgeslagen.Recentelijk werden een paar interessante alternatieven geformuleerdvoor codegeneratie en registertoewijzing in parallelle (VLIW) architecturen[Rimey 89] [Hartmann 92]: de tijdsordening van de bewerkingen en de toe-wijzing van de operanden aan registers worden tegelijkertijd uitgevoerd. Hetvoordeel hiervan is dat de interaktie tussen tijdsordening en registergebruikgoed ingeschat kan worden. Het nadeel echter is dat deze interaktie slechtslokaal wordt beschouwd.G.2.2 Registeroptimalisatie in hoog-niveau syntheseIn hoog-niveau synthese is het vooral tijdens tijdsordening dat het regis-tergebruik zo gelijkmatig mogelijk wordt verdeeld (gebalanceerd) in de tijd.Verschillende mechanismen voor het balanceren van het registergebruik wer-den reeds gepubliceerd [Paulin 89] [Verhaegh 91] [Rabaey 90] [Potkonjak 92][Hwang 91] [Rompaey 92]. Geen van deze technieken is in staat om een be-grenzing op het aantal gebruikte registers in rekening te nemen.G.2.3 Een nieuwe benaderingIn dit proefschrift wordt een nieuwe benadering voor de optimalisatie van hetregistergebruik voorgesteld. In de nieuwe methode wordt een globaal zicht opde interaktie tussen tijdsordening en registergebruik behouden. De methodeis ge��mplementeerd als een \script" (een opeenvolging van verschillende taken,zie �guur 2.2). Het script bestaat uit twee grote stappen. De eerste stap is eenv�o�orverwerking van de tijdsordening, waarin de register kost over alle mogelijkeordeningen gereduceerd wordt tot ze voldoet aan de beperkingen. Dit houdtvoornamelijk het toevoegen van extra volgordebeperkingen tussen individuelebewerkingen in, hetgeen een sterke wisselwerking met de tweede stap in het

180 BIJLAGE G. NEDERLANDSE SAMENVATTINGscript (de tijdsordening zelf) betekent. Het resultaat van deze voorverwerkingis dat er tijdens tijdsordening geen rekening meer gehouden dient te wordenmet de beperkingen op het beschikbaar aantal registers.Het principe van voorverwerking wordt ge��llustreerd door middel van eenvoorbeeld in �guur 2.4:1. Extra volgorde tussen bewerkingen wordt toegevoegd, zoals aangeduidmet de dikke gestreepte pijl in �guur 2.4(a).2. De levensduur van signalen kan ook kort gehouden worden door hunprodukties en consumpties dicht bij elkaar te houden in de tijd. Ditwordt bereikt door bewerkingen te groeperen, zoals aangeduid met degestippelde contouren in �guur 2.4(a).Als gevolg van de voorverwerking nemen de signalen die aangeduid zijn metde zwart stip in �guur 2.4(a) slechts �e�en register in beslag na tijdsordening.De tijdsordening zelf bestaat uit twee verschillende fazes: in een eerste fazewordt elke groep van bewerkingen (zoals bepaald tijdens de voorverwerking)afzonderlijk geordend in de tijd (zie �guur 2.4(b)). In een tweede faze wor-den de tijdsgeordende groepen bewerkingen gecombineerd tot een uiteindelijketijdsordening. Een techniek voor deze tweede faze van de tijdsordening, geba-seerd op gehele lineaire programmering (ILP), is voorgesteld in Hoofdstuk 6van dit proefschrift.De motivaties voor de nieuwe benadering zijn de volgende:� In tegenstelling tot de meeste technieken in hoog-niveau synthese is dezebenadering in staat om met beperkingen op het beschikbaar aantal regis-ters rekening te houden.� Tijdens de voorbewerking van de tijdsordening is er een globaal zichtop het e�ect van de registeroptimalisatie, in tegenstelling tot de meestetechnieken uit de literatuur.� De voorbewerking is onafhankelijk van de techniek die voor de uiteinde-lijke tijdsordening gekozen wordt: de ontwerper heeft de keuze over derekentijd (en dus de prestaties) van de tijdsordening.G.3 Het signaalstroomgraf-modelHet signaalstroomgraf-model van een re�ele-tijd signaalverwerkingstoepassingbestaat uit knopen en takken. De knopen zijn de individuele bewerkingenvan de gedragsbeschrijving (zoals optellingen en vermenigvuldigingen) of hi-erarchisch geneste lussen, die op hun beurt opnieuw aan de hand van eensignaalstroomgraf beschreven worden. De takken van de graf geven relaties

G.4. SNEDEVERKLEINING 181weer tussen de knopen: het doorgeven van een signaal (data-tak) of een volg-orderelatie (volgorde-tak).Lus-hi�erarchie Een herhaalde berekening (een \lus") komt veel voor inde gedragsbeschrijvingen van signaalverwerkingstoepassingen. Signaalbu�ersworden gebruikt om aan te geven dat een signaal wordt geconsumeerd in eenvolgende lus-iteratie. Deze signaalbu�ers worden gemodeleerd als gewichtenop de takken van de signaalstroomgraf (zie �guur 3.4).Conditie-hi�erarchie Conditionele bewerkingen die plaatsvinden onder de-zelfde conditie worden gegroepeerd in eenzelfde blok. Blokken met exclusievecondities (condities die nooit samen geldig zijn) worden op hun beurt gegroe-peerd (zie �guur 3.7). Dit levert voordelen op qua complexiteit (en dus reken-tijd) gedurende de voorbewerking van de tijdsordening.Schatting van de maximale registerkost Om een tijdsordening te vin-den die niet m�e�er dan een beschikbaar aantal registers gebruikt door middelvan een bewerking van de signaalstroomgraf voor de tijdsordening ervan, isde maximale registerkost een belangrijk criterium. De maximale registerkostkan berekend worden door de grootte van de \maximum kliek" in de signaal-con ictgraf te berekenen [Tseng 86] [Grass 90] [Even 79]. Vermits de oplossingvan dit probleem echter exponenti�ele rekentijd vraagt, wordt in het proefschrifteen alternatieve methode voorgesteld: de maximale registerkost wordt zeergoed benaderd door het maximaal aantal tegelijk levende signalen [Stok 92][Hendren 92]. Stelling 3.1 toont aan dat dit aantal berekend kan worden metbehulp van het her-timing algoritme [Leiserson 83]. Dit laatste algoritme ver-toont experimenteel een lage, polynome rekentijd, wat het geschikt maakt voorintensief gebruik tijdens de voorverwerking van tijdsordening.G.4 SnedeverkleiningEen eerste techniek voor het voorverwerken van een signaalstroomgraf zodateen bepaald beschikbaar aantal registers niet overschreden wordt, reduceerthet maximum aantal parallele data-takken in de graf. Een set parallele data-takken wordt een \snede" genoemd, vandaar de benaming \snedeverkleining".Het principe van snedeverkleining wordt ge��llustreerd in �guur 4.1: de maxi-male snede met een registerkost van 4 registers in �guur 4.1(a) kan door hettoevoegen van een volgorde-tak verkleind worden tot een maximale snede meteen kost van 3 registers (�guur 4.1(b)). Het registergebruik kan nog meer wor-den verkleind door het tijdelijk opslaan van signalen in het RAM-geheugen (zie�guur 4.1(c)): dit alternatief wordt eveneens door snedeverkleining beschouwd.

182 BIJLAGE G. NEDERLANDSE SAMENVATTINGKies en beperk De snedeverkleining is uitgevoerd als een kies en be-perk -zoekstrategie: elementaire transformaties (zoals het toevoegen van eenvolgorde-tak, of het toekennen van een signaal aan RAM-geheugen) wordenuitgevoerd waarbij telkens een kostfunktie wordt ge�evalueerd. De kostfunktiegeeft de impact van een bepaalde transformatie op de uitvoeringstijd van designaalstroomgraf weer. Een geldige oplossing is bereikt wanneer de maximaleregisterkost niet meer groter is dan het beschikbaar aantal oplossingen. Debeste oplossing is een geldige oplossing waarvoor de kost minimaal is.Keuzeheuristieken De volgorde waarin de elementaire transformaties wor-den uitgevoerd, heeft een grote impact op de totale rekentijd van het kies enbeperk-algoritme. Daarom worden in het proefschrift een aantal keuzeheuris-tieken uitgewerkt.Eindfaze Voor iedere tussenoplossing tijdens de kies en beperk-zoekstrategieworden de maximale snede en de maximale registerkost berekend door middelvan de hogervermelde benaderingstechniek. Het is evenwel de exacte maximaleregisterkost die moet voldoen aan de beperkingen. Daarom moet een oplos-sing van snedeverkleining gecontroleerd worden door er de exacte maximaleregisterkost van te berekenen. Dankzij de kwaliteit van de benadering echterzal dit zelden extra snedeverkleiningsstappen vereisen.Experimenten De snedeverkleiningstechniek werd uitgeprobeerd op eenaantal signaalstroomgrafen met verschillende groottes en met verschillende be-perkingen op de registerkost. Uit deze experimenten blijkt dat de optimaleoplossing zeer snel gevonden wordt, waarna het grootste deel van de rekentijdbesteed wordt aan het veri��eren van die oplossing. Dit wijst op een goedewerking van de kiesheuristieken. Verder wordt vastgesteld dat de rekentijdenvoor grote signaalstroomgrafen met strenge beperkingen op de registerkostgroot zijn. Daarom wordt in Hoofdstuk 5 van dit proefschrift een complemen-taire techniek voorgesteld, die aangewend zal worden om de rekentijden vansnedeverkleining te drukken.G.5 ClusteringEen alternatief voor snedeverkleining gaat uit van het principe dat bewerkin-gen die signalen communiceren best samen gehouden worden in de tijd, om delevensduur van de signalen zo kort mogelijk te houden. Dit principe is ge��llus-treerd in �guur 5.1: door groepen bewerkingen (\clusters") te de�ni�eren envoor deze clusters afzonderlijk een tijdsordening te berekenen, kan de overlap-ping tussen de levensduren van signalen (d.w.z. de registerkost) gereduceerd

G.5. CLUSTERING 183worden. Het is tevens van belang om het aantal clusters zo klein mogelijk tehouden, vermits dan ook de rekentijden voor snedeverkleining (die na cluste-ring volgt) het meest gedrukt worden.Hi�erarchisch clusteren Voor het construeren van een zo klein moge-lijk aantal clusters wordt een hi�erarchisch clusteringsalgoritme aangewend[Johnson 67]. Dit algoritme maakt gebruik van een afstandsmaat tussen declusters: clusters die zich op de kleinste afstand t.o.v. elkaar bevinden, komenhet eerst in aanmerking om samengenomen te worden en een nieuw cluster tevormen. De hi�erarchische clusteringsmethode bestaat uit de volgende stappen:1. Tussen elk paar clusters (waarvan de unie voldoet aan de de�nitie van eencluster) wordt de afstand berekend. Deze afstand weerspiegelt het e�ectvan een eventuele unie van de twee clusters op de globale maximale regis-terkost van de signaalstroomgraf. Een paar alternatieve afstandsmatenzijn formeel gede�nieerd in Sektie 5.3.3 van het proefschrift.2. Na het berekenen van alle afstanden wordt de minimale afstand bepaald(zie De�nitie 5.7).3. De clusters die zich op deze minimale afstand van elkaar bevinden, kun-nen samengenomen worden om een nieuw cluster te vormen. Hierbijdient rekening gehouden te worden met het feit dat de kandidaat clus-terparen disjunkt moeten zijn: met behulp van een grafenkleuring wordteen maximaal aantal disjunkte clusterparen geselekteerd om samengeno-men te worden.4. Na het samennemen van een aantal clusterparen worden opnieuw afstan-den berekend (zie stap 1). Om rekentijd te sparen worden gedurende elkeiteratie van het clusteringsproces niet alle afstanden opnieuw berekend.Dit introduceert terzelfdertijd ook een fout . Deze fout wordt geredu-ceerd door toch af en toe opnieuw alle afstanden te berekenen (waarbijde ontwerper aangeeft met welke periode dit zich voordoet).Integratie met snedeverkleining De clusteringsmethode kan niet garan-deren dat er een set clusters gevonden wordt zodat het maximale registerge-bruik van de signaalstroomgraf binnen de beperkingen blijft. Daarom moet devoorverwerking meestal voortgezet worden met behulp van snedeverkleining.De aanwezigheid van clusters beperkt de rekentijd van deze snedeverkleiningechter in sterke mate, vermits i.p.v. de volgorde van afzonderlijke bewerkin-gen nu slechts de volgorde van vaste clusters van bewerkingen bepaald moetworden.

184 BIJLAGE G. NEDERLANDSE SAMENVATTINGG.6 Tijdsordening door gehele lineaire pro-grammeringHet script voor voorverwerking en tijdsordening zoals voorgesteld in �guur 2.2bestaat uit twee tijdsordeningsfases (de tijdsordening van de individuele clus-ter en de globale tijdsordening). Dit wordt in het proefschrift hi�erarchischetijdsordening genoemd. Er wordt een tijdsordeningsmethode voorgesteld diegebaseerd is op gehele lineaire programmering (ILP), en die kan gebruikt wor-den voor beide tijdsordeningsfases. De voorgestelde methode heeft de volgendeeigenschappen:� Optimale tijdsordening, ge��ntegreerd met optimale lusvouwing.� E�ci�ente generatie van de lineaire ongelijkheden voor de modellering vanvoorwaardelijke bewerkingen.� E�ci�ente formulering van de lineaire ongelijkheden voor de volgorde-beperkingen tussen de bewerkingen.� Het minimaliseren van signaalbu�ers tijdens tijdsordening.� Er is een grens aan de grootte van de signaalstroomgrafen die met ILPtijdsordening behandeld kunnen worden.Op deze eigenschappen wordt kort ingegaan in de volgende paragrafen.Optimale lusvouwing Lusvouwing is een krachtige optimalisatie die inbijna alle re�ele-tijd signaalverwerkingstoepassingen wordt gebruikt om de tijds-kritische lussen te versnellen [Parhi 91] [Girczyc 87] [Goossens 89b] [Groot 92][Potkonjak 90] [Chen 91]. Het loont daarom ook de moeite om rekentijd teinvesteren in lusvouwing, en die aldus ook met ILP te berekenen.Modellering van con icten tussen voorwaardelijke bewerkingen Opelke tijdsstap en voor elke functionele eenheid dient een lineaire ongelijkheidte worden geformuleerd die verbiedt dat twee bewerkingen tegelijkertijd opdezelfde eenheid plaatsvinden. In het proefschrift wordt hiervoor een algemenemethode voorgesteld, rekening houdend met het feit dat er voorwaardelijkexclusieve bewerkingen kunnen zijn. Deze methode heeft als voordeel dat zede complexiteit van het ILP-probleem niet verhoogt, in tegenstelling tot demeeste methodes in de literatuur [Hwang 91] [Lee 89b].Volgorde-beperkingen De formulering van een volgorde-beperking in[Gebotys 91] wordt in het proefschrift uitgebreid voor de extra vrijheid tengevolge van lusvouwing. Het resultaat is een meer e�ci�ente formulering dande triviale uitbreiding van de formulering in [Lee 89b].

G.7. EXPERIMENTEN 185Het minimaliseren van signaalbu�ers Een bijkomend voordeel van devoorgestelde methode is dat de lengte van de signaalbu�ers tijdens tijdsorde-ning geoptimaliseerd kan worden. Dit is belangrijk in de hoog-niveau synthesevan signaalverwerkingstoepassingen en wordt meestal buiten beschouwing ge-laten in de literatuur.De rekentijd van ILP-tijdsordening Vermits ILP-tijdsordening een NP-compleet probleem is, vergt de uitvoering ervan exponenti�ele rekentijd. Metandere woorden, er zijn grenzen aan de toepasbaarheid van ILP voor tijds-ordening. Deze grens is experimenteel vastgesteld op 600 gehele verander-lijken (voor het commerci�ele pakket Lindo). Voor tijdsordeningsproblemendie meer dan 600 gehele veranderlijken tellen wordt automatisch een anderetijdsordeningstechniek gekozen, bijvoorbeeld lijst-tijdsordening [Goossens 90][Rompaey 92].G.7 ExperimentenDe technieken voor voorverwerking van tijdsordening en de ILP-tijdsordeningwerden ge��mplementeerd in een programma (Tron). De globale prestatie vande methodologie werd veri�eerd aan de hand van een aantal signaalverwer-kingstoepassingen uit de praktijk.Het gebruik van beperkingen op het registergebruik Het e�ect vansnedeverkleining en clustering op de signaalstroomgraf werd uitvoerig nage-gaan aan de hand van een syndroomgeneratietoepassing voor foutcorrectie (zie�guur 7.1 en �guur 7.2 voor de tijdsordening bij twee verschillende registerbe-perkingen). Hoe strenger de registerbeperkingen, hoe meer extra volgorde aanhet ontwerp toegevoegd moet worden en hoe meer klokcycli de totale uitvoe-ring vergt. Het gebruik van verschillende registerbeperkingen, en de impactop het aantal klokcycli wordt eveneens ge��llustreerd door middel van een co-deringstoepassing voor spraakherkenning (�guur 7.3).Exploratie van de zoekruimte Door de registerbeperkingen op een ont-werp te veranderen, kan een deel van de zoekruimte van tijdsordening ge�explo-reerd worden. Deze zoekruimtes worden voor twee verschillende toepassingenuitgezet in �guur 7.4 en in �guur 7.5. Uit experimenten blijkt dat alleen aldoor de tijdsordening van een signaalstroomgraf te be��nvloeden (en dus geensignalen tijdelijk in RAM-geheugen op te slaan), de registerkost zo'n 30 %kan vari�eren. Verder levert het gebruik van ILP-technieken voor tijdsordeningkortere uitvoeringstijden op (in klokcycli).

186 BIJLAGE G. NEDERLANDSE SAMENVATTINGG.8 BesluitOver registeroptimalisatie en de relatie met tijdsordening werd reeds veel gepu-bliceerd, zowel in de context van softwarecompilatie als in de context van hoog-niveau synthese. In hoog-niveau synthese wordt de registerkost meestal tij-dens de tijdsordening geoptimaliseerd [Paulin 89] [Verhaegh 91] [Potkonjak 92][Hwang 91] [Rompaey 92]. De meeste van deze technieken houden geen reke-ning met beperkingen op het beschikbaar aantal registers, en de paar technie-ken die dat wel doen, nemen lokale beslissingen [Rimey 89] [Hartmann 92]. Insoftwarecompilatie daarentegen wordt de registeroptimalisatie meestal verrichtvoor een vaste codevolgorde [Aho 88] [Chaitin 82] [Hendren 92] [Mueller 93].Als er dan meer registers nodig zijn dan beschikbaar worden een aantal signalentijdelijk in het RAM-geheugen opgeslaan. Het verband tussen de tijdsordeningen het registergebruik wordt hierbij niet beschouwd.Daarom werd in dit proefschrift een nieuwe methodologie voor tijdsorde-ning en registeroptimalisatie voorgesteld: de registeroptimalisatie gebeurt tij-dens de v�o�orverwerking van tijdsordening zodat de beperkingen op het be-schikbaar aantal registers niet tijdens de eigenlijke tijdsordening in rekeninggebracht hoeven te worden. Tijdens de v�o�orverwerking wordt een globaal zichtop de optimalisaties behouden, en er is een sterk verband met tijdsordeningdoor het toevoegen van extra volgordebeperkingen. Door de beperkingen ophet beschikbaar aantal registers te veranderen, kan de ontwerper verschillendetijdsordeningen exploreren. Verder laat de methodologie ook toe om eenderwelke tijdsordeningstechniek te gebruiken.De voorgestelde methode kan de optimaliteit van de oplossing (qua totaalaantal klokcycli) niet garanderen, vermits de registeroptimalisatie en de tijds-ordening niet gelijktijdig uitgevoerd worden. Echter, uit de praktijk blijktmeestal dat optimaliteit een utopie is die niet binnen redelijke rekentijden tebereiken is. De in dit proefschrift voorgestelde methode daarentegen levertoplossingen van een hoge kwaliteit, en dit binnen praktische rekentijden.Het proefschrift bevat de volgende bijdragen:� Een nieuwe methode voor tijdsordening en registeroptimalisatie, geba-seerd op voorverwerking, is voorgesteld [Depuydt 91].� E�ci�ente technieken voor voorverwerking zijn aangebracht:{ Snedeverkleining vermindert de registerkost door het toevoegen vanextra volgordebeperkingen tussen de bewerkingen in een signaal-stroomgraf, of het tijdelijk opslaan van signalen in RAM-geheugen[Depuydt 93a].{ Clustering reduceert de overlappingsmogelijkheid van de signalenen de tijdsordeningsvrijheid van de bewerkingen die deze signalen

G.8. BESLUIT 187produceren en consumeren [Depuydt 90] [Depuydt 91].� Voor het e�ci�ent schatten van de maximale registerkost van een ontwerpwerd een nieuwe techniek voorgesteld, gebaseerd op de her-timing vaneen signaalstroomgraf [Depuydt 92].� Een nieuw formalisme voor tijdsordening met behulp van gehele lineaireprogrammering werd beschreven [Depuydt 93b]. Het formalisme houdtexpliciet rekening met lusvouwing en laat de optimalisatie van de lengtesvan de signaalbu�ers toe.De methode werd ge��mplementeerd en toegepast op verschillende toepas-singen van praktisch belang. Hieruit is bijvoorbeeld gebleken dat de regis-terkost tot 30 % verbeterd kan worden (t.o.v. ontwerpen die de registerkostverwaarlozen) door enkel extra volgordebeperkingen tijdens voorverwerkingtoe te voegen.In de nabije toekomst zal in de consumenten-elektronica en in telecom-municatie steeds meer gebruik gemaakt worden van een combinatie vanstandaard DSP-componenten en toepassingsspeci�eke hardware [Paulin 92][Goossens 93]. Ontwerp wordt daarom meer en meer hardware-softwareco-ontwerp, waarvoor technieken die zowel in hoog-niveau synthese als insoftware-compilatie toepasbaar zijn centraal staan. In dit proefschrift werd debeperking op het aantal beschikbare registers belicht. Met andere beperkingenmoet echter ook rekening gehouden worden, zoals instruktiecoderingsbeperkin-gen [Praet 93] en sprongbeperkingen [Ki i 93].

Indexaccelerator data path, 3alignment, 9, 11application speci�c, 2{3architecture, 2{3cooperating data-path, 3heterogeneous, 3regular array, 3VLIW, 102microcoded, 2, 7ASIC, see application speci�cback-tracking, 68, 71basic block, 18basic move, 70bit-true, 5, 9BMC, see maximum bridging edgesregister costbranch-and-bound, 67, 75, 79, 125,126, 157basic move, 68heuristics, 72{73, 79pruning, 68, 74, 79bus, 9, 120merging, 9, 120Cathedral-2nd, 6, 105, 120CBC compiler, 20chaining, 7, 10chromatic number, 112cluster, 82, 108pair, 89clustering, 28, 81{97, 132, 140, 167{168cluster merging, 86, 89, 90a priori, 86, 88

complete cover, 83complexity, 92greedy, 85hierarchical, 85, 92multistage, 93process, 91stop criterion, 94CMB, see condition merge blockCMC, see constrained maximumreg-ister costcode motion, 19compaction, 19, 100condition merge block, 40, 53, 94,162, 164conditions, 40{41, 53, 94, 115, 162block, 40exclusive, 40, 111 attening, 94constraintbus, 115hardware, 4interconnection, 106port, 115register, 114, 132{134, 137register access, 115register cost, 61resource, 105, 109{115retiming, 116set, 105, 108, 125, 126timing, 35, 105, 116{119, 171{174projection, 117consumer electronics, 2controllergeneration, 9188

INDEX 189multibranch, 7cut reduction, 28, 65{80, 82, 84, 92,96, 132, 140, 157{158hierarchical, 80reduced model, 97data routing, 9, 11, 20, 21, 79, 115,141data-path mapping, 10delay line, 37window, 39, 53, 69, 107delay operation, 37Dinics, 161distance, 88error, 90, 93metric, 85{88, 92minimum, 88normal value, 88recomputation, 90, 92ultra-metric, 90, 93DSP core, 3, 13, 141edge, 32{33bridging, 58, 96cost, 33, 56correction, 61, 155data, 32decomposition, 40fanout, 32, 52, 56, 123, 155, 163latency, 35sequence, 33, 67, 69, 77, 82, 97weight, 38, 106lower bound, 169entry node, 39execution unit, 7, 10instance assignment, 9, 10, 114exit node, 39, 52EXU, see execution unit ow, 161lower bound, 161network ow, 161reduction, 164

functional pipelining, 102general-purpose DSP processor, 2GMC, see global maximum registercostGomory cutting plane, 125graphcircular arc, 22coloring, 90control-data ow, 31convex, 82, 86interference, 22overlap, 24perfect, 111, 113resource con ict, 107, 110, 113signal ow, 31, 36triangulated, 111GSM, 2, 11, 12hardware-software co-design, 141HDSFG, see signal ow graphhierarchy, 28HLDM, see data-path mappingHLMM, see memory managementILPS, 124, 132implicit enumeration, 125initial value, 38instruction ordering, 6, 18integer facet, 104, 119interconnectioncost, 120de�nition, 9, 11iteration period bound, 101last value, 38left edge algorithm, 24leveling, see topological sortinglinear programming, 62LLM, see low-level mappingloop, 36{39, 94{96folding, 103hierarchy, 13, 28, 109procedural, 36

190 INDEXunfolding, 103unrolling, 102winding, 103loop invariant signals, 39low-level mapping, 10, 120maxcut, 55, 66, 161max ow, 161, 164maximum clique, 47, 75, 79, 107,112, 113, 157maximum independent set, 90memory management, 9, 33microcode, 7symbolic, 11min ow, 47, 62, 63, 159, 161, 164hierarchical, 62, 162{163MPES, see maximum parallel edgesetnetwork, 159{161parallel edge set, 55, 56maximum, 55, 66partitioning, 85, 91PES, see parallel edge setport latency, 34power, 2, 4rate-optimal, 101real-time, 5DSP, 1, 2, 12re�nement, 10registerallocation, 18, 19assignment, 9, 11, 18, 22{25, 33cost, 12estimation, 47�le, 9assignment, 20maximum costarray, 57bridging edges, 58, 96constrained, 61, 83estimation, 46, 55, 140

exact, 47global, 58, 66, 68merging, 25o�set cost, 96optimization, 11transfer, 10, 32usage, 6resource con ict, 32retiming, 51, 52, 63, 103, 118, 140,169formulation, 53hierarchical, 49{53, 62, 75, 157,166model, 49variable, 106router, 9RSP, see real-time DSPRT, see register transferscheduling, 6, 7, 11, 19, 44{46, 72,118, 169balancing, 21{22cluster, 28, 100, 123cyclo-static, 102force-directed, 103freedom, 107hierarchical, 100integer programming, 99{131, 137,138, 140integrated environment, 137list, 102, 130macronode, 28, 99, 100, 105, 108,109, 121, 124multi-processor, 101preprocessing, 26{28, 134rotation, 103script, 25static, 44trace, 19script, 9silicon assembly, 3, 10Simplex, 62, 119

INDEX 191revised, 63, 125single-chip, 2software compilation, 5software pipelining, 52, 53, 101{103,115, 116, 118, 128, 171spilling, 19, 66, 67, 69, 71{73, 92,96, 97, 134, 141start time, 34, 106synthesishigh-level, 4, 20, 102RTL, 3, 10system level , 4tasks, 7{9targeted compilation, 6throughput, 35, 107time step, 46, 106, 117topological sorting, 72, 107total unimodularity, 53, 62Tron, 76, 132user interaction, 4, 15very large instruction word, 2VLIW, see very large instruction word

Register Optimization and Scheduling for Real-Time Digital ...

Documents