Code Reordering of Decision Support Systems for optimized Instruction Fetch

Code Reordering of Decision Support Systems for optimizedInstruction FetchAlex Ram��rez Josep-L. Larriba-Pey Carlos Navarro Xavi SerranoJosep Torrellas� Mateo ValeroComputer Architecture Department (UPC)Universitat Politecnica de CatalunyaJordi Girona 1-308034 BarcelonaAbstractInstruction fetch bandwidth is feared to be a major limiting factor to the performance of futurewide-issue aggressive superscalars. Consequently, it is crucial to develop techniques to increase thenumber of useful instructions per cycle provided to the processor. Unfortunately, most of the pastwork in this area has largely focused on engineering workloads, rather than on the more challenging,badly-behaved popular commercial workloads.In this paper, we focus on Database applications running Decision Support workloads. We char-acterize the locality patterns of database kernel code and �nd frequently executed paths. Using thisinformation, we propose an algorithm to lay out the basic blocks of the database kernel for improvedI-fetch. Finally, we evaluate the scheme via simulations.Our results show a miss reduction of 60-98% for realistic I-cache sizes and a doubling of the numberof instructions executed between taken branches. As a consequence we increase the fetch bandwithprovided by an aggressive sequential fetch unit from 5.8 for the original code to 10.6 using our proposedlayout. Our software scheme combines well with hardware schemes like a Trace Cache providing up to12.1 instruction per cycle, suggesting that commercial workloads may be amenable to the aggressiveI-fetch of future superscalars.�University of Illinois at Urbana Champaign, USA.1

1 IntroductionFuture wide-issue superscalars are expected to demand a high instruction bandwidth to satisfy theirexecution requirements. This will put pressure on the fetch unit and has raised concerns that instructionfetch bandwidth may be a major limiting factor to the performance of aggressive processors. Consequently,it is crucial to develop techniques to increase the number of useful instructions per cycle provided to theprocessor.The number of useful instructions per cycle provided by the fetch unit is broadly determined by threefactors: the branch prediction accuracy, the cache hit rate and the number of instructions provided bythe fetch unit for each access. Clearly, many things can go wrong. Branch mispredictions cause the fetchengine to provide wrong-path instructions to the processor. Instruction cache misses stall the fetch engine,interrupting the supply of instructions to the processor. Finally, the execution of non-contiguous basicblocks prevents the fetch unit from providing a full width of instructions.Much work has been done in the past to address these problems. Branch e�ects have been addressedwith techniques to improve the branch prediction accuracy [12] and to predict multiple branches percycle [22, 28]. Instruction cache misses have been addressed with software and hardware techniques.Software solutions include code reordering based on procedure placement [8, 7] or basic block mapping,either procedure oriented [18] or using a global scope [9, 24]. Hardware solutions include set associativecaches, hardware prefetching, victim caches and other techniques. Finally, the number of instructionsprovided by the fetch unit each cycle can also be improved with software or hardware techniques. Softwaresolutions include trace scheduling [5], and superblock scheduling [10]. Hardware solutions include branchaddress caches [28], collapsing bu�ers [3] and trace caches [6, 21].While all these techniques have vastly improved the performance of superscalar I-fetch units, they havebeen largely focused and evaluated on engineering workloads. Unfortunately, there is growing evidencethat popular commercial workloads provide a more challenging environment to aggressive instructionfetching.Indeed, recent studies of database workload performance on current processors have given useful in-sight [1, 14, 15, 16, 20, 25]. These studies show that commercial workloads do not behave like otherscienti�c and engineering codes. They execute fewer loops and have many procedure calls. This leads tolarge instruction footprints. The analysis, however, is not detailed enough to understand how to optimizethem for improved I-fetch engine performance.The work in this paper focuses on this issue. We proceed in three steps. First, we characterize thelocality patterns of database kernel code and �nd frequently executed paths. The database kernel used2

is PostgreSQL [23]. Our data shows that there is signi�cant locality and that the execution patterns arequite deterministic.Second, we use this information to propose an algorithm to reorder the layout of the basic blocks in thedatabase kernel for improved I-fetch. Finally, we evaluate our scheme via simulations. Our results show amiss reduction of 60-98% for realistic instruction cache sizes and a doubling of the number of instructionsexecuted between taken branches to over 22. As a consequence, a 16 instruction wide sequential fetchunit using a perfect branch predictor increases the fetch bandwidth from 5.6 to 10.6 instructions per cyclewhen using our proposed code layout.The software scheme that we propose combines well with hardware schemes like a Trace Cache. Thefetch bandwith for a 256 entry Trace Cache improves from 8.6 to 12.1 when combined with our softwareapproach. This suggests that commercial workloads may be amenable to the aggressive instruction fetchof future superscalars.This paper is structured as follows. In Section 2, we give a detailed account of the internals of adatabase management system and compare PostgreSQL to it. In Sections 4 and 5, we characterize themiss rate of the di�erent modules of PostgreSQL and analyze the locality and determinism of the databaseexecution. In Section 6, we describe the basic block reordering method that we propose. In Section 7we give details on related work. In Section 8 we evaluate the performance of our method and compare itto other hardware and software techniques. Finally, in Section 9 we conclude and present guidelines forfuture work.2 Structure of a Database Management SystemDatabase Management Systems are organized in di�erent software modules. Those modules correspondto di�erent functionalities to run queries, maintain the Database tables or use statistics on the Databasedata among others. Our interest focuses on those modules that take charge of running relational querieswhich are the most time consuming part of a RDBMS.In order to run a relational query, it is necessary to perform a number of steps as shown in Figure 1.The query is speci�ed by the user in a declarative language that determines what the user wants to knowabout the Database data. Nowadays, the Structured Query Language (SQL) is the standard declarativelanguage for relational Databases [4]. The SQL query is translated into an execution plan that will beprocessed by the Query Execution kernel. The query execution plan has the form of a tree, with nodesrepresenting the di�erent operations.The task of the Parsing-optimization kernel is to check the grammatical correctness of the SQL ex-3

SQLExecutor

Access Methods

Buffer Manager

Storage Manager

kernelQuery ExecutionParsing-Optimization

kernel

Optimizer

ParserQueryResult

Figure 1: Steps required for the execution of an SQL query and all the RDBMS modules involved.pression and to generate the best execution plan for the given query in a speci�c computer and Database.While the importance of the Parsing-optimization module is paramount to generate a plan that executesfastest on a speci�c computer, the time employed to run it can be considered small compared to the totaltime spent in executing the query.2.1 The Query Execution Kernel of a RDBMSThe Executor of a RDBMS (Figure 1) contains the routines that implement basic operations like SequentialScan, Index Scan, Nested-Loop Join, Hash Join, Merge Join, Sort, Aggregate and Group. It also containsthe routines that schedule the execution of those basic operations as described by the execution plan.Scan operations take tuples from one table and generate a new table selecting those tuples that ful�llsome conditions on an attribute or set of attributes.Join operations take two tables and produce a result table. For every pair of tuples, each from a di�erenttable, the Join checks if one or more attributes satisfy a given condition. The result table contains tuplescontaining the data from all pairs of tuples that satisfy the condition.Scan

Join

Scan

SELECT customer.name, customer.address, order.totalprizeFROM customer, orderWHERE customer.custkey=order.custkey

AND customer.acctbal > 0AND order.orderdate > ’1-Sept-1998’

Customer

customer.acctbal > 0

customer.custkey = order.custkey

Order

order.orderdate > Sept 1st, 1998

Figure 2: SQL query example and its associated query execution tree.In the example of Figure 2 the leftmost Scan will select the tuples from the Order table with4

orderdate > Sep1st; 1998 and the rightmost Scan will select the tuples from the Customer table withcustomer:acctbal > 0. Then, the Join operator will select the pairs of tuples with the same custkey value.The Sort operation orders the tuples of a table based on the value of one or more attributes. TheGroup operation generates a table with one tuple for each group of tuples in the original table that have acommon value in the grouping attribute. The Aggregate operation performs some arithmetic manipulation(like addition or counting) on the tuples grouped by the Group operation and gives a single result.DBMSs are built with a modular structure to isolate the di�erent semantic data levels. Thus, themodules of a DBMS must communicate by means of data structures that act as intermediate bu�ers.In this section, we describe those data semantic levels and the communication structures for the lowermodules of the DBMS.Just below the Executor, are the lower modules of the DBMS, the Access Methods, the Bu�er Managerand the Storage Manager. This modular structure hides the di�erent semantic data levels from theExecutor. Now, we describe those semantic data levels and the communication data structures for thelower modules of the DBMS.The tables of a Database are stored as �les following a given logic structure. The Storage Manager isresponsible for both managing those �les and accessing them to provide �le blocks to the Bu�er Manager.The Bu�er Manager is responsible for managing the blocks stored in memory similarly to the way theOS Virtual Memory Manager does. The Bu�er Manager provides memory blocks to the Access MethodsModule.The Access Methods of a RDBMS provide tuples to the Executor module. Depending on the organi-zation of each table, the Access Methods will traverse the required index structures and plain databasetables stored in the blocks managed by the Bu�er Manager. Each DBMS will implement di�erent AccessMethods for its own index and database table structures.2.2 PostgreSQLPostgreSQL is a public domain, client/server database developed at the University of California-Berkeley.PostgreSQL is an evolution of the Postgres database [23] which has led to commercial products now beingcomercialized by Informix with the name of Illustra or being distributed for free as part of the Debian 2.0\hamm" Linux distribution. PostgreSQL runs on many di�erent platforms, it is quite popular and welltuned for a system developed in academia.PostgreSQL has a client/server structure and compromises three types of processes; clients, backendsand a postmaster. Each client communicates with one backend that is created by the postmaster the �rst5

time the client queries the database. The postmaster is also in charge of the initialization of the database.The structure of the server backend of PostgreSQL corresponds to that of a general DBMS that weexplained before. The query execution kernel has the modular structure shown in Figure 1.The execution of a query in PostgreSQL is performed in a pipelined fashion. This means that eachoperation passes the result tuples to the parent operation in the execution plan as soon as they aregenerated, instead of processing their whole input and generating the full result set. This explains thelack of loops and the long code sequences found in the PostgreSQL and other DBMS kernels [16].To illustrate this execution model, we will use the example query in Figure 2. The Executor builds theresult table by iteratively asking the Join operation for a result tuple, one at a time. To obtain a resulttuple, the Join operation needs tuples from both scans. The Join operation requests a tuple from theleftmost Scan, which selects the �rst valid tuple from the Order table. Then, the Join operation requestsa tuple from the rightmost Scan which selects a tuple from the Customer table. Once the Join has a tuplefrom each scan, it checks the Join condition on the speci�ed attributes and passes the result tuple to thequery Executor. As new result tuples are requested by the query Executor, the Join operation will keeprequesting tuples to the rightmost scan until table Customer is �nished. For each tuple of table Order,a full traversal of table Customer will be done, tuple by tuple as requested by the Join operation.The Scan operations call the Access Methods to get new tuples from Database tables. The AccessMethods will access tuples in the pages provided by the Bu�er Manager, wich in time, requests the pagesfrom disk to the Storage Manager.2.3 DSS Workloads and TPC-DIn this paper we use the Transaction Processing Performance Council benchmark for Decision SupportSystems (TPC-D) [26] as a workload for our experiments. DSS workloads imply large data sets andcomplex, read-only queries that access a signi�cant portion of this data.TPC-D has been described in the recent literature on the topic [1, 25] and for this reason we do notgive a detailed account of it. At a glance, the TPC-D benchmark de�nes a database consisting of 8 tables,and a set of 17 read-only queries and 2 update queries. It is worth noting that the TPC-D benchmarkis just a data set and the queries on this data; it is not an executable. The tables in the database aregenerated randomly, and their size is determined by a Scale Factor. The benchmark speci�cation de�nesthe database scale factor of 1 corresponding to a 1GB database. There are no restrictions regarding theindices that can be used for the database.6

3 Experimental SetupWe set up the Alpha version of the Postgres 6.3.2 database on Digital Unix v4.0 compiled with the �O3optimization ags and the cc compiler. A TPC-D database is generated with a scale factor of 0.1 (100Mbof data). With the generated data we build two separate databases, one having Btree indices and the otherhaving Hash indices. Both databases have unique indices on all tables for the primary key attributes (thoseattributes that identify a tuple) and multiple entry indices for the foreign key attributes (those attributeswhich reference tuples on other tables). More indices could be created to accelerate the query execution,but we were more interested in the database behavior than in the performance of the benchmark.We run the TPC-D queries on the PostgreSQL database management system assuming the data isdisk resident, not memory resident. We do not perform any warm up to load the database in memory, soTPC-D data must be loaded for every new query. The set of queries used to obtain the pro�le informationand the ones used to evaluate the performance of our method are shown in Table 1. The Training setis executed on the Btree indexed database only, and the Test set is executed on both the Btree and theHash indexed databases.Query 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17Training Set - - X X X X - - X - - - - - X - -Test Set - X X X - X - - - - X X X X X - XTable 1: Queries used to obtain the pro�le information and to test our proposed algorithm.ATOM is a static binary translator that allows the instrumentation of an executable program. Weused ATOM to analyze the instruction reference stream and to simulate several cache organizations andcode layouts.4 Workload AnalysisTo gain insight on the internal workings of PostgreSQL we identi�ed the entry points to the di�erentmodules of the database. We instrument the database and insert code before and after each of thosefunctions to count the number of instructions executed and the i-cache misses of each database module.The numbers obtained are re ected in the bottom row of Table 2 and show results for a direct mapped32KB i-cache.For a sample execution of read-only queries on the Btree indexed database, a total of 169.5 billioninstructions were executed, and 4.8 million cache misses were registered (a 2.8% base miss rate). Ascould be expected, most of the instructions belong to the query execution kernel. Less than 1% of the7

instructions belong to the user interface and the query optimizer levels, while 63% of the instructionsbelong to the Executor module. Nevertheless, the Access Methods and Bu�er Manager account for 35%of the instructions, reaching 70% for some queries [19].Looking at the i-cache misses, we observe that while the Executor is responsible of 63% of the executedinstructions, only 53% of the misses correspond to that module. Meanwhile, the Access Methods gather26% of the misses, for only 15% of the executed instructions. That is due to the fact that the Executorconcentrates the few loops present in the database code, while the Access Methods are sequential functions,with few or no loops, and consist of many di�erent functions referenced from several places which tend tocon ict in the cache, replacing each other.We were interested in learning which Executor operations were responsible for these instructions in thelower levels of the database. By modifying our instrumentation, we also counted the number of instructionsand i-cache misses of each operation and the lower level functions called by those operations. We obtainedthe two dimensional matrix shown in Table 2. Dashes mean that no instructions were executed for thatsegment, while zeros represent an insigni�cant fraction of the total number of instructions.Other Parser Optimizer Executor Access Bu�er Storage TotalOther 0.2/0.0 0.0/0.0 0.7/0.0 0.1/0.1 0.1/0.1 0.0/0.0 0.0/0.0 1.2/0.3Hash Join { { { 0.0/0.0 { { { 0.0/0.0Hash { { { 0.0/0.0 { { { 0.0/0.0Aggregate { { { 0.2/0.2 0.0/0.0 0.0/0.0 0.0/0.0 0.2/0.2Group { { { 0.5/0.5 { { { 0.5/0.5Sort { { { 1.0/0.3 { { { 1.0/0.3Merge Join { { { 0.5/1.6 0.0/0.0 0.0/0.0 0.0/0.0 0.5/1.7Nest Loop { { { 1.4/1.4 0.2/0.4 0.0/0.0 0.0/0.0 1.6/1.8Index scan { { { 9.4/20.3 11.2/18.4 13.3/13.0 0.9/3.0 34.8/54.6Seq. scan { { { 2.8/3.1 3.7/7.3 6.2/3.8 0.0/0.0 12.7/14.3Result { { { 0.6/0.9 0.0/0.0 0.1/0.0 0.0/0.0 0.8/0.9Qualify { { { 46.8/25.3 0.0/0.0 0.0/0.0 0.0/0.0 46.8/25.3Total 0.2/0.0 0.0/0.0 0.7/0.0 63.4/53.8 15.2/26.2 19.7/16.9 0.9/3.0 100.0/100.0Table 2: Percentage of the total number of dynamic instructions/i-cache misses for each database leveland each executor operation for a sample run of read-only queries on the Btree indexed database. Missesare for a direct mapped 32KB cache.The most important operations are the Qualify operation and the Index and Sequential scan. TheQualify operation is responsible for checking whether a tuple satis�es a set of conditions or not, isolatingall other Executor operations from the complexity of data types and operators. The Scan operationsare responsible for most of the data access in the query execution. Indeed, almost all the references to8

the Access Methods and the Bu�er Manager belong to the Scan operations. The Sequential scan makesheavier use of the Qualify operation as it must check all tuples in the scanned table, while the Index scanneeds to check fewer tuples, because only those tuples that satisfy the index condition will be accessed.The Index scan is responsible of 54% of the misses for only 34% of the instruction references. Most ofthe misses due to the Index scan are found in the Executor and the Access Methods modules. The Indexscan operation accounts for so many misses due to its irregular behavior, using many di�erent AccessMethods routines to access both indices and database tables. Meanwhile, the Sequential scan has fewermisses, because it only accesses database tables, using less Access Methods routines.On the other hand, the Qualify operation is responsible for 25% of the misses, while it gathers as muchas 46% of the executed instructions. Its heavy use, and the repeated evaluation of the same conditionsacross a given query make it easy for the cache to hold its working set.We conclude that the Executor module, and the Qualify and Scan operations in particular, concentratemost of the executed instructions. Also, the Access Methods and Bu�er Manager modules must be takeninto account as they concentrate a large percentage of the total i-cache misses.5 Analysis of the Instruction Reference PatternsNext we examined the instruction reference patterns of the database focusing on locality issues by countingthe number of times each basic block is executed, and recording all basic block transitions.Based on the importance of the Qualify and Scan operations, and the large number of misses attributedto the Access Methods and Bu�er Manager modules (see Section 4) we select a subset of the TPC-Dqueries based on the scan operations used by the query. The selected queries are the Training set shownin Table 1. This set also includes queries with and without an extensive use of the Aggregate, Groupand Sort operations, because they need all their children's results to be executed. This stops the normalpipelined execution of queries in the PostgreSQL database, and implies the storage of large temporaryresults. Furthermore, these operations store and access the temporary data without going through theAccess Methods, which makes them somehow unique.Instrumenting the database and running the Training set, we obtained a directed graph with weightededges. There is one node for each basic block. An edge connects two basic blocks p and q if q is executedafter p. The weight of an edge pq is equal to the total number of times q has been executed after p. Thenumber of times a basic block has been executed can be obtained by adding the weight of all outgoingedges. All unexecuted basic blocks and transitions are pruned from the graph.9

5.1 Reference localityThe data in Table 3 illustrates an important characteristic of the database code. Only 12.7% of the staticinstructions were referenced for an execution of the Training set, which means that the database containslarge sections of code which are rarely accessed. This is because the database must be able to manipulatemany di�erent data types and perform many special case operations that happen very infrequently, orcan be found only in very special setups, but must be included for completeness.Static Dynamic PercentProcedures 6.813 1.340 19.7%Basic blocks 127.426 15.415 12.1%Instructions 593.884 75.183 12.7%Table 3: Static and dynamic footprint for the database. Numbers include the dynamically linked librariesThis characteristic can be graphically observed in Figure 3. The graph shows the accumulated numberof references to each basic block. In the Figure we can see the uneven distribution of references. It is clearthat only a small part of the code is accessed, and that there are parts more popular than others.mbEvalExpr BufferIsValid hash_search memcmpequalbt_compare

Basic Block

5E+07

1E+08

Num

ber

of R

efer

ence

s

Figure 3: Accumulated number of references to each basic block for an execution of the Training set.The peaks present in Figure 3 correspond to the data access routines, the condition evaluation and thecomparison operations. This large concentration of references was to be expected attending to the dataobtained from Tables 2 and 3.By ordering the basic blocks in descending order of references and accumulating them, we obtainFigure 4. This plots the percentage of the total number of basic block references as a function of thenumber of basic blocks. We observe that with only 1000 basic blocks (which represent 0.7% of the staticbasic blocks, 6.5% of the dynamic ones) we accumulate 90% of the references. With 2500 basic blocks(the maximum number present in the Figure) we obtain 99.26% of the references.10

0 500 1000 1500 2000 2500

Number of Basic Blocks

20

40

60

80

Acu

mul

ated

ref

eren

ces

(%)

Figure 4: Percentage of total basic block references for a number of basic blocks.This large concentration of the basic block references implies a large potential for exploiting locality.To further explore temporal locality, we counted the number of instructions that were executed betweentwo consecutive invocations of a basic block.Figure 5 shows the distribution for 495 basic blocks included in the most popular routines of the codewhich capture 73% of all the basic block references. The basic blocks in this set, have a probability of33% of being re-executed in less than 250 instructions, and as much as 19% of being referenced twice inless than 100 instructions. As we have shown, the database has relatively few loops, and the most popularroutines are called from many di�erent places, but out of loops. Between two consecutive invocations ofone of these popular routines, the database may execute some code that would end up displacing this codeform the cache before it is reused.This shows that there is substantial temporal locality to be exploited. As we will show, our methodkeeps the most frequently executed segments of code in a reserved area of the cache, so that they will notcon ict with other code, saving many con ict misses.0-25 26-100 101-250 251-500 501-1000 1001-2000 2001-5000 +5000

Number of instructions between two consecutive references

0

20

40

60

80

100

Per

cent

age

of in

voca

tion

s

Acumulated percent

Individual contribution

Figure 5: Temporal locality analysis for a selection of 495 basic blocks included in the most popularroutines of the code. These basic blocks represent 73% of the total basic block references.11

5.2 Execution determinismNext, we study how deterministic are the sequences of executed basic blocks, independently of how farapart in the code these basic blocks are, because we can always move them closer in memory and exposemore spatial locality in the form of long sequences of instructions that will be executed consecutively.We classify basic blocks in one of four kinds attending at how they a�ect the program ow. Fall-through basic blocks do not end with a branch instruction, so execution always continues on the nextbasic block. Branch basic blocks end with a conditional or unconditional branch, so after a Branch basicblock a maximum of two di�erent basic blocks may be found. Subroutine call basic blocks end with asubroutine invocation, an may have many successors. Return basic blocks end with a subroutine return,and may have many possible successors, as a subroutine may be referenced from several places.BB Type Static number Static % Dynamic number Dynamic %Branch 54.026 42.4% 4.0 billion 50.2%Fall-through 31.120 24.4% 1.8 billion 22.4%Subroutine call 10.228 8% 1.1 billion 13.7%Subroutine return 32.052 25.2% 1.1 billion 13.7%Table 4: Number of basic blocks of each type, both static and dynamicFigure 6 gives some evidence of the determinism with which database code is executed. The �gureshows the percentage of outgoing arcs that have a given probability of being followed, given the basicblock they leave is executed. Only the conditional branches are shown here, as fall-through basic blocks,subroutine calls and subroutine returns are predictable. The graph has been cut at 5%, but the rightmostpeak reaches 55.6%.

0 20 40 60 80 100

Probability of outgoing arc being taken

0

1

2

3

4

5

Num

ber

of a

rcs

(%)

<- 2.6%

Up to 55.6% ->

Figure 6: Percentage of outgoing arcs with a given probability of being followed.Most arcs have a very high, or very low probability of being followed after the basic block is executed.12

Over 59% of the arcs have probability larger or equal to 0.9 of being followed. Furthermore, looking atthe numbers in Table 4, fall-through basic blocks and subroutine calls and returns compromise 50% ofthe dynamic basic blocks. Also 59% of the branch basic blocks (30% of the total basic blocks) tend tobehave in a �xed way. Overall, 80% of the basic block transitions are predictable. Which means that theexecuted sequences of basic blocks are fairly deterministic.Our method intends to make those sequences explicit, and will map the basic blocks so that they are incontiguous memory locations, increasing the number of instructions executed between two taken branches.If the sequences are mapped this way, the instruction cache will work like a software trace cache, providingthe processor with more useful instructions for each cache access. There is enough determinism to do whata hardware trace cache does dynamically in a static way, with no additional hardware cost.Our conclusions for this section show that there is a large concentration of references in a small set ofvery popular functions. There is substantial temporal locality to be exploited, as the most popular routinesare referenced every few instructions. The execution paths are also quite deterministic, allowing us toexploit spatial locality by mapping basic blocks executed sequentially in consecutive memory locations.6 Method descriptionTo improve the fetch bandwidth, we start building sequences of basic blocks, placing in consecutive memorypositions those basic blocks executed sequentially, maximizing the number of instructions executed betweentaken branches.To reduce the miss rate, we will map the most frequently executed sequences of code in a reserved areaof the cache. This way, the most popular routines will always be present in the cache, as they will notbe displaced by any other code, saving many con ict misses. We will also map popular sequences of codeclose to other equally popular ones, reducing interference among them.6.1 Sequence buildingTo build our sequences of basic blocks, we begin by identifying seeds, or the starting basic blocks of thosesequences. The seed selection is described in section 6.2 below.Using the weighted graph we obtained for the locality study in Section 5, and starting from the selectedseeds, we implement a greedy algorithm to build our sequences of basic blocks. Given a basic block, thealgorithm follows the most frequently executed path out of it. This implies visiting a subroutine calledby the basic block, or following the control transfer out of it with the highest probability of being used.With this, we often end up including the basic blocks of a called function between the basic blocks of the13

caller. This sort of function inlining further increases the sequentiality of code by crossing the procedureboundary and mapping sequences of basic blocks spanning several functions.For this algorithm we use two parameters called the Exec Threshold (ExecThresh) and the BranchThreshold (BranchThresh). The weight of a node divided by the total weight of all nodes gives a ratiothat we compare to ExecThresh; the weight of an outgoing edge divided by the weight of the node it leavesgives a ratio we compare to BranchThresh. The sequence building algorithm stops when all the successorbasic blocks have been visited, or they have a weight lower than the ExecThresh, or all the outgoing arcshave a weight less than the BranchThresh. In that case, we start again from the next acceptable basicblock. Once all basic blocks reachable from the given seed have been included in the sequence, we proceedto the next seed.In our algorithm we have a loop that repeatedly selects a pair of user de�ned values for the ExecThreshand BranchThresh and generates the resulting sequences. The basic blocks included in previous passesare pruned from the newly formed sequences. By iteratively selecting less and less restrictive values forthe thresholds, we build our sequences in decreasing frequency of execution.A1

A2

A3

A4 A5

A6 A7

A8

B1

C1

C2

C3

C4 C5

10

10

10

6 4

7.62.4

10

30

20

20

15020

1 0.1

0.9

0.6

0.4

1 1

0.6

0.4

0.9 0.1

0.01

0.99

9

0.55

0.45

1

A1 10 Node weight

0.4 Branch probability0.6

Unconditional branch / Fall-through / Subroutine call

Conditional branch

Subroutine return

Figure 7: Example of a Basic Block weighted graph with the node weight and branch probability.Figure 7 shows a sinthetic example of a weighted graph including three routines. Using an ExecThreshof 3 and a BranchThresh of 0.4, we start from seed A1, going next to A2. Then we discard the call to B114

due to the BranchThresh, and we proceed to C1. C1 may have a larger weight than A2, as it may be calledfrom several other places. From C1, we take the most likely path to C2, but we note that the branch to C3passes the BranchThresh. From C2 we follow to C3, from where we discard C5 due to the BranchThresh,even as C5 has a high weight, due to its inclusion in a tight loop. At C4 we �nd a subroutine return,and the sequence terminates. Then we take the second option from C1 to C3, but we observe that ithas already been included in the sequence. Execution returns to A3, from where we take the way to A4,noting A5 is a valid target. From A4 we go to A7, and then to A8 where we �nd a subroutine return.Now we go back to A5, and we stop again, as A7 is already included in the sequence. Now we take A6,but we discard it as it does not pass the ExecThresh. The resulting sequence of basic blocks would be A1,A2, C1, C2, C3, C4, A3, A4, A7, A8, A5. C5 could not be reached due to the BranchThresh, and A6 wasdiscarded due to the ExecThresh.Note that we often end up including the basic blocks of a called function between the basic blocks of thecaller. This sort of function inlining further increases the sequentiality of code by crossing the procedureboundary and mapping sequences of basic blocks spanning several functions. This sort of function inliningfurther increases the sequentiality of code by crossing the procedure boundary and mapping sequences ofbasic blocks spanning several functions.6.2 Seed selectionHow we select the seeds, or starting basic blocks of our sequences, will determine the amount of spatiallocality exposed by our method, and the temporal locality of the sequences built from them.We �rst tried an automatic seed selection (auto selection). The list of seeds was obtained orderingthe entry points of all functions in decreasing frequency of execution. This tries to expose the maximumtemporal locality, as the sequences built will start on the most frequently referenced functions. On theother hand, the most referenced functions will be included �rst. This way, when a less popular functionreferences one of them, the sequence will not include those popular basic blocks, as they have been alreadyvisited. Including few functions among the basic blocks of the caller, we are losing a lot of potential forspatial locality.The second seed selection (ops selection) was based on our knowledge of the database structure andthe analysis in Section 4. We selected as seeds the entry points of all the Executor operations, speciallythe Qualify and Scan operations. This selection should expose all the spatial locality that the automaticselection failed to expose as the most frequently referenced functions will be inlined in a longer sequenceof basic blocks. However, some important basic blocks may be left out, as they will be unreachable fromthe selected seeds, or some intermediate basic block will not pass the Exec or Branch thresholds. Also,15

the sequences built this way will have lower temporal locality as they include less frequently referencedbasic blocks surrounding the most popular ones.6.3 Sequence MappingFigure 8 shows the mapping scheme we developed. We de�ne a logical array of caches, equal in size andaddress alignment to the physical cache. The sequences found in the �rst pass of the algorithm describedin Section 6.1 are mapped from the start of the logical cache array. Then, we place the rest of the sequencessequentially, one pass at a time, keeping the area used by the sequences in the �rst pass free of code in alllogical caches. This way, the �rst sequences will not be replaced from the cache by any other code, andso will be free of self-interference. We call this area the Con ict Free Area (CFA).The size of this CFA is determined by the Exec and Branch Thresholds used for the �rst pass of oursequence building algorithm. As we will show, the size of the CFA is the most important factor for theperformance of our method.Sequences found in the first pass

Second sequence (first pass)

First sequence (first pass)

Second pass

Third pass Second sequence (6th pass)Seldom executed code

ConflictFreeArea

CacheSize

First sequence (6th pass)Figure 8: Sequence mapping into a direct mapped cacheWhen all the sequences have been mapped in the cache, we map all the remaining basic blocks inorder, this time �lling the entire address space. This rarely executed code is expected not to producemany con icts with the sequences placed in the CFA.Torrellas et al. built their CFA pulling out the most referenced basic blocks out of their sequences andmapping them in the reserved area. This e�ectively breaks the code sequence, as there will be many jumpsin and out of the CFA. To keep the program execution as linear as possible we chose to map the wholeset of sequences found in the �rst pass of the algorithm described in Section 6.1. This will reduce thepercentage of references captured by the CFA, as some basic blocks mapped there will not be as popularas some others left out, but we expect the increased sequentiality of the code to o�er additional bene�ts16

to future aggressive processor that will increase performance.Also, the dynamic footprint of the database is not as �xed as that of the Operating System. Smallchanges in the database setup might result in heavy use of that supposedly unreferenced code we mappedin con icting addresses with the CFA, and that may result in a lot of con ict misses. We examined thepossibility of keeping the CFA truly free of interference and found that option to improve the results forsetups other than the one used to obtain the pro�le information [19].Sequence mapping for set associative caches works the same way, but the caches in the array should beconsidered of size CacheSize=SetAssoc. Also, the �rst sequences, the ones mapped in the CFA, shouldbe distributed evenly among the �rst SetAssoc passes. The rest of the sequences, and the basic blocksnot placed in sequences are mapped as in the direct mapped case.7 Related WorkThe software approach to the i-cache miss rate problem has been the use of code reordering algorithms.Hwu and Chang [9] use function inline expansion, and group into traces those basic blocks which tend toexecute in sequence as observed on a pro�le of the code. Then, they map these traces in the cache so thatthe functions which are executed close to each other are placed in the same page. Our approach protectsthe most popular sequences in a reserved area of the cache. We evaluate the impact of our reordering inthe fetch bandwidth obtained when applied to a database kernel.Pettis & Hansen [18] propose a pro�le based technique to reorder the procedures in a program, and thebasic blocks within each procedure. Their aim is to minimize the con icts between the most frequentlyused functions, placing functions which reference each other close in memory. They also reorder the basicblocks in a procedure, moving unused basic blocks to the bottom of the function code, even splitting theprocedures in two, and moving away the unused basic blocks. Their algorithm did not consider the targetcache information, and their basic block reordering was limited to the basic blocks within a function body.Torrellas et al [24] designed a basic block reordering algorithm for Operating System code, running ona very conservative vector processor. They map the code in the form of sequences of basic blocks spanningseveral functions, and keep a section of the cache address space reserved for the most frequently referencedbasic blocks.Gloy et al. [7] extend the Pettis & Hansen placement algorithm at the procedure level to considerthe temporal relationship between procedures in addition to the target cache information and the sizeof each procedure. Hashemi et al [8] and Kalamaitianos et al [13] use a cache line coloring algorithminspired in the register coloring technique to map procedures so that the resulting number of con icts is17

minimized. Their algorithm is based on either a dynamic pro�le of the code or on static estimations basedon heuristics.For more aggressive processors, the number of instructions provided to the processor becomes an impor-tant issue. On the software side, we have techniques like trace scheduling [5] and superblock scheduling [10],which map groups of basic blocks next to each other in the cache, avoiding the need for extra hardwareto fetch them and enlarging the scope of the compiler to schedule instructions. Both use static branchprediction to identify sequences of basic blocks.On the hardware side, techniques like the Branch Address Cache [28], the Collapsing Bu�er [3] andthe Trace Cache [21, 6] approach the problem of fetching multiple, non-contiguous basic blocks each cycle.Both the Branch Address Cache and the Collapsing Bu�er access non-consecutive cache lines from aninterleaved i-cache each cycle and then merge the required instructions from each accessed line.The Trace Cache does not require fetching of non-consecutive basic blocks from the i-cache. This isdone by storing the dynamically constructed sequences of basic blocks in a special purpose cache. Itdoes not replace the conventional instruction cache, or the fetch hardware around it. If a fetch requestcorresponds to one of the sequences stored in the Trace Cache, this sequence is passed to the decodeunit, e�ectively fetching multiple basic blocks in a single cycle. On a Trace Cache miss, fetching proceedsfrom the conventional i-cache and a new sequence is created and stored in the Trace Cache. Some workscombine the software and the hardware approach. Patel et al [17] identify branches with a �xed behaviorand avoid making prediction on them, increasing the potential of the Trace Cache.8 Method EvaluationAs we said before, our method is based on pro�le information. The selected queries for the Training setare shown in Table 1. In order to evaluate our technique we used a di�erent set of queries for the Test set,also shown in Table 1. The Test set consists of 10 di�erent queries which are executed on two di�erentdatabase setups. Due to the large amount of simulations we needed to fully evaluate our method, werestricted the Test set to queries which could be simulated in a reasonable time. For both the Trainingand the Test sets, all queries were run to completion.Small changes in the code layout can lead to dramatic changes in the performance of the resultingexecutable. We totalized the results for all the queries in the Test set, and then calculated the result foreach database (Btree and Hash indexed). This way we obtain a weighted average for all the simulatedqueries and each separate database. The results in this section correspond to the average result for bothdatabases. 18

We compared the results of both our automatic code layout (auto layout) and the experience basedlayout (ops layout) with those obtained by the method proposed by Pettis & Hansen (P&H layout) andthe work of Torrellas et al (Torr layout). For all the proposed layouts, as well as for the original compiledcode, we evaluated both the i-cache miss rate and the instruction fetch bandwidth that could be provided.We also evaluated the i-cache miss rate and the fetch bandwidth provided by a 2-way set associativecache and the addition of a 16-line fully-associative victim cache.8.1 Simulation setupWe did not generate a new executable with the proposed code layouts. Instead, we generated a newaddress for each basic block, feeding the simulators with this new address instead of the original PC. Thecode was not modi�ed in any way, so all basic blocks have the same size for all the examined code layouts.The fetch unit used in our simulations corresponds to the SEQ.3 fetch unit described in [21]. This fetchunit is able to fetch up to 16 instructions from 2 consecutive cache lines. This 16 instructions must belongto a maximum of 3 di�erent basic blocks, that is, these 16 instructions contain a maximum of 3 branches.Summarizing, the fetch unit accesses two consecutive cache lines, and provides the instructions from thefetch address up to the �rst taken branch, or up to the maximum of three branches, or 16 instructions,whichever comes �rst.We were interested in the performance limit of the examined code layouts, not the actual performanceusing a given branch predictor. For this reason, we assumed a perfect branch predictor, as well as aperfect branch target bu�er, able to predict even indirect branches, and an in�nite return stack. That is,all branches and jumps were correctly predicted and their destination was always known. No wrong pathinstructions were fetched.We also compared our results with those of the basic Trace Cache described in [21]. We simulated twodirect mapped Trace Caches of 64 and 256 entries of 16 instructions (4KB and 16KB). The Trace Cachewas simulated with the same perfect branch prediction scheme as the sequential fetch unit.All kinds of branch instructions were counted against the 3 branch limit, including unconditionalbranches and subroutine calls and returns. But, as an advantage to the Trace Cache, we did not breakthe sequence of instructions when we found an indirect branch as was done in [6, 21]. Also, in case of aTrace Cache miss, the new instruction traces were built in a single cycle using future knowledge insteadof waiting for the instructions to be fetched and executed. This way, in case of a Trace Cache miss, themissed trace was available the next fetch cycle.19

8.2 Instruction cache miss rateFollowing we study the e�ect on the i-cache miss rate of the di�erent parameters of the described method.We analyze the impact of keeping the CFA truly free of interference, the di�erent seed selections andstudy the e�ect of the cache associativity and line size.8.2.1 Free area vs. Shared areaWe �rst examined the e�ect of keeping the CFA free of interference or allowing seldom executed code tobe mapped in con icting addresses.8/2 8/4 8/6 16/4 16/8 16/12 32/4 32/8 32/16 32/24 64/8 64/16 64/24

Cache size/CFA size

0

20

40

60

Rel

ativ

e nu

mbe

r of

mis

ses

(%)

Free CFA, Hash indexed DBFree CFA, Btree indexed DBShared CFA, Hash indexed DBShared CFA, Btree indexed CFA

Figure 9: Comparison of a truly Con ict Free Area and a busy or shared CFAFigure 9 shows the relative number of misses for each cache size and CFA size. That is, the percentageof misses left after reordering the code. Dashed lines plot the results for the free CFA, solid lines correspondto the shared CFA. The miss rate reduction is 100% minus the relative miss rate, lower lines representbetter reductions.We consider separately the results for the Btree and the Hash indexed databases. In most cases,the free CFA obtains better results, specially for the Hash indexed database as more code mapped oncon icting positions with the CFA is referenced.8.2.2 Seed selectionNext we analyzed the performance of each of the seed selections proposed. In addition to the miss rateimprovement, we measured the percentage of the basic block references that were satis�ed by the CFA,and compared the number to the ideal curve obtained in Figure 4.Figure 10 shows the percentage of basic blocks referenced that were satis�ed by each CFA setup asa function of the number of basic blocks included in the CFA. As expected, the auto selection does20

0 500 1000 1500

Number of Basic Blocks (CFA size)

0

20

40

60

80

100

% o

f tot

al B

B r

efre

nces

Profiled optimumTorr layout, Btree indexed DBTorr layout, Hash indexed DBauto layout, Btree indexed DBauto layout, Hash indexed DBops layout, Btree indexed DBops layout, Hash indexed DB

Figure 10: Percentage of the total number of basic block references for a given number of basic blocks anddi�erent seed selections. The points in the graph correspond to the di�erent CFA sizes in basic blocks.better than the ops selection, but the di�erence becomes negligible as the CFA grows above 16KB. Notsurprisingly, the percentage of references satis�ed by the CFA decreases for setups other than the one usedfor the pro�le (as is the Hash database case), specially for the ops selection.8/2 8/4 8/6 16/4 16/8 16/12 32/4 32/8 32/16 32/24 64/8 64/16 64/24

Cache size/CFA size

0

20

40

60

Rel

ativ

e nu

mbe

r of

mis

ses

(%)

auto-Hash indexed DBauto-Btree indexed DBops-Hash indexed DBops-Btree indexed DB

Figure 11: Relative number of misses for the di�erent seed selections and each Cache size/CFA sizeIn Figure 11 we compare the relative miss rates for all the seed selections we tried. The auto selection,does better for the smaller CFA (under 8KB) as the ops selection fails to include some important basicblocks, specially in the Hash case, where some of the basic blocks included in the CFA were not used infavor of some other code. For a larger CFA, the ops selection includes all the important basic blocks, andthe added spatial locality starts to make a di�erence.8.2.3 CFA sizeIntuitively, an increase in the CFA size causes positive and negative e�ects. On the one hand, a largerCFA shields more routines from self-interference, and, as a result, eliminates misses in those routines. On21

the other hand, however, less area is left for the rest of the routines, which will su�er more con ict misses.Once the CFA size reaches certain value, the second e�ect will dominate. Also, once the CFA is able tosatisfy most of the references, there is little point in further increasing it, as we will be taking out cachearea that may be better used to avoid con ict misses in other code. Unfortunately, di�erent queries anddatabase setups prefer a di�erent CFA. Obviously, the CFA will be harmful to a setup if it contains codethat is not used by that setup.In Figure 11 we can observe that the larger CFA setups do not o�er the best results, even as morereferences are satis�ed by the CFA. For example, in Figure 10 we can see that a 6KB CFA captures over60% of the references, but a 2KB CFA o�ers better results for an 8KB cache, as the space left is neededto avoid con icts in the rest of the code. As an example of the second limit to the CFA size, we observethat even in the 64KB cache, the best result is obtained with a 16KB CFA.8.2.4 Cache geometryTo this point we have examined the e�ect of the CFA setup for direct mapped caches with 32 byte lines.Next we examine the e�ect of the line size and the cache associativity.

8/2 8/4 8/6 16/4 16/8 16/12 32/4 32/8 32/16 32/24 64/8 64/16 64/24

Cache size/CFA size

0

10

20

30

40

50

Rel

ativ

e nu

mbe

r of

mis

ses

(%)

32 bytes64 bytes128 bytes256 bytes

Figure 12: Relative number of misses for di�erent cache line sizes. Results are for the ops seed selectionand the Btree indexed database.In Figure 12 we can see that longer lines allow us to better exploit the added spatial locality of theops seed selection, as we get better results with longer cache lines. This added bene�t gets lost as cachesget bigger, and the e�ect of the cache line size is almost null for a 64Kb cache.Figure 13 shows that even for associative caches, we obtain important reductions in the instruction missrate. When using an associative cache, there are less con ict misses to eliminate, as the cache associativityavoids many of them, but still the CFA proves useful, obtaining reductions over 95% for 64Kb caches.22

8/2 8/4 8/6 16/4 16/8 16/12 32/4 32/8 32/16 32/24 64/8 64/16 64/24

Cache size/CFA size

0

20

40

60

Rel

ativ

e nu

mbe

r of

mis

ses

(%)

Direct mapped-Hash indexed DBDirect mapped-Btree indexed DB2 way-Hash indexed DB2 way-Btree indexed DB

Figure 13: Relative number of misses obtained for direct mapped and 2-way associative caches.8.2.5 Comparison with other techniquesFinally, we compared the miss ratio of our proposed code layout with some hardware and software solutionsproposed in the literature.Table 5 shows the miss rate for the sequential fetch unit and the di�erent examined setups. Thenumbers given are in terms of i-cache misses per instruction executed.Code layout Cache Trace cachei-cache/CFA orig P&H Torr auto ops 2-way victim 4KB 16KB 16KB+ops8/2 6.5 3.0 2.3 2.2 2.1 6.1 5.6 15.5 7.6 1.88/4 { { 2.9 4.2 2.9 { { { { 1.68/6 { { 3.1 2.3 5.2 { { { { 2.116/4 4.0 1.1 0.9 0.8 0.7 2.6 3.4 7.4 3.8 0.516/8 { { 0.7 0.8 0.6 { { { { 0.516/12 { { 0.8 0.8 1.0 { { { { 0.732/4 2.7 0.3 0.2 0.3 0.2 1.2 1.6 4.2 2.0 0.132/8 { { 0.2 0.4 0.2 { { { { 0.232/16 { { 0.3 0.2 0.1 { { { { 0.132/24 { { 0.2 0.3 0.2 { { { { 0.164/8 1.4 0.09 0.05 0.07 0.04 0.3 0.4 1.6 0.8 0.0364/16 { { 0.14 0.08 0.05 { { { { 0.0364/24 { { 0.02 0.03 0.03 { { { { 0.03Table 5: Instruction cache miss rate for the di�erent i-cache and CFA sizes examined.Our proposed layouts obtain similar results to the Torr layout, and always outperform the P&H layout.We must note that the best miss rate for each cache size is always obtained using the ops layout, exceptfor the 64KB cache, where the di�erence with the Torr layout is minimal. All the code layouts examinedobtained better results than both the 2-way associative cache and the victim cache.23

The 4KB Trace cache did not improve the miss rate of the original code, but it is worth noting thatit could only satisfy 18.6% of the fetch requests. Meanwhile, the 16KB Trace Cache could satisfy 47.9%,e�ectively reducing the i-cache miss rate. The combination of the ops layout and the 16KB Trace Cacheobtained the best results for all cases.8.3 Fetch bandwidthTable 6 shows the percentage of taken branches, the average number of instructions between branches,and the average number of instructions between taken branches for all the examined code layouts. Itshows that for the original code layout, the fetch bandwidth of the sequential unit is limited to under 9instructions per cycle, while our ops layout increases this limit to 22 instructions.Code Layout %taken Avg BB length Avg Seq lengthorig 66% 5.85 8.9P&H 42% { 14.1Torr 41{61% { 9.6{14.4auto 36% { 16.2ops 26% { 22.4Table 6: Branch and Basic Block statistics for all code layouts considered.The method proposed by Torrellas et al. shows a variable behavior. The percentage of taken branches(and thus the number of instructions between taken branches) varies as a function of the CFA size. Thelarger the CFA, the more basic blocks included in it. These basic blocks have been pulled out of theirsequences to be included in the CFA, causing the sequential execution to break, because we have to jumpin and out of the CFA to reach them.The increased number of consecutive instructions obtained by all the examined layouts comes from thereordering of basic blocks to map in consecutive memory positions those basic blocks executed sequentially.The auto and ops layout increase this e�ect by crossing the procedure boundaries, and mapping the basicblocks of a called function between the basic blocks of the caller.Table 7 shows the number of instructions per cycle provided by each setup. The Ideal line shows theaverage number of instructions provided for each access to the fetch unit, while the rest of the table takesinto account the number of cycles taken by each fetch access.The average access time to the fetch unit considers that hits on the i-cache or the Trace Cache take asingle cycle, and misses on the i-cache take an average of 5 cycles. The i-cache is assumed to be able toserve multiple overlapped misses, so if both line accesses from the fetch unit miss, the miss penalty is still5 cycles. 24

Code layout Cache Trace cachei-cache/CFA orig P&H Torr auto ops 2-way victim 4KB 16KB 16KB+opsIdeal 7.6 9.6 8.5{9.9 9.9 10.7 7.6 7.6 8.7 10.3 12.28/2 3.1 5.2 5.6 6.0 6.2 3.2 3.4 3.5 5.1 8.48/4 { { 5.0 5.3 6.6 { { { { 8.78/6 { { 4.9 5.8 5.6 { { { { 8.116/4 4.0 7.3 7.4 8.1 8.8 4.8 4.3 4.5 6.2 10.316/8 { { 7.4 8.1 9.0 { { { { 10.416/12 { { 7.3 7.9 8.1 { { { { 10.232/4 4.7 8.8 8.9 9.2 10.0 5.9 5.5 5.2 7.2 11.532/8 { { 8.4 8.8 10.1 { { { { 11.532/16 { { 8.0 9.3 10.3 { { { { 11.832/24 { { 8.2 9.2 10.1 { { { { 11.664/8 5.8 9.3 8.8 9.8 10.6 7.1 7.0 6.6 8.6 12.064/16 { { 8.4 9.7 10.5 { { { { 12.164/24 { { 8.5 9.8 10.6 { { { { 12.1Table 7: Fetch bandwidth in instructions per cycle. The i-cache miss penalty is 5 cycles.Looking at the Ideal fetch bandwidth provided by each technique, we observe that the P&H layoutis very close to the auto reordering and the 16KB Trace Cache. The ops reordering has more potentialbandwidth than any other technique except the combination with the 16KB Trace Cache. Once again, theTorr layout exhibits a variable behavior which varies with the CFA size, obtaining similar results to theother basic block reordering layouts only for the smaller CFA sizes, but still doing better than the originalcode for the larger ones. The fetch bandwidth improvement presented by the 2-way and associative cachescomes exclusively from the reduction in the average access time, which is not enough to compensate forthe lack of sequentiality in the original code.Once the average number of cycles taken by each fetch requests is considered, both the auto and theops layouts become clearly better than any other technique. The Trace Cache could not remember allthe executed sequences, and had to resort to sequential fetching 52.1% of the times (81.4% for the 4KBTrace Cache), and the fetch unit could provide few instructions due to the lack of sequentiality in theoriginal code. Our proposed layout could use the whole memory space as a software trace cache to capturethe most frequently executed sequences of code, and could provide more instructions per cycle using thestatically stored traces. When used in conjuntion with the ops layout, the fetch engine could provide moreinstructions per cycle, even on a Trace Cache miss, both due to the increased sequentiality of the codeand to the reduced i-cache miss rate.It is worth noting that a lower miss rate alone does not mean a higher fetch bandwidth, as can beobserved comparing the results of the P&H and the Torr layouts for the larger caches, or the results for25

the 2-way associative cache and the 16KB Trace Cache. In order to improve the fetch bandwidth, boththe i-cache miss rate and the sequentiality of code must be taken into account.9 Conclusions and Future WorkInstruction fetch bandwidth is feared to be a major limiting factor to the performance of future wide-issueaggressive superscalars.In this paper, we focus on Database applications running Decision Support workloads. We characterizethe locality patterns of database kernel code and �nd frequently executed paths. Using this information,we propose an algorithm to lay out the basic blocks of the database kernel for improved fetch bandwith.This is achieved by both a reduction of the i-cache miss rate and an increase in the number of instructionsexecuted between taken branches.Our results show a miss reduction of 60-98% for realistic i-cache sizes, obtaining miss rates under 0.05%with a 64KB direct mapped cache. The proposed code layout also increases the number of instructionsexecuted between taken branches from the 8.9 of the original code to 22.4. With this, a 16 instructionwide fetch unit could provide 10.6 instructions per cycle using our proposed code layout.Improving only the i-cache miss rate or the potential fetch bandwidth provided is not enough. Bothfactors must be taken into account to provide optimal results.A Trace Cache alone could not hold all the executed sequences, while our technique used all thememory space as a Software Trace Cache to statically store the most frequently executed traces. As aconsequence, while the Trace Cache alone could only provide 8.6 inctructions per cycle, a combination ofthe Software-Hardware Trace Caches increased the result to 12.1.We have shown that large �rst level i-caches can capture the working set of large applications likea DBMS. It is worth studying if the controlled use of code expanding techniques like function inliningand code replication can increase the potential fetch bandwidth provided by a sequential fetch unit whilekeeping the miss rate under control.In the near future we plan to extend the proposed algorithm to automatize the process of selectingthe thresholds and the seeds while obtaining results closer to the knowledge-based selection. Also, we willexamine the e�ect of our technique on the IPC for a wider range of applications like OLTP workloads andthe SPEC benckmark.26

10 AcknowledgmentsAuthors from Universitat Politecnica de Catalunya are supportted by CICYT grant TIC-0511-98. JosepLluis Larriba-Pey and Josep Torrellas are supportted by Generalitat de Catalunya Grant ACI-002. JosepLluis Larriba-Pey, Josep Torrellas and Mateo Valero are supported by the Comision for Cultural, Educa-tional and Scienti�c Excange between the United States of America and Spain.References[1] Luiz Andr�e Barroso, Kourosh Gharachorloo and Edouard Bugnion, Memory System Characteri-zation of Commercial Workloads, Proccedings of the 25th Annual Intl. Symposium on ComputerArchitecture, pp 3-14, June 1998.[2] William Y. Chen, Pohua P. Chung, Thomas M. Conte and Wen-mei Hwu, The E�ect of CodeExpanding Optimizations on Instruction Cache Design, IEEE Transactions on Computers, Vol.42(9):1045-1057, September 1993.[3] T.Conte, K.Menezes, P.Mills and B.Patell, Optimization of Instruction Fetch Mechanism for HighIssue Rates., Proccedings of the 22nd Annual Intl. Symposium on Computer Architecture, pp333-344, June 1995.[4] R. Elmasri and S. Navathe, Fundamentals of Database Systems (2nd edition)., Benjamin Cum-mings, 1994.[5] J.A.Fisher, Trace Scheduling: A Technique for Global Microcode Compaction., IEEE Transac-tions on Computers, C-30(7):478-490, July 1981.[6] D. H. Friendly, S.J. Patel and Y. N. Patt, Alternative Fetch and Issue Techniques from theTrace Cache Mechanism., Proceedings of the 30th Anual ACM/IEEE International Symposiumon Microarchitecture, December 1997.[7] Nikolas Gloy, Trevor Blackwell, Michael D. Smith and Brad Calder, Procedure Placement Us-ing Temporal Ordering Information, Proceedings of the 30th Anual ACM/IEEE InternationalSymposium on Microarchitecture, pp 303-313, December 1997.[8] Amir H. Hashemi, David R. Kaeli and Brad Calder, E�cient Procedure Mapping Using CacheLine Coloring, Proc. ACM SIGPLAN'97 Conf. on Programming Languaje Design and Implemen-tation, pp 171-182, June 1997. 27

[9] Wen-mei Hwu and Pohua P. Chang, Achieving High Instruction Cache Performance with an Op-timizing Compiler, Proceedings of the 16th Annual Intl. Symposium on Computer Architecture,pp 242-251, June 1989.[10] W.W.Hwu, S.A.Mahlke, W.Y.Chen, P.P.Chang, N.J.Water, R.A.Bringmann, R.G.Ouellette,R.E.Hank,, T.Kiyohara, G.E.Haab, J.G.Hold and D.M.Lavery,, The Superblock: An E�ectiveTechnique for VLIW and Superscalar Compilation., Journal on Supercomputing, 7(9-50), 1993.[11] N.J.Jouppi, Improving Direct-mapped Cache Performance by the Addition of a Small Fully-associative Cache and Prefetch Bu�ers., Proceedings of the 17th Annual Intl. Symposium onComputer Architecture, pp 364-373, June 1990.[12] T.Juan, S.Sanjeevan and J.J.Navarro, Dynamic History-Length Fitting: A thrid level of adap-tivity for branch prediction., Proccedings of the 25th Annual Intl. Symposium on ComputerArchitecture, pp 155-166, June 1998.[13] John Kalamaitianos and David R. Kaeli, Temporal-based Procedure Reordering for ImprovedInstruction Cache Performance, Proceedings of the 4th Intl. Conference on High PerformanceComputer Architecture, Februray 1998.[14] Kimberly Keeton, David A. Patterson, Yong Quiang He, Roger C. Raphael and Walter E. Baker,Performance Characterization of a Quad Pentium Pro SMP Using OLTPWorkloads, Proccedingsof the 25th Annual Intl. Symposium on Computer Architecture, pp 15-26, June 1998.[15] Jack L. Lo, Luiz Andr�e Barroso, Susan J. Eggers, Kourosh Gharachorloo, Henry M. Levy andSujay S. Parekh, An Analysis of Database Workload Performance on Simultaneous MultithreadedProcessors, Proccedings of the 25th Annual Intl. Symposium on Computer Architecture, pp 39-50,June 1998.[16] A.M.Maynard, C.M.Donnelly and B.R.Olszewski, Contrasting Characteristics and Cache Per-formance of Technical and Multi-User Commercial Workloads., Proceedings of the Sixth Intl.Conference on Architectural Support for Programming Languajes and Operating Systems, pp145-156, October 1994.[17] Sanjay Jeram Patel, Marius Evers and Yale N. Patt, Improving Trace Cache e�ectiveness withBranch Promotion and Trace Packing, Proccedings of the 25th Annual Intl. Symposium onComputer Architecture, pp 262-271, June 1998.[18] Karl Pettis and Robert C. Hansen, Pro�le Guided Code Positioning, Proc. ACM SIGPLAN'99Conf. on Programming Languaje Design and Implementation, pp 16-27, June 1990.28

[19] Alex Ramirez, et al., Characterization of Instruction Cache behavior of PostgreSQL running theTPC-D workload., Research Report DAC-98-51.[20] Parthasarathy Rangananthan, Kourosh Gharachorloo, Sarita V. Adve and Luiz Andr�e Barroso,Performance of Database Workloads on Shared-Memory Systems with Out of Order Processors,Proceedings of the Tenth Intl. Conference on Architectural Support for Programming Languajesand Operating Systems, October 1998.[21] E. Rottenberg, S. Benett and J. E. Smith, Trace Cache: a Low Latency Aprroach to High Band-with Instruction Fetching., Proceedings of the 29th Anual ACM/IEEE International Symposiumon Microarchitecture, 00 24-34, December 1996.[22] A.Seznec, S.Jourdan, P.Sainrat and P.Michaud, Multiple-block Ahead Branch Predictors., Pro-ceedings of the 7th Intl. Conference on Architectural Support for Programming Languajes andOperating Systems, October 1996.[23] M. Stonebreakerand G. Kemnitz, The POSTGRES Next Generation Database Management Sys-tem., Communications of the ACM, October 1991.[24] Josep Torrellas, Chun Xia and Russell Daigle, Optimizing Instruction Cache Performance for Op-erating System Intensive Workloads, Proceedings of the 1st Intl. Conference on High PerformanceComputer Architecture, pp 360-369, January 1995.[25] Pedro Trancoso, Josep Ll. Larriba-Pey, Zheng Zhang and Josep Torrellas, The Memory Perfor-mance of DSS Commercial Workloads in Shared-Memory Multiprocessors, Proceedings of the 3rdIntl. Conference on High Performance Computer Architecture, February 1997.[26] Transaction Processing Performance Council (TPC), TPC Benchmark D (Decision Support),Standard Speci�cation, Revision 1.2.3, TPC 1993-1997.[27] T.-Y. Yeh and Y.N. Patt, Two-level adaptive branch prediction., Proceedings of the 24th AnnualACM/IEEE Intl. Symposium on Microarchitecture, pp 51-61, 1991.[28] T.-Y. Yeh, D.T. Marr and Y.N. Patt, Increasing the Instruction Fetch Rate via Multiple BranchPrediction and a Branch Address Cache., 7th Intl. Conference on Supercomputing, pp 67-76,July 1993.29

Code Reordering of Decision Support Systems for optimized Instruction Fetch

Documents