Kilo-instruction Processors

Kilo-instruction Processors

Adrian Cristal, Daniel Ortega, Josep Llosa, and Mateo Valero

Departamento de Arquitectura de Computadores,Universidad Politecnica de Cataluna,

Barcelona, Spain,{adrian,dortega,josepll,mateo}@ac.upc.es

Abstract. Due to the difference between processor speed and mem-ory speed, the latter has steadily appeared further away in cycles to theprocessor. Superscalar out-of-order processors cope with these increasinglatencies by having more in-flight instructions from where to extract ILP.With coming latencies of 500 cycles and more, this will eventually derivein what we have called Kilo-Instruction Processors, which will have tohandle thousands of in-flight instructions. Managing such a big numberof in-flight instructions must imply a microarchitectural change in theway the re-order buffer, the instructions queues and the physical regis-ters are handled, since simply up-sizing these resources is technologicallyunfeasible. In this paper we present a survey of several techniques whichtry to solve these problems caused by thousands of in-flight instructions.

1 Introduction and Motivation

The ever increasing gap between processor and memory speed is steadily in-creasing memory latencies (in cycles) with each new processor generation. Thisincreasing difference, sometimes referred to as the memory wall effect [33] af-fects performance by stalling the processor pipeline while waiting for memoryaccesses. These scenarios are very common in superscalar out-of-order micropro-cessors with current cache hierarchies.

In these processors, when a load instruction misses in L2 cache, the latencyof the overall operation will eventually make this instruction the oldest in theprocessor. Since instructions commit in order, this instruction will disallow theretirement of newer instructions, which will fill entirely the re-order buffer (ROB)of the processor and halt the fetching of new instructions.

When this happens, performance suffers a lot. The amount of independentinstructions to issue is limited due to the lack of new instructions. Eventually, allinstructions independent on the missing load will get executed and the issuingof instructions will stop. In the left side of Fig. 1 we can see the different IPCfor a 4 way processor different latencies to main memory (100, 500, and 1000cycles) assuming a 128 entry ROB (architectural parameters will be describedin section 5). For a current 100 cycle latency, the loss due to the lack of in-flightinstructions is not enormous, but the loss for a 1000 cycle latency indeed it is.Notice that the slowdown observed in this figure from 100 cycle to 1000 cycle isalmost 3 times.

A. Veidenbaum et al. (Eds.): ISHPC 2003, LNCS 2858, pp. 10–25, 2003.c© Springer-Verlag Berlin Heidelberg 2003

https://www.researchgate.net/publication/2429866_Hitting_the_Memory_Wall_Implications_of_the_Obvious?el=1_x_8&enrichId=rgreq-e92d0fa09e8ce8c14df9013ee0122417-XXX&enrichSource=Y292ZXJQYWdlOzIyMTUxMTIwMztBUzoxMDM1NTE1OTg1OTYxMDZAMTQwMTcwMDAzNDg0Ng==

Kilo-instruction Processors 11

128 64 128 256 512 1024 2048 40960

0.5

1

1.5

2

2.5

3

3.5

4

Mem

Lat =

100

500

100

0

IPC

Resource size

Limited ROBLimited RegsLimited QueuePerfect L2

Fig. 1. Effect on IPC of unbounded resources for SPEC2000 fp applications

A well known microarchitectural solution to cope with this problem is toincrease the amount of maximum number of in-flight instructions that the pro-cessor can handle. More instructions means that the halting will start later,thus affecting less overall performance, or not happen at all. Besides, increas-ing the availability of instructions to issue, increases the probability of findingindependent instructions, thus also increasing IPC [11].

In order to increase the amount of in-flight instructions we must increase thecapability of several resources that directly depend on the number of in-flightinstructions, the most important ones being the ROB, the instruction queuesand the physical registers. We will start our digression describing the problemsof the re-order buffer (ROB). This structure keeps the state of each instructionsince it gets prepared for execution (normally just after renaming) until it reachescommit stage and updates the architectural state of the machine. In the case of a1000 cycle latency to memory and a 4 way processor, and assuming we would likenot to halt the processor, this would mean at least a four thousand entry ROB.Several authors have pointed out the criticality of this resource [22]. Simply up-sizing this structure will not be feasible technologically, which motivates us tolook for other type of solutions. Unfortunately, this is not our only problem.

In-flight instructions not only live in the ROB. After the decode stage theyalso get inserted into Instruction Queues (IQ) where they wait until execution.Normally the instruction queues are divided depending on the type of the instruc-tion, which leads to a more clustered and effective microarchitecture. Unfortu-nately, the handling of this IQs is even more critical than that of the ROB. Eachcycle the tags of the destination registers of instructions that have finished mustbe compared against each of the source registers from each instruction in theIQ. This process, called wakeup logic, determines which instructions are readyto get executed. Another process, called selection logic, determines which readyinstructions get selected and are effectively issued to their respective functionalunits and which remain in the IQ.

Having more in-flight instructions means having bigger and more complexIQs. This is specially true in our missing load scenario, since many of the in-

https://www.researchgate.net/publication/4005860_Runahead_execution_an_alternative_to_very_large_instruction_windows_for_out-of-order_processors?el=1_x_8&enrichId=rgreq-e92d0fa09e8ce8c14df9013ee0122417-XXX&enrichSource=Y292ZXJQYWdlOzIyMTUxMTIwMztBUzoxMDM1NTE1OTg1OTYxMDZAMTQwMTcwMDAzNDg0Ng==

12 Adrian Cristal et al.

structions will depend on the missing load and will consume entries of theirrespective queues for a long time. Palacharla et al [23] showed how the wakeupand selection logic of instruction queues is on the critical path and may eventu-ally determine processor cycle time. Several authors have presented techniqueswhich take into account the complexity problem of IQs and handle it in inno-vative and intelligent ways [3, 29]. Lately [14] and [2] have proposed techniquesthat tackle the problem of large IQs derived from processors with a high numberof in-flight instructions.

Another resource which is dependent on the amount of in-flight instructionsis the amount of physical registers [8]. Just after fetching instructions, thesego through a process of decoding and renaming. Renaming is the process ofchanging the logical names of the registers used by instructions for the names ofphysical registers [30]. Each destination logical register is translated into a newphysical register and source registers get translated to whatever register wasassigned to the destination register of their producer. This exposes more ILPsince false dependencies1 and certain hazards such as write after write (WAW)get handled in an elegant way.

Physical register lives start in the decode stage of the instruction that de-fines the value that that particular register will contain. The liberation of thisresource must take place when all dependent instructions have already read thevalue. Since the determination of this particular moment in time is very difficult,microarchitectures conservatively free it when the following instruction that de-fines the same logical register becomes the oldest, since by then all dependentinstructions from the prior definition of that logical register must have read theirvalue. When a missing load stalls the commit of instructions, register lives getextended, thus increasing the necessity for physical registers.

Unfortunately, increasing physical registers is not simple nor inexpensive.Several studies [1, 24] show that the register file is in the critical path, andothers [31] state that register bank accounts for a 15% to a 20% of overall powerconsumption.

At this point of our analysis it must be clear that simply up-sizing criticalstructures is not a feasible solution to tolerate memory latency. Nevertheless, wemust not disdain the performance benefits that this increase brings. In Fig. 1we can see the IPC obtained in our machine with a memory latency of 1000cycles to main memory relative to the amount of resources. Each group of fourbars presents rightmost the IPC obtained supposing perfect L2 cache, while theother three bars show the IPC when one of the three resources (ROB size, IQsize and number of physical registers) is limited to the amount shown on theX axis while the other two are supposed unbounded. The first conclusion wemay draw from these results is how tightly coupled these three resources are.

1 False dependencies occur when two instructions define the same logical register. Inmicroarchitectures without renaming, both instructions get to use the same resourceand the second instruction may not start until the first one and all its dependentshave finished. Renaming solves this constraint by assigning different physical regis-ters to both dependence chains.

https://www.researchgate.net/publication/3905733_Dynamically_allocating_processor_resources_between_nearby_and_distant_ILP?el=1_x_8&enrichId=rgreq-e92d0fa09e8ce8c14df9013ee0122417-XXX&enrichSource=Y292ZXJQYWdlOzIyMTUxMTIwMztBUzoxMDM1NTE1OTg1OTYxMDZAMTQwMTcwMDAzNDg0Ng==

https://www.researchgate.net/publication/4001377_Hierarchical_scheduling_windows?el=1_x_8&enrichId=rgreq-e92d0fa09e8ce8c14df9013ee0122417-XXX&enrichSource=Y292ZXJQYWdlOzIyMTUxMTIwMztBUzoxMDM1NTE1OTg1OTYxMDZAMTQwMTcwMDAzNDg0Ng==

https://www.researchgate.net/publication/3940509_Select-free_instruction_scheduling_logic?el=1_x_8&enrichId=rgreq-e92d0fa09e8ce8c14df9013ee0122417-XXX&enrichSource=Y292ZXJQYWdlOzIyMTUxMTIwMztBUzoxMDM1NTE1OTg1OTYxMDZAMTQwMTcwMDAzNDg0Ng==

https://www.researchgate.net/publication/3700273_Memory-system_Design_Considerations_For_Dynamically-scheduled_Processors?el=1_x_8&enrichId=rgreq-e92d0fa09e8ce8c14df9013ee0122417-XXX&enrichSource=Y292ZXJQYWdlOzIyMTUxMTIwMztBUzoxMDM1NTE1OTg1OTYxMDZAMTQwMTcwMDAzNDg0Ng==

https://www.researchgate.net/publication/220771028_A_Large_Fast_Instruction_Window_for_Tolerating_Cache_Misses?el=1_x_8&enrichId=rgreq-e92d0fa09e8ce8c14df9013ee0122417-XXX&enrichSource=Y292ZXJQYWdlOzIyMTUxMTIwMztBUzoxMDM1NTE1OTg1OTYxMDZAMTQwMTcwMDAzNDg0Ng==

https://www.researchgate.net/publication/2472597_Reducing_Register_Ports_for_Higher_Speed_and_Lower_Energy?el=1_x_8&enrichId=rgreq-e92d0fa09e8ce8c14df9013ee0122417-XXX&enrichSource=Y292ZXJQYWdlOzIyMTUxMTIwMztBUzoxMDM1NTE1OTg1OTYxMDZAMTQwMTcwMDAzNDg0Ng==

https://www.researchgate.net/publication/3867918_Energy-efficient_register_access?el=1_x_8&enrichId=rgreq-e92d0fa09e8ce8c14df9013ee0122417-XXX&enrichSource=Y292ZXJQYWdlOzIyMTUxMTIwMztBUzoxMDM1NTE1OTg1OTYxMDZAMTQwMTcwMDAzNDg0Ng==

https://www.researchgate.net/publication/3700265_Complexity-Effective_Superscalar_Processors?el=1_x_8&enrichId=rgreq-e92d0fa09e8ce8c14df9013ee0122417-XXX&enrichSource=Y292ZXJQYWdlOzIyMTUxMTIwMztBUzoxMDM1NTE1OTg1OTYxMDZAMTQwMTcwMDAzNDg0Ng==

https://www.researchgate.net/publication/3885115_On_pipelining_dynamic_instruction_scheduling_logic?el=1_x_8&enrichId=rgreq-e92d0fa09e8ce8c14df9013ee0122417-XXX&enrichSource=Y292ZXJQYWdlOzIyMTUxMTIwMztBUzoxMDM1NTE1OTg1OTYxMDZAMTQwMTcwMDAzNDg0Ng==


Unbounding any two of them does not increase performance, since the remainingone becomes the bottleneck. Solutions to any of these three hurdles, although inmany cases presented separately, will always need solutions to the other two tobecome effective.

From these numbers it is clear that the ratio between the amount of physicalregisters, the amount of IQ entries and the amount of ROB entries should notbe the unity. Since not all instructions define registers and not all instructionsin the ROB are actually using an IQ entry, the necessary amount of these tworesources should be smaller than that of the ROB. This can be seen in the figureby noticing that the bar where the ROB is limited to the resource size is theworse of the three.

The most important conclusion that we can get from Fig. 1, which motivatesour present discussion, is the fact that increasing the amount of in-flight in-structions (which is achieved by augmenting the necessary resources) effectivelyallows to tolerate the latency from main memory, as expected. The results ofFig. 1 show how for a 1000 cycle latency, 4096 entries in the ROB practicallyachieves a performance close to the perfect L2 behavior.

The scenarios and the results explained in this paper are not exclusive ofnumerical applications, but also of integer applications. Unfortunately, integerapplications also suffer from other bottlenecks such as branch mis-speculationand pointer chasing problems which are out of the scope of this paper. Discussionabout these problems can be found in [13, 28].

Having reached this point in our discussion, it seems clear that althoughincreasing in-flight instructions clearly delivers outstanding performance for nu-merical applications considering future memory latencies, it is not feasible to upsize resources to achieve this goal. We need a different approach to this problem[4], and this approach is motivated from the results of Figs. 2 and 3.

In these figures we can see the breakdown of the resource usage on FP ap-plications. In Fig. 2 we can see the relation between the number of in-flightinstructions and the necessary entries in the FP instruction queue for a 4 way

1168 1382 1607 1868 1955 20340

100

200

300

400

500

600

FP

Que

ue

Number of In−flight Instructions (SpecFP)

Blocked−LongBlocked−Short

10% 25% 50% 75% 90%

Fig. 2. Breakdown of FP Queue usage on FP applications

https://www.researchgate.net/publication/3044157_Branch_prediction_instruction-window_size_and_cache_size_Performance_trade-offs_and_simulation_techniques?el=1_x_8&enrichId=rgreq-e92d0fa09e8ce8c14df9013ee0122417-XXX&enrichSource=Y292ZXJQYWdlOzIyMTUxMTIwMztBUzoxMDM1NTE1OTg1OTYxMDZAMTQwMTcwMDAzNDg0Ng==

https://www.researchgate.net/publication/2660920_The_Relative_Importance_of_Memory_Latency_Bandwidth_and_Branch_Limits_to_Performance?el=1_x_8&enrichId=rgreq-e92d0fa09e8ce8c14df9013ee0122417-XXX&enrichSource=Y292ZXJQYWdlOzIyMTUxMTIwMztBUzoxMDM1NTE1OTg1OTYxMDZAMTQwMTcwMDAzNDg0Ng==


1168 1382 1607 1868 1955 20340

200

400

600

800

1000

1200

1400

FP

Reg

iste

rs

Number of In−flight Instructions (SpecFP)

DeadBlocked−LongBlocked−ShortLive

10% 25% 50% 75% 90%

Fig. 3. Breakdown of FP Registers usage on FP applications

processor with 2048 entries in the ROB and a memory latency of 500 cycles. Thex-axis is not linear. Instead we have also computed in average how much time ofthe total time did the number of in-flight instructions behave in a particular way.For example, we can see that on average a 25% of the time the number of in-flight instructions was less than 1382, on average a 50% of the time this numberwas less than 1607 and so on. The motivating part of figure is the fact that wehave not only computed the amount of to-be-issued instructions with respect tothe amount of in-flight instructions, but rather, we have divided the to-be-issuedinstructions into two different types, those depending on instructions which aregoing to be ready shortly, and those depending on instructions with very longlatency. Of course, long latency instructions are those that have missed on L2or any dependent instruction on them. As can be seen from Fig. 2, the amountof dependents on long latency instructions, tagged blocked long, in the figureis much bigger than that of the soon to be ready. The motivation behind ourpresent discussion is obvious. Why consume critical resources for long latencyinstructions? It seems that a small number of entries in the FP queue will beable of handling the soon to be ready instructions.

Something similar can be read from Fig. 3. In it the x-axis follows the samebehavior, indicating the amount of in-flight instructions and the percentile dis-tribution over time. The y-axis accounts for the amount of necessary physicalregisters to cope with that amount of in-flight instructions. Again, we have classi-fied physical register usage depending on their nearness to be used. Live registersare those that presently have values computed. Blocked-short are registers whichdo not yet have a value but depend on soon to be ready instructions and there-fore their value is presumably going to be available soon. Blocked-long registersare those that will eventually have a value but that will remain for a long timewithout one. These are the registers associated with long latency instructionsand their dependents. Dead registers are those registers which have their valuescomputed, and that all their dependent instructions have already consumed it.


Here again we can see something motivating. At any particular point, Deadregisters are not needed anymore2 and Blocked-long registers will be needed ina future time. Therefore the amount of necessary registers at any point is muchsmaller, ranging less than 300 for nearly 2K in-flight instructions.

In this paper we shall present a survey of different techniques that addressthe problems of the ROB, the IQs and the physical registers. We will presentsummaries of our work and the work of others and the philosophy behind them.In section 2 we will address the problem of achieving very big ROBs. Following,in section 3 we will describe certain mechanisms that try to virtually enlarge theIQs with little cost. We shall also present in section 4 a survey of the differentsolutions proposed in the field of registers, some of which have been specificallybeen tailored to suit the case of Kilo-Instruction Processors. We will present briefresults for a combination of these mechanisms in section 5 and we will concludethe paper in section 6.

2 Managing the Commit of Kilo-instruction Processors

The main purpose of the ROB is to enforce in-order commit, which is fundamen-tal for microarchitectures with precise interrupts. The ROB allows the processorto recover the state after each specific instruction, and this means not only thestate of the registers, but also the memory state, since stores only update mem-ory when they get committed. Fortunately, recoveries are rare when comparedto the amount of instructions. Recoveries may happen due to two causes, excep-tions or mis-speculations. The former is highly unpredictable: a page miss, anillegal division, etc. The latter may only happen on speculated instructions, suchas branches or memory operations incorrectly disambiguated3. Why not just tryto honor the famous motto Improve the common case [11] and execute normalinstructions as if they would never except nor be part of any mis-speculation?

This motivation is the one followed in [6], although initial work on checkpoint-ing was made by [12]. As far as we know, [6] is the first approach to substitutethe need for a large ROB with a small set of microarchitectural checkpoints.

Once instructions finish their execution, the processor can free the resourcesthat the instruction is not going to need anymore. The processor does not needto wait for preceding instructions to finish. Of course, this heavily relies onthe improbability of needing to return to this specific moment in the state ofthe machine. The processor is acting as if the state of the machine changedevery group of executed instructions, instead of after each instruction as theArchitecture mandates.2 In reality they are needed, since precise interrupts must still be taken care of. Nev-

ertheless, we will speak freely using the verb need to mean that they are not to betreated with the same urge, i.e. they are not needed shall be understood as they arenot required to be accessed every cycle.

3 We recognise that many other types of speculation exist and have been publishedin the literature. Nevertheless, the two most common types are the ones we shalldiscuss and the philosophy can easily be extended to cover other cases.

https://www.researchgate.net/publication/220770973_Checkpoint_Repair_for_Out-of-Order_Execution_Machines?el=1_x_8&enrichId=rgreq-e92d0fa09e8ce8c14df9013ee0122417-XXX&enrichSource=Y292ZXJQYWdlOzIyMTUxMTIwMztBUzoxMDM1NTE1OTg1OTYxMDZAMTQwMTcwMDAzNDg0Ng==

https://www.researchgate.net/publication/239559351_Large_virtual_robs_by_processor_checkpointing?el=1_x_8&enrichId=rgreq-e92d0fa09e8ce8c14df9013ee0122417-XXX&enrichSource=Y292ZXJQYWdlOzIyMTUxMTIwMztBUzoxMDM1NTE1OTg1OTYxMDZAMTQwMTcwMDAzNDg0Ng==



Nevetheless, the functionality for returning to a precise state of the proces-sor is still available. If any kind of error, i.e. mis-speculation, exception, . . . ,happened, the processor needs only to return to the previous checkpoint to theerror and proceed after that, either inserting checkpoints after each instruction,in a step-by-step way, or if it remembered the excepting instruction, by tak-ing a checkpoint at it. This re-execution of extra instructions is detrimentalto performance. The total amount of extra re-execution can be minimised if theheuristics for taking checkpoints in the first place are driven to the most commonmis-speculated instructions.

As instructions de-allocate their resources when they finish rather than wait-ing for all preceding instructions to finish, this mechanism has been called Out-of-Order Commit [5]. The name is somehow misleading, since the state of themachine is not modified at all. Only when the whole set of instructions belong-ing to a checkpoint finish, and the checkpoint is the last one, the state of themachine is modified. The checkpoints are thus committed in-order, respectingISA semantics and interruption preciseness. The authors name this couplingof an out-of-order commit mechanism for instructions and an in-order commitmechanism for checkpoints as a Hierarchical Commit mechanism.

In [5, 6] different heuristics are given for taking checkpoints. In [6] the authorsanalyse the impact of having loads as the checkpoint heads. Later, in [5] branchesare shown to be good places for taking checkpoints. This seems intuitive sincethe biggest number of mis-speculations happen at branches. Nevertheless tak-ing a checkpoint at every branch is not necessary nor efficient. Many branchesresolve very early. So, adjacent checkpoints can be merged to minimise the num-ber of checkpoint entries and increase the average number of instructions percheckpoint. Another solution proposed is to delay the taking of the branch withsome sort of simple ROB structure which holds the last n instructions. Thispseudo-ROB structure would hold the state for the n newest instructions anddecrease the need for so many checkpoints. Both papers have pointed out thatthe analysis and determination of where to take checkpoints is not a simple task,since it must combine the correct use of resources with the minimisation of re-covery procedures. Further analysis should be expected in this field. In section 5we will discuss the sensibility of certain parameters related to this mechanism.

3 Managing the Issue of Kilo-instruction Processors

Probably the simplest classification for instruction in Kilo-Instruction Processorsis to consider that an instruction may be a slow one or a fast one (cf. Fig. 2). Slowinstructions are those that depend on long latency instructions, such as loadsthat miss on L2 cache or long floating point operations. Issuing an instructionis not simple at all, and treating these slow instructions with unknown latenciesincreases the complexity for the whole set of instructions. Several studies haveaddressed this particular problem [3, 9, 20, 23, 29] and more specifically threeof them [2, 5, 14] stand out with proposals on how to modify the instructionqueues in order to tolerate the problems caused by bigger latencies.



https://www.researchgate.net/publication/3905726_Energy-effective_issue_logic?el=1_x_8&enrichId=rgreq-e92d0fa09e8ce8c14df9013ee0122417-XXX&enrichSource=Y292ZXJQYWdlOzIyMTUxMTIwMztBUzoxMDM1NTE1OTg1OTYxMDZAMTQwMTcwMDAzNDg0Ng==




https://www.researchgate.net/publication/3916444_Recovery_mechanism_for_latency_misprediction?el=1_x_8&enrichId=rgreq-e92d0fa09e8ce8c14df9013ee0122417-XXX&enrichSource=Y292ZXJQYWdlOzIyMTUxMTIwMztBUzoxMDM1NTE1OTg1OTYxMDZAMTQwMTcwMDAzNDg0Ng==



https://www.researchgate.net/publication/221574525_Out-of-Order_Commit_Processors?el=1_x_8&enrichId=rgreq-e92d0fa09e8ce8c14df9013ee0122417-XXX&enrichSource=Y292ZXJQYWdlOzIyMTUxMTIwMztBUzoxMDM1NTE1OTg1OTYxMDZAMTQwMTcwMDAzNDg0Ng==





[14] presents a mechanism for detecting long latency operations, namely loadsthat miss in L2. Dependent instructions from this load are then moved into aWaiting Instruction Buffer (WIB) where the instructions reside until the com-pletion of the particular loads. In order to determine if an instruction dependson the load or not, the wakeup logic of the window is used to propagate thedependency. The selection logic is used to select the dependent instructions andsend them into the WIB, which acts as a functional unit. This mechanism isvery clever, but we believe that using the same logic for two purposes, putsmore pressure on a logic which is on the critical path anyway [23]. Besides, theWIB contains all instructions in the ROB, which the authors consider not to beon the critical path. This way they supposedly simplify the tracking of whichmissing load instructions each of the WIB instructions depends on. [14] has beena pioneer in this field, among the first to propose solutions, but we believe thatcertain decisions in how the mechanism behaves will not be able to scale. TheWIB structure is simpler than the IQ it replaces but still very complex to scaleto the necessities of 10K in-flight instruction processors. Besides, the tracking ofdependencies on the different loads is done orthogonally to all the outstandingloads which we believe may be another cause of not scaling correctly.

In [2] the authors propose a hierarchical instruction window (HSW) mecha-nism. This mechanism has two scheduling windows, one big and slow and anotherone smaller but faster. Each one of them has its own scheduler that schedulesinstructions indepently. This distribution forces the hardware to duplicate thenumber of functional units. Besides the managing of the register accesses is rathercomplex due to this clusterisation. The important thing about this mechanismis that all instructions are decoded and initially sent to the slow instruction win-dow. Long latency instructions and their dependents are considered more criticaland are then prioritised by moving them into the small and fast. The rationalebehind this is that instructions on the critical path should finish as soon as pos-sible. The rest of instructions can pay the penalty imposed by the slower andbigger instruction window. Notice how this mechanism is conceptually opposedto the one presented in [14] although both try to achieve the same.

Another reference that proposes a mechanism to handle the complexity andnecessities of instruction queues for Kilo-Instruction Processors is [5]. This mech-anism is conceptually nearer to that of [14] than to the one presented in [2]. Itdiffers from it, however, in its complexity. This mechanism has been devised toscale further than the one presented in [14]. First of all, in order to simplify thedetection of long latency loads and dependents it uses a pseudo-ROB structurewhich tracks the last n instructions. When leaving this structure (which can beeasily coupled with the newest part of the ROB if the microarchitecture has one)instructions are analysed to see if they are a missing load or dependent on one.This dependency tracking is easier and does not need a wake-up or selectionlogic. If the instruction is a slow one, it is moved from the IQ4 into a Slow Lane

4 Microarchitecturally it would be simpler to invalidate entries in the IQ and to in-sert into the SLIQ from the pseudo-ROB, but for clarity in the explanation theinstructions are assumed to move from the IQ to the SLIQ.











Instruction Queue (SLIQ) where it will wait until the missing load comes frommain memory.

In order to manage several missing loads, instead of having more complexityin the SLIQ, the authors propose to treat instructions linearly. Each load hasa starting point in this FIFO like structure and once it finishes it starts tomove instructions from the SLIQ to the small and fast IQ. This is done atthe maximum pace of 4 instructions per cycle with the starting penalty of 4cycles, in our simulations. The instructions are re-introduced to the fast IQ whentheir source operands no longer depend on long latency instruction. However, itcould happen that a second long latency load is resolved while the instructionsdependent on the first one are still being processed. Two different situations canhappen: first, if the new load is younger than the instructions being processed,it will be found by the wakening mechanism. After that, the mechanism couldcontinue, placing the instructions dependent on any of the two loads in theinstruction queues. Second, if the new load is older that the instructions beingprocessed, it will not be found by the wakening process, so a new wakeningprocess should be started for the second load. Extra penalty in the re-insertionseems not to affect performance, which leaves a margin for the consequences ofthe potential complexity of this SLIQ [5]. Finally, an additional advantage of thepseudo-ROB is reducing the misprediction penalty of branch instructions. Sinceour out-of-order commit mechanism removes the ROB, mispredicted branchesforce the processor to return up to the previous checkpoint, which is potentiallysome hundred instructions behind. The information contained in the pseudo-ROB allows to recover from branch mispredictions without needing to use a farcheckpoint, whenever the branch instruction is still stored in the pseudo-ROB.

4 Managing the Dataflow Necessitiesof Kilo-instruction Processors

Decreasing the lifetime of registers has been a hot topic for over 10 years now.As explained in section 1, the motivation behind all this research is to diminishthe impact that the register file and the renaming have on performance, which isvery important as has been stated in several papers [1, 21, 23, 27, 32]. In manycases the authors propose to microarchitecturally modify the register structureto scale better for current or future processor generations [7, 16, 19, 24, 25].In other cases, the authors propose to modify the renaming strategy to limitthe amount of live registers at any moment and therefore decrease the long termnecessities of the processor. All this can be accomplished due to the classificationof different types of registers shown in Fig. 3.

A good example of this last type of research is presented in [21]. This paperis the first to our knowledge that proposes to recycle registers as soon as theyhave been fully and irreversibly superseded by the renaming logic, as opposed towaiting for instruction retirement. In order to be able of doing this, the microar-chitecture keeps a per-register bookkeeping count of consumers of the registernot yet executed. When a particular register has been unmapped, i.e. the logi-


https://www.researchgate.net/publication/3854044_Multiple-banked_register_file_architectures?el=1_x_8&enrichId=rgreq-e92d0fa09e8ce8c14df9013ee0122417-XXX&enrichSource=Y292ZXJQYWdlOzIyMTUxMTIwMztBUzoxMDM1NTE1OTg1OTYxMDZAMTQwMTcwMDAzNDg0Ng==

https://www.researchgate.net/publication/4001372_Cherry_Checkpointed_early_resource_recycling_in_out-of-order_microprocessors?el=1_x_8&enrichId=rgreq-e92d0fa09e8ce8c14df9013ee0122417-XXX&enrichSource=Y292ZXJQYWdlOzIyMTUxMTIwMztBUzoxMDM1NTE1OTg1OTYxMDZAMTQwMTcwMDAzNDg0Ng==

https://www.researchgate.net/publication/3558388_Register_renaming_and_dynamic_speculation_an_alternative_approach?el=1_x_8&enrichId=rgreq-e92d0fa09e8ce8c14df9013ee0122417-XXX&enrichSource=Y292ZXJQYWdlOzIyMTUxMTIwMztBUzoxMDM1NTE1OTg1OTYxMDZAMTQwMTcwMDAzNDg0Ng==



https://www.researchgate.net/publication/4001371_Register_write_specialization_register_read_specialization_a_path_to_complexity-effective_wide-issue_superscalar_processors?el=1_x_8&enrichId=rgreq-e92d0fa09e8ce8c14df9013ee0122417-XXX&enrichSource=Y292ZXJQYWdlOzIyMTUxMTIwMztBUzoxMDM1NTE1OTg1OTYxMDZAMTQwMTcwMDAzNDg0Ng==

https://www.researchgate.net/publication/221005119_Exploiting_short-lived_variables_in_superscalar_processors?el=1_x_8&enrichId=rgreq-e92d0fa09e8ce8c14df9013ee0122417-XXX&enrichSource=Y292ZXJQYWdlOzIyMTUxMTIwMztBUzoxMDM1NTE1OTg1OTYxMDZAMTQwMTcwMDAzNDg0Ng==


https://www.researchgate.net/publication/3215151_Design_space_of_register_renaming_techniques?el=1_x_8&enrichId=rgreq-e92d0fa09e8ce8c14df9013ee0122417-XXX&enrichSource=Y292ZXJQYWdlOzIyMTUxMTIwMztBUzoxMDM1NTE1OTg1OTYxMDZAMTQwMTcwMDAzNDg0Ng==

https://www.researchgate.net/publication/3668660_A_scalable_register_file_architecture_for_dynamically_scheduled_processors?el=1_x_8&enrichId=rgreq-e92d0fa09e8ce8c14df9013ee0122417-XXX&enrichSource=Y292ZXJQYWdlOzIyMTUxMTIwMztBUzoxMDM1NTE1OTg1OTYxMDZAMTQwMTcwMDAzNDg0Ng==



cal has been mapped to another physical, and all its consumers have read theparticular value, the mechanism is able to reuse this particular register for otherpurposes.

Solutions for delaying register allocation have also been presented in the lit-erature prior to the research for Kilo-Instruction Processors, such as [10, 32].These proposals require deadlock avoidance mechanisms. They also incur in acomplexity penalty for Kilo-Instruction Processors5 which makes them unsuit-able for us. In the case that a particular instruction fails to secure a destinationregister and needs to be re-executed, two alternatives appear: either it must bekept in the IQ throughout its execution or the IQ must be augmented to admitre-insertion of instructions.

More recent papers have addressed partially the problem de-allocating re-sources in Kilo-Instruction Processors. One of them is [19] where the authorspropose Cherry, a hybrid checkpoint/ROB-based mechanism that allows earlyrecycling of multiple resources. They use the ROB to support speculative ex-ecution and precise interrupts, and state checkpointing for precise exceptions.They identify the instructions not subject to mis-speculation and apply earlyrecycling of load/store queue entries and registers. This mechanism does not gobeyond unresolved branches or speculative loads, which limits its scope.

Other papers have made proposals to recycle registers based on compilersupport and/or special ISA instructions, most notably [15–17]. We find thesemechanisms very interesting but in this paper we have limited our scope to mi-croarchitectural solutions which do not need to change the ISA or the compiler.

A new proposal specifically targeted for Kilo-Instruction Processors is theone presented in [18], which is called Ephemeral Registers. This mechanismproposes an aggressive register recycling mechanism which combines delayedregister allocation and early register recycling in the context of Kilo-InstructionProcessors.

The mechanism has been tailored to support fast recovery from mis-specu-lations while supporting precise exceptions and interrupts. This is accomplishedby combining these mechanisms with only one checkpointing, which allows theprocessor to non-conservatively deallocate resources. This is specially importantin Kilo-Instruction Processors since these processors rely heavily on speculationmechanism to extract parallelism. This paper is also the first proposal thatintegrates both a mechanism for delayed register allocation and early registerrecycling and analyses the synergy between them. With respect to [10] theyopt for assigning resources during the issue stage rather than in the writebacksince this simplifies the design. The mechanism tries to secure a register withina few cycles after issue. During these cycles, the instruction keeps its IQ entry.If successful, the IQ entry is released; if unsuccessful, it is squashed and, since itretains its entry in the IQ, can be retried later.

5 Of course the authors solution was not intended for Kilo-Instruction Processors andtheir solutions perfectly fit the problems addressed in their respective papers.

https://www.researchgate.net/publication/3733083_Virtual-physical_registers?el=1_x_8&enrichId=rgreq-e92d0fa09e8ce8c14df9013ee0122417-XXX&enrichSource=Y292ZXJQYWdlOzIyMTUxMTIwMztBUzoxMDM1NTE1OTg1OTYxMDZAMTQwMTcwMDAzNDg0Ng==


https://www.researchgate.net/publication/260358092_Software-directed_register_deallocation_for_simultaneous_multithreaded_processors?el=1_x_8&enrichId=rgreq-e92d0fa09e8ce8c14df9013ee0122417-XXX&enrichSource=Y292ZXJQYWdlOzIyMTUxMTIwMztBUzoxMDM1NTE1OTg1OTYxMDZAMTQwMTcwMDAzNDg0Ng==

https://www.researchgate.net/publication/3728140_Exploiting_Dead_Value_Information?el=1_x_8&enrichId=rgreq-e92d0fa09e8ce8c14df9013ee0122417-XXX&enrichSource=Y292ZXJQYWdlOzIyMTUxMTIwMztBUzoxMDM1NTE1OTg1OTYxMDZAMTQwMTcwMDAzNDg0Ng==





5 Experimental Results

In this section we shall present some experimental results of some of the tech-niques presented in the previous sections. The benchmark suite used for all theexperiments is SPEC2000fp, averaging over all the applications in the set. Allbenchmarks have been simulated 300 million representative instructions, whererepresentativeness has been determined following [26]. The simulator targets asuperscalar out-of-order microarchitecture which fetches, issues and commits amaximum of 4 instructions per cycle. The branch predictor assumed is a 16Khistory gshare with a penalty in case of misprediction of 10 cycles. The memoryhierarchy has separate instruction and data first level of cache (L1), each with32Kb 4way and 2 cycles of hit access. The second level of cache (L2) is unifiedfor data and instructions and its size is 512Kb with 10 cycles in case of a hit.Notice that in this description of the architecture we have intentionally omittedthe most commons parameters, sizes of the IQs, size of the re-order buffer, la-tency to main memory and number of physical registers. Since our experimentsdeal exactly with these parameters, each one of the figures will detail the exactassumptions of the results presented.

In Fig. 4 we can see the IPC obtained with different numbers of checkpointsusing the technique presented in [5] and explained in 2. The limit bar showsIPC results obtained when considering ROB of 4096 entries, which is microar-chitecturally unfeasible. This limit helps us show how near we are of a pseudop-erfect mechanism which does not take into account complexity. Having only 4checkpoints produces just a 20% slowdown. If we increase to 8 checkpoints thisslowdown decreases to only 9% and from 32 onwards the slowdown is just a 6%.This 6% slowdown with 256 times less entries is clearly spectacular.

Figure 5 presents results for the technique called Slow Lane Instruction Queu-ing presented in [5] and explained in section 3. The experiments show the IPCobtained on average by all the SPEC2000fp applications when executed underdifferent architectural configurations. The three bars of each group show theIPC when the IQ size ranges from 32 to 128. It is remarkable that in each group

0

0.5

1

1.5

2

2.5

3

3.5

IPC

Limit48163264128

Fig. 4. Sensibility of our Hierarchical Commit mechanism to the amount of availablecheckpoints. With IQ size=2048 entries and 2048 Physical Registers.




0

0.5

1

1.5

2

2.5

3

3.5

512 1024 2048 512 1024 2048 512 1024 2048

100 500 1000Slow Lane Instruction Queue

Memory Latency

IPC

3264128

Limit 4096

Limit 4096

Limit 4096

Fig. 5. IPC respective to the memory latency, the size of the SLIQ and the IQ size.With 16 Checkpoints and 4096 Physical Registers

of bars the 64 entry bar nearly achieves the same IPC as the 128 one. This isbecause with this mechanism having 64 entries is enough to extract enough par-allelism with out forcing the processor to stall. It can also be seen that in orderto tolerate higher latencies to main memory the SLIQ must be bigger.

Figure 6 presents a combination of the three different techniques from pre-vious figures, i.e. SLIQ, Out-of-Order Commit and Ephemeral registers. Thesethree different techniques were coupled together in a simulator which assumeda 64 entry pseudo-ROB and IQs of 128 entries. The figure is divided into threezones, each of them comprising the results for each of the memory latencies usedthroughout this paper (100, 500 and 1000). Each group is composed of threegroups of two bars each. Each of the groups assumes an increasing amount ofVirtual Registers. The two bars of each group represent the IPC with 256 or 512physical registers. Besides each zone has two lines which represent the perfor-mance obtained with the baseline (with 128 ROB entries) and the performanceobtained by a Limit microarchitecture where all the resources have been up-sizedwith no constraints. Of course this Limit microarchitecture is unfeasible, but ithelps us understand how well our mechanisms are behaving, since it acts as anupper bound.

Figure 6 shows some interesting points. First of all it allows to see how thecombination of these three orthogonal techniques work. Another conclusion thatwe can see is that there is still room for improvement with 1000 cycles or more.We can notice how the mechanisms nearly saturate at 500 cycles. For this latencyhaving 1024 Virtual tags is nearly as having 2048, while in the 1000 latency zonewe can see that the growth trend is far from saturating. This enforces us in thebelief that Kilo-Instruction Processors with more in-flight instructions will stillyield benefits.


0

0.5

1

1.5

2

2.5

3

3.5

512 1024 2048 512 1024 2048 512 1024 2048

100 500 1000

Virtual TagsMemory Latency

IPC

256512

Limit 4096

Limit 4096

Limit 4096

Baseline 128

Baseline 128

Baseline 128

Fig. 6. IPC results of the combination of mechanisms (SLIQ, Out-of-Order Commit,Ephemeral registers) with respect to the amount of Virtual Registers, the memorylatency and the amount of physical registers

6 Conclusions

The ever increasing difference between processor and memory speed is makingthe latter appear farther with each new processor generation. Future microar-chitectures are expected to have 500 cycles to main memory or even more. Inorder to overcome this hurdle, more in-flight instructions will be needed to main-tain ILP in the events of missing load instructions. Although recent history hasshown how an increase in the number of in-flight instructions helps tolerate thisincreasing latency, it has also pointed out the limitations that simply up-sizingthe critical resources has on cycle time and thus on performance.

The resources that directly depend on the amount of in-flight instructions arethe re-order buffer, the instructions queues and the amount of physical registersavailable. In this paper we have shown that these critical resources are under-utilised, and we have presented a survey of different techniques that accomplishthe task of allowing thousands of in-flight instructions at reasonable cost. Wehave also shown that this pursuit is beneficial for performance and that it is at-tainable in a near future. We have also shown how a subset of all the techniquesdescribed act together forming a synergy.

Acknowledgments

The authors would like to thanks Sriram Vajapeyam and Oliver Santana for thehelp during the writting of this paper. This work has been supported by theMinistry of Science and Technology of Spain, under contract TIC-2001-0995-C02-01 by the CEPBA.


References

1. R. Balasubramonian, S. Dwarkadas, and D. Albonesi. Dynamically allocatingprocessor resources between nearby and distant ilp. In Proceedings of the 28thannual international symposium on on Computer architecture, pages 26–37. ACMPress, 2001.

2. E. Brekelbaum, J. Rupley, C. Wilkerson, and B. Black. Hierarchical schedulingwindows. In Proceedings of the 35th annual ACM/IEEE international symposiumon Microarchitecture, pages 27–36. IEEE Computer Society Press, 2002.

3. M.D. Brown, J. Stark, and Y.N. Patt. Select-free instruction scheduling logic. InProceedings of the 34th annual ACM/IEEE international symposium on Microar-chitecture, pages 204–213. IEEE Computer Society, 2001.

4. A. Cristal, J.F. Martınez, J. Llosa, and M. Valero. A case for resource-conscious out-of-order processors. Technical Report UPC-DAC-2003-45, Universi-dad Politecnica de Cataluna, Department of Computer Architecture, July 2003.

5. A. Cristal, D. Ortega, J.F. Martınez, J. Llosa, and M. Valero. Out-of-order com-mit processors. Technical Report UPC-DAC-2003-44, Universidad Politecnica deCataluna, Department of Computer Architecture, July 2003.

6. A. Cristal, M. Valero, A. Gonzalez, and J. LLosa. Large virtual robs by processorcheckpointing. Technical Report UPC-DAC-2002-39, Universidad Politecnica deCataluna, Department of Computer Architecture, July 2002.

7. J-L. Cruz, A. Gonzalez, M. Valero, and N. P. Topham. Multiple-banked registerfile architectures. In Proceedings of the 27th annual international symposium onComputer architecture, pages 316–325. ACM Press, 2000.

8. Keith I. Farkas, Paul Chow, Norman P. Jouppi, and Zvonko Vranesic. Memory-system design considerations for dynamically-scheduled processors. In Proceedingsof the 24th annual international symposium on Computer architecture, pages 133–143. ACM Press, 1997.

9. D. Folegnani and A. Gonzalez. Energy-effective issue logic. In Proceedings ofthe 28th Annual International Symposium on Computer Architecture, pages 230–239, Goteborg, Sweden, June 30–July 4, 2001. IEEE Computer Society and ACMSIGARCH. Computer Architecture News, 29(2), May 2001.

10. A. Gonzalez, J. Gonzalez, and M. Valero. Virtual-physical registers. In IEEEInternational Symposium on High-Performance Computer Architecture, February1998.

11. J.L. Hennessy and D.A. Patterson. Computer Architecture. A Quantitative Ap-proach. Second Edition. Morgan Kaufmann Publishers, San Francisco, 1996.

12. W.M. Hwu and Y. N. Patt. Checkpoint repair for out-of-order execution ma-chines. In Proceedings of the 14th annual international symposium on Computerarchitecture, pages 18–26. ACM Press, 1987.

13. N. P. Jouppi and P. Ranganathan. The relative importance of memory latency,bandwidth, and branch limits to performance. In Workshop of Mixing Logic andDRAM: Chips that Compute and Remember. ACM Press, 1997.

14. A.R. Lebeck, J. Koppanalil, T. Li, J. Patwardhan, and E. Rotenberg. A large, fastinstruction window for tolerating cache misses. In Proceedings of the 29th annualinternational symposium on Computer architecture, pages 59–70. IEEE ComputerSociety, 2002.

15. J. Lo, S. Parekh, S. Eggers, H. Levy, and D. Tullsen. Software-directed registerdeallocation for simultaneous multithreaded processors. Technical Report TR-97-12-01, University of Washington, Department of Computer Science and Engineer-ing, 1997.














































16. L.A. Lozano and G.R. Gao. Exploiting short-lived variables in superscalar proces-sors. In Proceedings of the 28th annual international symposium on Microarchi-tecture. IEEE Computer Society Press, November 1995.

17. M.M. Martin, A. Roth, and C.N. Fischer. Exploiting dead value information. InProceedings of the 30th annual ACM/IEEE international symposium on Microar-chitecture. IEEE Computer Society Press, December 1997.

18. J.F. Martınez, A. Cristal, M. Valero, and J. Llosa. Ephemeral registers. TechnicalReport CSL-TR-2003-1035, Cornell Computer Systems Lab, 2003.

19. J.F. Martınez, J. Renau, M.C. Huang, M. Prvulovic, and J. Torrellas. Cherry:checkpointed early resource recycling in out-of-order microprocessors. In Proceed-ings of the 35th annual ACM/IEEE international symposium on Microarchitecture,pages 3–14. IEEE Computer Society Press, 2002.

20. E. Morancho, J.M. Llaberıa, and A. Olive. Recovery mechanism for latencymisprediction. Technical Report UPC-DAC-2001-37, Universidad Politecnica deCataluna, Department of Computer Architecture, November 2001.

21. M. Moudgill, K. Pingali, and S. Vassiliadis. Register renaming and dynamic spec-ulation: an alternative approach. In Proceedings of the 26th annual internationalsymposium on Microarchitecture, pages 202–213. IEEE Computer Society Press,1993.

22. O. Mutlu, J. Stark, C. Wilkerson, and Y.N. Patt. Runahead execution: An alterna-tive to very large instruction windows for out-of-order processors. In Proceedings ofthe Ninth International Symposium on High-Performance Computer Architecture,Anaheim, California, February 8–12, 2003. IEEE Computer Society TCCA.

23. S. Palacharla, N.P. Jouppi, and J.E. Smith. Complexity-effective superscalar pro-cessors. In Proceedings of the 24th international symposium on Computer archi-tecture, pages 206–218. ACM Press, 1997.

24. I. Park, M. Powell, and T. Vijaykumar. Reducing register ports for higher speedand lower energy. In Proceedings of the 35th annual ACM/IEEE internationalsymposium on Microarchitecture, pages 171–182. IEEE Computer Society Press,2002.

25. A. Seznec, E. Toullec, and O. Rochecouste. Register write specialization registerread specialization: a path to complexity-effective wide-issue superscalar proces-sors. In Proceedings of the 35th annual ACM/IEEE international symposium onMicroarchitecture, pages 383–394. IEEE Computer Society Press, 2002.

26. T. Sherwood, E. Perelman, and B. Calder. Basic block distribution analysis tofind periodic behavior and simulation points in applications. In Proceedings ofthe Intl. Conference on Parallel Architectures and Compilation Techniques, pages3–14, September 2001.

27. D. Sima. The design space of register renaming techniques. In Micro, IEEE ,Volume: 20 Issue: 5, pages 70–83. IEEE Computer Society, September 1999.

28. K Skadron, P.A. Ahuja, M. Martonosi, and D.W. Clark. Branch prediction,instruction-window size, and cache size: Performance trade-offs and simulationtechniques. In IEEE Transactions on Computers, pages 1260–1281. IEEE Com-puter Society, 1999.

29. J. Stark, M.D. Brown, and Y.N. Patt. On pipelining dynamic instruction schedul-ing logic. In Proceedings of the 33rd Annual International Symposium on Mi-croarchitecture, pages 57–66, Monterey, California, December 10–13, 2000. IEEEComputer Society TC-MICRO and ACM SIGMICRO.

30. R.M. Tomasulo. An efficient algorithm for exploiting multiple arithmetic units.January 1967.












































31. J. Tseng and K. Asanovic. Energy-efficient register access. In XIII Symposium onIntegrated Circuits and System Design, September 2000.

32. S. Wallace and N. Bagherzadeh. A scalable register file architecture for dynami-cally scheduled processors. In Proceedings: Parallel Architectures and CompilationTechniques, October 1996.

33. W.A. Wulf and S.A. McKee. Hitting the memory wall: Implications of the obvious.In Computer Architecture News, pages 20–24, 1995.








Kilo-instruction Processors

Documents