Emily Report

PROCESSOR POWER REDUCTION

Wing-Shan (Emily) Chan

[email protected] May 2004

Supervisor: Annie Guo

[email protected] of Content

1Abstract

21. Introduction

42. Review of Prior Work

52.1 Prior Related Work

52.1.1 Bai et al. [1]

72.1.2 Maro et al. [4]

92.1.3 Bahar et al. [5]

113. Proposal

113.1 Introduction

123.2 Design

123.2.1 Floating Point and Integer Clusters

133.2.2 Ready and Non-Ready FIFOs

153.3 Implementation

153.3.1 Performance Monitors

173.3.2 Power Estimates and Tools

173.4 Alternate proposal

193.5 Schedule

214.0 Conclusion

225.0 References

Abstract

Power dissipation has become a very vital issue in designing modern computer architectures. In this paper, I examine a number of previous researches on processor power saving techniques and I analyze the constraints of each of them. Based on these studies, I propose a solution towards the end of this paper.

1. Introduction

For years, researches have been carried out focusing only on ways to maximize processor performance. These techniques include pipelining, superscalar architecture, cache, branch prediction, and different Instruction Set Architectures, etc. With all these different creative ideas implemented, our processor technology has shown a dramatically increasing trend in performance throughout the past. However, not till the recent decade, researchers have realized the importance of including the power consumption issue in the design phase of a processor. There are mainly two reasons for this change: (1.) Portable digital equipments with high-end microprocessors embedded are becoming popular; however, limited battery life forms the major bottleneck of these equipments; and (2.) It is believed that current high-end microprocessors are beginning to reach the limits of conventional air cooling techniques. Moreover, some of these cooling methods may become power-hungry in order to maintain a system at a right temperature. As a consequence, the issue of power has become an essential part for both portable and non-portable system designs.

An important observation [1] claims that different applications vary in their degrees of instruction-level parallelism, branch behaviors and memory access patterns. This leads to a consequence of available data path resources utilization is not optimized by all applications or even at all time within a particular application. Wall shows that the degree of ILP within a single application varies by a factor of three [2]. It is important for a system to adapt to changes in usage of available resources and to adjust its configuration according to current needs. Due to this observation, studies have been previously conducted on different components of a processor, such as, caches, branch predictors and functional units and these approaches all share the ultimate goal of reducing power consumption by processors. Most recent researches have emphasized the effect that issue logic has on the total power dissipation in out-of-order superscalar processors. A typical example being the Alpha 21264 processors, in which the issue logic is responsible for 46% of the total power according to [3].

In my proposed scheme, I divide some of the components of the pipeline organization into two clusters; mainly the ones that lay geometrically in the middle part of the organization. One cluster is dedicated for Floating Point Instructions while the other is for Integer Instructions. Each cluster consists of an issue queue, a copy of the register file, corresponding functional units and hardware performance monitors. These two clusters share the same data cache, the instruction fetch/rename unit and the commit unit. I then divide the issue queue in each cluster into two major parts: ready and non-ready instructions queues. Non-ready instructions imply the ones that have at least one operand pending. Within each part, the sub-queue is then further partitioned into several sets (FIFOs). Only the head of each FIFO is visible to the request and selection/arbiter logic resulting in in-order issue of instructions in FIFOs.

My contribution is to show the potential in power saving through dynamically reconfiguring the issue logic as well as the functional units. According to the feedback from the hardware performance monitors, I dynamically modify the size and number of the existing FIFOs. Note that the reconfiguration of each cluster can be carried out independently to each other and this independence holds between the ready and non-ready queues as well. In addition, I also modify the number of available functional units on-the-fly based on the feedback provided by the hardware performance monitors.

The layout of the rest of the paper is as follows. Section 2 in the paper discusses previously taken approaches. Section 3 explains my proposed scheme in details and also presents my schedule of implementation. Section 4 offers conclusions.

2. Review of Prior Work

Different researchers have attempted to tackle the problem from different points. I briefly present some of the interesting solutions here. Olson et al. [6] suggested placing two completely separate processor cores side by side. This arrangement allows the processor operating in one of the two modes: High-performance mode and Low-Power mode. Nevertheless, significant amount of time is required to switch between modes as this operation involves transferring the contents of one processor core to another. Yuan et al. [7] presented a middleware framework coordinating the processor and power resource management (PPRM) which dynamically adjusts the processor speed and power consumption according to the system workload, the processor status and the power availability. Yuan et al., however, only addressed this solution specifically for multimedia applications.

Olson et al. has attempted to solve the problem from a relative low approach while Yuan et al. has attempted to tackle the issue from a very high level. In this study, I plan to solve the problem from the architecture level. My proposed solution is to modify the architecture of the processor dynamically at run-time of programs to achieve power savings. In next section, I discuss a few number of prior related studies and I also analyze each of them from the point of how effective they are in power saving.

2.1 Prior Related Work

2.1.1 Bai et al. [1]

Bai et al. [1] proposed two schemes in implementing a dynamically reconfigurable mixed in-order/out-of-order issue queue for power-aware processors. In both schemes the issue queue is partitioned into several smaller sets (FIFOs) and only heads of these FIFOs are visible to the request and selection/arbitration logic. As a consequence, this visibility property of the FIFOs forces the FIFOs to issue instructions in-order which in turn simplify and reduce power consumption of the request and selection/arbitration logic.

In Scheme#1, a FIFO is completely disabled when the feedback from the hardware performance monitor claims that the FIFOs are underutilized. The hardware performance monitors used here are implemented by Maro et al. in [4]. These monitors are mainly composed of simple counters and comparators; hence, the power consumed by these components can be neglected [5]. Below is a picture showing an example of possible operating modes of the issue queue under Scheme#1.

Figure 1 An example showing possible modes of the processor for Scheme#1. Graph redrawn based on [1].

One major drawback of this approach is that it offers limited exposure in ILP by shrinking the total size of the issue queue. This becomes very disadvantaging for Floating Point benchmarks which exhibit a large degree of ILP, and; therefore, Floating Point benchmarks are not tested using Scheme#1 in the paper. Out of all the Integer benchmarks, Scheme#1 produces the average best result of 27.6% power saving with only 3.7% performance degradation comparing to the starting state of the experiments (i.e. the base case remains at the starting state configuration throughout the run time of an application).

In Scheme#2, the size of the issue queue remains the same at all times; this ensures maximized exposure of ILP. Both the size and the number of FIFOs are modified simultaneously under this scheme; this makes the issue queue appear to be smaller in size to the request, selection/arbitration logic. If the hardware performance monitors indicate that processor is suffering in achieving high performance, the size of the FIFOs will be decreased while increasing the number of existing FIFOs. The following graph presents an example of the configurations of the issue queues in different operating modes:

Figure 2 An example show possible modes for the processor of Scheme#2.

Graph redrawn based on [1].

Experimental results have shown that Scheme#2 is a rather effective mechanism in power saving. It achieves a 27.3% power saving with only 2.7% performance degradation as the best average case. This time, Floating Point benchmarks were also tested during the experiment.

Despite the fact that this scheme works relatively effective for both Integer and Floating Point benchmarks, some potential enhancements can be carried out in order to gain better control over the type of applications. For instance, the processor can deal with the two types of instruction separately and therefore results in a greater flexibility to adapt to changes in resources needs of each instruction type.

In addition, further power saving can be achieved by restricting the broadcast of a just-computed result to the non-ready instructions only. Applications with a large degree of ILP will benefit the most from this restriction.

2.1.2 Maro et al. [4]

Maro et al. [4] implemented a hardware performance monitoring mechanism and these monitors provide feedback on whether or not to disable part of the Integer and/or Floating Point pipelines during runtime in order to save power. Here shows the basic multi-pipelined processor:

Figure 3 Pipeline Organization of the processor. Graph taken from [4]

For each cluster, a maximum number of four instructions can be issued each cycle.

In the work, a number of low-power operating modes are defined and they are:

Figure 4 Possible Operating Modes for the processor. Table taken from [4]

Both the entering and exiting of these modes depend of the hardware performance monitors feedback while the exiting of these modes also depend on the trigger events such as data/instruction cache misses and floating point activity, etc.

This approach provides greater flexibility in handling instructions according to their types. Nevertheless, by shrinking the overall size of the issue queue when entering some of the operating modes will lead to a limitation of the exposure of ILP. Scheme#1 described in 2.1.1 has this same negative effect as they both attempt to alter the total size of the issue queue during run-time of programs. Moreover, the select and wake-up logic has no way to distinguish between ready and non-ready instructions, and; therefore the system becomes very power inefficient in a way that the associated selection and wake-up signals of all entries in the issue queue will have to be updated every cycle even when an instruction is not ready to be issued.

2.1.3 Bahar et al. [5]

Bahar et al. [5] proposed a technique called Pipeline Balancing (PLB) in which it allows disabling of a cluster or part of a cluster of functional units through varying the issue width. The pipeline organization of 8-wide issue processor is shown below:

Figure 5 Pipeline Organization of the processor. Graph taken from [5].

Bahar et al. implemented PLB with two possible issue widths when operating in low-power mode: 4-wide issue and 6-wide issue. In a 6-wide configuration, 2 Floating Point functional units are disabled resulting in an unbalanced machine; while in a 4-wide configuration, a cluster of functional units is disabled (shown as shaded area in the graph above). Two basic triggers are implemented in both modes; they are issue IPC (IIPC) and Floating Point IPC (FPIPC); both are measured by hardware performance monitors [5]. More power is saved in the 4-wide configuration however the performance penalty of spuriously entering the 4-wide mode can be great. An extra trigger, mode history, is included to prevent spuriously entering of the 4-wide mode: No transition into the 4-wide mode is allowed unless the conditions for the two basic triggers are satisfied for two consecutive sampling windows.

The state machine for enabling/disabling PLB is shown below; EC4w, DC4w, EC6w and DC6w are the enabling and disabling conditions for 4-wide mode and 6-wide mode respectively:

Figure 6 State machine of the processor. Graph taken from [5].

And the conditions are listed as follows:

Figure 7 Entering and Exiting Conditions for each state of the processor.

Table taken from [5].

It is important to ensure that these threshold values allow the system to respond to changes in programs needs effectively and efficiently. For example, a program that has a burst of floating point instructions for some portions of its execution time will suffer when the processor fails to restore back to normal mode (rebalancing the structure) effectively. The performance suffers mainly due to the unbalanced structure of the data path.

One major drawback of this approach is again the lack of ways in distinguishing between ready and non-ready instructions (same as the above approach) resulting in unnecessary waste of power in the selection and wake-up logic.

Moreover, this scheme only allows disabling of Floating Point functional units; this leads to limited flexibility of a system in adapting to different type of applications and therefore not maximizing the potential power saving.

3. Proposal

3.1 Introduction

Bai et al. [1] states that a good design strategy should be flexible enough to dynamically reconfigure available resources according to the programs needs. This statement becomes the Golden Rule during the designing phase of my strategy. In my proposed scheme, I try to provide as much flexibility as possible for a system to adapt to changes in programs needs; and, therefore reacts effectively and efficiently to fulfill the ultimate goal of power saving in processors.

3.2 Design

The fundamental configuration of my proposed scheme is as follow:

ParameterConfiguration

Issue Queue size128 entries

Machine Width8-wide fetch, issue, commit

(4-wide for each cluster)

Functional Units8 Integer FU, 4 Floating Point FU and 4 Memory ports

Size of issue queue in each cluster64 entries

Size of ready queue in each cluster8 entries

Size of non-ready queue in each cluster56 entries

Figure 8 Configurations of my proposed processor

Note that only relevant parameters are shown at this stage; a much more detailed description will be provided in the next report. Also, the configuration for each parameter is only estimation; they are subject to change due to any implementation issue.

The basic idea of my proposal lies in further dividing the FIFOs architecture in [7] into smaller components according to the type and the status of the instructions.

3.2.1 Floating Point and Integer Clusters

I divide the middle part of the pipeline organization in [1] into two clusters. One of the clusters is dedicated for Integer Operations while the other is for Floating Point Operations. Each cluster consists of its own issue queue (FIFOs), reorder buffer (ROB), a copy of the register file, hardware performance monitors and corresponding functional units. Nevertheless, both clusters share the same instruction fetch/rename unit, data cache and the commit unit. Note that this is only a preliminary stage of design; the architecture described above may be changed according to issues arisen during the implementation stage. Below is a graph showing the new pipeline structure proposed:

Figure 9 Pipeline Organization of my proposed scheme

The major advantage of this new design over the original one in [1] is that this provides the system with greater flexibility when handling different types of benchmarks. For Integer benchmarks as an example, there is hardly any Floating Point instruction; therefore, the existence of these functional units will consume unnecessary power while they are not contributing to the overall performance. With this modified structure, the system will have definitely more control over the power consumption of various types of applications.

3.2.2 Ready and Non-Ready FIFOs

I further divide the issue queue in each cluster into two components: (1.) Ready instructions queue and (2.) Non-Ready instructions queue. The purpose of such division is to restrict broadcasting of a just-computed result to only those instructions that are non-ready. As a result, extra power saving is achieved especially for applications exhibiting a high ILP. The graph below shows the structure of the newly proposed issue queue within a cluster:

Figure 10 Internal architecture of an issue queue in a cluster

At this stage, I plan to allow reconfigurations of the two components according to the feedback from the hardware performance monitor [1, 4] and other trigger events [5] to occur independently to each other. The major reason for this is to simplify the implementation as well as providing more flexibility in dynamically reconfiguration of the issue queue; however, further investigation may be carried out on how different combinations of the configurations of each component affect the system performance.

An important property is that the total numbers of entries for both component queues remain the same at all times for all applications. This diminishes the negative effect of limiting the exposure of ILP by shrinking the issue queue size.

In addition to the above modifications, I also propose to monitor the usage of the functional units. More power can be saved by disabling some of the functional units that are not utilized optimally. This again is implemented independently on each of the clusters, i.e. disabling a Floating Point functional unit will not affect the operation of the Integer cluster.

3.3 Implementation

3.3.1 Performance Monitors

Similar to [1, 4, 5], reconfigurations of the system are carried out according to the feedback from the hardware performance monitors. As stated before, these monitors are mainly composed of simple counters and comparators; therefore, the power consumption by these parts can be neglected [5]. The cycle window [1] that I set is either 512 or 1024 cycles at this stage. I will further investigate the feasibility and effects on having different cycle window sizes for different monitors. This ensures that the system will be more flexible in responding to feedback from different monitors. I implement the following hardware performance monitors in the system:

Monitoring IPC for each cluster separately:

If either the FP issue IPC or the Integer issue IPC is low during the current cycle window, this may indicates that the ILP in the program is low and therefore the cluster may be switched to a lower power consuming mode.

Monitoring Ready Instructions:

If the occupancy of the ready queue of a cluster is high, this may imply that the ILP in the application is high and therefore the ready queue may be switched to a higher-performance mode.

Detecting Variations in IPC:

If the issue and commit rates vary significantly for a cluster may indicates a high branch misprediction rate. The number of FIFOs in the cluster can be reduced in order to restrict the issue rate and indirectly limit the amount of branch mispredicted instructions issued.

Performance Degradation:

If the IPC drop between two consecutive sample windows exceeds a threshold value within a cluster, the cluster will be restored back to a higher-performance mode.

Issue Queue Usage:

Low issue queue occupancy rates for both components (Ready and Non-Ready queues) may indicate a potential to reduce the number of FIFOs as the issue queue is being underutilized.

Functional Unit Usage [4]:

It is a relatively inexpensive way to monitor whether the program is underutilizing the available resources over time by means of a simple shift register. When the percentage of busy functional units is under a certain threshold, a 1 is shifted into the register. At any given cycle, if the number of 1s present in this register is greater than some pre-defined threshold, then the program is said to be underutilizing the resources; therefore some functional units can be disabled. An assumption is made that if recent history indicates an underutilization of functional units then it can be presumed that in the following cycles few resources will be needed as well.

Issue attempts [4]:

The total number of issue attempts for each ready instruction before a functional unit is made available for its execution is counted. Whenever an instruction is prevented from issuing due to the lack of resources, its counter is incremented. When the total count for all the ready (still not issued) instructions reaches a certain threshold, the cluster should restore some of its functional units.

A list of monitoring techniques is presented above and the threshold values are not specified in this report since experiments will have to be carried out to determine the appropriate values. There is a potential of modifying the above list due to any arising implementation issues.

3.3.2 Power Estimates and Tools

I estimate the total power savings of my processor based on the power estimates of Alpha 21264 processor in [3]. More details will be provided in the next report of how the power values are estimated.

The simulator that I use in this study is derived from the SIMPLESCALAR tool suite. Nevertheless, modifications will have to be carried out in order to achieve a better modeling of my processor. One modification is to split the Register Update Unit (RUU) of the SIMPLESCALAR into the reorder buffer (ROB) and the issue queue (IQ) for each cluster [1].

If possible, I may also run the experiments using the Wattch tool which is an extension to SIMPLESCALAR including power estimations.

The benchmarks used in this study are the same as those ones in [1]. This can allow me to compare the performance of my processor with the one implemented in [1].

3.4 Alternate proposal

I also propose a very similar approach in this study. This alternate approach basically adopt the same architecture as the one described with an exception that Integer and Floating Point instructions are no longer handled separately. Therefore, within a cluster, there exist both Integer and Floating Point Functional units. This scheme will help me to analyze how the separation of the Integer and Floating Point clusters impact the overall performance. Due to the similarity, the details of this alternate approach will not be covered here; however, more information will be provided in depth in the final report.

3.5 Schedule

4.0 Conclusion

Due to the rapidly rising awareness of the importance of including power issues in the design phase of processors, many researches have been carried out. These prior work have presented many ways to achieve power saving while minimizing the impact on the overall performance. Based on the previous studies, I propose a strategy focusing mainly on the issue logic design as well as the usage of functional units. The aim of my study is to show that by dynamically reconfiguring the internal structure of my processor according to different sources of feedback, saving in power consumption will be achieved. And that my proposed processor will have its maximized flexibility in responding to different programs needs and therefore it fulfills the Golden Rule stated in [1].

5.0 References

[1] Yu Bai and R. Iris Bahar. A Dynamically Reconfigurable Mixed In-Order/Out-of-Order Issue Queue for Power-Aware Microprocessors. Division of Engineering, Brown University.

[2] D. W. Wall. Limits of instruction-level parallelism. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), November 1991.

[3] K. Wilcox and S. Manne. Alpha processors: A history of power issues and a look to the future. In Cool-Chips Tutorial, November 1999. Held in conjunction with the 32nd International Symposium on Microarchitecture.

[4] R. Maro, Y. Bai, and R. I. Bahar. Dynamically reconfiguring processor resources to reduce power consumption in high-performance processors. In Workshop on Power-Aware Computer Systems, November 2000. Held in conjunction with the International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS).

[5] R. I. Bahar and S. Manne. Power and energy reduction via pipeline balancing. In Proceedings of the 28th InternationalSymposium on Computer Architecture, July 2001.

[6] Elwin Olson and Andrew Menard. Issue Logic and Power/Perfomance Tradeoffs. Laboratory for Computer Science, Massachusetts Institute of Technology.

[7] Wanghong Yuan and Klara Nahrstedt. A Middleware Framework coordinating Processor/Power Resource Management for Multimedia Applications. Department of Computer Science, University of Illinois at Urban-Champaign.

DESIGN PHASE (3 WEEKS)

Eliminate any uncertainty in design through research

Modify design when necessary

Consolidate design including determining parameter values for processor architecture

Document any changes with reasons

IMPLEMENTATION PHASE STAGE 1 (9 WEEKS)

Modify the out-of-order simulator in SIMPLESCALAR according to the design documents

Modify design when necessary

Document any issues arisen and solution proposed


Verify correctness in functioning of the processor

IMPLEMENTATION PHASE STAGE 2 (3 WEEKS)

Implement the hardware performance monitors

Add or modify existing monitors in design documents when necessary

Test and determine the initial threshold values


Document any issue arisen and solution proposed

OPTIONAL PHASE

Implement the alternate approach proposed in the design document

Implement the proposed processor using the Wattch Tool

Investigate the effect of changing the issue width while reconfiguring the functional units

REPORT PREPARATION PHASE

(1 WEEK)

Finalize experiments

Comment and analyze the results

Document the analysis and comments made

TESTING PHASE (6 WEEKS)

Run the simulator using different combinations of performance monitors (with different threshold values) for each benchmark

Modify the processor architecture when necessary

Investigate if design fails and try to solve the problem(s) encountered if possible


Document any issue arisen and solutions proposed

Document reasons for failure and/or solutions proposed

FINAL DOCUMENTATION PHASE

(2 WEEK)

Summarize all the documents throughout the entire project

Write up the final report

Prepare for final presentation

PAGE

Emily Report

Documents