INSTRUCTION SCHEDULING FOR VLIW PROCESSORS UNDER VARIATION SCENARIO A thesis submitted in partial fulfilment of the requirements for the degree of Master of Science (by Research) in Computer Science and Engineering by Mujadiya Nayan Vasantbhai Roll No: 200605011 nayan [email protected]Center for VLSI and Embedded System Technologies INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY GACHIBOWLI, HYDERABAD, A.P., INDIA - 500 032. NOVEMBER 2010
58
Embed
INSTRUCTION SCHEDULING FOR VLIW PROCESSORS UNDER …web2py.iiit.ac.in/publications/default/download... · INSTRUCTION SCHEDULING FOR VLIW PROCESSORS UNDER VARIATION SCENARIO A thesis
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Table 7. Latency map for each component (in cycle).
on/off, we provide a sleep signal per each integer functional unit. In the second
technique, a high latency FU (process variation affected FU) is turned-on based
on IPC value. Functional unit ‘turn-on’ and ‘turn-off’ instructions (that control
the supply gating) are inserted at the beginning and end of the associated loop
considering its IPC and the available integer functional units.
It should be noted that our techniques can work in conjunction with any
other performance-oriented scheduler such as basic block/trace scheduling [42, 43],
superblock scheduling [44] and hyperblock scheduling [45]. We now elaborate our
proposed techniques by considering the variation map (as shown in Table 1) and
the latency table (as shown in Table 7) in the following subsections.
22
FU FU FU FU FU FU FU FU[0] [1] [2] [3] [4] [5] [8] [9]
0 A1 A2 A3
1 A4 A5 A6 L1 L2
2 A7 A8 C1
3 C2 A9 A10 L3
45 L4
6 M1 M2 M3
789 S1 S2
Table 8. VLIW schedule after applying ‘turn-off’.
Loop FU FU FU FU FU FU FU FUNum. [0] [1] [2] [3] [4] [5] [8] [9]
0 A1 A2 A3
1 1 M1 M2 A6
(IPC=3)...
......
......
......
...5 S S
......
......
......
......
...0 A4 A5 A1 A2 A6 A3
k 1
(IPC=6)...
......
......
......
...7 S S
Table 9. Scheduling tables for different basic blocks in different loops after applying‘on-demand turn-on’.
4.1.1 Turn-off
VLIW compilers do all of the translation and scheduling at compile time. So
we can use information from the variation map to schedule the instructions only
on clean FUs. In the turn-off technique, we turn off high latency FUs and use only
clean FUs for scheduling. We also turn-off the unused clean FUs so that leakage
power of unused functional units and variation affected functional units can be
greatly reduced. To turn-off the unused clean FUs we use the IPC information
and priority is given to FU which consume high leakage.
23
By considering the variation map (as shown in Table 1) and the latency table
(as shown in Table 7), a schedule table using the ‘turn-off’ technique is shown in
Table 8. As we know that functional units FU0, FU1 and FU4 have high latency,
instructions A1 to A10 are scheduled only on clean FUs (i.e. FU2, FU3 and FU5).
Similarly, multiply instructions M1, M2, M3 are scheduled on FU2, FU3, FU5,
respectively. It can be observed that compare instructions C1 and C2 are being
scheduled on FU5 and FU2 instead of FU2 and FU3, respectively. With turn-
off technique, apart from achieving improved performance we can also reduce the
leakage energy of the FUs when compared to the worst case.
4.1.2 On-demand turn-on
It is observed in [12] that during some parts of a VLIW program execution the
maximum number of operations that can be executed per cycle is much smaller
than the number of available IFUs. Motivated by this observation, a technique of
IPC (instructions issued per cycle) tuning at loop-level granularity is proposed in
[12]. The basic idea of this technique is to find a suitable IPC for a given loop and
select IPC number of integer functional units for re-scheduling operations and turn
off the remaining integer functional units for reducing the leakage power. Similar to
this technique, in our on-demand turn-on technique, we compute suitable IPC for
a given loop and based on the IPC value we turn-on a high latency FU, if required.
By default, instructions are scheduled only on clean FUs and high latency FUs are
turned-off along with the unused clean FUs. Whenever loop IPC is greater than
the number of clean FUs available, only required number of high latency FUs are
activated by giving priority to those process variation affected FUs which take less
latency and consume less leakage.
For example, consider Table 9 in which for loop ‘1’, IPC is 3, so the high latency
FUs are not needed. This low IPC can be satisfied by scheduling instructions only
on clean FUs. On the other hand, in loop k, IPC is found to be 6 so that there is
a need to turn-on the high latency FUs.
24
However, these two techniques require the recompilation of the sources for
every target processor since the latency of FUs varies between each instance of
the target architecture, acting as a limitation. But additional run-time hardware
techniques can be used which store the latency information of FUs in the BIOS
of the system and the latency information is loaded at boot-time [25]. Section 4.2
provides the detailed analysis of these two techniques.
4.2 Experimental results
To evaluate the proposed algorithm, we implement and simulate the algorithm
within the Trimaran [37] framework (see Figure 5 in Section 3). We modify the
instruction scheduling part of the Elcor to incorporate extra changes. We config-
ure the simulator to simulate an Itanium [39] like architecture. Table 3 gives the
default simulation parameters used in our experiments, namely, the CPU/memory
configuration parameters. Table 4 lists the information of array-intensive bench-
mark codes used in our experiments.
We study two variation maps, one with 20% variation (as shown in Table 1)
and another with 40% variation in transistor parameters. To study the effectiveness
of our techniques with these variation maps, we compare them with best case, IPC
technique [46], PV-IPC technique and the worst case.
In the best case, all the components in all the IFUs are clean (there is no
variation) so they have normal latency (as shown in Table 3). On the other hand,
in the worst case, all the components of IFU take their corresponding worst case
latencies. In IPC technique, proposed in [46] instructions are scheduled based on
the IPC value. For example, if IPC is 3 then only the first 3 IFUs (FU0-FU2) are
used. In this technique, all the IFUs are assumed to be clean. PV-IPC implements
the IPC technique considering the effect of variations in IFUs. So, when IPC is 3
the first three IFUs are used, even though they have high latency (are variation
effected). In figures, “PV-IPC: 20% variation”, “Turn-off: 20% variation” and
“On-demand turn-on: 20% variation” indicate the cases of applying ‘PV-IPC’,
25
‘turn-off’ and ‘on-demand turn-on’ techniques for 20% variation map, respectively.
Similarly, we indicate the different techniques for 40% variation also.
Figure 7. Benchmark-wise IPC for different techniques.
Figure 7 shows the benchmark-wise IPC values for all the above techniques
considering six FUs. We can observe that both ‘turn-off’ and ‘on-demand turn-on’
techniques perform better than worst case scenario. For benchmarks like “bmcm”,
“mxm” and “tsf” the ‘turn-off’ technique incurs at most 0.4% of performance
degradation w.r.t. best case. For all other benchmarks because of high resource
requirement we have performance degradation for ‘turn-off’ w.r.t. best case.
In case of ‘on-demand turn-on’ technique, we have incurred an average per-
formance loss of 1.1% when compared to best case, for 20% variation. On the
other hand, for 40% variation, we incur loss of 3.6% in performance. In case of
“wss” benchmark because of its high resource requirement, we can observe a dras-
tic change in IPC values for ‘turn-off’ and ‘on-demand turn-on’ techniques. By
considering IPC and PV-IPC techniques we can observe the affect of variations on
IPC value. Because functional units FU0 and FU1 are effected, they posses high
latency, the average IPC value for PV-IPC technique is 6.0% less than that of sim-
ple IPC technique. It can also be noted, due to obvious reasons, that performance
of our proposed techniques decreases with increase in variation.
26
Figure 8. Leakage energy savings of all the IFUs for different techniques.
Figure 8 shows benchmark wise leakage energy savings obtained for different
techniques w.r.t. the worst case. First of all, we would like to point out that
the leakage savings for best case w.r.t. worst case is less because it uses all the
FUs. When it comes to IPC technique, the savings w.r.t. worst case are more
than that of best case because we turn-off unused FUs based on IPC value. For
PV-IPC technique because of variations, the savings in leakage energy are less
than that of IPC technique. Coming to our ‘turn-off’ technique, we can observe
that because of turning off all the variation effected FUs we have higher leakage
energy savings compared to that of best, IPC and PV-IPC techniques. In case of
‘on-demand turn-on’ technique, because we are turning on the effected FUs based
on IPC value, we have less leakage energy savings when compared to that of ‘turn-
off’ technique (on an average 14.5% less). But for benchmarks “apsi”, “bmcm”,
“mxm”, “tsf” and “vpenta” we have achieved almost same leakage energy savings
in case of ‘turn-off’ and ‘on-demand turn-on’ techniques because of low IPC (Figure
7). Considering “wss” benchmark, we can observe that the savings for ‘turn-off’
technique are much more than other techniques because we are turning off 3 FUs
(process variation affected FUs) even though IPC value is 4.35 (Figure 7).
27
Figure 9. Peak temperature of IFUs for different techniques.
For all the techniques, Figure 9 shows benchmark wise peak temperatures. For
IPC technique, the peak temperature is more than that of best case because more
number of instructions are being scheduled on the initial FUs. In PV-IPC as we
use the variation effected FUs, we can see a drastic increase in the peak tempera-
ture w.r.t. worst case. On an average, w.r.t. worst case, the peak temperature for
PV-IPC technique is 11.3◦C more. Our ‘turn-off’ technique achieves an average
peak temperature reduction of 17.5◦C w.r.t. the worst case because we are com-
pletely turning off the variation effected FUs. Similarly, for ‘on-demand turn-on’
technique we achieve 10.0◦C reduction in average peak temperature w.r.t. worst
case. In general, we can observe that peak temperature increases with increase in
the variation.
28
Figure 10. Average change in IPC for different techniques, for 6 and 4 IFUs with 20%and 40% variation.
Figures 10, 11 and 12 show the results of sensitivity analysis with 4 IFUs.
For easy comparison, we also show the results when number of IFUs is 6. Figure
10 shows the impact of our techniques on IPC value. It shows average perfor-
mance degradation when compared with best case and average improvement in
performance when compared to worst case, over all the benchmarks. In case of
IPC technique we can observe 3% degradation w.r.t the best case. This degrada-
tion further increases in case of PV-IPC. In case of our ‘turn-off’ and ‘on-demand
turn-on’ techniques degradation is 14% and 1%, respectively, w.r.t the best case
when 20% variation is considered. This degradation in IPC value is observed with
increase in the variation. Similarly we can observe 20% and 39% improvement
w.r.t the worst case considering 20% variation which reduced with increase in the
variation. The reason of it being negative in case of “Turn-off: 40% variation” is
that out of 4 FUs 3 FUs are variation affected.
29
Figure 11. Average leakage energy savings for different techniques, for 6 and 4 IFUs with20% and 40% variation.
Figure 11 shows average leakage energy savings over all the benchmarks. We
can observe that we are able to save more leakage energy in case of ‘turn-off’
when compare to the ‘on-demand turn-on’, IPC and PV-IPC techniques. We can
observe that on an average 82% and 52% of savings are obtain in case of ‘turn-off’
and ‘on-demand turn-on’, respectively, w.r.t the base case. When worst case is
considered we achieve 87% and 65% of savings, respectively. As variation increase
leakage energy saving decreases.
Figure 12. Average peak temperature reduction for different techniques, for 6 and 4IFUs with 20% and 40% variation.
30
Figure 12 shows average peak temperature reduction over all the benchmarks.
We can observe that in the IPC technique average peak temperature is more than
that of best case and less than that of worst case. In case of PV-IPC as instructions
are scheduled on variation affected FUs, we can see a drastic increase in average
peak temperature. In case of ‘turn-off’ and ‘on-demand turn-on’ techniques we can
observe 12.8% and 5% reduction in average peak temperature when compared to
that of best case, and 17% and 10% reduction when compared to the worst case,
for 6 FU and 4 FU in case of 20% variation. From figures 10, 11 and 12 we can
observe that with increase in variation IPC degradation increases, leakage energy
saving decrease and average peak temperature increases.
4.3 Conclusion
We have presented two compile-time techniques namely ‘turn-off’ and ‘on-
demand turn-on’ to handle these non-uniform latency IFUs and reduce the per-
formance penalty. Apart from achieving nearly same performance as that of IFUs
without variability, we also achieve nearly 76.5% reduction in leakage energy con-
sumption along with 13.3% reduction in peak temperature of IFU as compared to
the worst-case.
31
CHAPTER 5
Mobility-list-scheduling
To achieve high performance, VLIW processors use multiple functional units.
By exploiting the available instruction-level parallelism in programs, compilers
schedule operations on different functional units of VLIW processors. It is a com-
mon case that list-scheduling [43] is used for scheduling operations in VLIW proces-
sors to achieve high performance. But, the list-scheduling always tends to schedule
operations on first freely available functional unit [46]. As long as the functional
units of same kind having the same latency, list-scheduling will give better per-
formance results. But, functional units of same kind may have different latencies.
This scenario can happen in advanced process technologies due to process variation
[26], [47]. In such situation, the list-scheduling may not yield good performance.
In order to work with non-uniform latency functional units, we propose a new
scheduling algorithm, namely, mobility-list-scheduling, which is a modified version
of the list-scheduling algorithm that uses the mobility [13] information to schedule
operations onto non-uniform latency FUs.
5.1 Motivation
In this chapter, we assume a VLIW processor with six integer functional units
(IFUs) in such a way that each IFU can take either nominal latency (type-0), 1 cycle
extra latency as compared to the nominal latency (type-1), or 2 cycle extra latency
(type-2). Here an IFU with k cycle extra latency means instructions scheduled on
that IFU will take k extra cycles as compared to the nominal latency of the IFU
for all the operations. For n IFUs with m possible latency types, we have(n+m−1m−1
)different latency pattern sets with a total of mn latency patterns, where a latency
pattern set defines the total number of IFUs for each latency type while a latency
pattern determines the latency type of each IFU. In other words, a latency pattern
32
Figure 13. Total execution cycles for benchmarks “apsi” and “bmcm” for all possiblelatency pattern scenarios after applying list-scheduling algorithm.
set A is defined as A = {i0, i1, · · · , im−1 |∑m−1k=0 ik = n}, where ik is the total
number of IFUs with type-k latency and n is the total number of IFUs, while a
latency pattern p is defined as p = l(IFU0)l(IFU1) · · · l(IFUn−1), where l(IFUk)
is the latency type of IFUk.
So, for 6 IFUs and with 3 possible latency types, we have 28 latency pattern
sets with a total of 729 (=36) different latency patterns. Considering a compiler
which is aware of these latency types, Figure 13 shows the number of execution
cycles for all the 729 latency patterns for “apsi” and “bmcm” benchmarks af-
ter applying the list-scheduling algorithm. From the figure, it is clear that the
total number of execution cycles depends on the latency pattern. For a better
understanding, Figure 14 shows values for 9 benchmarks with all possible latency
patterns of a latency pattern set A = {4, 1, 1}. We can observe that for latency
pattern p1 = 210000, all benchmarks take more number of execution cycles as
compared to that of latency pattern p2 = 000012. From the above observation,
we can conclude that the position of high latency IFUs plays an important role in
determining the performance.
33
Figure 14. Benchmark-wise execution cycles after applying list-scheduling algorithm forall latency patterns of a latency pattern set {4,1,1}.
NOP
NOP
A1 A2 C1 A3 M1 A4 M2
A5 A6 A7 A9C2A8
A10
M3
Figure 15. Dependency graph (Gn) for Basic Block (BBi).
5.2 Working with non-uniform latency functional units
Figure 15 shows a simple dependency graph Gn with n nodes as operations
for a basic block BBi. Gn consists of 10 Add (Ai, i ∈ {1, · · · , 10}) operations, 2
Compare (C1 and C2) operations, and 3 Multiply operations (M1, M2, and M3).
34
IFU IFU IFU IFU IFU IFU[0] [1] [2] [3] [4] [5]
1 A1 A2 C1 A3 M1 A4
2 M2 A5 A6 A7 C2
3 A10
4 M3 A8
5 A9
6
Table 10. VLIW schedule after applying list-scheduling algorithm for latency pattern000000.
IFU IFU IFU IFU IFU IFU[0] [1] [2] [3] [4] [5]
1 A1 A2 C1 A3 M1 A4
2 M2 A5 A6 A7
3 A10
4 M3 C2
5 A9 A8
6
Table 11. VLIW schedule after applying list-scheduling algorithm for latency pattern000012.
Just for the illustrative purpose, we consider only these operations. We assume
that each IFU can perform Add, Multiply, and Compare operations, however,
in a particular cycle only one operation can be performed by the IFU. In the
advanced process technologies, because of process variation, it may so happen
that functional units of same kind may have different latencies [26], [47]. In this
chapter, as described in Section 5.1, we assume that each IFU can belong to one
of the three latency types (type-0, type-1, or type-2). We also assume that a type-0
IFU takes 1 cycle latency to perform an Add or a Compare operation and 3 cycle
latency for Multiply operation. We tabulate VLIW scheduling information for a
basic block with row indicating the execution cycles and columns indicating the
functional units. Table 10 shows a VLIW schedule for latency pattern 000000 (that
is, all the IFUs take nominal latency) by considering Gn (Figure 15) as input to
the list-scheduling algorithm [43].
35
IFU IFU IFU IFU IFU IFU[0] [1] [2] [3] [4] [5]
1 A1 A2 C1 A3 M1 A4
2 M2 A6 A7
3 C2
4 A5 A8
5 A9
67 A10
8910 M3
11121314
Table 12. VLIW schedule after applying list-scheduling algorithm for latency pattern210000.
Considering a compiler which is aware of these latency types, Table 11 shows a
schedule obtained by using the list-scheduling algorithm for Gn for latency pattern
000012. Note that the schedule lengths (i.e., the number of rows in a schedule
table) are same for both latency patterns 000000 and 000012 as the list-scheduling
algorithm has a tendency to schedule instruction on the first freely available IFU
and for the latency pattern considered here, all the high latency IFUs are towards
the end. Now, when a latency pattern 210000 is considered, from Table 12, we
can see that most of the instructions are scheduled on IFU0, which is a type-2
latency IFU. This results in increased schedule length (14 cycle) and hence there
is a performance loss as compared to that of the schedules given in Tables 10 - 11
(6 cycles in each case). To overcome this problem in the list-scheduling algorithm,
we present a modified list scheduling algorithm, namely, mobility-list-scheduling,
which uses mobility [13] information of each operation to schedule the operation
on a particular IFU.
36
MOBILITY - LIST(Gn(V,E), a, m) {Compute mobility for all the operations and form mobility classes;l = 1;repeat {for each mobility class k = 0, 1, · · · , t {Determine candidate operations Ul,k;Sort the operations of Ul,k in ascending order of their latency;j = 0;repeat {Determine unfinished operations Tl,j ;Select the first Sk ⊆ Ul,k operations, such that |Sk|+ |Tl,j | ≤ aj ;Schedule the Sk operations on IFUs with type-j latency at step lby setting ti = l,∀i : vi ∈ Sk;Ul,k = Ul,k − Sk;j = j + 1;} until (Ul,k is empty or j == m);}l = l + 1;} until (vn is scheduled);return(t);}
Figure 16. Mobility-list-scheduling algorithm.
5.2.1 Mobility-list-scheduling
Mobility of an operation corresponds to the difference in the start time com-
puted by the As-Late-As-Possible (ALAP) and As-Soon-As-Possible (ASAP) al-
gorithms [13]. For an operation with zero mobility, the operation has to bound
to an IFU with type-0 latency and execute at the earliest start time in order to
avoid performance penalty. On the other hand, a k-mobility operation, k > 0, can
be bound to an IFU with type-m latency, where m ≤ k, so that its execution can
be postponed by k − m steps. Operations with zero mobility are called critical
operations. In Figure 15, operations A1, A2, C1, A5, A6, A10, and M3 are critical
operations. In order not to delay these critical operations, whenever there is a
possibility, the mobility-list-scheduling algorithm (as shown in Figure 16) avoids
scheduling these operations on IFUs with type-m latency, m > 0. In general,
for scheduling k-mobility operations, the mobility-list-scheduling algorithm always
gives preference to type-j latency IFUs, where j ≤ k. If such IFUs are not avail-
able, the algorithm chooses the next best IFU which incurs minimal performance
penalty.
37
IFU IFU IFU IFU IFU IFU[0] [1] [2] [3] [4] [5]
1 M1 M2 A1 A2 C1 A3
2 A5 A6 A7 A4
3 A10 C2
4 M3 A8
5 A9
6
Table 13. VLIW schedule after applying mobility-list-scheduling algorithm for latencypattern 210000.
The input to the mobility-list-scheduling algorithm (as shown in Figure 16) is
a dependency graph Gn, latency pattern set A = {a0, a1, · · · , am}, and the number
of latency types, m. The mobility-list-scheduling algorithm (as shown in Figure
16) selects the set of all operations that can be executed in each schedule step. In
each schedule step, the algorithm selects operations in the increasing order of their
mobility. Ul,k defines the set of all eligible k-mobility operations that are ready
to execute in schedule step l. Operations of set Ul,k are sorted in ascending order
of their execution latency so that low latency operations are preferred first over
high latency operations. Now, the algorithm checks for free IFUs in the increasing
order of their latency types. Ti,j defines the set of operations scheduled on IFUs
with type-j latency and these operations are started earlier and whose execution
is not finished in step l. The number of IFUs with type-j latency is denoted by
aj. The inner repeat loop in the algorithm explores all IFUs, starting from low
latency to high latency, to schedule the operations of Ul,k.
Table 13 shows a schedule table after applying our algorithm by considering
Gn and latency pattern 210000. Here, we choose the latency pattern 210000 as it
gives worst performance when the conventional list-scheduling is applied (see Table
12). From Table 13 and Table 14, it is clear that, as A1, A2 and C1 are 0-mobility
operations, they are scheduled on type-0 latency IFUs 2 − 4. Though A3 is a 1-
mobility operation, as there is a free type-0 latency IFU, it is scheduled on IFU
5. As both M1 and M2 are 2-mobility operations, they are scheduled on available
38
type-1 and type-2 IFUs, respectively. Though the start time of A4 is 1, because
of 4-mobility, its execution is postponed to step 2 by giving preference to other
low mobility operations in schedule step 1. In this way, the algorithm completes
the schedule with a schedule length of 6 cycles, thus improving the performance
as compared to that of conventional case (Table 12).
that the mobility-list-scheduling technique achieves on average 20.7% performance
improvement compared to conventional list-scheduling when non-uniform latency
IFUs are considered.
42
CHAPTER 6
Conclusion and future work
Due to process variation, components like adders, multipliers, etc., of different
integer functional units (IFUs) in VLIW processors may operate at various speeds,
resulting in non-uniform latency IFUs which can cause performance loss. We have
presented two compile-time techniques, namely, ‘turn-off’ and ‘on-demand turn-
on’ to handle these non-uniform latency IFUs and reduce the performance penalty.
Apart from achieving nearly the same performance as that of IFUs without vari-
ability, we also achieve nearly 76.5% reduction in leakage energy consumption along
with 13.3% reduction in peak temperature of IFU as compared to the worst-case.
Conventional list-scheduling algorithm schedules instructions on first freely
available IFU, which results in significant performance loss in case where first
free IFU is of type-1 or type-2 and critical instructions are scheduled on them.
We proposed mobility-list-scheduling algorithm, which is a modified version of the
list-scheduling algorithm that uses the mobility information to schedule operations
onto non-uniform latency IFUs. Our experimental evaluation shows that mobility-
list-scheduling observe on average 20.7% performance improvement compare to
conventional list-scheduling when non-uniform latency IFUs are considered.
As future work, one can explore compile-time technique which can work with
non-uniform latency clustered VLIW architectures.
43
List of publications related to the thesis
[a] Nayan V. Mujadiya, “Instruction scheduling for VLIW processors under vari-
ation scenario” in International Symposium on Systems, Architectures, Modeling,
and Simulation, July. 2009, pp. 33− 40.
[b] Nayan V. Mujadiya and M. Mutyam, “Instruction Scheduling on Variable La-
tency Functional Units of VLIW Processors” (to be communicated).
44
References
[1] A. Datta and et. al., “Speed binning aware design methodology to improve profitunder process variations,” in Asia and South Pacific Design Automation Conference,Sept. 2004, pp. 712–717.
[2] S. Borkar, T. Karnik, S. Narendra, J. Tschanz, A. Keshavarzi, and V. De, “Parame-ter variations and impact on circuits and microarchitecture,” in DAC ’03: Proceed-ings of the 40th conference on Design automation. New York, NY, USA: ACM,2003, pp. 338–342.
[3] K. A. Bowman, A. R. Alameldeen, S. T. Srinivasan, and C. B. Wilkerson, “Impactof die-to-die and within-die parameter variations on the throughput distributionof multi-core processors,” in ISLPED ’07: Proceedings of the 2007 internationalsymposium on Low power electronics and design. New York, NY, USA: ACM,2007, pp. 50–55.
[4] O. S. Unsal and et. al., “Impact of parameter variations on circuits and microarchi-tecture,” in IEEE Micro, Nov. 2006, pp. 30–39.
[5] S. Nassif, “Modeling and analysis of manufacturing variations,” Custom IntegratedCircuits, 2001, IEEE Conference on., pp. 223–228, 2001.
[6] S. Nassif, “Within-chip variability analysis,” Electron Devices Meeting, 1998. IEDM’98 Technical Digest., International, pp. 283–286, Dec 1998.
[7] K. Bowman, S. Duvall, and J. Meindl, “Impact of die-to-die and within-die pa-rameter fluctuations on the maximum clock frequency distribution for gigascaleintegration,” Solid-State Circuits, IEEE Journal of, vol. 37, no. 2, pp. 183–190, Feb2002.
[8] P. S. Zuchowski, P. A. Habitz, J. D. Hayes, and J. H. Oppold, “Process and environ-mental variation impacts on asic timing,” in ICCAD ’04: Proceedings of the 2004IEEE/ACM International conference on Computer-aided design. Washington, DC,USA: IEEE Computer Society, 2004, pp. 336–342.
[9] T. Rahal-Arabi and et. al., “Design and validation of the pentium 3 and pentium 4processors power delivery,” in IEEE Symposium on VLSI Circuits, 2002, pp. 220–223.
[10] D. Brooks and M. Martonosi, “Dynamic thermal management for high-performancemicroprocessors,” in High-Performance Computer Architecture, 2001, pp. 171–182.
[11] A. Abdollahi and et. al., “Leakage current reduction in cmos vlsi circuits by inputvector control,” in IEEE Transactions on VLSI Systems, 2004, pp. 140–154.
[12] H. S. Kim, N. Vijaykrishnan, M. Kandemir, and M. J. Irwin, “Adapting instruc-tion level parallelism for optimizing leakage in vliw architectures,” SIGPLAN Not.,vol. 38, no. 7, pp. 275–283, 2003.
45
[13] G. D. Micheli, “Synthesis and optimization of digital circuits,” McGraw-Hill, 1994.
[14] J. W. Tschanz, J. T. Kao, S. G. Narendra, R. Nair, D. A. Antoniadis, A. P. Ch,S. Member, and V. De, “Adaptive body bias for reducing impacts of die-to-die andwithin-die parameter variations on microprocessor frequency and leakage,” in IEEEJournal Of Solid-State Circuits, 2002, pp. 1396–1402.
[15] S. Narendra, A. Keshavarzi, B. Bloechel, S. Borkar, and V. De, “Forward bodybias for microprocessors in 130-nm technology generation and beyond,” Solid-StateCircuits, IEEE Journal of, vol. 38, no. 5, pp. 696–701, May 2003.
[16] A. Datta, S. Bhunia, J. H. Choi, S. Mukhopadhyay, and K. Roy, “Speed binningaware design methodology to improve profit under parameter variations,” in ASP-DAC ’06: Proceedings of the 2006 conference on Asia South Pacific design automa-tion. Piscataway, NJ, USA: IEEE Press, 2006, pp. 712–717.
[17] A. Agarwal, B. Paul, S. Mukhopadhyay, and K. Roy, “Process variation in embeddedmemories: failure analysis and variation aware architecture,” Solid-State Circuits,IEEE Journal of, vol. 40, no. 9, pp. 1804–1814, Sept. 2005.
[18] Q. Chen, H. Mahmoodi, S. Bhunia, and K. Roy, “Modeling and testing of sramfor new failure mechanisms due to process variations in nanoscale cmos,” in VTS’05: Proceedings of the 23rd IEEE VLSI Test Symposium. Washington, DC, USA:IEEE Computer Society, 2005, pp. 292–297.
[19] H. Chang and S. S. Sapatnekar, “Full-chip analysis of leakage power under processvariations, including spatial correlations,” in DAC ’05: Proceedings of the 42ndannual conference on Design automation. New York, NY, USA: ACM, 2005, pp.523–528.
[20] M. Mutyam and V. Narayanan, “Working with process variation aware caches,” inDATE ’07: Proceedings of the conference on Design, automation and test in Europe.San Jose, CA, USA: EDA Consortium, 2007, pp. 1152–1157.
[21] M. A. Hussain and M. Mutyam, “Block remap with turnoff: a variation-tolerantcache design technique,” in ASP-DAC ’08: Proceedings of the 2008 conference onAsia and South Pacific design automation. Los Alamitos, CA, USA: IEEE Com-puter Society Press, 2008, pp. 783–788.
[22] S. Ozdemir, D. Sinha, G. Memik, J. Adams, and H. Zhou, “Yield-aware cache archi-tectures,” in MICRO 39: Proceedings of the 39th Annual IEEE/ACM InternationalSymposium on Microarchitecture. Washington, DC, USA: IEEE Computer Society,2006, pp. 15–25.
[23] F. Wang, C. Nicopoulos, X. Wu, Y. Xie, and N. Vijaykrishnan, “Variation-awaretask allocation and scheduling for mpsoc,” in ICCAD ’07: Proceedings of the 2007IEEE/ACM international conference on Computer-aided design. Piscataway, NJ,USA: IEEE Press, 2007, pp. 598–603.
46
[24] L. Huang and Q. Xu, “Performance yield-driven task allocation and schedulingfor mpsocs under process variation,” in DAC ’10: Proceedings of the 47th DesignAutomation Conference. New York, NY, USA: ACM, 2010, pp. 326–331.
[25] P. Raghavan, J. Ayala, D. Atienza, F. Catthoor, G. De Micheli, and M. Lopez-Vallejo, “Reduction of register file delay due to process variability in vliw embeddedprocessors,” Circuits and Systems, 2007. ISCAS 2007. IEEE International Sympo-sium on, pp. 121–124, May 2007.
[26] X. Liang and D. Brooks, “Mitigating the impact of process variations on proces-sor register files and execution units,” in MICRO 39: Proceedings of the 39th An-nual IEEE/ACM International Symposium on Microarchitecture. Washington, DC,USA: IEEE Computer Society, 2006, pp. 504–514.
[27] B. F. Romanescu, M. E. Bauer, S. Ozev, and D. J. Sorin, “Reducing the impact ofintra-core process variability with criticality-based resource allocation and prefetch-ing,” in CF ’08: Proceedings of the 2008 conference on Computing frontiers. NewYork, NY, USA: ACM, 2008, pp. 129–138.
[28] T. Sato and S. Watanabe, “Instruction scheduling for variation-originated variablelatencies,” Quality Electronic Design, International Symposium on, vol. 0, pp. 361–364, 2008.
[29] D. Kannan, A. Shrivastava, S. Bhardwaj, and S. Vrudhul, “Power reduction offunctional units considering temperature and process variations,” in VLSID ’08:Proceedings of the 21st International Conference on VLSI Design. Washington,DC, USA: IEEE Computer Society, 2008, pp. 533–539.
[30] D. Kannan, A. Shrivastava, V. Mohan, S. Bhardwaj, and S. Vrudhula, “Tempera-ture and process variations aware power gating of functional units,” in VLSID ’08:Proceedings of the 21st International Conference on VLSI Design. Washington,DC, USA: IEEE Computer Society, 2008, pp. 515–520.
[31] X. Liang, R. Canal, G.-Y. Wei, and D. Brooks, “Process variation tolerant 3t1d-based cache architectures,” Microarchitecture, 2007. MICRO 2007. 40th AnnualIEEE/ACM International Symposium on, pp. 15–26, Dec. 2007.
[33] A. Das, S. Ozdemir, G. Memik, and A. Choudhary, “Evaluating voltage islands incmps under process variations,” Computer Design, 2007. ICCD 2007. 25th Inter-national Conference on, pp. 129–136, Oct. 2007.
[34] R Development Core Team, R: A Language and Environment for StatisticalComputing, R Foundation for Statistical Computing, Vienna, Austria, 2008, ISBN3-900051-07-0. [Online]. Available: http://www.R-project.org
[35] N. Kim, T. Austin, D. Baauw, T. Mudge, K. Flautner, J. Hu, M. Irwin, M. Kan-demir, and V. Narayanan, “Leakage current: Moore’s law meets static power,”Computer, vol. 36, no. 12, pp. 68–75, Dec. 2003.
[36] Y.-F. Tsai, A. Ankadi, N. Vijaykrishnan, M. Irwin, and T. Theocharides, “Chip-power: an architecture-level leakage simulator,” SOC Conference, 2004. Proceed-ings. IEEE International, pp. 395–398, Sept. 2004.
[37] L. N. Chakrapani, J. Gyllenhaal, W. mei W. Hwu, S. A. Mahlke, K. V. Palem,and R. M. Rabbah, “Trimaran: An infrastructure for research in instruction-levelparallelism,” in Lecture Notes in Computer Science, 2004, p. 2005.
[38] K. Skadron, M. R. Stan, W. Huang, S. Velusamy, K. Sankaranarayanan, and D. Tar-jan, “Temperature-aware microarchitecture,” SIGARCH Comput. Archit. News,vol. 31, no. 2, pp. 2–13, 2003.
[39] J. Huck, D. Morris, J. Ross, A. Knies, H. Mulder, and R. Zahir, “Introducing theia-64 architecture,” IEEE Micro, vol. 20, no. 5, pp. 12–23, 2000.
[40] Y.-F. Tsai, D. E. Duarte, N. Vijaykrishnan, and M. J. Irwin, “Characterization andmodeling of run-time techniques for leakage power reduction,” IEEE Trans. VeryLarge Scale Integr. Syst., vol. 12, no. 11, pp. 1221–1232, 2004.
[41] N. V. Mujadiya, “Instruction scheduling for vliw processors under variation sce-nario,” in SAMOS’09: Proceedings of the 9th international conference on Systems,architectures, modeling and simulation. Piscataway, NJ, USA: IEEE Press, 2009,pp. 33–40.
[42] J. Fisher, “Trace scheduling: A technique for global microcode compaction,” Com-puters, IEEE Transactions on, vol. C-30, no. 7, pp. 478–490, July 1981.
[43] S. Muchnick, “Advanced compiler design and implementation,” Morgan Kaufmann,1997.
[44] W.-M. W. Hwu, S. A. Mahlke, W. Y. Chen, P. P. Chang, N. J. Warter, R. A.Bringmann, R. G. Ouellette, R. E. Hank, T. Kiyohara, G. E. Haab, J. G. Holm,and D. M. Lavery, “The superblock: an effective technique for vliw and superscalarcompilation,” J. Supercomput., vol. 7, no. 1-2, pp. 229–248, 1993.
[45] S. A. Mahlke, D. C. Lin, W. Y. Chen, R. E. Hank, and R. A. Bringmann, “Effec-tive compiler support for predicated execution using the hyperblock,” SIGMICRONewsl., vol. 23, no. 1-2, pp. 45–54, 1992.
[46] M. Mutyam, F. Li, V. Narayanan, M. Kandemir, and M. J. Irwin, “Compiler-directed thermal management for vliw functional units,” SIGPLAN Not., vol. 41,no. 7, pp. 163–172, 2006.
[47] E. Chun, Z. Chishti, and T. N. Vijaykumar, “Shapeshifter: Dynamically changingpipeline width and speed to address process variations,” in MICRO 41: Proceed-ings of the 41st annual IEEE/ACM International Symposium on Microarchitecture.Washington, DC, USA: IEEE Computer Society, 2008, pp. 411–422.