CARLOS HENRIQUE ANDRADE COSTA DYNAMIC METHODOLOGY FOR OPTIMIZATION EFFECTIVENESS EVALUATION AND VALUE LOCALITY EXPLOITATION Tese apresentada à Escola Politécnica da Uni- versidade de São Paulo para obtenção do Título de Doutor em Engenharia. São Paulo 2012
159
Embed
Dynamic Methodology for Optimization Effectiveness ...
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
CARLOS HENRIQUE ANDRADE COSTA
DYNAMIC METHODOLOGY FOR OPTIMIZATIONEFFECTIVENESS EVALUATION AND VALUE
LOCALITY EXPLOITATION
Tese apresentada à Escola Politécnica da Uni-
versidade de São Paulo para obtenção do Título
de Doutor em Engenharia.
São Paulo
2012
CARLOS HENRIQUE ANDRADE COSTA
DYNAMIC METHODOLOGY FOR OPTIMIZATIONEFFECTIVENESS EVALUATION AND VALUE
LOCALITY EXPLOITATION
Tese apresentada à Escola Politécnica da Uni-
versidade de São Paulo para obtenção do Título
de Doutor em Engenharia.
Área de Concentração:
Sistemas Digitais
Orientador:
Prof. Dr. Paulo S. L. M. Barreto
São Paulo
2012
To my parents, who taught me. To my wife, who supported me. To my friends,who belived in me.
Acknowledgements
I would like first to thank my advisor Prof. Dr. Paulo S. L. M. Barreto, who gave me the
opportunity and support to pursue my research interests with independence, my collaborators,
and informal advisors in a sense, Dr. José E. Moreira (IBM) and Prof. Dr. David Padua
(University of Illinois at Urbana-Champaign), with whom I have worked with during the part
of this work conducted at IBM T. J. Watson Research Center, whose suggestions motivated
me to pursue the core ideas in this thesis and whose guidance, support, and encouragement
have been invaluable. Without them all this thesis would not have been possible.
I am grateful to various people at the IBM T. J. Watson Research Center and University
of São Paulo, places where I have the opportunity and the environment to conduct research
and to collaborate with amazingly talented people. At IBM, I would like to thank Dr. Robert
Wisnieski for being such a good manager during my time with the Exascale System Software
Group and for making this joint research possible. I thank José Brunheroto, with whom I
have worked developing the full-system simulator (Mambo) for BlueGene/Q, for sharing his
vast knowledge on the simulator’s intrinsics and for helping me to understand how this tool
could be used in the development of the present work, and for the long and always interesting
hours of chatting that always shed new lights to a wide spectrum of subjects. I also thank
the people that I have worked with directly or indirectly and that helped to make my period
at T. J. Watson truly enjoyable: Roberto Gioiosa (now with Pacific Northwest National Lab-
oratory), Alejandro Rico-Carro (now with Barcelona Supercomputing Center, Spain), Jamin
Naghmouchi (now with T.U. Braunschweig, Germany), Daniele P. Scarpazza (now with D. E.
Shaw Research), Chen-Yong Cher, George Almasi, and many others.
At University of Sao Paulo, my deepest gratitude to Prof. Tereza C. M. B. Carvalho
who believed and gave me a life changing opportunity. Her constant willingness to help and
guide helped to be a better researcher and person. I also thank all my fellow talented graduate
students and researchers, Charles Miers (now with Universidade do Estado de Santa Catarina),
Rony R. Sakuragui (now with Scopus Tecnologia), Fernando Redigolo, Marcelo C. Amaral,
Diego S. Gallo, who I have worked or collaborated with in multiple research projects and
many others that even indirectly inspired me. I thank Guilherme C. Januário in particular,
multi-talented graduate student, that make valuable suggestion for optimization of the code
written in this work.
Finally, I’m grateful to my parents that provided me the amazing environment to grow
and pursue my true interests and that taught me everything I know that really matters in life.
Most of all I would like to thank my wife, Emanuele P. N. Costa, whose unending patience,
love, and support through all the long working hours required for this work have made this
possible. Thanks for making life worth living!
Resumo
O desempenho de um software depende das múltiplas otimizações no código realizadas por com-
piladores modernos para a remoção de computação redundante. A identificação de computação re-
dundante é, em geral, indecidível em tempo de compilação, e impede a obtenção de um caso ideal
de referência para a medição do potencial inexplorado de remoção de redundâncias remanescentes e
para a avaliação da eficácia de otimização do código. Este trabalho apresenta um conjunto de métodos
para a análise da efetividade de otimização de código através da observação do conjunto completo de
instruções dinamicamente executadas e referências à memória na execução completa de um programa.
Isso é feito por meio do desenvolvimento de um algoritmo de value numbering dinâmico e sua apli-
cação conforme as instruções vão sendo executadas. Este método reduz a análise interprocedural à
análise de um grande bloco básico e detecta operações redundantes de memória e operações escalares
que são visíveis apenas em tempo de execução. Desta forma, o trabalho estende a análise de reuso de
instruções e oferece tanto uma aproximação mais exata do limite superior de otimização explorável
dentro de um programa, quanto um ponto de referência para avaliar a eficácia de uma otimização. O
método também provê uma visão clara de hotspots de redundância não explorados e uma medida de
localidade de valor dentro da execução completa de um programa. Um modelo que implementa o
método e integra-o a um simulador completo de sistema baseado em Power ISA 64-bits (versão 2.06)
é desenvolvido. Um estudo de caso apresenta os resultados da aplicação deste método em relação a
executáveis de um benchmark representativo (SPECInt2006) criados para cada nível de otimização do
compilador GNU C/ C++. A análise proposta produz uma avaliação prática de eficácia da otimiza-
ção de código que revela uma quantidade significativa de redundâncias remanescentes inexploradas,
mesmo quando o maior nível de otimização disponível é usado. Fontes de ineficiência são identificadas
através da avaliação de hotspots e de localidade de valor. Estas informações revelam-se úteis para o
ajuste do compilador e da aplicação. O trabalho ainda apresenta um mecanismo eficiente para explorar
o suporte de hardware na eliminação de redundâncias.
Palavras-chave: otimização de código, análise dinâmica, value numbering, efetividade de
otimização
Abstract
Software performance relies on multiple optimization techniques applied by modern compilers to
remove redundant computation. The identification of redundant computation is in general undecid-
able at compile-time and prevents one from obtaining an ideal reference for the measurement of the
remaining unexploited potential of redundancy removal and for the evaluation of code optimization
effectiveness. This work presents a methodology for optimization effectiveness analysis by observing
the complete dynamic stream of executed instructions and memory references in the whole program
execution, and by developing and applying a dynamic value numbering algorithm as instructions are
executed. This method reduces the interprocedural analysis to the analysis of a large basic block and
detects redundant memory and scalar operations that are visible only at run-time. This way, the work
extends the instruction-reuse analysis and provides both a more accurate approximation of the upper
bound of exploitable optimization in the program and a reference point to evaluate optimization effec-
tiveness. The method also generates a clear picture of unexploited redundancy hotspots and a measure
of value locality in the whole application execution. A framework that implements the method and in-
tegrates it with a full-system simulator based on Power ISA 64-bit (version 2.06) is developed. A case
study presents the results of applying this method to representative benchmark (SPECInt 2006) exe-
cutables generated by various compiler optimization levels of GNU C/C++ Compiler. The proposed
analysis yields a practical analysis that reveals a significant amount of remaining unexploited redun-
dancies present even when using the highest optimization level available. Sources of inefficiency are
identified with an evaluation of hotspot and value locality, an information that is useful for compilers
and application-tuning softwares. The thesis also shows an efficient mechanism to explore hardware-
support for redundancy elimination.
Keywords: code optimization, dynamic analysis, value numbering, optimization effectiveness
List of Figures
1 Example of unexploited optimization . . . . . . . . . . . . . . . . . . . . . p. 22
(ALPERN et al., 2005) and LLVM (LATTNER; ADVE, 2004).
2.3 Value Numbering 36
...c = a + bd = ae = bf = d + ed = x...
...c3 = a1 + b2
d1 = a1
e2 = b2
f3 = d1 + e2
f3 = c3
d4 = x4
...
...c3 = a1 + b2
d1 = a1
e2 = b2
f3 = c3
d4 = x4
...
...c3 = a1 + b2
e2 = b2
f3 = c3
d4 = x4
...
LVN LVN LVN
CSE DCE
Fig. 6: Local Value Numbering (LVN) operation: common subexpression elimination anddead-code elimination.
2.3 Value Numbering
2.3.1 Local Value Numbering (LVN)
Local Value Numbering is the technique applied at the compile-time that recognizes re-
dundancy in a basic block among expressions that are lexically different, but which are cer-
tain to compute the same value. Traditionally, it is achieved by assigning symbolic names
(value numbers) to expressions. If the value numbers of the operands of two expressions and
the operators applied by the expressions are identical, then the expressions receive the same
value number and are certain to get same results. The analysis supports three different opti-
mizations: common subexpression elimination (CSE), copy propagation (CP), and dead-code
elimination (DCE) (CLICK, 1995; GULWANI; NECULA, 2007). The method works progressing
through each statement in sequential order. For each new variable, a distinct value number is
assigned. For any assignment expression, an existing value number on the right-hand side is
assigned to the left-hand side (see Figure 6). A new value number is created and propagated
to both sides when a new variable, i.e., one that has no existing value number, is found. This
procedure implements copy propagation. Each unary or binary expression is analyzed, and
when a match is recognized the algorithm searches a history table of value numbers and re-
places the right-hand side with the variable associated with the matching value number. This
2.3 Value Numbering 37
is equivalent to constant subexpression elimination. If the algorithm identifies that two dif-
ferent value numbers are assigned to the same variable and that no intervening use of that
variable exists, then the first assignment is marked as dead and removed from the instruction
list. This implements dead-code elimination.
The operation of the algorithm is illustrated by the steps shown in Figure 6. The algorithm
starts with the original code in the first column, in the left. In the second column, all variables
have been assigned value numbers. Common subexpression are identified and eliminated:
binary expression c3 = a1 +b2 and binary expression f 3 = d1 +e2 have the same value number
pattern (vn(3) = vn(1) + vn(2)), hence f 3 = d1 + e2 is replaced with f 3 = c3. Dead code
elimination is shown in the third column. In this case two assignment expressions, d1 = a1 and
d4 = x4, associate different value numbers to the same variable d, and there is no intervening
use of d. First assignment expression d1 = a1 is marked dead and removed from the list of
instructions.
Hash-based value numbering is one LVN family of methods, which was originally fully
described in (COCKE, 1969). This is the most used pessimistic algorithm operating over ba-
sic blocks and is a widely implemented and useful technique. The problem of determining
whether two computations are equivalent is, in general, undecidable (RAMALINGAM, 1994).
The approach taken in value numbering algorithm is that any two expressions identified as
equivalent produce identical values on all possible executions. This commonly referred as a
conservative solution. Algorithm 2.1 formalizes the steps for an implementation of LVN. The
algorithm can be implemented with worst case complexity of O(N2), but most LVN analyses
complete in O(kN) time, with k << N (BRIGGS; COOPER; SIMPSON, 1997).
The construction of the DAG representing the basic block is closely related to the op-
eration of a value numbering algorithm, but in this case the reuse of nodes in the DAG is
performed instead of the insertion of new nodes with the same value. This operation corre-
sponds to deleting later computations of equivalent values and replacing them by previously
computed ones. The presentation of the intermediate code in the DAG form helps to illustrate
the operation of the technique. Figure 7 shows the application of the algorithm 2.1 to the
2.3 Value Numbering 38
Algorithm 2.1 Local Value Numbering (LVN) algorithm over basic blocks1: T ← empty2: N ← 03: for all quadruple a← b ⊕ c in the block do4: if (b 7→ k) ∈ T for some K then5: nb ← k6: else7: N ← N + 18: nb ← N9: put b 7→ nb into T
10: end if11: if (c 7→ k) ∈ T for some K then12: nc ← k13: else14: N ← N + 115: nc ← N16: put c 7→ nc into T17: end if18: if ((nb ⊕ nc) 7→ m) ∈ T for some m then19: put a 7→ m into T20: mark this quadruple a← b ⊕ c as a common subexpression21: else22: N ← N + 123: put (nb ⊕ nc) 7→ N into T24: put a 7→ N into T25: end if26: end for
2.3 Value Numbering 39
+
x0 y0
g
x y1 2
3
g = x + y
(a) DAG construction: 1 expression
-
u0 v0
h
+
x0 y0
g
y u1 2 4
6
5v
3i x
g = x + yh = u - vi = x + yx = u - v
(b) DAG construction: 4 expressions
-
u0 v0
x
+
x0 y0
g
+
u
+
y1 2 4
6
5
3ih
v 7
8 w
g = x + yh = u - vi = x + yx = u - vu = g + hv = i + x w = u + v
(c) DAG construction: 7 expressions
Fig. 7: Steps to build a DAG applying local value numbering.
2.3 Value Numbering 40
code in the top left in a DAG form. The figure shows expressions scanned and the resulting
graph in sequence. Each node holds a content or expression. The value number assigned is
annotated in the top of the node. Variables are annotated with arrows pointing to content held
as the expressions are scanned. For example, in the second block the reuse of x requires that
the arrow pointing to node x0 to be removed and pointed to the node with the result of u-v.
Similar reuses are seen in the third block.
2.3.2 Global Value Numbering (GVN)
The value numbering algorithm as proposed is only able to handle basic blocks. An exten-
sion that can be applied to extended basic blocks was proposed by Auslander and Hopkins and
implemented in the IBM PL.8 compiler (AUSLANDER; HOPKINS, 2004). A general approach to
operate over the program as a whole is known as Global Value Numbering (GVN). The GVN
technique is the analysis used to remove redundant computations that compute the same static
value in a program considering multiple procedures. The roots of GVN approach are in the
work of Alpern, Rosen, Wegman, and Zadeck in (ALPERN; WEGMAN; ZADECK, 1988b). Their
main contribution was the description of how expressions can be partitioned into congruences
classes.
The first contribution into the direction of a global value numbering targeting the whole
program, including multiple procedures, was represented by the optimistic algorithm pro-
posed by Reif and Lewis (REIF; LEWIS, 1986). The algorithm, in spite of having a good
asymptotic complexity, is hard to implement. Alpern, Wegman and Zadeck (ALPERN; WEG-
MAN; ZADECK, 1988b) suggested an algorithm that used the Hopcroft’s finite state minimiza-
tion to determine the congruences, including a subscripting scheme called φ -function that
captures the semantics of conditionals and loops. The use of the φ -function in the description
of a program led to the development of the SSA form. This approach was extended by Yang,
Horowitz and Reps (YANG; HORWITZ; REPS, 1989). A combined SSA form with the Program
Dependence Graph (PDG) was proposed by Ballance, Maccabe and Ottenstein (OTTENSTEIN;
BALLANCE; MACCABE, 1990). It is called Gated Single Assignment (GSA) and it is arguably
2.3 Value Numbering 41
a better subscripting scheme than the proposed by Alpern, Wegman and Zadeck.
i ← 1
j ← 1
if i mod 2 = 0
i ← i + 1
j ← j + 1
else
i ← i + 3
j ← j + 3
if j > n
(a) Intermediate code
j3 ← φ2(i1,i2)j3 ← φ2(j1,j2)i3 mod 2 = 0
i4 ← i3 + 1j4 ← j3 + 1
i5 ← i3 + 3j5 ← j3 + 3
i2 ← φ5(i4,i5)j2 ← φ5(j4,j5)j2 > n
n ← vali1 ← 1j1 ← 1
Entry
Exit
(b) Corresponding minimal SSA form
Fig. 8: Minimal SSA form from an intermediate code.
In the core of GVN technique is the discovering of the so-called variables congruence.
The concept is defined as the case where the computation that defined the variables have
identical operators and their corresponding operands are congruent. In order to apply the GVN
technique, the procedure needs to be translated to the minimal SSA form. Such manipulation
can be done using the dominance frontier. The SSA form can be used then to produce the so-
called Value Graph (VG). The value graph is a labeled directed graph whose nodes are labeled
with the operators, functions and constants. The value graph edges represent the assignment
from an operator or function to its operands. The edges are labeled with natural numbers
2.3 Value Numbering 42
=
0 mod
2
1
+
1
φ2
φ5
3
+
1
1
1
11
1
22
2
2
>
n
1
+
1
φ2
φ5
3
+
1
1
1
11
2
2
2
2
22
2
Fig. 9: Value Graph.
indicating the operand position with respect to a given operator. Figure 9 shows the value
graph for the intermediate code in Figure 8(a) and its translation into the minimal SSA form
of figure 8(b).
From the value graph, a variable congruence is defined as the maximal relation on the
graph such that two nodes are congruent if either they are the same node or have the same
operators and theirs operands are congruent. The equivalence of two variables is then defined
at a point P if they are congruent and their defining assignments dominate P (MUCHNICK,
1997).
The Apern-Rosen-Wegman-Zadeck algorithm makes an optimistic assumption that a
large set of expression classes is congruent and refines the grouping splitting the congruence
classes at a fixed point. A hash table-based approach for GVN was introduced in (BRIGGS;
COOPER; SIMPSON, 1997). The role of the hash table is the association of an expression to a
value number. Taylor Simpson’s SCCVN algorithm for optimistic value numbering discovers
value-based identities and is arguable the strongest global technique for redundant scalar
values detection and has been implemented in a number of compiler (SIMPSON, 1996).
2.4 Register Allocation 43
2.3.3 Partial Redundancy Elimination (PRE)
Another large group of optimization techniques can be gathered under the Partial Redun-
dancy Elimination (PRE) group. The technique was first proposed in (MOREL; RENVOISE,
1979). The basic idea is to find computations that are redundant on some, but not all paths
(LO et al., 1998). The process can be viewed as follows: considering that a given expression
e at some point p is redundant on some subset of the paths that reach p, the transformation
inserts evaluations of e on paths where it had not been, making the evaluation at p redundant
on all paths (XU, 2003). The conclusion frequently found in the literature comparing the GVN
and PRE can be summarized as follows: in one hand PRE finds lexical congruences instead of
value congruences, including only partial redundancies. On the other hand, GVN finds value
congruences but can remove only full redundancies (VANDRUNEN, 2004).
Attempts to integrate both approaches have been seen in the literature more recently. The
mixed approaches proposed in (BODÍK; ANIK, 1998) and later in (VANDRUNEN, 2004) are
significant examples. However, up to date such approaches have not been seen implemented
in commercial compilers specially due to its complexity and memory demands.
2.4 Register Allocation
Register allocation consists in an attempt to maximize execution speed of programs keep-
ing as many variable as possible in registers instead of memory. In case there are more live
variables than available registers, variables must be spilled from registers, i.e., written to the
memory and reloaded at a later time when they are needed. An efficient allocation should
avoid spilling of invariant values, since this spill is potentially redundant and can be folded by
propagating the invariant value to the subsequent instructions that use it as a source operand.
Register allocation is usually performed at the end of global optimization, when the final
structure of the code is ready, and all registers to be used are known. The register allocation
procedure attempts to map the registers in such a way that minimizes the number of memory
references. Register allocation affects performance by lowering the instruction count and po-
2.4 Register Allocation 44
tentially reducing the execution time per instruction by changing memory operands to register
operands. Code size is reduced by these improvements, leading to other secondary improve-
ments. Register allocation depends on the liveness analysis in order to decide which variable
to spill.
2.4.1 Variable Liveness Analysis
The liveness analysis is one of the techniques of data flow analysis that is used to seek the
variables that may be potentially read before their next write. A variable in this state is usually
called as live. The liveness analysis is in the core of the register allocation task. The register
allocation determines which of the values should be in the registers of the machine at each
point of the execution. The register assignment is usually preceded by a liveness analysis.
A liveness analysis of a basic block S can be expressed by the following equations:
Lin−state[s] = G[S ] ∪ (Lin−state − U[S ])
Lout−state[ f inal] = ∅
Lout−state[S ] =⋃
p∈succ[S ]
Lin−state[p]
G[d : y← f (x1, · · · , xn)] = {x1, · · · , xn}
U[d : y← f (x1, · · · , xn)] = {y},
where G is the set of variables used before any assignment, U is the set of variables assigned
a value in S before any use. The sets Lin−state and Lout−state are defined as respectively the set
of variables that are live at the beginning of the block and the set of variables that are live at
the end of the block. The process of making dead the written variables and making live the
read variables is handled by the state function f (KHEDKER; SANYAL; KARKARE, 2009).
2.4 Register Allocation 45
x ← 2
y ← 4
w ← x + y
z ← x + 1
x ← z * 2
(a)
s1 ← 2
s2 ← 4
s3 ← s1 + s2
s4 ← s1 + 1
s5 ← s1 * s2
s6 ← s4 * 2
(b)
r1 ← 2
r2 ← 4
r3 ← r1 + r2
r3 ← r1 + 1
r1 ← r1 * r2
r2 ← r3 * 2
(c)
Fig. 10: Steps to perform register allocation with K-coloring.
s1
s6
s2
s3
s4
s5
r1
r2
r3
Fig. 11: Example of interference graph.
2.4 Register Allocation 46
2.4.2 Register Allocation by Graph-Coloring
One of the most effective approaches largely used in register allocation is the technique
known as register allocation by graph coloring. The application of the general graph-coloring
problem (SAATY; KAINEN, 1977) to the register allocation problem was first proposed in
(COCKE, 1969). The first implementation was obtained only ten years latter in (CHAITIN,
2004). The approach was first adapted for the PL.8 compiler for the IBM 801 RISC System.
Most of the modern compilers implement this allocator or one of its derivations.
The global register allocation by graph coloring assumes two major procedures, the con-
struction of the so-called Interference Graph (IG) and the procedure of K−Coloring. The
interference graph is created in a way that every vertex represents a unique variable in the
program. The interference edges are the ones connecting the pairs of the vertices which are
live at the same time. The pairs of vertices involved in move instructions are called preference
edges. The register allocation can be done with the procedure of K-Coloring the interference
graph. The K-Coloring problem applied to register allocation can be seen in a simplified
way as the assignment of a necessarily different color to two vertices sharing an interference
edge and the assignment of the same color to edges sharing the same preference edge when
possible. Hardcoded register assignment is represented by the precoloring of some edges.
The K-Coloring problem is a known NP-complete (BRIGGS; COOPER; TORCZON, 1994). For
the past several years algorithms trading off quality and performance of the code produced
have been proposed. Global register allocation by graph coloring can be summarized in the
following steps (MUCHNICK, 1997):
1. In the phase that precedes the register allocation, allocate objects that can be assigned to
registers r1, r2, · · · , rn to distinct symbolic registers, using as many registers as necessary
to hold the objects;
2. Determine the object sets that should be candidate for allocation;
3. Generate the interference graph for which each node represents allocatable objects and
the real registers of the target machine. The arcs should represent the interferences,
2.4 Register Allocation 47
Build Simplify Potential spill Select Actual Spill
Coloring Heuristic
Fig. 12: Graph coloring heuristic.
where two allocatable objects interfere if they are simultaneously live and an object and
a register interfere if the object cannot be or should not be allocated to that register;
4. Interference graph should have its nodes colored with K colors, where K is the number
of available registers. Two adjacent nodes must have different colors;
5. Allocate each object to the register that has the same color.
The intermediate code in Figure 10 shows the steps of the register allocation process with
K-coloring and the corresponding interference graph used is shown in Figure 11. An effi-
cient approach to implement register allocation with graph coloring was proposed by Briggs
and Cooper in (BRIGGS; COOPER; TORCZON, 1994) and is largely used in modern compilers.
Common used heuristic for graph coloring is shown in Figure 11.
Recent development shows that the interference graphs of programs in Static Single As-
signment (SSA) form are chordal (HACK; GRUND; GOOS, 2006). This result is important be-
cause chordal graphs can be colored in polynomial time. Most known register allocation
models described can be adapted to run on SSA-form programs (TORCZON; COOPER, 2011).
Register allocators can benefit from chordality of SSA-form programs in three main ways:
(i) lower register pressure; (ii) separation between spilling and register assignment; (iii) sim-
pler register assignment algorithms. Spill-free register allocation has polynomial time solu-
tion for SSA-form programs, but it is NP-complete for programs in general. One point that
must be emphasized is that these two problems are obviously non-equivalent. Any program
can be converted into SSA-form via a polynomial time transformation (CYTRON et al., 1991).
2.4 Register Allocation 48
However, a register assignment for a SSA-form program cannot be converted back to an opti-
mal register assignment of the original program in polynomial time unless P=NP (TORCZON;
COOPER, 2011).
Next chapter discusses the limitations of static analysis and the approaches to identify and
avoid redundant computation with run-time knowledge.
49
3 Dynamic Identification of RedundantComputation
The traditional approach to code optimization is the compile-time optimizer and modern
optimizing compilers rely strongly on static analysis to identify and remove redundant compu-
tation. It is common practice to evaluate compiler’s optimization by reporting the amount of
those computations successfully identified and removed. Unfortunately, there are difficulties
that limit the effectiveness of compile-time optimization. Procedure boundaries, profitability-
based trade-off and heavy interaction among multiple optimizations and the system architec-
ture inhibit the effectiveness of many optimizations. Significant benefits to be gained from
optimizing across procedure boundaries have been shown. However, finding and exploiting
interprocedural opportunities reveals to be challenging. Interprocedural analysis implement-
ing aggressive function inlining can remove many procedure boundaries entirely, but comes
at the cost of increased code size and can greatly increase cache misses. Interprocedural data-
flow analysis techniques have also been used. Evidences, however, argues that such methods
are not worth the additional complexity they create in the compiler due to their limited impact
on effectiveness. The question that typically arises regarding the optimized code is how ef-
fective was the compiler in finding and eliminating redundant computation and whether there
are optimization opportunities unexploited that can only be detect at the run-time.
Dynamic identification of redundant computation detects redundancies along an execu-
tion path using run-time data. Some redundancies identified dynamically cannot be detect
statically in a practical way, or are related to a specific or unexpected execution path. In any
case, dynamic redundancies can elucidate shortcomings in various parts of a compiler. For
3.1 Limitations of Static Analysis 50
Run-time Redundancy
Fully Static Redundancy
Partially Static Redundancy
Fig. 13: Relationship between redundancy categories.
example, common subexpression elimination (CSE) should eliminate redundant loads. Dead-
store elimination should remove unnecessary stores and a register allocator would not need
to store an unmodified spilled value. Register spill while another register holds a dead value
indicates that the register allocation unnecessarily spilled a value along at least one path.
Occurrence and detection of redundancy can be categorized as shown in Figure 13 (XU,
2003). Fully static redundancies are those that can be safely removed through static analysis.
Partially static redundancies are those instructions that through code motion can be avoided
statically. Run-time redundancy are the instances that are only visible at the run-time.
This chapter discusses the limitations of compile-time optimization and the main rea-
sons why statically optimized code can still exhibit significant opportunities for optimizations.
Here are discussed the approaches on the identification of dynamic redundant computation as
a way to provide information on unexploited opportunities and existing support to explore
run-time data in optimization.
3.1 Limitations of Static Analysis
Exact addresses are often unavailable at the compile-time and program’s input conditional
control flows are unknown until run-time (COOPER; XU, 2003). Static analysis must conser-
vatively speculate on the program’s dynamic behavior. They typically generalize invariant
behavior, which is applied to all dynamic instances. A representative example is register pro-
motion (COOPER; LU, 1997), which identifies memory operations that always access the same
3.2 Profile Guided Optimization (PGO) 51
if ( ) then Z = X * Yelse ...endifif ( ) W = X * Yelse ...endif
if ( ) then T = Z = X * Yelse T = X * Yendifif ( ) W = Telse ...endif
if ( ) then T = Z = X * Yelse T = (Y=1)? X : X * Yendifif ( ) W = Telse ...endif
(a) original code (b) redundancy removal (c) strength reduction
12345678910
12345678910
12345678910
Fig. 14: Example of profile-guided transformation.
memory location and uses a register to hold copies of the memory content. However, the
approach fails to detect possible differences and redundancies existing among dynamic in-
stances. In addition to that, the advent and increased use of shared libraries, dynamic class
loading, and run-time binding complicates the compiler’s ability to analyze and optimize pro-
grams. For good performance, many versions of the compiled application may be needed to
take performance implications of subtle architectural differences between compatible proces-
sors (even processors in the same family) into account.
Profile-guided optimization has helped to identify opportunities of partially redundant
computation that cannot be conservatively distinguished at compile-time, but with run-time
knowledge can be optimistically eliminated. Dynamic limit study approach is the attempt to
evaluate the effectiveness of optimization deployed and a measure of unexploited potential.
Hardware-based support for redundancy elimination is the approach to identify and/or provide
support for the compiler to exploit run-time redundancy instances, such as the deployment of
reuse buffers and support for speculative and predicated execution.
3.2 Profile Guided Optimization (PGO)
If the compiler understands the relative execution frequencies of the various parts of the
program, it can use that information to improve the program’s performance. Profile Guided
3.2 Profile Guided Optimization (PGO) 52
Optimization (PGO) has helped to tune the optimization process through collecting informa-
tion about resource utilization, such as register allocation, instruction count, etc, during the
run-time (LIN et al., 2003; LIU et al., 2007). Figure 14 shows an example of redundancy elimi-
nation opportunities that can be exploited through profile data. In static analysis, the compiler
cannot identify the value of variable y during execution. Profiling, however, indicates that
very frequently the value of y in the first multiplication is 1. An opportunity of optimization is
identified as the multiplication can be frequently optimized away. The transformation shown
in the figure performs this optimization, for which the multiplication is executed condition-
ally. The transformation implies in extra instructions to check the value of y and is only worth
depending on the frequency with which variable y has the value 1 or some other value. Such
information determines whether the transformation should be applied. Profile can play an im-
portant role in optimizations such as global code placement or inline substitution. The major
approaches to gather profile data are:
• Instrumented executable: the compiler generates code to count specific events, such
as procedure entries and exits or taken branches. The data is written to an external file
at run-time and processed offline by another tool;
• Timer interrupts: approach that interrupts program execution at frequent, regular in-
tervals. The tool constructs a histogram of program counter locations where the inter-
rupts occurred. Post-processing constructs a profile from the histogram;
• Performance counters: many processors offer some form of hardware counters to
record hardware events, such as total cycles, cache misses, or taken branches. If coun-
ters are available, the run-time system can use them to construct profile-like data.
The benefits of PGO are related to the overcoming of static optimization limitations by ex-
ploiting dynamic information that cannot be safely inferred statically. However, the approach
relies in good profiling information, which typically implies in the collection and handling of
an actual run representation of the program. Common types of profile information are Control
Flow Profiles (CFG), Value Profiles and Memory Profiles.
3.2 Profile Guided Optimization (PGO) 53
S
a
b c
d e
f
g h
k
E
Control flow profile
S(abdfgk) (acefhk) (abdfhk) (acefgk) (abefgk) (abefhk) E
10 10
10 10
50 10
Fig. 15: An example of control flow profiles.
Code
l1: load R3, 0(R4)l2: R2 R3 & 0xff
(l1,R3)...
(l1,R2)
(l2,R3)
(0xb8d003400,10)
(0,1000)
(0,1000),(0x890,200)...,(0x2900,100)
Instruction, register
profiles(value, freq)
Value profile
Fig. 16: An example of value profiles.
In CFGs, a trace of the execution path taken by the program is generated by counting
the number of visits for the various basic blocks in the program’s control flow graph (CFG).
Real world applications generate large and sometimes unmanageable traces. Whole Program
Paths (WPP), representing whole execution have been possible through profile compression,
as proposed in (LARUS, 1999) and (ZHANG; GUPTA, 2001). An example of control flow trace
is shown in Figure 15.
Value profiles identify specific values encountered as an operand of an instruction and the
related frequency. The use of top-n-values table (TNV) has been proposed (CALDER; FELLER;
EUSTACE, 1997) as a way to deal with the fact that not all values appearing in a practical
execution can be collected. Replacement policies, such as Least Frequently Used (LFU) are
typically used to populate the table. Approaches to lower the overhead generated for value
3.2 Profile Guided Optimization (PGO) 54
for(i=0;i<2000;i++){ switch(flag){ case 1: xa=*pa; ...; break; case 2: xb=*pb; ...; break; case 3: xc=*pc; ...; break; case 4: xd=*pd; ...; break; } ... pa=buf[i] ...}
int flag;int *pa, *pb, *pc, *pd;int buf[2000];...int xa,xb,xc,xd;
(A(xa),A(pa)) 5000
Link Weight
(A(xb),A(pb))
(A(xc),A(pc))
(A(xd),A(pd))...(A(pa),A(buf))
20
2
10...2000
Code Declarations
Address profile
Fig. 17: An example of memory profiles.
collection have been proposed (WATTERSON; DEBRAY, 2001). Value profiles are of particular
interest in value specialization optimizations such as constant folding, strength reduction, and
motion of nearly invariant code. An example of value profile is shown in Figure 16.
Address profiles are collected in the form of a stream of memory addresses, typically in
compressed form (CHILIMBI, 2001), and used to improve performance of memory hierarchy.
Hot address streams, subsequences of address that are encountered very frequently, are used
to guide placement transformation. An example of address profile is shown in Figure 17.
The typical scheme used for profiling is shown in Figure 18. The collection of profiles
implies in an overhead since they are collected executing instrumented version of the program.
Instrumentation depends on the profile being collected and typically implies in the approach
of instrumenting specific parts of the code. The approach has helped to get better results when
guiding several static optimization. However, due to its limited scope, the approach does not
provide much information on the remaining unexploited potential in optimized code.
3.3 Dynamic Limit Studies 55
Instrumented Program Compiler
ProgramExecution
Program
Profile data
Representative Input
OptimizedCode
Profile-guidedOptimizingCompiler
Fig. 18: Profile-guided optimizing compiler.
3.3 Dynamic Limit Studies
The ultimate optimization effectiveness analysis would indicate how far the deployed opti-
mizations are from the ideal case. As the identification of redundant computation is in general
undecidable (RAMALINGAM, 1994), such analysis is not feasible. Dynamic limit studies are
targeted at the identification of the total redundancies exhibited at the run-time and yields the
obtaining of an upper limit of exploitable redundancy elimination. Limit studies are related
to the occurrence of instruction repetition and likelihood of an instruction to repeatedly pro-
duce a same known result. This effect, known as value locality, is primarily related to the fact
that real-world programs, run-time environments, and operating systems incur severe perfor-
mance penalties because they are general by design (LIPASTI; WILKERSON; SHEN, 1996). They
are implemented to handle contingencies, exceptional conditions, and erroneous inputs, all of
which occur relatively rarely in real life. Even code that is aggressively optimized by modern,
state-of-the-art compilers, exhibits these instruction repetition and value locality.
Studies based on empirical observations has helped to understand why instruction repeti-
tion occurs (LIPASTI; WILKERSON; SHEN, 1996). A simulator to derive an ideal performance of
an algorithm for removing heap-based loads is presented in (DIWAN; MCKINLEY; MOSS, 1998).
The ideal performance is used to determine what alias analysis is near-optimal for the load re-
moval, but still not too expensive. A compiler auditor tool, which analyzes the program trace
to discover limitations and bugs in the compiler is presented in (LARUS; CHANDRA, 1993). A
load-reuse profiler technique based on limit-study with the primary goal of giving load-reuse
hints to the processor is discussed in (REINMAN et al., 1998). One of the most relevant work is
the load-reuse limit study presented as a method to detect dynamic redundancy for memory
3.4 Hardware support for Redundancy Elimination 56
operations and a way to evaluate effectiveness of register promotion optimization in (BODÍK;
GUPTA; SOFFA, 1999). Instruction removal is obtained by a lexical load-reuse analysis, in
which only loads with identical names or identical syntax-tree structure (record fields) can
be detected as equivalent. The method revealed a significant occurrence of redundant load
instructions (average of 55%) in SPEC95 benchmark.
3.4 Hardware support for Redundancy Elimination
A number of prior proposals have exploited hardware support to detect and avoid redun-
dant operations and exploit dynamic redundancy. Most of existing methods can be rooted from
two major approaches: Dynamic Instruction Reuse (DIR) and Data Value Predition (DVP).
The method proposed in this work differs from previous techniques in many aspects. Dif-
ferently from most of them, it allows the discovering of redundancies due to the interaction
between arithmetic and memory values across the multiple procedures in the program, not
separating the task of arithmetic and memory redundancy detection. Load-reuse analysis pro-
vides an evaluation only for dynamic memory operations. Moreover, the known methods
either use dynamic profiling to estimate dynamic redundancy or trace redundancies in a rep-
resentative segment. Dynamic instruction reuse and value prediction are focused on specific
forms of dynamic redundancies and is prone to detect redundancies related to equivalence of
instructions coincidentally handling the same contents.
To the author’s knowledge, there is no whole application method that provides a way to
detect memory and arithmetic dynamic instructions that use the same operands as in a pre-
vious execution, a similarity that becomes visible only at run-time, due to addresses match.
Such approach would ultimately yield a practical and more precise approximation to the up-
per bound of exploitable redundancies, and also a reference point to optimization effectiveness
evaluation, based on the remaining unexploited redundancy potential. The information pro-
duced by this approach would also provide a better picture of hotspots to be targeted and
valuable on extracting the most of available reuse buffers and value prediction tables. Benefits
are achieved in terms of a lowering of the total number of executed instructions, implying in
performance gains and potential power consumption decrease.
The approach consists in a method based on the observation of the stream of memory ref-
erences and executed instructions. The objective is to find an approximation to all instruction
redundancies visible at run-time, using as input instructions being executed. In order to detect
redundancies, a new approach to local value numbering (LVN) algorithm, which discovers
redundancies and folds constants, was developed. Inspired by dynamic redundancy detec-
tion methods, the algorithm is redesigned to be applied over a stream of instruction execution
and, for this reason, it is inherently interprocedural, so that the method reduces the difficul-
4.2 Dynamic Value Numbering Algorithm (DVN) 67
ties encountered in value numbering of extended basic blocks (ALPERN; WEGMAN; ZADECK,
1988a) to the analysis of one large basic block. The method handles each memory operation
as a variable assignment, and each instruction that performs an arithmetic operation, as an
usual operation in the local value numbering. It relies on the mapping of memory addresses
held in registers into identification numbers (value numbers), assigned through value number-
ing. These numbers are used in finding redundant computation, and their dynamics prevent
one from taking coincidentally handling the same content operations as redundant, since they
would not hold the same value number. The goal is to produce a redundancy report that
identifies each instruction redundantly executed in the specific execution path.
Inefficiencies found through this evaluation do not necessarily mean that the compiler
could generate better code. For example, consider a loop-invariant code in which the calcu-
lated expression is a single instruction. A compiler might (justifiably) recalculate the expres-
sion to avoid keeping a register occupied through the loop. The method allows the effective-
ness evaluation of optimizations through which a program was compiled based on an upper
bound limit. Comparing programs generated with different and potentially interfering opti-
mizations makes it possible to measure the overall effectiveness of removal methods in terms
of total instruction redundancy exhibited at run-time. Correlating redundancy detection and
instruction’s PC, an evaluation of value locality for the whole execution trace is also achieved.
It is worthy noticing that the proposed method relies on the availability of an instrumentation
framework to intercept and evaluate instructions and memory references being executed.
The following sections propose a method based on a dynamic extension of the value num-
bering algorithm that uses run-time information to identify dynamic instances of redundant
memory and arithmetic operations.
4.2 Dynamic Value Numbering Algorithm (DVN)
The goal of a classical value numbering algorithm is to recognize redundancy in a code
among expressions that are lexically different, but which are certain to compute the same
4.2 Dynamic Value Numbering Algorithm (DVN) 68
value. Traditionally, it is achieved by assigning symbolic names (value numbers) to expres-
sions. If the value numbers of the operands of two expressions, and the operators applied by
the expressions are identical, then the expressions receive the same value number and are cer-
tain to get same results. A dynamic version of the value numbering algorithm would be able
to detect instances of redundant instructions by analyzing a stream of executed instructions,
which might be obtained when running an executable, generated by a compiler for a given
code, architecture and input. The goal of such dynamic extension is to detect instruction reuse
based on the information of executed instructions, involved registers and memory references.
Since the equivalences are based on the source of the values, not the literal values themselves,
this technique does not report spurious redundancies when two instructions manipulate the
same bit-pattern. Such approach extends the local value numbering, as it makes possible an
interprocedural analysis, for the execution trace is seen as a large basic block. The method is
comprehensive in the sense that it treats scalar, array and pointer-based loads uniformly. The
approach offers a method for determining the run-time optimization benefit.
In a dynamic value numbering algorithm, variable assignment is captured through ref-
erences to memory. For each first occurrence of a memory access instruction, a new value
number is assigned to the accessed address, as does the classical algorithm for variables in a
source code. It is possible to associate the operands of an arithmetic instruction with the value
number list and check whether the computation was previously performed, what is done by
tracking the value number that is associated to the content of each memory address. As is
for the classical algorithm, the dynamic one tests, for each arithmetic instruction, whether the
computation of the value numbers involved was previously performed. This method yields the
capture of redundancies in arithmetic operations and provides an upper bound for the dynamic
load and store reuse.
The proposed approach for a dynamic value numbering algorithm has as input a stream of
executed instructions and, as output, a redundancy report with redundant instructions identi-
fied and statistics of redundancy occurrence. The algorithm relies on three key data structures.
The first is an array relating a register number to this specific register’s associated value num-
4.2 Dynamic Value Numbering Algorithm (DVN) 69
register-value number
r0 r1 r9 r30 r(n-1)
#1 #2NULL NULL NULL
... ... ...
Fig. 25: Registers and value numbers mapping.
ber, as illustrated in Figure 25, where n is the number of registers available in the architecture.
In addition, two tables holding the value numbers and memory accesses are used. The al-
gorithm relies also on four key procedures, one for each of these sets of instructions: load,
store, arithmetic and others. The load set includes all variants of load instructions in the ar-
chitecture’s instruction set. Store set includes instructions that store a register’s content into
memory. Arithmetic set is represented by all arithmetic instructions, divided into two subsets,
commutative and non-commutative. The remaining set, others, contains all the instructions
that are not in any of the other sets. They are classified as unsupported. The algorithm re-
quires instruction decoding and classification into the four sets, and physical address accessed
for each memory instruction. It then handles each instruction according to its set, so that
the algorithm becomes able to detect in the instructions’ stream whether a given instance had
been previously performed. This identification is made by means of the procedures shown in
algorithms 4.1, 4.2, and 4.3.
4.2.1 Redundant Memory Operation
The approach classifies redundant memory operations into the following sets:
1. Fully-redundant load: instruction which loads a value that is already in a register;
2. Redundant load: redundant load is one where the last time this load loaded this ad-
dress, it fetched the same value;
3. Redundant store: which saves a value to a memory location that already holds the
identical value.
4.2 Dynamic Value Numbering Algorithm (DVN) 70
Fully-redundant load and redundant store are instructions that can be entirely skipped
with no impact to the program behavior. Redundant load detects multiple load instructions
that refer always to the same memory location x with no intervening (killing) store to the
address. In order to do this, the algorithm determines if a store instruction on a path from the
redundant instruction program counter PCr to the previous occurrence of the instruction PCn
modifies x. There may have been intervening accesses to other addresses. In this case, the
same content will be fetched, redundancy that is not necessarily captured in value prediction
approach.
4.2.1.1 Redundant Load
The algorithm 4.1 shows the procedure to handle load instructions. It requires instruc-
tion’s details, such as the physical address that data is being loaded from and the target reg-
ister. The procedure works with the three main data structures: RegVn, the register–value
number array; VN, the table holding the value numbers; and AT, the accesses history.
The first steps consist in getting the instruction opcode OP, the physical address RA, and
the target register RT. Following, the tuple T is set with the instruction opcode OP and physical
address RA. The next step is to search for T in VN table, checking whether a value number
had already been assigned to T, what would indicate that the address itself was accessed in
a previous equivalent memory operation. If T is not in the VN table, it is tested whether a
previous memory operations is known for RA. In the case of RA was the target of a store
instruction, the operation refers to a content that was spilled to memory is now being reloaded.
The value number i from the previous access is obtained and used to update RegVn, and the
access details are added to AT. The instruction is marked as a spill. Otherwise, as no value
number is known for RA, this first access of this type is used to update the data structures.
This is performed by getting first the greatest value i in VN table. Following, i is incremented
and this new value is mapped to T in the VN table. Also, the occurrence of an access to RA,
along with the access set (LOAD SET), is added to the addresses table AT. In case T is found
in VN, the previous access to RA is retrieved from AT. If the access is in the LOAD SET, then
4.2 Dynamic Value Numbering Algorithm (DVN) 71
Algorithm 4.1 Algorithm to detect redundant load operation in an execution stream.Require: instruction, physical address
1: RA← get physical address2: RT← get target register3: Set tuple T← 〈OP, RA〉4: Search for T in VN5: if T < VN then6: Get type of previous access to RA in AT7: if Previous access ∈ STORE SET then8: Get value number i of previous access9: Add 〈T, i〉 into VN
10: RegVn[RT]← i11: Add 〈RA, RegVn[RT], LOAD SET〉 to AT12: Mark instruction as spill13: else14: Get greatest value number i in VN15: Add 〈T, i+1〉 into VN16: RegVn[RT]← i+117: Add 〈RA, RegVn[RT], LOAD SET〉 into AT18: end if19: else20: Get type of previous access to RA in AT21: if Previous access ∈ LOAD SET then22: Get value number i of previous access23: if RegVn[RT] = i then24: Mark instruction as fully-redundant25: else26: RegVn[RT]← i27: Mark instruction as redundant load28: end if29: else30: Add 〈RA, RegVn[RT], LOAD SET〉 to AT31: RegVn[RT]← i32: Mark instruction as spill33: end if34: end if
4.2 Dynamic Value Numbering Algorithm (DVN) 72
the memory content has not been changed since the previous operation. The value number
i associated to the previous access is compared to the value number in RegVn[RT]. If they
are the same, the content being loaded is already in the target register and the instruction is
marked as fully redundant. Otherwise, the value number i is used to update RegVn[RT] and
the instruction is marked as a redundant load. In case the previous access is not in the LOAD
SET, the operation refers to the load of a value previously spilled. The previous access had
changed the memory content and, therefore, the related value number. If this is the case, this
value number i previously assigned is used to update RegVn, the access details are added to
AT and the instruction marked as a spill.
4.2.1.2 Redundant Store
Algorithm 4.2 Algorithm to detect redundant store operation in an execution stream.Require: instruction, physical memory reference
1: WA← get physical address2: RS← get source register3: Set tuple T← 〈OP, WA〉4: if RegVn[RS] , null then5: Get value number i and instruction type TYPE6: of previous access to WA in AT7: if TYPE ∈ STORE SET then8: if i , RegVn[RS] then9: Add 〈T, i〉 into VN
10: Add 〈WA, RegVn[RS], STORE SET〉 into AT11: else12: Mark instruction as redundant13: end if14: else15: Add 〈T, RegVn[RS]〉 into VN16: Add 〈WA, RegVn[RS], STORE SET〉 to AT17: end if18: else19: Get greatest value number i in VN20: Add 〈T, i+1〉 into VN21: RegVn[RS]← i+122: Add 〈WA, RegVn[RS], STORE SET〉 to AT23: end if
Algorithm 4.2 details the handling of store instructions. It is similar to the load case. The
first steps are the obtainment of the instruction opcode OP, the physical address WA and the
4.2 Dynamic Value Numbering Algorithm (DVN) 73
source register RS. The tuple T is set with the instruction opcode OP and physical address WA.
Following, it is tested whether RegVn[RS] is mapped to a value number related to the result
of a previous instruction. If this is the case, the value number i and the instruction type TYPE
of the previous equivalent access to WA are then obtained from AT table. If the previous access
to WA is in STORE SET, the instruction is potentially an equivalent occurrence of a previously
executed instruction. In order to check this, it is tested if i matches the value number in
RegVN[RS]. Being this not the case, the pair T and i is added into VN, and the WA access
details, added into AT. Otherwise, if the value number i matches the one in RegVn[RS], then
the operation had already been performed, and the content in WA has not been changed. In
the case that the instruction is not in STORE SET, the operation consists in writing to memory
the result that is associated to the value number in RegVn[RS]. The pair T and RegVN[RS] is
added into VN, and the access details, to AT table. At last, there is the case when the mapping
to a value number in RegVn[RS] exists. This indicates that a new content is being stored and
a new value number should be associated. This is performed by getting first the greatest value
number i in VN. Then, i is incremented, paired with T, and added to VN table. RegVn[RS] is
mapped to i+1. The access details are added to AT table.
4.2.2 Redundant Common-subexpression
Algorithm 4.3 depicts the procedure of handling instructions of the arithmetic set. It has
the same inputs and outputs as the previous procedures. It is worthy noticing that the arith-
metic operations addressed have two mathematical operands, and that one of these operands
might be encoded in the opcode itself. In this case, just one register operand is involved.
The algorithm’s first step is to get the operand registers. Value i is set with the proper RegVn
element’s value, what also happens with j, if necessary.
The tuple T is then built, which will serve the purpose of identifying redundancies: if
there is only one register operand, it is arranged with the instruction opcode OP, the involved
register operand’s value number, and the not register–based operand, which is a constant. This
last should be replaced by the other register operand’s value number, if both the register and
4.2 Dynamic Value Numbering Algorithm (DVN) 74
Algorithm 4.3 Algorithm to detect redundant scalar operation in an execution stream.Require: instruction
1: RT← get target register2: RA← get operand A register number3: RB← get operand B register number4: i← RegVn[RA]5: j← RegVn[RB]6: if i , null and j , null then7: Get instruction opcode OP8: if OP ∈ commutative set then9: Set tuple T← 〈OP, i, j〉
10: else11: if i ≤ j then12: Set tuple T← 〈OP, j, i〉13: else14: Set tuple T← 〈OP, i, j〉15: end if16: end if17: Search T in VN18: if T < VN then19: Get greatest value k in VN20: Add 〈T, k+1〉 in VN21: else22: Get value m associated to T in VN23: RegVn[RT]← m24: Mark instruction as redundant25: end if26: else27: RegVn[RT]← null28: end if
4.2 Dynamic Value Numbering Algorithm (DVN) 75
its associated value exist.
Following, it is tested if a value number is known for all the register operands. If this is
the case, the next step is testing whether instruction’s opcode OP belongs to the commutative
set. If so, the tuple T is rearranged with the greatest value to the right. Otherwise, the tuple T is
kept with the original operands’ order. This ensures that redundant commutative instructions
are detected as such even if the operand order does not match. Following, an occurrence of
the tuple T is searched in VN. If it is not found, then the operation had not been previously
performed. In this case, the greatest value number k in VN is obtained and incremented.
The pair of T and the new value number is then added to VN. However, an occurrence of T
in VN would indicate that the same operation had been previously performed with the same
operands, and that the result is known. In this case, the value number m associated to the
occurrence of T in VN is obtained. RegVn[RT] is mapped with the value number m and the
instruction is marked as redundant. At last, there is the case when a value number associated
with at least one operand is not known. In this case, it is not possible to detect whether the
operation is associated to a previous occurrence, and the RegVn[RT] is updated to reflect this.
4.2.3 Unsupported Instructions
The proposed algorithm is intended to operate with complete and whole application’s
instructions’ trace. In this way, there are instructions to be processed that are not in any of the
sets defined, namely load, store and arithmetic sets. Those are instructions such as branches,
compares, data cache management, and others. The interaction between such instructions and
the value number table depends on the specific instruction set used. An implementation of
the algorithm has to be realized in a way to cover all the instructions that potentially interfere
in the mapping of registers and value numbers. Instructions that create equivalent mappings
should properly update the RegVn array. Specialized and architecture dependent instructions
that perform intrinsic arithmetic operations need to be mapped according to one of the sets.
Instructions for which such mapping is not possible should clear the RegVn in a pessimistic
Figures 26 illustrates the resulting VN and AT tables when the method is applied to the exe-
cution stream in its left, which clearly contains redundant instruction executions. The process
of building the VN table can be seen as the generation of a Directed Acyclic Graph (DAG),
where nodes represent instructions and have annotated, for each instance, the associated value
number (on top of the node), the memory address that holds the instruction’s resulting content
(arrow), and the register associated to the value number (arrow), if this is known. Figure 27(c)
shows the resulting DAG after applying the method to the instruction stream in Figure 26.
4.2 Dynamic Value Numbering Algorithm (DVN) 77
add
r9
0x*28C
1 2
3
0x*288
r0
0x*284
(a) DAG: First 3 instructions
subf
r9
add
1 2 4
6
5
3
0x*288
0x*274
0x*284
0x*27C
0x*278
0x*280
0x*28C
0x*28C
r0
(b) DAG: First 12 instructions
add
add
7
8
subfadd
1 2 4
6
5
3
0x*288
0x*274
0x*284
0x*280
0x*28C
0x*27C
0x*278
0x*270
r0
r9
(c) DAG: whole execution trace
Fig. 27: Steps building a DAG with dynamic value numbering algorithm.
4.3 Value-based Hotspot and h-index 78
4.3 Value-based Hotspot and h-index
The output of the dynamic value numbering algorithm consists in the identification of re-
dundant computations. Binding the occurrence of redundancy with specific instances through
the instruction’s program counter, the algorithm identify redundancy hotspots. The ratio be-
tween the total amount of redundancy detected per program counter (PC) and the number of
unique value numbers found per PC provide a hotness index (h-index) for each redundant
instruction. An instruction with a high number of redundancy and a small number of unique
value numbers (different possible output for the instruction) is said to be hot. The identifi-
cation of redundant instruction in the whole program execution measures value locality and
indicates hotspots for optimization. The h-index provides a measure of profitability of each
occurrence.
Algorithm 4.4 shows how hotspot identification is handled in the dynamic value number-
ing. It has as input an instruction marked as redundant and requires the instruction’s program
counter PC and the value number VN assigned. The algorithm uses two tables, one for redun-
dant program counters and one for pairs of program counters and value number. Figure 28
illustrates the algorithm’s operation over an arbitrary execution stream.
4.4 Dynamic Value Numbering and Unnecessary Spill
The dynamic value numbering algorithm is able to detect code spill. Such behavior is
known for being unavoidable because of the limited number of registers available. An ad-
ditional step can be added to the methodology in order to verify unnecessary spills. An un-
necessary spill occurs when an unused value is available between the spilling and reload. The
approach is an adaptation of the method proposed in (LARUS; CHANDRA, 1993) to operate with
the dynamic value numbering to determine, from the spills identified, the amount that could
be avoided. The algorithm operates with the three sets of instructions defined in the Dynamic
Value Numbering algorithm. Algorithm 4.5 is used to detect unnecessary spills. The field
Access records when a register was last read or modified and the field Spill records when
4.4 Dynamic Value Numbering and Unnecessary Spill 79
Algorithm 4.4 Algorithm to identify hotspot and generate the h-index.Require: instruction, opcode (OP), assigned value number (VN)
1: RP← ∅2: RPV← ∅3: Get redundant instruction opcode PC4: for all Instruction marked as redundant do5: if PC ∈ RP then6: Get counter i of PC7: i← i + 18: Update counter i for PC in RP9: Set pair P← 〈PC, VN〉
10: if P ∈ RPV then11: Get counter j of P12: j← j + 113: Update counter j for P14: else15: Add P in RPV16: end if17: else18: Add PC in RP19: Set P← 〈PC, VN〉20: Add P in RPV21: end if22: end for
it was spilled to the stack frame. When a spilled value is reloaded into register s, any register
r that has access time before spill time is unused since its spill and might have held the value
instead of forcing a spill. Register r may be live, in which case the allocator’s choice of reg-
ister s is defensible. However, if the value in r is subsequently redefined, before being used
(i.e. dead), the allocator spilled the wrong variable. The method’s aim is to provide support to
dead store elimination (DSE). A representation of the algorithm’s operation is shown in figure
29.
4.4 Dynamic Value Numbering and Unnecessary Spill 80
Fig. 28: Operation of the algorithm to identify hotspot.
4.4 Dynamic Value Numbering and Unnecessary Spill 81
Algorithm 4.5 Algorithm to detect unnecessary spill.Require: instruction type TP, instruction counter PC
1: if TP ∈ LOAD SET then2: Get target register RT3: Access[RT] = instruction counter4: if TP is marked as spill then5: for all Register i do6: if Access[i] < Spill[i] then7: SpillCandidate[i] ← SpillCandidate[i]
⋃RT
8: end if9: end for
10: end if11: if SpillCandidate[RT] , ∅ then12: Mark instruction as unnecessary spill13: end if14: SpillCandidate[RT]← ∅15: end if16: if TP ∈ STORE SET then17: Get target register RT18: Spilll[RT] = instruction counter19: end if20: if TP ∈ ARITHMETIC SET then21: Get all registers REG involved22: for all REG do23: Access[REG] = instruction counter24: end for25: if SpillCandidate[RT] , ∅ then26: Mark instruction as unnecessary spill27: end if28: SpillCandidate[RT]← ∅29: end if
The identification of unnecessary spill as proposed in algorithm 4.5 is once again related
to an upper bound limit. The algorithm detects the situation when a register r is spilled at
PC1 and later in the execution, at PC2 another register is defined and not used between PC1
and PC2, as an unnecessary spill. This detection is truly valid if r is actually dead along all
paths after PC1. Another problem is that many registers can be dead during the spill and the
identification is useful only to indicate a reference number of potential opportunities missed
by the compiler.
4.5 Discussion 82
r0 r1 r2 r9 r(n)
spill
reload
r11
candidate candidate
... ... ...
unnecessaryspill
access modification store (spill)
Fig. 29: Representation of the algorithm to detect unnecessary spill.
4.5 Discussion
The Dynamic Value Numbering (DVN) methodology as designed yields the detection of
instruction reuse and value locality. An effectiveness analysis based on DVN returns a report
with redundancy-weighted program counters, which estimates the upper bound for optimiza-
tion and indicates the most profitable missed opportunities. The report can be useful in many
aspects as following discussed.
In the compiler side, the detection of reuse is profitable even when register promotion is
prevented (due to aliasing or lack of registers). When promotion is unsafe due to interfering
stores, the redundant load can be replaced with a data-speculative load, which works as a reg-
ister reference when the kill did not occur, but as a load when it did. When registers are not
available, instruction reuse information can be exploited using software cache control (BODÍK;
GUPTA; SOFFA, 1999). In addition, by directing which loaded values remain in the cache and
which bypass it, the compiler can improve the suboptimal hardware cache replacement strat-
4.5 Discussion 83
egy (BODÍK; GUPTA; SOFFA, 1999). Profile-guided optimization can be also benefited since the
methodology indicates hotspots through binding program counter to the occurrence of redun-
dancy. Although these sequences are suboptimal for only one path, the use of representative
input sets ensures that an expected general behavior is evaluated. The information can be
used to audit and potentially tune the compiler operation in suboptimal situation or used in a
hardware-based mechanism. In hardware-support for redundancy removal, DVN can be used
as a mechanism to populate a reuse buffer. Another approach is related to value prediction.
Value locality is measured through the hotspot identification and the h-index can be exploited
as a measure of predictability when speculating on instruction’s expected value.
These ideas are explored in more details in the next chapter through the implementation
of the Dynamic Value Numbering. The implementation of the algorithms is integrated with
a full-system simulator and used as a framework to evaluate optimization effectiveness and
hotspot identification in whole application execution.
Fig. 36: Redundancy evaluation scheme with validation and reuse support.
cations. The approach is to evaluate the effectiveness of the multiple levels of optimization
available applied by a commercial compiler. The value locality analysis is presented and a dis-
cussion on how this approach can be used in mechanisms to exploit remaining optimization
potential unexploited.
95
6 Case Study
The number of redundant instructions identified by applying the method can be used
as a reference point in measuring how effective were the optimizations applied to a code.
This chapter presents a case study that evaluates GNU C/C++ compiler when optimizing
executable codes of some of the application available in the SPEC CPU2006 suite of bench-
marks. The goal is to measure redundancies that can be detected using the proposed method,
but that are not detected by GCC. For the study, the GCC version 4.3.2 was fully ported to the
instruction set used and integrated with the simulator’s toolchain. Applications of SPECInt
2006 were obtained as source code and compiled with the ported compiler using the available
toolchain.
The data provided proves to be valuable in multiple ways. The redundancy report pro-
duced by the framework developed yields a reference point to compare the multiple opti-
mization levels available in the compiler. The redundancy histogram produces a picture of
hotspots, pointing to sources of inefficiency. An example of how such hotspot can be used
in correlating suboptimal sequences of generated code to source-level constructions is dis-
cussed. The example shows the benefits of using the approach for audition of problems in the
compiler or for improving the source code.
On the other hand, the study also explores the opportunities related to hardware support
for redundancy elimination. The hotness index (h-index) along with value locality report
indicate what are the most profitable opportunities for value prediction and reuse. In order to
investigate the potential of instruction reuse, the study first shows the amount of instructions
that can be successfully skipped when all the results of identified redundant instructions can
6.1 Evaluation Target 96
be retrieved from the value number hash table, which is theoretically unlimited. Following, in
a realistic approach, it is discussed how much of this potential can be exploited when a limited
number of entries are allowed in the value number hash table. In this case, the Dynamic Value
Numbering algorithm is used to populate an instruction reuse buffer. The study shows the
results for quantities of entries comparable to those seen in the literature, feasible in practice
or already commercially implemented. The gains achieved, given in terms of instructions
count reduction, is shown and compared to other schemes available. The chapter also includes
a discussion on exploring value locality for value prediction, on what would be the accuracy
level when predicting the top n most repeated values, and how these numbers compare to
those seen in literature.
6.1 Evaluation Target
6.1.1 GNU Compiler Collection (GCC)
The GNU Compiler Collection (GCC) started as a modest C compiler and has grown
over the last 20 years to become one of the most popular compilers available. The compiler
supports now several languages (C, C++, Objective C, Objective C++, Java, Ada, Fortran 95,
etc.) and about 30 different architectures (STALLMAN, 2009). GCC has a vast support for code
optimization and implements several analysis. Since version 4, GCC makes extensive use of
SSA form and most of high-level optimizations are applied using SSA.
GCC’s optimization level is set through the -O switch. Without any optimization, the
result is the fastest compile time, but absolutely no attempt is performed to optimize the code.
Programs are larger and slower than the optimized version. Debugging the compiled program,
if unoptimized, is somewhat easy, since there is a straightforward relation between each code
statement and a block of generated code. When debugging, stopping the execution flow,
assigning values to any variable, or changing the content of the program counter register are
possible actions. With optimization set, the compiler attempts to improve the performance
at the expense of compilation time and, possibly, of the ability of debugging the program.
6.1 Evaluation Target 97
The main optimization options are -O0, -O1, -O2, and -O3. No optimization is performed if
-O0 is specified, but conservative, yet extensive, optimization is performed if no optimization
option is specified. The specific set of optimization changes from release to release, language
by language. Optimization levels for C are described in Table 1.
Tab. 1: GCC optimization levels.optimization description-O0 No optimization (not the default); generates
unoptimized code but has the fastestcompilation time.
-O1 Compilation takes a little more time,and a lot more memory for large functions.Statements are independent and debugging,as expected.
-O2 Full optimization; generates highlyoptimized code and hasslower compilation time.
-O3 Full optimization as in -O2, yet slower;also uses more aggressive automaticinlining of subprograms withina unit (for interprocedural analysis)and attempts to vectorize loops.
Testing has been extensively performed for all optimization levels and it is uncommon to
see bugs reported. There is no correlation verified between reliability and optimization level.
There are some bugs reported exclusively when optimization is on as often there are bugs
reported exclusively when no optimization is set.
6.1.2 SPEC CPU2006 Benchmark Suite
Standard Performance Evaluation Corporation (SPEC) benchmarks are widely used to
evaluate the performance of computer systems for the last few decades. The SPEC CPU
benchmarks are widely used in both industry and academia to evaluate CPU, memory and
compiler. SPEC announced on August, 2006 the SPEC CPU 2006 to replace CPU2000. The
new suite is much larger than the previous, and exercises aspects of CPUs, memory systems,
and compilers, especially C++ compilers. The suite has 7 C++ applications, including one
with half million lines of C++ code. Fortran and C are also well represented. SPEC CPU
6.1 Evaluation Target 98
benchmarks are derived from real life applications, rather than using artificial loop kernels or
synthetic benchmarks. Technical details regarding benchmark behavior and profiles have been
appearing in several publications, such as in (HENNING, 2006). A summary of the application
in SPEC CPU 2006 is shown in Table 2. Detailed description of each application used in this
study is shown in Appendix A.
Tab. 2: SEPCInt 2006 application set.Benchmark Language Application Area Brief Description400.perlbench C Programming Language Derived from Perl V5.8.7. The
401.bzip2 C Compression Julian Seward’s bzip2 version1.0.3, modified to do most workin memory, rather thandoing I/O.
403.gcc C C Compiler Based on gcc Version 3.2,generates code for Opteron.
429.mcf C Combinatorial Optimization Vehicle scheduling. Uses anetwork simplex algorithm (whichis also used in commercial products)to schedule public transport.
445.gobmk C Artificial Intelligence: Go Plays the game of Go, a simplydescribed but deeply complex game.
456.hmmer C Search Gene Sequence Protein sequence analysis usingprofile hidden Markov models(profile HMMs)
458.sjeng C Artificial Intelligence: chess A highly-ranked chess programthat also plays several chessvariants.
462.libquantum C Physics / Quantum Computing Simulates a quantum computer,running Shor’s polynomial-timefactorization algorithm.
464.h264ref C Video Compression A reference implementation ofH.264/AVC, encodes a videostreamusing 2 parameter sets.The H.264/AVC standard isexpected to replace MPEG2
471.omnetpp C++ Discrete Event Simulation Uses the OMNet++ discrete eventsimulator to model a largeEthernet campus network.
473.astar C++ Path-finding Algorithms Pathfinding library for 2D maps,including the well knownA* algorithm.
483.xalancbmk C++ XML Processing A modified version of Xalan-C++,which transforms XML documentsto other document types.
The current version of the benchmark suite, v1.1 released in June 2008, was obtained
6.1 Evaluation Target 99
as source code. The applications were compiled with the GCC (v.4.3.2) compiler ported to
the architecture, using versions of configuration files provided with the suite and adapted for
the simulator’s environment. Three executables were obtained for each application through
varying the level of optimization among -O0, -O2, and -O3. The -O1 level is omitted, as the
higher levels include the optimizations it deploys, and for the sake of feasibility of multiple
executables’ analysis in reasonable time. This study is focused on the integer component
of the benchmark suite since the fixed-point facility is the one supported in the framework
developed. Additionally, aimed at performing exhaustive evaluation of effectiveness and value
locality of whole application execution, this work had also to deal with subsetting and input
set reduction in order to make the intended analysis possible. Following the details on the
approach used.
6.1.2.1 Subsetting SPEC
SPEC CPU2006 has been frequently used in simulators for pre-silicon design analysis.
Partial use of benchmark suites by researchers are mainly related to simulation time con-
straints, compiler difficulties, and library or system call issues. The run-time and memory re-
quirement of SPEC CPU benchmark programs has been significantly increased to keep pace
with advancement in technology. This fact means that for studies that use cycle-accurate
simulators, it is virtually impossible to simulate all programs and input sets in a reasonable
amount of time. Structured and conscious subletting of SPEC has been proposed as a way
to allow same or equivalent amount of information from a smaller subset of representative
programs.
The Euclidean distance between the benchmarks is used as a measure of dissimilarity
in (PHANSALKAR; JOSHI; JOHN, 2007). Single-linkage distance (k) is computed to create a
dendrogram. The dendogram proposed guides the selection of a subset when requirements
are imposed. If the simulation budget limits the set to just six benchmarks, then drawing a
vertical line at linkage distance of 4, as shown in Figure 37 (extracted from (PHANSALKAR;
JOSHI; JOHN, 2007), will give a subset of six benchmarks (k = 6) for SPECInt 2006. Drawing
6.1 Evaluation Target 100
a line at a point close to 4.5 yields a subset of four benchmarks (k = 4). Table 3 shows the
resulting subsets of the CINT2006 suite.
3.2 Subsetting SPEC CPU2006 Benchmarks To keep pace with advancements in technology and the
increase in size of on-chip caches, the data footprint and run time of SPEC CPU benchmark programs has been significantly increased. However for architectural studies that use cycle-accurate simulators, it is virtually impossible to simulate all programs and input sets in a reasonable amount of time. If the same amount of information can be obtained from a smaller subset of representative programs, it would certainly help architects and researchers to cut down the simulation time without compromising on the inferences drawn from their studies.
This section demonstrates the result of applying PCA and cluster analysis for selecting a subset of benchmark programs when an architect or researcher is constrained by time and wants to select a reduced subset of programs from the suite. Figure 2 shows a dendrogram for CINT2006 benchmarks obtained after applying PCA and Hierarchical Clustering on the performance counter data from Table 3. The Euclidean distance between the benchmarks is used as a measure of dissimilarity and single-linkage distance is computed to create a dendrogram. Seven Principal Components (PCs) with eigen values greater than one are chosen and they retain 94% of the variance. In the dendrogram in Figure 2 the horizontal axis shows the linkage distance indicating the dissimilarity between the benchmarks. The ordering on the Y-axis does not have particular significance, except that benchmarks are positioned close to each other when the distance is smaller. Benchmarks that are outliers have larger linkage distances with the rest of the clusters formed in a hierarchical way. One can use this dendrogram to select a representative subset of programs. For example, if a researcher wants to reduce his simulation budget to just six benchmarks, then drawing a vertical line at linkage distance of 4, as shown in Figure
2, will give a subset of six benchmarks (k=6). Drawing a line at a point close to 4.5 yields a subset of four benchmarks (k=4). Table
4 shows the resulting subsets of the CINT2006 suite. In clusters where there are more than two programs, the representative of cluster i.e. the benchmark closest to the center of the cluster is chosen as a representative. As we traverse from left to right on the dendrogram the number of benchmarks in the subset keep decreasing. This helps the user to select appropriate benchmarks when simulation time is a constraint.
Figure 2. Dendrogram showing similarity between CINT2006 Programs.
Figure 3. Dendrogram showing similarity between CFP2006 Programs.
Figure 3 shows the dendrogram for floating point
benchmarks in CPU2006. Five PCs are chosen using the Kaiser criterion which retains 85% of the variance. The two vertical arrows show the points at which the subsets of size 6 and 8 are formed. The resulting clusters are shown in Table 5. The distance of each of the benchmarks in the cluster to the cluster center has to be recalculated and a representative can be chosen. In Figure 3 there are two main clusters which split at extreme right because the branch characteristics of the benchmarks, 447.dealII, 450. soplex and, 453.povray exhibit a comparatively higher branch misprediction rate. In the next section we evaluate the representativeness of these subsets.
One should note that clustering and subsetting gives importance to unique features and differences. It helps to eliminate redundancy and duplicated efforts in experimentation. However, one should not mistake the mix of program types in a subset as the mix of program types in real-world workloads. Table 4. Representative subset of SPEC CINT2006 programs.
3.3 Evaluating Representativeness of Subsets We evaluate the usefulness of the subsets, proposed in the
previous section, to estimate the speedup of the entire suite on Fig. 37: Dendrogram showing similarity between CINT2006 Programs.
Tab. 3: Representative subset for SPECInt 2006.Set of Four Programs Set of Six Programs400.perlbench 400.perlbench462.libquantum 471.omnetpp473.astar 429.mcf483.xalancbmk 462.libquantum
473.astar483.xalancbmk
For the evaluation, most of the SPECInt were successfully ported. The seven SPECInt
applications in Table 4 are the ones used. From the six application subset, the only missing is
483.xalancbmk, a lack due to compilation difficulties.
6.1.2.2 Reduced Inputset
A SPEC CPU2006 application execution using the reference input data set represents few
trillion instructions in average. This order of magnitude is above the number of instructions
that any known system simulator is able to handle within a reasonable time. In addition, the
6.2 Effectiveness Evaluation Results 101
total number of instructions in the execution stream of each application, which translates into
the size of the data structure handled by the dynamic value numbering implementation, is
a major constraint for the analysis. It has been shown, as in (KLEINOSOWSKI; LILJA, 2002),
how to make SPEC applications more suitable to system simulators. That consists in using
subsets of the reference input data sets in such a way that the main application behavior and
instructions’ distribution are preserved. These ideas were applied to the input data set of
SPECInt CPU2006 applications used in this work, in order to make the analysis feasible.
Table 4 shows the applications that were successfully ported and the input sets that were
used for the obtainment of execution streams that lie in the order of hundreds of millions of
instructions (see Appendix A for details on application and related input).
al) r e d u n d a n t l o g i c a l r e d u n d a n t i m m e d i a t e r e d u n d a n t s c a l a r
4 7 1 .o m n e t
4 6 2 .l i b q u a n t u m
4 5 8 .s j e n g
4 4 5 .g o b m k
4 2 9 .m c f
4 0 1 .b z i p 2
4 0 0 .p e r l b e n c h
Fig. 42: Redundant arithmetic instructions per type, normalized by instruction count of non-optimized code.
Tab. 8: Statistics of redundant arithmetic instructions detection, classified into non-redundantor redundant for each subset: scalar, immediate and logical.
While optimization is not able to eliminate all the clusters (what would be expected from
the effectiveness results, which show that high redundancy rates are detected even in opti-
mized code), it does affect the shape of the clusters. Both height and width are lowered in
all applications, with more accentuated difference between -O0 and one of the higher levels.
Cluster shapes for -O2 and -O3 are mostly unaltered. This shows that a redundant construct at
the executable code level is still present in every optimized code, but it has fewer instructions
with higher optimization level. The conclusion is that the optimization deployed is not able
to eliminate the very core of redundancy, a group of few instructions that are still responsible
for producing the spikes in the value locality graph.
6.3.1.3 H-index
H-index is plotted for each application and optimization level (in orange in all cases). The
area covered by the h-index graph indicates the expected number of instructions that could be
eliminated using only one of the various different contents each specific instruction produces.
An h-index that completely covers the redundancies of the corresponding program counter
indicates that a single value is seen in every occurrence of redundancies of that program
counter. The results vary per application. There are examples of highly profitable clusters,
such as benchmarks 400, 429, 458 and 462. In these cases, main clusters (or the only one in
458 case) are significantly covered by the h-index graph, indicating that thousands of redun-
dancy instances produce just a couple of different values. There are cases, 401, 445 and 475,
with less profitable clusters, but considerable opportunities to exploit single value occurrence
6.4 Instruction Reuse 116
are still observed.
Redundancy clusters reveal opportunities for optimization. The numbers show that the
performance of all applications, measured in the total amount of instructions executed, could
be significantly improved in case the result of few instructions are reused. This work presents
three approaches that can be benefited with such knowledge: instruction reuse, compiler/ap-
plication audition and value prediction. The first approach is the deployment of constrained
dynamic value numbering as a mechanism to populate an instruction reuse buffer, measuring
the upper limit of redundancy elimination that could be exploited when the method is used
with a limited table. The second approach shows how the report on redundancy occurrence
and program counter can be used as a mechanism to inspect application and compiler’s job
for missed optimization opportunities. Third, the h-index can be explored in value prediction.
6.4 Instruction Reuse
As discussed in Section 3.4.1, instruction reuse is based on a method to populate a reuse
buffer and apply its information when a redundancy is detected. The operation of the Dynamic
Value Numbering methodology, specially considering the validation component, is closely
related to known mechanisms used in reuse buffer. However, for effectiveness analysis an un-
limited table for the value numbering is allowed. This sections presents the results of applying
the Dynamic Value Numbering algorithm as an instruction reuse method, and the exploitable
part of the theoretical upper limit found in the effectiveness analysis when a limited reuse
table is used.
The approach consists in limiting the total number of entries allowed in the value number
table and applying a replacement policy when the limit is achieved. The implementation is the
one discussed in section 5.3.1: the method is applied at fetching stage and, upon redundancy
detection, the content of the target register is replaced and the execution entirely skipped.
Based on feasible reuse buffer, three buffer sizes were selected: 256, 512 and 1024 entries.
The ability of successfully reusing an instruction based on constrained value number-
6.4 Instruction Reuse 117
ing depends on the number of non-redundant executed instructions between a first (non-
redundant) occurrence of an instruction (i.e., program count) and its redundant instances, and
on the replacement policy.
Statistics collected for the application set, for the different buffer sizes and First In First
Out (FIFO) replacement policy, are shown in Table 11 and graphically in Figure 51. The
proportion of the redundancy upper bound (unlimited table) that can be captured with each
size is also presented.
Tab. 11: Redundant instruction (all sets) detection with constrained dynamic value number-ing, showing the proportion of the amount detected compared to the upper limit.
Figure 52 and table 12 show the amount of redundant load instruction detected for each
buffer size compared to the upper limit. Figure 53 and table 13, for arithmetic instructions.
6.4.1 Discussion
The statistics for redundancy detection with a constrained operation of the dynamic value
numbering methodology also vary per application, but reveal that a significant share of the
6.4 Instruction Reuse 118
- O0
- O2
- O3
- O0
- O2
- O3
- O0
- O2
- O3
- O0
- O2
- O3
- O0
- O2
- O3
- O0
- O2
- O3
- O0
- O2
- O3
0
1 0
2 0
3 0
4 0
5 0
6 0Re
dund
ancy
Detec
tion (
%)
u n l i m i t e d 1 0 2 4 5 1 2 2 5 6
4 7 1 .o m n e t
4 6 2 .l i b q u a n t u m
4 5 8 .s j e n g
4 4 5 .g o b m k
4 2 9 .m c f
4 0 1 .b z i p 2
4 0 0 .p e r l b e n c h
Fig. 51: Redundant instruction (all sets) detection with constrained dynamic value numbering,showing the proportion of the amount detected compared to the upper limit.
- O0
- O2
- O3
- O0
- O2
- O3
- O0
- O2
- O3
- O0
- O2
- O3
- O0
- O2
- O3
- O0
- O2
- O3
- O0
- O2
- O3
0
2 0
4 0
6 0
8 0
Redu
ndan
cy De
tectio
n (%)
u n l i m i t e d 1 0 2 4 5 1 2 2 5 6
Fig. 52: Redundant load instruction detection with constrained dynamic value numbering,showing the proportion of the amount detected compared to the upper limit.
redundancy upper limit can be realistically exploited. An average of ≈ 47% of the redundancy
upper limit, or a ≈ 15% of total instructions can be reused with a reuse buffer of 1024 entries.
The results also show that even with a 256-entry buffer, an average of ≈ 27% of the upper limit
6.4 Instruction Reuse 119
Tab. 12: Redundant load instruction detection with constrained dynamic value numbering,showing the proportion of the amount detected compared to the upper limit.
Fig. 53: Redundant arithmetic instruction detection with constrained dynamic value number-ing, showing the proportion of the amount detected compared to the upper limit.
6.4 Instruction Reuse 120
Tab. 13: Redundant arithmetic instruction detection with constrained dynamic value number-ing, showing the proportion of the amount detected compared to the upper limit.
behavior. At last, in case 475 a new instance of redundancy detected in strncpy in the C
standard library is seen. The behavior is similar to case 401, indicating again an inefficiency
that goes beyond compiler’s scope.
The analysis of the examples show that the methodology covers a wide scope of situations.
Redundancies related to memory operation are observed as scalar ones. Inefficiencies source
were detected in application’s constructs as in included and/or linked libraries. The source of
inefficiencies do not point to a fixable problem in the compiler or in the application, rather it
brings a valuable knowledge on understanding the sources of inefficiencies. The methodology
also opens new possibilities on investigating inefficiencies related to poorly written libraries.
6.7 Value Prediction 128
Following section discusses the value of this knowledge on predicting values to be used by an
instruction.
6.7 Value Prediction
Value locality cluster, along with the h-index, indicates instructions that can be specula-
tively executed with the aid of highly frequent known values. Advantages in doing so depends
on the ability of predicting instruction result. The h-index of an instruction is the ratio between
its total redundancy count and the number of unique value numbers it generates. This ratio
points out the expected amount of instructions that could be avoided if one uses these unique
results when they are available before an instruction executes.
H-index can be used to evaluate the profitability of speculative value prediction. In this
case, it is used to identify values to be kept in a value prediction table. A common approach to
evaluate value prediction consists in evaluating the accuracy of prediction. Figure 57 shows,
for the top-128 most frequent redundant instructions of optimized (O3) codes, the average
accuracy prediction when only one of the known values is used in speculation.
The 128 program counters depicted in each subfigure are ordered from the most to the
least redundant. Table 15 shows details on the prediction accuracy. It shows the share of the
top-128 redundant instruction in the total redundancy count, the average prediction accuracy,
median, standard-deviation, and the total amount of redundant instructions that would be suc-
cessfully predicted in case one of the known results of the top-128 most redundant instruction
were used speculatively. The table shows that a relative small number (128 in this analysis)
of static instructions (≈ 3.5% of all the program counters that executes redundantly at least
once) is responsible for more than 50% of redundant occurrences in cases 400, 429, 458, 462.
A significant share is also observed for the other applications. This information is in line with
the value locality graphs obtained. Prediction accuracy analysis shows that accuracy is limited
to 50% since at least two value-numbers (different results) are found for each of the top-128,
with average of 20%. Taking into account the share in total redundancy, an average of 14% of
6.7 Value Prediction 129
0 2 0 4 0 6 0 8 0 1 0 0 1 2 0 1 4 0
0 . 0
0 . 1
0 . 2
0 . 3
0 . 4
0 . 5
0 2 0 4 0 6 0 8 0 1 0 0 1 2 0 1 4 0
0 . 0
0 . 1
0 . 2
0 . 3
0 . 4
0 . 5
0 2 0 4 0 6 0 8 0 1 0 0 1 2 0 1 4 0
0 . 0
0 . 1
0 . 2
0 . 3
0 . 4
0 . 5
0 2 0 4 0 6 0 8 0 1 0 0 1 2 0 1 4 0
0 . 0
0 . 1
0 . 2
0 . 3
0 . 4
0 . 5
Pred
iction
Accu
racy
4 0 0 . p e r l b e n c h : T o p - 1 2 8 m o s t r e d u n d a n t i n s t r u c t i o n
Pred
iction
Accu
racy
4 0 1 . b z i p 2 : T o p - 1 2 8 m o s t r e d u n d a n t i n s t r u c t i o n
Pred
iction
Accu
racy
4 2 9 . m c f : T o p - 1 2 8 m o s t r e d u n d a n t i n s t r u c t i o n
Pred
iction
Accu
racy
4 4 5 . g o b m k : T o p - 1 2 8 m o s t r e d u n d a n t i n s t r u c t i o n
0 2 0 4 0 6 0 8 0 1 0 0 1 2 0 1 4 0
0 . 0
0 . 1
0 . 2
0 . 3
0 . 4
0 . 5
0 2 0 4 0 6 0 8 0 1 0 0 1 2 0 1 4 0
0 . 0
0 . 1
0 . 2
0 . 3
0 . 4
0 . 5
0 2 0 4 0 6 0 8 0 1 0 0 1 2 0 1 4 0
0 . 0
0 . 1
0 . 2
0 . 3
0 . 4
0 . 5
Pred
iction
Accu
racy
4 5 8 . s j e n g : T o p - 1 2 8 m o s t r e d u n d a n t i n s t r u c t i o n
Pred
iction
Accu
racy
4 6 2 . l i b q u a n t u m : T o p - 1 2 8 m o s t r e d u n d a n t i n s t r u c t i o n
Pred
iction
Accu
racy
4 7 1 . o m n e t : T o p - 1 2 8 m o s t r e d u n d a n t i n s t r u c t i o n
Fig. 57: Prediction accuracy for top-128 most redundant instructions.
6.8 Discussion on the Replicability of the Study 130
Tab. 15: Prediction accuracy for top-128 most redundant instruction.
% of total redundant % of total prediction accuracy % prediction ratestatic instruction redundancy count average median std-deviation of all redundancy
provements (e.g., reducing the amount of instructions in 0.5% may result in 10% performance
improvement).
The hotspots identified allowed the audition of the executables produced by GCC. The
analysis pointed out that no clear error or omission was detected by GCC. Rather, it shows
that, in some cases, linked libraries are the source of redundancies, or that one is dealing
with a scenario where the inherent limitation allows an obvious optimization. The results
are valuable, for they identify constructs that are ultimately responsible for redundancies and
new studies on how these instances could be prioritized in a scheme that would degrade the
performance of some selected part in exchange for considerably redundant clusters. A com-
prehensive cost analysis would be necessary in this case.
The methodology was also used to investigate the use of identified hotspots on predicting
specific results of instructions. The approach was mainly motivated by the value of the h-
index, showing that most of the redundant instructions produce a very limited number of
known results. The obtained results show limited benefit on predicting the top-128 most
7.1 Summary of contributions and Future Work 136
redundant instruction, but indicate that considerable number of redundancies can be predicted
with high accuracy (an average of 78%). The mechanism to exploit these opportunities is
tightly bound to the ability of passing to the compiler a profile that indicates the most profitable
instructions about which to speculate based on the accuracy prediction.
The following section summarizes author’s publications that include the main contribu-
tions in this work and envisions future works related to each of them.
7.1 Summary of contributions and Future Work
The development of a first version of the dynamic value numbering algorithm and the
method’s implementation along with the results of effectiveness evaluation for the selected
SPECInt applications were presented by the author in (COSTA et al., 2012). The paper focused
on exposing the amount of unexploited opportunities as a limit study. As future work, the
author envisions porting and evaluating new benchmarks, including the floating point compo-
nent of the SPEC CPU 2006 suite of benchmarks. The effectiveness evaluation is also suitable
to compare different optimized executable produced by the same compiler. Unfortunately, the
evaluation did not consider the integration with the profile guided optimization tool available
in GCC, a limitation due to the toolchain available in the simulator. However, integration of
the development implementation with a binary instrumentation tool (Intel’s Pin, for instance)
would allow the study on the ability of GCC’s profiling to eliminate the remaining redundan-
cies.
A second version of the dynamic value numbering algorithm including the ability to pro-
duce a value locality report and improved support for redundant memory operation detection
(unnecessary spill, fully-redundant operation) is presented by the author in (COSTA et al., a).
The article presents the results of redundancy cluster identification and the approaches to ex-
ploit value locality. The article introduces the instruction reuse scheme based on dynamic
value numbering and discusses compiler audition and support for value prediction.
An accurate evaluation of the impact incurred on performance when elimination of re-
7.1 Summary of contributions and Future Work 137
dundancies found with the presented methodology takes place is envisioned as future work.
In the work in progress (COSTA et al., b), this thesis’ author works on the development of a
cycle-accurate analysis to identify performance benefits of avoiding redundant instruction in
a speculative execution. The model involves a realistic evaluation of the latency of such mech-
anism. The work evolves so to include the support of floating point instructions, an extension
of the implementation presented in this thesis. The work compares the gains with the ones
found in the literature.
As future work the author also envisions the extension of the value prediction approach
based on dynamic value numbering to include the relative frequency of unique value numbers.
In this case, the total amount of instances of a given instruction would be confronted to the
amount of redundancies and to the frequency of each unique result. The study is based on
value locality and compiler audition that evaluates how transformation at compile-time could
avoid the redundancies found at the run-time. The analysis would start from the study of the
ideal number of registers needed to eliminate the redundancies found and how modification on
register allocation would allow further optimizations. At last, the study of dynamic languages
and instructions executed, for instance, by a virtual machine is also seen as a future extension.
In this case, the approach of binary instrumentation would be more suitable in order to avoid
the difficulties related to porting a virtual machine to a simulation environment. In any case,
an integration of the implementation described in this thesis would allow the investigation of
the effectiveness of final executed code.
138
References
ADVE, V. The next generation of compilers. In: Proceedings of the 7th annual IEEE/ACMInternational Symposium on Code Generation and Optimization. Washington, DC, USA:IEEE Computer Society, 2009. (CGO ’09). ISBN 978-0-7695-3576-0.
ALLEN, F. E. The History of Language Processor Technology in IBM. IBM Journal ofResearch and Development, v. 25, n. 5, p. 535 –548, Sept. 1981. ISSN 0018-8646.
ALPERN, B. et al. The Jikes research virtual machine project: building an open-sourceresearch community. IBM Syst. J., IBM Corp., Riverton, NJ, USA, v. 44, n. 2, p. 399–417,Jan. 2005. ISSN 0018-8670.
ALPERN, B.; WEGMAN, M. N.; ZADECK, F. K. Detecting equality of variables inprograms. In: Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principlesof programming languages. New York, NY, USA: ACM, 1988. (POPL ’88), p. 1–11. ISBN0-89791-252-7.
ALPERN, B.; WEGMAN, M. N.; ZADECK, F. K. Detecting equality of variables inprograms. In: Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principlesof programming languages. New York, NY, USA: ACM, 1988. (POPL ’88), p. 1–11. ISBN0-89791-252-7.
AUSLANDER, M.; HOPKINS, M. An overview of the PL.8 compiler. SIGPLAN Not., ACM,New York, NY, USA, v. 39, p. 38–48, April 2004. ISSN 0362-1340.
AUSTIN, T.; LARSON, E.; ERNST, D. SimpleScalar: an infrastructure for computer systemmodeling. Computer, v. 35, n. 2, p. 59 –67, feb 2002. ISSN 0018-9162.
BACHEGA, L. R. et al. The BlueGene/L pseudo cycle-accurate simulator. In: Proceedings ofthe 2004 IEEE International Symposium on Performance Analysis of Systems and Software.Washington, DC, USA: IEEE Computer Society, 2004. (ISPASS ’04), p. 36–44. ISBN0-7803-8385-0. Disponível em: <http://dl.acm.org/citation.cfm?id=1153925.1154586>.
BODÍK, R.; ANIK, S. Path-sensitive value-flow analysis. In: Proceedings of the 25th ACMSIGPLAN-SIGACT symposium on Principles of programming languages. New York, NY,USA: ACM, 1998. (POPL ’98), p. 237–251. ISBN 0-89791-979-3.
BODÍK, R.; GUPTA, R.; SOFFA, M. L. Load-reuse analysis: design and evaluation. In:Proceedings of the ACM SIGPLAN 1999 conference on Programming language designand implementation. New York, NY, USA: ACM, 1999. (PLDI ’99), p. 64–76. ISBN1-58113-094-5.
References 139
BODÍK, R.; GUPTA, R.; SOFFA, M. L. Complete removal of redundant expressions.SIGPLAN Not., ACM, New York, NY, USA, v. 39, n. 4, p. 596–611, Apr. 2004. ISSN0362-1340.
BRIGGS, P.; COOPER, K. D.; SIMPSON, L. T. Value numbering. Softw. Pract. Exper., JohnWiley & Sons, Inc., New York, NY, USA, v. 27, p. 701–724, June 1997. ISSN 0038-0644.
BRIGGS, P.; COOPER, K. D.; TORCZON, L. Improvements to graph coloring registerallocation. ACM Transactions on Programming Languages and Systems, v. 16, p. 428–455,1994.
BRUENING, D.; GARNETT, T.; AMARASINGHE, S. An infrastructure for adaptivedynamic optimization. In: Proceedings of the international symposium on Code generationand optimization: feedback-directed and runtime optimization. Washington, DC, USA: IEEEComputer Society, 2003. (CGO ’03), p. 265–275. ISBN 0-7695-1913-X. Disponível em:<http://dl.acm.org/citation.cfm?id=776261.776290>.
CALDER, B.; FELLER, P.; EUSTACE, A. Value profiling. In: Proceedings of the 30thannual ACM/IEEE international symposium on Microarchitecture. Washington, DC, USA:IEEE Computer Society, 1997. (MICRO 30), p. 259–269. ISBN 0-8186-7977-8. Disponívelem: <http://dl.acm.org/citation.cfm?id=266800.266825>.
CEZE, L. et al. Full Circle: Simulating Linux clusters on Linux clusters. In: In Proceedingsof the Fourth LCI International Conference on Linux Clusters: The HPC Revolution 2003.[S.l.: s.n.], 2003.
CHAITIN, G. Register allocation and spilling via graph coloring. SIGPLAN Not., ACM, NewYork, NY, USA, v. 39, p. 66–74, April 2004. ISSN 0362-1340.
CHILIMBI, T. M. Efficient representations and abstractions for quantifying andexploiting data reference locality. In: Proceedings of the ACM SIGPLAN 2001conference on Programming language design and implementation. New York, NY,USA: ACM, 2001. (PLDI ’01), p. 191–202. ISBN 1-58113-414-2. Disponível em:<http://doi.acm.org/10.1145/378795.378840>.
CITRON, D.; FEITELSON, D. G. The Organization of Lookup Tables in InstructionMemoization. Hebrew University of Jerusalem (Technical Report), 2000. Disponível em:<leibniz.cs.huji.ac.il/tr/306.ps>.
CLICK, C. Global code motion/global value numbering. In: Proceedings of the ACMSIGPLAN 1995 conference on Programming language design and implementation. NewYork, NY, USA: ACM, 1995. (PLDI ’95), p. 246–257. ISBN 0-89791-697-2. Disponível em:<http://doi.acm.org/10.1145/207110.207154>.
COCKE, J. Programming languages and their compilers: Preliminary notes. [S.l.]: CourantInstitute of Mathematical Sciences, New York University, 1969. ISBN B0007F4UOA.
COOPER, K.; ECKHARDT, J.; KENNEDY, K. Redundancy elimination revisited.In: Proceedings of the 17th international conference on Parallel architectures andcompilation techniques. New York, NY, USA: ACM, 2008. (PACT ’08), p. 12–21. ISBN978-1-60558-282-5.
References 140
COOPER, K. D.; LU, J. Register promotion in C programs. In: Proceedings of the ACMSIGPLAN 1997 conference on Programming language design and implementation. NewYork, NY, USA: ACM, 1997. (PLDI ’97), p. 308–319. ISBN 0-89791-907-6. Disponível em:<http://doi.acm.org/10.1145/258915.258943>.
COOPER, K. D.; XU, L. An efficient static analysis algorithm to detect redundant memoryoperations. SIGPLAN Not., ACM, New York, NY, USA, v. 38, p. 97–107, June 2002. ISSN0362-1340.
COOPER, K. D.; XU, L. Memory redundancy elimination to improve application energyefficiency. In: Proceedings of the 16th Int’l Workshop on Languages and Compilers forParallel Computing (LCPC’03). [S.l.: s.n.], 2003. p. 288–305.
COSTA, C. H. A. et al. Exploiting Value Locality with a Dynamic Methodology forOptimization Effectiveness Evaluation. [Submission].
COSTA, C. H. A. et al. Performance Improvement with Instruction Reuse based on DynamicValue Numbering. [In production].
COSTA, C. H. A. et al. Dynamic Method to Evaluate Code Optimization Effectiveness.In: Proceedings of the 15th International Workshop on Software and Compilers forEmbedded Systems. New York, NY, USA: ACM, 2012. (SCOPES ’12), p. 62–71. ISBN978-1-4503-1336-0. Disponível em: <http://doi.acm.org/10.1145/2236576.2236583>.
CYTRON, R. et al. Efficiently computing static single assignment form and thecontrol dependence graph. ACM Trans. Program. Lang. Syst., ACM, New York,NY, USA, v. 13, n. 4, p. 451–490, Oct. 1991. ISSN 0164-0925. Disponível em:<http://doi.acm.org/10.1145/115372.115320>.
DIWAN, A.; MCKINLEY, K. S.; MOSS, J. E. B. Type-based alias analysis. In:Proceedings of the ACM SIGPLAN 1998 conference on Programming language designand implementation. New York, NY, USA: ACM, 1998. (PLDI ’98), p. 106–117. ISBN0-89791-987-4. Disponível em: <http://doi.acm.org/10.1145/277650.277670>.
FRANKLIN, M. The Multiscalar Architecture. Tese (Doutorado) — University of Wisconsin,Madison, WI, USA, 1993.
GABBAY, F.; MENDELSON, A. Can program profiling support value prediction? In:Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture.Washington, DC, USA: IEEE Computer Society, 1997. (MICRO 30), p. 270–280. ISBN0-8186-7977-8. Disponível em: <http://dl.acm.org/citation.cfm?id=266800.266826>.
GELLERT, A.; FLOREA, A.; VINTAN, L. Exploiting selective instruction reuse and valueprediction in a superscalar architecture. J. Syst. Archit., Elsevier North-Holland, Inc., NewYork, NY, USA, v. 55, n. 3, p. 188–195, Mar. 2009. ISSN 1383-7621. Disponível em:<http://dx.doi.org/10.1016/j.sysarc.2008.11.002>.
GHANDOUR, W. J.; AKKARY, H.; MASRI, W. Leveraging strength-based dynamicinformation flow analysis to enhance data value prediction. ACM Trans. Archit. Code Optim.,ACM, New York, NY, USA, v. 9, n. 1, p. 1:1–1:33, Mar. 2012. ISSN 1544-3566. Disponívelem: <http://doi.acm.org/10.1145/2133382.2133383>.
References 141
GOLANDER, A.; WEISS, S. Transactions on high-performance embedded architecturesand compilers ii. In: STENSTROM, P. (Ed.). Berlin, Heidelberg: Springer-Verlag, 2009.cap. Reexecution and Selective Reuse in Checkpoint Processors, p. 242–268. ISBN978-3-642-00903-7.
GOUGH, B. J. An Introduction to GCC. Bristol, United Kingdom: Network Theory Limited,2005. ISBN 978-0-9541617-9-8.
GULWANI, S.; NECULA, G. C. A polynomial-time algorithm for global value numbering.Sci. Comput. Program., Elsevier North-Holland, Inc., Amsterdam, The Netherlands,The Netherlands, v. 64, n. 1, p. 97–114, 2007. ISSN 0167-6423. Disponível em:<http://dx.doi.org/10.1016/j.scico.2006.03.005>.
GUPTA, R.; BERSON, D. A.; FANG, J. Z. Path profile guided partial redundancyelimination using speculation. In: Proceedings of the 1998 International Conference onComputer Languages. Washington, DC, USA: IEEE Computer Society, 1998. p. 230–. ISBN0-8186-8454-2.
HACK, S.; GRUND, D.; GOOS, G. Register allocation for programs in SSA-Form.In: Proceedings of the 15th international conference on Compiler Construction. Berlin,Heidelberg: Springer-Verlag, 2006. (CC’06), p. 247 – 262. ISBN 3-540-33050-X,978-3-540-33050-9. Disponível em: <http://dx.doi.org/10.1007/11688839_20>.
HARING, R. et al. The IBM Blue Gene/Q Compute Chip. Micro, IEEE, v. 32, n. 2, p. 48–60, March-April 2012. ISSN 0272-1732.
HENNING, J. L. SPEC CPU2006 benchmark descriptions. SIGARCH Comput. Archit. News,ACM, New York, NY, USA, v. 34, p. 1–17, 2006. ISSN 0163-5964.
HERROD, S. A. Using Complete Machine Simulation to Understand Computer SystemBehavior. Stanford, CA, USA: Stanford University (Technical Report), 1998. Disponível em:<ftp://reports.stanford.edu/pub/cstr/reports/cs/tr/98/1603/CS-TR-98-1603.pdf>.
HIRSCHBERG, D. S.; LELEWER, D. A. Efficient decoding of prefix codes. Communicationsof the ACM, v. 33, p. 449–459, 1990.
IBM, P. Power ISA Version 2.06 Revison B. 2010. Disponível em:<https://www.power.org/resources/downloads/PowerISA_V2.06B_V2_PUBLIC.pdf>.
JENKINS, R. A hash function for hash Table lookup. 2006. Disponível em:<http://burtleburtle.net/bob/hash/doobs.html>.
KENNEDY, K.; ALLEN, J. R. Optimizing compilers for modern architectures: adependence-based approach. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.,2002. ISBN 1-55860-286-0.
KHEDKER, U.; SANYAL, A.; KARKARE, B. Data Flow Analysis: Theory and Practice.1st. ed. Boca Raton, FL, USA: CRC Press, Inc., 2009. ISBN 0849328802, 9780849328800.
KLEINOSOWSKI, A. J.; LILJA, D. J. MinneSPEC: A New SPEC Benchmark Workload forSimulation-Based Computer Architecture Research. IEEE Computer Architecture Letters,v. 1, p. 7–7, 2002.
References 142
KNUTH, D. E. The art of computer programming, volume 3: (2nd ed.) sorting and searching.Redwood City, CA, USA: Addison Wesley Longman Publishing Co., Inc., 1998. ISBN0-201-89685-0.
KOES, D.; GOLDSTEIN, S. C. An Analysis of Graph Coloring Register Allocation.Pittsburgh, PA, USA: Carnegie Mellon University (Technical Report, 2006. Disponível em:<http://www.cs.cmu.edu/ seth/papers/koes-tr06.pdf>.
KOTZMANN, T. et al. Design of the Java HotSpotTM client compiler for Java 6. ACM Trans.Archit. Code Optim., ACM, New York, NY, USA, v. 5, n. 1, p. 7:1–7:32, May 2008. ISSN1544-3566.
LARUS, J. R. Whole program paths. SIGPLAN Not., ACM, New York,NY, USA, v. 34, n. 5, p. 259–269, 1999. ISSN 0362-1340. Disponível em:<http://doi.acm.org/10.1145/301631.301678>.
LARUS, J. R.; CHANDRA, S. Using Tracing and Dynamic Slicing to Tune Compilers.Marison, WI, USA: University of Wisconsin (Technical Report), 1993. Disponível em:<http://128.105.2.28/pub/techreports/1993/TR1174.pdf>.
LATTNER, C.; ADVE, V. LLVM: A Compilation Framework for Lifelong Program Analysis& Transformation. In: Proceedings of the international symposium on Code generation andoptimization: feedback-directed and runtime optimization. Washington, DC, USA: IEEEComputer Society, 2004. (CGO ’04), p. 75–. ISBN 0-7695-2102-9.
LEE, K.; BENAISSA, Z.; RODRIGUEZ, J. A dynamic tool for finding redundantcomputations in native code. In: Proceedings of the 2008 international workshopon dynamic analysis: held in conjunction with the ACM SIGSOFT InternationalSymposium on Software Testing and Analysis (ISSTA 2008). New York, NY, USA:ACM, 2008. (WODA ’08), p. 15–21. ISBN 978-1-60558-054-8. Disponível em:<http://doi.acm.org/10.1145/1401827.1401831>.
LEPAK, K. M.; LIPASTI, M. H. On the value locality of store instructions. In: Proceedingsof the 27th annual international symposium on Computer architecture. New York, NY,USA: ACM, 2000. (ISCA ’00), p. 182–191. ISBN 1-58113-232-8. Disponível em:<http://doi.acm.org/10.1145/339647.339678>.
LIN, J. et al. Speculative register promotion using advanced load address table (ALAT).In: Proceedings of the international symposium on Code generation and optimization:feedback-directed and runtime optimization. Washington, DC, USA: IEEE Computer Society,2003. (CGO ’03), p. 125–134. ISBN 0-7695-1913-X.
LIPASTI, M. H.; WILKERSON, C. B.; SHEN, J. P. Value locality and load value prediction.In: Proceedings of the seventh international conference on Architectural support forprogramming languages and operating systems. New York, NY, USA: ACM, 1996.(ASPLOS-VII), p. 138–147. ISBN 0-89791-767-7.
LIU, Y. et al. An online profile guided optimization approach for speculative parallelthreading. In: Advances in Computer Systems Architecture. [S.l.]: Springer Berlin /
Heidelberg, 2007, (Lecture Notes in Computer Science, v. 4697). p. 28–39. ISBN978-3-540-74308-8.
References 143
LO, R. et al. Register promotion by sparse partial redundancy elimination of loads and stores.SIGPLAN Not., ACM, New York, NY, USA, v. 33, p. 26–37, May 1998. ISSN 0362-1340.
LOBEL, A. Vehicle scheduling in public transit and lagrangean pricing. Management Sci,v. 44, p. 1637–1649, 1998.
MAGNUSSON, P. S. et al. Simics: A full system simulation platform. Computer, IEEEComputer Society Press, Los Alamitos, CA, USA, v. 35, n. 2, p. 50–58, 2002. ISSN0018-9162. Disponível em: <http://dx.doi.org/10.1109/2.982916>.
MARKOFF, J. The iPad in Your Hand: As Fast as a Supercomputer of Yore. New YorkTimes, 2011. Disponível em: <http://bits.blogs.nytimes.com/2011/05/09/the-ipad-in-your-hand-as-fast-as-a-supercomputer-of-yore/>.
MCKENZIE, B. Modern compiler implementation in ML: Basic techniques by andrew w.appel, cambridge university press, 1997, isbn 0521587751. J. Funct. Program., CambridgeUniversity Press, New York, NY, USA, v. 9, p. 105–111, January 1999. ISSN 0956-7968.
MOLINA, C.; GONZALEZ, A.; TUBELLA, J. Dynamic removal of redundant computations.In: Proceedings of the 13th international conference on Supercomputing. New York, NY,USA: ACM, 1999. (ICS ’99), p. 474–481. ISBN 1-58113-164-X. Disponível em:<http://doi.acm.org/10.1145/305138.305239>.
MOREIRA, J. et al. Designing a highly-scalable operating system: the Blue Gene/Lstory. In: Proceedings of the 2006 ACM/IEEE conference on Supercomputing. NewYork, NY, USA: ACM, 2006. (SC ’06). ISBN 0-7695-2700-0. Disponível em:<http://doi.acm.org/10.1145/1188455.1188578>.
MOREL, E.; RENVOISE, C. Global optimization by suppression of partial redundancies.Commun. ACM, ACM, New York, NY, USA, v. 22, n. 2, p. 96–103, Feb. 1979. ISSN0001-0782. Disponível em: <http://doi.acm.org/10.1145/359060.359069>.
MUCHNICK, S. S. Advanced compiler design and implementation. San Francisco, CA,USA: Morgan Kaufmann Publishers Inc., 1997. ISBN 1-55860-320-4.
NAKRA, T.; GUPTA, R.; SOFFA, M. L. Global context-based value prediction. In:Proceedings of the 5th International Symposium on High Performance ComputerArchitecture. Washington, DC, USA: IEEE Computer Society, 1999. (HPCA ’99), p. 4–.ISBN 0-7695-0004-8. Disponível em: <http://dl.acm.org/citation.cfm?id=520549.822759>.
OTTENSTEIN, K. J.; BALLANCE, R. A.; MACCABE, A. B. The program dependence web:a representation supporting control-, data-, and demand-driven interpretation of imperativelanguages. SIGPLAN Not., ACM, New York, NY, USA, v. 25, p. 257–271, June 1990. ISSN0362-1340.
PETERSON, J. L. et al. Application of full-system simulation in exploratory system designand development. IBM Journal of Research and Development, v. 50, n. 2.3, p. 321 –332,March 2006. ISSN 0018-8646.
PHANSALKAR, A.; JOSHI, A.; JOHN, L. K. Analysis of redundancy and appli-cation balance in the SPEC CPU2006 benchmark suite. In: Proceedings of the 34th
References 144
annual international symposium on Computer architecture. New York, NY, USA:ACM, 2007. (ISCA ’07), p. 412–423. ISBN 978-1-59593-706-3. Disponível em:<http://doi.acm.org/10.1145/1250662.1250713>.
RAMALINGAM, G. The undecidability of aliasing. ACM Trans. Program. Lang. Syst.,ACM, New York, NY, USA, v. 16, p. 1467–1471, September 1994. ISSN 0164-0925.
REIF, J. H.; LEWIS, H. R. Efficient symbolic analysis of programs. J. Comput. Syst. Sci.,Academic Press, Inc., Orlando, FL, USA, v. 32, p. 280–314, June 1986. ISSN 0022-0000.
REINMAN, G. et al. Profile Guided Load Marking for Memory Renaming. San Diego,CA, USA: University of California San Diego (Technical Report), 1998. Disponível em:<http://www.cse.hcmut.edu.vn/ anhvu/teaching/2008/ACA/articles/29.pdf>.
RYCHLIK, B. et al. Efficacy and performance impact of value prediction. In: Proceedingsof the 1998 International Conference on Parallel Architectures and Compilation Techniques.Washington, DC, USA: IEEE Computer Society, 1998. (PACT ’98), p. 148–. ISBN0-8186-8591-3. Disponível em: <http://dl.acm.org/citation.cfm?id=522344.825671>.
SAATY, T. L.; KAINEN, P. C. The four-color problem: assaults and conquest. New York:McGraw-Hill International Book Co., 1977. ix, 217 p. p. ISBN 0070543828 0070543828.
SAZEIDES, Y.; SMITH, J. E. Implementations of Context Based Value Predictors.Madison, WI, USA: University of Wisconsin (Technical Report), 1997. Disponível em:<http://www.lems.brown.edu/ iris/en291s9-04/papers/Context-value-pred.pdf>.
SAZEIDES, Y.; SMITH, J. E. The predictability of data values. In: Proceedings of the 30thannual ACM/IEEE international symposium on Microarchitecture. Washington, DC, USA:IEEE Computer Society, 1997. (MICRO 30), p. 248–258. ISBN 0-8186-7977-8. Disponívelem: <http://dl.acm.org/citation.cfm?id=266800.266824>.
SHAFI, H. et al. Design and validation of a performance and power simulator for PowerPCsystems. IBM J. Res. Dev., IBM Corp., Riverton, NJ, USA, v. 47, p. 641–651, September2003. ISSN 0018-8646. Disponível em: <http://dx.doi.org/10.1147/rd.475.0641>.
SHOR, P. W. Polynomial-time algorithms for prime factorization and discrete logarithms ona quantum computer. SIAM J. Comput., Society for Industrial and Applied Mathematics,Philadelphia, PA, USA, v. 26, n. 5, p. 1484–1509, Oct. 1997. ISSN 0097-5397. Disponívelem: <http://dx.doi.org/10.1137/S0097539795293172>.
SIMPSON, L. T. Value-driven redundancy elimination. Tese (Doutorado) — Rice University,Houston, TX, USA, 1996.
SODANI, A.; SOHI, G. S. Dynamic instruction reuse. In: Proceedings of the 24th annualinternational symposium on Computer architecture. New York, NY, USA: ACM, 1997.(ISCA ’97), p. 194–205. ISBN 0-89791-901-7.
SODANI, A.; SOHI, G. S. An empirical analysis of instruction repetition. In: Proceedingsof the eighth international conference on Architectural support for programming languagesand operating systems. New York, NY, USA: ACM, 1998. (ASPLOS-VIII), p. 35–45. ISBN1-58113-107-0. Disponível em: <http://doi.acm.org/10.1145/291069.291016>.
References 145
SODANI, A.; SOHI, G. S. Understanding the differences between value predictionand instruction reuse. In: Proceedings of the 31st annual ACM/IEEE internationalsymposium on Microarchitecture. Los Alamitos, CA, USA: IEEE Computer Soci-ety Press, 1998. (MICRO 31), p. 205–215. ISBN 1-58113-016-3. Disponível em:<http://dl.acm.org/citation.cfm?id=290940.290983>.
SRIKANT, Y. N.; SHANKAR, P. The Compiler Design Handbook: Optimizations andMachine Code Generation, Second Edition. 2nd. ed. Boca Raton, FL, USA: CRC Press, Inc.,2007. ISBN 142004382X, 9781420043822.
STALLMAN, R. M. Using The Gnu Compiler Collection: A GNU Manual For GCC Version4.3.3. Paramount, CA: CreateSpace, 2009. ISBN 144141276X, 9781441412768.
SURENDRA, G.; BANERJEE, S.; NANDY, S. K. Instruction reuse in spec, mediaand packet processing benchmarks. J. Embedded Comput., IOS Press, Amsterdam, TheNetherlands, The Netherlands, v. 2, n. 1, p. 15–34, Jan. 2006. ISSN 1740-4460. Disponívelem: <http://dl.acm.org/citation.cfm?id=1370986.1370989>.
THOMAS, R. et al. Improving branch prediction by dynamic dataflow-based iden-tification of correlated branches from a large global history. In: Proceedings of the30th annual international symposium on Computer architecture. New York, NY,USA: ACM, 2003. (ISCA ’03), p. 314–323. ISBN 0-7695-1945-8. Disponível em:<http://doi.acm.org/10.1145/859618.859655>.
TORCZON, L.; COOPER, K. Engineering A Compiler. 2nd. ed. San Francisco, CA, USA:Morgan Kaufmann Publishers Inc., 2011. ISBN 012088478X.
TSENG, H.-W.; TULLSEN, D. M. Data-triggered threads: Eliminating redundantcomputation. High-Performance Computer Architecture, International Symposium on, IEEEComputer Society, Los Alamitos, CA, USA, p. 181–192, 2011.
USA. Designing a Digital Future: Federally Funded Research and Development inNetworking and Information Technology. President’s Council of Advisors on Science andTechnology, Office of Science and Technology Policy (PCAST), 2010. Disponível em:<http://www.whitehouse.gov/sites/default/files/microsites/ostp/pcast-nitrd-report-2010.pdf>.
VANDRUNEN, T. Partial Redundancy Elimination for Global Value Numbering. Tese(Doutorado) — Purdue University, West Lafayette, IN, USA, 2004.
WANG, K.; FRANKLIN, M. Highly accurate data value prediction using hybrid predictors.In: Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture.Washington, DC, USA: IEEE Computer Society, 1997. (MICRO 30), p. 281–290. ISBN0-8186-7977-8. Disponível em: <http://dl.acm.org/citation.cfm?id=266800.266827>.
WATTERSON, S.; DEBRAY, S. Goal-directed value profiling. In: Proceedings ofthe 10th International Conference on Compiler Construction. London, UK, UK:Springer-Verlag, 2001. (CC ’01), p. 319–333. ISBN 3-540-41861-X. Disponível em:<http://dl.acm.org/citation.cfm?id=647477.760386>.
XU, L. Program redundancy analysis and optimization to improve memory performance.Tese (Doutorado) — Rice University, Houston, TX, USA, 2003.
References 146
YANG, J.; GUPTA, R. Load redundancy removal through instruction reuse. In: Proceedingsof the Proceedings of the 2000 International Conference on Parallel Processing. Washington,DC, USA: IEEE Computer Society, 2000. (ICPP ’00), p. 61–. ISBN 0-7695-0768-9.Disponível em: <http://dl.acm.org/citation.cfm?id=850941.852902>.
YANG, W.; HORWITZ, S.; REPS, T. Detecting Program Components With EquivalentBehaviors. Madison, WI, USA: University of Wisconsin (Technical Report), 1989.Disponível em: <http://digital.library.wisc.edu/1793/59110>.
ZHANG, Y.; GUPTA, R. Timestamped whole program path representation and itsapplications. In: Proceedings of the ACM SIGPLAN 2001 conference on Programminglanguage design and implementation. New York, NY, USA: ACM, 2001. (PLDI ’01), p. 180–190. ISBN 1-58113-414-2. Disponível em: <http://doi.acm.org/10.1145/378795.378835>.
147
APPENDIX A -- SPEC CPU2006 Benchmark Suite
In this Chapter, for the sake of completeness, detailed information on the selected com-
ponent from SPEC CPU2006 Integer is provided. Details were extracted from description
provided by the Standard Performance Evaluation Corporation (HENNING, 2006).
A.1 400.perlbench
Authors: Larry Wall, et. al.
General Category: Programming language
Description: 400.perlbench is a cut-down version of Perl v5.8.7, the popular scripting
language. SPEC’s version of Perl has had most of OS-specific features removed. In addition
to the core Perl interpreter, several third-party modules are used:
•SpamAssassin v2.61;
•Digest-MD5 v2.33;
•HTML-Parser v3.35;
•MHonArc v2.6.8;
•IO-stringy v1.205;
•MailTools v1.60;
•TimeDate v1.16;
A.1 400.perlbench 148
Sources for all of the freely-available components used in 400.perlbench can be found on
the distribution media in the original.src directory.
Input: The reference workload for 400.perlbench consists of three scripts:
1.The primary component of the workload is the Open Source spam checking software
SpamAssassin. SpamAssassin is used to score a couple of known corpora of both spam
and ham (non-spam), as well as a sampling of mail generated from a set of random
components. SpamAssassin has been heavily patched to avoid doing file I/O, and does
not use the Bayesian filtering;
2.Another component is the popular freeware email-to-HTML converter MHonArc.
Email messages are generated randomly and converted to HTML. In addition to
MHonArc, which was lightly patched to avoid file I/O, this component also uses several
standard modules from the CPAN (Comprehensive Perl Archive Network);
3.The third script in the reference workload (which also uses the mail generator for con-
vienience) exercises a slightly modified version of the specdiff script, which is a part
of the CPU2006 tool suite.
The training workload is similar, but not identical, to the reference workload from
CPU2000. The test workload consists of the non-system-specific parts of the actual Perl 5.8.7
test harness.
Output: In the case of the mail-based benchmarks, a line with salient characteristics
(number of header lines, number of body lines, etc) is output for each message generated.
During processing, MD5 hashes of the contents of output "files" (in memory) are computed
and output. For SpamAssassin, the message’s score and the rules that it triggered are also
output.
Programming Language: ANSI C
A.2 401.bzip2 149
A.2 401.bzip2
Author: Julian Seward General Category: Compression
Description: 401.bzip2 is based on Julian Seward’s bzip2 version 1.0.3. The only differ-
ence between bzip2 1.0.3 and 401.bzip2 is that SPEC’s version of bzip2 performs no file I/O
other than reading the input. All compression and decompression happens entirely in memory.
This is to help isolate the work done to only the CPU and memory subsystem.
Input: 401.bzip2’s reference workload has six components: two small JPEG images, a
program binary, some program source code in a tar file, an HTML file, and a "combined"
file, which is representative of an archive that contains both highly compressible and not
very compressible files. Each input set is compressed and decompressed at three different
blocking factors ("compression levels"), with the end result of the process being compared to
the original data after each decompression step.
Output: The output files provide a brief outline of what the benchmark is doing as it runs.
Output sizes for each compression and decompression are printed to facilitate validation, and
the results of decompression are compared with the input data to ensure that they match.
Programming Language: ANSI C
References: (HIRSCHBERG; LELEWER, 1990)
A.3 429.mcf
Author: Andreas Lobel; SPEC Project Leader: Reinhold Weicker
General Category: Combinatorial optimization / Singledepot vehicle scheduling
Description: 429.mcf is derived from MCF, a program used for single-depot vehicle
scheduling in public mass transportation.
The program is designed for the solution of single-depot vehicle scheduling problems
planning transportation. It considers one single depot and a homogeneous vehicle fleet. Based
A.3 429.mcf 150
on a line plan and service frequencies, so-called timetabled trips with fixed departure/arrival
locations and times are derived. Each of these timetabled trips has to be serviced by exactly
one vehicle. The links between these trips are called dead-head trips. In addition, there
are pull-out and pull-in trips for leaving and entering the depot. Cost coefficients are given
for all dead-head, pull-out, and pull-in trips. It is the task to schedule all timetabled trips
to so-called blocks such that the number of necessary vehicles is as small as possible and,
subordinate, the operational costs among all minimal fleet solutions are minimized. For the
considered single-depot case, the problem can be formulated as a large-scale minimum-cost
flow problem, solved with a network simplex algorithm accelerated with a column generation.
The network simplex algorithm is a specialized version of the well known simplex algo-
rithm for network flow problems. The linear algebra of the general algorithm is replaced by
simple network operations such as finding cycles or modifying spanning trees that can be per-
formed very quickly. The main work of the network simplex implementation is pointer and
integer arithmetic. In the transition from 181.mcf (CPU2000) to 429.mcf (CPU2006), new
inputs were defined for test, train, and ref, with the goal of longer execution times. The heap
data size, and with it the overall memory footprint, increased accordingly. Most of the source
code was not changed, but several type definitions were changed by the author:
•Whenever possible, long typed attributes of struct node and struct arc are replaced by
32 bit integer, for example if used as boolean type. Pointers remain unaffected and map
to 32 or 64 bit long, depending on the compilation model, to ensure compatibility to 64
bit systems;
•To reduce cache misses and accelerate program performance somewhat, the elements
of struct node and struct arc, respectively, are rearranged;
Input: The input file contains: the number of timetabled and dead-head trips; for each
timetabled trip its starting and ending time; for each dead-head trip its starting and ending
timetabled trip and its cost.
Worst case execution time is pseudo-polynomial in the number timetabled and dead-head
A.4 445.gobmk 151
trips and in the amount of the maximal cost coefficient. The expected execution time, however,
is in the order of a low-order polynomial.
Memory Requirements: 429.mcf requires about 860 and 1700 megabyte for a 32 and a
64 bit data model, respectively.
Output: The benchmark writes log information, a checksum, and output values describ-
ing an optimal schedule.
References: (LOBEL, 1998).
A.4 445.gobmk
Authors: Man Lung Li, et. al.
General Category: Artificial intelligence - game playing.
Description: The program plays Go and executes a set of commands to analyze Go
positions.
Input: Most input is in "SmartGo Format" (.sgf), a widely used de facto standard rep-
resentation of Go games. A typical test involves reading in a game to a certain point, then
executing a command to analyze the position.
Output: typically an ASCII description of a sequence of Go moves.
Programming Language: C
A.5 458.sjeng
Authors: Gian-Carlo Pascutto, Vincent Diepeveen
General Category: Artificial Intelligence (game tree search & pattern recognition)
Description: 458.sjeng is based on Sjeng 11.2, which is a program that plays chess and
several chess variants, such as drop-chess (similar to Shogi), and losing chess.
A.6 462.libquantum 152
It attempts to find the best move via a combination of alpha-beta or priority proof number
tree searches, advanced move ordering, positional evaluation and heuristic forward pruning.
Practically, it will explore the tree of variations resulting from a given position to a given base
depth, extending interesting variations but discarding doubtful or irrelevant ones. From this
tree the optimal line of play for both players ("principle variation") is determined, as well as
a score reflecting the balance of power between the two.
The SPEC version is an enhanced version of the free Sjeng 11.2 program, modified to be
more portable and more accurately reflect the workload of current professional programs.
Input: 458.sjeng’s input consists of a textfile containing alternations of a chess position
in the standard Forsyth-Edwards Notation (FEN) and the depth to which this position should
be analyzed, in half-moves (ply depth) The SPEC reference input consists of 9 positions
belonging to various phases of the game.
Output: 458.sjeng’s output consists, per position, of some side information (textual dis-
play of the chessboard, phase of the game, used parameters...) followed by the output from
the tree searching module as it progresses. This is formatted as: Attained depth in half-moves
(plies); Score for the player that is to move, in equivalents of 1 pawn; Number of positions
investigated; and the optimal line of play ("principle variation")
Programming Language: ANSI C
A.6 462.libquantum
Author: Bjorn Butscher, Hendrik Weimer
General Category: Physics / Quantum Computing
Description: libquantum is a library for the simulation of a quantum computer. Quantum
computers are based on the principles of quantum mechanics and can solve certain computa-
tionally hard tasks in polynomial time.
In 1994, Peter Shor discovered a polynomial-time algorithm for the factorization of num-
A.7 471.omnetpp 153
bers, a problem of particular interest for cryptanalysis, as the widely used RSA cryptosystem
depends on prime factorization being a problem only to be solvable in exponential time. An
implementation of Shor’s factorization algorithm is included in libquantum.
Libquantum provides a structure for representing a quantum register and some elemen-
tary gates. Measurements can be used to extract information from the system. Additionally,
libquantum offers the simulation of decoherence, the most important obstacle in building prac-
tical quantum computers. It is thus not only possible to simulate any quantum algorithm, but
also to develop quantum error correction algorithms. As libquantum allows to add new gates,
it can easily be extended to fit the ongoing research, e.g. it has been deployed to analyze
quantum cryptography.
Input: The benchmark program expects the number to be factorized as a command-
line parameter. An additional parameter can be supplied to specify a base for the modular
exponentiation part of Shor’s algorithm.
Output: The program gives a brief explanation on what it is doing and the factors of the
input number if the factorization was successful.
Programming Language: ISO/IEC 9899:1999 ("C99")
References: (SHOR, 1997)
A.7 471.omnetpp
Author: Andras Varga, Omnest Global, Inc.
General Category: Discrete Event Simulation
Description: simulation of a large Ethernet network, based on the OMNeT++ discrete
event simulation system1, using an ethernet model which is publicly available2. For the refer-
ence workload, the simulated network models a large Ethernet campus backbone, with several
smaller LANs of various sizes hanging off each backbone switch. It contains about 8000 com-