Dynamic Methodology for Optimization Effectiveness ...

CARLOS HENRIQUE ANDRADE COSTA

DYNAMIC METHODOLOGY FOR OPTIMIZATIONEFFECTIVENESS EVALUATION AND VALUE

LOCALITY EXPLOITATION

Tese apresentada à Escola Politécnica da Uni-

versidade de São Paulo para obtenção do Título

de Doutor em Engenharia.

São Paulo

2012

CARLOS HENRIQUE ANDRADE COSTA

DYNAMIC METHODOLOGY FOR OPTIMIZATIONEFFECTIVENESS EVALUATION AND VALUE

LOCALITY EXPLOITATION

Tese apresentada à Escola Politécnica da Uni-

versidade de São Paulo para obtenção do Título

de Doutor em Engenharia.

Área de Concentração:

Sistemas Digitais

Orientador:

Prof. Dr. Paulo S. L. M. Barreto

São Paulo

2012

To my parents, who taught me. To my wife, who supported me. To my friends,who belived in me.

Acknowledgements

I would like first to thank my advisor Prof. Dr. Paulo S. L. M. Barreto, who gave me the

opportunity and support to pursue my research interests with independence, my collaborators,

and informal advisors in a sense, Dr. José E. Moreira (IBM) and Prof. Dr. David Padua

(University of Illinois at Urbana-Champaign), with whom I have worked with during the part

of this work conducted at IBM T. J. Watson Research Center, whose suggestions motivated

me to pursue the core ideas in this thesis and whose guidance, support, and encouragement

have been invaluable. Without them all this thesis would not have been possible.

I am grateful to various people at the IBM T. J. Watson Research Center and University

of São Paulo, places where I have the opportunity and the environment to conduct research

and to collaborate with amazingly talented people. At IBM, I would like to thank Dr. Robert

Wisnieski for being such a good manager during my time with the Exascale System Software

Group and for making this joint research possible. I thank José Brunheroto, with whom I

have worked developing the full-system simulator (Mambo) for BlueGene/Q, for sharing his

vast knowledge on the simulator’s intrinsics and for helping me to understand how this tool

could be used in the development of the present work, and for the long and always interesting

hours of chatting that always shed new lights to a wide spectrum of subjects. I also thank

the people that I have worked with directly or indirectly and that helped to make my period

at T. J. Watson truly enjoyable: Roberto Gioiosa (now with Pacific Northwest National Lab-

oratory), Alejandro Rico-Carro (now with Barcelona Supercomputing Center, Spain), Jamin

Naghmouchi (now with T.U. Braunschweig, Germany), Daniele P. Scarpazza (now with D. E.

Shaw Research), Chen-Yong Cher, George Almasi, and many others.

At University of Sao Paulo, my deepest gratitude to Prof. Tereza C. M. B. Carvalho

who believed and gave me a life changing opportunity. Her constant willingness to help and

guide helped to be a better researcher and person. I also thank all my fellow talented graduate

students and researchers, Charles Miers (now with Universidade do Estado de Santa Catarina),

Rony R. Sakuragui (now with Scopus Tecnologia), Fernando Redigolo, Marcelo C. Amaral,

Diego S. Gallo, who I have worked or collaborated with in multiple research projects and

many others that even indirectly inspired me. I thank Guilherme C. Januário in particular,

multi-talented graduate student, that make valuable suggestion for optimization of the code

written in this work.

Finally, I’m grateful to my parents that provided me the amazing environment to grow

and pursue my true interests and that taught me everything I know that really matters in life.

Most of all I would like to thank my wife, Emanuele P. N. Costa, whose unending patience,

love, and support through all the long working hours required for this work have made this

possible. Thanks for making life worth living!

Resumo

O desempenho de um software depende das múltiplas otimizações no código realizadas por com-

piladores modernos para a remoção de computação redundante. A identificação de computação re-

dundante é, em geral, indecidível em tempo de compilação, e impede a obtenção de um caso ideal

de referência para a medição do potencial inexplorado de remoção de redundâncias remanescentes e

para a avaliação da eficácia de otimização do código. Este trabalho apresenta um conjunto de métodos

para a análise da efetividade de otimização de código através da observação do conjunto completo de

instruções dinamicamente executadas e referências à memória na execução completa de um programa.

Isso é feito por meio do desenvolvimento de um algoritmo de value numbering dinâmico e sua apli-

cação conforme as instruções vão sendo executadas. Este método reduz a análise interprocedural à

análise de um grande bloco básico e detecta operações redundantes de memória e operações escalares

que são visíveis apenas em tempo de execução. Desta forma, o trabalho estende a análise de reuso de

instruções e oferece tanto uma aproximação mais exata do limite superior de otimização explorável

dentro de um programa, quanto um ponto de referência para avaliar a eficácia de uma otimização. O

método também provê uma visão clara de hotspots de redundância não explorados e uma medida de

localidade de valor dentro da execução completa de um programa. Um modelo que implementa o

método e integra-o a um simulador completo de sistema baseado em Power ISA 64-bits (versão 2.06)

é desenvolvido. Um estudo de caso apresenta os resultados da aplicação deste método em relação a

executáveis de um benchmark representativo (SPECInt2006) criados para cada nível de otimização do

compilador GNU C/ C++. A análise proposta produz uma avaliação prática de eficácia da otimiza-

ção de código que revela uma quantidade significativa de redundâncias remanescentes inexploradas,

mesmo quando o maior nível de otimização disponível é usado. Fontes de ineficiência são identificadas

através da avaliação de hotspots e de localidade de valor. Estas informações revelam-se úteis para o

ajuste do compilador e da aplicação. O trabalho ainda apresenta um mecanismo eficiente para explorar

o suporte de hardware na eliminação de redundâncias.

Palavras-chave: otimização de código, análise dinâmica, value numbering, efetividade de

otimização

Abstract

Software performance relies on multiple optimization techniques applied by modern compilers to

remove redundant computation. The identification of redundant computation is in general undecid-

able at compile-time and prevents one from obtaining an ideal reference for the measurement of the

remaining unexploited potential of redundancy removal and for the evaluation of code optimization

effectiveness. This work presents a methodology for optimization effectiveness analysis by observing

the complete dynamic stream of executed instructions and memory references in the whole program

execution, and by developing and applying a dynamic value numbering algorithm as instructions are

executed. This method reduces the interprocedural analysis to the analysis of a large basic block and

detects redundant memory and scalar operations that are visible only at run-time. This way, the work

extends the instruction-reuse analysis and provides both a more accurate approximation of the upper

bound of exploitable optimization in the program and a reference point to evaluate optimization effec-

tiveness. The method also generates a clear picture of unexploited redundancy hotspots and a measure

of value locality in the whole application execution. A framework that implements the method and in-

tegrates it with a full-system simulator based on Power ISA 64-bit (version 2.06) is developed. A case

study presents the results of applying this method to representative benchmark (SPECInt 2006) exe-

cutables generated by various compiler optimization levels of GNU C/C++ Compiler. The proposed

analysis yields a practical analysis that reveals a significant amount of remaining unexploited redun-

dancies present even when using the highest optimization level available. Sources of inefficiency are

identified with an evaluation of hotspot and value locality, an information that is useful for compilers

and application-tuning softwares. The thesis also shows an efficient mechanism to explore hardware-

support for redundancy elimination.

Keywords: code optimization, dynamic analysis, value numbering, optimization effectiveness

List of Figures

1 Example of unexploited optimization . . . . . . . . . . . . . . . . . . . . . p. 22

2 Optimizing compiler structure . . . . . . . . . . . . . . . . . . . . . . . . p. 28

3 Source code and multiple levels of IR . . . . . . . . . . . . . . . . . . . . p. 29

4 Intermediate code representation with a DAG . . . . . . . . . . . . . . . . p. 30

5 Simple C code translated into SSA form . . . . . . . . . . . . . . . . . . . p. 33

6 Local Value Numbering (LVN) operation . . . . . . . . . . . . . . . . . . p. 36

7 Steps to build a DAG applying local value numbering . . . . . . . . . . . . p. 39

8 Minimal SSA form from an intermediate code . . . . . . . . . . . . . . . . p. 41

9 Value Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 42

10 Steps to perform register allocation with K-coloring . . . . . . . . . . . . . p. 45

11 Example of interference graph . . . . . . . . . . . . . . . . . . . . . . . . p. 45

12 Graph coloring heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 47

13 Relationship between redundancy categories . . . . . . . . . . . . . . . . . p. 50

14 Example of profile-guided transformation . . . . . . . . . . . . . . . . . . p. 51

15 An example of control flow profiles . . . . . . . . . . . . . . . . . . . . . p. 53

16 An example of value profiles . . . . . . . . . . . . . . . . . . . . . . . . . p. 53

17 An example of address profiles . . . . . . . . . . . . . . . . . . . . . . . . p. 54

18 Profile-guided optimizing compiler . . . . . . . . . . . . . . . . . . . . . . p. 55

19 Integrating a Reuse Buffer (RB) with the pipeline . . . . . . . . . . . . . . p. 57

20 Instruction Reuse (IR): scheme S v . . . . . . . . . . . . . . . . . . . . . . p. 58

21 Instruction Reuse (IR): scheme S n . . . . . . . . . . . . . . . . . . . . . . p. 59

22 Instruction Reuse (IR): scheme S n+v . . . . . . . . . . . . . . . . . . . . . p. 60

23 Value Prediction Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 62

24 Classification of redundancy elimination techniques . . . . . . . . . . . . . p. 63

25 Registers and value numbers mapping . . . . . . . . . . . . . . . . . . . . p. 69

26 Value numbers and accesses tables . . . . . . . . . . . . . . . . . . . . . . p. 76

27 Steps building a DAG with dynamic value numbering algorithm . . . . . . p. 77

28 Operation of the algorithm to identify hotspot . . . . . . . . . . . . . . . . p. 80

29 Representation of the algorithm to detect unnecessary spill . . . . . . . . . p. 82

30 IBM Full-system simulator architecture . . . . . . . . . . . . . . . . . . . p. 87

31 Statistics data collection with IBM Full-system simulator . . . . . . . . . . p. 87

32 Redundancy evaluation scheme . . . . . . . . . . . . . . . . . . . . . . . . p. 89

33 Approach for implementation of memory accesses with different sizes. . . . p. 90

34 Approach for creating an entry in the value number hash table. . . . . . . . p. 91

35 Redundancy evaluation scheme with validation support . . . . . . . . . . . p. 93

36 Redundancy evaluation scheme with validation and reuse support . . . . . . p. 94

37 Dendrogram showing similarity between CINT2006 Programs. . . . . . . . p. 100

38 Instruction distribution per category set . . . . . . . . . . . . . . . . . . . p. 102

39 Redundant load instructions normalized . . . . . . . . . . . . . . . . . . . p. 103

40 Redundant store instructions normalized . . . . . . . . . . . . . . . . . . . p. 105

41 Redundant arithmetic instructions normalized . . . . . . . . . . . . . . . . p. 106

42 Redundant arithmetic instructions per type normalized . . . . . . . . . . . p. 107

43 Complete redundancy detection statistics normalized . . . . . . . . . . . . p. 108

44 Value locality identification: 400.perlbench . . . . . . . . . . . . . . . . . p. 111

45 Value locality identification: 401.bzip2 . . . . . . . . . . . . . . . . . . . . p. 111

46 Value locality identification: 429.mcf . . . . . . . . . . . . . . . . . . . . . p. 112

47 Value locality identification: 445.gobmk . . . . . . . . . . . . . . . . . . . p. 112

48 Value locality identification: 458.sjeng . . . . . . . . . . . . . . . . . . . . p. 113

49 Value locality identification: 462.libquantum . . . . . . . . . . . . . . . . . p. 113

50 Value locality identification: 471.omnet . . . . . . . . . . . . . . . . . . . p. 114

51 Redundancy detection with constrained dynamic value numbering . . . . . p. 118

52 Redundancy detection for load instructions with constrained dynamic value

numbering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 118

53 Redundancy detection for arithmetic instructions with constrained DVN . . p. 119

54 Compiler audition based on dynamic value numbering: 400.perlbench . . . p. 124

55 Compiler audition: 401, 429 and 445 . . . . . . . . . . . . . . . . . . . . . p. 125

56 Compiler audition: 458, 462 and 475 . . . . . . . . . . . . . . . . . . . . . p. 127

57 Prediction accuracy for top-128 most redundant instructions . . . . . . . . p. 129

58 Integer memory access instructions . . . . . . . . . . . . . . . . . . . . . . p. 156

59 Integer arithmetic instructions . . . . . . . . . . . . . . . . . . . . . . . . p. 157

60 Integer logical instructions . . . . . . . . . . . . . . . . . . . . . . . . . . p. 157

61 Integer rotate instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 157

62 Integer compare instructions . . . . . . . . . . . . . . . . . . . . . . . . . p. 158

63 Integer load and reserve store conditional instructions . . . . . . . . . . . . p. 158

List of Tables

1 GCC optimization levels . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 97

2 SEPCInt 2006 application set . . . . . . . . . . . . . . . . . . . . . . . . . p. 98

3 Representative subset for SPECInt 2006 . . . . . . . . . . . . . . . . . . . p. 100

4 SPECInt 2006 benchmarks analyzed . . . . . . . . . . . . . . . . . . . . . p. 101

5 Redundant load instruction . . . . . . . . . . . . . . . . . . . . . . . . . . p. 103

6 Redundant store instruction . . . . . . . . . . . . . . . . . . . . . . . . . . p. 104

7 Total arithmetic redundant instructions . . . . . . . . . . . . . . . . . . . . p. 106

8 Detailed redundant arithmetic instruction . . . . . . . . . . . . . . . . . . p. 107

9 Complete statistics of redundancy detection . . . . . . . . . . . . . . . . . p. 108

10 Ratio between most redundant instruction and average . . . . . . . . . . . p. 115

11 Total redundancy detection with dynamic value numbering . . . . . . . . . p. 117

12 Load redundancy detection with dynamic value numbering . . . . . . . . . p. 119

13 Arithmetic redundancy detection with dynamic value numbering . . . . . . p. 120

14 Constrained DVN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 121

15 Prediction accuracy for top-128 most redundant instruction. . . . . . . . . . p. 130

List of Acronyms

AST Abstract Syntax Tree

CFG Control-Flow Graph

CFP Control-Flow Profiles

CP Copy Propagation

CSE Common Subexpression Elimination

CT Classification Table

DAG Directed Acyclic Graph

DCE Dead-Code Elimination

DFCM Differential Finite Context Method

DIR Dynamic Instruction Reuse

DSE Dead Store Elimination

DVN Dynamic Value Numbering

DVP Data Value Predition

FCM Finite Context Method

GCC GNU Compiler Collection

GSA Gated Single Assignment

GVN Global Value Numbering

IG Interference Graph

IPC Instruction per Cycle

IR Instruction Reuse

LFU Least Frequently Used

LV Last Value

LVN Local Value Numbering

PC Program Counter

PDG Program Dependence Graph

PGO Profile Guided Optimization

PRE Partial Redundancy Elimination

RB Reuse Buffer

RST Register Source Table

SPEC Standard Performance Evaluation Corporation

SSA Single Static Assignment

STR Stride Value

TNV Top-n-values Table

VG Value Graph

VN Value Numbering

VPT Value Prediction Table

WPP Whole Program Paths

Contents

List of Figures

List of Tables

List of Acronyms

1 Introduction p. 18

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 18

1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 19

1.3 Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 23

1.3.1 Contributions summary . . . . . . . . . . . . . . . . . . . . . . . . p. 24

1.3.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 25

2 Compile-time Optimization p. 27

2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 27

2.1.1 Intermediate Representation (IR) . . . . . . . . . . . . . . . . . . . p. 28

2.2 Code Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 31

2.2.1 Flow-based Analysis . . . . . . . . . . . . . . . . . . . . . . . . . p. 32

2.2.2 Static Single Assignment (SSA) . . . . . . . . . . . . . . . . . . . p. 33

2.3 Value Numbering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 36

2.3.1 Local Value Numbering (LVN) . . . . . . . . . . . . . . . . . . . . p. 36

2.3.2 Global Value Numbering (GVN) . . . . . . . . . . . . . . . . . . . p. 40

2.3.3 Partial Redundancy Elimination (PRE) . . . . . . . . . . . . . . . p. 43

2.4 Register Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 43

2.4.1 Variable Liveness Analysis . . . . . . . . . . . . . . . . . . . . . . p. 44

2.4.2 Register Allocation by Graph-Coloring . . . . . . . . . . . . . . . p. 46

3 Dynamic Identification of Redundant Computation p. 49

3.1 Limitations of Static Analysis . . . . . . . . . . . . . . . . . . . . . . . . p. 50

3.2 Profile Guided Optimization (PGO) . . . . . . . . . . . . . . . . . . . . . p. 51

3.3 Dynamic Limit Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 55

3.4 Hardware support for Redundancy Elimination . . . . . . . . . . . . . . . p. 56

3.4.1 Dynamic Instruction Reuse . . . . . . . . . . . . . . . . . . . . . . p. 56

3.4.2 Data Value Prediction (DVP) . . . . . . . . . . . . . . . . . . . . . p. 61

4 Run-time Optimization Effectiveness Analysis p. 65

4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 65

4.2 Dynamic Value Numbering Algorithm (DVN) . . . . . . . . . . . . . . . . p. 67

4.2.1 Redundant Memory Operation . . . . . . . . . . . . . . . . . . . . p. 69

4.2.1.1 Redundant Load . . . . . . . . . . . . . . . . . . . . . . p. 70

4.2.1.2 Redundant Store . . . . . . . . . . . . . . . . . . . . . . p. 72

4.2.2 Redundant Common-subexpression . . . . . . . . . . . . . . . . . p. 73

4.2.3 Unsupported Instructions . . . . . . . . . . . . . . . . . . . . . . . p. 75

4.3 Value-based Hotspot and h-index . . . . . . . . . . . . . . . . . . . . . . . p. 78

4.4 Dynamic Value Numbering and Unnecessary Spill . . . . . . . . . . . . . . p. 78

4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 82

5 Experimental Evaluation and Optimization Framework p. 84

5.1 IBM Full-system simulator (Mambo) . . . . . . . . . . . . . . . . . . . . . p. 86

5.1.1 Analysis Facilities . . . . . . . . . . . . . . . . . . . . . . . . . . p. 86

5.2 Instruction Set Architecture (ISA) . . . . . . . . . . . . . . . . . . . . . . p. 88

5.3 Dynamic Value Numbering Algorithm Implementation . . . . . . . . . . . p. 88

5.3.1 Validation and Instruction Reuse Support . . . . . . . . . . . . . . p. 92

6 Case Study p. 95

6.1 Evaluation Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 96

6.1.1 GNU Compiler Collection (GCC) . . . . . . . . . . . . . . . . . . p. 96

6.1.2 SPEC CPU2006 Benchmark Suite . . . . . . . . . . . . . . . . . . p. 97

6.1.2.1 Subsetting SPEC . . . . . . . . . . . . . . . . . . . . . . p. 99

6.1.2.2 Reduced Inputset . . . . . . . . . . . . . . . . . . . . . p. 100

6.2 Effectiveness Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . p. 101

6.2.1 Memory operations . . . . . . . . . . . . . . . . . . . . . . . . . . p. 102

6.2.1.1 Load instructions . . . . . . . . . . . . . . . . . . . . . p. 102

6.2.2 Store instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 102

6.2.3 Arithmetic instructions . . . . . . . . . . . . . . . . . . . . . . . . p. 105

6.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 109

6.3 Value Locality Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 109

6.3.1 Value Locality Graphs . . . . . . . . . . . . . . . . . . . . . . . . p. 110

6.3.1.1 Cluster occurrence . . . . . . . . . . . . . . . . . . . . . p. 114

6.3.1.2 Optimization Effect . . . . . . . . . . . . . . . . . . . . p. 114

6.3.1.3 H-index . . . . . . . . . . . . . . . . . . . . . . . . . . p. 115

6.4 Instruction Reuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 116

6.4.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 117

6.5 Performance Considerations . . . . . . . . . . . . . . . . . . . . . . . . . p. 121

6.6 Compiler Audition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 123

6.7 Value Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 128

6.8 Discussion on the Replicability of the Study . . . . . . . . . . . . . . . . . p. 130

7 Conclusions p. 133

7.1 Summary of contributions and Future Work . . . . . . . . . . . . . . . . . p. 136

References p. 138

Appendix A -- SPEC CPU2006 Benchmark Suite p. 147

A.1 400.perlbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 147

A.2 401.bzip2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 149

A.3 429.mcf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 149

A.4 445.gobmk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 151

A.5 458.sjeng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 151

A.6 462.libquantum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 152

A.7 471.omnetpp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 153

Appendix B -- Power ISA decoding approach p. 155

B.1 Fixed-point facility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 155

B.1.1 Memory Operation . . . . . . . . . . . . . . . . . . . . . . . . . . p. 156

B.1.2 Arithmetic Instructions . . . . . . . . . . . . . . . . . . . . . . . . p. 157

Index p. 159

18

1 Introduction

1.1 Motivation

Computer systems relying on software are in the core of modern society. Software-based

tools have never been as ubiquitous and powerful as they are nowadays. They are with no

exaggeration a crucial part of the infrastructure that undergirds the global economy. The abil-

ity to produce highly specialized software algorithms relies strongly in the abstraction level

the modern high-level programming languages exhibit, allowing programmers to implement

sophisticated algorithms with decreasingly effort and knowledge of the machine low-level

implementation. Compiler technology, the central piece that is ultimately responsible for

translating high-level language to machine-readable sequence of bits, along with advances in

programming languages, are in the core of today’s complex software infrastructure.

The high-performance computing level of two decades ago is now achieved by a handheld

battery-powered device (MARKOFF, 2011) and it is safe to say that compilers and high-level

languages have played a role in this improvement as significant as semiconductor technology

has. It has been shown that improvements in software algorithm outpace the gains attributable

to faster hardware. The time to complete some reference performance benchmarks improved

by a factor of 43 million in 15 years, being a factor of 43,000 due to improvements in the

efficiency of software algorithms (USA, 2010).

Software performance has historically relied on code optimization performed by com-

pilers. In order to increase program efficiency by restructuring code to simplify instruction

sequences and take advantage of machine-specific features, modern optimizing compiler for

1.2 Problem Statement 19

general-purpose processors deploys several optimization techniques, such as Single Static As-

signment (SSA)-based optimization, pointer analysis, profile-guided optimization, link-time

cross-module optimization, automatic vectorization, and just-in-time compilation. New com-

pilation challenges have been presented as programming languages and computer architecture

evolve. As the hardware complexity increases, it has never been as important as it is now the

development of smarter software able to take advantage of the improvements in hardware.

Compiler technology is a key component for tackling new software-hardware interaction

challenges. The current wide availability of multicore processors offers a variety of resources

of which the compiler designer must take advantage, such as multicore programming mod-

els, optimization and code generation for explicitly parallel programs, naming a few. It is

also expected that research in compilers will provide new and more sophisticated tools for

optimization tuning exposing unexploited performance improvements with existing compiler

technology (ADVE, 2009). In this respect, it is expected that compilers will guide the adoption

of speculative optimizations in order to compensate for the constraints imposed by conserva-

tive static analysis and take advantage of recent architecture improvements that provide new

ways to exploit hardware-support for more powerful and nontraditional optimization.

1.2 Problem Statement

The main problem addressed in this work is the identification of suboptimal code pro-

duced by an optimizing compiler, measuring its effectiveness through the identification of

missed opportunities in static analysis and code generation, and the identification of ap-

proaches to take advantage of unexploited opportunities through software-tuning and existing

hardware-support for redundancy removal. A generated compiled code is said to be subopti-

mal if the same effect on the program’s state can be obtained by a shorter or faster sequence of

instructions. The optimization effectiveness of a given program can be measured in terms of

the amount of redundant computation executed, where an operation repeatedly performs the

same computation because it sees the same operands. From a compiler-hardware interaction

point of view, redundant computation has been related to the value locality concept (LEPAK;


LIPASTI, 2000), as the likelihood of a previously seen value recurring repeatedly as the output

of the same instruction.

The traditional approach to eliminate redundant computation is the compile-time opti-

mizer based on static analysis. The effectiveness in the elimination of redundant computation

translates into performance improvement and decrease in the power consumption required by

the processor during the program’s execution. Common obstacles that limit the effectiveness

of compile-time code optimization are:

1. Compile-time limitations: limited information at the compile-time often prevents

static analysis from distinguishing redundant dynamic instances of static instruc-

tions that are visible only at the run-time (COOPER; LU, 1997; COOPER; ECKHARDT;

KENNEDY, 2008);

2. Analysis boundaries and limited scope: most deployed optimization techniques are

limited to procedure boundaries, which inhibit their effectiveness and benefits to be

gained from optimizing across procedure boundaries are unexploited. In addition, the

use of shared libraries, dynamic class loading, and run-time binding complicates com-

piler’s ability to statically analyze and optimize programs (BRUENING; GARNETT; AMA-

RASINGHE, 2003);

3. Conservative program analysis: for correctness, compilers do not perform some op-

timizations because a necessary requirement cannot be guaranteed at compile-time. In

some cases, program analysis can be improved by identifying specific cases that can

be precisely analyzed and optimized, but in general static analysis must conservatively

speculate on the program’s dynamic behavior (COOPER; XU, 2003);

4. Limited heuristic algorithm: since identification of redundant computation is in gen-

eral undecidable (RAMALINGAM, 1994), heuristic algorithms have to be applied. How-

ever, they do not always produce satisfactory results. Good example is graph-coloring

register allocation, which addresses an NP-complete problem whose precise and effi-

cient solution is not possible. Heuristic’s performance is typically evaluated through its


behavior on real programs and workloads;

5. Optimization unpredictable interaction: multiple and concurrent optimization meth-

ods are heavily deployed in modern compilation. In some cases they can interact with

one another and the system architecture in an unpredictable way, generating suboptimal

code.

Dynamic identification and exploitation of opportunities for redundancy elimination is

the approach that attempts to overcome most of these limitations providing ways to optimize

the application at the run-time based on profiling data (BRUENING; GARNETT; AMARASINGHE,

2003). Profile Guided Optimzation (PGO) is the approach that using run-time knowledge col-

lected in a training execution attempts to guide optimization at the compile-time, such as prior-

itizing through code motion a frequently executed path. The performance gains depend on the

ability to obtain accurate profiling data and the overhead generated by the optimizer to instru-

ment and interact with the executing program. Hardware support has been also proposed as a

way to detect and avoid redundant operations, such as instruction reuse buffers (SODANI; SOHI,

1997; SODANI; SOHI, 1998a; SODANI; SOHI, 1998b) and value prediction mechanisms (LIPASTI;

WILKERSON; SHEN, 1996; GABBAY; MENDELSON, 1997; GHANDOUR; AKKARY; MASRI, 2012).

The improvements achievable through these methods are tightly bound to the mechanism

used to populate the reuse buffer and prediction table. An understanding on the sources of

redundant computation plays a crucial role on extracting the most out of these techniques.

An example of redundant computation is show in Figure 1. It shows a C program and the

object code generated for PowerISA instruction set by a recent version of GCC (GNU Com-

piler Collection) with the highest level of optimization enabled (-O3) (STALLMAN, 2009). The

code could be highly optimized during compile-time, providing a simple example full of re-

dundancies. The code calculates nineteen consecutive elements of a simple series two hundred

times. It is truly optimization prone via static analysis, since the involved array is declared lo-

cally, for it is within the main function. In this case, no pointer aliasing should be expected by

the compiler. Also, no external procedure is called, providing a very controlled environment

to the compiler. Looking at the code, one can tell that each run of the outer loop winds up


object dump

#define exte 200#define inte 19int main(){ int i,t,a[inte]; for(t=0,a[0]=4; t<exte;t++) for(i=1;i<inte;i++) a[i]= a[i-1] * 3; return a[3]*a[4];}

GCC 4.3.2-O3

...

...

1000350 li r0,41000354 li r8,01000358 stw r0,-112(r1)100035c addi r10,r1,-361000360 addi r7,r1,-1081000364 nop1000368 nop100036c nop1000370 lwa r11,-112(r1)1000374 mr r9,r71000378 nop100037c nop1000380 rlwinm r0,r11,1,0,31000384 add r0,r0,r111000388 extsw r11,r0100038c stw r11,0(r9)1000390 addi r9,r9,41000394 cmpd cr7,r9,r101000398 bne cr7,1000380100039c addi r0,r8,1 10003a0 clrldi r8,r0,3210003a4 cmpwi cr7,r8,20010003a8 bne cr7,100037010003ac lwz r0,-100(r1)10003b0 lwz r3,-96(r1)10003b4 mullw r3,r3,r010003b8 extsw r3,r310003bc blr

PowerISA2.06

Fig. 1: Example of unexploited optimization: redundant loop highlighted.

with the same values being stored on the very same variables whereon they were previously

stored. One would expect a highly optimized compilation when optimization is enabled. Such

optimized executable should reduce the outer loop’s steps to only one, or, more drastically, to

thoroughly remove the loop structures, replacing the returned multiplication by its foreseeable

result. This is not what is seen for GCC (behavior observed in version 4.3 for PowerISA and

x86_64). The outer loop still dwells on the object code, as shown in Figure 1 (instructions

100039c – 1000a8), and will be run redundantly as many times as the programmer unawares

had prescribed. The compiler should be able to eliminate such instruction redundancies, but

this simple scenario is sufficient to present a situation for which an unexpected interaction

among optimizations is experienced. Even at its highest optimization level, GCC still emits

a significant number of redundant instructions within a basic block boundaries, resulting in

multiple redundant executions of the same set of instructions. If this is so for this scenario,

situation could worsen for interprocedural environments.

This example shows that even simple code compiled at high optimization level can exhibit

1.3 Thesis 23

significant optimization potential unexploited. The question is whether and how often such

situation occurs in optimized code of a real world program. The identification of redundant

computation analyzing executed instructions is a way to detect missed opportunities for op-

timization exploitation. The approach can be used as reference for measuring how effective

optimization has been performed. It can be also used to indicate hotspots for compiler audi-

tion or tuning, and instructions and values to be targeted by hardware support for redundancy

elimination.

1.3 Thesis

This work is focused on the development of a methodology based on a dynamic limit

study for identification and measurement of the total redundancies exhibited at run-time for

the whole application execution. The approach yields a practical and arguably precise ap-

proximation to the upper bound of exploitable redundancies, and also a reference point for

optimization effectiveness evaluation, based on the remaining unexploited redundancy poten-

tial. The methodology relies on the development and implementation of a dynamic approach

through a framework to detect redundant memory operations and dynamic instances of arith-

metic instructions that are certain to produce the same known result as in a previous execution.

The information produced by this approach is used to provide an accurate picture of hotspots

and value locality that can be used to correlate instruction redundancy to source-level con-

structs and to indicate the most profitable instructions and produced results to be targeted by

hardware-based optimization.

The methodology developed is based on the observation of the stream of memory refer-

ences and executed instructions. The objective is to find an approximation to all instruction

redundancies visible at run-time along with the identification of the exact redundant instances,

using as input a real-time execution stream. In order to detect redundancies, a novel dynamic

approach for the Local Value Numbering algorithm (COCKE, 1969; COOPER; ECKHARDT;

KENNEDY, 2008) is developed, which discovers redundancies and folds constants. The pro-

posed Dynamic Value Numbering (DVN) algorithm is designed to be applied over a stream of

1.3 Thesis 24

instruction execution and, for this reason, it is inherently interprocedural. The method reduces

the difficulties encountered in value numbering of extended basic blocks (ALPERN; WEGMAN;

ZADECK, 1988a) to the analysis of one large basic block. The method in this way allows the

identification of suboptimal code related to dynamically linked libraries that would not be

available at the compiled-time.

The work presents a practical framework to validate the method. It relies on the IBM

full-system simulator for an architecture implementing PowerISA (2.06) to execute optimized

compiled code. The simulator is integrated with an implementation of the proposed method

that operates observing the execution and possibly interfering in each instruction executed by

the simulator. The implementation of the method relies on an approach that efficiently handles

large tables required when applying a value numbering-like algorithm to an execution trace.

The work also presents a case study of optimization effectiveness and value locality eval-

uation and exploitation for which a reference suite of applications was used. For each appli-

cation, multiple executables were generated using different compiler optimization levels. The

GNU C/C++ compiler, version 4.3.2, was chosen as optimizing compiler. The goal is to mea-

sure the occurrence frequency and identify redundant computations that can be detected using

the proposed method, but that were not detected in a static analysis by GCC. The study con-

sists in an upper limit analysis, and therefore also exposes instruction redundancy that cannot

be necessarily exploited statically. The study shows how the hotspots identified can be used to

understand the source of the inefficiencies in the optimization, and to provide an evaluation of

value locality. It is discussed how this information can be used in hardware-based redundancy

elimination and the practical gains achievable. The main purpose of this work is to provide a

framework for optimization effectiveness analysis that indicates new ways to extend dynamic

optimization to static languages, and how hardware and software-based mechanisms could

benefit from the increasing support for optimization.

1.3.1 Contributions summary

The main contributions of this work can be summarized as following:

1.3 Thesis 25

• Novel approach to construct and manipulate the representation of a program from run-

time information for a whole application execution (Section 4.1);

• Proposition of dynamic method to evaluate optimization effectiveness through the Dy-

namic Value Numbering (DVN) algorithm. The method identifies the amount of redun-

dant instruction found in a whole application execution and the individual redundancy

instances, including frequency of the redundantly produced result (Section 4.2);

• Development and validation of experimental framework to evaluate optimization ef-

fectiveness based on an efficient implementation for dynamic value numbering. The

framework integrates the method’s implementation to a full-system simulator, yielding

practical analysis and validation of the approach through the avoidance of all redundant

instructions identified (Chapter 5);

• A case study exposing the unexploited opportunities in reference application (SPECInt

2006) optimized with GCC, evaluating objectively the effectiveness of optimizations

applied (Section 6.2);

• Identification of hotspots through a practical evaluation of value locality and the occur-

rence of redundancy clusters in reference applications (SPECInt 2006);

• Demonstration of realistically achievable optimization gains through an instruction

reuse scheme based on DVN (Section 6.3);

• Demonstration of how value locality reports produced with DVN can be deployed as a

method for compiler audition (Section 6.6) and value prediction (Section 6.7).

1.3.2 Outline

The thesis is organized as following. In Chapter 2 is presented a background on compile-

time optimization, with a review on compilation, code representation and optimization. The

chapter also discusses the role of the intermediate representation, specially SSA form, and data

flow analysis in two major static optimization approaches, Local/Global Value Numbering

1.3 Thesis 26

and Partial Redundancy Elimination techniques. Register allocation and liveness analysis are

also discussed. Chapter 3 discusses the intrinsic limitation of static analysis and reviews the

profile guided optimization and hardware-support for redundancy elimination approaches and

how their effectiveness relies on the ability to collect accurate information on the occurrence

of redundancy. Chapter 4 discusses limit studies based on run-time information as a way

to measure optimization effectiveness analysis. The chapter details the development of the

Dynamic Value Numbering algorithm. In Chapter 5 are details on the experimental evaluation

framework developed. It is described the approach to execute the programs with a full-system

simulator and integration with the DVN algorithm implementation. The chapter presents the

details of the architecture used and integration method that makes the approach proposed

possible. The chapter also discusses the framework validation and how the integration with

a full-system simulator allows the evaluation of the method as a new mechanism to exploit

hardware-based redundancy removal. Chapter 6 presents the application suite and compiler

used as reference for a case study. The chapter presents the results obtained for effectiveness

analysis of optimized executables generated by GCC with different optimization levels. It is

also presented the results of applying the methodology to generate value locality graphs that

identify redundancy hotspots. Three approaches to exploit this information and their results

are presented: instruction reuse, compiler audition and value prediction. Discussion on the

gains achievable are presented and compared to data available from related work. In Chapter

7 are the conclusions, a summary of results and related publications, and future envisioned

work.

27

2 Compile-time Optimization

A well-designed compiler has as ultimate goal the production of correct object code that

runs as fast as possible, where fast means not only the time the program takes to produce the

output but includes aspects as the data traffic traversing the memory hierarchy and other as-

pects related to an efficient use of the architecture resources. These goals in general translate

into the aim to delete computations that have been previously performed, and when possible to

move computations from their original positions to less frequently executed regions. A mod-

ern optimizing compiler relies strongly on static analysis to identify and remove redundant

computation.

This chapter introduces fundamentals of compilation and optimization at compile-time,

reviewing classical methods for static analysis. Code optimization is a vast field, and it is

not the intention of this work to review the existing techniques extensively. Rather, the aim

is to provide the concepts of the data-flow analysis and a review of two prominent static

optimization sets, Value Numbering (VN) and Partial Redundancy Elimination (PRE) from

which the methodology to be proposed in this work is derived. The chapter also includes a

discussion on the problem of register allocation and liveness analysis.

2.1 Preliminaries

Compilation is the task of converting programs written in high-level (and desirably

highly-abstracted) language into programs in machine language, the binary-encoded instruc-

tions that move data from and into memory, transfer data to devices and perform logical and

arithmetic operations. It typically involves a lexical analysis, represented by the parsing that

2.1 Preliminaries 28

Scanner

Parser

Semantic Analyzer

Source Code

String of tokens

Parse tree

Translator

Parse tree

Intermediate Representation

Optimization 1

Optimization 2

Optimization 3

Optimization n

Object Code

Assembly

Register Allocation

Instr. Selection




...





Front-end Optimizer Back-end

Fig. 2: Optimizing compiler structure.

identifies tokens in the source code that are members of the high-level vocabulary; a syntactic

analysis, the processing of the sequence of tokens producing an intermediate representation

of the program; a semantic analysis, the process that determines that the program satisfies the

semantic properties defined in the source language, and the code generation, represented by

the transformation of an intermediate code into the specific machine code. A well-designed

optimizing compiler is expected to produce executable code that is correct and efficient

according to a performance metric, usually the time or memory required to complete the

computational task. Optimization phases in the compilation process perform transformations

in the program in a way to avoid unnecessary computations to be performed by the executable

generated. Optimization can take place at any of the known compilation phases. Figure 2

shows the typical structure of an optimizing compiler.

2.1.1 Intermediate Representation (IR)

The optimization scope is typically dependent on the intermediate representation (IR)

used by the compiler to manipulate the source code. Common design for the intermediate


float a[10][20];

a[i][j+2];

(a) C Source Code

t1 = a[i, j+2]

(b) High-level IR: low-level registers manipu-lation are not exposed to the compiler.

t1 = j + 2

t2 = i * 20

t3 = t1 + t2

t4 = 4 * t3

t5 = addr a

t6 = t5 + t4

t7 = t7 * t6

(c) Medium-level IR: general and infinitenumber of register available.

r1 = [fp – 4]

r2 = [r1 + 2]

r3 = [fp – 8]

r4 = r3 * 20

r5 = r4 + r2

r6 = 4 * r5

r7 = fp - 216

f1 = [r7 + r6]

(d) Low-Level IR: limited and specialized reg-isters.

Fig. 3: Source code and multiple levels of intermediate representation.

code varies from low-level representations that include architecture dependent features to an

architecture-independent representation. Optimizer operates over a low-level intermediate

representation to take advantage of architecture-specific features such as special floating point

units, hint bits for value speculation, etc, or over a high-level intermediate code to ensure a

portable optimized executable. Mixed model implementing multiple passes over a low-level

representation followed by a high-level representation is the approach frequently seen. Figure

3 shows possible intermediate representation for the code written in C.

The translation of a statement from a high-level language into an intermediate representa-

tion involves the representation of the program’s simplest unit of control, basic blocks. They

are the portion of the code that have one entry point, meaning that no instruction within their

boundaries is the destination of a jump instruction, and only one exit point, meaning that

only the last instruction causes the program to execute another basic block. In other words,

a maximal length sequence of code without branch. An abstract syntax tree (AST) is typi-

cally built by the compiler to perform context-sensitive analysis and high-level optimizations.

Three-address code in the form x ← y op z is generated during a walk of the AST. The


-

u0 v0

h

x0 y0

g

+

u

+

+

k0 p0

l

z

x

y u

v

k p

g = x + yh = u - vu = g + hz = k + pl = u + z

+

Fig. 4: Intermediate code representation using a Directed Acyclic Graph (DAG).

three-address code can be divided into basic blocks that have the property that if one instruc-

tion executes, they all execute. Basic blocks are organized into a control-flow graph (CFG), a

directed graph, G = (N, E) for which each node n ∈ N corresponds to a basic block and each

edge e = (ni, n j) ∈ E corresponds to a possible transfer of control from block ni to block n j.

CFGs are a graphical representation of the possible run-time paths of control-flow.

The Directed Acyclic Graph (DAG) representation is a more compact representation than

either trees or any linear notation as it reuses values. A DAG is the minimal sequence of trees

that represents a basic block. Figure 4 shows a set of statements represented in a DAG. The

nodes hold expressions or contents and are annotated with variable names (arrows). They are

harder to construct and manipulate, but provide savings in space. They also highlight equiv-

alent sections of code and for this reason are useful when studying redundancy elimination.

In this case, the needed result is computed once and saved, rather than recomputed multiple

times.

Code representation has also to deal with the fact that programs are also divided into

multiple procedures. Such fact has positive and negative impacts on generating code. Such

division positively limits the amount of code to be considered by the compiler at any given

time, helping to keep data structures small and to limit the problem size and cost of several

2.2 Code Optimization 31

compile-time algorithms. On the other hand, the division of the program into procedures lim-

its the compiler’s view inside a call. Based on its scope, optimization is usually classified as

either intraprocedural or interprocedural. Intraprocedural scope refers to local methods oper-

ating over basic blocks, regional methods operating over scopes larger than a single block but

smaller than a full procedure, and global methods operating over whole procedures. Interpro-

cedural scope refers to methods that operate over the whole program, i.e., larger than a single

procedure (TORCZON; COOPER, 2011).

2.2 Code Optimization

The optimization of a program involves the elimination and insertion of instructions in

order to avoid extra computations. A redundant computation can be defined in two major

ways. The first case is the case when two computations match exactly, including the vari-

ables names. This equivalence is known as lexical equivalence. A second case occurs when

two computations are not lexically equivalent, i.e., represented by different variables, but are

certain to compute the same value in any run of the program. This case is known as value

equivalence. The analysis that takes into account value equivalences is stronger compared to

the ones restricted to lexical equivalences.

The optimization techniques can be divided into four main groups related to the level

of intermediate code they operate over. In the first level are the optimizations applied to the

source code or to a high-level description of the intermediate code that preserves loop structure

and sequencing array access in essentially their original form. Second level optimizations are

the ones that operate over a medium-level or low-level representation depending on whether

a mixed model is being used. The optimizations in third level are architecture-dependent

and operate over a low-level intermediate code. They can be related, for instance, to the

addressing mode available on the target architecture. At last, optimizations in the fourth level

are performed at the linking phase and operate on relocatable object code.

The optimizations in the second level represent a large set of techniques that are used in


compilers required to generate code for different architectures. They usually rely on the so-

called flow-based analysis, such as the control-flow analysis and data-flow analysis. The flow

analysis is the technique used to determine the multiple possible paths of a program execution

and how data can be modified through these paths during the execution. The optimizations

relying on flow-based analysis perform the task of moving computations to places where they

are computed less frequently without changing the program’s semantic.

Optimizations of this level are numerous. There are, however, two major techniques of

notable importance. The Value Numbering (VN) analysis is the group of techniques that

analyzes the program in order to identify computations that are known to have the same value

in a static view of the program. The group includes both the Local Value Numbering (LVN)

and Global Value Numbering (GVN). The Partial Redundancy Elimination (PRE) analysis is

the group of techniques that hoist, i.e. anticipates, computations in order to make partially

redundant computations fully redundant allowing their removal.

2.2.1 Flow-based Analysis

Most existing compiler optimization algorithms need to perform control-flow analysis and

build data-flow solvers. This section presents the basic concepts related to these approaches.

The flow-based analysis is the set of techniques used to determine the multiple possible paths

of a program execution and how data is modified as the program takes the possible paths.

This approach is in the core of a large set of the most used optimization techniques. The flow-

based analysis is essentially divided into the control-flow analysis and data-flow analysis. The

control-flow analysis is the technique used to characterize how conditionals affect the flow

of the program. The purpose is that any unused generality can be removed and operations

can be replaced by faster ones. The purpose of the data-flow analysis is to provide global

information on how the program manipulates data. The goal of the analysis is to seek and

determine whether all the assignments to a particular variable that may provide the value of

that variable at some particular point necessarily give it the same constant value. In this case,

the variable at that point can be replaced by a constant. The flow-based analysis relies greatly


on the Static Single Assignment (SSA) intermediate representation.

!"#$!"#$%&'(&#$%&)(&#$%&*+,!"#$-&.&/0!"#$1&.&20!"#$3&.&40!%$"*+&,

-&.&'&5&)01&.&)03&.&1&5&'0

6&'(&$,

-&.&1&5&30678%97$&1&5&36

(a) C code.

Entry

w1 ← 0z1 ← 1p1 ← 2

w ← x1 + y1z2 ← y1 p2 ← z2 + x1

w3 ← z1 + p1

w4 ← φ(w2,w3)z3 ← φ(z2,z1)p3 ← φ(p2,p1)R ← z3 + p3

Exit

2

joint point

(b) SSA form for a simple procedure.

Fig. 5: Simple C code translated into SSA form.

2.2.2 Static Single Assignment (SSA)

The Static Single Assignment (SSA) was first introduced by Alpern, Wegman, and

Zadeck (ALPERN; WEGMAN; ZADECK, 1988b). The technique separates the values computed

in a program from the locations where they are stored. SSA form makes possible a more

effective implementation of several optimization techniques. In order to produce the static

single assignment form, every value assigned to a variable, whether representing a source-

level variable chosen by the programmer or a temporary variable generated by the compiler,

occurs as the target of only one assignment (MCKENZIE, 1999). The process of translating

a code to the SSA form splits the multiple instances of distinct assignments to a variable


when it occurs in the source program, being the variables divided into distinct versions.

The SSA form makes more effective several kinds of optimization techniques such as value

numbering, invariant code motion, constant propagation, and partial-redundancy elimination.

The translation of an intermediate representation to the SSA form in order to optimize the

program and its translation back into the original form is an approach typically used.

Figure 5 shows an example of translation of a simple C code (5(a)) into its SSA equivalent

form (5(b)). The SSA form has to deal with the multiple paths a basic block can take and the

consequent merging. The merging of variables is represented in the SSA form by the φ-

function at the joint points, i.e nodes in the CFG with merging paths (see Figure 5). Each

version of the variable merging at a given point represents an argument to the φ-function in

the SSA form. The control-flow defines the argument corresponding position. The procedure

to create the SSA form starts identifying the joint points where to insert the φ-functions. Each

function has the number of argument positions as the number of the control-flow preceding

the point where some variable definition reaches. Since the φ-functions are conceptual tools

and are not related to any computing operation, they have to be eliminated. The SSA form

with a minimal number of φ-functions is known as minimal SSA form. The minimal SSA

form can be obtained using the so-called dominance frontiers.

In order to obtain the minimal SSA form, it is first generated the program’s Control Flow

Graph (CFG), which is a graph with each node representing a basic block. The directed edges

in the CFG are used to represent jumps in the control flow. For a CFG node x, the dominance

frontier of DF(x), is the set of all nodes y in the graph such that x dominates the immediate

predecessor of y, not strictly dominating y:

DF(x) = {y|(∃z ∈ Pred(y) : x dom z) and x !sdom y}.

The computation of DF(x) for all x is quadratic in the number of nodes in the graph. The

split of two intermediate components, DFlocal(x) and DFup(x, z) is performed in this way:

2.3 Value Numbering 35

DFlocal(x) = {y ∈ Succ (x)| idom (y) , x}

DFup(x, z) = {y ∈ DF(z)| idom (z) = x & idom (y) , x},

where idom(x) is the set of nodes that x immediately dominates. In other words, idom(x) = y,

only if x ∈ idom(y). DF(x) is then computed as:

DF(x) = DFlocal(x) ∪⋃

z∈N( idom (z)=x)

DFup(x, z).

In order to define a set of graph nodes S , the dominance frontier of S is defined as:

DF(S ) =⋃x∈S

DF(x).

The iterated dominance frontier DF+() can be defined then as:

DF+() = limi→∞

DF i(S ),

where DF1(S ) = DF(S ) and DF i+1(S ) = DF(S ∪ DF i(S )). If S is defined as the set of

nodes that assigns values to variable x, including the entry node, then DF+(S ) is exactly the

set of nodes that need φ-functions for x.

Many industrial compilers use the SSA form as an intermediate representation: GCC

4.0 (GOUGH, 2005), Sun’s HotSpot JVM (KOTZMANN et al., 2008), IBM’s Java Jikes RVM

(ALPERN et al., 2005) and LLVM (LATTNER; ADVE, 2004).


...c = a + bd = ae = bf = d + ed = x...

...c3 = a1 + b2

d1 = a1

e2 = b2

f3 = d1 + e2

f3 = c3

d4 = x4

...

...c3 = a1 + b2

d1 = a1

e2 = b2

f3 = c3

d4 = x4

...

...c3 = a1 + b2

e2 = b2

f3 = c3

d4 = x4

...

LVN LVN LVN

CSE DCE

Fig. 6: Local Value Numbering (LVN) operation: common subexpression elimination anddead-code elimination.

2.3 Value Numbering

2.3.1 Local Value Numbering (LVN)

Local Value Numbering is the technique applied at the compile-time that recognizes re-

dundancy in a basic block among expressions that are lexically different, but which are cer-

tain to compute the same value. Traditionally, it is achieved by assigning symbolic names

(value numbers) to expressions. If the value numbers of the operands of two expressions and

the operators applied by the expressions are identical, then the expressions receive the same

value number and are certain to get same results. The analysis supports three different opti-

mizations: common subexpression elimination (CSE), copy propagation (CP), and dead-code

elimination (DCE) (CLICK, 1995; GULWANI; NECULA, 2007). The method works progressing

through each statement in sequential order. For each new variable, a distinct value number is

assigned. For any assignment expression, an existing value number on the right-hand side is

assigned to the left-hand side (see Figure 6). A new value number is created and propagated

to both sides when a new variable, i.e., one that has no existing value number, is found. This

procedure implements copy propagation. Each unary or binary expression is analyzed, and

when a match is recognized the algorithm searches a history table of value numbers and re-

places the right-hand side with the variable associated with the matching value number. This


is equivalent to constant subexpression elimination. If the algorithm identifies that two dif-

ferent value numbers are assigned to the same variable and that no intervening use of that

variable exists, then the first assignment is marked as dead and removed from the instruction

list. This implements dead-code elimination.

The operation of the algorithm is illustrated by the steps shown in Figure 6. The algorithm

starts with the original code in the first column, in the left. In the second column, all variables

have been assigned value numbers. Common subexpression are identified and eliminated:

binary expression c3 = a1 +b2 and binary expression f 3 = d1 +e2 have the same value number

pattern (vn(3) = vn(1) + vn(2)), hence f 3 = d1 + e2 is replaced with f 3 = c3. Dead code

elimination is shown in the third column. In this case two assignment expressions, d1 = a1 and

d4 = x4, associate different value numbers to the same variable d, and there is no intervening

use of d. First assignment expression d1 = a1 is marked dead and removed from the list of

instructions.

Hash-based value numbering is one LVN family of methods, which was originally fully

described in (COCKE, 1969). This is the most used pessimistic algorithm operating over ba-

sic blocks and is a widely implemented and useful technique. The problem of determining

whether two computations are equivalent is, in general, undecidable (RAMALINGAM, 1994).

The approach taken in value numbering algorithm is that any two expressions identified as

equivalent produce identical values on all possible executions. This commonly referred as a

conservative solution. Algorithm 2.1 formalizes the steps for an implementation of LVN. The

algorithm can be implemented with worst case complexity of O(N2), but most LVN analyses

complete in O(kN) time, with k << N (BRIGGS; COOPER; SIMPSON, 1997).

The construction of the DAG representing the basic block is closely related to the op-

eration of a value numbering algorithm, but in this case the reuse of nodes in the DAG is

performed instead of the insertion of new nodes with the same value. This operation corre-

sponds to deleting later computations of equivalent values and replacing them by previously

computed ones. The presentation of the intermediate code in the DAG form helps to illustrate

the operation of the technique. Figure 7 shows the application of the algorithm 2.1 to the


Algorithm 2.1 Local Value Numbering (LVN) algorithm over basic blocks1: T ← empty2: N ← 03: for all quadruple a← b ⊕ c in the block do4: if (b 7→ k) ∈ T for some K then5: nb ← k6: else7: N ← N + 18: nb ← N9: put b 7→ nb into T

10: end if11: if (c 7→ k) ∈ T for some K then12: nc ← k13: else14: N ← N + 115: nc ← N16: put c 7→ nc into T17: end if18: if ((nb ⊕ nc) 7→ m) ∈ T for some m then19: put a 7→ m into T20: mark this quadruple a← b ⊕ c as a common subexpression21: else22: N ← N + 123: put (nb ⊕ nc) 7→ N into T24: put a 7→ N into T25: end if26: end for


+

x0 y0

g

x y1 2

3

g = x + y

(a) DAG construction: 1 expression

-

u0 v0

h

+

x0 y0

g

y u1 2 4

6

5v

3i x

g = x + yh = u - vi = x + yx = u - v

(b) DAG construction: 4 expressions

-

u0 v0

x

+

x0 y0

g

+

u

+

y1 2 4

6

5

3ih

v 7

8 w

g = x + yh = u - vi = x + yx = u - vu = g + hv = i + x w = u + v

(c) DAG construction: 7 expressions

Fig. 7: Steps to build a DAG applying local value numbering.


code in the top left in a DAG form. The figure shows expressions scanned and the resulting

graph in sequence. Each node holds a content or expression. The value number assigned is

annotated in the top of the node. Variables are annotated with arrows pointing to content held

as the expressions are scanned. For example, in the second block the reuse of x requires that

the arrow pointing to node x0 to be removed and pointed to the node with the result of u-v.

Similar reuses are seen in the third block.

2.3.2 Global Value Numbering (GVN)

The value numbering algorithm as proposed is only able to handle basic blocks. An exten-

sion that can be applied to extended basic blocks was proposed by Auslander and Hopkins and

implemented in the IBM PL.8 compiler (AUSLANDER; HOPKINS, 2004). A general approach to

operate over the program as a whole is known as Global Value Numbering (GVN). The GVN

technique is the analysis used to remove redundant computations that compute the same static

value in a program considering multiple procedures. The roots of GVN approach are in the

work of Alpern, Rosen, Wegman, and Zadeck in (ALPERN; WEGMAN; ZADECK, 1988b). Their

main contribution was the description of how expressions can be partitioned into congruences

classes.

The first contribution into the direction of a global value numbering targeting the whole

program, including multiple procedures, was represented by the optimistic algorithm pro-

posed by Reif and Lewis (REIF; LEWIS, 1986). The algorithm, in spite of having a good

asymptotic complexity, is hard to implement. Alpern, Wegman and Zadeck (ALPERN; WEG-

MAN; ZADECK, 1988b) suggested an algorithm that used the Hopcroft’s finite state minimiza-

tion to determine the congruences, including a subscripting scheme called φ -function that

captures the semantics of conditionals and loops. The use of the φ -function in the description

of a program led to the development of the SSA form. This approach was extended by Yang,

Horowitz and Reps (YANG; HORWITZ; REPS, 1989). A combined SSA form with the Program

Dependence Graph (PDG) was proposed by Ballance, Maccabe and Ottenstein (OTTENSTEIN;

BALLANCE; MACCABE, 1990). It is called Gated Single Assignment (GSA) and it is arguably


a better subscripting scheme than the proposed by Alpern, Wegman and Zadeck.

i ← 1

j ← 1

if i mod 2 = 0

i ← i + 1

j ← j + 1

else

i ← i + 3

j ← j + 3

if j > n

(a) Intermediate code

j3 ← φ2(i1,i2)j3 ← φ2(j1,j2)i3 mod 2 = 0

i4 ← i3 + 1j4 ← j3 + 1

i5 ← i3 + 3j5 ← j3 + 3

i2 ← φ5(i4,i5)j2 ← φ5(j4,j5)j2 > n

n ← vali1 ← 1j1 ← 1

Entry

Exit

(b) Corresponding minimal SSA form

Fig. 8: Minimal SSA form from an intermediate code.

In the core of GVN technique is the discovering of the so-called variables congruence.

The concept is defined as the case where the computation that defined the variables have

identical operators and their corresponding operands are congruent. In order to apply the GVN

technique, the procedure needs to be translated to the minimal SSA form. Such manipulation

can be done using the dominance frontier. The SSA form can be used then to produce the so-

called Value Graph (VG). The value graph is a labeled directed graph whose nodes are labeled

with the operators, functions and constants. The value graph edges represent the assignment

from an operator or function to its operands. The edges are labeled with natural numbers


=

0 mod

2

1

+

1

φ2

φ5

3

+

1

1

1

11

1

22

2

2

>

n

1

+

1

φ2

φ5

3

+

1

1

1

11

2

2

2

2

22

2

Fig. 9: Value Graph.

indicating the operand position with respect to a given operator. Figure 9 shows the value

graph for the intermediate code in Figure 8(a) and its translation into the minimal SSA form

of figure 8(b).

From the value graph, a variable congruence is defined as the maximal relation on the

graph such that two nodes are congruent if either they are the same node or have the same

operators and theirs operands are congruent. The equivalence of two variables is then defined

at a point P if they are congruent and their defining assignments dominate P (MUCHNICK,

1997).

The Apern-Rosen-Wegman-Zadeck algorithm makes an optimistic assumption that a

large set of expression classes is congruent and refines the grouping splitting the congruence

classes at a fixed point. A hash table-based approach for GVN was introduced in (BRIGGS;

COOPER; SIMPSON, 1997). The role of the hash table is the association of an expression to a

value number. Taylor Simpson’s SCCVN algorithm for optimistic value numbering discovers

value-based identities and is arguable the strongest global technique for redundant scalar

values detection and has been implemented in a number of compiler (SIMPSON, 1996).

2.4 Register Allocation 43

2.3.3 Partial Redundancy Elimination (PRE)

Another large group of optimization techniques can be gathered under the Partial Redun-

dancy Elimination (PRE) group. The technique was first proposed in (MOREL; RENVOISE,

1979). The basic idea is to find computations that are redundant on some, but not all paths

(LO et al., 1998). The process can be viewed as follows: considering that a given expression

e at some point p is redundant on some subset of the paths that reach p, the transformation

inserts evaluations of e on paths where it had not been, making the evaluation at p redundant

on all paths (XU, 2003). The conclusion frequently found in the literature comparing the GVN

and PRE can be summarized as follows: in one hand PRE finds lexical congruences instead of

value congruences, including only partial redundancies. On the other hand, GVN finds value

congruences but can remove only full redundancies (VANDRUNEN, 2004).

Attempts to integrate both approaches have been seen in the literature more recently. The

mixed approaches proposed in (BODÍK; ANIK, 1998) and later in (VANDRUNEN, 2004) are

significant examples. However, up to date such approaches have not been seen implemented

in commercial compilers specially due to its complexity and memory demands.

2.4 Register Allocation

Register allocation consists in an attempt to maximize execution speed of programs keep-

ing as many variable as possible in registers instead of memory. In case there are more live

variables than available registers, variables must be spilled from registers, i.e., written to the

memory and reloaded at a later time when they are needed. An efficient allocation should

avoid spilling of invariant values, since this spill is potentially redundant and can be folded by

propagating the invariant value to the subsequent instructions that use it as a source operand.

Register allocation is usually performed at the end of global optimization, when the final

structure of the code is ready, and all registers to be used are known. The register allocation

procedure attempts to map the registers in such a way that minimizes the number of memory

references. Register allocation affects performance by lowering the instruction count and po-


tentially reducing the execution time per instruction by changing memory operands to register

operands. Code size is reduced by these improvements, leading to other secondary improve-

ments. Register allocation depends on the liveness analysis in order to decide which variable

to spill.

2.4.1 Variable Liveness Analysis

The liveness analysis is one of the techniques of data flow analysis that is used to seek the

variables that may be potentially read before their next write. A variable in this state is usually

called as live. The liveness analysis is in the core of the register allocation task. The register

allocation determines which of the values should be in the registers of the machine at each

point of the execution. The register assignment is usually preceded by a liveness analysis.

A liveness analysis of a basic block S can be expressed by the following equations:

Lin−state[s] = G[S ] ∪ (Lin−state − U[S ])

Lout−state[ f inal] = ∅

Lout−state[S ] =⋃

p∈succ[S ]

Lin−state[p]

G[d : y← f (x1, · · · , xn)] = {x1, · · · , xn}

U[d : y← f (x1, · · · , xn)] = {y},

where G is the set of variables used before any assignment, U is the set of variables assigned

a value in S before any use. The sets Lin−state and Lout−state are defined as respectively the set

of variables that are live at the beginning of the block and the set of variables that are live at

the end of the block. The process of making dead the written variables and making live the

read variables is handled by the state function f (KHEDKER; SANYAL; KARKARE, 2009).


x ← 2

y ← 4

w ← x + y

z ← x + 1

x ← z * 2

(a)

s1 ← 2

s2 ← 4

s3 ← s1 + s2

s4 ← s1 + 1

s5 ← s1 * s2

s6 ← s4 * 2

(b)

r1 ← 2

r2 ← 4

r3 ← r1 + r2

r3 ← r1 + 1

r1 ← r1 * r2

r2 ← r3 * 2

(c)

Fig. 10: Steps to perform register allocation with K-coloring.

s1

s6

s2

s3

s4

s5

r1

r2

r3

Fig. 11: Example of interference graph.


2.4.2 Register Allocation by Graph-Coloring

One of the most effective approaches largely used in register allocation is the technique

known as register allocation by graph coloring. The application of the general graph-coloring

problem (SAATY; KAINEN, 1977) to the register allocation problem was first proposed in

(COCKE, 1969). The first implementation was obtained only ten years latter in (CHAITIN,

2004). The approach was first adapted for the PL.8 compiler for the IBM 801 RISC System.

Most of the modern compilers implement this allocator or one of its derivations.

The global register allocation by graph coloring assumes two major procedures, the con-

struction of the so-called Interference Graph (IG) and the procedure of K−Coloring. The

interference graph is created in a way that every vertex represents a unique variable in the

program. The interference edges are the ones connecting the pairs of the vertices which are

live at the same time. The pairs of vertices involved in move instructions are called preference

edges. The register allocation can be done with the procedure of K-Coloring the interference

graph. The K-Coloring problem applied to register allocation can be seen in a simplified

way as the assignment of a necessarily different color to two vertices sharing an interference

edge and the assignment of the same color to edges sharing the same preference edge when

possible. Hardcoded register assignment is represented by the precoloring of some edges.

The K-Coloring problem is a known NP-complete (BRIGGS; COOPER; TORCZON, 1994). For

the past several years algorithms trading off quality and performance of the code produced

have been proposed. Global register allocation by graph coloring can be summarized in the

following steps (MUCHNICK, 1997):

1. In the phase that precedes the register allocation, allocate objects that can be assigned to

registers r1, r2, · · · , rn to distinct symbolic registers, using as many registers as necessary

to hold the objects;

2. Determine the object sets that should be candidate for allocation;

3. Generate the interference graph for which each node represents allocatable objects and

the real registers of the target machine. The arcs should represent the interferences,


Build Simplify Potential spill Select Actual Spill

Coloring Heuristic

Fig. 12: Graph coloring heuristic.

where two allocatable objects interfere if they are simultaneously live and an object and

a register interfere if the object cannot be or should not be allocated to that register;

4. Interference graph should have its nodes colored with K colors, where K is the number

of available registers. Two adjacent nodes must have different colors;

5. Allocate each object to the register that has the same color.

The intermediate code in Figure 10 shows the steps of the register allocation process with

K-coloring and the corresponding interference graph used is shown in Figure 11. An effi-

cient approach to implement register allocation with graph coloring was proposed by Briggs

and Cooper in (BRIGGS; COOPER; TORCZON, 1994) and is largely used in modern compilers.

Common used heuristic for graph coloring is shown in Figure 11.

Recent development shows that the interference graphs of programs in Static Single As-

signment (SSA) form are chordal (HACK; GRUND; GOOS, 2006). This result is important be-

cause chordal graphs can be colored in polynomial time. Most known register allocation

models described can be adapted to run on SSA-form programs (TORCZON; COOPER, 2011).

Register allocators can benefit from chordality of SSA-form programs in three main ways:

(i) lower register pressure; (ii) separation between spilling and register assignment; (iii) sim-

pler register assignment algorithms. Spill-free register allocation has polynomial time solu-

tion for SSA-form programs, but it is NP-complete for programs in general. One point that

must be emphasized is that these two problems are obviously non-equivalent. Any program

can be converted into SSA-form via a polynomial time transformation (CYTRON et al., 1991).


However, a register assignment for a SSA-form program cannot be converted back to an opti-

mal register assignment of the original program in polynomial time unless P=NP (TORCZON;

COOPER, 2011).

Next chapter discusses the limitations of static analysis and the approaches to identify and

avoid redundant computation with run-time knowledge.

49

3 Dynamic Identification of RedundantComputation

The traditional approach to code optimization is the compile-time optimizer and modern

optimizing compilers rely strongly on static analysis to identify and remove redundant compu-

tation. It is common practice to evaluate compiler’s optimization by reporting the amount of

those computations successfully identified and removed. Unfortunately, there are difficulties

that limit the effectiveness of compile-time optimization. Procedure boundaries, profitability-

based trade-off and heavy interaction among multiple optimizations and the system architec-

ture inhibit the effectiveness of many optimizations. Significant benefits to be gained from

optimizing across procedure boundaries have been shown. However, finding and exploiting

interprocedural opportunities reveals to be challenging. Interprocedural analysis implement-

ing aggressive function inlining can remove many procedure boundaries entirely, but comes

at the cost of increased code size and can greatly increase cache misses. Interprocedural data-

flow analysis techniques have also been used. Evidences, however, argues that such methods

are not worth the additional complexity they create in the compiler due to their limited impact

on effectiveness. The question that typically arises regarding the optimized code is how ef-

fective was the compiler in finding and eliminating redundant computation and whether there

are optimization opportunities unexploited that can only be detect at the run-time.

Dynamic identification of redundant computation detects redundancies along an execu-

tion path using run-time data. Some redundancies identified dynamically cannot be detect

statically in a practical way, or are related to a specific or unexpected execution path. In any

case, dynamic redundancies can elucidate shortcomings in various parts of a compiler. For

3.1 Limitations of Static Analysis 50

Run-time Redundancy

Fully Static Redundancy

Partially Static Redundancy

Fig. 13: Relationship between redundancy categories.

example, common subexpression elimination (CSE) should eliminate redundant loads. Dead-

store elimination should remove unnecessary stores and a register allocator would not need

to store an unmodified spilled value. Register spill while another register holds a dead value

indicates that the register allocation unnecessarily spilled a value along at least one path.

Occurrence and detection of redundancy can be categorized as shown in Figure 13 (XU,

2003). Fully static redundancies are those that can be safely removed through static analysis.

Partially static redundancies are those instructions that through code motion can be avoided

statically. Run-time redundancy are the instances that are only visible at the run-time.

This chapter discusses the limitations of compile-time optimization and the main rea-

sons why statically optimized code can still exhibit significant opportunities for optimizations.

Here are discussed the approaches on the identification of dynamic redundant computation as

a way to provide information on unexploited opportunities and existing support to explore

run-time data in optimization.

3.1 Limitations of Static Analysis

Exact addresses are often unavailable at the compile-time and program’s input conditional

control flows are unknown until run-time (COOPER; XU, 2003). Static analysis must conser-

vatively speculate on the program’s dynamic behavior. They typically generalize invariant

behavior, which is applied to all dynamic instances. A representative example is register pro-

motion (COOPER; LU, 1997), which identifies memory operations that always access the same

3.2 Profile Guided Optimization (PGO) 51

if ( ) then Z = X * Yelse ...endifif ( ) W = X * Yelse ...endif

if ( ) then T = Z = X * Yelse T = X * Yendifif ( ) W = Telse ...endif

if ( ) then T = Z = X * Yelse T = (Y=1)? X : X * Yendifif ( ) W = Telse ...endif

(a) original code (b) redundancy removal (c) strength reduction

12345678910

12345678910

12345678910

Fig. 14: Example of profile-guided transformation.

memory location and uses a register to hold copies of the memory content. However, the

approach fails to detect possible differences and redundancies existing among dynamic in-

stances. In addition to that, the advent and increased use of shared libraries, dynamic class

loading, and run-time binding complicates the compiler’s ability to analyze and optimize pro-

grams. For good performance, many versions of the compiled application may be needed to

take performance implications of subtle architectural differences between compatible proces-

sors (even processors in the same family) into account.

Profile-guided optimization has helped to identify opportunities of partially redundant

computation that cannot be conservatively distinguished at compile-time, but with run-time

knowledge can be optimistically eliminated. Dynamic limit study approach is the attempt to

evaluate the effectiveness of optimization deployed and a measure of unexploited potential.

Hardware-based support for redundancy elimination is the approach to identify and/or provide

support for the compiler to exploit run-time redundancy instances, such as the deployment of

reuse buffers and support for speculative and predicated execution.

3.2 Profile Guided Optimization (PGO)

If the compiler understands the relative execution frequencies of the various parts of the

program, it can use that information to improve the program’s performance. Profile Guided


Optimization (PGO) has helped to tune the optimization process through collecting informa-

tion about resource utilization, such as register allocation, instruction count, etc, during the

run-time (LIN et al., 2003; LIU et al., 2007). Figure 14 shows an example of redundancy elimi-

nation opportunities that can be exploited through profile data. In static analysis, the compiler

cannot identify the value of variable y during execution. Profiling, however, indicates that

very frequently the value of y in the first multiplication is 1. An opportunity of optimization is

identified as the multiplication can be frequently optimized away. The transformation shown

in the figure performs this optimization, for which the multiplication is executed condition-

ally. The transformation implies in extra instructions to check the value of y and is only worth

depending on the frequency with which variable y has the value 1 or some other value. Such

information determines whether the transformation should be applied. Profile can play an im-

portant role in optimizations such as global code placement or inline substitution. The major

approaches to gather profile data are:

• Instrumented executable: the compiler generates code to count specific events, such

as procedure entries and exits or taken branches. The data is written to an external file

at run-time and processed offline by another tool;

• Timer interrupts: approach that interrupts program execution at frequent, regular in-

tervals. The tool constructs a histogram of program counter locations where the inter-

rupts occurred. Post-processing constructs a profile from the histogram;

• Performance counters: many processors offer some form of hardware counters to

record hardware events, such as total cycles, cache misses, or taken branches. If coun-

ters are available, the run-time system can use them to construct profile-like data.

The benefits of PGO are related to the overcoming of static optimization limitations by ex-

ploiting dynamic information that cannot be safely inferred statically. However, the approach

relies in good profiling information, which typically implies in the collection and handling of

an actual run representation of the program. Common types of profile information are Control

Flow Profiles (CFG), Value Profiles and Memory Profiles.


S

a

b c

d e

f

g h

k

E

Control flow profile

S(abdfgk) (acefhk) (abdfhk) (acefgk) (abefgk) (abefhk) E

10 10

10 10

50 10

Fig. 15: An example of control flow profiles.

Code

l1: load R3, 0(R4)l2: R2 R3 & 0xff

(l1,R3)...

(l1,R2)

(l2,R3)

(0xb8d003400,10)

(0,1000)

(0,1000),(0x890,200)...,(0x2900,100)

Instruction, register

profiles(value, freq)

Value profile

Fig. 16: An example of value profiles.

In CFGs, a trace of the execution path taken by the program is generated by counting

the number of visits for the various basic blocks in the program’s control flow graph (CFG).

Real world applications generate large and sometimes unmanageable traces. Whole Program

Paths (WPP), representing whole execution have been possible through profile compression,

as proposed in (LARUS, 1999) and (ZHANG; GUPTA, 2001). An example of control flow trace

is shown in Figure 15.

Value profiles identify specific values encountered as an operand of an instruction and the

related frequency. The use of top-n-values table (TNV) has been proposed (CALDER; FELLER;

EUSTACE, 1997) as a way to deal with the fact that not all values appearing in a practical

execution can be collected. Replacement policies, such as Least Frequently Used (LFU) are

typically used to populate the table. Approaches to lower the overhead generated for value


for(i=0;i<2000;i++){ switch(flag){ case 1: xa=*pa; ...; break; case 2: xb=*pb; ...; break; case 3: xc=*pc; ...; break; case 4: xd=*pd; ...; break; } ... pa=buf[i] ...}

int flag;int *pa, *pb, *pc, *pd;int buf[2000];...int xa,xb,xc,xd;

(A(xa),A(pa)) 5000

Link Weight

(A(xb),A(pb))

(A(xc),A(pc))

(A(xd),A(pd))...(A(pa),A(buf))

20

2

10...2000

Code Declarations

Address profile

Fig. 17: An example of memory profiles.

collection have been proposed (WATTERSON; DEBRAY, 2001). Value profiles are of particular

interest in value specialization optimizations such as constant folding, strength reduction, and

motion of nearly invariant code. An example of value profile is shown in Figure 16.

Address profiles are collected in the form of a stream of memory addresses, typically in

compressed form (CHILIMBI, 2001), and used to improve performance of memory hierarchy.

Hot address streams, subsequences of address that are encountered very frequently, are used

to guide placement transformation. An example of address profile is shown in Figure 17.

The typical scheme used for profiling is shown in Figure 18. The collection of profiles

implies in an overhead since they are collected executing instrumented version of the program.

Instrumentation depends on the profile being collected and typically implies in the approach

of instrumenting specific parts of the code. The approach has helped to get better results when

guiding several static optimization. However, due to its limited scope, the approach does not

provide much information on the remaining unexploited potential in optimized code.

3.3 Dynamic Limit Studies 55

Instrumented Program Compiler

ProgramExecution

Program

Profile data

Representative Input

OptimizedCode

Profile-guidedOptimizingCompiler

Fig. 18: Profile-guided optimizing compiler.

3.3 Dynamic Limit Studies

The ultimate optimization effectiveness analysis would indicate how far the deployed opti-

mizations are from the ideal case. As the identification of redundant computation is in general

undecidable (RAMALINGAM, 1994), such analysis is not feasible. Dynamic limit studies are

targeted at the identification of the total redundancies exhibited at the run-time and yields the

obtaining of an upper limit of exploitable redundancy elimination. Limit studies are related

to the occurrence of instruction repetition and likelihood of an instruction to repeatedly pro-

duce a same known result. This effect, known as value locality, is primarily related to the fact

that real-world programs, run-time environments, and operating systems incur severe perfor-

mance penalties because they are general by design (LIPASTI; WILKERSON; SHEN, 1996). They

are implemented to handle contingencies, exceptional conditions, and erroneous inputs, all of

which occur relatively rarely in real life. Even code that is aggressively optimized by modern,

state-of-the-art compilers, exhibits these instruction repetition and value locality.

Studies based on empirical observations has helped to understand why instruction repeti-

tion occurs (LIPASTI; WILKERSON; SHEN, 1996). A simulator to derive an ideal performance of

an algorithm for removing heap-based loads is presented in (DIWAN; MCKINLEY; MOSS, 1998).

The ideal performance is used to determine what alias analysis is near-optimal for the load re-

moval, but still not too expensive. A compiler auditor tool, which analyzes the program trace

to discover limitations and bugs in the compiler is presented in (LARUS; CHANDRA, 1993). A

load-reuse profiler technique based on limit-study with the primary goal of giving load-reuse

hints to the processor is discussed in (REINMAN et al., 1998). One of the most relevant work is

the load-reuse limit study presented as a method to detect dynamic redundancy for memory

3.4 Hardware support for Redundancy Elimination 56

operations and a way to evaluate effectiveness of register promotion optimization in (BODÍK;

GUPTA; SOFFA, 1999). Instruction removal is obtained by a lexical load-reuse analysis, in

which only loads with identical names or identical syntax-tree structure (record fields) can

be detected as equivalent. The method revealed a significant occurrence of redundant load

instructions (average of 55%) in SPEC95 benchmark.

3.4 Hardware support for Redundancy Elimination

A number of prior proposals have exploited hardware support to detect and avoid redun-

dant operations and exploit dynamic redundancy. Most of existing methods can be rooted from

two major approaches: Dynamic Instruction Reuse (DIR) and Data Value Predition (DVP).

Dynamic instruction Reuse (DIR) (SODANI; SOHI, 1997; COOPER; ECKHARDT; KENNEDY,

2008) buffers the execution result of instructions, but skips the execution stage of reused

instructions (whose inputs match saved prior invocations). Thus, each instruction is reused

non-speculatively as long as the inputs match. Value prediction (LIPASTI; WILKERSON; SHEN,

1996) predicts the output of an instruction based on previous execution results stored in the

prediction table and executes the instruction using the predicted values speculatively. Finite

lookup table size limits the amount of redundancies that can be eliminated. The potential that

can be exploited depends significantly on the approach to populate lookup tables.

3.4.1 Dynamic Instruction Reuse

General instruction reuse detection methods have been proposed in (SODANI; SOHI, 1997;

SODANI; SOHI, 1998a), and similarly in (YANG; GUPTA, 2000). Dynamic instruction reuse is a

non-speculative micro-architectural technique that exploits the repetition of dynamic instruc-

tions. The main idea is that if an instruction or an instruction chain is re-executed with the

same input values, its output value will be the same. The authors introduced different schemes

that maintain the inputs and the results of previously executed instructions in a hardware struc-

ture called reuse buffer (RB). Further studies exploring new paradigms for execution model

and hardware support for instruction reuse were proposed in (TSENG; TULLSEN, 2011). Com-


Reuse BufferPC

FetchUnit

Decode &Rename

Out-of-OrderExecution

Commit

Fig. 19: Integrating a Reuse Buffer (RB) with the pipeline.

piler hints to find reuse is proposed (GHANDOUR; AKKARY; MASRI, 2012).

The main idea in instruction reuse is that by buffering previous result of instructions,

future dynamic instance of the same static instruction can use the result by establishing that the

input operands in both cases are the same. The benefits are two fold: all phases of execution

are bypassed for redundant instruction and the result of an instruction can be known much

earlier, allowing dependent instructions to proceed sooner (SODANI; SOHI, 1998a).

Instruction reuse mechanisms rely commonly on a Reuse Buffer (RB) and a Reuse Test.

The basic placement approach for a reuse buffer in the pipeline is shown in 19. RB access

begins in fetch stage and reuse happens in the decode stage. The reuse buffer is indexed by

the program counter (PC) with elements managed with a reuse test and selective invalidation.

Three fundamental schemes were proposed in (SODANI; SOHI, 1998a). The first is based on

operand values (S v), the second is based on operands names (S n), and the third is based on

operands values and names (S v+n).

In S v scheme (v from values), the operand values of an instruction are stored along with its

result. The scheme operates as follows: when an instruction is decoded, its current operand

values are compared with those stored in the reuse buffer (RB). In case they match, result

stored in the RB is reused. The scheme distinguishes load and store instructions since they

need specific handling due to address-calculation. Address-calculation can be reused if the

operands for the address calculation have not changed. Nevertheless, the actual result of the

load instruction can only be reused if the addressed memory location was not written into by


PC: r1 r2 + r3 50 20 30

r2

20

r3

30

PC: r1 r2 + r3

...

...

50 20 30

r2

20

r3

30

= =

Reuse result

Time

RB Register contents

TagOperand 1Value

Operand 2Value

Address ResultMemValid

Fig. 20: Instruction Reuse (IR): scheme S v.

an intervening store. Additional information, a memory validation bit, is stored in the RB and

used to distinguish between the two situations. In case of a store instruction, only address-

calculation is reused in order to prevent the side effects outside the domain of the processing

node. Invalidation occurs only for load instructions, as the operands of non-load and store

operations uniquely determine the result and invalidation is not needed. For loads, a store to

the same address invalidates the value in the RB. Figure 20 shows an example of application

of scheme S v and a practical format of a RB entry.

In S n scheme (s from names) architectural register identifiers are stored in RB. When an

instruction writes into a register, all instructions with a matching register identifier in the RB

are invalidated. The scheme relies on a result valid bit that determines whether the result is

valid. The bit is set when an entry is first inserted into the RB. The reuse is based on the

testing of result valid and memory valid bits. Address calculation for load/store instructions

and results for all other instructions can be reused if the result valid bit is set. Result of a load

instruction can be reused if both bits are set. Invalidation occurs when an instruction writes

into a register, all instructions with a matching register identifier in the RB are invalidated,

resetting result valid bit. Additionally, store invalidate the loads from the same address, reset-


A: r1 r2 + 3 A r2

R: r1 4

...

TimeRB contents

B: r3 r1 + 4

...

A: r1 r2 + 3

B: r3 r1 + 4

Dynamic instructions

Operand name

B r1

A r2Operand name

B r1

A r2Operand name

Invalidate

Reused

T1

T2

T3

TagOperand 1Reg. Name

Operand 2Reg. Name

Address ResultRes.Valid

Mem.Valid

Time

Fig. 21: Instruction Reuse (IR): scheme S n.

ting memory valid bit. Figure 21 shows scheme S n and a practical RB entry. In this case, B

performs the same computation but is not reused by S n.

In S n+d scheme (values and names), attempts to establish chains of dependent instructions,

and to track the reuse status of such chains. Dependence chains are generated in the process of

inserting entries into the RB. A Register Source Table (RST) is used to facilitate the process.

The RST has an entry for each architectural register and tracks the RB entry which has the

latest result for that register. When an entry for an instruction is added to RB, the RST entry

for its destination register is updated to point to this added entry. If the instruction which is

the latest producer of a register is not in the RB, then the RST entry for that register is set

to invalid. In this scheme a RB entry also holds the source index information, that stores

the RB index of the source instructions. An invalid value is inserted if the source does not

exist. The reuse test distinguishes independent and dependent instructions. For independent

instructions, the reuse test is the same as in S n scheme. For dependent instructions, reuse

occurs when the source instructions indicated by the source index information for the operands

are found in RB. This fact is established using the RST. Invalidation is similar to the previous

schemes. Invalidation of loads is done by stores to the same address, which reset memory

valid bit. Independent instructions are invalidated when their operands are overwritten (result


A: r1 r2 + 3 A r2

R: r1 4

...

Time

RB contents

B: r3 r1 + 4

...

A: r1 r2 + 3

B: r3 r1 + 4

Dynamic instructions

Operand name

B A

B Not Invalidated

Reused

T1

T2

T3

A r2Operand name

B A

A r2Operand name

B A

Tag

Operand 1src-indexreg.name

Address ResultRes.Valid

Mem.Valid

Operand 2src-indexreg.name

r1

r2r3

RB IndexValid

RST

Fig. 22: Instruction Reuse (IR): scheme S n+v.

valid is reset). Dependent instructions have reuse status established using their dependence

information. They are invalidated when dependent information is lost. To evict an instruction,

it is needed to search for entries whose source index information matches the index of the

source instruction being evicted. The entries which result in a match are invalidated, i.e., result

valid bit is reset. Figure 23 shows an example of application of scheme S n+d. Instruction A is

reused if valid bit is found in RST. Instruction B is reused if A is latest producer of r1.

Experimental analysis showing reuse varying with different reuse buffer size is shown for

SPEC95 in (SODANI; SOHI, 1998a), reporting a typical instruction reuse of ≈ 20% and an aver-

age speedup of 10%. Instruction Memoization is introduced in (CITRON; FEITELSON, 2000) as

a technique that consists in storing the inputs and outputs of only long-latency operations and

reusing the output if the same inputs are encountered again. The memo table is accessed in

parallel with the first computation cycle, and the computation halts in the case of hit. Different

instruction reuse schemes for checkpoint processors are present in (GOLANDER; WEISS, 2009).

In checkpoint microarchitectures, a misspeculation initiates the rollback, in which the latest

safe checkpoint preceding the point of misprediction is recovered, and the re-execution of the

entire code segment between the recovered checkpoint and the mispredicting instruction (se-

lective reissue). Two instruction reuse methods for normal execution and other two methods


for re-execution after a misprediction are presented. The reuse approach that combines all the

methods achieves average instructions per cycle (IPC) speedup of 2.5% for the SPEC 2000

integer benchmarks and 5.9% for the SPEC 2000 floating-point (GOLANDER; WEISS, 2009).

3.4.2 Data Value Prediction (DVP)

Data value prediction (DVP) is a technique to increase instruction level parallelism by

attempting to overcome serialization constraints caused by true data dependences. Value pre-

diction is a method to deal with the problem of value locality, defined as the likelihood of a

previously seen value recurring in a storage location. It was first discussed in (LIPASTI; WILK-

ERSON; SHEN, 1996). When a predicted instruction executes, the correct result is compared

with the predicted one. If a mismatch occurs, the correct result is passed to the dependent

instructions, which execute again with the correct result as input. If the benefit from increased

parallelism outweighs the misprediction recovery penalty, overall performance could be im-

proved. Enhancing performance with value prediction, therefore, requires highly accurate

prediction methods.

The occurrence of value locality is related to various sources. Some of these are run-

time constants whose values are not known until program execution begins, but are invariant

while the program executes. Another possibility is that the compiler is doing a poor job of

lifting out loop-invariant computations. In usual value prediction, value locality is measured

through history depths. For instance, depth four means that the value is matched with one of

previous four values. Locality for an instruction is taken as the predictability of the value that

the instruction writes. Figure 23 shows a basic scheme to exploit value locality. The method

relies on the Classification Table (CT) and the Value Prediction Table (VPT), both indexed

by the program counter. CT determines whether an old value can be used and classifies

instructions into predictable and unpredictable. VPT contains the values. Hit rate depends

on the size of the VPT and saturates after a limit of entries. The approach was evaluated

experimentally for PPC620 architecture with a maximum of 16 instructions in flight at a time,

PPC620+ with a maximum of 32 instructions in flight at any time, and PPC620+ with any


prediction history value history

PCPredictedInstruction

<v> <v>

Classification Table Value Prediction Table

Prediction Result Updated Value

Predicted Value

Fig. 23: Value Prediction Unit.

number of instructions in flight at a time. The average speedups were about 5%, 7%, and 27%

respectively (LIPASTI; WILKERSON; SHEN, 1996).

Prediction mechanisms are classified as either computational or context based (SAZEIDES;

SMITH, 1997b). A computational predictor yields a predicted next value by performing an

operation on previous values such as stride value (STR) predictor (SAZEIDES; SMITH, 1997a)

and last value (LV) predictor (LIPASTI; WILKERSON; SHEN, 1996). A context is a finite ordered

sequence of values. A context based predictor makes value prediction based on its observation

of previous patterns. An order k Finite Context Method (FCM) predictor uses k preceding

values (SAZEIDES; SMITH, 1997a). Differential Finite Context Method (DFCM) hashes recent

history of strides rather than values and looks up a stride in the hash table to make a prediction.

A hybrid predictor combines distinct types of predictors, such as (WANG; FRANKLIN, 1997)

and (STR, FCM) (RYCHLIK et al., 1998).

Some studies have addressed the correlation between control flow and value prediction.

Path identity correlation based stride predictor maintains to each possible path the corre-

sponding stride and last value (FRANKLIN, 1993). Path-based last value and per-path stride

predictors improve regular last value and stride prediction techniques by incorporating path


CSEPREGVNregister promotion

single instance

multiple instancesscalar replacementpath sensitive reuse

global variablepromotion

intraprocedural interprocedural

hardware-baseddynamic instruction reusevalue prediction

Fig. 24: Classification of redundancy elimination techniques.

information (NAKRA; GUPTA; SOFFA, 1999). Dynamic dataflow-inherited speculative context

(DDISC) predictor uses a context obtained from the predicted values of previous instructions

in the instruction dataflow path (THOMAS et al., 2003).

The possibility of using program profiling to support value prediction is explored in (GAB-

BAY; MENDELSON, 1997). Value predictors usually incorporate a confidence estimator to rec-

ognize predictions that are highly probable to be wrong so that they are inhibited. A predictor

with a high misprediction rate can result in slowing down the processor instead of speeding

it up. Filtering techniques are proposed to select and only predict instructions of specific

types, such as load instructions, as well as instructions on the critical path (CALDER; FELLER;

EUSTACE, 1997).

Compiler field benefited greatly from the development of a systematic theory for static

analysis in the previous decades (discussed in Chapter 2). Dynamic approaches for detec-

tion and elimination of redundancies, as discussed in this chapter, have helped to eliminate

redundancies that are only seen at the run-time. In figure 24 is a summary of how these ap-

proaches are correlated accordingly to their scope. Instruction reuse and value prediction are

in the intersection of single and multiple instances. In spite of extensive benefits achievable

with these techniques, there is no systematic approach that correlates dynamic redundancies

with static instructions, providing a practical upper bound for redundancy elimination and the

actual sources of inefficiency. Such knowledge is valuable in performing architecture-specific


program optimization and the audition and improvement of optimizations applied by the com-

piler. Next chapter discusses the reasons why these approaches do not provide information

on the remaining optimization potential unexploited and present a new approach to evaluate

dynamic redundancy and value locality.

65

4 Run-time Optimization EffectivenessAnalysis

4.1 Motivation

Compile-time analysis used to identify and remove redundant computation does not pro-

vide much information on the remaining potential unexploited. The ultimate optimization ef-

fectiveness analysis would indicate how far the deployed optimization is from the ideal case,

which in turn cannot be obtained analytically (RAMALINGAM, 1994). Dynamic limit study

has provided valuable information on unexploited potential for redundancy removal of mem-

ory operations. Conversely, hardware-support for dynamic redundancy provides mechanisms

to avoid redundant instructions, but have their accuracy strongly dependent on the ability to

populate reuse or prediction tables and also do not provide any estimation of unexploited

potential.

This section presents a novel approach for the identification and measurement of equiva-

lent instruction at the run-time. The goals are two-fold. The study is used as an approximation

for the upper bound of exploitable redundancy elimination and a practical measure of value

locality for the whole application execution. A redundant instruction at the run-time is defined

as a memory operation accessing a memory address holding the same value as in a previous

instance of the same memory operation, or as an instruction performing an arithmetic opera-

tion with same operands as in a previous execution of the same instruction. As exact addresses

are often not available at the compile-time, redundant dynamic instances of instructions may

not be detected by static methods. The method is related to the dynamic redundancies stud-

ies based on tracing or observing program’s execution such the load-reuse analysis (BODÍK;

4.1 Motivation 66

GUPTA; SOFFA, 1999) and general instruction reuse detection methods (SODANI; SOHI, 1997;

SODANI; SOHI, 1998a; YANG; GUPTA, 2000; SURENDRA; BANERJEE; NANDY, 2006).

The method proposed in this work differs from previous techniques in many aspects. Dif-

ferently from most of them, it allows the discovering of redundancies due to the interaction

between arithmetic and memory values across the multiple procedures in the program, not

separating the task of arithmetic and memory redundancy detection. Load-reuse analysis pro-

vides an evaluation only for dynamic memory operations. Moreover, the known methods

either use dynamic profiling to estimate dynamic redundancy or trace redundancies in a rep-

resentative segment. Dynamic instruction reuse and value prediction are focused on specific

forms of dynamic redundancies and is prone to detect redundancies related to equivalence of

instructions coincidentally handling the same contents.

To the author’s knowledge, there is no whole application method that provides a way to

detect memory and arithmetic dynamic instructions that use the same operands as in a pre-

vious execution, a similarity that becomes visible only at run-time, due to addresses match.

Such approach would ultimately yield a practical and more precise approximation to the up-

per bound of exploitable redundancies, and also a reference point to optimization effectiveness

evaluation, based on the remaining unexploited redundancy potential. The information pro-

duced by this approach would also provide a better picture of hotspots to be targeted and

valuable on extracting the most of available reuse buffers and value prediction tables. Benefits

are achieved in terms of a lowering of the total number of executed instructions, implying in

performance gains and potential power consumption decrease.

The approach consists in a method based on the observation of the stream of memory ref-

erences and executed instructions. The objective is to find an approximation to all instruction

redundancies visible at run-time, using as input instructions being executed. In order to detect

redundancies, a new approach to local value numbering (LVN) algorithm, which discovers

redundancies and folds constants, was developed. Inspired by dynamic redundancy detec-

tion methods, the algorithm is redesigned to be applied over a stream of instruction execution

and, for this reason, it is inherently interprocedural, so that the method reduces the difficul-

4.2 Dynamic Value Numbering Algorithm (DVN) 67

ties encountered in value numbering of extended basic blocks (ALPERN; WEGMAN; ZADECK,

1988a) to the analysis of one large basic block. The method handles each memory operation

as a variable assignment, and each instruction that performs an arithmetic operation, as an

usual operation in the local value numbering. It relies on the mapping of memory addresses

held in registers into identification numbers (value numbers), assigned through value number-

ing. These numbers are used in finding redundant computation, and their dynamics prevent

one from taking coincidentally handling the same content operations as redundant, since they

would not hold the same value number. The goal is to produce a redundancy report that

identifies each instruction redundantly executed in the specific execution path.

Inefficiencies found through this evaluation do not necessarily mean that the compiler

could generate better code. For example, consider a loop-invariant code in which the calcu-

lated expression is a single instruction. A compiler might (justifiably) recalculate the expres-

sion to avoid keeping a register occupied through the loop. The method allows the effective-

ness evaluation of optimizations through which a program was compiled based on an upper

bound limit. Comparing programs generated with different and potentially interfering opti-

mizations makes it possible to measure the overall effectiveness of removal methods in terms

of total instruction redundancy exhibited at run-time. Correlating redundancy detection and

instruction’s PC, an evaluation of value locality for the whole execution trace is also achieved.

It is worthy noticing that the proposed method relies on the availability of an instrumentation

framework to intercept and evaluate instructions and memory references being executed.

The following sections propose a method based on a dynamic extension of the value num-

bering algorithm that uses run-time information to identify dynamic instances of redundant

memory and arithmetic operations.

4.2 Dynamic Value Numbering Algorithm (DVN)

The goal of a classical value numbering algorithm is to recognize redundancy in a code

among expressions that are lexically different, but which are certain to compute the same


value. Traditionally, it is achieved by assigning symbolic names (value numbers) to expres-

sions. If the value numbers of the operands of two expressions, and the operators applied by

the expressions are identical, then the expressions receive the same value number and are cer-

tain to get same results. A dynamic version of the value numbering algorithm would be able

to detect instances of redundant instructions by analyzing a stream of executed instructions,

which might be obtained when running an executable, generated by a compiler for a given

code, architecture and input. The goal of such dynamic extension is to detect instruction reuse

based on the information of executed instructions, involved registers and memory references.

Since the equivalences are based on the source of the values, not the literal values themselves,

this technique does not report spurious redundancies when two instructions manipulate the

same bit-pattern. Such approach extends the local value numbering, as it makes possible an

interprocedural analysis, for the execution trace is seen as a large basic block. The method is

comprehensive in the sense that it treats scalar, array and pointer-based loads uniformly. The

approach offers a method for determining the run-time optimization benefit.

In a dynamic value numbering algorithm, variable assignment is captured through ref-

erences to memory. For each first occurrence of a memory access instruction, a new value

number is assigned to the accessed address, as does the classical algorithm for variables in a

source code. It is possible to associate the operands of an arithmetic instruction with the value

number list and check whether the computation was previously performed, what is done by

tracking the value number that is associated to the content of each memory address. As is

for the classical algorithm, the dynamic one tests, for each arithmetic instruction, whether the

computation of the value numbers involved was previously performed. This method yields the

capture of redundancies in arithmetic operations and provides an upper bound for the dynamic

load and store reuse.

The proposed approach for a dynamic value numbering algorithm has as input a stream of

executed instructions and, as output, a redundancy report with redundant instructions identi-

fied and statistics of redundancy occurrence. The algorithm relies on three key data structures.

The first is an array relating a register number to this specific register’s associated value num-


register-value number

r0 r1 r9 r30 r(n-1)

#1 #2NULL NULL NULL

... ... ...

Fig. 25: Registers and value numbers mapping.

ber, as illustrated in Figure 25, where n is the number of registers available in the architecture.

In addition, two tables holding the value numbers and memory accesses are used. The al-

gorithm relies also on four key procedures, one for each of these sets of instructions: load,

store, arithmetic and others. The load set includes all variants of load instructions in the ar-

chitecture’s instruction set. Store set includes instructions that store a register’s content into

memory. Arithmetic set is represented by all arithmetic instructions, divided into two subsets,

commutative and non-commutative. The remaining set, others, contains all the instructions

that are not in any of the other sets. They are classified as unsupported. The algorithm re-

quires instruction decoding and classification into the four sets, and physical address accessed

for each memory instruction. It then handles each instruction according to its set, so that

the algorithm becomes able to detect in the instructions’ stream whether a given instance had

been previously performed. This identification is made by means of the procedures shown in

algorithms 4.1, 4.2, and 4.3.

4.2.1 Redundant Memory Operation

The approach classifies redundant memory operations into the following sets:

1. Fully-redundant load: instruction which loads a value that is already in a register;

2. Redundant load: redundant load is one where the last time this load loaded this ad-

dress, it fetched the same value;

3. Redundant store: which saves a value to a memory location that already holds the

identical value.


Fully-redundant load and redundant store are instructions that can be entirely skipped

with no impact to the program behavior. Redundant load detects multiple load instructions

that refer always to the same memory location x with no intervening (killing) store to the

address. In order to do this, the algorithm determines if a store instruction on a path from the

redundant instruction program counter PCr to the previous occurrence of the instruction PCn

modifies x. There may have been intervening accesses to other addresses. In this case, the

same content will be fetched, redundancy that is not necessarily captured in value prediction

approach.

4.2.1.1 Redundant Load

The algorithm 4.1 shows the procedure to handle load instructions. It requires instruc-

tion’s details, such as the physical address that data is being loaded from and the target reg-

ister. The procedure works with the three main data structures: RegVn, the register–value

number array; VN, the table holding the value numbers; and AT, the accesses history.

The first steps consist in getting the instruction opcode OP, the physical address RA, and

the target register RT. Following, the tuple T is set with the instruction opcode OP and physical

address RA. The next step is to search for T in VN table, checking whether a value number

had already been assigned to T, what would indicate that the address itself was accessed in

a previous equivalent memory operation. If T is not in the VN table, it is tested whether a

previous memory operations is known for RA. In the case of RA was the target of a store

instruction, the operation refers to a content that was spilled to memory is now being reloaded.

The value number i from the previous access is obtained and used to update RegVn, and the

access details are added to AT. The instruction is marked as a spill. Otherwise, as no value

number is known for RA, this first access of this type is used to update the data structures.

This is performed by getting first the greatest value i in VN table. Following, i is incremented

and this new value is mapped to T in the VN table. Also, the occurrence of an access to RA,

along with the access set (LOAD SET), is added to the addresses table AT. In case T is found

in VN, the previous access to RA is retrieved from AT. If the access is in the LOAD SET, then


Algorithm 4.1 Algorithm to detect redundant load operation in an execution stream.Require: instruction, physical address

1: RA← get physical address2: RT← get target register3: Set tuple T← 〈OP, RA〉4: Search for T in VN5: if T < VN then6: Get type of previous access to RA in AT7: if Previous access ∈ STORE SET then8: Get value number i of previous access9: Add 〈T, i〉 into VN

10: RegVn[RT]← i11: Add 〈RA, RegVn[RT], LOAD SET〉 to AT12: Mark instruction as spill13: else14: Get greatest value number i in VN15: Add 〈T, i+1〉 into VN16: RegVn[RT]← i+117: Add 〈RA, RegVn[RT], LOAD SET〉 into AT18: end if19: else20: Get type of previous access to RA in AT21: if Previous access ∈ LOAD SET then22: Get value number i of previous access23: if RegVn[RT] = i then24: Mark instruction as fully-redundant25: else26: RegVn[RT]← i27: Mark instruction as redundant load28: end if29: else30: Add 〈RA, RegVn[RT], LOAD SET〉 to AT31: RegVn[RT]← i32: Mark instruction as spill33: end if34: end if


the memory content has not been changed since the previous operation. The value number

i associated to the previous access is compared to the value number in RegVn[RT]. If they

are the same, the content being loaded is already in the target register and the instruction is

marked as fully redundant. Otherwise, the value number i is used to update RegVn[RT] and

the instruction is marked as a redundant load. In case the previous access is not in the LOAD

SET, the operation refers to the load of a value previously spilled. The previous access had

changed the memory content and, therefore, the related value number. If this is the case, this

value number i previously assigned is used to update RegVn, the access details are added to

AT and the instruction marked as a spill.

4.2.1.2 Redundant Store

Algorithm 4.2 Algorithm to detect redundant store operation in an execution stream.Require: instruction, physical memory reference

1: WA← get physical address2: RS← get source register3: Set tuple T← 〈OP, WA〉4: if RegVn[RS] , null then5: Get value number i and instruction type TYPE6: of previous access to WA in AT7: if TYPE ∈ STORE SET then8: if i , RegVn[RS] then9: Add 〈T, i〉 into VN

10: Add 〈WA, RegVn[RS], STORE SET〉 into AT11: else12: Mark instruction as redundant13: end if14: else15: Add 〈T, RegVn[RS]〉 into VN16: Add 〈WA, RegVn[RS], STORE SET〉 to AT17: end if18: else19: Get greatest value number i in VN20: Add 〈T, i+1〉 into VN21: RegVn[RS]← i+122: Add 〈WA, RegVn[RS], STORE SET〉 to AT23: end if

Algorithm 4.2 details the handling of store instructions. It is similar to the load case. The

first steps are the obtainment of the instruction opcode OP, the physical address WA and the


source register RS. The tuple T is set with the instruction opcode OP and physical address WA.

Following, it is tested whether RegVn[RS] is mapped to a value number related to the result

of a previous instruction. If this is the case, the value number i and the instruction type TYPE

of the previous equivalent access to WA are then obtained from AT table. If the previous access

to WA is in STORE SET, the instruction is potentially an equivalent occurrence of a previously

executed instruction. In order to check this, it is tested if i matches the value number in

RegVN[RS]. Being this not the case, the pair T and i is added into VN, and the WA access

details, added into AT. Otherwise, if the value number i matches the one in RegVn[RS], then

the operation had already been performed, and the content in WA has not been changed. In

the case that the instruction is not in STORE SET, the operation consists in writing to memory

the result that is associated to the value number in RegVn[RS]. The pair T and RegVN[RS] is

added into VN, and the access details, to AT table. At last, there is the case when the mapping

to a value number in RegVn[RS] exists. This indicates that a new content is being stored and

a new value number should be associated. This is performed by getting first the greatest value

number i in VN. Then, i is incremented, paired with T, and added to VN table. RegVn[RS] is

mapped to i+1. The access details are added to AT table.

4.2.2 Redundant Common-subexpression

Algorithm 4.3 depicts the procedure of handling instructions of the arithmetic set. It has

the same inputs and outputs as the previous procedures. It is worthy noticing that the arith-

metic operations addressed have two mathematical operands, and that one of these operands

might be encoded in the opcode itself. In this case, just one register operand is involved.

The algorithm’s first step is to get the operand registers. Value i is set with the proper RegVn

element’s value, what also happens with j, if necessary.

The tuple T is then built, which will serve the purpose of identifying redundancies: if

there is only one register operand, it is arranged with the instruction opcode OP, the involved

register operand’s value number, and the not register–based operand, which is a constant. This

last should be replaced by the other register operand’s value number, if both the register and


Algorithm 4.3 Algorithm to detect redundant scalar operation in an execution stream.Require: instruction

1: RT← get target register2: RA← get operand A register number3: RB← get operand B register number4: i← RegVn[RA]5: j← RegVn[RB]6: if i , null and j , null then7: Get instruction opcode OP8: if OP ∈ commutative set then9: Set tuple T← 〈OP, i, j〉

10: else11: if i ≤ j then12: Set tuple T← 〈OP, j, i〉13: else14: Set tuple T← 〈OP, i, j〉15: end if16: end if17: Search T in VN18: if T < VN then19: Get greatest value k in VN20: Add 〈T, k+1〉 in VN21: else22: Get value m associated to T in VN23: RegVn[RT]← m24: Mark instruction as redundant25: end if26: else27: RegVn[RT]← null28: end if


its associated value exist.

Following, it is tested if a value number is known for all the register operands. If this is

the case, the next step is testing whether instruction’s opcode OP belongs to the commutative

set. If so, the tuple T is rearranged with the greatest value to the right. Otherwise, the tuple T is

kept with the original operands’ order. This ensures that redundant commutative instructions

are detected as such even if the operand order does not match. Following, an occurrence of

the tuple T is searched in VN. If it is not found, then the operation had not been previously

performed. In this case, the greatest value number k in VN is obtained and incremented.

The pair of T and the new value number is then added to VN. However, an occurrence of T

in VN would indicate that the same operation had been previously performed with the same

operands, and that the result is known. In this case, the value number m associated to the

occurrence of T in VN is obtained. RegVn[RT] is mapped with the value number m and the

instruction is marked as redundant. At last, there is the case when a value number associated

with at least one operand is not known. In this case, it is not possible to detect whether the

operation is associated to a previous occurrence, and the RegVn[RT] is updated to reflect this.

4.2.3 Unsupported Instructions

The proposed algorithm is intended to operate with complete and whole application’s

instructions’ trace. In this way, there are instructions to be processed that are not in any of the

sets defined, namely load, store and arithmetic sets. Those are instructions such as branches,

compares, data cache management, and others. The interaction between such instructions and

the value number table depends on the specific instruction set used. An implementation of

the algorithm has to be realized in a way to cover all the instructions that potentially interfere

in the mapping of registers and value numbers. Instructions that create equivalent mappings

should properly update the RegVn array. Specialized and architecture dependent instructions

that perform intrinsic arithmetic operations need to be mapped according to one of the sets.

Instructions for which such mapping is not possible should clear the RegVn in a pessimistic

approach for redundancy detection.


(lwz, 0x*28C)

(add,1,2)

(subf,4,5)

(add,3,6)

(add,7,7)

12

4566

677788

33

0x*28C0x*288

0x*27C0x*2780x*280

0x*27C0x*2780x*270

12

5636778

34

loadloadstoreload

0x*284

loadstore

0x*274 store0x*28C store

storestorestore

addresses table

value numbers table

stream of instructions

...

...

0x10003E0 lwz r9,0xAC(r31)0x10003E4 lwz r0,0xA8(r31)0x10003E8 add r0,r9,r00x10003EC stw r0,0xA4(r31)0x10003F0 lwz r9,0x9C(r31)0x10003F4 lwz r0,0x98(r31)0x10003F8 subf r0,r0,r90x10003FC stw r0,160(r31) 0x1000400 lwz r9,0xA8(r31)0x1000404 lwz r0,0xAC(r31)0x1000408 add r0,r9,r00x100040C stw r0,0x94(r31)0x1000410 lwz r9,0x9C(r31)0x1000414 lwz r0,0x98(r31)0x1000418 subf r0,r0,r90x100041C stw r0,0xAC(r31)0x1000420 lwz r9,0xA4(r31)0x1000424 lwz r0,160(r31) 0x1000428 add r0,r9,r00x100042C stw r0,0x9C(r31)0x1000430 lwz r9,0x94(r31)0x1000434 lwz r0,0xAC(r31)0x1000438 add r0,r9,r00x100043C stw r0,0x98(r31)0x1000440 lwz r9,0x9C(r31)0x1000444 lwz r0,0x98(r31)0x1000448 add r0,r9,r00x100044C stw r0,0x90(r31)

red

un

dan

tre

du

nd

ant

red

un

dan

t

(lwz, 0x*288)

(stw, 0x*284)(lwz, 0x*27C)(lwz, 0x*278)

(stw, 0x*280)(stw, 0x*274)(stw, 0x*28C)

3

(stw, 0x*27C)(stw, 0x*278)

(stw, 0x*270)

...

...

Mark as redundant

Dynamic Value Numbering (DVN)

Fig. 26: Value numbers and accesses tables.

Figures 26 illustrates the resulting VN and AT tables when the method is applied to the exe-

cution stream in its left, which clearly contains redundant instruction executions. The process

of building the VN table can be seen as the generation of a Directed Acyclic Graph (DAG),

where nodes represent instructions and have annotated, for each instance, the associated value

number (on top of the node), the memory address that holds the instruction’s resulting content

(arrow), and the register associated to the value number (arrow), if this is known. Figure 27(c)

shows the resulting DAG after applying the method to the instruction stream in Figure 26.


add

r9

0x*28C

1 2

3

0x*288

r0

0x*284

(a) DAG: First 3 instructions

subf

r9

add

1 2 4

6

5

3

0x*288

0x*274

0x*284

0x*27C

0x*278

0x*280

0x*28C

0x*28C

r0

(b) DAG: First 12 instructions

add

add

7

8

subfadd

1 2 4

6

5

3

0x*288

0x*274

0x*284

0x*280

0x*28C

0x*27C

0x*278

0x*270

r0

r9

(c) DAG: whole execution trace

Fig. 27: Steps building a DAG with dynamic value numbering algorithm.

4.3 Value-based Hotspot and h-index 78

4.3 Value-based Hotspot and h-index

The output of the dynamic value numbering algorithm consists in the identification of re-

dundant computations. Binding the occurrence of redundancy with specific instances through

the instruction’s program counter, the algorithm identify redundancy hotspots. The ratio be-

tween the total amount of redundancy detected per program counter (PC) and the number of

unique value numbers found per PC provide a hotness index (h-index) for each redundant

instruction. An instruction with a high number of redundancy and a small number of unique

value numbers (different possible output for the instruction) is said to be hot. The identifi-

cation of redundant instruction in the whole program execution measures value locality and

indicates hotspots for optimization. The h-index provides a measure of profitability of each

occurrence.

Algorithm 4.4 shows how hotspot identification is handled in the dynamic value number-

ing. It has as input an instruction marked as redundant and requires the instruction’s program

counter PC and the value number VN assigned. The algorithm uses two tables, one for redun-

dant program counters and one for pairs of program counters and value number. Figure 28

illustrates the algorithm’s operation over an arbitrary execution stream.

4.4 Dynamic Value Numbering and Unnecessary Spill

The dynamic value numbering algorithm is able to detect code spill. Such behavior is

known for being unavoidable because of the limited number of registers available. An ad-

ditional step can be added to the methodology in order to verify unnecessary spills. An un-

necessary spill occurs when an unused value is available between the spilling and reload. The

approach is an adaptation of the method proposed in (LARUS; CHANDRA, 1993) to operate with

the dynamic value numbering to determine, from the spills identified, the amount that could

be avoided. The algorithm operates with the three sets of instructions defined in the Dynamic

Value Numbering algorithm. Algorithm 4.5 is used to detect unnecessary spills. The field

Access records when a register was last read or modified and the field Spill records when

4.4 Dynamic Value Numbering and Unnecessary Spill 79

Algorithm 4.4 Algorithm to identify hotspot and generate the h-index.Require: instruction, opcode (OP), assigned value number (VN)

1: RP← ∅2: RPV← ∅3: Get redundant instruction opcode PC4: for all Instruction marked as redundant do5: if PC ∈ RP then6: Get counter i of PC7: i← i + 18: Update counter i for PC in RP9: Set pair P← 〈PC, VN〉

10: if P ∈ RPV then11: Get counter j of P12: j← j + 113: Update counter j for P14: else15: Add P in RPV16: end if17: else18: Add PC in RP19: Set P← 〈PC, VN〉20: Add P in RPV21: end if22: end for

it was spilled to the stack frame. When a spilled value is reloaded into register s, any register

r that has access time before spill time is unused since its spill and might have held the value

instead of forcing a spill. Register r may be live, in which case the allocator’s choice of reg-

ister s is defensible. However, if the value in r is subsequently redefined, before being used

(i.e. dead), the allocator spilled the wrong variable. The method’s aim is to provide support to

dead store elimination (DSE). A representation of the algorithm’s operation is shown in figure

29.


0x10003E8 1

Redundant PC (RD)

stream of instructions

...

...

0x10003E0 lwz r9,0xAC(r31)0x10003E4 lwz r0,0xA8(r31)0x10003E8 add r0,r9,r00x10003EC stw r0,0xA4(r31)0x10003F0 lwz r9,0x9C(r31)0x10003F4 lwz r0,0x98(r31)0x10003F8 subf r0,r0,r9...0x10003E0 lwz r9,0xAC(r31)0x10003E4 lwz r0,0xA8(r31)0x10003E8 add r0,r9,r00x10003EC stw r0,0xA4(r31)0x10003F0 lwz r9,0x9C(r31)0x10003F4 lwz r0,0x98(r31)0x10003F8 subf r0,r0,r9... 0x10003E0 lwz r9,0xAC(r31)0x10003E4 lwz r0,0xA8(r31)0x10003E8 add r0,r9,r00x10003EC stw r0,0xA4(r31)0x10003F0 lwz r9,0x9C(r31)0x10003F4 lwz r0,0x98(r31)0x10003F8 subf r0,r0,r9

Hotspot Evaluationusing DVN

redundantVN=3 (0x10003E8, 3) 1

redundantVN=93

...

...Redundant PC-Value (RDV)

0x10003E8 1

Redundant PC (RP)

(0x10003E8,3) 1

...

...Redundant PC-Value (RPV)

0x10003E8 1

Redundant PC (RD)

Hotspot Evaluationusing DVN)

...0x10003E8 56

Redundant PC (RP)...

(0x10003E8,3) 5

...Redundant PC-Value (RPV)...

(0x10003E8,34) 11(0x10003E8,57) 8(0x10003E8,88) 14(0x10003E8,93) 18...

Fig. 28: Operation of the algorithm to identify hotspot.


Algorithm 4.5 Algorithm to detect unnecessary spill.Require: instruction type TP, instruction counter PC

1: if TP ∈ LOAD SET then2: Get target register RT3: Access[RT] = instruction counter4: if TP is marked as spill then5: for all Register i do6: if Access[i] < Spill[i] then7: SpillCandidate[i] ← SpillCandidate[i]

⋃RT

8: end if9: end for

10: end if11: if SpillCandidate[RT] , ∅ then12: Mark instruction as unnecessary spill13: end if14: SpillCandidate[RT]← ∅15: end if16: if TP ∈ STORE SET then17: Get target register RT18: Spilll[RT] = instruction counter19: end if20: if TP ∈ ARITHMETIC SET then21: Get all registers REG involved22: for all REG do23: Access[REG] = instruction counter24: end for25: if SpillCandidate[RT] , ∅ then26: Mark instruction as unnecessary spill27: end if28: SpillCandidate[RT]← ∅29: end if

The identification of unnecessary spill as proposed in algorithm 4.5 is once again related

to an upper bound limit. The algorithm detects the situation when a register r is spilled at

PC1 and later in the execution, at PC2 another register is defined and not used between PC1

and PC2, as an unnecessary spill. This detection is truly valid if r is actually dead along all

paths after PC1. Another problem is that many registers can be dead during the spill and the

identification is useful only to indicate a reference number of potential opportunities missed

by the compiler.

4.5 Discussion 82

r0 r1 r2 r9 r(n)

spill

reload

r11

candidate candidate

... ... ...

unnecessaryspill

access modification store (spill)

Fig. 29: Representation of the algorithm to detect unnecessary spill.

4.5 Discussion

The Dynamic Value Numbering (DVN) methodology as designed yields the detection of

instruction reuse and value locality. An effectiveness analysis based on DVN returns a report

with redundancy-weighted program counters, which estimates the upper bound for optimiza-

tion and indicates the most profitable missed opportunities. The report can be useful in many

aspects as following discussed.

In the compiler side, the detection of reuse is profitable even when register promotion is

prevented (due to aliasing or lack of registers). When promotion is unsafe due to interfering

stores, the redundant load can be replaced with a data-speculative load, which works as a reg-

ister reference when the kill did not occur, but as a load when it did. When registers are not

available, instruction reuse information can be exploited using software cache control (BODÍK;

GUPTA; SOFFA, 1999). In addition, by directing which loaded values remain in the cache and

which bypass it, the compiler can improve the suboptimal hardware cache replacement strat-

4.5 Discussion 83

egy (BODÍK; GUPTA; SOFFA, 1999). Profile-guided optimization can be also benefited since the

methodology indicates hotspots through binding program counter to the occurrence of redun-

dancy. Although these sequences are suboptimal for only one path, the use of representative

input sets ensures that an expected general behavior is evaluated. The information can be

used to audit and potentially tune the compiler operation in suboptimal situation or used in a

hardware-based mechanism. In hardware-support for redundancy removal, DVN can be used

as a mechanism to populate a reuse buffer. Another approach is related to value prediction.

Value locality is measured through the hotspot identification and the h-index can be exploited

as a measure of predictability when speculating on instruction’s expected value.

These ideas are explored in more details in the next chapter through the implementation

of the Dynamic Value Numbering. The implementation of the algorithms is integrated with

a full-system simulator and used as a framework to evaluate optimization effectiveness and

hotspot identification in whole application execution.

84

5 Experimental Evaluation andOptimization Framework

An experimental framework for optimization effectiveness evaluation is related to the

ability to instrument and monitor program’s execution. In order to apply the proposed method

for effectiveness evaluation, two requirements have to be fulfilled. First, it is required an

instrumentation framework to observe/collect whole program instruction execution stream

for a given target architecture. For the method to be validated, the intended experimental

framework should support the actual avoidance of detected redundant instruction, retrieving

previous known output. In other words, the experimental framework would allow redundant

instructions to be skipped and registers’ content to be changed dynamically at run-time. The

execution trace format should contain, for each memory access, the respective physical ad-

dress. Second, an implementation for the redundancy identification via dynamic value num-

bering, one able to process the stream, extract required information, decode and classify each

instruction has to be realized. The evaluation of representative architecture, compiler and ap-

plication imposes multiple challenges. A reasonable overhead imposed by the framework in

collecting traces determines the method’s feasibility. For whole application execution, the dy-

namic value numbering algorithm implementation has to be able to deal with a large amount

of instructions in an efficient way in order to achieve feasible analysis time.

An experimental framework can be obtained in two ways. The first approach consists in

dynamic binary instrumentation, specially used for program tracing. An example is Intel’s

Pin, a binary instrumentation framework for IA-32 and x86_64 instruction-set architectures

(LEE; BENAISSA; RODRIGUEZ, 2008). The framework allows a program analysis on user space

4 Experimental Evaluation and Optimization Framework 85

for applications in Linux and Windows. Binary instrumentation is performed at run-time

on compiled binary files, requiring no recompilation of the source code. The API provided

supports the manipulation of register content that can be passed to injected code as parameters.

Registers can also be overwritten. Other examples (arguably more limited) are GDB (GNU

Debugger)1, Valgrind 2, SystemTap 3, and Frysk 4.

Another approach is obtained through system simulators, or more specifically full-system

simulators. These simulators are software system able to simulate computer systems in a level

of details that the entire software stack can be run in the system without any change. In or-

der to perform such task, full-system simulators include typically peripheral devices, multiple

level of memory, interconnections, network connections and the processor core. This way

full-system simulators allow full operating systems to be run on top of the simulated hard-

ware, which does not need to be in any aspect related to the host architecture. Traditionally,

architecture simulators can be divided in two classes: cycle-accurate (e.g, VHDWVerilog gate

level simulators) and functional simulators (e.g., SimOS, Simcs [(HERROD, 1998; MAGNUS-

SON et al., 2002)], SimpleScalar (AUSTIN; LARSON; ERNST, 2002)). Cycle-accurate simulators

have been helping the construction of detailed, highly accurate models of system components

(BACHEGA et al., 2004). SimpleScalar simulator, one of the most used simulators in computer

architecture research, can emulate the Alpha, PISA, ARM, and x86 instruction sets (AUSTIN;

LARSON; ERNST, 2002). Simics simulates processors at the instruction-set level, including the

full supervisor state. Currently, Simics supports models for UltraSparc, Alpha, x86, x86-64

(Hammer), PowerPC, IPF (Itanium), MIPS, and ARM. Simics is pure software, and current

ports include Linux (x86, PowerPC, and Alpha), Solaris/UltraSparc, Tru64/Alpha. (MAGNUS-

SON et al., 2002)1http://www.gnu.org/software/gdb/2http://valgrind.org/3http://sourceware.org/systemtap/4http://sourceware.org/frysk/

5.1 IBM Full-system simulator (Mambo) 86

5.1 IBM Full-system simulator (Mambo)

The IBM full-system simulator, also known by the codename Mambo, is used in this

work as platform and framework for run-time evaluation based on execution stream. It is a

complete instruction-level system simulator that implements multiple system architectures,

including embedded and highly parallelized systems, and has been successfully used in hard-

ware validation and software development (CEZE et al., 2003). Mambo is an IBM proprietary

full-system simulation toolset for the PowerPC architecture and is run on a variety of plat-

forms, in special on both PowerPC and x86_64 processor on top of Linux or AIX. Mambo

supports the 64-bit PowerPC processor and the 32-bit 405GP processor. The system is de-

signed for multiple configuration options. Various PowerPC extensions and attributes such as

vector multimedia extensions (VMX) support, hypervisor, cache geometries, segment looka-

side buffers (SLBs) and translation lookaside buffers (TLBs) can be configured. Models of

future PowerPC architectures can be created by selecting from the various configuration op-

tions.

5.1.1 Analysis Facilities

Mambo includes integrated development environment with debugging tools, support for

stand-alone program execution and operating system boot, data collection and analysis frame-

works, performance visualization, and tracing and logging capabilities. All machine settings,

debug preferences are configured dynamically through Tcl/Tk script language. Many Mambo

models have been instrumented to analyze and count performance events that characterize

applications behavior, such as instruction mix, pipeline stall conditions, branch and cache

functions, memory and device transactions, and bus traffic. System metrics are exposed by

simulating the hardware performance monitor with processor, bridge and device models writ-

ten accordingly. Application binaries that interact with hardware performance counters and

control registers can be executed without modification in simulation and on a corresponding

real system (PETERSON et al., 2006).

5.1 IBM Full-system simulator (Mambo) 87

! " #"$%&"'()*()+,-(./"011")-23,4")545)657/

!"#$%&'%(&)#*+#,-&#./0#12334567,&8#5(8239,*'835"$%&"9:11;<=4,5>"<->:1+,()"?<=4,5><->@"*)(6-754"A+4,"B:,"+CC:)+,5"4->:1+,-(."(A"C(>*15,5"4=4,5>4"D-,3"*)(C544()4

,3+,"+)5"B+457"(.",35"$%&"E(D5)E'"0)C3-,5C,:)5/"<=4,5><->F"()-2-.+11="G.(D."B=",35"-.,5).+1"$%&".+>5"H&+>B(FI

C(.,+-.4" .:>5)(:4" A5+,:)54" ,(" 4:**()," 4(A,D+)5" 754-2." +.7" +.+1=4-4F" -.C1:7-.2" *+)+>5,5)-J57" >(7514" A()

+)C3-,5C,:)+1" A5+,:)54" +.7"7=.+>-C+11=" 4515C,+B15">(751" A-751-,=" ,3+," 5.+B15":45)4" ,(" ,)+75" 4->:1+,-(."+CC:)+C=" A()

5K5C:,-(."4*557/"<=4,5><->"3+4"B55.":457"5K,5.4-651="-.4-75"$%&",("4:**(),",35"75651(*>5.,"(A"(*5)+,-.2"4=4,5>4F

C(>*-15)4F"+.7"G5="4(A,D+)5"C(>*(.5.,4"A()".5D"4=4,5>4"D511"-."+76+.C5"(A"3+)7D+)5"+6+-1+B-1-,=/

835" $%&" 9:11;<=4,5>" <->:1+,()" -4" +" 4(*3-4,-C+,57" 4->:1+,-(." 5.6-)(.>5.," ,3+," 4:**(),4" B(,3" A:.C,-(.+1" +.7

*5)A()>+.C5"4->:1+,-(."(A">:1,-*15"E(D5)E'"C()54"+.7"+"6+)-5,="(A"4=4,5>"C(>*(.5.,4/"835"4->:1+,()L4"*)(C544()

4:**()," -.C1:754" -.,5)):*,4F" 75B:22-.2" C(.,)(14F" C+C354F" B:4454F" +.7" +" 1+)25".:>B5)" (A" +)C3-,5C,:)+1" A5+,:)54/" $.

+77-,-(.F" 835" 4->:1+,()">(7514">5>()=;>+**57" $MN"756-C54F" C(.4(154F" 7-4G4F" +.7".5,D()G4" ,3+," +11(D"(*5)+,-.2

4=4,5>4",("):."-."4->:1+,-(.F"+.7"B((,"+.7"):."*)(2)+>4/

0"C()5"+76+.,+25"(A" ,35"9:11;<=4,5>"<->:1+,()" -4" ,35" 15651"(A" A15K-B-1-,=" ,3+," -4"+AA()757"B=" -,4">(7:1+)F"C(.A-2:)+B15

+)C3-,5C,:)5/"O+C3"4->:1+,-(."-4"+."-.4,+.C5"(A",35">-C)(*)(C544()"+)C3-,5C,:)5"-,"-4">(751-.2/"9-2:)5 P;P")5*)545.,4"+

3-23;15651"6-5D"(A",35"*)(C544();,(;4->:1+,-(.")51+,-(.43-*Q

9-2:)5"P;P/"$%&"9:11;<=4,5>"<->:1+,()"0)C3-,5C,:)5

04"-11:4,)+,57"-.",35"7-+2)+>F"+"4=4,5>">+="-.C1:75"+.=".:>B5)"(A"4:**(),57"*)(C544()"C(>*(.5.,",=*54F"5+C3"(A

D3-C3"-4"-.4,+.,-+,57"-."+"4->:1+,()"6-+"C:4,(>"8C1"C(.A-2:)+,-(."4C)-*,4/"$.",3-4">+..5)F"+"4->:1+,-(."C+.")5+1-4,-C+11=

)5*)545.,"+."5.,-)5" 4=4,5>"(A"5R:-*>5.,F"756-C54F"()" 4:B4=4,5>4F"+.7" -4"+B15" ,(" 4->:1+,5".(,"(.1=" ,35" -.4,):C,-(.4

5K5C:,57"B=",35"*)(C544()"C()5F"B:,"+14("-,4"-.,5)+C,-(.4"D-,3"-,4"4:))(:.7-.2"4=4,5>"C(>*(.5.,4/"

835"$%&"9:11;<=4,5>"<->:1+,()"*)(6-754"7-AA5)5.,"4->:1+,-(.">(754F")+.2-.2"A)(>"A:.C,-(.+1"4->:1+,-(."(A"*)(C544()

-.4,):C,-(.4" ,(" *5)A()>+.C5" 4->:1+,-(."(A" +." 5.,-)5" 4=4,5>/" $.">(4," C+454F" ,35" 4->:1+,-(.">(75" C+."B5" C3+.257

7=.+>-C+11="+,"+.="*(-.,"-.",35"4->:1+,-(./"S(D565)F"C5),+-."HD+)>;:*I"5AA5C,4">+="+AA5C,",35")54:1,4"(A"*5)A()>+.C5

4->:1+,-(."A()"4(>5"*(),-(."(A",35"4->:1+,-(."A(11(D-.2"+"C3+.25",("C=C15">(75/"

<->*15" ?A:.C,-(.+1;(.1=@">(75">(7514" ,35"5AA5C,4"(A" -.4,):C,-(.4F"D-,3(:,"+,,5>*,-.2" ,("+CC:)+,51=">(751" ,35

,->5" )5R:-)57" ,("5K5C:,5" ,35" -.4,):C,-(.4/" $." 4->*15">(75F"+" A-K57" 1+,5.C=" -4"+44-2.57" ,("5+C3" -.4,):C,-(.T" ,35

1+,5.C=" C+." B5" +)B-,)+)-1=" +1,5)57" B=" ,35" :45)/" <-.C5" 1+,5.C=" -4" A-K57F" -," 7(54" .(," +CC(:.," A()" *)(C544()

->*15>5.,+,-(." +.7" )54(:)C5" C(.A1-C," 5AA5C,4" ,3+," C+:45" -.4,):C,-(." 1+,5.C-54" ,(" 6+)=/" 9:.C,-(.+1;(.1=" >(75

+44:>54" ,3+," >5>()=" +CC54454" +)5" 4=.C3)(.(:4" +.7" -.4,+.,+.5(:4/" 83-4" >(75" -4" :45A:1" A()" 4(A,D+)5

75651(*>5.,"+.7"75B:22-.2F"D35."+"*)5C-45">5+4:)5"(A"5K5C:,-(.",->5"-4".(,")5R:-)57/

<=4,5>"0)C3-,5C,:)5"?&(751@

0

0U

% %U %UU

'

!"#$%&'%"()*+,&'-$").('&./%()*"*0()*

!"1-&2"3--'")*&4%*0()*",-&2"%,5"3--'6*&4%

!"7-+$/%"'8%"9:6"(/$);'((-+$/%"<%.=>?@?#AB6CDEFFE2%5&+,'A4+)F'/,

!"7'&$'"()*+,&')-.*0()*"4-

<=4,5>"D-,3"4:**(),57"C(>*(.5.,",=*54"0F"%F"+.7"'

<->:1+,-(."'(.A-2:)+,-(."?<->:1+,()"8C1"<C)-*,4@

$%&"9:11;<=4,5>"<->:1+,()

V5*)545.,+,-(."(A",35">(751"D-,3-.",35"4->:1+,()

S(4,"N<

<(A,D+)5"<,+CG"):..-.2"(."<=4,5><->

Fig. 30: IBM Full-system simulator architecture.

!" #$%&'()*$&+$,-./0-1223+45&+6$%-7+&8-/*'96')5$4*-/'69+3*-:8*4;26+$&%- <-#=>-:6'26'5&+6$?-133-'+,8&%-'*%*'@*A?

/*'96')5$4*-2'69+3*-48*4;26+$&%-5'*-5-%+)23*-)*485$+%)-96'-46A*-+$%&'()*$&5&+6$-78+48-45$-B*-(%*A-&6-A*3+)+&-5

'*,+6$- 69- 5223+45&+6$- 46A*- 6@*'- 78+48- 2*'96')5$4*- %&5&+%&+4%- 5'*- ,5&8*'*A- BC- &8*- #=>- D(33E.C%&*)- .+)(35&6'?

/*'96')5$4*-2'69+3*-48*4;26+$&%-%+,$53-&8*-%+)(35&6'-&6-A+%235C-&8*-4(''*$&-+$%&'(4&+6$-5$A-4C43*-46($&%-96'-&8*-./0?

F8*%*-45$-B*-(%*A- +$-25+'%- &6-A*&*')+$*- &8*-$()B*'-69- +$%&'(4&+6$%-*G*4(&*A-5$A-4C43*%-46$%()*A-BC-5-%2*4+9+4

'*,+6$- 69- ./0- 2'6,'5)- 46A*?- F8+%- +$96')5&+6$- 45$- &8*$- B*- (%*A- &6- +A*$&+9C- 2*'96')5$4*- +%%(*%- 6'- H(5$&+9C- &8*

B*$*9+&%-69-5-26&*$&+53-62&+)+I5&+6$-&*48$+H(*?-D+,('* JEJ-+33(%&'5&*%-867-5$-./0-2'6,'5)-45$-B*-+$%&'()*$&*A-7+&8

2*'96')5$4*-2'69+3*-48*4;26+$&%-&6-6B&5+$-B5%+4-2*'96')5$4*-+$96')5&+6$-5B6(&-%*4&+6$%-69-&8*-5223+45&+6$-46A*K

D+,('*-JEJ?-/*'96')5$4*-/'69+3*-:8*4;26+$&-#$%&'(4&+6$-

1-2*'96')5$4*-2'69+3*-48*4;26+$&-+%-5-%2*4+53-96')-69-$6E62-+$%&'(4&+6$-&85&-&8*-#=>-D(33E.C%&*)-.+)(35&6'-+$&*'2'*&%-5%

5-'*H(*%&-&6-A+%235C-./0-2*'96')5$4*-A5&5?-F8+%-$6E62-+$%&'(4&+6$-+%-5-%+)23*-36,+453-62*'5&+6$-786%*-'*%(3&-85%-$6

+)254&-6$-2'6,'5)-*G*4(&+6$?-F8*'*-5'*-JL-A+99*'*$&-$6E62-+$%&'(4&+6$%-&85&-&8*-#=>-D(33E.C%&*)-.+)(35&6'-&'*5&%-5%

2*'96')5$4*-2'69+3*-48*4;26+$&%?-

M8*$-5-48*4;26+$&-+%-*$46($&*'*A-+$-2'6,'5)-*G*4(&+6$N-&8*-#=>-D(33E.C%&*)-.+)(35&6'-2'+$&%-5-)*%%5,*-7+&8-&8*

+A*$&+&C-69- &8*-4533+$,-./0-5$A-2'69+3+$,-A5&5- 9'6)-&85&-26+$&- +$- &8*-2'6,'5)-*G*4(&+6$?-F8*%*-)*%%5,*%-85@*-&8*

96')K

SPUn: CPm, xxxxx(yyyyy), zzzzzzz

78*'*-$-+%-&8*-./0-6$-78+48-&8*-2*'96')5$4*-2'69+3*-48*4;26+$&-85%-B**$-+%%(*AN-)-+%-&8*-48*4;26+$&-$()B*'N-GGGGG

+%-&8*-+$%&'(4&+6$-46($&*'N-CCCCC-+%-&8*-+$%&'(4&+6$-46($&-*G43(A+$,-$6E62%N-5$A-IIIIII-+%-&8*-4C43*-46($&*'?

F8*-5223+45&+6$-2'6,'5)-+$&*'954*-O1/#P-96'-&8*-2*'96')5$4*-2'69+3*-48*4;26+$&%-+%-A*9+$*A-+$-&8*-2'69+3*?8-8*5A*'-9+3*?

F8+%-9+3*-2'6@+A*%-&8*-:E35$,(5,*-2'64*A('*%N-$5)*A-2'69Q42R$SOP-78*'*-$-+%-5-$()*'+4-@53(*-'5$,+$,-9'6)-T-&6-J!N

&85&- ,*$*'5&*- &8*- %2*4+53- $6E62- +$%&'(4&+6$%?- .2*4+53- 9($4&+6$%- 5'*- 5%%+,$*A- &6- &8*- 963367+$,- 2*'96')5$4*- 2'69+3*

48*4;26+$&%K

2'69Q42TOP- U- 43*5'%- &8*- ./0- 2+2*3+$*- 2*'96')5$4*- %&5&+%&+4%- 96'- &8+%- ./0?- :5$- 53%6- B*- +$@6;*A- 7+&8- 53+5%

2'69Q43*5'OP?

2'69Q42JTOP-U-*$5B3*%-4633*4&+6$-69-./0-2+2*3+$*-2*'96')5$4*-%&5&+%&+4%- 96'- &8+%- ./0?- -:5$-53%6-B*- +$@6;*A-5%

2'69Q%&5'&OP?

2'69Q42J!OP-U-A+%5B3*%-4633*4&+6$-69-./0-2+2*3+$*-2*'96')5$4*-%&5&+%&+4%- 96'- &8+%-./0?- -:5$-53%6-B*- +$@6;*A-5%

2'69Q%&62OP?

F8*-963367+$,-%5)23*-46A*-%*,)*$&- +33(%&'5&*%- &8*-48*4;26+$&-46A*-&85&- +%-5AA*A-&6-46A*-&6-43*5'N- %&5'&N-5$A-%&62

46($&*'%K

//declaration for checkpoint routine#include “profile.h”. . .prof_cp0(); // clear dataprof_cp30();// start recording data. . .<code_for_timing_metrics>. . .prof_cp31(); // stop recording data

>5)B6-%+)(35&+6$-69-./0-5'48+&*4&('*

1223+45&+6$-.69&75'*-'($$+$,-6$->5)B6

+$%&'(4&+6$-&6->5)B6-%+)(35&6'

%+)(35&+6$-4633*4&%-5$A-6(&2(&%-A5&5

V+%235C-M+$A67

SPU0: CP0, 204(193), 2408clear performance info

SPU0: CP30, 0(0), 0start recording performance info

SPU0: CP31, 203(192), 2656stop recording performance info

!"#$%&'#(%

W6%&-X.

Fig. 31: Statistics data collection with IBM Full-system simulator.

This work is based on Mambo, which was elected as the environment to run, collect and

validate the effectiveness analysis proposed. The simulator was obtained as source code and

integrated with components implementing the dynamic effectiveness analysis proposed. The

following sections describe the architecture used, instruction set characteristics, trace infor-

mation and instrumentation approach. Figure 30 shows a general overview of the simulator

and Figure 31 an overview of debugging facilities for statistics collection.

5.2 Instruction Set Architecture (ISA) 88

5.2 Instruction Set Architecture (ISA)

The target system’s architecture is a multicore and multithreaded 64-bit that implements

the full 64-bit fixed-point Power ISA (Instruction Set Architecture), version 2.06. The core

is fully compliant with these architectural specifications. Floating-point unit is attached ac-

cording to the specific system’s requirements. The Compute Node Kernel is used as operating

system for the experiments presented in the case study of the following chapter (MOREIRA et

al., 2006).

The processor used in the framework is a version of the A2 processor core used on the

IBM PowerEN chip. The A2 processor core implements the 64-bit Power instruction set

architecture (Power ISA) and is optimized for aggregate throughput. The A2 is four-way

simultaneously multithreaded and supports two-way concurrent instruction issue: one integer,

branch, or load/store instruction and one floating-point instruction. Within a thread, dispatch,

execution, and completion occurs in order. The A2 core’s L1 data cache is a 16 Kbyte, eight-

way set associative, with 64-byte lines. The 32-byte-wide load/store interface has sufficient

bandwidth to support the quad floating-point unit. The L1 instruction cache is 16 Kbyte,

four-way set associative (HARING et al., 2012).

5.3 Dynamic Value Numbering Algorithm Implementation

The observation of execution trace approach relies on the integration of a custom com-

ponent to the simulator. This component is able to intercept each executed instruction, and

its memory references’ information, such as involved registers and their respective contents.

Figure 32 shows a scheme using the full-system simulator with the DVN methodology. The

evaluation corresponds to observe instruction fetch performed by the system simulator and

to perform subsequent effectiveness analysis. The evaluator intercepts each instruction ex-

ecution instance and applies the dynamic value numbering algorithm. The instructions are

collected in a report with redundant occurrences identified. The report can also include the

whole execution trace with redundant instructions marked. As the run ends, the statistics for

5.3 Dynamic Value Numbering Algorithm Implementation 89

compiler(GCC-PPC A2)

optimized executable

full-system simulator

Kernel

...

...

Instruction Reuse

DVN

sourcecode

redundancy reportstatistics

libraries

DEBUG: 801208: (1335997): MEM_REFS : thread 0: I/0x00000000010003A8/0x00000000010003A8 w=0x92 s=4 v=0x00000000801F0088 e=1DEBUG: 801208: (1335997): MEM_REFS : thread 0: R/0x00000003FFFFD2A8/0x00000003FFFFD2A8 w=0x82 s=4 v=0x0000000000000000 e=1DEBUG: 801208: (1335997): 0:0:0: INSTRUCTION: PC=0x00000000010003A8: [0x801F0088]: lwz r0,0x88(r31)

...

...

DEBUG: 801208: (1335997): MEM_REFS : thread 0: I/0x00000000010003A8/0x00000000010003A8 w=0x92 s=4 v=0x00000000801F0088 e=1DEBUG: 801208: (1335997): MEM_REFS : thread 0: R/0x00000003FFFFD2A8/0x00000003FFFFD2A8 w=0x82 s=4 v=0x0000000000000000 e=1DEBUG: 801208: (1335997): 0:0:0: INSTRUCTION: PC=0x00000000010003A8: [0x801F0088]: lwz r0,0x88(r31)DEBUG: 801209: (1335998): MEM_REFS : thread 0: I/0x00000000010003AC/0x00000000010003AC w=0x92 s=4 v=0x000000007C090214 e=1DEBUG: 801209: (1335998): 0:0:0: INSTRUCTION: PC=0x00000000010003AC: [0x7C090214]: add r0,r9,r0DEBUG: 801210: (1335999): MEM_REFS : thread 0: I/0x00000000010003B0/0x00000000010003B0 w=0x92 s=4 v=0x00000000901F0084 e=1DEBUG: 801210: (1335999): MEM_REFS : thread 0: W/0x00000003FFFFD2A4/0x00000003FFFFD2A4 w=0x82 s=4 v=0x0000000000000000 e=1DEBUG: 801210: (1335999): 0:0:0: INSTRUCTION: PC=0x00000000010003B0: [0x901F0084]: stw r0,0x84(r31)

trace file

Application (user)inputset

A2 Core

instruction execution

decodefetch

hash-table

... commit

Fig. 32: Redundancy evaluation scheme.

redundant occurrences are obtained.

The implementation of the effectiveness evaluator is in a modular way, making possible

the use of the approach in other environment with little adaptation. The framework starts with

the implementation of an instruction decoder for the instruction set. A decoder and classifier

was implemented for the fixed-point facility in Power ISA 2.06. The instructions in the ISA

were classified into the four sets previously described: load, store, arithmetic, and other. The

load set includes all load operations that do not perform an implicit arithmetic operation. The

same approach was taken for all store instructions. The arithmetic set was subdivided into:


0x3FFFFD2A8

st

size=4

size=8

0x3FFFFD2AC 0x3FFFFD2B0

0x3FFFFD2A8

stw

size=4

size=8

0x3FFFFD2AC 0x3FFFFD2B0

0x*2A8 1 store

addresses table

...

0x*2A8 2 store

addresses table

...

0x*2AC * store

VN typeaddr.

VN typeaddr.

... ...

... ...

Fig. 33: Approach for implementation of memory accesses with different sizes.

regular arithmetic; immediate arithmetic, i.e. instructions involving only a constant, or both

a constant and registers; and logical operations that imply an arithmetic operation. Memory

operation with different access size is covered. The smallest access size is 4 bytes. The address

table stores the access size and, for operations with access larger than 4 bytes, the subsequent

4-byte chunk has its value number pessimistically invalidated, as represented in Figure 33.

The set of the remaining instructions was subdivided into non value number destructive and

value number destructive. These sets are related to whether or not the instruction changes a

register content in a way not covered by the other defined sets. Appendix B provides detailed

information on all the instructions covered and how the sets were defined for each fixed-point

instruction in PowerISA 2.06.

The next step was the obtainment of an implementation of the Dynamic Value Number-

ing algorithm. As a specification requirement, the implementation should be able to pro-

cess hundreds of millions of instructions in a reasonable amount of time. The bottleneck


32-bit hash

hash1

hasn3

1, content12, content23, content3

value numbers hash table

hash2

...

...

...

...

...I/0x00000000010003A8w=0x92 s=4 v=0x000000007C090214

...

...INSTRUCTION: PC=0x00000000010003AC: [0x7C090214]: add r0,r9,r0

...

...

PowerISAdecoder

(59,266,0,9)

add r0,r9,r0(opcode,xo,9,0)

Instruction fetch

Execution

key VN content

0 3231 48 49 111

112-bit entry

...

DVN Component

Fig. 34: Approach for creating an entry in the value number hash table.

in the implementation’s performance is the search in the value numbers table. Hash table–

based approaches for the classical value numbering have been traditionally proposed (BRIGGS;

COOPER; SIMPSON, 1997) as a method that significantly speed up the handling of value num-

bers tables. The search in a hash table using a non-cryptographic lookup hash function is

known for having a constant time complexity O(1). An efficient implementation of the dy-

namic value numbering algorithm can be obtained when the value numbers (VN) and the ad-

dresses (AT) tables are implemented as hash tables, and the search is performed with a hash

function for table lookup.

An efficient implementation of the effectiveness evaluator was written in C language rely-

ing on lookpu3 hash-function, a 32 bits non-cryptographic hash function proposed in (JENK-

INS, 2006). An experimental approach for the method adds to the table the actual content

generated by the instruction. The purpose is to enable a validation mechanism through the

comparison of the content retrieved from the table with the one actually resulting from an

instruction detected as redundant. Additionally, based on the hash tables, the mechanism can

be used to in fact provide known values for instruction avoiding the repetition (approach is


discussed in details in the next section). The entries in the implemented hash have the format

shown in Figure 34. It occupies a total of 112 bits.

The implementation is a module that has as input the instruction opcode and program

counter. It produces as output a return code identifying whether the instruction is redun-

dant, subdivided into redundant load, fully-redundant load, redundant store, and redundant

arithmetic operation. A redundancy report is also generated, consisting of general statistics

for redundancy occurrence along with a histogram and h-index report ordered by program

counter. Histogram indicates for each instruction executed between the program’s entry and

exit point, the number of redundancy identified per program counter. The value locality re-

port indicates the number of unique value-numbers (different content) per program counter.

The evaluator can be restricted to observe and apply effectiveness analysis only to instruction

issued in user space or to all the instructions executed, including linked libraries and system

calls. The implemented component is intended to be attached to an instrumentation approach

for real-time analysis, but is suitable to be used in offline mode with pre-collected execution

traces. In this case, a trace containing all the required input and a trace handler able to extract

and format the information are required.

5.3.1 Validation and Instruction Reuse Support

The implementation detects redundancy as instructions are executed. The presence in the

value number table of the actual content produced by the instruction in a previous execution

allows the validation of the method in two ways. The method is tested for accuracy, through

false detection identification. In this case, a consistency check is performed after the instruc-

tion execution with the actual data loaded from or written into memory, or the result of an

arithmetic operation. For each instruction, the actual result is extracted from instruction target

register and added to the VN tables, as shown in Figure 34. When a redundant computation is

identified, the data is compared and it is checked whether it is indeed a redundant operation,

identifying false detection. Figure 35 illustrates the validation approach. The method was

tuned to guarantee that no false detection is identified.





Kernel

...

...

Instruction Reuse

DVN

sourcecode

redundantinstructionstatistics

libraries


...

...


trace file


A2 Core


decodefetch

hash-table

... commit

validation

Fig. 35: Redundancy evaluation scheme with validation support.

A more aggressive validation is performed applying the method as an unlimited reuse

buffer. The method is invoked as the instruction is fetched. In case the instruction is detected

as redundant, the previous known result is written into the instruction’s target register and

the instruction execution skipped. This approach is detailed in Figure 36. The amount of the

instruction that can be skipped depends on the optimization effectiveness and on the amount

of memory available (translated in the number of possible entries) in the values hash table.

The integration of the implemented component and the full-system simulator was ob-

tained. The next chapter presents a case study based on this framework for reference appli-





Kernel

...

...

Instruction Reuse

DVN

sourcecode

redundantinstructionstatistics

libraries


...

...


trace file


A2 Core


decodefetch

hash-table

... commit

REUSE DETECTION

...

...

r1

r32

Registers (GPRs)

Fig. 36: Redundancy evaluation scheme with validation and reuse support.

cations. The approach is to evaluate the effectiveness of the multiple levels of optimization

available applied by a commercial compiler. The value locality analysis is presented and a dis-

cussion on how this approach can be used in mechanisms to exploit remaining optimization

potential unexploited.

95

6 Case Study

The number of redundant instructions identified by applying the method can be used

as a reference point in measuring how effective were the optimizations applied to a code.

This chapter presents a case study that evaluates GNU C/C++ compiler when optimizing

executable codes of some of the application available in the SPEC CPU2006 suite of bench-

marks. The goal is to measure redundancies that can be detected using the proposed method,

but that are not detected by GCC. For the study, the GCC version 4.3.2 was fully ported to the

instruction set used and integrated with the simulator’s toolchain. Applications of SPECInt

2006 were obtained as source code and compiled with the ported compiler using the available

toolchain.

The data provided proves to be valuable in multiple ways. The redundancy report pro-

duced by the framework developed yields a reference point to compare the multiple opti-

mization levels available in the compiler. The redundancy histogram produces a picture of

hotspots, pointing to sources of inefficiency. An example of how such hotspot can be used

in correlating suboptimal sequences of generated code to source-level constructions is dis-

cussed. The example shows the benefits of using the approach for audition of problems in the

compiler or for improving the source code.

On the other hand, the study also explores the opportunities related to hardware support

for redundancy elimination. The hotness index (h-index) along with value locality report

indicate what are the most profitable opportunities for value prediction and reuse. In order to

investigate the potential of instruction reuse, the study first shows the amount of instructions

that can be successfully skipped when all the results of identified redundant instructions can

6.1 Evaluation Target 96

be retrieved from the value number hash table, which is theoretically unlimited. Following, in

a realistic approach, it is discussed how much of this potential can be exploited when a limited

number of entries are allowed in the value number hash table. In this case, the Dynamic Value

Numbering algorithm is used to populate an instruction reuse buffer. The study shows the

results for quantities of entries comparable to those seen in the literature, feasible in practice

or already commercially implemented. The gains achieved, given in terms of instructions

count reduction, is shown and compared to other schemes available. The chapter also includes

a discussion on exploring value locality for value prediction, on what would be the accuracy

level when predicting the top n most repeated values, and how these numbers compare to

those seen in literature.

6.1 Evaluation Target

6.1.1 GNU Compiler Collection (GCC)

The GNU Compiler Collection (GCC) started as a modest C compiler and has grown

over the last 20 years to become one of the most popular compilers available. The compiler

supports now several languages (C, C++, Objective C, Objective C++, Java, Ada, Fortran 95,

etc.) and about 30 different architectures (STALLMAN, 2009). GCC has a vast support for code

optimization and implements several analysis. Since version 4, GCC makes extensive use of

SSA form and most of high-level optimizations are applied using SSA.

GCC’s optimization level is set through the -O switch. Without any optimization, the

result is the fastest compile time, but absolutely no attempt is performed to optimize the code.

Programs are larger and slower than the optimized version. Debugging the compiled program,

if unoptimized, is somewhat easy, since there is a straightforward relation between each code

statement and a block of generated code. When debugging, stopping the execution flow,

assigning values to any variable, or changing the content of the program counter register are

possible actions. With optimization set, the compiler attempts to improve the performance

at the expense of compilation time and, possibly, of the ability of debugging the program.


The main optimization options are -O0, -O1, -O2, and -O3. No optimization is performed if

-O0 is specified, but conservative, yet extensive, optimization is performed if no optimization

option is specified. The specific set of optimization changes from release to release, language

by language. Optimization levels for C are described in Table 1.

Tab. 1: GCC optimization levels.optimization description-O0 No optimization (not the default); generates

unoptimized code but has the fastestcompilation time.

-O1 Compilation takes a little more time,and a lot more memory for large functions.Statements are independent and debugging,as expected.

-O2 Full optimization; generates highlyoptimized code and hasslower compilation time.

-O3 Full optimization as in -O2, yet slower;also uses more aggressive automaticinlining of subprograms withina unit (for interprocedural analysis)and attempts to vectorize loops.

Testing has been extensively performed for all optimization levels and it is uncommon to

see bugs reported. There is no correlation verified between reliability and optimization level.

There are some bugs reported exclusively when optimization is on as often there are bugs

reported exclusively when no optimization is set.

6.1.2 SPEC CPU2006 Benchmark Suite

Standard Performance Evaluation Corporation (SPEC) benchmarks are widely used to

evaluate the performance of computer systems for the last few decades. The SPEC CPU

benchmarks are widely used in both industry and academia to evaluate CPU, memory and

compiler. SPEC announced on August, 2006 the SPEC CPU 2006 to replace CPU2000. The

new suite is much larger than the previous, and exercises aspects of CPUs, memory systems,

and compilers, especially C++ compilers. The suite has 7 C++ applications, including one

with half million lines of C++ code. Fortran and C are also well represented. SPEC CPU


benchmarks are derived from real life applications, rather than using artificial loop kernels or

synthetic benchmarks. Technical details regarding benchmark behavior and profiles have been

appearing in several publications, such as in (HENNING, 2006). A summary of the application

in SPEC CPU 2006 is shown in Table 2. Detailed description of each application used in this

study is shown in Appendix A.

Tab. 2: SEPCInt 2006 application set.Benchmark Language Application Area Brief Description400.perlbench C Programming Language Derived from Perl V5.8.7. The

workload includes SpamAssassin,MHonArc (an email indexer),and specdiff (SPEC’s tool thatchecks benchmark outputs).

401.bzip2 C Compression Julian Seward’s bzip2 version1.0.3, modified to do most workin memory, rather thandoing I/O.

403.gcc C C Compiler Based on gcc Version 3.2,generates code for Opteron.

429.mcf C Combinatorial Optimization Vehicle scheduling. Uses anetwork simplex algorithm (whichis also used in commercial products)to schedule public transport.

445.gobmk C Artificial Intelligence: Go Plays the game of Go, a simplydescribed but deeply complex game.

456.hmmer C Search Gene Sequence Protein sequence analysis usingprofile hidden Markov models(profile HMMs)

458.sjeng C Artificial Intelligence: chess A highly-ranked chess programthat also plays several chessvariants.

462.libquantum C Physics / Quantum Computing Simulates a quantum computer,running Shor’s polynomial-timefactorization algorithm.

464.h264ref C Video Compression A reference implementation ofH.264/AVC, encodes a videostreamusing 2 parameter sets.The H.264/AVC standard isexpected to replace MPEG2

471.omnetpp C++ Discrete Event Simulation Uses the OMNet++ discrete eventsimulator to model a largeEthernet campus network.

473.astar C++ Path-finding Algorithms Pathfinding library for 2D maps,including the well knownA* algorithm.

483.xalancbmk C++ XML Processing A modified version of Xalan-C++,which transforms XML documentsto other document types.

The current version of the benchmark suite, v1.1 released in June 2008, was obtained


as source code. The applications were compiled with the GCC (v.4.3.2) compiler ported to

the architecture, using versions of configuration files provided with the suite and adapted for

the simulator’s environment. Three executables were obtained for each application through

varying the level of optimization among -O0, -O2, and -O3. The -O1 level is omitted, as the

higher levels include the optimizations it deploys, and for the sake of feasibility of multiple

executables’ analysis in reasonable time. This study is focused on the integer component

of the benchmark suite since the fixed-point facility is the one supported in the framework

developed. Additionally, aimed at performing exhaustive evaluation of effectiveness and value

locality of whole application execution, this work had also to deal with subsetting and input

set reduction in order to make the intended analysis possible. Following the details on the

approach used.

6.1.2.1 Subsetting SPEC

SPEC CPU2006 has been frequently used in simulators for pre-silicon design analysis.

Partial use of benchmark suites by researchers are mainly related to simulation time con-

straints, compiler difficulties, and library or system call issues. The run-time and memory re-

quirement of SPEC CPU benchmark programs has been significantly increased to keep pace

with advancement in technology. This fact means that for studies that use cycle-accurate

simulators, it is virtually impossible to simulate all programs and input sets in a reasonable

amount of time. Structured and conscious subletting of SPEC has been proposed as a way

to allow same or equivalent amount of information from a smaller subset of representative

programs.

The Euclidean distance between the benchmarks is used as a measure of dissimilarity

in (PHANSALKAR; JOSHI; JOHN, 2007). Single-linkage distance (k) is computed to create a

dendrogram. The dendogram proposed guides the selection of a subset when requirements

are imposed. If the simulation budget limits the set to just six benchmarks, then drawing a

vertical line at linkage distance of 4, as shown in Figure 37 (extracted from (PHANSALKAR;

JOSHI; JOHN, 2007), will give a subset of six benchmarks (k = 6) for SPECInt 2006. Drawing


a line at a point close to 4.5 yields a subset of four benchmarks (k = 4). Table 3 shows the

resulting subsets of the CINT2006 suite.

3.2 Subsetting SPEC CPU2006 Benchmarks To keep pace with advancements in technology and the

increase in size of on-chip caches, the data footprint and run time of SPEC CPU benchmark programs has been significantly increased. However for architectural studies that use cycle-accurate simulators, it is virtually impossible to simulate all programs and input sets in a reasonable amount of time. If the same amount of information can be obtained from a smaller subset of representative programs, it would certainly help architects and researchers to cut down the simulation time without compromising on the inferences drawn from their studies.

This section demonstrates the result of applying PCA and cluster analysis for selecting a subset of benchmark programs when an architect or researcher is constrained by time and wants to select a reduced subset of programs from the suite. Figure 2 shows a dendrogram for CINT2006 benchmarks obtained after applying PCA and Hierarchical Clustering on the performance counter data from Table 3. The Euclidean distance between the benchmarks is used as a measure of dissimilarity and single-linkage distance is computed to create a dendrogram. Seven Principal Components (PCs) with eigen values greater than one are chosen and they retain 94% of the variance. In the dendrogram in Figure 2 the horizontal axis shows the linkage distance indicating the dissimilarity between the benchmarks. The ordering on the Y-axis does not have particular significance, except that benchmarks are positioned close to each other when the distance is smaller. Benchmarks that are outliers have larger linkage distances with the rest of the clusters formed in a hierarchical way. One can use this dendrogram to select a representative subset of programs. For example, if a researcher wants to reduce his simulation budget to just six benchmarks, then drawing a vertical line at linkage distance of 4, as shown in Figure

2, will give a subset of six benchmarks (k=6). Drawing a line at a point close to 4.5 yields a subset of four benchmarks (k=4). Table

4 shows the resulting subsets of the CINT2006 suite. In clusters where there are more than two programs, the representative of cluster i.e. the benchmark closest to the center of the cluster is chosen as a representative. As we traverse from left to right on the dendrogram the number of benchmarks in the subset keep decreasing. This helps the user to select appropriate benchmarks when simulation time is a constraint.

Figure 2. Dendrogram showing similarity between CINT2006 Programs.

Figure 3. Dendrogram showing similarity between CFP2006 Programs.

Figure 3 shows the dendrogram for floating point

benchmarks in CPU2006. Five PCs are chosen using the Kaiser criterion which retains 85% of the variance. The two vertical arrows show the points at which the subsets of size 6 and 8 are formed. The resulting clusters are shown in Table 5. The distance of each of the benchmarks in the cluster to the cluster center has to be recalculated and a representative can be chosen. In Figure 3 there are two main clusters which split at extreme right because the branch characteristics of the benchmarks, 447.dealII, 450. soplex and, 453.povray exhibit a comparatively higher branch misprediction rate. In the next section we evaluate the representativeness of these subsets.

One should note that clustering and subsetting gives importance to unique features and differences. It helps to eliminate redundancy and duplicated efforts in experimentation. However, one should not mistake the mix of program types in a subset as the mix of program types in real-world workloads. Table 4. Representative subset of SPEC CINT2006 programs.

Subset of Four

Programs

400.perlbench, 462.libquantum,473.astar,483.xalancbmk

Subset of Six

Programs

400.perlbench, 471.omnetpp, 429.mcf, 462.libquantum, 473.astar, 483.xalancbmk

Table 5. Representative subset of SPEC CFP2006 programs.

Subset of Six Programs

437.leslie3d, 454.calculix, 436.cactusADM, 447.dealII, 470.lbm, 453.povray

Subset of Eight

Programs

437.leslie3d, 454.calculix, 459.GemsFDTD,436.cactusADM, 447.dealII, 450.soplex, 470.lbm, 453.povray

3.3 Evaluating Representativeness of Subsets We evaluate the usefulness of the subsets, proposed in the

previous section, to estimate the speedup of the entire suite on Fig. 37: Dendrogram showing similarity between CINT2006 Programs.

Tab. 3: Representative subset for SPECInt 2006.Set of Four Programs Set of Six Programs400.perlbench 400.perlbench462.libquantum 471.omnetpp473.astar 429.mcf483.xalancbmk 462.libquantum

473.astar483.xalancbmk

For the evaluation, most of the SPECInt were successfully ported. The seven SPECInt

applications in Table 4 are the ones used. From the six application subset, the only missing is

483.xalancbmk, a lack due to compilation difficulties.

6.1.2.2 Reduced Inputset

A SPEC CPU2006 application execution using the reference input data set represents few

trillion instructions in average. This order of magnitude is above the number of instructions

that any known system simulator is able to handle within a reasonable time. In addition, the

6.2 Effectiveness Evaluation Results 101

total number of instructions in the execution stream of each application, which translates into

the size of the data structure handled by the dynamic value numbering implementation, is

a major constraint for the analysis. It has been shown, as in (KLEINOSOWSKI; LILJA, 2002),

how to make SPEC applications more suitable to system simulators. That consists in using

subsets of the reference input data sets in such a way that the main application behavior and

instructions’ distribution are preserved. These ideas were applied to the input data set of

SPECInt CPU2006 applications used in this work, in order to make the analysis feasible.

Table 4 shows the applications that were successfully ported and the input sets that were

used for the obtainment of execution streams that lie in the order of hundreds of millions of

instructions (see Appendix A for details on application and related input).

Tab. 4: SPECInt 2006 benchmarks analyzed.benchmark input dataset/arguments458.sjeng testPOS1, depth 1 (test set)429.mcf inp.in (test set)462.libquantum 22 26 (test set)445.gobmk reading02.svg (ref set)401.bzip2 text.html (ref set)471.omnetpp omnetpp.ini, time limit 4s (test set)400.perlbench makerand.pl, i= 23752 (test set)

Using reduced input data, three execution traces were collected for each application. Fig-

ure 38 presents, for each trace collected, the distribution of the sets and subsets previously

detailed.

6.2 Effectiveness Evaluation Results

This section presents the statistics of instruction redundancy detected after applying the

dynamic value numbering algorithm to the whole execution stream of the selected applica-

tions. Applications were run single-threaded. Comparisons of the total number of instructions

and the number of redundant executions detected are presented for each discussed set. Fol-

lowing, instruction counts normalized by non-optimized (O0) code’s instruction count, and

the results’ considerations are presented.


- O0

- O2

- O3

- O0

- O2

- O3

- O0

- O2

- O3

- O0

- O2

- O3

- O0

- O2

- O3

- O0

- O2

- O3

- O0

- O2

- O3

0

2 0

4 0

6 0

8 0

1 0 0

O T H E R S L O G I C A L I M M E D I A T E A R I T H S T O R E L O A D

Instru

ction d

istribu

tion (

%)

4 7 1 .

o m n e t p p

4 6 2 .

l i b q u a n t u m

4 4 5 .

g o b m k

4 5 8 . s j e n g

4 2 9 . m c f

4 0 1 . b z i p 2

4 0 0 .p e r l b e n c h

Fig. 38: Instruction distribution per category set.

6.2.1 Memory operations

6.2.1.1 Load instructions

The statistics for instructions in load set detection are shown in Table 5 and graphically

in figure 39. Instructions are classified as non-redundant, redundant and fully-redundant, as

discussed in Section 4.2. The redundancy detection rate and how it depends on the level

of optimization vary greatly according to application. In most of cases, the optimization

applied to the code significantly removed the number of redundancies found by the method.

In average, the redundancy rate of the most optimized compilation (-O3) remains at ≈ 64%,

from which ≈ 16% are fully-redundant. This number is tightly bound to the inherent problem

of registers availability and allocation. These results present a practical reference point for the

upper bound of exploitable optimization through register allocation.

6.2.2 Store instructions

The numbers for instructions in store set are shown in Table 6 and graphically in Figure

40. They likewise depend on the application. In this case, despite the fact that the total number


- O 0 - O 2 - O 3 - O 0 - O 2 - O 3 - O 0 - O 2 - O 3 - O 0 - O 2 - O 3 - O 0 - O 2 - O 3 - O 0 - O 2 - O 3 - O 0 - O 2 - O 3

01 02 03 04 05 06 07 08 09 0

1 0 0# I

nstru

ctions

(%, n

ormal)

f u l l y - r e d u n d a n t r e d u n d a n t l o a d

4 7 1 .o m n e t

4 6 2 .l i b q u a n t u m

4 5 8 .s j e n g

4 4 5 .g o b m k

4 2 9 .m c f

4 0 1 .b z i p 2

4 0 0 .p e r l b e n c h

Fig. 39: Statistics of redundant load detection, normalized by instruction count of non-optimized code.

Tab. 5: Statistics of redundant load instructions detection, classified into non-redundant, re-dundant and fully redundant.

benchmark opt. non-redundant (total) redundant (%) fully-red. (%)(106) (106) (106)

400. -O0 26.49 7.00 73.58 3.31 16.97perlbench -O2 15.13 4.81 68.22 0.34 3.28

-O3 15.03 4.76 68.35 0.33 3.18401. -O0 54.45 15.64 71.27 4.95 12.75bzip2 -O2 12.08 4.54 62.41 0.92 12.22

-O3 11.90 4.52 62.01 0.92 12.51429. -O0 40.79 22.53 44.76 2.84 15.57mcf -O2 37.37 22.02 41.08 1.87 12.20

-O3 37.26 21.93 41.14 1.87 12.22445. -O0 74.27 13.74 81.51 9.96 16.45gobmk -O2 13.82 4.79 65.31 0.51 5.62

-O3 13.75 4.62 66.38 0.55 6.01458. -O0 3.89 0.77 80.28 0.12 3.76sjeng -O2 0.86 0.07 91.30 0.57 72.43

-O3 0.84 0.07 91.53 0.57 73.43462. -O0 49.94 7.47 85.04 4.31 10.16libquantum -O2 4.36 1.73 60.29 0.05 1.98

-O3 4.36 1.73 60.32 0.05 1.93471. -O0 37.92 19.97 47.33 4.15 23.14omnet -O2 12.67 5.97 52.85 0.25 3.72

-O3 12.43 5.77 53.57 0.24 3.55AVERAGE -O0 69.11 14.11

-O2 63.07 15.92-O3 63.33 16.12


of store instructions is always lower for higher levels of optimization, even if seamlessly, there

are instances for which the share of redundant store found in the optimized code is higher,

namely, 462, 445, 471, and 400. Such behavior accompanies a lower number of arithmetic

operations execution (see Figure 41). In average, the share of redundant store instructions

identified in the most optimized code (-O3) is ≈ 10%.

It is a known fact that the load-store reuse optimization is limited by the memory bot-

tleneck and register spill. The amount of register spills and the ones that are identified as

unnecessary is also shown in Table 6 . The question is how effective the compiler is in detect-

ing instances of arithmetic operation repetition. The limited intraprocedural scope of most of

the optimizations deployed prevents the detection of interprocedural redundancy instances.

Tab. 6: Statistics of redundant store instructions detection, classified into non-redundant, re-dundant, spill and unnecessary spill.

benchmark opt. store (total) redundant (%) spill (%) unn. spill (%)(10x6) (10x6) (10x6) (10x6)

400. -O0 7.60 0.19 2.53 6.92 26.12 4.552806 59.89perlbench -O2 6.02 0.14 2.40 4.75 31.41 4.614993 76.64

-O3 5.99 0.14 2.41 4.70 31.28 4.564211 76.14401. -O0 17.59 0.81 4.63 15.50 28.47 8.517089 48.42bzip2 -O2 7.08 0.28 3.91 4.20 34.81 2.92318 41.31

-O3 7.07 0.28 3.90 4.16 34.98 3.24588 45.93429. -O0 28.39 1.95 6.88 20.44 50.10 24.020701 84.61mcf -O2 28.41 1.78 6.26 19.40 51.91 23.561105 82.93

-O3 28.24 1.78 6.30 19.48 52.28 24.399609 86.40445. -O0 22.21 3.31 14.89 13.58 18.28 6.179926 27.83gobmk -O2 12.96 1.93 14.91 4.62 33.41 3.413122 26.33

-O3 12.72 1.95 15.31 4.44 32.32 3.235043 25.43458. -O0 89.69 43.51 48.52 0.76 19.56 0.668359 0.75sjeng -O2 89.27 27.01 30.26 0.07 7.96 0.047263 0.05

-O3 89.27 27.01 30.26 0.07 7.70 0.044728 0.05462. -O0 7.91 0.37 4.70 7.46 14.94 4.604281 58.24libquantum -O2 2.19 0.13 5.99 1.73 39.62 0.639677 29.27

-O3 2.18 0.13 5.99 1.73 39.58 0.643936 29.50471. -O0 22.23 0.74 3.31 19.77 52.15 13.043579 58.68omnet -O2 8.86 0.60 6.80 5.80 45.75 4.805331 54.26

-O3 8.57 0.59 6.91 5.59 44.99 4.712575 55.02AVERAGE -O0 12.21 29.95 48.35

-O2 10.07 34.98 44.40-O3 10.15 34.73 45.50


- O0

- O2

- O3

- O0

- O2

- O3

- O0

- O2

- O3

- O0

- O2

- O3

- O0

- O2

- O3

- O0

- O2

- O3

- O0

- O2

- O3

0

2 0

4 0

6 0

8 0

1 0 0# I

nstru

ctions

(%, n

ormal)

r e d u n d a n t n o n - r e d u n d a n t

4 7 1 .o m n e t

4 6 2 .l i b q u a n t u m

4 5 8 .s j e n g

4 4 5 .g o b m k

4 2 9 .m c f

4 0 1 .b z i p 2

4 0 0 .p e r l b e n c h

Fig. 40: Redundant store instructions normalized by instruction count of non-optimized code.

6.2.3 Arithmetic instructions

Table 7 and Figure 41 display the statistics of redundancy occurrence for the arithmetic

set, aggregating regular arithmetic, immediate, and logic instructions (shown in details in

Table 8 and Figure 42). The numbers show that the effectiveness in removing redundant arith-

metic instructions depends significantly on the application. The share of redundant arithmetic

instructions decreases slightly in cases 400, 401, 445, 462, while decreases sharply in case

458. Conversely, in cases 429 and 471, the share of redundant arithmetic instructions remains

unaltered (slightly increased). However, the absolute number of redundant instructions found

in the optimized code is lower than in non-optimized one, in all cases. The occurrence of

redundant instances of arithmetic instructions, allotted into the regular arithmetic, immediate,

and logical sets are shown in Table 8. In average, significant ≈ 46% average of redundancy in

arithmetic instructions is detected by the method.


-O0

-O2

-O3

-O0

-O2

-O3

-O0

-O2

-O3

-O0

-O2

-O3

-O0

-O2

-O3

-O0

-O2

-O3

-O0

-O2

-O3

0

10

20

30

40

50

60

70

80

90

100

110

#In

stru

ctio

ns(%

,nor

mal

)

redundant non-redundant

471.omnet

462.libquantum

458.sjeng

445.gobmk

429.mcf

401.bzip2

400.perlbench

Fig. 41: Statistics of redundant arithmetic instructions detection normalized by instructioncount of non-optimized code.

Tab. 7: Statistics of redundant arithmetic instructions detection, classified into non-redundantand redundant.

benchmark opt. all arithmetic redundant (%)(106) (106)

400. -O0 13.150 7.46 56.71perlbench -O2 8.914 3.78 42.44

-O3 8.791 3.74 42.54401. -O0 70.142 42.54 60.65bzip2 -O2 29.611 12.34 41.66

-O3 30.087 12.45 41.39429. -O0 71.128 45.44 63.89mcf -O2 69.160 44.40 64.20

-O3 69.082 44.33 64.17445. -O0 88.170 69.78 79.14gobmk -O2 38.124 24.81 65.09

-O3 37.435 24.70 65.97458.sjeng -O0 11.752 4.57 38.92sjeng -O2 6.468 0.21 3.22

-O3 6.462 0.21 3.23462. -O0 49.835 32.01 64.23ligquantum -O2 18.337 9.57 52.16

-O3 18.330 9.42 51.40471. -O0 39.312 20.52 52.20omnet -O2 18.659 9.87 52.91

-O3 18.114 9.56 52.76AVERAGE -O0 59.39

-O2 45.95-O3 45.92


- O0

- O2

- O3

- O0

- O2

- O3

- O0

- O2

- O3

- O0

- O2

- O3

- O0

- O2

- O3

- O0

- O2

- O3

- O0

- O2

- O3

01 02 03 04 05 06 07 08 09 0

1 0 01 1 0

# Ins

tructio

ns (%

, norm

al) r e d u n d a n t l o g i c a l r e d u n d a n t i m m e d i a t e r e d u n d a n t s c a l a r

4 7 1 .o m n e t

4 6 2 .l i b q u a n t u m

4 5 8 .s j e n g

4 4 5 .g o b m k

4 2 9 .m c f

4 0 1 .b z i p 2

4 0 0 .p e r l b e n c h

Fig. 42: Redundant arithmetic instructions per type, normalized by instruction count of non-optimized code.

Tab. 8: Statistics of redundant arithmetic instructions detection, classified into non-redundantor redundant for each subset: scalar, immediate and logical.

benchmark opt. scalar redundant (%) immediate redundant (%) logical redundant (%)(106) (106) (106) (106) (106) (106)

400. -O0 1.73 1.11 63.98 1.98 1.15 58.20 9.44 5.20 55.06perlbench -O2 1.34 0.86 64.59 2.37 0.85 35.84 5.20 2.07 39.76

-O3 1.34 0.86 64.60 2.33 0.85 36.68 5.13 2.02 39.45401. -O0 14.85 7.87 53.03 11.93 8.21 68.80 43.36 26.46 61.02bzip2 -O2 4.79 1.55 32.39 10.36 5.34 51.50 14.46 5.45 37.68

-O3 4.79 1.53 32.00 10.29 5.24 50.91 15.01 5.68 37.85429. -O0 4.72 1.88 39.73 30.37 21.71 71.49 36.03 21.85 60.65mcf -O2 4.50 1.89 42.05 29.59 20.94 70.75 35.07 21.57 61.51

-O3 4.46 1.86 41.64 29.59 20.94 70.75 35.03 21.53 61.47445. -O0 16.37 11.67 71.27 19.68 16.28 82.73 52.12 41.83 80.26gobmk -O2 9.53 5.78 60.69 14.93 9.90 66.32 13.67 9.13 66.80

-O3 9.37 5.78 61.69 14.64 9.90 67.63 13.43 9.02 67.16458.sjeng -O0 1.91 1.18 61.54 5.88 0.22 3.66 3.97 3.18 80.27sjeng -O2 0.18 0.02 12.56 5.96 0.09 1.59 0.33 0.09 27.29

-O3 0.18 0.02 12.55 5.95 0.09 1.56 0.33 0.09 28.04462. -O0 7.70 6.83 88.70 7.07 6.83 96.60 35.07 18.35 52.34ligquantum -O2 4.63 3.73 80.63 4.99 4.66 93.45 8.72 1.17 13.40

-O3 4.63 3.66 79.14 4.99 4.59 92.05 8.71 1.16 13.35471. -O0 2.70 1.22 45.07 9.34 4.27 45.67 27.27 15.04 55.15omnet -O2 1.78 0.71 39.82 6.42 3.06 47.66 10.46 6.10 58.35

-O3 1.78 0.71 39.78 6.22 3.01 48.41 10.12 5.84 57.71AVERAGE -O0 60.47 61.02 63.54

-O2 47.53 52.44 43.54-O3 47.34 52.57 43.58


- O0

- O2

- O3

- O0

- O2

- O3

- O0

- O2

- O3

- O0

- O2

- O3

- O0

- O2

- O3

- O0

- O2

- O3

- O0

- O2

- O3

- O0

- O2

- O3

01 02 03 04 05 06 07 08 09 0

1 0 0# I

nstru

ctions

(%, n

ormal)

r e d u n d a n t n o n - r e d u n d a n t

4 7 1 .o m n e t

4 6 2 .l i b q u a n t u m

4 5 8 .s j e n g

4 4 5 .g o b m k

4 2 9 .m c f

4 0 1 .b z i p 2

4 0 0 .p e r l b e n c h

Fig. 43: Statistics for complete set of redundant instructions count normalized by instructioncount of non-optimized code.

Tab. 9: Statistics of redundant instructions detection for all instructions, classified into non-redundant, and redundant.

benchmark opt. total count unsupported redundant (%)(106) (106) (106)

400.perlbench -O0 61.21 13.96 27.14 44.34-O2 41.69 11.63 14.25 34.18-O3 41.32 11.50 14.16 34.27

401.bzip2 -O0 156.40 14.22 82.16 52.53-O2 62.12 13.36 20.15 32.44-O3 62.44 13.39 20.11 32.20

429.mcf -O0 228.10 87.79 65.66 28.78-O2 222.74 87.79 61.53 27.63-O3 222.35 87.77 61.44 27.63

445.gobmk -O0 217.67 33.02 133.62 61.39-O2 94.53 29.62 35.77 37.84-O3 92.77 28.87 35.77 38.56

458.sjeng -O0 144.92 39.59 51.21 35.33-O2 135.76 39.17 28.00 20.62-O3 135.73 39.16 27.99 20.62

462.libquantum -O0 132.14 24.45 74.85 56.65-O2 40.51 15.62 12.33 30.43-O3 40.50 15.62 12.18 30.09

475.omnet -O0 131.81 32.36 39.20 29.74-O2 60.87 20.68 17.17 28.21-O3 59.36 20.25 16.81 28.31

AVERAGE -O0 44.11-O2 30.19-O3 30.24

6.3 Value Locality Results 109

6.2.4 Discussion

Figure 43 and Table 9 show the total count of redundant instructions that the proposed

methodology detected in the whole application execution trace. The numbers show that the

share of total redundancies identified is always lower with higher optimizations. However,

results show that a significant average of ≈ 30% of all instructions are redundantly executed

even by these optimized executables generated by GCC.

The mitigation of memory operation is in general a goal in compiler design and the com-

promise between arithmetic and memory operations is desired in most architectures. These

results provide an upper bound in terms of possible optimization for load-store reuse, along

with an evaluation of the total remaining opportunities for arithmetic instructions optimiza-

tion. The total amount of redundancy is obtained with a theoretically unlimited hash-table.

The value of such knowledge is manifold. A straightforward approach is the use of such in-

formation in profile guided optimization, for which the exploitation of run-time information

is possible. Alternatively, the knowledge of upper limits for recurring instructions is being

proved to be valuable in the design and use of architecture-based instruction reuse mecha-

nisms. The method presents, this way, a reference point for the limits of redundancy exploita-

tion. Following, a discussion on how the method can be used in both approaches through the

identification of hotspots.

6.3 Value Locality Results

Results of total amount of redundancy observed while the program executes provide a ref-

erence point for unexploited potential. The remaining question is how to exploit this potential.

The association of redundancy occurrence with each instruction in the program’s execution

indicates spots where redundancy most frequently occurs. Analysis show that few instruc-

tions are responsible for most of redundancies detected (see Table 15). These hotspots are

sets of instructions that are frequently executed producing known results. The dynamic value

numbering methodology produces a report that correlates each program counter (i.e., static


instruction) with a redundant dynamic instance. Plotting redundancy count for each program

counter in the order they are first executed produces a value locality graph. The nature of rep-

etition (few highly redundant instructions) is captured by spikes in the value locality graph.

These spikes indicate value repetition clusters, instructions that are the most frequently iden-

tified as redundant, and thus the most profitable instructions to have their sources investigated

at application level or exploited in a hardware-support mechanism.

A value locality cluster indicates instructions that produce a result that has been already

produced at any moment since the execution beginning. While these clusters indicate the

total amount of repetition per program counter, they do not provide information on how many

different results are produced per redundant instruction. The hotness index (h-index) proposed

indicates the number of unique value numbers detected, i.e. it is a direct measure of the

number of different results produced. This information complements value locality cluster,

for it indicates for each analyzed redundant instruction what is the frequency of different

value numbers produced. Such a knowledge is valuable in value prediction, since a highly

redundant instruction that has its output identified as not frequently varying can be predicted

with high accuracy.

In order to exploit these ideas, value locality graphs were generated with the methodology

for the set of applications. Figures from 44 to 50 show value locality for each application, with

a graph comparing the multiple optimization levels (top left), and value locality and h-index

for each optimization separately (remaining in the group, from -O0 to -O3). Following is a

discussion on the behavior observed in each application.

6.3.1 Value Locality Graphs

The analysis is focused on the occurrence and nature of redundancy clusters and on the

impact of optimization in mitigating their occurrence.


0 2 0 0 0 4 0 0 0 6 0 0 0 8 0 0 0 1 0 0 0 0 1 2 0 0 0 1 4 0 0 0

0

1 x 1 0 5

2 x 1 0 5

3 x 1 0 5

4 x 1 0 5

5 x 1 0 5

0 2 0 0 0 4 0 0 0 6 0 0 0 8 0 0 0 1 0 0 0 0 1 2 0 0 0 1 4 0 0 0

0

1 x 1 0 5

2 x 1 0 5

3 x 1 0 5

4 x 1 0 5

5 x 1 0 5

0 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 6 0 0 0 7 0 0 0 8 0 0 0

0

1 x 1 0 5

2 x 1 0 5

3 x 1 0 5

4 x 1 0 5

5 x 1 0 5

0 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 6 0 0 0 7 0 0 0 8 0 0 0

0

1 x 1 0 5

2 x 1 0 5

3 x 1 0 5

4 x 1 0 5

5 x 1 0 5

Redu

ndan

t Instr

uctio

n (ag

grega

ted co

unt) - O 0

- O 2 - O 3

- O 0 h - i n d e x

Redu

ndan

t Instr

uctio

n (ag

grega

ted co

unt)

P r o g r a m C o u n t e r ( r e l a t i v e e x e c u t i o n o r d e r )

- O 2 h - i n d e x


- O 3 h - i n d e x

Fig. 44: Value locality identification: 400.perlbench.

0 2 0 0 0 4 0 0 0 6 0 0 0 8 0 0 0 1 0 0 0 0 1 2 0 0 0

0 . 0

5 . 0 x 1 0 4

1 . 0 x 1 0 5

1 . 5 x 1 0 5

2 . 0 x 1 0 5

2 . 5 x 1 0 5

3 . 0 x 1 0 5

0 2 0 0 0 4 0 0 0 6 0 0 0 8 0 0 0 1 0 0 0 0 1 2 0 0 0

0 . 0

5 . 0 x 1 0 4

1 . 0 x 1 0 5

1 . 5 x 1 0 5

2 . 0 x 1 0 5

2 . 5 x 1 0 5

3 . 0 x 1 0 5

0 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 6 0 0 0

0 . 0

5 . 0 x 1 0 4

1 . 0 x 1 0 5

1 . 5 x 1 0 5

2 . 0 x 1 0 5

2 . 5 x 1 0 5

3 . 0 x 1 0 5

0 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 6 0 0 0

0 . 0

5 . 0 x 1 0 4

1 . 0 x 1 0 5

1 . 5 x 1 0 5

2 . 0 x 1 0 5

2 . 5 x 1 0 5

3 . 0 x 1 0 5

- O 0 - O 2 - O 3

- O 0 h - i n d e x


- O 2 h - i n d e x

Redu

ndan

t Instr

uctio

n (ag

grega

ted co

unt)

Redu

ndan

t Instr

uctio

n (ag

grega

ted co

unt)


- O 3 h - i n d e x

Fig. 45: Value locality identification: 401.bzip2.


0 5 0 0 1 0 0 0 1 5 0 0 2 0 0 0 2 5 0 0

0 . 0

2 . 0 x 1 0 5

4 . 0 x 1 0 5

6 . 0 x 1 0 5

8 . 0 x 1 0 5

1 . 0 x 1 0 6

0 5 0 0 1 0 0 0 1 5 0 0 2 0 0 0 2 5 0 0

0 . 0

2 . 0 x 1 0 5

4 . 0 x 1 0 5

6 . 0 x 1 0 5

8 . 0 x 1 0 5

1 . 0 x 1 0 6

0 5 0 0 1 0 0 0 1 5 0 0 2 0 0 0

0 . 0

2 . 0 x 1 0 5

4 . 0 x 1 0 5

6 . 0 x 1 0 5

8 . 0 x 1 0 5

1 . 0 x 1 0 6

0 5 0 0 1 0 0 0 1 5 0 0 2 0 0 0

0 . 0

2 . 0 x 1 0 5

4 . 0 x 1 0 5

6 . 0 x 1 0 5

8 . 0 x 1 0 5

1 . 0 x 1 0 6

- O 0 - O 2 - O 3

- O 0 h - i n d e x


- O 2 h - i n d e x


- O 3 h - i n d e x

Redu

ndan

t Instr

uctio

n (ag

grega

ted co

unt)

Redu

ndan

t Instr

uctio

n (ag

grega

ted co

unt)

Fig. 46: Value locality identification: 429.mcf.

0 1 0 0 0 0 2 0 0 0 0 3 0 0 0 0 4 0 0 0 0 5 0 0 0 0 6 0 0 0 0 7 0 0 0 0

0 . 0

2 . 0 x 1 0 5

4 . 0 x 1 0 5

6 . 0 x 1 0 5

8 . 0 x 1 0 5

1 . 0 x 1 0 6

1 . 2 x 1 0 6

- 1 0 0 0 0 0 1 0 0 0 0 2 0 0 0 0 3 0 0 0 0 4 0 0 0 0 5 0 0 0 0 6 0 0 0 0 7 0 0 0 0

0 . 0

2 . 0 x 1 0 5

4 . 0 x 1 0 5

6 . 0 x 1 0 5

8 . 0 x 1 0 5

1 . 0 x 1 0 6

1 . 2 x 1 0 6

0 5 0 0 0 1 0 0 0 0 1 5 0 0 0 2 0 0 0 0 2 5 0 0 0- 1 x 1 0 5

0

1 x 1 0 5

2 x 1 0 5

3 x 1 0 5

4 x 1 0 5

5 x 1 0 5

6 x 1 0 5

7 x 1 0 5

0 5 0 0 0 1 0 0 0 0 1 5 0 0 0 2 0 0 0 0 2 5 0 0 0 3 0 0 0 0- 1 x 1 0 5

01 x 1 0 5

2 x 1 0 5

3 x 1 0 5

4 x 1 0 5

5 x 1 0 5

6 x 1 0 5

7 x 1 0 5

8 x 1 0 5

- O 0 - O 2 - O 3

Redu

ndan

t Instr

uctio

n (ag

grega

ted co

unt)

Redu

ndan

t Instr

uctio

n (ag

grega

ted co

unt)

- O 0 h - i n d e x


- O 2 h - i n d e x


- O 3 h - i n d e x

Fig. 47: Value locality identification: 445.gobmk.


0 2 0 0 0 4 0 0 0 6 0 0 0 8 0 0 0 1 0 0 0 0 1 2 0 0 0

0 . 0

2 . 0 x 1 0 6

4 . 0 x 1 0 6

6 . 0 x 1 0 6

8 . 0 x 1 0 6

1 . 0 x 1 0 7

1 . 2 x 1 0 7

0 2 0 0 0 4 0 0 0 6 0 0 0 8 0 0 0 1 0 0 0 0 1 2 0 0 0

0 . 0

2 . 0 x 1 0 6

4 . 0 x 1 0 6

6 . 0 x 1 0 6

8 . 0 x 1 0 6

1 . 0 x 1 0 7

1 . 2 x 1 0 7

0 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0- 1 x 1 0 6

0

1 x 1 0 6

2 x 1 0 6

3 x 1 0 6

4 x 1 0 6

5 x 1 0 6

6 x 1 0 6

7 x 1 0 6

0 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0- 1 x 1 0 6

0

1 x 1 0 6

2 x 1 0 6

3 x 1 0 6

4 x 1 0 6

5 x 1 0 6

6 x 1 0 6

7 x 1 0 6

- O 0 - O 2 - O 3

- O 0 h - i n d e x


- O 2 h - i n d e x

Redu

ndan

t Instr

uctio

n (ag

grega

ted co

unt)

Redu

ndan

t Instr

uctio

n (ag

grega

ted co

unt)


- O 3 h - i n d e x

Fig. 48: Value locality identification: 458.sjeng.

0 5 0 0 1 0 0 0 1 5 0 0 2 0 0 0 2 5 0 0 3 0 0 0 3 5 0 0

0 . 0

5 . 0 x 1 0 5

1 . 0 x 1 0 6

1 . 5 x 1 0 6

2 . 0 x 1 0 6

2 . 5 x 1 0 6

0 5 0 0 1 0 0 0 1 5 0 0 2 0 0 0 2 5 0 0 3 0 0 0 3 5 0 0

0 . 0

5 . 0 x 1 0 5

1 . 0 x 1 0 6

1 . 5 x 1 0 6

2 . 0 x 1 0 6

2 . 5 x 1 0 6

0 5 0 0 1 0 0 0 1 5 0 0 2 0 0 0

0

1 x 1 0 6

2 x 1 0 6

0 5 0 0 1 0 0 0 1 5 0 0 2 0 0 0

0

1 x 1 0 6

2 x 1 0 6

- O 0 - O 2 - O 3

- O 0 h - i n d e x


- O 2 h - i n d e x


- O 3 h - i n d e x

Redu

ndan

t Instr

uctio

n (ag

grega

ted co

unt)

Redu

ndan

t Instr

uctio

n (ag

grega

ted co

unt)

Fig. 49: Value locality identification: 462.libquantum.


- 2 0 0 0 0 2 0 0 0 4 0 0 0 6 0 0 0 8 0 0 0 1 0 0 0 0 1 2 0 0 0 1 4 0 0 0 1 6 0 0 0 1 8 0 0 0

0

1 x 1 0 5

2 x 1 0 5

3 x 1 0 5

4 x 1 0 5

5 x 1 0 5

- 2 0 0 0 0 2 0 0 0 4 0 0 0 6 0 0 0 8 0 0 0 1 0 0 0 0 1 2 0 0 0 1 4 0 0 0 1 6 0 0 0 1 8 0 0 0

0

1 x 1 0 5

2 x 1 0 5

3 x 1 0 5

4 x 1 0 5

5 x 1 0 5

0 2 0 0 0 4 0 0 0 6 0 0 0 8 0 0 0 1 0 0 0 0 1 2 0 0 0

0 . 0

2 . 0 x 1 0 5

4 . 0 x 1 0 5

0 2 0 0 0 4 0 0 0 6 0 0 0 8 0 0 0 1 0 0 0 0 1 2 0 0 0

0 . 0

2 . 0 x 1 0 5

4 . 0 x 1 0 5

- O 0 - O 2 - O 3

- O 0 h - i n d e x


- O 2 h - i n d e x


- O 3 h - i n d e x

Redu

ndan

t Instr

uctio

n (ag

grega

ted co

unt)

Redu

ndan

t Instr

uctio

n (ag

grega

ted co

unt)

Fig. 50: Value locality identification: 471.omnet.

6.3.1.1 Cluster occurrence

For all applications, value locality clusters are clearly identified. Their width relates with

the number of instructions involved and the height with the amount of redundancies. Clusters

shape varies significantly with each application. Dominating clusters of very few instructions

are seen in benchmarks 400, 445, 462 and specially in 458. Wider clusters are seen in 429,

462 and specially in 401. The ratio between the most redundant instruction in the highest

cluster and the average redundancy count of each benchmark are shown in Table 10.

6.3.1.2 Optimization Effect

Value locality graphs show that clusters (spikes in the graphs) occur for all levels of op-

timization. A careful look at the aggregated graph indicates that spikes only change their

relative occurrence position. Such behavior is related to the fact that fewer instructions are

executed in the optimized versions. However, the observed shape for all optimization levels

maintain a high degree of similarity.


Tab. 10: Ratio between redundancy count of most redundant single instruction and averageredundancy count.

optimizationbenchmark -O0 -O2 -O3

400.perlbench 229.10 233.42 238.96401.bzip2 1.11 85.00 86.27429.mcf 32.69 29.71 29.49445.gobmk 627.07 496.67 583.19458.sjeng 2319.08 1031.06 1130.68462.libquantum 95.90 355.24 350.29475.omnet 187.11 255.53 258.59

While optimization is not able to eliminate all the clusters (what would be expected from

the effectiveness results, which show that high redundancy rates are detected even in opti-

mized code), it does affect the shape of the clusters. Both height and width are lowered in

all applications, with more accentuated difference between -O0 and one of the higher levels.

Cluster shapes for -O2 and -O3 are mostly unaltered. This shows that a redundant construct at

the executable code level is still present in every optimized code, but it has fewer instructions

with higher optimization level. The conclusion is that the optimization deployed is not able

to eliminate the very core of redundancy, a group of few instructions that are still responsible

for producing the spikes in the value locality graph.

6.3.1.3 H-index

H-index is plotted for each application and optimization level (in orange in all cases). The

area covered by the h-index graph indicates the expected number of instructions that could be

eliminated using only one of the various different contents each specific instruction produces.

An h-index that completely covers the redundancies of the corresponding program counter

indicates that a single value is seen in every occurrence of redundancies of that program

counter. The results vary per application. There are examples of highly profitable clusters,

such as benchmarks 400, 429, 458 and 462. In these cases, main clusters (or the only one in

458 case) are significantly covered by the h-index graph, indicating that thousands of redun-

dancy instances produce just a couple of different values. There are cases, 401, 445 and 475,

with less profitable clusters, but considerable opportunities to exploit single value occurrence

6.4 Instruction Reuse 116

are still observed.

Redundancy clusters reveal opportunities for optimization. The numbers show that the

performance of all applications, measured in the total amount of instructions executed, could

be significantly improved in case the result of few instructions are reused. This work presents

three approaches that can be benefited with such knowledge: instruction reuse, compiler/ap-

plication audition and value prediction. The first approach is the deployment of constrained

dynamic value numbering as a mechanism to populate an instruction reuse buffer, measuring

the upper limit of redundancy elimination that could be exploited when the method is used

with a limited table. The second approach shows how the report on redundancy occurrence

and program counter can be used as a mechanism to inspect application and compiler’s job

for missed optimization opportunities. Third, the h-index can be explored in value prediction.

6.4 Instruction Reuse

As discussed in Section 3.4.1, instruction reuse is based on a method to populate a reuse

buffer and apply its information when a redundancy is detected. The operation of the Dynamic

Value Numbering methodology, specially considering the validation component, is closely

related to known mechanisms used in reuse buffer. However, for effectiveness analysis an un-

limited table for the value numbering is allowed. This sections presents the results of applying

the Dynamic Value Numbering algorithm as an instruction reuse method, and the exploitable

part of the theoretical upper limit found in the effectiveness analysis when a limited reuse

table is used.

The approach consists in limiting the total number of entries allowed in the value number

table and applying a replacement policy when the limit is achieved. The implementation is the

one discussed in section 5.3.1: the method is applied at fetching stage and, upon redundancy

detection, the content of the target register is replaced and the execution entirely skipped.

Based on feasible reuse buffer, three buffer sizes were selected: 256, 512 and 1024 entries.

The ability of successfully reusing an instruction based on constrained value number-


ing depends on the number of non-redundant executed instructions between a first (non-

redundant) occurrence of an instruction (i.e., program count) and its redundant instances, and

on the replacement policy.

Statistics collected for the application set, for the different buffer sizes and First In First

Out (FIFO) replacement policy, are shown in Table 11 and graphically in Figure 51. The

proportion of the redundancy upper bound (unlimited table) that can be captured with each

size is also presented.

Tab. 11: Redundant instruction (all sets) detection with constrained dynamic value number-ing, showing the proportion of the amount detected compared to the upper limit.

benchmark opt. unlimited 1024 (%) 512 (%) 256 (%)400.perlbench -O0 44.34 39.20 88.39 29.75 67.10 26.28 59.25

-O2 34.18 29.89 87.46 18.36 53.70 14.94 43.70-O3 34.27 30.12 87.90 18.24 53.24 15.70 45.81

401.bzip2 -O0 52.53 39.20 74.62 37.72 71.80 5.30 10.10-O2 32.44 16.69 51.45 13.57 41.84 5.18 15.97-O3 32.20 16.82 52.24 13.79 42.83 4.94 15.35

429.mcf -O0 28.78 21.32 74.06 14.91 51.80 12.78 44.39-O2 27.63 20.10 72.77 13.68 49.51 11.42 41.35-O3 27.63 20.11 72.79 13.67 49.48 11.40 41.27

445.gobmk -O0 61.39 44.35 72.25 42.28 68.87 39.28 63.99-O2 37.84 16.70 44.14 15.11 39.93 12.99 34.33-O3 38.56 16.77 43.50 15.17 39.35 13.04 33.82

458.sjeng -O0 35.33 4.59 13.00 4.49 12.71 4.35 12.31-O2 20.62 0.57 2.79 0.55 2.67 0.53 2.56-O3 20.62 0.57 2.75 0.54 2.64 0.52 2.53

462.libquantum -O0 56.65 42.43 74.89 42.21 74.52 41.80 73.79-O2 30.43 5.45 17.90 5.36 17.60 5.10 16.77-O3 30.09 5.57 18.50 5.35 17.78 5.10 16.94

475.omnet -O0 29.74 18.13 60.95 16.86 56.68 15.30 51.45-O2 28.21 15.35 54.41 12.13 42.98 9.94 35.23-O3 28.31 15.58 55.04 12.22 43.18 9.98 35.24

AVERAGE -O0 29.89 65.45 26.89 57.64 20.73 45.04-O2 14.97 47.27 11.25 35.46 8.59 27.13-O3 15.08 47.53 11.28 35.50 8.67 27.28

Figure 52 and table 12 show the amount of redundant load instruction detected for each

buffer size compared to the upper limit. Figure 53 and table 13, for arithmetic instructions.

6.4.1 Discussion

The statistics for redundancy detection with a constrained operation of the dynamic value

numbering methodology also vary per application, but reveal that a significant share of the


- O0

- O2

- O3

- O0

- O2

- O3

- O0

- O2

- O3

- O0

- O2

- O3

- O0

- O2

- O3

- O0

- O2

- O3

- O0

- O2

- O3

0

1 0

2 0

3 0

4 0

5 0

6 0Re

dund

ancy

Detec

tion (

%)

u n l i m i t e d 1 0 2 4 5 1 2 2 5 6

4 7 1 .o m n e t

4 6 2 .l i b q u a n t u m

4 5 8 .s j e n g

4 4 5 .g o b m k

4 2 9 .m c f

4 0 1 .b z i p 2

4 0 0 .p e r l b e n c h

Fig. 51: Redundant instruction (all sets) detection with constrained dynamic value numbering,showing the proportion of the amount detected compared to the upper limit.

- O0

- O2

- O3

- O0

- O2

- O3

- O0

- O2

- O3

- O0

- O2

- O3

- O0

- O2

- O3

- O0

- O2

- O3

- O0

- O2

- O3

0

2 0

4 0

6 0

8 0

Redu

ndan

cy De

tectio

n (%)

u n l i m i t e d 1 0 2 4 5 1 2 2 5 6

Fig. 52: Redundant load instruction detection with constrained dynamic value numbering,showing the proportion of the amount detected compared to the upper limit.

redundancy upper limit can be realistically exploited. An average of ≈ 47% of the redundancy

upper limit, or a ≈ 15% of total instructions can be reused with a reuse buffer of 1024 entries.

The results also show that even with a 256-entry buffer, an average of ≈ 27% of the upper limit


Tab. 12: Redundant load instruction detection with constrained dynamic value numbering,showing the proportion of the amount detected compared to the upper limit.


-O2 68.22 60.39 88.52 37.42 54.85 30.15 44.20-O3 68.35 60.84 89.01 37.35 54.64 31.14 45.56

401.bzip2 -O0 71.27 64.03 89.84 62.01 87.01 13.72 19.25-O2 62.41 42.76 68.51 37.25 59.68 12.67 20.29-O3 62.01 42.05 67.80 36.46 58.79 12.64 20.38

429.mcf -O0 44.76 37.22 83.14 30.43 67.97 25.60 57.20-O2 41.08 33.67 81.97 27.09 65.94 21.84 53.17-O3 41.14 33.72 81.95 27.11 65.89 21.84 53.10

445.gobmk -O0 81.51 72.91 89.45 71.22 87.39 68.52 84.07-O2 65.31 31.65 48.45 28.20 43.18 23.77 36.40-O3 66.38 32.61 49.12 29.08 43.80 24.60 37.06

458.sjeng -O0 80.28 76.69 95.53 75.02 93.45 72.41 90.20-O2 91.30 80.86 88.56 78.60 86.09 75.84 83.07-O3 91.53 81.02 88.52 78.77 86.06 76.11 83.16

462.libquantum -O0 85.04 78.54 92.36 78.25 92.01 77.66 91.32-O2 60.29 5.08 8.43 4.99 8.28 4.84 8.03-O3 60.32 6.24 10.34 5.00 8.29 4.85 8.04

462.omnet -O0 47.33 33.57 70.93 31.09 65.69 28.64 60.51-O2 52.85 28.90 54.68 20.52 38.83 15.17 28.70-O3 53.57 29.47 55.01 20.68 38.60 15.21 28.39

AVERAGE -O0 61.42 87.47 57.44 80.99 47.84 66.89-O2 40.47 62.73 33.44 50.98 26.33 39.12-O3 40.85 63.11 33.49 50.87 26.63 39.38

- O0

- O2

- O3

- O0

- O2

- O3

- O0

- O2

- O3

- O0

- O2

- O3

- O0

- O2

- O3

- O0

- O2

- O3

- O0

- O2

- O3

0

1 0

2 0

3 0

4 0

5 0

6 0

7 0

8 0

Redu

ndan

cy De

tectio

n (%)

u n l i m i t e d 1 0 2 4 5 1 2 2 5 6

4 7 1 .o m n e t

4 6 2 .l i b q u a n t u m

4 5 8 .s j e n g

4 4 5 .g o b m k

4 2 9 .m c f

4 0 1 .b z i p 2

4 0 0 .p e r l b e n c h

Fig. 53: Redundant arithmetic instruction detection with constrained dynamic value number-ing, showing the proportion of the amount detected compared to the upper limit.


Tab. 13: Redundant arithmetic instruction detection with constrained dynamic value number-ing, showing the proportion of the amount detected compared to the upper limit.


-O2 42.44 34.86 82.13 18.04 42.50 13.34 31.42-O3 42.54 35.33 83.04 17.33 40.73 14.90 35.02

401.bzip2 -O0 60.65 36.65 60.42 34.95 57.61 8.30 13.68-O2 41.66 16.97 40.73 12.70 30.48 8.41 20.18-O3 41.39 17.57 42.46 13.46 32.51 7.97 19.25

429.mcf -O0 63.89 43.47 68.04 26.68 41.76 22.23 34.79-O2 64.20 43.26 67.39 26.08 40.62 21.34 33.24-O3 64.17 43.25 67.40 26.05 40.59 21.27 33.14

445.gobmk -O0 79.14 47.61 60.16 43.89 55.46 38.80 49.02-O2 65.09 28.44 43.69 25.44 39.08 21.33 32.77-O3 65.97 28.10 42.60 25.18 38.16 21.14 32.04

458.sjeng -O0 38.92 31.17 80.10 30.42 78.16 29.47 75.72-O2 3.22 1.26 39.05 1.03 31.90 0.86 26.66-O3 3.23 1.21 37.47 0.98 30.42 0.84 25.92

462.libquantum -O0 64.23 33.09 51.51 32.82 51.10 32.31 50.31-O2 52.16 10.00 19.18 9.83 18.85 9.32 17.86-O3 51.40 10.00 19.45 9.81 19.08 9.29 18.08

475.omnet -O0 52.20 25.97 49.75 23.51 45.04 20.00 38.32-O2 52.91 28.46 53.78 22.83 43.15 18.59 35.14-O3 52.76 28.73 54.46 22.92 43.44 18.50 35.06

AVERAGE -O0 37.68 64.39 31.36 53.89 24.71 42.92-O2 23.32 49.42 16.56 35.23 13.31 28.18-O3 23.46 49.55 16.53 34.99 13.41 28.36

(≈ 9% of total instruction) can be avoided. The breakdown per category (memory operation

and arithmetic instruction) shows that the detection is uniform in the sense that no category is

specifically benefited or degraded by the constraint of value numbering. This is the expected

behavior, as redundancy depends on the interactions among the categories.

Intended to investigate the variation in the performance observed (measured through the

amount of detected share of the upper limit), the correlation between the buffer size and the

total number of program counters related to instructions that have been at least once part of a

redundancy detection is evaluated. Table 14 shows the number of program counters that are

seen in redundant execution, the proportion that each buffer size represents of this number and

the proportion of the redundancy upper limit detected with different buffer sizes.

Table 14 shows that there is no recognizable correlation between the detection perfor-

mance and the ratio between the number of redundant program counters and buffer size. The

results for buffer size with the lowest ratio with the total redundant count (0.37% for the 256-

entry buffer in O0 for case 445), is still able to detect ≈ 49% of the upper bound. Conversely,

6.5 Performance Considerations 121

Tab. 14: Buffer size compared to the number of program counter in redundancy and theproportion of redundancy upper limit detected.

benchmark opt. red. Instr. 1024 512 256(PCs) (%) (%) (%) (%) (%) (%)

red. PCs up. limit red. PCs up. limit red. PCs up. limit400.perlbench -O0 13084 7.83 80.75 3.91 48.10 1.96 38.61

-O2 6999 14.63 82.13 7.32 42.50 3.66 31.42-O3 7119 14.38 83.04 7.19 40.73 3.60 35.02

401.bzip2 -O0 11532 8.88 60.42 4.44 57.61 2.22 13.68-O2 5911 17.32 40.73 8.66 30.48 4.33 20.18-O3 5986 17.11 42.46 8.55 32.51 4.28 19.25

429.mcf -O0 2396 42.74 68.04 21.37 41.76 10.68 34.79-O2 2041 50.17 67.39 25.09 40.62 12.54 33.24-O3 2023 50.62 67.40 25.31 40.59 12.65 33.14

445.gobmk -O0 69823 1.47 60.16 0.73 55.46 0.37 49.02-O2 25159 4.07 43.69 2.04 39.08 1.02 32.77-O3 29540 3.47 42.60 1.73 38.16 0.87 32.04

458.sjeng -O0 10920 9.38 80.10 4.69 78.16 2.34 75.72-O2 4277 23.94 39.05 11.97 31.90 5.99 26.66-O3 4689 21.84 37.47 10.92 30.42 5.46 25.92

462.libquantum -O0 3367 30.41 51.51 15.21 51.10 7.60 50.31-O2 2058 49.76 19.18 24.88 18.85 12.44 17.86-O3 2073 49.40 19.45 24.70 19.08 12.35 18.08

462.omnet -O0 16249 6.30 49.75 3.15 45.04 1.58 38.32-O2 10357 9.89 53.78 4.94 43.15 2.47 35.14-O3 10258 9.98 54.46 4.99 43.44 2.50 35.06

one of the highest rates (≈ 50% for the 1024-entry buffer for O3 in case 462), the detection

reaches ≈ 19% of the upper limit. The best performance is achieved in case 400, for which

≈ 83% of the upper limit for the optimized code is detected with a 1024-entry buffer. The

worst performance is found in case 445, where ≈ 37% of the upper limit is detected with a

1024-entry buffer.

The results found show that the method not only is able to provide an upper limit for

redundancy removal but can also be used to, in fact, avoid much of these redundancies when

a realistic reuse buffer size is used. Following, considerations on performance improvement

and comparison with existing solutions for instruction reuse.

6.5 Performance Considerations

Performance of redundancy elimination is evaluated through the share of total instruction

removed and improvements gains in terms of execution speedup or instruction per cycle (IPC).

To the author’s knowledge there is no work that could provide a direct comparison to the

6.5 Performance Considerations 122

proposed method, as the goals and, specially, the evaluated applications differ considerably.

However, few related works provide data for an indirect performance comparison.

Aggressive profile-guided code transformation based o PRE and CSE eliminates 7.4% and

9.1%, respectively, of total dynamic instruction count in SPEC95int (BODÍK; GUPTA; SOFFA,

2004). Instruction reuse based on reuse buffers with 1024-entries shows 26% average reuse

in SPECInt95 (MOLINA; GONZALEZ; TUBELLA, 1999). Most known recent instruction reuse

schemes show an average overall instruction reuse for SPECInt 2000 of 10.4% with a 128-

entry reuse buffer and 20.6% with 1024-entry buffer (GELLERT; FLOREA; VINTAN, 2009). Load

reuse for SPECInt 2000 has been reported in (SURENDRA; BANERJEE; NANDY, 2006) to be

36.5% on average. However, there is no evaluation for the most recent SPECInt (2006) for a

direct comparison.

The evaluation of the actual speed achievable through instruction reuse involves multiple

aspects. The performance gains related to instruction reuse scheme is determined not only by

the percentage of reused instructions but also by the reuse latency. Such latency needs to allow

a reuse opportunity to be detected and exploited faster than simply executing the instruction.

A cycle-accurate simulation is required with a careful modeling of the latency added by the

reuse scheme. While the required work needed for such modeling goes beyond the scope of

this work, few considerations can be drawn regarding the impacts on execution time related

to instruction reuse.

Speedup achievable is tightly bound to the profitability of reused instruction. In case

of modern processors that issue Out-of-Order (OoO) instructions, those instructions that are

dependent on reused ones will obtain their operands earlier. This would allow dataflow limit

to be exceeded. The amount of instruction detected is expected to accelerate considerably the

execution in modern highly parallel processors.

6.6 Compiler Audition 123

6.6 Compiler Audition

In this section, it is discussed how the occurrence of redundant instruction can be related

to source-level constructs. The focus is on the instructions that are part of main redundant

clusters and on the relationship with how optimization handles redundant instruction. The

goal is to reveal opportunities missed by the compiler or by the programmer. For a highly

optimized code, it is expected that no clear missed opportunity should be clearly identifiable.

Figure 54 shows the audition of benchmark case 400. The object code for non-optimized

executable (O0) and optimized (O3) compiled with debugging flag (-S) were obtained. The

object code maps source-level constructs to blocks of instructions. Source-level constructs

and most redundant instructions are compared. The analysis starts with the non-optimized

executable (O0). Following, it is inspected whether the optimization applied to the code was

able to eliminate the redundant construction. Figure 54 shows part of the redundancy report in

the left, for each level of optimization. Object code merged with source code is shown in the

right. There are two main constructs that cause most of the redundancies in case 400. They

are related to while constructs in the function Perl_runops_standard of 400.perlbench.

A total of 12 instruction issued to implement the constructs are responsible for ≈ 21% of all

redundancies detected. Almost all of them are load instructions (except for a extend sign word

(extsw) to deal with unaligned memory operation), of which two sets of three instructions

result in only two different value numbers. Inspecting the optimized code (figure’s bottom), it

is seen that the optimization was able to eliminate 6 of the redundant instructions, including

one of the set of loads that were related to only two value numbers. The remaining instructions

are responsible for ≈ 20% of the total redundancies. The remaining set of loads that are related

to two results are highlighted.

The audition of 400.perlbench indicates that neither the compiler nor the programmer

could have done a better job, as the reload is needed due to the inevitability of having reg-

isters holding dead variable between the two redundant occurrences. However, the audition

shows that the reuse of only four contents (two per load), reloaded redundantly from memory


while ((PL_op = CALL_FPTR(PL_op->op_ppaddr)(aTHX))) { PERL_ASYNC_CHECK(); 10f744c: e9 22 d7 c8 ld r9,-10296(r2) 10f7450: 80 09 00 00 lwz r0,0(r9) 10f7454: 7c 00 07 b4 extsw r0,r0 10f7458: 2f a0 00 00 cmpdi cr7,r0,0 10f745c: 41 9e 00 0c beq cr7,10f7468 <.Perl_runops_standard+0x3c> 10f7460: 4b f3 3b f5 bl 102b054 <.Perl_despatch_signals> 10f7464: 60 00 00 00 nop */

int Perl_runops_standard(pTHX){ while ((PL_op = CALL_FPTR(PL_op->op_ppaddr)(aTHX))) { 10f7468: e9 22 d7 d0 ld r9,-10288(r2) 10f746c: e9 29 00 00 ld r9,0(r9) 10f7470: e9 29 00 10 ld r9,16(r9) 10f7474: e8 09 00 00 ld r0,0(r9) 10f7478: 7c 09 03 a6 mtctr r0 10f747c: f8 41 00 28 std r2,40(r1) 10f7480: e8 49 00 08 ld r2,8(r9) 10f7484: e9 69 00 10 ld r11,16(r9) 10f7488: 4e 80 04 21 bctrl 10f748c: e8 41 00 28 ld r2,40(r1) 10f7490: 7c 60 1b 78 mr r0,r3 10f7494: e9 22 d7 d0 ld r9,-10288(r2) 10f7498: f8 09 00 00 std r0,0(r9) 10f749c: e9 22 d7 d0 ld r9,-10288(r2) 10f74a0: e8 09 00 00 ld r0,0(r9) 10f74a4: 2f a0 00 00 cmpdi cr7,r0,0 10f74a8: 40 9e ff a4 bne cr7,10f744c <.Perl_runops_standard+0x20> PERL_ASYNC_CHECK(); }

-O0

RedundancyCount

Unique Value numbers

Program Counter

Redundancy Report(most frequent redundanc PC)

-O0 Non-optimized Object Code merged with source code

(%) of total redundancy count: 21.01%

int Perl_runops_standard(pTHX){ ... while ((PL_op = CALL_FPTR(PL_op->op_ppaddr)(aTHX))){ 10bdd90: e9 7f 00 00 ld r11,0(r31) 10bdd94: e9 2b 00 10 ld r9,16(r11) 10bdd98: e8 09 00 00 ld r0,0(r9) 10bdd9c: f8 41 00 28 std r2,40(r1) 10bdda0: 7c 09 03 a6 mtctr r0 10bdda4: e9 69 00 10 ld r11,16(r9) 10bdda8: e8 49 00 08 ld r2,8(r9) 10bddac: 4e 80 04 21 bctrl 10bddb0: e8 41 00 28 ld r2,40(r1) 10bddb4: 2f a3 00 00 cmpdi cr7,r3,0 10bddb8: f8 7f 00 00 std r3,0(r31) 10bddbc: 41 9e 00 54 beq cr7,10bde10 <.Perl_runops_standard+0xa0> PERL_ASYNC_CHECK(); 10bddc0: e9 22 d5 70 ld r9,-10896(r2) 10bddc4: 80 09 00 00 lwz r0,0(r9) 10bddc8: 2f 80 00 00 cmpwi cr7,r0,0 10bddcc: 41 9e ff c4 beq cr7,10bdd90 <.Perl_runops_standard+0x20> 10bddd0: 4b f6 17 01 bl 101f4d0 <.Perl_despatch_signals> 10bddd4: 60 00 00 00 nop */

-O3

RedundancyCount

Unique Value numbers

Program Counter

Redundancy Report(most frequent redundanc PC)

-O3 Optimized Object Code merged with source code


Fig. 54: Compiler audition based on dynamic value numbering: 400.perlbench.


-O3 Redundancy Report(most frequent redundanc PC)

-O3 Optimized Object Code

0000000001028130 <.strncpy>:... 10281f0: 7d 2a 5a 14 add r9,r10,r11 10281f4: 39 6b 00 01 addi r11,r11,1 10281f8: 98 09 00 01 stb r0,1(r9)

401.bzip2


000000000103ef20 <._IO_vfscanf_internal>:... 104019c: e9 3f 01 e8 ld r9,488(r31).. 1040300: 3a f7 00 01 addi r23,r23,1... 104031c: e8 15 00 10 ld r0,16(r21),,. 1040328: 38 03 00 01 addi r0,r3,1... 1040338: 3b 39 00 01 addi r25,r25,1

429.mcf


0000000001034bd0 <.matchpat_loop>... 1034f30: 38 1f 00 01 addi r0,r31,1 1034f34: 7c 1f 07 b4 extsw r31,r0 1034f38: 7f 9f c8 00 cmpw cr7,r31,r25... 1034f40: 39 7f 00 08 addi r11,r31,8 1034f44: 39 3f 00 10 addi r9,r31,16 1034f48: 79 6b 17 64 rldicr r11,r11,2,61 1034f4c: 7d 7e 5a 14 add r11,r30,r11 1034f50: 79 29 17 64 rldicr r9,r9,2,61

445.gobmk


Fig. 55: Compiler audition: 401, 429 and 445.

multiple times, would avoid ≈ 2.31% of the total instruction count.

The remaining question regards the behavior of the other cases. Analysis shows that it

varies greatly, but in any case no clear missed opportunity was found. Figures 55 and 56 show

a summary of the main redundancy cluster of each benchmark. The percentage of the total

redundant instructions is highlighted, so the object code related to redundancy.

Figure 55 shows cases 401, 429 and 445. The main source of redundancy in case 400

is with the strncpy function, from the C standard library. The majority of redundancies

(≈ 52%) is related to a scalar operation followed by a store. The latter store only two different


contents repeatedly. In this case the source code is not available and the exploitation of such

inefficiency is achievable only through a run-time optimization. In case 429, redundancies

are found in non-sequential instructions. They are part of a linked library for I/O, for which

no source code is available. Not even the compiler is able to optimize it, since the library

was dynamically linked. The case provides a good example of how the methodology reveals

inefficiency sources that go beyond the scope in which the compiler is able to operate. In

case 445, the most redundantly executed instructions are arithmetic instructions (scalar and

logical). The instructions are related to a construct in the application’s source code, in the

matchpat_loop function. The benchmark refers to an implementation of the Go game (see

Appendix A), and the function implements precomputed tables to allow rapid checks on the

pieces at the board. The function relies on the fact that color is 1 or 2. The redundancies are

related to immediate instructions, for which in multiple instances the same content held in the

registers involved (three immediate instances in total) have the same content as in a previous

occurrence. The behavior is related to the fact that such information is only detectable at the

run-time and for this reason is not detected at the compile-time.

Figure 56 shows cases 458, 462 and 475. In case 458 almost all redundancies are related

to four store instructions. The redundancy is related to the function memset, responsible for

setting the board state of a chess match implemented by the benchmark. The application stores

only two different contents per instruction thousands of times. Even though, redundancies are

seen for the other sets (load and arithmetic), although the fact that this application is dominated

by store instruction leads to the fact that this few instructions are responsible for almost all

redundancies detected. They are needed to store the board state since no register is available.

In case 462, almost half of the redundancies refers to three instructions, a clear left double

word and shift left immediate instruction (clrldi), a compare word immediate (cmpdi) and

a load (ld). Different from the previous cases, the redundancies are related to multiple distinct

contents. In the case of the first two instruction, 513 distinct values are detected for all the

redundant occurrences. For the third instruction a number even greater is observed. The load

is redundant more than in a million instances, for which almost half million distinct values

are generated. This case shows an example for which no static approach would detect such


-O3 Redundancy Report(most frequent redundanc PC)

-O3 Optimized Object Code

0000000001001b58 <.quantum_sigma_z>... 1001bd8: 78 00 00 20 clrldi r0,r0,32 1001bdc: 78 00 07 e0 clrldi r0,r0,63 1001be0: 78 00 06 20 clrldi r0,r0,56 1001be4: 2f a0 00 00 cmpdi cr7,r0,0 1001be8: 41 9e 00 4c beq cr7,1001c34 <.quantum_sigma_z+0xdc> 1001bec: e9 3f 00 f8 ld r9,248(r31)

462.libquantum


000000000118b9f0 <.strncpy>... 118bab0: 7d 2a 5a 14 add r9,r10,r11 118bab4: 39 6b 00 01 addi r11,r11,1 118bab8: 98 09 00 01 stb r0,1(r9) 118babc: 42 00 ff f4 bdnz <.strncpy+0xc0> 118bac0: 4e 80 00 20 blr...

475.omnet


000000000103c4c0 <.memset>:... 103c598: f8 86 ff f8 std r4,-8(r6) 103c59c: f8 86 ff f0 std r4,-16(r6) 103c5a0: f8 86 ff e8 std r4,-24(r6) 103c5a4: f8 86 ff e1 stdu r4,-32(r6) 103c5a8: 42 00 ff ec bdnz 103c594 <.memset+0xd4>

458.sjeng


Fig. 56: Compiler audition: 458, 462 and 475.

behavior. At last, in case 475 a new instance of redundancy detected in strncpy in the C

standard library is seen. The behavior is similar to case 401, indicating again an inefficiency

that goes beyond compiler’s scope.

The analysis of the examples show that the methodology covers a wide scope of situations.

Redundancies related to memory operation are observed as scalar ones. Inefficiencies source

were detected in application’s constructs as in included and/or linked libraries. The source of

inefficiencies do not point to a fixable problem in the compiler or in the application, rather it

brings a valuable knowledge on understanding the sources of inefficiencies. The methodology

also opens new possibilities on investigating inefficiencies related to poorly written libraries.

6.7 Value Prediction 128

Following section discusses the value of this knowledge on predicting values to be used by an

instruction.

6.7 Value Prediction

Value locality cluster, along with the h-index, indicates instructions that can be specula-

tively executed with the aid of highly frequent known values. Advantages in doing so depends

on the ability of predicting instruction result. The h-index of an instruction is the ratio between

its total redundancy count and the number of unique value numbers it generates. This ratio

points out the expected amount of instructions that could be avoided if one uses these unique

results when they are available before an instruction executes.

H-index can be used to evaluate the profitability of speculative value prediction. In this

case, it is used to identify values to be kept in a value prediction table. A common approach to

evaluate value prediction consists in evaluating the accuracy of prediction. Figure 57 shows,

for the top-128 most frequent redundant instructions of optimized (O3) codes, the average

accuracy prediction when only one of the known values is used in speculation.

The 128 program counters depicted in each subfigure are ordered from the most to the

least redundant. Table 15 shows details on the prediction accuracy. It shows the share of the

top-128 redundant instruction in the total redundancy count, the average prediction accuracy,

median, standard-deviation, and the total amount of redundant instructions that would be suc-

cessfully predicted in case one of the known results of the top-128 most redundant instruction

were used speculatively. The table shows that a relative small number (128 in this analysis)

of static instructions (≈ 3.5% of all the program counters that executes redundantly at least

once) is responsible for more than 50% of redundant occurrences in cases 400, 429, 458, 462.

A significant share is also observed for the other applications. This information is in line with

the value locality graphs obtained. Prediction accuracy analysis shows that accuracy is limited

to 50% since at least two value-numbers (different results) are found for each of the top-128,

with average of 20%. Taking into account the share in total redundancy, an average of 14% of

6.7 Value Prediction 129

0 2 0 4 0 6 0 8 0 1 0 0 1 2 0 1 4 0

0 . 0

0 . 1

0 . 2

0 . 3

0 . 4

0 . 5

0 2 0 4 0 6 0 8 0 1 0 0 1 2 0 1 4 0

0 . 0

0 . 1

0 . 2

0 . 3

0 . 4

0 . 5

0 2 0 4 0 6 0 8 0 1 0 0 1 2 0 1 4 0

0 . 0

0 . 1

0 . 2

0 . 3

0 . 4

0 . 5

0 2 0 4 0 6 0 8 0 1 0 0 1 2 0 1 4 0

0 . 0

0 . 1

0 . 2

0 . 3

0 . 4

0 . 5

Pred

iction

Accu

racy

4 0 0 . p e r l b e n c h : T o p - 1 2 8 m o s t r e d u n d a n t i n s t r u c t i o n

Pred

iction

Accu

racy

4 0 1 . b z i p 2 : T o p - 1 2 8 m o s t r e d u n d a n t i n s t r u c t i o n

Pred

iction

Accu

racy

4 2 9 . m c f : T o p - 1 2 8 m o s t r e d u n d a n t i n s t r u c t i o n

Pred

iction

Accu

racy

4 4 5 . g o b m k : T o p - 1 2 8 m o s t r e d u n d a n t i n s t r u c t i o n

0 2 0 4 0 6 0 8 0 1 0 0 1 2 0 1 4 0

0 . 0

0 . 1

0 . 2

0 . 3

0 . 4

0 . 5

0 2 0 4 0 6 0 8 0 1 0 0 1 2 0 1 4 0

0 . 0

0 . 1

0 . 2

0 . 3

0 . 4

0 . 5

0 2 0 4 0 6 0 8 0 1 0 0 1 2 0 1 4 0

0 . 0

0 . 1

0 . 2

0 . 3

0 . 4

0 . 5

Pred

iction

Accu

racy

4 5 8 . s j e n g : T o p - 1 2 8 m o s t r e d u n d a n t i n s t r u c t i o n

Pred

iction

Accu

racy

4 6 2 . l i b q u a n t u m : T o p - 1 2 8 m o s t r e d u n d a n t i n s t r u c t i o n

Pred

iction

Accu

racy

4 7 1 . o m n e t : T o p - 1 2 8 m o s t r e d u n d a n t i n s t r u c t i o n

Fig. 57: Prediction accuracy for top-128 most redundant instructions.

6.8 Discussion on the Replicability of the Study 130

Tab. 15: Prediction accuracy for top-128 most redundant instruction.

% of total redundant % of total prediction accuracy % prediction ratestatic instruction redundancy count average median std-deviation of all redundancy

400.perlbench 1.80 60.08 0.39 0.50 0.13 23.72401.bzip2 2.14 41.79 0.07 0.00 0.11 2.72429.mcf 9.00 66.90 0.29 0.50 0.21 19.46445.gobmk 0.43 45.30 0.07 0.00 0.10 3.04458.sjeng 2.73 99.30 0.20 0.08 0.20 20.28462.libquantum 6.17 98.52 0.22 0.07 0.22 21.58471.omnet 1.25 34.61 0.20 0.11 0.20 6.87AVERAGE 3.36 63.79 13.95

total instructions would be accurately predicted if one of the known value numbers were used

as guess.

The approach of using the method to speculate with value prediction is based in a static

view, intended to provide frequent content to be used by the compiler to define hint bits

indicating profitable content to speculate. Improvements that capture the actual occurrence

of each value number could potentially enhance the results. While the seen numbers show

some benefits on using the method to speculate on value prediction, the nature of the method

is more aligned with a dynamic reutilization of instructions, such as through a reuse buffer.

6.8 Discussion on the Replicability of the Study

This chapter discusses the applicability of the presented method. The study of selected

SPECInt applications proved that the methodology is able to reveal relevant information on

the effectiveness of optimization and on sources of inefectiveness. Although the study relied

on a specific integration with the IBM full-system simulation, the Dynamic Value Number-

ing algorithm implementation stands as a detachable component. The proposed evaluation of

effectiveness can be performed whenever a stream of instructions is available, be it in exe-

cution trace or during real-time observation. The requirements for applying the method are

the ones discussed in Chapter 5. Instructions must be collected in sequence, for the whole

application or a target spot. For each instruction, memory references must be available. A

decoder for the intended architecture has to be written, so to provide the information required


by the algorithm. In possession of such a component, the implementation of the dynamic

value numbering detailed in this work can be deployed by three approaches:

Standalone: in this approach, traces are collected for a part or the whole application. An API

able to extract instruction information and interface with the Dynamic Value Numbering

framework is responsible to evaluate each instruction. The output is the redundancy re-

port and a processed trace with redundant instructions identified. In this case validation

is not possible unless the trace format holds results of each instruction;

System simulator: the approach used in this work can be integrated to a different system

simulation environment. For this, the implemented module has to be integrated to the

simulator in a way that information might be obtained as instructions execute. Further.

for reuse studies, the implementation has to be integrated in a way that information is

provided before the instruction gets to execute. The implementation must also have

access to the simulated registers. For instance, the method is suitable to be integrated

to the widely used SimpleScalar (AUSTIN; LARSON; ERNST, 2002), for which a free

academic non-commercial use license is available;

Binary instrumentation: modern binary instrumentation tools, such as Intel’s Pin can also

be used to perform effectiveness evaluation based on the the Dynamic Value Number-

ing framework. The method can be used in the format of a pintool able to observe

each instruction as the program runs. Pin permits the interference on physical regis-

ters in real-time, allowing validation and reuse studies. In this case, the extra overhead

generated by the hash tables handling must be taken into account.

The method targeted at whole applications, for which limited input sets were properly

chosen. However, usage of larger input sets with selection of a specific part of the application

is also possible. In this case, an API that supports the selection of segments where the frame-

work should be used is necessary. The methodology can also be used to investigate the quality

of the code produced by dynamic approaches, like just-in-time compilation, or to analyze the


instructions executed by virtual machines. For these, integration of binary instrumentation

with the proposed methodology is a specially suitable approach.

133

7 Conclusions

This work presented a methodology for the identification of redundant computation found

in optimized code as a measure of optimization effectiveness. The work reviewed and dis-

cussed the limitations of the static analysis that prevents the obtainment of an analytic upper

limit for optimization. It was also discussed how the approach of dynamic identification of

redundancies can be used in performing further optimization with run-time information. The

discussed profile guided optimization approach adds new mechanisms to provide selected run-

time information at compile-time for more informed optimization. Instruction reuse and value

prediction were discussed as the two approaches to avoid redundancy based on hardware-

support. Profile guided optimization produces a report that indicates potential opportunities

for optimization in the scope of the profiling, but do not provide information on the remain-

ing unexploited optimization potential. Hardware reuse schemes provide a reference of total

redundancies that can be detected at run-time, but do not provide much information on the

sources of redundancies or have rather limited scope.

The method designed and implemented in this work provides both an approach for effec-

tiveness evaluation based on the amount of redundancies detectable at the run-time and a value

locality evaluation that shows the main sources of redundancies. The methodology relied on

the development of the Dynamic Value Numbering algorithm as a dynamic version of the lo-

cal value numbering, able to operate with a stream of instructions. The approach detects both

redundant memory operations and arithmetic instructions, and the integration among them.

The algorithm is designed to produce a report of found redundant occurrences, of their rela-

tive frequency, and of the amount of unique results per redundant instruction. The approach

is also integrated with a technique to detect unnecessary register spills, used as a measure of

6 Conclusions 134

the effectiveness of register allocation.

The implementation of the methodology proved to be suitable for integration with a full-

system simulator in order to observe and interfere in running applications. An efficient im-

plementation for the PowerISA and a validation scheme that allowed the actual instruction’s

result to be compared, and eventually replaced, by the a previous content have shown the

feasibility of the method. The integration of the developed implementation and the frame-

work yields the main experimental results of this work. They are grouped in effectiveness

evaluation and value locality exploitation.

As case study, a recent version of the widely used GCC was applied to optimize selected

applications from SPECInt 2006. The results produced using the proposed dynamic method

proved that one finds remaining redundancies in optimized executables even when the highest

optimization level (O3) is used. The method detected a significant occurrence of load-reuse

(average of 64%), effectively measuring both the inherent limitation in making use of just

a finite quantity of registers, and the efficiency of register allocation. The amount of store

occurrences shows that a considerable number of spills are unnecessary, effectively measuring

the conservativeness to which the compiler is submitted at compile-time. The quite high

amount of unnecessary spills detected refers to the register allocation algorithm’s need of

predicting and choosing, from all the possible paths of execution flow, which would in fact

occur. Such metric can be seen as a reference point for an effectiveness evaluation of register

promotion techniques deployed in the code.

The methodology revealed the relationship between redundant memory operation and

arithmetic operations, effectively identifying a high amount of redundant arithmetic instruc-

tions (average of ≈ 45% for the optimized code). The numbers constitute a practical measure

of how the compiler trades memory operation in favor of arithmetic operations. Additionally,

the amount of redundancy detected within arithmetic instructions provides an upper bound of

exploitable interprocedural optimizations.

The analysis of complete execution of selected SPECInt 2006 benchmark suite has shown

that a significant amount of redundancies is detected in all optimized programs produced by

6 Conclusions 135

GCC. While effectiveness evaluation indicated considerable opportunities for optimization,

the value locality exploitation investigated how these opportunities can be exploited. The

reports produced with the methodology were presented in the form of value locality graphs,

which indicate the occurrence of redundancy clusters. They clearly identify hotspots, which

consist of few instructions that are responsible for the majority of the redundancies detected.

The hotspots were exploited in three ways: instruction reuse, compiler audition and value

prediction.

The adaptation of the methodology to operate as an instruction reuse mechanism shows

that a considerable amount of the redundancy can be exploited with a practical mechanism.

The approach restricted the amount of entries in the value number table, emulating a practical

reuse buffer, and revealed that with a 1024-entry table an average of 47% of the upper limit

can be successfully detected. These results point out to new considerable opportunities for

redundancy exploitation with hardware support, an exploitation through which, an accurate

performance analysis reveals, reducing instruction count results in indirect performance im-

provements (e.g., reducing the amount of instructions in 0.5% may result in 10% performance

improvement).

The hotspots identified allowed the audition of the executables produced by GCC. The

analysis pointed out that no clear error or omission was detected by GCC. Rather, it shows

that, in some cases, linked libraries are the source of redundancies, or that one is dealing

with a scenario where the inherent limitation allows an obvious optimization. The results

are valuable, for they identify constructs that are ultimately responsible for redundancies and

new studies on how these instances could be prioritized in a scheme that would degrade the

performance of some selected part in exchange for considerably redundant clusters. A com-

prehensive cost analysis would be necessary in this case.

The methodology was also used to investigate the use of identified hotspots on predicting

specific results of instructions. The approach was mainly motivated by the value of the h-

index, showing that most of the redundant instructions produce a very limited number of

known results. The obtained results show limited benefit on predicting the top-128 most

7.1 Summary of contributions and Future Work 136

redundant instruction, but indicate that considerable number of redundancies can be predicted

with high accuracy (an average of 78%). The mechanism to exploit these opportunities is

tightly bound to the ability of passing to the compiler a profile that indicates the most profitable

instructions about which to speculate based on the accuracy prediction.

The following section summarizes author’s publications that include the main contribu-

tions in this work and envisions future works related to each of them.

7.1 Summary of contributions and Future Work

The development of a first version of the dynamic value numbering algorithm and the

method’s implementation along with the results of effectiveness evaluation for the selected

SPECInt applications were presented by the author in (COSTA et al., 2012). The paper focused

on exposing the amount of unexploited opportunities as a limit study. As future work, the

author envisions porting and evaluating new benchmarks, including the floating point compo-

nent of the SPEC CPU 2006 suite of benchmarks. The effectiveness evaluation is also suitable

to compare different optimized executable produced by the same compiler. Unfortunately, the

evaluation did not consider the integration with the profile guided optimization tool available

in GCC, a limitation due to the toolchain available in the simulator. However, integration of

the development implementation with a binary instrumentation tool (Intel’s Pin, for instance)

would allow the study on the ability of GCC’s profiling to eliminate the remaining redundan-

cies.

A second version of the dynamic value numbering algorithm including the ability to pro-

duce a value locality report and improved support for redundant memory operation detection

(unnecessary spill, fully-redundant operation) is presented by the author in (COSTA et al., a).

The article presents the results of redundancy cluster identification and the approaches to ex-

ploit value locality. The article introduces the instruction reuse scheme based on dynamic

value numbering and discusses compiler audition and support for value prediction.

An accurate evaluation of the impact incurred on performance when elimination of re-

7.1 Summary of contributions and Future Work 137

dundancies found with the presented methodology takes place is envisioned as future work.

In the work in progress (COSTA et al., b), this thesis’ author works on the development of a

cycle-accurate analysis to identify performance benefits of avoiding redundant instruction in

a speculative execution. The model involves a realistic evaluation of the latency of such mech-

anism. The work evolves so to include the support of floating point instructions, an extension

of the implementation presented in this thesis. The work compares the gains with the ones

found in the literature.

As future work the author also envisions the extension of the value prediction approach

based on dynamic value numbering to include the relative frequency of unique value numbers.

In this case, the total amount of instances of a given instruction would be confronted to the

amount of redundancies and to the frequency of each unique result. The study is based on

value locality and compiler audition that evaluates how transformation at compile-time could

avoid the redundancies found at the run-time. The analysis would start from the study of the

ideal number of registers needed to eliminate the redundancies found and how modification on

register allocation would allow further optimizations. At last, the study of dynamic languages

and instructions executed, for instance, by a virtual machine is also seen as a future extension.

In this case, the approach of binary instrumentation would be more suitable in order to avoid

the difficulties related to porting a virtual machine to a simulation environment. In any case,

an integration of the implementation described in this thesis would allow the investigation of

the effectiveness of final executed code.

138

References

ADVE, V. The next generation of compilers. In: Proceedings of the 7th annual IEEE/ACMInternational Symposium on Code Generation and Optimization. Washington, DC, USA:IEEE Computer Society, 2009. (CGO ’09). ISBN 978-0-7695-3576-0.

ALLEN, F. E. The History of Language Processor Technology in IBM. IBM Journal ofResearch and Development, v. 25, n. 5, p. 535 –548, Sept. 1981. ISSN 0018-8646.

ALPERN, B. et al. The Jikes research virtual machine project: building an open-sourceresearch community. IBM Syst. J., IBM Corp., Riverton, NJ, USA, v. 44, n. 2, p. 399–417,Jan. 2005. ISSN 0018-8670.

ALPERN, B.; WEGMAN, M. N.; ZADECK, F. K. Detecting equality of variables inprograms. In: Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principlesof programming languages. New York, NY, USA: ACM, 1988. (POPL ’88), p. 1–11. ISBN0-89791-252-7.

ALPERN, B.; WEGMAN, M. N.; ZADECK, F. K. Detecting equality of variables inprograms. In: Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principlesof programming languages. New York, NY, USA: ACM, 1988. (POPL ’88), p. 1–11. ISBN0-89791-252-7.

AUSLANDER, M.; HOPKINS, M. An overview of the PL.8 compiler. SIGPLAN Not., ACM,New York, NY, USA, v. 39, p. 38–48, April 2004. ISSN 0362-1340.

AUSTIN, T.; LARSON, E.; ERNST, D. SimpleScalar: an infrastructure for computer systemmodeling. Computer, v. 35, n. 2, p. 59 –67, feb 2002. ISSN 0018-9162.

BACHEGA, L. R. et al. The BlueGene/L pseudo cycle-accurate simulator. In: Proceedings ofthe 2004 IEEE International Symposium on Performance Analysis of Systems and Software.Washington, DC, USA: IEEE Computer Society, 2004. (ISPASS ’04), p. 36–44. ISBN0-7803-8385-0. Disponível em: <http://dl.acm.org/citation.cfm?id=1153925.1154586>.

BODÍK, R.; ANIK, S. Path-sensitive value-flow analysis. In: Proceedings of the 25th ACMSIGPLAN-SIGACT symposium on Principles of programming languages. New York, NY,USA: ACM, 1998. (POPL ’98), p. 237–251. ISBN 0-89791-979-3.

BODÍK, R.; GUPTA, R.; SOFFA, M. L. Load-reuse analysis: design and evaluation. In:Proceedings of the ACM SIGPLAN 1999 conference on Programming language designand implementation. New York, NY, USA: ACM, 1999. (PLDI ’99), p. 64–76. ISBN1-58113-094-5.

References 139

BODÍK, R.; GUPTA, R.; SOFFA, M. L. Complete removal of redundant expressions.SIGPLAN Not., ACM, New York, NY, USA, v. 39, n. 4, p. 596–611, Apr. 2004. ISSN0362-1340.

BRIGGS, P.; COOPER, K. D.; SIMPSON, L. T. Value numbering. Softw. Pract. Exper., JohnWiley & Sons, Inc., New York, NY, USA, v. 27, p. 701–724, June 1997. ISSN 0038-0644.

BRIGGS, P.; COOPER, K. D.; TORCZON, L. Improvements to graph coloring registerallocation. ACM Transactions on Programming Languages and Systems, v. 16, p. 428–455,1994.

BRUENING, D.; GARNETT, T.; AMARASINGHE, S. An infrastructure for adaptivedynamic optimization. In: Proceedings of the international symposium on Code generationand optimization: feedback-directed and runtime optimization. Washington, DC, USA: IEEEComputer Society, 2003. (CGO ’03), p. 265–275. ISBN 0-7695-1913-X. Disponível em:<http://dl.acm.org/citation.cfm?id=776261.776290>.

CALDER, B.; FELLER, P.; EUSTACE, A. Value profiling. In: Proceedings of the 30thannual ACM/IEEE international symposium on Microarchitecture. Washington, DC, USA:IEEE Computer Society, 1997. (MICRO 30), p. 259–269. ISBN 0-8186-7977-8. Disponívelem: <http://dl.acm.org/citation.cfm?id=266800.266825>.

CEZE, L. et al. Full Circle: Simulating Linux clusters on Linux clusters. In: In Proceedingsof the Fourth LCI International Conference on Linux Clusters: The HPC Revolution 2003.[S.l.: s.n.], 2003.

CHAITIN, G. Register allocation and spilling via graph coloring. SIGPLAN Not., ACM, NewYork, NY, USA, v. 39, p. 66–74, April 2004. ISSN 0362-1340.

CHILIMBI, T. M. Efficient representations and abstractions for quantifying andexploiting data reference locality. In: Proceedings of the ACM SIGPLAN 2001conference on Programming language design and implementation. New York, NY,USA: ACM, 2001. (PLDI ’01), p. 191–202. ISBN 1-58113-414-2. Disponível em:<http://doi.acm.org/10.1145/378795.378840>.

CITRON, D.; FEITELSON, D. G. The Organization of Lookup Tables in InstructionMemoization. Hebrew University of Jerusalem (Technical Report), 2000. Disponível em:<leibniz.cs.huji.ac.il/tr/306.ps>.

CLICK, C. Global code motion/global value numbering. In: Proceedings of the ACMSIGPLAN 1995 conference on Programming language design and implementation. NewYork, NY, USA: ACM, 1995. (PLDI ’95), p. 246–257. ISBN 0-89791-697-2. Disponível em:<http://doi.acm.org/10.1145/207110.207154>.

COCKE, J. Programming languages and their compilers: Preliminary notes. [S.l.]: CourantInstitute of Mathematical Sciences, New York University, 1969. ISBN B0007F4UOA.

COOPER, K.; ECKHARDT, J.; KENNEDY, K. Redundancy elimination revisited.In: Proceedings of the 17th international conference on Parallel architectures andcompilation techniques. New York, NY, USA: ACM, 2008. (PACT ’08), p. 12–21. ISBN978-1-60558-282-5.

References 140

COOPER, K. D.; LU, J. Register promotion in C programs. In: Proceedings of the ACMSIGPLAN 1997 conference on Programming language design and implementation. NewYork, NY, USA: ACM, 1997. (PLDI ’97), p. 308–319. ISBN 0-89791-907-6. Disponível em:<http://doi.acm.org/10.1145/258915.258943>.

COOPER, K. D.; XU, L. An efficient static analysis algorithm to detect redundant memoryoperations. SIGPLAN Not., ACM, New York, NY, USA, v. 38, p. 97–107, June 2002. ISSN0362-1340.

COOPER, K. D.; XU, L. Memory redundancy elimination to improve application energyefficiency. In: Proceedings of the 16th Int’l Workshop on Languages and Compilers forParallel Computing (LCPC’03). [S.l.: s.n.], 2003. p. 288–305.

COSTA, C. H. A. et al. Exploiting Value Locality with a Dynamic Methodology forOptimization Effectiveness Evaluation. [Submission].

COSTA, C. H. A. et al. Performance Improvement with Instruction Reuse based on DynamicValue Numbering. [In production].

COSTA, C. H. A. et al. Dynamic Method to Evaluate Code Optimization Effectiveness.In: Proceedings of the 15th International Workshop on Software and Compilers forEmbedded Systems. New York, NY, USA: ACM, 2012. (SCOPES ’12), p. 62–71. ISBN978-1-4503-1336-0. Disponível em: <http://doi.acm.org/10.1145/2236576.2236583>.

CYTRON, R. et al. Efficiently computing static single assignment form and thecontrol dependence graph. ACM Trans. Program. Lang. Syst., ACM, New York,NY, USA, v. 13, n. 4, p. 451–490, Oct. 1991. ISSN 0164-0925. Disponível em:<http://doi.acm.org/10.1145/115372.115320>.

DIWAN, A.; MCKINLEY, K. S.; MOSS, J. E. B. Type-based alias analysis. In:Proceedings of the ACM SIGPLAN 1998 conference on Programming language designand implementation. New York, NY, USA: ACM, 1998. (PLDI ’98), p. 106–117. ISBN0-89791-987-4. Disponível em: <http://doi.acm.org/10.1145/277650.277670>.

FRANKLIN, M. The Multiscalar Architecture. Tese (Doutorado) — University of Wisconsin,Madison, WI, USA, 1993.

GABBAY, F.; MENDELSON, A. Can program profiling support value prediction? In:Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture.Washington, DC, USA: IEEE Computer Society, 1997. (MICRO 30), p. 270–280. ISBN0-8186-7977-8. Disponível em: <http://dl.acm.org/citation.cfm?id=266800.266826>.

GELLERT, A.; FLOREA, A.; VINTAN, L. Exploiting selective instruction reuse and valueprediction in a superscalar architecture. J. Syst. Archit., Elsevier North-Holland, Inc., NewYork, NY, USA, v. 55, n. 3, p. 188–195, Mar. 2009. ISSN 1383-7621. Disponível em:<http://dx.doi.org/10.1016/j.sysarc.2008.11.002>.

GHANDOUR, W. J.; AKKARY, H.; MASRI, W. Leveraging strength-based dynamicinformation flow analysis to enhance data value prediction. ACM Trans. Archit. Code Optim.,ACM, New York, NY, USA, v. 9, n. 1, p. 1:1–1:33, Mar. 2012. ISSN 1544-3566. Disponívelem: <http://doi.acm.org/10.1145/2133382.2133383>.

References 141

GOLANDER, A.; WEISS, S. Transactions on high-performance embedded architecturesand compilers ii. In: STENSTROM, P. (Ed.). Berlin, Heidelberg: Springer-Verlag, 2009.cap. Reexecution and Selective Reuse in Checkpoint Processors, p. 242–268. ISBN978-3-642-00903-7.

GOUGH, B. J. An Introduction to GCC. Bristol, United Kingdom: Network Theory Limited,2005. ISBN 978-0-9541617-9-8.

GULWANI, S.; NECULA, G. C. A polynomial-time algorithm for global value numbering.Sci. Comput. Program., Elsevier North-Holland, Inc., Amsterdam, The Netherlands,The Netherlands, v. 64, n. 1, p. 97–114, 2007. ISSN 0167-6423. Disponível em:<http://dx.doi.org/10.1016/j.scico.2006.03.005>.

GUPTA, R.; BERSON, D. A.; FANG, J. Z. Path profile guided partial redundancyelimination using speculation. In: Proceedings of the 1998 International Conference onComputer Languages. Washington, DC, USA: IEEE Computer Society, 1998. p. 230–. ISBN0-8186-8454-2.

HACK, S.; GRUND, D.; GOOS, G. Register allocation for programs in SSA-Form.In: Proceedings of the 15th international conference on Compiler Construction. Berlin,Heidelberg: Springer-Verlag, 2006. (CC’06), p. 247 – 262. ISBN 3-540-33050-X,978-3-540-33050-9. Disponível em: <http://dx.doi.org/10.1007/11688839_20>.

HARING, R. et al. The IBM Blue Gene/Q Compute Chip. Micro, IEEE, v. 32, n. 2, p. 48–60, March-April 2012. ISSN 0272-1732.

HENNING, J. L. SPEC CPU2006 benchmark descriptions. SIGARCH Comput. Archit. News,ACM, New York, NY, USA, v. 34, p. 1–17, 2006. ISSN 0163-5964.

HERROD, S. A. Using Complete Machine Simulation to Understand Computer SystemBehavior. Stanford, CA, USA: Stanford University (Technical Report), 1998. Disponível em:<ftp://reports.stanford.edu/pub/cstr/reports/cs/tr/98/1603/CS-TR-98-1603.pdf>.

HIRSCHBERG, D. S.; LELEWER, D. A. Efficient decoding of prefix codes. Communicationsof the ACM, v. 33, p. 449–459, 1990.

IBM, P. Power ISA Version 2.06 Revison B. 2010. Disponível em:<https://www.power.org/resources/downloads/PowerISA_V2.06B_V2_PUBLIC.pdf>.

JENKINS, R. A hash function for hash Table lookup. 2006. Disponível em:<http://burtleburtle.net/bob/hash/doobs.html>.

KENNEDY, K.; ALLEN, J. R. Optimizing compilers for modern architectures: adependence-based approach. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.,2002. ISBN 1-55860-286-0.

KHEDKER, U.; SANYAL, A.; KARKARE, B. Data Flow Analysis: Theory and Practice.1st. ed. Boca Raton, FL, USA: CRC Press, Inc., 2009. ISBN 0849328802, 9780849328800.

KLEINOSOWSKI, A. J.; LILJA, D. J. MinneSPEC: A New SPEC Benchmark Workload forSimulation-Based Computer Architecture Research. IEEE Computer Architecture Letters,v. 1, p. 7–7, 2002.

References 142

KNUTH, D. E. The art of computer programming, volume 3: (2nd ed.) sorting and searching.Redwood City, CA, USA: Addison Wesley Longman Publishing Co., Inc., 1998. ISBN0-201-89685-0.

KOES, D.; GOLDSTEIN, S. C. An Analysis of Graph Coloring Register Allocation.Pittsburgh, PA, USA: Carnegie Mellon University (Technical Report, 2006. Disponível em:<http://www.cs.cmu.edu/ seth/papers/koes-tr06.pdf>.

KOTZMANN, T. et al. Design of the Java HotSpotTM client compiler for Java 6. ACM Trans.Archit. Code Optim., ACM, New York, NY, USA, v. 5, n. 1, p. 7:1–7:32, May 2008. ISSN1544-3566.

LARUS, J. R. Whole program paths. SIGPLAN Not., ACM, New York,NY, USA, v. 34, n. 5, p. 259–269, 1999. ISSN 0362-1340. Disponível em:<http://doi.acm.org/10.1145/301631.301678>.

LARUS, J. R.; CHANDRA, S. Using Tracing and Dynamic Slicing to Tune Compilers.Marison, WI, USA: University of Wisconsin (Technical Report), 1993. Disponível em:<http://128.105.2.28/pub/techreports/1993/TR1174.pdf>.

LATTNER, C.; ADVE, V. LLVM: A Compilation Framework for Lifelong Program Analysis& Transformation. In: Proceedings of the international symposium on Code generation andoptimization: feedback-directed and runtime optimization. Washington, DC, USA: IEEEComputer Society, 2004. (CGO ’04), p. 75–. ISBN 0-7695-2102-9.

LEE, K.; BENAISSA, Z.; RODRIGUEZ, J. A dynamic tool for finding redundantcomputations in native code. In: Proceedings of the 2008 international workshopon dynamic analysis: held in conjunction with the ACM SIGSOFT InternationalSymposium on Software Testing and Analysis (ISSTA 2008). New York, NY, USA:ACM, 2008. (WODA ’08), p. 15–21. ISBN 978-1-60558-054-8. Disponível em:<http://doi.acm.org/10.1145/1401827.1401831>.

LEPAK, K. M.; LIPASTI, M. H. On the value locality of store instructions. In: Proceedingsof the 27th annual international symposium on Computer architecture. New York, NY,USA: ACM, 2000. (ISCA ’00), p. 182–191. ISBN 1-58113-232-8. Disponível em:<http://doi.acm.org/10.1145/339647.339678>.

LIN, J. et al. Speculative register promotion using advanced load address table (ALAT).In: Proceedings of the international symposium on Code generation and optimization:feedback-directed and runtime optimization. Washington, DC, USA: IEEE Computer Society,2003. (CGO ’03), p. 125–134. ISBN 0-7695-1913-X.

LIPASTI, M. H.; WILKERSON, C. B.; SHEN, J. P. Value locality and load value prediction.In: Proceedings of the seventh international conference on Architectural support forprogramming languages and operating systems. New York, NY, USA: ACM, 1996.(ASPLOS-VII), p. 138–147. ISBN 0-89791-767-7.

LIU, Y. et al. An online profile guided optimization approach for speculative parallelthreading. In: Advances in Computer Systems Architecture. [S.l.]: Springer Berlin /

Heidelberg, 2007, (Lecture Notes in Computer Science, v. 4697). p. 28–39. ISBN978-3-540-74308-8.

References 143

LO, R. et al. Register promotion by sparse partial redundancy elimination of loads and stores.SIGPLAN Not., ACM, New York, NY, USA, v. 33, p. 26–37, May 1998. ISSN 0362-1340.

LOBEL, A. Vehicle scheduling in public transit and lagrangean pricing. Management Sci,v. 44, p. 1637–1649, 1998.

MAGNUSSON, P. S. et al. Simics: A full system simulation platform. Computer, IEEEComputer Society Press, Los Alamitos, CA, USA, v. 35, n. 2, p. 50–58, 2002. ISSN0018-9162. Disponível em: <http://dx.doi.org/10.1109/2.982916>.

MARKOFF, J. The iPad in Your Hand: As Fast as a Supercomputer of Yore. New YorkTimes, 2011. Disponível em: <http://bits.blogs.nytimes.com/2011/05/09/the-ipad-in-your-hand-as-fast-as-a-supercomputer-of-yore/>.

MCKENZIE, B. Modern compiler implementation in ML: Basic techniques by andrew w.appel, cambridge university press, 1997, isbn 0521587751. J. Funct. Program., CambridgeUniversity Press, New York, NY, USA, v. 9, p. 105–111, January 1999. ISSN 0956-7968.

MOLINA, C.; GONZALEZ, A.; TUBELLA, J. Dynamic removal of redundant computations.In: Proceedings of the 13th international conference on Supercomputing. New York, NY,USA: ACM, 1999. (ICS ’99), p. 474–481. ISBN 1-58113-164-X. Disponível em:<http://doi.acm.org/10.1145/305138.305239>.

MOREIRA, J. et al. Designing a highly-scalable operating system: the Blue Gene/Lstory. In: Proceedings of the 2006 ACM/IEEE conference on Supercomputing. NewYork, NY, USA: ACM, 2006. (SC ’06). ISBN 0-7695-2700-0. Disponível em:<http://doi.acm.org/10.1145/1188455.1188578>.

MOREL, E.; RENVOISE, C. Global optimization by suppression of partial redundancies.Commun. ACM, ACM, New York, NY, USA, v. 22, n. 2, p. 96–103, Feb. 1979. ISSN0001-0782. Disponível em: <http://doi.acm.org/10.1145/359060.359069>.

MUCHNICK, S. S. Advanced compiler design and implementation. San Francisco, CA,USA: Morgan Kaufmann Publishers Inc., 1997. ISBN 1-55860-320-4.

NAKRA, T.; GUPTA, R.; SOFFA, M. L. Global context-based value prediction. In:Proceedings of the 5th International Symposium on High Performance ComputerArchitecture. Washington, DC, USA: IEEE Computer Society, 1999. (HPCA ’99), p. 4–.ISBN 0-7695-0004-8. Disponível em: <http://dl.acm.org/citation.cfm?id=520549.822759>.

OTTENSTEIN, K. J.; BALLANCE, R. A.; MACCABE, A. B. The program dependence web:a representation supporting control-, data-, and demand-driven interpretation of imperativelanguages. SIGPLAN Not., ACM, New York, NY, USA, v. 25, p. 257–271, June 1990. ISSN0362-1340.

PETERSON, J. L. et al. Application of full-system simulation in exploratory system designand development. IBM Journal of Research and Development, v. 50, n. 2.3, p. 321 –332,March 2006. ISSN 0018-8646.

PHANSALKAR, A.; JOSHI, A.; JOHN, L. K. Analysis of redundancy and appli-cation balance in the SPEC CPU2006 benchmark suite. In: Proceedings of the 34th

References 144

annual international symposium on Computer architecture. New York, NY, USA:ACM, 2007. (ISCA ’07), p. 412–423. ISBN 978-1-59593-706-3. Disponível em:<http://doi.acm.org/10.1145/1250662.1250713>.

RAMALINGAM, G. The undecidability of aliasing. ACM Trans. Program. Lang. Syst.,ACM, New York, NY, USA, v. 16, p. 1467–1471, September 1994. ISSN 0164-0925.

REIF, J. H.; LEWIS, H. R. Efficient symbolic analysis of programs. J. Comput. Syst. Sci.,Academic Press, Inc., Orlando, FL, USA, v. 32, p. 280–314, June 1986. ISSN 0022-0000.

REINMAN, G. et al. Profile Guided Load Marking for Memory Renaming. San Diego,CA, USA: University of California San Diego (Technical Report), 1998. Disponível em:<http://www.cse.hcmut.edu.vn/ anhvu/teaching/2008/ACA/articles/29.pdf>.

RYCHLIK, B. et al. Efficacy and performance impact of value prediction. In: Proceedingsof the 1998 International Conference on Parallel Architectures and Compilation Techniques.Washington, DC, USA: IEEE Computer Society, 1998. (PACT ’98), p. 148–. ISBN0-8186-8591-3. Disponível em: <http://dl.acm.org/citation.cfm?id=522344.825671>.

SAATY, T. L.; KAINEN, P. C. The four-color problem: assaults and conquest. New York:McGraw-Hill International Book Co., 1977. ix, 217 p. p. ISBN 0070543828 0070543828.

SAZEIDES, Y.; SMITH, J. E. Implementations of Context Based Value Predictors.Madison, WI, USA: University of Wisconsin (Technical Report), 1997. Disponível em:<http://www.lems.brown.edu/ iris/en291s9-04/papers/Context-value-pred.pdf>.

SAZEIDES, Y.; SMITH, J. E. The predictability of data values. In: Proceedings of the 30thannual ACM/IEEE international symposium on Microarchitecture. Washington, DC, USA:IEEE Computer Society, 1997. (MICRO 30), p. 248–258. ISBN 0-8186-7977-8. Disponívelem: <http://dl.acm.org/citation.cfm?id=266800.266824>.

SHAFI, H. et al. Design and validation of a performance and power simulator for PowerPCsystems. IBM J. Res. Dev., IBM Corp., Riverton, NJ, USA, v. 47, p. 641–651, September2003. ISSN 0018-8646. Disponível em: <http://dx.doi.org/10.1147/rd.475.0641>.

SHOR, P. W. Polynomial-time algorithms for prime factorization and discrete logarithms ona quantum computer. SIAM J. Comput., Society for Industrial and Applied Mathematics,Philadelphia, PA, USA, v. 26, n. 5, p. 1484–1509, Oct. 1997. ISSN 0097-5397. Disponívelem: <http://dx.doi.org/10.1137/S0097539795293172>.

SIMPSON, L. T. Value-driven redundancy elimination. Tese (Doutorado) — Rice University,Houston, TX, USA, 1996.

SODANI, A.; SOHI, G. S. Dynamic instruction reuse. In: Proceedings of the 24th annualinternational symposium on Computer architecture. New York, NY, USA: ACM, 1997.(ISCA ’97), p. 194–205. ISBN 0-89791-901-7.

SODANI, A.; SOHI, G. S. An empirical analysis of instruction repetition. In: Proceedingsof the eighth international conference on Architectural support for programming languagesand operating systems. New York, NY, USA: ACM, 1998. (ASPLOS-VIII), p. 35–45. ISBN1-58113-107-0. Disponível em: <http://doi.acm.org/10.1145/291069.291016>.

References 145

SODANI, A.; SOHI, G. S. Understanding the differences between value predictionand instruction reuse. In: Proceedings of the 31st annual ACM/IEEE internationalsymposium on Microarchitecture. Los Alamitos, CA, USA: IEEE Computer Soci-ety Press, 1998. (MICRO 31), p. 205–215. ISBN 1-58113-016-3. Disponível em:<http://dl.acm.org/citation.cfm?id=290940.290983>.

SRIKANT, Y. N.; SHANKAR, P. The Compiler Design Handbook: Optimizations andMachine Code Generation, Second Edition. 2nd. ed. Boca Raton, FL, USA: CRC Press, Inc.,2007. ISBN 142004382X, 9781420043822.

STALLMAN, R. M. Using The Gnu Compiler Collection: A GNU Manual For GCC Version4.3.3. Paramount, CA: CreateSpace, 2009. ISBN 144141276X, 9781441412768.

SURENDRA, G.; BANERJEE, S.; NANDY, S. K. Instruction reuse in spec, mediaand packet processing benchmarks. J. Embedded Comput., IOS Press, Amsterdam, TheNetherlands, The Netherlands, v. 2, n. 1, p. 15–34, Jan. 2006. ISSN 1740-4460. Disponívelem: <http://dl.acm.org/citation.cfm?id=1370986.1370989>.

THOMAS, R. et al. Improving branch prediction by dynamic dataflow-based iden-tification of correlated branches from a large global history. In: Proceedings of the30th annual international symposium on Computer architecture. New York, NY,USA: ACM, 2003. (ISCA ’03), p. 314–323. ISBN 0-7695-1945-8. Disponível em:<http://doi.acm.org/10.1145/859618.859655>.

TORCZON, L.; COOPER, K. Engineering A Compiler. 2nd. ed. San Francisco, CA, USA:Morgan Kaufmann Publishers Inc., 2011. ISBN 012088478X.

TSENG, H.-W.; TULLSEN, D. M. Data-triggered threads: Eliminating redundantcomputation. High-Performance Computer Architecture, International Symposium on, IEEEComputer Society, Los Alamitos, CA, USA, p. 181–192, 2011.

USA. Designing a Digital Future: Federally Funded Research and Development inNetworking and Information Technology. President’s Council of Advisors on Science andTechnology, Office of Science and Technology Policy (PCAST), 2010. Disponível em:<http://www.whitehouse.gov/sites/default/files/microsites/ostp/pcast-nitrd-report-2010.pdf>.

VANDRUNEN, T. Partial Redundancy Elimination for Global Value Numbering. Tese(Doutorado) — Purdue University, West Lafayette, IN, USA, 2004.

WANG, K.; FRANKLIN, M. Highly accurate data value prediction using hybrid predictors.In: Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture.Washington, DC, USA: IEEE Computer Society, 1997. (MICRO 30), p. 281–290. ISBN0-8186-7977-8. Disponível em: <http://dl.acm.org/citation.cfm?id=266800.266827>.

WATTERSON, S.; DEBRAY, S. Goal-directed value profiling. In: Proceedings ofthe 10th International Conference on Compiler Construction. London, UK, UK:Springer-Verlag, 2001. (CC ’01), p. 319–333. ISBN 3-540-41861-X. Disponível em:<http://dl.acm.org/citation.cfm?id=647477.760386>.

XU, L. Program redundancy analysis and optimization to improve memory performance.Tese (Doutorado) — Rice University, Houston, TX, USA, 2003.

References 146

YANG, J.; GUPTA, R. Load redundancy removal through instruction reuse. In: Proceedingsof the Proceedings of the 2000 International Conference on Parallel Processing. Washington,DC, USA: IEEE Computer Society, 2000. (ICPP ’00), p. 61–. ISBN 0-7695-0768-9.Disponível em: <http://dl.acm.org/citation.cfm?id=850941.852902>.

YANG, W.; HORWITZ, S.; REPS, T. Detecting Program Components With EquivalentBehaviors. Madison, WI, USA: University of Wisconsin (Technical Report), 1989.Disponível em: <http://digital.library.wisc.edu/1793/59110>.

ZHANG, Y.; GUPTA, R. Timestamped whole program path representation and itsapplications. In: Proceedings of the ACM SIGPLAN 2001 conference on Programminglanguage design and implementation. New York, NY, USA: ACM, 2001. (PLDI ’01), p. 180–190. ISBN 1-58113-414-2. Disponível em: <http://doi.acm.org/10.1145/378795.378835>.

147

APPENDIX A -- SPEC CPU2006 Benchmark Suite

In this Chapter, for the sake of completeness, detailed information on the selected com-

ponent from SPEC CPU2006 Integer is provided. Details were extracted from description

provided by the Standard Performance Evaluation Corporation (HENNING, 2006).

A.1 400.perlbench

Authors: Larry Wall, et. al.

General Category: Programming language

Description: 400.perlbench is a cut-down version of Perl v5.8.7, the popular scripting

language. SPEC’s version of Perl has had most of OS-specific features removed. In addition

to the core Perl interpreter, several third-party modules are used:

•SpamAssassin v2.61;

•Digest-MD5 v2.33;

•HTML-Parser v3.35;

•MHonArc v2.6.8;

•IO-stringy v1.205;

•MailTools v1.60;

•TimeDate v1.16;

A.1 400.perlbench 148

Sources for all of the freely-available components used in 400.perlbench can be found on

the distribution media in the original.src directory.

Input: The reference workload for 400.perlbench consists of three scripts:

1.The primary component of the workload is the Open Source spam checking software

SpamAssassin. SpamAssassin is used to score a couple of known corpora of both spam

and ham (non-spam), as well as a sampling of mail generated from a set of random

components. SpamAssassin has been heavily patched to avoid doing file I/O, and does

not use the Bayesian filtering;

2.Another component is the popular freeware email-to-HTML converter MHonArc.

Email messages are generated randomly and converted to HTML. In addition to

MHonArc, which was lightly patched to avoid file I/O, this component also uses several

standard modules from the CPAN (Comprehensive Perl Archive Network);

3.The third script in the reference workload (which also uses the mail generator for con-

vienience) exercises a slightly modified version of the specdiff script, which is a part

of the CPU2006 tool suite.

The training workload is similar, but not identical, to the reference workload from

CPU2000. The test workload consists of the non-system-specific parts of the actual Perl 5.8.7

test harness.

Output: In the case of the mail-based benchmarks, a line with salient characteristics

(number of header lines, number of body lines, etc) is output for each message generated.

During processing, MD5 hashes of the contents of output "files" (in memory) are computed

and output. For SpamAssassin, the message’s score and the rules that it triggered are also

output.

Programming Language: ANSI C

A.2 401.bzip2 149

A.2 401.bzip2

Author: Julian Seward General Category: Compression

Description: 401.bzip2 is based on Julian Seward’s bzip2 version 1.0.3. The only differ-

ence between bzip2 1.0.3 and 401.bzip2 is that SPEC’s version of bzip2 performs no file I/O

other than reading the input. All compression and decompression happens entirely in memory.

This is to help isolate the work done to only the CPU and memory subsystem.

Input: 401.bzip2’s reference workload has six components: two small JPEG images, a

program binary, some program source code in a tar file, an HTML file, and a "combined"

file, which is representative of an archive that contains both highly compressible and not

very compressible files. Each input set is compressed and decompressed at three different

blocking factors ("compression levels"), with the end result of the process being compared to

the original data after each decompression step.

Output: The output files provide a brief outline of what the benchmark is doing as it runs.

Output sizes for each compression and decompression are printed to facilitate validation, and

the results of decompression are compared with the input data to ensure that they match.


References: (HIRSCHBERG; LELEWER, 1990)

A.3 429.mcf

Author: Andreas Lobel; SPEC Project Leader: Reinhold Weicker

General Category: Combinatorial optimization / Singledepot vehicle scheduling

Description: 429.mcf is derived from MCF, a program used for single-depot vehicle

scheduling in public mass transportation.

The program is designed for the solution of single-depot vehicle scheduling problems

planning transportation. It considers one single depot and a homogeneous vehicle fleet. Based

A.3 429.mcf 150

on a line plan and service frequencies, so-called timetabled trips with fixed departure/arrival

locations and times are derived. Each of these timetabled trips has to be serviced by exactly

one vehicle. The links between these trips are called dead-head trips. In addition, there

are pull-out and pull-in trips for leaving and entering the depot. Cost coefficients are given

for all dead-head, pull-out, and pull-in trips. It is the task to schedule all timetabled trips

to so-called blocks such that the number of necessary vehicles is as small as possible and,

subordinate, the operational costs among all minimal fleet solutions are minimized. For the

considered single-depot case, the problem can be formulated as a large-scale minimum-cost

flow problem, solved with a network simplex algorithm accelerated with a column generation.

The network simplex algorithm is a specialized version of the well known simplex algo-

rithm for network flow problems. The linear algebra of the general algorithm is replaced by

simple network operations such as finding cycles or modifying spanning trees that can be per-

formed very quickly. The main work of the network simplex implementation is pointer and

integer arithmetic. In the transition from 181.mcf (CPU2000) to 429.mcf (CPU2006), new

inputs were defined for test, train, and ref, with the goal of longer execution times. The heap

data size, and with it the overall memory footprint, increased accordingly. Most of the source

code was not changed, but several type definitions were changed by the author:

•Whenever possible, long typed attributes of struct node and struct arc are replaced by

32 bit integer, for example if used as boolean type. Pointers remain unaffected and map

to 32 or 64 bit long, depending on the compilation model, to ensure compatibility to 64

bit systems;

•To reduce cache misses and accelerate program performance somewhat, the elements

of struct node and struct arc, respectively, are rearranged;

Input: The input file contains: the number of timetabled and dead-head trips; for each

timetabled trip its starting and ending time; for each dead-head trip its starting and ending

timetabled trip and its cost.

Worst case execution time is pseudo-polynomial in the number timetabled and dead-head

A.4 445.gobmk 151

trips and in the amount of the maximal cost coefficient. The expected execution time, however,

is in the order of a low-order polynomial.

Memory Requirements: 429.mcf requires about 860 and 1700 megabyte for a 32 and a

64 bit data model, respectively.

Output: The benchmark writes log information, a checksum, and output values describ-

ing an optimal schedule.

References: (LOBEL, 1998).

A.4 445.gobmk

Authors: Man Lung Li, et. al.

General Category: Artificial intelligence - game playing.

Description: The program plays Go and executes a set of commands to analyze Go

positions.

Input: Most input is in "SmartGo Format" (.sgf), a widely used de facto standard rep-

resentation of Go games. A typical test involves reading in a game to a certain point, then

executing a command to analyze the position.

Output: typically an ASCII description of a sequence of Go moves.

Programming Language: C

A.5 458.sjeng

Authors: Gian-Carlo Pascutto, Vincent Diepeveen

General Category: Artificial Intelligence (game tree search & pattern recognition)

Description: 458.sjeng is based on Sjeng 11.2, which is a program that plays chess and

several chess variants, such as drop-chess (similar to Shogi), and losing chess.

A.6 462.libquantum 152

It attempts to find the best move via a combination of alpha-beta or priority proof number

tree searches, advanced move ordering, positional evaluation and heuristic forward pruning.

Practically, it will explore the tree of variations resulting from a given position to a given base

depth, extending interesting variations but discarding doubtful or irrelevant ones. From this

tree the optimal line of play for both players ("principle variation") is determined, as well as

a score reflecting the balance of power between the two.

The SPEC version is an enhanced version of the free Sjeng 11.2 program, modified to be

more portable and more accurately reflect the workload of current professional programs.

Input: 458.sjeng’s input consists of a textfile containing alternations of a chess position

in the standard Forsyth-Edwards Notation (FEN) and the depth to which this position should

be analyzed, in half-moves (ply depth) The SPEC reference input consists of 9 positions

belonging to various phases of the game.

Output: 458.sjeng’s output consists, per position, of some side information (textual dis-

play of the chessboard, phase of the game, used parameters...) followed by the output from

the tree searching module as it progresses. This is formatted as: Attained depth in half-moves

(plies); Score for the player that is to move, in equivalents of 1 pawn; Number of positions

investigated; and the optimal line of play ("principle variation")


A.6 462.libquantum

Author: Bjorn Butscher, Hendrik Weimer

General Category: Physics / Quantum Computing

Description: libquantum is a library for the simulation of a quantum computer. Quantum

computers are based on the principles of quantum mechanics and can solve certain computa-

tionally hard tasks in polynomial time.

In 1994, Peter Shor discovered a polynomial-time algorithm for the factorization of num-

A.7 471.omnetpp 153

bers, a problem of particular interest for cryptanalysis, as the widely used RSA cryptosystem

depends on prime factorization being a problem only to be solvable in exponential time. An

implementation of Shor’s factorization algorithm is included in libquantum.

Libquantum provides a structure for representing a quantum register and some elemen-

tary gates. Measurements can be used to extract information from the system. Additionally,

libquantum offers the simulation of decoherence, the most important obstacle in building prac-

tical quantum computers. It is thus not only possible to simulate any quantum algorithm, but

also to develop quantum error correction algorithms. As libquantum allows to add new gates,

it can easily be extended to fit the ongoing research, e.g. it has been deployed to analyze

quantum cryptography.

Input: The benchmark program expects the number to be factorized as a command-

line parameter. An additional parameter can be supplied to specify a base for the modular

exponentiation part of Shor’s algorithm.

Output: The program gives a brief explanation on what it is doing and the factors of the

input number if the factorization was successful.

Programming Language: ISO/IEC 9899:1999 ("C99")

References: (SHOR, 1997)

A.7 471.omnetpp

Author: Andras Varga, Omnest Global, Inc.

General Category: Discrete Event Simulation

Description: simulation of a large Ethernet network, based on the OMNeT++ discrete

event simulation system1, using an ethernet model which is publicly available2. For the refer-

ence workload, the simulated network models a large Ethernet campus backbone, with several

smaller LANs of various sizes hanging off each backbone switch. It contains about 8000 com-

1http://www.omnetpp.org2http://ctieware.eng.monash.edu.au/twiki/bin/view/Simulation/EtherNe

A.7 471.omnetpp 154

puters and 900 switches and hubs, including Gigabit Ethernet, 100Mb full duplex, 100Mb half

duplex, 10Mb UTP, and 10Mb bus. The training workload models a small LAN. The model

is accurate in that the CSMA/CD protocol of Ethernet and the Ethernet frame are faithfully

modeled. The host model contains a traffic generator which implements a generic request-

response based protocol. Higher layer protocols are not modeled in detail.

Input: The topology of the network and structure of hosts, switches and hubs are de-

scribed in NED files (the network description language of OMNeT++). Operation of the

Ethernet MAC, traffic generator etc. are in C++. Request and reply lengths are configured as

intuniform (50,1400) and truncnormal (5000,5000) for the reference input. The volume

of the traffic can most easily be controlled with the time period between sending requests; cur-

rently set in omnetpp.ini to exponential (0.33) (that is, average 3 requests per second) for the

reference input. This already causes frames to be dropped in some of the backbone switches,

so the network is a bit overloaded with the current settings.

Output: The model generates extensive statistics in the omnetpp.sca file at the end of the

simulation: number of frames sent, received, dropped, etc. These are only basic statistics;

however, if all nodes were allowed to record them, omnetpp.sca would grow to about 28

megabytes. To make the output more reasonable in size, recording statistics is only enabled

in a few nodes.

Programming Language: C++

155

APPENDIX B -- Power ISA decoding approach

This chapter shows the classication performed in this thesis for the PowerISA 2.06.

The Instruction Set Architecture (ISA) used has its roots extended back over more than five

decades to IBM Research. The POWER (Performance Optimization With Enhanced RISC)

Architecture was introduced with the RISC System/6000 product family in early 1990. In

1991, Apple, IBM, and Motorola began the collaboration to evolve to the PowerPC Architec-

ture, expanding the architecture’s applicability. In 1997, Motorola and IBM began another

collaboration, focused on optimizing PowerPC for embedded systems, which produced Book

E.

In 2006, Freescale and IBM collaborated on the creation of the Power ISA Version 2.03,

which represented the reunification of the architecture by combining Book E content with

the more general purpose PowerPC Version 2.02. A significant benefit of the reunification is

the establishment of a single, compatible, 64-bit programming model. The combining also

extends explicit architectural endorsement and control to Auxiliary Processing Units (APUs),

units of function that were originally developed as implementation or product family-specific

extensions in the context of the Book E allocated opcode space.

B.1 Fixed-point facility

The thesis relies on the implementation of a decoder for the fixed-point facility of Pow-

erISA 2.06. Integer instructions transfer data between memory and General Purpose Registers

(GPRs) and perform various operations on the GRPs. The integer instructions were classi-

fied into four categories: load, store, arithmetic and unsupported. The decoder relies on the

B.1 Fixed-point facility 156

Loads Stores

Byte Halfword Word Double Multiple/String Byte Halfword Word Double Multiple/String

lbz[u][x]

lha[u][x]lhbrxlhz[u][x]

lwbrxlwz[u][x]lwa[u][x]

ld[u][x]ldbrx

lmwlswilswx

stb[u][x]

sth[u][x]sthbrx

stw[u][x]stwbrx

std[u][x]stdbrx

stmwstswistswx

Integer Storage Access Instructions by External Process ID

Loads Stores

Byte Halfword Word Double Byte Halfword Word Double

lbepx lhepx lwepx ldepx stbepx sthepx stwepx

stdepx

Integer Storage Access Instructions

Fig. 58: Integer memory access instructions.

specification in (IBM, 2010) for instruction format, classifying instructions into these sets as

following.

B.1.1 Memory Operation

Integer memory access are load and store instructions that transfer data between memory

to the GPRs. Load instructions operates on byte, halfwords, word and doubleword. Content

in storage addressed by effective address (EA) is loaded into target register (RT). Many of

the load instructions have an update form, in which register RA is updated with the effective

address. For these forms, if RA,0 and RA,RT, the effective address is placed into register

RA and the storage element (byte, halfword, word, or doubleword) addressed by EA is loaded

into RT. Figure 58 shows the instruction classified into load set in the top left. Conversely, in

store instructions the content in the source register (RS) is stored into byte, halfword, word or

double word in storage addressed by EA. As in the load case, store instructions also have an

update form. The instructions classified in the store set are shown in the top right of figure 58.

In Figure 58 (bottom) also are memory operation that support External Process ID. Exter-

nal Process ID registers provide capabilities for loading and storing General Purpose Registers

and performing cache management operations using a supplied context other than the context

normally used by the programming model. Instruction in the bottom left are classified as load,


and in the right as stores.

B.1.2 Arithmetic Instructions

Integer Arithmetic Instructions

Add Subtract Multiply Divide Negate

add[o][.]addc[o][.]adde[o][.]addiaddic[.]addisaddme[o][.]addze[o][.]

subf[o][.]subfc[o][.]subfe[o][.]subficsubfme[o][.]subfze[o][.]

mulhw[.]mulhwu[.]mullimullw[o][.]mulhd[.]mulhdu[.]mulld[o][.]

divw[o][.]divwu[o][.]divwe[o][.]divweu[o][.]divd[o][.]divdu[o][.]divde[o][.]divdeu[o][.]

neg[o][.]

Fig. 59: Integer arithmetic instructions.

Integer Logical Instructions

AndAnd with

complementNand Or

Or withcomplement

Nor Xor Equivalence Extend signCountleadingzeros

Permute Parity

and[.]andi.andis.

andc[.] nand[.]

or[.]orioris

orc[.] nor[.]

xor[.]xorixoris

eqv[.]extsb[.]extsh[.]extsw[.]

cntlzw[.]cntlzd[.]

bpermd

prtywprtyd

Fig. 60: Integer logical instructions.

Integer Rotate Instructions

Rotate and Insert Rotate and Mask Rotate and Clear

rlwimi[.]rldimi[.]

rlwinm[.]rlwnm[.]

rldcl[.]rldcr[.]rldic[.]rldicl[.]rldicr[.]

Fig. 61: Integer rotate instructions.

In figures 59, 60, 61 and 62 are all the instructions classified into arithmetic set. In figure

59 are scalar arithmetic operations. Arithmetic operations are performed on integer or ordinal

operands stored in registers. Instructions that perform operations on two operands are defined

in a three-operand format; an operation is performed on the operands, which are stored in two

registers. The result is placed in a third register. Instructions that perform operations on one

operand are defined in a two-operand format; the operation is performed on the operand in a


register and the result is placed in another register. Several instructions also have immediate

formats in which one of the source operands is a field in the instruction. In figure 60 are

the logical instructions. In figure 62 are the instructions that perform arithmetic or logical

comparisons between two operands and update the Condition Register (CR) register with the

result of the comparison. In figure 61 are the instructions that rotate operands stored in the

GPRs. Rotate instructions can also mask rotated operands. In figure 63 are the load and

reserve and store conditional instructions. They together permit atomic update of a shared

storage location. There are byte, halfword, word, and doubleword forms of each of these

instructions.

Integer Compare Instructions

Arithmetic Logical

cmpcmpicmpb

cmplcmpli

Fig. 62: Integer compare instructions.

Load and Reserve and Store Conditional Instructions

Loads Stores

Word Double Word Double

lwarx ldarx stwcx. stdcx.

Fig. 63: Integer load and reserve store conditional instructions.

The remaining instruction in the fixed-point facility were classified into the unsupported

set. The ones that change the contents of any of the GPRs were classified into symbolic

binding destructive (as discussed in Section 4.2.3). All the other instructions that do not

change GPRs content are classified as unsupported.

159

Index

Abstract syntax tree, 29

Basic block, 29

Classification table, 62Code optimization, 31Common subexpression elimination, 37Compilation, 27Compile-time, 27Compiler, 19Control flow profiles, 52Control-flow, 29Control-flow profiles, 52Copy propagation, 37

Data value prediction, 61Dead-code elimination, 37Directed Acyclic Graph, 30Dominance Frontiers, 34Dynamic Instruction Reuse, 56Dynamic Value Numbering, 67

Experimental framework, 84

Flow-based analysis, 32

Global Value Numbering, 41

h-index, 76Hash table, 90High-level programming language, 19High-level representation, 27Hotspot, 76

IBM Full-system simulator, 85, 87Instruction Set Architecture, 87Interference graph, 46Intermediate representation, 28Interprocedural optimization, 31Intraprocedural optimization, 31

K-Coloring, 44

Lexical analysis, 27Lexical equivalence, 31Limit studies, 54Liveness analysis, 44Lookup function, 90Low-level representation, 27

Memory profiles, 52

Optimizing compiler, 27

Partial Redundancy Elimination, 43Profile Guided Optimization, 51

Register allocation, 44Register source table, 60Reuse buffer, 57Reuse support, 92Reuse test, 57

Semantic analysis, 27Single Static Assignment, 19SPEC 2006, 97Static analysis, 50Static Single Assignment, 33, 34, 41Syntactic analysis, 27System simulator, 85

Value equivalence, 31Value Graph, 42Value locality cluster, 110Value locality graph, 110Value numbering, 37Value prediction, 61Value prediction table, 62Value profiles, 52Variable Liveness, 44

Whole Program Paths, 52

Dynamic Methodology for Optimization Effectiveness ...

Documents