1. Dynamic Binary Optimization for Virtualization on Multi-cores
( The software market requires applications to run on many
generations of hardware. Even if software vendors tune their
binaries for the most prevalent hardware at release time, the code
will rapidly becomes mismatched to new platforms as hardware
implementation evolves. All latest high-performance microprocessors
have very sophisticated runtime monitoring support that allows
runtime information such as cache misses, instruction pipeline
stalls to be monitored for further re- optimization of binary code
at runtime to improve overall performance. Such continuous program
re-optimization requires a dynamic compiler to manipulate binary
code at runtime. In another very important type of application
called process/system virtualization, a software layer is set up
above the hardware to allow multiple OSs and/or applications with
different instruction-set architectures (ISAs) to run on the same
hardware platform. The main technology supporting virtualization is
binary translation. It is basically a different kind of runtime
compiler that takes the binary code of those OSs and the
application programs with different ISAs, and translate them into a
sequence of instructions with the ISA of the underlying hardware
platform for execution. Most of these binary manipulation
techniques today are with substantial runtime overhead. Dynamic
optimization on the binary code during runtime to improve overall
performance could be a core technology deserved to be studied.
Since the optimizations are performed at runtime, a dynamic binary
optimizer has to be carefully designed so that the overhead of
runtime optimization would not outweigh the performance gain of the
optimized code. We will call an optimizer that produces more
performance gain than overhead an effective optimizer. There are a
number of factors that are crucial to the effectiveness of a
dynamic optimizer. Before discussing them, we first give an
overview of how a general dynamic binary optimizer works. In
general, execution of an application under a dynamic binary
optimizer, as 2. shown in Figure 1, begins with the system
executing (or emulating) and profiling a running programs
instruction stream to track its execution flow. When the system
discovers significant change in profile, it tries to find a
frequently executed code sequence (i.e. hot traces), and then the
sequence is analyzed, optimized, and placed in a code cache. The
execution then switches to the optimized code in the code cache.
Figure 1. Control flow of a general dynamic binary optimizer Now,
we can identify a number of key factors that could have profound
impact on the effectiveness of a dynamic binary optimizer: (1) the
profiling method to collect runtime information, (2) the frequency
the optimizer is activated, which often depends on the phase
detection method, (3) detecting frequently executed code sequence,
which is also referred to as hot code identification and hot trace
generation, (4) the optimizations performed, and (5) better
optimizations with the help of compiler annotations. In this
project, we propose a light-weight, sampling based dynamic binary
optimization framework that provides novel solutions to these
important issues. Furthermore, most of the binary optimization
techniques today are for single- core platforms. We plan to extend
such binary optimization techniques to multi-core platforms. This
is a much harder problem as we need to deal with multithreading
applications and with much more shared resources on multi-core
platforms. The core technologies we will develop in this project,
if successful, could have significant impact on the development of
dynamic binary optimization systems and 3. virtualization systems.
Related Work Profile-guided optimizations [16][17] provide runtime
information for advanced optimizations[18][19][20][21][22]. Hence,
recent researches attempt to extend the idea of branch profile to
value, cache miss [23] and data dependency [24] profiling. However,
collecting a representative profile is difficult for real
applications [25]. Although post-link optimizations
[26][27][28][29][30][31][32] optimize programs based on performance
profiles and reduce the need to recompile, applications can have
different performance characteristics with individual inputs.
Therefore, an application may have to be optimized during execution
since the specific information about its performance cannot be
gathered before the input is given. Dynamic optimization systems
[1][2][3][4][5][6][33][34][35][36] are getting important because of
the need to customize optimizations for individual inputs, changing
behavior with time, dynamic linking library and the
micro-architecture. Such dynamic optimization systems typically
manipulate and optimize binary code at runtime. The profiling
methods used by most binary manipulation and optimization systems
can be classified into two categories: Virtual Machine (VM) based,
and sampling based. VM-based systems, such as Dynamo [1], DynamoRIO
[2], Mojo [3] and PIN [5] typically instrument code for profiling.
Therefore, accurate runtime data for phase detection strategies,
such as instruction working set and basic block vectors can be
collected without problem. However, such systems are with
substantial overhead of profiling, emulation, code-cache management
and the expensive handling of indirect branches. For example, Pin
[5] has an overhead of 54% for SPECint2000 benchmarks on IA32
systems and DynamoRIO [2] has an overhead of 42% for the same
environment. This is the minimal overhead reported, when no
instrumentation or optimization is performed. Unlike VM based
optimizers, sampling based optimizers, such as ADORE [4], and
sampling based profiling tools, such as SimPoint [7], typically do
not instrument code for profiling. Therefore, runtime data for
phase detection strategies cannot be used with the same accuracy.
Also, Sampling-based optimizers do not have complete control over
program execution. They take frequent snapshots of program
execution and thus only see frequently executed code, but not the
complete execution path 4. leading to it. However, sampling based
profiling has lower runtime overhead than VM-based profiling.
Dynamic optimization systems using sample based profiling rely on
phase detection to detect change in code working set and change in
performance characteristics that can affect optimization
strategies. Phase detection techniques can be classified into two
categories: Global Phase Detection (GPD) [8 ] and Local Phase
Detection (LPD) [9 ][10]. In GPD, program characteristics are
computed by taking into account information from all regions that
executed during the profiled interval. Hence, it is sensitive to
sampling period, interval size and thresholds used in the phase
detector. LPD can detect phase change more accurately than GPD
because the scope of phase detection is reduced to a small code
region, such as a basic block, a loop, or a procedure. Commonly use
LPD methods include region monitoring based [10] and trace
compilation [9]. Table 1 compares these optimization systems.
Details of each optimization system are described below. Note that
all of these optimization systems are for single- core platforms,
except that the ADORE system runs the optimizer and the user
application code on separate cores. Dynamo [1] is a software
dynamic optimization system that is capable of transparently
improving the performance of a native instruction stream as it
executes on the processor. The input native instruction stream to
Dynamo can be dynamically generated (by a JIT for example), or it
can come from the execution of a statically compiled native binary.
Dynamo focuses its efforts on optimization opportunities that tend
to manifest only at runtime. Experimental results demonstrate that
even statically optimized native binaries can be accelerated by
Dynamo. For example, the average performance of -O optimized
SpecInt95 benchmark binaries created by the HP product C compiler
is improved to a level comparable to their -O4 optimized version
running without Dynamo. The performance advantage of Dynamo in such
case is not surprising because it was compared with compile-time
static optimizations, which usually lack runtime information to
generate code with good performance. Since Dynamo relies on
VM-based profiling and runtime emulation of the program execution,
its runtime overhead could be high. 5. DynamoRIO [2] is a
framework, extended from Dynamo, for implementing dynamic analyses
and optimizations. It provides an interface for building external
modules, or clients, for the DynamoRlO dynamic code modification
system. This interface abstracts away many low-level details of the
DynamoRlO runtime system while exposing a simple yet efficient API.
This is achieved by restricting optimization units to linear
streams of code and using adaptive levels of detail for
representing instructions. The interface is not restricted to
optimization and can be used for instrumentation, profiling,
dynamic translation, etc. DynamoRIO also implements several
optimizations. These improve the performance of some applications
by 12% on average, relative to native execution. Since DynamoRIO is
intended to be a analysis and instrumental tool, it uses expensive
software instrumentation based profiling and interpreter for
emulation. 6. Dynamo DynamoRIO Mojo [3] ADORE PIN [5] JikesRVM [6]
[1] [2] [4] Sampli no no no Yes no no ng based Profi ling VM yes
yes yes No yes yes based Emulation yes yes no No yes yes with
Interpreter Annotation no no no no no yes information Optimization
1. Hot 1. Constant 1. Hot 1. 1. 1. Adaptive Inlining tracing
Propagation Path Dynamic Persiste 2. Register Linking 2. Dead Code
Linking Register nt Code allocation and Removal Allocatio Cachin
coalescing 3.Call Return 2. drop n g 3. tail recursion Matching
unconditio 2. elimiation 4. Stack Adjust nal jumps Runtime 4. code
reordering Data 5. Dead code Cache elmination 3. Prefetchi 6. loop
call/return ng normalization & sequences 3. Hot unrolling
inlined trace 7. load/store & Patching redundant branch 4.
elmination unrolling loops Table 1. Comparison of VM based dynamic
optimizers and sampling based optimizers 7. Mojo [3] is unlike most
dynamic optimizers that have been chiefly targeted towards running
the SPEC benchmarks on scientific workstations. Mojo[3], developed
by Microsoft Research, contends that dynamic optimization
technology is also important to the desktop computing environment
where running large, complex commercial software applications is
commonplace. Mojo implements its optimizations for the x86
architecture. It also supports exception handling and multithreaded
applications on Windows along with preliminary performance
measurements. Similar to Dynamo and DynamoRIO, Mojo also employs VM
based profiling. However, it does not rely on the time-consuming
emulation/interpretation of program execution. ADORE [4] is a
light-weight dynamic binary optimization system developed at the
University of Minnesota. Its light-weight because it uses hardware
performance monitoring based sampling for profiling. ADORE uses
dynamic optimization to address cache miss, branch mis-prediction,
and other performance events at runtime. It detects performance
problems of running applications and deploys optimizations to
increase execution efficiency. ADOREs approach includes detecting
performance bottlenecks, generating optimized traces and
redirecting execution from the original code to the dynamically
optimized code. Experiment results show that ADORE speeds up many
of the CPU2000 benchmark programs having large numbers of D-Cache
misses through dynamically deployed cache prefetching. For other
applications that dont benefit from ADOREs runtime optimization,
the average cost is only 2% of execution time. ADORE is a good
example of using existing hardware and software to deploy
speculative optimizations to improve a programs runtime
performance. In this project, we will develop our dynamic binary
optimization system based on ADORE because of the various efficient
and attractive features it provides. PIN [5] is an instrumentation
system developed by Intel. It aims to support easy to use,
portable, transparent, and efficient instrumentation.
Instrumentation tools (called Pintools) are written in C/C++ using
Pin's API. Pin follows the model of ATOM, allowing the tool writer
to analyze an application at the instruction level without the need
for detailed knowledge of the underlying instruction set. Pin uses
dynamic compilation to instrument executables while they are
running. For efficiency, Pin uses several techniques, including
inlining, register re-allocation, 8. liveness analysis, and
instruction scheduling to optimize instrumentation. As a result,
Pin is 3.3x faster than Valgrind and 2x faster than DynamoRIO for
basic- block counting. Pin is publicly available for Linux
platforms on four architectures: IA32 (32-bit x86), EM64T (64-bit
x86), Itanium, and ARM. JikesRVM[6] : Jikes RVM (Research Virtual
Machine) provides a flexible open testbed to prototype virtual
machine technologies and experiment with a large variety of design
alternatives. Jikes RVM can run on various platforms. It implements
virtual machine technologies for dynamic compilation, adaptive
optimization, garbage collection, thread scheduling, and
synchronization. A distinguishing characteristic of Jikes RVM is
that it is implemented in the Java programming language and is
self-hosted i.e., its Java code runs on itself without requiring a
second virtual machine. JikesRVM uses VM based profiling and
interpreter for program emulation. 1. Approach We propose a
light-weight, sampling based dynamic binary optimization system.
The system diagram, including system components and major data
structures, of our virtualization system proposed in the main
project is depicted in Figure 2. The blocks circled by a dotted
line are the components for the dynamic binary optimization
sub-system proposed in this sub-project. The components include a
Hardware Performance Monitor profiler (HPM Data), Phase Detector,
Hot Trace Generator, and Optimizer. 9. Figure 2. System diagram of
our virtualization system We first describe the functionality of
and our design decision for each of the components in our
optimization system. We also address the important research issues
in each component. 1.1 Light-weight HPM-Based Profiling We exploit
hardware performance monitors in the processor for light-weight
profiling. Since HPM counters the events automatically, we can
expect that the extra overhead for monitoring the program behaviors
is much lower than the instrumentation approach to counter the
events for profiling. We adopt Perfmon2 [11], a standard
performance monitoring interface for Linux to exploit HPM. It
provides the friendly interfaces to help the user setting the
register of HPM for monitoring events that the user wants to
observe. Each Linux thread can perform a monitor section of
Perfmon2. In a monitor section, the user can indicate which core or
which thread to be monitor. In our virtualization system, we will
target the multi-threading programs and implement a guest thread as
a pthread so we will create a monitor section for each pthread
created for a guest thread. Moreover, we will create a monitor
section for each core in the host platform, too. 1.2 Sampling
Accumulation Phase Detection A dynamic optimizers need to
accurately identify periods of execution when program must be
optimized or re-optimized. The concept of phase was introduced to
identify periods of execution when certain runtime characteristics
10. do not change. Phase detection [12][13] identifies these
periods and triggers phase changes between these periods. Thus, an
accurate and reliable phase detection scheme is crucial to runtime
performance. Phase detection is an important component of sampling
based dynamic optimizers. Phase detection, as implemented in
current sampling-based prototype dynamic optimization systems such
as ADORE[8], is called Global Phase Detection (GPD) as program
characteristics are computed by taking into account information
from all regions that executed during the profiled interval. The
problem with GPD is that it may not be able to detect the change
between two phases if they have the same average program counter
value. We propose a new phase detection approach called sampling
accumulation phase detection to solve this problem. For each
sampling interval, we maintain a code blocks vector and an
accumulation vector. Both vectors have the same cardinality. An
element in the code blocks vector is a pair of program counters
indicating the beginning and end of a code block of the program. An
element in the accumulation vector records the number of times a
program counter is located in the corresponding code block of this
vector element. When the HPM data buffer overflows, the program
counter in the data structure HPM Data is retrieved and compared
with the values in the code blocks vector to find the code block in
which this program counter is located. Then, the corresponding
element (a value) in the accumulation vector is incremented by 1.
For two adjacent sampling intervals, we can compare the Manhattan
distance or the Euclidean distance of their accumulation vectors.
If the distance is larger than a threshold value, then there is a
phase change. 1.3 Hot Code Identification Optimizing at runtime can
be expensive and incurs real performance penalty. Limiting the
scope of optimization reduces this overhead. The scope of dynamic
optimization can be reduced by finding frequently executed code.
Such code exists naturally in programs from loops and recursive
function calls. The general technique for identifying such code is
to maintain a count for each basic block and when the basic block
count exceeds a threshold, it is optimized. Sampling based dynamic
optimizers rely on hardware performance counters to collect this
data. By sampling these counters, program counter samples are
obtained periodically. Using these samples, frequently executed
code can be identified. 11. 1.4 Hot Trace Generation Optimization
at a basic block level may not be beneficial because the
granularity is too small. Thus, it is desirable to aggregate
multiple basic blocks into a larger code segment (also called
trace). Traces are basic blocks that form the unit of optimization
for dynamic optimizers. Dynamic optimizers try to select those
basic blocks that form loops, as trace exits would be minimized.
The other consideration when building traces is to minimize
analysis time. As traces are units of optimization, the dynamic
optimizer passes these traces to its optimization algorithms. These
algorithms must quickly generate an optimized trace. A sampling
based optimizer such as ours is limited by the fact that it does
not have complete control over execution. We will solve the trace
generation problem by dynamic code analysis and runtime estimate of
profile. We may also apply the concept of superblocks [14] and
hyperblocks [15] to help guide our trace generation. Our approach
for trace generation is described as follows. According to the HPM
Data for a sampling interval, we can construct a directed graph
with weighted edges. A vertex indicates an IR basic block and the
weight on an edge represents the frequency of the branch between
two IR basic blocks in this sampling interval. We can generate the
hot trances as follows. First, according to the result of hot code
identification, we delete the vertices that represent the IR basic
blocks that are not hot. Next, we delete the edges whose weights
are lower than the threshold value. This step will result in a
graph with a number of connected sub-graphs. A sub-graph presents a
hot trace. Furthermore, the hottest block in a sub-graph is the
entry point of the trace. In the example shown in Figure 3, we have
six IR basic blocks. Blocks A, C and E are the hot blocks and block
A is the hottest block. Let the threshold value for frequent
branches between two basic blocks be 8. According to the algorithm
described above, blocks B, D and F will be removed, and edges with
weight smaller than 8 will be removed. This results in a graph of
three vertices and three edges, which happens to be a loop. 12.
Figure 3 An example of hot trace generation 1.5 Machine-Independent
Optimizations LLVM was originally developed as a research
infrastructure at the University of Illinois at Urbana-Champaign to
investigate dynamic compilation techniques for static and dynamic
programming languages. LLVM can perform its own optimizations
(scalar, interprocedural, profile-driven, and loop optimizations)
and code generation from the intermediate form generated by GCC
front ends. The LLVM code generator is easily re-targetable,
supporting x86, PowerPC, MIPS and various other ISAs. Because of
these attractive features, our dynamic optimization system uses
LLVM back-end to perform machine-independent optimizations from the
intermediate form (IR). To expose more opportunity for optimization
and to maximize the benefit of LLVM optimization, our dynamic
optimizer will try to aggregate smaller hot code blocks to form a
longer trace using our hot trace generation method. Another
optimization we will consider is optimization for indirect
branches. The original program addresses must be used wherever the
application stores indirect branch targets. These addresses must be
translated into their corresponding code cache addresses in order
to jump to the target code. This translation is usually performed
as a hash table lookup, which may be a source of overhead for a
dynamic optimizer. Instead, we will use the following approach.
With several rounds of execution and profiling, the frequently
occurring branch targets of an indirect branch instruction can be
detected. The optimizer inserts a code sequence at the bottom of
the trace. The code sequence consists of a series of compares and
conditional direct branches for each frequent target. Hash table
13. lookup is performed only when the comparisons in the code
sequence fail. 1.6 Machine-Dependent Optimizations IA64 (Itanium)
Itanium provides predicate bit, the mechanism that can turn on or
off the effect of an instruction by setting the bit. The
compilation may generate multiple versions of the binary codes for
different data access patterns or different frequencies of the
branches taken. According to profiling, we can set the predicate
bits to choose the appropriate versions of the binary code. We may
also set the predicate bits to turn off some prefetch operations to
reduce cache miss rate. X86 (i7) With profiling, we can collect
information about frequent branches. There are different kinds of
instructions for branch in x86 ISA. It depends on the address
offset. These different kinds of instructions have different
latencies. In general, the branch instruction for shorter offset
has lower latency. We should arrange the one basic block to another
one as close as possible if one block jumps to the other one
frequently, so that frequent branches can be replaced by
lower-latency branch instructions. The register addressing mode has
the lowest latency in all the addressing modes of x86 ISA. We
should store the frequently accessed objects in registers so that
we can replace the instructions in other addressing modes with the
ones in register addressing mode to improve performance. With
profiling, we can collect information about frequently accessed
objects. Therefore, we can apply this optimization. Data locality
may improve the efficiency of data cache access in x86 architecture
because the hardware pre-fetches locality memory to cache
automatically. We should put the data that are accessed at the same
time in neighboring locations so that data cache space will be
saved. 1.7 Optimization for Multi-cores 14. Optimization for
multi-core platforms is a much harder problem than for single-core
as we need to deal with parallel applications and with much more
shared resources on multicore platforms. One of our optimizations
for multi- cores is to reduce resource contention caused by
concurrent access to the same resource by multiple threads. For
example, if the profiling data finds that two certain threads
constantly compete for the same resource on one core, then one way
to solve the problem is to propagate such information to the
operating system so that the OS scheduler can dispatch the two
threads to different CPUs or lower the priority of one of the
threads so they will not be executed at the same time. Such
optimization can be done either at user-level or system-level. The
user-level approach is based on the assumption that the OS is
capable of taking hints from the hardware monitor through an
user-level optimization software. The system-level approach will
require modifying the OS scheduler. Another optimization problem we
would like to investigate is disabling over-aggressive prefetching
to reduce cache miss rate. On single-core platforms, prefetching is
an effective mechanism to overlap computation with data access. On
multi-core platforms, however, prefetching needs to be done
carefully. The caches are usually shared by multiple cores (and
thus multiple threads). Over- aggressive prefetching of one thread
may cause increased cache miss rates in other threads. One solution
is to disable some of the prefetching instructions. One challenging
research issue is to determine an appropriate set of prefetching
instructions to strike a good balance between the benefit of
prefetching and the penality of over-prefetching. One possible
solution is to facilitate hardware support to provide prefetch
information, such as whether the prefetched data is actually used
or is pushed out of the cache before it is used. 1.8 Interaction
with Annotations Help for phase detection Annotation can provide
the information of the code boundaries for important procedures and
loops. This information can help us to appropriately define the
code region for each element in the vector that accumulates the
frequencies of execution in the different code regions. Using the
hottest basic blocks to define those code regions may not detect
the phase change if two different phases have similar hottest basic
blocks. 15. Help for identification of hot code Annotation can
provide the information about the frequencies of execution for the
basic blocks. This information can help us to calculate the
frequency of execution for each basic block. Help for hot trace
generation The information of the code boundaries for important
procedures and loops from annotation can help us to find the entry
point for a hot trace or help us group more basic blocks together
for more opportunities of optimization with LLVM end-back code
generator. For example, according to the algorithm described above,
we may generate two hot traces for two hot loops. However, these
two loops may be a main part of a procedure. In this case, we
should combine these two trace into one hot trace or even make this
whole procedure to be one hot trace. Help for optimization
Annotation information on functional unit use can guide the
optimizer to dispatch the threads that compete for the same
functional units to different cores. Annotation information on
register use can guide the optimizer to replace memory store with
register store and use low-latency register read to access the
data. 16. Figure 4. Control flow of our dynamic binary optimization
system Having addressed all the important issues, we can now
describe the control flow of our dynamic binary optimization system
(Figure 3). Hardware Performance Monitor (HPM) samples hardware
events periodically and writes the sampling data into a kernel
buffer. When the buffer overflows, HPM Data will be produced from
these samplings in the buffer. HPM Data contains timestamp, program
counter of the instruction that is executed at the time of
sampling, number of data cache miss, etc. HPM buffer overflow will
activate Phase Detector to analyze the HPM Data to detect whether
the behavior of the guest program has changed. If the phase change
is detected, the Optimizer will be triggered. The Optimizer will
identify the hot code blocks of the guest program, and find their
corresponding IR code blocks by looking up the address mapping
table between guest binary and host binary, and then chains these
hot IR code blocks together to form hot traces in IR form. Next,
these hot IR traces are fed to the LLVM back-end for optimization
and code generation. The generated code is then passed to the
Optimizer for further optimization. 17. 2. Work Plan Year 1: The
goal of the first year is to develop light-weight profiling
mechanism, phase detector and hot trace identification. The work
items include: profiling with Perfmon2 With Perfmon2, we can
monitor runtime information about individual thread or individual
core, and store the runtime data in HPM Data. The data in HPM Data
represent runtime information for a set of samples. Data for a
sample include a program counter, a time stamp, thread ID, core ID,
and counters of the last level cache miss event, instruction
retired event and clock cycles event. We will develop the mechanism
to retrieve data in the kernel buffer for HPM to HPM Data when the
buffer overflows. We will also implement API(s) to access
individual fields of any individual sample in HPM Data. Algorithm
design and implementation for Phase detection First, we will
construct a number of program phases from the guest program and the
code region in each program phase. A program phase may be a basic
block. However, the number of basic blocks may be too large to be
the appropriate choice. We will consider the hottest basic blocks
in average to be the program phases. Algorithm design and
implementation for hot code identification Year 2: Implementation
of Optimizer and Hot Trace Cache The goal of the second year is to
develop the hot trace generator and implement machine-dependent
optimizations on Itanium2. Work items include: Design and implement
the algorithm for generating hot traces. Develop the mechanism to
interact with the translator developed in sub- project 2 to perform
IR-level machine-independent optimizations. This will require a
method to map binary hot trace to LLVM IR form, and develop API for
passing the IR hot trace to the translator. Design and implement
the algorithms for machine-dependent optimizations 18. for Itanium.
The optimizations include choosing appropriate version of binary
code, minimizing cache contention by turning off some of the
prefetch operation. Year 3: The goal of the third year is to
develop machine-dependent optimizations for x86, optimizations for
multi-cores, and improving optimizations with compiler annotation.
Work items include: Design and implement the algorithms for
machine-dependent optimizations for x86. The optimizations include
generating low-latency branch instruction, better register usage,
improving data locality. Develop optimizations for multi-cores. The
main objective is to reduce contention in shared resources. Improve
phase detection, trace generation with procedure and loop boundary
annotation. Improve machine-dependent optimizations with register
use annotation. Improve multi-core optimizations with
functional-unit use and data access pattern annotation. (08/01/2010
-07/31/2011) (08/01/2011 -07/31/2012) (08/01/2012 - 07/31/2013) 1.
Study hardware monitor mechanisms on the multi-cores. 1. Study the
micro architecture and instruction set for the targeted host 1.
Study OSs thread scheduler for multi-core platforms machine and the
host. 2. Design and implement of the related 2. Design of the API
with translator 2. Design of the API with annotation API with
translator to get the address to generate optimized code for hot to
get the information for mapping between guest binary code and code
regions. optimization. host binary code. 3. Development of the
algorithm to 3. Development of the new 3. Development of the phase
detection form long paths. optimization algorithms with algorithm.
4. Development of machine annotation data. dependent dynamic binary
4. Develop techniques to generate optimizations scheduling hints
from analysis of 19. HPM data and annotation data, as well as the
technique to pass the hints to the OS scheduler Getting the data
from HPM Generate hot traces for LLVM back-end to generate Reducing
resource contention optimized code. on multi-core. Getting the
mapping information from translator Optimize the optimized
Improving effectiveness of Detecting phase changing code from LLVM
the dynamic optimizer with accurately annotation data. The data
structure for HPM Data Hot trace generator Annotation-enhanced
dynamic binary optimizer Phase detector Machine-independent Dynamic
binary optimizer for dynamic optimizer with multi-cores LLVM
back-end code generator Machine-dependent dynamic optimizer
References [1] Vasanth Bala, Evelyn Duesterwald, Sanjeev Banerjia,
Dynamo: A transparent dynamic optimization system, Proceedings of
the ACM SIGPLAN conference onprogramming language design and
implementation, p.1-12, June 18-21, 2000. [2] D. Bruening, T.
Garnett, S. Amarasinghe, An Infrastructure for Adaptive Dynamic
Optimization, Proceedings of the international symposium on
codegeneration and optimization, 2003. [3] W.K. Chen, S. Lerner, R.
Chaiken, and D. Gillies, Mojo: A dynamic optimization system, 3rd
acm workshop on feedback-directed and dynamic optimization,
p.81-90, 2000. 20. [4] J. Lu, H. Chen, P.-C. Yew, W.-C. Hsu, Design
and Implementation of a Lightweight Dynamic Optimization System,
Journal of Instruction-LevelParallelism, vol.6, 2004. [5] Luk,
C.-K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., and
Wallace, S., Reddi, V. J., Hazelwood, K., Pin: Building Customized
Program Analysis Tools with Dynamic Instrumentation, Programming
languages design and implementation, June 2005. [6] Jikes Research
Virtual Machine (RVM), http://jikesrvm.org/ [7] Timothy Sherwood,
Erez Perelman, Greg Hamerly and Brad Calder, Automatically
Characterizing Large Scale Program Behavior, In the 10th
International Conference on Architectural Support for Programming
Languages and Operating Systems, October 2002. [8] J. Lu, H. Chen,
P.-C. Yew, W.-C. Hsu, Design and Implementation of a Lightweight
Dynamic Optimization System, Journal of Instruction-Level
Parallelism, vol.6, 2004. [9] Christian Wimmer, Marcelo S. Cintra,
Michael Bebenita Mason Chang, Andreas Gal and Michael Franz, Phase
Detection using Trace Compilation, PPPJ09, 2009. [10] Abhinav Das,
Jiwei Lu and Wei-Chung Hsu, Region Monitoring for Local Phase
Detection in Dynamic Optimization Systems, International Symposium
on Code Generation and Optimization, 2006. [11] PerfMon,
http://www.hpl.hp.com/research/linux/perfmon/. [12] W. W. Hwu, S.
A. Mahlke, W. Y. Chen, P. P. Chang, N. J. Warter, R. A. Bringmann,
R. G. Ouellette, R. E. Hank, T. Kiyohara, G. E. Haab, J. G. Holm,
D. M. Lavery, The superblock: an effective technique for VLIW and
superscalar compilation, The Journal of Supercomputing, v.7 n.1-2,
p.229-248, May 1993. [13] S. A. Mahlke, D. C. Lin, W. Y. Chen, R.
E. Hank, R. A. Bringmann, Effective compiler support for predicated
execution using the hyperblock, Proceedings of the 25th annual
international symposium on Microarchitecture, p.45-54, December
01-04, 1992. [14] Low Level Virtual Machine (LLVM),
http://llvm.org/ [15] Sherwood, T., Sair, S., and Calder, B., Phase
tracking and prediction, International 21. symposium on computer
architecture, 2003. [16] Merten, M. C., Trick, A. R., George, C.
N., Gyllenhaal, J. C., and Hwu, W.W., A hardware-driven profiling
scheme for identifying program hot spots to support runtime
optimization, International symposium on computer architecture,
1999. [17] Karl Pettis, Robert C. Hansen, Profile guided code
positioning, Proceedings of the ACM SIGPLAN conference on
programming language design and implementation, p.16-27, June 1990.
[18] A. Ramirez, L. Barroso, K. Gharachorloo, R. Cohn, J.
Larriba-Pey, P. G. Lowney, M. Valero, Code layout optimizations for
transaction processing workloads, Proceedings of the 28th annual
international symposium on computer architecture, p.155-164, 2001.
[19] P. P. Chang, W.W. Hwu, Trace selection for compiling large C
applicationprograms to microcode, Proceedings of the 21st annual
workshop on microprogramming and microarchitecture, p.21-29, 1988.
[20] Brad Calder, Peter Feller, Alan Eustace, Value profiling,
Proceedings of the 30th annual ACM/IEEE international symposium on
Microarchitecture, p.259- 269, December 01-03, 1997. [21] S. G.
Abraham, R. A. Sugumar, D. Windheiser, B. R. Rau, Rajiv Gupta,
Predictability of load/store instruction latencies, Proceedings of
the 26th annual international symposium on microarchitecture,
p.139-152, December 01-03, 1993. [22] Todd M. Austin, Gurindar S.
Sohi, Dynamic dependency analysis of ordinary programs, Proceedings
of the 19th annual international symposium on Computer
architecture, p.342-351, May 19-21, 1992. [23] Scott McFarling,
Reality-based optimization, Proceedings of the international
symposium on code generation and optimization, p.59 - 68, 2003.
[24] P. P. Chang, S. A. Mahlke, and W. W. Hwu, Using profile
information to assist classic code optimizations, Software-Practice
and Experience, vol.21(12), p.1301-1321, December 1991. [25] Robert
Cohn, P. G. Lowney, Hot cold optimization of large WindowsNT
applications, 22. Proceedings of the 29th annual ACM/IEEE
international symposium on microarchitecture, p.80-89, December
02-04, 1996. [26] Todd C. Mowry, Chi-Keung Luk, Predicting data
cache misses in non- numericapplications through correlation
profiling, Proceedings of the 30th annual ACM/IEEE international
symposium on microarchitecture, p.314-320, December 01-03, 1997.
[27] A. Srivastava, D. W. Wall, A practical system for intermodule
code optimizationat link- time, Journal of programming languages,
vol.1 (1), p. 1-18, March 1993. [28] C.-K. Luk, R. Muth, H. Patil,
R. Weiss, P. G. Lowney, R. Cohn, Profile-guided post-link stride
prefetching, Proceedings of the 16th international conference
onSupercomputing, p. 167-178, 2002. [29] C. B. Zilles, G. S. Sohi,
Understanding the backward slices of performance degrading
instructions, Proceedings of the 27th annual international
symposium on computer architecture, p.172-181, June 2000. [30]
Goodwin, D. W., Interprocedural dataflow analysis in an executable
optimizer, Programming language design and implementation, June
16-18, 1997. [31] A. Srivastava, A. Edwards, and H. Vo, Vulcan.
Binary translation in a distributed environment, Technical Report
MSR-TR-2001-50, Microsoft Research, April 2001. [32] Luk, C., Muth,
R., Patil, H., Cohn, R., and Lowney, G., Ispike: A Post-link
Optimizer for the Intel Itanium Architecture, Proceedings of the
international symposium on code generation and optimization:
feedback-directed and runtime optimization, March 20-24, 2004. [33]
Patel, S. J. and Lumetta, S. S., rePLay: A Hardware Framework for
Dynamic Optimization, IEEE Transactions on Computers, vol.50 (6),
p.590-608, June 2001. [34] Fahs, B., Bose, S., Crum, M., Slechta,
B., Spadini, F., Tung, T., Patel, S. J., and Lumetta, S. S.,
Performance characterization of a hardware mechanism for dynamic
optimization, Proceedings of the 34th annual ACM/IEEE international
symposium on microarchitecture, December 01-05, 2001. 23. [35]
Dehnert, J. C., Grant, B. K., Banning, J. P., Johnson, R., Kistler,
T., Klaiber, A., and Mattson, J., The Transmeta Code Morphing
Software: using speculation, recovery, and adaptive retranslation
to address real-life challenges, Proceedings of the international
symposium on code generation and optimization: feedbackdirected and
runtime optimization, March 23-26, 2003. [36] Zhang, W., Calder,
B., and Tullsen, D. M., An Event-Driven Multithreaded Dynamic
Optimization Framework, Proceedings of the 14th international
conference on parallel architectures and compilation techniques,
September 17- 21, 2005.