Top Banner
Dynamic Binary Optimization for Virtualization on Multi-cores 十十 、: (十 十 )。 The software market requires applications to run on many generations of hardware. Even if software vendors tune their binaries for the most prevalent hardware at release time, the code will rapidly becomes mismatched to new platforms as hardware implementation evolves. All latest high-performance microprocessors have very sophisticated runtime monitoring support that allows runtime information such as cache misses, instruction pipeline stalls to be monitored for further re- optimization of binary code at runtime to improve overall performance. Such continuous program re-optimization requires a dynamic compiler to manipulate binary code at runtime. In another very important type of application called process/system virtualization, a software layer is set up above the hardware to allow multiple OS’s and/or applications with different instruction-set architectures (ISA’s) to run on the same hardware platform. The main technology supporting virtualization is binary translation. It is basically a different kind of runtime compiler that takes the binary code of those OS’s and the application programs with different ISA’s, and translate them into a sequence of instructions with the ISA of the underlying hardware platform for execution. Most of these binary manipulation techniques today are with substantial runtime overhead. Dynamic optimization on the binary code during runtime to improve
28
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript

1. Dynamic Binary Optimization for Virtualization on Multi-cores ( The software market requires applications to run on many generations of hardware. Even if software vendors tune their binaries for the most prevalent hardware at release time, the code will rapidly becomes mismatched to new platforms as hardware implementation evolves. All latest high-performance microprocessors have very sophisticated runtime monitoring support that allows runtime information such as cache misses, instruction pipeline stalls to be monitored for further re- optimization of binary code at runtime to improve overall performance. Such continuous program re-optimization requires a dynamic compiler to manipulate binary code at runtime. In another very important type of application called process/system virtualization, a software layer is set up above the hardware to allow multiple OSs and/or applications with different instruction-set architectures (ISAs) to run on the same hardware platform. The main technology supporting virtualization is binary translation. It is basically a different kind of runtime compiler that takes the binary code of those OSs and the application programs with different ISAs, and translate them into a sequence of instructions with the ISA of the underlying hardware platform for execution. Most of these binary manipulation techniques today are with substantial runtime overhead. Dynamic optimization on the binary code during runtime to improve overall performance could be a core technology deserved to be studied. Since the optimizations are performed at runtime, a dynamic binary optimizer has to be carefully designed so that the overhead of runtime optimization would not outweigh the performance gain of the optimized code. We will call an optimizer that produces more performance gain than overhead an effective optimizer. There are a number of factors that are crucial to the effectiveness of a dynamic optimizer. Before discussing them, we first give an overview of how a general dynamic binary optimizer works. In general, execution of an application under a dynamic binary optimizer, as 2. shown in Figure 1, begins with the system executing (or emulating) and profiling a running programs instruction stream to track its execution flow. When the system discovers significant change in profile, it tries to find a frequently executed code sequence (i.e. hot traces), and then the sequence is analyzed, optimized, and placed in a code cache. The execution then switches to the optimized code in the code cache. Figure 1. Control flow of a general dynamic binary optimizer Now, we can identify a number of key factors that could have profound impact on the effectiveness of a dynamic binary optimizer: (1) the profiling method to collect runtime information, (2) the frequency the optimizer is activated, which often depends on the phase detection method, (3) detecting frequently executed code sequence, which is also referred to as hot code identification and hot trace generation, (4) the optimizations performed, and (5) better optimizations with the help of compiler annotations. In this project, we propose a light-weight, sampling based dynamic binary optimization framework that provides novel solutions to these important issues. Furthermore, most of the binary optimization techniques today are for single- core platforms. We plan to extend such binary optimization techniques to multi-core platforms. This is a much harder problem as we need to deal with multithreading applications and with much more shared resources on multi-core platforms. The core technologies we will develop in this project, if successful, could have significant impact on the development of dynamic binary optimization systems and 3. virtualization systems. Related Work Profile-guided optimizations [16][17] provide runtime information for advanced optimizations[18][19][20][21][22]. Hence, recent researches attempt to extend the idea of branch profile to value, cache miss [23] and data dependency [24] profiling. However, collecting a representative profile is difficult for real applications [25]. Although post-link optimizations [26][27][28][29][30][31][32] optimize programs based on performance profiles and reduce the need to recompile, applications can have different performance characteristics with individual inputs. Therefore, an application may have to be optimized during execution since the specific information about its performance cannot be gathered before the input is given. Dynamic optimization systems [1][2][3][4][5][6][33][34][35][36] are getting important because of the need to customize optimizations for individual inputs, changing behavior with time, dynamic linking library and the micro-architecture. Such dynamic optimization systems typically manipulate and optimize binary code at runtime. The profiling methods used by most binary manipulation and optimization systems can be classified into two categories: Virtual Machine (VM) based, and sampling based. VM-based systems, such as Dynamo [1], DynamoRIO [2], Mojo [3] and PIN [5] typically instrument code for profiling. Therefore, accurate runtime data for phase detection strategies, such as instruction working set and basic block vectors can be collected without problem. However, such systems are with substantial overhead of profiling, emulation, code-cache management and the expensive handling of indirect branches. For example, Pin [5] has an overhead of 54% for SPECint2000 benchmarks on IA32 systems and DynamoRIO [2] has an overhead of 42% for the same environment. This is the minimal overhead reported, when no instrumentation or optimization is performed. Unlike VM based optimizers, sampling based optimizers, such as ADORE [4], and sampling based profiling tools, such as SimPoint [7], typically do not instrument code for profiling. Therefore, runtime data for phase detection strategies cannot be used with the same accuracy. Also, Sampling-based optimizers do not have complete control over program execution. They take frequent snapshots of program execution and thus only see frequently executed code, but not the complete execution path 4. leading to it. However, sampling based profiling has lower runtime overhead than VM-based profiling. Dynamic optimization systems using sample based profiling rely on phase detection to detect change in code working set and change in performance characteristics that can affect optimization strategies. Phase detection techniques can be classified into two categories: Global Phase Detection (GPD) [8 ] and Local Phase Detection (LPD) [9 ][10]. In GPD, program characteristics are computed by taking into account information from all regions that executed during the profiled interval. Hence, it is sensitive to sampling period, interval size and thresholds used in the phase detector. LPD can detect phase change more accurately than GPD because the scope of phase detection is reduced to a small code region, such as a basic block, a loop, or a procedure. Commonly use LPD methods include region monitoring based [10] and trace compilation [9]. Table 1 compares these optimization systems. Details of each optimization system are described below. Note that all of these optimization systems are for single- core platforms, except that the ADORE system runs the optimizer and the user application code on separate cores. Dynamo [1] is a software dynamic optimization system that is capable of transparently improving the performance of a native instruction stream as it executes on the processor. The input native instruction stream to Dynamo can be dynamically generated (by a JIT for example), or it can come from the execution of a statically compiled native binary. Dynamo focuses its efforts on optimization opportunities that tend to manifest only at runtime. Experimental results demonstrate that even statically optimized native binaries can be accelerated by Dynamo. For example, the average performance of -O optimized SpecInt95 benchmark binaries created by the HP product C compiler is improved to a level comparable to their -O4 optimized version running without Dynamo. The performance advantage of Dynamo in such case is not surprising because it was compared with compile-time static optimizations, which usually lack runtime information to generate code with good performance. Since Dynamo relies on VM-based profiling and runtime emulation of the program execution, its runtime overhead could be high. 5. DynamoRIO [2] is a framework, extended from Dynamo, for implementing dynamic analyses and optimizations. It provides an interface for building external modules, or clients, for the DynamoRlO dynamic code modification system. This interface abstracts away many low-level details of the DynamoRlO runtime system while exposing a simple yet efficient API. This is achieved by restricting optimization units to linear streams of code and using adaptive levels of detail for representing instructions. The interface is not restricted to optimization and can be used for instrumentation, profiling, dynamic translation, etc. DynamoRIO also implements several optimizations. These improve the performance of some applications by 12% on average, relative to native execution. Since DynamoRIO is intended to be a analysis and instrumental tool, it uses expensive software instrumentation based profiling and interpreter for emulation. 6. Dynamo DynamoRIO Mojo [3] ADORE PIN [5] JikesRVM [6] [1] [2] [4] Sampli no no no Yes no no ng based Profi ling VM yes yes yes No yes yes based Emulation yes yes no No yes yes with Interpreter Annotation no no no no no yes information Optimization 1. Hot 1. Constant 1. Hot 1. 1. 1. Adaptive Inlining tracing Propagation Path Dynamic Persiste 2. Register Linking 2. Dead Code Linking Register nt Code allocation and Removal Allocatio Cachin coalescing 3.Call Return 2. drop n g 3. tail recursion Matching unconditio 2. elimiation 4. Stack Adjust nal jumps Runtime 4. code reordering Data 5. Dead code Cache elmination 3. Prefetchi 6. loop call/return ng normalization & sequences 3. Hot unrolling inlined trace 7. load/store & Patching redundant branch 4. elmination unrolling loops Table 1. Comparison of VM based dynamic optimizers and sampling based optimizers 7. Mojo [3] is unlike most dynamic optimizers that have been chiefly targeted towards running the SPEC benchmarks on scientific workstations. Mojo[3], developed by Microsoft Research, contends that dynamic optimization technology is also important to the desktop computing environment where running large, complex commercial software applications is commonplace. Mojo implements its optimizations for the x86 architecture. It also supports exception handling and multithreaded applications on Windows along with preliminary performance measurements. Similar to Dynamo and DynamoRIO, Mojo also employs VM based profiling. However, it does not rely on the time-consuming emulation/interpretation of program execution. ADORE [4] is a light-weight dynamic binary optimization system developed at the University of Minnesota. Its light-weight because it uses hardware performance monitoring based sampling for profiling. ADORE uses dynamic optimization to address cache miss, branch mis-prediction, and other performance events at runtime. It detects performance problems of running applications and deploys optimizations to increase execution efficiency. ADOREs approach includes detecting performance bottlenecks, generating optimized traces and redirecting execution from the original code to the dynamically optimized code. Experiment results show that ADORE speeds up many of the CPU2000 benchmark programs having large numbers of D-Cache misses through dynamically deployed cache prefetching. For other applications that dont benefit from ADOREs runtime optimization, the average cost is only 2% of execution time. ADORE is a good example of using existing hardware and software to deploy speculative optimizations to improve a programs runtime performance. In this project, we will develop our dynamic binary optimization system based on ADORE because of the various efficient and attractive features it provides. PIN [5] is an instrumentation system developed by Intel. It aims to support easy to use, portable, transparent, and efficient instrumentation. Instrumentation tools (called Pintools) are written in C/C++ using Pin's API. Pin follows the model of ATOM, allowing the tool writer to analyze an application at the instruction level without the need for detailed knowledge of the underlying instruction set. Pin uses dynamic compilation to instrument executables while they are running. For efficiency, Pin uses several techniques, including inlining, register re-allocation, 8. liveness analysis, and instruction scheduling to optimize instrumentation. As a result, Pin is 3.3x faster than Valgrind and 2x faster than DynamoRIO for basic- block counting. Pin is publicly available for Linux platforms on four architectures: IA32 (32-bit x86), EM64T (64-bit x86), Itanium, and ARM. JikesRVM[6] : Jikes RVM (Research Virtual Machine) provides a flexible open testbed to prototype virtual machine technologies and experiment with a large variety of design alternatives. Jikes RVM can run on various platforms. It implements virtual machine technologies for dynamic compilation, adaptive optimization, garbage collection, thread scheduling, and synchronization. A distinguishing characteristic of Jikes RVM is that it is implemented in the Java programming language and is self-hosted i.e., its Java code runs on itself without requiring a second virtual machine. JikesRVM uses VM based profiling and interpreter for program emulation. 1. Approach We propose a light-weight, sampling based dynamic binary optimization system. The system diagram, including system components and major data structures, of our virtualization system proposed in the main project is depicted in Figure 2. The blocks circled by a dotted line are the components for the dynamic binary optimization sub-system proposed in this sub-project. The components include a Hardware Performance Monitor profiler (HPM Data), Phase Detector, Hot Trace Generator, and Optimizer. 9. Figure 2. System diagram of our virtualization system We first describe the functionality of and our design decision for each of the components in our optimization system. We also address the important research issues in each component. 1.1 Light-weight HPM-Based Profiling We exploit hardware performance monitors in the processor for light-weight profiling. Since HPM counters the events automatically, we can expect that the extra overhead for monitoring the program behaviors is much lower than the instrumentation approach to counter the events for profiling. We adopt Perfmon2 [11], a standard performance monitoring interface for Linux to exploit HPM. It provides the friendly interfaces to help the user setting the register of HPM for monitoring events that the user wants to observe. Each Linux thread can perform a monitor section of Perfmon2. In a monitor section, the user can indicate which core or which thread to be monitor. In our virtualization system, we will target the multi-threading programs and implement a guest thread as a pthread so we will create a monitor section for each pthread created for a guest thread. Moreover, we will create a monitor section for each core in the host platform, too. 1.2 Sampling Accumulation Phase Detection A dynamic optimizers need to accurately identify periods of execution when program must be optimized or re-optimized. The concept of phase was introduced to identify periods of execution when certain runtime characteristics 10. do not change. Phase detection [12][13] identifies these periods and triggers phase changes between these periods. Thus, an accurate and reliable phase detection scheme is crucial to runtime performance. Phase detection is an important component of sampling based dynamic optimizers. Phase detection, as implemented in current sampling-based prototype dynamic optimization systems such as ADORE[8], is called Global Phase Detection (GPD) as program characteristics are computed by taking into account information from all regions that executed during the profiled interval. The problem with GPD is that it may not be able to detect the change between two phases if they have the same average program counter value. We propose a new phase detection approach called sampling accumulation phase detection to solve this problem. For each sampling interval, we maintain a code blocks vector and an accumulation vector. Both vectors have the same cardinality. An element in the code blocks vector is a pair of program counters indicating the beginning and end of a code block of the program. An element in the accumulation vector records the number of times a program counter is located in the corresponding code block of this vector element. When the HPM data buffer overflows, the program counter in the data structure HPM Data is retrieved and compared with the values in the code blocks vector to find the code block in which this program counter is located. Then, the corresponding element (a value) in the accumulation vector is incremented by 1. For two adjacent sampling intervals, we can compare the Manhattan distance or the Euclidean distance of their accumulation vectors. If the distance is larger than a threshold value, then there is a phase change. 1.3 Hot Code Identification Optimizing at runtime can be expensive and incurs real performance penalty. Limiting the scope of optimization reduces this overhead. The scope of dynamic optimization can be reduced by finding frequently executed code. Such code exists naturally in programs from loops and recursive function calls. The general technique for identifying such code is to maintain a count for each basic block and when the basic block count exceeds a threshold, it is optimized. Sampling based dynamic optimizers rely on hardware performance counters to collect this data. By sampling these counters, program counter samples are obtained periodically. Using these samples, frequently executed code can be identified. 11. 1.4 Hot Trace Generation Optimization at a basic block level may not be beneficial because the granularity is too small. Thus, it is desirable to aggregate multiple basic blocks into a larger code segment (also called trace). Traces are basic blocks that form the unit of optimization for dynamic optimizers. Dynamic optimizers try to select those basic blocks that form loops, as trace exits would be minimized. The other consideration when building traces is to minimize analysis time. As traces are units of optimization, the dynamic optimizer passes these traces to its optimization algorithms. These algorithms must quickly generate an optimized trace. A sampling based optimizer such as ours is limited by the fact that it does not have complete control over execution. We will solve the trace generation problem by dynamic code analysis and runtime estimate of profile. We may also apply the concept of superblocks [14] and hyperblocks [15] to help guide our trace generation. Our approach for trace generation is described as follows. According to the HPM Data for a sampling interval, we can construct a directed graph with weighted edges. A vertex indicates an IR basic block and the weight on an edge represents the frequency of the branch between two IR basic blocks in this sampling interval. We can generate the hot trances as follows. First, according to the result of hot code identification, we delete the vertices that represent the IR basic blocks that are not hot. Next, we delete the edges whose weights are lower than the threshold value. This step will result in a graph with a number of connected sub-graphs. A sub-graph presents a hot trace. Furthermore, the hottest block in a sub-graph is the entry point of the trace. In the example shown in Figure 3, we have six IR basic blocks. Blocks A, C and E are the hot blocks and block A is the hottest block. Let the threshold value for frequent branches between two basic blocks be 8. According to the algorithm described above, blocks B, D and F will be removed, and edges with weight smaller than 8 will be removed. This results in a graph of three vertices and three edges, which happens to be a loop. 12. Figure 3 An example of hot trace generation 1.5 Machine-Independent Optimizations LLVM was originally developed as a research infrastructure at the University of Illinois at Urbana-Champaign to investigate dynamic compilation techniques for static and dynamic programming languages. LLVM can perform its own optimizations (scalar, interprocedural, profile-driven, and loop optimizations) and code generation from the intermediate form generated by GCC front ends. The LLVM code generator is easily re-targetable, supporting x86, PowerPC, MIPS and various other ISAs. Because of these attractive features, our dynamic optimization system uses LLVM back-end to perform machine-independent optimizations from the intermediate form (IR). To expose more opportunity for optimization and to maximize the benefit of LLVM optimization, our dynamic optimizer will try to aggregate smaller hot code blocks to form a longer trace using our hot trace generation method. Another optimization we will consider is optimization for indirect branches. The original program addresses must be used wherever the application stores indirect branch targets. These addresses must be translated into their corresponding code cache addresses in order to jump to the target code. This translation is usually performed as a hash table lookup, which may be a source of overhead for a dynamic optimizer. Instead, we will use the following approach. With several rounds of execution and profiling, the frequently occurring branch targets of an indirect branch instruction can be detected. The optimizer inserts a code sequence at the bottom of the trace. The code sequence consists of a series of compares and conditional direct branches for each frequent target. Hash table 13. lookup is performed only when the comparisons in the code sequence fail. 1.6 Machine-Dependent Optimizations IA64 (Itanium) Itanium provides predicate bit, the mechanism that can turn on or off the effect of an instruction by setting the bit. The compilation may generate multiple versions of the binary codes for different data access patterns or different frequencies of the branches taken. According to profiling, we can set the predicate bits to choose the appropriate versions of the binary code. We may also set the predicate bits to turn off some prefetch operations to reduce cache miss rate. X86 (i7) With profiling, we can collect information about frequent branches. There are different kinds of instructions for branch in x86 ISA. It depends on the address offset. These different kinds of instructions have different latencies. In general, the branch instruction for shorter offset has lower latency. We should arrange the one basic block to another one as close as possible if one block jumps to the other one frequently, so that frequent branches can be replaced by lower-latency branch instructions. The register addressing mode has the lowest latency in all the addressing modes of x86 ISA. We should store the frequently accessed objects in registers so that we can replace the instructions in other addressing modes with the ones in register addressing mode to improve performance. With profiling, we can collect information about frequently accessed objects. Therefore, we can apply this optimization. Data locality may improve the efficiency of data cache access in x86 architecture because the hardware pre-fetches locality memory to cache automatically. We should put the data that are accessed at the same time in neighboring locations so that data cache space will be saved. 1.7 Optimization for Multi-cores 14. Optimization for multi-core platforms is a much harder problem than for single-core as we need to deal with parallel applications and with much more shared resources on multicore platforms. One of our optimizations for multi- cores is to reduce resource contention caused by concurrent access to the same resource by multiple threads. For example, if the profiling data finds that two certain threads constantly compete for the same resource on one core, then one way to solve the problem is to propagate such information to the operating system so that the OS scheduler can dispatch the two threads to different CPUs or lower the priority of one of the threads so they will not be executed at the same time. Such optimization can be done either at user-level or system-level. The user-level approach is based on the assumption that the OS is capable of taking hints from the hardware monitor through an user-level optimization software. The system-level approach will require modifying the OS scheduler. Another optimization problem we would like to investigate is disabling over-aggressive prefetching to reduce cache miss rate. On single-core platforms, prefetching is an effective mechanism to overlap computation with data access. On multi-core platforms, however, prefetching needs to be done carefully. The caches are usually shared by multiple cores (and thus multiple threads). Over- aggressive prefetching of one thread may cause increased cache miss rates in other threads. One solution is to disable some of the prefetching instructions. One challenging research issue is to determine an appropriate set of prefetching instructions to strike a good balance between the benefit of prefetching and the penality of over-prefetching. One possible solution is to facilitate hardware support to provide prefetch information, such as whether the prefetched data is actually used or is pushed out of the cache before it is used. 1.8 Interaction with Annotations Help for phase detection Annotation can provide the information of the code boundaries for important procedures and loops. This information can help us to appropriately define the code region for each element in the vector that accumulates the frequencies of execution in the different code regions. Using the hottest basic blocks to define those code regions may not detect the phase change if two different phases have similar hottest basic blocks. 15. Help for identification of hot code Annotation can provide the information about the frequencies of execution for the basic blocks. This information can help us to calculate the frequency of execution for each basic block. Help for hot trace generation The information of the code boundaries for important procedures and loops from annotation can help us to find the entry point for a hot trace or help us group more basic blocks together for more opportunities of optimization with LLVM end-back code generator. For example, according to the algorithm described above, we may generate two hot traces for two hot loops. However, these two loops may be a main part of a procedure. In this case, we should combine these two trace into one hot trace or even make this whole procedure to be one hot trace. Help for optimization Annotation information on functional unit use can guide the optimizer to dispatch the threads that compete for the same functional units to different cores. Annotation information on register use can guide the optimizer to replace memory store with register store and use low-latency register read to access the data. 16. Figure 4. Control flow of our dynamic binary optimization system Having addressed all the important issues, we can now describe the control flow of our dynamic binary optimization system (Figure 3). Hardware Performance Monitor (HPM) samples hardware events periodically and writes the sampling data into a kernel buffer. When the buffer overflows, HPM Data will be produced from these samplings in the buffer. HPM Data contains timestamp, program counter of the instruction that is executed at the time of sampling, number of data cache miss, etc. HPM buffer overflow will activate Phase Detector to analyze the HPM Data to detect whether the behavior of the guest program has changed. If the phase change is detected, the Optimizer will be triggered. The Optimizer will identify the hot code blocks of the guest program, and find their corresponding IR code blocks by looking up the address mapping table between guest binary and host binary, and then chains these hot IR code blocks together to form hot traces in IR form. Next, these hot IR traces are fed to the LLVM back-end for optimization and code generation. The generated code is then passed to the Optimizer for further optimization. 17. 2. Work Plan Year 1: The goal of the first year is to develop light-weight profiling mechanism, phase detector and hot trace identification. The work items include: profiling with Perfmon2 With Perfmon2, we can monitor runtime information about individual thread or individual core, and store the runtime data in HPM Data. The data in HPM Data represent runtime information for a set of samples. Data for a sample include a program counter, a time stamp, thread ID, core ID, and counters of the last level cache miss event, instruction retired event and clock cycles event. We will develop the mechanism to retrieve data in the kernel buffer for HPM to HPM Data when the buffer overflows. We will also implement API(s) to access individual fields of any individual sample in HPM Data. Algorithm design and implementation for Phase detection First, we will construct a number of program phases from the guest program and the code region in each program phase. A program phase may be a basic block. However, the number of basic blocks may be too large to be the appropriate choice. We will consider the hottest basic blocks in average to be the program phases. Algorithm design and implementation for hot code identification Year 2: Implementation of Optimizer and Hot Trace Cache The goal of the second year is to develop the hot trace generator and implement machine-dependent optimizations on Itanium2. Work items include: Design and implement the algorithm for generating hot traces. Develop the mechanism to interact with the translator developed in sub- project 2 to perform IR-level machine-independent optimizations. This will require a method to map binary hot trace to LLVM IR form, and develop API for passing the IR hot trace to the translator. Design and implement the algorithms for machine-dependent optimizations 18. for Itanium. The optimizations include choosing appropriate version of binary code, minimizing cache contention by turning off some of the prefetch operation. Year 3: The goal of the third year is to develop machine-dependent optimizations for x86, optimizations for multi-cores, and improving optimizations with compiler annotation. Work items include: Design and implement the algorithms for machine-dependent optimizations for x86. The optimizations include generating low-latency branch instruction, better register usage, improving data locality. Develop optimizations for multi-cores. The main objective is to reduce contention in shared resources. Improve phase detection, trace generation with procedure and loop boundary annotation. Improve machine-dependent optimizations with register use annotation. Improve multi-core optimizations with functional-unit use and data access pattern annotation. (08/01/2010 -07/31/2011) (08/01/2011 -07/31/2012) (08/01/2012 - 07/31/2013) 1. Study hardware monitor mechanisms on the multi-cores. 1. Study the micro architecture and instruction set for the targeted host 1. Study OSs thread scheduler for multi-core platforms machine and the host. 2. Design and implement of the related 2. Design of the API with translator 2. Design of the API with annotation API with translator to get the address to generate optimized code for hot to get the information for mapping between guest binary code and code regions. optimization. host binary code. 3. Development of the algorithm to 3. Development of the new 3. Development of the phase detection form long paths. optimization algorithms with algorithm. 4. Development of machine annotation data. dependent dynamic binary 4. Develop techniques to generate optimizations scheduling hints from analysis of 19. HPM data and annotation data, as well as the technique to pass the hints to the OS scheduler Getting the data from HPM Generate hot traces for LLVM back-end to generate Reducing resource contention optimized code. on multi-core. Getting the mapping information from translator Optimize the optimized Improving effectiveness of Detecting phase changing code from LLVM the dynamic optimizer with accurately annotation data. The data structure for HPM Data Hot trace generator Annotation-enhanced dynamic binary optimizer Phase detector Machine-independent Dynamic binary optimizer for dynamic optimizer with multi-cores LLVM back-end code generator Machine-dependent dynamic optimizer References [1] Vasanth Bala, Evelyn Duesterwald, Sanjeev Banerjia, Dynamo: A transparent dynamic optimization system, Proceedings of the ACM SIGPLAN conference onprogramming language design and implementation, p.1-12, June 18-21, 2000. [2] D. Bruening, T. Garnett, S. Amarasinghe, An Infrastructure for Adaptive Dynamic Optimization, Proceedings of the international symposium on codegeneration and optimization, 2003. [3] W.K. Chen, S. Lerner, R. Chaiken, and D. Gillies, Mojo: A dynamic optimization system, 3rd acm workshop on feedback-directed and dynamic optimization, p.81-90, 2000. 20. [4] J. Lu, H. Chen, P.-C. Yew, W.-C. Hsu, Design and Implementation of a Lightweight Dynamic Optimization System, Journal of Instruction-LevelParallelism, vol.6, 2004. [5] Luk, C.-K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., and Wallace, S., Reddi, V. J., Hazelwood, K., Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation, Programming languages design and implementation, June 2005. [6] Jikes Research Virtual Machine (RVM), http://jikesrvm.org/ [7] Timothy Sherwood, Erez Perelman, Greg Hamerly and Brad Calder, Automatically Characterizing Large Scale Program Behavior, In the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, October 2002. [8] J. Lu, H. Chen, P.-C. Yew, W.-C. Hsu, Design and Implementation of a Lightweight Dynamic Optimization System, Journal of Instruction-Level Parallelism, vol.6, 2004. [9] Christian Wimmer, Marcelo S. Cintra, Michael Bebenita Mason Chang, Andreas Gal and Michael Franz, Phase Detection using Trace Compilation, PPPJ09, 2009. [10] Abhinav Das, Jiwei Lu and Wei-Chung Hsu, Region Monitoring for Local Phase Detection in Dynamic Optimization Systems, International Symposium on Code Generation and Optimization, 2006. [11] PerfMon, http://www.hpl.hp.com/research/linux/perfmon/. [12] W. W. Hwu, S. A. Mahlke, W. Y. Chen, P. P. Chang, N. J. Warter, R. A. Bringmann, R. G. Ouellette, R. E. Hank, T. Kiyohara, G. E. Haab, J. G. Holm, D. M. Lavery, The superblock: an effective technique for VLIW and superscalar compilation, The Journal of Supercomputing, v.7 n.1-2, p.229-248, May 1993. [13] S. A. Mahlke, D. C. Lin, W. Y. Chen, R. E. Hank, R. A. Bringmann, Effective compiler support for predicated execution using the hyperblock, Proceedings of the 25th annual international symposium on Microarchitecture, p.45-54, December 01-04, 1992. [14] Low Level Virtual Machine (LLVM), http://llvm.org/ [15] Sherwood, T., Sair, S., and Calder, B., Phase tracking and prediction, International 21. symposium on computer architecture, 2003. [16] Merten, M. C., Trick, A. R., George, C. N., Gyllenhaal, J. C., and Hwu, W.W., A hardware-driven profiling scheme for identifying program hot spots to support runtime optimization, International symposium on computer architecture, 1999. [17] Karl Pettis, Robert C. Hansen, Profile guided code positioning, Proceedings of the ACM SIGPLAN conference on programming language design and implementation, p.16-27, June 1990. [18] A. Ramirez, L. Barroso, K. Gharachorloo, R. Cohn, J. Larriba-Pey, P. G. Lowney, M. Valero, Code layout optimizations for transaction processing workloads, Proceedings of the 28th annual international symposium on computer architecture, p.155-164, 2001. [19] P. P. Chang, W.W. Hwu, Trace selection for compiling large C applicationprograms to microcode, Proceedings of the 21st annual workshop on microprogramming and microarchitecture, p.21-29, 1988. [20] Brad Calder, Peter Feller, Alan Eustace, Value profiling, Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, p.259- 269, December 01-03, 1997. [21] S. G. Abraham, R. A. Sugumar, D. Windheiser, B. R. Rau, Rajiv Gupta, Predictability of load/store instruction latencies, Proceedings of the 26th annual international symposium on microarchitecture, p.139-152, December 01-03, 1993. [22] Todd M. Austin, Gurindar S. Sohi, Dynamic dependency analysis of ordinary programs, Proceedings of the 19th annual international symposium on Computer architecture, p.342-351, May 19-21, 1992. [23] Scott McFarling, Reality-based optimization, Proceedings of the international symposium on code generation and optimization, p.59 - 68, 2003. [24] P. P. Chang, S. A. Mahlke, and W. W. Hwu, Using profile information to assist classic code optimizations, Software-Practice and Experience, vol.21(12), p.1301-1321, December 1991. [25] Robert Cohn, P. G. Lowney, Hot cold optimization of large WindowsNT applications, 22. Proceedings of the 29th annual ACM/IEEE international symposium on microarchitecture, p.80-89, December 02-04, 1996. [26] Todd C. Mowry, Chi-Keung Luk, Predicting data cache misses in non- numericapplications through correlation profiling, Proceedings of the 30th annual ACM/IEEE international symposium on microarchitecture, p.314-320, December 01-03, 1997. [27] A. Srivastava, D. W. Wall, A practical system for intermodule code optimizationat link- time, Journal of programming languages, vol.1 (1), p. 1-18, March 1993. [28] C.-K. Luk, R. Muth, H. Patil, R. Weiss, P. G. Lowney, R. Cohn, Profile-guided post-link stride prefetching, Proceedings of the 16th international conference onSupercomputing, p. 167-178, 2002. [29] C. B. Zilles, G. S. Sohi, Understanding the backward slices of performance degrading instructions, Proceedings of the 27th annual international symposium on computer architecture, p.172-181, June 2000. [30] Goodwin, D. W., Interprocedural dataflow analysis in an executable optimizer, Programming language design and implementation, June 16-18, 1997. [31] A. Srivastava, A. Edwards, and H. Vo, Vulcan. Binary translation in a distributed environment, Technical Report MSR-TR-2001-50, Microsoft Research, April 2001. [32] Luk, C., Muth, R., Patil, H., Cohn, R., and Lowney, G., Ispike: A Post-link Optimizer for the Intel Itanium Architecture, Proceedings of the international symposium on code generation and optimization: feedback-directed and runtime optimization, March 20-24, 2004. [33] Patel, S. J. and Lumetta, S. S., rePLay: A Hardware Framework for Dynamic Optimization, IEEE Transactions on Computers, vol.50 (6), p.590-608, June 2001. [34] Fahs, B., Bose, S., Crum, M., Slechta, B., Spadini, F., Tung, T., Patel, S. J., and Lumetta, S. S., Performance characterization of a hardware mechanism for dynamic optimization, Proceedings of the 34th annual ACM/IEEE international symposium on microarchitecture, December 01-05, 2001. 23. [35] Dehnert, J. C., Grant, B. K., Banning, J. P., Johnson, R., Kistler, T., Klaiber, A., and Mattson, J., The Transmeta Code Morphing Software: using speculation, recovery, and adaptive retranslation to address real-life challenges, Proceedings of the international symposium on code generation and optimization: feedbackdirected and runtime optimization, March 23-26, 2003. [36] Zhang, W., Calder, B., and Tullsen, D. M., An Event-Driven Multithreaded Dynamic Optimization Framework, Proceedings of the 14th international conference on parallel architectures and compilation techniques, September 17- 21, 2005.