UNIVERSITY OF CALIFORNIA, SAN DIEGO Parallel Speedup Estimates for Serial Programs A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Computer Science (Computer Engineering) by Donghwan Jeon Committee in charge: Professor Michael Taylor, Chair Professor Chung-Kuan Cheng Professor Sorin Lerner Professor Bill Lin Professor Steven Swanson Professor Dean Tullsen 2012
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Parallel Speedup Estimates for Serial Programs
A dissertation submitted in partial satisfaction of the
requirements for the degree
Doctor of Philosophy
in
Computer Science (Computer Engineering)
by
Donghwan Jeon
Committee in charge:
Professor Michael Taylor, ChairProfessor Chung-Kuan ChengProfessor Sorin LernerProfessor Bill LinProfessor Steven SwansonProfessor Dean Tullsen
2012
Copyright
Donghwan Jeon, 2012
All rights reserved.
The dissertation of Donghwan Jeon is approved, and it
is acceptable in quality and form for publication on mi-
crofilm and electronically:
Chair
University of California, San Diego
2012
iii
DEDICATION
To my mother Misook Chung,
for her endless love and devotion.
iv
EPIGRAPH
There are only two mistakes one can make along the road to truth;
Table 2.1: Measured Speedup (16-core) and CPA Estimated Speedup. . . . 13Table 2.2: Impact of Summarization Technique on File Size in NPB . . . . 31
Table 3.1: Overview of Two Platforms - Raw and Multicore . . . . . . . . . 42Table 3.2: Estimated Speedup with and without Expressible Self-Parallelism 54
Table 4.1: Motivation: Vector Shadow Memory Overheads of the Hierarchi-cal Critical Path Analysis (HCPA) Differential Dynamic Analysis. 63
Personal use of this material is permitted. Permission from IEEE must be obtained
for all other uses, in any current or future media, including reprinting/republish-
ing this material for advertising or promotional purposes, creating new collective
works, for resale or redistribution to servers or lists, or reuse of any copyrighted
component of this work in other works.
xii
VITA
2001 B. S. in Computer Science and EngineeringSeoul National UniversitySeoul, Korea
2002-2005 Software EngineerMDS TechnologySeoul, Korea
2005-20012 Graduate Research AssistantUniversity of California, San Diego
2008 M. S. in Computer Science (Computer Engineering)University of California, San Diego
2012 Ph. D. in Computer Science (Computer Engineering)University of California, San Diego
PUBLICATIONS
Saturnino Garcia, Donghwan Jeon, Christopher Louie, Michael Bedford Taylor,“The Kremlin Oracle for Sequential Code Parallelization”, IEEE Micro, July/Au-gust 2012.
Donghwan Jeon, Saturnino Garcia, Christopher Louie, Michael Bedford Taylor,“Kismet: Parallel Speedup Estimates for Serial Programs”, Proceedings of ACMConference on Object Oriented Programming Systems Languages and Applications(OOPSLA), October 2011.
Saturnino Garcia, Donghwan Jeon, Christopher Louie, Michael Bedford Taylor,“Kremlin: Rethinking and Rebooting gprof for the Multicore Age”, Proceedings ofACM SIGPLAN Conference on Programming Language Design and Implementa-tion (PLDI), June 2011.
Donghwan Jeon, Saturnino Garcia, Christopher Louie, Michael Bedford Taylor,“Parkour: Parallel Speedup Estimates for Serial Programs”, USENIX Workshopon Hot Topics in Parallelism (HotPar), May 2011.
Donghwan Jeon, Saturnino Garcia, Christopher Louie, Michael Bedford Taylor,“Kremlin: Like gprof, but for Parallelization”, Principles and Practice of ParallelProgramming (PPoPP), Feb 2011.
xiii
Saturnino Garcia, Donghwan Jeon, Christopher Louie, Srivanthi Kota-Venkata,Michael Bedford Taylor, “Bridging the Parallelization Gap: Automating Paral-lelism Discovery and Planning”, USENIX Workshop on Hot Topics in Parallelism(HotPar), June 2010.
Srivanthi Kota Venkata, Ikkjin Ahn, Donghwan Jeon, Anshuman Gupta, Christo-pher Louie, Saturnino Garcia, Serge Belongie, Michael Bedford Taylor, “SD-VBS:The San Diego Vision Benchmark Suite”, Proceedings of IEEE International Sym-posium on Workload Characteristics (IISWC) October 2009.
xiv
ABSTRACT OF THE DISSERTATION
Parallel Speedup Estimates for Serial Programs
by
Donghwan Jeon
Doctor of Philosophy in Computer Science (Computer Engineering)
University of California, San Diego, 2012
Professor Michael Taylor, Chair
Software engineers now face the difficult task of parallelizing serial programs
for parallel execution on multicore processors. Parallelization is a complex task
that typically requires considerable time and effort. However, even after extensive
engineering efforts, the resulting speedup is often fundamentally limited due to the
lack of parallelism in the target program or the inability of the target platform to
exploit existing parallelism. Unfortunately, little guidance is available as to how
much benefit may come from parallelization, making it hard for a programmer to
answer this critical question: “Should I parallelize my code?”.
In this dissertation, we examine the design and implementation of Kismet, a
tool that creates parallel speedup estimates for unparallelized serial programs. Our
approach differs from previous approaches in that it does not require any changes
or manual analysis of the serial program. This difference allows quick profitability
analysis of a program, helping programmers make informed decisions in the initial
stages of parallelization.
To provide parallel speedup estimates from serial programs, we developed
a dynamic program analysis named hierarchical critical path analysis (HCPA).
xv
HCPA extends a classic technique called critical path analysis to quantify localized
parallelism of each region in a program. Based on the parallelism information from
HCPA, Kismet incorporates key parallelization constraints that can significantly
affect the parallel speedup, providing realistic speedup upperbounds. The results
are compelling. Kismet can significantly improve the accuracy of parallel speedup
estimates relative to prior work based on critical path analysis.
xvi
Chapter 1
Introduction
Software engineers currently face the daunting task of parallelizing their
programs to take advantage of multi-core processors. These multi-core processors
provide extensive parallel resources, providing the potential for greatly improved
performance. However, a parallelized software is required to exploit the capabilities
of these multi-core processors. This is a radical change for software engineers. Until
recently, most microprocessors had only a single core. Thanks to increasing oper-
ation frequency and micro-architectural innovations fueled by new CMOS process
technologies, these processors enabled exponential performance growth without
any changes in the software. However, since 2005 the power wall and increasing
on-chip wire delay have fundamentally changed the way processors are designed,
bringing the task of parallelization to software engineers.
Unfortunately, parallelization typically relies on individual engineer’s man-
ual effort rather than automated tools. An automatic parallelizing compiler might
be the ideal solution in parallelization: it analyzes the serial source code, finds
parallelization opportunities, applies code transformations, and finally emits the
parallel executable of the program. In reality, however, even the state of the art
parallelizing compilers miss many parallelization opportunities. Because compilers
have to guarantee the correctness of their output, they do not parallelize code when
they cannot prove correctness, missing potential parallelization opportunities. Fur-
thermore, many programming languages do not explicitly express the parallelism
in code, which makes it even harder for an automated tool to discover and exploit
1
2
parallelism.
Manual parallelization typically requires extensive time and effort. In the
first place, thinking in parallel and writing a parallel program is hard for a hu-
man. Parallel programs are also harder to test, modify, debug, and maintain due
to concurrency issues such as race conditions and deadlock. Furthermore, unlike
the well-established serial programming environment, the parallel programming
environment is still evolving, requiring additional learning and training for pro-
grammers.
Even after extensive parallelization efforts, the resulting performance of
refactored serial programs often falls short of optimistic speedups derived from the
available core count. Worse-than-expected performance can be caused by several
factors. First, the program may have an inherently low amount of parallelism—
possibly the result of choosing an algorithm without considering its parallelizability.
Second, the target system may be a poor choice for that program—the result of a
mismatch between the structure of the parallelism in the program and the ability
of the system to efficiently exploit it. Finally, the implementation may be poor—
the result of missed parallelism opportunities or poorly executed parallelization
attempts.
With the expected serious investment in engineering efforts and widely vary-
ing parallel performance, parallelization raises many risks in software engineering.
Unfortunately, very little tool support is available to help parallel softwares, es-
pecially in the early stages of the parallelization. For example, if a programmer
decides to parallelize a program with very limited parallelism, the programmer is
likely to waste precious development time and see disappointing results. If the
programmer had known that the program is parallelism-limited, their time could
have been better used on serial optimization or substituting the algorithm with
another one that has more parallelism.
In this thesis, we propose the design and implementation of a parallel
speedup estimation tool that mitigates the risk of parallel software engineering.
The tool answers the question “Should I parallelize my code?”, allowing software
developers to understand the potential benefit associated with the cost of migrating
3
an existing serial implementation into a parallel one. Unlike other parallel perfor-
mance tuning tools that require an already parallelized program, the tool works on
unmodified serial source code, helping the user in the initial stages of paralleliza-
tion. Furthermore, by incorporating real-world parallelization constraints, the tool
can provide parallel speedup estimates that are accurate enough for practical use.
This thesis makes the following key contributions:
• It presents Kismet, a tool that provides parallel speedup estimates from
serial source code and target-specific parallelization constraints. Because
Kismet automatically provides parallel speedup estimates from unmodified
serial code, it requires little user efforts compared to other tools that demand
pre-parallelization or user annotations [HLL10, Int].
• It introduces the use of summarization techniques on hierarchical critical path
analysis (HCPA) [GJLT11], a recently proposed dynamic program analysis
employed in Kismet that measures the parallelism of a program. The use of
summarization techniques significantly improves the applicability of HCPA
over the previous implementation of HCPA that relied on a compression
technique.
• It demonstrates the effectiveness of Kismet with a wide range of benchmarks
on two very different platforms: MIT Raw and AMD multi-core. With par-
allelism profile from HCPA and a brief description of target-specific paral-
lelization constraints, Kismet was able to provide parallel speedup estimates
close enough to guide initial stages of parallelization.
• It presents the design and implementation of Skadu, a vector shadow memory
system that dramatically reduces memory and runtime overhead of hierar-
chical memory analysis. The effectiveness of Skadu is shown with HCPA and
memory footprint analysis.
4
1.1 Existing Tools for Parallelization
Several tools are available to improve the productivity of parallelization. In
this section, we overview existing tools that can help parallelization and discuss
their merits and limitations.
Parallelizing Compiler A parallelizing compiler is the ideal solution for par-
allelization for a software engineer’s convenience. Taking serial source code, it
discovers parallelization opportunities, applies required code transformations, and
finally emits the parallel executable. Unfortunately, the resulting performance is
often disappointing due to missing parallelization opportunities, as shown in Kim
et al. [KKL10b] and Tournavitis et al. [TWFO09]. Those missing opportunities
stem from an automatic parallelizing compiler’s limitations in static pointer anal-
ysis, irregular control flow, and program input dependence.
Serial Profiler Although performance profilers such as gprof [GKM82] are de-
veloped for serial program optimizations rather than parallelization, they provide
useful hotspot information. By focusing on only hotspots, a programmer can make
the parallelization process more productive. Unfortunately, these profilers do not
quantify the parallelism of the target program and a programmer must manu-
ally estimate the profitability of parallelization for each program region, which is
time-consuming and error-prone.
Critical Path Analysis Based Tools Critical path analysis (CPA) [Kum88] is
a classical program analysis that quantifies the theoretical speedup upperbound.
CPA analyzes dependences in data-flow style execution and reports the speedup
on an ideal target platform where unlimited number of cores are available and
parallelization overhead does not exist. CPA’s strength lies in the quantification
of parallelism regardless of the serial implementation of the program. For example,
reordering two independent statements in a program does not change the result
from CPA. Unfortunately, the reported number is typically overly optimistic and
CPA has seen only limited use, primarily in research projects [Kum88, KMC72,
5
AS92, KBI+09].
Dependence Testing Based Tools Dependence testing shares similar goals
of discovering parallelism with critical path analysis [Lar93, ZNJ09, KKL10b,
TWFO09]. Typically a dependence testing tool monitors inter-iteration depen-
dences at runtime and report existing dependences in the target program. Based
on the work and detected dependence patterns, the tool can provide a short list
of promising regions for parallelization. However, dependence testing tools have
two major limitations. First, they have difficulties detecting parallelization op-
portunities if the serial implementation does not the program structure that the
tool supports such as DOALL loops. Second, they do not quantify the amount of
parallelism, making it hard to reason about the profit from parallelization.
Parallel Speedup Predictor Existing speedup predictors such as Intel Cilkview
analyzer and Intel parallel advisor [HLL10, Int, KKKB12] are designed to help
quick exploration of parallelization. The programmer provides annotated source
code that expresses parallelizable code regions as well as parallelization strategies.
After profiling the target program with the given annotations, these tools provide
the estimated speedup after parallelization. Since making annotations tend to be
easier than applying fully working code transformations, they can improve produc-
tivity, but still they require the programmer have a deep understanding about the
target program. For example, if the software engineer working on parallelization is
different from the original code writer, which is often the case, these tools do not
help much until the user has enough understanding about the program to write
reasonable annotations. Another interesting tool is the Intel Cilkview Scalability
Figure 2.5: Uncovering Masked Parallelism. The use of critical path analysis
allows HCPA to uncover parallelism even when masked by a serial implementation.
The code in (a) shows a nested loop operating on a 2D array with cross-iteration
dependences over both loops, making it appear very serial. The iteration depen-
dence graph in (b) shows that iterations can be grouped into independent sets,
allowing parallel execution if loop skewing and interchange are used as shown in
(c). Techniques relying on dependence testing would overlook this parallelism.
Furthermore, the 2D array in (a) is represented as an array of pointers to arrays,
thwarting a parallelizing compiler’s attempt to statically analyze this section of
code.
19
both loops. To parallelize this code requires two key analysis. First, it requires
the compiler to recognize that a loop-transformation technique called loop-skewing
can be applied, which restructures the loop so that execution traversals the array
“diagonally” (as shown in Figure 2.5c). Second, it requires the compiler to prove,
possibly using shape analysis, that none of the pointers in the first level of the
array point to the same array in the second level; i.e. that there is no aliasing.
Some research compilers have implemented shape-analysis passes that could
potentially decipher that the data structure is equivalent to a 2D array; and simi-
larly some research compilers are able to automatically infer loop-skewing of static
arrays. More generally, to unlock the parallelism latent in sequence programs may
require that an arbitrary number of difficult analyses and transformations that
must be composed. Because of complexity and runtime issues, modern compilers
are not able to compose all of these heroic tasks simultaneously into one coherent
analysis and transformation framework.
However, using runtime information, HCPA is easily able to identify and
quantify the parallelism that is latent in the double-loop structure, allowing a
speedup estimation tool to count this important parallelization opportunity. When
parallelization the code, the programmer can work to iteratively transform the code
sufficiently that the compiler or runtime system can take it the rest of the way.
In contrast, weaker dynamic dependence testing-based frameworks would typically
report no available parallelism because they are not able to see past the existing
structure of the double loop, leading to underestimated parallel speedup.
2.3 HCPA Implementation
Although the concept of HCPA is quite straightforward as discussed in the
previous section, it involves several implementation issues for the use in speedup
estimation. In this section, we will focus on four major issues in HCPA implemen-
tation.
• Designing the Region Hierarchy: HCPA captures the self-parallelism of each
region in the region hierarchy. The design of region hierarchy can fundamen-
20
tally impact the result of HCPA. We present a simple but effective region
hierarchy design for speedup estimation.
• Calculating Critical Path Length: Unlike CPA that tracks the single critical
path length of the whole program, HCPA must simultaneously track multiple
critical paths for each region, which can be very costly. We present an array
of techniques that reduces the overhead of simultaneously finding critical
path lengths in multiple regions.
• Calculating Self-Parallelism: Self-parallelism metric represents the exclusive
parallelism of a region, excluding the parallelism from its children. We cal-
culate self-parallelism with an effective approximation.
• Summarizing Profiled Information: The number of dynamic regions can be
extremely large, thus naively storing all the information could be prohibitive.
We use a context-sensitive summarization technique to effectively and effi-
ciently represent the profiled information.
2.3.1 Designing the Region Hierarchy
HCPA uses the concept of a region to denote a region of code whose par-
allelism is to be measured from the time that region is entered until the time it
is exited. In order for the self-parallelism metric to work, regions must obey a
proper nesting structure: regions must not partially overlap, but they may nest or
be siblings with the same parent region. Based on this nesting structure, we can
define a dynamic region graph which shows the relationship between parent and
children regions in the dynamic execution of the program.
Although more arbitrary delineations of regions are possible, we use three
types of regions - loop, function, and sequence. These regions are designed in an
attempt to quantify the parallelism of constructs that users understand well and
relate directly to the process of parallelization. Loop and function regions directly
map to loops and functions in the program. A sequence region can be any single-
entry piece of code but we restrict sequences to two important cases: loop bodies
21
1 int main {2 for (i=1 to N) {3 foo(1); // callsite A
4 foo(N); // callsite B
5 }6 }7 void foo(int size) {8 for(i=1 to size) {9 // loop body
10 }11 }
(a) Sample Code Fragment
1 N
1 1 … N
…
…
loop
iters
foo_A foo_B
iters
loop
iters
foo_A foo_B
iters
loop loop
func loop sequence
Context-Sensitive Summarization
work: .. cp: ..
avg work: .. avg self-p: ..
self-p: ..
(b) Dynamic Region Tree and Summarized Region Tree
Figure 2.6: Overview of HCPA. HCPA builds a hierarchical region structure
from source code, forming a dynamic region tree at runtime. As each dynamic
region is profiled, HCPA summarizes the profiled data into a summarized region
tree. The summarized tree preserves context-sensitive parallelism information,
exposing more parallelization opportunities.
22
and self-work sequences. Loop body regions form a child region for each iteration of
a loop region, allowing HCPA to identify loop-level parallelism. Self-work sequence
regions are sequences of code that are contained in non-leaf regions and do not have
any function calls or loops. These regions may seem unintuitive but they separate
different types of parallelism, improving the accuracy of speedup estimation. Self-
work sequences factor out the instruction level parallelism in regions that would
otherwise contain a mix of task-level parallelism (from its other children) and
instruction level parallelism.
Kismet demarcates region boundaries at static instrumentation time. Kismet’s
instrumentor inserts function calls to the runtime, and they form a region tree at
runtime. For example, the source code in Figure 2.6(a) forms a dynamic region
tree as shown in Figure 2.6(b).
2.3.2 Calculating Critical Path Length
The main task of CPA, the underlying technique of HCPA, is to find the
single critical path length of the whole program, which can be costly. HCPA, on
the other hand, has to track the critical path length for each region, incurring
much higher memory and runtime overhead.
In order to efficiently find the critical path length, HCPA independently
maintains the timestamp of each operand (i.e. register and memory address) for
each region. A timestamp represents the earliest execution time an instruction is
available for execution with the operand. HCPA tracks the maximum timestamp
value used in each region and report it as the critical path length when the region
exits. All timestamps are logically initialized to zero upon entry so that a reference
to an instruction outside the region will be assumed to be available immediately at
the beginning of the region (i.e. time 0). Upon a dynamic instruction, HCPA takes
timestamps of every source operand and control dependence, finds the maximum
timestamp, and updates the timestamp of the destination operand after adding
the latency of the instruction.
Although the number of dynamic regions in a program can be very large,
the number of regions that need simultaneous timestamp update is limited to the
23
1 void handlerBinary(int dest, int src0, int src1, int cost) {2 for (int depth=0; depth<active region depth; depth++) {3 // calculate the updated timestamp for dest
4 Time time control dep = getControlDepTime(depth);
5 Time time src0 = getRegTime(src0, depth);
6 Time time src1 = getRegTime(src1, depth);
7 Time time dest = max(time control dep, time src0, time src1) + cost;
8 setRegTime(dest, time dest, depth);
9
10 // update critical path length and work
11 Region∗ region = getActiveRegion(depth);
12 region−>cp = max(region−>cp, dest);
13 region−>work += cost;
14 }15 }
Figure 2.7: Runtime Instruction Handler. Upon a dynamic instruction, HCPA
calculates and updates the timestamp of the dest register for all the active regions.
number of active regions. An active region refers to a region which has entered
but has not exited yet. In the dynamic region graph, the number of active regions
is the same with the depth of the current dynamic region in the tree. Every active
region independently manages its timestamps, making the update process similar
to a vector operation.
Figure 2.7 shows a simplified runtime code that is invoked upon a dynamic
instruction. The dynamic instruction takes two source registers and one desti-
nation register. Each iteration of the loop (line 2) in the code handles an active
region. It first reads timestamps from two source registers and control dependence,
and calculates the new timestamp by adding the cost of the instruction. Then it
updates the timestamp of the destination register, and finally updates the critical
path length of the region if the new timestamp value is larger than the current
critical path length.
HCPA honors true dependences including register-, memory-, and control-
24
dependencies. Because we aim to provide speedup upperbound, HCPA ignores
false dependences and easy-to-break dependences. For example, every for loop
carries an inter-iteration dependence with its loop variable, but the dependence
can be easily broken.
Register Dependence At compile time, the LLVM-based instrumentor effi-
ciently and accurately analyze true dependencies between registers. HCPA’s in-
strumentor inserts a function call so that the HCPA runtime uses the dependence
information when it updates timestamps. Dependencies that are not true de-
pendencies are filtered out using two main mechanisms. First, our instrumentor
operates on LLVM’s SSA form IR [CFR+91]. This eliminates false output (i.e.
write-after-write) dependencies. Next, the instrumentor detects induction and re-
duction variables then breaks the false dependencies that result from them.
In order to efficiently manage timestamps, we use vector shadow register
(VSR). A VSR is an array of vectors where each vector contains timestamps of a
register for all the active regions. Each element of a vector represents a timestamp
of an active region. Because every function independently manages a register
space, HCPA allocates a VSR when the function starts and deallocates it when
the function ends. In a VSR, all the vectors in a VSR share the same length - the
length represents the maximum depth of a region in the dynamic region tree that
might use the register associated with the vector, which can be easily analyzed
at the compile time. The number of registers used in a function, in LLVM’s SSA
form, determines the number of vectors in a VSR.
Memory Dependence HCPA detects every memory dependence at runtime
and incorporates the dependence information in its critical path length calculation.
Compared to the static pointer analyses used in many parallelizing compilers,
this runtime approach can handle irregular control flows and complicated pointer
operations, enabling more accurate parallelism quantification.
We use vector shadow memory (VSM) to manage timestamps for memory
addresses. Vector shadow memory is a variant of shadow memory, which provides
efficient tagging of information to memory address space. Shadow memory is
25
widely used in dynamic program analyses, from memory analysis [SN05, BZ11] to
computer security [CZYH06, QWL+06, XBS06].
Vector shadow memory shares a similar goal with vector shadow register. It
efficiently provides an independent storage for each dynamic region in a program so
that HCPA can read and update timestamps associated with each memory address.
Similar to VSR, VSM consists of vectors where each vector provides storage for all
the active regions.
Although vector shadow memory resembles vector shadow register in its
functionality and structure, it raises serious challenges for practical use. VSM’s
address space is larger than VSR’s register count by several orders of magnitude.
If not carefully managed, VSM will incur prohibitively large memory overhead.
Furthermore, the length of a vector varies in VSM, making it even more difficult
to efficiently allocate and deallocate VSM’s memory. Similar to the vector length
in VSR, a vector’s length is determined by the maximum depth of the region that
accesses the vector (i.e. the address associated with the vector). However, it is
undecidable at compile time as opposed to VSR’s vector length that can be easily
analyzed by the instrumentor. In Chapter 4, we discuss an array of techniques
that can dramatically reduce the memory and runtime overhead of vector shadow
memory.
Control Dependence Kismet tracks control dependencies through the use of
control dependence analysis and a dynamic control dependence stack. Timestamps
for control dependence are pushed to and popped from the control dependence
stack whenever a control dependent region is entered and exited. A stack entry
contains a vector of timestamps for every active region. For a region, this stack
has monotonically increasing values from the bottom to the top, allowing HCPA
to include only the topmost entry in the list of dependencies for each instruction.
Although HCPA aims to find the critical path length of each region, it also
profiles other useful information for speedup prediction. For example, HCPA also
determines if all of the children of a non-leaf region are independent and therefore
can be executed in parallel. This information is useful for identifying DOALL
26
loops, which is both common and easy-to-exploit in many target platforms. To
store the information, each region calculates a “P bit” using the following equation:
P = CP (parent) == MAX (CP (child1), ..., CP (childn))
where CP (parent) is the critical path length of the parent and CP (childi) is the
critical path length of the ith child.
If all children can be executed in parallel, then the length of the critical
path will simply be the length of longest critical path of all of the children. In
this case, the “P bit” will be 1. Chapter 3 discusses how we leverage this in more
detail.
2.3.3 Self-Parallelism Calculation
When a region exits, HCPA calculates the self-parallelism with the calcu-
lated critical path length and children’s information. Since a parent region exits
after all the children’s execution is over, every child’s profiled information is also
available when a region exits.
To determine the self-parallelism of a region R, SP (R), HCPA employs the
following equation:
SP (R) =
∑nk=1 cp(child(R, k))
cp(R)R is a non-leaf
work(R)cp(R)
R is a leaf
Here n is the number of children of R, child(R, k) is the kth child of R, cp(R) is
the critical path length, and work(R) is the amount of work in R.
Intuitively, this equation captures the ratio of execution time between serial
and parallel run of fully parallelized children. By using fully parallelized children’s
execution time, self-parallelism metric captures the exclusive parallelism of the
parent region. Not having any children, a leaf region’s self-parallelism is identical
to CPA’s total-parallelism.
Figure 2.8 demonstrates the calculation of SP in three non-leaf regions, one
totally serial, one partially parallel (DOACROSS), and the other totally parallel
27
DOACROSS DOALL
CP
CP
CP
…
CP
CP
CP
…
(N/2) * CP CP
N * CP
(N/2) * CP= 2.0
N * CP
CP
= N
CP CP CP…
Serial
N * CP
N * CP
N * CP
= 1.0
Type
CP (R)
SP (R)
Figure 2.8: Self-Parallelism Calculation on Regions with Varying Par-
allelism. Self-parallelism computes the amount of parallelism in a parent re-
gion that is attributable to that region and not its children. The figure above
shows that Kismet’s self-parallelism calculation successfully quantifies parallelism
across a spectrum of loop types, ranging from totally serial to partially paral-
lel (DOACROSS) to totally parallel (DOALL). The shaded boxes are child re-
gions, corresponding to separate iterations of the loops. The relative scheduling
of child regions is indicated spatially, with time running from left to right. The
self-parallelism calculation correctly quantifies parallelism in non-loop region hier-
archies as well.
28
(DOALL). For simplicity, in the example, each iteration’s critical path length cp
is the same. For the serial loop, the measured cp(R) will be equal to n ∗ cpand the computed self-parallelism will be n∗cp
n∗cp = 1, which is expected since serial
dependences prevent overlapped execution of regions. For the DOACROSS loop
shown, where half of an iteration can overlap with the next iteration, cp(R) will
be the half of the cp(R) for the serial loop. Thus SP (R) is n∗cp(n/2)∗cp = 2. For
the DOALL loop, cp(R) will be equal to cp, so SP (R) is n∗cpcp
= n. Although we
show three relatively simple cases here, this method is a good approximation of
self-parallelism even with more sophisticated child region interaction.
2.3.4 Summarizing Profiled Information
HCPA produces a parallelism profile for each dynamic region that is exe-
cuted. The number of dynamic regions quickly grows as nested loops with many
iterations are executed. This large amount of regions poses practical challenges
not only in the size of the profile output but also in the runtime of algorithms
that need to analyze this data. We developed a summarization technique that
effectively reduces the amount of profiled data while preserving context-sensitive
parallelism information.
Our summarization technique combines all dynamic regions that have the
same region context into a single summarized region. Figure 2.6b depicts how the
runtime region tree becomes a summarized region profile. In this method, all loop
iterations collapse to a single node, greatly reducing the number of regions. Each
node calculates weighted averages for self-parallelism, work, and other profiled data
across all dynamic regions corresponding to that node.
Kismet maintains a ‘current’ pointer that tracks the summary node that
corresponds to the current dynamic region. When a new region is entered, it
updates the ‘current’ pointer to one of its children node based on statically assigned
callsite ID information. If there is no corresponding node, it creates a new summary
node and updates the ‘current’ pointer. When a region exits, the region’s profiled
information is added to the current node and the pointer returns to the parent
node. This process is similar to the call context tree described in [ABL97] but
29
modified for HCPA’s region hierarchy.
The example summarized region profile shown in Figure 2.6b contains two
nodes for the same function (foo) from what appears to be the same context.
This corresponds to two separate calls from the same loop. While this increases
the number of nodes in the summarized profile, it allows Kismet to uncover new
parallelism opportunities.
To understand the merit of context-sensitive representation, consider the
code in Figure 2.6b. When the loop in function foo is parallel and N is large, the
parallelism of this loop significantly differs between callsites A and B. Callsite A’s
loop will always have a self-parallelism of 1, providing no benefit to parallelism
and likely causing slowdown due to synchronization overhead. Callsite B’s loop
will have a self-parallelism of N and would likely be a good candidate for parallel
refactoring. Kismet can capitalize on the split contexts, incorporating the speedup
from callsite B into its estimates while ignoring callsite A. The tree structure of
context-sensitive representation also allows the development of parallel execution
time model when a few regions are parallelized. Details of parallel execution time
model will be discussed in Chapter 3.
2.4 Experimental Results
2.4.1 Effectiveness of Self-Parallelism Metric
In order to demonstrate the merit of self-parallelism metric against CPA’s
total-parallelism, we examined programs in the NAS Parallel Bench (NPB) bench-
mark suite [BBB+91]. We measured both self-parallelism and total-parallelism of
all 1953 regions in NPB and classify them based on the amount of parallelism:
serial (parallelism < 1.1), moderately parallel (1.1 to 2.0), parallel (2.0 to 5.0), or
very parallel (parallelism > 5.0).
30
Figure 2.9: Region Classification Based On Parallelism. We classified all
1953 regions in the NPB benchmarks based on the amount of parallelism. Switch-
ing from total-parallelism based classification to self-parallelism based classifica-
tion, 6× more regions are classified as being serial. This result highlights self-
parallelism’s ability to localize parallelism, filtering out false positive parallel re-
gions in the speedup prediction.
2.4.2 Effectiveness of the Summarization Techniques
Effectiveness of the Summarization Technique To examine the effective-
ness of Kismet’s summarization technique, we ran NPB and SpecInt2000 bench-
marks 1 with two different input sizes (’S’ and ’A’ for NPB, ’test’ and ’ref’ for
SpecInt2000) and examined dynamic region counts as well as output file sizes.
Figure 2.2 shows the results.
The results show that Kismet’s summarization technique scales well with
increasing input sizes and is effective at reducing the output file size. As expected,
the dynamic region count significantly increases when we switch from small input
to larger input – 463X on average. With the larger input sets, dynamic region
1Raw benchmarks have only a single input set.
31
Table 2.2: Impact of Summarization Technique on File Size in NPB.
Switching from the small (S) to large (L) inputs causes 463× more dynamic regions
to execute on average, but the output file size increases only 1.1× on average, from
77KB to 85 KB. Thus, the summarization technique is very effective in keeping
output file size manageable even with large inputs.
Bench Dynamic Region Count Output File Size(Mega Regions) (Kilo Byte)
ment in the region hierarchy. This partitioning allows efficient sharing of shadow
memory among regions in the same level of the program hierarchy. Partitioning
also enables lightweight garbage collection of stale tags. Skadu utilizes a tag vec-
tor cache to keep frequently used tag vectors in a format that minimizes access
time. This cache allows the user to easily trade increased memory overhead for
improved performance. It serves as a kind of nursery that eliminates the need
to allocate short-lived regions in long-term shadow storage. Skadu reduces the
memory overhead of long-term tag storage by compressing infrequently used tags.
Skadu also introduces two novel techniques for reducing the overhead associated
with validating tags, a requirement stemming from the sharing of shadow memory
between multiple program regions.
4.0.8 Evaluating Skadu
We have implemented two differential dynamic analyses in order to evalu-
ate Skadu’s effectiveness at reducing overheads. First, a memory footprint profiler
tracks the amount of memory used by every function and loop in a program. This
analysis is useful to a range of applications such as scheduling of programs on
a heterogeneous multi-core processor. Second, a parallelism profiler, hierarchical
critical path analysis (HCPA), determines the average amount of parallelism in
every function and loop in the program. While both analyses utilize Skadu’s VSM
infrastructure, they demonstrate opposite ends of the analysis spectrum: the mem-
ory footprint profiler is a relatively lightweight analysis with only 1-bit tags while
the heavyweight HCPA uses with 64-bit tags.
Results from our implementations of the memory footprint and parallelism
profilers show that Skadu reduces memory overhead by 14.2× for memory footprint
profiling and by 11.4× for HCPA versus a baseline implementation. We also ex-
amine the effect of vector caching as well as selective compression on both memory
and performance overhead.
65
0x00COFFEE{ {...
+
Segment Table Tag Tables
...
...
...
Figure 4.1: Traditional Memory Shadowing Organization. The memory
address is used as an index into a two-level page table that contains the metadata
associated with that address. To support 64-bit addresses, a three-level table may
be used.
4.1 Overview and Challenges
In this section we will introduce the challenges facing vectored shadow
memory (VSM). We start by introducing traditional shadow memory organiza-
tion before describing the region-based differential dynamic analysis employed in
this chapter. Finally, we overview the techniques that Skadu uses to create an
efficient VSM framework for the differential dynamic analyses.
Traditional Memory Shadowing Technique In traditional shadow memory
infrastructures, each memory address has an associated shadow memory address.
Each shadow address may contain some metadata (or tag) about the associated
memory address. Figure 4.1 shows a simple shadow memory organization where
shadow memory is accessed via a two-level table: first the segment, then the tag.
The size of the tag table entry can range from tiny to large: it is common to see
one bit tags in taint tracking infrastructures while applications such as hierarchical
critical path analysis introduced in Chapter 2 require 64-bit tags. While the details
vary between existing memory shadowing frameworks, they generally follow this
66
basic organization.
Region-Based Differential Analysis Traditional memory shadowing requires
tracking only a single region of the code, usually the whole program (i.e. the main
function). The differential analyses we consider in this chapter require separate
dynamic sub-analyses to be applied to each nested region of the program. We
define a region to be any single-entry piece of code but we will focus on two par-
ticular types of regions: functions and loops. During the execution of a program,
the regions of a program form a natural hierarchy. Figure 4.2 demonstrates this
hierarchy (shown in the form of a region tree) for an example piece of code. Each
node in the region tree is a dynamic region while an edge from A to B indicates
that B is a child (or subregion) of A.
In the region-based differential analysis, each region is profiled indepen-
dently of the others. Conceptually, this means that each region has its own ad-
dress space and therefore requires its own shadow memory address space. A naive
extension of the memory shadowing shown in Figure 4.1 might be to make each
entry in the tag table correspond to an array or list. However, the number of
entries needed for each address depends on the number of dynamic regions in the
program and therefore cannot be determined statically. Neither an array- or list-
based approach are desirable in this situation. The array-based approach would
either need to be radically over-allocated (if statically allocated) to ensure room
for all regions or would need to be dynamically allocated, potentially leading to
prohibitive performance penalties. The list-based approach suffers from the same
drawbacks as the the dynamically allocated array approach.
Efficient Management of Multiple Shadow Address Spaces The key in-
sight for efficient shadow address space management is that there is at most one
active region in any given level in the region tree. Skadu takes advantage of this
hierarchical property to minimize the memory overhead associated with multiple
shadow address spaces. As shown in Figure 4.3, all regions in each level of the
region tree are mapped to a single tag. In other words, shadow address space is
shared amongst every region in a level.
67
1 int main() {2 foo();
3 bar();
4 ...
5 }6 void foo() {7 ...
8 }9 void bar() {
10 for(i=0..10)
11 x++;
12 foo();
13 }
(a) Code Snippet.
main
foo
foo
bar
for(i)shadowmemory
shadowmemory
shadowmemory
(b) Corresponding Region Tree.
Figure 4.2: Region Hierarchy Overview.The pseudo code in (a) results in
the region tree shown in (b). Each region has an isolated shadow memory address
space, as shown in (b). Skadu introduces several techniques to reduce the overhead
associated with maintaining these separate address spaces.
Sharing of shadow address spaces could potentially lead to one region pol-
luting the address space of another. It is possible to clean the “dirty” tags after
exiting a region but this is likely to incur a significant performance penalty. This
penalty is especially onerous for regions that are entered and exited rapidly, such
as deeply nested loops. This solution also requires additional space overhead for
tracking “dirty” tags.
To avoid the cleaning costs of the naive scheme, we can use the version-
based approach. This version-based approach attaches metadata to each tag to
indicate which region owns that tag. Before a tag is used, the version is checked
to ensure that the region accessing the tag is its owner. The tag is invalidated if
the version does not match. Unfortunately, this technique requires a significant
amount of space for tracking versions. Skadu introduces several novel techniques
68
A
B C
D E
Level 0
Level 1
Level 2
TagVector
VersionVector
Region Tree
Figure 4.3: Level-based Sharing of Shadow Memory. The hierarchical nature
of regions ensures that any level in the region tree will have at most one active
region. Skadu uses this property to enable reuse of physical shadow memory space
between multiple regions of the same level. This reuse requires that tags be vali-
dated to ensure that stale metadata is not used (e.g. not using region B’s metadata
for region C). Each tag has an version associated with it to determine the region
in which it is valid. Skadu introduces two novel versioning systems that reduces
the memory overhead of versioning from O(n) to O(1).
to minimize this cost; these techniques are described in detail in Section 4.2.
Utilizing Region Hierarchy to Reduce Overhead The definition of a region
ensures that the size of regions monotonically decreases as you go from the root of
the region tree to its leaves. As a result, “deep” regions tend to be much smaller
than “shallow” regions and have a smaller memory footprint. Skadu leverages this
property by introducing two novel memory shadowing features: level tables and
tag vector caching.
Skadu introduces a level table into the basic shadow memory organization
shown in Figure 4.1. This level table acts to partition tags according to the level
in which their associated region resides. Each entry in the level table points to a
tag table that is associated with a given level in the region tree. This organization
allows Skadu to maintain the minimum number of tag tables for each active level,
capturing the main benefit of the dynamic array or list-based approach without
the space overhead associated with those approaches.
69
The level table organization has the added benefit of enabling efficient
shadow memory garbage collection. Skadu includes a garbage collector that peri-
odically scans level tables to determine which levels are no longer active. Tags in
these inactive levels can be deallocated so as to reduce the memory overhead. The
garbage collector allows lazy deallocation of tags, moving this deallocation away
from the region exits and therefore reducing the performance overhead. Garbage
collection could also done in parallel with normal analysis, further reducing runtime
overhead.
While level tables help minimize the space overhead associated with the
differential dynamic analysis, they could potentially add to performance overhead
as multiple table traversals are required to access a single address’ tag vector.
Skadu minimizes this impact by introducing a tag vector cache. Skadu’s cache not
only k.pdf the tag vectors for frequently used memory addresses but also uses an
array-based organization that minimizes access time. This cache offers a trade-off
of performance and space: increased the cache size leads to increased performance.
The user can exploit this trade-off to maximize performance based on the amount
of memory available to them.
One further consequence of deeper regions being smaller than shallower
regions is that the lifetime of tags in those deeper regions is much shorter. Skadu
takes advantage of this by using a novel write-back policy. When a tag vector is
evicted from the cache, the version metadata is first checked to see if any of the
tags are out-of-date and therefore invalid. Invalid tags are simply discarded since
they will never be used again. As a result, Skadu avoids allocating tag tables for
stale tags, thus reducing the space overhead.
The tag vector cache makes the average shadow memory operation fast by
storing frequently used tags in a compact, quickly accessible form. By making
uncached accesses a rare event, Skadu is able to further reduce memory overhead
by compressing tags residing outside of the cache. Other than a small number
of recently used tags, all tags are compressed, minimizing the memory overhead
while only moderately impacting performance. The number of uncompressed tags
outside of the cache offers another trade-off of memory and performance: more
70
VerTag [1..N] Ver [1..N]
Tag [1..N] Ver [1..N]
… …
Tag [1..N] Ver [1..N]
Tag [1..N]
Tag [1..N]
…
Tag [1..N]
Ver
…
Ver
Tag [1..N]
Tag [1..N]
…
Tag [1..N]
Ver
(a) baseline (b) SlimTV (c) BulkTV
Figure 4.4: Space Overhead of SlimTV and BulkTV. Compared to the base-
line where each level requires version information, SlimTV shares a version infor-
mation for all levels. BulkTV further reduces the space overhead by sharing a
single version number across a range of memory addresses.
uncompressed tags increase memory overhead but reduce the runtime penalty.
Details of compression as well as the complete overview of Skadu’s shadow memory
architecture will be given in Section 4.3.
4.2 Efficient Tag Validation
Tag validation is an essential operation in Skadu. As described in the
previous section, Skadu uses version information to determine ownership of tags.
Unfortunately, naive version management is scalable neither in memory overhead
nor in performance; it has O(n) space and time complexity where n is the depth of
the deepest region that accesses a specific memory address. This section introduces
two techniques that enable efficient tag validation: Slim Tag Validation (SlimTV)
and Bulk Tag Validation (BulkTV). Figure 4.4 compares the space overhead of
these techniques against a baseline implementation: together they make the space
requirements of tag validation almost negligible and significantly lower the runtime
overhead.
4.2.1 Baseline Implementation
Our baseline implementation features a simple procedure to check tag va-
lidity that is based on a design property of the shadow memory: sharing of shadow
71
memory is limited to regions within the same level of the region tree. The baseline
implementation utilizes this sharing property by assigning a unique ID to each
region in a level. The unique IDs of all active regions are stored in a version vec-
tor associated with a tag vector whenever that memory is updated. This stored
version vector is then used for tag validation on each read: if there is a mismatch
between the ID of the current active region in a level and the ID for that level in
the version vector, then the tag for that level is invalid.
The baseline implementation suffers from the drawback that it requires stor-
ing a version vector for every shadow memory address. This storage requirement
leads to an O(n) space overhead just for tag validation, where n is the depth of
the region. This approach also incurs a large number of memory loads and stores
from reading/writing version vectors, resulting in higher runtime overhead.
4.2.2 Slim Tag Validation (SlimTV)
Skadu introduces a new tag validation technique known as Slim Tag Valida-
tion (SlimTV). SlimTV improves upon the baseline implementation by eliminating
the need to store a version vector with each tag vector; only a single value needs
to be stored. This technique not only reduces the space overhead from O(n) to
O(1) but also eliminates the excessive loads/stores associated with accessing the
stored vector, greatly reducing the runtime overhead.
SlimTV relies on the key insight that unique IDs can be used to create a
total ordering of all regions in the region tree. SlimTV assigns IDs to regions in
the order in which they begin. During the access of a tag vector only the ID of the
most recently entered, active region is stored. The current stored ID is compared
with the vector of IDs associated with the currently active regions whenever the
shadow address is accessed. Active regions with IDs greater than the stored ID
started after that region and are therefore invalid. SlimTV reduces the problem
of tag validation to finding the minimum region level with an invalid tag: active
regions at deeper levels must have started later and therefore are also invalid.
Figure 4.5 provides an example of SlimTV’s tag validation. In this example
a memory address is written to in the region with version 4. This single version
72
number is then stored along with the tags for each active region. This same address
is read later in the region with version 7, at which time the active version vector
is 〈1,5,6,7〉. Of the active regions only the first (1) has an ID less than the stored
version (4); starting from region 5, all other regions are invalid.
Theorem 1. Suppose V (R) is the active version vector of region R, the current
region is Rnew, the stored version in shadow memory is v, and the stored tag vector
is T . T [i] is valid if and only if V (Rnew)[i] ≤ v.
Proof. Assume for contradiction that V (Rnew)[i] > v but that its associated tag
is valid. Let Rold be the region that stored both the tag and v. Because T [i]
is valid, V (Rold)[i] == V (Rnew)[i]; therefore V (Rold)[i] > v. However, this is a
contradiction because by rule v is the largest value in V (Rold).
4.2.3 Bulk Tag Validation
Skadu’s SlimTV technique reduces the memory overhead to a constant fac-
tor but this may still result in significant memory overhead. For example, when
shadowing every byte of memory, the overhead incurred from an 8-byte version
identifier is 8X. Skadu therefore introduces an additional technique, Bulk Tag Val-
idation (BulkTV), that can reduce the memory overhead to a negligible amount
while additionally reducing the runtime overhead.
BulkTV’s key idea is to amortize the tag validation process’ overhead across
many addresses. BulkTV accomplishes this amortization by using only a single
version number for a page of shadow memory, performing tag validation for all
entries in the page whenever a single address is accessed. The effectiveness of
BulkTV is clearly tied to the size of the page: the bigger the page, the bigger the
benefit. For example, a modest 4KB page leads to a drastic reduction of 4096X
when shadowing every byte.
BulkTV can also have a significant impact on the runtime overhead of tag
validation. This impact stems from two competing factors. On one hand, there is
additional overhead associated with validating a whole page, especially when only
a fraction of locations in that page will be used. This factor ultimately depends
73
t0 t1 t2
Tag[1..N]
4
Validating tags with
version vector
first
access7
1
2
3 4
5
6 8
7
(a) SlimTV’s Region Versioning
t0 invalid invalid
1 5 6 7
Ver
second
access
(b) Tag Validation Process
Tags logged
in region 4
Tags validated
in region 7
Figure 4.5: An SlimTV Example. (a) SlimTV exploits the ordering encoded in
the version ID of dynamic regions in the program. (b) Illustrates the tag validation
process. Suppose a memory location is accessed in region 4 and later in region 7.
After the access in region 4, tag[0:2] will be logged with the region’s version number
4. When the same address is accessed in region 7, the version stack [1, 5, 6, 7]
is compared against the stored version number 4. From the comparison, SlimTV
detects that only region level 0 started before the previous tags were written.
SlimTV therefore invalidates tag[1:2] and updates the version field in the shadow
memory. SlimTV shares a version information for all levels. BulkTV further
reduces the space overhead by sharing a single version number across a range of
memory addresses.
on the locality of memory accesses: higher locality will lead to fewer wasted tag
validations.
On the other hand, BulkTV greatly reduces overhead through a decrease
in the most costly single operation in tag validation: finding the highest valid
level of tags. Finding this highest valid level involves walking through the current
version vector, an operation linear in the size of the vector. BulkTV still requires
n comparisons for each access in the worst case but the average case requires
significantly fewer comparisons. In the best case, only a single comparison is
needed; this occurs when a memory location in the same page is accessed in the
same region. BulkTV performs the comparisons starting at the end of the version
74
vector, meaning that slight differences in version number since the last access will
incur significantly fewer than n comparisons. The reduction is again dependent on
the locality exhibited throughout the program but our results in Section 4.5 show
that only a moderate amount of locality is needed to result in a net reduction in
runtime overhead.
4.3 Vectored Shadow Memory (VSM) Architec-
ture
Traditional shadow memory infrastructures have gone to great length to
minimize the runtime overhead of memory shadowing. Runtime overhead contin-
ues to be a serious concern when extending shadow memory to support vectored
tags but memory overhead is potentially more limiting. Skadu introduces a novel
shadow memory architecture that balances the sometime conflicting requirements
of low memory and runtime overhead for vectored shadow memory. In this section
we describe this novel architecture.
4.3.1 VSM Architecture Overview
Skadu’s architecture separates fast, short-term shadow memory from space-
efficient, long-term shadow memory. Figure 4.6a shows the interaction between the
architecture’s two main components, TVCache and TVStorage, which correspond
to the split between short-term and long-term tag vector storage. This split allows
Skadu to exploit the characteristic differences in lifetime and tag storage size across
various levels of the region hierarchy.
Skadu initially places tag vectors in the TVCache, evicting them to the
TVStorage only as needed. The TVCache is geared toward fast-access time; sized
appropriately, it minimizes the number of accesses to the slower-access TVStor-
age. The TVStorage is designed for long-term storage and therefore attempts to
minimize the memory overhead. It does this through a level-based storage in-
frastructure that facilitates lightweight garbage collection; this garbage collection
75
TVCache TVStorage
evict [lv_out]
fetch [lv_in]
(a) Shadow Memory Overview
V0
V1
V2
…
T0 0
T1 0
T2 0
…
T0 1
T1 1
T2 1
…
T0 2
T1 2
T2 2
…
T0 …
T1 …
T2 …
…
Version Tags!Vectors
(b)!TVCache Structure
0
1
…
N 1 T0 1
T1 1
…
V0
V1
… …
Ver Ptr
Tag!Table
Level!Table
T0 0
T1 0
…
Tag
……
SegTable
Ptr
(c)!TVStorage Structure
Figure 4.6: Overview of Skadu Shadow Memory Organization. (a) To
exploit the memory footprint and liveness characteristics of hierarchical regions,
Skadu uses a TVCache, reducing memory requirements and improving perfor-
mance. (b) The TVCache is optimized for the performance, handling most shadow
memory requests and allowing a memory-efficient organization of the TVStorage.
(c) The TVStorage is optimized for low memory overhead with the addition of a
level table. Paired with BulkTV, this three-level organization enables lightweight
garbage collection.
reduces the runtime overhead of dynamic vector resizing. Skadu compresses tags in
the TVStorage; because the TVStorage is accessed infrequently, the performance
overhead of this compression is minimal. In the following sections, we will describe
the components of Skadu’s architecture in more detail.
4.3.2 Tag Vector Cache (TVCache)
The TVCache stores frequently used tag vectors, making the common case
access time fast. These tag vectors are stored in an array format to further reduce
76
access time. The TVCache uses SlimTV for low-overhead tag validation but not
BulkTV because “lines” in the TVCache do not contain the spatial locality required
to make BulkTV profitable.
Figure 4.6b shows the structure of the TVCache–albeit slightly simplified
with omitted metadata such as associated memory address and vector size. Each
cache line contains the version and the tag vector associated with a memory ad-
dress. All cache lines have the same vector size for better performance at the cost of
possibly wasted memory. However, TVCache’s memory requirement is very small
compared to that of TVStorage because of the reduced address space it covers.
The TVCache is direct mapped to reduce access time while still providing good
hit ratios.
TVCache differs from traditional caches in that it not only improves runtime
performance but also reduces memory overhead of long-term tag storage. This
reduction in memory overhead is a result of the efficient write-back policy used by
the TVCache. The TVCache tends to cache tag vectors long enough that short-
lived regions have already exited by the time they are evicted. The TVCache
validates all tags upon an evict, writing back to the TVStorage only those the
are valid. This process mimics garbage collection, reducing the space used by the
TVStorage.
4.3.3 Tag Vector Storage (TVStorage)
The TVStorage acts as a long-term backing store for tag vectors evicted
from the TVCache. The TVCache handles most shadow memory accesses, allowing
the TVStorage to focus on reducing memory rather than runtime overhead.
Figure 4.6c shows the structure of the TVStorage. The TVStorage utilizes a
three-level structure that is similar to traditional shadow memory infrastructures
1 but with the novel addition of a level table. The TVStorage groups tags by
their level rather than the address they shadow. This distribution of tags enables
efficient garbage collection, exploiting the fact that tags become invalidated when
1Although not shown, this structure is easily modified to handle 64-bit addressing via anadditional table before the segment table, similar to what was proposed in [ZBA10a].
77
regions–and therefore levels–are exited. The TVStorage also employs BulkTV: all
entries in a tag table share a single version ID, which is located in the level table
next to the tag table pointer.
The TVStorage organization enables invalidation of a whole tag table with
only a single version number comparison. Skadu maintains a list of free tag tables:
tag invalidation only requires sending off the tag table to be scrubbed and returned
to the free list. This makes garbage collection extremely lightweight. Skadu em-
ploys a simple garbage collector that walks all the level tables in the TVStorage,
invalidating and freeing tag tables as it goes along. This garbage collector allows
Skadu to dynamically adjust the size of an address’ tag vector with little-to-no
performance overhead.
4.3.4 Tag Vector Compression
Skadu’s TVCache-TVStorage organization facilitates the use of compres-
sion without significant overhead. The size of the TVStorage dwarfs that of the
TVCache for all but the smallest programs: the TVCache is designed to be small
enough to handle only the most frequently accessed addresses, leaving the TVStor-
age to store all other addresses. The TVStorage is the good target for compression
because of its large size and relatively infrequent access.
Skadu balances the space savings of compressed tags with the performance
of uncompressed tags: a small list of recently used level tables house uncompressed
tag tables while all other level tables house compressed tag tables. This list of
uncompressed level tables is checked whenever a line is evicted from the TVCache;
if the corresponding level table is not in this list, it is added and one of the existing
level tables is removed according to a simple “clock” eviction algorithm.
The inclusion of uncompressed level tables protects against large perfor-
mance penalties during bursts of high miss rates in the TVCache. These bursts
would otherwise incur decompression costs on top of the already high cost of ac-
cessing the TVStorage. Results show this method to be effective.
78
4.4 Case Studies
To demonstrate Skadu’s effectiveness, we implemented two dynamic, region-
based analyses that use vectored shadow memory: a memory footprint profiler
and hierarchical critical path analysis (HCPA). The first represents a relatively
lightweight application of Skadu whereas the second represents a heavyweight one.
The following subsections describe these two analyses.
4.4.1 Memory Footprint Profiler
The memory footprint profiler tracks the number of memory locations ac-
cessed in each dynamic region and reports the average memory footprint for each
static region. It illuminates a program’s region-specific memory usage, informing
memory optimizations.
Tag Format Each tag is a single bit that tracks whether or not the address has
been touched by a region. This leads to a tag vector of n bits, where n is the depth
of the region accessing the address. The profiler watches for the first touch of an
address (i.e. tag changing from 0 to 1), incrementing a counter associated with
the region when this event happens. This counter is checked when a region exits;
its value then propagates to the statistics associated with the corresponding static
region.
The region hierarchy leads to an inclusivity property for memory footprint
analysis: if a memory address is touched in a region, it must also have been touched
in all its ancestor regions. The footprint profiler exploits this property by compress-
ing the whole tag vector into a single integer. This integer represents the shallowest
level in which the address was not touched. This scalar representation avoids costly
vector operations. This compressed format could lead to increased overhead—for
example, when using an 8-bit integer to represent a the vector 〈1,1,0,0〉—but our
results show that the size of this increase is negligible.
Efficiently Measuring Memory Footprint Each memory access triggers a
check to see if the footprint of the active regions needs to be increased. This check
79
involves three s.pdf: tag validation, footprint update, and tag update. The tag
validation step reads both the stored tag and the version from shadow memory
and uses SlimTV to find the first invalid region level. The footprint update step
finds and updates the range of region levels whose memory footprint should be
incremented. The tag update step updates shadow memory with the new tag and
version for the given address.
We use an algorithm in the footprint update step that reduces the update
cost from O(n) to O(1). A naive algorithm would increment a counter for each
level that needs to be updated, leading to an overhead of O(n) for n updated
regions.
Our algorithm reduces this cost through the use of a 2D array. This array
contains elements, count[maxLevel][minLevel], that represent the number of
new memory accesses that increment minLevel to maxLevel. The footprint update
increments only a element in the array that corresponds to the min and max levels
to update. The profiler calculates the footprint of the region by summing all values
in count[currentLevel][...] when a region exits and propagates these counters
if minLevel ≤ currentLevel − 1.
Implementation The memory footprint analyzer uses LLVM 2.8 [LA04] to in-
sert functions calls into the source code that demarcate region boundaries and
trigger events on memory accesses. These functions are implemented in a runtime
library that is linked in at compile time. The footprint analyzer uses functions and
loops as regions because they are natural, programmer-centric boundaries.
As previously mentioned, the footprint profiler compresses tag vectors into
a single vector, eliminating the need for the TVCache-TVStorage organization; tag
compression is still used by making the uncompressed level table list the first point
of access for all accesses to shadow memory. In place of the TVCache-TVStorage
architecture, we modified the traditional two-level shadow memory organization
shown in Figure 4.1 to support tag validation and a 64-bit address space. Each
segment table and tag table covers 4GB and 64KB of address space, respectively.
Each tag is an 8-bit integer, supporting a region tree of depth 256. This was more
than enough for all benchmarks we examined in our results. The footprint analyzer
80
supports the use of baseline tag validation, SlimTV, or BulkTV; this allowed us
to examine the overheads associated with each of these techniques.
Tag tables support two separate configurations: one that tags every 4-bytes
of address space and another that tags every 8-bytes of address space. The latter
configuration results in less overhead (1 byte tag per 8 bytes of data or 12.5%) and
is the default configuration when a tag table is created. If the analyzer detects
finer granularity accesses (4-byte), it automatically switches configuration.
4.4.2 Hierarchical Critical Path Analysis
Overview As introduced in Chapter 2, hierarchical critical path analysis (HCPA)
is a dynamic program analysis that computes the self-parallelism of each program
region. Self-parallelism is the parallelism of a region exclusive of the parallelism
of its child regions. HCPA calculates self-parallelism by performing critical path
analysis (CPA) on every region of the program, utilizing the program hierarchy to
determine the relationships of regions. CPA incurs a large amount of overhead as
it requires every operation to be instrumented; this is required to find the critical
path of the program, its longest set of dependent instructions.
HCPA concurrently calculates CPA on multiple regions, requiring a tag
vector of n 64-bit timestamps for n active regions. The size of each tag makes
memory overhead a severe issue in HCPA, much more so than the memory footprint
profiler. HCPA further exacerbates the memory overhead problem by treating loop
bodies as regions; this is in addition to the function and loop regions seen in the
memory footprint profiler. The addition of loop bodies increases the depth of the
region tree, increasing tag vector sizes and the memory overhead as a result.
HCPA operates on all instructions not just the loads and stores that were
instrumented in the memory footprint profiler. This increased instrumentation
greatly increases the performance overhead. HCPA does not access shadow mem-
ory on all instructions though: all non-memory operations utilize a shadow register
file. This shadow register file is much smaller than shadow memory and can there-
fore be optimized for access time rather than space overhead in much the same
way as the TVCache.
81
HCPA follows a three step procedure for handling loads. First, it accesses
shadow memory to load in the tag vector (the timestamps) for the specified memory
address. Next, it calculates the updated tag vector for the target register based
on three factors: the loaded tag vector, the tag vector of control dependences, and
the estimated cost of a load. Finally, it updates the shadow register file entry for
the target register. The process for a store is similar except that the tag vector is
initially loaded from the shadow register file and finally stored in shadow memory.
Implementation HCPA utilizes all of Skadu’s techniques in order to reduce
both the memory and runtime overhead. Shadow memory operations first access
the TVCache to determine if the target address is available. A TVCache miss
forces a load from and eviction to the TVStorage in the case of a load instruction;
a miss on a store instruction simply requires an eviction to the TVStorage. All tag
tables are compressed, save for those associated with a list of uncompressed level
tables. If the level table associated with an evicted TVCache line is not in this list,
it is added after another level table is evicted and compressed. HCPA uses a tag
table size of 4KB, which is smaller than the 64KB tag tables used by the memory
footprint analyzer. This smaller size reduces the runtime overhead associated with
BulkTV, helping offset the increased runtime from having a variable size tag vector
in HCPA.
The size of both the TVCache and the uncompressed level table list can be
configured by the user. Increasing the size of either of these tends to reduce the
runtime overhead at the expense of increased memory overhead.
The HCPA code also contains a lightweight garbage collector. This garbage
collector walks all level tables in the TVStorage, using BulkTV to quickly find
invalid tag tables and return them to the list of free tag tables. The garbage
collector is activated when Skadu’s dynamic memory overhead passes a threshold.
Skadu adjusts this threshold based on the memory usage after garbage collection.
This variable threshold avoids hysteresis effects.
82
Table 4.2: Benchmark Characteristics. We examined 12 benchmarks from
three benchmark suites. These benchmarks display a wide variety of characteristics
including memory usage (2MB to 434MB) and execution time (2 seconds to 2
minutes).
Benchmark Mem. Native Region DepthSuite Name Usage Runtime Footprint HCPA
Figure 4.7: Memory Overhead Reduction and Speedup in Footprint Pro-
filer. Skadu reduces the memory expansion factor from the baseline’s 17.8× to
1.25× while maintaining comparable execution time. Numbers on the top repre-
sent the (a) memory expansion factor and (b) slowdown of Skadu’s most aggressive
memory-saving implementation compared to native execution.
84
and contain many dense, array-based operations. Conversely, SpecInt benchmarks
have more irregular memory access patterns in addition to deeper region hierar-
chies. We used SpecInt and SpecFP’s ’ref’ input set and NPB’s ’A’ input set for
all results.
4.5.1 Memory Footprint Profiler
As mentioned in Section 4.4, the memory footprint profiler uses only SlimTV,
BulkTV, and tag compression. The footprint profiler’s overheads are almost solely
from tag validation, making it a good target to evaluate the impact of this pro-
cess. The results are compared against those in the baseline implementation. This
baseline implementation associates a version vector with every tag vector; the vec-
tor size in this baseline implementation is fixed to the deepest region level in the
program.
Figure 4.7 shows the memory expansion factors and runtime overheads from
the memory footprint profiler. This graph is sorted in order of increasing memory
footprint. The numbers on top of the bars represent the final memory expansion
factor and slowdown compared to the native execution. Skadu shows impressive
reductions in the memory expansion factor of the memory footprint profiler when
combining SlimTV, BulkTV, and compression. In overall, Skadu reduces the mem-
ory expansion factor by 14.2×. Benchmarks with larger memory footprints show
overall better reductions, 17.5× for top six benchmarks in memory footprint.
SlimTV effectively reduces the memory expansion factor and improves per-
formance. SlimTV’s main benefits stem from its replacement of the version vector
with a scalar version. These benefits will therefore be more pronounced in pro-
grams with deep region hierarchies. For example, mcf sees the largest reduction
in memory expansion because of its region depth (48) that is more than twice the
closest benchmark (20). SlimTV also speeds up the analysis by a factor of 3.1×because it eliminates the the large number of loads and stores associated with
accessing version vectors.
BulkTV provides additional benefits beyond that of SlimTV. BulkTV re-
duces the memory overhead of tag validation from 7× (a 56-bit version for every
85
8-bit tag) to nearly zero (one 64 bit version per 64KB tag table). BulkTV is more
effective at reducing memory expansion on programs with large memory foot-
prints: the benefit increases as more tag tables are in use. Figure 4.7 shows this
phenomenon: while the smallest (leftmost) benchmarks see little additionally ben-
efit from BulkTV, the remaining benchmarks see significant improvements in the
memory expansion factor. BulkTV also helps improve performance as explained
in Section 4.2. With SlimTV and BulkTV, the geomean memory expansion factor
is only 1.25× while slowdown is a manageable 12.28×.
Tag compression further reduces the memory footprint profiler’s memory
expansion factor. The footprint profiler maintained a list of 256 uncompressed
level tables while the rest were compressed. These 256 tables covered 16MB of
memory address space. This address space coverage meant that benchmarks that
used less than 16MB of memory saw no benefit. Compression is therefore similar
to SlimTV and BulkTV in that is sees larger benefits with larger programs. For
example, mg receives a 13× reduction in memory, making the memory overhead
almost negligible.
Compression’s memory savings come at a cost: increased runtime overhead.
This overhead consists of two components: the compression/decompression algo-
rithms and the eviction algorithm used for the list of uncompressed level tables.
This list uses a “clock” eviction policy [Tan07] that requires an access bit be up-
dated every time an entry in the list is touched. This clocking cost explains the
additional runtime overhead even when compression is not used (e.g. in art).
While a simpler eviction policy may seem desirable (e.g. direct mapped cache),
the higher hit ratio of the clock algorithm more than offsets its maintenance costs.
4.5.2 Hierarchical Critical Path Analysis (HCPA)
Hierarchical critical path analysis is much more costly than the memory
footprint profiler in terms of both memory and performance. HCPA’s baseline
version results in a memory expansion factor of 59.0×, severely limiting its use
outside of supercomputers and other high memory environments. HCPA utilizes
Skadu’s full array of techniques to rein in its overheads. The results are impres-
As multi-core processors enter mainstream computing, software engineers
are facing a fundamental change with parallelization. We began this dissertation
by discussing why fully automatic parallelization do not work in practice and the
limitations of currently available tools, leading to the need for a new speedup
estimation tool. Kismet is different from existing tools in that our speedup esti-
mation tool does not require any pre-parallelized or annotated source code. As our
tool requires only unmodified serial source code, it can help a programmer mak-
ing informed decisions in the early stages of parallelization, making the manual
parallelization process more productive.
One of the key factors that limit achievable speedup is the amount of par-
allelism available in the target program. In Chapter 2, we introduced hierarchical
critical path analysis (HCPA) that quantifies the amount of parallelism in each
region. Unlike the original critical path analysis (CPA) that provides only the
theoretical speedup upperbound, HCPA localizes the parallelism of each region
with a new metric called self-parallelism, providing the basis for realistic speedup
estimation. We also discussed an efficient summarization techniques that makes
the huge amount of produced data at runtime manageable.
Chapter 3 described Kismet, our speedup estimation tool prototype. Be-
sides parallelism, target-specific parallelization constraints such as expressible par-
allelism type, parallelization overhead, available core count, and memory locality
significantly impact the achievable parallel speedup. Based on the profiled infor-
94
95
mation from HCPA and specified parallelization constraints, Kismet finds the par-
allelization strategy with the highest expected speedup. Our experimental results
show that Kimset provided realistic speedup upperbounds on two very different
target platforms: the MIT RAW and conventional multi-core processors.
Chapter 4 discussed the design and implementation of vector shadow mem-
ory (VSM). Because HCPA recursively applies CPA, which is already an expen-
sive dynamic analysis, it can incur prohibitively expensive memory and runtime
overhead. We applied a few techniques that reduces both memory and runtime
overhead, dramatically reducing the overhead of HCPA to a level where most pro-
grams can be run on conventional machines. We also showed these techniques
can be applied to other heavyweight memory analysis with a memory footprint
analyzer.
Overall, we have shown that estimating parallel speedup from unmodified
serial source code is a viable means of helping manual parallelization, which typi-
cally requires extensive efforts from a programmer. Our prototype, Kismet, allows
programmers to make informed decisions in the early stages of parallelization by
understanding the potential benefit from parallelization, making parallelization
more productive. We have demonstrated that Kismet provides realistic speedup
upperbounds on two very different target platforms with widely varying paral-
lelization constraints. In order to help more people in parallelization and make
Kismet evolve with contributions from more people, we plan to release Kismet as
an open source project.
Bibliography
[ABL97] Glenn Ammons, Thomas Ball, and James R. Larus. Exploiting hard-ware performance counters with flow and context sensitive profiling.In PLDI ’97: Proceedings of the ACM SIGPLAN Conference on Pro-gramming Language Design and Implementation, pages 85–96, NewYork, NY, USA, 1997. ACM.
[AMCA+95] V.S. Adve, J. Mellor-Crummey, M. Anderson, J-C. Wang, D. A.Reed, and K. Kennedy. An integrated compilation and performanceanalysis environment for data parallel programs. In SC ’95: Proceed-ings of the ACM/IEEE conference on Supercomputing, 1995.
[AS92] Todd Austin and Gurindar S. Sohi. Dynamic dependency analysisof ordinary programs. In ISCA ’92: Proceedings of the InternationalSymposium on Computer Architecture, pages 342–351, 1992.
[BBB+91] D.H. Bailey, E. Barszcz, J.T. Barton, D.S. Browning, R.L. Carter,L. Dagum, R.A. Fatoohi, P.O. Frederickson, T.A. Lasinski, R.S.Schreiber, H.D. Simon, V. Venkatakrishnan, and S.K. Weeratunga.The nas parallel benchmarks summary and preliminary results. InSupercomputing, 1991. Supercomputing ’91. Proceedings of the 1991ACM/IEEE Conference on, pages 158 –165, nov. 1991.
[BFL+97] J. Babb, M. Frank, V. Lee, E. Waingold, R. Barua, M. Taylor, J. Kim,S. Devabhaktuni, and A. Agarwal. The raw benchmark suite: com-putation structures for general purpose computing. In FCCM ’97:Proceedings of the IEEE Symposium on FPGA-Based Custom Com-puting Machines, pages 134–, Washington, DC, USA, 1997. IEEEComputer Society.
[BO01] J. Mark Bull and Darragh O’Neill. A microbenchmark suite forOpenMP 2.0. SIGARCH Computer Architecture News, 29:41–48, Dec2001.
[BRL+08] Bradley J. Barnes, Barry Rountree, David K. Lowenthal, JaxkReeves, Bronis de Supinski, and Martin Schulz. A regression-based
96
97
approach to scalability prediction. In ICS ’08: Proceedings of theInternational Conference on Supercomputing, pages 368–377, 2008.
[BZ11] D. Bruening and Qin Zhao. Practical memory checking with dr.memory. In CGO ’11: International Symposium on Code Generationand Optimization, pages 213 –223, 2011.
[CFR+91] Ron Cytron, Jeanne Ferrante, Barry K. Rosen, Mark N. Wegman,and F. Kenneth Zadeck. Efficiently computing static single assign-ment form and the control dependence graph. ACM Trans. Program.Lang. Syst., 13(4):451–490, October 1991.
[CMHM10] Eric S. Chung, Peter A. Milder, James C. Hoe, and Ken Mai.Single-chip heterogeneous computing: Does the future include cus-tom logic, fpgas, and gpgpus? In MICRO ’10: Proceedings of theIEEE/ACM International Symposium on Microarchitecture, pages225–236, Washington, DC, USA, 2010. IEEE Computer Society.
[CZYH06] W. Cheng, Qin Zhao, Bei Yu, and S. Hiroshige. Tainttrace: Efficientflow tracing with dynamic binary rewriting. In Computers and Com-munications, 2006. ISCC ’06. Proceedings. 11th IEEE Symposiumon, pages 749 – 754, june 2006.
[DRR99] L.A. De Rose and D.A. Reed. SvPablo: A multi-languagearchitecture-independent performance analysis system. In ICPP’99:International Conference on Parallel Processing, pages 311 –318,1999.
[E. 97] E. Waingold et al. Baring It All to Software: Raw Machines. IEEEComputer, pages 86–93, Sept 1997.
[GJLT11] Saturnino Garcia, Donghwan Jeon, Chris Louie, and Michael BedfordTaylor. Kremlin: Rethinking and rebooting gprof for the multicoreage. In PLDI ’11: Proceedings of the Conference on ProgrammingLanguage Design and Implementation, New York, NY, USA, 2011.ACM.
[GKM82] Susan L. Graham, Peter B. Kessler, and Marshall K. Mckusick. gprof:A call graph execution profiler. In Proceedings of the 1982 SIGPLANSymposium on Compiler Construction, SIGPLAN ’82, pages 120–126.ACM, 1982.
[GSV+10] N. Goulding, J. Sampson, G. Venkatesh, S. Garcia, J. Auricchio,J. Babb, M.B. Taylor, and S. Swanson. GreenDroid: A Mobile Ap-plication Processor for a Future of Dark Silicon. In Hotchips, 2010.
98
[HLL10] Y. He, C. Leiserson, and W. Leiserson. The Cilkview ScalabilityAnalyzer. In SPAA ’10: Proceedings of the Symposium on Parallelismin Algorithms and Architectures, pages 145–156, 2010.
[HM08] Mark D. Hill and Michael R. Marty. Amdahl’s law in the multicoreera. IEEE Computer, 41:33–38, July 2008.
[HPE+06] Kenneth Hoste, Aashish Phansalkar, Lieven Eeckhout, AndyGeorges, Lizy K. John, and Koen De Bosschere. Performance pre-diction based on inherent program similarity. In PACT ’06: ParallelArchitectures and Compilation Techniques, 2006.
[Int] Intel. Intel Parallel Advisor 2011. .
[KBI+09] Milind Kulkarni, Martin Burtscher, Rajeshkar Inkulu, Keshav Pin-gali, and Calin Cascaval. How much parallelism is there in irreg-ular applications? In PPoPP ’09: Proceedings of the ACM SIG-PLAN Symposium on Principles and Practice of Parallel Program-ming, pages 3–14, 2009.
[KKKB12] Minjang Kim, Pranith Kumar, Hyesoon Kim, and Bevin Brett. Pre-dicting potential speedup of serial code via lightweight profiling andemulations with memory performance model. In IPDPS ’12: Proceed-ings of the 26th IEEE International Parallel and Distributed Process-ing Symposium, 2012.
[KKL10a] M. Kim, H. Kim, and C.K. Luk. Prospector: A dynamic data-dependence profiler to help parallel programming. In HotPar, 2010.
[KKL10b] Minjang Kim, Hyesoon Kim, and Chi-Keung Luk. SD3: A scalableapproach to dynamic data-dependence profiling. MICRO ’10: Pro-ceedings of the International Symposium on Microarchitecture, 0:535–546, 2010.
[KMC72] D.J. Kuck, Y. Muraoka, and Shyh-Ching Chen. On the number of op-erations simultaneously executable in fortran-like programs and theirresulting speedup. IEEE Transactions on Computers, C-21(12):1293–1310, Dec. 1972.
[KRL+10] Hanjun Kim, Arun Raman, Feng Liu, Jae W. Lee, and David I. Au-gust. Scalable speculative parallelization on commodity clusters. InMICRO ’10: Proceedings of the IEEE/ACM International Sympo-sium on Microarchitecture, pages 3–14, 2010.
[KS04] Tejas S. Karkhanis and James E. Smith. A first-order superscalar pro-cessor model. In ISCA ’04: Proceedings of the International Sympo-sium on Computer Architecture, pages 338–, Washington, DC, USA,2004. IEEE Computer Society.
[Kum88] M. Kumar. Measuring parallelism in computation-intensive scien-tific/engineering applications. IEEE Transactions on Computers,37(9):1088–1098, Sep 1988.
[LA04] Chris Lattner and Vikram Adve. LLVM: A compilation frameworkfor lifelong program analysis & transformation. In CGO ’04: Pro-ceedings of the International Symposium on Code Generation andOptimization, Palo Alto, California, 2004.
[Lar93] J. R. Larus. Loop-level parallelism in numeric and symbolic programs.IEEE Trans. Parallel Distrib. Syst., 4(7):812–826, 1993.
[LBF+98] Walter Lee, Rajeev Barua, Matthew Frank, Devabhaktuni Srikrishna,Jonathan Babb, Vivek Sarkar, and Saman Amarasinghe. Space-timescheduling of instruction-level parallelism on a Raw machine. In ASP-LOS ’98: International Conference on Architectural Support for Pro-gramming Languages and Operating Systems, pages 46–54, Oct 1998.
[LDB+99] Shih-Wei Liao, Amer Diwan, Robert P. Bosch, Jr., Anwar Ghuloum,and Monica S. Lam. SUIF Explorer: an interactive and interpro-cedural parallelizer. In PPoPP ’99: Proceedings of the ACM SIG-PLAN symposium on Principles and Practice of Parallel Program-ming, pages 37–48, New York, NY, USA, 1999. ACM.
[Loh01] Gabriel Loh. A time-stamping algorithm for efficient performanceestimation of superscalar processors. In SIGMETRICS, pages 72–81,New York, NY, USA, 2001. ACM.
[LW92] Monica S. Lam and Robert P. Wilson. Limits of control flow onparallelism. In ISCA, pages 46–57, New York, NY, USA, 1992. ACM.
[M. 04] M. B. Taylor et al. Evaluation of the raw microprocessor: An exposed-wire-delay architecture for ilp and streams. In ISCA ’04: Proceedingsof the International Symposium on Computer Architecture, Munich,Germany, Jun 2004.
[MCC+95] Barton P. Miller, Mark D. Callaghan, Jonathan M. Cargille, Jef-frey K. Hollingsworth, R. Bruce Irvin, Karen L. Karavanic, KrishnaKunchithapadam, and Tia Newhall. The Paradyn Parallel Perfor-mance Measurement Tool. IEEE Computer, 28(11):37–46, 1995.
100
[MFH96] Margaret Martonosi, David Felt, and Mark Heinrich. Integratingperformance monitoring and communication in parallel computers.In SIGMETRICS, pages 138–147, 1996.
[MSB+05] Milo Martin, Daniel Sorin, Bradford Beckmann, Michael Marty,Min Xu, Alaa R. Alameldeen, Kevin Moore, Mark Hill, and DavidWood. Multifacet’s general execution-driven multiprocessor simula-tor (GEMS) toolset. SIGARCH Comput. Archit. News, 33:92–99,Nov 2005.
[NS07a] N. Nethercote and J. Seward. How to shadow every byte of memoryused by a program. In VEE ’07: Proceedings of the InternationalConference on Virtual Execution Environments, pages 65–74, 2007.
[NS07b] N. Nethercote and J. Seward. Valgrind: A framework for heavyweightdynamic binary instrumentation. In PLDI ’07: Proceedings of theConference on Programming Language Design and Implementation,pages 89–100, New York, NY, USA, 2007. ACM.
[Obe] Markus Oberhumer. LZO Data Compression Library.http://www.oberhumer.com/opensource/lzo/.
[OH00] David Ofelt and John L. Hennessy. Efficient performance predictionfor modern microprocessors. In SIGMETRICS, pages 229–239, NewYork, NY, USA, 2000. ACM.
[PO05] Manohar K. Prabhu and Kunle Olukotun. Exposing speculativethread parallelism in spec2000. In PPoPP ’05: Proceedings of theACM SIGPLAN symposium on Principles and Practice of ParallelProgramming, pages 142–152, New York, NY, USA, 2005. ACM.
[QWL+06] Feng Qin, Cheng Wang, Zhenmin Li, Ho-seop Kim, Yuanyuan Zhou,and Youfeng Wu. Lift: A low-overhead practical information flowtracking system for detecting security attacks. In Proceedings of the39th Annual IEEE/ACM International Symposium on Microarchitec-ture, MICRO 39, pages 135–148, Washington, DC, USA, 2006. IEEEComputer Society.
[RDN93] Lawrence Rauchwerger, Pradeep K. Dubey, and Ravi Nair. Mea-suring limits of parallelism and characterizing its vulnerability to re-source constraints. In MICRO ’93: Proceedings of the internationalsymposium on Microarchitecture, pages 105–117, 1993.
[ROR+08] Easwaran Raman, Guilherme Ottoni, Arun Raman, Matthew J.Bridges, and David I. August. Parallel-stage decoupled softwarepipelining. In CGO ’08: Proceedings of the International Symposium
on Code Generation and Optimization, pages 114–123, New York,NY, USA, 2008. ACM.
[S. 08] S. Bell et al. TILE64 - Processor: A 64-Core SoC with Mesh Inter-connect. In ISSCC ’08: IEEE Solid-State Circuits Conference, pages88–89,598, 2008.
[SN05] Julian Seward and Nicholas Nethercote. Using valgrind to detectundefined value errors with bit-precision. In Proceedings of the an-nual conference on USENIX Annual Technical Conference, ATEC’05, pages 2–2, Berkeley, CA, USA, 2005. USENIX Association.
[Tan07] Andrew S. Tanenbaum. Modern Operating Systems. Prentice HallPress, 3rd edition, 2007.
[Tay07] Michael B. Taylor. Tiled Microprocessors. PhD thesis, MassachusettsInstitute of Technology, 2007.
[TGH92] Kevin B. Theobald, Guang R. Gao, and Laurie J. Hendren. Onthe limits of program parallelism and its smoothability. In MICRO’92: Proceedings of the International Symposium on Microarchitec-ture, pages 10–19. IEEE Computer Society Press, 1992.
[TKM+02] M.B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Green-wald, H. Hoffman, P. Johnson, Jae-Wook Lee, W. Lee, A. Ma,A. Saraf, M. Seneski, N. Shnidman, V. Strumpen, M. Frank, S. Ama-rasinghe, and A. Agarwal. The raw microprocessor: a computationalfabric for software circuits and general-purpose programs. Micro,IEEE, 22(2):25 – 35, mar/apr 2002.
[TLAA05] Michael Bedford Taylor, Walter Lee, Saman P. Amarasinghe, andAnant Agarwal. Scalar operand networks. IEEE Transactions onParallel and Distributed Systems, 16:145–162, Feb 2005.
[TWFO09] Georgios Tournavitis, Zheng Wang, Bjorn Franke, and Michael F. P.O’Boyle. Towards a holistic approach to auto-parallelization: in-tegrating profile-driven parallelism detection and machine-learningbased mapping. In PLDI ’09: Proceedings of the ACM SIGPLANConference on Programming Language Design And Implementation,pages 177–187, 2009.
[Uni] Tsukuba University. NAS Parallel Benchmarks 2.3; OpenMP C.http://www.hpcc.jp/Omni/.
[Wal91] David W. Wall. Limits of instruction-level parallelism. In Proceedingsof the Conference on Architectural Support for Programming Lan-guages and Operating Systems, pages 176–188, New York, NY, USA,1991. ACM.
[WE03] Joel Winstead and David Evans. Towards differential program anal-ysis. In WODA ’03: Workshop on Dynamic Analysis, 2003.
[XBS06] Wei Xu, Sandeep Bhatkar, and R. Sekar. Taint-enhanced policy en-forcement: a practical approach to defeat a wide range of attacks. InProceedings of the 15th conference on USENIX Security Symposium- Volume 15, Berkeley, CA, USA, 2006. USENIX Association.
[ZBA10a] Q. Zhao, D. Bruening, and S. Amarasinghe. Umbra: Efficientand scalable memory shadowing. In CGO ’10: Proceedings of theIEEE/ACM international symposium on Code Generation and Opti-mization, pages 22–31, 2010.
[ZBA10b] Qin Zhao, Derek Bruening, and Saman Amarasinghe. Efficient mem-ory shadowing for 64-bit architectures. In ISMM ’10: Proceedingsof the International Symposium on Memory Management, Toronto,Canada, Jun 2010.
[ZCZ10] Jidong Zhai, Wenguang Chen, and Weimin Zheng. Phantom: pre-dicting performance of parallel applications on large-scale parallelmachines using a single node. In PPoPP ’10: Proceedings of theACM SIGPLAN Symposium on Principles and Practice of ParallelProgramming, pages 305–314, 2010.
[ZG01] Youtao Zhang and Rajiv Gupta. Timestamped whole program pathrepresentation and its applications. In PLDI ’01: Proceedings of theACM SIGPLAN Conference on Programming Language Design andImplementation, pages 180–190, 2001.
[ZIM+07] Li Zhao, R. Iyer, J. Moses, R. lllikkal, S. Makineni, and D. Newell.Exploring Large-Scale CMP Architectures Using ManySim. IEEEMicro, 27(4):21 –33, July 2007.
[ZL10] David Zier and Ben Lee. Performance evaluation of dynamic specula-tive multithreading with the cascadia architecture. TPDS, 21:47–59,Jan 2010.
[ZMLM08] H. Zhong, M. Mehrara, S. Lieberman, and S. Mahlke. Uncoveringhidden loop level parallelism in sequential applications. In HPCA ’08:Proceedings of the International Symposium on High PerformanceComputer Architecture, 2008.
103
[ZNJ09] X. Zhang, A. Navabi, and S. Jagannathan. Alchemist: A trans-parent dependence distance profiling infrastructure. In CGO ’09:Proceedings of the International Symposium on Code Generation andOptimization, pages 47–58. IEEE Computer Society, 2009.