Sapienza University of Rome School of Information Engineering, Computer Science, and Statistics Master’s Thesis in Engineering in Computer Science Mining Hot Calling Contexts in Small Space Advisor: Prof. Camil Demetrescu Student: Daniele Cono D’Elia A.Y. 2011/2012
81
Embed
Sapienza University of Rome - Dipartimento di Ingegneria ...delia/files/msc-thesis.pdf · Sapienza University of Rome ... the data streaming computational model can be used to handle
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Sapienza University of Rome
School of Information Engineering,
Computer Science, and Statistics
Master’s Thesis in
Engineering in Computer Science
Mining Hot Calling Contexts in Small Space
Advisor: Prof. Camil Demetrescu
Student: Daniele Cono D’Elia
A.Y. 2011/2012
Homines, dum docent, discunt.
(Seneca)
Introduction
A crucial problem in the development of computer software is to identify badly
designed portions of code that can affect execution performance. An empirical
rule [15] states that 90% of the time is spent in executing 10% of the code: hence,
you have to pinpoint and optimize this 10% of code to improve performance
bottlenecks, since optimizing the remaining 90% would have a little impact on
overall performance.
Dynamic program analysis encompasses the design and development of tools
for analyzing software by gathering information at execution time. Main appli-
cations include performance profiling, debugging, and program understanding.
Static techniques for performance profiling - they are based on the inspection of
the source or sometimes the object code of the software and were very used in the
past - are not sufficient anymore, mostly because of the massive usage of dynamic
libraries and polymorphism and late binding in object-oriented languages.
Traditional profilers such as gprof [14] and valgrind [24] provide information
about the execution frequency of single routines (vertex profiling) or of caller-
callee pairs (edge profiling). However, these data sometimes are not sufficient to
describe accurately the dynamics of execution [26, 30]. Context-sensitive profiling
provides, instead, more valuable information with finer granularity for program
understanding, performance analysis, and runtime optimization. Collecting con-
text information in modern object-oriented software is very challenging: source
code is composed of a large number of small routines, and the high frequency
of function calls and returns might result in considerable profiling overhead and
huge amounts of data to be analyzed by the profiler.
The main goal of this thesis is to show how sophisticated techniques from
the data streaming computational model can be used to handle the huge amount
i
of profiling data gathered at execution time, thus achieving little overheads in
the implementation of a context sensitive profiler. In particular, we adopted
and optimized two well-know algorithms for the frequent items problem, Lossy
Counting and Space Saving, and then realized two implementations of our context
sensitive profiler. The first implementation is based on Intel Pin and allows it to
rapidly perform profiling on any C/C++ application since the instrumentation
is done at runtime. Although Pin is one of the most efficient and widespread
instrumentation frameworks, preliminary experimental results have shown that
the overhead introduced for context sensitive profiling was large.
For this reason, we developed a much more efficient profiler based on gcc.
Although instrumentation is now performed at compiling time, the tool is very
flexible: it is sufficient to change a symbolic link or an environment variable to
modify even crucial parameters of the profiler. This tool also allows for a partial
instrumentation of the source code and outputs only calling contexts representa-
tive of hot spots to which optimizations must be directed.
The experiments show that our profiler is able to correctly identify all the
hottest calling contexts, producing accurate performance metrics. The running
time overhead is kept under control using bursting techniques [2, 17, 32] while
the peak memory usage is only 1% of standard context-sensitive profilers. This
thesis is largely based on, and extends, the paper Mining Hot Calling Contexts in
Small Space [12], written by the author of this dissertation together with Profes-
sors Camil Demetrescu and Irene Finocchi (Sapienza University of Rome), and
published in Proceedings of the 32nd ACM SIGPLAN Conference on Program-
ming Language Design and Implementation, PLDI 2011, San Jose, CA, USA,
June 4-8, 2011. Source code and documentation related to this work are hosted
on Google Code (http://code.google.com/p/hcct/).
Thesis structure. The remaining part of this thesis is structured as follows.
Chapter 1 describes performance profiling methodologies and most used program
instrumentation tools. Chapter 2 describes issues related to context-sensitive pro-
filing and our approach to the problem. Chapter 3 focuses on implementation and
engineering aspects, while Chapter 4 illustrates the outcome of our experimental
The development of computer software often leads to programs that are quite
large and typically modular, according to well-known principles from software
engineering. These programs are usually constituted of a large number of small
routines, written by several developers and eventually updated to introduce en-
hancements and new functionalities, sometimes introducing deviations from soft-
ware’s initial goals. When a large program is compiled, it becomes important
to pinpoint and optimize the pieces of code that dominate the execution time.
Actually, according to the 90-10 empirical rule [15], 90% of the time is spent in
executing 10% of the code: this rule suggests, in accordance to the well-know
Pareto principle, to optimize mostly this part of the program to achieve remark-
able improvements in terms of efficiency and execution time.
The main goal of a performance profiler is that of identifying which parts of
the program should be optimized, in order to improve the global execution speed.
There are many other applications for profilers: for instance, optimal memory
allocation, intrusion detection, and understanding of program behaviour.
1
CHAPTER 1. PERFORMANCE PROFILING METHODOLOGIES
1.1.2 Static and dynamic analysis
Static analysis techniques are based on the inspection of source code (and some-
times of the object code). This analysis is executed without actually executing
programs and is performed by an automated tool. The granularity of analysis
tools varies from those that only consider individual statements, to those that in-
clude the complete source code in their analysis. Static analysis techniques had a
lot of success in the past years, especially in procedural programming; nowadays
they are used mainly by compilers to apply optimizations to the code, preserving
at the same time the semantic correctness. This is due to the fact that changes
in the development methodologies in the last years have created the need for the
adoption of a dynamic analysis of the behaviour of a program: polymorphism,
late binding, and dynamically linked libraries are emblematic of this need. A
static analysis would be inaccurate since the increasing shortage of information
forces the profiler to make conservative assumptions in its analysis, thus under-
mining the validity of the analysis in several cases. The development of dynamic
analysis tools has put several challenges also from a methodological point of view,
encompassing different areas: building efficient instrumentation techniques that
introduce a moderated slowdown on execution time and do not interfere with the
internal logic of the analysed program (i.e. do not introduce heisenbugs); design-
ing efficient algorithms to process huge amount of data with limited time and
resources; dealing with sometimes obscure optimizations introduced by compiler
in the executable code.
1.1.3 Collection of data
A statistical profiler operates by sampling: it usually probes the target program’s
program counter on a regular basis using operating system interrupts. The anal-
ysis performed by a statistical profiler is the result of a statistical approximation,
with a high error probability on unfrequent functions (sometimes they are not
detected at all). In practice, these profiler can often provide a more accurate
analysis than other approaches, since ther are not as intrusive to the analyzed
program (it is possible to achieve a near full speed execution, so they can detect
issues that would otherwise be hidden) and don’t have as many side effects (such
2
CHAPTER 1. PERFORMANCE PROFILING METHODOLOGIES
as on memory caches). Moreover, a statistical profiler can show the amount of
time approximately spent in user mode versus kernel mode by every analyzed rou-
tine. Several drawbacks due to the usage of operating system interrupts can be
overcome by using dedicated hardware to capture the program counter without
interferences on the execution of the program under analysis.
An instrumenting profiler modifies the target program with additional instruc-
tions to collect the required information. Instrumenting the program will always
have an impact on the program execution, typically introducing a slowdown and
potentially causing inaccurate results or heisenbugs. However, instrumenting a
program in a careful way can lead to a minimal impact and a very high degree
of accuracy in the results. For this reason, the profiler we implemented relies on
instrumentation rather than on sampling. The impact of instrumentation on a
particular program depends on many factors: the placement of instrumentation
points, the mechanism used to capture information, and the inner structure of
the program to be analyzed. Several instrumentation methodologies are available
nowadays: manual (i.e. performed by the programmer, by adding instructions
to extract profiling information), compiler assisted, binary translation, runtime
instrumentation, and runtime injection.
An event-based profiler is typical of languages such as Java, .NET, Python,
and Ruby. In this case the runtime provides hooks to profilers, for trapping events
like enter, leaves, object creation, exception etc. These profilers are usually very
flexible: for instance, it is possible to vary both the profiling event and the
sampling frequency.
1.1.4 Data type
Profilers can also be classified according to the type of information they collect;
then a further distinction can be made according to the different representations
of the gathered information. There are two main categories: profilers that output
counters for instructions/routines and profilers that produce estimations on the
fraction of execution time spent in each routine. Counters are usually presented
in a simple table, while for time measures there could be more than one value
associated with each routine (for instance, time spent in user mode and in kernel
3
CHAPTER 1. PERFORMANCE PROFILING METHODOLOGIES
mode).
It is important to stress the fact that counters are not necessarily repre-
sentative of or proportional to the execution time of the functions. Moreover,
the completion time may vary among different invocations of the same function.
When the goal is to estimate accurately the distribution of execution time, a
counter-based profiler is not the right choice. However, counters are very useful
in a variety of scenarios and can be used in several contexts. For instance, the ex-
act number of invocations of a function can be used to verify the correctness of an
algorithm, or to identify algorithms whose complexity is not adequate to the tasks
they have to complete. Moreover, an accurate analysis of counters can provide
a rough estimation of the execution time for parts of code and help to improve
the performance and refactor already implemented algorithms. Finally, counters
can also be used to identify subtle programming errors and which portions of the
code are executed (for instance, to check whether a new implementation of an
abstraction has completely replaced the old one).
Profilers that estimate time measures operate by approximation: measuring
the exact execution time of each instruction or routine would introduce a too
large slowdown on global execution speed. For this reason, these profilers are
based on sampling techniques as described in the previous paragraph.
1.2 Data aggregation and representation
In this thesis we focus on counter-based profiling, although our approach can be
easily extendend to other metrics such as execution time or cache misses. Several
data structures have been proposed during the years to mantain information
about interprocedural control flow1.
The design of a data structure to store profiling information is strictly related
to the aggregation level of such information. A vertex profiler stores the number of
invocations for each function and does not distinguish the caller nor the context.
An edge profiler instead stores the number of invocations for each ordered pair of
caller-callee routines. This terminology derives from the graphical representation
1An intraprocedural data flow analysis operates on a control-flow graph for a single method;
an interprocedural analysis operates across function boundaries.
4
CHAPTER 1. PERFORMANCE PROFILING METHODOLOGIES
of the control flow of a computer software: every function is represented as a
vertex; two vertices A and B are linked by a directed edge A → B if function
A invokes function B; a path is a sequence of vertices that can be traversed by
starting from a node and following existing edges.
Ball and Larus [4] introduced an intraprocedural path profiling technique to
efficiently determine how many times each acyclic path in a routine executes,
extending the more common edge profiling. Melski and Reps [21] have proposed
interprocedural path profiling to capture both inter- and intra-procedural contro
flow; however, their approach suffers from scalability issues since the number of
paths existing across procedure boundaries can be very large.
1.2.1 Calling contexts
A context-sensitive analysis is an interprocedural analysis that takes into account
the calling context when analyzing the target of a function call. A calling context
is a sequence of routine calls that are concurrently active on the run-time stack
and that lead to a program location. The utility of calling context information
was already clear in the 80s, and nowadays collecting context information in mod-
ern object-oriented information is very challenging: since application are often
structured in a large number of small routines, the high frequency of function
calls and returns might result in considerable profiling overhead, huge amounts
of data to be analyzed by the programmer, and heisenbugs. For this reason,
edge profilers still today are the most used profilers. On the other hand, both
context-sensitive and path profilers provides more valuable information than edge
profilers, while at the same time it is still possible to reconstruct an edge profile
from this information [5].
We will now define three well-known data structures for counter-based profilers.
1.2.2 Call graph
A call graph (CG) is a succinct representation of function invocations: every
function is represented as a vertex and every directed arc shows a caller-callee
relationship between two functions. The use of call graph profiles was pionereed
5
CHAPTER 1. PERFORMANCE PROFILING METHODOLOGIES
in the 80s by gprof [14]. Although this representation is very space- and time-
efficient, this approach can lead to misleading results [26, 30] and partial com-
prehension of the behaviour of the program. For instance, in Figure 1.1 the call
graph is missing the information that routine D is invoked by C only when C is
invoked by B (calling context A→B→C→D occurs two times, while A→C→D
does not appear in the execution). Call graphs are used by edge profilers and are
not suited for the representation of context sensitive information.
AB
CD
2
2
2
1
1
A
B BC C
D C
D D
A
B C
D C
D
2 2
1 1
2
Figure 1.1: Call graph, Call tree, Calling Context Tree (from left to right).
1.2.3 Call tree
In a call tree (CT) each node represents a different routine invocation: a new node
is added for each method invocation and then attached to the parent node (i.e.,
the caller method). This yields very accurate context information, but requires
extremely large (possibly unbounded) space.
1.2.4 Calling context tree
The inaccuracy of call graphs and the huge size of call trees have motivated the
introduction [1] of calling context trees (CCT). The CCT is a succinct summary
of the call tree, built recursively from the root by merging for each node children
representing different invocations of the same methods: in other words, CCTs do
not distinguish between different invocations of the same routine within the same
context. The edge weight of a CCT represents the number of nodes merged into
6
CHAPTER 1. PERFORMANCE PROFILING METHODOLOGIES
the target node, and also the calling frequency of the callee by the caller (i.e., the
parent node) within the particular calling context of the caller. This representa-
tion takes advantage of the tree structure by encoding every calling context as a
simple node: the context can be easily reconstructed by moving along the path
from the root to the target node. While preserving a good accuracy, typically
CCTs are several orders of magnitude smaller than call trees: the presence of re-
cursion may heavily increase the number of nodes, but this scenario is infrequent
in practice.
However, as noticed in previous works [7, 32], even CCTs may be very large
and difficult to analyze in several applications. The exhaustive approach to the
construction of a CCT is based on the instrumentation of each routine call and
return, incurring considerable slowdown even when using efficient instrumenta-
tion mechanisms [1, 32]: these two issues will be addressed and analyzed in more
detail in Chapter 2.
1.3 Program instrumentation tools
Here follows a survey of the most used instrumentation tools and frameworks.
1.3.1 gprof
This profiler has been introduced in 1982 [14] by a group of researchers of the
Berkeley University as an extension of older prof Unix tool. It uses hybrid of
instrumentation and sampling. Instrumentation is done by compiler (i.e., the -pg
option of gcc compiler) and is used to count how many times a single procedure
of the program is invoked; moreover, also the edge of the call graph related
to each invocation is stored. The time spent in each routine is estimated by
statistical sampling: the program counter of the monitored software is probed at
regular intervals using operating system interrupts (e.g., programmed via profil
syscall). Typical sampling period is 0.01 seconds (10 milliseconds). In 2004 gprof
paper appeared on the list of 50 most influential PLDI papers of all time: it
revolutionized the performance analysis field and became a milestone and the
tool of choice for many developers.
7
CHAPTER 1. PERFORMANCE PROFILING METHODOLOGIES
1.3.2 Dyninst
Dyninst [9] is a multi-platform API for runtime code-patching developed as part
of the Paradyn project at the University of Wisconsin-Madison and University
of Maryland. The goal of this API is to provide a machine independent interface
to permit the creation of tools and applications that use runtime code patch-
ing. Several APIs are available, such as the InstructionAPI for decoding and
representing machine instructions in a platform-independent manner, and the
StackwalkerAPI for walking the run-time stack of a process.
1.3.3 DynamoRIO
DynamoRIO [8] is a runtime code manipulation system: it supports code trans-
formations while the target program executes. Unlike many frameworks, it allows
arbitrary modifications to application instructions (a powerful instruction manip-
ulation library for IA-32 and AMD64 is provided). DynamoRIO’s API abstracts
away the details of the underlying infrastructure and allows the tool builder to
implement efficient dynamic tools for a wide variety of uses, such as program
analysis, profiling, and optimization.
1.3.4 Valgrind
This tool [24] was originally designed to be a free memory debugging tool for
Linux, but has since evolved to become a generic framework for creating dy-
namic analysis tools for memory debugging, memory leak detection, and profil-
ing. Valgrind is essentially a virtual machine using just-in-time (JIT) compilation
techniques: nothing from the original program ever gets run directly on the host
processor. After an initial translation into an intermediate processor-neutral for-
mat, a Valgrind-based tool is free to make any transformation on the code before
Valgrind translates it back into machine code and executes it. Although a con-
siderable amount of performance iIn particulars lost in these transformations, the
intermediate format is much more suitable for instrumentation than the original.
Several tools are included with Valgrind, such as Memcheck for memory usage
inspection and callgrind for the construction of the call graph.
8
CHAPTER 1. PERFORMANCE PROFILING METHODOLOGIES
1.3.5 Pin
Pin [18] is an instrumentation system developed by Intel and the University of
Colorado. The goal of this project is to provide easy-to-use, portable, transparent,
and efficient instrumentation. Instrumentation tools (called pintools) are written
in C/C++ using the Pin API, that is designed to be as much architecture inde-
pendent as possible. Pin uses just-in-time compilation to instrument executables
while they are running; several techniques are used to optimize instrumentation,
such as inlining, register re-allocation, and instruction scheduling. According to
authors, the fully automated approach delivers significantly better instrumenta-
tion performance than similar tools such as DynamoRIO and Valgrind. Another
interesting feature of Pin is the possibility to attach to a process, instrument it,
collect profiles, and eventually detach. Pin preserves the original behaviour of
the analyzed application by providing instrumentation transparency: the appli-
cation observes the same addresses and values as it would in a native execution:
this makes the information collected by instrumentation more accurate and rel-
evant. At the highest level of its architecture, Pin consists of a virtual machine,
a code cache, and an instrumentation API; the virtual machine consists of a JIT
compiler, an emulator (for instructions that cannot be executed directly, such as
system calls), and a dispatcher. When an istrumented program is running, three
binary programs are present in memory: the application, Pin, and the pintool.
1.3.6 gcc
The GNU C Compiler [27] includes an experimental and somehow obscure fea-
ture: the -finstrument-functions option. This feature was originally implemented
by Cygnus Solutions and makes the compiler execute two special user-defined
functions at the top and bottom of every function - the only information pro-
vided to the user are two parameters containing the address of the routine and
the calling site. These two functions can be used to perform a compiler assisted
instrumentation of any C/C++ application on Linux, achieving a good degree of
performance while maintaining full control of program execution. Profiling tools
can be implemented as dynamic libraries to be linked to programs compiled with
the above-mentioned option. Clearly, the developer of a tool has to deal manually
9
CHAPTER 1. PERFORMANCE PROFILING METHODOLOGIES
with aspects such as program launch and termination, multithreading, etc. - the
only functionality provided by gcc is intercepting routine enter and exit events.
Although this at first glance can seem discouraging, in this thesis we learned how
it is possible to write an efficient gcc-based profiler using a limited number of
lines of code to deal with these aspects.
1.4 Conclusions
In this chapter we have provided an overview on general performance profiling
techniques, showing the importance of the use of dynamic program analysis tools
in order to pinpoint and optimize the hot spots that dominate the execution
time. Calling Context Trees have been presented as main data structure for
compactly representing context-sensitive profiling information gathered during
the execution of a program. Several instrumentations frameworks are nowadays
available for developers: in particular, Pin can be used to quickly implement
analysis tools on several platforms, while the GNU C Compiler offers a compiler-
assisted instrumentation feature to develop highly-tuned tools with low overhead
introduced.
10
Chapter 2
Mining hot calling contexts
In this chapter we will firstly discuss some state-of-the-art techniques for context
sensitive profiling. Then we will present the frequent items problem in the data
streaming computational model and two well-known space-efficient algorithms to
solve it. Finally, we will introduce our novel space-efficient approach, discussing
the main choices behind its design and its correlation with other techniques.
2.1 Previous approaches
In the previous chapter we highlighted two relevant issues for context sensitive
profiling: the exhaustive approach to the construction of the CCT introduces con-
siderable slowdown even using efficient instrumentation mechanisms, and CCTs
may be very large and difficult to analyze in several applications. In this section
we will discuss some of the most relevant approaches that have been proposed in
literature in the past years.
Early approaches. The utility of calling context information was already clear
to the authors of gprof: this tool approximates context sensitive profiles by as-
sociating procedure timing with caller-callee pairs rather than with single pro-
cedures. This level of context sensitivity, however, may yield sever inaccura-
cies [26, 30]. Since exhaustive instrumentation can lead to a large slowdown,
Bernat and Miller [6] proposed to generate path profiles including only methods
of interest in order to reduce overhead. A very popular approach is based on sam-
11
CHAPTER 2. MINING HOT CALLING CONTEXTS
pling [3, 13, 16, 31]: the run-time stack is periodically sampled and the context is
reconstructed through stack-walking. For call-intensive programs, sample-driver
stack-walking can be significantly (i.e., at least one order of magnitude) faster
than exhaustive instrumentation. This approach, however, can lead to a signif-
icant loss of accuracy with respect to the exhaustive construction of the CCT:
results may be highly inconsistent in different executions, and sampling guaran-
tees neither high coverage [7] nor accuracy of performance metrics [32].
Adaptive bursting. Several works explore the combination of sampling with
bursting [2, 17, 32]. Most recently, Zhuang et al. suggest to perform stack-
walking followed by a burst during which the profiler traces every routine enter
and exit event: experimental results show that adaptive bursting can yield much
more accurate results with respect to traditional sample-based stack-walking1.
While static bursting collects a highly accurate profile, it can still cause signif-
icant performance degradation: the authors propose an adaptive mechanism to
dynamically disable profile collection for previously observed calling contexts and
periodically re-enable it according to a re-enablement ratio parameter which re-
flects the trade-off between accuracy and overhead. Their profiler is based on the
Java Virtual Machine Profiler Interface (JVMPI), while experiments have been
performed with a sampling rate parameter of 10ms and a burst length of 0.2 ms.
Reducing space. A few previous works have addressed techniques to reduce
profile data (or at least the amount of data presented to the user) in context
sensitive profiling. Bernat and Miller [6] suggest to let the user choose a subset
of routines to be analyzed. Quite recently, Bond and McKinley [7] introduced
a new approach called probabilistic calling context that continuously maintains
a probabilistically unique value representing the current calling context. The
proposed representation is extremely compact (just a 32-bit value per context)
and, according to the authors, the average overhead introduced in the Java virtual
machine is around 3%. This approach is efficient and accurate enough for tasks
such as residual testing, bug detection, and intrusion detection. However, it
1According to the authors, a low-overhead solution using sampled stack-walking alone is less
than 50% accurate, based on degree of overlap with a complete CCT.
12
CHAPTER 2. MINING HOT CALLING CONTEXTS
cannot be used for profiling with the purpose of understanding and improving
application performance, where it is crucial to maintain for each context the
sequence of active routine calls along with performance metrics. Moreover, Bond
and McKinley target applications where coverage of both hot and cold contexts
is necessary; this is not the case in performance analysis, where identifying a few
hot contexts is typically sufficient to guide code optimization.
2.2 Frequent items in data streams
In the last years a lot of effort has been put on the design of algorithms able
to perform analyses on a near-real time basis on massive data streams; in these
streams input data often come at a high rate and cannot be stored due to their
possibly unbounded size [23]. This line of research has been mainly motivated
by networking and database applications, in which streams may be very long
and stream items may also be drawn from a very large universe. Streaming
algorithms are designed to address these requirements, providing approximate
answers in contexts where it is impossible to obtain an exact solution using only
limited space.
These techniques can be adapted in order to analyze profiling information
gathered during a program’s execution. In fact, the stream of routine enter and
exit events collected by a profiler represents a good case of massive data stream:
items belong to a possibly large universe (consider an application composed of
many small routines frequently invoked from different call sites) and the length of
the stream itself is very high. Moreover, in order to keep the introduced slowdown
under control, timing requirements to process each item are strict.
The Frequent Items (a.k.a. heavy hitters) problem [22] has been extensively
studied in data streaming computational models. Given a frequency threshold φ
in [0,1] and a data stream of length N , the problem (in its simplest formulation)
is to find all items that appear more than bφNc times, i.e., having frequency
greater than bφNc. For instance, for φ=0.01 the problem is to find all items that
appear at least 1% of the times. It can be proved that any algorithm able to
provide an exact solution requires Ω(N) space [23]. Therefore, research focused
on solving an approximate version of the problem [20, 22]:
13
CHAPTER 2. MINING HOT CALLING CONTEXTS
Definition 1 ε-Deficient Frequent Items problem. Given two parameters φ, ε ∈[0, 1], with ε < φ, return all items with frequency ≥ bφNc and no item with
frequency ≤ b(φ − ε)Nc. The absolute difference between estimated and true
frequency is at most εN .
In the approximate solution, false negatives cannot exist: all frequent items
are correctly output. Instead, some false positives are allowed: however, their real
frequency is guaranteed to be at most εN -far from the threshold bφNc. Many
different algorithms for computing (φ, ε)-heavy hitters have been proposed in the
literature in the last ten years. Counter-based algorithms, according to extensive
experimental studies [19], have better performance and ease of implementation
than sketch-based algorithms. Counter-based algorithms track a subset of items
from the input and monitor counts associated with them; sketch-based algorithms
act as general data stream summaries, and can be used for other types of approx-
imate statistical analysis of the data stream, apart from being used to find the
frequent items.
In a preliminary analysis we considered three counter-based algorithms: Space
Saving, Sticky Sampling, and Lossy Counting. Space Saving [22] and Lossy Count-
ing [20] are deterministic and use 1ε
and 1ε
log(εN) entries, respectively. Sticky
Sampling [20] is probabilistic: it fails to produce the correct answer with a minus-
cole probability, say δ, and uses at most 2ε
log(φ−1δ−1) entries in its data structure;
in our experiments, Sticky Sampling consistently proved itself to be less efficient
and accurate than its competitors, so we will not mention it any further.
2.2.1 Space Saving
Space Saving [22] monitors a set of d1/εe = M triples of the form (item, counti,
εi) initialized respectively by the first 1/ε distinct items, their exact counts, and
zero. After the init phase, when a query q is observed in the stream the update
operation works as follows:
• if q is monitored, the corresponding counter is incremented;
• if q is not monitored, the (item, counti, εi) triple with the smallest count
is chosen as a victim and has its item replaced with q and its count in-
14
CHAPTER 2. MINING HOT CALLING CONTEXTS
cremented. Moreover, εi is set to the count value of the victim triple - it
measures the maximum possible estimation error.
The update time is bounded by the dictionary operation of checking whether an
item is monitored. Heavy hitters queries are answered by returning monitored
entries such that count > dφNe. Space Saving has the following properties [22]:
1. The minimum counter value min is no greater than bεNc;
2. For any monitored element ei, 0 ≤ εi ≤ min, that is, fi ≤ (fi + εi) =
counti ≤ fi +min. In other words, count is overestimated by at most min
with respect to the true frequency fi;
3. Any element whose true frequency is greater than min is monitored;
4. The algorithm uses a number of counters of min(|A|, d1/εe), where |A| is
the cardinality of the universe from which the items are drawn. This can
be proved to be optimal;
5. Space Saving solves the ε-Deficient Frequent Items problem, by reporting
all the elements with frequencies more than dφNe, such that no reported
element has true frequency less than d(φ− ε)Ne;
6. Items whose true frequencies are more than d(φ− ε)Ne but less than dφNecould be output. They are the so-called false positives.
2.2.2 Lossy Counting
Lossy Counting [20] maintains a set M of (item, count, ∆) triples, where count
represents the exact frequency of the item since it was last inserted in M and ∆ is
the maximum possible underestimation in count: the algorithm guarantees that
at any time the true frequency of a monitored item is ≤ count + ∆. While Space
Saving overestimates counters, Lossy Counting takes an opposite approach.
The incoming data stream is conceptually divided into bursts of width d1/εe.During the i-th burst, if the observed item already exists in M, the corresponding
count is incremented, otherwise a new entry is inserted with count = 1 and
∆ = i − 1. At the end of the burst, M is pruned by deleting triples such that
15
CHAPTER 2. MINING HOT CALLING CONTEXTS
(count + ∆) ≤ i. Heavy hitters queries are answered by returning entries in M
such that count ≥ b(φ− ε)Nc. Lossy Counting has the following properties [20]:
1. Lossy Counting solves the ε-Deficient Frequent Items problem, by reporting
all the elements with frequencies more than φN , such that no reported
element has true frequency less than (φ− ε)N ;
2. Estimated frequencies are less than the true frequencies by at most εN , i.e,
count ≤ fe ≤ count+ εN ;
3. If an element e does not appear in M, then fe ≤ εN ;
4. Lossy Counting computes an ε-deficient synopsis using at most 1ε
log(εN)
entries. However, if we make the assumption that in real world datasets,
elements with very low frequency (at most εN2
) tend to occur more or less
uniformly at random: in this scenario Lossy Counting requires no more
than space 7ε
in the worst case.
2.3 HCCT: Hot Calling Context Tree
As suggested in Section 2.2, the execution trace of routine invocations and ter-
minations can be naturally regarded as a stream of item. Each item is a triple
of the form (routine address, call site, event type). Table 2.1 reports essential
informations about some well-known Linux applications. It is easy to notice that
the number of nodes of the call graph (i.e., the number of distinct routines) is
small compared to the stream length (i.e., to the number of nodes of the call tree),
even for large-scale applications such as gimp, inkscape, and those belonging to
the OpenOffice suite. Hence, a non-contextual profiler can mantain a hash table
of size proportional to the number of distinct routines, using routine names as
hash keys to efficiently update the corresponding metrics.
In the case of contextual profiling, the number of distinct calling contexts
(i.e., the number of nodes of the CCT) is too large: hashing would be inefficient,
and for some applications it would even be impossible to maintain the tree in
main memory. Consider, for instance, oocalc: the optimistic assumption that
The meaning of the environment variables is straightforward. DUMPATH specifies
the path for storing the output file(s); values for EPSILON and PHI are defined as
for the trace analyzer; SINTVL and BLENGTH specify (in nanoseconds) sampling
interval and burst length, respectively.
For each thread a separate output file is produced, containing the tree con-
structed by the profiling tool. File names are chosen according to the syntax
<benchmark>-<PID/TID>.tree, and the tree is dumped with an in-order visit
to ease its reconstruction. A simple tool called analysis is provided under the
folder tools and computes simple statistics about the tree; this tool is a starting
point for the implementation of more complex analysis methods.
Routine addresses are stored in hexadecimal format in order to be resolved
using addr2line. Given an address and an executable, this tool uses the debug-
ging information in the executable to reconstruct which source file name and line
number are associated with the address. As an example, we executed the testing
program for the CCT construction and extracted a random routine address from
5Debugging information is needed to resolve addresses using the addr2line command.6The profiling library will invoke the Pthread library after an initialization phase.
67
APPENDIX . APPENDIX A: USING THE TOOLS
the output files; as the reader can see, information reported by addr2line is very