APPROVED: Song Fu, Major Professor Yan Huang, Committee Member Krishna Kavi, Committee Member Xiaohui Yuan, Committee Member Barrett Bryant, Chair of the Department of Computer Science and Engineering Coastas Tsatsoulis, Dean of the College of Engineering Mark Wardell, Dean of the Toulouse Graduate School AUTONOMIC FAILURE IDENTIFICATION AND DIAGNOSIS FOR BUILDING DEPENDABLE CLOUD COMPUTING SYSTEMS Qiang Guan Doctor Prepared for the Degree of DOCTOR OF PHILOSOPHY UNIVERSITY OF NORTH TEXAS May 2014
136
Embed
Autonomic Failure Identification and Diagnosis for Building …/67531/metadc499993/... · Guan, Qiang. Autonomic Failure Identification and Diagnosis for Building Dependable Cloud
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
APPROVED: Song Fu, Major Professor Yan Huang, Committee Member Krishna Kavi, Committee Member Xiaohui Yuan, Committee Member Barrett Bryant, Chair of the Department of
Computer Science and Engineering Coastas Tsatsoulis, Dean of the College of
Engineering Mark Wardell, Dean of the Toulouse Graduate
School
AUTONOMIC FAILURE IDENTIFICATION AND DIAGNOSIS FOR BUILDING
DEPENDABLE CLOUD COMPUTING SYSTEMS
Qiang Guan
Doctor Prepared for the Degree of
DOCTOR OF PHILOSOPHY
UNIVERSITY OF NORTH TEXAS
May 2014
Guan, Qiang. Autonomic Failure Identification and Diagnosis for Building Dependable
Cloud Computing Systems. Doctor of Philosophy (Computer Science), May 2014, 121 pp., 9
tables, 53 figures, bibliography, 112 titles.
The increasingly popular cloud-computing paradigm provides on-demand access to
computing and storage with the appearance of unlimited resources. Users are given access to a
variety of data and software utilities to manage their work. Users rent virtual resources and pay
for only what they use. In spite of the many benefits that cloud computing promises, the lack of
dependability in shared virtualized infrastructures is a major obstacle for its wider adoption,
especially for mission-critical applications.
Virtualization and multi-tenancy increase system complexity and dynamicity. They
introduce new sources of failure degrading the dependability of cloud computing systems. To
assure cloud dependability, in my dissertation research, I develop autonomic failure
identification and diagnosis techniques that are crucial for understanding emergent, cloud-wide
phenomena and self-managing resource burdens for cloud availability and productivity
enhancement. We study the runtime cloud performance data collected from a cloud test-bed and
by using traces from production cloud systems. We define cloud signatures including those
metrics that are most relevant to failure instances.
We exploit profiled cloud performance data in both time and frequency domain to
identify anomalous cloud behaviors and leverage cloud metric subspace analysis to automate the
diagnosis of observed failures. We implement a prototype of the anomaly identification system
and conduct the experiments in an on-campus cloud computing test-bed and by using the Google
datacenter traces. Our experimental results show that our proposed anomaly detection
mechanism can achieve 93% detection sensitivity while keeping the false positive rate as low as
6.1% and outperform other tested anomaly detection schemes. In addition, the anomaly detector
adapts itself by recursively learning from these newly verified detection results to refine future
detection.
ii
Copyright 2014
by
Qiang Guan
ACKNOWLEDGMENTS
This dissertation would be impossible without the continuous support and supervision of
many people. I would like to thank them at this opportunity. I first would like to appreciate my
advisor, Dr. Song Fu, for his guidance, support and supervision for the past four years. I am so
proud that I will be his first Ph.D. graduate. That is my great honor. I also want to thank Dr. Yan
Huang, Dr. Krishna Kavi and Dr. Xiaohui Yuan for their comments and suggestions on this work.
I would like to thank Dr. Nathan Debardeleben, Dr. Mike Lang and Mr Sean Blanchard from
Ultrascale System Research Center, New Mexico Consortium, Los Alamos National Laboratory
for their mentoring and advising. I would also appreciate the Chairman, Dr. Barrett Bryant, grad-
uate advisor, Dr. Bill Buckles, Dr. Armin R. Mikler for the guidance and generous help for my
academic career. I am thankful to my friends, Dongyu Ang, K.J. Buckles, Guangchun Cheng, Chi-
Chen Qiu, Song Huang, Tommy Janjusic, Zhi Liu, Husanbir Pannu, Devender Singh, Yanan Tao,
Dr. Shijun Tang, Yiwen Wan, Ziming Zhang, Chengyang Zhang, Shunli Zhao, all the team-mates
of Highland Guerilla and friends in Highland Baptist Church for their friendship and support.
I would like to thank my parents for their support during the whole journey. I want to give
special thanks to my wife, Dr. Xiaoyi Fang for love, patience, understanding and support for these
days and nights.
iii
TABLE OF CONTENTS
Page
ACKNOWLEDGMENTS iii
LIST OF TABLES viii
LIST OF FIGURES ix
CHAPTER 1 INTRODUCTION AND MOTIVATION 1
1.1. Introduction 1
1.2. Terms and Definitions 2
1.3. Motivation and Research Tasks 2
1.3.1. Charactering System Dependability in Cloud Computing Infrastructures 2
1.3.2. Metric Dimensionality Reduction for Cloud Anomaly Identification 3
1.3.3. Soft Errors (SE) and Silent Data Corruption (SEC) 5
1.4. Contributions 6
1.4.1. Cloud Dependability Characterization and Analysis 6
1.4.2. Metric Selection and Extraction for Charactering Cloud Health 8
1.4.3. Exploring Time and Frequency Domains of Cloud Performance Data
for Accurate Anomaly Detection 9
1.4.4. Most Relevant Principal Components based Anomaly Identification
and Diagnosis 9
1.4.5. SEFI : A Soft Error Fault Injection Tool for Profiling the Application
a multi-class classification model is employed with each failure type being assigned a class value.
Let xt denote a record of values for the n collected performance metrics m1,m2, . . . ,mn at time t.
The pattern classification problem is to learn a classifier function C that has C(xt) = ft.
I use a directed acyclic graph (DAG) to represent the classification function C. Each node
in the DAG is for a cloud performance metric. An arc between two nodes represents a probability
correlation. Let x = (x1, x2, . . . , xn) be a cloud performance data point described by the n perfor-
mance metrics m1,m2, . . . ,mn, respectively. Using the DAG, we can compute the probability of
data point x by
(1) P (x) =n∏i=1
P (xi|mj),
where metric mj is the immediate predecessor of metric mi in the DAG. To find the essential
metrics that can characterize the correlation between cloud performance and failure events, we
compute the conditional probability of every metric on failure occurrences, i.e., P (mk|failure),
and select those metrics whose conditional probabilities are greater than a threshold τ . The selected
metrics constitute the cloud fingerprint.
A DAG is automatically built from a set of cloud performance data records,R = {x1, x2, . . . , xl}.
For a cloud performance metric mi, let metric mp denote a parent of mi. The probability P (mi =
mij|mp = mpk) is computed and denoted by wijpk. The DAG building mechanism searches for
the wijpk values that best model the cloud performance data. In essence, it tries to maximize the
probability
(2) Pw(R) =l∏
r=1
Px(xr).
This is done by an iterative process. wijpk is initialized to random probability values for any i,
j, p, and k. In each iteration, for each cloud performance data record, xr, in R, our mechanism
computes
(3)∂lnPw(R)
∂wijpk=
l∑r=1
P (mi = mij,mp = mpk|xr)wjipk
.
22
TABLE 3.1. Description of the injected faults.
Type of Injected Faults Symptom
CPU Fault Infinite loop
Memory Fault Keep allocating the memory space
I/O Fault Keep copying files to the disk
Network Fault Keep sending and receiving packets
Then, the values of wijpk are updated by
(4) wijpk = wijpk + α∂lnPw(R)
∂wijpk,
where α is a learning rate and ∂lnPw(R)/∂wijpk is computed from Equation (3). The value of α is
set to a small constant for quick convergence. Before the next iteration starts, the values of wijpk
are normalized to be between 0 and 1.
3.4. Cloud Computing Testbed and Performance Profiling
The cloud computing system under test consists of 16 servers. The cloud servers are
equipped with 4 to 8 Intel Xeon or AMD Opteron cores and 2.5 to 16 GB of RAM. I have in-
stalled Xen 3.1.2 hypervisors on the cloud servers. The operating system on a virtual machine
is Linux 2.6.18 as distributed with Xen 3.1.2. Each cloud server hosts up to ten VMs. A VM is
assigned up to two VCPUs, among with the number of active ones depends on applications. The
amount of memory allocated to a VM is set to 512 MB. I run the RUBiS [14] distributed online
service benchmark and MapReduce [24] jobs as cloud applications on VMs. The applications are
submitted to the cloud testbed through a web based interface. I have developed a fault injection
tool, which is able to inject four major types and 12 sub-types of faults to cloud servers by adjust-
ing the levels of intensity. They mimic the faults of CPU, memory, disk, and network. All four
major types of failures injected are implemented in Table 3.1
I exploit third-party monitoring tools, sysstat [94] to collect runtime performance data in
the hypervisor and virtual machines, and a modified perf [75] to obtain the values of performance
counters from the Xen hypervisor on each server in the cloud testbed. In total, 518 metrics are
23
FIGURE 3.2. A sampling of cloud performance metrics that are often correlated
with failure occurrences in our experiments. In total, 518 performance metrics are
profiled with 182 metrics for the hypervisor, 182 metrics for virtual machines, and
154 metrics for hardware performance counters (four cores on most of the cloud
servers).
profiled, i.e., 182 for the hypervisor and 182 for virtual machines by sysstat and 154 for perfor-
mance counters by perf, every minute. They cover the statistics of every component of cloud
servers, including the CPU usage, process creation, task switching activity, memory and swap
24
space utilization, paging and page faults, interrupts, network activity, I/O and data transfer, power
management, and more. Table 3.2 lists and describes a sampling of the performance metrics that
are often correlated with failure occurrences in our experiments. I tested the system from May 22,
2011 to February 18, 2012. In total, about 813.6 GB performance data were collected and recorded
from the cloud computing testbed in that period of time.
To tackle the big data problem and analyze the cloud dependability efficiently, our cloud
dependability analysis (CDA) system removes those performance metrics that are least relevant to
failure occurrences. First, CDA searches for the metrics that display zero variance. Among all
of the 518 metrics, 112 of them have constant values, which provides no contribution to cloud
dependability analysis. After removing them, 406 non-constant metrics are kept. Then, CDA cal-
culates the correlation between the remaining metrics and the “failure” label (0/1 for normal/failure
classification and multi-classes for different types of failures). CDA removes those metrics whose
correlations with failure occurrences are less than a threshold τcorr.
3.5. Impact of Virtualization on Cloud Dependability
This work aims to find out and model the impact of virtualization on system dependability
in cloud computing infrastructures. To this end, our cloud dependability analysis (CDA) system
compares the correlation of various performance metrics with failure occurrences in virtualization
and traditional non-virtualization environments. CDA exploits the DAGs described in Section 3.3
for the analysis and comparison.
To build a failure-metric DAG using a training set from the collected cloud performance
data, CDA sets the root node as “failure” for all types of failure events or a specific type of failures
for finer-grain analysis. Each node, except for the root node, is allowed to have multiple parents.
The maximal number of parents can be configured. For example, in our experiments it is set to
two, which means each metric node in the DAG can have only one more parent in addition to the
root node. Moreover, a continuous metric is discretized to a certain number of bins based on the
nature of the metric.
In this section, I focus on the failures caused by CPU (Section 3.5.1), memory(Section 3.5.2),
disk(Section 3.5.3), network(Section 3.5.4), and all (Section 3.5.5) faults, and model the impact
25
FIGURE 3.3. Failure-metric DAG for CPU-related failures in the cloud testbed.
FIGURE 3.4. Failure-metric DAG for CPU-related failures in the non-virtualized system.of virtualization on cloud dependability. I present the DAGs for virtualized and non-virtualized
systems and compare the results. Due to the space limitation, only the top three levels of each
DAG are plotted.
3.5.1. Analysis of CPU-Related Failures
To characterize the cloud dependability under CPU failures, the Coordinators in the CDA
system control the fault injection agents to inject CPU related faults, including randomly changing
one or multiple bits of the outputs of arithmetic or logic operations, continuously using up all CPU
cycles, and more. These faults are injected to one, some, or all of the processor core(s) on a cloud
server. The health monitoring sensors collect the runtime performance data on each cloud server,
pre-process the data, and report them to the Coordinators, which build the failure-metric DAGs
and analyze the system health status of a management domain or the entire cloud.
Figure 3.3 depicts the DAG for CPU related failures in the cloud computing testbed with
virtualization support. For comparison, I also conduct experiments on a traditional distributed
systems without virtualization. Figure 3.4 presents the corresponding DAG.
From Figure 3.4, I can see that 13 metrics display strong correlation with the occurrences
26
FIGURE 3.5. Failure-metric
DAG for memory-related
failures in the cloud testbed.
FIGURE 3.6. Failure-metric
DAG for memory-related
failures in the non-virtualized
system.
of CPU related failures in the non-virtualized system. Among them, four (i.e., %usr all, %nice all,
%sys all, and %iowait all) are metrics for all processor cores, while the others (i.e., %usr n,
%nice n, %sys n, %iowait n, and %soft n) are for individual cores.
In the cloud computing environment (Figure 3.3), 12 metrics are highly correlated with
the failures. Metric %usr all from the privileged domain, Dom0, is the direct child of the root
node, showing the highest correlation. Among the 12 metrics, 11 are metrics collected from Dom0
(Metrics from user virtual machines, DomU, locate at lower levels of the DAG). They are %usr,
%sys, and %iowait of all or individual processor cores. %steal all is a new metric that is cor-
related with failure occurrences, compared with Figure 3.4. In addition, a performance counter
metric, DTBL-load-miss, also has a strong dependency with CPU related failures, while other per-
formance counters have higher correlation with performance metrics of either the hypervisor or
virtual machines.
3.5.2. Analysis of Memory-Related Failures
To characterize the cloud dependability under memory related failures, memory faults are
injected by the fault injection agents to cloud servers. This type of faults includes flipping one or
multiple bits of memory to the opposite state, using up all available memory space, and more. The
Coordinators collect the runtime performance data from the health monitoring sensors on cloud
servers, and generate failure-metric DAGs for cloud dependability analysis.
27
FIGURE 3.7. Failure-Metric DAG for disk-related failures in the cloud testbed.
FIGURE 3.8. Failure-metric DAG for disk-related failures in the non-virtualized system.
Figure 3.5 shows the DAG for memory related failures in the cloud computing testbed. The
result from the non-virtualized system is presented in Figure 3.6.
In Figure 3.6, six metrics display strong correlation with the occurrences of memory related
failures in the non-virtualized system. They are %usr all and %sys all for all processor cores,
%usr n and %iowait n of some individual cores, and %memused, which indicates the memory
utilization.
In the cloud computing environment (Figure 3.5), seven metrics are highly correlated with
the failures. All of the seven metrics come from Dom0. Compared with Figure 3.6, the metric
%usr all is the direct child of the root node in both cases. However, in the cloud computing envi-
ronment, the metric %memused is not a significant identifier of memory related failures. Instead,
%soft n becomes more closely correlated with the occurrences of memory failures.
3.5.3. Analysis of Disk-Related Failures
Disks are also prone to fault [87, 104]. In our experiments, the fault injection agents in-
ject disk faults by blocking certain disk I/O operations or running background micro-benchmark
programs that continuously copying large files to disks to saturate disk I/O bandwidth. Again, the
28
FIGURE 3.9. Failure-metric DAG for network-related failures in the cloud testbed.
FIGURE 3.10. Failure-metric DAG for network-related failures in the non-
virtualized system.
Coordinators collect the cloud-wide performance data and analyze the cloud dependability.
Figure 3.7 presents the DAG for failures caused by disk faults in the cloud computing
testbed. Figure 3.8 shows the result from the non-virtualized system. From the two figures, I
observe that more metrics are correlated with the failure occurrences.
In the non-virtualized system (Figure 3.8), 15 metrics highly correlate with the occurrences
of disk related failures. In addition to other CPU metrics, %iowait n and %nice n are directly
affected by disk I/O operations. It is interesting to notice that metrics such as rd sec/s and wr sec/s
are not included in the top correlated metrics. This is because these metrics have a more direct
influence on the values of processor related metrics.
In the cloud computing environment (Figure 3.7), 12 metrics are the top ones that are
correlated with the failures. Among them, the metric %sys all from Dom0 is the direct child
of the root node, which is different from the non-virtualized case. Compared with Figure 3.8,
virtualization has more significant impact on the metrics including %steal n and pgpgout/s for
disk related failures.
29
3.5.4. Analysis of Network-Related Failures
Networking hardware/software in cloud servers and switches and routers in the core net-
work may fail at runtime [111]. To generate network related failures, the fault injection agents
inject network faults by dropping certain incoming/outgoing network packages, flipping one or
multiple bits of packages to the opposite state, or attempting to use up the network bandwidth by
continuously transferring large files through the network. After the performance data are collected
from cloud servers, failure-metric DAGs are generated to analyze the cloud dependability under
network related failures. Figures 3.9 and 3.10 show the DAGs for the cloud computing testbed and
the non-virtualized system, respectively.
From Figure 3.10, I observe the occurrences of network failures are strongly correlated
with 12 metrics in the non-virtualized environment. Two metrics, %iowait n and %usr all, are the
direct children of the root node. In contract, 16 metrics are included within the top three levels of
the DAG in Figure 3.9. For the cloud computing testbed, one metric, %usr all, is the direct child of
the root node. Three new metrics profiled from Dom0, fault/s, tcp-tw, and await dev8, are highly
correlated the occurrences of network failures. They are closely related to the networking opera-
tions, including the number of packages, the number of TCP sockets, and the average processing
time by networking devices. Moreover, two metrics from user virtual machines, DomU, are among
the most significant ones. They are U %usr all and U %steal n, accounting for the time to process
a large number of network packages and to switch between virtual processors.
3.5.5. Analysis of All Types of Failures
In addition to studying individual types of failures, I analyze the cloud dependability under
any type of failures. The goal is to identify a set of metrics that can characterize all types of failures
and to understand the impact of virtualization on the metric selection.
To generate the failure-metric DAGs for this purpose, the Coordinators mix the cloud per-
formance data records together. The label of each record takes one of the two values: 0 or 1
denoting a “normal” or “failure” state. Figures 3.11 and 3.12 depict the DAGs for the cloud com-
puting testbed and the non-virtualized system, respectively. The root nodes represent the generic
failures.
30
TABLE 3.2. The metrics that are highly correlated with failure occurrences in the
cloud testbed using four-level failure-metric DAGs.
Failure type No. of correlated metrics No. of metrics from Dom0 No. of metrics from DomU
CPU-related failures 45 44 1
Memory-related failures 29 26 2
Disk-related failures 34 25 9
Network-related failure 32 31 1
All failures 25 24 1
FIGURE 3.11. Failure-metric DAG for all types of failures in the cloud testbed.
By comparing these two figures, I can find out the influence of virtualization on the system
dependability. In both cases, processor related metrics are the dominant ones 1. Certain metrics in
these two DAGs also appear in the DAGs for individual types of failures. For the non-virtualized
case (Figure 3.12), a metric related with memory and disk operations, %vmeff, has a strong depen-
dency with the generic failures. In contrast, a hardware performance counter metric, DTLB-stores,
is highly correlated with failure occurrences in the cloud computing environment as shown in
Figure 3.11. Moreover, in Figure 3.11 and also preceding DAGs for the cloud computing environ-
ment, most of the correlated metrics are associated with Dom0. If more levels of the DAGs are
considered, more metrics from user virtual machines, DomU, correlate with failure occurrences.
However, there is little work on understanding the dependability of cloud computing envi-
ronments. As virtualization has been an enabling technology for cloud computing, it is imperative
to investigate the impact of virtualization on the cloud dependability, which is the focus of this
work.1Only the first three levels of the DAGs are depicted due to the limited space. When more levels are considered,metrics for other system components are incorporated for dependability analysis.
31
FIGURE 3.12. Failure-metric DAG for all types of failures in the non-virtualized system.3.6. Summary
Large-scale and complex cloud computing systems are susceptible to software and hard-
ware failures, which significantly affect the cloud performance and management. It is imperative
to understand the failure behavior in cloud computing infrastructures. In this work, I study the
impact of virtualization, which has become an enabling technology for cloud computing, on the
cloud dependability. I present a cloud dependability analysis (CDA) framework with mechanisms
to characterize failure behavior in virtualized environments. I exploit failure-metric DAGs to an-
alyze the correlation of various cloud performance metrics with failure events in virtualized and
non-virtualized systems. We study multiple types of failures, including CPU-, memory-, disk-,
and network-related failures. By comparing the generated DAGs in the two environments, I gain
insight into the effects of virtualization on the cloud dependability.
32
CHAPTER 4
A METRIC SELECTION AND EXTRACTION FRAMEWORK FOR DESCRIBING CLOUD
PERFORMANCE ANOMALIES
4.1. Introduction
To characterize cloud behavior, identify anomalous states, and pinpoint the causes of fail-
ures, I need the runtime performance data collected from utility clouds. However, continuous
monitoring and large system scale lead to the overwhelming volume of data collected by health
monitoring tools. The size of system logs from large-scale production systems can easily reach
hundreds and even thousands of tera-bytes [70, 87]. In addition to the data size, the large number
of metrics that are measured make the data model extremely complex. Moreover, the existence
of interacting metrics and external environmental factors introduce measurement noises in the col-
lected data. For the collected health-related data, there might be a maximum number of metrics
above which the performance of anomaly detection will degrade rather than improve. High metric
dimensionality will cause low detection accuracy and high computational complexity. However,
there is a lack of systematic approaches to effectively identifying and selecting principle metrics
for anomaly detection.
In this chapter, I present a metric selection framework for online anomaly detection in the
cloud. Among the large number of metrics profiled, I aim at selecting the most essential ones
by applying metric selection and extraction methods. Mutual information is exploited to quantify
the relevance and redundancy among metrics. An incremental search algorithm is proposed to
select metrics by enforcing maximal relevance and minimal redundancy. We apply metric space
combination and separation to extract essential metrics and further reduce the metric dimension.
The remainder of this chapter is organized as follows. Section 4.2 presents the proposed
metric selection framework with three mechanisms. Experimental evaluation and discussion are
described in Section 4.3. Section 4.4 presents the summary.
33
4.2. Cloud Metric Space Reduction Algorithms
To make anomaly detection tractable and yield high accuracy, we apply dimensionality re-
duction which transforms the collected health-related performance data to a new metric space with
only the most important metrics preserved [38]. I propose two approaches to reducing dimension-
ality: metric selection using mutual information and metric extraction by metric space combination
and separation. Metric selection are methods that select the best subset of the original metric set.
The term metric extraction refers to methods that create new metrics based on transformations or
combinations of the original metric set. The data presented in a low-dimensional subspace are
easier to be classified into distinct groups, which facilitates anomaly detection.
4.2.1. Metric Selection
The metric selection process can be formalized as follows. Given the input health-related
performance data D including L records of N metrics M = {mi, i = 1, . . . , N}, and the classi-
fication variable c, it is to find from the N -dimensional measurement space, RN , a subspace of n
metrics, Rn, that optimally characterizes c.
In this section, I present the metric selection algorithm based on mutual information (MI) [21]
as a measure of relevance and redundancy among metrics to select a desirable subset. MI has two
main properties that distinguish it from other selection methods. First, MI has the capability of
measuring any type of relationship between variables, because it does not rely on statistics of any
grade or order. The second property is MI’s invariance under space transformation.
The mutual information of two random variables quantifies the mutual dependence of
them. Let mi and mj be two metrics in M . Their mutual information is defined as I(mi;mj) =
H(mi) + H(mj) − H(mimj), where H(·) refers to the Shannon entropy [21]. Metrics in the
health-related performance data collected periodically from a cloud computing system usually
take discrete values. The marginal probability p(mi) of metric mi and the probability mass func-
tion p(mi,mj) of two metrics mi and mj can be calculated using the collected data. Then, the MI
of mi and mj is computed as
(5) I(mi;mj) =∑mi∈M
∑mj∈M
p(mi,mj) logp(mi,mj)
p(mi)p(mj).
34
Intuitively, the MI between two metrics, I(mi;mj), measures the amount of information shared
between mi and mj . Metrics with high co-relevance have high MI. As special cases, I(mi;mi) =
1, while I(mi;mj) = 0 if mi and mj are independent.
The goal of metric selection is to find from the originalN metrics a subset S with nmetrics
{mi, i = 1, . . . , n}, which jointly have the largest dependency on the class c. This can be accom-
plished by using two criteria in metric selection: maximal relevance and minimal redundancy. I use
the mean value of all MI values between individual metric mi and class c to define the relevance.
The maximal relevance criterion is specified as,
(6) max relevance(S), relevance =1
|S|∑mi∈S
I(mi; c),
where |S| is the cardinality of S. By applying Equation (6), irrelevant metrics can be removed.
However, the remaining metrics may have rich redundancy. As a result, the dependency
among these metrics may still be high. When two metrics highly depend on each other, their
class-discriminative capabilities do not change much, if one of them is removed. Therefore, I
additionally apply a minimal redundancy criterion to select independent metrics.
(7) min redundancy(S), redundancy =1
|S|2∑
mi,mj∈S
I(mi;mj).
I combine the two criteria (6 and 7) together to define the dependency of the selected metrics on
the class, dependency(S). To optimize relevance and redundancy simultaneously, I can use the
following equation.
(8)max dependency(S),
dependency = relevance(S)− redundancy(S).
The N metrics in the original metric set M defines a 2N search space. Finding the optimal
metric subset is NP-hard [5]. To find the near-optimal metrics satisfying the criterion (8), I apply an
incremental search method. Given Sk−1, a metric subset with (k−1) metrics, I try to select the kth
metric that maximizes dependency(·) from the remaining metrics in (M − Sk−1). By including
Equations (6) and (7), the metric search algorithm looks for the kth metric that optimizes the
35
following condition.
(9) maxmi∈M−Sk−1
{I(mi; c)−
1
k − 1
∑mj∈Sk−1
I(mi;mj)
}.
The metric selection algorithm works as follows.
ALGORITHM 1. Metric selection algorithm
MetricSelection() {
1: Apply the incremental search following Equation
(9) to select n metrics sequentially from the
original metric set M . The value of n can be preset
with a large number. The search process produces n
metric set, S1 ⊂ S2 ⊂ . . . ⊂ Sn.
2: Check these metric sets S1, . . . , Si, . . . , Sn to find
the range of i, where the cross-validation error erri
has small mean and small variance.
3: Within the range, look for the smallest error err∗.
The optimal size of the metric subset n∗ equals to
the smallest i, for which Si has error err∗. The
corresponding Sn∗ is the selected metric subset.
4: }
The computational complexity of the incremental search method is O(|S| ·N).
4.2.2. Metric Space Combination and Separation
The metric extraction process creates new metrics by transformation or combination of the
original metrics. It applies a mapping x′ = g(x) : Rn → Rn′ to transform a measurement x in
a n-dimensional space to a point x′ in a n′-dimensional space with n′ < n. It creates a subset
of new metrics by transformation of the original ones. The information most relevant to anomaly
36
detection in Rn is preserved. The goal is to reconstruct the health-related cloud performance dataset
to a space of fewer dimensions for more efficient and accurate anomaly identification. I explore
both metric space combination and metric space separation to find the most useful metrics and
reduce the dimension of metric space.
After the metric selection process (Section 4.2.1) is completed, the health-related cloud
performance dataset D contains L records (x1, x2, . . . , xL) with n metrics. Metric space combi-
nation transforms the L records from n-dimensional space to L records (x′1, x′2, . . . , x
′L) in a new
n′-dimensional space.
Let m1,m2, . . . ,mn denote the n performance metrics. A measurement xi in D can be
represented with {xj,i}, the value of the jth metric mj of xi. That is xi = [x1,i, x2,i, . . . , xn,i]T .
Then, the cloud performance dataset D is represented by a matrix D = [x1, x2, . . . , xL]. To find
the optimal combination of the metric space, I calculate the covariance matrix of D as V = DDT .
According to [26], in order to minimize the mean-squared error of representing the dataset by
n′ orthonormal metrics, the eigenvalues of the covariance matrix V are used. We calculate the
eigenvalues {λi} of V and sort them in a descending order as λ1 > λ2 > . . . > λn.
The metrics with the largest variance caused by a changing faulty condition are identified by
checking their directions. I utilize this property to combine metrics for efficient anomaly detection.
An iterative algorithm is employed to search for the new combined metrics. The first n′ eigenvalues
that satisfy the following requirement are chosen.
(10)∑n′
i=1 λi∑ni=1 λi
≥ τ,
where τ is a threshold and τ ∈ (0, 1). The corresponding n′ eigenvectors are the new metrics,
denoted by S ′ = {m′i, i = 1, . . . , n′}. The eigenvectors for metric space transformation are used
to select the most sensitive and relevant metrics. An iterative algorithm is employed to search for
{ej}:
ALGORITHM 2. Metric space combination based metric extraction
MetricExtraction1() {
37
1: n′ = the number of essential axes or eigenvectors
required to estimate;
2: Compute the covariance matrix S;
3: for j = 1 upto n′ do
4: Initialize randomly eigenvector ej of size n× 1;
5: while (1− |eTj ej|) > ε do
6: ej = Sej;
7: ej = ej −∑j−1
k=1(eTj ek)ek;
8: ej = ej/ ‖ ej ‖;
9: end while
10: end for
11: return e;
12:}
In Algorithm 2, Steps 7 and 8 apply the Gram-Schmidt orthogonalization process [44]
and then normalization. ε is a small constant, which is used to test the convergence of ej , i.e. if
(1− |eTj ej|) < ε, then ej converges, otherwise ej is updated iteratively.
Algorithm 2 converges quickly. It usually takes only two to five iterations to find an eigen-
vector. The computational complexity of the algorithm is O(n2n′ + n2L), where n is the number
of metrics after metric selection (Section 4.2.1). To determine the value of n′, i.e. the number of
essential metrics, a common practice is first setting a threshold for the percentage of total vari-
ance that is expected to preserve; then n′ being the smallest number of essential metrics that can
contribute to achieving such a threshold.
In addition to the metric space combination, I also apply metric extraction approaches
based on metric space separation. They separate desired data from mixed data. They define a set
of new basis vectors for metric space separation. Let us use A to denote the matrix with elements
xj,i, x = [x1, x2, . . . , xL]T and e = [e1, e2, . . . , e′n]T with base vectors. Then, x = Ae. After
estimating the matrix A from x, I calculate its inverse, denoted by W . Hence, the base vectors can
38
be computed by e = Wx.
Before applying the metric extraction algorithm, the anomaly detector performs some pre-
processing on the cloud performance dataset. Data records are subtracted by the mean of the data
record vector so that they have zero-mean. A linear transformation is also applied to the dataset,
which makes its components uncorrelated and having unit variance. The goal of metric space sep-
aration is to find an optimal transformation matrix W so that {ej} are maximally independent. An
iterative algorithm is employed to search for W and hence the new separated metrics.
The metric extraction algorithm that computes the matrix W works as follows.
ALGORITHM 3. Metric space separation based metric extraction
MetricExtraction2() {
1: Initialize randomly the matrix W = [w1, w2, . . . , w′n]T ;
2: while (1− |wTj wj|) > ε for any j = 1 . . . n do
3: for p = 1 upto n′ do
4: wp+1 = wp+1 −∑p
j=1wTp+1wjwj;
5: wp+1 = wp+1/(wTp+1wp+1)
1/2;
6: end for
7: end while
8: return W ;
9: }
In Algorithm 3, ε is a small constant, which is used to test the convergence of W , i.e. if
(1 − |wTj wj|) < ε for all j = 1 . . . n′, then W converges, otherwise W is updated iteratively. The
algorithm converges fast.
4.3. Performance Evaluation
As a proof of concept, I implement a prototype of the proposed metric selection framework
and evaluate its performance based on the collected performance metrics data from our cloud
39
TABLE 4.1. Normalized mutual information values for 12 metrics of CPU and
2: On receiving a cloud performance record data xt
3: if xt −Mt < τ then
4: Report the anomaly state
5: end if
6: On receiving a verified failure or an observed
but undetected failure record
7: MRPCSelect()
8: end while
9: }
6.4. Analysis of Cloud Anomalies
6.4.1. Anomaly Detection and Diagnosis Results
In this section, I study the four types of failures caused by CPU-related faults, memory-
related faults, disk-related faults, and network-related faults. For each failure type, I present the
experimental results on MRPC selection and discuss the root cause analysis on each MRPC.
6.4.2. MRPCs and Diagnosis of Memory Related Failures
Figure 8(a) shows the correlation between PCs and the memory related faults. As I have
discussed, the 1st and 2nd principal components do not possess high causal correlation with the
occurrences of failures (only 0.16 and 0.04). This indicates the memory related failures have little
dependency upon them. On the contrary, the 3rd, 5th, 8th and 31th PCs display high correlation
with failure records (greater than 0.2) as listed in Table 6.1 . Figure 4(a) shows that the 3rd PC
displays a significantly identifiable performance to distinguish the failure states from normal states.
66
0 500 1000 1500 2000-4
-3
-2
-1
0
1
2
Time : min
The
3rd
Prin
cipa
l Com
pone
nt
3rd PCMemory faults
(a) The time series of 3rd principal component
0 50 100 150 200 250 300 350 400 450-0.7
-0.6
-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
Performance Metric
Coe
ffici
ents
to th
e 3n
d P
rinci
pal C
ompo
nent
(b) The coefficients of performance metric to the 3rd principal component
FIGURE 6.4. MRPCs of memory-related failures. (a) plots the time series of 3rd
principal component. (b) shows the performance metric avgrq-sz displays the high-
est contribution to the MRPC.
Based on the procedure described in Section III-B, the synaptic weight wjis represent the
quantified impact from the original space to the anomaly specific subsets. Considering these synap-
tic weight wjis could be either positive or negative, I exploit |wji| to identify the effect of each
performance metrics contributing to the anomaly specific subsets. These weights computed are
shown in Figure 4(b) for the 3rd principal component, which is selected as the top ranked MRPC
with regard to memory related failures. In addition, one performance metric has a dominant con-
67
TABLE 6.1. MRPCs are ranked by correlation with faults.(For each major type, 25
faults are injected into testbed)
Fault TypeMost Relevant Principal Components
Rank Order of PC Correlation Coefficient to Fault
Memory Fault
1 3 0.3898
2 5 0.2840
3 8 0.2522
4 31 0.2043
I/O Fault1 5 0.4283
2 7 0.2961
3 3 0.2402
CPU Fault1 35 0.3738
2 40 0.3424
3 103 0.2559
Network Fault1 29 0.3532
2 27 0.2733
3 23 0.2715
tribution to this MRPC, as weight 0.65. Within it, multiple performance metrics have weights
around 0.1-0.2. By checking the performance metrics list, we find that the highly weighted metric
is ”avgrq-sz dev253-1”, which is ”The average size (in sectors) of the requests that were issued to
the hard drive device 253-1 [2]”. Given that the memory related failures are injected by keeping
allocating memory in a short period, after the physical memory is used up, the swap memory is put
into use. As a result, this process will cause more requests issued to the hard disk.
6.4.3. MRPCs and Diagnosis of Disk Related Failures
Disk-related faults are injected by continuously issuing a big volume of disk requests to
saturate the I/O bandwidth. The causal correlation with the disk related failures is computed for
68
0 500 1000 1500 2000-1
0
1
2
3
4
5
6
Time : min
The
5th
Prin
cipa
l Com
pone
nt
5th PCI/O faults
(a) The time series of 5th principal component
0 50 100 150 200 250 300 350 400 450-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
Performance Metric
Coe
ffici
ent t
o th
e 5t
h P
rinci
pal C
ompo
nene
t
(b) The coefficients of performance metric to the 5th principal component
FIGURE 6.5. MRPCs of disk-related failures.(a) plots the time series of the 5th
principal component. (b) shows the performance metric rd-sec/s dev-253 displays
the highest contribution to the MRPC
each principal component, which is shown in Figure 8(b). The top ranked MRPCs are listed in
Table 6.1 . The 5th PC possesses the highest correlation with the disk-related failures as its casual
correlation is more than 0.42. Analysis of the time series of the 5th principal component, plotted
in Figure 5(a), shows that most of the anomalies could be identified by setting a proper thresh-
old. From Figure 5(b), the performance metric named ”rd-sec/s dev-253”, with the coefficient of
0.4423, contributes to the 5th principal component more than other performance metrics. It refers
69
0 500 1000 1500 2000-1.5
-1
-0.5
0
0.5
1
Time : min
The
35th
Prin
cipa
l Com
pone
nt
35th PCCPU faults
(a) The time series of 35th principal component
0 50 100 150 200 250 300 350 400 450-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
Performance Metric
Coe
ffici
ent t
o th
e 35
th P
rinci
pal C
ompo
nent
(b) The coefficients of performance metric to the 35th principal component
FIGURE 6.6. MRPCs of CPU-related failures.(a) plots the time series of the 35th
principal component. (b) shows the performance metric ldavg displays the highest
contribution to the MRPC
to The number of sectors read from the device, which is an indicator to characterize the symptom
of I/O related failures.
6.4.4. MRPCs and Diagnosis of CPU Related Failures
CPU-related faults are injected by employing infinite loops that use up all CPU cycles.
Table 6.1 lists MRPCs with the highest correlation with the CPU-related failures. Figure 6(a)
70
0 500 1000 1500 2000-1.5
-1
-0.5
0
0.5
1
1.5
Time : min
The
29th
Prin
cipa
l Com
pone
nt
29th PCNetwork faults
(a) Time series of 29th principal component
0 50 100 150 200 250 300 350 400 450-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
Performance Metric
Coe
ffici
ent t
o th
e 29
th P
rinci
pal C
ompo
nent
(b) Coefficients of performance metric to the 29th principal component
FIGURE 6.7. MRPCs of network-related failures.(a) plots the time series of the
29th principal component. (b) shows the performance metric %user 4 displays
highest contribution to the MRPC
presents the time series of the 35th principal component. From the figure, I can see some CPU-
related failures are not easily identifiable, e.g., the failures occurred around 1250th minute and
1500th minute. Figure 6(b) plots weights of the cloud performance metrics for the 35th principal
component. With largest weight of 0.3874, ”ldavg-15” refers to ”The load average calculated as the
average number of runnable or running tasks (R state), and the number of tasks in uninterruptible
sleep (D state) over past 15 minutes”. The second and third largest weights correspond to the
71
performance metrics ”ldavg-5” and ”%sys all” respectively. ”%sys all” refers to the average CPU
utilization for system over all processors. All of the three performance metrics characterize the
process behavior under failures.
6.4.5. MRPCs and Diagnosis of Network Related Failures
Network-related faults are injected by saturating the network bandwidth by continuously
transferring large files between servers. In cloud computing systems, denial-of-service attacks,
virus infections, and failure of switches and routers may cause this type of anomalies. The MR-
PCs are listed in Table 6.1 . The 29th principal component is highly correlated with the network
related failures. The 27th and 23rd principal components are ranked as the second and third as
shown in Figure 8(d). In Figure 7(a), I can see the 29th principal component is sensitive to the
occurrences of network-related failures and the failure is distinguishable from normal states. Fig-
ure 7(b) shows that ”%usr 4” possesses the highest weight of 0.292. The second and third highest
weights are associated with performance metrics ”%idle 4” and ”svctm dev8-0” (i.e., ”The average
service time (in milliseconds) for I/O requests that were issued to the device.”). Both %usr 4 and
%idle 4 represent the states of processor core 4, which is assigned to the virtual machine where
the network-related faults are injected. Therefore, MRPCs can assist cloud operator not only to
identify anomalies, but also to localize faults even in virtual machines.
6.4.6. The Accuracy of Anomaly Identification
I study the performance of several anomaly detection techniques including our proposed
MRPC-based detection approach. I use the receiver operating characteristic (ROC) curves to
present the experimental results. An ROC curve displays the true positive rate (TPR) and the
false positive rate (FPR) of the anomaly detection results. The area under the curve is used to
evaluate the detection performance. A larger area infers higher sensitivity and specificity.
I compare the performance of the proposed MRPC-based anomaly detection approach
with four widely used detection algorithms: decision tree, Bayesian network, support vector ma-
chine(SVM), and 1st principal component (using Kalman filter to detect the anomaly). Our MRPC-
based anomaly detector achieves the best performance, with the true positive rate reaching 91.4%
72
while keeping the false positive rate as low as 3.7%. By applying only the first principal compo-
nent, the false positive rate is higher than 40% in order to achieve a 90% true positive rate. Among
other detection algorithms, the Bayesian network is relatively better, reaching 74.1% of TPR with
a low FPR. Experimental results prove that PCA has the worst performance in identifying the
performance anomalies.
To make an anomaly detection, it takes 6.81 seconds on average for a control node in the
cloud to process cloud performance data, select MRPCs, and make anomaly detections.
6.4.7. Experimental Results using Google Datacenter Traces
In addition to the experiments on our cloud computing testbed, I evaluate the performance
of the proposed MRPC-based anomaly detection mechanism by using the performance and events
traces collected from a Google datacenter [80]. The Google datacenter trace is the first publicly
available dataset collected from a large number of (about 13000) multi-purpose servers over 29
days. In the dataset, multiple task related events are recorded. Among them, I focus on the failure
events. In total, there are 13 resource usage metrics profiled periodically, which are listed in Table
6.2. the measurement period is typically 5 minutes (300s). Within each measurement period,
measurements are usually taken at 1 second intervals. By applying the MRPC selection algorithm
presented in Section III-B, we obtain the casual correlation between principal components and
failure events, which is plotted in Figure 6.9. The 13th principal component retains the highest
correlation (i.e., 0.18) with the failures.
The ROC curves in Figure 6.10 show the performance of the proposed anomaly detection
approach and other four detection algorithms. By exploiting MRPC, I can achieve 81.5% of TPR
and 27% of FPR. The results outperform all other detection methods by 22.9% - 68.7% of TPR
with the same value of FPR. The performance of the proposed anomaly identification mechanism
is a little worse on the Google traces. This is caused by the higher dynamicity and variety of
workloads,more complex interactions among system components, less number of performance
metrics and incomplete information of failure types. Our anomaly detector still provides valuable
information of failure dynamics, which facilitates the system operators to proactively reconfigure
resources and schedule workloads.
73
TABLE 6.2. Performance metrics in the Google datacener traces
Index Performance Metrics
1 Number of running tasks
2 CPU rate
3 Canonical memory usage
4 Assigned memory usage
5 Unmapped page cache
6 Total page cache
7 Maximum memory usage
8 Disk I/O time
9 Local disk space usage
10 Maximum CPU rate
11 Maximum disk I/O time
12 Cycles per instruction
13 memory accesses per instruction
The main contribution of this work includes: to our best knowledge, it is the first time to use
subsets of principal components as the most relevant metrics for different types of failures. With
the analysis of each failure type, I show that anomalies are highly correlated with specific principal
component subsets. Moreover, MRPCs can be applied for digging the root causes of failures and
guiding a timely maintenance.
6.5. Summary
Modern large-scale and complex cloud computing systems are susceptible to software and
hardware failures, which significantly affect the cloud dependability and performance. In this
paper, I present an adaptive anomaly identification mechanism in cloud computing systems. I
start by analyzing the correlation between principal components with failure occurrences, where I
find the PCs retaining the highest variance cannot effectively characterize the failure events, while
74
lower order PCs displaying high correlation with occurrences of failures. I then propose to exploit
the most relevant principal components (MRPCs) to describe failure events and devise a learning
based approach to identify and diagnose cloud anomalies by leveraging MRPCs. The anomaly
detector adapts itself by recursively learning from these newly verified detection results to refine
future detections. Meanwhile, it exploits the observed but undetected failure records reported by
the cloud operators to identify new types of failures. Experimental results from an on-campus
cloud computing testbed show that the proposed MRPC-based anomaly identification mechanism
can accurately detect failures while achieving a low overhead. Learning form the MRPC subspaces
that relate to each type of failure, I gain the knowledge of the root causes of failures.
75
0 50 100 150 200 250 300 350 400 4500
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Index of Pricipal Components
Cor
rela
tion
to th
e Fa
ults
(a) Correlation with memory-related failures
0 50 100 150 200 250 300 350 400 4500
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
Index of Pricipal Components
Cor
rela
tion
to th
e Fa
ults
(b) Correlation with disk-related failures
50 100 150 200 250 300 350 400 4500
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Index of Pricipal Components
Cor
rela
tion
to th
e Fa
ults
(c) Correlation with CPU-related failures
0 50 100 150 200 250 300 350 400 4500
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Index of Pricipal Components
Cor
rela
tion
to th
e Fa
ults
(d) Correlation with Network-related failures
FIGURE 6.8. Correlation between the principal components and different types of
failures
76
1 2 3 4 5 6 7 8 9 10 11 12 130
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Index of Pricipal Components
Cor
rela
tion
to th
e Fa
ults
FIGURE 6.9. Correlation between principal components and failure events using
the Google datacenter trace.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
FPR
TPR
ROC Curve
Decision TreeBayesianNetSupport VectorMRPCPC1
FIGURE 6.10. Performance of the proposed MRPC-based anomaly detector com-
pared with four other detection algorithms on the Google datacenter trace.
77
CHAPTER 7
F-SEFI: A FINE-GRAINED SOFT ERROR FAULT INJECTION FRAMEWORK
7.1. Introduction
In order to facilitate the testing of application resilience methods, I present a fine-grained
soft error fault injector named F-SEFI. F-SEFI allows for the targeted injection of soft errors into
instructions belonging to applications of interest and that applications individual subroutines. F-
SEFI leverages the QEMU [92] virtual machine (VM) and its hypervisor. QEMU uses Tiny Code
Generation (TCG) to reference and translate instruction sets between the guest and host architec-
tures before the instructions are delivered to the host system for execution. F-SEFI provides the
ability to emulate soft errors and corrupt data at runtime by intercepting instructions and replacing
them with contaminated versions during the TCG translation. With the addition of a binary symbol
table, F-SEFI supports a tunable fine-grained injection strategy where soft errors can be injected
into chosen instructions in specified functions of an application. In addition, F-SEFI allows multi-
ple fault models to mimic the upsets in hardware (e.g., probabilistic model, single bit fault model
and multiple-bits fault model). Overall, F-SEFI manages the fault injections and the user decides
where, when, and how to inject faults.
I implemented a prototype F-SEFI system and conducted the fault injection campaign on
multiple HPC applications. The experimental results show that the effect of the injected faults is
amplified when the fault propagates to other software components, resulting in a number of silent
data corruptions on multiple sites. F-SEFI provides sufficient instruction level soft error samples
of different fault models which assists programmers to understand the vulnerabilities of underlying
HPC applications, further helps designing resilient strategies to mitigate the impacts of SDCs.
The rest of this chapter is organized as follows. Section 7.2 presents the coarse-grained soft
error fault injection (C-SEFI) platform which requires a gdb to manually snoop and inject the soft
errors to specific applications. Section 7.3 describes the design goal of a fine-grained soft error
fault injector (F-SEFI) and the competences of F-SEFI. Fault definitions and models supported
in F-SEFI are presented in Section 7.3.2. Section 7.3.3 depicts fault injection mechanism and
78
FIGURE 7.1. Overview of C-SEFI
the implementation of components of F-SEFI. Cases studies on three widely used benchmarks
are demonstrated in Section 7.3.4. Discussion and conclusion are presented in Section 7.4 and
Section 7.5.
7.2. A Coarse-Grained Soft Error Fault Injection (C-SEFI) Mechanism
C-SEFI’s logic soft error injection operational flow is roughly depicted in Figure 7.1. First,
the guest environment is booted and the application to inject faults into is started. Next, I probe
the guest operating system for information related to the code region of the target application and
notify the VM which code regions to watch. Then the application is released, allowing it to run.
The VM observes the instructions occurring on the machine and augments ones of interest. A more
detailed explaination of these techniques follows.
7.2.1. C-SEFI Startup
Initial startup of SEFI begins by simply booting a debug enabled Linux kernel within a
standard QEMU virtual machine. QEMU allows us to start a gdbserver within the QEMU monitor
such that I can attach to the running Linux kernel with an external gdb instance. This allows us
to set breakpoints and extract kernel data structures from outside the guest operating system as
well as from outside QEMU itself. This is a fairly standard technique used by many Linux kernel
developers. Figure 7.2 depicts the startup phase.
7.2.2. C-SEFI Probe
Once the guest Linux operating system is fully booted and sitting idle I use the attached
external gdb to set a breakpoint at the end of the sys exec call tree but before an application is
sent to a cpu to be executed. I are currently focused on only ELF binaries and have therefore
set our breakpoint at the end of the load elf binary routine. This is trivial to generalize to other
79
FIGURE 7.2. SEFI’s startup phase
FIGURE 7.3. C-SEFI’s probe phase
binary formats in future work. With the breakpoint is set I are free to issue a continue via gdb to
allow the Linux kernel to operate. The application of interest can now be started and will almost
immediately hit our set breakpoint and bring the kernel back to a stopped state. By this point in the
exec procedure the kernel has already loaded an application’s text section into physical memory in
a memory region denoted by the start code and end code elements of the task’s mm struct memory
structure. I can now extract the location in memory assigned to our application by the kernel by
walking the task list in the kernel. Starting with the symbol init task, I can find the application of
interest either by comparing a binary name to the task struct’s comm field or by searching for a
known pid which is also contained in the task struct. The physical addresses within the VM of the
application’s text region can now be fed into our fault injection code in the modified QEMU virtual
machine. Currently this is done by hand but I have plans to automate this discovery and transfer
using scripts and hypervisor calls.
Figure 7.3 depicts the probe phase of C-SEFI.
80
7.2.3. C-SEFI Fault Injection
Once QEMU has the code segment range of the target application, the application is re-
sumed. Next, when any opcode is called in the guest hardware that I are interested in injecting
faults into, QEMU checks the current instruction pointer register (EIP). If that instruction pointer
address is within the range of the target application (obtained during the probe phase), QEMU now
is aware that the application I are targeting is running this particular instruction. At this point I
are able to inject any number of faults and have confidence that I are affecting only the desired
application.
This approach is novel for several reasons. Causing opcodes in an emulated machine hard-
ware to produce wrong results is not particularly novel or complex. What is complex is doing it
only in applications of interest and not every time that instruction is called. For instance, causing
every add operation to be faulty on the machine would be neither interesting nor allow the kernel to
boot. Our technique of pinpointing which instructions are being executed by an application affords
us this capability.
FIGURE 7.4. C-SEFI’s fault injection phase
Figure 7.4 depicts this fault injection phase of the C-SEFI logic plug-in. In the first step
of this phase, QEMU brings in the code segment range obtained in the previous, probing, phase.
This range is passed into QEMU by a new hypervisor call that I added to QEMU. Next, the gdb
breakpoint is removed. The application is then resumed and continues operation as normal. Once
the application makes calls to opcodes that I are monitoring, the fault injection code inside of
QEMU can determine if, and how, to insert a simulated soft error in that opcode. Finally, the
81
application continues to run in this state and I observe and analyze how the injected fault is handled
in the application.
The opcode fault injection code has several capabilities. Firstly, it can simply flip a bit in
the inputs or outputs of the operation. Flipping a bit in the input simulates a soft error in the input
registers used for this operation. Secondly, it can flip a bit in the output of the operation. This
simulates either a soft error in the actual operation of the logic unit (such as a faulty multiplier)
or soft error in the register after the data value is stored. Currently the bit flipping is random
but can be seeded to produce errors in a specified bit-range. Thirdly, opcode fault injection can
perform complicated changes to the output of operations by flipping multiple bits in a pattern
consistent with an error in part but not all of an opcodes physical circuitry. For example, consider
the difference in the output of adding two floating point numbers of differing exponent if the a
transient error occurs for one of the numbers while setting up the significant digits so that they can
be added. By carefully considering the elements of such an operation I can alter the output of such
an operation to reflect all the different possible incorrect outputs that might occur.
The fault injector also has the ability to let some calls to the opcode go unmodified. It is
possible to cause the faults to occur after a certain number of calls or with some probability. In this
way the fault can occur every time which closely emulates permanently damaged hardware or can
be used to emulate transient soft errors by causing a single call to be faulty.
Most importantly, whenever I cause a fault to occur I know precisely what the instruction
pointer was at that time. Using this information I should be able to reference back to the original
source code. One of the obvious complications of this is that there does not exist a readily available
one-to-one mapping between high level language source code and the machine code generated by
the compiler and assembler. However, if the target application is compiled with debug symbols, I
can recognize at the very least what function the application was in when I injected the fault. This,
coupled with careful code organization, should make it more feasible to make this mapping.
7.2.4. Performance Evaluation of C-SEFI
To demonstrate C-SEFI’s capability to inject errors in specific instructions I provide two
simple experiments. For each experiment I modified the translation instructions inside of Qemu for
82
each instruction of interest. Once the instruction was called, the modified Qemu would check the
current instruction pointer (EIP) to see if the address was within the range of the target application.
If so, then a fault could be injected. I performed two experiments in this way, injecting faults into
the floating point multiply and floating point add operations.
For this experiment I instrumented the floating point multiply operation, fmul, in Qemu.
I created a toy application which iteratively performs Equation 28 40 times. The variable, y, is
initialized to 1.0.
(28) y = y ∗ 0.9
Then, at iteration 10 I injected a single fault into the multiplication operation by flipping a random
bit in the output. Figure 7.5 plots the results of this experiment. The large, solid line, represents
the output as it is without any faults. The other five lines represent separate executions of the
application with different random faults injected. Each fault introduces a numerical error in the
results which continues through the lifetime of the program.
I focus on two areas of interest from the plot in Figure 7.6 and 7.6 . In Figure 7.6 the plot
is zoomed in to focus on the point where the five faults are injected so as to make it easier to see.
Figure 7.7 is focused on the final results of the application. In this figure it becomes clear that each
fault caused an error to manifest in the application through to the final results.
7.3. A Fine-Grained Soft Error Fault Injection (F-SEFI) Framework
In the previous section I discussed the Coarse-grained SEFI. I validate the idea of designing
a soft error fault injector with minimal modifications to the environments and source codes. But
C-SEFI is impractical because it requires the user to pause and extract the application knowledge
in order to inject faults to specific application. Moreover, C-SEFI can only implement a course-
grained injection granularity. It is difficult to inject the faults to specific sub-routines, which limits
its capability to profile the vulnerability of application against soft errors. Based on the study of
C-SEFI, I propose a fine-grained soft error fault injection platform that not only inherits the key
features of C-SEFI, but also combined all of the features I desired in a tool meant to study the
behavior of application in the presences. The key features of F-SEFI are summarized as follows:
83
FIGURE 7.5. The multiplication experiment uses the floating point multiply in-
struction where a variable initially is set to 1.0 and is repeatedly multiplied by 0.9.
For five different experiments a random bit was flipped in the output of the multiply
at iteration 10, simulating a soft error in the logic unit or output register
.
FIGURE 7.6. Experiments with the focus on the injection point. it can be seen that
each of the five separately injected faults all cause the value of y to change - once
radically, the other times slightly.
84
FIGURE 7.7. Experiments with the focus on the effects on the final solution. it can
be seen that the final output of the algorithm differs due to these injected faults.
7.3.1. F-SEFI Design Objectives
Non-intrusion: in designing F-SEFI I were keenly focused on providing fault injection
with as little impact on the operating environment as possible. Our approach is non-intrusive in
that it requires no modifications to the application source code, compilers, third party software and
operating systems. It does not require custom hardware and runs entirely in user space so can be run
on production supercomputers alongside scientific applications. These constraints are pragmatic at
a production DOE facility and also exclude any possibility of side effects due to intrusive changes.
Additionally, our approach allows other applications to run alongside the application under fault
injection. In particular, this facilitates studies in resilient libraries and helper applications.
Infrastructure Independency: F-SEFI is designed as a module of the QEMU hypervisor
and, therefore, benefits from virtualization. Since the hypervisor supports a wide range of plat-
forms so does our fault injection capability. This enables us to explore hardware that I might not
physically have as well as explore new hardware approaches. For instance, I can implement triple-
modular redundancy (TMR) in certain instructions and generate errors probabilistically to evaluate
classes of applications that might be resilient on such hardware. In addition, since all guest OSs
are isolated, multiple target guest OSs from different architectures can work at the same time with-
out any interference. Faults can then be contained and I can run multiple applications in different
guest OSs and inject faults into them concurrently. Similarly, since FSEFI can target a specific
85
application, I can inject into multiple applications running within the same guest OS. This can help
reduce the effects of the virtualization overhead by studying multiple applications (or input sets)
concurrently.
Application Knowledge: F-SEFI performs binary injection dynamically without augmen-
tation to source code. Moreover, it adapts to the dynamicity of data objects, covering all static and
dynamic data. This is especially useful for applications that operate on random data or whose fault
characteristics vary when given different input datasets. F-SEFI does not require the memory ac-
cess information of the data objects at runtime. All the injections target the instructions, covering
the opcodes, addresses, and data in registers copied from memory.
Tunable Injection Granularity: F-SEFI supports a tunable injection granularity, allowing it
to inject faults semantically. Faults can target the entire application or focus in on specific func-
tions. Furthermore, the faults can be configured to infect specific operands and specific bit ranges.
Particularly with function-level injection, F-SEFI can provide a gprof-like [6] vulnerability profile
which is useful to programmers analyzing coverage vulnerability. While fine-grained tunability op-
erates on the symbol table extracted from an unstripped binary, F-SEFI can still do fault injections
into random locations in the application if the symbol table is not available.
Injection Efficiency: F-SEFI can be configured to inject faults only in specific micro-
operations and get out of the way for others. As such, it can be configured to only cause SDCs by
flipping bits in mathematical operations. Or, it can be used to explore control corruptions (such as
looping and jumps) or crashes (accessing protected memory, etc). This generality allows a user of
F-SEFI to focus their attention on studying the effects of specific SDC scenarios.
7.3.2. F-SEFI Fault Model
In this work I consider soft errors that occur in the function units (e.g., ALU and FPU)
of the processor. In order to produce SDCs, I corrupt the micro-operations executed in the ALU
(e.g., XOR) and FPU (e.g., FADD and FMUL) unit(s) by tainting values in the registers during
instruction execution. Fault characteristics can be configured in several ways to comprehensively
study how an application responds.
Faulty Instruction(s): soft errors can be injected into any machine instruction. In this
86
TABLE 7.1. Fault types for injection
Fault Type Description
FADD Bit-flip in floating point addition micro-operation.
FMUL Bit-flip in floating point multiplication micro-operation.
XOR Bit-flip in xor micro-operation.
paper I study corrupted FADD, FMUL, and XOR instructions. Since QEMU does guest-to-host
instruction translation, I merely modify this process to perform the type of corruption I want to
study.
Random and Targeted: F-SEFI offers both random (for coarse-grained) and targeted (for
fine-grained) fault injections. Initial development of the tool demonstrated the coarse-grained in-
jection by randomly choosing instructions to corrupt in an application[4]. This technique provides
limited resilience evaluation at the application level. F-SEFI now also has the ability to do targeted
fault injection into specific instructions and functions of an application. This allows a finer-grained
study of the vulnerabilities of an application.
Single and Multiple-Bit Corruption: any number of bits can be corrupted in an instruction
using F-SEFI. This allows for the study of how applications would behave without different forms
of error protection as well as faults that cause silent data corruption.
Deterministic and Probabilistic: while injecting faults to instructions, F-SEFI can deter-
ministically flip any bit of the input or output register(s). It can also be configured to apply a
probability function to determine which bits are more vulnerable than others. For example, one
can target the exponent, mantissa, or sign bit(s) of a floating point value.
7.3.3. F-SEFI Fault Injection Mechanisms
F-SEFI leverages extensive open source work on the QEMU processor emulator and virtu-
alizer by interfacing with the hypervisor as a plug-in module. After the QEMU hypervisor starts
a virtual machine image, the F-SEFI broker is loaded dynamically. As instructions are issued by
applications running within a guest OS, F-SEFI intercepts these and potentially corrupts them be-
87
Overview of SEFIOverview of SEFI
Guest OS……Guest OSGuest OS
VM Hypervisor (QEMU) F-SEFI BrokerLog
Host Kernel
Hardware
FIGURE 7.8. The overall system infrastructure of F-SEFI
Overview of SEFIOverview of SEFI
Collect the information of target
instructions
ProfilerConfigure Probe and
Injector with multiple features
ConfiguratorSnoop the EIP before
the execution of each guest code
block
ProbeUse a Bit-flipper to
contaminate the target
application/function
Injector
Log all the injection events from SEFI
Tracker
pp /
FIGURE 7.9. The components of F-SEFI Broker
fore sending them on to the host kernel. This interaction is depicted in Figure 7.8. F-SEFI runs
entirely in user space and can be run as a ”black box” on the command-line. This launches the tool,
performs the fault injections, tracks how the application responds, and logs all the results back in
the host file system. This is particularly useful for batch mode analysis to do campaign studies of
the vulnerability of applications. F-SEFI consists of five major components: profiler, configurator,
probe, injector, and tracker, which are shown in Figure 7.9. These are explained in more detail in
the next few sections.
Profiler: as with most dynamic fault injectors, F-SEFI profiles the application to gather in-
formation about it before injecting faults. As described in Section 7.3.2, F-SEFI can target specific
instructions for corruption. It is in this profiling stage that F-SEFI gathers information about how
many occurrences of each instruction there are as well as their relative location within the binary. It
is also in this profiling stage that the function symbol table (FST) is extracted from the unstripped
binary. This allows F-SEFI to understand where the applications functions start/end. Then the
88
Num: Value Size Type Bind vis Ndx Name65: 08048130 136 FUNC GLOBAL DEFAULT 13 find_nearest_point86: 080489a0 143 FUNC GLOBAL DEFAULT 13 clusters101: 080491f0 661 FUNC GLOBAL DEFAULT 13 kmeans clustering101: 080491f0 661 FUNC GLOBAL DEFAULT 13 kmeans_clustering105 08048a70 1713 FUNC GLOBAL DEFAULT 13 main
FIGURE 7.10. A subset of the function symbol table (FST) for the K-means clus-
tering algorithm studied in section 7.3.4. This is extracted during the profiling stage
and used to trace where the application is at runtime for targeted fault injections.
Execution Instruction Pointer (EIP) is observed through QEMU to trace where the application is at
runtime. Figure 7.10 shows the relevant information for a sample symbol table used in a later case
study.
Configurator: the configuration contains all the specifics related to the faults that will be
injected. This includes what application is to be studied, functions to target, the injection granular-
ity, and the instructions to alter. Additionally, probabilities of alteration can be assigned to specific
bit regions where injections are desired.
As an example, in [101], the authors present an application that is highly resilient except to
data corruption in high order bits. This configuration stage makes it possible to target injections,
for instance, at only the most significant bits in a 64-bit register from the 52nd bit to the 63rd
bit. This is precisely the kind of study that is enabled by F-SEFI. Another example use would
be choosing a probability of corruption related to the neutron flux where the application will run
(sea level, high altitude terrestrial, aerial, satellite, etc.). The configurator allows a great deal of
flexibility in the way instructions can be targeted. For example, one can skip over N instances of a
target instruction and only then begin injecting faults.
Probe: once profiled and configured, the application under analysis is run within the guest
OS. The F-SEFI probe component then dynamically observes the guest instruction stream before
it is sent to the host hardware for execution. This instruction stream is snooped at a block-level
where blocks of instructions in QEMU are organized to reduce overhead. The probe monitors
the Execution Instruction Pointer (EIP) and if it enters the memory region belonging to the target
application the probe then switches to instruction-level monitoring. At this more fine-grained level
89
the probe begins checking micro-operations of each instruction that passes to the host. If the
underlying instruction satisfies the conditions defined in the configuration phase, then the probe
phase activates the injector. The algorithm for the probing process is shown as follows.
ALGORITHM 6. Probing Algorithm
PROBE() {
1: Load Probe configuration
2: Load Injector configuration
3: FOR each TB intercepted by FSEFI DO
4: Extract PS name of current TB
5: IF PS name == target application THEN
6: Start FSEFI Tracker
7: Extract critical memory region
8: FOR each instruction to execute DO
9: IF instruction address resides within cirtical memory region THEN
10: Start Injector
11: END IF
12: END FOR
13: END IF
14: END FOR
15: }
Injector: QEMU has translation functions for how to translate each instruction on a guest
architecture into a instruction (or series of instructions) on a host architecture. The injector phase
of F-SEFI substitutes the original helper function with a modified one. The new corrupted ver-
sion is controlled by the configuration to conditionally flip bits in the registers used during the
calculation. This translation is entirely transparent to the QEMU hypervisor and allows F-SEFI to
closely emulate faulty hardware without the associated overheads and limitations of hardware fault
injection.
Tracker: F-SEFI maintains very detailed logs of what happens during the monitoring of an
90
TABLE 7.2. Benchmarks and target functions for fine-grained fault injection
Benchmarks Target Functions for Injection
FFT [95]: Fast Fourier Transform using Radix-2, 3, 4, 5 and 8 FFT routineFFT4b : Redix 4 routine
FFT8b : Redix 8 routine
BMM [78]: Bit Matrix Multiply Algorithm from CPU Suite Benchmarkmaketable : construct the lookup tables
bmm update : apply 64bits matrix multiply based on lookup tables
Kmeans [16] : Kmeans Clustering Algorithm from Rodinia Suite Benchmarkkmeans clustering : update the cluster center coordinates
find nearest point : update membership
application as well as carefully tracking fault injections. When it decides to inject a fault it reports
information about what instruction was being executed and the state of the registers before and
after injection. In this way it is possible to analyze post-mortem the way in which the application
behaved when faults occurred.
7.3.4. Case Studies
In this section I demonstrate F-SEFI injecting faults into three benchmark applications: fast
fourier transform (FFT), bit matrix multiplication, and K-means clustering. These experiments
were conducted using the QEMU virtual machine hypervisor. The guest kernel used was Linux
version 2.6.0 running on Ubuntu 9.04. The host specifications are unimportant as all that is required
is that QEMU can run on it in user space.
Table 7.2 gives specifics about the benchmarks I studied including the functions targeted for
fault injection. Each benchmark was profiled to determine the number of floating point addition
(FADD), floating point multiplication (FMUL), and exclusive-or (XOR) operations. These results
are shown in Figure 7.11 and are the basis for the instructions that are targeted in the following
experiments.
1-D Fast Fourier Transform (FFT). After profiling the benchmark I chose to target the
fft4b function. This function comprises a large percentage of the FADD and FMUL instructions.
F-SEFI was configured to inject one and two single-bit errors into the benchmark into randomly
selected FADD and FMUL instructions in the fft4b routine. The injection procedure is shown
in Figure 7.12. Selected results from these four experiments are shown in Figure 7.13 and are