Autonomic Failure Identification and Diagnosis for Building …/67531/metadc499993/... · Guan, Qiang. Autonomic Failure Identification and Diagnosis for Building Dependable Cloud

APPROVED: Song Fu, Major Professor Yan Huang, Committee Member Krishna Kavi, Committee Member Xiaohui Yuan, Committee Member Barrett Bryant, Chair of the Department of

Computer Science and Engineering Coastas Tsatsoulis, Dean of the College of

Engineering Mark Wardell, Dean of the Toulouse Graduate

School

AUTONOMIC FAILURE IDENTIFICATION AND DIAGNOSIS FOR BUILDING

DEPENDABLE CLOUD COMPUTING SYSTEMS

Qiang Guan

Doctor Prepared for the Degree of

DOCTOR OF PHILOSOPHY

UNIVERSITY OF NORTH TEXAS

May 2014

Guan, Qiang. Autonomic Failure Identification and Diagnosis for Building Dependable

Cloud Computing Systems. Doctor of Philosophy (Computer Science), May 2014, 121 pp., 9

tables, 53 figures, bibliography, 112 titles.

The increasingly popular cloud-computing paradigm provides on-demand access to

computing and storage with the appearance of unlimited resources. Users are given access to a

variety of data and software utilities to manage their work. Users rent virtual resources and pay

for only what they use. In spite of the many benefits that cloud computing promises, the lack of

dependability in shared virtualized infrastructures is a major obstacle for its wider adoption,

especially for mission-critical applications.

Virtualization and multi-tenancy increase system complexity and dynamicity. They

introduce new sources of failure degrading the dependability of cloud computing systems. To

assure cloud dependability, in my dissertation research, I develop autonomic failure

identification and diagnosis techniques that are crucial for understanding emergent, cloud-wide

phenomena and self-managing resource burdens for cloud availability and productivity

enhancement. We study the runtime cloud performance data collected from a cloud test-bed and

by using traces from production cloud systems. We define cloud signatures including those

metrics that are most relevant to failure instances.

We exploit profiled cloud performance data in both time and frequency domain to

identify anomalous cloud behaviors and leverage cloud metric subspace analysis to automate the

diagnosis of observed failures. We implement a prototype of the anomaly identification system

and conduct the experiments in an on-campus cloud computing test-bed and by using the Google

datacenter traces. Our experimental results show that our proposed anomaly detection

mechanism can achieve 93% detection sensitivity while keeping the false positive rate as low as

6.1% and outperform other tested anomaly detection schemes. In addition, the anomaly detector

adapts itself by recursively learning from these newly verified detection results to refine future

detection.

ii

Copyright 2014

by

Qiang Guan

ACKNOWLEDGMENTS

This dissertation would be impossible without the continuous support and supervision of

many people. I would like to thank them at this opportunity. I first would like to appreciate my

advisor, Dr. Song Fu, for his guidance, support and supervision for the past four years. I am so

proud that I will be his first Ph.D. graduate. That is my great honor. I also want to thank Dr. Yan

Huang, Dr. Krishna Kavi and Dr. Xiaohui Yuan for their comments and suggestions on this work.

I would like to thank Dr. Nathan Debardeleben, Dr. Mike Lang and Mr Sean Blanchard from

Ultrascale System Research Center, New Mexico Consortium, Los Alamos National Laboratory

for their mentoring and advising. I would also appreciate the Chairman, Dr. Barrett Bryant, grad-

uate advisor, Dr. Bill Buckles, Dr. Armin R. Mikler for the guidance and generous help for my

academic career. I am thankful to my friends, Dongyu Ang, K.J. Buckles, Guangchun Cheng, Chi-

Chen Qiu, Song Huang, Tommy Janjusic, Zhi Liu, Husanbir Pannu, Devender Singh, Yanan Tao,

Dr. Shijun Tang, Yiwen Wan, Ziming Zhang, Chengyang Zhang, Shunli Zhao, all the team-mates

of Highland Guerilla and friends in Highland Baptist Church for their friendship and support.

I would like to thank my parents for their support during the whole journey. I want to give

special thanks to my wife, Dr. Xiaoyi Fang for love, patience, understanding and support for these

days and nights.

iii

TABLE OF CONTENTS

Page

ACKNOWLEDGMENTS iii

LIST OF TABLES viii

LIST OF FIGURES ix

CHAPTER 1 INTRODUCTION AND MOTIVATION 1

1.1. Introduction 1

1.2. Terms and Definitions 2

1.3. Motivation and Research Tasks 2

1.3.1. Charactering System Dependability in Cloud Computing Infrastructures 2

1.3.2. Metric Dimensionality Reduction for Cloud Anomaly Identification 3

1.3.3. Soft Errors (SE) and Silent Data Corruption (SEC) 5

1.4. Contributions 6

1.4.1. Cloud Dependability Characterization and Analysis 6

1.4.2. Metric Selection and Extraction for Charactering Cloud Health 8

1.4.3. Exploring Time and Frequency Domains of Cloud Performance Data

for Accurate Anomaly Detection 9

1.4.4. Most Relevant Principal Components based Anomaly Identification

and Diagnosis 9

1.4.5. SEFI : A Soft Error Fault Injection Tool for Profiling the Application

Vulnerability 10

1.5. Dissertation Organization 11

CHAPTER 2 BACKGROUND AND RELATED WORK 13

2.1. Metrics Selection and Extraction 13

iv

2.2. Anomaly Detection and Failure Management 14

2.3. State of the Art of Fault Injection 16

2.3.1. Dynamic Binary Instrumentation-based Fault Injection 16

2.3.2. Virtualization-based Fault Injection 17

CHAPTER 3 A CLOUD DEPENDABILITY ANALYSIS FRAMEWORK FOR

CHARACTERISING THE SYSTEM DEPENDABILITY IN CLOUD

COMPUTING INFRASTRUCTURES 19

3.1. Introduction 19

3.2. Overview of the Cloud Dependability Analysis Framework 20

3.3. Cloud Dependability Analysis Methodologies 21

3.4. Cloud Computing Testbed and Performance Profiling 23

3.5. Impact of Virtualization on Cloud Dependability 25

3.5.1. Analysis of CPU-Related Failures 26

3.5.2. Analysis of Memory-Related Failures 27

3.5.3. Analysis of Disk-Related Failures 28

3.5.4. Analysis of Network-Related Failures 30

3.5.5. Analysis of All Types of Failures 30

3.6. Summary 32

CHAPTER 4 A METRIC SELECTION AND EXTRACTION FRAMEWORK FOR

DESCRIBING CLOUD PERFORMANCE ANOMALIES 33


4.2. Cloud Metric Space Reduction Algorithms 34

4.2.1. Metric Selection 34

4.2.2. Metric Space Combination and Separation 36

4.3. Performance Evaluation 39

4.3.1. Experimental Results of Metric Selection and Extraction 40

4.4. Summary 42

v

CHAPTER 5 EFFICIENT AND ACCURATE CLOUD ANOMALY DETECTION 45


5.2. Cloud Anomaly Detection Mechanisms 45

5.2.1. Wavelet-Based Multi-Scale Anomaly Detection Mechanism 46

5.2.2. Sliding-Window Cloud Anomaly Detection 48

5.2.3. Mother Wavelet Selection and Adaptation 49

5.3. Performance Evaluation 51

5.3.1. Cloud Testbed and Performance Metrics 51

5.3.2. Mother Wavelets 51

5.3.3. Performance of Anomaly Identification 53

5.4. Summary 56

CHAPTER 6 EXPLORING METRIC SUBSPACE ANALYSIS FOR ANOMALY

IDENTIFICATION AND DIAGNOSIS 58


6.2. A Motivating Example 59

6.3. MRPC-Based Adaptive Cloud Anomaly Identification 61

6.3.1. Dynamic Normalization 62

6.3.2. MRPC Selection 63

6.3.3. Adaptive Cloud Anomaly Identification 65

6.4. Analysis of Cloud Anomalies 66

6.4.1. Anomaly Detection and Diagnosis Results 66

6.4.2. MRPCs and Diagnosis of Memory Related Failures 66

6.4.3. MRPCs and Diagnosis of Disk Related Failures 68

6.4.4. MRPCs and Diagnosis of CPU Related Failures 70

6.4.5. MRPCs and Diagnosis of Network Related Failures 72

6.4.6. The Accuracy of Anomaly Identification 72

6.4.7. Experimental Results using Google Datacenter Traces 73

6.5. Summary 74

vi

CHAPTER 7 F-SEFI: A FINE-GRAINED SOFT ERROR FAULT INJECTION

FRAMEWORK 78


7.2. A Coarse-Grained Soft Error Fault Injection (C-SEFI) Mechanism 79

7.2.1. C-SEFI Startup 79

7.2.2. C-SEFI Probe 79

7.2.3. C-SEFI Fault Injection 81

7.2.4. Performance Evaluation of C-SEFI 82

7.3. A Fine-Grained Soft Error Fault Injection (F-SEFI) Framework 83

7.3.1. F-SEFI Design Objectives 85

7.3.2. F-SEFI Fault Model 86

7.3.3. F-SEFI Fault Injection Mechanisms 87

7.3.4. Case Studies 91

7.4. Discussions 100

7.5. Summary 102

CHAPTER 8 CONCLUSION AND FUTURE WORK 104

8.1. Conclusion 104

8.1.1. Characterizing Cloud Dependability 104

8.1.2. Detecting and Diagnozing Cloud Anomalies 104

8.1.3. Soft Error Fault Injection 105

8.1.4. List of Publications in My PhD Study 105

8.2. Future Work 107

8.2.1. Self-Adaptive Failure-Aware Resource Management in the Cloud 108

8.2.2. Tolerating Silent Data Corruptions in Large Scale Computing Systems 108

BIBLIOGRAPHY 110

vii

LIST OF TABLES

Page

Table 2.1. Existing fault injection technologies. 17

Table 3.1. Description of the injected faults. 23

Table 3.2. The metrics that are highly correlated with failure occurrences in the cloud

testbed using four-level failure-metric DAGs. 31

Table 4.1. Normalized mutual information values for 12 metrics of CPU and memory

related statistics. 40

Table 6.1. MRPCs are ranked by correlation with faults.(For each major type, 25 faults

are injected into testbed) 68

Table 6.2. Performance metrics in the Google datacener traces 74

Table 7.1. Fault types for injection 87

Table 7.2. Benchmarks and target functions for fine-grained fault injection 91

Table 7.3. K-Means clustering centroids with and without fault injection showing the

impact of corrupted data in the centroid calculations and clustering calculations

for individual particles. 98

viii

LIST OF FIGURES

Page

Figure 1.1. A dependable cloud computing infrastructure. 7

Figure 3.1. Architecture of the cloud dependability analysis (CDA) framework. 20

Figure 3.2. A sampling of cloud performance metrics that are often correlated with failure

occurrences in our experiments. In total, 518 performance metrics are profiled

with 182 metrics for the hypervisor, 182 metrics for virtual machines, and 154

metrics for hardware performance counters (four cores on most of the cloud

servers). 24

Figure 3.3. Failure-metric DAG for CPU-related failures in the cloud testbed. 26

Figure 3.4. Failure-metric DAG for CPU-related failures in the non-virtualized system. 26

Figure 3.5. Failure-metric DAG for memory-related failures in the cloud testbed. 27

Figure 3.6. Failure-metric DAG for memory-related failures in the non-virtualized system. 27

Figure 3.7. Failure-Metric DAG for disk-related failures in the cloud testbed. 28

Figure 3.8. Failure-metric DAG for disk-related failures in the non-virtualized system. 28

Figure 3.9. Failure-metric DAG for network-related failures in the cloud testbed. 29

Figure 3.10. Failure-metric DAG for network-related failures in the non-virtualized system. 29

Figure 3.11. Failure-metric DAG for all types of failures in the cloud testbed. 31

Figure 3.12. Failure-metric DAG for all types of failures in the non-virtualized system. 32

Figure 4.1. Quantified redundancy and relevance among metrics based on their mutual

information values. 41

Figure 4.2. Results from metric extraction (Algorithm 2) and metric selection

(Algorithm 1). 41

Figure 4.3. Results from metric extraction (Algorithm 2) only. 41

Figure 4.4. Distribution of normal (blue marker) and abnormal (red marker) cloud

system states represented by the metrics that are selected and extracted by

ix

algorithm 1, 2 and 3. 43

Figure 5.1. Three-level details “Di” and approximations “Ai” of performance metric

%memused profiled on our cloud testbed. Performance metric %memused

is divided into high frequency components (details) and low frequency

components (approximations). The approximation is further decomposed into

new details and approximations at each level. 47

Figure 5.2. A sliding detection window (NsWin = 80 measurements of a cloud

performance metric) for a mother wavelet with Nmother = 16 measurements

and the scale coefficient s = 5. A failure is illustrated with the spike. 49

Figure 5.3. An example of Haar mother wavelet. 50

Figure 5.4. Mother wavelet derived by employing a measurement window of different

sizes. As the window size increases, the peak at the tail is sharpened while other

peaks are smoothed. From the perspective of frequency, more failure-related

signals in both the low-frequency and higher-frequency bands are included for

large measurement windows. 52

Figure 5.5. The numbers of truly identified anomalies vs. the numbers of validated false

positives with mother wavelets of different sizes. Small windows result in low

detection accuracy, while big windows brings in more false positives 54

Figure 5.6. Wavelets coefficients for mother wavelets with different Nmother (1 ≤ s ≤

16, 0 ≤ τ ≤ 200). A memory-related fault is injected at the 100th time

point. The states of system are learned from the wavelet coefficients based on

the anomaly mother wavelet with different scale. A smaller mother wavelet

(i.e., 8 measurements or 12 measurements) brings more noise to the wavelet

coefficients, while a bigger mother wavelet (i.e., 24 measurements) requires a

larger scale to achieve a high detection accuracy. 55

Figure 5.7. Performance comparison of our wavelet-based cloud anomaly detection

mechanism with other four detection algorithms. Our approach achieves the

best TPR with the least FPR. It can identify anomalies more accurately than

x

other methods. 56

Figure 6.1. Examples of memory related faults injected to a cloud testbed. The memory

utilization and CPU utilization time serials are plotted. 59

Figure 6.2. Distribution of data variance retained by the first 50 principal components. 60

Figure 6.3. Time series of principal components and their correlation with the memory

related faults. 61

Figure 6.4. MRPCs of memory-related failures. (a) plots the time series of 3rd principal

component. (b) shows the performance metric avgrq-sz displays the highest

contribution to the MRPC. 67

Figure 6.5. MRPCs of disk-related failures.(a) plots the time series of the 5th principal

component. (b) shows the performance metric rd-sec/s dev-253 displays the

highest contribution to the MRPC 69

Figure 6.6. MRPCs of CPU-related failures.(a) plots the time series of the 35th principal

component. (b) shows the performance metric ldavg displays the highest

contribution to the MRPC 70

Figure 6.7. MRPCs of network-related failures.(a) plots the time series of the 29th

principal component. (b) shows the performance metric %user 4 displays

highest contribution to the MRPC 71

Figure 6.8. Correlation between the principal components and different types of failures 76

Figure 6.9. Correlation between principal components and failure events using the Google

datacenter trace. 77

Figure 6.10. Performance of the proposed MRPC-based anomaly detector compared with

four other detection algorithms on the Google datacenter trace. 77

Figure 7.1. Overview of C-SEFI 79

Figure 7.2. SEFI’s startup phase 80

Figure 7.3. C-SEFI’s probe phase 80

Figure 7.4. C-SEFI’s fault injection phase 81

Figure 7.5. The multiplication experiment uses the floating point multiply instruction

xi

where a variable initially is set to 1.0 and is repeatedly multiplied by 0.9. For

five different experiments a random bit was flipped in the output of the multiply

at iteration 10, simulating a soft error in the logic unit or output register 84

Figure 7.6. Experiments with the focus on the injection point. it can be seen that each

of the five separately injected faults all cause the value of y to change - once

radically, the other times slightly. 84

Figure 7.7. Experiments with the focus on the effects on the final solution. it can be seen

that the final output of the algorithm differs due to these injected faults. 85

Figure 7.8. The overall system infrastructure of F-SEFI 88

Figure 7.9. The components of F-SEFI Broker 88

Figure 7.10. A subset of the function symbol table (FST) for the K-means clustering

algorithm studied in section 7.3.4. This is extracted during the profiling

stage and used to trace where the application is at runtime for targeted fault

injections. 89

Figure 7.11. Instruction profiles for the benchmarks studied. Each benchmark is reported

as a whole application (coarse-grained) and one or two functions that were

targeted (fine-grained). While both FFT and K-Means have a large number of

FADD and FMUL instructions, the BMM benchmark is almost entirely XOR. 92

Figure 7.12. 1-D FFT algorithm with soft errors injected by F-SEFI 93

Figure 7.13. Comparative outputs with four different types of fault injections into the

extended split radix 1-D FFT algorithm. The output is represented in

magnitude and phase. The single FADD fault shown causes significant SDC in

both magnitude and phase. 94

Figure 7.14. The relative mean square root (RMS) of 1-D FFT outputs with different

problem sizes showing that for the faults I injected into FMUL instructions the

output varied only slightly. 95

Figure 7.15. 2-D FFT algorithm with soft errors injected by F-SEFI 95

Figure 7.16. 8x8 spiral images with FADD and FMUL fault injections. 96

xii

Figure 7.17. The Bit Matrix Multiply algorithm compresses the 64-bits of output to a 9-bit

signature code used to checksum the result. 97

Figure 7.18. Faults injected into two different functions of the K-Means Cluster algorithm

cause different effects. Clusters are colored by cluster number and the centroids

are marked by triangles. 99

Figure 7.19. The number of mislabeled particles in the K-Means Clustering Algorithm

under fault injection as a function of the total number of particles. An FADD

fault injected into kmeans clustering causes about 28% of the particles

to be mislabeled. 100

xiii

CHAPTER 1

INTRODUCTION AND MOTIVATION

1.1. Introduction

The increasingly popular cloud computing paradigm provides on-demand access to com-

puting and storage with the appearance of unlimited resources [6]. Users are given access to a

variety of data and software utilities to manage their work. Users rent virtual resources and pay

for only what they use. Underlying these services are data centers that provide virtual machines

(VMs) [90]. Virtual machines make it easy to host computation and applications for large numbers

of distributed users by giving each the illusion of a dedicated computer system. It is anticipated

that cloud platforms and services will increasingly play a critical role in academic, government

and industry sectors, and will have widespread societal impact.

Production cloud computing systems continue to grow in their scale and complexity. Mod-

ern cloud computing systems contain hundreds to thousands of computing and storage servers.

Such a scale, combined with ever-growing system complexity, is introducing a key challenge to

failure and resource management for dependable cloud computing [6]. Despite great efforts on

the design of ultra-reliable components [9], the increase of cloud scale and complexity has out-

paced the improvement in component reliability. On the other hand, the states of cloud systems are

changing dynamically as well due to the addition and removal of system components, changing

execution environments and workloads, frequent updates and upgrades, online repairs and more.

In such large-scale complex and dynamic systems, failures are common [104, 45]. Results from

recent studies [64] show that the reliability of existing data centers and cloud computing systems is

constrained by a system mean time between failure (MTBF) on the order of 10-100 hours. Failure

occurrence as well as its impact on cloud performance and operating costs is becoming an increas-

ingly important concern to cloud designers and operators [6, 106]. The success of cloud computing

will depend on its ability to provide dependability at scale.

In this dissertation, I aim to design, implement, and evaluate a framework that can facilitate

the development of dependable cloud computing systems. I provided the definitions that are used

1

in this dissertation. Then, I elaborated the research tasks.

1.2. Terms and Definitions

The definitions of the following terms used in this dissertation are adopted from [91, 98,

73, 88].

Fault : the cause of an error (e.g., stuck bit, alpha particle and temperature).

Error : the part of states that may lead to a failure.

Failure : a transition to incorrect or unavailable services (e.g., web service disruption).

Performance anomaly: a performance anomaly arises when the system performance be-

havior deviates from the expectation. Usually the expectation threshold is defined in the Service

Level Agreement (SLA).

Soft error: a type of transient fault occur due to random events.

Dependability : the ability to avoid service failures that are more frequent and more severe

than is acceptable.

Resilience : the collection of techniques for keeping applications running to a correct solu-

tion in a timely and efficient manner despite underlying system faults.

1.3. Motivation and Research Tasks

The occurrences of failures may cause affected tasks to fail or abort and furthermore, force

the system rolling back to the nearest checkpoint to restart the relevant jobs and tasks. Dependable

cloud system should be able to proactively tackle the cloud anomaly before crashing and trigger-

ing the checkpoint restart, since reactive approaches are consuming extra computation cycles and

power budgets.

1.3.1. Charactering System Dependability in Cloud Computing Infrastructures

Dependability assurance is crucial for building reliable cloud services. Current solutions

to enhancing cloud dependability include VM replication [22] and checkpointing [72]. Proactive

approaches, such as failure prediction [96, 32, 28, 29] and VM migration [4, 39], have also been

explored. However, a fundamental question, i.e., ”What is the uniqueness of cloud computing

systems in terms of their dependability?” or ”What impact does virtualization have on the cloud

2

dependability?”, is never answered. There also exists research on characterizing cloud hardware re-

liability [104], modeling cloud service availability [23], and injecting faults to cloud software [46].

Still, none of them evaluate the influence of virtualization on the system dependability in cloud

computing infrastructures.

As virtualization has become the de facto enabling technology for cloud computing [6],

dependability evaluation of the cloud is no longer confined to the hardware, operating system, and

application layers. A new virtualized environment, which consists of virtual machines (VMs) with

virtualized hardware and hypervisors, should be analyzed to characterize the cloud dependabil-

ity. VM-related operations, such as VM creation, cloning, migration, and accesses to physical

resources via virtualized devices, cause more points of failure. They also make failure detec-

tion/prediction and diagnosis more complex. Moreover, virtualization introduces richer perfor-

mance metrics to evaluate the cloud dependability. Traditional approaches [68, 30] that ignore

those cloud-oriented metrics may not model cloud dependability accurately or effectively.

I need to design an analytical framework to evaluate the system dependability both virtual-

ized and non-virtualized environments , in order to characterize the impact of virtualization on the

cloud dependability.

1.3.2. Metric Dimensionality Reduction for Cloud Anomaly Identification

To characterize cloud behavior, identify anomalous states, and pinpoint the causes of fail-

ures, I need the runtime performance data collected from utility clouds. However, continuous

monitoring and large system scale lead to the overwhelming volume of data collected by health

monitoring tools. The size of system logs from large-scale production systems can easily reach

hundreds and even thousands of tera-bytes [70, 87]. In addition to the data size, the large number

of metrics that are measured make the data model extremely complex. Moreover, the existence

of interacting metrics and external environmental factors introduce measurement noises in the col-

lected data. For the collected health-related data, there might be a maximum number of metrics

above which the performance of anomaly detection will degrade rather than improve. High metric

dimensionality will cause low detection accuracy and high computational complexity. Another

challenge of failure identification from measurement data originates from the dynamics of cloud

3

computing systems. It is common in those systems that user behaviors and servers loads are always

changing. The cloud hardware and software components are also frequently replaced or updated.

This requires the failure detection mechanisms distinguish the normal cloud variation and real

failures.

Anomaly detection and failure management based on analysis of system logs is an active re-

search topic. Anomaly detection techniques developed in machine learning and statistical domains

are surveyed in [42]. Structured and broad overviews of recent research on anomaly detection and

proactive failure management techniques are presented in [15, 84]. Most of the existing anomaly

detection work focuses on the detection techniques, while putting little emphasis on the metric

selection. There is a lack of systematic approaches to effectively identifying and selecting princi-

ple metrics for anomaly detection. On the other hand, metric selection is vital. Its performance

directly affects the efficiency and accuracy of anomaly detection.

The conventional methods of failure detection rely on statistical learning models to ap-

proximate the dependency of failure occurrences on various performance attributes; see [15] for a

comprehensive review and [59, 13] for examples. The underlying assumption of these methods is

that the training dataset is labeled, i.e. for each measurement used to train a failure detector, the

designer knows if it corresponds to a normal execution state or a failure. However, the labeled data

are not always available in realworld cloud computing environments, especially for newly managed

or deployed systems. Moreover, these methods do not exploit the detected failures to improve the

accuracy of future detections. In these methods, the undetected failures are also never considered

by the detectors to identify new types of failures in the future. How to accurately and adaptively

detect and forecast failure occurrences in such complex and dynamic environments without prior

failure histories is challenging.

Cloud environments need to have visibility not only into the cloud performance , but also

into the different computing resources and architectures where these applications reside. Typically

it is easy to identify the performance anomalies in a single server as opposed to study the perfor-

mance of applications that are pulling computing resources from different resources. This issue is

more complex if the system dynamicity is taken into consideration. Current utility clouds are un-

4

able to validate the performance of a heterogeneous set of application in the cloud. No technology

are provided to in place trust the end-users about the honesty of the cloud service provides. There-

fore, the current cloud vendors need to have independent performance anomaly detection tools

in place to inform the quality of the services they are providing, but also enable the root-cause

analysis of problems as they occur in the backend.

I need a metric selection mechanism for proactive online anomaly detection and root-cause

analysis. This mechanism should be efficient, accurate and adaptive to the dynamicity of system.

1.3.3. Soft Errors (SE) and Silent Data Corruption (SEC)

Exascale supercomputers are likely to encounter failures at higher rates than current high

performance computing systems. Next generation machines are expected to consist of a much

larger component count than current petascale machines. In addition to the increase in components,

it is expected that each individual component will be built on smaller feature sizes which may prove

to be more vulnerable than current parts. This vulnerability may be aggravated by near-threshold

voltage designs meant to dramatically decrease power consumption in the data center [48].

Due to high error rates it is estimated that exascale systems may waste as much as 60%

[25] of their computation cycles and power due to the overhead of reliability assurance. These

high error rates pose a serious threat to the prospect of exascale systems.

Soft errors fall into three categories [91]: detected and corrected (DCE), detected but un-

correctable (DUE) and SE (silent). Most DRAM in supercomputers is protected by Chipkill which

makes DUE events rare and SE events even more rare. SRAM in cache layers, however, is gen-

erally protected by SECDED or parity and is therefore more vulnerable to SE events. In addition,

logic circuits have varying levels of internal protection and I expect these error rates to be on the

rise as well.

Silent errors pose a serious issue when they lead to silent data corruption (SDC) in user

applications. If undetected by the application, a single SDC can corrupt data causing applications

to output incorrect results, malfunction or hang.

Unfortunately, detecting and correcting SDC events at the application layer is challenging.

It requires expert knowledge of the algorithm involved to determine where an application might

5

be most vulnerable and how it will behave if an SDC should occur. Even with such knowledge it

is difficult to test any mitigation techniques an application author might attempt since SDC events

occur rarely and in most cases randomly.

Currently, many cloud vendors provide computing services (e.g., Google Compute Engine

(GCE) and Amazon Elastic Compute Cloud (EC2)) to satisfy the requirements of scientific com-

puting in research institutes and universities. However, cloud system is composed of less reliable

hardware components comparing to the High Performance Computing (HPC) systems. Compo-

nents in the cloud are more susceptible to in-field hardware bit upsets (soft errors) due to radiation

from energetic particle strikes. Soft errors that are undetectable will further corrupt the data in

the computation or memory cells. The computation results are delivered to end-users incorrectly.

In order to guarantee the service dependability, Cloud service providers have to address the high

system soft error rate and tackle the silent data corruption due to soft errors.

In this dissertation, I need to investigate the impact of soft errors to the correctness of

computation results and inject soft errors to analyze the resilience of applications to soft errors.

1.4. Contributions

In this dissertation, I consider the influence of the cloud infrastructure on system depend-

ability and develop new techniques in a systematic way. The proposed autonomic failure identifi-

cation and diagnosis framework with mechanisms will enable cloud computing systems to contin-

uously monitor their health, accurately and efficiently detect anomalous behaviors, diagnoze the

causes and inject soft faults to evaluate and deliver dependable cloud services.

1.4.1. Cloud Dependability Characterization and Analysis

The goal is to evaluate cloud dependability with the virtualized environments and compare

it with traditional, non-virtualized systems. To the best of our knowledge, this is the first work to

analyze the impact of virtualization on the cloud dependability. I propose a cloud dependability

analysis (CDA) framework with mechanisms to characterize failure behavior in cloud computing

infrastructures. I design failure-metric DAGs to model and quantify the correlation of various

performance metrics with failure events in virtualized and non-virtualized systems. I study multiple

6

VMM VMM

Cloud Resource

Manager

Cloud

Coordinator

Application execution

requests

RDVM (A) RDVM (A’)

Cloud Application A

Cloud Server 1 Cloud Server 2 Cloud Server N

Performance

data, Detected

failures

Workload

distribution

Cloud Anomaly

Detector

Performance

data

Observed

failures

Anomaly

identification

Guest

OS

Task

(A)

Guest

OS

Guest

OS

Task

(A')Daemon

Dom0 DomU DomU

Guest OS

Task (A')Daemon

DomU

Virtual

Machine

Cloud Application A’

Guest

OS

Daemon

Dom0

RDVM RDVM

VMM

Guest

OS

Task

(A)

Guest

OS

Guest

OS

Task

(A')Daemon

Dom0 DomU DomU

RDVM

Resource

allocation

FIGURE 1.1. A dependable cloud computing infrastructure.

types of failures, including CPU-, memory-, disk-, and network-related failures. By comparing the

generated DAGs in the two environments, I gain insights into the effects of virtualization on the

cloud dependability.

To build dependable cloud computing systems, I propose a reconfigurable distributed vir-

tual machine (RDVM) infrastructure, which leverages the virtualization technologies to facilitate

failure-aware cloud resource management. A RDVM, as illustrated in Figure 1.1, consists of a set

of virtual machines running on top of physical servers in a cloud. Each VM encapsulates execution

states of cloud services and running client applications, serving as the basic unit of management

for RDVM construction and reconfiguration. Each cloud server hosts multiple virtual machines.

They multiplex resources of the underlying physical server. The virtual machine monitor (VMM,

also called a hypervisor) is a thin layer that manages hardware resources and exports a uniform

interface to the upper guest OSs.

When a client application is submitted with its computation and storage requirement to the

cloud, the cloud coordinator (described in Section 3.2) evaluates the qualifications of available

cloud servers. It selects one or a set of them for the application, initiates the creation of VMs

7

on them, and then dispatches the application instances for execution. Virtual machines on a cloud

server are managed locally by a RDVM daemon, which is also responsible for communication with

the cloud resource manager, cloud anomaly detector and cloud coordinator. The RDVM daemon

monitors the health status of the corresponding cloud server, collects runtime performance data of

local VMs and sends them to the Cloud Anomaly Detector which characterizes cloud behaviors,

identifies anomalous states, and reports the identified anomalies to cloud operators. Based on the

performance data and anomaly reports, the cloud resource manager analyzes the workload distri-

bution, online availability and available cloud resources, and then makes RDVM reconfiguration

decisions. The Anomaly Detector and Resource Manager form a closed feedback control loop to

deal with dynamics and uncertainty of the cloud computing environment.

To identify anomalous behaviors, the Anomaly Detector needs the runtime cloud perfor-

mance data. The performance data collected periodically by the RDVM daemons include the ap-

plication execution status and the runtime utilization information of various virtualized resources

on virtual machines. RDVM daemons also work with hypervisors to record the performance of

hypervisors and monitor the utilization of underlying hardware resources/devices. These data and

information from multiple system levels (i.e., hardware, hypervisor, virtual machine, RDVM, and

the cloud) are valuable for accurate assessment of the cloud health and for identifying anomalies

and pinpointing failures. They constitute the health-related cloud performance dataset, which is

explored for autonomic anomaly identification.

1.4.2. Metric Selection and Extraction for Charactering Cloud Health

I propose a metric selection framework for efficient health characterization in the cloud.

Among the large number of metrics profiled, I select the most essential ones by applying met-

ric selection and extraction methods. Mutual information is exploited to quantify the relevance

and redundancy among metrics. An incremental search algorithm is designed to select metrics

by enforcing maximal relevance and minimal redundancy. We apply metric space combination

and separation to extract essential metrics and further reduce the metric dimension. I implement

a prototype of the proposed metric selection framework and evaluate its performance on a cloud

computing testbed. Experimental results show that the proposed approaches can significantly re-

8

duce the metric dimension by finding the most essential metrics.

1.4.3. Exploring Time and Frequency Domains of Cloud Performance Data for Accurate Anomaly

Detection

I propose a wavelet-based multi-scale anomaly identification mechanism to detect anoma-

lous cloud behaviors. It analyzes the profiled cloud performance metrics in both time and frequency

domains and identifies anomalous behaviors by checking both domains for cloud anomaly detec-

tion. I leverage learning technologies to construct and adapt mother wavelets which capture the

characteristic properties of failure events occurred in the cloud. To tackle cloud dynamicity and

improve detection accuracy, I devise a sliding-window approach to identify anomalies by using

the updated mother wavelet in a recent detection period. We develop a prototype of the proposed

cloud anomaly identification mechanism and evaluate its performance. Experimental results show

that the wavelet-based anomaly detector can identify cloud failures accurately. It achieves 93.3%

detection sensitivity and 6.1% false positive rate which makes it suitable for building highly de-

pendable cloud systems. To the best of our knowledge, this is the first work that considers both the

time and frequency domains to identify anomalies in cloud computing systems.

1.4.4. Most Relevant Principal Components based Anomaly Identification and Diagnosis

I propose an adaptive mechanism that leverages PCA to identify and diagnose cloud per-

formance anomalies. Different from existing PCA-based approaches, the proposed mechanism

characterizes cloud health dynamics and finds the most relevant principal components (MRPCs)

for each type of possible failures. The selection of MRPCs is motivated by the observation in

experiments that higher order principal components possess strong correlation with failure occur-

rences, even though they maintain less variance of the cloud performance data. By exploiting

MRPCs and learning techniques, I design an adaptive anomaly detection mechanism in the cloud.

It adapts its anomaly detector by learning from the newly verified detection results to refine future

detections. I compare the anomaly identification accuracy of several algorithms using the receiver

operating characteristic (ROC) curves. The experimental results show that the proposed mecha-

nism can achieve 91.4% true positive rate while keeping the false positive rate as low as 3.7%. I

9

also conduct experiments by using traces from a Google data center. Our MRPC-based anomaly

identification mechanism performs well on the traces from the production system with the true

positive rate reaching 81.5%. The mechanism is lightweight as it takes only several seconds to

initialize the detector and a couple of seconds for adaptation and anomaly identification. Thus, it

is well suitable for building dependable cloud computing systems.

These two anomaly identification approaches have different goal orientations and pros and

cons in anomaly identification.

The wavelet-based approach only monitors a few samples to identify the occurrences of

cloud anomalies and to update the anomaly mother wavelet when new anomalies are verified. The

MRPCs approach requires more samples (a bigger sliding window size) to build the most relevant

principal component space that correlates to the specific faults.

The selected MRPCs facilitate the creation of an anomaly inference code book to assist

the anomaly diagnoser to rapidly determine the failure type timely. Moreover, the MRPCs-based

approach is capable of finding the root causes of faults by learning the contributions of metrics

in the MRPCs. This analysis information can aid the diagnoser to track a failure instance to the

faulty hardware and the faulty VM with anomalous behaviors and guide the resource allocation

and reconfiguration.

Both approaches are able to adapt to the new types of failures. For the wavelet-based

approach, mother wavelets needs to be updated to include the features of new failures. For the

MRPCs-based approach, MRPCs are required to be re-selected when failure instances of a new

type occur.

1.4.5. SEFI : A Soft Error Fault Injection Tool for Profiling the Application Vulnerability

I develop F-SEFI, a Fine-grained Soft Error Fault Injector, for profiling software robustness

against soft errors. I rely on logic soft error injection to mimic the impact of soft errors to logic

circuits. Leveraging the open source virtual machine hypervisor QEMU. F-SEFI enables user to

modify the emulated machine instructions to introduce soft errors, by amending the process of

Tiny Code Generation (TCG) in the QEMU hypervisor. Neither intimate knowledge of applica-

tions (e.g., source code) nor dynamic instructions analysis is required. F-SEFI can semantically

10

control what application, which sub-function, when and how to inject the soft errors with different

granularities, without any interferences to other applications that share the same environment and

any revisions to the application source codes, compilers and operating systems. F-SEFI shows the

soft error injection capability on a selected set of applications for campaign studies on vulnerability

of applications while exposed to soft errors.

1.5. Dissertation Organization

The remaining of this dissertation is organized as follows.

Chapter 2 reviews the related work on failure detection and diagnosis in distributed sys-

tems, especially in clusters, grids and clouds. I first discuss the existing studies on characterising

failure behaviors. Then, I present the approaches to failure identification. I also describe existing

proactive and reactive failure management mechanisms. In the end, I survey the soft error injection

techniques.

Chapter 3 presents the cloud dependability analysis framework and the key findings on the

properties of the cloud dependability that guide the development of dependable cloud systems.

Chapter 4 presents the proposed performance metric selection and combination approaches

to reduce the dimensionality of the collected runtime cloud performance data. The reduced dataset

will be used in developing efficient and accurate anomaly identification mechanisms.

Chapter 5 describes the design, implementation, and evaluation of the proposed proactive

cloud anomaly detection framework. I first presented the wavelet-based multi-scale anomaly de-

tection approach. Then I applied the proposed approach to the collected cloud performance data

using a sliding window mechanism and show the experimental results.

In Chapter 6, I described a motivating example to demonstrate the limitations of the tra-

ditional Principal Component Analysis (PCA) for anomaly identification and diagnosis. Then, I

described the proposed anomaly identification approach based on metric subspace analysis. Ex-

perimental results from our cloud testbed and on the Google datacenter traces will be presented to

show the root-cause analysis on different failures using MRPMs.

Chapter 7 presents the soft error fault injection framework for profiling application vulner-

ability. First, the design of a coarse-grained injection platform is described based on an attached

11

GDB debugger. I then improve it to achieve fine-grained soft error injections facility. In case

studies, I show the effects of SEFI on multiple applications.

Chapter 8 concludes the dissertation with a summary of this work and remarks on the

directions of future research.

12

CHAPTER 2

BACKGROUND AND RELATED WORK

Production cloud computing systems continue to grow in their scale and complexity. They

are changing dynamically as well due to the addition and removal of system components, changing

execution environments and workloads, frequent updates and upgrades, online repairs and more.

In such large-scale complex and dynamic systems, failures are common [45].In addition, failure

management requires significantly higher level of automation. Examples include anomaly detec-

tion and failure diagnosis based on realtime streams of system events, and performing continuous

monitoring of cloud servers and services. The core of autonomic computing [51] is the ability to

analyze data in realtime and to identify anomalies accurately and efficiently. The goal is to avoid

catastrophic failures through prompt execution of remedial actions.

In this chapter, I survey the existing work on metric selection and extraction, anomaly de-

tection and failure diagnosis and fault injection, which provides the background for this dissertation

research.

2.1. Metrics Selection and Extraction

Anomaly detection is an important topic and attracts considerable attention [97]. Typical

methods treat anomalies as deviation of the system performance [62], [27]. Examples include diag-

nosis and prediction based on probabilistic or analytical model [36], real-time streams of computer

events[53], [107] and continuous monitoring over the runtime services [112], [50]. These model-

based system diagnosing methods analyze the system by deriving a probabilistic or analytical

model. Models are trained with the help of prior knowledges. Examples include native Bayesian-

based model for hardware disk failure [36], EM-algorithm model for the highest-likelihood mix-

ture [69] and Hidden Markov model for online failure detection [85]. These approaches are all

based on the study of the huge amount of the health-related data while being trained, which spe-

cially addresses the complication in mining daily log in the magnitude of GBs. Furthermore, the

insufficient failure data makes it difficult to accurately diagnose the root causes.

13

In order to overcome these drawbacks, other researchers seek to involve the performance

metrics selection and metrics extraction [67], [77] as a pre-process to maintain the variance of

dataset as much as possible while shrinking the size to improve the speed for analysis. Metrics

selection and extraction expect to remove the redundancy in the dataset. But still they are sharing

the weakness originating from the characteristic of a model-based manners while maintaining the

accuracy of the model is difficult for a complex system, especially while system is moved to cloud

environment and featured with virtualization. Meanwhile, any maintenance, update and upgrades

on the cloud system post the requirement of rebuilding the model. Moreover, any changes to

virtual architecture(i.e., virtual machine migration, initialization and shutdown) make the model

ineffective and inadequate. In addition, for many cloud infrastructures, the behavior of workload is

never taken into account, though it may be a dominant factor to affect the results of metric selection

and extraction. Besides, this kind of methods also suffer from the overfits, which are brought by

specific training data sets.

2.2. Anomaly Detection and Failure Management

Anomaly detection based on analysis of performance logs has been the topic of many stud-

ies. Hodge and Austin [42] provide an extensive survey of failure/anomaly detection techniques.

A structured and broad overview of extensive research on failure/anomaly detection techniques has

been presented in [15]. There exist many methods for failure detection, typically based on statis-

tical techniques. Specifically, Cohen et al. [19] developed an approach in the SLIC project that

statistically clusters metrics with respect to SLOs to create system signatures. Chen et al. [17] pro-

posed Pinpoint using clustering/correlation analysis for problem determination. Concerning data

center management, Agarwala et al. [1, 2] proposed profiling tools, E2EProf and SysProf, that can

capture monitoring information at different levels of granularity. They however address different

sets of VM behaviors, focusing on relationships among VMs rather than anomalies.

Broadly speaking, existing approaches can be classified into two categories: model-based

and data-driven. A model-based approach derives a probabilistic or analytical model of a sys-

tem. A warning is triggered when a deviation from the model is detected [41]. Examples include

an adaptive statistical data fitting method called MSET presented in [100], naive Bayesian based

14

models for disk failure prediction [37], and Semi-Markov reward models described in [31]. In

large-scale systems, errors may propagate from one component to others, thereby making it dif-

ficult to identify the root causes of failures. A common solution is to develop fault propagation

models, such as causality graphs or dependency graphs [93]. Generating dependency graphs, how-

ever, requires a priori knowledge of the system structure and the dependencies among different

components which is hard to obtain in large-scale systems. The major limitation of model-based

methods is their difficulty of generating and maintaining an accurate model, especially given the

unprecedented size and complexity of production cloud computing systems.

Data mining and statistical learning theories have received growing attention for anomaly

detection and failure management. These methods extract failure patterns from systems’ normal

behaviors, and detect abnormal observations based on the learned knowledge [74]. For example,

the RAD laboratory at UC-Berkeley applied statistical learning techniques for failure diagnosis in

Internet services [17, 110]. The SLIC (Statistical Learning, Inference and Control) project at HP

explored similar techniques for automating failure management of IT systems [19]. In [82, 103],

the authors presented several methods to forecast failure events in IBM clusters. In [63], Liang et al.

examined several statistical methods for failure prediction in IBM Blue Gene/L systems. In [54],

Lan et al. investigated meta-learning based method for improving failure prediction. These results

provide great help to the checkpointing and restoring mechanism for the proactive failure manage-

ment. By analyzing the knowledge of failure distribution, however, the failure information is rare

and needs the consistence of the system setting. But usually it is hard to be guaranteed due to the

possible system update and reconfiguration. Moreover, without timely failure validation, this kind

of unsupervised model bears a very high false alarm rate.

Research in [33, 87, 63, 83] characterized failure behaviors in production cloud and net-

worked computer systems. They found that failures are common in large-scale systems and their

occurrences are quite dynamic, displaying uneven distributions in both time and space domains.

There exist the time-of-day and day-of-week patterns in long time spans [87, 83]. Weibull dis-

tributions were used to model time-between-failure in [40]. Failure events, depending on their

15

types, display strong spatial correlations: a small fraction of computers may experience most of

the failures in a coalition system [83] and multiple computers may fail almost simultaneously [63].

Failure diagnosis techniques localize the root causes of a failure to a group of system per-

formance metrics to notify the system administrator to validate the inferred alarms by analyzing the

hardware failures, operating system faults and user-level applications. Currently most widely used

commercial tool for failure diagnosis are rule-based diagnosis approaches. IBM Tivoli Enterprise

Console provide a platform for users to define and develop new rules to trace back to root of failure

causes. Chopstix [10] collects system profiled runtime status and builds the inference rules offline.

These rules are used for mapping the system behaviors with known diagnosis rules. X-ray [7] and

Crosscut [18] troubleshoot the performance anomalies with dynamic instruments techniques to re-

play the process within runtime to gain the knowledge of root cause. Besides, machine learning

techniques are also used for building and training the failure diagnostic models. Draco [49] ad-

dresses the chronic problems that exist in distributed system by using a scalable Bayesian learner.

Decision trees [110, 35] and clustering techniques [71, 34] serve for the failure diagnosis by as-

signing the resource usage metrics to the correlated anomalies. Rule-based diagnosis techniques

are unable to the dynamics of cloud environments and numbers of unforeseen failures propagate

from other nodes and VM instances. On the other hand, the performance of machine learning based

failure diagnosis techniques is limit to overhead caused by the data volume and dimensionality.

2.3. State of the Art of Fault Injection

Studying the behavior of applications in the face of soft errors has been growing in popu-

larity in recent years. I identified two main techniques used in existing research to inject faults into

running applications: dynamic binary instrumentation and virtual machine based injection.

2.3.1. Dynamic Binary Instrumentation-based Fault Injection

One category of recent research focuses on injecting soft errors into an application binary

dynamically. Thomas et al. [99] propose LLFI, a programming level fault injection tool using

the LLVM [57] Just-In-Time Compiler (JITC) to allow injections to intermediate code based on

data categories (pointer data and control data). Performance of LLVM is restricted by the compiler

16

TABLE 2.1. Existing fault injection technologies.

Methodologies Heuristic /Application Knowledge Semantic Intrusion Granularity Compiler Dependency Output Fault Control

LLFI [99] Yes Yes Yes Fine and Coarse Yes Yes

BIFIT [61] Yes Yes Yes Fine and Coarse Yes No

Virtual Hardware FI [60] No No No Coarse No No (Crash Only)

CriticalFault [105], Relyzer [86] Yes No No Coarse Yes No

VarEMU [65] Yes Yes Yes Fine and Coarse No No

Xen SWIFI [58] Yes No Yes Coarse No No

specification (only GCC is supported), the heuristic study of the application source code, and by

the instrumented Intermediate Representation (IR) of the machine code.

Li et al. [61] designed BIFIT to investigate an application’s vulnerability to soft errors by

injecting faults at arbitrarily chosen execution points and data structures. BIFIT is closely inte-

grated with PIN [66], a binary instrumentation tool. BIFIT relies on application knowledge by

profiling the application to generate a memory reference table of all data objects. This approach

becomes less practical when the application uses a random seed to dynamically initialize the in-

put data set. Moreover, due to the limitations of PIN, BIFIT is constrained to specific (although

popular) hardware architectures.

2.3.2. Virtualization-based Fault Injection

Virtualization provides an infrastructure-independent environment for injecting hardware

errors with minimal modification to the system or application. In addition, virtualization can be

used to evaluate a variety of hardware and explore new architectures.

Levy et al. [60] propose a virtualization-based framework that injects errors into a guest’s

physical address and evaluates fault tolerance technologies for HPC systems (e.g. Palacios [56]).

Their Virtual Hardware Fault Injector is only able to inject crash failures and IDE disk failures.

CriticalFault [105] and Relyzer [86] are based on Simics, a commercial simulator. With

static pruning and dynamic profiling of the application, instructions are categorized into several

classes, i.e., control, data store and address. This process reduces the number of potential fault in-

jection sites by pruning the injection space. Depending on the test scenario, soft errors are injected

17

into different categories, which produce different faulty outputs, e.g., crash or SDC. However,

CriticalFault and Relyzer can’t always establish the correlations between the instruction level fault

injection and the faulty behaviors, because tracing back from instructions to high-level languages

is difficult. Therefore, only coarse-gained injection is available.

Wanner et al. [65] designed VarEMU, an emulation testbed built on top of QEMU. Injection

is controlled by a guest OS system call. In order to inject faults, users have to import a library to

interface with system calls and insert fault injection codes into the applications that run in the guest

OS. The user space controls cannot guarantee that injections are applied to the specific user space

application. If another user space application is running the same type of instructions and sharing

the CPU with the injector targeted application, the consequence of fault injection is unpredictable

and uncontrollable.

Winter et al. [58] designed a software-implemented fault injector (SWIFI) based on the

Xen Virtual Machine Monitor. Using Xen Hypercalls, Xen SWIFI can inject faults into the code,

memory and registers of Para-Virtualization (PV) and Fully-Virtualization (FV) virtual machines.

PV intrudes into the original system via modifications to kernel device drivers. Because the injec-

tion targets registers (i.e., EIP, ESP, EFL and EAX) that Xen SWIFI does not directly control, it is

not possible to be certain that an injection affects the application of interest.

18

CHAPTER 3

A CLOUD DEPENDABILITY ANALYSIS FRAMEWORK FOR CHARACTERISING THE

SYSTEM DEPENDABILITY IN CLOUD COMPUTING INFRASTRUCTURES

3.1. Introduction

Due to the inherent complexity and large scale, production cloud computing systems are

prone to various runtime problems caused by hardware and software failures. Dependability as-

surance is crucial for building sustainable cloud computing services. Although many techniques

have been proposed to analyze and enhance reliability of distributed systems, there is little work

on understanding the dependability of cloud computing environments.

As virtualization has become the de facto enabling technology for cloud computing [7],

dependability evaluation of the cloud is no longer confined to the hardware, operating system, and

application layers. A new virtualized environment, which consists of virtual machines (VMs) with

virtualized hardware and hypervisors, should be analyzed to characterize the cloud dependabil-

ity. VM-related operations, such as VM creation, cloning, migration, and accesses to physical

resources via virtualized devices, cause more points of failure. They also make failure detec-

tion/prediction and diagnosis more complex. Moreover, virtualization introduces richer perfor-

mance metrics to evaluate the cloud dependability. Traditional approaches [68, 30] that ignore

those cloud-oriented metrics may not model cloud dependability accurately or effectively.

In this dissertation task, I aim to evaluate cloud dependability with the virtualized environ-

ments and compare it with traditional, non-virtualized systems. To achieve this goal, I propose

a cloud dependability analysis (CDA) framework with mechanisms to characterize failure behav-

ior in cloud computing infrastructures. I design failure-metric DAGs to model and quantify the

correlation of various performance metrics with failure events in virtualized and non-virtualized

systems. I study multiple types of failures, including CPU-, memory-, disk- , and network-related

failures. By comparing the generated DAGs in the two environments, I gain insight into the effects

of virtualization on the cloud dependability.

In this chapter, I present an overview of the cloud dependability analysis framework in

19

FIGURE 3.1. Architecture of the cloud dependability analysis (CDA) framework.

Section 3.2. Details of the failure-metric DAG based analysis method are described in Section 3.3.

We present the cloud testbed and the runtime cloud performance profiling system in Section 3.4.

Analytical results for different types of failures are shown and discussed in Section 3.5. Section 3.6

summaries this chapter.

3.2. Overview of the Cloud Dependability Analysis Framework

Figure 3.1 depicts the architecture of our cloud dependability analysis (CDA) framework.

The cloud computing environment consists of a large number of cloud servers, each of which can

accommodate a set of virtual machines (VMs). A VM encapsulates the execution states of cloud

services and runs a client application. These VMs multiplex resources of the underlying physical

servers. The virtual machine monitor (VMM, also called hypervisor) is a thin layer that manages

hardware resources and exports a uniform interface to the upper VMs [81].

Our CDA system is distributed in nature. I leverage the fault injection techniques [43, 76]

to evaluate the cloud dependability. Each cloud server has a fault injection agent, which injects

random or specific faults to the host system. Faults can be injected to multiple layers of a system,

including the hypervisor, VMs, and possibly client applications. There is also an health monitoring

sensor residing on each cloud server. It periodically records the values of a list of health-related

performance metrics from the hardware, hypervisor, and VMs. Both the fault inject agent and the

health monitoring sensor run in a privileged domain, such as Dom0 in a Xen-based virtualization

environment, in order to access privileged server resources.

20

A Coordinator controls the fault injection operations. It determines the time and location

of an injection and the type of fault to be injected or lets the fault injection agents inject faults

randomly. It also schedules client applications to run on the cloud servers. By communicating with

the health monitoring sensors, the Coordinator collects the raw or aggregated cloud performance

data for cloud dependability analysis. In a small-scale cloud computing testbed, one Coordinator

can perform these tasks. However, the Coordinator will become a performance bottleneck in large-

scale cloud computing environments. To tackle this problem, I propose to employ a hierarchy

of Coordinators. A lower-level Coordinator manages multiple or all cloud servers in a rack. It

receives fault injection requests from and sends aggregated performance data and dependability

analysis results to the upper-level Coordinators.

3.3. Cloud Dependability Analysis Methodologies

To analyze the cloud dependability, I need a representation that captures those aspects of

cloud state that serve as a fingerprint of a particular cloud condition. I aim at capturing the essential

cloud state that contributes to cloud component failures, and to do so using a representation that

provides information useful in the diagnosis of the state.

CDA continuously measures a collection of metrics that characterize the low-level opera-

tions of a cloud computing system, from hardware devices, hypervisors, to virtual machines. This

information can come from system facilities, commercial operation monitoring tools, server logs,

etc.

As a starting point, I can simply use the raw values of the metrics as the cloud fingerprint. I

then automatically build models that identify the set of metrics that correlate with failure instances

as a concise representation of the cloud fingerprint. I compare the fingerprint from virtualized

computing environments with those from traditional, non-virtualized environments, and investigate

the influence of virtualization on cloud dependability analysis.

The metric attribution problem is a pattern classification problem in supervised learning.

Let ft denote the state of the failure occurrence at time t. In this case, f can take one of two states

from the set {normal, failure} for binary classification. Let {0, 1} denote the two states. When

we distinguish different types of failures, e.g. hardware (CPU, memory, disk, node controller,

21

rack, network, etc.) failures and software (hypervisor, VM, application, scheduler, etc.) failures,

a multi-class classification model is employed with each failure type being assigned a class value.

Let xt denote a record of values for the n collected performance metrics m1,m2, . . . ,mn at time t.

The pattern classification problem is to learn a classifier function C that has C(xt) = ft.

I use a directed acyclic graph (DAG) to represent the classification function C. Each node

in the DAG is for a cloud performance metric. An arc between two nodes represents a probability

correlation. Let x = (x1, x2, . . . , xn) be a cloud performance data point described by the n perfor-

mance metrics m1,m2, . . . ,mn, respectively. Using the DAG, we can compute the probability of

data point x by

(1) P (x) =n∏i=1

P (xi|mj),

where metric mj is the immediate predecessor of metric mi in the DAG. To find the essential

metrics that can characterize the correlation between cloud performance and failure events, we

compute the conditional probability of every metric on failure occurrences, i.e., P (mk|failure),

and select those metrics whose conditional probabilities are greater than a threshold τ . The selected

metrics constitute the cloud fingerprint.

A DAG is automatically built from a set of cloud performance data records,R = {x1, x2, . . . , xl}.

For a cloud performance metric mi, let metric mp denote a parent of mi. The probability P (mi =

mij|mp = mpk) is computed and denoted by wijpk. The DAG building mechanism searches for

the wijpk values that best model the cloud performance data. In essence, it tries to maximize the

probability

(2) Pw(R) =l∏

r=1

Px(xr).

This is done by an iterative process. wijpk is initialized to random probability values for any i,

j, p, and k. In each iteration, for each cloud performance data record, xr, in R, our mechanism

computes

(3)∂lnPw(R)

∂wijpk=

l∑r=1

P (mi = mij,mp = mpk|xr)wjipk

.

22

TABLE 3.1. Description of the injected faults.

Type of Injected Faults Symptom

CPU Fault Infinite loop

Memory Fault Keep allocating the memory space

I/O Fault Keep copying files to the disk

Network Fault Keep sending and receiving packets

Then, the values of wijpk are updated by

(4) wijpk = wijpk + α∂lnPw(R)

∂wijpk,

where α is a learning rate and ∂lnPw(R)/∂wijpk is computed from Equation (3). The value of α is

set to a small constant for quick convergence. Before the next iteration starts, the values of wijpk

are normalized to be between 0 and 1.

3.4. Cloud Computing Testbed and Performance Profiling

The cloud computing system under test consists of 16 servers. The cloud servers are

equipped with 4 to 8 Intel Xeon or AMD Opteron cores and 2.5 to 16 GB of RAM. I have in-

stalled Xen 3.1.2 hypervisors on the cloud servers. The operating system on a virtual machine

is Linux 2.6.18 as distributed with Xen 3.1.2. Each cloud server hosts up to ten VMs. A VM is

assigned up to two VCPUs, among with the number of active ones depends on applications. The

amount of memory allocated to a VM is set to 512 MB. I run the RUBiS [14] distributed online

service benchmark and MapReduce [24] jobs as cloud applications on VMs. The applications are

submitted to the cloud testbed through a web based interface. I have developed a fault injection

tool, which is able to inject four major types and 12 sub-types of faults to cloud servers by adjust-

ing the levels of intensity. They mimic the faults of CPU, memory, disk, and network. All four

major types of failures injected are implemented in Table 3.1

I exploit third-party monitoring tools, sysstat [94] to collect runtime performance data in

the hypervisor and virtual machines, and a modified perf [75] to obtain the values of performance

counters from the Xen hypervisor on each server in the cloud testbed. In total, 518 metrics are

23

FIGURE 3.2. A sampling of cloud performance metrics that are often correlated

with failure occurrences in our experiments. In total, 518 performance metrics are

profiled with 182 metrics for the hypervisor, 182 metrics for virtual machines, and

154 metrics for hardware performance counters (four cores on most of the cloud

servers).

profiled, i.e., 182 for the hypervisor and 182 for virtual machines by sysstat and 154 for perfor-

mance counters by perf, every minute. They cover the statistics of every component of cloud

servers, including the CPU usage, process creation, task switching activity, memory and swap

24

space utilization, paging and page faults, interrupts, network activity, I/O and data transfer, power

management, and more. Table 3.2 lists and describes a sampling of the performance metrics that

are often correlated with failure occurrences in our experiments. I tested the system from May 22,

2011 to February 18, 2012. In total, about 813.6 GB performance data were collected and recorded

from the cloud computing testbed in that period of time.

To tackle the big data problem and analyze the cloud dependability efficiently, our cloud

dependability analysis (CDA) system removes those performance metrics that are least relevant to

failure occurrences. First, CDA searches for the metrics that display zero variance. Among all

of the 518 metrics, 112 of them have constant values, which provides no contribution to cloud

dependability analysis. After removing them, 406 non-constant metrics are kept. Then, CDA cal-

culates the correlation between the remaining metrics and the “failure” label (0/1 for normal/failure

classification and multi-classes for different types of failures). CDA removes those metrics whose

correlations with failure occurrences are less than a threshold τcorr.

3.5. Impact of Virtualization on Cloud Dependability

This work aims to find out and model the impact of virtualization on system dependability

in cloud computing infrastructures. To this end, our cloud dependability analysis (CDA) system

compares the correlation of various performance metrics with failure occurrences in virtualization

and traditional non-virtualization environments. CDA exploits the DAGs described in Section 3.3

for the analysis and comparison.

To build a failure-metric DAG using a training set from the collected cloud performance

data, CDA sets the root node as “failure” for all types of failure events or a specific type of failures

for finer-grain analysis. Each node, except for the root node, is allowed to have multiple parents.

The maximal number of parents can be configured. For example, in our experiments it is set to

two, which means each metric node in the DAG can have only one more parent in addition to the

root node. Moreover, a continuous metric is discretized to a certain number of bins based on the

nature of the metric.

In this section, I focus on the failures caused by CPU (Section 3.5.1), memory(Section 3.5.2),

disk(Section 3.5.3), network(Section 3.5.4), and all (Section 3.5.5) faults, and model the impact

25

FIGURE 3.3. Failure-metric DAG for CPU-related failures in the cloud testbed.

FIGURE 3.4. Failure-metric DAG for CPU-related failures in the non-virtualized system.of virtualization on cloud dependability. I present the DAGs for virtualized and non-virtualized

systems and compare the results. Due to the space limitation, only the top three levels of each

DAG are plotted.

3.5.1. Analysis of CPU-Related Failures

To characterize the cloud dependability under CPU failures, the Coordinators in the CDA

system control the fault injection agents to inject CPU related faults, including randomly changing

one or multiple bits of the outputs of arithmetic or logic operations, continuously using up all CPU

cycles, and more. These faults are injected to one, some, or all of the processor core(s) on a cloud

server. The health monitoring sensors collect the runtime performance data on each cloud server,

pre-process the data, and report them to the Coordinators, which build the failure-metric DAGs

and analyze the system health status of a management domain or the entire cloud.

Figure 3.3 depicts the DAG for CPU related failures in the cloud computing testbed with

virtualization support. For comparison, I also conduct experiments on a traditional distributed

systems without virtualization. Figure 3.4 presents the corresponding DAG.

From Figure 3.4, I can see that 13 metrics display strong correlation with the occurrences

26

FIGURE 3.5. Failure-metric

DAG for memory-related

failures in the cloud testbed.

FIGURE 3.6. Failure-metric

DAG for memory-related

failures in the non-virtualized

system.

of CPU related failures in the non-virtualized system. Among them, four (i.e., %usr all, %nice all,

%sys all, and %iowait all) are metrics for all processor cores, while the others (i.e., %usr n,

%nice n, %sys n, %iowait n, and %soft n) are for individual cores.

In the cloud computing environment (Figure 3.3), 12 metrics are highly correlated with

the failures. Metric %usr all from the privileged domain, Dom0, is the direct child of the root

node, showing the highest correlation. Among the 12 metrics, 11 are metrics collected from Dom0

(Metrics from user virtual machines, DomU, locate at lower levels of the DAG). They are %usr,

%sys, and %iowait of all or individual processor cores. %steal all is a new metric that is cor-

related with failure occurrences, compared with Figure 3.4. In addition, a performance counter

metric, DTBL-load-miss, also has a strong dependency with CPU related failures, while other per-

formance counters have higher correlation with performance metrics of either the hypervisor or

virtual machines.

3.5.2. Analysis of Memory-Related Failures

To characterize the cloud dependability under memory related failures, memory faults are

injected by the fault injection agents to cloud servers. This type of faults includes flipping one or

multiple bits of memory to the opposite state, using up all available memory space, and more. The

Coordinators collect the runtime performance data from the health monitoring sensors on cloud

servers, and generate failure-metric DAGs for cloud dependability analysis.

27

FIGURE 3.7. Failure-Metric DAG for disk-related failures in the cloud testbed.

FIGURE 3.8. Failure-metric DAG for disk-related failures in the non-virtualized system.

Figure 3.5 shows the DAG for memory related failures in the cloud computing testbed. The

result from the non-virtualized system is presented in Figure 3.6.

In Figure 3.6, six metrics display strong correlation with the occurrences of memory related

failures in the non-virtualized system. They are %usr all and %sys all for all processor cores,

%usr n and %iowait n of some individual cores, and %memused, which indicates the memory

utilization.

In the cloud computing environment (Figure 3.5), seven metrics are highly correlated with

the failures. All of the seven metrics come from Dom0. Compared with Figure 3.6, the metric

%usr all is the direct child of the root node in both cases. However, in the cloud computing envi-

ronment, the metric %memused is not a significant identifier of memory related failures. Instead,

%soft n becomes more closely correlated with the occurrences of memory failures.

3.5.3. Analysis of Disk-Related Failures

Disks are also prone to fault [87, 104]. In our experiments, the fault injection agents in-

ject disk faults by blocking certain disk I/O operations or running background micro-benchmark

programs that continuously copying large files to disks to saturate disk I/O bandwidth. Again, the

28

FIGURE 3.9. Failure-metric DAG for network-related failures in the cloud testbed.

FIGURE 3.10. Failure-metric DAG for network-related failures in the non-

virtualized system.

Coordinators collect the cloud-wide performance data and analyze the cloud dependability.

Figure 3.7 presents the DAG for failures caused by disk faults in the cloud computing

testbed. Figure 3.8 shows the result from the non-virtualized system. From the two figures, I

observe that more metrics are correlated with the failure occurrences.

In the non-virtualized system (Figure 3.8), 15 metrics highly correlate with the occurrences

of disk related failures. In addition to other CPU metrics, %iowait n and %nice n are directly

affected by disk I/O operations. It is interesting to notice that metrics such as rd sec/s and wr sec/s

are not included in the top correlated metrics. This is because these metrics have a more direct

influence on the values of processor related metrics.

In the cloud computing environment (Figure 3.7), 12 metrics are the top ones that are

correlated with the failures. Among them, the metric %sys all from Dom0 is the direct child

of the root node, which is different from the non-virtualized case. Compared with Figure 3.8,

virtualization has more significant impact on the metrics including %steal n and pgpgout/s for

disk related failures.

29

3.5.4. Analysis of Network-Related Failures

Networking hardware/software in cloud servers and switches and routers in the core net-

work may fail at runtime [111]. To generate network related failures, the fault injection agents

inject network faults by dropping certain incoming/outgoing network packages, flipping one or

multiple bits of packages to the opposite state, or attempting to use up the network bandwidth by

continuously transferring large files through the network. After the performance data are collected

from cloud servers, failure-metric DAGs are generated to analyze the cloud dependability under

network related failures. Figures 3.9 and 3.10 show the DAGs for the cloud computing testbed and

the non-virtualized system, respectively.

From Figure 3.10, I observe the occurrences of network failures are strongly correlated

with 12 metrics in the non-virtualized environment. Two metrics, %iowait n and %usr all, are the

direct children of the root node. In contract, 16 metrics are included within the top three levels of

the DAG in Figure 3.9. For the cloud computing testbed, one metric, %usr all, is the direct child of

the root node. Three new metrics profiled from Dom0, fault/s, tcp-tw, and await dev8, are highly

correlated the occurrences of network failures. They are closely related to the networking opera-

tions, including the number of packages, the number of TCP sockets, and the average processing

time by networking devices. Moreover, two metrics from user virtual machines, DomU, are among

the most significant ones. They are U %usr all and U %steal n, accounting for the time to process

a large number of network packages and to switch between virtual processors.

3.5.5. Analysis of All Types of Failures

In addition to studying individual types of failures, I analyze the cloud dependability under

any type of failures. The goal is to identify a set of metrics that can characterize all types of failures

and to understand the impact of virtualization on the metric selection.

To generate the failure-metric DAGs for this purpose, the Coordinators mix the cloud per-

formance data records together. The label of each record takes one of the two values: 0 or 1

denoting a “normal” or “failure” state. Figures 3.11 and 3.12 depict the DAGs for the cloud com-

puting testbed and the non-virtualized system, respectively. The root nodes represent the generic

failures.

30

TABLE 3.2. The metrics that are highly correlated with failure occurrences in the

cloud testbed using four-level failure-metric DAGs.

Failure type No. of correlated metrics No. of metrics from Dom0 No. of metrics from DomU

CPU-related failures 45 44 1

Memory-related failures 29 26 2

Disk-related failures 34 25 9

Network-related failure 32 31 1

All failures 25 24 1

FIGURE 3.11. Failure-metric DAG for all types of failures in the cloud testbed.

By comparing these two figures, I can find out the influence of virtualization on the system

dependability. In both cases, processor related metrics are the dominant ones 1. Certain metrics in

these two DAGs also appear in the DAGs for individual types of failures. For the non-virtualized

case (Figure 3.12), a metric related with memory and disk operations, %vmeff, has a strong depen-

dency with the generic failures. In contrast, a hardware performance counter metric, DTLB-stores,

is highly correlated with failure occurrences in the cloud computing environment as shown in

Figure 3.11. Moreover, in Figure 3.11 and also preceding DAGs for the cloud computing environ-

ment, most of the correlated metrics are associated with Dom0. If more levels of the DAGs are

considered, more metrics from user virtual machines, DomU, correlate with failure occurrences.

However, there is little work on understanding the dependability of cloud computing envi-

ronments. As virtualization has been an enabling technology for cloud computing, it is imperative

to investigate the impact of virtualization on the cloud dependability, which is the focus of this

work.1Only the first three levels of the DAGs are depicted due to the limited space. When more levels are considered,metrics for other system components are incorporated for dependability analysis.

31

FIGURE 3.12. Failure-metric DAG for all types of failures in the non-virtualized system.3.6. Summary

Large-scale and complex cloud computing systems are susceptible to software and hard-

ware failures, which significantly affect the cloud performance and management. It is imperative

to understand the failure behavior in cloud computing infrastructures. In this work, I study the

impact of virtualization, which has become an enabling technology for cloud computing, on the

cloud dependability. I present a cloud dependability analysis (CDA) framework with mechanisms

to characterize failure behavior in virtualized environments. I exploit failure-metric DAGs to an-

alyze the correlation of various cloud performance metrics with failure events in virtualized and

non-virtualized systems. We study multiple types of failures, including CPU-, memory-, disk-,

and network-related failures. By comparing the generated DAGs in the two environments, I gain

insight into the effects of virtualization on the cloud dependability.

32

CHAPTER 4

A METRIC SELECTION AND EXTRACTION FRAMEWORK FOR DESCRIBING CLOUD

PERFORMANCE ANOMALIES

4.1. Introduction

To characterize cloud behavior, identify anomalous states, and pinpoint the causes of fail-

ures, I need the runtime performance data collected from utility clouds. However, continuous

monitoring and large system scale lead to the overwhelming volume of data collected by health

monitoring tools. The size of system logs from large-scale production systems can easily reach

hundreds and even thousands of tera-bytes [70, 87]. In addition to the data size, the large number

of metrics that are measured make the data model extremely complex. Moreover, the existence

of interacting metrics and external environmental factors introduce measurement noises in the col-

lected data. For the collected health-related data, there might be a maximum number of metrics

above which the performance of anomaly detection will degrade rather than improve. High metric

dimensionality will cause low detection accuracy and high computational complexity. However,

there is a lack of systematic approaches to effectively identifying and selecting principle metrics

for anomaly detection.

In this chapter, I present a metric selection framework for online anomaly detection in the

cloud. Among the large number of metrics profiled, I aim at selecting the most essential ones

by applying metric selection and extraction methods. Mutual information is exploited to quantify

the relevance and redundancy among metrics. An incremental search algorithm is proposed to

select metrics by enforcing maximal relevance and minimal redundancy. We apply metric space

combination and separation to extract essential metrics and further reduce the metric dimension.

The remainder of this chapter is organized as follows. Section 4.2 presents the proposed

metric selection framework with three mechanisms. Experimental evaluation and discussion are

described in Section 4.3. Section 4.4 presents the summary.

33

4.2. Cloud Metric Space Reduction Algorithms

To make anomaly detection tractable and yield high accuracy, we apply dimensionality re-

duction which transforms the collected health-related performance data to a new metric space with

only the most important metrics preserved [38]. I propose two approaches to reducing dimension-

ality: metric selection using mutual information and metric extraction by metric space combination

and separation. Metric selection are methods that select the best subset of the original metric set.

The term metric extraction refers to methods that create new metrics based on transformations or

combinations of the original metric set. The data presented in a low-dimensional subspace are

easier to be classified into distinct groups, which facilitates anomaly detection.

4.2.1. Metric Selection

The metric selection process can be formalized as follows. Given the input health-related

performance data D including L records of N metrics M = {mi, i = 1, . . . , N}, and the classi-

fication variable c, it is to find from the N -dimensional measurement space, RN , a subspace of n

metrics, Rn, that optimally characterizes c.

In this section, I present the metric selection algorithm based on mutual information (MI) [21]

as a measure of relevance and redundancy among metrics to select a desirable subset. MI has two

main properties that distinguish it from other selection methods. First, MI has the capability of

measuring any type of relationship between variables, because it does not rely on statistics of any

grade or order. The second property is MI’s invariance under space transformation.

The mutual information of two random variables quantifies the mutual dependence of

them. Let mi and mj be two metrics in M . Their mutual information is defined as I(mi;mj) =

H(mi) + H(mj) − H(mimj), where H(·) refers to the Shannon entropy [21]. Metrics in the

health-related performance data collected periodically from a cloud computing system usually

take discrete values. The marginal probability p(mi) of metric mi and the probability mass func-

tion p(mi,mj) of two metrics mi and mj can be calculated using the collected data. Then, the MI

of mi and mj is computed as

(5) I(mi;mj) =∑mi∈M

∑mj∈M

p(mi,mj) logp(mi,mj)

p(mi)p(mj).

34

Intuitively, the MI between two metrics, I(mi;mj), measures the amount of information shared

between mi and mj . Metrics with high co-relevance have high MI. As special cases, I(mi;mi) =

1, while I(mi;mj) = 0 if mi and mj are independent.

The goal of metric selection is to find from the originalN metrics a subset S with nmetrics

{mi, i = 1, . . . , n}, which jointly have the largest dependency on the class c. This can be accom-

plished by using two criteria in metric selection: maximal relevance and minimal redundancy. I use

the mean value of all MI values between individual metric mi and class c to define the relevance.

The maximal relevance criterion is specified as,

(6) max relevance(S), relevance =1

|S|∑mi∈S

I(mi; c),

where |S| is the cardinality of S. By applying Equation (6), irrelevant metrics can be removed.

However, the remaining metrics may have rich redundancy. As a result, the dependency

among these metrics may still be high. When two metrics highly depend on each other, their

class-discriminative capabilities do not change much, if one of them is removed. Therefore, I

additionally apply a minimal redundancy criterion to select independent metrics.

(7) min redundancy(S), redundancy =1

|S|2∑

mi,mj∈S

I(mi;mj).

I combine the two criteria (6 and 7) together to define the dependency of the selected metrics on

the class, dependency(S). To optimize relevance and redundancy simultaneously, I can use the

following equation.

(8)max dependency(S),

dependency = relevance(S)− redundancy(S).

The N metrics in the original metric set M defines a 2N search space. Finding the optimal

metric subset is NP-hard [5]. To find the near-optimal metrics satisfying the criterion (8), I apply an

incremental search method. Given Sk−1, a metric subset with (k−1) metrics, I try to select the kth

metric that maximizes dependency(·) from the remaining metrics in (M − Sk−1). By including

Equations (6) and (7), the metric search algorithm looks for the kth metric that optimizes the

35

following condition.

(9) maxmi∈M−Sk−1

{I(mi; c)−

1

k − 1

∑mj∈Sk−1

I(mi;mj)

}.

The metric selection algorithm works as follows.

ALGORITHM 1. Metric selection algorithm

MetricSelection() {

1: Apply the incremental search following Equation

(9) to select n metrics sequentially from the

original metric set M . The value of n can be preset

with a large number. The search process produces n

metric set, S1 ⊂ S2 ⊂ . . . ⊂ Sn.

2: Check these metric sets S1, . . . , Si, . . . , Sn to find

the range of i, where the cross-validation error erri

has small mean and small variance.

3: Within the range, look for the smallest error err∗.

The optimal size of the metric subset n∗ equals to

the smallest i, for which Si has error err∗. The

corresponding Sn∗ is the selected metric subset.

4: }

The computational complexity of the incremental search method is O(|S| ·N).

4.2.2. Metric Space Combination and Separation

The metric extraction process creates new metrics by transformation or combination of the

original metrics. It applies a mapping x′ = g(x) : Rn → Rn′ to transform a measurement x in

a n-dimensional space to a point x′ in a n′-dimensional space with n′ < n. It creates a subset

of new metrics by transformation of the original ones. The information most relevant to anomaly

36

detection in Rn is preserved. The goal is to reconstruct the health-related cloud performance dataset

to a space of fewer dimensions for more efficient and accurate anomaly identification. I explore

both metric space combination and metric space separation to find the most useful metrics and

reduce the dimension of metric space.

After the metric selection process (Section 4.2.1) is completed, the health-related cloud

performance dataset D contains L records (x1, x2, . . . , xL) with n metrics. Metric space combi-

nation transforms the L records from n-dimensional space to L records (x′1, x′2, . . . , x

′L) in a new

n′-dimensional space.

Let m1,m2, . . . ,mn denote the n performance metrics. A measurement xi in D can be

represented with {xj,i}, the value of the jth metric mj of xi. That is xi = [x1,i, x2,i, . . . , xn,i]T .

Then, the cloud performance dataset D is represented by a matrix D = [x1, x2, . . . , xL]. To find

the optimal combination of the metric space, I calculate the covariance matrix of D as V = DDT .

According to [26], in order to minimize the mean-squared error of representing the dataset by

n′ orthonormal metrics, the eigenvalues of the covariance matrix V are used. We calculate the

eigenvalues {λi} of V and sort them in a descending order as λ1 > λ2 > . . . > λn.

The metrics with the largest variance caused by a changing faulty condition are identified by

checking their directions. I utilize this property to combine metrics for efficient anomaly detection.

An iterative algorithm is employed to search for the new combined metrics. The first n′ eigenvalues

that satisfy the following requirement are chosen.

(10)∑n′

i=1 λi∑ni=1 λi

≥ τ,

where τ is a threshold and τ ∈ (0, 1). The corresponding n′ eigenvectors are the new metrics,

denoted by S ′ = {m′i, i = 1, . . . , n′}. The eigenvectors for metric space transformation are used

to select the most sensitive and relevant metrics. An iterative algorithm is employed to search for

{ej}:

ALGORITHM 2. Metric space combination based metric extraction

MetricExtraction1() {

37

1: n′ = the number of essential axes or eigenvectors

required to estimate;

2: Compute the covariance matrix S;

3: for j = 1 upto n′ do

4: Initialize randomly eigenvector ej of size n× 1;

5: while (1− |eTj ej|) > ε do

6: ej = Sej;

7: ej = ej −∑j−1

k=1(eTj ek)ek;

8: ej = ej/ ‖ ej ‖;

9: end while

10: end for

11: return e;

12:}

In Algorithm 2, Steps 7 and 8 apply the Gram-Schmidt orthogonalization process [44]

and then normalization. ε is a small constant, which is used to test the convergence of ej , i.e. if

(1− |eTj ej|) < ε, then ej converges, otherwise ej is updated iteratively.

Algorithm 2 converges quickly. It usually takes only two to five iterations to find an eigen-

vector. The computational complexity of the algorithm is O(n2n′ + n2L), where n is the number

of metrics after metric selection (Section 4.2.1). To determine the value of n′, i.e. the number of

essential metrics, a common practice is first setting a threshold for the percentage of total vari-

ance that is expected to preserve; then n′ being the smallest number of essential metrics that can

contribute to achieving such a threshold.

In addition to the metric space combination, I also apply metric extraction approaches

based on metric space separation. They separate desired data from mixed data. They define a set

of new basis vectors for metric space separation. Let us use A to denote the matrix with elements

xj,i, x = [x1, x2, . . . , xL]T and e = [e1, e2, . . . , e′n]T with base vectors. Then, x = Ae. After

estimating the matrix A from x, I calculate its inverse, denoted by W . Hence, the base vectors can

38

be computed by e = Wx.

Before applying the metric extraction algorithm, the anomaly detector performs some pre-

processing on the cloud performance dataset. Data records are subtracted by the mean of the data

record vector so that they have zero-mean. A linear transformation is also applied to the dataset,

which makes its components uncorrelated and having unit variance. The goal of metric space sep-

aration is to find an optimal transformation matrix W so that {ej} are maximally independent. An

iterative algorithm is employed to search for W and hence the new separated metrics.

The metric extraction algorithm that computes the matrix W works as follows.

ALGORITHM 3. Metric space separation based metric extraction

MetricExtraction2() {

1: Initialize randomly the matrix W = [w1, w2, . . . , w′n]T ;

2: while (1− |wTj wj|) > ε for any j = 1 . . . n do

3: for p = 1 upto n′ do

4: wp+1 = wp+1 −∑p

j=1wTp+1wjwj;

5: wp+1 = wp+1/(wTp+1wp+1)

1/2;

6: end for

7: end while

8: return W ;

9: }

In Algorithm 3, ε is a small constant, which is used to test the convergence of W , i.e. if

(1 − |wTj wj|) < ε for all j = 1 . . . n′, then W converges, otherwise W is updated iteratively. The

algorithm converges fast.

4.3. Performance Evaluation

As a proof of concept, I implement a prototype of the proposed metric selection framework

and evaluate its performance based on the collected performance metrics data from our cloud

39

TABLE 4.1. Normalized mutual information values for 12 metrics of CPU and

memory related statistics.

Metrics proc/s cswch/s intr/s pswpin/s pswpout/s pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s %vmeff

proc/s 1 0.173432 0.146242 0.026211 0.010373 0.173137 0.166139 0.774422 0.113095 0.59518 0.006642 0.010004

cswch/s 0.173432 1 1.19829 0.027594 0.037947 0.077614 0.22167 0.240895 0.069647 0.267043 0.024364 0.066146

intr/s 0.146242 1.19829 1 0.106931 0.202099 0.197885 0.432027 0.210116 0.203809 0.259266 0.222368 0.321142

pswpin/s 0.026211 0.027594 0.106931 1 0.191225 0.646559 0.104967 0.033024 0.907643 0.064447 0.123519 0.158348

pswpout/s 0.010373 0.037947 0.202099 0.191225 1 0.155811 0.179029 0.030781 0.208582 0.105926 0.22015 0.314369

pgpgin/s 0.173137 0.077614 0.197885 0.646559 0.155811 1 0.208409 0.25285 1.161042 0.210407 0.210457 0.226447

pgpgout/s 0.166139 0.22167 0.432027 0.104967 0.179029 0.208409 1 0.298546 0.196957 0.720304 0.217444 0.296459

fault/s 0.774422 0.240895 0.210116 0.033024 0.030781 0.25285 0.298546 1 0.151257 0.794756 0.049625 0.060143

majflt/s 0.113095 0.069647 0.203809 0.907643 0.208582 1.161042 0.196957 0.151257 1 0.196055 0.243348 0.295593

pgfree/s 0.59518 0.267043 0.259266 0.064447 0.105926 0.210407 0.720304 0.794756 0.196055 1 0.116328 0.181414

pgscank/s 0.006642 0.024364 0.222368 0.123519 0.22015 0.210457 0.217444 0.049625 0.243348 0.116328 1 0.351326

%vmeff 0.010004 0.066146 0.321142 0.158348 0.314369 0.226447 0.296459 0.060143 0.295593 0.181414 0.351326 1

testbed. The experiment settings are discussed in Section 3.4 I present the experimental result in

this section.

4.3.1. Experimental Results of Metric Selection and Extraction

I explore mutual information (MI) to quantify the relevance and redundancy of pair-wise

metrics. For N metrics, we only need to compute(N2

)mutual information values. After removing

the zero-variance metrics, I have N = 406. In total, I compute(4062

)= 82,215 MI value. To present

the results in a limited space, I show a portion of the entire 406×406 MI matrix in Table 4.1.

This matrix includes the pair-wise MI values of 12 metrics, which are related to the CPU and

memory usage statistics. They are the CPU related metrics: proc/s, cswch/s and intr/s, and

the memory related metrics: pswpin/s, pswpout/s, pgpgin/s, pgpgout/s, fault/s, majflt/s,

pgfree/s, pgscank/s and %vmeff . From the table, I can see the matrix is symmetric with all

diagonal elements being zero. A smaller MI values infers the corresponding pair of metrics share

less information.

Then I compute the redundancy and relevance among the metrics, and thus their depen-

dency according to Equation 9. For ease of comparison and visualization, I calculate the inverse

40

0 50 100 150 200 250 300 350 400

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

Feature index

No

rma

lize

d (

red

un

da

ncy-r

ele

va

nce

) va

lue

s

Metric index

FIGURE 4.1. Quantified redundancy and relevance among metrics based on their

mutual information values.

1 2 3 4 5 6 7 8 9 10 110

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

Principal component index

Perc

enta

ge

Metric index

FIGURE 4.2. Results from met-

ric extraction (Algorithm 2) and

metric selection (Algorithm 1).

0 5 10 15 20 25 30 35 400

0.05

0.1

0.15

0.2

0.25

0.3

Principal component index

Pe

rce

nta

ge

of

va

ria

nce

ca

ptu

red

Metric index

FIGURE 4.3. Results from met-

ric extraction (Algorithm 2)

only.

of the dependency, i.e. (redundancy - relevance), and search for the metrics with minimal values.

Figure 4.1 shows the normalized (redundancy - relevance) values of the 406 non-constant metrics.

I set the threshold λd = 0.15 (under 95% confidence level for sensitivity) and select metrics whose

41

normalized (redundancy - relevance) values are no larger than λd. In total, 14 metrics satisfy this

condition. They are cswch/s, pswpout/s, pgscank/s, %vmeff , %system, intr/s, %iowait,

ITLB − loads, kbmemused, kbbuffers, wr sec/s, dev253 − 1, rxpck/s, and totsck. The di-

mension of the metric space is reduced by 96.6%. The efficiency of processing the metric space is

improved correspondingly.

I then apply the metric extraction algorithms to reduce the metric dimension further. Al-

gorithm 2 in Section 4.2.2 transforms the metric space by combining metrics to find the essential

ones. The new metric space can present the original dataset in a more concise way. Figure 4.2

shows the results after performing the metric extraction on the 14 metrics selected in the preceding

step. From the figure, I can see that the first three essential metrics can capture most (i.e. 81.3%) of

the variance from the original dataset. Therefore, the dimension of the metric space is further re-

duced to three. In addition, I evaluate the performance of metric extraction without applying metric

selection first. Figure 4.3 depicts the results. From the figure, it is clear to observe that the essential

metrics capture less variance than those in Figure 4.2 and only 50.1% of the variance is captured

by the first three metrics. I also test the metric extraction algorithm (Algorithm 3), which exploits

metric separation. Figure 4.4 shows the distribution of the samples of the essential components that

are selected and extracted by algorithm 1, 2 and 3. The normal states are marked with blue and

anomalous states are identified with red. As shown in Figure 4(a) and Figure 4(c), if algorithm 1

is applied, most of the normal samples are aggregated together and the samples that are discrete

and far from the central of the majority which represents the normal states, are most anomalous

states. With metric selection and metric extraction algorithms, both unsupervised clustering tech-

niques and supervised classification algorithms can be used for further anomaly identification and

diagnosis.

4.4. Summary

Large-scale and complex utility cloud computing systems are susceptible to software and

hardware failures and administrators’ mistakes, which significantly affect the cloud performance

and management. Anomaly detection and proactive failure management provides a vehicle for

self-managing cloud resources and enhancing system dependability. In this work, I focus on the

42

-20

24

68

10 -20

24

68-6

-4

-2

0

2

4

6

2nd Essential Component1st Essential Component

3rd

Ess

entia

l Com

pone

nt

(a) Distribution of normal and abnormal cloud sys-

tem states represented by top three essential com-

ponents with metric selection (Algorithm 1) and ex-

traction (Algorithm 2).

-2-1

01

23

4 -4

-2

0

2

4-3.5

-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2nd Essential Component

1st Essential Component

3rd

Ess

entia

l Com

pone

nt

(b) Distribution of normal and abnormal cloud sys-

tem states represented by top three essential compo-

nents with metric extraction (Algorithm 2).

-8-6

-4-2

02

02

46

810

-6

-4

-2

0

2

4

6

8

10

1st Essential Component

2nd Essential Component

3th

Ess

entia

l Com

pone

nt

(c) Distribution of normal and abnormal cloud sys-

tem states represented by top three essential com-

ponents with metric selection (Algorithm 1) and ex-

traction (Algorithm 3).

-2

0

2

4

-1-0.5

00.5

11.5

22.5

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

1st Essential Component2nd Essential Component

3th

Ess

entia

l Com

pone

nt

(d) Distribution of normal and abnormal cloud sys-

tem states represented by top three essential compo-

nents with metric extraction (Algorithm 3)

FIGURE 4.4. Distribution of normal (blue marker) and abnormal (red

marker) cloud system states represented by the metrics that are selected and ex-

tracted by algorithm 1, 2 and 3.

43

metric selection issue for efficient and accurate anomaly detection in utility clouds. We present

a metric selection framework with metric selection and extraction mechanisms. The mutual in-

formation based approach selects metrics that maximize the mutual relevance and minimize their

redundancy. Then the essential metrics are further extracted by means of combining or separating

the selected metric space. The reduced dimensionality of metric space significantly improves the

computational efficiency of anomaly detection. We evaluate the performance of the metric selec-

tion framework and two illustrating anomaly detection approaches. The selected and extracted

metric set contributes to highly efficient and accurate anomaly detection.

The proposed metric selection framework is an open framework. Many metric selection

and extraction algorithms can be explored to implement the framework. I study the mutual infor-

mation based metric selection approach, because of its competence in capturing the relevance and

redundancy among metrics.

44

CHAPTER 5

EFFICIENT AND ACCURATE CLOUD ANOMALY DETECTION

5.1. Introduction

With the ever-growing complexity and dynamicity of cloud systems, autonomic failure

management is an effective approach to enhance cloud dependability [6]. Anomaly detection is a

key technique [15]. It identifies activities that do not conform to expected, normal cloud behavior.

The importance of anomaly detection is due to the fact that anomalies in the cloud performance

data may translate to significant and critical component or system failures which disrupt cloud ser-

vices and waste useful computation. Anomaly detectors provide valuable information for resource

allocation, virtual machine reconfiguration and cloud maintenance [84].

Efficient and accurate anomaly detection in cloud systems is challenging due to the dy-

namics of runtime cloud states, heterogeneity of configuration, nonlinearity of failure occurrences,

and overwhelming volume of performance data in production environments. Recent work has de-

veloped various technologies to tackle the problems of anomaly detection. They, however, detect

anomalies either in the time domain [74, 30] or in the frequency domain [8]. Few work considers

both domains to identify system anomalies.

In this chapter, we present a wavelet-based anomaly detection approach based on the anal-

ysis on the profiled cloud runtime performance metrics in time and frequency domains. Section 5.2

presents the key components of the proposed wavelet-based multi-scale cloud anomaly detection

mechanism that explores both the time and frequency domains. Experiment results are shown and

discussed in Section 5.3. Section 5.4 summarizes this chapter.

5.2. Cloud Anomaly Detection Mechanisms

Production clouds usually consist of large numbers of commodity computers which expe-

rience frequent software updates, hardware upgrades, system maintenance, repairs and reboots.

As a result, component failures are norms rather than exceptions [33]. Moreover, the mean time

between failures (MTBF) of a cloud system changes as the system evolves. Considering the high

dynamicity and changing failure rates, we propose a wavelet-based multi-scale model to charac-

45

terize the failure dynamics in both the time and frequency domains and to identify anomalous

behaviors in cloud computing systems. We also exploit multi-layer neural networks to make the

anomaly detectors adaptive to cloud dynamics and derive mother wavelets.

Wavelet analysis [11] is suitable for failure characterization and anomaly detection in cloud

systems because it can transform failure event signals from the time domain to the time-frequency

domain and is capable of analyzing failure dynamics with regard to both time and frequency prop-

erties. Wavelet transform provides a satisfactory time resolution of critical events at high frequen-

cies and a satisfactory frequency resolution at low frequencies [20]. It can highlight subtle changes

in the morphology of cloud event signals, which enables wavelet transform to detect transient,

aperiodic, and other non-stationary features in cloud performance and health data.

5.2.1. Wavelet-Based Multi-Scale Anomaly Detection Mechanism

Let m be the number of performance metrics monitored and profiled from cloud servers.

We construct a cloud health related matrix H with the profiled performance metrics, which is rep-

resented by H(t) = [h1(t), h2(t), ..., hm(t)], where hi(t) is a vector of values of the ith metric

profiled on a cloud server at time t. Our wavelet-based multi-scale anomaly detection mechanism

decomposes the matrix H into a hierarchical structure of details (D), which characterize anoma-

lous behaviors within a given period, and approximations (A), which include the normal workloads

running in the cloud. At each level, the approximation is decomposed as follows.

(11) Ai(t) = caAi+1(t) + cdDi+1(t),

where ca and cd are the decomposition coefficients. Then, the cloud performance matrix H is

rewritten as

(12) H(t) =k∑i=1

cdDi(t) + caAk(t),

where k specifies the number of levels by which the decomposition will be performed. As an

illustration of the preceding process, Figure 5.1 shows a three-level hierarchical structure of a

performance metric, %memused, profiled on our cloud testbed. The details at different levels

provide more information from different aspects of the dynamics of the cloud performance metric,

46

%memused

A1 D1

A DA2 D2

A3 D3

FIGURE 5.1. Three-level details “Di” and approximations “Ai” of performance

metric %memused profiled on our cloud testbed. Performance metric %memused

is divided into high frequency components (details) and low frequency components

(approximations). The approximation is further decomposed into new details and

approximations at each level.

which helps us identify anomalous behaviors. To determine the details (Di) and approximations

(Ai) of the profiled cloud performance metrics at level i, we employ the following wavelet function

(13) ψis,τ (t) = 2−s/2ψs(2−st− τ),

where s and τ are two parameters representing the scale and translation, respectively. We

convert the s scale to a characteristic frequency of the wavelet, such as the spectral peak frequency

that is associated with anomalous states, or the passband centre that describes patterns in workload

fluctuation. The spectral frequency is inversely proportional to the scale coefficient as f = fc/s,

47

where fc is the characteristic frequency of the mother wavelet, i.e., the archetype wavelet at scale

s = 1 and translation τ = 0.

Then, the ith-level details and approximations of the cloud performance metrics are com-

puted by

(14)Di(t) = ψ2s(t) =

+∞∑τ=0

g(t)ψi(2t− τ),

Ai(t) = ψ2s+1(t) =+∞∑τ=0

l(t)ψi(2t− τ),

where g(t) and l(t) denote the coefficients of low-pass and high-pass filters, respectively.

Thus, each profiled cloud performance metric hi is decomposed by the following.

(15)

hi(t) =2s∑j=1

hjs,i(t),

hjs,i(t) =+∞∑τ=0

cjs,τ,iψjs,τ,i(t),

where the wavelet function in Equation (13) is applied to the cloud performance metric hi and cjs,τ,i

is the wavelet coefficient, which is computed by

(16) cjs,τ,i =

∫ +∞

0

hi(t)ψjs,τ,i(t)dt.

The wavelet coefficient cjs,τ,i represents the similarity between the profiled cloud performance met-

ric and the characteristic mother wavelet in both time and frequency domains. We explore it for

cloud anomaly detection.

5.2.2. Sliding-Window Cloud Anomaly Detection

To achieve online anomaly identification in the cloud, we design a sliding-window ap-

proach that addresses the cloud dynamicity and improves detection accuracy. In this work, a slid-

ing detection window of size NsWin and the mother wavelet with window size Nmother are applied

to compute the wavelet coefficient cjs,τ at multiple scale levels. Figure 3 presents an illustration of

sliding windows with a 16-measurement mother wavelet.

48

0 50 100 150 200 250

NsWin

Nmother

FIGURE 5.2. A sliding detection window (NsWin = 80 measurements of a cloud

performance metric) for a mother wavelet with Nmother = 16 measurements and

the scale coefficient s = 5. A failure is illustrated with the spike.

The size of the sliding detection window,NsWin, is determined by the product of the mother

wavelet’s window size, Nmother, and the value of the scale parameter, s, as shown below.

(17) NsWin = s ·Nmother.

To identify anomalies based on the wavelet coefficient cjs,τ , we employ the threshold,

λ = k1µnormal + k2σ, which is a weighted sum of the normal states in the mother wavelet and

the standard deviation of the measurements in the mother wavelet. We define coefficient α as pro-

portion of anomalous states in the mother wavelet. The mean of normal states and the standard

deviation of anomalous states are important properties of the mother wavelet. The values of co-

efficient α and weights k1 and k2 depend on failure types and the length of mother wavelet. We

discuss the selection of their values in Section 5.3.

5.2.3. Mother Wavelet Selection and Adaptation

We propose a learning-based approach to select and adapt mother wavelets to characterize

the major properties of failures in the cloud. It is composed of the following two phases.

Constructing the Initial Mother Wavelet.

There are many choices to construct the initial mother wavelet. We adopt the Haar wavelet [102]

as it is widely used and it is simple and effective to represent the two (normal and anomalous) states.

49

By applying the Haar wavelet, the function of a mother wavelet can be represented as follows.

(18) ψ(k) =

1 0 ≤ k ≤ Nmother

a

−1 Nmother

a≤ k ≤ Nmother,

where a is a parameter that is used to separate the two states. Its value can be set according

to the measured failure rate. An example of Haar mother wavelet is shown in Figure 5.3

FIGURE 5.3. An example of Haar mother wavelet.

To adapt the mother wavelets at runtime, we exploit statistical learning technologies to up-

date mother wavelets based on observed failure events in a sliding detection window. In this work,

we use neural networks [109] because of its capability of combining the existing mother wavelets

with new measurements of cloud performance metrics in deriving the new mother wavelets. A

neural network consists of multiple levels of processing elements, called neurons, each of which

performs the following transfer function.

(19) yi(k) = fi(n∑j=1

ωijxj(k)− θi),

where xj is the jth input to the neuron, n is the size of a mother wavelet, ωij is the connection

weight coefficient between i and j, θi is the bias of the neuron. As for the function fi, we employ

the nonlinear Gaussian function.

Adaptation of Mother Wavelets.

Our cloud anomaly identification mechanism is adaptive to the dynamicity of cloud systems

by updating the neural network with verified anomalies and undetected but reported failures to

derive new mother wavelets.

50

The neural network adapter has n, where n = Nmother, input neurons to receive the mea-

surements of ψ(k)(k = 1, 2, ..., n), three hidden layers and n output neurons ψ′(k)(k = 1, 2, ..., n).

The Gaussian transfer function is applied to each layer with the Resilient Back Propagation Train-

ing algorithm [79], which adds the last n measurements prior to the occurrence of failures into the

training set to update the neural network. The learning process exploits a gradient descent method

in order to minimize the mean square values of errors between the output and input. The error is

defined as

(20) E =1

2

n∑k=1

(ψD(k)− ψ′(k))2,

(21) ∆wij(t) = −ξ ∂E∂wij

,

where ψD(k) is the desired output vector of wavelet coefficients, ξ is the learning factor and wij

defines the weight from neurons i to j. The error function repeatedly updates its weight matrix

until it converges.

5.3. Performance Evaluation

5.3.1. Cloud Testbed and Performance Metrics

The experiment settings are discussed in Section 3.4. We apply the metric space combination-

based metric extraction algorithm proposed in Chapter 4 to pre-process the performance data col-

lected from our cloud testbed. By maintaining the most variance of the performance data from

the original metric space, cloud performance metrics are extracted and then explored by wavelet

analysis for cloud anomaly identification. We present the experimental result in this section.

5.3.2. Mother Wavelets

The mother wavelet maintains the essential information of abnormal behaviors of cloud

servers. Our mechanism keeps updating the mother wavelet when anomalies are verified by cloud

operators or new undetected failures are reported to the anomaly detector. To adapt the mother

wavelet, measurements of cloud performance metrics prior to the time of a failure are collected

51

0 2 4 6 8-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

Samples

Mea

sure

men

t

MeasurementAdapted Mother Wavelet

(a) Mother Wavelet using a window of 8 mea-

surements

0 5 10 15-3

-2

-1

0

1

2

3

Samples

Mea

sure

men

t


(b) Mother Wavelet using a window of 16

measurements

0 5 10 15 20-2

-1

0

1

2

3

4

Samples

Mea

sure

men

t


(c) Mother Wavelet using a window with 24

measurements

0 5 10 15 20 25 30-2

-1

0

1

2

3

4

5

Samples

Mea

sure

men

t


(d) Mother Wavelet using a window with 32

measurements

FIGURE 5.4. Mother wavelet derived by employing a measurement window of dif-

ferent sizes. As the window size increases, the peak at the tail is sharpened while

other peaks are smoothed. From the perspective of frequency, more failure-related

signals in both the low-frequency and higher-frequency bands are included for large

measurement windows.

and the neural network derives a new mother wavelet based on the existing one and metric mea-

surements for the recent failure event.

We use the cloud performance data of the first month to train the neural network and derive

the mother wavelet. The derived mother wavelets are shown in Figure 5.4 by using four different

measurement window sizes. As the window size increases, the peak at the tail is sharpened while

52

other peaks are smoothed. From the perspective of frequency, more failure-related signals in both

the low-frequency and higher-frequency bands are included for large measurement windows.

The coefficient α is defined as the proportion of anomalous states in the mother wavelet.

In Figure 5.4, for the mother wavelet using eight measurements, the last two samples are validated

as failure states and thus α = 0.25. If the mother wavelet is derived from more measurements, the

value of α gets smaller. In the experiments, we also observe that for most of the software-related

failure, α is larger than those for hardware-related failures, because software-related failures are

more likely to display deteriorating trends before the failures while hardware faults usually cause

abrupt changes.

The selection of the size of the mother wavelet is system dependent. We need to consider

the trade-off between the computation overhead and the anomaly detection accuracy. If the size

is too small, anomalous states cannot be captured and represented sufficiently, which causes more

false positives. On the other hand, a large mother wavelet may involve multiple anomalous states

of different types and more noise is included as well, which affect the accuracy of the anomaly

detection. Our cloud anomaly detector can try different sizes of measurements when generating

the mother wavelet and learn from the performance of anomaly detection to adaptively choose

the best size. In our experiments, the optimal size of the mother wavelet found by our anomaly

detector is 16 measurements, as shown in Figure 5.5. This size yields a good detection precision

and a small number of false alarms.

5.3.3. Performance of Anomaly Identification

Figure 5.6 shows the contours of wavelet coefficients cs,τ for the first 200 time bins with

the scale parameter s changes from 1 to 16. The sizes of mother wavelets are set to 8, 12, 16, and

24 cloud performance measurements. Memory-related faults are injected. We use a color bar to

denote the normalized value in each figure. A small mother wavelet (i.e., 8 measurements and 12

measurements in Figures 6(a) and 6(b)) brings noise to the wavelet coefficients. In Figures 6(c)

and 6(d), if we set threshold λ = 0.8, one anomaly is detected, i.e., at timebin = 100. If we use

this threshold in the cases of Figures 4(a) and 4(b), more false positive anomalies are identified.

In Figures 6(c) and 6(d), the mother wavelet derived from 16 measurements is more effective

53

4 8 12 16 20 24 2805

10

20

30

40

50

60

70

80Detected anomalies by different window size

Window Size

Num

ber o

f det

ecte

d an

omlie

s

True anomaliesFalse positives

FIGURE 5.5. The numbers of truly identified anomalies vs. the numbers of vali-

dated false positives with mother wavelets of different sizes. Small windows result

in low detection accuracy, while big windows brings in more false positives

than the one using 24 measurements. This is because in Figure 6(d), as the scale parameter s

increases from 1 to 16, the wavelet coefficients increase steadily, which indicates a higher scale

for higher detection accuracy. By choosing a larger scale, we also need a larger sliding detection

window, as shown in Equation (17), which causes higher computational complexity, while the

wavelet coefficients are similar. As a result, we choose the value of the scale parameter s = 10.

We employ ROC curves to present and compare the performance of anomaly detection by

several algorithms. We compute the true positive rate (TPR) and the false positive rate (FPR) of the

detection results by those algorithms. A larger area under the ROC curve implies higher detection

sensitivity and specificity.

We compare the performance of our wavelet-based multi-scale anomaly identification mech-

anism with four widely used anomaly detection approaches, which use the decision tree, radial

basis function (RBF) network, Bayesian network, and support vector machine (SVM). We train

the models using the same training set and compare their detection results. The ROC curves are

shown in Figure 5.7. In the test dataset, there are 18 failure records in 2217 measurements. Our

54

(a) Wavelet coefficients with 8 measurements. (b) Wavelet coefficients with 12 measurements.

(c) Wavelet coefficients with 16 measurements. (d) Wavelet coefficients with 24 measurements.

FIGURE 5.6. Wavelets coefficients for mother wavelets with differentNmother (1 ≤

s ≤ 16, 0 ≤ τ ≤ 200). A memory-related fault is injected at the 100th time point.

The states of system are learned from the wavelet coefficients based on the anomaly

mother wavelet with different scale. A smaller mother wavelet (i.e., 8 measurements

or 12 measurements) brings more noise to the wavelet coefficients, while a bigger

mother wavelet (i.e., 24 measurements) requires a larger scale to achieve a high

detection accuracy.

wavelet-based approach reaches 93.3% TPR with 6.1% FPR. Among other detection algorithms,

Bayesian network can achieve 81.5% TPR and about 10% FPR. Performance of decision tree,

RBF network and SVM are worse with less than 75% TPRs. In our current cloud performance

and failure datasets, a small number of faults are injected. Our wavelet-based anomaly detection

mechanism can detect almost all of them. We plan to find datasets from production cloud systems

55

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

FPR

TPR

ROC Curve

Our ApproachDecision TreeRBF NetworkBayesianNetSupport Vector

FIGURE 5.7. Performance comparison of our wavelet-based cloud anomaly detec-

tion mechanism with other four detection algorithms. Our approach achieves the

best TPR with the least FPR. It can identify anomalies more accurately than other

methods.

and test our mechanism on them.

5.4. Summary

Large-scale and complex cloud computing systems are susceptible to software and hard-

ware failures and administrator mistakes, which significantly affect the cloud performance and

management. Anomaly identification and proactive failure management provides a vehicle for

self-managing cloud resources and enhancing system dependability.

In this paper, we employ a wavelet-based multi-scale cloud anomaly identification mecha-

nism with learning-aid mother wavelet selection and sliding detection windows for adaptive fail-

ure detection. Different from other anomaly identification approaches, it does not require a prior

knowledge of failure distributions, it can self-adapt by learning from observed failures at runtime,

and it analyzes both the time and frequency domains to identify anomalous cloud behaviors. We

56

test a prototype implementation of our cloud anomaly identification mechanism on a cloud com-

puting system. Experimental results show our approach can identify cloud failures with the highest

accuracy among several widely used approaches.

The proposed anomaly identification mechanism in this research can also aid failure predic-

tion. Complementing existing failure prediction methods, results from this research can be utilized

to determine the potential localization of failures by analyzing the runtime cloud performance data.

We also note that even with the most advanced learning algorithms the prediction accuracy

could not reach 100%. As a remedy, reactive failure management techniques, such as checkpoint-

ing and redundant execution, can be exploited to deal with mis-predictions. We will integrate these

two failure management approaches and enhance the cloud dependability further.

57

CHAPTER 6

EXPLORING METRIC SUBSPACE ANALYSIS FOR ANOMALY IDENTIFICATION AND

DIAGNOSIS

6.1. Introduction

Effective anomaly identification and diagnosis in cloud systems is challenging due to the

dynamics of runtime cloud states, heterogeneity of configuration, nonlinearity of failure occur-

rences, and overwhelming volume of performance data in production environments. Principal

component analysis (PCA) is a well-know dimensionality reduction technique [26]. It performs

linear transformation to map a set of data points to new axes (i.e., principal components). Each

principal component has the property that it points in the direction of maximum variance remain-

ing in the data. It has been widely in feature extraction and network anomaly detection [53], [12],

[55], [89]. However, indiscriminately choosing the first several principal components does not al-

ways yield desirable accuracy of anomaly diagnosis, as different types of failures display different

correlations with cloud performance metrics (An illustrating example is presented in Section 6.2.).

A more subtle analysis of the relation between principal components and failure types is necessary

in order to develop effective anomaly diagnosis mechanisms in cloud systems.

In this chapter, I propose an adaptive mechanism that explores PCA to identify and di-

agnose anomalies in the cloud. Different from existing PCA-based diagnosis approaches, our

proposed mechanism characterizes cloud health dynamics and finds the most relevant principal

components (MRPCs) for each type of possible failures. The selection of MRPCs is motivated

by our observation in experiments that higher order principal components possess strong correla-

tion with failure occurrences, even though they maintain less variance of the cloud performance

data. By exploiting MRPCs and learning techniques, I propose an effective anomaly diagnosis

mechanism in the cloud.

The reminder of this chapter is organized as follows. In Section 6.2, I present an example to

illustrate limitation of the traditional principal components based anomaly detection and diagnosis

approaches. In Section 6.3, I describe the proposed most relevant principal component approach

58

0 500 1000 1500 20000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Time:min

%m

em u

tiliz

atio

n

%memFaults

0 500 1000 1500 20000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Time:min

%us

er a

ll

%userFaults

FIGURE 6.1. Examples of memory related faults injected to a cloud testbed. The

memory utilization and CPU utilization time serials are plotted.

for anomaly diagnosis. Experimental results are presented in Sections 6.4. Finally, I summarize

this chapter in Section 6.5.

6.2. A Motivating Example

Virtualization plays a key role in cloud computing infrastructures, because it makes possi-

ble to significantly reduce the number of servers in data centers by having each server host multiple

independent virtual machines (VMs) which are managed by a virtual machine monitor (VMM) of-

ten referred to as a hypervisor [90]. By enabling the consolidation of multiple applications on a

small number of physical servers, virtualization promises significant cost savings resulting from

higher resource utilization and lower system management costs.

However, virtualization also complicates the interactions among applications sharing the

physical cloud infrastructure, which is referred to as multi-tenancy. The inability to predict such

interactions and adapt the system accordingly makes it difficult to provide dependability assurance

in terms of availability and responsiveness to failures. Virtualization and multi-tenancy introduce

new sources of failure degrading the dependability of cloud computing systems and making anom-

aly identification more challenging.

As an example, Figure 6.1 depicts several memory-related failures, which are denoted by

red circles, observed on a cloud testbed. The corresponding CPU utilization and memory utiliza-

tion profiled from the testbed are also shown. From the figure, I can see these failures are difficult

to identify if single performance metric is explored by approaches such as [107]. Moreover, rule-

based failure diagnosis techniques are not applicable. High CPU utilization and low memory

59

0 5 10 15 20 25 30 35 40 45 500

0.05

0.1

0.15

0.2

0.25

Index of Principal Components

Per

cent

age

of V

aria

nce

FIGURE 6.2. Distribution of data variance retained by the first 50 principal components.

utilization can not determine the conclusion of a memory fault, especially when the symptom of

anomalous behavior is not significantly deviated from other normal states. To improve anomaly de-

tection and diagnosis accuracy, more metrics should be considered. Cloud computing systems are

large-scale and complex. Hundreds and even thousands of performance metrics are profiled peri-

odically [6]. The large metric dimension and the overwhelming volume of cloud performance data

dramatically increase the processing overhead and reduce the anomaly detection accuracy. Feature

extraction techniques have been widely applied to reduce the metric dimensionality. Among these

techniques, principal component analysis (PCA) is a well-known one, which extracts a smaller set

of metrics while retaining as much data variance as possible [26].

Figure 6.2 shows the distribution of variance retained by the first 50 principal components

(PCs) for the memory related failures. In total, 651 performance metrics are profiled on the cloud

testbed. The 1st and 2nd PCs keep 27% and 8% of data variance, respectively, while the 3rd and

5th PCs retain only less than 5% of data variance each. Their time series are plotted in Figure 6.3.

Figures 3(a) and 3(b) show that the low order PCs are less significant for identifying failures even

though they retain a large portion of the data variance. Some failure events are overwhelmed by

normal data records with high resource utilization and some are indistinguishable from noises.

In contrast, the higher order PCs, such as the 3rd and 5th PCs in Figures 3(c) and 3(d), manifest

stronger correlation with the failure events. I call these PCs the most relevant principal components

(MRPCs) for a failure type. They are usually higher order PCs, which are not explored by existing

anomaly detection techniques, while they provide important insights of failure occurrences. In the

60

0 500 1000 1500 2000

-4

-3

-2

-1

0

1

2

Time : min

The

1st P

rinci

pal C

ompo

nent

1st PCMemory faults

(a) The time series of the1st principal component

0 500 1000 1500 2000

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

Time : min

The

2nd

Prin

cipa

l Com

pone

nt

2nd PCMemory faults

(b) The time series of the 2nd principal component

0 500 1000 1500 2000-4

-3

-2

-1

0

1

2

Time : min

The

3rd

Prin

cipa

l Com

pone

nt

3rd PCMemory faults

(c) The time series of the 3rd principal component

0 500 1000 1500 2000-1

-0.5

0

0.5

1

1.5

2

2.5

3

3.5

4

Time : min

Th

e 5t

h P

rinci

pal C

ompo

nent

5th PCMemory faults

(d) The time series of the 5th principal component

FIGURE 6.3. Time series of principal components and their correlation with the

memory related faults.

following sections, we discuss how to find the MRPCs and exploit MRPCs to identify anomalies

in cloud computing systems.

6.3. MRPC-Based Adaptive Cloud Anomaly Identification

To find the most relevant principal components (MRPCs) and identify cloud anomalies,

I explore the runtime performance data profiled from multiple layers of cloud servers. I install

profiling instruments in hypervisors and operating systems in virtual machines to measure perfor-

mance metrics from the hardware performance counters, resource utilization and critical events

from the hypervisors and virtual machines (VMs), while user applications are running in VMs. To

capture the dynamics of cloud systems, I employ a rolling window to analyze the profiled cloud

performance data and dynamic normalization to pre-process the data. Then, MRPCs will be se-

lected from cloud performance principal components by examining their correlation with failure

events. In this section, I present the three key parts of our proposed MRPC-based adaptive cloud

anomaly identification mechanism, i.e., dynamic normalization, MRPC selection, and adaptive

61

anomaly identification.

6.3.1. Dynamic Normalization

The cloud performance metrics are profiled from various components at runtime. The

collected data include performance and runtime states of hardware devices, hypervisors, virtual

machines and applications. The unit and scale vary among these metrics. To explore them in cloud

anomaly detection, normalization is necessary.

In our work, I employ a dynamic normalization approach with a rolling window to pre-

process the cloud performance data and scale the metrics to [0, 1]. Let X(t) represent a vector of

M cloud performance metrics metrics (m1(t),m2(t), ...,mM(t)). For X(t) at time t in a rolling

window of size N, the dynamic normalization process is described as follows.

In order to overcome the obstacle of the diversity in unit and scale of different types of

performance metrics, normalization is necessary. In our work, I employ a dynamic normalization

process where sampled cloud metrics are scaled down into For any cloud metric vector at t within

the rolling window size ofN , X(t) represents a vector ofM , the normalization function is defined

as follows:

(22) XN (t) =

X(t)−minN (t)

maxN (t)−minN (t)maxN (t) 6= minN (t)

0 maxN (t) = minN (t)

where maxN(t) and minN(t) denote the synthetic maximum and minimum values of a cloud

perofmance metric in a time window, which is ended at time t. They are updated iteratively based

on maxN(n− 1), minN(t− 1) and the new measurement of the metric by following the equations

below.

(23) maxN (t+ 1) = maxN (t)λeX(t+1)−maxN (t)

(24) minN (t+ 1) = minN (t)λeX(t+1)−minN (t)

where the coefficient λ represents the adjustment ratio. It controls the ascending or descending

rate. For a new measurement of the metric as the time window moves forward, if the measurement

62

is smaller than current maximum, the maximum value decreases. If the measurement is larger than

current minimum, the minimum value increases. In such a way, the measurements close to the end

of the rolling window have less impact to the normalized value than the measurements at the head

of time window.

6.3.2. MRPC Selection

The most relevant principal components (MRPCs) are the set of PCs which have strong

correlation with occurrences of failures. For each failure type, I select a corresponding set of

MRPCs. As described in Section III-A, I employ the dynamic normalization approach to re-scale

performance metrics in a rolling time window. To adapt to the dynamics in a time window and at

the same time consider the values of cloud performance metrics in previous windows, I propose an

adaptive learning approach that exploits neural networks to compute principal components (PCs)

from normalized values of cloud performance metrics in time windows. The approach proceeds as

follows.

1: Initialize a neural network Rm → Rl with small

random synaptic weights wji, j and i are the index

of input neural and output neural at time t = 1.

Assign a small learning rate ε.

2: For t = 1j = 1, 2, ....l and i = 1, 2, ...,m calculate

yj(t) =∑m

i=1wji(t)xi(t) and

4wji(t) = ε[yj(t)xi(t)− yj(t)∑j

k=1wki(t)yk(t)]

where xi(t) is the ith component of the m-by-1

input vector x(t) and l is the dimension of principal

components space

3: Increment t by 1, repeat step 2 until the synaptic

weight wji converges to the ith component of the

eigenvector associated with the jth eigenvalue of the

correlation matrix of the input vector x(n).

63

Then, the MRPC selection process decomposes the PCs in set S into k subset Ssub(fi)(one

for each failure type) and a noise subset Snoise.The decomposition can be expressed as:

(25) S = Ssub(f1) ∩ Ssub(f2) ∩ ... ∩ Ssub(fk) ∩ Snoise

Each failure specific subset characterizes a failure type fi. There may be intersections between

two subsets. As some PCs are correlated with multiple failure types. The noise subset consists

of PCs that are not correlated with any failure type. They can be represented by Gaussian noise.

Algorithm 4 shows the pseudocode of the MRPC selection process.

ALGORITHM 4. MRPC Selection Algorithm

MRPCSelect() {

1: l : dimension of principal component space

2: k: number of failure types

3: fi: failure type i

4: N : window size

5: coeffji :correlation coefficient of Yj to fi

6: Ssub(fi) : fi specific subspace

7: Yj(t, N) = (yi(t−N), yi(t−N − 1), ..., yi(t))

8: Lablei(t, N) = (Li(t−N), Li(t−N − 1)..., Li(t))

9: Ssub(fi) = ∅

10: for j = 1, 2, ..., l do

11: for i = 1, 2, ..., k do

12: coeffji = correlation(Yj(t, N), Lablei(t, N))

13: if coeffji < σ do

14: Ssub(fi) = Ssub(fi) ∩ j

15: end if

16: end for

17: end for

64

18: }

6.3.3. Adaptive Cloud Anomaly Identification

To identify anomalies using MRPCs, I leverage adaptive Kalman filters [47], because they

are dynamic and do not require any prior failure history. A Kalman filter is optimized to achieve

the best estimation of the next state. As the rolling window moves forward, the Kalman filter

detects anomalies based on the corresponding cloud performance data record and the prior states by

following Equation (4). Then it uses the measurement and the estimation to update the uncertainty

for next time window by employing Equation (5) adaptively. Each iteration consists of anomaly

detection and correction.

The anomaly detector identifies anomalous cloud behaviors by applying the following

model.

(26)

X−t = ΦXt−1

P−t = ΦPt−1 +Q

The correction phase can be presented as follows.

(27)

Kt =

P−tP−t +R

Xt = X−t +Kt(Mt −ΨX−t )

Pt = (1−KtΨ)P−t

where X−t and Xt represent the prior state and posteriori state at time t and P−t and Pt denote the

prior and posterior error covariance. Parameter Φ and Ψ are detector variables. Kt is the ratio,

which is to control the difference between prior state and posteriori state. Mt is the measurement

of time t. Q and R are the variances of estimation noise and measurement noise respectively.

If the difference |Xt−Mt| between measurement and estimation on MRPCs at a time point

is greater then a threshold, an anomaly is detected and it is reported to cloud operators. The cloud

operators check the identified anomalies to verity them as either true failures or false alarms. Those

records of true failures are used to update the corresponding anomaly detector model following

Equation 27. The cloud operators also input those observed but undetected failure records to the

65

anomaly detector to generate a new MRPC subset. Algorithm 5 presents the adaptive anomaly

detection process.

ALGORITHM 5. Adaptive Anomaly Identification Algorithm

AnomalyDetector() {

1: While(TRUE) do

2: On receiving a cloud performance record data xt

3: if xt −Mt < τ then

4: Report the anomaly state

5: end if

6: On receiving a verified failure or an observed

but undetected failure record

7: MRPCSelect()

8: end while

9: }

6.4. Analysis of Cloud Anomalies

6.4.1. Anomaly Detection and Diagnosis Results

In this section, I study the four types of failures caused by CPU-related faults, memory-

related faults, disk-related faults, and network-related faults. For each failure type, I present the

experimental results on MRPC selection and discuss the root cause analysis on each MRPC.

6.4.2. MRPCs and Diagnosis of Memory Related Failures

Figure 8(a) shows the correlation between PCs and the memory related faults. As I have

discussed, the 1st and 2nd principal components do not possess high causal correlation with the

occurrences of failures (only 0.16 and 0.04). This indicates the memory related failures have little

dependency upon them. On the contrary, the 3rd, 5th, 8th and 31th PCs display high correlation

with failure records (greater than 0.2) as listed in Table 6.1 . Figure 4(a) shows that the 3rd PC

displays a significantly identifiable performance to distinguish the failure states from normal states.

66

0 500 1000 1500 2000-4

-3

-2

-1

0

1

2

Time : min

The

3rd

Prin

cipa

l Com

pone

nt

3rd PCMemory faults

(a) The time series of 3rd principal component

0 50 100 150 200 250 300 350 400 450-0.7

-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

Performance Metric

Coe

ffici

ents

to th

e 3n

d P

rinci

pal C

ompo

nent

(b) The coefficients of performance metric to the 3rd principal component

FIGURE 6.4. MRPCs of memory-related failures. (a) plots the time series of 3rd

principal component. (b) shows the performance metric avgrq-sz displays the high-

est contribution to the MRPC.

Based on the procedure described in Section III-B, the synaptic weight wjis represent the

quantified impact from the original space to the anomaly specific subsets. Considering these synap-

tic weight wjis could be either positive or negative, I exploit |wji| to identify the effect of each

performance metrics contributing to the anomaly specific subsets. These weights computed are

shown in Figure 4(b) for the 3rd principal component, which is selected as the top ranked MRPC

with regard to memory related failures. In addition, one performance metric has a dominant con-

67

TABLE 6.1. MRPCs are ranked by correlation with faults.(For each major type, 25

faults are injected into testbed)

Fault TypeMost Relevant Principal Components

Rank Order of PC Correlation Coefficient to Fault

Memory Fault

1 3 0.3898

2 5 0.2840

3 8 0.2522

4 31 0.2043

I/O Fault1 5 0.4283

2 7 0.2961

3 3 0.2402

CPU Fault1 35 0.3738

2 40 0.3424

3 103 0.2559

Network Fault1 29 0.3532

2 27 0.2733

3 23 0.2715

tribution to this MRPC, as weight 0.65. Within it, multiple performance metrics have weights

around 0.1-0.2. By checking the performance metrics list, we find that the highly weighted metric

is ”avgrq-sz dev253-1”, which is ”The average size (in sectors) of the requests that were issued to

the hard drive device 253-1 [2]”. Given that the memory related failures are injected by keeping

allocating memory in a short period, after the physical memory is used up, the swap memory is put

into use. As a result, this process will cause more requests issued to the hard disk.

6.4.3. MRPCs and Diagnosis of Disk Related Failures

Disk-related faults are injected by continuously issuing a big volume of disk requests to

saturate the I/O bandwidth. The causal correlation with the disk related failures is computed for

68

0 500 1000 1500 2000-1

0

1

2

3

4

5

6

Time : min

The

5th

Prin

cipa

l Com

pone

nt

5th PCI/O faults

(a) The time series of 5th principal component

0 50 100 150 200 250 300 350 400 450-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

Performance Metric

Coe

ffici

ent t

o th

e 5t

h P

rinci

pal C

ompo

nene

t

(b) The coefficients of performance metric to the 5th principal component

FIGURE 6.5. MRPCs of disk-related failures.(a) plots the time series of the 5th

principal component. (b) shows the performance metric rd-sec/s dev-253 displays

the highest contribution to the MRPC

each principal component, which is shown in Figure 8(b). The top ranked MRPCs are listed in

Table 6.1 . The 5th PC possesses the highest correlation with the disk-related failures as its casual

correlation is more than 0.42. Analysis of the time series of the 5th principal component, plotted

in Figure 5(a), shows that most of the anomalies could be identified by setting a proper thresh-

old. From Figure 5(b), the performance metric named ”rd-sec/s dev-253”, with the coefficient of

0.4423, contributes to the 5th principal component more than other performance metrics. It refers

69

0 500 1000 1500 2000-1.5

-1

-0.5

0

0.5

1

Time : min

The

35th

Prin

cipa

l Com

pone

nt

35th PCCPU faults

(a) The time series of 35th principal component

0 50 100 150 200 250 300 350 400 450-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

Performance Metric

Coe

ffici

ent t

o th

e 35

th P

rinci

pal C

ompo

nent

(b) The coefficients of performance metric to the 35th principal component

FIGURE 6.6. MRPCs of CPU-related failures.(a) plots the time series of the 35th

principal component. (b) shows the performance metric ldavg displays the highest

contribution to the MRPC

to The number of sectors read from the device, which is an indicator to characterize the symptom

of I/O related failures.

6.4.4. MRPCs and Diagnosis of CPU Related Failures

CPU-related faults are injected by employing infinite loops that use up all CPU cycles.

Table 6.1 lists MRPCs with the highest correlation with the CPU-related failures. Figure 6(a)

70

0 500 1000 1500 2000-1.5

-1

-0.5

0

0.5

1

1.5

Time : min

The

29th

Prin

cipa

l Com

pone

nt

29th PCNetwork faults

(a) Time series of 29th principal component

0 50 100 150 200 250 300 350 400 450-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

Performance Metric

Coe

ffici

ent t

o th

e 29

th P

rinci

pal C

ompo

nent

(b) Coefficients of performance metric to the 29th principal component

FIGURE 6.7. MRPCs of network-related failures.(a) plots the time series of the

29th principal component. (b) shows the performance metric %user 4 displays

highest contribution to the MRPC

presents the time series of the 35th principal component. From the figure, I can see some CPU-

related failures are not easily identifiable, e.g., the failures occurred around 1250th minute and

1500th minute. Figure 6(b) plots weights of the cloud performance metrics for the 35th principal

component. With largest weight of 0.3874, ”ldavg-15” refers to ”The load average calculated as the

average number of runnable or running tasks (R state), and the number of tasks in uninterruptible

sleep (D state) over past 15 minutes”. The second and third largest weights correspond to the

71

performance metrics ”ldavg-5” and ”%sys all” respectively. ”%sys all” refers to the average CPU

utilization for system over all processors. All of the three performance metrics characterize the

process behavior under failures.

6.4.5. MRPCs and Diagnosis of Network Related Failures

Network-related faults are injected by saturating the network bandwidth by continuously

transferring large files between servers. In cloud computing systems, denial-of-service attacks,

virus infections, and failure of switches and routers may cause this type of anomalies. The MR-

PCs are listed in Table 6.1 . The 29th principal component is highly correlated with the network

related failures. The 27th and 23rd principal components are ranked as the second and third as

shown in Figure 8(d). In Figure 7(a), I can see the 29th principal component is sensitive to the

occurrences of network-related failures and the failure is distinguishable from normal states. Fig-

ure 7(b) shows that ”%usr 4” possesses the highest weight of 0.292. The second and third highest

weights are associated with performance metrics ”%idle 4” and ”svctm dev8-0” (i.e., ”The average

service time (in milliseconds) for I/O requests that were issued to the device.”). Both %usr 4 and

%idle 4 represent the states of processor core 4, which is assigned to the virtual machine where

the network-related faults are injected. Therefore, MRPCs can assist cloud operator not only to

identify anomalies, but also to localize faults even in virtual machines.

6.4.6. The Accuracy of Anomaly Identification

I study the performance of several anomaly detection techniques including our proposed

MRPC-based detection approach. I use the receiver operating characteristic (ROC) curves to

present the experimental results. An ROC curve displays the true positive rate (TPR) and the

false positive rate (FPR) of the anomaly detection results. The area under the curve is used to

evaluate the detection performance. A larger area infers higher sensitivity and specificity.

I compare the performance of the proposed MRPC-based anomaly detection approach

with four widely used detection algorithms: decision tree, Bayesian network, support vector ma-

chine(SVM), and 1st principal component (using Kalman filter to detect the anomaly). Our MRPC-

based anomaly detector achieves the best performance, with the true positive rate reaching 91.4%

72

while keeping the false positive rate as low as 3.7%. By applying only the first principal compo-

nent, the false positive rate is higher than 40% in order to achieve a 90% true positive rate. Among

other detection algorithms, the Bayesian network is relatively better, reaching 74.1% of TPR with

a low FPR. Experimental results prove that PCA has the worst performance in identifying the

performance anomalies.

To make an anomaly detection, it takes 6.81 seconds on average for a control node in the

cloud to process cloud performance data, select MRPCs, and make anomaly detections.

6.4.7. Experimental Results using Google Datacenter Traces

In addition to the experiments on our cloud computing testbed, I evaluate the performance

of the proposed MRPC-based anomaly detection mechanism by using the performance and events

traces collected from a Google datacenter [80]. The Google datacenter trace is the first publicly

available dataset collected from a large number of (about 13000) multi-purpose servers over 29

days. In the dataset, multiple task related events are recorded. Among them, I focus on the failure

events. In total, there are 13 resource usage metrics profiled periodically, which are listed in Table

6.2. the measurement period is typically 5 minutes (300s). Within each measurement period,

measurements are usually taken at 1 second intervals. By applying the MRPC selection algorithm

presented in Section III-B, we obtain the casual correlation between principal components and

failure events, which is plotted in Figure 6.9. The 13th principal component retains the highest

correlation (i.e., 0.18) with the failures.

The ROC curves in Figure 6.10 show the performance of the proposed anomaly detection

approach and other four detection algorithms. By exploiting MRPC, I can achieve 81.5% of TPR

and 27% of FPR. The results outperform all other detection methods by 22.9% - 68.7% of TPR

with the same value of FPR. The performance of the proposed anomaly identification mechanism

is a little worse on the Google traces. This is caused by the higher dynamicity and variety of

workloads,more complex interactions among system components, less number of performance

metrics and incomplete information of failure types. Our anomaly detector still provides valuable

information of failure dynamics, which facilitates the system operators to proactively reconfigure

resources and schedule workloads.

73

TABLE 6.2. Performance metrics in the Google datacener traces

Index Performance Metrics

1 Number of running tasks

2 CPU rate

3 Canonical memory usage

4 Assigned memory usage

5 Unmapped page cache

6 Total page cache

7 Maximum memory usage

8 Disk I/O time

9 Local disk space usage

10 Maximum CPU rate

11 Maximum disk I/O time

12 Cycles per instruction

13 memory accesses per instruction

The main contribution of this work includes: to our best knowledge, it is the first time to use

subsets of principal components as the most relevant metrics for different types of failures. With

the analysis of each failure type, I show that anomalies are highly correlated with specific principal

component subsets. Moreover, MRPCs can be applied for digging the root causes of failures and

guiding a timely maintenance.

6.5. Summary

Modern large-scale and complex cloud computing systems are susceptible to software and

hardware failures, which significantly affect the cloud dependability and performance. In this

paper, I present an adaptive anomaly identification mechanism in cloud computing systems. I

start by analyzing the correlation between principal components with failure occurrences, where I

find the PCs retaining the highest variance cannot effectively characterize the failure events, while

74

lower order PCs displaying high correlation with occurrences of failures. I then propose to exploit

the most relevant principal components (MRPCs) to describe failure events and devise a learning

based approach to identify and diagnose cloud anomalies by leveraging MRPCs. The anomaly

detector adapts itself by recursively learning from these newly verified detection results to refine

future detections. Meanwhile, it exploits the observed but undetected failure records reported by

the cloud operators to identify new types of failures. Experimental results from an on-campus

cloud computing testbed show that the proposed MRPC-based anomaly identification mechanism

can accurately detect failures while achieving a low overhead. Learning form the MRPC subspaces

that relate to each type of failure, I gain the knowledge of the root causes of failures.

75

0 50 100 150 200 250 300 350 400 4500

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Index of Pricipal Components

Cor

rela

tion

to th

e Fa

ults

(a) Correlation with memory-related failures

0 50 100 150 200 250 300 350 400 4500

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45


Cor

rela

tion

to th

e Fa

ults

(b) Correlation with disk-related failures

50 100 150 200 250 300 350 400 4500

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4


Cor

rela

tion

to th

e Fa

ults

(c) Correlation with CPU-related failures

0 50 100 150 200 250 300 350 400 4500

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4


Cor

rela

tion

to th

e Fa

ults

(d) Correlation with Network-related failures

FIGURE 6.8. Correlation between the principal components and different types of

failures

76

1 2 3 4 5 6 7 8 9 10 11 12 130

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2


Cor

rela

tion

to th

e Fa

ults

FIGURE 6.9. Correlation between principal components and failure events using

the Google datacenter trace.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

FPR

TPR

ROC Curve

Decision TreeBayesianNetSupport VectorMRPCPC1

FIGURE 6.10. Performance of the proposed MRPC-based anomaly detector com-

pared with four other detection algorithms on the Google datacenter trace.

77

CHAPTER 7

F-SEFI: A FINE-GRAINED SOFT ERROR FAULT INJECTION FRAMEWORK

7.1. Introduction

In order to facilitate the testing of application resilience methods, I present a fine-grained

soft error fault injector named F-SEFI. F-SEFI allows for the targeted injection of soft errors into

instructions belonging to applications of interest and that applications individual subroutines. F-

SEFI leverages the QEMU [92] virtual machine (VM) and its hypervisor. QEMU uses Tiny Code

Generation (TCG) to reference and translate instruction sets between the guest and host architec-

tures before the instructions are delivered to the host system for execution. F-SEFI provides the

ability to emulate soft errors and corrupt data at runtime by intercepting instructions and replacing

them with contaminated versions during the TCG translation. With the addition of a binary symbol

table, F-SEFI supports a tunable fine-grained injection strategy where soft errors can be injected

into chosen instructions in specified functions of an application. In addition, F-SEFI allows multi-

ple fault models to mimic the upsets in hardware (e.g., probabilistic model, single bit fault model

and multiple-bits fault model). Overall, F-SEFI manages the fault injections and the user decides

where, when, and how to inject faults.

I implemented a prototype F-SEFI system and conducted the fault injection campaign on

multiple HPC applications. The experimental results show that the effect of the injected faults is

amplified when the fault propagates to other software components, resulting in a number of silent

data corruptions on multiple sites. F-SEFI provides sufficient instruction level soft error samples

of different fault models which assists programmers to understand the vulnerabilities of underlying

HPC applications, further helps designing resilient strategies to mitigate the impacts of SDCs.

The rest of this chapter is organized as follows. Section 7.2 presents the coarse-grained soft

error fault injection (C-SEFI) platform which requires a gdb to manually snoop and inject the soft

errors to specific applications. Section 7.3 describes the design goal of a fine-grained soft error

fault injector (F-SEFI) and the competences of F-SEFI. Fault definitions and models supported

in F-SEFI are presented in Section 7.3.2. Section 7.3.3 depicts fault injection mechanism and

78

FIGURE 7.1. Overview of C-SEFI

the implementation of components of F-SEFI. Cases studies on three widely used benchmarks

are demonstrated in Section 7.3.4. Discussion and conclusion are presented in Section 7.4 and

Section 7.5.

7.2. A Coarse-Grained Soft Error Fault Injection (C-SEFI) Mechanism

C-SEFI’s logic soft error injection operational flow is roughly depicted in Figure 7.1. First,

the guest environment is booted and the application to inject faults into is started. Next, I probe

the guest operating system for information related to the code region of the target application and

notify the VM which code regions to watch. Then the application is released, allowing it to run.

The VM observes the instructions occurring on the machine and augments ones of interest. A more

detailed explaination of these techniques follows.

7.2.1. C-SEFI Startup

Initial startup of SEFI begins by simply booting a debug enabled Linux kernel within a

standard QEMU virtual machine. QEMU allows us to start a gdbserver within the QEMU monitor

such that I can attach to the running Linux kernel with an external gdb instance. This allows us

to set breakpoints and extract kernel data structures from outside the guest operating system as

well as from outside QEMU itself. This is a fairly standard technique used by many Linux kernel

developers. Figure 7.2 depicts the startup phase.

7.2.2. C-SEFI Probe

Once the guest Linux operating system is fully booted and sitting idle I use the attached

external gdb to set a breakpoint at the end of the sys exec call tree but before an application is

sent to a cpu to be executed. I are currently focused on only ELF binaries and have therefore

set our breakpoint at the end of the load elf binary routine. This is trivial to generalize to other

79

FIGURE 7.2. SEFI’s startup phase

FIGURE 7.3. C-SEFI’s probe phase

binary formats in future work. With the breakpoint is set I are free to issue a continue via gdb to

allow the Linux kernel to operate. The application of interest can now be started and will almost

immediately hit our set breakpoint and bring the kernel back to a stopped state. By this point in the

exec procedure the kernel has already loaded an application’s text section into physical memory in

a memory region denoted by the start code and end code elements of the task’s mm struct memory

structure. I can now extract the location in memory assigned to our application by the kernel by

walking the task list in the kernel. Starting with the symbol init task, I can find the application of

interest either by comparing a binary name to the task struct’s comm field or by searching for a

known pid which is also contained in the task struct. The physical addresses within the VM of the

application’s text region can now be fed into our fault injection code in the modified QEMU virtual

machine. Currently this is done by hand but I have plans to automate this discovery and transfer

using scripts and hypervisor calls.

Figure 7.3 depicts the probe phase of C-SEFI.

80

7.2.3. C-SEFI Fault Injection

Once QEMU has the code segment range of the target application, the application is re-

sumed. Next, when any opcode is called in the guest hardware that I are interested in injecting

faults into, QEMU checks the current instruction pointer register (EIP). If that instruction pointer

address is within the range of the target application (obtained during the probe phase), QEMU now

is aware that the application I are targeting is running this particular instruction. At this point I

are able to inject any number of faults and have confidence that I are affecting only the desired

application.

This approach is novel for several reasons. Causing opcodes in an emulated machine hard-

ware to produce wrong results is not particularly novel or complex. What is complex is doing it

only in applications of interest and not every time that instruction is called. For instance, causing

every add operation to be faulty on the machine would be neither interesting nor allow the kernel to

boot. Our technique of pinpointing which instructions are being executed by an application affords

us this capability.

FIGURE 7.4. C-SEFI’s fault injection phase

Figure 7.4 depicts this fault injection phase of the C-SEFI logic plug-in. In the first step

of this phase, QEMU brings in the code segment range obtained in the previous, probing, phase.

This range is passed into QEMU by a new hypervisor call that I added to QEMU. Next, the gdb

breakpoint is removed. The application is then resumed and continues operation as normal. Once

the application makes calls to opcodes that I are monitoring, the fault injection code inside of

QEMU can determine if, and how, to insert a simulated soft error in that opcode. Finally, the

81

application continues to run in this state and I observe and analyze how the injected fault is handled

in the application.

The opcode fault injection code has several capabilities. Firstly, it can simply flip a bit in

the inputs or outputs of the operation. Flipping a bit in the input simulates a soft error in the input

registers used for this operation. Secondly, it can flip a bit in the output of the operation. This

simulates either a soft error in the actual operation of the logic unit (such as a faulty multiplier)

or soft error in the register after the data value is stored. Currently the bit flipping is random

but can be seeded to produce errors in a specified bit-range. Thirdly, opcode fault injection can

perform complicated changes to the output of operations by flipping multiple bits in a pattern

consistent with an error in part but not all of an opcodes physical circuitry. For example, consider

the difference in the output of adding two floating point numbers of differing exponent if the a

transient error occurs for one of the numbers while setting up the significant digits so that they can

be added. By carefully considering the elements of such an operation I can alter the output of such

an operation to reflect all the different possible incorrect outputs that might occur.

The fault injector also has the ability to let some calls to the opcode go unmodified. It is

possible to cause the faults to occur after a certain number of calls or with some probability. In this

way the fault can occur every time which closely emulates permanently damaged hardware or can

be used to emulate transient soft errors by causing a single call to be faulty.

Most importantly, whenever I cause a fault to occur I know precisely what the instruction

pointer was at that time. Using this information I should be able to reference back to the original

source code. One of the obvious complications of this is that there does not exist a readily available

one-to-one mapping between high level language source code and the machine code generated by

the compiler and assembler. However, if the target application is compiled with debug symbols, I

can recognize at the very least what function the application was in when I injected the fault. This,

coupled with careful code organization, should make it more feasible to make this mapping.

7.2.4. Performance Evaluation of C-SEFI

To demonstrate C-SEFI’s capability to inject errors in specific instructions I provide two

simple experiments. For each experiment I modified the translation instructions inside of Qemu for

82

each instruction of interest. Once the instruction was called, the modified Qemu would check the

current instruction pointer (EIP) to see if the address was within the range of the target application.

If so, then a fault could be injected. I performed two experiments in this way, injecting faults into

the floating point multiply and floating point add operations.

For this experiment I instrumented the floating point multiply operation, fmul, in Qemu.

I created a toy application which iteratively performs Equation 28 40 times. The variable, y, is

initialized to 1.0.

(28) y = y ∗ 0.9

Then, at iteration 10 I injected a single fault into the multiplication operation by flipping a random

bit in the output. Figure 7.5 plots the results of this experiment. The large, solid line, represents

the output as it is without any faults. The other five lines represent separate executions of the

application with different random faults injected. Each fault introduces a numerical error in the

results which continues through the lifetime of the program.

I focus on two areas of interest from the plot in Figure 7.6 and 7.6 . In Figure 7.6 the plot

is zoomed in to focus on the point where the five faults are injected so as to make it easier to see.

Figure 7.7 is focused on the final results of the application. In this figure it becomes clear that each

fault caused an error to manifest in the application through to the final results.

7.3. A Fine-Grained Soft Error Fault Injection (F-SEFI) Framework

In the previous section I discussed the Coarse-grained SEFI. I validate the idea of designing

a soft error fault injector with minimal modifications to the environments and source codes. But

C-SEFI is impractical because it requires the user to pause and extract the application knowledge

in order to inject faults to specific application. Moreover, C-SEFI can only implement a course-

grained injection granularity. It is difficult to inject the faults to specific sub-routines, which limits

its capability to profile the vulnerability of application against soft errors. Based on the study of

C-SEFI, I propose a fine-grained soft error fault injection platform that not only inherits the key

features of C-SEFI, but also combined all of the features I desired in a tool meant to study the

behavior of application in the presences. The key features of F-SEFI are summarized as follows:

83

FIGURE 7.5. The multiplication experiment uses the floating point multiply in-

struction where a variable initially is set to 1.0 and is repeatedly multiplied by 0.9.

For five different experiments a random bit was flipped in the output of the multiply

at iteration 10, simulating a soft error in the logic unit or output register

.

FIGURE 7.6. Experiments with the focus on the injection point. it can be seen that

each of the five separately injected faults all cause the value of y to change - once

radically, the other times slightly.

84

FIGURE 7.7. Experiments with the focus on the effects on the final solution. it can

be seen that the final output of the algorithm differs due to these injected faults.

7.3.1. F-SEFI Design Objectives

Non-intrusion: in designing F-SEFI I were keenly focused on providing fault injection

with as little impact on the operating environment as possible. Our approach is non-intrusive in

that it requires no modifications to the application source code, compilers, third party software and

operating systems. It does not require custom hardware and runs entirely in user space so can be run

on production supercomputers alongside scientific applications. These constraints are pragmatic at

a production DOE facility and also exclude any possibility of side effects due to intrusive changes.

Additionally, our approach allows other applications to run alongside the application under fault

injection. In particular, this facilitates studies in resilient libraries and helper applications.

Infrastructure Independency: F-SEFI is designed as a module of the QEMU hypervisor

and, therefore, benefits from virtualization. Since the hypervisor supports a wide range of plat-

forms so does our fault injection capability. This enables us to explore hardware that I might not

physically have as well as explore new hardware approaches. For instance, I can implement triple-

modular redundancy (TMR) in certain instructions and generate errors probabilistically to evaluate

classes of applications that might be resilient on such hardware. In addition, since all guest OSs

are isolated, multiple target guest OSs from different architectures can work at the same time with-

out any interference. Faults can then be contained and I can run multiple applications in different

guest OSs and inject faults into them concurrently. Similarly, since FSEFI can target a specific

85

application, I can inject into multiple applications running within the same guest OS. This can help

reduce the effects of the virtualization overhead by studying multiple applications (or input sets)

concurrently.

Application Knowledge: F-SEFI performs binary injection dynamically without augmen-

tation to source code. Moreover, it adapts to the dynamicity of data objects, covering all static and

dynamic data. This is especially useful for applications that operate on random data or whose fault

characteristics vary when given different input datasets. F-SEFI does not require the memory ac-

cess information of the data objects at runtime. All the injections target the instructions, covering

the opcodes, addresses, and data in registers copied from memory.

Tunable Injection Granularity: F-SEFI supports a tunable injection granularity, allowing it

to inject faults semantically. Faults can target the entire application or focus in on specific func-

tions. Furthermore, the faults can be configured to infect specific operands and specific bit ranges.

Particularly with function-level injection, F-SEFI can provide a gprof-like [6] vulnerability profile

which is useful to programmers analyzing coverage vulnerability. While fine-grained tunability op-

erates on the symbol table extracted from an unstripped binary, F-SEFI can still do fault injections

into random locations in the application if the symbol table is not available.

Injection Efficiency: F-SEFI can be configured to inject faults only in specific micro-

operations and get out of the way for others. As such, it can be configured to only cause SDCs by

flipping bits in mathematical operations. Or, it can be used to explore control corruptions (such as

looping and jumps) or crashes (accessing protected memory, etc). This generality allows a user of

F-SEFI to focus their attention on studying the effects of specific SDC scenarios.

7.3.2. F-SEFI Fault Model

In this work I consider soft errors that occur in the function units (e.g., ALU and FPU)

of the processor. In order to produce SDCs, I corrupt the micro-operations executed in the ALU

(e.g., XOR) and FPU (e.g., FADD and FMUL) unit(s) by tainting values in the registers during

instruction execution. Fault characteristics can be configured in several ways to comprehensively

study how an application responds.

Faulty Instruction(s): soft errors can be injected into any machine instruction. In this

86

TABLE 7.1. Fault types for injection

Fault Type Description

FADD Bit-flip in floating point addition micro-operation.

FMUL Bit-flip in floating point multiplication micro-operation.

XOR Bit-flip in xor micro-operation.

paper I study corrupted FADD, FMUL, and XOR instructions. Since QEMU does guest-to-host

instruction translation, I merely modify this process to perform the type of corruption I want to

study.

Random and Targeted: F-SEFI offers both random (for coarse-grained) and targeted (for

fine-grained) fault injections. Initial development of the tool demonstrated the coarse-grained in-

jection by randomly choosing instructions to corrupt in an application[4]. This technique provides

limited resilience evaluation at the application level. F-SEFI now also has the ability to do targeted

fault injection into specific instructions and functions of an application. This allows a finer-grained

study of the vulnerabilities of an application.

Single and Multiple-Bit Corruption: any number of bits can be corrupted in an instruction

using F-SEFI. This allows for the study of how applications would behave without different forms

of error protection as well as faults that cause silent data corruption.

Deterministic and Probabilistic: while injecting faults to instructions, F-SEFI can deter-

ministically flip any bit of the input or output register(s). It can also be configured to apply a

probability function to determine which bits are more vulnerable than others. For example, one

can target the exponent, mantissa, or sign bit(s) of a floating point value.

7.3.3. F-SEFI Fault Injection Mechanisms

F-SEFI leverages extensive open source work on the QEMU processor emulator and virtu-

alizer by interfacing with the hypervisor as a plug-in module. After the QEMU hypervisor starts

a virtual machine image, the F-SEFI broker is loaded dynamically. As instructions are issued by

applications running within a guest OS, F-SEFI intercepts these and potentially corrupts them be-

87

Overview of SEFIOverview of SEFI

Guest OS……Guest OSGuest OS

VM Hypervisor (QEMU) F-SEFI BrokerLog

Host Kernel

Hardware

FIGURE 7.8. The overall system infrastructure of F-SEFI

Overview of SEFIOverview of SEFI

Collect the information of target

instructions

ProfilerConfigure Probe and

Injector with multiple features

ConfiguratorSnoop the EIP before

the execution of each guest code

block

ProbeUse a Bit-flipper to

contaminate the target

application/function

Injector

Log all the injection events from SEFI

Tracker

pp /

FIGURE 7.9. The components of F-SEFI Broker

fore sending them on to the host kernel. This interaction is depicted in Figure 7.8. F-SEFI runs

entirely in user space and can be run as a ”black box” on the command-line. This launches the tool,

performs the fault injections, tracks how the application responds, and logs all the results back in

the host file system. This is particularly useful for batch mode analysis to do campaign studies of

the vulnerability of applications. F-SEFI consists of five major components: profiler, configurator,

probe, injector, and tracker, which are shown in Figure 7.9. These are explained in more detail in

the next few sections.

Profiler: as with most dynamic fault injectors, F-SEFI profiles the application to gather in-

formation about it before injecting faults. As described in Section 7.3.2, F-SEFI can target specific

instructions for corruption. It is in this profiling stage that F-SEFI gathers information about how

many occurrences of each instruction there are as well as their relative location within the binary. It

is also in this profiling stage that the function symbol table (FST) is extracted from the unstripped

binary. This allows F-SEFI to understand where the applications functions start/end. Then the

88

Num: Value Size Type Bind vis Ndx Name65: 08048130 136 FUNC GLOBAL DEFAULT 13 find_nearest_point86: 080489a0 143 FUNC GLOBAL DEFAULT 13 clusters101: 080491f0 661 FUNC GLOBAL DEFAULT 13 kmeans clustering101: 080491f0 661 FUNC GLOBAL DEFAULT 13 kmeans_clustering105 08048a70 1713 FUNC GLOBAL DEFAULT 13 main

FIGURE 7.10. A subset of the function symbol table (FST) for the K-means clus-

tering algorithm studied in section 7.3.4. This is extracted during the profiling stage

and used to trace where the application is at runtime for targeted fault injections.

Execution Instruction Pointer (EIP) is observed through QEMU to trace where the application is at

runtime. Figure 7.10 shows the relevant information for a sample symbol table used in a later case

study.

Configurator: the configuration contains all the specifics related to the faults that will be

injected. This includes what application is to be studied, functions to target, the injection granular-

ity, and the instructions to alter. Additionally, probabilities of alteration can be assigned to specific

bit regions where injections are desired.

As an example, in [101], the authors present an application that is highly resilient except to

data corruption in high order bits. This configuration stage makes it possible to target injections,

for instance, at only the most significant bits in a 64-bit register from the 52nd bit to the 63rd

bit. This is precisely the kind of study that is enabled by F-SEFI. Another example use would

be choosing a probability of corruption related to the neutron flux where the application will run

(sea level, high altitude terrestrial, aerial, satellite, etc.). The configurator allows a great deal of

flexibility in the way instructions can be targeted. For example, one can skip over N instances of a

target instruction and only then begin injecting faults.

Probe: once profiled and configured, the application under analysis is run within the guest

OS. The F-SEFI probe component then dynamically observes the guest instruction stream before

it is sent to the host hardware for execution. This instruction stream is snooped at a block-level

where blocks of instructions in QEMU are organized to reduce overhead. The probe monitors

the Execution Instruction Pointer (EIP) and if it enters the memory region belonging to the target

application the probe then switches to instruction-level monitoring. At this more fine-grained level

89

the probe begins checking micro-operations of each instruction that passes to the host. If the

underlying instruction satisfies the conditions defined in the configuration phase, then the probe

phase activates the injector. The algorithm for the probing process is shown as follows.

ALGORITHM 6. Probing Algorithm

PROBE() {

1: Load Probe configuration

2: Load Injector configuration

3: FOR each TB intercepted by FSEFI DO

4: Extract PS name of current TB

5: IF PS name == target application THEN

6: Start FSEFI Tracker

7: Extract critical memory region

8: FOR each instruction to execute DO

9: IF instruction address resides within cirtical memory region THEN

10: Start Injector

11: END IF

12: END FOR

13: END IF

14: END FOR

15: }

Injector: QEMU has translation functions for how to translate each instruction on a guest

architecture into a instruction (or series of instructions) on a host architecture. The injector phase

of F-SEFI substitutes the original helper function with a modified one. The new corrupted ver-

sion is controlled by the configuration to conditionally flip bits in the registers used during the

calculation. This translation is entirely transparent to the QEMU hypervisor and allows F-SEFI to

closely emulate faulty hardware without the associated overheads and limitations of hardware fault

injection.

Tracker: F-SEFI maintains very detailed logs of what happens during the monitoring of an

90

TABLE 7.2. Benchmarks and target functions for fine-grained fault injection

Benchmarks Target Functions for Injection

FFT [95]: Fast Fourier Transform using Radix-2, 3, 4, 5 and 8 FFT routineFFT4b : Redix 4 routine

FFT8b : Redix 8 routine

BMM [78]: Bit Matrix Multiply Algorithm from CPU Suite Benchmarkmaketable : construct the lookup tables

bmm update : apply 64bits matrix multiply based on lookup tables

Kmeans [16] : Kmeans Clustering Algorithm from Rodinia Suite Benchmarkkmeans clustering : update the cluster center coordinates

find nearest point : update membership

application as well as carefully tracking fault injections. When it decides to inject a fault it reports

information about what instruction was being executed and the state of the registers before and

after injection. In this way it is possible to analyze post-mortem the way in which the application

behaved when faults occurred.

7.3.4. Case Studies

In this section I demonstrate F-SEFI injecting faults into three benchmark applications: fast

fourier transform (FFT), bit matrix multiplication, and K-means clustering. These experiments

were conducted using the QEMU virtual machine hypervisor. The guest kernel used was Linux

version 2.6.0 running on Ubuntu 9.04. The host specifications are unimportant as all that is required

is that QEMU can run on it in user space.

Table 7.2 gives specifics about the benchmarks I studied including the functions targeted for

fault injection. Each benchmark was profiled to determine the number of floating point addition

(FADD), floating point multiplication (FMUL), and exclusive-or (XOR) operations. These results

are shown in Figure 7.11 and are the basis for the instructions that are targeted in the following

experiments.

1-D Fast Fourier Transform (FFT). After profiling the benchmark I chose to target the

fft4b function. This function comprises a large percentage of the FADD and FMUL instructions.

F-SEFI was configured to inject one and two single-bit errors into the benchmark into randomly

selected FADD and FMUL instructions in the fft4b routine. The injection procedure is shown

in Figure 7.12. Selected results from these four experiments are shown in Figure 7.13 and are

91

FADD FMUL XOR100

101

102

103

104

105

106

107

108

Num

ber o

f Ope

rand

s (lo

g)

Coarse-grained 1-D FFTFine-grained 1-D FFT (fft4b)Coarse-grained BMMFine-grained BMM (bmm_update)Fine-grained BMM (maketable)Coarse-grained K-MeansFine-grained K-Means (find_nearest_point)Fine-grained K-Means (kmeans_clustering)

FIGURE 7.11. Instruction profiles for the benchmarks studied. Each benchmark

is reported as a whole application (coarse-grained) and one or two functions that

were targeted (fine-grained). While both FFT and K-Means have a large number of

FADD and FMUL instructions, the BMM benchmark is almost entirely XOR.

presented in magnitude and phase. In each of the figures the thick blue line represents the correct

output, without faults injected. The thinner red line shows what happens when F-SEFI injects

faults into the specified regions.

In Figure 7.13 I can see the differences in the high frequency area. This can be explained

by signal processing theory because a transient and sharp peak or trough occurring at a certain

time will strengthen the power of the high frequency components. Therefore, the variation in FFT

outputs implies a significant spike/trough in the input time series due to data corruption by F-

SEFI. It is interesting to see that in double faults injected in both FADD (Figure 13(b)) and FMUL

(Figure 13(d)) experiments I can see two overlapping waves.

For the FMUL faults shown in Figure 13(c) and 13(d) and the double FADD faults, shown

in Figure13(b), the difference of the magnitude output is less significant compared with that of the

single FADD faults. However, for the phase output, faults are distinguishable. In order to quantify

the output difference introduced by F-SEFI, I use the forward error to compare the faulty 1-D FFT

outputs with the original 1-D FFT outputs. The forward error is calculated as

92

1-D FFT (Fast Fourier Transform) h kBenchmark

• Problem size : 1024

FFTTime Serials Frequency

Problem size : 1024

FFTTime Serials

Soft Error

Frequency

Soft Error

FIGURE 7.12. 1-D FFT algorithm with soft errors injected by F-SEFI

(29) forward error =‖F −O‖n‖O‖n

,

where F and O are the faulty and original output vectors, respectively, and ‖F − O‖n is the Ln

norm of the difference between the two output vectors and ‖O‖n is the Ln norm of the output

vector of the original FFT. They are further calculated as:

(30)

‖F −O‖n = (∑n

i=1 |Fi −Oi|n)1/n,

‖O‖n = (∑n

i=1 |Oi|n)1/n,

L2 error is also called the relative root mean square (RMS) and, therefore, I use the relative

RMS to quantify the impact of the injected soft errors on the application execution. The FFT

problem size was varied and the above four fault injection experiments were performed. The

results are shown in Figure 7.14. While it is visually obvious from Figure 7.13 that all of the faults

I injected cause output differences, from the RMS calculations I can see that the FADD faults caused

the most noticeable output variations. Furthermore, I can see that as the problem size increases,

the FADD faults caused more significant difference.

2-D FFT. I also tested a 2-D FFT for fault injection using F-SEFI shown in Figure 7.15. In

the 2-D FFT an image is transformed by FFT into the frequency domain. Then, an Inverse FFT

(IFFT) algorithm is performed which converts it back to the original image. For easy visualization

93

100 101 102 103100

105

1010

1015

FFT Output

Mag

itude

Bode Plot: Magnitude

Without FaultWith Fault

100 101 102 103-4

-2

0

2

4

FFT Output

Pha

se

Bode Plot: Phase


(a) Comparison between 1-D FFT output without fault

and output with Single ”FADD” fault.

100 101 102 103102

104

106

FFT Output

Mag

itude



100 101 102 103-4

-2

0

2

4

FFT Output

Pha

se

Bode Plot: Phase


(b) Comparison between 1-D FFT output without fault

and output with Single ”FMUL” fault.

100 101 102 103100

102

104

106

FFT Output

Mag

itude



100 101 102 103-4

-2

0

2

4

FFT Output

Pha

se

Bode Plot: Phase


(c) Comparison between 1-D FFT output without fault

and output with Double ”FADD” fault.

100 101 102 103102

104

106

FFT Output

Mag

itude



100 101 102 103-4

-2

0

2

4

FFT Output

Pha

se

Bode Plot: Phase


(d) Comparison between 1-D FFT output without fault

and output with Double ”FMUL” fault.

FIGURE 7.13. Comparative outputs with four different types of fault injections into

the extended split radix 1-D FFT algorithm. The output is represented in magnitude

and phase. The single FADD fault shown causes significant SDC in both magnitude

and phase.

of the data corruption output I chose to use an 8x8 spiral gray image. This original image is shown

in the left-most picture in Figure 7.16.

I chose to inject into the fft4b function during the FFT portion of the algorithm. This

corrupts the original image in its conversion to the frequency domain. Then, the inverse FFT cor-

rectly converts the corrupted image back and I are able to visualize the output. For this experiment,

single FADD and FMUL faults were injected, separately, to see how the image would be affected.

94

1024 2048 4096 8192 16384 3276810-5

100

105

1010

1015

1020

1025

1030

1035

1040

FFT Problem Size

Rel

ativ

e R

MS

One FADD Fault

Two FADD Faults

One FMUL Fault

Two FMUL Faults

FIGURE 7.14. The relative mean square root (RMS) of 1-D FFT outputs with dif-

ferent problem sizes showing that for the faults I injected into FMUL instructions

the output varied only slightly.

2-D FFT (Fast Fourier Transform) h kBenchmark

FFT IFFTOriginal Image Faulty Image

Soft Error

FIGURE 7.15. 2-D FFT algorithm with soft errors injected by F-SEFI

The FADD fault appears in the center picture in Figure 7.16. I can see that the (x, y)

locations (6, 4) and (6, 8) were affected by a single fault injection and exhibit data corruption. On

the contrary, the FMUL fault shown to the right in Figure 7.16 affects the image to a lesser extent.

The difference of the image cannot be distinguished visually but careful analysis shows each pixel

is on average 3% different from the original image. This difference may not be visually apparent

but for an application using this result it could be catastrophic.

Bit-Matrix Multiply (BMM): Bit-matrix multiplication is similar to a numerical matrix mul-

tiplication, where, numerical multiplications are replaced by AND bit-wise operations and numer-

ical additions are replaced by XOR bit-wise operations. This algorithm is used in various fields

including symmetric-key cryptography [108] and bioinformatics [3] [52]. BMM can be generally

95

Original1 2 3 4 5 6 7 8

1

2

3

4

5

6

7

8

Single FADD1 2 3 4 5 6 7 8

1

2

3

4

5

6

7

8

Single FMUL1 2 3 4 5 6 7 8

1

2

3

4

5

6

7

8

Clear SDCs

FIGURE 7.16. 8x8 spiral images with FADD and FMUL fault injections.

defined as:

(31) Y (m) =N−1∑n=0

B(m,n)A(n)

where A(n)(n = 0, 1, ..., N − 1) is an N−dimensional input vector, Y (m)(m = 0, 1, ...,M − 1) is

the M−dimensional output vector, and B(m,n) is an N ×M matrix.

Here I present fault injection results from the BMM benchmark in the VersaBench bench-

mark suite [78]. This benchmark uses a randomly generated input matrix of 217 64-bit elements.

Each multiplication result is compressed to a 9-bit signature code as shown in Figure 7.17 and a

512-entry vector is used to statistically accumulate the frequency of code occurrences. This vector

is used as a checksum for validation.

The BMM loop is repeated for 8 iterations and performs a total of 217 × 8 = 1, 048, 575

64-bit “multiply” operations (implemented as XOR as explained above). As shown in the profiling

results in Figure 7.11, there are only XOR instructions in the BMM algorithm. Two core functions,

bmm update and maketable contain 93% and 0.24% of the XOR operations, respectively. I

inject one and two single-bit faults into the XOR instruction used in the functions bmm update

and maketable and compare the output checksum vector with that from a fault free BMM run.

For our fault injections the total number of corrupted outputs was very small (between 200

and 300 corrupted outputs out of approximately one million). I found that for the bmm update

96

3263 0

ANDHigh 32Bits Low 32Bits

031 16

OR

High 16Bits Low 16Bits

15 0

OR

8

ADD

High 8Bits Low 8Bits

ADD08

9 Bits

FIGURE 7.17. The Bit Matrix Multiply algorithm compresses the 64-bits of output

to a 9-bit signature code used to checksum the result.

function, when I injected two single-bit errors the number of corrupted output increased on average

by 26%. For the maketable function I saw that two single-bit errors increased the number of

corrupted outputs by an average of 47%. Whether this is significant enough to matter for this

algorithm is difficult to tell without examining the results in the context of the parent application

that uses this algorithm.

Kmeans Clustering: the K-Means Clustering algorithm (and variations of it) is used in a

wide range of scientific and engineering fields including computer vision, astronomy, and biology

when dealing with large data sets. K-Means inputs include N k−dimensional particles to be

clustered into n clusters. The clusters’ centers and cluster membership (which cluster each particle

belongs to) vector is produced as output.

There are two key functions in the K-Means algorithm: kmeans clustering and find nearest point.

At the beginning of each iteration, the kmeans clustering function updates the center of

each cluster by calculating the centroid based on the distances between particles. For the newly

97

generated cluster centers, the find nearest point function updates the membership of each

particle by searching for the nearest clustering center.

As shown in Figure 7.11, the find nearest point function contains 82.13% of the

FADD micro-operations and 100% of the FMUL micro-operations. The kmeans clustering

function has no FMUL micro-operations and has the remaining 17.87% FADD ones. I chose to

inject faults into the FADD instruction in these two functions.

Results from these experiments appear in Table 7.3 and are shown visually in Figure 7.18.

This experiment uses a small data set so that the corruptions are easy visible. The dataset was

randomly generated uniformly distributed 2-D data points for clustering and I chose to find five

clusters.

TABLE 7.3. K-Means clustering centroids with and without fault injection showing

the impact of corrupted data in the centroid calculations and clustering calculations

for individual particles.

# of Particles2-D Coordinates of Cluster Centers

Cluster Index Center w/o Faults Center w/ Faults (kmeans clustering) Center w/ Faults (find nearest point)

300

1 (194.92, 66.32) (194.04, 67.07) (195.88, 66.29)

2 (126.39, 183.90) (68.26, 202.14) (126.39, 183.90)

3 (44.53, 200.69) (52.38, 4.2E+19) (44.53, 200.69)

4 (66.45, 68.25) (66.35, 76.43) (67.34, 68.25)

5 (210.85, 181.85) (195.63, 185.41) (210.85, 181.85)

By injecting a fault into the kmeans clustering function the centroid for cluster 3 was

sent far from the other data points. This causes the data points to effectively cluster into only four

groups as shown in Figure 18(a) (cluster 3 centroid is not shown for clarity).

When a fault was injected into the find nearest point function a single particle be-

comes mislabeled. This is shown in Figure 18(b) and this causes the particle to be grouped with

cluster 4 instead of cluster 1. This also augments the centroids for those clusters as can be seen

in Table 7.3. This makes sense as the find nearest point function compares distances be-

tween a particle and the cluster centroids. The particle is assigned to the cluster that is closest

and corrupting this calculation makes it incorrectly assign a single particle to the wrong cluster.

98

0 50 100 150 200 2500

50

100

150

200

250

X Coordinate

Y C

oord

inat

e

K-Means Clustering Algorithm Without Faults

0 50 100 150 200 2500

50

100

150

200

250

X Coordinate

Y C

oord

inat

e

K-Means Clustering Algorithm with FADD Fault

(a) K-Means Clustering without (left) and with (right) one single-bit FADD fault injected into the

kmeans clustering function. The uniformly distributed data clusters into only four clusters

when one centroid is relocated far away.

0 50 100 150 200 2500

50

100

150

200

250

X Coordinate

Y C

oord

inat

e

K-Means Clustering Algorithm without Faults

0 50 100 150 200 2500

50

100

150

200

250

X Coordinate

Y C

oord

inat

e

K-Means Clustering Algorithm with FADD Fault

Mislabeled Particle

(b) K-Means Clustering without (left) and with (right) one single-bit FADD fault injected into the

find nearest point function. A single particle is affected and gets mislabeled. This also

causes the centroids for these clusters to change slightly.

FIGURE 7.18. Faults injected into two different functions of the K-Means Cluster

algorithm cause different effects. Clusters are colored by cluster number and the

centroids are marked by triangles.

99

Therefore, faults contained to this function will only affect a single particle while faults that affect

the kmeans clustering function cause much larger differences.

To demonstrate this I performed FADD injections into the find nearest point func-

tion while scaling up the number of particles. These results are shown in Figure 7.19. It can be seen

that for our uniformly distributed dataset causing one of the cluster centroids to be removed has an

effect on approximately 28% of the data points being clustered incorrectly. Also the centroids for

each cluster then are wrong.

300 3000 300000

1000

2000

3000

4000

5000

6000

7000

8000

9000

Total Number of Particles

Num

ber o

f Mis

labe

led

Par

ticle

s

FADD fault in kmeans_clustering()

82

854

8376

FIGURE 7.19. The number of mislabeled particles in the K-Means Clustering Al-

gorithm under fault injection as a function of the total number of particles. An FADD

fault injected into kmeans clustering causes about 28% of the particles to be

mislabeled.

7.4. Discussions

Overhead: fault injection techniques are notoriously slow. Vendors use a combination of

Register-Transfer Level (RTL) simulations and hardware fault injection to study chip behavior.

Both of these approaches are obviously limited. RTL simulations require proprietary information

about hardware designs and hardware approaches require specialized hardware. By contrast, our

approach trades this for flexibility, working inside of a freely available virtual machine and running

in user space.

100

F-SEFI adds an overhead of about 30% on top of the QEMU virtualization overhead. In

contrast to other virtualization fault injection techniques that add about 200% overhead, our ap-

proach seems favorable. The QEMU overhead can be substantial due to the processor emulation

and in our experiences might add as much as a 200x slowdown over running an application na-

tively. This performance impact is still generally faster than RTL-based approaches.

However, it is important to realize that studying applications for their vulnerabilities is

an offline exercise. Furthermore, processor emulation provides us a lot of capabilities to study

interesting new hardware approaches and how they affect reliability. Finally, if applications have a

small enough memory footprint they can be run concurrently inside the same guest OS and F-SEFI

can perform fault injection experiments in parallel and without interaction between the processes.

Characterizing Fault Propagation: it is often difficult for a software designer to understand

how an error might propagate through their application and what it might affect. F-SEFI provides

a means for studying that propagation which I demonstrated that with the K-Means Clustering al-

gorithm. For that algorithm, I saw that the kmeans clustering function is particularly vital

to protect as it has an impact on all of the centroids and about 28% of the particles were misla-

beled when a fault occurred. In contrast, the find nearest point function only affects the

classification of a single particle and some of the related centroids.

Based on analysis like this, the functions of an application can be divided into several

vulnerability zones that characterize the impact of soft errors. These vulnerability zones can then

be translated into a vulnerability map that quantifies with ranked scores how data corruption affects

individual portions of an application. This is valuable information for a programmer to decide

where to focus their attention to provide resilience techniques.

Effective SDC Outcomes: faults can cause application crashes, data corruption, hangs, and

also have no impact (are benign). Many applications can tolerate crashes much more than data

corruption where getting the wrong answer can have drastic impacts on what actions are taken

within a code or the scientific integrity of a result. As such, in our work I have particularly focused

on the ”getting the right answer” portion of the problem. While it is possible to study crashes using

F-SEFI through corrupting moves, jumps, and conditional instructions I have not focused on that

101

at this time.

One of the benefits of our approach is that since F-SEFI can target faults with extreme pre-

cision, I can study just how data corruption at specific points of an application affects the results.

Also, because of our focus on causing data corruption rather than crashes it makes large scale

studies more practical. Many similar approaches require an extremely large number of fault injec-

tions to cause data corruption. While these approaches are valuable for studying data corruption

probabilities in a code, they make it difficult to study the questions I are focused on with F-SEFI –

how do corruptions at specific locations in an application cause it to behave? Our approach makes

studying this question more effective and practical.

7.5. Summary

For a number of reasons there is call to be concerned about rates of silent data corruption in

applications on next generation machines. Many of the applications that are run on leadership-class

supercomputers are intolerant of data corruption and these applications are often used for science

where knowing you have the right answer is extremely important. As the HPC community goes

through this transition period towards exascale there is an opportunity to redesign applications so

that they are made more resilient to these types of errors. However, knowing where to focus the

attention has historically been difficult. Studies on the presence and impact of silent data corruption

has been lacking due to the silent nature of the errors. They are difficult to reproduce and identify

and are largely environmentally dependent. I have presented F-SEFI which makes that effort easier

by allowing an application programmer the ability to survey where an application is vulnerable and

where it is more resilient.

F-SEFI leverages a robust and actively developed open source processor emulator, QEMU,

to emulate faults as close to hardware as possible to do in software. By intercepting instructions

in the translation of architectural instruction sets, F-SEFI is capable of injecting soft errors that

cause incorrect execution results. F-SEFI can control when and how to inject the errors into which

function in what application with different granularities. I demonstrated the use of F-SEFI on a

variety of benchmark applications and have shown how data corruption can alter results. The tool

is capable of doing large, campaign studies of fault injections into applications and provides access

102

to a rich set of fault models. The tool can be useful in studying new approaches to resilience both

at the hardware and software level and actually quantifying the benefits of them.

103

CHAPTER 8

CONCLUSION AND FUTURE WORK

8.1. Conclusion

This dissertation research aims to characterize and enhance cloud dependability and system

resilience. I summarize my work as follows.

8.1.1. Characterizing Cloud Dependability

As virtualization is an enabling technology for cloud computing, its impact on dependabil-

ity must be well understood. The goal of this work is to assess cloud dependability in virtualized

environments and compare it with that of traditional, non-virtualized systems. I propose a cloud

dependability analysis (CDA) framework with mechanisms to characterize failure dynamics in

cloud computing infrastructures. I have proposed failure-metric DAGs (Directed Acyclic Graph)

to model and quantify the correlation of various performance metrics with failure events in virtu-

alized and non-virtualized systems. I have investigated multiple types of failures, including CPU-,

memory-, disk-and network-related failures. By comparing the generated DAGs in the two envi-

ronments, I gain insight into the effects of virtualization on the cloud dependability by comparing

different members in the failure-related system performance metric sets selected for virtualized

and non-virtualized environments.

8.1.2. Detecting and Diagnozing Cloud Anomalies

Given the ever-increasing cloud sizes coupled with the complexity of system components,

continuous monitoring leads to the overwhelming volume of data collected by health monitoring

tools. I address the metric selection problem for efficient and accurate anomaly detection in the

cloud. I present a metric selection framework with metric selection and extraction mechanisms.

The mutual information based approach selects metrics that maximize the mutual relevance and

minimize their redundancy. Then the essential metrics are further extracted by means of combining

or separating the selected metric space. The reduced dimensionality of metric space significantly

improves the computational efficiency of anomaly detection.

104

To detect performance anomalies, I propose a wavelet-based multi-scale cloud anomaly

identification mechanism with learning-aid mother wavelet selection and sliding detection win-

dows. Different from other anomaly identification approaches, it does not require a prior knowl-

edge of failure distributions, it can self-adapt by learning from observed failures at runtime, and it

analyzes both the time and frequency domains to identify anomalous cloud behaviors.

To diagnose the identified anomalies, I start by analyzing the correlation between principal

components with failure occurrences, where I find the PCs retaining the highest variance cannot

effectively characterize the failure events, while lower order PCs displaying high correlation with

occurrences of failures. I propose to exploit the most relevant principal components (MRPCs)

to describe failure events and devise a learning based approach to diagnose cloud anomalies by

leveraging MRPCs.

8.1.3. Soft Error Fault Injection

To assess application vulnerability, I develop a fine-grained soft error fault injector, F-

SEFI. F-SEFI leverages a robust and actively developed open source processor emulator, QEMU,

to emulate faults as close to hardware as possible to do in software. By intercepting instructions

in the translation of architectural instruction sets, F-SEFI is capable of injecting soft errors that

cause incorrect execution results. F-SEFI can control when and how to inject the errors into which

function in what application with different granularities. I demonstrated the use of F-SEFI on a

variety of benchmark applications and have shown how data corruption can alter results. The tool

is capable of doing large, campaign studies of fault injections into applications and provides access

to a rich set of fault models. The tool can be useful in studying new approaches to resilience both

at the hardware and software level and actually quantifying the benefits of them.

8.1.4. List of Publications in My PhD Study

1. Qiang Guan and Song Fu, Exploring Time and Frequency Domains for Accurate

and Automated Anomaly Detection in Cloud Computing Systems”, in proceeding of the 19th

IEEE/IFIP International Symposium on Dependable Computing (PRDC), 10 pages, 2013.

2. Qiang Guan and Song Fu, ”Wavelet-Based Multi-scale Anomaly Identification in Cloud

105

Computing Systems”, accepted by IEEE Global Communications Conference (GlobeCom), 6

pages, 2013.

3. Qiang Guan and Song Fu, ”Adaptive Anomaly Identification by Exploring Metric

Subspace in Cloud Computing Infrastructures”, in proceeding of the 32nd IEEE International

Symposium on Reliable Distributed Systems (SRDS), 10 pages, 2013.

4. Husanbir S Pannu, Jianguo Liu, Qiang Guan and Song Fu, ”AFD: Adaptive Failure

Detection System for Cloud Computing Infrastructures,” in proceeding of the 31st IEEE Interna-

tional Performance Computing and Communications Conference (IPCCC), 10 pages, 2012.

5. Ziming Zhang, Qiang Guan and Song Fu, ”An Adaptive Power Management Frame-

work for Autonomic Resource Configuration in Cloud Computing Infrastructures,” in proceed-

ing of the 31st IEEE International Performance Computing and Communications Conference

(IPCCC), 10 pages, 2012.

6. Qiang Guan, Chi-Chen Chiu and Song Fu, ”A Cloud Dependability Analysis Frame-

work for Characterizing System Dependability in Cloud Computing Infrastructures,” in proceed-

ing of the 18th IEEE/IFIP International Symposium on Dependable Computing (PRDC), 10 pages,

2012.

7. Qiang Guan, Ziming Zhang and Song Fu, ”A Failure Detection and Prediction Mecha-

nism for Enhancing Dependability of Data Centers”, in International Journal of Computer Theory

and Engineering, 4 (5), 726-730 , 2012.

8. Qiang Guan, Ziming Zhang and Song Fu, ”Ensemble of Bayesian Predictors and Deci-

sion Trees for Proactive Failure Management in Cloud Computing Systems”, in Journal of Com-

munications, 7 (1), 52-61, 2012.

9. Qiang Guan, Ziming Zhang and Song Fu, ”Efficient and Accurate Anomaly Iden-

tification Using Reduced Metric Space in Utility Clouds,” in proceeding of IEEE International

Conference on Networking, Architecture, and Storage (NAS), 10 pages, 2012.

10. Qiang Guan, Ziming Zhang and Song Fu, ”Proactive Failure Management by Inte-

grated Unsupervised and Semi-Supervised Learning for Dependable Cloud Systems”, in Proceed-

ing of IEEE International Conference on Availability, Reliability and Security Conference (ARES),

106

2011.

11. Nathan DeBardeleben, Sean Blanchard, Qiang Guan, Ziming. Zhang, and Song. Fu,

”Experimental Framework for Injecting Logic Errors in a Virtual Machine to Profile Applications

for Soft Error Resilience”, in Proceeding of Resilience, International European Conference on

Parallel and Distributed Computing (Euro-Par), 2011.

12. Qiang Guan, Ziming Zhang and Song Fu, ”Ensemble of Bayesian Predictors for Au-

tonomic Failure Management in Cloud Computing”, in Proceeding of the 20th IEEE International

Conference on Computer Communications and Networks (ICCCN), 2011.

13. Qiang Guan and Song Fu, ”auto-AID: A Data Mining Framework for Autonomic

Anomaly Identification in Networked Computer Systems”, in Proceeding of the 29th IEEE Inter-

national Performance Computing and Communications Conference (IPCCC), 2010.

14. Qiang Guan, Derek Smith and Song. Fu, ”Anomaly Detection in Large-Scale Coali-

tion Clusters for Dependability Assurance”, in Proceeding of the 17th IEEE International Confer-

ence on High Performance Computing (HiPC), 2010.

15. Derek Smith, Qiang Guan and Song Fu, ”An Anomaly Detection Framework for

Autonomic Management of Compute Cloud Systems”, in Proceedings of CloudApp, the 34th IEEE

International Conference on Computer Software and Applications (COMPSAC), 2010.

8.2. Future Work

As the rapid development and ever-increasing demands of cloud computing, it becomes

more and more important to build dependable cloud computing infrastructures to guarantee the

trust between users and cloud service providers. Dependability-as-a-Service (DaaS) would be a

solution for the users that require high dependability cloud services. To achieve this goal, two

areas should be considered. I need to develop failure-aware resource management mechanisms to

proactively avoid assigning applications to or migrating VMs from the cloud servers that are to

fail and meanwhile satisfy the service level agreements. Moreover, efforts should be made towards

designing resilient programming models and libraries to ensure the correctness of cloud services

and applications.

107

8.2.1. Self-Adaptive Failure-Aware Resource Management in the Cloud

Based on the understanding of cloud dependability, resilience against service overload,

hardware failures, software bugs as well as operator errors should be studied and addressed by an-

swering the following questions: How much resources should be allocated adaptively to an appli-

cation that will run on multiple VMs across cloud servers and how should applications be migrated

and resources be reconfigured when failures occur in order to satisfy the requirements of cloud

dependability and performance with low overheads? In order to address the above questions, a

self-adaptive failure-aware resource management System needs to possess a number of properties:

1) adaptation capability for cloud dynamics in virtualized environments; 2) self-learning ability

over various applications and cloud infrastructures; 3) low overhead assuming system performance

in failure management and resource configuration operations.

Current resource management techniques manage cloud resources based on the perfor-

mance of systems under different intensity of workloads and applications. I can optimize the cloud

system by virtue of allocating the cloud applications to VMs or cloud nodes with the guarantee

of dependability to proactively avoid Service Level Agreement (SLA) violation or inefficient re-

sources utilization due to varying hardware/software health conditions and workload fluctuations.

Furthermore, these new resource management strategies could also benefit energy efficiency of

large scale cloud systems.

8.2.2. Tolerating Silent Data Corruptions in Large Scale Computing Systems

As programming languages and models have been used to reduce the complexity of par-

allel programming, I plan to leverage a new programming model that guarantees the correctness

of applications running on unreliable hardware to enhance applicationlevel resilience. This will

facilitate applications to get rid of the threats from the silent data corruptions (SDCs). Based on

our developed F-SEFI soft error fault injector, I plan to achieve the application-level SDC fault

tolerance by preventing and correcting SDCs.

Soft Error SDC Prevention: I propose to design a programming model for developing re-

silient applications. For example, a standard library can be extended to a ”resilient” library by

defining resilient patterns and abstracts that are robust to SDCs. This can be achieved by com-

108

prehensively studying the implementation of these functions and determining the resilient features

from those functions. These features will then be quantified and categorized as guidelines for de-

signing a resilient programming model. This idea can be further explored at the algorithm level.

F-SEFI will help to profile the different implementations of an algorithm and find the most resilient

implementation.

Soft Error SDC Correction: I also plan to develop an ”on-site” correction mechanism that

uses redundant resources to execute error-prone instructions. These vulnerable portions (either

subroutines or floating point operands) can be identified by vulnerability profiling. The proposed

SDC fault tolerance mechanisms will facilitate the design of the next generation exascale comput-

ing systems by enhancing the resilience capability especially when the service/system providers

cannot keep a low SDC rate with an acceptable performance.

109

BIBLIOGRAPHY

[1] S. Agarwala, F. Alegre, K. Schwan, and J. Mehalingham, E2EProf: Automated end-to-

end performance management for enterprise systems, Proc. of IEEE/IFIP Intl. Conf. on

Dependable Systems and Networks (DSN), 2007.

[2] Sandip Agarwala and Karsten Schwan, Sysprof: Online distributed behavior diagnosis

through fine-grain system monitoring, Proc. of IEEE Intl. Conf. on Distributed Computing

Systems (ICDCS), 2006.

[3] Tatsuya Akutsu, Satoru Miyano, and Satoru Kuhara, Algorithms for identifying boolean net-

works and related biological networks based on matrix multiplication and fingerprint func-

tion, Proceedings of the fourth annual international conference on Computational molecular

biology (New York, NY, USA), RECOMB ’00, ACM, 2000, pp. 8–14.

[4] Samer Al-Kiswany, Dinesh Subhraveti, Prasenjit Sarkar, and Matei Ripeanu, Vmflock: vir-

tual machine co-migration for the cloud, Proc. of ACM Intl. Symp. on High Performance

Distributed Computing (HPDC), 2011.

[5] Edoardo Amaldi and Viggo Kann, On the approximability of minimizing nonzero variables

or unsatisfied relations in linear systems, Theoretical Computer Science 209 (1998), no. 1-2,

237–260.

[6] Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy Katz, Andy

Konwinski, Gunho Lee, David Patterson, Ariel Rabkin, Ion Stoica, and Matei Zaharia, A

view of cloud computing, Communications of the ACM 53 (2010), no. 4, 50–58.

[7] Mona Attariyan, Michael Chow, and Jason Flinn, X-ray: automating root-cause diagnosis

of performance anomalies in production software, Proceedings of the 10th USENIX confer-

ence on Operating Systems Design and Implementation, OSDI’12, 2012.

[8] Paul Barford, Jeffery Kline, David Plonka, and Amos Ron, A signal analysis of network

traffic anomalies, Proc. of ACM SIGCOMM Workshop on Internet measurment, 2002.

[9] David Bernick, Bill Bruckert, Paul Del Vigna, David Garcia, Robert Jardine, Jim Klecka,

110

and Jim Smullen, Nonstop advanced architecture, Proc. of IEEE Conf. on Dependable Sys-

tems and Networks (DSN), 2005.

[10] Sapan Bhatia, Abhishek Kumar, Marc E. Fiuczynski, and Larry Peterson, Lightweight,

high-resolution monitoring for troubleshooting production systems, Proceedings of the 8th

USENIX conference on Operating systems design and implementation, OSDI’08, 2008.

[11] Boualem Boashash, Time frequency signal analysis and processing : a comprehensive ref-

erence, Elsevier, 2003.

[12] D. Brauckhoff, K. Salamatian, and M. May, Applying pca for traffic anomaly detection:

Problems and solutions, INFOCOM 2009, IEEE, 2009, pp. 2866–2870.

[13] Daniela Brauckhoff, Kave Salamatian, and Martin May, A signal processing view on packet

sampling and anomaly detection, Proc. of IEEE Intl. Conf. on Information Communications

(INFOCOM), 2010.

[14] Emmanuel Cecchet, Julie Marguerite, and Willy Zwaenepoel, Performance and scalabil-

ity of EJB applications, Proc. of ACM Conf. on Object-Oriented Programming, Systems,

Languages, and Applications (OOPSLA), 2002.

[15] Varun Chandola, Arindam Banerjee, and Vipin Kumar, Anomaly detection: A survey, ACM

Computing Surveys 41 (2009), no. 3, 15:1–15:58.

[16] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee,

and Kevin Skadron, Rodinia: A benchmark suite for heterogeneous computing, Proceed-

ings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)

(Washington, DC, USA), IISWC ’09, IEEE Computer Society, 2009, pp. 44–54.

[17] Mike Y. Chen, Emre Kiciman, Eugene Fratkin, Armando Fox, and Eric Brewer, Pinpoint:

Problem determination in large, dynamic internet services, Proc. of IEEE/IFIP Intl. Conf.

on Dependable Systems and Networks (DSN), 2002.

[18] Jim Chow, Dominic Lucchetti, Tal Garfinkel, Geoffrey Lefebvre, Ryan Gardner, Joshua

Mason, Sam Small, and Peter M. Chen, Multi-stage replay with crosscut, Proceedings of the

6th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments,

VEE ’10, 2010.

111

[19] Ira Cohen, Moises Goldszmidt, Terence Kelly, Julie Symons, and Jeffrey S. Chase, Cor-

relating instrumentation data to system states: a building block for automated diagnosis

and control, Proc. of USENIX Symp. on Opearting Systems Design and Implementation

(OSDI), 2004.

[20] J.M. Combes, A. Grossmann, and P. Tchamitchian, Wavelets: time-frequency methods and

phase space, Springer-Verlag, 1990.

[21] Thomas M. Cover and Joy A. Thomas, Elements of information theory, Wiley, New York,

1991.

[22] Brendan Cully, Geoffrey Lefebvre, Dutch Meyer, Mike Feeley, Norm Hutchinson, and An-

drew Warfield, Remus: high availability via asynchronous virtual machine replication, Proc.

of USENIX Symp. on Networked Systems Design and Implementation (NSDI), 2008.

[23] Yuan-Shun Dai, Bo Yang, Jack Dongarra, and Gewei Zhang, Cloud service reliability:

Modeling and analysis, Proc. of IEEE Pacific Rim Intl. Symp. on Dependable Computing

(PRDC), 2009.

[24] Jeffrey Dean and Sanjay Ghemawat, Mapreduce: simplified data processing on large clus-

ters, Communications of the ACM 51 (2008), no. 1, 107–113.

[25] Nathan DeBardeleben, James Laros, John T. Daly, Stephen L. Scott, Christian Engelmann,

and Bill Harrod, High-end computing resilience: Analysis of issues facing the HEC commu-

nity and path-forward for research and development, Whitepaper, December 2009.

[26] Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern classification, Wiley-

Interscience, 2000.

[27] Daniel Ford, Francois Labelle, Florentina I. Popovici, Murray Stokely, Van-Anh Truong,

Luiz Barroso, Carrie Grimes, and Sean Quinlan, Availability in globally distributed stor-

age systems, Proceedings of the 9th USENIX conference on Operating systems design and

implementation, OSDI’10, 2010, pp. 1–7.

[28] S. Fu, Dependability enhancement for coalition clusters with autonomic failure manage-

ment, Proc. of IEEE Intl. Symp. on Computers and Communications (ISCC), 2010.

[29] Song Fu, Failure-aware construction and reconfiguration of distributed virtual machines

112

for high availability computing, Proc. of IEEE/ACM Intl. Symp. on Cluster Computing and

the Grid (CCGrid), 2009.

[30] Song Fu and Chengzhong Xu, Exploring event correlation for failure prediction in coali-

tions of clusters, Proc. of ACM/IEEE Supercomputing Conf. (SC), 2007.

[31] Sachin Garg, Antonio Puliafito, and Kishor S. Trivedi, Analysis of software rejuvenation

using markov regenerative stochastic petri net, Proc. of IEEE Intl. Symp. on Software Reli-

ability Engineering (ISSRE), 1995.

[32] Q. Guan, Z. Zhang, and S. Fu, Proactive failure management by integrated unsupervised

and semi-supervised learning for dependable cloud systems, Proc. of IEEE Intl. Conf. on

Availability, Reliability and Security (ARES), 2011.

[33] Qiang Guan, Chi-Chen Chiu, and Song Fu, A cloud dependability analysis framework for

assessing system dependability in cloud computing infrastructures, Proc. of IEEE/IFIP Pa-

cific Rim Intl. Symp. on Dependable Computing (PRDC), 2012.

[34] Qiang Guan, Chi-Chen Chiu, Ziming Zhang, and Song Fu, Efficient and accurate anomaly

identification using reduced metric space in utility clouds, Proc. of IEEE Intl. Conf. on

Networking, Architecture and Storage (NAS), 2012.

[35] Qiang Guan, Ziming Zhang, and Song Fu, Ensemble of bayesian predictors and decision

trees for proactive failure management in cloud computing systems, Journal of Communi-

cations 7 (2012), no. 1, 52–61.

[36] G Hamerly and C Elkan, Bayesian approaches to failure prediction for disk drives, Proceed-

ings of the Eighteenth International Conference on Machine Learning, ICML ’01, 2001.

[37] Greg Hamerly and Charles Elkan, Bayesian approaches to failure prediction for disk drives,

Proc. of Conf. on Machine Learning (ICML), 2001.

[38] J. Han and M. Kamber, Data mining: Concepts and techniques, Morgan Kaufmann Pub-

lishers Inc., 2005.

[39] Jacob Gorm Hansen and Eric Jul, Lithium: virtual machine storage for the cloud, Proc. of

ACM Symp. on Cloud Computing (SOCC), 2010.

[40] Taliver Heath, Richard P. Martin, and Thu D. Nguyen, Improving cluster availability using

113

workstation validation, Proc. of ACM Intl. Conf. on Measurement and modeling of com-

puter systems (SIGMETRICS), 2002.

[41] Joseph L. Hellerstein, Fan Zhang, and Perwez Shahabuddin, A statistical approach to pre-

dictive detection, Computer Networks: The Intl. Journal of Computer and Telecommunica-

tions Networking 35 (2001), no. 1, 77–95.

[42] Victoria Hodge and Jim Austin, A survey of outlier detection methodologies, Artificial In-

telligence Review 22 (2004), 85–126.

[43] M. Hsueh, T.K. Tsai, and R.K. Iyer, Fault injection techniques and tools, IEEE Computer

30 (1997), 75–82.

[44] Alan Jeffrey, Advanced engineering mathematics, Academic Press, 2001.

[45] Kaustabh Joshi, Guy Bunker, Farnham Jahanian, Aard van Moorsel, and Joe Weinman,

Dependability in the cloud: Challenges and opportunities, Proc. of IEEE/IFIP Intl. Conf.

on Dependable Systems and Networks (DSN), 2009.

[46] Pallavi Joshi, Haryadi S. Gunawi, and Koushik Sen, PREFAIL: a programmable tool for

multiple-failure injection, Proc. of ACM Intl. Conf. on Object Oriented Programming Sys-

tems Languages and Applications (OOPSLA), 2011.

[47] R. E. Kalman, Transactions of the ASME Journal of Basic Engineering, no. 82 (Series D),

35–45.

[48] Himanshu Kaul, Mark Anders, Steven Hsu, Amit Agarwal, Ram Krishnamurthy, and

Shekhar Borkar, Near-threshold voltage (ntv) design: opportunities and challenges, Pro-

ceedings of the 49th Annual Design Automation Conference (New York, NY, USA), DAC

’12, ACM, 2012, pp. 1153–1158.

[49] S.P. Kavulya, S. Daniels, K. Joshi, M. Hiltunen, R. Gandhi, and P. Narasimhan, Draco:

Statistical diagnosis of chronic problems in large distributed systems, 2012 42nd Annual

IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2012.

[50] Kamal Kc and Xiaohui Gu, Elt: Efficient log-based troubleshooting system for cloud com-

puting infrastructures, Proceedings of the 2011 IEEE 30th International Symposium on

Reliable Distributed Systems, SRDS ’11, 2011, pp. 11–20.

114

[51] Jeffrey O. Kephart and David M. Chess, The vision of autonomic computing, IEEE Com-

puter 36 (2003), no. 1, 41–50.

[52] Mehmet Koyuturk, Wojciech Szpankowski, and Ananth Grama, Biclustering gene-feature

matrices for statistically significant dense patterns, Proceedings of the 2004 IEEE Com-

putational Systems Bioinformatics Conference (Washington, DC, USA), CSB ’04, IEEE

Computer Society, 2004, pp. 480–484.

[53] Anukool Lakhina, Mark Crovella, and Christophe Diot, Diagnosing network-wide traffic

anomalies, Proceedings of the 2004 conference on Applications, technologies, architectures,

and protocols for computer communications, SIGCOMM ’04, 2004, pp. 219–230.

[54] Zhiling Lan, Jiexing Gu, Ziming Zheng, Rajeev Thakur, and Susan Coghlan, A study of

dynamic meta-learning for failure prediction in large-scale systems, Journal of Parallel and

Distributed Computing 70 (2010), no. 6, 630–643.

[55] Zhiling Lan, Ziming Zheng, and Yawei Li, Toward automated anomaly identification in

large-scale systems, Parallel and Distributed Systems, IEEE Transactions on 21 (2010),

no. 2, 174–187.

[56] J. Lange, K. Pedretti, T. Hudson, P. Dinda, Zheng Cui, Lei Xia, P. Bridges, A. Gocke,

S. Jaconette, M. Levenhagen, and R. Brightwell, Palacios and kitten: New high performance

operating systems for scalable virtualized and native supercomputing, Parallel Distributed

Processing (IPDPS), 2010 IEEE International Symposium on, 2010, pp. 1–12.

[57] C. Lattner and V. Adve, Llvm: a compilation framework for lifelong program analysis trans-

formation, Code Generation and Optimization, 2004. CGO 2004. International Symposium

on, 2004, pp. 75–86.

[58] Michael Le and Yuval Tamir, Fault injection in virtualized systems - challenges and appli-

cations, IEEE Transactions on Dependable and Secure Computing 99 (2014), no. PrePrints,

1.

[59] Matthew Leeke, Saima Arif, Arshad Jhumka, and Sarabjot Singh Anand, A methodology for

the generation of efficient error detection mechanisms, Proc. of IEEE/IFIP SIntl. Conf. on

Dependable Systems and Networks(DSN), 2011.

115

[60] Scott Levy, Matthew G. F. Dosanjh, Patrick G. Bridges, and Kurt B. Ferreira, Using unreli-

able virtual hardware to inject errors in extreme-scale systems, FTXS’13, 2013, pp. 21–26.

[61] Dong Li, J.S. Vetter, and Weikuan Yu, Classifying soft error vulnerabilities in extreme-scale

scientific applications using a binary instrumentation tool, High Performance Computing,

Networking, Storage and Analysis (SC), 2012 International Conference for, 2012, pp. 1–11.

[62] Zhichun Li, Ming Zhang, Zhaosheng Zhu, Yan Chen, Albert Greenberg, and Yi-Min Wang,

Webprophet: automating performance prediction for web services, Proceedings of the 7th

USENIX conference on Networked systems design and implementation, NSDI’10, 2010.

[63] Y. Liang, Y. Zhang, A. Sivasubramaniam, R. Sahoo, J. Moreira, and M. Gupta., Filtering

failure logs for a BlueGene/L prototype, Proc. of IEEE Conf. on Dependable Systems and

Networks (DSN), 2005.

[64] F. Longo, R. Ghosh, V.K. Naik, and K.S. Trivedi, A scalable availability model for

infrastructure-as-a-service cloud, Proc. of IEEE Conf. on Dependable Systems and Net-

works (DSN), 2011.

[65] Liangzhen Lai Puneet Gupta Lucas Wanner, Salma ELmalaki and Mani Srivastava, Pro-

ceedings of the 11th international conference on hardware/software codesign and system

synthesis, codes+isss 2013, 2013, VarEMU: An Emulation Testbed for Variability-Aware

Software, 2013.

[66] Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney,

Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood, Pin: building customized pro-

gram analysis tools with dynamic instrumentation, Proceedings of the 2005 ACM SIG-

PLAN conference on Programming language design and implementation (New York, NY,

USA), PLDI ’05, ACM, 2005, pp. 190–200.

[67] Jordan McBain and Markus Timusk, Feature extraction for novelty detection as applied to

fault detection in machinery, Pattern Recogn. Lett. 32 (2011), no. 7, 1054–1061.

[68] James W. Mickens and Brian D. Noble, Exploiting availability prediction in distributed sys-

tems, Proc. of USENIX Symp. on Networked Systems Design and Implementation (NSDI),

2006.

116

[69] Joseph F. Murray, Gordon F. Hughes, and Dale Schuurmans, Machine learning methods

for predicting failures in hard drives: A multiple-instance application, Journal of Machine

Learning research 6 (2005), 816.

[70] Adam Oliner and Jon Stearley, What supercomputers say: A study of five system logs, Proc.

of IEEE Conf. on Dependable Systems and Networks (DSN), 2007.

[71] Husanbir Pannu, Jianguo Liu, and Song Fu, AAD: Adaptive anomaly detection system for

cloud computing infrastructures, Proc. of IEEE Symp. on Reliable Distributed Systems

(SRDS), 2012.

[72] Eunbyung Park, Bernhard Egger, and Jaejin Lee, Fast and space-efficient virtual machine

checkpointing, Proc. of ACM Intl. Conf. on Virtual Execution Environments (VEE), 2011.

[73] K. Patel, S. Parameswaran, and R.G. Ragel, Architectural frameworks for security and reli-

ability of mpsocs, Very Large Scale Integration (VLSI) Systems, IEEE Transactions on 19

(2011), no. 9, 1641–1654.

[74] Wei Peng and Tao Li, Mining logs files for computing system management, Proc. of IEEE

Intl. Conf. on Autonomic Computing (ICAC), 2005.

[75] perf Subsystem, perf: Linux profiling with performance counters, Available at:

http://perf.wiki.kernel.org/.

[76] Cuong Pham, Daniel Chen, Zbigniew Kalbarczyk, and Ravishankar K. Iyer, Cloudval: A

framework for validation of virtualization environment in cloud infrastructure, Proc. of

IEEE/IFIP Intl. Conf. on Dependable Systems and Networks (DSN), 2011.

[77] Guangzhi Qu, S. Hariri, and M. Yousif, A new dependency and correlation analysis for

features, Knowledge and Data Engineering, IEEE Transactions on 17 (2005), no. 9, 1199–

1207.

[78] Rodric Rabbah and Anant Agarwal, Versatility and versabench: A new metric and a bench-

mark suite for flexible architectures, 2004.

[79] Russell D. Reed and Robert J. Marks, Neural smithing: Supervised learning in feedforward

artificial neural networks, MIT Press, 1998.

117

[80] Charles Reiss, John Wilkes, and Joseph L. Hellerstein, Google cluster-usage traces: format

+ schema, Tech. report, Google Inc., November 2011.

[81] M. Rosenblum and T. Garfinkel, Virtual machine monitors: current technology and future

trends, IEEE Computer 38 (2005), no. 5, 39–47.

[82] R. K. Sahoo, A. J. Oliner, I. Rish, M. Gupta, J. E. Moreira, S. Ma, R. Vilalta, and A. Siva-

subramaniam, Critical event prediction for proactive management in large-scale computer

clusters, Proc. of ACM Intl. Conf. on Knowledge Discovery and Data Dining (KDD), 2003.

[83] Ramendra K. Sahoo, Anand Sivasubramaniam, Mark S. Squillante, and Yanyong Zhang,

Failure data analysis of a large-scale heterogeneous server environment, Proc. of IEEE

Conf. on Dependable Systems and Networks (DSN), 2004.

[84] Felix Salfner, Maren Lenk, and Miroslaw Malek, A survey of online failure prediction meth-

ods, ACM Computing Surveys 42 (2010), no. 3, 10:1–10:42.

[85] Felix Salfner and Miroslaw Malek, Using hidden semi-markov models for effective online

failure prediction, Proceedings of the 26th IEEE International Symposium on Reliable Dis-

tributed Systems, SRDS ’07, 2007, pp. 161–174.

[86] S.K. Sastry Hari, S.V. Adve, H. Naeimi, and P. Ramachandran, Relyzer: Application re-

siliency analyzer for transient faults, Micro, IEEE 33 (2013), no. 3, 58–66.

[87] Bianca Schroeder and Garth A. Gibson, A large-scale study of failures in high-performance

computing systems, IEEE Transactions on Dependable and Secure Computing 7 (2010),

337–351.

[88] Kai Shen, Christopher Stewart, Chuanpeng Li, and Xin Li, Reference-driven performance

anomaly identification, SIGMETRICS Perform. Eval. Rev. 37 (2009), no. 1.

[89] Mei-Ling Shyu, Shu-Ching Chen, Kanoksri Sarinnapakorn, and LiWu Chang, Principal

component-based anomaly detection scheme, Foundations and Novel Approaches in Data

Mining (Tsau Young Lin, Setsuo Ohsuga, Churn-Jung Liau, and Xiaohua Hu, eds.), Studies

in Computational Intelligence, vol. 9, Springer Berlin Heidelberg, 2006, pp. 311–329.

[90] James E. Smith and Ravi Nair, The architecture of virtual machines, IEEE Computer 38

(2005), no. 5, 32–38.

118

[91] Marc Snir, Robert W. Wisniewski, Jacob A. Abraham, Sarita V. Adve, Saurabh Bagchi,

Pavan Balaji, Bill Carlson, Andrew A. Chien, Pedro Diniz, Christian Engelmann, Rinku

Gupta, Fred Johnson, Jim Belak, Pradip Bose, Franck Cappello, Paul Coteus, Nathan A. De-

bardeleben, Mattan Erez, Saverio Fazzari, Al Geist, Sriram Krishnamoorthy, Sven Leyffer,

Dean Liberty, Subhasish Mitra, Todd Munson, Rob Schreiber, Jon Stearley, and Eric Van

Hensbergen, Addressing failures in exascale computing, Workshop report, August 4-11,

2013.

[92] Dawn Song, David Brumley, Heng Yin, Juan Caballero, Ivan Jager, Min Gyung Kang,

Zhenkai Liang, James Newsome, Pongsin Poosankam, and Prateek Saxena, BitBlaze: A

new approach to computer security via binary analysis, Proceedings of the 4th International

Conference on Information Systems Security. Keynote invited paper. (Hyderabad, India),

December 2008.

[93] M. Steinder and A.S. Sethi, A survey of fault localization techniques in computer networks,

Science of Computer Programming 53 (2004), no. 2, 165–194.

[94] SYSSTAT Utilities, SYSSTAT: Performance monitoring tools, Available at:

http://sebastien.godard.pagesperso-orange.fr/.

[95] D. Takahashi, An extended split-radix fft algorithm, Signal Processing Letters, IEEE 8

(2001), no. 5, 145–147.

[96] Yongmin Tan, Xiaohui Gu, and Haixun Wang, Adaptive system anomaly prediction for

large-scale hosting infrastructures, Proc. of ACM Symp. on Principles of Distributed Com-

puting (PODC), 2010.

[97] Yongmin Tan, Hiep Nguyen, Zhiming Shen, Xiaohui Gu, C. Venkatramani, and D. Rajan,

Prepare: Predictive performance anomaly prevention for virtualized cloud systems, Dis-

tributed Computing Systems (ICDCS), 2012 IEEE 32nd International Conference on, 2012,

pp. 285–294.

[98] Yongmin Tan, Hiep Nguyen, Zhiming Shen, Xiaohui Gu, Chitra Venkatramani, and Deepak

Rajan, Prepare: Predictive performance anomaly prevention for virtualized cloud systems.,

ICDCS, 2012.

119

[99] Anna Thomas and Karthik Pattabiraman, Error detector placement for soft computation,

2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Net-

works (DSN) 0 (2013), 1–12.

[100] Kalyan Vaidyanathan and Kenny Gross, Mset performance optimization for detection of

software aging, Proc. of IEEE Intl. Symp. on Software Reliability Engineering (ISSRE),

2003.

[101] Hubertus J. J. van Dam, Abhinav Vishnu, and Wibe A. de Jong, A case for soft error detec-

tion and correction in computational chemistry, Journal of Chemical Theory and Computa-

tion 9 (2013), no. 9, 3995–4005.

[102] Brani Vidakovic, Statistical modeling by wavelets, 2010.

[103] R. Vilalta and S. Ma, Predicting rare events in temporal domains, Proc. of IEEE Intl. Conf.

on Data Mining (ICDM), 2002.

[104] Kashi Venkatesh Vishwanath and Nachiappan Nagappan, Characterizing cloud computing

hardware reliability, Proceedings of ACM Symposium on Cloud computing (SOCC), 2010.

[105] Xin Xu and Man-Lap Li, Understanding soft error propagation using efficient vulnerability-

driven fault injection, Dependable Systems and Networks (DSN), 2012 42nd Annual

IEEE/IFIP International Conference on, 2012, pp. 1–12.

[106] Bo Yang, Feng Tan, Yuan-Shun Dai, and Suchang Guo, Performance evaluation of cloud

service considering fault recovery, Proc. of IEEE Intl. Conf. on Cloud Computing (Cloud-

Com), 2009.

[107] Lingyun Yang, Chuang Liu, Jennifer M. Schopf, and Ian Foster, Anomaly detection and

diagnosis in grid environments, Proceedings of the 2007 ACM/IEEE conference on Super-

computing, SC ’07, 2007, pp. 33:1–33:9.

[108] Ying Yang, Chang-Tsun Li, Xingming Sun, and Hengfu Yang, Removable visible image wa-

termarking algorithm in the discrete cosine transform domain, Journal of Electronic Imag-

ing 17 (2008), no. 3, 033008–033008–11.

[109] Xin Yao, Evolving artificial neural networks, 1999.

120

[110] Alice X. Zheng, Jim Lloyd, and Eric Brewer, Failure diagnosis using decision trees, Proc.

of IEEE Conf. on Autonomic Computing (ICAC), 2004.

[111] Qiang Zheng, Guohong Cao, Tom La Porta, and Ananthram Swami, Optimal recovery from

large-scale failures in ip networks, Proc. of IEEE ICDCS, 2012.

[112] Ziming Zheng, Li Yu, Wei Tang, Zhiling Lan, Rinku Gupta, Narayan Desai, Susan Coghlan,

and Daniel Buettner, Co-analysis of ras log and job log on blue gene/p, Proceedings of the

2011 IEEE International Parallel & Distributed Processing Symposium, IPDPS ’11, 2011,

pp. 840–851.

121

Autonomic Failure Identification and Diagnosis for Building …/67531/metadc499993/... · Guan, Qiang. Autonomic Failure Identification and Diagnosis for Building Dependable Cloud

Documents