Autonomic Feature Selection for Application Classificationumpeysak/Classes/CS576/papers/Zhang… · STATISTICAL TOOLS USED FOR FEATURE SELECTION A. Feature Selection Feature selection

59

Autonomic Feature Selection for ApplicationClassification

Jian Zhang and Renato J. FigueiredoAdvanced Computing and Information Systems (ACIS) Laboratory

Department of Electrical and Computer EngineeringUniversity of Florida, Gainesville, FL 32611, USA

{jianzh, renato}@acis.ufl.edu

Abstract- Application classification techniques based on mon-itoring and learning of resource usage (e.g. CPU, memory, diskand network) have been proposed to aid in resource schedulingdecisions. An important problem that arises in applicationclassifiers is how to decide which subset of numerous performancemetrics collected from monitoring tools should be used for theclassification. This paper presents an approach based on a prob-abilistic model (Bayesian Network) to systematically select therepresentative performance features, which can provide optimalclassification accuracy and adapt to changing workloads. Virtualmachines (VMs) are used to host the application execution andsystem-level performance metrics for a VM summarize the appli-cation and its host's resource usage. This approach requires noapplication source code modification nor execution intervention.Results from experiments show that the proposed scheme caneffectively select a performance metric subset providing above90% classification accuracy for a set of benchmark applications.

I. INTRODUCTIONAwareness of application resource consumption patterns

(such as CPU-intensive, I/O and paging-intensive and network-intensive) can facilitate the mapping of workloads to appropri-ate resources. Techniques of application classification based onmonitoring and learning of resource usage can be used to gainapplication awareness [1]. Well-known monitoring tools suchas the open source packages Ganglia [2] and dproc [3], andcommercial products such as HP's Open View [4] provide thecapability of monitoring a rich set of system level performancemetrics. An important problem that arises is how to decidewhich subset of numerous performance metrics collected frommonitoring tools should be used for the classification in adynamic environment. In this work we address this problem.Our approach is based on autonomic feature selection andcan help to improve the system's self-manageability [5] byreducing the reliance on expert knowledge and increasing thesystem's adaptability.The need for autonomic feature selection and application

classification is motivated by systems such as VMPlant [6],which provides automated provisioning of flexible and iso-lated Virtual Machine (VM) based execution environments forapplications of various types. In the context of VMPlant, theapplication can be scheduled to run on a dedicated virtualmachine, whose system level performance metrics reflect theapplication's resource usage. An application classifier cate-gorizes the application into different classes such as CPU-

intensive, disk I/O-intensive, and network-intensive based onthe selected VM performance metrics.

To build an autonomic classification system with self-configurability, it is critical to devise a systematic featureselection scheme that can automatically choose the most rep-resentative features for application classification and adapt tochanging workloads. This paper presents an approach of usinga probabilistic model, the Bayesian Network, to automaticallyselect the performance metrics that correlate with applicationclasses and optimize the classification accuracy. The approachalso uses the Mahalanobis distance to support online selectionof training data, which enables the feature selection to adaptto dynamic workloads. In the rest of the paper, we will usethe terms "metrics" and "features" interchangeably.

In previous work on application classification [1], a subsetof performance metrics were manually selected based onexpert knowledge to correlate to the resource consumptionbehavior of the application class. However, expert knowledgeis not always available. In case of highly dynamic workloador mass volume of performance data, the approach of manualconfiguration by human expert is also not feasible. Thesepresent a need for a systematic way to select the representativemetrics in the absence of sufficient expert knowledge. On theother hand, the use of the Bayesian Network leaves the optionopen to integrate expert knowledge with the automatic featureselection to improve the classification accuracy and efficiency.

Feature selection based on static selected application per-formance data, which are used as the training set, may notalways provide the optimal classification results in dynamicenvironments. To enable the feature selection to adapt tothe changing workload, it requires the system to be able todynamically update the training set with data from recentworkload. A question that arises is how to decide whichdata should be selected as training data. In this paper, aMahalanobis distance based algorithm is used to identify thetraining data which can represent the resource consumptionpattern of corresponding application class.

Our experimental results show the following. First, weobserve correlations between pairs of selected performancemetrics, which justifies the use of Mahalanobis distance asa means of taking the correlation into account in the trainingdata selection process. Second, there is a diminishing returnof classification utility function (i.e. the ratio of classificationaccuracy over the number of selected metrics) as more features

1-4244-0175-5/06/$20.00 02006 IEEE. 43

Authorized licensed use limited to: Drexel University. Downloaded on October 10, 2008 at 16:17 from IEEE Xplore. Restrictions apply.

are selected. The experiments showed that above 90% applica-tion classification accuracy can be achieved with a small subsetof performance metrics which are highly correlated with theapplication class. Third, the application classification basedon the selected features for a set of benchmark programs andscientific applications matched our empirical experience withthese applications.The rest of the paper is organized as follows: Section II

briefly introduces VMPlant and application classification inthe context of VMPlant. The statistical techniques used are

described in Section III. Section IV presents the feature selec-tion model. Section V presents and discusses the experimentalresults. Section VI discusses related work. Conclusions andfuture work are discussed in Section VII.

II. VMPLANT AND APPLICATION CLASSIFICATIONA "classic" virtual machine enables multiple independent,

isolated operating systems to run on one physical machine,efficiently multiplexing system resources of the host machine[7]. It provides a secure and isolated environment for appli-cation execution [8]. The VMPlant Grid service [6] buildsupon VM technologies and provides support for automatedVM creation and provisioning. In the context of VMPlant, theapplication can be scheduled to run on a dedicated virtualmachine. Therefore, the VM system performance metricscollected during the application execution summarize the ap-plication and its entire hosting environment's resource usage.A subset of the performance metrics collected by the

monitoring daemon is selected based on expert knowledge.The application classifier is trained by the data of the selectedfeatures and is able to apply statistical methods to categorizethe application into one of the following four classes: CPU-intensive, 10 and paging-intensive, network-intensive, or idle.The application behavior knowledge gained from the learningprocess over its historical runs can be used to assist schedulingor resource reservation on virtual machine host servers. Forexample, it can improve the system throughput by allocatingapplications of different classes to run on the same machine.The application classification framework can work for systemsthat can provide isolated environments such as VMPlant tocollect performance data. It requires a dedicated machine,which can be either physical or virtual, and the knowledgeof starting and ending time of the application execution.

In this paper, an automatic feature selection proposed inthis framework is designed to automate the feature selectionprocess with or without the presence of expert knowledge andmake the selector adapt to the changing workload. It enableson-line training of the application classifier by providing an

automatic method to select training data and a systematic wayto identify the combination of performance metrics which are

highly correlated with the application class. It is a step furthertowards self-manageability capacity in building an autonomicclassification system.

III. STATISTICAL TOOLS USED FOR FEATURE SELECTIONA. Feature Selection

Feature selection is a process that selects an optimal subsetof original features based on an evaluating criterion. The

Application Class

out

load

Fig. 1. Sample Bayesian Network generated by feature selector

evaluation criterion in this paper is the classification accuracy.A typical feature selection process consists of four steps:subset generation, subset evaluation, stopping criterion, andresult validation [9]. Subset generation is a process of heuristicsearch of candidate subsets. Each subset is evaluated based on

the evaluation criterion. Then the evaluation result is comparedwith the previously computed best result. If it is better, itwill replace the best result and the process continues untilthe stop criterion is reached. The selection result is validatedby different tests or prior knowledge.

There are two major types of feature selection algorithms:the filter model and the wrapper model. The filter model relieson general characteristics of data to evaluate and select featuresubsets without involving any mining algorithm. However, thewrapper model requires one predetermined mining algorithmand uses its performance as the evaluation criterion. In thiswork, a wrapper model is used to search for features bettersuited to the classification algorithm (Bayesian Network) aim-ing to improve the classification accuracy. Our model employsa forward Bayesian Network based wrapper algorithm, whichis introduced in detail in Section IV-B.

B. Bayesian NetworkA Bayesian Network (BN) is a directed acyclic graph

(DAG) with a conditional probability distribution for eachnode. Each node represents a domain variable, and each arc

between nodes represents a probabilistic dependency [10].It can be used to compute the conditional probability of a

node, given the values of its predecessors; hence, a BN can

be used as a classifier that gives the posterior probabilitydistribution of the class decision node given the values of othernodes. Bayesian networks have been applied to many areas

successfully, including map learning [11], medical diagnosis[12][13], and speech and vision processing [14][15]. Com-pared with other predictive models, such as decision trees andneural networks, and standard Principal Component Analysis(PCA) based feature selection model, Bayesian networks havethe advantage of interpretability. Human experts can easilyunderstand the network structure and modify it to obtain betterpredictive models. By adding decision nodes and utility nodes,BN models can also be extended to decision networks fordecision analysis [16].

60


In this work, the Bayesian Network has a tree structure asshown in Figure 1. The root is the application class decisionnode, which is used to decide an application class given thevalue of the leaf nodes. The root node is the parent of all othernodes. The leaf nodes represent selected performance metrics,such as network packets sent and bytes written to disk. Theyare connected one to another in a series.

C. Mahalanobis DistanceThe Mahalanobis distance is a measure of distance between

two points in the multidimensional space defined by multidi-mensional correlated variables [17] [18]. For example, if xl andx2 are two points from the distribution which is characterizedby covariance matrix Z 1, then the quantity

((X1 -X2) -(X X2))2 (1)

is called the Mahalanobis distance from xi to x2, where Tdenotes the transpose of a matrix.

In the cases where there are correlations between variables,simple Euclidean distance is not an appropriate measure,whereas the Mahalanobis distance can adequately account forthe correlations and is scale-invariant. Statistical analysis ofthe performance data in Section V-C shows that there arecorrelations between the application performance metrics withvarious degrees. Therefore, Mahalanobis distance between theunlabeled performance sample and the class centroid, whichrepresents the average of all existing training data of the class,is used in the training data qualification process in Section IV-A.

D. Confusion MatrixConfusion matrix [19] is commonly used to evaluate the

performance of classification systems. It shows the predictedand actual classification done by the system. The matrix sizeis LxL, where L is the number of different classes. In our casewhere there are five target application classes, the L is equalto 5.The classification accuracy is measured as the proportion of

the total number of predictions that are correct. A predictionis considered correct if the data is classified to the same classas its actual class. Table I shows a sample confusion matrixwith L=2. There are only two possible classes in this example:Positive and negative. Therefore, its classification accuracy canbe calculated as (a+d)/(a+b+c+d).

IV. FEATURE SELECTION MODEL FOR APPLICATIONCLASSIFICATION

Before continuing to introduce the feature selection modelfor application classification, we will briefly introduce how

Actual PredictedClass Negative Positive

Negative a bPositive c d

TABLE ISAMPLE CONFUSION MATRIX WITH TWO CLASSES (L=2)

NA[:[:P~Aso1k'

POff4mnance

Appinit Aping | | iction

DICla ssifia Ce er

Applicati

l I nlw at Res~~~~~ce

App3 ^ App-|l

UCTC-Classification Training Center DCataQA - DCaLta Qualit:y Assulror

Fig. 2. Application classification model

The Performance profiler collects performance metrics of the target applica-tion node. The Application classifier classifies the application using extractedkey components and performs statistic analysis of the classification results.The DataQA selects the training data for the classification. The Featureselector selects performance metrics which can provide optimal classificationaccuracy. The Trainer trains the classifier using the selected metrics of trainingdata. The Application DB stores the application class information. (to tl: arethe beginning / ending times of the application execution, VMIP is the IPaddress of the application's host machine).

the application classification model works in the contextof VMPlant. The application classification model has threemajor components: the performance profiler, the applicationclassifier, and the classification training center (CTC) as shownin Figure 2.

The performance profiler takes snapshots of the application-VM's performance metrics collected by the monitoring dae-mon upon the request of the resource manager. In our im-plementation, Ganglia [2] is used as the monitoring tool andthe snapshots are taken once every five seconds during theapplication execution. These performance snapshots form mul-tivariate time series. However, the use of Bayesian Networkdoes not take into account the actual time the performancedata correspond to, and the data is treated in the same way as

static data.The classification training center is responsible for selecting

the classification training data, deriving the representative per-formance metrics based on the training data, and conductingthe classification training for the application classifier.The application classifier classifies the application perfor-

mance snapshots into classes such as CPU-intensive, I/Oand paging-intensive, and network-intensive. The applicationclassifier is based on k-Nearest Neighbor (k-NN) algorithm,which is described in detail in [1]. The classification resultsare stored in the application database (DB) to assist futureresource scheduling.

In this section, we are going to focus on introducingthe classification training center, which enables the self-configurability for online application classification. The train-ing center has two major functions: quality assurance oftraining data, which enables the classifier to adapt to changingworkloads, and systematic feature selection, which supportsautomatic feature selection. The training center consists ofthree components: the data quality assuror, the feature selector,and the trainer.

61


62

A. Data Quality AssurorThe data quality assuror (DataQA) is responsible for select-

ing the training data for application classification. The inputsof the DataQA are the performance snapshots taken during theapplication execution. The outputs are the qualified trainingdata with its class, such as CPU-intensive.The training data pool consists of representative data of

five application classes including CPU-intensive, I/O-intensive,memory-intensive, network-intensive, and idle. Training dataof each class c is a set of KC m-dimensional points, whererm is the number of application-specific performance metricsreported by the monitoring tools. To select the training datafrom the application snapshots, only n out of m metrics areextracted based on previous feature selection result to form aset of KC n-dimensional training points

Performance DescriptionMetricscpuwsystem / user Percent CPU-system / user/ idle / idlecpu-nice Percent CPU nicebytes-in / out Number of bytes per second

into / out of the networkio-bi / bo Blocks sent to / received from

a block device (blocks/s)swap-in / out Amount of memory swapped

in / out from / to disk (kB/s)pkts-in / out Packets in / out per secondproc-run Total number of running

processesload-one / five One / Five / Fifteen minutesI fifteen load average

TABLE IISAMPLE PERFORMANCE METRICS IN THE ORIGINAL FEATURE SET

(2)that comprise a cluster C,. From [20], it follows that the n-tuple

(Xi= Xl,2c, Xznj (3)where

Kc

Xi,(t) 4K Zk,ii= 1,2, ,n (4)k1=l

is called the centroid of the cluster Cc.The training data selection is a three-step process: First

the DataQA extracts the n out of m metrics of the inputperformance snapshot to form a training data candidate. Thuseach candidate is represented by an n-dimensional pointX = (Xi, X2,J...* Xn). Second, it evaluates whether the inputcandidate is qualified to be training data representing oneof the application class. At last, the qualified training datacandidate is associated with a scalar value Class, which definesthe application class.The first step is straight-forward. In the second and third

steps, the Mahalanobis distance between the training datacandidate x and the centroid 1uc of cluster Cc is calculatedas follows:

dc (x) = ( (x-,)pjEC1 (x - pj) 2 (5)

where c = 1, 2, ,5 represents the application class andE-1 denotes the inverse covariance matrix of the clusterCc. The distance from the training data candidate x to theboundary between two class clusters, for example C1 and C2,is ddi (x) d2(x) If ddi (x) d2(x) = 0, it means that thecandidate is exactly at the boundary between class 1 and 2.The further away the candidate is from the class boundaries,the better it can represent a class. In other words, there is lessprobability for it to be misclassified. Therefore, the DataQAcalculates the distance from the candidate to boundaries ofall possible pairs of the classes. If the minimal distance toclass boundaries, min( d -d2 l, d -d3 , ., d4 -d5 ),is bigger than a predefined threshold -y, the correspondingm-dimensional snapshot of the candidate is determined tobe qualified training data of the class, whose centroid has

snapshot. Automated and adaptive threshold setting is dis-cussed in detail in [21].

In our implementation, Ganglia is used as the monitoringtool and twenty (m = 20) performance metrics, which arerelated to resource usage, are included in the training data.These performance metrics include 16 out of 33 defaultmetrics monitored by Ganglia and the 4 metrics that we addedbased on the need of classification. The four metrics includethe number of I/O blocks read from/written to disk, and thenumber of memory pages swapped in/out. A program wasdeveloped to collect these four metrics (using vmstat) andadded them to the metric list of Ganglia's monitoring daemongmond. Table II shows some sample performance metrics ofthe training candidate.The first time quality assurance was performed by human

expert at the initialization. The subsequent assurance can beconducted automatically by following the above steps to selectrepresentative training data for each class.

B. Feature SelectorFeature selector is responsible for selecting the features

which are correlated with the application's resource consump-tion pattern from the numerous performance metrics collectedfrom monitoring tools. By filtering out metrics which con-tribute less to the classification, it can help to not only reducethe computational complexity of subsequent classifications,but also improve classification accuracy.

In our previous work [1], representative features wereselected manually based on expert knowledge. For example,performance metrics of cpu-system and cpu user are correlatedto the behavior of CPU-intensive applications; bytes-in andbytes-out are correlated to network-intensive applications;iobi and io-bo are correlated to the I/O-intensive applications;swap-in and swap-out are correlated to memory-intensiveapplications. However, to support on-line classification, it isnecessary for feature selection to have the ability to adapt tochanging workloads. Therefore, the static selection conductedby human expert may not be sufficient in a highly dynamicenvironment. A feature selection scheme, which can automat-ically select the representative features for application classifi-

the smallest Mahanalobis distance min(dl, d2,... , d5) to the46cation in a systematic way, can help to solve the problem. This

fXk, 1 . Xk,,2 . ... . Xk,n I . kc I . 2 .... . Kc


Input: C( bo, Fl, ,FN-1)Input: ClassOutput: Sbestbegin

initialize Sbest {}

initialize Amax 0;

D = discretize( C );

repeatinitialize Anode = 0;

initialize Fnode = 0;

foreach F C ({Fo F1, FN-1} Sbest) doAccuracy = eval(D, Class, Sbest U {F});

e

if Accuracy > Anode thenAnode = Accuracy;Fnode F;

// training data set with N featuresclass of training data (teacher for learning)

// selected feature subset

// maximum accuracy// convert continuous to discrete features

// max accuracy for each node// selected feature for each node

Dvaluate Bayesian network with extra feature F// store the current feature

endendif Anode > Amax then

Sbest Sbest U {Fnode};Amax Anode;Anode Anode + 1;

enduntil (Anode < Amax)

end

Fig. 3. Bayesian Network based feature selection algorithm for application classification

automated feature selection enables the application classifier toself-configure its input feature subset to adapt to the changingworkload.A Bayesian network based wrapper algorithm is employed

by the feature selector to conduct the feature selection. Asintroduced in Section III-A, although this feature selectionscheme reduces the reliance on human experts' knowledge,the Bayesian network's interpretability leaves the options opento integrate the expert knowledge into the selection scheme tobuild a better classification model.

Figure 3 shows the feature selection algorithm. It startswith an empty feature subset Sbest = {}. To search for thebest feature F, it uses the temporary feature set {Sbest U F}to perform Bayesian Network classification for the discretetraining data D. The classification accuracy is calculated bycomparing the classification result and the true answer ofthe Class information contained in the training data. Af-ter the evaluation of accuracy using all remaining features({F, F2, ... FN- 1}- Sbest), the best accuracy is stored toAnode. If Anode is better than the previous best accuracyAmax achieved, the corresponding feature node is added tothe feature subset to form the new subset. This process isrepeated until the classification accuracy cannot be improvedany more by adding any of the remaining features into thesubset.

C. Trainer

The classification trainer manages the training of the appli-

data pool. Every time the DataQA qualifies new training data,it replaces the oldest data in the training data pool with thenew data. When the percentage of new training data in thepool reaches a predefined threshold (for example, 80%), thetrainer sends a request to the feature selector to start the featureselection process to generate the updated feature subset. Afterreceiving the updated feature subset, it calls the classifierto perform classification of the data in the updated trainingdata pool using the old and new feature subsets respectively.Then it compares the classification accuracy of the two. If theaccuracy achieved by the new feature subset is higher thanthe one achieved by the previous subset, the selected featureis updated. Otherwise, it remains the same.

V. EMPIRICAL EVALUATION

We have implemented a prototype for the feature selectorbased on Matlab. This section shows the experimental resultsof feature selection for data collected during the execution ofa set of applications representative of each class (CPU, I/O,memory and network intensive) and the classification accuracyachieved. In addition, statistical analysis of the performancemetrics was conducted to justify the reason of using Maha-lanobis distance in the training data quality assurance process.

In the experiments, all the applications were executed in a

VMware GSX 2.5 virtual machine with 256MB memory. Thevirtual machine was hosted on an Intel(R) Xeon(TM) dual-CPU 1.80GHz machine with 512KB cache and 1GB RAM.The CTC and application classifier were running on an Intel(R)

cation classifier. It monitors the updating status of the training47Pentium(R) III 750MHz machine with 256MB RAM.

63


64

A. Feature Selection and Classification Accuracy

1.4

1.2

0.8

0.6

0.4

0.2

0

-0.2_0 20 40 60

cpu svstem (%)

Fig. 4. Five-class test data distribution with first two s

0.95

0.9

0.85

0.8

0.75

0.70 2 4 6

Number of Selected Features

o_(a)Two sets of experiments were conducted offline to evaluate

Idle our feature selection algorithm. In both experiments, theCPU training data, described by 20 performance metrics, consists ofIO performance snapshots of applications belonging to differentNetwork classes. In the experiments, tenfold cross validation wereMemory performed. The training data was randomly divided into twoDo 0 parts: A combination of 50% of the data from each class was

used to train the feature selector (training set) to derive thefeature subset, and the other 50% was used as test set tovalidate the features selected by calculating its classification

CD o accuracy.The first experiment was designed to show the relationship

80 100 between classification accuracy and the number of featuresselected. The second experiment was designed to show that

,elected features prior-class information can be used to achieve higher classifi-cation accuracy with smaller number of features.

8

Fig. 5. Average classification accuracy of 10 sets of test data versus numberof features selected in the first experiment

£00 10

26 K> Memory0

24 0

~0 z0 022 - 00 0

0 020 O O

K>0\K Ae 018 IkO~

1R-~~1200 1400 1600

bytes inIho1 000 1800 2000

Fig. 6. Two-class test data distribution with the first two selected features

In the first experiment, the training data consist of per-formance snapshots of five classes of applications, includingCPU-intensive, I/O-intensive, memory-intensive, and network-intensive applications, and the snapshots collected from anidle application-VM, which has only "background noise" fromsystem activity (i.e. without any application execution duringthe monitoring period). The feature selector's task is to selectthose metrics which can be used to classify the test set intofive classes with optimal accuracy.

In all the ten iterations of cross validation, two performancemetrics (cpu-system and load-fifteen) were always selectedas the best two features. Figure 4 shows a sample test datadistribution with these two features. If we project the data tox-axis or y-axis, we can see that it is more difficult to differ-entiate the data from each class by using either cpu-systemor load-fifteen alone than using both metrics. For example,the cpu-system value ranges of network-intensive applicationand I/O-intensive application largely overlap. This makes ithard to classify these two applications with only cpu-systemmetric. Compared with the one-metric classification, it is mucheasier to decide which class the test data belong to by usinginformation of both metrics. In other words, the combinationof multiple features is more descriptive than a single feature.The classification accuracy versus the number of features

selected for the above learned Bayesian network is plotted inFigure 5. It shows that with a small number of features (3 to4), it can achieve above 90% classification accuracy for this5-class classification.

In the second experiment, the training data consist of perfor-mance snapshots of two classes of applications, I/O-intensiveand memory-intensive. Figure 6 shows its test data distributionwith the first two selected features, bytes-in and pktsJin.A comparison of Figure 4 and Figure 6 shows that withreduced number of application classes, higher classificationaccuracy can be achieved with less number of features. Forexample, in this experiment, if we know that the applicationbelongs to either I/O-intensive or memory-intensive class,with two selected features, 96% classification accuracy canbe achieved versus 87% accuracy in the 5-class case. This

48 shows the potential of using pair-wise classification to improve

a)a)

-o0

0n(D (ROW 0 C

x

~~ 3* O~~~-* 0

CWD~E0 0 0

W o asooo

0

0

U,U,

C-)

c

C')

0CZ


Actual || Classified asClass Idle CPU 10 Net MemIdle 11 4938 0 62 0 0CPU 231 4746 23 0 0O1 20 86 2888 6 0

Net 0 12 8 4980 0Mem 0 0 0 0 5000

(a) Automatic

Actual Classified asClass Idle CPU 10 Net MemIdle 4962 0 38 0 0CPU 4 4882 10 0 10410 20 10 2797 0 173Net 0 0 24 4970 6Mem 3 0 36 0 4961

(b) Expert

cxC)0i-E00

a-.0-

The bold numbers along the diagonal are the number ofcorrectly classified data.

TABLE IIICONFUSION MATRIX OF CLASSIFICATION RESULTS WITH

EXPERT-SELECTED AND AUTOMATICALLY-SELECTED FEATURE SETS

the classification accuracy for multi-class cases. Using pair-wise approach for multi-class classification is a topic of futureresearch.

B. Classification ValidationThis set of experiments targets to validate the feature

selection experiment results with the Principal ComponentAnalysis (PCA) and k-Nearest Neighbor (k-NN) based appli-cation classification framework described in [1].

First, the training data distributions based on principalcomponents, which are derived from automatically selectedfeatures in Section V-A and manually selected features inprevious work [1], are shown in Figure 7. Distances betweeneach pair of class centroids in Figure 7 are calculated andploted in Figure 8. It shows that the distances between 9 outof 10 pairs of cluster centroids are bigger in the automaticselection case than the expert's manual selection case. Thismeans that competitively distinct class clusters can be formedwith the 2 principal components which were derived fromthe automatically selected features compared with the expertselected features.

Second, the PCA and k-NN based classifications wereconducted with both the expert selected 8 features in previouswork [1] and the automatically selected features in Section V-A. Table III shows the confusion matrices of the classificationresults. If data are classified to the same classes as theiractual classes, the classifications are considered as correct. Theclassification accuracy is the proportion of the total numberof classifications that were correct. The confusion matricesshows that a classification accuracy of 98.05% can be achievedwith automatically selected feature set, which is similar to the98.14% accuracy achieved with expert selected feature set.Thus the Bayesian Network based automatic feature selectioncan reduce the reliance on expert knowledge while offeringcompetitive classification accuracy compared to manual selec-tion by human expert.

cxC)0a-E00

a-F.0-

6

4

2

0

-4 -2 0 2 4 EPrincipal Component 1(a) Automatic

8x Idle0 10

6 * CPUr NET

4 O MEM4 0

2

-2 1 #-4 -2 0 2

Principal Component 1(b) Expert

| Automatic|Expert

In1lllHHI

1234567~~~l_

1 2 3 4 5 6PrCluster Pair

4 6

Fig. 7. Training data clustering diagram derived from expert-selected andautomatically selected feature sets

6

5

4

a)O

(I)

2F

U8 9 10

1:idle-cpu 2:idle-I/O 3:idle-net 4:idle-mem 5:cpu-I/O6:cpu-net 7:cpu-mem 8:1/0-net 9:1/0-mem 10:net-mem

Fig. 8. Comparison of distances between cluster centers derived from expert-selected and automatically selected feature sets

49

65

x Idle0 10* CPU

|wi NET|> MEM

0

* D

x,II

_.I,I.I_,,1 ..L-JEJEL JML -im-M

6

1

r%


0 102 * CPU

1.5D

1 0

0.5 -i

O.sj

-0.5

-1 X-1 0 1 2 3 4

Principal Component 1(a) SPECseis96

2.50 10

2- * CPU

1.5 -

0.5

0-

-0.5-

-1-1 0 1 2 3 4

Principal Component 1(b) PostMark

2.50 10

2 -E ±CPU0 D~~~~~~NET

1.5 -0 EW0 0

0.5

0 *

-0.5

-1 - 0 1 2 3-1 0 1 2 3 4

Principal Component 1(c) PostMarkLNFS

Principal components 1 and 2 are the principal componentmetrics extracted by PCA.

Fig. 9. Classification results of benchmark programs

Metric T 1 - 2 3 4 5 6 l1 1.00 -0.21 -0.34 0.74 0.20 -0.022 -0.21 - 1.00 -0.16 -0.02 -0.17 -0.06 l3 -0.34 -0.16 1.00 -0.60 0.20 -0.054 0.74 -0.02 -0.60 1.00 -0.19 0.045 0.20 -0.17 0.20 -0.19 1.00 0.126 -0.02 -0.06 -0.05 0.04 0.12 1.00

(a) Correlation matrix of SPECseis96 performance data

Metric 1 2 3 4 5 61 1.00 -0.24 0.22 0.34 -0.08 -0.132 -0.24 1.00 -0.22 0.18 0.04 -0.023 0.22 -0.22 1.00 0.33 0.30 0.184 0.34 0.18 0.33 1.00 0.42 0.475 -0.08 0.04 0.30 0.42 1.00 0.206 -0.13 -0.02 0.18 0.47 0.20 1.00

(b) Correlation matrix of PostMark performance data

Metric T 1 2 _3 T 4_ T 5 6 l1 1.00 0.29 0.31 0.48 0.27 0.302 0.29 1.00 0.49 0.39 0.75 0.953 0.31 0.49 1.00 0.50 0.59 0.524 0.48 0.39 0.50 1.00 0.42 0.395 0.28 0.75 0.59 0.42 1.00 0.756 0.30 0.95 0.52 0.39 0.75 1.00

(c) Correlation matrix of NetPIPE performance data

1-load-five 2-pkts-in 3-cpuwsystem4-load-fifteen 5-pkts-out 6-bytes-outCorrelations those are larger than 0.5 are highlightedwith bold characters

TABLE IVPERFORMANCE METRIC CORRELATION MATRIXES OF TEST APPLICATIONS

In addition, a set of 8 features selected in the 5-class featureselection experiment in Section V-A was used to configure theapplication classifier and the same training data used in thefeature selection experiment were used to train the applicationclassifier. Then the trained classifier conducted classificationfor a set of three benchmark programs: SPECseis96 [22],PostMark and PostMarkjNFS [23]. SPECseis96 is a scientificapplication which is computing-intensive but also exercisesdisk I/O in the initial and end phases of its execution. PostMarkoriginally is a disk I/O benchmark program. In PostMarkjNFS,a network file system (NFS) mounted directory was used tostore the files which were read/written by the benchmark.Therefore, PostMark NFS performs substantial network-I/Orather than disk I/O. The classification results are shownin Figure 9. The results show that 86% of the SPECseis96test data were classified as cpu-intensive, 95% of the Post-Mark data were classified as I/O-intensive, and 61% of thePostMarkjNFS data were classified as network-intensive. Theresults matched with our empirical experience with theseprograms and are close to the results of expert selectedfeature based classificiation, which shows 85% cpu-intensivefor SPECseis96, 97% I/O-intensive for PostMark, and 62%network-intensive for PostMarkjNFS.

C. Training Data Quality Assurance

This set of experiments shows the need of using Maha-lanobis distance in the training data's quality assurance testing

50process.

66

cJ

C

CME0C-)

a

_L

CM

C\C-a)

o

E0C-)

C:

C\C-a)

E0C-)

C:n~


The data quality assuror classifies each unlabeled test databy identifying its nearest neighbor among all class centroids.Its performance thus depends crucially on the distance metricused to identify the nearest class centroid. In fact, a numberof researchers have demonstrated that nearest neighbor clas-sification can be greatly improved by learning an appropriatedistance metric from labeled examples [18].

Table IV shows the correlation coefficients of each pairof the first six performance metrics collected during theapplication execution, including load-five, pktsJin, cpu-system,load-fifteen, pkts-out, and bytes-out. Three applications are

used in these experiments including: SPECseis96 [22], Post-Mark [23] and NetPIPE [24].The experiments show that there are correlations between

pairs of performance metrics in various degrees. For example,NetPIPE's bytes-out metric are highly correlated with itspkts-in, pkts out, and cpu-system metrics. In the cases wherethere are correlations between metrics, distance metric whichcan take the correlation into account when determining itsdistance from the class centroid, should be used. Therefore,Mahalanobis distance is used in the training data selectionprocess.

VI. RELATED WORK

Feature selection [25] [26] and classification techniques havebeen applied to many areas successfully, such as intrusiondetection [27][28][29], text categorization [30], speech andimage processing [14][15], and medical diagnosis [12][13].The following works applied these techniques to analyze

system performance. However they differ from each other fromthe following aspects: goals of feature selection, the featuresunder study, and implementation complexity.

Nickolayev et al. used statistical clustering techniques toidentify the representative processors for parallel applicationperformance tuning [20]. Only event tracing of the selectedprocessors are needed to capture the interaction betweenapplication and system components. This helps to reducethe large event data volume, which can perturb the systemperformance. Their approach does not require modification ofapplication source code.Ahn et al. applied various statistical techniques to extract

the important performance counter metrics for applicationperformance analysis [31]. Their prototype can support parallelapplication's performance analysis by collecting and aggregat-ing local data. It requires annotation of application source codeas well as appropriate operating system and library support tocollect hardware counter based process information.Cohen et al. [32] studied correlation between component

performance metrics and SLO violation in Internet server

platform. There are some similarities between their work andours in terms of level of performance metrics under study andtype of classifier used. However, our study differs from theirsin the following ways. First, our study focuses on applica-tion classification (CPU-intensive, I/O and paging intensive,and network-intensive) for resource scheduling. Their studyfocused on performance anomaly detection (SLO violationand compliance). Second, our prototype targets to support51

online classification. It addressed the training data qualificationproblem to adapt the feature selection to changing workload.However, online training data selection problems were not thefocus of [32]. Third, in our prototype, virtual machines were

used to host application executions and summarize applica-tion's resource usage. The prototype supports a wide range ofapplications, such as scientific programs and business onlinetransaction system. [32] studied web application in three-tierclient/server systems.

In addition to [32], Aguilera et al. [33] and Magpie [34]also studied performance analysis of distributed systems.However, they considered message-level traces of systemactivities instead of system level performance metrics. Bothof them treated components of distributed systems as black-box. Therefore, their approaches do not require applicationand middleware modifications.

VII. SUMMARYThe autonomic feature selection prototype presented in this

paper shows how to apply statistical analysis techniques tosupport online application classification. We envision thatthis classification approach can be used to provide first-orderanalysis of the dominant resource consumption patterns of an

application. This paper shows that autonomic feature selectionenables classification without requiring expert knowledge inthe selection of relevant low-level performance metrics. Withknowledge of an application's class, we envision that furtherprediction techniques can be applied by selecting models thatare tailored to a particular resource characteristic, e.g. byusing existing execution time prediction frameworks for CPU-intensive applications and non-linear models that account forthe relationship between memory size and execution timefor memory-intensive applications. This is the topic of futureresearch.

ACKNOWLEDGEMENTThis work was supported in part by a grant from Intel

Corporation ISTG R&D Council and the National ScienceFoundation under grants: EIA-0224442, EEC-0228390, ACI-0219925 and NSF Middleware Initiative (NMI) grants ANI-0301108 and SCI-0438246. Any opinions, findings and con-

clusions or recommendations expressed in this material are

those of the authors and do not necessarily reflect the viewsof Intel and NSF.

REFERENCES[1] J. Zhang and R. Figueiredo, "Application classification through monitor-

ing and learning of resource consumption patterns," in Proceedings ofthe10th IEEE International Paralle & Distributed Processing Symposium,Rhodes Island, Greece, Apr. 25-29, 2006, (to appear).

[2] M. Massie, B. Chun, and D. Culler, The Ganglia Distributed MonitoringSystem: Design, Implementation, and Experience. Reading, MA:Addison-Wesley, 2003.

[3] S. Agarwala, C. Poellabauer, J. Kong, K. Schwan, and M. Wolf,"Resource-aware stream management with the customizable dproc dis-tributed monitoring mechanisms," in Proceedings. 12th IEEE Interna-tional Symposium on High Performance Distributed Computing, June22-24, 2003, pp. 250-259.

[4] Http://www.managementsoftware.hp.comr [Online]. Available:http://www.managementsoftware.hp.com

67


[5] J. Kephart and D. Chess, "The vision of autonomic computing," Com-puter, vol. 36, no. 1, pp. 41-50, 2003.

[6] I. Krsul, A. Ganguly, J. Zhang, J. Fortes, and R. Figueiredo, "Vmplants:Providing and managing virtual machine execution environments forgrid computing," in Proceedings of the 2004 ACM/IEEE conference on

Supercomputing, Washington, DC, Nov. 6-12, 2004, pp. 7-7.[7] R. P. Goldberg, "Survey of virtual machine research," IEEE Computer

Magazine, vol. 7, no. 6, pp. 34-45, June 1974.[8] R. Figueiredo, P. Dinda, and J. Fortes, "A case for grid computing on

virtual machines," in Proceedings. 23rd International Conference onDistributed Computing Systems, May 19-22, 2003, pp. 550-559.

[9] H. Liu and L. Yu, "Toward integrating feature selection algorithms forclassification and clustering," IEEE Trans. Knowl. Data Eng., vol. 17,no. 4, pp. 491-502, Apr. 2005.

[10] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks ofPlausible Inference. San Francisco, CA: Morgan Kaufmann Publishers,1988.

[11] T. Dean, K. Basye, R. Chekaluk, S. Hyun, M. Lejter, and M. Randazza,"Coping with uncertainty in a control system for navigation and ex-ploration." in Proceedings of the 8th National Conference on ArtificialIntelligence, Boston, MA, July 29-Aug. 3, 1990, pp. 1010-1015.

[12] D. Heckerman, "Probabilistic similarity networks," Depts. of ComputerScience and Medicine, Stanford University, Tech. Rep., 1990.

[13] D. J. Spiegelhalter, R. C. Franklin, and K. Bull, "Assessment criticismand improvement of imprecise subjective probabilities for a medicalexpert system," in Proceedings of the Fifth Workshop on Uncertainty inArtificial Intelligence, 1989, pp. 335-342.

[14] E. Charniak and D. McDermott, Introduction to Artificial Intelligence.Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc.,1985.

[15] T. S. Levitt, J. Mullin, and T. 0. Binford, "Model-based influencediagrams for machine vision," in Proceedings of the Fifth Workshopon Uncertainty in Artificial Intelligence, 1989, pp. 233-244.

[16] R. E. Neapolitan, Probabilistic reasoning in expert systems: theory andalgorithms. New York, NY, USA: John Wiley & Sons, Inc., 1990.

[17] R. Duda, P. Hart, and D. Stork, Pattern Classification. New York, NY:Wiley-Interscience, 2001, 2nd edition.

[18] K. Weinberger, J. Blitzer, and L. Saul, "Distance metric learning for largemargin nearest neighbor classification," in Proceeding of the 19th annualConference on Neural Information Processing Systems, Vancouver, CA,Dec. 2005.

[19] R. Kohavi and F. Provost, "Glossary of terms," Machine Learning,vol. 30, pp. 271-274, 1998.

[20] 0. Y. Nickolayev, P. C. Roth, and D. A. Reed, "Real-time statisticalclustering for event trace reduction," The International Journal of

Supercomputer Applications and High Performance Computing, vol. 11,no. 2, pp. 144-159, Summer 1997.

[21] B. Ziebart, D. Roth, R. Campbell, and A. Dey, "Automated andadaptive threshold setting: Enabling technology for autonomy and self-management," in Proceedings. 2nd International Conference on Auto-nomic Computing, June 13-16, 2005, pp. 204-215.

[22] R. Eigenmann and S. Hassanzadeh, "Benchmarking with real industrialapplications: the spec high-performance group," IEEE ComputationalScience and Engineering, vol. 3, no. 1, pp. 18-23, 1996.

[23] Http://www.netapp.com/tech-library/3022.html. [Online]. Available:http:llwww.netapp.com/tech-library/3022.html

[24] Q. Snell, A. Mikler, and J. Gustafson, "Netpipe: A network protocolindependent performace evaluator," June 1996. [Online]. Available:citeseer.ist.psu.edu/snell96netpipe.html

[25] I. Guyon and A. Elisseeff, "An introduction to variable and featureselection," J. Mach. Learn. Res., vol. 3, pp. 1157-1182, Mar. 2003.

[26] P. Mitra, C. Murthy, and S. Pal, "Unsupervised feature selection usingfeature similarity," IEEE Trans. Pat. Anal. Mach. Intel., vol. 24, no. 3,pp. 301-312, Mar. 2002.

[27] W. Lee, S. J. Stolfo, and K. W. Mok, "Adaptive intrusion detection: Adata mining approach," Artificial Intelligence Review, vol. 14, no. 6, pp.533-567, 2000.

[28] Y Liao and V. R. Vemuri, "Using text categorization techniques for in-trusion detection," in 11th USENIX Security Symposium, San Francisco,CA, Aug. 5-9, 2002, pp. 51-59.

[29] M. Almgren and E. Jonsson, "Using active learning in intrusion de-tection," in Proceedings. 17th IEEE Computer Security FoundationsWorkshop, June 28-30, 2004, pp. 88-98.

[30] G. Forman, "An extensive empirical study of feature selection metricsfor text classification," J. Mach. Learn. Res., vol. 3, pp. 1289-1305,2003.

[31] D. H. Ahn and J. S. Vetter, "Scalable analysis techniques for micropro-cessor performance counter metrics," in Proceedings. SuperComputing,Baltimore, MD, Nov. 16-22, 2002, pp. 1-16.

[32] I. Cohen, J. S. Chase, M. Goldszmidt, T. Kelly, and J. Symons,"Correlating instrumentation data to system states: A building blockfor automated diagnosis and control." in 6th USENIX Symposium on

Operating Systems Design and Implementation, 2004, pp. 231-244.[33] M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthi-

tacharoen, "Performance debugging for distributed systems of blackboxes," in Proceedings of the nineteenth ACM symposium on Operatingsystems principles, Bolton Landing, NY, Oct. 19-22, 2003, pp. 74-89.

[34] R. Isaacs and P. Barham, "Performance analysis in loosely-coupleddistributed systems," in 7th CaberNet Radicals Workshop, Bertinoro,Italy, Oct. 2002.

52

68


Autonomic Feature Selection for Application Classificationumpeysak/Classes/CS576/papers/Zhang… · STATISTICAL TOOLS USED FOR FEATURE SELECTION A. Feature Selection Feature selection

Documents