Unsupervised Anomaly Detection for High Dimensional Data · Anomaly detection? I Anomaly is a pattern in the data that does not conform to the expected behavior I Also referred to

Unsupervised Anomaly Detection for HighDimensional Data

Dr. Thayasivam, Umashanger

Department of Mathematics, Rowan University.

July 19th, 2013

International Workshop in Sequential Methodologies(IWSM-2013)

Dr. Thayasivam, Umashanger Unsupervised Anomaly Detection for High Dimensional Data

Outline of Talk

I Motivation : Biometrics

I SVM(Supervised learning) Approach

I Unsupervised L2E Estimation Approach

I Experimental Results

I Concluding Remarks


Introduction

We are drowning in the deluge of data that are being collectedworld-wide, while starving for knowledge at the same time.Anomalous events occur relatively infrequently However, whenthey do occur, their consequences can be quite dramatic and quiteoften in a negative sense

* - J. Naisbitt, Megatrends: Ten New Directions Transforming Our Lives.New York: Warner Books, 1982.


Need for Accurate Speaker Recognition

I Method of recognizing a person based on his voice

I One of the forms of biometric identification

I Need for accurate and scalable speaker recognition -VoIP applications

I Applications in diverse areas- telephone, internetbanking,online trading,forensics

I Corporate and government sectors security enforcement


What is an intrusion detection?

I Intrusions are the activities that violate the security policy ofsystem.

I Intrusion Detection is the process used to identify maliciousbehavior that targets a network and its resources


Intrusion Detection System

I Intrusion Detection Systems(IDSs) plays a key role as defensemechanism against malicious attacks in network security.

I Monitors traffic between users and networks; abnormalactivity.

I Analyzes patterns/signatures based on data packets.


Intrusion Detection Techniques

I misuse intrusion detection-intrusion signatures

I statistical/anomaly intrusion detection


Misuse intrusion detection

I Catch the intrusions in terms of the characteristics of knownattacks or system vulnerabilities

I Built with knowledge of bad behaviors

I Collection of signatures-Signature Analysis

I Examine event stream for signature match-Pattern Matching

I Cannot detect novel or unknown attacks


Anomaly detection?

I Anomaly is a pattern in the data that does not conformto the expected behavior

I Also referred to as outliers, exceptions, peculiarities,surprise, etc.

I Detect any action that significantly deviates from thenormal behavior

I Built with knowledge of normal behaviors

I Examine event stream for deviations from normal


Applications of Anomaly detection

I Network intrusion detection

I Insurance / Credit card fraud detection

I Healthcare Informatics / Medical diagnostics

I Industrial Damage Detection

I Image Processing / Video surveillance

I Novel Topic Detection in Text Mining


Real world Analomies

Figure : Real world AnalomiesDr. Thayasivam, Umashanger Unsupervised Anomaly Detection for High Dimensional Data

Key Challenges

I Defining a representative normal region is challenging

I The boundary between normal and outlying behavior isoften not precise

I The exact notion of an outlier is different for differentapplication domains

I Availability of labeled data for training/validation

I Data is extremely huge, noisy, can be complex

I Normal behavior keeps evolving

I Fast and accurate real-time detection


Novelty detection

I Identification of new or unknown data or signal that amachine learning system is not aware of during training.

I Fundamental requirements of good classification oridentification system

I Abnormalities are very rare or there may be no datadescribes the faulty conditions


Techniques/approaches to detect anomalies

I Supervised - The data (observations, measurements, etc.)are labeled with pre-defined classes.

I Unsupervised - Class labels of the data are unknown

I Given a set of data, the task is to establish the existenceof classes or clusters in the data


Support Vector Machine (SVM)

I A popular supervised anomaly detection technique

I SVMs are linear classifiers that find a hyperplane toseparate two class of data, positive and negative

I The common features in normal and adversary groupsneed to be learned and need to be differentiate

I Discovering the key characteristics of network trafficpatterns, a decision making boundary is superimposed inthe space of feature representations.


SVM for Network Traffic Classification

I Effectively understand the patterns of network trafficand detect measurements deemed untrustworthy frommalicious targets

I Eliminates the need for arbitrary assumptions about theunderlying network topology and parameters orthresholds in favor of direct training data.

I Discover key characteristics of network traffic patternsby superimposing a boundary in the space ofmeasurements.


SVM Framework

I Cast the problem of detecting malicious nodes in a SVMclassification framework

I Labeled Training Examples: (~xi , yi ), where ~xi is therepresentation of the i th example in the feature spaceand yi ∈ {1,−1} is the corresponding label

I Decision Boundary Function: y(→x ) =

→w .→x .+ w0 where

→w is the weight vector and w0 is the bias.


SVM Framework

I Network Traffic Features:→x

I Optimization Function:→w and w0

I Prediction of Training Set Label:

→w .→x +w0


SVM Optimization Problem

I

min1

2||→W ||2 + γ

N∑i=1

εi

subject to yi (→W .Φ(

→x ) + W0) > 1− εi , ∀i

I where N : number of training examples.

I εi : collection of non-negative slack variables that account forpossible misclassification’s.

I γ : trade off factor between the slack variables and the

regularization on the norm of the weight vector→W .

I The constraint in this minimization implies that we want our

predictions,→W .Φ (~x) .+ W to be similar to labels.


Solution to the SVM Optimization Problem

I Solve optimization by quadratic programming in dual

I Parameter estimation by cross validation of training set

I Given a ~W ∗ and ~W ∗0 , predict whether a node is adversary

or not by looking at the sign of ~W ∗.Φ (~x) + ~W ∗0 .

I LibSVM package to implement the SVM model basedanomaly detection


Key Challenges in Supervised learning

I Defining a representative normal region is challenging

I The boundary between normal and outlying behavior isoften not precise

I The exact notion of an outlier is different for differentapplication domains

I Availability of labeled data for training/validation

I Data is extremely huge, noisy, can be complex

I Normal behavior keeps evolving

I Fast and accurate real-time detection


What is Mixture Model

I Let fθm(~x) denote the general mixture probability density

function with m components.

fθm(~x) =

m∑i=1

πi f (~x |~φi ).

I πi ≥ 0,m∑i=1

πi = 1 for i = 1, . . . ,m;

θm = (π1, . . . , πm−1, πm, ~φT

1 , . . . ,~φT

m)T .

I In theory, the f (~x |~φi )’s could be any parametric density,although in practice they are often from the same parametricfamily (usually Gaussian)


Estimation Approach with Built-in Robustness using L2E

I When m is known, we want to find fθm(~x) is close to g(~x) in

L2 distance.

I That is,

L2(fθm, g(~x)) =

∫ ∞−∞

[fθm(~x)− g(~x)]2d~x .

I The aim is to derive an estimate of θm that minimizes the L2distance


Estimation Approach with Built-in Robustness

L2(fθm(~x), g(~x)) =

∫ ∞−∞

f 2θm(~x)d~x

− 2

∫ ∞−∞

fθm(~x)g(~x)d~x

+

∫ ∞−∞

g(~x)2d~x


Estimation Approach with Built-in Robustness

I The last integral is constant with with respect to θm

I The first integral is often available as a closed formexpression

I The second integral is simply the average height of thedensity estimate, which may be estimated as−2n−1

∑ni=1 fθm

(~Xi ) where ~Xi is a sample observation.


Computational Algorithm

I The L2E estimator of θm is given by

θ̂L2Em = arg min

θm

[∫ ∞−∞

f 2θm(~x)d~x − 2n−1

n∑i=1

fθm(~Xi )

],



I Normal Identity∫ ∞−∞

φ(x | µ1, σ12)φ(x | µ2, σ2

2)dx = φ(µ1 − µ2| 0, σ12 + σ2

2),

I where φ(x | µ, σ2) is the normal density function with mean µ andvariance σ2.

I For multivariate Gaussian mixtures-GMM, f (~x |~φi ) = φ(~x | ~µi ,Σi ),the use of the above identity reduces the key integral to



∫ ∞−∞

f 2θm(~x)d~x =

m∑k=1

m∑l=1

πkπl φ(~µk − ~µl | 0,Σk + Σl)

I Making the integral tractable and thereby significantlyreducing the computations involved in minimizing L2E .

I Thus, the estimation of L2E may be performed by anystandard optimization algorithm.


Data Analysis

I The effective detection and identification of anomalies intraffic requires the ability to separate them from normalnetwork traffic.

I Network traffic data set from University of New Mexico.

I Trace files contained 13831 sample observations withprocess IDs and their respective system calls.

I We apply our L2E(unsupervised) and compare theperformance with SVM(supervised)


Results: Accuracy with increasing dimensions-(70%-30%train-test partition of the data)

Dimensions L2E FalseDetec-tionRate

L2E TrueDetec-tionRate

SVMTrue De-tectionRate

2 0.774 1.000 0.99263 0.663 1.000 0.99264 0.561 1.000 0.99245 0.390 0.989 0.99246 0.322 1.000 0.99247 0.189 1.000 0.99248 0.000 0.980 0.9924


Results: Accuracy with varied testing training data-(using8 dimensions of the data)

Train-Testing

L2E FalseDetec-tionRate



50-50 0.0003 0.9884 0.991460-40 0.0001 0.9786 0.991970-30 0.0002 0.9836 0.992080-20 0.0002 0.9781 0.989890-10 0.0001 0.9814 0.9884


Results: Accuracy with increasing testing samplesize-(using 70%-30% train-test partition using 8dimensions of the data)

TestingSampleSize

L2E FalseDetec-tion Rate



500 0.0000 0.9792 0.99601000 0.0000 0.9744 0.99601500 0.0000 0.9686 0.99202000 0.0000 0.9814 0.99352500 0.0004 0.9844 0.99123000 0.0000 0.9876 0.99073500 0.0003 0.9840 0.99254000 0.0003 0.9836 0.9927


Observations

I The false detection rate for SVM for all scenarios forthis data set is zero.

I Despite the lack of labeled training data , the truedetection rate of the L2E algorithm is comparable to theSVM for all scenarios.


Analysis for Simulated data

I Case: 5 dimension, n=10000 and we use 80/20 randomsplit.

I Dataset : mu1 = (2, 2, 2, 2), mu2 = (2.5, 2.5, 2.5, 2.5),σ1 =diag(.1),σ2 = diag(.4),pi1 = 0.8,pi2 = 0.2

I We apply our L2E(unsupervised) and compare theperformance with SVM(supervised) and some othermachine learning algorithms.

I Classification accuracy for L2E is better than thealternatives.


Results: Comparing Machine Learning Algorithm for thesimulate data:testing sample size-

Classifier Time False -ve False+ve

L2E 2.1 0.0345 0.0055EM 16 0.0315 0.006Trees 0.31 0.186 0.011SVM 1.95 0.167 0.007NN 5.2 0.214 0.01


Conclusion-Significance of our L2E

I Does not require the labeled training data orspecial configuration

I Ease of use

I Efficiency in achieving accuracy with outcomputational overhead

I Results are Comparable to SVM and other machinelearning algorithms


Current and Future work

I Evaluating the performance using multiple networktraffic data sets for speaker recognition.

I Applying real data sets with higher dimensions andlarge number of components.

I Estimating the number of components.

I Data Mining-Random Forest/Boosting


Some Reference Article

I L2E Estimation of Mixture complexity for Count Data-CSDA(Oct,2009)

I Simultaneous Robust Estimation in Finite Mixture: TheContinuous Case- JISA(Special-Golden Jubilee-2012)

I Detection of Anomalies in Network traffic using L2E forAccurate speaker recognition, IEEE Midwest, August,2012.

I Elements of Statistical Learning- Book-http://www-stat.stanford.edu/ tibs/ElemStatLearn/



Unsupervised Anomaly Detection for High Dimensional Data · Anomaly detection? I Anomaly is a pattern in the data that does not conform to the expected behavior I Also referred to

Documents