Top Banner
Sparse Gaussian Markov Random Field Mixtures for Anomaly Detection Tsuyoshi Id´ e IBM Research T. J. Watson Research Center [email protected] Ankush Khandelwal University of Minnesota Department of Computer Science [email protected] Jayant Kalagnanam IBM Research T. J. Watson Research Center [email protected] Abstract—We propose a new approach to anomaly detection from multivariate noisy sensor data. We address two major challenges: To provide variable-wise diagnostic information and to automatically handle multiple operational modes. Our task is a practical extension of traditional outlier detection, which is to compute a single scalar for each sample. To consistently define the variable-wise anomaly score, we leverage a predictive conditional distribution. We then introduce a mixture of Gaussian Markov random field and its Bayesian inference, resulting in a sparse mixture of sparse graphical models. Our anomaly detection method is capable of auto- matically handling multiple operational modes while removing unwanted nuisance variables. We demonstrate the utility of our approach using real equipment data from the oil industry. 1. Introduction Anomaly detection from sensor data is one of the critical applications of data mining. In the standard setting, we are given a data set under a normal operating condition, and we build a statistical model as a compact representation of the normal state. In operation, when a new observation is provided, we evaluate the discrepancy from what is expected by the normal model. This paper focuses on a different anomaly detection scenario where the dataset has multiple normal operating conditions. Moreover, instead of reporting a single anomaly score, the goal is to compute anomaly score for each variable separately. In spite of the long history of research in statistics, as represented by the classical Hotelling’s T 2 theory [1], anomaly detection in modern condition-based monitoring applications is still challenging due to various reasons. Ma- jor requirements suggested can be summarized as follows. First, an anomaly detection algorithm should be capable of handling nuisance variables, which behave like random noise even under normal conditions. We wish to automat- ically down-weight such unimportant variables as a result of model training. Second, it should be capable of handling dynamic state changes over time. The assumption of sin- gle Gaussian distribution in the T 2 theory is sometimes not appropriate. We wish to capture multiple operational modes due to dynamic changes in operational conditions of the system. Third, it should be capable of giving action- able or diagnostic information. For that direction, providing value expected by the neighbors observed value as expected? Figure 1. High-level picture of variable-wise anomaly scoring using Gaus- sian Markov random fields (GMRF). Intuitively, the anomaly score for the i-th variable measures the discrepancy from what is expected by its neighbors ({x l 1 ,x l 2 ,x l 3 } in this case) in the GMRF sense. variable-wise anomaly scores will be a promising approach, instead of giving a single scalar as is the case in most of the traditional outlier detection methods. To overcome the limitations of the traditional approach, much work has been done in the data mining community. Major approaches include subspace-based methods [2], [3], [4], [5] distance-based methods [6], [7], and mixture mod- els [8], [9], [10]. However, the goal of these approaches is basically to provide a single scalar representing the degree of outlierness of a sample, and it is generally not straight- forward to produce variable-wise information. Although the tasks of anomaly analysis [11] and anomaly localization [12] have been proposed recently, they are not readily applicable to our problem of multivariate but variable-wise anomaly scoring. This paper presents a statistical machine learning ap- proach to anomaly detection that can 1) automatically re- move unwanted effects of nuisance variables, 2) handle multiple states of the system, and 3) compute variable- wise anomaly scores. Specifically, we focus on Gaussian Markov Random Fields (GMRF) that provide us a natural way to calculate variable-wise anomaly scores (see Fig. 1). To handle multiple operational modes, we then introduce a mixture of GMRF and propose a novel method to define the conditional distribution from the mixture consistently. Also, to handle nuisance variable, we propose an approach to learning a sparse mixture of the sparse graphical Gaussian model (GGM) [13]. We leverage not only 1 regularization to achieve sparsity in the variable dependency, but also the automated relevance determination (ARD) mechanism [14]
6

Sparse Gaussian Markov Random Field Mixtures for Anomaly ... · Gaussian distributions having these expressions are gen-erally called the Gaussian Markov random field (GMRF). The

Jun 24, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Sparse Gaussian Markov Random Field Mixtures for Anomaly ... · Gaussian distributions having these expressions are gen-erally called the Gaussian Markov random field (GMRF). The

Sparse Gaussian Markov Random Field Mixtures for Anomaly Detection

Tsuyoshi IdeIBM Research

T. J. Watson Research [email protected]

Ankush KhandelwalUniversity of Minnesota

Department of Computer [email protected]

Jayant KalagnanamIBM Research

T. J. Watson Research [email protected]

Abstract—We propose a new approach to anomaly detectionfrom multivariate noisy sensor data. We address two majorchallenges: To provide variable-wise diagnostic informationand to automatically handle multiple operational modes. Ourtask is a practical extension of traditional outlier detection,which is to compute a single scalar for each sample. Toconsistently define the variable-wise anomaly score, we leveragea predictive conditional distribution. We then introduce amixture of Gaussian Markov random field and its Bayesianinference, resulting in a sparse mixture of sparse graphicalmodels. Our anomaly detection method is capable of auto-matically handling multiple operational modes while removingunwanted nuisance variables. We demonstrate the utility of ourapproach using real equipment data from the oil industry.

1. IntroductionAnomaly detection from sensor data is one of the critical

applications of data mining. In the standard setting, we aregiven a data set under a normal operating condition, andwe build a statistical model as a compact representation ofthe normal state. In operation, when a new observation isprovided, we evaluate the discrepancy from what is expectedby the normal model. This paper focuses on a differentanomaly detection scenario where the dataset has multiplenormal operating conditions. Moreover, instead of reportinga single anomaly score, the goal is to compute anomalyscore for each variable separately.

In spite of the long history of research in statistics,as represented by the classical Hotelling’s T 2 theory [1],anomaly detection in modern condition-based monitoringapplications is still challenging due to various reasons. Ma-jor requirements suggested can be summarized as follows.First, an anomaly detection algorithm should be capableof handling nuisance variables, which behave like randomnoise even under normal conditions. We wish to automat-ically down-weight such unimportant variables as a resultof model training. Second, it should be capable of handlingdynamic state changes over time. The assumption of sin-gle Gaussian distribution in the T 2 theory is sometimesnot appropriate. We wish to capture multiple operationalmodes due to dynamic changes in operational conditions ofthe system. Third, it should be capable of giving action-able or diagnostic information. For that direction, providing

value expected by the neighbors

observed value

as expected?

Figure 1. High-level picture of variable-wise anomaly scoring using Gaus-sian Markov random fields (GMRF). Intuitively, the anomaly score forthe i-th variable measures the discrepancy from what is expected by itsneighbors ({xl1 , xl2 , xl3} in this case) in the GMRF sense.

variable-wise anomaly scores will be a promising approach,instead of giving a single scalar as is the case in most ofthe traditional outlier detection methods.

To overcome the limitations of the traditional approach,much work has been done in the data mining community.Major approaches include subspace-based methods [2], [3],[4], [5] distance-based methods [6], [7], and mixture mod-els [8], [9], [10]. However, the goal of these approaches isbasically to provide a single scalar representing the degreeof outlierness of a sample, and it is generally not straight-forward to produce variable-wise information. Although thetasks of anomaly analysis [11] and anomaly localization [12]have been proposed recently, they are not readily applicableto our problem of multivariate but variable-wise anomalyscoring.

This paper presents a statistical machine learning ap-proach to anomaly detection that can 1) automatically re-move unwanted effects of nuisance variables, 2) handlemultiple states of the system, and 3) compute variable-wise anomaly scores. Specifically, we focus on GaussianMarkov Random Fields (GMRF) that provide us a naturalway to calculate variable-wise anomaly scores (see Fig. 1).To handle multiple operational modes, we then introduce amixture of GMRF and propose a novel method to definethe conditional distribution from the mixture consistently.Also, to handle nuisance variable, we propose an approachto learning a sparse mixture of the sparse graphical Gaussianmodel (GGM) [13]. We leverage not only `1 regularizationto achieve sparsity in the variable dependency, but also theautomated relevance determination (ARD) mechanism [14]

Page 2: Sparse Gaussian Markov Random Field Mixtures for Anomaly ... · Gaussian distributions having these expressions are gen-erally called the Gaussian Markov random field (GMRF). The

To appear in the Proceedings of the 2016 IEEE International Conference on Data Mining (ICDM 2016).

to achieve sparsity over mixture components. To the best ofour knowledge, this is the first work for anomaly detectionthat extends GMRFs and sparse GGMs to mixtures. Usingreal sensor data of an oil production compressor, we showthat our model is capable of capturing multiple operationalconditions and significantly reduce false alerts that havebeen thought of as unavoidable.

Regarding related work, in the area of image processing,GMRFs have been extensively studied for the purpose ofdenoising [15], [16]. However, most of them are basedon single component GMRFs, not on mixtures. To thebest of our knowledge, practical procedures to derive theconditional distribution from GMRF mixtures are not knownat least in the context of anomaly detection.

2. Problem setting

We are given a training data set D as

D = {x(t) ∈ RM | t = 1, . . . , N}, (1)

where N is the number of observations and M is thedimensionality of the samples, or the number of sensors.We represent the dimensions by subscripts and the sampleindexes by superscripts, e.g. x(n)i . The training data D isassumed to be collected under normal conditions of thesystem. One of the major assumptions is that the data gen-erating mechanism may include multiple operational modesand would not be captured by a unimodal model.

Our goal is to compute the variable-wise anomaly scorefor a new sample, x. For the i-th variable, it can be generallydefined as

ai(x) = − ln p(xi | x−i,D), (2)

where p(xi|x−i,D) is the conditional predictive distributionfor the i-th variable, given the rest of the variables x−i ≡(x1, . . . , xi−1, xi+1, . . . , xM )>. Intuitively, ai computes thedegree of discrepancy between an observed xi and what isexpected by the rest variables x−i (see Fig. 1).

This definition is a natural extension of Hotelling’s T 2,which computes the outlier score with − lnN (x | µ,Σ) upto unimportant constant terms and a prefactor. Here N (· |µ,Σ) denotes the Gaussian distribution with the mean µ andthe covariance matrix Σ. Notice that aT 2 is just a singlescalar even when x is a multivariate sample. Our task ismore general than traditional outlier detection.

3. Gaussian Markov random field mixtures

This section describes how to derive the conditional pre-dictive distribution p(xi|x−i,D) from a mixture of GaussianMarkov random field, given the generative process of x.

3.1. Gaussian Markov random field

For the conditional predictive distribution, we assumethe following mixture model:

p(xi|x−i,D) =

K∑k=1

gik(x) N(xi | uki , wki

), (3)

where gik(x) is a function called gating function that islearned from the data (see Eq. (21) and its footnote). Eachk specifies a mixture component. Since we are interestedin modeling the conditional distribution, unlike standardmixture models, the mixture weights depend on the in-dex i. We also assume that the data generating processof x is described by a K-component Gaussian mixturep(x|D) =

∑Kk=1 πkN (x|mk, (Ak)−1). The means and the

precision matrices {mk,Ak} as well as the optimal numberof K are also learned from the data (see Sec. 4), but let usassume that they are given for now.

For the mean uki and the variance wki in Eq. (3), we usea particular form of

uki = mki −

1

Aki,i

M∑l 6=i

Aki,l(xl −mkl ), (4)

wki =1

Aki,i. (5)

Gaussian distributions having these expressions are gen-erally called the Gaussian Markov random field (GMRF).The term “Markov” highlights the property that only directneighbors as defined by nonzero entries of Ak can affectthe distribution of xi (see Fig. 1). For the derivation ofthe functional form, see Theorem 2.3 in [17]. Note that theproblem is trivial when K = 1. In this case, the anomalyscore (2) is readily given by

ai(x)K=1 =1

2Ai,i[A(x−m)]i

2 − 1

2lnAi,i2π

, (6)

where we dropped the superscript k and [·]i denotes the i-thentry of a vector inside the square bracket. This paper is allabout how to handle difficulties when K > 1.

3.2. Variational inference for GMRF mixture

Now let us consider how to find the gating function inEq. (3) under the assumption that {(mk,Ak)} are given.With a cluster assignment indicator for the i-th variable, hi,we consider the following model:

p(xi | x−i,hi) =

K∏k=1

N(xi | uki , wki

)hik , (7)

p(hi | θi) =

K∏k=1

(θik)hik , (8)

p(θi | αi) =Γ(αi1) · · ·Γ(αiK)

Γ(αi)

K∏k=1

(θik)αik−1 (9)

2

Page 3: Sparse Gaussian Markov Random Field Mixtures for Anomaly ... · Gaussian distributions having these expressions are gen-erally called the Gaussian Markov random field (GMRF). The

To appear in the Proceedings of the 2016 IEEE International Conference on Data Mining (ICDM 2016).

where∑K

k=1 θik = 1, Γ(·) is the gamma function, and

α ≡∑K

k=1 αik with αik being a hyper parameter treated as

a given constant. Alternatively, p(θi|α) may be denoted byDir(θi|α), the Dirichlet distribution. As usual, hik ∈ {0, 1}and

∑Kk=1 h

ik = 1. Based on this model, the complete log

likelihood is written as

lnP (D,Hi | θi) =

N∑n=1

K∑k=1

hi(n)k ln

{θikN (x

(n)i |u

ki , w

ki )}

− ln Γ(α) +

K∑k=1

{ln Γ(αk) + (αk − 1) ln θik

}, (10)

where hi(n) is the indicator vector for the n-th sample andHi is a collective notations for {hi(n) | n = 1, . . . , N}.

To infer the model, we use the variational Bayes (VB)method [14]. We assume the functional form of the posteriordistributions as

q(Hi) =

N∏n=1

K∏k=1

{gi(n)k

}hi(n)k

, (11)

q(θi) = Dir(θi | ai). (12)

VB and point-estimation equations are given as

ln q(Hi) = c.+⟨lnP (D,Hi | θi)

⟩θi

(13)

ln q(θi) = c.+⟨lnP (D,Hi | θi)

⟩Hi, (14)

where c. symbolically represents a constant. 〈·〉Hi and 〈·〉θirepresent the expectation by q(Hi) and q(θi), respectively.Using the well-known result

⟨ln θik

⟩θi

= ψ(aik) − ψ(ai),

where ψ(·) is the di-gamma function and ai ≡∑K

k=1 aik,

we can easily derive VB iterative equations as

aik ← αk +N ik, (15)

θik ← exp{ψ(aik)− ψ(ai)

}, (16)

gi(n)k ← θikN (x

(n)i |uki , wki )∑K

l=1 θilN (x

(n)i |uli, wli)

for all n, (17)

N ik ←

N∑n=1

gi(n)k , (18)

These substitutions are performed until convergence. Re-peating over i = 1, . . . ,M and k = 1, . . . ,K, we obtain aM ×K matrix Θ = [θik].

3.3. Predictive distribution for GMRF mixture

The predictive distribution in Eq. (2) is formally definedas

p(xi|x−i,D) =

∫dhi q(hi) p(xi | x−i,hi). (19)

To find q(hi), which is the posterior distribution for theindicator variable associated with a new sample x, consider

(removed) Surviving patterns with adjusted parameters

Figure 2. Overview of the sGMRFmix algorithm. Starting from K initialpatterns that may be redundant, a sparse mixture of sparse graphical modelsis learned, from which p(xi|x−i,D) is derived.

an augmented data set D∪x. In this case, the log completelikelihood is given by

lnP (D,x,Hi,hi | θi) = lnP (D,Hi, | θi)+K∑k=1

hik ln{θikN (xi|uki , wki )

}. (20)

Corresponding to this, let the posterior be

q(Hi,hi) = q(Hi)×K∏k=1

(gik)hik ,

from which we get VB iterative equations similar toEqs. (15)-(18). Although the resulting {θik} differs fromthe one obtained using only D, Eq. (18) suggests that thedifference is just on the order of 1/N , which is negligiblewhen N � 1. Therefore, we conclude that the posteriordistribution of a new sample x is given by

gik(x) ≈ θikN (xi|uki , wki )∑Kl=1 θlN (xi|uli, wli)

. (21)

where θik is the solution of Eqs. (17)-(18) 1.Finally, using Eqs. (3) and (21), the variable-wise

anomaly score defined in Eq. (2) is given by

ai(x) = − ln

K∑k=1

gik(x) N(xi | uki , wki

). (22)

The r.h.s. includes the parameters {(mk,Ak)} that representthe generative process of x. Next section discuss how to getthem.

4. Sparse mixture of sparse graphical models

To capture multiple operational modes of the system,we assume a Gaussian mixture model for the generativeprocess of x. To ensure the capability of removing noisynuisance variables, we further request that the model shouldbe sparse. This section explains to learn sparse mixture ofsparse graphical Gaussian models (see Fig. 2).

1. By construction, gik(x) has to be treated as constant when consideringthe normalization condition of Eq. (3).

3

Page 4: Sparse Gaussian Markov Random Field Mixtures for Anomaly ... · Gaussian distributions having these expressions are gen-erally called the Gaussian Markov random field (GMRF). The

To appear in the Proceedings of the 2016 IEEE International Conference on Data Mining (ICDM 2016).

4.1. Observation model and priors

We employ a Bayesian Gaussian mixture model havingK mixture components. First, we define the observationmodel by

p(x | z,µ,Λ) ≡K∏k=1

N (x | µk, (Λk)−1)zk , (23)

where µ and Λ are collective notations representing {µk}and {Λk}, respectively. Also, z is the indicator variable ofcluster assignment. As before, zk ∈ {0, 1} for all k, and∑K

k=1 zk = 1.We place the Gauss-Laplace prior on (µk,Λk) and the

categorical distribution on z:

p(µk,Λk) ∝ e−ρ2 ‖Λ

k‖1N (µk|m0, (λ0Λk)−1), (24)

p(z|π) =

K∏k=1

πzkk s.t.K∑k=1

πk = 1, πk ≥ 0, (25)

where ‖Λ‖1 =∑

i,j |Λi,j |. The parameter π is determinedas a part of the model while ρ, λ0,m0 are given constants.From these equations, we can write down the completelikelihood as

P (D,Z,Λ |µ,π) ≡K∏k=1

p(µk,Λk)

×N∏n=1

p(z(n)|π)p(x(n) | z(n),µ,Λ), (26)

where Z is a collective notation for {z(n)k }.

4.2. Variational Bayes inference

Since the Laplace distribution is not the conjugate priorof Gaussian, exact inference is not possible. We again usethe VB method based on the categorical distribution forthe posterior of Z and the Gauss-delta distribution for theposterior of (µ,Λ):

q(Z) =

N∏n=1

K∏k=1

(r(n)k )z

(n)k , (27)

q(µ,Λ) =

K∏k=1

N (µk|mk, (λkΛk)−1)δ(Λk − Λk), (28)

where δ(·) is Dirac’s delta function. We combine VB anal-ysis for {Z,µ,Λ} with point estimation for the mixtureweight π. As shown in [18], this leads to a sparse solution(i.e. πk = 0 in many k’s) through the ARD mechanism.

By expanding 〈lnP (D,Z,Λ | π,µ)〉Λ,µ, it is straightfor-ward to obtain the VB iterative equation for {r(n)k }:

ln r(n)k ← ln

{πk N (x(n) |mk, (Λk)−1)

}− M

2λk(29)

r(n)k ←

r(n)k∑k

l=1 r(n)l

. (30)

Similarly, for the other variables including point-estimatedπ, we have the VB solution as

Nk ←N∑n=1

r(n)k , πk ←

Nk

N, (31)

xk ← 1

Nk

N∑n=1

r(n)k x(n), (32)

Σk ← 1

Nk

N∑n=1

r(n)k (x(n) − xk)(x(n) − xk)>, (33)

λk ← λ0 +Nk, mk ← 1

λk(λ0m0 +Nkxk), (34)

Qk ← Σk +λ0λk

(xk −m0)(xk −m0)>, (35)

Λk ← arg maxΛk

{ln |Λk| − Tr(ΛkQk)− ρ

Nk‖Λk‖1

}. (36)

These VB equations are computed for k = 1, . . . ,K and re-peated until convergence. Notice that the VB equation for Λk

preserves the original `1-regularized GGM formulation [13].We see that the fewer samples a cluster have, the more the`1 regularization is applied due to the ρ/Nk term.

Finally, the predictive distribution is given by

p(x|D) =

K∑k=1

πk

∫dµk∫

dΛk N (x|µk, (Λk)−1)q(µk,Λk),

=

K∑k=1

πkN (x |mk, (Ak)−1), (37)

where Ak ≡ λk

1+λkΛk. This is the one we assumed in Sec. 3.

5. Algorithm summary

Algorithm 1 gives a high-level summary of sGMRFmix(sparse GMRF mixture) algorithm. The first stage (Sec. 4),sparseGaussMix, starts with a large enough number ofK and identifies major dependency patterns from the data. Inthe context of industrial condition-based monitoring, initial-ization of {mk, Λk} can be naturally done by disjointly par-titioning the data along the time axis as D = D1∪ . . .∪DKand apply e.g. the graphical lasso algorithm [13] on each, asillustrated in Fig. 2. After the initialization, the VB iterationcan start with πk = 1

K and λk = πkN , as well as withλ0 = 1,m0 = 0 if no prior information is available.

The second stage (Sec. 3), GMRFmix, determines thegating function gik(x) for an arbitrary input sample xthrough the resulting θik’s to define the anomaly score inEq. (22). For α, it is reasonable to choose αk = 1 for k’swith πk 6= 0 and zero otherwise. Regarding ρ, an optimalvalue should be determined together with the threshold onthe anomaly score, so the performance of anomaly detectionis maximized. One reasonable performance metric is the F -measure between the accuracies separately computed for thenormal and anomalous samples.

4

Page 5: Sparse Gaussian Markov Random Field Mixtures for Anomaly ... · Gaussian distributions having these expressions are gen-erally called the Gaussian Markov random field (GMRF). The

To appear in the Proceedings of the 2016 IEEE International Conference on Data Mining (ICDM 2016).

Algorithm 1 The sGMRFmix algorithmInput: D, ρ,α.Output: {mk, λk, Λk}, {θik}.{πk,mk,Ak} = sparseGaussMix(D,m0, λ0, ρ).{θik} = GMRFmix({πk,µk,Ak}, α).

6. Experimental results

This section presents experimental results of the pro-posed algorithm. Methods compared are as follows.single [11] is essentially the same as the K = 1 versionof the proposed algorithm with λ0 = 0. The same ρ valueis used as GMRFmix.sPCA [19] computes the i-th anomaly score via

ai(x)sPCA ≡ |xi − e>i UU>x|,

where ei is the i-th basis vector and U ≡ [u1, . . . ,uK′ ]is the matrix of K ′ principal components computed bythe sparse principal component analysis (sPCA) [20]. Thesame values of ρ and K ′ as GMRFmix are used for the `1regularization coefficient and U, respectively.autoencoder trains a sparse autoencoder [21] with onehidden layer based on the normalized input as xi ←xi−mini

maxi−mini, where maxi and mini are the maximum and

minimum values of the i-th variable over the training data,respectively. The anomaly score is simply defined as

ai(x)autoendcoder ≡ |xi − xi|, (38)

where xi is the output of the i-th output neuron. The input,hidden and output layers have the same number of neurons.The value of `1 and `2 regularization parameters (β and λin [21]) are determined by cross-validation on the averagedreconstruction error on the training data.

6.1. Synthetic data: illustration

We synthetically generated a data set by adding t-distributed random noise to Gaussian distributed sampleswhose correlation structures are shown in Fig. 3. By shiftingthe mean, we created a sequence of A-B-A-B for the trainingdata and A-B-Anomaly for the testing data. Both data have1 000 samples, as shown in Fig. 4, where the Anomaly pattenis highlighted with the dashed line. To initialize the model,we used a K = 7 disjoint partitioning (see Sec. 5 for thedetail). Figure 5 shows the learned model. We see that thedistinctive patterns A and B are automatically discoveredwithout specifying the ground truth cluster number, thanksto the ARD mechanism.

With the trained model, we computed the anomaly scoreon the testing data and evaluated the AUC (area underthe curve) based on the classification accuracies separatelycomputed for negative and positive samples. The accuraciesare defined on the negative and positive labels given tothe first and second half of the testing data, respectively.Table 1 clearly shows that the proposed model outperformsthe alternative.

x1

-3 -1 1 2 3 -2 0 2 4

-3-1

13

-3-1

13

x2

x3

-3-1

12

-20

24

x4

-3 -1 1 2 3 4 -3 -1 0 1 2 -6 -2 0 2 4 6

-6-2

246

x5

pattern A

x1

-5 -3 -1 0 -1 0 1 2 3 4

-11

23

45

-5-3

-1

x2

x3

-6-4

-20

2

-10

12

34

x4

-1 0 1 2 3 4 5 -6 -4 -2 0 2 -4 0 2 4 6

-40246

x5

pattern B

x1

-4 -2 0 2 4 -4 -2 0 1 2

-201234

-4-2

02

4

x2

x3

-8-4

02

-4-2

012

x4

-2 0 1 2 3 4 -8 -4 0 2 -6 -2 2 4 6

-6-2

26

x5

(anomaly)

Figure 3. Synthetic Pattern A, Pattern B, and Anomaly.

(a) training (b) testing

x1

x2

x3

x4

x5

Figure 4. Synthetic training and testing data.

6.2. Real application: offshore oil production

We applied the proposed method to the task of earlyanomaly detection of a compressor of offshore oil produc-tion. Figure 6 shows simulated examples out of M = 53sensor signals (acceleration, pressure, flow rate, etc.) overabout one month. Apparently, the system irregularly makestransitions to different trends under heavy spike-like noise.See [22] for more details about challenges in the condition-based monitoring in the oil industry.

To train the model, we used reference data under nor-mal operating conditions over about one year selected bydomain experts. For sGMRFmix, we partitioned the datainto K = 21 disjoint subsets for initialization. We alsotrained the alternative methods. Cross-validated parametersfor the sparse autoencoder are λ = 10−8 and β = 10−8

in the notation of [21]. Figure 7 shows {πk} computedby sparseGaussMix with ρ = 0.1. Thanks to the ARDmechanism, the algorithm automatically discovered 4 majorpatterns (K ′ = 13 in total).

Using the trained model, we computed the variable-wiseanomaly score on testing data including a few real failuresthat were unable to be detected by an existing monitoringsystem. Figure 8 presents the distribution of a14 over thetesting data. This is a flow rate variable and was confirmedto be involved in the physical failure process related to pumpsurge. In Fig. 8 (a), we see that the anomaly score of the pre-failure window is significantly higher than the other periodwhile the separation is not very clear in (b)-(c).

TABLE 1. AUC VALUES FOR THE SYNTHETIC DATA.

sGMRFmix single sPCA autoencoder0.72 0.52 0.63 0.57

5

Page 6: Sparse Gaussian Markov Random Field Mixtures for Anomaly ... · Gaussian distributions having these expressions are gen-erally called the Gaussian Markov random field (GMRF). The

To appear in the Proceedings of the 2016 IEEE International Conference on Data Mining (ICDM 2016).

Index

pro

bab

ility

Figure 5. Converged mixture weights {πk} and the corresponding precisionmatrices for the synthetic data.

Figure 6. Compressor data under normal operating condition.

7. Conclusion

We have proposed a new outlier detection method, thesparse GMRF mixture, that is capable of handling multipleoperational modes in the normal condition and variable-wise anomaly scores. We derived variational Bayes iterativeequations based on the Gauss-delta posterior model.

References

[1] T. W. Anderson, An Introduction to Multivariate Statistical Analysis,3rd ed. Wiley-Interscience, 2003.

[2] T. Ide and H. Kashima, “Eigenspace-based anomaly detection incomputer systems,” in Proc. ACM SIGKDD Intl. Conf. KnowledgeDiscovery and Data Mining, 2004, pp. 440–449.

[3] H. Ringberg, A. Soule, J. Rexford, and C. Diot, “Sensitivity of pcafor traffic anomaly detection,” in ACM SIGMETRICS PerformanceEvaluation Review, vol. 35, no. 1. ACM, 2007, pp. 109–120.

[4] L. Xiong, X. Chen, and J. Schneider, “Direct robust matrix factor-izatoin for anomaly detection,” in Data Mining (ICDM), 2011 IEEE11th International Conference on. IEEE, 2011, pp. 844–853.

[5] D. Blythe, P. von Bunau, F. Meinecke, and K.-R.Muller, “Featureextraction for change-point detection using stationary subspace anal-ysis,” IEEE Transactions on Neural Networks and Learning Systems,vol. 23, no. 4, pp. 631–643, 2012.

[6] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “LOF: iden-tifying density-based local outliers,” ACM SIGMOD Record, vol. 29,no. 2, pp. 93–104, 2000.

[7] G. O. Campos, A. Zimek, J. Sander, R. J. Campello, B. Micenkova,E. Schubert, I. Assent, and M. E. Houle, “On the evaluation ofunsupervised outlier detection: measures, datasets, and an empiricalstudy,” Data Mining and Knowledge Discovery, pp. 1–37, 2016.

[8] K. Yamanishi, J. Takeuchi, G. Williams, and P. Milne, “On-lineunsupervised outlier detection using finite mixtures with discountinglearning algorithms,” in Proc. the Sixth ACM SIGKDD Intl. Conf. onKnowledge Discovery and Data Mining, 2000, pp. 320–324.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

probability

0.00

0.15

Figure 7. Converged mixture weights {πk} for the compressor data.

(a) sGMRFmix, X14 (b) single, X14 (c) sPCA, X14 (d) autoencoder, X14

Figure 8. Density plot of the anomaly scores in arbitrary scales. The dashedlines represent the distribution in the 24-hour pre-failure window.

[9] S. Hirai and K. Yamanishi, “Detecting changes of clustering structuresusing normalized maximum likelihood coding,” in Proceedings of the18th ACM SIGKDD international conference on Knowledge discoveryand data mining, ser. KDD 12, 2012, pp. 343–351.

[10] L. I. Kuncheva, “Change detection in streaming multivariate datausing likelihood detectors,” IEEE Transactions on Knowledge andData Engineering, vol. 25, no. 5, pp. 1175–1180, 2013.

[11] T. Ide, A. C. Lozano, N. Abe, and Y. Liu, “Proximity-based anomalydetection using sparse structure learning,” in Proc. of 2009 SIAMInternational Conference on Data Mining (SDM 09), pp. 97–108.

[12] R. Jiang, H. Fei, and J. Huan, “Anomaly localization for networkdata streams with graph joint sparse PCA,” in Proceedings of the 17thACM SIGKDD international conference on Knowledge discovery anddata mining, 2011, pp. 886–894.

[13] J. Friedman, T. Hastie, and R. Tibshirani, “Sparse inverse covarianceestimation with the graphical lasso,” Biostatistics, vol. 9, no. 3, pp.432–441, 2008.

[14] C. M. Bishop, Pattern Recognition and Machine Learning. Springer-Verlag, 2006.

[15] S. M. Schweizer and J. M. F. Moura, “Hyperspectral imagery: clutteradaptation in anomaly detection,” IEEE Transactions on InformationTheory, vol. 46, no. 5, pp. 1855–1871, Aug 2000.

[16] L. Shadhan and I. Cohen, “Detection of anomalies in texture im-ages using multi-resolution random field models,” Signal Processing,vol. 87, pp. 3045–3062, 2007.

[17] H. Rue and L. Held, Gaussian Markov Random Fields: Theoryand Applications, ser. CRC Monographs on Statistics & AppliedProbability. Chapman & Hall, 2005.

[18] A. Corduneanu and C. M. Bishop, “Variational bayesian model selec-tion for mixture distributions,” in Artificial intelligence and Statistics,vol. 2001. Morgan Kaufmann Waltham, MA, 2001, pp. 27–34.

[19] R. Jiang, H. Fei, and J. Huan, “A family of joint sparse pca algorithmsfor anomaly localization in network data streams,” IEEE Transactionson Knowledge and Data Engineering, vol. 25, no. 11, pp. 2421–2433,2013.

[20] H. Zou, T. Hastie, and R. Tibshirani, “Sparse principal componentanalysis,” Journal of Computational and Graphical Statistics, vol. 15,pp. 265–286, 2006.

[21] A. Ng, “Sparse autoencoder,” CS294A Lecture notes, vol. 72, pp. 1–19, 2011.

[22] S. Natarajan and R. Srinivasan, “Multi-model based process conditionmonitoring of offshore oil and gas production process,” ChemicalEngineering Research and Design, vol. 88, no. 5, pp. 572–591, 2010.

6