Feature Extraction for Outlier Detection in High- Dimensional Spaces Hoang Vu Nguyen Vivekanand Gopalkrishnan.

Feature Extraction for Outlier Detection in High-Dimensional Spaces

Hoang Vu Nguyen

Vivekanand Gopalkrishnan

Motivation

Feature Extraction for Outlier Detection2

• Outlier detection techniques Compute distances between points in full feature space

Curse of dimensionality

Solution: feature extraction

• Feature extraction techniques Do not consider class imbalance

Not suitable for asymmetric classification (and outlier detection!)

Overview


• DROUT Dimensionality Reduction/Feature Extraction for OUTlier Detection

Extract features for the detection process

To be integrated with outlier detectors

Training setDROUT

Features

Testing setDetector Outliers

Background


• Training set:

Normal class ωm: cardinality Nm, mean vector μm, covariance matrix ∑m

Anomaly class ωa: cardinality Nm, mean vector μa, covariance matrix ∑a

Nm >> Na

Total number of points: Nt = Nm + Na

∑w = (Nm/Nt) . ∑m + (Na/Nt) . ∑a

∑b = (Nm/Nt) . (μm – μt) (μm – μt)T + (Na/Nt) . (μa – μt)(μa – μt)T

∑t = ∑w + ∑b

Background (cont.)


• Eigenspace of scatter matrix ∑ : (spanned by eigenvectors) Consists of 3 subspaces: principal, noise, and null space

Solving eigenvalue problem and obtain d eigenvalues v1 ≥ v2 ≥ … ≥ vd

Noise and null subspaces are caused by noise and mainly by the insufficient training data

Existing methods: discard the noise and null subspaces loss of information

Jiang et al. 2008: regularize all 3 subspaces before performing feature extraction

1 m r d

P N Ø

0

Plot of eigenvalues

DROUT Approach


• Weight-adjusted Within-Class Scatter Matrix

∑w = (Nm/Nt) . ∑m + (Na/Nt) . ∑a

Nm >> Na ∑a is far less reliable than ∑m

Weighing ∑m and ∑a according to (Nm/Nt) and (Na/Nt)

when doing feature extraction on ∑w (using PCA etc.), dimensions (eigenvectors) specified mainly by small eigenvalues of ∑m unexpectedly removed

dimensions extracted are not really relevant for the asymmetric classification task

Xudong Jiang: Asymmetric principal component and discriminant analyses for pattern classification. IEEE Trans. Pattern Anal. Mach. Intell., 31(5), 2009

• Solutions

∑w = wm . ∑m + wa . ∑a

wm < wa and wm + wa = 1

more suitable for asymmetric classification

DROUT Approach (cont.)


• Which matrix to regularize first? Goal: extract features that minimize the within-class and maximize the between-class

variances

Within-class variances are estimated from limited training data

small variances estimated tend to be unstable and cause overfitting

proceed with regularizing 3 subspaces of the adjusted within-class scatter matrix



• Subspace decomposition

Solving eigenvalue problem on (weight-adjusted) ∑w and obtain eigenvectors {e1, e2, …, ed} with corresponding eigenvalues v1 ≥ v2 ≥ … ≥ vd

Identify m:

vmed = mediani ≤ r {vi}

vm+1 = maxi ≤ r {vi | vi < 2vmed – vr}

1 m r d

P N Ø

0

Plot of eigenvalues



• Subspace regularization

a = v1 . vm . (m – 1)/(v1 – vm)

b = (mvm – v1)/(v1 – vm)

Regularize:

i ≤ m: xi = vi

m < i ≤ r: xi = a/(i + b)

r < i ≤ d: xi = a/(r + 1 + b)

A = [ei . wi]1 ≤ i ≤ d where wi = 1/sqrt(xi)

1 m r d

P N Ø

0

iw



• Subspace regularization pT = AT . p with p being a data point

Form new (weight-adjusted) total scatter matrix (slide 4) and solve the eigenvalue problem using it

B = matrix of c resulting eigenvectors with largest eigenvalues

feature extraction done only after regularization limit loss of information

Xudong Jiang, Bappaditya Mandal, and Alex ChiChung Kot: Eigenfeature regularization and extraction in face recognition. IEEE Trans. Pattern Anal. Mach. Intell., 30(3):383–394, 2008

Transform matrix: M = A . B



• Summary:

Let ∑w = wm . ∑m + wa . ∑a

Compute A from ∑w

Transform the training set using A

Compute the new total scatter matrix ∑t

Compute B by solving the eigenvalue problem on ∑t

M = A . B

Use M to transform the testing set

Related Work


• APCDAXudong Jiang: Asymmetric principal component and discriminant analyses for pattern classification . IEEE Trans. Pattern Anal. Mach. Intell., 31(5), 2009

Uses weight-adjusted scatter matrices for feature extraction

Discards noise and null subspaces loss of information

• EREXudong Jiang, Bappaditya Mandal, and Alex ChiChung Kot: Eigenfeature regularization and extraction in face recognition. IEEE Trans. Pattern Anal. Mach. Intell., 30(3):383–394, 2008

Performs regularization before feature extraction

Ignores class imbalance not suitable for outlier detection

• ACPDavid Lindgren and Per Spangeus: A novel feature extraction algorithm for asymmetric classification. IEEE Sensors Journal, 4(5):643–650, 2004

Consider neither noise-null subspaces nor class imbalance

Outlier Detection with DROUT


• Detectors: ORCA

Stephen D. Bay and Mark Schwabacher: Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In KDD, pages 29–38, 2003

BSOUT

George Kollios, Dimitrios Gunopulos, Nick Koudas, and Stefan Berchtold: Efficient biased sampling for approximate clustering and outlier detection in large data sets. IEEE Trans. Knowl. Data Eng., 15(5):1170–1187, 2003

Outlier Detection with DROUT (cont.)


• Datasets: KDD Cup 1999

Normal class (60593 records) vs. U2R class (246)

d = 34 (7 categorical attributes are excluded)

Training set: 1000 normal recs. vs. 50 anomalous recs.

Ann-thyroid 1

Class 3 vs. class 1

d = 21


Ann-thyroid 2

Class 3 vs. class 2

d = 21


• Parameter settings:

wm = 0.1 and wa = 0.9

Number of extracted features b ≤ d/2

Results


Results (cont.)


Conclusion


• Summary of contributions Explore the effect of feature extraction on outlier detection

Results on real datasets and two detection methods are promising

A novel framework for ensemble outlier detection. Experiments on real data sets seem to be promising

• Future work More experiments on larger datasets

Examine other possibilities of dimensionality reduction

Last words…

Feature Extraction for Outlier Detection in High- Dimensional Spaces Hoang Vu Nguyen Vivekanand Gopalkrishnan.

Documents