Top Banner
Finding the Oddballs Adam Ashenfelter ([email protected])
56

Adam Ashenfelter - Finding the Oddballs

Apr 16, 2017

Download

Technology

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Adam Ashenfelter - Finding the Oddballs

Finding the Oddballs

Adam Ashenfelter ([email protected])

Page 2: Adam Ashenfelter - Finding the Oddballs

Why Anomaly Detection?Data Cleaning

- Alerting when data may be corrupt

Science / Analysis

- Faster identification of interesting measurements

Security / fraud detection

- Target unusual and/or suspicious activity

Detect Concept Drift for Predictive Models

- Generate alerts when retraining may be necessary

Page 3: Adam Ashenfelter - Finding the Oddballs

DARPA ADAMS ProjectDesktop activity collected from ~5000 employees of a corporation using Raytheon-Oakley’s Sureview

Insider threat scenarios overlaid on selected employees:

● Anomalous Encryption● Layoff Logic Bomb● Insider Startup● Circumventing SureView

Page 4: Adam Ashenfelter - Finding the Oddballs

Anomaly Detection TechniquesMany universities competed to find the simulated attacks in monthly contests.

Tom Dietterich’s research group (Oregon State University) considered a multitude of detectors:

● Local Outlier Factor (lof)● One-Class SVM (ocsvm)● Support Vector Data Description (svdd)● Isolation Forest (if)● Ensemble of Gaussian Mixture Models (egmms)

Page 5: Adam Ashenfelter - Finding the Oddballs

Lack of Public Anomaly Detection DatasetsMost anomaly detection datasets are proprietary and/or classified.

No equivalent of the UCI supervised learning dataset collection.

How can we rigorously compare all these detection techniques in a repeatable manner?

Let’s transform supervised learning datasets!

A. Emmott, et al. Systematic Construction of Anomaly Detection Benchmarks. ODD Workshop, 2013

Page 6: Adam Ashenfelter - Finding the Oddballs

Support Vector Machine TechniquesOne-Class SVM

Shifts data from the origin with and searches for a hyper-plane which separates the majority of the data from the origin.

B. Scholkopf, et al. Estimating the support of a high-dimensional distribution, 1999

Available in libsvm, R, and SciKit Learn.

Support Vector Data Description

Finds the smallest hyper-sphere (in kernel space) that separates most of the data

Tax and Duin. Support vector data description. Machine Learning, 2004

Available in libsvmtools.

Page 7: Adam Ashenfelter - Finding the Oddballs

Support Vector Machine Techniques

Page 8: Adam Ashenfelter - Finding the Oddballs

Support Vector Machine Techniques

Page 9: Adam Ashenfelter - Finding the Oddballs

Support Vector Machine Techniques

Page 10: Adam Ashenfelter - Finding the Oddballs

Support Vector Machine Techniques

Page 11: Adam Ashenfelter - Finding the Oddballs

Support Vector Machine Techniques

Page 12: Adam Ashenfelter - Finding the Oddballs

Support Vector Machine Techniques

Page 13: Adam Ashenfelter - Finding the Oddballs

Compute the average distance from the each point to its closest neighbors as an estimate of the inverse density.

For each point, calculate:

M. Breunig, et al. LOF: identifying density-based local outliers. ACM SIGMOD Record, 2000

Available in R.

Local Outlier Factor

average density of neighbor points

current point’s density

Page 14: Adam Ashenfelter - Finding the Oddballs

Compute the average distance from the each point to its closest neighbors as an estimate of the inverse density.

For each point, calculate:

M. Breunig, et al. LOF: identifying density-based local outliers. ACM SIGMOD Record, 2000

Available in R.

Local Outlier Factor

average density of neighbor points

current point’s density

Page 15: Adam Ashenfelter - Finding the Oddballs

Compute the average distance from the each point to its closest neighbors as an estimate of the inverse density.

For each point, calculate:

M. Breunig, et al. LOF: identifying density-based local outliers. ACM SIGMOD Record, 2000

Available in R.

Local Outlier Factor

average density of neighbor points

current point’s density

Page 16: Adam Ashenfelter - Finding the Oddballs

Compute the average distance from the each point to its closest neighbors as an estimate of the inverse density.

For each point, calculate:

Downside - picking ‘k’, the number of neighbors.

M. Breunig, et al. LOF: identifying density-based local outliers. ACM SIGMOD Record, 2000

Available in R.

Local Outlier Factor

average density of neighbor points

current point’s density

Page 17: Adam Ashenfelter - Finding the Oddballs

A Gaussian mixture model can capture density, but picking ‘k’ is tricky.

GMM with K = 3

A. Emmott, et al. Systematic Construction of Anomaly Detection Benchmarks. ODD Workshop, 2013

Ensemble of Gaussian Mixture Models

Page 18: Adam Ashenfelter - Finding the Oddballs

Ensemble of Gaussian Mixture ModelsLet’s build an ensemble!

Pick a variety of of ‘k’, and fit many GMMs.

Keep the best, and average their scores.

Page 19: Adam Ashenfelter - Finding the Oddballs

Isolation ForestTo grow an isolation tree:

● Take a small random sample of the data (256 points).● Choose a random field and split anywhere between the minimum and

maximum values.● Repeat random splits until each instance is isolated from the rest.

F. Liu, et al. Isolation Forest. ICDM, 2008

Available in R, coming to Scikit Learn (0.18-dev).

Extended Isolation Forests available with BigML.

Page 20: Adam Ashenfelter - Finding the Oddballs
Page 21: Adam Ashenfelter - Finding the Oddballs
Page 22: Adam Ashenfelter - Finding the Oddballs
Page 23: Adam Ashenfelter - Finding the Oddballs
Page 24: Adam Ashenfelter - Finding the Oddballs
Page 25: Adam Ashenfelter - Finding the Oddballs
Page 26: Adam Ashenfelter - Finding the Oddballs

Isolation Forest - Scoring

Age<= 84

Salary> $50K

Salary<= $50K

Scoring instance:{“Salary”: 20000, “Age”: 50}

Root

Age> 84

Page 27: Adam Ashenfelter - Finding the Oddballs

Isolation Forest - Scoring

Age<= 84

Salary> $50K

Salary<= $50K

Scoring instance:{“Salary”: 20000, “Age”: 50}

Tree output: 2

Root

Age> 84

(2)

Page 28: Adam Ashenfelter - Finding the Oddballs

Isolation Forest - ScoringTo score an instance, find the depth of the terminal node. The smaller the depth the more anomalous.

Each tree is a weak anomaly detector, but an ensemble becomes powerful.

The average ensemble depth is normalized by the expected depth for a balanced binary tree, to give the score a 0 - 1 range.

Page 29: Adam Ashenfelter - Finding the Oddballs

Lightweight Online Detector of Anomalies (LODA)Invented by Tomás Pevný, Czech Technical University

An ensemble method using sparse random projections and 1-dimensional histograms

Pevný, T. Loda: Lightweight on-line detector of anomalies. Machine Learning, 2016

Page 30: Adam Ashenfelter - Finding the Oddballs

Lightweight Online Detector of Anomalies (LODA)Invented by Tomás Pevný, Czech Technical University

An ensemble method using sparse random projections and 1-dimensional histograms

Pevný, T. Loda: Lightweight on-line detector of anomalies. Machine Learning, 2016

Page 31: Adam Ashenfelter - Finding the Oddballs

Lightweight Online Detector of Anomalies (LODA)Another ensemble method - weak individually but powerful in numbers.

Can operate as an on-line anomaly detector when paired with streaming histograms (see https://github.com/bigmlcom/histogram).

When operating in batch mode, LODA is often competitive with Isolation Forests.

Scales nicely to wide datasets.

Blazingly fast to train and score!

Page 32: Adam Ashenfelter - Finding the Oddballs

Benchmark SuiteBase assumption is that anomalies are generated from different processes than regular points.

Use supervised datasets to emulate this. Starting from 19 “mother” UCI datasets, construct many benchmark datasets by varying the following:

● Relative frequency● Clusteredness● Point difficulty

~4.4K benchmark datasets generated from the original 19 UCI datasets.

Page 33: Adam Ashenfelter - Finding the Oddballs

Anomalies from a multi-class dataset Train a random decision forest on the dataset.

Use the misclassifications to build a graph with edges as the mistakes.

9

8

7

6

5

4 3

2 1

0

Page 34: Adam Ashenfelter - Finding the Oddballs

Anomalies from a multi-class dataset Find the minimum spanning tree, selecting edges with the most mistakes.

9

8

7

6

5

4 3

2 1

0

Page 35: Adam Ashenfelter - Finding the Oddballs

Anomalies from a multi-class dataset Two-color the graph to pick the ‘normal’ classes and ‘anomalous’ classes.

This encourages diversity in the anomalous points.

9

8

7

6

5

4 3

2 1

0

Page 36: Adam Ashenfelter - Finding the Oddballs

Frequency - High

Page 37: Adam Ashenfelter - Finding the Oddballs

Frequency - Medium

Page 38: Adam Ashenfelter - Finding the Oddballs

Frequency - Low

Page 39: Adam Ashenfelter - Finding the Oddballs

Difficulty - Low

Page 40: Adam Ashenfelter - Finding the Oddballs

Difficulty - Medium

Page 41: Adam Ashenfelter - Finding the Oddballs

Difficulty - Hard

Difficulty measured with probabilities from a logistic regression classifier.

Page 42: Adam Ashenfelter - Finding the Oddballs

Clustered Anomalies (low variance)

Page 43: Adam Ashenfelter - Finding the Oddballs

Non-Clustered Anomalies (equivalent variance)

Page 44: Adam Ashenfelter - Finding the Oddballs

Scattered Anomalies (high variance)

Page 45: Adam Ashenfelter - Finding the Oddballs

Benchmark ResultsTrends over the results from all ~4.4K benchmarks

Page 46: Adam Ashenfelter - Finding the Oddballs

Benchmark Results

Page 47: Adam Ashenfelter - Finding the Oddballs

Isolation ForestsThe Good

● Very easy to construct (no distance function needed)● Gracefully handles missing values at scoring time● Can provide anomaly explanations ● Parameter free*

The Bad

● *Not really parameter free● Wide datasets will need more trees (how many?)● Needs extensions to handle discrete values and categorical data

Page 48: Adam Ashenfelter - Finding the Oddballs

Isolation ForestsThe Good

● Very easy to construct (no distance function needed)● Gracefully handles missing values at scoring time● Can provide anomaly explanations ● Parameter free*

The Bad

● *Not really parameter free● Wide datasets will need more trees (how many?)● Needs extensions to handle discrete values and categorical data

Page 49: Adam Ashenfelter - Finding the Oddballs

Isolation ForestsThe Good

● Very easy to construct (no distance function needed)● Gracefully handles missing values at scoring time● Can provide anomaly explanations ● Parameter free*

The Bad

● *Not really parameter free● Wide datasets will need more trees (how many?)● Needs extensions to handle discrete values and categorical data

Page 50: Adam Ashenfelter - Finding the Oddballs

Isolation Forest - Missings

Age<= 84

Salary> $50K

Salary<= $50K

Scoring instance:{“Salary”: 20000, “Age”: null}

Root

Age> 84

Page 51: Adam Ashenfelter - Finding the Oddballs

Isolation Forest - Missings

Age<= 84

Salary> $50K

Salary<= $50K

Scoring instance:{“Salary”: 20000, “Age”: null}

Tree depth output: 1.5

Root

Age> 84

(1)

(2)

Page 52: Adam Ashenfelter - Finding the Oddballs

Isolation Forest - MissingsWe can draw partial dependency plots by marginalizing away other dimensions as ‘missing’.

A Isolation Forest built on a 3d cylinder dataset and then reduced to a 2d partial dependency plot:

Page 53: Adam Ashenfelter - Finding the Oddballs

Isolation ForestsThe Good

● Very easy to construct (no distance function needed)● Gracefully handles missing values at scoring time● Can provide anomaly explanations ● Parameter free*

The Bad

● *Not really parameter free● Wide datasets will need more trees (how many?)● Needs extensions to handle discrete values and categorical data

Page 54: Adam Ashenfelter - Finding the Oddballs

Isolation Forests - ExplanationsWe can explain which fields contributed most to an anomaly using a technique similar to Gini variable importance from random forests.

Record the original node populations when growing the trees. When scoring, sum the isolated points after each split according to field. Normalize the field sums.

Relative field importance for an anomaly detected on the UCI wine quality dataset:

Chlorides Sulphates WineType

CitricAcid

VolatileAcidity

Page 55: Adam Ashenfelter - Finding the Oddballs

Isolation ForestsThe Good

● Very easy to construct (no distance function needed)● Gracefully handles missing values at scoring time● Can provide anomaly explanations ● Parameter free*

The Bad

● *Not really parameter free● Wide datasets will need more trees (how many?)● Needs extensions to handle discrete values and categorical data

Page 56: Adam Ashenfelter - Finding the Oddballs

Isolation ForestsThe Good

● Very easy to construct (no distance function needed)● Gracefully handles missing values at scoring time● Can provide anomaly explanations ● Parameter free*

The Bad

● *Not really parameter free● Wide datasets will need more trees (how many?)● Needs extensions to handle discrete values and categorical data

○ 0.966 AUC on the UCI mushrooms dataset