Top Banner
Scalable Anomaly Detection with Spark and SOS Strata NYC September 26, 2019
66

Scalable Anomaly Detection with Spark and SOS

May 17, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Scalable Anomaly Detection with Spark and SOS

ScalableAnomaly Detectionwith Spark and SOS

Strata NYCSeptember 26, 2019

Page 2: Scalable Anomaly Detection with Spark and SOS

Hi there, my name is Jeroen Janssens

Page 3: Scalable Anomaly Detection with Spark and SOS

Today

● SOS, World!● Anomalies and outliers● Evaluating outlier-selection algorithms● Various approaches to outlier selection● Stochastic Outlier Selection● Conclusion

Page 4: Scalable Anomaly Detection with Spark and SOS

SOS, World!

01-sos-world.ipynb

Page 5: Scalable Anomaly Detection with Spark and SOS

Implementations of SOS

● Python: http://bit.ly/sos-python● Spark: http://bit.ly/sos-spark● R: http://bit.ly/sos-r● Flink: http://bit.ly/sos-flink

Page 6: Scalable Anomaly Detection with Spark and SOS

Anomalies and outliers

Page 7: Scalable Anomaly Detection with Spark and SOS

An anomaly is an observation or event that deviates qualitatively from what is considered to be normal, according to a

domain expert.

Page 8: Scalable Anomaly Detection with Spark and SOS

Detecting anomalies is important

● Expensive● Dangerous● Mess up your model

Page 9: Scalable Anomaly Detection with Spark and SOS

Human anomaly detection may suffer from

● Fatigue● Information overload● Emotional bias

Page 10: Scalable Anomaly Detection with Spark and SOS

Feature-vector representation

Page 11: Scalable Anomaly Detection with Spark and SOS

Dissimilarity-matrix representation

Page 12: Scalable Anomaly Detection with Spark and SOS

From anomaly to outlier

Page 13: Scalable Anomaly Detection with Spark and SOS

An outlier is a data point that deviates quantitatively from the majority of the

data points, according to an outlier-selection algorithm.

Page 14: Scalable Anomaly Detection with Spark and SOS

The symbiotic relationship between the domain expert and the algorithm

Page 15: Scalable Anomaly Detection with Spark and SOS

Data flow diagram

Data flow diagram illustrating the relationship between the domain expert (square) and the outlier-selection algorithm (top circle).

Page 16: Scalable Anomaly Detection with Spark and SOS

Six Euler diagrams (1/2)

Page 17: Scalable Anomaly Detection with Spark and SOS

Six Euler diagrams (2/2)

Page 18: Scalable Anomaly Detection with Spark and SOS

Evaluating outlier-selection algorithms

Page 19: Scalable Anomaly Detection with Spark and SOS

Confusion matrix

Computer says no.

Page 20: Scalable Anomaly Detection with Spark and SOS

Four possible outcomes

Page 21: Scalable Anomaly Detection with Spark and SOS

Evaluation

Illustration of relabelling a multi-class data set into multiple one-class data sets.

Page 22: Scalable Anomaly Detection with Spark and SOS

Anomalies are rare

In order to evaluate the algorithm we simulate anomalies to be rare. Banana for scale.

Page 23: Scalable Anomaly Detection with Spark and SOS

Outlier scores

The dashed line indicates the threshold chosen by the domain expert.

Page 24: Scalable Anomaly Detection with Spark and SOS

ROC curve

An ROC curve plots the false alarm rate against the hit rate for all possible thresholds.

Page 25: Scalable Anomaly Detection with Spark and SOS

Various approachesto outlier selection

Page 26: Scalable Anomaly Detection with Spark and SOS

Distribution-based outlier selection

Page 27: Scalable Anomaly Detection with Spark and SOS

Distance-based outlier selection

Size does matter

Page 28: Scalable Anomaly Detection with Spark and SOS

Density-based outlier-selection

Page 29: Scalable Anomaly Detection with Spark and SOS

Stochastic Outlier Selection

Page 30: Scalable Anomaly Detection with Spark and SOS

Stochastic Outlier Selection

● Unsupervised outlier selection algorithm● Employs concept of affinity (inspired by t-SNE)● One parameter: perplexity● Computes outlier probabilities

Page 31: Scalable Anomaly Detection with Spark and SOS

t-Distributed Neighbor Embedding (t-SNE; Van der Maaten, Hinton) employs affinity to perform dimensionality reduction

Page 32: Scalable Anomaly Detection with Spark and SOS

A data point is selected as an outlier when all the other data points have

insufficient affinity with it.

Page 33: Scalable Anomaly Detection with Spark and SOS

From input to output

Page 34: Scalable Anomaly Detection with Spark and SOS

From feature matrix to dissimilarity matrix

Page 35: Scalable Anomaly Detection with Spark and SOS

From input to output

Page 36: Scalable Anomaly Detection with Spark and SOS

Smooth neighborhoods

Page 37: Scalable Anomaly Detection with Spark and SOS

Affinity between data points

Page 38: Scalable Anomaly Detection with Spark and SOS

From input to output

Page 39: Scalable Anomaly Detection with Spark and SOS

From affinity to binding probability

The binding matrix B is obtained by normalising each row in the affinity matrix A.

Page 40: Scalable Anomaly Detection with Spark and SOS

Binding probabilities form a graph

Page 41: Scalable Anomaly Detection with Spark and SOS

Binding probabilities form a graph

Page 42: Scalable Anomaly Detection with Spark and SOS

Stochastic Neighbor Graph

A data point belongs to the outlier class when no it is not selected by any other data points.

Page 43: Scalable Anomaly Detection with Spark and SOS

Three SNGs

The three SNGs Ga, Gb, and Gc are sampled from the discrete probability distribution P(G).

Page 44: Scalable Anomaly Detection with Spark and SOS

Set of all SNGs

Page 45: Scalable Anomaly Detection with Spark and SOS

Approximating outlier probabilities by sampling SNGs

Page 46: Scalable Anomaly Detection with Spark and SOS

Demo: Sampling SNGs inCoffeeScript and D3

http://bit.ly/sos-d3

Page 47: Scalable Anomaly Detection with Spark and SOS

Computing outlier probabilities through marginalisation

Page 48: Scalable Anomaly Detection with Spark and SOS

Computing outlier probabilities in closed form

Page 49: Scalable Anomaly Detection with Spark and SOS

Proof!

Page 50: Scalable Anomaly Detection with Spark and SOS

Selecting outliers

Page 51: Scalable Anomaly Detection with Spark and SOS

Adaptive variances via the perplexity parameter

Page 52: Scalable Anomaly Detection with Spark and SOS

Continuous binary search

Page 53: Scalable Anomaly Detection with Spark and SOS

Perplexity influences outlier probabilities

Page 54: Scalable Anomaly Detection with Spark and SOS

Evaluation and comparison

Page 55: Scalable Anomaly Detection with Spark and SOS

Putlier-score plots

Page 56: Scalable Anomaly Detection with Spark and SOS

Real-world datasets

Page 57: Scalable Anomaly Detection with Spark and SOS

Synthetic datasets

Page 58: Scalable Anomaly Detection with Spark and SOS

Synthetic datasets

Page 59: Scalable Anomaly Detection with Spark and SOS

Synthetic datasets

Page 60: Scalable Anomaly Detection with Spark and SOS

Synthetic datasets

Page 61: Scalable Anomaly Detection with Spark and SOS

SOS performs significantly better

Page 62: Scalable Anomaly Detection with Spark and SOS

Spark implementation of SOS

Page 63: Scalable Anomaly Detection with Spark and SOS

Spark implementation of SOS

● Developed by Fokko Driesprong● Works with DataFrame API● Available on GitHub● Plan is to make it part of MLLib

Page 64: Scalable Anomaly Detection with Spark and SOS

SOS on PySpark

92-pyspark-sos.ipynb

Page 65: Scalable Anomaly Detection with Spark and SOS

Summary

● Outlier selection can support the detection of anomalies● SOS is an intuitive and probabilistic algorithm to select outliers● SOS has a very good performance● No free lunch

Page 66: Scalable Anomaly Detection with Spark and SOS

Thank you! Here are some links

● Blog: http://bit.ly/sos-blog● D3 Demo: http://bit.ly/sos-d3● Python implementation: http://bit.ly/sos-python● Spark implementation: http://bit.ly/sos-spark● R implementation: http://bit.ly/sos-r● Flink implementation: http://bit.ly/sos-flink