Top Banner
(Rare) Category Detection Using Hierarchical Mean Shift Pavan Vatturi ([email protected]) Weng-Keen Wong ([email protected])
23

(Rare) Category Detection Using Hierarchical Mean Shift

Mar 22, 2016

Download

Documents

Serena Poletto

(Rare) Category Detection Using Hierarchical Mean Shift. Pavan Vatturi ([email protected]) Weng-Keen Wong ([email protected]). 1. Introduction. Applications for surveillance, scientific discovery and data cleaning require anomaly detection - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: (Rare) Category Detection Using Hierarchical Mean Shift

(Rare) Category Detection Using Hierarchical Mean Shift

Pavan Vatturi ([email protected])

Weng-Keen Wong ([email protected])

Page 2: (Rare) Category Detection Using Hierarchical Mean Shift

1. Introduction

• Applications for surveillance, scientific discovery and data cleaning require anomaly detection

• Anomalies often identified as statistically unusual data points

• Many detected anomalies are simply uninteresting or correspond to known sources of noise

Page 3: (Rare) Category Detection Using Hierarchical Mean Shift

1. IntroductionKnown objects (99.9% of the data)

Anomalies (0.1% of the data)

Uninteresting (99% of anomalies)

Interesting (1% of anomalies)

Pictures from: Sloan Digital Sky Survey (http://www.sdss.org/iotw/archive.html)

Pelleg, D. (2004). Scalable and Practical Probability Density Estimators for Scientific Anomaly Detection. PhD Thesis, Carnegie Mellon University.

Page 4: (Rare) Category Detection Using Hierarchical Mean Shift

1. Introduction

Category Detection [Pelleg and Moore 2004]: human-in-the-loop exploratory data analysis

Data Set

Build Model

Spot Interesting Data Points

Ask User to Label Categories of Interesting

Data Points

Update Model with Labels

Page 5: (Rare) Category Detection Using Hierarchical Mean Shift

1. Introduction

Data Set

Build Model

Spot Interesting Data Points

Ask User to Label Categories of Interesting

Data Points

Update Model with Labels

User can:• Label a query data point under an existing category• Or declare data point to belong to a previous undeclared category

Page 6: (Rare) Category Detection Using Hierarchical Mean Shift

1. Introduction

• Goal: present to user a single instance from each category in as few queries as possible

• Difficult to detect rare categories if class imbalance is severe

• Interested in rare categories for anomaly detection

Page 7: (Rare) Category Detection Using Hierarchical Mean Shift

Outline

1. Introduction2. Related Work3. Background4. Methodology5. Results6. Conclusion / Future Work

Page 8: (Rare) Category Detection Using Hierarchical Mean Shift

2. Related Work

• Interleave [Pelleg and Moore 2004]• Nearest-Neighbor-based active learning for

rare-category detection for multiple classes [He and Carbonell 2008]

• Multiple output identification [Fine and Mansour 2006]

Page 9: (Rare) Category Detection Using Hierarchical Mean Shift

3. Background: Mean Shift [Fukunaga and Hostetler 1975]

Reference data set

Query point

Center of Mass

Mean shift vector (follows density

gradient)

x

hxxk

hxxkx

xmn

i

i

in

ii

1

2

2

1

'

'

)(

Mean shift vector with kernel k

Page 10: (Rare) Category Detection Using Hierarchical Mean Shift

3. Background: Mean Shift [Fukunaga and Hostetler 1975]

Reference data set

Query point

Center of Mass

Convergence to cluster center

Page 11: (Rare) Category Detection Using Hierarchical Mean Shift

3. Background: Mean Shift Blurring

Reference data set

Query point

Center of Mass

Blurring: •When query points are the same as the reference data set •Progressively blurs the original data set

Page 12: (Rare) Category Detection Using Hierarchical Mean Shift

3. Background: Mean Shift

End result of applying mean shift to a synthetic data set

Page 13: (Rare) Category Detection Using Hierarchical Mean Shift

4. Methodology: Overview

1. Sphere the data2. Hierarchical Mean Shift3. Query user

Page 14: (Rare) Category Detection Using Hierarchical Mean Shift

4. Methodology: Hierarchical Mean Shift

Repeatedly blur data using Mean Shift with increasing bandwidth:hnew = k * hold

Page 15: (Rare) Category Detection Using Hierarchical Mean Shift

4. Methodology: Querying the User

The data point closest to the cluster center is the representative data point. Rank representative data points for querying to user according to:1.Outlierness [Leung et al. 2000] for Cluster Ci:

i

i

CC

in points data ofNumber of Lifetime sOutliernes

Lifetime of Ci = Log (bandwidth when cluster Ci is merged with other clusters – bandwidth when cluster Ci is formed)

Page 16: (Rare) Category Detection Using Hierarchical Mean Shift

4. Methodology: Querying the User

Rank representative data points for querying to user according to:

2. Compactness + Isolation [Leung et al. 2000] for Cluster Ci:

i

j

i

i

Cx j

h

px

Cx

hpx

e

esCompactnes

2

2

2

2

2

||||

2||||

2

2

2

2

2||||

2||||

hpx

x

Cx

hpx

i

i

i

e

eIsolation

Page 17: (Rare) Category Detection Using Hierarchical Mean Shift

4. Methodology: Tiebreaker

• Ties may occur in Outlierness or Compactness/Isolation values.

• Highest Average Distance heuristic: choose representative data point with highest average distance from user-labeled points.

Page 18: (Rare) Category Detection Using Hierarchical Mean Shift

5. Results

Name Dims Records Classes Smallest Class

Largest Class

Abalone 7 4177 20 0.34% 16%

Shuttle 8 4000 7 0.02% 64.2%

OptDigits 64 1040 10 0.77% 50%

OptLetters 16 2128 26 0.37% 24%

Statlog 19 512 7 1.5% 50%

Yeast 8 1484 10 0.33% 31.68%

Shuttle, OptDigits, OptLetters, and Statlog were subsampled to simulate class imbalance.

Data sets used in experiments

Page 19: (Rare) Category Detection Using Hierarchical Mean Shift

5. Results (Yeast)

Category detection metric: # queries before user presented with at least one example from all categories

Page 20: (Rare) Category Detection Using Hierarchical Mean Shift

5. Results

Dataset HMS-CI HMS-CI+HAD

HMS-Out HMS-Out+HAD

NNDM Interleave

Abalone 1195 93 603 385 124 193

Shuttle 44 32 36 28 162 35

OptDigits 100 100 160 118 576 117

OptLetters 133 133 161 182 420 489

Statlog 18 20 34 124 228 54

Yeast 73 91 103 77 88 111

Number of hints to discover all classes

Page 21: (Rare) Category Detection Using Hierarchical Mean Shift

5. Results

Dataset HMS-CI HMS-CI+HAD

HMS-Out NNDM Interleave

Abalone 0.835 0.873 0.837 0.846 0.840

Shuttle 0.925 0.929 0.917 0.480 0.905

OptDigits 0.855 0.855 0.840 0.199 0.808

OptLetters 0.936 0.936 0.917 0.573 0.765

Statlog 0.956 0.958 0.944 0.472 0.934

Yeast 0.821 0.805 0.793 0.838 0.778

Area under the category detection curve

Page 22: (Rare) Category Detection Using Hierarchical Mean Shift

6. Conclusion / Future Work

Conclusions– HMS-based methods consistently discover

more categories in fewer queries than existing methods

– Do not need apriori knowledge of dataset properties

Page 23: (Rare) Category Detection Using Hierarchical Mean Shift

6. Conclusion / Future Work

Future Work• Better use of user feedback• Presentation of an entire cluster to the user

instead of a representative data point• Improved computational efficiency• Theoretical analysis