Classification and Novel Class Detection in Data Streams Mehedy Masud 1 , Latifur Khan 1 , Jing Gao 2 , Jiawei Han 2 , and Bhavani Thuraisingham 1 1 Department of Computer Science, University of Texas at Dallas 2 Department of Computer Science, University of Illinois at Urbana Champaign This work was funded in part by
30
Embed
Classification and Novel Class Detection in Data Streams Classification and Novel Class Detection in Data Streams Mehedy Masud 1, Latifur Khan 1, Jing.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Classification and Novel Class Detection in Data Streams
Mehedy Masud1, Latifur Khan1, Jing Gao2,
Jiawei Han2, and Bhavani Thuraisingham1
1Department of Computer Science, University of Texas at Dallas
2Department of Computer Science, University of Illinois at Urbana Champaign
This work was funded in part by
Presentation Overview
Stream Mining Background
Novel Class Detection– Concept Evolution
Data StreamsData streams are:
◦ Continuous flows of
data
Network traffic
Sensor data Call center
records
◦ Examples:
Uses past labeled data to build classification model
Predicts the labels of future instances using the model
Helps decision making
Data Stream Classification
Network traffic
Classification model
Attack traffic
Firewall
Block and quarantine
Benign traffic
Server
Model update
Expert analysis and labeling
Infinite length
Concept-drift
Concept-evolution (emergence of
novel class)
Recurrence (seasonal) class
ChallengesIntroduction
5ICDM 2012, Brussels, Belgium 12/11/2012
Impractical to store and use all historical data
◦ Requires infinite storage
◦ And running time
Infinite Length
0 11
0
11
11
0
0 0
0
Concept-Drift
Negative instancePositive instance
A data chunk
Current hyperplane
Previous hyperplane
Instances victim of concept-drift
Concept-Evolution
X X X X X X X X X X XX X X X X X XX X X X X X X X X X X X X X X X XX X X X X X
StreamsDivide the data stream into equal sized chunks◦ Train a classifier from each data chunk◦ Keep the best L such classifier-ensemble◦ Example: for L= 3
Data chunks
Classifiers
D1
C1
D2
C2
D3
C3
Ensemble
C1 C2 C3
D4
Prediction
D4
C4C4
C4
D5D5
C5C5
C5
D6
Labeled chunkUnlabeled chunk
Addresses infinite lengthand concept-drift
Note: Di may contain data points from different classes
Examples of Recurrence and Novel Classes
Twitter Stream – a stream of messagesEach message may be given a category or
“class” ◦ based on the topic
Examples ◦ “Election 2012”, “London Olympic”,
“Halloween”, “Christmas”, “Hurricane Sandy”, etc.
Among these ◦ “Election 2012” or “Hurricane Sandy” are
novel classes because they are new events.Also
◦ “Halloween” is recurrence class because it “recurs” every year.
11ICDM 2012, Brussels, Belgium 12/11/2012
Introduction
Concept-Evolution and Feature Space
Introduction
X X X X X X X X X X XX X X X X X XX X X X X X X X X X X X X X X X XX X X X X X
If q-NSC(x) is positive, it means x is closer to Foutliers than any other class.
x
o,5(x)
+,5(x)
- - - -
+ + + +
- - -
- -
+ + + + +
-,5(x)
a(x)
b+
(x)b-(x)
16ICDM 2012, Brussels, Belgium 12/11/2012
Prior work
Limitation: Recurrence Class
chunk0 chunk1 chunk49 chunk50
Stream
chunk51 chunk52 chunk99 chunk100
Novel
chunk101 chunk102 chunk149 chunk150
Recurrence
17ICDM 2012, Brussels, Belgium 12/11/2012
Prior work
Why Recurrence Classes are Forgotten?
Divide the data stream into equal sized chunks◦ Train a classifier from whole data chunk◦ Keep the best L such classifier-ensemble◦ Example: for L= 3◦ Therefore, old models are discarded◦ Old classes are “forgotten” after a while
Data chunks
Classifiers
D1
C1
D2
C2
D3
C3
Ensemble
C1 C2 C3
D4
Prediction
D4
C4C4
C4
D5D5
C5C5
C5
D6
Labeled chunkUnlabeled chunk
Addresses infinite length and concept-drift
18ICDM 2012, Brussels, Belgium 12/11/2012
Prior work
CLAM: The Proposed Approach
19ICDM 2012, Brussels, Belgium 12/11/2012
LatestLabeled chunk
Stream
New model
Training
Ensemble (M)(keeps all classes)
Upd
ate
Latest unlabeled instance Outlier
detection
Not outlierClassify using M
(Existing class)Outlier
Buffering and novel class detection
Proposed method
CLAss Based Micro-Classifier Ensemble
Training and Updating
20ICDM 2012, Brussels, Belgium 12/11/2012
Proposed method
• Each chunk is first separated into different classes• A micro-classifier is trained from each class’s data• Each micro-classifier replaces one existing micro-
classifier• A total of L micro-classifiers make a Micro-Classifier
Ensemble (MCE)• C such MCE’s constitute the whole ensemble, E
CLAM: The Proposed Approach
21ICDM 2012, Brussels, Belgium 12/11/2012
LatestLabeled chunk
Stream
New model
Training
Ensemble (M)(keeps all classes)
Upd
ate
Latest unlabeled instance Outlier
detection
Not outlierClassify using M
(Existing class)Outlier
Buffering and novel class detection
Proposed method
CLAss Based Micro-Classifier Ensemble
Outlier Detection and Classification
22ICDM 2012, Brussels, Belgium 12/11/2012
Proposed method
• A test instance x is first classified with each micro-classifier ensemble
• Each micro-classifier ensemble gives a partial output (Yr) and a outlier flag (boolean)
• If all ensembles flags x as outlier, then it is buffered and sent to novel class detector
• Otherwise, the partial outputs are combined and a class label is predicted
Datasets: Synthetic, KDD Cup 1999 & Forest covertype
1. M. M. Masud, T. M. Al-Khateeb, L. Khan, C. C. Aggarwal, J. Gao, J. Han, and B. M. Thuraisingham, Detecting recurring and novel classes in concept-drifting data streams,” in Proc. ICDM ’11, Dec. 2011, pp. 1176–181.
2. Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani M. Thuraisingham.Classification and novel class detection in concept-drifting data streams under time constraints. In Preprints, IEEE Transactions on Knowledge and Data Engineering (TKDE), 23(6): 859-874 (2011).
3. E. J. Spinosa, A. P. de Leon F. de Carvalho, and J. Gama. Cluster-based novel concept detection in data streams applied to intrusion detection in computer networks. In Proc. 2008 ACM symposium on Applied computing, pages 976–980, 2008.
4. H. Wang, W. Fan, P. S. Yu, and J. Han. Mining concept-drifting data streams using ensemble classifiers. In Proc. ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 226–235, Washington, DC, USA, Aug, 2003. ACM.