This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Slide 1
Stream Clustering CSE 902
Slide 2
Big Data
Slide 3
Stream analysis Stream: Continuous flow of data Challenges
Volume: Not possible to store all the data One-time access: Not
possible to process the data using multiple passes Real-time
analysis: Certain applications need real-time analysis of the data
Temporal Locality: Data evolves over time, so model should be
adaptive.
Slide 4
Stream Clustering Topic cluster Article Listings
Slide 5
Stream Clustering Online Phase Summarize the data into
memory-efficient data structures Offline Phase Use a clustering
algorithm to find the data partition
CF-Trees Summarize the data in each CF-vector Linear sum of
data points Squared sum of data points Number of points Scalable
k-means, Single pass k-means
Slide 9
Microclusters CF-Trees with time element CluStream Linear sum
and square sum of timestamps Delete old microclusters/merging
microclusters if their timestamps are close to each other Sliding
Window Clustering Timestamp of the most recent data point added to
the vector Maintain only the most recent T microclusters DenStream
Microclusters are associated with weights based on recency Outliers
detected by creating separate microcluster
Slide 10
Microclusters CF-Trees with time element DenStream
Microclusters are associated with weights based on recency Outliers
detected by creating separate microcluster ClusTree Allows
real-time clustering
Slide 11
Grids D-Stream Assign the data to grids Grids weighted by
recency of points added to it Each grid associated with a label
DGClust Distributed clustering of sensor data Sensors maintain
local copies of the grid and communicate updates to the grid to a
central site
Slide 12
StreamKM++ (Coresets) StreamKM++: A Clustering Algorithm for
Data Streams, Ackermann, Journal of Experimental Algorithmics
2012
Slide 13
Kernel-based Clustering
Slide 14
Kernel-based Stream Clustering Use non-linear distance measures
to define similarity between data points in the stream Challenges
Quadratic running time complexity Computationally expensive to
compute centers using linear sums and squared sums (CF-vector
approach will not work)
Slide 15
Stream Kernel k-means (sKKM) Kernel k-means Weighted Kernel
k-means History from only the preceding data chunk retained
Approximation of Kernel k-Means for Streaming Data, Havens, ICPR
2012
Slide 16
Statistical Leverage Scores Measures the influence of a point
in the low-rank approximation
Slide 17
Statistical Leverage Scores
Slide 18
Slide 19
Approximate Stream kernel k-means o Uses statistical leverage
score to determine which data points in the stream are potentially
important o Retain the important points and discard the rest o Use
an approximate version of kernel k-means to obtain the clusters
Linear time complexity o Bounded amount of memory
Updating eigenvectors Only eigenvectors and eigenvalues of
kernel matrix are required for both sampling and clustering Update
the eigenvectors and eigenvalues incrementally
Slide 25
Approximate Stream Kernel k-means
Slide 26
Network Traffic Monitoring Clustering used to detect intrusions
in the network Network Intrusion Data set TCP dump data from seven
weeks of LAN traffic 10 classes: 9 types of intrusions, 1 class of
legitimate traffic. Running Time in milliseconds (per data point)
Cluster Accuracy (NMI) Approximate stream kernel k-means6.614.2
StreamKM++0.87.0 sKKM42.113.3 Around 200 points clustered per
second
Slide 27
Summary Efficient kernel-based stream clustering algorithm -
linear running time complexity Memory required is bounded Real-time
clustering is possible Limitation: does not account for data
evolution