Top Banner
Clustering on Database Systems Vahid Mirjalili Michigan State University
22
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Clustering on database systems rkm

Clustering on Database Systems

Vahid Mirjalili

Michigan State University

Page 2: Clustering on database systems rkm

Clustering• Partitioning data into groupsItems in the same group should have higher similarity to each

other than items from different groups

• A similarity/dissimilarity measure

• Examples: Clustering patients in a hospital

Genomic clustering

Hand-written character recognition

A. Jain, “Data Clustering: 50 years beyond K-means”

Page 3: Clustering on database systems rkm

Clustering vs. Classification

• Classification is supervised– class labels are provided; – learn a classifier to predict class labels of novel/unseen

data

• Clustering is unsupervised or semi-supervised;– No class label is give– Understand the structure underlying your data

Reinforcement learning

Unsupervised Learning

Supervised Learning

Predictive Modeling Tasks

Page 4: Clustering on database systems rkm

Clustering Approaches

Probability-based– Assuming statistical independence among features

– Inefficient updating and storing clusters

Distance-based– Assuming direct access to all data points

– Hierarchical clustering: O(N2), not giving the best clustering

Page 5: Clustering on database systems rkm

Distance-Based Clustering Algorithms

• kmeans and its variants (kmedoids, kernel kmeans, fuzzy c-means, …)

• Density based methods (DBSCAN)

• Hierarchical methods

Page 6: Clustering on database systems rkm

Challenges• Unknown number of clusters (from 1 to N)

Input data K=2 K=6

You always get some output as clusters

Are they really distinct clusters?

A. Jain, “Data Clustering: 50 years beyond K-means”

Page 7: Clustering on database systems rkm

Challenges

• Clusters with different shapes, sizes and densities

Shapes: globular shape, linear vs. non-linear shapes

A. Jain, “Data Clustering: 50 years beyond K-means”

Page 8: Clustering on database systems rkm

Standard K-Means Algorithm• Find initial Cluster centroids randomly

• An iterative algorithm1. Assignment step: assign each data point the

cluster whose mean is closest (smallest distance)

2. Update step: update the mean (centroid) of each cluster

Distance: squared Euclidean distance

Centroid: mean of feature vectors

Ci

i

C

XN

1

2

1

),(

d

j

jjxxdist

Page 9: Clustering on database systems rkm

Standard K-Means Algorithm

Page 10: Clustering on database systems rkm

Problem in Database-oriented Clustering

• Low memory available compared to size of dataset data doesn’t fit in main memory

• High I/O

• Necessary to avoid too many iterations

Page 11: Clustering on database systems rkm

RKM: An Efficient Disk-based KMeansMethod

• Find the initial centroids by

• Only 3 iterations:– Assign every L points to nearest centroids;

– Update the cluster centroids

• Minor efficiency tricks:– Keep track of LS, SS and Nc for each cluster during

assignment update step:

NL

drallc /

cc NLS /

Page 12: Clustering on database systems rkm

Implementation of RKM: storing data matrices

• D input dataset

• Pj cluster j (for j in [1..k])

• Mj, Qj, Nj Linear Sum, Squared Sum, cluster size

• Cj, Rj, Wj Centroids, Variances, Weights (accessed during update step)

kl

ljj

j

t

jjjjj

jjj

NNW

NMMNQR

NMC

..1

2

/

//

/

Page 13: Clustering on database systems rkm

RKM avoids local minima: split large clusters

• Only performed if size of a cluster is less than a user-defined threshold

1. Remove the centroid of the small cluster

2. Find the largest cluster (largest Wj)

3. Randomly choose two centroids for the largest cluster (using Cj, and Rj)

4. Reassign the items of small and large clusters

Page 14: Clustering on database systems rkm

RKM vs. Standard K-means: Random Dataset

Page 15: Clustering on database systems rkm

RKM vs. Standard K-means: Initial Cluster Centroids

K = 3

Page 16: Clustering on database systems rkm

Cluster assignment:Results after one pass over all the data

2 more iterationsMany iterations needed

Page 17: Clustering on database systems rkm

RKM: Database design

• Relational schema for sparse data representation: D(pid, inx, value)

• For other matrices: doing 1 I/O per matrix row to minimize I/O

Matrix accessE step (assignment step)M step (update step)

Ordonez and Omiecinski, “Efficient Disk-Based K-Means Clustering for Relational Databases”

Page 18: Clustering on database systems rkm

Performance Comparison• RKM (disk-based)

• Memory based:

– Standard K-means

– Scalable K-means

kj Pi

ji

j

CxdistC..1

),()(errorQuan.

Page 19: Clustering on database systems rkm

Time Complexity of RKM

Ordonez and Omiecinski, “Efficient Disk-Based K-Means Clustering for Relational Databases”

Page 20: Clustering on database systems rkm

Time Complexity

Ordonez and Omiecinski, “Efficient Disk-Based K-Means Clustering for Relational Databases”

Page 21: Clustering on database systems rkm

Conclusion

• RKM resolve some of the limitations of K-means

• RKM limits disk access (I/O)

• Final clustering is achieved with 3 iterations

• On large datasets RKM outperforms standard K-means

• Other limitations of K-means clustering still remain

Page 22: Clustering on database systems rkm

Read more …

General implementation in IPython notebook: http://goo.gl/YZScH9

http://www.vahidmirjalili.com