Top Banner
Birch: An efficient data clustering method for very large databases By Tian Zhang, Raghu Ramakrishnan Presented by Hung Lai
26

Birch: An efficient data clustering method for very large databases By Tian Zhang, Raghu Ramakrishnan Presented by Hung Lai.

Dec 22, 2015

Download

Documents

Helena McCoy
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Birch: An efficient data clustering method for very large databases By Tian Zhang, Raghu Ramakrishnan Presented by Hung Lai.

Birch: An efficient data clustering method for very large databases

By Tian Zhang, Raghu Ramakrishnan

Presented by Hung Lai

Page 2: Birch: An efficient data clustering method for very large databases By Tian Zhang, Raghu Ramakrishnan Presented by Hung Lai.

Outline

What is data clustering Data clustering applications Previous Approaches and problems Birch’s Goal Clustering Feature Birch clustering algorithm Experiment results and conclusion

Page 3: Birch: An efficient data clustering method for very large databases By Tian Zhang, Raghu Ramakrishnan Presented by Hung Lai.

What is Data Clustering?

A cluster is a closely-packed group. A collection of data objects that are similar to

one another and treated collectively as a group.

Data Clustering is the partitioning of a dataset into clusters

Page 4: Birch: An efficient data clustering method for very large databases By Tian Zhang, Raghu Ramakrishnan Presented by Hung Lai.

Data Clustering

Helps understand the natural grouping or structure in a dataset

Provided a large set of multidimensional data– Data space is usually not uniformly occupied– Identify the sparse and crowded places– Helps visualization

Page 5: Birch: An efficient data clustering method for very large databases By Tian Zhang, Raghu Ramakrishnan Presented by Hung Lai.

Some Clustering Applications

Biology – building groups of genes with related patterns

Marketing – partition the population of consumers to market segments

Division of WWW pages into genres. Image segmentations – for object recognition Land use – Identification of areas of similar

land use from satellite images Insurance – Identify groups of policy holders

with high average claim cost

Page 6: Birch: An efficient data clustering method for very large databases By Tian Zhang, Raghu Ramakrishnan Presented by Hung Lai.

Data Clustering – previousapproaches

Probability based (Machine learning): make wrong assumption that distributions on attributes are independent on each other

Probability representations of clusters are expensive

Page 7: Birch: An efficient data clustering method for very large databases By Tian Zhang, Raghu Ramakrishnan Presented by Hung Lai.

Approaches

Distance Based (statistics) Must be a distance metric between two items Assumes that all data points are in memory and can

be scanned frequently Ignores the fact that not all data points are equally

important Close data points are not gathered together Inspects all data points on multiple iterations

These approaches do not deal with dataset and memory size issues!

Page 8: Birch: An efficient data clustering method for very large databases By Tian Zhang, Raghu Ramakrishnan Presented by Hung Lai.

Clustering parameters

Centroid – Euclidian center Radius – average distance to center Diameter – average pair wise difference

within a cluster

Radius and diameter are measures of the tightness of a cluster around its center. We wish to keep these low.

Page 9: Birch: An efficient data clustering method for very large databases By Tian Zhang, Raghu Ramakrishnan Presented by Hung Lai.

Clustering parameters

Other measurements (like the Euclidean distance of the centroids of two clusters) will measure how far away two clusters are.

A good quality clustering will produce high intra-clustering and low interclustering

A good quality clustering can help find hidden patterns

Page 10: Birch: An efficient data clustering method for very large databases By Tian Zhang, Raghu Ramakrishnan Presented by Hung Lai.

Birch’s goals:

Minimize running time and data scans, thus formulating the problem for large databases

Clustering decisions made without scanning the whole data

Exploit the non uniformity of data – treat dense areas as one, and remove outliers (noise)

Page 11: Birch: An efficient data clustering method for very large databases By Tian Zhang, Raghu Ramakrishnan Presented by Hung Lai.

Clustering Features (CF)

CF is a compact storage for data on points in a cluster

Has enough information to calculate the intra-cluster distances

Additivity theorem allows us to merge sub-clusters

Page 12: Birch: An efficient data clustering method for very large databases By Tian Zhang, Raghu Ramakrishnan Presented by Hung Lai.

Clustering Feature (CF)

Given N d-dimensional data points in a cluster: {Xi} where i = 1, 2, …, N,

CF = (N, LS, SS) N is the number of data points in the cluster, LS is the linear sum of the N data points, SS is the square sum of the N data points.

Page 13: Birch: An efficient data clustering method for very large databases By Tian Zhang, Raghu Ramakrishnan Presented by Hung Lai.

CF Additivity Theorem

If CF1 = (N1, LS1, SS1), and

CF2 = (N2 ,LS2, SS2) are the CF entries of two disjoint sub-clusters.

The CF entry of the sub-cluster formed by merging the two disjoin sub-clusters is:

CF1 + CF2 = (N1 + N2 , LS1 + LS2, SS1 + SS2)

Page 14: Birch: An efficient data clustering method for very large databases By Tian Zhang, Raghu Ramakrishnan Presented by Hung Lai.

Properties of CF-Tree

Each non-leaf node has at most B entriesEach leaf node has at most L CF entries which each satisfy threshold TNode size is determined by dimensionality of data space and input parameter P (page size)

Page 15: Birch: An efficient data clustering method for very large databases By Tian Zhang, Raghu Ramakrishnan Presented by Hung Lai.

CF Tree Insertion

Identifying the appropriate leaf: recursively descending the CF tree and choosing the closest child node according to a chosen distance metric

Modifying the leaf: test whether the leaf can absorb the node without violating the threshold. If there is no room, split the node

Modifying the path: update CF information up the path.

Page 16: Birch: An efficient data clustering method for very large databases By Tian Zhang, Raghu Ramakrishnan Presented by Hung Lai.

Birch Clustering Algorithm

Phase 1: Scan all data and build an initial in-memory CF tree.

Phase 2: condense into desirable length by building a smaller CF tree.

Phase 3: Global clustering Phase 4: Cluster refining – this is optional,

and requires more passes over the data to refine the results

Page 17: Birch: An efficient data clustering method for very large databases By Tian Zhang, Raghu Ramakrishnan Presented by Hung Lai.

Birch – Phase 1

Start with initial threshold and insert points into the tree

If run out of memory, increase thresholdvalue, and rebuild a smaller tree by reinserting values from older tree and then other values

Good initial threshold is important but hard to figure out

Outlier removal – when rebuilding tree remove outliers

Page 18: Birch: An efficient data clustering method for very large databases By Tian Zhang, Raghu Ramakrishnan Presented by Hung Lai.

Birch - Phase 2

Optional Phase 3 sometime have minimum size which

performs well, so phase 2 prepares the tree for phase 3.

Removes outliers, and grouping clusters.

Page 19: Birch: An efficient data clustering method for very large databases By Tian Zhang, Raghu Ramakrishnan Presented by Hung Lai.

Birch – Phase 3

Problems after phase 1:– Input order affects results– Splitting triggered by node size

Phase 3:– cluster all leaf nodes on the CF values according

to an existing algorithm– Algorithm used here: agglomerative hierarchical

clustering

Page 20: Birch: An efficient data clustering method for very large databases By Tian Zhang, Raghu Ramakrishnan Presented by Hung Lai.

Birch – Phase 4

Optional Do additional passes over the dataset &

reassign data points to the closest centroid from phase 3

Recalculating the centroids and redistributing the items.

Always converges (no matter how many time phase 4 is repeated)

Page 21: Birch: An efficient data clustering method for very large databases By Tian Zhang, Raghu Ramakrishnan Presented by Hung Lai.

Experimental Results

Create 3 synthetic data sets for testing– Also create an ordered copy for testing input

order

KMEANS and CLARANS require entire data set to be in memory– Initial scan is from disk, subsequent scans are in

memory

Page 22: Birch: An efficient data clustering method for very large databases By Tian Zhang, Raghu Ramakrishnan Presented by Hung Lai.

Experimental Results

Intended clustering

Page 23: Birch: An efficient data clustering method for very large databases By Tian Zhang, Raghu Ramakrishnan Presented by Hung Lai.

Experimental Results

KMEANS clustering

DS Time D # Scan DS Time D # Scan

1 43.9 2.09 289 1o 33.8 1.97 197

2 13.2 4.43 51 2o 12.7 4.20 29

3 32.9 3.66 187 3o 36.0 4.35 241

Page 24: Birch: An efficient data clustering method for very large databases By Tian Zhang, Raghu Ramakrishnan Presented by Hung Lai.

Experimental Results

CLARANS clustering

DS Time D # Scan DS Time D # Scan

1 932 2.10 3307 1o 794 2.11 2854

2 758 2.63 2661 2o 816 2.31 2933

3 835 3.39 2959 3o 924 3.28 3369

Page 25: Birch: An efficient data clustering method for very large databases By Tian Zhang, Raghu Ramakrishnan Presented by Hung Lai.

Experimental Results

BIRCH clustering

DS Time D # Scan DS Time D # Scan

1 11.5 1.87 2 1o 13.6 1.87 2

2 10.7 1.99 2 2o 12.1 1.99 2

3 11.4 3.95 2 3o 12.2 3.99 2

Page 26: Birch: An efficient data clustering method for very large databases By Tian Zhang, Raghu Ramakrishnan Presented by Hung Lai.

Conclusion

Birch performs faster than existing algorithms (CLARANS and KMEANS) on large datasets in Quality, speed, stability and scalability

Scans whole data only once Handles outliers better