Midterm Review: CIS 563 – Intro to data science Sagnik Basumallik Some Important Points from the slides: Types of Clusterings: hierarchical and partitional Clustering Algorithms: K‐means and its variants, Hierarchical clustering, Density‐based clustering k‐means complexity Problems with Selecting Initial Points Solutions to Initial Centroids Problem: Multiple runs Helps, but probability is not on your side Sample and use hierarchical clustering to determine initial centroids Select more than k initial centroids and then select among these initial centroids Select most widely separated Postprocessing Bisecting K‐ means Not as susceptible to initialization issues What is postprocessing? Evaluating K‐means Clusters K‐means has problems when clusters are of differing Sizes Densities Non‐globular shapes K‐means has problems when the data contains outliers. Overcome? One solution is to use many clusters. Find parts of clusters, but need to put together. Types of Hierarchical Clustering: Agglomerative and Agglomerative Algorithms for each clustering process How to Define Inter‐Cluster Similarity: pros and cons Type Pros Cons MIN Can handle non‐elliptical shapes • Sensitive to noise and outliers MAX • Less susceptible to noise and outliers • Tends to break large clusters • Biased towards globular clusters Group Average • Less susceptible to noise and outliers Biased towards globular clusters Distance Between Centroids • Less susceptible to noise and outliers Biased towards globular clusters K i C x i i x m dist SSE 1 2 ) , (
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Midterm Review: CIS 563 – Intro to data science
Sagnik Basumallik
Some Important Points from the slides:
Types of Clusterings: hierarchical and partitional
Clustering Algorithms: K‐means and its variants, Hierarchical clustering, Density‐based clustering
k‐means complexity
Problems with Selecting Initial Points
Solutions to Initial Centroids Problem: Multiple runs Helps, but probability is not on your side Sample
and use hierarchical clustering to determine initial centroids Select more than k initial centroids and
then select among these initial centroids Select most widely separated Postprocessing Bisecting K‐
means Not as susceptible to initialization issues
What is postprocessing?
Evaluating K‐means Clusters
K‐means has problems when clusters are of differing Sizes Densities Non‐globular shapes K‐means has
problems when the data contains outliers.
Overcome? One solution is to use many clusters. Find parts of clusters, but need to put together.
Types of Hierarchical Clustering: Agglomerative and Agglomerative
Algorithms for each clustering process
How to Define Inter‐Cluster Similarity: pros and cons
Type Pros Cons
MIN Can handle non‐elliptical shapes
• Sensitive to noise and outliers
MAX • Less susceptible to noise and outliers
• Tends to break large clusters • Biased towards globular clusters
Group Average • Less susceptible to noise and outliers
Biased towards globular clusters
Distance Between Centroids
• Less susceptible to noise and outliers
Biased towards globular clusters
K
i Cxi
i
xmdistSSE1
2 ),(
Ward’s Method • Less susceptible to noise and outliers
Biased towards globular clusters
Hierarchical Clustering: Problems and Limitations: Once a decision is made to combine two clusters, it
cannot be undone No objective function is directly minimized Different schemes have problems with
one or more of the following: Sensitivity to noise and outliers Difficulty handling different sized clusters
and convex shapes Breaking large clusters
Density based clustering: DBSCAN?
• Resistant to Noise
• Can handle clusters of different shapes and sizes
Does not work well under
• Varying densities
• High‐dimensional data
Measures of Cluster Validity
1. Internal Measures:
a. Cohesion and Separation
b. Silhouette Coefficient combine ideas of both cohesion and separation, but for individual
points, as well as clusters and clusterings
c. Correlation
2. External measures
a. Entropy
b. Purity
SSE :
Time Series
1. SAX
2. DTW
Association Rule Mining:
1. find frequent itemsets
2. Hash trees
Given d items,
Total number of itemsets = 2d
Total number of possible association rules:
Subset Operation: Given a transaction t, what are the possible subsets of size 3?
Support and Confidence
Example 1: Find frequent itemsets and the strong association rules