iCx 1 - Syracuse University

Midterm Review: CIS 563 – Intro to data science

Sagnik Basumallik

Some Important Points from the slides:

Types of Clusterings: hierarchical and partitional

Clustering Algorithms: K‐means and its variants, Hierarchical clustering, Density‐based clustering

k‐means complexity

Problems with Selecting Initial Points

Solutions to Initial Centroids Problem: Multiple runs Helps, but probability is not on your side Sample

and use hierarchical clustering to determine initial centroids Select more than k initial centroids and

then select among these initial centroids Select most widely separated Postprocessing Bisecting K‐

means Not as susceptible to initialization issues

What is postprocessing?

Evaluating K‐means Clusters

K‐means has problems when clusters are of differing Sizes Densities Non‐globular shapes K‐means has

problems when the data contains outliers.

Overcome? One solution is to use many clusters. Find parts of clusters, but need to put together.

Types of Hierarchical Clustering: Agglomerative and Agglomerative

Algorithms for each clustering process

How to Define Inter‐Cluster Similarity: pros and cons

Type Pros Cons

MIN Can handle non‐elliptical shapes

• Sensitive to noise and outliers

MAX • Less susceptible to noise and outliers

• Tends to break large clusters • Biased towards globular clusters

Group Average • Less susceptible to noise and outliers

Biased towards globular clusters

Distance Between Centroids

• Less susceptible to noise and outliers


K

i Cxi

i

xmdistSSE1

2 ),(

Ward’s Method • Less susceptible to noise and outliers


Hierarchical Clustering: Problems and Limitations: Once a decision is made to combine two clusters, it

cannot be undone No objective function is directly minimized Different schemes have problems with

one or more of the following: Sensitivity to noise and outliers Difficulty handling different sized clusters

and convex shapes Breaking large clusters

Density based clustering: DBSCAN?

• Resistant to Noise

• Can handle clusters of different shapes and sizes

Does not work well under

• Varying densities

• High‐dimensional data

Measures of Cluster Validity

1. Internal Measures:

a. Cohesion and Separation

b. Silhouette Coefficient combine ideas of both cohesion and separation, but for individual

points, as well as clusters and clusterings

c. Correlation

2. External measures

a. Entropy

b. Purity

SSE :

Time Series

1. SAX

2. DTW

Association Rule Mining:

1. find frequent itemsets

2. Hash trees

Given d items,

Total number of itemsets = 2d

Total number of possible association rules:

Subset Operation: Given a transaction t, what are the possible subsets of size 3?

Support and Confidence

Example 1: Find frequent itemsets and the strong association rules

ID List

1 I1 I2 I3

2 I2 I4

3 I2 I3

4 I1 I2 I4

5 I1 I3

6 I2 I3

7 I1 I3

8 I1 I2 I3 I5

9 I1 I2 I3

Example 2: Hash Trees

{145} {124} {457} {125} {458} {159} {136} {234} {567} {345} {356}

Length = 3

Given Hash function:

Example 3: DTW: measure similarity between two sequences which may vary in time or speed

Let the two time series be given as:

A = [1 3 4 9 8 2 1 5 7 3]

B = [1 6 2 3 0 9 4 3 6 3]

Example 4:

Example 5:

iCx 1 - Syracuse University

Documents