Top Banner
2/21/09 CS 461, Winter 2009 1 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff [email protected]
53

2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff [email protected] Dr. Kiri Wagstaff [email protected].

Jan 21, 2016

Download

Documents

Arron McDonald
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 1

CS 461: Machine LearningLecture 7

Dr. Kiri [email protected]. Kiri [email protected]

Page 2: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 2

Plan for Today

Unsupervised Learning K-means Clustering EM Clustering

Homework 4

Page 3: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 3

Review from Lecture 6

Parametric methods Data comes from distribution Bernoulli, Gaussian, and their parameters How good is a parameter estimate? (bias, variance)

Bayes estimation ML: use the data (assume equal priors) MAP: use the prior and the data

Parametric classification Maximize the posterior probability

Page 4: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 4

Clustering

Chapter 7Chapter 7

Page 5: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 5

Unsupervised Learning

The data has no labels! What can we still learn?

Salient groups in the data Density in feature space

Key approach: clustering … but also:

Association rules Density estimation Principal components analysis (PCA)

Page 6: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 6

Clustering

Group items by similarity

Density estimation, cluster models

Page 7: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 7

Applications of Clustering

Image Segmentation

[Ma and Manjunath, 2004]

Data Mining: Targeted marketing

Remote Sensing: Land cover types

Text Analysis

[Selim Aksoy]

Page 8: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 8

Applications of Clustering

Text Analysis: Noun Phrase Coreference

John Simon, Chief Financial Officer of Prime Corp. since 1986, saw his pay jump 20%, to $1.3 million, as the 37-year-old also became the financial-services company’s president.

John Simon

Chief Financial Officer

his

the 37-year-old

president

Prime Corp.

the financial-services company

Input text

Cluster PC

1986

pay

20%

$1.3 million

SingletonsCluster JS

Page 9: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 9[ Andrew Moore]

Sometimes easy

Sometimes impossible

and sometimesin between

Page 10: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 10

K-means1. Ask user how

many clusters they’d like. (e.g. k=5)

[ Andrew Moore]

Page 11: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 11

K-means1. Ask user how

many clusters they’d like. (e.g. k=5)

2. Randomly guess k cluster Center locations

[ Andrew Moore]

Page 12: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 12

K-means1. Ask user how

many clusters they’d like. (e.g. k=5)

2. Randomly guess k cluster Center locations

3. Each datapoint finds out which Center it’s closest to. (Thus each Center “owns” a set of datapoints)

[ Andrew Moore]

Page 13: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 13

K-means1. Ask user how

many clusters they’d like. (e.g. k=5)

2. Randomly guess k cluster Center locations

3. Each datapoint finds out which Center it’s closest to.

4. Each Center finds the centroid of the points it owns

[ Andrew Moore]

Page 14: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 14

K-means1. Ask user how

many clusters they’d like. (e.g. k=5)

2. Randomly guess k cluster Center locations

3. Each datapoint finds out which Center it’s closest to.

4. Each Center finds the centroid of the points it owns…

5. …and jumps there

6. …Repeat until terminated!

[ Andrew Moore]

Page 15: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 15

K-means Start: k=5

Example generated by Dan Pelleg’s super-duper fast K-means system:

Dan Pelleg and Andrew Moore. Accelerating Exact k-means Algorithms with Geometric Reasoning. Proc. Conference on Knowledge Discovery in Databases 1999, (KDD99) (available on www.autonlab.org/pap.html)

[ Andrew Moore]

Page 16: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 16

K-means continues…

[ Andrew Moore]

Page 17: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 17

K-means continues…

[ Andrew Moore]

Page 18: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 18

K-means continues…

[ Andrew Moore]

Page 19: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 19

K-means continues…

[ Andrew Moore]

Page 20: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 20

K-means continues…

[ Andrew Moore]

Page 21: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 21

K-means continues…

[ Andrew Moore]

Page 22: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 22

K-means continues…

[ Andrew Moore]

Page 23: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 23

K-means continues…

[ Andrew Moore]

Page 24: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 24

K-means terminates

[ Andrew Moore]

Page 25: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 25

K-means Algorithm

1. Randomly select k cluster centers

2. While (points change membership)1. Assign each point to its closest cluster

(Use your favorite distance metric)

2. Update each center to be the mean of its items

Objective function: Variance

K-means applet

V =c=1

k

∑ dist(x j ,μ cx j ∈Cc

∑ )2

Page 26: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 26

K-means Algorithm: Example

1. Randomly select k cluster centers

2. While (points change membership)1. Assign each point to its closest cluster

(Use your favorite distance metric)

2. Update each center to be the mean of its items

Objective function: Variance

Data: [1, 15, 4, 2, 17, 10, 6, 18]

V =c=1

k

∑ dist(x j ,μ cx j ∈Cc

∑ )2

Page 27: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 27

K-means for Compression

Original image Clustered, k=4

159 KB 53 KB

Page 28: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 28

Issue 1: Local Optima

K-means is greedy! Converging to a non-global optimum:

[Example from Andrew Moore]

Page 29: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 29

Issue 1: Local Optima

K-means is greedy! Converging to a non-global optimum:

[Example from Andrew Moore]

Page 30: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 30

Issue 2: How long will it take?

We don’t know! K-means is O(nkdI)

d = # features (dimensionality) I =# iterations

# iterations depends on random initialization “Good” init: few iterations “Bad” init: lots of iterations How can we tell the difference, before clustering?

We can’t Use heuristics to guess “good” init

Page 31: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 31

Issue 3: How many clusters?

The “Holy Grail” of clustering

Page 32: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 32

Issue 3: How many clusters?

Select k that gives partition with least variance?

Best k depends on the user’s goal

[Dhande and Fiore, 2002]

Page 33: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 33

Issue 4: How good is the result?

Rand Index A = # pairs in same cluster in both partitions B = # pairs in different clusters in both partitions Rand = (A + B) / Total number of pairs

1

53

2

4 76 8

9 10

1

53

2

4 76 8

910

Rand = (5 + 26) / 45

Page 34: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 34

K-means: Parametric or Non-parametric?

Cluster models: means Data models?

All clusters are spherical Distance in any direction is the same Cluster may be arbitrarily “big” to include outliers

Page 35: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 35

EM Clustering

Parametric solution Model the data distribution

Each cluster: Gaussian model Data: “mixture of models”

Hidden value zt is the cluster of item t E-step: estimate cluster memberships

M-step: maximize likelihood (clusters, params)

N μ,σ( )

L μ,σ | X( ) = P(X |μ,σ )

E z t X ,μ,σ[ ] =p x t |C ,μ,σ( )P C( )

p x t |C j ,μ j ,σ j( )P C j( )j

Page 36: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 36

The GMM assumption

• There are k components. The i’th component is called i

• Component i has an associated mean vector μi μ1

μ2

μ3

[ Andrew Moore]

Page 37: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 37

The GMM assumption

• There are k components. The i’th component is called i

• Component i has an associated mean vector μi

• Each component generates data from a Gaussian with mean μi and covariance matrix 2I

Assume that each datapoint is generated according to the following recipe:

μ1

μ2

μ3

[ Andrew Moore]

Page 38: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 38

The GMM assumption

• There are k components. The i’th component is called i

• Component i has an associated mean vector μi

• Each component generates data from a Gaussian with mean μi and covariance matrix 2I

Assume that each datapoint is generated according to the following recipe:

• Pick a component at random. Choose component i with probability P(i).

μ2

[ Andrew Moore]

Page 39: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 39

The GMM assumption

• There are k components. The i’th component is called i

• Component i has an associated mean vector μi

• Each component generates data from a Gaussian with mean μi and covariance matrix 2I

Assume that each datapoint is generated according to the following recipe:

• Pick a component at random. Choose component i with probability P(i).

• Datapoint ~ N(μi, 2I )

μ2

x

[ Andrew Moore]

Page 40: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 40

The General GMM assumption

μ1

μ2

μ3

• There are k components. The i’th component is called i

• Component i has an associated mean vector μi

• Each component generates data from a Gaussian with mean μi and covariance matrix i

Assume that each datapoint is generated according to the following recipe:

• Pick a component at random. Choose component i with probability P(i).

• Datapoint ~ N(μi, i )

[ Andrew Moore]

Page 41: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 41

EM in action

http://www.the-wabe.com/notebook/em-algorithm.html

Page 42: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 42

Gaussian Mixture Example: Start

[ Andrew Moore]

Page 43: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 43

After first iteration

[ Andrew Moore]

Page 44: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 44

After 2nd iteration

[ Andrew Moore]

Page 45: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 45

After 3rd iteration

[ Andrew Moore]

Page 46: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 46

After 4th iteration

[ Andrew Moore]

Page 47: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 47

After 5th iteration

[ Andrew Moore]

Page 48: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 48

After 6th iteration

[ Andrew Moore]

Page 49: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 49

After 20th iteration

[ Andrew Moore]

Page 50: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 50

EM Benefits

Model actual data distribution, not just centers

Get probability of membership in each cluster,not just distance

Clusters do not need to be “round”

Page 51: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 51

EM Issues?

Local optima How long will it take? How many clusters? Evaluation

Page 52: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 52

Summary: Key Points for Today

Unsupervised Learning Why? How?

K-means Clustering Iterative Sensitive to initialization Non-parametric Local optimum Rand Index

EM Clustering Iterative Sensitive to initialization Parametric Local optimum

Page 53: 2/21/09CS 461, Winter 20091 CS 461: Machine Learning Lecture 7 Dr. Kiri Wagstaff wkiri@wkiri.com Dr. Kiri Wagstaff wkiri@wkiri.com.

2/21/09 CS 461, Winter 2009 53

Next Time

Clustering Reading: Alpaydin Ch. 7.1-7.4, 7.8 Reading questions: Gavin, Ronald, Matthew

Next time: Reinforcement learning – Robots!