Top Banner
Unsupervised Learning Gaussian Mixture Models Expectation-Maximization (EM)
32

Unsupervised Learning

Feb 22, 2016

Download

Documents

salaam

Unsupervised Learning. Gaussian Mixture Models Expectation-Maximization (EM). Gaussian Mixture Models. Like K-Means, GMM clusters have centers. In addition, they have probability distributions that indicate the probability that a point belongs to the cluster. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Unsupervised Learning

Unsupervised Learning

Gaussian Mixture ModelsExpectation-Maximization (EM)

Page 2: Unsupervised Learning

Gaussian Mixture Models

X1

X 2

Like K-Means, GMM clusters have centers.

In addition, they have probability distributions that indicate the probability that a point belongs to the cluster.

These ellipses show “level sets”: lines with equal probability of belonging to the cluster.

Notice that green points still have SOME probability of belonging to the blue cluster, but it’s much lower than the blue points.

This is a more complex model than K-Means: distance from the center can matter more in one direction than another.

Page 3: Unsupervised Learning

GMMs and EM

Gaussian Mixture Models (GMMs) are a model, similar to a Naïve Bayes model but with important differences.

Expectation-Maximization (EM) is a parameter-estimation algorithm for training GMMs using unlabeled data.

To explain these further, we first need to review Gaussian (normal) distributions.

Page 4: Unsupervised Learning

The Normal (aka Gaussian) Distribution

f

mean, σ2: varianceσ

𝜇

Page 5: Unsupervised Learning

Quiz: MLE for Gaussians

Based on your statistics knowledge,1) What is the MLE for μ from a bunch of

example X points?

2) What is the MLE for σ from a bunch of example X points?

Page 6: Unsupervised Learning

Answer: MLE for Gaussians

Based on your statistics knowledge,1) What is the MLE for μ from a bunch of

example X points?

2) What is the MLE for σ from a bunch of example X points?

(average of the X values)

(average deviation from the mean)Note: this is a so-called “biased” estimator for ; there is also an

“unbiased” estimator which basically just uses (M-1) instead of M. We’ll stick to the “biased” one here, but either one is fine.

Page 7: Unsupervised Learning

Quiz: Deriving the ML estimators

How would you derive the MLE equations for Gaussian distributions?

Page 8: Unsupervised Learning

Answer: Deriving the ML estimators

How would you derive the MLE equations for Gaussian distributions?

Same plan of attack as for MLE estimates of Bayes Nets:1. Write down the Likelihood function P(D | M)2. Make the assumption that each data point Xi is independently

distributed, so P(D|M) = 3. Take the log4. Take the partial derivative with respect to μ, set this equal to

zero, and solve for μ.5. Take the partial derivative with respect to σ, set this equal to

zero, and solve for σ.

Page 9: Unsupervised Learning

Quiz: Estimating a Gaussian

On the left is a dataset with the following X values:0, 3, 4, 5, 6, 7, 10

Find the maximum likelihood Gaussian distribution.

0

𝜇=1𝑀 ∑

𝑖𝑋 𝑖

𝜎 2=1𝑀 ∑

𝑖(𝑋𝑖−𝜇)2

Page 10: Unsupervised Learning

Answer: Estimating a Gaussian

On the left is a dataset with the following X values:0, 3, 4, 5, 6, 7, 10

0

𝜇=1𝑀 ∑

𝑖𝑋 𝑖=

17

(0+3+4+5+6+7+10 )=5

𝜎 2=1𝑀 ∑

𝑖(𝑋𝑖−𝜇)2=

17 ((0−5 )2+(3−5 )2+(4−5 )2+ (5−5 )2+(6−5 )2+(7−5 )2+(10−5 )2 )=17 (52+22+12+02+12+22+52 )=17 (25+4+1+1+4+25 )= 607

f

Page 11: Unsupervised Learning

Clustering by fitting K Gaussians

Suppose our dataset looks like the one above.

It doesn’t really look Gaussian anymore; it looks like it has 3 clusters.

Fitting a single Gaussian to this data will still give you an estimate.

But that Gaussian will have a low Likelihood value: it will give very low probability to the leftmost and rightmost clusters.

0

Page 12: Unsupervised Learning

Clustering by fitting K Gaussians

What we’d like to do instead is to fit K Gaussians.

A model for data that involves multiple Gaussian distributions is called a Gaussian Mixture Model (GMM).

0

Page 13: Unsupervised Learning

Clustering by fitting K Gaussians

Another way of drawing these is with “Level sets”:

Curves that show points with equal probability for each Guassian.

Wider curves having lower probability than narrower curves.

Notice that each point is contained within every Gaussian, but is most tightly bound to the closest Gaussian.

0

μred μblue μgreen

Page 14: Unsupervised Learning

Expectation-Maximization (EM)EM is “K-Means for GMMs”.

It is a parameter estimation algorithm for GMMs that will determine a (locally-optimal) setting for all of the GMM parameters, using a bunch of unlabeled X points.

Input: 1. Data points X1, …, XM

2. A number K

Output: , , …, , such that the GMM with those means and standard deviations has a locally-maximum likelihood for the training data set.

Page 15: Unsupervised Learning

Visualization of EM

1. Initialize the mean and standard deviation of each Gaussian randomly.

2. Repeat until convergence:– Expectation: For each point X and each Gaussian

k, find P(X | Gaussian k)

Page 16: Unsupervised Learning

Visualization of EM

1. Initialize the mean and standard deviation of each Gaussian randomly.

2. Repeat until convergence:– Expectation: For each point X and each Gaussian k, find

f(X | Gaussian k)– Maximization: Estimate new parameters for each

Gaussian. (Technically, you also need to estimate a third parameter, called πk. More later.)

Page 17: Unsupervised Learning

Visualization of EM

1. Initialize the mean and standard deviation of each Gaussian randomly.

2. Repeat until convergence:– Expectation: For each point X and each Gaussian k, find

f(X | Gaussian k)– Maximization: Estimate new parameters for each

Gaussian. (Technically, you also need to estimate a third parameter, called πk. More later.)

Page 18: Unsupervised Learning

Gaussian Mixture ModelK Gaussian distributions with parameters through .

It also involves K additional parameters, called prior probabilities, through . These describe the relative importance of each of the K Gaussian distributions in the full model.

The likelihood equation for this model looks like this:𝑓 (𝑋 1,… ,𝑋𝑀|𝐺𝑀𝑀 ¿=∏

𝑖𝑓 ( 𝑋𝑖|𝐺𝑀𝑀 )

𝑓 (𝑋 𝑖|𝐺𝑀𝑀 )=∑𝑘=1

𝐾

𝜋𝑘1

√2𝜋 𝜎𝑘exp ¿¿

(i.i.d. assumption)

GaussianPrior

Page 19: Unsupervised Learning

GMMs as Bayes NetsGMMs are simple Bayes Nets.Two differences from previous BNs we’ve seen:

1. We’re used to binary variables in BNs. Here, the “Cluster” variable has K possible values (1, 2, …, K) instead of just two (+cluster and –cluster). We used to store P(+a) and P(-a) for the parent variable; now we store through .

2. The “X” variable has infinitely many values (any real number) instead of just (+x and –x). We used to store P(+x | +a) and P(+x | -a). Now we store through , and we say f(X |Cluster is j) =

Cluster(1, 2, …, K)

X(a real

number)

Page 20: Unsupervised Learning

Formal Description of the Algorithm

1. Init: For each k in {1, …, K}, create a random πk, μk, σ2k

2. Repeat until all πk, μk, σ2k remain the same from one

iteration to the next: Expectation (aka Assignment in K-Means):For each Xi, for each k: let C[Xi,k] Maximization (aka Update in K-Means): For each k,

3. Return (for all values of k) πk, μk, σ2k

Page 21: Unsupervised Learning

Evaulation metric for GMMs and EM

LOSS Function (or Objective function) for EM:

EM (locally) maximizes “Marginal” Likelihood:EM(X1, …, XM)

= argmax, …, f(X1,…XM | , …, )

Notice that this is the Likelihood function for just the X variable in our Bayes Net, rather than the Likelihood for (X and Cluster), which is why it is called “marginal likelihood” rather than just “likelihood”.

Page 22: Unsupervised Learning

Analysis of EM Performance

EM is guaranteed to find a local optimum of the Likelihood function.

Theorem: After one iteration of EM, the Likelihood of the new GMM >= the Likelihood of the previous GMM.(Dempster, A.P.; Laird, N.M.; Rubin, D.B. 1977. "Maximum Likelihood from Incomplete Data via the EM Algorithm". Journal of the Royal Statistical Society. Series B (Methodological) 39 (1): 1–38.JSTOR 2984875.)

Page 23: Unsupervised Learning

EM Generality

Even though EM was originally invented for GMMs, the same basic algorithm can be used for learning with arbitrary Bayes Nets when some of the training data has missing values.

This has made EM one of the most popular unsupervised learning techniques in machine learning.

Page 24: Unsupervised Learning

EM Quiza b c

g1 g2 g3

Which Gaussian(s) have a nonzero value for f(a)?

How about f(c)?

Page 25: Unsupervised Learning

Answer: EM Quiza b c

g1 g2 g3

Which Gaussian(s) have a nonzero f(a)?

All Gaussians (g1, g2, and g3) have a nonzero value for f(a).

How about f(c)?

Ditto. All Gaussians have a nonzero value for f(c).

Page 26: Unsupervised Learning

Quiz: EM vs. K-Meansa c

g1 g2

At the end of K-Means, where will cluster center g1 end up – Option 1 or Option 2?

Option 1 Option 2

At the end of EM, where will cluster center g1 end up – Option 1 or Option 2?

Option 3 Option 4

Page 27: Unsupervised Learning

Answer: EM vs. K-Meansa c

g1 g2

At the end of K-Means, where will cluster center g1 end up – Option 1 or Option 2?

Option 1: K-Means puts the “mean” at the center of all points in the cluster, and point a will be the only point in g1’s cluster.

Option 1 Option 2

At the end of EM, where will cluster center g1 end up – Option 1 or Option 2?

Option 2: EM puts the “mean” at the center of all points in the dataset, where each point is weighted by how likely it is according to the Gaussian. Point a and Point b will both have some likelihood, but Point a’s likelihood will be much higher. So the “mean” for g1 will be very close to Point a, but not all the way at Point a.

Option 3 Option 4

Page 28: Unsupervised Learning

How many clusters?We’ve been assuming a fixed K.Here’s a technique to determine this automatically, from data.

New objective function:Minimize:

Algorithm:1. Initialize K somehow.Repeat until convergence:

2. Run EM.3. Remove unnecessary clusters (low π value)4. Create new random clusters (more or fewer than before, depending on a

heuristic estimate of whether there were too many or too few before).

This is slow. But one nice property is that it can overcome some difficulties with local maxima.

Page 29: Unsupervised Learning

Quiz

Is EM for GMMsClassification or Regression?

Generative or Discriminative?

Parametric or Nonparametric?

Page 30: Unsupervised Learning

AnswerIs EM for GMMsClassification or Regression?Two possible answers:- classification: output is a discrete value (cluster label) for each point- Regression: output is a real value (probability) for each possible cluster label for each point

Generative or Discriminative?- normally, it’s used with a fixed set of input and output variables. However, GMMs are Bayes

Nets that store a full joint distribution. Once it’s trained, a GMM can actually make predictions for any subset of the variables given any other subset. Technically, this is generative.

Parametric or Nonparametric?- parametric: the number of parameters is 3K, which does not change with the number of training data points.

Page 31: Unsupervised Learning

Quiz

Is EM for GMMsSupervised or Unsupervised?

Online or batch?

Closed-form or iterative?

Page 32: Unsupervised Learning

Answer

Is EM for GMMsSupervised or Unsupervised?- Unsupervised

Online or batch?- batch: if you add a new data point, you need to revisit all the training data to recompute the locally-optimal model

Closed-form or iterative?-iterative: training requires many passes through the data