Unsupervised Learning

Unsupervised Learning

Gaussian Mixture ModelsExpectation-Maximization (EM)

Gaussian Mixture Models

X1

X 2

Like K-Means, GMM clusters have centers.

In addition, they have probability distributions that indicate the probability that a point belongs to the cluster.

These ellipses show “level sets”: lines with equal probability of belonging to the cluster.

Notice that green points still have SOME probability of belonging to the blue cluster, but it’s much lower than the blue points.

This is a more complex model than K-Means: distance from the center can matter more in one direction than another.

GMMs and EM

Gaussian Mixture Models (GMMs) are a model, similar to a Naïve Bayes model but with important differences.

Expectation-Maximization (EM) is a parameter-estimation algorithm for training GMMs using unlabeled data.

To explain these further, we first need to review Gaussian (normal) distributions.

The Normal (aka Gaussian) Distribution

f

mean, σ2: varianceσ

𝜇

Quiz: MLE for Gaussians

Based on your statistics knowledge,1) What is the MLE for μ from a bunch of

example X points?

2) What is the MLE for σ from a bunch of example X points?

Answer: MLE for Gaussians

Based on your statistics knowledge,1) What is the MLE for μ from a bunch of

example X points?

2) What is the MLE for σ from a bunch of example X points?

(average of the X values)

(average deviation from the mean)Note: this is a so-called “biased” estimator for ; there is also an

“unbiased” estimator which basically just uses (M-1) instead of M. We’ll stick to the “biased” one here, but either one is fine.

Quiz: Deriving the ML estimators

How would you derive the MLE equations for Gaussian distributions?

Answer: Deriving the ML estimators

How would you derive the MLE equations for Gaussian distributions?

Same plan of attack as for MLE estimates of Bayes Nets:1. Write down the Likelihood function P(D | M)2. Make the assumption that each data point Xi is independently

distributed, so P(D|M) = 3. Take the log4. Take the partial derivative with respect to μ, set this equal to

zero, and solve for μ.5. Take the partial derivative with respect to σ, set this equal to

zero, and solve for σ.

Quiz: Estimating a Gaussian

On the left is a dataset with the following X values:0, 3, 4, 5, 6, 7, 10

Find the maximum likelihood Gaussian distribution.

0

𝜇=1𝑀 ∑

𝑖𝑋 𝑖

𝜎 2=1𝑀 ∑

𝑖(𝑋𝑖−𝜇)2

Answer: Estimating a Gaussian

On the left is a dataset with the following X values:0, 3, 4, 5, 6, 7, 10

0

𝜇=1𝑀 ∑

𝑖𝑋 𝑖=

17

(0+3+4+5+6+7+10 )=5

𝜎 2=1𝑀 ∑

𝑖(𝑋𝑖−𝜇)2=

17 ((0−5 )2+(3−5 )2+(4−5 )2+ (5−5 )2+(6−5 )2+(7−5 )2+(10−5 )2 )=17 (52+22+12+02+12+22+52 )=17 (25+4+1+1+4+25 )= 607

f

Clustering by fitting K Gaussians

Suppose our dataset looks like the one above.

It doesn’t really look Gaussian anymore; it looks like it has 3 clusters.

Fitting a single Gaussian to this data will still give you an estimate.

But that Gaussian will have a low Likelihood value: it will give very low probability to the leftmost and rightmost clusters.

0


What we’d like to do instead is to fit K Gaussians.

A model for data that involves multiple Gaussian distributions is called a Gaussian Mixture Model (GMM).

0


Another way of drawing these is with “Level sets”:

Curves that show points with equal probability for each Guassian.

Wider curves having lower probability than narrower curves.

Notice that each point is contained within every Gaussian, but is most tightly bound to the closest Gaussian.

0

μred μblue μgreen

Expectation-Maximization (EM)EM is “K-Means for GMMs”.

It is a parameter estimation algorithm for GMMs that will determine a (locally-optimal) setting for all of the GMM parameters, using a bunch of unlabeled X points.

Input: 1. Data points X1, …, XM

2. A number K

Output: , , …, , such that the GMM with those means and standard deviations has a locally-maximum likelihood for the training data set.

Visualization of EM

1. Initialize the mean and standard deviation of each Gaussian randomly.

2. Repeat until convergence:– Expectation: For each point X and each Gaussian

k, find P(X | Gaussian k)

Visualization of EM


2. Repeat until convergence:– Expectation: For each point X and each Gaussian k, find

f(X | Gaussian k)– Maximization: Estimate new parameters for each

Gaussian. (Technically, you also need to estimate a third parameter, called πk. More later.)

Visualization of EM


2. Repeat until convergence:– Expectation: For each point X and each Gaussian k, find

f(X | Gaussian k)– Maximization: Estimate new parameters for each

Gaussian. (Technically, you also need to estimate a third parameter, called πk. More later.)

Gaussian Mixture ModelK Gaussian distributions with parameters through .

It also involves K additional parameters, called prior probabilities, through . These describe the relative importance of each of the K Gaussian distributions in the full model.

The likelihood equation for this model looks like this:𝑓 (𝑋 1,… ,𝑋𝑀|𝐺𝑀𝑀 ¿=∏

𝑖𝑓 ( 𝑋𝑖|𝐺𝑀𝑀 )

𝑓 (𝑋 𝑖|𝐺𝑀𝑀 )=∑𝑘=1

𝐾

𝜋𝑘1

√2𝜋 𝜎𝑘exp ¿¿

(i.i.d. assumption)

GaussianPrior

GMMs as Bayes NetsGMMs are simple Bayes Nets.Two differences from previous BNs we’ve seen:

1. We’re used to binary variables in BNs. Here, the “Cluster” variable has K possible values (1, 2, …, K) instead of just two (+cluster and –cluster). We used to store P(+a) and P(-a) for the parent variable; now we store through .

2. The “X” variable has infinitely many values (any real number) instead of just (+x and –x). We used to store P(+x | +a) and P(+x | -a). Now we store through , and we say f(X |Cluster is j) =

Cluster(1, 2, …, K)

X(a real

number)

Formal Description of the Algorithm

1. Init: For each k in {1, …, K}, create a random πk, μk, σ2k

2. Repeat until all πk, μk, σ2k remain the same from one

iteration to the next: Expectation (aka Assignment in K-Means):For each Xi, for each k: let C[Xi,k] Maximization (aka Update in K-Means): For each k,

3. Return (for all values of k) πk, μk, σ2k

Evaulation metric for GMMs and EM

LOSS Function (or Objective function) for EM:

EM (locally) maximizes “Marginal” Likelihood:EM(X1, …, XM)

= argmax, …, f(X1,…XM | , …, )

Notice that this is the Likelihood function for just the X variable in our Bayes Net, rather than the Likelihood for (X and Cluster), which is why it is called “marginal likelihood” rather than just “likelihood”.

Analysis of EM Performance

EM is guaranteed to find a local optimum of the Likelihood function.

Theorem: After one iteration of EM, the Likelihood of the new GMM >= the Likelihood of the previous GMM.(Dempster, A.P.; Laird, N.M.; Rubin, D.B. 1977. "Maximum Likelihood from Incomplete Data via the EM Algorithm". Journal of the Royal Statistical Society. Series B (Methodological) 39 (1): 1–38.JSTOR 2984875.)

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.163.7580&rep=rep1&type=pdf



http://en.wikipedia.org/wiki/JSTOR

http://www.jstor.org/stable/2984875

EM Generality

Even though EM was originally invented for GMMs, the same basic algorithm can be used for learning with arbitrary Bayes Nets when some of the training data has missing values.

This has made EM one of the most popular unsupervised learning techniques in machine learning.

EM Quiza b c

g1 g2 g3

Which Gaussian(s) have a nonzero value for f(a)?

How about f(c)?

Answer: EM Quiza b c

g1 g2 g3

Which Gaussian(s) have a nonzero f(a)?

All Gaussians (g1, g2, and g3) have a nonzero value for f(a).

How about f(c)?

Ditto. All Gaussians have a nonzero value for f(c).

Quiz: EM vs. K-Meansa c

g1 g2

At the end of K-Means, where will cluster center g1 end up – Option 1 or Option 2?

Option 1 Option 2

At the end of EM, where will cluster center g1 end up – Option 1 or Option 2?

Option 3 Option 4

Answer: EM vs. K-Meansa c

g1 g2

At the end of K-Means, where will cluster center g1 end up – Option 1 or Option 2?

Option 1: K-Means puts the “mean” at the center of all points in the cluster, and point a will be the only point in g1’s cluster.

Option 1 Option 2

At the end of EM, where will cluster center g1 end up – Option 1 or Option 2?

Option 2: EM puts the “mean” at the center of all points in the dataset, where each point is weighted by how likely it is according to the Gaussian. Point a and Point b will both have some likelihood, but Point a’s likelihood will be much higher. So the “mean” for g1 will be very close to Point a, but not all the way at Point a.

Option 3 Option 4

How many clusters?We’ve been assuming a fixed K.Here’s a technique to determine this automatically, from data.

New objective function:Minimize:

Algorithm:1. Initialize K somehow.Repeat until convergence:

2. Run EM.3. Remove unnecessary clusters (low π value)4. Create new random clusters (more or fewer than before, depending on a

heuristic estimate of whether there were too many or too few before).

This is slow. But one nice property is that it can overcome some difficulties with local maxima.

Quiz

Is EM for GMMsClassification or Regression?

Generative or Discriminative?

Parametric or Nonparametric?

AnswerIs EM for GMMsClassification or Regression?Two possible answers:- classification: output is a discrete value (cluster label) for each point- Regression: output is a real value (probability) for each possible cluster label for each point

Generative or Discriminative?- normally, it’s used with a fixed set of input and output variables. However, GMMs are Bayes

Nets that store a full joint distribution. Once it’s trained, a GMM can actually make predictions for any subset of the variables given any other subset. Technically, this is generative.

Parametric or Nonparametric?- parametric: the number of parameters is 3K, which does not change with the number of training data points.

Quiz

Is EM for GMMsSupervised or Unsupervised?

Online or batch?

Closed-form or iterative?

Answer

Is EM for GMMsSupervised or Unsupervised?- Unsupervised

Online or batch?- batch: if you add a new data point, you need to revisit all the training data to recompute the locally-optimal model

Closed-form or iterative?-iterative: training requires many passes through the data

Unsupervised Learning

Documents

gaussian distributionquiz

closest gaussian

single gaussian

gaussian mixture model

gaussian normal distributions

probability distributions

lower probability

equal probability