Top Banner
1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015
64

1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

1

EM Algorithm

Presented By:

Jonathan Carter

Department of Computer Science University of Vermont

April 2015

Page 2: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

Copyright Note:

This presentation is based on the paper:

– Dempster, A.P. Laird, N.M. Rubin, D.B. (1977). "Maximum Likelihood from Incomplete Data via the EM Algorithm". Journal of the Royal Statistical Society. Series B (Methodological) 39 (1): 1–38. JSTOR 2984875.MR0501537.

Sections 1 and 4 come from professor Taiwen Yu’s “EM Algorithm”.

Sections 2, 3, and 6 come from professor Andrew W. Moore’s “Clustering with Gaussian Mixtures”.

Section 5 was edited by Haiguang Li.

Section 7 edited by Christopher Morse2

Page 3: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

Contents

1. Introduction2. Example − Silly Example

3. Example − Same Problem with Hidden Info

4. Example − Normal Sample

5. EM-algorithm Explained

6. EM-Algorithm Running on GMM

7. EM-algorithm Application: Semi-Supervised Text Classification

8. Questions

Page 4: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

Introduction The EM algorithm was explained and given its name in a

classic 1977 paper by Arthur Dempster, Nan Laird, and Donald Rubin.

They pointed out that the method had been "proposed many times in special circumstances" by earlier authors.

EM is typically used to compute maximum likelihood estimates given incomplete samples.

The EM algorithm estimates the parameters of a model iteratively.

– Starting from some initial guess, each iteration consists of an E step (Expectation step) an M step (Maximization step)

Page 5: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

Applications Filling in missing data in samples Discovering the value of latent variables Estimating the parameters of HMMs Estimating parameters of finite mixtures Unsupervised learning of clusters Semi-supervised classification and

clustering.

Page 6: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

Contents

1. Introduction

2. Example − Silly Example3. Example − Same Problem with Hidden Info

4. Example − Normal Sample

5. EM-algorithm Explained

6. EM-Algorithm Running on GMM

7. EM-algorithm Application: Semi-Supervised Text Classification

8. Questions

Page 7: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

7

EM Algorithm

Silly Example

Page 8: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.
Page 9: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.
Page 10: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

Contents1. Introduction

2. Example − Silly Example

3. Example − Same Problem with Hidden Info

4. Example − Normal Sample

5. EM-algorithm Explained

6. EM-Algorithm Running on GMM

7. EM-algorithm Application: Semi-Supervised Text Classification

8. Questions

Page 11: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

11

EM Algorithm

Same Problem

with Hidden Info

Page 12: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.
Page 13: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.
Page 14: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.
Page 15: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.
Page 16: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

Contents

1. Introduction

2. Example − Silly Example

3. Example − Same Problem with Hidden Info

4. Example − Normal Sample5. EM-algorithm Explained

6. EM-Algorithm Running on GMM

7. EM-algorithm Application: Semi-Supervised Text Classification

8. Questions

Page 17: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

17

EM Algorithm

Normal Sample

Page 18: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

Normal Sample

μ

σ

Sampling

Page 19: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

Maximum Likelihood

μ

σ

Sampling

Given x, it is a function of μ and σ2

We want to maximize it.

Page 20: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

Log-Likelihood Function

Maximizethis instead

By setting

and

Page 21: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

Max. the Log-Likelihood Function

Page 22: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

Max. the Log-Likelihood Function

Page 23: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

Contents

1. Introduction

2. Example − Silly Example

3. Example − Same Problem with Hidden Info

4. Example − Normal Sample

5. EM-algorithm Explained6. Illustration: EM-Algorithm Running on GMM

7. EM-algorithm Application: Semi-Supervised Text Classification

8. Questions

Page 24: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

24

EM Algorithm

Explained

Page 25: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

Begin with Classification

Page 26: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

Solve the problem using another method– parametric method

Page 27: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

Use our model for classification

Page 28: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

EM Clustering Algorithm

Page 29: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

E-M

Page 30: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.
Page 31: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.
Page 32: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

Comparison to K-means

Page 33: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

Contents

1. Introduction

2. Example − Silly Example

3. Example − Same Problem with Hidden Info

4. Example − Normal Sample

5. EM-algorithm Explained

6. EM-Algorithm Running on GMM7. EM-algorithm Application: Semi-Supervised

Text Classification

8. Questions

Page 34: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

34

EM Algorithm

EM Running on GMM

Page 35: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.
Page 36: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.
Page 37: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.
Page 38: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.
Page 39: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.
Page 40: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.
Page 41: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.
Page 42: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.
Page 43: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

Contents

1. Introduction

2. Example − Silly Example

3. Example − Same Problem with Hidden Info

4. Example − Normal Sample

5. EM-algorithm Explained

6. EM-Algorithm Running on GMM

7. EM-algorithm Application: Semi-Supervised Text Classification

8. Questions

Page 44: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

44

EM Application: Semi-Supervised Text Classification

“Learning to Classify Text From Labeled and Unlabeled Documents”

K. Nigam, A. Mccallum, and T. Mitchell (1998)

44

Page 45: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

Learning to Classify Text from Labeled and Unlabeled Documents

K. Nigam et al. present a method for building a more accurate text classifier by augmenting a limited amount of labeled training documents with a large number of unlabeled documents

Why the interest in unlabeled data?– There is an abundance of unlabeled data available

for training but very limited amounts of labeled. – Labeling data is costly and there is too much data

being produced to make a dent in it.

Page 46: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

Learning to Classify Text from Labeled and Unlabeled Documents

The authors present evidence of the efficacy of their methods in three main domains:– newsgroup postings, web pages, and newswire

articles The exponential expansion of textual data

demands efficient and accurate methods of classification.

The high-dimensionality of the feature set and relatively small number of labeled training samples makes effective classification difficult.

Page 47: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

Building a Semi-Supervised Classifier

Classify textual documents using a combination of labeled and unlabeled data.

First, build an initial classifier by calculating model parameters from the labeled documents only.

Loop until convergence (or stopping criteria):– Probabilistically label the unlabeled documents

using the classifier.– Recalculate the classifier parameters given the

probabilistically assigned labels.

Page 48: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

The Probabilistic Framework

Assumptions– Every document di is generated according to

probability distribution (mixture model) parameterized by . 𝜃

– Mixture model is composed of components cj ∈C (1-to-1 relationship between components and classes and mix. components, i. e. cj refers to jth

component and jth class).

Page 49: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

The Probabilistic Framework

Document Generation: 1.Mixture component is selected according to prior

probability

2.Component generates the document using its own parameters with distribution

Thus the likelihood of a document di:

Page 50: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

Initial Classifier: naive Bayes

Documents are considered an “ordered list of word events” [17]

– Probability of a document given its class:

– represents the word in position of document

Page 51: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

Naive Bayes

Naive Bayes in the context of this experiment:

–“The learning task for the naive Bayes classifier is to use a set of training documents to estimate the mixture model parameters, then use the estimated model to classify new documents.”[17]

Page 52: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

Step 1: Train a Naive Bayes Classifier Using Labeled Data

Page 53: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

Apply Bayes’ Rule

Page 54: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

Step 2: Combine the Labeled Data with Unlabeled Data Using EM

EM: maximum likelihood estimates given incomplete data.

Wait, where’s the incomplete data?

Ah Ha! The unlabeled data is incomplete: it’s missing class labels!

Page 55: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

Step 2: Combine the Labeled Data with Unlabeled Data Using EM

E-step:– the E-step corresponds to calculating probabilistic

labels for every document by using the current estimate for , which we demonstrated the calculation for previously

M-step:– The M-step corresponds to

calculating a new maximum likelihood estimate for given the current estimates for document labels,

Page 56: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

Classifier Performance

Page 57: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

Results

Experiment showed significant improvements from using unlabeled documents for training classifiers in three real-world text classification tasks.

Using unlabeled data requires a closer match between the data and the model than those using only labeled data. – Warrants exploring more complex mixture models

Page 58: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

References

1. ^ Dempster, A.P.; Laird, N.M.; Rubin, D.B. (1977). "Maximum Likelihood from Incomplete Data via the EM Algorithm". Journal of the Royal Statistical Society. Series B (Methodological) 39 (1): 1–38. JSTOR 2984875.MR0501537.

2. ^ Sundberg, Rolf (1974). "Maximum likelihood theory for incomplete data from an exponential family".Scandinavian Journal of Statistics 1 (2): 49–58. JSTOR 4615553. MR381110.

3. ^ a b Rolf Sundberg. 1971. Maximum likelihood theory and applications for distributions generated when observing a function of an exponential family variable. Dissertation, Institute for Mathematical Statistics, Stockholm University.

4. ^ a b Sundberg, Rolf (1976). "An iterative method for solution of the likelihood equations for incomplete data from exponential families". Communications in Statistics – Simulation and Computation 5 (1): 55–64.doi:10.1080/03610917608812007. MR443190.

5. ^ See the acknowledgement by Dempster, Laird and Rubin on pages 3, 5 and 11.

6. ^ G. Kulldorff. 1961. Contributions to the theory of estimation from grouped and partially grouped samples. Almqvist & Wiksell.

7. ^ a b Anders Martin-Löf. 1963. "Utvärdering av livslängder i subnanosekundsområdet" ("Evaluation of sub-nanosecond lifetimes"). ("Sundberg formula")

8. ^ a b Per Martin-Löf. 1966. Statistics from the point of view of statistical mechanics. Lecture notes, Mathematical Institute, Aarhus University. ("Sundberg formula" credited to Anders Martin-Löf).

9. ^ a b Per Martin-Löf. 1970. Statistika Modeller (Statistical Models): Anteckningar från seminarier läsåret 1969–1970 (Notes from seminars in the academic year 1969-1970), with the assistance of Rolf Sundberg.Stockholm University. ("Sundberg formula")

10. ^ a b Martin-Löf, P. The notion of redundancy and its use as a quantitative measure of the deviation between a statistical hypothesis and a set of observational data. With a discussion by F. Abildgård, A. P. Dempster, D. Basu, D. R. Cox, A. W. F. Edwards, D. A. Sprott, G. A. Barnard, O. Barndorff-Nielsen, J. D. Kalbfleisch and G. Rasch and a reply by the author. Proceedings of Conference on Foundational Questions in Statistical Inference (Aarhus, 1973), pp. 1–42. Memoirs, No. 1, Dept. Theoret. Statist., Inst. Math., Univ. Aarhus, Aarhus, 1974.

11. ^ a b Martin-Löf, Per The notion of redundancy and its use as a quantitative measure of the discrepancy between a statistical hypothesis and a set of observational data. Scand. J. Statist. 1 (1974), no. 1, 3–18.

12. ^ Wu, C. F. Jeff (Mar. 1983). "On the Convergence Properties of the EM Algorithm". Annals of Statistics 11 (1): 95–103. doi:10.1214/aos/1176346060. JSTOR 2240463. MR684867.

13. ^ a b Neal, Radford; Hinton, Geoffrey (1999). Michael I. Jordan. ed. "A view of the EM algorithm that justifies incremental, sparse, and other variants". Learning in Graphical Models (Cambridge, MA: MIT Press): 355–368. ISBN 0262600323. Retrieved 2009-03-22.

14. ^ a b Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome (2001). "8.5 The EM algorithm". The Elements of Statistical Learning. New York: Springer. pp. 236–243. ISBN 0-387-95284-5.

15. ^ Jamshidian, Mortaza; Jennrich, Robert I. (1997). "Acceleration of the EM Algorithm by using Quasi-Newton Methods". Journal of the Royal Statistical Society: Series B (Statistical Methodology) 59 (2): 569–587.doi:10.1111/1467-9868.00083. MR1452026.

16. ^ Meng, Xiao-Li; Rubin, Donald B. (1993). "Maximum likelihood estimation via the ECM algorithm: A general framework". Biometrika 80 (2): 267–278. doi:10.1093/biomet/80.2.267. MR1243503.

17. ^ Hunter DR and Lange K (2004), A Tutorial on MM Algorithms, The American Statistician, 58: 30-37

Page 59: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

References

17. ^ K. Nigam, A. Mccallum, and T. Mitchell, “Learning to Classify Text From Labeled and Unlabeled Documents,” pp. 792–799, AAAI Press, 1998.

18. ^ G. Cong, W. S. Lee, H. Wu, and B. Liu, “Semi-Supervised Text Classification Using Partitioned EM,” in Database Systems for Advanced Applications, pp. 482–493, 2004.

19. ^ K. Nigam, A. Mccallum, and T. M. Mitchell, Semi-Supervised Text Classification Using EM, ch. 3. Boston: MIT Press, 2006.

Page 60: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

The End

Thanks very much!

Page 61: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

Contents

1. Introduction

2. Example − Silly Example

3. Example − Same Problem with Hidden Info

4. Example − Normal Sample

5. EM-algorithm Explained

6. EM-Algorithm Running on GMM

7. EM-algorithm Application: Semi-Supervised Text Classification

8. Questions

Page 62: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

Question #1

Describe a data mining application for EM?– Using EM to improve a classifier by augmenting labeled

training data with unlabeled data.– The example given illustrated a method for text

classification using an initial naive Bayes classifier and assumes documents are generated probabilistically according to an underlying mixture model.

Page 63: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

Question #2

What are the EM algorithm initialization methods?

– Random guess.– Any general classifier that builds a parameterized

probability distribution model (i.e. naive Bayes).– Initialized by k-means. After a few iterations of k-

means, using the parameters to initialize EM.

Page 64: 1 EM Algorithm Presented By: Jonathan Carter Department of Computer Science University of Vermont April 2015.

Question #3

What are the main advantages of parametric methods?

– You can easily change the model to adapt to different distribution of data sets.

– Knowledge representation is very compact. Once the model is selected, the model is represented  by a specific number of parameters.

– The number of parameters does not increase with the

increasing of training data .