Top Banner
Chinese Restaurants and Stick- Breaking: An Introduction to the Dirichlet Process Teg Grenager NLP Group Lunch February 24, 2005
24

Chinese Restaurants and Stick-Breaking: An Introduction to the Dirichlet Process Teg Grenager NLP Group Lunch February 24, 2005.

Mar 30, 2015

Download

Documents

Lacey Whidby
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Chinese Restaurants and Stick-Breaking: An Introduction to the Dirichlet Process Teg Grenager NLP Group Lunch February 24, 2005.

Chinese Restaurants and Stick-Breaking: An Introduction to the Dirichlet Process

Teg GrenagerNLP Group LunchFebruary 24, 2005

Page 2: Chinese Restaurants and Stick-Breaking: An Introduction to the Dirichlet Process Teg Grenager NLP Group Lunch February 24, 2005.

Agenda

Motivation Mixture Models Dirichlet Process Gibbs Sampling Applications

Page 3: Chinese Restaurants and Stick-Breaking: An Introduction to the Dirichlet Process Teg Grenager NLP Group Lunch February 24, 2005.

Clustering

Goal: learn a partition of the data, such that: Data within classes are “similar” Classes are “different” from each other

Two very different approaches: Agglomerative: build up clusters by

iteratively sticking similar things together Mixture Model: learn a generative model

over the data, treating the classes as hidden variables

Page 4: Chinese Restaurants and Stick-Breaking: An Introduction to the Dirichlet Process Teg Grenager NLP Group Lunch February 24, 2005.

Agglomerative Clustering

Pros: Doesn’t need generative model (number of clusters, parametric distribution)

Cons: Ad-hoc, no probabilistic foundation, intractable for large data sets

Num Clusters Max Distance

19 518 517 516 815 814 813 812 811 910 99 98 107 106 105 104 123 122 151 16

20 0

Page 5: Chinese Restaurants and Stick-Breaking: An Introduction to the Dirichlet Process Teg Grenager NLP Group Lunch February 24, 2005.

Mixture Model Clustering

Examples: K-means, mixture of Gaussians, Naïve Bayes Pros: Sound probabilistic foundation, efficient even for

large data sets Cons: Requires generative model, including number of

clusters (mixture components)

Page 6: Chinese Restaurants and Stick-Breaking: An Introduction to the Dirichlet Process Teg Grenager NLP Group Lunch February 24, 2005.

Problem

1 2 3 4 5 6 7 8

Number of Clusters

Distance/Likelihood

Page 7: Chinese Restaurants and Stick-Breaking: An Introduction to the Dirichlet Process Teg Grenager NLP Group Lunch February 24, 2005.

Big Idea

Want to use a generative model, but don’t want to decide number of clusters in advance

Suggestion: put each datum in its own cluster Problem: probability of 2 clusters colliding is

zero under any density function, no “stickiness”

Solution: instead of a density function, use a statistical process where the probability of two clusters falling together is non-negative

Best of both worlds: stickiness with variable number of clusters

Page 8: Chinese Restaurants and Stick-Breaking: An Introduction to the Dirichlet Process Teg Grenager NLP Group Lunch February 24, 2005.

Finite Mixture Model

Gaussian

Naïve Bayes

c

x

c

x1 x2 xM

ci

xi N

p

ci

xij NM

p

Page 9: Chinese Restaurants and Stick-Breaking: An Introduction to the Dirichlet Process Teg Grenager NLP Group Lunch February 24, 2005.

Dirichlet Priors (Review)

A distribution over possible parameter vectors of the multinomial distribution

Thus values must lie in the k-dimensional simplex Beta distribution is the 2-parameter special case Expectation

A conjugate prior to the multinomial

Explicit formulation is ugly! xi

N

Page 10: Chinese Restaurants and Stick-Breaking: An Introduction to the Dirichlet Process Teg Grenager NLP Group Lunch February 24, 2005.

Infinite Mixture Model

ci

xi N

p

G0

Page 11: Chinese Restaurants and Stick-Breaking: An Introduction to the Dirichlet Process Teg Grenager NLP Group Lunch February 24, 2005.

Chinese Restaurant Process

Page 12: Chinese Restaurants and Stick-Breaking: An Introduction to the Dirichlet Process Teg Grenager NLP Group Lunch February 24, 2005.

DP Mixture Model

xi N

G

i

G0

ci

xi N

p

G0

Page 13: Chinese Restaurants and Stick-Breaking: An Introduction to the Dirichlet Process Teg Grenager NLP Group Lunch February 24, 2005.

Stick-breaking Process

G0

0 0.4 0.4

0.6 0.5 0.3

0.3 0.8 0.24

Page 14: Chinese Restaurants and Stick-Breaking: An Introduction to the Dirichlet Process Teg Grenager NLP Group Lunch February 24, 2005.

Properties of the DP

Let (,) be a measurable space, G0 be a probability measure on the space, and be a positive real number

A Dirichlet process is any distribution of a random probability measure G over (,) such that, for all finite partitions (A1,…,Ar) of ,

Draws G from DP are generally not distinct The number of distinct values grows with O(log n)

Page 15: Chinese Restaurants and Stick-Breaking: An Introduction to the Dirichlet Process Teg Grenager NLP Group Lunch February 24, 2005.

Infinite Exchangeability

In general, an infinite set of random variables is said to be infinitely exchangeable if for every finite subset {xi,…,xn} and for any permutation we have

Note that infinite exchangeability is not the same as being independent and identically distributed (i.i.d.)!

Using DeFinetti’s theorem, it is possible to show that our draws are infinitely exchangeable

Thus the mixture components may be sampled in any order

Page 16: Chinese Restaurants and Stick-Breaking: An Introduction to the Dirichlet Process Teg Grenager NLP Group Lunch February 24, 2005.

Mixture Model Inference

We want to find a clustering of the data: an assignment of values to the hidden class variable

Sometimes we also want the component parameters

In most finite mixture models, this can be found with EM

The Dirichlet process is a non-parametric prior, and doesn’t permit EM

We use Gibbs sampling instead

Page 17: Chinese Restaurants and Stick-Breaking: An Introduction to the Dirichlet Process Teg Grenager NLP Group Lunch February 24, 2005.

Gibbs Sampling 1

Algorithm 1: integrate out G, and sample the i

directly, conditioned on everything else This is inefficient, because we update cluster

information for one datum at a time

xi N

G

i

G0

xi N

i

G0

Page 18: Chinese Restaurants and Stick-Breaking: An Introduction to the Dirichlet Process Teg Grenager NLP Group Lunch February 24, 2005.

Gibbs Sampling 2

Algorithm 2: Reintroduce a cluster variable ci which takes on values that

are the names c of the clusters Store the parameters that are shared by all data in class c

in a new variable c

xi N

G

i

G0

xi N

ci

G0

c

Page 19: Chinese Restaurants and Stick-Breaking: An Introduction to the Dirichlet Process Teg Grenager NLP Group Lunch February 24, 2005.

Gibbs Sampling 2 (cont.)

Algorithm 2: For i = 1,…,N sample ci from

where H-i,c is the posterior distribution of c based on the prior G0 and all observations for which ji and cj=c

Repeat Works well Note: can also use variational methods (other than EM)

Page 20: Chinese Restaurants and Stick-Breaking: An Introduction to the Dirichlet Process Teg Grenager NLP Group Lunch February 24, 2005.

NLP Applications

Clustering Document clustering for topic, genre, sentiment,… Word clustering for POS, WSD, synonymy,… Topic clustering across documents (see Blei et. al.,

2004 and Teh et. al., 2004) Noun coreference: don’t know how many entities

there are Other identity uncertainty problems: deduping, etc. Grammar induction

Sequence modeling: the “infinite HMM” Topic segmentation (see Grenager et. al., 2005) Sequence models for POS tagging

Others?

Page 21: Chinese Restaurants and Stick-Breaking: An Introduction to the Dirichlet Process Teg Grenager NLP Group Lunch February 24, 2005.

Nested CRP

Day 1 Day 2 Day 3

Page 22: Chinese Restaurants and Stick-Breaking: An Introduction to the Dirichlet Process Teg Grenager NLP Group Lunch February 24, 2005.

Nested CRP (cont.)

To generate a document given a tree with L levels Choose a path from the root of the tree to a

leaf Draw a vector of topic mixing proportions

from an L-dimensional Dirichlet Generate the words in the document from a

mixture of the topics along the path, with mixing proportions

Page 23: Chinese Restaurants and Stick-Breaking: An Introduction to the Dirichlet Process Teg Grenager NLP Group Lunch February 24, 2005.

Nested CRP (cont.)

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 24: Chinese Restaurants and Stick-Breaking: An Introduction to the Dirichlet Process Teg Grenager NLP Group Lunch February 24, 2005.

References Seminal:

T.S. Ferguson. A Bayesian analysis of some nonparametric problems. Annals of Statistics 1:209-230, 1973.

C.E. Antoniak. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Annals of Statistics 2:1152-1174, 1974.

Foundational: M.D. Escobar and M. West. Bayesian density estimation and inference using

mixtures. Journal of the American Statistical Association, 90:577-588, 1995.

S.N. MacEachern and P. Muller. Estimating mixture of Dirichlet process models. Journal of Computational and Graphical Statistics, 7:223-238, 1998.

R.M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9:249-265, 2000.

C.E. Rasmussen. The Infinite Gaussian Mixture Model. NIPS, 2000. H. Ishwaran and L. James. Gibbs sampling methods for stick-breaking priors.

Journal of the American Statistical Association, 96:161-173, 2001. NLP:

D.M. Blei, T.L. Griffiths, M.I. Jordan, and J.B. Tenenbaum. Hierarchical topic models and the nested Chinese restaurant process. NIPS, 2004.

Y.W. Teh, M.I. Jordan, M.J. Beal, and D.M. Blei. Hierarchical Dirichlet processes. NIPS, 2004.