Top Banner
Course: Neural Networks, Instructor: Professor L.Behera. -Joy Bhattacharjee, Department of ChE, IIT Kanpur. Johann Peter Gustav Lejeune Dirichlet DIRICHLET PROCESS
30

Course: Neural Networks, Instructor: Professor L.Behera. -Joy Bhattacharjee, Department of ChE, IIT Kanpur. Johann Peter Gustav Lejeune Dirichlet.

Mar 30, 2015

Download

Documents

Aliyah Fulmer
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Course: Neural Networks, Instructor: Professor L.Behera. -Joy Bhattacharjee, Department of ChE, IIT Kanpur. Johann Peter Gustav Lejeune Dirichlet.

Course: Neural Networks, Instructor: Professor L.Behera.

-Joy Bhattacharjee,Department of ChE,

IIT Kanpur.Johann Peter Gustav Lejeune Dirichlet

DIRICHLET PROCESS

Page 2: Course: Neural Networks, Instructor: Professor L.Behera. -Joy Bhattacharjee, Department of ChE, IIT Kanpur. Johann Peter Gustav Lejeune Dirichlet.

What is Dirichlet Process ?The Dirichlet process is a stochastic process used in Bayesian nonparametricmodels of data, particularly in Dirichlet process mixture models (also known as infinite mixture models). It is a distribution over distributions, i.e. each draw from a Dirichlet process is itself a distribution. It is called a Dirichlet process because it has Dirichlet distributed finite dimensional marginal distributions.

Page 3: Course: Neural Networks, Instructor: Professor L.Behera. -Joy Bhattacharjee, Department of ChE, IIT Kanpur. Johann Peter Gustav Lejeune Dirichlet.

What is Dirichlet Process ?The Dirichlet process is a stochastic process used in Bayesian nonparametricmodels of data, particularly in Dirichlet process mixture models (also known as infinite mixture models). It is a distribution over distributions, i.e. each draw from a Dirichlet process is itself a distribution. It is called a Dirichlet process because it has Dirichlet distributed finite dimensional marginal distributions.

Page 4: Course: Neural Networks, Instructor: Professor L.Behera. -Joy Bhattacharjee, Department of ChE, IIT Kanpur. Johann Peter Gustav Lejeune Dirichlet.

What is Dirichlet Process ?The Dirichlet process is a stochastic process used in Bayesian nonparametricmodels of data, particularly in Dirichlet process mixture models (also known as infinite mixture models). It is a distribution over distributions, i.e. each draw from a Dirichlet process is itself a distribution. It is called a Dirichlet process because it has Dirichlet distributed finite dimensional marginal distributions.

Page 5: Course: Neural Networks, Instructor: Professor L.Behera. -Joy Bhattacharjee, Department of ChE, IIT Kanpur. Johann Peter Gustav Lejeune Dirichlet.

Dirichlet Priors

• A distribution over possible parameter vectors of the multinomial distribution

• Thus values must lie in the k-dimensional simplex• Beta distribution is the 2-parameter special case• Expectation

• A conjugate prior to the multinomial

xi

N

Page 6: Course: Neural Networks, Instructor: Professor L.Behera. -Joy Bhattacharjee, Department of ChE, IIT Kanpur. Johann Peter Gustav Lejeune Dirichlet.

What is Dirichlet Distribution ?

Methods to generate Dirichlet distribution :1. Polya’s Urn2. Stick Breaking3. Chinese Restaurant Problem

Page 7: Course: Neural Networks, Instructor: Professor L.Behera. -Joy Bhattacharjee, Department of ChE, IIT Kanpur. Johann Peter Gustav Lejeune Dirichlet.

Samples from a DP

Page 8: Course: Neural Networks, Instructor: Professor L.Behera. -Joy Bhattacharjee, Department of ChE, IIT Kanpur. Johann Peter Gustav Lejeune Dirichlet.
Page 9: Course: Neural Networks, Instructor: Professor L.Behera. -Joy Bhattacharjee, Department of ChE, IIT Kanpur. Johann Peter Gustav Lejeune Dirichlet.
Page 10: Course: Neural Networks, Instructor: Professor L.Behera. -Joy Bhattacharjee, Department of ChE, IIT Kanpur. Johann Peter Gustav Lejeune Dirichlet.
Page 11: Course: Neural Networks, Instructor: Professor L.Behera. -Joy Bhattacharjee, Department of ChE, IIT Kanpur. Johann Peter Gustav Lejeune Dirichlet.

Dirichlet Distribution

Page 12: Course: Neural Networks, Instructor: Professor L.Behera. -Joy Bhattacharjee, Department of ChE, IIT Kanpur. Johann Peter Gustav Lejeune Dirichlet.

Polya’s Urn scheme:

Suppose we want to generate a realization of Q Dir(α). To start, put i balls of color i for i = 1; 2; : : : ; k; in an urn. Note that i > 0 is not necessarily an integer, so wemay have a fractional or even an irrational number of balls of color i in our urn! At each iteration, draw one ball uniformly at random from the urn, and then place it back into the urn along with an additional ball of the same color. As we iterate this procedure more and more times, the proportions of balls of each color will converge to a pmf that is a sample from the distribution Dir(α).

Page 13: Course: Neural Networks, Instructor: Professor L.Behera. -Joy Bhattacharjee, Department of ChE, IIT Kanpur. Johann Peter Gustav Lejeune Dirichlet.

Mathematical form:

Page 14: Course: Neural Networks, Instructor: Professor L.Behera. -Joy Bhattacharjee, Department of ChE, IIT Kanpur. Johann Peter Gustav Lejeune Dirichlet.

Stick Breaking ProcessThe stick-breaking approach to generating a random vector with a Dir(α) distribution involves iteratively breaking a stick of length 1 into k pieces in such a way that the lengths of the k pieces follow a Dir(α) distribution. Following figure illustrates this process with simulation results.

Page 15: Course: Neural Networks, Instructor: Professor L.Behera. -Joy Bhattacharjee, Department of ChE, IIT Kanpur. Johann Peter Gustav Lejeune Dirichlet.

Stick Breaking Process

G0

0 0.4 0.4

0.6 0.5 0.3

0.3 0.8 0.24

Page 16: Course: Neural Networks, Instructor: Professor L.Behera. -Joy Bhattacharjee, Department of ChE, IIT Kanpur. Johann Peter Gustav Lejeune Dirichlet.

Chinese Restaurant Process

Page 17: Course: Neural Networks, Instructor: Professor L.Behera. -Joy Bhattacharjee, Department of ChE, IIT Kanpur. Johann Peter Gustav Lejeune Dirichlet.

Chinese Restaurant Process

Page 18: Course: Neural Networks, Instructor: Professor L.Behera. -Joy Bhattacharjee, Department of ChE, IIT Kanpur. Johann Peter Gustav Lejeune Dirichlet.

Nested CRP

• To generate a document given a tree with L levels– Choose a path from the root of the tree to a leaf– Draw a vector of topic mixing proportions from

an L-dimensional Dirichlet– Generate the words in the document from a

mixture of the topics along the path, with mixing proportions

Page 19: Course: Neural Networks, Instructor: Professor L.Behera. -Joy Bhattacharjee, Department of ChE, IIT Kanpur. Johann Peter Gustav Lejeune Dirichlet.

Nested CRP

Day 1 Day 2 Day 3

Page 20: Course: Neural Networks, Instructor: Professor L.Behera. -Joy Bhattacharjee, Department of ChE, IIT Kanpur. Johann Peter Gustav Lejeune Dirichlet.

Properties of the DP

• Let (,) be a measurable space, G0 be a probability measure on the space, and be a positive real number

• A Dirichlet process is any distribution of a random probability measure G over (,) such that, for all finite partitions (A1,…,Ar) of ,

• Draws G from DP are generally not distinct• The number of distinct values grows with O(log n)

Page 21: Course: Neural Networks, Instructor: Professor L.Behera. -Joy Bhattacharjee, Department of ChE, IIT Kanpur. Johann Peter Gustav Lejeune Dirichlet.

•In general, an infinite set of random variables is said to be infinitely exchangeable if for every finite subset {xi,…,xn} and for any permutation we have

•Note that infinite exchangeability is not the same as being independent and identically distributed (i.i.d.)!•Using DeFinetti’s theorem, it is possible to show that our draws are infinitely exchangeable•Thus the mixture components may be sampled in any order.

Page 22: Course: Neural Networks, Instructor: Professor L.Behera. -Joy Bhattacharjee, Department of ChE, IIT Kanpur. Johann Peter Gustav Lejeune Dirichlet.

Mixture Model Inference

• We want to find a clustering of the data: an assignment of values to the hidden class variable

• Sometimes we also want the component parameters

• In most finite mixture models, this can be found with EM

• The Dirichlet process is a non-parametric prior, and doesn’t permit EM

• We use Gibbs sampling instead

Page 23: Course: Neural Networks, Instructor: Professor L.Behera. -Joy Bhattacharjee, Department of ChE, IIT Kanpur. Johann Peter Gustav Lejeune Dirichlet.

Finite mixture model

Page 24: Course: Neural Networks, Instructor: Professor L.Behera. -Joy Bhattacharjee, Department of ChE, IIT Kanpur. Johann Peter Gustav Lejeune Dirichlet.

Infinite mixture model

Page 25: Course: Neural Networks, Instructor: Professor L.Behera. -Joy Bhattacharjee, Department of ChE, IIT Kanpur. Johann Peter Gustav Lejeune Dirichlet.

DP Mixture model

Page 26: Course: Neural Networks, Instructor: Professor L.Behera. -Joy Bhattacharjee, Department of ChE, IIT Kanpur. Johann Peter Gustav Lejeune Dirichlet.

Agglomerative Clustering

• Pros: Doesn’t need generative model (number of clusters, parametric distribution)

• Cons: Ad-hoc, no probabilistic foundation, intractable for large data sets

Num Clusters Max Distance

19 518 517 516 815 814 813 812 811 910 99 98 107 106 105 104 123 122 151 16

20 0

Page 27: Course: Neural Networks, Instructor: Professor L.Behera. -Joy Bhattacharjee, Department of ChE, IIT Kanpur. Johann Peter Gustav Lejeune Dirichlet.

Mixture Model Clustering

• Examples: K-means, mixture of Gaussians, Naïve Bayes • Pros: Sound probabilistic foundation, efficient even for large

data sets• Cons: Requires generative model, including number of clusters

(mixture components)

Page 28: Course: Neural Networks, Instructor: Professor L.Behera. -Joy Bhattacharjee, Department of ChE, IIT Kanpur. Johann Peter Gustav Lejeune Dirichlet.

Applications• Clustering in Natural Language Processing

– Document clustering for topic, genre, sentiment…– Word clustering for Part of Speech(POS), Word sense

disambiguation(WSD), synonymy…– Topic clustering across documents– Noun coreference: don’t know how many entities are there– Other identity uncertainty problems: deduping, etc.– Grammar induction

• Sequence modeling: the “infinite HMM”– Topic segmentation)– Sequence models for POS tagging

• Society modeling in public places• Unsupervised machine learning

Page 29: Course: Neural Networks, Instructor: Professor L.Behera. -Joy Bhattacharjee, Department of ChE, IIT Kanpur. Johann Peter Gustav Lejeune Dirichlet.

References:•Bela A. Frigyik, Amol Kapila, and Maya R. Gupta , University of Washington,Seattle, UWEE Technical report : Introduction to Dirichlet distribution and related processes, report number UWEETR2010-0006.•Yee Whye Teh, University College London : Dirichlet Process•Khalid-El-Arini, Select Lab meeting, October 2006.•Teg Granager, Natural Language Processing, Stanford University : Introduction to Chinese Restaurant problem and Stick breaking scheme.•Wikipedia

Page 30: Course: Neural Networks, Instructor: Professor L.Behera. -Joy Bhattacharjee, Department of ChE, IIT Kanpur. Johann Peter Gustav Lejeune Dirichlet.

Questions ?•Suggest some distributions that can use Dirichlet process to find classes.•What are the applications in finite mixture model?•Comment on: The DP of a cluster is also a Dirichlet distribution.