High-Dimensional Unsupervised Selection and Estimation of a Fin ite Generalized Dirichlet Mixture model Based on Minimum Message Length by Nizar Bouguila and Djemel Ziou Dissusion led by Qi An Duke University Machine Learnin g Group
Jan 03, 2016
High-Dimensional Unsupervised Selection and Estimation of a Finite Generali
zed Dirichlet Mixture model Based on Minimum Message Length
by Nizar Bouguila and Djemel Ziou
Dissusion led by Qi An
Duke University Machine Learning Group
Outline
• Introduction
• The generalized Dirichlet mixture
• The minimal message length (MML) criterion
• Fisher information matrix and priors
• Density estimation and model selection
• Experimental results
• Conclusions
Introduction
• How to determine the number of components in a mixture model for high-dimensional data?– Stochastic and resampling (Slow)
• Implementation of model selection criteria• Fully Bayesian way
– Deterministic (Fast)• Approximate Bayesian criteria• Information/coding theory concepts
– Minimal message length (MML)
– Akaike’s information criterion (AIC)
The generalized Dirichlet distribution
• A d dimensional generalized Dirichlet distribution is defined to be
It can be reduced to the Dirichlet distribuiton when
where and , , ,
d
iiX
1
1 10 iX 0i 0i 11 iiii
11 iii
The generalized Dirichlet distribution
For the generalized Dirichlet distribution:
The GDD has a more general covariance structure than the DD and it is conjugate to multinomial distribution.
GDD vs. Gaussian
• The GDD has smaller number of parameters to estimate. The estimation can be more accurate
• The GDD is defined in a support [0,1] and can be extended to a compact support [A,B]. It is more appropriate for the nature of data.
Beta distribution:
Beta type-II distribution:
They are equal if we set u=v/(1+v).
A GDD mixture model
A generalized Dirichlet mixture model with M components, where p(X|α) takes a form of the GDD.
The MML criterion
• The message length is defined as minus the logarithm of the posterior probability.
• After placing an explicit prior over parameters, the message length for a mixture of distribution is given as
prior likelihood Fisher Information
optimal quantization constant
Fisher Information matrix
• The Fisher information matrix is the expected value of the Hessian minus the logarithm of the likelihood
where
Prior distribution
• Assume the independence between difference components
Mixture weighs
GDD parameters
Place a Dirichlet distribution and a generalized Dirichlet distribution on P and α, respectively, with parameters set to 1.
Message length
• After obtaining the Fisher information and specifying the prior distribution, the message length can be expressed as
Estimation and selection algorithm
• The authors use an EM algorithm to estimate the mixture parameters.
• To overcome the computation issue and local maxima problem, they implement a fairly sophisticated initialization algorithm.
• The whole algorithm is summarized in the next page
Experimental results
The correct number of mixture are 5, 6, 7, respectively
Experimental results
Experimental results
• Web mining:– Training with multiple
classes of labels– Use to
predict the label of testing sample
– Use top 200 words frequency
Conclusions
• A MML-based criterion is proposed to select the number of components in generalized Dirichlet mixtures.
• Full dimensionality of the data is used.• Generalized Dirichlet mixtures allow more model
ing flexibility than mixture of Gaussians.• The results indicate clearly that the MML and LE
C model selection methods outperform the other methods.