Machine Learning in Computer Vision: Mixture Models, Part 4mlcv182/wiki.files/MLCV... · Machine Learning in Computer Vision: Mixture Models, Part 4 Oren Freifeld Computer Science,

Machine Learning in Computer Vision:Mixture Models, Part 4

Oren FreifeldComputer Science, Ben-Gurion University

May 17, 2018

May 17, 2018 1 / 19

1 Dirichlet-Process GMM

May 17, 2018 2 / 19

Dirichlet-Process GMM

Dirichlet-Process Mixture Models (DPMM)

DPMM belongs to the larger class of Bayesian ∞-dimensional mixturemodels.Loosely speaking, K =∞, and, particularly for DPMM, you can thinkof a Dirichlet process, an analog of the Dirichlet distribution, as adistribution over an ∞-dimensional probability simplex:

{π = (π1, π2, . . .) :

∞∑

k=1

πk = 1 and πk ≥ 0 ∀k}

Bayesian ∞-dimensional mixture models generalize finite Bayesianmixture models. For example, DPGMMs generalize finite-K BayesianGMMs.

May 17, 2018 3 / 19


Bayesian Non-Parametric Statistics

Bayesian ∞-dimensional mixture models – AKA Bayesiannon-parametric mixture models – is one of the two main flavors ofBayesian Non-parametric (BNP) statistics.The other one is Gaussian Processes (GP), an ∞-dimensionalgeneralization of finite-dimensional Gaussian distribution. In the latter,loosely speaking, n = dim(x) =∞ where x ∼ N (µ,Σ):

Version 1: x = (x1, x2, . . .) or x = (. . . , x−2, x−1, x0, x1, x2, . . .). In boththese cases the index set is countable and infinite. The GP here is a priorover (continuous) infinite sequences.Version 2:

x = (xt)t∈R≥0∈ Rn

orx = (xt)t∈R ∈ Rn

In both these cases the index set is uncountable. In other words, x is anRn-valued function. The GP here is a prior over (continuous) functions.

BNP Mixture Models and Gaussian Processes have numerousapplications in machine learning and computer vision.

May 17, 2018 4 / 19


Dirichlet-Process GMM (DPGMM)

A probabilistic model where each data point is assumed to be drawnfrom a mixture of an unknown (possibly-infinite) number of Gaussiancomponents of unknown parameters.

May 17, 2018 5 / 19


Dirichlet-Process GMM (DPGMM)

During inference, the number of clusters is found automatically. Thisdoes not only produce better results than model-selection methods infinite mixture models (i.e., trying different K, and then pick the bestone according to some criterion, e.g., penalized likelihood) but also hasbetter theoretical justifications.

Those theoretical justifications are related to de Finetti’s theorem andthe notion of exchangeability, the latter being a relaxation of the notionof iid.

May 17, 2018 6 / 19


DPGMM Inference

DPGMM inference involves computational challenges, but recenteffective methods have been proposed. Particularly, there is lot of recentinterest in developing MCMC parallel samplers.

May 17, 2018 7 / 19


Parallel Sampling in DPGMM

Chang and Fisher’s [NIPS ’12] parallel DP-GMM sampler augments themodel with sub-clusters:122 CHAPTER 5. PARALLEL SPLIT-MERGE MCMC FOR THE DPMM

Figure 5.6: A visualization of the inferred sub- and super-clusters of the algorithm. Solidellipses indicate regular-cluster means and covariances and dotted ellipsses indicate sub-cluster means and covariances. Color of data points indicate super-cluster membership.

Gaussian mixture model is shown in Figure 5.6. We have initialized the inference tohave 4 clusters. Inferred regular-cluster parameters are illustrated with a solid ellipseand inferred sub-cluster parameters are illustrated with dotted ellipses of the same color.Because split moves have not been incorporated into the procedure yet, the result is nota valid sample from the posterior. This is indicated by the black and yellow clusters,which each contain two true clusters. However, the dotted ellipses show that the inferredsub-clusters correctly capture the information of interest in representing a likely split.

� 5.4.3 Sub-Cluster Split Moves

We now exploit these auxiliary variables to propose likely splits. Similar to previoussplit/merge algorithms, we use a Metropolis-Hastings (MH) MCMC [51] method for

proposed splits. A new set of random variables, {π, θ, z, π, θ, z} are proposed via someproposal distribution, q, and accepted with probability

min[1, H] = min

[1,p(π, z, θ, x)p(π, θ, z|x, z)p(π, z, θ, x)p(π, θ, z|x, z)

· q(π, z, θ, π, θ, z|π, z, θ, π, θ, z)q(π, z, θ, π, θ, z|π, z, θ, π, θ, z)

]. (5.23)

Unfortunately, because the joint likelihood for the augmented space cannot be expressedanalytically, the Hastings ratio for an arbitrary proposal distribution cannot be com-puted. We now discuss a very specific proposal distribution which results in a tractableHastings ratio. A split or merge move, denoted by Q, is first selected at random. As wedetail shortly, all possible splits and a large subset of all possible merges are consideredat every iteration. A randomized proposal can be used instead when the number ofclusters is large.

Conditioned on Q = QKsplit-\, which splits cluster \ into clusters [ and ], or Q =

QKmerge-[], which merges clusters [ and ] into cluster \, a new set of variables are sampled

Figure: Solid ellipses are clusters, dashed ellipses are sub-clusters.

May 17, 2018 8 / 19



In each iteration of Chang and Fisher’s sampler:

1 Sample weights of each cluster and sub-cluster.

2 Sample the cluster and sub-cluster parameters (mean; cov). Parallelizeover clusters.

3 Sample cluster and sub-cluster assignments for each point. Parallelizeover points.

4 Propose a split for each cluster with a certain probability. Parallelizeover clusters.

5 Propose a merge for each pair of clusters with a certain probability.Parallelize over cluster pairs.

Split/merge probabilities are calculated in a principled wayMay 17, 2018 9 / 19



See demo

May 17, 2018 10 / 19



Chang and Fisher proposed a single-machine, multiple-CPU-cores, C++implementation, using a shared-memory model.

Julian Straub (from Fisher’s lab) developed similar single-machine GPUimplementations1, some variants of which were used in several recentapplications. In all cases, there was a shared-memory model.

[Yu, Freifeld, and Fisher]2 recently proposed a multiple-machine,multiple-CPU-cores, Julia implementation that uses adistributed-memory model.

1Including for spherical data [Straub, Chang, Freifeld and Fisher, AISTATS ’15] andMixtures of Manhattan Frames [Straub, Freifeld, Rosman, Leonard, and Fisher, CVPR’14 and PAMI ’17]

2[In preparation, see also Angel Yu’s MEng Thesis]May 17, 2018 11 / 19


Julia Multi-Machine Implementation

Data1(Darray)

Assignments1

Model Parameters (SharedArray on Master)

Sample Assignments

Sample ParametersPropose Splits

Propose Merges

Sufficient Stats

Data4(Darray)

Assignments4

Data3(Darray)

Assignments3

Data2(Darray)

Assignments2

Figure: The proposed architecture (here shown for 4 machines)

May 17, 2018 12 / 19



Cores (using only 1 machine) 1 2 4 8C++ (Chang and Fisher) 76.94s 40.57s 21.23s 13.01sJulia (Yu, Freifeld and Fisher) 1101.97s 572.50s 345.58s 172.30s

Table: 100 iterations of dim = 2, Ndata points = 106, Ngaussians = 6

The C++/Julia speedup difference is on par with the literature.

Cores × Machines 8× 1 8×2 8×3 8×4C++ (Chang and Fisher) 798.94s - - -Julia (Yu, Freifeld and Fisher) 398.67s 218.42s 146.71s 124.55s

Table: 100 iterations of dim = 30, Ndata points = 106, Ngaussians = 6

When the dimensionality grows, the Julia Implementation is faster even ona single machine (the C++ code was highly optimized).

May 17, 2018 13 / 19



As a result, can now easily and quickly learn DPGMMs (more generally,DPMMs) on datasets several orders of magnitude larger, and for data ofhigher dimensionality, than previous methods.

May 17, 2018 14 / 19



Figure: Denoising examples using a DPGMM trained on 44 · 106 patches of 8 ×8. From left to right: originals; corrupted images; denoised images. The denoisingmethod is akin to Zoran and Weiss’ method for denoising using a GMM prior overimage patches.

May 17, 2018 15 / 19


May 17, 2018 16 / 19


Differences from Zoran and Weiss:

They used a manually-chosen-finite-K GMM, we inferred it on the fly.

On 2M patches: their serial Matlab implementation of EM (for MLE)took 30 hours. Our distributed Julie implementation (Bayesian, Gibbssampler) took one hour.

Since we could memory of several machines: we were able to train onmuch larger datasets (e.g., we trained on 44M patches) and of largerpatches.

May 17, 2018 17 / 19


Take-home Message (Part 1)

A big part of the huge success of Deep Learning (DL) in supervisedlearning is the availability of large annotated datasets and to thedevelopment of excellent SW/HW tools for training these nets.

In the realm of unsupervised learning, where DL has yet to to producea similar success, Bayesian nonparametric mixture models provide apowerful, mathematically-principled, generative framework, that adaptsmodel complexity to the data, and is especially suitable forstreaming-data applications.

May 17, 2018 18 / 19


Take-home Message (Part 2)

In some sense, Bayesian nonparametric mixture models, with theirinfinite number of parameters, are complementary to DL

There is much more unannotated data than annotated data out there.

Combining/connecting BNP and DL is an interesting research directionin itself.

For such models to have a similar practical impact in unsupervisedlearning as DL has in supervised learning, new scalable computationaltools (such as parallel DPGMM samplers and easy-to-use powerful SWtools) need to be developed – this is an active research area.

May 17, 2018 19 / 19

Machine Learning in Computer Vision: Mixture Models, Part 4mlcv182/wiki.files/MLCV... · Machine Learning in Computer Vision: Mixture Models, Part 4 Oren Freifeld Computer Science,

Documents