Algorithms

Algorithms

This section contains concise descriptions of almost all of the models and algo-rithms in this book. This includes additional details, variations of algorithms andimplementation concerns that were omitted from the main text to improve read-ability. The goal is to provide sufficient information to implement a naive versionof each method and the reader is encouraged to do exactly this.

WARNING! These algorithms have not been checked very well. I’m looking forvolunteers to help me with this - please mail [email protected] if you can help.IN the mean time, treat them with suspicion and send me any problems you find

Copyright c©2011 by Simon Prince; to be published by Cambridge University Press 2012.For personal use only, not for distribution.

2

0.1 Fitting probability distributions

0.1.1 ML learning of Bernoulli parameters

The Bernoulli distribution is a probability density model suitable for describingdiscrete binary data x ∈ 0, 1. It has pdf

Pr(x) = λx(1− λ)1−x,

where the parameter λ ∈ [0, 1] denotes the probability of success.

Algorithm 1: Maximum likelihood learning for Bernoulli distribution

Input : Binary training data xiIi=1

Output: ML estimate of Bernoulli parameter θ = λbegin

λ =∑Ii=1 xi/I

end

0.1.2 MAP learning of Bernoulli parameters

The conjugate prior to the Bernoulli distribution is the Beta distribution,

Pr(λ) =Γ[α+ β]

Γ[α]Γ[β]λα−1(1− λ)β−1.

where Γ[•] is the Gamma function and α, β are hyperparameters.

Algorithm 2: MAP learning for Bernoulli distribution with conjugate prior

Input : Binary training data xiIi=1, Hyperparameters α, βOutput: MAP estimates of parameters θ = λbegin

λ = (∑Ii=1 xi + α− 1)/(α+ β + I − 2)

end


0.1 Fitting probability distributions 3

0.1.3 Bayesian approach to Bernoulli distribution

Algorithm 3: Predictive distribution Bernoulli fit (Bayesian)

Input : Binary training data xiIi=1, Hyperparameters α, βOutput: Posterior parameters α, β, predictive distribution Pr(x∗|x1...I)begin

// Compute Bernoulli posterior over λ

α = α+∑Ii=1 xi

β = β + I −∑Ii=1 xi

// Evaluate new datapoint under predictive distribution

Pr(x∗ = 1|x1...I) = (α)/(α+ β)Pr(x∗ = 0|x1...I) = 1− Pr(x∗ = 1|x1...I)

end


4

0.1.4 ML learning of univariate normal parameters

The univariate normal distribution is a probability density model suitable for de-scribing continuous data x in one dimension. It has pdf

Pr(x) =1√

2πσ2exp

[−0.5(x− µ)2/σ2

],

where the parameter µ denotes the mean and σ2 denotes the variance.

Algorithm 4: Maximum likelihood learning for normal distribution

Input : Training data xiIi=1

Output: Maximum likelihood estimates of parameters θ = µ, σ2begin

// Set mean parameter

µ =∑Ii=1 xi/I

// Set variance

σ2 =∑Ii=1(xi − µ)2/I

end

0.1.5 MAP learning of univariate normal parameters

The conjugate prior to the normal distribution is the normal-scaled inverse gammawhich has pdf

Pr(µ, σ2) =

√γ

σ√

2π

βα

Γ(α)

(1

σ2

)α+1

exp

[−2β + γ(δ − µ)2

2σ2

],

with hyperparameters α, β, γ > 0 and δ ∈ [−∞,∞].

Algorithm 5: MAP learning for normal distribution with conjugate prior

Input : Training data xiIi=1, Hyperparameters α, β, γ, δOutput: MAP estimates of parameters θ = µ, σ2begin


µ = (∑i=1 xi + γδ)/(I + γ)

// Set variance

σ2 = (∑Ii=1(xi − µ)2 + 2β + γ(δ − µ)2)/(I + 3 + 2α)

end



0.1.6 Bayesian approach to univariate normal distribution

In the Bayesian approach to the univariate normal distribution we again use anormal-scaled inverse gamma prior. In the learning stage we compute a probabilitydistribution over the mean and variance parameters. The predictive distributionfor a new datum is based on all possible values of these parameters

Algorithm 6: Bayesian approach to normal distribution

Input : Training data xiIi=1, Hyperparameters α, β, γ, δ, Test data x∗

Output: Posterior parameters α, β, γ, δ, predictive distribution Pr(x∗|x1...I)begin

// Compute normal inverse gamma posterior over parameters from

training data

α = α+ I/2

β =∑i x

2i /2 + β + (γδ2)/2− (γδ +

∑i xi)

2/(2γ + 2I)γ = γ + I

δ = (γδ +∑i xi)/(γ + I)

// Compute intermediate parameters

α = α+ 1/2

β = (x∗2)/2 + β + (γδ2)/2− (γδ + x∗)2/(2γ + 2)γ = γ + 1// Evaluate new datapoint under predictive distribution

Pr(x∗|x1...I) =√γβαΓ[α]/(

√2π√γβαΓ[α])

end


6

0.1.7 ML learning of multivariate normal parameters

The univariate normal distribution is a probability density model suitable for de-scribing continuous data x in one dimension. It has pdf

Pr(x) =1

(2π)D/2|Σ|1/2exp

[−0.5(x− µ)TΣ−1(x− µ)

],

where µ denotes the mean vector and Σ denotes the covariance matrix

Algorithm 7: Maximum likelihood learning for multivariate normal


Output: Maximum likelihood estimates of parameters θ = µ,Σbegin


µ =∑Ii=1 xi/I

// Set variance

Σ =∑Ii=1(xi − µ)(xi − µ)T /I

end

0.1.8 MAP learning of multivariate normal parameters

The conjugate prior to the normal distribution is the normal inverse Wishart

Pr(µ,Σ) =Ψα/2|Σ|−(α+D+2)/2 exp

[−0.5

(2Tr(ΨΣ−1) + γ(µ− δ)TΣ−1(µ− δ)

)]2αD/2(2π)D/2ΓD[α/2]

,

with hyperparameters α,Ψ, γ and δ.

Algorithm 8: MAP learning for normal distribution with conjugate prior

Input : Training data xiIi=1, Hyperparameters α,Ψ, γ, δOutput: MAP estimates of parameters θ = µ,Σbegin

// Compute posterior parameters

α = α+ I

Ψ = Ψ + γδδT +∑Ii=1 xix

Ti − (γδ + xi)(γδ + xi)

T /(γ + 1)γ = γ + I

δ = (∑i=1 xi + γδ)/(I + γ)

// Set mean and covariance

µ = δ

Σ = (2Ψ + (µ− δ)(µ− δ)T /(α+D + 2)

end



0.1.9 Bayesian approach to multivariate normal distribution

In the Bayesian approach to the multivariate normal distribution we again use anormal inverse Wishart In the learning stage we compute a probability distributionover the mean and variance parameters. The predictive distribution for a newdatum is based on all possible values of these parameters

Algorithm 9: Bayesian approach to normal distribution

Input : Training data xiIi=1, Hyperparameters α,Ψ, γ, δ, Test data x∗

Output: Posterior parameters α, Ψ, γ, δ, predictive distribution Pr(x∗|x1...I)begin

// Compute normal inverse Wishart over parameters

α = α+ I

Ψ = Ψ + γδδT /2 +∑Ii=1 xix

Ti /2− (γδ +

∑i xi)(γδ +

∑i xi)

T /(2γ + 2I)γ = γ + I

δ = (∑i=1 xi + γδ)/(I + γ)

// Compute intermediate parameters

α = α+ 1

Ψ = γδδT

+ x∗x∗T − (γδ + x∗)(γδ + x∗)T /(γ + 1)γ = γ + 1// Evaluate new datapoint under predictive distribution

Pr(x∗|x1...I) = Ψα/2ΓD[α]/(πd/2Ψα/2ΓD[α])

end


8

0.1.10 ML learning of categorical parameters

The categorical distribution is a probability density model suitable for describingdiscrete multivalued data x ∈ 1, 2, . . .K. It has pdf

Pr(x = k) = λk,

where the parameter λk denotes the probability of observing category k.

Algorithm 10: Maximum likelihood learning for categorical distribution

Input : Multi-valued training data xiIi=1

Output: ML estimate of categorical parameters θ = λ1 . . . λkbegin

for k=1to K do

λk =∑Ii=1 δ[xi − k]/I

end

end



0.1.11 MAP learning of categorical parameters

The conjugate prior to the categorical distribution is the Dirichlet distribution,

Pr(λ1 . . . λK) =Γ[∑Kk=1 αk]∏K

k=1 Γ[αk]

K∏k=1

λαk−1k , (1)

where Γ[•] is the Gamma function and αkKk=1 are hyperparameters.

Algorithm 11: MAP learning for categorical distribution with conjugate prior

Input : Binary training data xiIi=1, Hyperparameters αkKk=1

Output: MAP estimates of parameters θ = λkKk=1

beginfor k=1to K do

λk = (∑Ii=1 δ[xi − k] + αk − 1)/(I +

∑Kk=1(αk)−K)

end

end

0.1.12 Bayesian approach to categorical distribution

Algorithm 12: Bayesian approach to categorical distribution

Input : Categorical training data xiIi=1, Hyperparameters αkKk=1

Output: Posterior parameters αkKk=1, predictive distribution Pr(x∗|x1...I)begin

// Compute caterorical posterior over λfor k=1to K do

αk = αk +∑Ii=1 δ[xi − k]

end// Evaluate new datapoint under predictive distribution

for k=1to K do

Pr(x∗ = k|x1...I) = (αk)/(∑Km=1 αm)

end

end


10

0.2 Machine learning for machine vision

0.2.1 Basic generative classifier

Consider the situation where we wish to assign a label w ∈ 1, 2, . . .K based onan observed multivariate measurement vector xi. We model the class conditionaldensity functions as normal distributions so that

Pr(xi|wi = k) = Normxi [µk,Σk] (2)

with prior probabilities over the world state defined byh

Pr(wi) = Catwi [λ] (3)

Algorithm 13: Basic Generative classifier

Input : Training data xi, wiIi=1, new data example x∗

Output: ML parameters θ = λ1...K ,µ1...K ,Σ1...K, posterior probability Pr(w∗|x∗)begin

// Learning of model

for k=1 to K do// Set mean

µk =∑Ii=1 xiδ[wi − k]/

∑Ii=1 δ[wi − k]

// Set variance

Σk =∑Ii=1(xi − µ)(xi − µ)T δ[wi − k]/

∑Ii=1 δ[wi − k]

// Set prior

λk =∑Ii=1 δ[wi − k]/I

end// Compute likelihoods for new datapoint

for k=1 to K dolk = Normx∗ [µk,Σk]

end// Classify new datapoint

for k=1 to K do

Pr(w∗ = k|x∗) = lkλk/∑Km=1 lmλm

end

end


0.3 Fitting complex densities 11

0.3 Fitting complex densities

0.3.1 Mixture of Gaussians

The mixture of Gaussians (MoG) is a probability density model suitable for data xinD dimensions. The data is described as a weighted sum ofK normal distributions

Pr(x|θ) =

K∑k=1

λkNormx[µk,Σk],

where µ1...K and Σ1...K are the means and covariances of the normal distributionsand λ1...K are positive valued weights that sum to one. The MoG is fit using theEM algorithm.

Algorithm 14: Maximum likelihood learning for mixtures of Gaussians

Input : Training data xiIi=1, number of clusters KOutput: ML estimates of parameters θ = λ1...K ,µ1...K ,Σ1...Kbegin

Initialize θ = θ0a

repeat// Expectation Step

for i=1to I dofor k=1to K do

lik = λkNormxi [µk,Σk] // numerator of Bayes’ rule

end// Compute posterior (responsibilities) by normalizing

for k=1 to K do

rik = lik/(∑Kk=1 lik)

end

end

// Maximization Step b

for k=1 to K do

λ[t+1]k =

∑Ii=1 rik/(

∑Kk=1

∑Ii=1 rik)

µ[t+1]k =

∑Ii=1 rikxi/(

∑Ii=1 rik)

Σ[t+1]k =

∑Ii=1 rik(xi − µ[t+1]

k )(xi − µ[t+1]k )T /(

∑Ii=1 rik).

end// Compute Data Log Likelihood and EM Bound

L =∑Ii=1 log

[∑Kk=1 λkNormxi [µk,Σk]

]B =

∑Ii=1

∑Kk=1 rik log [λkNormxi [µk,Σk]/rik]

until No further improvement in L

end

aOne possibility is to set the weights λ• = 1/K, the means µ• to the values of K ran-

domly chosen datapoints and the variances Σ• to the variance of the whole dataset.bFor a diagonal covariance retain only the diagonal of the Σk update.


12

0.3.2 t-distribution

The t-distribution is a robust (long-tailed) distribution with pdf

Pr(x) =Γ(ν+D2

)(νπ)D/2|Σ|1/2Γ

(ν2

) (1 +(x− µ)TΣ−1(x− µ)

ν

)−(ν+D)/2

.

We use the EM algorithm to fit the parameters θ = µ,Σ, ν of the t-distribution.

Algorithm 15: Maximum likelihood learning for t-distribution


Output: Maximum likelihood estimates of parameters θ = µ,Σ, νbegin



for i=1to I doδi = (xi − µ)TΣ−1(xi − µ)E[hi] = (ν +D)/(ν + δi)E[log[hi] = Ψ[ν/2 +D/2]− log[ν/2 + δi]/2

end// Maximization Step

µ =∑Ii=1 E[hi]xi/

∑Ii=1 E[hi]

Σ =∑Ii=1 E[hi](xi − µ)(xi − µ)T /

∑Ii=1 E[hi]

ν = optimize[tCost[ν, E[hi], E[log[hi]]Ii=1], ν]// Compute Data Log Likelihood

for i=1to I doδi = (xi − µ)TΣ−1(xi − µ)

endL = I log[Γ[(ν +D)/2]]− I(d/2) log[νπ]− I log[|Σ|]/2− log[Γ[ν/2]]

L = L−∑Ii=1(ν +D)/2 log[1 + deltai/ν]


end

a One possibility is to initialize the parameters µ and Σ to the mean and variance of thedistribution and set the initial degrees of freedom to a large value say ν = 1000.

The optimization of the degrees of freedom nu uses the criterion

tCost[ν, E[hi], E[log[hi]]Ii=1

]=

I∑i=1

ν

2log[ν

2

]− log

[Γ[ν

2

]]+(ν

2− 1)E[log[hi]]−

ν

2E[hi].


0.3 Fitting complex densities 13

0.3.3 Factor analyzer

The factor analyzer is a probability density model suitable for data x in D dimen-sions. It has pdf

Pr(xi|θ) = Normx∗ [µ,ΦΦ + Σ],

where µ is a D × 1 mean vector, Φ is a D ×K matrix containing the K factorsφKk=1 in its columns and Σ is a diagonal matrix of size D×D. The factor analyzeris fit using the EM algorithm.

Algorithm 16: Maximum likelihood learning for factor analyzer

Input : Training data xiIi=1, number of factors KOutput: Maximum likelihood estimates of parameters θ = µ,Φ,Σbegin


// Set mean

µ =∑Ii=1 xi/I


for i=1to I doE[hi] = (ΦTΣ−1Φ + I)−1ΦTΣ−1(xi − µ)

E[hihTi ] = (ΦTΣ−1Φ + I)−1 + E[hi]E[hi]

T

end// Maximization Step

Φ =(∑I

i=1(xi − µ)E[hi]T)(∑I

i=1 E[hihTi ])−1

Σ = 1I

∑Ii=1 diag

[(xi − µ)T (xi − µ)−ΦE[hi]x

Ti

]// Compute Data Log Likelihoodb

L =∑Ii=1 log

[Normxi [µ,ΦΦT + Σ]

]until No further improvement in L

end

a It is usual to initialize Φ to random values. The D diagonal elements of Σ can beinitialized to the variances of the D data dimensions.bIn high dimensions it is worth reformulating the covariance of this matrix using the

Woodbury relation (section ??).


14

0.4 Regression models

0.4.1 Linear regression model

The linear regression model describes the world y as a normal distribution. Themean of this distribution is a linear function φ0+φTx and the variance is constant.In practice we add a 1 to the start of every data vector xi ← [1 xTi ]T and attachthe y-intercept φ0 to the start of the gradient vector φ← [φ0 φT ]T and write

Pr(yi|xi,θ) = Normyi

[φ0 + φTxi, σ

2].

To learn the model, we will work with the matrix X = [x1,x2 . . .xI ] whichcontains all of the training data examples in its columns and the world vectory = [y1, y2 . . . yI ]

T which contains the training world states.

Algorithm 17: Maximum likelihood learning for linear regression

Input : (D + 1)×I Data matrix X, I×1 world vector yOutput: Maximum likelihood estimates of parameters θ = Φ, σ2begin

// Set gradient parameter

Φ = (XXT )−1Xy// Set variance parameter

σ2 = (y −XTφ)T (y −XTφ)/I

end


0.4 Regression models 15

0.4.2 Bayesian linear regression

To implement the Bayesian version we define a prior

Pr(φ) = Normφ[0, σ2pI],

which contains one hyperparameter σ2p which determines the prior variance.

Algorithm 18: Bayesian formulation of linear regression.

Input : (D + 1)×I Data matrix X, I×1 world vector y, Hyperparameter σ2p,

Output: Distribution Pr(y∗|x∗) over world given new data example x∗

begin// Fit variance parameter σ2 with line search

σ2 = arg maxσ2 Normy[0, σ2pX

TX + σ2I]// Compute variance of posterior over Φ

A−1 = σ2pI− σ2

pX(XTX + (σ2/σ2

p)I)−1

XT

// Compute mean of prediction for new example x∗

µy∗|x∗ = x∗TA−1Xy/σ2

// Compute variance of prediction for new example x∗

σ2y∗|x∗ = x∗TA−1x∗ + σ2

end


16

0.4.3 Bayesian non-linear regression (Gaussian process regression)

Algorithm 19: Gaussian process regression.

Input : (D + 1)×I Data matrix X, I×1 world vector y, Hyperparameter σ2p, Kernel

Function K[•, •]Output: Distribution Pr(y∗|x∗) over world given new data example x∗


σ2 = arg maxσ2 Normy[0, σ2pK[X,X] + σ2I]

// Compute inverse term

A−1 =(K[X,X] + (σ2/σ2

p)I)−1


µy∗|x∗ = (σ2p/σ

2)K[x∗,X]y − (σ2/σ2p)K[x∗,X]A−1KX,X]y


σ2y∗|x∗ = σ2

pKx∗,x∗]− σ2pK[x∗,X]A−1K[X,x∗] + σ2

end



0.4.4 Sparse linear regression

Algorithm 20: Sparse linear regression.

Input : (D + 1)×I Data matrix X, I×1 world vector w, degrees of freedom, νOutput: Distribution Pr(w∗|x∗) over world given new data example x∗

begin// Initialize variables

H = diag[1, 1, . . . 1]for t=1to T do

// Maximize marginal likelihood w.r.t. variance parameter σ2

with line search

σ2 = arg maxσ2 Normy[0,XTH−1X + σ2I]// Maximize marginal likelihood w.r.t. relevance parameters H

Σ = σ2(XXT + H)−1

µ = ΣXw/σ2

for d=1to D doif Method 1 then

hd = (1 + ν)/(µ2d + Σdd + ν)

elsehd = (1− hdΣdd + ν)/(µ2

d + ν)end

end

end// Remove columns of X, rows of y and rows and columns of H where

contributions is low (perhaps hdd > 1000)[H,X,y] = prune[H,X,y]// Compute variance of posterior over Φ

A−1 = H−1 −H−1X(XTH−1X + σ2I

)−1XTH−1




σ2y∗|x∗ = x∗TA−1x∗ + σ2

end


18

0.4.5 Dual Bayesian linear regression

To implement the Bayesian version we represent the parameter vector φ as aweighted sum

φ = Xψ

of the data examples X. We define a prior over the new parameters Ψ so that

Pr(ψ) = Normψ[0, σ2pI],

which contains one hyperparameter σ2p which determines the prior variance.

Algorithm 21: Dual formulation of linear regression.

Input : (D + 1)×I Data matrix X, I×1 world vector y, Hyperparameter σ2p,

Output: Distribution Pr(y∗|x∗) over world given new data example x∗


σ2 = arg maxσ2 Normy[0, σ2pX

TXXTX + σ2I]// Compute inverse variance of posterior over Φ

A = (XTXXTX/σ2 + I/σ2p




σ2y∗|x∗ = x∗TA−1x∗ + σ2

end



0.4.6 Dual Gaussian process regression

Algorithm 22: Dual Gaussian process regression.

Input : (D + 1)×I Data matrix X, I×1 world vector y, Hyperparameter σ2p, Kernel

Function K[•, •]Output: Distribution Pr(y∗|x∗) over world given new data example x∗


σ2 = arg maxσ2 Normy[0, σ2pK[X,X]K[X,X] + σ2I]

// Compute inverse term

A = K[X,X]K[X,X]/σ2 + I/σ2p


µy∗|x∗ = K[x,X]A−1K[X,X]y/σ2


σ2y∗|x∗ = K[x∗T ,X]A−1K[X,x∗] + σ2

end


20

0.4.7 Relevance vector regression

Algorithm 23: Relevance vector regression.

Input : (D + 1)×I Data matrix X, I×1 world vector w, Kernel function K[•, •],degrees of freedom, ν

Output: Distribution Pr(w∗|x∗) over world given new data example x∗

begin// Initialize variables

H = diag[1, 1, . . . 1]for t=1to T do

// Maximize marginal likelihood w.r.t. variance parameter σ2

with line search

σ2 = arg maxσ2 Normy[0,K[X,X]H−1K[X,X] + σ2I]// Maximize marginal likelihood w.r.t. relevance parameters HΣ = σ2(K[X,X]K[X,X] + H)−1

µ = ΣK[X,X]w/σ2

for i=1to I doif Method 1 then

hd = (1 + ν)/(µ2i + Σii + ν)

elsehd = (1− hdΣii + ν)/(µ2

i + ν)end

end

end// Remove columns of X, rows of y and rows and columns of H where

contributions is low (perhaps hdd > 1000)[H,X,y] = prune[H,X,y]// Compute inverse term

A = K[X,X]K[X,X]/σ2 + I/σ2p


µy∗|x∗ = K[x,X]A−1K[X,X]y/σ2


σ2y∗|x∗ = K[x∗T ,X]A−1K[X,x∗] + σ2

end


0.5 Classification models 21

0.5 Classification models

0.5.1 Logistic regression

The logistic regression model is defined as

Pr(w|φ,x) = Bernw

[1

1 + exp[−φTx]

].

This is a straightforward optimization problem. We prepend a 1 to the start ofeach data example xi and then optimize the log binomial probability. To do thiswe need to compute this value, and the derivative and Hessian with respect to theparameter φ.

Algorithm 24: Compute cost function, derivative and Hessian

Input : Binary world state wiIi=1, observed data xiIi=1, parameters φOutput: cost L, gradient g, Hessian Hbegin

// Initialize cost, gradient, Hessian

L = 0g = zeros[D + 1, 1]H = zeros[D + 1, D + 1]// For each data point

for i=1to I do// Compute prediction y

yi = 1/(1 + exp[−φTxi])// Update log likelihood, gradient and Hessian

if wi == 1 thenL = L+ log[yi]

elseL = L+ log[1− yi]

endg = g + (yi − wi)xiH = H + yi(1− yi)xixTi

end

end

Don’t forget to multiply L, g and H by −1 if you are optimizing with a routinethat minimizes a cost function rather than maximizes it.


22

0.5.2 MAP Logistic Regression

This is a straightforward optimization problem and very similar to the originallogistic regression model except that we now also have a prior over the parameters

Pr(φ) = Normφ[0, σ2pI] (4)

We prepend a 1 to the start of each data example xi and then optimize the logbinomial probability. To do this we need to compute this value, and the derivativeand Hessian with respect to the parameter φ.


Input : Binary world state wiIi=1, observed data xiIi=1, parameters φ, priorvariance σ2

p

Output: cost L, gradient g, Hessian Hbegin


L = L− (D + 1) log[2πσ2]/2− φTφ/(2σ2p)

g = −φ/σ2p

H = −1/σ2p

// For each data point


yi = 1/(1 + exp[−φTxi])// Update log likelihood, gradient and Hessian



endg = g + (yi − wi)xiH = H + yi(1− yi)xixTi

end

end




0.5.3 Bayesian logistic regression

In Bayesian logistic regression, we aim to compute the predictive distributionPr(w∗|x∗) over the binary world state w∗ for a new data example x∗. Thistakes the form of a Bernoulli distribution and is hence summarized by the sin-gle λ∗ = Pr(w∗ = 1|x∗).

Algorithm 26: Bayesian logistic regression

Input : Binary world state wiIi=1, observed data xiIi=1, new data x∗

Output: Bernoulli parameter λ∗ from Pr(w∗|x∗) for new data x∗

begin// Prepend a 1 to the start of each data vector

for i=1to I doxi = [1; xi]

end// Initialize parameters

φ = zeros[D, 1]// Optimization using cost function of algorithm ??φ = optimize [logRegCrit[xi, wi,φ],φ]

end// Compute Hessian at peak (algorithm ??)[L,g,H] = logRegCrit[xi, wi,φ]// Set mean and variance of Laplace approximation

µ = ΦΣ = −H−1

// Compute mean and variance of activation

µa = µTx∗

σ2a = x∗TΣx∗

// Compute approximate prediction

λ∗ = 1/(1 + exp[−µa/√

1 + πσ2a/8])


24

0.5.4 MAP dual logistic regression

The dual logistic regression model is defined as

Pr(w|φ,x) = Bernw

[1

1 + exp[−ψTXTx]

]Pr(ψ) = Normψ[0, σ2

pI].

This is a straightforward optimization problem. We prepend a 1 to the start ofeach data example xi and then optimize the log binomial probability. To do thiswe need to compute this value, and the derivative and Hessian with respect to theparameter ψ.


Input : Binary world state wiIi=1, observed data xiIi=1, parameters ψOutput: cost L, gradient g, Hessian Hbegin


L = −I log[2πσ2]/2−ψTψ/(2σ2p)

g = −ψ/σ2p

H = −1/σ2p

// Form compound data matrix

X = [x1,x2, . . .xI ]// For each data point


yi = 1/(1 + exp[−ψTXxi])// Update log likelihood, gradient and Hessian



end

g = g + (yi − wi)XTxiH = H + yi(1− yi)XTxix

Ti X

end

end




0.5.5 Dual Bayesian logistic regression

In dual Bayesian logistic regression, we aim to compute the predictive distributionPr(w∗|x∗) over the binary world state w∗ for a new data example x∗. This takesthe form of a Bernoulli distribution and is hence summarized by the single λ∗ =Pr(w∗ = 1|x∗).







ψ = zeros[I, 1]// Optimization using cost function of algorithm ??ψ = optimize [logRegCrit[ψ],ψ]

end// Compute Hessian at peak (algorithm ??)[L,g,H] = logRegCrit[xi, wi,ψ]// Set mean and variance of Laplace approximation

µ = ΨΣ = −H−1


µa = µTXTx∗

σ2a = x∗TXΣXTx∗


λ∗ = 1/(1 + exp[−µa/√

1 + πσ2a/8])


26

0.5.6 MAP Kernel logistic regression


Input : World state wiIi=1, data xiIi=1, parameters ψ, kernel function K[•, •]Output: cost L, gradient g, Hessian Hbegin


L = −I log[2πσ2]/2−ψTψ/(2σ2p)

g = −ψ/σ2p

H = −1/σ2p

// Form compound data matrix

X = [x1,x2, . . .xI ]// For each data point


yi = 1/(1 + exp[−ψTK[X,xi]])// Update log likelihood, gradient and Hessian



endg = g + (yi − wi)K[X,xi]H = H + yi(1− yi)K[X,xi]K[xiX]

end

end



0.5.7 Bayesian kernel logistic regression (Gaussian process classification)

In dual Bayesian logistic regression, we aim to compute the predictive distributionPr(w∗|x∗) over the binary world state w∗ for a new data example x∗. This takesthe form of a Bernoulli distribution and is hence summarized by the single λ∗ =Pr(w∗ = 1|x∗).







ψ = zeros[I, 1]// Optimization using cost function of algorithm ??ψ = optimize [logRegKernelCrit[ψ],ψ]// Compute Hessian at peak (algorithm ??)[L,g,H] = logRegKernelCrit[xi, wi,ψ]// Set mean and variance of Laplace approximation

µ = ΨΣ = −H−1


µa = µTK[X,x∗]σ2a = K[x∗,X]ΣK[X,x∗]


λ∗ = 1/(1 + exp[−µa/√

1 + πσ2a/8])

end


28

0.5.8 Relevance vector classification

0.5.9 Incremental fitting for logistic regression

The incremental fitting approach to logistic regression model fits the model

Pr(w|φ,x) = Bernw

[1

1 + exp[−φ0 +∑Kk=1 φkf [xi, ξk]]

].

The method is to set all the weight parameters φk to zero initially and to optimizethem one by one. At the first stage we optimize φ0, φ1 and ξ1. Then we optimizeφ0, φ2 and ξ2 and so on.

Algorithm 31: Incremental logistic regression

Input : Binary world state wiIi=1, observed data xiIi=1

Output: ML parameters φ0, φk, ξkKk=1

begin// Initialize parameters

φ0 = 0for k=1to K do

φk = 0

ξk = ξ(0)k


for i=1to I doai = 0

endfor k=1to K do

// Reset offset parameters

for i=1to I doai = ai − φ0

endφ0 = 0[φ0, φk, ξk] = optimize [logRegOffsetCrit[φ0,φk, ξk, ai,xi],φ0,φk, ξk]for i=1to I do

ai = ai + φ0 + φkf[xi, ξk]end

end

end

At each stage the optimization procedure improves the criterion

logRegOffsetCrit[φ0, φk, ξk, aiIi=1] =

I∑i=1

log

[Bernwi

[1

1 + exp[−ai − φ0 − φkf[xi, ξk]]

]]with respect to parameters φ0, φk, ξk.



0.5.10 Logit-boost

The logit-boost model is

Pr(w|φ,x) = Bernw

[1

1 + exp[−φ0 +∑Kk=1 φkheaviside[f [x, ξck ]]

].

Algorithm 32: Incremental logistic regression

Input : Binary world state wiIi=1, observed data xiIi=1, functionsfm[x, ξm]Mm=1

Output: ML parameters φ0, φkKk=1, ck ∈ 1 . . .Mbegin

// Initialize parameters

φ0 = 0for k=1to K do

φk = 0endfor i=1to I do

ai = 0end// Initialize parameters

for k=1to K do// Find best weak classifier by looking at magnitude of gradient

ck = maxm[(∑Ii=1(ai − wi)f[xi, ξm])2]

// Reset offset parameters

for i=1to I doai = ai − φ0

endφ0 = 0// Perform optimization

[φ0, φk] = optimize [logitBoostCrit[φ0,φk],φ0,φk]for i=1to I do

ai = ai + φ0 + φkf[xi, ξck ]

end

end

end

At each stage the optimization procedure improves the criterion

logitBoostCrit[φ0, φk] =

I∑i=1

log

[Bernwi

[

[1

1 + exp[−ai − φ0 − φkf[xi, ξck ]]

]]with respect to parameters φ0, φk.


30

0.5.11 Multi-class logistic regression

The multiclass logistic regression model is defined as

Pr(w|φ,x) = Catw[softmax[φT1 x, φT2 x, . . . φTKx]

].

where we have prepended a 1 to the start of each data vector x. This is a straight-forward optimization problem over the log probability. We need to compute thisvalue, and the derivative and Hessian with respect to the parameters φk.

Algorithm 33: Cost function, derivative and Hessian for multi-class logistic regres-

sion

Input : World state wiIi=1, observed data xiIi=1, parameters φKk=1

Output: cost L, gradient g, Hessian Hbegin


L = 0for k=1to K do

gk = 0for L=1to K do

Hkl = 0end

end// For each data point


yi = softmax[φT1 xi, φT2 xi, . . . φ

Tk xi]

// Update log likelihood

L = L+ log[yiw]// Update gradient and Hessian

for k=1to K dogk = gk + xi(yik − δ[wi − k])for L=1to K do

Hkl = Hkl + xixTi yik(δ[k − l]− yil)

end

end

end// Assemble final Hessian

g = [g1; g2; . . .gk] for k=1to K doHk = [Hk1,Hk2, . . .HkK ]

endH = [H1; H2; . . .HK ]

end




0.5.12 Multiclass logistic tree

Algorithm 34: Multiclass classification tree

Input : World state wiIi=1, data xiIi=1Mm=1, classifiers g[•, ωm]Mm=1

Output: Categorical params at leaves λpJ+1p=1 , Classifier indices cjJj=1

beginEnqueue[x1...I , w1...I ]// For each node in tree

for j=1to J do[x1...I , w1...I ] = dequeue[]for m=1to M do

// Count frequency of each class passing into either branch

for k=1to K do

n(l)k =

∑Ii=1 δ[g[xi, ωm]− 0]δ[wi − k]

n(r)k =

∑Ii=1 δ[g[xi, ωm]− 1]δ[wi − k]

end// Compute log likelihood

lm =∑Kk=1

∑n(l)kn=1 log[n

(l)k /

∑Kq=1 n

(l)q ]

lm = lm +∑Kk=1

∑n(r)kn=1 log[n

(r)k /

∑Kq=1 n

(r)q ]

endcj = arg maxm lm // Store best classifier

Sl = ;Sr = // Partition into two sets

for i=1to I doif g[xi, ωcj ] == 0 thenSL = Sl ∪ i

elseSR = Sr ∪ i

end

endEnqueue[xSl , wSl ,λl] // Add to queue

Enqueue[xSr , wSr ,λr]

end// Recover categorical parameters at leaves

for p = 1to J + 1 do[x1...I , w1...I ] = dequeue[ ]for k=1to K do

nk =∑Ii=1 δ[wi − k]

end

λp = n/∑Kk=1 nk

end

end


32

0.5.13 Random classification tree



0.5.14 Random classification fern


34

0.6 Graphical models

0.6.1 Gibbs’ Sampling from an discrete undirected model

Algorithm 35: Gibbs’ sampling from undirected model

Input : Potential functions φc[Sc]Cc=1

Output: Samples xtT1begin

// Initialize first sample in chain

x0 = x(0)

// For each time sample

for t=1to T doxt = xt−1

// For each dimension

for d=1to d do// For each possible value

for k=1to K doλk = 1xtd = kfor c such that d ∈ Sc do

λk = λkφc[Sc]end

end

λ = λ/∑Kk=1 λk

// Draw from categorical distribution

xtd = DrawFromCategorical[λ]

end

end

end

It is normal to discard the first few thousand entries so that the initial conditionsare forgotten. Then entries are chosen that are spaced apart to avoid correlationbetween the samples.


0.6 Graphical models 35

0.6.2 Contrastive divergence for learning undirected models

Algorithm 36: Contrastive divergence learning of undirected model

Input : data xKk=1,learning rate αOutput: ML Parameters θbegin

// Initialize parameters

θ = θ(0)

// For each time sample

repeatfor i=1to I do

// Take a single Gibbs’ sample step from the ith data point

x∗i = GibbsSample[xi,θ]

end// Update parameters

// Function gradient[•, •] returns derivative of log of

unnormalized probability

θ = θ + α∑Ii=1(gradient[xi, θ]− I/Jgradient[x∗i ])

until No further average change in θ

end


36

0.7 Models for chains and trees

0.7.1 Dynamic programming for chain model

Algorithm 37: Dynamic programming in chain

Input : Unary costs Un,kN,Kn=1,k=1, Pairwise costs Pn,k,lN,K,Kn=2,k=1,l=1

Output: Minimum cost path ynNn=1

begin// Initialize cumulative sums Sn,kfor k=1to K do

S1,k = U1,k

end// Work forward through chain

for n=2to N do// Find minimum cost to get to this node

Sn,k = Un,k + minl[Sn−1,l + Pn,k,l]// Store route by which we got here

Rn,k = argminl[Sn−1,l + Pn,k,l]

end// Find node yN with overall minimum cost

yN = mink[SN,k]// Trace back to retrieve route

for n=N to 2 doyn−1 = Rn,yn

end

end


0.7 Models for chains and trees 37

0.7.2 Dynamic programming for tree model

This algorithm relies on pre-computing an order to traverse the nodes so that thechildren of each node in the graph are visited before the parent. It also uses thenotation ψn,k[ych[n]

] to represent the logarithm of the factor in the probability

distribution that includes node n and its children for some yn = k and some valuesof the children.

Algorithm 38: Dynamic programming in tree

Input : Unary costs Un,kN,Kn=1,k=1, Joint cost function ψn,k[ych[n]]Nn=1

Output: Minimum cost path ynNn=1

beginrepeat

// Retrieve nodes in an order so children always come before

parents

n = GetNextNode[]// Add unary costs to cumulative sums

for k=1to K doSn,k = Un,k + minych[n]

ψn,k[ych[n]]

Rn,k = arg minych[n]ψn,k[ych[n]

]end// Push node index onto stack

push[n]

until pa[yn] = // Find node yN with overall minimum cost

yn = mink[Sn,k]// Trace back to retrieve route

for c=1to N don = pop[]if ch[n] 6= then

ych[n]= Rn,yn

end

end

end


38

0.7.3 Sum Product Algorithm

Algorithm 39: Sum Product Algorithm: Distribute

Input : Observed data z∗nn∈Sobs, Functions φk[Ck]Kk=1

Output: Marginal probability distributions qn[yn]Nn=1

begin// Distribute evidence

repeat// Retrieve edges in order

n = GetNextEdge[]// Test for type of edge

if isEdgeToFunction[e[n]] then// If this data was observed

if n ∈ Sobs thenmen = δ[z∗n]

else// Find set of edges that are incoming to data node

S = k : en1 ∈ ek \ en// Take product of messages

men =∏k∈S mek

// Add edge to stack

push[n]

end

else// Find set of edges incoming to function node


men =∑y∈S φn[S ∪ n]

∏k∈S mek

// Add edge to stack

push[n]

end

until pa[en] = end


0.7 Models for chains and trees 39

Algorithm 40: Sum Product Algorithm: Collate and compute distributions

Input : Observed data z∗nn∈Sobs, Functions φk[Ck]Kk=1

Output: Marginal probability distributions qn[yn]Nn=1

begin// Collate evidence

repeat// Retrieve edges in opposite order

n = pop[]// Test for type of edge

if !isEdgeToFunction[e[n]] then// Find set of edges that are incoming to data node


ben =∏k∈S bek

else// Find set of edges incoming to function node


ben =∑y∈S φn[S ∪ n]

∏k∈S bek

end

until stack empty// Compute distributions at nodes

for k=1to K do// Find set of edges that are incoming to data node

S1 = n : k ∈ ek2S2 = n : k ∈ ek1q[k] =

∏n∈S1 mn

∏n∈S2 bn

end

end


40

0.8 Models for grids

0.8.1 Binary graph cuts

Algorithm 41: Binary Graph Cut Algorithm

Input : Unary costs Un(k)N,Kn,k=1, Pairwise costs Pn,m(k, l)N,N,K,Kn,m,k,l=1, binary edge

flags Emn,N,Nn=1,m=1

Output: Label assignments ynbegin

// Create edges from source and to sink

for n=1to N doMakeLink[source, n]MakeLink[n, sink]for m=1to n− 1 do

if Em,n = 1 thenMakeLink[n,m]MakeLink[m,n]

end

end

end// Add costs to edges

for n=1to N doAddToLink[source, n, Un(0)]AddToLink[n, sink, Un(1)]for m=1to n− 1 do

if Em,n = 1 thenAddToLink[n,m, Pmn(1, 0)− Pmn(1, 1)− Pmn(1, 1)]AddToLink[m,n, Pmn(1, 0)]AddToLink[source,m, Pmn(0, 0)]AddToLink[n, sink, Pmn(1, 1)]

end

end

endReparameterize[]ComputeMinCut[]// Read off values

for n=1to N doif isConnected[n, source] then

yn = 1else

yn = 0end

end

end


0.8 Models for grids 41

0.8.2 Reparameterization for graph cuts

Algorithm 42: Reparameterization for binary graph cut

Input : GraphOutput: Modified Graphbegin

for n=1to N dofor m=1to n− 1 do

if Em,n = 1 thenβ = 0if GetEdgeCost[n,m] < 0 then

β = β + GetEdgeCost[n,m]else

if GetEdgeCost[m,n] < 0 thenβ = β − GetEdgeCost[m,n]

end

endAddToLink[n,m,−β]AddToLink[m,n, β]AddToLink[source,m, β]AddToLink[n, sink, β]α = min[GetEdgeCost[source, n],GetEdgeCost[m, sink]]AddToLink[source, n,−α]AddToLink[n, sink,−α]

end

end

end

end


42

0.8.3 Multi-label graph cuts

Algorithm 43: Multi-way Graph Cut Algorithm




// Create edges from source and to sink

for n=1to N doMakeLink[source, (n− 1)(K + 1) + 1,∞]MakeLink[n(K + 1), sink,∞]for k=1to K do

MakeLink[(n− 1)(K + 1) + k, (n− 1)(K + 1) + k + 1, U(n−1)(K+1)+k(k)]MakeLink[(n− 1)(K + 1) + k, (n− 1)(K + 1) + k + 1,∞]

endfor m=1to n− 1 do

if Em,n = 1 thenfor k=1to K do

for L=2to K + 1 doC =Pn,m(k, l−1)+Pn,m(k−1, l)−Pn,m(k, l)−Pn,m(k−1, l−1)MakeLink[(n− 1)(K + 1) + k, (m− 1)(K + 1) + l, C]

end

end

end

end


for n=1to N dofor k=1to K do

if isConnected[(n− 1)(K + 1) + k, source] thenyn = k

end

end

end

end


0.8 Models for grids 43

0.8.4 Alpha-Expansion Algorithm

Algorithm 44: Alpha Expansion Algorithm (main loop)




y = y0

L =∑Nn=1 Un(yn) +

∑Nn=1

∑Mm=1 EmnPn,m(yn, ym)

L0 = −∞repeat

L0 = L for k=1to K doy = AlphaExpand[y, k]

end

L =∑Nn=1 Un(yn) +

∑Nn=1

∑Mm=1 EmnPn,m(yn, ym)

until L = L0

end


44

Algorithm 45: Alpha Expansion Algorithm expand


flags Emn,N,Nn=1,m=1, expand label k, label assignments ynOutput: New label assignments ynbegin

t = 0 for n=1to N doMakeLink[source, n,∞, Un(k)]if yn = k then

MakeLink[n, sink,∞]else

MakeLink[n, sink, Un(yn)]endfor m=1to n do

if Em,n == 1 thenif yn == k|ym == k then

if yn! = k&ym == k thenMakeLink[n,m, Pn,m(ym, yn)]

endif yn == k&ym! = k then

MakeLink[m,n, Pn,m(yn, ym)]end

elseif yn == ym then

MakeLink[n,m, Pn,m(ym, yn)]MakeLink[m,n, Pn,m(yn, ym)]

elset = t+ 1MakeLink[n, t, Pn,m(yn, k)]MakeLink[m, t, Pn,m(k, ym)]MakeLink[t, sink, Pn,m(ym, yn)]

end

end

end

end


for n=1to N doif isConnected[n, sink] then

yn = kend

end

end


0.9 The pinhole camera 45

0.9 The pinhole camera

0.9.1 ML learning of camera extrinsic parameters

Given a known object, with I distinct three-dimensional points wiIi=1 points,their corresponding projections in the image xiIi=1 and known camera param-eters Λ, estimate the geometric relationship between the camera and the objectdetermined by the rotation Ω and the translation τ .

Algorithm 46: ML learning of extrinsic parameters

Input : Intrinsic matrix Λ, pairs of points xi,wiIi=1

Output: Extrinsic parameters: rotation Ω and translation τbegin

for i=1to I do// Convert to normalized camera coordinates

x′i = Λ−1[xi, yi, 1]T

// Compute linear constraints

a1i = [ui, vi, wi, 1, 0, 0, 0, 0,−uix′i,−vix′i,−wix′i,−x′i]a2i = [0, 0, 0, 0, ui, vi, wi, 1,−uiy′i,−viy′i,−wiy′i,−y′i]

end// Stack linear constraints

A = [a11; a21; a12; a22; . . . a1I ; a2I ]// Solve with SVD

[U,L,V] = svd[A]b = v12 // extract last column of V// Extract estimates up to unknown scale

Ω = [b1, b2, b3; b5, b6, b7; b9; b10, b11]τ = [b4; b8; b12]// Find closest rotation using Procrustes method

[U,L,V] = svd[Ω]

Ω = UVT

// Rescale translation

τ = τ∑3i=1

∑3j=1 Ωij/(9Ωij)

// Refine parameters with non-linear optimization

[Ω, τ ] = optimize[projCost[Ω, τ ],Ω, τ ]

end

The final optimization minimizes the least squares error between the predictedprojections of the points wi into the image and the observed data xi, so

projCost[Ω, τ ]=

I∑i=1

(xi−pinhole[[wi, 0],Λ,Ω, τ ])T

(xi−pinhole[[wi, 0],Λ,Ω, τ ])

This optimization should be carried out while enforcing the constraint that Ωremains a valid rotation matrix.


46

0.9.2 ML learning of intrinsic parameters (camera calibration)

Given a known object, with I distinct 3D points wiIi=1 points and their corre-sponding projections in the image xiIi=1, establish the camera parameters Λ.

Algorithm 47: ML learning of intrinsic parameters

Input : World points wiIi=1, image points xiIi=1, initial ΛOutput: Intrinsic parameters Λbegin

// Main loop for alternating optimization

for k=1to K do// Compute extrinsic parameters

[Ω, τ ] = calcExtrinsic[Λ, wi,xiIi=1]// Compute intrinsic parameters

for i=1 to I do// Compute matrix Ai

ai = (ωT1 wi + τx)/(ωT3 wi + τ z)

bi = (ωT1 wi + τ y)/(ωT3 wi + τ z)Ai = [ai, bi, 1, 0, 0; 0, 0, 0, bi, 1]

end// Concatenate matrices and data points

x = [x1; x2; . . .xI ]A = [A1; A2; . . .AI ]// Compute parameters

θ = (ATA)−1ATxΛ = [θ1,θ2,θ3; 0,θ4,θ5; 0, 0, 1]

end// Refine parameters with non-linear optimization

[Ω, τ ,Λ] = optimize [projCost[Ω, τ ,Λ],Ω, τ ,Λ]

end

The final optimization minimizes the squared error between the projections of wi

and the observed data xij , respecting the constraints on the rotation matrix Ω.

projCost[Ω, τ ,Λ]=

I∑i=1

(xi−pinhole[wi,Λ,Ω, τ ])T

(xi−pinhole[wi,Λ,Ω, τ ])


0.9 The pinhole camera 47

0.9.3 Inferring 3D world points (reconstruction)

Given J calibrated cameras in known positions (i.e. cameras with known Λ,Ω, τ ),viewing the same three-dimensional point w and knowing the corresponding pro-jections in the images xjJj=1, establish the position of the point in the world.

Algorithm 48: Inferring 3D world position

Input : Image points xjJj=1, camera parameters Λj ,Ωj , τ jJj=1

Output: 3D world point wbegin

for j=1to J do// Convert to normalized camera coordinates

x′j = Λ−1j [xj , yj , 1]T

// Compute linear constraints

a1j = [ω31jx′j − ω11j , ω32jx

′j − ω12j , ω33jx

′j − ω13j ]

a2j = [ω31jy′j − ω11j , ω32jy

′j − ω12j , ω33jy

′j − ω13j ]

bj = [τxj − τzjx′j ; τyj − τzjy′j ]end// Stack linear constraints

A = [a11; a21; a12; a22; . . . a1J ; a2J ]b = [b1; b2; . . . bJ ]// LS solution for parameters

w = (ATA)−1ATb// Refine parameters with non-linear optimization

w = optimize[projCost[w, xj ,Λj ,Ωj , τ jJj=1],w

]end

The final optimization minimizes the squared error between the projections of wand the observed data xj ,

projCost[w, xj ,Λj ,Ωj , τ jJj=1] =

J∑j=1

(xj−pinhole[w,Λj ,Ωj , τ j ])T

(xj−pinhole[w,Λj ,Ωj , τ j ])


48

0.10 Transformation models

0.10.1 ML learning of Euclidean transformation

The Euclidean transformation model maps one set of 2D points wiIi=1 to anotherset xiIi=1 with a rotation Ω and a translation τ , so that

Pr(xi|wi,Ω, τ , σ2) = Normxi

[euc[wi,Ω, τ ], σ2I

],

where the Euclidean transformation is defined as

euc[wi,Ω, τ ] = Ωwi + τ ,

and Ω is constrained to be a rotation matrix so that ΩTΩ = I and det[Ω] = 1.

Algorithm 49: Maximum likelihood learning of Euclidean transformation

Input : Training data pairs xi,wiIi=1

Output: Rotation Ω, translation τ , variance, σ2

begin// Compute mean of two data sets

µw =∑Ii=1 wi/I

µx =∑Ii=1 xi/I

// Concatenate data into matrix form

W = [w1 − µw,w2 − µw, . . . ,wI − µw]X = [x1 − µx,x2 − µx, . . . ,xI − µx]// Solve for rotation

[U,L,V] = svd[XWT ]

Ω = VUT

// Solve for translation

τ =∑Ii=1(xi −Ωwi)/I

// Solve for variance

σ2 =∑Ii=1(xi −Ωwi − τ )T (xi −Ωwi − τ )/2I

end


0.10 Transformation models 49

0.10.2 ML learning of similarity transformation

The similarity transformation model maps one set of 2D points wiIi=1 to anotherset xiIi=1 with a rotation Ω, a translation τ and a scaling ρ so that


[sim[wi,Ω, τ , ρ], σ2I

],

where the similarity transformation is defined as

sim[wi,Ω, τ , ρ] = ρΩwi + τ ,

and Ω is constrained to be a rotation matrix so that ΩTΩ = I and det[Ω] = 1.

Algorithm 50: Maximum likelihood learning of similarity transformation


Output: Rotation Ω, translation τ , scale ρ, variance σ2

begin// Compute mean of two data sets

µw =∑Ii=1 wi/I

µx =∑Ii=1 xi/I

// Concatenate data into matrix form

W = [w1 − µw,w2 − µw, . . . ,wI − µw]X = [x1 − µx,x2 − µx, . . . ,xI − µx]// Solve for rotation

[U,L,V] = svd[XWT ]

Ω = VUT

// Solve for scaling

ρ = (∑Ii=1(xi − µx)TΩ(wi − µw))/(

∑Ii=1(wi − µw)T (w − µw))

// Solve for translation

τ =∑Ii=1(xi − ρΩwi)/I

// Solve for variance

σ2 =∑Ii=1(xi − ρΩwi − τ )T (xi − ρΩwi − τ )/2I

end


50

0.10.3 ML learning of affine transformation

The affine transformation model maps one set of 2D points wiIi=1 to another setxiIi=1 with a linear transformation Φ and an offset τ so that


[aff[wi,Φ, τ ], σ2I

],

where the affine transformation is defined as

aff[wi,Φ, τ ] = Φwi + τ .

Algorithm 51: Maximum likelihood learning of affine transformation


Output: Linear transformation Φ, offset τ , variance σ2

begin// Solve for translation

τ =∑Ii=1(wi − xi)/I

// Compute intermediate 2×4 matrices Ai

for i=1to I doAi = [wi,0; 0,wi]

T

end// Concatenate matrices Ai into 2I×4 matrix AA = [A1; A2; . . .AI ]// Concatenate output points into 2I × 1 vector cc = [x1 − τ ; x2 − τ ; . . .xI − τ ]// Solve for linear transformation

φ = (ATA)−1AT cΦ = [φ1, φ2;φ3, φ4]// Solve for variance

σ2 =∑Ii=1(xi − φwi − τ )T (xi − φwi − τ )/2I

end



0.10.4 ML learning of projective transformation (homography)

The projective transformation model maps one set of 2D points wiIi=1 to anotherset xiIi=1 with a non-linear transformation with 3×3 parameter matrix Φ so


[proj[wi,Φ], σ2I

],

where the homography is defined as

proj[wi,Φ] =[φ11u+φ12v+φ13

φ31u+φ32v+φ33

φ21u+φ22v+φ23

φ31u+φ32v+φ33

]T.

Algorithm 52: Maximum likelihood learning of projective transformation


Output: Parameter matrix Φ,, variance σ2

begin// Convert data to homogeneous representation

for i=1to I doxi = [xi; 1]

end// Compute intermediate 2×9 matrices Ai

for i=1to I doAi = [0, xi;−xi,0; vixi,−uixi]T

end// Concatenate matrices Ai into 2I×9 matrix AA = [A1; A2; . . .AI ]// Solve for approximate parameters

[U,L,V] = svd[A]Φ0 = [v19, v29, v39; v49, v59, v69; v79, v89, v99]// Refine parameters with non-linear optimization

Φ = optimize[homCostFn[Φ],Φ0]// Solve for variance

σ2 =∑Ii=1(xi − hom[wi,Φ])T (xi − hom[wi,Φ])/2I

end

The cost function for the non-linear optimization is based on the least squares errorof the model.

homCostFn[Φ] =

I∑i=1

(xi − proj[wi,Φ])T

(xi − proj[wi,Φ])

The optimization should be carried out with the constraint |Φ|F = 1 that the sumof the squares of the elements of Φ is one.


52

0.10.5 ML Inference for transformation models

Consider a transformation model maps one set of 2D points wiIi=1 to another setxiIi=1 so that

Pr(xi|wi,Φ) = Normxi

[trans[wi,Φ], σ2I

].

In inference we wish are given a new data point x = [x, y] and wish to computethe most likely point w = [u, v] that was responsible for it. To make progress, weconsider the transformation model trans[wi,Φ] in homogeneous form

λ

xy1

=

φ11 φ12 φ13φ21 φ22 φ23φ31 φ32 φ33

uv1

,or x = Φw. The Euclidean, similarity, affine and projective transformations canall be expressed as a 3× 3 matrix of this kind.

Algorithm 53: Maximum likelihood inference for transformation models

Input : Transformation parameters Φ, new point xOutput: point wbegin

// Convert data to homogeneous representation

x = [x; 1]// Apply inverse transform

a = Φ−1x// Convert back to Cartesian coordinates

w = [a1/a3, a2/a3]

end



0.10.6 Learning extrinsic parameters (planar scene)

Consider a calibrated camera with known parameters Λ viewing a planar. We aregiven a set of 2D positions on the plane wI

i=1 (measured in real world units likecm) and their corresponding 2D pixel positions xIi−1. The goal of this algorithmis to learn the 3D rotation Ω and translation τ that maps a point in the frameof reference of the plane w = [u, v, w]T (w = 0 on the plane) into the frame ofreference of the camera.

Algorithm 54: ML learning of extrinsic parameters (planar scene)

Input : Intrinsic matrix Λ, pairs of points xi,wiIi=1

Output: Extrinsic parameters: rotation Ω and translation τbegin

// Compute homography between pairs of points

Φ = LearnHomography[xiIi=1, wiIi=1]// Eliminate effect of intrinsic parameters

Φ = Λ−1Φ// Compute SVD of first two columns of Φ[ULV] = svd[φ1,φ2]// Estimate first two columns of rotation matrix

[ω1ω2] = [u1,u2] ∗VT

// Estimate third column by taking cross product

ω3 = ω1 × ω2

Ω = [ω1,ω2,ω3]// Check that determinant is one

if det[Ω] < 0 thenΩ = [ω1,ω2,−ω3]

end// Compute scaling factor for translation vector

λ = (∑3i=1

∑2j=1 ωij/φij)/6

// Compute translation

τ = λφ3

// Refine parameters with non-linear optimization

[Ω, τ ] = optimize[projCost[Ω, τ ],Ω, τ ]

end

The final optimization minimizes the least squares error between the predictedprojections of the points wi into the image and the observed data xi, so

projCost[Ω, τ ]=

I∑i=1

(xi−pinhole[[wi, 0],Λ,Ω, τ ])T

(xi−pinhole[[wi, 0],Λ,Ω, τ ])

This optimization should be carried out while enforcing the constraint that Ωremains a valid rotation matrix.


54

0.10.7 Learning intrinsic parameters (planar scene)

This is also known as camera calibration from a plane. The camera is presented withJ views of a plane with unknown pose Ωj , τ j. For each image we know I points

wiIi=1 where wi = [ui, vi, 0] and we know their imaged positions xijI,Ji=1,j=1 ineach of the J scenes. The goal is to compute the intrinsic matrix Λ.

Algorithm 55: ML learning of intrinsic parameters (planar scene)

Input : World points wiIi=1, image points xijI,Ji=1,j=1, initial ΛOutput: Intrinsic parameters Λbegin

// Main loop for alternating optimization

for k=1to K do// Compute extrinsic parameters for each image

for j=1to J do[Ωj , τ j ] = calcExtrinsic[Λ, wi,xijIi=1]

end// Compute intrinsic parameters

for i=1 to I dofor j=1to J do

// Compute matrix Aij

aij = (ωT1 wi + τx)/(ωT3 wi + τ z)

bij = (ωT1 wi + τ y)/(ωT3 wi + τ z)Aij = [aij , bij , 1, 0, 0; 0, 0, 0, bij , 1]

end

end// Concatenate matrices and data points

x = [x11; x12; . . .xIJ ]A = [A11; A12; . . .AIJ ]// Compute parameters

θ = (ATA)−1ATxΛ = [θ1,θ2,θ3; 0θ4,θ5; 0, 0, 1]

end// Refine parameters with non-linear optimization

[Ω, τJj=1,Λ] = optimize[projCost[Ω, τJj=1,Λ], Ω, τJj=1,Λ

]end

The final optimization minimizes the squared error between the projections of wi

and the observed data xij , respecting the constraints on the rotation matrix Ω.

projCost[Ω, τJj=1,Λ] =

I∑i=1

J∑j=1

(xij−pinhole[[wi, 0],Λ,Ωj , τ j ])T

(xij−pinhole[[wi, 0],Λ,Ωj , τ j ])



0.10.8 Robust learning of projective transformation with RANSAC

The goal of this algorithm is to fit a homography that maps one set of 2D pointswiIi=1 to another set xiIi=1, in the case where some of the point matches areknown to be wrong (outliers). The algorithm also returns the true matches andthe outliers.

Algorithm 56: Robust ML learning of homography

Input : Point pairs xi,wiIi=1, number of RANSAC steps N , threshold τOutput: Homography Φ, inlier indices Ibegin

// Initialize best inlier set to empty

I = for n=1to N do// Draw 4 different random integers between 1 and IR = RandomSubset[1 . . . I, 4]// Compute homography (algorithm ??)Φn = LearnHomography[xii∈R, wii∈R]// Initialize set of inliers to empty

Sn = for i=1to I do

// Compute squared distance

d = (xi − Hom[wi,Φn])T (xi − Hom[wi,Φn])// If small enough then add to inliers

if d < τ2 thenSn = Sn ∩ i

end

end// If best outliers so far then store

if |Sn| > |I| thenI = Sn

end

end// Compute homography from all outliers

Φ = LearnHomography[xii∈I , wii∈I ]

end


56

0.10.9 Sequential RANSAC for fitting homographies

The goal of this algorithm is to estimate K of homographies between subsets ofpoint pairs wi,xiIi=1 to another set xiIi=1 using sequential RANSAC

Algorithm 57: Robust sequential learning of homographies

Input : Point pairs xi,wiIi=1, number of RANSAC steps N , inlier threshold τ ,number of homographies to fit K

Output: K homographies Φk, and associated inlier indices Ikbegin

// Initialize set of indices of remaining point pairs

S = 1 . . . I for k=1to K do// Compute homography using RANSAC (algorithm ??)[Φk, Ik] = LearnHomographyRobust[xii∈S , wii∈S , N, τ ]// Remove inliers from remaining points

S = S\Ik// Check that there are enough remaining points

if |S| < 4 thenbreak

end

end

end



0.10.10 PEaRL for fitting homographies

The propose, expand and re-learn algorithm first suggests a large number of possiblehomographies relating point pairs wi,xiIi=1¿ These then compete for the pointpairs to be assigned the them and they are re-learnt based on this assignments.

Algorithm 58: PEaRL learning of homographies

Input : Point pairs xi,wiIi=1, number of initial models M , inlier threshold τ ,mininum number of inliers l, number of iterations J , neighborhood systemNiIi=1, pairwise cost P

Output: Set of homographies Φk, and associated inlier indices Ikbegin

// Propose Step: generate M hypotheses

m = 1 // hypothesis number

repeat// Draw 4 different random integers between 1 and IR = RandomSubset[1 . . . I, 4]// Compute homography (algorithm ??)Φm = LearnHomography[xii∈R, wii∈R]Im = // Initialize inlier set to empty

for i=1to I dodim = (xi − Hom[wi,Φn])T (xi − Hom[wi,Φn])if dim < τ2 then // if distance small, add to inliers

In = In ∩ iend

endif |Im| ≥ l then // If enough inliers, get next hypothesis

m = m+ 1end

until m < Mfor j=1to J do

// Expand Step: returns I × 1 label vector l

l = AlphaExpand[D, P, NiIi=1]// Re-Learn Step: re-estimate homographies with support

for m=1to M doIm = find[L == m] // Extract points with label L// If enough support then re-learn, update distances

if |Im| ≥ 4 thenΦm = LearnHomography[xii∈Im , wii∈Im ]for i=1to I do

dim = (xi − Hom[wi,Φn])T (xi − Hom[wi,Φn])end

end

end

end

end


58

0.11 Multiple cameras

0.11.1 Camera geometry from point matches

This algorithm computes the rotation and translation (up to scale) between thecameras given a set of I point matches xi1,xi2Ii=1 between two images

Algorithm 59: Extracting relative camera position from point matches

Input : Point pairs xi1,xi2Ii=1, intrinsic matrices Λ1,Λ2

Output: Rotation Ω, translation τ between camerasbegin

// Compute fundamental matrix (algorithm ??)

F = ComputeFundamental[x1i,x2iIi=1]// Compute essential matrix

E = ΛT2 FΛ1

// Extract four possible rotation and translations from EW = [0,−1, 0; 1, 0, 0; 0, 0,−1][U,L,V] = svd[E]

τ1 = ULWUT ; Ω1 = UW−1VT

τ2 = ULW−1UT ; Ω2 = UWVT

τ3 = −τ1; Ω3 = Ω1

τ4 = −τ2; Ω4 = Ω1

// For each possibility

for k=1to K doFailFlag= 0// For each point

for i=1to I do// Reconstruct point (algorithm ??)w = Reconstruct[xi1,xi2,Λ1,Λ2,0, I,Ωk, τ k]// Test if point reconstructed behind camera

if w3 < 0 thenFailFlag= 1

end

end// If all point in front of camera then return solution

if FailFlag == 0 thenΩ = Ωk

τ = τ kreturn

end

end

end


0.11 Multiple cameras 59

0.11.2 Eight point algorithm for fundamental matrix

This algorithm takes a set of I ≥ 8 point correspondences xi1,xi2Ii=1 betweentwo images and computes the fundamental matrix using the 8 point algorithm. Toimprove the numerical stability of the algorithm, the points are transformed beforethe calculation and the resulting fundamental matrix is modified to compensatefor this transformation.

Algorithm 60: Eight point algorithm for fundamental matrix

Input : Point pairs x1i,x2iIi=1

Output: Fundamental matrix Fbegin

// Compute statistics of data

µ1 =∑Ii=1 x1i/I

Σ1 =∑Ii=1(x1i − µ)(x1i − µ)/I

µ2 =∑Ii=1 x2i/I

Σ2 =∑Ii=1(x2i − µ)(x2i − µ)/I

for k=1to K do// Compute transformed coordinates

xi1 = Σ−1/21 (xi1 − µ1)

xi2 = Σ−1/22 (xi2 − µ2)

// Compute constraint

Ai = [xi2xi1, xi2yi1, xi2, yi2xi1, yi2yi1, yi2, xi1, yi1, 1]

end// Append constraints and solve

A = [A1; A2; . . .AI ][U,L,V] = svd[A]F = [v19, v29, v39; v49, v59, v69; v79, v89, v99]// Compensate for transformation

T1 = [Σ−1/21 ,Σ

−1/21 µ1; 0, 0, 1]

T2 = [Σ−1/22 ,Σ

−1/22 µ2; 0, 0, 1]

F = TT2 FT1

// Ensure that matrix has rank 2

[U,L,V] = svd[F]l33 = 0

F = ULVT

end


60

0.11.3 Robust computation of fundamental matrix with RANSAC

The goal of this algorithm is to estimate the fundamental matrix from 2D pointpairs xi1,xi2Ii=1 to another in the case where some of the point matches areknown to be wrong (outliers). The algorithm also returns the true matches.

Algorithm 61: Robust ML fitting of fundamental matrix

Input : Point pairs xi,wiIi=1, number of RANSAC steps N , threshold τOutput: Fundamental matrix F, inlier indices Ibegin

// Initialize best inlier set to empty

I = for n=1to N do

// Draw 8 different random integers between 1 and IR = RandomSubset[1 . . . I, 8]// Compute fundamental matrix (algorithm ??)Φn = ComputeFundamental[xii∈R, wii∈R]// Initialize set of inliers to empty

Sn = for i=1to I do

// Compute epipolar line in first image

xi2 = [xi2; 1]l = tildexi2F// Compute squared distance to epipolar line

d1 = (l1xi1 + l2yi1 + l3)2/(l21 + l22)// Compute epipolar line in second image

xi1 = [xi1; 1]l2 = Fxi1// Compute squared distance to epipolar line

d2 = (l1xi2 + l2yi2 + l3)2/(l21 + l22)// If small enough then add to inliers

if (d1 < τ2)&&(d2 < τ2) thenSn = Sn ∩ i

end

end// If best outliers so far then store

if |Sn| > |I| thenI = Sn

end

end// Compute fundamental matrix from all outliers

Φ = ComputeFundamental[xii∈I , wii∈I ]

end


0.11 Multiple cameras 61

0.11.4 Planar rectification

This algorithm computes homographies that can be used to rectify the two images.The homography for this second images is chosen so that it moves the epipole toinfinity. The homography for the first image is chosen so that the matches areon the same horizontal lines as in the first image and the distance between thematches is smallest in a last squares sense (i.e. the disparity is smallest).

Algorithm 62: Planar rectification

Input : Point pairs xi1,xi2Ii=1

Output: Homographies Φ1, Φ2 to transform first and second imagesbegin

// Compute fundamental matrix (algorithm ??)

F = ComputeFundamental[x1i,x2iIi=1]// Compute epipole in image 2

[U,L,V] = svd[F]

e = [u13, u23, u33]T

// Compute three transformation matrices

T1 = [0, 0,−δx; 0, 0, δy, 0, 0, 1]θ = atan2[ey − δy, ex − δx]T2 = [cos[θ], sin[θ], 0;− sin[θ], cos[θ], 0; 0, 0, 1]T3 = [1, 0, 0; 0, 1, 0,−1/(cos[θ], sin[θ]), 0, 1]]// Compute homography for second image

Φ2 = T3T2,T1

// Compute factorization of fundamental matrix

L = diag[l11, l22, (l11 + l22)/2]W = [0,−1, 0; 1, 0, 0; 0, 0, 1]

M = ULWVT

for k=1to K dox′i1 = hom[xi1,Φ2M]// Transform points

x′i2 = hom[xi2,Φ2]// Create elements of A and bAi = [x′i1, x

′i2, 1]

bi = x′i2 − x′i1end// Concatenate elements of A and bA = [A1; A2; . . .AI ]b = [b1; b2; . . . bI ]// Solve for α

α = (ATA)−1ATb// Calculate homography in first image

Φ1 = (I + [1, 0, 0]TαT )Φ2M

end


62

0.12 Shape Models

0.12.1 Snake


0.12 Shape Models 63

0.12.2 Template model


64

0.12.3 Generalized Procrustes analysis

The goal of generalized Procrustes analysis is to align a set of shape vectors wiIi=1

with respect to a given transformation family (Euclidean, similarity, affine etc.).Each shape vector consists of a set of N 2D points wi = [wT

i1,wTi2, . . .w

TiN ]T . In the

algorithm below, we will use the example of registering with respect to a Similaritytransformation, which consists of a rotation Ω, scaling ρ and translation τ .

Algorithm 63: Generalized Procrustes analysis

Input : Shape vectors wiIi=1, number of factors, KOutput: Template w, transformations Ωi,ρi, τ iIi=1, no of iterations Kbegin

Initialize w = w1

// Main iteration loop

for k=1to K do// For each transformation

for i=1to I do// Compute transformation to template (algorithm ??)

[Ωi, ρi, τ i] = EstimateSimilarity[wnNn=1, winNn=1]

end// Update template (average of inverse transform)

wi =∑Ii=1 ΩT

i (win − τ i)/(I ∗ ρi)end

end


0.12 Shape Models 65

0.12.4 Probabilistic principal components analysis

The probabilistic principal components analysis algorithm describes a set of I D×1data examples xiIi=1 with the model

Pr(xi) = Normxi[µ,ΦΦT + σ2I]

where µ is the D×1 mean vector, Φ is a D×K matrix containing the K principalcomponents in its columns. The principal components define a K dimensionalsubspace and the parameter σ2 explains the variation of the data around thissubspace.

Algorithm 64: ML learning of PPCA model

Input : Training data xiIi=1, number of principal components, KOutput: Parameters µ,Φ, σ2

begin// Estimate mean parameter

µ =∑Ii=1 xi/I

// Form matrix of mean-zero data

X = [x1 − µ,x2 − µ, . . .xI − µ]// Decompose X to matrices U,L,V

[VLVT ] = svd[XTX]

U = WVL−1/2

// Estimate noise parameter

σ2 =∑Dj=K+1 ljj/(D −K)

// Estimate principal components

Uk = [u1,u2, . . .uK ]Lk = diag[l11, l22, . . . lKK ]

Φ = UK(LK − σ2I)1/2

end

0.12.5 Active shape model


66

0.13 Models for style and identity

0.13.1 ML learning of subspace identity model

This describes the jth of J data examples from the ith of I identities as

xij = µ+ Φhi + εij ,

where xij is the D×1 observed data, µ is the D×1 mean vector, Φ is the D×Kfactor matrix, hi is the K×1 hidden variable representing the identity and εij is aD×1 additive normal noise multivariate noise with diagonal covariance Σ.

Algorithm 65: Maximum likelihood learning for identity subspace model

Input : Training data xijI,Ji=1,j=1, number of factors, KOutput: Maximum likelihood estimates of parameters θ = µ,Φ,Σbegin


// Set mean

µ =∑Ii=1

∑Jj=1 xij/IJ

repeat// Expectation step

for i=1to I do

E[hi] = (JΦTΣ−1Φ + I)−1ΦTΣ−1∑Jj=1(xij − µ)

E[hihTi ] = (JΦTΣ−1Φ + I)−1 + E[hi]E[hi]

T

end// Maximization step

Φ =(∑I

i=1

∑Jj=1(xij − µ)E[hi]

T)(∑I

i=1 JE[hihTi ])−1

Σ = 1IJ

∑Ii=1

∑Jj=1 diag

[(xij − µ)T (xij − µ)−ΦE[hi]x

Tij

]// Compute data log likelihood

for i=1to I dox′i = [xTi1,x

Ti2, . . . ,x

TiJ ]T // compound data vector, JD×1

end

µ′ = [µT ,µT . . .µT ]T // compound mean vector, JD×1

Φ′ = [ΦT ,ΦT . . .ΦT ]T // compound factor matrix, JD×KΣ′ = diag[Σ,Σ, . . .Σ] // compound covariance, JD×JDL =

∑Ii=1 log

[Normx′i

[µ′,Φ′Φ′T + Σ′]]

b


end

a It is usual to initialize Φ to random values. The D diagonal elements of Σ can beinitialized to the variances of the D data dimensions.b In high dimensions it is worth reformulating the covariance of this matrix using theWoodbury relation (section ??)


0.13 Models for style and identity 67

0.13.2 Identity matching subspace identity model

To perform inferences about the identities of newly observed data xnNn=1 we buildM competing models that explain the data in terms of different identities and whichcorrespond to world states y = 1 . . .M . We define a prior Pr(y = m) = λm foreach model. Then we compute the posterior over world states using Bayes’ rule

Pr(y = m|x1...N ) =Pr(x1...N |y = m)Pr(y = p)∑Mp=1 Pr(x1...N |y = p)Pr(y = p)

(5)

Let the mth model divide the data into Q non-overlapping partitions SqQq=1

where each subset Sq is assumed to belong to the same identity. We now computethe likelihood Pr(x1...N |y = m) as

Pr(x1...N |y = m) =

Q∏q=1

Pr(Sq|θ); (6)

The likelihood of the qth subset is given by

Pr(Sq|θ) = Normx′i[µ′,Φ′Φ

′T + Σ] (7)

where x′ is a compound data vector formed by stacking all of the data associatedwith cluster Sq on top of each other. If there were |Sq| data vectors associatedwith Sq then this will be a |Sq|D × 1 vector. Similarly the vector µ′ is a |Sq|D×1compound mean formed by stacking |Sq| copies of the mean vector µ on top of eachother, Φ′ is a |Sq|D×K compound factor matrix formed by stacking |Sq| copiesof Φ on top of each other and Σ′ is a |Sq|D×|Sq|D compound covariance matrixwhich is block diagonal with each block equal to Σ. In high dimensions it is worthreformulating the covariance using the Woodbury relation (section ??)


68

0.13.3 ML learning of PLDA model

PLDA describes the jth of J data examples from the ith of I identities as

xij = µ+ Φhi + Ψsij + εij ,

where all terms are the same as in subspace identity model but now we add Ψ, theD×L within-individual factor matrix and sij the L×1 style variable.

Algorithm 66: Maximum likelihood learning for PLDA model

Input : Training data xijI,Ji=1,j=1, numbers of factors, K,LOutput: Maximum likelihood estimates of parameters θ = µ,Φ,Ψ,Σbegin


// Set mean

µ =∑Ii=1

∑Jj=1 xij/IJ

repeatµ′ = [µT ,µT . . .µT ]T // compound mean vector, JD×1

Φ′ = [ΦT ,ΦT . . .ΦT ]T // compound factor matrix 1, JD×KΨ′ = diag[Ψ,Ψ, . . .Ψ] // compound factor matrix 2, JD×JLΦ′ = [Φ′,Ψ′] // concatenate matrices JD×(K+JL)Σ′ = diag[Σ,Σ, . . .Σ] // compound covariance, JD×JD// Expectation step

for i=1to I dox′i = [xTi1,x

Ti2, . . . ,x

TiJ ]T // compound data vector, JD×1

µh′i= (Φ′TΣ′−1Φ′ + I)−1Φ′TΣ′−1(x′i − µ′)

Σh′i= (Φ′TΣ′−1Φ′ + I)−1 + E[h′i]E[h′i]

T

for j=1to J doSij = [1 . . .K,K+(J − 1)L+1 . . .K+JL]

E[h′′ij ] = µh′i

(Sij) // Extract indices for xij

E[h′′ijh′′Tij ] = Σ

h′i(Sij ,Sij)

end


Φ′′ =(∑I

i=1

∑Jj=1(xij − µ)E[h

′′ij ]T)(∑I

i=1

∑Jj=1 E[h

′′ijh′′Tij ])−1

Σ = 1IJ

∑Ii=1

∑Jj=1 diag

[(xij − µ)T (xij − µ)− [Φ,Ψ]E[hij ]x

Tij

]Φ = Φ′′(:, 1 : K) // Extract original factor matrix

Ψ = Φ′′(:,K + 1 : K + L) // Extract other factor matrix

// Compute data log likelihood

L =∑Ii=1 log

[Normx′i

[µ′,Φ′Φ′T + Σ′]]


end

a Initialize Ψ to random values, other variables as in identity subspace model.



0.13.4 Identity matching using PLDA model

To perform inferences about the identities of newly observed data xnNn=1 we buildM competing models that explain the data in terms of different identities and whichcorrespond to world states y = 1 . . .M . We define a prior Pr(y = m) = λm foreach model. Then we compute the posterior over world states using Bayes’ rule


(8)

Let the mth model divide the data into Q non-overlapping partitions SqQq=1

where each subset Sq is assumed to belong to the same identity. We now computethe likelihood Pr(x1...N |y = m) as

Pr(x1...N |y = m) =

Q∏q=1

Pr(Sq|θ); (9)



′T + Σ] (10)

where x′ is a compound data vector formed by stacking all of the data associatedwith cluster Sq on top of each other. If there were |Sq| data vectors associatedwith Sq then this will be a |Sq|D × 1 vector. Similarly the vector µ′ is a |Sq|D×1compound mean formed by stacking |Sq| copies of the mean vector µ on top ofeach other. The matrix Φ′ is a |Sq|D×(K + |Sq|L compound factor matrix whichis constructed as

Φ′ =

Φ Ψ 0 . . . 0Φ 0 Ψ . . . 0...

......

. . ....

Φ 0 0 . . . Ψ

(11)

Finally, Σ′ is a |Sq|D×|Sq|D compound covariance matrix which is block diag-onal with each block equal to Σ. In high dimensions it is worth reformulating thecovariance using the Woodbury relation (section ??)


70

0.13.5 ML learning of asymmetric bilinear model

This describes the jth data example from the ith identities and the kth styles as

xijs = µs + Φshi + εijs,

where the terms have the same interpretation as for the subspace identity modelexcept now there is one set of parameters θs = µs,Φs,Σs per style, s.

Algorithm 67: Maximum likelihood learning for asymmetric bilinear model

Input : Training data xijI,J,Si=1,j=1,s=1, number of factors, KOutput: ML estimates of parameters θ = µ1...S ,Φ1...S ,Σ1...Sbegin

Initialize θ = θ0

for s=1to S do

µs =∑Ii=1

∑Jj=1 xijs/IJ // Set mean

endrepeat

// Expectation step

for i=1to I do

E[hi] = (I + J∑Ss=1 ΦT

s Σ−1s Φs)

−1∑Ss=1 ΦT

s Σ−1s

∑Jj=1(xijs − µs)

E[hihTi ] = (I + JΦT

s Σ−1s Φs)

−1 + E[hi]E[hi]T


for s=1to S do

Φs =(∑I

i=1

∑Jj=1(xijs − µs)E[hi]

T)(∑I

i=1 JE[hihTi ])−1

Σs = 1IJ

∑Ii=1

∑Jj=1 diag

[(xijs − µs)T (xijs − µs)−ΦsE[hi]x

Tijs

]end// Compute data log likelihood

for s=1to S doΦ′s = [ΦT

s ,ΦTs . . .Φ

Ts ]T

Σ′s = diag[Σs,Σs, . . .Σs]for i=1to I do

x′is = [xTi1s,xTi2s, . . . ,x

TiJs]

T

x′i = [xTi1,xTi2, . . . ,x

TiS ]T // compound data vector, JSD×1

end

end

µ′ = [µT ,µT . . .µT ]T // compound mean vector, JSD×1

Φ′ = [Φ′T1 ,Φ

′T2 . . .Φ

′TS ]T // compound factor matrix, JSD×K

Σ′ = diag[Σ′1,Σ′2, . . .Σ

′S ] // compound covariance, JSD×JSD

L =∑Ii=1 log

[Normx′i

[µ′,Φ′Φ′T + Σ′]]


end



0.13.6 Identity matching with asymmetric bilinear model

This formulation assumes that the style s of each observed data example is known. Toperform inferences about the identities of newly observed data xnNn=1 we build M com-peting models that explain the data in terms of different identities and which correspondto world states y = 1 . . .M . We define a prior Pr(y = m) = λm for each model. Then wecompute the posterior over world states using Bayes’ rule


(12)

Let the mth model divide the data into Q non-overlapping partitions SqQq=1 whereeach subset Sq is assumed to belong to the same identity. We now compute the likelihoodPr(x1...N |y = m) as

Pr(x1...N |y = m) =

Q∏q=1

Pr(Sq|θ); (13)



′T + Σ] (14)

where x′ is a compound data vector formed by stacking all of the data associated withcluster Sq on top of each other. If there were |Sq| data vectors associated with Sq thenthis will be a |Sq|D × 1 vector. Similarly the vector µ′ is a |Sq|D×1 compound meanformed by stacking the appropriate mean vectors µs for the style of each example on topof each other. The matrix Φ′ is a |Sq|D× (K + |Sq|L compound factor matrix whichis constructed by stacking the factor matrices Φs on top of each other, where the stylematches that of the data. Finally, Σ′ is a |Sq|D×|Sq|D compound covariance matrixwhich is block diagonal with each block equal to Σs where the style is chosen to matchthe style of the data. In high dimensions it is worth reformulating the covariance usingthe Woodbury relation (section ??)

0.13.7 Style translation with asymmetric bilinear model

Algorithm 68: Style translation with asymmetric bilinear model

Input : Example x in style s1, model parameters θOutput: Prediction for data x∗ in style s2

begin// Estimate hidden variable

E[h] = (I + ΦTs1Σ−1

s1 Φs1)−1ΦTs1Σ−1

s1 (x− µs1)

// Predict in different style

x∗ = µs2 + Φs2E[h]

end


72

0.13.8 Symmetric bilinear model


0.14 Temporal models 73

0.14 Temporal models

0.14.1 Kalman filter


74

0.14.2 Kalman smoother



0.14.3 Extended Kalman filter


76

0.14.4 Iterated extended Kalman filter



0.14.5 Unscented Kalman filter


78

0.14.6 Condensation algorithm



0.14.7 Bag of features model

The bag of features model treats each object class as a distribution over discrete features fregardless of their position in the image. Assume that there are I images with Ji featuresin the ith image. Denote the jth feature in the ith image as fij . Then we have

Pr(Xi|w = n) =

Ij∏j=1

Catfij [λn] (15)

Algorithm 69: Learn bag of words model

Input : Features fijI,Jii=1,j=1, wiIi=1, Dirichlet parameter α

Output: Model parameters λmMm=1

begin// For each object class

for n=1to N do// For each feature

for k=1to L do// Compute number of times feature k observed for object m

Nfnk =

∑Ii=1

∑Jij=1 δ[wi − n]δ[fij − k]

end// Compute parameter

λnk = (Nfnk + α− 1)/(

∑Kk=1 N

fnk +Kα− 1)

end

end

We can then define a prior Pr(w) over the N object classes and classify a new imageusing Bayes rule,

Pr(w = n|X ) =Pr(X|w = n)Pr(w = n)∑Nn=1 Pr(X|w = n)Pr(w = n)

(16)


80

0.14.8 Latent Dirichlet Allocation

The LDA model models a discrete set of features fij ∈ 1 . . .K as a mixture of M categor-ical distributions (parts), where the categorical distributions themselves are shared, butthe mixture weights πi differ from image to image

Algorithm 70: Learn latent Dirchlet allocation model

Input : Features fijI,Jii=1,j=1, wiIi=1, Dirichlet parameters α, β

Output: Model parameters λmMm=1, πiIi=1

begin// Initialize categorical parameters

θ = θ0a

// Initialize count parameters

N(f) = 0

N(p) = 0for i=1to I do

for j=1to J do// Initialize hidden variables

pij = randInt[M ]// Update count parameters

N(f)pij ,fij

= N(f)pij ,fij

+ 1

N(p)i,pij

= N(f)i,pij

+ 1

end

end// Main MCMC Loop

for t=1to T do

p(t) = MCMCSample[p, f ,N(f),N(w), λmMm=1, πiIi=1,M,K]end// Choose samples to use for parameter estimate

St = [BurnInTime : SkipTime : Last Sample]for i=1to I do

for m=1to M do

πi,m =∑Jij=1

∑t∈St δ[p

[t]ij −m] + α

end

πi = πi/∑Mm=1 πim

endfor m=1to M do

for k=1to K do

λm,k =∑Ii=1

∑Jij=1

∑t∈St δ[p

[t]ij −m]δ[fij − k] + β

end

λm = λm/∑Kk=1 λm,k

end

end

a One way to do this would be to set the categorical parameters λmMm=1, πiIi=1

to random values by generating positive random vectors and normalizing them to sum to

one.



Algorithm 71: MCMC Sampling for LDA

Input : p, f ,N(f),N(w), λmMm=1, πiIi=1,M,KOutput: Part sample pbegin

repeat// Choose next feature

(a, b) = ChooseFeature[J1, J2, . . . JI ]// Remove feature from statistics

N(f)pab,fab

= N(f)pab,fab

− 1

N(p)a,pab = N

(p)pab − 1

for m=1to M do

qm = (N(f)m,fab

+ β)(N(p)a,m + α)

qm = qm/(∑Kk=1(N

(f)m,k + β)

∑Nm=1(N

(p)a,m + α))

end// Normalize

q = q/(sumMm=1qm)

// Draw new feature

pij = DrawCategorical[q]// Replace feature in statistics

N(f)pab,fab

= N(f)pab,fab

+ 1

N(p)a,pab = N

(p)pab + 1

until All parts pij updated

end


82

0.15 Preprocessing

0.15.1 Principal components analysis

The goal of PCA is to approximate a set of multivariate data xiIi=1 with a secondset of variables of reduced size hiIi=1, so that

xi ≈ µ+ Φhi,

where Φ is a rectangular matrix where the columns are unit length and orthogonalto one another so that ΦTΦ = I.

Algorithm 72: Principal components analysis

Input : Training data xiIi=1, number of components KOutput: Mean µ, PCA basis functions Φ, low dimensional data hiIi=1

begin// Estimate mean

µ =∑Ii=1 xi/I

// Form mean zero data matrix

X = [x1 − µ,x2 − µ, . . .xI − µ]// Do spectral decomposition

[U,L,V] = svd[XTX]// Compute dual principal components

Ψ = [u1,u2, . . .uK ]// Compute principal components

Φ = XΨ// Convert data to low dimensional representation

for i=1to I dohi = ΦT (xi − µ)

end

end


0.15 Preprocessing 83

0.15.2 k-means algorithm

The goal of the k-means algorithm is to partition a set of data xiIi=1 into Kclusters. It can be thought of as approximating each data point with the associatedcluster mean µk , so that

xi ≈ µhi,

where hi ∈ 1, 2, . . .K is a discrete variable that indicates which cluster the ithpoint belongs to.

Algorithm 73: K-means algorithm

Input : Data xiIi=1, number of clusters K, data dimension DOutput: Cluster means µkKk=1, cluster assignment indices, hiIi=1

begin// Initialize cluster means (one of many heuristics)

µ =∑Ii=1 xi/I

for i=1to I dodi = (xi − µ)T (xi − µ)

end

Σ = Diag[∑Ii=1 di/I]

for k=1to K do

µk = µ+ Σ1/2randn[D, 1]end// Main loop

repeat// Compute distance from data points to cluster means

for i=1to I dofor k=1to K do

dik = (xi − µk)T (xi − µk)end// Update cluster assignments

hi = argminkdikend// Update cluster means

for k=1to K do

µk =∑Ii=1 δ[hi − k]xi/(

∑Ii=1 δ[hi − k])

end

until No further change in µkKk=1

end


Algorithms

Documents

set mean

set variance2

mean parameter

posterior parameters

inverse gammawhichhaspdfpr

inverse gamma prior

new datapoint

intermediate parameters