Algorithms This section contains concise descriptions of almost all of the models and algo- rithms in this book. This includes additional details, variations of algorithms and implementation concerns that were omitted from the main text to improve read- ability. The goal is to provide sufficient information to implement a naive version of each method and the reader is encouraged to do exactly this. WARNING! These algorithms have not been checked very well. I’m looking for volunteers to help me with this - please mail [email protected] if you can help. IN the mean time, treat them with suspicion and send me any problems you find Copyright c 2011 by Simon Prince; to be published by Cambridge University Press 2012. For personal use only, not for distribution.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Algorithms
This section contains concise descriptions of almost all of the models and algo-rithms in this book. This includes additional details, variations of algorithms andimplementation concerns that were omitted from the main text to improve read-ability. The goal is to provide sufficient information to implement a naive versionof each method and the reader is encouraged to do exactly this.
WARNING! These algorithms have not been checked very well. I’m looking forvolunteers to help me with this - please mail [email protected] if you can help.IN the mean time, treat them with suspicion and send me any problems you find
0.1.6 Bayesian approach to univariate normal distribution
In the Bayesian approach to the univariate normal distribution we again use anormal-scaled inverse gamma prior. In the learning stage we compute a probabilitydistribution over the mean and variance parameters. The predictive distributionfor a new datum is based on all possible values of these parameters
Algorithm 6: Bayesian approach to normal distribution
Input : Training data xiIi=1, Hyperparameters α, β, γ, δ, Test data x∗
Output: Posterior parameters α, β, γ, δ, predictive distribution Pr(x∗|x1...I)begin
// Compute normal inverse gamma posterior over parameters from
training data
α = α+ I/2
β =∑i x
2i /2 + β + (γδ2)/2− (γδ +
∑i xi)
2/(2γ + 2I)γ = γ + I
δ = (γδ +∑i xi)/(γ + I)
// Compute intermediate parameters
α = α+ 1/2
β = (x∗2)/2 + β + (γδ2)/2− (γδ + x∗)2/(2γ + 2)γ = γ + 1// Evaluate new datapoint under predictive distribution
0.1.9 Bayesian approach to multivariate normal distribution
In the Bayesian approach to the multivariate normal distribution we again use anormal inverse Wishart In the learning stage we compute a probability distributionover the mean and variance parameters. The predictive distribution for a newdatum is based on all possible values of these parameters
Algorithm 9: Bayesian approach to normal distribution
Input : Training data xiIi=1, Hyperparameters α,Ψ, γ, δ, Test data x∗
Output: Posterior parameters α, Ψ, γ, δ, predictive distribution Pr(x∗|x1...I)begin
// Compute normal inverse Wishart over parameters
α = α+ I
Ψ = Ψ + γδδT /2 +∑Ii=1 xix
Ti /2− (γδ +
∑i xi)(γδ +
∑i xi)
T /(2γ + 2I)γ = γ + I
δ = (∑i=1 xi + γδ)/(I + γ)
// Compute intermediate parameters
α = α+ 1
Ψ = γδδT
+ x∗x∗T − (γδ + x∗)(γδ + x∗)T /(γ + 1)γ = γ + 1// Evaluate new datapoint under predictive distribution
Consider the situation where we wish to assign a label w ∈ 1, 2, . . .K based onan observed multivariate measurement vector xi. We model the class conditionaldensity functions as normal distributions so that
Pr(xi|wi = k) = Normxi [µk,Σk] (2)
with prior probabilities over the world state defined byh
Pr(wi) = Catwi [λ] (3)
Algorithm 13: Basic Generative classifier
Input : Training data xi, wiIi=1, new data example x∗
Output: ML parameters θ = λ1...K ,µ1...K ,Σ1...K, posterior probability Pr(w∗|x∗)begin
The mixture of Gaussians (MoG) is a probability density model suitable for data xinD dimensions. The data is described as a weighted sum ofK normal distributions
Pr(x|θ) =
K∑k=1
λkNormx[µk,Σk],
where µ1...K and Σ1...K are the means and covariances of the normal distributionsand λ1...K are positive valued weights that sum to one. The MoG is fit using theEM algorithm.
Algorithm 14: Maximum likelihood learning for mixtures of Gaussians
Input : Training data xiIi=1, number of clusters KOutput: ML estimates of parameters θ = λ1...K ,µ1...K ,Σ1...Kbegin
Initialize θ = θ0a
repeat// Expectation Step
for i=1to I dofor k=1to K do
lik = λkNormxi [µk,Σk] // numerator of Bayes’ rule
end// Compute posterior (responsibilities) by normalizing
for k=1 to K do
rik = lik/(∑Kk=1 lik)
end
end
// Maximization Step b
for k=1 to K do
λ[t+1]k =
∑Ii=1 rik/(
∑Kk=1
∑Ii=1 rik)
µ[t+1]k =
∑Ii=1 rikxi/(
∑Ii=1 rik)
Σ[t+1]k =
∑Ii=1 rik(xi − µ[t+1]
k )(xi − µ[t+1]k )T /(
∑Ii=1 rik).
end// Compute Data Log Likelihood and EM Bound
L =∑Ii=1 log
[∑Kk=1 λkNormxi [µk,Σk]
]B =
∑Ii=1
∑Kk=1 rik log [λkNormxi [µk,Σk]/rik]
until No further improvement in L
end
aOne possibility is to set the weights λ• = 1/K, the means µ• to the values of K ran-
domly chosen datapoints and the variances Σ• to the variance of the whole dataset.bFor a diagonal covariance retain only the diagonal of the Σk update.
ν = optimize[tCost[ν, E[hi], E[log[hi]]Ii=1], ν]// Compute Data Log Likelihood
for i=1to I doδi = (xi − µ)TΣ−1(xi − µ)
endL = I log[Γ[(ν +D)/2]]− I(d/2) log[νπ]− I log[|Σ|]/2− log[Γ[ν/2]]
L = L−∑Ii=1(ν +D)/2 log[1 + deltai/ν]
until No further improvement in L
end
a One possibility is to initialize the parameters µ and Σ to the mean and variance of thedistribution and set the initial degrees of freedom to a large value say ν = 1000.
The optimization of the degrees of freedom nu uses the criterion
The factor analyzer is a probability density model suitable for data x in D dimen-sions. It has pdf
Pr(xi|θ) = Normx∗ [µ,ΦΦ + Σ],
where µ is a D × 1 mean vector, Φ is a D ×K matrix containing the K factorsφKk=1 in its columns and Σ is a diagonal matrix of size D×D. The factor analyzeris fit using the EM algorithm.
Algorithm 16: Maximum likelihood learning for factor analyzer
Input : Training data xiIi=1, number of factors KOutput: Maximum likelihood estimates of parameters θ = µ,Φ,Σbegin
Initialize θ = θ0a
// Set mean
µ =∑Ii=1 xi/I
repeat// Expectation Step
for i=1to I doE[hi] = (ΦTΣ−1Φ + I)−1ΦTΣ−1(xi − µ)
E[hihTi ] = (ΦTΣ−1Φ + I)−1 + E[hi]E[hi]
T
end// Maximization Step
Φ =(∑I
i=1(xi − µ)E[hi]T)(∑I
i=1 E[hihTi ])−1
Σ = 1I
∑Ii=1 diag
[(xi − µ)T (xi − µ)−ΦE[hi]x
Ti
]// Compute Data Log Likelihoodb
L =∑Ii=1 log
[Normxi [µ,ΦΦT + Σ]
]until No further improvement in L
end
a It is usual to initialize Φ to random values. The D diagonal elements of Σ can beinitialized to the variances of the D data dimensions.bIn high dimensions it is worth reformulating the covariance of this matrix using the
The linear regression model describes the world y as a normal distribution. Themean of this distribution is a linear function φ0+φTx and the variance is constant.In practice we add a 1 to the start of every data vector xi ← [1 xTi ]T and attachthe y-intercept φ0 to the start of the gradient vector φ← [φ0 φT ]T and write
Pr(yi|xi,θ) = Normyi
[φ0 + φTxi, σ
2].
To learn the model, we will work with the matrix X = [x1,x2 . . .xI ] whichcontains all of the training data examples in its columns and the world vectory = [y1, y2 . . . yI ]
T which contains the training world states.
Algorithm 17: Maximum likelihood learning for linear regression
Input : (D + 1)×I Data matrix X, I×1 world vector yOutput: Maximum likelihood estimates of parameters θ = Φ, σ2begin
This is a straightforward optimization problem. We prepend a 1 to the start ofeach data example xi and then optimize the log binomial probability. To do thiswe need to compute this value, and the derivative and Hessian with respect to theparameter φ.
Algorithm 24: Compute cost function, derivative and Hessian
Input : Binary world state wiIi=1, observed data xiIi=1, parameters φOutput: cost L, gradient g, Hessian Hbegin
// Initialize cost, gradient, Hessian
L = 0g = zeros[D + 1, 1]H = zeros[D + 1, D + 1]// For each data point
for i=1to I do// Compute prediction y
yi = 1/(1 + exp[−φTxi])// Update log likelihood, gradient and Hessian
if wi == 1 thenL = L+ log[yi]
elseL = L+ log[1− yi]
endg = g + (yi − wi)xiH = H + yi(1− yi)xixTi
end
end
Don’t forget to multiply L, g and H by −1 if you are optimizing with a routinethat minimizes a cost function rather than maximizes it.
This is a straightforward optimization problem and very similar to the originallogistic regression model except that we now also have a prior over the parameters
Pr(φ) = Normφ[0, σ2pI] (4)
We prepend a 1 to the start of each data example xi and then optimize the logbinomial probability. To do this we need to compute this value, and the derivativeand Hessian with respect to the parameter φ.
Algorithm 25: Compute cost function, derivative and Hessian
Input : Binary world state wiIi=1, observed data xiIi=1, parameters φ, priorvariance σ2
p
Output: cost L, gradient g, Hessian Hbegin
// Initialize cost, gradient, Hessian
L = L− (D + 1) log[2πσ2]/2− φTφ/(2σ2p)
g = −φ/σ2p
H = −1/σ2p
// For each data point
for i=1to I do// Compute prediction y
yi = 1/(1 + exp[−φTxi])// Update log likelihood, gradient and Hessian
if wi == 1 thenL = L+ log[yi]
elseL = L+ log[1− yi]
endg = g + (yi − wi)xiH = H + yi(1− yi)xixTi
end
end
Don’t forget to multiply L, g and H by −1 if you are optimizing with a routinethat minimizes a cost function rather than maximizes it.
In Bayesian logistic regression, we aim to compute the predictive distributionPr(w∗|x∗) over the binary world state w∗ for a new data example x∗. Thistakes the form of a Bernoulli distribution and is hence summarized by the sin-gle λ∗ = Pr(w∗ = 1|x∗).
Algorithm 26: Bayesian logistic regression
Input : Binary world state wiIi=1, observed data xiIi=1, new data x∗
Output: Bernoulli parameter λ∗ from Pr(w∗|x∗) for new data x∗
begin// Prepend a 1 to the start of each data vector
for i=1to I doxi = [1; xi]
end// Initialize parameters
φ = zeros[D, 1]// Optimization using cost function of algorithm ??φ = optimize [logRegCrit[xi, wi,φ],φ]
end// Compute Hessian at peak (algorithm ??)[L,g,H] = logRegCrit[xi, wi,φ]// Set mean and variance of Laplace approximation
This is a straightforward optimization problem. We prepend a 1 to the start ofeach data example xi and then optimize the log binomial probability. To do thiswe need to compute this value, and the derivative and Hessian with respect to theparameter ψ.
Algorithm 27: Compute cost function, derivative and Hessian
Input : Binary world state wiIi=1, observed data xiIi=1, parameters ψOutput: cost L, gradient g, Hessian Hbegin
// Initialize cost, gradient, Hessian
L = −I log[2πσ2]/2−ψTψ/(2σ2p)
g = −ψ/σ2p
H = −1/σ2p
// Form compound data matrix
X = [x1,x2, . . .xI ]// For each data point
for i=1to I do// Compute prediction y
yi = 1/(1 + exp[−ψTXxi])// Update log likelihood, gradient and Hessian
if wi == 1 thenL = L+ log[yi]
elseL = L+ log[1− yi]
end
g = g + (yi − wi)XTxiH = H + yi(1− yi)XTxix
Ti X
end
end
Don’t forget to multiply L, g and H by −1 if you are optimizing with a routinethat minimizes a cost function rather than maximizes it.
In dual Bayesian logistic regression, we aim to compute the predictive distributionPr(w∗|x∗) over the binary world state w∗ for a new data example x∗. This takesthe form of a Bernoulli distribution and is hence summarized by the single λ∗ =Pr(w∗ = 1|x∗).
Algorithm 28: Bayesian logistic regression
Input : Binary world state wiIi=1, observed data xiIi=1, new data x∗
Output: Bernoulli parameter λ∗ from Pr(w∗|x∗) for new data x∗
begin// Prepend a 1 to the start of each data vector
for i=1to I doxi = [1; xi]
end// Initialize parameters
ψ = zeros[I, 1]// Optimization using cost function of algorithm ??ψ = optimize [logRegCrit[ψ],ψ]
end// Compute Hessian at peak (algorithm ??)[L,g,H] = logRegCrit[xi, wi,ψ]// Set mean and variance of Laplace approximation
0.5.7 Bayesian kernel logistic regression (Gaussian process classification)
In dual Bayesian logistic regression, we aim to compute the predictive distributionPr(w∗|x∗) over the binary world state w∗ for a new data example x∗. This takesthe form of a Bernoulli distribution and is hence summarized by the single λ∗ =Pr(w∗ = 1|x∗).
Algorithm 30: Bayesian logistic regression
Input : Binary world state wiIi=1, observed data xiIi=1, new data x∗
Output: Bernoulli parameter λ∗ from Pr(w∗|x∗) for new data x∗
begin// Prepend a 1 to the start of each data vector
for i=1to I doxi = [1; xi]
end// Initialize parameters
ψ = zeros[I, 1]// Optimization using cost function of algorithm ??ψ = optimize [logRegKernelCrit[ψ],ψ]// Compute Hessian at peak (algorithm ??)[L,g,H] = logRegKernelCrit[xi, wi,ψ]// Set mean and variance of Laplace approximation
The incremental fitting approach to logistic regression model fits the model
Pr(w|φ,x) = Bernw
[1
1 + exp[−φ0 +∑Kk=1 φkf [xi, ξk]]
].
The method is to set all the weight parameters φk to zero initially and to optimizethem one by one. At the first stage we optimize φ0, φ1 and ξ1. Then we optimizeφ0, φ2 and ξ2 and so on.
Algorithm 31: Incremental logistic regression
Input : Binary world state wiIi=1, observed data xiIi=1
Output: ML parameters φ0, φk, ξkKk=1
begin// Initialize parameters
φ0 = 0for k=1to K do
φk = 0
ξk = ξ(0)k
end// Initialize parameters
for i=1to I doai = 0
endfor k=1to K do
// Reset offset parameters
for i=1to I doai = ai − φ0
endφ0 = 0[φ0, φk, ξk] = optimize [logRegOffsetCrit[φ0,φk, ξk, ai,xi],φ0,φk, ξk]for i=1to I do
ai = ai + φ0 + φkf[xi, ξk]end
end
end
At each stage the optimization procedure improves the criterion
where we have prepended a 1 to the start of each data vector x. This is a straight-forward optimization problem over the log probability. We need to compute thisvalue, and the derivative and Hessian with respect to the parameters φk.
Algorithm 33: Cost function, derivative and Hessian for multi-class logistic regres-
sion
Input : World state wiIi=1, observed data xiIi=1, parameters φKk=1
Output: cost L, gradient g, Hessian Hbegin
// Initialize cost, gradient, Hessian
L = 0for k=1to K do
gk = 0for L=1to K do
Hkl = 0end
end// For each data point
for i=1to I do// Compute prediction y
yi = softmax[φT1 xi, φT2 xi, . . . φ
Tk xi]
// Update log likelihood
L = L+ log[yiw]// Update gradient and Hessian
for k=1to K dogk = gk + xi(yik − δ[wi − k])for L=1to K do
Hkl = Hkl + xixTi yik(δ[k − l]− yil)
end
end
end// Assemble final Hessian
g = [g1; g2; . . .gk] for k=1to K doHk = [Hk1,Hk2, . . .HkK ]
endH = [H1; H2; . . .HK ]
end
Don’t forget to multiply L, g and H by −1 if you are optimizing with a routinethat minimizes a cost function rather than maximizes it.
0.6.1 Gibbs’ Sampling from an discrete undirected model
Algorithm 35: Gibbs’ sampling from undirected model
Input : Potential functions φc[Sc]Cc=1
Output: Samples xtT1begin
// Initialize first sample in chain
x0 = x(0)
// For each time sample
for t=1to T doxt = xt−1
// For each dimension
for d=1to d do// For each possible value
for k=1to K doλk = 1xtd = kfor c such that d ∈ Sc do
λk = λkφc[Sc]end
end
λ = λ/∑Kk=1 λk
// Draw from categorical distribution
xtd = DrawFromCategorical[λ]
end
end
end
It is normal to discard the first few thousand entries so that the initial conditionsare forgotten. Then entries are chosen that are spaced apart to avoid correlationbetween the samples.
This algorithm relies on pre-computing an order to traverse the nodes so that thechildren of each node in the graph are visited before the parent. It also uses thenotation ψn,k[ych[n]
] to represent the logarithm of the factor in the probability
distribution that includes node n and its children for some yn = k and some valuesof the children.
Algorithm 38: Dynamic programming in tree
Input : Unary costs Un,kN,Kn=1,k=1, Joint cost function ψn,k[ych[n]]Nn=1
Output: Minimum cost path ynNn=1
beginrepeat
// Retrieve nodes in an order so children always come before
parents
n = GetNextNode[]// Add unary costs to cumulative sums
for k=1to K doSn,k = Un,k + minych[n]
ψn,k[ych[n]]
Rn,k = arg minych[n]ψn,k[ych[n]
]end// Push node index onto stack
push[n]
until pa[yn] = // Find node yN with overall minimum cost
Given a known object, with I distinct three-dimensional points wiIi=1 points,their corresponding projections in the image xiIi=1 and known camera param-eters Λ, estimate the geometric relationship between the camera and the objectdetermined by the rotation Ω and the translation τ .
Algorithm 46: ML learning of extrinsic parameters
Input : Intrinsic matrix Λ, pairs of points xi,wiIi=1
Output: Extrinsic parameters: rotation Ω and translation τbegin
for i=1to I do// Convert to normalized camera coordinates
0.9.2 ML learning of intrinsic parameters (camera calibration)
Given a known object, with I distinct 3D points wiIi=1 points and their corre-sponding projections in the image xiIi=1, establish the camera parameters Λ.
Given J calibrated cameras in known positions (i.e. cameras with known Λ,Ω, τ ),viewing the same three-dimensional point w and knowing the corresponding pro-jections in the images xjJj=1, establish the position of the point in the world.
0.10.4 ML learning of projective transformation (homography)
The projective transformation model maps one set of 2D points wiIi=1 to anotherset xiIi=1 with a non-linear transformation with 3×3 parameter matrix Φ so
Pr(xi|wi,Ω, τ , σ2) = Normxi
[proj[wi,Φ], σ2I
],
where the homography is defined as
proj[wi,Φ] =[φ11u+φ12v+φ13
φ31u+φ32v+φ33
φ21u+φ22v+φ23
φ31u+φ32v+φ33
]T.
Algorithm 52: Maximum likelihood learning of projective transformation
Input : Training data pairs xi,wiIi=1
Output: Parameter matrix Φ,, variance σ2
begin// Convert data to homogeneous representation
for i=1to I doxi = [xi; 1]
end// Compute intermediate 2×9 matrices Ai
for i=1to I doAi = [0, xi;−xi,0; vixi,−uixi]T
end// Concatenate matrices Ai into 2I×9 matrix AA = [A1; A2; . . .AI ]// Solve for approximate parameters
Consider a transformation model maps one set of 2D points wiIi=1 to another setxiIi=1 so that
Pr(xi|wi,Φ) = Normxi
[trans[wi,Φ], σ2I
].
In inference we wish are given a new data point x = [x, y] and wish to computethe most likely point w = [u, v] that was responsible for it. To make progress, weconsider the transformation model trans[wi,Φ] in homogeneous form
λ
xy1
=
φ11 φ12 φ13φ21 φ22 φ23φ31 φ32 φ33
uv1
,or x = Φw. The Euclidean, similarity, affine and projective transformations canall be expressed as a 3× 3 matrix of this kind.
Algorithm 53: Maximum likelihood inference for transformation models
Input : Transformation parameters Φ, new point xOutput: point wbegin
Consider a calibrated camera with known parameters Λ viewing a planar. We aregiven a set of 2D positions on the plane wI
i=1 (measured in real world units likecm) and their corresponding 2D pixel positions xIi−1. The goal of this algorithmis to learn the 3D rotation Ω and translation τ that maps a point in the frameof reference of the plane w = [u, v, w]T (w = 0 on the plane) into the frame ofreference of the camera.
Algorithm 54: ML learning of extrinsic parameters (planar scene)
Input : Intrinsic matrix Λ, pairs of points xi,wiIi=1
Output: Extrinsic parameters: rotation Ω and translation τbegin
// Compute homography between pairs of points
Φ = LearnHomography[xiIi=1, wiIi=1]// Eliminate effect of intrinsic parameters
Φ = Λ−1Φ// Compute SVD of first two columns of Φ[ULV] = svd[φ1,φ2]// Estimate first two columns of rotation matrix
[ω1ω2] = [u1,u2] ∗VT
// Estimate third column by taking cross product
ω3 = ω1 × ω2
Ω = [ω1,ω2,ω3]// Check that determinant is one
if det[Ω] < 0 thenΩ = [ω1,ω2,−ω3]
end// Compute scaling factor for translation vector
λ = (∑3i=1
∑2j=1 ωij/φij)/6
// Compute translation
τ = λφ3
// Refine parameters with non-linear optimization
[Ω, τ ] = optimize[projCost[Ω, τ ],Ω, τ ]
end
The final optimization minimizes the least squares error between the predictedprojections of the points wi into the image and the observed data xi, so
projCost[Ω, τ ]=
I∑i=1
(xi−pinhole[[wi, 0],Λ,Ω, τ ])T
(xi−pinhole[[wi, 0],Λ,Ω, τ ])
This optimization should be carried out while enforcing the constraint that Ωremains a valid rotation matrix.
This is also known as camera calibration from a plane. The camera is presented withJ views of a plane with unknown pose Ωj , τ j. For each image we know I points
wiIi=1 where wi = [ui, vi, 0] and we know their imaged positions xijI,Ji=1,j=1 ineach of the J scenes. The goal is to compute the intrinsic matrix Λ.
Algorithm 55: ML learning of intrinsic parameters (planar scene)
0.10.8 Robust learning of projective transformation with RANSAC
The goal of this algorithm is to fit a homography that maps one set of 2D pointswiIi=1 to another set xiIi=1, in the case where some of the point matches areknown to be wrong (outliers). The algorithm also returns the true matches andthe outliers.
Algorithm 56: Robust ML learning of homography
Input : Point pairs xi,wiIi=1, number of RANSAC steps N , threshold τOutput: Homography Φ, inlier indices Ibegin
// Initialize best inlier set to empty
I = for n=1to N do// Draw 4 different random integers between 1 and IR = RandomSubset[1 . . . I, 4]// Compute homography (algorithm ??)Φn = LearnHomography[xii∈R, wii∈R]// Initialize set of inliers to empty
Sn = for i=1to I do
// Compute squared distance
d = (xi − Hom[wi,Φn])T (xi − Hom[wi,Φn])// If small enough then add to inliers
The goal of this algorithm is to estimate K of homographies between subsets ofpoint pairs wi,xiIi=1 to another set xiIi=1 using sequential RANSAC
Algorithm 57: Robust sequential learning of homographies
Input : Point pairs xi,wiIi=1, number of RANSAC steps N , inlier threshold τ ,number of homographies to fit K
Output: K homographies Φk, and associated inlier indices Ikbegin
// Initialize set of indices of remaining point pairs
S = 1 . . . I for k=1to K do// Compute homography using RANSAC (algorithm ??)[Φk, Ik] = LearnHomographyRobust[xii∈S , wii∈S , N, τ ]// Remove inliers from remaining points
S = S\Ik// Check that there are enough remaining points
The propose, expand and re-learn algorithm first suggests a large number of possiblehomographies relating point pairs wi,xiIi=1¿ These then compete for the pointpairs to be assigned the them and they are re-learnt based on this assignments.
Algorithm 58: PEaRL learning of homographies
Input : Point pairs xi,wiIi=1, number of initial models M , inlier threshold τ ,mininum number of inliers l, number of iterations J , neighborhood systemNiIi=1, pairwise cost P
Output: Set of homographies Φk, and associated inlier indices Ikbegin
// Propose Step: generate M hypotheses
m = 1 // hypothesis number
repeat// Draw 4 different random integers between 1 and IR = RandomSubset[1 . . . I, 4]// Compute homography (algorithm ??)Φm = LearnHomography[xii∈R, wii∈R]Im = // Initialize inlier set to empty
for i=1to I dodim = (xi − Hom[wi,Φn])T (xi − Hom[wi,Φn])if dim < τ2 then // if distance small, add to inliers
In = In ∩ iend
endif |Im| ≥ l then // If enough inliers, get next hypothesis
m = m+ 1end
until m < Mfor j=1to J do
// Expand Step: returns I × 1 label vector l
l = AlphaExpand[D, P, NiIi=1]// Re-Learn Step: re-estimate homographies with support
for m=1to M doIm = find[L == m] // Extract points with label L// If enough support then re-learn, update distances
if |Im| ≥ 4 thenΦm = LearnHomography[xii∈Im , wii∈Im ]for i=1to I do
0.11.2 Eight point algorithm for fundamental matrix
This algorithm takes a set of I ≥ 8 point correspondences xi1,xi2Ii=1 betweentwo images and computes the fundamental matrix using the 8 point algorithm. Toimprove the numerical stability of the algorithm, the points are transformed beforethe calculation and the resulting fundamental matrix is modified to compensatefor this transformation.
Algorithm 60: Eight point algorithm for fundamental matrix
0.11.3 Robust computation of fundamental matrix with RANSAC
The goal of this algorithm is to estimate the fundamental matrix from 2D pointpairs xi1,xi2Ii=1 to another in the case where some of the point matches areknown to be wrong (outliers). The algorithm also returns the true matches.
Algorithm 61: Robust ML fitting of fundamental matrix
Input : Point pairs xi,wiIi=1, number of RANSAC steps N , threshold τOutput: Fundamental matrix F, inlier indices Ibegin
// Initialize best inlier set to empty
I = for n=1to N do
// Draw 8 different random integers between 1 and IR = RandomSubset[1 . . . I, 8]// Compute fundamental matrix (algorithm ??)Φn = ComputeFundamental[xii∈R, wii∈R]// Initialize set of inliers to empty
Sn = for i=1to I do
// Compute epipolar line in first image
xi2 = [xi2; 1]l = tildexi2F// Compute squared distance to epipolar line
d1 = (l1xi1 + l2yi1 + l3)2/(l21 + l22)// Compute epipolar line in second image
xi1 = [xi1; 1]l2 = Fxi1// Compute squared distance to epipolar line
d2 = (l1xi2 + l2yi2 + l3)2/(l21 + l22)// If small enough then add to inliers
if (d1 < τ2)&&(d2 < τ2) thenSn = Sn ∩ i
end
end// If best outliers so far then store
if |Sn| > |I| thenI = Sn
end
end// Compute fundamental matrix from all outliers
This algorithm computes homographies that can be used to rectify the two images.The homography for this second images is chosen so that it moves the epipole toinfinity. The homography for the first image is chosen so that the matches areon the same horizontal lines as in the first image and the distance between thematches is smallest in a last squares sense (i.e. the disparity is smallest).
Algorithm 62: Planar rectification
Input : Point pairs xi1,xi2Ii=1
Output: Homographies Φ1, Φ2 to transform first and second imagesbegin
// Compute fundamental matrix (algorithm ??)
F = ComputeFundamental[x1i,x2iIi=1]// Compute epipole in image 2
The goal of generalized Procrustes analysis is to align a set of shape vectors wiIi=1
with respect to a given transformation family (Euclidean, similarity, affine etc.).Each shape vector consists of a set of N 2D points wi = [wT
i1,wTi2, . . .w
TiN ]T . In the
algorithm below, we will use the example of registering with respect to a Similaritytransformation, which consists of a rotation Ω, scaling ρ and translation τ .
Algorithm 63: Generalized Procrustes analysis
Input : Shape vectors wiIi=1, number of factors, KOutput: Template w, transformations Ωi,ρi, τ iIi=1, no of iterations Kbegin
Initialize w = w1
// Main iteration loop
for k=1to K do// For each transformation
for i=1to I do// Compute transformation to template (algorithm ??)
0.12.4 Probabilistic principal components analysis
The probabilistic principal components analysis algorithm describes a set of I D×1data examples xiIi=1 with the model
Pr(xi) = Normxi[µ,ΦΦT + σ2I]
where µ is the D×1 mean vector, Φ is a D×K matrix containing the K principalcomponents in its columns. The principal components define a K dimensionalsubspace and the parameter σ2 explains the variation of the data around thissubspace.
Algorithm 64: ML learning of PPCA model
Input : Training data xiIi=1, number of principal components, KOutput: Parameters µ,Φ, σ2
begin// Estimate mean parameter
µ =∑Ii=1 xi/I
// Form matrix of mean-zero data
X = [x1 − µ,x2 − µ, . . .xI − µ]// Decompose X to matrices U,L,V
This describes the jth of J data examples from the ith of I identities as
xij = µ+ Φhi + εij ,
where xij is the D×1 observed data, µ is the D×1 mean vector, Φ is the D×Kfactor matrix, hi is the K×1 hidden variable representing the identity and εij is aD×1 additive normal noise multivariate noise with diagonal covariance Σ.
Algorithm 65: Maximum likelihood learning for identity subspace model
Input : Training data xijI,Ji=1,j=1, number of factors, KOutput: Maximum likelihood estimates of parameters θ = µ,Φ,Σbegin
a It is usual to initialize Φ to random values. The D diagonal elements of Σ can beinitialized to the variances of the D data dimensions.b In high dimensions it is worth reformulating the covariance of this matrix using theWoodbury relation (section ??)
To perform inferences about the identities of newly observed data xnNn=1 we buildM competing models that explain the data in terms of different identities and whichcorrespond to world states y = 1 . . .M . We define a prior Pr(y = m) = λm foreach model. Then we compute the posterior over world states using Bayes’ rule
Let the mth model divide the data into Q non-overlapping partitions SqQq=1
where each subset Sq is assumed to belong to the same identity. We now computethe likelihood Pr(x1...N |y = m) as
Pr(x1...N |y = m) =
Q∏q=1
Pr(Sq|θ); (6)
The likelihood of the qth subset is given by
Pr(Sq|θ) = Normx′i[µ′,Φ′Φ
′T + Σ] (7)
where x′ is a compound data vector formed by stacking all of the data associatedwith cluster Sq on top of each other. If there were |Sq| data vectors associatedwith Sq then this will be a |Sq|D × 1 vector. Similarly the vector µ′ is a |Sq|D×1compound mean formed by stacking |Sq| copies of the mean vector µ on top of eachother, Φ′ is a |Sq|D×K compound factor matrix formed by stacking |Sq| copiesof Φ on top of each other and Σ′ is a |Sq|D×|Sq|D compound covariance matrixwhich is block diagonal with each block equal to Σ. In high dimensions it is worthreformulating the covariance using the Woodbury relation (section ??)
To perform inferences about the identities of newly observed data xnNn=1 we buildM competing models that explain the data in terms of different identities and whichcorrespond to world states y = 1 . . .M . We define a prior Pr(y = m) = λm foreach model. Then we compute the posterior over world states using Bayes’ rule
Let the mth model divide the data into Q non-overlapping partitions SqQq=1
where each subset Sq is assumed to belong to the same identity. We now computethe likelihood Pr(x1...N |y = m) as
Pr(x1...N |y = m) =
Q∏q=1
Pr(Sq|θ); (9)
The likelihood of the qth subset is given by
Pr(Sq|θ) = Normx′i[µ′,Φ′Φ
′T + Σ] (10)
where x′ is a compound data vector formed by stacking all of the data associatedwith cluster Sq on top of each other. If there were |Sq| data vectors associatedwith Sq then this will be a |Sq|D × 1 vector. Similarly the vector µ′ is a |Sq|D×1compound mean formed by stacking |Sq| copies of the mean vector µ on top ofeach other. The matrix Φ′ is a |Sq|D×(K + |Sq|L compound factor matrix whichis constructed as
Φ′ =
Φ Ψ 0 . . . 0Φ 0 Ψ . . . 0...
......
. . ....
Φ 0 0 . . . Ψ
(11)
Finally, Σ′ is a |Sq|D×|Sq|D compound covariance matrix which is block diag-onal with each block equal to Σ. In high dimensions it is worth reformulating thecovariance using the Woodbury relation (section ??)
0.13.6 Identity matching with asymmetric bilinear model
This formulation assumes that the style s of each observed data example is known. Toperform inferences about the identities of newly observed data xnNn=1 we build M com-peting models that explain the data in terms of different identities and which correspondto world states y = 1 . . .M . We define a prior Pr(y = m) = λm for each model. Then wecompute the posterior over world states using Bayes’ rule
Let the mth model divide the data into Q non-overlapping partitions SqQq=1 whereeach subset Sq is assumed to belong to the same identity. We now compute the likelihoodPr(x1...N |y = m) as
Pr(x1...N |y = m) =
Q∏q=1
Pr(Sq|θ); (13)
The likelihood of the qth subset is given by
Pr(Sq|θ) = Normx′i[µ′,Φ′Φ
′T + Σ] (14)
where x′ is a compound data vector formed by stacking all of the data associated withcluster Sq on top of each other. If there were |Sq| data vectors associated with Sq thenthis will be a |Sq|D × 1 vector. Similarly the vector µ′ is a |Sq|D×1 compound meanformed by stacking the appropriate mean vectors µs for the style of each example on topof each other. The matrix Φ′ is a |Sq|D× (K + |Sq|L compound factor matrix whichis constructed by stacking the factor matrices Φs on top of each other, where the stylematches that of the data. Finally, Σ′ is a |Sq|D×|Sq|D compound covariance matrixwhich is block diagonal with each block equal to Σs where the style is chosen to matchthe style of the data. In high dimensions it is worth reformulating the covariance usingthe Woodbury relation (section ??)
0.13.7 Style translation with asymmetric bilinear model
Algorithm 68: Style translation with asymmetric bilinear model
Input : Example x in style s1, model parameters θOutput: Prediction for data x∗ in style s2
The bag of features model treats each object class as a distribution over discrete features fregardless of their position in the image. Assume that there are I images with Ji featuresin the ith image. Denote the jth feature in the ith image as fij . Then we have
Pr(Xi|w = n) =
Ij∏j=1
Catfij [λn] (15)
Algorithm 69: Learn bag of words model
Input : Features fijI,Jii=1,j=1, wiIi=1, Dirichlet parameter α
Output: Model parameters λmMm=1
begin// For each object class
for n=1to N do// For each feature
for k=1to L do// Compute number of times feature k observed for object m
Nfnk =
∑Ii=1
∑Jij=1 δ[wi − n]δ[fij − k]
end// Compute parameter
λnk = (Nfnk + α− 1)/(
∑Kk=1 N
fnk +Kα− 1)
end
end
We can then define a prior Pr(w) over the N object classes and classify a new imageusing Bayes rule,
The LDA model models a discrete set of features fij ∈ 1 . . .K as a mixture of M categor-ical distributions (parts), where the categorical distributions themselves are shared, butthe mixture weights πi differ from image to image
Algorithm 70: Learn latent Dirchlet allocation model
Input : Features fijI,Jii=1,j=1, wiIi=1, Dirichlet parameters α, β
Output: Model parameters λmMm=1, πiIi=1
begin// Initialize categorical parameters
θ = θ0a
// Initialize count parameters
N(f) = 0
N(p) = 0for i=1to I do
for j=1to J do// Initialize hidden variables
pij = randInt[M ]// Update count parameters
N(f)pij ,fij
= N(f)pij ,fij
+ 1
N(p)i,pij
= N(f)i,pij
+ 1
end
end// Main MCMC Loop
for t=1to T do
p(t) = MCMCSample[p, f ,N(f),N(w), λmMm=1, πiIi=1,M,K]end// Choose samples to use for parameter estimate
St = [BurnInTime : SkipTime : Last Sample]for i=1to I do
for m=1to M do
πi,m =∑Jij=1
∑t∈St δ[p
[t]ij −m] + α
end
πi = πi/∑Mm=1 πim
endfor m=1to M do
for k=1to K do
λm,k =∑Ii=1
∑Jij=1
∑t∈St δ[p
[t]ij −m]δ[fij − k] + β
end
λm = λm/∑Kk=1 λm,k
end
end
a One way to do this would be to set the categorical parameters λmMm=1, πiIi=1
to random values by generating positive random vectors and normalizing them to sum to
The goal of the k-means algorithm is to partition a set of data xiIi=1 into Kclusters. It can be thought of as approximating each data point with the associatedcluster mean µk , so that
xi ≈ µhi,
where hi ∈ 1, 2, . . .K is a discrete variable that indicates which cluster the ithpoint belongs to.
Algorithm 73: K-means algorithm
Input : Data xiIi=1, number of clusters K, data dimension DOutput: Cluster means µkKk=1, cluster assignment indices, hiIi=1
begin// Initialize cluster means (one of many heuristics)
µ =∑Ii=1 xi/I
for i=1to I dodi = (xi − µ)T (xi − µ)
end
Σ = Diag[∑Ii=1 di/I]
for k=1to K do
µk = µ+ Σ1/2randn[D, 1]end// Main loop
repeat// Compute distance from data points to cluster means