Inference in generative models of images and video John Winn MSR Cambridge May 2004
Inference in generative models of images and video
John WinnMSR CambridgeMay 2004
Overview
Generative vs. conditional models
Combined approach
Inference in the flexible sprite model
Extending the model
We have an image I and latent variables H which we wish to infer, e.g. object position, orientation, class. There will also be other sources of variability, e.g. illumination, parameterised by θ.
Generative vs. conditional models
Generative model: P(H, θ, I)
Conditional model: P(H, θ|I) or P(H|I)
Conditional models use featuresFeatures are functions of I which aim to be informative about H but invariant to θ.
Edge features Corner features
Blob features
Conditional modelsUsing features f(I), train a conditional model e.g. using labelled data
))(()|( IIH fgP
Example: Viola & Jones face recognition using rectangle features and AdaBoost
Conditional modelsAdvantages
Simple - only model variables of interest
Inference is fast - due to use of features and simple model
Disadvantages
Non-robust
Difficult to compare different models
Difficult to combine different models
Generative modelsA generative model defines a process of generating the image pixels I from the latent variables H and θ, giving a joint distribution over all variables: P(H, θ, I)
Learning and inference carried out using standard machine learning techniques e.g. Expectation Maximisation, MCMC, variational methods.
No features!
Generative modelsExample: image modeled as layers of ‘flexible’ sprites.
Generative modelsAdvantages
Accurate – as the entire image is modeled
Can compare different models
Can combine different models
Can generate new images
Disadvantages
Inference is difficult due to local minima
Inference is slower due to complex model
Limitations on model complexity
Combined approach
Use a generative model, but speed up inference using proposal distributions given by a conditional model.
A proposal R(X) suggests a new distribution over some of the latent variables X H, θ.
Inference is extended to allow accepting or rejecting the proposal e.g. depending on whether it improves the model evidence.
Using proposals in an MCMC framework
Proposals for text and faces Accepted proposals
From Tu et al, 2003
Generative model: textured regions combined with face and text models
Conditional model: face and text detector using AdaBoost (Viola & Jones)
Using proposals in an MCMC framework
Proposals for text and faces Reconstructed image
From Tu et al, 2003
Generative model: textured regions combined with face and text models
Conditional model: face and text detector using AdaBoost (Viola & Jones)
Proposals in the flexible sprite model
Flexible sprite model
x
Set of images
e.g. frames from a video
Flexible sprite model
x
Flexible sprite model
πf
x
Sprite shape and appearance
Flexible sprite model
π
m
f
T
x
Sprite transform for this image (discretised)
Transformed mask instance for this image
Flexible sprite model
π
m
fb
T
x
Background
Inference method & problems Apply variational inference with factorised
Q distribution Slow – since we have to search entire
discrete transform space Limited size of transform space e.g.
translations only (160120). Many local minima.
Proposals in the flexible sprite model
π
m
T
We wish to create a proposal R(T).
Cannot use features of the image directly until object appearance found.
Use features of the inferred mask.
proposal
Moment-based featuresUse the first and second moments of the inferred mask as features. Learn a proposal distribution R(T).
True locationC-of-G of
mask
Contour of proposal distribution over object location
Can also use R to get a probabilistic bound on T.
Iteration #1
Iteration #2
Iteration #3
Iteration #4
Iteration #5
Iteration #6
Iteration #7
Results on scissors video.
On average, ~1% of transform space searched. Always converges, independent of initialisation.
Original Reconstruction
Foreground only
Beyond translation
Extended transform space
Original Reconstruction
Extended transform space
Original Reconstruction
Extended transform space
Normalised video
Learned sprite appearance
Corner features
Learned sprite appearance
Masked normalised image
Corner feature proposals
Preliminary results
Future directions
Extensions to the generative model
Very wide range of possible extensions: Local appearance model e.g. patch-based Multiple layered objects Object classes Illumination modelling Incorporation of object-specific models e.g. faces Articulated models
Further investigation of using proposals
Investigate other bottom-up features, including: Optical flow Color/texture Use of standard invariant features e.g. SIFT Discriminative models for particular object
classes e.g. faces, text
π
m
fb
T
x
N