The Science of Silly The Science of Silly Walks Walks Hedvig Sidenbladh Michael J. Black http://www.cs.brown.edu/~blac Department of Computer Science Brown University Royal Inst. of Technology, KTH Stockholm Sweden ttp://www.nada.kth.se/~hedvig
Jan 11, 2016
The Science of Silly WalksThe Science of Silly Walks
Hedvig Sidenbladh Michael J. Black
http://www.cs.brown.edu/~black
Department of Computer ScienceBrown University
Royal Inst. of Technology, KTHStockholm Sweden
http://www.nada.kth.se/~hedvig
CollaboratorsCollaborators
David Fleet, Xerox PARC
Nancy Pollard, Brown University
Dirk Ormoneit and Trevor Hastie Dept. of Statistics, Stanford University
Allan Jepson, University of Toronto
The (Silly) ProblemThe (Silly) Problem
Inferring 3D Human MotionInferring 3D Human Motion
* No special clothing* Monocular, grayscale, sequences (archival data)* Unknown, cluttered, environment* Incremental estimation
* Infer 3D human motion from 2D image properties.
Why is it Hard?Why is it Hard?
Low contrast
Self occlusion
Singularities in viewing direction
Unusual viewpoints
Clothing and LightingClothing and Lighting
Large MotionsLarge Motions
Limbs move rapidly with respect to their width.
Non-linear dynamics.
Motion blur.
AmbiguitiesAmbiguities
Where is the leg?
Which leg is in front?
AmbiguitiesAmbiguities
Accidental alignment
AmbiguitiesAmbiguities
Whose legs are whose?Occlusion
Inference/IssuesInference/Issues
Bayesian formulation
p(model | cues) = p(cues | model) p(model)
3. Need an effective way to explore the model space (very high dimensional) and represent ambiguities.
p(cues)
1. Need a constraining likelihood model that is alsoinvariant to variations in human appearance.
2. Need a prior model of how people move.
Simple Body ModelSimple Body Model
* Limbs are truncated cones* Parameter vector of joint angles and angular velocities =
Key Idea #1 (Likelihood)Key Idea #1 (Likelihood)
1. Use the 3D model to predict the location of limb boundaries (not necessarily features) in the scene.
2. Compute various filter responses steered to the predicted orientation of the limb.
3. Compute likelihood of filter responses using a statistical model learned from examples.
Example Training ImagesExample Training Images
Edge FiltersEdge FiltersNormalized derivatives of Gaussians (Lindeberg, Granlund and Knutsson, Perona, Freeman&Adelson, …)
),(cos),(sin),,( xxx yxe fff
Edge filter response steered to limb orientation:
Filter responses steered to arm orientation.
Distribution of Edge Filter ResponsesDistribution of Edge Filter Responses
pon(F) poff (F)
Likelihood ratio, pon/ poff , used for edge detectionGeman & Jednyak and Konishi, Yuille, & Coughlan
Object specific statistics
Other CuesOther CuesI(x, t)
I(x+u, t+1)
Ridges
Motion
Key Idea #2 (Likelihood)Key Idea #2 (Likelihood)“Explain” the entire image.
p(image | foreground, background)
Generic, unknown, background
Foreground person
Foreground should explain what the background can’t.
pixelsfore
pixelsfore
backimagep
foreimagepconst
)|(
)|(
LikelihoodLikelihood
Steered edgefilter responses
crude assumption: filter responses independent across scale.
limbs cues )background|responsefilter(
)person|responsefilter(
p
p
Learning Human MotionLearning Human Motion
* constrain the posterior to likely & valid poses/motions* model the variability
time
joint angles
3D motion-capture data. * Database with multiple actors and a variety of motions.
(from M. Gleicher)
Key Idea #3 (Prior)Key Idea #3 (Prior)Problem:
* insufficient data to learn probabilistic model of human motion.
Alternative:
* the data represents all we know
* replace representation and learning with search. (search has to be fast)
* De Bonnet & Viola, Efros & Leung, Efros & Freeman, Paztor & Freeman, Hertzmann et al, …
Efros & Freeman’01Efros & Freeman’01
Implicit Empirical DistributionImplicit Empirical Distribution
Off-line:
• learn a low-dimensional model of every n-frame sequence of joint angles and angular velocities (Leventon & Freeman, Ormoneit et al, …)
• project training data onto model to get small number of coefficients describing each time instant
• build a tree structured representation
““Textural” ModelTextural” Model
On-line: Given an n-frame input motion
• project onto low-dimensional model.
• index in log time using the coefficients.
• return the best k approximate matches (and form a “proposal” distribution).
• sample from them and return the n+1st pose.
Synthetic WalkerSynthetic Walker
* Colors indicate different training sequences.
Synthetic Swing DancerSynthetic Swing Dancer
Bayesian FormulationBayesian Formulation
1111 ))|()|(()(
)|(
ttttttt
tt
dppp
p
II
I
Posterior over model parameters given an image sequence.
Likelihood ofobserving the imagegiven the model parameters
Temporal model (prior)
Posterior fromprevious time instant
Key Idea #4 (Ambiguity)Key Idea #4 (Ambiguity)
Samples from a distributionover 3D poses.
* Represent a multi-modal posterior probability distribution over model parameters - sampled representation - each sample is a pose and its probability - predict over time using a particle filtering approach.
Particle FilterParticle Filter
samplesample
samplesample
normalizenormalize
Posterior)I|( 11 ttp
Temporal dynamics)|( 1ttp
Likelihood
)|I( ttp )I|( ttp
Posterio
r
What does the posterior look like?What does the posterior look like?
x yz
Shoulder: 3dofElbow: 1dof
Elbow bends
Stochastic 3D TrackingStochastic 3D Tracking
* 2500 samples, multiple cues.
Prel
imin
ary
resu
lt
Prel
imin
ary
resu
lt
ConclusionsConclusionsInferring human motion, silly or not, from video is challenging.
We have tackled three important parts of the problem:
1. Probabilistically modeling human appearance in a generic, yet useful, way.
2. Representing the range of possible motions using techniques from texture modeling.
3. Dealing with ambiguities and non-linearities using particle filtering for Bayesian inference.
Learned Walking ModelLearned Walking Model
* mean walker
Learned Walking ModelLearned Walking Model
* sample with small
Learned Walking ModelLearned Walking Model
* sample with moderate
Learned Walking ModelLearned Walking Model
* sample with very large
(Silly-Walk Generator)
Prel
imin
ary
resu
lt
Prel
imin
ary
resu
lt
Tracking with OcclusionTracking with Occlusion
1500 samples, ~2 minutes/frame.
Prel
imin
ary
resu
lt
Prel
imin
ary
resu
lt
Moving CameraMoving Camera
1500 samples, ~2 minutes/frame.
Ongoing and Future WorkOngoing and Future WorkHybrid Monte Carlo tracker (Choo and Fleet ’01)
* analytic, differentiable, likelihood.
Learned dynamics.
Correlation across scale.
Estimate background motion.
Statistical models of color and texture.
Automatic initialization.
Training data and likelihood models to be available in the web.
Lessons LearnedLessons Learned* Probabilistic (Bayesian) framework allows - integration of information over time
- modeling of priors
* Particle filtering allows- multi-modal distributions- tracking with ambiguities and non-linear models
* Learning image statistics and combining cues improves robustness and reduces computation
OutlookOutlook5 years:
- Relatively reliable people tracking in monocular video.- Path is pretty clear.
… solve the vision problem.
Next step: Beyond person-centric- people interacting with object/world
Beyond that: Recognizing action- goals, intentions, ...
… solve the AI problem.
ConclusionsConclusions* Generic, learned, model of appearance.
• Combines multiple cues.* Exploits work on image statistics.* Use the 3D model to predict features.* Principled way to chose filters.* Model of foreground and background is incorporated into the tracking framework.
• exploits the ratio between foreground and background likelihood.• improves tracking.
Motion Blur
RequirementsRequirements
1. Represent uncertainty and multiple hypotheses.
2. Model non-linear dynamics of the body.
3. Exploit image cues in a robust fashion.
4. Integrate information over time.
5. Combine multiple image cues.
What Image Cues?What Image Cues?
Pixels?
Temporal differences?
Background differences?
Edges?
Color?
Silhouettes?
Brightness ConstancyBrightness Constancy
I(x, t+1) = I(x+u, t) +
Image motion of foreground as a function of the 3D motion of the body.
Problem: no fixed model of appearance (drift).
t1t
Changing background
Low contrast limb boundaries
Occlusion
Varying shadows
Deforming clothing
What do people look like?
What do non-people look like?
Edges as a Cue?Edges as a Cue?
• Probabilistic model?• Under/over-segmentation, thresholds, …
Contrast Normalization?Contrast Normalization?
contrast
OcontrastSw
*2
)*tanh(1
Lee, Mumford & Huang
)ˆ
log(I
IInorm
Contrast NormalizationContrast NormalizationMaximize difference between distributions
* e.g. Bhattarcharyya distance:
dyypyppp offonoffonB )()(log),(
Local Contrast NormalizationLocal Contrast Normalization
Ridge FeaturesRidge Features
|),(cossin2),(sin),(cos|
|),(cossin2),(cos),(sin|),,(22
22
xxx
xxxx
xyyyxx
xyyyxxr
fff
ffff
Scale specific
Ridge Thigh StatisticsRidge Thigh Statistics
Brightness ConstancyBrightness Constancy
What are the statistics of brightness variationI(x, t) - I(x+u, t+1)?
Variation due to clothing, self shadowing, etc.
I(x, t) I(x+u, t+1)
Brightness ConstancyBrightness Constancy
Scale 4
Scale 0
EdgesEdges
Temporal Model: Smooth MotionTemporal Model: Smooth Motion
otherwise0
],[if)),((
),|(
max,min,,1,1,,
1,1,,
iitiitititi
tititi
G
p
),()|( 1,,1,, ititititi Gp
* individual angles and velocities assumed independent
Particle FilteringParticle Filtering* large literature (Gordon et al ‘93, Isard & Blake ‘96,…)
* non-Gaussian posterior approximated by N discrete samples
* explicitly represent the ambiguities
* exploit stochastic sampling for tracking
)(nt )10( 3NNn ,...,1
Representing the PosteriorRepresenting the Posterior
N
i
itt
ntt
p
pnt
1
)(
)(
)|(
)|()(
I
I
),( )()( nt
nt
)|( ttp I
represented by discrete set of N samples
Normalized likelihood:
CondensationCondensation1. Selection Sample from posterior at t-1
Most probable states selected most often.2. Prediction.3. Updating
states
p
t
1t
1. Selection2. Prediction/Diffusion (sample from )
Models the dynamics:
3. Updating
CondensationCondensation
)|( 1ttp
CondensationCondensation1. Selection2. Prediction3. Updating (the distribution)
Evaluate new likelihood.
Repeat until N new samples have beengenerated.
Compute normalized probability distribution.
)|( ttp I
Temporal Model: WalkingTemporal Model: Walking
],,,,[ gt
gttttt c
Parameters of the generative model are now
),()|(
),]100[]1,([),,|(
),()|(
)),(()|(
),,()|(
11
1111111
111
111
,1,,1,
gggg
Ttt
Tgtt
gtt
gt
tttt
ttttt
kcc
ktktktkt
ttttGp
Gp
Gp
Gp
ccGccp
TT
Probabilistic model for )|( 1ttp
No likelihoodNo likelihood
* how strong is the walking prior? (or is our likelihood doing anything?)
Other Related WorkOther Related Work
J. Sullivan, A. Blake, M. Isard, and J.MacCormick.
Object localization by Bayesian correlation. ICCV’99.
J. Sullivan, A. Blake, and J.Rittscher.
Statistical foreground modelling for object localisation. ECCV, 2000.
J. Rittscher, J. Kato, S. Joga, and A. Blake.
A Probabilistic Background Model for Tracking. ECCV, 2000.
S. Wachter and H. Nagel. Tracking of persons in monocular image sequences. CVIU, 74(3), 1999.
What does the posterior look like?What does the posterior look like?
x yz
Shoulder: 3dofElbow: 1dof
Elbow bends
Statistics of LimbsStatistics of Limbs
How do people appearin natural scenes?
Want a general model.
EdgeFilters
RidgeFilters
Other Related WorkOther Related Work* Bregler & Malik: image motion, single hypothesis,
full-body required multiple cameras, scaled ortho.* Ju, Black, Yacoob: cardboard person model,
image motion, 2D* Deutscher et al: Condensation, edge cues,
background subtraction.* Cham& Rehg: known templates, 2D (SPM), particle
filter.* Wachter & Nagel: nicely combines motion and edges,
single hypothesis (Kalman filter).* Leventon & Freeman: assumes 2D tracking,
probabilistic formulation, learned temporal model
(full body, monocular, articulated)
Open QuestionsOpen Questions
Representation of human motions
* model the range of human activity
* constrain the estimation to plausible motions
Representation of human appearance
* (somewhat) invariant to the variation in human appearance
* specific enough to constrain the estimation
LikelihoodLikelihood
bf
bbffbf IpIpIpxx
xx )|)(()|)((),|(
f
f
bf
ffb
Ip
IpIp
x
xx
x
xx
)|)((
)|)(()|)((
f
f
bf
ff
Ip
Ipc
x
x
x
x
)|)((
)|)((
Foreground pixels
Background pixels
OverviewOverview* Why is 3D human motion important?* Why is recovering it hard?* A Bayesian approach
* generative model * robust likelihood function* temporal prior model (learning)* stochastic search (particle filtering)
* Where are we going?* Recent advances & state of the art.* What remains to be done?
ProblemsProblemsA simple articulated human model may have 30+parameters (e.g. joint angles. 60+ w/ velocities).
Models of human action are non-linear and likelihood models will be multi-modal.
Key challenges Key challenges (common to other domains)• representation,• learning, and• search
in high dimensional spaces.
Bayesian FormulationBayesian Formulation
Represent a distributionover 3D poses.
* define generative model of image appearance* multi-modal posterior over model parameters - sampled representation - particle filtering approach.* focus on image motion as a cue (adding edges,…)
Generative Model: TemporalGenerative Model: Temporal
),|(),|( 1111 tttttt pp VV
)|(),|( 111 ttttt pp VVVV
* general smooth motion or,* action-specific motion (walking)
First order Markov assumption on angles, , and angular velocity, V:
Explore two models of human motion
Arm Tracking: Smooth motion priorArm Tracking: Smooth motion prior
Particle filter * represents ambiguity * propagates information over time
x yz
Display: expectedvalue of joint angles.
Learning Temporal ModelsLearning Temporal Models
* Motion capture data is noisy, data is missing, activities are performed differently.
* For cyclic motion (important but special class):1. Detect cycles and segment2. Account for missing data3. Preserve continuity of cycles4. Statistical model of variation
* Approaches should generalize to non-cyclic motion.
(Dirk Ormoneit & Trevor Hastie)
Detecting CyclesDetecting Cycles
Automatically detect length of cycles,Automatically segment and align cycles.
Modeling Cyclic MotionModeling Cyclic Motion
Automaticallyalign 3D data with a reference curverepresented usingperiodicallyconstrainedregression splines.
Modeling Cyclic MotionModeling Cyclic Motion
* Iterative SVD method (from gene expression work)* computes SVD in Fourier domain* construct a rank-q approximation and take inverse Fourier transform* impute missing data from the approximation* repeat until convergence.
* Segment into cycles, compute mean curve and represent variation by performing PCA on data.
* SVD must enforce periodicity and cope with missing data.
IssuesIssues* Large parameter space
* approx. 10000 samples * sparsely represented* not real time
* Flow-based models can drift
* Requires initialization
ConclusionsConclusionsBayesian formulation for tracking 3D human figuresusing monocular image information.
* Generative model of image appearance.* Non-linear model represents ambiguities, singularities occlusion, etc - sampled representation of posterior.* Particle filtering for incremental estimation.* Automatic learning of cyclic motion prior.
Rich framework for modeling the complexity ofhuman motion.
Initialization Using 2D ModelInitialization Using 2D Model
* Full-body walking model.
* Constructed from 3D mocap data.
* 2D, view-based (every 30 degrees)
* 4 subjects, 14 cycles
2D, View-Based Walker2D, View-Based Walker* Construct linear optical flow basis
* Use similar Bayesian framework for tracking (Black CVPR’99)
* Coarse estimate of 3D parameters
* Automatic initialization
Example Bases:
...
...
0 degrees
90 degrees
Recent ResultsRecent Results
* Box indicates mean position and scale.* Recovers distribution over phase and 3D scale.
MotionMotionConverged
Dense optical flow.
Open questions: appearance change, textural motion.
Converging
Human motion.
Faces:
Here we focus on full-body.
Truth in AdvertisingTruth in AdvertisingNot about realistic models for synthesizing
* faces
* clothing
* skin
* hair
Focus on generic models of appearance for human motion capture.
Graphics to the Rescue?Graphics to the Rescue?
Hodgins and Pollard ‘97
How big is the parameter space of all possible appearances?
Accurately synthesize appearance?
Human AppearanceHuman Appearance
LikelihoodLikelihood
* To cope with occluded limbs or those viewed at narrow angles, we introduce a probability of occlusion.* likelihood of observing limb j is then
occludedimagej pqpqp )1(
* likelihood of the model is product of limb likelihoods
jj
ttt pp ),|( RI
Generative Model: MotionGenerative Model: Motion
t-1
t
),P( 11-t tyx
),P(t tyx
),P(),P( 1t tt yyu
Learned Walking ModelLearned Walking Model
* sample with large
Temporal Model: WalkingTemporal Model: Walking
],,,,[ gt
gttttt c
sParameters of the generative model are now
),()|(
),]100[]1,([),,|(
),()|(
)),(()|(
),,()|(
11
1111111
111
111
,1,,1,
gggg
Ttt
Tgtt
gtt
gt
tttt
ttttt
kcc
ktktktkt
ttttGp
Gp
Gp
Gp
ccGccp
TT
Probabilistic model for )|( 1ttp ss
Common AssumptionsCommon Assumptions
* Multiple Cameras(additional constraints, occlusion)
* Color Images(locate face and hands)
* Known Background(background subtraction to locate person)
* Batch process an entire sequence.* Known Initialization
(to be avoided)
Ratios for different limbsRatios for different limbs
Modeling AppearanceModeling Appearance
What do people look like?
What do non-people look like?
How can we model appearance in a way the captures the variability across people, clothing, lighting, pose, …?
Ridge FiltersRidge Filters
Relationship between limb diameter in image and scale of maximum ridge filter response.
RidgesRidges
BrightnessBrightnessConstancyConstancy
Correct position at t
Incorrect position at t
Vary position at t+1
1. Selection2. Prediction/Diffusion (sample from )
ie from the temporal prior:
1. Compute
2. Sample from
3. Sample from
3. Updating
CondensationCondensation
)|( 1ttp ss
),|()|(),|( 11111 tttttttt ppp IRVVV
),|( 11 tttp IR
)|( 1ttp VV
),|( 11 tttp V
tR
Visualizing ResultsVisualizing Results
)(
1
)( )(|)( nt
N
n
nttt ffE
sIs
Expected value of state parameter )( tf s
Why is it hard?Why is it hard?
Geometrically under-constrained.
Vigil Calculare
Watchful computation.
Tiny PeopleTiny People
Why is it Important?Why is it Important?ApplicationsApplications
• Human-Computer Interaction• Surveillance • Motion capture (games and animation)• Video search/annotation• Work practice analysis.
Social display of puzzlement
* detect moving regions* estimate motion* model articulated objects* model temporal patterns of activity * interpret the motion
Why is it Hard?Why is it Hard?
The appearance of peoplecan vary dramatically.
Bones and jointsare unobservable(muscle, skin, clothing hide the underlying structure).
(inference)
Why is it hard?Why is it hard?
People can appear in arbitrary poses.
They can deform in complexways.
Occlusion results inambiguities and multiple interpretations.
Other ProblemsOther Problems
* geometrically under-constrained* non-linear dynamics of limbs* similarity of appearance of different limbs (matching ambiguities)* image noise* outliers
Our models are approximations.Image changes that are not modeled(e.g. clothing deformation) will be outliers.
Bregler and Malik ‘98Bregler and Malik ‘98
State of the Art.
* Brightness constancy cue
• insensitive to appearance
* Full-body required multiple cameras.
* Single hypothesis.
• MAP estimate
Cham and Rehg ‘99Cham and Rehg ‘99
State of the Art.
* Single camera, multiple hypotheses.
* 2D templates (solves drift but is view dependent)
I(x, t) = I(x+u, 0) +
Deutscher, North, Deutscher, North, Bascle, & Blake ‘99Bascle, & Blake ‘99
State of the Art.
* Multiple cameras
* Simplified, clothing, lighting and background.
Sidenbladh, Black, & Fleet ‘00Sidenbladh, Black, & Fleet ‘00
* Monocular. Brightness constancy as the only cue.* Significant changes in view and depth.* Template-based methods will fail.
State of the Art.
Bayesian InferenceBayesian Inference
Exploit cues in the images. Learn likelihood models:p(image cue | model)
Build models of human form and motion. Learnpriors over model parameters:
p(model)
Represent the posterior distributionp(model | cue) p(cue | model) p(model)
Natural Image StatisticsNatural Image Statistics
Ruderman. Lee, Mumford, Huang. Portilla and Simoncelli. Olshausen & Field. Xu, Wu, & Mumford. …
* Statistics of image derivatives are non-Gaussian.* Consistent across scale.
Statistics of EdgesStatistics of Edges
Statistics of filter responses, F, on edges, pon(F), differs from background statistics, poff (F).
Likelihood ratio, pon/ poff , can be used for edge detection and road following.
Geman & Jednyak and Konishi, Yuille, & Coughlan
What about the object specific statistics of limbs?
* edge may be present or not.
Distribution of Edge Distribution of Edge Filter ResponsesFilter Responses
LikelihoodLikelihood
pixelsbackpixelsfore
backimagepforeimagepbackforeimagep )|()|(),|(
pixelsfore
pixelsforepixelsall
backimagep
foreimagepbackimagep
)|(
)|()|(
pixelsfore
pixelsfore
backimagep
foreimagepconst
)|(
)|(
Foreground pixels
Background pixels
Action-Specific ModelAction-Specific Model
q
ktkkttt vc
1, )()(~
The joint angles at time t are a linear combinationof the basis motions evaluated at phase
Mean curve Basis curves