Michael J. BlackFebruary 2002 Learning the Appearance and Motion of People in Video Hedvig Sidenbladh Michael J. Black black Department.

Michael J. BlackFebruary 2002

Learning the Appearance and Learning the Appearance and Motion of People in VideoMotion of People in Video

Hedvig Sidenbladh Michael J. Black

http://www.cs.brown.edu/~black

Department of Computer ScienceBrown University

Defense Research InstituteStockholm Sweden

http://www.nada.kth.se/~hedvig

(The Science of Silly Walks)(The Science of Silly Walks)


CollaboratorsCollaborators

David Fleet, Xerox PARC

Nancy Pollard, Brown University

Dirk Ormoneit and Trevor Hastie Dept. of Statistics, Stanford University

Allan Jepson, University of Toronto


The (Silly) ProblemThe (Silly) Problem

Unsolved without manual intervention.


Inferring 3D Human MotionInferring 3D Human Motion

* No special clothing* Monocular, grayscale, sequences (archival data)* Unknown, cluttered, environment* Incremental estimation

* Infer 3D human motion from 2D image properties.


Why is it Hard?Why is it Hard?

Low contrast

Self occlusion

Singularities in viewing direction

Unusual viewpoints

Ambiguous matches


Clothing and LightingClothing and Lighting


Large MotionsLarge Motions

Limbs move rapidly with respect to their width.

Non-linear dynamics.

Motion blur.


AmbiguitiesAmbiguities

Where is the leg?

Which leg is in front?



Accidental alignment



Whose legs are whose?Occlusion


RequirementsRequirements

1. Represent uncertainty and multiple hypotheses.

2. Model non-linear dynamics of the body.

3. Exploit image cues in a robust fashion.

4. Integrate information over time.

5. Combine multiple image cues.


Simple Body ModelSimple Body Model

* Limbs are truncated cones* Parameter vector of joint angles and angular velocities =


Inference/IssuesInference/IssuesBayesian formulation

p(model | cues) = p(cues | model) p(model)

3. Need an effective way to explore the model space (very high dimensional) and represent ambiguities.

p(cues)

1. Need a constraining likelihood model that is alsoinvariant to variations in human appearance.

2. Need a prior model of how people move.


What Image Cues?What Image Cues?

Pixels?

Temporal differences?

Background differences?

Edges?

Color?

Silhouettes?

Optical flow?


Brightness ConstancyBrightness Constancy

I(x, t+1) = I(x+u, t) +

Image motion of foreground as a function of the 3D motion of the body.

Problem: no fixed model of appearance (drift).

t1t


Bregler and Malik ‘98Bregler and Malik ‘98

State of the Art.

* Brightness constancy cue

• insensitive to appearance

* Full-body required multiple cameras.

* Single hypothesis.

• MAP estimate


Cham and Rehg ‘99Cham and Rehg ‘99

State of the Art.

* Single camera, multiple hypotheses.

* 2D templates (solves drift but is view dependent)

I(x, t) = I(x+u, 0) +


Edges as a Cue?Edges as a Cue?

• Probabilistic model?• Under/over-segmentation, thresholds, …


Deutscher, North, Bascle, & Deutscher, North, Bascle, & Blake ‘99Blake ‘99

* Multiple cameras

* Simplified, clothing, lighting and background.

State of the Art.


Changing background

Low contrast limb boundaries

Occlusion

Varying shadows

Deforming clothing

What do people look like?

What do non-people look like?


Key Idea #1 Key Idea #1 (Rigorous Likelihood)(Rigorous Likelihood)

1. Use the 3D model to predict the location of limb boundaries (not necessarily features) in the scene.

2. Compute various filter responses steered to the predicted orientation of the limb.

3. Compute likelihood of filter responses using a statistical model learned from examples.


Natural Image StatisticsNatural Image Statistics

Ruderman. Lee, Mumford, Huang. Portilla and Simoncelli. Olshausen & Field. Xu, Wu, & Mumford. …

* Statistics of image derivatives are non-Gaussian.* Consistent across scale.


Statistics of EdgesStatistics of Edges

Statistics of filter responses, F, on edges, pon(F), differs from background statistics, poff (F).

Likelihood ratio, pon/ poff , can be used for edge detection and road following.

Geman & Jednyak and Konishi, Yuille, & Coughlan

What about the object specific statistics of limbs?

* edge may be present or not.


Object-Specific StatisticsObject-Specific Statistics


Edge FiltersEdge FiltersNormalized derivatives of Gaussians (Lindeberg, Granlund and Knutsson, Perona, Freeman&Adelson, …)

),(cos),(sin),,( xxx yxe fff

Edge filter response steered to limb orientation:

Filter responses steered to arm orientation.


Distribution of Edge Distribution of Edge Filter ResponsesFilter Responses

pon(F) poff(F)


Contrast Normalization?Contrast Normalization?

contrast

OcontrastSw

*2

)*tanh(1

Lee, Mumford & Huang

)ˆ

log(I

IInorm


Contrast NormalizationContrast NormalizationMaximize difference between distributions

* e.g. Bhattarcharyya distance:

dyypyppp offonoffonB )()(log),(


Local Contrast NormalizationLocal Contrast Normalization


Ridge FeaturesRidge Features

|),(cossin2),(sin),(cos|

|),(cossin2),(cos),(sin|),,(22

22

xxx

xxxx

xyyyxx

xyyyxxr

fff

ffff

Scale specific


Ridge FiltersRidge Filters

Relationship between limb diameter in image and scale of maximum ridge filter response.


Ridge Thigh StatisticsRidge Thigh Statistics



What are the statistics of brightness variationI(x, t) - I(x+u, t+1)?

Variation due to clothing, self shadowing, etc.

I(x, t) I(x+u, t+1)



• well fit by t-distribution or Cauchy distribution (heavy tails)

• related to robust statistics


Key Idea #2 Key Idea #2 (Explain the Image)(Explain the Image)

p(image | foreground, background)

Generic, unknown, background

Foreground person

Foreground should explain what the background can’t.

pixelsfore

pixelsfore

backimagep

foreimagepconst

)|(

)|(

See also McCormick and Isard, ICCV’01.


LikelihoodLikelihood

Steered edgefilter responses

crude assumption: filter responses independent across scale.

limbs cues )background|responsefilter(

)person|responsefilter(

p

p



p(model | cues) = p(cues | model) p(model) p(cues)




Learning Human MotionLearning Human Motion

* constrain the posterior to likely & valid poses/motions* model the variability

time

joint angles

3D motion-capture data. * Database with multiple actors and a variety of motions.

(from M. Gleicher)


Key Idea #3 Key Idea #3 (Trade learning for search.)(Trade learning for search.)

Problem:

* insufficient data to learn a prior probabilistic model of human motion.

Alternative:

* the data represents all we know

* replace representation and learning with search. (challenge: search has to be fast)


Texture SynthesisTexture Synthesis

Efros & Freeman’01

“Database”Synthetic Texture

* De Bonnet & Viola, Efros & Leung, Efros & Freeman, Paztor & Freeman, Hertzmann et al, …

* Image(s) as an implicit probabilistic model.


Implicit Probabilistic ModelImplicit Probabilistic Model

Key idea: probabilistic search (log time) of this tree approximates sampling from p(stored sequence | generated sequence).


SynthesisSynthesis

* Colors indicate different training sequences.

* For graphics, we need- editability, constraints (ground contact, pose, interpenetration), key frames, style, …


Tracking* Efficiently generate samples (image data will sort out which are good).

* Temperature parameter controls randomness of tree search.


Bayesian FormulationBayesian Formulation

1111 ))|()|(()(

)|(

ttttttt

tt

dppp

p

II

I

Posterior over model parameters given an image sequence.

Likelihood ofobserving the imagegiven the model parameters

Temporal model (prior)

Posterior fromprevious time instant


What does the posterior look like?What does the posterior look like?

x yz

Shoulder: 3dofElbow: 1dof

Elbow bends



p(model | cues) = p(cues | model) p(model)

3. Need an effective way to explore the model space (very high dimensional) and represent ambiguities.

p(cues)




Key Idea #4 Key Idea #4 (Represent Ambiguity)(Represent Ambiguity)

Samples from a distributionover 3D poses.

* Represent a multi-modal posterior probability distribution over model parameters - sampled representation - each sample is a pose and its probability - predict over time using a particle filtering approach.


Particle FilteringParticle Filtering* large literature (Gordon et al ‘93, Isard & Blake ‘96,…)

* non-Gaussian posterior approximated by N discrete samples

* explicitly represent the ambiguities

* exploit stochastic sampling for tracking

)(nt )10( 3NNn ,...,1


Particle FilterParticle Filter

samplesample

samplesample

normalizenormalize

Posterior)I|( 11 ttp

Temporal dynamics)|( 1ttp

Likelihood

)|I( ttp )I|( ttp

Posterio

r


Particle FilterParticle Filter

Isard & Blake ‘96


Tracking with OcclusionTracking with Occlusion

1500 samples, ~2 minutes/frame.


Moving CameraMoving Camera

1500 samples, ~2 minutes/frame.


Stochastic 3D TrackingStochastic 3D Tracking

* 2500 samples (now down as low as 300 with the new prior).


ConclusionsConclusionsInferring human motion, silly or not, from video is challenging.

We have tackled three important parts of the problem:

1. Probabilistically modeling human appearance in a generic, yet useful, way.

2. Representing the range of possible motions using techniques from texture modeling.

3. Dealing with ambiguities and non-linearities using particle filtering for Bayesian inference.


Ongoing and Future WorkOngoing and Future WorkBetter search algorithms Hybrid Monte Carlo tracker (Choo and Fleet ’01) Covariance scaled sampling (Schiminescu&Triggs’01)

Richer prior models of motion.

Estimate background motion.

Statistical models of color and texture.

Automatic initialization.

Training data and likelihood models to be available in the web.

Michael J. BlackFebruary 2002 Learning the Appearance and Motion of People in Video Hedvig Sidenbladh Michael J. Black black Department.

Documents