Monocular Human Pose Estimation with Bayesian Networks

Monocular Human Pose Estimation with Bayesian Networks

Electronic Engineering Department,Fu Jen University

2010/6/11

Yuan-Kai Wang

本著作採用創用CC 「姓名標示」授權條款台灣3.0版

http://creativecommons.org/licenses/by/3.0/tw/�

Wang, Yuan-Kai Electronic Engineering Department, Fu Jen University 2

Outline1. Introduction2. Markless Monocular Human Pose

Estimation3. Overview of the Approach4. Model Learning by EM algorithm5. Pose Estimation by Approximate Inference6. Feature Extraction7. Experimental Results8. Conclusions


1. Introduction• Applications of Human Motion

Capture– Performance animation in movie making– Game– Medical diagnosis– Sport & Health– Visual surveillance


Performance Animation• Avatar • The Lord of the

Rings


Game• Microsoft's Project Natal for XBOX360


Medical Diagnosis• Gait analysis for

Rehabilitation


Sport & Health• Golf training


Visual Surveillance• Behavior analysis for event detection

– Irregular movement, body language, and unusual interactions, fighting

– Car crash• Content-based retrieval


Sensor Approaches• Active sensors

– Types• Electro-magnetic marker• Optical• Accelerometer

– Wired connection– Drawbacks

• Intrusive• Expensive• Time consuming

• Passive sensorsby camera– Marker-based– Markerless

TooManyWires


Marker-based Sensors• Add visual markers on body

– Active marker• Visual/non-visual light

– Passive marker• Need computer vision algorithms• Advantages

– No wires• Drawbacks

– Semi-intrusive– Time consuming

Activemarker

Passivemarker


Markerless Sensors• No attachment on human body• Heavily dependent on

computer vision analyzer– Stereo/Multiple cameras– Monocular cameras

Pure vision solution


Sensor v.s. Analyzer

T. B. Moeslund, "Computer vision-based human motion capture – a survey", Technical report LIA 99-02, University of AALBORG, 1999.


Pose Estimation v.s. Gesture Recognition

Walking

GestureRecognition

Pose Estimation


2D v.s. 3D


2. Markerless Monocular Human Motion Capture

• Goal– Markless– Single camera– 3D poses

• Challenges– Ill-posed– Highly articulated– Self-occluding

Depth ambiguities & occlusion using

monocular silhouettes


Joint Representation

• Articulated human body is linked by joints


Abstract Representation

2D 3D

Stick

Surface/Volume


Literature Review

ImageSpace

(Pixel domain)

HumanSegmentation

(S)

ImageFeature

Descriptor (F)

2D JointLocation

(J)

3D ModelParametric Space(Pose domain, P)

• Full body• Body

parts

• Shape• Silhouette• Color• Appearance• Motion• Feature

point (corner)

• ...

•Joint angle

•Joint location

Neck

Left shoulder

Right shoulder

Left elbow

Right elbow

Left hand

Right hand

BottomLeft waist

Left knee

Right knee

Right foot

Left foot

X

y

Z

Right waist

Marker-based

Low-LevelObservation

High-LevelAbstraction

Θi

Pi

P=f(S)P=f(F) P=f(J)

• Background subtraction• Object detection

P=f1(f2(F))A two-stage approach is proposed


Approaches• Model-free [Agarwal, 2006] [Loy, 2004]

– No utilization of joints articulation to constrain the search of function mapping P = f(X)

• Model-based [Rbert, 2006] [Rohr, 1994]

– A model of human articulation to constrain the search of f and P

– Two kinds of approach• Discriminative• Generative: Bayesian networks (BNs)

),|ˆ(maxargˆ:Inference

),Training(maxargˆ:Training

2

1

PXfLP

fLf

P

f

=

=


An Articulated Model = A Bayesian Network

• Human body is represented as a kinematics tree, consisting of divisions linking by joints

• Kinematics models are addressed with graphical probability network

• Graphical probability models are computed via Bayesian network

Neck

Left shoulder

Right shoulder

Left elbow

Right elbow

Left hand

Right hand

BottomLeft waist

Left knee

Right knee

Right foot

Left foot

X

y

Z

Right waist


Three Steps to Utilize BNs

• Representation, learning and inference

Representation

Inference

Learning

X1

X2 X4X3

X1

X2 X4X3

P(X1|X2,X3,X4)

Joints

Features

),Training(maxargˆ1 fLf

f=

Feature-Joint correspondenceby Conditional Probability

),|ˆ(maxargˆ2 PXfLP

P=

Pose Estimation


Two Causal Models in BNs• Undirected acyclic graph [Lan, 2008] [Hua, 2005]

– Bayesian network is a tree or a graph model that the linking edge between two nodes has no direction.

• Directed acyclic graph [Ramanan, 2007] [Lee, 2006] [Leonid, 2003]

– Every node has directed arcs linked to another node.

X1

P(X1|X2)X2

P(X1,X2)X1 X2


Directed Bayesian Articulated Model

• Nodes in directed acyclic graph (DAG) are not influenced by their child nodes.

• Human body parts are not regarded as two-way

h2d,1

h2d,2

h2d,4h2d,5 h2d,6h2d,7 h2d,8

h2d,9h2d,10 h2d,11

h2d,12

h2d,14

h2d,13

h2d,15

h2d,3


Inference of Bayesian Networks

• Top-down approach [Gavrila, 1996]

– Has the strength at finding human body parts in the image.

• Bottom-up approach [Ren, 2005]

– Has the strength at finding people in the image.

• Combined approach [Navaraman, 2005][Lee, 2002]

– Has the benefit from the advantages of both.


3. Overview of the Approach

Neck

Left shoulder

Right shoulder

Left elbow

Right elbow

Left hand

Right hand

BottomLeft waist

Left knee

Right knee

Right foot

Left foot

X

y

Z

Right waist

Head

Left knee

Right knee

Left foot

Right foot

Neck

Left shoulder

Right shoulder

Left elbow

Right elbow

Left hand

Right hand

Bottom

Left waist

Right waist

2D 3D

They are belief propagation networks using an annealing Gibbs sampling algorithm.


System Architecture• We estimate the 2D human joint

positions before 3D estimation.

2D Bayesian Human Model

Setting

EM Training3D Bayesian

Human Model Setting

EM Training

Testing image

2D Bayesian Inference with

Annealed Gibbs Sampling

3D Bayesian Inference with

Annealed Gibbs Sampling

Feature Extraction

2D Model Training

Result

Training Features

3D Model Training

Training Features


2D Human Graphical Model• The articulated structure of 2D human

body is represented by a 15-node graphical model.

Head

Left knee

Right knee

Left foot

Right foot

Neck

Left shoulder

Right shoulder

Left elbow

Right elbow

Left hand

Right hand

Bottom

Left waist

Right waist

h2d,1

h2d,2

h2d,4h2d,5 h2d,6h2d,7 h2d,8

h2d,9h2d,10 h2d,11

h2d,12

h2d,14

h2d,13

h2d,15

h2d,3

},...,{ 15,21,22 ddD hhH =

2D stick figure (articulated model)


Neck

Left shoulder

Right shoulder

Left elbow

Right elbow

Left hand

Right hand

BottomLeft waist

Left knee

Right knee

Right foot

Left foot

X

y

Z

Right waist

3D Human Graphical Model• 3D human body model is described by a 45D

vector H3D representing joint positions for dimensions of each joint node in the 3D space

},...,{ 15,31,33 ddD hhH =

h3d,1h3d,2 h3d,3

h3d,4 h3d,5

h3d,6 h3d,7 h3d,8

h3d,9 h3d,10

h3d,11 h3d,12

h3d,14h3d,13

h3d,15

3D stick figure (articulated model)


The BN Model• A directed acyclic graph

– V: vertex set {Vi, 1≤i≤N}– : a set of directed edges (i,j) – C: (i,j) → R+, edge cost functions

• To encode probabilistic information– An edge indicates a probabilistic

dependence– C : P(Vi | Vj): conditional probability

function set• The 2D and 3D BNs

),,( CEVG

=h2d,1

h2d,2

h2d,4h2d,5 h2d,6h2d,7 h2d,8

h2d,9h2d,10 h2d,11

h2d,12

h2d,14

h2d,13

h2d,15

h2d,3

E

),,( 2222 DDDD CEVG

= ),,( 3333 DDDD CEVG

=


2D Graphical Model

NcS

AC

h2d,2

h2d,1

h2d,10h2d,8h2d,9

h2d,4

h2d,11

h2d,13 h2d,14

h2d,12

h2d,6 h2d,8h2d,3h2d,5h2d,7

O2d :

))}(|({ ,2,22 ididD hpahPC =

},{ 222 DDD OHV =


3D Graphical Model

hu3d,1hu3d,2 hu3d,3

hu3d,5hu3d,4

hu3d,6 hu3d,7

h2d,1h2d,3 h2d,4

h2d,5 h2d,6

h2d,7 h2d,8

h2d,9

h2d

wN

L

O3d :

hl3d,1hl3d,2 hl3d,3

hl3d,5hl3d,4

hl3d,6 hl3d,7

h2d,9h2d,10 h2d,11

h2d,12 h2d,13

h2d,14 h2d,15

Upperbody

Lowerbody

))}(|({ ,3,33 ididD hpahPC =

},{ 333 DDD OHV =


Joint Probability Distribution(JPD)

• The two proposed graphical models specify two unique JPDs: P2D(V2D) and P3D(V3D)

• Let P(V) represent the two JPDs

∏=

=n

iii VpaVPVP

1

))(|()(h2d,2

h2d,1

h2d,10h2d,8h2d,9

h2d,4

h2d,11

h2d,13 h2d,14

h2d,12

h2d,6 h2d,8h2d,3h2d,5h2d,7

• The factorization of the JPD comes from the Markov Blanket, a local Markov property

• If we can learn the finite conditional probabilities, we can inference the human pose


Two Problems• Training problem

– Given a training set : {O2d, O3d}– How can we learn the edge cost function

C = { P(h | pa(h)) }– We apply the EM algorithm

• Inference problem– Given an evidence O– How can we inference

the human poseP(H | O) by P(V)

– We propose an annealed Gibbs samplingalgorithm

h2d,2

h2d,1

h2d,10h2d,8h2d,9

h2d,4

h2d,11

h2d,13 h2d,14

h2d,12

h2d,6 h2d,8h2d,3h2d,5h2d,7


4. Model Learning by EM• Why apply the EM algorithm for model

learning– The human poses and observations are

incomplete and sparse• Incomplete: occlusion due to single camera• Sparse: small training samples in large-

dimension space


The Likelihood Function• The training set D={D1,…DN}

– N represents the number of training samples– Dl={V1[l],…,Vn[l]} is the l-th training sample

• Let θ be the learning model: C = { P(h | pa(h)) }•

• A log-likelihood function is formulated based on the independence assumption of training samples

= ∏=

N

lnD lVlVPL

11 )|][],...,[(log)( θθ

∑ ∑= ==

n

i

N

l iii lVpalVP1 1

))),((|][(log θ

))|(log()( θθ DPLD =

∏=

=

===

Nll

DPP

DP

DPDPDP

~1

)()(

)|(maxarg

)|(maxarg)|(maxarg)|(maxargˆ

θ

θθθθ

θ

θ

θ

θθ


MLE v.s. EM• If D is complete, we can apply the MLE

(Maximum Likelihood Estimation) to find θ

• However D is incomplete because of occlusion and partial observability

• Let D=Y∪U– Y is observed data– U is the missing data

h2d,2

h2d,1

h2d,10h2d,8h2d,9

h2d,4

h2d,11

h2d,13 h2d,14

h2d,12

h2d,6 h2d,8h2d,3h2d,5h2d,7


The EM• Expectation Step

– Computes the expectation of the log likelihood function

• Maximization Step– Updates the t+1 step parameter θ(t+1) from

current parameter θ(t)

• Stop condition of the E-M steps iteration– converges

],|)|([log)|( )()()( YDPEQ tt

t θθθθθ

==

)|(maxarg )()1( tt Q θθθθ

=+

)()( )()1( tD

tD LL θθ −+


5. Pose Estimation by Approximate Inference

• Let the observed data be O'=O-U– U is the set of hidden variables that are

unobservable due to occlusion• The best estimated pose is a vector H*,

which is defined as the pose with the maximum probability given O'.

∫

∫

∈

∈

=

==

Uu

Uu

duuOHP

duOuHPOHPH

),',(maxarg

)'|,(maxarg)'|(maxarg*

V= H ∪ O' ∪ UP(V) ∫ ∏∈ =

=Uu

n

iii VpaVP

1

))(|(maxarg


Inference of Posterior Probability

• How to calculate the posterior probability?

– Exact inference• Junction tree, Message passing

– Approximate inference• Loopy belief propagation , Variational method• Markov chain Monte Carlo (MCMC) sampling

– Metropolis-Hasting– Gibbs sampling

∫ ∏∈ =

=Uu ni

ii duVpaVPH...1

))(|(maxarg*


Approximate Inference (1/2)

• MCMC algorithm uses sampling theorem• To approximate posterior distributions

P(V) by random number generation• The key idea of MCMC is to simulate the

sampling process as a Markov chain• Definition

• A sample vector v of V• A proposal distribution q(v*|v(t-1)) to generate v*• An acceptance distribution α to accept v* as v(t)

= −−

−−

)|*()(*)|(*)(,1min*),( )1()1(

)1()1(

tt

tt

vvqvpvvqvpvvα


Approximate Inference (2/2)• MCMC will generate a Markov chain

(v(0), v(1), ..., v(k), ...), as the transition probabilities from v(t-1) to v(t)

– Depends only on v(t-1)

– But not (v(0), v(1), ..., v(t-2))• The chain approaches its stationary

distribution– Samples from the vector (v(k+1), ..., v(k+n)) are

samples from P(V)• However, if V is in high dimensions,

MCMC is not easy to converge


Annealed Gibbs Sampling (1/4)• Gibbs sampling method

– Formally proposed by Geman&Geman in 1984 for Markov Random Field (MRF)

– Here the sampler is revised for the proposed two-stage Bayesian network

– The basic idea• Sampling uni-variate conditional

distributions• That is, Markov chain of (v(0), v(1), ..., v(k),

...) is achieved by only changing one variable of v


Annealed Gibbs Sampling (2/4)• We draw from the distribution

• The Annealed Gibbs (AG) sampler– The uni-variate conditional distributions

sampling is controlled by a stochastic process of simulated cooling

( ))()(1

)(1

)(1

)( ,,,,,|~ tn

tj

tj

tj

tj vvvvVPv +−

=

= −−−

otherwise 0 if )|(

)|*()(*)(*

)(tjj

ijjt vvvvp

vvq

=

)|*(*)|(

)(*)(,1min )(

)()(1

)( tj

tj

tT

tj

AG vvqvvq

vpvpα


Annealed Gibbs Sampling (3/4)• Function T(t) is called cooling

schedule• The particular value of T at any point in

the chain is called the temperature – T0 is start temperature– Tf is the final cool down temperatures over

n step • As the process proceeds, we decrease

the probability of such down-hill moves

nt

f

TT

TtT )()(0

0=


Annealed Gibbs Sampling (4/4)• The AG sampler adopts a stochastic iterative

algorithm that converges to the set of points which are the global maxima of the given function

• The advantage of the AG sampler is – Its efficiency compared to the Gibbs sampler is

better• Because Instead of approximating P(V)

– We want to find the global maximum, i.e., the ML estimate of posterior distribution.

– We run a Markov chain of invariant distribution P(V) and estimate only the global mode


6. Feature Extraction• Human silhouette sampling

• Normalized width

• Normalized center

• Spatial distribution of skin color

• Corners of silhouette

Width

Length


Human Silhouette Sampling (S)• Human segmentation• Human silhouette capturing [Suzuki, 1985]

• Uniform sampling is used in human silhouette sampling.


Normalized Width (wN )• Human segmentation• Binary image profile• Width adjust

48

Width

Length

LRN xxw −=

wxforthresholdh

thresholdhxx

x

xL →=

<≥

=−

11

0 100 200 300 400 500 6000

50

100

150

200

250

300

350

400

450

Profile of X coordinate

x coordinate of image

pixel

accu

mulat

ion va

lue

Normalization width

11

→=

<≥

=+

wxforthresholdh

thresholdhxx

x

xR


Normalized Center (Nc) • Boundary adjustment• Center of new boundary

Width

Length

NpN wxx 5.0+=

Lyy pN 5.0+=


Spatial Distribution of Skin Color (A)

Skin color detection by GMM

Morphology

Region segment

Spatial distribution of skin color


Corners of Silhouette (C)• Human segmentation• Human silhouette capturing• The level curve curvature approach

[Lindeberg, 1998]

• Adaptive corner choicexyyxxxyyyx DDDDDDDyxI 2maxarg),(~ 22 −+=


7. Experimental Results• Experimental environment

– CPU:1.86G, RAM:1G, VC6.0– HumanEva database I


HumanEva Database I• Provider:

– Department of Computer Science in Brown Univ.• Actions of HumanEva I

Action DescriptionWalking Subjects walked in an elliptical around

the capture space.Jog Subjects jogged in an elliptical around

the capture space.Gesture Subjects performed “hello”

and ”good-bye” gestures in repetition.Throw/Catch

Subjects tossed and caught a baseball with the help of the lab assistant.

Box Subjects imitated boxing.Combo Subjects performed combinational

actions of walking and jogging.


Environment Setting

• 7 cameras– 3 color cameras

( C1, C2, C3 ) – 4 gray level cameras

( BW1, BW2, BW3, BW4 )

Control Station

Capture Space2m

3m

BW1 BW2

C1BW4 BW3

C2 C3


The Experimental Data• Our proposed method has been trained by 1900

images from walking sequences of subjects 1 and 2 from C1

• 200 testing images: • 100 images from subject 1 • 100 images from subject 2

• Difficulties:– Self-occluding– Clothe variation– Large variation of

joint location


Evaluation of Accuracy

• Average distance error of poses between estimated results and ground truth• Let H = {h1, h2, ...hM}, where hm ∈ R3 (or xm ∈

R2 for the 2D body model), be the position vector of the body pose in the world (or image respectively)

• D(H, H*): the error in estimated pose H* to the ground truth pose H

∑=

−=

M

m

mm

Mhh

HHD1

*

*),( ∑∑= =

=N

n

T

tntnt HHD

NT 1 1

*,, ),(1ξ


Performance Comparison Between Two-stage and One-stage methods

• AG sampler performs better than the Gibbs sampler,• Two-stage approach performs better than classical

one-stage approach• AG sampler takes less inference time


Effect of Iteration Number on Accuracy


2D Results of Subject 1

GTAGs

GTAGs

GTAGs

GTAGs

Frame:1122

Frame:1149

Frame:1172

Frame:1200


GTAGs

2D Results of Subject 2

GTAGs

GTAGs

GTAGs

Frame:804

Frame:835

Frame:875

Frame:899


3D Results• The 1110 frame of subject 1

-1000100 -1000100

-50

0

50

100

150

Ground truth

-1000100 -1000100

-50

0

50

100

150AGs estimation result


3D Results (Cont.)

• The 1135 frame of subject 1

-1000

100

-1000

100-50

0

50

100

150

Ground truth

-1000

100

-1000

100-50

0

50

100

150

AGs estimation result


3D Results (Cont.)• The 845 frame of subject 2

-1000

100

-100

0

100-50

0

50

100

150

Ground truth

-1000

100

-100

0

100-50

0

50

100

150



3D Results (Cont.)• The 872 frame of subject 2

-1000

100

-100

0

100-50

0

50

100

150

Ground truth

-1000

100

-100

0

100-50

0

50

100

150



8. Conclusions• A markerless and monocular motion

capture problem is considered• The proposed two-stage annealed Gibbs

sampling method can estimate more accurate poses with less computation time

• The method can overcome three challenges of the problem– Self-occlusion– High-degree variation of joint locations– Clothing limitation


Future Work• Use GMM to approximate prior and

posterior distribution of our human models • Combine model-free method and model-

based methods to obtain benefits of both • Exploit HMM to inference human motions

in time series• Add human parts detectors to help locate

human joints


Wang, Yuan-Kai

本簡報授權聲明• 此簡報內容採用 Creative Commons 「姓名標示 - 非商業性台灣 3.0 版」授權條款

• 歡迎非商業目的的重製、散布或修改本簡報的內容，但請標明： (1)原作者姓名：王元凱； (2)圖標示：

• 簡報中所取用的部份圖形創作乃截取自網際網路，僅供演講者於自由軟體推廣演講時主張合理使用，請讀者不得對其再行取用，除非您本身自忖亦符合主張合理使用之情狀，且自負相關法律責任。

http://creativecommons.org/licenses/by-nc-nd/3.0/deed.zh_TW�







http://www.islab.tw/�

http://creativecommons.org/licenses/by/3.0/tw/�

Monocular Human Pose Estimation with Bayesian Networks

Documents