DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1 Tracking people’s hands and feet using mixed network AND/OR search Vlad I. Morariu, Member, IEEE, David Harwood, Member, IEEE, and Larry S. Davis, Fellow, IEEE Abstract We describe a framework that leverages mixed probabilistic and deterministic networks and their AND/OR search space to efficiently find and track the hands and feet of multiple interacting humans in 2D from a single camera view. Our framework detects and tracks multiple people’s heads, hands, and feet through partial or full occlusion; requires few constraints (does not require multiple views, high image resolution, knowledge of performed activities, or large training sets); and makes use of constraints and AND/OR Branch-and-Bound with lazy evaluation and carefully computed bounds to efficiently solve the complex network that results from the consideration of inter-person occlusion. Our main contributions are 1) a multi-person part-based formulation that emphasizes extremities and allows for the globally optimal solution to be obtained in each frame, and 2) an efficient and exact optimization scheme that relies on AND/OR Branch-and-Bound, lazy factor evaluation, and factor cost sensitive bound computation. We demonstrate our approach on three datasets: the public single person HumanEva dataset, outdoor sequences where multiple people interact in a group meeting scenario, and outdoor one-on-one basketball videos. The first dataset demonstrates that our framework achieves state-of-the-art performance in the single person setting, while the last two demonstrate robustness in the presence of partial and full occlusion and fast non-trivial motion. Index Terms Tracking, Motion, Pictorial Structures. V. I. Morariu, D. Harwood, and L. S. Davis are with the the Department of Computer Science, AV Williams Bldg, University of Maryland, College Park, MD 20742. E-mail: {morariu,lsd}@cs.umd.edu, [email protected]. August 23, 2012 DRAFT
30
Embed
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND …morariu/publications/MorariuHandfootPAMI… · Vlad I. Morariu, Member, IEEE, David Harwood, Member, IEEE, and Larry S. Davis, Fellow,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1
Tracking people’s hands and feet using mixed
network AND/OR search
Vlad I. Morariu, Member, IEEE, David Harwood,Member, IEEE,
and Larry S. Davis,Fellow, IEEE
Abstract
We describe a framework that leverages mixed probabilisticand deterministic networks and their
AND/OR search space to efficiently find and track the hands andfeet of multiple interacting humans
in 2D from a single camera view. Our framework detects and tracks multiple people’s heads, hands,
and feet throughpartial or full occlusion; requires few constraints (does not require multiple views,
high image resolution, knowledge of performed activities,or large training sets); and makes use of
constraints and AND/OR Branch-and-Bound with lazy evaluation and carefully computed bounds to
efficiently solve the complex network that results from the consideration of inter-person occlusion.
Our main contributions are 1) a multi-person part-based formulation that emphasizes extremities and
allows for the globally optimal solution to be obtained in each frame, and 2) an efficient and exact
optimization scheme that relies on AND/OR Branch-and-Bound, lazy factor evaluation, and factor cost
sensitive bound computation.
We demonstrate our approach on three datasets: the public single person HumanEva dataset, outdoor
sequences where multiple people interact in a group meetingscenario, and outdoor one-on-one basketball
videos. The first dataset demonstrates that our framework achieves state-of-the-art performance in the
single person setting, while the last two demonstrate robustness in the presence of partial and full
occlusion and fast non-trivial motion.
Index Terms
Tracking, Motion, Pictorial Structures.
V. I. Morariu, D. Harwood, and L. S. Davis are with the the Department of Computer Science, AV Williams Bldg, University
Fig. 3. Occlusion constraints: simultaneous assignment of overlapping candidates is disallowed if thereis no evidence that
the overlap is caused by occlusion. When there is visual evidence of occlusion, thedisallowed assignment pairs list is empty,
and two overlapping detections can be simultaneously assigned. When there is no visual evidence of occlusion, thedisallowed
assignment pairs list ensures that only one of the overlapping detections is assigned. In thisfigure, disallowed pairs included
by an assignment are highlighted in yellow.
P (X|θ) ∝∏
p
∏
i
funk (xpi )
∏
(xp
i,x
p
j)∈E
fprior(xpi , x
pj , θ)
whereE is the set of edges in model (partitioned into symmetric appearance and skeletal edges,
Eskel andEapp); skeletal and symmetric edges are solid and dotted in Figure 2a, respectively.
See section IV for additional details on factors. Instead ofprecomputing the factors (which is
costly, as we will show), we employ a lazy evaluation scheme,evaluating each factor entry the
first time that it is needed during search. The priorP (X|θ) penalizes unknown locations through
funk and can include other priors on nodes throughfprior , e.g., pairwise length priors.
B. Deterministic constraints
A deterministic networkR = (X,D,C) encodes length and occlusion constraints. Variables
X and their domainsD are the same as in the previous section.
1) Length constraints: Length constraints, depicted in Figure 2b, ensure that bodysegments
have bounded length. The motivation for hard length constraints is that body segment length
is bounded in 3D as a ratio of height, and will be bounded even after projection to 2D under
mild assumptions (i.e., camera is not pointed down toward people’s heads). As long as length
constraints are satisfied, we do not prefer one length over another since foreshortening can cause
body segments to be arbitrarily short, so we use uniform length priors. Minimum lengths can
also be imposed for practical reasons, e.g., to avoid the degenerate case of zero-length segments.
2) Occlusion constraints: Intra- and inter-person occlusion constraints are added only between
extremities (not inner joints), as we are interested mainlyin tracking them. As Figure 3 shows,
August 23, 2012 DRAFT
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 8
���� ������� �������������� ����� �������
�� �������� ���� �� ����
�������
�� ����
����
Fig. 4. Occlusion evidence over time. Partial overlap: if a period of no overlap (between tracklet pairs) occurs immediately
before or after a period of partial overlap, then this is counted as evidence of occlusion for the partial overlap period, and
simultaneous assignment is allowed.Full overlap: during periods of full overlap, the track of the occluded part is assumed to
be unreliable and only one of the overlapping tracks can be assigned.No overlap: simultaneous assignment is always allowed
during periods ofno overlap.
two extremity tracklets can overlap either due to occlusion(first column) or detector false
positives (second and third columns). In the former case, each tracklet corresponds to a real
extremity, so both should be assigned; in the latter, only one tracklet should be assigned to
a person. To be conservative, when two candidates overlap (as in Figs 1 and 3), they can be
assigned simultaneously only if visual evidence suggests that they are occluding each other, e.g.,
if two initially non-overlapping candidates are observed to move toward each other and partially
overlap. If two candidatesv ∈ Dpi and v′ ∈ Dq
j overlap for a period of time but no evidence
of occlusion is observed, then mutual exclusion constraints are automatically added betweenxpi
andxqj during that time period, preventingv andv′ from being simultaneously assigned.
We use a two threshold approach to determine if there is enough evidence to allow simultane-
ous assignment of overlapping tracklets, based on following assumptions: 1) low-level trackers
generally track extremities through partial occlusion if they were at some point observed in iso-
lation, and 2) low-level trackers generally fail for extremities that become fully occluded. Using
these assumptions, we use two thresholds on the area of overlap between two tracked extremities
to create three types of relationships between extremity pairs: full overlap, partial overlap, and
no overlap. During a period of partial overlap, we allow two tracklets to be simultaneously
assigned to two body parts only if there is a period of no overlap immediately before or after
the partial overlap period, due to the first assumption. During a period of full overlap, tracklets
are not allowed to be simultaneously assigned, due to the second assumption. During periods of
no overlap we allow simultaneous assignment. Figure 4 illustrates this approach.
Intra- and inter-person occlusion constraints, both obtained by this approach, are illustrated in
Figures 2c and 2d. Note that Figure 2c shows the worst case scenario (all extremity pairs have
occlusion constraints), but in Figure 2d only some pairs have overlapping domains. A graph
August 23, 2012 DRAFT
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 9
(a) (b) (c) (d) (e) (f)
Fig. 5. Example of how inner joint domains are obtained. Fixed head and hand locations each constrain the location of the
elbow, (a) and (b), respectively, and the elbow must lie in their intersection(c). The resulting domain for each inner joint
(color-coded) can be obtained this way given all extremities(d). If we have a silhouette(e), we further constrain locations(f).
with fully connected extremities is very unlikely, becauselength constraints cause domains to be
spatially localized; thus, hand candidates generally do not overlap foot candidates, and a person
only has constraints between immediate neighbors.
C. Temporal assignment tracking
An additional set of factors,ftrans(xt−1i , xt
i), links the extremities of people who appear
in consecutive frames, enforcing temporal assignment consistency (see Figure 2e). Structural
changes over time can result in a complex overall graph, so weapproximate the solution for
a sequence as follows. We first ignore temporal factors to obtain the exact topk solutions in
each frame by AND/OR Branch-and-Bound on the mixed network defined byP andR. We
then compute the transition probability between two framesas the product of all temporal factors
between extremity nodes that appear in both frames,ftrans(Xt−1, X t) =
∏
p
∏
i ftrans(xp,t−1i , xp,t
i ),
and obtain the best sequence of assignments by dynamic programming (the Viterbi algorithm).
D. Node domains
Recall that for extremities,Xpe = {xp
i |i = h, a1, a2, f1, f2}, domains consist of detected
candidate locations plus theunknown state. Instead of explicitly detecting internal joints, one
advantage of our formulation is that a set extremity candidates and hard length constraints yield
compact feasible regions for each internal joint. Assume that we fix an extremity assignment;
then, given an extremity location, any internal jointmust lie inside of a circular region centered
at that extremity, with radius equal to the maximum allowable distance imposed by the length
constraints. To satisfy all constraints, an internal jointmust lie in the intersection of the feasible
regions given each extremity. LetDpij(v
pj ) denote the feasible domain of inner jointxp
i if extremity
August 23, 2012 DRAFT
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 10
xpj is fixed to a single locationvpj ; then the domain ofxp
i isDpi =
⋂
j, s.t.xj∈XpeDp
ij(vpj ), as depicted
in Figure 5. However,xpj is not yet known, so the internal joint domain given an extremity is
the union over the set of potential extremity locations:Dpi =
⋂
j, s.t.xj∈Xpe
(
⋃
v∈DpjDp
ij(v))
. We
discretize inner joint regions into uniform grids to obtainfeasible joint domains.
III. AND/OR SEARCH WITH COSTLY FACTORS
Consider the problem as described in Section II-A, where the waist and neck joints, together
with a width parameter, define the torso segment of the body. If there areNw candidate locations
for the waist, andNn locations for the neck, then there areNwNn candidate configurations for
the torso segment. Thus, the factor that evaluates local image likelihoods of torso configurations
would need to be evaluatedNwNn times to precompute all factor entries before processing. If
Nw and Nn are large enough relative to the graphical model complexity(as they are in our
experiments), the process of precomputing factor entries can be more time-consuming than the
subsequent optimization problem. Since informed search (e.g. Branch-and-Bound) avoids a large
part of the search space, lazy evaluation can reduce image evaluations. However, if the upper
bounds needed to guide the search are computed using standard approaches, e.g., via MBE [3],
all factor entries would be evaluated in the process. By usingthe careful bound computation
approach we describe below, AND/OR search spaces can be usedto reduce evaluation cost
while obtaining the exact global solution. Unlike other approaches, such as Belief Propagation
(BP), whose performance degrades as determinism (probabilities close or equal to 0 or 1) is
introduced3, AND/OR Branch-and-Bound leverages determinism present in the network.
We first summarize AND/OR search spaces [1], mixed networks [2], and AND/OR Branch-
and-Bound [3], and then describe how to efficiently compute upper bounds directly from the
data while evaluating few costly factor entries.
A. AND/OR search spaces
A graphical modelP = (X,D, F ) is defined by a set of variablesX = {x1, . . . xn}, the
domainDi ∈ D of each variable, and a set of functions (or factors)F defined on subsets of
X. The primal graph of a graphical model is the undirected graphG = (V,E) whose nodes
3see [27] for an analysis of BP in the presence of determinism or near-determinism
August 23, 2012 DRAFT
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 11
V are the variables,X, and whose edge setE is formed by connecting any two nodes if their
corresponding variables appear together as arguments in one or more of the factors inF (see
Fig 6a). A pseudo tree T = (V,E ′), is a directed rooted tree defined on all of the graph nodes;
any arc ofG which is not included inE ′ is a back-arc (see Fig 6b). Given thispseudo tree,
the associated AND/ORsearch tree has alternating levels of OR and AND nodes, labeledXi
and 〈Xi, xi〉, respectively, whereXi is one of the variables inX, andxi is a value fromDi.
The root of the search tree is an OR node labeled with the root of T . Each child of an OR
nodeXi, labeled〈Xi, xi〉, represents an instantiation ofXi with a valuexi. The children of
each OR node〈Xi, xi〉 are the children ofXi in the pseudo treeT . Thus, depth first traversal
of the AND/OR search space begins with the root node, and alternates between choosing from
possible assignments for a variable at OR nodes and decomposing the search into independent
sub-problems at AND nodes. A solution is a subtreeT ⊆ T , and is defined as follows: (1) it
contains the root node, (2) each OR node inT must have exactly one of its successors inT ,
and (3) each non-terminal AND node inT must have all of its successors inT (see Fig 6c).
Each nodeXi of the pseudo tree has an associated bucketBT (Xi) containing each factor
in F whose scope includesXi and is fully contained along the path from the root down to
Xi. During the search process, at each AND noden = 〈Xi, xi〉, the factors inBT (Xi) can be
evaluated, as all their arguments have been instantiated along the path from the root to noden.
During depth first traversal of the graph, factors in a node’sbucket are evaluated when the node
is first opened. After a node’s subtree has been evaluated, the partial solution of that subtree is
propagated to its predecessor. For a max-product task such as ours, the value at an OR node is
the maximum of its children, and the value at each AND node is the product of the values of
its children and the values of factors in its bucketBT (Xi).
The AND/OR search graph can be implicitly searched by caching solutions at OR nodes.
A noden is cached based on values assigned to itsparent set, which is defined as the union
of all ancestors ofn in the pseudo-tree connected by an edge in theprimal graph to n or
its descendants in the pseudo-tree. The graph size and search complexity are determined by the
maximum parent set size in the graph. Letting the maximum parent set size be theinduced width
(or treewidth) and denoting it byw, graph size and hence search complexity areO(nkw), where
k = maxi |Di| is the maximum domain size. Parent sets are directly affected by the ordering of
nodes in the search pseudo-tree, so it is important to choosean ordering which yields the smallest
August 23, 2012 DRAFT
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 12
A
D
CB
E
X = {A,B,C,D,E}
DA ={ 0, 1 }
DB ={ 0, 1 }
DC ={ 0, 1 }
DD={ 0, 1 }
DE={ 0, 1 }
F = { f1(A,B), f2(B,E),
f3(A,D), f4(C,D)},
f5(C,A)}
A
D
CB
E
[A]
[B]
[A]
[A,C]
A
0
B C
0 1 0 1
1
B C
0 1 0 1
OR
AND
OR
AND
E
0 1
E
0 1
D
0 1
D
0 1
E
0 1
E
0 1
D
0 1
D
0 1
OR
AND
A
0
BC
0 10 1
1
B C
0 1 0 1
OR
AND
OR
AND
E
0 1
D
0 1
D
0 1
E
0 1
D
0 1
D
0 1
OR
AND
(a) (b) (c) (d)
��
��
�
�
��
��
��
��
�
� � �
���
��
�
� � �
�
�
� �
�
�
�
�
�
�
� �
�
�
�
�
��
��
�
�
��
��
��
��
�
� � �
���
��
�
� � �
�
�
� �
�
�
�
�
�
�
e1
0 1
e2
0 1
w 0 1
k1
0
1 k
2
0
1
a1 0
f1 0
a2
0
f2
0
h 0
n 0 1
(e) (f) (g)Fig. 6. Sample AND/OR search space: (a) network description and primal graph, (b) pseudo-tree (with the parent set used
for defining OR context in square brackets, e.g.[A] is the parent set for nodeB and [A,C] is the parent set for nodeD), (c)
AND/OR search tree, with solution highlighted, (d) AND/OR search graph, inwhich nodes are implicitly merged by caching
previously reached solutions, (e) AND/OR search tree and solution of our graphical model with two candidate locations for
elbows, knees, waist, and neck, and a single candidate location for all other joints (we are using a limited number of candidates
for display only; joints are round nodes, their candidate assignments aresquare nodes), (f) corresponding AND/OR search graph,
(g) corresponding domains in image coordinates, with color coded joint domains and the solution highlighted in red.
possible parent sets. This can only be done efficiently by greedy approximation algorithms such
as Min-Fill [3]. Figures 6b and 6d respectively show the parent sets in square brackets for each
variable, and the implicit AND/OR graph that results.
B. Mixed networks
Given a constraint networkR = (X,D,C), whereX and D are defined as above, andC
is a set of deterministic constraints defined on subsets ofX which allow or disallow certain
tuples of variable assignments, AND/OR search can be modified to check constraints ofRwhile maximizingP. This can be done by replacing theprimal graph described above with the
union of the primal graphs ofP andR. During the search process, constraints (and constraint
processing techniques [28]) can be applied to prune inconsistent paths early (see [2] for details).
In our case, it is beneficial to avoid exploring portions of the search tree which have no solutions
by performing backtrack free search. This can be done by preprocessing constraints using Bucket
August 23, 2012 DRAFT
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 13
Elimination (BE) [29]; in cases where BE is infeasible, Mini-bucket Elimination (MBE) [30]
can be used to reduce the amount of explored dead ends.
C. Branch-and-Bound
Branch-and-Bound [3] can further reduce the search space and complexity required to obtain
the optimal solution. Branch-and-Bound search involves updating the lower bound on the best
solutions each time a solution sub-tree is traversed. The lower bound, coupled with upper bounds
on best extensions of a partial solution can be used to prune the search space. In particular, a
node is opened only if the upper bound on extensions of the current partial solution is greater
than the current lower bound. Unlike the lower bound, which is computed during search, the
upper bound on the value of a subtree is obtained from the graphical model before search, using
the process of Mini-Bucket Elimination (MBE) [30], [3], whichpartitions buckets with large
parent sets into smaller mini-buckets. This partitioning ensures that the approximate best solution
has a score greater than or equal to that of the original best solution, but is cheaper to compute.
D. Branch-and-Bound with costly factors
A disadvantage of Mini-Bucket Elimination is that it evaluates most if not all factor entries.
This is usually acceptable, but in our case accessing the value of a factor entry is an expensive
operation. If the cost of evaluating one entry of factorfj ∈ F is αj for j = 1, . . . ,m, then the
cost of evaluating all entries offj is βj = αj
∏
Xi∈scope(fj) |Di|. To avoid evaluating all entries, we
first sort factors byβj, and then iteratively remove the largest factors until the total remaining
evaluation cost is below some ratior. To ensure that any solution of the resulting graphical
model bounds the original from above, we must replace each removed factor by a constant that
bounds all of its entries. By construction, the factors described hereafter are always bounded by
1, so upper bounds obtained by MBE on the reduced problem can be used to obtain the exact
global solution, while ensuring that few entries need to be evaluated in the process.
Our lazy evaluation approach to dealing with costly factorsis summarized by the algorithm
in Figure 7. The first step is to compute the variable orderingneeded to construct the pseudo
treeT . Constraint processing is then performed to yield a new set ofconstraint factorsC ′. In
our implementation, constraint processing involves propagating hard constraints toward the root
of the pseudo tree (using MBE), producing a new set of constraints that are consistent with
August 23, 2012 DRAFT
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 14
Inputs:
(1) probabilistic and constraint networks,R = (X,D,C) andP = (X,D, F ), respectively
(2) factor entry evaluation costs,αj for eachfj ∈ F
step by removing from consideration any assignments that violate a head-limb constraint. Table
IV shows detection results on the group dataset for various occlusion approaches, in terms of
precision, recall (computed only when extremities are at least partially visible), and F1 measures.
The first row shows the result of finding the best assignment for each person individually, without
head constraint pre-processing (thus, skin blobs corresponding to people’s faces are incorrectly
assigned as hands). The second and third rows show the individual and iterative approaches, both
with head constraint pre-processing. Finally, the last tworows show the results of our approach,
with and without temporal assignment tracking. Note that the single-frame results are single-
frame only in the sense that we are reporting the best solution found per frame; candidates are
still obtained from candidate tracklets obtained from multiple frames. Table IV shows that as
we increase the complexity of occlusion reasoning, performance increases. F1 remains fixed or
is lower only between the individual and iterative occlusion handling approaches; in this case,
obtaining joint solutions iteratively increases precision significantly, but also lowers recall. Our
joint approach increases both precision and recall. The multi-frame results show that our temporal
transition factors do improve results, but very little; this might be explained by the fact that some
temporal information is included during low-level tracklet formation. In addition, single frame
performance is already very high (up to the quality of the detected set of candidates), and since
the multi-frame approach add candidate locations, it provides limited gains in performance.
E. Generalization in the absence of silhouettes
We explore the use of Histograms of Gradients (HOG), as implemented by Felzenszwalb et al.
[49], as replacements for silhouettes. We detect humans using [49], train detectors for axis-aligned
joints and rotated body parts6 on the TUD training set [18] using Partial Least Squares (PLS) and
6We use templates of size (64, 64), (48, 48), and (72, 72) for the head, feet, and other joints, and (64, 104), (64, 72) and (80,
144) for the lower legs, upper legs, and torso, respectively, all usinga HOG block stride of 8
August 23, 2012 DRAFT
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 27
Fig. 14. Generalization to case with HOG features and no silhouettes. Individual detectors generate informative response maps.
0.0 0.2 0.4 0.6 0.8 1.0recall
0.0
0.2
0.4
0.6
0.8
1.0
pre
cisi
on
torso (AP = 0.9)head (AP = 0.59)lowerleg1 (AP = 0.37)foot1 (AP = 0.31)foot2 (AP = 0.22)lowerleg2 (AP = 0.17)upperleg1 (AP = 0.048)upperleg2 (AP = 0.046)
P R F1
detections 0.33 0.63 0.43candidates 0.32 0.64 0.43top candidate 0.68 0.54 0.60assignments, single 0.83 0.61 0.70assignments, multi 0.84 0.61 0.71
Fig. 15. Quantiative results with HOG features replacing silhouettes.Left: precision-recall curves of individual models evaluated
as detectors on fixed size 300 by 300 pixel windows extracted from thetud-campus test sequence, where people are scaled to
a height of 200 pixels.Middle: tracking performance (feet only). Thetop candidate row is a baseline comparison to keeping
the top-scoring location for each detector individually.Right: sample results of our framework using only HOG features.
Quadratic Discriminant Analysis (QDA) as in [50], and evaluate on thetud-campus dataset [18].
These videos have constant foreground motion and are short,making background model training
difficult. Figures 14 and 15 show qualitative and quantitaveresults of HOG-based detectors
individually as well as part of our entire framework (using raw uncalibrated responses to detect
humans/feet and model part likelihoods). Our focus is on extremity detection and tracking in
these experiments, so while humans were automatically detected using [49], manual supervision
was used as for the basketball dataset to fix identity switches/merges (however, detector noise
remains–no bounding boxes were added or adjusted). Our results show that individual part
detectors and likelihood models are informative but not sufficient to associate feet to people.
However, as part of our framework, precision increases significantly.
The common approach of running part models at discretized translation and rotation intervals
requires over 500,000 evaluations per person to run 10 part detectors every 8 pixels and 10
degrees for a 300 by 300 pixel region of interest, as in our experiments. As figure 12 shows,
even if all possible likelihood evaluations are performed in our framework, most cases require
below 500,000 evaluations for all (up to 5) people combined at a much finer set of angles (since
parts are defined by their endpoints); AOBB reduces these to roughly 50,000 image evaluations.
Figure 11 shows that likelihood evaluation (not inference)dominates the total time, so little can
be gained by a more efficient inference, e.g., fast generalized distance transform [6].
August 23, 2012 DRAFT
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 28
VI. CONCLUSION
We proposed a framework for detecting and tracking extremities of multiple interacting people.
We quantitatively evaluated our approach on the publicly available HumanEva I dataset, a
dataset of a group of interacting people, and a dataset of one-on-one basketball games. Our
experiments show that AND/OR Branch-and-Bound with lazy evaluation can significantly reduce
computational cost, while yielding the globally optimal solution in each frame. Our approach
is to be flexible enough to deal with significant occlusion of people in groups as well as rapid
motions and large pose variations observed during basketball games.
REFERENCES
[1] R. Dechter and R. Mateescu, “And/or search spaces for graphical models,”Artif. Intell., 2007.
[2] R. Mateescu and R. Dechter, “Mixed deterministic and probabilistic networks,” ICS Technical Report (Submitted), 2008.
[3] R. Marinescu and R. Dechter, “And/or branch-and-bound search for combinatorial optimization in graphical models,”ICS
Technical Report, 2008.
[4] C.-S. Lee and A. Elgammal, “Body pose tracking from uncalibratedcamera using supervised manifold learning,” inEHuM,
2006.
[5] R. Urtasun and T. Darrell, “Sparse probabilistic regression for activity-independent human pose inference,”CVPR, 2008.
[6] P. F. Felzenszwalb and D. P. Huttenlocher, “Pictorial structures for object recognition,”IJCV, 2005.
[7] D. Ramanan, D. Forsyth, and A. Zisserman, “Tracking people bylearning their appearance,”PAMI, 2007.
[8] E. B. Sudderth, M. I. M, W. T. Freeman, and A. S. Willsky, “Distributed occlusion reasoning for tracking with nonparametric
belief propagation,” inNIPS, 2004.
[9] L. Sigal and M. Black, “Measure locally, reason globally: Occlusion-sensitive articulated pose estimation,” inCVPR, 2006.
[10] A. Gupta, A. Mittal, and L. S. Davis, “Constraint integration for efficient multiview pose estimation with self-occlusions,”
PAMI, 2008.
[11] H. Jiang and D. Martin, “Global pose estimation using non-tree models,” in CVPR, 2008.
[12] X. Ren, A. Berg, and J. Malik, “Recovering human body configurations using pairwise constraints between parts,” in
ICCV, 2005.
[13] G. Hua, M.-H. Yang, and Y. Wu, “Learning to estimate human posewith data driven belief propagation,” inCVPR, 2005.
[14] L. Karlinsky, M. Dinerstein, D. Harari, and S. Ullman, “The chainsmodel for detecting parts by their context,” inCVPR,
2010.
[15] H. Jiang, “Human pose estimation using consistent max-covering,” in ICCV, 2009.
[16] T.-P. Tian and S. Sclaroff, “Fast globally optimal 2d human detection with loopy graph models,” inCVPR, 2010.
[17] M. Bergtholdt, J. H. Kappes, S. Schmidt, and C. Schnorr, “A study of parts-based object class detection using complete
graphs,”IJCV, 2010.
[18] M. Andriluka, S. Roth, and B. Schiele, “People-tracking-by-detection and people-detection-by-tracking,”CVPR, 2008.
[19] S. Gammeter, A. Ess, T. Jaggli, K. Schindler, B. Leibe, and L. V. Gool, “Articulated multi-body tracking under egomotion,”
in ECCV, 2008.
August 23, 2012 DRAFT
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 29
[20] T. Zhao and R. Nevatia, “Tracking multiple humans in complex situations,” PAMI, 2004.
[21] S. Park and J. K. Aggarwal, “Simultaneous tracking of multiple bodyparts of interacting persons,”CVIU, 2006.
[22] T. Yu and Y. Wu, “Decentralized multiple target tracking using netted collaborative autonomous trackers,” inCVPR, 2005.
[23] M. Eichner and V. Ferrari, “We are family: Joint pose estimation ofmultiple persons,” inECCV, 2010.
[24] L. Zhu, Y. Chen, Y. Lu, C. Lin, and A. Yuille, “Max margin and/or graph learning for parsing the human body,” inCVPR,
2008.
[25] C. H. Lampert, M. B. Blaschko, and T. Hofmann, “Beyond slidingwindows: Object localization by efficient subwindow
search,” inCVPR, 2008.
[26] V. Lempitsky, A. Blake, and C. Rother, “Image segmentation by branch-and-mincut,” inECCV, 2008.
[27] R. Dechter, B. Bidyuk, R. Mateescu, and E. Rollon, “On the powerof belief propagation: A constraint propagation
perspective,” inIn Heuristics, Probabilities and Causality: A tribute to Judea Pearl., 2010.
[28] R. Dechter,Constraint processing. Elsevier Morgan Kaufmann, 2003.
[29] ——, “Bucket elimination: A unifying framework for probabilistic inference,” inUAI, 1996.
[30] K. Kask and R. Dechter, “Mini-bucket heuristics for improved search,” in UAI, 1999.
[31] I. Haritaoglu, D. Harwood, and L. Davis, “W4: real-time surveillance of people and their activities,”PAMI, 2000.
[32] R. Ronfard, C. Schmid, and B. Triggs, “Learning to parse pictures of people,” inECCV, 2002.
[33] W. Schwartz, A. Kembhavi, D. Harwood, and L. Davis, “Human detection using partial least squares analysis,” in
Proceedings of the International Conference on Computer Vision, 2009.
[34] C. Huang, B. Wu, and R. Nevatia, “Robust object tracking by hierarchical association of detection responses,” inECCV,
2008.
[35] M. Everingham, J. Sivic, and A. Zisserman, “’hello! my name is... buffy’ - automatic naming of characters in tv video,”
in BMVC, 2006, pp. 889–908.
[36] Y. Li, C. Huang, and R. Nevatia, “Learning to associate: Hybridboosted multi-target tracker for crowded scene,” inCVPR,
2009, pp. 2953–2960.
[37] H. Sidenbladh and M. J. Black, “Learning image statistics for bayesian tracking,” inICCV, 2001.
[38] K. Kim, T. H. Chalidabhongse, D. Harwood, and L. S. Davis, “Real-time foreground-background segmentation using
codebook model.”Real-Time Imaging, 2005.
[39] A. S. Ogale and Y. Aloimonos, “A roadmap to the integration of early visual modules,”IJCV, 2007.
[40] D. Comaniciu and P. Meer, “Mean shift analysis and applications,”in ICCV, 1999.
[41] V. I. Morariu, B. V. Srinivasan, V. C. Raykar, R. Duraiswami,and L. S. Davis, “Automatic online tuning for fast gaussian
summation,” inAdvances in Neural Information Processing Systems (NIPS), 2008.
[42] L. Sigal, A. O. Balan, and M. J. Black, “Humaneva: Synchronized video and motion capture dataset and baseline algorithm
for evaluation of articulated human motion,”Int. J. Comput. Vision, vol. 87, no. 1-2, pp. 4–27, 2010.
[43] J. Martinez del Rincon, J. Nebel, D. Makris, and C. Orrite, “Tracking human body parts using particle filters constrained
by human biomechanics,” inBMVC, 2008.
[44] R. Poppe, “Evaluating example-based pose estimation: Experiments on the humaneva sets,” inEHuM2, 2007.
[45] N. Howe, “Evaluating lookup-based monocular human pose tracking on the humaneva test data,” inEHuM, 2006.
[46] C. Yanover and Y. Weiss, “Finding the m most probable configurations using loopy belief propagation,” inNIPS, 2004.
[47] J. M. Mooij, “libDAI: A free and open source C++ library for discrete approximate inference in graphical models,”Journal
of Machine Learning Research, vol. 11, pp. 2169–2173, Aug. 2010.
August 23, 2012 DRAFT
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 30
[48] W. S. Cleveland and S. J. Devlin, “Locally weighted regression: Anapproach to regression analysis by local fitting,”
Journal of the American Statistical Association, vol. 83, no. 403, pp. 596–610, Sept 1988.
[49] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part based
models,” inPAMI, 2010.
[50] W. R. Schwartz, A. Kembhavi, D. Harwood, and L. S. Davis, “Human detection using partial least squares analysis,” in
ICCV, 2009.
Vlad I. Morariu received the BS and MS degrees in computer science from the Pennsylvania State
University in 2005, and the PhD degree in computer science from the University of Maryland in 2010.
He is currently a research associate in the Computer Vision Laboratory ofthe Institute for Advanced
Computer Studies at the University of Maryland, College Park. His research interests include activity
recognition, object detection, graphical models, and probabilistic logic models. He is a member of the
IEEE.
David Harwood David Harwood received the graduate degrees from the University ofTexas and the
Massachusetts Institute of Technology. His research, with many publications, is in the fields of computer
image and video analysis and AI systems for computer vision. He is a longtime member of the research
staff of the Computer Vision Laboratory of the Institute for Advanced Computer Studies at the University
of Maryland College Park. He is a member of the IEEE.
Larry S. Davis received the BA degree from Colgate University in 1970 and the MS and PhD degrees in
computer science from the University of Maryland in 1974 and 1976, respectively. From 1977 to 1981,
he was an assistant professor in the Department of Computer Science at the University of Texas, Austin.
He returned to the University of Maryland as an associate professor in 1981. From 1985 to 1994, he was
the director of the University of Maryland Institute for Advanced Computer Studies. He is currently a
professor in the institute and in the Computer Science Department, as well asthe chair of the Computer