IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE JULY 2014
Structured Time Series Analysis for HumanAction Segmentation and
RecognitionDian Gong, GerardMedioni, Fellow, IEEE, and Xuemei
ZhaoAbstractWe address the problem of structure learning of human
motion in order to recognize actions from a continuousmonocular
motion sequence of an arbitrary person from an arbitrary viewpoint.
Human motion sequences are represented bymultivariate time series
in the joint-trajectories space. Under this structured time series
framework, we rst propose KernelizedTemporal Cut (KTC), an
extension of previous works on change-point detection by
incorporating Hilbert space embedding ofdistributions, to handle
the nonparametric and high dimensionality issues of human motions.
Experimental results demonstratethe effectiveness of our approach,
which yields realtime segmentation, and produces high action
segmentation accuracy. Second,a spatio-temporal manifold framework
is proposed to model the latent structure of time series data. Then
an efcient spatiotemporal alignment algorithm Dynamic Manifold
Warping (DMW) is proposed for multivariate time series to calculate
motionsimilarity between action sequences (segments). Furthermore,
by combining the temporal segmentation algorithm and thealignment
algorithm, online human action recognition can be performed by
associating a few labeled examples from motioncapture data. The
results on human motion capture data and 3D depth sensor data
demonstrate the effectiveness of the proposedapproach in
automatically segmenting and recognizing motion sequences, and its
ability to handle noisy and partially occludeddata, in the transfer
learning module.Index TermsMultivariate Time Series, Action
Recognition, Online Temporal Segmentation, Saptio-Temporal
Alignment,Transfer Learning.
!
1 I NTRODUCTIONRecognizing human action is a key component in
manyapplications, such as human computer interaction, computergame,
surveillance and human pose estimation. Extractingthis high level
information from motion capture data ordepth sensor data is the
problem we propose to addresshere.Although signicant progress has
been made in humanaction recognition [1], [2], [3], [4], [5], [6],
the problemremains inherently challenging due to signicant
intraclass variations, viewpoint change, partial occlusion
andbackground dynamic variations. A key limitation of
manyaction-recognition approaches is that the models are
learnedfrom single 2D view video features on individual
datasets,and unable to handle arbitrary view change or scale
andbackground variations. Also, since they are not generalizable
across different datasets, retraining is needed forevery new
dataset. Furthermore, many works in humanactivity recognition focus
on simple primitive actions suchas walking, running and jumping, in
contrast to the fact thatdaily activity involves complex temporal
patterns (walking,sit-down, then stand-up). Thus, recognizing such
complexactivities relies on accurate temporal structure
decomposition [7].We offer to take as input either a motion capture
(Mocap)sequence providing 3D joint positions, or a depth videofrom
a 3D camera, pre-processed to obtain partial, noisy D. Gong, G.
Medioni and X. Zhao are with the Institute for Roboticsand
Intelligence Systems, University of Southern California, Los
Angeles, CA 90089. E-mail: {diangong, medioni,
xuemeiz}@usc.edu.
Digital Object Indentifier 10.1109/TPAMI.2013.244
3D joint positions, or, in future work, a 2D video froman
arbitrary viewpoint, pre-processed to provide partial,noisy 2D
joint positions. Our rst step is to online segmentthese sequences
into segments corresponding to differentactivities. This is
achieved with no training. Furthermore,these segments are
subdivided into action units corresponding to cycles (such as
walking). Ofine, we learn differentactivities from one or very few
examples of labeled Mocapsegments. Then, online, we compare a
segment to labeledones using a novel alignment algorithm in order
to performclassication. We show promising results, demonstratingthe
power of our approach. The proposed approach has thefollowing
modules:(1) Given a labeled Mocap sequence with M markersin 3D,
which is a 3M -dimensional time series (3M D+t),the low dimensional
manifold structure (i.e., tangent space,geodesics distance, etc) is
learned by using Tensor Voting.This is an ofine process, as shown
in Fig. 1.(2) For other unlabeled motion sequences in 3D,
thesequential temporal segmentation is performed to automatically
segment the input motion sequence into differentaction units. For a
single action unit, after structure learning(1), we calculate the
motion similarity score with eachlabeled motion sequence by the
proposed spatio-temporalalignment approach, and perform action
recognition. Thisis an online process.(3) Our system can recognize
actions from depth sensor.Available human pose estimation methods
(Kinect SDK andOpenNI) can provide 3D human pose estimation results
butthey are often noisy and have occlusions, while the
structurelearning algorithm (1) remains the same and our
temporal
0162-8828/13/$31.00 2013 IEEE
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE JULY 2014
Dynamic Manifold Warping (DMW). KTC is a temporalextension of
Hilbert space embedding of distributions [8],[9] and kernelized two
sample test [10], [11] for onlinechange-point detection. DMW
extends previous works onspatio-temporal alignment by incorporating
manifold learning. Empirical results demonstrate the superior
performanceof these two algorithms compared to other the
state-of-theart methods on human action segmentation and
recognition.The technical details of proposed algorithms are given
insections 3, 4, 5, and 6.3.
Fig. 1. Flow chart of the proposed approach.segmentation and
alignment approach (2) can naturallyhandle noisy input and
occlusion.Our approach has the following advantages:One or very few
examples are required in each actioncategory in the training stage,
compared to 100s for manylearning approaches.Transfer learning:
when applying our approach to depthimage sequences, there is no
training process on thesedepth images and people in these images do
not necessarilyappear in the labeled Mocap sequences. Thus, our
approachcan be considered as a transfer learning framework, i.e.,
theknowledge from labeled Mocap data can be adapted to anyhuman
motion data.Online action recognition: the input sequence from
unlabeled Mocap sequences or 3D sensors can be temporallysegmented
in an online fashion (sec. 3), resulting in continuous action
recognition (sec. 6.3).Intra/Inter-person variations: a person
repeating an action twice with differences, or two people
performing anaction with differences in both pose style and motion
dynamic, can be handled by combining the proposed temporaland
spatial alignment methods (sec. 5.2 and 5.3) together.View
invariance: low dimensional human motion manifoldmodels are learnt
from the 3D Mocap data, and our spatialtemporal alignment
algorithms can handle 3D input witharbitrary viewpoint; these two
features enable our systemrobust to actions viewpoint.Noise and
occlusion handling: in order to recognizeactions from depth image
sequences, human poses need tobe estimated. Instead of M key
points, often only K visiblepoints can be estimated (with noise)
during the whole action(K M ), such as a side view boxing man. Our
systemcan handle these noisy trajectories, even with
occlusion(3KD+t).An overview of our approach is sketched in Fig. 1.
Thejoint-trajectories of M human body key points are used
torepresent a human motion sequence. Trajectories can beeither
provided by Mocap (3D) or be tracked from depthimage sequence by
available human pose estimation methods such as Kinect SDK and
OpenNI (noisy 3D). The coreof our approach is the structured time
series representationand two newly proposed machine learning
algorithms fortime series data, i.e., Kernelized Temporal Cut (KTC)
and
2
R ELATED W ORK
Dynamic Manifold Model. Non-linear manifold learningand Latent
Variable Modeling (LVM) is prominent inmachine learning research in
the past decade [12], [13],[14]. In [15], Tensor Voting [16] is
used to analyze the 1Dmanifold of landmark sequences, and the
manifold structureis applied to 3D face tracking and expression
inference. Inparticular, some probabilistic latent variable
frameworks,i.e., GP-LVM, GPDM and its variants [17], [18],
[19],focus on motion capture data and try to capture the
intrinsicstructure of human motion, which is further applied to
3Dmonocular people tracking [20].Moreover, these are some manifold
related works onhuman action recognition and motion analysis. [21],
[22]apply generative based manifold models to several aspectsof
human motion analysis including pose recovery, bodytracking, gait
recognition and faical expression recognition. [23] utilizes
manifold learning to perform motion retrieval and [24] combines
ISOMAP and DTW to recognizeactions from silhouettes. The focus of
our approach differsfrom these works signicantly. The problem we
addressedis to perform jointly online action segmentation and
recognition and can recognize actions from realtime OpenNIinput
based on labled Mocap sequences. This online processfor stream
input and transfer learning functionality is notthe target of the
above mentioned works.Besides the differences on temporal
segmentation andtransfer learning, our alignment step uses manifold
learningto infer the latent completion variable, while [21],
[22]use it to build the generative model. Their human
motiongenerative models have advantages on pose recovery
andtracking.Temporal Segmentation. This is a multifaced area
andseveral related topics in machine learning, statistics, computer
vision and graphics are discussed.-Change-Point Detection. Most of
the work in statistics,i.e., ofine or quickest (online)
change-point detections(CD) [25], is often restricted to univariate
series (1D) andparametric distribution assumption, which does not
holdfor human motions with complex structure. [26] uses
theundirected sparse Gaussian graphical models and performsjointly
structure estimation and segmentation. Recently, asa nonparametric
extension of Bayesian online change-pointdetection (BOCD) [27],
[28] is proposed to combineBOCD and Gaussian Process (GPs) to relax
the i.i.d assumption in a regime. Although GPs improve the ability
to
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
SUBMISSION
model complex data, it also brings in high computationalcost.
More relevant to us, kernel methods have been appliedto
non-parametric change-point detection on multivariatetime series
[29], [30]. In particular, [29] (KCD) utilizes theone-class SVM as
online training method and [30] (KCpA)performs sequentially
segmentation based on the KernelFisher Discriminant Ratio. Unlike
all the above works,KTC can not only detect action transitions but
also cyclicmotions.-Temporal Clustering. Recently, as an extension
of clustering [31], [32], some works focus on how to
correctlytemporally segment time series into different clusters.
Asa elegant combination of Kernel K-means and spectralclustering,
Aligned Cluster Analysis (ACA) is developedfor temporal clustering
of facial behavior with a multisubject correspondence algorithm for
matching facial expressions [33]. To estimate the unknown number of
clusters, [34] use the hierarchical Dirichlet process as a prior
toimprove the switch linear dynamical system (SLDS). Mostof these
works ofine segment time series and providecluster labels as in
clustering. As a complementary approach, KTC performs online
temporal segmentation whichis suitable for realtime
applications.-Motion Analysis. In computer vision and graphics,
someworks focus on grouping human motions. Unusual humanactivity
detection is addressed in [35] using the (bipartite)graph spectral
clustering. [36] extracts spatio-temporal features to address event
clustering on video sequences. [37]proposes a geometric-invariant
temporal clustering algorithm to cluster facial expressions. More
relevantly, [38]proposes an online algorithm to decompose motion
sequences into distinct action segments. Their method isan elegant
temporal extension of Probabilistic PrincipalComponent Analysis for
change-point detection (PPCACD), which is computationally efcient
but restricted to(approximate) Gaussian assumptions.Action
Recognition. Inspired by the success in objectrecognitions,
low-level features like Space-Time InterestPoints (STIPs) plus
Histogram of Oriented Gradient (HOG)descriptors are used in many
action recognition works [1],[2], [39]. Silhouettes based features
are also popular [40],[41], for which good results rely on accurate
foregroundextractions. Some works also use tracked key points,
whichare quantized as feature vectors by the pre-learned ormanually
designed codebook [3], [42], [43]. Action recognition is a
multifaceted eld, our discussion focuses onview-invariant methods,
and readers can refer to a recentreview [44] for more details.A
Hidden Markov Model (HMM) is built on 3D jointtrajectories (Mocap)
to capture the dynamic information ofhuman motion in [45]. The
claimed advantage of the 3DHMM model is that the dependence on view
point andillumination is removed. However, HMMs require largeamount
of training data in relatively high dimensionalspace (e.g. 67) and
the HMM structure must be adaptivelydesigned for specic application
domains. These may bepotential factors that make the recognition
performanceunsatisfactory, and AdaBoost is used to improve the
accu-
3
racy [45]. View-independence is also addressed in [41], [4]by
rendering Mocap data of various actions from multipleviewpoints,
which is a time and storage consuming process.In [40], 3D models
are projected onto 2D silhouettes withrespect to different view
point, and [5] detects 2D featurerst and then back-projects them to
action features based ona 3D visual hull. These methods require a
computationallyexpensive search process over model parameters to
ndthe best match between 2D features and 3D model. Veryrecently, in
[46], a 3D HOG descriptor was proposed tohandle view point change,
and this approach requires themultiple view camera settings for
training data to achievethe view-invariant recognition. Departing
from these methods, our recognition process does not require pose
renderingor parameters search. Our trajectory features are
locatedat body skeletons key locations, with explicit
semanticmeaning, allowing our system to be directly applied toan
arbitrary scene without datasets dependent training.Recently, there
are a few works [47], [48], [49] focusingon action recognition on
depth or RGB plus depth (RGBD)image sequences.Spatio-Temporal
Alignment. Given two human motionsequences, an important question
is to consider whetherthose two sequences represent the same
motion, similarmotions or distinct motions. This can be viewed as
a(spatio-temporal) alignment problem, serving as a foundation for
action recognition, clustering, etc. CanonicalComponent Analysis
(CCA) [50], proposed for learning theshared subspace between two
high dimensional features,which been used as the spatial matching
algorithm foractivity recognition from video [51] and activity
correlationfrom cameras [52]. Video synchronization is addressed
asa temporal alignment problem in [53], [54], which usesdynamic
time warping (DTW) or its variants [55]. [56] usesoptimization
methods to maximize a similarity measure oftwo human action
sequences, while the temporal warpingis constrained by 1D afne
transformation. The same lineartemporal model is also used in
[57].Very recently, as the elegant extension of CCA and
DTW,Canonical Time Warping (CTW) is proposed for spatiotemporal
alignment of two multivariate time series andapplied to align human
motion sequences between twosubjects [58]. CTW is formulated as an
energy minimization framework and solved by an iterative gradient
descentprocedure. Since spatial and temporal transformations
arecoupled together, the objective function becomes nonconvex and
the solution is not guaranteed to be globaloptimal. Under the STM
model, we propose Dynamic Manifold Warping (DMW), which focuses on
time series withintrinsic spatial structure and guarantees global
optimalsolution. By combining KTC and alignment approachessuch as
[59], [58], we can perform online action recognitionfor input from
2.5D depth sensor. Unlike other workson supervised joint
segmentation and recognition [60],two signicant features of our
approach are, viewpointindependence and handling arbitrary person
with a fewlabeled Mocap sequences, in the transfer learning
module.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
SUBMISSION
Fig. 2. Online Hierarchial Temporal Segmentation.A 22 secs input
sequence is temporally cut into twosegments; a walking segment (S1)
which is further cutinto 6 action units, and a jumping segment (S2)
whichis further cut into 4 action units.
3
O NLINE T EMPORAL S EGMENTATION
3.1 Time Series RepresentationTo effectively represent human
motion sequences, jointposition (or joint-angle) is used in this
paper. In eachframe, the joint-position (or joint-angle) of several
keypoints on human skeleton is formulated as a point in
themulti-dimensional space. Thus, a human motion sequenceis
represented as a trajectory, i.e., a structured multivariatetime
series which implicitly contains the human motionstructure. For
instance, given a length Lx human actionsequence (e.g. stretching),
the joint-position trajectory canbe represented as a matrix X1:Lx =
[x1 x2 ... xLx ] DLx , where xt is the joint-positions at temporal
indext. In 3D (Mocap), xt = [pt1 , pt2 ... ptM ]T 3M 1 andpti =
(pti1 pti2 pti3 ) is the coordinate of the ith markerin 3 . Or in
partial 3D (e.g., tracking trajectories fromdepth sensor), xt = [q
t1 , q t2 ... q tK ]T 3K1 (K M ),tttq ti = (qi1qi2qi3) is the
location of the ith tracked point.This structured time series
representation is used forboth temporal segmentation and alignment.
This sectiondescribes the Kernelized Temporal Cut (KTC), a
temporalapplication of Hilbert space embedding of distributions
[8]and kernelized two-sample test [10], [11], to
sequentiallyestimate temporal cut points in human motion
sequences(Fig. 2) [61]. It is notable that, as a kernelized
learning algorithm, KTC can be applied for structured sequential
datain general, such as multivariate time series and
dynamicgraphs.3.2
Problem Formulation
DtxGiven a stream input X 1:Lx = {xt }Lt=1 (xt ,where Dt can be
xed or change over time), the goal oftemporal segmentation is to
predict temporal cut points ci .For instance, if a person walks and
then boxes, a temporalcut point must be detected. For depth sensor
data, xt is thevector representation of tracked joints. More
details of xtare given in sec. 6.1. From a machine learning
perspective,cthe estimated {ci }Ni=1 can be modeled by minimizing
thefollowing objective function,cLX ({ci }Ni=1 , Nc ) =
Nc
i=1
I(X ci1 :ci 1 , X ci :ci+1 1 ) (1)
4
where X ci :ci+1 1 D(ci+1 ci ) indicates the segmentbetween two
cut points ci and ci+1 (c1 = 1, cNc +1 =Lx + 1). Here I() is the
homogeneous function to measurethe spatio-temporal consistency
between two consecutivecsegments. It is worth noting that, both {ci
}Ni=1 and Nc needto be estimated from eq. 1. Next, the main task is
to designI() and to online optimize eq. 1. As the counterpart, eq.
1could be ofine optimized by dynamic programming whenNc is given,
which is out of the scope of this paper.3.3 KTC-SInstead of jointly
optimizing eq. 1, the proposed KernelizedTemporal Cut (KTC)
sequentially optimizes ci+1 based onci by minimizing the following
loss function,L{X ci :ci +T 1 } (ci+1 ) =I(X ci :ci+1 1 , X ci+1
:ci +T 1 ), i = 1, 2, ..., Nc 1
(2)
where ci (c1 = 1, cNc +1 = Lx + 1) is provided bythe previous
step and T is a xed length. We refer tothis sequential optimization
process for eq. 2 as KTC-S,where S stands for sequential.
Sequentially optimizing Lis actually a xed-length sliding window
process whichis also used in [30]. However, setting T is a
difculttask and how to improve this process is described insec.
3.4. Essentially, Eq. 2 is a two-class temporal clusteringproblem
for X ci :ci +T 1 DT . The crucial factor isconstructing I(), which
is related to temporal version of(dis)similar functions in spectral
clustering [31], [32], [37]and information theoretical clustering
[62].To handle the complex structure of human motion,
unlikeprevious work, KTC utilizes Hilbert space embedding
ofdistributions (HED) to map the distribution of X t1 :t2 intothe
Reproducing Kernel Hilbert Space (RKHS). [8], [9] areseminal works
on combining kernel methods and probability distribution analysis.
Without going into details, the ideaof using HED for temporal
segmentation is straightforward.The change-point is detected by
using a well behaved(smooth) kernel function, whose values are
large on thesamples belonging to the same spatio-temporal pattern
andsmall on the samples from different patterns. By doingthis, KTC
does not only handle nonparametric and highdimensionality problems
but also rests on a solid theoreticalfoundation [8].HED. Inspired
by [9], probabilistic distributions can beembedded in RKHS. At the
center of the Hilbert spaceembedding of distributions are the mean
mapping functions,(P x ) = E x (k(x, )), (X) =
T1 k(xt , )T t=1
(3)
where {xt }Tt=1 are assumed to be i.i.d sampled from
thedistribution P x . Under mild conditions, (P x ) (same for(X))
is an element of the Hilbert space as follows,T1< (P x ), f
>= E x (f (x)), < (X), f >=f (xt )T t=1
Mappings (P x ) and (X) are attractive because,
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
SUBMISSION
Theorem 1. if the kernel k is universal, then the mean map: P x
(P x ) is injective. [9]This theorem states that distributions of x
D havea one-to-one correspondence with mappings (P x ). Thus,for
two distributions P x and P y , we can use the functionnorm ||(P x
) (P y )|| to quantitatively measure thedifference (denoted as D(P
x , P y )) between these twodistributions. Moreover, we do not need
to access theactual distributions but rather nite samples to
calculateD(P x , P y ) because:Theorem 2. Assume that ||f ||inf C
for all f H with|f ||H 1, then with probability at least 1 , ||(P x
) (X)|| 2RT (H, P x ) + C T 1 log()). [9]As long as the Rademacher
average is well behaved,nite samples yield error that converges to
zero, thus theyempirically approximate (P x ). Therefore, D(P x , P
y )can be precisely approximated by using nite sample estimation
||(X) (Y )||.Thanks to the above facts, we use HED to constructIKT
C (X 1:T1 , Y 1:T2 ) to measure the consistency
betweendistributions of two segments as follows,2 1 1 k(xi , y j )
2k(xi , xj ) 2k(y i , y j )T1 T2 i,jT1 i,jT2 i,j(4)Combining eq. 2
and eq. 4, ci+1 is estimated by minimizingthe following function in
matrix formulation as:c
c
CiL{X ci :ci +T 1 } (ci ) = (E Ti )H K KTci :ci +T 1 E T1 1:c1 c
:TcE Ti = eT i eTicidi
(5)
where ci and di are short notations for ci+1 ci and ci +T ci+1 .
etT1 :t2 T 1 is a binary vector with 1 for positionsCT Tfrom t1 to
t2 and 0 for others. K KTis theci :ci +T 1 kernel matrix based on
the kernel function kKT C ().Kernel. The success of kernel methods
largely depends onthe choice of the kernel function [8]. As
mentioned before,the difculty of human motion, is that both spatial
andtemporal structures are important. Thus, we propose a
novelspatio-temporal kernel kKT C () as follows,kKT C (xi , xj ) =
kS (xi , xj )kT (xi , xj ) i ), (x j ))= kS (xi , xj )kT ((x
(6)
where kS () is the spatial kernel and kT () is the temporal
kernel. (x)is the estimated local tangent space at pointx. kS ()
and kT () can be chosen according to the domainknowledge or
universal kernels such as Gaussian. For instance, the canonical
component analysis (CCA) kernel [50]is used for joint-position
features as,kSCCA (xi , xj ) = exp(S dCCA (xi , xj )2 )
(7)
where dCCA () is the CCA metric based on M 3
matrixrepresentation of x 3M 11 . Or in general, we set them
5
as,kS (xi , xj ) = exp(S xi xj 2 ) i ), (x j )) = exp(T ((x i ),
(x j ))2 )kT ((x
where S is the kernel parameter for kS () and T is thekernel
parameter for kS (). () is the notation of principalangle between
two subspace (range from 0 to 2 ).In short, the spatio-temporal
kernel kKT C captures bothspatial and temporal distributions of
data (a visual examplein Fig. 3), which is suitable to model
structured sequentialdata. As special cases, kST degenerates to
spatial kernel ifT 0 and to temporal kernel if S 0.Optimization.
Unlike the NP-Hard optimization in spectralclustering [32], eq. 5
can be efciently solved because thefeasible region of ci+1 is [ci +
1, ci + T 1], allowing tosearch the entire space to minimize L(ci+1
). For each step,minimizing eq. 5 has complexity at most O(T 2 ) to
accesskKT C ().
3.4 KTC-RSequentially optimizing eq. 1 is given in sec. 3.3.
However,this process may not be suitable for realtime
applications.A key feature of human motion is temporal
variations,i.e., one action can last for a long time or only a
fewseconds. Thus, it is difcult to use a xed-length-T slidingwindow
to capture transitions. Small values of T causeover-segmentation
and large values of T cause large delays(T = 300 for depth sensor
results in 10 secs delay). Toovercome this problem, we combine
incremental slidingwindow strategy [38] and two-sample test [10],
[11] todesign a realtime algorithm for eq. 5, i.e., KTC-R (Fig.
3).Given X 1:Lx = [x1 , ..., xLx ] DLx , KTC-R sequentially
processes the varying-length window X t =[xnt , ..., xnt +Tt ] at
step t. This process starts from n1 = 1and T1 = 2T0 , where T0 is
the pre-dened shorest possibleaction length. At step-t (assume the
last cut is ci ), if there isno captured action transition point,
the following updatingprocess is performed,nt+1 = nt , Tt+1 = Tt +
T
(9)
else if there is a transition point,ci+1 = nt + Tt T0 , nt+1 =
ci+1 , Tt+1 = T1
(10)
where T is the step length of increasing the window. Thisprocess
ends when nt Lx T0 . As shown in eq. 9 andeq. 10, X 1:Lx is
sequentially processed and all cuts ci areestimated when the
algorithm requires the (ci + T0 1)thframe (same for non-cut
frames). This fact indicates STC-Rhas a xed-length time delay T0 ,
as shown in Fig. 3.At each step, deciding on a cut (at frame nt +
Tt T0 )is equivalent to the following hypothesis test,n 1
H0 : {xi }ntt
HA : not H01. M is the number of 3D joints from Mocap or depth
sensor.
(8)
and {xi }nnt +Tt 1 are the samet
(11)
where nt is the short notation for nt + Tt T0 . eq. 11 is
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
SUBMISSION
6
C190190Fig. 3. An illustration of KTC-R. Left: tracked joints
(45190 ) from depth sensor, right: K KTfor1:190 window X 1:190
(Human stands for human ground-truth). The decision to make no cut
between frame 1 to 110 ismade before the current window with a
maximum delay of T0 = 80 frames.
re-written by combining eq. 5 as follows,Lt = (E
nt ntTt
C)H K KTnt :nt +Tt 1 E
H0 : Lt t : nt is not a cut HA : Lt < t : nt is a cut
nt ntTt
(12)
3.5 Online Hierarchical Temporal Segmentation
where t is the adaptive threshold for the hypothesis test 12.In
fact, eq. 12 is directly inspired by [10] which proposes
akernelized two-sample test method. Lt is analogous to thenegative
square of empirical estimation of Maximum MeanDiscrepancy (MMD),
which has the following formulation,M M D[F, X 1:T1 , Y 1:T2 ] =
(
T11 k(xi , xj )T12 i,j=1
T1 ,T2T212 1 k(xi , y j ) + 2k(y i , y j )) 2T1 T2 i,j=1T2
i,j=1
The reason is that a clear temporal cut for human motionrequires
a large number of observations before and afterthe cut. Indeed, the
required number of frames varies fromaction to action, even for
manual annotation.
(13)
1where F is a unit ball in a universal RKHS H, and {x}Ti=1T2and
{y}j=1 are i.i.d. samples from distributions P x andP y . It can be
shown that,
lim Lt = M M D[F, X nt :nt 1 , X nt :nt +Tt 1 ]2 (14)
T 0
if the same kernel in MMD is used as the spatial kernel inkKT C
() (eq. 6) and kT () degenerates to 1 as T 0.Based on eq. 14, t is
set as BR (t) + where BR (t)is an adaptive threshold which is
calculated from theRademacher bound [10], and is a xed global
thresholdwhich is the only non-trivial parameter in KTC-R (used
tocontrol the coarse-ne level of segmentation).Analysis. In
summary, both KTC-S and KTC-R are basedon eq. 5. The main
differences are, KTC-S performssegmentation by sequential
optimization in a two-class temporal clustering way, and KTC-R
performs segmentation byusing incremental sliding-window in a
two-sample test way.KTC-R requires more sliding-windows than KTC-S,
but foreach one, there is no optimization, and accessing kKT C
()O(Tt T ) times is enough (linear to Tt ). Only when anew cut is
detected, O(Tt2 ) times accessing is required.Thus, KTC-R is
extremely efcient and suitable for realtimeapplications. It is
notable that, even if the xed-lengthsliding-window method (sec.
3.3) is improved to makethe decision whether a cut happens or not
in X ci :ci +T 1 ,a small T is still not reliable for realtime
applications.
Besides estimating {ci }, decomposing an action segmentX ci
:ci+1 1 into an unknown number of action units (e.g.,three walking
cycles) if cyclic motions exists, is alsoneeded [63]. This is not
only helpful for understandingmotion sequences, but also for other
applications such asrecognition and indexing. Thus, an online
cyclic structure segmentation algorithm, i.e., Kernelized
AlignmentCut (KAC), is proposed as a generalization of
kernelembedding of distributions and temporal alignment [58],[33].
By combining KAC and KTC-R, we get the twolayer segmentation
algorithm KTC-H, where H stands forhierarchical. Action units
segmentation is difcult for nonperiodic motions (e.g., jumping),
which are actions thatare usually performed once locally. However,
people canstill perform two consecutive non-periodic motions,
andthese two motions are not identical because of
intra-personvariations, which brings challenges for KAC.KAC. As an
online algorithm, KAC utilizes the slidingwindow strategy. Each
window X aj +nt Tm :aj +nt 1 issequentially processed, starting
from n1 = 2Tm , a1 = ci ,where aj is jth action unit cut. Tm is a
parameter whichis the minimal length of one action units. We
empiricallynd that results are insensitive to Tm .For each window X
aj +nt Tm :aj +nt 1 , this process hastwo branches. The last action
unit continues: nt+1 = nt +Tm ; or there is a new action unit: aj+1
= aj + nt Tm , nt+1 = 2Tm . Here Tm is the step length. This
processends when a new cut point ci+1 received. Deciding whetherX
aj +nt Tm :aj +nt 1 is the start of a new unit or not canbe
formulated as,St = SAlign (X aj :aj +Tm 1 , X aj +nt Tm :aj +nt 1 )
H0 : St t : aj + nt Tm is a new unit(15) HA : St > t : not
H0where SAlign () is the metric to measure the structure similarity
between X aj :aj +Tm 1 and X aj +nt Tm :aj +nt 1 , tohandle
intra-person variations. t is an adaptive threshold(empirically set
by cross-validation) and ideally should be
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
SUBMISSION
close to zero if alignment can perfectly leverage
variations.Similar to KTC-R, KAC has delay Tm . In particular,
KACuses dynamic time warping (DTW) [58], [33] to designSKAC () by
minimizing the following loss function basedon the kernel from eq.
6,a +n T :a +nt 1
tm jSKAC (K ajj :aj +Tm 1
: W 1, W 2)
(16)
where K is the cross-kernel matrix for two segments,and W 1 and
W 2 are binary temporal warping matricesencoding the temporal
alignment path as shown in [58].Interested readers are referred to
[58], [33] for more detailsabout S(). Eq. 16 can be optimized by
using dynamic2programming with complexity O(Tm), and SKAC ()
measure the similarity between the current action unit (a part)and
the current window. Importantly, alignment methodssuch as DTW are
not suitable for eq. 12. This is becausealignment requires two
segments to have roughly the samestarting and ending points, which
does not hold in eq. 12.KTC-H. By combining KTC-R and KAC, we can
sequentially simultaneously capture action transitions (cuts)
andaction units, in the integrated algorithm KTC-H. Formally,KTC-H
uses the two-layer sliding window strategy, i.e., theouter loop
(sec. 3.4) to estimate ci and the inner loop toestimate aj between
ci and the current frame from the outerloop. Since KTC-R (eq. 12)
and KAC (eq. 15) both havexed delay (T0 and Tm ), KTC-H is suitable
for realtimetransition and action unit segmentation.Discussion. We
compare with several related algorithms:(1) Spectral clustering
[32] can be extended to temporalclustering if only allowing
temporal cuts (TSC) [37]. Similarly, minimizing eq. 5 can be viewed
as an instance ofTSC motivated by embedding distributions into
RKHS. (2)PPCA-CD is proposed in [38] to model motion segmentsby
Gaussian models, where CD stands for change-pointdetection.
Compared to [38], KTC has higher computationalcost but gains the
ability to handle nonparametric distributions. (3) KTC is similar
to KCpA [30], which uses thenovel kernel Fisher discriminant ratio.
Compared to [30],KTC performs change-point detection by using the
incremental sliding-window. More importantly, KTC detectsboth
change-points and cyclic structures. This is crucial foronline
recognition, making action can be recognized afteronly one unit
instead of the whole action. (4) As an elegantextension of Kernel
K-means and spectral clustering, ACAis proposed in [33] for ofine
temporal clustering. KTC canbe viewed as an online complementary
approach to [33]. (5)Differ from the online two one-class SVM
strategy KCDin [29], the hypothesis testing in KTC has null
distributionsthanks to embedding of distributions [9].
4
S PATIO -T EMPORAL M ANIFOLD M ODEL
As given in Sec. 3.1, we use structured time series xtto
represent the human motion sequences. Although xtlies in a high
dimensional space, the natural property ofhuman pose suggests xi
has lower intrinsic degree offreedom. Suppose there is a
d-dimensional submanifold Membedded in an ambient space of
dimensionality D d.
7
We use latent variable model (LVM) to represent M asa mapping
between the intrinsic space and the ambientspace: f : d D and x = f
( ) + , where x Dis the observation variable, d is the latent
variableand D is the noise. In computer vision applications,the
mapping function f is often highly non-linear, andthe ambient space
is the spatial (feature) space, so M isalso called
spatial-manifold. To incorporate the temporaldimension into the
standard LVM to model human motiontime series, we propose a novel
framework as follows.Denition: a spatio-temporal manifold (STM) is
a directed traversing path Mp (with boundary or compact) ona
spatial-manifold M, and further embedded in D .A traversing path Mp
can be intuitively thought as apoint walking on M from a starting
point at time t1( start , xstart ) to an ending point ( end , xend
) at timet2 . A path is not just a subset of M which looks likea
curve, it also includes a natural parametrization as,g : [0 1] M,
s.t. g(0) = start and g(1) = end .So, a new latent variable [0 1]
is associated with everypoint in this path. Furthermore, the
relationship between and temporal index t can be modeled as a time
seriespt : [t1 t2 ] [0 1], s.t. h(t1 ) = 0 and h(t2 ) = 1.Since M
is embedded in D by f (), essentially thetraversing path (with
noise) can be described as a non-linearmultivariate time series as
x(t) = f (g(p(t))) + .Under this denition, the structured
representationX1:Lx of a human motion sequence is just a sequenceof
sampled observations on a STM. Here, the ambientspace is the
joint-position space, manifold M is the humanpose space, and Mp is
a specic type of human action.The newly introduced variable is
assigned to a semanticmeaning which indicates the completion degree
of anaction. For an action sequence including only one actionunit,
we assume the starting point of the action has = 0and the ending
point has = 12 . Inferring from X1:Lx isimportant for temporal
alignment in our approach, which isgiven in Sec. 5.1. It is notable
that, this 1D representation is mainly used for temporal alignment
in Sec. 5.2, andmultivariate time series are used in the other
steps such asspatial matching and temporal segmentation.
5
S PATIO -T EMPORAL A LIGNMENT
Given two human action segments X 1:Lx Dx Lx(Mxp ) and Y 1:Ly Dy
Ly (Myp )3 , we need to calculatethe motion distance score S(X 1:Lx
, Y 1:Ly ), after properspatial and temporal alignment. The problem
is inherentlychallenging because of the large spatial/temporal
scale difference between human actions, ambiguity between
humanposes, as well as the inter/intra subject variability [58].
Wemodel the motion sequence matching as a spatial-temporalalignment
problem under the STM framework, and incorporate manifold learning,
spatial alignment and temporalalignment together, resulting in
Dynamic Manifold Warping(DMW) [59].2. For periodic motion, i.e,
walking, it denes a motion cycle.3. Both are one action unit
sequences after temporal segmentation.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
SUBMISSION
5.1
Structure Learning
An important module of the proposed spatio-temporalalignment is
structure learning. Given {xt }Lt=1 as L ordereddata points sampled
from a STM, the goal of structurelearning is to recover the latent
completion variable tfrom those samples. Note that our goal is
different frommost latent variable models, which aim to identify
[12],[13] and sometimes f () [17], [18], [19].Estimating dGeo ().
We use Tensor Voting to calculate theminimum traversing distance
between xs and xs+1 (1 s L 1) to approximate the geodesic distance
dGeo ().Tensor Voting is a non-parametric framework propose
toestimate the geometric information of manifolds, as wellas the
intrinsic dimensionality [16]. Let xs (0) = xs , wehavedGeo (xs ,
xs+1 ; Mp )
R
xs (r) xs (r + 1)L2
r=0
(17)where xs (r + 1) is updated from the current point by therst
order Taylor expansion xs (r),xs (k+1) = xs (k)+J (xs (r))J (xs
(r))T (xs+1 xs (r))until xs (r + 1) converges to xs+1 . is a step
length,and J (xs (r)) is the tangent space estimation on xs (r)by
Tensor Voting (local PCA [64] can also be used). [15]uses Tensor
Voting to estimate the manifold structure for3D face tracking in
126D space, while the temporal index isnot explicitly considered.
Our algorithm is a revised versionof [15] under the STM
framework.Learning t . A two stage approach is possible, rst
estimate (or f ()) on a collection of time series, and thenoptimize
{1:L }. Nevertheless, we propose a solution whichperforms direct
estimation for individual sequence based onthe learnt geodesic
distance.t1s=1 dGeo (xs , xs+1 ; Mp )t = L1(18)s=1 dGeo (xs , xs+1
; Mp )Since the traversing path is continuous and smooth, theglobal
geodesic distance is approximately decomposed tothe sum of the
local distance, inspired by ISOMAP [12].
Fig. 4. An illustration of the non-linearity of (t).Top, action
stretching(Mocap), 6 samples are uniformly distributed in 368
frames; bottom, estimatedlatent completion variable. The whole
action is decomposed into 5 stages.
8
Fig. 4 illustrates the latent completion variable
learningresults from a stretching sequence in an action unit,
i.e.,from the actions start to the end. We use the CMU Mocapdata
[65] in this experiment, and M = 15 key points areused to represent
the human body, resulting in joint 3Dtrajectories in 45D. These 15
key points are extracted fromthe amc and asf les by our joint-angle
to joint-positiontransfer algorithm. We uniformly divide the
sequence into5 stages along the time index. The dynamic variations
instage 2 and 4 are larger than the others, and these twocorrespond
to stretch and fold arms. Stage 3 has thesmallest variation,
because it corresponds to the peakstateof a stretching, i.e., there
is almost no arm movement.5.2 Temporal AlignmentThe temporal
alignment part of DMW is called DynamicManifold Temporal Warping
(DMTW). DMTW is the combination of manifold learning and Dynamic
Time Warping(DTW), and can be applied to any temporal data with
latentspatial structure.Formulation. Given two time series X 1:Lx
Dx Lxand Y 1:Ly Dy Ly , nd the optimal alignment pathQ = [q 1 , q 2
, ..., q L ] 2L by minimizing the followingloss function ( F is the
Frobenius norm operator),LDM T W (Fx (), Fy (), W x , W y )= Fx (X
1:Lx )W Tx Fy (Y 1:Ly )W Ty 2F
(19)
yx} {0, 1}LLx , W y = {wt,t} where W x = {wt,tyxLLyare binary
selection matrices encoding the{0, 1}yxtemporal alignment path Q
[58]. wt,t= wt,t= 1 isyxequivalent to q t = [tx ty ]T , which means
xtx correspondsto y ty at step t in the alignment path. F() maps X
1:Lx andY 1:Ly to a shared subspace with the same
dimensionality.Essentially, Fx () and Fy () are spatial mapping
functionsand W x and W y are temporal warping matrices.If F() is an
identity function, then LDM T W reducesto X 1:Lx W Tx Y 1:Ly W Ty
2F , which is equivalent toperforming the standard DTW directly on
X 1:Lx andY 1:Ly . Unlike the alternative iterative algorithm to
optimizeLDM T W , i.e., optimize W with xed F and then optimizeF
with xed W , we propose a two-step approach withoutthe iterative
computation. Instead of optimizing Fx , Fyin eq. 19, we directly
estimate them under the STMframework.Step 1. Under the STM model in
section 4, we choosexFx (X 1:Lx ) to be 1:L 1Lx and Fy (X 1:Ly ) to
bexy1Ly1:Ly . t represent the universal structure for allSTMs,
making aligning two sequences with different actions possible. If
the sequence is training data (i.e. Mocap),then methods in sec. 5.1
can be used. Otherwise, insteadof performing the variable-length
path estimation, we candirectly estimate the dGeo () by using the
xed-length (i.e.,1 or 2) traversing path, without re-pefroming
Tensor Votingat each step. After learning dGeo (), combining with
eq. 18,yxwe can obtain the estimated results for 1:Land
1:L,xydenoted as x 1Lx and y 1Ly .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
SUBMISSION
Step 2. Replace Fx () and Fy () with x and y in eq. 19,LDM T W
reduces to the following formulation,LDM T W (W x , W y ) = x W Tx
y W Ty 2F
(20)
This is equivalent to performing DTW in the transformdomain,
i.e., x and y . The temporal aligning matrix A =y 2
x{atx ,ty } is dened as atx ,ty = (tx ty ) , which is acompact
representation of x and y . Optimizing eq. 20results in variable
length path (vary from max(Lx , Ly ) toLx + Ly 1), which is not
proper for similarity metric.Thus, referenced DTW is proposed to x
the path lengthby setting one warping matrix to be identity, x I Lx
y W Ty 2F
(21)
where I Lx is an identity matrix. X 1:Lx is chosen as
thereference sequence, and Y 1:Lx is aligned to X 1:Lx by
thewarping matrix W y Lx Ly . The path Q in eq. 21has xed length Lx
. Since x and y are monotonicallyincreasing sequences, dynamic
programming provides anextremely efcient solution (O(Lx Ly )) to
optimize Q(W y ).
9
forms DTW. Our DMTW gets the best results among threemethods. It
is notable that our temporal alignment step doesnot involve spatial
matching (unlike CTW). More visualcomparison results are not
provided due to lack of space,and DMTW gets similar performance in
all experiments.While the objective function of DMTW (eq. 19)
isinspired from CTW [58], key differences exist. CTW useslinear
F(), and its optimization process may lead to localextreme since
the objective function is non-convex. InDMTW, F() is chosen as the
non-linear mapping h1 (),which can guarantee a global solution. It
is notable thatCTW does not need smooth manifold assumption, and
thushas more general applications than DMTW, while DMTWfocuses on
time series with intrinsic manifold structure.DMTW is also related
with Prole Models [66]. Although the ideas of Prole and t seem
similar, theydiffer in many aspects. In particular, Prole Models
needmultiple training examples and the size of the discreteProle
space increases exponentially with the precisionrequirement, which
is not only computationally impracticalbut also causes over-tting.
In contrast, DMTW does notneed training stage, and t is continuous
in nature.5.3 Temporally Local Spatial AlignmentAfter temporal
alignment, spatial alignment is performed toleverage the subjects
variability, i.e., body-skeleton scalesvariations between different
people, or viewpoint variations.In particular, we propose Dynamic
Manifold Spatial Warping (DMSW), which has the following
framework,DDM SW (X t1 :t2 , Y t1 :t2 )= V x (U (X t1 :t2 )) V y (U
(Y t1 :t2 ))2F
Fig. 5. Temporal Alignment Results. DMTW is compared with DTW
and CTW. The reference sequence isshown in the rst row, followed by
the aligned results.2 red arrows indicate 2 key states in the
referencesequence, i.e, the peaks of the rst and the secondboxing.
The aligned sequence also has 2 red arrows,indicating the peaks of
the rst and the second jump.DMTW is able to align the two peak
states in thejumping sequence to the peak states in the
boxingsequence very well.Results. The proposed DMTW algorithm (eq.
19) is compared with other state-of-the-art algorithms. In
particular, Dynamic Time Warping (DTW) [53], [55] is chosenas the
baseline algorithm and Canonical Time Warping(CTW) [58] is chosen
as the alternative method. To makethe comparison clearer, the
sequences may include morethan one action units. Fig. 5 shows the
visual comparisonfor two motion sequences, one is boxing (twice)
and theother is side jumping (twice). DTW does not considerthe
spatial transformation, making it difcult to align twomotion
sequences by two people. CTW signicantly outper-
(22)
X t1 :t2 Dx (t2 t1 +1) are consecutive frame featuresxt1 to xt2
in the reference sequence, and Y t1 :t2 Dy (t2 t1 +1) are
temporally corresponding samples inthe aligned sequence Y 1:Lx . V
x () is the spatial alignmentfunction (same for V y ()) and U() is
the pre-denedfeature extraction function. Spatial alignment is
restrictedto temporally local (from t1 to t2 ) segments, since
globalmatching on entire sequences is often not accurate due
tonon-linear variations. How to set V () is explained in
thefollowing part and U () will be discussed in eq. 25.Denoting the
extracted features by U () as two zeromean feature sets, U x d1 n
and U y d2 n , weconsider an unsupervised learning approach, i.e.,
CanonicalCorrelation Analysis (CCA), in which a pair of
linearalignment matrices is optimized in the sense of maximizingthe
correlation E() in transformed features as follows,E(V x , V y ) =
T r(V Tx U x (V Ty U y )T )s.t., V Tx U x U Tx V x = V Ty U y U Ty
V y = I d
(23)
where V x d1 d and V y d2 d are two linear spatialalignment
matrices for U x and U y , and I d is the identitymatrix of size
dd. T r() is the trace operator. Minimizingthis objective function
is equivalent to solving a generalizedeigenvalue problem [50].
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
SUBMISSION
10
The metric can be induced in the transform domain as,2TDDM SW (X
t1 :t2 , Y t1 :t2 ) = V Tx U x V y U y F (24)
V x d1 d and V y d2 d are the solutions of eq. 23.Eq. 24 can
handle two feature sets with different dimensionalities, making the
alignment between 2D and 3D inputpossible. Both DMSW and CTW
algorithms use CCA,but key differences exist. Spatial alignment in
DMSW isrestricted to temporally local manifolds, since global
linearmatching on entire sequences (CTW) is often not accuratedue
to non-linear variations. But this global matching is
notnecessarily a disadvantage: CTW can provide dimensionreduction
results, which is useful in some applications.In short, DMW (DMTW
and DMSW) extends previousworks in two ways, (i) it combines
temporal alignment withmanifold learning and (ii) it allows local
CCA performingon local temporal aligned segments.Based on the
proposed DMTW for temporal alignmentand DMSW for spatial alignment,
we further propose twotypes of motion distance functions by
choosing two featureextraction functions U (.). In particular,
instead of treatingxt Dx 1 (or y t ) as a multi-dimensional vector,
theimplicit structure in the joint-position space is considered.In
sec. 4, xt = [pt1 , ..., ptM ]T 3M 1 , the 3D Euclideanspace is
implicitly embedded in the joint-position space3M . Thus, we
reformulate xt as,p11 ... pM 1(25)Nt = p12 ... pM 2 3Mp13 ... pM
3which turns to be M samples in 3 (similar operation forxt 3K to Nt
3K , K M , as used in sec. 6.1).This operation is dened as T 3 : 3M
3M (same for3K 3K ). It is notable that, this operation can bealso
performed for joints in 2D, as T 2 : 2K 2K .Thus, we can align 2KD
video tracks (with noise) with3M D Mocap sequences, which is not
addressed by previousworks.The rst feature extraction function is
chosen asU1 (xt ) = T (xt ), which is the static pose feature
(jointposition in the matrix formulation). The second one isU2 (xt
, xt+1 ) = T (xt ) T (xt+1 ), which is the motionpose feature
between two consecutive frames. Thus, thenal similarity score S1 (X
1:Lx , Y 1:Ly ) given by the staticfeatures is as follows,S1 =
Lx
DDM SW (T (xt ), T (y t ))
(26)
t=1
t is the temporally corresponding frame estimatedwhere yby eq.
19. The similarity score S2 (X 1:Lx , Y 1:Ly ) given bythe motion
features is as follows,Lx
t=1
DDM SW (T (xt ) T (xt+1 ), T (y t ) T (y t+1 )) (27)
Fig. 6. Examples of Online Temporal Segmentation. Top (Depth
Sensor): a sequence includes 3segments as walking, boxing and
jumping. Noisy jointtrajectories are tracked by OpenNI. Middle
(Mocap):a sequence has 4579 frames and 7 action segments.Bottom
(Video): a clip includes walking and running.For all cases, KTC-R
achieves the highest accuracy.
These two scores can be linearly combined,SDM W (X 1:Lx , Y 1:Ly
) =S1 (X 1:Lx , Y 1:Ly ) + (1 )S2 (X 1:Lx , Y 1:Ly )
(28)
where [0 1] can be either optimized by cross-validationin the
supervised setting (i.e., recognition), or chosen manually in the
unsupervised setting (i.e., clustering). Eq. 28 is asummary result
of eqs. 18, 19, 23 and two feature extractionfunctions. The
similarity metric is not symmetric, so we setthe testing sequence
to be the reference sequence.
6
E XPERIMENTAL R ESULTS
We quantitatively evaluate the performance of proposed approach
on CMU Motion Capture data (MoCap),HumanEva-2 [67] and depth sensor
data. These data arechosen to demonstrate the general capability of
our algorithms on human motion analysis, as well as the
advantageson continuous action recognition in the transfer
learningmodule. We investigate the performance of realtime
actionsegmentation cross different data sets with comparisonto
alternative methods. We also perform comparisons ofaction
recognitions on MoCap data to other state-of-the-artalignment
algorithms. In particular, we can online recognizeactions of an
arbitrary person from an arbitrary viewpoint,given realtime
continuous depth sensor input.6.1 Segmentation ResultsIn this
section, quantitative comparison of online temporalsegmentation
methods is provided. KTC-R is compared
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
SUBMISSION
11
TABLE 1Temporal Segmentation Results Comparison. Precision (P),
recall (R) and rand index (RI) are
reportedMethodsDepthMocapVideo
PPCA-CD
(online)0.73(P)/0.78(R)/0.80(RI)0.85(P)/0.90(R)/0.90(RI)
TSC-CD
(online)0.77(P)/0.81(R)/0.81(RI)0.83(P)/0.86(R)/0.88(RI)0.78(P)/0.85(R)/0.82(RI)
with other state-of-the-art methods, i.e, PPCA-CD [38] andTSC-CD
[36], [37], where TSC-CD is a change-point detection algorithm
based on temporal spectral clustering byour implementation. PPCA-CD
uses the same incrementalsliding-window strategy as sec. 3.4 and
TSC-CD uses thexed-length sliding window as sec. 3.3. Thresholds
(e.g., for KTC-R and thresholds for other methods) are set
bycross-validation on one sequence. Methods like ACA [33]and [34]
can not be directly compared since they are ofine.Results are
evaluated by three metrics, i.e., precision, recalland rand index.
The rst two are for cut points and the lastone is for all frames.
The ground-truth for rand index (RI)is labeled as consecutive
numbers 1, 2, 3, ... for differentsegments. Importantly, T0 is set
as 80, 250, 60 for depthsensor, Mocap and video respectively,
making KTC-R have2.3, 2.1 and 1 seconds delay. Results are very
robust to T0and T . For instance, we got almost identical results
whenT0 ranges from 60 to 120 in OpenNI data. Furthermore,KTC-S
achieves similar accuracy to KTC-R but with longerdelay, thus KTC-R
is preferred.Depth Sensor. To validate online temporal
segmentationon depth sensor, 10 human motion sequences are
capturedby the PrimeSense sensor. Each sequence is a combinationof
3 to 5 actions (e.g., walking to boxing) with lengtharound 700
frames (30Hz). For human pose tracking, weuse the available OpenNI
tracker to automatically trackjoints on human skeleton. K [12, 15]
key points aretracked, resulting in joint 3D position xt in 36 to
45 . Although human pose tracking results are often noisy (Fig.
6and Fig. 8), we can correctly estimate action transitionsfrom
these noisy tracking results. In particular, KTC-R(T = 30)
signicantly improves the accuracy from othermethods (Table 1). The
main reason is the joint-positions ofnoisy tracked joints have
complex nonparametric structures,which is handled by kernel
embedding of distributions [10],[11], [9] in KTC.KTC-H. Besides
action transitions, results on detectingboth cyclic motions and
transitions are reported by performing KTC-H (Tm = 50, Tm = 1).
Since other methodsdont have the module, we report quantitative
comparisonon online hierarchial segmentation by using KTC-H orother
methods plus our KAC algorithm in sec. 3.5. Resultsshow (Table 3)
that KTC-H gets higher accuracy than othercombinations. It is
notable that, because of the natural ofRI, the RI metric will
increase when the number of cutsincrease, even for low P/R, which
is the case for hierarchialsegmentation (including two types of
cuts).Mocap. Similar to [38], [33], M = 14 joints are used
torepresent the human skeleton, resulting in joint-quaternions
KTC-S
(online)0.88(P)/0.91(R)/0.89(RI)0.87(P)/0.90(R)/0.91(RI)0.85(P)/0.89(R)/0.87(RI)
KTC-R
(online)0.87(P)/0.93(R)/0.88(RI)0.86(P)/0.91(R)/0.92(RI)0.85(P)/0.92(R)/0.88(RI)
of joint angles in 42D. Online temporal segmentationmethods are
tested on 14 selected Mocap sequences fromsubject 86 in CMU Mocap
database. Each sequence is acombination of roughly 10 action
segments and in totalthere are around 105 frames (120Hz). Since the
implementation of PPCA-CD differs from [38] (such as only
forwardpass is allowed in our experiments), results are not the
sameas in [38]. Table 1 shows that the gain of KTC-R (T = 50)to
other methods in Mocap is reduced, compared with depthsensor data.
This is because the Gaussian property is morelikely to hold for
quaternions representation of noiselessMocap data, which is not the
case for real data in general.Video. Furthermore, KTC-R is
performed on a numberof sequences from HumanEva-2, which is a
benchmark forhuman motion analysis [67]. Silhouettes are extracted
bybackground substraction, resulting in a sequence of binarymasks
(60Hz). xt Dt is set as the vector representationof the mask at
frame t. It is notable that, Dt (size ofmasks) in different frames
may not be identical, so PPCACD can not be applied. This fact
supports the advantage ofKTC, which is applicable for complex
sequential data aslong as a (pseudo) kernel can be dened. In
particular, wefollow [33] to compute the matching distance of
silhouettesto set the kernel. Results are shown in Fig. 6 and
Table. 1.As a reference, state-of-the-art ofine temporal
clusteringmethod ACA achieves higher accuracy than KTC-R onMocap
(96% precision). However, ofine methods (1) arenot suitable for
real-time applications, and (2) require thenumber of clusters
(segments) to be set in advance, whichis not applicable in many
cases.6.2 Recognition Results
Fig. 7. Action recognition on Mocap. 1 Mocapexample for each
action.We collected 3978 frames from CMU Mocap [65] capturing fteen
people, performing 10 natural actions (detailsin Fig. 7). For
action recognition, we use the leave-oneout procedure for each
sequence, i.e., each sequence istreated as unlabeled and associated
with all other sequences.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
SUBMISSION
12
TABLE 2Action Recognition Rates on Mocap. Rate is measured by #
of sequences (S) or # of frames (F).MethodsRate (S)Rate (F)
DTW+DMSW (3M D)60%62%
CTW+DMSW (3M D)85%91%
Since each person only performs a specic action once,
therecognition process can not benet from the fact that thesame
person repeating the same action results in quite largesimilarity.
in eq. 28 is set to be 0.5 and results (Table 2)show that our
approach only misclassies 5% sequences,or 1.2% by weighing with the
number of frames.To investigate how temporal alignment affects
recognition, results of using DTW [53] and CTW [58] are
alsoprovided in Table 2. To make a fair comparison, onlythe
temporal alignment step is changed. Results show thatthis change
reduces accuracy signicantly, which supportsthe effectiveness of
DMW not only in temporal alignment(Fig. 5), but also in action
recognition (quantitatively).Furthermore, to demonstrate the
ability to recognize actionsfrom arbitrary 2D view, Mocap sequences
are linearlyprojected to joint 2D trajectories in 30D space using
asynthetic camera (without occlusion, K = M = 15). Weachieve 90%
accuracy on this 2D view recognition. [45](HMM+Adaboost) also
reports recognition rate for Mocapdata, but it requires a large
number of training sequences,and 2D view recognition is not
included.6.3
Joint Segmentation and Recognition
The proposed approach in sec. 5 can also be used torecognize
actions from 2.5D depth sensor input, withoutextra training process
in the transfer learning module.Furthermore, we combine the
temporal segmentation insec. 3 and the action recognition method in
sec.5 and tobuild an online system for sequential action
segmentationand recognition.Following the same approach as in sec.
6.1, we useOpenNI [68] to get the 36D to 45D time series Y 1:Ly
3KLy (K M ) from depth sensor. Afterwards, theonline recognition
process for Y 1:Ly is performed in thefollowing procedure:(1)
Temporal Segmentation. Use the algorithms in section 3 to
sequentially cut Y 1:Ly into different action unitsY aj :aj+1 1
3Kaj+1 aj (j = 1, 2, 3..., a1 = 1).(2) Structure Learning. Use the
algorithm in section 5.1for the current action unit Y aj :aj+1 1
.(3) Temporal Alignment. Use the proposed DMTW(eq. 19) algorithm to
get temporal correspondencei3M aj+1 aj
Xfrom a labeled Mocap se1:aj+1 aj iquences X mocap .(4) Spatial
Alignment. Select K markers from1:a a (the K corresponding ones
from OpenNI),Xj+1jK
3Kaj+1 aj . Only featuresresulting in XK
1:aj+1 aj
from X1:aj+1 aj are selected to match Y aj :aj+1 1 by
DMW (3M D)95%99%
DMW (2KD)90%87%
using DMSW (eq. 24), since M K markers informationis missing
from OpenNI tracking results.(5) Motion Distance SDM W (Y aj :aj+1
1 , X imocap ) iscalculated by using eq. 28.(6) Assume there are N
labeled motion sequencesi{X imocap }Ni=1 associated with action
label I I, whereI = 1, 2, ..., C indicates C action classes. The
estimatedaction label Ijy for the current action unit is given
byIjy = arg
min
i{1,2..,N }
SDM W (Y aj :aj+1 1 , {X imocap , I i })
(29)Results. We collect additional 5109 frames (N = 30)with 10
primitive actions from CMU Mocap as the labeled(training) data for
recognition. In order to associate labeledMocap sequences with data
from other domains, jointposition trajectories (M = 15) are used in
eq. 29 [59].Testing data are previous collected sequences from
depthsensor, and online segmentation and recognition are
simultaneously performed by KTC-H and eq. 29. A signicantfeature of
our approach is, there is no extra-training processfor depth
sensor, i.e., the knowledge from Mocap can betransferred to other
motion sequences, based on properfeatures. Tracked trajectories
from OpenNI in an action unit(segmented by KTC-H) are associated
with labeled Mocapsequences from 10 action categories.Although
OpenNI tracking results are often noisy (highlighted by blue
circles in Fig. 8), we achieve 85% recognition accuracy (Acc) from
these noisy tracking results(Table 3), without any additional
training on depth sensordata. This result does not only benet from
DMW [59],but also from KTC-H. DMW requires the input onlycontain
one action unit, while KTC-H performs a criticalmissing step, i.e.,
accurate online temporal segmentation, inorder to perform
recognition. As illustrated in Table 3, theaccuracy on OpenNI is
enhanced from 0.71 to 0.85, whichstrongly supports the
effectiveness of KTC-H. Furthermore,the complete and accurate 3M D
human motion sequencescan be inferred by associating the learned
manifolds fromMocap.
7
C ONCLUSION
In this paper, we rst propose an online temporal segmentation
method KTC, as a temporal extension of Hilbertspace embedding of
distributions for change-point detectionbased on the novel
spatio-temporal kernel. Then, a realtimeimplementation of KTC and a
hierarchial extension aredesigned, which can detect both action
transitions andaction units. Furthermore, a robust and efcient
alignmentalgorithm DMW is designed to calculate the similarity
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
SUBMISSION
13
TABLE 3Online hierarchial segmentation and recognition on 2.5D
depth sensorMethodsDepth
PPCA-CD+KAC+CTW0.72(P)/0.76(R)/0.89(RI)/0.62(Acc)
PPCA-CD+KAC+DMW0.72(P)/0.76(R)/0.89(RI)/0.71(Acc)
KTC-H + DMW0.85(P)/0.87(R)/0.94(RI)/0.85(Acc)
Fig. 8. Online action segmentation and recognition on 2.5D depth
sensor. Top to bottom, depth imagesequences, KTC-H results and
action recognition results. For segmentation, blue line indicates
the cut anddifferent rectangles indicate different action units.
The blue circle indicates noisy tracking results. For
recognition:distance to labeled Mocap sequences, and inferred 3M D
motion sequences.between two multivariate time series.. Finally,
temporalsegmentation is combined with spatio-temporal
alignment,resulting in realtime action recognition on depth
sensorinput, without the need of training data from depth
sensor.Future works include the extension to 2D videos and
moreapplications rather than human motion analysis. In orderto
apply our approach to action recognition in 2D videos,we can use
the idea similar to [69] to estimate the 2D keypoints of human
skeleton from the image. As the proposedframework and two
algorithms KTC and DMW are generalmethods, they can be used in
other domains such as 3Dfacial expression analysis, if the time
series representationis available.
R EFERENCES[1][2][3][4][5][6][7]
[8][9]
I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld,
Learningrealistic human actions from movies, in Proc. CVPR, 2008.J.
Nielbes, H. Wang, and L. Fei-Fei, Unsupervised learning of human
action categories using spatial-temporal words, IJCV, vol. 79,pp.
299318, 2008.P. Matikainen, M. Hebert, and R. Sukthankar,
Representing pairwisespatial and temporal relations for action
recognition, in Proc. ECCV,2010, vol. 6311, pp. 508521.P. Natarajan
and R. Nevatia, View and scale invariant actionrecognition using
multiview shape-ow models, in Proc. CVPR,2008.P. Yan, S. M. Khan,
and M. Shah, Learning 4d action feature modelsfor arbitary view
action recognition, in Proc. CVPR, 2008.H. Ning, W. Xu, Y. Gong,
and T. Huang, Latent pose estimator forcontinuous action
recognition, in Proc. ECCV, 2008, vol. 5303, pp.419433.J. C.
Niebles, C.-W. Chen, , and L. Fei-Fei, Modeling temporalstructure
of decomposable motion segments for activity classication, in
Proceedings of the 12th European Conference of ComputerVision
(ECCV), Crete, Greece, September 2010.T. Hofmann, B. Scholkopf, and
A. J. Smola, Kernel methods inmachine learning, Annals of
Statistics, vol. 36, pp. 11711220,2008.A. Smola, A. Gretton, L.
Song, and B. Schoelkopf, A hilbertspace embedding for
distributions, in Algorithmic Learning Theory:18th International
Conference. Springer-Verlag, Berlin/Heidelberg,2007, pp. 1331.
[10] A. Gretton, K. Borgwardt, M. Rasch, B. Scholkopf, and A.
Smola, Akernel method for the two-sample-problem, in Advances in
NeuralInformation Processing Systems 19. MIT Press, 2007, pp.
513520.[11] A. Gretton, K. Fukumizu, Z. Harchaoui, and B.
Sriperumbudur, Afast, consistent kernel two-sample test, in NIPS,
2009, vol. 19, pp.673681.[12] J. B. Tenenbaum, V. de Silva, and J.
C. Langford, A global geometric framework for nonlinear
dimensionality reduction, Science,vol. 290, pp. 23192323, December
2000.[13] L. K. Saul and S. T. Roweis, Think globally, t locally:
unsupervised learning of low dimensional manifolds, Journal of
MachineLearning Research, vol. 4, pp. 119155, 2003.[14] O. C.
Jenkins and M. J. Mataric, A spatio-temporal extension toisomap
nonlinear dimension reduction, in Proc. ICML, 2004.[15] W. Liao and
G. Medioni, 3d face tracking and expression inferencefrom a 2d
sequence using manifold learning, in Proc. CVPR, vol. 2,2008, pp.
416423.[16] P. Mordohai and G. Medioni, Dimensionality estimation,
manifoldlearning and function approximation using tensor voting,
JMLR,vol. 11, pp. 411450, 2010.[17] N. Lawrence, Probabilistic
non-linear principal component analysiswith gaussian process latent
variable models, JMLR, vol. 6, pp.17831816, November 2005.[18] J.
M. Wang, D. J. Fleet, and A. Hertzmann, Gaussian processdynamical
models for human motion, IEEE PAMI, vol. 30, pp. 283298, 2008.[19]
R. Urtasun, D. J. Fleet, A. Geiger, J. Popovic, T. Darrell, and N.
D.Lawrence., Topologically-constrained latent variable models,
inProc. ICML, 2008, pp. 10801087.[20] R. Urtasun, D. Fleet, and P.
Fua, 3d people tracking with gaussianprocess dynamical models, in
Proc. CVPR, vol. 1, 2006, pp. 238245.[21] C.-S. Lee, Modeling human
motion using manifold learning andfactorized generative models, PhD
Thesis, 2007.[22] A. Elgammal and C.-S. Lee, The role of manifold
learning in humanmotion analysis, Computational Imaging and Vision,
vol. 36, 2008.[23] N. C. Tang, C.-T. Hsu, T.-Y. Lin, and H.-Y. M.
Liao, Examplebased human motion extrapolation based on manifold
learning, inProc. ACM MM, 2011.[24] J. Blackburn and E. Ribeiro,
Human motion recognition usingisomap and dynamic time warping, in
Proc. ICCV workshop, 2007.[25] J. Chen and A. Gupta, Parametric
Statistical Change-point Analysis.Birkhauser, 2000.[26] X. Xuan and
K. Murphy, Modeling changing dependency structurein multivariate
time series, in Proceedings of the 24th internationalconference on
Machine learning, 2007.[27] R. P. Adams and D. J. MacKay, Bayesian
online changepointdetection, in University of Cambridge Technical
Report, 2007.[28] Y. Saatci, R. Turner, and C. Rasmussen, Gaussian
process changepoint models, in Proc. ICML, 2010.[29] F. Desobry, M.
Davy, and C. Doncarli, An online kernel change detection algorithm,
IEEE Transactions on Signal Processing, vol. 53,pp. 29612974,
2005.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
SUBMISSION
[30] Z. Harchaoui, F. Bach, and E. Moulines, Kernel
change-pointanalysis, in Advances in Neural Information Processing
Systems21, 2009.[31] A. Y. Ng, M. I. Jordan, and Y. Weiss, On
spectral clustering:Analysis and an algorithm, in Advances in
Neural InformationProcessing Systems 14. MIT Press, 2002, pp.
849856.[32] U. von Luxburg, A tutorial on spectral clustering,
Statistics andComputing, vol. 17, pp. 395416, 2007.[33] F. Zhou, F.
D. la Torre, and J. K. Hodgins, Hierarchical alignedcluster
analysis for temporal clustering of human motion, Acceptedfor
publication at IEEE Transactions Pattern Analysis and
MachineIntelligence (PAMI), 2012.[34] E. Fox, E. Sudderth, M.
Jordan, and A. Willsky, Nonparametricbayesian learning of switching
linear dynamical systems, in NIPS21, 2009, pp. 457464.[35] H.
Zhong, J. Shi, and M. Visontai, Detecting unusual activity invideo,
in Proc. CVPR, 2004, pp. 816823.[36] L. Zelnik-Manor and M. Irani,
Statistical analysis of dynamicactions, IEEE Trans. on PAMI, vol.
28, pp. 15301535, 2006.[37] F. D. la Torre, J. Campoy, Z. Ambadar,
and J. F. Conn, Temporalsegmentation of facial behavior, in Proc.
ICCV, 2007.[38] J. Barbic, A. Safonova, J.-Y. Pan, C. Faloutsos, J.
K. Hodgins,and N. S. Pollard, Segmenting motion capture data into
distinctbehaviors, in Proc. Graphics Interface, 2004, pp.
185194.[39] I. Junejo, E. Dexter, I. Laptev, and P. Perez,
View-independent action recognition from temporal
self-similarities, IEEE Transactionson PAMI, vol. 33, pp. 172185,
2011.[40] D. Weinland, E. Boyer, and R. Ronfard, Action recognition
fromarbitrary views using 3d exemplars, in Proc. ICCV, 2007.[41] F.
Lv and R. Nevatia, Single view human action recognition usingkey
pose matching and viterbi path searching, in Proc. CVPR, 2007,pp.
18.[42] R. Messing, C. Pal, and H. Kautz, Activity recognition
using thevelocity histories of tracked keypoints, in Proc. of ICCV,
2009, pp.104111.[43] J. Sun, X. Wu, S. Yan, L. F. Cheong, T. S.
Chua, and J. Li, Hierarchical spatial-temporal context modeling for
action recognition, inProc. of CVPR, 2009, pp. 20042011.[44] R.
Poppe, A survey on vision-based human action recognition,Image and
Vision Computing, vol. 28, pp. 976990, 2010.[45] F. Lv and R.
Nevatia, Recognition and segmentation of 3-d humanaction using hmm
and multi-calss adaboost, in Proc. ECCV, 2006,vol. 3954, pp.
359372.[46] D. Weinland, M. Ozuysal, and P. Fua, Making action
recognitionrobust to occlusions and viewpoint changes, in ECCV,
2010, vol.6313, pp. 635648.[47] W. Li, Z. Zhang, and Z. Liu, Action
recognition based on a bagof 3d points, in Workshop on CVPR for
Human CommunicativeBehavior Analysis, 2010, pp. 914.[48] B. Ni, G.
Wang, and P. Moulin, Rgbd-hudaact: A color-depthvideo database for
huamn daily activity recognition, in Workshopon Consumer Depth
Cameras for Computer Vision, in conjunctionwith ICCV, 2011.[49] J.
Wang, Z. Liu, Y. Wu, and J. Yuan, Mining actionlet ensemble
foraction recognition with depth cameras, in CVPR, 2012, pp.
12901297.[50] F. R. Bach and M. I. Jordan, Kernel independent
componentanalysis, JMLR, vol. 3, pp. 148, 2003.[51] T. K. Kim and
R. Cipolla, Canonical correlation analysis of videovolume tensors
for action categorization and detection, IEEE PAMI,vol. 31, pp.
14151428, 2009.[52] C. C. Loy, T. Xiang, and S. Gong, Multi-camera
activity correlationanalysis, in Proc. CVPR, 2009, pp.
19881995.[53] C. Rao, A. Gritaiand, M. Shah, and T. Syeda-Mahmood,
Viewinvariant alignment and matching of video sequences, in
Proc.ICCV, 2003, pp. 939945.[54] M. Singh, I. Cheng, M. Mandal, and
A. Basu, Optimization ofsymmetric transfer error for sub-frame
video synchronization, inProc. ECCV, 2008, vol. 5303, pp.
554567.[55] L. R. Rabiner and B. Juang, Fundamentals of speech
recognition.Prentice-Hall, Inc., 1993.[56] Y. Ukainitz and M.
Irani, Aligning sequences and actions bymaximizing space-time
correlations, in Proc. ECCV, 2006, vol.3953, pp. 538550.[57] F.
Padua, F. Carceroni, R. Santos, and G. Kutulakos,
Linearsequence-to-sequence alignment, IEEE PAMI, vol. 32, pp.
304320,2010.[58] F. Zhou and F. D. la Torre, Canonical time warping
for alignmentof human behavior, in NIPS, 2009, vol. 22, pp.
22862294.[59] D. Gong and G. Medioni, Dynamic manifold warping for
viewinvariant action recognition, in Proc. ICCV, 2011, pp.
571578.[60] M. Hoai, Z. Lan, and F. D. la Torre, Joint segmentation
andclassication of human actions in video, in Proc. CVPR, 2011.[61]
D. Gong, G. Medioni, S. Zhu, and X. Zhao, Kernelized temporal
cutfor online tempoal segmentation and recognition, in Proc.
ECCV,2012.[62] L. Faivishevsky and J. Goldberger, A nonparametric
informationtheoretic clustering algorithm, in ICML, 2010, pp.
351358.[63] I. Laptev, S. Belongie, P. Perez, and J. Wills,
Periodic motiondetection and segmentation via approximate sequence
alignment,in Proc. ICCV. Springer-Verlag, Berlin/Heidelberg, 2005,
pp. 8168231.[64] Y. W. Teh and S. Roweis, Automatic alignment of
local representations, in Advances in Neural Information Processing
Systems 15,M. Kearns, S. Solla, and D. Cohn, Eds. Cambridge, MA:
MITPress, 2003, pp. 841848.[65] CMU Motion Capture Database,
http://mocap.cs.cmu.edu/.
14
[66] J. Listgarten, R. M. Neal, S. T. Roweis, and A. Emili,
Multiplealignment of continuous time series, in NIPS, 2005, vol.
17.[67] L. Sigal, A. O. Balan, and M. J. Black, Humaneva:
Synchronizedvideo and motion capture dataset and baseline algorithm
for evaluation of articulated human motion, IJCV, vol. 87, pp. 427,
2010.[68] OpenNI,
http://www.openni.org/Downloads/OpenNIModules.aspx.[69] B. Yao and
L. Fei-Fei, Action recognition with exemplar based 2.5dgraph
matching, in ECCV, 2012, pp. 173186.
Dian Gong received the PhD degree majorin Electrical Engineering
and minor in Computer Science from University of
SouthernCalifornia. He got his BS degree in Electronic Engineering
from Tsinghua University.His PhD thesis applies machine learning
tomining (large-scale) time series data. He iscurrently working as
a quantitative trading associate at Susquehanna International
Group.He also had working experiences at BarclaysInvestment Bank,
Sony US Research and Microsoft Research. In the past, he won
severalmathematics contest awards, and was selected as a national
teamcandidate for International Mathematics Olympiad.
Gerard Medioni received the Diplme dIngenieur from ENST, Paris
in 1977, a M.S.and Ph.D. from the University of SouthernCalifornia
in 1980 and 1983 respectively. Hehas been at USC since then, and is
currentlyProfessor of Computer Science and Electrical Engineering,
co-director of the Institutefor Robotics and Intelligent Systems
(IRIS),and co-director of the USC Games Institute.He served as
Chairman of the ComputerScience Department from 2001 to 2007.
Professor Medioni has made signicant contributions to the eld of
computer vision. His research covers a broadspectrum of the eld,
such as edge detection, stereo and motionanalysis, shape inference
and description, and system integration.He has published 4 books,
over 75 journal papers and 200 conference articles, and is the
recipient of 14 international patents.Prof. Medioni is on the
advisory board of the IEEE Transactionson PAMI Journal, associate
editor of the International Journal ofComputer Vision, associate
editor of the Pattern Recognition andImage Analysis Journal, and
associate editor of the InternationalJournal of Image and Video
Processing. Prof. Medioni served atprogram co-chair of the 1991
IEEE CVPR Conference in Hawaii, ofthe 1995 IEEE Symposium on
Computer Vision in Miami, general cochair of the1997 IEEE CVPR
Conference in Puerto Rico, conferenceco-chair of the 1998 ICPR
Conference in Australia, general co-chairof the 2001 IEEE CVPR
Conference in Kauai, general co-chair of the2007 IEEE CVPR
Conference in Minneapolis, general co-chair of the2009 IEEE CVPR
Conference in Miami, program co-chair of the 2009IEEE WACV
Conference in Snowbird, Utah, general co-chair of the2011 IEEE WACV
Conference in Kona, Hawaii, and general co-chairof the 2013 IEEE
CVPR in Portland. He is a Fellow of IAPR, a Fellowof the IEEE, and
a Fellow of AAAI.
Xuemei Zhao received the BS degree inElectronic Engineering from
Tsinghua University, Beijing, in 2008. She received thePhD degree
in Electrical Engineering fromUniversity of Southern California,
Los Angeles, in 2013. She is currently working asa software
engineer at Google in New Yorkofce.