Learning Appearance Manifolds from Videopublications.csail.mit.edu/abstracts/abstracts05/rahimi/...Learning Appearance Manifolds from Video Ali Rahimi MIT CS and AI Lab, Cambridge,

Learning Appearance Manifolds from Video

Ali RahimiMIT CS and AI Lab,

Cambridge, MA [email protected]

Ben RechtMIT Media Lab,


Trevor DarrellMIT CS and AI Lab,


Abstract

The appearance of dynamic scenes is often largely gov-erned by a latent low-dimensional dynamic process. Weshow how to learn a mapping from video frames to this low-dimensional representation by exploiting the temporal co-herence between frames and supervision from a user. Thisfunction maps the frames of the video to a low-dimensionalsequence that evolves according to Markovian dynamics.This ensures that the recovered low-dimensional sequencerepresents a physically meaningful process. We relate ouralgorithm to manifold learning, semi-supervised learning,and system identification, and demonstrate it on the tasksof tracking 3D rigid objects, deformable bodies, and artic-ulated bodies. We also show how to use the inverse of thismapping to manipulate video.

1. Introduction

The change in appearance of most scenes is governed bya low-dimensional time-varying physical process. The 3Dmotion of a camera through a scene in an egomotion prob-lem, the contraction of the muscles in a face in expressionanalysis, or the motion of limbs in articulated-body track-ing are examples of low-dimensional processes that almostcompletely determine the appearance of a scene. Recover-ing these processes is a fundamental problem in many areasof computer vision.

Recently, manifold learning algorithms have been usedto automatically recover low-dimensional representationsof collections of images [1, 3, 4, 11, 14]. But when thesecollections are video sequences, these algorithms ignorethe temporal coherence between frames, even though thiscue provides useful information about the neighborhoodstructure and the local geometry of the manifold. Semi-supervised regression that take advantage of the manifoldstructure of the data set provide another framework for ad-dressing this problem [2, 17]. These algorithms learn amapping between high-dimensional observations and low-

dimensional representations given a few examples of themapping. But they do not take advantage of the temporalcoherence between video frames either. One could use non-linear system identification [6, 15] to model the dynamicsof low-dimensional states to simultaneously estimate thesestates while learning a lifting from them to the observedimages. But current nonlinear system identification meth-ods do not scale to image-sized observations and get stuckin local minima.

Our main contribution is a synthesis of a semi-supervisedregression model with a model for nonlinear system identi-fication. The result is a semi-supervised regression modelthat takes advantage of the dynamics model used in systemidentification to learn an appearance manifold. The algo-rithm finds a smooth mapping represented with radial ba-sis functions that maps images to a low-dimensional pro-cess consistent with physical dynamics defined by a linear-Gaussian Markov chain. The algorithm allows a user to la-bel a few data points to specify a coordinate system and toprovide guidance to the algorithm when needed.

We demonstrate our algorithm with an interactive track-ing system where the user specifies a desired output for afew key frames in a video sequence. These examples, to-gether with the unlabeled portion of the video sequence,allow the system to compute a function that maps as-yetunseen images to the desired representation. This func-tion is represented using radial basis kernels centered onthe frames of the video sequence. We demonstrate our al-gorithm on three different examples: 1) a rigid pose estima-tion problem where the user specifies the pose of a syntheticobject for a few key frames, 2) on a lip tracking examplewhere the user specifies the shape of the subject’s lips, and3) an articulated body tracking experiment where the userspecifies positions of the subject’s limbs. The algorithm op-erates on the video frames directly and does not require anypreprocessing. Semi-supervision allows the user to spec-ify additional examples to improve the performance of thesystem where needed.

By inverting the learned mapping, we can also gener-ate novel frames and video sequences. We demonstrate this

by manipulating low-dimensional representations to synthe-size videos of lips and articulated limbs.

2. Related Work

Manifold learning techniques [1, 3, 4, 11, 14, 16] find alow-dimensional representation that preserves some localgeometric attribute of the high-dimensional observations.This requires identifying data points that lie in a local neigh-borhood along the manifold around every high-dimensionaldata point. When the manifold is sparsely sampled, theseneighboring points are difficult to identify, and the algo-rithms can fail to recover any meaningful structure. Ouralgorithm obviates the need to search for such neighborsby utilizing the time ordering of data points instead. Jenk-ins and Mataric [8] suggest artificially reducing the distancebetween temporally adjacent points to provide an additionalhint to Isomap about the local neighborhoods of image win-dows. We also take advantage of dynamics in the low-dimensional space to allow our algorithm to better estimatethe distance between pairs of temporally adjacent pointsalong the manifold. This requires only fine enough sam-pling over time to retain the temporal coherence betweenvideo frames, which is much less onerous than the sam-pling rate required to correctly estimate neighborhood rela-tionships in traditional manifold learning algorithms. Whilevarious semi-supervised extensions to manifold learning al-gorithms have been proposed [7, 10], these algorithms stilldo not take advantage of the temporal coherence betweenadjacent samples of the input time series.

The semi-supervised regression approaches of [17] and[2] take into account the manifold structure of the data.But they also rely on brittle estimates of the neighborhoodstructure, and do not take advantage of the time orderingof the data set. These semi-supervised regression meth-ods are similar to our method in that they also impose arandom field on the low-dimensional representation. Thework presented here augments these techniques by intro-ducing the temporal dependency between output samples inthe random field. It can be viewed as a special case of esti-mating the parameters of a continuously-valued conditionalrandom field [9] or a manifold learning algorithm based onfunction estimation [13].

Nonlinear system identification (see [6, 15] and refer-ences within) provides another framework for introducingdynamics into manifold learning. In this context, the framesin the video are modeled as observations generated by aMarkov chain of low-dimensional states. Nonlinear systemidentification recovers the parameters of this model, includ-ing an observation function which maps low-dimensionalstates to images. This usually requires approximate coordi-nate ascent over a non-convex space, making the algorithmscomputationally intensive and susceptible to local minima.

x2

y1 y2 y3

f

x1 x3 ...

...

Figure 1. A generative model for video sequences. The statesxt

are low-dimensional representations of the scene. The embeddingf lifts these to high-dimensional imagesyt.

Dynamic Textures [5] sidesteps these issues by performinglinear system identification instead, which limits it to lin-ear appearance manifolds. Instead of searching for a map-ping from states to images, as would be done in nonlinearsystem identification, we search for a mapping from im-ages to states. This results in an optimization problem thatis quadratic in the latent states and the parameters of theprojection function, making the problem computationallytractable and not subject to local minima.

3. Model for Semi-Supervised Nonlinear Sys-tem ID

Figure 1 depicts a plausible generative model for video.The latent state of the scene evolves according to a Markovchain of statesxt, t = 1 . . . T . At each time step, a nonlin-ear functionf : Rd → RD maps ad-dimensional subsetof the statext to an image withD pixels represented as aD-dimensional vectoryt. The Markov chain captures thenotion that the underlying process that generates the videosequence is smooth. Effects not accounted for byf aremodeled as iid noise modifying the output off .

Learning the parameters of this generative model froma sequence of observationsy1, . . . , yT can be computation-ally expensive [6,15]. Instead of solving forf in this gener-ative model, we recover a projection functiong : RD →Rd that maps images to their low-dimensional represen-tation in a random field. This random field consists of afunctiong that maps the sequence of observed images to asequence inRd that evolves in accordance with a Markovchain. The random field mirrors the generative model ofFigure 1 by modeling the interactions between a Markovchain, the observations, and supervised points provided bythe user. We address each interaction in turn, graduallybuilding up a cost functional forg.

Consider each componentgi of g = [g1(y) . . . gd(y)]separately. If the desired output ofgi at time stepst ∈ Swere known to bezi

t, we could use Tikhonov regularization

on a Reproducing Kernel Hilbert Space (RKHS) to solve forthe best approximation ofgi:

mingi

∑t∈S

‖gi(yt)− zit‖2 + λk‖gi‖2k. (1)

The first term in this cost functional penalizes the devi-ation from the desired outputs, and the norm in the costfunction governs the smoothness ofgi. In particular, ac-cording to the Representer Theorem [12], when the normis an RKHS norm induced by a radial basis kernel, suchask(y′, y) = exp(−‖y − y′‖2/σ2

k), any cost functional ofthe form

∑t∈S V (gi(yt)) + ‖gi‖2k will be minimized by a

weighted sum of kernels centered at eachyt:

gi(y) =∑t∈S

citk(y, yt) , (2)

where the vectorci contains the coefficients for theith di-mension ofg.

But in practice, only a fewzits are provided by the user.

Because we know the low-dimensional process is smooth,we assume the missingzi

ts evolve according to second-order Newtonian dynamics:

xt+1 = Axt + ωt, (3)

A =

1 Av 00 1 Aa

0 0 1

, (4)

zit = h′xt . (5)

The Gaussian random variableωt has zero-mean and a di-agonal covariance matrixΛω. The matricesA and Λω

specify the desired dynamics, and are parameters of ouralgorithm. The components ofxt have intuitive physicalanalogs: the first component corresponds to a position, thesecond to velocity, and the third to acceleration. The vectorh =

[1 0 0

]′extracts the position component ofxt.

We can compensate for the absence ofzit at every data

point by forcinggi(yt) to agree with the position compo-nent of the correspondingxt using additional penalty terms:

mingi,x

∑Tt=1 ‖gi(yt)− h′xt‖2 (6)

+λd

∑Tt=2 ‖xt −Axt−1‖2Λw

+λs

∑t∈S ‖gi(yt)− zi

t‖2 + λk‖gi‖2k.

The first term favors functions whose outputs evolve ac-cording to the trajectoryx. The term weighted byλd fa-vors trajectories that are compatible with the given dynam-ics model.

Figure 2 depicts a random field describing the factor-ization prescribed by (6). This random field mirrors thegenerative model of Figure 1. Note that according tothe Representer Theorem, the the optimalgi has the form∑T

t=1 citk(y, yt), where the kernels are now placed on all

data points, not just those supervised by the user.

y1

xi1 xi

3 ...

...y2 y3

gi

xi2

zi3

Figure 2. Forcing agreement between projections of imagesyt anda Markov chain of statesxt. z3 is a semi-supervised point pro-vided by the user. The functiong maps observations to states.

4. Learning the Projection Function

The optimization (6) is quadratic in the quantities of in-terest. Substituting the representer form results in a finite-dimensional quadratic optimization problem:

arg minci,x

‖Kci −Hx‖2 + λdx′Ωxx (7)

+λs‖GKci − zi‖2 + λkci′Kci.

The matrixK hask(yt, yτ ) in entry t, τ . The pathx isa 3T -dimensional vector of states stacked on top of eachother, andH = IT ⊗h′ extracts the position components ofx. The matrixΩx is the inverse covariance of the Markovprocess and is block tri-diagonal. The matrixG extracts therows ofK corresponding to the supervised framest ∈ S,andzi is a column vector consisting of theith componentof all the semi-supervised points.

The minimizer can be found by setting derivatives to zeroand solving forci. After an application of the matrix inver-sion lemma, we find

ci∗ = λsS−1G′zi (8)

S = K + λsG′GK−HR−1H′K + λkI (9)

R = λdΩx + H′H (10)

Having recovered the coefficients of the radial basisfunctions,g can be applied to an as-yet unseen imageynew

by computing the vectorKnew whose tth component isk(ynew, yt). Then, according to Equation (2),gi(ynew) =Knewci. Unlabeled frames of the video can be labeled byusing thetth row ofK, Kt, to getgi(yt) = Ktc

i.

5. Experiments

To compare with Isomap, LLE and Laplacian Eigen-maps, we relied on source code available from the respec-tive authors’ web sites. We also compare against Belkin and

Nyogi’s graph Laplacian-based semi-supervised regressionalgorithm [2], which we refer to as BNR in this section. Weused our own implementation of BNR.

5.1. Synthetic Results

We first demonstrate our algorithm on a synthetic 2Dmanifold embedded inR3. The neighborhood structure ofthis manifold is difficult to estimate from high-dimensionaldata, so traditional manifold learning techniques performpoorly on this data set. Taking into account the temporalcoherence between data points and using user supervisionalleviates these problems.

Figure 3(top-middle) shows an embedding of the 2DMarkov process shown in Figure 3(top-left) intoR3. Thesemi-supervised points are marked with a large triangle.Figure 3(top-right) shows our interpolated results for theunlabeled points. The interpolated values are close to thetrue values that generated the data set. Although the processis smooth, it clearly does not follow the dynamics assumedby Equation (3) because it bounces off the boundaries ofthe rectangle[0, 5]× [−3, 3]. Nevertheless, the assumed dy-namics of Equation (3) are sufficient for recovering the truelocation of unlabelled points.

To assess the quality of the learned functiong on as-yet unseen points, we evenly sampled the 2D rectangle[0, 5]× [−3, 3] and lifted the samples toR3 using the samemapping used to generate the training sequence. See Fig-ure 3(bottom-left and bottom-right). Each sample inR3 ispassed throughg to obtain the 2D representation shown inFigure 3(bottom-right). The projections fall close to the true2D location of these samples.

We applied LLE, Laplacian Eigenmaps, and Isomap tothe data set of Figure 3(top-middle). Isomap produced theresult shown in Figure 4(left). It is difficult to estimate theneighborhood structure near the neck, where the manifoldcomes close to intersecting itself, so Isomap creates folds inthe projection.

Figure 4(right) shows the result of BNR. Compared toour result in Figure 3(top-right), the interpolated results areincorrect for most points. Since BNR does not attemptto enforce any geometric invariance in the projection, it isfairly robust to the neighborhood estimation problem.

For this and subsequent data sets, neither LLE nor Lapla-cian Eigenmaps produced sensible results. This may be dueto the low rate at which the manifold is sampled.

5.2. Synthetic Images

We quantitatively gauged the performance of our algo-rithm on images by running on a synthetic image sequence.Figure 5 shows frames in a synthetically generated sequence

Figure 5. (top) A few frames of a synthetically generated 1500frame sequence of a rotating cube. (bottom) The 6 semi-supervised frames. The rotation for each frame in the sequencewas recovered with an average deviation of 4 from ground truth.

of 50× 50 pixel images of a rigidly rotating object. Six im-ages were chosen for semi-supervision by providing theirtrue elevation and azimuth to the algorithm.

The azimuth and elevation of the filled-in frames devi-ated from ground truth by only an average of 3.5. Weevaluated BNR on the same data set, with the same semi-supervised points, and obtained average errors of 17 inelevation and 7 in azimuth. To test the learned functiong, we generated a video sequence that swept through therange of azimuths and elevations in 4 increments. Theseimages were passed throughg to estimate their azimuth andelevation. The mean squared error was about 4 in eachdirection.

5.3. Interactive Tracking

Our algorithm is not limited to rigid body tracking. Weapplied it to a lip tracking experiment exhibiting deformablemotion, and to an upper-body tracking experiment exhibit-ing articulated motion. In these experiments, we restrictedourselves to recovering the missing labels of the trainingdata and labeling frames acquired under the same settingfrom which the training data was gathered. Our algorithmoperates on the entire frames, as shown in the figures. Im-ages were not in any way preprocessed before applying ouralgorithm, though to apply the learned mapping to differentsettings, more taylored representations or kernels could beemployed. We tuned the parameters of our algorithm (Av,Aa, the diagonal entries ofΛω, and the weightsλd, λs, andλk) by minimizing the leave-one-out cross validation erroron the semi-supervised points using the simplex method.

Figure 6 shows frames in a 2000 frame sequence of asubject articulating his lips. The top row shows the framesthat were manually annotated with a bounding box aroundthe lips. The bottom row shows the bounding boxes re-turned byg on some typical frames in the sequence. Onlyfive labeled frames were necessary to obtain good lip track-ing performance. The tracker is robust to natural changes inlighting, blinking, facial expressions, small movements ofthe head, and the appearance and disappearance of teeth.

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5−3

−2

−1

0

1

2

3

12

34

0.511.522.5

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

yxz

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5−3

−2

−1

0

1

2

3

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5−3

−2

−1

0

1

2

3

24

0.511.522.5

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

yx

z

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5−3

−2

−1

0

1

2

3

Figure 3. (top-left) The true 2D parameter trajectory. Semi-supervised points are marked with big black triangles. The trajectory is sampledat 1500 points (small markers). Points are colored according to their y-coordinate on the manifold. (top-middle) Embedding of a path viathe lifting F (x, y) = (x, |y|, sin(πy)(y2 + 1)−2 + 0.3y). (top-right) Recovered low-dimensional representation using our algorithm. Theoriginal data in (top-left) is correctly recovered. (bottom-left) Even sampling of the rectangle[0, 5]× [−3, 3]. (bottom-middle) Lifting ofthis rectangle viaF . (bottom-right) Projection of (bottom-middle) via the learned functiong. g has correctly learned the mapping from 3Dto 2D. These figures are best viewed in color.

−0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8

−0.4

−0.2

0

0.2

0.4

0.6

0 1 2 3 4 5−3

−2

−1

0

1

2

3

Figure 4. (left) Isomap’s projection intoR2 of the data set of Figure 3(top-middle). Errors in estimating the neighborhood relations at theneck of the manifold cause the projection to fold over itself. (right) Projection with BNR, a semi-supervised regression algorithm. There isno folding, but the projections are not close to the ground truth shown in Figure 3(top-left).

Figure 8 shows 12 labeled images in a 2300 frame se-quence of a subject moving his arms. These frames weremanually labeled with line segments denoting the upperand lower arms. Figure 9 shows the recovered limb posi-tions for unlabeled samples, some of which were not in thetraining sequence. Because the raw pixel representation isused, there are very few visual ambiguities between appear-ance and pose, and occlusions due to crossing arms do notpresent a problem.

The utility of dynamics is most apparent in articulatedtracking. Settingλd to zero makes our algorithm ignoredynamics, forcing it to regress on the semi-supervised ex-amples only. The resulting function produced the limb loca-tions shown in black in Figure 9. Using dynamics allows thesystem to take advantage of the unsupervised points, pro-ducing better estimates of limb position.

5.4. Resynthesizing Video

Wheng is one-to-one, it can be inverted. This inversefunction maps the intrinsic representation to images, allow-ing us to easily create new video sequences by controllingthe intrinsic representation. We have explored two differentapproaches for computing pseudo-inverses ofg that do notrequireg to be exactly one-to-one.

In Figure 7, where we animate the mouth by manipulat-ing its bounding box, the inverse simply returns the trainingimage whose estimated parameter is closest to the desiredintrinsic parameter. In Figure 10, where we manipulate limblocations to generate new images, we computed the inverseby fitting a function using Tikhonov regularization to a dataset consisting of the training images and their estimated la-bels. This representation can automatically interpolate be-tween images, allowing us to generate images that do notappear in the training sequence.

6. Conclusion

We have presented a semi-supervised regression algo-rithm for learning the appearance manifold of a scene froma video sequence. By taking advantage of the dynamicsin video sequences, our algorithm learns a function thatprojects images to a low-dimensional space with semanti-cally meaningful coordinate axes. The pseudo-inverse ofthis mapping can also be used to generate images from theselow-dimensional representations.

We demonstrated our algorithm on lip tracking and ar-ticulated body tracking, two domains where the appearancemanifold is nonlinear. With very few labeled frames and nopreprocessing of the images, we were able to recover posesfor the frames in the training sequences as well as outsidethe training sequences.

References

[1] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectraltechniques for embedding and clustering. InAdvances inNeural Information Processing Systems, 2002.

[2] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimen-sionality reduction and data representation.Neural Compu-tation, 15(6):1373–1396, 2003.

[3] M. Brand. Charting a manifold. InNeural Information Pro-cessing Systems (NIPS), 2002.

[4] D.L. Donoho and C. Grimes. Hessian eigenmaps: new lo-cally linear embedding techniques for highdimensional data.Technical report, TR2003-08, Dept. of Statistics, StanfordUniversity, 2003.

[5] G. Doretto, A. Chiuso, and Y.N. Wu S. Soatto. Dynamictextures. International Journal of Computer Vision (IJCV),51(2):91–109, 2003.

[6] Z. Ghahramani and S. Roweis. Learning nonlinear dynam-ical systems using an em algorithm. InNeural InformationProcessing Systems (NIPS), pages 431–437, 1998.

[7] J.H. Ham, D.D. Lee, and L.K. Saul. Learning high dimen-sional correspondences from low dimensional manifolds. InICML, 2003.

[8] O. Jenkins and M. Mataric. A spatio-temporal extensionto isomap nonlinear dimension reduction. InInternationalConference on Machine Learning (ICML), 2004.

[9] J. Lafferty, A. McCallum, and F. Pereira. Conditional ran-dom fields: Probabilistic models for segmenting and label-ing sequence data. InProc. 18th International Conf. onMachine Learning, pages 282–289. Morgan Kaufmann, SanFrancisco, CA, 2001.

[10] R. Pless and I. Simon. Using thousands of images of anobject. InCVPRIP, 2002.

[11] S. Roweis and L. Saul. Nonlinear dimensionality reduc-tion by locally linear embedding.Science, 290(5500):2323–2326, 2000.

[12] B. Scholkopf, R. Herbrich, A.J. Smola, and R.C. Williamson.A generalized representer theorem. Technical Report 81,NeuroCOLT, 2000.

[13] A. Smola, S. Mika, B. Schoelkopf, and R. C. Williamson.Regularized principal manifolds.Journal of Machine Learn-ing, 1:179–209, 2001.

[14] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A globalgeometric framework for nonlinear dimensionality reduc-tion. Science, 290(5500):2319–2323, 2000.

[15] H. Valpola and J. Karhunen. An unsupervised ensemblelearning method for nonlinear dynamic state-space models.Neural Computation, 14(11):2647–2692, 2002.

[16] K.Q. Weinberger and L.K. Saul. Unsupervised learning ofimage manifolds by semidefinite programming. InComputerVision and Pattern Recognition (CVPR), 2004.

[17] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervisedlearning using gaussian fields and harmonic functions. InICML, 2003.

Figure 6. The bounding box of the mouth was annotated for 5 frames of a 2000 frame video. The labeled points (shown in the top row) andfirst 1500 frames were used to train our algorithm. The images were not altered in any way before computing the kernel. The parametersof the model were fit using leave-one-out cross validation on the labeled data points. Plotted in the second row are the recovered boundingboxes of the mouth for various frames. The first three examples correspond to unlabeled points in the training set. The tracker is robust tonatural changes in lighting, blinking, facial expressions, small movements of the head, and the appearance and disappearance of teeth.

Figure 7. Resynthesized trajectories using radial basis interpolation. The two rows show a uniform walk along two of the coordinate axesof the low-dimensional space. The appearance and disappearance of the tongue is a nonlinearity that is well captured with our model.

Figure 8. The twelve supervised points in the training set for articulated hand tracking (see Figure 9).

Figure 9. The hand and elbow positions were annotated for 12 frames of a 2300 frame video. The labeled points (shown in Figure 8) andthe first 1500 frames were used to train our algorithm. The images were not preprocessed in any way. Plotted in white are the recoveredpositions of the hands and elbows. Plotted in black are the recovered positions when the algorithm is trained without taking advantage ofdynamics. Using dynamics improves tracking significantly. The first two rows correspond to unlabeled points in the training set. The lastrow correspond to frames in the last 800 frames of the video, which was held out during training.

Figure 10. Resynthesized trajectories using nearest neighbors. Top row: The left hand moving straight up while keeping the right handfixed. Middle row: The same trajectory with the hands switched. Bottom row: Both arms moving in opposite directions at the same time.

Learning Appearance Manifolds from Videopublications.csail.mit.edu/abstracts/abstracts05/rahimi/...Learning Appearance Manifolds from Video Ali Rahimi MIT CS and AI Lab, Cambridge,

Documents