Video Stabilization and Face Saliency-based Retargeting · Video Stabilization and Face Saliency-based Retargeting Yinglan Ma1, Qian Lin2, Hongyu Xiong2 ... the human face are of

Video Stabilization and Face Saliency-based Retargeting

Yinglan Ma1, Qian Lin2, Hongyu Xiong2

1Department of Computer Science, Stanford University 2Department of Applied Physics, Stanford University

Abstract

Technology revolution has brought great convenience ofdaily life recording using cellphones and wearable devicesnowadays. However, hand shake and human body move-ment is likely to happen during the capture period, whichsignificantly degrades the video quality. In this work, westudy and implement an algorithm that automatically stabi-lizes the shaky videos. We first calculate the video motionpath using feature matching and then smooth out high fre-quency undesired jitters with L1 optimization. The methodensures that the smoothed paths only compose of constant,linear and parabolic segments, mimicking the camera mo-tions employed by professional cinematographers. Sincethe human face are of broad interest and appear in largeamount of videos, we further incorporated face feature de-tection module for video retargeting purposes. The detectedfaces in the video also enables many potential applications,and we add decoration features in this work, e.g., glassesand hats on the faces.

1. Introduction

Nowadays nearly 2 billion people own smartphonesworldwide, and an increasing number of videos are cap-tured by mobile devices. However, videos captured by handhandled devices are always shaky and undirected due to thelack of stabilization equipment on the handhold devices.Even though there are commercial hardware componentsthat could stabilize the device when we record, they are rel-atively redundant and not handy for daily use. Moreover,most hardware stabilization systems only removes high fre-quencies jitters but are unable to remove low frequency mo-tions arise from panning shots or walking movements. Suchslow motion is particular problematic in shots that intend totrack prominent foreground object or person.

To overcome the above difficulties, we implement apost-processing video stabilization pipeline aiming to re-move undesirable high and low frequency motions fromcasually captured videos. Similar to most post-processingvideo stabilization algorithms, our implementation involvesthree main steps: (1) estimate original shaky camera path

from feature tracking in the video; (2) calculate a smoothedpath, which is cast as an constraint optimization problem;(3) Synthesizing the stabilized video using the calculatedsmooth camera path. To reduce high frequency noise, weuse the L1 path optimization method described in [1] toproduce purely constant, linear or parabolic segments ofsmoothed motion, which follows cinematographic rules. Toreduce low frequency swanning in videos containing a per-son as the central object, we apply further restraint to themotion of the facial features. In order to make the solutionapproachable, our method uses automatic feature detectionand do not require user interaction.

Our video stabilization method is a purely software ap-proach, and can be applied to videos from any camera de-vices and sources. Another popular class of mobile videostabilization methods use the phone’s build-in gyroscope tomeasure the camera path. Our method has the advantage ofbeing applicable to any video from any sources, for exampleonline video, without any prior knowledge of the capturingdevice or other physical parameters of the scene. Our ap-proach also enables facial retargeting, which can be extentto other kinds of salient features.

2. Previous Work2.1. Literatures

Video stabilization methods can be categorized into threemajor directions: 2D method, 3D method and motion esti-mation method.

2D methods estimate frame-to-frame 2D transforma-tions, and smooth the transformations to create a more sta-ble camera path. Early work by Matsushita et al. [5] appliedlow-path filters to smooth the camera trajectories. Gleicherand Liu [4] proposed to create a smooth camera path byinserting linearly interpolated frames. Liu et al.[6] later in-corporated subspace constraints in smoothing camera tra-jectories, but it required longer feature tracks.

3D methods rely on feature tracking to stabilize shakyvideos. Beuhler et al. [8] utilized projective 3D reconstruc-tion to stabilize videos from uncalibrated cameras. Liu etal. [9] were the first to introduce content-preserving warp-ing in video stabilization. However, 3D reconstruction isdifficult and unrobust. Liu et al. [6] reduced the problem

1

to smoothing long feature trajectories, and achieved com-parable results to 3D reconstruction based methods. Gold-stein and Fattal[10] proposed an epipolar transfer methodto avoid direct 3D reconstruction. Obtaining long featuretracks is often fragile in consumer videos due to occlusion,rapid camera motion and motion blur. Lee et al. [11] incor-porated feature pruning to select more robust feature trajec-tories to resolve the occlusion issue.

Motion estimation methods calculate transitions betweenconsecutive frames with view-overlap. To reduce the align-ment error due to parallax, Shum and Szeliski[12] im-posed local alignment, and Gao et al.[7] introduced a dual-homography model. Liu et al[13] proposed a mesh-based,spatially-variant homography model to represent the motionbetween video frames, but the smoothing strategy did notfollow cinematographic rules.

Our implementation, based on [1], apply L1-norm op-timization to generate a camera path that consists of onlyconstant, linear and parabolic segments, which follow cine-matographic principles in producing professional videos.

2.2. Our Contribution

In this work, we re-implement the L1-norm optimizationalgorithm [1] to automatically stabilize the videos captured,with a smoothed feature path containing only constant, lin-ear and parabolic segments. Additionally, in order to en-able the video to retarget on human faces, we use the faciallandmark detection algorithm from OpenFace toolkit [3] toset facial saliency constraints for the path smoothing; thestrength of the constraint could be tuned from 0 (no facialretargeting) to 1 (video fixing on facial features), and in thisway we are able to combine both video path smoothing andfacial retargeting according to specific user needs.

Beyond that, in order to make our work more fun, wealso manage to attach interesting decorations such as hat,glasses, and tie above, on, or below the human faces de-tected, and their transformations are based on the movementof human face in the video.

3. Proposed Method3.1. L1-Norm Optimized Video Stablization

In this section, we describe the method of video stabliza-tion in this work.

3.1.1 Norms of smoothing

When applying path smoothing algorithm, we should al-ways be careful to choose which regularization method weuse, since different regularization methods works differ-ently for different error distribution. [2]

For error distributions with sharply defined edge or ex-tremes (typified by the uniform distribution) one should

use Tchebycheff (L1) smoothing. For error distributionsat the other end of the spectrum, which is with long tails,one should use L1 smoothing. In between these extremes,which are short-tail spectra such as normal distribution,least squares or L2 smoothing appears to be best.

3.1.2 L1-Norm Optimization

In the perspective of a single feature point, the video mo-tion can be viewed as a path of its coordinates (x, y) move-ment with respect to the frame number. Since it is diffi-cult to avoid jitters with hand-held devices, we will observethat the path the is wiggling. Video stablization is to ob-tain the new coordinates (x, y) at each frame and thus anew path with enhanced smoothness. In the perspective ofthe frames, the task is to smooth the transformations be-tween frames so that the feature points movement wouldbe minimal. The frame transformation is generalized asaffine transform, including translational and rotational mo-tion, and scaling caused by object/camera distance change.

We estimate the camera path by first matching featuresbetween consecutive frames C

t

and Ct+1, and then cal-

culate the affine transformation Ft+1 based on the match-

ing. That is, the process can be formatted as Ct+1 =

Ft+1Ct

. Then we estimate the affine transformation Ft+1

using these two set of feature coordinates, Ct

and Ct+1. In

this work, we extract features of each frame (opencv func-tion cv::goodFeaturesToTrack), and find the matching in thenext frame using iterative Lucas-Kanade method with pyra-mids (cv::calcOpticalFlowPyrLK).

We denote the smoothed features as Pt

, then we have acorrelation between the original features in frame t and thesmoothed ones, as P

t

= Bt

Ct

, where Bt

is is the stabiliza-tion/retargeting matrix, transforming the original features tothe smoothed ones. Since we only want the smoothed pathto contain constant, linear, and parabolic segments, we min-imize the first, second, and third derivatives of the smoothedpath with weights c = (c1, c2, c3)T :

O(P ) = c1|D(P )|1 + c2|D2(P )|1 + c3|D3(P )|1, (1)

where

|D(P )|1 =X

t

|Pt+1 � P

t

|1 =X

t

|Rt

|1,

|D2(P )|1 =X

t

|Rt+1 �R

t

|1,

|D3(P )|1 =X

t

|Rt+2 � 2R

t+1 +Rt

|1.

(2)

Here the residual is Rt

= Bt+1Ft+1 �B

t

.For each affine transform:

Bt

=

b11 b12 t

x

b21 b22 ty

�(3)

2

in 6 DOF, we vectorize it as pt

=(b11, b12, b21, b22, tx, ty)T , which is the parametriza-tion of B

t

; correspondingly

|Rt

(p)|1 = |pTt+1M(F

t+1)� pt

|1. (4)

We make use of Linear Programming (LP) techniqueto solve this L1-norm optimization problem. To minimize|R

t

(p)|1 in LP, we introduce slack variables e1 � 0, sothat �e1 R

t

(p) e1; similarly there are e2 and e3 for|R

t+1(p) � Rt

(p)|1 and |Rt+2(p) � 2R

t+1(p) + Rt

(p)|1,respectively. For e = (e1, e2, e3)T , the objective functionof the problem is to minimize cT e.

In addition, we want to limit how much Bt

(or pt

) coulddeviate from the original path, i.e. the actual shift shouldwithin the cropping window. Thus, we can add constraintson the parameters in LP, such as: lb Up

t

ub, where Uis the linear combination coefficient of p

t

. The complete L1

minimization LP for smoothed video path with constraintsis summarized below:

Algorithm 1 Summarized LP for the smoothed video pathInput: Frame pair transform F

t

, t = 1, 2, ..., nOutput: Update transform B

t

. Bt

could be transformed to pt

Minimize: cT ew.r.t p = (p1, p2, ..., pn)where e = (e1, e2, e3)T , ei = (ei1, e

i

2, ..., ei

n

), c =(c1, c2, c3)T

subject to:1. �e1

t

Rt

(p) e1t

2. �e2t

Rt+1(p)�R

t

(p) e2t

3. �e3t

Rt+2(p)� 2R

t+1(p) +Rt

(p) e3t

4. eit

� 0constraints:lb Up

t

ub

We use lpsolve library for modeling and solving our LPsystem.

3.2. Facial Features Detection and Retargeting

In many videos, a particular subject, usually a person, isfeatured. In this case it is not only important to remove fast,jittering camera motions, but also unintended slow panningor swanning that momentarily move the subject off-centerand lead to distraction for the viewer. This can be posedas a constraint on the path optimization as requiring thatsalient features of the subject to be closed to the center re-gion throughout the video.

The first step towards salient-point-preserving video sta-bilization is salient feature detection and tracking. In par-ticular, it is desirable to have the algorithm automaticallyrecognize and detect these salient features without user in-put. There are many face detectors available for such task.

We use Constrained Local Neural Fields (CLNF) for fa-cial landmark detection available on OpenFace. Detail ofthe algorithm can be found in [3]. The CLNF algorithmworks robustly under varied illumination and are stabilizedfor video. It outputs a fixed number of facial landmarks,including the face silhouette, the lips, nose tip and eyes, asshown in Fig. 2c. These multiple landmarks allow a morestable and accurate estimate of the facial position. In con-trary, other face detector, for example the opencv built-inones, were observed to produce inaccurate bounding boxand are not stable over video frames during our experiment.The detailed facial landmarks from CLNF also enable us toperform other post-processing on the video, for example theface decoration described in Section 3.4.

After detecting the facial landmarks in each frame t, weestimate the center of face C

f,t

by averaging all the land-marks. Let C0 be the desired position of the center of face,for example the center of frame. Let P

t

and St

be the orig-inal and smoothed camera trajectory, then the saliency con-straint can be posed as a additional term to the loss function

Lt

= (1�ws

)(St

� P̄t

)2+ws

(St

�Pt

+Cf,t

�C0)2 (5)

where P̄t

is average over a window of frames, and ws

is aparameter to adjust how much weight the saliecy constrainthave on the optimization. Minimizing L

t

then produce thedesired smoothed trajectory S

t

.

3.3. Metrics & Characterization

3.3.1 Evaluation of Smoothed Path

For the stabilizing problem we are concerning about, itwould be inappropriate to simply regard the undesired shak-ing as short-tail normal distribution, so using the L1 normbetween each frame pair during minimization is more suit-able. In addition, L1 optimization has the property that theresulting solution is sparse, i.e. the computed path there-fore has derivatives which are exactly zero for most seg-ments. On the other hand, L2 minimization (in a least-squared sense), tend to result in small but non-zero gradi-ents. Qualitatively, the L2 optimized camera path alwayshas some small non-zero motion (most likely in the direc-tion of the camera shake), while the L1 optimized we used(|D(P )|1, |D2(P )|1, and|D3(P )|1) will create path is onlycomposed of segments resembling a static camera, (uni-form) linear motion, and constant acceleration [1].

Therefore, we will compare the L1 norm |D(P )|1 be-tween the original video feature path and the smoothed one,and use this comparison as metrics of our experiments de-scribed below. Specifically, we will calculate the averageabsolute shift between adjacent points on the video featurepath, with respect to both x and y directions, and averageabsolute rotation angle increment. The same calculationswill be done to the smoothed path

3

3.3.2 Evaluation of Facial Retargeting

As for the part of facial retargeting, in addition to the com-parison between the L1 norm |D(P )|1 of the original videofeature path and the new one, where we can extract the in-formation about smoothing, we are also interested to seehow the facial features are targeted. So we will calculatethe average position of the face features with respect to thecenter of frame, and simultaneously calculate the averageabsolute position deviation.

3.4. Face Decoration

With per frame face features detected, we can add funface decorations to our videos, such as glasses, hat and mus-tache. By incorporating feature locations, we are able totranslate, scale and rotate the decorations to place them ap-propriately onto human faces. Since our videos are stabi-lized and focus on faces, the transitions of the decorationsare smoother. Here is an example of how we utilize the fea-ture points in adding decorations.

Adding glasses: we extract left eye, right eye, left browand right brow feature points to calculate a horizontal eyeaxis, and use it to estimate the orientation of the glasses.Scale is approximated from eye distance, and translation de-pends on the locations of the eye points.

Since face silhouette feature points are usually less sta-ble, we avoid using those points in adding face decorations.

Screenshots of adding hat and glasses are shown in Fig-ure 4.

4. Experiments

Table 1 lists the algorithm run time on our laptop. Thesecond column lists time for path smoothing without facialfeature, and the third column lists time for path smooth-ing with facial feature as salient constraint. In the lat-ter case, the CNFL facial landmark detection takes up thebiggest chunk of time (⇠ 45ms per frame). [1] reported 20fps on low-resolution video, and 10 fps with un-optimizedsaliency.

Table 1. Timing per frame of the algorithm. Video resolution640⇥ 360.

w.o. face w. facemotion estimation (ms) 12.1 59.1

optimize camera path (µs) 0.15 0.40render final result (µs) 2.7 2.7face decoration (ms) - 5.7

total (ms) 15 68speed (fps) 67 15

4.1. Video Stabilization

We apply our path smoothing algorithm to shaky videosand observe significant reduction of jittering. An exampleoutput can be found on Youtube.

To visualize the effect of stabilization, we plot the esti-mated camera trajectory before and after our algorithm inFig. 1. We also provide a quantitative measurement of theL1 norm |D(P )|1 before and after smoothing in Table. 1.As we can see the L1 norm decreases a lot, which meansthe abrupt jitters are significantly decreased.

Figure 1. Path before and after (Left column) L2-norm smooth-ing (Right column) L1-norm smoothing. (Top)x-direction.(Middle)y-direction. (Bottom)rotational angle.

Table 2. L1 norm |D(P )|1 between the original video feature pathand the smoothed one, in both x and y directions and the rotationalangle.

path < |�xt

| > < |�yt

| > < |�at

| >original 1569 857 1.12

smoothed 705 234 0.44

4.2. Facial Retargeting

Our experiment with video stabilization using facial fea-tures are shown in Fig. 2. Fig. 2(a) is the original video,which contains slow swanning motion of both the cameraand the subject person. Fig. 2(b) is the stabilized output us-ing only camera path smoothing. The slow motion of thesubject is still prominent. Fig. 2(c) is the stabilized outputusing camera path smoothing with a constraint of the mo-tion of facial features. It leads to stabilization of the subjectat the center over frames. Both result videos can be foundon Youtube link 1 and link 2.

As expected, stabilization comes at a price of reducedresolution. The original image are cropped by 20% inFig. 2(b) and (c) to remove black margins due to warpping.There are still residue margins in Fig. 2(c).

4

We also quantify the smoothing effect and the facial tar-geting, as we can see from Table. 2. With the increase of thefacial saliency constraint ratio !, both L1 norm and the ab-solute position shift drops, which means, the larger ! is, thesmoothier the video gets, and the more centered the humanface is. The result is expected from our algorithm.

4.3. Comparison with State-of-the-art Systems

Due to no publicly available implementation of previousworks, we obtain the original and output videos reported inGrundmann’s paper [1], and calculate the evaluation metricsdescribed in Section 3.3 on their output video and presentalongside with our results. As we can see from the compari-son below, our implemented algorithm is comparable to thestate-of-the-art system.

4.4. Face Decoration

With per frame face features detected, we can add funface decorations to our videos, such as glasses, hat and mus-tache. By incorporating feature locations, we are able totranslate, scale and rotate the decorations to place them ap-propriately onto human faces. Since our videos are stabi-lized and focus on faces, the transitions of the decorationsare smoother. Screenshots of adding hat and glasses areshown in Fig. 4.

5. Conclusion & Perspectives

All in all, video feature path is significantly smoothedusing the L1 optimization stabilization algorithm; the L1

norm |D(P )|1, which signifies the moving between frames,greatly drops after applying the stabilization.

If the facial retargeting method is included, the videowould be more focused on human faces; the larger thesaliency constraint ratio ! is, the more centered the humanfaces are with respect to the cropped video frame.

Decoration addition such as glasses, hat, or tie could alsobe attached to the faces in the video, with the same orien-tation as the faces. More fun stuffs will be applied to makethis work fancier in the future.

Reference

[1] Matthias Grundmann, Vivek Kwatra, Irfan Essa.Auto-Directed Video Stabilization with Robust L1 OptimalCamera Paths. CVPR, 2011.

[2] JR Rice, JS White. Norms for smoothing and estima-tion. SIAM review, 1964.

[3] Tadas Baltruaitis, Peter Robinson, and Louis-Philippe Morency. Constrained Local Neural Fields for ro-bust facial landmark detection in the wild. ICCVW, 2013.

[4] Michael L. Gleicher and Feng Liu. 2007. Re-cinematography: improving the camera dynamics of casual

video. In Proceedings of the 15th ACM international con-ference on Multimedia (MM ’07). ACM, New York, NY,USA, 27-36.

[5] Yasuyuki Matsushita, Eyal Ofek, Weina Ge, XiaoouTang, and Heung-Yeung Shum. 2006. Full-Frame VideoStabilization with Motion Inpainting. IEEE Trans. PatternAnal. Mach. Intell. 28, 7 (July 2006), 1150-1163.

[6] F. Liu, M. Gleicher, J. Wang, H. Jin, and A. Agar-wala. Subspace video stabilization. In ACM Transactionson Graphics, volume 30, 2011.

[7] Junhong Gao , Seon Joo Kim , M. S. Brown, Con-structing image panoramas using dual-homography warp-ing, Proceedings of the 2011 IEEE Conference on Com-puter Vision and Pattern Recognition, p.49-56, June 20-25,2011

[8] Buehler, C., Bosse, M., and McMillan, L. 2001. Non-metric image-based rendering for video stabilization. InProc. CVPR.

[9] Feng Liu , Michael Gleicher , Hailin Jin , AseemAgarwala, Content-preserving warps for 3D video stabiliza-tion, ACM Transactions on Graphics (TOG), v.28 n.3, Au-gust 2009

[10] Amit Goldstein , Raanan Fattal, Video stabilizationusing epipolar geometry, ACM Transactions on Graphics(TOG), v.31 n.5, p.1-10, August 2012

[11] Chen, B.-Y., Lee, K.-Y., Huang, W.-T., and Lin, J.-S. 2008. Capturing intention-based full-frame video stabi-lization. Computer Graphics Forum 27, 7, 1805–1814.

[12] Heung-Yeung Shum , Richard Szeliski, Systemsand Experiment Paper: Construction of Panoramic ImageMosaics with Global and Local Alignment, InternationalJournal of Computer Vision, v.36 n.2, p.101-130, Feb. 2000

[13] Shuaicheng Liu, Lu Yuan, Ping Tan, and Jian Sun.2013. Bundled camera paths for video stabilization. ACMTrans. Graph. 32, 4, Article 78 (July 2013), 10 pages.

5

Figure 2. Demonstration of facial retargeting in video stabilization. The green dot indicates the center of frame. Green lines show boarderof frame. Red dots in (c) indicated detected facial landmarks from OpenFace [3]. They are intended as a guide to the eye. Both videos canbe found on Youtube (b) and (c).

Table 3. L1 norm |D(P )|1 between the original video feature path and the smoothed one, in both x and y directions and the rotationalangle.

! < |�xt

| > < |�yt

| > < |x� xcenter

| > < |y � ycenter

| >original 1392 496 32805 5882

0.2 1139 254 32583 49020.5 792 234 21568 34330.95 221 247 2695 1954

6

Figure 3. Path smoothing before and after with facial saliency con-straints. (Left column) x-direction. (Right column) y-direction.From top to bottom, the facial constraint ratios ! are 0.2, 0.5, and0.95, respectively.

Table 4. Comparison between our algorithm and the state-of-the-art one from [1]

method < |�xt

| > < |�yt

| > < |�at

| >state-of-the-art [1] 273 296 0.53693

our algorithm 705 234 0.44387

Figure 4. Face decoration with glasses and hat

7

Video Stabilization and Face Saliency-based Retargeting · Video Stabilization and Face Saliency-based Retargeting Yinglan Ma1, Qian Lin2, Hongyu Xiong2 ... the human face are of

Documents