FaceTrack: Tracking and summarizing faces from compressed video Hualu Wang, Harold S. Stone*, Shih-Fu Chang Dept. of Electrical Engineering, Columbia University.

FaceTrack: FaceTrack: Tracking and Tracking and

summarizing faces summarizing faces from compressed from compressed

videovideoHualu Wang, Harold S. Stone*, Shih-Fu ChangHualu Wang, Harold S. Stone*, Shih-Fu Chang

Dept. of Electrical Engineering, Columbia Dept. of Electrical Engineering, Columbia UniversityUniversity

*NEC Research Institute*NEC Research InstitutePresentation by

Andy RovaSchool of Computing Science

Simon Fraser University

March 15, 2005 2Andy Rova • SFU CMPT 820

IntroductionIntroduction

FaceTrackFaceTrack System for both System for both trackingtracking and and summarizingsummarizing

faces in faces in compressed videocompressed video data data TrackingTracking

Detect faces and trace them through time in video shotsDetect faces and trace them through time in video shots SummarizingSummarizing

Cluster the faces across video shots and associate them Cluster the faces across video shots and associate them with different peoplewith different people

Compressed videoCompressed video Avoids the costly overhead of decoding prior to face Avoids the costly overhead of decoding prior to face

detectiondetection


System OverviewSystem Overview

The FaceTrack system’s goals are The FaceTrack system’s goals are related to ideas discussed in related to ideas discussed in previous presentationsprevious presentations A face-based video summary can help A face-based video summary can help

users decide if they want to download users decide if they want to download the whole videothe whole video

The summary provides good visual The summary provides good visual indexing information for a database indexing information for a database search enginesearch engine


Problem definitionProblem definition

The goal of the FaceTrack system is The goal of the FaceTrack system is to take an input video sequence and to take an input video sequence and generate a list of prominent faces generate a list of prominent faces that appear in the video, and that appear in the video, and determine the time periods where determine the time periods where each of the faces appearseach of the faces appears


General ApproachGeneral Approach Track faces within shots Track faces within shots Once tracking is done, group faces Once tracking is done, group faces

across video shots into faces of different across video shots into faces of different peoplepeople

Output a list of faces for each sequenceOutput a list of faces for each sequence For each face, list shots where it appears, For each face, list shots where it appears,

and whenand when Face recognition Face recognition is notis not performed performed

Very difficult in unconstrained videos due to Very difficult in unconstrained videos due to the broad range of face sizes, numbers, the broad range of face sizes, numbers, orientations and lighting conditionsorientations and lighting conditions


General ApproachGeneral Approach

Try to work in the compressed domain Try to work in the compressed domain as much as possibleas much as possible MPEG-1 and MPEG-2 videos MPEG-1 and MPEG-2 videos

Used in applications such as digital TV and DVDUsed in applications such as digital TV and DVD Macroblocks and motion vectors can be Macroblocks and motion vectors can be

used directly in tracking used directly in tracking Greater computational speed compared to Greater computational speed compared to

decodingdecoding Can always decode select frames down to Can always decode select frames down to

the pixel level for further analysisthe pixel level for further analysis For example, grouping faces across shotsFor example, grouping faces across shots


MPEG ReviewMPEG Review 3 types of frame data3 types of frame data

Intra-frames Intra-frames (I-frames)(I-frames) Forward predictive frames Forward predictive frames (P-frames)(P-frames) Bidirectional predictive frames Bidirectional predictive frames (B-frames)(B-frames)

Macroblocks are coding units which Macroblocks are coding units which combine pixel information via DCTcombine pixel information via DCT Luminance and chrominance are separatedLuminance and chrominance are separated

P-frames and B-frames are subjected to P-frames and B-frames are subjected to motion compensationmotion compensation Motion vectors are found and their differences Motion vectors are found and their differences

are encodedare encoded


System DiagramSystem Diagram


Face TrackingFace Tracking

ChallengesChallenges Locations of detected faces may not be accurate, Locations of detected faces may not be accurate,

since the face detection algorithm works on 16x16 since the face detection algorithm works on 16x16 macroblocksmacroblocks

False alarms and missesFalse alarms and misses Multiple faces cause ambiguities when they move Multiple faces cause ambiguities when they move

close to each otherclose to each other The motion approximated by the MPEG motion The motion approximated by the MPEG motion

vectors may not be accuratevectors may not be accurate A tracking framework which can handle these A tracking framework which can handle these

issues in the compressed domain is neededissues in the compressed domain is needed


The Kalman FilterThe Kalman Filter A linear, discrete-time dynamic system A linear, discrete-time dynamic system

is defined by the following difference is defined by the following difference equations:equations:

We only have access to a sequence of We only have access to a sequence of measurements measurements

Given this noisy observation data, the Given this noisy observation data, the problem is to find the optimal estimate problem is to find the optimal estimate of the unknown system state variablesof the unknown system state variables


The Kalman FilterThe Kalman Filter The “filter” is actually an iterative algorithm The “filter” is actually an iterative algorithm

which keeps taking in new observationswhich keeps taking in new observations The new states The new states are successively estimatedare successively estimated The error of the prediction ofThe error of the prediction of is called is called

the the innovationinnovation The innovation is amplified by a gain matrix The innovation is amplified by a gain matrix

and used as a correction for the state and used as a correction for the state predictionprediction

The corrected prediction is the new state The corrected prediction is the new state estimateestimate


The Kalman FilterThe Kalman Filter In the FaceTrack system, the state In the FaceTrack system, the state

vectorvector of the Kalman filter is the of the Kalman filter is the kinematic information of the face kinematic information of the face position, velocity (and sometimes position, velocity (and sometimes

acceleration)acceleration) The observation vector The observation vector is the is the

position of the position of the detecteddetected face face May not be accurateMay not be accurate

The Kalman filter lets the system predict The Kalman filter lets the system predict and update the position and parameters and update the position and parameters of the facesof the faces


The Kalman FilterThe Kalman Filter

The FaceTrack system uses a 0.1 The FaceTrack system uses a 0.1 second time interval for state second time interval for state updatesupdates

This corresponds to every I-frame This corresponds to every I-frame and P-frame for typical MPEG GOP and P-frame for typical MPEG GOP structurestructure GOP: “Group Of Pictures” frame GOP: “Group Of Pictures” frame

structurestructure For example, IBBPBBP…For example, IBBPBBP…


The Kalman FilterThe Kalman Filter For I-frames, the face detector results are used For I-frames, the face detector results are used

directlydirectly For P-frames, the face detector results are more For P-frames, the face detector results are more

prone to false alarmsprone to false alarms Instead, P-frame face locations are predicted Instead, P-frame face locations are predicted

based on the MPEG motion vectors based on the MPEG motion vectors (approximately)(approximately)

These locations are then fed into the Kalman These locations are then fed into the Kalman filter as observationsfilter as observations (in contrast with previous trackers, which assumed (in contrast with previous trackers, which assumed

that the motion-vector calculated locations were that the motion-vector calculated locations were correct alone) correct alone)


The Face Tracking The Face Tracking FrameworkFramework

How to discriminate new faces from How to discriminate new faces from previous ones during tracking?previous ones during tracking? The The Mahalanobis distanceMahalanobis distance is a is a

quantitative indicator of how close the new quantitative indicator of how close the new observation is to the predictionobservation is to the prediction

This can help separate new faces from This can help separate new faces from existing tracks: if the Mahalanobis distance existing tracks: if the Mahalanobis distance is greater than a certain threshold, then the is greater than a certain threshold, then the newly detected face is unlikely to belong to newly detected face is unlikely to belong to a particular existing tracka particular existing track


The Face Tracking The Face Tracking FrameworkFramework

In the case where two faces move close In the case where two faces move close together, Mahalanobis distance alone together, Mahalanobis distance alone cannot keep track of multiple facescannot keep track of multiple faces

Case where a face is missed or occluded:Case where a face is missed or occluded: Hypothesize the continuation of the face trackHypothesize the continuation of the face track

Case of false alarm or faces close Case of false alarm or faces close together:together: Hypothesize creation of a new trackHypothesize creation of a new track

The idea is to wait for new observation The idea is to wait for new observation data before making the final decision data before making the final decision about a trackabout a track


Intra-shot Tracking Intra-shot Tracking ChallengesChallenges

Multiple hypothesis method:Multiple hypothesis method:


Kalman Motion ModelsKalman Motion Models The Kalman filter is a framework which can The Kalman filter is a framework which can

model different types of motion, depending on model different types of motion, depending on the system matrices usedthe system matrices used

Several models were tested for the paper, with Several models were tested for the paper, with varying resultsvarying results

Intuition: who pays to research object Intuition: who pays to research object tracking?tracking? The military! The military! Hence many tracking models are based on Hence many tracking models are based on

trajectories that are unlike those that faces in video trajectories that are unlike those that faces in video will likely exhibitwill likely exhibit

For example, in most commercial video, a human For example, in most commercial video, a human face will not maneuver like a jet or missile face will not maneuver like a jet or missile


Kalman Motion ModelsKalman Motion Models

Four motion models were tested for Four motion models were tested for FaceTrackFaceTrack Constant VelocityConstant Velocity (CV)(CV) Constant AccelerationConstant Acceleration (CA)(CA) Correlated AccelerationCorrelated Acceleration (AA)(AA) Variable DimensionVariable Dimension (VDF)(VDF)

The testing was done against The testing was done against ground ground truthtruth consisting of manually consisting of manually identified face centers in each frameidentified face centers in each frame


Kalman Motion ModelsKalman Motion Models

Rather than go through the whole Rather than go through the whole process in exact detail, the next process in exact detail, the next several slides are an illustration of several slides are an illustration of the differences between the CV and the differences between the CV and CA modelsCA models

Also, the matrices are expanded to Also, the matrices are expanded to show how the states are updatedshow how the states are updated


Constant Velocity (CV) Constant Velocity (CV) ModelModel

expand



simplify



simplify

expand


Constant Acceleration (CA) Constant Acceleration (CA) ModelModel

Acceleration is now added to the state vector, and is explicitly modeled as constants disturbed by random noises

expand


Constant Acceleration (CA) Constant Acceleration (CA) ModelModel

simplify


The Correlated Acceleration The Correlated Acceleration ModelModel

Replaces constant accelerations with a Replaces constant accelerations with a AR(1) modelAR(1) model AR(1): First order autoregressiveAR(1): First order autoregressive

A stochastic process where the immediately previous A stochastic process where the immediately previous value has an effect on the current value (plus some value has an effect on the current value (plus some random noise)random noise)

Why? Why? There is a strong negative autocorrelation There is a strong negative autocorrelation

between the accelerations of consecutive framesbetween the accelerations of consecutive frames Positive accelerations tend to be followed by negative Positive accelerations tend to be followed by negative

accelerationsaccelerations Implies that faces tend to “stabilize”Implies that faces tend to “stabilize”


The Variable Dimension The Variable Dimension FilterFilter

A system that switches between CV A system that switches between CV (constant velocity) and CA (constant (constant velocity) and CA (constant acceleration) modesacceleration) modes

The dimension of the state vector The dimension of the state vector changes when a maneuver is changes when a maneuver is detected, hence “VDF”detected, hence “VDF”

Developed for tracking highly Developed for tracking highly maneuverable targets (probably maneuverable targets (probably military jets)military jets)


Comparison of Motion Comparison of Motion ModelsModels

average tracking error

tracking runs (first 16)


Comparison of Motion Comparison of Motion ModelsModels

Why does CV perform best?Why does CV perform best? Small sampling interval justifies viewing Small sampling interval justifies viewing

face motion as piecewise linear movementsface motion as piecewise linear movements The face cannot achieve very high The face cannot achieve very high

accelerations (as opposed to a jet fighter)accelerations (as opposed to a jet fighter) AA also performs well because it fits the AA also performs well because it fits the

nature of the face motion wellnature of the face motion well Commercial video faces exhibit few Commercial video faces exhibit few

persistent accelerations (negative persistent accelerations (negative autocorrelation)autocorrelation)


Summarization Across Summarization Across ShotsShots

Select representative frames for tracked Select representative frames for tracked facesfaces Large, frontal-view faces are bestLarge, frontal-view faces are best

Decode representative frames into the pixel Decode representative frames into the pixel domaindomain

Use clustering algorithms to group the faces Use clustering algorithms to group the faces into different personsinto different persons

Make use of domain knowledgeMake use of domain knowledge For example, people do not usually change clothes For example, people do not usually change clothes

within a news segment, but often do change within a news segment, but often do change outfits within a sitcom episodeoutfits within a sitcom episode


Simulation ResultsSimulation Results


Conclusions & Future Conclusions & Future ResearchResearch

The FaceTrack is an effective face tracking (and The FaceTrack is an effective face tracking (and summarization) architecture, within which summarization) architecture, within which different detection and tracking methods can be different detection and tracking methods can be usedused Could be updated to use new face detection Could be updated to use new face detection

algorithms or improved motion modelsalgorithms or improved motion models Based on the results, the CV and AA motion Based on the results, the CV and AA motion

models are sufficient for commercial face motionmodels are sufficient for commercial face motion Summarization techniques need the most Summarization techniques need the most

development, followed by optimizing tracking for development, followed by optimizing tracking for adverse situationsadverse situations

FaceTrack: Tracking and summarizing faces from compressed video Hualu Wang, Harold S. Stone*, Shih-Fu Chang Dept. of Electrical Engineering, Columbia University.

Documents

video compressed video

video shots

list of faces

summarizing faces

faces of different people

list of prominent faces

facebased video summary

input video sequence