Serena Yeung, PHD, Stanford, at MLconf Seattle 2017

Towards Scaling Video Understanding

Serena Yeung

YouTube TV

GoPro Smart spaces

State-of-the-art in video understanding

State-of-the-art in video understandingClassification

Abu-El-Haija et al. 2016

4,800 categories15.2 Top5 error

State-of-the-art in video understandingClassification Detection

Idrees et al. 2017, Sigurdsson et al. 2016Abu-El-Haija et al. 2016


Tens of categories~10-20 mAP at 0.5 overlap

State-of-the-art in video understandingClassification Detection

Abu-El-Haija et al. 2016

Captioning


Yu et al. 2016

Just getting started:Short clips, niche domains

Idrees et al. 2017, Sigurdsson et al. 2016


Comparing video with image understanding

Comparing video with image understandingClassification4,800 categories15.2% Top5 errorVideos

Images 1,000 categories*3.1% Top5 error

*Transfer learning widespread

Krizhevsky 2012, Xie 2016


He 2017

Classification Detection4,800 categories15.2% Top5 error

Tens of categories~10-20 mAP at 0.5 overlapVideos



Hundreds of categories*~60 mAP at 0.5 overlapPixel-level segmentation




He 2017 Johnson 2016, Krause 2017

Classification Detection Captioning4,800 categories15.2% Top5 error

Just getting started:Short clips, niche

domainsVideos





Dense captioningCoherent paragraphs




He 2017 Johnson 2016, Krause 2017

Classification Detection Captioning4,800 categories15.2 Top5 error

Just getting started:Short clips, niche

domainsVideos

Beyond





Dense captioningCoherent paragraphs

Significant work on question-answering

—


Yang 2016


The challenge of scale

Training labels Inference

Models

Video processing is computationally expensive

Video annotation is labor-intensive

Temporal dimension adds complexity


Training labels InferenceVideo processing is

computationally expensiveVideo annotation is

labor-intensive

ModelsTemporal dimension adds

complexity

Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame

Glimpses in Videos. CVPR 2016.

Input Output

t = 0 t = TRunning

Task: Temporal action detection

Talking

Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.


Efficient video processing

t = 0 t = T

Output


Frame modelInput: a frame

Our model for efficient action detection

t = 0 t = T

[ ]Output


Output:Detection instance [start, end]Next frame to glimpse

Frame modelInput: a frame


t = 0 t = T

Recurrent neural network(time information)

[ ] Output:Detection instance [start, end]Next frame to glimpse

Output

Convolutional neural network (frame information)



t = 0 t = T



Output




t = 0 t = T



Output




t = 0 t = T

Output



[ ] [ ] Output:Detection instance [start, end]Next frame to glimpse

Output



t = 0 t = T

Output




Output



t = 0 t = T

Output




Output



t = 0 t = T

Output




Output


Output

[ ]


t = 0 t = T

Output




Output


Output

[ ]

…


t = 0 t = T

Output



Output


Output

[ ]

…


�� Optional output:Detection instance [start, end]

Output:Next frame to glimpse


Our model for efficient action detection• Train differentiable outputs (detection output class and bounds) using

standard backpropagation

• Train non-differentiable outputs (where to look next, when to emit a prediction) using reinforcement learning (REINFORCE algorithm)

• Achieves detection performance on par with dense sliding window-based approaches, while observing only 2% of frames

Learned policy in action



Learned policy in action




labor-intensive


complexity



Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling

of Actions in Complex Videos. IJCV 2017.



Dense action labeling

Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos. IJCV 2017.

MultiTHUMOS• Extends the THUMOS’14 action detection dataset with dense, multilevel,

frame-level action annotations for 30 hours across 400 videos

THUMOS MultiTHUMOSAnnotations 6,365 38,690

Classes 20 65Density (labels / frame) 0.3 1.5

Classes per video 1.1 10.5Max actions per frame 2 9Max actions per video 3 25


Modeling dense, multilabel actions• Need to reason about multiple potential actions simultaneously

• High degree of temporal dependency

• In standard recurrent models for action recognition, all state is in hidden layer representation

• At each time step, makes prediction of current frame based on the current frame and previous hidden representation


MultiLSTM• Extension of LSTM that expands the temporal receptive field of input

and output connections

• Key idea: providing the model with more freedom in both reading input and writing output reduces the burden placed on the hidden layer representation


MultiLSTM

Input video frames

Frame class predictions

t

Standard LSTM

……

Donahue 2014


MultiLSTMFrame class predictions

t

……


Standard LSTM: Single input, single output

Input video frames

MultiLSTMFrame class predictions

t

……

Frame class predictions

t

MultiLSTM: Multiple inputs, multiple outputs

……

Standard LSTM: Single input, single output

Input video frames Input video frames


MultiLSTMMultiple Inputs (soft attention)

Multiple Outputs (weighted average)

Multilabel Loss


MultiLSTM


Retrieving sequential actions


Retrieving co-occurring actions





labor-intensive


complexity







Yeung, Ramanathan, Russakovsky, Shen, Mori, Fei-Fei. Learning to learn from noisy web videos.

CVPR 2017.

Labeling videos is expensive• Takes significantly longer to

label a video than an image

• If spatial or temporal bounds desired, even worse

• How can we practically learn about new concepts in video?

Web queries are a source of noisy video labels

Image search is much cleaner!

Can we effectively learn from the noisy web queries?

• Our approach: learn how to selective positive training examples from noisy queries in order to train classifiers for new classes

• Use a reinforcement learning-based formulation to learn a data labeling policy that achieves strong performance on a small, manually-labeled dataset of classes

• Then use this policy to automatically label noisy web data for new classes

Yeung, Ramanathan, Russakovsky, Shen, Mori, Fei-Fei. Learning to learn from noisy web videos. CVPR 2017.

Balancing diversity vs. semantic drift• Want diverse training examples to improve classifier

• But too much diversity can also lead to semantic drift

• Our approach: balance diversity and drift by training labeling policies using an annotated reward set which the policy must successfully classify


Overview of approach

Boomerang …

Boomerang on a beach

Boomerang music video

Classifier

Candidate web queries (YouTube autocomplete)

Agent

Label new positives

+Boomerang on a beach

Current positive set

Update classifier


Update state


Boomerang …



Classifier


Agent

Label new positives



Update classifier

Fixed negative set


Update state


Boomerang …



Classifier


Agent

Label new positives



Update classifier

Update state

Fixed negative set


Training reward

Eval on reward set

Sports1M

Greedy classifier Ours


Sports1M



Novel classes






labor-intensive


complexity








CVPR 2017.




labor-intensive


complexity








CVPR 2017.

Learning to learn




labor-intensive


complexity








CVPR 2017.

Learning to learnUnsupervised learning

Towards Knowledge



labor-intensive


complexity

Videos Knowledge of the dynamic visual world

Collaborators

Olga Russakovsky Mykhaylo Andriluka Ning Jin Vignesh Ramanathan Liyue Shen

Greg Mori Fei-Fei Li

Thank You

Serena Yeung, PHD, Stanford, at MLconf Seattle 2017

Technology