Describing videos by exploiting temporal structure

Describing Videos by Exploiting Temporal Structure

Slides by Alberto MontesComputer Vision Group, April 12th, 2016

[arXiv] [GitXiv] [video] [code]

Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christoper Pal, Hugo Larochelle, Aaron Courville

http://arxiv.org/abs/1502.08029

http://gitxiv.com/posts/dKhP7eSaZKEfFmpzt/describing-videos-by-exploiting-temporal-structure

https://www.youtube.com/watch?v=Q6BiLAxJtXk&feature=youtu.be

https://github.com/yaoli/arctic-capgen-vid

Introduction

Introduction

Goal: Generate captions from videos.

Video Description Generation Framework

Encoder-Decoder Framework

Encoder: Convolutional Neural Network

Basic approach:

Deep CNN over frames

Decoder: Long Short-Term Memory Network

Long Short Term Memory


Forget Gate:


Input Gate Layer

New candidates for cell state


Update Memory Content:


E[yt]: word embedding matrix

inputprevious

hidden stateWeights matrices:context from

encoder bias

Exploiting Temporal Structure

Exploiting Local Features

● Trained for activity recognition.● Only the conv layers will be used.

Histograms of oriented Gradient

Histograms of oriented Flow

Motion Boundary Histogram

A Spatio-Temporal Convolution Neural Net

Exploiting Global Structure

Attention Mechanism

Update of attention weights:

Experiments

YouTube2Text

1,970 video clips with multiple descriptions

Training set: 1,200 video clips

Validation set: 100 video clips

Datasets

DVS

Videos taken from DVDs

49,000 video clips

Training set: 39,000 video clips

Validation set: 5,000

Test set: 5,000

Setup and Training

4 setups:

◉ Basic (2D GoogLeNet CNN)◉ Local (+ 3D CNN features)◉ Global (+ temporal attention

mechanism)◉ Local + Global

Training

- Adadelta gradient- Loss function:

Results

Evaluation

Evaluation

Conclusions

Propose a 3D CNN to capture local fine-grained motion information.

A temporal attention mechanism to capture global information.

State-of-the-art results on Youtube2text with a combination of both approaches.

“

Thank you!Questions?

Describing videos by exploiting temporal structure

Technology