Top Banner
Describing Videos by Exploiting Temporal Structure Slides by Alberto Montes Computer Vision Group, April 12th, 2016 [arXiv ] [GitXiv ] [video ] [code ] Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christoper Pal, Hugo Larochelle, Aaron Courville
21

Describing videos by exploiting temporal structure

Apr 13, 2017

Download

Technology

Xavier Giro
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Describing videos by exploiting temporal structure

Describing Videos by Exploiting Temporal Structure

Slides by Alberto MontesComputer Vision Group, April 12th, 2016

[arXiv] [GitXiv] [video] [code]

Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christoper Pal, Hugo Larochelle, Aaron Courville

Page 2: Describing videos by exploiting temporal structure

Introduction

Page 3: Describing videos by exploiting temporal structure

Introduction

Goal: Generate captions from videos.

Page 4: Describing videos by exploiting temporal structure

Video Description Generation Framework

Page 5: Describing videos by exploiting temporal structure

Encoder-Decoder Framework

Encoder: Convolutional Neural Network

Basic approach:

Deep CNN over frames

Decoder: Long Short-Term Memory Network

Page 6: Describing videos by exploiting temporal structure

Long Short Term Memory

Page 7: Describing videos by exploiting temporal structure

Long Short Term Memory

Forget Gate:

Page 8: Describing videos by exploiting temporal structure

Long Short Term Memory

Input Gate Layer

New candidates for cell state

Page 9: Describing videos by exploiting temporal structure

Long Short Term Memory

Update Memory Content:

Page 10: Describing videos by exploiting temporal structure

Long Short Term Memory

E[yt]: word embedding matrix

inputprevious

hidden stateWeights matrices:context from

encoder bias

Page 11: Describing videos by exploiting temporal structure

Exploiting Temporal Structure

Page 12: Describing videos by exploiting temporal structure

Exploiting Local Features

● Trained for activity recognition.● Only the conv layers will be used.

Histograms of oriented Gradient

Histograms of oriented Flow

Motion Boundary Histogram

A Spatio-Temporal Convolution Neural Net

Page 13: Describing videos by exploiting temporal structure

Exploiting Global Structure

Attention Mechanism

Update of attention weights:

Page 14: Describing videos by exploiting temporal structure

Experiments

Page 15: Describing videos by exploiting temporal structure

YouTube2Text

1,970 video clips with multiple descriptions

Training set: 1,200 video clips

Validation set: 100 video clips

Datasets

DVS

Videos taken from DVDs

49,000 video clips

Training set: 39,000 video clips

Validation set: 5,000

Test set: 5,000

Page 16: Describing videos by exploiting temporal structure

Setup and Training

4 setups:

◉ Basic (2D GoogLeNet CNN)◉ Local (+ 3D CNN features)◉ Global (+ temporal attention

mechanism)◉ Local + Global

Training

- Adadelta gradient- Loss function:

Page 17: Describing videos by exploiting temporal structure

Results

Page 18: Describing videos by exploiting temporal structure

Evaluation

Page 19: Describing videos by exploiting temporal structure

Evaluation

Page 20: Describing videos by exploiting temporal structure

Conclusions

Propose a 3D CNN to capture local fine-grained motion information.

A temporal attention mechanism to capture global information.

State-of-the-art results on Youtube2text with a combination of both approaches.

Page 21: Describing videos by exploiting temporal structure

Thank you!Questions?