CSED703R: (2016S) 8: Convolutional Neural Variable ...cvlab.postech.ac.kr/~bhhan/class/cse703r_2016s/csed703r_lecture8.pdf · Object detection and ... human‐object interaction,

Lecture 8: Convolutional Neural Networks on Videos

Bohyung HanComputer Vision [email protected]

CSED703R: Deep Learning for Visual Recognition (2016S)

CNNs on Videos

• Challenges in video processing using CNNs A large number of frames: high computational complexity Variable lengths Temporal dependency of data

• Relevant problems Action detection and recognition Video event detection Object detection and recognition in videos Scene recognition in videos Visual tracking …

2

Action Recognition

• Classifying actions From images or videos Deep learning vs. shallow learning Slightly different from detection and localization

• Deep learning for action recognition Not yet mature Approaches based on

• Convolutional neural networks• Recurrent neural networks

Algorithms based on deep learning started to outperform the methods based on handcrafted features.

3

Datasets

• UCF‐101 101 classes: approximately 13,320 realistic videos, collected from YouTube 5 types: human‐object interaction, body motion only, human‐human inter

action, playing musical instruments, sports Three training/testing splits

4http://crcv.ucf.edu/data/UCF101.php

Datasets

• Sports‐1M 487 classes, 1000‐3000 videos per class 1M YouTube videos The classes are arranged in a manually‐curated taxonomy. Noisy labels due to automated annotation process

5http://cs.stanford.edu/people/karpathy/deepvideo/

Datasets

• HMDB‐51 A large human motion database 7000 clips for 51 classes Original and stabilized versions are available. STIP (Space Time Interest Point) features are available.

6http://serre‐lab.clps.brown.edu/resource/hmdb‐a‐large‐human‐motion‐database/

Action Recognition Research

7

Title Venue AlgorithmAccuracy

Sports‐1M

UCF101

Action Recognition with Improved Trajectories

ICCV 13 Improved densetrajectory (iDT)

85.9

Bag of Visual Words and Fusion Methods for Action Recognition: Comprehensive Study and Good Practice

arXiv 14 iDT with higher dimensional encoding

87.9

Large‐scale Video Classification with CNNs CVPR 14 Spatial‐temporal CNN 60.9 65.4

Two‐Stream CNNs for Action Recognition in Videos

NIPS 14 Two‐stream CNN fused by SVM

88.0

Long‐Term Recurrent Convolutional Networks for Visual Recognition and Description

CVPR 15 CNN + LSTM 82.9

Beyond Short Snippets: Deep Networks for Video Classification

CVPR 15 Two‐stream + temporalfeature pooling

72.4 88.6

Learning Spatiotemporal Features with 3D Convolutional Networks

ICCV 15 Spatiotemporal 3D‐CNN + iDT

61.1 90.4

Action Recognition with Trajectory‐Pooled Deep‐Convolutional Descriptors

CVPR 15 Two‐stream model + iDT

91.5

Actions ∼ Transformations CVPR 16 92.4

Improved Dense Trajectory (iDT)

• Main idea Global motion compensation:

removing trajectories from camera motion

Outlier rejection: removing matches from human regions

• Feature encoding Extraction of Trajectory, HOG,

HOF, and MBH Bag of trajectory features Fisher vector + PCA

8H. Wang, C. Schmid: Action Recognition with Improved Trajectories. ICCV 2013

It demonstrates very good accuracy even compared with deep learning approaches.

Large‐Scale Video Classification

• Contributions Construction a large‐scale dataset, Sports‐1M Providing quantitative accuracies of several baseline algorithms

• Cons Still worse than iDT algorithm with large margin Still no sophisticated idea of video level prediction

9A. Kapathy et al.: Large‐scale Video Classification with Convolutional Neural Networks. CVPR 2014

Architectures with Pooling Variations

• Main idea Regards videos as a collection of frames Applies CNNs for a single or multiple frames


Two streams with T frame gap 3D convolutions 3D convolutions with

progressive fusion

2 frames Multiple frames Multiple frames

Multi‐Resolution Approach

• Combination of high and low resolution inputs Fovea stream: center crop of each frame Context stream: entire frame


Multi‐Resolution Approach

• Learned features in the first layer

12

Context stream Fovea stream

A. Kapathy et al.: Large‐scale Video Classification with Convolutional Neural Networks. CVPR 2014

Results

• Video level prediction Random sampling of 20 clips: non‐overlapped and 16 frames long 4 different crops and flips for each clip Simple averaging

• Comparisons Slow fusion and multi‐resolution are helpful.


Two Stream CNNs

• Combination of two complementary information Spatial stream: CNN on a single image (2D convolution) Temporal stream: CNN on multi‐frame optical flow (2D convolution) Fusion: SVM with softmax scores from two streams

14

K. Simonyan, A. Zissermann: Two‐Stream Convolutional Networks for Action Recognition in Videos. NIPS 2014

Pretrained on ImageNet

Training from scratch

Optical Flow for Temporal Information

• Optical flow stacking Stack flow channels and of consecutive frames

• Trajectory stacking Inspired by trajectory‐based descriptors Stack cumulative motion in and direction

• Bidirectional optical flow Stacking forward and backward streams together

• Mean flow subtraction

15

Optical flow stacking Trajectory stacking

Temporal Stream

• Input Flow data generation: ∈ ( frames with 2 channels) Network input: 224 224 2 volume subsampled from

• Multi‐task learning To train temporal stream from scratch using a small number of data Trained on UCF‐101 (9.5K videos) and HMDB‐51 (3.5K videos) Two softmax classification layers

• Learned features

16

96 first‐layer convolutional filters learned on 10 stacked optical flows

Results

17

Individual ConvNets accuracy on UCF‐101 (split 1)

Mean accuracy (over three splits) on UCF‐101 and HMDB‐51

iDT + CNN

• Main idea Shares merits of handcrafted features and deeply learned features Shallow features: iDT Deep features: A slight variation of two‐stream CNN

18

L. Wang, Y. Qiao, X. Tang: Action Recognition with Trajectory‐Pooled Deep‐Convolutional Descriptors. CVPR 2015

TDD

• Trajectory‐pooled deep convolutional descriptors Local trajectory‐aligned descriptor computed in a 3D volume around the

trajectory (optionally in multi‐scale) Feature map normalization Trajectory pooling

• Normalization of feature maps Spatio‐temporal normalization Channel normalization

• TDD extraction Sum‐pooling of the normalized feature maps over the 3D volume centered

at the trajectory

19

, , ,: trajectory, : normalized feature map, ∈ , , , ,

Classification

• Activation maps

• Classifier Fisher vector

encoding Linear SVM

20

RGB Flow‐x Flow‐y S‐conv4 S‐conv5 T‐conv3 T‐conv4

Results

21

L. Wang, Y. Qiao, X. Tang: Action Recognition with Trajectory‐Pooled Deep‐Convolutional Descriptors. CVPR 2015

Action Recognition by 3D CNNs

• Generalized version of 2Dconvolution Convolution in spatio‐temporal

domain More parameters for

convolutions

22

S. Ji, W. Xu, M. Yang, K. Yu: 3D Convolutional Neural Networks for Human Action Recognition. ICML 2010

2D convolution 3D convolution

D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri: Learning Spatiotemporal Features with 3D Convolutional Networks. ICCV 2015

Architecture[JiICML10]

• Standard CNN architecture 1 hardwired convolutional layer 3 additional convolutional layers 2 pooling layers 1 fully connected layer

23

H1: 33@60x40 C2:

23*2@54x34

7x7x3 3D convolution

2x2 subsampling

S3: 23*2@27x17

7x6x3 3D convolution

C4: 13*6@21x12

3x3 subsampling

S5: 13*6@7x4

7x4 convolution

C6: 128@1x1

full connnection

hardwired

input: 7@60x40

S. Ji, W. Xu, M. Yang, K. Yu: 3D Convolutional Neural Networks for Human Action Recognition. ICML 2010

Architecture[TranICCV15]

• Components 8 convolutions layers: 3x3x3 with stride 1x1x1 5 max‐pooling layers: 2x2x2 with stride 2x2x2 (1x2x2 with stride 1x2x2 for

pool1) 2 fully connected layers Softmax layer

24

Conv1a

64

Conv2a

128

Pool1

Conv3a

256

Pool2

Conv3b

256

Pool3

Conv4a

512

Conv4b

512

Pool4

Conv5a

512

Conv5b

512

Pool5

FC 409

6

FC 409

6

Softmax


Feature Embedding

• Comparison C3D looks better. But, the comparison is not fair.

25

DeCAF

C3D

DeCAFDeCAF

Accuracy with linear SVM


Deconvolutions for Analysis (conv2a)

26

The learned filters detect moving edges and blobs.

The learned filters detect changes in shots, edge orientations, and colors.

Deconvolutions for Analysis (conv3b)

27

The feature maps detect moving corners and textures.

The feature maps detect moving body parts.

The feature maps detect object trajectories and circular objects.

Results

28


Action recognition results on UCF101

Action Detection

• Spatio‐temporal action localization

29G. Gkioxari, J. Malik: Finding Action Tubes. CVPR2015

Action‐specific classifier

Spatial‐CNN

Motion‐CNN

Object proposals

Linking action detections

Action Detection

• Action specific classifier

• Linking action detections

30

, ⋅ IoU ,∗ argmax 1 , Optimized by Viterbi algorithm

: learned model from linear SVM: learned feature from CNN

Results

31

Results on UCF Sports dataset32

CSED703R: (2016S) 8: Convolutional Neural Variable ...cvlab.postech.ac.kr/~bhhan/class/cse703r_2016s/csed703r_lecture8.pdf · Object detection and ... human‐object interaction,

Documents