Lecture 8: Convolutional Neural Networks on Videos Bohyung Han Computer Vision Lab. [email protected]CSED703R: Deep Learning for Visual Recognition (2016S) CNNs on Videos • Challenges in video processing using CNNs A large number of frames: high computational complexity Variable lengths Temporal dependency of data • Relevant problems Action detection and recognition Video event detection Object detection and recognition in videos Scene recognition in videos Visual tracking … 2 Action Recognition • Classifying actions From images or videos Deep learning vs. shallow learning Slightly different from detection and localization • Deep learning for action recognition Not yet mature Approaches based on • Convolutional neural networks • Recurrent neural networks Algorithms based on deep learning started to outperform the methods based on handcrafted features. 3 Datasets • UCF‐101 101 classes: approximately 13,320 realistic videos, collected from YouTube 5 types: human‐object interaction, body motion only, human‐human inter action, playing musical instruments, sports Three training/testing splits 4 http://crcv.ucf.edu/data/UCF101.php
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Lecture 8: Convolutional Neural Networks on Videos
CSED703R: Deep Learning for Visual Recognition (2016S)
CNNs on Videos
• Challenges in video processing using CNNs A large number of frames: high computational complexity Variable lengths Temporal dependency of data
• Relevant problems Action detection and recognition Video event detection Object detection and recognition in videos Scene recognition in videos Visual tracking …
2
Action Recognition
• Classifying actions From images or videos Deep learning vs. shallow learning Slightly different from detection and localization
• Deep learning for action recognition Not yet mature Approaches based on
Algorithms based on deep learning started to outperform the methods based on handcrafted features.
3
Datasets
• UCF‐101 101 classes: approximately 13,320 realistic videos, collected from YouTube 5 types: human‐object interaction, body motion only, human‐human inter
action, playing musical instruments, sports Three training/testing splits
4http://crcv.ucf.edu/data/UCF101.php
Datasets
• Sports‐1M 487 classes, 1000‐3000 videos per class 1M YouTube videos The classes are arranged in a manually‐curated taxonomy. Noisy labels due to automated annotation process
• HMDB‐51 A large human motion database 7000 clips for 51 classes Original and stabilized versions are available. STIP (Space Time Interest Point) features are available.
Bag of Visual Words and Fusion Methods for Action Recognition: Comprehensive Study and Good Practice
arXiv 14 iDT with higher dimensional encoding
87.9
Large‐scale Video Classification with CNNs CVPR 14 Spatial‐temporal CNN 60.9 65.4
Two‐Stream CNNs for Action Recognition in Videos
NIPS 14 Two‐stream CNN fused by SVM
88.0
Long‐Term Recurrent Convolutional Networks for Visual Recognition and Description
CVPR 15 CNN + LSTM 82.9
Beyond Short Snippets: Deep Networks for Video Classification
CVPR 15 Two‐stream + temporalfeature pooling
72.4 88.6
Learning Spatiotemporal Features with 3D Convolutional Networks
ICCV 15 Spatiotemporal 3D‐CNN + iDT
61.1 90.4
Action Recognition with Trajectory‐Pooled Deep‐Convolutional Descriptors
CVPR 15 Two‐stream model + iDT
91.5
Actions ∼ Transformations CVPR 16 92.4
Improved Dense Trajectory (iDT)
• Main idea Global motion compensation:
removing trajectories from camera motion
Outlier rejection: removing matches from human regions
• Feature encoding Extraction of Trajectory, HOG,
HOF, and MBH Bag of trajectory features Fisher vector + PCA
8H. Wang, C. Schmid: Action Recognition with Improved Trajectories. ICCV 2013
It demonstrates very good accuracy even compared with deep learning approaches.
Large‐Scale Video Classification
• Contributions Construction a large‐scale dataset, Sports‐1M Providing quantitative accuracies of several baseline algorithms
• Cons Still worse than iDT algorithm with large margin Still no sophisticated idea of video level prediction
9A. Kapathy et al.: Large‐scale Video Classification with Convolutional Neural Networks. CVPR 2014
Architectures with Pooling Variations
• Main idea Regards videos as a collection of frames Applies CNNs for a single or multiple frames
10A. Kapathy et al.: Large‐scale Video Classification with Convolutional Neural Networks. CVPR 2014
Two streams with T frame gap 3D convolutions 3D convolutions with
progressive fusion
2 frames Multiple frames Multiple frames
Multi‐Resolution Approach
• Combination of high and low resolution inputs Fovea stream: center crop of each frame Context stream: entire frame
11A. Kapathy et al.: Large‐scale Video Classification with Convolutional Neural Networks. CVPR 2014
Multi‐Resolution Approach
• Learned features in the first layer
12
Context stream Fovea stream
A. Kapathy et al.: Large‐scale Video Classification with Convolutional Neural Networks. CVPR 2014
Results
• Video level prediction Random sampling of 20 clips: non‐overlapped and 16 frames long 4 different crops and flips for each clip Simple averaging
• Comparisons Slow fusion and multi‐resolution are helpful.
13A. Kapathy et al.: Large‐scale Video Classification with Convolutional Neural Networks. CVPR 2014
Two Stream CNNs
• Combination of two complementary information Spatial stream: CNN on a single image (2D convolution) Temporal stream: CNN on multi‐frame optical flow (2D convolution) Fusion: SVM with softmax scores from two streams
14
K. Simonyan, A. Zissermann: Two‐Stream Convolutional Networks for Action Recognition in Videos. NIPS 2014
Pretrained on ImageNet
Training from scratch
Optical Flow for Temporal Information
• Optical flow stacking Stack flow channels and of consecutive frames
• Trajectory stacking Inspired by trajectory‐based descriptors Stack cumulative motion in and direction
• Bidirectional optical flow Stacking forward and backward streams together
• Mean flow subtraction
15
Optical flow stacking Trajectory stacking
Temporal Stream
• Input Flow data generation: ∈ ( frames with 2 channels) Network input: 224 224 2 volume subsampled from
• Multi‐task learning To train temporal stream from scratch using a small number of data Trained on UCF‐101 (9.5K videos) and HMDB‐51 (3.5K videos) Two softmax classification layers
• Learned features
16
96 first‐layer convolutional filters learned on 10 stacked optical flows
Results
17
Individual ConvNets accuracy on UCF‐101 (split 1)
Mean accuracy (over three splits) on UCF‐101 and HMDB‐51
iDT + CNN
• Main idea Shares merits of handcrafted features and deeply learned features Shallow features: iDT Deep features: A slight variation of two‐stream CNN
18
L. Wang, Y. Qiao, X. Tang: Action Recognition with Trajectory‐Pooled Deep‐Convolutional Descriptors. CVPR 2015
TDD
• Trajectory‐pooled deep convolutional descriptors Local trajectory‐aligned descriptor computed in a 3D volume around the
trajectory (optionally in multi‐scale) Feature map normalization Trajectory pooling
• Normalization of feature maps Spatio‐temporal normalization Channel normalization
• TDD extraction Sum‐pooling of the normalized feature maps over the 3D volume centered