Action Recognition Overview Vadim Andronov Internet of Things Group
Action Recognition Overview
Vadim Andronov
Internet of Things Group
Task definition
● Action Recognition○ Predict action on the current time (some interval of time)
○ Video classification - predict action on the whole video
○ Simplest model - image classifier
● Action Detection○ Consists of two tasks:
■ Action detection
■ Classification of detected actions
○ Simplest model - object detection
Action Examples (Kinetics Dataset)
Datasets
● Kinetics-400: 400 classes of actions, ~300k videos from YouTube
● UCF-101: 101 class of human actions, 13k clips from YouTube
● HMDB-51: 51 actions, 7k clips from movies
● Sports 1M: 487 sport actions, ~1M clips
● Jester: 27 human gestures
Why not just classifier?
What is the correct action for this image?
Action Recognition Approaches
● Two-stream models○ Two-stream convolutional networks for action recognition in videos, 2014
○ Temporal Segment Networks: Towards Good Practices for Deep Action Recognition (TSN),
2016
● 3D models
○ Learning Spatiotemporal Features with 3D Convolutional Networks (C3D), 2014
○ Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset (I3D), 2017
○ Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet (R3D), 2017
○ A Closer Look at Spatiotemporal Convolutions for Action Recognition (R2+1D), 2017
○ Non-local Neural Networks, 2017
● Sequence modeling approaches○ Action Recognition using Visual Attention, 2015
○ Lightweight Network Architecture for Real-Time Action Recognition
(VTN), ours - 2018
Two-stream approach
● One of the first successful DL approaches to AR problem: Two-stream
convolutional networks for action recognition in videos, 2014
● Fusion of two AlexNet-based CNNs that work on different modalities: RGB
frames and OF
● Techniques to prevent overfitting:○ Multi-task learning - two datasets
Optical FlowSparse Dense
Temporal Segment Networks: Towards Good
Practices for Deep Action Recognition (TSN), 2016● Giving a video V it divides it to segments (evenly)
● Then
● - - is a random snippet of , , - CNN function (BN-Inception)
● - some weighting function (evenly averaging), - Softmax
3D CNN models
● Introduce higher dimensional primitives operating with 5D tensors
(BxCxTxHxW):○ 3D convolution
○ 3D pooling
● Consider spatial and temporal information at the same time
● Problems:○ Higher computational complexity
○ Hard to train
3D Convolution
2D Convolution 3D Convolution
Learning Spatiotemporal Features with 3D
Convolutional Networks (C3D), 2014
● AlexNet-like architecture with 3D primitives
● First introduced 3D nets to address AR problem
Quo Vadis, Action Recognition? A New Model and
the Kinetics Dataset (I3D)
● “Inflated” Inception V1 architecture - 3D CNN (224x224)
● Proved transfer learning from ImageNet
● Introduced Kinetics dataset
● Mixed approaches (two-stream and 3D)
● Saturated UCF-101
Can Spatiotemporal 3D CNNs Retrace the History of
2D CNNs and ImageNet (R3D)
● Adopts residual architectures to 3D (112x112)
● Uses Kinetics as a main benchmark
A Closer Look at Spatiotemporal Convolutions for
Action Recognition (R2+1D)
● Compares mixed architectures
● Decomposes 3D convolutional kernel○ Less weights
○ Easier to train
Non-local Neural Networks
● Introduces special non-local block
● Uses self-attention mechanism to re-weight
features
● Can be used in any CNN
● Increases computational complexity
Sequence modeling methods
● Model sequential connections explicitly via special NN architecture (e.g.
Recurrent, 1D-CNN, etc)
RNN Cell
LSTM Cell
In practice, more complicated cell is used to allow training on longer sequences
Action recognition using visual attention, 2015
● Uses attention + RNN
● Re-weights features based on the
history:
Sequential modeling evolution
RNNs (LSTM, GRU) considered a default starting point for many sequential
modelling tasks e.g. machine translation, speech recognition, image captioning...
Because they are simple and do the job but…
● Sequentiality limits the parallelization
● Despite gating, only consider short-range context
Sequential modeling evolution: CNNs
WaveNet (Speech), ByteNet (Language model), TCN (Many tasks)
WaveNet ByteNet
Transformer (Tensor2Tensor, attention is all you need)
Transformer architecture English-to-german results
Back to AR: Our approach (Video Transformer)
● Embed each frame using spatial CNN
● Find temporal relations between embeddings in a
stacked decoder blocks
● Decoder block consist of multi-head self-attention
and 1D-Convolutional block with some residual
connections
Results (HMDB, UCF)
Results (Kinetics)
The end. Questions are welcome!