April 4-7, 2016 | Silicon Valley Pavlo Molchanov Xiaodong Yang Shalini Gupta Kihwan Kim Stephen Tyree Jan Kautz GESTURE RECOGNITION WITH 3D CNNS 4/6/2016
April 4-7, 2016 | Silicon Valley
Pavlo Molchanov
Xiaodong Yang
Shalini Gupta
Kihwan Kim
Stephen Tyree
Jan Kautz
GESTURE RECOGNITION WITH 3D CNNS 4/6/2016
2
AGENDA
Motivation
Problem statement
Selecting the best classifier
Online gesture detection and classification
Demos
9
PROBLEM STATEMENT
Single commodity sensor:
• Gesture recognition
• Skeleton tracking
• Gaze estimation
• Head tracking
No special devices
Kinectv1
SoftKinetic
10
PROBLEM STATEMENT
Hand model fitting and tracking
*http://www.virtualrealityreviewer.com/leap-motion-enters-vr-new-software-product-accessory-preview-what%C2%B9s-next/
Understanding gesture concepts
Thumb up
Wave hand
We don’t: We do:
Classifier
Classifier
11
PROBLEM STATEMENT
Hand model fitting and tracking
*http://www.virtualrealityreviewer.com/leap-motion-enters-vr-new-software-product-accessory-preview-what%C2%B9s-next/
Understanding gesture concepts
Thumb up
Wave hand
We don’t: We do:
Classifier
Classifier
??????
13
SELECTING THE BEST CLASSIFIER VIVA CHALLENGE 2015 organized by UCLA
19 classes, 8 subjects
Driver and passenger
RGB + Depth from Microsoft Kinect
885 gestures in total
14
SELECTING THE BEST CLASSIFIER VIVA CHALLENGE 2015 organized by UCLA
19 classes, 8 subjects
Driver and passenger
RGB + Depth from Microsoft Kinect
885 gestures in total
Gesture example:
Slide 2 fingers left
15
SELECTING THE BEST CLASSIFIER VIVA CHALLENGE 2015 organized by UCLA
19 classes, 8 subjects
Driver and passenger
RGB + Depth from Microsoft Kinect
885 gestures in total
Gesture example:
Zoom out
16
SELECTING THE BEST CLASSIFIER VIVA CHALLENGE 2015 organized by UCLA
19 classes, 8 subjects
Driver and passenger
RGB + Depth from Microsoft Kinect
885 gestures in total
Gesture example:
Rotate CCW
17
SELECTING THE BEST CLASSIFIER 3D Convolutional Neural Network
ReLU ReLU
Softmax
Pre
dic
tion
RG
B
Depth
3D convolution
and max-pooling 3D convolution
and max-pooling
3D convolution
and max-pooling
3D convolution
and max-pooling
19
SELECTING THE BEST CLASSIFIER First result
Classification accuracy, higher better
1 Oreifej and Liu. HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences, CVPR, 2013 2 Ohn-Bar and Trivedi, IEEE Trans. on Intelligent Transportation Systems, 2014.
HON4D1 HOG2 3D-CNN
Testing set 58.7% 64.5% 48.3%
Training set 99.9%
20
SELECTING THE BEST CLASSIFIER
VIVA IMAGENET
1.5 M examples 885 examples
Recent success in deep learning benefited from large data
22
SELECTING THE BEST CLASSIFIER Training
Data
augmentation
Depth
3D CNN Back
propagation
error
update
RG
B
23
SELECTING THE BEST CLASSIFIER Data augmentation
Spatial geometric transformations
Temporal augmentation
Generating new training data
Original
Augmented
24
SELECTING THE BEST CLASSIFIER Data augmentation
Spatial geometric transformations
Temporal augmentation
Generating new training data
Original
Augmented
25
SELECTING THE BEST CLASSIFIER Data augmentation
Spatial geometric transformations
Temporal augmentation
Generating new training data
Original
Augmented
26
SELECTING THE BEST CLASSIFIER Data augmentation
Spatial geometric transformations
Temporal augmentation
Generating new training data
Original
Augmented
27
SELECTING THE BEST CLASSIFIER Data augmentation
Spatial geometric transformations
Temporal augmentation
Generating new training data
Original
Augmented
28
SELECTING THE BEST CLASSIFIER Data augmentation
Spatial geometric transformations
Temporal augmentation
Generating new training data
Original
Augmented
29
SELECTING THE BEST CLASSIFIER Data augmentation
Spatial geometric transformations
Temporal augmentation
Generating new training data
30
SELECTING THE BEST CLASSIFIER Data augmentation
Spatial geometric transformations
Temporal augmentation
Generating new training data
flip
32
SELECTING THE BEST CLASSIFIER Official challenge results
36.4
44.6
54
58.7
64.5
48.3
0 10 20 30 40 50 60 70 80
Harris-3.5D
HOG3D
Dense Trajectories
HON4D
HOG+HOG2
NVIDIA (3D-CNN) No data augmentation
Classification accuracy, higher better
33
SELECTING THE BEST CLASSIFIER Official challenge results
36.4
44.6
54
58.7
64.5
48.3 77.5
0 10 20 30 40 50 60 70 80
Harris-3.5D
HOG3D
Dense Trajectories
HON4D
HOG+HOG2
NVIDIA (3D-CNN)
Classification accuracy, higher better
with data augmentation
34
SELECTING THE BEST CLASSIFIER Speed
FPS, higher better
0.2
3
18
25
50
110 GPU +250 cuDNNv4 +400
0 100 200 300 400 500 600 700 800 900
Harris-3.5D
HOG3D
Dense Trajectories
HON4D
HOG+HOG2
NVIDIA (3D-CNN)
CPU
35
SEGMENTED GESTURE CLASSIFICATION
Gesture
time Start of the gesture End of the gesture
Classification
Decision
Decision after gesture ends introduces latency
37
ONLINE GESTURE CLASSIFICATION
Gesture
time Start of the gesture End of the gesture
Classification
Decision
Decision before gesture ends improve feedback and user experience
38
ONLINE GESTURE CLASSIFICATION R3DCNN
Video server
3D
CN
N
3D
CN
N
RNN RNN RNN
softmax softmax softmax
global
motion
descriptor
local
motion
descriptor
8 frames
Forward recurrence only
Detection and classification
109M parameters
CTC for training only
Connectionist Temporal Classification (CTC)
39
ONLINE GESTURE CLASSIFICATION Training loss function
Labeling dynamic gestures is difficult
Labeling per frame is ambiguous
Input:
Labels:
Loss function: Per frame negative log likelihood
40
ONLINE GESTURE CLASSIFICATION Training loss function
Sequence based training is the solution
Input:
Sequence: nothing – slide right – nothing – slide left - nothing
Loss function: Connectionist Temporal Classification (CTC) by A. Graves et al.
41
ONLINE GESTURE CLASSIFICATION Italian sign language recognition
Chalearn2014 challenge held in 2014
RGBD videos of 20 Italian sign language
13K gestures
20 subjects
42
ONLINE GESTURE CLASSIFICATION Italian sign language recognition
97.2
97.4
98.2
Pigou et al.* 3D-CNN 3D-CNN CTC
Classification accuracy (%)
Improvement in accuracy
35%
By seeing only
41% of gesture
*L. Pigou et al. Beyond temporal pooling: Recurrence and temporal convolutions for gesture recognition in video
43
ONLINE GESTURE CLASSIFICATION Italian sign language recognition
Improvement in accuracy
35%
By seeing only
41% of gesture
No pre- or post-processing
44
ONLINE GESTURE CLASSIFICATION Car interfaces
In-house database
Media player, navigation, phone
20 subjects, 25 gestures
More information at CVPR2016
45
ONLINE GESTURE CLASSIFICATION Car interfaces
37
66
71
73
79
84
88
25 45 65 85
HOG+HOG2
Two stream CNN
SNV
iDT
C3D
Ours
HumanIn-house database
Media player, navigation, phone
20 subjects, 25 gestures
More information at CVPR2016
46
ONLINE GESTURE CLASSIFICATION
Suitability of hardware for inference:
Latency is critical
IMAGE CLASSIFICATION
GPU
CPU
VIDEO CLASSIFICATION
GPU
CPU
47
ONLINE GESTURE CLASSIFICATION
NVIDIA TX1 - for embedded solutions
Credit card GPU in your pocket
Our R3DCNN takes only 30% of GPU
Scalability
48
CONTRIBUTIONS
Data augmentation helps a lot to deep learning
R3DCNN are the best for sign language and gesture recognition
CTC helps a lot for video sequence learning
Scalable enough to run on NVIDIA TX1
April 4-7, 2016 | Silicon Valley
THANK YOU
JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join