GESTURE RECOGNITION WITH 3D CNNS - NVIDIAon-demand.gputechconf.com/gtc/2016/presentation/s6432-pavlo-mo… · April 4-7, 2016 | Silicon Valley Pavlo Molchanov Xiaodong Yang Shalini

April 4-7, 2016 | Silicon Valley

Pavlo Molchanov

Xiaodong Yang

Shalini Gupta

Kihwan Kim

Stephen Tyree

Jan Kautz

GESTURE RECOGNITION WITH 3D CNNS 4/6/2016

2

AGENDA

Motivation

Problem statement

Selecting the best classifier

Online gesture detection and classification

Demos

3

MOTIVATION

4

GESTURE IS NATURAL FORM OF COMMUNICATION

photo.elsoar.com

5

SAFE INTERFACES

@ bmw.com

6

IN NEED FOR VIDEO RELAY SERVICES

@ http://relayservice.gov.au/

7

GAMMING @ leapmotion

8

PROBLEM STATEMENT

9

PROBLEM STATEMENT

Single commodity sensor:

• Gesture recognition

• Skeleton tracking

• Gaze estimation

• Head tracking

No special devices

Kinectv1

SoftKinetic

10

PROBLEM STATEMENT

Hand model fitting and tracking

*http://www.virtualrealityreviewer.com/leap-motion-enters-vr-new-software-product-accessory-preview-what%C2%B9s-next/

Understanding gesture concepts

Thumb up

Wave hand

We don’t: We do:

Classifier

Classifier

11

PROBLEM STATEMENT

Hand model fitting and tracking

*http://www.virtualrealityreviewer.com/leap-motion-enters-vr-new-software-product-accessory-preview-what%C2%B9s-next/

Understanding gesture concepts

Thumb up

Wave hand

We don’t: We do:

Classifier

Classifier

??????

12

SELECTING THE BEST CLASSIFIER

13

SELECTING THE BEST CLASSIFIER VIVA CHALLENGE 2015 organized by UCLA

19 classes, 8 subjects

Driver and passenger

RGB + Depth from Microsoft Kinect

885 gestures in total

14






Gesture example:

Slide 2 fingers left

15






Gesture example:

Zoom out

16






Gesture example:

Rotate CCW

17

SELECTING THE BEST CLASSIFIER 3D Convolutional Neural Network

ReLU ReLU

Softmax

Pre

dic

tion

RG

B

Depth

3D convolution

and max-pooling 3D convolution

and max-pooling

3D convolution

and max-pooling

3D convolution

and max-pooling

18

SEGMENTED GESTURE CLASSIFICATION Training

3D CNN Back

propagation

error

update

RG

B

Depth

19

SELECTING THE BEST CLASSIFIER First result

Classification accuracy, higher better

1 Oreifej and Liu. HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences, CVPR, 2013 2 Ohn-Bar and Trivedi, IEEE Trans. on Intelligent Transportation Systems, 2014.

HON4D1 HOG2 3D-CNN

Testing set 58.7% 64.5% 48.3%

Training set 99.9%

20


VIVA IMAGENET

1.5 M examples 885 examples

Recent success in deep learning benefited from large data

21

SELECTING THE BEST CLASSIFIER Training

3D CNN Back

propagation

error

update

RG

B

Depth

22

SELECTING THE BEST CLASSIFIER Training

Data

augmentation

Depth

3D CNN Back

propagation

error

update

RG

B

23

SELECTING THE BEST CLASSIFIER Data augmentation

Spatial geometric transformations

Temporal augmentation

Generating new training data

Original

Augmented

24





Original

Augmented

25





Original

Augmented

26





Original

Augmented

27





Original

Augmented

28





Original

Augmented

29





30





flip

31


VIVA AUGMENTED

0.3 M examples 885 examples

32

SELECTING THE BEST CLASSIFIER Official challenge results

36.4

44.6

54

58.7

64.5

48.3

0 10 20 30 40 50 60 70 80

Harris-3.5D

HOG3D

Dense Trajectories

HON4D

HOG+HOG2

NVIDIA (3D-CNN) No data augmentation


33

SELECTING THE BEST CLASSIFIER Official challenge results

36.4

44.6

54

58.7

64.5

48.3 77.5

0 10 20 30 40 50 60 70 80

Harris-3.5D

HOG3D

Dense Trajectories

HON4D

HOG+HOG2

NVIDIA (3D-CNN)


with data augmentation

34

SELECTING THE BEST CLASSIFIER Speed

FPS, higher better

0.2

3

18

25

50

110 GPU +250 cuDNNv4 +400

0 100 200 300 400 500 600 700 800 900

Harris-3.5D

HOG3D

Dense Trajectories

HON4D

HOG+HOG2

NVIDIA (3D-CNN)

CPU

35

SEGMENTED GESTURE CLASSIFICATION

Gesture

time Start of the gesture End of the gesture

Classification

Decision

Decision after gesture ends introduces latency

36

ONLINE GESTURE DETECTION AND CLASSIFICATION

37

ONLINE GESTURE CLASSIFICATION

Gesture

time Start of the gesture End of the gesture

Classification

Decision

Decision before gesture ends improve feedback and user experience

38

ONLINE GESTURE CLASSIFICATION R3DCNN

Video server

3D

CN

N

3D

CN

N

RNN RNN RNN

softmax softmax softmax

global

motion

descriptor

local

motion

descriptor

8 frames

Forward recurrence only

Detection and classification

109M parameters

CTC for training only

Connectionist Temporal Classification (CTC)

39

ONLINE GESTURE CLASSIFICATION Training loss function

Labeling dynamic gestures is difficult

Labeling per frame is ambiguous

Input:

Labels:

Loss function: Per frame negative log likelihood

40

ONLINE GESTURE CLASSIFICATION Training loss function

Sequence based training is the solution

Input:

Sequence: nothing – slide right – nothing – slide left - nothing

Loss function: Connectionist Temporal Classification (CTC) by A. Graves et al.

41

ONLINE GESTURE CLASSIFICATION Italian sign language recognition

Chalearn2014 challenge held in 2014

RGBD videos of 20 Italian sign language

13K gestures

20 subjects

42


97.2

97.4

98.2

Pigou et al.* 3D-CNN 3D-CNN CTC

Classification accuracy (%)

Improvement in accuracy

35%

By seeing only

41% of gesture

*L. Pigou et al. Beyond temporal pooling: Recurrence and temporal convolutions for gesture recognition in video

43


Improvement in accuracy

35%

By seeing only

41% of gesture

No pre- or post-processing

44

ONLINE GESTURE CLASSIFICATION Car interfaces

In-house database

Media player, navigation, phone

20 subjects, 25 gestures

More information at CVPR2016

45

ONLINE GESTURE CLASSIFICATION Car interfaces

37

66

71

73

79

84

88

25 45 65 85

HOG+HOG2

Two stream CNN

SNV

iDT

C3D

Ours

HumanIn-house database

Media player, navigation, phone

20 subjects, 25 gestures

More information at CVPR2016

46


Suitability of hardware for inference:

Latency is critical

IMAGE CLASSIFICATION

GPU

CPU

VIDEO CLASSIFICATION

GPU

CPU

47


NVIDIA TX1 - for embedded solutions

Credit card GPU in your pocket

Our R3DCNN takes only 30% of GPU

Scalability

48

CONTRIBUTIONS

Data augmentation helps a lot to deep learning

R3DCNN are the best for sign language and gesture recognition

CTC helps a lot for video sequence learning

Scalable enough to run on NVIDIA TX1


CTC Deep

Learning

Data

Augmentation


THANK YOU

JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join

developer.nvidia.com/join

GESTURE RECOGNITION WITH 3D CNNS - NVIDIAon-demand.gputechconf.com/gtc/2016/presentation/s6432-pavlo-mo… · April 4-7, 2016 | Silicon Valley Pavlo Molchanov Xiaodong Yang Shalini

Documents