【ECCV 2016 BNMW】Human Action Recognition without Human

Human Action Recognition without Human

He Yun1,2, Soma Shirakabe1,2, Yutaka Satoh1,2, Hirokatsu Kataoka1

1Computer Vision Research Group, AIST, Japan 2Human-Centered Vision Lab., University of Tsukuba, Japan

Motion representation

•  Database: UCF101, HMDB51, ActivityNet

•  Approach: IDT, Two-Stream CNN

–  DBs and approaches have been prepared in the field

Action Database

h"p://www.thumos.info/

The problem setting in action recognition

•  Video-level prediction

–  1 action-label prediction per input video

TennisSwing

Mo6onDescriptor

Dense Trajectories (DT) [Wang+, CVPR11]

•  Trajectory-based representation

–  A large amount of trajectories

–  Feature description (HOG, HOF, MBH)

–  Codeword vector is generated

Two-Stream CNN [Simonyan+, NIPS14]

•  Spatial and temporal convolution

–  Spatial-stream: From a RGB image

–  Temporal-stream: From a stacked flows

–  Score fusion: Average or SVM

Is background enough to classify actions?

•  RGB input is too strong!

–  The two-stream CNN[Simonyan+, NIPS14] reported spatial-stream can understand an

action more than expected

•  72.4% with spatial-stream (RGB) @UCF101

•  “Human Action Recognition without Human”

Without Human?

•  Human action recognition can be done just by motion of the

background?

TennisSwing

Mo6onDescriptor

TennisSwing?

Mo6onDescriptor

Detailed setting of w/ and w/o Human

•  With and without human setting

–  Without human setting: center-blind image with UCF101

–  With human setting: inverse of the without human setting

I(x,y) f(x,y)* I’(x,y)

1/2 1/41/4

1/2

1/4

1/4

I(x,y) f(x,y)* I’(x,y)

1/2 1/41/4

1/2

1/4

1/4ーー

WithoutHumanSeIng WithHumanSeIng

Framework –  Baseline: Very deep two-stream CNN [Wang+, arXiv15]

–  Two different scenarios: without human and with human

Exploration experiment

•  @UCF101

–  UCF101 pre-trained model with very deep two-stream CNN

–  With/Without Human Setting

Visual results (Full Image)

Visual results (Without Human Setting)

Without Human

•  The concept of ”Human Action Recognition without Human”

–  The accuracies are very close

•  With human is +9.49% better than without human

–  The current motion representation heavily rely on the backgrounds

Future work

•  This is a suggestive reality

–  We must accept this reality to realize better motion representation

–  Pure motion representation is an urgent work!

•  More sophisticated approach

•  Human only motion

【ECCV 2016 BNMW】Human Action Recognition without Human

Science