Top Banner
Making Third Person Techniques Recognize First-Person Actions in Egocentric Videos Sagar Verma, Pravin Nagar, Divam Gupta, and Chetan Arora [email protected], [email protected], [email protected], [email protected] Problem Statement DNN trained on third-person actions do not adapt to egocentric actions due to a large difference in size of visible objects. Another complexity is mul- tiple action categories. This work unifies the fea- ture learning for multiple action categories using a generic two-stream architecture. RGB Flow Take Walking Actions with hand-object interaction(take) and with- out(walking) in two different view streams. Contributions 1. Deep neural network trained on third person videos do not adapt to egocentric action due to large difference in size of the visible objects. After cropping and resizing the objects be- come comparable to the objects in third person videos. 2. We propose curriculum learning by merging similar but opposite actions while training CNN. 3. Proposed framework is generic to all categories of egocentric actions. Related Work Earlier works on first-person action recognition use hands and objects as important cues.[1, 2] On the other end many works only use motion information for first-person action recognition.[3, 4] State of the art (SoTA) techniques focus only on one specific cat- egory of action classes. Proposed Architecture ResNet50 ResNet50 Resize (300x300) Flow Random Crop 224x224) Central Crop (MxN) Long/Short term actions RGB Stream Flow Stream Class Score Fusion Predicted Action Label SoftMax w LSTM w Output Vector RGB (Lx W) SoftMax 1 LSTM 1 SoftMax w LSTM w Output Vector Flow (Lx W) SoftMax 1 LSTM 1 Results and Discussion Dataset Subjects Frames Classes Accuracy Accuracy comparison of our method with SoTA and statistics of egocentric video datasets Current Ours GTEA [1] 4 31,253 11 68.50[5] 82.71 EGTEA+ [1] 32 1,055,937 19 NA 66 Kitchen [6] 7 48,117 29 66.23[5] 71.92 ADL [2] 5 93,293 21 37.58[5] 44.13 UTE [7] 2 208,230 21 60.17[5] 65.12 HUJI [8] NA 1,338,606 14 86[8] 93.92 Top and bottom rows show the visualization of normal and resized inputs respectively for ‘close’, ‘open’, and ‘take’ actions column-wise. Applicability in real life setting where different action cate- gories are present: To validate the applicability of our method, we use mixed samples from GTEA [1] and HUJI [8] dataset. From the confusion matrix it is evident that the proposed network does not seem to have any confusion in the different ca8tegory of actions. bg take 2 open put take open pour close take open pour close take put Activity: Cheese bg take 2 take take open open pour pour close close put fold Activity: Hotdog bg take 2 take take take take open open open close close close scoop scoop scoop put pour spread spread spread Activity: Pealate bg take 2 open open take take take scoop scoop spread spread pour put close close Activity: Peanut Top and bottom of each subfigure shows predicted and ground truth sequence respectively. References [1] Alireza Fathi, Xiaofeng Ren, and James M Rehg, “Learning to recognize objects in egocentric activities,” in CVPR, 2011. [2] Hamed Pirsiavash and Deva Ramanan, “Detecting activities of daily living in first-person camera views,” in CVPR, 2012. [3] Kris Makoto Kitani, Takahiro Okabe, Yoichi Sato, and Akihiro Sug- imoto, “Fast unsupervised ego-action learning for first-person sports videos,” in CVPR, 2011, pp. 3241–3248. [4] Suriya Singh, Chetan Arora, and C. V. Jawahar, “Trajectory aligned features for first person action recognition,” Pattern Recognition, vol. 62, pp. 45–55, 2016. [5] Suriya Singh, Chetan Arora, and C V, Jawahar, “First person action recognition using deep learned descriptors,” in CVPR, 2016, pp. 2620– 2628. [6] Ekaterina H Spriggs, Fernando De La Torre, and Martial Hebert, “Tem- poral segmentation and activity classification from first-person sensing,” in CVPRW, 2009, pp. 17–24. [7] Yong Jae Lee, Joydeep Ghosh, and Kristen Grauman, “Discovering im- portant people and objects for egocentric video summarization.,” in CVPR, 2012, pp. 1346–1353. [8] Yair Poleg, Ariel Ephrat, Shmuel Peleg, and Chetan Arora, “Compact cnn for indexing egocentric videos,” in WACV. IEEE, 2016, pp. 1–9. Acknowledgement This work has been supported by Infosys Center for Artificial Intelligence, Visvesaraya Young Faculty Research Fellowship, and Visvesaraya Ph.D. Fel- lowship from Government of India. We thank Inria and CVN Lab at Cen- tralesupelec to support travel for Sagar Verma.
1

First-Person Actions in Egocentric Videos Making Third ... · [2]Hamed Pirsiavash and Deva Ramanan, \Detecting activities of daily living in first-person camera views," in CVPR, 2012.

Jul 26, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: First-Person Actions in Egocentric Videos Making Third ... · [2]Hamed Pirsiavash and Deva Ramanan, \Detecting activities of daily living in first-person camera views," in CVPR, 2012.

Making Third Person Techniques RecognizeFirst-Person Actions in Egocentric Videos

Sagar Verma, Pravin Nagar, Divam Gupta, and Chetan Arora

[email protected], [email protected], [email protected], [email protected]

Problem StatementDNN trained on third-person actions do not adaptto egocentric actions due to a large difference insize of visible objects. Another complexity is mul-tiple action categories. This work unifies the fea-ture learning for multiple action categories using ageneric two-stream architecture.

RGB

Flow

Take Walking

Actions with hand-object interaction(take) and with-out(walking) in two different view streams.

Contributions1. Deep neural network trained on third person

videos do not adapt to egocentric action dueto large difference in size of the visible objects.

After cropping and resizing the objects be-

come comparable to the objects in third person

videos.

2. We propose curriculum learning by mergingsimilar but opposite actions while trainingCNN.

3. Proposed framework is generic to all categoriesof egocentric actions.

Related WorkEarlier works on first-person action recognition usehands and objects as important cues.[1, 2] On theother end many works only use motion informationfor first-person action recognition.[3, 4] State of theart (SoTA) techniques focus only on one specific cat-egory of action classes.

Proposed Architecture

ResNet50

ResNet50

Resize (300x300)

Flow

Random Crop 224x224)

Central Crop (MxN)

Long/Short term actions

RGB Stream

Flow Stream

Clas

s Sco

re F

usio

n

Pred

icted

Acti

on L

abelSoftMax

wLSTM

w

Outp

ut V

ecto

r RGB

(Lx

W)

SoftMax1

LSTM1

SoftMaxw

LSTMw

Outp

ut V

ecto

r Flow

(L

x W

)

SoftMax1

LSTM1

Results and Discussion

Dataset Subjects Frames Classes Accuracy

Accuracy comparisonof our method withSoTA and statistics ofegocentric videodatasets

Current Ours

GTEA [1] 4 31,253 11 68.50[5] 82.71EGTEA+ [1] 32 1,055,937 19 NA 66Kitchen [6] 7 48,117 29 66.23[5] 71.92ADL [2] 5 93,293 21 37.58[5] 44.13UTE [7] 2 208,230 21 60.17[5] 65.12HUJI [8] NA 1,338,606 14 86[8] 93.92

Walking

Stand

ingSta

ticSit

ting

Stair C

limbin

g

Horseb

ackBox

ing close

open pu

tsha

ke stir

Predicted label

Driving

Biking

Riding Bus

Running

Skiing

Sailing

bg

fold

pour

scoop

spread

take

True

labe

l

91 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 93 0 1 0 1 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 05 0 87 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 97 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 98 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 92 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 1 2 0 2 1 91 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 99 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 1 0 0 0 0 0 0 93 0 0 0 2 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 96 1 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 97 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 1 0 0 0 0 1 0 0 91 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 1 0 0 0 95 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 80 0 0 0 0 0 0 0 0 0 200 0 0 0 0 0 0 0 0 0 0 0 0 0 71 0 16 1 1 2 0 0 0 90 0 0 0 0 0 0 0 0 0 0 0 0 0 3 17 15 3 40 1 0 19 1 00 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 75 4 1 9 1 1 0 30 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 3 93 0 1 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 4 1 14 2 66 0 1 11 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 8 2 1 70 1 5 0 60 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 2 11 0 9 75 0 1 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 10 0 87 0 00 0 0 0 0 0 0 0 0 0 0 0 0 1 4 0 18 1 0 8 3 0 60 50 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3 0 0 0 0 0 0 97

Top and bottom rows show the visualization of normal andresized inputs respectively for ‘close’, ‘open’, and ‘take’actions column-wise.

Applicability in real life setting where different action cate-gories are present:To validate the applicability of our method, we use mixed samplesfrom GTEA [1] and HUJI [8] dataset. From the confusion matrixit is evident that the proposed network does not seem to have anyconfusion in the different ca8tegory of actions.

bg

take2open

put

takeopen

pour

close

takeopen

pour

close

take

put

Activity: Cheese

bg

take2 take takeopen open

pour pour

close close

putfold

Activity: Hotdog

bg

take2 take take take take

open

open openclose close close

scoop scoop scoopput

pour

spread spread spread

Activity: Pealate

bg

take2open open

take take take

scoop scoop

spread

spread

pourput

close close

Activity: Peanut

Top and bottom of each subfigure shows predictedand ground truth sequence respectively.

References[1] Alireza Fathi, Xiaofeng Ren, and James M Rehg, “Learning to recognize

objects in egocentric activities,” in CVPR, 2011.

[2] Hamed Pirsiavash and Deva Ramanan, “Detecting activities of dailyliving in first-person camera views,” in CVPR, 2012.

[3] Kris Makoto Kitani, Takahiro Okabe, Yoichi Sato, and Akihiro Sug-imoto, “Fast unsupervised ego-action learning for first-person sportsvideos,” in CVPR, 2011, pp. 3241–3248.

[4] Suriya Singh, Chetan Arora, and C. V. Jawahar, “Trajectory alignedfeatures for first person action recognition,” Pattern Recognition, vol. 62,pp. 45–55, 2016.

[5] Suriya Singh, Chetan Arora, and C V, Jawahar, “First person actionrecognition using deep learned descriptors,” in CVPR, 2016, pp. 2620–2628.

[6] Ekaterina H Spriggs, Fernando De La Torre, and Martial Hebert, “Tem-poral segmentation and activity classification from first-person sensing,”in CVPRW, 2009, pp. 17–24.

[7] Yong Jae Lee, Joydeep Ghosh, and Kristen Grauman, “Discovering im-portant people and objects for egocentric video summarization.,” inCVPR, 2012, pp. 1346–1353.

[8] Yair Poleg, Ariel Ephrat, Shmuel Peleg, and Chetan Arora, “Compactcnn for indexing egocentric videos,” in WACV. IEEE, 2016, pp. 1–9.

AcknowledgementThis work has been supported by Infosys Center for Artificial Intelligence,Visvesaraya Young Faculty Research Fellowship, and Visvesaraya Ph.D. Fel-lowship from Government of India. We thank Inria and CVN Lab at Cen-tralesupelec to support travel for Sagar Verma.