Top Banner
Video Action Detection with Relational Dynamic-Poselets Limin Wang 1,2 , Yu Qiao 2 , and Xiaoou Tang 1,2 1 Department of Information Engineering, The Chinese University of Hong Kong, Hong Kong 2 Shenzhen Institutes of Advanced Technology, CAS, China Introduction Problem: We aim to not only recognize on-going action class (action recognition), but also localize its spatiotemporal extent (action detection), and even estimate the pose of the actor (pose estimation). Key insights: Figure 1: Illustration for motivation. An action can be temporally decomposed into a sequence of key poses. Each key pose can be decomposed into a spatial arrangement of mixtures of action parts. Main contributions: We propose to a new pose and motion descriptor to cluster cuboids into dynamic-poselets. We design a sequential skeleton model to jointly capture spatiotemporal relations among body parts, co-occurrences of mixture types, and local part templates. Dynamic-Poselets Figure 2: Construction of dynamic-poselets. A pose and motion descriptor: f (p v i,j ) = [Δp v,1 i,j , Δp v,2 i,j , Δp v,3 i,j ]. Δp v,1 i,j = p v i,j - p v par(i),j , Δp v,2 i,j = p v i,j - p v i,j-1 , Δp v,3 i,j = p v i,j - p v i,j+1 . f (p v i,j )=[ Δp v,1 i,j , Δp v,2 i,j , Δp v,3 i,j ]. Δp v,k i,j = [Δx v,k i,j /s v i,j , Δy v,k i,j /s v i,j ](k = 1, 2, 3). Using this descriptor, we run k-means algorithm to cluster cuboids into dynamic-poselets. Action Detection with SSM Dynamic Poselet Construction Sequential Skeleton Model Learning Part Models Spatiotemporal Relations Mixture Type Relations Training Dataset Positive Videos with Annotations Negative Videos without Annotations Training Testing Test Video without Annotations Temporal Sliding Window x y t Post-processing Model Inference …… Figure 3: Overview of our approach. Sequential Skeleton Model (SSM): S(v, p, t)= b(t) Mixture Type Relations + Ψ(p, t) Spatiotemporal Relations + Φ(v, p, t) Action Part Models v is a video clip, p and t are the pixel positions and the mixture types of dynamic-poselets, respectively . Mixture Type Relations: b(t)= N j=1 K i=1 b t i,j i,j + (i,j)(m,n) b t i,j t m,n (i,j),(m,n) b t i,j i,j encodes the mixture prior, b t i,j t m,n (i,j),(m,n) captures the compatibility of mixture types. Spatiotemporal Relations: Ψ(p, t)= (i,j)(m,n) β t i,j t m,n (i,j),(m,n) ψ (p i,j , p m,n ), ψ (p i,j , p m,n )=[dx, dy, dz, dx 2 , dy 2 , dz 2 ] β t i,j t m,n (i,j),(m,n) represents the parameter of quadratic spring model. Action Part Models: Φ(v, p, t)= N j=1 K i=1 α t i,j i φ(v, p i,j ) φ(v, p i,j ) is the feature vector, α t i,j i denotes the feature template. Note that the body part template α t i,j i is shared among different key poses. Action detection pipeline: temporal sliding window model inference non-maximum suppression. References 1. Cao, L., Liu, Z., Huang, T.S.: Cross-dataset action detection. In: CVPR (2010). 2. Yang, Y., Ramanan, D.: Articulated pose estimation with flexible mixtures-of-parts. In: CVPR (2011) 3. Lan, T., etc.: Discriminative figure-centric models for joint action localization and recognition. In: ICCV (2011). 4. Tian, Y., Sukthankar, R., Shah, M.: Spatiotemporal deformable part models for action detection. In: CVPR (2013). 5. Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV (2013). Experiments Quantitive results on MSR-II, UCF Sports, J-HMDB: 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Recall Precision Boxing SDPM w/o parts SDPM Model before adaptation Model after adaptation Model adapted with groundtruth Our result 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Recall Precision Hand Clapping 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Precision Hand Waving 0.1 0.2 0.3 0.4 0.5 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Overlap threshold AUC Diving Golf Kicking Lifting HorseRiding Running Skating SwingBench SwingSide Walking 0 0.1 0.2 0.3 0.4 0.5 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate True positive rate SDPM Figure-Centric Model Our result 0.1 0.2 0.3 0.4 0.5 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Overlap threshold Average AUC SDPM Figure-Centric Model Our result 0.1 0.2 0.3 0.4 0.5 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Overlap threshold AUC Catch ClimbStairs Golf Jump Kick Pick Pullup Push Run ShootBall Swing Walk 0 0.1 0.2 0.3 0.4 0.5 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate True positive rate iDT+FV Our result 0.1 0.2 0.3 0.4 0.5 0.6 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Overlap threshold Average AUC iDT+FV Our result Detection examples on MSR-II, UCF Sports, J-HMDB: Limin Wang, Yu Qiao, and Xiaoou Tang European Conference on Computer Vision (ECCV), Zurich, Switzerland, 2014.
1

Video Action Detection with Relational Dynamic-Poselets · Title: Video Action Detection with Relational Dynamic-Poselets Author: Limin Wang1,2, Yu Qiao2, and Xiaoou Tang1,2 Created

Jul 24, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Video Action Detection with Relational Dynamic-Poselets · Title: Video Action Detection with Relational Dynamic-Poselets Author: Limin Wang1,2, Yu Qiao2, and Xiaoou Tang1,2 Created

Video Action Detection with Relational Dynamic-PoseletsLimin Wang1,2, Yu Qiao2, and Xiaoou Tang1,2

1Department of Information Engineering, The Chinese University of Hong Kong, Hong Kong2Shenzhen Institutes of Advanced Technology, CAS, China

Introduction

•Problem: We aim to not only recognize on-going action class (actionrecognition), but also localize its spatiotemporal extent (actiondetection), and even estimate the pose of the actor (pose estimation).

•Key insights:

Figure 1: Illustration for motivation.

◮ An action can be temporally decomposed into a sequence of key poses.◮ Each key pose can be decomposed into a spatial arrangement of mixtures of

action parts.•Main contributions:

◮ We propose to a new pose and motion descriptor to cluster cuboids intodynamic-poselets.

◮ We design a sequential skeleton model to jointly capture spatiotemporal relationsamong body parts, co-occurrences of mixture types, and local part templates.

Dynamic-Poselets

Figure 2: Construction of dynamic-poselets.

•A pose and motion descriptor:◮ f(pv

i,j) = [∆pv,1i,j ,∆pv,2

i,j ,∆pv,3i,j ].

◮∆pv,1i,j = pv

i,j − pvpar(i),j, ∆pv,2

i,j = pvi,j − pv

i,j−1, ∆pv,3i,j = pv

i,j − pvi,j+1.

◮ f(pvi,j) = [∆pv,1

i,j ,∆pv,2i,j ,∆pv,3

i,j ].

◮∆pv,ki,j = [∆xv,k

i,j /svi,j,∆yv,k

i,j /svi,j] (k = 1, 2, 3).

•Using this descriptor, we run k-means algorithm to clustercuboids into dynamic-poselets.

Action Detection with SSMDynamic Poselet Construction Sequential Skeleton Model Learning

Part Models

Spatiotemporal Relations

Mixture Type Relations

Training Dataset

Positive Videos with Annotations

Negative Videos without Annotations

Tra

inin

g

Tes

tin

g

Test Video without Annotations Temporal Sliding Window

x

y

t

Post-processing Model Inference

……

Figure 3: Overview of our approach.

•Sequential Skeleton Model (SSM):

S(v,p, t) = b(t)︸︷︷︸

Mixture Type Relations

+ Ψ(p, t)︸ ︷︷ ︸

Spatiotemporal Relations

+ Φ(v,p, t)︸ ︷︷ ︸

Action Part Models

v is a video clip, p and t are the pixel positions and the mixture typesof dynamic-poselets, respectively .◮ Mixture Type Relations:

b(t) =N∑

j=1

K∑

i=1

bti,ji,j +

(i,j)∼(m,n)

bti,jtm,n(i,j),(m,n)

bti,ji,j encodes the mixture prior, bti,jtm,n

(i,j),(m,n) captures the compatibility of mixture types.◮ Spatiotemporal Relations:

Ψ(p, t) =∑

(i,j)∼(m,n)

βti,jtm,n(i,j),(m,n)ψ(pi,j, pm,n),

ψ(pi,j, pm,n) = [dx, dy, dz, dx2, dy2, dz2]

βti,jtm,n(i,j),(m,n) represents the parameter of quadratic spring model.

◮ Action Part Models:

Φ(v, p, t) =N∑

j=1

K∑

i=1

αti,ji φ(v, pi,j)

φ(v, pi,j) is the feature vector, αti,ji denotes the feature template.

Note that the body part template αti,ji is shared among different key poses.

•Action detection pipeline:◮ temporal sliding window → model inference → non-maximum suppression.

References1. Cao, L., Liu, Z., Huang, T.S.: Cross-dataset action detection. In: CVPR (2010).2. Yang, Y., Ramanan, D.: Articulated pose estimation with flexible mixtures-of-parts. In: CVPR (2011)3. Lan, T., etc.: Discriminative figure-centric models for joint action localization and recognition. In: ICCV (2011).4. Tian, Y., Sukthankar, R., Shah, M.: Spatiotemporal deformable part models for action detection. In: CVPR (2013).5. Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV (2013).

Experiments

•Quantitive results on MSR-II, UCF Sports, J-HMDB:

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Recall

Pre

cisi

on

Boxing

SDPM w/o partsSDPMModel before adaptationModel after adaptationModel adapted with groundtruthOur result

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Recall

Pre

cisi

on

Hand Clapping

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

Pre

cisi

on

Hand Waving

0.1 0.2 0.3 0.4 0.5 0.60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Overlap threshold

AU

C

DivingGolfKickingLiftingHorseRidingRunningSkatingSwingBenchSwingSideWalking

0 0.1 0.2 0.3 0.4 0.5 0.60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False positive rate

Tru

e p

osi

tive

rat

e

SDPMFigure−Centric ModelOur result

0.1 0.2 0.3 0.4 0.5 0.60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Overlap threshold

Ave

rag

e A

UC

SDPMFigure−Centric ModelOur result

0.1 0.2 0.3 0.4 0.5 0.60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Overlap threshold

AU

C

CatchClimbStairsGolfJumpKickPickPullupPushRunShootBallSwingWalk

0 0.1 0.2 0.3 0.4 0.5 0.60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False positive rate

Tru

e p

osi

tive

rat

e

iDT+FVOur result

0.1 0.2 0.3 0.4 0.5 0.60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Overlap threshold

Ave

rag

e A

UC

iDT+FVOur result

•Detection examples on MSR-II, UCF Sports, J-HMDB:

Limin Wang, Yu Qiao, and Xiaoou Tang European Conference on Computer Vision (ECCV), Zurich, Switzerland, 2014.