A Hierarchical Model of Shape and Appearance for Human Action Classification Ref: J.C. Niebles & L. Fei-Fei. A hierarchical model of shape and appearance for human action classification. CVPR 2007. Minneapolis, USA. Highlights and Summary • A novel model for human action categorization from video sequences. • Our model can be characterized as a constellation of bags-of-features. • Use of hybrid features: combines both static shape and spatio-temporal features. Hybrid features Static features Dynamic features Original frame Feature detection Canny edge detector Spatio-temporal interest point detector [Dollár et al. 2005] Feature description Shape context [Belongie et al. 2000] Concatenated brightness gradient codebook codebook Membership assignment video frame: w = {x,a} w i = {x i ,a i } x i : i-th feature position a i : i-th feature appearance Final Representation Learning Estimate model parameters using EM { } Ω = = = K K 1 1 , , , , , 0 0 , , , , ω θ θ θ θ ω ω ω ω ω P p A X A p X p L L Σ Σ μ SVM action class ( ) ( ) ( ) C class frame p class frame p class frame p 2 1 M Conclusions • The constellation of bags-of-features is able to capture semantic information of human action classes. • Combines hybrid features: static shape features and dynamic motion features. • Capable of classifying in both frame based and video based manner. 1 University of Illinois at Urbana – Champaign 3 Princeton University 2 Universidad del Norte, Colombia [email protected] Juan Carlos Niebles 1,2 [email protected] Li Fei-Fei 1,3 Algorithm New sequence Feature extraction and description Decide on best model Recognition Class 1 Class N . . . Feature extraction Model 1 Model N Feature extraction and description Learning Video frame representation Input video sequences Learn a model for each class . . . . . . Form Codebook Class 1 Class N Previous Works Part layer Feature Layer constellation Small number of features ☺ Strong shape representation bag-of-features ☺ Large number of features No geometrical or shape information w P 3 P 1 P 2 P 4 Bg w Hierarchical model Image Part layer Feature layer Mixture components w P 3 P 1 P 2 P 4 Bg θ ω Action Models mixture components bend ω = 1 ω = 2 ω = 3 jack run wave1 jump pjump wave2 side walk http://vision.cs.princeton.edu [Weber et al. 2000, Csurka et al. 2004, Sudderth et al. 2006] ( ) ( ) ( ) ( ) ∑ ∑ Ω = ∈ * ≈ 1 , , , , , ω ω ω ω ω θ θ θ π H ure layer Local feat Part layer p p p p h h m Y w h Y h θ Y w 4 4 3 4 4 2 1 43 42 1 Approximated data likelihood: ( ) ( ) ( ) ω ω ω θ , , , | , | L L N p Σ μ h Y h Y T = Part layer term: ( ) ( ) ( ) ( ) ( ) ∏∏ ∏ = ∈ ∈ * = P p P Appearance Part A p i Shape Part X p p r i Bg Appearance Bg A j Shape Bg X r j p i j a p h x p a p x p p 1 0 0 , , , , , w w Y θ h m Y w 43 42 1 4 43 4 42 1 43 42 1 43 42 1 θ θ θ θ ω Local feature layer term: • Large number of features from the bag-of-features model • Strong shape representation from the constellation model. Experimental Results • 9 action classes, performed by 9 subjects [Blank et al 2005] • Leave one out cross-validation • Video Classification performance: 72.8% ( ) ( ) ( ) ( ) ( ) old old old old old p p p p p ω ω ω ω ω θ θ θ θ π θ ω Y w m h Y w h h Y Y w h , , , , , , , , * ≈ • E-step: ( ) ( ) θ ω θ ω θ θ , , , ln , , , max arg h Y w Y w h h p p old new ∑ = • M-Step: • Classify actions in both frame based and video based manner • Video classification based on majority votes of frames Recognition