Jingen Liu 1, Mubarak Shah 2, Benjamin Kuipers 1, Silvio Savarese 1 Cross-View Action Recognition via View Knowledge Transfer 2 Department of EECS University.

Jingen Liu 1, Mubarak Shah 2, Benjamin Kuipers 1, Silvio Savarese 1 Cross-View Action Recognition via View Knowledge Transfer 2 Department of EECS University of Central Florida Orlando, FL, USA 1 Department of EECS University of Michigan Ann Arbor, MI, USA IEEE International Conference on Computer Vision and Pattern Recognition, 2011 Slide 2 Cross-View Action Recognition View 1: having labeled examples to train an action classifier F1 View 2: having NO training examples, i.e., Checking watch Question: How to use knowledge of view 1 to recognize unknown actions of view 2? View 1 View 2 2 Low-level features representation Classifier Checking watch Low-level features representation ? Checking watch Slide 3 Cross-View Action Recognition Directly use classifier F1 to recognize actions of view 2? No! Performance decreases dramatically Motion appearance looks very different across views View 1 View 2 3 Low-level features representation Classifier Checking watch Low-level features representation ? Checking watch Slide 4 Cross-lingual text categorization/retrieval [Bel et al. 2004, Pirkola 98] Translate them into a common language E.g., an interlingua, as used in machine translation [Hutchins et al. 92] Underlying assumption: having word-by-word Analogy to Text Analysis 4 Common Languages OR An Interlingua In Chinese In French Slide 5 Our Proposal An action view interlingua Treat each view point as a language; construct vocabulary Model an action by a Bag-of-Visual-Words (BoVW) Translate two BoVWs into an action view interlingua View 1 View 2 An Action View Interlingua Videos Vocabulary V1 Vocabulary V2 5 Histogram of Visual-Words Slide 6 Previous Work Geometry-based approaches Geometric measurement of body joints C. Rao et al. IJCV 2002, V. Paramesmaran et al. IJCV 2006, etc. Require stable body joint detection and tracking 3D reconstruction related D. Weinland et al. ICCV07, P. Yan et al. CVPR08, F. Lv et al. ICCV07, D. Gavrila et al. CVPR96, R. Li et al. ICCV07, etc. Strict alignments between views Computationally expensive in reconstruction Temporal self-similarity matrix [Junejo et al. ECCV08] Non knowledge transfer; Poor performance on top view 6 Slide 7 Previous Work Transfer-based approaches Farhadi et al. ECCV08 Requires feature to feature correspondence at frame level Mapping is provided by a trained predictor Mapping is conducted in one direction Farhadi et al. ICCV 09 Abstract discriminative aspects Training a hash mapping No explicit model transfer 7 Slide 8 Our Contributions Advantages of our approach More flexible: no geometry constraints, human body joint detection and tracking, and 3D reconstruction No requirement on strict temporal alignment Two directional mapping rather than one direction No supervision for bilingual words discovery Fuse transferred multi-view knowledge using Locally Weighted Ensemble method 8 First View Features Second View Features Info. Exchange First View Features Second View Features Slide 9 Graph Partitioning Our Framework Phase I: Discovery of bilingual words Given N pairs of unlabelled videos captured from two views Learn two view-dependent visual vocabularies Discover bi-lingual words by bipartite graph partitioning First View Second View Training Data Matrix M First View Second View Vocabulary V 1 Vocabulary V 2 BoVW models M S BoVW models M T Bipartite Graph V1V1 V2V2 9 Bilingual Words Slide 10 Our Framework First View Second View Training Data Matrix M First View Second View Vocabulary V 1 Vocabulary V 2 BoVW models M 1 BoVW models M 2 Bilingual Words BoBW models Bipartite Graph Graph Partitioning V1V1 V2V2 10 A BY Z Phase I: Discovery of bilingual words Given N pairs of unlabelled videos captured from two views Learn two view-dependent visual vocabularies Discover bi-lingual words by bipartite graph partitioning Slide 11 Our Framework Phase II: cross-view novel action recognition Source View Source View Action Model Learning Bilingual Words Bag-of-Visual-WordsBag-of-Bilingual-WordsBag-of-Visual-WordsBag-of-Bilingual-Words Target View Novel Action Recognizing Training Classifier on Source View Testing Classifier on Target View Training videos Target View Testing videos Slide 12 12 Low-level Action Representation Acquiring the training matrix M Feature Detector 3D cuboids extraction Feature Clustering Visual Word A Visual Word B Video-words histogram Visual vocabulary Bag-of-Visual-Words (BoVW) model Examples d visual words x View 1 View 2 Slide 13 Bipartite Graph Modeling Build a bipartite graph between two views Edge weights matrix, where S is a similarity matrix Generate similarity matrix S In the column space of M, each S(i,j) of S can be estimated, X: Visual words of view 1 Y: Visual words of view 2 W Video Examples visual words Source View Target View 13 Slide 14 Bipartite Graph Bi-Partitioning Bipartite graph partition: [1] H. Zha, X. He, C. Ding, H. Simon & M. Gu, CIKM 2001 [2] I.S. Dhillon, SIGKDD 2001 A. Before Partition B. After Partition Two clusters (1,2,3; a, b) & (4,5; c, d, e) -> two bilingual words 14 Slide 15 IXMAS Data Set IXMAS videos: 11 actions performed by 10 actors, taken from 5 views. C0 C1 C2 C3 C4 Check-watchScratch-head Sit-down Wave-hand Kicking Pick-up 15 Slide 16 Data Partition Kick Pick-up Classes Z Check-watchScratch-head Sit-down Wave-hand Source View Target View Classes Y IXMAS Data Classes Ys Classes Z Slide 17 Data Partition Kick Pick-up Classes Z Check-watchScratch-head Sit-down Wave-hand Source View Target View Classes Y IXMAS Data Classes Ys Classes Z source view target view Learning Bilingual Words Training Z classesTesting Z classes View 2 View 1 Slide 18 Data Partition Kick Pick-up Classes Z Check-watchScratch-head Sit-down Wave-hand Source View Target View Classes Y IXMAS Data Classes Ys Classes Z source view target view Learning Bilingual Words Training Z+Y classesTesting Z classes source view target view Slide 19 Results on View Knowledge Transfer (%) Camera 0Camera 1Camera 2Camera 3Camera 4 W/OW/W/OW/W/OW/W/OW/W/OW/ Cam0 14.4075.4610.6964.4010.6167.6819.0965.99 Cam116.1275.72 11.1164.237.4168.109.2256.02 Cam210.2770.3311.8066.25 12.9071.348.0862.42 Cam311.1573.748.5965.629.9871.30 9.3058.04 Cam48.8071.348.4666.299.2270.8810.0663.55 W/O and W/ show the results without and with view knowledge transfer via the bag of bilingual words, respectively Average, W/O=10.9%, W/ = 67.4% 19 Training View Testing View Slide 20 Performance of Transfer (%) Camera 0Camera 1Camera 2Camera 3Camera 4 W/OW/W/OW/W/OW/W/OW/W/OW/ Cam0 14.4075.4610.6964.4010.6167.6819.0965.99 Cam116.1275.72 11.1164.237.4168.109.2256.02 Cam210.2770.3311.8066.25 12.9071.348.0862.42 Cam311.1573.748.5965.629.9871.30 9.3058.04 Cam48.8071.348.4666.299.2270.8810.0663.55 20 Training View Testing View W/O and W/ show the results without and with view knowledge transfer via the bag of bilingual words, respectively Average, W/O=10.9%, W/ = 67.4% Slide 21 Performance of Transfer (%) Camera 0Camera 1Camera 2Camera 3Camera 4 W/OW/W/OW/W/OW/W/OW/W/OW/ Cam0 14.4075.4610.6964.4010.6167.6819.0965.99 Cam116.1275.72 11.1164.237.4168.109.2256.02 Cam210.2770.3311.8066.25 12.9071.348.0862.42 Cam311.1573.748.5965.629.9871.30 9.3058.04 Cam48.8071.348.4666.299.2270.8810.0663.55 21 Training View Testing View W/O and W/ show the results without and with view knowledge transfer via the bag of bilingual words, respectively Average, W/O=10.9%, W/ = 67.4% Slide 22 Performance of Transfer (%) Camera 0Camera 1Camera 2Camera 3Camera 4 W/OW/W/OW/W/OW/W/OW/W/OW/ Cam0 14.4075.4610.6964.4010.6167.6819.0965.99 Cam116.1275.72 11.1164.237.4168.109.2256.02 Cam210.2770.3311.8066.25 12.9071.348.0862.42 Cam311.1573.748.5965.629.9871.30 9.3058.04 Cam48.8071.348.4666.299.2270.8810.0663.55 22 Training View Testing View W/O and W/ show the results without and with knowledge transfer respectively Average, woTran=10.9%, wTran = 67.4% Slide 23 Performance of Transfer (%) Camera 0Camera 1Camera 2Camera 3Camera 4 W/OW/W/OW/W/OW/W/OW/W/OW/ Cam0 14.4075.4610.6964.4010.6167.6819.0965.99 Cam116.1275.72 11.1164.237.4168.109.2256.02 Cam210.2770.3311.8066.25 12.9071.348.0862.42 Cam311.1573.748.5965.629.9871.30 9.3058.04 Cam48.8071.348.4666.299.2270.8810.0663.55 23 Training View Testing View W/O and W/ show the results without and with knowledge transfer respectively Average, woTran=10.9%, wTran = 67.4% Slide 24 Performance Comparison Low-level features: ST cuboids + shape-flow features [D. Tran et al. ECCV 2008] Columns A: A. Farhadi et al. ECCV 2008 Columns B: I. N. Junejo, et al. ECCV 2008 Columns C: A. Farhadi et al. ICCV 2009 (%) Camera 0Camera 1Camera 2Camera 3Camera 4 OursABC ABC ABC ABC ABC C0 79.97276.86176.86274.830 C181.269 75.86478.06870.441 C279.66276.667 79.86772.843 C373.06374.17274.468 66.944 C482.05168.35574.05171.153 Ave.79.06174.76775.26176.46371.240 24 Slide 25 Performance Comparison Low-level features: ST cuboids + shape-flow features [D. Tran et al. ECCV 2008] Columns A: A. Farhadi et al. ECCV 2008 Columns B: I. N. Junejo, et al. ECCV 2008 Columns C: A. Farhadi et al. ICCV 2009 (%) Camera 0Camera 1Camera 2Camera 3Camera 4 OursABC ABC ABC ABC ABC C0 79.97277.676.86169.476.86270.374.83044.8 C181.26977.3 75.86473.978.06867.370.44143.9 C279.66266.176.66770.6 79.86763.672.84353.6 C373.06369.474.17270.074.46863.0 66.94444.2 C482.05139.168.35538.874.05151.871.15334.2 Ave.79.06163.074.76764.375.26164.576.46358.971.24046.6 25 Slide 26 Performance Comparison Low-level features: ST cuboids + shape-flow features [D. Tran et al. ECCV 2008] Columns A: A. Farhadi et al. ECCV 2008 Columns B: I. N. Junejo, et al. ECCV 2008 Columns C: A. Farhadi et al. ICCV 2009 (%) Camera 0Camera 1Camera 2Camera 3Camera 4 OursABC ABC ABC ABC ABC C0 79.97277.67976.86169.47976.86270.36874.83044.876 C181.26977.372 75.86473.97478.06867.37070.44143.966 C279.66266.17176.66770.682 79.86763.67672.84353.672 C373.06369.47574.17270.07574.46863.079 66.94444.276 C482.05139.18068.35538.87374.05151.87371.15334.279 Ave.79.06163.07474.76764.37775.26164.57676.46358.97371.24046.672 26 Slide 27 Transferred Knowledge Fusion One target view V.S. n-1 source views Each source view have an action classifier How to fuse the knowledge to final decision? Locally Weighted Ensemble strategy [ Gao et al. SIGKDD 08 ] 27 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Classifier of Source 1 Classifier of Source 2 Fusion R R Slide 28 Knowledge Fusion Results (%)Camera 0Camera 1Camera 2Camera 3Camera 4Average Ours86.681.180.183.682.8 Baseline80.678.578.078.373.077.7 Each column denotes a testing (target) view, and the rest four views are source view 28 Slide 29 Knowledge Fusion Results (%)Camera 0Camera 1Camera 2Camera 3Camera 4Average Ours86.681.180.183.682.8 Baseline80.678.578.078.373.077.7 Junejo et al. ECCV0874.874.574.870.661.271.2 Liu et al. CVPR 0876.773.372.073.0N/A73.8 Weinland et al. ECCV 1086.789.986.487.666.483.4 Each column denotes a testing (target) view, and the rest four views are source view 29 Slide 30 Detailed Recognition Rate 30 Slide 31 Summary Create an action view interlingua for cross-view action recognition Bilingual words serve as a bridge for view knowledge transfer Fuse multiple transferred knowledge using Locally Weighted Ensemble method Our approach achieves state-of-the-art performance Slide 32 Thank You! Acknowledgements: UMich Intelligent Robotics Lab UMich Computer Vision Lab UCF Computer Vision Lab NSF Slide 33 Confusion Table

Jingen Liu 1, Mubarak Shah 2, Benjamin Kuipers 1, Silvio Savarese 1 Cross-View Action Recognition via View Knowledge Transfer 2 Department of EECS University.

Documents

action view interlingua

views view

visual words of view

knowledge of view

view point

view vocabulary v

view features info

view knowledge transfer