• Training: Given a pair of frames from the same video—a labeled Frame A and an unlabeled Frame B—we train our model to detect pose in Frame A using the features from Frame B. I. Introduction Learning Temporal Pose Estimation from Sparsely Labeled Videos Gedas Bertasius, Christoph Feichtenhofer, Du Tran, Jianbo Shi, Lorenzo Torresani • Goal: Pose detection in video. • Challenge: Densely labeling every frame with multi-person pose annotations is costly and time consuming. • Idea: Leverage sparsely annotated videos, i.e., pose annotations are given only every k frames. II. The PoseWarper Network • We use dilated deformable convolutions to warp pose heatmaps from an unlabeled Frame B for pose prediction in a labeled frame A. • Our model implicitly learns motion cues between Frame A and Frame B (no ground truth alignment is available between frames). III. Applications of the PoseWarper Network 1. Video Pose Propagation • Goal: Propagate ground truth pose annotations across the entire video from only a few labeled frames. Ground Truth Reference Frame (time t) PoseWarper (time t+1) FlowNet2 (time t+1) 3. Temporal Pose Aggregation at Inference • Challenge: Detecting pose in video can be difficult due to nuisance effects such as occlusion, motion blur, and video defocus. • Goal: Improve detection by using pose information from adjacent frames. •We train our PoseWarper model on the full PoseTrack dataset, i.e., when frames in videos are densely labeled. • During inference, we use our temporal pose aggregation scheme to aggregate information from 5 nearby frames. 2. Data Augmentation with PoseWarper • Goal: Augment sparsely labeled video data with our propagated poses. Then, train an HRNet-W48 pose detector on this joint data. Comparison to State-of-the-Art IV. Additional Experiments Interpreting our Learned Offsets • It appears that different offset maps encode different motion, thus performing a sort of motion decomposition of discriminative regions in the video. V. Conclusions Frame t Frame t+5 Offset Magnitudes Channel 99 (x,y) Channel 123 (x,y) • Our approach reduces the need for densely labeled video data, while producing strong pose detection performance. • PoseWarper is useful even when training videos are densely labeled. • The source code and our trained models will be made publicly available at: https://github.com/facebookresearch/PoseWarper • Application Stage: After training, we can apply our model in reverse order to propagate pose information across the training video from sparse ground truth poses. 2. Application Stage (Labeled Unlabeled) Labeled Frame A (time t+ ) δ Unlabeled Frame B (time t) Ground Truth Pose Propagated to Frame B Ground Truth Pose in Frame A Features Relating the Two Frames Feature Warping Architecture • Every 7 th frame in a training video is labeled. We propagate pose annotations from manually-labeled frames to all unlabeled frames in the same video. • Our temporal pose aggregation scheme during inference is another effective way to maintain strong performance in a sparsely-labeled video setting. Warping from A to B Pose Heatmap B Frame B (time t) Frame A (time t-1) Frame C (time t+1) Pose Heatmap C Pose Heatmap A ⊕ Warping from C to B Pose B (time t) 0 2 4 6 8 # Labeled Frames Per Video 70 75 80 Accuracy (mAP) GT (7x) GT (1x) + pGT (6x) GT (1x) 0 100 200 300 # Training Videos 70 75 80 Accuracy (mAP) GT (7x) GT (1x) + pGT (6x) GT (1x) 0 2 4 6 8 # Labeled Frames Per Video 70 75 80 Accuracy (mAP) GT (7x) GT (1x) + ST-Agg. GT (1x) 0 100 200 300 # Training Videos 70 75 80 Accuracy (mAP) GT (7x) GT (1x) + ST-Agg. GT (1x) A Labeled Frame (time t) An Unlabeled Frame (time t-1) An Unlabeled Frame (time t+1) A Sparsely Labeled Multi-Person Video
1
Embed
Learning Temporal Pose Estimation from Sparsely Labeled Videos · annotations is costly and time consuming. •Idea: Leverage sparsely annotated videos, i.e., pose annotations are
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
• Training: Given a pair of frames from the same video—a labeled Frame A and an unlabeled Frame B—we train our model to detect pose in Frame A using the features from Frame B.
I. Introduction
Learning Temporal Pose Estimation from Sparsely Labeled VideosGedas Bertasius, Christoph Feichtenhofer, Du Tran, Jianbo Shi, Lorenzo Torresani
• Goal: Pose detection in video.• Challenge: Densely labeling every frame with multi-person pose
annotations is costly and time consuming.• Idea: Leverage sparsely annotated videos, i.e., pose annotations are
given only every k frames.
II. The PoseWarper Network
• We use dilated deformable convolutions to warp pose heatmaps from an unlabeled Frame B for pose prediction in a labeled frame A.
• Our model implicitly learns motion cues between Frame A and Frame B (no ground truth alignment is available between frames).
III. Applications of the PoseWarper Network1. Video Pose Propagation• Goal: Propagate ground truth pose annotations across the entire
3. Temporal Pose Aggregation at Inference• Challenge: Detecting pose in video can be difficult due to nuisance
effects such as occlusion, motion blur, and video defocus.• Goal: Improve detection by using pose information from adjacent frames.
• We train our PoseWarper model on the full PoseTrack dataset, i.e., when frames in videos are densely labeled.
• During inference, we use our temporal pose aggregation scheme to aggregate information from 5 nearby frames.
2. Data Augmentation with PoseWarper• Goal: Augment sparsely labeled video data with our propagated
poses. Then, train an HRNet-W48 pose detector on this joint data.
Comparison to State-of-the-Art
IV. Additional Experiments
Interpreting our Learned Offsets• It appears that different offset maps encode different motion, thus performing a
sort of motion decomposition of discriminative regions in the video.
V. ConclusionsFrame t Frame t+5 Offset MagnitudesChannel 99 (x,y)Channel 123 (x,y)
• Our approach reduces the need for densely labeled video data, while producing strong pose detection performance.
• PoseWarper is useful even when training videos are densely labeled.• The source code and our trained models will be made publicly available at:
https://github.com/facebookresearch/PoseWarper
• Application Stage: After training, we can apply our model in reverse order to propagate pose information across the training video from sparse ground truth poses.
2. Application Stage (Labeled Unlabeled)
Labeled Frame A (time t+ )�<latexit sha1_base64="AU/vq+DAgMm1x9RhUc20aA7BJwY=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaUDabTbt2kw27E6GE/gcvHhTx6v/x5r9x2+agrQ8GHu/NMDMvSKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdgBouRcJbKFDybqo5jQPJO8H4duZ3nrg2QiUPOEm5H9NhIiLBKFqp3Q+5RDqo1ty6OwdZJV5BalCgOah+9UPFspgnyCQ1pue5Kfo51SiY5NNKPzM8pWxMh7xnaUJjbvx8fu2UnFklJJHSthIkc/X3RE5jYyZxYDtjiiOz7M3E/7xehtG1n4skzZAnbLEoyiRBRWavk1BozlBOLKFMC3srYSOqKUMbUMWG4C2/vEraF3XPrXv3l7XGTRFHGU7gFM7BgytowB00oQUMHuEZXuHNUc6L8+58LFpLTjFzDH/gfP4AkYyPHA==</latexit><latexit sha1_base64="AU/vq+DAgMm1x9RhUc20aA7BJwY=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaUDabTbt2kw27E6GE/gcvHhTx6v/x5r9x2+agrQ8GHu/NMDMvSKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdgBouRcJbKFDybqo5jQPJO8H4duZ3nrg2QiUPOEm5H9NhIiLBKFqp3Q+5RDqo1ty6OwdZJV5BalCgOah+9UPFspgnyCQ1pue5Kfo51SiY5NNKPzM8pWxMh7xnaUJjbvx8fu2UnFklJJHSthIkc/X3RE5jYyZxYDtjiiOz7M3E/7xehtG1n4skzZAnbLEoyiRBRWavk1BozlBOLKFMC3srYSOqKUMbUMWG4C2/vEraF3XPrXv3l7XGTRFHGU7gFM7BgytowB00oQUMHuEZXuHNUc6L8+58LFpLTjFzDH/gfP4AkYyPHA==</latexit><latexit sha1_base64="AU/vq+DAgMm1x9RhUc20aA7BJwY=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaUDabTbt2kw27E6GE/gcvHhTx6v/x5r9x2+agrQ8GHu/NMDMvSKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdgBouRcJbKFDybqo5jQPJO8H4duZ3nrg2QiUPOEm5H9NhIiLBKFqp3Q+5RDqo1ty6OwdZJV5BalCgOah+9UPFspgnyCQ1pue5Kfo51SiY5NNKPzM8pWxMh7xnaUJjbvx8fu2UnFklJJHSthIkc/X3RE5jYyZxYDtjiiOz7M3E/7xehtG1n4skzZAnbLEoyiRBRWavk1BozlBOLKFMC3srYSOqKUMbUMWG4C2/vEraF3XPrXv3l7XGTRFHGU7gFM7BgytowB00oQUMHuEZXuHNUc6L8+58LFpLTjFzDH/gfP4AkYyPHA==</latexit><latexit sha1_base64="AU/vq+DAgMm1x9RhUc20aA7BJwY=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaUDabTbt2kw27E6GE/gcvHhTx6v/x5r9x2+agrQ8GHu/NMDMvSKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdgBouRcJbKFDybqo5jQPJO8H4duZ3nrg2QiUPOEm5H9NhIiLBKFqp3Q+5RDqo1ty6OwdZJV5BalCgOah+9UPFspgnyCQ1pue5Kfo51SiY5NNKPzM8pWxMh7xnaUJjbvx8fu2UnFklJJHSthIkc/X3RE5jYyZxYDtjiiOz7M3E/7xehtG1n4skzZAnbLEoyiRBRWavk1BozlBOLKFMC3srYSOqKUMbUMWG4C2/vEraF3XPrXv3l7XGTRFHGU7gFM7BgytowB00oQUMHuEZXuHNUc6L8+58LFpLTjFzDH/gfP4AkYyPHA==</latexit>
Unlabeled Frame B (time t)
Ground Truth Pose Propagated to Frame B
Ground Truth Pose in Frame A
Features Relating the Two Frames
Feature Warping
Architecture
• Every 7 th frame in a training video is labeled. We propagate pose annotations from manually-labeled frames to all unlabeled frames in the same video.
• Our temporal pose aggregation scheme during inference is another effective way to maintain strong performance in a sparsely-labeled video setting.
Warping from A to B
Pose Heatmap B
Frame B (time t)Frame A (time t-1) Frame C (time t+1)