Top Banner
Peeking into occluded joints: A novel framework for crowd pose estimation Lingteng Qiu ?1,2,3 , Xuanye Zhang ?1,2 , Yanran Li 4 , Guanbin Li 5 , Xiaojun Wu 3 , Zixiang Xiong 6 , Xiaoguang Han ??1,2 , and Shuguang Cui 1,2 The Chinese University of Hong Kong, Shenzhen 1 Shenzhen Research Institute of Big Data 2 Harbin Institute of Technology(Shenzhen) 3 Bournemouth University 4 Sun Yat-sen University 5 Texas A& M University 6 Abstract. Although occlusion widely exists in nature and remains a fundamental challenge for pose estimation, existing heatmap-based approaches suffer serious degradation on occlusions. Their intrinsic problem is that they directly localize the joints based on visual information; how- ever, the invisible joints are lack of that. In contrast to localization, our framework estimates the invisible joints from an inference perspective by proposing an Image-Guided Progressive GCN module which provides a comprehensive understanding of both image context and pose structure. Moreover, existing benchmarks contain limited occlusions for evaluation. Therefore, we thoroughly pursue this problem and propose a novel OPEC-Net framework together with a new Occluded Pose (OCPose) dataset with 9k annotated images. Extensive quantitative and qualitative evalua- tions on benchmarks demonstrate that OPEC-Net achieves significant improvements over recent leading works. Notably, our OCPose is the most complex occlusion dataset with respect to average IoU between adjacent instances. Source code and OCPose will be publicly available. Keywords: Pose Estimation, Occlusion, Progressive GCN 1 Introduction Human pose estimation is a long-standing problem in Computer Vision. It has still attracted increasing attentions in recent years due to rising demands for wide range of applications which require human pose as input [2,4,7,11,16,18,22]. Despite the significant progress achieved in this area by advanced deep learning techniques [13,6,3,8,24], pose estimation in crowd scenarios still remains extremely challenging due to the intractable occlusion problem. Trending models for crowd pose estimation strongly rely on heatmap representation for joints esti- mation: albeit being effective for visible joints, these methods still suffer performance degradation on occlusions and this is due to the fact that, since invisible joints are hidden, it is infeasible to directly localize them. To date, researchers have made painstaking efforts and complicated remedies in devel- oping heatmap models and improving their accuracy of localization. However, the occlusion problem has only received little attention and only few attempts have been made into solving it. As illustrated in Fig 1, the current state-of-the-art work still produces very awkward poses and fails to estimate the occluded joints. Occlusion is an intractable challenge in pose estimation due to the complicated background context, complex intertwined human poses and arbitrary occluded shape. To reveal the hidden joints, it becomes necessary to have a comprehensive inference method rather than simple localization. Our key insight is that the invisible joints are strongly related to contextual understanding of the image and structural understanding of the human pose. For example, humans can easily infer the location of invisible joints using clues derived from the action type and the image context. Therefore, we delve deeper into the clues needed for invisible joints inference and propose a novel framework OPEC-Net to incorporate these clues for multi-person pose estimation. To achieve this goal, two stages are proposed in our framework: ? indicates equal contributions ?? corresponding author, [email protected] arXiv:2003.10506v3 [cs.CV] 31 Mar 2020
18

pose estimation - arXiv.org e-Print archive

Dec 10, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: pose estimation - arXiv.org e-Print archive

Peeking into occluded joints: A novel framework for crowdpose estimation

Lingteng Qiu?1,2,3, Xuanye Zhang?1,2, Yanran Li4, Guanbin Li5, Xiaojun Wu3, Zixiang Xiong6,Xiaoguang Han??1,2, and Shuguang Cui1,2

The Chinese University of Hong Kong, Shenzhen1 Shenzhen Research Institute of Big Data2 HarbinInstitute of Technology(Shenzhen)3 Bournemouth University4 Sun Yat-sen University5 Texas

A& M University6

Abstract. Although occlusion widely exists in nature and remains a fundamental challenge forpose estimation, existing heatmap-based approaches suffer serious degradation on occlusions.Their intrinsic problem is that they directly localize the joints based on visual information; how-ever, the invisible joints are lack of that. In contrast to localization, our framework estimates theinvisible joints from an inference perspective by proposing an Image-Guided Progressive GCNmodule which provides a comprehensive understanding of both image context and pose structure.Moreover, existing benchmarks contain limited occlusions for evaluation. Therefore, we thoroughlypursue this problem and propose a novel OPEC-Net framework together with a new OccludedPose (OCPose) dataset with 9k annotated images. Extensive quantitative and qualitative evalua-tions on benchmarks demonstrate that OPEC-Net achieves significant improvements over recentleading works. Notably, our OCPose is the most complex occlusion dataset with respect to averageIoU between adjacent instances. Source code and OCPose will be publicly available.

Keywords: Pose Estimation, Occlusion, Progressive GCN

1 Introduction

Human pose estimation is a long-standing problem in Computer Vision. It has still attracted increasingattentions in recent years due to rising demands for wide range of applications which require humanpose as input [2,4,7,11,16,18,22]. Despite the significant progress achieved in this area by advanced deeplearning techniques [13,6,3,8,24], pose estimation in crowd scenarios still remains extremely challengingdue to the intractable occlusion problem.

Trending models for crowd pose estimation strongly rely on heatmap representation for joints esti-mation: albeit being effective for visible joints, these methods still suffer performance degradation onocclusions and this is due to the fact that, since invisible joints are hidden, it is infeasible to directlylocalize them. To date, researchers have made painstaking efforts and complicated remedies in devel-oping heatmap models and improving their accuracy of localization. However, the occlusion problemhas only received little attention and only few attempts have been made into solving it. As illustratedin Fig 1, the current state-of-the-art work still produces very awkward poses and fails to estimate theoccluded joints.

Occlusion is an intractable challenge in pose estimation due to the complicated background context,complex intertwined human poses and arbitrary occluded shape. To reveal the hidden joints, it becomesnecessary to have a comprehensive inference method rather than simple localization. Our key insightis that the invisible joints are strongly related to contextual understanding of the image and structuralunderstanding of the human pose. For example, humans can easily infer the location of invisible jointsusing clues derived from the action type and the image context. Therefore, we delve deeper into theclues needed for invisible joints inference and propose a novel framework OPEC-Net to incorporate theseclues for multi-person pose estimation. To achieve this goal, two stages are proposed in our framework:

? indicates equal contributions?? corresponding author, [email protected]

arX

iv:2

003.

1050

6v3

[cs

.CV

] 3

1 M

ar 2

020

Page 2: pose estimation - arXiv.org e-Print archive

2 Lingteng et al.

Fig. 1. The current SOTA method [13] (left) VS our method (right). Our method demonstrates a more naturaland accurate estimation for occluded joints

Initial Pose Estimation and GCN-based Pose Correction. The first stage generates heatmaps to producean initial pose and the subsequent correction stage adjusts the initial pose obtained from the heatmapsby an Image-Guided Progressive GCN (IGP-GCN) module.

The correction module deals with the image context and pose structure clues in the following aspects:(1) The human body structure provides the essential constraint information between joints. For thisreason, the correction module is designed as a GCN-based network, which offers an explicit way ofmodeling the body structural information that is advantageous for correcting the joints. (2) Anotherimportant clue to infer the invisible joints is their related image context. Considering that, our GCNnetwork is specially designed in an Image-Guided way: The IGP-GCN feeds both the coordinate ofjoints and also the image features extracted at the location of joints as input to each graph node.Therefore, the multi-scale image features from the heatmap modules are fed into the IGP-GCN in aprogressive way, so that large displacements can be learned steadily. This enables the IGP-GCN to notonly capture pose structural information but also the contextual image information at the same time.(3) In a crowd scenario, human interaction information becomes vital to infer poses. Therefore, wefurther formulate a CoupleGraph by connecting the corresponding joints of two instances, making theinteraction between the pair of people contribute to our estimation results as well. However, the multi-scale image features learnt for heatmap estimation are not compatible for the coordinate correctionmodule. Thus, a Cascaded Feature Adaption (CFA) strategy is introduced to process the features first:since the finer image feature has lost more global contextual information, we fuse the low-level featureswith high-level features following a cascaded design in order to strengthen their contextual information.

Finally, our framework is trained in an end-to-end fashion and addresses occlusion problem in anelegant way. Interestingly, the heatmap module and coordinate GCN module are complementary inour framework: the quantisation error introduced from the heatmap modules can be addressed by theIGP-GCN and, at the same time, the heatmap modules offer a more accurate initial value for IGP-GCNthat benefits the correction.

We conduct comprehensive experiments and introduce a new dataset to evaluate our framework.While occlusion cases are ubiquitous in crowded scenarios, only few existing benchmarks include enoughcomplex examples tailored to the evaluation of this problem. Thus, it becomes necessary to have datasetsthat that not only contain light occlusions but also include heavily occluded scenes, such as waltz andwrestling, in which individuals are intertwined in complex ways. However, this field is still lackingsuch datasets because annotating human poses in heavily occluded scenes is very difficult and requiresmassive manual work. Therefore, we introduce a new dataset called Occluded Pose (OCPose) thatincludes more complex occlusions. We manually label all the 18k groundtruth human poses of the 9kimages in OCPose. We also compare the average intersection over union (IoU) with typical datasets.MSCOCO [15] and MPII [1] have less than 5% data with IoU higher than 30%, in contrast, our datasetOCPose contains 90% data with IoU higher than 30%.

In summary, the contributions of this work are:

Page 3: pose estimation - arXiv.org e-Print archive

Peeking into occluded joints 3

– To the best of our knowledge, this is the first attempt at tackling the challenging problem of occludedjoints from digging the image context and pose structure clues in an inference perspective. A novelframework, named OPEC-Net, is proposed, which significantly outperforms existing methods.

– Our approach designs a novel Image-Guided Progressive GCN to accommodate the structural poseinformation and contextual image information for correction in a single pass.

– We contribute a carefully-annotated 9K human pose dataset OCPose that includes highly chal-lenging occluded scenes. To the best of our knowledge, OCPose is one of the datasets that containsthe most complex occlusions to date. The OCPose dataset will be released to the public to facilitateresearch in the pose estimation field.

2 Related Works

Heatmap-based Models for pose estimation.Models for multi-person pose estimation (MPPE) can be divided into two categories, namely bottom-

up and top-down approaches. The bottom-up methods first detect the joints and then assign them tothe matching person. Pioneer works of bottom-up methods [21,10,3,17,26] attempted to design differentjoint grouping strategies. Newell et al. [17] introduced a stacked hourglass network to utilize the taggingheatmap. DeepCut [21] presented an Integer Linear Program (ILP) and Zanfir et al. [26] grouped jointsby learned scoring functions. Cao et al. [3] proposed a novel 2d vector field Part Affinity Fields (PAFs)for association as well. However, these prior works all have a serious deficiency that the invisible jointswill decrease the performance drastically.

In the second category, the top-down methods first detect all people in the scene and then performpose estimation for each person. Most of the existing top-down approaches [8,19,6] focused on proposinga more effective human detector to obtain better results. Fang et al. [6] proposed a framework which ismore robust for the redundant human bounding box. Li et al. [13] designed a global maximum jointsassociation algorithm to address the association problem in crowd scenarios. Nevertheless, all of thesestrategies are unable to adequately reduce errors, especially in the severe occlusion cases, where onebounding box captures joints of multiple people. Most of the mainstream approaches are heatmap-basedand thus are limited to estimating invisible joints which are lack of visual information. Therefore, wepropose an OPEC-Net which completely differ from these works and is able to estimate invisible jointsby inference rather than by localization.

GCN for pose modelling. The human body shows a natural graph structure, so that someadvanced work constructed graph networks to address human pose related problems, such as actionrecognition [25], motion prediction [14], 3D pose regression [28,5]. These work intuitively form thenatural human pose as a graph and apply convolutional layers on it. Compared to other approaches,Graph Convolutional Networks demonstrate one compelling advantage when deal with human posemodeling problem: they are more effective in capturing dependency relationships between joints.

Previous work [25,14] achieved a significant gap of improvements in human motion understandingby forming the spatial and temporal relationships as edges in graph. Moreover, pose regression from2D to 3D is a natural graph prediction problem so that a new SemGCN [28] is proposed in this field.However, GCN frameworks are never introduced for keypoints detection problem such as MPPE. Incomparison, our graph network is specially designed for keypoints detection and contains a progressivelearning strategy and guided by image features.

3 OPEC-Net: Occluded Pose Estimation and Correction

Existing pose estimation approaches achieve striking results on visible joints but produce wildly inac-curate outcomes on invisible ones. This is mainly due to the fact that localizing invisible joints from theheatmap is very challenging since they are occluded and there is a lack of visual information. To rectifythis shortcoming, we introduce a novel framework that infers invisible joints from the image contextualand pose structural clues.

Page 4: pose estimation - arXiv.org e-Print archive

4 Lingteng et al.

Considering that, we generate an initial pose from a heatmap-based module and process it into anGCN-based joints correction module to learn their precise position. An Image-Guided GCN network(IGP-GCN) and a Cascaded Feature Adaption module is proposed in the correction stage. The IGP-GCN network exploits the human body structure and image context together to optimize the estimationresults. By learning the displacements in a progressive way, it also offers a stable way to achieve moreaccurate results.

The heatmap and coordinate modules in our framework are actually interdependent. Due to ourheatmap inference network, the IGP-GCN module has a more accurate pose initialization, which alsocontributes to a more precise local contextual understanding, before conducting corrections. On theother hand, coordinate based IGP-GCN also addresses the limitation of heatmap modules: due to asize limit, heatmap representation usually causes quantisation error for joints estimations. Our IGP-GCN design tackles this issue by converting the heatmap into coordinate representation. The overallframework and the proposed OPEC-Net module is illustrated in Fig 2.

Fig. 2. The schematic diagram of our pipeline. This figure depicts the two stages of estimation for onesingle pose. The GCN-based pose correction stage contains two modules: the Cascaded Feature Adaptationand the Image-Guided Progressive GCN. Firstly a base module is employed to generate heatmaps. After that,an integral regression method [23] is employed to transform the heatmap representation into a coordinaterepresentation, which can be the initial pose for GCN network. The initial pose and the three feature maps fromthe base module are processed in Image-Guided Progressive GCN. The multi-scale feature maps are updatedthrough the Cascaded Feature Adaptation module and put into each ResGCN Attention blocks. J1, J2 and J3

are the node features excavated on related location (x, y) from image features. The error of Initial Pose, Pose1,Pose2, and Final Pose are all considered in the objective function. Then the OPEC-Net is trained entirely toestimate the human pose. The details of the whole framework are described in Section 3

3.1 Initial Pose Estimation from Heatmap-based modules

In this stage, AlphaPose+ [13] is employed as the base module to generate a heatmap for visible joints.This is a top-down approach, which first detects a bounding box for each person and then performsinstance-level human pose estimation. We describe the process for an instance-level human pose in thefollowing.

Firstly, the three layers of the decoder of the base module generate three corresponding featuremaps with different levels of fine details: a coarse feature map F1, a middle feature map F2 and a finefeature map F3. The base module outputs a heatmap which has high confidence for visible joints. Theestimated pose from the heatmap H can be denoted as P , which contains estimation results for eachjoint:

< x1, y1, c1 >, < x2, y2, c2 >, ..., < xk, yk, ck > (1)

Page 5: pose estimation - arXiv.org e-Print archive

Peeking into occluded joints 5

where xj and yj are the position of the jth joint, cj the confidence score, and k is the number of jointsin the skeleton.

3.2 GCN-based Joints Correction

The occluded poses can be inferred easily by humans mostly because of their abundant prior knowl-edge of implicit body structure and pose properties. More specifically, a natural human pose is highlyconstrained by the environments and human body property, such as the biomechanical structure of thehuman body and implications in the environments. In light of that, we propose an Image-Guided graphnetwork for correction which takes the initial pose generated from the above modules and adjusts theestimation results according to the implicit relationship of joints.

Heatmap representation to Coordinate representation. First of all, we generate the initial posefor the GCN network from the heatmaps of the first two stages. An important factor to consider inobtaining the initial pose is that the translation from heatmap to coordinate representation needs tobe differential for the end to end training purpose, so the initial pose cannot be grasped directly fromthe heatmap by searching max values as P . Finally, we find out that a coordinate initial pose Ji can begenerated from Heatmap and estimated by an integral regression method [23].

Specifically, the heatmap is propagated into a Softmax layer which normalizes the values into like-lihood values [0, 1]. After that, an integral operation is applied on the likelihood map to sum up thevalues and estimate joints positions.

Jki =

∫p∈A

p ·Hk(p), (2)

where Jki is the position estimation of the kth joint. We use A to denote the region of likelihoodand Hk(p) to represent the likelihood value on point p. Therefore, every heatmap matrix contains theinformation to produce an initial pose Pinit.

Graph Formulation. The human body skeleton has a natural hierarchical graph structure. Previousresearches on MPPE merely utilize this information by a primitive graph matching strategy. We claimthat the implicit relationships between different joints are helpful to guide position estimation. We thusconstruct an intuitive graph G = (V,E) to formulate the human pose with N joints. V is the node set inG which can be denoted as V = vi |i = 1, 2, ..., N. E = vivj | if i and j are connected in the human bodyis the edge set which refers to limbs of the human body. The adjacent matrix of G refers to matrixA = aij, with aij = 1 when vi and vj are neighbors in G or i = j, otherwise aij = 0.

For every node, the input feature Gji is the joint estimation result < xji , yji , c

ji >, where i is ith pose

and j is the jth joint of the skeleton. We denote Gi ∈ RL×N as the input feature of the ith pose in thetraining set, where L is the feature dimension.

Image-Guided Progressive GCN Network. The core methodology proposed in our work is theImage-Guided Progressive GCN for Correction. In this network, the image context and pose structureclues for invisible joints inference are merged together in an innovative way. The details of each layersand ResGCN Attention Blocks are describe in supplementary materials.

(1) The estimated position of invisible joints from the base module is sometimes far from their correctlocations and this makes it challenging to directly regress their displacements. Therefore, we design anintuitive coarse-to-fine learning mechanism in the coordinate-based module, which builds a progressiveGCN architecture and leverages the performance steadily by enforcing multi-scale image features ina progressive manner.

(2) The coordinate-based module lacks local context information . Consequently, we excavatethe related image features for each joints position and fuse them into the module. In another word, weimprove the pose estimation results by incorporating image feature maps F1, F2, and F3. Specifically, we

Page 6: pose estimation - arXiv.org e-Print archive

6 Lingteng et al.

design cascaded ResGCN attention blocks to grasp the useful information that is stored in the featuremaps but lost in initial pose Pi. The three feature maps are ordered from coarse to fine according totheir size of receptive fields. After that, we employ a grid sample method that obtains the jth jointfeature by excavating the feature located in < xji , y

ji > on the related coordinate weight feature map.

Every pose leads to three node feature vectors J1, J2, and J3 extracted following this process. Finally,these node features are fed into the ResGCN attention blocks accordingly.

Cascaded Feature Adaption (CFA). Feature maps F1,F2, and F3 should be adaptive to providemore effective information to the IGP-GCN. Moreover, the low-level feature and high-level feature arefused in the cascaded design in order to enlarge their respective fields resulting the updated feature aremore informative. The details of Conv Blocks and Fusion Blocks used in this module is in supplementarymaterials.

CoupleGraph We extend the single human graph into a CoupleGraph that captures more humaninteractions and this is achieved by connecting the corresponding joints to capture human interac-tion information. The couple graph can be denoted as G′ = (V ′, E′). The joints number of a sin-gle person is N so that there is 2N joints in total in the couple graph. It can be formulated asV ′ = vi |i = 1, 2, ..., 2N. There are two types of edges in E′, the edges representing the hu-man skeleton Es and the edges connecting the two humans Ec. The human skeleton edges are noted asEs = vivj | if i and j are connected in the human body. The human interaction edges can be writ-ten as Ec = vivi+N, where the vi and vi+N are correspond to the same components of the twohuman skeletons. The CoupleGraph module is appended after OPEC-Graph module to enhance theperformance of estimation. Each pair of people is processed by CoupleGraph.

3.3 Loss Functions

The objective function of our OPEC-Net module can now be formulated. We denote the training set asΩ, the ground truth pose in Ω as Pi, and the output pose of jth ResGCN Attention block as Pij . Fromheatmap representation to coordinate representation, the integral regression method produces an initialpose Pinit. Hence, the total loss is defined as the sum of the rectified loss of poses from IGP-GCN andinitial loss of initial pose:

Loss = minθ

∑i∈Ω

(

n∑j=1

λj |(Pij − Pi)M |+ |(Pinit − Pi)M |) (3)

The term |Pij − Pi| indicates the calculation of the L1 loss between our estimated pose and theground truth, n is the number of ResGCN attention blocks in the model. In this work, we set n = 3.We sum up all the errors of the produced pose from each block and assign a parameter λj to controlthe weight. All the trainable parameters in our network are denoted as θ. M ∈ ZN2 is a binary maskwhere the element in M corresponds to 1 when the related joint has a ground truth label, otherwise itis 0. The denotes the element-wise product operation so that we only take into account the errors onthe joints with ground truth.

The lasted generated pose will correspond to the best estimation result, so we treat the final one asour estimated result.

4 Our Occluded Pose Dataset

We build a new dataset, called Occluded Pose (OCPose), that includes more heavy occlusions toevaluate the MPPE. It contains challenging invisible joints and complex intertwined human poses. Wemostly consider the couple pose scenes, such as dancing, skating, and wrestling, because they have more

Page 7: pose estimation - arXiv.org e-Print archive

Peeking into occluded joints 7

reliable annotations and practical utility. This section gives details of data collection, data annotation,and data statistics.

Data Collection. The ground truth of human pose can be hard to recognize when the occlusions areextremely heavy. Thus, we majorly collect videos of two-person interactions since they are much easierto annotate because the volunteer can infer the pose from contextual information. We first search forvideos from the Internet by using keywords such as boxing, dancing, and wrestling. We then capture thedistinctive images which contain diverse poses and humans from these videos by restricting the intervalto be at least 3 seconds. Finally, we manually sift through the clips to select high-quality images. Allthe images are collected under the permission of privacy issues.

Data Annotation. We develop an annotation tool for the user to bound the area of the coupleand then locate two template skeletons to their right positions. Six volunteers are recruited for manuallabelling. Each skeleton has 12 joints and the left and right components are distinct. In addition toannotating the bounding box and the human body poses, the volunteers also need to indicate whetherthe joint is visible or not. To ensure accuracy, we use cross annotation for every image. At least twovolunteers are required to provide their annotations on the same image. If an intolerant deviation existsbetween their results, the image is annotated again. The final joint positions are the mean value of thetwo annotations.

Table 1. The comparison of occlusion level. We count the number of images of each dataset with differentlevel of occlusion. As shown above, MSCOCO and MPII almost have no heavily occlusions. OCHuman is thestate-of-the-art dataset for occlusions but our dataset is larger and contains more severe occlusion

Dataset Total IoU>0.3 IoU>0.5 IoU>0.75 Average

CrowdPose 20000 8706 (44%) 2909 (15%) 309 (2%) 0.27MSCOCO 118287 6504 (5%) 1209 (1%) 106 (<1%) 0.06MPII 24987 0 0 0 0.11OCHuman 4773 3264(68 %) 3244(68%) 1082(23%) 0.46

Ours 9000 8105 (90%) 6843 (76%) 2442 (27%) 0.47

Data Statistics. In total, our dataset contains 9000 images and 18000 fully annotated persons. Forthe training process, the training dataset consists of 5000 images, whereas validation and test dataseteach contains 2000 images.

To compare the occlusion level, we evaluate the average intersection over union (IoU) of boundingbox on the other public benchmarks, such as CrowdPose [13], OCHuman [27], MSCOCO [15] andMPII[1]. We report the comparison result of these benchmarks in Table 1, which illustrates that ourdataset beats down all the other benchmarks on the occlusion level.

Other Dataset. In our approach, we carried out extensive experiments on public benchmarks.Following the typical training procedure, we evaluate the OPEC-Net on our OCPose, CrowdPose [13],MSCOCO [15] and particular occluded dataset OCHuman [27]. CrowdPose dataset is split in a ratio of5 : 4 : 1 for training, testing and validation respectively. We regard the validation set of OCHuman with2500 images as our training dataset, and the rest 2273 images for testing. Then we follow the typicaltraining strategy on MSCOCO.

Page 8: pose estimation - arXiv.org e-Print archive

8 Lingteng et al.

5 Experiments

In this section, extensive quantitative and qualitative experiments are demonstrated to evaluate theeffectiveness of our OPEC-Net. Comprehensive ablation studies are carried out to validate the effec-tiveness of each components.

5.1 Experiments Settings

Implementation Details. For training, we set the parameters λ1 = 0.3, λ2 = 0.5, λ3 = 1 andepochs = 30. We feed 10 images in a batch to train the whole framework. The initial learning rate is setto 1e−3 and decays in a cosine way. The input image size are 384× 288 for MSCOCO and 320× 256 forthe other datasets. An AdamOptimizer is employed to optimize the parameters by backpropagation.For a fair comparison, we filter the proposal of the instances in the background and only focus on theObject Keypoint Similarity (OKS) of targets when we evaluate baselines on our dataset. We implementour model in PyTorch [20] and conduct experiments on one Nvidia GeForce GTX 1080 Ti with 11GBmemory. More details are described in the supplementary materials.

Evaluation Metric. We follow the standard evaluation metric of MSCOCO, which is widely usedby existing work as well [6,13,9,3]. Specifically, we report the mean Average Precision (mAP) value at0.5:0.95, 0.5, 0.75, 0.80 and 0.90. In order to grasp the qualified poses for OPEC-Net training procedure,two rules are formulated to select the proposal. The proposal poses must contain more than 5 visiblepoints and OKS value more than 0.3 to ensure the quality. To enrich the dataset, we also flip the imagesas a data augmentation strategy. Furthermore, we provide the visualization results of pose estimation.

Baselines. For comparison, we assess the performance with our OPEC-Net module using the threestate-of-the-art approaches for MPPE: Mask RCNN [8], AlphaPose+1 [13] and SimplePose [24]. Fora fair comparison, we quote the results of Mask RCNN and SimplePose directly from paper [13] andre-train AlphaPose+ from their public code. For the evaluation on OCPose, CrowdPose and OCHuman,we take AlphaPose+ for the initial pose estimation stage with ResNet-101 as backbone and Yolo V3 asdetector. For MSCOCO dataset, we take the public code of SimplePose2 for the first stage for it hashigher performance than AlphaPose+ on MSCOCO. Mask RCNN is used as the detector and ResNet-152 is used as the backbone on MSCOCO. OPEC-Net here denotes the framework with a single personas graph, and CoupleGraph denotes the baseline that performs a CoupleGraph based framework afterOPEC-Net.

Fig. 3. The qualitative evaluation of CoupleGraph and OPEC-Net. The left images are generated from OPEC-Net and the right ones come from CoupleGraph

1 https://github.com/MVIG-SJTU/AlphaPose/tree/pytorch.2 https://github.com/leoxiaobin/pose.pytorch.

Page 9: pose estimation - arXiv.org e-Print archive

Peeking into occluded joints 9

Table 2. The comparison on our OCPose dataset

Method [email protected]:0.95 AP 50 AP 75 AP 80 AP 90

Mask RCNN [8] 21.5 49.8 15.9 7.7 0.1Simple Pose [24] 27.1 54.3 24.2 16.8 4.7AlphaPose+ [13] 30.8 58.4 28.5 22.4 8.2

OPEC-Net 32.8(+2.0) 60.5 31.1 24.0 9.2CoupleGraph 33.6(+2.8) 60.8 32.5 25.0 9.8

5.2 Performance Comparison on our OCPose dataset

Quantitative Comparison. The quantitative results are presented in Table 2. Our approach attainsthe best mAP comparing to all the baselines with a considerable margin. OPEC-Net achieves a sig-nificant gain which is surprisingly 2.0 [email protected]:0.95 improvement compared to AlphaPose+. Despiteof that, a significant 1.0 AP 90 improvement has been achieved which proves that our OPEC-Net hasgreat ability of inference especially for high level of occlusions compared to localization methods. Inconclusion, these results validate the prominent effectiveness of our OPEC-Net module on MPPE tasks.

Qualitative Comparison. As illustrated in the first row of Fig 4, our OPEC-Net is capable ofcorrecting the wrong link between joints and estimating the occluded joints while maintaining highperformance on visible joints. We make these observations from the results: (1) For the first sample, asuperior pose estimation result is provided by our method. Even an error with large displacement can becorrected by OPEC-Net. (2) Moreover, although the second case has difficult sunlight interference, ourapproach can also adjust the joints to their correct locations. (3) The third group also shows an evidencethat our OPEC-Net module produces more natural poses that conform to human body constraints. (4)The fourth figure shows that our method can find the correct link between joints.

CoupleGraph. The evaluations of CoupleGraph are given in Tab 2 and Fig 3. Comparing to OPEC-Net, CoupleGraph baseline also shows an advanced lifting 0.8 [email protected]:0.95, which validates the humaninteraction clues are quite prominent. As illustrated in Fig 3, CoupleGraph outperforms OPEC-Netsignificantly in quality. In these human interactive scenarios, the poses estimated by CoupleGraph aremore concordant and superior.

5.3 Comparison against state-of-the-Arts on other benchmarks

Extensive evaluations on heavily benchmarked dataset demonstrate the effectiveness of our model forocclusion problem. The experimental results on existing benchmarks are presented in Table 3, 4 andFig 4. Our model surpasses all the baselines by a considerable margin.

Table 3. The qualitative result on occlusion dataset

Method

Dataset OCHuman [27] CrowdPose [13]

[email protected]:0.95 AP 50 AP 75 AP 80 AP 90 [email protected]:0.95 AP 50 AP 75 AP 80 AP 90

Mask RCNN [8] 20.2 33.2 24.5 18.3 2.1 57.2 83.5 60.3 - -

SimplePose [24] 24.1 37.4 26.8 22.6 4.5 60.8 81.4 65.7 - -

AlphaPose+ [13] 27.5 40.8 29.9 24.8 9.5 68.5 86.7 73.2 66.9 45.9

OPEC-Net 29.1(+1.6) 41.3 31.4 27.0 12.8(+3.3) 70.6(+2.1) 86.8 75.6 70.1 48.8(+2.9)

OCHuman As OCHuman is a new benchmark proposed mainly for pose segmentation, we are thefirst to report all the baseline results on this challenging occlusion dataset. Comparing to AlphaPose+,

Page 10: pose estimation - arXiv.org e-Print archive

10 Lingteng et al.

Fig. 4. The results on OCPose, OCHuman and CrowdPose. These are the qualitative comparison results ofAlphaPose+ method and OPEC-Net on our datasets. The left pose is estimated by AlphaPose+ method andthe right one is ours. The first row is OCPose and the second row represents OCHuman, the rest representsCrowdPose

we achieve maximal 3.3 improvements on AP 90. This further validates that our OPEC-Net model isrobust even for highly challenging occlusion scenarios.

CrowdPose As shown in Table 3, OPEC-Net drastically lifts 2.1 [email protected]:0.95 of the estimationresult over AlphaPose+. It is also worth noting that the improvements remain high when the comparisonAP terms are high. For example, our model achieves 0.1, 2.4, 3.2 and 2.9 on AP 50, 75, 80 and 90respectively.

MSCOCO We also present our results on the largest benchmark MSCOCO. Our model only con-tributes slightly accuracy improvements. The reason is the key difference between MSCOCO and otherdatasets – it contains too few occlusion scenarios, especially the severe ones. Moreover, a lot of invisiblejoints lack annotations on MSCOCO.

Invisible vs. Visible To investigate of the effectiveness on invisible (Inv) and visible (V) jointsseparately, we report the statistics of each type of joints according to the similar rule of OKS. FromTab 5, our OPEC-Net improves mostly on the invisible joints rather than visible joints. In terms ofInv@75, our framwork achieves a considerable marginal 3.3% and 4.9% gains on CrowdPose and OCPoserespectively. On the contrary, the OPEC-Net only improves a maximal 1% on visible joints because ourmain focus is the invisible joints. This comparison also explains why the gains are smaller on MSCOCOdatasets than the other datasets that contains more occlusions.

Page 11: pose estimation - arXiv.org e-Print archive

Peeking into occluded joints 11

Table 4. MSCOCO 2017 test-dev set [15]

Method [email protected]:0.95 [email protected] [email protected]

AlphaPose+ [13] 72.2 90.1 79.3

Simple Pose [24] 73.7 91.9 81.8

OPEC-Net 73.9(+0.2) 91.9 82.2

Table 5. Results for Visible and Invisible Joints on CrowdPose and OCPose

Datasets CrowdPose OCPose

Method Inv@75 Inv@90 V@75 V@90 Inv@75 Inv@90 V@75 V@90

AlphaPose+ 76.2% 57.2% 89.5% 67.8% 50.7% 17.7% 85.2% 55.3%

OPEC-Net 79.5% 58.4% 90.0% 67.8% 55.6% 20.5% 86.2% 55.1%

5.4 Alabtion studies

To analyze our model in details, we conduct comprehensive ablative experiments to evaluate the capa-bility of each component and clues we claimed. As illustrated in Table 6, we present the baselines toinvestigate the impact of each component.

Firstly, we investigate the impact of image guided strategy that blends the image context with GCN.From (a), a clear decrease around 2.0 [email protected]:0.95 is observed, which points out the importance of theImage-Guide strategy. Without the Image-Guided part, a single GCN network improves the performancepoorly. This evidence validates that the GCN module must learn under the guidance of image features.

We further investigate each design of the IGP-GCN. From (b) and (c), we can conclude that thestrategy of progressive and coarse to fine feature learning is effective. Moreover, the proposed CascadedFeature Adaption module is analysed as well. In Table 6, the mAP value of three datasets falls downsignificantly, demonstrating that the CFA module plays an indispensable role in the whole framework.We remove the Fusion Blocks and report the results in (c), which further proves the effectiveness ofthe fusion part in the CFA module. We can conclude that the image guidance is the most imperativein the framework. The CFA module brings an average 0.7 [email protected]:0.95 gain on three datasets, whichmanifests the necessity to make image features adaptive for coordinate module. Overall, these ablationstudies overwhelmingly validate that every component is effective and the clues are informative forinvisible joints inference.

Table 6. Ablation study of our OPEC-Net framework ([email protected]:0.95)

The evaluation of removal each component OCHuman CrowdPose OCPose

The AlphaPose+ baseline 30.8 27.5 68.5(a)Without the Image-Guided strategy in GCN 30.8 27.7 68.6(b)Without the Progressive design in the GCN 32.2 28.3 69.3(c)Use one feature F3 instead of multi-scale features 32.5 28.5 69.7(d)Remove the Cascaded Feature Adaption module 32.1 28.4 69.9(e)Remove the Fusion Block in CFA module 32.4 28.7 69.6The full OPEC-Net 32.8 29.1 70.6

Page 12: pose estimation - arXiv.org e-Print archive

12 Lingteng et al.

6 Conclusion

In this paper, we proposed a novel OPEC-Net module and a challenging Occluded Pose (OCPose)dataset to address the occlusion problem in Crowd Pose Estimation. Two elaborate components, Image-Guided Progressive GCN and Cascaded Feature Adaptation, are designed to exploit the natural humanbody constraints and image context. We conduct thorough experiments on four benchmarks and ablationstudies to demonstrate the effectiveness and provide a variety of insights. The heatmap and coordinatemodule are proved to work cooperatively and achieve remarkable improvements in all aspects. By makingthis dataset available, we hope to arouse the attention and increase the interest in the investigation ofthe occlusion problem in pose estimation.

Acknowledgment

The work was supported in part by grants No. 2018YFB1800800, No. 2018B030338001, No. 2017ZT07X152, No. ZDSYS201707251409055 and in part by National Natural Science Foundation of China(Grant No.: 61902334 and 61629101). The authors also would like to thank Running Gu and YuhengQiu for their early efforts on data labeling.

References

1. Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2d human pose estimation: New benchmark and stateof the art analysis. In: Proceedings of the IEEE Conference on computer Vision and Pattern Recognition.pp. 3686–3693 (2014)

2. Bansal, A., Ma, S., Ramanan, D., Sheikh, Y.: Recycle-gan: Unsupervised video retargeting. In: Proceedingsof the European Conference on Computer Vision (ECCV). pp. 119–135 (2018)

3. Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields.In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 7291–7299 (2017)

4. Chan, C., Ginosar, S., Zhou, T., Efros, A.A.: Everybody dance now. arXiv preprint arXiv:1808.07371 (2018)5. Ci, H., Wang, C., Ma, X., Wang, Y.: Optimizing network structures for 3d human pose estimation. ICCV

(2019)6. Fang, H., Xie, S., Tai, Y.W., Lu, C.: Rmpe: Regional multi-person pose estimation. In: The IEEE Interna-

tional Conference on Computer Vision (ICCV). vol. 2 (2017)7. Gui, L.Y., Zhang, K., Wang, Y.X., Liang, X., Moura, J.M., Veloso, M.: Teaching robots to predict human

motion. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 562–567. IEEE (2018)

8. He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask r-cnn. In: Computer Vision (ICCV), 2017 IEEE Inter-national Conference on. pp. 2980–2988. IEEE (2017)

9. Huang, Z., Huang, L., Gong, Y., Huang, C., Wang, X.: Mask scoring r-cnn. In: Proceedings of the IEEEConference on Computer Vision and Pattern Recognition. pp. 6409–6418 (2019)

10. Insafutdinov, E., Pishchulin, L., Andres, B., Andriluka, M., Schiele, B.: Deepercut: A deeper, stronger,and faster multi-person pose estimation model. In: European Conference on Computer Vision. pp. 34–50.Springer (2016)

11. Joo, H., Simon, T., Sheikh, Y.: Total capture: A 3d deformation model for tracking faces, hands, and bodies.In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 8320–8329 (2018)

12. Li, G., Muller, M., Thabet, A., Ghanem, B.: Deepgcns: Can gcns go as deep as cnns? In: Proceedings of theIEEE International Conference on Computer Vision. pp. 9267–9276 (2019)

13. Li, J., Wang, C., Zhu, H., Mao, Y., Fang, H.S., Lu, C.: Crowdpose: Efficient crowded scenes pose esti-mation and a new benchmark. In: Proceedings of the IEEE Conference on Computer Vision and PatternRecognition. pp. 10863–10872 (2019)

14. Li, M., Chen, S., Chen, X., Zhang, Y., Wang, Y., Tian, Q.: Actional-structural graph convolutional networksfor skeleton-based action recognition. In: Proceedings of the IEEE Conference on Computer Vision andPattern Recognition. pp. 3595–3603 (2019)

15. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., Zitnick, C.L.: Microsoftcoco: Common objects in context. In: European conference on computer vision. pp. 740–755. Springer (2014)

Page 13: pose estimation - arXiv.org e-Print archive

Peeking into occluded joints 13

16. Mehta, D., Sridhar, S., Sotnychenko, O., Rhodin, H., Shafiei, M., Seidel, H.P., Xu, W., Casas, D., Theobalt,C.: Vnect: Real-time 3d human pose estimation with a single rgb camera. ACM Transactions on Graphics(TOG) 36(4), 44 (2017)

17. Newell, A., Huang, Z., Deng, J.: Associative embedding: End-to-end learning for joint detection and group-ing. In: Advances in Neural Information Processing Systems. pp. 2277–2287 (2017)

18. Panteleris, P., Oikonomidis, I., Argyros, A.: Using a single rgb frame for real time 3d hand pose estimationin the wild. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 436–445.IEEE (2018)

19. Papandreou, G., Zhu, T., Kanazawa, N., Toshev, A., Tompson, J., Bregler, C., Murphy, K.: Towards accuratemulti-person pose estimation in the wild. In: CVPR. vol. 3, p. 6 (2017)

20. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L.,Lerer, A.: Automatic differentiation in pytorch (2017)

21. Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Gehler, P.V., Schiele, B.: Deepcut:Joint subset partition and labeling for multi person pose estimation. In: Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition. pp. 4929–4937 (2016)

22. Qian, X., Fu, Y., Xiang, T., Wang, W., Qiu, J., Wu, Y., Jiang, Y.G., Xue, X.: Pose-normalized imagegeneration for person re-identification. In: Proceedings of the European Conference on Computer Vision(ECCV). pp. 650–667 (2018)

23. Sun, X., Xiao, B., Wei, F., Liang, S., Wei, Y.: Integral human pose regression. In: Proceedings of theEuropean Conference on Computer Vision (ECCV). pp. 529–545 (2018)

24. Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: Proceedings of theEuropean Conference on Computer Vision (ECCV). pp. 466–481 (2018)

25. Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recog-nition. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)

26. Zanfir, A., Marinoiu, E., Zanfir, M., Popa, A.I., Sminchisescu, C.: Deep network for the integrated 3d sensingof multiple people in natural images. In: Advances in Neural Information Processing Systems. pp. 8410–8419(2018)

27. Zhang, S.H., Li, R., Dong, X., Rosin, P., Cai, Z., Han, X., Yang, D., Huang, H., Hu, S.M.: Pose2seg:Detection free human instance segmentation. In: Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition. pp. 889–898 (2019)

28. Zhao, L., Peng, X., Tian, Y., Kapadia, M., Metaxas, D.N.: Semantic graph convolutional networks for 3d hu-man pose regression. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.pp. 3425–3435 (2019)

Supplementary Materials

We provide details of our algorithms and more visualization results in this supplementary material.Two videos are attached for demonstrating the visual comparisons between our OPEC-Net and Al-phaPose+ [13] in crowd and couple scenarios respectively. Our estimation displays as green skeletonsand the red ones are from AlphaPose+. It is obvious to observe from the videos that our estimationresults are more accurate than AlphaPose+, especially for occlusion cases. For the couple scenario, thecomparison between OPEC-Net and OPEC-CG (with CoupleGraph extension) is also shown, whereOPEC-Net is with red color while OPEC-CG is in green. As seen, with the help of CoupleGraph, theapproach is more robust to severe occlusions.

1 Illustration of the CoupleGraph

CoupleGraph is proposed to capture the human interaction information. In the main paper, we gaveout the formulation of the CoupleGraph. Here, we illustrate the structure of the couple graph in thefollowing:

Page 14: pose estimation - arXiv.org e-Print archive

14 Lingteng et al.

Fig. 5. The couple graph. The blue and red skeletons represent two different individuals. Other than in-skeletonbone edges, the graph also links the two corresponding joints between the two instances.

2 Details of the Network Architecture

We explain the detailed settings of the layers and parameters of our OPEC-Net in this section. Firstly,the proposals Jk are normalized into the range of [−1, 1]. There is a ReLu layer after each deep GCNlayer [12] in our network. In the module of Cascaded Feature Adaption (CFA), each convolutional layeris of a 3×3 kernel and is followed by a ReLu function. In IGP-GCN, the first and second GCN layers areof 128 channels. The first ResGCN Attention Block (RAB) takes a 128-channel feature map as inputand outputs a feature map with 256 channels. Both the input and output channels of the following twoRABs are in 256. The last block takes a 256-channel feature map together with a 128-channel featuremap as input and outputs a 256-channel feature map.

3 ResGCN Attention Block

As seen in Fig 6, each RAB consists of three GCN layers[12] and a self-attention layer. To ease thelearning, we also add a residual link in the RAB to make the network being focusing on inferring residualvalue. For this goal, before conducting addition with the output of the lower two GCN layers, we alsoinvolve a GCN layer (the upper one in the Figure) to adjust the feature size of the input. Furthermore,we append a self-attention layer at the end to learn the dependency of joints and utilize this to capturemore informative understanding of the inherent relationships in the human pose, for more accurateestimation.

Fig. 6. The ResGCN Attention Block. C denotes the channel size of the feature.

4 Cascaded Feature Adaption (CFA) Module

As seen in Fig 7, the CFA module consists of two Fusion blocks (Fig 7) and three Conv blocks. In theexperiments, we use one convolution layer in each Conv block. More specifically, the CFA module takes

Page 15: pose estimation - arXiv.org e-Print archive

Peeking into occluded joints 15

the three feature maps F1,F2, and F3 as input and outputs F1, F2, and F3. Firstly, F1 is transformedto F1 by a convolutional block. It can be formulated as

F1 = Conv(F1; θ). (4)

We subsequently fuse two feature maps F1 and F2 by a Fusion block, which gets a new feature mapF2 produced. This process will also be performed repeatedly again which generates F3. We formulatethis process as

Fi+1 = Fusion(Fi,Fi+1; θ). (5)

Thanks to such CFA module, the feature maps F1, F2, and F3 are more informative and adaptive,than F1,F2, and F3, for the next stage.

Heatmap = Conv(F3; θ). (6)

Fig. 7. The left figure shows the design of the Cascaded Feature Adaption module and the right one demonstratesthe design of Fusion block.

In particular, the Fusion Block takes Fi and Fi+1 as input and outputs Fi+1. The two feature mapsare concatenated and are passed into an attention layer, which learns a weighting way for automaticallycombining information from both sides.

5 Result Gallery

In this section, we present more visually comparison results on the three datasets: OCPose, OCHuman[27], and CrowPose [13]. For each example, the left result is obtained from the AlphaPose+, and theright ones are estimated by our OPEC-Net.

Page 16: pose estimation - arXiv.org e-Print archive

16 Lingteng et al.

Fig. 8. Result Gallery of the OCPose. For each example, the left result is from AlphaPose+ while the right isfrom ours OPEC-Net.

Page 17: pose estimation - arXiv.org e-Print archive

Peeking into occluded joints 17

Fig. 9. Result Gallery of the CrowdPose. For each example, the left result is from AlphaPose+ while the rightis from ours OPEC-Net.

Page 18: pose estimation - arXiv.org e-Print archive

18 Lingteng et al.

Fig. 10. Result Gallery of the OCHuman. For each example, the left result is from AlphaPose+ while the rightis from ours OPEC-Net.