Top Banner
An Efficient Convolutional Network for Human Pose Estimation Umer Rafi 1 rafi@vision.rwth-aachen.de Ilya Kostrikov 1 [email protected] Juergen Gall 2 [email protected] Bastian Leibe 1 [email protected] 1 Computer Vision Group, RWTH Aachen University, Germany 2 Computer Vision Group, University of Bonn, Germany In recent years, human pose estimation has greatly benefited from deep learning and huge gains in performance have been achieved on popular benchmarks [1, 3, 4]. The trend to max- imise the accuracy on benchmarks, however, re- sulted in computationally expensive deep net- work architectures that require expensive hard- ware and pre-training on large datasets. In this work, we propose an efficient deep network ar- chitecture that can be efficiently trained on mid- range GPUs without the need of any pre-training and that is on par with much more complex mod- els on the benchmarks [1, 3, 4]. Our proposed Fully Convolutional GoogLeNet (FCGN) network (see Figure 1) is based on the network architecture from [2]. We take the first 17 layers of [2] and add a decon- volution layer to make it fully convolutional. In addition, we introduce a skip layer and combine two FCGNs with shared weights to obtain a multi-resolution network. Belief maps for each joint are then obtained by a deconvolution layer with large kernel size in combination with a sigmoid function for normalisation and spatial drop out for regularisation. We compare the performance of the pro- posed architecture against convolutional pose machines [5] on the well-known FLIC, LSP, and MPII benchmarks [1, 3, 4]. Our proposed net- work outperforms most previous approaches and achieves competitive performance to the more complex model of [5], while requiring only 3GB of memory and far less training time. [1] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2D Human Pose Estimation: New BenchMark. In CVPR, 2014. [2] S. Ioffe and C. Szegedy. Batch normalization: Ac- celerating deep network training by reducing in- ternal covariate shift. 2015. [3] S. Johnson and M. Everingham. Clustered Pose Figure 1: (a) Proposed fully convolutional GoogLeNet (FCGN) (b) The proposed multi- resolution network combines two FCGNs. Figure 2: Our Qualitative results on FLIC [4], LSP [3] and MPII [1]. and Nonlinear Appearance Models for Human Pose Estimation. In BMVC, 2010. [4] B. Sapp and B. Taskar. MODEC : Multimodel De- composable Models for Human Pose Estimation. In CVPR, 2013. [5] S. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional Pose Machines. In CVPR, 2016.
1

An Efficient Convolutional Network for Human Pose Estimation · In recent years, human pose estimation has greatly benefited from deep learning and huge gains in performance have

May 24, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: An Efficient Convolutional Network for Human Pose Estimation · In recent years, human pose estimation has greatly benefited from deep learning and huge gains in performance have

An Efficient Convolutional Network for Human Pose Estimation

Umer Rafi1

[email protected]

Ilya Kostrikov1

[email protected]

Juergen [email protected]

Bastian Leibe1

[email protected]

1 Computer Vision Group,RWTH Aachen University,Germany

2 Computer Vision Group,University of Bonn,Germany

In recent years, human pose estimation hasgreatly benefited from deep learning and hugegains in performance have been achieved onpopular benchmarks [1, 3, 4]. The trend to max-imise the accuracy on benchmarks, however, re-sulted in computationally expensive deep net-work architectures that require expensive hard-ware and pre-training on large datasets. In thiswork, we propose an efficient deep network ar-chitecture that can be efficiently trained on mid-range GPUs without the need of any pre-trainingand that is on par with much more complex mod-els on the benchmarks [1, 3, 4].

Our proposed Fully ConvolutionalGoogLeNet (FCGN) network (see Figure 1) isbased on the network architecture from [2]. Wetake the first 17 layers of [2] and add a decon-volution layer to make it fully convolutional. Inaddition, we introduce a skip layer and combinetwo FCGNs with shared weights to obtain amulti-resolution network. Belief maps for eachjoint are then obtained by a deconvolution layerwith large kernel size in combination with asigmoid function for normalisation and spatialdrop out for regularisation.

We compare the performance of the pro-posed architecture against convolutional posemachines [5] on the well-known FLIC, LSP, andMPII benchmarks [1, 3, 4]. Our proposed net-work outperforms most previous approaches andachieves competitive performance to the morecomplex model of [5], while requiring only 3GBof memory and far less training time.

[1] M. Andriluka, L. Pishchulin, P. Gehler, andB. Schiele. 2D Human Pose Estimation: NewBenchMark. In CVPR, 2014.

[2] S. Ioffe and C. Szegedy. Batch normalization: Ac-celerating deep network training by reducing in-ternal covariate shift. 2015.

[3] S. Johnson and M. Everingham. Clustered Pose

Figure 1: (a) Proposed fully convolutionalGoogLeNet (FCGN) (b) The proposed multi-resolution network combines two FCGNs.

Figure 2: Our Qualitative results on FLIC [4],LSP [3] and MPII [1].

and Nonlinear Appearance Models for HumanPose Estimation. In BMVC, 2010.

[4] B. Sapp and B. Taskar. MODEC : Multimodel De-composable Models for Human Pose Estimation.In CVPR, 2013.

[5] S. Wei, V. Ramakrishna, T. Kanade, andY. Sheikh. Convolutional Pose Machines. InCVPR, 2016.