Designing Compact Convolutional Filters for Lightweight Human Pose Estimation

Niu, Shili; Ou, Weihua; Feng, Shihua; Gou, Jianping; Long, Fei; Zhang, Wenchuan; Zeng, Wu

doi:https://doi.org/10.1155/2021/1333250

Wireless Communications and Mobile Computing

On this page

Abstract Introduction Related Works Conclusion Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Special Issue

Semantic Mobile Computing within the Internet of Things and Web of Things 2021

View this Special Issue

Research Article | Open Access

Volume 2021 | Article ID 1333250 | https://doi.org/10.1155/2021/1333250

Designing Compact Convolutional Filters for Lightweight Human Pose Estimation

Shili Niu,¹Weihua Ou,¹Shihua Feng,¹Jianping Gou,²Fei Long,^3,4Wenchuan Zhang,¹and Wu Zeng⁵

Academic Editor: Chi-Hua Chen

Received19 Aug 2021

Revised18 Nov 2021

Accepted29 Nov 2021

Published17 Dec 2021

Abstract

Existing methods for human pose estimation usually use a large intermediate tensor, leading to a high computational load, which is detrimental to resource-limited devices. To solve this problem, we propose a low computational cost pose estimation network, MobilePoseNet, which includes encoder, decoder, and parallel nonmaximum suppression operation. Specifically, we design a lightweight upsampling block instead of transposing the convolution as the decoder and use the lightweight network as our downsampling part. Then, we choose the high-resolution features as the input for upsampling to reduce the number of model parameters. Finally, we propose a parallel OKS-NMS, which significantly outperforms the conventional NMS in terms of accuracy and speed. Experimental results on the benchmark datasets show that MobilePoseNet obtains almost comparable results to state-of-the-art methods with a low compilation load. Compared to SimpleBaseline, the parameter of MobilePoseNet is only 4%, while the estimation accuracy reaches 98%.

1. Introduction

Human pose estimation is also called human key point detection. Its main task is to detect the key points of human body (eyes, nose, shoulders, elbows, etc.) in a given RGB picture. Human pose estimation is one of the basic tasks of computer vision and has many practical applications, such as human-computer interaction [1], human tracking [2], and motion analysis [3]. In recent years, with the quick development of neural networks, human pose estimation based on deep neural networks [4–9] has gained a high accuracy. However, these works have focused only on improving the accuracy of pose estimation through the use of complex and computationally expensive models, while largely ignoring the issue of the cost of model inference. Many methods already require computational resources beyond the capabilities of many mobile and embedded devices. At the same time, information security is a growing concern for people, and it is important to deploy applications directly on edge devices for personal information protection, which leads to high requirements for the computational volume and complexity of human pose estimation models.

Many works have been proposed to solve this problem by building human pose estimation networks with small model size and low computing cost [7, 10, 11]. For example, there is a recent attempt [10] to construct pose estimation models with fewer parameters using quantitative methods, but the performance of the obtained model largely degraded. Also, some researchers [7] try to use knowledge distillation to reduce the parameters of the model, but the model training time and deployment time are increased. On the other hand, some works attempt [12] to find a lightweight pose estimation model using neural structure search methods. However, the obtained model has a complex structure and slow inference speed. The key problem is how to balance the model accuracy and the inference efficiency.

To address this problem, in this paper, we propose a lightweight human pose estimation network specifically for mobile and resource-constrained environments by designing compact convolutional filters. As shown in Figure 1, our model contains three main parts: an encoder, a decoder, and a heat map regressor that estimates each key point. To keep the model lightweight, we use the first 13 layers of MobileNetV3 [13] as our encoder. Intuitively, high resolution is beneficial for human pose estimation, so we design the encoder with less downsampling, and the structure of the specific model and the comparison of SimpleBaseline can be seen in Figure 2. In the decoder part, inspired by the bottleneck block, we propose a lightweight upsampling module, whose concrete structure is shown in Figure 1. The detailed structure of the overall model is shown in Table 1. Finally, we also propose a parallel OKS-based NMS to further improve the operation speed of pose estimation. Experimental results show that our method can achieve 69.0 AP with only 1.5M model parameters and 1.23 GFLOP calculation amount under the condition of less cost. The contributions of the proposed method are summarized as follows: (i)We design a lightweight upsampling block that integrates separable transpose convolution and channel-based attention. This is achieved by extensively examining the upsampling modules in existing state-of-the-art deep convolutional networks(ii)We reduce the number of upsampling and use lightweight upsampling blocks to achieve a lightweight pose estimation network. In particular, we balance the accuracy of the model and the inference speed of the model, which is a key issue to be addressed in extending existing depth-pose estimation methods to practical applications(iii)We propose a parallel OKS-NMS by combining Matrix-NMS [14] and OKS-NMS [15] to further improve the efficiency of the human pose estimation system

The rest of this paper is organized as follows. We briefly review the related work in the second section and followed by description of the proposed method. Then, we conduct experiments on the MSCOCO and MPII datasets and conclude this work.

2.1. Human Posture Estimation

In recent years, deep learning-based pose estimation methods [16] have made great progress. Despite significant performance improvements, these prior works focused only on improving the accuracy of pose estimation by using complex networks and large tensors, while largely ignoring the cost issues of model inference. This state of affairs significantly limits their deploy ability in real-world applications, especially when the available computational budget is very limited.

In the literature, there are some recent works aimed at improving model efficiency. Bulat and Tzimiropoulos [10] designed a binary hourglass network using quantitative methods, but the restricted binary network has weak information representation and low accuracy of the model. Zhang et al. [7] proposed a new fast pose distillation (FPD) model learning strategy. A pretrained teacher network can be used to obtain a computationally fast and computationally inexpensive student network. However, it requires too much time to train. Yu et al. [17] proposed conditional channel weighting blocks and constructed the HR-Lite network, which achieves a great advantage in model accuracy and scale. However, the network structure is too complex, resulting in slow model inference. Zhang and Tang [11] proposed lightweight bottleneck block with depthwise convolution and attention mechanism, while the model size is still up to 2.7M parameters.

Compared with previous methods, comprehensively considering the accuracy of the model, the speed of inference, and the complexity of the model, we directly designed a model with simple structure and low complexity, which makes the model more practical and reliable in practical application scenarios.

2.2. Efficient Upsampling Module

Recent work [13, 18–20] has shown that deep convolutional neural networks have reached state-of-the-art performance. For advanced vision problems such as semantic segmentation [21], pose estimation [16], and object detection [22], existing approaches pass inputs through a network, usually consisting of high- to low-resolution subnetworks and a main network of raised resolutions. Many approaches have been designed to improve the resolution of the main network in different ways. For example, networks such as hourglass [6] reduce the input high-resolution features to low-resolution features and then use interpolation upsampling to scale the low-resolution features to the original input features, fuse the information with the previously input high-resolution features, and finally expect to generate fused high semantic and high resolution. Although it achieved very good results, the large tensor is used in the process of feature fusion. Zhou et al. [23, 24] constructed an attention-driven feature fusion upsampling network in an attempt to reduce the complexity of the model and reduce the use of large tensor using heterogeneous convolution. However, the network structure is complex and does not solve the problem of slow model inference fundamentally. SimpleBaseline [25] uses several transposed convolutional layers to generate high-resolution representations and achieves very good results. Although the model structure is simple, however, transposed convolution introduces a large number of parameters and computational effort, which is not friendly to small devices.

Therefore, we propose an efficient upsampling module that achieves a significant reduction in the number of parameters and computation during upsampling while ensuring the simplicity of the model structure and inference accuracy.

3. Proposed Method

In this section, we detail a simple and low computational cost human pose estimation network (MobilePoseNet), which designs a lightweight upsampling block (LPB) and directly uses high-resolution features to achieve high-resolution representation while maintaining lightweight features.

3.1. Lightweight Upsampling Block

Transposed convolution was first introduced into pose estimation by SimpleBaseline and achieved excellent performance. However, this operation brings a model with nearly a third of the parameters and calculations. Specifically, given the input of feature maps and the output of feature maps , the amount of computation for conventional transposed convolution is

The number of parameters of traditional transpose convolution is where is the kernel size of traditional transposed convolution.

To reduce the burden of calculation and the number of parameters, while maintaining the effect of transpose convolution, we designed a lightweight upsampling block inspired by the intuition that the bottlenecks actually contain all the necessary information, as shown in Figure 3(b), which composed of three parts: depthwise transposed convolution, point convolution, and attention module. Specifically, we first expand the low-latitude information to high-latitude information by point convolution and use depth transpose convolution on each channel of the feature map for the spatial transformation. Finally, we use 11 point convolution to fuse the information between each channel and compress the high-latitude information to the original input latitude.

(a)

(b)

Figure 3

The comparison of two different upsampling methods. (a) The traditional transposed convolution, which has a large computational overhead. (b) The proposed lightweight upsampling block, which includes depthwise transposed convolution operation, pointwise convolution operation, and attention . In (a), the features are amplified directly by transposed convolution. In (b), we first use point convolution to expand the number of channels of the feature so that the number of channels goes from to and then use the depthwise transposed convolution to generate high-resolution feature maps. Finally, we use a point convolution to change the number of channels to and the attention mechanism to make feature map stronger.

As shown in Figure 3(b), the computation of the lightweight upsampling block is the sum of the depth transpose convolution and the two point convolution computations:

The number of parameters of the lightweight upsampling block is where is the number of channels for high-latitude features. Compared to the traditional transposed convolution, our method reduces the calculation amount to 83.2% and the number of parameters reduces to 74%.

Since LPB separates space operation and channel operation into two independent steps, the decoding effect of transpose convolution will be weakened. To solve this problem, we enhance the feature responses through channel attention mechanism. Here, we directly use SENet [26] as our channel attention mechanism to dynamically adjust the weight of each channel, as shown in Figure 3(b). To sum up, we assumed the input feature map , the feature output through the LPB is as input to the channel attention mechanism.

The feature output through the channel attention mechanism is . Then, the feature output of LPB and the feature output of the channel attention mechanism are multiplied and summed to obtain the final fusion information , i.e.,

3.2. Lightweight Human Pose Estimation

Usually, the pipeline [5, 6, 27, 28] for poses estimation consists of three parts: the upsampling, the downsampling, and estimation of the heat map. In this work, we focus on the design of a lightweight upsampling and downsampling.

Different from SimpleBaseline which uses a ResNet backbone as the downsampling and three traditional deconvolutional layers as the upsampling, we use MobileNetV3 as our downsampling, which reduces the size of the parameters up to 96% and reduces the computation load up to 79%. For the upsampling, we replace each traditional deconvolution layer with a lightweight upsampling block. The details of the model are shown in Table 1.

As shown in Figure 2, different with SimpleBaseline, we use a higher resolution feature map as the input for upsampling. The rationale behind this that it is beneficial to maintain high-resolution representations before upsampling.

3.3. Parallel Pose NMS

In pose estimation, human body detectors inevitably generate redundant detection, and pose estimation also generates redundant poses. Therefore, nonmaximum suppression (NMS) is required to eliminate redundant postures.

Given pose with joints where and are the location and confidence score of the joint, respectively. Corresponding detection boxes with confidence score. The general pose NMS is as follows: firstly, the pose with the highest confidences was chosen as the reference, and the poses similar to it were suppressed or discarded. This process is repeated for the rest of the pose set until only one pose is left.

However, the main problems for this process are sequential and cannot be implemented in parallel, resulting in slower speeds. Inspired by Matrix-NMS, we proposed parallel nonmaximum suppression considering following two key factors: (1)The confidence of pose: the higher the confidence of pose, the lower the probability of joints being suppressed, i.e., if the pose and with confidence (), will have a high probability of being suppressed(2)Similarity between the pose and other poses: the lower the similarity between one pose and other poses, the lower the suppression ratio of the poses

For the pose confidence, we set the product of the average of the confidence of the key points and the confidence of the human detector as the final pose confidence below where is the confidence of the detection box and is defined as follows:

We consider the key point prediction to be true if is bigger than and otherwise to be false.

For the similarity between two poses, we use the object key point similarity (OKS) [29] to measure the pose distance function as follows: where is given by where is the area of the detection box.

We define a new decay factor for pose NMS. For , we can get a new , where where .

Finally, we get a new pose confidence . For usage, we just need threshold and selecting top- scoring masks as the final predictions.

Like Matrix-NMS, all the operations in pose NMS could be implemented in one shot without recurrence. We first get a pose confidence and then compute a pairwise OKS matrix for the pose sorted descending by pose confidence score. The decay factors of each pose can be obtained by looking up the table of the OKS matrix. Finally, the pose scores are updated by the decay factors. For usage, we just need threshold and select top- pose scoring as the final predictions. The whole procedure is summarized in Algorithm 1.

Input: the area of the detection boxer , the confidence of the detection boxer , the location of the key point , the confidence of the key point , and parameter . Here, is the -th person, , is the -th key point, and .
Output: the confidence of the key point .
1: Initialize
2: Calculate by equation (6) and parameter , ,
3: Sort , , and in descending order by
4: Calculate using
5: Calculate using
6: Calculate OKS matrix by equation (9) and parameter , ,
7: Update
8: Set
9: Set by repeating times
10: Calculate decay matrix using
11: Set decay
12: Update the confidence of the key point by

4. Experiments

We conduct experiments on the MSCOCO and MPII datasets to evaluate the performance of our method in multiperson pose estimation.

4.1. Datasets

(i)The MSCOCO dataset contains over 200K images, 250K human body instances, and 17 key points. We trained our model on the MSCOCO train2017 dataset, including 57K images and 150K person instances and evaluated our approach on val2017 and test-dev2017, which contained 5000 images and 20K images, respectively(ii)The MPII Human Pose dataset contains about 25K images of more than 40,000 people with annotated human joints, which are taken from a wide range of real-world activities with full-body pose annotations

We selected the object key point similarity (OKS) as an evaluation metric for the MSCOCO dataset. The standard metric [30], the PCK (probability of correct key point normalized by head) score, was used to evaluate the MPII dataset.

4.2. Implement Details

In MSCOCO, we extend the human detection box into a fixed aspect ratio with 4 : 3, and crop the box from the image with fixed size, or . In MPII, the input size is cropped to for fair comparison with other methods. In addition, the same data augmentation and the training strategy are utilized for both datasets. The data augmentation includes random rotation ([-45, 45]), random scale ([0.65, 1.35]), and flipping. In MSCOCO, half body data augmentation is also involved.

We all use the Adam optimizer with initial learning rate . The model was trained on a single Nvidia TITAN RTX GPU with a minibatch size 32 and stop at 210 epochs.

4.3. Experimental Results

4.3.1. Results on MSCOCO Dataset

From the results, as shown in Table 2, we can see that our method has a significant advantage in terms of model size and complexity with comparable accuracy. For input size , our method achieved comparable accuracy with less than 6% the parameters with respect to hourglass network. Compared with MobileNetV2 and ShuffleNetV2, our method obtained better accuracy with low complexity. For the small network HRNet-W16 and Lite-HRNet-18, our model is also better in terms of accuracy although the model size is slightly large. For the input , we can also derive the same conclusion.

Figure 4 illustrates the comparison of accuracy and complexity of small networks. Figure 5 shows the visualization results of our method in MSCOCO. It can be seen that our model achieved better balance between complexity and accuracy and can estimate the accurate joints under different complex scenes.

Table 3 lists the mAP, input size, Params, and GFLOP values of compared methods and our method on the MSCOCO dataset.

4.3.2. Results on MPII Human Pose Dataset

Table 4 reports the results of our network and other lightweight networks on MPII val data. Compared with MobileNetV2, MobileNetV3, ShuffleNetV2, and SmallHRNet-W16, our model achieves better accuracy with lower number of parameters and calculation weights. Compared to Lite-HRNet-30, our model achieves 87.3 [email protected] in terms of the number of parameters with 0.3M less than Lite-HRNet-30. Compared to MobileNetV2, MobileNetV3, ShuffleNetV2, and Small HRNet-W16, our model improved by 1.9%, 3.0%, 4.5%, and 7.1%, respectively. Figure 6 illustrates the comparison of accuracy and complexity.

4.4. Inference Speed

FLOPs and Param are only the properties that measure the size and complexity of the model. In this section, we study the actual inference speed of the human pose estimation network by inference items per second (Inference Items Per Second). The speed is tested on devices with GPU and without GPU, respectively, with a batch size of 32 and full precision (fp32). We use the Nvidia TITAN TRX as the GPU device and the Intel Core I9-10900k device without GPU as the non-GPU device. To better reflect the running speed of the model, all methods are tested on the MSCOCO validation set. We use the same person detector provided by the SimpleBaseline validation set. In the tests without GPU, a thread was used for evaluation. As can be seen in Table 5, thanks to the simple structure of our model, our actual inference is 3 times faster than the less computationally intensive Lite-HRNet on the GPU speed test. In the GPU-free speed test, our method is faster than a large network like HRNet. Also, our model has a significant advantage in complexity and computational power compared to other models, which means easier deployment to embedded devices.

5. Ablation Study

We study the effect of each component of our approach on the validation set of MSCOCO.

5.1. Deconvolution Blocks

In this section, we analyzed the impact of reducing the number of upsampling and using different upsampling blocks in terms of accuracy with resolution . From Table 6, it can be seen that the number of parameters and the computation of our model are reduced compared to other models, while the precision has indeed been improved.

Table 6

Ablation experiments on reduced downsampling with the use of lightweight upsampling blocks, on the MSCOCO val dataset. V1 denote the model that uses C5 as the input for upsampling, using the first 16 layers of MobileNetV3 as the downsampling and three layers of deconvolution as the upsampling part. V2 denote the model that uses C4 as the input for upsampling, using the first 13 layers of MobileNetV3 as the downsampling part, then uses three layers of bottleneck with a stride of 1, and finally uses two layers of the same deconvolution as V1 as the upsampling part.

5.2. OKS-Based Nonmaximum Suppression

We compared the proposed OKS-based nonmaximum suppression and other OKS-based nonmaximum suppression methods on the accuracy and speed with the same pose estimator. As shown in Table 7, we can find that our proposed OKS-based nonextreme suppression has significant advantages in terms of accuracy and speed.

6. Conclusion

In this paper, we propose a lightweight pose estimation network, which can achieve an AP score of 69.0 on the MSCOCO val set with only 1.5M parameters and 1.23 GFLOPs. However, we found that our model has some gaps compared to high-performance algorithms, mainly because we are missing the fusion of multiscale information. Designing complex networks and introducing the fusion of multiscale information will increase the inference speed of the model. In future work, we will redesign the backbone network for human pose estimation by introducing multiscale information to balance accuracy and speed.

Data Availability

The datasets used in this paper are the public datasets MSCOCO and MPII.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Nos. 61962010 and 61976107), the Excellent Young Scientific and Technological Talent of Guizhou Province ([2019]-5670), the Natural Science Foundation of Guizhou Province (Grant No. [2017]5726-32), the National Natural Science Foundation (No. 61863006), and the Basic Research Project (Key Project) of Guizhou Province ([2019]-1416).

References

J. Shotton, A. Fitzgibbon, M. Cook et al., “Real-time human pose recognition in parts from single depth images,” in Computer Vision and Pattern Recognition 2011, pp. 1297–1304, Colorado Springs, USA, 2011.
View at: Google Scholar
N.-G. Cho, A. L. Yuille, and S.-W. Lee, “Adaptive occlusion state estimation for human pose tracking under self- occlusions,” Pattern Recognition, vol. 46, no. 3, pp. 649–661, 2013.
View at: Publisher Site | Google Scholar
G. Cheron, I. Laptev, and C. Schmid, “P-CNN: pose-based CNN features for action recognition,” in International Conference on Computer Vision, pp. 3218–3226, Santiago, Chile, 2015.
View at: Google Scholar
G. Papandreou, T. Zhu, L.-C. Chen, S. Gidaris, J. Tompson, and K. Murphy, “PersonLab: person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model,” in European Conference on Computer Vision, pp. 282–299, Munich, Germany, 2018.
View at: Google Scholar
K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5693–5703, Long Beach, USA, 2019.
View at: Google Scholar
A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” in European Conference on Computer Vision, pp. 483–499, Glasgow, United Kingdom, 2016.
View at: Google Scholar
F. Zhang, X. Zhu, and M. Ye, “Fast human pose estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3517–3526, Long Beach, USA, 2019.
View at: Google Scholar
Y. Chen, Y. Tian, and M. He, “Monocular human pose estimation: a survey of deep learning-based methods,” computer vision and image understanding, vol. 192, article 102897, 2020.
View at: Publisher Site | Google Scholar
W. Li, Z. Wang, B. Yin et al., “Rethinking on multi-stage networks for human pose estimation,” 2019, https://arxiv.org/abs/1901.00148.
View at: Google Scholar
A. Bulat and G. Tzimiropoulos, “Binarized convolutional landmark localizers for human pose estimation and face alignment with limited resources,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 3726–3734, Venice, Italy, 2017.
View at: Google Scholar
Z. Zhang, J. Tang, and G. Wu, “Simple and lightweight human pose estimation.,” https://arxiv.org/abs/1911.10346.
View at: Google Scholar
M. Ding, X. Lian, L. Yang et al., “HR-NAS: searching efficient high-resolution neural architectures with lightweight transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2982–2992, 2021.
View at: Google Scholar
A. Howard, R. Pang, H. Adam et al., “Searching for MobileNetV3,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1314–1324, Seoul, Korea (south), 2019.
View at: Google Scholar
X. Wang, R. Zhang, T. Kong, L. Li, and C. Shen, “SOLOv2: dynamic and fast instance segmentation,” pp. 17721–17732, 2020, https://arxiv.org/abs/2003.10152.
View at: Google Scholar
G. Papandreou, T. Zhu, N. Kanazawa et al., “Towards accurate multi-person pose estimation in the wild,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3711–3719, Hawaii, USA, 2017.
View at: Google Scholar
A. Toshev and C. Szegedy, “DeepPose: human pose estimation via deep neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1653–1660, Columbus, USA, 2014.
View at: Google Scholar
C. Yu, B. Xiao, C. Gao et al., “Lite-HRNet: a lightweight high-resolution network,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10440–10450, 2021.
View at: Google Scholar
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),, pp. 770–778, Las Vegas, USA, 2016.
View at: Google Scholar
M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “MobileNetV2: inverted residuals and linear bottlenecks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4510–4520, Salt Lake City, USA, 2018.
View at: Google Scholar
W. Lu, R. Yu, S. Wang, C. Wang, P. Jian, and H. Huang, “Sentence semantic matching based on 3D CNN for human–robot language interaction,” ACM Transactions on Internet Technology (TOIT), vol. 21, no. 4, pp. 1–24, 2021.
View at: Publisher Site | Google Scholar
Q. Zhou, X. Wu, S. Zhang, B. Kang, Z. Ge, and L. Jan Latecki, “Contextual ensemble network for semantic segmentation,” Pattern Recognition, vol. 122, article 108290, 2022.
View at: Publisher Site | Google Scholar
K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask R-CNN,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988, Venice,Italy, 2017.
View at: Google Scholar
Q. Zhou, Y. Wang, Y. Fan et al., “AGLNet: towards real-time semantic segmentation of self-driving images via attention-guided lightweight network,” applied soft computing, vol. 96, p. 106682, 2020.
View at: Publisher Site | Google Scholar
Q. Zhou, Y. Wang, J. Liu, X. Jin, and L. J. Latecki, “An open-source project for real-time image semantic segmentation,” SCIENCE CHINA Information Sciences, vol. 62, no. 12, article 227101, 2019.
View at: Google Scholar
B. Xiao, H. Wu, and Y. Wei, “Simple baselines for human pose estimation and tracking,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 472–487, Munich,Germany, 2018.
View at: Google Scholar
J. Hu, L. Shen, S. Albanie, G. Sun, and E. Wu, “Squeeze-and-excitation networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2011–2023, Salt Lake City, USA, 2018.
View at: Google Scholar
X. Sun, B. Xiao, F. Wei, S. Liang, and Y. Wei, “Integral human pose regression,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 536–553, Munich,Germany, 2018.
View at: Google Scholar
Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun, “Cascaded pyramid network for multi-person pose estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7103–7112, Salt Lake City, USA, 2018.
View at: Google Scholar
T.-Y. Lin, M. Maire, S. J. Belongie et al., “Microsoft COCO: common objects in context,” in European Conference on Computer Vision, pp. 740–755, Zurich, Switzerland, 2014.
View at: Google Scholar
M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2D human pose estimation: new benchmark and state of the art analysis,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3686–3693, Columbus, USA, 2014.
View at: Google Scholar
S. Huang, M. Gong, and D. Tao, “A coarse-fine network for keypoint localization,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 3047–3056, Venice, Italy, 2017.
View at: Google Scholar
F. Zhang, X. Zhu, H. Dai, M. Ye, and C. Zhu, “Distribution-aware coordinate representation for human pose estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7093–7102, Seattle, USA, 2020.
View at: Google Scholar
N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “ShuffleNet V2: practical guidelines for efficient CNN architecture design,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 122–138, Munich,Germany, 2018.
View at: Google Scholar
Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2D pose estimation using part affinity fields,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1302–1310, Hawaii, USA, 2017.
View at: Google Scholar
A. Newell, Z. Huang, and J. Deng, “Associative embedding: end-to-end learning for joint detection and grouping,” Advances in neural information processing systems, vol. 30, pp. 2278–2288, 2017.
View at: Google Scholar
M. Kocabas, S. Karagoz, and E. Akbas, “MultiPoseNet: fast multi-person pose estimation using pose residual network,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 437–453, Munich,Germany, 2018.
View at: Google Scholar
B. Cheng, B. Xiao, J. Wang, H. Shi, T. S. Huang, and L. Zhang, “HigherHRNet: scale-aware representation learning for bottom-up human pose estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5386–5395, Seattle, USA, 2020.
View at: Google Scholar
H.-S. Fang, S. Xie, Y.-W. Tai, and C. Lu, “RMPE: regional multi-person pose estimation,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2353–2362, Venice,Italy, 2017.
View at: Google Scholar
W. Yang, S. Li, W. Ouyang, H. Li, and X. Wang, “Learning feature pyramids for human pose estimation,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1290–1299, Venice,Italy, 2017.
View at: Google Scholar

Copyright

Copyright © 2021 Shili Niu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

494

Downloads

464

Citations

Wireless Communications and Mobile Computing

Semantic Mobile Computing within the Internet of Things and Web of Things 2021

Designing Compact Convolutional Filters for Lightweight Human Pose Estimation

Abstract

1. Introduction

2. Related Works

2.1. Human Posture Estimation

2.2. Efficient Upsampling Module

3. Proposed Method

3.1. Lightweight Upsampling Block

3.2. Lightweight Human Pose Estimation

3.3. Parallel Pose NMS

4. Experiments

4.1. Datasets

4.2. Implement Details

4.3. Experimental Results

4.3.1. Results on MSCOCO Dataset

4.3.2. Results on MPII Human Pose Dataset

4.4. Inference Speed

5. Ablation Study

5.1. Deconvolution Blocks

5.2. OKS-Based Nonmaximum Suppression

6. Conclusion

Data Availability

Conflicts of Interest

Acknowledgments

References

Copyright