Abstract

Existing methods for human pose estimation usually use a large intermediate tensor, leading to a high computational load, which is detrimental to resource-limited devices. To solve this problem, we propose a low computational cost pose estimation network, MobilePoseNet, which includes encoder, decoder, and parallel nonmaximum suppression operation. Specifically, we design a lightweight upsampling block instead of transposing the convolution as the decoder and use the lightweight network as our downsampling part. Then, we choose the high-resolution features as the input for upsampling to reduce the number of model parameters. Finally, we propose a parallel OKS-NMS, which significantly outperforms the conventional NMS in terms of accuracy and speed. Experimental results on the benchmark datasets show that MobilePoseNet obtains almost comparable results to state-of-the-art methods with a low compilation load. Compared to SimpleBaseline, the parameter of MobilePoseNet is only 4%, while the estimation accuracy reaches 98%.

1. Introduction

Human pose estimation is also called human key point detection. Its main task is to detect the key points of human body (eyes, nose, shoulders, elbows, etc.) in a given RGB picture. Human pose estimation is one of the basic tasks of computer vision and has many practical applications, such as human-computer interaction [1], human tracking [2], and motion analysis [3]. In recent years, with the quick development of neural networks, human pose estimation based on deep neural networks [49] has gained a high accuracy. However, these works have focused only on improving the accuracy of pose estimation through the use of complex and computationally expensive models, while largely ignoring the issue of the cost of model inference. Many methods already require computational resources beyond the capabilities of many mobile and embedded devices. At the same time, information security is a growing concern for people, and it is important to deploy applications directly on edge devices for personal information protection, which leads to high requirements for the computational volume and complexity of human pose estimation models.

Many works have been proposed to solve this problem by building human pose estimation networks with small model size and low computing cost [7, 10, 11]. For example, there is a recent attempt [10] to construct pose estimation models with fewer parameters using quantitative methods, but the performance of the obtained model largely degraded. Also, some researchers [7] try to use knowledge distillation to reduce the parameters of the model, but the model training time and deployment time are increased. On the other hand, some works attempt [12] to find a lightweight pose estimation model using neural structure search methods. However, the obtained model has a complex structure and slow inference speed. The key problem is how to balance the model accuracy and the inference efficiency.

To address this problem, in this paper, we propose a lightweight human pose estimation network specifically for mobile and resource-constrained environments by designing compact convolutional filters. As shown in Figure 1, our model contains three main parts: an encoder, a decoder, and a heat map regressor that estimates each key point. To keep the model lightweight, we use the first 13 layers of MobileNetV3 [13] as our encoder. Intuitively, high resolution is beneficial for human pose estimation, so we design the encoder with less downsampling, and the structure of the specific model and the comparison of SimpleBaseline can be seen in Figure 2. In the decoder part, inspired by the bottleneck block, we propose a lightweight upsampling module, whose concrete structure is shown in Figure 1. The detailed structure of the overall model is shown in Table 1. Finally, we also propose a parallel OKS-based NMS to further improve the operation speed of pose estimation. Experimental results show that our method can achieve 69.0 AP with only 1.5M model parameters and 1.23 GFLOP calculation amount under the condition of less cost. The contributions of the proposed method are summarized as follows: (i)We design a lightweight upsampling block that integrates separable transpose convolution and channel-based attention. This is achieved by extensively examining the upsampling modules in existing state-of-the-art deep convolutional networks(ii)We reduce the number of upsampling and use lightweight upsampling blocks to achieve a lightweight pose estimation network. In particular, we balance the accuracy of the model and the inference speed of the model, which is a key issue to be addressed in extending existing depth-pose estimation methods to practical applications(iii)We propose a parallel OKS-NMS by combining Matrix-NMS [14] and OKS-NMS [15] to further improve the efficiency of the human pose estimation system

The rest of this paper is organized as follows. We briefly review the related work in the second section and followed by description of the proposed method. Then, we conduct experiments on the MSCOCO and MPII datasets and conclude this work.

2.1. Human Posture Estimation

In recent years, deep learning-based pose estimation methods [16] have made great progress. Despite significant performance improvements, these prior works focused only on improving the accuracy of pose estimation by using complex networks and large tensors, while largely ignoring the cost issues of model inference. This state of affairs significantly limits their deploy ability in real-world applications, especially when the available computational budget is very limited.

In the literature, there are some recent works aimed at improving model efficiency. Bulat and Tzimiropoulos [10] designed a binary hourglass network using quantitative methods, but the restricted binary network has weak information representation and low accuracy of the model. Zhang et al. [7] proposed a new fast pose distillation (FPD) model learning strategy. A pretrained teacher network can be used to obtain a computationally fast and computationally inexpensive student network. However, it requires too much time to train. Yu et al. [17] proposed conditional channel weighting blocks and constructed the HR-Lite network, which achieves a great advantage in model accuracy and scale. However, the network structure is too complex, resulting in slow model inference. Zhang and Tang [11] proposed lightweight bottleneck block with depthwise convolution and attention mechanism, while the model size is still up to 2.7M parameters.

Compared with previous methods, comprehensively considering the accuracy of the model, the speed of inference, and the complexity of the model, we directly designed a model with simple structure and low complexity, which makes the model more practical and reliable in practical application scenarios.

2.2. Efficient Upsampling Module

Recent work [13, 1820] has shown that deep convolutional neural networks have reached state-of-the-art performance. For advanced vision problems such as semantic segmentation [21], pose estimation [16], and object detection [22], existing approaches pass inputs through a network, usually consisting of high- to low-resolution subnetworks and a main network of raised resolutions. Many approaches have been designed to improve the resolution of the main network in different ways. For example, networks such as hourglass [6] reduce the input high-resolution features to low-resolution features and then use interpolation upsampling to scale the low-resolution features to the original input features, fuse the information with the previously input high-resolution features, and finally expect to generate fused high semantic and high resolution. Although it achieved very good results, the large tensor is used in the process of feature fusion. Zhou et al. [23, 24] constructed an attention-driven feature fusion upsampling network in an attempt to reduce the complexity of the model and reduce the use of large tensor using heterogeneous convolution. However, the network structure is complex and does not solve the problem of slow model inference fundamentally. SimpleBaseline [25] uses several transposed convolutional layers to generate high-resolution representations and achieves very good results. Although the model structure is simple, however, transposed convolution introduces a large number of parameters and computational effort, which is not friendly to small devices.

Therefore, we propose an efficient upsampling module that achieves a significant reduction in the number of parameters and computation during upsampling while ensuring the simplicity of the model structure and inference accuracy.

3. Proposed Method

In this section, we detail a simple and low computational cost human pose estimation network (MobilePoseNet), which designs a lightweight upsampling block (LPB) and directly uses high-resolution features to achieve high-resolution representation while maintaining lightweight features.

3.1. Lightweight Upsampling Block

Transposed convolution was first introduced into pose estimation by SimpleBaseline and achieved excellent performance. However, this operation brings a model with nearly a third of the parameters and calculations. Specifically, given the input of feature maps and the output of feature maps , the amount of computation for conventional transposed convolution is

The number of parameters of traditional transpose convolution is where is the kernel size of traditional transposed convolution.

To reduce the burden of calculation and the number of parameters, while maintaining the effect of transpose convolution, we designed a lightweight upsampling block inspired by the intuition that the bottlenecks actually contain all the necessary information, as shown in Figure 3(b), which composed of three parts: depthwise transposed convolution, point convolution, and attention module. Specifically, we first expand the low-latitude information to high-latitude information by point convolution and use depth transpose convolution on each channel of the feature map for the spatial transformation. Finally, we use 11 point convolution to fuse the information between each channel and compress the high-latitude information to the original input latitude.

As shown in Figure 3(b), the computation of the lightweight upsampling block is the sum of the depth transpose convolution and the two point convolution computations:

The number of parameters of the lightweight upsampling block is where is the number of channels for high-latitude features. Compared to the traditional transposed convolution, our method reduces the calculation amount to 83.2% and the number of parameters reduces to 74%.

Since LPB separates space operation and channel operation into two independent steps, the decoding effect of transpose convolution will be weakened. To solve this problem, we enhance the feature responses through channel attention mechanism. Here, we directly use SENet [26] as our channel attention mechanism to dynamically adjust the weight of each channel, as shown in Figure 3(b). To sum up, we assumed the input feature map , the feature output through the LPB is as input to the channel attention mechanism.

The feature output through the channel attention mechanism is . Then, the feature output of LPB and the feature output of the channel attention mechanism are multiplied and summed to obtain the final fusion information , i.e.,

3.2. Lightweight Human Pose Estimation

Usually, the pipeline [5, 6, 27, 28] for poses estimation consists of three parts: the upsampling, the downsampling, and estimation of the heat map. In this work, we focus on the design of a lightweight upsampling and downsampling.

Different from SimpleBaseline which uses a ResNet backbone as the downsampling and three traditional deconvolutional layers as the upsampling, we use MobileNetV3 as our downsampling, which reduces the size of the parameters up to 96% and reduces the computation load up to 79%. For the upsampling, we replace each traditional deconvolution layer with a lightweight upsampling block. The details of the model are shown in Table 1.

As shown in Figure 2, different with SimpleBaseline, we use a higher resolution feature map as the input for upsampling. The rationale behind this that it is beneficial to maintain high-resolution representations before upsampling.

3.3. Parallel Pose NMS

In pose estimation, human body detectors inevitably generate redundant detection, and pose estimation also generates redundant poses. Therefore, nonmaximum suppression (NMS) is required to eliminate redundant postures.

Given pose with joints where and are the location and confidence score of the joint, respectively. Corresponding detection boxes with confidence score. The general pose NMS is as follows: firstly, the pose with the highest confidences was chosen as the reference, and the poses similar to it were suppressed or discarded. This process is repeated for the rest of the pose set until only one pose is left.

However, the main problems for this process are sequential and cannot be implemented in parallel, resulting in slower speeds. Inspired by Matrix-NMS, we proposed parallel nonmaximum suppression considering following two key factors: (1)The confidence of pose: the higher the confidence of pose, the lower the probability of joints being suppressed, i.e., if the pose and with confidence (), will have a high probability of being suppressed(2)Similarity between the pose and other poses: the lower the similarity between one pose and other poses, the lower the suppression ratio of the poses

For the pose confidence, we set the product of the average of the confidence of the key points and the confidence of the human detector as the final pose confidence below where is the confidence of the detection box and is defined as follows:

We consider the key point prediction to be true if is bigger than and otherwise to be false.

For the similarity between two poses, we use the object key point similarity (OKS) [29] to measure the pose distance function as follows: where is given by where is the area of the detection box.

We define a new decay factor for pose NMS. For , we can get a new , where where .

Finally, we get a new pose confidence . For usage, we just need threshold and selecting top- scoring masks as the final predictions.

Like Matrix-NMS, all the operations in pose NMS could be implemented in one shot without recurrence. We first get a pose confidence and then compute a pairwise OKS matrix for the pose sorted descending by pose confidence score. The decay factors of each pose can be obtained by looking up the table of the OKS matrix. Finally, the pose scores are updated by the decay factors. For usage, we just need threshold and select top- pose scoring as the final predictions. The whole procedure is summarized in Algorithm 1.

Input: the area of the detection boxer , the confidence of the detection boxer , the location of the key point , the confidence of the key point , and parameter . Here, is the -th person, , is the -th key point, and .
Output: the confidence of the key point .
1: Initialize
2: Calculate by equation (6) and parameter , ,
3: Sort , , and in descending order by
4: Calculate using
5: Calculate using
6: Calculate OKS matrix by equation (9) and parameter , ,
7: Update
8: Set
9: Set by repeating times
10: Calculate decay matrix using
11: Set decay
12: Update the confidence of the key point by

4. Experiments

We conduct experiments on the MSCOCO and MPII datasets to evaluate the performance of our method in multiperson pose estimation.

4.1. Datasets

(i)The MSCOCO dataset contains over 200K images, 250K human body instances, and 17 key points. We trained our model on the MSCOCO train2017 dataset, including 57K images and 150K person instances and evaluated our approach on val2017 and test-dev2017, which contained 5000 images and 20K images, respectively(ii)The MPII Human Pose dataset contains about 25K images of more than 40,000 people with annotated human joints, which are taken from a wide range of real-world activities with full-body pose annotations

We selected the object key point similarity (OKS) as an evaluation metric for the MSCOCO dataset. The standard metric [30], the PCK (probability of correct key point normalized by head) score, was used to evaluate the MPII dataset.

4.2. Implement Details

In MSCOCO, we extend the human detection box into a fixed aspect ratio with 4 : 3, and crop the box from the image with fixed size, or . In MPII, the input size is cropped to for fair comparison with other methods. In addition, the same data augmentation and the training strategy are utilized for both datasets. The data augmentation includes random rotation ([-45, 45]), random scale ([0.65, 1.35]), and flipping. In MSCOCO, half body data augmentation is also involved.

We all use the Adam optimizer with initial learning rate . The model was trained on a single Nvidia TITAN RTX GPU with a minibatch size 32 and stop at 210 epochs.

4.3. Experimental Results
4.3.1. Results on MSCOCO Dataset

From the results, as shown in Table 2, we can see that our method has a significant advantage in terms of model size and complexity with comparable accuracy. For input size , our method achieved comparable accuracy with less than 6% the parameters with respect to hourglass network. Compared with MobileNetV2 and ShuffleNetV2, our method obtained better accuracy with low complexity. For the small network HRNet-W16 and Lite-HRNet-18, our model is also better in terms of accuracy although the model size is slightly large. For the input , we can also derive the same conclusion.

Figure 4 illustrates the comparison of accuracy and complexity of small networks. Figure 5 shows the visualization results of our method in MSCOCO. It can be seen that our model achieved better balance between complexity and accuracy and can estimate the accurate joints under different complex scenes.

Table 3 lists the mAP, input size, Params, and GFLOP values of compared methods and our method on the MSCOCO dataset.

4.3.2. Results on MPII Human Pose Dataset

Table 4 reports the results of our network and other lightweight networks on MPII val data. Compared with MobileNetV2, MobileNetV3, ShuffleNetV2, and SmallHRNet-W16, our model achieves better accuracy with lower number of parameters and calculation weights. Compared to Lite-HRNet-30, our model achieves 87.3 [email protected] in terms of the number of parameters with 0.3M less than Lite-HRNet-30. Compared to MobileNetV2, MobileNetV3, ShuffleNetV2, and Small HRNet-W16, our model improved by 1.9%, 3.0%, 4.5%, and 7.1%, respectively. Figure 6 illustrates the comparison of accuracy and complexity.

4.4. Inference Speed

FLOPs and Param are only the properties that measure the size and complexity of the model. In this section, we study the actual inference speed of the human pose estimation network by inference items per second (Inference Items Per Second). The speed is tested on devices with GPU and without GPU, respectively, with a batch size of 32 and full precision (fp32). We use the Nvidia TITAN TRX as the GPU device and the Intel Core I9-10900k device without GPU as the non-GPU device. To better reflect the running speed of the model, all methods are tested on the MSCOCO validation set. We use the same person detector provided by the SimpleBaseline validation set. In the tests without GPU, a thread was used for evaluation. As can be seen in Table 5, thanks to the simple structure of our model, our actual inference is 3 times faster than the less computationally intensive Lite-HRNet on the GPU speed test. In the GPU-free speed test, our method is faster than a large network like HRNet. Also, our model has a significant advantage in complexity and computational power compared to other models, which means easier deployment to embedded devices.

5. Ablation Study

We study the effect of each component of our approach on the validation set of MSCOCO.

5.1. Deconvolution Blocks

In this section, we analyzed the impact of reducing the number of upsampling and using different upsampling blocks in terms of accuracy with resolution . From Table 6, it can be seen that the number of parameters and the computation of our model are reduced compared to other models, while the precision has indeed been improved.

5.2. OKS-Based Nonmaximum Suppression

We compared the proposed OKS-based nonmaximum suppression and other OKS-based nonmaximum suppression methods on the accuracy and speed with the same pose estimator. As shown in Table 7, we can find that our proposed OKS-based nonextreme suppression has significant advantages in terms of accuracy and speed.

6. Conclusion

In this paper, we propose a lightweight pose estimation network, which can achieve an AP score of 69.0 on the MSCOCO val set with only 1.5M parameters and 1.23 GFLOPs. However, we found that our model has some gaps compared to high-performance algorithms, mainly because we are missing the fusion of multiscale information. Designing complex networks and introducing the fusion of multiscale information will increase the inference speed of the model. In future work, we will redesign the backbone network for human pose estimation by introducing multiscale information to balance accuracy and speed.

Data Availability

The datasets used in this paper are the public datasets MSCOCO and MPII.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Nos. 61962010 and 61976107), the Excellent Young Scientific and Technological Talent of Guizhou Province ([2019]-5670), the Natural Science Foundation of Guizhou Province (Grant No. [2017]5726-32), the National Natural Science Foundation (No. 61863006), and the Basic Research Project (Key Project) of Guizhou Province ([2019]-1416).