Abstract

Human parsing, which aims at resolving human body and clothes into semantic part regions from an human image, is a fundamental task in human-centric analysis. Recently, the approaches for human parsing based on deep convolutional neural networks (DCNNs) have made significant progress. However, hierarchically exploiting multiscale and spatial contexts as convolutional features is still a hurdle to overcome. In order to boost the scale and spatial awareness of a DCNN, we propose two effective structures, named “Attention SPP and Attention RefineNet,” to form a Mutual Attention operation, to exploit multiscale and spatial semantics different from the existing approaches. Moreover, we propose a novel Attention Guidance Network (AG-Net), a simple yet effective architecture without using bells and whistles (such as human pose and edge information), to address human parsing tasks. Comprehensive evaluations on two public datasets well demonstrate that the AG-Net outperforms the state-of-the-art networks.

1. Introduction

Human parsing, which segments a human image into the regions of semantic parts, has recently received considerable interest in computer vision areas. Due to its comprehensive and elaborate analysis of human information, human parsing has served as an indispensable basis for many high-level computer vision applications, such as objection detection [1], clothing parsing [2], human pose estimation [36], video surveillance [7, 8], and person reidentification [9].

A significant progress on human parsing has been achieved using a deep convolutional neural network (DCNN). However, the diversity of human poses, the foreshortening caused by viewpoint change, and the variance distribution of human bodies affect the accuracy of human parsing. For example, due to severe foreshortening and unusual pose, the upper-body of a human has a larger scale than the lower-body in Figure 1(a), and the shoes appear at the right side in Figure 1(b). The body part scales and body distributions of humans in Figure 1 are different from those in the majority of scenarios. Therefore, how to design a powerful and robust model to capture multiscale and spatial contextual information is crucial to address the human parsing task.

Confronted by the hurdle of exploiting multiscale features, some multibranch tactics have been proposed. The Spatial Pyramid Pooling (SPP) [1012] and the RefineNet [13] approaches, where parallel convolution layers with different receptive fields are used to capture multiscale information, are two prevalent strategies to get over this hurdle. However, these multibranch methods simply employ only a concatenation or an additional operation to achieve a feature fusion, hence producing feature redundancies and suppressing the representation capacity of the whole network. Moreover, human parsing has an important characteristic different from other segmentation tasks. It greatly requires spatial awareness to parse spatial-oriented labels, such as right-arm, left-arm, right-shoe, and left-shoe. However, SPP and RefineNet are limited to capturing spatial semantics, because the multibranch methods have no special design to distinguish the upper and lower parts, and right and left parts of a human body, particularly in the challenging human images as shown in Figure 1.

In order to address the challenges of multiscale and spatial feature extraction in human parsing, we impose a soft-attention mechanism into SPP and RefineNet methods to recalibrate high-level features in the model, hence producing the Attention SPP and Attention RefineNet, correspondingly. Based on the characteristic of multibranch model, we develop a light-weight trainable mechanism, named Mutual Attention (MA).

To exploit multiscale and spatial awareness with the attention-oriented philosophy, we propose an efficient Attention Guidance Network (AG-Net) for human parsing, which is shown as Figure 2. Specifically, the AG-Net can be divided into four steps. Firstly, we leverage fully convolutional network to encode a high-level feature map of the input image. Based on the high-level feature, secondly, we use the Attention SPP module capturing multiscale and spatial information. Thirdly, we decode the output feature from Attention SPP and each scale of decoding stage has a supervision strategy to supervise our model. Finally, each stage of decoding feature map is guided by Attention RefineNet to further fuse multiscale and spatial features.

Comprehensive experiments are conducted on two human parsing benchmark datasets, the ATR [14] and LIP [3] datasets, to evaluate our model. We demonstrate the feasibility and superiority of our methods on the ATR dataset.

Besides, we also use the ATR and LIP datasets to synthetically evaluate our AG-Net and obtain a state-of-the-art performance. In particular, in the evaluation of fine-grained and spatial-oriented labels, our approach obtains substantial improvement, which illustrates the remarkable ability of our AG-Net for human parsing.

There are two main contributions in our paper:(i)To hurdle the issues of feature redundancies and spatial semantic limitations in SPP and RefineNet, we propose Attention SPP and Attention RefineNet and form Mutual Attention to recalibrate models(ii)A portable and powerful architecture, named Attention Guidance Network (AG-Net), is designed to boost the multiscale and spatial semantic presentation ability in a deep learning model and obtain a brilliant human parsing performance

The remainder of this paper is organized as follows. In Section 2, we review related works. Subsequently, we describe each part of the proposed network in detail in Section 3. The experiments and conclusions are provided in Sections 4 and 5, respectively.

Due to the great scientific value and commercial potential, human parsing has attracted increasing research interest [1518] in recent years. In particular, significant progress on human parsing has been made using a fully convolutional network (FCN) [19]. However, the diversity and complication of real-world scenes make it hard to improve the accuracies of parsing results. Therefore, how to exploit the multiscale and spatial features is a key point to boost the parsing performance.

Recently, based on deep learning frameworks, there are two types of mainstream methods to improve the human parsing performance. The first type adopts some extra human body information to construct a model. The second type aims at exploiting the multiscale features of a human.

2.1. Introducing Extra Human Body Information

Introducing extra human body information (e.g., pose information or a structural relationship of different human body parts) aims at exploiting the spatial features of humans and improving the parsing results toward spatial-oriented labels. The methods of MuLA [6], LIPNet [3], and JMPP [4] were established via combining human parsing and pose estimation into a network. With the pose information, the joint network could generate refined parsing results. PCNet [20] manually divided human body parts into different levels and established an end-to-end network to parse the human body parts from coarse to fine. Based on the ATR datasets and the CPC datasets, Guo et al. [21] applied prior pose information to increase the parsing accuracy. Different from the methods above, Luo et al. [15] proposed a trusted guidance learning framework to address the label parsing fragmentation issue. Su et al. [22] leveraged a label trusted network to solve the label confusion problem with the prior statistics of labels. Liu et al. [23] proposed a braiding network for fine-grained human parsing. At the end of this model, the semantic ambiguity of different body parts is eliminated with the help of pairwise hard regionpriors. Wang et al. [24] treated human parsing as a multisource information fusion process by combining convolutional neural network (CNN) with the compositional hierarchy of human bodies. Liu et al. [25] leveraged various priors such as feature resolution, global context information, and edge details to improve the human parsing performance. Huang et al. [26] rebalanced the imbalanced dataset from the perspective of geometry.

Although those methods could promise parsing performance, they required extra human body ground-truth information and hence greatly increased the workload of tagging datasets.

2.2. Exploiting the Multiscale Features

Exploiting the multiscale features is to obtain high-level abstract semantics without losing the semantic information of detailed texture synchronously. For example, DeepLab-v2 [10] created an ASPP model by employing the atrous convolution layer in a Spatial Pyramid Pooling structure. Inspired by the image pyramid, Chen et al. [17] trained several weight-sharing networks on different scales and merged the multiscale outputs into an attention network. In this way, high-level features with rich class information can be used to weight the underlying information to select details with precise resolutions. Co-CNN [18] imposed global image-level context and local super-pixel context into a unified model. To maintain pixel-level location information, Li et al. [27] and Chen et al. [28] used a pyramid structure to learn the attention mask instead of directly learning the feature map. During the decoding phase, this model introduced an attention mechanism that uses a high-resolution feature map to predict a channel mask. After that, this model multiplies the predicted channel mask with a low-resolution feature map shortcut. Huang et al. [29] proposed a novel trilateral awareness mechanism to sense the feature maps in trilateral levels to obtain comprehensive multiscale, spatial, and feature distribution information to exploit the semantic information precisely.

This type of methods greatly improves the accuracy of a deep model for human segmentation results, particularly with scale-orient labels. However, this method also greatly increases the number of parameters by dilating the structure of network. Moreover, it is extremely limited to extraction of position features to distinguish position-orient labels in human parsing tasks.

Different from the methods mentioned above, we propose Attention SPP and Attention RefineNet, which learn multiscale and spatial features simultaneously through Mutual Attention guidance. Moreover, an efficient AG-Net is proposed to address the challenges of human parsing. Compared with other methods, our model is simple yet effective in exploiting multiscale and spatial features without any bells and whistles (such as human pose or edge information).

3. Attention Guidance Network (AG-Net)

In this paper, based on the Encoder-Decoder architecture, we propose a novel Attention Guidance Network (AG-Net) for human parsing task, as shown in Figure 2. In the end of the encoder part and each scale of decoder part, we impose the Attention SPP and Attention RefineNet correspondingly. Guided by Mutual Attention, the SPP and RefineNet have further powerful capacity to exploit multiscale and spatial semantics. Therefore, the whole model is designed with the attention-guided philosophy, which aims at selectively emphasising informative features and restraining less useful ones, and then the network has much powerful awareness to handle the complicated multiscale- and spatial-oriented features in human parsing task.

3.1. Mutual Attention (MA)

The structure of Mutual Attention (MA), as shown in Figure 3, is composed of a Spatial Attention and a Channel Attention part. The Spatial Attention part concentrates on optimizing position sensitive features such as the location distribution of a human pose and organ, and it enhances the spatial perception and generalization ability of the model. Using the Channel Attention optimizes the cross-channel contexts due to emboldening informative semantics and dampening valueless one. Therefore, MA achieves the goal of recalibration of the position and channel contexts in a feature matrix.

Let and be the input and output feature matrices of dimensions, where and denote the spatial dimensions and denotes the channel. For feature extraction in the Conv block , the thickness of output feature matrix is , where represents the feature points in spatial dimensions. Through max pooling of stride = 1 by a filter, two convolutions, and a softmax operation, a Spatial Attention map is produced. A ReLU [31] is embedded in the two convolutions. The matrix generated by the Spatial Attention operation is defined as

Using the average global pooling approach, the is transferred to a feature vector , whose element is calculated by

In order to reduce the feature parameters, referred to as the SENet [32], we employ a squeeze and excitation structure to extract channel level features. Due to the sharp compression of channel features in hourglass structure, if a nonlinear activation function, such as the ReLU function, is introduced, some useful informative features will be inevitably lost [33]. Therefore, distinguished from the SENet, we do not employ the ReLU operation after its squeezing FC layer so as to preserve the integrity of useful contexts. The output of Channel Attention vector is as follows:where denotes the sigmoid function. and are the weights of two FC layers. By the Channel Attention part, the final output of the features is defined as

Mutual Attention can rebalance the features in the spatial level and channel level with negligible parameters increasing. From the perspective of attention, this mechanism achieves the goal of feature decoupling and boosts the capacity of feature expression in the network.

3.2. Attention SPP

The SPP uses the multiscale convolution filter and pyramid pooling structure to extract a feature pyramid from high-level semantics. Due to its powerful performance, SPP has been imposed into many semantic segmentation tasks. However, because of the massive size of its multibranch fusion model, SPP unavoidably has redundant and overcoupling feature information in its output, so that the features cannot be fully utilized and even the network has to be retrained or relearned. Moreover, SPP is lacking in spatial semantics capturing, which cannot satisfy the need of parsing the spatial-oriented labels.

In order to improve feature utilization for multiscale feature mining and impose the spatial feature extraction ability in SPP, we propose the Attention SPP model. We illustrate the specific design of Attention SPP shown in Figure 4, which uses the Atrous Spatial Pyramid Pooling (ASPP) [28] as an example. Confronted by the different branches in SPP, at the spatial level, Spatial Attention is employed to rebalance different-scale semantics correspondingly to enable the branches to concentrate on their own jobs. Besides, using the Spatial Attention can guide the convolution layers to focus on reweighting the position and angular distributions based on multiscale receptive fields; it enables the SPP to parse the spatial semantics. In the end of model, we use the concatenation operation to aggregate all paths and generate a feature matrix with affluent contextual information. Nevertheless, with the sparse and redundant features, the output matrix will hinder the feature extraction of latter layers, thus dampening the presentational capability of the network. Consequently, the Channel Attention is employed to recalibrate the feature-rich but redundant matrix in the channel level.

Finally, through using the self-learning weight coefficients to rebalance the matrix, the output features contain much more abundant multiscale contextual information. Additionally, the SPP model is efficient in capturing spatial features. When the multiscale and spatial feature extraction intertwine in one model, the Attention SPP has much more representational power to confront human parsing tasks.

3.3. Attention RefineNet

To absorb the multiscale and exploit spatial contextual information generated by Encoder-Decoder architecture with spatial information, we impose the Attention RefineNet into the AG-Net, shown in Figure 2, which can then further optimize the predicted score maps.

Inspired by CPNet [34], confronted by downsampled (2, 4, 8) feature layers, we perform a series of operations to cascade 1, 2, and 3 Residual blocks, respectively. In the end of the Refine block, we integrate the information of different levels via a concatenating operation. In order to reduce the complexity of the model, a bilinear interpolation is used to unify the multiscale features to of resolution instead of the original input scale.

Like SPP, the hurdle caused by the issues of feature redundancies and spatial context limitations also adheres to the feature matrix generated by multiscale feature fusions. Accordingly, MA is injected into RefineNet, which rebalances the feature matrix at the spatial level and channel level. MA makes different paths of RefineNet focus on their corresponding-scale features and explore the potential of spatial feature extraction. In addition, it also redistributes the learning direction of the model, reduces the learning difficulty of the original task, and makes the network easier to train.

3.4. Supervision Strategy

There have been existing works [34, 35] using a multiscale supervision strategy to refine their models. The proposed AG-Net also adopts this strategy as shown in Figure 2.

Based on the traditional softmax loss function in pixel-wise ground-truth masks, we inject a global loss function as follows to optimize AG-Net:where is a predicted label map and is the GT mask. denote the global predict vector extracted from the predicted label map using global pooling operation, and denotes the global GT vector extracted from GT mask. represents the common loss function of semantic parsing, and is the weight of global loss , which is obtained bywhere denotes the count of ground-truth labels. In the experiments of our model, the hyperparameter is set to 0.1.

4. Experiment Analysis

In the training process, we remove the softmax and full connected layers of the VGG-16 [36] and DenseNet-121 [37] and replace them with a fully convolutional network to extract features. Besides, we utilize the hybrid dilated convolution in the conv5 layers. The input image is of cropped from an original image with about 10 FPS in the VGG-16 based model and 15 FPS in the DenseNet-121 based model. We adopt the initialization with a pretrained model and leverage the Gaussian distribution with standard deviation of 0.01 to initialize each without pretrained layers. We utilize the Adam [38] solver with batch size of 6, momentum of 0.9, weight decay of 0.0005, and initial learning rate of 0.0001. Inspired by the DeepLab method [10], we use the poly strategy to dynamically adjust the learning rate. The training data are augmented by a left-right flipping. All models are experienced using the PyTorch platform [39]. Our experiments are implemented on a system of Core Intel i7-5930K CPU and a single NVIDIA GTX 1080 Ti GPU. We estimate experimental results on two human parsing benchmark datasets, the LIP [3] and ATR [14] datasets. We train our model with 30 epochs and 60 epochs in ATR and LIP, respectively.

4.1. Datasets
4.1.1. LIP Dataset

The Look into Person (LIP) dataset contains 50,462 images with careful pixel-wise annotations of 19 semantic human parts from the MS COCO dataset. These images are collected from real-world scenes and present various and complicated views of human appearances, poses, sizes, clothes, occlusion, illumination, and feature confusion. Besides, it not only categorizes the traditional human parts, but also annotates some tiny labels (e.g., sunglasses, socks). Some annotations contain spatial-oriented information (e.g., left-arm, right-arm, left-shoe, right-shoe). Therefore, it is a challenging human parsing dataset.

4.1.2. ATR Dataset

ATR dataset contains a total of 17,700 images, which consist of 7,700 images in the original ATR [14] and 10,000 additional images in the Chictopia10K [18]. All images are annotated pixel-wise with 18 categories. For a convenient comparison, we follow the setting of Co-CNN [18] and split the original ATR dataset into 700 images for validation, 1,000 images for testing, and the rest for training.

We only experiment on these two datasets because other datasets do not have fine-grained and spatial-oriented segment annotations to satisfy the needs of our model.

4.2. Attention SPP Evaluation and Discussion

To verify the advantages of our Attention SPP, we embed our MA into four classic SPP methods [1012]: DeepLab v2, DeepLab v3, Vortex, and PSPNet. We evaluate these four methods on the ATR benchmark with the same experiment settings. All models are trained with the VGG-16 bottleneck for 30 epochs. We deploy two effective measurements, the average F1 score and the mean intersection-over-union (mIoU), to compare the performance of the models. We can see from Table 1 that each model can be increased by 13% on both F1 score and mIoU with MA. The Vortex method with the sequential type of MA, especially, is increased by 2.94% and 4.42% in terms of F1 score and mIoU, respectively.

We depict testing curves of four models with/without Mutual Attention (MA) on the ATR dataset in Figure 5. For each model, both types of MA can improve model performance and accelerate the convergence. It is demonstrated that the attention structure can enhance the representative and comprehensive ability of the models by guiding the networks to capture more useful information.

Furthermore, the MA employing the sequential operation outperforms the parallel operation on aggregating the spatial and Channel Attention parts, as shown in Table 1. The sequential operation achieves the feature decoupling purpose. The Spatial Attention focuses on learning spatial contexts. Then, based on the optimized spatial contexts, the Channel Attention attentively recalibrates cross-channel contexts. This sequential structure can dynamically guide the model to use features of different levels in different stages, so as to alleviate the difficulty of a deep learning.

Due to the best performance of Vortex method [12] with the sequential type of MA in Table 1, we deploy the Vortex method, which leverages the Vortex Pooling strategy to improve the feature utilization ratio in the original SPP, for our Attention SPP backbone into AG-Net.

4.3. Human Parsing Performances

We evaluate the comprehensive performance of the proposed AG-Net for human parsing in ATR and LIP datasets.

4.3.1. ATR Dataset

Table 2 shows the result of the proposed method with ten state-of the-art methods on the ATR dataset. Due the evaluation settings in Co-CNN [18] and TGPNet [15], we employ the average F1 score as the evaluation criteria. From Table 2, we can see that both types of our AG-Net have achieved remarkable results. Compared with the state-of-the-art approaches, our method with the backbone of DenseNet-121 has seen an improvement of 0.5%.

4.3.2. LIP Dataset

According to the evaluation method introduced in LIP [3], Table 3 shows comparison results with fifteen state-of-the-art approaches with mIoU measurements and Table 4 shows eight state-of-the-art approaches with IoU measurements on 20 class labels.

In Table 3, the Baseline network means the combination of the Encoder-Decoder architecture, Vortex bottleneck, and multiscale supervision. On the strength of the Baseline, we, respectively, inject the Attention SPP or Attention RefineNet into the model, making 1.90% and 3.37% improvements compared with the Baseline approach. Additionally, AG-Net with VGG-16 backbone enormously improves by 5.92% compared with the Baseline. However, the network scale only expands a little (Baseline 156M vs. AG-Net 161M) and the training speed remains almost identical. Toward the backbone of DenseNet-121, our AG-Net outperforms all state-of-the-art methods with relatively small network size (106M). We use only the traditional pixel-wise mask supervision instead of adding auxiliary pose labels to supervise our model. Therefore, our model is a simple and efficient model to conduct human parsing tasks.

Five methods (i.e., CE2P [25], BraidNet [23], HRNetV2 [46], CNIF [24], A-CE2P [47]) obtain very high scores by introducing various priors such as human body can be represented as a hierarchy of multilevel parts or others. Nevertheless, these priors require introducing additional datasets [46] and additional networks [24, 25, 46], which lead to complex and inefficient network structures, as well as extra costs of tagging and training. In addition, some of these models have evolved into multistage inference approaches [23, 24, 46], which are not flexible enough to be embedded in other tasks.

Our DenseNet-121 based model achieves the best performance on 13 mIoU results with detailed labels as shown in Table 4. Note that for the labels that require high-level spatial-oriented features to distinguish the global direction and spatial position, such as left-arm, right-arm, left-leg, right-leg, left-shoe, and right-shoe, our model can surpass the existing state-of-the-art approaches by 6.83%, 5.05%, 4.74%, 6.43%, 9.49%, and 6.56%, respectively. Besides, our proposed AG-Net also gets the best results on scale-oriented labels, such as gloves, socks, and hat, which all need fine-grained features to guide. Moreover, based on the same network settings with VGG-16 backbone, the AG-Net can significantly improve the Baseline on gloves, sunglasses, socks, skirt, right- and left-leg, and right- and left-shoe by about 10%. The great improvement demonstrates the strong generalization ability of our network in extracting multiscale and spatial features for human parsing.

Figure 6 shows the Spatial Attention heatmaps toward different scales of receptive fields in the Attention SPP and Attention RefineNet. For the layers with small receptive fields, the Spatial Attention heatmaps tend to focus on the edges and small parts while large receptive fields cause the heatmaps to comprehend global information. For example, as shown in the first row of images in Figure 6(b), the heatmap with atrous rate has higher confidence on human edges while the heatmap with atrous rate tends to represent midlevel objects, such as hands, which need to distinguish the orientations. Global pooling, additionally, can better be aware of large scale regions, such as upper-clothes and foreground-background. To sum up, the Attention SPP and Attention RefineNet can achieve effective decoupling and can guide networks to represent multiscale features by reducing redundancies in different features. Besides, it can endow the model sensitivity toward the human parts and positions.

Visualized comparisons on the LIP dataset are shown in Figure 7. We compare the visual results of our AG-Net with three state-of-the-art methods, which are DeepLab (VGG16) [10], DRN−50 + Vortex [12, 45], and SS-JPPNet [3], and a Baseline network, on ten representative and challenging images. From columns 1 and 3, we can draw a conclusion that our AG-Net has a strong ability to segment small and confusing parts. The proposed AG-Net is the only method that successfully parses the sunglasses and head labels, respectively, in columns 1 and 3. For spatial-oriented labels, our AG-Net can accurately classify them without being influenced by the shooting angles and pose variations. For example, in column 6, the child has larger scale of upper-body than that of lower-body due to the high angles and side shots of images. Besides, in column 9, the human pose is different from those in commonly seen images and has a transverse distribution, while our AG-Net can still accurately get the paring results. Therefore, our method shows a robust performance in segmenting multiscale- and spatial-oriented human parts.

5. Conclusion

In this paper, we have proposed novel Attention SPP and Attention RefineNet and used a Mutual Attention mechanism to recalibrate feature maps in bilateral levels (the spatial dimension and channel dimension) to get comprehensive multiscale and spatial features different from the existing approaches. Moreover, the Attention Guidance Network (AG-Net), a simple and efficient model designed with the attention-centric theory, has been proposed to boost human parsing performance in scale- and spatial-oriented labels. Extensive experiments on two benchmarks for human parsing have demonstrated the representational power of AG-Net and shown that our method outperforms the state-of-the-art methods.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was supported by the National Natural Science Foundation of China (61872394, 61672547, and 61772140), Guangzhou Science and Technology Plan Project (201902010056), and Guangxi Innovation Driven Development Special Fund Project (AA18118039).