Abstract

Existing attribute learning methods rely on predefined attributes, which require manual annotations. Due to the limitation of human experience, the predefined attributes are not capable enough of providing enough description. This paper proposes a self-supervised attribute learning (SAL) method, which automatically generates attribute descriptions by differentially occluding the object region to deal with the above problems. The relationship between attributes is formulated with triplet loss functions and is utilized to supervise the CNN. Attribute learning is used as an auxiliary task of a multitask image classification and segmentation network, in which self-supervision of attributes motivates the CNN to learn more discriminative features for the main semantic tasks. Experimental results on public benchmarks CUB-2011 and Pascal VOC show that the proposed SAL-Net can obtain more accurate classification and segmentation results without additional annotations. Moreover, the SAL-Net is embedded into a multiobject recognition and segmentation system, which realizes instance-aware semantic segmentation with the help of a region proposal algorithm and a fusion nonmaximum suppression algorithm.

1. Introduction

Visual attributes are designed as midlevel semantic features to describe objects. However, most of the existing attribute learning methods [17] rely on manually annotated datasets [811]. In these datasets, domain experts predefined semantic attributes, and then, annotators give their judgments on whether a certain attribute appears in each object. On the one hand, the higher the dimensionality of the predefined attributes, the longer it takes to label each object, making attributes with limited dimensions more effective in obtaining large-scale datasets with limited resources. On the other hand, the predefined attributes with limited dimensions cannot fully describe the object, reducing the attribute dataset’s usability. Since there are always some potential attributes essential for object representation, the predefined attribute is not capable enough to describe the real world, even with high dimensionality.

With the increasing demand for large-scale training sets for deep learning models [1222], some researchers propose self-supervised learning methods [2326], which train CNNs by automatic generation of labels based on the structure or characteristics of the image itself. We observe that each attribute of the object corresponds to a particular part of the object region. For example, the attribute “blue wings” depends on the visual features inside the wings. Certain attributes are hidden by occluding a part of the object region, making the occluded object have a slightly different attribute distribution than the original object. Inspired by the aforementioned fact, this research article proposes an automatic attribute generation method by occluding the object twice to obtain augmented objects with different attributes from the original object. The relationship between attributes can be formulated and utilized as the self-supervision to learn an attribute CNN. Based on the proposed method, this paper extends our previous work [27] by replacing the manually annotated supervision with the formulated attribute relationship to improve object recognition and segmentation.

The contributions of this paper can be summarized as follows: (1)A self-supervised attribute learning (SAL) method is proposed by automatically generating attribute relationships to learn the CNN(2)We design a multitask deep neural network SAL-Net, in which attribute learning is an auxiliary self-supervised task to alleviate intraclass variations in recognition tasks and refine the segmentation results(3)We embed the SAL-Net into an instance-aware semantic segmentation system, in which a Fusion NMS algorithm is proposed to deal with repeated extraction of the same object

The remainder of this paper is organized as follows. The following section presents some relevant studies, including attribute learning methods and self-supervised learning methods. Then, the methodology of the proposed self-supervised attribute learning approach is described, and extensive experiments have been carried out to compare with baseline and state-of-the-art methods on two public datasets. Finally, the article is concluded.

2.1. Attribute Learning

Visual attributes denote some semantic characteristics, which can be predefined to describe objects, including color, shape, and material. Lampert et al. [5] developed an AWA dataset consisting of 37,322 images of 50 animal classes and provided an 85-dimensional attribute distribution for each class. Deng et al. [9] composed a PETA dataset containing 19,000 images of 8705 pedestrians and provided 61 binary and four multiclass attributes for each pedestrian. Farhadi et al. [8] annotated 12,679 samples of Pascal VOC [28] to develop an attribute dataset aPY, where each sample is labeled with a 64-dimensional attribute description. Patterson and Hays [11] provided 204-dimensional attribute vectors on 29 categories of samples in COCO [29], while only 18,073 samples are labeled with over ten positive attributes.

Attribute learning has been widely used in computer vision fields, such as fine-grained recognition [24], zero-shot learning [57], face analysis [3033], person reidentification [9, 3438], and semantic segmentation [3941]. Akata et al. [6] presented a model that embedded each category label in the space of attribute vectors for unseen object recognition. Han et al. [4] proposed an attribute-aware attention model to learn local attribute representation and global category representation for fine-grained image classification. In [37], pedestrian attributes and ID labels have been jointly learned with a shared backbone CNN, where these attributes are combined with low-level visual features to improve reidentification accuracy. Zhang et al. [38] jointly learned the center position, scale, offset, and semantic attributes in a multitask CNN for pedestrian detection tasks. In [41], an end-to-end attribute-aware semantic segmentation method was proposed by replacing the single category class person with nine orientation attribute classes.

However, most of the existing attribute learning methods are limited by the representative ability of human-defined attributes. Therefore, some researchers have proposed to learn latent attributes to obtain more discriminative features. Fu et al. [42] introduced a concept of semilatent attribute space, expressing user-defined and latent attributes in a unified framework, and proposed a novel scalable probabilistic topic model for learning multimodal semilatent attributes. Peng et al. [43] proposed a novel dictionary learning model which decomposes the dictionary space into three parts corresponding to semantic, latent discriminative, and latent background attributes. Li et al. [44] proposed an end-to-end network that is capable of learning discriminative semantic representations in an augmented space introduced for both user-defined and latent attributes. Xie et al. [45] proposed a semantic dictionary learning approach to exploit the latent visual attributes and align the visual-semantic spaces at the same time. Unlike the aforementioned methods, which learn latent attributes complementary to human-defined attributes for cross-class transfer learning, this paper proposes a self-supervised attribute learning method for object recognition and segmentation, which does not require additional attribute annotations.

2.2. Self-Supervised Learning

Convolutional neural networks (CNNs) [1215] are always end-to-end learned in a supervised style, obtaining more discriminative features than traditional hand-craft methods [4648]. Large-scale training samples are easy to obtain, while their human-annotated labels are costly. Thus, some researchers have proposed self-supervised learning models, which automatically generate labels for specific tasks by processing the samples. Doersch et al. [23] trained a neural network that can predict relative position, where the position labels are generated by splitting the entire image into blocks. Zhang et al. [24] transformed colorful images into gray-scale images and proposed to learn latent semantic information of pixels from the colorization task. Gidaris et al. [25] trained a network to predict the rotation angle of the image, where training samples are obtained by rotating the original image by different angles.

Different from the above methods, which pretrain the CNN on a set of unlabeled data to build representations and utilize them for downstream tasks, some researchers leverage self-supervised learning with multitask learning. Lee et al. [26] embedded the rotation prediction task into the traditional classification task to attain a significant improvement in classification accuracy. In the proposed SAL-Net, self-supervised attribute prediction, image-level classification, and pixel-level segmentation are jointly learned to more generalized features for different semantic analysis tasks.

3. Methodology

In this section, the proposed self-supervised attribute learning method is described, including an attribute generation approach and the definition of attribute loss. Then, a multitask deep neural network, SAL-Net, is designed based on the proposed method. In the network, attribute learning performs as an auxiliary task to constrain the CNN to learn more discriminative representations for object recognition and segmentation. Moreover, the proposed self-supervised attribute learning method is extended to instance-aware semantic segmentation by a multiobject recognition and segmentation system.

3.1. Attribute Generation

There is a correspondence between attributes and each local region of the object, e.g., the yellow forehead attribute of the bird depends on the features in the forehead region. Therefore, an automatic attribute generation method is proposed, generating two additional training samples and the attribute relationship between them by differentially occluding an original training sample. As shown in Figure 1, the original sample includes an image , a category label , and a mask . According to the contour information provided by the mask, the object’s bounding box can be obtained. The area inside the bounding box is divided into blocks, labeled . Two blocks are randomly selected, and the first block is occluded to obtain a new sample . The pixels in the occluded block are set to a constant, as given in where denotes the normalized mean BGR value among all pixels of the ImageNet dataset [49] and represents the position of each occluded pixel. In the mask , the occluded pixel is set to 0. Based on sample , the second block is occluded to obtain a new sample . In this way, the original single training sample is expanded into a group . The positive attributes of the occluded sample are more likely to be less than the positive attributes of the sample before the occlusion. In particular, sample is more occluded than sample , which results in the attribute difference between and greater than the difference between and . The attribute relationship is described in where , , and represent the attribute distribution of samples , , and . is a measure of the difference between two attribute distributions. The equal sign means that when the occluded block does not contain any attributes, the occluded sample maintains the original attribute distribution. In the classification task, the category label of sample is directly applied to and , i.e., , which makes the learned features invariant to occlusion for categorization.

3.2. Triplet Loss

A triplet loss function is defined to learn the attribute prediction model, which consists of two parts.

The one constrains the attribute difference between and greater than the attribute difference between and : where represents the predicted attribute distribution of the input image based on the parameter set . denotes the Euclidean distance between two distributions. is the minimum margin between two metrics:

It means that when the distance between and is larger than the distance between and plus , the cost will be 0. Thus, by minimizing , we can keep away from and bring close to in the attribute space.

However, can be minimized via . It results in the same attribute description for and , which is not our expectation. Thus, an additional triple loss function is defined:

It means that when the distance between and is smaller than , there will be a cost.

By simultaneously minimizing and , we can learn appropriate parameters to realize the attribute relationship represented in Equation (2). The randomly selected block probably contains attributes for the entire training set, so it is reasonable to set to a positive constant rather than 0.

3.3. SAL-Net

In this section, an object recognition and segmentation network SAL-Net is designed based on the proposed self-supervised attribute learning method. As shown in Figure 2, the end-to-end network consists of three main components highlighted in different colors: (1)A shared feature coding module (the green block). It consists of a backbone CNN to extract image features and an attribute feature coding model to obtain attribute feature maps for segmentation and recognition tasks(2)A segmentation module (the orange block). It projects the attribute feature maps into multiscale segmentation predictions(3)A classification module (the blue block). It projects semantic features to a smooth category space and combines the obtained category description with the attribute description to predict the category label

3.3.1. Attribute Feature Encoding

In CNN, features extracted by different layers contain different attribute information. Local features are extracted by low-level layers, e.g., yellow forehead and blue wing, while holistic features are extracted by high-level layers, e.g., chicken-like shape. Training with different features can improve the robustness of the attribute model. Therefore, we encode the visual feature with a shared attribute coding module and obtain the attribute-wise feature :

There are layers selected in the backbone CNN, i.e., the Conv3_x, Conv4_x, and Conv5_x. Since the dimensions of the features extracted from different layers are different, a convolutional layer is applied to convert feature maps into feature maps. Then, a shared convolutional layer is utilized for encoding the features as -dimensional attribute-wise feature maps.

The encoded features are fused to obtain the attribute prediction : where each item in denotes the maximum in each attribute feature map .

According to Equations (3) and (5), the attribute loss function is defined as

3.3.2. Segmentation Module

It consists of segmentation models . Each segmentation model maps an attribute-wise feature into a segmentation map :

The segmentation model consists of a convolutional layer to localize the object and a deconvolutional layer to upsample the segmentation to the same size as the original image.

The final segmentation is obtained by element-wise adding coarse-to-fine segmentation maps and then performing the sigmoid operation:

represents the probability of each pixel belonging to the foreground object.

The segmentation loss function is defined as the cross-entropy between the prediction and the mask provided by the dataset. is defined as the cross-entropy of the ground-truth and the prediction :

Three training samples are fed into the network during the training phase. Thus, the segmentation loss function is defined as

3.3.3. Classification Module

It combines category descriptions and attribute descriptions to classify the image. The semantic feature , which is the output of the global pooling layer in the backbone CNN, is encoded into a category description feature :

denotes a fully connected layer with softmax activation. The obtained description represents the location of the input sample in a category description space . The space is defined by applying the label smoothing [15] strategy on the typical one-hot encoding category labels:

The loss function of the category feature is defined as

represents the product of the category features of three samples. Then, the category feature and the attribute feature are concatenated to obtain a -dimensional semantic feature vector, which is utilized to predict the category label with a classifier:

denotes the parameter set of the classifier, e.g., a fully connected layer with softmax activation works well. An additional loss is defined to train the classifier:

In the training phase, we automatically generate group samples based on original samples and utilize these samples to optimize parameters of the above SAL-Net by minimizing the total cost of three tasks:

3.4. Multiobject System

In this section, a multiobject recognition and segmentation system is designed based on the proposed SAL-Net to achieve instance-aware semantic segmentation. As shown in Figure 3, the system consists of four parts: an attribute generation module, a region proposal module, the SAL-Net, and a postprocessing module.

3.4.1. Candidate Region Proposal Module

This module automatically generates candidate regions from the input image based on the selective search [50] algorithm. In the training phase, this module generates negative samples. We calculate the IoU (intersection over union) score between each proposal and ground-truth regions. The proposal with a small maximum IoU, e.g., smaller than 0.1, is taken as a “background” sample. In the testing phase, this module generates testing samples which are fed into the SAL-Net.

3.4.2. Attribute Generation Module

It generates positive samples for training SAL-Net, as shown in Figure 1. If a training image contains objects, then groups of positive samples can be obtained, where each group contains three samples .

3.4.3. SAL-Net Module

The structure is similar to Figure 2, except negative samples being utilized for training along with groups of positive samples. We add a background class into the category description space; that is, the category feature becomes a -dimensional vector. It is worth noticing that negative samples are not for training on attributes and segmentation tasks. Equation (18) is modified with the following given formula:

In the testing phase, we use the category description as the basis for distinguishing the “object/background.” If the maximum probability of appears on the background class, that means the testing sample belongs to the background; otherwise, the maximum probability of is taken as the confidence of the sample, and the corresponding category is regarded as the recognition result.

3.4.4. Postprocessing Module

In the testing phase, this module filters the recognition and segmentation results of multiple test samples to obtain the final instance-aware semantic segmentation results. A fusion nonmaximum suppression algorithm is proposed to deal with repeated extraction of the same object. As described in Algorithm 1, we calculate the segmentation IoU of each pair of predictions with the same category. When the IoU is larger than a threshold (set to 0.2), we remove the prediction with lower confidence and fuse its segmentation into the other prediction.

Input: Mask set , category set , confidence set
Output: updated , and
[1] Sort , and according to the value in from large to small;
[2] Initialize the number of predictions
[3] fordo
[4]   fordo
[5]     ifthen
[6]       Calculate ;
[7]       ifthen
[8]          Fuse the prediction: ;
[9]          Remove the prediction: ;
[10]          ;
[11]       end if
[12]     end if
[13]   end for
[14] end for
[15] return Updated , and

4. Experiments

4.1. Datasets and Implementation Details

This section conducted a series of experiments on two public benchmarks, including CUB-2011 [10] and Pascal VOC [28]. (i)CUB-2011 [10]. Wah et al. [10] established a dataset with 11,788 images from 200 bird categories. Each bird is annotated with a 312-dimensional attribute distribution. The standard training/testing split (i.e., 5994/5794) is adopted. In the training phase, each image is resized into , and the batch size is set to 64 for both SAL-Net and baseline models. All networks are trained 50,000 iterations with a learning rate of 0.001, and then 30,000 iterations with a learning rate of 0.0001, and finally 20,000 iterations with a learning rate of 0.00001.(ii)Pascal VOC [28]. The Pascal VOC dataset is adopted to evaluate the effectiveness of both SAL-Net and the proposed multiobject system. We use 1464 images for training and 1449 images for testing. In the training phase, 3508 groups of positive samples and 1464 negative samples have been used. The batch size is set to the number of samples from 4 training images, and each cropped sample is resized into . The training for both SAL-Net and baseline models is performed in three stages. First, it performs 20,000 iterations with a learning rate of 0.001, then performs 12,000 iterations with a learning rate of 0.0001, and finally performs 8000 iterations with a learning rate of 0.00001. In order to verify the performance of SAL-Net, a bounding box is provided to extract each object in the testing image and use it as input. In order to verify the performance of the proposed system, the entire testing image is fed into the system.

We compute the average intersection over union (aIoU) among testing samples to evaluate the segmentation performance and the accuracy (Acc) to evaluate the classification performance. For instance-aware semantic segmentation, we calculate the mean average precision and recall among categories (mask AP and AR defined in [29]).

4.2. Comparative Methods

We compare the following degenerated models with the proposed SAL-Net. All comparative models are based on the ResNet50 [14], which is pretrained on the ImageNet dataset [49]. (i)ST-cls (A Single-Task Model for Classification). The output of Conv5_x is fed into a global pooling layer to obtain a 2048-dimensional feature vector, which is projected onto the category space with a fully connected layer.(ii)ST-seg (A Single-Task Model for Segmentation). The output of Conv5_x is fed into a convolutional layer and a deconvolutional layer to obtain the foreground confidence map.(iii)MT-SF (A Multitask Model with Single-Layer Features). This is a two-branch multitask network, where the classification task and the segmentation task share the CNN backbone. The output of Conv5_x is fed into a classification branch and a segmentation branch, which have the same structure as ST-cls and ST-seg, respectively.(iv)MT-MF (A Multitask Model with Multilayer Features). The network has a similar structure to SAL-Net except for the auxiliary attribute. For the classification branch, the Conv5_x feature is fed into a global pooling layer to obtain a 2048-dimensional vector. For the segmentation branch, features from Conv3_x, Conv4_x, and Conv5_x are, respectively, fed into three segmentation projection models; each model contains a convolutional layer and a deconvolutional layer. The outputs of the three segmentation models are fused into the final prediction.(v)AFE-Net [27] (Our Previous Attribute-Aware Feature Encoding Network). It projects the visual features onto attribute-wise feature maps before sending the features to the segmentation branch. At the same time, the classification branch adopts the strategy of smoothing the category space and combining attributes and category descriptions. Compared with SAL-Net, the only difference is that such a baseline applies supervision from manually annotated attributes, while SAL-Net is based on the proposed self-supervised attribute learning method.

4.3. Performance of SAL-Net
4.3.1. Comparisons with Baseline Models

The proposed SAL-Net is compared with baseline models in Table 1. Compared with the single-task model ST-cls, the multitask model MT-SF obtains better classification results. It is thanks to the segmentation supervision signal transferring the object contour information to the classification task. The extracted object region contains fewer background features’ adverse effects to improve the stability of the classification model. However, MT-SF has limited advantages over the single-task model ST-seg in the segmentation task. The additional classification supervision motivates the CNN to extract semantic features rather than location features. The above problem can be solved by applying the multilayer feature strategy. Compared with MT-SF, MT-MF significantly improves the segmentation accuracy, which obtains a gain of 8.39% aIoU on the CUB-2011 dataset and 3.57% on the Pascal VOC dataset.

In our previous work [27], attribute knowledge is applied to constrain the feature extraction, which has made significant improvements in both classification and segmentation tasks on the Pascal VOC dataset. However, it is not suitable to Pascal VOC since there are no attribute annotations provided by the dataset. This paper proposes a self-learning attribute learning (SAL) method. Compared with MT-MF, SAL-Net produces a 4.05% aIoU gain and a 6.1% accuracy gain on the CUB-2011 dataset. On the Pascal VOC dataset, it obtains 0.6% aIoU gain and 3.55% accuracy gain. The reason is that SAL-Net introduces attribute relationships to constrain the feature extraction network and smooth the category space, which makes the extracted features more discriminative and robust. Meanwhile, concatenating the attribute features to the category description can further improve classification accuracy. Compared with [27], SAL-Net produces a segmentation gain of 1.59% aIoU and a classification gain of 3.18% accuracy on the CUB-2011 dataset. It can be explained by the advantages of latent attributes, where the attribute knowledge automatically mined from images are more suitable than manually annotated attributes for identifying and segmenting objects.

4.3.2. Comparisons with State-of-the-Art Methods

In Table 2, SAL-Net is compared with state-of-the-art fine-grained recognition methods on the CUB-2011 benchmark. A widespread data augmentation approach is applied, where each input image is resized into and rotated with a random angle. The randomly cropped patch is resized and flipped into a input image. Based on the data augmentation, the standard ResNet50 achieves high accuracy on the CUB-2011 testing set, while our SAL-Net improves the accuracy from 82.0% to 84.6%. Existing attribute-based methods [4, 27, 51, 52] also learn the attribute attention to localize discriminative regions. These methods rely on the predefined attributes or textual descriptions provided by the dataset. In contrast, our SAL-Net automatically exploits latent attributes based on the self-supervised learning method, which does not need any extra annotations.

4.3.3. Effects of the Structure of Attribute Feature Coding Module

Figure 4 illustrates the classification and segmentation performance by using different attribute feature coding modules. The horizontal axis represents the number of shared convolutional layers, which encode CNN features as -dimensional attribute-wise feature maps. We can see that the performance of using one convolutional layer is relatively better. As the number of hidden layers increases, the supervision of the attributes imposes fewer constraints on the feature extraction. The result indirectly verifies the effectiveness of the proposed self-supervised attribute learning method for representation learning.

4.3.4. Effects of the Attribute Dimension

Figure 5 compares the classification and segmentation results by using different attribute dimensions. We can see that the change of the attribute dimension has a limited effect on the segmentation task. The reason is that SAL-Net mainly refines the segmentation through the localization of latent attributes. When is small, the area of each latent attribute is learned to be large, and when is large, the area of each latent attribute is learned to be small. Thus, the fused segmentation result is stable. For the classification task, the network with a small cannot learn enough attributes to describe the object. On the contrary, an excessively large will weaken the category knowledge in the concatenated feature and decrease the recognition accuracy. In the Pascal VOC dataset, there are remarkable visual differences among the 20 categories. There are also noticeable visual differences between individuals from a common category, resulting in the entire dataset with diversified characteristics. Therefore, we select the value of to learn as many latent attributes as possible. In the fine-grained CUB-2011 dataset, 200 categories share attributes, and birds from a common category contain almost the same attribute distribution. Therefore, we choose , which is 88 higher than the predefined attributes provided by the dataset.

4.3.5. Effects of the Attribute Margin

Figure 6 illustrates the influence of the minimum margin in Equation (3). We can see that when is small, the occluded sample is closer to the original sample in the attribute space, which reduces the discrimination of the representation. When is large, the attribute features of the occluded sample and the original sample are quite different, making it difficult for the concatenated features to be classified into the same category. Therefore, an appropriate is selected to learn discriminative attributes and ensure their occlusion invariance to category labels.

4.4. Performance of Instance-Aware Semantic Segmentation

Figure 7 shows the output of each module in the proposed multiobject recognition and segmentation system. It can be seen from Figure 7(a) that the region proposal module based on the selective search [50] can obtain multiple target candidates concentrated in the object region. Figure 7(b) illustrates the classification and segmentation results of SAL-Net. Each candidate region is fed into the network to obtain a binary segmentation mask and category confidence. Some background regions are filtered out with a threshold of 0.6 on the confidence score. However, there are still some multiple overlapping results for the same object. The proposed Fusion NMS approach is applied to process the above predictions. In Figure 7(c), we can see that most overlapping areas are filtered out, and the segmentation result obtained after fusion is more accurate.

Table 3 further demonstrates the effectiveness of the proposed Fusion NMS method. We can see that the baseline result without postprocessing has a higher recall and a lower precision. It is because many overlapping SAL-Net predictions are retained (as shown in Figure 7(b)). Mask NMS [58] can filter out overlapping results with lower scores, while the performance is sensitive to the accuracy of each candidate box. Based on the coarse boxes extracted by the SS algorithm, Mask NMS cannot obtain effective improvement. Matrix NMS [59] updates the confidence of all predictions instead of directly filtering out those low-score predictions, while its gain of AR and AP is limited. The proposed Fusion NMS method is an advancement of Mask NMS [58]. We filter out the overlapping prediction with a lower score while fusing its segmentation into the retained prediction, making the final segmentation more accurate. As shown in Figure 7(c), even if the retained box does not contain the entire object, the segmentation after fusion is still satisfactory.

5. Conclusion

This paper proposes a self-supervised attribute learning method, which automatically generates latent attribute descriptions for training. Based on the proposed method, we design a multitask network SAL-Net, in which attributes constrain the feature coding network to learn more discriminative representations for recognition and segmentation. Experimental results on CUB-2011 and Pascal VOC datasets illustrate that the proposed SAL-Net can significantly improve object recognition and segmentation accuracy without extra annotations. It achieves state-of-the-art fine-grained recognition performance on CUB-2011, superior to existing methods based on predefined attributes. In addition, we naturally embed SAL-Net into a multiobject system to achieve instance-aware semantic segmentation. Qualitative and quantitative experiments on Pascal VOC verify the effectiveness of each module in the system.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was partially supported by the National Natural Science Foundation of China (Grant No.: 61620106002) and the Central Government Guides Local Science and Technology Development Fund Project of China (Grant No.: 2021ZY0004).