Abstract

Deep encoder-decoder networks have been adopted for saliency detection and achieved state-of-the-art performance. However, most existing saliency models usually fail to detect very small salient objects. In this paper, we propose a multitask architecture, M2Net, and a novel centerness-aware loss for salient object detection. The proposed M2Net aims to solve saliency prediction and centerness prediction simultaneously. Specifically, the network architecture is composed of a bottom-up encoder module, top-down decoder module, and centerness prediction module. In addition, different from binary cross entropy, the proposed centerness-aware loss can guide the proposed M2Net to uniformly highlight the entire salient regions with well-defined object boundaries. Experimental results on five benchmark saliency datasets demonstrate that M2Net outperforms state-of-the-art methods on different evaluation metrics.

1. Introduction

Salient object detection (SOD) [13] aims to extract the most visually distinctive objects in an image or video. During the past decades, it has become a hotspot in the research field of computer vision. Saliency detection results often serve as the first step for a variety of downstream computer vision tasks, including object recognition [4], visual tracking [5], image retrieval [6], no-reference synthetic image quality assessment [7], robot navigation [8] image and video compression [9, 10], and object discovery [1113].

Earlier SOD methods mostly rely on hand-crafted features (e.g., color, brightness, and texture) to produce saliency maps. However, these low-level features can hardly capture high-level semantic information and are not robust enough to various complex scenarios.

Recently, convolutional neural networks (CNNs), especially fully convolutional neural networks (FCNs) [14], have pushed salient object detection to achieve very promising results on many popular public benchmark datasets. Encoder-decoder framework [3, 1519] is frequently used to extract and combine enriched feature blocks and therefore can generate more accurate saliency maps. More recently, many researchers further improved the saliency model by incorporating domain-specific information from other tasks such as contour/edge detection [18, 20, 21], image classification [22, 23], and noise pattern modeling [24].

These U-shape models [3, 21] have greatly refreshed the leaderboards on all commonly used datasets. However, existing saliency methods still hold many problems that are not solved totally and are worthy of further research. First, due to the repeated subsampling, a single-scale convolutional kernel has difficulty in accurately segmenting size-varying salient object. Two state-of-the-art methods cannot uniformly highlight small foreground object with well-defined boundaries, as is shown in Figures 1(d) and 1(e). This motivates some efforts to characterize the multiscale information from a single layer. Second, most of the existing saliency methods [15, 25] use binary cross entropy (BCE) loss to train the saliency networks. But these models with BCE loss usually have low confidence in making a distinction between foreground and background, leading to blurred boundaries. The recent survey [26] indicates that the elaborate design of loss function can help to train more effective saliency detection models. Some training losses, such as PPA loss [19], Intersection over Union (IoU) loss [17, 27], and F-measure loss [28], were proposed for improving model performance. In consequence, it is essential to design a mechanism to extract multiscale information from each layer and develop a novel training loss.

To address the above challenges, we proposed a novel multiscale and multitask network, named M2Net, which can generate high-quality saliency maps with clear boundaries (see Figure 1(c)). Firstly, in the bottom-up encoder module, we use two branches to extract robust feature blocks. The backbone branch is based on a common pretrained image classification network, while the transformation branch is based on the sequence of three operations, including convolution, batch normalization, and ReLU. Secondly, in the decoder module, we develop two units, including multiscale feature extraction unit and cross-layer feature block fusion unit, to generate the saliency maps. Multiscale feature extraction unit can extract multiscale contextual features, while cross-layer feature block fusion unit can continually fuse adjacent level feature blocks. Thirdly, to take full advantage of ground truth, we design a centerness-aware loss, which considers the location of salient objects. This loss can guide the proposed network to generate high-quality saliency maps.

We conduct experiments on five benchmark saliency datasets and demonstrate the better performance of the proposed M2Net. In summary, our contributions are as follows:(i)We propose a multiscale and multitask deep framework with a centerness-aware loss for salient object detection. The M2Net consists of encode module, decoder module, and centerness prediction module.(ii)We develop a centerness-aware loss, which can help to generate high-quality saliency maps, and it can push the proposed M2Net to uniformly highlight the entire salient regions with clear boundaries.(iii)Extensive experiments on five public SOD datasets show that our model M2Net outperforms state-of-the-art saliency methods on different evaluation metrics. In particular, the proposed model (M2Net) can achieve the best performance under different challenging situations.

2.1. Salient Object Detection

Early SOD methods [2, 29, 30] are mainly based on hand-crafted features and some intrinsic cues, such as center prior, color contrast, and background prior. Recently, convolutional neural networks (CNNs) have been used to extract multilevel features from input images. CNNs-based methods treat patches/superpixels [3133] and generic object proposals [3437] as image processing units, and an MLP-classifier is used to train the network. Wang et al. [35] trained two different CNN models to generate a saliency map. DNN-L and DNN-G are used to extract local and global features, respectively. Particularly, fully convolutional networks (FCN) show their advantage and refresh the state-of-the-art records in saliency prediction task. The encoder-decoder framework is frequently used in the FCN-based saliency models [3, 1519, 3840]. Liu et al. [16] proposed a novel network to embed local and global pixelwise contextual attention modules into a U-shape network. Zhao et al. [3] proposed a simple and effective gated network architecture to control the meaningful message passing from encoder to decoder feature blocks. Almost all of the above methods try to develop more complicated modules and strategies to fuse feature blocks of different levels. Different from the methods mentioned above, we propose a simple and effective multitask architecture, which attempts to solve saliency tasks by adding an extra centerness prediction branch.

2.2. Multiscale Feature Extraction

The atrous spatial pyramid pooling (ASPP) module [41] is widely used in many computer vision tasks. The atrous convolution can expand the receptive field with fewer parameters to get large-scale and more comprehensive features. The pyramid pooling module (PPM) [42] is another choice for extracting multiscale features. Zhang et al. [43] insert five ASPP modules into the encoder feature blocks of five levels. The larger the atrous rate, the more the difficulty in capturing the changes of image details. To alleviate the above problem, Zhao et al. [3] designed a folded ASPP and achieved a local-in-local effect. Besides, the pyramid attention module [44] can generate multiscale attention maps to enhance saliency features. The above methods can extract multiscale features from images, but it is more sensitive to background noise. To improve the recall rate of saliency objects under complex background, we propose a multiscale feature extraction module and insert it into decoder feature blocks.

2.3. Multitask Learning

Multitask learning (MTL) has led to successes in many research fields, from computer vision and speech recognition to drug discovery and natural language processing. Multitask learning aims at simultaneous training using two or more related tasks. It is found that learning multiple tasks jointly can lead to better performance improvement compared with learning them individually. Recent multitask learning-based saliency methods have shown good results by jointly tackling multiple related tasks such as image classification, fixation prediction, and edge detection. Li et al. [23] and Wang et al. [22] proposed to apply image-level tags to assist the detection of the foreground object. Kruthiventi et al. [47] proposed a unified multitask learning framework to jointly solve salient object detection and fixation prediction. Zhao et al. [20] presented an edge guidance network to extract two complementary features, including salient object features and salient edge features. As we all know, location is the important information of an object. To the best of our knowledge, this information has never been directly used in saliency prediction tasks. In this paper, we investigate how to integrate the centerness prediction task into saliency detection.

3. Proposed Method

In this paper, we propose a multitask and multiscale deep network for salient object detection. The overview of the proposed network consists of three related modules, as shown in Figure 2. To guide the saliency network to uniformly highlight the entire object with different size, we propose a multiscale feature extraction approach. To further improve the detection accuracy, we introduce centerness-aware loss, which helps to reduce the impact of complex background.

3.1. Network Overview

The encoder-decoder architecture has been widely used in the salient object detection task, and it has a strong ability to combine features from different network layers. Our method is built on the feature pyramid networks (FPN) [48] with the pretrained ResNeXt-101 [46] or ResNet-50 [45] as the backbone network, both of which can extract meaningful saliency features to build high-quality U-shape networks. To reduce network parameters, we discard all the fully connected layer of the pretrained backbone [45, 46]. The proposed M2Net consists of a bottom-up encoder module, top-down decoder module, and centerness prediction module. In the encoder, we use the pretrained backbone to extract multilevel saliency features from preprocessed images. To obtain robust saliency features, each feature block is processed by 1 × 1 convolutional layers followed by batch norm and ReLU (Figure 2). Next, in the decoder, we use a skip/concatenation connection scheme. To generate the final saliency maps, a novel multiscale feature extraction approach is proposed (Figure 3). Lastly, we design a centerness prediction module (Figure 2), which can help to generate high-quality saliency maps. We describe the structures of the three modules and explain their transformation in the following sections.

3.2. Encoder Module

In our M2Net, the encoder module is composed of a backbone branch and a transformation branch. The backbone branch is based on a common pretrained image classification network, for example, the VGG, ResNet-50 [45] or ResNeXt-101 [46]. In order to fit the saliency prediction task, similar to most previous saliency methods [3, 16, 20], we remove the last pooling layer and cast away all the fully connected layers of the ResNet or ResNeXt. Let I ∈ R320 × 320 × C denote the input training image with ground truth labels Y ∈ R320 × 320 × 1 as is shown in Figure 2, where C denotes the channel of the input image. For a given input image with size H × W, the pretrained image classification network will extract its saliency features at five different levels, denoted as {Ei ∈ RH × W × C|i = 1, …, 5} with resolutions [H/(2i − 1), W/(2i − 1)], where C denotes the channel of the feature blocks. H and W are the height and width of different level feature blocks. In the transformation branch, the sequence of three operations is used to generate robust and meaningful saliency features, as is shown in Figure 2. The detailed parameters of the three different operators can be found in Table 1. After the processes above, we obtain five different feature blocks {Ti ∈ RH × W × C|i = 1, …, 5}, which are to be used in the decoder part of the network.

3.3. Decoder Module

In the encoder, different levels of feature blocks contain different information. The high-layer feature blocks encode the semantic information for category, and these layers do not care about local detail for image.

The low-layer feature blocks contain more detailed information about the image, and these layers suffer from the problem of semantic ambiguity. The decoder module is designed to integrate these different feature blocks. The combination of these different level feature blocks can enhance the representation ability to complete the saliency prediction task. The decoder network comprises two main computing units: (i) Multiscale feature extraction unit, which can extract multiscale contextual features to facilitate saliency prediction models to extract discriminative features. (ii) Cross-layer feature block fusion unit, which continually fuses adjacent level feature blocks from {Ti|i = 1, …, 5}.

Figure 3 shows the details of the multiscale feature extraction unit. Given a feature block , we first use average pooling to perform a downsampling operation. After that, we can obtain . To obtain robust saliency features, the two branches are processed by combination operation, which is composed of convolution, batch normalization, and ReLU operation. After a series of above-mentioned processing, we obtain and . The output of the bottom branch is upsampled to match the output of top branch . To extract the multiscale features, we integrate the two branches by using multiplication and addition operations. The multiscale feature extraction unit is formulated as follows:where denote upsampling operation.

Figure 4 shows the details of the cross-layer feature block fusion unit. This transformation unit is composed of four different kinds of operators, including convolution, upsampling, concatenation, and combinator. The first convolution layer can halve the number of channels for high-level feature block. To adapt to the low-level feature block, the transformed high-level feature block is processed by the second upsampling operator, which can increase the size of feature blocks by 2 times. Then, the concatenation operator is used to build one larger feature block. Finally, the combinator is composed of the two repeated cascaded structures of convolution operators, each of them followed by a batch normalization layer and a ReLU layer. After a series of above-mentioned processing, we can obtain four different feature blocks {}, as shown in Figure 2.

The cross-layer feature block fusion unit is formulated as follows:where and represent the combined operation as mentioned above and the upsampling operation, respectively. denote the output of the decoder, and it is formulated as follows:where denote multiscale feature extraction operation, which is defined in equation (1).

3.4. Centerness Prediction Module

Object location information can be very useful to improve the image classification task but seldom used in the saliency detection task.

In this paper, we introduce the centerness to the saliency detection. We define centerness as a ratio between EO and EC, as is shown in Figure 5. The location of node O represents the center of ground truth or saliency map, and it can be calculated as follows:where denotes the gray value of ground truth or predicted saliency map.

Centerness prediction module comprised two main components: (i) FCL (two fully connected layers), which directly maps high-dimensional feature space to 1-dimensional feature space; (ii) logistic function, which applies a sigmoid function to restrict the number from a large scale to within the range 0–1. Figure 6 shows the details of the FCL. This component contains three fully connected layers, and each layer contains a different number of neural nodes.

3.5. Deep Supervision

An effective loss function plays an important role in training more effective saliency models [26]. When the image contains complex background, the deep network with BCE loss will probably generate poor saliency results. To generate high-quality saliency maps with clear boundaries, we propose a centerness-aware loss, which is defined as follows:where , , and denote BCE loss [49], location loss, and IoU loss [17, 27], respectively. The parameter λ is a hyperparameter which is set to 0.5 in this paper.

Binary cross entropy (BCE) is a widely used loss in saliency detection tasks, and it is defined as follows:where and denote the prediction of the pixel and ground truth. is the batch size and is the product of the height and width of a given image.

The position of salient objects is very important information. Hence, we introduced it into our training loss; is defined as follows:where denotes the ground truth of the k-th image and is the result of centerness prediction.

To uniformly highlight the whole salient region, we integrated IoU [17, 27] into our training loss. It is defined aswhere is the predicted probability of being the foreground object and is the ground truth of the pixel .

4. Experiments

4.1. Implementation Details

We train our saliency model on the DUTS-TR [22] dataset with 10553 images as followed by [3, 16]. For a fair comparison, we use ResNet and ResNeXt as backbone networks, respectively. For convenience, all the training and testing images are resized to 320 × 320. Our saliency model is implemented in PyTorch. The parameters of backbone networks are initialized with the models pretrained on the classification dataset. All the other parameters of M2Net are set by the default setting of PyTorch 1.2.0. The hyperparameters are set as follows: weight decay = 0.0005 and momentum = 0.9, and the initial learning rate is set to 0.005 for pretrained backbone networks [45, 46] and 0.05 for the rest parts. In this paper, we use the warm-up and linear decay methods to dynamically adjust the learning rate. During the training stage, random flip, random contrast, random saturation, and random brightness act as data augmentation techniques to avoid the overfitting problem. We apply a stochastic gradient descent algorithm to update all the parameters of the proposed M2Net. To ensure model convergence, M2Net is trained for 32 epochs with a minibatch of 15 on an NVIDIA GTX 2080 Ti GPU.

4.2. Datasets

The performance of M2Net is evaluated on five benchmark saliency datasets, including ECSSD [50], PASCAL-S [51], DUTS [22], DUT-OMRON [30], and HKU-IS [34]. ECSSD [50] contains 1000 meaningful semantic images with pixel-accurate annotations. The PASCAL-S [51] dataset is composed of 850 challenging images, which are carefully selected from the PASCAL VOC segmentation dataset. DUTS is the largest salient object detection (SOD) dataset. It contains 10553 images for training and the remaining 5019 images for testing. DUT-OMRON [30] is composed of 5168 high-quality but challenging images. Images in this dataset contain one or more salient objects with complex background. The HKU-IS [34] contains 4447 challenging images which have multiple disconnected salient objects.

4.3. Evaluation Criteria

To quantitatively evaluate the performance, four measurements, including Precision-Recall (PR) curve, F-measure, and Mean Absolute Error (MAE), and S-measure, are adopted in our experiments.

Precision-Recall curve is a widely used graphical tool to evaluate the robustness of saliency maps. It can demonstrate the relation of precision and recall by thresholding the final saliency results from 0 to 255. The larger the area under the PR curve, the better the performance.

The F-measure is a weighted combination of precision and recall, which is defined aswhere is set to 0.3 as done in most recent state-of-the-art saliency methods [3, 16, 19, 5256] to emphasize the precision. The mean F-measure () of each benchmark dataset is reported in the paper.

Mean Absolute Error (MAE) is a metric, which measures the pixelwise average absolute difference between saliency map and its corresponding ground truth. The MAE score is defined as follows:

where x and y are the prediction result and ground truth, respectively, and indicates the total number of image pixels.

S-measure is more sensitive to foreground structural information of saliency maps, which is closer to the human visual system. It considers the object-aware structural similarity and the region-aware structural similarity :where is set to 0.5 as suggested in [3, 20, 53, 57].

4.4. Comparison with State of the Art

In this section, we compare our method with seventeen previous state-of-the-art saliency models, including NLDF [58], Amulet [15], R3Net [59], RAS [60], DGRL [61], C2SNet [54], PiCANet [16], BMPM [43], BASNet [17], AFNet [62], SCRN [63], CPD [64], EGNet [20], PoolNet [18], F3Net [19], MINet [53], and GateNet [3]. Note that all the saliency maps of above saliency methods are produced by running source codes or precomputed by the authors.

4.4.1. Quantitative Evaluation

To fully compare the proposed saliency model with these state-of-the-art methods, the detailed experimental results in terms of three metrics are listed in Table 2. For better comparison, we use the ResNet-50 and ResNeXt-101 as backbone networks for training our proposed M2Net. Specifically, our method achieves a great improvement in terms of the Fm compared to the most recent saliency model GateNet [3] on the challenging DUT-TE [22] (0.857 versus 0.816), DUT-OMRON [30] (0.791 versus 0.762), and PASCAL-S [51] (0.858 versus 0.827). In addition, we demonstrate the standard PR curves in Figures 7 and 8. Our method achieves the best performance on the ECSSD, HKU-IS, PASCAL-S, DUT-OMRON, and DUT-TE datasets.

4.4.2. Qualitative Evaluation

Some prediction results of the proposed M2Net and ten state-of-the-art saliency methods have been shown in Figure 9. We observe that the proposed method M2Net not only uniformly highlights the correct salient object region clearly but also well suppresses the background clutter effectively. It excels in dealing with various challenging scenarios, including small objects (rows 2, 8, and 9), cluttered backgrounds (rows 1, 3, and 5), low contrast between the salient object and background region (rows 4 and 7), and image boundary (row 6). Compared with other state-of-the-art methods, the detected object boundaries of our saliency map are clear and sharper. Most importantly, the proposed saliency model M2Net achieves these results without any postprocessing.

4.5. Ablation Analysis

Before analyzing the influence of each saliency module, there is one hyperparameter λ to be determined. λ is used in CAL loss to balance different losses. Table 3 lists the scores of Fm, MAE, and Sm when λ gives four discrete values. As can be seen, when λ equals 0.50, these indicators reach the best results. To investigate the importance of different components in our proposed M2Net, we will conduct a detailed analysis next.

4.5.1. Effectiveness of Backbones

In the saliency detection, VGG [65], ResNet [45], and ResNeXt [46] are widely used as the pretrained backbone. Table 2 demonstrates that ResNet-50 and ResNeXt-101 can achieve better performance compared with VGG in most cases. To demonstrate the effectiveness of ResNet-50 and ResNeXt-101, we also selected two widely used datasets for evaluation, and the comparison results are shown in Table 4. From Table 4, we can see that M2Net with ResNeXt-101 [46] can get better performance compared with ResNet-50 [45].

4.5.2. Effectiveness of Components

We take an FPN-like network as our baseline network. Then, we install the multiscale feature unit on the baseline network and evaluate its performance. The comparison results are shown in Table 4. It can be seen that a multiscale feature unit can achieve significant improvement over the FPN-like network. We also quantitatively evaluate the effect of the centerness-aware loss in Table 4. Compared to “+B,” the proposed M2Net with the CAL achieves consistent performance enhancements in terms of three metrics. Visual comparison of saliency maps generated by BCE loss and our centerness-aware loss are shown in Figures 10(c), 10(d), and 10(e). To fully compare CAL and three other losses, including FLoss, PPA, and IoU loss, the detailed experimental results are listed in Table 5. As it can be seen, our proposed CAL loss can get the best results on two challenge saliency datasets.

5. Conclusion

In this paper, we proposed a multiscale deep network with a centerness-aware loss for salient object detection. The proposed M2Net aims to solve saliency prediction and centerness prediction simultaneously. Our model consists of a bottom-up encoder module, top-down decoder module, and centerness prediction module. In the encoder, we use the pretrained backbone to extract multilevel saliency features from preprocessed images. Next, in the decoder module, we use a skip/concatenation connection scheme. To generate the final saliency maps, we proposed a novel multiscale feature extraction method. Lastly, we design a centerness prediction module, which can help to uniformly highlight the entire salient object. Extensive experimental results on five widely used datasets demonstrate that our method outperforms 17 state-of-the-art approaches under different evaluation metrics.

Data Availability

The labeled datasets used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of Shandong Province (nos. ZR2019PF019 and ZR2020QF044).