Abstract

Existing RGB + depth (RGB-D) salient object detection methods mainly focus on better integrating the cross-modal features of RGB images and depth maps. Many methods use the same feature interaction module to fuse RGB and depth maps, which ignores the inherent properties of different modalities. In contrast to previous methods, this paper proposes a novel RGB-D salient object detection method that uses a depth-feature guide cross-modal fusion module based on the properties of RGB and depth maps. First, a depth-feature guide cross-modal fusion module is designed using coordinate attention to utilize the simple data representation capability of depth maps effectively. Second, a dense decoder guidance module is proposed to recover the spatial details of salient objects. Furthermore, a context-aware content module is proposed to extract rich context information, which can predict multiple objects more completely. Experimental results on six benchmark public datasets demonstrate that, compared with 15 mainstream convolutional neural network detection methods, the saliency map edge contours detected by the proposed model have better continuity and the spatial structure details are clearer. Perfect results are achieved on four quantitative evaluation metrics. Furthermore, the effectiveness of the three proposed modules is verified through ablation experiments.

1. Introduction

Salient object detection (SOD) [15] aims to locate the most attractive objects in natural scene images and has been widely used in various computer vision tasks, such as image resolution [6], object detection [7], learning-based compression [8], and image quality assessment [9]. In recent years, benefiting from the rapid development of convolutional neural networks (CNNs), SOD has achieved great success. However, when dealing with some challenging scenarios, such as when the contrast between the object and background is low or there are multiple objects in the image, many models have difficulty predicting the objects clearly and completely. Microsoft Kinect sensors and Huawei mobile phones are widely used tools that can capture depth maps easily. Compared with previous models that only used RGB images for training, models with depth maps as auxiliary information can achieve improved detection performance, which has resulted in the development of various RGB + depth (RGB-D) SOD algorithms [1014]. However, because RGB images and depth maps contain different modal information, it remains challenging to achieve cross-modal feature fusion effectively, which significantly impacts the robustness of the model. Although many previous methods [1518] have explored cross-modal feature fusion, its application remains limited due to (1) the effects of the RGB image background and (2) the effects of illumination on the RGB image. Regarding (1), RGB images provide rich color information, but the detection accuracy is seriously disturbed by color information. For example, as illustrated in the first row of Figure 1, the consistency of the salient object color and background color causes the model to generate incorrect detection results. The detected object (a chair) is extremely similar in color to the background. A small part of the chair is detected by the 3DCNN [3] and LDCM [15], whereas the rest of the chair is swallowed by the background. Regarding (2), as shown in the second row of Figure 1, because the images are affected by illumination, the background area is high in brightness, whereas the object area is low in brightness, so 3DCNN and LDCM misjudge the background area as the object area, and the detected area is blurred. Furthermore, although many methods predict the complete object area, such as the carts generated by the 3DCNN and LDCM in the third row of Figure 1, the edge spatial structure details of the salient objects are lost through the upsampling convolution. Although encoder feature maps are usually introduced into the decoder feature map through skip connections to recover the spatial details of salient objects more effectively and the ground truth map is used to supervise the loss of the decoder stage in every layer, it remains impossible to generate more complete detailed features. There are multiple objects in the image, as shown in the fourth row of Figure 1, and the saliency maps predicted by both the 3DCNN and LDCM lose the object and generate only a single object. The combination of the above shows that the detection performance of the model is affected by the color and illumination of RGB images, edge spatial structure details, and number of salient objects.

As a remedy for the aforementioned problems, an RGB-D SOD network is proposed that uses the depth-feature guide cross-modal fusion module with coordinate attention filtering. First, coordinate attention is used to filter invalid information from the depth map and to strengthen the expressive ability of salient objects, which can guide the model to learn more advanced semantic features. It can also better locate the position of salient objects while significantly suppressing the background information interference of RGB images. Second, a dense decode guidance (DDG) module is proposed, which can not only provide a more comprehensive semantic guidance for the encoder features of skip connections but also compensate for the loss of high-level semantic information in the decoder stages, thereby better recovering the structural details of salient objects. Finally, to remove the variation in the number of objects, a context-aware content (CAC) module is designed that aims to explore rich contextual feature information effectively and efficiently as well as to extract the most discriminative salient features. Three encoder-decoder U-nets are jointly trained in an end-to-end manner.

The main contributions of this study can be summarized as follows:(i)To suppress the effects of RGB image color and illumination for model detection, a coordinate attention filtering depth-feature guide cross-modal fusion module is proposed that uses coordinate attention filtering to enhance the feature representation of salient objects in the depth map such that the generated attention map can guide the model to highlight the locations and contour features of objects more prominently.(ii)A dense decoder guidance module is designed to compensate effectively for the loss of high-level semantic features in the decoder process to restore the edge structural detail features of the salient objects better.(iii)A context-aware content module is designed that can effectively capture rich contextual feature information, which is used to improve the feature capability to enhance the model performance in detecting multiobject scenes.(iv)Comprehensive experiments on six benchmark datasets with four evaluation metrics demonstrate that, compared to 15 other models, the proposed model has superior detection performance, and the generated saliency map has a better visual effect.

2.1. Salient Object Detection

With the development and popular application of deep learning, an increasing number of studies [1922] have utilized deep learning to detect salient objects. Zhao et al. [19] developed a lightweight and real-time model that directly uses the depth map to guide early and middle fusion between an RGB image and the depth map. Sun et al. [20] introduced a depth-sensitive attention module to enhance RGB features effectively, which can utilize the depth geometry feature to reduce background distraction.

Multilevel feature aggregation and cross-modal feature fusion strategies [2326] are widely used in models to improve detection performance. Wang et al. [23] proposed cross-modality consistency of correlation for RGB-D SOD. Zhang et al. [24] designed a cross-modality discrete interaction network that includes an RGB-induced detail enhancement module and depth-induced semantic enhancement of different layers. Zhou et al. [25] proposed a crossflow and cross-scale adaptive fusion network to detect salient objects in RGB-D images. Other methods have also achieved good results, such as uncertainty learning [27], collaborative learning [28], saliency prior [21], graph neural networks [29], edge detection [30], and transformers [31, 32].

2.2. Attention Mechanisms

Attention mechanisms have been widely used in computer vision tasks, such as visual tracking [33], image classification [34], video question answering [35], person reidentification [36], and image segmentation [37]. Zhang et al. [38] developed a selection attention mechanism to fuse multimodal information. Chen et al. [4] introduced the channel-wise attention mechanism to achieve a selectively cross-modal cross-level combination. Because the attention mechanism has a strong feature selection ability, its application is well suited to RGB-D SOD [3941].

Previous methods have directly added or multiplied RGB and depth features when fusing them. The elaborate fusion module also treats RGB features and depth features equally, and only fused features are employed layer by layer in the decoding stage. In this paper, inspired by the above methods, the inherent characteristics of RGB images and depth maps are rethought; moreover, it is argued that the advantages and disadvantages of the inherent characteristics of each modality should be considered in cross-modal feature interaction rather than being treated equally. According to observations, the performance of SOD is greatly affected by the background information in the collected RGB images. Therefore, the performance is reduced by extracting background noise from the RGB features with the network. The objects in the depth map are not disturbed by color; therefore, a CFD module is proposed that uses coordinate attention filtering such that the depth features can effectively suppress the interference of background information, which improves the expressive ability of salient objects. In addition, the three-branch decode structure is adopted in this paper to preserve the original RGB features and depth features for decoding to achieve effective utilization of multimodal features and improve the detection accuracy of the model.

3. Proposed Method

The proposed RGB-D SOD network is shown in Figure 2. In the feature extraction stage, one ordinary convolution is used to reduce the image resolution quickly, and the four residual blocks of the ResNet-50 architecture are used as the subsequent feature extractor, which uses two identical backbone branches to extract the features of the RGB image and the depth map. These extracted features are denoted as and , respectively, where I ∈ {1, 2, ..., 5} represents the level of feature layers. At the low levels (first and second layers), the RGB and depth feature maps are added to generate a fusion branch feature map. Next, the CFD module is embedded into the higher levels (third, fourth, and fifth layers), and the fusion branch feature map is represented by . For the decoder stage, DDG and CAC modules are designed. Finally, the RGB, depth, and fusion branch streams are designed as three encoder-decoder architectures with the same structure for joint end-to-end training. The final saliency map is generated by the fusion branch stream.

3.1. Depth-Feature Guide Cross-Modal Fusion Module with Coordinate Attention Filtering

RGB images contain rich colors and appearances. Compared with RGB images, depth maps discard complex color information and can intuitively describe the shapes and positions of objects, which means the feature expression ability of objects is provided more directly and effectively. At the low level of the encoder, the detailed features of the object are learned by the model, including the clear boundary, texture, and spatial structure, but these also contain significant background noise. At the high level of the encoder, the features learned by the model contain more semantic information. The high-level semantic features of the depth map are relatively simple; therefore, they can be used to guide the fusion of cross modalities. However, some images exist in which the collected depth maps are of lower quality. Therefore, a CFD module is designed and then embedded in the high levels of the network to make better use of the depth map features. The noise in the depth map is filtered by the coordinate attention, which largely suppresses the nonsalient region features in the RGB image, thereby helping the model locate and identify the salient regions more accurately. The structure is shown in Figure 3.

Specifically, the RGB feature map () and the depth feature map () are fed into a convolutional layer with a kernel size of 3 × 3 and stride of 1, which are aggregated to generate the feature map () as follows:where Cov represents the convolutional layer.

Coordinate attention is used to filter the noise of the depth map to utilize the feature information of the depth map more effectively. The coordinate attention module is shown in Figure 4. Specifically, pool kernels of size (H, 1) and (1, W) are selected to encode each channel along the horizontal and vertical coordinate directions for the input depth map, respectively, which correspond to X Avg Pool and Y Avg Pool. Thus, the output features of channel c with height h and width can be expressed as follows:

The aforementioned transformations aggregate features in two different directions. Two types of transformations enable the coordinate vector to capture long-distance dependencies in one spatial direction and preserve precise location information in the other spatial direction, which helps the network locate salient objects more accurately.

The coordinate vector is used to generate feature information with a global receptive field and an accurate position to generate coordinate attention maps. The specific operation of generating attention maps is described next.

First, the two feature vectors ( and ) are concatenated with a 1 × 1 convolutional layer and then divided into two separate feature maps (ZH and ZW) along the spatial dimension. Next, two 1 × 1 convolutions are used to transform the feature maps ZH and ZW to have the same number of channels as the input depth map. The two attention maps are generated using the sigmoid function, expressed as follows:

Finally, the two attention maps are multiplied together and added to to obtain the enhanced feature map:

The CFD module can not only suppress the effects of RGB image color and illumination but also effectively capture the relationship among feature map channels, which guides the effective information interaction among cross-modal features to improve SOD performance.

3.2. Context-Aware Content Module

In the decoder stage, existing methods directly use upsampling convolution to generate the final saliency map. However, for the multiobject case, the same convolutional layer cannot extract distinguishable features, causing the entire object to be lost. Therefore, a CAC module is designed, which aims to explore rich contextual information effectively and efficiently as well as to deal with the changes caused by inconsistent numbers of salient objects more effectively.

The CAC module is shown in Figure 5. Four 3 × 3 depth-wise convolutions are used, with dilate convolution rates of 1, 3, 5, and 7 to enlarge the receptive field, capturing multiscale features comprehensively. Meanwhile, the number of channels and sizes of all feature maps are kept the same. Subsequently, the input feature map and four feature maps are added as follows to output more discriminative salient features:

Here, DCov and r represent depth-wise convolutions and the dilate factor, respectively; FM represents the input feature map of the CAC module, and the j-th CAC is designated as and j ∈ {4, 3, 2, 1}.

In this way, the CAC module can obtain multiscale information and bring powerful feature representation, which is beneficial for producing multisalient objects and results with high performance.

3.3. Dense Decode Guidance Module

Herein, the intrinsic properties of the depth map are demonstrated as guiding the learning of cross-modal interaction features during the feature encoder stage, whereas the decoder is committed to learning features related to saliency regions and predicting the saliency maps of same size as the ground truth map. The encoder features are introduced into the decoder stage by skip connections, as is common in SOD models. The attention module, which is applied between the encoder and the decoder, is also a popular methodology. However, these methods only establish relationships between the encoder and decoder features of same size, ignoring the effects of different levels of features. As high-level features provide rich semantic information that can provide semantic guidance for each layer of the decoder and compensate for the loss of semantic information in layer-by-layer upsampling, a DDG module is designed to enhance and refine the saliency maps generated by each layer, which better restore the edge structural detail features of the salient objects.

The DDG module considers the RGB branch flow as an example (the other two branch flows adopt the same strategy). First, the encoder feature map () is fed into a 3 × 3 convolution kernel, and the feature map is output with 256 channels. Similarly, the decoder feature map is adjusted by convolution operation and upsampling interpolation to obtain a feature map with the same size and same number of channels as the encoder feature map. Finally, the decoder feature map of each layer is multiplied by the encoder feature map and concatenated to be sent to the CAC module. The entire process can be formulated as follows:where up() represents bilinear interpolation, and the subscript numbers represent the upsampling times.

3.4. Loss Function

The binary cross entropy (BCE) and intersection over union (IoU) loss functions are often used to optimize SOD models.

The BCE loss function can be expressed as follows:

Moreover, the IoU loss function is defined as follows:where H and W represent the width and height of the image, respectively. The subscripts i and j represent the pixel value coordinates. Additionally, P and G represent the predicted saliency map and the ground truth map, respectively.

BCE and IoU are combined for the optimization loss function of the proposed model:

The auxiliary loss function is used to optimize the model in the decoding stage and to prevent gradient vanishing during the training process. Specifically, a 3 × 3 convolutional layer is applied to the feature map of each layer in the decoder stage to convert the input feature map with 256 channels into a feature map with 1 channel. Simultaneously, the feature map is bilinearly interpolated to the same scale as the ground truth map, and the sigmoid function is used to normalize the generated saliency map.

Next, the loss functions of the three branch streams are as follows:

Therefore, the total loss function of the model is as follows:where lRGB, lDepth, and lRd represent the loss functions of the RGB, deep, and fusion branch streams, respectively. , , and represent the CAC feature map of the i-th layer in the RGB, deep, and fusion branch streams, respectively.

4. Experiments and Results

4.1. Dataset

To verify the effectiveness of the proposed model, experiments were performed on six public datasets: NJU2K [42], DES [43], NLPR [44], SSD [45], DUT-RGBD [46], and SIP [47]. NJU2K contains 1985 image pairs collected from the Internet and 3D movies. The DES (RGBD135) dataset contains 135 RGB-D image pairs from seven indoor locations. The NLPR dataset consists of 1000 image pairs collected by Kinect from 11 different scenes, including more than 400 kinds of common objects. The SIP dataset contains 1000 image pairs collected by smartphones with camera resolutions of 992 × 744. The SSD dataset contains 80 images extracted from three stereoscopic movies for which the depth maps are generated by the depth estimation method. The DUT-RGBD dataset includes 1200 indoor and outdoor complex scenes, of which 800 and 400 image pairs are used for training and testing, respectively.

4.2. Evaluation Metrics

In this paper, the maximum F-measure () [48], maximum E-measure () [49], S-measure (Sα) [50], and the mean absolute error (M) [51] are used as evaluation metrics. Fβ is proposed to consider the importance of precision and recall in a comprehensive manner. Its calculation formula is as follows:where β2 = 0.3, and the maximum F-measure is denoted as .

M is the average of the absolute errors between the predicted saliency map and the ground truth map:

Sα calculates the structural similarity between object-aware and region-aware:where SO and Sr represent object and region awareness, respectively. Typically, α is set to 0.5. The larger the value of Sα, the more similar are the saliency and ground truth maps in their spatial structures.

Eφ calculates the local pixel-level and global image-level errors and is defined as follows:where ϕFM denotes the enhanced alignment matrix.

4.3. Experimental Details

The network in this study was implemented using the deep learning framework PyTorch, and the model was executed on a machine with an Nvidia RTX 3090 GPU. There are 1985 image pairs in the NJU2K dataset, of which 1485 and 500 images are used for training and testing, respectively. There are 1000 image pairs in the NLPR dataset, of which 700 and 300 images are used for training and testing, respectively. In particular, when the DUT-RGBD dataset is tested, an additional 800 DUT-RGBD image pairs are supplemented for training. The Adam optimizer is used to optimize the model, and the batch size is set to 10. The initial learning rate is set to 0.0001 and updated every two iterations with a decay rate of 0.9. All training and testing images are resized to 352 × 352. To prevent the model from overfitting, the optimal model is selected based on the validation dataset (800 image pairs), and the model saves the best result in 126 epochs of training, taking approximately 10 h. The proposed model does not require any preprocessing or postprocessing.

4.4. Experimental Comparison

The proposed method was compared with 15 state-of-the-art CNN-based RGB-D models: SSP [52], EENet [53], LDCM [15], DSN [54], 3DCNN [3], CDNet [14], CMDI [24], CCAF [25], DSAM [20], BiANet [10], DQFM [55], DCF [56], JLDCF [57], and ICNet [5]. For a fair comparison, all saliency maps were directly obtained from the original author or generated from the train model provided by the original author.

The results of various SOD methods on the six datasets are listed in Table 1. According to the experimental results, the proposed method notably outperforms the other methods in multiple metrics. Compared with the other methods, on the SIP and NLPR datasets, the proposed method is superior in all metrics. For example, compared with the second-best model (SSP) on the SIP dataset, , , Sα, and M are improved by 0.000, 0.004, 0.001, and 0.001, respectively. The proposed model is also compared with the 3DCNN model on the NLPR dataset, with , , Sα, and M improving by 0.002, 0.002, 0.004, and 0.002, respectively. On the DUT-RGBD dataset, the proposed and 3DCNN methods both added an additional 800 image pairs for training, giving the same M, but the proposed method outperforms the 3DCNN in terms of and . In addition, for the proposed method on the SSD dataset, except for M, the other three metrics are far lower than those of the DSN method, which is caused by the low quality of the depth map. Because the proposed model relies on the quality of the depth map, the detection performance on the SSD dataset is relatively weak. However, a comprehensive analysis of all datasets and evaluation metrics demonstrates that the proposed detection method is better than the other methods.

The precision-recall (PR) and F-measure curves are illustrated in Figure 6. Note that the proposed model achieves both better precision and recall than the other models. Some visual saliency map results for the proposed and nine other methods are shown in Figure 7. Next, several specific challenging cases are summarized. When the background information is similar to the color of the object (first, fifth, and sixth rows), many models only detect a portion of salient objects; in contrast, the proposed model performs well and can detect salient objects clearly and completely. Additionally, for scenes with extremely low brightness (seventh, eighth, and ninth rows), which is a very challenging situation, the shadow between the legs of the person in the eighth row is not detected by the other methods, but the proposed method can detect the complete object in low-light scenes. This finding demonstrates that the CFD module can use depth features to differentiate the object region clearly from a similar background, whereas the objects detected by other methods are submerged into the background.

Low-contrast and multiobject scenarios are also shown in the bottom three rows, in which the other methods wrongly miss objects when dealing with such cases. For example, there are two objects in the penultimate row of the image, but many methods can detect one person only, whereas the proposed method detects two objects completely. This finding shows that the CAC module can effectively capture rich contextual feature information and improve the detection performance in multiobject scenarios. From the displayed visualization results, the saliency map generated by the proposed method verifiably has a finer spatial structure, which indicates that the DDG module effectively makes the salient object more uniform and clearer. On the whole, the objects detected by the proposed method are more complete, the texture is clearer, and the boundary contour is more prominent. The proposed model gives better results visually, and the generated saliency map is closer to the ground truth map.

4.5. Ablation Experiment

Ablation experiments were mainly conducted to prove the effectiveness of each module, and the experimental results on the NLPR and SIP datasets are listed in Table 2.

4.6. Effectiveness of CFD

In the feature encoder stage, an add operation is used instead of the CFD module to concatenate the RGB and depth modalities. Specifically, for the feature maps of the two modalities of and , the enhanced feature map is ( =  + ), which is denoted as w/o CFD in Table 2. Considering the experimental results on the SIP and NLPR datasets, the proposed CFD reduces M by 0.002 and improves Sα by 0.003 and 0004, respectively. This finding proves that the model detection performance can be improved when using the CFD module instead of simply adding feature maps. Some visual results are shown in Figure 8. Without the CFD module, the models predict the background information as salient objects of varying degrees for illumination effects (first row), salient objects consistent with background information (second row), and complex background (third row). The model that uses the CFD module can accurately predict the salient objects, which effectively suppresses the influence of background information and accurately generates the salient objects.

4.7. Effectiveness of DDG

In addition, in the feature decoder stage, the DDG module is deleted, like in the U-net method, and only the encoder and decoder feature maps of the same scale are concatenated. This map is referred to as w/o DDG in Table 2. The DDG module improves by 0.004 and 0.003 on the SIP and NLPR datasets, respectively. The visualization results are shown in Figure 9. Considering the saliency map, note that without the help of the DDG module, although the salient object can be accurately detected, the spatial structure is not sufficiently clear. With the help of the DDG module, the model generates clearer salient objects with more detailed spatial structures.

4.8. Effectiveness of CAC

To verify the effectiveness of the CAC module, the CAC module is replaced with a 3 × 3 convolutional layer, which is denoted as w/o CAC in Table 2. On the SIP dataset, , , and Sα increase by 0.004, 0.001, and 0.003, respectively; moreover, M decreases by 0.001. The four metrics also have different degrees of improvement on the NLPR dataset. The visual saliency map for comparison is shown in Figure 10. For multiobject scenes, the saliency map generated by the model without the CAC module either has missing objects or the objects are considerably blurred and incomplete. However, the salient objects generated by the proposed method are more complete, which shows that the CAC module can effectively explore rich contextual information.

4.9. Effectiveness of the Three Branch Streams

The effectiveness of jointly training the three branch streams is also verified in the decoder stage. First, the RGB and depth branch streams are removed from the decoder, which only keeps the fusion branch stream for training, denoted as w/o RD. The experimental results show that when the RGB and depth branch flows are removed for training, the detection performance is greatly reduced, which indicates that the detection performance is considerably degraded when only fusion branch flow is used. For example, on the SIP dataset, compared with the full method, and Sα are reduced by 0.011 and 0.020, respectively. Second, the depth and RGB branch stream are removed separately, leaving only the remaining two branch streams, denoted as w/o R and w/o D, respectively. Note that whether the model with the deep or RGB branch streams is removed, the detection metrics are lower than those of the full model, which indicates that training three branch streams together produces the best results.

4.10. Effectiveness of Our Model on the Three Datasets

Additionally, image pairs were specifically collected for low-illumination scenes, complex backgrounds, and multiobject scenes. There are 122 image pairs for the low-light scenes, all collected from the SIP dataset, defined as the low-illumination (LI) dataset. A total of 255 image pairs with complex backgrounds were collected from the NLPR dataset, called the complex background (CB) dataset. The multiobject (MO) dataset was collected from the NLPR and SIP datasets and contains 38 and 327 image pairs from NLPR and SIP, respectively. The experimental results are listed in Table 3. Compared with the other methods, the proposed method showcases better detection performances in these three scenarios, far ahead of other methods in terms of M, which further verifies the effectiveness of each proposed module. All models find it more difficult to detect salient objects effectively in multiobject scenes, which confirms the need to improve the performance of multiobject detection in RGB-D SOD.

4.11. Failure Cases and Analyses

As mentioned above, the results of the quantitative and qualitative evaluations demonstrate the superiority and effectiveness of the proposed method. However, the proposed method still has limitations in some cases. Some detection failures of the saliency maps are shown in Figure 11. It can be seen that the quality of the depth maps is very low, which not only makes it difficult to characterize the salient objects but also causes a lot of noise information. We can see that although the object location is correctly predicted in the first and third rows, redundant and erroneous object regions are generated due to being affected by the noise of the depth map. As can be seen from the second row, the locations of salient objects in the RGB image are not obvious, and the depth map makes it difficult to provide effective saliency features, which leads the model to misclassify the prominent background area as the salient area. In summary, the proposed method is not effective in generating objects with low-quality depth maps. Now that the attention map generated by the depth map is used in the feature encoder stage to guide the generation of cross-modal features, low-quality depth maps can interfere with the generation of valid cross-modal saliency features, which can cause the model to produce incorrect object regions. To address the problem of low-quality depth maps, a depth map quality score can be used to determine the proportion of depth maps in the model, and the detection performance can be further improved by preprocessing.

5. Conclusion

In this paper, a novel depth-feature guide cross-modal fusion method for RGB-D SOD is proposed. Unlike most previous works, which mostly focused on learning to fuse cross modalities, the proposed model is based on depth maps of inherent simplicity, which guide the learning of shared modal information to improve the detection performance. In addition, the proposed DDG module can effectively recover the spatial detail structure features of salient objects, and the CAC module achieves effective multiobject detection by extracting rich contextual information. Quantitative and qualitative evaluations on six challenging benchmark datasets demonstrate that the proposed model outperforms the existing RGB-D SOD methods.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request. All images used in this study are publicly available and have been approved by the publisher, and the datasets are available at https://github.com/lartpang/awesome-segmentation-saliency-dataset.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Authors’ Contributions

Lingbing Meng wrote the original draft and was involved in the methodology. Mengya Yuan wrote the original draft. Xuehan Shi wrote the review, reviewed the manuscript, and acquired funding. Qingqing Liu wrote the review, reviewed the manuscript, acquired funding, and investigated the study. Le Zhang wrote the review and reviewed the manuscript. Jinhua Wu acquired funding, was responsible for software, and validated the study. Ping Dai wrote the review and reviewed the manuscript. Fei Cheng acquired funding, curated the data, and supervised the study.

Acknowledgments

This work was supported by the General Project of the Natural Science Foundation of Anhui Province, China (2008085MF201); the General Project of Anhui Philosophy and Social Sciences Planning, China (AHSKY2021D142); the Natural Science Research Project of Anhui University (KJ2020a0824, 2022AH051887, and 2022AH051894); the Advanced Talent Scientific Research Project of Anhui Institute of Information Technology (rckj2021A002); the Support Program for Outstanding Young Talents in Colleges and Universities (gxyq2022147); and the Scientific and Technological Innovation 2030 Major Project (2020AAA0103600).