Abstract

The urban data provides a wealth of information that can support the life and work for people. In this work, we research the object saliency detection in optical remote sensing images, which is conducive to the interpretation of urban scenes. Saliency detection selects the regions with important information in the remote sensing images, which severely imitates the human visual system. It plays a powerful role in other image processing. It has successfully made great achievements in change detection, object tracking, temperature reversal, and other tasks. The traditional method has some disadvantages such as poor robustness and high computational complexity. Therefore, this paper proposes a deep multiscale fusion method via low-rank sparse decomposition for object saliency detection in optical remote sensing images. First, we execute multiscale segmentation for remote sensing images. Then, we calculate the saliency value, and the proposal region is generated. The superpixel blocks of the remaining proposal regions of the segmentation map are input into the convolutional neural network. By extracting the depth feature, the saliency value is calculated and the proposal regions are updated. The feature transformation matrix is obtained based on the gradient descent method, and the high-level semantic prior knowledge is obtained by using the convolutional neural network. The process is iterated continuously to obtain the saliency map at each scale. The low-rank sparse decomposition of the transformed matrix is carried out by robust principal component analysis. Finally, the weight cellular automata method is utilized to fuse the multiscale saliency graphs and the saliency map calculated according to the sparse noise obtained by decomposition. Meanwhile, the object priors knowledge can filter most of the background information, reduce unnecessary depth feature extraction, and meaningfully improve the saliency detection rate. The experiment results show that the proposed method can effectively improve the detection effect compared to other deep learning methods.

1. Introduction

With the rapid promotion of information technology, urban data has become one of the important information sources for human beings. And the amount of information received by people has increased exponentially [1, 2]. How to select the object regions of human interest from the mass of image information in urban becomes a significant research. Studies have found that under a complex scene, the human visual processing system will focus on several objects, named region of interest (ROI) [3]. ROI is relatively close to human visual perception. Saliency, as the image pretreatment process, can be widely applied in remote sensing areas such as visual tracking, image classification, image segmentation, and target relocation.

The saliency detection method mainly contains two aspects: top-down and bottom-up. The top-down-based saliency detection method [46] is a task-driven process. The ground-truth images are labeled manually for supervised training. It integrates more perceptions of humans to obtain the salient map. However, the bottom-up method is a data-driven process and pays more attention to the images’ features such as contrast, position, and texture to compute the saliency map (SM). Itti et al. [7] proposed a spatial visual model taking full advantage of local contrast and obtained the saliency map via the image differences from the center to the surrounding. Hou and Zhang [8] put forward a saliency detection algorithm based on Spectral Residual (SR). Achanta et al. [9] proposed a frequency-tuned (FT) method based on the image frequency domain to calculate saliency. A detection method combining histogram was presented to calculate global contrast [10]. Furthermore, other relevant methods were raised and showed better effect [1115]. But they do not analyze the image from the dimensions.

Yan et al. [16] treated the saliency region of the image as sparse noise and the background as a low-rank matrix. It calculated the saliency of the image by using the sparse representation and robust principal component analysis algorithm. Firstly, the image was decomposed into blocks. Every image block was sparsely encoded and merged into a coding matrix. Then, the coding matrix was decomposed by robust principal component analysis. Finally, the sparse matrix obtained by decomposition was devoted to establish the saliency factor of the corresponding image block. However, because the large-size saliency object contained many image blocks, the saliency object in each image block no longer satisfied the sparse feature; thus, it greatly affected the detection effect. Lang et al. [17] utilized a multitask low-rank recovery approach for saliency detection. The multitask low-rank representation algorithm was used to decompose the feature matrix and constrained the consistency of all feature sparse components in the same image blocks. The algorithm used the consistency information of multifeature description, and its effect was improved. However, since the large-size target contained a large number of feature descriptions, the feature was no longer sparse. The reconstruction error could not solve this problem, so this method could not completely detect the saliency object with a large size. To perfect the result of the above method, Shen and Wu [18] proposed a low-rank matrix recovery (LRMR) algorithm combining bottom-up and top-down algorithm (providing high-level and low-level information, respectively). First, it performed the superpixel segment in the image and several features were extracted. Then, the feature transformation matrix and a priori knowledge, including size, texture, and color, were obtained by network learning to transform the feature matrix. Finally, the low-rank and sparse decomposition of the transformed matrix were carried out by using the robust principal component analysis algorithm. This method improved the deficiency to some extent. However, due to the limitation of center prior and the failure of color prior to complex scenes, this algorithm was not ideal for detecting images with complex backgrounds.

The saliency detection method using different low-level features is usually only effective for a specific type of image, which is not suitable for multiobject images in complex scenes [1921]. Figure 1 is the instance of saliency detection. The low-level features of visual stimuli lack an understanding of the nature of saliency objects and cannot represent the features at a deeper level. For noisy objects in the image, if they are similar to the low-level features but do not belong to the same category, they are often wrongly detected as saliency objects. Yang et al. [22] showed a bag of word model to detect saliency. Firstly, the prior probability saliency map could be obtained through the object feature, and a word bag model representing the middle semantic features was established to calculate the conditional probability saliency graph. Finally, two saliency images were synthesized by Bayesian inference. The middle semantic features could represent the image content more accurately than the bottom features. Therefore, the detection effect was more accurate. Jiang et al. [23] took saliency detection as a regression problem and integrated regional attributes, contrast, and feature vectors of regional background knowledge at multiscale segmentation conditions. The saliency map was obtained by supervised learning. Due to the introduction of background knowledge features, the algorithm had a better ability to identify background objects, and thus obtained more accurate foreground detection results.

Deep learning (DL) combines low-level features to form more abstract high-level features, a typical representative is a convolutional neural network (CNN). Many saliency detection methods have adopted CNN to optimize the result. Li et al. [24] proposed deep CNN to detect saliency. Firstly, region and edge information were obtained by using the hyper-pixel algorithm and bilateral filtering. DCNN was utilized to extract the regions and edge features in raw images. Finally, the region confidence graph and edge confidence graph generated by CNN were integrated into the conditional random field to judge the saliency. Wang et al. [25] proposed recurrent fully CNN (i.e., RFCNN) for saliency detection, which mainly included two steps: pretraining and fine-tuning. RFCN was used to train the original image to correct the saliency prior image. Then, the traditional algorithm was used to further optimize the modified saliency graph. Lee et al. [26] proposed a deep saliency (DS) algorithm for saliency detection using low and high-level information in a unified CNN framework. VGG-Net was used to extract the advanced features. It mainly extracted the low-level features. Then, the CNN was used to encode the distance graph. Finally, the coded low-level distance graph was connected with higher features. A full-connected CNN classifier was adopted to evaluate the features’ information and obtain the saliency graph [27]. The above DL methods show the excellent performance in terms of saliency detection rate. But there are still some disadvantages such as slow speed and highly complex calculations.

In this paper, we propose a deep multiscale fusion method via low-rank sparse decomposition for object saliency detection in optical remote sensing images. The main contributions are as follows. (a)First, multiscale segmentation is executed for remote sensing images. For the first segmentation graph, the depth features of all the superpixel blocks are extracted by CNN(b)Then, we calculate the saliency value, and the proposal region is generated. The superpixel blocks of the remaining proposal regions of the segmentation graph are input into the CNN network. By extracting the depth feature, the saliency value is calculated and the proposal regions are updated. Meanwhile, the color, texture, and edge feature mean values of all the pixels in each superpixel are calculated to construct the feature matrix. In order to make the image background facilitate low-rank sparse decomposition, the above feature matrices need to be transformed so that the background can be represented as a low-rank matrix in the new feature space(c)To make use of the high-level information and improve the detection effect of the ROI, the fully convolutional neural network is used for learning features, and the high-level semantic prior knowledge matrix is obtained. The feature matrix is transformed by using the feature transformation matrix and the high-level semantic prior knowledge. The robust principal component analysis algorithm is used to decompose the transformed matrix into a low-rank sparse decomposition to obtain a saliency map. The process is iterated continuously to obtain the saliency map on each scale(d)Finally, the weight cellular automata method fuses the multiscale saliency graphs. It is shown that the proposed method can effectively improve the detection effect compared to other DL methods

The remainder of the paper is organized as follows. The proposed deep multiscale fusion method for saliency detection is analyzed in section II. Section III introduces the saliency region extraction based on multiscale segmentation. Saliency is calculated based on the deep features in section IV. The performance and robustness are evaluated in section V. Conclusion is drawn in section VI.

2. Deep Multiscale Fusion for Saliency Detection

The proposed deep multiscale fusion method for saliency detection in optical remote sensing images is shown in Figure 2.

Firstly, the image is segmented into a small number of superpixel blocks by using the superpixel segmentation algorithm. The deep feature is extracted from all the superpixel blocks. The color, texture, and edge feature mean value of all the pixels in each superpixel are calculated to construct the feature matrix. In order to make the image background facilitate low-rank sparse decomposition, the above feature matrix needs to be transformed so that the background can be represented as a low-rank matrix in the new feature space. And the multidimensional feature containing the key information of the image is extracted by PCA (principal component analysis). The rough segmentation saliency graph is obtained based on the calculation of key features, where we can extract the initial saliency region to obtain the superpixel set . Then, we adopt to centralize the similarity degree between superpixel and the nonobject region. The input image is segmented at different scales. The region containing the superpixel block in the set is selected for depth feature extraction. Saliency maps and sets at the next scale are obtained based on the same method. The robust PCA is used to decompose the transformed matrix into a low-rank sparse decomposition to obtain a saliency map. Weight cellular automata fusion is used to obtain the final SM .

3. Saliency Region Extraction Based on Multiscale Segmentation

Superpixel segmentation is to gather adjacent similar pixel points into image regions with different sizes according to the low-level features such as brightness, thus reducing the complexity of significance calculation. The superpixel segmentation algorithm mainly includes watershed [28] and simple linear iterative clustering (SLIC) [25] method. We combine their respective characteristics, SLIC method is used to obtain the segmentation results with regular shape and uniform size during rough segmentation, and the watershed algorithm is used to obtain better object contour during fine segmentation in this study.

For segmentation scales . denotes the obtained superpixel set at a certain segmentation scale. denotes the superpixel number at scale . is pixel’s color feature vector in the superpixel.

For the input image, we extract color, texture, and edge features to construct the feature matrix. (i)Color feature. The gray value of R, G, B, hue, and saturation are extracted to describe the color feature of the image(ii)Edge feature. Steerable pyramid filter is used to decompose the image in multiple scales and directions. Filters with 3 scales and 4 directions are selected to obtain 12 responses as the edge features of the image(iii)Texture feature. Gabor filter is used to extract texture features at different scales and directions. Here, 3 scales and 12 directions are selected to obtain 36 responses as the texture features

It calculates the mean value of all pixel features in each superpixel to represent the eigenvalue . All the eigenvalues constitute the eigenmatrix , .

The saliency region of the image is regarded as sparse noise and the background as a low-rank matrix. In the complex background, the image background similarity degree after clustering is still not high. Therefore, the features in the original image are not conducive to low-rank sparse decomposition. In order to find a suitable feature space, most image backgrounds can be represented as low-rank matrices; in this paper, the eigentransformation matrix is obtained based on the gradient descent method. The process of obtaining the eigentransformation matrix is as follows: (a)Construct marker matrix . If the superpixel is within the marked saliency region manually, . Otherwise, (b)According to the following formula, the optimal model of transformation matrix is utilized to learn the features of raw image

Where is the feature matrix of th image. represents the superpixel number of th image. is the labeled matrix of the th image. represents the kernel norm of the matrix, that is, the sum of all singular values of the matrix. is the weight coefficient. denotes the norm of the matrix . is a constant to prevent from arbitrarily increasing or decreasing. If the eigentransformation matrix is appropriate, then TFQ is low rank. is to avoid obtaining the general solution when the rank of T is arbitrarily small. (c)Find the gradient descent direction, that is(d)Adopt the following formula to update the eigentransformation matrix until the algorithm converges to the local optimal. is the step size

3.1. Extracting Proposal Region

The segmentation graph of a rough segmentation scale is taken as input. The saliency map is obtained by depth feature extraction and saliency value calculation. The , as the object prior knowledge in the next segmentation, is used to guide the proposal region extraction. The saliency is binarized. The value of is divided into channels by the adaptive threshold strategy. is used to represent the number of pixels in the channel . The channel with the largest number of pixels in all channels is determined. The threshold value is calculated by the formula (4).

In order to prevent from getting larger, the significant pixel will not be binarized to 0 when the saliency object occupies the most space in the image. The pixel number in each channel must satisfy . Where is the pixel number of image . is an experience value. The binarization object a priori map is denoted as . We adopt as the prior knowledge. The corresponding superpixel area of superpixel set in the next scale constitutes the proposal saliency superpixel set . is the number of proposal saliency superpixel at the scale , . Assume that is the total number of the superpixel . is the pixel number with a value of 1 at the corresponding position of the binary map . If , the superpixel at the corresponding position is considered to belong to .

3.2. Region Optimization

The proposal object superpixel set may contain some background areas or missing saliency areas. It needs to optimize the proposal object area. It removes the possible background area in and adds the possible saliency area in the background area. According to the Euclidean distance between the two color spaces, the difference matrix is . It is a symmetric matrix with order.

Where is the th feature of superpixel region . corresponds to R, G, B, L, a, and b, respectively. For , it calculates the local average dissimilarity degree through equation (6),

Where , is superpixel number in the proposal saliency region set . We calculate the average dissimilarity degree of each superpixel in and its adjacent background region:

Where , and is adjacent to . represents the number of superpixels adjacent to in the background area. If , it indicates that is more similar to the adjacent background area, then will be removed from .

Similarly, for any , the average dissimilarity between and adjacent background region, and average dissimilarity between and adjacent proposal saliency region can be calculated. If the condition is satisfied, then the similarity between and the adjacent saliency region is higher than that of other background regions. Therefore, is added to . is constantly updated by comparing the superpixel in with other saliency regions and background regions. Until the superpixel in is no longer changed.

3.3. Deep Feature Extraction of Proposal Region

This deep feature extraction method based on CNN is as shown in Figure 3. In the first superpixel segmentation, the deep features of all superpixels are extracted. In the subsequent deep feature extraction process, only the superpixel in set is extracted. Under a certain segmentation strategy, the computation is greatly reduced and the computation speed is increased.

Assuming it is not the first time to segment the superpixel, the local and global features are extracted for superpixel . The local features of the superpixel include two parts: (1) the deep feature containing its own region; (2) deep feature containing itself and the adjacent superpixel region.

First, according to set, it extracts the minimum rectangular region of each superpixel . Since most superpixels are not regular rectangles, the extracted rectangles must contain other pixels. These pixels are represented by the average value of the superpixel. The depth feature only containing its own region can be obtained through the deep CNN.

If we only adopt the saliency calculation of to acquire saliency detection value is meaningless. It is impossible to determine whether it is saliency without comparing it with the saliency of other adjacent superpixels. Therefore, it still needs to extract to further obtain of the deep local feature. The location of the region in the image is an important factor to judge whether it is saliency or not. It is generally believed that the area in the center of the image is more likely to be saliency than the region at the edge. Therefore, the whole image is taken as the input, and the deep feature of the global region is extracted.

If it only uses the bottom feature to extract the saliency map, due to many interference objects, the final saliency map is not ideal. Therefore, the high-level information needs to be added to improve the detection effect. The adopted high-level semantic prior knowledge is mainly to predict the most likely ROI based on previous experience (i.e., training samples). The FCNN is used to train the high-level semantic prior knowledge, which is integrated into the feature transformation process to optimize the final saliency map. Higher-order features can be learned from the primitive data without preprocessing in the multi-stage global training process of CNN.

FCNN can accept input images with any size. The difference between FCNN and CNN is that the deconvolution layer replaces the full connection layer. Finally, pixel classification is carried out on the feature map of the upsampling. A binary prediction is produced for each pixel, and a classification result at the pixel level is output. Thus, the problem of image segmentation at the semantic level is solved. Semantic a priori is an important high-level information in the detection of the ROI, which can assist the detection of the ROI. Therefore, this paper adopts FCNN to obtain high-level semantic prior knowledge and applies it to the detection of the ROI.

The network structure of FCNN is shown in Figure 4. Based on the original classifier, this paper utilizes the back propagation algorithm to fine-tune the parameters in all FCNN layers. In the network structure, the first row gets the feature map after alternately seven convolutional layers and five pooling layers. The last step of the deconvolution layer is to conduct the upsampling of the feature map with a step size 32 pixels. The network structure in this paper is denoted as FCNN-32s. It is found that the precision decreases because of the maximum pool operation. It directly executes upsampling for the feature map of downsampling, which will result in very rough output and details loss. Therefore, in this paper, the features with step size 32 pixels obtained from the upsampling are extended by 2 times, which is summed with the feature with step size 16 pixels. Then, the obtained feature is recovered to the original image for training, and the FCNN-16s model is obtained. So more accurate detailed information is obtained than that of FCNN-32s. We adopt the same method to train the network to obtain the FCNN-8s model, the prediction of detailed information is more accurate. Experiments show that although lower-level feature fusion for training networks can make detailed information prediction more accurate, the effect of low-rank sparse decomposition on the result is not significantly improved. Since the training time will increase sharply, this paper adopts FCNN-8s model to acquire the high-level priori knowledge of images.

The deep CNN model comprises an input layer, multiple convolution layers, downsampling layer, full connection layer, and output layer. The downsampling layer and convolution layer form the intermediate structure of the neural network. The former is used for feature extraction, and the latter is for feature calculation. The fully connected layer is connected with the downsampling layer, which can output the feature. The output of the convolution layer is:

Where and are the feature maps of the current layer and the previous layer. is the convolution kernel of the model. is the neuron activation function. is neuron bias. The feature extraction result of the downsampling layer is:

Where is the downsampling template scale. is the template weight. In this paper, the trained GoogleNet model is used to extract the depth features of the proposal object region. On the strength of this model, the labeled output layer is removed to obtain a depth feature. The convolution layer C1 uses 96 filters with size to filter the input image with size . The convolution layers C2, C3, C4, and C5 take the output of the downsampling layer as their input, respectively. The convolution processing is carried out by using the self-filter, and several output feature graphs are obtained and transmitted to the next layer. The full connection layers F6 and F7 have 4096 features. The output of each full connection layer can be denoted as:

3.4. Saliency Calculation Based on Deep Feature

PCA [28] is the common method for dimension reduction of high-dimensional data, which can replace high-dimensional features with a smaller number of features. For superpixels, the output features can constitute a sample matrix with dimension. The correlation coefficient matrix of the sample is calculated by the formula (11):

Where . By solving the equation , we find the eigenvalues and order them. Then, we calculate the contribution rate and cumulative contribution rate of each eigenvalue :

We calculate the corresponding orthogonal unit vector of each eigenvalue . The unit vector corresponding to the first features with a cumulative contribution rate 95% is selected to form the transformation matrix . The high-dimensional matrix is reduced by formula (13). denotes the -dimension principal component feature. The principal component features are extracted by the same transformation matrix in the segmentation maps with different scales.

3.5. Contrast Feature

The contrast feature reflects the difference degree between the region and its adjacent region. The contrast feature of the superpixel is defined by its distance from other superpixels features, as given in equation (14):

Where denotes the number of superpixel. is 2-norm.

3.6. Spatial Feature

In the human visual system, we pay different attentions in different spatial positions. The distance between the pixel at different positions and the image center satisfies the Gaussian distribution. For any superpixel , its spatial feature is calculated as:

Where is the central coordinate of superpixel . is the central region. If the average distance from the image center is smaller, the spatial feature is larger. The saliency value of the superpixel is denoted as:

We obtain the SM of the first segmented image and use it as the object prior knowledge to guide the proposal region extraction and optimization.

3.7. Saliency Detection Based on Low-Rank Sparse Decomposition

The background in the image can be expressed as a low-rank matrix. The saliency region can be regarded as sparse noise. For an original image, the eigenmatrix and the eigentransformation matrix are obtained. Then, we use the FCN to obtain the high-level prior knowledge . The low-rank sparse decomposition of the transformed matrix is carried out by robust PCA.

Where is the eigenmatrix. is the learned eigentransformation matrix. is a high-level prior knowledge matrix. is a low-rank matrix. represents the sparse matrix. represents the kernel norm of the matrix, that is, the sum of all singular values of the matrix. represents the -norm of the matrix, the sum of the absolute values of all the elements in the matrix. Supposing that is the optimal solution for the sparse matrix. The saliency map can be calculated by the following equation.

Where represents the saliency value of superpixel . represents the -norm of the th column vector of , that is, the sum of the absolute values of all the elements in the vector.

3.8. Saliency Map Fusion Based on Weight Cellular Automata

Wang and Wang [29] adopted the multilayer cellular automata (MCA) for object fusion. Each pixel represents a cell. In the m-layer cellular automata, the cellular of the saliency map has m-1 neighbors. They are at the same positions in other saliency maps.

If cellular is labeled as foreground, the foreground probability of its neighbor at the same position in other SMs is . Saliency maps obtained by different methods are considered to be independent. When updating synchronously, all saliency maps are considered to have the same weight. There are guiding and refining relationships between the saliency maps at different segmentation scales. The weights cannot be considered as equally during the fusion process. In different segmentation scales, it is assumed that the weight of the SM obtained by the first segmentation scale is , represented by . The SM weight with different scale is expressed as:

Where denotes the total pixel number in the proposal object set. is the superpixel number in the th saliency map. Set . Synchronous updating mechanism is defined as:

Where represents the saliency value of all the cellular of the th SM at time . Matrix is a matrix with elements. If the neighbor of cellular is judged as foreground, then the saliency value should be increased. We obtain the final saliency map by formula (21). is the next time.

The proposed deep multiscale fusion method for object saliency detection is summarized as depicted in Algorithm 1.

Input: Raw image I, multiscale segment number N and segment parameter in each scale.
Output: Saliency map.
 for
 {
  if i=1 then
   (1) According to the determined parameters, we use SLIC to segment image ;
   (2) Determine the input region , , of each superpixel;
   (3) The above is input GoogleNet to extract deep feature , , ;
   (4) The deep features of all superpixels constitute a matrix W, and the transformation matrix A of W is calculated by using PCA to obtain the principal component features;
   (5) According to the principal component features, saliency values without object priors are calculated to obtain the first segmentation saliency map ;
 else
   (6) According to the determined parameters, we use Watershed algorithm to segment image;
   (7) The saliency map is taken as object priori map. Then it extracts and optimizes proposal object set ;
   (8) Determine the input region , , in ;
   (9) The above is input GoogleNet to extract deep feature , , ;
   (10) The deep features of all superpixels constitute a matrix W, and the transformation matrix A of W is calculated by using PCA to obtain the principal component features;
   (11) According to the principal component features, saliency values with object priors are calculated. And we obtain the saliency map ;
 end if
 }
   (12) Calculate the saliency map weight at each scale;
   (13) Adopt weight cellular automata to fuse the obtained N saliency maps and get final SM.
Algorithm 1: Proposed saliency detection method.

4. Experiments and Analysis

In this section, we obtain the experiment data from Google Earth. The remote sensing image size is from to . The spatial resolution is 1 m. The experiment environment is Intel(R), Core(TM), i7-8750, CPU2.2 Hz, Geforce GTX1060 with MATLAB 2017a platform.

4.1. Evaluation Index and Parameter Setting

In the experiment, the PR curve, -measure, and mean absolute error (MAE) of the saliency map are compared to evaluate the effect of saliency detection to select a better segmentation scale.

Precision and Recall are the two most commonly used evaluation criteria in image saliency detection. If the PR curve is higher, the effect of saliency detection is better. Otherwise, it is poor. For the given manual labeled Ground Truth G and the saliency map S, the definition of Precision and Recall is given in equation (22):

Where represents the sum of the value after the pixels of visual feature graph S multiplying that of G. is the sum of all pixels in the visual feature graph S. represents the sum of all pixels in G.

When calculating -measure, the adaptive threshold of each image is used to segment the image.

Where the W and H denote the width and height of the image, respectively. It calculates the average precision and recall of the SM. The average -measure value is calculated according to equation (24). The effect of saliency is better if the -measure value is excellent. -measure value is used for the comprehensive evaluation of accuracy and recall. is often set to 1.

MAE is used to evaluate the saliency model by comparing the difference between the SM and the GT. We use formula (25) to compute the MAE value of each input image. The calculated MAE value can be used to draw a histogram. If the MAE value is lower, the proposed algorithm is better.

4.2. Segment Scale Determination

The main parameter of this algorithm is the segmentation scale. Many segmentation scales can increase the computational complexity. Few scales will affect the accuracy of saliency detection. Therefore, 15 segmentation scales are set according to experience. We conduct experiments on randomly selected remote sensing image data. Then, we extract the depth features of all superpixels in the segmentation graphs and calculate the saliency map. The histogram of the PR curve with different segmentation scales is shown in Figure 5. Three segmentation scales with better effects are selected from them. Through comparative analysis, it is found that the three segmentation scales 10, 11, 12 have a relatively better saliency detection effect. The three segmentation scales are selected as the final segmentation scales of the proposed method.

4.3. PCA Parameter Determination

To verify the effectiveness of PCA on selecting principal components from depth features, this section adopts the depth features extracted from each superpixel block as the data set. The percentage of explained variance (PEV) is used to measure the importance of the principal component in the overall data as formula (26). PEV is a main index to describe the distortion rate of data.

Where is the right matrix of the main component matrix after singular value decomposition. denotes the covariance matrix. Figure 6 shows the relation between PEV and the top 50 principal components. It reveals that with the increase of principal component number, PEV shows an upward trend. But the trend grows slowly. When the number of principal components exceeds 20, the PEV reaches to 90%, which is considered to represent the overall information of the data. In this paper, the top 20 principal components are selected for saliency calculation.

4.4. Saliency Detection Comparison with Other State-of-the-Art Methods

In this section, five state-of-the-art methods including RA [30], RB [31], SC [32], RAD [33], and SCLR [34] are conducted comparison with the proposed deep multiscale fusion method. And we conduct experiments on some optical remote sensing images based on urban data, namely airplane (), playground (), boat (), vehicle (), and cloud (). Due to the limited space, we only display the results of several remote sensing objects. The test images along with their relevant GT maps are shown in Figure 7. Figure 8 displays the saliency results with different methods.

Figure 8 shows the comparison of saliency detection results with different methods. It can be seen that the detection effect of this algorithm is obviously better than other algorithms.

Table 1 is the -measure result. With the change of Recall, the Precision of the method in this paper has better value and keeps a high level. However, in terms of -measure value, our method is 7.18% higher than the second better method. Under the condition of complex background information, both the PR curve value and -measure value of the proposed method are significantly higher than other algorithms. It fully demonstrates the advantages of the proposed algorithm in relatively complex image information. Similarly, the MAE of this proposed algorithm is lower than that of other algorithms. Figures 914 are the subjective evaluation results for the six objects.

We also adopt IoU (Intersection-Over-Union) to illustrate the effectiveness of the proposed method [35, 36]. The IoU is calculated as follows:

The greater IoU shows a better effect. The results are shown in Table 2.

From Table 2, we can see that our proposed method has a better saliency detection effect than other methods.

There are also apparent differences in the detection time among different algorithms. In terms of the speed of saliency detection, the proposed method is faster than other methods as given in Figure 15. Though deep learning-based algorithms need to train many samples, compared with other deep learning methods, the processing efficiency is improved by about 4%. Overall, the deep multiscale fusion method has a better effect on saliency detection for remote sensing images.

5. Conclusions

The saliency detection algorithm based on DL can overcome the shortcomings of the traditional saliency detection algorithms. However, the detection efficiency is obviously insufficient. Therefore, we present a deep multiscale fusion method for object saliency detection in optical remote sensing images based on urban data. Through the deep feature extraction, we calculate the saliency value and use the weight cellular automata to integrate and optimize the scale saliency map. Results reveal that the proposed method can efficiently acquire the saliency detection results than other methods. In the future, some new models based on deep learning will be researched. And the new methods will be applied to practical engineerings.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.