Abstract

Because of the presence of clouds, the available information in optical remote sensing images is greatly reduced. These temporal-based methods are widely used for cloud removal. However, the temporal differences in multitemporal images have consistently been a challenge for these types of methods. Towards this end, a bishift network (BSN) model is proposed to remove thick clouds from optical remote sensing images. As its name implies, BSN is combined of two dependent shifts. Moment matching (MM) and deep style transfer (DST) are the first shift to preliminarily eliminate temporal differences in multitemporal images. In the second shift, an improved shift net is proposed to reconstruct missing information under cloud covers. It introduces multiscale feature connectivity with shift connections and depthwise separable convolution (DSC), which can capture local details and global semantics effectively. Through experiments with Sentinel-2 images, it has been demonstrated that the proposed BSN has great advantages over traditional methods and state-of-the-art methods in cloud removal.

1. Introduction

Optical remote sensing image is an important data source for large-area research and application. However, clouds and shadows make it difficult to obtain high-quality optical images. Mostly, clouds are detrimental to the practical applications of remote sensing images. When the ground is covered by thin clouds, the sensor captures a mixture of thin clouds and ground objects. When the ground is covered by thick clouds, thick clouds and shadows completely obstruct the ground, and the optical sensor usually cannot capture ground information. Clouds (especially thick clouds) and their shadows have long been considered a difficult problem in remote sensing image processing and applications.

In the past decades, great efforts have been made to remove clouds from remote sensing images. Depending on the type of data source, a wide variety of cloud removal methods are classified into four categories: spatial-based, spectral-based, temporal-based, and hybrid methods [1]. Spatial-based methods are not suitable for large-size and complex cloud-covered images. Spectral-based methods work well with thin clouds, but not with thick clouds. Therefore, the most significant and widely used methods are temporal-based methods and hybrid methods.

Remote sensing platforms usually have a fixed visit period and can acquire images of the same area at different time intervals. Thus, they provide a reliable reference data source for cloud removal with multitemporal remote sensing images. Temporal-based methods introduce additional observations from multitemporal images to reconstruct cloud-covered regions rather than using only the cloudy image itself. They can alleviate temporal differences caused by observational conditions and regular changes in geographic features (e.g., phenological changes). The representative methods include temporal replacement methods [2], temporal filter methods [3, 4], and temporal learning methods [58]. Hybrid methods attempt to make better use of correlations among spatial, spectral, and temporal domains using the same or different sensor data. They can take full consideration of the advantages of the above three methods to achieve better cloud removal results. The hybrid methods include joint spatiotemporal methods [9] and joint spatiospectral methods [10]. Additionally, multisource data (SAR and optical images) [11, 12] are also used as auxiliary images to improve the effects of cloud removal. Generally, hybrid methods have difficulty in multisource data acquisition. Relatively speaking, temporal-based methods are more popular and available for cloud removal from optical images.

1.1. Related Work

During the past decades, researchers have developed a number of temporal-based methods. Chen et al. proposed a Savitzky–Golay filter to remove noise from images [3] but failed to remove thick clouds. Based on the idea of replacement, Lin et al. proposed a multitemporal information cloning method for removing thick clouds [2]. An automatic cloud removal method based on Poisson blending using temporal similarity of multitemporal images is applied [13]. Li et al. reconstructed cloud-covered regions from remote sensing images within a framework of sparse representation (PM-MTGSR) [14]. It can effectively exploit nonlocal correlations to reconstruct missing information in optical remote sensing images [5]. An improved Bayesian dictionary-learning algorithm based on compressed sensing was proposed to restore remote sensing images [15]. Then, multitemporal dictionary learning is expanded into the recovery of quantitative data contaminated by thick clouds and shadows [7]. To recover the original information covered by clouds and accompanying shadows, a non-negative matrix factorization error correction method was proposed [8]. Although researchers have made significant developments in multitemporal-based methods, traditional methods have some limitations. For example, they cannot take advantage of the deep correlation of the image to remove clouds, which is especially important for remote sensing images. Traditional methods often underperform when dealing with clouds and feature boundaries, and reconstructed features are not sufficiently accurate. Furthermore, temporal-based methods are extremely dependent on the quality of multitemporal reference images, and if reference temporal images are contaminated by clouds, the results of cloud removal are significantly influenced.

Recently, owing to its powerful nonlinear representation capability [16, 17], deep learning has attracted more and more attention in missing information reconstruction in remote sensing images. Deep learning-based methods can exploit deep correlations between multitemporal images compared to traditional methods. Sandhan and Choi successfully used a generative model to specifically remove extremely thin high-altitude clouds [18]. However, it cannot effectively process more heavily obscured images. Thus, researchers moved to thick cloud removal based on convolutional neural networks (CNNs). For example, Zhang et al. proposed a CNN-based spatial-temporal-spectrum (STS) framework [19] to generate accurate reconstruction results. They also improved the STS framework to combine the global-local spatiotemporal information for cloud removal [20]. Considering the specificity of different types of information, a traditional CNN-based joint content, texture, and spectrum generation network was proposed for cloud removal [21]. Based on the idea of integration, Ji et al. proposed an integrated cloud detection and removal method with cascaded CNNs [22]. Moreover, a novel gated convolutional network (GCN) has stronger differentiation ability than general convolutional networks for cloud removal [23]. Generative adversarial networks (GANs) [24] are also used for cloud removal. Yu et al. [25] proposed a GAN with a contextual attention method (GAN-CA) for reconstructing information from cloud-obscured images. It can explicitly focus on relevant feature blocks at distant spatial locations, but the ability to process high-resolution tasks is insufficient. A trainable Spatio-Temporal Generator Adversarial Network (STGAN) [26] casts cloud removal as a conditional image synthesis. Gao et al. proposed the SAR-opt-GAN method, which joins SAR and optical data to facilitate cloud removal [27]. The image translation approach has been adopted by researchers as a recent idea for SAR-assisted cloud removal [28]. Deep learning-based methods are robust and stable and therefore less susceptible to the influence of the dataset quality than traditional methods. Deep learning-based methods have a large number of model parameters and therefore have a relatively high accuracy due to pretraining requirements. The current cloud removal methods based on deep learning have made great advancements, but there are still some problems. For example, due to the large size of remote sensing datasets, deep learning methods cannot process them directly and researchers have to crop them to smaller images. Moreover, if the ancillary images in the datasets have large temporal differences, the results of cloud removal will be unsatisfactory. Therefore, the datasets need a high correlation of spatial, spectral, and temporal aspects with target images. However, deep learning-based methods have rarely focused on temporal differences among multitemporal images. This is a great obstacle to utilizing the temporal correlation of multitemporal images.

1.2. Contributions

In order to restrain the temporal differences in multitemporal images and cooperate with the advantages of deep learning, a novel bishift network (BSN) model with double shifts is proposed in this paper. BSN can improve the correlation between the target image and datasets somewhat, facilitating the model to learn more effective information required for cloud removal.

The first shift includes moment matching (MM) and deep style transfer (DST). Multitemporal images (reference images) are statistically normalized to cloud-covered images by traditional MM and then processed by DST to further eliminate temporal differences. MM utilizes the mean and variance in optical remote sensing images to match features from datasets to target images. In the stage of DST, the transferred images are constrained to be represented by locally affine color transformations to prevent distortions.

The second shift takes full advantage of a proposed reconstruction network to reduce the temporal differences of images once again for better cloud removal. It is an improved version of Shift-Net [29] with shift connections and depthwise separable convolution (DSC). Shift connections reduce information loss during reconstruction and effectively improve the accuracy of cloud removal. DSC can partially reduce the number of parameters of the model and improve training efficiency. The proposed reconstruction network can better capture the local details and global semantics of images. By two successive shift operations, the temporal differences in the multitemporal images can be suppressed effectively. Eventually, high-quality cloud-removed images can be obtained.

The rest of this paper is organized as follows: The proposed BSN is introduced in Section 2. The effectiveness of BSN is tested by simulated experiments and real experiments in Section 3. Finally, Section 4 summarizes the article.

2. Proposed Method

BSN is a further improvement on our previous research [30]. It consists of two shifts, as shown in Figure 1. The first shift is to preprocess multitemporal images (reference images) with MM and DST to obtain reliable preliminary results. In the second shift, the reconstruction network from improved Shift-Net will reconstruct the target image covered by clouds and shadows to generate accurate cloud-free images. BSN requires at least one temporal reference image to ensure the capability of the network. It is also capable of dealing with many multitemporal reference images. This section will introduce BSN in detail.

2.1. First Shift

It is a challenge that multitemporal remote sensing images have different temporal characteristics. To this end, the first shift of the proposed BSN is used to normalize multitemporal reference images to cloudy images (target images). The first shift contains statistical MM and DST for reducing temporal differences in reference images.

2.1.1. Moment Matching

MM is a mathematical statistical method, commonly used in remote sensing image denoising [31] and difference elimination [32]. In BSN, MM is used to normalize multitemporal reference images. It should be noted that, because parts of the information of the cloudy image are covered by clouds, the statistical values of target images are from cloudless regions. Thus, the cloud mask is used to distinguish between the cloud and cloud-free pixels. The MM formula is as follows:where is the pixel at the position and and represent the mean and variance of the entire image, respectively. The abbreviations represent the output image, the reference image, and the target image, respectively.

Before MM, there are feature differences between multitemporal reference images due to temporal, weather, and light intensity. For example, there are differences in the greenness degree of vegetation and the intensity of light reflections. MM can reduce these differences to a certain extent and contribute to the efficiency of style transfer.

2.1.2. Deep Style Transfer

Style transfer is an evolving research field. Gatys et al. first proposed a style transfer algorithm based on neural networks [33, 34]. In neural style transfer, two input images include a “content image” and a “style image.” The content of the image is defined as the feature response from the pretrained CNN, while the style is a summary feature statistic. The task of style transfer is to convert the image to an artistic style by changing the style. However, preserving the required semantic content is a key challenge, which is achieved by generally changing the weights of pretrained CNN models [35]. In this paper, DST is applied to multitemporal reference images for normalizing temporal differences.

The traditional style transfer method is effective for simple styles, such as global color changes and tone curves. It generates the output image by using the reference style image on the content image. The general objective function iswhere is the total loss in deep CNN, and are the loss weights of the content image and the style image in the -th convolutional layer of the total -layer, respectively, is the weight of all style images, and and represent the content loss and style loss in the style transfer process, respectively.

Before BSN, general fast style transfer had been used for temporal difference elimination [30]. However, regular style transfer methods are not suitable for realistic style and complex transfers. Style transfer is particularly difficult for remote sensing images with complex features, large scale, and high resolution. To solve this problem, the style transformation from input to output is constrained to be a local affine projection in RGB color space. It has been demonstrated that the local style transfer algorithm based on spatial color mapping is more expressive [36]. Therefore, DST is adopted [37].

DST can ensure the preservation of image structure, semantic accuracy, and transfer faithfulness, which is beneficial to complex remote sensing images. In optimization, a realistic regularization term is proposed in the objective function. The reconstructed image is constrained to be represented by locally affine color transformations of the input to prevent distortions. An optional guidance is introduced to the style transfer process based on semantic segmentation of reference images. Styles are only transferred between features of reference images and similar features of target images. The style transfer process minimizes the content-mismatch problem, which greatly improves the photorealism of output images. This indicates that the multiple classes of features and complex textures of remote sensing images will not be lost in the process of style transfer. The method expects no cloud information in the transferred images. Therefore, regions covered by the cloud are useless. The mask is added to the input image as an additional channel to distinguish between cloud-covered and cloud-free regions. Then, the neural style algorithm is enhanced by concatenating segmentation channels and updating the augmented style loss . More details can be found in [37].

The improved objective function of DST iswhere indicates the corresponding weight of the realistic regularization term and and represent the content loss and the augmented style loss, respectively.

The traditional fast style transfer method requires a decision on the content and style of the image. Therefore, the original content clarity of the image cannot be preserved, while the style is completely transferred. Fast style transfer can only achieve rough style transfer, with little change in its content. This leads to the fact that the transferred image is not close enough to the style image (cloudy target image). For example, in Figure 2(d), vegetation and bareland are in a gradual transition in the cloudy image, but the difference is obvious in the reference image. Although the images generated by fast style transfer are similar in style to cloudy images, vegetation and bareland are clearly distinguished. It is obviously inappropriate as a preliminary result of cloud removal. However, DST can realize transfer between similar features, which makes the transfer process more accurate. As a result of DST, the features of rivers, vegetation, and bareland are highly consistent with cloudy target images in Figure 2(e).

In the first shift, MM and DST are applied to process multitemporal reference remote sensing images, gradually removing temporal differences among images so that they can contain more information similar to the target image. The preprocessed images provide more available information for the subsequent work and are therefore reliable. In the subsequent experimental sections, ablation experiments are conducted to demonstrate the effects of preprocessed measures.

2.2. Second Shift

The first shift preliminarily solves the temporal difference problem of multitemporal images. To obtain more accurate cloud-removed images, the second shift recovers cloud-covered information by the proposed reconstruction network. This network is performed on the feature domain of the deep encoder learned end-to-end from the training data, which is different from the traditional exemplar-based method [38] to fill pixels or patches.

2.2.1. Reconstruction Network

Adversarial learning has been adopted in low-level vision [39], image generation [40, 41], and image inpainting [25] and exhibits its superiority in restoring fine details and photo-realistic textures. In the stage of the second shift, the reconstruction network architecture is based on GAN, which includes a generator and a discriminator, as shown in Figure 3. The generator learns distribution of data and improves the ability to remove clouds from images by adversarial learning of the generator and discriminator. After training, the generator can transform the input cloud-covered image into a cloud-free image.

The reconstruction network in BSN is inspired by Shift-Net [29]. As shown in Figure 4, its generator includes eight convolutional modules and corresponding deconvolutional modules. Convolutional modules are the encoders of the network, and deconvolutional modules are decoders. Encoders and corresponding decoders are associated with skip connections [42]. Skip connections fuse different scales and different levels of features, which can effectively reduce gradient disappearance and network degradation problems. Furthermore, skip connections can facilitate the use of information of convolution and deconvolution layers, which is valuable for capturing local visual details of cloud removal tasks of remote sensing images. In order to further enhance the generator’s ability to capture local details, shift connections are introduced. Shift connections demonstrate even greater advantages through deep feature rearrangement. The appropriate placement of the shift connection layer ensures both the computation time and the reconstruction performance of the network.

Based on Shift-Net, BSN introduces DSC to improve the network’s capabilities. For the first and second modules of the encoder, a combination of DSC [43], batch normalization (BN) layer, and leaky ReLU activation function is adopted. DSC splits the convolution operation into depthwise convolution and 1 × 1 pointwise convolution operation. DSC of multichannel feature maps of the previous layer is performed by first splitting them all into single-channel feature maps. Then, separate single-channel convolutions are performed and restacked. DSC reduces the computation in convolution. Therefore, for smaller models, the ability of the model may be significantly reduced if 2D convolution is replaced by DSC. As a result, it may be suboptimal. However, if used properly, DSC can help achieve efficiency gains without degrading the model performance. In processing remote sensing images, it is beneficial to minimize network parameters to improve training efficiency due to the large size of the remote sensing image, and it has been found experimentally that DSC can improve the accuracy of cloud removal, as is shown later in the experimental section. Experience shows that the DSC layers in the first and second layers not only reduce the burden of the network but also improve the effectiveness of feature extraction. In addition, it can effectively generate clear, detailed, and photo-realistic images.

The discriminator judges whether input images are real images in the dataset or generated fake cloud-free images. The structure of the discriminator is shown in Figure 5. The discriminator of GAN for cloud removal consists of five convolutional modules. The first module contains a convolutional layer and a leaky ReLU layer. The second to fourth modules are a combination of convolutional, leaky ReLU, and instance batch norm (IBN) layers. The last module is a single convolutional layer. The input image can be discriminated by the multilayer convolution module of the discriminator. The discriminator determines whether the input image is a false cloudless image or a real image generated by using the generator to evaluate the cloud removal ability of the generator. More information on the parameters of the generator and the discriminator can be found in Table 1.

2.2.2. Objective Function

For remote sensing images with complex features, a suitable compound loss function is required to ensure a positive training process. After experimental attempts, the weighted combination of the generative adversarial loss , the norm loss , and the shift loss is applicable to the training of remote sensing images. The objective function is as follows:where and are the hyperparameters of weights of , and , respectively.

The generative adversarial loss is used to guide the optimization of the generator and the discriminator [40], as shown in (5). denotes the expected value of the distribution function. and are the values of pixel distribution in the cloud-covered region and the cloud-free region.

The norm loss is the absolute difference between ground truth and the estimated value , which is expressed as follows:

In the reconstruction network of the encoder-decoder architecture, the shift loss is the constraint between encoder features and corresponding decoder features , which is expressed as follows:

After the first shift, the temporal differences in reference images are preliminarily eliminated. The normalized multitemporal images are then used for the training of the reconstruction network in the second shift. Finally, the reconstruction network recovers the information of the cloud-covered region.

3. Experimental Results

In order to test the proposed BSN method, it was compared with traditional cloud removal methods and deep learning methods in real and simulated experiments. The selected traditional methods include exemplar-based methods [38] and information cloning methods [2], which represent spatial-based methods and temporal-based traditional methods, respectively. The compared deep learning methods include style transfer [37], GAN-CA [25], U-Net [42], Shift-Net [29], and SAR-opt-GAN [27]. Then, the ablation experiments are conducted to demonstrate the role of double shifts and loss function. The training data of all methods are the same.

3.1. Experimental Settings

For different datasets, different experimental settings should be used to achieve good results. Details of our experimental settings are given.

3.1.1. Datasets

Our dataset consists of high-resolution Level-1C Sentinel-2 images between 2019 and 2021 (downloaded from https://www.copernicus.eu/). Only visible bands (B2, B3, and B4) with a spatial resolution of 10 meters were selected, as shown in Table 2. In the simulation experiments, four datasets containing different feature types such as mountains, rivers, urban buildings, lakes, and oceans were used to validate the effects of cloud removal. In real experiments, another four datasets testified the cloud removal capability for different types and locations of cloud occlusions.

In the simulated and real experiments, the whole scene images were cropped as samples with a size of 512 × 512 × 3. The datasets of the simulated experiments are all cloudless images, and simulated clouds are artificially added. The real experimental datasets include cloud-covered images and cloudless multitemporal images. It is worth noting that it is helpful for the result of cloud removal with as many temporal images as possible.

3.1.2. Parameter Settings

Reconstruction network training adopts adaptive moment estimation (Adam) as a gradient descent optimization algorithm [44]. Adam uses gradient first-order moment estimation and second-order moment estimation to dynamically adjust the learning rate. The learning rate is initialized to 0.002, and the number of iterations is set to 1000 epochs.

DST employed pretrained VGG-19 [45] as a feature extractor. The derivative of the photorealism regularization term is implemented in CUDA for gradient-based optimization. The learning rate of the training is initialized to 0.1 and iterated 1000 epochs.

3.2. Simulated Experiment

In the simulated experiments, comparison experiments and ablation experiments are performed. In the comparison experiments, BSN is compared visually and quantitatively with traditional and deep learning methods. The effects of double shifts and loss function are verified in ablation experiments.

3.2.1. Comparison Experiments

The cloud removal results and detailed enlargements of the simulated experiment are shown in Figure 6. Although the traditional exemplar-based method can reconstruct the feature information of cloud-covered regions, reconstructed features are inconsistent with ground truth images. For example, in the result of dataset 4 in Figure 6(c), the simulated cloud covers the water and land, but it is incorrectly reconstructed as land. The information cloning method cannot overcome temporal differences between images (e.g., color inconsistency), and the result depends on the quality of the temporal image being cloned. For example, in Figure 6(d), datasets 1 and 2 have good results, while datasets 3 and 4 have poor performance. It can be seen from Figure 6(e) that the result of style transfer can only ensure that the style of the reconstructed region is similar to ground truth, but the content information of features is inaccurate. The result of cloud removal is severely affected by the quality of temporal images. GAN-CA, U-Net, Shift-Net, and SAR-opt-GAN are recent deep learning methods, and the results are shown in Figures 6(f)–6(i). The ground objects after being reconstructed are highly consistent with ground truth, which is significantly better than traditional methods. However, the proposed method has the highest clarity of ground objects after reconstruction in cloud-covered regions, which can be further reflected in the following quantitative evaluation.

To better visualize the differences between cloud-removed and ground truth images, an error analysis map is shown in Figure 7. The colors corresponding to errors can be found in the legend. The red scatter on the error map represents the difference between the restoration image and ground truth, with the darker red color representing a larger difference value. The white color in the image indicates no difference in ground truth. Except for style transfer, the error maps of other methods are only present in the simulated cloud-covered area, and the cloud masks are shown in Figure 7(a). In Figure 7(b), the exemplar-based method has the darkest red color, representing the greatest difference between the results and ground truth. The information cloning method sometimes performs well, but it is not stable. For example, in Figure 7(c), the water region of dataset 3 has severe errors, while the other datasets show good results. From Figure 7(d), it can be found that style transfer does not ensure that the content outside the cloud mask remains unchanged, so the entire image is inaccurate by a large margin. The cloud removal results of deep learning methods are shown in Figures 7(e)–7(i), where the proposed method has the best performance.

Furthermore, an error scatter plot was also produced to evaluate the error between the reconstructed image and ground truth. The more dispersed scatter distribution indicates the greater difference between the reconstructed image and ground truth. The results in Figure 8 show that the proposed method has the most concentrated error distribution among all methods, and therefore, the results are most similar to ground truth.

Correlation coefficient (CC), structural similarity (SSIM), and peak signal-to-noise ratio (PSNR) are used to evaluate cloud removal results quantitatively in simulated experiments. CC reflects the image correlation between the cloud-free image after cloud removal and the ground truth image, with higher values representing a higher correlation and a maximum of 1. SSIM is a measure of the similarity of two images, and SSIM is equal to 1 when two images are identical. PSNR is the most common and widely used objective measurement to evaluate image quality, and a high PSNR means high image quality after cloud removal. Evaluation metrics for each method are shown in Table 3. In quantitative evaluation, the exemplar-based method and style transfer have the lowest accuracy of cloud removal. This is consistent with the visual results and proves that the image quality after cloud removal is poor. The results of the information cloning method are unstable. For deep learning methods, GAN-CA, U-Net, Shift-Net, SAR-opt-GAN, and the proposed BSN have shown higher accuracies than traditional methods. The proposed method has the highest accuracy in CC, SSIM, and PSNR. This proves that the cloud removal images of the proposed method are most similar to ground truth images and that the proposed method is the most effective.

3.2.2. Ablation Experiments

The effects of double shifts, DSC, loss function, and reference images on cloud removal in remote sensing images are demonstrated in ablation experiments.

(1) The Effects of the Double Shift. First, the effects of double shifts were tested. Ten reference images with large temporal differences in target cloudy images were adopted as datasets. Experiments using only the first shift or the second shift were conducted. It can be seen from Figure 9(c) that when only the first shift was used, the correctness of reconstruction of ground objects could not be guaranteed. When there are large temporal differences between the reference image and the target image, the reconstructed information of the cloud-covered regions usually has temporal differences in the ground truth image. In Figure 9(d), the cloud removal result only using the second shift is visually obviously different from the ground truth image in the cloud-covered region. Meanwhile, the proposed BSN that includes both the first shift and second shift can solve these problems well as shown in Figure 9(e). In Table 4, the quantitative evaluation of the proposed BSN is also significantly better than using the first shift or second shift alone. Therefore, the first shift is helpful in eliminating temporal differences, and the second shift guarantees the accuracy of reconstructed features.

(2) The Effects of Depthwise Separable Convolution. The effects of DSC were tested. The reconstructed network with DSC has stronger detail capturing ability in remote sensing images. To demonstrate this, experiments on changing the number of DSC layers of the proposed BSN were performed. The results are shown in Figure 10, and Figure 10(e) shows the result of the proposed method with the addition of two layers of DSC. Figures 10(c) and 10(d) show that models with no or one layer of DSC result in blurred and discontinuous boundaries. Figures 10(f) and 10(g) show that more than two layers of DSC do not enhance cloud removal but rather lose some important information, as reflected in the inconsistent spectral information of the river. The proposed method with two layers of DSC demonstrates the best visual results, as shown in Figure 10(e). It shows that a proper amount of DSC can improve the accuracy in cloud removal.

(3) The Effects of the Loss Function. To demonstrate the effects of the loss function on the training process of the reconstructed network, the experiments were conducted for the training process. The loss function of BSN consists of , , and . In the ablation experiment of the loss function, , , and were removed separately and the cloud removal results were evaluated quantitatively. The same training datasets and 1000 training epochs were used in training. Table 5 shows that removing any of , , and will reduce the accuracy of the cloud removal results. The loss function used in the proposed BSN achieves the highest accuracy in CC, SSIM, and PSNR.

(4) The Effects of Reference Images. As a temporal-based method, the cloud removal capability of BSN is affected by the number of reference images. Normally, more reference images usually mean more valid reference information. An experiment on the effect of the number of reference images was conducted. BSN requires at least one reference image to ensure the availability of the network. The effect of reference images with different numbers on the accuracy is shown in Table 6. It can be seen in the experiment that more reference images are helpful for the improvement of accuracy. Therefore, it is desirable to ensure sufficient temporal reference images. If data acquisition is limited, a small number of temporal images can also finish reconstruction. As a trade-off between reconstruction accuracy and the difficulty of data acquisition, ten multitemporal reference images show a good performance in this experiment. Considering that not many reference images are generally used in practical applications and that there may be more adverse effects of sudden changes, we have not conducted more tests.

3.3. Real Experiment

In real experiments, cloudy images are directly reconstructed. It is important to note that the ground truth of cloudy images does not exist. Therefore, the result can only be judged visually. The cloud removal results of the real experiment are shown in Figure 11. The traditional exemplar-based methods can fill cloud-covered image information but not actual features. In Figure 11(b), the exemplar-based method often ignores the roads of images. In Figure 11(c), information cloning cannot always perform well, and sometimes there are serious spectral errors. As can be seen in Figure 11(d), for the result of style transfer, it is not only that the overall spectral information is not consistent with the target but also that the pixels in noncloud-covered regions are altered. Style transfer and information cloning do not perform well in the elimination of temporal differences, and images show severe chromatic aberrations. GAN-CA cannot accurately capture the local details of the image when removing clouds. It is easy to confuse when dealing with the relationship between different features (such as bareland, vegetation, and buildings). Moreover, the output image of GAN-CA has low definition. It is obvious in the enlarged figure, as shown in Figure 11(e). U-Net has difficulty recovering complex textures, such as vegetation and mountains are interspersed, as shown in Figure 11(f). In Figure 11(g), the cloud removal results of Shift-Net are better than those of the previous methods, and the visual effects are quite good. Although the SAR-opt-GAN method is overall acceptable, the cloud removal results showed temporal differences in dataset 2 and blurred images in dataset 3, as shown in Figure 11(h). BSN has great advantages over other traditional and recent cloud removal methods. For example, the cloud-free images reconstructed by BSN show reasonable global semantics and local details, accurate feature classes, trace-free boundaries, and high-resolution cloud-free images, as shown in Figure 11(i).

To further test and evaluate the effective cloud removal capability of the proposed BSN, the experiments were conducted on four additional real remote sensing images. The results are shown in Figure 12, which includes cloud-contaminated images, cloud-removed images, and multitemporal images of the same region as a reference. Comparing the reconstructed image with the cloudy image, the reconstructed area of the reconstructed image and the noncloud-covered area of the cloud image are consistent spectrally. Spatially, cloud-removed images have the same features and textures as temporal images in reconstructed regions. It can therefore be seen that the proposed BSN is effective in cloud removal and performs well in both global semantics and texture details.

4. Conclusions

In this article, two shift-based BSN was proposed for cloud removal in optical remote sensing images. In the first shift, MM and DST are used. MM can preliminarily normalize multitemporal images. DST can further eliminate temporal differences. The preprocessing of the first shift is conducive to improving the efficiency of the second shift and the accuracy of cloud removal. In the second shift, the reconstruction network is used to remove clouds from cloud-covered images. In order to improve the reconstruction network’s ability to extract local details and maintain global consistency, shift connection and DSC are introduced. After simulated experiments and real experiments, the proposed method has obvious advantages over traditional methods and deep learning methods in terms of accuracy and visual effects. The ablation experiments also demonstrate the role of the double shift and loss function. BSN can effectively remove clouds in optical remote sensing images, thereby improving the effective information of optical remote sensing images.

The advantage of the proposed method is that the original spectral information of the image is maintained when clouds are removed, thus providing a good visual effect. Although the proposed BSN has a great effect on removing thick clouds in images, it still has some limitations. For example, it requires cloudless multitemporal images as reference data, and the quality of the results is affected by multitemporal images. Considering the different locations of clouds in cloud-covered images, several images can be processed with simultaneous cloud removal in the future. At present, some researchers have also combined optical and SAR images based on deep learning to improve the ability of cloud removal [11, 12]. For sudden changes in multitemporal remote sensing images, such as new buildings and man-made landscapes, the envisioned future strategy is to use multisource remote sensing images (e.g., SAR images) as auxiliary data to improve the accuracy and reliability.

Data Availability

Our dataset consists of high-resolution Level-1C Sentinel-2 images between 2019 and 2021 (downloaded from https://copernicus.eu/.)

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

Chaojun Long was responsible for methodology and writing the manuscript. Xinghua Li was responsible for conceptualization and framework design. Yinghong Jing was responsible for experimental discussion. Huanfeng Shen was responsible for supervision and proofreading.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (NSFC) under Grant no. 42171302 and the Open Fund of Hubei Luojia Laboratory under Grant no. 220100055.