The clouds and snow in optical remote sensing images always interfere with the interpretation of remote sensing images, which even makes an entire image unavailable. In general, the proportion of cloud/snow cover in remote sensing images needs to be clarified to improve the utilization of remote sensing images. The metadata of remote sensing image products contains prior knowledge of spatiotemporal information, such as imaging time, latitude and longitude, and altitude. This paper proposes a remote sensing image cloud/snow detection method that fuses spatial and temporal information. The proposed method can combine spatiotemporal information for feature extraction and stitching, thus improving the accuracy of remote sensing image cloud/snow detection. In this study, the proposed method is trained and tested with a large-scale cloud/snow image dataset. The experimental results show that both the temporal or spatial information alone and the fused temporal and spatial information can improve the cloud/snow detection accuracy in remote sensing images. The easy-to-obtain imaging time information can also significantly improve the detection accuracy for cloud/snow. The proposed method can be used to improve the cloud/snow detection effect of any remote sensing image product containing prior knowledge of spatiotemporal information and has a good application prospect.

1. Introduction

Optical remote sensing images have been widely used for their advantages, such as large information capacity and stable geometric properties. However, the optical imaging process is susceptible to interference from clouds and snow. In recent years, with the development of telemetry technology, such as the improvement of UAV endurance in [1], a large number of remote sensing images have been put into use. On the one hand, clouds form an occlusion of the ground cover, with about 1/3 of the ground surface covered by clouds [2], and the occlusion of clouds greatly limits the application of optical images. On the other hand, the spectral characteristics of clouds and snow have many similarities, which makes them prone to misclassification in the image classification process. The above-given two factors will greatly hamper the use and analysis of remote sensing images. In addition, the detection of clouds and snow still has deep-seated requirements; for example, the accurate detection of clouds and snow can serve the inversion of atmospheric aerosols [3, 4] and reconstruct missing image information [5, 6]. Therefore, accurately and automatically detecting clouds and snow from remote sensing images is of great importance. For example, achieving high-precision cloud detection before satellite image transmission can discover and eliminate high-cloud images in time, thus saving storage space and improving time efficiency.

To reject high-cloudiness images with low information density and improve the efficiency of image utilization, accurate identification of clouds/snow in remote sensing images has been a basic and important research topic. The current cloud/snow automatic detection algorithms include three main types: (1) physical model-based methods, (2) statistical model-based methods, and (3) deep learning-based methods.

Traditional cloud/snow detection methods are mainly based on physical models. The methods are to use the reflectance of a particular image band or calculation of the ratio between bands to identify cloud/snow. One of the most typical methods is the automatic cloud coverage assessment [7, 8], which performs the calculation based on bands 2 to 6 of Landsat-7 ETM + imagery by using the spectral differences that exist between clouds and snow in different geographical environments to distinguish them. The normalized difference snow index [9], which separates cloud/snow image elements by calculating the ratio of the difference between the green band G and the short IR band IR to the sum, has also been proposed. In addition, an enhanced multitemporal cloud detection algorithm [10] was designed to solve the problem of the overdetection of cloud in snow-covered regions. The physical model-based approach, while effective in obtaining cloud/snow masks of images without relying on pixel-level labels for training, is strongly influenced by the reflectance of the image bands and predefined thresholds.

In recent years, big data has driven the development of statistical models [11]. The statistical model-based approaches typically handle cloud/snow detection tasks in a classification paradigm. Current popular classifiers, such as support vector machines (SVMs) [12] and random forests [13], are commonly used for such tasks. In [14], Amato applied principal component analysis to remote sensing image cloud detection based on statistical theory. In [15], Merchant proposed a cloud detection algorithm based on full probability Bayesian theory. In [16, 17], Bai and Li performed SVM classification based on multiple texture features. Methods based on statistical models can utilize a priori knowledge, such as the spatial pattern information and improve the utilization of remote sensing image information via segmentation models. Moreover, the use of image features greatly reduces the dependence of traditional cloud/snow detection methods on the spectral characteristics. In recent years, deep learning-based methods have shown powerful capabilities in remote sensing feature extraction and classification tasks [18]. The methods treat the task of cloud/snow detection as an image pixel-by-pixel classification process (the pixels are usually divided into three categories: “cloud,” “snow,” and “background”) [19]. In [20], Lei segmented an image into a set of superpixels and then used a neural network to classify these superpixels. In [21], Long and Shelhamer introduced Fully convolutional networks (FCNs) for cloud/snow detection. In [22], Nie and Xu, respectively, introduced DeepLabV3+ for cloud/snow detection. All these approaches implement cloud/snow detection through a deep learning-based semantic segmentation task. For the characteristics and difficulties of the cloud/snow detection task itself, the multiscale convolutional feature fusion method [23] and the multiscale fusion gated cloud detection model MFGNet [24] have also been proposed. They couple the multiscale features of remote sensing images, which effectively improve the cloud detection accuracy. At present, methods based on deep learning can mine the multidimensional and deep-level features of clouds and snow, thereby enhancing the distinguishability of clouds and snow.

In addition to the spectral properties of the cloud/snow itself, the spatiotemporal information carried in images also helps in cloud/snow detection. In fact, geographic information (such as elevation, latitude and longitude, and imaging time) is often present in remote sensing images as the basic metarecords. In cloud/snow detection, elevation and altitude are important a priori information. For example, in some low altitude or low-latitude regions, snow is unlikely to exist, and the visual appearance of clouds generated in different geographic regions may also differ. To use spatial information to guide the detection of clouds and snow in remote sensing imagery, a geographic information-driven method (GeoInfoNet) for remote sensing cloud/snow detection was proposed in [25]. Unlike previous methods of detection based solely on image data, this method encodes the elevation, and spatial location of the image into a set of geographic knowledge-aided maps and then integrates these maps containing spatial information into the feature extraction network to assist in cloud and snow detection. In [26], Wu and Shi proposed the scene aggregation network, which fuses scene information with remote sensing images to perform scene classification while achieving cloud detection. Although the above models confirm the effectiveness of spatial prior knowledge to drive cloud/snow detection, they ignore the effect of temporal prior knowledge on the cloud/snow distribution. The imaging time, as the time information carried by almost all remote sensing images, is able to synergize with the spatial information. For example, snow at low altitudes occurs more often in winter and rarely in summer; as latitude rises, snow occurs longer each year at higher latitudes; clouds occur more often in the rainy season. The temporal information should be fully utilized to reduce false detection of cloud/snow effectively. Therefore, the fusion of spatiotemporal information can provide richer and more reliable a priori knowledge for cloud/snow detection tasks.

In view of this, this study supervises the cloud/snow detection by incorporating temporal information based on existing studies. Specifically, this study constructs a cloud/snow detection model integrating temporal and spatial information to provide a feasible solution for cloud/snow detection in remote sensing image processing. We conduct ablation experiments for different spatiotemporal information combined into prior knowledge to verify the effects of various types of spatiotemporal prior knowledge on cloud/snow detection. In this paper, we will introduce the specific structure of the model in Section 2; in Section 3, we will introduce the data set and evaluation indicators, and analyze and evaluate the results; in Section 4, we draw a conclusion and look forward to the future research direction.

2. Methodology

The model in this paper is a two-branch feature extraction network, as shown in Figure 1. First, spatiotemporal information is encoded into a geographic knowledge-aided map by using a geographic information encoder. Then, the original image and the geographic knowledge-aided map are fed into the two branches of the feature extraction network to obtain hierarchical features. Finally, the cloud/snow detection results are obtained by fusing the features of the two branches through a double-feature splicing module.

2.1. Geographic Information Encoder

Specifically, to make full use of the a priori geographic knowledge from satellite remote sensing images, the spatial a priori knowledge is encoded into a spatial knowledge-aided map by using a geographic information encoder [25]. To investigate the effect of temporal prior knowledge on cloud/snow distribution, we improve the cloud/snow detection network by incorporating spatiotemporal information by encoding and integrating the imaging temporal information into the geographic knowledge-aided map for feature extraction. In this way, the imaging time can be integrated into the network as a priori knowledge, enabling the model to couple the spatiotemporal a priori knowledge to aid cloud/snow detection from remote sensing images.

The geographic information encoder can be considered the preprocessing module of the model. It corresponds the longitude and latitude information of the remote sensing image to the spatial resolution of the image through affine transformation and obtains a longitude and latitude map of the same size as the image. Then, the longitude and latitude maps are combined with the DEM upsampled to a consistent resolution to obtain a spatial knowledge-aided map of this satellite remote sensing image. The time information is also able to be converted into a global feature by the idea of spatial information encoding. And finally, the imaging time is normalized by dividing the total number of days in the year, and the time information encoded by this method can better represent the season imaged by the image. In this way, we can integrate the temporal parameters as a channel with the spatial knowledge-aided map to obtain a geographic knowledge-aided map integrating spatiotemporal information. Furthermore, the cloud/snow detection of remote sensing images is assisted by mining the spatiotemporal a priori knowledge of geographic knowledge-aided maps.

Specifically, in each channel of the geographic knowledge-aided map, for a pixel in row y and column x, the corresponding elevation is, while the corresponding longitude, latitude, and time are calculated as follows:

The final coded auxiliary map A for each input satellite remote sensing image is obtained by stitching the four geographic knowledge auxiliaries in the channel dimension as described above.

To exploit the spatiotemporal correlation knowledge embedded in the geographic knowledge-aided map fused with spatiotemporal information fully, we first use a two-branch network with DenseNet121 [27] structure as the backbone to extract scale-dense features from the input image and the geographic knowledge-aided map separately and then fuse the features extracted from both branches. The two basic modules are “DenseNet-based feature extraction” and “dual-feature concatenation.” The role of the former is to mine the deep features of the original image and the geographic knowledge-aided map, and the role of the latter is to stitch and fuse the features of the two branches and generate a high-precision cloud/snow mask segmentation result.

2.2. DenseNet-Based Feature Extraction

To exploit the spatiotemporal correlation knowledge embedded in the geographic knowledge-aided map that incorporates spatiotemporal information fully, the original image and the geographic knowledge-aided map encoding four spatiotemporal prior knowledge are input to the dense feature extraction module. The module is capable of mining deep features of the image and spatiotemporal prior knowledge and using them for the prediction of cloud/snow masks.

Considering the balance of computational efficiency and GPU memory cost, the dense feature extraction module uses DenseNet121(structure is shown in Table 1) as the backbone network for feature extraction, which consists of multiple dense blocks. In each block, the feature maps from all previous convolutional layers are concatenated. Formally, the feature map Ml + 1 in the (l + 1)th layer can be expressed as (4), where (⋅) represents the nonlinear transformation on the feature:

The module receives an input feature map and first performs feature extraction via an convolutional layer (“Conv_0”) and an pooling layer (“Pool_0”) and then sequentially fed into the four dense blocks and three transformation blocks for processing. In the above-given feature extraction process, the number of output feature maps increases as the number of layers deepens. The intensive feature extraction module enables the full exploitation of feature maps, i.e., spatiotemporal a priori knowledge.

2.3. Dual Feature Concatenation

The DenseNet-based feature extraction module enables the extraction of fine-grained feature representations required for cloud/snow mask segmentation. To apply the spatiotemporal a priori knowledge extracted from the auxiliary map branch to the cloud/snow detection task, the image features need to be fused with the spatiotemporal a priori knowledge. Therefore, this module is connected to the DenseNet-based feature extraction module (as shown in Figure 2) and fuses spatiotemporal prior knowledge with high-level features of cloud/snow by merging feature maps from different blocks to obtain segmentation results of high-precision cloud/snow masks.

Given that the image feature extraction branch and the auxiliary map feature extraction branch use blue, green, red, and infrared images and coded auxiliary maps as inputs, respectively, the resolution of the feature maps extracted by different blocks is not consistent. Thus, the spatial feature maps of each block are upsampled to the size of the input image through bilinear interpolation, which in turn connects the upsampled feature maps along their channel dimensions. Before stitching, 1 × 1 convolution is used to adjust the channel size of each block’s features so that they have the same number of channels.

All blocks in both branches are concatenated to obtain the concatenation feature M which can be expressed as M = concat{}. The subscripts 0–4 are the upsampled features of the feature maps in each dense feature extraction module, respectively, and the spatiotemporal knowledge is fused through the stitching of the feature maps. Finally, The feature maps incorporating spatiotemporal information are fed into a convolutional layer with a 1 × 1 filter to generate a pixel-level fractional map of three classes: background S1, cloud S2, and snow S3.

2.4. Loss Settings

The output score maps are normalized by using a softmax function and convert the pixel scores to probabilities [0, 1]. The probability map Pt of each class t = {1, 2, 3} can be expressed as follows:

Finally, the cross-entropy loss is used as the loss function of the network. Suppose ym∈{0, 1}represents the ground truth label of the class m. The loss function is expressed as follows:

3. Results and Analysis

3.1. Experimental Design and Dataset

To investigate the role of spatiotemporal prior knowledge in remote sensing image cloud/snow detection, this study conducts experiments on how the model affects the cloud/snow mask extraction accuracy when different spatiotemporal information is added. The effect of cloud/snow mask extraction on remote sensing images assisted by spatiotemporal a priori knowledge is verified through comparative experiments.

In this study, we use the large-scale cloud/snow detection dataset “Levir_CS” [25] for model training, which contains a total of 4168 GF-1 WFV scenes. The scenarios in the dataset are distributed across the globe, as shown in Figure 3. Various types of topographical features are taken into account, such as plains, plateaus, bodies, deserts, and glaciers. Complex landforms formed by different combinations of landforms also exist. Figure 4 shows example scenarios. Given that these scenarios are globally distributed, they may include multiple types of climatic conditions, such as desert climate or ocean climate. All scenes were imaged from May 2013 to February 2019 and are available for download at http://www.cresda.com/.

In this dataset, level-1A product data that are radiometrically calibrated but not geometrically corrected by the system are used for each scenario to improve time efficiency, as needed for cloud/snow detection in real-world situations. The pixel-level label masks of all images in the dataset are divided into three categories: “background” (labeled 0), “cloud” (labeled 128), and “snow” (labeled 255). The number of pixels of the three types is as follows: the background occupies the largest number of pixels (79.2%), and the snow occupies the smallest number of pixels (2.2%). Cloud pixels account for 18.6% of the total number of pixels. There were 3068 images for training and 1100 for testing. The following observations can be obtained through visual interpretation:(1)The three types of pixels are remarkably different in various locations. In particular, the background varies a lot.(2)Clouds are common in different geographical locations.(3)In terms of latitude, most of the snow is concentrated in high latitudes (see Figure 4(f) for an example). There is almost no snow in the equatorial regions (see Figures 4(a),4(c), 4(e)) for examples).(4)In terms of altitude, the cloud amount is higher in the region below 500 m (see Figures 4(a), 4(b), 4(e)) for example), and the snow accumulation gradually increases with increasing altitude (see Figures 4(d), 4(f)) for example). High-altitude areas are usually mountainous, while the snow accumulation varies regularly with the seasons. From the above statistics.(5)In terms of imaging time, snow at low altitudes and low latitudes often occur in winter (see Figure 4(f) for an example), but rarely in summer (see Figure 4(c) for an example), the use of spatial and temporal information is necessary for cloud/snow detection.

3.2. Parameters and Evaluation Indexes

In the experiments, different prior knowledge is encoded, combined and input to the auxiliary branch for training to explore the role of spatiotemporal prior knowledge in remote sensing image cloud/snow detection. The model is trained for 200 epochs to reach convergence.

The scenes in the above-given dataset are randomly divided into two datasets: 3068 scenes for a training set and 1100 scenes for a test set.

In this study, the prediction results are evaluated quantitatively by using recall, precision, F1-score, accuracy and IoU. The definition and calculation formula of each parameter are as follows:True positives (TP): the number of samples that are actually positive and correctly classified as positive by the classifierFalse positives (FP): the number of samples that are actually negative but are incorrectly classified as positive by the classifierFalse negatives (FN): the number of samples that are actually positive cases but are incorrectly classified as negative cases by the classifierTrue-negatives (TN): the number of samples that are actually negative and correctly classified as negative by the classifier

IoU defines the overlap between the labeled and predicted regions.

3.3. Results Evaluation

The results show that the inclusion of different auxiliary information in the cloud/snow detection task is effective. Compared with the advanced cloud/snow detectionmethods, our method also shows excellent performance and achieves thebest accuracy and IoU, and the time cost is acceptable as shown in Table 2.

Tables 35 show the accuracy of cloud/snow detection using different spatiotemporal a priori knowledge-aided remote sensing images. The background occupies most of the area in the remote sensing images, and there is a large gap between the background and the clouds and snow. Thus, the background has achieved a high detection accuracy without introducing prior knowledge. The introduction of prior knowledge can slightly improve the IoU of background detection, in which the model incorporating spatiotemporal information can achieve the highest detection accuracy, as shown in Table 3.

For the detection of clouds in images, temporal or spatial information alone is of limited use to improve the accuracy of cloud detection. The model that introduces DEM as a priori knowledge has the highest accuracy improvement, with an IoU improvement of 0.55%. Meanwhile, the model incorporating spatiotemporal information improves IoU by 1.16%. Table 3 is the detection accuracy of clouds in images. The results in Table 4 demonstrate that the joint spatiotemporal information can effectively improve the detection accuracy for clouds in remote sensing images.

Given that snow accounts for the least amount of the image, introducing any separate spatiotemporal prior knowledge can effectively improve the snow detection accuracy. That is, a strong correlation exists between snow distribution and both spatial and temporal properties, which is in line with the perception. The model incorporating spatiotemporal information obtains the highest accuracy, with an 11-point improvement in IoU (from 60% to 71%). As shown in Table 5, this finding further confirms the effectiveness of the fusion of spatiotemporal information to improve the accuracy of cloud/snow detection.

Based on the above-given quantitative analysis of the experimental results, training with a single spatiotemporal information-aided map can improve cloud/snow detection compared with using images alone. The use of the deep convolutional neural networks can mine the knowledge of correlation between spatiotemporal a priori information and the distribution of surface clouds and snow. From Table 4, the use of temporal information alone is less effective for cloud detection, although some improvement is realized. By contrast, the addition of temporal information alone can improve the snow IoU by 7 points (from 60% to 67%), allowing for a very significant improvement. The extent of using some single spatial information to improve the detection accuracy for the cloud is similar to the effect when using temporal information alone. The reason may be that the accuracy of the IoU for cloud detection is already high enough and stable at around 90%, with limited room for improvement. However, Table 4 indicates that the addition of any of the spatiotemporal information components improves the accuracy of snow detection by about 7 points. Specifically, simply adding latitude and longitude information can even boost 10 points. Accordingly, the occurrence of snow shows a strong correlation with the latitude and longitude where it is located, and it can be speculated that the climatic zones of different latitudes cause this difference.

From the comparison experiments with different combinations of spatiotemporal information-aided components, the model with fused spatiotemporal information obtains the highest precision for cloud/snow detection. This result shows that we can further improve the results of cloud/snow detection by introducing inexpensive and easily accessible temporal information into the results with only spatial information. In summary, our proposed remote sensing image cloud/snow detection method incorporating spatiotemporal information achieves the highest accuracy in cloud/snow detection, with IoU reaching 91% for cloud and 71% for snow, which is a stable performance improvement compared with the method incorporating only spatial information knowledge in Literature [25].

The results presented in Figure 5 further corroborate the conclusions of the quantitative analysis. On the one hand, individual temporal or spatial information can be useful for cloud/snow detection. The most accessible temporal information can effectively improve cloud detection, while the spatial information contributes most to the improvement of snow detection accuracy. On the other hand, the models incorporating spatiotemporal prior knowledge have the best results for cloud/snow detection. In particular, the introduction of temporal information can effectively improve the detection of clouds in images, as shown in the red box in Figure 5. In the high-altitude region shown in the blue box, the clouds have a high probability of being misdetected as snow by introducing a temporal prior or a spatial prior alone. The fusion of temporal and geographical can form a complementary effect and effectively alleviate the problem of cloud/snow false detection.

4. Conclusion

Inspired by the fact that cloud/snow has a strong seasonal a priori, this paper proposes a cloud/snow detection model integrating temporal and geographical information based on existing research on geographic knowledge-driven cloud/snow detection. Among all the a priori knowledge, temporal information contributes the most to improving cloud detection accuracy, while latitude and longitude information contributes the most to improving snow detection accuracy, and fusing temporal and spatial information at the same time can obtain the highest accuracy. This research can effectively complement the existing algorithms that do not fully utilize the a priori knowledge of imaging time. This method is expected to be used for the fast detection of high cloud cover images driven by temporal information and the detection of cloud snow essence fused with spatiotemporal information. Our method also has some limitations, for example, it only performs simple coupling splicing of spatiotemporal features, and does not dig into their correlations.

In the future, research can be further improved from several perspectives, including incorporating richer a priori knowledge, such as scene information, to assist remote sensing images for cloud/snow detection, as well as introducing an attention mechanism to explore the more essential and deeper relationship between cloud/snow and a priori knowledge in remote sensing images.

Data Availability

The image dataset used to support the findings of this study can be downloaded from https://github.com/permanentCH5/GeoInfoNet. The dataset is cited at relevant places within the text as references [25].

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.


This work was supported in part by the Natural Science Foundation of Hunan Province, China (Grant no. 2020JJ4691).