Abstract

High spatial and temporal resolution remote sensing data play an important role in monitoring the rapid change of the earth surface. However, there is an irreconcilable contradiction between the spatial and temporal resolutions of the remote sensing image acquired from a same sensor. The spatiotemporal fusion technology for remote sensing data is an effective way to solve the contradiction. In this paper, we will study the spatiotemporal fusion method based on the convolutional neural network, which can fuse the Landsat data with high spatial but low temporal resolution and MODIS data with low spatial but high temporal resolution, and generate time series data with high spatial resolution. In order to improve the accuracy of spatiotemporal fusion, a residual convolution neural network is proposed. MODIS image is used as the input to predict the residual image between MODIS and Landsat, and the sum of the predicted residual image and MODIS data is used as the predicted Landsat-like image. In this paper, the residual network not only increases the depth of the superresolution network but also avoids the problem of vanishing gradient due to the deep network structure. The experimental results show that the prediction accuracy by our method is greater than that of several mainstream methods.

1. Introduction

Due to the limitation of the hardware technology of the remote sensing satellite and the cost of satellite launching, it is difficult for the same satellite to obtain the remote sensing image with both high spatial and temporal resolutions. Landsat series satellites can obtain multispectral data with a spatial resolution of 30 m. While multispectral images reflect the spectral information of ground features, when performing classification and other processing, unlike hyperspectral, which has rich dimensions, dimensionality reduction processing is required. Although there are many dimensionality reduction methods, it can achieve dimensionality reduction of hyperspectral images [1]. But multispectral image imaging is more convenient, making it widely used in many fields. With this feature, Landsat data has been widely used in the exploration of earth resources, management of agriculture, forestry, animal husbandry, and natural disaster and environmental pollution monitoring [24]. However, the 16-day visit circle of the Landsat satellite and the impact of cloud pollution limit their potential use in monitoring and researching the land surface dynamic changes. On the other hand, Moderate-resolution Imaging Spectroradiometer (MODIS) on Terra/Aqua satellite has a visit circle per 1-2 days, which has a high temporal resolution and can be applied in vegetation phenology [5, 6] and other fields. However, the spatial resolution of MODIS data is 250-1000 m, which has a poor representation of the details of the ground objects and is not enough to observe the heterogeneous landscape.

In 1995, Vignolles et al. [7] first proposed to generate high spatiotemporal resolution data by using spatiotemporal fusion technology. Since then, different types of spatiotemporal fusion methods have been emerging. The spatiotemporal fusion technology of remote sensing images is fused with the spatial features of high spatial but low temporal resolution images and the temporal features of low spatial but high temporal resolution images to generate time series images with high spatial resolution. According to the principle, the existing spatiotemporal fusion models can be divided into three types: reconstruction based, spatial unmixing based, and learning based.

The basic principle of reconstruction-based methods is to calculate the reflectance of the center fusion pixel through a weighting function which takes full account of the spectral, temporal, and spatial information in similar pixels. Gao et al. [8] first proposed the spatiotemporal adaptive reflection fusion model (STARFM), which uses a pair of MODIS and ETM+ reflectance images at known time phase and MODIS reflectance images at predicted time phase to generate 30 m spatial resolution image. Hilker et al. [9] proposed a spatiotemporal fusion algorithm for mapping reflectance change (STAARCH) based on tasseled cap change. The algorithm can not only generate 30 m spatial resolution ETM + like images but also detect highly detailed surface classes. However, the fusion accuracy of STARFM and STAARCH are highly related to the surface landscape heterogeneity, resulting in low fusion accuracy for heterogeneous area. David et al. [10] considered the influence of bidirectional reflectance effect, proposing a semiphysical method to generate fused Landsat ETM+ reflectance using MODIS and Landsat ETM+ data. Zhu et al. [11] based on STARFM considering the reflectivity difference between different sensor imaging systems due to different orbital parameters, band bandwidth, spectral response curve and other factors, the transfer coefficient between different sensor differences is increased, and the enhanced STARFM (ESTARFM) model is proposed to improve the fusion accuracy of complex surface area (heterogeneous area) to a certain extent. The model uses two sets of MODIS and ETM+ reflectance images and MODIS reflectance images to generate 30 m spatial resolution ETM+ like image. Wang and Atkinson [12] proposed a spatiotemporal fusion algorithm consisting of three parts: regression model fitting (RM Fitting), spatial filtering (SF), and residual compensation (RC), referred to as Fit-FC; this method only uses a pair of known high-low resolution image pair as input and can better predict the spatial change between images in different periods. Chiman et al. [13] proposed a simple and intuitive method and has two steps. First, a mapping is established between two MODIS images where one is at an earlier time, , and the other one is at the time of prediction, . Second, this mapping is applied to a known Landsat image at to generate a predicted Landsat image at .

Spatial-temporal fusion methods based on spatial unmixing performs spatial unmixing of pixels in known low-resolution images and applies the classification results to high-resolution images at the unknown time to predict high-resolution images. Zhukov et al. [14] proposed a spatiotemporal fusion method that considers the spatial variability of pixel reflectivity based on the assumption that the pixel reflectivity does not change drastically between neighbor pixels. This method introduces window technology to predict the high-resolution reflectance of each type of feature. This method is not ideal for the farmland area which changes dramatically in a short time. Wu [15] proposed a spatial and temporal data fusion approach (STDFA) based on the assumption that the temporal variation characteristics of the class reflectivity are consistent with the intraclass pixel reflectivity. This method extracts the temporal change information of ground features from time-series low spatial resolution images and performs classification and density segmentation on known two periods of high spatial resolution images to obtain classified images, so as to obtain class average reflectance for image fusion. On the basis of [15], Wu and Huang [16] comprehensively considered the spatial variability and temporal variation of pixel reflectivity and proposed an enhanced STDFA; the method solves the problem of missing remote sensing data. Hazaymeh and Hassan [17] proposed a relatively simple and more efficient algorithm; the Spatiotemporal Image-Fusion Model (STI-FM) applies clustering to the images first, and, for each cluster, performs a separate prediction. Zhu et al. [18] proposed a flexible spatiotemporal data fusion (FSDAF) based on spectral demixing analysis and thin-plate spline interpolation. The algorithm uses less input data, which is suitable for heterogeneous areas and can effectively preserve the low-resolution details of the image during the prediction period.

In remote sensing image processing, learning-based methods are more commonly used for classification of ground features [19]. In recent years, spatiotemporal fusion methods based on learning have been widely concerned. In 2012, Huang and Song [20] first introduced sparse representation technology into the process of spatiotemporal fusion and proposed a sparse representation based on a spatiotemporal reflectance fusion model (SPSTFM), which uses the MODIS and ETM+ images of the front and back phases at the predicted time phase. First, use high- and low-resolution difference images to train a couple dictionary representing high- and low-resolution features, and then use a low-resolution image to predict high-resolution image. Song and Huang [21] proposed a sparse representation of spatiotemporal reflectance fusion model using only a pair of known high- and low-resolution image pair, which first enhanced MODIS image by sparse representation to obtain a transition image, then predicted image is generated by combining known high-resolution image with transition image through high-pass modulation. The model reduces the number of known image pair that needs to be inputted, so that the algorithm can be applied in the case of lack of data and has more general applicability. Spatiotemporal fusion method based on feature learning considers the spatial information of changing image. However, there are some limitations in previous methods based on sparse representation. First, the image features need to be designed, which brings complexity and instability to performance. Secondly, the method does not consider the large amount of actual remote sensing data but only develops and validates the algorithm for small-scale research areas.

The convolutional neural network (CNN) [22] model has a simple structure and can be used to solve the problems of target recognition [23] and image classification [24] in computer vision. In recent years, CNN has also been used in the field of superresolution. As the pioneer CNN model for SR, superresolution convolutional neural network (SRCNN) [25] predicts the nonlinear LR-HR mapping via a fully convolutional network and significantly outperforms classical non-DL methods. In the field of remote sensing, Song et al. [26] proposed a five-layer convolutional neural network (CNN) spatiotemporal fusion model. This model is similar to [21] and is a two-stage model. It learns the CNN nonlinear mapping between MODIS and Landsat images and combines high-pass modulation with a weighting strategy to predict Landsat-like images. Liu et al. [27] proposed a two-stream convolutional neural network StfNet, which not only considered the temporal dependence of remote sensing images but also introduced temporal constraint, the network takes a coarse difference image with the neighboring fine image as inputs and the corresponding fine difference image as output, the method can restore spatial details greater. At present, there are two main problems faced by learning-based spatiotemporal fusion methods: first, the deep-seated network can improve the prediction accuracy; however, the deep-seated network will lead vanishing gradient or convergence difficulty and second, it is difficult to obtain two pairs of suitable prior image pairs as the input of network training. For example, StfNet is a fusion method using two pairs of prior images as input. Considering the above two points, we propose a spatiotemporal fusion model based on residual convolution neural network. The model can only uses a pair of prior images as train input. The MODIS image is very similar to the predicted Landsat image. In other words, the low-frequency information of low-resolution image is similar to that of high-resolution image. In fact, the low-resolution image and the high-resolution image only lack the residual of the high-frequency part. If only train the high-frequency residual between the high resolution and the low resolution, which does not need to spend too much time in the low-frequency part. And can deepen the network structure to avoid problems such as gradient disappearance. For this reason, we introduce the idea of ResNet [28] and set up a spatiotemporal fusion framework of remote sensing image suitable for a small-sample training set for CNN. Considering the time dependence between the image sequences, we use the MODIS-Landsat image pairs of the front and back phases of the prediction image to construct the prediction network, respectively. The experimental results show that compared with benchmark methods, the spectral color and spatial details of our method are closer to the real Landsat image.

The rest of this paper is divided into three sections. In Section 2, the principle of residual CNN is introduced. Section 3 provides the experimental verification process and results. Section 4 gives the conclusion.

2. Methods

In this paper, we use CNN and ResNet to construct a dual stream network to predict Landsat-like images. The principles involved are briefly introduced as follows.

2.1. CNN

Convolutional neural network (CNN) is one of the most representative network models in the deep learning method [29]. With the continuous development of deep learning techniques in recent years, it has achieved very good results in the field of image processing. Compared with the traditional data processing algorithm, CNN avoids the complicated preprocessing work such as manually extracting data from the data, so that it can be directly used in the original data.

CNN is a nonfully connected multilayer neural network, as shown in Figure 1. The main structure consists of a convolutional layer, pooling layer, activation layer, and fully connected layer [30]. The convolutional layer, pooling layer, and activation layer are the feature extraction layers of CNN, which are used to extract the signal features. The fully connected layer is the CNN classifier. Since this paper mainly uses the deep convolution network to extract the spatial characteristics of the remote sensing image, the feature extraction layer of deep convolutional neural networks is analyzed.

2.2. Residual Learning

If the input of a neural network is , and the expected output is ; is the expected mapping. If we want to learn such a model, the training difficulty will be greater; if we have learned the more saturated accuracy (or when we find that the error in the lower layer becomes larger), then the next learning goal will be transformed into the learning of identity mapping, that is, to make the input an approximate output , which is in order to keep in the later hierarchy without causing a drop in accuracy.

As shown in the residual network structure diagram in Figure 2, input is directly transferred to the output as the initial result through “shortcut connections,” and the output result is . When , then , which is the constant mapping mentioned above. Therefore, ResNet is equivalent to changing the learning goal, not a complete output of learning, but the difference between the goal value and , that is, the so-called residual . Therefore, the later training goal is to approach the residual result to 0, so that with the deepening of the network, the accuracy does not decline.

2.3. Spatiotemporal Fusion Using Residual Learning in CNN

In this paper, Landsat image is regarded as high spatial but low temporal resolution data; MODIS image is regarded as high temporal but low spatial resolution data. We express the Landsat image and MODIS image at as and , respectively. If there are two pairs of prior images, the two-stream residual CNN network uses the known Landsat-MODIS image pair at and , and MODIS image at to predict Landsat-like image.

2.3.1. Training Stage.

In the training stage, in order to build an nonlinear mapping model between MODIS and Landsat-MODIS residual images, we first up sample the spatial resolution of to the same size as . Then, the Landsat and MODIS images at the same time are differenced to obtain a residual image . Thus, we expect to learn a mapping function which approximates . Pixel value in are likely to be zero or small. We want to predict this residual image. The loss function now becomes , where is the network prediction. We divide the high- and low-resolution images corresponding on the same time into overlapping image patches. Define the set of high- and low-resolution samples as and , where the corresponding samples are and . The overlapping segmentation is performed here to increase the number of training samples. After predicting the residual image, the ground truth Landsat image is obtained by the sum of the input MODIS image and the predicted residual image.

In the network, the loss layer has three inputs: residual estimation, input MODIS image, and Landsat image. The loss is calculated as the Euclidean distance between the reconstructed image and the real Landsat image. In order to achieve the purpose of high-precision spatiotemporal fusion, we use a very deep convolutional network. We use 18 layers where layers except the first and the last are of the same type: 64 filters of the size , where a filter operates on spatial region across 64 channels (feature maps). The first layer operates on the input image. The last layer, used for image reconstruction, consists of a single filter of size . The process structure is shown in Figure 3.

Training was performed by using back-propagation-based minibatch gradient descent to optimize regression targets. We set the momentum parameter to 0.9. Training is regularized by weight loss ( penalty multiplied by 0.0001).

2.3.2. Prediction Stage.

There are two pairs of prior Landsat-MODIS images and the MODIS image on prediction date, we aim to fuse them to predict the Landsat-like image on prediction date. Denote the prior dates as and , the prediction date as , we predict based on the residual learning CNN. , , and are divided into patches, and their corresponding image patches are ,, and , respectively. Taking as the input of CNN, the label is , and the sum of the residual image and is used as the prediction. In this paper, the number of network layers is set to 18. In the process of reconstruction, input into the trained network and get the predicted . Similarly, can be predicted by Landsat-MODIS image pair at . Considering the temporal correlation between the image at the predicted time and the reference image, we use the corresponding temporal weight when reconstructing each image patch. Finally, the high spatial resolution image patch at the predicted time is obtained: where and are the th predicted patch using and as the reference image, respectively, and are the corresponding weight, and determined as follows:

The local weight is calculated by the sum of normalized difference vegetation index (NDVI) [31] and normalized difference built-up index (NDBI) [32], where is used to measure the change degree between MODIS images at two times, and it is the absolute average change of in , where represents the MODIS image change at different times. After each image patch is reconstructed one by one, it is restored to the whole image. In order to ensure the continuity of the reconstructed image, there is an overlap between adjacent patches, and the pixel value of the overlapped part of the image patch is taken as the mean value when the whole image is restored.

3. Experiments

Two datasets were used in the experiments. The first dataset contains two pairs of MODIS-Landsat images, and the second dataset contains three pairs of MODIS-Landsat images. Both areas are located in Coleambally, New South Wales, Australia. MODIS data uses the surface reflectance of MOD09A1 (500 m) and MOD09Q1 (250 m) for 8-day synthetic products. We up sampled all MODIS in the dataset to the same resolution as the Landsat image of the corresponding date. Compared with natural images, remote sensing images have a large size and rich details. Therefore, the remote sensing images are overlapped and divided into patches to obtain the training set. In the paper, the images of the two areas are overlapped divided into image patches. The above image patch set is used as the train set and prediction set. We compare our method with the mainstream and advanced methods (including STARFM, FSDAF, Fit-FC, STDFA, STI-FM, HCM, ESTARFM, SPSTFM, and StfNet), which will be described in detail in this section.

3.1. Experiment on the First Dataset

In order to verify the applicability of our proposed spatial-temporal fusion method based on residual convolutional neural network for one prior Landsat-MODIS image pair, we use a single-stream network to verify and compare the same data as the input of STARFM, FSDAF, Fit-FC, STDFA, STI-FM, and HCM.

In this experiment, two pairs of Landsat and MODIS surface reflectance images covering a area in Coleambally are used. The two pair images were acquired on 2 July 2013 and 17 August 2013. Figure 4 shows the 30 m Landsat images (upper row) and 500 m MODIS images (lower row) using green-red-NIR as RGB composite image. Then, we use the bicubic interpolation method to downscale the 500 m MODIS image into 30 m. Our experimental task used the pair of Landsat-MODIS images on 2 July 2013 and the MDOIS image on 17 August 2013 to predict the 30 m Landsat-like image on 17 August 2013. At the same time, STARFM, FSDAF, Fit-FC, STDFA, STI-FM, and HCM are tested with the same input in this experiment, and the true 30 m Landsat image acquired on 17 August 2013 is used as the reference to evaluate the accuracy of fusion results.

Figure 5 shows the fusion results by four methods (STARFM, Fit-FC, FSDAF, STDFA, STI-FM, HCM, and our method). Obviously, the prediction accuracy by our method is greater. For example, the highlighted areas in the bottom left part of subarea S, for Fit-FC, STDFA, FSDAF, STI-FM, HCM, and STARFM, some dark green pixels are incorrectly predicted as purple pixels. In addition, the highlighted areas in the bottom right part of the subarea, Fit-FC, STDFA, FSDAF, STI-FM, HCM, and STARFM incorrectly predicted some red pixels as purple and blue pixels. However, our method is closer to the reference image. The main reason is that the Fit-FC method directly applies the known linear coefficients of the low-resolution image to fit the high-resolution image on the prediction period. Therefore, when the spatial resolution difference between the high- and low-resolution images is large, there will be obvious “block effect,” for example, the spatial resolution difference between Landsat image and MODIS image is nearly 17 times. STDFA assumes that the temporal variation characteristics of the same surface coverage class in coarse pixels are consistent, but there may be inconsistencies in practical applications, so the fusion result is affected. The accuracy of the FSDAF spatiotemporal fusion algorithm is low, which is mainly caused by two aspects: The prediction accuracy of FSDAF is worse, which is mainly caused by two aspects: at first, the known high-resolution data needs to be classified, and the classification accuracy by unsupervised classification method (such as the -means method) will have a certain impact on the results; at second, when the spatial resolution difference between high- and low-resolution data is large, the endmember (that is, high-resolution pixel) represented area will be more refined. When the number of categories is small, the fusion result will be relatively smooth, and when the number of categories is large, the fitting accuracy will also be reduced (such as in low-resolution pixels, if the richness of a certain category is low, the total prediction error will be increase). STI-FM is susceptible to interference from outliers, so when the spatial characteristics change significantly, the prediction effect is not good. The method of using gradation mapping is greatly affected by the heterogeneous region, so HCM failed to show the best performance in this experiment. STARFM considers the similarity of neighboring pixels, so the prediction accuracy is relatively stable. However, the premise of STARFM is that the spectrum of similar pixels in the neighborhood is constant and there is no land cover change during the observation period, which makes the model susceptible to environmental and phenological changes, resulting in large prediction errors, especially for heterogeneous areas. Our method uses deep convolutional neural networks to more effectively extract the features of low-resolution images and residual images and constructs a mapping relationship between low-resolution images and residual images through a residual learning network. This mapping relationship is nonlinear mapping and is more in line with the change of ground features. In addition, the number of layers in the network is deepened through residual learning, which strengthens the robustness of the network. Therefore, the experimental results based on our method have better visual effects.

Table 1 lists the objective evaluation results of four fusion methods and uses three common fusion evaluation methods of remote sensing image, including root mean square error (RMSE) [33], correlation coefficient (CC) [34], and universal image quality index (UIQI) [35]. The ideal values for RMSE, CC, and UIQI are 0, 1, and 1, respectively. From Table 1, we can see that for the six bands of all fusion results, the fusion results of our method have smaller RMSE and larger CC and UIQI. Our method is compared with other six methods (STARFM, Fit-FC, FSDAF, STDFA, STI-FM, and HCM); the gain of the mean CC is 0.0259, 0.0365 0.0168, 0.0253, 0.0620, and 0.0487, and the gain of the mean UIQI is 0.0261, 0.0368, 0.0175, 0.0254, 0.0620, and 0.0489, respectively. The mean RMSE is reduced by 0.0018, 0.0024, 0.0012, 0.0021, 0.0040, and 0.0032, respectively. In addition, the fusion result based on our method is better than STATFM, and STARFM is better than Fit-FC, the rest of the sequence is STDFA>FSDAF>HCM>STI-FM. The main reason is when the spatial resolution difference between high- and low-resolution images is large, Fit-FC directly applies the fitting coefficients of low-resolution images into high-resolution images, which causes large errors; FSDAF also has similar fitting errors. STDFA assumes that the temporal variation characteristics of the same surface coverage class in coarse pixels are consistent, but there may be inconsistencies in practical applications, so the fusion result is affected. Although STARFM considers the similarity of neighboring pixels, the reconstruction method of each pixel cannot consider the continuity of the image. STI-FM is susceptible to interference from outliers, so when the spatial characteristics change significantly, the prediction effect is not good. HCM using gradation mapping is greatly affected by the heterogeneous region. Our method can better restore the continuity of the image by reconstructing the image patch.

3.2. Experiment on the Second Dataset

In this experiment, three pairs of Landsat-MODIS images covering area of Coleambally are used to verify the applicability of our method for two pairs of prior images. The three pairs of images were acquired on 6 April, 2012, 12 May, 2012, and 20 July, 2012, respectively. Figure 6 shows the 30 m Landsat image (upper row) and 500 m MODIS image (lower row) using green-red-NIR as RGB composite image. This experiment is to verify the accuracy of our methods based on two pairs of prior images, we use the two image pairs on 6 April, 2012 and 20 July, 2012, and MODIS image on 12 May, 2012 to predict the Landsat-like image on 12 May, 2012.

Figure 7 shows the 30 m prediction results on August 12, 2012 based on the four methods (ESTARFM, SPSTFM, StfNet, and our method). It is worth noting that the ESTARFM result is the worst, StfNet is better than SPSTFM, and our method is better than StfNet. For example, the highlighted areas in the bottom left of subarea S, ESTARFM, and SPSTFM incorrectly predicted some light green pixels as dark green pixels. Although the prediction by StfNet is similar to the reference image, but there are some yellow pixels which had been incorrectly predicted to be blue pixels. However, for our method, the prediction is closer to the true reference image. Compared with the three benchmark methods, our method provides excellent performance. The main reason is ESTARFM assumes that during the observation period, the conversion coefficients between high- and low-resolution images remain unchanged, but in actual conditions, land types and coverage will change, so this assumption is not applicable in the areas with significant changes. SPSTFM utilizes sparse representation and dictionary learning approaches in the signal domain to increase prediction accuracy for land cover change and heterogeneous region. Although the network structure, compared with our method, SPSTFM only applicable for small-scale regions and cannot extract sufficient image features. Although StfNet can produce more accurate prediction results by deep network, but the data contained in the training process is too large and the network is hard to convergent, which also has a certain impact on the prediction accuracy. Our method using residual learning network can only learn the difference information between high- and low-resolution images. Since the low-frequency information between high- and low-resolution images is similar, if we directly learn mapping relationship between high- and low-resolution images, it will increase the amount of calculation, which also introduces errors. Through residual learning, not only the nonlinear mapping relationship between high- and low-frequency information can be directly learned, but also the network layers can be deepened, which enhances the accuracy and stability of the network structure.

Table 2 shows the comparison results in RMSE, CC, and UIQI. From Table 2, we can see that for six bands, the fusion results by our method can obtain smaller average RMSE and larger CC and UIQI. It is easy to find that our method is better than StfNet, the StfNet is better than SPSTFM, and ESTARFM is the worst among the four approaches. Specifically, the CC gains of our method over ESTARFM, SPSTFM, and StfNet are 0.0508, 0.0257, and 0.0134, and the UIQI gains are 0.0467, 0.0238, and 0.0126, respectively. The main reason is that ESTARFM assumes that the conversion coefficients remain unchanged during the observation period, but there are land cover types in this area, such as subregion S, so the conversion coefficients are not consistent, so the prediction results are greatly biased. SPSTFM takes the image patch as the reconstruction unit and considers the continuity between adjacent pixels, so it has strong robustness in dealing with complex surface changes. However, due to the instability of forcing the same sparse coefficient of high- and low-resolution dictionaries to construct the mapping relationship, the performance in this experiment is worse than our method. StfNet has a deep network layers; however, it is difficult to converge due to directly training the mapping relationship between high- and low-resolution images, which also leads to network instability. Our method through residual network not only improves the stability of the network but also enhances the accuracy of the fusion results.

4. Conclusion

In this paper, we propose a residual convolution neural network to predict Landsat-like image, and the method can be applied to the case where there is only a pair of prior images. This method mainly includes two steps: firstly, use the known MODIS-Landsat image pair to train the residual convolutional neural network and secondly, input MODIS image at predicted phase to reconstruct Landsat-like image. Compared with the several benchmark algorithms (STARFM, FSDAF, Fit-FC, ESTARFM, SPSTFM, and SftNet), our method has the advantages of learning algorithm, which takes the image patch as the reconstruction unit and considers the continuity between adjacent pixels. Training the residual to construct the depth network not only enhances the stability of the network but also improves the prediction accuracy.

The spatiotemporal fusion methods based on learning have greater prediction accuracy for heterogeneous regions. In this paper, we use a multilayer convolution neural network to extract spatial features. In the future work, we will try to design more effective methods to extract spatial features to improve the recognition ability of change information. In recent years, deep learning has received extensive attention. Deep learning needs a lot of data to train model. Because of the characteristics of large amount of data and rich information in remote sensing data, we can use the “big data” characteristics of remote sensing data to train more effective mapping relationship between MODIS and Landsat images by deep learning training, so as to improve the prediction accuracy. In addition, although the spatiotemporal fusion models based on learning have outstanding performance, but the calculation time is longer, which is also a “common failure” based on the learning method. Therefore, our future work will follow the idea of improving the accuracy of fusion results and reducing computational complexity.

Data Availability

Data is not available for the following reasons: In this paper, we received remote sensing data from Institute of Remote Sensing Applications Chinese Academy of Sciences and conducted an experiment, but without the consent of Institute of Remote Sensing Applications Chinese Academy of Sciences, the author cannot judge whether data is available or not.

Conflicts of Interest

The authors declare that they have no conflict of interest.