Advanced Sensor Technologies in Geospatial Sciences and Engineering 2020
View this Special IssueResearch Article  Open Access
Spatiotemporal Fusion of Remote Sensing Image Based on Deep Learning
Abstract
High spatial and temporal resolution remote sensing data play an important role in monitoring the rapid change of the earth surface. However, there is an irreconcilable contradiction between the spatial and temporal resolutions of the remote sensing image acquired from a same sensor. The spatiotemporal fusion technology for remote sensing data is an effective way to solve the contradiction. In this paper, we will study the spatiotemporal fusion method based on the convolutional neural network, which can fuse the Landsat data with high spatial but low temporal resolution and MODIS data with low spatial but high temporal resolution, and generate time series data with high spatial resolution. In order to improve the accuracy of spatiotemporal fusion, a residual convolution neural network is proposed. MODIS image is used as the input to predict the residual image between MODIS and Landsat, and the sum of the predicted residual image and MODIS data is used as the predicted Landsatlike image. In this paper, the residual network not only increases the depth of the superresolution network but also avoids the problem of vanishing gradient due to the deep network structure. The experimental results show that the prediction accuracy by our method is greater than that of several mainstream methods.
1. Introduction
Due to the limitation of the hardware technology of the remote sensing satellite and the cost of satellite launching, it is difficult for the same satellite to obtain the remote sensing image with both high spatial and temporal resolutions. Landsat series satellites can obtain multispectral data with a spatial resolution of 30 m. While multispectral images reflect the spectral information of ground features, when performing classification and other processing, unlike hyperspectral, which has rich dimensions, dimensionality reduction processing is required. Although there are many dimensionality reduction methods, it can achieve dimensionality reduction of hyperspectral images [1]. But multispectral image imaging is more convenient, making it widely used in many fields. With this feature, Landsat data has been widely used in the exploration of earth resources, management of agriculture, forestry, animal husbandry, and natural disaster and environmental pollution monitoring [2–4]. However, the 16day visit circle of the Landsat satellite and the impact of cloud pollution limit their potential use in monitoring and researching the land surface dynamic changes. On the other hand, Moderateresolution Imaging Spectroradiometer (MODIS) on Terra/Aqua satellite has a visit circle per 12 days, which has a high temporal resolution and can be applied in vegetation phenology [5, 6] and other fields. However, the spatial resolution of MODIS data is 2501000 m, which has a poor representation of the details of the ground objects and is not enough to observe the heterogeneous landscape.
In 1995, Vignolles et al. [7] first proposed to generate high spatiotemporal resolution data by using spatiotemporal fusion technology. Since then, different types of spatiotemporal fusion methods have been emerging. The spatiotemporal fusion technology of remote sensing images is fused with the spatial features of high spatial but low temporal resolution images and the temporal features of low spatial but high temporal resolution images to generate time series images with high spatial resolution. According to the principle, the existing spatiotemporal fusion models can be divided into three types: reconstruction based, spatial unmixing based, and learning based.
The basic principle of reconstructionbased methods is to calculate the reflectance of the center fusion pixel through a weighting function which takes full account of the spectral, temporal, and spatial information in similar pixels. Gao et al. [8] first proposed the spatiotemporal adaptive reflection fusion model (STARFM), which uses a pair of MODIS and ETM+ reflectance images at known time phase and MODIS reflectance images at predicted time phase to generate 30 m spatial resolution image. Hilker et al. [9] proposed a spatiotemporal fusion algorithm for mapping reflectance change (STAARCH) based on tasseled cap change. The algorithm can not only generate 30 m spatial resolution ETM + like images but also detect highly detailed surface classes. However, the fusion accuracy of STARFM and STAARCH are highly related to the surface landscape heterogeneity, resulting in low fusion accuracy for heterogeneous area. David et al. [10] considered the influence of bidirectional reflectance effect, proposing a semiphysical method to generate fused Landsat ETM+ reflectance using MODIS and Landsat ETM+ data. Zhu et al. [11] based on STARFM considering the reflectivity difference between different sensor imaging systems due to different orbital parameters, band bandwidth, spectral response curve and other factors, the transfer coefficient between different sensor differences is increased, and the enhanced STARFM (ESTARFM) model is proposed to improve the fusion accuracy of complex surface area (heterogeneous area) to a certain extent. The model uses two sets of MODIS and ETM+ reflectance images and MODIS reflectance images to generate 30 m spatial resolution ETM+ like image. Wang and Atkinson [12] proposed a spatiotemporal fusion algorithm consisting of three parts: regression model fitting (RM Fitting), spatial filtering (SF), and residual compensation (RC), referred to as FitFC; this method only uses a pair of known highlow resolution image pair as input and can better predict the spatial change between images in different periods. Chiman et al. [13] proposed a simple and intuitive method and has two steps. First, a mapping is established between two MODIS images where one is at an earlier time, , and the other one is at the time of prediction, . Second, this mapping is applied to a known Landsat image at to generate a predicted Landsat image at .
Spatialtemporal fusion methods based on spatial unmixing performs spatial unmixing of pixels in known lowresolution images and applies the classification results to highresolution images at the unknown time to predict highresolution images. Zhukov et al. [14] proposed a spatiotemporal fusion method that considers the spatial variability of pixel reflectivity based on the assumption that the pixel reflectivity does not change drastically between neighbor pixels. This method introduces window technology to predict the highresolution reflectance of each type of feature. This method is not ideal for the farmland area which changes dramatically in a short time. Wu [15] proposed a spatial and temporal data fusion approach (STDFA) based on the assumption that the temporal variation characteristics of the class reflectivity are consistent with the intraclass pixel reflectivity. This method extracts the temporal change information of ground features from timeseries low spatial resolution images and performs classification and density segmentation on known two periods of high spatial resolution images to obtain classified images, so as to obtain class average reflectance for image fusion. On the basis of [15], Wu and Huang [16] comprehensively considered the spatial variability and temporal variation of pixel reflectivity and proposed an enhanced STDFA; the method solves the problem of missing remote sensing data. Hazaymeh and Hassan [17] proposed a relatively simple and more efficient algorithm; the Spatiotemporal ImageFusion Model (STIFM) applies clustering to the images first, and, for each cluster, performs a separate prediction. Zhu et al. [18] proposed a flexible spatiotemporal data fusion (FSDAF) based on spectral demixing analysis and thinplate spline interpolation. The algorithm uses less input data, which is suitable for heterogeneous areas and can effectively preserve the lowresolution details of the image during the prediction period.
In remote sensing image processing, learningbased methods are more commonly used for classification of ground features [19]. In recent years, spatiotemporal fusion methods based on learning have been widely concerned. In 2012, Huang and Song [20] first introduced sparse representation technology into the process of spatiotemporal fusion and proposed a sparse representation based on a spatiotemporal reflectance fusion model (SPSTFM), which uses the MODIS and ETM+ images of the front and back phases at the predicted time phase. First, use high and lowresolution difference images to train a couple dictionary representing high and lowresolution features, and then use a lowresolution image to predict highresolution image. Song and Huang [21] proposed a sparse representation of spatiotemporal reflectance fusion model using only a pair of known high and lowresolution image pair, which first enhanced MODIS image by sparse representation to obtain a transition image, then predicted image is generated by combining known highresolution image with transition image through highpass modulation. The model reduces the number of known image pair that needs to be inputted, so that the algorithm can be applied in the case of lack of data and has more general applicability. Spatiotemporal fusion method based on feature learning considers the spatial information of changing image. However, there are some limitations in previous methods based on sparse representation. First, the image features need to be designed, which brings complexity and instability to performance. Secondly, the method does not consider the large amount of actual remote sensing data but only develops and validates the algorithm for smallscale research areas.
The convolutional neural network (CNN) [22] model has a simple structure and can be used to solve the problems of target recognition [23] and image classification [24] in computer vision. In recent years, CNN has also been used in the field of superresolution. As the pioneer CNN model for SR, superresolution convolutional neural network (SRCNN) [25] predicts the nonlinear LRHR mapping via a fully convolutional network and significantly outperforms classical nonDL methods. In the field of remote sensing, Song et al. [26] proposed a fivelayer convolutional neural network (CNN) spatiotemporal fusion model. This model is similar to [21] and is a twostage model. It learns the CNN nonlinear mapping between MODIS and Landsat images and combines highpass modulation with a weighting strategy to predict Landsatlike images. Liu et al. [27] proposed a twostream convolutional neural network StfNet, which not only considered the temporal dependence of remote sensing images but also introduced temporal constraint, the network takes a coarse difference image with the neighboring fine image as inputs and the corresponding fine difference image as output, the method can restore spatial details greater. At present, there are two main problems faced by learningbased spatiotemporal fusion methods: first, the deepseated network can improve the prediction accuracy; however, the deepseated network will lead vanishing gradient or convergence difficulty and second, it is difficult to obtain two pairs of suitable prior image pairs as the input of network training. For example, StfNet is a fusion method using two pairs of prior images as input. Considering the above two points, we propose a spatiotemporal fusion model based on residual convolution neural network. The model can only uses a pair of prior images as train input. The MODIS image is very similar to the predicted Landsat image. In other words, the lowfrequency information of lowresolution image is similar to that of highresolution image. In fact, the lowresolution image and the highresolution image only lack the residual of the highfrequency part. If only train the highfrequency residual between the high resolution and the low resolution, which does not need to spend too much time in the lowfrequency part. And can deepen the network structure to avoid problems such as gradient disappearance. For this reason, we introduce the idea of ResNet [28] and set up a spatiotemporal fusion framework of remote sensing image suitable for a smallsample training set for CNN. Considering the time dependence between the image sequences, we use the MODISLandsat image pairs of the front and back phases of the prediction image to construct the prediction network, respectively. The experimental results show that compared with benchmark methods, the spectral color and spatial details of our method are closer to the real Landsat image.
The rest of this paper is divided into three sections. In Section 2, the principle of residual CNN is introduced. Section 3 provides the experimental verification process and results. Section 4 gives the conclusion.
2. Methods
In this paper, we use CNN and ResNet to construct a dual stream network to predict Landsatlike images. The principles involved are briefly introduced as follows.
2.1. CNN
Convolutional neural network (CNN) is one of the most representative network models in the deep learning method [29]. With the continuous development of deep learning techniques in recent years, it has achieved very good results in the field of image processing. Compared with the traditional data processing algorithm, CNN avoids the complicated preprocessing work such as manually extracting data from the data, so that it can be directly used in the original data.
CNN is a nonfully connected multilayer neural network, as shown in Figure 1. The main structure consists of a convolutional layer, pooling layer, activation layer, and fully connected layer [30]. The convolutional layer, pooling layer, and activation layer are the feature extraction layers of CNN, which are used to extract the signal features. The fully connected layer is the CNN classifier. Since this paper mainly uses the deep convolution network to extract the spatial characteristics of the remote sensing image, the feature extraction layer of deep convolutional neural networks is analyzed.
2.2. Residual Learning
If the input of a neural network is , and the expected output is ; is the expected mapping. If we want to learn such a model, the training difficulty will be greater; if we have learned the more saturated accuracy (or when we find that the error in the lower layer becomes larger), then the next learning goal will be transformed into the learning of identity mapping, that is, to make the input an approximate output , which is in order to keep in the later hierarchy without causing a drop in accuracy.
As shown in the residual network structure diagram in Figure 2, input is directly transferred to the output as the initial result through “shortcut connections,” and the output result is . When , then , which is the constant mapping mentioned above. Therefore, ResNet is equivalent to changing the learning goal, not a complete output of learning, but the difference between the goal value and , that is, the socalled residual . Therefore, the later training goal is to approach the residual result to 0, so that with the deepening of the network, the accuracy does not decline.
2.3. Spatiotemporal Fusion Using Residual Learning in CNN
In this paper, Landsat image is regarded as high spatial but low temporal resolution data; MODIS image is regarded as high temporal but low spatial resolution data. We express the Landsat image and MODIS image at as and , respectively. If there are two pairs of prior images, the twostream residual CNN network uses the known LandsatMODIS image pair at and , and MODIS image at to predict Landsatlike image.
2.3.1. Training Stage.
In the training stage, in order to build an nonlinear mapping model between MODIS and LandsatMODIS residual images, we first up sample the spatial resolution of to the same size as . Then, the Landsat and MODIS images at the same time are differenced to obtain a residual image . Thus, we expect to learn a mapping function which approximates . Pixel value in are likely to be zero or small. We want to predict this residual image. The loss function now becomes , where is the network prediction. We divide the high and lowresolution images corresponding on the same time into overlapping image patches. Define the set of high and lowresolution samples as and , where the corresponding samples are and . The overlapping segmentation is performed here to increase the number of training samples. After predicting the residual image, the ground truth Landsat image is obtained by the sum of the input MODIS image and the predicted residual image.
In the network, the loss layer has three inputs: residual estimation, input MODIS image, and Landsat image. The loss is calculated as the Euclidean distance between the reconstructed image and the real Landsat image. In order to achieve the purpose of highprecision spatiotemporal fusion, we use a very deep convolutional network. We use 18 layers where layers except the first and the last are of the same type: 64 filters of the size , where a filter operates on spatial region across 64 channels (feature maps). The first layer operates on the input image. The last layer, used for image reconstruction, consists of a single filter of size . The process structure is shown in Figure 3.
Training was performed by using backpropagationbased minibatch gradient descent to optimize regression targets. We set the momentum parameter to 0.9. Training is regularized by weight loss ( penalty multiplied by 0.0001).
2.3.2. Prediction Stage.
There are two pairs of prior LandsatMODIS images and the MODIS image on prediction date, we aim to fuse them to predict the Landsatlike image on prediction date. Denote the prior dates as and , the prediction date as , we predict based on the residual learning CNN. , , and are divided into patches, and their corresponding image patches are ,, and _{,} respectively. Taking as the input of CNN, the label is , and the sum of the residual image and is used as the prediction. In this paper, the number of network layers is set to 18. In the process of reconstruction, input into the trained network and get the predicted . Similarly, can be predicted by LandsatMODIS image pair at . Considering the temporal correlation between the image at the predicted time and the reference image, we use the corresponding temporal weight when reconstructing each image patch. Finally, the high spatial resolution image patch at the predicted time is obtained: where and are the th predicted patch using and as the reference image, respectively, and are the corresponding weight, and determined as follows:
The local weight is calculated by the sum of normalized difference vegetation index (NDVI) [31] and normalized difference builtup index (NDBI) [32], where is used to measure the change degree between MODIS images at two times, and it is the absolute average change of in , where represents the MODIS image change at different times. After each image patch is reconstructed one by one, it is restored to the whole image. In order to ensure the continuity of the reconstructed image, there is an overlap between adjacent patches, and the pixel value of the overlapped part of the image patch is taken as the mean value when the whole image is restored.
3. Experiments
Two datasets were used in the experiments. The first dataset contains two pairs of MODISLandsat images, and the second dataset contains three pairs of MODISLandsat images. Both areas are located in Coleambally, New South Wales, Australia. MODIS data uses the surface reflectance of MOD09A1 (500 m) and MOD09Q1 (250 m) for 8day synthetic products. We up sampled all MODIS in the dataset to the same resolution as the Landsat image of the corresponding date. Compared with natural images, remote sensing images have a large size and rich details. Therefore, the remote sensing images are overlapped and divided into patches to obtain the training set. In the paper, the images of the two areas are overlapped divided into image patches. The above image patch set is used as the train set and prediction set. We compare our method with the mainstream and advanced methods (including STARFM, FSDAF, FitFC, STDFA, STIFM, HCM, ESTARFM, SPSTFM, and StfNet), which will be described in detail in this section.
3.1. Experiment on the First Dataset
In order to verify the applicability of our proposed spatialtemporal fusion method based on residual convolutional neural network for one prior LandsatMODIS image pair, we use a singlestream network to verify and compare the same data as the input of STARFM, FSDAF, FitFC, STDFA, STIFM, and HCM.
In this experiment, two pairs of Landsat and MODIS surface reflectance images covering a area in Coleambally are used. The two pair images were acquired on 2 July 2013 and 17 August 2013. Figure 4 shows the 30 m Landsat images (upper row) and 500 m MODIS images (lower row) using greenredNIR as RGB composite image. Then, we use the bicubic interpolation method to downscale the 500 m MODIS image into 30 m. Our experimental task used the pair of LandsatMODIS images on 2 July 2013 and the MDOIS image on 17 August 2013 to predict the 30 m Landsatlike image on 17 August 2013. At the same time, STARFM, FSDAF, FitFC, STDFA, STIFM, and HCM are tested with the same input in this experiment, and the true 30 m Landsat image acquired on 17 August 2013 is used as the reference to evaluate the accuracy of fusion results.
(a)
(b)
(c)
(d)
Figure 5 shows the fusion results by four methods (STARFM, FitFC, FSDAF, STDFA, STIFM, HCM, and our method). Obviously, the prediction accuracy by our method is greater. For example, the highlighted areas in the bottom left part of subarea S, for FitFC, STDFA, FSDAF, STIFM, HCM, and STARFM, some dark green pixels are incorrectly predicted as purple pixels. In addition, the highlighted areas in the bottom right part of the subarea, FitFC, STDFA, FSDAF, STIFM, HCM, and STARFM incorrectly predicted some red pixels as purple and blue pixels. However, our method is closer to the reference image. The main reason is that the FitFC method directly applies the known linear coefficients of the lowresolution image to fit the highresolution image on the prediction period. Therefore, when the spatial resolution difference between the high and lowresolution images is large, there will be obvious “block effect,” for example, the spatial resolution difference between Landsat image and MODIS image is nearly 17 times. STDFA assumes that the temporal variation characteristics of the same surface coverage class in coarse pixels are consistent, but there may be inconsistencies in practical applications, so the fusion result is affected. The accuracy of the FSDAF spatiotemporal fusion algorithm is low, which is mainly caused by two aspects: The prediction accuracy of FSDAF is worse, which is mainly caused by two aspects: at first, the known highresolution data needs to be classified, and the classification accuracy by unsupervised classification method (such as the means method) will have a certain impact on the results; at second, when the spatial resolution difference between high and lowresolution data is large, the endmember (that is, highresolution pixel) represented area will be more refined. When the number of categories is small, the fusion result will be relatively smooth, and when the number of categories is large, the fitting accuracy will also be reduced (such as in lowresolution pixels, if the richness of a certain category is low, the total prediction error will be increase). STIFM is susceptible to interference from outliers, so when the spatial characteristics change significantly, the prediction effect is not good. The method of using gradation mapping is greatly affected by the heterogeneous region, so HCM failed to show the best performance in this experiment. STARFM considers the similarity of neighboring pixels, so the prediction accuracy is relatively stable. However, the premise of STARFM is that the spectrum of similar pixels in the neighborhood is constant and there is no land cover change during the observation period, which makes the model susceptible to environmental and phenological changes, resulting in large prediction errors, especially for heterogeneous areas. Our method uses deep convolutional neural networks to more effectively extract the features of lowresolution images and residual images and constructs a mapping relationship between lowresolution images and residual images through a residual learning network. This mapping relationship is nonlinear mapping and is more in line with the change of ground features. In addition, the number of layers in the network is deepened through residual learning, which strengthens the robustness of the network. Therefore, the experimental results based on our method have better visual effects.
Table 1 lists the objective evaluation results of four fusion methods and uses three common fusion evaluation methods of remote sensing image, including root mean square error (RMSE) [33], correlation coefficient (CC) [34], and universal image quality index (UIQI) [35]. The ideal values for RMSE, CC, and UIQI are 0, 1, and 1, respectively. From Table 1, we can see that for the six bands of all fusion results, the fusion results of our method have smaller RMSE and larger CC and UIQI. Our method is compared with other six methods (STARFM, FitFC, FSDAF, STDFA, STIFM, and HCM); the gain of the mean CC is 0.0259, 0.0365 0.0168, 0.0253, 0.0620, and 0.0487, and the gain of the mean UIQI is 0.0261, 0.0368, 0.0175, 0.0254, 0.0620, and 0.0489, respectively. The mean RMSE is reduced by 0.0018, 0.0024, 0.0012, 0.0021, 0.0040, and 0.0032, respectively. In addition, the fusion result based on our method is better than STATFM, and STARFM is better than FitFC, the rest of the sequence is STDFA>FSDAF>HCM>STIFM. The main reason is when the spatial resolution difference between high and lowresolution images is large, FitFC directly applies the fitting coefficients of lowresolution images into highresolution images, which causes large errors; FSDAF also has similar fitting errors. STDFA assumes that the temporal variation characteristics of the same surface coverage class in coarse pixels are consistent, but there may be inconsistencies in practical applications, so the fusion result is affected. Although STARFM considers the similarity of neighboring pixels, the reconstruction method of each pixel cannot consider the continuity of the image. STIFM is susceptible to interference from outliers, so when the spatial characteristics change significantly, the prediction effect is not good. HCM using gradation mapping is greatly affected by the heterogeneous region. Our method can better restore the continuity of the image by reconstructing the image patch.

3.2. Experiment on the Second Dataset
In this experiment, three pairs of LandsatMODIS images covering area of Coleambally are used to verify the applicability of our method for two pairs of prior images. The three pairs of images were acquired on 6 April, 2012, 12 May, 2012, and 20 July, 2012, respectively. Figure 6 shows the 30 m Landsat image (upper row) and 500 m MODIS image (lower row) using greenredNIR as RGB composite image. This experiment is to verify the accuracy of our methods based on two pairs of prior images, we use the two image pairs on 6 April, 2012 and 20 July, 2012, and MODIS image on 12 May, 2012 to predict the Landsatlike image on 12 May, 2012.
(a)
(b)
(c)
(d)
(e)
(f)
Figure 7 shows the 30 m prediction results on August 12, 2012 based on the four methods (ESTARFM, SPSTFM, StfNet, and our method). It is worth noting that the ESTARFM result is the worst, StfNet is better than SPSTFM, and our method is better than StfNet. For example, the highlighted areas in the bottom left of subarea S, ESTARFM, and SPSTFM incorrectly predicted some light green pixels as dark green pixels. Although the prediction by StfNet is similar to the reference image, but there are some yellow pixels which had been incorrectly predicted to be blue pixels. However, for our method, the prediction is closer to the true reference image. Compared with the three benchmark methods, our method provides excellent performance. The main reason is ESTARFM assumes that during the observation period, the conversion coefficients between high and lowresolution images remain unchanged, but in actual conditions, land types and coverage will change, so this assumption is not applicable in the areas with significant changes. SPSTFM utilizes sparse representation and dictionary learning approaches in the signal domain to increase prediction accuracy for land cover change and heterogeneous region. Although the network structure, compared with our method, SPSTFM only applicable for smallscale regions and cannot extract sufficient image features. Although StfNet can produce more accurate prediction results by deep network, but the data contained in the training process is too large and the network is hard to convergent, which also has a certain impact on the prediction accuracy. Our method using residual learning network can only learn the difference information between high and lowresolution images. Since the lowfrequency information between high and lowresolution images is similar, if we directly learn mapping relationship between high and lowresolution images, it will increase the amount of calculation, which also introduces errors. Through residual learning, not only the nonlinear mapping relationship between high and lowfrequency information can be directly learned, but also the network layers can be deepened, which enhances the accuracy and stability of the network structure.
Table 2 shows the comparison results in RMSE, CC, and UIQI. From Table 2, we can see that for six bands, the fusion results by our method can obtain smaller average RMSE and larger CC and UIQI. It is easy to find that our method is better than StfNet, the StfNet is better than SPSTFM, and ESTARFM is the worst among the four approaches. Specifically, the CC gains of our method over ESTARFM, SPSTFM, and StfNet are 0.0508, 0.0257, and 0.0134, and the UIQI gains are 0.0467, 0.0238, and 0.0126, respectively. The main reason is that ESTARFM assumes that the conversion coefficients remain unchanged during the observation period, but there are land cover types in this area, such as subregion S, so the conversion coefficients are not consistent, so the prediction results are greatly biased. SPSTFM takes the image patch as the reconstruction unit and considers the continuity between adjacent pixels, so it has strong robustness in dealing with complex surface changes. However, due to the instability of forcing the same sparse coefficient of high and lowresolution dictionaries to construct the mapping relationship, the performance in this experiment is worse than our method. StfNet has a deep network layers; however, it is difficult to converge due to directly training the mapping relationship between high and lowresolution images, which also leads to network instability. Our method through residual network not only improves the stability of the network but also enhances the accuracy of the fusion results.

4. Conclusion
In this paper, we propose a residual convolution neural network to predict Landsatlike image, and the method can be applied to the case where there is only a pair of prior images. This method mainly includes two steps: firstly, use the known MODISLandsat image pair to train the residual convolutional neural network and secondly, input MODIS image at predicted phase to reconstruct Landsatlike image. Compared with the several benchmark algorithms (STARFM, FSDAF, FitFC, ESTARFM, SPSTFM, and SftNet), our method has the advantages of learning algorithm, which takes the image patch as the reconstruction unit and considers the continuity between adjacent pixels. Training the residual to construct the depth network not only enhances the stability of the network but also improves the prediction accuracy.
The spatiotemporal fusion methods based on learning have greater prediction accuracy for heterogeneous regions. In this paper, we use a multilayer convolution neural network to extract spatial features. In the future work, we will try to design more effective methods to extract spatial features to improve the recognition ability of change information. In recent years, deep learning has received extensive attention. Deep learning needs a lot of data to train model. Because of the characteristics of large amount of data and rich information in remote sensing data, we can use the “big data” characteristics of remote sensing data to train more effective mapping relationship between MODIS and Landsat images by deep learning training, so as to improve the prediction accuracy. In addition, although the spatiotemporal fusion models based on learning have outstanding performance, but the calculation time is longer, which is also a “common failure” based on the learning method. Therefore, our future work will follow the idea of improving the accuracy of fusion results and reducing computational complexity.
Data Availability
Data is not available for the following reasons: In this paper, we received remote sensing data from Institute of Remote Sensing Applications Chinese Academy of Sciences and conducted an experiment, but without the consent of Institute of Remote Sensing Applications Chinese Academy of Sciences, the author cannot judge whether data is available or not.
Conflicts of Interest
The authors declare that they have no conflict of interest.
References
 B. Rasti, D. Hong, R. Hang et al., “Feature extraction for hyperspectral imagery: the evolution from shallow to deep (Overview and Toolbox),” IEEE Geoscience and Remote Sensing Magazine, vol. 8, no. 3, pp. 63–92, 2020. View at: Publisher Site  Google Scholar
 M. C. Anderson, R. G. Allen, A. Morse, and W. P. Kustas, “Use of Landsat thermal imagery in monitoring evapotranspiration and managing water resources,” Remote Sensing of Environment, vol. 122, pp. 50–65, 2012. View at: Publisher Site  Google Scholar
 F. D. van der Meer, H. M. A. van der Werff, F. J. A. van Ruitenbeek et al., “Multi and hyperspectral geologic remote sensing: A review,” International Journal of Applied Earth Observation & Geo information, vol. 14, no. 1, pp. 112–128, 2012. View at: Publisher Site  Google Scholar
 D. Hong, N. Yokoya, N. Ge, J. Chanussot, and X. X. Zhu, “Learnable manifold alignment (LeMA): a semisupervised crossmodality learning framework for land cover and land use classification,” ISPRS Journal of Photogrammetry & Remote Sensing, vol. 147, pp. 193–205, 2019. View at: Publisher Site  Google Scholar
 S. Ganguly, M. A. Friedl, B. Tan, X. Zhang, and M. Verma, “Land surface phenology from MODIS: characterization of the collection 5 global land cover dynamics product,” Remote Sensing of Environment, vol. 114, no. 8, pp. 1805–1816, 2010. View at: Publisher Site  Google Scholar
 X. Zhang, M. A. Friedl, C. B. Schaaf et al., “Monitoring vegetation phenology using MODIS,” Remote Sensing of Environment, vol. 84, no. 3, pp. 471–475, 2003. View at: Publisher Site  Google Scholar
 C. Vignolles, M. Gay, G. Flouzat, and P. PuyouLascassies, “Spatiotemporal connection of remote sensing data concerning agricultural production modelisation at a middle scale,” in 1995 International Geoscience and Remote Sensing Symposium, IGARSS '95. Quantitative Remote Sensing for Science and Applications, Firenze, Italy, Italy, July 1995. View at: Publisher Site  Google Scholar
 F. Gao, J. Masek, M. Schwaller, and F. Hall, “On the blending of the Landsat and MODIS surface reflectance: predicting daily Landsat surface reflectance,” IEEE Transactions on Geoscience and Remote Sensing, vol. 44, no. 8, pp. 2207–2218, 2006. View at: Publisher Site  Google Scholar
 T. Hilker, M. A. Wulder, N. C. Coops et al., “A new data fusion model for high spatial and temporalresolution mapping of forest disturbance based on Landsat and MODIS,” Remote Sensing of Environment, vol. 113, no. 8, pp. 1613–1627, 2009. View at: Publisher Site  Google Scholar
 P. R. David, J. Junchang, L. Philip, S. Crystal, H. Matt, and L. Erik, “Muftitemporal MODISLandsat fusion for relative and prediction of Landsat data,” Remote Sensing of Environment, vol. 112, no. 6, pp. 3112–3130, 2008. View at: Google Scholar
 X. Zhu, J. Chen, F. Gao, X. Chen, and J. G. Masek, “An enhanced spatial and temporal adaptive reflectance fusion model for complex heterogeneous regions,” Remote Sensing of Environment, vol. 114, no. 11, pp. 2610–2623, 2010. View at: Publisher Site  Google Scholar
 Q. Wang and P. M. Atkinson, “Spatiotemporal fusion for daily Sentinel2 images,” Remote Sensing of Environment, vol. 204, pp. 31–42, 2018. View at: Publisher Site  Google Scholar
 C. Kwan, B. Budavari, F. Gao, and X. Zhu, “A hybrid color mapping approach to fusing MODIS and Landsat images for forward prediction,” Remote Sensing, vol. 10, no. 4, pp. 520–529, 2018. View at: Publisher Site  Google Scholar
 B. Zhukov, D. Oertel, F. Lanzl, and G. Reinhackel, “Unmixingbased multisensor multiresolution image fusion,” IEEE Transactions on Geoscience and Remote Sensing, vol. 37, no. 3, pp. 1212–1226, 1999. View at: Publisher Site  Google Scholar
 M. Wu, “Use of MODIS and Landsat time series data to generate highresolution temporal synthetic Landsat data using a spatial and temporal reflectance fusion model,” Journal of Applied Remote Sensing, vol. 6, no. 13, p. 063507, 2012. View at: Google Scholar
 M. Wu, W. Huang, Z. Niu, and C. Wang, “Generating daily synthetic Landsat imagery by combining Landsat and MODIS data,” Sensors, vol. 15, no. 9, pp. 24002–24025, 2015. View at: Publisher Site  Google Scholar
 K. Hazaymeh and Q. K. Hassan, “Spatiotemporal imagefusion model for enhancing the temporal resolution of Landsat8 surface reflectance images using MODIS images,” Journal of Applied Remote Sensing, vol. 9, no. 1, p. 096095, 2015. View at: Publisher Site  Google Scholar
 X. Zhu, E. H. Helmer, F. Gao, D. Liu, J. Chen, and M. A. Lefsky, “A flexible spatiotemporal method for fusing satellite images with different resolutions,” Remote Sensing of Environment, vol. 172, pp. 165–177, 2016. View at: Publisher Site  Google Scholar
 D. Hong, N. Yokoya, J. Chanussot, and X. X. Zhu, “CoSpace: common subspace learning from hyperspectralmultispectral correspondences,” IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 7, pp. 4349–4359, 2019. View at: Publisher Site  Google Scholar
 B. Huang and H. Song, “Spatiotemporal reflectance fusion via sparse representation,” IEEE Transactions on Geoscience and Remote Sensing, vol. 50, no. 10, pp. 3707–3716, 2012. View at: Publisher Site  Google Scholar
 H. Song and B. Huang, “Spatiotemporal satellite image fusion through onepair image learning,” IEEE Transactions on Geoscience and Remote Sensing, vol. 51, no. 4, pp. 1883–1896, 2013. View at: Publisher Site  Google Scholar
 D. Yu and L. Deng, “Deep learning and its applications to signal and information processing exploratory,” IEEE Signal Processing Magazine, vol. 28, no. 1, pp. 145–154, 2011. View at: Publisher Site  Google Scholar
 F. H. C. Tivive and A. Bouzerdoum, “A new class of convolutional neural Networks (SICoNNets) and their application of face detection,” in Proceedings of the International Joint Conference on Neural Networks, 2003, pp. 2157–2162, Portland, OR, USA, July 2003. View at: Publisher Site  Google Scholar
 A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Image Net classification with deep convolutional neural networks,” in Proceedings of the 2012 Advances in Neural Information Processing Systems, pp. 1097–1105, Lake Tahoe, Nevada, USA, 2012, Curran Associates, Inc. View at: Google Scholar
 C. Dong, C. C. Loy, K. He, and X. Tang, “Image superresolution using deep convolutional networks,” EEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 2, pp. 295–307, 2016. View at: Publisher Site  Google Scholar
 H. Song, Q. Liu, G. Wang, R. Hang, and B. Huang, “Spatiotemporal satellite image fusion using deep convolutional neural networks,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 11, no. 3, pp. 821–829, 2018. View at: Publisher Site  Google Scholar
 X. Liu, C. Deng, J. Chanussot, D. Hong, and B. Zhao, “StfNet: a twostream convolutional neural network for spatiotemporal image fusion,” IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 9, pp. 6552–6564, 2019. View at: Publisher Site  Google Scholar
 K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, June 2016. View at: Publisher Site  Google Scholar
 S. Ren, K. He, R. Girshick, and J. Sun, “Faster RCNN: towards realtime object detection with region proposal networks,” IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 39, no. 6, pp. 1137–1149, 2017. View at: Publisher Site  Google Scholar
 M. Liang and X. Hu, “Recurrent convolutional neural network for object recognition,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, June 2015. View at: Publisher Site  Google Scholar
 E. F. Lambin and A. H. Strahler, “Indicators of landcover change for changevector analysis in multitemporal space at coarse spatial scales,” International Journal of Remote Sensing, vol. 15, no. 10, pp. 2099–2119, 2007. View at: Publisher Site  Google Scholar
 C. He, P. Shi, D. Xie, and Y. Zhao, “Improving the normalized difference builtup index to map urban builtup areas using a semiautomatic segmentation approach,” Remote Sensing Letters, vol. 1, no. 4, pp. 213–221, 2010. View at: Publisher Site  Google Scholar
 Z. Zhang and R. S. Blum, “A categorization of multiscaledecompositionbased image fusion schemes with a performance study for a digital camera application,” Proceedings of the IEEE, vol. 87, no. 8, pp. 1315–1326, 1999. View at: Publisher Site  Google Scholar
 K. D. Kim and J. H. Heo, “Comparative study of flood quantiles estimation by nonparametric models,” Journal of Hydrology, vol. 260, no. 14, pp. 176–193, 2002. View at: Publisher Site  Google Scholar
 Z. Wang and A. C. Bovik, “A universal image quality index,” IEEE Signal Processing Letters, vol. 9, no. 3, pp. 81–84, 2002. View at: Publisher Site  Google Scholar
Copyright
Copyright © 2020 Xiaofei Wang and Xiaoyi Wang. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.