Abstract

Air temperature (Ta) is an essential parameter for science research and engineering practice. While the traditional site-based approach is only able to obtain observations in limited and discrete locations, satellite remote sensing is promising to retrieve some environmental variables with spatially continuous coverage. Nowadays, land surface temperature (Ts) measurements can be obtained from some satellite sensors (e.g., MODIS), further enabling us to estimate Ta in view of the relationship between Ta and Ts. In this article, we proposed a two-phase integrated framework to estimate daily mean Ta nationwide. In the first phase, multivariate linear regression models were fitted between site-based observations of daily mean air temperature (Ta-mean) and MODIS land surface temperature products (including Terra day: TMOD-day, Terra night: TMOD-night, Aqua day: TMYD-day, and Aqua night: TMYD-night) conditional on some covariates of environmental factors. The fitted models were then used to predict Ta-mean from those covariates at unobserved locations. The predicted Ta-mean were looked on as stochastic variables, and their distributions were also obtained. In the second phase, Bayesian maximum entropy (BME) methods were used to produce spatially continuous maps of Ta-mean taking the meteorological station observations as hard data and the predicted Ta-mean in the first phase as soft data. It is shown that the proposed approach is promising to improve the interpolation accuracy significantly, comprehensively considering the prior knowledge and the context of space variability and correlation, which will enable it to compile spatially continuous air temperature products with higher accuracy.

1. Introduction

Air temperature (Ta) is one of the most important variables in scientific research and engineering practice. Short-term effects of air temperature include disease spreading, crop growth, snow-melting, and inundation, while the long-term effects of it on regional and global development such as global warming, drought, extreme weather, and food safety are concerned. In meteorology, air temperature is one of the basic meteorological factors commonly observed at 2 m above the ground in weather observation sites. Therefore, their spatial distribution depends on the sites built for weather data collection [1]. Ta is obtained as point data that cannot directly depict the range of climate variability within a region. Weather observation sites are dense in cities and sparse in sparsely populated and underdeveloped regions. In China, most of weather observation sites are distributed in the east. To densify air temperature observations, the remotely sensed land surface temperature (LST, Ts), retrieved from thermal images, has been used in estimating Ta. However, the major limitations in estimating Ta using remotely sensed Ts are the uncertainties of Ts estimation, nonlinear relationship between Ts and Ta, and trade-off between the temporal and the spatial resolution. In addition, the thermal remote sensing approach is not applicable under cloudy conditions.

To produce high-resolution Ta data, many methods involving in station observations and remotely sensed Ts are proposed. Three types of methods—interpolation, regression analysis, and simulation—were reviewed and have been proven to be useful in mapping high-resolution Ta [2]. Among them, satellite remote sensing Ts-based regression estimation and spatial interpolation of observed Ta are most popular approaches. Wang et al. [3] compared spatial interpolation and regression analysis models for estimates of monthly near-surface Ta from station observations in China [4]. Their results indicated that the higher standard deviation and the lower mean of near-surface Ta from sample data would be associated with a better performance of predicting monthly near-surface Ta using spatial interpolation models. Considering filling in data gaps in the time series of Ta, Shtiliyanova et al. applies a Kriging-based interpolation in the temporal dimension to predict missing Ta data [5] against temperature datasets from five sites in Europe and one site situated in the Indian Ocean (Réunion Island, France overseas). To provide long-term grid historical temperature datasets based on sparse historical stations, a temperature spatial interpolation based on the Biased Sentinel Hospitals Areal Disease Estimation (P-BSHADE) method was proposed, which successfully interpolate 1-km grids of monthly Ts in the historical period of 1900–1950 in China [6]. Considering the characteristics of spatial autocorrelation and nonhomogeneity of the temperature distribution to obtain unbiased and minimum error variance estimates, the proposed method shows better accuracy compared to inverse distance weighting (IDW), spline. Cho et al. proposed a so-called stacking ensemble model consisting of multilinear regression (MLR), support vector regression (SVR), and random forest (RF) optimized by the SVR to interpolate the daily maximum Ta during summertime in Seoul, the capital of South Korea [7]. Some other case studies can also be seen, which have selected either satellite remote sensing-based Ta estimation or spatial interpolation of site Ta to conduct spatially continuous Ta prediction. However, thanks to either of them has limitations and shortcomings, it is necessary to integrate advantages of potential methods to enhance the knowledge of Ta patterns, and to map spatially continuously covered and high-resolution Ta.

It is well known that remote sensing takes some advantages such as continuous space tessellation, wide observation scope, and low cost over other observation methods. In recent decades, several well-known sensors such as AVHRR and MODIS were launched into orbit, which enables us to obtain moderate resolution images with many spectrums (e.g., up to 36 for MODIS) ranging from infrared to microwave. Simultaneously, researchers begin to develop a variety of algorithms for retrieving Ts from these remotely sensed data. In view of the fact that, Ts and Ta have a strong association with each other, remote sensing has been combined with meteorological sites to improve the spatial coverage and accuracy of Ta. Kloog et al. presented work on predicting Ta from Ts in Massachusetts by predicting 24 h Ta means on a 1-km grid across the Northeast and Mid-Atlantic states demonstrating how Ts can be used reliably to predict daily Ta at high resolution in large geographical areas even in nonretrieval days [8]. Alonso et al. proposed to estimate Ta from 28 explanatory variables (covariates), using multiple linear regressions, which integrates variables from remote sensing and the variables traditionally used like the ones from the Land Use Land Cover [9]. Some other researchers proposed statistical models to estimate Ta using MODIS land surface temperature data [1013]. Furthermore, with a daytime Ta variation model, it is possible to estimate daily Ta at any time. For example, Chen et al. [14] first estimate daytime Ta at the times of Terra and Aqua satellites overpass, and subsequently, the maximum Ta is inferred from the daytime Ta. A general approach to obtaining Ta from images involves two steps: (1) retrieve the land surface temperature at the satellite overpass time and then (2) calculate the Ta from the land surface temperature according to the physical or statistical function relationship between them.

Spatial interpolation is a class of methods that interpolate the values of unobserved locations from the values of observed locations. In a geographical context, spatial auto-association is universally present, which differs spatial problems from nonspatial ones [15]. Therefore, the focus of most of the spatial interpolation methods is devoted to dealing with this nuisance. These methods estimate the most likely values through spatial trend analysis and spatial correlation analysis of discrete observed data. Geostatistics is developed from Kriging interpolation techniques along this thread. The recent advances in geostatistics have been striving to obtain better estimates by blending as much information as possible (including observations and prior knowledge). Christakos [16] extended Kriging techniques into a new methodological framework called BME (Bayesian maximum entropy) [16]. BME takes some types of data and different types of knowledge into spatial interpolation. According to BME methodology, these data and knowledge are divided into general knowledge (GK) and site-specific knowledge (SK). Two types of data are involved in BME interpolation procedure: hard data and soft data. Hard data are generally measured with instruments and considered having definite values. Soft data have indefinite values relative to hard data. Soft data are generally depicted by value interval, probability distribution, etc. The application of BME methods in some fields can be referenced in [3, 17, 18]. It has been also shown that BME is promising in blending multiple sensor data and multiresolution satellite products [19, 20].

In this study, Ts from MODIS sensors and Ta from weather stations were integrated to densify Ta, considering that Ts is spatially continuous and is able to provide additional information for spatial estimation of Ta. The purpose of this study was to estimate Ta with a high spatial resolution and accuracy simultaneously, coupling high spatial resolution remote sensing data with high accuracy (but spatially sparse) station data. We proposed a two-phase approach to estimating Ta at unobserved locations and demonstrated the estimation and evaluation of daily mean Ta (Ta-mean) on four days of 2004 (the vernal equinox (VE), the summer solstice (SS), the autumnal equinox (AE), and the winter solstice (WS)).

2. Study Area and Materials

The study area is Mainland China. There is a total of 839 sites distributed over the area. Observations from these sites are taken as hard data. A total of 334 locations are supplemented to obtain soft data through multivariate linear regression estimate techniques. These supplemented locations were generated at spatial random conditional on meteorological site locations (constrained in areas with sparse sites). Figure 1 shows the study area and the distribution of meteorological observation sites and the supplemented locations within it. According to the elevation, the whole study area can be divided into three terrain ranges. The first terrain range is in the western China and has a mean altitude higher than 4000 m. High mountains are main features in this range. The second range is in the central and northern China and has a mean altitude from 1000 m to 2000 m. It consists mainly of plateaus, large basins, and some high mountains. And the third range is in the eastern China and has mean altitude from 200 m to 1000 m. It is mainly composed of plains, hills, and low mountains. We also divided the whole study area into the west part and the east part in terms of the Hu Huanyong line in the name of Chinese scholar Hu Huanyong, which is going to be used to examine the effects of areas with different geographical characteristics on the Ta estimation.

The Ta observations from meteorological sites and the Ts products derived from MODIS in the 80th day (VE), the 172th day (SS), the 266th day (AE), and the 356th day (WS) of 2014 were used. These are four representative days in a whole year. Daily Ta variables are collected at those sites from China Meteorological Service Center. Ta-mean was calculated based on a published standard by CMA (China Meteorological Administration) from the original weather site observations, which is the arithmetic mean of the Ta observations at four time points: 02 : 00, 08 : 00, 14 : 00, and 20 : 00 a day.

Two MODIS products—MOD11C1 and MYD11C1— were used, which are global Ts data retrieved from MODIS sensors aboard Terra and Aqua satellites. They are estimated at time when satellites pass according to an algorithm developed by NASA. The nominal equatorial passing time of Terra is around 10:30 am and 10:30 pm of local solar time, while Aqua passes in the opposite direction at about 1:30 am and 1:30 pm. Therefore, Terra MOD11C1 and Aqua MYD11C1 include both day and night Ts (Terra day: TMOD-day, Terra night: TMOD-night, Aqua day: TMYD-day, and Aqua night: TMYD-night). These two products cover the main continents on the Earth surface, and the spatial resolution is 0.05 degrees. Therefore, the real ground extent is 5 km × 5 km or so near the equator. We downloaded the Ts data from a remotely sensed data distribution website created by the United States National Aeronautics and Space Administration (NASA) (https://disc.gsfc.nasa.gov/). The global products were trimmed to the north latitude range (39.4°, 41.1°) and east longitude range (115.3°, 117.6°), which is the bounding box of the study area. According to the quality control information of the products, we excluded the missing data and the data whose nominal errors are greater than 3K. As a result, we got some data with acceptable quality (nominal errors ≤3K). The analysis will be carried out based on these remaining data.

We also acquired a DEM data of China from the Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) Global Digital Elevation Model (GDEM), which was developed jointly by the Ministry of Economy, Trade and Industry (METI) of Japan and NASA under the agreement of contribution to GEOSS (Global Earth Observation System of Systems), and a public release was started on June 29, 2009 [21]. Version 1 of the DEM data is produced in 2009 and reproduced and improved in 2011 (version 2). We downloaded version 2 of the DEM data. This dataset is provided in raster format, and its spatial resolution is 30 m. The absolute vertical accuracy is within 0.20 meters on average.

Two vegetation index data provided by MODIS were used. One is the classical NDVI (normalized difference vegetation index) product and the other is EVI (enhanced vegetation index) product. The latter is more sensitive to areas with high vegetation coverage than the former. The MOD13C1 product was selected to extract these two indices. Since the MOD13C1 product is a 16-day composite one. Corresponding to the four days when Ta-mean was estimated in this study, the MOD13C1 data on the 65th day, the 161th day, the 257th day, and the 353th day of 2014 were selected and downloaded.

A LUCC (Land Use/Cover Change) dataset from the Chinese Academy of Sciences Resources and Environment Science Data Center (https://www.resdc.cn/) was selected to extract the land use/cover. The dataset covers 1990 to 2015, and the time interval of production is 5 years. Its spatial resolution is 1 km × 1 km. We selected the LUCC data in 2015 that is closest to 2014. The format of the data is raster format. In this dataset, the land use/cover types of level 1 include forest, grassland, wetland, farmland, and artificial surface.

3. Methods

The proposed methodology includes two phases. The first phase regressed the Ta observations from weather sites on Ts from MODIS conditional on some environmental factors such as topological, meteorological, vegetative, and LUCC as covariates. In this phase, multivariate linear regression models are fitted. Once proper models are fitted, we can predict the Ta at a specific location given covariates. According to the statistical theory of multivariate linear regression analysis, the predicted Ta has an uncertainty, and we can also estimate its confidence interval. Then, in the second phase, we make an ensemble estimation of Ta by integrating weather site observations and the prediction of Ta from the regression analysis in the first phase.

3.1. Multivariate Linear Estimates

The multivariate linear estimation is formulated aswhere Y is the dependent variable, x is the vector of the prediction variables, is the parameter vector of the multivariate linear regression model, and e is the residual error. Thus, given a group of samples X0 (land surface temperatures and environmental variables), we can estimate the corresponding Ta as

For a linear regression, the estimate of variance iswhere is the estimated standard error, which is given aswhere n is the number of samples, and k is the number of covariates.

We further construct a statistic:

In probability theory, this statistic is t-distributed with a degree of freedom n−k−1. Therefore, we can calculate the confidence interval of :

When n > 35, the t distribution can be approximated with the normal distribution N (, ).

In this study, three groups of variables are taken as covariates: Ts from Terra and Aqua, including TMOD_Day, TMOD_Night, TMYD_Day, and TMYD_Night; vegetation data including EVI and NDVI; and topological data including latitude, altitude, and longitude. All variables used in the regression analysis are listed in Table 1. The dependent variable Ta-mean is regressed on the covariates. The first-level classification of LUCC was represented by setting up a nominal covariate LUCC. To carry out the regression analysis, we expanded it into 7 dummy variables according to 6 types of the first-level classification.

Taking weather sites as reference locations, we extracted the values of covariates from corresponding data sources such as MOD11C1, MYD11C1, MOD13C1, LUCC, and DEM using GIS Extraction tools (Extract Values to Points in ArcMap). For a certain weather site, if there exist missing values or outliers in either of the dependent variable and covariates, the sample of this site was excluded.

3.2. BME Interpolation

We selected the BME method as the interpolation estimator of the Ta, which is firstly established by Christakos [16, 22]. BME uses many types of data for spatial estimation. These data are classified into two types: GK and SK, whose specific definitions and meaning in meteorological data can be identified in some research examples [3, 17, 23]. In this study, we intend to estimate the values of a meteorological variable X(s) (a spatial random field) at unmeasured locations sk given acquired data: , where represents a set of meteorological data xi at spatial locations si (i = 1, …,m; si≠sk). These datasets can be divided into two main groups: hard data and soft data. The former is observations of Ta from meteorological stations, and the latter is Ta estimation obtained from the multivariate linear regression analysis. According to the regression modeling, the soft data are expressed in normal distribution (an approximation of t distribution) with the corresponding estimated Ta and their variances as distribution parameters. We let the vector denote the hard data and is the soft data. Furthermore, let denote the random vector including the hard data, soft data, and unknown value xk. There are three stages to finish BME estimation.

3.2.1. Prior Stage

The aim of this stage was to determine the prior probability density function (pdf) based on GK, called prior pdf. Prior knowledge of GK is taken as statistics constraint conditions derived from , which is expressed mathematically aswhere , Nc is the number of constraints, are known functions, and the case α = 0,  = 1 is a normalization constraint. In this article, other constraints include means and covariances of and probability of soft data. The corresponding forms of can be referred to [22].

An entropy function of is defined as

Thus, prior pdf may be derived by means of a procedure that maximizes the entropy function and takes into consideration the constraints of equation (7), which represent prior knowledge.

3.2.2. Preposterior Stage

The objective of this stage is to collect and organize additional auxiliary information in appropriate forms to produce SK. These are then used in the BME model. Hard data were incorporated in the prior stage indirectly and used directly in the preposterior stage. Soft data were generated according to the responding Gaussian distributions.

3.2.3. Posterior Stage

The aim of this stage is to update the prior pdf based on the Bayesian conditional probability theorem and SK, thereby attaining the posterior pdf. When the distribution of hard and soft data is certain, the posterior pdf of spatial variable xk at location sk is

In practice, only hard and soft data within the scope of maximum distance to the estimation point are used to calculate xk. If the soft data are in the form of a pdf, thenwhere is the pdf of soft data.

3.3. Prediction Accuracy Evaluation

We used the leave-one-out cross-validation to evaluate the goodness of the results. Each time, one out of mh observations were left as the validation point to calculate the error between the estimated value and the observed value, and the other mh−1 points were used to build the model and predict the value at the leave-one-out location. Based on the prediction errors, two indices are used to evaluate the interpolation accuracy, and one is mean absolute error (MAE), defined as

And the other is root mean square error (RMSE), which is defined aswhere represents the estimated Ta value by the spatial interpolation at the spatial location si.

MAE is mainly used to evaluate the upper and lower limits of errors, while RMSE is better at evaluating the sensitivity of spatial interpolation results and the maximal minimum effect of some points.

4. Results

4.1. Regression Analysis and Prediction

We made regression analysis and prediction for each of the four selected days. With a stepwise regression procedure, Ta-mean was regressed on those selected covariates. At each step, a temporary best model (with maximal adjusted R-square) was chosen given the certain covariates. When adding any new covariate always leads to a decrease in the adjusted R-square, the final best model was obtained. Table 2 shows the best fitted models. Among all these models, Model 1 of VE has the smallest R-square value of 0.891, and Model 6 of WS has the biggest R-square value of 0.976. Furthermore, the F test shows the -values of these models are all ≤0.001, indicating these models are statistically significant. Also, the t test for regression coefficients shows all these models have coefficients with significance levels <0.01. Thus, these models were used to predict the corresponding Ta at the supplemented locations. Table 2 shows the best models with Ta-mean regressed on different sets of covariates and their specifications for the selected four days.

According to the fitted models, the prediction values at the supplemented locations (pink dots in Figure 1) and their corresponding confidence intervals were calculated taking the extracted values of covariates. Perhaps not all the values of the covariates are able to be extracted due to potential missing data and outliers. In this case, we turned to the model whose covariates are available and R2 is largest (we call it the applicable model) to calculate the estimated Ta and its corresponding confidence intervals. Figure 2 shows the boxplots of the standard deviations (σ) of the estimated values of Ta-mean. From Figure 2, we can evaluate the regression accuracy; for example, the regression estimates of Ta-mean indicate that most of estimates have an estimation error with σ approximating 1.6 degrees Celsius. The boxplots on the right show the limits of the confidence intervals of the estimated values of Ta-mean, showing most of estimates have limits of about 6 degrees Celsius (95% confidence interval). These estimated values and their corresponding confidence intervals were taken as soft data in the coming BME interpolation.

According to the estimate u and the standard deviation σ at a certain supplemented location, a normal distribution (Gaussian distribution) N (u, σ2) is used to approximate the distribution of the variable. A graphical BME tool, the SEKS-GUI software library, was used to carry out the interpolation operation, which accepts soft data in the form of Gaussian, uniform, or triangular distributions [24, 25]. An Excel file was created to feature the soft data of the supplemented locations. In the case of Gaussian distribution, for a certain supplemented location s (x, y), its soft data are portrayed in consecutive cells in the same row: .

4.2. Spatial Interpolation and Evaluation

Taking longitude as X axis, latitude as Y axis, and Ta-mean as Z axis, in a XYZ coordinate system, the 3D scatterplots of Ta-mean and their projections on XZ and YZ planes are plotted as shown in Figure 3. From the spatial distribution of the Ta observations and the characteristics of the projections on XZ and YZ planes, quadratic polynomial was used in the coming interpolation operation of BME.

After removing the global trends from the observations of Ta, we analyzed the semivariograms of the residuals. With a technique called binning, the experimental semivariograms of the residuals were plotted. Through observing the experimental semivariograms, we selected the spherical model to fit the semivariograms. Isotropy was applied for the fitting of the experimental semivariograms. Figure 4 shows the fitted results. The fitted models are also given in each subgraph.

After identifying the semivariograms, two methods—BME with only hard data (BME-hard) and BME with both hard and soft data (BME-both)—were applied to predict Ta-mean.

The scatter plots of measured and predicted values are shown in Figure 5. Following the scatter plots, the BME-both achieved higher accuracy than BME-hard. Generally, scatter plots of SS and AE are more clustered and better fitted with the line y = x, indicating their predictions are more accurate and stabler than SEs and WSs.

Spatially continuous distribution maps of Ta-mean in the study area were produced by setting the output grid size to 0.1degree0.1degree, as well as of the corresponding prediction standard errors as shown in Figures 69 for VE, SS, AE, and WS, respectively. Each figure shows two pairs of maps. Each pair of maps consists of an estimation map and a prediction standard error map, which are produced by BME-hard (maps in the upper part) and BME-both (maps in the lower part). From these maps, we see the BME-both-produced maps are more accurate than those produced by BME-hard overall. Comparing the prediction standard error maps, we found BME-both-produced maps have significantly smaller and stabler prediction standard error than BME-hard-produced maps, which means the estimation from BME-both has less uncertainty. Furthermore, interpolation errors are closely associated with the density of the site distribution. The first terrain range has the largest errors, where meteorological sites are very sparse because of high altitude and sparse population. Furthermore, according to the results of the four selected days, interpolation errors are probably dependent of days or seasons. The prediction standard error maps show that interpolation results in SS and AE have higher accuracy than those in SE and WS. WS has the highest interpolation errors among the four selected days.

To explore the effects of different geographic characteristics on the interpolation accuracy, the prediction accuracy in six areas—the whole study area, the west and east of Hu Huanyong line (Hu W. and Hu E.), the high terrain range (Terrain 1), the medium terrain range (Terrain 2), and the low terrain range (Terrain 3)—were evaluated with BME-both and BME-hard. Table 3 shows the calculated RMSE and MAE for each area and method. Our two-phase method, BME-both, achieved significant improvement on nationwide Ta prediction. Quantitative calculation further confirmed our judgement that interpolation accuracy is associated with some geographic factors such as site density and altitude. In areas with low altitude and dense population, the interpolation accuracy is better than that in other areas.

5. Discussion

Soft data are one of the two kinds of input data in BME interpolation, which differ from other interpolation methods. The general framework of BME explains some approaches to the acquisition and representation of soft data. Transforming prior knowledge into probability distribution is the most used technique in practice. It is well known that regression analysis can predict a response variable from some covariates in terms of a function relationship. Furthermore, as a random variable, the predicted variable (dependent variable) obeys a certain probability distribution. Taking advantage of this characteristic, in this article, we employed a multivariate linear regression method to retrieve the distribution of Ta at some unobserved locations and used Gaussian distribution to approximate the t distributions of the predicted variable of the regression model. Thus, additional information is incorporated into the interpolation process as prior knowledge of a probabilistic distribution. The results indicate that significant improvement is made on the interpolation accuracy. In some studies other than Ta interpolation, researchers practiced this idea and showed enhanced effects, which confirmed that it is an effective approach to retrieve soft data and implement BME interpolation operation. For example, Cao et al. used a binary logistic regression to model the occurrence of the highly pathogenic avian influenza and interpolate the risk, indicating an improved result [26]. Some scholars also regressed the precipitation on historical records or other factors to generate soft data for better interpolation [3, 27, 28].

We compared the results with several studies contributed to daily Ta estimation. Janatian and Zeng investigated daily Ta retrieval from remotely sensed Ts considering the TaTs relationship. The former proposed a statistical framework for estimating Ta using MODIS Ts data and made a case study in the eastern part of Iran [10], which achieved a RMSE of 3.0°C in daily Ta estimation. The latter evaluated the ability of MODIS Ts data to estimate daily temperature over the Corn Belt in the United States [12]. They found that the RMSE of different land covers ranges from 2°C to 5°C. As seen, the errors of these results are much larger than ours. This can be explained that our methods take far more explanatory variables and integrate hard data and soft data. There are also some studies considering some potential explanatory variables. For example, Alonso et al. used multiple regression models to estimate the daily Ta from some remotely sensed explanatory variables and the variables traditionally used like the ones from the Land Use Land Cover in Rhône-Alpes county, located in southeastern France. They achieved a RMSE of 1.20. This result is consistent with our results. However, our study area has a larger extent and more complicated environment, which may cause that our results have a little lower accuracy than theirs. In their study, remote sensing variables were incorporated and made significant contribution to the Ta prediction model. In another study, a stacking ensemble model was proposed to interpolate daily maximum Ta during summertime over Seoul [7]. They achieved a RMSE of 0.7°C, which is lower than our results. In our study, however, BME-both has an improvement of 40%, which is much better than theirs (less than 20%).

In the first phase of the proposed two-phase process, we used regression modeling to predict Ta-mean in unobserved locations and took them as soft data. From the multivariate linear estimates, the best models show good fitting. However, as explained in the regression analysis operation, the best models are sometimes unusable due to failure to extract data of some variables from the corresponding data sources, and as a result, inferior models have to be selected for calculating estimates in unobserved locations, which degrades the prediction accuracy. The observation errors and unavailability of covariates are of great concerns in the regression phase of our two-phase model. For example, Ts retrieved from satellite remote sensing includes errors depending on the underlying surface. And even Ts is unavailable when the surface is sheltered by clouds. Furthermore, since limited cognition for the problem domain and data unavailability, it is difficult to build a perfect model. For example, we considered 10 putative factors as covariates in our multivariate linear regression analysis. However, there are certainly other unknown (or untaken) factors affecting the dependent variable and the relationship between Ta and covariates is likely nonlinear. Therefore, regression analysis generally is taken as a primary approach in some practices where rough results are acceptable. Nowadays, some alternative advanced approaches such as neural networks are prevalent [29, 30]. These methods can depict the complex nonlinear relationships between outputs and inputs and can improve the prediction accuracy. Nonetheless, enormous input parameters and large data are needed to achieve good results in neural network models.

From the comparison of the interpolation accuracy in six areas, it is shown that BME-both is consistently better than BME-hard in any certain area. This can be expected from the characteristics of spatial distribution of meteorological observation stations: the meteorological observation stations in the east are spatially dense, more hard data (with high quality) are incorporated into the interpolation, and the soft data contribute less to the interpolation results. In contrast, more soft data (considered high uncertainty) are used in the interpolation and lead to bigger errors of the interpolation results.

6. Conclusions

is one of the most important parameters for science and practice. While traditional meteorological station observations are limited and spatially distributed unevenly, environmental factors and remote sensing images are promising to predict Ta with spatially continuous coverage. For example, Ts retrieval has undergone long-term studies shortly after the first Landsat launched. And then in terms of the statistical relationship between Ts and Ta, Ta can be predicted at the pixel locations. With advances in the acquisition of remotely sensed data and retrieval models, some full-fledged algorithms for retrieving Ts from remotely sensed data have been developed. As a newer remotely sensed data source, MODIS has been put into use in many fields since 2000. Especially benefiting from well-developed products with sound algorithms, the width and depth of MODIS data application are promoted largely. The distribution website of MODIS data products created by NASA is capable of providing some land products covering the globe (e.g., Ts). And the global users can freely download these product data according to their requirements.

To densify Ta from discrete meteorological observations, two popular approaches, interpolation and regression, are widely used. Interpolation uses the observed data, while regression fits the observed data and covariates and then predicts the Ta in unobserved locations with covariate inputs. There are advantages and disadvantages for them. Incorporating the relationship between Ta and environmental factors into the interpolation procedure is supposed to supplement additional information into the observations, and thus, higher accuracy is expected to achieve. The BME methods can well blend observations and prior knowledge, which use two kinds of data called hard data and soft data.

Utilizing the relationship between Ta and environmental variables, in this article, a two-phase approach was proposed to increase the accuracy of Ta estimates. Taking Ta-mean prediction as example, first, multivariate linear regression models were fitted between Ta-mean and Ts (TMOD-day, TMOD-night, TMYD-day, TMYD-night) conditional on some environmental factors including vegetative and topographical ones. The fitted models were then used to predict Ta-mean from those factors. The predicted Ta-mean were looked on as stochastic variables, and their distributions were also estimated. In the second phase, BME methods were used to interpolate the meteorological observations of Ta-mean taking the meteorological station observations as hard data and the predicted Ta-mean in the first phase as soft data. It is approved that the proposed methods supplement new information to the observations and thus reduce uncertainty of the estimated results. Particularly, for some stations distributed spatially sparse, our method significantly improves the accuracy. The proposed approach is supposed to map Ta with spatially continuous coverage and higher accuracy simultaneously through blending multisource information.

In the coming work, we are going to investigate other advanced methods for producing soft data, which is expected to further improve the accuracy of Ta estimates. Deep learning regression methods are also promising approaches to achieving Ta estimates with high precision in view of their performance in the prediction of environmental parameters, which are of great interest in our next work. According to the improved methods, we plan to produce high spatial resolution Ta products and distribute them for free use.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors acknowledge the support from the National Key R&D Program of China (Grant no. 2019YFF0301301), National Natural Science Foundation of China (Grant no. 72174031), and Youth Scholar Program of Beijing Academy of Science and Technology (Contract no. YS202004). The authors also appreciate the support for this paper from the Beijing Key Laboratory of Operation Safety of Gas, Heating and Underground Pipelines.