#### Abstract

A multicriteria score-based method was developed to assess the performances of 18 general circulation models (GCMs) in the study region from 1970 to 2005. The results indicate the following. (1) GCMs simulate temperature better than rainfall. The temporal and spatial distributions of simulated temperature performed well compared with those from the observations. In comparison to temperature, the spatial distribution of simulated precipitation performed poorly. Most of the GCMs underestimated temperature and overestimated precipitation. (2) The Grubbs test was used to detect anomalous moving changes in the rank score (RS) results; the inm-cm4 and ipsl-cm5b-lr models were rejected when simulating temperature, while the bnu-esm and canesm2 models performed poorly when simulating precipitation. (3) Adding or removing any criterion does not significantly influence the RS result, which indicates that the multicriteria score-based method is robust. The advantages of using multicriteria score-based method to assess GCMs performance were demonstrated, and this method also provides a more comprehensive assessment when compared with the single-criterion method. The multicriteria method could replace other criteria as the research requirements and could be easily extended to different study regions; the results could be used for better informed regional climate change impact analyses.

#### 1. Introduction

General circulation models (GCMs) are the most common tools for projecting future climate change. Errors and uncertainties in GCM metadata range in severity, specifically resulting in the inability to simulate observed meteorological events. GCM simulations are often characterized by biases and uncertainties that limit their direct application [1]. Different forcing scenarios, GCMs, and subgrid-scale forcings and processes cause uncertainties, revealing an abundance of information but also indicate that a large amount of work is required to identify useful information, which limits GCM applications [2]. Despite continuous efforts to improve the GCM simulation performance, the application of assessment methods is essential for climate change impact studies [3].

To improve the accuracy of GCM applications, GCMs have been assessed in many studies [4–6]. These assessments emphasize various aspects of GCMs according to their different applications. For example, in one study, where a long-term climate change analysis was the main focus, an assessment of the GCM performance before its application only focused on its long-term temporal and spatial distribution simulations. However, the drawback of this assessment was that using a single criterion could only describe the temporal or spatial performances of the GCMs but may not meet the other requirements of the study [4]. A more comprehensive understanding of the advantages and disadvantages of GCMs is possible when more criteria are included in a GCM assessment.

To date, no assessment method used in the study of GCMs has been widely accepted. Assessing the performance of GCMs before using them is becoming an interesting issue. In this paper, a multicriteria score-based method was analyzed and the performances of all GCMs were quantitatively calculated and examined. We studied this method with the aim of comprehensively and accurately evaluating the performances of the GCMs.

The outline of the paper is as follows: the data and methods are presented in Sections 2 and 3, respectively. Section 4 describes the performance of each GCM. The GCM simulations of temperature and precipitation are evaluated in the study region. The concluding remarks are provided in Section 5.

#### 2. Study Region and Dataset

The performances of the GCMs in the Yellow-Huai-Hai region were assessed in this study. The Yellow-Huai-Hai region, which is located in north-central China between 30° and 42.5°N and 90° and 122.5°E (Figure 1), has the largest fluvial plain in China. Most parts of the study area are semiarid and semihumid (i.e., the Yellow River and Hai River basins, respectively), and only a small part of the region in the southeast of the study area has a humid climate (the area covering the Huai River basin). The Yellow-Huai-Hai region is an agricultural breadbasket and prime urban and industrial region in China. This region, therefore, plays an important role in the social and economic development of the country. Thus, the consequences of climate change seriously restrain economic growth [7, 8].

All GCM data are from the fifth phase of the Coupled Model Intercomparison Project (CMIP5), which is the most important tool for analyzing future climate change. The dataset from this project provides a framework for coordinating climate change experiments with the aim of evaluating climate simulations of the recent past to provide more accurate projections of climate change and quantifications of climate feedbacks compared to those from CMIP3 [9]. Details on the data can be found in Moss et al. [10] and Taylor et al. [9]. To simulate future climate change, 18 global GCMs from CMIP5 were considered in this paper (including access1-0, bcc-csm1, bnu-esm, canesm2, ccsm4, cesm1-bgc, cnrm-cm5, giss-e2-h, csiro-mk3.6, fgoals-g2, gfdl-cm3, hadgem2, inm-cm4, noresm1-m, miroc-esm, ipsl-cm5b-lr, mri-cgcm3). More details on all of the models are available at http://cmip-pcmdi.llnl.gov/cmip5/docs/CMIP5_modeling_groups.docx. Since GCM horizontal resolutions vary, the GCM outputs were interpolated to a uniform resolution of 2.5° × 2.5°. The grid cell distributions over the study region are shown in Figure 1.

High-quality temperature and precipitation data were derived from the daily dataset on China’s surface climate (V3.0) during the period 1970–2005 provided by the National Meteorological Information Center. These data are based on gauged data from 128 meteorological stations (Figure 1) and have been controlled for quality and accuracy by nearly 100%; for more details, see http://data.cma.cn/data/cdcdetail/dataCode/SURF_CLI_CHN_MUL_DAY_V3.0.html. To effectively assess the performances of the GCMs, the daily data observed by the meteorological stations were collected as monthly data and interpolated to 2.5° × 2.5° cells using the inverse distance weighted method. The hollow circle in Figure 1 represents the location of each GCM grid point, and the data for the GCM grid points, which are denoted with black circles, were selected for the assessment in this study.

#### 3. Methods

In this study, a multicriteria score-based method was developed to assess the performances of GCM simulations at a regional scale. The criteria included mean annual data, standard deviation, annual climate cycle, normalized root mean square error (NRMSE), spatial distribution, climate change trend, empirical orthogonal function (EOF), and probability density function (PDF); these criteria are listed in Table 1.

In the assessment, the rank score (RS) values of 0–9, which are used to assess each individual assessment criterion, are written in the following form:where represents the relative error (RE) between the GCM result and observation or the related statistical value for the GCM. For the RE, a larger indicates a larger RS in the GCM performance assessment. In addition, the total RS for each GCM is summated by the RS for all weighted criteria. This RS method is used to describe the fitting degree between observed and simulated sequential statistical characteristics. According to the fitting results, the score of each GCM was assigned a number between 0 and 9 to assess the performance of each GCM. The RS does not represent the actual simulation accuracy of the specific models but is suitable for comparison between different GCM performances. Several different criteria that have the same statistical purposes, such as the Mann-Kendall (M-K) test (Z) and trend magnitude (*β*), which are criteria for trend analyses; EOF1 and EOF2, which are criteria for EOF analyses; and Brier score (BS) and significance score (Sscore), which are criteria for PDFs (which will be described later), have weights of 0.5 each during this summation (Table 1), while the other individual criteria have weights of 1.0. If a GCM effectively simulates an observation, then the RS is small.

The RE was used to quantify the similarity between simulated and observed values for long-term monthly means and standard deviations:where and represent the simulated and observed data of the time series, respectively, and represents the duration of these samples (432 months from 1970 to 2005).

The GCM performances for time series are evaluated by the NRMSE [11, 12] and defined as

Based on historical data, and represent the GCM and observation results at historical time , respectively; represents the mean of the observations; and represents the length of the time series. The advantage of the NRMSE is that it can consider the mean and standard deviation of the predictor. The NRMSE is essentially the root mean square error divided by the standard deviation in the corresponding observations. The lowest value of the NRMSE is always associated with the best results, and this lowest value is reliable for determining the best simulation. The range of the NRMSE varies from 0 to positive infinity, where 0 indicates that there is a perfect agreement between the GCM data and reference data.

The correlation coefficient of the annual cycle was calculated between the observed and modeled long-term monthly mean values. For the spatial distribution, the correlation coefficient was calculated between the observed and modeled long-term means for each individual grid cell.

The M-K test and trend magnitude method were applied to determine the long-term monotonic annual trends and quantify their magnitudes [13]. The rank-based value of the nonparametric M-K test statistic (Z) for climate variables in the GCMs and observations was estimated bywherewhere represents the time series of the annual climate variable, represents the length of year, represents the extent of any given tie (length of consecutive equal values), and denotes the summation over all ties.

The trend magnitude , for Sen’s slope, which is a metric developed by Hirsch et al. [13] and proposed by Sen [14]; is defined aswhere . The slope estimator, , is equal to the median for all possible combinations of pairs for the whole dataset [7]. represents the time series for the variable to be assessed in the study. Sen’s slope analyzes the change trend of the data by analyzing the time series data to possibly avoid the adverse influence of lost data in the analysis. The RE in Equation (2) was used to assess how close the values of and are for each GCM to the observed values.

An EOF analysis was used in this study to compare the spatial distribution differences between the modeled climate variables and observations [15]. An EOF can identify and quantify the spatial structures of correlated variabilities [16]. The two leading modes are selected in this assessment since they account for a majority of the total variance.

The BS and Sscore were used to assess the PDFs of the monthly climate variables in the GCMs.where and represent the simulated and observed probability values, respectively, in each bin and represents the number of bins. According to the data ranges, we set the number of bins as 100; thus, we divided all of the data into 100 equal parts sequentially and then calculated the probability density of each size. In this study, the BS represents the mean square error measure for probability forecasts [17, 18] and the Sscore represents the calculated cumulative minimum value of the observed and simulated distributions for each bin, which can quantify the overlap between the observed and simulated data [19, 20]. Therefore, when the BS of a GCM is lower and the Sscore is higher, the performance of the GCM is better.

#### 4. Results

##### 4.1. Assessment of Temperature

Table 2 includes the assessment results of the GCM performance for temperature in the Yellow-Huai-Hai region. The observed mean temperature during the historical period in the study region is 8.49°C, while the simulated temperature via the GCMs is 3.62–8.09°C for the same period. Most GCMs underestimate the mean temperature by approximately 2°C. The standard deviation in the observations is 0.53°C, and most standard deviations in the GCMs are from 0.4–0.6°C. The NRMSE is always used to compare the difference between the observations and simulations. Therefore, if there were sets of data with very similar results for the means and standard deviations, smaller NRMSE results indicated a better simulation for the set of data. For the monthly mean temperature, the best NRMSE occurs with the mpi-esm-lr GCM (0.16), while the NRMSE result for the ipsl-cm5b-lr model was the largest of the GCMs. The simulated monthly distribution for the annual climate cycle for each GCM was relatively similar to that from the observed data, which can be seen from the correlation index (all values are larger than 0.995). Consequently, the correlation results for the monthly distribution of the annual cycle were rounded to 1. The correlation coefficients of the spatial temperature distribution between each GCM and the observations were also larger than 0.9. The simulated spatial temperature had a distribution similar to that of the observations, where the temperature increased from west to east, and the temperature was lowest in the source region of the Yellow River, while it was highest in the southern region of the Huai River basin (Figure 2).

**(a)**

**(b)**

According to the results of the M-K analysis in Table 2, the temperature in the Yellow-Huai-Hai region has increased over the past 36 years. Most GCMs show an increasing trend in temperature, excluding the giss-e2-h model. The performances of the different GCMs in simulating the change trend differ. The Z value in the M-K test for observed temperature is 4.81, which means that the observed mean temperature significantly increases at the 0.05 significance level. However, the Z values for most GCMs are between 1.13 and 4.59 (excluding giss-e2-h and canesm2), which indicates that most GCMs underestimate the temperature change trend in this region. The trend magnitude, , via Sen’s slope shows similar results.

The results of the analysis of spatial temperature by using an EOF show that the first and second vectors of the EOF for monthly temperature via the observations account for 98.9% and 0.51% of the total variance (Table 2), respectively. The range of the first explained variance of GCMs is between 96.99% and 98.63, while that of the second explained variance is between 0.55% and 1.23%. This result simply evaluates the GCMs’ performance by using two explained variance values and indicates that all GCMs simulate the variability well. According to the EOF results of GCMs, all the GCMs perform well in terms of the physical process of temperature variability. It should be noted that there are certain special cases where the spatial patterns could differ, while the spatial patterns and observations have similar values of variance. However, this situation is relatively rare and is, therefore, not discussed in this study.

The empirical cumulative probability distribution (Figure 3) shows that the empirical cumulative probability distributions for monthly mean temperatures that are simulated by most GCMs are quite close to the observations (excluding the inm-cm4 and ipsl-cm5b-lr models, which underestimate the ensemble temperature in the Yellow-Huai-Hai region). The results of the Sscore and BS across all 53 selected grid points are presented in Figure 4. The variations in the scores across all 53 grid points imply spatial differences. A high Sscore with a relatively low BS indicates excellent GCM performance in terms of probability distributions in the grid points. The mean Sscores in the grid points of most GCMs are over 80%. The results of the ipsl-cm5b-lr model are consistent with the empirical cumulative probability plots, which have larger BS and smaller Sscore values. The BS and Sscore of each model behave differently between each grid point, reflecting the spatial variability of climatic elements. For example, in some GCMs, the Sscore in some grid points is over 90%, and the BS value is close to 0, which means that the probability density distribution of the GCM temperature in these grid points is very similar to that of the observations. However, the Sscore does not exceed 50% when there is a high BS value in some grid points, which indicates that the probability distribution of temperature simulated by these GCMs is not quite as strong in these grid points. By using the RS assessment, the performances of the GCMs have been evaluated and the final score of each model has been calculated. The ccsm4 model has the highest score, while the inm-cm4 model has the lowest score. Figure 5 describes the differences in annual temperature changes between the observations and the best and worst performing models in the Yellow-Huai-Hai region. We can clearly see that even though the ccsm4 model has underestimated the mean temperature, the model simulates a change trend similar to that of the observations. In contrast, the inm-cm4 model vastly underestimates the temperature and simulates an incorrect temperature change in comparison to that of the observations.

##### 4.2. Assessment of Precipitation

Table 3 includes the assessment results of the GCM performances for precipitation. In comparison with temperature, the GCMs perform poorly in terms of precipitation. The observed mean annual precipitation in the Yellow-Huai-Hai region is 568 mm, while most GCMs overestimate the value of precipitation (650 mm-1256 mm). Specifically, precipitation in the bnu-esm model reaches 1,256 mm, which is two times greater than the observed precipitation amount. The standard deviations in the bnu-esm and mri-cgcm3 are 83.4 mm and 33.58 mm, respectively, which is quite different from the observation (61.5 mm). The NRMSE for precipitation (only 0.54–1.5) is much larger than that for temperature (0.16–0.55). The correlation coefficients for monthly precipitation in the annual cycle via the GCMs are lower than those for temperature, but most correlation coefficient values are still greater than 0.9. However, when we analyze the performances of the GCMs in terms of the precipitation spatial distribution, the spatial correlation coefficients of the GCMs are 0.45–0.82, which indicates that the GCM simulations for the spatial distributions of precipitation, are much worse than those for the spatial distributions of temperature. Figure 6 shows that the bnu-esm model performs poorly in terms of simulating spatial precipitation and, specifically, it incorrectly estimates the high precipitation region in the study region.

**(a)**

**(b)**

The annual precipitation in the Yellow-Huai-Hai region experiences a nonsignificant decrease at the 0.05 significance level. According to the *Z* value and magnitude of Sen’s slope, the change trends for most GCMs decrease less than those in the observations; specifically, some GCMs appear to have increasing trends during the study period. The M-K test shows that precipitation in the GCMs shows different change trends, which indicates that simulated precipitation via the GCMs is much more uncertain than the simulated temperature. Analyzing the EOFs reveals that the difference between the observations and GCMs is larger than that for temperature, which is consistent with other criteria assessment results, where precipitation is relatively poorly simulated compared to temperature. The physical mechanisms affecting precipitation are mainly influenced by large-scale circulation factors; the inconsistent spatial distribution of simulated precipitation indicates that some GCMs could not explain the influence of circulation factors.

The empirical cumulative probability distributions for the GCM monthly precipitation are compared with the observations in Figure 7. The simulation of the empirical cumulative probability distribution for precipitation is not as accurate as that for temperature. Generally, most GCMs simulate a poorer result for high values of precipitation in the empirical cumulative probability distribution. Most GCMs overestimate high precipitation values in the probability distribution, which is consistent with the results of the annual precipitation analysis. The BS for monthly precipitation in the 53 grid points is much larger than that for monthly temperature, and the outliers also indicate an inconsistency between the GCMs and observations (Figure 8). Although the Sscore median values have almost the same magnitudes as those for monthly temperature, the result of a higher BS and lower Sscore also indicates that the temperature simulations in the GCMs are better than the simulations for precipitation (especially the bnu-esm model).

The results of the GCM performances by using RS values are shown in Table 3. The csiro-mk3.6 model simulated precipitation better than the other models, and its RS was only 12.24. In addition, the bnu-esm model performed the worst in terms of simulating precipitation, with the highest RS (48.52). Figure 9 describes the annual precipitation change in the Yellow-Huai-Hai region. The bnu-esm model vastly overestimated annual rainfall in the study region, and the oscillations in the model seem to have a reversed phase compared with those in the observations. Even the csiro-mk3.6 model appears to slightly overestimate the annual precipitation in the study area and exhibits a different fluctuation change compared with the observations in the early 1970s; the model has a fluctuation change similar to that in the observed data after 1975.

Five GCMs, which have a total RS for temperature less than 17 (ccsm4, hadgem2, mpi-esm-lr, cesm1-bgc and access1-0), and 4 GCMs, which have a total RS for precipitation less than 20 (csiro-mk3.6, access1-0, ccsm4, cnrm-cm5, hadgem2, and cesm1-bgc), were chosen as good GCM groups (Figure 10). Compared with the observations, the good GCM groups are narrower in terms of their uncertainty intervals, and the mean values are closer to those of the observations. Note that the errors in the GCM metadata affect the entire intensity spectrum, and bias correction is required to improve the GCM simulation capacity. After a simple bias correction, the models in the good GCM groups could be effectively applied in future studies.

**(a)**

**(b)**

##### 4.3. Overall Performance of the GCMs and Sensitivity Analysis

The RSs for temperature and precipitation are used for assessing the performances of all GCMs (the last columns in Tables 2 and 3). In ascending order, the difference between two successive ranking scores (i.e., the moving range (MR)) is used to detect the presence of any change points [21–23]. In addition, the Grubbs test [24] is used to test whether there is an anomalous value in a univariate dataset. If the tests indicate significant differences, then we have evidence to reject the GCMs within the larger ranking score group. The results of the MR analysis and Grubbs test are shown in Table 4.

Two change points in the temperature were detected, while the result of the Grubbs test indicates that these change points are outliers at the 95% significance level (Figure 11). Thus, these GCMs (inm-cm4 and ipsl-cm5b-lr) should be rejected because their RSs are significantly different from those of the other GCMs. The differences in the first 8 GCMs are not significant. Some GCMs with high-ranking scores could not be rejected by the test because these GCMs could capture one or more characteristics in the temporal or spatial distributions for monthly temperature.

**(a)**

**(b)**

For precipitation, two change points were detected (Figure 12). The result of the Grubbs test indicates that the bnu-esm and canesm2 models should be rejected due to their poor performances in simulating precipitation. The RSs of the last two GCMs are remarkably different compared to those for the other GCMs, while the differences in the GCM RSs among the other models are small.

**(a)**

**(b)**

According to Table 4, the simulations in the GCMs for temperature perform better than those for precipitation. This result is consistent with the study of global-scale AR4 GCMs, which revealed that most GCMs could capture the characteristics of monthly temperature but not those of precipitation [12]. We should note that the same GCM model performs differently for different climate variables. For example, the bnu-esm model is the 6th best model for temperature, but it is the worst model for precipitation. In addition, the csiro-mk3.6 model simulates precipitation the best, while it ranks only 10th in the RS assessment.

In addition, the GCMs perform differently in different regions. For example, the bnu-esm model is not suitable for projecting future climate changes in the Yellow-Huai-Hai region, but it may potentially perform well for another study region.

To analyze each individual assessment criterion in the ranking results, the overall results were compared with the results by removing an individual statistical criterion. Based on Figure 13, adding or removing any assessment criterion does not obviously influence the overall ranking. The RS score may change after a criterion is removed, but the better-performing mode still performs well after adding or removing a criterion. The results indicate that this RS method robustly assesses the GCM performances. This robust assessment provides an advantage when using the multicriteria method to assess the performances of GCMs rather than using an individual assessment criterion because a GCM may simulate individual statistical factors well but not provide good simulations for other factors.

Each of the RSs for the statistics that were produced by a single criterion were used individually and compared with the overall results (Figure 14). According to the correlation analysis, no single criterion produced exactly the same result as the overall ranking, which also affirmed that the multicriteria method produced more information than the single-criterion assessment. The assessments of a single criterion provided different results, such as the RS of the NRMSE criterion being close to the overall ranking and the correlation coefficient being 0.75, while the correlation coefficient of the spatial distribution was only 0.08. Thus, if there is a GCM that can simulate the spatial and seasonal distribution well in the Yellow-Huai-Hai region, this does not mean this model would also have better results in simulating other statistics (e.g., long-term means, trend magnitude, or probability density).

#### 5. Conclusion

In this paper, a multicriteria score-based method is developed to assess GCM performance in the Yellow-Huai-Hai region from 1970 to 2005. The RSs of these criteria are applied to comprehensively assess the temporal and spatial performances of 18 GCMs when simulating precipitation and temperature in the study region.

All GCMs perform well when simulating temperature. Although all of the models underestimated the mean temperature, the results of the temporal and spatial distributions are quite close to those from the observations. The GCMs did not simulate precipitation as well as temperature, especially in terms of simulating precipitation spatial distributions. Most GCMs overestimate mean precipitation in the study area. The good performing models are selected to comprise the good GCM groups, where the means of the good GCM groups are closer to the observations.

By analyzing the sensitivity of the criteria, we found that removing or adding a criterion does not obviously influence the results of the assessment, which indicates that the multicriteria score-based method is a robust method for assessing GCMs. This study provides a different method from a single evaluation criterion to assess the GCMs’ simulation ability. Researchers could specify the criteria relevant to their specific application and research requirements to select appropriate GCMs for their study. This method could be easily applied to different study regions and guide the selection of GCMs for use in regional climate change impact studies.

#### Data Availability

The GCMs data used to support the findings of this study have been deposited in the http://wwwpemdi.llnl.gov.

#### Conflicts of Interest

The authors declare that they have no conflicts of interest.

#### Acknowledgments

We are very grateful to the anonymous reviewers for their constructive comments, which greatly improved the manuscript. We are grateful to the climate modeling groups for producing and making their model outputs available to the public. Financial support from the National Key Research and Development Program (grant 2018YFC0407403), National Natural Science Foundation of China (grants 51809103 and 41701509), Special Research Fund of the Yellow River Institute of Hydraulic Research (grants HKY-JBYW-2018-06 and HKY-JBYW-2017-08), China Postdoctoral Science Foundation (grant 2017M610458), and the Technology Development Foundation of the Yellow River Institute of Hydraulic Research (grant HKF201604) is gratefully acknowledged.