Abstract

Gridded precipitation datasets have been used as alternatives to rain gauge observations, but their applicability for a specific region should be thoroughly evaluated. This article aims at finding the most appropriate one for climatological and hydrological applications in Indonesia, by evaluating the statistics of the performance of eight different datasets (research products) having horizontal resolutions between 0.1 and 0.25 and with a time span of data availability from 2003 to 2015. The datasets are compared against the observed daily rainfall at 133 stations using 13 statistical metrics that can be classified into three groups with different characteristics of measurements, namely distribution, time sequence, and extreme value representations. By applying summation of rank (SR), it is found that MSWEP and TMPA 3B42 are the top two datasets that outperformed based on distribution and time sequence performance metric groups. The extreme performances for all datasets are still good in 75th percentiles; however, the performances decrease at more than 75th percentiles indicating still a poorly representation of daily extreme rainfall for all gridded datasets. Results of this study suggest that MSWEP (v2) is presently the best gridded precipitation datasets available for climatological and hydrological applications in Indonesia.

1. Introduction

Climate variability at subseasonal, seasonal, interannual, and interdecadal timescales has potential societal impacts across the globe. In terms of agricultural production, for example, roughly one-third of the observed variations in global yield is caused by climate variability [1]. Furthermore, climate change is causing extreme weather events and climate anomalies to increase in both frequency and intensity [2], leading to greater risks for natural and human systems [3]. The risks are even higher for countries like Indonesia that are prone to natural disasters. During 1900 to 2011, 56% of the disasters that killed almost 241,000 people affected about 28 million population at a cost of around US$ 24 billion are of hydrometeorological (climate-related) type [4]. Therefore, accurate estimation of hazards and risks due to historical and projected climate anomalies is essential to make development and business plans more climate proof and adaptive to climate change.

High impacts of hydrometeorological disasters (drought, wildfire, flood, and landslide) in Indonesia are associated with the excess or deficit of rainfall and more prevalent than other types of climatological disasters such as heat wave [5]. While climate change analyses using the top-down approach have been facilitated by downscaling of projected future precipitation under the WCRP CORDEX (the Program Coordinated Regional Downscaling Experiment sponsored by World Climate Research Program) for the Southeast Asia region [6], the feasibility of further quantitative impact studies depends on the availability of observational data to calibrate the model output. In this context, the availability and quality of baseline climate data are crucial to carry out both bottom-up and top-down climate change studies [7] at regional scales. However, long-term continuous rainfall observations in Indonesia are only available at a very limited number of locations. Jakarta, for example, is an exceptional location where 130-year records, from 1983 to 2012, of rainfall and temperature data are available [8, 9]. Other than that, rainfall data vary in length, network density, quality, and consistency for different regions, making it difficult to assess climate hazards and risks associated with extreme events, even for regions with important sociopolitical contexts.

In recent decades, there have been efforts to develop globally gridded precipitation datasets by various research groups and institutions. Those datasets vary in terms of purpose, data origin, area coverage, record length, as well as spatial and temporal resolution [10]. In any case, the availability of such precipitation datasets is potentially helpful for coping with the lack of rain gauge observations. In fact, gridded datasets such as Tropical Rainfall Measuring Mission (TRMM)-based precipitation products have been extensively used in various studies with main concerns on large-scale climatic features [11]. However, prior to their application to study climate impacts at regional scales, global precipitation datasets need to be evaluated to understand their advantages, limitations, and uncertainties [10, 12]. Moreover, Indonesia’s archipelago constitutes the largest part of the Maritime Continent (MC) where spatial variations of rainfall climatology are prominent due to complex land-sea distribution, topography, and strong influence of Asian-Australian monsoons [13].

Intercomparisons of global precipitation datasets have been conducted for monsoon and MC regions involving reanalysis and TRMM precipitation products [1416]. These studies used EOF analysis, correlation coefficient, and bias in comparing the datasets with observations, except [16] that focused on the relative differences among the products. There are also studies focusing on Indonesian regions and the validation of specific datasets such as TRMM [17] and GSMaP [18, 19] using rain gauge data. An intercomparison of four precipitation datasets, that is, SA-OBS, APHRODITE, CMORPH, and TRMM, has also been conducted for performance evaluation against rain gauge data [20]. Another study focused on performance evaluation for a specific purpose to detect low rainfall for drought monitoring on three datasets, that is, TMPA 3B42RT, PERSIANN, and CMORPH [21], and another one for a specific region of Bali Island on three other different datasets, that is, GSMaP, IMERG, and CHIRPS [22]. These studies have compared different, but still limited, number of datasets. Moreover, only a few performance metrics such as bias and correlation are used, except Liu et al. [22] who used more diverse metrics of continuous, categorical, and volumetric types. None of the studies compared the statistical distribution between precipitation datasets and observed data.

In this work, we performed a more comprehensive evaluation on eight precipitation products that are derived from rain gauges, satellite-based estimates, and their combinations (see Table 1). This study aims to find the most robust precipitation dataset for climatological and meteorological research and applications in Indonesia. We propose a multimetric approach [34, 35] with a total of 13 metrics that can be classified into three groups of statistical measures against observations. The first group is for assessing data distribution: (1) mean (), (2) standard deviation (SC), (3) coefficient of variance (CV), (4) PDF skill scores (SS), and (5) Kolmogorov–Smirnov test (KST). The second is for evaluating the relationships in sequential data pairs as time series: (6) Pearson correlation coefficient (r), (7) mean error (ME), (8) root mean square error (RM), (9) relative change (RC), (10) T-test (TT), and (11) Z-test (ZT). The third is for addressing the performance on extreme events detection: (12) fraction skill score (FSS) and (13) Anderson–Darling test for the 75th, 90th, and 98th percentiles. For overall performance, we apply a summation of rank (SR) to all metrics used in all groups of scores and select the top one dataset. As a reference, we employed rain gauge data of 133 meteorological stations belonging to the Indonesian Meteorological, Climatological, and Geophysical Agency (BMKG) observed from 2003 to 2015 (see Figure 1).

2. Materials and Methods

2.1. Gridded Precipitation Datasets

The summary of eight primary datasets used in this study is presented in Table 1. These datasets are research products with high-latency data transfer (in the order of several months) and generally derived from combination of rain gauge, satellite, and reanalysis data. It should be noted that all datasets have daily temporal resolution, but the spatial resolutions are 0.25 for five (CHIRPS, CMORPH-CDR, GFD, PERSIANN-CDR, and TMPA) and 0.1 for the other three (GSMaP RNL, GPM-IMERG and MSWEP) datasets. Comparisons between gridded and observed station precipitations were performed for the station locations by interpolating the gridded data, using the “nearest neighbor” method. In addition to the eight primary datasets, we also analyzed other five gridded datasets that have different specifications (see the Discussion section).

2.2. Observation Datasets

The observation datasets used as a reference employs rain gauge data from 133 meteorological stations in Indonesia from 2003 to 2015 (Figure 1). These periods overlap the years between observation and all precipitation datasets that were being compared. The rainfall data are the same observed daily precipitation dataset that was used in Supari et al. [9] up to 2012, with additional stations and time periods. The same quality control analysis described in Supari et al. [9] was applied, consisting of checking for gross errors, missing values, outliers, and overall data homogeneity. The accumulation of daily precipitation measured at 07.00 local time assigned the date of that day’s precipitation data regarding the guidelines of the Indonesia Meteorological Service (BMKG) [36]. The intercomparison between precipitation datasets and observation rain gauge in this study was carried out without a day shift (for more details, please refer to Van den Besselaar et al. [20]).

2.3. Metrics of Distribution-Based Performance
2.3.1. Mean ()

The mean (or average) is the measure of the central tendency for both discrete and continuous data. Given μo the time mean, σo the standard deviation of the observation, and μm the time mean of a precipitation dataset with the same period, we define the performance metricwhere is a scale factor taken as equal to 1 [34, 35]. The maximum value of  = 1, and  < 0 if the difference between the time mean of rainfall dataset and observation is greater than ng multiplied by σo. The performance indices for this metric were calculated for daily (_d), monthly (_m), seasonal (_DJF, _MAM, _JJA, and _SON), and annual (_a) timescales. Herein, DJF, MAM, JJA, and SON are the months of December-January-February, March-April-May, and September-October-November, respectively.

2.3.2. Standard Deviation (SC)

Standard deviation measures the spread of data distribution. The metric SC is a normalized quantity to represent the comparison between the spreads of evaluated datasets and the referent, given bywhere σo and σm are the standard deviations observation of gridded precipitation datasets, respectively, so that SC = 1 is being the perfect skill [34, 35]. As with , the performance indices for SC were calculated for daily (SC_d), monthly (SC_m), seasonal (SC_DJF, SC_MAM, SC_JJA, and SC_SON), and annual (SC_a) timescales.

2.3.3. Coefficient of Variance (CV)

The metric CV is a normalized measure of dispersion. Given CVo and CVm are the coefficients of variation for the observation and precipitation dataset [34, 35], the performance index is calculated as

for daily (CV_d), monthly (CV_m), seasonal (CV_DJF, CV_MAM, CV_JJA, CV_SON), and annual (CV_a) timescales.

2.3.4. PDF Skill Score (SS)

The probability density function (PDF) skill score (SS) is calculated using samples of rain days (days with precipitation >0.5 mm) [36]. The SS compares the PDF observed and gridded precipitation datasets by the following formula [37]:where and are relative frequency of occurrence of a value in the kth bin belonging to the histograms of the dataset and observation, whereas Nb is the number of bins used to calculate the empirical PDF. If SS = 1, the precipitation dataset perfectly simulates the observed PDF [35, 37].

2.3.5. Kolmogorov–Smirnov Test (KST)

The KST test is like SS but for ECDF (empirical cumulative distribution function). Given Fo(x) and Fm(x) are ECDFs of the observed data and a precipitation dataset, the KST performance index is calculated by

with DKS is the maximum absolute difference between ECDF of the two different datasets. The metric values are normalized and bounded by 1 as the perfect skill [34, 35].

2.4. Metrics of Time Sequence
2.4.1. Normalized Mean Error (ME)

Normalized mean error (ME) is to measure value-to-value differences between the two time series calculated as

where and are the ith data of the time series of the gridded dataset and observation, and N is the number of data records [38]. Calculations of ME are at daily (ME_d), monthly (ME_m), annual (ME_a), and seasonal (ME_DJF, ME_MAM, ME_JJA, and ME_SON) time sequences with ME = 1 means the perfect skill.

2.4.2. Normalized Root Mean Square Error (RM)

Root mean square error, or RMSE, is a common measure to quantify the difference between the two time series. For two time series p(t) and f(t) with n data records, it can be calculated as follows:

Herein, we use normalized RMSE [35], which is expressed as follows:where and are the scale factor and observation standard deviation relevant to the index i of interest related to daily (RM_d), monthly (RM_m), annual (RM_a), and seasonal (RM_DJF, RM_MAM, RM_JJA, and RM_SON) time sequences.

2.4.3. Relative Change (RC)

The RC metric is only applied for annual rainfall by calculating changes in two consecutive years as

RC index for the whole data period (years) is then calculated as the mean difference of Ci for the observation and precipitation datasets using equation (1) [34, 35].

2.4.4. Pearson Correlation Coefficient (r)

The Pearson correlation coefficient (r) of the sequential time series for every point of observation and the corresponding grid cell of a precipitation dataset is computed using the following equation:where x and y are the variables of observation and the precipitation dataset with the mean and , and n is the degree of freedom of the variables [35]. The calculations of r are at daily (r_d), monthly (r_m), annual (r_a), and seasonal timescales (r_DJF, r_MAM, r_JJA, and r_SON).

2.4.5. Z-Test (ZT)

Z-test compares the significant difference between the mean values of observation and a precipitation dataset taking into account the difference in sample size.where , σm, and nm are the mean, standard deviation, and sampling size of the dataset, respectively; , σo, and no are the mean, standard deviation, and sampling size of observation, respectively. The test value of this statistic is approximated using the standard Gaussian distribution at 95%. value <0.05 means the average of the precipitation dataset is significantly different with observation. The score of ZT is the number of stations with an insignificant value (≥0.05) divided by the total number of stations [35]. ZT calculations are at daily (ZT_d), monthly (ZT_m), seasonal (ZT_DJF, ZT_MAM, ZT_JJA, and ZT_SON), and annual (ZT_a) timescales.

2.4.6. T-Test (TT)

The metric TT is computed the same way as the Z-test but using Student’s t distribution [35].

2.5. Metrics of Extreme Value Representation

To evaluate the performance of precipitation dataset for extreme value representation, two metrics are used: fraction skill score (FSS) and the Anderson-Darling Tests (ADT). The FSS uses the forecast verification approach to evaluate the detectability of moderate-to-heavy rainfall events [39], where rainfall data that exceed a threshold transform into a binary number of 1, otherwise it is 0 following the formula

FBS represents the differences of mean squares between the referent (), and the precipitation dataset () on each grid computed aswhere N is the amount of data, while is the largest FBS that can be obtained. FSS ranges between 0 (no skill) and 1 (perfect skill). Although FSS can be calculated with absolute thresholds, in this study it is defined relatively to the 75th (p75), 90th (p90) [40], and 98th (p98) percentiles of each dataset.

A modified version of Anderson-Darling [41, 42] was applied to test the differences between the precipitation dataset distribution with reference data, using the following formula:

A represents the distribution of daily precipitation from the dataset that reproduces the distribution of daily observations concerning moderate-to-heavy rainfall (using the same thresholds as FSS). Smaller values of A indicate a similarity between the two distributions of daily precipitation at a 95% significance level. Let X and Y be an n and m sample with the empirical cumulative distribution function (CDF) of F and G. H denotes a measure determined by the weighted average of F and G, N = n + m. The rainfall distribution dataset is significantly different from the observed rainfall when values <0.05. The score values were normalized value of A and bounded by 1 as the perfect skill (ADT).

2.6. The Summation of Rank

The metrics were applied based on point-to-grid comparisons since the observation data as referents were at point locations, which may affect the representativeness of the values being compared. However, Tan et al. [43] pointed out that results of point-to-grid comparison are substantially similar to the grid-to-grid comparison. In this work, we use two scoring schemes for indexing the precipitation dataset performance with the distribution and time sequence metrics: the “ratio” scheme, which summates the station index of more than 0.5 divided by the total number of stations; and the “mean” scheme, which averages the entire index from all stations. These indices and scores measure statistical performance at daily, monthly, seasonal, and annual timescales. However, the performance scores for extreme value representation were only calculated with the “mean” scheme. The summation of rank (SR) [34, 35, 44] method is used to summarize and quantify the total score of all metrics.

3. Results

3.1. Distribution-Based Performance

Scores for five performance metrics in the data distribution group, namely , SC, CV, SS, and KST, are presented as heatmaps in Figure 2. It should be noted that the scores are calculated as an aggregate of all validated points in Figure 1, with two representation schemes, that is, “ratio” (Figure 2(a)) and “mean” (Figure 2(b)) as previously explained. The number of samples for CMORPH-CDR, GPM-IMERG-F, GSMaP_RNL, GFD, and TMPA 3B42 datasets are 133 or sampled at the 133 validating points. However, PERSIANN-CDR, MSWEP, and CHIRPS had slightly fewer samples (128, 131, and 131, respectively) because some grids do not enclose observational data.

It can be seen from Figure 2(a) that, based on the ratio scoring (summation) scheme, MSWEP has the highest scores followed by TMPA 3B42 and GPM-IMERG-F. The MSWEP also has the highest scores with the mean scoring scheme, followed by TMPA 3B42 and CMORPH-CDR (Figure 2(b)). It is of interest to note that the metric shows relatively low scores for seasonal and annual, in comparison to daily and monthly, timescales. Seasonally, the scores are worse for DJF and MAM than those for JJA and SON. At most stations, the JJA and DJF periods correspond to dry and rainy season, respectively. The spatial distribution of SR distribution-based performance at all 133 stations in Indonesia can be seen on Figure S1 in the Supplementary Material (SM).

3.2. Time Sequence Performance

Time sequence performance for each dataset was evaluated using six metrics, namely ME, RM, RC, r, TT, and ZT, and are presented as heatmaps in Figure 3. As shown in Figure 2, the heatmaps in Figures 3(a) and 3(b) correspond to the results of the ratio and mean scoring schemes. The SR scores show that the MSWEP dataset consistently appears in the topmost rank, along with the TMPA 3B42 in the second rank. On the other hand, the CHIRPS now emerges as the third-ranking dataset overperforming GPM-IMERG-F and CMORPH-CDR. In contrast to the distribution-based performance, the metrics for time sequence performance are worse with daily comparing to monthly, seasonal, and annual timescales. Thus, in general, longer temporal aggregates improve the time sequence performance of the datasets.

It is notable from Figure 3 that RM or normalized RMSE metric groups show the worst scores at all timescales. Most datasets have RM scores less than 0.5 (0.3) for the ratio (mean) scoring schemes (Figures 3(a) and 3(b)). There are even negative scores, which indicate that the RMSE of dataset precipitation is larger than 1 standard deviation of the observation [35]. Negative values also appear in the normalized mean error (ME) scores at daily timescale, with the mean scoring scheme, except that of TMPA 3B42. In general, the mean scoring scheme tends to produce lower scores for all metrics in Figures 2 and 3. The spatial distribution of time sequence SR for all datasets at 133 stations as seen on Figure S2 in SM shows randomly distributed and lower values than the SR of the distribution performance.

3.3. Extreme Value Representation

The two metrics for extreme value representation, that is, FSS and AD can be calculated using both percentiles and absolute thresholds. However, considering that the distributions of extreme values can be significantly different among datasets (Figure S3 in the SM), we calculated the scores only based on three percentiles, that is, p75, p90, and p98, and show the results in Figure 4. In contrast to the results of two previously discussed performance rankings, Figure 4 shows that the MSWEP and TMPA 3B42 datasets are in the bottom three with the SR scores and only perform better when compared to CMORPH-CDR. In this case, the PERSIAN-CDR is in the top rank followed by GFD and CHIRPS. Figure S3 in SM shows the empirical CDF of all stations average (Figure S3(a)) and five stations in Sumatra, Borneo, Java, Sulawesi, and Papua (from Figure S3(b) to S3(f)).

In general, all scores in this category drop drastically with higher percentiles. The ADT scores for p75 are the highest but those for p98 are the lowest among all scores. On the other hand, the scores of FSS decrease more gradually with p75, p90, and p98. These results indicate that extreme precipitations are still poorly represented in the globally gridded datasets, bearing in mind that daily precipitation values are being compared.

4. Discussion

Based on SR scores, MSWEP and TMPA 3B42 are the top two datasets that consistently outperform the others for both distribution-based and time sequence performance metric groups. It should be noted that MSWEP combines several datasets including other gridded data, such as CMORPH and TMPA 3B42, and station (GHCN) data. MSWEP data processing involve bias and frequency correction, CDF matching, improvement in the peak attenuation of rain distribution, weighted combination of data sources, and refining the spatial resolution to 0.1 [33]. Results of a previous study [10] that compared 22 precipitation datasets also show that MSWEP had the best performance in temporal correlation with rain gauge observations and calibration scores for hydrological model applications. However, Figure 3 clearly shows that the scores of time sequence performance metrics are low at daily resolution. Longer temporal aggregation is likely needed to improve the representativeness of the datasets.

In contrast, statistical performances of MSWEP and TMPA 3B42 datasets are low in extreme value representation. The averaging process when combining several data sources could have smoothing effects that eliminates real extreme values. A study by Hamada et al. [45] shows that maximum near-surface rain rate from TRMM PR data surpasses 50 mm per hour only at percentiles higher than 90. Therefore, gridded precipitation datasets may still be useful for qualitative studies of extreme events, but additional efforts should be needed in more quantitative applications.

Considering the variations in rainfall climatology of Indonesian region [46], spatial variation of performance indices might also be of user’s concerns. Maps of SR scores for all 133 validating stations can be found in the SM (Figure S1). In general, the spatial distribution of SR shows random patterns. However, relatively lower scores tend to concentrate over the northern part of Sumatra Island. This area is characterized by a bimodal annual rainfall and mountainous topography.

So far, we have discussed the statistical performance of eight gridded precipitation datasets that are categorized as (high latency) research data. Some previous studies include rain gauge-based datasets, as well as near real-time (low latency) satellite products in their analysis. For more comparisons, we applied the same procedures of scoring and ranking to five additional datasets, that is, SA-OBS, APHRODITE, CMORPH-Raw, GPM-IMERG-Early Run, and GSMaP NRT (see Table S4 in the SM), and summarize the results in Table 2. By comparing the SR scores with those in Figures 2 and 3, the performance of MSWEP is comparable with that of rain gauge-based SA-OBS, whereas APHRODITE does not show good performance. On the other hand, from three low latency satellite products, GPM-IMERG Early Run show the best statistical performance comparable to MSWEP in terms of SR scores.

This study adopts a multimetric approach [34, 35] to evaluate climate models. We use a combination of standard continuous and categorical verification statistics [47] as quantitative measures to assess the accuracy of the rainfall estimation amounts and occurrence of the gridded precipitation dataset. We apply continuous type metrics for the performance of data distribution comprehensively and time sequences in data pairs from time to time between the rainfall of the dataset and the observed data. In addition, categorical type metrics were employed to evaluate the representation of extreme events with a threshold value. The summation of ranks diagnoses the rank of precipitation dataset performance to decide the best performance robustly. Previous studies [10, 21, 22] used fewer metrics without rank scoring and a greater focus on biases and correlation in comparison. This study compares more of the statistical distribution for all rainfall data and extreme rain days. The weakness of this study lies in not considering physical factors when comparing the rainfall of precipitation datasets and observed data, such as comparisons with altitude differences [22], while regional influences of monsoon [35] were only briefly discussed. This research is still purely statistical analysis with the quantitative evaluation of the “value-to—to-value” between gridded precipitation datasets and point-based rain gauge stations. Nonetheless, our results could be informative for those who need to use gridded precipitation datasets, especially for climatological and hydrological applications in the Indonesian region.

5. Conclusions

The performance and reliability of eight gridded precipitation datasets, namely CHIRPS v2.0, CMORPH-CDR v1.0, GFDv3, PERSIANN-CDR v01r01, TMPA 3B42v7, GSMaP_RNL V06, GPM-IMERG V06 (Final Run), and MSWEPv2, were compared with rain gauge station observations for daily, monthly, seasonal, and annual timescales in the period of 2003–2015. The findings of this study can be summarized as follows:(i)A multimetric approach of 13 metrics is grouped into three: data distribution, time sequence, and extreme value representation. The application of summation of rank deals with the ranking of all datasets for every performance metric and quantifies the scores of all metrics for diagnosing and deciding the best performing dataset.(ii)The results show that MSWEPv2 is the best product, followed by TMPA 3B42 for daily, monthly, seasonal, and annual precipitation in comparison with rain gauge data based on summation of rank. The extreme performance of all gridded precipitation datasets is low in more than 75th percentile of daily rainfall. This study implicates for the application of climatology and hydrology in the Indonesian region using gridded precipitation datasets.

Data Availability

The meteorological station data are available in BMKG or Indonesia Agency for Meteorology, Climatology, and Geophysics, which can be accessed at https://dataonline.bmkg.go.id/home. Precipitation datasets used for this study are included within the paper.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this article.

Acknowledgments

The authors acknowledge the Indonesia Agency for Meteorology, Climatology, and Geophysics for facilitating the data of rainfall stations. TW drafted the initial manuscript and statistical data processing. TWH revised and improved the overall manuscript including the storyline and discussion. AS and LMH performed further revisions of the manuscript. All the authors were involved in discussions during the review process and all the authors have read and approved the final manuscript. This manuscript contain materials from a dissertation by TW, as a part of the fulfillment to obtain a doctoral degree at Bandung Institute of Technology. The authors would like to express their gratitude to the Indonesia Endowment Fund for Education (LPDP) for funding this research (Grant Number 201812210213600). The second author (TWH) was partially supported by ITB-P3MI research grants.

Supplementary Materials

Supplementary Materials. S1: Figures for spatial distribution of distribution-based performance metrics. S2: Figures for spatial distribution of time sequence performance-based metrics. S3: Figures for datasets’ empirical cumulative distribution function (CDF). S4: Table summary of the five gridded precipitation products. (Supplementary Materials)