Abstract

The univariate analysis of hydrological extremes is a well-established practice in developing countries such as Ethiopia. However, for the design of hydrological and hydraulic systems, it is essential to have a thorough understanding of flood event characteristics, including volumes, peaks, time of occurrence, and duration. This study utilizes copula functions for bivariate modeling of flood peak and volume characteristics, examining the performance of four Archimedean copulas in the Guder basin located in Ethiopia from 1987 to 2017. Flood peak and volume were extracted using the theory of runs for analysis of their joint characteristics with the truncation level chosen as equal to the lowest annual maximum event. Univariate distributions with the best fitness on both variables were determined, and results showed that gamma and GEV-fitted flood peaks and lognormal-fitted flood volumes are the most suitable. Four Archimedean copulas were evaluated, and the Gumbel-Hougaard copula was found to be the best fit for the data based on graphical and measurable tests. Bivariate probability and return period were computed in “AND” and “OR” states. The joint return period for flood peak (97.49 m3/s) and volume (77.35 m3/s) was found to be 15 years in the “AND” state and approximately 4 years in the “OR” state. The study also evaluates univariate and conditional return periods, comparing them with the primary one. The copula method was an effective method for distributing marginal variables, highlighting its potential as a valuable tool in flood management.

1. Introduction

The increase in water demand caused by population growth and industrial development has caused issues related to water resource development and optimal use to gain researchers’ attention more than ever in recent decades. Floods are a recurrent natural disaster all over the world, causing substantial socioeconomic consequences and human victims [1, 2]. As a result, developing models that allow for accurate projections of the extent of such catastrophic occurrences is crucial for achieving the construction and management of hydraulic structures such as dams, as well as acceptable flood risk assessments. In practice, such magnitudes are typically assessed using univariate flood frequency studies, which are primarily concerned with peak flows [3], [4] [5]. In addition to the uncertainty involved with the occurrence in both space and time, these events may frequently have varied degrees of association. As a result, using a univariate flood frequency analysis may result in an incorrect evaluation of the risks associated with the event. However, a review of the literature suggests that the bulk of research uses a univariate approach [68] rather than a more realistic approach that acknowledges the multivariate nature of the underlying phenomenology. In order to more properly reflect the risk associated with such events, it is required to analyze their simultaneous occurrence in order to better grasp their probabilistic properties [9].

Recently, researchers have widely used a new method known as the copula function in the multivariate frequency analysis of hydrological phenomena [10, 11]. Copula functions are not constrained by the limitations of ordinary bivariate distribution functions and can be used to create multivariate distribution functions by connecting different univariate marginal distribution functions. Furthermore, copulas can more accurately describe dependence structures among variables ([12], [1317]).

Copulas have been introduced as an efficient tool for quantifying the dependence structure between correlated quantities. Many studies have demonstrated the flexibility provided by copulas in forming joint distributions [15, 1823]. Copulas were first used in hydrological studies by De Michele and Salvadori [24] in rainfall frequency analysis. Naz et al. [15] used copula functions to investigate the flood at Tarbela Dam, Pakistan, and developed a method for forecasting floods along the Indus River. Other studies have also been used in flood analysis [17, 2527].

Siraj et al., [27] tested ten bivariate copulas from three different copula families on flood data from the Sava River in Slovenia. The Gumbel-Hougaard copula was found to be the most appropriate for the peak and volume of the flood, as well as the peak and duration factors, whereas the student’s t copula was found to be the most appropriate for the peak and duration pair of factors. Ganguli and Reddy [28] used a trivariate copula to break down the three flood factors, peak, volume, and duration. Based on the goodness-of-fit and tail-dependence tests, as well as a graphical investigation, the best choice among Clayton, Gumbel-Hougaard, Frank, and the student’s t copula has been made. It was deduced that the student’s t copula was the best fit for their data. Recently, another system’s entropy copula implied for a multivariate examination of commonly influencing factors was tested on three flood factors (P, V, and D) at two distinct Chinese stations [29].

Amini et al. [30] applied the vine copula structures for multivariate analysis of flood characteristics in the Armand watershed, Iran, using the hydrographs of 68 flood events. The study found Johnson SB, lognormal (3P), log-logistic, and lognormal (3P) best-fitted marginal distributions for flood duration, peak discharge, time to peak, and flood volumes, respectively. They found that the Frank copula has the best-fitting copula at most of the edges and nodes. Zhang and Singh [18] investigated the relationship between flood peak, volume, and duration using four families of Archimedean copulas: Ali-Mikhail-Hagh, Frank, Gumbel-Hougaard, and Cook-Johnson. The margins were examined using the extreme values distribution type I and the log-Pearson type III. Their results showed that there are significant associations between flood peak and volume, as well as flood volume and duration. They realized that the Gumbel-Hougaard family would be more appropriate to explain the variable dependence structure. The copula model was also used to calculate the conditional return period. [31] developed a new method for calculating bivariate design events using copulas that allow for simultaneous and nonsimultaneous occasions of the variables under consideration. Copulas were also used in flood frequency analysis by different researchers [1, 19, 20, 28, 2933]. Stamatatou et al. [34] compared univariate and joint bivariate return periods of floods that rely on different probability concepts in the Yermasoyia watershed, Cyprus. The pair of peak discharge and corresponding volume was estimated and compared using annual maximum series (AMS) and partial duration series (PDS) approaches.

According to research, only one comprehensive study on this topic has been conducted previously, which was performed by Haile in 2022. Therefore, the novelty of this research in Ethiopia is that it provides valuable insights into a relatively new research area that can contribute to better flood risk management strategies. Multivariate flood frequency analysis can model the complex relationships between flood variables and provide a comprehensive understanding of the risk associated with floods, leading to accurate estimation of flood hazards, identification of sources of flood risk, and ultimately, better flood risk management strategies. Therefore, further research in this area has the potential to address the current data gap in Ethiopia and contribute to the development of effective flood risk management policies and practices. As a result, the aim of this study was to create a copula-based probability model for bivariate analysis of the Guder watershed’s 30-year flow (1987–2017). Archimedean copulas such as the Clayton copula, Frank copula, Gumbel-Hougaard copula, and Ali-Mikhail-Haq copula have all been examined. Goodness-of-fit statistics, upper tail-dependence coefficients, and graphical analysis were used to compare the copulas. The primary, conditional, and Kendall return periods have also been computed for a better knowledge of the river flow in the study area in order to comprehend the danger of flood occurrences.

2. Materials and Methods

2.1. Description of Study Area

Guder watershed has a drainage area of 7011 km2 and is located in central Ethiopia in the Blue Nile basin’s south-eastern region. It may be found in the Oromia regional state between 7°30 to 9°30 N latitudes and 37°00′ to 39°00′ E longitudes. Tikur Incinis, Ambo, Cheliya, Dendi, Jima Rare, Mida Kegn, Gojo, Guduru, Liban Kutaye, Tokke Kutaye, and Ababo Guduru are the districts covered by the Guder watershed (Figure 1). It is bounded to the east by the Muger subbasin, to the south by the Awash basin, and to the west by the Fincha subbasin. The Guder River rises in the mountains to the south of the villages of Ambo and Guder. The river runs from south to north and empties into the Abbey River. The main tributaries of the Guder River include Huluka, Tarantar, and Debris Rivers [16, 3539].

2.2. Methodology

The primary goal of this study is to apply the copula method and evaluate its results. Initially, approaches for specifying the marginal distribution functions for the various variables were used. This includes data analysis and generation of flood characteristics [36, 40, 41], modeling extremes for the selected parameters using threshold peaks, and finally determining the appropriate distributions for each variable of interest. Following that, the return periods for each variable were calculated using the standard univariate approach. Copula represents the correlated variable’s joint distributions. As a result, it is critical to examine the relationship between the variables. A scatterplot of the data, a scatterplot of the variables’ standardized ranks, and a Chi-plot can be used to perform a graphical analysis of dependence. The Chi-plot is a plot of a rank-based measure of each observation’s location versus a measure of the Chi-squared test of independence. If the transformed data are scattered above (below) the region defined by the Chi-plot’s confidence interval [42, 43], the dependence is positive (negative).

To quantify the dependence between variables, dependence measures such as Kendall’s rank correlation coefficient (or Kendall’s tau), and Spearman’s rho can be computed [16, 44]. Under the null hypothesis that the value of the association measures is 0, the values associated with these measures can be calculated. If the value is less than the significance level, statistical independence is rejected. Kendall’s tau is regarded as the most important measure of association in the theory of copulas, particularly Archimedean copulas, where its relationship with the generating function is used for copula parameter estimation. One of the most common methods for estimating the parameter of the Archimedean copula is the inversion of Kendall’s tau. The relationship shown in Table 1 between Kendall’s tau and the parameter is used to estimate in this method. Because this relationship is dependent on the copula functions generating function , different values of θ are obtained for different copula models for different copula models [14, 15, 17, 36].where ϕ and ϕ′ are the generating function and derivative of the Archimedean copula’s generating function, respectively. Because this relationship is dependent on the generating function of the copula function, different values of are obtained for the various copula models listed in Table 1. Table 1 describes the key characteristics of the Archimedean copula families used in this study.

Following the selection of the best-performing copula approach, the bivariate distributions had to be created. According to Sklar’s theorem [47], a copula is a joint distribution function of typical uniform random variables that may link marginal distribution functions with multivariate probability distributions. Let be a joint distribution function of and marginal distribution. The copula can be described as follows:

If , , and C are distribution functions, then given by equation (2) is a joint distribution function with marginal and .

Evaluation of the joint distributions is greatly simplified when copulas are employed to construct the bivariate distribution, which is demonstrated in the following. The joint distribution function, defined in equation (3), is the probability that both and are less than or equal to specific thresholds of and , respectively; that is,

In practice, the probability of both and exceeding their respective thresholds can also be defined by copulas:

The conditional distribution of given and the conditional distribution of given can also be estimated as the following equations:

The resulting joint probabilities of copula are expressed as a function of univariate marginal distributions. Therefore, for specified univariate marginal distributions, the joint distributions can be described simply through the use of copulas.

The copula-based return periods were computed after demonstrating the joint distribution. The joint (primary) return periods known as “OR” mode and “AND” mode [31, 48, 49] were computed in this study and are defined as follows:where u and are uniformly distributed . The U represents , and V represents , and they were created by applying the probability integral transform to X and Y, a transformation that allowed us to simplify our work by using an equivalent set of values that follow the standard uniform distribution.

In addition, the return period can also be defined by the event for given or the event for given that is called the conditional return period for given and the conditional return period for given , respectively:

On the other hand, a secondary return period, also known as Kendall’s return period, is defined as follows:where is Kendall’s distribution function for the theoretical copula function.

As a reminder, , and ; hence, the (u, ) pairs that satisfy the above equations form the associated bivariate return period curve for a given return period value. In other words, for a given copula value , the (u, ) pairs located in the copula level curve equals t are associated with the return period value.

3. Results and Discussion

The technique described in the methodology section was used to evaluate the design variables that could be used in the analysis of hydraulic or hydrologic problems on the Guder River. In this case, both the flood peak and the flood volume are important; thus, assuming the two variables are related, the copula approach would be more significant than the single-variable approach. The threshold was chosen in accordance with the partial duration series methodology in order to generate a larger sample of extreme events than in the previous analysis [31, 40]. The flood volume V was then estimated by locating the flood volume that corresponded to the chosen flood peak events. More specifically, the truncation line is taken as 40 and 60 m3/s for station-1 and station-2, respectively; i.e., flood events are defined as a daily stream flow equal to or greater than 40 and 60 m3/s. Eighty-seven and hundred fifty-four flood data are abstracted from the recorded daily stream flow data (1987–2017) for the two stations, respectively. The observed number of pairs (n) is 87 for station 1, consisting of peak discharge (Q) and volume (V), with a mean interarrival time of 0.35 years. On the other hand, for station 2, there are 154 pairs with a mean interarrival time of 0.20 years. The location and scale parameters for both variables are depicted in Table 2 at the two stations.

3.1. Univariate Analysis

A univariate flood frequency analysis was performed after the extreme events were chosen. The datasets were subjected to various probability models that were best fitted by gamma and generalized extreme value (GEV) distributions for peak discharge at stations 1 and 2, respectively, and lognormal for flood volumes at both stations. The parameters of the distribution were estimated using the maximum likelihood method, which will also be used in the estimation of copula parameters [16, 48]. Following that, the Kolmogorov–Smirnov (K-S) goodness-of-fit and graphical tests were developed to identify the distributions that produced an adequate fit to the data. After performing graphical and goodness-of-fit (GOF) tests and computing K-S values for each accepted model, the best-fit distributions were chosen (Figure 2). The maximum likelihood (ML) method was used to estimate model parameters (Table 3). Finally, once a suitable model had been identified, the univariate return periods for 2, 5, 10, 25, 50, and 100 years were calculated and are shown in Table 4. In the lower return periods, the estimated univariate return values differ only for the peak discharge variable.

3.2. Bivariate Analysis

Following the univariate analysis, Kendall and Spearman’s correlation coefficient was used to perform a formal assessment of the dependence between the pairs of variables under consideration. The first station yielded Kendall and Spearman’s correlation coefficient values of 0.60 and 0.79, while the second station yielded a value of 0.67 and 0.85, respectively, highlighting the strong correlation between the two variables (Table 5).

The scatter plot of the u and is displayed in Figures 3(a) and 3(b) for both station-1 and station-2, respectively, where a strong correlation between the two variables is readily apparent. The scatter plot of the pseudo-observations is plotted in Figures 3(c) and 3(d) for station-1 and station-2, respectively; this shows a positive relation of dependence between variables. The values of Kendall’s and Spearman’s rank-based nonparametric measures of dependence validate the results provided by the graphical information and also supported by the Chi-plot (Figure 4). The Chi-plot shows that the majority of the values are located above the region defined by the confidence intervals, indicating that the positive dependency between the two variables (peak and volume) is strong.

Nonparametric estimation method is used to estimate copula parameters and to identify copula-based bivariate flood frequency analysis of flood peaks and volumes [44, 50]. The four commonly used Archimedean copula families are considered as candidates for joint flood frequency analysis. These are Ali-Mikhail-Haq, Clayton, Frank, and Gumbel-Hougaard copula. The feasible copulas were checked by testing the admissible range of dependence supported by each one via Kendall’s value. As a result, the Ali-Mikhail-Haq (τ ∈ [0.1817, 1/3]) was eliminated (Figures 5 and 6).

The parameter estimated by Kendall’s τ is given in Table 6 for each copula method. The goodness of fit of observed data to the theoretical bivariate distribution obtained by using copula functions is tested by AIC [51] and KS methods. Table 7 shows that AIC and KS values for bivariate distributions are obtained by using different copula functions for flood peaks and volumes. Considering both tests, the Gumbel-Hougaard copula was selected as the best copula model for station-1 and station-2. Based on this, joint cumulative density function (JCDF), conditional probability, and corresponding return periods for different combinations of flood peaks and volumes are calculated.

In other definition, if the plot is in agreement with a straight line passing through the origin at a 45-degree angle, then the generating function is satisfactory [46], [52]. The 45-degree line indicates that the quantiles are equal. Otherwise, the copula function needs to be reidentified.

3.2.1. Joint and Conditional Density Function

(1) Joint Cumulative Density Function (JCDF). The JCDF of flood peaks and volumes is calculated based on Gumbel-Hougaard copula methods for station-1 and station-2, respectively, in which u and were estimated by corresponding marginal distributions of flood peak and volume. The contour lines demonstrate that, given a fixed occurrence probability of a flood event, one can obtain various occurrence combinations of flood peaks and volumes, and vice versa, which cannot be obtained by single-variable flood frequency analysis (Figure 7). These results should be useful for hydrological structure design studies.

(2) Conditional Density Function. The conditional probabilities and are estimated based on (5) and (6) and are depicted in Figure 8 for both stations.

(3) Joint Return Periods. The (equation (7)), (equation (7)), and (equation (9)) joint return periods associated with the theoretical events with a peak equal to and volume equal to for univariate return periods T equal to 5, 10, 50, and 100 years are estimated for both the Station-1 and Station-2, being and the quantiles obtained from the fitted marginal distributions. The results presented in Table 8 indicate that although values linked to the Gumbel copula are greater than those linked to the Clayton copula, and values are much higher. It can also be seen that the greater the return period, the larger the difference between joint return periods related to each copula. Therefore, as expected for not being an extreme-value copula, the Clayton copula underestimates the risk associated with the and joint return period. Hence, this analysis supports the fact that not taking into consideration the upper tail dependence in joint extreme events modeling can lead to underestimate the risk [53] [46], [1315, 17, 31, 38].

Shiau et al. [52] predicted, in reference to “OR” and “AND” cases and flood peak and volume, that “the use of or as the design criterion depends on what situations will destroy the structure. can be used to calculate the average recurrence interval when either the flood peak or flood volume exceeds a certain magnitude. , on the other hand, is used when the flood volume and flood peak must exceed a certain magnitude that will cause damage.” was proposed by Salvadori [24] and is thus widely used (e.g., [48, 50, 5456]). Based on the following arguments, the idea behind TK is to overcome an apparent shortcoming of and . Different pairs of , e.g., , , and , lying on the same level curve of a bivariate joint distribution, have the same joint probability, i.e., , but they define different and partially overlapping critical regions. As a result, we have infinite OR (AND) critical regions with the same joint probability, making a choice between them impossible (e.g., [27, 54, 57]). Because the lack of correspondence between each value and a unique critical region is incorrect from a measure theoretic standpoint, Salvadori [24] introduced , which is based on the Kendall distribution (or measure) KC and measures the chance of observing an event in one of the two distinct subregions defined by a level curve characterized by a distinct value of joint probability. The contours for joint return periods and are depicted in Figure 9.

Based on the contours of the joint return periods, given a return period, one can obtain various combinations of flood peaks and volumes, and vice versa. These various scenarios can be very useful for hydrologic engineers to carry out effective hydrologic engineering management and designs, such as spillways and flood control reservoirs, in which a design flood hydrograph (DFH) is needed [58]. The DFH is characterized by its peak, volume, duration, and shape, as addressed in previous studies [5961] [62]. Various pairs of flood peak and volume values associated with a given return period, say 100 years, provide one more possible choice on which the DFH should be selected. The variability of design flood events under bivariate flood frequency analysis allows a better selection of the most crucial case according to a specific water resources planning, management, or design problem, which cannot be achieved via single-variable frequency analysis. The conditional joint return periods were computed using the copula-based distributions selected for each set of correlated flood variables. The conditional joint returns periods and of flood peak given flood volume and flood volume given peak based on the conditions and were computed using equation (8).

Figure 10 shows the conditional joint return periods for the Guder River at Station-1 and Station-2 for the above-mentioned conditions. The figures show that the conditional return period is significantly affected by the nature of conditioning. Taking the conditional return period of the Guder River Station-1 and 2 as an example, the return period of a specified value (say, flood peak) when the conditioning is in terms of a fixed value (say, flood volume) is found to be lower than the return period when the conditioning is in terms of a value less than or equal to the fixed value. On the one hand, for low specified flood volume values, no significant differences were found under these two conditions, i.e., in terms of a fixed volume value and a volume less than or equal to the fixed volume value. On the other hand, significant differences were found under these two conditions if a specified peak discharge value was high. Furthermore, if the return period of a larger specified value is desired, then larger differences are obtained under the two conditions. A decreasing trend of the return period is obtained for the same specified value under different conditions. This can be explained by the correlation structure of flood peak and volume, which are positively correlated. The positive correlation results in a less likely occurrence of high specified discharge for low conditioning volume than would be the case for high specified discharge under high conditioning volume.

The conditional return period was always higher than the primary return period and . The secondary return period was higher than the primary return period and lower than the return period . These relationships are in accordance with the different findings ([46], [1315, 17, 38]).

From Table 8, it is evident that the return period for the AND operation is consistently larger than the return period for the OR operation. Additionally, the univariate return period values always fall between these two primary return period notations. This finding aligns with the results reported by [46]. The study’s results highlight the utility of copulas as a valuable mathematical tool for modeling various multidimensional hydrological processes, including floods and sediment transport-related processes.

4. Conclusions

In this study, a bivariate flood frequency analysis was carried out using a large number of bivariate copulas as well as various statistical and graphical tests. The annual maxima series and peaks of threshold approaches were used to collect the data samples, and the corresponding univariate and bivariate return periods were evaluated and compared as the next step. Overall, the results agreed with [46], [1315, 17, 31, 38] who demonstrated that the relationship between univariate and joint return periods can be written as .

The examination of correlation in the two samples of data confirmed that the hydrological variables were significantly dependent on one another. It is worth noting that, while the association pattern changed when different sampling methods were used, the return period estimates did not differ significantly. Overall, the presence of dependence among hydrological variables, as well as the demands of the common problems at hand, points to the need for multivariate distributions to be constructed, particularly when dealing with design values. As a result, more research should be conducted to investigate the significance of copula application in hydrological analysis, specifically in return period estimation.

These findings could be useful for hydrological system design studies, and the same approach could be extended to any station in Ethiopia. The general conclusion is that univariate frequency analysis cannot offer an adequate probabilistic evaluation of correlated multivariate hydrological events, which can result in overestimation or underestimation of their magnitude.

Data Availability

The datasets generated during and/or analysed during this study are available from the corresponding author on reasonable request.

Informed consent to participate and to publish was obtained from all individual participants included in the study.

Disclosure

A preprint of this article has previously been published [63].

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

The authors were responsible for the study conception and design, data collection, analysis and interpretation of results, and manuscript preparation.