Robust Statistical Modeling and Machine Learning with Applications in Data ScienceView this Special Issue
On the Properties of the New Generalized Pareto Distribution and Its Applications
In this paper, a new generalization of the Generalized Pareto distribution is proposed using the generator suggested in , named as Khalil Extended Generalized Pareto (KEGP) distribution. Various shapes of the suggested model and important mathematical properties are investigated that includes moments, quantile function, moment-generating function, measures of entropy, and order statistics. Parametric estimation of the model is discussed using the technique of maximum likelihood. A simulation study is performed for the assessment of the maximum likelihood estimates in terms of their bias and mean squared error using simulated sample estimates. The practical applications are illustrated via two real data sets from survival and reliability theory. The suggested model provided better fits than the other considered models.
In modelling heavy-tailed data sets, the Generalized Pareto distribution is a very significant distribution with its applications in environmental studies, finance, operation risk, and insurance. The Generalized Pareto distribution was first familiarized while making inferences on the upper tail of a distribution by Pickands . The distribution is sometimes termed as the “peaks over thresholds” model as it is used for modelling exceedances over threshold level in flood control. In particular, the Generalized Pareto distribution is used to model extreme values. This application of the Generalized Pareto distribution was debated by several authors, for instance, Gupta et al. , Hogg et al. , Hosking and Wallis , Smith [6–8], Davison , and Davison and Smith . Smith  presented an excellent review of the two most widely used methods in this field, based on generalized extreme value distributions and on the Generalized Pareto distribution. Davison and Smith  discussed the applications of Generalized Pareto distribution using river-flow exceedances and used it as a model for excesses over thresholds. Its applications include the upper atmosphere ozone levels, environmental extreme events, large fluctuation in financial data, large insurance claims, and reliability studies. The applications of Generalized Pareto distribution are addressed in various books, such as by Castillo et al.  and Kotz and Nadarajah . A number of researchers fit the Generalized Pareto distribution for exceedances over a series of thresholds and have used the Kolmogorov–Smirnov and Anderson–Darling statistics for testing the fit. Various authors claimed the flexibility in modelling long tail data using the Generalized Pareto distribution that includes Choulakian and Stephens . This provides the motivation for proposing another flexible generalization of the Generalized Pareto distribution, referred to as the Khalil Extended Generalized Pareto (KEGP) distribution. The suggested generalization of Generalized Pareto aims the following:(i)To produce a flexible extension of the Generalized Pareto distribution(ii)To adapt various forms of the hazard function such as increasing, decreasing, and upside-down failure rate functions(iii)To produce more flexibility in modelling extreme value data(iv)To extend the considered distribution to a variety of data related to reliability and survival analysis
The rest of the manuscript is organized as follows: In Section 2, the KEGP distribution is defined with a special case of the distribution. In Section 3, some important properties are investigated for instance moments, moment-generating function, entropies, order statistics, and quantile function for the KEGP model. Section 4 is devoted to parametric estimation using the maximum likelihood method. In Section 5, application of the KEGP model is provided using bladder cancer patients’ data and average wind speed data sets. The simulation study is also performed for the parameters using the Monte Carlo simulation method. The paper is concluded finally with some remarks on the results and their significance in Section 6.
2. Khalil Extended Generalized Pareto (KEGP)
Salahuddin et al.  proposed a very flexible family of distribution, known as the Khalil new generalized family, whose CDF is defined as
Pickands  suggested the Generalized Pareto distribution with the cumulative distribution function and probability density function in the following form:
In the above density, are the scale and shape parameters. Also, ; ; and for , the density reduces to exponential distribution.
Now, we shall define the Khalil Extended Generalized Pareto distribution using equation (2) in the generator defined in equation (1), that is, if the Generalized Pareto distribution is distributed as , then the Khalil Extended Generalized Pareto is distributed as with a cumulative distribution function defined as
The probability density function, reliability function, and hazard function are
Figures 1–3 demonstrate the graphs of CDF, PDF, and hazard rate functions, respectively. From Figure 2, it can be clearly seen that as the values of the shape parameters and increase (i.e., greater than 2), the pdf of the KEGP approaches more towards normality. Furthermore, in general, the Generalized Pareto distribution has a decreasing hazard function for all values of x greater than 0, while the hazard function illustrated in Figure 3 is very flexible depicting various shapes (i.e., monotonically increasing and decreasing).
2.1. Special Case
As a special case if we substitute , in equations (4) and (5), it shall refer to the probability density function and cumulative distribution function of the Exponentiated Exponential Poisson (EEP)  as
3. Statistical Properties of KEGPD
The statistical properties of the KEGP distribution are as follows.
Let a random variable x follow the KEGP distribution; then, the rth moment of the KEGP distribution, say , by definition is
Using the exponent series as and the Binomial expansion , on simplification, equation (9) yields
Now, here we will proof the above rth moment for three cases as follows: Case-I: when , then and the expression in equation (10) can be simplified as If we suppose , where . Case-II: when , then and the expression in equation (11) can be simplified as Now, here if we suppose where of the second kind is used. Now, if we look into the two cases, that is, , only the second argument of the beta function is different and the first one is the same for both cases. Case-III: if , the derivation is straightforward with the density of exponential as
3.2. Moment-Generating Function (MGF)
If a random variable has the KEGP , then the MGF of is obtained by using
The Taylor series yields the following simplified expression:
Using equations (11), (12), and (14) in (15), we get
3.3. Entropy Measures
For measuring the randomness of various systems, entropy measures are widely used. The main usage of these entropies is in the areas of physics, sparse kernel density estimation, and molecular imaging of tumors. If the result of entropy statistics is low, it specifies less uncertainty in the data. Thus, the Renyi  and q-entropy  are considered for the KEGP distribution to measure the quantity of uncertainty in the data. The entropies are characterized as
For , the Renyi entropy of the KEGP distribution is as follows:
On solving and using the exponent series as and Binomial expansion as in expression (18), it becomes
When , then , and on simplification, the Renyi entropy yields the following result:When , then , and on simplification, the Renyi entropy becomes
The q-entropy familiarized by  is defined as
Consider the integralfor ; then, , and on simplification, the q-entropy yields the following result:
Using the result of the above integral in , it reduces for asand for and and on simplification, the integral becomes
Now, substituting the result in , the q-entropy for is
3.4. Quantile Function
The quantile function or random number generator of the KEGP can be obtained by inverting and solving the expression for x as ; we havewhere . When u is replaced by q, the median, 1st quantile, and 3rd quartiles are obtained from (28) by substituting q = 0.5, 0.25, and 0.75, respectively.
3.5. Order Statistics
Order statistics plays a vital role in the field of reliability and survival analysis. Let a random sample of size k from the KEGP distribution has the corresponding order statistics denoted by . Then, the pdf of the -order statistics is given by
Using equations (4) and (5) in (29), we have
4. Parameter Estimation of KEGP Distribution
The parametric estimation of the KEGP distribution is presented in this section using the technique of maximum likelihood since it possesses many desirable properties, for instance, consistency, invariance, and normal approximation. It basically depends upon the maximization of the likelihood function.
4.1. Maximum Likelihood Estimation
Let us consider a random sample from the KEGP ( model; then, its likelihood function is defined as
The log likelihood function of the KEGP model is obtained by taking the logarithm on both sides of (20) as
For MLEs of the unknown parameters of the KEGP model, the nonlinear equation derived above is simplified by taking its derivative w.r.t. to , respectively, and setting , , , and . The normal equations are as follows:
4.2. Asymptotic Confidence Intervals
Since the expression of the MLEs cannot be derived in closed form, for the solution, some iterative procedures such as conjugate gradient type algorithms may be used to obtain the numerical solution. Thus, the asymptotic confidence interval can be derived for the unknown parameters with the assumption that the MLEs () of these parameters are approximately normal with mean and inverse Fisher information-observed variance-covariance matrix . The Fisher information matrix is defined aswhose elements are listed in “Appendix.”
Thus, the asymptotic confidence intervals for the parameters of the KEGP model are constructed by using the following equations:where is the upper percentile of the .
The applications of the KEGP distribution are provided in the following section using simulated data and two real data sets.
5.1. Simulation Study
A simulation study is performed to obtain the average values of MLEs (maximum likelihood estimators), MSEs (mean square error), and bias. The following steps are used to perform the simulation.
Step 1. First, suppose the values of the parameters from KEGP as , that is, Case I = (2, 3, 2, 1), Case II = (3, 4, 2, 1), Case III = (1, 2, 4, 3), and Case IV = (1, 2, 2, 5).
Step 2. The process is repeated 10,000 times and the MSE and bias are computed for the estimates for n = 50, 100, 500, and 1000.
Step 3. Use the following quantile expression for the generation of random numbers from KEGP aswhere .
Also, the bias and MSEs are calculated using the following expression:where . Results of simulations were obtained for different combinations of the parameters and are displayed in Table 1. It is exhibited clearly from the table that with the increase of sample size, these estimates are reasonably consistent and approaches the true values of parameters. Furthermore, the MSEs and bias decrease for all the combinations of parameters with increase of the sample size. Therefore, it has been determined that the process of MLE is performing well in estimating the parameters of KEGPD.
5.2. Real-Life Applications
The practicality of the KEGP is demonstrated by using two real-life data sets. The first data set is considered from , and the second data set is taken from . The KEGP is compared with various submodels such as Kumaraswamy Pareto , Alpha Power Pareto , Generalized Pareto , and Exponentiated Generalized Pareto  distributions with the following probability density functions:(i)Kumaraswamy Pareto (KP) distribution:(ii)Alpha Power Pareto (APP):(iii)Generalized Pareto (GP):(iv)Exponentiated Generalized Pareto (EGP):
The performance of the KEGP is checked by comparing it with various forms of Pareto distributions discussed above using the goodness-of-fit test criteria. A package of R software, that is, Adequacy Model, is utilized which includes the outcome of Akaike’s Information Criteria (AIC), Bayesian Information Criterion (BIC), Consistent Akaike’s Information Criteria (CAIC), Hannan–Quinn Information Criteria (HQIC), , and Kolmogorov–Smirnov test (KS) and its value. Generally, the model is considered as the best-fitted model to the data if the values of all the above criteria are smaller with a larger value. Tables 2 and 3 display the MLEs and Kolmogorov–Smirnov (K-S) results along with their values. It can be seen clearly that the results of the Khalil Extended Generalized Pareto distribution for the K-S test are smaller with a large value, i.e., K-S value = 0.0342, value = 0.9982 for data set I and K-S value = 0.0835, value = 0.4881 for data set II, in comparison with its other variants, hence proving a better fit. Furthermore, Tables 4 and 5 display the results of AIC, BIC, CAIC, and HQIC for which the Khalil Extended Generalized Pareto distribution has lower values in comparison with its other various versions. A commendable performance of the KEGP distribution is also visible from Figures 4 and 5 displaying the empirical and theoretical pdfs and CDFs. Figures 6 and 7 demonstrate the Q-Q and P-P plots for both the data sets, respectively. The better performance of the suggested model indicates the need of introduction of new distributions in managing some sets of real data.
5.2.1. Data Set I
The first data set is considered from , which consists of remission time (in months) of bladder cancer patients. The data set is positively skewed and unimodal, with a skewness of 3.286, mean remission time of 9.366 months, and standard deviation of 10.508 months. These data are recently studied in [22, 23] as follows: 0.08, 2.09, 3.48, 4.87, 6.94, 8.66, 13.11, 23.63, 0.2, 2.23, 3.52, 4.98, 6.97, 9.02, 13.29, 0.4, 2.26, 3.57, 5.06, 7.09, 9.22, 13.80, 25.74, 0.5, 2.46, 3.64, 5.09, 7.26, 9.47, 14.24, 25.82, 0.51, 2.54, 3.7, 5.17, 7.28, 9.74, 14.76, 26.31, 0.81, 2.26, 3.82, 5.32, 7.32, 10.06, 14.77, 32.15, 2.64, 3.88, 5.32, 7.39, 10.34, 14.83, 34.26, 0.09, 2.69, 4.18, 5.34, 7.59, 10.66, 15.96, 36.66, 1.05, 2.69, 4.23, 5.41, 7.62, 10.75, 16.62, 43.01, 1.19, 2.75, 4.26, 5.41, 7.63, 17.12, 46.12, 1.26, 2.83, 4.33, 5.49, 7.66, 11.25, 17.14, 79.05, 1.35, 2.87, 5.62, 7.87, 11.64, 17.36, 1.4, 3.02, 4.43, 5.71, 7.93, 11.79, 18.10, 1.46, 4.4, 5.85, 8.26, 11.98, 19.13, 1.76, 3.25, 4.5, 6.25, 8.37, 12.02, 2.2, 13.31, 4.51, 6.54, 8.53, 12.03, 20.28, 2.02, 3.36, 6.76, 12.07, 21.73, 2.07, 3.36, 6.93, 8.65, 12.63, 22.69.
5.2.2. Data Set II
Statistical methods are very helpful in estimating the random phenomena of wind speed. It is environment friendly and an alternative clean energy source in comparison with the fuels obtained from the fossil. It is a type of solar energy, which is determined by the Earth surface unequal heating. Wind speed probabilities are modelled with probability distribution as they are the important parameter of the wind power. Here, a data set is considered from the daily average wind speeds at Cairo city from  as follows: 3.5, 3.1, 3.8, 3.2, 3.2, 4.5, 5.6, 5.7, 4.9, 5.7, 4.3, 9.4, 9.3, 4.4, 2.7, 3.8, 4.9, 5.4, 4.9, 4.2, 5.4, 3.3, 6.9, 9.8, 10, 8, 5.6, 8.2, 9.4, 11.3, 9.4, 5.5, 4.9, 8.6, 5, 4.7, 3.8, 4.3, 6.7, 7.6, 13.3, 8.2, 5.8, 5.1, 7.8, 10.3, 9.3, 4.3, 7.4, 13.8, 10.7, 12, 8.9, 10.6, 6.8, 6.6, 11.1, 12.5, 14.4, 9.9, 4.8, 4.2, 5.5, 7.3, 12.4, 14.7, 6.4, 8.7, 5.2, 6.8, 5.6, 7.5, 7.7, 7.1, 6.1, 7.6, 5.8, 6.3, 12.2, 6, 3.5, 9.5, 8.8, 5.2, 5, 9.8, 8, 7.9, 6.8, 5.7, 7.3, 6.8, 4.7, 5.3, 9.6, 10.1, 7.3, 6.7, 5.4, 5.4.
A new and improved generalization of the Generalized Pareto distribution is proposed in this paper named as Khalil Extended Generalized Pareto distribution (KEGP). The new proposed generalization (KEGP) of the Generalized Pareto proved to be more flexible and suitable for monotone as well as nonmonotone life time data. In addition, it is observed that the hazard function of the new proposed KEGP model is more flexible in fitting monotonically increasing, decreasing, and various types of data. For the proposed model, various statistical properties are derived. The estimation of parameters is done using the famous method of maximum likelihood. Furthermore, the consistency of the parameters is proved using the Monte Carlo simulation method. Moreover, the practicality of the distribution is exemplified with the help of two real-life data sets. Finally, among the fitted models, the KEGP provided a better fit than its other submodels.
Elements of the observed variance-covariance Fisher information matrix:
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare no conflicts of interest regarding the study in this paper.
All authors have equally contributed to this paper.
N. Salahuddin, A. Khalil, W. K. Mashwani, H. Shah, P. Jomsri, and T. Panityakul, “A novel generalized family of distributions for engineering and life sciences data applications,” Mathematical Problems in Engineering, vol. 2021, Article ID 9949999, 16 pages, 2021.View at: Publisher Site | Google Scholar
J. III. Pickands, “Statistical inference using extreme order statistics,” Annals of Statistics, vol. 3, no. 1, pp. 119–131, 1975.View at: Publisher Site | Google Scholar
R. C. Gupta, P. L. Gupta, and R. D. Gupta, “Modeling failure time data by Lehman alternatives,” Communications in Statistics - Theory and Methods, vol. 27, no. 4, pp. 887–904, 1998.View at: Publisher Site | Google Scholar
R. V. Hogg, J. W. McKean, and A. T. Craig, Introduction to Mathematical Statistics, Pearson Prentice-Hall, Hoboken, NJ, USA, 6th edition, 2005.
J. R. M. Hosking and J. R. Wallis, “Parameter and quantile estimation for the generalized pareto distribution,” Technimetrics, vol. 29, no. 3, pp. 339–349, 1987.View at: Publisher Site | Google Scholar
R. L. Smith, “Threshold Methods for Sample Extremes,” in Statistical Extremes and Applications, J. Tiago de Oliveira, Ed., Reidel, Dordrecht, Netherlands, 1984.View at: Google Scholar
R. L. Smith, “Extreme value analysis of environmental time series: an application to trend detection in ground-level ozone,” Statistical Science, vol. 4, pp. 367–393, 1989.View at: Publisher Site | Google Scholar
R. L. Smith, “Extreme value theory,” in Handbook of Applicable Mathematics, W. Ledermann, Ed., Wiley, Hoboken, NJ, USA, 1990.View at: Google Scholar
A. C. Davison, “Modelling excesses over high thresholds, with an application,” in Statistical Extremes and Applications, J. Tiago de Oliveira, Ed., D. Reidel, Dordrecht, Netherlands, 1984.View at: Google Scholar
A. C. Davison and R. L. Smith, “Models for exceedances over high thresholds,” Journal of the Royal Statistical Society: Series B, vol. 52, no. 3, pp. 393–425, 1990.View at: Publisher Site | Google Scholar
E. Castillo, A. S. Hadi, N. Balakrishnan, and J. M. Sarabia, Extreme Value and Related Models with Applications in Engineering and Science, Wiley, Hoboken, NJ, USA, 2005.
S. Kotz and S. Nadarajah, Extreme Value Distributions: Theory and Applications, Imperial College Press, London, UK, 2000.
V. Choulakian and M. A. Stephens, “Goodness-of-fit tests for the generalized Pareto distribution,” Technometrics, vol. 43, no. 4, pp. 478–484, 2001.View at: Publisher Site | Google Scholar
M. M. Ristić and S. Nadarajah, “A new lifetime distribution,” Journal of Statistical Computation and Simulation, vol. 84, no. 1, pp. 135–150, 2014.View at: Google Scholar
A. Renyi, On Measures of Entropy and Information, Hungarian academy of sciences Budapest Hungary, Hungary, Europe, 1961.
J. Havarda and F. Charvat, “Quantification method in classification Processes: concept of structural α-entropy,” Kybernetika, vol. 3, no. 1, pp. 30–35, 1967.View at: Google Scholar
E. T. Lee and J. Wang, Statistical Methods for Survival Data Analysis, John Wiley & Sons, Hoboken, NJ, United states, 2003.
M. G. Ghazal and H. M. Hasaballah, “Exponentiated Rayleigh distribution: a Bayes study using MCMC approach based on unified hybrid censored data,” Journal of Advances in Mathematics, vol. 12, no. 12, pp. 6863–6880, 2017.View at: Google Scholar
M. B. Pereira, R. B. Silva, L. M. Zea, and G. M. Cordeiro, “The Kumaraswamy Pareto Distribution,” https://arxiv.org/abs/1204.1389.View at: Google Scholar
S. Ihtisham, A. Khalil, S. Manzoor, S. A. Khan, and A. Ali, “Alpha-Power Pareto distribution: its properties and applications,” PLoS One, vol. 14, no. 6, Article ID e0218027, 2019.View at: Publisher Site | Google Scholar
S. Lee and J. H. Kim, “Exponentiated generalized Pareto distribution: properties and applications towards extreme value theory,” Communications in Statistics-Theory and Methods, vol. 48, no. 8, 2018.View at: Publisher Site | Google Scholar
M. A. M. Safari, N. Masseran, and M. H. Abdul Majid, “Robust reliability estimation for lindley distribution-A probability integral transform statistical approach,” Mathematics, vol. 8, no. 9, p. 1634, 2020.View at: Publisher Site | Google Scholar
N. Alotaibi, “A new lifetime distribution: properties, copulas, applications, and different classical estimation methods,” Complexity, vol. 2021, Article ID 6657172, 18 pages, 2021.View at: Publisher Site | Google Scholar