A Weighted Estimation for Risk Model
We propose a weighted estimation method for risk models. Two examples of natural disasters are studied: hurricane loss in the USA and forest fire loss in Canada. Risk data is often fitted by a heavy-tailed distribution, for example, a Pareto distribution, which has many applications in economics, actuarial science, survival analysis, networks, and other stochastic models. There is a difficulty in the inference of the Pareto distribution which has infinite moments in the heavy-tailed case. Firstly this paper applies the truncated Pareto distribution to overcome this difficulty. Secondly, we propose a weighted semiparametric method to estimate the truncated Pareto distribution. The idea of the new method is to place less weight on the extreme data values. This paper gives an exact efficiency function, -optimal weights and -optimal weights of the new estimator. Monte Carlo simulation results confirm the theoretical conclusions. The two above mentioned examples are analyzed by using the proposed method. This paper shows that the new estimation method is more efficient by mean square error relative to several existing methods and fits risk data well.
1.1. Two Motivating Examples
In the recent years, many extreme events have occurred in financial markets, natural disasters, disease control, and industrial quality control. Natural disasters, for example, earthquakes, hurricanes, forest fires, volcanoes, and floods affect human life. It is important to predict and prepare for the next disaster occurrence and to estimate losses to inhabitants, insurance companies, and governments. In this section, we study two examples.
1.1.1. A Hurricane Loss Example
Strong winds, heavy rainfall, and storm surges caused by hurricanes cause death and destroy properties. They generate great losses to insurance companies as well. Figure 1 shows the 49 costliest Atlantic hurricane losses for the United States during 1900–2005 . The measurement of this hurricane loss data is in US dollars; all dollars have been adjusted by using the inflation rates from 1900 to 2005.
From the data in , we note that the most costly hurricane is the 1926 Great Miami Hurricane with cost of damage of 157 billion which is 1.58 times larger than the second worst hurricane, the 1900 Galveston Hurricane. After 1926, on August 28, 2005 Hurricane Katrina caused damage of billion. This is approximately 1.19 times larger than Hurricane Andrew in August 1992, which caused damage of billion. On the other hand, we note that 80% of the hurricane losses are less than 21 billion. Many smaller hurricanes are not listed. These considerations raise a number of questions.(a)How do we predict the loss of the next hurricane? Will it again be 1.58 times larger than the worst one so far?(b)Do we keep the traditional approach which is to prepare for the worst event? Should we focus on the extreme events?(c)What is the value at risk with 5% probability?(d)How do we set an upper limit of the losses?(e)How do we estimate the distribution of largest losses?
The objective of this study is to find the best model to fit the empirical data and to answer the above questions as accurately as possible. Since the data has large losses, we consider that the data should fit a heavy-tailed distribution. Studies on Pareto-type heavy-tailed distributions are rapidly increasing with applications to extreme values, insurance, survival analysis, networks, and risk analysis [2, 3]. Therefore, we choose a Pareto model. In this paper, we will propose a weighted method to study the hurricane loss data. The statistical analysis results are given in Section 6.
1.1.2. A Forest Fire Loss Example
Large forest fires have a significant impact on natural, social, and economic systems. However, most fires are extinguished in the initial stages and thus remain small. These smaller fires have a large probability of occurrence, but the resulting damage is almost negligible on an individual basis. Large forest fires, however, have a low probability of occurrence, but the damage and the losses are huge. So modelling large fire losses is becoming critical in the analysis of the risk of the next large forest fire. Figure 2 shows 30 forest fire losses during 1977 to 2006 in AB, Canada. We are concerned that the losses have been increasing over the last 30 years. The data listed in Table 1 are fire occurrence records from the Forestry Service (Council of Canadian Fire Marshals and Fire Commissioners, 2008, Canadian Fire Statistics, (http://www.ccfmfc.ca/stats.html). The database includes forest fire records for all the insurance losses. The data contains a relatively substantial number of large losses which convince us to use a Pareto model. The statistical analysis results are given in Section 6.
1.2. The Truncated Pareto Distribution
Many risk models with heavy tails have been developed using the class of the Pareto distributions. Some of these cases are found in city population sizes, the occurrence of natural resources, the sizes of firms, and personal income . The Pareto distribution is on the list of distributions of Frechet domain which belongs to the extreme value distribution of type II . It is important to explore estimation methods for Pareto distribution. There are theoretical difficulties in studying the Pareto distribution since it has infinite moments in heavy-tailed situations. We propose using the truncated Pareto distribution in these models to overcome these difficulties. Usually, we choose the upper limit as the largest value in the data set. In the recent years, the truncated Pareto distribution has become an alternative model for the original Pareto distribution.
There are several kinds of Pareto distributions. We consider a type I Pareto distribution in this paper .
Definition 1. The probability density function (p.d.f.) and the cumulative distribution function (c.d.f.) of a random variable having the Pareto distribution are given by
where is the shape parameter.
When , which is a heavy-tailed case, the mean and variance of are infinite, and the distribution is heavier on the right tail as decreases.
The truncated Pareto distribution was originally used to describe the distribution of oil fields by size. It has a lower limit , an upper limit , and a shape parameter . In fact, it has been shown that the truncated Pareto distribution fits better than the nontruncated distribution for positively skewed populations .
Definition 2. The p.d.f. and c.d.f. of a random variable having the truncated Pareto distribution are given by The quantile function of the truncated Pareto distribution is The mean and variance of are
1.3. The Weighted Empirical Distribution Function
Recently, some parametric estimation methods for the truncated Pareto distribution has been developed . But there is an efficiency problem in the estimates of the distribution tails. The distribution tail values and their probabilities are important in many fields, for example, value at risk in risk analysis, survival probability in survival analysis, tolerance limits in quality control, prediction intervals, and confidence intervals. Classical statistical inference theory depends on the classical empirical distribution function (EDF) : is a minimum variance unbiased estimator for the c.d.f. based on a random sample . Note that uses the equal weight for each sample point. Should we use equal weights on extreme data values as well? Recently, authors have applied various weights to data points by using different philosophies; that is, the Jackknife method gives zero weight for eliminated data ; weighted bootstrap  and weighted empirical distribution functions or processes have been discussed [9, 10]. But there are some difficulties to determine what weights should be used for the data points. Huang and Brill  introduced a weighted level crossing estimation method from a geometric point of view to visualize random samples in the -optimal sense; the method improves the efficiency of the estimation of tails.
This paper proposes a semiparametric approach to estimate in (4) using -optimal and -optimal weights. Both theoretical and simulation efficiencies are consistently improved when compared with existing methods. This method is based on a symmetric weighted empirical distribution function (SWEDF) of Huang , namely, where the are symmetric general weights and are the order statistics of the random sample. Note that
The parameter in (9) is the weight for the middle data; in (9) is the weight for the extreme data. It is interesting to explore how the value of affects the estimation of a heavy-tailed distribution. We may use flexibly.
In Section 2, we propose a weighted method for estimating the shape parameter and the mean of the truncated Pareto distribution. In Section 3, an exact efficiency function of the new mean estimator relative to the classical estimator is derived. Section 4 explores the -optimal weights and -optimal weights for estimating the mean. Section 5 gives results of Monte Carlo simulations. The simulation efficiencies are consistent with the exact efficiencies in Section 3. In Section 6, we analyze the hurricane loss data and forest fire loss data given in Section 1 by using the proposed method. The statistical inference in these examples shows that the estimated distribution curve by using proposed weighted estimation method fits the tails of data better, relative to several existing methods. Suggestions for further studies are also discussed.
2. Estimation Methods
In this section, we discuss the existing and proposed methodologies for the truncated Pareto distribution. Consider a random sample from the distribution in (3), and let denote its order statistics.
2.1. Maximum Likelihood Estimators (Hill, Beg, and Aban)
There are several different maximum likelihood estimators (MLE) for estimating the shape parameter .
A popular estimator is the Hill  MLE, which uses the largest order statistics, , to estimate the original Pareto shape parameter in (1). When applying it to the truncated Pareto distribution in (3), it is defined as
Beg  developed the MLE method for the truncated Pareto distribution when is known. The Beg MLE for can be obtained by solving the following equation: where and is the sample geometric mean.
Aban’s MLE  when are known is obtained by solving the equation where and is the sample size.
Note that we may use estimators and in (14) when are unknown. A similar situation is in the following equations (16) and (18). There are other estimators of and in the literature, for example, Cooke .
2.2. Moment Estimator
To estimate the population truncated Pareto mean, the sample mean estimator is A Moment estimator for estimating is the solution of the equation
2.3. A Proposed Weighted Estimator
Now, to estimate the population truncated Pareto mean, we define a weighted mean based on the weighted empirical distribution function in (9) as where is the sample size and is the weight as defined in (9). Then, for estimating , we define to be the solution of the equation
3. An Exact Efficiency Function
Theorem 3. The mean and mean square error (MSE), of in (17) when , are given by where , , , , ,
4. Optimal Weights
4.1. -Optimal Weights
Huang and Brill  proposed an -optimal weight which is based on Manhattan metric for the in (9). It is Huang and Brill  proved that the exact efficiency (EFF) of in (9) relative to the EDF exceeds 1 on the tails of the distribution. In this paper, we use this weight in order to improve the efficiency of estimating the tail probability of the truncated Pareto distribution. Huang  indicates that in general for any distribution if , which means putting more weight on the middle data, the efficiency of estimating the tail probability will be improved.
Next, we explore an alternative optimal weight.
4.2. Optimal Weights
Corollary 5. An L2-optimal weight for the efficiency function of the given in (17) for estimating the population mean in (3) relative to the sample mean in (16) when is given by The minimum and are given by In (24) and (25), , , , , and are defined in (20).
The proof of Corollary 5 is in the appendix.
Table 2 lists the values of , , , , and the exact of relative to for , and 100; and 3, by using (24) and (25). We note that all values of are greater than , and all exact relative to are greater than 1.
Remark 6. The in (23) is totally nonparametric; it is more robust and easy to use. Note that in (24) depends on the parameter . In practice, we may estimate first then obtain a , while still keeping the optimal advantage. Of course, we use the given values in the simulations. However, and are close to each other when .
Next, we use simulations to compare the performance of the three parametric MLE estimators, that is, Hill’s, Beg’s, and Aban’s estimators in (12), (13), and (14), and the two semiparametric estimators, that is, the Moment and the weighted estimators in (16) and (18), for estimating the shape parameter in (3). We generate m = 1,000 random samples of size from the distribution of (3). is only used for the weighted estimator, since Table 2 indicates that the values of and are close to each other when .
We know that if , the original Pareto distribution has an infinite variance; if , then the mean is infinite. These cases have inference difficulties. We focus on , and 1.8, and let , . Figure 3 contains the box-plots of the comparison of these five estimators. Note that in the cases and , Beg's and Hill's estimators have large biases, and the weighted, Aban's, and the moment estimators performed very well; but sometimes, Aban's MLE estimates have unstable solutions. When , there are similar results except Hill's and Beg's estimators performed better. It is interesting to see that, in all three cases, the median of the weighted estimator for is relatively larger than Aban's and the Moment's ones. We will discuss how these affect the tail estimation in the next section. The simulations were run by using MAPLE 15 with double precision.
Now we use the proposed method and compare it with existing methods to analyze the data of the two examples outlined in Section 1.
6.1. Hurricane Loss Example
6.1.1. Comparison of Four Estimation Methods
At first, We look at the hurricane loss example in Section 1.1.1. Based on the simulation results in Figure 3, we consider the three better estimators out of the five in the heavy-tailed case, that is, Aban’s, Moment, and the weighted estimators in (27), (28), and (29) for the truncated Pareto p.d.f. in (3) and the c.d.f. in (4). We also compare them with Hill’s estimator in (26) for the original Pareto p.d.f. in (1) and the c.d.f. in (2). Here where is given in (12) with ; consider where is Aban’s MLE given in (14); consider where is given in (16); consider where is given in (18), using the weight .
The results of these four methods are listed in Table 3 by using the hurricane loss data, where , , and .
Figure 4 is a log-log plot showing the upper tail for the hurricane loss data. In this plot, the circles represent the real data, and the straight line represents the estimated original Pareto distribution. The dashed line, dotted line, and thick solid line represent the estimated truncated Pareto distributions by using Aban's, Moment, and Huang's estimators, respectively. We observe two conclusions intuitively.(1)The original Pareto distribution (straight line) does not fit the data well in the tail. The truncated Pareto distribution fits the data very well using all three estimation methods (Aban, Moment, and weighted). Note that the 5% value at risk estimated by the original Pareto model is 147 billion; the three truncated model estimates are about 80 billion. It appears that the original Pareto model overestimates the risk. An insurance company would set a high premium if it is using the Pareto model, with the result that many people cannot afford to buy insurance.(2)We examine the three truncated Pareto estimates. Around the tail, the weighted estimate fits the data the best; that is, the curve turns downward more quickly following the trend of the data pattern because the weighted estimate is largest among the three truncated model estimates (this is consistent with the simulation results in Figure 4), and the weighted mean estimate and the 5% value at risk are the smallest among the three methods (this is obtained by placing less weight on the extreme value, e.g., the 1926 great miami hurricane's loss), so its estimated distribution is less heavy in the tail compared with other methods.
Next, in order to confirm these conclusions, we run three goodness of-fit tests. Later, we define the absolute error and integrated error as the measures of the distance from the empirical data points to the estimated Pareto curve and truncated Pareto curves.
6.1.2. Goodness-of-Fit Tests
Our objective is to test if the estimated distributions in (26)–(29) fit the data properly. We test the hypotheses against , where is the true unknown distribution function and is the estimated Pareto c.d.f. in (26) or the estimated truncated Pareto c.d.f. in (27)–(29). In this paper we use three EDF goodness-of-fit tests.
(2) Anderson and Darling  test (A-D test) introduced a measure of “distance” between the empirical distribution and the proposed c.d.f. by using a metric function space where is a weight function, with . Let , and under the test statistic and -value are given by where is the observed value of and , , is the Gamma function.
For fitting the 49 losses in the hurricane loss data, for example, after computing estimates of by using (26), (27), (28), and (29), we compute the absolute error (AE) in (30) and define the integrated error (IE) by
Figure 5 gives the absolute errors (AE) in (30) for the 10 largest losses and confirms that the original Pareto estimate has relatively larger errors and the weighted estimate has relatively smaller errors in the tail. Figure 5 explains the data fitting of the tail of the distributions in Figure 4.
We also compute the AE in (30) and IE in (36) to confirm those tail errors by using as the number of the largest losses, . The AE and IE values are given in Table 4 (the smallest values are bold with *).
The weighted estimator has the smallest AE and IE values for and largest losses, and its IE value is almost equal to the smallest IE value for , all largest losses. We statistically conclude that the weighted estimated distribution is the best fit in the tail of the hurricane loss data.
6.2. Forest Fire Loss Data Example
Next we look at the forest fire loss example in Section 1.1.2. The data in Table 1 contains a relatively substantial number of large forest fire losses which convince us to use a truncated Pareto model and compare the four estimators in (26), (27), (28), and (29). We use the 25 largest losses in this study.
Figure 6 is a log-log plot which shows the upper tail for the forest fire loss data. The circles represent the real data, and the straight line represents the estimated original Pareto distribution. The dashed line, dotted line, and thick solid line represent the estimated truncated Pareto distribution by using Aban’s, Moment, and the weighted estimators, respectively. We can see that the estimated truncated Pareto distributions fit the data very well using all three estimation methods and are much better than the original Pareto distribution. Around the tail, the weighted and the Moment estimation methods perform the best. The fact that the tail of the data curves downward in Figure 6 is the evidence in support of using a truncated Pareto model.
The results of these three estimators are listed in Table 5 by using the forest fire loss data, where , , and .
Note that the weighted estimate is the largest among the three truncated Pareto estimates. The 5% value at risk of the original Pareto estimate is the largest. The 5% VaR may be overestimated compared to the truncated Pareto models.
Figure 7 gives the absolute errors (AE) in (30) for the 10 largest losses and confirms that the original Pareto estimate has relatively larger errors and the weighted estimate has relatively smaller errors in the tail. Figure 7 explains the data fitting of the tail of the distributions in Figure 6.
The weighted estimator has the smallest AE and IE values for and , and its IE value in is almost equal to the smallest IE value. We statistically conclude that the weighted estimated distribution is the best fit to the tail of the forest fire loss data.
In the complicated real world, it is difficult to construct a model combining all the desired features. In general, the final model selection depends on the best fitting model. The criteria are based on goodness-of-fit tests, existence of the moments, characteristic largest values, and log-log plots. The hurricane loss data and forest fire loss data are well fitted by the truncated Pareto distribution. In summary,(a)we recommend that the truncated Pareto model is appropriate as a loss distribution to be used when analyzing huge risk loss data sets. The upper and lower limits can be set by the largest and smallest losses or other reasonable values;(b)the estimated loss distribution provides a prediction of the next disaster’s 5% value at risk. The largest loss in the data set plays an important role in the prediction. In Figures 4 and 6, the trend of the tail distribution is crucial for an insurance company setting policy, also for inhabitants and government to make plans to minimize damage from natural disasters;(c)the semiparametric methods (Moment and weighted) are robust, easy to use, more stable, and fit the data better than the MLE methods. In both of the foregoing two examples, the estimated shape parameter is less than 1. The original Pareto model is not able to use the Moment or the weighted methods. This is another advantage of using the truncated Pareto model;(d)the statistical inference in these examples shows that the estimated distribution curve by using proposed weighted estimation method fits the tails of data better, relative to the Moment and MLE estimators. This is due to the fact that the weighted estimator gives less weight on the extreme values and has good MSE. Based on these studies, we suggest that further studies on the usage of weights may be useful.
Lemma A.1. For a truncated Pareto random variable with p.d.f. given in (3), for , , are order statistics; one has where , , , , , and .
Proof. Let the c.d.f. of a truncated Pareto distribution in (4) be By the theory of order statistics, we have Using the binomial formula , , we have By substituting and 2, respectively, we have (A.1) and (A.2). And
Proof. Let . We have where Since , then
Proof of Corollary 5. By Theorem 3 and Lemma A.2, let which is a quadratic function with first and second derivatives of w.r.t. : Hence, is a convex function with minimum value The maximum value of the is
The authors thank the referees and the editor for their comments which helped to improve the paper. This research is supported by the Natural Sciences and Engineering Research Council of Canada.
P. Embrechts, C. Klüppelberg, and T. Mikosch, Modelling Extremal Events for Insurance and Finance, Springer, New York, NY, USA, 2003.View at: MathSciNet
Beirlant, J. Y. Goegebeur, I. Sergers, and J. Teugels, Statistics of Extremes, Theory and Application, John Wiley & Sons, New York, NY, USA, 2005.
G. R. Shorack and J. A. Wellner, Empirical Processes with Applications to Statistics, John Wiley & Sons, New York, NY, USA, 1986.View at: MathSciNet
H. L. Koul, Weighted Empirical and Linear Models, vol. 21 of Lecture Notes-Monograph Series, Institute of Mathematical Statistics, Hayward, Calif, USA, 1992.View at: MathSciNet
M. L. Huang, “The efficiencies of a weighted distribution function estimator,” in The Proceeding of American Statistical Association, Nonparametric Statistics Section, pp. 1502–1506, 2003.View at: Google Scholar
A. N. Kolmogorov, “Sulla determinazione empirica di una legge di distribuzione,” Giornale dell'Istituto Italiano degli Attuari, vol. 4, pp. 83–91, 1933.View at: Google Scholar