Table of Contents
ISRN Probability and Statistics
Volume 2013, Article ID 265373, 10 pages
http://dx.doi.org/10.1155/2013/265373
Research Article

A Cluster Truncated Pareto Distribution and Its Applications

1Department of Mathematics, Brock University, St. Catharines, ON, Canada L2S 3A1
2Department of Statistics, University of British Columbia, Vancouver, BC, Canada V6T 1Z4
3Department of Mathematics & Statistics, University of Winsdor, Windsor, ON, Canada N9B 3P4

Received 20 June 2013; Accepted 25 July 2013

Academic Editors: E. Omey and S. Sagitov

Copyright © 2013 Mei Ling Huang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

The Pareto distribution is a heavy-tailed distribution with many applications in the real world. The tail of the distribution is important, but the threshold of the distribution is difficult to determine in some situations. In this paper we consider two real-world examples with heavy-tailed observations, which leads us to propose a mixture truncated Pareto distribution (MTPD) and study its properties. We construct a cluster truncated Pareto distribution (CTPD) by using a two-point slope technique to estimate the MTPD from a random sample. We apply the MTPD and CTPD to the two examples and compare the proposed method with existing estimation methods. The results of log-log plots and goodness-of-fit tests show that the MTPD and the cluster estimation method produce very good fitting distributions with real-world data.

1. Introduction

There are many real-world problems modelled as heavy-tailed distributions, especially the Pareto distribution [1, 2]. However, there are some difficulties in estimation of Pareto distributions. First, the Pareto distribution has infinite moments in some heavy-tailed cases. Therefore the moment estimation method for the shape parameter cannot be used in these situations. It is a loss for the estimation process since the moment estimator is a robust estimator. Several authors suggest using a truncated Pareto distribution which always has finite moments (e.g., [35]).

In some situations, data will behave differently within different thresholds. For example, losses from hurricane damage can be classified into small, medium, and large hurricane groups. The data in these classes may have different distributions, or by grouping, data with self-similarity may have the same kind of distribution but with different parameters. A cluster method for data is needed to determine these groups when dealing with real data sets. In this paper we study an example of 49 most damaging Atlantic hurricanes occurring between years 1900 and 2005 [6]. The costs are standardized to 2005 USD; see Figure 1.

265373.fig.001
Figure 1: The 49 costliest Atlantic hurricanes between the years 1900 and 2005.

Coia and Huang [7] applied Pareto and truncated Pareto models to fit the hurricane data set. The maximum likelihood estimator (MLE) and the moment estimator for the shape parameter were used. The results are shown in a log-log plot in Figure 2. Coia and Huang [7] also used Kolmogorov-Smirnov, Anderson-Darling, and Cramer-von-Mises goodness-of-fit tests. We note that the two estimated (by MLE and moment method) truncated Pareto curves fit the data set quite well; they fit much better in the tail than the original Pareto distribution (which is in a straight line). But the truncated Pareto curves do not fit the data uniformly well, especially for the middle value data. We observed that the pattern of data can be classified into three groups. The data in these three groups may still be Pareto distributed but with different shape parameters. In the literature, researchers study similar data sets by using cluster methods; for example, Coia and Huang [7] proposed a sieve model.

265373.fig.002
Figure 2: Log-log plot of hurricane example with estimated distribution curves. The red circles are the data; the black straight line is the original Pareto distribution; the green dot line is the MLE estimated truncated Pareto distribution; the brown dash line is the moment estimated truncated Pareto distribution.

In this paper, we propose a more generalized method—mixture truncated Pareto distribution (MTPD)—in Section 2. We study the properties of the MTPD in Section 3. In Section 4, we propose a cluster method by using a two-point slope technique to estimate the MTPD from data which utilizes a cluster truncated Pareto distribution (CTPD). In Section 5, we review the nonparametric kernel density estimation method. In Section 6, we analyze the hurricane data and a second example regarding sizes of fish caught off the Atlantic shore of Massachusetts, USA, by using the CTPD, nonparametric kernel estimation method and three other existing semiparametric estimation methods in log-log plots (see Figure 3 in Section 6). We also perform Kolmogorov-Smirnov, Anderson Darling, and Cramer-von Mises goodness-of-fit tests on these two data sets. The results show that the proposed cluster method and nonparametric method are superior to other existing estimation methods, in both examples.

265373.fig.003
Figure 3: Log-log plot of hurricane example with estimated distribution curves. The red circles are the data; the black straight line is the original Pareto distribution; the green dot line is the MLE estimated truncated Pareto distribution; the brown dash line is the moment estimated truncated Pareto distribution; the thick green dash line is the kernel estimated distribution; the thick blue line is the cluster truncated Pareto distribution.

2. Mixture Truncated Pareto Distribution

Definition 1. The probability density function (p.d.f.) and the cumulative distribution function (c.d.f.) of a random variable having the Pareto distribution are given by where is the shape parameter.

When , which is a heavy-tailed case, the mean and variance of are infinite, and the distribution is heavier in the right tail as decreases.

The truncated Pareto distribution (TPD) was originally used to describe the distribution of oil fields by size. It has a lower limit , an upper limit , and a shape parameter . In fact, it has been shown that the truncated Pareto distribution fits better than the nontruncated Pareto distribution for some positively skewed populations [3].

Definition 2. The p.d.f. and c.d.f. of a random variable having the truncated Pareto distribution are given by where and are the left and right truncation points.

The quantile function of the truncated Pareto distribution is

The mean, second moment, and variance of are, respectively,

We consider a vector of thresholds Consider a vector , , . We define a mixture truncated Pareto distribution as follows.

Definition 3. The c.d.f. of a random variable having a mixture truncated Pareto distribution (MTPD) is given by where is the c.d.f. of the truncated Pareto distribution in (4), the truncation points , , are related to thresholds , and is a vector of weights The p.d.f. of a mixture truncated Pareto distribution is given by where is the p.d.f. of the truncated Pareto distribution in (3).

3. Properties of Mixture Truncated Pareto Distribution

Proposition 4. The mean and variance of a mixture truncated Pareto distribution are given by where and are the mean and the second moment of the in (6) and (7), .

Proof.

Proposition 5. For given ,  , the quantile function of a mixture truncated Pareto distribution in (10) is given by solving for in the following equation:

4. A Cluster Truncated Pareto Distribution Estimator

Consider a random sample from the MTPD in (10), and let denote its order statistics. We divide data into clusters by the domains , ,  , where , . We define a cluster truncated Pareto distribution (CTPD) as an estimator of the MTPD.

Definition 6. The c.d.f. of a random variable having the cluster truncated Pareto distribution (CTPD) is given by where is a c.d.f. of the MTPD in (10) and is the sample size in the th cluster in the th domain: : where ’s depend on the vector , where ,   is the number of data less than or equal to the threshold . Note that is a function of and the random sample . Thus where is the indicator function of set .

We also note that and all depend on , in (16). The key point of applying the CTPD in (16) is to determine from the random sample. In this paper, we propose a two-point slope technique in the log-log plot to estimate thresholds .

Definition 7. A two-point slope is defined as

Then we construct order statistics from the absolute values of the two-point slopes , . The cluster threshold points can be estimated by which are determined by the largest absolute values of the two-point slopes where depends upon empirical observations of differences between successive ’s. This usually occurs when is large compared with previous differences.

We propose seven steps to construct a cluster truncated Pareto distribution in (16).

Step 1. Compute two-point slopes in (19), .

Step 2. Determine by using (20); there are two main factors:(1)Determining depends upon empirical observations of differences between successive ’s, when is much larger than the previous difference . (This technique is used on the two examples in Section 6.)(2)We also ensure that the sample size within each group is sufficiently large (usually ).

Step 3. Find the estimated threshold points by using the values of the largest absolute slopes of the order statistics of in (20), , corresponding to the values of the original sample, which now have been ordered as new order statistics Then we let

Step 4. Determine , where , . Thus Then we have clusters: This construction is shown in Box 1.

figbox1
Box 1: Construction of cluster truncated Pareto distribution from the data.

Step 5. Construct , in (16).

Step 6. Estimate . We suggest using the moment estimator in (26) since it has robust properties, but there are other estimators available in (25) and (27).

Step 7. Construct an estimator , for (16).

Remark 8. There are three estimation methods for the shape parameters for all sub-samples, given by the following.
(1) Hill Estimator (original Pareto distribution in (1)). The Hill [8] MLE for is defined as where is the th smallest order statistic and is the cut-off point.
(2) Moment Estimator (truncated Pareto distribution in (3)). A moment estimator for can be obtained by solving the following equation: where , .
(3) MLE Method (truncated Pareto distribution in (3)). The Aban MLE [4] for is obtained by solving the following equation: where is the th smallest order statistic,  , .
Note that, before using (26) and (27), the parameters and may be estimated by , , which are used in this paper. There are other nonparametric methods in the literature. Cooke [9] proposed estimators

5. Nonparametric Kernel Density Estimation

We apply the kernel density estimation method (KDE) [10]. It is a smoothing technique for estimating distributions. It is well known that the classical kernel density estimator from a random sample for true probability density function (p.d.f.) is where is a symmetric density function and is a bandwidth.

We will compare the KDE estimator and the CTPD estimator in the next section.

6. Applications

Now we apply the cluster truncated Pareto distribution to the hurricane example in Section 1.

6.1. Cluster Method

By using 48 two-point slopes in (19) and the seven steps in Section 4, we construct clusters. We select the largest absolute values of the two-point slopes in (20) as since the third largest two-point slope does not have a large change compared with .

Then we determine ’s and groups as where , , , and ; , , , and ; , , and ; ; then we have an estimated CTPD in (16) as This construction is shown in Box 2.

figbox2
Box 2: Construction of cluster truncated Pareto distribution from the data.

6.2. Kernel and Other Estimation Methods in Log-Log Plot

We apply the kernel estimator in (29) which is normalized to the hurricane data. Here, we use a standard normal kernel and optimal bandwidth [10, p. 45] We ensure that the bandwidth is sufficiently large such that the estimated tail distribution is smooth enough.

Table 1 gives , , Median, 5% Value-at-Risk (VaR), and 1% VaR of each of four estimation methods. We note that the cluster method and kernel method give the largest medians. The cluster method gives the smallest VaRs.

tab1
Table 1: Five estimation methods for hurricane example.

Figure 3 is a log-log plot which exhibits data and five estimated distribution curves. We note that the original Pareto distribution does not fit data well in the right tail. The moment and Aban estimated truncated Pareto distribution (TPD) fit data well in the right tail, but not so well in the smaller or middle values data. The cluster truncated Pareto distribution overcomes this problem and has the best fit to the data over the whole range. Figure 3 suggests a single distribution may not totally represent how natural data is distributed. We may consider grouping data by using the cluster method. Figure 3 also shows that the nonparametric kernel estimated distribution fits the data well.

Figure 3 provides a visual observation. It is necessary to run goodness-of-fit tests mathematically to confirm which estimated distribution best fits the hurricane data.

6.3. Goodness-of-Fit Tests

In this section we conduct three goodness-of-fit tests, Kolmogorov-Smirnov, Anderson Darling, and Cramer-von Mises. All three tests are based on the distance between the empirical distribution function and the proposed distribution function: original Pareto distribution in (1) or truncated Pareto distribution in (3) or mixture truncated Pareto distribution in (10).

Each test considers the same null and alternative hypothesis: where is the unknown true distribution of the sample data and is one of our proposed four estimated distributions:(1)Pareto distribution in (1) with Hill estimator in (25);(2)truncated Pareto distribution (TPD) in (3) with Aban estimator in (27);(3)truncated Pareto distribution in (TPD) (3) with moment estimator in (26);(4)kernel estimated distribution in (29);(5)cluster truncated Pareto distribution in (16) with moment estimator in (26).We ran a test for each estimated distribution as .

(1) The Kolmogorov-Smirnov (K-S) Test [11]. The test statistic is given by where is the empirical distribution function. Under the two-tailed value for the K-S test is as follows: where is the integer part of .

(2) Anderson and Darling Test (A-D Test). Anderson and Darling [12] introduced a measure of “distance” between the empirical distribution and the proposed c.d.f. by using a metric function space, where , with . Let ; assume under the test statistic and value are given by where is the observed value of    and  ,  , is the Gamma function.

(3) Cramer-von Mises Test (C-v-M Test). Anderson and Darling [12] proposed this test by using in (37). Thus under the test statistic and value are given by where is the modified Bessel function of the second kind,

Table 2 gives the values of the test statistics and values of three goodness-of-fit tests. Note that the cluster truncated Pareto distribution has the smallest test statistics in the K-S test (i.e., the smallest errors) and the largest values. The kernel estimated distribution gives the smallest test statistics in the A-D test and C-v-M test, respectively. This means the cluster truncated Pareto distribution and kernel estimated distribution have the best fit to the hurricane data.

tab2
Table 2: Goodness-of-fit tests for hurricane example.

In Table 3, we took the largest data in the sample. The absolute error and integrated error are defined by

tab3
Table 3: Errors of goodness-of-fit tests for hurricane example.

Table 3 gives absolute errors and integrated errors of the five estimation methods in cases. We note that the cluster truncated Pareto distribution has the smallest errors in all 6 cases. This means the cluster method is superior in fitting the hurricane data compared with the other existing methods.

6.4. Fishing Example

Another example is determining the chance of catching a fish of record length (size). Overfishing is a serious issue that has been known to collapse many fish populations. To help control the population, limits are set upon anglers to determine the size of fish which can be kept.

The data is from a fishing trip, May 29–June 3, 2011, to Buzzard’s Bay of the Cape Cod area in Massachusetts, USA, by Coia [13]. The largest black sea bass lengths were measured out of a total of 326 black sea bass caught on the trip, using 43.0 cm as a lower-limit threshold. The threshold of 43.0 cm was chosen to be conservative. When sampling, locations producing smaller fish were avoided and the largest fish were targeted. A time-series plot in Figure 4 shows the lengths of these largest 72 fish in the order of which they were caught.

265373.fig.004
Figure 4: Lengths of 72 black sea bass (Centropristis striata) caught at least 43 cm long, fishing in Buzzard’s Bay, May 29–June 3, 2011.

By using 71 two-point slopes in (19) and the seven steps in Section 4, we construct clusters. We select the largest absolute values of the two-point slopes in (20) as since the fourth largest two-point slope does not have a large change compared with .

Then we determine ’s and groups as where , , , , and ; , , , , and ; , , , and ; ;   then we have an estimated CTPD in (16) as This construction is shown in Box 3.

figbox3
Box 3: Construction of cluster truncated Pareto distribution from the data.

We also apply the kernel estimator in (29) and bandwidth in (33) to the fish data. Figure 5 is a log-log plot which exhibits data and five estimated distribution curves. We note that the original Pareto distribution does not fit the data well in the right tail. The moment and Aban estimated truncated Pareto distribution (TPD) fit the data well in the right tail, but not so well in the smaller or middle values data. The cluster truncated Pareto distribution overcomes this problem and has the best fit to the data over the whole range. Figure 5 suggests a single distribution may not totally represent how natural data is distributed. We group the data by using the cluster method. Figure 5 also shows that the nonparametric kernel estimated distribution fits the data well.

265373.fig.005
Figure 5: Log-log plot of fishing example with estimated distribution curves. The red circles are the data; the black straight line is the original Pareto distribution; the green dot line is the MLE estimated truncated Pareto distribution; the brown dash line is the moment estimated truncated Pareto distribution; the thick green dash line is the kernel estimated distribution; the thick blue line is the cluster Pareto distribution.

Table 4 gave , , Median, 5% VaR, and 1% VaR of each of four estimation methods. We note that the cluster method and kernel method give the largest median and the smallest VaRs.

tab4
Table 4: Five estimation methods for fishing example.

We also ran goodness-of-fit tests mathematically to confirm which estimated distribution best fits the fish data.

Table 5 gives the values of the test statistics and value of each of three goodness-of-fit tests. We note that the cluster truncated Pareto distribution has the smallest test statistics (i.e., the smallest errors) and the largest values in the K-S test. The kernel estimated distribution has the smallest test statistics in the A-D test and C-v-M test, respectively. Thus the cluster truncated Pareto distribution and kernel estimated distribution fit best to the fish data.

tab5
Table 5: Goodness-of-fit tests for fishing example.

Table 6 gives absolute errors and integrated errors of five estimation methods in cases. We note that the cluster truncated Pareto distribution and the kernel estimated distribution have the smallest errors in all 6 cases. The cluster method and the kernel estimation method are superior in fitting the fish data compared with the other existing semiparametric estimation methods.

tab6
Table 6: Errors of goodness-of-fit tests for fishing example.

7. Conclusions

Overall, after the studies in this paper, we may conclude the following.(1)Truncated Pareto models are useful for analyzing real-world data.(2)Cluster truncated Pareto models are useful for grouped data.(3)The two-point slope technique is innovate and useful for determining the threshold points in clustering.(4)A nonparametric kernel estimated distribution fits heavy-tailed data very well.(5)It is a difficult problem to determine —the number of groups which is based on the two-point slope in (20) in Step 2 of the construction of the CTPD in (16). The very much depends on the empirical observations. We displayed this technique in the two examples in Section 6. We plan to do further studies on more complex data sets in the future.

Acknowledgment

The authors thank the referees and the editor for their comments which helped to improve the paper. This research is supported by the Natural Sciences and Engineering Research Council of Canada.

References

  1. C. Kleiber and S. Kotz, Statistical Size Distributions in Economics and Actuarial Sciences, Wiley Series in Probability and Statistics, Wiley-Interscience, Hoboken, NJ, USA, 2003. View at Publisher · View at Google Scholar · View at MathSciNet
  2. R. Perline, “Strong, weak and false inverse power laws,” Statistical Science, vol. 20, no. 1, pp. 68–88, 2005. View at Publisher · View at Google Scholar · View at Zentralblatt MATH · View at MathSciNet
  3. M. A. Beg, “Estimation of the tail probability of the truncated Pareto distribution,” Journal of Information & Optimization Sciences, vol. 2, no. 2, pp. 192–198, 1981. View at Publisher · View at Google Scholar · View at Zentralblatt MATH · View at MathSciNet
  4. I. B. Aban, M. M. Meerschaert, and A. K. Panorska, “Parameter estimation for the truncated Pareto distribution,” Journal of the American Statistical Association, vol. 101, no. 473, pp. 270–277, 2006. View at Publisher · View at Google Scholar · View at Zentralblatt MATH · View at MathSciNet
  5. M. L. Huang and K. Zhao, “On estimation of the truncated Pareto distribution,” Advances and Applications in Statistics, vol. 16, no. 1, pp. 83–102, 2010. View at Google Scholar · View at Zentralblatt MATH · View at MathSciNet
  6. R. A. Pielke, J. Gratz, C. W. Landsea, D. Collins, M. A. Saunders, and R. Musulin, “Normalized hurricane damage in the United States: 1900–2005,” Natural Hazards Review, vol. 9, no. 1, pp. 29–42, 2008. View at Google Scholar
  7. V. Coia and M. L. Huang, “A sieve model for extreme valus,” Journal of Statistical Computation and Simulation, vol. 83, 2013. View at Publisher · View at Google Scholar
  8. B. M. Hill, “A simple general approach to inference about the tail of a distribution,” The Annals of Statistics, vol. 3, no. 5, pp. 1163–1174, 1975. View at Publisher · View at Google Scholar · View at Zentralblatt MATH · View at MathSciNet
  9. P. Cooke, “Statistical inference for bounds of random variables,” Biometrika, vol. 66, no. 2, pp. 367–374, 1979. View at Publisher · View at Google Scholar · View at Zentralblatt MATH · View at MathSciNet
  10. B. W. Silverman, Density Estimation for Statistics and Data Analysis, Monographs on Statistics and Applied Probability, Chapman & Hall, London, UK, 1986. View at MathSciNet
  11. A. N. Kolmogorov, “Sulla determinazione empirica di una legge di distribuzione,” Giornale dell'Istituto Italiano degli Attuari, vol. 4, pp. 83–91, 1933. View at Google Scholar
  12. T. W. Anderson and D. A. Darling, “Asymptotic theory of certain “goodness of fit” criteria based on stochastic processes,” Annals of Mathematical Statistics, vol. 23, pp. 193–212, 1952. View at Publisher · View at Google Scholar · View at Zentralblatt MATH · View at MathSciNet
  13. V. Coia, “On estimation of extreme value distributions,” Brock Report in Mathematics and Statistics 120809-01, Brock University, 2012. View at Google Scholar