Research Article  Open Access
Mei Ling Huang, Vincenzo Coia, Percy Brill, "A Cluster Truncated Pareto Distribution and Its Applications", International Scholarly Research Notices, vol. 2013, Article ID 265373, 10 pages, 2013. https://doi.org/10.1155/2013/265373
A Cluster Truncated Pareto Distribution and Its Applications
Abstract
The Pareto distribution is a heavytailed distribution with many applications in the real world. The tail of the distribution is important, but the threshold of the distribution is difficult to determine in some situations. In this paper we consider two realworld examples with heavytailed observations, which leads us to propose a mixture truncated Pareto distribution (MTPD) and study its properties. We construct a cluster truncated Pareto distribution (CTPD) by using a twopoint slope technique to estimate the MTPD from a random sample. We apply the MTPD and CTPD to the two examples and compare the proposed method with existing estimation methods. The results of loglog plots and goodnessoffit tests show that the MTPD and the cluster estimation method produce very good fitting distributions with realworld data.
1. Introduction
There are many realworld problems modelled as heavytailed distributions, especially the Pareto distribution [1, 2]. However, there are some difficulties in estimation of Pareto distributions. First, the Pareto distribution has infinite moments in some heavytailed cases. Therefore the moment estimation method for the shape parameter cannot be used in these situations. It is a loss for the estimation process since the moment estimator is a robust estimator. Several authors suggest using a truncated Pareto distribution which always has finite moments (e.g., [3–5]).
In some situations, data will behave differently within different thresholds. For example, losses from hurricane damage can be classified into small, medium, and large hurricane groups. The data in these classes may have different distributions, or by grouping, data with selfsimilarity may have the same kind of distribution but with different parameters. A cluster method for data is needed to determine these groups when dealing with real data sets. In this paper we study an example of 49 most damaging Atlantic hurricanes occurring between years 1900 and 2005 [6]. The costs are standardized to 2005 USD; see Figure 1.
Coia and Huang [7] applied Pareto and truncated Pareto models to fit the hurricane data set. The maximum likelihood estimator (MLE) and the moment estimator for the shape parameter were used. The results are shown in a loglog plot in Figure 2. Coia and Huang [7] also used KolmogorovSmirnov, AndersonDarling, and CramervonMises goodnessoffit tests. We note that the two estimated (by MLE and moment method) truncated Pareto curves fit the data set quite well; they fit much better in the tail than the original Pareto distribution (which is in a straight line). But the truncated Pareto curves do not fit the data uniformly well, especially for the middle value data. We observed that the pattern of data can be classified into three groups. The data in these three groups may still be Pareto distributed but with different shape parameters. In the literature, researchers study similar data sets by using cluster methods; for example, Coia and Huang [7] proposed a sieve model.
In this paper, we propose a more generalized method—mixture truncated Pareto distribution (MTPD)—in Section 2. We study the properties of the MTPD in Section 3. In Section 4, we propose a cluster method by using a twopoint slope technique to estimate the MTPD from data which utilizes a cluster truncated Pareto distribution (CTPD). In Section 5, we review the nonparametric kernel density estimation method. In Section 6, we analyze the hurricane data and a second example regarding sizes of fish caught off the Atlantic shore of Massachusetts, USA, by using the CTPD, nonparametric kernel estimation method and three other existing semiparametric estimation methods in loglog plots (see Figure 3 in Section 6). We also perform KolmogorovSmirnov, Anderson Darling, and Cramervon Mises goodnessoffit tests on these two data sets. The results show that the proposed cluster method and nonparametric method are superior to other existing estimation methods, in both examples.
2. Mixture Truncated Pareto Distribution
Definition 1. The probability density function (p.d.f.) and the cumulative distribution function (c.d.f.) of a random variable having the Pareto distribution are given by where is the shape parameter.
When , which is a heavytailed case, the mean and variance of are infinite, and the distribution is heavier in the right tail as decreases.
The truncated Pareto distribution (TPD) was originally used to describe the distribution of oil fields by size. It has a lower limit , an upper limit , and a shape parameter . In fact, it has been shown that the truncated Pareto distribution fits better than the nontruncated Pareto distribution for some positively skewed populations [3].
Definition 2. The p.d.f. and c.d.f. of a random variable having the truncated Pareto distribution are given by where and are the left and right truncation points.
The quantile function of the truncated Pareto distribution is
The mean, second moment, and variance of are, respectively,
We consider a vector of thresholds Consider a vector , , . We define a mixture truncated Pareto distribution as follows.
Definition 3. The c.d.f. of a random variable having a mixture truncated Pareto distribution (MTPD) is given by where is the c.d.f. of the truncated Pareto distribution in (4), the truncation points , , are related to thresholds , and is a vector of weights The p.d.f. of a mixture truncated Pareto distribution is given by where is the p.d.f. of the truncated Pareto distribution in (3).
3. Properties of Mixture Truncated Pareto Distribution
Proposition 4. The mean and variance of a mixture truncated Pareto distribution are given by where and are the mean and the second moment of the in (6) and (7), .
Proof.
Proposition 5. For given , , the quantile function of a mixture truncated Pareto distribution in (10) is given by solving for in the following equation:
4. A Cluster Truncated Pareto Distribution Estimator
Consider a random sample from the MTPD in (10), and let denote its order statistics. We divide data into clusters by the domains , , , where , . We define a cluster truncated Pareto distribution (CTPD) as an estimator of the MTPD.
Definition 6. The c.d.f. of a random variable having the cluster truncated Pareto distribution (CTPD) is given by where is a c.d.f. of the MTPD in (10) and is the sample size in the th cluster in the th domain: : where ’s depend on the vector , where , is the number of data less than or equal to the threshold . Note that is a function of and the random sample . Thus where is the indicator function of set .
We also note that and all depend on , in (16). The key point of applying the CTPD in (16) is to determine from the random sample. In this paper, we propose a twopoint slope technique in the loglog plot to estimate thresholds .
Definition 7. A twopoint slope is defined as
Then we construct order statistics from the absolute values of the twopoint slopes , . The cluster threshold points can be estimated by which are determined by the largest absolute values of the twopoint slopes where depends upon empirical observations of differences between successive ’s. This usually occurs when is large compared with previous differences.
We propose seven steps to construct a cluster truncated Pareto distribution in (16).
Step 1. Compute twopoint slopes in (19), .
Step 2. Determine by using (20); there are two main factors:(1)Determining depends upon empirical observations of differences between successive ’s, when is much larger than the previous difference . (This technique is used on the two examples in Section 6.)(2)We also ensure that the sample size within each group is sufficiently large (usually ).
Step 3. Find the estimated threshold points by using the values of the largest absolute slopes of the order statistics of in (20), , corresponding to the values of the original sample, which now have been ordered as new order statistics Then we let
Step 4. Determine , where , . Thus Then we have clusters: This construction is shown in Box 1.
Step 5. Construct , in (16).
Step 6. Estimate . We suggest using the moment estimator in (26) since it has robust properties, but there are other estimators available in (25) and (27).
Step 7. Construct an estimator , for (16).
Remark 8. There are three estimation methods for the shape parameters for all subsamples, given by the following.
(1) Hill Estimator (original Pareto distribution in (1)). The Hill [8] MLE for is defined as
where is the th smallest order statistic and is the cutoff point.
(2) Moment Estimator (truncated Pareto distribution in (3)). A moment estimator for can be obtained by solving the following equation:
where , .
(3) MLE Method (truncated Pareto distribution in (3)). The Aban MLE [4] for is obtained by solving the following equation:
where is the th smallest order statistic, , .
Note that, before using (26) and (27), the parameters and may be estimated by , , which are used in this paper. There are other nonparametric methods in the literature. Cooke [9] proposed estimators
5. Nonparametric Kernel Density Estimation
We apply the kernel density estimation method (KDE) [10]. It is a smoothing technique for estimating distributions. It is well known that the classical kernel density estimator from a random sample for true probability density function (p.d.f.) is where is a symmetric density function and is a bandwidth.
We will compare the KDE estimator and the CTPD estimator in the next section.
6. Applications
Now we apply the cluster truncated Pareto distribution to the hurricane example in Section 1.
6.1. Cluster Method
By using 48 twopoint slopes in (19) and the seven steps in Section 4, we construct clusters. We select the largest absolute values of the twopoint slopes in (20) as since the third largest twopoint slope does not have a large change compared with .
Then we determine ’s and groups as where , , , and ; , , , and ; , , and ; ; then we have an estimated CTPD in (16) as This construction is shown in Box 2.
6.2. Kernel and Other Estimation Methods in LogLog Plot
We apply the kernel estimator in (29) which is normalized to the hurricane data. Here, we use a standard normal kernel and optimal bandwidth [10, p. 45] We ensure that the bandwidth is sufficiently large such that the estimated tail distribution is smooth enough.
Table 1 gives , , Median, 5% ValueatRisk (VaR), and 1% VaR of each of four estimation methods. We note that the cluster method and kernel method give the largest medians. The cluster method gives the smallest VaRs.

Figure 3 is a loglog plot which exhibits data and five estimated distribution curves. We note that the original Pareto distribution does not fit data well in the right tail. The moment and Aban estimated truncated Pareto distribution (TPD) fit data well in the right tail, but not so well in the smaller or middle values data. The cluster truncated Pareto distribution overcomes this problem and has the best fit to the data over the whole range. Figure 3 suggests a single distribution may not totally represent how natural data is distributed. We may consider grouping data by using the cluster method. Figure 3 also shows that the nonparametric kernel estimated distribution fits the data well.
Figure 3 provides a visual observation. It is necessary to run goodnessoffit tests mathematically to confirm which estimated distribution best fits the hurricane data.
6.3. GoodnessofFit Tests
In this section we conduct three goodnessoffit tests, KolmogorovSmirnov, Anderson Darling, and Cramervon Mises. All three tests are based on the distance between the empirical distribution function and the proposed distribution function: original Pareto distribution in (1) or truncated Pareto distribution in (3) or mixture truncated Pareto distribution in (10).
Each test considers the same null and alternative hypothesis: where is the unknown true distribution of the sample data and is one of our proposed four estimated distributions:(1)Pareto distribution in (1) with Hill estimator in (25);(2)truncated Pareto distribution (TPD) in (3) with Aban estimator in (27);(3)truncated Pareto distribution in (TPD) (3) with moment estimator in (26);(4)kernel estimated distribution in (29);(5)cluster truncated Pareto distribution in (16) with moment estimator in (26).We ran a test for each estimated distribution as .
(1) The KolmogorovSmirnov (KS) Test [11]. The test statistic is given by where is the empirical distribution function. Under the twotailed value for the KS test is as follows: where is the integer part of .
(2) Anderson and Darling Test (AD Test). Anderson and Darling [12] introduced a measure of “distance” between the empirical distribution and the proposed c.d.f. by using a metric function space, where , with . Let ; assume under the test statistic and value are given by where is the observed value of and , , is the Gamma function.
(3) Cramervon Mises Test (CvM Test). Anderson and Darling [12] proposed this test by using in (37). Thus under the test statistic and value are given by where is the modified Bessel function of the second kind,
Table 2 gives the values of the test statistics and values of three goodnessoffit tests. Note that the cluster truncated Pareto distribution has the smallest test statistics in the KS test (i.e., the smallest errors) and the largest values. The kernel estimated distribution gives the smallest test statistics in the AD test and CvM test, respectively. This means the cluster truncated Pareto distribution and kernel estimated distribution have the best fit to the hurricane data.
 
Note. In this paper, we use (1) to denote the best values and (2) to denote the secondbest values in the tables. 
In Table 3, we took the largest data in the sample. The absolute error and integrated error are defined by

Table 3 gives absolute errors and integrated errors of the five estimation methods in cases. We note that the cluster truncated Pareto distribution has the smallest errors in all 6 cases. This means the cluster method is superior in fitting the hurricane data compared with the other existing methods.
6.4. Fishing Example
Another example is determining the chance of catching a fish of record length (size). Overfishing is a serious issue that has been known to collapse many fish populations. To help control the population, limits are set upon anglers to determine the size of fish which can be kept.
The data is from a fishing trip, May 29–June 3, 2011, to Buzzard’s Bay of the Cape Cod area in Massachusetts, USA, by Coia [13]. The largest black sea bass lengths were measured out of a total of 326 black sea bass caught on the trip, using 43.0 cm as a lowerlimit threshold. The threshold of 43.0 cm was chosen to be conservative. When sampling, locations producing smaller fish were avoided and the largest fish were targeted. A timeseries plot in Figure 4 shows the lengths of these largest 72 fish in the order of which they were caught.
By using 71 twopoint slopes in (19) and the seven steps in Section 4, we construct clusters. We select the largest absolute values of the twopoint slopes in (20) as since the fourth largest twopoint slope does not have a large change compared with .
Then we determine ’s and groups as where , , , , and ; , , , , and ; , , , and ; ; then we have an estimated CTPD in (16) as This construction is shown in Box 3.
We also apply the kernel estimator in (29) and bandwidth in (33) to the fish data. Figure 5 is a loglog plot which exhibits data and five estimated distribution curves. We note that the original Pareto distribution does not fit the data well in the right tail. The moment and Aban estimated truncated Pareto distribution (TPD) fit the data well in the right tail, but not so well in the smaller or middle values data. The cluster truncated Pareto distribution overcomes this problem and has the best fit to the data over the whole range. Figure 5 suggests a single distribution may not totally represent how natural data is distributed. We group the data by using the cluster method. Figure 5 also shows that the nonparametric kernel estimated distribution fits the data well.
Table 4 gave , , Median, 5% VaR, and 1% VaR of each of four estimation methods. We note that the cluster method and kernel method give the largest median and the smallest VaRs.

We also ran goodnessoffit tests mathematically to confirm which estimated distribution best fits the fish data.
Table 5 gives the values of the test statistics and value of each of three goodnessoffit tests. We note that the cluster truncated Pareto distribution has the smallest test statistics (i.e., the smallest errors) and the largest values in the KS test. The kernel estimated distribution has the smallest test statistics in the AD test and CvM test, respectively. Thus the cluster truncated Pareto distribution and kernel estimated distribution fit best to the fish data.

Table 6 gives absolute errors and integrated errors of five estimation methods in cases. We note that the cluster truncated Pareto distribution and the kernel estimated distribution have the smallest errors in all 6 cases. The cluster method and the kernel estimation method are superior in fitting the fish data compared with the other existing semiparametric estimation methods.

7. Conclusions
Overall, after the studies in this paper, we may conclude the following.(1)Truncated Pareto models are useful for analyzing realworld data.(2)Cluster truncated Pareto models are useful for grouped data.(3)The twopoint slope technique is innovate and useful for determining the threshold points in clustering.(4)A nonparametric kernel estimated distribution fits heavytailed data very well.(5)It is a difficult problem to determine —the number of groups which is based on the twopoint slope in (20) in Step 2 of the construction of the CTPD in (16). The very much depends on the empirical observations. We displayed this technique in the two examples in Section 6. We plan to do further studies on more complex data sets in the future.
Acknowledgment
The authors thank the referees and the editor for their comments which helped to improve the paper. This research is supported by the Natural Sciences and Engineering Research Council of Canada.
References
 C. Kleiber and S. Kotz, Statistical Size Distributions in Economics and Actuarial Sciences, Wiley Series in Probability and Statistics, WileyInterscience, Hoboken, NJ, USA, 2003. View at: Publisher Site  MathSciNet
 R. Perline, “Strong, weak and false inverse power laws,” Statistical Science, vol. 20, no. 1, pp. 68–88, 2005. View at: Publisher Site  Google Scholar  Zentralblatt MATH  MathSciNet
 M. A. Beg, “Estimation of the tail probability of the truncated Pareto distribution,” Journal of Information & Optimization Sciences, vol. 2, no. 2, pp. 192–198, 1981. View at: Publisher Site  Google Scholar  Zentralblatt MATH  MathSciNet
 I. B. Aban, M. M. Meerschaert, and A. K. Panorska, “Parameter estimation for the truncated Pareto distribution,” Journal of the American Statistical Association, vol. 101, no. 473, pp. 270–277, 2006. View at: Publisher Site  Google Scholar  Zentralblatt MATH  MathSciNet
 M. L. Huang and K. Zhao, “On estimation of the truncated Pareto distribution,” Advances and Applications in Statistics, vol. 16, no. 1, pp. 83–102, 2010. View at: Google Scholar  Zentralblatt MATH  MathSciNet
 R. A. Pielke, J. Gratz, C. W. Landsea, D. Collins, M. A. Saunders, and R. Musulin, “Normalized hurricane damage in the United States: 1900–2005,” Natural Hazards Review, vol. 9, no. 1, pp. 29–42, 2008. View at: Google Scholar
 V. Coia and M. L. Huang, “A sieve model for extreme valus,” Journal of Statistical Computation and Simulation, vol. 83, 2013. View at: Publisher Site  Google Scholar
 B. M. Hill, “A simple general approach to inference about the tail of a distribution,” The Annals of Statistics, vol. 3, no. 5, pp. 1163–1174, 1975. View at: Publisher Site  Google Scholar  Zentralblatt MATH  MathSciNet
 P. Cooke, “Statistical inference for bounds of random variables,” Biometrika, vol. 66, no. 2, pp. 367–374, 1979. View at: Publisher Site  Google Scholar  Zentralblatt MATH  MathSciNet
 B. W. Silverman, Density Estimation for Statistics and Data Analysis, Monographs on Statistics and Applied Probability, Chapman & Hall, London, UK, 1986. View at: MathSciNet
 A. N. Kolmogorov, “Sulla determinazione empirica di una legge di distribuzione,” Giornale dell'Istituto Italiano degli Attuari, vol. 4, pp. 83–91, 1933. View at: Google Scholar
 T. W. Anderson and D. A. Darling, “Asymptotic theory of certain “goodness of fit” criteria based on stochastic processes,” Annals of Mathematical Statistics, vol. 23, pp. 193–212, 1952. View at: Publisher Site  Google Scholar  Zentralblatt MATH  MathSciNet
 V. Coia, “On estimation of extreme value distributions,” Brock Report in Mathematics and Statistics 12080901, Brock University, 2012. View at: Google Scholar
Copyright
Copyright © 2013 Mei Ling Huang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.