International Scholarly Research Notices

International Scholarly Research Notices / 2013 / Article

Research Article | Open Access

Volume 2013 |Article ID 265373 | https://doi.org/10.1155/2013/265373

Mei Ling Huang, Vincenzo Coia, Percy Brill, "A Cluster Truncated Pareto Distribution and Its Applications", International Scholarly Research Notices, vol. 2013, Article ID 265373, 10 pages, 2013. https://doi.org/10.1155/2013/265373

A Cluster Truncated Pareto Distribution and Its Applications

Academic Editor: E. Omey
Received20 Jun 2013
Accepted25 Jul 2013
Published18 Sep 2013

Abstract

The Pareto distribution is a heavy-tailed distribution with many applications in the real world. The tail of the distribution is important, but the threshold of the distribution is difficult to determine in some situations. In this paper we consider two real-world examples with heavy-tailed observations, which leads us to propose a mixture truncated Pareto distribution (MTPD) and study its properties. We construct a cluster truncated Pareto distribution (CTPD) by using a two-point slope technique to estimate the MTPD from a random sample. We apply the MTPD and CTPD to the two examples and compare the proposed method with existing estimation methods. The results of log-log plots and goodness-of-fit tests show that the MTPD and the cluster estimation method produce very good fitting distributions with real-world data.

1. Introduction

There are many real-world problems modelled as heavy-tailed distributions, especially the Pareto distribution [1, 2]. However, there are some difficulties in estimation of Pareto distributions. First, the Pareto distribution has infinite moments in some heavy-tailed cases. Therefore the moment estimation method for the shape parameter cannot be used in these situations. It is a loss for the estimation process since the moment estimator is a robust estimator. Several authors suggest using a truncated Pareto distribution which always has finite moments (e.g., [35]).

In some situations, data will behave differently within different thresholds. For example, losses from hurricane damage can be classified into small, medium, and large hurricane groups. The data in these classes may have different distributions, or by grouping, data with self-similarity may have the same kind of distribution but with different parameters. A cluster method for data is needed to determine these groups when dealing with real data sets. In this paper we study an example of 49 most damaging Atlantic hurricanes occurring between years 1900 and 2005 [6]. The costs are standardized to 2005 USD; see Figure 1.

Coia and Huang [7] applied Pareto and truncated Pareto models to fit the hurricane data set. The maximum likelihood estimator (MLE) and the moment estimator for the shape parameter were used. The results are shown in a log-log plot in Figure 2. Coia and Huang [7] also used Kolmogorov-Smirnov, Anderson-Darling, and Cramer-von-Mises goodness-of-fit tests. We note that the two estimated (by MLE and moment method) truncated Pareto curves fit the data set quite well; they fit much better in the tail than the original Pareto distribution (which is in a straight line). But the truncated Pareto curves do not fit the data uniformly well, especially for the middle value data. We observed that the pattern of data can be classified into three groups. The data in these three groups may still be Pareto distributed but with different shape parameters. In the literature, researchers study similar data sets by using cluster methods; for example, Coia and Huang [7] proposed a sieve model.

In this paper, we propose a more generalized method—mixture truncated Pareto distribution (MTPD)—in Section 2. We study the properties of the MTPD in Section 3. In Section 4, we propose a cluster method by using a two-point slope technique to estimate the MTPD from data which utilizes a cluster truncated Pareto distribution (CTPD). In Section 5, we review the nonparametric kernel density estimation method. In Section 6, we analyze the hurricane data and a second example regarding sizes of fish caught off the Atlantic shore of Massachusetts, USA, by using the CTPD, nonparametric kernel estimation method and three other existing semiparametric estimation methods in log-log plots (see Figure 3 in Section 6). We also perform Kolmogorov-Smirnov, Anderson Darling, and Cramer-von Mises goodness-of-fit tests on these two data sets. The results show that the proposed cluster method and nonparametric method are superior to other existing estimation methods, in both examples.

2. Mixture Truncated Pareto Distribution

Definition 1. The probability density function (p.d.f.) and the cumulative distribution function (c.d.f.) of a random variable having the Pareto distribution are given by where is the shape parameter.

When , which is a heavy-tailed case, the mean and variance of are infinite, and the distribution is heavier in the right tail as decreases.

The truncated Pareto distribution (TPD) was originally used to describe the distribution of oil fields by size. It has a lower limit , an upper limit , and a shape parameter . In fact, it has been shown that the truncated Pareto distribution fits better than the nontruncated Pareto distribution for some positively skewed populations [3].

Definition 2. The p.d.f. and c.d.f. of a random variable having the truncated Pareto distribution are given by where and are the left and right truncation points.

The quantile function of the truncated Pareto distribution is

The mean, second moment, and variance of are, respectively,

We consider a vector of thresholds Consider a vector , , . We define a mixture truncated Pareto distribution as follows.

Definition 3. The c.d.f. of a random variable having a mixture truncated Pareto distribution (MTPD) is given by where is the c.d.f. of the truncated Pareto distribution in (4), the truncation points , , are related to thresholds , and is a vector of weights The p.d.f. of a mixture truncated Pareto distribution is given by where is the p.d.f. of the truncated Pareto distribution in (3).

3. Properties of Mixture Truncated Pareto Distribution

Proposition 4. The mean and variance of a mixture truncated Pareto distribution are given by where and are the mean and the second moment of the in (6) and (7), .

Proof.

Proposition 5. For given ,  , the quantile function of a mixture truncated Pareto distribution in (10) is given by solving for in the following equation:

4. A Cluster Truncated Pareto Distribution Estimator

Consider a random sample from the MTPD in (10), and let denote its order statistics. We divide data into clusters by the domains , ,  , where , . We define a cluster truncated Pareto distribution (CTPD) as an estimator of the MTPD.

Definition 6. The c.d.f. of a random variable having the cluster truncated Pareto distribution (CTPD) is given by where is a c.d.f. of the MTPD in (10) and is the sample size in the th cluster in the th domain: : where ’s depend on the vector , where ,   is the number of data less than or equal to the threshold . Note that is a function of and the random sample . Thus where is the indicator function of set .

We also note that and all depend on , in (16). The key point of applying the CTPD in (16) is to determine from the random sample. In this paper, we propose a two-point slope technique in the log-log plot to estimate thresholds .

Definition 7. A two-point slope is defined as

Then we construct order statistics from the absolute values of the two-point slopes , . The cluster threshold points can be estimated by which are determined by the largest absolute values of the two-point slopes where depends upon empirical observations of differences between successive ’s. This usually occurs when is large compared with previous differences.

We propose seven steps to construct a cluster truncated Pareto distribution in (16).

Step 1. Compute two-point slopes in (19), .

Step 2. Determine by using (20); there are two main factors:(1)Determining depends upon empirical observations of differences between successive ’s, when is much larger than the previous difference . (This technique is used on the two examples in Section 6.)(2)We also ensure that the sample size within each group is sufficiently large (usually ).

Step 3. Find the estimated threshold points by using the values of the largest absolute slopes of the order statistics of in (20), , corresponding to the values of the original sample, which now have been ordered as new order statistics Then we let

Step 4. Determine , where , . Thus Then we have clusters: This construction is shown in Box 1.

Step 5. Construct , in (16).

Step 6. Estimate . We suggest using the moment estimator in (26) since it has robust properties, but there are other estimators available in (25) and (27).

Step 7. Construct an estimator , for (16).

Remark 8. There are three estimation methods for the shape parameters for all sub-samples, given by the following.
(1) Hill Estimator (original Pareto distribution in (1)). The Hill [8] MLE for is defined as where is the th smallest order statistic and is the cut-off point.
(2) Moment Estimator (truncated Pareto distribution in (3)). A moment estimator for can be obtained by solving the following equation: where , .
(3) MLE Method (truncated Pareto distribution in (3)). The Aban MLE [4] for is obtained by solving the following equation: where is the th smallest order statistic,  , .
Note that, before using (26) and (27), the parameters and may be estimated by , , which are used in this paper. There are other nonparametric methods in the literature. Cooke [9] proposed estimators

5. Nonparametric Kernel Density Estimation

We apply the kernel density estimation method (KDE) [10]. It is a smoothing technique for estimating distributions. It is well known that the classical kernel density estimator from a random sample for true probability density function (p.d.f.) is where is a symmetric density function and is a bandwidth.

We will compare the KDE estimator and the CTPD estimator in the next section.

6. Applications

Now we apply the cluster truncated Pareto distribution to the hurricane example in Section 1.

6.1. Cluster Method

By using 48 two-point slopes in (19) and the seven steps in Section 4, we construct clusters. We select the largest absolute values of the two-point slopes in (20) as since the third largest two-point slope does not have a large change compared with .

Then we determine ’s and groups as where , , , and ; , , , and ; , , and ; ; then we have an estimated CTPD in (16) as This construction is shown in Box 2.

6.2. Kernel and Other Estimation Methods in Log-Log Plot

We apply the kernel estimator in (29) which is normalized to the hurricane data. Here, we use a standard normal kernel and optimal bandwidth [10, p. 45] We ensure that the bandwidth is sufficiently large such that the estimated tail distribution is smooth enough.

Table 1 gives , , Median, 5% Value-at-Risk (VaR), and 1% VaR of each of four estimation methods. We note that the cluster method and kernel method give the largest medians. The cluster method gives the smallest VaRs.


Method Median VaR VaR

Pareto(Hill) 0.8126 8.68 billion 147.68 billion 1070.30 billion
TPD(Aban) 0.6206 21.10 billion 9.73 billion 85.15 billion 136.17 billion
TPD(moment) 0.6476 20.48 billion 9.47 billion 82.55 billion 134.90 billion
Kernel 19.88 billion 12.88 billion 80.58 billion 156.14 billion
Cluster
= 0.6476 27.65 billion
11.19 billion
77.24 billion
132.41 billion
= 5.6498
= 0.8416

Figure 3 is a log-log plot which exhibits data and five estimated distribution curves. We note that the original Pareto distribution does not fit data well in the right tail. The moment and Aban estimated truncated Pareto distribution (TPD) fit data well in the right tail, but not so well in the smaller or middle values data. The cluster truncated Pareto distribution overcomes this problem and has the best fit to the data over the whole range. Figure 3 suggests a single distribution may not totally represent how natural data is distributed. We may consider grouping data by using the cluster method. Figure 3 also shows that the nonparametric kernel estimated distribution fits the data well.

Figure 3 provides a visual observation. It is necessary to run goodness-of-fit tests mathematically to confirm which estimated distribution best fits the hurricane data.

6.3. Goodness-of-Fit Tests

In this section we conduct three goodness-of-fit tests, Kolmogorov-Smirnov, Anderson Darling, and Cramer-von Mises. All three tests are based on the distance between the empirical distribution function and the proposed distribution function: original Pareto distribution in (1) or truncated Pareto distribution in (3) or mixture truncated Pareto distribution in (10).

Each test considers the same null and alternative hypothesis: where is the unknown true distribution of the sample data and is one of our proposed four estimated distributions:(1)Pareto distribution in (1) with Hill estimator in (25);(2)truncated Pareto distribution (TPD) in (3) with Aban estimator in (27);(3)truncated Pareto distribution in (TPD) (3) with moment estimator in (26);(4)kernel estimated distribution in (29);(5)cluster truncated Pareto distribution in (16) with moment estimator in (26).We ran a test for each estimated distribution as .

(1) The Kolmogorov-Smirnov (K-S) Test [11]. The test statistic is given by where is the empirical distribution function. Under the two-tailed value for the K-S test is as follows: where is the integer part of .

(2) Anderson and Darling Test (A-D Test). Anderson and Darling [12] introduced a measure of “distance” between the empirical distribution and the proposed c.d.f. by using a metric function space, where , with . Let ; assume under the test statistic and value are given by where is the observed value of    and  ,  , is the Gamma function.

(3) Cramer-von Mises Test (C-v-M Test). Anderson and Darling [12] proposed this test by using in (37). Thus under the test statistic and value are given by where is the modified Bessel function of the second kind,

Table 2 gives the values of the test statistics and values of three goodness-of-fit tests. Note that the cluster truncated Pareto distribution has the smallest test statistics in the K-S test (i.e., the smallest errors) and the largest values. The kernel estimated distribution gives the smallest test statistics in the A-D test and C-v-M test, respectively. This means the cluster truncated Pareto distribution and kernel estimated distribution have the best fit to the hurricane data.


Method
Goodness-of-fit tests
K-S test A-D test C-v-M test
Test statistic value Test statistic value Test statistic value

Pareto(Hill) 0.1340 0.2900 2.7141 0.0383 0.2057 0.2568
TPD(Aban) 0.0948 0.6282 2.3126 0.0622 0.0964 0.6030
TPD(moment) 0.1053 0.5308 2.3672 0.0582 0.1095 0.5402
Kernel
Cluster

Note. In this paper, we use (1) to denote the best values and (2) to denote the second-best values in the tables.

In Table 3, we took the largest data in the sample. The absolute error and integrated error are defined by


Method
Goodness-of-fit tests
Absolute error (AE) Integrated error (IE)

Pareto(Hill) 0.1340 0.0584 0.0584 0.0032 0.0027 0.0027
TPD(Aban) 0.0948 0.0839 0.0832 0.0020 0.0018 0.0016
TPD(moment) 0.1053 0.0738 0.0737 0.0019 0.0015 0.0013
Kernel
Cluster

Table 3 gives absolute errors and integrated errors of the five estimation methods in cases. We note that the cluster truncated Pareto distribution has the smallest errors in all 6 cases. This means the cluster method is superior in fitting the hurricane data compared with the other existing methods.

6.4. Fishing Example

Another example is determining the chance of catching a fish of record length (size). Overfishing is a serious issue that has been known to collapse many fish populations. To help control the population, limits are set upon anglers to determine the size of fish which can be kept.

The data is from a fishing trip, May 29–June 3, 2011, to Buzzard’s Bay of the Cape Cod area in Massachusetts, USA, by Coia [13]. The largest black sea bass lengths were measured out of a total of 326 black sea bass caught on the trip, using 43.0 cm as a lower-limit threshold. The threshold of 43.0 cm was chosen to be conservative. When sampling, locations producing smaller fish were avoided and the largest fish were targeted. A time-series plot in Figure 4 shows the lengths of these largest 72 fish in the order of which they were caught.

By using 71 two-point slopes in (19) and the seven steps in Section 4, we construct clusters. We select the largest absolute values of the two-point slopes in (20) as since the fourth largest two-point slope does not have a large change compared with .

Then we determine ’s and groups as where , , , , and ; , , , , and ; , , , and ; ;   then we have an estimated CTPD in (16) as This construction is shown in Box 3.

We also apply the kernel estimator in (29) and bandwidth in (33) to the fish data. Figure 5 is a log-log plot which exhibits data and five estimated distribution curves. We note that the original Pareto distribution does not fit the data well in the right tail. The moment and Aban estimated truncated Pareto distribution (TPD) fit the data well in the right tail, but not so well in the smaller or middle values data. The cluster truncated Pareto distribution overcomes this problem and has the best fit to the data over the whole range. Figure 5 suggests a single distribution may not totally represent how natural data is distributed. We group the data by using the cluster method. Figure 5 also shows that the nonparametric kernel estimated distribution fits the data well.

Table 4 gave , , Median, 5% VaR, and 1% VaR of each of four estimation methods. We note that the cluster method and kernel method give the largest median and the smallest VaRs.


Method Median VaR VaR

Pareto(Hill) 16.3582 44.8612 51.6419 56.9812
TPD(Aban) 15.6493 45.7437 44.9108 51.3511 54.7664
TPD(moment) 15.7782 45.7250 44.8961 51.2993 54.7273
Kernel 45.7484 45.4767 49.3010 54.5943
Cluster
= 2.7781 43.4179
45.2801
50.3432
54.0029
= 162.6216
= 64.5429
= 17.0537

We also ran goodness-of-fit tests mathematically to confirm which estimated distribution best fits the fish data.

Table 5 gives the values of the test statistics and value of each of three goodness-of-fit tests. We note that the cluster truncated Pareto distribution has the smallest test statistics (i.e., the smallest errors) and the largest values in the K-S test. The kernel estimated distribution has the smallest test statistics in the A-D test and C-v-M test, respectively. Thus the cluster truncated Pareto distribution and kernel estimated distribution fit best to the fish data.


Method Goodness-of-fit tests
K-S test A-D test C-v-M test
Test statistic value Test statistic value Test statistic value

Pareto(Hill) 0.1496 0.0702 2.8568 0.0324 0.4956 0.0409
TPD(Aban) 0.1407 0.1021 2.8413 0.0330 0.4442 0.0554
TPD(moment) 0.1434 0.0916 2.8672 0.0320 0.4560 0.0516
Kernel
Cluster

Table 6 gives absolute errors and integrated errors of five estimation methods in cases. We note that the cluster truncated Pareto distribution and the kernel estimated distribution have the smallest errors in all 6 cases. The cluster method and the kernel estimation method are superior in fitting the fish data compared with the other existing semiparametric estimation methods.


Method
Goodness-of-fit tests
Absolute error (AE) Integrated error (IE)

Pareto(Hill) 0.1496 0.1496 0.0739 0.0154 0.0148 0.0109
TPD(Aban) 0.1407 0.1407 0.0770 0.0149 0.0144 0.0108
TPD(moment) 0.1434 0.1434 0.0747 0.0149 0.0143 0.0105
Kernel
Cluster

7. Conclusions

Overall, after the studies in this paper, we may conclude the following.(1)Truncated Pareto models are useful for analyzing real-world data.(2)Cluster truncated Pareto models are useful for grouped data.(3)The two-point slope technique is innovate and useful for determining the threshold points in clustering.(4)A nonparametric kernel estimated distribution fits heavy-tailed data very well.(5)It is a difficult problem to determine —the number of groups which is based on the two-point slope in (20) in Step 2 of the construction of the CTPD in (16). The very much depends on the empirical observations. We displayed this technique in the two examples in Section 6. We plan to do further studies on more complex data sets in the future.

Acknowledgment

The authors thank the referees and the editor for their comments which helped to improve the paper. This research is supported by the Natural Sciences and Engineering Research Council of Canada.

References

  1. C. Kleiber and S. Kotz, Statistical Size Distributions in Economics and Actuarial Sciences, Wiley Series in Probability and Statistics, Wiley-Interscience, Hoboken, NJ, USA, 2003. View at: Publisher Site | MathSciNet
  2. R. Perline, “Strong, weak and false inverse power laws,” Statistical Science, vol. 20, no. 1, pp. 68–88, 2005. View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet
  3. M. A. Beg, “Estimation of the tail probability of the truncated Pareto distribution,” Journal of Information & Optimization Sciences, vol. 2, no. 2, pp. 192–198, 1981. View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet
  4. I. B. Aban, M. M. Meerschaert, and A. K. Panorska, “Parameter estimation for the truncated Pareto distribution,” Journal of the American Statistical Association, vol. 101, no. 473, pp. 270–277, 2006. View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet
  5. M. L. Huang and K. Zhao, “On estimation of the truncated Pareto distribution,” Advances and Applications in Statistics, vol. 16, no. 1, pp. 83–102, 2010. View at: Google Scholar | Zentralblatt MATH | MathSciNet
  6. R. A. Pielke, J. Gratz, C. W. Landsea, D. Collins, M. A. Saunders, and R. Musulin, “Normalized hurricane damage in the United States: 1900–2005,” Natural Hazards Review, vol. 9, no. 1, pp. 29–42, 2008. View at: Google Scholar
  7. V. Coia and M. L. Huang, “A sieve model for extreme valus,” Journal of Statistical Computation and Simulation, vol. 83, 2013. View at: Publisher Site | Google Scholar
  8. B. M. Hill, “A simple general approach to inference about the tail of a distribution,” The Annals of Statistics, vol. 3, no. 5, pp. 1163–1174, 1975. View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet
  9. P. Cooke, “Statistical inference for bounds of random variables,” Biometrika, vol. 66, no. 2, pp. 367–374, 1979. View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet
  10. B. W. Silverman, Density Estimation for Statistics and Data Analysis, Monographs on Statistics and Applied Probability, Chapman & Hall, London, UK, 1986. View at: MathSciNet
  11. A. N. Kolmogorov, “Sulla determinazione empirica di una legge di distribuzione,” Giornale dell'Istituto Italiano degli Attuari, vol. 4, pp. 83–91, 1933. View at: Google Scholar
  12. T. W. Anderson and D. A. Darling, “Asymptotic theory of certain “goodness of fit” criteria based on stochastic processes,” Annals of Mathematical Statistics, vol. 23, pp. 193–212, 1952. View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet
  13. V. Coia, “On estimation of extreme value distributions,” Brock Report in Mathematics and Statistics 120809-01, Brock University, 2012. View at: Google Scholar

Copyright © 2013 Mei Ling Huang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


More related articles

 PDF Download Citation Citation
 Download other formatsMore
 Order printed copiesOrder
Views2667
Downloads488
Citations

Related articles

We are committed to sharing findings related to COVID-19 as quickly as possible. We will be providing unlimited waivers of publication charges for accepted research articles as well as case reports and case series related to COVID-19. Review articles are excluded from this waiver policy. Sign up here as a reviewer to help fast-track new submissions.