Estimation of Poisson-Dirichlet Parameters with Monotone Missing Data

Zhou, Xueqin; Huang, Jinlong; Wu, Xianyi

doi:https://doi.org/10.1155/2017/7892507

Mathematical Problems in Engineering

On this page

Abstract Introduction Model Conclusion Conflicts of Interest Acknowledgments References Copyright Related Articles

Research Article | Open Access

Volume 2017 | Article ID 7892507 | https://doi.org/10.1155/2017/7892507

Estimation of Poisson-Dirichlet Parameters with Monotone Missing Data

Xueqin Zhou,¹Jinlong Huang,²and Xianyi Wu¹

Academic Editor: Giuseppe Vairo

Received22 Mar 2017

Accepted29 Aug 2017

Published12 Oct 2017

Abstract

This article considers the estimation of the unknown numerical parameters and the density of the base measure in a Poisson-Dirichlet process prior with grouped monotone missing data. The numerical parameters are estimated by the method of maximum likelihood estimates and the density function is estimated by kernel method. A set of simulations was conducted, which shows that the estimates perform well.

1. Introduction

As a young but fast-growing field of statistics, Bayesian nonparametrics (abbreviated as BNP below) focuses on Bayesian solutions of nonparametric and other infinite-dimensional statistical models. Compared with frequentist statistics and classical Bayesian statistics, BNP provides highly flexible and robust models for infinite-dimensional parameter spaces. The most extensively investigated priors in BNP include the famous Dirichlet process prior [1] and Polya tree process prior [2–5] which have played fundamental roles in the development of Bayesian nonparametrics.

Dirichlet processes are also referred to as one-parameter Poisson-Dirichlet processes by Kingman [6]. As the first and foremost generalization of Dirichlet processes, two-parameter Poisson-Dirichlet processes (abbreviated as Poisson-Dirichlet process below) were first discussed by Pitman and Yor [7] and from then have made huge success in Bayesian nonparametric modeling in language, images, ecology, biology, genomics, and so on. Remarkable examples include Goldwater et al. [8] who used Poisson-Dirichlet process as an adaptor to justify the appearance of type frequencies in formal analyses of natural language and improved the performance of an earlier model for unsupervised learning of morphology, Sudderth and Jordan [9] who modeled the object frequencies and segment sizes by Poisson-Dirichlet processes and developed a statistical framework for the simultaneous, unsupervised segmentation and discovery of visual object categories from image databases, Favaro et al. [10] who used a Poisson-Dirichlet model to deal with the issue of prediction within species sampling problems, and Hoshino [11] who studied the microdata disclosure risk assessment with Pitman’s sampling formula, clarified some of its theoretical implications, and compared various models based on the Akaike Information Criterion by applying them to real data sets. For more references related to the application of Poisson-Dirichlet process in the area of language learning, one can be referred to Johnson et al. [12], Wood and Teh [13], and Wallach et al. [14].

While the exact Bayesian methods take the assumption that prior distributions are completely specified, empirical Bayesian methods deal with the situations where prior distributions are at most partially specified and thus need to be estimated. Empirical Bayesian methods for parametric and semiparametric models have been investigated in a huge volume of literature. However, the study of empirical Bayesian methods in the framework of Bayesian nonparametrics is quite limited. A recently published paper is that by Yang and Wu [15] who studied the problem of estimation of the priors with monotonically missing data when the prior is a Dirichlet process.

In this paper, we aim at estimating the unknown numerical parameters and the density of the base measure with independent and identically distributed (i.i.d.) groups of observations with a Poisson-Dirichlet process prior. Because the Dirichlet process prior is a special case of the Poisson-Dirichlet process prior, we in fact have extended the methodologies of Yang and Wu [15] to a bigger model. The estimation of the unknown parameters is carried out in two different methods, the maximum likelihood method and a naive method proposed by Carlton [16], of which performances are compared by a simulation study.

Because there are two numerical parameters in Poisson-Dirichlet process priors, the maximum likelihood estimates (MLE) for the unknown parameters are discussed under three different settings (see the next section for the definition of the parameters and and the density function): (i) the discount parameter is unknown but the concentration parameter is known; (ii) the concentration parameter is unknown but the discount parameter is known; and (iii) both and are unknown. Favaro et al. [10] gave the empirical Bayes estimates when both and were unknown on complete data without missing. The comparisons between the estimates of Favaro et al. [10] and ours are presented with the same sample size in terms of bias, standard deviations (SD), and mean squared errors (MSEs).

The remainder of this paper is structured as follows. In Section 2, we review the basic model and the definition of the Poisson-Dirichlet processes priors. Data structure and model assumptions are also described in this section. In Section 3, the MLEs of the prior parameters are discussed in detail under three different aforementioned situations. A naive estimate for the discount parameter is also discussed. Section 4 discusses the estimates of base distribution density by kernel method. Section 5 presents a small simulation study to show the performance of the estimates discussed in Section 3.

2. The Data and Model

The data are observed in time periods and accordingly organized in groups , with representing the calendar time point the individuals in this group begin to be observed. Each group contains individuals. Denote by the th individual in group which is represented by a random vector with being the th coordinate of individual . The observations are presented by -dimensional vectors for which the coordinates are observed sequentially in time, so that the observations are subject to monotone missing. A clearer picture of the data structure is exhibited in Table 1.

Hence, the observed data up to time are monotone missing; that is, the components of an individual are ordered in such a way that if an observation of a variable for individual is missing, then so are the observations of all subsequent variables (if any) for the same individual. Clearly, Table 1 indicates that variables for individuals in the first groups are completely observed and only the first components are observable for individuals in group .

Data structured as in Table 1 is frequently encountered in real world. A typical example is loss data for claims used for the purpose of loss reserving in the industry of non-life insurance (see, e.g., [15, 17, 18]). Assume that the evaluation time of loss reserving is accident year ; each claim made at accident year will be paid at the end of the subsequent years after the claim (the first is the one paid at the end of the year the claim is made) so that the payments of an individual correspond to a -vector . Then observing at the end of the evaluation year , the payments for a claim in accident year have been paid only with s such that ; that is, the observable payments of claim are just the subvector , where the superscript “” means “observable.” Other example data of this structure can also be found in Marini et al. [19] who discussed the maximum likelihood estimation in panel studies for the first time, Hao and Krishnamoorthy [20] who considered testing and estimation problems, and Raats et al. [21] who discussed the estimation and testing for multivariate regression model, among a large number of others, with monotone missing data.

Before presenting the probabilistic characteristics of the data under a BNP framework in Assumption 2, we recall the definition of Poisson-Dirichlet processes in terms of stick-breaking. Let be a pair of real numbers satisfying and , which is always respected and hence will not be mentioned everywhere.

Definition 1. Let be a sequence of mutually independent random variables with and and , so that Then the ranked sequence (i.e., ) of is said to be Poisson-Dirichlet distributed, writing , where and are called discount and concentration parameter, respectively.
Let be a sequence of random variables independent and identically distributed as a probability measure (called base distribution below), taking values in a measurable space , and . Then the random probability measureon is referred to as a Poisson-Dirichlet process (indexed by ) with parameters and base distribution or, in symbol, writing .

Clearly, for given and , one has a realization of the Poisson-Dirichlet process that is a discrete distribution with mass at in the domain . Pitman and Yor [7] discussed in detail Poisson-Dirichlet processes in which a number of well-known properties of Dirichlet processes were generalized. Carlton [16] gave methods to estimate the parameters with completely observed data.

Assumption 2. (a)For every , the data have a Poisson-Dirichlet structure: and, given , the data .(b)The random elements are mutually independent over .(c)The probability measure has a continuous density function, denoted also by , with respect to the -dimensional Lebesgue measure on .

The intuition behind the assumption is as what follows. In many situations, the data are recorded chronologically so that the first subscript of reflects the calendar time when the data vector begins to be recorded and the third subscript represents the calendar time that component is recorded; hence the dependence between the data beginning at a same calendar time and independent over calendar times is reasonable. This structure is the typical case of loss reserving in general insurance with individual data (see, e.g., Huang et al. [17, 18]).

Given the above data structure and model assumption, the objective of this paper is to estimate the unknown parameters and and the density of the base measure.

3. Estimation of Parameters

The unknown parameters of the distribution of a Poisson-Dirichlet process can generally be estimated by means of maximum likelihood as shown in Section 3.2. To do it under the data structure of Table 1, we first introduce some necessary preliminary results for Poisson-Dirichlet processes in Section 3.1

3.1. Preliminaries

In the following, write , for the first components of and and , respectively, for the marginal distributions of and on the first -dimensional subspace of . The following lemma shows that if the observations have a Poisson-Dirichlet structure, then so are their subvectors.

Lemma 3. If the observations and ; , then, for every , given , the partial observations and .

Proof. The assertion that follows immediately from the definition of . For any -measurable set , , we have It indicates the desired result .

The next lemma shows that, with a Poisson-Dirichlet process structure, observations of any two individuals are identical with probability of if and only if so are their corresponding components. This provides an easy and convenient way to judge whether any two observations are the same only by their any (e.g., the first) component.

Lemma 4. For any , the equality implies that for all almost surely.

Proof. We verify the case and the proof applies for any . By the definition of Poisson-Dirichlet process, we take with an i.i.d. sequence drawn from . Because is continuous, then we have for some . Therefore, Corollary of Pitman and Yor [7] implies thatwhere . On the other hand, because Lemma 3 implies that and are two observations from a PD structure with the parameters and and , similar discussion as above implies that . Therefore,The proof is thus completed.

3.2. Maximum Likelihood Estimation of the Unknown Parameters

For , let be the number of distinct observations among the individuals of the th group. Each distinct observation represents a species or cluster in biological perspective [7]. Let be the number of those clusters, each of which appears times in group . For example, if for some , then there are total 3 different species, each of which appears just times in group . Clearly,

there might exist many with ,

generally,WriteSimilar to Carlton [16] who used to estimate the unknown parameters by maximum likelihood method when individuals are completely observed, we here discuss the MLE of in terms of the random variables . Under Assumption 2, the log-likelihood function for () given the observations in all the groups iswhere is irrelevant to the parameter .

The estimation is analyzed in three different situations as what has been done in Carlton [16]:

(i) is unknown and is known.

(ii) is known and is unknown.

(iii) Both and are unknown.

Remark 5. In the following, we assume that and for some . Note that, in group , means that all observations are the same and means that all the observations are different from one another. Thus, if or for all , the MLE will be attained at the boundary of the parameter spaces. However, the joint events and will become increasingly unlikely as tends to infinity; see Chapter 3 of Carlton [16].

3.2.1. Estimate with Known

In this case, note thatis a decreasing function of and as . The equality has a unique solution if , as stated in the theorem below.

Theorem 6. The MLE of is unique andwhere is Euler’s constant.

Proof. Note thatThe proof is thus completed.

3.2.2. Estimate with Known

If , then the Poisson-Dirichlet process reduces to a Dirichlet process, for which the estimation problem of can be referred to Yang and Wu [15]. We thus consider only the case of . By the log-likelihood function given in (8), we haveA condition under which the MLE is unique is presented in the following theorem.

Theorem 7. Set and . Then the solution to the equation is unique if .

Proof. For given , notice that the variable for is only ; we writeFurther recall that for ; it follows thatand consequently as for the second term is limited in the upper equation. With some algebraic transformation to (13), we obtainwhich implies that, for sufficiently large ,Then, by the continuity of , the Intermediate Value Theorem implies that such that .
For the uniqueness, it suffices to show that whenever . By (13), we have which implies that By the definition of and , The inequality is obtained when :With (15) and (20), we have When , . Particularly, there is a unique solution when for .

The solution to is the MLE of and we denote it by . When , by Pitman and Yor [7], the distributions are mutually absolutely continuous when is fixed and is varying; hence is not a consistent estimator. Some simulation results of are reported in Section 5 when ; theoretically can be .

3.2.3. Estimate Both and

As discussed above, the MLEs for and when both are unknown can be obtained by solving the equations . To guarantee the existence of the MLEs for and , we need of to be such thatwhere is the Hessian matrix; that is,

3.2.4. Remarks

While we have so far proved the existence of MLEs under the three different cases, analytical expressions of the estimates under none of the cases are given and, thus, the MLEs can only be obtained with numerical methods, for example, the Newton-Raphson iteration method or dichotomy method.

In addition, by Carlton [16], a naive estimate of in terms of merely the observations in group is , which is consistent as tends to . Now with the groups of data, we can estimate bywhere . It is then straightforward from Lemma in Carlton [16] that both and are strongly consistent with the true value of as .

4. The Estimation of Base Distribution

In this section, we estimate the base distribution of under the monotone missing data structure. In Section 4.1, we pick up a set of i.i.d. observations from and then estimate the density of by kernel estimation method. With the i.i.d. observations, we can estimate the base density by multivariate kernel density method that is similar to the one used by Yang and Wu [15] who studied the multivariate density estimation in a Dirichlet process prior under the same missing mechanism and showed that their estimate is superior to Titterington and Mill [22] under asymptotic mean integrated squared errors (AMISE) criterion (note that the classical versions of [22] for complete data can be found in, e.g., [23, 24]).

4.1. Deduce i.i.d. Observations from

Denote by the distinct observations of . The following theorem shows that, given , are independent and identically -distributed—note that, by Lemmas 3 and 4, under the structure of this model, for every group , though the data are partly missing, the value of can be easily worked out.

Theorem 8. Suppose that is continuous; then, given , the distinct observations , are i.i.d. with distribution .

Proof. The proof can be found in Theorem in Korwar and Hollander [25].

By Theorem 8, there are i.i.d. observations in group and hence the number of i.i.d. observations in the whole groups is . Note that the observed part of the i.i.d. observations, namely,is still with a monotone missing structure. For every , write for the subvector of the first components of from individual , andfor the maximal set of the subvectors , , which are completely observed. Obviously, is made of all the observations with no missing components and is the set of the first components of all in the groups. Similarly, defineObviously, , for all .

4.2. Kernel Density Estimation

There are two methods to estimate the joint density of the random vector , proposed by Yang and Wu [15] and Titterington and Mill [22], respectively. They are described below.

(1) Yang and Wu’s Method. Let be a sequence of consistent kernel functions—a terminology used in Yang and Wu [15] to indicate that are density functions on space satisfying for . Then, under the monotone missing data structure, can be estimated in two different ways: (i) using the consistent property of kernel and the data and (ii) using the ordinary kernel density estimation with data . Formally, the two estimates (denoted by and , resp.) are given byTherefore, the density for can be estimated byAnd can be estimated by . Furthermore, the consistence of also indicates that

By a decomposition of the joint density as products of a series of conditional densities, that is, , one can estimate the density bywhere the superscript “" indicates that the estimate is obtained by the monotone missing data structure and and are, respectively, proper estimates for and .

(2) Titterington and Mill’s Method. For a vector , Titterington and Mill [22] used instead of the usual kernel when estimating the density of , where and are, respectively, the observed part and missing part of and is a probability density function to estimate true .

With the above idea, data , and group , another estimate for is given bywhere We should note that is an appropriate estimate of the density of given , .

5. Simulation

In this section, we report a small numerical simulation regarding the performance of the estimates of and given in Section 3. The simulation was proceeded under the following settings:(i) and because the estimates depend only on the observations of s which can be determined by the first component, we set for convenience.(ii)The standard normal distribution was taken as the base measure .(iii)The sizes of individuals took three levels of values: level 1: , level 2: , level 3: .

The procedure of the simulation under those settings is simply described as follows.

Algorithm 9.

Step 1. Generate the data for every group by means of a generalized Polya urn model as follows: (i).(ii), is generated by the prediction rule where is the th distinct value among , is the number of occurrences of , and is the number of distinct values of .

Step 2. Compute the values of and from the data generated in Step 1.

Step 3. Compute the estimates of with Newton-Raphson method.

The simulation compared the estimates and of for known in which we set and took three values: 0.1, 0.3, and 0.5; the case where can simply be obtained by replacing with . For each combination of the parameters, the simulation was proceeded 1000 replications. The performances of the two estimates are shown in Table 2, from which it is clearly shown that the simulated biases of are all smaller than those of . Although the simulated standard deviations of are stable and smaller in three levels under three different values, the MSEs of are smaller than those of . That is, performs better than under the MSE criterion.

Next, the simulation studied the performance of the estimates of with at four true values: , and . The simulation was also proceeded 1000 replications for each combination of the parameters. The simulation results are summarized in Table 3, from which we can see that the estimates generally performed better with the increasing of the sample size.

Finally, we considered the MLEs for and when neither of the parameters is known. The true values of the parameters and the performance of the estimates of the 1000 replications are shown in Table 4. The simulation used the Newton- Raphson iteration method.

Favaro et al. [10] adopted an empirical Bayes procedure for estimating , writing , when the data is not divided into groups. They investigated the issue of prediction within species sampling problems and applications to genomics based on the estimates when both parameters are unknown. Favaro et al. [10] also studied the asymptotic behavior of the number of new species conditionally on the observed sample and derived asymptotic highest posterior density intervals for the estimates of their interest. The comparisons of simulation results between and the MLEs which are under the multigroups data structure are presented in Table 4 with the total size of individuals in the perspective of bias, SD, and MSEs. From Table 4, we can see that the MSE of is very small and the MSE of is a bit bigger. Apparently, the simulation shows that the MLEs generally are better than of single group data under the critera of bias, SD, and MSEs.

6. Conclusion

In this paper, we studied the estimates of the unknown parameters as well as the density of the base measure in Poisson-Dirichlet process priors under monotonically missing data. The parameters are estimated by the method of MLEs and a set of simulations shows that the estimates perform well.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was supported by NSFC under Grant no. 71371074.

References

T. S. Ferguson, “A Bayesian analysis of some nonparametric problems,” The Annals of Statistics, vol. 1, pp. 209–230, 1973.
View at: Publisher Site | Google Scholar | MathSciNet
T. S. Ferguson, “Prior Distributions on Spaces of Probability Measures,” The Annals of Statistics, vol. 2, no. 4, pp. 615–629, 1974.
View at: Publisher Site | Google Scholar
M. Lavine, “Some aspects of polya tree distributions for statistical modelling,” The Annals of Statistics, vol. 20, no. 3, pp. 1222–1235, 1992.
View at: Publisher Site | Google Scholar
M. Lavine, “More aspects of P\'olya tree distributions for statistical modelling,” The Annals of Statistics, vol. 22, no. 3, pp. 1161–1176, 1994.
View at: Publisher Site | Google Scholar | MathSciNet
R. D. Mauldin, W. D. Sudderth, and S. C. Williams, “Pólya trees and random ditributions,” The Annals of Statistics, vol. 20, no. 3, pp. 1203–1221, 1992.
View at: Publisher Site | Google Scholar | MathSciNet
J. F. C. Kingman, “Random discrete distributions,” Journal of the Royal Statistical Society: Series B, vol. 37, pp. 1–22, 1975.
View at: Google Scholar
J. Pitman and M. Yor, “The Poisson-Dirichlet distribution derived from a stable subordinator,” Annals of Probability, vol. 25, no. 2, pp. 855–900, 1997.
View at: Publisher Site | Google Scholar
S. Goldwater, T. Griffiths, and M. Johnson, “Interpolating between types and tokens by estimating power-law generators,” Advances in Neural Information processing Systems, vol. 18, pp. 459–466, 2006.
View at: Google Scholar
E. B. Sudderth and M. Jordan, Shared Segmentation of Natural Scenes Using Dependent Pitman-Yor Processes, Advances in Neural Information Processing Systems 21, MIT Press, 2009.
S. Favaro, A. Lijoi, R. H. Mena, and I. Pr\"unster, “Bayesian non-parametric inference for species variety with a two-parameter Poisson-Dirichlet process prior,” Journal of the Royal Statistical Society. Series B. Statistical Methodology, vol. 71, no. 5, pp. 993–1008, 2009.
View at: Publisher Site | Google Scholar | MathSciNet
N. Hoshino, “Applying Pitmans sampling formula to microdata risk assessment,” Journal of Official Statistics, vol. 17, pp. 499–520, 2001.
View at: Google Scholar
M. Johnson, T. L. Griffiths, and S. Goldwater, “Adaptor grammars: a framework for specifying compositional nonparametric Bayesian models,” Advances in Neural Information processing Systems, vol. 19, pp. 641–648, 2007.
View at: Google Scholar
F. Wood and Y. W. Teh, “A hierarchical nonparametric Bayesian approach to statistical language model domain adaptation,” Journal of Machine Learning Workshop and Conference Proceedings: Artificial INtelligence in Statistics, vol. 5, pp. 607–614, 2009.
View at: Google Scholar
H. Wallach, C. Sutton, and A. McCallum, “Bayesian modeling of dependency trees using hierarchical Pitman-Yor priors,” in Proceedings of the Workshop on Prior Knowledge for Text and Language, pp. 15–20, 2008.
View at: Google Scholar
L. Yang and X. Wu, “Estimation of Dirichlet process priors with monotone missing data,” Journal of Nonparametric Statistics, vol. 25, no. 4, pp. 787–807, 2013.
View at: Publisher Site | Google Scholar
M. A. Carlton, Applications of the Poisson-Dirichlet distribution [Dissertation], University of California, 1999.
J. Huang, C. Qiu, X. Wu, and X. Zhou, “An individual loss reserving model with independent reporting and settlement,” Insurance: Mathematics and Economics, vol. 64, pp. 232–245, 2015.
View at: Publisher Site | Google Scholar
J. Huang, X. Wu, and X. Zhou, “Asymptotic behaviors of stochastic reserving: aggregate versus individual models,” European Journal of Operational Research, vol. 249, no. 2, pp. 657–666, 2016.
View at: Publisher Site | Google Scholar | MathSciNet
M. M. Marini, A. R. Olsen, and D. B. Rubin, “Maximum-likelihood estimation in panel studies with missing data,” Sociological Methodology, vol. 11, pp. 314–357, 1980.
View at: Publisher Site | Google Scholar
J. Hao and K. Krishnamoorthy, “Inferences on a normal covariance matrix and generalized variance with monotone missing data,” Journal of Multivariate Analysis, vol. 78, no. 1, pp. 62–82, 2001.
View at: Publisher Site | Google Scholar
V. M. Raats, B. B. van der Genugten, and J. J. Moors, “Multivariate regression with consecutively added dependent variables,” Linear Algebra and its Applications, vol. 410, pp. 170–197, 2005.
View at: Publisher Site | Google Scholar | MathSciNet
D. M. Titterington and G. M. Mill, “Kernel-based density estimates from incomplete data,” Journal of the Royal Statistical Society B: Methodological, vol. 45, no. 2, pp. 258–266, 1983.
View at: Google Scholar | MathSciNet
T. Cacoullos, “Estimation of a multivariate density,” Annals of the Institute of Statistical Mathematics, vol. 18, pp. 179–189, 1966.
View at: Publisher Site | Google Scholar | MathSciNet
M. P. Wand and M. C. Jones, “Multivariate plug-in bandwidth selection,” Compututional Statistics, vol. 9, pp. 97–166, 1994.
View at: Publisher Site | Google Scholar | MathSciNet
R. M. Korwar and M. Hollander, “Contributions to the theory of dirichlet processes,” The Annals of Probability, vol. 1, no. 4, pp. 705–711, 1973.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2017 Xueqin Zhou et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

1112

Downloads

848

Citations