On the Existence and Uniqueness of the Maximum Likelihood Estimators of Normal and Lognormal Population Parameters with Grouped Data
Lognormal distribution has abundant applications in various fields. In literature, most inferences on the two parameters of the lognormal distribution are based on Type-I censored sample data. However, exact measurements are not always attainable especially when the observation is below or above the detection limits, and only the numbers of measurements falling into predetermined intervals can be recorded instead. This is the so-called grouped data. In this paper, we will show the existence and uniqueness of the maximum likelihood estimators of the two parameters of the underlying lognormal distribution with Type-I censored data and grouped data. The proof was first established under the case of normal distribution and extended to the lognormal distribution through invariance property. The results are applied to estimate the median and mean of the lognormal population.
Lognormal distribution has been used to model many skewed frequency distributions, especially to model continuous random quantities in medical, physical, chemical, biological, toxicological, economical, and environmental processes.
For example, in medicine, the red cell volume distributions; size distributions of plaques in Alzheimer’s patients; surgical procedure times; survival times of breast and ovarian cancer; all have been modeled by lognormal distribution by various researchers. Tai et al.  and Mould et al.  validated the use of the lognormal model for predicting long-term survival rates of laryngeal cancer patients using short-term follow-up data.
It is also common to apply the lognormal distribution for fatigue life and residual strength of composite materials , reliability analysis , size distributions in economics and actuarial sciences , cell growth , and many other phenomena.
In all these studies, it is critical to estimate the parameters of a lognormal distribution. A random variable follows lognormal distribution if the logarithm of the random variable follows normal distribution . Thus to estimate the parameters , it suffices to convert the lognormal data to normal data by log-transformation. In literature, the estimation of these two parameters was considered with complete sample, or in most cases Type-I censored sample. However, estimation with grouped data has not yet been studied. We complement this literature by proposing maximum likelihood estimators (MLEs) of the two parameters that are based on grouped sample data (i.e., interval censored data).
The paper is organized as follows. In Section 2, we will show that the MLEs of the two parameters exist uniquely under mild conditions and thus the asymptotic normality of the estimators. The results are applied to derive the point and confidence interval estimation of the mean and median of the underlying lognormal distribution in Section 2.1. Section 3 provides the simulation results comparing the properties of the estimator based on grouped sample to those of type I censoring. Section 4 contains study results of a practical problem by the above method. To facilitate reading, proofs are relegated to the appendix.
2. Main Results
In this section, we will first show that the MLEs of the parameters and of a normal population based on grouped data uniquely exist. Here, the grouped data refers to the following. Assume that a sample is drawn from a normal population, the values of s are unknown; however, according to preestablished partition points , we know , the number of s that fall into the interval , where and . Denote the density of the standard normal distribution as , then the density of distribution is , , . In order to prove our results, we consider two new parameters and . There is a one-to-one correspondence between and , namely, and . We will show that the MLEs of and based on grouped data uniquely exist. Then due to the invariance property of MLEs, the existence and uniqueness of the MLEs of follow. With the new parameters the CDF of can be expressed as where is the CDF of the standard normal distribution, and the log-likelihood function is given by
where is a known constant.
Before proceed, we present two lemmas. Please refer to the appendix for the proofs of the lemmas.
Lemma 2.1. Assume , , . For any given , there exists a compact subset such that
Basically, Lemma 2.1 means that the log-likelihood function will not achieve its maximum value at the boundary of its domain.
Lemma 2.2. Let for . Then the Hessian matrix of , is negative definite.
Theorem 2.3. Suppose that the observed satisfy and , then the MLEs of parameters and of normal population uniquely exist.
Proof. We need only to show that the MLEs of parameters and uniquely exist. According to the results of Mäkeläinen et al. , in order to show the existence and uniqueness of the MLEs of , it is sufficient to verify the following two conditions.(i)For any given , (2.2) holds.(ii)The Hessian matrix of ,
is negative definite at every point .
Condition (i) is certainly satisfied by Lemma 2.1. Therefore, to prove the theorem, we need only to show (ii), that is, the log-likelihood function is negative definite function of
To this end we should consider each of the three terms in the expression (2.1) of .
Let . It is evident that the Hessian matrix of is where denotes .
To show is negative semidefinite, we will verify the following two conditions: (a) or , ; (b) the determinant of is nonnegative, that is, .
Note that (a) is equivalent to , . This is true since holds for any (see, e.g., Feller ). Hence (a) is satisfied. The two rows of are proportional, so . Hence, the condition (b) is satisfied. Therefore, is negative semidefinite.
Now denote . The Hessian matrix of is where denotes
In the similar way as the above we can show that the matrix is negative semidefinite.
Finally, let us consider , . Let . Then . The Hessian matrix associated with is where and is the transpose of . By Lemma 2.2, is negative definite. Therefore, is negative definite.
The Hessian matrix of the log-likelihood function can be expressed as . Since matrices and are negative semidefinite, each is negative definite, and at least one by our assumptions, so must be negative definite. This completes the proof of the theorem.
Corollary 2.4. Under the conditions of Theorem 2.3, it holds that as , where means “converges in law,’’ and
Proof. For each define . Note that . Hence the result follows from Theorem 2.3 and the asymptotic normality of MLE (see, e.g., Lawless (2003).
The same results as in Theorem 2.3 and Corollary 2.4 also hold for the case of Type-I censored data. Let be a sample from an population. Suppose that is a predetermined detection limit. Without loss of generality, we will consider left censoring, the common situation in environmental studies, that is, will be observed if and only if . Even though Type-I is widely applied in literature, but according to the authors' knowledge, the existence and uniqueness of the MLEs of have not been proved. This will be shown in the following theorem.
Theorem 2.5. Suppose that the number of observable s is at least 2, then the MLEs of uniquely exist based on the Type-I censored data with as detection limit.
Proof. The result can be proved in the same way as Balakrishnan and Mi .
Remark 2.6. (a) The same result holds for the case of right censoring; (b) the results of Theorem 2.5 are true if each is censored by detection limit .
2.1. Estimation of the Median and Mean
Suppose that random variable follows lognormal distribution . With log-transformation then follows normal distribution . Lognormal distribution has been used to model various continuous random variables as mentioned in Section 1. Specifically, this distribution is frequently applied in environmental statistics. The lognormal random variable has median and mean . The MLEs of and can easily be obtained as and due to the invariance property of MLE. We can also obtain approximate confidence intervals for and as follows.
Denote the inverse of the matrix in Corollary 2.4 to Theorem 2.3 as It is obvious that , by large sample theory we have . From these, an approximate confidence interval of can be obtained as , here is the upper percentile, and is obtained from substituting and for and in the expression of . Similarly, it holds that as where Therefore, an approximate confidence interval of is obtained as where is obtained by substituting and by their MLEs and .
3. Simulation Studies
In this section, we will conduct simulation studies on the MLEs and confidence intervals of and of normal distribution based on grouped data. In addition, we will also examine point and interval estimations of the mean and median of lognormal distribution . The results obtained from grouped data will be compared with those obtained from Type-I censored data.
We create a population of size by drawing values from a normal population with and . Next, for a prefixed five partition points , we record the number of this population that fall into each interval . Each such samples are consider to be observed sample. The MLEs of and are then computed based on this observed sample. This process is repeated 5,000 times. Different sample size and 6 sets of partition points are considered for comparisons purpose.
We compute the MLEs of and by solving the likelihood equations using SAS IMSL nonlinear equation solver. Then the MLEs and of and are readily obtained by the invariance of MLE. According to the large sample properties of MLEs stated in Corollary 2.4 to Theorem 2.3, we know that is asymptotically normally distributed. Thus we can obtain approximate confidence intervals for and .
Type-I censored data are very common in various experiments. It is widely used in life test in order to save test time. Particularly, in environmental data analysis, values are often reported simply as being below detection limit along with the stated detection limit. The data obtained in this way are Type-I left singly censored. To compare the performance of the MLEs based on the grouped data with those obtained from Type-I left singly censored data, we will use as the “detection limit’’. Figures 1, 2, 3, 4, 5, and 6 present the estimated MLEs of and under six different partition sets with ranges from to . The results of median and mean of the lognormal population are listed in Tables 1, 2, 3, 4, 5, and 6.
From these figures (grouped data: solid line, type I censoring: dotted line), it is easy to see that estimations under both data situations improved dramatically with the increasing sample size. The estimated values are very close to the true values with error less than 0.003% when . The choice of 's does not seem to affect the result much except in Figure 6, where , an interval which most samples will be observed in the middle and few on the either side. From those figures, it is not hard to see that the estimation with grouped data are uniformly better than those based on type I censoring data, especially in the estimation of , with exception in few isolated cases. Moreover, it is interesting to observe how the and approach the true value differently with taking the oscillated routine and tends to be consistently underestimated.
4. An Application
Let us consider a sample of 47 observations from the guidance document USEPA [10, pages 6.22–6.25]. The data describe the measures of 1,2,3,4-Tetrachlorobenzene (TcCB) concentrations (in parts per billion, usually abbreviated ppb) from soil samples at a “Reference’’ site.
The normal Q-Q plot for the log-transformed TcCB data shown in the book of Millard and Neerchal (2001) indicates that the lognormal distribution appears to provide a good fit to the original data. The book gives as the MLE of the mean of the lognormal distribution, and as an approximate confidence interval for based on the complete sample data with the 47 observations. The book also uses 0.5 as the detection limit, that is, any observation lower than 0.5 will be censored, which yields 19 censored observations and 28 uncensored observations. The censored data then give as the MLE of and as an approximate confidence interval for .
To apply the results in Section 2 for computing the MLEs of the parameters of this lognormal distribution, we first transform the original data to their logarithms and thus the log-transformed data constitute a sample from a normal distribution, then obtain , , , , , by using the following five partition points . Solving the corresponding log-likelihood equations gives , , , and . Approximate confidence intervals for and are given in Table 7.
Proof of Lemma 2.1. To prove the lemma, it is sufficient to verify the following three limits:
To see (A.1), from the assumption , there exists an index, say , such that and . We have . So and . Therefore, (A.1) holds.
To show (A.2), we denote . For each fixed , it is evident that . Thus .
Note that, , so there exists , such that . Consider function . For any given large number , it is easy to see that there exists such that .
Denote . For any , from our assumptions there exists an index, say , belonging to satisfying (a) ; (b) the following two quantities and have the same sign; and (c) , and .
Note that, if , and both and , then If , then and so . Consequently,
This further implies
If , but both and , then similarly, it can be shown that (A.6) is true again. Therefore, we see that for any given large number , it holds that Due to the arbitrariness of , we conclude that (A.1) is true.
To verify (A.3), we let and for any . For any fixed , we have It can be easily verified that as or . Thus, there exists such that .
We define function , for any given , and as before. Consider any sequence with as . Let and be any converging subsequence of , . Let us study two cases.
Case 1 (). Notice that for any , by our assumptions there exists at least one index, say , in such that (a) ; (b) and ; (c) and have the same sign.
Since as , there exists sufficiently large such that . Thus, From these, as what we did before we obtain this implies .Case 2 (). In this case, the inequality can be proved in the same way as Case 1.From the results in the above two cases, we conclude that . Since is an arbitrary sequence satisfying , so finally (A.3) is true.
Proof of Lemma 2.2. For any given , we have . The Hessian matrix of is
In order to prove is negative definite, the following two conditions must be satisfied: (i) or ; (ii) the determinant of the Hessian matrix is positive.
The inequality is equivalent to . This inequality follows from . Thus the desired inequality is true.
From the expression of it follows that The inequality is equivalent to . We discuss three cases.
Case 1 (). We have . From this, we see that Case 2 (). It is obvious that .Case 3 (). It holds that From this, we see that since . It means that . This further implies that since . Hence, in all the three cases, we obtain .
From all the above, we conclude that both conditions (i) and (ii) are satisfied and thus the Hessian matrix is negative definite.
|, :||MLEs of , with grouped data|
|, :||MLEs of , with type I left censored data|
|:||MLE of median of distribution with grouped data|
|:||MLE of mean of distribution with grouped data|
|Average:||The average of estimates from 5000 simulations|
|A.W.:||The average width of 5000 approximate 95% confidence intervals|
|C.R.:||The average coverage rate of 5000 approximate 95% confidence intervals.|
P. Tai, E. Yu, R. Shiels, and J. Tonita, “Long-term survival rates of laryngeal cancer patients treated by radiation and surgery, radiation alone, and surgery alone: studied by lognormal and Kaplan-Meier survival methods,” BMC Cancer, vol. 5, article 13, 2005.View at: Publisher Site | Google Scholar
C. Kleiber and S. Kotz, Statistical Size Distributions in Economics and Actuarial Sciences, Wiley Series in Probability and Statistics, Wiley-Interscience, Hoboken, NJ, USA, 2003.View at: MathSciNet
J. E. Mosimann and G. Campbell, “Applications in biology of the lognormal distribution: simple growth models,” in Lognormal Distribution: Theory and Applications, pp. 287–302, 1988.View at: Google Scholar
T. Mäkeläinen, K. Schmidt, and G. P. H. Styan, “On the existence and uniqueness of the maximum likelihood estimate of a vector-valued parameter in fixed-size samples,” The Annals of Statistics, vol. 9, no. 4, pp. 758–767, 1981.View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet
W. Feller, An Introduction to Probability Theory and Its Applications: Volume 2, John Wiley & Sons, New York, NY, USA, 1957.
USEPA, Statistical Methods for Evaluating the Attainment of Cleanup Standards. Vol. 3, 1994.