Modelling Inverse Gaussian Data with Censored Response Values: EM versus MCMC
Low detection limits are common in measure environmental variables. Building models using data containing low or high detection limits without adjusting for the censoring produces biased models. This paper offers approaches to estimate an inverse Gaussian distribution when some of the data used are censored because of low or high detection limits. Adjustments for the censoring can be made if there is between 2% and 20% censoring using either the EM algorithm or MCMC. This paper compares these approaches.
Partial missing data is common in environmental applications, where measurement systems have low or high detection limits. The measurement processes at times are incapable of measuring below certain values; for example, near zero counts of colony form units, below a certain lower limit, denoted , are considered too unreliable when the measurement process involves diluting solution samples. An example of a high detection limit () occurs for flows when flow is measured based on the height of a river, yet the river height measurement cannot exceed the height of the bank. All that is known in such cases is that the measurement is below or above , respectively. In this paper, accurate model fitting is the aim. We advocate the EM algorithm  to establish realistic penalised maximum likelihood estimates of regression parameters in such cases. Other approaches apply MCMC methods . The next section introduces the EM algorithm approach for dealing with low/high detection limits. A simulation study for high detects (the case that is more unstable because it is unbounded on the high side) is used to demonstrate the capability of the methodology. Section 3 looks at the MCMC approach. Section 4 uses an example to demonstrate the EM methodology. Section 5 concludes with a discussion.
2. The EM Algorithm
The expectation step of the EM algorithm is used to replace censored values based on the current parameter estimates and the detection limit. These replaced values are, in turn, used to establish penalised maximum likelihood estimates of regression parameters. These two steps are applied iteratively until estimates converge.
In this paper, we will use the GAMLSS model to fit the location and spread for the data using penalised maximum likelihood estimation (see [3, 4]). The library GAMLSS has the capability of fitting models with low detects, but this is experimental, while our approach has been tested at least for the applications considered in this paper.
Let the location of the response given the vector of explanatory variables be denoted by , where is the link function and is the mean. We are interested in the inverse Gausian distribution using the usual logarithm link function giving The inverse Gaussian distribution is given by where is the shape parameter and the variance is given by . The GAMLSS model for is so that . To simplify the notation, we drop subscripts; that is, let . When is a low detect, we want to evaluate which is given by (see Appendix A) where and .
High detect limits are less common in the application considered later but do occur in practice. In this situation, it can be shown that the expectation step is given by (see  for a different result that is often numerically equivalent to the result below)
A simulation study, of 1000 runs per table entry, generated data from the inverse Gaussian distribution with density and truncated values at high detection limits , 4, 5, and 6. The application of the EM algorithm is outlined in Whitemore . This was used to find the maximum likelihood estimates of hidden values and . The results are reported in Table 1. Similar results were found for inverse Gaussian distributions with low detects.
It is clear from the standard errors in Table 1 that the EM algorithm is unstable for 50 observations with standard errors more than 25% of the parameter estimated values. However, for this few observations, we are unlikely to diagnose the data as Inverse Gaussian, and any methodology is going to be suspicious.
If the presence of high detects is ignored so that data with actual values greater than are set equal to , then the resulting parameter estimates are biased. For , 4, 5, and 6, the 1000 observations example with high-truncated values in Table 1, produced average estimates of and , without correction for the truncation, equal to () 1.374, 1.528, 1.639, and 1.716 (actual ) and () 1.257, 1.169, 1.119, and 1.092 (actual ), respectively. These parameter estimates are significantly biased. A simulation exercise for low detects was carried out but is not presented, because the results are less extreme (e.g., if , then little is lost by replacing the low detects by ). This will be demonstrated in the application in Section 4.
3. Markov Chain Monte Carlo
In this section, the results of a Markov chain Monte Carlo analysis of the simulation study of high detects are presented (see previous section for details). The mean and dispersion parameters are estimated using Gibbs sampling (e.g., ). Defining the mean parameter as , the MCMC analysis is undertaken with a uniform prior distribution for and a gamma prior distribution for , with shape parameter equal to 0.001 and scale parameter equal to 1,000 (i.e., prior variance = 1000). That is, the mean of the prior distribution for is equal to 1. Based on these prior distributions, the conditional distribution for is Gamma (, ), so samples of can be drawn from this distribution, while samples for , and hence , are drawn using rejection sampling (e.g., [7, Chapter 3]).
The results of the MCMC analysis of the simulations are presented in the Table 2 below. These results are generated by using MCMC to draw 7,000 parameter samples for each of the 1,000 simulated data sets of the 16 different combinations of cutoff and number of observations in Table 2. That is, a total of 16,000 MCMC analyses are run and each of these is run for 7,000 sample iterations. The first 2,000 samples of each MCMC run are discarded, and of the remaining 5,000 samples, every fifth sample is saved and used to estimate and . The medians of these 1,000 realizations is taken as the estimate of and for that simulated data set. in Table 2 presents summary statistics for the 1,000 median estimates for and from the 1,000 simulated data sets for each cutoff and number of observations combinations. An examination of the results in this table and those for the EM analysis in Table 1 reveals that the MCMC estimates are generally less biased for the dispersion parameter than the corresponding EM estimates, especially for the smaller sample sizes. For the mean parameter the means of the 1,000 median estimates for are generally more biased than the EM estimates.
The data include analytes (e.g., total nitrogen, chlorophyll-a, e. coli, turbidity, etc.) and flows routinely measured at 35 different sites in the waterways that supply Sydney with drinking water for a period going back 20 years. These measures are used to assess the quality of water in the various reservoirs, rivers, and dams that are used to supply Sydney and greater Sydney with drinking water. Some measurements are taken daily (temperature, dissolved oxygen, turbidity, and pH), while others are collected roughly either fortnightly or monthly. The fitted models were used to assess the risk that particular analytes pose to the drinking water supply to Sydney residents.
We used AIC to select the distribution and model which was most appropriate for several water quality variables. The distributions considered were normal, log-normal, inverse gaussian, and zero-adjusted inverse Gaussian. The analytes that appeared to select the inverse Gaussian as the preferred distribution in more than 75% of the sites were firstly phosphorus filterable, nitorogen ammoniacal, and e. coli. Other analytes that where selected less frequently as inverse Gaussian distributed were entrococci, toxic cyanobacterial count, chloride, nitrogen oxide, and phosphorus total. The number of measurements ranged from several thousand to a few hundred with different number of low detects. All of these were successfully fitted by the approach advocated in the paper. As an example, we model phosphorus total(mg/L) at Bendeela Pondage in the Sydney water catchment supply area. This example is illustrated below.
Table 3 presents the result of the fitted model for phosphorus total (mg/L) at Bendeela Pondage. This was measured for the past 15 years with frequency of just over one per month (giving 213 measurements). There are seven low detects with one being below 0.001 and six being below 0.01. Not adjusting for this can adversely influence the model fit, because 0.01 is close to the mean value. Applying the above EM algorithm to this example results in a nonhomogeneous mean and variance, which was fitted in R  by GAMLSS with family equal to the inverse Gaussian. The EM algorithm ran in a fraction of the time of the MCMC approach.
In addition, qq-plots of the theoretical values versus the quantile residuals were closer to each other after convergence using EM algorithm. In Table 3, setting the low detects equal to produces a fitted model that is closest to the EM fit. However, even with the propotionally few low detects in the sample, there are significant differences in some of the time-of-the-day harmonic coefficients for the trend and spread (i.e., coefficients).
The paper offers a way of improving models with the modelled response having an inverse Gaussian distribution when the response variable has low or high detects. Although only one example is reported in the paper, the methodology has been successfully applied to many more examples for the Sydney Catchment Authority. Problems with convergence were encountered in a few applications. The MCMC approach was computationally too intensive to run on all applications, but little appears to be lost by applying the EM algorithm. We used the AIC criterion to select the family from normal, log-normal, inverse Gaussian, or zero-adjusted inverse Gaussian (for which low detects were assumed zeros). This latter model is only realistic when the low detects are below all other measured values, hence, it is inappropriate in the first example in Section 4, where some low detects (i.e., 0.01) are close to the average response value. The high detection limit is seldom an issue in applications, but its theory was included, because it was encountered on a few occasions in the Sydney Catchment Authority work. The EM algorithm in this paper makes no attempt to achieve a maximum likelihood estimate for the variance; however, this limitation is of little consequence, since the variance estimate is reasonably comparable to the MCMC approach, which does properly account for the variance.
Numerical instabilities in the approaches were experienced in simulations, but results can be achieved by small adjustments to the estimated variances.
Our aim in this appendix is to find and using the moment generating function (mgf) of given , denoted or , respectively; for example, .
A. Inverse Gaussian Distribution
The inverse Gaussian distribution for random variable is given by density and has distribution function , where .
The mgf for the Inverse Gaussian distribution is given by By completing the square for the exponent terms in (A.1) and setting , we have so that for low detects, where we know that , the mgf becomes Since when , we have Therefore Similarly, for high detects with , See also Whitemore  which seems to often agree numerically with the result.
P. Toscas, “MCMC and data augmentation for paramter estimation of censored data,” CSIRO Technical Report, CMIS 07/92, 2007.View at: Google Scholar
M. A. Tanner, Tools for Statistical Inference: Methods for the Exploration of Posterior Distributions and Likelihood Functions, Springer, New York, NY, USA, 2nd edition, 1993.