Alternatives to Mixture Model Analysis of Correlated Binomial Data

Chaganty, N. Rao; Sabo, Roy; Deng, Yihao

doi:https://doi.org/10.5402/2012/896082

International Scholarly Research Notices

On this page

Abstract Introduction Discussion Appendix References Copyright Related Articles

Research Article | Open Access

Volume 2012 | Article ID 896082 | https://doi.org/10.5402/2012/896082

Alternatives to Mixture Model Analysis of Correlated Binomial Data

N. Rao Chaganty,¹Roy Sabo,²and Yihao Deng³

Academic Editor: P. D'Urso, M. Scotto, A. Hutt

Received28 Feb 2012

Accepted29 Mar 2012

Published28 May 2012

Abstract

While univariate instances of binomial data are readily handled with generalized linear models, cases of multivariate or repeated measure binomial data are complicated by the possibility of correlated responses. Likelihood-based estimation can be applied by using mixture distribution models, though this approach can present computational challenges. The logistic transformation can be used to bypass these concerns and allow for alternative estimating procedures. One popular alternative is the generalized estimating equation (GEE) method, though systematic errors can lead to infeasible correlation estimates or nonconvergence problems. Our approach is the coupling of quasileast squares (QLSs) method with a rarely used matrix factorization, which achieves a simplified estimation platform—as compared to the mixture model approach—and does not suffer from the convergence problems in GEE method. A noncontrived example is provided that shows the mechanical breakdown of GEE using several statistical software packages and highlights the usefulness of the QLS approach.

1. Introduction

Binomial data occur when observations on a given subject consist of a fixed series of Bernoulli trials, resulting in a proportional outcome. Maximum likelihood estimation is readily available in a generalized linear modeling framework when subjects consist of univariate measures (i.e., one Bernoulli or binomial trial per subject). However, estimation becomes more complicated when several Bernoulli or binomial trials are observed for each subject. In this case subject responses could be multivariate (consisting of several series of separately defined trials) or repeated measure (where the set of trials are defined similarly), and in both instances there is the possibility that the intrasubject responses are correlated. Here we use the term “subject” for convenience but it could be an item, store, location, plot in agriculture experiments, and so on. For example, real-life data situations where we encounter correlated proportions include (1) bankers interested in the proportion of customers making the th type of transaction at the th bank branch; (2) the proportion of CD deposits at the th branch in the th month of a year; (3) retail managers interested in the proportion of customers purchasing the th item at the th store; (4) marketers interested in the proportion of subjects viewing the th advertisement type on the th website; (5) information technology specialists interested in the proportion of students who use the th computer program in the th computer lab; (6) biologists interested in the proportion of hatched English sole eggs kept in solutions at different temperatures and salinity levels. We will discuss the last example later in this paper. Other examples where correlated binomial data occur are seed testing experiments described in Gilliland et al. [1].

Data layout of the aforementioned examples is presented in Table 1. In all of these examples the proportions within each row are correlated but the rows can be assumed to be independent. The within-row correlation, while complicating matters, must be accounted for in order to obtain proper variance estimates and inference for any regression parameters representing the associations between the vector of proportions and covariates. Thus, the problem is to estimate the parameters of interest within the ensemble of all parameters. In this context one could use a likelihood-based approach utilizing mixture-distribution models.

In the case of binomial data the mixture model would consist of both binomial and logit-normal components. However, parameter estimation in the mixture model could experience convergence problems due to the multitude of marginal means, regression, and correlation parameters. A simplified alternative approach would be to transform the variable-specific proportions for each subject in a way that would simplify the assumed probability distribution. The logit of the proportions would transform the outcome scale from to , which could make appropriate a multivariate normal-based methodology. One such procedure could be the generalized estimating equations (GEEs) proposed by Liang and Zeger [2]. Though a popular methodology for estimating regression parameters in cases of longitudinal or repeated measure data, this procedure suffers problems estimating correlations. As will be seen in subsequent sections, the GEE method can fail to converge even for cases of continuous data, which is the case if the logit transformation is used on binomial data.

Coull and Agresti [3] discussed random effects models for logit-transformed correlated binomial data. Here we suggest the method of quasileast squares (QLSs), developed by Chaganty [4] and Chaganty and Shults [5]. While generally seen as an alternative method to solving the maximum likelihood score equation for correlation parameters in the case of Gaussian data (Sabo and Chaganty [6]), the estimation of correlation in the QLS procedure can also be supplemented with a little-known matrix factorization that makes it distinct from the maximum likelihood method. In this sense the QLS procedure is applicable for estimating correlated continuous data, which is appropriate for logit-transformed binomial proportions.

The rest of this paper is outlined as follows. The likelihood-based mixed-model approach is discussed in Section 2, while the logit transformation of binomial data and the GEE methodology are discussed in Section 3. We briefly outline the QLS estimating procedure in Section 4, while also highlighting the matrix factorization for use in estimating correlation. A noncontrived example is given in Section 5 that shows the usefulness of the QLS approach, as well as the convergence problems experienced by several statistical software packages in implementing the GEE method. A brief conclusion follows in Section 6.

2. Maximum Likelihood Estimation Using Mixture Distribution Models

For subjects, let be a vector of possibly dependent binomial random variables, where is the number of successes out of trials with success probability for the th variable of the th subject. Also assume that is the vector of covariates corresponding to the th variable in the th subject, such that is the matrix of all covariates for the th subject.

The general mixture distribution model for binomial data is given by where is a multivariate cumulative distribution function with support in and . Basically, we assume that is distributed as , and, given , the vector consists of independent binomial variables. Then the marginal distribution of is given by (2.1). A popular choice for is the multivariate logit-normal distribution; that is, the distribution obtained under the assumption is multivariate normal with mean and covariance matrix . Here represents the mean as a function of the covariates and a -dimensional regression parameter vector . To make model (2.1) identifiable we make the common assumption that , where is a correlation matrix and is a scale parameter (Joe [7], page 219). This condition is necessary for model identification as the following simple example shows. Suppose and that the vector is multivariate logit-normal distributed with mean and covariance matrix . It is easy to verify that two sets of choices for the variances and can result in identical binary distribution for as shown in Table 2.

If are the only parameters of interest, we can obtain maximum likelihood estimates by maximizing the likelihood . However, if the ’s are also of interest, we can obtain estimates of these parameters using either the empirical Bayes (EB) method or the EM algorithm considering the full likelihood Equation (2.2) is the full specification of (2.1) with covariates, regression parameters, correlation, and variance described in , the multivariate logit-normal density function. One quickly notices that the likelihood (2.2) has parameters that increase with , and solutions to the maximization of (2.2) will require roots of complex nonlinear equations. These considerations may make the full likelihood approach subject to computational difficulties and convergence problems. Further, such specific definitions for the components in the mixture model may affect estimator robustness.

3. Alternatives to Likelihood-Based Estimation

For reasons outlined earlier it makes sense to consider the vector of logit-transformed proportions , where and . Note that is distributed as multivariate normal with parameters and . The focus on these normally distributed random variables, rather than the mixture-distribution-based binomial random variables, can allow us to relax distributional assumptions and utilize distribution-free methodologies for parameter estimation such as the generalized estimating equations (GEEs). This methodology is a two-stage process, in which the estimate of the regression parameter is updated by a residual-based moment estimate of . Specifically, estimation is iterated between the two equations until convergence. Here , , where . The problem with this methodology is that the diagonal elements of are not necessarily equal to , implying that the diagonal elements of are not necessarily unity. However, the GEE methodology, as implemented in software packages, forces the diagonal elements of to unity (i.e., it changes the values from whatever they are to 1), and thus matrix is not guaranteed to be positive definite. This can lead to (most harmlessly) convergence problems, but it can also lead to artificially deflated estimator variances for the regression parameters and is thus subject to improper or incorrect inference.

4. Quasileast Squares

The quasileast squares (QLSs) approach, on the other hand, provides an alternative estimate of and does not experience the convergence problems exhibited by GEE. A further benefit of this method is that it does not require the assumption of normality for the joint distribution of each response or among their marginal distributions. The initial step for estimation of is to minimize over the set of correlation matrices. Since the diagonal elements of are restricted to be one, introducing a diagonal matrix of lagrange multipliers , we can verify that the point of minimum factors the matrix as Whittle [8] has shown that for a positive definite matrix the factorization (4.1) is unique. Further , and the diagonal matrix satisfies the fixed-point equation , which can be solved using a simple fixed-point iterative scheme (Olkin and Pratt [9], Chaganty [4]). Next, using the first step correlation estimate , we can then obtain a consistent correlation estimate as where , , is a vector of ones, and denotes the Hadamard product. It is possible that the correlation matrix (4.2) may not be positive definite in which case Chaganty and Shults [5] have recommended the estimate See equation (3.2) in Chaganty and Shults [5]. The quasi-least squares method uses the estimate (4.2) of if it is positive definite and otherwise uses (4.3), which is clearly a positive definite correlation matrix, to update the estimate of until convergence. Code for fitting this model using the R statistical software is provided in the Appendix.

5. Example

We now provide an example from Alderdice and Forrester [10], who modeled the effects of salinity and temperature on the proportion of hatched English sole eggs. In this study, the number of hatched eggs was recorded at seven salinity and five temperature levels. Measurements were taken in four separate tanks for each combination of salinity and temperature, and for each tank we have recordings of the number of fish eggs and the number hatched. Thus, the tanks represent the repeated measure component for this binomial data set. The data, as given on page 6 of Lindsey [11], is reproduced in Table 3.

The goal of the analysis is to study the dependence of the proportion of eggs hatched on the temperature and salinity. After calculating , where is the number of eggs hatched out of the total in the th tank at the th combination of temperature and salinity, the Shapiro-Wilk test for normality was performed on the transformed responses for each replicate. The results (-values ) indicate a departure from normality, so that maximum likelihood methods for continuous data are not applicable. The data was analyzed using GEE in several statistical software packages using an unstructured working correlation matrix to account for the correlation between the four replications of the solution in the four tanks. The results using PROC GENMOD in SAS version 9.2, gee.fit module in TIBCO Spotfire S+ version 8.2, and xtgee procedure in STATA version 11, are shown in parts (i), (ii), and (iii) of Table 4. The warning message from PROC GENMOD read “WARNING: Iteration limit exceeded.” Here we see that in each case the estimates failed to converge. The 0.999 correlation estimates in part (i) represent model breakdown in that programmers often use this value to indicate nonconvergence.

The warning message from TIBCO Spotfire S+ software read (sic) “Warning messages: 1: at convergence at the correlation estimate 1 is outside of the range in cgeefit (gee.model) 2: correlation matrix is not full rank, in: cgeefit (gee.model).” Note that the correlation between the first and second tanks in part (ii) is greater than one, clearly violating the most liberal of correlation boundaries. The warning message from xtgee in STATA version 11 read “convergence not achieved.” Also, the fourth eigenvalue in part (iii) is negative, indicating that the estimated correlation matrix is not positive definite. The results of the QLS analysis are given in part (iv) of Table 4, which show that the estimated correlation matrix is positive definite.

6. Discussion

The logit transformation was originally applied on mortality rates in univariate bioassays (Berkson, [12]), though the idea also generalizes nicely into the cases of correlated repeated-measure, longitudinal, or multivariate binomial data discussed here. Doing so allows the data analyst to bypass complicated, parametrically saturated mixture distributions and utilize methods for correlated continuous data. Interestingly, even after the logit transformation is applied, the GEE method still experiences convergence difficulties and problems with correlation parameter estimation. Potential causes for these problems are explained in Section 3. The QLS method, on the other hand, does not experience these difficulties and handles the simultaneous estimation of both regression and correlation parameters with relative ease. This was made possible by incorporating the little-known and rarely used matrix factorization given in (4.1).

Note that the probit transformation had an earlier origin and similar function to the logit transformation (Bliss, [13]) and can also be used in place of the logit transformation shown here. However, likelihood estimation of correlated binomial data using a latent multivariate distribution has already been established for the probit link function (Ashford and Sowden, [14]) and has been compared favorably to the GEE method when analyzed on real data (Sabo and Chaganty [15]).

Appendix

For more details see Algorithm 1.

#########################################################################
# R (ver 2.14.1) program to compute QLS estimates for the mixture model #
#########################################################################
# Function to estimate the correlation matrix
# between the repeated measurements
correlation.est <- function(residuals, tol=1e-10)
{
t <- ncol(residuals)
Z <- t(residuals)%*%residuals
# start the decomposition algorithm with an identity matrix
Lambda0 <- diag(t)
ev <- eigen(Z)
Lambdak <- diag(diag(ev$vec%%diag(sqrt(ev$val))%%t(ev$vec)))
Diff <- diag(Lambdak -Lambda0)
while(sum(Diff^∧2) > tol)
{
Lambda0 <- Lambdak
ev <- eigen(sqrt(Lambda0)%% Z %%sqrt(Lambda0))
M <- ev$vec%% diag(sqrt(ev$val)) %%t(ev$vec)
Lambdak <- diag(diag(M))
Diff <- diag(Lambdak - Lambda0)
}
Rtilde <- solve(sqrt(Lambdak))%% M %%solve(sqrt(Lambdak))
Rhat <- Rtilde%%diag(as.vector(solve(RtildeRtilde)%%rep(1,t)))%
%Rtilde
ev <- eigen(Rhat)
if (any(ev$val<0))
Rhat <- solve(sqrt(diag(diag(Z))))%% Z %% solve
(sqrt(diag(diag(Z)))) return(Rhat)
}
# Function to calculate the regression parameter beta.
regression.est <- function(x, y, t, Rhat)
{
mt <- nrow(x)
Sigma <- solve(kronecker(diag(mt/t), Rhat))
XRinvX <- t(x)%%Sigma%%x
XRinvY <- t(x)%%Sigma%%y
betahat <- solve(XRinvX)%*%XRinvY
return(betahat)
}
# The main program starts here
d <- read.table("c:/hatch-eggs.txt", header=TRUE)
proportion <- d$Hatch/d$Total
y <- log(proportion/(1-proportion))
x <- model.matrix(^~Salinity+Temperature, data=d)
tol <- 1e-10
t <- length(d$ID)/length(unique(d$ID))
mt <- nrow(x)
m <- mt/t
k <- ncol(x)
Rhatinit <- diag(t)
betahat <- regression.est(x, y, t, Rhatinit)
residuals <- matrix(y-x%*%betahat, ncol=t, byrow=TRUE)
Rhat <- correlation.est(residuals)
betanew <- regression.est(x, y, t, Rhat)
while(sum((betanew-betahat)^∧2)>tol)
{
betahat <- betanew
residuals <- matrix(y-x%*%betahat, ncol=t, byrow=TRUE)
Rhat <- correlation.est(residuals)
betanew <- regression.est(x, y, t, Rhat)
}
# Calculate the scale parameter
residuals <- matrix(y-x%*%betahat, ncol=t, byrow=TRUE)
Z <- t(residuals)%*% residuals
Rhat <- correlation.est(residuals)
scale <- sum(diag(solve(Rhat)%*%Z))/(mt-k)
# Calculate model based standard errors and z-scores for betas
Sigma <- solve(kronecker(diag(m), Rhat))
Covbeta <- scalesolve(t(x)%%Sigma%*%x)
stderrbeta <- sqrt(diag(Covbeta))
zstat <- betanew/stderrbeta
# Prepare and print the output
output <- cbind(betanew, stderrbeta, zstat, 2*(1-pnorm(abs(zstat))))
colnames(output) <- c("Estimate", "Std. Error", "z value", "P-value")
list(scale = scale, Rhat=Rhat, beta = output)

References

D. Gilliland, O. Schabenberger, and H. Liu, “Intercluster correlations for binomial data: an application to seed testing,” Journal of Agricultural, Biological, and Environmental Statistics, vol. 7, no. 1, pp. 95–106, 2002.
View at: Publisher Site | Google Scholar
K. Y. Liang and S. L. Zeger, “Longitudinal data analysis using generalized linear models,” Biometrika, vol. 73, no. 1, pp. 13–22, 1986.
View at: Publisher Site | Google Scholar | Zentralblatt MATH
B. A. Coull and A. Agresti, “Random effects modeling of multiple binomial responses using the multivariate binomial logit-normal distribution,” Biometrics, vol. 56, no. 1, pp. 73–80, 2000.
View at: Google Scholar
N. R. Chaganty, “An alternative approach to the analysis of longitudinal data via generalized estimating equations,” Journal of Statistical Planning and Inference, vol. 63, no. 1, pp. 39–54, 1997.
View at: Publisher Site | Google Scholar | Zentralblatt MATH
N. R. Chaganty and J. Shults, “On eliminating the asymptotic bias in the quasi-least squares estimate of the correlation parameter,” Journal of Statistical Planning and Inference, vol. 76, no. 1-2, pp. 145–161, 1999.
View at: Publisher Site | Google Scholar | Zentralblatt MATH
R. T. Sabo and N. R. Chaganty, “Estimation methods for an autoregressive familial correlation structure,” Communications in Statistics. Theory and Methods, vol. 39, no. 6, pp. 973–991, 2010.
View at: Publisher Site | Google Scholar | Zentralblatt MATH
H. Joe, Multivariate Models and Dependence Concepts, vol. 73, Chapman & Hall, London, UK, 1997.
P. Whittle, “A multivariate generalization of Tchebichev's inequality,” The Quarterly Journal of Mathematics, vol. 9, pp. 232–240, 1958.
View at: Publisher Site | Google Scholar | Zentralblatt MATH
I. Olkin and J. W. Pratt, “A multivariate Tchebycheff inequality,” Annals of Mathematical Statistics, vol. 29, pp. 226–234, 1958.
View at: Publisher Site | Google Scholar | Zentralblatt MATH
D. F. Alderdice and C. R. Forrester, “Some effects of salinity and temperature on early development and survival of the English sole (Parophrys vetulus),” Journal of the Fisheries and Research Boad of Canada, vol. 25, pp. 495–521, 1968.
View at: Google Scholar
J. K. Lindsey, “Likelihood analyses and tests for binary data,” Journal of Applied Statistics, vol. 24, no. 1, pp. 1–16, 1975.
View at: Google Scholar
J. Berkson, “Application to the logistic function to bio-assay,” Journal of the American Statistical Association, vol. 39, pp. 357–365, 1944.
View at: Google Scholar
C. I. Bliss, “The calculation of the dosage mortality curve,” Annals of Applied Biology, vol. 22, pp. 134–167, 1935.
View at: Google Scholar
J. R. Ashford and R. R. Sowden, “Multi-variate probit analysis,” Biometrics, vol. 26, no. 3, pp. 535–546, 1970.
View at: Google Scholar
R. T. Sabo and N. R. Chaganty, “What can go wrong when ignoring correlation bounds in the use of generalized estimating equations,” Statistics in Medicine, vol. 29, no. 24, pp. 2501–2507, 2010.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2012 N. Rao Chaganty et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

1798

Downloads

1487

Citations