Abstract

While univariate instances of binomial data are readily handled with generalized linear models, cases of multivariate or repeated measure binomial data are complicated by the possibility of correlated responses. Likelihood-based estimation can be applied by using mixture distribution models, though this approach can present computational challenges. The logistic transformation can be used to bypass these concerns and allow for alternative estimating procedures. One popular alternative is the generalized estimating equation (GEE) method, though systematic errors can lead to infeasible correlation estimates or nonconvergence problems. Our approach is the coupling of quasileast squares (QLSs) method with a rarely used matrix factorization, which achieves a simplified estimation platform—as compared to the mixture model approach—and does not suffer from the convergence problems in GEE method. A noncontrived example is provided that shows the mechanical breakdown of GEE using several statistical software packages and highlights the usefulness of the QLS approach.

1. Introduction

Binomial data occur when observations on a given subject consist of a fixed series of Bernoulli trials, resulting in a proportional outcome. Maximum likelihood estimation is readily available in a generalized linear modeling framework when subjects consist of univariate measures (i.e., one Bernoulli or binomial trial per subject). However, estimation becomes more complicated when several Bernoulli or binomial trials are observed for each subject. In this case subject responses could be multivariate (consisting of several series of separately defined trials) or repeated measure (where the set of trials are defined similarly), and in both instances there is the possibility that the intrasubject responses are correlated. Here we use the term “subject” for convenience but it could be an item, store, location, plot in agriculture experiments, and so on. For example, real-life data situations where we encounter correlated proportions include (1) bankers interested in the proportion of customers making the 𝑖th type of transaction at the 𝑗th bank branch; (2) the proportion of CD deposits at the 𝑖th branch in the 𝑗th month of a year; (3) retail managers interested in the proportion of customers purchasing the 𝑖th item at the 𝑗th store; (4) marketers interested in the proportion of subjects viewing the 𝑖th advertisement type on the 𝑗th website; (5) information technology specialists interested in the proportion of students who use the 𝑖th computer program in the 𝑗th computer lab; (6) biologists interested in the proportion of hatched English sole eggs kept in solutions at different temperatures and salinity levels. We will discuss the last example later in this paper. Other examples where correlated binomial data occur are seed testing experiments described in Gilliland et al. [1].

Data layout of the aforementioned examples is presented in Table 1. In all of these examples the proportions within each row are correlated but the rows can be assumed to be independent. The within-row correlation, while complicating matters, must be accounted for in order to obtain proper variance estimates and inference for any regression parameters representing the associations between the vector of proportions and covariates. Thus, the problem is to estimate the parameters of interest within the ensemble of all parameters. In this context one could use a likelihood-based approach utilizing mixture-distribution models.

In the case of binomial data the mixture model would consist of both binomial and logit-normal components. However, parameter estimation in the mixture model could experience convergence problems due to the multitude of marginal means, regression, and correlation parameters. A simplified alternative approach would be to transform the variable-specific proportions for each subject in a way that would simplify the assumed probability distribution. The logit of the proportions would transform the outcome scale from [0,1] to (,), which could make appropriate a multivariate normal-based methodology. One such procedure could be the generalized estimating equations (GEEs) proposed by Liang and Zeger [2]. Though a popular methodology for estimating regression parameters in cases of longitudinal or repeated measure data, this procedure suffers problems estimating correlations. As will be seen in subsequent sections, the GEE method can fail to converge even for cases of continuous data, which is the case if the logit transformation is used on binomial data.

Coull and Agresti [3] discussed random effects models for logit-transformed correlated binomial data. Here we suggest the method of quasileast squares (QLSs), developed by Chaganty [4] and Chaganty and Shults [5]. While generally seen as an alternative method to solving the maximum likelihood score equation for correlation parameters in the case of Gaussian data (Sabo and Chaganty [6]), the estimation of correlation in the QLS procedure can also be supplemented with a little-known matrix factorization that makes it distinct from the maximum likelihood method. In this sense the QLS procedure is applicable for estimating correlated continuous data, which is appropriate for logit-transformed binomial proportions.

The rest of this paper is outlined as follows. The likelihood-based mixed-model approach is discussed in Section 2, while the logit transformation of binomial data and the GEE methodology are discussed in Section 3. We briefly outline the QLS estimating procedure in Section 4, while also highlighting the matrix factorization for use in estimating correlation. A noncontrived example is given in Section 5 that shows the usefulness of the QLS approach, as well as the convergence problems experienced by several statistical software packages in implementing the GEE method. A brief conclusion follows in Section 6.

2. Maximum Likelihood Estimation Using Mixture Distribution Models

For 𝑖=1,,𝑚 subjects, let 𝐲𝑖=(𝑦𝑖1,,𝑦𝑖𝑡) be a vector of 𝑡 possibly dependent binomial random variables, where 𝑦𝑖𝑗 is the number of successes out of 𝑛𝑖𝑗 trials with success probability 𝑝𝑖𝑗 for the 𝑗th variable of the 𝑖th subject. Also assume that 𝐱𝑖𝑗=(𝑥𝑖1,,𝑥𝑖𝑘) is the vector of 𝑘 covariates corresponding to the 𝑗th variable in the 𝑖th subject, such that 𝐗𝑖=(𝐱𝑖1,,𝐱𝑖𝑡) is the 𝑡×𝑘 matrix of all covariates for the 𝑖th subject.

The general mixture distribution model for binomial data is given by𝑓𝐲𝑖=[0,1]𝑡𝑡𝑗=1𝑛𝑖𝑗𝑦𝑖𝑗𝑝𝑦𝑖𝑗𝑖𝑗1𝑝𝑖𝑗𝑛𝑖𝑗𝑦𝑖𝑗𝐆𝑑𝐩𝑖,(2.1) where 𝐆 is a multivariate cumulative distribution function with support in [0,1]𝑡 and 𝐩𝑖=(𝑝𝑖1,,𝑝𝑖𝑡). Basically, we assume that 𝐩𝑖 is distributed as 𝐺(), and, given 𝐩𝑖, the vector 𝐲𝑖 consists of 𝑡 independent binomial variables. Then the marginal distribution of 𝐲𝑖 is given by (2.1). A popular choice for 𝐆 is the multivariate logit-normal distribution; that is, the distribution obtained under the assumption logit(𝐩𝐢)=(log(𝑝𝑖1/(1𝑝𝑖1)),,log(𝑝𝑖𝑡/(1𝑝𝑖𝑡))) is multivariate normal with mean 𝝁𝑖=(𝜇𝑖1,,𝜇𝑖𝑡) and covariance matrix Σ. Here 𝜇𝑖𝑗=𝐱𝑖𝑗𝜷represents the mean as a function of the covariates and a 𝑘-dimensional regression parameter vector 𝜷. To make model (2.1) identifiable we make the common assumption that Σ=𝜙𝐑, where 𝐑 is a correlation matrix and 𝜙 is a scale parameter (Joe [7], page 219). This condition is necessary for model identification as the following simple example shows. Suppose 𝑡=2 and that the vector 𝐩 is multivariate logit-normal distributed with mean 𝝁=0 and covariance matrix Σ=𝜎210.3𝜎1𝜎20.3𝜎1𝜎2𝜎22. It is easy to verify that two sets of choices for the variances 𝜎21 and 𝜎22 can result in identical binary distribution for 𝐲 as shown in Table 2.

If 𝜷,𝐑,𝜙 are the only parameters of interest, we can obtain maximum likelihood estimates by maximizing the likelihood 𝐿(𝜷,𝐑,𝜙)=𝑚𝑖=1𝑓(𝐲𝑖). However, if the 𝐸(𝑦𝑖𝑗/𝑛𝑖𝑗)=𝑝𝑖𝑗’s are also of interest, we can obtain estimates of these parameters using either the empirical Bayes (EB) method or the EM algorithm considering the full likelihood𝐿𝐩𝑖=,𝜷,𝐑,𝜙𝑚𝑡𝑖=1𝑗=1𝑛𝑖𝑗𝑦𝑖𝑗𝑝𝑦𝑖𝑗𝑖𝑗1𝑝𝑖𝑗𝑛𝑖𝑗𝑦𝑖𝑗𝐩𝑖,𝜷,𝐑,𝜙.(2.2) Equation (2.2) is the full specification of (2.1) with covariates, regression parameters, correlation, and variance described in (), the multivariate logit-normal density function. One quickly notices that the likelihood (2.2) has parameters that increase with 𝑚, and solutions to the maximization of (2.2) will require roots of complex nonlinear equations. These considerations may make the full likelihood approach subject to computational difficulties and convergence problems. Further, such specific definitions for the components in the mixture model may affect estimator robustness.

3. Alternatives to Likelihood-Based Estimation

For reasons outlined earlier it makes sense to consider the vector of logit-transformed proportions ̂𝐮𝑖=(̂𝑢𝑖1,,̂𝑢𝑖𝑡), where ̂𝑢𝑖𝑗=logit(̂𝑝𝑖𝑗)=log[̂𝑝𝑖𝑗/(1̂𝑝𝑖𝑗)] and ̂𝑝𝑖𝑗=𝑦𝑖𝑗/𝑛𝑖𝑗. Note that ̂𝐮𝑖 is distributed as multivariate normal with parameters ̂𝐮𝐸(𝑖)=𝝁𝑖=𝐗𝑖𝜷and ̂𝐮Cov(𝑖)=𝜙𝐑. The focus on these normally distributed random variables, rather than the mixture-distribution-based binomial random variables, can allow us to relax distributional assumptions and utilize distribution-free methodologies for parameter estimation such as the generalized estimating equations (GEEs). This methodology is a two-stage process, in which the estimate of the regression parameter 𝜷 is updated by a residual-based moment estimate of 𝐑. Specifically, estimation is iterated between the two equations𝑚𝑖=1𝜕𝝁𝑖𝜕𝜷𝐑1̂𝐮𝑖𝝁𝑖𝐙=0,𝐑=𝜙,(3.1) until convergence. Here 𝐙=𝑚𝑖=1𝐳𝑖𝐳𝑖, 𝜙=𝑚𝑖=1(𝐳𝑖𝐳𝑖)/(𝑚𝑡𝑘), where 𝐳𝑖=̂𝐮𝑖𝝁𝑖(𝜷). The problem with this methodology is that the diagonal elements of 𝐙 are not necessarily equal to 𝜙, implying that the diagonal elements of 𝜙𝐑=𝐙/ are not necessarily unity. However, the GEE methodology, as implemented in software packages, forces the diagonal elements of 𝐑 to unity (i.e., it changes the values from whatever they are to 1), and thus matrix 𝐑 is not guaranteed to be positive definite. This can lead to (most harmlessly) convergence problems, but it can also lead to artificially deflated estimator variances for the regression parameters and is thus subject to improper or incorrect inference.

4. Quasileast Squares

The quasileast squares (QLSs) approach, on the other hand, provides an alternative estimate of 𝐑 and does not experience the convergence problems exhibited by GEE. A further benefit of this method is that it does not require the assumption of normality for the joint distribution of each response or among their marginal distributions. The initial step for estimation of 𝐑 is to minimize tr(𝐑1𝐙) over the set of correlation matrices. Since the diagonal elements of 𝐑 are restricted to be one, introducing a diagonal matrix of lagrange multipliers Λ, we can verify that the point of minimum 𝐑 factors the matrix 𝐙 as𝐙=𝐑𝚲𝐑.(4.1) Whittle [8] has shown that for a positive definite matrix 𝐙 the factorization (4.1) is unique. Further 𝐑=Λ1/2(Λ1/2𝐙Λ1/2)Λ1/2, and the diagonal matrix Λsatisfies the fixed-point equation Λ=diag(Λ1/2𝐙Λ1/2)1/2, which can be solved using a simple fixed-point iterative scheme (Olkin and Pratt [9], Chaganty [4]). Next, using the first step correlation estimate 𝐑, we can then obtain a consistent correlation estimate as𝐑=𝐑𝚫𝐑,(4.2) where Δ=diag(𝝂), 𝝂=(𝐑𝐑)1𝐞, 𝐞 is a vector of ones, and denotes the Hadamard product. It is possible that the correlation matrix (4.2) may not be positive definite in which case Chaganty and Shults [5] have recommended the estimate𝐑=diag(𝐙)1/2𝐙diag(𝐙)1/2.(4.3) See equation (3.2) in Chaganty and Shults [5]. The quasi-least squares method uses the estimate (4.2) of 𝐑 if it is positive definite and otherwise uses (4.3), which is clearly a positive definite correlation matrix, to update the estimate of 𝜷until convergence. Code for fitting this model using the R statistical software is provided in the Appendix.

5. Example

We now provide an example from Alderdice and Forrester [10], who modeled the effects of salinity and temperature on the proportion of hatched English sole eggs. In this study, the number of hatched eggs was recorded at seven salinity and five temperature levels. Measurements were taken in four separate tanks for each combination of salinity and temperature, and for each tank we have recordings of the number of fish eggs and the number hatched. Thus, the tanks represent the repeated measure component for this binomial data set. The data, as given on page 6 of Lindsey [11], is reproduced in Table 3.

The goal of the analysis is to study the dependence of the proportion of eggs hatched on the temperature and salinity. After calculating ̂𝑢𝑖𝑗=logit(𝑦𝑖𝑗/𝑛𝑖𝑗), where 𝑦𝑖𝑗 is the number of eggs hatched out of the total 𝑛𝑖𝑗 in the 𝑗th tank at the 𝑖th combination of temperature and salinity, the Shapiro-Wilk test for normality was performed on the transformed responses for each replicate. The results (𝑃-values <0.05) indicate a departure from normality, so that maximum likelihood methods for continuous data are not applicable. The data was analyzed using GEE in several statistical software packages using an unstructured working correlation matrix to account for the correlation between the four replications of the solution in the four tanks. The results using PROC GENMOD in SAS version 9.2, gee.fit module in TIBCO Spotfire S+ version 8.2, and xtgee procedure in STATA version 11, are shown in parts (i), (ii), and (iii) of Table 4. The warning message from PROC GENMOD read “WARNING: Iteration limit exceeded.” Here we see that in each case the estimates failed to converge. The 0.999 correlation estimates in part (i) represent model breakdown in that programmers often use this value to indicate nonconvergence.

The warning message from TIBCO Spotfire S+ software read (sic) “Warning messages: 1: at convergence at the correlation estimate 1 is outside of the range [1,1] in cgeefit (gee.model) 2: correlation matrix is not full rank, 2<4 in: cgeefit (gee.model).” Note that the correlation between the first and second tanks in part (ii) is greater than one, clearly violating the most liberal of correlation boundaries. The warning message from xtgee in STATA version 11 read “convergence not achieved.” Also, the fourth eigenvalue in part (iii) is negative, indicating that the estimated correlation matrix is not positive definite. The results of the QLS analysis are given in part (iv) of Table 4, which show that the estimated correlation matrix is positive definite.

6. Discussion

The logit transformation was originally applied on mortality rates in univariate bioassays (Berkson, [12]), though the idea also generalizes nicely into the cases of correlated repeated-measure, longitudinal, or multivariate binomial data discussed here. Doing so allows the data analyst to bypass complicated, parametrically saturated mixture distributions and utilize methods for correlated continuous data. Interestingly, even after the logit transformation is applied, the GEE method still experiences convergence difficulties and problems with correlation parameter estimation. Potential causes for these problems are explained in Section 3. The QLS method, on the other hand, does not experience these difficulties and handles the simultaneous estimation of both regression and correlation parameters with relative ease. This was made possible by incorporating the little-known and rarely used matrix factorization given in (4.1).

Note that the probit transformation had an earlier origin and similar function to the logit transformation (Bliss, [13]) and can also be used in place of the logit transformation shown here. However, likelihood estimation of correlated binomial data using a latent multivariate distribution has already been established for the probit link function (Ashford and Sowden, [14]) and has been compared favorably to the GEE method when analyzed on real data (Sabo and Chaganty [15]).

Appendix

For more details see Algorithm 1.

#########################################################################
# R (ver 2.14.1) program to compute QLS estimates for the mixture model #
#########################################################################
# Function to estimate the correlation matrix
# between the repeated measurements
correlation.est <- function(residuals, tol=1e-10)
{
  t <- ncol(residuals)
  Z <- t(residuals)%*%residuals
  # start the decomposition algorithm with an identity matrix
  Lambda0 <- diag(t)
  ev <- eigen(Z)
  Lambdak <- diag(diag(ev$vec%*%diag(sqrt(ev$val))%*%t(ev$vec)))
  Diff <- diag(Lambdak -Lambda0)
  while(sum(Diff2) > tol)
  {
   Lambda0 <- Lambdak
   ev <- eigen(sqrt(Lambda0)%*% Z %*%sqrt(Lambda0))
   M <- ev$vec%*% diag(sqrt(ev$val)) %*%t(ev$vec)
   Lambdak <- diag(diag(M))
   Diff <- diag(Lambdak - Lambda0)
  }
  Rtilde <- solve(sqrt(Lambdak))%*% M %*%solve(sqrt(Lambdak))
  Rhat <- Rtilde%*%diag(as.vector(solve(Rtilde*Rtilde)%*%rep(1,t)))%*
  %Rtilde
  ev <- eigen(Rhat)
  if (any(ev$val<0))
   Rhat <- solve(sqrt(diag(diag(Z))))%*% Z %*% solve
   (sqrt(diag(diag(Z)))) return(Rhat)
}
# Function to calculate the regression parameter beta.
regression.est <- function(x, y, t, Rhat)
{
  mt <- nrow(x)
  Sigma <- solve(kronecker(diag(mt/t), Rhat))
  XRinvX <- t(x)%*%Sigma%*%x
  XRinvY <- t(x)%*%Sigma%*%y
  betahat <- solve(XRinvX)%*%XRinvY
  return(betahat)
}
# The main program starts here
d <- read.table("c:/hatch-eggs.txt", header=TRUE)
proportion <- d$Hatch/d$Total
y <- log(proportion/(1-proportion))
x <- model.matrix(~Salinity+Temperature, data=d)
tol <- 1e-10
t <- length(d$ID)/length(unique(d$ID))
mt <- nrow(x)
m <- mt/t
k <- ncol(x)
Rhatinit <- diag(t)
betahat <- regression.est(x, y, t, Rhatinit)
residuals <- matrix(y-x%*%betahat, ncol=t, byrow=TRUE)
Rhat <- correlation.est(residuals)
betanew <- regression.est(x, y, t, Rhat)
while(sum((betanew-betahat)2)>tol)
{
  betahat <- betanew
  residuals <- matrix(y-x%*%betahat, ncol=t, byrow=TRUE)
  Rhat <- correlation.est(residuals)
  betanew <- regression.est(x, y, t, Rhat)
 }
# Calculate the scale parameter
residuals <- matrix(y-x%*%betahat, ncol=t, byrow=TRUE)
Z <- t(residuals)%*% residuals
Rhat <- correlation.est(residuals)
scale <- sum(diag(solve(Rhat)%*%Z))/(mt-k)
# Calculate model based standard errors and z-scores for betas
Sigma <- solve(kronecker(diag(m), Rhat))
Covbeta <- scale*solve(t(x)%*%Sigma%*%x)
stderrbeta <- sqrt(diag(Covbeta))
zstat <- betanew/stderrbeta
# Prepare and print the output
output <- cbind(betanew, stderrbeta, zstat, 2*(1-pnorm(abs(zstat))))
colnames(output) <- c("Estimate", "Std. Error", "z value", "P-value")
list(scale = scale, Rhat=Rhat, beta = output)