Let Continuous Outcome Variables Remain Continuous

Bakhshi, Enayatollah; McArdle, Brian; Mohammad, Kazem; Seifi, Behjat; Biglarian, Akbar

doi:https://doi.org/10.1155/2012/639124

Computational and Mathematical Methods in Medicine

On this page

Abstract Introduction Methods Discussion Appendix References Copyright Related Articles

Special Issue

Data Preprocessing and Model Design for Medicine Problems

View this Special Issue

Research Article | Open Access

Volume 2012 | Article ID 639124 | https://doi.org/10.1155/2012/639124

Let Continuous Outcome Variables Remain Continuous

Enayatollah Bakhshi,¹Brian McArdle,²Kazem Mohammad,³Behjat Seifi,⁴and Akbar Biglarian¹

Academic Editor: Alberto Guillén

Received08 Nov 2011

Revised21 Feb 2012

Accepted29 Feb 2012

Published29 May 2012

Abstract

The complementary log-log is an alternative to logistic model. In many areas of research, the outcome data are continuous. We aim to provide a procedure that allows the researcher to estimate the coefficients of the complementary log-log model without dichotomizing and without loss of information. We show that the sample size required for a specific power of the proposed approach is substantially smaller than the dichotomizing method. We find that estimators derived from proposed method are consistently more efficient than dichotomizing method. To illustrate the use of proposed method, we employ the data arising from the NHSI.

1. Introduction

Recently, logistic regression has become a popular tool in biomedical studies. The parameter in logistic regression has the interpretation of log odds ratio, which is easy for people such as physicians to understand. Probit and complementary log-log are alternatives to logistic model. For a covariate and a binary response variable , let . A related model to the complementary log-log link is the log-log link. For it, approaches 0 sharply but approaches 1 slowly. When the complementary log-log model holds for the probability of a success, the log-log model holds for the probability of a failure [1].

These models use a categorical (dichotomous or polytomous) outcome variable. In many areas of research, the outcome data are continuous. Many researchers have no hesitation in dichotomizing a continuous variable, but this practice does not make use of within-category information. Several investigators have noted the disadvantages of dichotomizing both independent and outcome variables [2–10]. Ragland [11] showed that the magnitude of odds ratio and statistical power depend on the cutpoint used to dichotomize the response variable. From a clinical point of view, binary outcomes may be preferred for some reasons such as (1) setting diagnostic criteria for disease, (2) offering a simpler interpretation of common effect measures from statistical models such as odds ratios and relative risks. However, all advantages come at the lost information. From a statistical point of view, this loss of information means more samples which are required to attain prespecified powers.

Moser and Coombs [12] provided a closed-form relationship that allows a direct comparison between the logistic and linear regression coefficients. They also provided a procedure that allows the researcher to analyze the original continuous outcome without dichotomizing. To date, a method that applies the complementary log-log model without dichotomizing and without loss of information has not been available.

We aim to (a) provide a method that allows the researcher to estimate the coefficients of the complementary log-log model without dichotomizing and without loss of information, (b) show that the coefficient of the complementary log-log model can be interpreted in terms of the regression coefficients, (c) demonstrate that the coefficient estimates from this method have smaller variances and shorter confidence intervals than the dichotomizing method.

2. Methods

2.1. Model

Let be independent observations on , and let be predictor variables thought to be related to the response variable . The multiple linear regression model for the th observation can be expressed as or where To complete the model, we make the following assumptions:(1) for ,(2) for ,(3)the independent follows an extreme value distribution for .

Writing the model for each of the observations, in matrix form, we have or The preceding three assumptions on and can be expressed in terms of this model:(1), (2), (3)the is extreme value for .

2.2. (Largest) Extreme Value Distribution

The PDF and CDF of the extreme value distribution are given by

It is easy to check that where To return to a random sample of observations , we conclude that the PDF and CDF of each independent are given by (6), and the corresponding equality (7) is given by where the estimate is the th element of vector . It is readily shown that the results also hold true for the smallest extreme value distribution (Appendix A).

2.3. The Proposed Confidence Intervals

Let According to the preceding three assumptions on and, we obtain Therefore, andare unbiased estimators of and.

We have assumed that is distributed as an extreme value, and we use the approximation of the extreme value distribution of the errors by the normal distribution. For normally distributed observations, follows a noncentral distribution with degree of freedom and noncentrality parameter , where represents the percentile point of a noncentral distribution with degrees of freedom and noncentrality parameter , and is the st diagonal element of . We use the approximation of the percentiles of the noncentral distribution by the standard normal percentiles [13], thenThus, we obtain an approximate percent confidence interval for

3. Comparison of the Two Methods

Let be a continuous outcome variable. For fixed value of , we define such that Suppose that form a random sample of observations, and we fit a complementary log-log model where is the vector of covariates for the th observation, and is the vector of unknown parameters. The dichotomized parameter corresponding to the effect is In general, maximum likelihood estimation (MLE) can be used to estimate the parameter. Let be the ML estimate of , and let be the covariance matrix of . Using from (23), one can construct confidence intervals. This matrix has as its diagonal the estimated variances of each of the ML estimates. The th diagonal element is given by . Therefore, and for large samples, is a percent confidence interval for the true . Then is a percent confidence interval for the true .

We now compare the from (7) with the from (17) This show that the coefficient of the complementary log-log model, , can be interpreted in terms of the regression coefficients, . Note that are related to the responses through the general linear regression model where the independent are distributed as an extreme value with mean 0 and variance .

4. Covariance Matrix of Model Parameter Estimators

4.1. Derivation of for Large

The information matrix of generalized linear models has the form [1], where is the diagonal matrix with diagonal elements , is response variable with independent observations , and denote the value of predictor , The covariance matrix of is estimated by.

Maximum likelihood estimation for the complementary log-log model is a special case of the generalized linear models. Let

thenIt is readily shown that the results hold true for the largest extreme value distribution (Appendix A).

In large samples, approaches [14] which equals the th diagonal element of .

By applying the delta method, let , then

4.2. Derivation of for Large

In large samples, from (10) [15]. Therefore, In addition, .

By applying the delta method, let , then

5. Sample Sizes Saving

5.1. The Power for the Dichotomized Method

In large samples, converges to almost surely [14]. Therefore, for a given value of (i.e., ), the power is given by where

5.2. The Power for the Proposed Method

In large samples, converges to almost surely [15]. Therefore, for a given value of (i.e., ), the power is given by

where Our proposed method, since it is based on continuous data rather than dichotomized, is likely to be more powerful. We show that the proposed method can produce substantial sample size saving for a given power. Let(i)the number of parameters (i.e., ),(ii), , that is, follows a discrete uniform distribution with range (). For simplicity, .(iii)Total samples are and for the proposed and dichotomized methods, respectively. These samples included and set of these uniformly distributed points for the proposed and dichotomized methods, respectively. That is, and , then and from (23),We consider the same power for two methods:relative sample sizeThat is, (34) is independent of and applies for any power, and any test size .

Table 1 presents relative sample sizes for a given fixed parameter and an average proportion of success . We consider the situations in which , , .

For given fixed and , the relative sample sizes in Table 1 can be computed by the following step:(i)compute the value via the equation ,(ii) calculate the cut-off point iteratively such that attained the specified value for the values , using the value of in (i).

As can be seen from Table 1, all values are greater than 1. The values of increase as the moves farther away from 1. Values of Table 1 immediately highlight the improvement accomplished by the proposed method.

6. Relative Efficiency of with

Here, we examine the relative efficiency of the estimate to the estimate .

Using (24) and (26), the relative efficiency is given by Note that the relative efficiency is independent of and and converges to a constant. Comparing (34) and (35), the relative efficiency equals the relative sample sizes. Therefore, as in Table 1, the proposed method is a consistent improvement over the dichotomizing method with respect to relative efficiencies.

It should be noted that these results hold true under the following assumptions:(1)the responses and are related through the equation where the independent are distributed as an extreme value with mean 0 and variance ,(2)the independent variables follow a discrete uniform distribution.

7. Odds Ratio

For values of larger than 0.90, and are very close. Hence, for large values of , And from (7), odds ratio is given by The parameters estimated from the linear regression can be interpreted as an odds ratio.

8. Simulation Study

It should be noted that, as in Table 1, the proposed method is a consistent improvement over the dichotomizing method with respect to relative efficiencies. These results hold true under the assumption that predictor variable has a discrete uniform distribution and that the random variables follow an extreme value distribution. To demonstrate the robustness of this conclusion to changes in the distributions of predictor variables, simulations were run under different distributional conditions. The data were sampled 10000 times for three sample sizes , three average proportions of successes , and seven . The simulated data are generated using the following algorithm(1)Generate , where , through (7) to produce the correct , and for simplicity , .(2)For fixed , generate cutoff point using (15).

We simulated the data for two scenarios based on the distribution of the explanatory variable. In the first scenario, the independent variable follows a continuous uniform distribution and range (−2, 2), and in the second, the independent variable follows a truncated normal distribution with mean 0 and range (−2, 2). The relative mean square errors, relative interval lengths, absolute biases, and the probability of coverage were calculated.

Results of the simulations addressing the validity of the proposed method are displayed in Tables 2 and 3.

The simulations show that the relative mean square errors are all greater than 1, increasing with the average proportion of successes and when the moves farther away from 1. The results in Tables 1 and 2 demonstrate that the proposed method provides confidence intervals which successfully maintain their nominal 95 percent coverage. For the proposed method in first scenario, 51 out of 63 coverage probabilities fell within (0.94, 0.96), and all 63 coverage probabilities are greater than 0.93 and, in the second scenario, almost all coverage probabilities fell within (0.94, 0.96). The absolute biases for proposed method are never greater than a few percent. The proposed method is less biased than the dichotomizing method in 6 of 63 simulations in both two scenarios.

9. An Example

To illustrate the application of the proposed method presented in the previous section, we utilize the data arising from the National Health Survey in Iran. The other analyses using this data appear in many places [16].

In this study, 14176 women aged 20–69 years were investigated. BMI (body mass index), our dependent variable, was calculated as weight in kilograms divided by height in meters squared (kg/m²). Independent variables included place of residence, age, smoking, economic index, marital status, and education level. The independent variables considered were both categorical and continuous. At first, BMI was treated as a continuous variable, and and 95 percent confidence intervals were calculated using the proposed linear regression method. Then subjects were classified into obese (BMI ≥ 30 kg/m²) and nonobese (BMI <30 kg/m²). A complementary log-log model was used for the binary analysis, with obese or nonobese used as the outcome measure. The and 95 percent confidence intervals were calculated using the dichotomized method. Table 4 presents the coefficient estimates, estimated confidence intervals, and relative confidence interval lengths. The proposed and dichotomizing methods produced different confidence intervals, although the and were similar only varying slightly. The estimate from the proposed method had smaller variances and shorter confidence intervals than the dichotomizing method. All relative confidence interval lengths were greater than 2.58.

10. Discussion

When assuming the errorsare distributed as an extreme value distribution, as noted before, the method has several advantages. First, the method allows the researcher to apply the complementary log-log model without dichotomizing and without loss of information. Second, the from the dichotomizing method is dependent on the chosen cutoff point and will vary with . However, the proposed is independent of the since is a function of the continuous and not a function of the dichotomized defined through . Third, we show that the coefficient of the complementary log-log model, , can be interpreted in terms of the regression coefficients, . Fourth, when the independent variables follow a discrete uniform distribution, the proposed method is a consistent improvement over the dichotomizing method with respect to relative efficiencies. The proposed method can provide sample size saving, smaller variances, and shorter confidence intervals than the dichotomized method. Fifth, when is large, the parameters estimated from the linear regression can be interpreted as odds ratios.

Our results were consistent with the findings by Moser and Coombs [12] and Bakhshi et al. [16] showing the greater efficiency of parameter estimates from the regression method that avoids dichotomizing in comparison with a more traditional dichotomizing method using the logistic regression.

Our main recommendation is to let continuous response remain continuous. Do not throw away information by transforming the data to binary. This means that if the objective is to estimate and/or test coefficients when responses are continuous, please resist dichotomizing your response variable.

Appendix

A. Largest Extreme Value Distribution

(a) The PDF and CDF are given by where is a continuous outcome variable, is the vector of known independent variables, is the vector of unknown parameters, and .

It is easy to check that where

(b) Suppose that is distributed as a largest extreme value with mean 0 and variance . We conclude that the PDF and CDF of each independent are given by (A.1), and the corresponding equality (A.2) is given by

(c) Similar to largest extreme value distribution then

Conflict of Interests

The authors have declared no conflict of interests.

References

A. Agresti, Categorical Data Analysis, Wiley, New York, NY, USA, 2nd edition, 2002.
L. P. Zhao and L. N. Kolonel, “Efficiency loss from categorizing quantitative exposures into qualitative exposures in case-control studies,” American Journal of Epidemiology, vol. 136, no. 4, pp. 464–474, 1992.
View at: Google Scholar
R. C. MacCallum, S. Zhang, K. J. Preacher, and D. D. Rucker, “On the practice of dichotomization of quantitative variables,” Psychological Methods, vol. 7, no. 1, pp. 19–40, 2002.
View at: Publisher Site | Google Scholar
J. Cohen, “The cost of dichotomization,” Applied Psychological Measurement, vol. 7, no. 3, pp. 249–253, 1983.
View at: Publisher Site | Google Scholar
S. Greenland, “Avoiding power loss associated with categorization and ordinal scores in dose-response and trend analysis.,” Epidemiology, vol. 6, no. 4, pp. 450–454, 1995.
View at: Google Scholar
P. C. Austin and L. J. Brunner, “Inflation of the type I error rate when a continuous confounding variable is categorized in logistics regression analyses,” Statistics in Medicine, vol. 23, no. 7, pp. 1159–1178, 2004.
View at: Publisher Site | Google Scholar
A. Vargha, T. Rudas, H. D. Delaney, and S. E. Maxwell, “Dichotomization, partial correlation, and conditional independence,” Journal of Educational and Behavioral Statistics, vol. 21, no. 3, pp. 264–282, 1996.
View at: Google Scholar
S. E. Maxwell and H. D. Delaney, “Bivariate median splits and spurious statistical significance,” Psychological Bulletin, vol. 113, no. 1, pp. 181–190, 1993.
View at: Google Scholar
D. L. Streiner, “Breaking up is hard to do: the heartbreak of dichotomizing continuous data,” Canadian Journal of Psychiatry, vol. 47, no. 3, pp. 262–266, 2002.
View at: Google Scholar
H. Chen, P. Cohen, and S. Chen, “Biased odds ratios from dichotomization of age,” Statistics in Medicine, vol. 26, no. 18, pp. 3487–3497, 2007.
View at: Publisher Site | Google Scholar
D. R. Ragland, “Dichotomizing continuous outcome variables: dependence of the magnitude of association and statistical power on the cutpoint,” Epidemiology, vol. 3, no. 5, pp. 434–440, 1992.
View at: Google Scholar
B. K. Moser and L. P. Coombs, “Odds ratios for a continuous outcome variable without dichotomizing,” Statistics in Medicine, vol. 23, no. 12, pp. 1843–1860, 2004.
View at: Publisher Site | Google Scholar
N. L. Johnson, H. Welch, and C. Z. Wei, “Application of the non-central t distribution,” Biometrika, vol. 31, no. 3-4, pp. 362–389, 1940.
View at: Google Scholar
R. J. Serfling, Approximation Theory of Mathematical Statistics, Wiley, New York, NY, USA, 1980.
T. L. Lai, H. Robbins, and C. Z. Wei, “Strong consistency of least squares estimates in multiple regression,” Proceedings of the National Academy of Sciences of the United States of America, vol. 75, no. 7, pp. 3034–3036, 1978.
View at: Google Scholar
E. Bakhshi, M. R. Eshraghian, K. Mohammad, and B. Seifi, “A comparison of two methods for estimating odds ratios: results from the National Health Survey,” BMC Medical Research Methodology, vol. 8, article 78, 2008.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2012 Enayatollah Bakhshi et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

2305

Downloads

1485

Citations