BioMed Research International

Volume 2014, Article ID 658056, 6 pages

http://dx.doi.org/10.1155/2014/658056

## An Investigation of the Significance of Residual Confounding Effect

^{1}National Drug Research Institute, Curtin University, G.P.O. Box U 1987, Perth, WA 6845, Australia^{2}Northern Territory Department of Health, Darwin, NT 0800, Australia^{3}School of Public Health, Curtin University, Perth, WA 6845, Australia

Received 18 December 2013; Accepted 10 January 2014; Published 17 February 2014

Academic Editor: Tanya Chikritzhs

Copyright © 2014 Wenbin Liang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

*Background*. Observational studies are commonly conducted in health research. However, due to their lack of randomization, the estimated associations between the outcome and the exposure can be affected by unmeasured confounding factors. It is important to determine how likely a significant association observed between an outcome variable and a noncausally related exposure may be introduced by residual confounding factors. *Methods*. A simulation approach is developed based on the sufficient cause model to test the likelihood of significant associations observed between a noncausally related exposure and the outcome. *Results*. Based on the estimates from all 500 replicates, the association between the exposure and the outcome is found to be significant in 386 (77%) replicates when all confounders (component causes) are controlled for in the model. However, when a subset of real component causes and some noncausal factors are controlled for in the model, the association between exposure and the outcome becomes significant in 487 (97%) replicates. *Conclusion*. Even when all confounding factors are known and controlled for using conventional multivariate analysis, the observed association between exposure and outcome can still be dominated by residual confounding effects. Therefore, an observed significant association apparently provides limited evidence for a causal relationship.

#### 1. Introduction

Ethical and budgetary constraints often limit the application of experimental study designs in health research, so that observational studies such as cohort or case-control studies have been widely undertaken as methodological alternatives [1–5]. However, due to the lack of randomization, the estimates so obtained can be influenced by uncontrolled or unmeasured confounders and typically, the confounders bias estimates from their true values [6–12]. According to the epidemiological literature, a confounder must meet the following conditions: (i) being a cause of the disease, or a proxy of cause(s), in unexposed people; (ii) being correlated with exposure in the study population; (iii) not being an intermediate step in the causal pathway between the exposure and the disease [1, 13–16]. To deal with confounding effects, known or suspected confounders are measured together with the exposure and outcome of interest. Multivariate analyses are then performed to measure the association between the exposure and the outcome while attempting to remove the effects of such known or suspected confounders [8, 13, 17–19].

Under the sufficient cause model, a sufficient cause means a complete causal mechanism, which can be defined as a combination of minimal conditions (necessary elements) and events that inevitably produce disease, while the necessary elements that constitute a sufficient cause are component causes [2]. It is common that component causes and compositions of sufficient causes are unknown, with simultaneous existence of measurement errors, misclassifications for exposures, confounders, and outcomes [8, 20–23]. Consequently, the estimated associations between the outcome and the exposure remain likely to be affected by unmeasured confounding factors. For example, even in well-designed studies, significant protective associations occurred between true nonprotective exposures and outcomes are actually caused by unmeasured confounding factors [24, 25]. It is thus important to investigate how likely a significant association observed between an outcome variable and a noncausally related exposure may be introduced by residual confounding factors. In this study, we develop a simulation approach to test the likelihood of observing significant associations between a noncausally related exposure and the outcome variable based on standard multivariate analysis, given that the compositions of sufficient causes are not recognized, but either all risk factors/component causes are known and controlled, or only some of the risk factors/component causes are known and controlled. There are two objectives: (1) to investigate the likelihood of false positive observations in observational studies, (2) to propose a simulation framework for assessing epidemiologic methods which deal with confounding effects.

#### 2. Methods

##### 2.1. Overview of the Simulation

The simulation process follows the sufficient cause model [2]. For an event to occur, at least one sufficient cause has to occur. The components of a sufficient cause are randomly chosen from a pool of low to moderate correlated variables, which include the exposure of interest and 99 other variables. The exposure of interest is set to be *non*causal for the outcome and therefore it will never be chosen as a component for a sufficient cause. Given the correlation among the 100 variables, every chosen variable is a potential confounding factor for the association between the exposure and the outcome. The association between the exposure and the outcome is then estimated using a logistic regression model, while controlling for (i) all component causes; and (ii) some of the component causes (selected at random). The simulations contain 500 replicates, with each replicate being generated through an independent process. All simulations are performed using the STATA package release 12. The procedures involved in each replicate are outlined below. Details of the simulation procedure, including the sufficient cause model and the estimation process, are provided in the Appendix.(1)Generate a pool of low to moderate correlated random variables from the uniform [0,1) distribution: , , .(2)Determine the composition of sufficient causes and the threshold values of components. The total number for the types of sufficient causes for is randomly chosen from . Components for each type of sufficient causes are randomly selected from , . is taken as the exposure, which is set to be noncausal for . For each observation, a sufficient cause is set to occur, when each of its components has a value higher than its specific threshold value. The threshold value is specific for each component as well as each type of sufficient cause, and it is randomly chosen from a uniform [0.5, 0.9) distribution. This allows the threshold values to vary between components as well as between different sufficient causes for the same component. To reflect the fact that exact threshold values are typically unknown, are then dichotomized into binary form denoted by , , , by applying the following rule: is set to 1 if , and 0 otherwise. Here, the mean 0.7 of a uniform [0.5, 0.9) variable is used instead of applying the exact threshold values, in order to account for unavoidable measurement errors and misclassifications in confounders and exposures.(3)Generate competing events for , , . Note that is independent of and .(4)Generate small random errors for to represent measurement errors of outcome and to smooth the computing process. is a Bernoulli distributed random variable, being independent of and and only accounts for a small proportion of variance of .(5)Determine the status (occur or not occur) of .(6)Determine the known (not necessary the fact) causal factors for through a random process. Details of steps 1 to 6 can be found in the Appendix.(7)Estimate the effect of on when all component causes are identified. There is no noncausal factor being mistaken as causal factor. We have
where indicates whether is involved in at least one sufficient cause of , that is, if true and otherwise. Here, and are the estimated effects of and each of the component causes on , respectively. To estimate the effect of on when only some component causes are known, and there are some noncausal factors being mistaken as causal factors, we have
where indicates whether is “known” or suspected to be involved in at least one sufficient cause of , , and are the estimated effects of and each of the “known” risk factors on , respectively.

#### 3. Results

Data obtained from replicate 1 is used as an example. Table 1 shows details of the sufficient causes and their components for replicate 1. Overall, the incidence rate (per 1000 observation units) for is 32.4, while it is 20.2 among unexposed observations () and 89.0 among exposed observations (). This leads to an observed crude exposed-to-unexposed risk ratio of 4.4, though the exposure is not causal for . Moreover, as shown in Table 2, the strength of association between exposure and confounders is considerably low, with low level of misclassifications for confounder status. Given that all confounding factors (component causes) are controlled for in the model, the effect of exposure remained significant (). Table 3 suggests that the effect of exposure is further biased away from the null when only a subset of real component causes and some noncausal factors are controlled in the model.

Based on the estimates from all replicates, the association between the exposure and the outcome is found to be significant in 386 (77%) out of the 500 replicates when all confounders (component causes) are controlled in the model. However, when a subset (rather than all) of real component causes and some noncausal factors are controlled in the model, the association between the exposure and the outcome becomes significant in 487 (97%) out of the 500 replicates.

In addition, Figure 1 indicates that when adjusting for all the real component causes, the significantly estimated effect of the exposure is on average substantially smaller than the effects of real component causes. The mean (standard deviation), 25th, 50th, and 75th percentiles of the significant coefficients (natural logarithm of the odds ratio) are 0.22 (0.17), 0.14, 0.18, and 0.25, respectively for the noncausal exposure and are 0.73 (0.79), 0.23, 0.42, and 0.927, respectively, for the real component causes.

#### 4. Discussion

In observational studies, when a statistical significant association arises between an exposure and the outcome in the multivariate analysis, it is usually considered as supportive evidence for causal relationship [8]. We adopt the sufficient cause model in the simulation process to investigate how likely a significant association between the exposure and the outcome may be observed when there is no causal association between the two in an observational study setting. The results indicate that significant associations between the exposure and its noncausal related outcomes are presented in more than 70% of the situations, even when assuming that all confounders (causal factors) are known to researchers and controlled for in the multivariate analysis. In reality, many component causes of a disease are unknown [8, 20–23].

Moreover, results from the simulation study suggest that under the conventional multivariate analysis approach, residual confounding effects remain strong enough to influence the observed associations and an observed significant association provides only limited evidence for a causal relationship. Therefore, new methods are required to handle residual confounding effects. The simulation design adopted in this study can also serve as a platform to evaluate the performance of such methods.

There are several advantages of our simulation design. Firstly, although all component causes and sufficient causes are determined through random process, they are all tracked and measured, unlike collected data where most pieces of information on component causes and sufficient causes are unknown and unmeasurable. Secondly, for specific exposures and outcomes, information from existing literature can be easily adopted into the simulation design. Thirdly, the simulation design can be adjusted to fit specific prior assumptions on the distributions and correlations among component causes and the exposure as well as compositions of sufficient causes. Hence it is possible to obtain estimates on the effects of the exposure under different prior assumptions.

#### 5. Conclusion

This study demonstrates that even when all confounding factors are known and controlled for using conventional multivariate analysis, the observed association between exposure and outcome can still be dominated by residual confounding effects. An observed significant association apparently provides limited evidence for a causal relationship.

#### Appendix

#### Details of Steps 1 to 6 in Simulation Procedure

(1)Generate a matrix of correlated random variables, , its corresponding matrix , and , where , , and . indicates whether passes its threshold value and becomes active or occurs in the th sufficient cause (if it is a component of the th sufficient cause) of the th observation. is a proximate measure of given that the exact threshold value is usually unknown.(i) is a linear combination of a variable component () and a unique component () for each , with both being uniform [0,1) distributed random variables, and is a random proportion drawn from a uniform [0.3, 0.6) distribution. The range [0.3, 0.6) is chosen in order to set a low to moderate level of correlation among . The mean (standard deviation), 25th, 50th, and 75th percentiles of the correlation coefficients for the matrix are 0.35 (0.13), 0.25, 0.33, and 0.45, respectively.(ii) is set to 1 if and 0 otherwise, where takes on a random value drawn from an uniform [0.5, 0.9) distribution. The mean (standard deviation), 25th, 50th, and 75th percentiles of the correlation coefficients for the matrix are 0.16 (0.07), 0.11, 0.15, and 0.20, respectively.(iii) is set to 1 if and 0 otherwise, where 0.7 is the expected value of the uniform [0.5, 0.9) distribution. The mean (standard deviation), 25th, 50th, and 75th percentiles of the correlation coefficients for the matrix are 0.19 (0.07), 0.13, 0.17, and 0.24, respectively.(2)Determine sufficient cause compositions and their components.(i)Components for nine possible sufficient causes for are determined. Let , , indicate whether is a component of the th possible sufficient cause: if is component of the th possible sufficient cause, , and 0 otherwise. takes on a random value drawn from the Bernoulli distribution with probability of success , which is derived (rescaled) from a gamma distribution with both shape parameter and scale parameter equal to 1. For each sufficient cause if the components are less than 2, that is, for a given if , then all components are redetermined through the same random process.(ii)Determine whether a possible sufficient cause occurs. Let when all components for the th possible sufficient cause become active or occur in the th observation; that is, ; otherwise , , , and .(iii)Choose real sufficient causes from the nine possible sufficient causes. Let , denote whether the th possible sufficient cause is a real sufficient cause for . If the th possible sufficient cause is a real sufficient cause, then and 0 otherwise. takes on a random value drawn from the Bernoulli distribution with probability of success 0.5. If there is no real sufficient cause assigned, that is, , then the real sufficient causes for are redetermined through the same random process.(3)Determine competing events. Let denote the competing events for outcome , . is a Bernoulli distributed random variable with a probability of success 0.001, value of success (competing events occurred) being 1, and value of failure (competing events not occurred) being 0. is independent of .(4)Determine small random errors for . Let denote a small random error of , . is a Bernoulli distributed random variable with a probability of success 0.001, value of success being 1, and value of failure being 0. is independent of both and .(5)Determine the status of outcome . Let , denote the outcome not occurred or occurred, respectively. Value of each is determined as follows. For each observation , (outcome occurred) if , or for , when ; otherwise (outcome not occurred).(6)Determine the known/suspected causal factors, in other words, potential confounding factors. Let , denote the researcher’s knowledge (not necessary the fact) on in relation to its confounding effect on the association between and . is a random value drawn from the Bernoulli distribution with a probability of success , value of success being 1, and value of failure being 0. is the total number of real sufficient causes that included as a component.

#### Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

#### References

- K. J. Rothman and S. Greenland, “Precision and validity in epidemiologic studies,” in
*Modern Epidemiology*, pp. 115–134, 1998. View at Google Scholar - K. J. Rothman and S. Greenland, “Causation and causal inference,” in
*Modern Epidemiology*, pp. 7–28, 1998. View at Google Scholar - E. Riboli and R. Kaaks, “The EPIC project: rationale and study design,”
*International Journal of Epidemiology*, vol. 26, supplement 1, pp. S6–S14, 1997. View at Google Scholar - H. Morgenstern and D. Thomas, “Principles of study design in environmental epidemiology,”
*Environmental Health Perspectives*, vol. 101, supplement 4, pp. 23–38, 1993. View at Google Scholar · View at Scopus - P. S. Yusuf, S. Hawken, S. Ôunpuu et al., “Effect of potentially modifiable risk factors associated with myocardial infarction in 52 countries (the INTERHEART study): case-control study,”
*The Lancet*, vol. 364, no. 9438, pp. 937–952, 2004. View at Publisher · View at Google Scholar · View at Scopus - W. Liang, “Evaluating epidemiological evidence: a simple test,”
*International Journal of Medical Sciences*, vol. 10, no. 11, pp. 1459–1461, 2013. View at Google Scholar - W. Liang and T. Chikritzhs, “Does light alcohol consumption during pregnancy improve offspring's cognitive development?”
*Medical Hypotheses*, vol. 78, no. 1, pp. 69–70, 2012. View at Publisher · View at Google Scholar · View at Scopus - K. J. Rothman and S. Greenland, “Causation and causal inference in epidemiology,”
*American Journal of Public Health*, vol. 95, no. 1, pp. S144–S150, 2005. View at Publisher · View at Google Scholar · View at Scopus - S. Greenland, J. Copas, D. R. Jones et al., “Multiple-bias modelling for analysis of observational data,”
*Journal of the Royal Statistical Society A*, vol. 168, no. 2, pp. 267–306, 2005. View at Publisher · View at Google Scholar · View at Scopus - W. Liang and T. Chikritzhs, “Alcohol consumption and health status of family members: health impacts without ingestion,”
*Internal Medicine Journal*, vol. 43, no. 9, pp. 1012–1016, 2012. View at Google Scholar - Z. Fewell, G. D. Smith, and J. A. C. Sterne, “The impact of residual and unmeasured confounding in epidemiologic studies: a simulation study,”
*American Journal of Epidemiology*, vol. 166, supplement 6, pp. 646–655, 2007. View at Publisher · View at Google Scholar · View at Scopus - G. Davey Smith and A. N. Phillips, “Confounding in epidemiological studies: why “independent” effects may not be all they seem,”
*British Medical Journal*, vol. 305, no. 6856, pp. 757–759, 1992. View at Google Scholar · View at Scopus - R. McNamee, “Confounding and confounders,”
*Occupational and Environmental Medicine*, vol. 60, no. 3, pp. 227–234, 2003. View at Publisher · View at Google Scholar · View at Scopus - C. R. Weinberg, “Toward a clearer definition of confounding,”
*American Journal of Epidemiology*, vol. 137, no. 1, pp. 1–8, 1993. View at Google Scholar · View at Scopus - S. Greenland, J. M. Robins, and J. Pearl, “Confounding and collapsibility in causal inference,”
*Statistical Science*, vol. 14, no. 1, pp. 29–46, 1999. View at Google Scholar · View at Scopus - S. Greenland and H. Morgenstern, “Confounding in health research,”
*Annual Review of Public Health*, vol. 22, pp. 189–212, 2001. View at Google Scholar · View at Scopus - R. H. H. Groenwold, A. W. Hoes, and E. Hak, “Confounding in publications of observational intervention studies,”
*European Journal of Epidemiology*, vol. 22, no. 7, pp. 413–415, 2007. View at Publisher · View at Google Scholar · View at Scopus - E. von Elm, D. G. Altman, M. Egger, S. J. Pocock, P. C. Gøtzsche, and J. P. Vandenbroucke, “The strengthening the reporting of observational studies in epidemiology (STROBE) statement: guidelines for reporting observational studies,”
*Preventive Medicine*, vol. 45, no. 4, pp. 247–251, 2007. View at Publisher · View at Google Scholar · View at Scopus - R. M. Mickey and S. Greenland, “The impact of confounder selection criteria on effect estimation,”
*American Journal of Epidemiology*, vol. 129, no. 1, pp. 125–137, 1989. View at Google Scholar · View at Scopus - J. R. Kelley and J. M. Duggan, “Gastric cancer epidemiology and risk factors,”
*Journal of Clinical Epidemiology*, vol. 56, no. 1, pp. 1–9, 2003. View at Publisher · View at Google Scholar · View at Scopus - M. Susser, “What is a cause and how do we know one? A grammar for pragmatic epidemiology,”
*American Journal of Epidemiology*, vol. 133, no. 7, pp. 635–648, 1991. View at Google Scholar · View at Scopus - A. Blair, P. Stewart, J. H. Lubin, and F. Forastiere, “Methodological issues regarding confounding and exposure misclassification in epidemiological studies of occupational exposures,”
*American Journal of Industrial Medicine*, vol. 50, no. 3, pp. 199–207, 2007. View at Publisher · View at Google Scholar · View at Scopus - J. R. Marshall and J. L. Hastrup, “Mismeasurement and the resonance of strong confounders: uncorrelated errors,”
*American Journal of Epidemiology*, vol. 143, no. 10, pp. 1069–1078, 1996. View at Google Scholar · View at Scopus - M. J. Stampfer, G. A. Colditz, W. C. Willett et al., “Postmenopausal estrogen therapy and cardiovascular disease–ten-year follow-up from the nurses' Health Study,”
*The New England Journal of Medicine*, vol. 325, no. 11, pp. 756–762, 1991. View at Google Scholar · View at Scopus - J. E. Manson, J. Hsia, K. C. Johnson et al., “Estrogen plus progestin and the risk of coronary heart disease,”
*The New England Journal of Medicine*, vol. 349, no. 6, pp. 523–534, 2003. View at Publisher · View at Google Scholar · View at Scopus