Table of Contents Author Guidelines Submit a Manuscript
Computational and Mathematical Methods in Medicine
Volume 2017, Article ID 7847531, 18 pages
https://doi.org/10.1155/2017/7847531
Research Article

Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies

1Institute of Computational Biology, Helmholtz Zentrum München, German Research Center for Environmental Health, Munich, Germany
2Department of Mathematics, Technische Universität München, Munich, Germany

Correspondence should be addressed to Christiane Fuchs; ed.nehcneum-ztlohmleh@shcuf.enaitsirhc

Received 10 February 2017; Accepted 6 June 2017; Published 24 September 2017

Academic Editor: Matthias Schmid

Copyright © 2017 Norbert Krautenbacher et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Linked References

  1. C. E. Rossiter and J. J. Schlesselman, “Case-Control Studies. Design, Conduct, Analysis.,” Biometrics, vol. 39, no. 3, p. 821, 1983. View at Publisher · View at Google Scholar
  2. E. W. Steyerberg, G. J. J. M. Borsboom, H. C. van Houwelingen, M. J. C. Eijkemans, and J. D. F. Habbema, “Validation and updating of predictive logistic regression models: A study on sample size and shrinkage,” Statistics in Medicine, vol. 23, no. 16, pp. 2567–2586, 2004. View at Publisher · View at Google Scholar · View at Scopus
  3. Y. Huang and M. S. Pepe, “Assessing risk prediction models in case-control studies using semiparametric and nonparametric methods,” Statistics in Medicine, vol. 29, no. 13, pp. 1391–1410, 2010. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus
  4. S. Rose and M. van der Laan, “A Note on Risk Prediction for Case-Control Studies, 2008,” in press.
  5. K. J. M. Janssen, Y. Vergouwe, C. J. Kalkman, D. E. Grobbee, and K. G. M. Moons, “A simple method to adjust clinical prediction models to local circumstances,” Canadian Journal of Anesthesia, vol. 56, no. 3, pp. 194–201, 2009. View at Publisher · View at Google Scholar · View at Scopus
  6. J. E. White, “A two stage design for the study of the relationship between a rare exposure and a rare disease,” American Journal of Epidemiology, vol. 115, no. 1, pp. 119–128, 1982. View at Publisher · View at Google Scholar · View at Scopus
  7. J. M. Satagopan, E. S. Venkatraman, and C. B. Begg, “Two-stage designs for gene-desease association studies with sample size constraints,” Biometrics. Journal of the International Biometric Society, vol. 60, no. 3, pp. 589–597, 2004. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus
  8. O. Saarela, S. Kulathinal, and J. Karvanen, “Secondary analysis under cohort sampling designs using conditional likelihood,” Journal of Probability and Statistics, Article ID 931416, 2012. View at Publisher · View at Google Scholar · View at Scopus
  9. T. Saidel, R. Adhikary, M. Mainkar et al., “Baseline integrated behavioural and biological assessment among most at-risk populations in six high-prevalence states of India: Design and implementation challenges,” AIDS, vol. 22, no. 5, pp. S17–S34, 2008. View at Publisher · View at Google Scholar · View at Scopus
  10. T. C. Mills, R. Stall, L. Pollack et al., “Health-related characteristics of men who have sex with men: A comparison of those living in "gay ghettos" with those living elsewhere,” American Journal of Public Health, vol. 91, no. 6, pp. 980–983, 2001. View at Publisher · View at Google Scholar · View at Scopus
  11. C. Kendall, L. R. F. S. Kerr, R. C. Gondim et al., “An empirical comparison of respondent-driven sampling, time location sampling, and snowball sampling for behavioral surveillance in men who have sex with men, Fortaleza, Brazil,” AIDS and Behavior, vol. 12, no. 1, pp. S97–S104, 2008. View at Publisher · View at Google Scholar · View at Scopus
  12. B. Zadrozny, “Learning and evaluating classifiers under sample selection bias,” in Proceedings of the 21th International Conference on Machine Learning (ICML '04), pp. 903–910, Alberta, Canada, July 2004. View at Scopus
  13. J. J. Heckman, “Sample selection bias as a specification error,” Econometrica, vol. 47, no. 1, pp. 153–161, 1979. View at Publisher · View at Google Scholar · View at MathSciNet
  14. C. Cortes, M. Mohri, M. Riley, and A. Rostamizadeh, “Sample selection bias correction theory,” in Algorithmic learning theory, vol. 5254 of Lecture Notes in Comput. Sci., pp. 38–53, Springer, Berlin, 2008. View at Publisher · View at Google Scholar · View at MathSciNet
  15. G. King and L. Zeng, “Logistic regression in rare events data,” Political Analysis, vol. 9, no. 2, pp. 137–163, 2001. View at Publisher · View at Google Scholar
  16. T. Lumley, “Analysis of complex survey samples,” Journal of Statistical Software, vol. 9, pp. 1–19, 2004. View at Google Scholar · View at Scopus
  17. W. H. Dumouchel and G. J. Duncan, “Using sample survey weights in multiple regression analyses of stratified samples,” Journal of the American Statistical Association, vol. 78, no. 383, pp. 535–543, 1983. View at Publisher · View at Google Scholar · View at Scopus
  18. B. Zadrozny, J. Langford, and N. Abe, “Cost-sensitive learning by cost-proportionate example weighting,” in Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM '03), pp. 435–442, Melbourne, Fla, USA, November 2003. View at Scopus
  19. W. Fan and I. Davidson, “On sample selection bias and its efficient correction via model averaging and unlabeled examples,” in Proceedings of the 7th SIAM International Conference on Data Mining (SIAM '07), pp. 320–331, Minneapolis, Minn, USA, April 2007. View at Scopus
  20. C. Elkan, “The foundations of cost-sensitive learning,” in Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI '01), pp. 973–978, New York, NY, USA, August 2001. View at Scopus
  21. D. G. Horvitz and D. J. Thompson, “A generalization of sampling without replacement from a finite universe,” Journal of the American Statistical Association, vol. 47, pp. 663–685, 1952. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus
  22. J. M. Robins, A. Rotnitzky, and L. P. Zhao, “Estimation of regression coefficients when some regressors are not always observed,” Journal of the American Statistical Association, vol. 89, no. 427, pp. 846–866, 1994. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus
  23. L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, no. 2, pp. 123–140, 1996. View at Google Scholar · View at Scopus
  24. M. Nahorniak, D. P. Larsen, C. Volk, and C. E. Jordan, “Using inverse probability bootstrap sampling to eliminate sample induced bias in model based analysis of unequal probability samples,” PLoS ONE, vol. 10, no. 6, Article ID e0131765, 2015. View at Publisher · View at Google Scholar · View at Scopus
  25. N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002. View at Google Scholar
  26. L. Fahrmeir, T. Kneib, and S. Lang, “Regression,” in Statistik und ihre Anwendungen, Springer Berlin Heidelberg, Berlin, Heidelberg, Germany, 2009. View at Publisher · View at Google Scholar
  27. L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001. View at Publisher · View at Google Scholar · View at Scopus
  28. T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, Springer, New York, NY, USA, 2001. View at Publisher · View at Google Scholar · View at MathSciNet
  29. T. Fawcett, “An introduction to ROC analysis,” Pattern Recognition Letters, vol. 27, no. 8, pp. 861–874, 2006. View at Publisher · View at Google Scholar · View at Scopus
  30. R. Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2015.
  31. M. N. Wright and A. Ziegler, “ranger: a fast implementation of random forests for high dimensional data in C++ and R,” Journal of Statistical Software, vol. 77, no. 1, pp. 1–17, 2017. View at Publisher · View at Google Scholar
  32. D. Meyer, E. K. Dimitriadou, A. Hornik, Weingessel., and F. Leisch, e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien, Vienna, Austria, 2015.
  33. W. Siriseriwan, “smotefamily: A Collection of Oversampling Techniques for Class Imbalance Problem Based on SMOTE,” 2016. View at Publisher · View at Google Scholar
  34. X. Robin, N. Turck, A. Hainard et al., “pROC: an open-source package for R and S+ to analyze and compare ROC curves,” BMC Bioinformatics, vol. 12, no. 1, Article ID 77, 2011. View at Publisher · View at Google Scholar · View at Scopus
  35. T. Sing, O. Sander, N. Beerenwinkel, and T. Lengauer, “ROCR: visualizing classifier performance in R,” Bioinformatics, vol. 21, no. 20, pp. 3940-3941, 2005. View at Publisher · View at Google Scholar · View at Scopus
  36. J. Vanschoren, J. N. van Rijn, B. Bischl, and L. Torgo, “OpenML: Networked Science in Machine Learning,” SIGKDD Explorations, vol. 15, no. 2, pp. 49–60, 2014. View at Publisher · View at Google Scholar
  37. E. R. DeLong, D. M. DeLong, and D. L. Clarke-Pearson, “Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach,” Biometrics, vol. 44, no. 3, pp. 837–845, 1988. View at Publisher · View at Google Scholar · View at Scopus