A Cox-Based Risk Prediction Model for Early Detection of Cardiovascular Disease: Identification of Key Risk Factors for the Development of a 10-Year CVD Risk Prediction
Background and Objective. Current cardiovascular disease (CVD) risk models are typically based on traditional laboratory-based predictors. The objective of this research was to identify key risk factors that affect the CVD risk prediction and to develop a 10-year CVD risk prediction model using the identified risk factors. Methods. A Cox proportional hazard regression method was applied to generate the proposed risk model. We used the dataset from Framingham Original Cohort of 5079 men and women aged 30-62 years, who had no overt symptoms of CVD at the baseline; among the selected cohort 3189 had a CVD event. Results. A 10-year CVD risk model based on multiple risk factors (such as age, sex, body mass index (BMI), hypertension, systolic blood pressure (SBP), cigarettes per day, pulse rate, and diabetes) was developed in which heart rate was identified as one of the novel risk factors. The proposed model achieved a good discrimination and calibration ability with C-index (receiver operating characteristic (ROC)) being 0.71 in the validation dataset. We validated the model via statistical and empirical validation. Conclusion. The proposed CVD risk prediction model is based on standard risk factors, which could help reduce the cost and time required for conducting the clinical/laboratory tests. Healthcare providers, clinicians, and patients can use this tool to see the 10-year risk of CVD for an individual. Heart rate was incorporated as a novel predictor, which extends the predictive ability of the past existing risk equations.
Cardiovascular disease (CVD) describes various conditions that affect the functioning of heart/cardiovascular . Due to the high rate of disease morbidity, CVD has become the leading cause of mortality around the world [2–4]. In New Zealand, statistics on CVD mortality in 2017 suggests that the percentage of deaths caused by CVD is 33% .
Majority of cardiovascular-related deaths are premature and preventable and can be improved by effective health management by employing effective diet plans, lifestyle interventions, and drug intervention . To prevent CVD, a useful approach is to assess CVD risk regularly and then introduce new lifestyle adjustments or clinical treatments accordingly.
In the past decades, a great deal of research has been done on the CVD risk estimation such as the Framingham risk scores from the Framingham Heart Study (FHS) [6, 7], the QRISK equations , the Europe SCORE risk equations , the ASSIGN scores from the Scottish Heart Health Extended Cohort (SHHEC) , the Prospective Cardiovascular Master (PROCAM) equations , and the CUORE Cohort Study formulas . These CVD risk prediction models have proved their effectiveness in the health and disease management for clinicians and individuals [13–15]. The new PREDICT CVD risk assessment equation developed for primary health care among the population in New Zealand has been integrated to the electronic health records (EHRs) and a web-based software called PREDICT has been developed to support general practices manage the CVD risk in primary care . The PREDICT has got 400,728 patients assessed with the CVD risk and is becoming a useful tool for decision support and health management for general practitioners.
However, challenges and issues regarding the development of CVD risk estimation models still exist. CVD risk models [16–18] are based on single risk factor which cannot realize the influence of multiple factors simultaneously. Risk models [6, 8, 19] using statistical regression methods [20–22] prefer to use classic risk factors such as age, smoking, diabetes, sex, high blood pressure, and total cholesterol to estimate the risk score. Studies [18, 19, 23–27] applying data mining or machine learning techniques for the CVD risk estimations cannot provide an absolute risk estimation, although some of these models [18, 26] tried to incorporate novel predictors in the risk models. This research aims to identify the novel risk factors for CVD detection by conventional predictors and then enhance the risk estimation by developing a multiple-variable-based risk prediction model that targets the 5-year and 10-year CVD events.
2.1. Study Population
The study population selected from the Framingham Original Cohort study dataset [28, 29]. We obtained the ethics approval from NHLBI  and the Auckland University of Technology Ethics Committee (AUTEC) (Ref: 17/385 Early Detection and Self-Management of Cardiovascular Disease Using Artificial Intelligence-Based Model). The data from this cohort study includes a total of 5079 men and women aged 30-74 years free of CVD at the baseline, of them 3189 had CVD events eventually. Details of the CVD events distribution in male and female among the study population are summarized in Table 1.
2.2. Data Extraction
There are 32 exams in the Framingham Original Cohort study dataset, as shown in Appendix A. Data frame collected in the first exam “Exam1” was chosen to develop the CVD prediction model because it has the maximum number of samples 5209 subjects. Data from 130 subjects were removed because of the ethics protection. The other five exams are ranging from 8 to 12, marked with italic font (as shown in Table 7 of Appendix A) and will be used for the validation for the fitted model. Data of candidate risk factors (listed in Table 2) for creating the risk model was extracted.
2.3. Statistical Analysis
Cox proportional hazard regression analysis  was selected for developing the proposed risk model (one of the most accurate method belonging to the semiparametric statistical method). This research aims to develop a prediction model using multiple parameters to estimate the probability of developing CVD for an individual. There are mainly three statistical approaches in survival analysis, i.e., nonparametric, semiparametric, and parametric . The nonparametric approaches can only perform univariate analysis with single predictor and therefore are not suitable for the study of continuous variables [22, 32]. Both parametric and semiparametric approaches can perform multiple parameter analysis. They assume that the predictors and the log hazard rate have a linear relationship between . However, the Cox proportional hazard model has an advantage that only the rank orderings of the failure and censoring times are used to estimate and test the regression coefficients . The Cox model is more efficient even though the assumption of the parametric models is met. When the assumptions are not met, the Cox regression analysis can still be used efficiently with an extended Cox regression from , but a parametric model such as Weibull survival distribution would be a null model.
Statistical analyses were performed in R Studio platform . Missing values for candidate risk factors listed in Table 2 were imputed using Multiple Imputation . Continuous and categorical variables were transformed and imputed using algorithms modified from Maximum Generalized Variance (MGV) in the SAS PRINQUAL procedure . R function transcan inside the “Hmisc” package was used .
For candidate predictors listed in Table 2, two steps of variables selection from the list were performed. The first step was conducted in a “Forward Selection” manner ; i.e., the univariate Cox analysis was applied to all candidate variables. Insignificant predictors were filtered out based on a significance level p value >0.05. In the second step, all selected variables from the univariate analysis were entered into the multivariate Cox regression analysis to see how the risk factors jointly impact the incidence rate for CVD. Risk factors with a p value less than 0.05 will be finally decided.
In the validation stage, two approaches were undertaken to assess the predictive ability of our fitted model, statistical validation, and empirical validation. The statistical validation was performed with respect to both discrimination and calibration. The empirical validation was defined as an empirical comparison with a general CVD risk prediction model (the Framingham office-based risk equation ) in a horizontal and longitudinal perspective. The horizontal comparison was conducted by comparing with the Framingham prognostic model using data collected from multiple samples at the same time point. The longitudinal comparison was conducted by comparing with the Framingham prognostic model using data collected from specific examples at different time-points (fixed time intervals follow-up) and seeing the risk trend for an individual over time.
3.1. Derivation of a 10-Year Risk Score for CVD
Risk factors included in the risk model are age, sex, body mass index (BMI), hypertension, systolic blood pressure (SBP), cigarettes per day, pulse rate, the status of diabetes. Characteristics of risk factors were listed in Table 3. Statistics of “Min.”, “1st Qu.”, “Median”, “Mean”, “3rd Qu.”, and “Max.” of these risk factors are summarized.
The regression coefficients, hazard ratios, and their corresponding upper and lower 95% confidence intervals (CI) were estimated, as presented in Table 4. Values of the baseline hazard rate where the time point is ten years were estimated as well, shown in Table 5. The 10-year baseline hazard rate is 0.1023354 at mean values of all covariates, 0.001863652 at all covariates equal to zero. Corresponding, the survival probability () is 0.9027267 at mean values and 0.9981381 at all covariates equal to zero.
The Cox model has an exponential form (see Equation (1)), where t represents the time that the event occurs; is the hazard function for a subject at time t, determined by a set of m covariates (); are the regression coefficients that measure the effect size of covariates; exp is the exponential function (); is the baseline hazard rate, an arbitrary (unknown) function, corresponding to the value of the hazard when all equal zero.
So, the Cox model can be written as a survival function:
A general formula for computing risk estimates has the following form:
where H(t) is the CVD risk estimated for an individual; S0(t) is baseline survival rate at follow-up time t, where t = 10 years (see Table 5), βi is the regression coefficient (see Table 4), is the value of the risk factor (if is continuous it is the log-transformed value), is the corresponding mean, and k denotes the number of risk factors. The CVD risk function could be derived from (3), using regression coefficients from Table 4 and the baseline hazard rates from Table 5; hence, we computed the probability of developing any type of CVD for an individual. A case of computing the absolute risk score in 10 years was demonstrated in Appendix C.
A nomogram is a two-dimensional diagram to represent a mathematical function involving several predictors . It is a simple graphical illustration to approximately predict a particular event based on conventional statistical regression methods such as Cox proportional hazards model for survival analysis . A nomogram is accomplishing the estimation of individual survivals in 10 years and the median survival time by years was depicted in Figure 1.
In Figure 1, each predictor has a set of n scales, and there is a mapping between each scale and the “Points” scale. The bottoms are the corresponding 10-year survival estimates, and the median survival time (years). By accumulating the total points corresponding to the specific configuration of covariates for a patient, a clinician can then manually obtain the predicted value of the event for that patient.
The validation of the proposed predictive risk model was performed using traditional statistics. C-index (also called receiver operating characteristic (ROC) area)  was used to assess the goodness of the risk model based on a bootstrap internal resampling validation. From the statistical validation analysis, we got a C-index (area under the receiver operator curve [AUROC]) of 0.71 indicating moderately good discrimination.
Then, we performed an empirical validation by comparing our risk model with the Framingham Heart Study model in an external dataset horizontally and longitudinally over time. In the horizontal validation process, there were 2786 samples in the external dataset, and 1693 samples have got a CVD event. Risk scores using the FHS model and the proposed risk model were computed separately. Statistics of min (lower whisker), 1st quartile (the lower hinge), median, 3rd quartile (the upper hinge), and max (the extreme of the upper whisker) of estimated risks for all samples are depicted in Figure 2. This box-whisker graph in Figure 2 shows that the risks assessed by our Cox model are higher than the risk calculated by the Framingham model, but the error for five statistics (min, 1st Qu, median, mean, 3rd Qu., max) is within 0.02. For example, the median values of the FHS model and the Cox model are 0.1429475 and 0.1661985, respectively. For subjects with CVD event, the Cox model is much more accurate than the FHS model whereas for subjects without CVD, the Cox risk model overestimates the risk rate. Overall, the risk scale of the Cox model is consistent with the Framingham model, which highlights that the proposed Cox model is par with the FHS model.
In the longitudinal validation process, we selected four sex-specific subjects with or without CVD at the end of the Framingham Study. A summary of these four subjects is listed in Table 6 to confirm the longitudinal validation of the predicted CVD event.
For each sample, data with fixed time intervals (approximately two years) from longitudinal time follow-up are extracted. The data from five exams (Exam 8, Exam 9, Exam 10, Exam 11, and Exam 12) are extracted for comparison. Data summary for sample 1, sample 2, sample 3, and sample 4 are listed in Appendix B. For each sample, the risks of developing CVD in 10 years related to the selected five exams data are separately computed using the Cox model and the Framingham model. Then the trend of risk over the years with 5% error is depicted, as shown in Figure 3. This figure shows that the trend of risks of these two models are consistent and risks for a specific sample increase over time, the dotted trend lines in each graph represent the increase in the CVD risk over time. Also, samples (both male and female) with diabetes that developed CVD will have a higher risk than the ones with no developed CVD.
It is widely accepted that CVD has become one of the significant public health issue globally [42, 43] and contributes significantly to the annual deaths globally. Previous studies have noted the importance of identifying associated risk factors and the early detection and intervention of CVDs [44–48] and investigated reducing the risk of developing CVD in early stages. Consequently, CVD risk prediction tools based on a single variable or multiple variables have been devised to yield estimates of the CVD risk [6, 8, 9, 14, 49–51].
Motivated by the objective of early detection and risk estimation of CVD, the present study was designed to identify novel CVD risk factors, determine the effect of these factors, and then develop a risk prediction model based on the identified factors. Although risk factors could vary from one specific CVD component to another, there is sufficient evidence that different types of CVD have commonalities of risk factors. We developed and validated a 10-year risk equation for CVD risk using follow-up data rigorously measured by the Framingham Heart Study.
This investigation extends the number of risk factors by the previous general CVD risk formulations, incorporating heart rate to estimate absolute CVD risk. The approach used in this research is based on advanced statistical techniques that allow reducing the bias in the assessment of true CVD risk. The whole process of data analysis strictly follows the guideline of regression modelling strategies and survival analysis [34, 52].
We use continuous variables (age, BMI, SBP, and pulse rate) to generate the model that performs better than other similar models developed using categorical variables. Compared with simpler approaches that try to make inferences of 5-year and 10-year risk models such as the model based on logistic regression analysis  and the CVD risk model using Kaplan-Meier and log-rank test , the proposed Cox risk model is more adequate and will avoid severe errors of underestimation or overestimation [22, 34]. Moreover, this model was developed based on a more substantial number of samples and events, suggesting a valid estimation of the real risk.
4.1. Comparison with Other CVD Risk Prediction Tools
The old version Framingham general CVD risk function  is useful for identifying persons at high risk of CVD, but it was based on a limited number of risk factors (serum cholesterol, SBP, smoking history, electrocardiogram, and glucose intolerance). The new Framingham laboratory-test-based formula  included HDL cholesterol in the risk function. The QRISK study investigators incorporated family history as a novel risk factor by the Framingham general formulas . Although researchers have published risk scores [6, 8, 53] for predicting general CVDs, these functions did not include heart rate in the risk model.
Risk models formulated by using machine learning or data mining techniques have incorporated heart rate as a risk factor but tools that can predict CVD absolute risk are fewer. For example, a prediction tool  focuses on the classification of CVD event by employing the ANN and the Bayesian classifier based on heart rate variability. The diagnosis CVD model  categorizes the CVD risk as different levels but an absolute risk score cannot be obtained. Even though a supportive tool  will generate the estimate of a risk score, but the user can not know how many years the score is targeting.
Some equations only focused on specific CVD outcomes. The Europe SCORE project equations were developed for the fatal cardiovascular event . These risk estimation tools [7, 14, 30] are just for coronary heart disease. Also, there are some risk models aiming stroke [16, 55]. Compared with these disease-specific models to estimate the risk of developing specific CVD outcomes, the present study generated a general CVD risk tool that could predict a global CVD risk as well as the risk of developing individual components.
Moreover, compared with the laboratory-based algorithms, the present research proposed a more straightforward way to estimate 10-year CVD risk based on risk factors. An individual can assess his or her CVD risk during an office visit or his monitoring of the combination of risk factors in the risk model, either manually or use some devices like wearable sensors.
The CVD risk prediction model could be implemented at the primary care for population analysis and identifying the high-risk individual. This would be a transformation in healthcare management of CVD at an individual as well as at a population level. However, with a small event size of diabetes, caution must be applied to the practice of this risk model. Even though we have used multiple imputation methods to impute the missing values for diabetes, the original feature of data in-balance, which decides that the imputed data frame for the “diabetes” might still have a data in-balance there. Advanced imputation methods need to be considered in the future for avoiding unexpected outcome caused by the diabetes data in-balance.
Our research aims to provide a CVD prediction model based on key risk factors, so that it can be used at the point-of-care for better and informed decision making. Thus, risk factors based on a clinical test such as total cholesterol, HDL cholesterol were not included, but some of these risk factors have a substantial effect on the development of CVD. We have provided a valid framework for creating a risk model using the Cox regression model; future work should consider risk factors not included in our model at this moment. Thus, expanding more predictors into the risk model is an important issue for future research.
The proposed study devised a risk prediction model based on multivariable predictors. A novel risk factor “heart rate” was incorporated into this risk equation by conventional risk factors. A satisfying predictive ability with C-index (AUROC) of 0.71 was obtained, which ensures the accuracy of estimating risk scores. Compared with studies focusing on specific diseases, the proposed algorithm can be applied to measure the 10-year risk of CVD. Health care professionals, public health physicians, practice managers, and individuals can run the proposed model to quantify risk at a population level, during patient consultation and identify high-risk individuals for further preventive health care for the entire practice.
A. Exams in the Framingham Original Cohort Study Dataset
See Table 7.
B. Data Summary for Samples
C. Computation of Absolute Risk
Here, we take a specific subject to illustrate the process of risk score calculation. This sample is a 44-year-old man not having diabetes and hypertension. He has a systolic blood pressure of 120 mm Hg, pulse rate of 82 per minute, BMI of 26.38689413 kg/ and is a current smoker smoking 40 lapses per day, as shown in Table 12.
The risk estimate based on the Cox model is calculated as follows:
The cardiovascular disease (CVD) data used to support the findings of this study were supplied by Framingham Heart Study-Cohort (FHS-Cohort) under license and so cannot be made freely available. Requests for access to these data should be made with Open BioLINCC Studies Group through this website https://biolincc.nhlbi.nih.gov/studies/framcohort/.
The main contribution of the present study is developing a risk prediction model for early detection of CVD. More specifically, the contribution can be summarized in four major respects: firstly, a novel risk factor “heart rate” was identified as significant for the development of CVD; secondly, an CVD risk prediction model aiming for early detection of CVD was developed based on various risk factors; thirdly, an absolute risk score in 10 years of CVD can be calculated using this risk model; lastly, multiple forms of the risk estimation of CVD, namely risk equation and nomogram, were also developed.
Conflicts of Interest
Authors declare no conflicts of interest.
All authors contributed equally.
S. Mendis, P. Puska, B. Norrving et al., Global Atlas on Cardiovascular Disease Prevention and Control, World Health Organization, 2011.
D. Mozaffarian, E. J. Benjamin, A. S. Go et al., “Heart disease and stroke statistics update: a report from the American Heart Association,” Circulation, vol. 131, no. 4, pp. e29–e322, 2015.View at: Google Scholar
W. C. Chan, C. Wright, T. Riddell et al., “Ethnic and socioeconomic disparities in the prevalence of cardiovascular disease in New Zealand,” The New Zealand Medical Journal, vol. 121, no. 1285, 2008.View at: Google Scholar
Heart Foundation, General heart statistics in New Zealand, Heart Foundation, 2017, https://www.heartfoundation.org.nz/statistics.
J. Hippisley-Cox, C. Coupland, Y. Vinogradova, J. Robson, M. May, and P. Brindle, “Derivation and validation of QRISK, a new cardiovascular disease risk score for the United Kingdom: prospective open cohort study,” British Medical Journal, vol. 335, no. 7611, pp. 136–141, 2007.View at: Publisher Site | Google Scholar
S. Wells, T. Riddell, A. Kerr et al., “Cohort profile: the PREDICT cardiovascular disease cohort in New Zealand primary care (PREDICT-CVD 19),” International Journal of Epidemiology, vol. 46, no. 1, pp. 22–22, 2017.View at: Google Scholar
Cardiovascular Disease Risk Assessment Steering Group and others, New Zealand primary care hand- book 2012. Wellington: Ministry of health; 2013 (2017).
P. Unnikrishnan, D. K. Kumar, S. Poosapadi Arjunan, H. Kumar, P. Mitchell, and R. Kawasaki, “Development of health parameter model for risk prediction of CVD using SVM,” Computational and Mathematical Methods in Medicine, vol. 2016, Article ID 3016245, 7 pages, 2016.View at: Publisher Site | Google Scholar
A. Cannon, Reliability Data Banks, Springer Science & Business Media, 2012.
M. Kumari and S. Godara, “Comparative study of data mining classification methods in cardiovascular disease prediction,” Semantic Scholar, 2011.View at: Google Scholar
P. Melillo, R. Izzo, A. Orrico et al., “Automatic prediction of cardiovascular and cerebrovascular events using heart rate variability analysis,” PLoS ONE, vol. 10, no. 3, Article ID e0118504, 2015.View at: Google Scholar
S. Vaanathi, “Cardiovascular disease prediction using fuzzy logic expert system,” IUP Journal of Computer Sciences, vol. 11, no. 3, 2017.View at: Google Scholar
T. R. Dawber, W. B. Kannel, and L. P. Lyell, “An approach to longitudinal studies in a community: the Framingham Study,” Annals of the New York Academy of Sciences, vol. 107, no. 1, pp. 539–556, 1963.View at: Google Scholar
R. H. Eckel, W. W. Barouch, and A. G. Ershow, “Report of the national heart, lung, and blood institute-national institute of diabetes and digestive and kidney diseases working group on the pathophysiology of obesity-associated cardiovascular disease,” Circulation, vol. 105, no. 24, pp. 2923–2928, 2002.View at: Publisher Site | Google Scholar
E. T. Lee and J. Wang, Statistical Methods for Survival Data Analysis, vol. 476, JohnWiley & Sons, 2003.
N. Mantel, “Evaluation of survival data and two new rank order statistics arising in its consideration,” Cancer Chemotherapy Reports, vol. 50, no. 3, pp. 163–170, 1966.View at: Google Scholar
F. Harrell, Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis, Springer, 2015.
R. Ihaka and R. R. Gentleman, “A language for data analysis and graphics,” Journal of Computational and Graphical Statistics, vol. 5, no. 3, pp. 299–314, 1996.View at: Google Scholar
S. Van Buuren, Flexible Imputation of Missing Data, CRC Press, 2012.
W. F. Kuhfeld, The prinqual procedure, SAS/STAT Users Guide 2. pp. 1265–1323. 1990.
D. S. Hay, Cardiovascular Disease in New Zealand, 2004: A Summary of Recent Statistical Information, National Heart Foundation of New Zealand, 2004.
L. Cupples, “Some risk factors related to the annual incidence of cardiovascular disease and death using pooled repeated biennial measurements,” Framingham Heart Study, 1987.View at: Google Scholar
D. E. Weiner, H. Tighiouart, M. G. Amin et al., “Chronic kidney disease as a risk factor for cardiovascular disease and all-cause mortality: a pooled analysis of community-based studies,” Journal of the American Society of Nephrology, vol. 15, no. 5, pp. 1307–1315, 2004.View at: Publisher Site | Google Scholar
W. De Ruijter, R. G. J. Westendorp, W. J. J. Assendelft et al., “Use of Framingham risk score and new biomarkers to predict cardiovascular mortality in older people: population based observational cohort study,” BMJ, vol. 338, no. 7688, pp. 219–222, 2009.View at: Google Scholar
L. Bannink, S. Wells, J. Broad, T. Riddell, and R. Jackson, “Web-based assessment of cardiovascular disease risk in routine primary care practice in New Zealand: the first 18,000 patients (PREDICT CVD-1),” The New Zealand Medical Journal, vol. 119, no. 1245, 2006.View at: Google Scholar
D. G. Kleinbaum and M. Klein, Survival Analysis, vol. 3, Springer, 2010.