Abstract

Non-small-cell lung cancer (NSCLC) patients often develop bone metastases (BM), and the overall survival for these patients is usually perishing. However, a model with high accuracy for predicting the survival of NSCLC with BM is still lacking. Here, we aimed to establish a model based on artificial intelligence for predicting the 1-year survival rate of NSCLC with BM by using extreme gradient boosting (XGBoost), a large-scale machine learning algorithm. We selected NSCLC patients with BM between 2010 and 2015 from the Surveillance, Epidemiology, and End Results database. In total, 5973 cases were enrolled and divided into the training () and validation () sets. XGBoost, random forest, support vector machine, and logistic algorithms were used to generate predictive models. Receiver operating characteristic curves were used to evaluate and compare the predictive performance of each model. The parameters including tumor size, age, race, sex, primary site, histological subtype, grade, laterality, T stage, N stage, surgery, radiotherapy, chemotherapy, distant metastases to other sites (lung, brain, and liver), and marital status were selected to construct all predictive models. The XGBoost model had a better performance in both training and validation sets as compared with other models in terms of accuracy. Our data suggested that the XGBoost model is the most precise and personalized tool for predicting the 1-year survival rate for NSCLC patients with BM. This model can help the clinicians to design more rational and effective therapeutic strategies.

1. Introduction

Early-stage lung cancer is usually asymptomatic. Hence, lung cancer is frequently diagnosed at a late stage [1, 2]. Non-small-cell lung cancer (NSCLC) is the most common histological subtype of lung cancer, with about 40% of cases harboring distant metastases at the first diagnosis [3]. Bone metastases (BM) occurs in 30-40% of NSCLC patients, which is one of the most frequent distant metastasis events [4]. It is known that distant metastases are the leading cause of cancer-related death [5, 6]. For NSCLC patients with BM, the reported median survival is less than 1 year in different populations [7]. Such poor prognosis highlights the significant demand for accurate tools for predicting the prognosis of NSCLC with BM.

A TNM staging system is a tool based on pathological anatomy, which can assist clinicians to develop effective treatment strategies and improve the patients’ prognosis [8]. However, the prognosis of patients with the same stage is notably different, indicating significant limitations of using the TNM staging system as the prognosis predicting model. More importantly, many other factors should be considered and involved in predicting the prognosis of patients [9, 10]. Survival prediction models designed for lung cancer with BM specifically have been reported previously [1113]. However, the performance of these models is barely satisfactory as these models are based on the simple Cox regression model but not established as a survival prediction model for NSCLC with BM particularly. Given the impact of histological changes in prognostic determination, we propose narrowing down the scope of the study objects. For example, for patients with a certain histological type of NSCLC, it is necessary to improve the accuracy of the predictive model. Currently, artificial intelligence (AI) models based on machine learning (ML) algorithms are increasingly applied for clinical practice. Most models including random forest (RF), support vector machine (SVM), Bayesian network, and decision tree are created based on traditional ML algorithms [14]. Extreme gradient boosting (XGBoost) is a typical boosting algorithm designed to be highly efficient, flexible, and portable. Boosting is an ensemble technique with which new models can adjust the errors produced by existing models [15]. These advantages guarantee the high performance of XGBoost which provides satisfactory results in machine learning competitions and has been successfully used in other studies and domains [16].

Therefore, in the current study, we extracted the NSCLC patients with BM from the Surveillance, Epidemiology, and End Results (SEER) database and searched for an ideal AI model to predict the 1-year survival of NSCLC with BM by testing the XGBoost and other traditional algorithms.

2. Methods and Materials

2.1. Patients

All NSCLC patients with confirmed BM in the SEER database between 2010 and 2015 were selected for this study. The inclusion criteria were as follows: (a) patients diagnosed with lung cancer on histology, (b) the histologic type of NSCLC, and (c) patients with BM. The exclusion criteria were as follows: (a) lung cancer not the primary cancer, (b) patients without complete clinicopathological characteristics, demographic information, or follow-up information, and (c) follow-up month at the follow-up deadline. Finally, we extracted 5973 NSCLC patients with BM from 309,056 lung cancer patients. The study population was distributed to the training and validation sets with a ratio of 7 : 3. The classification process was completely randomized, and it was performed in R software. In addition, we retrospectively collected data for NSCLC patients with BM from the Affiliated Hospital of Chengde Medical University (AHOCMU) between 2015 and 2019 as an external validation set for our research.

2.2. Data Collection

Based on the specific patient information available in the SEER database, we selected 19 variables that may affect the prognosis of NSCLC with BM, including age, sex, race, tumor size, tumor site, histological type, grade, laterality, surgery, chemotherapy, radiotherapy, TNM staging, distant metastasis sites (lung, brain, and liver), insurance status, and marital status.

The primary site is defined according to the International Classification of Diseases for Oncology (ICD-O) codes: main bronchus (C34.0), lobe (C34.1-C34.3), overlapping lesion of the lung (C34.8), and lung, if not otherwise specified (C34.9). The histological type is defined in accordance with the following ICD-O-3 codes: adenocarcinoma (8140, 8141, 8144, 8244, 8250–8255, 8260, 8290, 8310, 8323, 8333, 8470, 8480, 8481, 8490, 8507, 8550, 8551, 8570, 8571, 8574, and 8576), squamous cell carcinoma (8052, 8070-8076, 8078, 8083, 8084, and 8123), and other NSCLC (8004, 8012-8014, 8022, 8030, 8035, 8046, 8082, 8200, 8240, 8249, 8430, 8560, and 8562). Regarding marital status, we excluded the misleading data of unmarried or domestic partner, and then, “unmarried,” “separated,” “single,” and “widow” were all included in the unmarried group. The insurance status was divided into insured and uninsured; “any Medicaid,” “insured,” and “insured/no specific” were included in the insured group. All cases in the present study were staged using the 7th edition of the AJCC TNM staging system.

2.3. Prognostic Nomogram

The variables that might be related to prognosis were analyzed by the univariate analysis. Then, variables with revealed by the univariate analysis were further included in the multivariate logistic analysis to determine independent prognostic factors of NSCLC patients with BM. Next, these independent prognostic factors identified by the multivariate logistic analysis were used to construct a nomogram for predicting the 1-year survival of NSCLC with BM.

2.4. Construction of the XGBoost Model

Before the sample feature data is put into the model for classification, the data was preprocessed first. In the dataset used in the study, age and size are continuous variables, and the rest is classified. For continuous, we adopted standardization for age and size to speed up the training. The formula is as follows:

To calculate the distance accurately in some machine learning models, we used one-hot encoding for multiclassification variables. The tree-based model has an excellent performance to calculate the importance of features. XGBoost was used to rank feature importance, and eventually, significant variables were included in our model building. After variable selection, there were 17 feature variables left. We also used XGBoost, an ensemble machine learning method predicting the residuals of prior models, and then combined together to make the final prediction. XGBoost uses second-order Taylor series to estimate the value of the loss function and further reduces the likelihood of overfitting by application of regularization. The objective function is as follows: where and , where is the loss function of time (), is the partial derivative of the loss function time (), is the second derivative of () degree of loss function, and is the complexity of model . In the setup of the hyperparameters, the best values were determined by performing a grid search.

2.5. Model Evaluation

We have also established three other prediction models based on RF, SVN, and logistic algorithms, respectively. To evaluate the performance of each prediction model, receiver operating characteristic (ROC) curves were used to quantify and compare the predictive performance of the XGBoost model and other prediction models.

3. Results

3.1. Features of Patients

According to the inclusion and exclusion criteria, 5973 NSCLC patients with BM were selected from the SEER database, and an additional 114 NSCLC patients with BM were identified from the AHOCMU for this study. In addition, 4183 patients were enrolled in the training set; the rest 1790 patients were included in the validation set. Patient demographic and clinicopathologic features are presented in Table 1. Briefly, 4657 patients (78.0%) were white and 740 (12.4%) were black. Male (57.4%) had a slight predominance over female (42.6%). Regarding the tumor characteristics, 90.7% were located in the lung lobe; adenocarcinoma (64.5%) accounted for the majority, most of which were moderately or poorly differentiated. For therapy, 231 (3.9%) of the patients received surgery, 3853 (64.5%) received chemotherapy, and 3525 (59.0%) underwent radiotherapy. Lung metastases (28.6%) were more common than liver metastases (20.4%) and brain metastases (23.1%).

3.2. Prognostic Nomogram for 1-Year Survival

The univariate analysis is presented in Table 2. The results of the multivariate logistic analysis indicated that tumor size, age, race, sex, histological type, grade, N stage, surgery, chemotherapy, and liver metastases were OS-related prognostic factors (Table 2). Next, these prognostic factors were integrated to build a nomogram for predicting the prognosis of NSCLC with BM (Figure 1). As shown in Figure 1, tumor size is the most important prognostic factor followed by chemotherapy, age, race, grade, surgery, and liver metastases, which affected the prognosis moderately, while N stage, histologic type, and sex had little effect on prognosis. Furthermore, each prognostic factor was given a corresponding score for the nomogram. The total score was obtained by summing the scores of each relevant factor, and we used the total score to draw a vertical line to obtain the individual probability of NSCLC with BM survival.

3.3. Establishment of the XGBoost Model

Correlated features are redundant and may decrease the performance of ML algorithms. The correlations between the features are depicted in Figure 2. Thus, it was necessary to perform the feature reduction. The nineteen features were ranked using the XGBoost Classifier based on feature importance. The ranking is shown in Figure 3. A cut-off point was determined to select the top-ranked features for the best trade-off between model performance and simplicity, according to the accuracy of the model when using different thresholds. After this selection, the M stage and insurance status were removed, and 17 features were fitted into our model.

After grid search, the parameters of the best model were determined (, , , , , , , and ). Using ROC analysis, the prediction model using XGBoost achieved a fitted AUC of 0.792 in the training set (Figure 4).

3.4. Validation of Predictive Accuracy of the XGBoost Model

We depicted the ROC curves for the XGBoost model and the single prognostic factor in training and validation sets, respectively. As shown in Figure 5, the AUC of the XGBoost model was significantly bigger than the single prognostic factor, indicating a much higher prediction accuracy of the XGBoost model. At the same time, the XGBoost model had an AUC of 0.764 in the external validation set, demonstrating a better discriminative ability (Figure 6).

3.5. Comparison of Predictive Accuracy between Various Prediction Models

In order to assess the advantage of the prediction model generated by the XGBoost algorithm, we also compared it with other models. The training and validation sets of each model were depicted with ROC curves, and the corresponding AUCs were calculated. In the training set, the accuracy of the XGBoost model for predicting survival () was higher than that of RF (), SVM (), and logistic () (Figure 4(a)). The XGBoost model also had a better performance in the validation set (), compared with RF (), SVM (), and logistic () (Figure 4(b)).

4. Discussion

Although NSCLC with BM may obtain longer survival than before with advancement of various treatment methods and drugs, accurate prediction of survival for NSCLC with BM remains to be necessary and a challenge for clinicians. This study established and validated the XGBoost model as the most appropriate model for predicting 1-year survival of NSCLC with BM. In essence, the XGBoost model achieved an AUC of 0.792, 0.786, and 0.764 in the training, internal validation, and external validation sets, respectively. Compared with other models, it showed better reliability and accuracy (Figure 4), which could be utilized to predict the 1-year mortality of NSCLC with BM, thus facilitating a reasonable individualized drug treatment program determination.

To our knowledge, this study is the first research to establish a prognostic model for NSCLC with BM by using AI-based models on large-scale populations. The major differences between our study and the others could be summarized as follows. First, we scaled down the research objects to NSCLC instead of the entire lung cancer patient population. It was important as the histological subtype affected the prognosis dramatically, which was in line as previously reported [1719]. Second, our study only included patients with BM, but not with other metastases, as the prognosis of patients with different metastatic sites was quite different [20, 21]. Therefore, the accuracy of using a prognostic model based on patients with any metastatic NSCLC to predict the prognosis of NSCLC with BM was questionable. More importantly, most of those previous predictive models were based on the Cox regression model or logistic regression. Logistic regression and Cox regression are regular algorithms that can be replaced by more sophisticated algorithms. For instance, XGBoost has excellent performance for processing large-scale and high-dimensional data [22]. Taken together, after defining NSCLC patients with BM, we then constructed and validated a prediction model based on the XGBoost algorithm, avoiding the shortcomings of the other models, and achieved the best performance in prediction among all models (Figure 2).

According to our study, tumor size is a significant factor affecting the patient prognosis, which has not been recognized previously [1113]. T stage roughly classifies tumor size or depth of invasion, but it cannot reflect the specific character of NSCLC patients with BM and accurately predict the prognosis, because the tumor size of the same period varies greatly [2325]. Age, tumor size, race, sex, histological types, grade, T stage, N stage, surgery, chemotherapy, liver metastases, and radiotherapy were related to prognosis, which was similar to the previous research [2630]. The presence of liver metastases significantly decreased survival in lung cancer patients with BM [17]. Firstly, it may be related to the liver being an immunosuppressive organ, thus hindering the immune surveillance of the growing metastases of the liver; secondly, the worse response to chemotherapy caused by metastatic liver cancer leads to a worse prognosis [31]. Undifferentiated, late-stage patients did show worse prognosis as expected, which is consistent with general cognition. We also find that surgery and chemotherapy are generally beneficial for patients. However, due to the lack of specific surgical procedures, chemotherapy-specific drugs, and specific programs, we were unable to further explore the relationship between treatment methods and prognosis in more depth. Of note, molecular targeted therapy and immunotherapy may provide new options in addition to traditional surgery, radiotherapy, and chemotherapy; whether these new treatments affect NSCLC with BM requires further investigation although some recent studies show that epidermal growth factor receptor-targeted drugs can improve prognosis in lung cancer [31, 32].

Although deep learning is accessible in the academics and industry, the boosting algorithm based on a tree model still plays a significant role in some subjects, showing dominance in structured information. The boosting algorithm has been proved to be effective in the predictive model of classification and regression tasks in practice. Traditional regression model and machine learning methods, such as Cox regression and SVM, are limited in terms of learning capacities, which need many artificial feature engineering. The tree-based model is a nonlinear model, which has the advantages of a natural feature combination and strong feature expressive capacity [33]. Tree-based classifiers, including RF and XGBoost, based on homogeneity, fit the characteristics of the data set for the present study. We speculated that the application of regularization, using Taylor expansion to estimate the loss function, and high flexibility to allow for fine-tuning might enable XGBoost to perform better than RF [34]. Taken together, our findings suggested that the XGBoost approach can reflect the feature importance and set up a mortality prediction model with extreme accuracy. Furthermore, this approach has extreme potential for practical implementation because it can be incorporated into existing healthcare information systems.

Our research had some advantages. First, the SEER database provided complete follow-up information of patients covering a large scope. Second, our AI model could provide personalized survival prediction for patients, thereby improving personalized treatment. Finally, our AI model can be used to predict the survival of other NSCLC patients with BM, as all the information used to predict survival is easily accessible, and our model can be performed as software-based or web-based tool optimization. However, this study has certain limitations. First, this study was a retrospective study while prospective randomized clinical trials are needed to provide high-level evidence for clinical application. Second, we could not obtain specific information about the treatment, such as chemotherapy drugs and protocols, radiation doses, and specific surgical procedures.

5. Conclusion

We used the XGBoost algorithm to build an AI model that predicts the 1-year survival of NSCLC with BM. The XGBoost model has higher accuracy and better performance than models generated from other algorithms. Furthermore, the XGBoost model can be integrated into existing healthcare information systems, so we propose that the XGBoost model could be used as a practical clinical prediction model to help clinicians develop better and more reasonable treatment programs.

Abbreviations

NSCLC:Non-small-cell lung cancer
BM:Bone metastases
AI:Artificial intelligence
ML:Machine learning
RF:Random forest
SVM:Support vector machine
XGBoost:Extreme gradient boosting
SEER:Surveillance, Epidemiology, and End Results
ROC:Receiver operating characteristic
AUC:Area under the curve.

Data Availability

The datasets generated and/or analyzed during the current study are available in the SEER database (https://seer.cancer.gov/).

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

Authors CLZ, ZHH, and CH designed the study. Author ZHH wrote the manuscript. Authors CH, CXC, and ZJ collected the data. Authors ZJ and YXT conducted the statistical analysis. Authors CLZ and ZHH revised the manuscript. All authors critically read the manuscript to improve intellectual content. All authors read and approved the final manuscript. Zhangheng Huang, Chuan Hu, and Changxing Chi contributed equally to this work.

Acknowledgments

We are thankful for the contribution of the SEER database and the 18 registries supplying cancer research information and thank all colleagues involved in the study for their contributions.