Abstract

Background and Aim: Gastric cancer (GC) is the common leading cause of cancer-related death worldwide. Immune-related genes (IRGs) may potentially predict lymph node metastasis (LNM). We aimed to develop a preoperative model to predict LNM based on these IRGs. Methods: In this paper, we compared and evaluated three machine learning models to predict LNM based on publicly available gene expression data from TCGA-STAD. The Pearson correlation coefficient (PCC) method was utilized to feature selection according to its relationships with LN status. The performance of the model was assessed using the area under the curve (AUC) and F1 score. Results: The Naive Bayesian model showed better performance and was constructed based on 26 selected gene features, with AUCs of 0.741 in the training set and 0.688 in the test set. The F1 score in the training set and test set was 0.652 and 0.597, respectively. Furthermore, Naive Bayesian model based on 26 IRGs is the first diagnostic tool for the identification of LNM in advanced GC. Conclusion: These results indicate that our new methods have the value of auxiliary diagnosis with promising clinical potential.

1. Instruction

Gastric cancer (GC) is one of the most common gastrointestinal malignancies worldwide, accounting for 1,033,701 new cases and 782,685 deaths in 2018 [1]. Although various new diagnoses and treatments have been achieved for the management of GC, the prognosis remains unsatisfactory due to recurrence and metastasis [2]. Lymph node metastasis (LNM) is one of the most crucial indicators which influence prognosis and treatment planning in GC patients [3, 4]. Accurate preoperative identification of LN status is considered critical for treatment strategy decisions in different stages of GC patients. Unfortunately, a majority of histopathologic findings identified as efficient predictors of LNM cannot be observed preoperatively. Traditional strategies to predict the LN status was developed based on radiomics or histopathologic findings. However, these predictors based on two basic strategies were available empirically or postoperatively.

Early studies demonstrate that imaging techniques to assess the LN size is not a reliable indicator in the detection of LNM [5, 6]. The prediction accuracy of the LN status evaluation approach is often unsatisfactory due to the high false-negative rate [7]. Positron emission tomography (PET) exhibits excellent specificity for detecting LNM in GC. However, the clinical utility of PET scan is limited due to its high cost [8]. Besides, a common strategy based on histopathologic findings was usually available postoperatively, and subjectivity may exist in determination to identify the LN status. Therefore, more accurate markers for the preoperative identification of LNM are urgently needed.

Various immune-related molecules have been proven as key factors during cancer initiation and progression [912]. Recent immunotherapy by targeting the specific immune checkpoints has demonstrated remarkable efficacy in the clinical treatment of GC [13]. Moreover, the prognostic and adjuvant treatment value of the immune-related molecules in GC has been shown in several studies [10]. Therefore, an immune-based LN signature for GC will supplement preoperative prediction and remain to be comprehensively explored regarding postoperative treatment in GC.

Machine learning algorithms are promising approaches for disease risk prediction and diagnosis based on high-dimensional genomics data sets. They provide variable predictive measures to target classification in accordance with their predictive power. Here, we perform a systematic comparative study of three machine learning methods using public TCGA data. Evaluating prediction performance to determine LN status is suitable for approaches based on the mRNA expression data of IRGs. More specifically, a novel 26-immune-gene panel based on a Naive Bayesian classifier is used for the identification of LNM in advanced GC. An immune-related gene model based on a machine learning method can provide an individual preoperative assessment of the risk of LNM in advanced GC patients.

2. Methods

2.1. Workflow

The overall workflow of this study includes the following parts: (1) differentially immune-related gene analysis, (2) feature selection, (3) IRG model construction, and (4) model performance evaluation. The resulting statistically significant IRGs were subsequently subjected to the machine learning algorithm to construct an LNM prediction model (as shown in Figure 1).

2.2. Data Collection and Preprocessing

This study used the publicly available data from the TCGA database (https://cancergenome.nih.gov/) and the ImmPort database (https://www.immport.org/home) to do a comprehensive analysis [14]. The normalized mRNA expression profiles (HTSeq—FPKM) and corresponding clinical data of 375 tumors and 32 tumor-adjacent healthy controls were extracted from the TCGA-STAD database with the closing date of 9 December 2019. The 1811 IRGs were downloaded from the ImmPort database. The TCGA public platform was used to measure 1811 IRGs from the ImmPort database. All data were processed with R software (https://www.r-project.org/). The exclusion criteria were as follows: (1) transcriptomic data are missing or not matched; (2) the status of LNM was missing or unknown; (3) the distant metastasis has occurred, or the status of distant metastasis was unknown; and (4) diagnosed as gastric cancer but not in advanced stage (as shown in Table 1).

2.3. Identification of Differentially Immune-Related Genes (DEG-IRGs)

The limma package (https://www.bioconductor.org/packages/release/bioc/html/limma.html) was used to identify DEG-IRGs [15]. The Wilcoxon test was applied to estimate the gene expression changes. The DEG-IRGs were defined as genes with a false discovery rate (FDR) of less than 0.05 and with an absolute of fold change greater than 1.5 (as shown in Table S1 & S2).

2.4. Feature Selection and Cross-Validation

The Pearson correlation coefficient based on the filtering feature has proven to be a dimensional reduction technique [16, 17]. After data preprocessing, 298 available samples including 89 non-LNM and 209 LNM were identified and randomized into the training set and validation set based on a 5-fold random sampling of approximately equal size. This method was performed on the training set to measure the importance of feature sets based on a given measure [18]. Afterward, the machine learning algorithm is trained on the fourfold subsamples, and the rest onefold subsamples are retained as the validation set for testing the selected algorithm. The process is then repeated until the selected algorithm is validated on all the folds. Finally, the results from 5-folds would be averaged together to produce a predictive value.

2.5. Performance Evaluation of Classification Model

In terms of model evaluation, we used a comprehensive list of metrics that include AUC, accuracy, precision, recall, and F1 score to measure the discriminative capability. The F1 score is defined based on weighted average means of precision and recall. True positive (TP), false positive (FP), true negative (TN), and false negative (FN) were widely used for the binary classification problem. The confusion matrix is shown in Table 2. Accuracy, precision, tecall, and F1 score were applied to assess the performance of the model using the following equations:

2.6. Statistical Analysis, Software, and Hardware

The data mining and relative statistical analyses were performed using R version 3.6. An adjusted value of less than 0.05 was considered statistically significant. The machine learning algorithms were achieved using packages scikit-learn 0.21.1 in Python 3.7 [19]. All of the computation was conducted in a computer with a 64-bit Windows 10 operation system, Intel® Core i5-8265U CPU 1.80 GHz, and 8.0 GB installed random access memory.

3. Results

3.1. Identification of an IRG Expression Signature

To characterize the expression pattern of immune genes, we used the limma package to analyze the TCGA FPKM data of gastric cancer and nongastric cancer samples. We identified genes as differentially expressed in GC. Afterward, we downloaded the list of IRGs from the ImmPort database. The differential expression analysis was subsequentially carried out using limma, and we obtained 141 DEGs, including 88 upregulated genes and 53 downregulated genes. A total of 141 IRGs were considered to the implication in GC (as shown in Figure 2).

3.2. Development of the IRG Panel for Gastric Cancer Lymph Node Metastasis

With these 141 DEGs, we further utilized feature selection, Pearson correlation coefficient, to select the best combination of immune gene signature with predictive power to classify GCs in accordance with their status of LNM. The ROC curve and F1 score were performed to determine the predictive performance of the model.

Three machine learning classifiers were performed to construct an LNM prediction model based on 298 eligible GC patients. To avoid the machine learning model from overfitting, we conducted 5-fold cross-validation in our experiment for binary classification. An optimized LNM prediction model was eventually constructed using a signature of 26 genes (as shown in Figure 3).

3.3. Validation and Evaluation of the Prediction Model

We first investigated the immune-related gene panels to predict LNM in advanced gastric cancer. Here, we performed 5-fold cross-validation on the training data set to evaluate the prediction model. The resulting immune gene-based diagnostic model showed good performance on the training set and test set, with AUCs of 0.741 and 0.688, respectively. Moreover, the good accuracy, precision, recall, and F1 score conformed to the generality of the Naive Bayesian classifier (as shown in Table 3).

4. Discussion

Although surgery has been achieved for the management of gastric cancer, it is widely accepted that advanced gastric cancer patients benefit from systemic therapies. Therefore, continuous search for new prognostic factors is helpful to select reasonable treatment strategies. Lymph node metastasis status might be the most significant prognostic indicator for the outcomes of GC patients. Accumulating evidence has suggested that the development of LNM is genetically determined with immune progression [20, 21]. To date, no immune molecular biomarkers have been confirmed to predict LNM in GC. Hence, there is an urgent need to identify an immune molecular panel with the preoperative predictive value and reveal potential malignant progression.

The prognosis and quality of life vary considerably in GC patients with or without LNM, and several studies have demonstrated associations between clinical factors and the risk of LNM [22]. Several reports have indicated that tumor size, tumor differentiation, the depth of tumor invasion, and lymphovascular infiltration were significantly associated with LNM [2326]. However, these clinical factors still fail to achieve preoperative prediction accurately.

Machine learnings are well-established classification tools for LNM of cancers [2730]. In recent years, combination of radiomics and machine learning has been succeeded in LNM classification due to its noninvasiveness and high efficiency. Li et al. developed a dual-energy CT-based nomogram to facilitate the preoperative prediction of LNM in GC patients and identify tumor thickness, Borrmann classification, and iodine concentration venous phase as independent predictors of LNM [31]. Feng et al. utilized lesion-based radiomic features to identify LNM with an accuracy of 76.4% preoperatively [32]. Wang et al. analyzed the values of radiomics features in the arterial phase with the random forest as feature selection and realized the individual prediction of LNM in GC [33]. However, combination of radiomics and machine learning has its exclusive challenges. Firstly, the performance of models is mainly dependent on a large number of the patient population. Extracting imaging features from a limited data set is feasible to diminish its predictive value and increase the risk of overfitting. In addition, the variability in CT or MRI image segmentation may introduce inevitable bias into the derived features.

With the rapid development of genomics in recent years, the molecular characteristics of LNM are becoming clear. To date, an increasing number of IRGs have been shown to be associated with LNM [34]. However, there are few studies on the combination of genomics and machine learning. In this study, we compared three classifiers and validated Naive Bayesian algorithm by using a genomics approach for preoperative evaluation of LN status in GC patients. First, we developed an IRG expression profile that included 141 DEGs between gastric patients and nongastric patients. Gastric mucosal tissue samples could be obtained by endoscopic biopsy preoperatively. Cancer-related gene sets were used to detect LNM in patients with GC. To refine the profiles, an immune signature of 26 genes with high predictive power for predicting LNM was extracted from the 141 DEGs using feature selection. Based on these mRNA sequencing data from the TCGA-STAD Project, our novel 26-IRG panel showed good performance. In internal validation, the selected model also showed beneficial prediction for LNM with AUC of 0.688. Our TCGA analysis showed that altered gene expression might further change in tumor progression. However, the molecular function of several genes in GC is not fully understood and deserves further investigation.

Admittedly, our study still had several limitations. First, the results were based on a public database obtained from TCGA. We did not perform further validation on a larger scale of sample size. To help address this limitation, we are comfortable with the further application of this model in our population cohort. Second, it is not clear that the performance of the model in early gastric cancer subgroup is due to the limitation of the T1 sample size. Besides, the majority of patients in this study were of the white race and the predictive performance for other racial groups is unproven. Therefore, further investigations are essential to confirm the current findings.

5. Conclusions

We developed a 26-mRNA-based Naive Bayesian classifier for the LN status preoperative prediction in advanced GC patients. The Naive Bayesian model based on IRGs showed outperform performance and would help clinicians guide useful individualized treatment strategies.

Data Availability

The data that support the findings of this study are available from the TCGA or the corresponding authors upon reasonable request.

Conflicts of Interest

The authors declare that there is no conflict of interests.

Authors’ Contributions

Yuan Yang, Ya Zheng, Hongling Zhang, Yuping Wang, and Yongning Zhou contributed equally to the article.

Acknowledgments

This study was supported by the National Natural Science Foundation of China (Grant number: 71964021), National Key R&D Program of China (Grant numbers: 2016YFC1302201, 2016YFC0107006), Key Research and Development Program of Gansu Province, China (Grant number: 18YF1FA110), Key Program of the Natural Science Foundation of Gansu Province, China (Grant number: 18JR3RA366), Foundation of The First Hospital of Lanzhou University, China (Grant number: ldyyyn2018-54), and Open Fund of State Key Laboratory of Cancer Biology, China (Grant number: CBSKL201718).

Supplementary Materials

Supplementary 1. Table S1: the expression profile of differentially genes in gastric cancer.

Supplementary 2. Table S2: the expression profile of differentially immune-related genes in gastric cancer.