Integrated Analysis of Multiscale Large-Scale Biological Data for Investigating Human Disease 2020View this Special Issue
Development and Validation of a Seven-Gene Signature for Predicting the Prognosis of Lung Adenocarcinoma
Background. Prognosis is a main factor affecting the survival of patients with lung adenocarcinoma (LUAD), yet no robust prognostic model of high effectiveness has been developed. This study is aimed at constructing a stable and practicable gene signature-based model via bioinformatics methods for predicting the prognosis of LUAD sufferers. Methods. The mRNA expression data were accessed from the TCGA-LUAD dataset, and paired clinical information was collected from the GDC website. R package “edgeR” was employed to select the differentially expressed genes (DEGs), which were then used for the construction of a gene signature-based model via univariate COX, Lasso, and multivariate COX regression analyses. Kaplan-Meier and ROC survival analyses were conducted to comprehensively evaluate the performance of the model in predicting LUAD prognosis, and an independent dataset GSE26939 was accessed for further validation. Results. Totally, 1,655 DEGs were obtained, and a 7-gene signature-based risk score was developed and formulated as . Kaplan-Meier survival curves revealed that the survival rate of patients in the high-risk group was lower in both the TCGA-LUAD dataset and GSE26939 relative to that of patients in the low-risk group. The relationship between the risk score and clinical characteristics was further investigated, finding that the model was effective in prognosis prediction in the patients with different age (, ) and TNM stage (N0&N1, T1&T2, and tumor stage I/II). In sum, our study provides a robust predictive model for LUAD prognosis, which boosts the clinical research on LUAD and helps to explore the mechanism underlying the occurrence and progression of LUAD.
Lung cancer is a kind of malignant tumor with the morbidity (13% both in male and female) and mortality (24% in male and 23% in female), respectively, ranking second and top worldwide, according to the latest data released in A Cancer Journal for Clinicians . Lung cancer can be classified into small-cell lung cancer (SCLC) and non-small-cell lung cancer (NSCLC), of which NSCLC sufferers are in the majority of the total lung cancer cases (around 80%). Lung adenocarcinoma (LUAD), the main histological subtype of NSCLC, takes up over 40% among the overall lung cancer morbidity . Given that around 80% of patients with lung cancer are diagnosed in middle and advanced stages, surgery is no more an available option, resulting in unfavorable outcomes with a 5-year overall survival (OS) rate of nearly 17% [3, 4]. While distant metastasis and relapse are main causes of poor cancer treatment and prognosis [5, 6], identification of cancer-associated genes and independent prognostic factors as well as investigation of their impact on tumor progression and prognosis is beneficial for the implementation of precision medicine and helps to raise the cure rate and improve the prognosis. With the development of gene chip technology and RNA sequencing, gene expression profiles have been widely applied in the prediction of LUAD prognosis. For example, PHLPP2 has been reported as a novel biomarker in NSCLC metastasis and prognosis . Thyroid transcription factor-1 is considered as a prognostic marker indicating the presence or absence of EGFR-sensitizing mutations in stage IV LUAD . And the elevated CX3CL1 mRNA expression is found to be a positive factor involved in LUAD prognosis . However, due to the variety of methods, experimental platforms, batch effects, or other factors, discrepancy appears in the genes screened for prognosis prediction. Besides, the prognostic models constructed might be only practicable in the current experimental samples, while the performance in other independent datasets is less pronounced. Therefore, it is urgent to find a model that is practicable in various datasets, making its value realized in different clinical researches.
In the present study, HTSeq-Counts data of LUAD comprising 522 tumor samples and 58 normal samples were accessed from the TCGA database. Based on the data, survival-associated genes were selected using univariate COX regression analysis, after which the Lasso regression model was constructed to rule out the genes of a relatively stronger correlation to prevent model overfitting. Afterwards, a series of multivariate COX regression models were established, and the optimal model was identified in line with the Akaike Information Criterion (AIC). To validate and evaluate the performance of the model in predicting LUAD prognosis, various aspects were taken into account, finding that the model was effective in the training set and testing set, and its performance in patients with different age and TNM stage was validated to be good as well. Furthermore, the model also exhibited a good ability in predicting the prognosis of LUAD patients in an independent dataset GSE26939. To sum, our study constructs a robust gene signature-based model available for predicting the prognosis of LUAD patients, which helps the clinical research on LUAD and lays a foundation for the future investigation on the molecular mechanism underlying LUAD occurrence and progression.
2. Methods and Materials
2.1. Data Collection and Preprocessing
HTSeq-Counts data of LUAD (including 522 tumor samples and 58 normal samples) were obtained from the TCGA database (https://portal.gdc.cancer.gov/) and then used for differential analysis with the aid of R package “edgeR” (, adj. ). The corresponding clinical information of TCGA-LUAD patients was collected in the GDC website (https://portal.gdc.cancer.gov/). Patients who were followed up less than 30 days were excluded in this study, and totally, 460 TCGA-LUAD patients were included eventually. Besides, to further verify the validity of the prognostic model, an independent dataset GSE26939 (including 115 patients with LUAD) and matched clinical information were accessed from the GEO database (https://www.ncbi.nlm.nih.gov/geo/).
2.2. Candidate Gene Selection
Differentially expressed genes (DEGs) screened out by “edgeR” were randomized into the training set and testing set (5 : 5) and then subjected to univariate COX regression analysis for identifying the genes associated with the survival of patients with LUAD. The Lasso regression model was employed to further analyze these survival-related genes to exclude the genes with a relatively higher correlation, contributing to the decrease in the complexity of the prognostic model  and helping to find the optimal signature genes.
2.3. Prognostic Model Construction
Candidate genes selected by Lasso regression analysis were used to construct multivariate COX models, and the Akaike Information Criterion (AIC) was referenced to find the optimal prognostic model.
2.4. Stability and Validity Verification
Patients in the training set and testing set were conferred a risk score and grouped into the high-risk group and the low-risk group based on the median score. The Kaplan-Meier method was conducted to compare the survival of patients in two groups, and log-rank was performed to calculate the value. Meanwhile, ROC analysis was carried out to analyze the performance of the model in predicting the prognosis of LUAD patients, and an independent dataset GSE26939 was applied for the verification of the model’s validity.
3.1. Identification of Candidate Genes
In total, 1,655 DEGs were obtained via differential analysis based on the TCGA-LUAD dataset (Figure 1(a)) and randomly assigned to the training set and testing set (5 : 5). Univariate COX analysis was performed to screen survival-related genes from the training set with the cut-off set as value = 0.01, and initially, 60 genes were screened out as shown in Supplementary Table 1 (the top 20 genes associated with survival are listed in Table 1). Subsequently, these genes were analyzed in a Lasso regression model. Genes with a relatively higher correlation were removed to lower the complexity of the prognostic model, and finally, 9 candidate signature genes were identified, namely, NTSR1, RHOV, KLK8, TNS4, C1QTNF6, FAM83A, IVL, B4GALNT2, and CREG2 (Figures 1(b) and 1(c)).
3.2. Construction of a 7-Gene Signature-Based Prognostic Model for LUAD
A series of multivariate COX models were constructed based on the candidate genes, and the optimal model was then selected in line with AIC as shown in Table 2. A 7-gene signature-based risk score formula was established as .
3.3. Evaluation of the 7-Gene Signature-Based Model in Predicting the Survival of LUAD Patients
Based on the formula, the 7-gene signature-based risk score of each patient in the training set and testing set was calculated, and patients were classified into the high-risk group and the low-risk group according to the median score. Kaplan-Meier curves and log-rank test were used to compare the survival of the two groups in two independent sets, finding that patients in the high-risk group had poorer survival relative to those in the low-risk group in both sets () (Figures 2(a) and 2(b)). To better know the expression level of the 7 genes, risk score distribution, and survival of the patients in two sets, data in the training set and testing set were obtained and plotted in Figures 2(c)–2(e) and 2(f)–2(h), respectively.
ROC analysis was conducted using the survivalROC package for the verification of the model performance in the training set and testing set. AUC values of 1-, 3-, and 5-year survival were calculated, with those in the training set as 0.783, 0.781, and 0.801 (Figure 2(i)) and in the testing set as 0.615, 0.724, and 0.618 (Figure 2(j)), respectively. Taken together, the 7-gene signature-based model was demonstrated to be capable of predicting the prognosis of LUAD patients.
3.4. Verification of Stability and Validity of the Prognostic Model for LUAD with an Independent Dataset GSE26939
An independent dataset GSE26939 from the GEO database was applied to further verify the validity and stability of the 7-gene model. The same as the above procedures, patients were divided into the high-risk and low-risk groups based on the median risk score, and survival comparison was performed using Kaplan-Meier as shown in Figure 3(a), indicating the lower survival rate in the patients of the high-risk group (). Thereafter, ROC analysis was performed for further verification, with the AUC values of 1-, 3-, and 5-year survival of 0.667, 0.616, and 0.623 (Figure 3(b)), respectively. Collectively, this 7-gene model was practicable in other independent datasets.
3.5. Prognostic Impact of the Model on Clinical Characteristics
To further discuss the correlation of the 7-gene signature-based risk score with the TNM (Tumor Node Metastasis) stage and overall survival (OS) of LUAD patients, matched clinical information of the training set and testing set was collected and is listed in Tables 3 and 4. The relationship between the risk score and TNM stage was explored, revealing that the risk score was significantly associated with pathologic T, N, and tumor stages of patients in both the training set (Figures 4(a)–4(c)) and testing set (Figures 4(d)–4(f)) (). Moreover, the performance of the model in predicting the prognosis of patients with different clinical characteristics in the two sets was investigated (Figure 4(g)), finding good performance on patients in different age and clinical stage (, , N0&N1, T1&T2, and tumor stage I/II). While in the independent dataset GSE26939, such correlation was less pronounced (Supplementary Table 2). Altogether, this 7-gene signature-based risk score model was a useful prognosis predictor in patients with different clinical characteristics and could be served as a novel biomarker in LUAD treatment.
Lung cancer, with its mortality ranking top globally, often appears to be in middle and advanced stages when being initially diagnosed in most patients, and surgery is no more useful. In addition, the treatment and prognosis of patients are mainly affected by distant metastasis and relapse. Thus, it is highly important to build a predictive model characterized by high stability and validity for better early diagnosis, medication guidance, and prognosis prediction. At present, many studies have focused on the construction of prognostic models for LUAD treatment. For instance, Li et al. suggested that clinical immune characteristics were a promising biomarker that could be used to evaluate OS of nonsquamous NSCLC patients (including early disease) . Park et al. tried to construct a gene signature-based prognostic model for LUAD , and in 2016, Shukla et al. proposed the first RNA-seq-based prognostic signature through analyzing the RNA-seq and clinical data, making an attempt to develop a potent predictive tool for LUAD prognosis . Despite the extensive research on signature genes used for LUAD prognosis, models with robust prediction capability have yet to be successfully constructed. Besides, with the development of high-throughput sequencing, more gene datasets of LUAD should be employed into new studies.
In our study, seven LUAD survival-related genes were identified, including NTSR1, RHOV, KLK8, TNS4, C1QTNF6, IVL, and B4GALNT2. These 7 signature genes were obtained from the HTSeq-Counts in the TCGA-LUAD dataset using univariate COX, Lasso regression, and multivariate COX analyses. Sequentially, the risk score based on the 7-gene signature was established and formulated as . As reported, most of these 7 genes are closely related to cancer progression. For example, NTSR1 (Neurotensin Receptor 1) has been reported as a potential prognostic biomarker for surgically resected stage I LUAD  and prostate cancer . RHOV (Ras Homolog Family Member V) has been verified to be highly expressed in NSCLC and can serve as a signature gene in LUAD prognosis . KLK8 (Kallikrein Related Peptidase 8) has presented its research value in the prognosis of various cancers, such as lung cancer , ovarian cancer , breast cancer , colon cancer, and rectal cancer . Moreover, TNS4 (Tensin 4) has been found to be upregulated in LUAD and able to predict poor prognosis, and it has been observed to be mediated by miR-150-3p . Meanwhile, another study indicated that the aberrant methylation of TNS4 is significantly associated with the OS of LUAD patients . C1QTNF6 (C1q/tumor necrosis factor-related protein 6), a member of the CTRP family, has shown its potential as an independent predictor for the prognosis of LUAD sufferers . Additionally, although the role of B4GALNT2 (Beta-1,4-N-Acetyl-Galactosaminyltransferase 2) in LUAD has not been investigated, it has been observed to be highly related to gastric cancer metastasis . However, the association between IVL (Involucrin) and the progression of LUAD has not been reported, which requires further study in the future. In view of the above studies, we could conclude that some of these signature genes exhibit a certain relationship with the prognosis of other cancers.
During the research, each patient in the training set and testing set was conferred a risk score and classified into the high-risk group and the low-risk group according to the median score. As suggested in OS curves, patients in the high-risk group had poorer survival. ROC curves were plotted, and the AUC values of 1-, 3-, and 5-year survival in two sets were all above 0.6, indicating that the 7-gene signature-based risk score model was capable of predicting LUAD prognosis. Notably, similar results were found in an independent dataset GSE26939 from the GEO database, demonstrating the validity and practicality of this 7-gene model. Furthermore, the association between this model and clinical characteristics of LUAD patients was explored, finding that the model functioned well in predicting the prognosis of patients with different age (, ) and TNM stage (N0&N1, T1&T2, and tumor stage I/II), while the effect in the GSE26939 was less remarkable.
In conclusion, we obtained 1,655 DEGs from the TCGA-LUAD dataset using the “edgeR” package and constructed a prognostic 7-gene signature-based model (containing NTSR1, RHOV, KLK8, TNS4, C1QTNF6, IVL, and B4GALNT2, seven genes) through univariate COX, Lasso, and multivariate COX regression analyses. The robust model we built helps to advance the clinical research on LUAD and better understand the mechanism underlying LUAD occurrence and progression.
All the data in my manuscript is available.
Conflicts of Interest
The authors declare no conflicts of interest.
Yingqing Zhang contributed to the study design. Ming Zhang conducted the literature search. Xixi Gao required the data. Xiaoping Zhang and Xiaodong Lv wrote the article. Jialiang Liu performed data analysis and drafted. Yufen Xu and Zhixian Fang revised the article. Wenyu Chen gave the final approval of the version to be submitted. Yingqing Zhang, Xiaoping Zhang, and Xiaodong Lv are co-first authors and contributed equally to this article.
The authors are grateful to the funds from the Natural Science Foundation of Zhejiang Province (No. LQ20H160057), the Key Discipline of Jiaxing Respiratory Medicine Construction Project (No.2019-zc-04), the Science and technology project of Jiaxing (2019AY32030, 2019AD32126, and 2020AY30012), the Jiaxing Key Laboratory of Precision Treatment for Lung Cancer, the Early Diagnosis and Comprehensive Treatment of Lung Cancer Innovation Team Building Project, and the Clinical Research Project of Microwave Ablation Combined with Chemotherapy in The Treatment of Stage IIIb-IV Peripheral Non-small Cell Lung Cancer (2018AD32087). The authors gratefully acknowledge contributions from the GEO network, the TCGA network, and the GDC network. The authors appreciate the patients who have participated in TCGA, GDC, and GEO. The authors sincerely thank the researchers for providing their GEO databases, TCGA databases, and GDC information online; it is our pleasure to acknowledge their contributions.
Supplementary Table 1: survival-related genes. Supplementary Table 2: matched clinical info of patients in independent dataset GSE26939. (Supplementary Materials)
A. Carrato, A. Vergnenègre, M. Thomas, K. McBride, J. Medina, and G. Cruciani, “Clinical management patterns and treatment outcomes in patients with non-small cell lung cancer (NSCLC) across Europe: EPICLIN-Lung study,” Current Medical Research and Opinion, vol. 30, no. 3, pp. 447–461, 2014.View at: Publisher Site | Google Scholar
J. A. Roth, B. H. Goulart, A. Ravelo, H. Kolkey, and S. D. Ramsey, “Survival gains from first-line systemic therapy in metastatic non-small cell lung cancer in the U.S., 1990–2015: progress and opportunities,” The Oncologist, vol. 22, no. 3, pp. 304–310, 2017.View at: Publisher Site | Google Scholar
M. M. Vasquez, C. Hu, D. J. Roe, Z. Chen, M. Halonen, and S. Guerra, “Least absolute shrinkage and selection operator type methods for the identification of serum biomarkers of overweight and obesity: simulation and application,” BMC Medical Research Methodology, vol. 16, no. 1, p. 154, 2016.View at: Publisher Site | Google Scholar
C. Planque, Y. H. Choi, S. Guyetant, N. Heuzé-Vourc'h, L. Briollais, and Y. Courty, “Alternative splicing variant of kallikrein-related peptidase 8 as an independent predictor of unfavorable prognosis in lung cancer,” Clinical Chemistry, vol. 56, no. 6, pp. 987–997, 2010.View at: Publisher Site | Google Scholar
A. Magklara, A. Scorilas, D. Katsaros et al., “The human KLK8 (neuropsin/ovasin) gene: identification of two novel splice variants and its prognostic value in ovarian cancer,” Clinical Cancer Research, vol. 7, no. 4, pp. 806–811, 2001.View at: Google Scholar
K. Michaelidou, A. Ardavanis, and A. Scorilas, “Clinical relevance of the deregulated kallikrein-related peptidase 8 mRNA expression in breast cancer: a novel independent indicator of disease-free survival,” Breast Cancer Research and Treatment, vol. 152, no. 2, pp. 323–336, 2015.View at: Publisher Site | Google Scholar