Development of a Ten-lncRNA Signature Prognostic Model for Breast Cancer Survival: A Study with the TCGA Database
Long noncoding RNA (lncRNA) plays a critical role in the development of tumors. The aim of our study was construction of a lncRNA signature model to predict breast cancer (BRCA) patient survival. We downloaded RNA-seq data and relevant clinical information from the Cancer Genome Atlas (TCGA) database. Differentially expressed lncRNA were computed using the “edgeR” package and subjected to the univariate and multivariate Cox regression analysis. Corresponding protein-coding genes were used for Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genome (KEGG) pathway analysis. Finally, 521 differentially expression lncRNA were obtained. We constructed a ten-lncRNA signature model (LINC01208, RP5-1011O1.3, LINC01234, LINC00989, RP11-696F12.1, RP11-909N17.2, CTC-297N7.9, CTA-384D8.34, CTC-276P9.4, and MAPT-IT1) to predict BRCA patient survival using the multivariate Cox proportional hazard regression model. The C-index was 0.712, and AUC scores of training, test, and entire sets were 0.746, 0.717, and 0.732, respectively. Univariate Cox regression analysis indicated that age, tumor status, N status, M status, and risk score were significantly related to overall survival in patients with BRCA. Further, the multivariate analysis showed that risk score and M status had outstanding independent prognostic values, both with . The Gene Ontology (GO) function and KEEG pathway analysis was primarily enriched in immune response, receptor binding, external surface of plasma membrane, signal transduction, cytokine-cytokine receptor interaction, and cell adhesion molecules (CAMs). Finally, we constructed a ten-lncRNA signature model that can serve as an independent prognostic model to predict BRCA patient survival.
Breast cancer (BRCA) is the most commonly diagnosed cancer and the leading cause of cancer death among women worldwide . Although BRCA death is declining due to early detection and improved treatment, significant variability in patient outcomes remains. Almost 62667 died of this malignancy in 2018 . The prognosis of BRCA is affected by multiple factors including age, tumor size, grade, lymph node involvement, histology, hormone-receptor status, HER-2 status, and positive margins . Many clinical prediction models for predicting patient prognosis and disease-free survival have been proposed, mainly focusing on age at diagnosis; post-menopausal status; ER, HER-2, and ki-67 status; tumor size; lymph node involvement; metastasis; and therapeutic strategy [4–7]. These models are difficult to implement in clinical practice due to incomplete diagnostic characteristics and model limitations.
Next-generation sequencing (NGS) allows continuing identification of biomarkers for tumor diagnosis and prognosis. Such models increasingly focus on mRNA, miRNA, and lncRNA. Li et al.  used NGS data available in several databases, including the Geno Expression Omnibus (GEO), the Cancer Genome Atlas (TCGA), the Human Protein Atlas, and the International Cancer Genome Consortium, to construct a six-gene model (SQSTM1, AHSA1, VNN2, SMG5, SRXN1, and GLS) to assist clinicians in selecting personalized treatment for patients with hepatocellular carcinoma. Shi et al. , using the TCGA, identified a five-lncRNA signature model (AC069513.4, AC003092.1, CTC-205M6.2, RP11-507K2.3, and U91328.21) to inform the prognosis of patients with clear cell renal cell carcinoma. Lv et al.  constructed a six-mRNA signature model (TMEM252, PRB2, SMCO1, IVL, SMR3B, and COL9A3) that may aid prognosis for patients with triple-negative BRCA. Zhu et al.  built a four-lncRNA signature model (PVT1, MAPT-AS1, LINC00667, and LINC00938) to predict BRCA survival, but the AUC value for the time-dependent receiver operation characteristic (ROC) curve was only 0.64. To date, few studies of multi-lncRNA signatures in BRCA are available, and functions and mechanisms of lncRNA in BRCA have yet to be explored.
In this study, we deeply mined and analyzed high-throughput sequencing data and clinical characteristics from the TCGA. We subsequently developed a ten-lncRNA signature model that effectively predicts BRCA survival and demonstrated its independence from clinical factors.
2. Material and Methods
2.1. Data Source
The RNA-seq data and corresponding clinical information were mined from TCGA: http://cancergenome.nih.gov/). As of December 2019, 1164 clinical samples and relevant gene expression information were obtained. BACA samples with repeated or incomplete prognostic information were eliminated. Ultimately, 1035 BRCA samples and 111 normal samples were collected for the construction of the model and coexpression analysis. TCGA is an open public database and, ethics approval was not needed for the study.
2.2. Identification of Differentially Expressed lncRNA
RNA count data were obtained from the TCGA data portal, and expression levels of lncRNA and mRNA were determined using Reads Per Kilobase of exons per Million mapped Reads (RPKM). Potential lncRNA were identified based on (1) transcriptome sequences mapped in corresponding lncRNA rather than protein-coding regions, (2) transcriptome sequences annotated in GENCODE data, and (3) transcription sequences expressed in at least half of BRCA tissues. The expression profile of lncRNA was defined as of 1146 BRCA samples. Finally, a total of 6129 lncRNA were enrolled. Differentially expressed lncRNA were identified using the R software “edgeR” package.
2.3. Statistical Analysis and Definition of the lncRNA-Related Prognostic Model
BRCA samples (1035 sequences) were randomly divided into a training set () and a test set (). The Univariate Cox regression analysis was used to examine associations among lncRNA expression levels and overall survival (OS) in the training set. lncRNA were considered significant when values were <0.05. These sequences were then used for the stepwise multivariate Cox regression analysis, using the R package “survival” (choose a model by AIC in a stepwise algorithm) [12–14]. Based on expression levels and coefficients () from multivariate Cox proportional hazards regression analysis, a novel ten-lncRNA-based prognostic risk score formula was defined [13–16]. The risk score formula was as follows:
A risk score for each patient in the training set was then calculated. BRCA patient samples can be divided into high-risk and low-risk groups, respectively, based on the median risk score as a cutoff. The Kaplan–Meier survival curve and the log-rank test were used to assess the prognostic value of the risk score using the R package “survival.”
The receiver operation characteristic (ROC) curve analysis within 3- and 5-year, using the R package “survivalROC,” compared sensitivity and specificity of survival predictions. Subsequently, we compared model predictions with traditional clinical risk factors (age, risk, stage, metastasis, tumor size, and lymph node involvement) using the univariate and multivariate Cox analysis. We also reassessed the relationship between risk level and clinical characteristics using the chi-square test. was considered statistically significant. All data were analyzed using R scripts (version 3.6.1). All the figures were plotted by ggplot2 (version 3.2.1).
2.4. Functional Enrichment Analysis
To explore functional implications of lncRNA, Spearman’s correlation coefficients were calculated between related lncRNA and protein-coding genes. Related protein-coding genes were screened for functional enrichment analysis. We subsequently performed GO analysis and the Kyoto Encyclopedia of Genes and Genome (KEGG) pathway enrichment analysis of differential expression protein-coding genes using the Database for Annotation, Visualization, and Integration Discovery (DAVID version 6.7).
3.1. Differentially Expressed lncRNA and Clinical Characteristic
We analyzed specific baseline clinical characteristics of 1035 BRCA patients (Table 1). We selected lncRNA expression profiles from raw RNA-seq expression data, and then, differentially expressed lncRNA between BRCA samples and normal samples were identified following and false discovery rate . This analysis recognized 406 upregulated lncRNA and 115 downregulated lncRNA (Figure 1).
3.2. Identification of lncRNA Associated with the OS of Patients from the Training Set and Validated in Test Set and Entire Sets
The training set was first analyzed to identify possible prognostic lncRNA; then, the test and entire sets were used for validation. We performed univariate and multivariate Cox regression analysis to identify correlation among differentially expressed lncRNA and OS of BRCA patients using the training set. Finally, 32 of the 521 differentially expressed lncRNA were found to be associated with survival time () by performing univariate Cox regression analysis (Table 2). In addition, a prognostic model, composed of 10 lncRNA, was established by performing a stepwise multivariate Cox proportional hazard regression model (Figure 2). . The C-index for model was 0.712 (CI 0.686-0.740), and the calibration curve showed good performance for the prognostic model (Figure 3).
Patients in the training set were classified into high-risk groups and low-risk groups using median risk score (0.938) as a cutoff. The survival rate of the high-risk group was significantly lower than the low-risk group in the Kaplan–Meier method and the log-rank test (Figure 4(a)). Subsequently, the prognostic ability of the model was assessed by calculating AUC of the time-dependent ROC curve. Generally, an AUC from 0.7 to 0.9 is deemed reliable. The AUC value of the training set was 0.760 in 3 years, 0.746 in 5 years (Figure 4(d)), indicating that the ten-lncRNA signature model shows high sensitivity and specificity.
We next validated our ten-lncRNA signature model with test and entire data sets. We again used risk scores for all patients in test and entire sets and divided patients into high- and low-risk groups based on the same threshold from the training set. Again, we found that survival rate of the high-risk group was significantly lower than the low-risk group in both test and entire sets. (Figures 4(b) and 4(c)). Time-dependent ROC curve analysis for the ten-lncRNA signature model achieved an AUC score of 0.717 in both 3- and 5-year for the test set, and the AUC score of the entire set was 0.741 in 3 years and 0.732 in 5 years, respectively (Figures 4(e) and 4(f)).
3.3. Independence of the Ten-lncRNA Signature and Other Clinical Variables
We assessed whether survival prediction based on the 10-lncRNA signature model was independent of clinical characteristics. Univariate Cox regression analysis indicated that age, tumor status, N status, M status, and risk score were significantly related to OS in the entire set. Also, the multivariate Cox analysis indicated that risk score and M status show outstanding independent prognostic value, both with (Table 3). Further, based on tests, the risk level had no association with age, stage, metastasis, tumor size, and lymph node involvement (Table 4). Collectively, our study demonstrates that the ten-lncRNA signature prognostic model is a robust tool for predicting the prognosis of BRCA patients.
3.4. Functional Characteristic of Ten Prognostic lncRNA
Biological functions of lncRNA remain unclear, but the expression of lncRNA is remarkably correlated with neighboring protein-coding genes. We obtained expression profiles of protein-coding genes from raw RNA-seq data and extracted corresponding protein-coding genes with ten lncRNA. Spearman’s correlation coefficients with and as the cutoff yielded 1178 protein-coding genes for stepwise functional enrichment analysis.
The GO function and KEGG pathway enrichment analysis of protein-coding genes used DAVID bioinformatics resources 6.7. BP results showed that protein-coding genes were enriched for signal transduction, immune response, inflammatory response, and positive regulation of transcription from RNA polymerase II promoter (Figure 5(a)). Characteristics of enrichment in MF were primarily transmembrane signaling receptor activity, protein binding, protein homodimerization activity, calcium ion binding, and receptor binding (Figure 5(b)). For CC analysis, genes were enriched in integral components of the plasma membrane, integral components of membranes, plasma membrane, extracellular exosome, and external side of the plasma membrane (Figure 5(c)). Also, results from the KEGG pathway analysis were enriched for cytokine-cytokine receptor interaction, cell adhesion molecules (CAMs), calcium signaling pathway, transcriptional dysregulation in cancer, and HTLV-I infection (Figure 5(d)).
High-throughput sequencing technology produces increasing amounts of sequencing data for studies of cancer diagnosis, therapy, and prognosis . Current studies focus on ncRNA associated with cancer, especially lncRNA. Some studies confirm that lncRNA plays a crucial role in th4e occurrence and progress of tumors, such as gastric , colon , and BRCAs .
In the present study, we downloaded RNA-seq data and clinically relevant information related to BRCA from the TCGA database. We obtained 1164 clinical samples and corresponding gene expression information. A total of 521 differently expression lncRNA involved in BRCA were pulled from the TCGA database, including 406 upregulated and 115 downregulated genes. Subsequently, univariate and multivariate Cox regression analyses identified correlations among differentially expressed lncRNA and OS in a training set. These correlations were used to establish a risk model for predicting BRCA prognosis. A 10-lncRNA signature risk prediction model (LINC01208, RP5-1011O1.3, LINC01234, LINC00989, RP11-696F12.1, RP11-909N17.2, CTC-297N7.9, CTA-384D8.34, CTC-276P9.4, and MAPT-IT1) was produced. Patients were subdivided into high- and low-risk groups based on the median risk score. Three-year AUC values for the time-dependent ROC curve in the training, test, and entire sets were 0.760, 0.717, and 0.741, respectively, indicating outstanding performance in survival prediction. Recently, Zhu et al.  and Li et al.  have proposed a breast cancer prognosis model based on RNA-seq, and three-year AUC values of their models are 0.641 and 0.711 in the training set, respectively. Therefore, the performance of our model outperforms these two models. We also compared the risk model and clinical parameters (age, stage, metastasis, tumor size, and lymph node involvement) using univariate and multivariate Cox analysis. The prognostic value of the model was independent of other clinical factors in BRCA, but the functional relationship between risk score and tumor development was unclear.
Previously, Liao et al.  showed that LINC01234 knockdown suppressed cell proliferation, migration, and invasion of colorectal cancer cells, while blocking the cell cycle and inducing cell apoptosis. Chen et al.  found that LINC01234 functioned as a ceRNA for miR-304-5p, resulting in derepression of its endogenous target core-binding factor. In addition, Chen et al.  found that LINC01234 expression is increased in non-small-cell lung cancer tissues, and its upregulation is associated with metastasis and shorter survival. Downregulation of LINC01234 impairs cell migration and invasion in vitro and inhibits cell metastasis in vivo by serving as a competing endogenous RNA for the miR-340-5p and miR-27b-3p; LINC01234 also affects RNA-binding proteins LSD1 and EZH2, leading to histone modification and transcriptional repression of antiproliferative gene BTG2 [24, 25]. In another study, Wang et al.  found that the expression of RP11-909N17.2 is positively associated with colorectal cancer outcomes and prognosis; in our study, RP11-909N17.2 is a protective factor in BRCA prognosis. This discrepancy requires further study.
LINC00989 and MAPI-IT1 are associated with congenital diseases [26, 27], and their relationship with cancer remains unclear. No studies have reported associations between LINC01208, RP5-1-11O1.3, RP11-696F12.1, CTC-297N7.9, CTA-384D8.34, CTC-276P9.4, and cancer, but we speculate that these lncRNA may be involved in BRCA tumorigenesis. More research effort is necessary to test this hypothesis.
Many issues remain to be addressed. First, we only download data from the TCGA database. More data are available in other databased that could prove valuable for the risk model. Second, lncRNA play important roles in the occurrence and progress of tumors, but the function of lncRNA in the signature is unclear. Additional experimental study of these lncRNA may help understand functional mechanisms and thus the functional basis for the ten lncRNA for prognosis of BRCA.
We identified differentially expressed gene associated with the pathogenesis of breast cancer and constructed a ten lncRNA prognostic model to predict prognosis of patients with BRCA. The prognostic model presented a good performance in 3- and 5-year OS prediction. Functional mechanisms of these lncRNA have not yet been investigated. Prospective studies are needed to further validate the utility of the ten-lncRNA prognostic model.
The raw data and relevant R code used to support the findings of this study are recorded in https://github.com/zhouwenqing789/TCGA-MODEL-LCNRNA-BRCA.git
Conflicts of Interest
The authors declare that they have no competing interests.
Wenqing Zhou and Yongkui Pang contributed equally to this work.
This work was supported by the Science and Education Foundation of Wujiang district (wwk201906). We would like to thank professor Wei Zhu from the Zhongshan Hospital of Fudan University for her technical guidance.
F. Bray, J. Ferlay, I. Soerjomataram, R. L. Siegel, L. A. Torre, and A. Jemal, “Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries,” CA: a Cancer Journal for Clinicians, vol. 68, no. 6, pp. 394–424, 2018.View at: Publisher Site | Google Scholar
V. Guarneri, F. Piacentini, G. Ficarra et al., “A prognostic model based on nodal status and Ki-67 predicts the risk of recurrence and death in breast cancer patients with residual disease after preoperative chemotherapy,” Annals of Oncology, vol. 20, no. 7, pp. 1193–1198, 2009.View at: Publisher Site | Google Scholar
W. Li, J. Lu, Z. Ma, J. Zhao, and J. Liu, “An integrated model based on a six-gene signature predicts overall survival in patients with hepatocellular carcinoma,” Frontiers in Genetics, vol. 10, 2019.View at: Google Scholar