Background. Bladder cancer (BLCA) is one of the most common urological malignancies globally, posing a severe threat to public health. In combination with protein-protein interaction (PPI) network analysis of proteomics, Gene Set Variation Analysis (GSVA) and “CancerSubtypes” package of software for transcriptomics can help identify biomarkers related to BLCA prognosis. This will have significant implications for prevention and treatment. Method. BLCA data were downloaded from The Cancer Genome Atlas (TCGA) database and GEO database (GSE13507). GSVA analysis converted the gene expression matrix to the gene set expression matrix. “CancerSubtypes” classified patients into three subtypes and established a prognostic model based on differentially expressed gene sets (DEGSs) among the three subtypes. For genes from prognosis-related DEGSs, functional and pathway enrichment analyses and PPI network analysis were carried out. The Human Protein Atlas (HPA) database was used for validation. Finally, the proportion of tumor-infiltrating immune cells (TIICs) was determined using the CIBERSORT algorithm. Results. In total, 414 tumor samples and 19 adjacent-tumor samples were obtained from TCGA, with 145 samples belonging to subtype A, 126 samples belonging to subtype B, and 136 samples belonging to subtype C. Then, we identified 83 DEGSs and constituted a prognostic signature with two of them: “GSE1460_CD4_THYMOCYTE_VS_THYMIC_STROMAL_CELL_DN” and “MODULE_253.” Finally, five subnets of two PPI networks were established, and nine core proteins were obtained: CDH2, COL1A1, EIF2S2, PSMA3, NAA10, DNM1L, TUBA4A, KIF11, and KIF23. The HPA database confirmed the expression of the nine core proteins in BLCA tissues. Furthermore, EIF2S2, PSMA3, DNM1L, and TUBA4A could be novel BLCA prognostic biomarkers. Conclusions. In this study, we discovered two gene sets linked to BLCA prognosis. PPI analysis confirmed the network’s core proteins, and several newly discovered biomarkers of BLCA prognosis were identified.

1. Introduction

Bladder cancer (BLCA) is one of the most common urological malignancies globally and a leading cause of cancer deaths [1]. According to studies, the incidence of bladder cancer has been increasing in recent years [2]. This implies that BLCA poses a serious threat to public health. According to epidemiological studies, cigarette smoking is the leading risk factor for BLCA; however, tobacco cessation interventions do not appear clinically effective for mortality [3]. The mechanisms underlying the phenomena must be understood. As a heterogeneous tumor, BLCA has variable prognosis influenced by different subtypes. BLCA is composed of two major subtypes: nonmuscle-invasive bladder cancer (NMIBC) and muscle-invasive bladder cancer (MIBC) [4]. Most patients are confirmed with NMIBC confined to the mucosa or lamina propria, which is associated with a better prognosis.

Meanwhile, 25% of patients have MIBC that invades the detrusor muscle and also has the potential to metastasize to lymph nodes or distant organs [5, 6]. These differences could account for the large prognostic gap between NMIBC and MIBC. BLCA can also be classified into several subtypes based on tissue morphology phenotypes, including urothelial carcinoma, squamous cell carcinoma, and adenocarcinoma [7], but the effect of tissue morphology phenotypes on prognosis remains debated [8].

Based on transcriptomic data from The Cancer Genome Atlas (TCGA) database, bioinformatics analysis can reclassify BLCA more reasonably and reveal the mechanism different underlying prognosis. To analyze TCGA RNA-seq data, we used gene set variation analysis (GSVA), a nonparametric unsupervised method. GSVA could be used to assess the gene expression at the pathway level rather than at the gene level. This method outperforms single-gene analysis in terms of feature dimension and noise interference, as well as biological interpretability [9]. Furthermore, the “Cancersubtypes” software package has been developed to reveal molecular subtypes of cancer patients from public databases using multiomics data: gene expression, DNA methylation, and miRNA expression [10]. This package has recently been used in several human cancer studies [1113], demonstrating the feasibility of this new method. However, no studies have been conducted to classify BLCA using the “Cancersubtypes” package. Protein-protein interaction (PPI) network analysis could build a protein network and then analyze their interactions [14]. In this study, RNA-seq data from the TCGA-BLCA cohort was analyzed by bioinformatics using GSVA and “Cancersubtypes.” The results were combined with proteomic analysis to reach a more convincing conclusion. Exploring the potential mechanisms influencing prognosis among different BLCA subtypes and the prognostic marker is critical for improving BLCA patient survival outcomes.

2. Materials and Methods

2.1. Data Collection and Processing

Transcriptome data and clinical characteristics from BLCA patients were obtained from the TCGA database (https://portal.gdc.cancer.gov/) and used as a training set. The testing set was composed of GSE13507 datasets obtained from the GEO database (https://www.ncbi.nlm.nih.gov/geo/). For further analysis, only samples with complete prognostic data were extracted.

2.2. Gene Set Variation Analysis

The GSVA algorithm was used with language’s “GSVA” package to reveal biological correlations between training and testing set genes. GSVA is a nonparametric, unsupervised method for identifying closely related pathways to essential genes [9]. By inputting gene expression matrix of training and testing sets, and a collection of gene sets, including “hallmark gene sets,” “KEGG subset of Canonical pathways,” and “immunologic signature gene sets” downloaded from the GSEA database (http://www.gsea-msigdb.org/gsea/index.jsp) for GSVA analysis, the gene expression matrix could be transformed to the matrix of gene set expression to explain the corresponding biological meaning.

2.3. Identification of BLCA Subtypes

The package “CancerSubtypes” helps infer cancer subtypes from input gene sets, using a consensus clustering algorithm to determine the number of subtypes, and uncover potential differences among varying subtypes [10]. The relationship between clinical characteristics (age, gender, grade, stage, TNM stage) and classification was investigated. The software’s “limma” package was used to identify DEGSs within each subtype. DEGSs with a |log2 fold change| of 0.1 and an adjusted were excluded from further analysis.

2.4. Construction of BLCA Prognostic Signature

Univariate Cox analysis and Kaplan–Meier survival analysis were performed in software using the “survival” and “survminer” packages to obtain prognostic DEGSs. Then, using the “glmnet” package, LASSO Cox regression analysis was used to establish the BLCA prognostic signature. The associated expression of these prognostic DEGSs and their correlated coefficient were used to calculate the risk score (RS) of the prognostic DEGSs for each sample. The median value of RS was used to group BLCA samples.

2.5. Functional and Pathway Enrichment Analysis

Pathways with gene symbols were downloaded from KEGG (https://www.kegg.jp/kegg/) and GO (http://geneontology.org/). Functional and pathway enrichment analysis on these prognosis-related genes was performed using the packages “Cluserprofiler.”

2.6. Establishment of Protein-Protein Interaction Network

The PPI network was created using the Search Tool for the Retrieval of Interacting Genes (STRING) database (https://string-db.org/). Furthermore, the Cytoscape software was used to search for meaningful modules in the PPI network (, , , and ).

2.7. Validation of the Core Protein Expression

Several core proteins were chosen from the PPI network. We used the HPA database (http://www.proteinatlas.org/) to obtain the expression level of selected core proteins for further validation.

2.8. Tumor-Infiltrating Immune Cell (TIICs) Analysis

The CIBERSORT algorithm was used to estimate the composition of 22 immune cells in the BLCA prognostic signature between high- and low-RS groups. The relative abundance of each immune cell was calculated based on the gene expression.

2.9. Statistical Analysis

All statistical analyses were conducted using v4.0.3 (https://www.r-project.org/) and Perl v5.32.1.1 (https://strawberryperl.com/). OS was defined as the time between intervals from the date of diagnosis and the date of death by any cause. The prognostic value of gene sets was assessed by Cox regression analysis. was considered statistically significant.

3. Results

3.1. The GSVA Analysis for BLCA

Four hundred thirty-three samples were obtained from the TCGA, including 414 tumor samples and 19 adjacent-tumor samples. Using the package “GSVA,” these sample gene expression matrices were subjected to the GSVA algorithm of hallmark gene sets, KEGG subsets of canonical pathways, and immunologic signature gene sets. We also conducted the GSVA algorithm in GEO datasets GSE13507 for subsequent validation. This step condenses gene-level RNA-seq expression profiles into gene sets used in subsequent analyses. The expression of gene sets in TCGA samples is depicted as a heat map (Figure 1).

3.2. Identification of BLCA Subtypes

The classification of BLCA was conducted through an unsupervised consensus clustering of the “CancerSubtypes” package in 414 tumor samples of TCGA. The samples with incomplete clinical information were discarded, leaving 407 samples. The value determined the optimal number of clusters. In our study, the area under the cumulative distribution function (CDF) curve increased with no significance when ; so, a three-cluster solution () was chosen (Figure S1A). The results showed that 145 samples were cluster I matching subtype A, 126 were cluster II matching subtype B, and 136 were cluster III matching subtype C. Obviously, samples classified as subtype C had a better prognosis than samples classified as subtype A and subtype B (). (Figure S1B; Figure 2). The clinical data show distinct characteristics of each BLCA subtype (Table 1), with significant age, grade, and stage differences. The heat map for gene sets with classified features shows how different gene sets express themselves across multiple subtypes (Figure 3). Similar classified results could be obtained when the GSE13507 dataset was analyzed similarly (Figure S2A-B; Figure 4). Based on the survival curve trend, cluster I matched subtype A, cluster II matched subtype C, and cluster III matched subtype B were significant ().

3.3. Identification of DEGSs

We separately tested for the differential gene set expression among each subtype, including subtype C − subtype A, C − subtype B, and B − subtype A. Then, we compared the gene sets that were differentially expressed in these three groups. Finally, 83 DEGSs were obtained (Figure 5(a)), and their representation of these DEGSs in the three subtypes was markedly different, as shown in Figure 5(b).

3.4. Construction of Prognostic Signature

We applied the univariate cox regression model in 83 DEGSs to identify gene sets that influence patient’s overall survival (OS). A total of 24 DEGSs were obtained () (Table 2). After that, a Lasso regression model (Figure 6(a)) was established to reveal the log (Lambda) value of the 24 screened gene sets. And, after performing crossvalidation (Figure 6(b)), the gene sets with the slightest crossvalidation error were chosen. Finally, we discovered two DEGSs linked to prognosis: “GSE1460_ CD4_THYMOCYTE_ VS_THYMIC_STROMAL_CELL_DN” and “MODULE_253.”

3.5. Verification of Prognostic Signature Using TCGA and GEO Databases

The RS of BLCA samples in the TCGA was calculated, and the calculation formula used was .

According to the median value of RS, the samples were divided into high- and low-RS groups. Kaplan–Meier survival analysis indicated that the OS of the patients was worse in the high-RS group than in the low-RS group in the TCGA-BLCA cohort (; Figure 7(a)). Similar results were confirmed using GSE13507 datasets in GEO databases (; Figure 7(b)). These results illustrate that the prognostic signature has effective predictive power in OS. Aside from that, we conducted survival analysis for the two DEGSs. The results suggested that having a high expression level of both DEGSs is associated with having a low OS (Figures 8(a) and 8(b)).

3.6. GO and KEGG of Prognostic Signature

Table 3 contains the gene list for two DEGSs. GO and KEGG analysis for “GSE1460_CD4_THYMOCYTE_VS_THYMIC_STROMAL_CELL_DN” (199 genes) and “MODULE_253” (21 genes) was performed, respectively. The GSEA database contains a detailed gene list. The former included 417 GO terms of biological process, 34 GO terms of cellular component, and 50 GO terms of molecular function (). The top 30 GO terms are shown in Figure 9(a). In addition, 234 GO terms of biological process, 64 GO terms of cellular component, and 31 GO terms of molecular function were discovered in the latter (). The top 30 significantly enriched GO terms are shown in Figure 9(b). Furthermore, genes from the former were enriched considerably into four KEGG pathways (; Figure 9(c)), while genes from the latter were significantly enriched into fifteen KEGG pathways (; Figure 9(d)).

3.7. PPI Network of Prognostic Signature

The PPI network analysis was carried out separately for each DEGS via String software. We found a PPI network with 138 nodes and 184 edges from “GSE1460_CD4_THYMOCYTE_VS_THYMIC_STROMAL_CELL_DN” (Figure 10(a)) and a network with 19 nodes and 78 edges from “MODULE_253” (Figure 10(b)). In addition, for further analysis, we used the Cytotype software’s MCODE app. The former’s PPI network generated three functional subnet modules (subnet 1, subnet 2, and 3). The hub nodes, CDH2, FGF2, and COL1A1, had higher node degrees in subnet 1 (Figure 11(a)), EIF2S2, PSMA3, and NAA10 were hub nodes in subnet 2 (Figure 11(b)), and DNM1L was the hub node in subnet 3 (Figure 11(c)). Meanwhile, we obtained two functional subnet modules, namely, subnets 1 and 2. KIF2C, TUBA4A, KIF5A, KIF11, and KIF23 were filtered as hub nodes in subnet 1 (Figure 12(a)), while KIF5C was filtered as a node in subnet 2 (Figure 12(b)). To validate these findings, we referred to the HPA (https://www.proteinatlas.org/) database. We discovered that most of the proteins were observed in cancer-adjacent normal tissues and cancer tissues. Still, it was difficult to detect significant differences between normal and cancer tissue (Figures 13(a) and 13(b)).

3.8. Relation between Tumor-Infiltrating Immune Cells and RS

The CIBERSORT algorithm was used to depict the composition of TIICs in all BLCA samples. As shown in the results, high-RS groups had higher fractions of CD8 T cells, M0 macrophages, M2 macrophages, activated dendritic cells, and neutrophils than low-RS groups. In contrast, low-RS groups had lower fractions of resting memory CD4 T cells, memory B cells, plasma cells, follicular helper T cells, regulatory T cells, and monocytes (Figure 14).

4. Discussion

Different BLCA subtypes have different invasive properties, which are also associated with different aggressiveness and prognoses [15]. We were able to identify DEGSs among multiple subtypes thanks to the help of GSVA analysis and the “Cancersubtypes” package. The “Cancersubtypes” package classified BLCA samples into three subtypes based on the gene set expression, and the prognosis differed significantly among the three subtypes. The significant differences in age, grade, and stage, on the other hand, may reflect the fact that the three subtypes of BLCA are at different stages of cancer progression. However, with the assistance of this advanced algorithm, we could still investigate the mechanism underlying BLCA progression. The combination of differential expression analysis and PPI network analysis can identify critical nodes influencing prognosis. Given the current state of treatment for BLCA, these critical nodes may be significant determinants of prognosis.

We discovered two DEGSs associated with prognosis in this study: “GSE1460_CD4_THYMOCYTE_VS_THYMIC_STROMAL_CELL_DN” and “MODULE_253.” Higher levels of both were linked to a lower OS rate. According to the GSVA database, the former is represented as differentially expressed genes between CD4 thymocytes and thymic stromal cells, implying that CD4+ T cells may play an essential role in BLCA progression, and the latter is defined as intracellular transport. We pursued this further to mine the biological significance. Following functional analysis, it was discovered that these DEGSs were primarily enriched in an extracellular matrix organization, MHC II antigen presentation, and the microtubule-associated pathway. This result was consistent with the GSVA database’s description of DEGSs. The next step was to analyze the PPI network to find the corresponding proteins playing important roles. We analyzed both gene sets separately, resulting in several subnets and hub proteins. By comparing information about the protein expression in HPA, we discovered the following core proteins expressed in human BLCA tissue: CDH2, COL1A1, EIF2S2, PSMA3, NAA10, DNM1L, TUBA4A, KIF11, and KIF23.

CDH2, also known as N-cadherin, is a mesenchymal cell development regulator that was thought to be a discriminatory marker for interstitial cells in the human bladder and a critical biomarker of epithelial-mesenchymal transition (EMT) [1618]. The EMT commonly thought to be a dysregulation of wound healing mechanisms is essential in cancer invasion and metastasis. The CDH2 expression is increasing, indicating a shift toward a mesenchymal phenotype. In summary, CDH2 may be involved in the progression of BLCA through the EMT. In subnet 1, COL1A1 (collagen type I alpha 1) is one of the hub nodes, along with CDH2. COL1A1 may also contribute to tumor progression by promoting EMT [19]. Many studies have shown that COL1A1 is a crucial factor in the invasion and metastasis of BLCA. The COL1A1 expression was higher in MIBC than in NMIBC, and the increased expression was associated with a poor prognosis in NMIBC.

Meanwhile, the low COL1A1 expression was thought to inhibit tumor proliferation and metastasis [1921]. EIF2S2, PSMA3, and NAA10 were core proteins in “GSE1460_CD4_THYMOCYTE_VS_THYMIC_STROMAL_CELL_DN” subnet 2.

EIF2S2 is an RNA binding protein that regulates the gene expression. Its regulatory effects were reported to play a role in the occurrence and development of many cancers, but these did not involve BLCA. One of EIF2S2’s carcinogenic pathways is through long noncoding RNA. EIF2S2 activated the Wnt signaling pathways to drive cancer development by regulating the interaction of LINC01600 with Myc protein. The other is glucose metabolism regulation. Typically, cancer cells rely on aerobic glycolysis (the Warburg effect) for energy, and knocking out EIF2S2 reduces the expression of glycolysis-related genes [2224]. Although previous research has not addressed the carcinogenesis of BLCA, our findings suggest that EIF2S2 as a differential expression gene among three subtypes may play an essential role in BLCA development, which is supported by transcriptomics and proteomics analysis. PSMA3 participates in forming the 26S proteasome complex, which is involved in the degradation of several proteins [25]. There is currently no direct evidence that PSMA3 participates in tumor invasion and metastasis; however, some studies have suggested that the proteasome complex may play a role in cancer aggravation [26, 27]. PSMA3 was found to correlate with BLCA in our study, implying that we may have discovered an undiscovered biomarker that predicts BLCA progression, but further research is needed in follow-up studies. NAA10 catalyzes acetylation and is involved in cell proliferation [28]. More importantly, the NAA10 expression has been found in tumors from various organs, including the urinary bladder, cervix, liver, bone, lung, breast, colon, and prostate [29, 30]. However, the pathway by which NAA10 induces tumorigenesis varies depending on the target proteins in different cancer tissues. The NAA10–AR (androgen receptor) axis is essential for prostate cancer cell growth [31]. Considering the gender differences in BLCA epidemiology, we made the bold assumption that the NAA10–AR axis also promotes BLCA progression.

DNM1L, also known as dynamin-related protein 1, is a GTPase that functions in the cytosol of dynamins [32]. Mitochondria are the primary sites of cellular respiration, and their fragmentation is a typical phenotype in many cancers. DNM1L is an essential regulator of mitochondrial fission [33]. This could be why DNM1L has been linked to a poor prognosis.

The PPI network of “MODULE_253” is centered on TUBA4A, KIF11, and KIF23. They could be related to intracellular substance transport. TUBA4A (tubulin alpha 4a) is the gene that encodes -tubulin. TUBA4A mutations have been linked to neurodegenerative diseases such as amyotrophic lateral sclerosis (ALS) and frontotemporal dementia (FTD) [34, 35]. Little articles concern TUBA4A in cancer, and only lung cancer was mentioned [3638]. TUBA4A could be a potential biomarker for BLCA prognosis prediction, but specific mechanisms must be investigated. The kinesin superfamily (KIFs) was a group of proteins that functioned as microtubule-based motors and served as the foundation for intracellular substance transport [39]. KIF11, also known as kinesin spindle protein, is important during mitogenesis and cell proliferation [40]. Furthermore, KIF11 is upregulated in various human cancers, including bladder cancer, renal clear cell carcinomas, prostate carcinomas, meningiomas, breast cancer, and gastric ccancer [41, 42]. Because high levels of the KIF11 expression indicate robust cell proliferation, anti-KIF11 therapy is emerging as a promising cancer therapy approach. KIF11 inhibitors were more effective at inhibiting the growth of gemcitabine-resistant bladder cancer cell lines [43]. Similarly, by regulating mitogenesis, KIF23 can influence cancer cell proliferation. Cell and animal experiments [44] confirmed the link between KIF23 and bladder cancer. According to the GSVA analysis and consensus clustering algorithm of “CancerSubtypes,” the high expression of gene set “GSE1460_CD4_THYMOCYTE_VS_THYMIC_STROMAL_CELL_DN” contributes to the poor prognosis of BLCA patients. As a result, we used the CIBERSORT algorithm to determine the composition of TIICs. We discovered that the fraction of resting memory CD4 T cells, memory B cells, plasma cells, follicular helper T cells, regulatory T cells, and monocytes was lower in the low-RS groups than in the high-RS groups. This finding suggests that these TIICs may play a role in antitumor immunity.

5. Conclusion

In this study, we identified two gene sets related to the prognosis of BLCA, and GO and KEGG analyses were performed based on the genes they contained. Further PPI analysis confirmed the network’s core proteins as the most known therapeutic targets of BLCA or other cancers, while others were newly discovered biomarkers. In the HPA database, these proteins were partially validated. Given that BLCA is frequently multifocal, the tumor could have been embedded within the adjacent normal tissues. Further experimental validation will necessitate the collection of valuable donor transplant samples to rule out such effects.

Data Availability

The raw data can be obtained from public TCGA database (https://portal.gdc.cancer.gov/), GEO database (GSE13507) (https://www.ncbi.nlm.nih.gov/geo/), HPA database (https://www.proteinatlas.org/), and GSVA database (http://www.gsea-msigdb.org/gsea/index.jsp). Further inquiries can be directed to the corresponding author.

Conflicts of Interest

All authors declare no conflict of interests.

Authors’ Contributions

Hanwen Li and Shaohua Chen have contributed equally to this work and share first authorship.


This work was supported by the National Natural Science Foundation of China (81860142), the National Key Research and Development Program of China (2017YFC0908000), the Natural Key Research and Development Project (2020YFA0113200), the Major Project of Guangxi Innovation Driven (AA18118016), the Guangxi key Laboratory for Genomic and Personalized Medicine (20-065-33), and Natural Science Foundation of China (81770759, 82060145, 31970814).

Supplementary Materials

Supplementary 1. Figure S1: classification of TCGA samples by the consensus clustering algorithm. Several clusters . (b) Cluster plot of BLCA.

Supplementary 2. Figure S2: classification of GEO samples by the consensus clustering algorithm. Many clusters . (b) Cluster plot of BLCA.