Abstract

Introduction and Aims. Hepatocellular carcinoma (HCC) is one of the most lethal tumors of the digestive system, but its mechanisms remain unclear. The purpose of this study was to study HCC-related genes, build a survival prognosis prediction model, and provide references for treatment and mechanism research. Methods. Transcriptome data and clinical data of HCC were downloaded from the TCGA database. Screen important genes based on the random forest method, combined with differential expression genes (DEGs) to screen out important DEGs. The Kaplan‒Meier curve was used to evaluate its prognostic significance. Cox regression analysis was used to construct a survival prognosis prediction model, and the ROC curve was used to verify it. Finally, the mechanism of action was explored through GO and KEGG pathway enrichment and GeneMANIA coexpression analyses. Results. Seven important DEGs were identified, three were highly expressed and four were lowly expressed. Among them, GPRIN1, MYBL2, and GSTM5 were closely related to prognosis (). After the survival prognosis prediction model was established, the survival analysis showed that the survival time of the high-risk group was significantly shortened (), but the ROC analysis indicated that the model was not superior to staging. Twenty coexpressed genes were screened, and enrichment analysis indicated that glutathione metabolism was an important mechanism for these genes to regulate HCC progression. Conclusion. This study revealed the important DEGs affecting HCC progression and provided references for clinical assessment of patient prognosis and exploration of HCC progression mechanisms through the construction of predictive models and gene enrichment analysis.

1. Introduction

Liver cancer is the fifth most common cancer and the fourth leading cause of cancer-related death worldwide [1]. HCC is the most common type of liver cancer, accounting for about 75%–85%, and has the characteristics of a high mortality and high metastasis and recurrence rate [2]. Studies have shown that genetic mutations, chromosomal aberrations, molecular signaling pathways, and epigenetic dysregulation are all associated with the development of HCC [3]. At present, in addition to traditional surgical resection, radiofrequency ablation, transarterial chemotherapy, and other methods have been developed for the treatment of liver cancer [4]. Undeniably, surgical resection is still the most effective treatment for HCC. However, due to the insidious onset of HCC, many patients have already lost the opportunity for surgery when they come to the clinic. Even with surgical resection, the 5-year recurrence rate is as high as 70% [5], and the 5-year overall survival (OS) rate is only 15%–19% [6].

It is well known that the tumor microenvironment plays a crucial role in the occurrence and development of tumors. The interaction between various signaling molecules in the microenvironment is also a hot topic in tumor-related research. The occurrence of HCC is closely related to the inflammatory response of the environment, and 90% of HCC is associated with inflammation [7]. In the state of liver inflammation, the dysregulation of the interaction between cytokines, chemokines, and growth factors is an important cause of liver cancer [8, 9]. The original research on the IL-6, IL-1, and TGF-beta inflammatory molecules based on recent years for immune checkpoint research to further explore the development mechanism of HCC provided new insights. Studies have found that immune checkpoint molecules, such as programmed death-1(PD-1), cytotoxic T-lymphocyte antigen 4 (CTLA4), lymphocyte activating gene 3 protein (LAG-3), and mucin domain molecule 3 (TIM-3), are upregulated on liver cancer cells and tumor-specific T cells, which can lead to CD8+ T cell apoptosis and poor prognosis in patients [10]. At the same time, we cannot ignore that there are a large number of proangiogenic factors produced by cancer cells or tumor-infiltrating lymphocytes or macrophages in the tumor microenvironment, such as vascular endothelial growth factor (VEGF), which can promote tumor angiogenesis [11]. Angiogenesis is indispensable for tumor development, invasion, and metastasis [12]. On this basis, targeted drugs such as sorafenib, lenvatinib, VEGF inhibitors, and immune checkpoint inhibitors (ICIs) have significantly improved the prognosis of patients in recent years, but the overall treatment effect is still poor due to the changes in HCC heterogeneity and the continuous emergence of phenotypic drug resistance [1319]. Therefore, it is particularly urgent to find a way to evaluate the disease early and take personalized treatment measures to improve the prognosis of patients.

The development of liver cancer is a multistep process caused by changes in signaling pathways triggered by multiple genes, and it shows high heterogeneity within tumors, between tumors, and between patients [2024]. DEGs play an important role in this process. Therefore, considering the high heterogeneity of HCC, the limited treatment methods, and the poor prognosis of patients, it is more urgent to further explore the development mechanism of HCC and new survival and prognostic models. Nault et al. first identified a genetic marker associated with the development of HCC in 2013 [25]. Subsequent studies have also shown that gene mutations occurring in HCC can be used as biological markers for targeted therapies [26]. Although it has been demonstrated that programmed death ligand-1 (PD-L1) inhibitors combined with antiangiogenesis therapy can significantly improve the prognosis of patients with HCC [27], intervention-related toxicity and difficulty in determining the optimal dosing phase have hindered further benefit for patients [28, 29]. Therefore, the search for HCC-related dysfunctional genes is particularly important to elucidate the mechanisms underlying the development of the disease and to improve the prognosis of patients. Thanks to the rapid development of sequencing technology, many disease-related marker genes have been identified one after another, which has laid a solid foundation for the screening of HCC-related genes and the establishment of prognostic models. Public databases such as The Cancer Genome Atlas (TCGA) are useful tools to screen microarray data for DEGs associated with the development of HCC [30, 31].

In this study, we used a random forest algorithm to identify key genes expressed in HCC in the TCGA database and screened DEGs between HCC and normal samples. On this basis, 7 important DEGs were finally screened. Subsequently, we performed enrichment analysis on these 7 important DEGs and analyzed the expression levels of these genes in different clinical states. Furthermore, we performed survival analysis and COX regression analysis, constructed a prognostic risk score model, and plotted the receiver operating characteristic (ROC) curve. Finally, 20 coexpressed genes were screened by GeneMANIA, and GO and KEGG enrichment analyses were performed to further explore their biological functions and molecular pathways. The DEGs of HCC discovered in this study, as well as the constructed survival prognosis prediction model, are expected to provide new insights into the clinical treatment and biological mechanisms of HCC.

2. Materials and Methods

2.1. Data Source

The data for our study were extracted from The Cancer Genome Atlas (TCGA; https://portal.gdc.cancer.gov, up to July 31, 2022) database, which contains transcriptomic data of 374 HCC tumor samples and 50 normal samples, as well as 370 clinical samples and related data. We used data obtained from the TCGA database using Illumina HiSeq Systems, and the sequencing data format was a Counts file.

2.2. Random Forest Screening for Important Genes

Build a random forest model using the random forest package [32]. First, calculate the average model false positive rate based on out-of-band data for all genes and select 400 as the optimal number of trees to include in the random forest. Next, build a random forest model and use the Gini coefficient method to obtain dimension importance values for the random forest model. The genes with the top 30 importance values were selected for subsequent analysis.

2.3. Identification of DEGs

Expression data downloaded from TCGA were analyzed using the Limma package of R version 4.2.0 [33], and fold differential expression was calculated after removing or averaging probe sets without corresponding gene symbols or genes with multiple probe sets, respectively. The criteria for setting the DEG were as follows: genes with adjusted value <0.05 and |logFC (fold change)| ≥ 1. Plot volcano plots using the ggplot2 package. Next, DEGs and genes screened by randomForest were combined to extract important differentially expressed genes.

2.4. Expression of DEGs in Different Clinical States

We further investigated the expression of DEGs and their association with different clinical states of HCC: event, age, gender, and stage. Violin plots were drawn using the ggplot2 package of the R software. Differences in gene expression among different groups were analyzed using SPSS 27.0. Definitions: , , .

2.5. Construction and Validation of an HCC Prognostic Risk Scoring Model

Kaplan‒Meier survival analysis was performed on the important DEGs using the R software survival and survminer packages, and the related survival curves were drawn.

Based on univariate and multivariate Cox regression analyses, a prognostic risk score model was constructed. According to the risk score grouping, prognostic analysis and Cox regression analysis were performed to verify whether the risk score could be used as an independent risk factor for evaluating the survival and prognosis of HCC patients. The specificity and sensitivity of the risk scoring model were verified using the R software pROC package, and the ROC was drawn using the ggplot2 package.

2.6. Enrichment Analyses of Important DEGs

Analysis of Gene Ontology (GO) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment plays a very important role in the annotation of gene products and the study of molecular mechanisms [34, 35]. We used enrichGO and enrichKEGG packages in the R language for enrichment, and adjusted values <0.05 were considered significant; enrichment point plots were drawn using the ggplot2 package.

2.7. Analysis of the Relationship between DEGs Genes

We constructed a coexpression network of these genes using GeneMANIA (https://www.genemania.org/) [36] and identified associations within them. Subsequently, we further carried out enrichment analysis on the coexpressed genes of important DEGs, intending to explore their biological functions and molecular mechanisms.

3. Results

3.1. Identification of Important DEGs

The research flowchart of this study is shown in Figure 1. First, we downloaded the expression profiling data of LIHC from the TCGA database along with clinical data. To find the genes with the greatest influence on the phenotype, we used the random forest method to screen. The relationship between the error of the reference model and the number of decision trees is shown in Figure 2(a). We selected 400 trees as the parameters of the final model, and the model error was stable at this time. We evaluated the final results using the Gini coefficient method and selected the top 30 genes as candidate genes (Table 1 and Figure 2(b)).

Next, we screened 1,564 DEGs using the Limma method and plotted the volcano (Figure 2(c)). After interacting with 30 candidate genes, 7 important DEGs were finally screened (Table 2). Analysis of the expression profiles of these seven DEGs revealed that MAFG-DT, GPRIN1, and MYBL2 were highly expressed in the tumor group, and LINC00907, GSTZ1, CCN1, and GSTM5 were lowly expressed in the tumor group (Figure 2(d)).

3.2. Expression and Clinical Parameters of Important DEGs in Patients

To further analyze the relationship between these seven important DEGs and clinical status, we used SPSS to compare the expression differences of each gene under different groups, and used the ggplot2 package of R software to draw a violin plot (Figures 36).

According to the outcome of patience, we divided the patients into the normal group, live group, and dead group. The results showed that compared with the normal group, the live and death groups showed significant differences in each gene (). In addition, GSTZ1 expression was significantly higher in the live group than that in the death group (), while MAFG (), GPRIN1 (), and MYBL2 () were significantly lower (Figure 3).

Stratified analysis by age revealed that GSTZ1 expression was significantly lower in the 41–60 years group than that in the 61–80 years group (), while the opposite was true for MYBL2 (). The expression of LINC00907 in the ≤20 years group was significantly higher than that in the 21–40 years group (). In addition, the expression of CCN1 was significantly lower in the group of 41–60 years () and 61–80 years () than in that of the over 80-year-old group. There was no significant difference in the expression of the remaining genes among the age groups () (Figure 4).

By analyzing the expression levels of each gene in patients of different genders, we found that the expression level of GSTZ1 () in female patients was significantly lower than that in male patients, while LINC00907 () was just the opposite (Figure 5).

The expression level of MAFG-DT was significantly lower in stage I than in stages II () and III () when grouped according to the HCC stage. However, the expression level of GSTZ1 in stage I was significantly higher than that in stage III (). In addition, GPRIN1 expression was significantly lower in stage I compared with stage II () and stage III (), and the same difference was observed with MYBL2 (). In contrast, the expression level of GSTM5 in stage III was significantly lower than that in stages I () and II () (Figure 6).

3.3. Impact of Important DEGs on Patients’ OS

We performed survival analysis using the survival and survminer packages of R software and used survival curves combined with log-rank tests to assess the impact of important DEGs on patient OS. As shown in Figure 7, high expression of GPRIN1 and MYBL2 indicated better prognosis of patients (; ), while patients with low expression of GSTM5 had better prognosis (). However, MAFG-DT, GSTZ1, LINC00907, and CCN1 were not associated with patient prognosis (all ).

3.4. Construction and Validation of a Prognostic Risk Model for HCC Patients

First, the selected seven important DEGs were combined with survival time and survival status and then included in a multivariate Cox regression analysis (Table 3 and Figure 8(a)), and a prognostic risk score model was constructed based on the results. The risk score calculation method is 7 important as the product of DEGs expression level and risk coefficient. The specific risk score model is risk score = “MAFG-DT” 0.069645 − “GSTZ1” 0.070909 + “GPRIN1” 0.009885 + “MYBL2” 0.461418 + “LINC00907” 0.010721 + “CCN1” 0.227363 − “GSTM5” 0.116514.

We divided 346 HCC samples into a high-risk group and a low-risk group, with 173 cases in each group, using the median risk value (2.4211) as the cutoff value. Survival analysis showed that there was a significant difference between the two groups (), and the survival time of the low-risk group was significantly longer than that of the high-risk group (Figure 8(b)).

The results of univariate Cox regression analysis showed that the risk score model was significantly correlated with survival time and survival status (hazard ratio = 0.5066, 95% Confidence interval = 0.353–0.727, ); further multivariate Cox regression analysis results also showed that the risk score model was closely related to the prognosis status (hazard ratio = 0.5549, 95% confidence interval = 0.3767–0.8173, ) (Table 4 and Figure 8(c)). The subsequent proportional hazard analysis also confirmed that patients with a lower risk score had a better prognosis (Figure 8(d)).

Finally, using ROC to evaluate the relationship between clinical characteristics and the prognosis of HCC patients, the results showed that the risk scoring model (AUC = 0.582) was the second most important factor for prognosis after stage (AUC = 0.659) (Figure 9).

3.5. Enrichment Analysis and Interaction Analysis

To further explore the mechanism of action and signaling pathway of important DEGs, we performed GO and KEGG enrichment analyses on them. GO enrichment analysis found that important DEGs were mainly enriched in the glutathione metabolic process (), glutathione transferase activity (), and transferase activity, transferring alkyl or aryl (other than methyl) groups () (Table 5 and Figure 10(a)). KEGG enrichment analysis showed that important DEGs were mainly enriched in the following seven pathways: tyrosine metabolism (), glutathione metabolism (), chemical carcinogenesis–DNA adducts (), drug metabolism–cytochrome P450 (), platinum drug resistance (), metabolism of xenobiotics by cytochrome P450 (), and drug metabolism–other enzymes () (Table 6 and Figure 10(b)).

To understand the interaction network of these genes, we used the GeneMANIA database for analysis. The results showed that these genes were in a complex PPI network, with physical Interactions of 77.64%, coexpression of 8.01%, predicted of 5.37%, colocalization of 3.63%, genetic Interactions of 2.87%, pathway of 1.88% and shared protein domains of 0.60% (Figure 10(c)). GO enrichment analysis of these 25 coexpressed genes found that, for biological processes, they were mainly enriched in the glutathione metabolic process (), benzene-containing compound metabolic process (), and cellular modified amino acid metabolic process (); in terms of cell composition, it is mainly located in the intercellular bridge (), transcription regulator complex (), and transcription repressor complex (); and its molecular functions are mainly involved in glutathione transferase activity (), glutathione binding (), and oligopeptide binding () (Table 7 and Figure 10(d)). KEGG enrichment analysis showed that these genes were mainly enriched in the following pathways: cellular senescence (), glutathione metabolism (), chemical carcinogenesis-DNA adducts (), drug metabolism–cytochrome P450 (), platinum drug resistance (), metabolism of xenobiotics by cytochrome P450 (), drug metabolism-other enzymes (), and tyrosine metabolism () (Table 8 and Figure 10(e)).

4. Discussion

At present, many studies have found many genes that affect HCC, but the mechanism of HCC occurrence and development is still not very clear, and there is an urgent need to further explore the factors affecting its development and prognosis. Although several previous studies have analyzed gene signatures related to HCC prognosis [30, 3739], these studies did not further screen genes that are more closely related to patient survival after screening DEGs. Therefore, in this study, we used the random forest and limma algorithms to screen out 30 important genes and 1,564 DEGs, respectively, and then took the intersection of the two to further screen out 7 important DEGs: MAFG-DT, GSTZ1, GPRIN1, MYBL2, LINC00907, CCN1, and GSTM5. Subsequent enrichment analysis, expression profiling analysis, survival analysis, and the constructed prognostic prediction model indicated that they are closely related to the occurrence and prognosis of HCC.

Among the seven important DEGs, MAFG-DT (logFC = 2.295817), GPRIN1 (logFC = 2.444281), and MYBL2 (logFC = 3.861042) were all significantly elevated in liver cancer samples (Figure 2(d)). MAFG-DT is an oncogenic long noncoding RNA (lncRNA), and many previous studies have shown that overexpression of MAFG-DT can promote the proliferation and metastasis of cancer cells [4043]. High expression of MAFG-DT is significantly associated with poor prognosis in bladder and breast cancer patients [44, 45]. In this study, MAFG-DT was highly expressed in liver cancer patients, and the normal group was significantly different from the liver cancer group after grouping by age, gender, and stage (Figures 36). In addition, after grouping according to the high and low expression of MAFG-DT, the survival time of the low expression group was higher than that of the high expression group, although there was no significant difference () (Figure 7). As a member of the GPRIN family, GPRIN1 acts on the classical G protein-coupled receptor pathway [46]. GPRIN1 is closely related to cancer, and it can promote the proliferation and metastasis of lung cancer by promoting the epithelial-mesenchymal transition of lung cancer cells [47]. But interestingly, Zhou et al. found that GPRIN1 is downregulated in gastric cancer cells and tissues, and it can inhibit the invasion and metastasis of gastric cancer [48]. Our results showed that GPRIN1 was highly expressed in liver cancer tissues, and patients with high GPRIN1 expression had a worse prognosis () (Figure 7). Therefore, we speculate that due to the different molecular biological mechanisms of different cancers, GPRIN1 promotes the malignant behavior of tumors in lung cancer and liver cancer but plays an inhibitory role in gastric cancer. MYBL2, a transcription factor in the myeloblastosis family, plays a key role in cell proliferation, differentiation, and cell cycle. Previous studies have shown that MYBL2 is highly expressed in cancers such as ovarian cancer and breast cancer and affects the prognosis of patients [49, 50]. Our study on liver cancer also proved that MYBL2 is highly expressed in cancer tissues, and the high-expression group has a poor prognosis () (Figure 7).

In addition, LINC00907 (logFC = −2.6057), CCN1 (logFC = −2.05925), GSTZ1 (logFC = −2.34197), and GSTM5 (logFC = −2.8954) were lowly expressed in liver cancer tissues. Through survival analysis, only GSTM5 was found to be associated with patient prognosis () (Figure 7). The glutathione S-transferase (GST) gene family plays an important role in detoxification in the body, protecting cells from damage by poisons and charged particles. The GST family is closely related to the occurrence and development of many tumors, and the same GSTM5 as a member is abnormally expressed in ovarian cancer and nonsmall cell lung cancer [51, 52]. In this study, the expression of GSTM5 was decreased in HCC tissues and suggested a poor prognosis, which was also consistent with the previous findings, further indicating that GSTM5 plays a key role in tumorigenesis.

Next, we constructed a risk-scoring model based on the multivariate Cox regression analysis of 7 important DEGs. Kaplan‒Meier survival analysis showed that the high-risk group had a significantly lower survival time () (Figure 8(b)). However, subsequent ROC analysis showed that the risk-scoring model was not dominant compared to the stage (AUC = 0.659 vs 0.582) (Figure 9). However, in the ROC analysis, the AUC of the risk model is second only to the stage. Combined with the results of the Kaplan‒Meier survival analysis, we believe that the risk model still has a certain significance. Especially when the patient is in the early stage of the disease, and the stage cannot indicate the disease progression, early individualized treatment according to the risk score has extremely important value for improving the prognosis of patients.

To further study the molecular signaling pathways of important DEGs, we performed enrichment analysis and coexpression analysis. As we discussed before, both GSTZ1 and GSTM5 belong to the GST family, so the screened important DEGs were mainly enriched in the glutathione metabolic process and glutathione transferase activity (; ) (Table 5), and further enrichment analysis of their interacting proteins showed the same results (; ) (Table 7). The KEGG enrichment analysis showed that glutathione metabolism () was still a major enrichment pathway, and interestingly, we found that more genes were enriched in cellular senescence () (Table 8). In addition to the central gene MYBL2, the cellular senescence pathway has its associated genes: LIN9, LIN37, LIN54, E2F4, RBL1, and FOXM1, which together constitute the DREAM complex and play an important role in cell cycle regulation [53]. This proves that MYBL2 can also promote the progression of HCC by interfering with cell cycle regulation.

However, this study still has some limitations. Firstly, the data for model construction and validation were obtained from the TCGA database. This public database contains incomplete information, and the data are all retrospective. Therefore, prospective real-world studies are necessary to verify the reliability of the model. It should be noted that in the process of research, it is necessary to comprehensively collect data at all stages of disease progression, such as blood samples and imaging data, etc., to eliminate information distortion caused by incomplete data collection as far as possible. To improve the representativeness of the results, multicenter studies are also necessary. Secondly, the data used in this study were only from the TCGA database, which may make the results lack representative. Therefore, data from other databases, such as GEO and Oncomine databases, can be selected for subsequent analysis and cross-validation. Thirdly, the diagnostic efficacy of the risk score model constructed in this study was not superior to staging, although the results of its KM analysis were beneficial. One of the reasons for this may be due to the bias caused by the data analysis using only the TCGA database. However, compared with the stage, the risk scoring model has more important significance in the evaluation of patients in the early stage of the disease to improve the prognosis. The performance of this model will be tested in clinical cohorts in the future. Finally, the seven important DEGs screened in this study are currently only data demonstrations. We will carry out in vitro and in vivo experiments to further explore the specific molecular pathways of these genes in HCC.

5. Conclusion

In conclusion, we innovatively used a combination of random forest and Limma to screen out the important DEGs for HCC development. Expression analysis and survival analysis were performed, indicating that these genes are closely related to the survival of HCC patients. The subsequently constructed prognostic risk scoring model has good predictive value for the prognosis of HCC patients, and combining it with other clinical indicators can provide an effective reference for clinical treatment. Subsequent enrichment analysis and coexpression analysis showed that seven important DEGs were closely related to cellular senescence and glutathione metabolism, which also provided new ideas for further research on the molecular mechanism of HCC occurrence and development. In brief, the early risk score model provided in this study can be used as a reference for subsequent personalized treatment of patients and ultimately help to improve prognosis.

Abbreviations

CTLA4:Cytotoxic T-lymphocyte antigen 4
DEGs:Differentially expressed genes
GO:Gene ontology
HCC:Hepatocellular carcinoma
ICIs:Immune checkpoint inhibitors
KEGG:Kyoto Encyclopedia of Genes and Genomes
LAG-3:Lymphocyte activating gene 3 protein
OS:Overall survival
PD-1:Programmed death-1
PD-L1:Programmed death ligand-1
ROC:Receiver operating characteristic
TCGA:The Cancer Genome Atlas
TIM-3:Mucin domain molecule 3
VEGF:Vascular endothelial growth factor.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Authors’ Contributions

Yikai Wang and Xiongtao Liu conceived and designed the study. Yikai Wang analyzed data, created graphs, and wrote the manuscript. Le Ma, Pengjun Xue, and Bianni Qin checked the analysis results. Ting Wang, Bo Li, and Lina Wu helped collect data and references. Liyan Zhao checked the article’s format and language. Xiongtao Liu is responsible for the overall content as a guarantor.

Acknowledgments

This article was supported by the Key Research and Development Program of Shaanxi (grant number 2021SF-227).