Hepatocellular carcinoma (LIHC) is the fifth common cancer worldwide, and it requires effective diagnosis and treatment to prevent aggressive metastasis. The purpose of this study was to construct a machine learning-based diagnostic model for the diagnosis of liver cancer. Using weighted correlation network analysis (WGCNA), univariate analysis, and Lasso-Cox regression analysis, protein-protein interactions network analysis is used to construct gene networks from transcriptome data of hepatocellular carcinoma patients and find hub genes for machine learning. The five models, including gradient boosting, random forest, support vector machine, logistic regression, and integrated learning, were to identify a multigene prediction model of patients. Immunological assessment, TP53 gene mutation and promoter methylation level analysis, and KEGG pathway analysis were performed on these groups. Potential drug molecular targets for the corresponding hepatocellular carcinomas were obtained by molecular docking for analysis, resulting in the screening of 2 modules that may be relevant to the survival of hepatocellular carcinoma patients, and the construction of 5 diagnostic models and multiple interaction networks. The modes of action of drug-molecule interactions that may be effective against hepatocellular carcinoma core genes CCNA2, CCNB1, and CDK1 were investigated. This study is expected to provide research ideas for early diagnosis of hepatocellular carcinoma.

1. Introduction

Liver cancer is one of the most common cancers worldwide. The incidence of liver cancer accounts for 8.2% of total cancer cases and 4.7% of total cancer deaths. In some regions, the incidence rate is still increasing. Despite significant improvements in the diagnosis and treatment of liver cancer, the long-term prognosis remains poor, and liver cancer remains an important global clinical challenge [1, 2]. Therefore, the development of more sensitive diagnostic methods, the use of new biomarkers, and the construction of effective prognostic models are important to improve the survival time of patients.

Image-based algorithms drive the diagnosis of disease [3, 4]. HCC can be diagnosed based on imaging features alone, and the noninvasive nature and wide availability have led many HCC guidelines to recommend image-based diagnosis [5]. Several studies have shown good results in classifying liver cancer images by using a machine learning approach [6, 7]. Serum α-fetoprotein (AFP) has been widely used as a predictive and prognostic biomarker for HCC, but the sensitivity of AFP for detecting early stage HCC is limited [8, 9]. The construction of machine learning models by some candidate markers such as cell-free DNA (cfDNA) provides room for improvement in the diagnosis of HCC [10]. In recent years, the use of other samples for prediction has also shown good promise [11, 12]. However, the selection of these characteristics does not explain well the mechanism of hepatocarcinogenesis and prognostic therapies at the genetic level. RNA sequencing (RNA-seq) can reveal gene fusions, splice variants, mutations/insertions deletions, and differential gene expression, thus providing a more complete genetic map than DNA sequencing [13]. Some current statistical analyses of the hepatocellular carcinoma transcriptome suffer from the problem of focusing only on statistical results and detaching from the biological context. Machine learning is better able to deal with complex nonlinear relationships in the data than some conventional statistical tools; however, part of the models lacks explanatory power [14].

Weighted correlation network analysis (WGCNA) aggregates genes with similar expression patterns in the same genetic module and identifies relationships between gene modules and phenotypes to identify potential candidate biomarkers or novel therapeutic targets [15]. The competing endogenous RNA (ceRNA) hypothesis addresses a complex posttranscriptional regulatory network. When two transcripts contain the same miRNA response element (MRE), they can compete with a shared miRNA. This means that the upregulation of one transcript leads to the segregation of more copies of the shared miRNA, which reduces the expression of the other transcript and vice versa [16].

In this study, we screened genes with survival and other indicators by WGCNA and used univariate—Lasso—multiple Cox regression analysis and survival analysis and protein-protein interactions (PPI) to screen for signature genes. Training was performed using random forest, gradient boosting, support vector machine, logistic regression, and integrated learning. The gene characteristics are also explained by ceRNA network, pathway enrichment and TP53 mutations, promoter methylation, and immune cell infiltration. Our study is aimed at combining biology with statistics and machine learning to provide new insights into potential targets for hepatocellular carcinoma and to promote precision therapy for HCC.

2. Materials and Methods

2.1. Data Collection

The main procedure of our study is shown in Figure 1. The RNA-seq data associated with HCC were downloaded from TCGA. There were 424 lncRNA and mRNA samples, including 50 normal samples and 374 tumor samples. And 425 miRNA samples were also downloaded, including 50 normal samples and 375 tumor samples. The relevant clinical characterization data were downloaded through UCSC Xena (http://xena.ucsc.edu/). After screening, we retained 294 samples of the lncRNA and mRNA dataset (including 50 normal samples and 244 tumor samples). In machine learning, the training data is GSE76427 (including 52 normal samples and 115 tumor samples) from the GPL10558 platform, while the validation set GSE102079 (including 105 normal samples and 152 tumor samples) is from the GPL570 platform.

2.2. Data Preprocessing and Analysis of Differentially Expressed Genes

DEGs between tumors and adjacent tissues were identified using the R package “edgeR”. and were considered statistically significant. Cluster heat maps were generated using the pheatmap R package. Principal component analysis was performed by the factoextra package to visualize the data. Volcano map visualization of differential genes was performed using the ggplot package.

2.3. Construction of Coexpressed Gene Networks Based on WGCNA

WGCNA is an analysis method to identify gene coexpression networks based on topological overlap. The final input of 6,500 genes was made by descending order of the expression values of the dataset. WGCNA was then used to construct coexpression networks of differential genes. Coexpression gene networks were created with a soft threshold of minimum value of to cluster genes with high correlation into the same module. Correlations between modules and clinical data were calculated to screen for modules associated with survival time and other prognosis or diagnosis of HCC. Finally, correlated mRNA and lncRNA were obtained.

2.4. Construction of RNA Network in Hepatocellular Carcinoma

The interactions between lncRNA and miRNA and miRNA and mRNA in the WGCNA module were predicted using ENCORI (http://starbase.sysu.edu.cn/index.php;version3.0). The predicted miRNAs intersected with the differential miRNAs in TCGA. In addition, the predicted RBPs that interact with the lncRNAs and mRNAs in the module were taken to intersect and merge. Finally, data on the interactions between these RBPs and mRNAs in different organs and cancer types were retrieved. These networks were visualized by Cytoscape (3.7.2). Finally, we obtained lncRNA-miRNA-mRNA, lncRNA-RBP-mRNA, and RBP-mRNA-tissue-disease networks.

We used the database search tool for retrieval of interacting genes/proteins (STRING) (https://www.string-db.org/) to evaluate protein-protein interactions (PPI) information and Cytoscape (3.7.2) for visualization. The R package “clusterProfiler” was used to perform Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analyses [17]. GSEA (version 4.1.0) was used for gene set enrichment analysis (GSEA) [18], where we defined enrichment markers as , NOM value <0.05, and .

2.5. Construction and Validation

To minimize the risk of overfitting, Lasso-penalized Cox regression analysis was applied to construct the prognostic model. The Lasso algorithm was used for variable selection and shrinkage using the “glmnet” R package. Patients were divided into high-risk and low-risk groups based on the median risk score. The risk score was calculated as follows: . Differences in OS time between risk groups were analyzed using Kaplan-Meier (KM) survival analysis and log-rank test.

2.6. HCC Diagnostic Model Construction and Core Gene Screening

Training and test sets for model construction were divided from the dataset generated by combining the TCGA and GSE76427 datasets from the GPL10558 platform. To validate the stability of the model, we used the GSE102079 dataset from the GPL570 platform as the validation set. These data were normalized, and missing values were replaced using the average of the same genes. Random forest, gradient boosting, support vector machine, logistic regression, and integrated learning were used to train the models using the python package “scikit-learn” [19]. These models were validated by the 5-fold cross-validation method and the leave-one-out method.

2.7. Analysis of Key Genes and Immunological Characteristics

Analysis of TP53 mutation status (TP53 mutation status) and promoter methylation levels (promoter methylation) for normal data and hepatocellular carcinoma data was performed through the UALCAN website [20].

An immunization file on LIHC patients in TCGA was downloaded through the TIMER 2.0 website [21]. This file includes TIMER, CIBERSORT, quanTIseq, xCell, and MCP-counter [2226]. Infiltration estimation of TCGA patients was performed with these 5 tools. RNA data were processed using the R package DESeq2 package [27], and the association between core genes and immunity was analyzed using the R package psych. Finally, heat map presentation was performed by pheatmap of R package.

2.8. Core Genes Targeting Corresponding Drug Candidates

Interaction studies of obtained hub genes with relevant drugs were performed to analyze their targeting effects, where candidate antihepatocellular carcinoma drugs were obtained from the DGIdb (http://www.dgidb.org/) database, an online tool containing information on drug-gene interactions from more than 30 libraries. These molecules were mainly from the ZINC library (https://zinc.docking.org/), while others were drawn by marvinsketch (version 21.9) [28], and the lowest energy conformation was selected. The conversion of mol2 files was performed using the open source chemistry toolbox open Babel (version 2.3.2) [29]. The 3D structures of the proteins expressed by the target genes were obtained from the RCSB PDB library (https://http://www.rcsb.org/). PyMOL (version 2.3) [30] removes hydrogen bonds and other ligands from the target protein. Autodocktools (version 1.5.6) [31] adds hydrogen atoms, binds nonpolar hydrogen atoms, calculates the charge number of the protein, and detects the docking sites. Finally, the target protein in pdbqt format is docked to the drug candidate by AutoDock Vina 1.1.2 [32] with a threshold of affinity -7.0 kcal/mol. The final 2D structure is drawn using ligplot (version 4.0) [33].

3. Results

3.1. Analysis of Data in TCGA-LIHC Samples

After the differential analysis of TCGA-LIHC samples, 3529 differential lncRNAs, including 3008 upregulated genes and 521 downregulated genes, were screened; 2183 differential mRNAs, including 4074 upregulated genes and 1109 downregulated genes, were obtained; 330 differential genes of miRNAs included 287 upregulated genes and 43 downregulated genes (Figure 2).

Next, GO and KEGG enrichment analyses were performed on the up- and downregulated differential mRNAs. We found that the expression of upregulated genes was mainly focused on the pathways with cell morphology, channels, and receptors. The downregulated differential genes, on the other hand, were mainly distributed in pathways related to metabolism and degradation (Figure 3). These changes may be related to disruption of liver function.

3.2. Cox Regression Analysis of Clinical Indicators

Cox regression analyses were performed on clinical data. The clinical information included fetoprotein outcome value, total bilirubin upper limit, bilirubin lower limit, bilirubin upper limit, age at initial pathologic diagnosis, sample type, gender, histological type, neoplasm histologic grade, platelet result count, platelet result lower limit, platelet result upper limit, and weight. The ROC curve and KM survival analysis combined with landmark analysis showed that the upper limit of total bilirubin could effectively distinguish between high-risk and low-risk patients with (Table 1, Figure 4). Methemoglobin is often used as a diagnostic indicator for HCC. Therefore, we selected age at initial pathologic diagnosis, bilirubin lower limit, bilirubin upper limit, survival time, fetoprotein outcome value, and total bilirubin upper limit as the clinical indicators associated with WGCNA.

3.3. WGCNA Expression Module Analysis

Initial screening using WGCNA selected as the soft threshold of the network, and 21 gene coexpression modules were obtained by WGCNA (Table 2).

The correlation between modules and clinical characteristics was analyzed according to the clinical profile of UCSC Xena. Turquoise module was found to be highly correlated with survival time and survival status (). Also, the blue module was highly correlated with other clinical information (Figure 5(b)). The correlation between blue and turquoise modules was low (Figure 5(c)). And the topological overlap matrix (TOM) plot showed a strong coexpression relationship of genes in both modules (Figure 5(d)). Two coexpression modules were obtained, in which lncRNAs and mRNAs were HCC survival-related genes.

Genes in the turquoise and blue modules were enriched by GO and KEGG analyses. GO enrichment analysis of the turquoise module showed that it was mainly enriched to cell cycle processes related to proliferation and metastasis of HCC: BP enrichment resulted in nuclear division, organelle fission, and mitotic nuclear division; CC enrichment results in condensed chromosome, chromosomal region, chromosomal, and centromeric region; MF enrichment results in DNA-dependent ATPase activity (KEGG enrichment results in cell cycle (cell cycle), DNA replication (DNA replication), and neuroactive ligand-receptor interaction (neuroactive ligand-receptor interaction)). In contrast, the blue module reflects the characteristics of liver function itself; BP enrichment results in organic acid catabolic process, carboxylic acid catabolic process, and small molecule catabolic process; CC enrichment results in mitochondrial matrix, peroxisomal matrix, and microbody lumen; MF enriched heme binding, tetrapyrrole binding, and monooxygenase activity; KEGG enrichment results in leucine and isoleucine degradation, fatty acid degradation, and retinol metabolism (Figure 6). The values of the above enrichment analysis results were <0.05.

3.4. PPI Network Construction and Core Gene Extraction

Based on MM values, 2000 genes of turquoise were analyzed by STRING. The degree values were visualized by Network Analyzer of Cytoscape software, and the eight genes with the most significant degree values were screened: PLK1, CDK1, CDC20, CCNB2, CCNB1, CCNA2, BUB1B, and BUB1 (Figure 7). Similarly, PPI analysis of genes from the blue module was performed to screen the six genes with the most significant degree values: CAT, ACADM, EHHADH, AGXT, HMGCS2, and CYP3A4 (Figure 7). Interestingly, the PPI analysis of the blue module contained a network of 10 mitochondrial genes with higher degree values.

3.5. Six Networks Were Constructed Based on ENCORI

A total of 114 genes were used in the turquoise module to construct the networks, including 106 genes screened in the turquoise module using as the criterion and 8 genes obtained by PPI analysis. Three turquoise-related networks were established: a lncRNA-miRNA-mRNA coexpression network consisting of 1 lncRNA, 8 miRNAs, and 10 mRNAs; a lncRNA-RBP-mRNA network consisting of 5 lncRNAs, 121 RNA binding proteins (RBPs), and 18 mRNAs; a RBP-mRNA-Tissue-Disease network consists of 104 RBPs, 13 mRNAs, 14 tissue, and 29 diseases (Figure 8).

A total of 128 genes from the blue module were used to construct the network, including 122 genes (106 lncRNAs and 112 mRNAs) screened by and 6 genes obtained by PPI analysis. Three turquoise-related networks were established: the lncRNA-miRNA-mRNA coexpression network consisted of 3 lncRNAs, 22 miRNAs, and 29 mRNAs; the lncRNA-RBP-mRNA network consisted of 4 lncRNAs, 128 RNA binding proteins (RBPs), and 112 mRNAs; the RBP-mRNA-Tissue-Disease network consisted of 48 RBPs, 9 mRNAs, 14 tissue, and 22 diseases (Figure 9).

3.6. The ROC Curves Showed the Good Performance of the Prognostic Model

Univariate Cox regression analysis was performed on mRNAs and lncRNAs in all differentially expressed genes, turquoise module, and blue module to explore their relationship with prognosis of hepatocellular carcinoma patients. Among all mRNAs, WISP3 () and STK32B () were screened. The AUC values of 3- and 5-year disease-free survival were 0.775 and 0.769, respectively. Among all lncRNAs, AL359853.1 (), AC110285.3 (), and FGF14-AS2 () were screened. The AUC values of 3- and 5-year disease-free survival were 0.793 and 0.751, respectively. Among lncRNAs in the turquoise module, FGF14-AS2 () was screened. The AUC values of 3- and 5-year disease-free survival were 0.699 and 0.67, respectively. Among mRNAs in the turquoise module, SOX11 (), HOXC8 (), GAGE2A (), and ETV4 () were screened. The AUC values of 3- and 5-year disease-free survival were 0.85 and 0.801, respectively. Among mRNAs in the blue module, CISH () was screened. The AUC values of 3- and 5-year disease-free survival were 0.826 and 0.834, respectively. The same survival analysis was performed for the core genes PLK1, CDK1, CDC20, CCNB2, CCNB1, CCNA2, BUB1B, and BUB1 of PPI in the turquoise module (Figure 10, Figure 11).

KM survival curves were used to observe the relationship between genes and survival. If the survival curves intersected, landmark analysis was performed. Finally, 11 prognosis-related genes were screened for the first time (): WISP3, STK32B, AL359853.1, AC110285.3, FGF14-AS2, HOXC8, GAGE2A, CDK1, CDC20, CCNA2, and BUB1. Higher levels of these genes were associated with poorer prognosis and may be poor prognostic liver cancer factors (Figure 12).

While landmark analysis showed that the higher expression levels of ETV4, SOX11, CCNB2, PLK1, BUB1B, and CCNB1 are the worse prognosis, which may be a poor prognostic factor for hepatocellular carcinoma, the higher expression levels of CISH are the better prognosis, which may be a protective factor for HCC prognosis (Figure 13).

3.7. GSEA Analysis Revealed 11 Genes Closely Associated with Cell Cycle and Translation

The median expression values of the survival genes screened were divided into two expression level groups. GSEA was then performed to detect the set of genes enriched in the gene classes of both groups to identify their expression levels and pathway associations. In an analysis, mostly, genes were enriched in cell cycle and protein replication-related pathways (Figure 14), further suggesting that these genes proceed to be associated with survival.

3.8. Five Machine Learning Models Demonstrated the Importance of 21 Genes in HCC

Gradient boosting, random forest, support vector machine, logistic regression, and integrated learning of data featuring 21 genes, which include the 15 survival-related mRNAs mentioned above and 6 PPI. The mRNAs belong to blue module are as follows: CAT, ACADM, EHHADH, AGXT, HMGCS2, and CYP3A4. The four models for judging HCC were finally generated through training, testing, and validation (Figure 15). The gradient boosting machine’s feature contribution degree bar graph showed that CCNB1, GAGE2A, and CYP3A4 were the more important features for judging HCC (Figure 15(c)). And the bar graph of feature contribution degree of random forest showed that CCNB1, BUB1, and CYP3A4 were the more important features for judging HCC (Figure 15(f)). In all metrics, random forest performed the best, and logistic regression was poor, but all these models reflected good training (Table 3). Therefore, we included them in the integrated learning, which uses a voting mechanism; if the voting ratio is 2 : 2, we use the value of random forest. Finally, the accuracy of the test set was improved to 0.97, the accuracy of the validation set was improved to 0.92, and other metrics were also improved (Table 3).

3.9. Key Genes Associated with TP53 Mutations, Promoter Methylation, and Immune Cell Infiltration

TP53 is a frequently mutated gene in many cancers. According to the analysis of the UALCAN database, among these genes in hepatocellular carcinoma, all genes were significantly elevated in TP53-mutated tumors except CIHS, where gene expression was significantly decreased in TP53-mutated tumors (Figure 16).

The methylation level of promoter region is closely related to tumor development. Therefore, we analyzed the methylation levels of the promoter regions of these core genes in hepatocellular carcinoma tissues. We found that the promoter methylation levels of CDK1, BUB1, ETV4, PLK1, WISP3, CCNB2, CISH, CCNB1, CCNA2, and GAGE2A were significantly decreased in hepatocellular carcinoma compared with normal tissues, while the promoter methylation levels of STK32B, SOX11, HOXC8, and BUB1B were significantly increased compared with normal tissues (Figure 17).

We analyzed the relationship between these core genes and immune cell infiltration through the TIMER 2.0 website. The heat map showed that all these genes were associated with immune cell infiltration and were mainly distributed on CD4 T cells, macrophage, regulatory T cells, and monocyte. As a whole, these genes clustered into two different parts, and their relationship with immune cell infiltration was almost completely opposite. The clustering results also showed that these genes could be divided into two clusters. The first cluster contains ACADM, AGXT, CAT, CISH, CYP3A4, EHHADH, FGF14-AS2, HMGCS2, and WISP3 genes, which are negatively associated with CD4 T cells, macrophages, and regulatory T cells, and positively associated with monocytes. Interestingly, in addition to the seven genes in the blue module, WISP3 and FGF14-AS2 in the turquoise module were also clustered in the first cluster and were clustered together. The second cluster is almost the exact opposite of the first cluster in terms of immune cell infiltration relationship (Figure 18). The correlation between them was further confirmed by the relationship map between genes (Figure 19).

KEGG pathway enrichment showed that the first cluster was mainly metabolism-related pathways, and since the previous machine learning results showed that CYP3A4 was an important feature-contributing gene in the hepatocellular carcinoma diagnostic model, we expanded the adjust value to 0.56 to include linoleic acid metabolism (LAM) containing CYP3A4 as a pathway of interest, in addition to some pathways of metabolism, PPAR signaling pathway, and terpenoid backbone biosynthesis. In the second cluster, besides cell cycle-related pathways, progesterone-mediated oocyte maturation, p53 signaling pathway, FOXO signaling pathway, and immune abnormalities-related pathways were also enriched (Figure 20).

3.10. Screening and Molecular Docking of Four Gene Candidates

Proteins without structure in the RCSB PDB library were excluded. A total of 257 small molecules related to CCNA2, CCNB1, CDK1, and PLK1 with high feature contribution in machine learning were obtained by DGIdb online tool. The PDB structure 2IW8 for CCNA2, 4YC3 for CCNB1, 6GU6 for CDK1, and 3DB6 for PLK1 were obtained from the RCSB PDB library were obtained. Under the condition of docking affinity score greater than -7.0 kcal/mol, 8 small molecules were found to have high affinity with cyclin A2 expressed by CCNA2, 44 small molecules had high affinity with G2/mitotic-specific cyclin-B1 expressed by CCNB1, 184 small molecules had high affinity to CDK1-expressed cyclin-dependent kinase 1 (CDK1), and 191 small molecules had high affinity to PLK1-expressed polo-like kinase 1 (PLK1) (Table 4). Among them, ZINC40393428, ZINC3973984, and ZINC20149014 had high affinity in all four (Figure 21).

4. Discussion

HCC is a high mortality disease among cancers worldwide and has a poor prognosis. Not all patients are suitable for surgical treatment [34]. Alpha-fetoprotein (AFP) is a tumor marker secreted by different levels of hepatocellular carcinoma and therefore is often used as one of the few means to detect hepatocellular carcinoma [35]. However, some literature suggests that it is controversial [3638]. Competitive binding to miRNA in lncRNA-miRNA-mRNA plays an important role in cancer development and regulation [39, 40]. We obtained turquoise and blue modules by WGCNA. Pathway analysis showed that the turquoise module is associated with cell cycle and DNA replication, and the blue module is mostly distributed in metabolism-related pathways. Accordingly, the ceRNA networks of both were constructed. Many of these genes have been shown to be associated with liver cancer. For example, in the turquoise module, the SNHG1 gene has been shown to be involved in cancer regulation, including hepatocellular carcinoma, through multiple ceRNA pathways [41]. The lncRNA-RBP-mRNA network was also constructed based on the theory that lncRNA recruitment of RBP affects mRNA [42]. Interestingly, SNHG1 was still included. We constructed an RBP-mRNA-Tissue-Disease network and showed that these combinations are frequently found in breast and liver and are mostly malignant epithelial tumors. Indeed, breast cancer has a very subtle relationship with the liver [43]. The results of our study also confirm the existence of this close relationship.

By univariate—Lasso—multiple Cox regression analysis and PPI, we obtained 11 genes according to survival time: WISP3, STK32B, AL359853.1, AC110285.3, FGF14-AS2, HOXC8, GAGE2A, CDK1, CDC20, CCNA2, and BUB1. These genes may be associated with prognostic relevance. WISP3 is a member of the CCN family, a family of cysteine-rich glycosylated proteins that are expressed in development and disease onset [44]. The results of our survival analysis were similar to other results where survival curves showed it to be a poor prognostic factor for hepatocellular carcinoma [45]. However, some studies have reported that WISP3 has the potential to inhibit the development of hepatocellular carcinoma [4648]. Indeed, the role of WISP3 seems to be different in different cancers [49], suggesting its complex regulatory mechanisms. STK32B is mainly associated with idiopathic tremor and anxiety [50, 51], but Parris et al. found that it may be a marker for oral squamous cell carcinoma [52]. AL359853.1, AC110285.3, and GAGE2 have also been noted to be possibly associated with the prognosis of HCC [5355]. HOXC8 is a potential driver gene for many cancer cells and is associated with cell proliferation, adhesion, migration, and metabolism-related processes and can be considered as a global regulator of growth and differentiation [5659]. FGF14-AS2 has inhibitory effects on breast and colorectal cancers [60, 61] but has a promotive effect on gliomas [62], and its role in hepatocellular carcinoma has not been reported. CDK1 is required for mammalian cell proliferation. It is the only CDK that can initiate mitosis (i.e., M phase) [63], but tumor cells may also require specific interphase CDKs to proliferate. Therefore, selective CDK inhibition may provide therapeutic benefit in some human tumors [64]. CDC20 exerts its biological functions mainly by targeting its downstream substrates for ubiquitination and subsequent degradation [65] and plays a role in the cell cycle and apoptosis [66, 67]. In hepatocellular carcinoma, inhibition of CDC20 decreases cell proliferation in hepatocellular carcinoma cells [68]. CDC20 acts by various mechanisms, such as involvement in the p53-related pathway [69]. Ubiquitination of CCNA2 is associated with CDC20; the late promotion complex of ubiquitinated CCNA2 is activated by CDC20 [70], and its overexpression is frequently observed in hepatocellular carcinoma [71] and is a more recognized marker [72]. The spindle assembly checkpoint is an important monitoring mechanism to ensure high-fidelity mitotic chromosome segregation. This is achieved by monitoring whether sister chromatids lack tension or are attached to spindle microtubules. It is mediated by checkpoint complexes or individual proteins that inhibit late promoting complex/loop (APC/C) ubiquitin ligase activity by targeting CDC20 regulatory subunits. BUB1 kinase is a key spindle checkpoint regulatory protein [73]. BUB1 may promote proliferation of hepatocellular carcinoma cells by activating phosphorylation of SMAD2 [74]. Not surprisingly, among them, CDC20, CCNA2, and BUB1 are all associated with APC-related processes, which may be key nodes in the prognosis and occurrence of hepatocellular carcinoma. GSEA enrichment results suggest that these genes are associated with replication, translation, and chromosome formation.

To improve the accuracy and enrich the means of hepatocellular carcinoma diagnosis, we performed integrated learning to build a hepatocellular carcinoma discriminative model using the core genetic composition features of the survival time module and other modules with higher span and tried to improve the diagnosis rate by machine learning, and the results showed that our model had high accuracy and AUC value.

In general, predictions are mainly based on the compositional features of genes and metabolites [75, 76]. CCNB1 is the gene with a large feature contribution in the model. During cytokinesis, CCNB1 binds to CDK1 to transition the cell from G2 phase to mitosis. After mid mitosis, cell cycle proteins are separated from CDK, and in the presence of APC, M phase cyclin A and cyclin B are degraded by the proteasome through ubiquitination-dependent pathway [75], and CCNB1 is also known to promote cancer development [76]. PLK1 is an inhibitor of the regulatory late promoting complex/cyclosome (APC/C) and can synergistically promote cell cycle protein B/Cdk1-mediated APC/C activation [77].

These related genes obtained by analysis of TP53 mutations, promoter methylation, and immune cell infiltration influence the progression of hepatocellular carcinoma in these aspects. By looking at the clustering results from an immune perspective, it is easy to distinguish between the blue module and other genes outside of it. And the results of KEGG pathway analysis also suggest that these core genes are involved in some immune-related pathways.

Recent studies have shown that dysregulation of propionate metabolism produces a prometastatic profile in breast and lung cancer cells, promoting cancer progression [78]. In contrast, linoleic acid and butyric acid have therapeutic potential for cancer, and they are both implicated in intestinal flora metabolism [79, 80], and intestinal flora and related metabolite molecules, which translocate through the portal vein to the liver and affect liver function, portend a potential pathway for intestinal flora to treat liver cancer. Peroxisome proliferator-activated receptor (PPAR) belongs to a class of nuclear hormone receptors activated by fatty acids and their derivatives, which have been shown to have cell cycle and metabolic regulatory effects. Some evidence suggests that it has a promotive effect on hepatocellular carcinoma and can be used as a target for drugs [81, 82].

Branched-chain amino acid metabolism is the most significant pathway obtained in the analysis, and studies have shown that the supplementation of valine, leucine, and isoleucine in branched-chain amino acids has a preventive effect on hepatocellular carcinoma [83]. In cluster II, in addition to the pathways of senescence and cell proliferation that often accompany cancer, we identified two interesting pathways—the progesterone-mediated oocyte maturation pathway and the oocyte meiotic pathway.

Although the role of progesterone-mediated oocyte maturation pathway and oocyte meiosis pathway in hepatocellular carcinoma is not clear, it has been reported in the literature that glioblastoma, lung cancer, etc. appear to be genetically enriched in these two pathways [84, 85]. Hepatocellular carcinoma is a sex-specific cancer, and in general, men are two to four times more likely to develop HCC than women [86], which predicts that some hormonal changes may contribute to this difference, among which progesterone receptor expression can affect the proliferation of hepatocellular carcinoma [87]. Enriched pathway, in which the FOXO transcription factor family plays an important role in tumor proliferation and apoptosis [88], FOXO1 was shown to play a repressive role in hepatocellular carcinoma [89], and other FOXO transcription factors have been shown to be associated with hepatocellular carcinoma [90, 91]. These evidences suggest that the genes we screened affect the progression of hepatocellular carcinoma by influencing the metabolic, immune, and other pathways.

Finally, we screened for potential drug candidates that might have an effect on HCC. To our surprise, many of these drugs have been found to have therapeutic effects on hepatocellular carcinoma, although the association with the genes we identified is not yet clear. For example, ZINC9566782 (hygromycin) has the highest docking affinity for cyclin A2. It suppresses stemness and malignancy of HCC cells by destroying CD133 in the LCSC population [92]. ZINC40393428 (SNS-314) has also been shown to be efficacious in hepatocellular carcinoma [93]. Our study provides recommendations for the diagnosis and treatment of hepatocellular carcinoma. However, due to the limitations of TCGA and GEO libraries, our results may need to be demonstrated by follow-up experiments.

Only a few machine learning algorithms are used in this study. In addition to the methods used in this paper, some of the most representative computational intelligence algorithms can be used to solve the problem, such as monarch butterfly optimization (MBO), earthworm optimization algorithm (EWA), elephant herding optimization (EHO), moth search (MS) algorithm, slime mold algorithm (SMA), hunger games search (HGS), Runge Kutta optimizer (RUN), colony predation algorithm (CPA), and Harris hawks optimization (HHO). These algorithms have the potential to provide better choices for our models. Many learning techniques have been developed to improve the performance of metaheuristic algorithms, such as the dynamic learning evolution algorithm (DLEA) [94] and the learning-based intelligent optimization algorithm (LIOA) [95]. This may be another effective way to improve model performance, and we will consider them in our subsequent studies. In addition, some biological experiments are also worth drawing on to demonstrate the reliability of the model and the importance of biomarkers [96].

5. Conclusions

In conclusion, we screened 2 modules, 6 networks, and 24 genes for hepatocellular carcinoma. Five machine learning models were constructed and screened for drug candidates for the core genes. This suggests that hepatocarcinogenesis is a dynamic network with multiple mechanisms. Treating only one pathway or one type of gene is not appropriate, especially since the liver is involved in various metabolic pathways. Combining dynamic therapies may be the hope for a complete cure of liver cancer in the future.

Data Availability

The datasets provided in this study can be found in an online repository (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE76427https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE102079https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga).

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.


This work was supported by the Chongqing Postgraduate Research Innovation Program (Project No. CYS21324) and the Natural Science Foundation of Chongqing (Project No. cstc2021jcyj-msxmX0834).