Abstract

The incidence of liver cancer (hepatocellular carcinoma; HCC) is rising and with poor clinical outcome expected, a more accurate judgment of tumor tissues and adjacent nontumor tissues is necessary. The aim of this study was to construct a diagnostic model based on random forest (RF) and artificial neural network (ANN). It can be used to aid in the identification of diseased tissue such as cancerous tissue, for HCC clinical diagnosis and surgical guidance. GSE36376 and GSE121248 from Gene Expression Omnibus (GEO) were used as training sets in this investigation. R package “limma” and WGCNA were used to filter the training set for statistically significant differential genes. To better understand the biological function and characteristics, R software was used to perform GO and KEGG enrichment analyses. To pick out and further understand the key genes, we performed PPI analysis and random forest tree analysis. Next, we built the ANN to predict training sets and validation set (GSE84402), and ROC curve was plotted to calculate area under curve (AUC). Then immune cell infiltration indicated difference of immune cell subsets between control and case groups. Finally, the survival analysis of key genes was also carried out based on data in TCGA database. Based on the expression of these 9 genes, we built the artificial neural network (ANN) and the accuracy of the final models was assessed with an ROC curve. The areas under the ROC curve were 0.984 (95% CI 0.972–0.993) in training sets. Its predictive capability was further assessed using the validation set. And the areas under the ROC curve were 0.929 (95% CI 0.786–1.000). In summary, this method effectively classifies hepatocellular carcinoma tissues and the corresponding noncancerous tissues and provides reasonable new ideas for the early diagnosis of liver cancer in the future.

1. Introduction

Every year, more than 850 000 new cancer cases are diagnosed in the world’s livers, with hepatocellular carcinoma accounting for over 90% of them [1]. The burden of liver cancer is projected to be over 1 million cases by 2030 [2]. Chronic hepatitis B and C virus infection, dietary toxin exposure (such as aflatoxin and aristolochic acid), metabolic illnesses (such as fatty liver disease and diabetes), and alcohol addiction are all key risk factors for HCC [3]. The neoplastic genesis of HCC is a multistep histological process. Hepatocellular necrosis is followed by hepatocyte growth after a hepatic injury. Chronic liver disease develops as a result of continuous destructive-regenerative cycles, resulting in liver cirrhosis, which is characterized by fibrosis and aberrant nodule development. Then hyperplastic and dysplastic nodules appear, leading to the development of HCC. HCC is further divided into three types: well-differentiated, moderately differentiated, and poorly differentiated tumors, with the last being the most dangerous type of primary HCC [4].

Hepatocellular carcinoma monitoring, diagnosis, and therapy have all improved significantly over the last decade [3]. Despite the disease’s declining incidence rates, disease-specific death rates remain high [5], and early diagnosis is important to improving outcomes [6]. Biannual ultrasound (US) with or without alpha-fetoprotein (AFP) testing is recommended by international hepatic associations for screening for HCC in at-risk individuals [79]. The purpose of these guidelines is to enhance the possibility of detecting early-stage HCC that could be treated successfully [10]. However, alpha-fetoprotein (AFP), dynamic magnetic resonance imaging, and computed tomography have low sensitivity and specificity: the combination of US and AFP has a sensitivity of just 63 percent for early-stage HCC identification [11], and more than 5% of MRI-diagnosed HCC may be false positive or non-HCC lesions [12], calling into question their value as primary screening tools for hepatocellular carcinoma [9]. Furthermore, early diagnosis is more challenging due to confounding variables such as the presence of inflammation and cirrhosis [9].

Therefore, novel diagnostic models for HCC with higher diagnostic accuracy are still urgently required.

Recently, the availability of omics data for illnesses, such as tumors and HCC, is rapidly increasing [13, 14]. However, because of the high dimensionality of this data, detecting biologically relevant patterns might be difficult. This circumstance needs the creation of new analytical methods, such as using artificial intelligence (AI) and random forest to analyze data. Now the most popular example of artificial intelligence methods is the artificial neural network (ANN) [15], which enables computer systems to improve forecast accuracy by creating a probabilistic or statistical model based on current data [16]. For data-driven precision medicine, a hypothesis-free strategy to integrating huge data is essential. In the field of liver diseases, which is associated with multifactorial and complex characteristics, approaches based on ANN to combine multiple factors using available data appear to improve performance in diagnostic tasks and decision-making tasks based on treatment response or prognostic prediction [17].

ANNs, with all of their variants, are now the main tools in machine learning tasks, such as disease diagnosis and classification [18]. Allahverdi et al. [19] created a heart disease classification system that used an artificial neural network and attained an accuracy of 82.4%. Dongfang Jia et al. [20] proposed a methodology based on artificial neuronal networks that can accurately classify cancer tissues and normal tissues and provides reasonable new directions for the early diagnosis of cancer. In this context, we suggested a novel diagnostic model that overcomes some of the disadvantages of standard diagnostic approaches, fully utilizing the advances in omics technologies and achieving a more comprehensive optimization of diagnosis of HCC, allowing for a faster, more sensitive, and radiation-free HCC detection.

2. Materials and Methods

2.1. Materials
2.1.1. Data Collection

The study’s workflow is depicted in Figure 1. The keywords “hepatocellular carcinoma/liver cancer”, “normal”, “Expression profiling by array”, “Series”, and “Homo sapiens” were used to search the Gene Expression Omnibus (GEO) databases for liver cancer gene expression profiles. Datasets collected from human case-control studies, with hepatocellular carcinoma tissues as the case group and corresponding noncancerous tissues as the control group, were included in our training set. Therefore, three datasets (GSE36376, GSE121248, and GSE84402) met the screening criteria. GSE36376 contained 240 tumor and 193 adjacent nontumor liver samples, and GSE121248 contained 70 tumor and 37 adjacent nontumor liver samples. GSE36376 and GSE121248 were combined as the training set to screen for the key genes to build the ANN model. GSE84402 (contained 14 tumor and 14 corresponding noncancerous samples) served as an independent validation set to verify the accuracy of the ANN model.

Flow diagram of the study. Data collection, analysis, key gene selection and validation.

2.1.2. Differentially Expressed Genes (DEGs) Screening

The differentially expressed genes (DEGs) were screened by limma [21]. The R software package’s limma contains a solution for DEA of microarray data. The DEGs between tumor and neighboring nonneoplastic liver were screened using limma in the GSE36376 and GSE121248 datasets, respectively. Both |logFC| ≥ 2.0 and adjusted were used as the thresholds for DEGs. All DEGs were visualized by a volcano plot.

2.1.3. Weighted Gene Coexpression Networks Construction and Module Selection

The gene modules were screened by WGCNA. After obtaining the gene expression profile, the WGCNA software tool in R [22] was used to create a gene coexpression network using the gene expression data of DEGs.

First, the appropriate soft-thresholding power (β) was selected by using the ‘‘pickSoftThreshold’’ function with the default parameters (herein, β = 7). Subsequently, Pearson’s correlation matrix was calculated to evaluate the similarity among all the pairwise genes by using the ‘‘cor’’ function with the default parameters. Then, the adjacency was calculated based on β and Pearson’s correlation matrix by using the ‘‘TOMsimilarity’’ function with the default parameters, and the corresponding dissimilarity (dissTOM) was also calculated. Finally, average linkage hierarchical clustering was conducted according to the dissTOM value with a minimum size of 48 for each gene dendrogram.

Module eigengenes (MEs), considered the first principal component (PC) of gene expression patterns of a corresponding module, were obtained for each module. To further strengthen the reliability of the modules, a cut line was set at 0.25 so that modules bearing <0.25 would be merged [23]. The module with the highest MS was considered as the key module related with liver cancer.

2.1.4. Candidate Gene Selection from the Most Significant Module

The intensity of intramodular interconnectivity (also known as module membership (MM)) was computed using the absolute value of Pearson’s correlation coefficient between module eigengene and expression values to define candidate genes. Candidate genes with an MM of r ≥ 0.80 and a substantial gene connection with liver cancer (at ) were prioritized for further investigation. A volcano plot and heatmap were used to illustrate all of the candidate genes.

2.1.5. Functional Enrichment Analysis

The functional analysis was performed by Kyoto Encyclopedia of Genes and Genomes (KEGG) and Gene Ontology (GO) analyses. The KEGG and GO analyses were conducted by R package “clusterProfiler.” [24].

2.1.6. Protein-Protein Interaction (PPI) Network Analysis

The information of the interaction of proteins and neighborhood, gene fusions were provided using the Search Tool for the Retrieval of Interacting Genes (STRING) database (a publicly available database; https://string-db.org/) [25]. In the present study, the input gene sets were 66 candidate genes and the species was Homo sapiens. To further explore the potential relevance of the candidate genes, the minimum required interaction score was 0.4.

2.1.7. Random Forest for Key Genes Screening

The random forest package in R was used with 500 trees and default parameters to do the random forest analysis [26]. To minimize overfitting, we used a decision tree-based method that included internal cross-validation and took into account nonlinear data. We then obtained the best trees with the fewest cross-validated errors for crucial gene screening.

2.1.8. Gene Scores for Removing Batch Effects

For each key genes screening by random forest, we calculated the median value of that gene across all arrays. Among the upregulated genes, whose expression is greater than the median value, we marked 1; otherwise mark 0. Among the downregulated genes, whose expression is greater than the median value, we marked 0; otherwise mark 1. Then we get the scores for these key genes in each sample to remove batch effects of training and validation set.

2.1.9. Correlation Analysis and Expression Difference Analysis of Key Genes

Spearman correlation analysis was used to analyze the correlation among key genes. Then we used the Wilcox test to perform a differential analysis of the key genes between control and treat groups and ggplot2 (version 2.2.1) were used to construct boxplots of gene expression.

2.1.10. The Establishment of ANN Models and Test

In general, artificial neural networks (ANN) consist of three layers, namely, input, hidden, and output layers. We used key genes scores as input layers. And the hidden layer’s main job is to extract categorized data from existing data. The output layer displays the network’s ultimate output. The outputs of one layer’s nodes are a weighted linear combination that has been altered by a nonlinear function. This nonlinear function enables the neural network to understand complex relationships between independent variables, hence improving the effectiveness of data-driven machine learning techniques [2729]. In validation set, among the upregulated genes, whose expression is greater than the median value, we marked 1; otherwise mark 0. Among the downregulated genes, whose expression is greater than the median value, we marked 0; otherwise mark 1. Then we get the genes scores so that we can examine the accuracy of the ANN model in the validation set. At last, we plotted ROC curve to calculate area under curve (AUC) to show model accuracy. An area under the curve (AUC) value between 0.8 and 0.9 is considered an excellent classification, while greater than 0.9 is considered as outstanding discrimination [30].

2.1.11. Immune Infiltration by CIBERSORT Analysis

To forecast the infiltration of 22 different kinds of immune cells in each tissue sample, the CIBERSORT algorithm is often utilized [31]. Seven types of T cells [CD8 + T cells, naïve CD4 + T cells, resting memory CD4 + T cells, activated memory CD4 + T cells, follicular helper T cells, regulatory T cells (Tregs), and gamma delta T cells], three types of macrophages (M0, M1, and M2), naïve B cells, memory B cells, plasma cells, resting natural killer (NK) cells, activated NK cells, monocytes, resting dendritic cells, activated dendritic cells, resting mast cells, activated mast cells, eosinophils, and neutrophils are among the 22 immune cells identified. The CIBERSORT method was used to transform a normalized gene expression matrix into 22 different types of immune cell matrix. The immune cell matrix was filtered using P0.05 criteria, and the relative expression of 22 categories of immune cells was determined using R packages between tumor and neighboring nontumor samples. The difference between tumor and neighboring nontumor samples was also determined using principal component analysis (PCA).

2.1.12. Survival Analysis

All the expressions of key genes were calculated, and patients were separated by the median expression level of each gene (highly expressed group and lowly expressed group). The Kaplan–Meier (KM) survival analyses were used to compare the survival difference between lowly and highly expressed groups based on each key gene group, with log-rank test.

3. Results

3.1. Screening the Candidate Genes

Weighted gene coexpression network analysis can be used to screen out the gene modules related to cancer tissues. First, we checked outliers in the sample, which were found and deleted from all samples (Figure 2(a)). The proper power value was then determined. Scale independence reached 0.8 and mean connection was more than zero when the soft threshold power value was equal to 7 (Figure 2(b)). As a result, the soft threshold power value for further analysis was set at 7. And 16 coexpression modules were discovered, with the gray module representing a gene that was not allocated to any module (Figure 2(c)). The genes of each part had been matched with different colors. The eigengene adjacency heatmap was used to identify correlations between different modules (Figure 2(d)). Modules that were grouped together into a single branch may have functionalities that are comparable. As shown in Figure 2(d), the highest connection with liver cancer was seen in the salmon module, which contained 66 genes. The heatmap of these 66 candidate genes was shown in Figure 3(a), indicating that these 66 candidate genes are differentially expressed between tumor and neighboring nontumor samples. In the heat plot, each cell represents the degree of gene expression, red represents upregulation, and green represents downregulation. We take log |FC| as the horizontal axis and −log10 (adj. -value) as the vertical axis to make volcano plots (Figure 3(b)), where the red and green dots represent the upregulated and downregulated genes, respectively.

3.2. Functional Enrichment of Candidate Genes

To explore the functions of candidate genes, the GO and KEGG enrichment analyses were conducted by R packages. GO analysis was applied from 3 aspects: biological process (BP), cellular component (CC), and molecular function (MF). In the BP part, the upregulated robust DEGs were mainly enriched in terpenoid metabolic process, cellular response to cadmium ion, and isoprenoid metabolic process. For CC, the upregulated genes were particularly enriched in blood microparticle, pore complex, and high-density lipoprotein particle. The top three significantly enriched terms were oxidoreductase activity, steroid hydroxylase activity, and aromatase activity in the MF group (Figure 4 and Table 1). The result of KEGG pathway enrichment analysis is shown in Figure 5 and Table 2. Retinol metabolism, chemical carcinogenesis, and drug metabolism-cytochrome P450 were highly associated with tumor progression.

3.3. PPI Network Analysis

To pick out and further understand the key genes, PPI network analysis was performed using STRING. The PPI network of candidate genes is shown in Figure 6. A total of 65 nodes and 118 interaction pairs were included in the network.

Using the STRING online database, a total of 65 nodes and 118 interaction pairs were included in the network.

3.4. Key Genes Screened by Random Forest Tree to Build ANN Model

Figure 7(a) shows the relationship between error rate and the number of classification trees. Genes with mean decrease Gini greater than 10 are identified as the final signature shown in Figure 7(b) (9 key genes). The heatmap of these 9 key genes was shown in Figure 7(c). Then we used these key genes scores to build ANN models, as shown in Figure 7(d), which consists of 9 input nodes, 5 hidden nodes, and 2 output nodes. The output layer can judge the properties of the samples and divide them into control groups and case groups.

3.5. Correlation Analysis

The major key genes were utilized to explore the relationships among these genes by a correlation analysis. Most of the genes had previously shown a strong positive correlation, while CDC20 was negatively correlated with the rest (Figure 8).

Blue represents negative correlation and red represents positive correlation. The depth of color indicates the intensity of the correlation between covariates. The darker the color, the higher the correlation.

3.6. Expression Difference Analysis of Key Genes

Among them, CDC20 showed significantly high expression in tumor tissues, while the other eight genes were low expressed in tumor tissues (Figure 9).

3.7. Model Accuracy on Training and Validation Sets

The ROC curve was displayed to validate the predictive accuracy of the model. The area under the receiver-operating characteristic (ROC) curve (AUC) of the training and validation set was 0.984 and 0.929, respectively (Figure 10).

3.8. Difference of Immune Cell Subsets between Control and Case Groups

Figure 11(a) shows the infiltration of 22 types of immune cells in all samples in the training set using the CIBERSORT algorithm. Macrophages M0 account for a large proportion of case group immune cell infiltration. The changes in immune cells between control and case samples are further examined in Figure 11(b). T cells CD4 memory activated and T cells CD8 showed the strongest positive correlation (Pearson correlation = 0.65), while macrophages M0 and mast cells resting showed the strongest negative correlation (Pearson correlation = −0.63) in the target database at a CIBERSORT (Figure 11(c)).

3.9. Survival Analysis

We explored the prognostic roles of each key gene by KM curves with the log-rank test. APOF (log-rank ), CDC20 (log-rank ), CLEC1B (log-rank ), CLEC4G (log-rank p = 0.0095), CYP1A2 (log-rank ), FCN3 (log-rank ), IGFALS (log-rank ), LCAT (log-rank ), and MT1H (log-rank ) were identified as prognostic genes in the TCGA database (Figure 12).

4. Discussion

This work innovatively combined comprehensive biological information analysis and artificial neural network (ANN) to classify hepatocellular carcinoma tissues and the corresponding noncancerous tissues. In this study, WGCNA was performed to reveal key modules with clinical significance, and salmon module was screened out through preservation evaluation. The GO and KEGG analyses revealed that the genes in salmon module were significantly enriched in the biological processes of the cell cycle, cell division, and liver-related functions. All these biological functions are closely related to liver cancer.

The basic unit of the ANN is neuron. To get better performance, the weight and bias of each neuron were constantly updated during training. Classification results of ANN indicated that the average accuracy is 0.929 in validation set which showed that the model in this paper is highly accurate.

Through immune cell infiltration analysis, we compared the infiltration of immune cells in tumor and corresponding noncancerous samples. The results show that a high proportion of macrophages M0 infiltration existed in tumor samples. Finally, survival analysis of hub genes based on the TCGA database was also performed in our study. The results imply that these key genes are potentially associated with the prognosis of HCC as well.

This model can be applied to the early diagnosis of cancer. In this method, probes are firstly used to measure gene expression, and then deep learning methods are used to classify cancer samples. There is no instrument contact during the whole diagnosis process, so there is no risk of radiation compared with CT, ultrasonic imaging, X-ray examination, and PET. In addition, CT and MRI scans for individuals with HCC have a number of limitations. Contrast-related allergies, respiratory movements, renal impairment, and cumulative radiation doses are all problems to consider when getting a CT scan, especially in young patients. Ultrasonography (US) is frequently used for screening, sometimes almost entirely, and other times in combination with CT or MRI, depending on the particular patient’s risk factors, doctor preference, and advanced imaging approved by health insurance. If a lesion is discovered on screening US, a contrast-enhanced scan is necessary to better characterize the lesion [32]. Despite the use of dynamic contrast-enhanced MRI (DCE-MRI), the imaging diagnosis of HCC is difficult due to atypical imaging presentations and the variety of liver tumors [33]. Other diagnostic approaches, such as α-fetoprotein, are expensive and lack sensitivity in HCC detection [34]. This article’s categorization is computer-assisted, and while pathological diagnosis necessitates manual operation throughout the procedure, this technique is more suited for quick diagnosis.

In addition, tumor individualized medical care is becoming increasingly important, relying on accurate risk stratification systems. These systems aid in the selection of the most appropriate therapy and the evaluation of the treatment’s effectiveness. Gene sequencing is useful for predicting therapy effects and assisting doctors in selecting the best customized treatment approach for patients.

At the same time, preventing early recurrence is an important issue in the management of HCC. Early recurrence, defined as recurrence within 1-2 years after resection or ablation, is a significant predictor of poor prognosis in HCC patients [33]. By detecting recurrent lesions early, the first‐line treatment for recurrences can be initiated early. The diagnostic model based on ANN and random forest analysis can distinguish the tumor tissue from the normal adjacent tissue, so as to predict early recurrence after surgery or curative ablation in HCC patients.

A vast amount of single-cell sequencing data will need to be investigated in the future. Classification and clustering challenges are common research subjects. We can actually construct small-scale quick diagnosis equipment for cancer with the advancement of sequencing data and deep learning. It gives people the chance to avoid and treat cancer.

Data Availability

The expression profiling by array data used to support the findings of this study has been deposited in the GEO repository (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE36376, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE121248, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE84402).

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

Ziyi Cao contributed to the supervision of this study.