Machine Learning and Network Methods for Biology and Medicine 2021View this Special Issue
Identification of Hub Genes Associated with Lung Adenocarcinoma Based on Bioinformatics Analysis
Lung adenocarcinoma (LUAD) is one of the malignant lung tumors. However, its pathology has not been fully understood. The purpose of this study is to identify the hub genes associated with LUAD by bioinformatics methods. Three gene expression datasets including GSE116959, GSE74706, and GSE85841 downloaded from the Gene Expression Omnibus (GEO) database were used in this study. The differentially expressed genes (DEGs) related to LUAD were screened by using the limma package. Gene Ontology (GO) and KEGG analysis of DEGs were carried out through the DAVID website. The protein-protein interaction (PPI) of differentially expressed genes was drawn by the STRING website, and the results were imported into Cytoscape for visualization. Then, the PPI network was analyzed by using MCODE, and the modules with a score greater than 5 were found by using cytoHubba. Finally, the GEPIA database and UALCAN database were used to verify and analyze the survival of hub genes. We identified 67 upregulated genes and 277 downregulated genes from three LUAD datasets. The results of GO analysis showed that the downregulated genes were significantly enriched in matrix adhesion and angiogenesis and upregulated differential genes were significantly enriched in cell adhesion and vascular development. KEGG pathway analysis showed that the differential genes of LUAD were significantly enriched in viral carcinogenesis and adhesion spots. The PPI network of differentially expressed genes consists of 269 nodes and 625 interactions. In addition, three modules with scores greater than 5 and seven hub genes, namely, MCM4, BIRC5, CDC20, CDC25C, FOXM1, GTSE1, and RFC4, playing an important role in the PPI network were screened out. In this study, we obtained the hub genes and pathways related to LUAD, revealing the molecular mechanism and pathogenesis of LUAD, which is helpful for the early detection of LUAD and provides a new idea for the treatment of LUAD.
Lung cancer is one of the malignant tumors for which no effective cure has been found so far. Adenocarcinoma is the most common histological subtype, accounting for about half of all lung cancers . Due to the fact that most lung cancers are diagnosed as advanced after regional or distant spread of the malignancy, this disease has a high mortality rate worldwide . Despite the availability of various therapies including surgery, radiation, and targeted therapies, unfortunately, there are significant barriers to effective treatment of LUAD and an overall survival of less than five years in patients with lung adenocarcinoma . If the disease can be detected at an early stage, the survival rate of patients will be greatly improved. Some studies have pointed out that many biomarkers are closely related to the occurrence and development of tumors and can be used for early screening of tumors. However, many biomarkers are highly expressed in various tumors without good specificity . Therefore, it is necessary to explore specific biomarkers for early diagnosis of LUAD. In recent years, gene expression profiling chips have been widely used in the research fields of various diseases to reveal the association between diseases and genes and provide valuable clues for the pathogenesis of diseases to determine the diagnosis and prognosis of cancer . The bioinformatics methods are widely used to study lung adenocarcinoma, and MCODE  is used to build clustering in the vast network of gene function modules. The connection of these modules in the PPI network is essential for the forecast of a molecular complex. In order to further find key genes of a complex network, this study proposed the research methods combining MCODE and cytoHubba. Based on the attributes of the nodes in the network, the genes associated with the disease will be found.
In this study, three lung adenocarcinoma datasets were downloaded from the GEO. Differentially expressed genes were obtained by comparing gene expression between lung adenocarcinoma samples and normal samples. Then, GO and KEGG signal pathway enrichment analyses were used to conduct functional annotation and signal pathway analysis for DEGs. Finally, the mechanism of the occurrence and development with LUAD was studied at the molecular level. The gene expression level was verified through GEPIA , and the prognosis was analyzed with UALCAN . It may provide valuable ideas for the diagnosis and prognosis of LUAD in the future.
2. Materials and Methods
2.1. Data Preprocessing and Correlation Analysis
In this paper, the limma package was used to obtain differentially expressed genes. The data was divided into the normal group and the diseased group, and difference analysis was conducted to extract different differential genes on the three datasets, respectively. The standard to screen differentially expressed genes was and in this paper. The results of difference analysis from the three datasets were intersected, and the genes shared by the three datasets were the genes used for the following analysis in this study. The Venn diagram online tool is used to make a Venn diagram of cross genes. DAVID (http://david.abcc.ncifcrf.gov/) is a database which is used for differences in gene functional annotation and pathway enrichment analysis online. GO analysis describes genes from three parts including molecular function (MF), biological process (BP), and cellular component (CC) and finds out the significantly enriched GO term. KEGG pathway enrichment analysis enriches differential genes into pathways, which is conducive to discovering the degree of association among genes, pathways, and diseases. In this study, we select in GO analysis and KEGG pathway enrichment analysis.
2.2. PPI Network Structure and Prognosis Analysis of Hub Genes
By constructing protein-protein interaction networks, the relationships between protein and protein can be visualized, and functional correlations among proteins can be deduced. It is done through the STRING database (http://string-db.org), selecting the interaction with a combined score greater than 0.5. Cytoscape software was used to visualize PPI networks, and the molecular complexity detection (MCODE) method was used to screen the modules of PPI networks with the degree cutoff value of 2, node score cutoff value of 0.2, -core value of 2, and maximum depth of 100, respectively. The networks with a score greater than 5 were screened out, which were analyzed using cytoHubba plug-ins to screen out hub genes with scores greater than 10. GEPIA (http://gepia.cancer-pku.cn/) is a free online visualization platform for analyzing differences in specific gene expression levels between cancer tissues and normal tissues. UALCAN (http://ualcan.path.uab.edu) is a very effective and convenient website which is used to analyze the cancer data. In this study, survival analysis was carried out on the hub genes using this website to verify the correlation between genes and the disease. In order to make the results of this study more credible, the evaluation of hub genes for cancer is utilized for the very famous online website Kaplan-Meier (https://kmplot.com/analysis/) which can be used to evaluate the effect of mRNA, miRNA, and protein on the survival rate of 21 types of cancer and the effects of gene expression information. By integrating clinical prognostic value of meta-analysis and survival research, it can discover and validate relevant molecular markers.
3. Results and Discussion
3.1. Gene Expression Profile Data
This study used data from a GEO database (https://www.ncbi.nlm.nih.gov/geo/) which includes various high-throughput gene expression data submitted by research institutions around the world and are free for the experts and scholars. GSE116959 was based on the GPL17077 platform (Agilent-039494 Sure Print G3 Human GE V2 8x60K Microarray 039381) and consisted of 68 samples, including 11 normal lung tissues and 57 LUAD samples. GSE74706 was based on the GPL13497 platform (Agilent-026652 Whole Human Genome Microarray 4x44K V2), with 36 samples including 18 tumor-free lung tissues and 18 tumor tissues. GSE85841 was based on the GPL20115 platform (Agilent-067406 Human CBC lncRNA + mRNA microarray V4.0) and contained 8 LUAD samples and 8 adjacent nontumor tissues. The data selected in this study all met three conditions: (1) the samples were all from human lung tissue, (2) the sample size is ≥16, and (3) it contained a control group and an experimental group. The platform and a series of matrix files were downloaded as TXT files, and the dataset information is shown in Table 1.
3.2. Identification of DEGs
Three datasets in the GEO website used the limma package  to do gap analysis; we set up less than 0.05 and greater than 1 as the standard. It obtained a total of 3051 upregulated genes and 3817 downregulated genes as shown in Figure 1. The number of the overlapping upregulated genes among three LUAD datasets was 67 as shown in Figure 2(a), and that of the overlapping downregulated genes among three datasets was 277 as shown in Figure 2(b).
3.3. GO Analysis of DEGs and KEGG Pathway Enrichment Analysis
In order to understand the biological function of differential genes in LUAD, the GO analysis and KEGG pathway analysis of differential genes were conducted on the website of DAVID, and the interpretation was made from the two focuses of gene function and pathway analysis. Results of GO analysis showed that the first 15 GO terms of upregulated genes and downregulated genes are listed in Tables 2(a) and 2(b), respectively. As for molecular function (MF), the downregulated DEGs play a major role in protein kinase inhibitor activity (GO: 0004860), integrin binding (GO: 0005178), and ion channel binding (GO: 0044325). The upregulation of DEGs played a role in protein isomerization activity (GO: 0046982), DNA binding (GO: 0003677), and serine endopeptidase activity (GO: 0004252). In terms of biological function (BP), downregulated DEGs were significantly enriched in cell matrix adhesion (GO: 0007160), positive regulation of RNA polymerase II promoter transcription (GO: 0045944), and angiogenesis (GO: 0001525). Upregulated DEGs were particularly abundant in pulmonary vascular development (GO: 0060426), cell adhesion (GO: 0007155), and extracellular matrix tissue (GO: 0030198). Cell component (CC) analysis showed that downregulated DEGs were concentrated in vesicles (GO: 0031982), lysosomes (GO: 0005764), and transcription factor complexes (GO: 0005667). The upregulated DEGs were mainly concentrated in the extracellular space (GO: 0005615), protein extracellular matrix (GO: 0005578), and extracellular matrix (GO: 0031012).
According to the KEGG pathway analysis, Table 3 shows that DEGs are mainly concentrated in the carcinogenic effect of virus (HSA05203), adhesion plaque (HSA04510), regulation of actin cytoskeleton (HSA04810), cAMP signaling pathway (HSA04024), and cGMP-PKG signaling pathway (HSA04022).
3.4. PPI Network Construction and Module Analysis
The STRING database was used to construct the PPI network, and then, Cytoscape software was used to visualize the DEGs of the three datasets. The MCODE plug-in in Cytoscape was used to further analyze the PPI network, and finally, three modules with scores greater than 5 were selected. The construction of the PPI network involves 269 nodes and 625 interactions as shown in Figure 3. The MCODE screened out three modules including a total of 38 nodes and 140 interactions as shown in Figure 4. Ten of the 344 differentially expressed genes were out of the PPI network. When cytoHubba was applied to these three modules, the genes with scores greater than 10 were selected, which indicated that these genes including MCM4, CDC20, CDC25C, BIRC5, FOXM1, GTSE1, and RFC4 might play an important role in the pathogenesis of LUAD.
3.5. Expression Level and Prognosis Analysis of Hub Genes
According to the results of the GEPIA database, seven hub genes were all upregulated in LUAD (Figure 5). Then, the UALCAN database was used for prognosis analysis of hub genes, and the results showed that the expression of MCM4, CDC20, CDC25C, and GTSE1 significantly reduced the survival time of LUAD patients (Figure 6). At last, the survival curve of Kaplan-Meier was also used to prove that the survival rate of LUAD patients with the low expression group was dramatically higher relative to that in the highly expressed group in the UALCAN database (Figure 7). It can be found to be consistent with the results obtained in Figure 6.
Although the world medical level has been greatly improved, there is still no effective way to completely cure lung adenocarcinoma. In addition, due to its rapid development and no obvious symptoms in the early stage of disease, it is of great significance to study the pathogenic genes related to lung adenocarcinoma. With the rapid development of bioinformatics, microarray technology has been increasingly applied to study the diagnosis, treatment, and prognosis of cancer .
In this study, three LUAD gene expression datasets were download from the GEO database. A total of 67 upregulated and 277 downregulated differential genes were screened and used for subsequent analysis. To further understand the biological function of these DEGs, we performed GO analysis and KEGG pathway analysis on these differential genes.
GO analysis shows that in BP, significant genes involved in cell adhesion and extracellular matrix organization and the development of the skeletal system are raised, and the difference of genes highly involved in nervous system development, angiogenesis, and cell matrix adhesion is lowered. Existing literature showed that the reduceness of cell adhesion is the key factor in the cancer metastasis , which are consistent with our GO results. Genetic variations and CC results show that the increase in the main role in the extracellular space, extracellular area, and extracellular matrix protein class, by the difference of the gene in the cytoskeleton, lysosome, and small bubble, plays a role, and the cytoplasmic matrix in the tumor actually adjusts all aspects of the tumor cells and cancer-related stromal cells . In terms of MF, upregulated differential genes are significantly enriched in DNA binding, protein heterodimer activity, and growth factor binding, while downregulated differential genes are significantly enriched in transcription factor binding, integrin binding, and ion channel binding, and insulin growth factor plays an important role in the occurrence and development of cancer [13, 14]. Heterodimerization is an important mechanism, and its activated erbB receptors are highly expressed in many cancers, especially breast cancer, ovarian cancer, and non-small-cell lung cancer , which further confirms the accuracy of the results of our GO analysis.
KEGG pathway analysis showed that the differentially expressed genes were significantly enriched in the viral oncogenic pathway and adhesion plaques. Oncogenic viruses select cellular processes to replicate and disrupt immune recognition, promoting oncogeny through shared host cell targets and pathways . Among the many microenvironmental factors that influence drug resistance in cancer cells, cell adhesion to the extracellular matrix (ECM) has been identified as a key determinant .
In order to better understand the relationships and interactions between these DEGs, we used Cytoscape software to visualize the PPI network of DEG-encoded proteins and highly screened out seven hub genes, namely, MCM4, CDC20, CDC25C, BIRC5, FOXM1, GTSE1, and RFC4. MCM4 mutations disrupt normal replication control and affect the mitotic phase, which in turn affects DNA replication and leads to DNA breakage. Genomic incompleteness greatly increases cancer susceptibility; for example, mice carrying the MCM4 Chaos3 subtype allele eventually develop breast cancer . BIRC5, also known as survivin, is a well-known cancer therapeutic target. By looking for survivin-partner protein interaction destruction inhibitors and survivin homodimer destruction inhibitors, it was revealed that some of the inhibitors can reduce survivin expression in cancer, including breast cancer, non-small-cell lung cancer, and pancreatic cancer .CDC20, the cell division cycle 20 homologue, is an important activator of anaphase cell division. Cdc20 works by activating the APC E3 ligase, which disrupts key cell cycle regulators during the mid- to late transition. Moreover, depletion of Cdc20 will also lead to mitosis stagnation, leading to cell death . CDC20 is upregulated not only in non-small-cell lung cancer but also in other malignancies and is also associated with patient prognosis . Studies have shown that MCM4, BIRC5, and CDC20 are potential lung cancer driver genes, and these genes are amplified in the genomes of patients with lung cancer, and when their number decreases, the number of cancer cells decreases as well . CDC25C is a typical bispecific phosphatase that regulates the dephosphorylation of CDK1 and the nuclear entry of the cyclin B1/CDK1 complex, which is a key mechanism regulating the initiation of cell division and plays an important role in the control of the cell cycle . The regulation of CDC25C in the cell cycle is affected by a variety of signaling pathways and is closely related to tumorigenesis and tumor development. FOXM1, an oncogenic transcription factor of the forkhead family, is expressed through tissue damage or oxidative stress and is induced by mitogenic stimulation in mature cells. Abnormal expression of FOXM1 has been shown to play a very important role in the progression of proliferation and cycle of various tumor cells . Studies have shown that FOXM1 exists in most human cancers by regulating the expression of target genes at the transcriptional level . GTSE1 is a cell cycle-related protein encoded by the human GTSE1 gene, which mainly regulates G1/S cell cycle transition. Previous studies have shown that high expression of GTSE1 is associated with resistance to a variety of cancers, and nonexpression of GTSE1 can significantly promote apoptosis of NSCLC cells after IR . RFC4 is one of the genes in the C family of replicators (RFC). Previous studies have shown that in a variety of malignant tumors, bioactive RFCs may play an important role in the proliferation, progression, invasion, and metastasis of cancer cells, confirming that RFCs can be used as effective markers for the diagnosis and prognosis prediction of cancer .
In this study, seven hub genes have been identified by bioinformatics. To verify whether these genes are related to LUAD, we used the GEPIA database and UALCAN database for validation and survival analysis, respectively. Results showed that all of these genes were upregulated compared with normal lung tissue and that MCM4, CDC20, CDC25C, and BIRC5 upregulated significantly reduced survival time in LUAD patients. Therefore, we believe that these hub genes play a key role in the occurrence and development of tumors, especially the four genes MCM4, CDC20, CDC25C, and BIRC5, which seem to be an important basis for future studies on the diagnosis and prognosis of LUAD.
Conflicts of Interest
The authors declare that there is no conflict of interest regarding the publication of this paper.
Shuaiqun Wang contributed to data analysis, methodology, and article writing. Xiaoling Xu contributed to investigation, figure construction, and validation who contributed equally with Shuaiqun Wang. Wei Kong contributed to the supervision and writing of the article. All authors contributed to the article and approved the submitted version.
This research was supported by the National Natural Science Foundation of China (No. 61803257) and Natural Science Foundation of Shanghai (No. 18ZR1417200).
T. K. P. Blee, N. K. Gray, and M. Brook, “Modulation of the cytoplasmic functions of mammalian post-transcriptional regulatory proteins by methylation and acetylation: a key layer of regulation waiting to be uncovered?” Biochemical Society Transactions, vol. 43, no. 6, pp. 1285–1295, 2015.View at: Publisher Site | Google Scholar
E. Gumuslu, N. Cine, M. E. Gökbayrak, O. Mutlu, I. K. Celikyurt, and G. Ulak, “Exenatide alters gene expression of neural cell adhesion molecule (NCAM), intercellular cell adhesion molecule (ICAM), and vascular cell adhesion molecule (VCAM) in the hippocampus of type 2 diabetic model mice,” Medical Science Monitor, vol. 22, pp. 2664–2669, 2016.View at: Publisher Site | Google Scholar
Y. Li, S. Gan, L. Ren et al., “Multifaceted regulation and functions of replication factor C family in human cancers,” American Journal of Cancer Research, vol. 8, no. 8, pp. 1343–1355, 2018.View at: Google Scholar