Integrated Analysis of DNA Methylation and mRNA Expression Profiles Data to Identify Key Genes in Lung Adenocarcinoma
Introduction. Lung adenocarcinoma (LAC) is the most frequent type of lung cancer and has a high metastatic rate at an early stage. This study is aimed at identifying LAC-associated genes. Materials and Methods. GSE62950 downloaded from Gene Expression Omnibus included a DNA methylation dataset and an mRNA expression profiles dataset, both of which included 28 LAC tissue samples and 28 adjacent normal tissue samples. The differentially expressed genes (DEGs) were screened by Limma package in R, and their functions were predicted by enrichment analysis using TargetMine online tool. Then, protein-protein interaction (PPI) network was constructed using STRING and Cytoscape. Finally, LAC-associated methylation sites were identified by CpGassoc package in R and mapped to the DEGs to obtain LAC-associated DEGs. Results. Total 913 DEGs were identified in LAC tissues. In the PPI networks, MAD2L1, AURKB, CCNB2, CDC20, and WNT3A had higher degrees, and the first four genes might be involved in LAC through interaction. Total 8856 LAC-associated methylation sites were identified and mapped to the DEGs. And there were 29 LAC-associated methylation sites located in 27 DEGs (e.g., SH3GL2, BAI3, CDH13, JAM2, MT1A, LHX6, and IGFBP3). Conclusions. These key genes might play a role in pathogenesis of LAC.
As the most common type of lung cancer and a non-small cell lung carcinoma (NSCLC) , lung adenocarcinoma (LAC) is characterized by gland or duct formation and massive mucus production . LAC generally originates in peripheral lung tissue, and this is in contrast with squamous cell lung cancer and small cell lung cancer (SCLC), which are both apt to be located more centrally in lungs [3, 4]. In US, LAC accounts for approximately 40% of lung cancers . Smoking is the main cause of LAC, and the disease has a high metastasis rate even at an early stage . Therefore, it is necessary to identify key genes in LAC and develop effective therapeutic schedule.
The WNT/TCF signaling can promote osseous and brain metastasis of LAC cells through targeting HOXB9 and LEF1 which mediates chemotactic invasion and colony outgrowth . Coexpression of Oct4 and Nanog, which are homeobox transcription factors important for self-renewal of stem cells, commands epithelial-mesenchymal transdifferentiation, mediates tumor-initiating ability, and contributes to metastasis of LAC . As a noncoding RNA, MALAT-1 enhances motility of LAC cells via regulating motility-associated gene expression in transcriptional and posttranscriptional levels [8, 9]. BRAF mutation is frequently detected in human LAC, indicating that BRAF may serve as a therapeutic target for a subset of patients with the disease [10, 11]. Overexpression of caveolin-1 is essential for mediating filopodia formation, which may promote invasion of LAC cells . In spite of the above researches, the mechanisms of LAC still remain unclear.
Recently, along with the development of chip technology, microarray data have been obtained and uploaded to Gene Expression Omnibus (GEO) for us to study . Using microarray data GSE62950, we screened the differentially expressed genes (DEGs) between LAC tissue samples and adjacent normal tissue samples. And their potential functions were predicted by Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), and Disease Ontology (DO) enrichment analysis. Then, the interrelationships between these DEGs were analyzed by protein-protein interaction (PPI) network and module analysis. At last, LAC-associated methylation sites were identified and mapped to the DEGs to obtain LAC-associated DEGs.
2. Materials and Methods
2.1. Microarray Data
Microarray data GSE62950 was downloaded from the database of GEO (http://www.ncbi.nlm.nih.gov/geo/), which included a DNA methylation dataset and an mRNA expression profiles dataset. The DNA methylation dataset and the mRNA expression profiles dataset separately were based on the platform of GPL8432 Illumina HumanRef-8 WG-DASL v3.0 and GPL8490 Illumina HumanMethylation27 BeadChip (HumanMethylation27_270596_v.1.2), and both of them included 28 LAC tissue samples and 28 adjacent normal tissue samples.
2.2. Data Preprocessing
Normalized series matrix file of mRNA expression profiles was downloaded directly. After beta matrix of DNA methylation data was downloaded, primary methylation signals were preprocessed by Methylation Module V1.9  in BeadStudio V126.96.36.199 to obtain normalized beta matrix. Those methylation sites which had no signal values in one or more samples were removed.
2.3. DEGs Screening
Using Limma package  in R, the DEGs between LAC tissue samples and adjacent normal tissue samples were identified. The values of DEGs were adjusted by Benjamini-Hochberg (BH) method . The adjusted value < 0.05 and were taken as the thresholds.
2.4. GO, KEGG, and DO Enrichment Analysis
GO is utilized for predicting potential functions of gene products in three categories (biological process, BP; cellular component, CC; and molecular function, MF) . The KEGG database can be used for systematic analysis of gene functions, which connects genomic information with corresponding functional information . The DO database contains a comprehensive knowledge of human diseases and is applied to annotate diseases . Using TargetMine online tool , GO, KEGG, and DO enrichment analyses were performed for the DEGs. The values of enriched terms were corrected by Holm-Bonferroni . The adjusted value < 0.05 was used as the cut-off criterion.
2.5. PPI Network and Module Construction
The STRING  database was utilized to search PPI pairs among the DEGs. And combined score > 0.4 was used as the cut-off criterion. Then, the PPI network of the DEGs was visualized by Cytoscape (http://www.cytoscape.org/) . Subsequently, igraph package  in R was used to calculate connectivity degrees of nodes (proteins) in the PPI network, and nodes with higher degrees were taken as hub nodes.
2.6. Screening of LAC-Associated Methylation Sites
Using genefilter package  in R, the methylation sites which had higher beta value variations within groups compared with variations among groups were deleted. And the value < 0.05 was used as the cut-off criterion. Then, the methylation sites located in sex chromosomes were removed. Finally, the associations between the screened methylation sites and LAC were analyzed by CpGassoc package  in R. The value < 0.05 was taken as the threshold.
2.7. Correlation Analysis of Methylation Sites and DEGs
According to the annotation information of the DNA methylation profiles, the nearest genes to the LAC-associated methylation sites were obtained and then mapped to the DEGs. At last, LAC-associated methylation sites of the DEGs were gained.
2.8. Validation of Methylation Sites and DEGs in LAC Tissue Samples
The RNASeqV2 data of LAC were downloaded from The Cancer Genome Atlas (TCGA, http://cancergenome.nih.gov/) database, which included 515 LAC tissue samples and 59 adjacent normal tissue samples. Using Limma package  in R, the DEGs with the adjusted value < 0.05 and ≥ 1 were also screened. Meanwhile, the methylation data of LAC were also downloaded from TCGA database, which included 459 LAC tissue samples and 32 adjacent normal tissue samples. Moreover, LAC-associated methylation sites were also identified and mapped to the DEGs using the same methods as the above.
3.1. Data Preprocessing and DEGs Screening
After preprocessing, total 18626 beta values of DNA methylation data and 18626 gene expression values of mRNA expression profiles separately were obtained. Compared with adjacent normal tissue samples, there were a total of 913 DEGs (including 409 upregulated and 504 downregulated genes) in LAC tissue samples. In the heat map of the DEGs, LAC tissues could be definitely separated from adjacent normal tissues by the DEGs (Figure 1).
3.2. GO, KEGG, and DO Enrichment Analysis
Using TargetMine online tool, functions of the DEGs were predicted by GO, KEGG, and DO enrichment analyses. For the upregulated genes, the enriched GO functions included multicellular organismal catabolic process ( value = ) in BP category, as well as extracellular region ( value = ) and extracellular space ( value = ) in CC category (Table 1(a)). The enriched KEGG pathways for upregulated genes included protein digestion and absorption ( value = 0.001244) and cell cycle ( value = 0.026338, which involved cell division cycle 20, CDC20; cyclin B2, CCNB2; and mitotic arrest deficient 2-like 1, MAD2L1) (Table 1(b)). The enriched DO terms for upregulated genes included cell type cancer (adjust. value = ), germ cell cancer (adjust. value = 0.001326), embryoma (adjust. value = 0.008727), and embryonal cancer (adjust. value = 0.009481) (Table 1(c)). And all of the DO terms involved MAD2L1 and aurora B kinase (AURKB). The enriched GO functions for downregulated genes were listed in Table 1(d), including single-organism process ( value = ) and single-multicellular organism process ( value = , which involved wingless-related MMTV integration site 3A, WNT3A) in BP category, as well as cell periphery in CC category ( value = , which involved WNT3A). In addition, there were nonsignificant KEGG pathways and DO terms enriched for the downregulated genes.
3.3. PPI Network and Module Analysis
PPI networks for upregulated and downregulated genes were constructed, respectively. The PPI network for upregulated genes had 229 nodes and 725 interactions (Figure 2(a)). Particularly, MAD2L1 (degree = 36), AURKB (degree = 38), CCNB2 (degree = 40), and CDC20 (degree = 42) had higher degrees, and they can interact with each other in the PPI network. Several modules were screened from the PPI network for upregulated genes, and the largest module (module 1) is also showed in Figure 2(b). GO functional enrichment analysis showed that the genes in module 1 were involved in mitosis-associated terms.
The PPI network for downregulated genes had 233 nodes and 368 interactions (Figure 3). Connectivity degrees of the nodes in the PPI network were calculated, and cGMP-dependent protein kinase II (PRKG2, degree = 11), VE-cadherin (CDH5, degree = 12), WNT3A (degree = 15), adenylyl cyclase 8 (ADCY8, degree = 16), and adenylyl cyclase 4 (ADCY4, degree = 17) were the top 5 nodes which had higher degrees. What is more, nonsignificant modules were screened from the PPI network for downregulated genes.
3.4. Screening of LAC-Associated Methylation Sites
After screening, total 8856 methylation sites were obtained. Then, the associations between the screened methylation sites and LAC were analyzed under the threshold of value < 0.05. As a result, 230 LAC-associated methylation sites were identified.
3.5. Correlation Analysis of Methylation Sites and DEGs
There were 29 LAC-associated methylation sites located in 27 DEGs (e.g., Src homology 3 domain GRB2-like 2, SH3GL2; brain-specific angiogenesis inhibitor 3, BAI3; H-cadherin, CDH13; junctional adhesion molecule 2, JAM2; metallothionein 1A, MT1A; LIM-homeobox containing 6, LHX6; and insulin-like growth factor binding protein-3, IGFBP3) (Table 2). And all of the methylation sites were within 2 kb from transcriptional start sites (TSSs) of the DEGs. For the 29 methylation sites, their methylation indexes in LAC tissue samples were compared with that in adjacent normal tissue samples. The methylation indexes of 19 methylation sites were significantly increased, and the downstream genes mediated by those 19 methylation sites were downregulated. Nevertheless, 1 methylation site had significantly decreased methylation index, and its downstream genes were upregulated.
3.6. Validation of Methylation Sites and DEGs in LAC Tissue Samples
Total 4893 DEGs (including 2191 upregulated genes and 2702 downregulated genes) were identified in the LAC sample data downloaded from TCGA database. There were 691 common DEGs (including 310 upregulated genes and 381 downregulated genes) between the 4893 DEGs identified in the LAC sample data downloaded from TCGA database and the 913 DEGs identified in the microarray data of GSE62950. The common DEGs included genes such as WNT3A, MAD2L1, AURKB, CCNB2, CDC20, SH3GL2, BAI3, CDH13, JAM2, MT1A, and IGFBP3. In addition, the 29 LAC-associated methylation sites identified in the microarray data of GSE62950 were also detected in the methylation data downloaded from TCGA database.
In this study, 913 DEGs including 409 upregulated and 504 downregulated genes were identified in LAC tissue samples compared with adjacent normal tissue samples. Total 230 LAC-associated methylation sites were identified, among which 29 LAC-associated methylation sites were located in 27 DEGs. Afterwards, the RNASeqV2 data and methylation data of LAC were downloaded from TCGA database to validate the obtained methylation sites and DEGs. There were 691 common DEGs (such as WNT3A, MAD2L1, AURKB, CCNB2, CDC20, SH3GL2, BAI3, CDH13, JAM2, MT1A, and IGFBP3) between the 913 DEGs identified in the microarray data of GSE62950 and the 4893 DEGs identified in the LAC sample data downloaded from TCGA database. In addition, the 29 LAC-associated methylation sites identified in the microarray data of GSE62950 were also detected in the methylation data downloaded from TCGA database. Functional enrichment indicated that WNT3A was involved in single-multicellular organism process and cell periphery. Overexpression of Wnt5a, which belongs to Wnt family that encodes signaling glycoproteins, promotes invasion of NSCLC during tumor progression [29, 30]. Via activating JNK pathway, Wnt-7a and Fzd-9 signaling plays role in inducing the receptor tyrosine kinase inhibitor Sprouty-4 and cadherin proteins and is essential for maintaining epithelial differentiation and inhibiting transformed cell growth in some NSCLC patients . In the PPI network for downregulated genes, WNT3A had higher degrees. These suggested that WNT3A might play a role in LAC.
Besides, CDC20, CCNB2, and MAD2L1 were enriched in the pathway of cell cycle. Meanwhile, MAD2L1 and AURKB were involved in DO terms of type cancer, germ cell cancer, embryoma, and embryonal cancer. Results of immunohistochemistry suggest that CDC20 can be a negative marker in prognosis of patients with resected NSCLC, especially adenocarcinoma . Overexpressed CDK5RAP3 and CCNB2, as well as suppressed RAGE, may be promising biomarkers in lung adenocarcinoma . The 5-year overall survival rates of LAC patients with low CCNB2 mRNA levels were significantly higher than that with high levels, and overexpressed CCNB2 mRNA can independently predict a poor prognosis in patients with LAC [34, 35]. Immunohistochemical analysis indicates AURKB, which mediates chromosome segregation during mitosis, is frequently overexpressed in primary lung carcinomas [36, 37]. Immunohistochemistry shows that overexpression of cell division cycle associated 8 (CDCA8) and AURKB can result in bad outcome of lung cancer patients; thus, suppression of the CDCA8-AURKB pathway is a potential therapeutic strategy for lung cancer . Semiquantitative RT-PCR shows that mitotic arrest defective protein 2 (MAD2) is overexpressed in a high percentage of lung cancers, and multivariate analysis suggests that high-level MAD2 can be a prognostic marker independently . In the PPI network for upregulated genes, MAD2L1, AURKB, CCNB2, and CDC20 had higher degrees, and they can interact with each other. Therefore, MAD2L1, AURKB, CCNB2, and CDC20 might be implicated in LAC by interacting with each other.
Additionally, LAC-associated methylation sites were identified and mapped to the DEGs. And there were 29 LAC-associated methylation sites located in 27 DEGs (e.g., SH3GL2, BAI3, CDH13, JAM2, MT1A, LHX6, and IGFBP3). Loss of SH3GL2 is frequently detected in NSCLC and SH3GL2 can mediate cellular growth and invasion through interacting with EGFR . CDX2, VIL1, and BAI3 levels have significant differences in SCLC and large-cell neuroendocrine lung carcinoma (LCNEC); therefore, they can be diagnostic markers of these tumor types . Tumor suppressor gene CDH13, located on chromosome 16q24.2–3, is downregulated in lung cancer and its aberrant methylation may be a potential marker for cancer detection [42–44]. Via mediating β1 integrin subunit and ERK activation in human dermal lymphatic endothelial cells (HDLEC), junctional adhesion molecule-C (JAM-C) contributes to lymphangiogenesis and nodal metastasis, suggesting that JAM-C may be a target for treating lymphatic metastases in NSCLC . Overexpression of metallothionein (MT) can be used as an independent predictor of short-term survival in SCLC patients enduring chemotherapy [46, 47]. Previous study indicates that LHX6 is a candidate tumor suppressor gene that has epigenetic silencing in patients with lung cancer . In NSCLC, methylation status of IGFBP-3 before cisplatin therapy seems to be a biomarker of prognosis, helping to select appropriate therapeutic method for patients [49, 50]. These declared that SH3GL2, BAI3, CDH13, JAM2, MT1A, LHX6, and IGFBP3 might relate to LAC.
In conclusion, we carried out a comprehensive bioinformatics analysis to screen LAC-associated genes. We identified 913 DEGs and 8856 methylation sites in LAC tissue samples. Besides, LAC might correlate with several key genes (e.g., WNT3A, MAD2L1, AURKB, CCNB2, CDC20, SH3GL2, BAI3, CDH13, JAM2, MT1A, LHX6, and IGFBP3). However, these bioinformatic findings need to be validated by further researches.
Highlights. (1) We screened 913 DEGs and 8856 methylation sites in LAC tissue samples. (2) In the PPI networks, MAD2L1, AURKB, CCNB2, CDC20, and WNT3A had higher degrees. (3) There were 29 LAC-associated methylation sites located in 27 DEGs.
All authors declare that they have no conflict of interests to state.
W. D. Travis, E. Brambilla, H. K. Muller-Hermelink, and C. C. Harris, Pathology and Genetics of Tumours of the Lung, Pleura, Thymus and Heart, World Health Organization, 2004.
R. S. Mitchell, V. Kumar, A. K. Abbas, and N. Fausto, “Box on morphology of adenocarcinoma,” in Robbins Basic Pathology, chapter 13, Elsevier, 2006.View at: Google Scholar
International Agency for Research on Cancer, World Cancer Report 2014, WHO, Geneva, Switzerland, 2014.
S.-H. Chiou, M.-L. Wang, Y.-T. Chou et al., “Coexpression of Oct4 and Nanog enhances malignancy in lung adenocarcinoma by inducing cancer stem cell–like properties and epithelial–mesenchymal transdifferentiation,” Cancer Research, vol. 70, no. 24, pp. 10433–10444, 2010.View at: Publisher Site | Google Scholar
S. Ren, F. Wang, J. Shen et al., “Long non-coding RNA metastasis associated in lung adenocarcinoma transcript 1 derived miniRNA as a novel plasma-based biomarker for diagnosing prostate cancer,” European Journal of Cancer, vol. 49, no. 13, pp. 2949–2959, 2013.View at: Publisher Site | Google Scholar
K. Naoki, T.-H. Chen, W. G. Richards, D. J. Sugarbaker, and M. Meyerson, “Missense mutations of the BRAF gene in human lung adenocarcinoma,” Cancer Research, vol. 62, no. 23, pp. 7001–7003, 2002.View at: Google Scholar
C.-C. Ho, P.-H. Huang, H.-Y. Huang, Y.-H. Chen, P.-C. Yang, and S.-M. Hsu, “Up-regulated caveolin-1 accentuates the metastasis capability of lung adenocarcinoma by inducing filopodia formation,” The American Journal of Pathology, vol. 161, no. 5, pp. 1647–1656, 2002.View at: Publisher Site | Google Scholar
G. Smyth, “Limma: linear models for microarray data,” in Bioinformatics and Computational Biology Solutions using R and Bioconductor, R. Gentleman, V. Carey, S. Dudoit, R. Irizarry, and W. Huber, Eds., pp. 397–420, Springer, New York, NY, USA, 2005.View at: Google Scholar
G. Van Belle, L. D. Fisher, P. J. Heagerty, and T. Lumley, Biostatistics: A Methodology for the Health Sciences, John Wiley & Sons, New York, NY, USA, 2004.
G. Csardi and T. Nepusz, “The igraph software package for complex network research,” International Journal of Complex Systems, vol. 1695, no. 5, pp. 1–9, 2006.View at: Google Scholar
R. Gentleman, V. Carey, W. Huber, and F. Hahne, “Genefilter: methods for filtering genes from microarray experiments,” R Package Version, vol. 1, 2011.View at: Google Scholar
C.-L. Huang, D. Liu, J. Nakano et al., “Wnt5a expression is associated with the tumor proliferation and the stromal vascular endothelial growth factor—an expression in non-small-cell lung cancer,” Journal of Clinical Oncology, vol. 23, no. 34, pp. 8765–8773, 2005.View at: Publisher Site | Google Scholar
R. A. Winn, L. Marek, S.-Y. Han et al., “Restoration of Wnt-7a expression reverses non-small cell lung cancer cellular transformation through Frizzled-9-mediated growth inhibition and promotion of cell differentiation,” The Journal of Biological Chemistry, vol. 280, no. 20, pp. 19625–19634, 2005.View at: Publisher Site | Google Scholar
D. Stav, I. Bar, and J. Sandbank, “Usefulness of CDK5RAP3, CCNB2, and RAGE genes for the diagnosis of lung adenocarcinoma,” The International Journal of Biological Markers, vol. 22, no. 2, pp. 108–113, 2007.View at: Google Scholar
S. L. Smith, N. L. Bowers, D. C. Betticher et al., “Overexpression of aurora B kinase (AURKB) in primary non-small cell lung carcinoma is frequent, generally driven from one allele, and correlates with the level of genetic instability,” British Journal of Cancer, vol. 93, no. 6, pp. 719–729, 2005.View at: Publisher Site | Google Scholar
K. O. Toyooka, S. Toyooka, A. K. Virmani et al., “Loss of expression and aberrant methylation of the CDH13 (H-cadherin) gene in breast and lung carcinomas,” Cancer Research, vol. 61, no. 11, pp. 4556–4560, 2001.View at: Google Scholar
M. G. Joseph, D. Banerjee, W. Kocha, R. Feld, L. W. Stitt, and M. G. Cherian, “Metallothionein expression in patients with small cell carcinoma of the lung: correlation with other molecular markers and clinical outcome,” Cancer, vol. 92, no. 4, pp. 836–842, 2001.View at: Google Scholar