Abstract

Integration of multi-omics data of cancer can help people to explore cancers comprehensively. However, with a large volume of different omics and functional data being generated, there is a major challenge to distinguish functional driver genes from a sea of inconsequential passenger genes that accrue stochastically but do not contribute to cancer development. In this paper, we present a gene length-based network method, named DriverFinder, to identify driver genes by integrating somatic mutations, copy number variations, gene-gene interaction network, tumor expression, and normal expression data. To illustrate the performance of DriverFinder, it is applied to four cancer types from The Cancer Genome Atlas including breast cancer, head and neck squamous cell carcinoma, thyroid carcinoma, and kidney renal clear cell carcinoma. Compared with some conventional methods, the results demonstrate that the proposed method is effective. Moreover, it can decrease the influence of gene length in identifying driver genes and identify some rare mutated driver genes.

1. Background

At present, understanding the mechanisms of cancer development and uncovering actionable target genes for cancer treatment are still difficult challenges. With rapid advances in high-throughput sequencing technologies, some large-scale cancer genomics projects, such as The Cancer Genome Atlas (TCGA) [1] and International Cancer Genome Consortium (ICGC) [2], have produced different omics data including a rich dataset of whole-exome and RNA sequence data [3, 4], which provides chances to allow us to accurately infer tumor-specific alterations [5] and help in precision medicine in cancers treatment [6, 7]. However, many of genetic changes represent neutral variations that do not contribute to cancer development which are called passenger mutations [6, 8]. Only a few alterations are causally implicated in the process of oncogenesis and provide a selection growth advantage which are referred to as driver mutations [8, 9]. Hence, it is a major challenge to distinguish pathogenic driver mutations from the so-called random mutated passenger mutations [10].

Previously, there were multiple computational methods to identify driver genes based on gene mutational frequency (termed as frequency-based method) in a large cohort of cancer patients [1113]. However, the infrequently mutated drivers are inclined to be ignored by frequency-based methods. Also mutational heterogeneity in cancer genomes is an important factor affecting the performance of frequency-based methods [14]. In addition, further studies have realized that driver mutations or genes disrupt some cellular signaling or regulatory pathways which promote the progression of cancer [15, 16]. In fact, genes affect various biological processes by related complex networks instead of acting in isolation in cancer [17]. In addition, the cancer is a result of interplay of various types of genetic changes which form complex and dynamic networks [18]. Thus, many network-based and pathway-based approaches have been proposed to prioritize driver mutations and genes. For instance, Dendrix was a pathway-based algorithm for discovery of mutated driver pathways in cancer using somatic mutation data [19]. After that, Multi-Dendrix algorithm was proposed to extend Dendrix method in order to guarantee yielding the optimal set of pathways [20]. MDPFinder was also a pathway-based method to solve the so-called maximum weight submatrix problem proposed in Dendrix method [19] which was aimed at identifying mutated driver pathways from mutation data in cancer [21]. And Zhang et al. proposed CoMDP method which focused on cooccurring driver pathways rather than single pathway [22]. In addition, iMCMC was a network-based method by integrating somatic mutation, CNVs, and gene expressions without any prior information [6]. Another method, DawnRank, was also a network-based algorithm to discover personalized causal driver mutations by ranking mutated genes according to their potential to be drivers based on PageRank algorithm [23]. Bashashati et al. developed a method called DriverNet which comprehensively analyzed genomes and transcriptomes datasets to identify likely driver genes in population-level by virtue of their effect on mRNA expression networks and also reveal the infrequent but important genes and patterns of pathway [10]. VarWalker was a personalized network-assisted approach to prioritize well-known, infrequently mutated genes and interpret mutation data in NGS studies [24].

Although some proposed methods can determine potential drivers, most of them do not consider the influence of gene length to the results; in other words, they may identify some likely false positive driver genes according to known driver genes datasets. And it has been indicated that driver genes are related to not only mutation frequency, but also mutation context or gene length [25] and variants tend to arise more frequently in long genes [26]. For example, TTN, the longest gene in human genome, accumulates many variants just due to its length [24, 26]. TTN may be selected in many computational methods; however, it usually serves as passenger gene [27]. This phenomenon indicates that many current methods have a strong preference towards identifying long genes [24]. So it is essential to filter those frequently mutated genes due to long length. VarWalker takes into account the gene length; however, it does not consider the influence of mutation to expression. In addition, some genomic variations in a gene may lead to extreme changes in some outlying genes expression level which are associated with the mutated gene through gene-gene interaction network or pathways and these outlying genes are often called outliers [10]. And, it has been proved that cancer-associated genes are more effectively detected by interindividual variation analysis rather than only calculating differences in the mean expression across different samples [28]. That is, the outliers are related to not only tumor expression distribution but also the corresponding normal expression distribution. Moreover, various cellular processes are often affected by genes in complex networks rather than genes acting in isolation [17] and cancer is also related to a set of genes interacting together in a molecular network [29]. So networks usually provide a convenient way to explore the context within which single gene operates [30]. It should be noted that prior knowledge such as protein-protein interaction (PPI) network can provide some useful information; however, prior knowledge is limited and may lead to discarding some important information in some instances [22]. In our previous work [31], we only consider the prior information of gene-gene interaction network. So it is essential to extend gene-gene interaction network.

In this study, we proposed an integrated framework named DriverFinder to identify driver genes by integrating somatic mutation data, copy number variations (CNVs), tumor and normal expression data, and gene-gene interaction network. Firstly, the gene length is taken into consideration to filter some frequently genes because of long length. Moreover, in this method, we integrated tumor expression and normal expression to construct outlier matrix rather than only using tumor expression. Furthermore, to increase accuracy of identifying drivers, we calculated Pearson correlation coefficients (PCC) of genes and combined them with PPI network to construct a new dynamic interaction network for each cancer type. In order to estimate the performance of DriverFinder method, we applied it to four different large-scale TCGA datasets, including breast cancer (BRCA), head and neck squamous cell carcinoma (HNSC), thyroid carcinoma (THCA), and kidney renal clear cell carcinoma (KIRC), and compared it with MUFFINN [32], DriverNet [10], and frequency-based method. The results demonstrated that DriverFinder can identify drivers effectively and decrease false positive, that is, filtering some long and frequently mutated but functionally neutral genes.

2. Materials and Methods

We proposed DriverFinder to identify cancer driver genes by integrating multiomics data (Figure 1). The detailed description of it is shown in Figure 1.

As is shown in Figure 1, the first step is to estimate the occurrence of mutation events in the genome by fitting them into a generalized additive model [24]. Then a weighted resample-based test is used to filter long passenger genes according to approximated probabilities based on benchmark of coding gene length. Secondly, for gene expression, we compare expression of tumors with normal samples to determine the outlying genes. Then a gene-gene interaction network combined with prior knowledge and PCC among expression data is used to relate mutations to their consequent effect on gene expression. The associations between mutated and outlying genes are formulated using a bipartite graph where the left nodes indicate the genes mutation status and the right nodes indicate the outlying expression status in each patient. For each patient, there is an edge between and if the left partition gene is mutated and the right gene is an outlier in some samples and they also have high correlations in gene-gene interaction network. Secondly, greedy algorithm is used to prioritize mutated genes based on the coverage. In each iteration of the greedy algorithm, the mutated gene on the left partition of the bipartite graph which relates to the most outlying expression genes is chosen. Until all the outlying expression genes are covered by the least mutated genes on the left, iteration is stopped. Then the mutated genes are ranked based on their coverage. So genes with the most outlying expression are appointed as candidate driver genes. Finally, the statistical significance test based on null distribution is applied to these putative driver genes.

2.1. Construction of Mutation Matrix

In terms of somatic mutation, we downloaded it from TCGA data portal (https://cancergenome.nih.gov/) and only considered the data of level 2. As a matter of fact, Jia et al. explored long genes in two datasets and examined gene length effects by plotting the proportion of mutated genes versus their complementary DNA (cDNA) length [24]. They discovered that two sets of mutated genes were positively correlated with cDNA length, and longer genes were more likely to be mutated genes [24]. Hence, frequency-based methods may be inclined to select long genes as drivers. So, it is necessary to perform gene length-based filtration. In this study, in order to accurately estimate the mutation rate for each gene, generalized additive model was adopted to compute a probability weight vector (PWV) for the mutation genes of each sample [24].

Here, only somatic mutated genes mapped onto the benchmark of consensus coding sequences (CCDS) dataset [33] which contains a core set of consistently annotated and high quality human and mouse protein coding regions are reserved. And those mapped genes in this study have been allocated cDNA length based on their coding sequences [24]. Assuming the vector as the cDNA gene length and the following model is used to assess the probability of a gene to be mutated, where represents an unspecified smooth function and represents the proportion of mutant genes in the specific samples [24]. Each gene then would be assigned a PWV value. Afterwards, a resampling test based on the probability of each gene is performed 1000 times in each sample and the null distribution is that in genes mutations occur at random. Then we define mutation frequency aswhere represents mutation frequency. Next we filter out genes with frequencies ≥ 5% in random datasets unless they are Cancer Gene Census (CGC) genes. Then a list of significant mutant genes is obtained.

As for CNVs, which have been processed by GISTIC 2.0, they are acquired from http://gdac.broadinstitute.org/runs/ (v2014_10_17). There are five types of copy number including amplification, gain, diploid, heterozygous deletion, and homozygous deletion in this dataset. Here, we only screen out amplifications and homozygous deletions to construct CNV matrix . Finally, the significant mutated gene list and CNV matrix are combined to generate the patient-mutation binary matrix , in which indicates there is genetic alteration, that is, mutation, amplification, or homozygous deletion, in the jth gene of the th sample. Otherwise, .

2.2. Construction of Expression Outlier Matrix

Gene expression data (level 3) including tumor and normal expression data is also downloaded from TCGA data portal. Moreover, some studies have shown that assessment of interindividual variation of gene expression performs well in predicting cancer-associated genes [28]. So in this study, the outlying matrix is determined based on analysis of interindividual variation in tumor and normal expression rather than differences in mean expression levels or only tumor expression distribution [28]. For each type of cancers, there are two expression datasets and which indicate the real-valued expression measure of gene in sample of tumor and normal datasets, respectively. For each gene, the outliers in this study are defined as tumors whose expression levels are outside the four-standard deviation range of the expression values of the gene across all the normal samples [28]. It is formularized asin which is the mean expression and indicates the standard deviation of gene expression in normal samples. Then the binary patient-outlier matrix is constructed and the value of indicated whether gene in patient is an outlier among the population-level distribution for that gene. If the expression of gene is an outlier in patient , ; otherwise, .

2.3. Gene-Gene Interaction Network

It is noteworthy that most prior knowledge such as PPI network or pathways is incomplete and a great deal of knowledge about biological pathways remains unclear [22]. In our previous study [31], we relied on prior knowledge about gene influence graph integrated from known gene networks [10] which often leaved out some likely important nodes. So in this study, we constructed a new dynamic gene-gene interaction network by incorporating gene-gene correlation coefficients with prior knowledge. That is, firstly, PCCs between pairwise genes are obtained by normalized tumor expression. Acceptable correlations with PCC > 0.75 are often considered high correlated and selected [34]. Here, we choose 0.8 as threshold to ensure selected pairwise genes being higher correlated and increase reliability. The edges with PCC > 0.8 are selected and set to 1, otherwise 0. In order to retrieve some important prior knowledge simultaneously, the known gene network (termed the influence graph in DriverNet [10]) is mapped onto the binary matrix obtained from correlation coefficients matrix. So a new and dynamic gene-gene interaction network (termed as after) included prior knowledge and deduced knowledge is established. When there is a correlation, that is, PCC > 0.8 or 1 in influence graph between gene and gene , ; otherwise, .

2.4. Significance Estimation

With the aim of testing the statistical significance of the driver candidates, we apply a randomization framework. The algorithm is run on the random permuted original datasets (mutation data and outlier data). Then we assess the significance by seeing if the results on real data are significantly different from the results on random datasets and obtain the value of each candidate drivers. The statistical significance of is defined as follows [10]:where is permutation times and is the number of candidate drivers in the th run of the approach. is the coverage of calculated from our method. Here we choose . The statistical significance of means that the times of the observed driver genes with coverage are more than . Finally, genes with value less than 0.05 are nominated as candidate drivers.

3. Results

3.1. Datasets

In this work, four TCGA datasets, BRCA, HNSC, KIRC, and THCA, were applied to our method. For each cancer type, four different omics data consisting of somatic mutation, tumor expression, normal expression, and CNV were used. The BRCA dataset includes copy number, 531 tumor samples’ and 62 normal samples’ expression data, and 962 samples’ somatic mutation data. KIRC dataset contains copy number variations, 417 samples’ somatic mutations data, and accompanying 534 tumor and 72 normal samples’ expression data. The HNSC dataset contains 509 patients’ somatic mutation data, 522 tumor and 44 normal samples’ expression data, and copy number data. For THCA, it includes copy number variations, 435 patients’ somatic mutation, and 513 tumor patients’ and 59 normal samples’ expression data. For each type of cancer, we only consider the samples which are common in tumor expression dataset and somatic mutation dataset.

3.2. Performance Evaluation

To evaluate the performance of our method on identifying known driver genes, we used annotated cancer-related genes datasets CGC database (15/7/2015) [35] and 20/20 rule [25] as approximate benchmarks. CGC is a database which catalogues 571 genes whose mutations have been causally involved in cancer [35]. 20/20 rule contains 138 driver genes in which 125 genes are affected by subtle mutations and 13 are affected by amplification or homozygous deletion [25]. We compared our method with frequency-based method, DriverNet [10], and MUFFINN [32] based on these two benchmarks.

In this work, we first counted the number of known drivers according to CGC genes of these four methods. For comparison, three measures based on top genes including precision, recall, and 1 score are used which are defined as follows:where TP indicates the number of overlapping genes found in our method and annotated genes associated with cancers in CGC. FP means the number of genes identified in our method, however not cataloged by CGC. FN is the number of genes in CGC but not contained in our method.

In general, DriverFinder almost outperforms other three methods in the top ranking genes of all the four cancer datasets (Figure 2; results of DriverFinder are shown in Supplementary File 1, in Supplementary Material available online at https://doi.org/10.1155/2017/4826206). Although, after approximately ranking top 30 genes in KIRC, MUFFINN performs comparably with DriverFinder, it has poorer performance in the top 30 genes. And the same phenomenon arises after top 60 genes in THCA. Analogous to it, the cumulative number of retrieved cancer genes annotated by 20/20 rule in KIRC by MUFFINN (Figure 3(c)) is also more than DriverFinder. A potential explanation of them may be that the total numbers of mutations in KIRC (10359 genes with 21089 mutations) and THCA (8899 genes with 16497 mutations) are remarkably less than in BRCA (16717 genes with 118098 mutations) and HNSC (14830 genes with 57164 mutations). So the gene-gene interaction network in KIRC and THCA may be simpler than in BRCA and HNSC; that is, genes may easily correlate directly with each other. And MUFFINN consider mutations only in direct neighbors [32]. On the one hand, this difference may help MUFFINN retrieve more genes; on the other hand, the number of mutations indicates that there may be more passengers (i.e., noise) in BRCA and HNSC and DriverFinder is more stable with noises.

Moreover, DriverFinder outperforms other three methods in BRCA, HNSC, and THCA by the cumulative numbers with 20/20 rule (Figure 3). In KIRC, it also has a better performance than DriverNet and frequency-based method within the top 100 genes.

3.3. DriverFinder Decreases the Effect of Gene Length

It is worth noting that, from the results of the DriverFinder, we can find that it can decrease the false positive because it has the good performance on filtering randomly mutated genes due to long length. The longest gene across human genome is TTN and it has been proven that higher mutation rate in it is likely to be artifacts [23, 36]. For example, in BRCA, TTN ranked 4 and 6 in frequency-based method and in MUFFINN, respectively, due to high mutation rate. Also it ranked 51 ( value = 0.031) as a candidate driver in DriverNet algorithm; however, it was filtered out with DriverFinder. In KIRC and HNSC, it ranked 4 and 3, respectively, based on mutation frequency and 22 and 18 in DriverNet, respectively. In addition, it is also at top 4 and 3 with MUFFINN in KIRC and HNSC, respectively, but it does not rank among the top 160 or 790 separately according to DriverFinder. Furthermore, for THCA, frequency-based method and MUFFINN ranked TTN as top 4 and 5, respectively. However, it is not identified by DriverFinder. These results proved the better performance of DriverFinder on filtering randomly mutated genes with long length than DriverNet, MUFFINN, and frequency-based method.

3.4. Pathway Enrichment Analysis

In order to investigate cancer-related pathways among the significant candidate drivers, Kyoto Encyclopedia of Genes and Genomes (KEGG) [37] pathway enrichment analysis was performed by statistics significantly candidate driver genes with values less than 0.05 (see Supplementary File 2). The top 20 significant pathways are shown in Figure 4. We observed that the most enriched terms are cancer-related pathways in four cancer datasets. Moreover, ErbB signaling pathway, which is significantly enriched in BRCA ( value = 4.79E − 07) and KIRC ( value = 6.23E − 07), has been reported to play important roles in many tumors and ErbB2/ErbB3 heterodimer functioned as an oncogenic unit in breast cancer [38]. Also, the significantly enriched VEGF signaling pathway (4.94E − 05) in HNSC plays a pivotal role in tumor angiogenesis [39].

3.5. Discovering Rare Driver Genes

In this subsection, we exhibited that DriverFinder can identify rarely mutated but significant candidate driver genes which are defined as genes whose mutation frequency < 2% across all patients cohort. Here, we only selected highly ranked (top 30) rare genes for further analysis.

In BRCA, 3 rare genes (PIK3R1, CREBBP, and PRKACB) are ranked in top 30. In them, for PIK3R1 (ranked 23) which is not ranked top 30 in the other three methods, underexpression might possibly lead to PI3K pathway activation and confer tumor development and progression in humans and it is a clinically useful independent prognostic marker in breast cancer [40]. Due to its low frequency mutations, any further statistical analyses concerning a possible association between PIK3R1 mutations and clinical parameters are not allowed [40] and it is easily ignored by frequency-based methods. Moreover, for CREBBP (ranked 25) which also was not ranked top 30 in the other three methods, it has been occasionally reported in breast cancer [41]. PRKACB downregulates in non-small cell lung cancer and the effect of its upregulation on cell proliferation, apoptosis, and invasion also has been investigated [42].

In HNSC, also 3 rare genes are selected and one of them (UGT2B4, ranked 30) has shown potential to be a novel driver. UGT2B4 genotypes associated with decreased enzyme activities are found to increase the risk of esophageal squamous cell carcinoma [43].

For KIRC and THCA, there are some important rare genes not identified by other methods. For example, CLTC in KIRC, which is contained in CGC and ranked 11 with low mutated frequency 1.68% in DriverFinder, encodes a major subunit of clathrin and is a fusion partner of TFE3. And CLTC-TFE3 is the fifth gene fusion involving TFE3 in pediatric renal cell carcinomas [44]. Another example is AKT1 (0.69% of cases) in THCA, which is identified by DriverFinder (ranked 30) and contained in CGC; it is a serine/threonine protein kinase and its downstream proteins have been reported to be frequently activated in human cancers [45].

4. Discussion

Cancer is a complex disease and difficult to treat, and distinguishing driver genes from a mass of neutral passenger genes is extremely important to understand the mechanism of the cancer and design targeted treatments. In this study, we introduced a comprehensive framework DriverFinder to identify driver genes by incorporating genomes, transcriptomes, and gene-gene interaction information. We implemented gene-based filtering model to exclude genes that were mutated largely due to random events. The method was applied to four independent cancer datasets from TCGA and the results demonstrated that the power of it across multiple tumor types was mainly better than DriverNet, MUFFINN, and frequency-based methods. In summary, this method has advantages in both filtering random mutated genes and identifying driver genes regardless of their mutation frequencies. We expect that it can also be applied to other complex cancer types.

However, in this work, we only explained the changes of the expression by somatic mutations, although other molecular or genetic changes such as transcription factors, methylations, and microRNAs also affect expression of other genes and play important roles in the development of cancer [46]. Therefore, it is necessary to extend the method so that driver genes can be determined by not only somatic alterations but also other different types of molecular changes. Also, we can extend our aim of identifying drivers by some methods such as machine learning methods [4751].

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This study was supported by the National Natural Science Foundation of China (no. 61672037), the Key Project of Anhui Provincial Education Department (no. KJ2017ZD01), and the Anhui Provincial Natural Science Foundation (nos. 1508085QF135 and 1608085MF136).

Supplementary Materials

Supplementary File 1: Results of four datasets BRCA, HNSC, KIRC, and THCA in DriverFinder.

Supplementary File 2: The results of KEGG pathway enrichment of four datasets BRCA, HNSC, KIRC, and THCA.

  1. Supplementary File 1
  2. Supplementary File 2