Abstract

The aberrations of a gene can influence it and the functions of its neighbour genes in gene interaction network, leading to the development of carcinogenesis of normal cells. In consideration of gene interaction network as a complex network, previous studies have made efforts on the driver attribute filling of genes via network properties of nodes and network propagation of mutations. However, there are still obstacles from problems of small size of cancer samples and the existence of drivers without property of network neighbours, limiting the discovery of cancer driver genes. To address these obstacles, we propose an efficient modularity subspace based concept learning model. Our model can overcome the curse of dimensionality due to small samples via dimension reduction in the task of attribute concept learning and explore the features of genes through modularity subspace beyond the network neighbours. The evaluation analysis also demonstrates the superiority of our model in the task of driver attribute filling on two gene interaction networks. Generally, our model shows a promising prospect in the application of interaction network analysis of tumorigenesis.

1. Introduction

Gene performs a function via the synthesis of its product protein encoded by the gene in human cells, and the interactions between proteins lead to the functional cooperation between different genes [1]. Subsequently, the aberrations of gene can not only alter the function of the gene itself but also influence the functions of other genes that interact with the aberrated gene, and both ways can lead to carcinogenesis of normal cells [2, 3]. Unlike the straightforward function abnormality of a gene caused by the aberration itself, the functional abnormality of genes caused by interaction among more than one gene is rather implicit to be understood [2]. To investigate the roles of genes among their interactions, the protein-protein interaction network has been established to systematically describe all the interactions between each of the gene pairs that have been collected so far [1]. When the network topology is investigated, it is observed that gene interaction network supports the property of being scale-free, which fairly satisfies the definition of complex network [4]. Consequently, through the network properties such as centrality and betweenness involved in complex network analytics [5, 6], it is an unprecedent opportunity to attributes of genes by investigating their roles in interaction network.

To unveil whether a certain gene has attributes of cancer drivers, that is, the capability of subverting a normal cell into malignant cancer cell [3], a direct way is to fill the attributes of nodes through their network properties. For a certain complex network, a number of numerical measurements have been established for describing the property of nodes in network, such as degree centrality, betweenness centrality, closeness centrality, eigenvector centrality, and Bonacich centrality [7]. With the advancing of complex network based analytical methods [8, 9], the cancer driver attributes of genes can be discovered through the properties of their related nodes in the network [10, 11]. Nevertheless, recent studies have demonstrated that there are also a noticeable number of cancer driver genes that do not show explicit high centrality in gene interaction network [12, 13]. Therefore, despite the successful achievement obtained by the previously published complex network based methods, by directly using the previous network property based studies, these types of cancer drivers are very likely to be missed in the task of node attribute filling of genes in interaction network.

To identify cancer drivers without high centrality, an intuitive idea is to compensate the information beyond the gene interaction itself [2]. Fortunately, the existence of high-throughput sequencing data of cancer samples has provided another important source for the discovery of cancer driver attributes of genes [14]. Sequencing data can provide the mutations of each gene in the tested cancer samples [15, 16], containing the information of mutation profiles among cancer samples which is far beyond the network itself [3, 17]. Afterward, there have been a bunch of researches that incorporate information from both interaction network and mutation data via network propagation of mutation frequencies [2, 12, 13]. Notwithstanding, recent studies have shown that there exist a number of cancer driver genes that are neither frequently mutated nor close to other highly mutated genes in interaction network [18]. Furthermore, when the number of cancer samples is small, the mutation frequencies of the mutation data may not be representative for the identification of highly mutated cancer driver genes [19]. Since both the scale of gene interaction network and the dimension of mutation data are of the order of ten thousand, the small sample number that is usually less than one hundred due to the practical reason is also likely to result in the curse of dimensionality [18]. Generally, there is still a lack of a driver attribute filling method for genes in interaction network with capability of mutation compensation for small samples.

To discover cancer driver attributes of gene in interaction network with small samples of mutation data, we proposed a novel modularity subspace based concept learning method, which can efficiently identify the driver genes in interaction network. To circumvent the curse of dimensionality resulting from small size of samples, we introduce the dimension reduction paradigm to relieve the shortage of sample size. For detecting driver genes that are neither frequently mutated nor connect to other highly mutated genes, we introduce the network modularity of the genes as the subspace features to achieve dimension reduction of the gene interaction network, where the network modularity represents the membership of genes to different network modules [20, 21], independent of mutation frequencies and direct connections of genes. After obtaining the feature subspace learnt from the compensation of interaction network and mutation data, we further utilize supervised concept learning technique to learn the rules in modularity feature space which associate the concept of cancer drivers [22, 23]. A systematic assessment illustrates the superiority of our proposed model over the existing network based methods for driver attribute filling of gene in two interaction networks from two distinct sources. In summary, our proposed modularity subspace based concept learning model is efficient for driver attribute filling of genes in interaction network with small samples of mutation data.

2. Materials and Methods

2.1. Modularity of Interaction Network

Gene interaction network contains modules, defined as groups of genes as nodes, and the gene interactions are relatively denser within the same module in comparison to the interactions between different modules (the connections of nodes between modules are sparser relatively) [24, 25]. Considering the independence between the membership of genes to network modules and mutation frequencies or direct connections of genes, we utilize the network modules as features for discriminating cancer driver genes in our study. Furthermore, in most situations, the dimension of features defined by network modules is far smaller than that defined by network nodes or neighbours [24, 26], benefiting the alleviation of curse of dimensionality due to small sample problem. Technically, when we use the adjacency matrix to describe the gene interaction network, a module can be defined as a partition of the matrix, where the summation of elements for the submatrix of the selected nodes is expected to be large [20]. Here the partition of a module can be depicted through a vector whose dimension is equal to the number of gene nodes (suppose the total number of genes is p), and the coefficients larger than zero represent that the corresponding genes belong to the module.

When the total number of network modules is a predefined number k, we can use k independent vectors to delineate the partitions of the p genes to the k modules, denoted as matrix formation . According to the derivation by Newman [20], the density of gene nodes within the k modules can be approximately calculated through the modularity score, which is defined via a quadratic form of vectors .where matrix is the normalized adjacency matrix of gene interaction network and vector is the degree vector of the network, whose i-th coefficient is the summation of the i-th row of matrix (or i-th column summation, since matrix is symmetric). For brevity, the modularity matrix is used to represent the result of matrix subtraction .

For matrix , its j-th column represents the partition of gene nodes within or without the j-th network module, where the relative value of coefficients is able to represent the inclusion and exclusion of its corresponding genes. Since the trace of matrix is equivalent to the summation of the quadratic forms for the k partitions of modules, by finding the partitions that can maximize the modularity score, we can obtain the feature vectors that can best reflect the memberships of genes to different modules [21]. Meanwhile, the i-th row of matrix can be regarded as the feature vector of the i-th gene, denoted as . For the i-th gene, when we regard vector as the feature vector of the gene instead of its neighbourhood relations, the feature dimension k is largely reduced in comparison to the primary dimension p. Therefore, we can denote the feature space of as the modularity subspace for the i-th gene.

2.2. Dimension Reduction for Small Samples

In consideration of the information beyond gene interaction network, the mutation data collected from sequencing of cancer samples is also another crucial source for cancer driver attribute filling. Restricted to the costs of sequencing, the amount of sequenced cancer samples is still limited in a relatively small number [27]. Considering that the number of genes is over twenty thousand, the number of cancer samples is usually fifth to one hundred [27]. Subsequently, a direct effect resulting from the small sample problem is the curse of dimensionality during the task of driver attribute discovery. Hence, like the dimension reduction of interaction network via modularity subspace, we also introduce the matrix decomposition architecture to learn the low-dimension representation of the p genes:where matrix is the n by p matrix of mutation data (n is the number of samples, n<<p). Matrix is a p by k matrix whose rows are the representation vector of genes, and matrix is a nonnegative n by k matrix.

A typical strategy to learn the low dimension representation from the data matrix is to apply the optimization procedures of alternatively updating rules in nonnegative matrix factorization (NMF) [28]. When the iteration reaches convergence, the multiplication of the two learnt matrices and can effectively approximate the original data matrix . Here matrix is the coefficient matrix of the linear transformation from low-dimension representation to the original data . The rows of matrix are the low-dimension representations learnt from mutation data, preserving the cancer mutation information beyond interaction network. Based on the matrix decomposition based architecture, we can also achieve the dimension reduction of the mutation data from small samples.

2.3. Modularity Subspace-Based Dimension Reduction

To compensate the information from the interaction network and the mutation data of cancer samples, we further proposed a joint dimension reduction framework that can efficiently fuse the feature vectors of both the modularity subspace and the low-dimension feature of mutation data. When there are p genes in the interaction network and n samples of mutation data, the objective function of the joint dimension reduction framework iswhere k is the predefined number of low dimensions, which is set to 64 empirically. Here the dimensions of matrices and are p by p and n by p, respectively. The p by k matrix represents the modularity features of genes, and the p by k matrix represents the low-dimension features from mutation data. To joint the two types of features, we also introduce the residual matrix , which is defined by their difference . When the elements of matrix are close to zero, matrix is then approximately equal to . The parameters , , and are tuning parameters to control the weights of different terms in objective function.

In detail, the joint dimension reduction framework is composed of form terms, denoted as modularity term, sample data term, feature regularization term, and approximate residual term, respectively. First, the modularity term is introduced to reduce the dimension of the gene interaction network via the modularity subspace feature learning. Second, the sample data term with tuning parameter is incorporated to compensate the information of mutations from cancer samples, which can learn the low-dimension features of each investigated gene . Third, the feature regularization term with tuning parameter is used to avoid extreme values of the learnt feature vector of genes during the joint dimension reduction procedure. Fourth, the approximate residual term with tuning parameter is aimed to bridge the modularity subspace features from network and the low-dimension features from mutation data , via a small value residual matrix . Here we set and to 1.0 and 100 empirically (see details in Supplementary Materials). Specifically, since matrix is served as residual in the fusion of the two types of features and , we therefore set the parameter being 100 times of to ensure the dominant of and strong penalty of at scale.

Note that there are two inequality constraints and ; we further introduce two Lagrange multipliers, matrix and matrix , respectively. Subsequently, we can derive the Lagrange function for the optimization of the joint dimension reduction framework:

For the three variables , , and , we can obtain their partial derivatives, respectively:

We further employ the Karush-Kuhn-Tucker (KKT) conditions [28], where the three partial derivatives are all equal to zero: , , and ; and the complementary slackness conditions are also equal to zero: and . Through these conditions, we can derive the three following equations:

Through the above equations, we can reach the updating rules of the three variables:

To learn the three variables, we alternatively apply the three updating rules until the iteration reaches its convergence. Finally, we can obtain the final results of the three matrices , , and after convergence. For the i-th gene, the row vector in matrix is the feature vector in modularity subspace compensated with mutation samples.

2.4. Modularity Subspace Concept Learning

To inference the cancer driver attributes of genes through the feature vectors in modularity subspace compensated with mutation samples, we further incorporate the idea of concept learning instead of the network property based strategy. Through the learnt rules, concept learning can efficiently identify the associated attributes of nodes if its features matching the rules [22, 23], which demonstrates more advantages in recognition task compared to the network property based strategy. Through the feature vectors with information from both modularity subspace and mutation samples and the assumption of independence of the k module dimension, we can adopt Bayesian based concept learning [23, 29] to establish the probabilistic rules that are associated with cancer driver attributes of genes:where the attribute index is denoted as c, whose value equaling 1 represents driver attribute and the value of 0 denotes nondriver attribute. The probability is the prior probability of the attributes estimated from the distribution of the known driver attributes among genes. At the same time, the conditional probability is estimated from the distribution between feature matrix and the known driver attributes among genes. Consequently, after the estimation of these aforementioned probabilities, for the i-th tested gene, we can adopt its feature vector into the probabilistic rules to infer the attribute of the gene. The overall pipeline of the proposed modularity subspace based concept learning model is also drawn as schematic plot (see Figure 1).

3. Results

To evaluate the performance of our proposed driver attribute filling model, we apply our model on two gene interaction networks from independent sources. The first network is STRING network [30], which provides an integration of critical assessed interactions including both direct physical interactions and indirect functional associations between the proteins of their related genes. The second network is iRefIndex network [31], which curates protein interaction data of their related genes from a variety of sources and carefully consolidates the redundancy of interaction. In addition to the network data, we also incorporate the mutation data of sequencing samples from two distinct types of cancer, prostate cancer [32] and thyroid cancer [33]. The mutation data of both types of cancers are accessed from the cBioPortal database [34], which offers a web resource for cancer genomics data. The cancer driver attribute annotations of genes are collected from the COSMIC Cancer Gene Census database [35], which provides well-curated cancer driver genes that have been widely acceptable in tumorigenesis field.

3.1. Result Evaluation for STRING Network

When we apply our model on the 12233 genes in the STRING network, our model firstly yields the joint dimension features of modularity subspace and reduced dimensions of data from the combination of network data and cancer mutation data. The learnt joint dimension features are then used in the training of the probabilistic concept learning for driver attributes. Here, we utilize tenfold cross validation that approximately evenly splits the driver annotation of genes into ten groups without any overlap [36]. Lastly, the average performance of driver attribute filling results in different folds that are used in the evaluation study. We also apply two sorts of previously published methods ReMIC [12] and MUFFINN [13] as compared methods in the performance evaluation, both of which are based on network propagation of mutation frequencies, and their input data are also interaction network and cancer mutation data. Detailly, we compare our model with the three versions of ReMIC, where the diffusion parameters are 0.01, 0.02, and 0.03 as suggested in their applications [12]. Meanwhile, the two versions of MUFFINN are also recommended as DNmax and DNsum [13], which calculate the effect from gene’s direct neighbours by maximum and summation, respectively. Finally, we compare the results of our model and the other two methods for performance evaluation.

Here we use the Receiver Operating Characteristic (ROC) curve [37] to evaluate the performance achieved by our model and other compared methods. In the evaluation of ROC curves, the y-axis represents the sensitivities under different threshold, where sensitivity is the fraction of the identified drivers in all known drivers, while the x-axis denotes the 1 minus specificity under different threshold, where specificity is the fraction of the identified nondrivers in all known nondrivers. When we draw the ROC curves of the three methods in the application of STRING network and prostate cancer data, we can get the phenomenon that the curve of our model is located at the top left of those of the other methods (see Figure 2(a)), which indicates that our model achieves the best performance in the driver attribute filling task. For example, when we examine that the sensitivities of the compared methods under the condition of their specificities are fixed to 0.10, the sensitivities of MUFFINN-DNmax and MUFFINN-DNsum are 21.67% and 23.33%, respectively. Also, the values of sensitivities of the three versions of ReMIC range from 31.67% to 36.67%. In contrast, the sensitivity yielded by our proposed model is 76.67%, higher than those of the other competing methods. Generally, through the assessment of ROC curves, we can observe a distinct advantage of our methods over the others in driver attribute filling of genes of prostate cancer.

For the results of STRING network for thyroid cancer, we can also observe similar results that our proposed model yields better results compared to the other compared methods, shown in the assessment of ROC curves (see Figure 2(b)). For example, when we investigate the sensitivities of the compared methods with their specificities fixed to 0.10, the sensitivities of MUFFINN-DNmax and MUFFINN-DNsum are 26.67% and 28.33%, respectively, and those of the three versions of ReMIC range from 46.67% to 55.00%. In comparison, the sensitivity achieved by our proposed model is 56.67%, demonstrating that our model also outperforms the others for data of thyroid cancer samples. Moreover, we also examine the specificities of these methods when their sensitivities are fixed to 0.10. In this case, the specificities yielded by ReMIC (Beta = 0.01), ReMIC (Beta = 0.02), and ReMIC (Beta = 0.03) are 53.12%, 50.75%, and 50.09%, respectively. MUFFINN-DNmax achieves a specificity of 48.57%, and MUFFINN-DNsum obtains a value of 29.57%. Here, our proposed modularity subspace based concept learning model accesses a specificity of 70.38%, which is also greater than those of the other compared methods. Thus, we can see a distinct advantage of our model in driver attribute filling when compared with the existing methods.

Since the Area Under the Curve (AUC) of ROC curve is a comprehensive performance measurement, we also compare the values of AUCs of our proposed modularity subspace based concept learning model and the other methods, for both prostate cancer (see Figure 2(c)) and thyroid cancer (see Figure 2(d)). Specifically, in the evaluation of attribute filling for prostate cancer, the AUC of MUFFINN-DNmax is 71.07% and that of MUFFINN-DNsum is 61.96%, and the AUCs of ReMIC with Beta = 0.01, 0.02, and 0.03 are 69.30%, 71.42%, and 72.31%, respectively. As for our proposed model, the value of AUC is 87.43% and is at least 20.9% higher than those of the other methods. Meanwhile, when we investigate the AUCs of attribute filling results for thyroid cancer, we can find that the AUCs of MUFFINN-DNmax, MUFFINN-DNsum, ReMIC (Beta = 0.01), ReMIC (Beta = 0.02), and ReMIC (Beta = 0.03) are 75.03%, 64.84%, 80.10%, 80.39%, and 81.63%, respectively. In comparison, the AUC of our proposed model reaches 86.10%, which is higher than AUCs of all the other methods. In addition, we also conduct ablation study on different modules in our model for the two cancer types (see Figure S1 in the Supplementary Materials). Consequently, through the application on STRING network with mutation data of two distinct types of cancers, we can observe the superiority of our model compared to the existing network based methods.

3.2. Result Evaluation for iRefIndex Network

At the same time, we also adopt the application of our model on the 12129 genes in the iRefIndex network. For mutation data of prostate cancer, when we assess the attribute filling results of our model and the other methods, the curve of our model is also closer to the northwest corner of the figures (see Figure 3(a)), showing a preferable performance over the other compared methods. In detail, when the specificities are fixed to 0.10, the corresponding sensitivities of different versions of ReMIC are rounded at 34.17%, and those of MUFFINN-DNmax and MUFFINN-DNsum are rounded at 22.50%. In contrast, the sensitivity acquired by our proposed model is 76.67%, which is distinctly larger than those of the other compared methods. When the sensitivities are locked at 0.10, compared to the fact that the specificities achieved by the other methods range from 30.49% to 45.78%, our model also outperforms the other compared methods, accessing the highest specificity of 55.85%. Generally, our model outperforms the other existing methods in the application of iRefIndex network with mutation data of prostate cancer.

For mutation data of thyroid cancer, when we evaluate the performance of the competing methods via ROC curve, we can acquire a similar phenomenon that our model exhibits a clear advantage over the other compared methods (see Figure 3(b)). Similar to the assessment above, when we fix the value of specificities to 0.10 and compare the sensitivities of these methods, our proposed method also obtains the largest sensitivity among those yielded by the others. Specifically, the sensitivity of our method is 56.67%, compared to 26.67% of MUFFINN-DNmax, 28.33% of MUFFINN-DNsum, 46.67% of ReMIC (Beta = 0.01), ReMIC (Beta = 0.02), and 55.00% of ReMIC (Beta = 0.03). Likewise, when the sensitivities of these competing methods are 0.10, the specificity of our proposed modularity subspace based concept learning model is 70.38%, which is also the highest among those of the compared methods. Considering that the values of the other compared method range from 29.57% to 53.12%, the value achieved by our model is at least 32.4% higher than those of the others. Consequently, when we use the mutation data from thyroid cancer samples, there is also a distinct advantage of our method with input of iRefIndex network.

As for the comprehensive measurement AUC in the evaluation study, the assessment results for both prostate cancer (see Figure 3(c)) and thyroid cancer (see Figure 3(d)) also demonstrate that our proposed modularity subspace based concept learning model is superior to the other compared methods. For the results of prostate cancer, the AUCs obtained by MUFFINN-DNmax and MUFFINN-DNsum are 63.78% and 52.08%, respectively, and the AUCs yielded by different versions of ReMIC range in the interval from 65.79% to 67.57%. In contrast, our model achieves an AUC of 84.72%, and the value is the largest among those of the compared methods. Meanwhile, for the results of thyroid cancer, our model also surpasses the other competing methods with an AUC of 85.04%, where the AUCs of MUFFINN-DNmax, MUFFINN-DNsum, ReMIC (Beta = 0.01), ReMIC (Beta = 0.02), and ReMIC (Beta = 0.03) are 70.71%, 55.81%, 73.34%, 74.41%, and 75.15%, respectively. The ablation study for the two cancer types also shows the roles of different modules in our model (see Figure S2 in the Supplementary Materials). Therefore, we can conclude that, for iRefIndex network, our modularity subspace based concept learning model exhibits a superior performance compared to the other existing network based methods.

3.3. Functional Enrichment Analysis

To exploit the capability of cancer driver filling of our proposed modularity subspace based concept learning model, we further examine the attribute filling results of our model on the data of the two types of cancers. When we apply the trained model on known drivers and mutation data of cancer samples, we can achieve a list of genes with the probabilities of driver attributes. Since the top ranked genes draw more attentions of researchers of tumorigenesis than the other genes, we further investigated the top twenty genes yielded by our model (see Table S1 in the Supplementary Materials for the predicted gene list). For the gene list of prostate cancer, we employ the functional enrichment analysis on STRING network to find the related specific biological functions [38]. Through the results of analysis, we can find that the set of genes obtained by our model are significantly enriched for a bunch of cancer related functions (see Table 1)). Detailly, the genes in the results for STRING network and prostate cancer data are highly involved in prostate cancer related functions, such as transcriptional misregulation in cancer, enzyme binding, sequence-specific DNA binding, and prostate cancer. Therefore, the gene list yielded by our model demonstrates significantly associations with known functions of prostate cancer.

At the same time, for the results of our model on STRING network with mutation data of thyroid cancer, we also adopt the functional enrichment analysis [38] on the predicted gene list (see Table S1 in the Supplementary Materials for the gene list) and exploit the highly related function terms (see Table 1). The results of enrichment analysis show that the top ranked function term is the pathway of thyroid cancer, indicating the affinity of the predicted genes to the corresponding cancer. The predicted genes are also significantly enriched for cancer related functions such as MAPK cascade, pathways in cancer, and central carbon metabolism in cancer, illustrating the strong relations between the gene list and cancer progresses. Meanwhile, when we employ functional enrichment analysis on the result of iRefIndex network, we can also obtain similar phenomena for samples of both prostate cancer and thyroid cancer. The details of the gene lists (Table S1 in Supplementary Materials) and the enriched function terms (Table S2 in Supplementary Materials) on iRefIndex network also show high association with functions of the two types of cancers.

4. Discussion and Conclusions

In gene interaction network, the aberrations of a gene can influence the functions of it and its interacting genes, both of which contribute to the development of cancer. Although there are many successes achieved by previous methods based on network properties of gene nodes and network propagation of mutation frequencies, there are still obstacles in the task of cancer driver attribute filling of genes. Specifically, the number of mutation samples of cancer patients is rather small when compared with the large scale of network, and this phenomenon causes the issue of curse of dimensionality. The existence of driver genes without distinct network property and high propagation influence of neighbours also leads to the missing attributes of driver genes in results of previous methods. To tackle these obstacles, we propose a novel modularity subspace based concept learning model, which can learn the modularity features of gene nodes beyond the network neighbours and reduce the feature dimensions to circumvent the curse of dimensionality. When we evaluate the performance of our model and those of the other compared methods on two gene interaction networks from independent sources, we can observe a distinct advantage of our proposed model in the driver attribute filling task. The enrichment analysis also shows the high correlation between the results of our model and the cancer related functions.

To seek the potential explanations of the improvement achieved by our proposed modularity subspace based concept learning, we can mainly conclude the following: information compensation, modularity feature, and dimension reduction. The first explanation is that we compensate the information from both gene interaction network and mutation data of cancer samples, by which we can incorporate the advantage of fusing the information from distinct independent sources. The second explanation is that we explore the features of genes from network neighbours and mutation frequencies to the modularity memberships of genes, by which we can break the limitation of the existing features of gene node in interaction network. The third explanation is that we employ the idea of dimension reduction on both the gene interaction network data and the mutation data of cancer samples, by which we can circumvent the negative effect of curse of dimensionality caused by the small sample problem. Based on the three concerns in this study, the results yielded by our proposed model demonstrate a superior performance compared to the existing compared methods, indicating its effectiveness in the task of driver attribute filling of genes.

In future work, there is still some room for improvement of our proposed model. One promising improvement is compensating more information from the data beyond interaction network and genomic mutation data, that is, integrated information from multiomics data such as transcriptome [39, 40], epigenome [41], and proteome [42]. Another point is the consideration of effect from deleterious synonymous variants into the framework of our model, that is, regarding the mutation types with more precise resolution [43, 44]. In addition to genes in coding regions, our model is also potentially applicable for the analysis of attribute filling of noncoding regions in bioinformatics [45]. Furthermore, incorporating the technique of recent advanced artificial intelligence is also a promising direction for improvement [46, 47]. The framework of our model also illustrates the potential in application in various fields beyond gene interaction network analysis, such as hydrological models [48, 49], catalytic activity models [50, 51], and spectroscopy analysis [52, 53]. Last but not least, incorporating cancer samples with larger size into the analysis of gene’s roles in interaction network is also a promising orientation for the task of driver attribute filling of gene nodes [54].

In conclusion, our proposed modularity subspace based concept learning model is capable of effectively compensating the information of gene interaction network and cancer mutation data and reducing the feature dimensions to circumvent the curse of dimensionality resulting from small sample problem in the attribute concept learning of cancer driver gene. The effectiveness of attribute filling of gene nodes of our model has been systematically evaluated through the application on two interaction networks. Considering the distinct performances, our model shows a promising potential in the analysis of cancer driver genes from interaction network, facilitating the comprehensive understanding of tumorigenesis.

Data Availability

The data repositories and deposition codes are freely available at https://github.com/JianingXi/ModularitySubspaceConceptLearning.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors would like to thank Dr. Zhenhua Yu for his helpful suggestions. This work was supported in part by the National Natural Science Foundation of China under grant nos. 61876145, 61973249, 61973250, 62003279, and 61901322 and in part by Education Department of Shaanxi Provincial Government (project nos. 19JC041 and 19JC038).

Supplementary Materials

Table S1: predicted gene lists of our proposed model on STRING network and iRefIndex network for data of prostate cancer samples and thyroid cancer samples. Table S2: functional enrichment analysis results of our proposed model on iRefIndex network. (a) Enrichment results of prostate cancer. (b) Enrichment results of thyroid cancer. Figure S1: ablation study on modules in our model for STRING network. The red-dashed lines represent the AUCs for cases of all modules, and the four bars denote the AUCs for cases of random ablation of three-fourths of modules. (a) Prostate cancer samples. (b) Thyroid cancer samples. Figure S2: ablation study on modules in our model for iRefIndex network. The red-dashed lines represent the AUCs for cases of all modules, and the four bars denote the AUCs for cases of random ablation of three-fourths of modules. (a) Prostate cancer samples. (b) Thyroid cancer samples. Supplementary text: details of hyperparameter settings of our proposed model. (Supplementary Materials)