Abstract

In this paper, we propose a novel method, SeekFun, to predict protein function based on weighted mapping of domains and GO terms. Firstly, a weighted mapping of domains and GO terms is constructed according to GO annotations and domain composition of the proteins. The association strength between domain and GO term is weighted by symmetrical conditional probability. Secondly, the mapping is extended along the true paths of the terms based on GO hierarchy. Finally, the terms associated with resident domains are transferred to host protein and real annotations of the host protein are determined by association strengths. Our careful comparisons demonstrate that SeekFun outperforms the concerned methods on most occasions. SeekFun provides a flexible and effective way for protein function prediction. It benefits from the well-constructed mapping of domains and GO terms, as well as the reasonable strategy for inferring annotations of protein from those of its domains.

1. Introduction

More and more sequences of proteins are available due to the advanced sequencing technologies, but the biological roles and functions of the proteins are hardly known. As reported by [1], only less than one percent of proteins have been functionally characterized by experiments. In other words, protein sequencing is faster than annotating protein. To fill this gap, a large number of computational methods have been developed to predict protein functions. These methods exploit biological information including amino acid sequence [29], genomic context [1014], protein interaction networks [1517], protein structure [1823], microarray [24], and literate to predict protein functions [25, 26]. However, the newly sequenced proteins are often poor in other biological information except the amino acid sequences. Thus, the development of the sequence-based method is crucial and useful for directing further experimental work.

In the past few years, several sequence-based methods [29] have been proposed to infer protein functions. These methods annotated the protein with the representative annotations of its homologues. Intuitively, these methods are also called homology-based methods. Usually, the homology-based methods include two stages: searching homologues through BLAST or PSI-BLAST and selecting representative Gene Ontology (GO) terms from annotations of homologues of the unannotated protein. More specifically, Goblet [2] determined the homologues by a predefined threshold of BLAST e-value and annotated the unannotated protein with the GO terms of its homologues. GoFigure [3], OntoBlast [4], and Gotcha [5] weighted the GO terms by the BLAST e-values and chose GO terms by their weights. PFP [6, 7] made use of both strongly and weakly similar sequences of the query sequence to increase the coverage of functional annotation. ESG [8] exploited cascading homologues of the unannotated protein iteratively to improve the precision of prediction. ConFunc [9] split the homologues into subgroups according to their annotations and then inferred annotations of the unannotated protein from these subgroups. These methods have a positive impact on protein function prediction. However, the homology-based methods may not work when the unannotated protein has low sequence similarity to other annotated sequences or all of its homologues are not annotated. Furthermore, it is also reported that transferring annotations among homologues may easily produce erroneous results [27].

As is known, domain is the conserved sequence and structure in the evolution of proteins, which plays as the stable and independent functional block of proteins [28]. Besides the detailed sequence, domain also carries some important structural information, that is, active site, which is tightly relevant to biological function [21]. Thus, a domain may be a suitable clue to discover the function of proteins. Statistics on UniProt database (released in May, 2013) show that more than sixty percent of proteins have domains. Moreover, domain databases and tools for efficient domain recognition have been developed including Pfam [29], SCOP [30], RPS-BLAST [31], and HMMER [32]. These databases and tools accelerate the analysis of domains in protein. In general, it seemed that inferring functions from resident domains of the protein is feasible and reasonable.

So far, many efforts have been made for discovering functional signals carried by domains. Schug et al. [33] generated rules for function-domain associations based on the intersection of functions assigned to gene products which contain domains at varying levels of sequence similarity. Hayete and Bienkowska [34] designed an automated predictor based on decision tree to assign functions for domains. Mulder et al. [35] mapped GO terms to the domain if all proteins with the given domain do not exist in the set of proteins without the given GO term. Song et al. [36] transferred functions based on alignment of domain content. In analogy with [35], Forslund and Sonnhammer [37] assigned GO term to domain set if and only if all proteins containing the domain set also are annotated with the given GO term. Rentzsch and Orengo [38] transferred annotations in single profile-based sequence cluster. These methods are easily understood and realized, but they are readily misled into making an error-prone prediction by spurious and missing annotations of proteins. Even a single protein missing a valid GO term is enough to mislead the functional inferring about its domains.

In addition, Zhao et al. [39] utilized the protein-domain features, domain-domain interaction, and domain coexisting features to predict domain function. Their work extended the coverage of domain annotation effectively and provided solid foundation for predicting protein function. However, their work mainly paid attention to domain function rather than how the annotation of domain affects protein function. In our work, we focus on how to predict protein function based on domain annotation.

Recently, the probabilistic models have become increasingly popular for their remarkable performance on uncertainty inference. Forslund and Sonnhammer [37] utilized Naïve Bayesian (NB) model for assigning terms to domain set. Nevertheless the Naïve Bayesian model required that domain sets occurrence independently, which does not come with practice. Thus, Forslund et al. had attempted to reduce the dependencies between domain subsets using an averaged contribution from each domain subset. However, the conditional independence assumption may still not hold. Subsequently, Messih et al. [40] designed two models based on NB: one is DRDO that an averaged contribution from each subset which contains the sequential neighboring domains is used to solve the problem of dependency; the other is DRDO-NB which took recurrence and order of domains into consideration. Although computational complexity of DRDO is lower than that of NB, it may still not satisfy the conditional independence assumption. Moreover, all of these methods pruned GO terms of resident domains before they assigned GO terms to the host protein. Thus, some weak functional signals which may be amplified by dependencies between domains are likely to be neglected.

Fang and Gough [41] generalized a dcGO predictor for inferring GO terms associated with individual domains and supradomains based on protein-level GO annotation (GOA) and families of protein. dcGO exploited value to evaluate the association strength (mentioned as relevance in the following sections to simplify) between domain and GO term. Since value only represents the probability of error involved in null hypothesis, it may not be reasonable for estimating the relevance between domain and GO term by value. In other words, value can be used to determine which GO term is related to the given domain from statistical perspective but it is not enough to measure the degree of their relatedness. Thus, an appropriate metric is needed for weighting the relevance between GO term and domain objectively.

In this paper, we design a method to seek functions for proteins (SeekFun) effectively. Under this method, a mapping of GO terms and domains is constructed based on protein-level GOA and domain compositions of proteins. The relevance between domain and GO term is measured by symmetrical conditional probability. Based on the relevance of resident domains and terms, the relevance between host protein and GO terms is computed. Finally, the GO terms with relevance above a predefined threshold are used to annotate the host protein. The performance of SeekFun is validated by a series of experiments. The results suggest that our method is effective and reliable for protein function prediction.

3. Methods

3.1. Step 1: Construct and Weight Mapping of Domains and GO Terms

It is assumed that the resident domains may be associated with GO terms of the host protein. It is a rough assumption about the relationship between domain and GO term and may result in a large number of false associations. To differentiate the true associations from the false ones, the relevance between domain and GO term need be measured. Judged with this, the true associations will have higher relevance while the false ones will have lower relevance.

As mentioned earlier, value can be used to determine whether the domain is related to the GO term or not. When the value of domain and GO term is larger than the given significance threshold, it is considered that the domain can be annotated with the GO term, and vice versa. However, the larger value does not mean a more tight relationship between domain and GO term. In simple words, value may be not suitable for measuring relevance between domain and GO term. Suppose that represents that the protein containing domain and denotes that the protein plays the function described by GO term . The conditional probability means the probability of that the protein containing is annotated by . The can reflect the dependence of on the . Likewise, the represents the probability of that the protein annotated by containing the domain . The can reflect the dependence of on the . Thus, it can be inferred that simple conditional probability can reflect relevance between domain and GO term partly but not enough. As (1), symmetrical conditional probability may be appropriate to measure the relevance between GO term and domain , . Consider

Equation (1) means that the relevance between and is determined jointly by conditional probabilities between and . The bigger the probabilities are, the stronger the relevance between them is. Range of the relevance is from 0 to 1. The higher relevance means that the domain is more probably annotated with the term.

Supposed that #prot() is the number of proteins which are annotated with the , #prot() is the number of proteins which contain , and #prot is the number of proteins which have to do with both and . Accordingly, (1) can be transformed into (2). Consider

3.2. Step 2: Transfer GO Terms of Resident Domains to the Host Protein

As is known, GO terms are organized as a directed acyclic graph and may be related to each other. Thus, predicting functions of proteins should take the relationship between GO terms into consideration. GO has a rule called “true path rule”, which defines the terms along the pathway from a given term to the root term that must annotate the protein if the protein is annotated with the given term. And a path upward from the given term to the root term in GO hierarchy is regarded as a true path of the term. Considering the true path rule, the mapping of GO terms and domains is extended along true paths of the GO terms in our method. Traditionally, if a domain is associated with a GO term, it is also associated with all ancestral terms of the GO term with equal relevance. However, it is reported that the semantics of GO terms has differences even if they are parent-child relationship. Thus, the relevance between the domain and each ancestor of the GO term may be different and the semantic differences between GO terms should be considered.

In fact, the organization of GO terms can be regarded as a split-flow semantic system (SFSS). In SFSS, the root term is the source of semantics which can describe the general functions while others represent semantic branches of the root term and illustrate specific functions. So the terms along the true path of the given term have different capabilities to describe the functions. Generally, for a given function, the ancestral term is more likely to describe the given function than its descendants because the semantics of its ancestors is more general and has more power to describe function. It can be explained by semantic coverage of GO term, which can be roughly estimated by the number of its descendants [42].

Based on these analyses, we proposed a novel strategy, namely RSC, to measure the relevance between domain and ancestral term based on semantic coverage. That is, given a term which is related to the domain with relevance , the relevance between the domain and the ancestral term of term , can be calculated by (3). In (3), represent the descendant set of the given term and consists of the ancestors of the term . Naturally, along the true path, the term which is nearer to root has bigger relevance value with the given domain than others and it is more probably to annotate the host protein.

It is supposed that protein is associated with all GO terms which are related to the resident domains of the protein. The relevance between protein and GO term can be derived from the relevance of the term and resident domains of the protein. For example, if a protein contains a set of domain and denote the relevance between and , then the relevance between and , , can be computed by (4). Consider

After the extension, each protein is associated with a group of GO terms with strong or weak relevance. To facilitate comparison, the relevance of proteins and terms need be normalized. Each of GO categories should be analyzed, respectively, as they have different biological meanings. For each protein, the relevance between the protein and the root of subontology (GO: 00003674 for molecular function, GO: 00008150 for biological process, and GO: 00005575 for cellular component), , is used as baseline because the real annotations of proteins must be split from the root in the GO hierarchy. The normalized relevance of and , , can be measured by (5). The relevance has been standardized to scale from 0 to 1. The higher relevance means that the protein is more probably annotated with the term. Consider

Through the above steps, the relevance of proteins and GO terms has been measured already. To select real annotations from candidate annotations, a threshold of relevance need be defined. If the relevance between protein and term is above the predefined threshold t and the term is assigned to the protein, and vice versa. In our study, the threshold t is about 0.6~0.7 as the proposed model performs well on the given datasets.

4. Results and Discussion

4.1. Experimental Datasets

Three up-to-date protein subsets of UniProt, Uniref50, SwissProt, and TrEMBL, are selected to evaluate SeekFun. The proteins which are only annotated with GO term inferred from electronic annotations are excluded from the experimental datasets. The SwissPfam database is used to determine the detailed domain composition of proteins. All the datasets are downloaded on May 20, 2013. The details of the experimental datasets are listed in Table 1.

4.2. Evaluation Metrics

Consistent with Critical Assessment of Functional Annotations (CAFA) experiments [42], the precision, recall, and f-measure are utilized to judge the performance of methods in our experiments. Given a target protein and which is a set of known (true) annotations of , the precision of the method at threshold , , can be calculated as

In (6), is the set of predictive annotations whose relevance with is above t. S is the target set for testing. is the number of proteins which at least has one predictive GO term under given . Similarly, the recall of method at threshold , , can be computed by

The f-measure (the harmonic mean of precision and recall) gives an intuitive number for comparisons of the concerned methods. For each method, the maximal value of f-measure on the overall threshold of relevance, , is calculated as

Considering the relationships between GO terms, the comparisons are guided by the true path rule. That is, the and are extended by adding all ancestors of their members to them before comparing.

4.3. Comparisons of Relevance Computed by Different Strategies

To illustrate the rationality of weighting strategies, the relevance weighted by symmetrical conditional probability is compared with those measured by value and traditional conditional probability . In fact, it is hard to evaluate the relevance between domain and GO term for lacking of the gold standard. To determine appropriate strategies for weighting relevance, some properties of relevance are analysed. A little random noise may make a difference between observed and real datasets and the relevance should be robust on these similar datasets. To simulate similar datasets, a series of subsets of Uniref50, SwissProt, and TrEMBL is constructed by taking nine of their ten equal-size partitions randomly at a time. The calculations of relevance by different strategies are performed on these subdatasets. The varied distributions of relevance on the different datasets may be good evidence for which strategy is more proper for weighting relevance.

The distributions of relevance derived from different strategies are displayed in Figure 1. In order to facilitate comparison, without loss of meanings, the logarithmic transformation and Z-score transformation are performed on , which are represented by in Figure 1. Observed the figure, it can be found that is the most changeful while the distribution curves of both and have similar trends. All of those suggest that, as for robustness on tiny different datasets, the and are more proper than . What is more, the curves of and appear to have obvious monotonicity that is beneficial for assigning GO terms to the domain.

Meanwhile, the curves of are steeper than those of on each dataset, which imply that the resolution of is lower than . In this paper, the resolution describes how sensitive the relevance is to distinguish true positive association between domain and GO term from other negative ones. The resolution of relevance is inversely proportional to the average density of relevance in their range, which is just indicated by the steepness of the curves in the figures. In simple words, the larger the average density of relevance in their range is, the harder the true association between domain and GO term is determined.

On the other hand, the relevance derived from two significantly different datasets may vary more dramatically than those from the similar datasets. Statistically, the SwissProt and TrEMBL have no intersection while they have 5031 and 6929 common proteins with Uniref50, about up to their 30% and 36% separately. Consequently, the difference between the curves of relevance on SwissProt and TrEMBL should be larger than those of others. Observing the distributions of relevance on these datasets, as displayed by Figure 2, it can be found that the and vary as expected but the still suffers from low resolution. Generally speaking, it can be concluded that is a more suitable measure of relevance between domain and GO term.

4.4. The Impact of on Protein Function Prediction

For validating its impact on protein function prediction, is tested on experimental datasets: Uniref50, SwissProt, and TrEMBL, respectively. The comparison is performed on the three subontologies of GO: molecular function (MF), biological process (BP), and cellular component (CC) separately. The comparison includes two steps: constructing mapping of domains and GO terms and annotating proteins based on the mapping.

In our experiment, the mapping of Pfam domains and GO terms (pfam2go) is downloaded from the Gene Ontology website in May, 2013. Based on this reliable mapping, all annotations which are associated with the resident domains are assigned to the host protein. This method is named in this paper. Meanwhile, the mapping of Pfam domains and GO terms which is weighted by is also used for prediction, namely, . In the comparisons, and are validated by performing the same task in the same framework on the basis of different mappings of domains and GO terms. To avoid the influence of domain coverage, the weighted mapping with just includes the domains in pfam2go when it is applied. Here, to compare the influence of the strategy and RSC, the method which is the combination of them is also used to perform the same task and marked with . Their performances are illustrated in Table 2.

As displayed in Table 2, has higher recall than while the latter achieves better precision than the former. These results suggest that the could improve the specificity of annotations but it is at the cost of precision.

It also can be found from Table 2 that is superior to others in general. Compared to , outperforms on both precision and recall. In contrast to , significantly improved the precision while it does as well as on recall. Thus, it can be concluded that tend to select specific terms for the proteins and RSC balances this bias by propagating in the GO hierarchy. It may be the reason that shows higher performances.

4.5. The Impact of RSC on Protein Function Prediction

In order to validate the effectiveness of the RSC, it is compared with traditional strategy which set the relevance of domain and terms along a true path as equal (RPE). The two strategies are applied to predict protein functions based on the mapping of domains and GO terms weighted by . Their best performances are listed in Table 3.

As displayed, RPE gives a better recall while RSC has higher precision and . In general, RSC may be more beneficial to protein function prediction than RPE. It may be because the resolution of is effectively promoted by different relevance between protein and each term along a true path. On the contrary, RPE considered that protein has equal relatedness to every term along the true path, which makes it harder to determine the true positive associations between terms and the host protein. Even if the threshold of RPE is 1, its precision is still lower than the other one and recall goes down. It confirms that the differences of GO terms have significant influence on their relevance with protein.

4.6. Comparison of the Concerned Methods

To assess the efficiency of SeekFun, it is compared together with NB, DRDO, DRDO-NB, and dcGO on the three benchmark datasets. The performances of concerned methods on different dataset are shown in Table 4. To provide a simple number for comparison between methods, the averages of metrics on each dataset are also listed.

In terms of precision, SeekFun is superior to others while NB, DRDO, and DRDO-NB follow in turn. The dcGO is significantly lower than others. As aforementioned, dcGO measured relevance between domain and GO term by value while other methods calculated it based on conditional probability. These results may indicate again that the relevance estimated by value is not sensitive enough to determine the true positive associations between domain and GO term. In other words, has low resolution for distinguishing real annotations of protein. By contrast, the conditional probability is more suitable for estimating relevance.

As for the recall, SeekFun performs better than others while dcGO follows. It also can be found that the performances of NB, DRDO, and DRDO-NB are not as well as the other methods. Comparing the details of them, NB, DRDO, and DRDO-NB infer functions of protein from annotations of domain combinations, which enhance the precision of function prediction. However, in the process of discovering domain combinations, some slightly weak associations between domain and GO term may be neglected. The resident domains of the host protein may interplay as different combinations to perform different functions. Nevertheless, these methods judge domain combination if the members of the domain combination exist in the protein and the value of their combination is above predefined threshold. It may miss information covered in the potential domain combinations and domain themselves. We guess this may be the reason that these methods show lower recall of functions.

Overall, SeekFun has better performance than others. It can attribute to the weighted mapping of domains and GO terms and the strategy for transferring annotations of resident domains to the host proteins. The weighted mapping can reflect the relationship between domain and GO term properly. The transferring strategy takes both the differences and connections of terms into consideration, which greatly promote its capability of distinguishing real associations of domains and terms from the false ones.

5. Conclusions

In this paper, SeekFun is developed for protein function prediction. Instead of using amino acid sequence of protein directly, SeekFun takes the resident domains of proteins and protein-level GOA as clues to annotate proteins. We tested the overall performance of SeekFun and the results suggest that SeekFun is superior to the concerned methods: NB, DRDO, DRDO-NB, and dcGO on precision and recall generally.

Meanwhile the effects of relevance computed by symmetrical conditional probability, and the strategy for inferring annotations of protein from the annotations of its resident domains (RSC) are validated, respectively. The results of these experiments confirmed that both of them are effective and can promote the performance of protein function prediction. In the proposed method, tend to discover specific functions of protein but it cannot ensure the precision and RSC is used to compensate for the lack of . So the combination of them achieves high performances. The main idea of SeekFun could be used to acquire knowledge from other functional ontologies based on different domain resources easily. SeekFun will facilitate the discovery of protein functions and the insights into the biological roles of proteins.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Authors’ Contribution

The experiments are conceived and designed by Zhixia Teng and Maozu Guo. The experiments are performed by Zhixia Teng and Chunyu Wang. The data are analyzed by Zhixia Teng, Qiguo Dai, and Jin Li. The paper is prepared by Zhixia Teng, Maozu Guo, Qiguo Dai, and Xiaoyan Liu.

Acknowledgments

Maozu Guo is supported by Natural Science Foundation of China (61271346) and Specialized Research Fund for the Doctoral Program of Higher Education of China (20112302110040). Xiaoyan Liu is supported by Natural Science Foundation of China (61172098 and 91335112).