Abstract

Nature often brings several domains together to form multidomain and multifunctional proteins with a vast number of possibilities. In our previous study, we disclosed that the protein function prediction problem is naturally and inherently Multi-Instance Multilabel (MIML) learning tasks. Automated protein function prediction is typically implemented under the assumption that the functions of labeled proteins are complete; that is, there are no missing labels. In contrast, in practice just a subset of the functions of a protein are known, and whether this protein has other functions is unknown. It is evident that protein function prediction tasks suffer from weak-label problem; thus protein function prediction with incomplete annotation matches well with the MIML with weak-label learning framework. In this paper, we have applied the state-of-the-art MIML with weak-label learning algorithm MIMLwel for predicting protein functions in two typical real-world electricigens organisms which have been widely used in microbial fuel cells (MFCs) researches. Our experimental results validate the effectiveness of MIMLwel algorithm in predicting protein functions with incomplete annotation.

1. Introduction

Automated annotation of protein functions is challenging in the postgenomic era. With the rapid growth of the number of sequenced genomes, the overwhelming majority of protein products can only be annotated by computational approaches [1]. Nature usually brings multiple domains together to construct multidomain and multifunctional proteins with a vast number of possibilities [2]. The large part of genomic proteins, two-thirds in unicellular organisms and more than 80% in Metazoa, belongs to multidomain proteins [3]. In a multidomain protein, each domain can fulfill its own function independently, or in a coordinated manner with its neighbors [4]. Zhou and Zhang [5] proposed the Multi-Instance Multilabel learning (MIML) framework, where one object is represented by a bag of instances and the object is valid to have several labels simultaneously. Labels of training examples are known; however, labels of instances are unknown. We can regard each domain as an input instance and represent each biological function with an output label. In our previous study, it is disclosed that the protein function prediction problem is naturally and inherently MIML learning tasks [6]. Previously, prediction of protein functions was typically operated with the assumption that the functions of labeled proteins are complete; that is, there are no missing labels [7, 8]. Instead of things, in practice we just know a part of the functions of a protein, and whether this protein has other functions is unknown. Namely, these proteins have an incomplete annotation of their functions [9]. This kind of protein functions prediction problem with incomplete annotation can be referred to as the Multilabel Multi-Instance with weak-label learning task.

During the past several years, many Multilabel Multi-Instance learning algorithms have been developed [5, 1012]. In our previous study, we proposed an ensemble MIML learning framework EnMIMLNN and design three algorithms for protein function prediction tasks by combining the advantage of three kinds of Hausdorff distance metrics [6]. On the other hand, in the past few years, there are multiple algorithms which have been proposed for the weak-label learning problem. Sun et al. studied the weak-label learning problem in multilabel learning and proposed a method called weak-label learning (WELL) [13]. WELL deems the fact that classification boundary for each label should go across the low density regions, and any given label will not be correlative to the majority of instances [13]. Bucak et al. [14] studied the incomplete class assignment task for annotating images and proposed an approach called MLR-GR. MLR-GR optimizes the ranking errors and group Lasso loss by a convex optimization approach. Qi et al. [15] applied the Hierarchical Dirichlet Process to append missing labels for a set of images. In addition, Wang et al. [16] designed an approach for annotating weakly labeled facial images.

Although the underlying nature of predicting protein functions with incomplete annotation matches well with the Multi-Instance Multilabel with weak-label learning framework, till now there is no attempt that has been made under this learning framework. Jiang had proposed a multilabel semisupervised learning algorithm, PfunBG, to predict protein functions, employing a birelational graph (BG) of proteins and function annotations [17]. Yu et al. [7, 8] had proposed a protein function prediction method with multilabel weak-label learning (ProWL) and a variant of ProWL (ProWLIF) in order to complete the partial annotation of proteins. Both ProWL and ProWL-IF replenish the functions of proteins under the assumption that proteins are partially annotated [7, 8]. However, multilabel learning framework is evidently degenerated versions of MIML learning framework [5, 12]. Such degenerated strategies may lose useful information in the instance spaces, and this further hurts prediction performance [5, 12]. Recently, Yang et al. [18] proposed the MIMLwel (MIML with weak-label) approach which works by assuming that highly relevant labels share some common instances, and the underlying class means of bags for each label are with a large margin. MIMLwel makes use of the label relationship, and experiments had validated the effectiveness of MIMLwel in handling the Multilabel Multi-Instance with weak-label learning problem [18].

Microbial fuel cells (MFCs) are devices that can use bacterial metabolism to produce an electrical current from a wide range of organic substrates [19]. Due to the promise of sustainable energy production from organic wastes, research has intensified in the MFCs field in the last few years [19]. In this paper, we have applied the MIMLwel algorithm for annotating protein functions in two typical real-world electricigens genomes (i.e., Geobacter sulfurreducens, Shewanella loihica PV-4) which have been widely used in the MFCs researches. Our experimental results validate the effectiveness of MIMLwel algorithm in predicting functions of proteins in the electricigens genomes with incomplete annotation. In addition, it is worth mentioning that our approach is a general method for predicting protein functions with incomplete annotation.

2. The Formulation of the Protein Function Prediction Task with Incomplete Annotation

Nature often assembles multiple domains together to form multidomain and multifunctional proteins with high possibility, and each domain may implement its own function independently or in a cooperated manner with its neighbors. We can regard each domain as an input instance and take each biological function as an output label. Labels of the training examples are known; however, labels of instances are unknown. In our previous work, we disclose that the protein function prediction problem is naturally and inherently Multi-Instance Multilabel (MIML) learning tasks [6]. Previous studies typically predict the functions of proteins under the assumption that the functions of labeled proteins are complete; that is, there are no missing labels. In contrast, in most real cases we just know a subset of the functions of a protein, and whether this protein has other functions is unknown. Namely, these proteins have an incomplete annotation for molecular functions [9]. This type of protein function prediction problem with incomplete annotation can be inferred to as the Multilabel Multi-Instance with weak-label learning task.

We study the Multi-Instance Multilabel weak-label learning framework for protein function prediction with incomplete annotation for two tasks as illustrated in Table 1. In the tables, each row indicates the function annotation for a protein, and each column denotes a function label. Table 1(a) presents the complete annotated proteins, with 1 and 0 showing function annotations (F1–F5) on the six proteins P1–P6. In Table 1(b), 1 denotes the known relevant functions, “?” represents the missing functions and will be set to 0 s, and all the 0 s indicate the candidates for being predicted as relevant. In Task 2 as shown by Table 1(c), the definitions of 1 and 0 are the same as in Table 1(b). However, the aim of the weak-label learning is to make use of the incomplete annotated proteins (P1–P4) to predict the functions of proteins P5 and P6, which are completely unlabeled.

Formally, we represent by the training dataset with examples. is the th protein in the training dataset, and is a bag with instances . denotes the Gene Ontology terms which are assigned to , and is a label vector with labels, where if the th label is positive for , and 0 otherwise. Note that the labels of instances ’s    are untagged. In the MIML weak-label setting, is unknown and instead we are just given a partial label matrix . Specifically, for , a label vector    is given, where if the th label is assigned for , and 0 otherwise. Different from the full label matrix, tells us nothing. The goal is to predict all the positive labels for unseen bags [18].

3. Datasets and Methods

3.1. Data and Feature Extraction

Microbial fuel cells (MFCs) are devices that can make use of bacterial metabolism to obtain an electrical current from a wide range of organic substrates [19]. Due to the promise of sustainable energy production from organic wastes, research has booming in this field during the last few years [19]. Recently, the increased interest in MFCs technology was highlighted by the discovery of Geobacter sulfurreducens, a bacterial strain capable of high current production [19]. In addition, the genome-wide sequences of multiple Shewanella strains have been completed and annotated, opening the door to explore the diversity of their extracellular electron transfer mechanisms [20]. In this paper, two typical real-world electricigens organisms which have been widely used in microbial fuel cells (MFCs) researches (i.e., Geobacter sulfurreducens, Shewanella loihica PV-4) are considered for predicting their protein functions. For each organism, complete proteome with manually annotated function has been downloaded from the Universal Protein Resource (UniProt) databank [21] (released by April, 2014) by querying the terms of {“organism name” AND “reviewed: yes” AND “keyword: Complete proteome”}.

Redundancy among protein sequences of each organism is removed by clustering operation using the blastclust executable program in the BLAST package [22] from NCBI with a threshold of 90% as sequence identity, and a nonredundant dataset is obtained by keeping only the longest sequence in each cluster for each organism [23]. Then, each nonredundant dataset is uploaded as a txt file into the Batch CD-Search servers [24] of NCBI for getting the conserved domains of each protein. For each domain, a frequency vector with 216 dimensions is employed for its representation where each element indicates the frequency of a triad type [25]. Protein function can be annotated in several ways, and the most well-known and widely used one is given by Gene Ontology Consortium [26] which offers ontology in three aspects: molecular function, biological process, and cellular location. In this study, we concentrate on the molecular function aspect. We achieve the GO molecular function terms with manual annotation for a protein from the downloaded UniProt format text file. Then, the same scheme as [27] is assigned for produce label vectors for a protein based on a hierarchal directed acyclic graph (DAG) of GO molecular function, and the latest version (December 2006) of GO function ontology is adopted as the bases of the functional terms and their relations in this work.

Under the MIML learning framework, each protein is described as a bag of instances where each instance represents a domain and is tagged with a set of GO molecular function terms (multiple labels). Detailed descriptions of the datasets, that is, complete proteome on the two above organisms, are shown in Table 2. For example, there are 373 proteins (examples) with a sum of 344 gene ontology terms (label classes) on molecular function in the Shewanella loihica PV-4 dataset (Table 2). The average number of instances (domains) per bag (protein) is , and the average number of labels (GO terms) per example (protein) is (Table 2).

3.2. The MIMLwel Approach

In this paper, the MIMLwel (MIML with weak-label) approach is adopted for the weak-label setting [18]. MIMLwel assumes that highly relevant labels usually share common instances, and the underlying class means of bags for each label are separated with a large margin [18].

Formally, the training dataset with examples can be represented by . corresponds to the th example in the training dataset, and is a bag with instances . denotes the labels which are assigned to , and is a label vector with labels, where if the lth label is positive for , and 0 otherwise. Notice that the labels of instances ’s are unknown. In the MIML weak-label setting, however, only a subset of labels are tagged. Specifically, for , a label vector is given, where if the th label is assigned for , and 0 otherwise. The goal is to predict all the positive labels for unseen bags [18].

For simplicity, linear models were employed, and each one is for a label; that is, where each denotes a d-dimensional linear predictor and is the transpose of . To make use of label relationship, a label relation matrix is considered, where if the two labels are related, and 0 otherwise. Let indicate for the pair of related labels . MIMLwel assumes that highly related labels usually share common instances, indicating that many rows of values should be equal to zero; this can be characterized by a convexly relaxed term , which is a convex relaxation of . Thus, the goal of MIMLwel is to obtain and an output matrix to meet thatwhere is a loss function for each label, represents the -norm, controls the sparsity of , and trades off the empirical risk and model complexity.

3.3. Experimental Configuration

In this paper, we adopt three popular multilabel learning evaluation criteria, that is, Hamming loss (HL), macro-F1 (maF1), and micro-F1 (miF1) [2830]. Hamming loss assesses how many times on average a bag label pair is wrongly predicted. The smaller the value of hamming loss, the better the performance. Macro-F1 computes F1 measure on each class label at first and then averages over all class labels. Macro-F1 is more influenced by the performance of the classes owning fewer examples. The larger the value of macro-F1, the better the performance. Micro-F1 globally calculates the F1 measure on the predictors over all bags and all class labels. Micro-F1 is more affected by the performance of the classes involving more examples. The larger the value of micro-F1, the better the performance. The definition of these criteria can be found in [30]. We repeat 10-fold cross validation for each dataset ten times and the mean ± std. performances are presented for the proposed and compared methods.

4. Results and Discussion

4.1. Performance of the MIMLwel Method

In our experiments we consider four weak-label ratios (W.L.R.) [18], defined as , from 20% to 80% with 20% as the interval. Table 3 illustrates the performances of MIMLwel based on each kind of W.L.R. on the Geobacter sulfurreducens and Shewanella loihica PV-4 datasets. For each evaluation criterion, ↑(↓) indicates the larger (smaller), the better the performance; the best results on each evaluation criterion are highlighted in boldface. As indicated in Table 3, the results show that, with the rising of W.L.R., the model performance of MIMLwel has been greatly improved.

The MIMLwel approach [18] involves two different parameters, that is, the scaling factor   and the fraction parameter  . Figure 1 shows how the MIMLwel algorithm is implemented on the two datasets with 80% weak-label ratios (W.L.R.) under different parameter configurations, where the performance is measured in terms of HL, maF1, and miF1. Here,   varies from 0.2 to 1.0 with an interval of 0.2 when   is fixed to 0.1, and   increases from 0.02 to 0.1 with an interval of 0.02 with the fixed   equal to 1.0. It is indicated that the performance of the MIMLwel algorithms achieves the perk in most cases by setting the scaling factor   to 1.0 and the fraction parameter   to 0.1. In this paper, the MIMLwel algorithm is implemented by setting the scaling factor   to 1.0 and the fraction parameter   to 0.1.

4.2. Performance Comparison

In this paper, we compare the MIMLwel algorithm with four state-of-the-art MIML algorithms, that is, MIMLkNN [31], MIMLNN [12], MIMLRBF [32], and MIMLSVM [5], under different configuration of weak-label ratios (W.L.R.) on the Geobacter sulfurreducens dataset (Table 4) and Shewanella loihica PV-4 dataset (Table 5). The codes of compared MIML algorithms are shared by their authors, and these algorithms are implemented using the best parameters reported in the papers. Specifically, for MIMLkNN, the number of nearest neighbors and the number of citers are set to 10 and 20, respectively [31]; for MIMLNN, the number of clusters is set to 40% of the training bags, and the regularization parameter used to compute matrix inverse is set to 1 [12]; for MIMLRBF, the scaling factor and the fraction parameter are set to 0.6 and 0.1, respectively [32]; for MIMLSVM, the number of clusters is set to 20% of the training bags and the Gaussian kernel width is set to 0.2 [5]. Tables 4 and 5 summarize the experimental results of each compared algorithm on the Geobacter sulfurreducens dataset and Shewanella loihica PV-4 dataset, respectively. For each evaluation criterion, “↓” indicates “the smaller the better,” while “↑” indicates “the bigger the better.” Furthermore, the best results on each evaluation criterion are highlighted in boldface. It is indicated that the MIMLwel algorithm performs quite well in terms of most criteria in two datasets (Tables 5 and 6). Specifically, paired t-tests at 95% significance level indicate that the MIMLwel algorithm achieves significantly better performance than compared methods in most cases, as shown by the overwhelming ●’s in Tables 4 and 5.

4.3. Case Study

Table 6 presents two example results. The first protein with the UniProt ID “Q74BW7” from the Geobacter sulfurreducens organism has seven ground-truth labels: {GO:0008270, GO:0046872, GO:0000287, GO:0051539, GO:0030145, GO:0005506, GO:0004160}. After training examples with 80% weak-label ratios by different MIML methods, the trained model is then used to predict the GO molecular function labels of this protein. The correctly predicted GO molecular function labels by each method are highlighted in boldface. It is shown in Table 6 that MIMLwel successfully predicts most of the ground-truth labels (6/7); however, it predicts one more label, that is, GO:0005524, which is not in the ground-truth list. Nevertheless, the label GO:0005524 that denotes “ATP binding” may be not a conflict with the true molecular function in UniProt. MIMLRBF and EnMIMLNN{metric} predict two ground-truth labels but still miss a lot (5/7). MIMLNN reports no prediction result, and MIMLSVM only reports a wrong GO molecular function label. Similar situation also happen in the second example with the UniProt ID “A3QFX5” from the Shewanella loihica PV-4 organism as indicated in Table 6.

5. Conclusion

In our previous study, we disclosed that the protein function prediction problem is naturally and inherently Multi-Instance Multilabel (MIML) learning tasks. Automated protein function prediction was typically implemented under the assumption that the functions of labeled proteins are complete; that is, there are no missing labels. In contrast, in practice just a subset of the functions of a protein are known, and whether this protein has additional functions is unknown. It is evident that the protein function prediction tasks suffer from weak-label problems, and we disclose that prediction of protein functions with incomplete annotation matches well with the MIML with weak-label learning framework in this paper. In this paper, we have applied the state-of-the-art MIML with weak-label learning algorithm MIMLwel for predicting protein function in two typical real-world electricigens organisms which have been widely used in microbial fuel cells (MFCs) researches. Our experimental results show that MIMLwel is superior to most state-of-the-art MIML algorithms, which validates the effectiveness of MIMLwel algorithm in predicting protein functions with incomplete annotation.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This research was supported by the National Science Foundation of China (61203289, 61071092, and 61205057), China Postdoctoral Science Foundation (20110490129, 2013T60523), and Natural Science Foundation of the Higher Education Institutions of Jiangsu Province, China (12KJB520010).