Abstract

The topology prediction of transmembrane protein is a hot research field in bioinformatics and molecular biology. It is a typical pattern recognition problem. Various prediction algorithms are developed to predict the transmembrane protein topology since the experimental techniques have been restricted by many stringent conditions. Usually, these individual prediction algorithms depend on various principles such as the hydrophobicity or charges of residues. In this paper, an evidential topology prediction method for transmembrane protein is proposed based on evidential reasoning, which is called TOPPER (topology prediction of transmembrane protein based on evidential reasoning). In the proposed method, the prediction results of multiple individual prediction algorithms can be transformed into BPAs (basic probability assignments) according to the confusion matrix. Then, the final prediction result can be obtained by the combination of each individual prediction base on Dempster’s rule of combination. The experimental results show that the proposed method is superior to the individual prediction algorithms, which illustrates the effectiveness of the proposed method.

1. Introduction

According to the present genome data, roughly 20–30% of the genes in a typical organism code for -helical transmembrane (TM) protein [13]. Transmembrane protein is the principal executives of the biomembrane’s functions and plays many important roles in cell such as substance transportation, and energy conversion. In order to explore the structure, function, and transmembrane mechanism of transmembrane protein, the topology prediction of transmembrane protein has been a hot field in bioinformatics and molecular biology [1, 2, 4].

The topology of transmembrane protein [5], that is, the number and position of the transmembrane helixes and the in/out location of the N and C terminal of the protein sequence, is an important issue for the research of transmembrane proteins. For a protein sequence, if both transmembrane helixes and location of the N and C terminal have been predicted correctly, the topology of the protein sequence is said to be predicted correctly. Recently, information science and technology are widely used in the biology and medicine [68]. In essence, the topology prediction of transmembrane protein is a typical pattern recognition problem. As shown in Figure 1, given a protein sequence, the task is to determine the class label for each residue among these three classes of “i” (intracellular), “M” (transmembrane), and “o” (extracellular). At present, the most accurate methods to determine the topology of transmembrane protein are some experimental techniques, such as nuclear magnetic resonance (NMR) and X-ray crystal diffraction. However, these experimental techniques usually require strict conditions so that they cannot be applied on a large scale. They cannot meet the needs of the increasing protein sequences. Therefore, various computational methods have been developed to predict the topology of transmembrane protein [911].

Generally speaking, in a previous study there mainly exist three primary kinds of algorithms to predict the topology of transmembrane protein. The first kind of algorithms is on the basis of the chemical or physical properties of amino acids, for example, the hydrophobicity of residues or the charges of residues in different location. Some classical prediction algorithms are TopPred [2], and so on [12, 13]. The second kind of algorithms for the topology prediction is based on the statistical analysis on a huge amount of structure known as transmembrane proteins, such as MEMSAT [14], TMAP [10], and PRED-TMR [15]. In the third kind of algorithms, various machine learning technologies such as hidden Markov model (HMM) and support vector machine (SVM) have been introduced to the prediction of transmembrane protein topology. A series of algorithms have been developed, for example, HMMTOP [11], PHDhtm [16, 17], and so forth [1821].

According to the mentioned above, even though there exists many algorithms for the prediction of transmembrane protein topology, however, different algorithms depend on different principles, and their applicable scopes are different. To a prediction system, if more information have been taken into consideration, the prediction ability of the system must be much more stronger. Essentially, it is a viewpoint of ensemble learning [2225]. Using this idea to the topology prediction of transmembrane protein, various prediction algorithms have been treat as basic predictors; the task is the combination of multiple predictors to obtain a combination predictor which has a better performance than basic predictors. Within this process, there are two critical problems, that is, the representation of each predictor’s prediction results and the combination method of combining multiple predictors. In regard to the representation of predictor’s prediction results, as Xu et al. [23] pointed three types of output information can be utilized for different prediction algorithms, namely, the information in the abstract level, rank level, and measurement level, respectively. As to the combination method, traditional methodologies are usually on the basis of the framework of probability theory. To some degree, it is very effective, especially for the randomness. However, in the real world there are various uncertainties, not only the randomness but also the fuzziness and incompleteness, and so forth [26, 27].

As a theory of evidential reasoning under the uncertain environment, the Dempster-Shafer theory of evidence [28, 29] has an advantage of directly expressing various uncertainties and has been widely used in many fields [3037]. It provides a general and effective framework for the representation and combination of multiple individual algorithms. In this paper, a new topology prediction method of transmembrane protein based on evidential reasoning approach, called TOPPER, has been proposed. In the proposed TOPPER method, the prediction results of basic predictor are represented by basic probability assignment (BPA) which has been constructed in terms of the confusion matrix of the predictor. Then, various basic predictors are combined by using the Dempster’s rule of combination. Finally, the topology of a transmembrane protein sequence are determined according to the combination prediction results. In this paper, an experiment demonstrates the effectiveness of the propose prediction method.

The rest of this paper is organized as follows. Section 2 introduces some basic concepts about the Dempster-Shafer theory of evidence. In Section 3 the proposed method is presented. Section 4 gives experimental verification to demonstrate the effectiveness of the proposed method. Conclusions are given in Section 5.

2. Preliminaries

In this section, a few concepts commonly in the Dempster-Shafer theory of evidence will be introduced.

The Dempster-Shafer theory of evidence [28, 29], also called the Dempster-Shafer theory or evidence theory, is used to deal with uncertain information. As an effective theory of evidential reasoning, the Dempster-Shafer theory has an advantage of directly expressing various uncertainties. This theory needs weaker conditions than the Bayesian theory of probability, so it is often regarded as an extension of the bayesian theory. For completeness of the explanation, a few basic concepts are introduced as follows.

Definition 1. Let be a set of mutually exclusive and collectively exhaustive, indicted by The set is called frame of discernment. The power set of is indicated by , where If , is called a proposition.

Definition 2. For a frame of discernment , a mass function is a mapping from to , formally defined by which satisfies the following condition:
In the Dempster-Shafer theory, a mass function is also called a basic probability assignment (BPA). If , is called a focal element, the union of all focal elements is called the core of the mass function.

Definition 3. For a proposition , the belief function is defined as The plausibility function is defined as where .
Obviously, ; these functions and are the lower limit function and upper limit function of proposition , respectively.
Consider two pieces of evidence indicated by two BPAs and on the frame of discernment ; the Dempster’s rule of combination is used to combine them. This rule assumes that these BPAs are independent.

Definition 4. The Dempster’s rule of combination, also called orthogonal sum, denoted by , is defined as follows: with
Note that the Dempster’s rule of combination is only applicable to such two BPAs which satisfy the condition .

3. Proposed Method

In this section, a new transmembrane protein topology prediction method is proposed based on evidential reasoning. For the sake of convenience, it is briefly written down as TOPPER (Topology prediction of transmembrane protein based on evidential reasoning). The proposed prediction method TOPPER is on the basis of the combination of multiple individual prediction algorithms. In order to obtain the combination predictor, the process is presented step by step as follows.

3.1. The Selection of Basic Predictor

Because the proposed topology prediction method is the combination of multiple individual prediction methods, the basic predictors should be constructed first. Here, five individual prediction algorithms, OCTOPUS [3], PRO-TMHMM and PRODIV-TMHMM [38], SCAMPI-msa, and SCAMPI-seq [13], have been selected to construct these basic predictors. In pattern recognition, the prediction performance of each predictor is expressed by confusion matrix. In the topology prediction of transmembrane protein, since there are only three classes “i” (intracellular), “M” (transmembrane), and “o” (extracellular), the confusion matrix is formulated by where each item is the number of residues belonging to the class but predicted as the class according to the basic predictor .

3.2. The Representation of the Basic Predictor’s Prediction Results

In the combination of multiple predictors, the representation of the basic predictor’s prediction results is a critical problem. In this paper, BPA is used to represent these prediction results. But the next is how to construct BPAs. For example, a residue in a protein sequence has been predicted that it belongs to transmembrane helix (i.e., class “M”) by a basic predictor. However, due to that the prediction is not 100% correct, how can we represent this uncertainty. Here, a classical and effective method proposed by Xu et al. [23] has been adopted to construct BPAs. In Xu et al.’s method, the output was treated as single class labels, and the source of evidence for the propositions of interest was defined on the basis of the performance of predictors in terms of recognition, substitution, and rejection rates which are generated from confusion matrix. Briefly speaking, it is a BPA construction method based on confusion matrix.

To a predictor of transmembrane protein topology with confusion matrix , according to Xu et al.’s method [23], a BPA can be constructed for each class by with where .

For a residue in a protein sequence, the constructed BPA is if the prediction result shows that the residue belongs to class . In two other situations of and , the constructed BPAs are and , respectively.

3.3. The Combination of Multiple Predictors

Once all BPAs of each predictor have been constructed, the prediction results of multiple predictors can be combined. In this paper, these prediction results of basic predictors have been treated as various evidences coming from different sources. The various prediction results can be combined by using the Dempster’s rule of combination, as shown in Figure 2.

Assume there are basic predictors in the evidential prediction system, is the set of constructed BPAs for all classes from basic predictor , and . is an operation used to obtain the matched BPA for a residue predicted by . The combination of multiple predictors to predict the class of residue can be expressed by

3.4. The Determination of Topology

Through the above steps, the combination prediction result has been derived for each residue in a transmembrane protein sequence. It is indicated by a BPA . In order to get the final class that the residue belongs to, the BPA will be translated into a probability distribution by using the so-called pignistic probability transformation (PPT) function, proposed by Smets and Kennes in the transferable belief model (TBM) [39]. The PPT function [39] is defined as follow.

Let be a BPA on a frame of discernment , a pignistic probability transformation function corresponding to is where is the cardinality of proposition .

By using PPT function, the BPA can be translated into a probability distribution . Then the class of the residue can be determined according to the maximum value of the probability distribution . At last, the topology of a transmembrane protein can be determined when the classes of all residues in the protein sequence have been determined. For each protein, the transmembrane orientation is determined by the location of the first residue, and each transmembrane region whose length exceeds a threshold consists of these residues labelled as class “M.” According to the topology, all transmembrane helixes and the orientation of each transmembrane helix can be derived.

4. Experimental Verification

In this paper, a data set of 125 transmembrane protein sequences with known topology is collected from the data set of MPtopo [40] to verify the effectiveness of the proposed method TOPPER.

In order to reflect the performance of combination predictor faithfully and to avoid overfitting, the experiment is performed using tenfold cross-validation. For each fold, it roughly contains 12-13 transmembrane proteins and their homology has been reduced to 30% below by using cd-hit program [41].

In order to assess the prediction performance of transmembrane regions (i.e., transmembrane helixes without considering orientation) of different algorithms, an evaluation method developed by Tusnády and Simon [11] is adopted in this paper. To a transmembrane region, the prediction is considered successful when the overlapping region of predicted and observed transmembrane region contains at least 9 amino acids. The total numbers of predicted and real observed transmembrane regions are indicated by and , respectively. The overlapping predicted and real observed transmembrane regions are indicated by . The efficiency of the transmembrane regions prediction is measured by and . The overall prediction power is defined by

Besides, if all transmembrane regions and orientation of a transmembrane protein sequence have been predicted correctly, the topology of the transmembrane protein is said to be predicted correctly.

In the rest of this section, various prediction algorithms will be compared from three aspects, namely, the prediction performance of residue level, transmembrane region level, and topology level, respectively.

In the level of residue prediction, the confusion matrix of residue prediction for each algorithm is shown in Table 1. According to these confusion matrices, Table 2 shows some indexes to measure the performance of residue prediction, including the recall rate, precision rate, F score of each class, and the prediction accuracy of residues. In TOPPER, the prediction accuracy of residue is , while in other algorithms they are , , , , and , respectively. The proposed method has the highest prediction accuracy of residue, shown in Figure 3. In addition, investigate the F score of each class in these algorithms. The TOPPER also has the highest value of F score no matter to class “i”, “M”, and “o”, shown in Figure 4. Hence, it is quite clear that the proposed TOPPER outperforms other algorithms.

In the level of transmembrane region prediction, Table 3 shows the prediction performance of various algorithms to the prediction of transmembrane region. According to the overall prediction power defined in [11], the value of TOPPER is , while the values of other algorithms are , , , , and , respectively. The value of TOPPER is the highest, shown in Figure 5. So TOPPER is superior to other algorithms.

In the level of topology prediction, Table 4 shows the prediction accuracy of topology for each algorithm. The topology’s prediction accuracy of TOPPER is , which is the highest among these algorithms, shown in Figure 6. Therefore, the proposed TOPPER is superior to other algorithms.

According to the mentioned above, the proposed TOPPER outperforms other algorithms no matter in the level of residue prediction, transmembrane region prediction, and topology prediction. Hence, the effectiveness of the proposed method has been demonstrated.

5. Conclusions

Transmembrane proteins are some special and important proteins in cells. The topology prediction of transmembrane protein is a foundation of the research of transmembrane proteins. In this paper, a new topology prediction method of transmembrane protein is proposed based on evidential reasoning. The proposed method is the combination of multiple individual prediction algorithms. In the proposed method, the Dempster-Shafer theory has been used to represent and combine the results of basic predictors. Experimental results show that the proposed method is superior to the individual prediction algorithms and demonstrates the effectiveness of the proposed method.

Acknowledgments

The work is partially supported by the National Natural Science Foundation of China, Grant nos. 30400067, 61174022 and 71271061, the Chongqing Natural Science Foundation, Grant no. CSCT, 2010BA2003, the Fundamental Research Funds for the Central Universities Grant no. XDJK2010C030, and the Doctor Funding of Southwest University Grant no. SWU110021.