Abstract

Protein-protein interactions are the basis of biological functions, and studying these interactions on a molecular level is of crucial importance for understanding the functionality of a living cell. During the past decade, biosensors have emerged as an important tool for the high-throughput identification of proteins and their interactions. However, the high-throughput experimental methods for identifying PPIs are both time-consuming and expensive. On the other hand, high-throughput PPI data are often associated with high false-positive and high false-negative rates. Targeting at these problems, we propose a method for PPI detection by integrating biosensor-based PPI data with a novel computational model. This method was developed based on the algorithm of extreme learning machine combined with a novel representation of protein sequence descriptor. When performed on the large-scale human protein interaction dataset, the proposed method achieved 84.8% prediction accuracy with 84.08% sensitivity at the specificity of 85.53%. We conducted more extensive experiments to compare the proposed method with the state-of-the-art techniques, support vector machine. The achieved results demonstrate that our approach is very promising for detecting new PPIs, and it can be a helpful supplement for biosensor-based PPI data detection.

1. Introduction

Proteins play crucial roles in cellular biology, including signaling cascades, metabolic cycles, and DNA transcription. In most cases, proteins rarely perform their functions alone; instead, they cooperate with other proteins by forming protein-protein interactions (PPIs) networks. PPIs are responsible for the majority of cellular functions. Over the past decades, many innovative techniques and systems for identifying protein interactions have been developed [1]; for example, in the high-throughput experimental technologies such as yeast two-hybrid (Y2H) screens [2], tandem affinity purification (TAP) [3], mass spectrometric protein complex identification (MS-PCI) [4], and other large-scale biological techniques for PPIs detection, a large amount of PPIs data for different species has been accumulated [511]. However, the experimental methods are costly and time consuming; therefore, current PPI pairs obtained from biological experiments only cover a small fraction of the complete PPI networks [1214]. In addition, large-scale experimental methods usually suffer from high rates of both false positives and false negatives [12, 1520]. Hence, it is of great practical significance to build low cost protein detection systems and establish the reliable computational methods to facilitate the detection of PPIs [2125].

A number of computational methods have been proposed for the prediction of PPIs based on different data types, including phylogenetic profiles, gene neighborhood, gene fusion, sequence conservation between interacting proteins, and literature mining knowledge [12, 2633]. There are also methods that combine interaction information from several different data sources [27]. However, the aforementioned methods cannot be carried out if such biological information about the proteins is not available. Recently, a number of methods which derive information directly from protein sequence are of particular interest [26, 2830]. Researchers are committed to develop the sequences-based method for discovering new PPIs, and the experimental results showed that the information of amino acid sequences of proteins alone is sufficient to predict PPIs. Among them, one of the excellent works is a support vector machine based method developed by Shen et al. [29]. In that study, the twenty amino acids were firstly clustered into 7 classes according to their volumes and dipoles of the side chains. Then the conjoint triad approach extracts the features of protein pairs based on the classification of amino acids. When applied to predict human PPIs, this method yields a high prediction accuracy of about 84%.

Because the conjoint triad approach did not take neighboring effect into account and the interactions usually occur in the discontinuous amino acids segments in the sequence, on the other work Guo et al. developed a method based on SVM and autocovariance to extract the interactions information in the discontinuous amino acids segments in the sequence [26]. Their method yielded a prediction accuracy of 86.55%, when applied to predicting Saccharomyces cerevisiae PPIs. Lately, Pan et al. proposed a novel hierarchical LDA-RF model to predict human PPIs from protein primary sequences directly. In this study, the local sequential features represented by conjoint triads are firstly extracted from sequences. Then the generative LDA model is used to project the original feature space into the latent semantic space to obtain low dimensional latent topic features. Finally, the random forest model is used to predict the interactions between two proteins. The experimental results show that it is a very promising scheme for PPIs prediction [28].

The general trend in the current study for predicting PPIs has focused on high accuracy but has not considered the running time taken to train the classification model, which should be an important factor of developing a sequence-based method for predicting PPIs because the total number of possible PPIs is very large. For example, if we assume that the human genome consists of 22,500 protein-coding genes, then the total number of possible PPIs is estimated to be around 253,113,750 , which indicates that some classification models with high classification accuracy may not be satisfactory when considering the tradeoff between the classification accuracy and the time for training the models. Here, in addition to exploring the local and global descriptors to mine interaction information from the multiscale amino acids segments at the same time, we also investigate the use of a novel paradigm of learning machine called extreme learning machine (ELM) [34], in order to obtain a balance between high classification accuracy and short training time.

In the present work, we report a novel sequence-based method for the prediction of interacting protein pairs using ELM combined with local and global descriptors. More specifically, we first represent each protein sequence as a vector by utilizing the novel representation of local and global protein sequence descriptors which provides us with a chance to mine interaction information from the multiscale amino acids segments at the same time. Then we characterize a protein pair in different feature vectors by coding the vectors of two proteins in this protein pair. Finally, an ELM model is constructed using these feature vectors of the protein pair as input. To evaluate the performance, the proposed method was applied to human PPI dataset. The experiment results show that our method achieved 84.8% prediction accuracy with 84.08% sensitivity at the specificity of 85.53%.

2. Materials and Methodology

In this section, we outline the main idea behind the proposed method. The flowchart intuitively showing how to map large-scale PPIs by integrating biosensor-based PPI data with computational model is given in Figure 1. Firstly, we discuss the PPI dataset which is used in the study to evaluate the performance of the proposed method. Next we introduce the novel sequence-based protein representation method. Finally, we briefly descript the computational model, ELM, used in this study.

2.1. Golden Standard Datasets

We evaluated the proposed method with the human PPI dataset, which was downloaded from the Human Protein References Database (HPRD). After self-interactions and duplicate interactions were removed, the remaining 36,630 PPI pairs between 9,630 different human proteins comprise the final positive dataset.

The chosen golden negative dataset has a variable impact on the prediction performance, and it can be artificially inflated by a bias towards dominant samples in the positive data. For golden negative set, we followed the previous work [28] assuming that the proteins in separate subcellular compartments do not interact with each other. In this study, the golden negative dataset is generated from Swiss-Prot database version 57.3 according to four criteria: (1) protein sequences annotated with uncertain subcellular location terms were removed. (2) Protein sequences annotated by multiple locations were removed because of lack of the uniqueness. (3) Protein sequences annotated with “fragment” were removed. (4) Protein sequences with less than 50 amino acid residues were also removed because they might be fragments. After strictly following the above steps, we finally obtained 1,773 human proteins from six subcellular localizations. Then the noninteracting protein pairs were constructed by randomly pairing the proteins from separate subcellular compartments.

We also downloaded the golden negative dataset of human with experimental evidence used in the study of Smialowski et al. [35]. By combining the above two negative datasets, the whole final golden negative dataset consists of 36,480 noninteracting protein pairs. The whole dataset consists of 73,110 protein pairs, where nearly half are from the positive dataset and half are from the negative dataset. Four-fifths of the protein pairs from the positive and negative dataset were, respectively, randomly selected as the training dataset and the remaining one-fifths were used as the testing dataset.

2.2. Representing Proteins with Descriptors from Primary Protein Sequences

To successfully use the machine learning methods to identify PPIs from primary protein amino acids sequences, one of the most important computational challenges is how to effectively represent a protein sequence by a fixed length feature vector in which the important information content of proteins is fully encoded [36, 37]. In this study, two kinds of sequence representation approach are used to transform the protein sequences into feature vectors, including amino acid composition and a novel local descriptor. For amino acid composition, it is evident that 20 amino acid composition descriptors reflecting the fraction of each kind of amino acid in a protein sequence are directly calculated. Then, a local multiscale decomposition technique is used to divide protein sequence into multiple sequence segments of varying length to describe local regions. Here, the continuous sequence segments are composed of residues which are local in the polypeptide sequence [38].

In order to extract local information, we first divided the entire protein sequence into seven equal length fractions. Then a novel binary coding scheme was adopted to construct a set of continuous regions on the basis of the above partition. For example, consider a protein sequence “CCYGGGYYCYYYCGGCCYYCG” containing 21 residues. To represent the sequence by a feature vector, let us first divide each protein sequence into multiple regions. For simplicity, the protein sequence is divided into four equal length segments (denoted as S1, S2, S3, and S4). Then it is encoded as a sequence of 1’s and 0’s of 4-bit binary form. In binary format, these combinations are written as 0000, 0001, 0010, 0011, 0100, 0101, 0110, 0111, 1000, 1001, 1010, 1011, 1100, 1101, 1110, and 1111. The number of states of a group of bits can be found by the expression , where is the number of bits. It should be noticed that here 0 or 1 denotes one of the four equal length regions, and S1–S4 are excluded or included in constructing the continuous regions, respectively. For example, 1100 denotes a continuous region constructed by S1 and S2 (the first 50% of the sequence). Similarly, 0011 represents a continuous region constructed by S3 and S4 (the final 50% of the sequence).

It should be noticed that the proposed representation can be simply and conveniently edited at multiple scales, which offers a promising new approach for addressing these difficulties in a simple, unified, and theoretically sound way when presenting a protein sequence. For a given number of bits, each protein sequence may take on only a finite number of continuous or discontinuous regions. This limits the resolution of the sequence. If more bits are used for each protein sequence, then a higher degree of resolution is obtained. In this study, the protein sequence is encoded by 7-bit binary form; each protein sequence may take on 126 (27−2) different regions. Higher bit encoding requires more storage for data and requires more computing resource to process. In this study, only the continuous regions are used and the discontinuous regions are discarded.

For each continuous region, three types of descriptors, composition (), transition (), and distribution (), are used to represent its characteristics. denotes the amino acids number of a particular property (e.g., hydrophobicity) divided by the total amino acids number in a local region. is the percentage frequency with which amino acids for a particular property are followed by protein amino acids of another property. characterizes the chain length within which the first 25 percent, 50 percent, 75 percent, and 100 percent of the protein amino acids of a particular property are located, respectively [39].

The three descriptors can be calculated in the following ways. Firstly, in order to reduce the complexity inherent in the representation of the 20 standard protein amino acids, we firstly clustered them into seven clusters based on the volumes and dipoles of the side chains. Amino acids within the same groups likely involve synonymous mutations because of their similar characteristics [29]. The amino acids belonging to each group are shown in Table 1.

Then, every amino acid in each protein sequence is replaced by the index depending on its grouping. For example, protein sequence “CCYGGGYYCYYYCGGCCYYCG” is replaced by 773111337333711773371 based on this classification of amino acids (see Figure 2). There are six “1,” eight “3,” and seven “7” in this protein sequence. The composition for these three symbols is ,  , and , respectively. There are 2 transitions from “1” to “3” or from “3” to “1” in this sequence, and the percentage frequency of these transitions is . The transitions from “1” to “7” or from “7” to “1” in this sequence can similarly be calculated as . The transitions from “3” to “7” or from “7” to “3” in this sequence can also similarly be calculated as .

For distribution , there are 6 residues encoded as “1” in the example of Figure 3, the positions for the first residue “1,” the 2nd residue “1” , the 4th “1” residue , the 6th “1” , and the 8th residue “1” in the encoded sequence are 4, 5, 6, 15, and 21, respectively, so the descriptors for “1” are , , , , and , respectively. Similarly, the descriptor for “3” and “7” is 14.29%, 33.33%, 47.62%, 57.14%, and 90.48% and 4.76%, 9.52%, 61.9%, 76.19%, and 95.24%, respectively.

For each continuous local region, the three descriptors (, , and ) were calculated and concatenated, and a total of 63 descriptors are generated: 7 for , 21 for , and 35 for . Then, the local descriptor from 27 regions (7-bit) was concatenated and a total 1701 dimensional vector has been built to represent each protein sequence. Finally, the PPI pair is characterized by concatenating the local and global descriptors of two individual proteins. Thus, a 3442-dimensional vector has been constructed to represent each protein pair and was used as a feature vector for input into SVM classifier.

2.3. Extreme Learning Machine

By virtue of their approximation capabilities for nonlinear mappings, the feed-forward neural networks (FNN) have become ideal classifiers in many applications. Huang et al. proved that the single-hidden-layer FNN could exactly learn distinct observations for almost any nonlinear activation function with almost hidden nods [34, 40, 41]. However, the hidden layer biases and input weights of FNN have usually to be tuned using some parameter adjusting approach, which are generally time-consuming due to inappropriate learning steps with significantly large latency to converge to local maxima. Therefore, the slow learning speed of FNN has been a major bottleneck in different applications.

Extreme learning machine (ELM) was originally developed for the single hidden layer feed-forward neural network (SLFNN) and then extended to the generalized SLFNN where the hidden layer need not be neuron alike [34, 40]. As shown in Figure 3, its architecture is similar to that of a SLFNN. Recently the ELM algorithm has been increasingly popular in classification tasks due to its high generalization ability and fast learning speed. Different from the popular thinking that network parameters need to be adjusted, the input weights and first hidden layer biases need not be adjusted but they are randomly assigned in ELM. It has been proved that the ELM algorithm performs learning at an extremely fast speed and achieves a good generalization performance with activation functions which are infinitely differentiable in hidden layers [40, 42, 43].

The ELM algorithm transforms the learning problem into a simple linear system; that is, the output weights of ELM can be analytically determined through a generalized inverse operation of the hidden layer weight matrices. Compared with traditional learning frameworks such a learning scheme can operate at extremely much fast speed. Improved generalization performance of ELM with the smallest training error shows its superior classification capability for real-time applications at an exceptionally fast pace without any learning bottleneck [44].

The basic idea behind ELM algorithm is briefly descripted as follows: suppose learning arbitrary distinct samples , where , , a standard ELM with hidden neurons and activation function are mathematically modeled by where represents the weight vector connecting the th hidden node and the input nodes, represents the weight vector connecting the th hidden neuron and the output neurons, and is the bias of the th hidden neuron. denotes the inner product of and . A wide variety of functions could be selected as the activation function, including sigmoid function, radial basis function, sine function, hardlim function, and triangular basis function. The architecture of ELM is shown in Figure 3. Equation (1) can be written compactly as where is termed as the hidden layer output matrix of the SLFNN; the th column of is the th hidden neuron’s output vector with respect to inputs . Hence for fixed arbitrary input weights and the hidden layer bias , training a SLFNN equals finding a least-squares solution of the linear system ; that is, Equation (12) becomes a linear system and the solution is estimated as where is the Moore-Penrose generalized inverse of the hidden layer output matrix .

In summary, given a training dataset , activation function , and hidden neuron number , the ELM-based learning procedure can be summarized as follows.

Step 1. Assign arbitrary input weight and bias , .

Step 2. Calculate the hidden layer output matrix .

Step 3. According to (13), calculate the output weight .

3. Results and Discussion

In this section, we describe our simulation methodology and present the experimental results that evaluate the effectiveness of our schemes. The proposed sequence-based PPI predictor was implemented using MATLAB platform. For ELM algorithm, the implementation by Zhu and Huang available from http://www.ntu.edu.sg/home/egbhuang was used. Regarding SVM, LIBSVM implementation available from http://www.csie.ntu.edu.tw/~cjlin/libsvm/index.html was utilized, which was originally developed by Chang et al. [33]. Tree kinds of kernel functions were chosen and the optimized parameters were obtained with a grid search approach. All the simulations were carried out on a computer with 3.1 GHz 2-core CPU, 8 GB memory, and Windows operating system.

3.1. Cross Validation and Performance Evaluation

In the study, fivefold cross-validation technique has been employed to evaluate the performance of the proposed model. In five-fold cross-validation technique, the whole dataset is randomly divided into five subsets, where each subset consists of nearly equal number of interacting and noninteracting protein pairs. Four subsets are used for training and the remaining set for testing. This process is repeated five times so that each subset is used once for testing. The performance of method is average performance of method on five sets.

Seven metrics have been used in the study to measure the predictive ability of the proposed method. The parameters are as follows: (1) the overall prediction accuracy (ACC) is the percentage of correctly identified interacting and noninteracting protein pairs; (2) the sensitivity (SN) is the percentage of correctly identified interacting protein pairs; (3) the specificity (SP) is the percentage of correctly identified noninteracting protein pairs; (4) the positive predictive value (PPV) is the positive prediction value; (5) the negative predictive value (NPV) is the negative prediction value; (6) the -score is a weighted average of the PPV and sensitivity, where an -score reaches its best value at 1 and worst score at 0; (7) Matthew’s correlation coefficient (MCC) is a more stringent measure of prediction accuracy accounts for both under- and overpredictions. These parameters are defined as follows: where true positive (TP) is the number of true PPIs that are predicted correctly; false negative (FN) is the number of true PPIs that are predicted to be noninteracting pairs; false positive (FP) is the number of true noninteracting pairs that are predicted to be PPIs, and true negative (TN) is the number of true noninteracting pairs that are predicted correctly.

The above mentioned parameters rely on the selected threshold. The area under the ROC curve (AUC), which is threshold-independent for evaluating the performances, can be easily calculated according to the following formula [45]: where and denote the number of positive and negative samples, respectively, and is the sum of the ranks of all positive samples in the list of all samples ranked in increasing order by estimated probabilities belonging to positive. AUC values can give us a good insight into performance comparison of different prediction methods. Although the AUC is threshold-independent, an appropriate threshold must be selected for the final decision. For the classifier which outputs a continuous numeric value to represent the confidence or probability of a sample belonging to the predicted class, adjusting the classification threshold will lead to different confusion matrices which decide different ROC points [29].

3.2. Determination of ELM Parameter

The number of hidden nodes is a critical factor for the generalization of ELM. To determine the parameter, four-fifths of the whole dataset are randomly chosen to train the ELM classifiers with different number of hidden nodes, while the rest one-fifths of the dataset are used as the validation set to compute the accuracy.

Here the sigmoid function was used as the activation function of the ELM classifier. The results are plotted in Figure 4, which shows that the accuracy value reaches about 0.9 and increases slowly when the number of hidden neurons was set to 9 percent of the amount of samples. Based on Figure 4, we finally set 9 percent of the sample number as the number of hidden neurons for the ELM classifier. The second experiment was to examine how the running time scales with the number of hidden neurons. We increase the number of hidden neurons from 1 to 11 percent of the amount of samples and measure the average time overhead. Figure 5 shows that the running time of proposed ELM model scales nearly linear as the hidden neuron size increases.

3.3. Prediction Performance of Proposed Model

We evaluated the performance of the proposed model using the PPIs dataset as described in the aforementioned section. To guarantee that the experimental results are valid and can be generalized for making predictions regarding new data, we adopted the fivefold cross-validation in this study. The advantages of cross-validation are that the impact of data dependency is minimized and the reliability of the results can be improved.

The prediction performance of ELM predictor with novel representation of protein sequence across five runs is shown in Table 2. It can be observed from Table 2 that high prediction accuracy of 84.8% is achieved for the ELM model with sigmoid function. To better investigate the prediction ability of our model, we also calculated the values of sensitivity, specificity, PPV, NPV, -score, MCC, and AUC. From Table 2, we can see that our model gives good prediction performance with an average sensitivity value of 84.08%, specificity value of 85.53%, PPV value of 85.47%, NPV value of 84.15%, -score value of 84.77%, MCC value of 74.22%, and AUC value of 0.9232. Further, it can also be seen in Table 2 that the standard deviation of accuracy, sensitivity, specificity, PPV, NPV, -score, MCC, and AUC is as low as 0.0022, 0.0019, 0.0028, 0.0040, 0.0038, 0.0029, 0.0030, and 0.0028, respectively.

To demonstrate the performance of the proposed model, we further compared our method with the state-of-the-art predictor SVM. From Table 2, we can see the performance of ELM and SVM model. As observed from Table 2, the testing time of SVM algorithm (2794.29 s) is roughly 38 times the testing time of ELM algorithm (72.7901 s) for sigmoid activation function. In addition, the prediction performance of ELM is also promising. The AUC of the SVM algorithm is 0.8878, which is lower than the ELM. The overall accuracy, sensitivity, specificity, PPV, NPV, 1 score, and MCC of SVM algorithm are, respectively, 81.77%, 81.19%, 82.32%, 82.15%, 81.44%, 81.65%, and 70.18% as illustrated in Table 2. Hence, it can be seen that almost all evaluation measures of ELM algorithm are a little better than those of SVM algorithm, while its learning speed is much more faster than SVM.

We also conduct an experiment to characterize the sensitivity (i.e., the size of true positives that can be detected by our method) and specificity (i.e., 1 − false positive rate) of proposed approach for different activation functions (Figure 6). The results in Figure 6 are reported using receiver operator characteristic (ROC) curves, which plot the achievable sensitivity at a given specificity (1 − false positive rate). Good performance is reflected in curves with a stronger bend towards the upper-left corner of the ROC graph (i.e., high sensitivity is achieved with a low false positive rate). We found that the proposed method achieved over 83 percent detection rate with less than 10 percent false positive rate. The results demonstrate that the proposed ELM can successfully classify positive and negative samples in all five activation functions that we investigated. Our algorithm can perfectly classify interacting and noninteracting protein pairs with only a few exceptions.

To sum up, considering the high efficiency as well as the good performance we can readily conclude that the proposed approach generally outperforms the state-of-the-art model with higher discrimination power for predicting PPIs based on the information of protein sequences. Therefore, we can see clearly that our model is a much more appropriate method for predicting new protein interactions compared with the other methods. Consequently, it makes us be more convinced that the proposed method can be very helpful in assisting the biologist to assist in the design and validation of experimental studies and for the prediction of interaction partners.

4. Conclusions

In this paper, we have developed an efficient and fast learning technique, which utilizes global and local information of protein amino acid sequence, for accurate identification PPIs at considerably high speed both in training and testing phase. The first contribution of this work is a novel protein amino acids sequence representation using amino acid composition and a descriptor to represent global and local information of a protein sequence, respectively. Then, the application of extreme learning machine ensures reliable recognition with minimum error and learning speed approximately thousands of times faster than the state-of-the-art classification method SVM. Experimental results demonstrated that the proposed method performed significantly well in distinguishing interacting and noninteracting protein pairs. It was observed that the proposed method achieved the mean classification accuracy of 84.8% using 5-fold cross-validation. Meanwhile, comparative study was conducted on the proposed method and the state-of-the-art SVM. The experimental results showed that our method significantly outperformed SVM in terms of classification accuracy with shorter running time.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work is supported by the National Science Foundation of China, under Grants 61102119, 61373086, 61133010, U1201256, and 61171125. The authors would like to thank all the guest editors and anonymous reviewers for their constructive advices.