Abstract

Identifying protein-protein interactions (PPIs) plays a vital role in a number of biological activities such as signal transduction, transcriptional regulation, and apoptosis. Although advances in high-throughput technologies have generated large amounts of PPI data for different species, they only cover a small part of the entire PPI network. Furthermore, traditional experimental methods are generally expensive, time-consuming, tedious, and prone to high false-positive rates. Therefore, to overcome this problem, it is necessary to develop a novel computational method for predicting PPIs. In this article, we propose an efficient computational method to detect protein-protein interactions using only protein sequence information, which integrates the MatPCA feature extraction algorithm and the weighted sparse representation classifier. As a result, when predicting PPIs on yeast, human, and H. pylori datasets, the proposed method achieves superior prediction performance with an average accuracy of 94.55%, 97.48%, and 83.64%, respectively. These experimental results further illustrate that the proposed method is reliable and robust in predicting PPIs, which can be regarded as a useful complement to the experimental method.

1. Introduction

Proteins are an important part of all organisms and are also one of the most versatile organic macromolecules in living systems. Generally, proteins interact with each other to perform their functions [1]. Therefore, predicting protein-protein interactions (PPIs) is critical to elucidating the function of proteins and exploring the pathogenesis of diseases. Nowadays, researchers have developed many biologically based experimental methods to identify protein interactions by employing high-throughput technologies, for example, mass spectrometry [2], immunoprecipitation [3], yeast two-hybrid systems [4], and protein chips [5]. Although traditional experimental methods have also achieved some results in detecting PPIs, they only account for a small part of the entire PPI network. Furthermore, these methods have weak generalization performance and high false-negative and false-positive rates, which are both costly and time-consuming [6, 7]. Thus, as a complement to experimental methods to reduce costs, it is indispensable to develop reliable computational methods to predict PPIs [8].

To date, a series of PPI detection methods based on different data types have been suggested, including gene fusion [9], protein domains [10], amino acid index distribution [11], phylogenetic profile [12], and protein structure information [13]. Nevertheless, the disadvantage of these methods is that they need to consider pre-existing knowledge about the protein pairs, such as protein functional domains and 3D structure information of proteins. In practice, there are relatively few known proteins with a 3D structure, which will limit the extensive use of these methods [14]. Moreover, with the development of genomic technologies, the sequence data of proteins have also shown explosive growth and are readily available. Compared to these data types, researchers have developed many methods based on the amino acid sequence of proteins to infer potential PPIs. Experimental results confirm that using only protein amino acid sequences is feasible in predicting PPIs [1517].

A growing number of computational approaches have been developed to predict PPIs based on protein sequence information [18, 19]. For example, Martin et al. [20] suggested a new signature descriptor to predict PPIs by combining the sequence-based protein description with the experimental information collected by the protein-protein interaction screen. The method achieves 70% to 80% accuracy on publicly available yeast and H. pylori datasets. Qi et al. [21] proposed six different classifiers such as decision tree, random forest (RF), naive Bayes, logistic regression, RF similarity-based k-nearest neighbors, and support vector machine to compare their prediction results in detecting PPIs. Guo et al. [22] used autocovariance to characterize protein sequences and combined support vector machine classifiers to predict PPIs. The method reached 88.09% prediction accuracy on the Saccharomyces cerevisiae dataset. You et al. [23] designed a computational method named PCA-EELM (principal component analysis-ensemble extreme learning machine) to detect PPIs, which used only protein sequence information. The prediction accuracy, precision, and sensitivity of the method on the yeast dataset were 87.00%, 87.59%, and 86.15%, respectively. Huang et al. [24] applied a global encoding feature extraction algorithm and a weighted sparse representation classifier to protein sequences to predict PPIs, in which good prediction accuracy was obtained. Du et al. [25] adopted a deep neural network method to predict PPIs. This method uses the integrated multiple feature descriptors to represent the feature information of the amino acid sequence. When performing PPIs on the yeast dataset, this model yields 92.5% prediction accuracy. Although these machine learning methods have achieved good results in predicting PPIs, the prediction accuracy of the algorithms still needs to be further improved to develop efficient and accurate prediction models.

In this study, we present a novel protein sequence-based computational method to detect protein-protein interactions by employing a matrix representation of protein sequences. Specifically, we first converted each protein sequence into a numerical substitution matrix representation (SMR). Secondly, we adopt MatPCA to extract features from the SMR to characterize protein amino acid sequences. By doing this, we can represent each protein pair as an 800-dimensional feature vector. Finally, we feed the obtained feature vectors into a weighted sparse representation-based classifier (WSRC) model, which is used to perform classification tasks for protein-protein interaction prediction. In this work, the proposed method was applied to three different biological datasets, namely, yeast, H. pylori, and human, for identifying protein-protein interactions. To further evaluate the prediction performance of our model, we compared the proposed method with the SVM-based method and other existing methods, respectively. The experimental results show that the proposed method achieves superior prediction accuracy in predicting PPIs as compared with the existing methods.

2. Materials and Methods

2.1. Dataset Construction

In this experiment, we used three available PPI datasets including yeast, human, and H. pylori to evaluate the predictive performance of the proposed method. Table 1 shows the summary of the datasets used in this experiment. The first dataset is a high-confidence Saccharomyces cerevisiae PPI dataset, which was collected from a freely available database of interacting proteins (DIP) [26] provided by Guo et al. [22]. To better perform the proposed method, we preprocessed the protein data, namely, protein pairs with sequence lengths less than fifty residues were deleted because these could be fragments. In addition, more than 40% sequence identity in sequence pairs is considered homologous. In order to eliminate the bias of homologous sequence pairs, we also removed these protein sequence pairs. In this way, we obtained the remaining 5594 protein pairs, which were formed into the positive dataset. Besides, to construct negative samples, we used 5594 additional protein pairs as the negative dataset, which came from different subcellular localizations. At last, the yeast PPI dataset contained 11,188 protein pairs, half of which were negative samples, and the other half were positive samples. The second dataset is the human dataset, which is derived from the Human Protein Reference Database (HPRD) [27]. We obtained 3899 interacting pairs and 4262 noninteracting pairs after removing more than 25% sequence identity from those protein pairs. Specifically, the interacting protein pairs were screened from 2502 different human proteins. Considering that proteins in diverse subcellular compartments do not interact with each other, noninteracting protein pairs were obtained from 661 different human proteins [28]. Finally, the human dataset consisted of 8161 protein pairs (3899 positive samples and 4262 negative samples). The third dataset is the Helicobacter pylori dataset introduced by Martin et al. [20]. The dataset contains a total of 2916 samples, including 1458 interacting pairs and 1458 noninteracting pairs.

2.2. Substitution Matrix Representation

It is vital to efficiently represent the intrinsic information of protein sequences when using computational methods to predict protein-protein interactions because an accurate and reliable protein sequence feature representation method will directly influence the prediction results of PPI predictors. Previous studies have shown those interacting or functionally related proteins tend to exhibit similarities in molecular phylogenetic trees during coevolution [29]. In this section, we present a novel feature representation method for predicting PPIs by transforming the evolutionary information of protein sequences into a matrix representation.

The proposed substitution matrix representation (SMR) method is a variant of the method described by Yu et al. [30], which retains the evolutionary information of protein sequences. For any given N-length protein sequence, we can use this novel protein matrix representation method to convert it into an matrix. In this experiment, we used a matrix representation method called BLOSUM62, which is a very popular sequence alignment substitution scoring matrix. In this transformation, SMR can be expressed as follows:where stands for BLOSUM62, which is a substitution matrix with 20 rows and 20 columns and denotes the probability value of the ith amino acid mutation to the jth amino acid in the evolutionary process. Here, is a given protein sequence consisting of N amino acids.

2.3. Matrix Principal Component Analysis (MatPCA)

As an effective feature extraction algorithm, MatPCA can deal with both vector pattern and matrix pattern, which was originally introduced by Chen et al. [31]. The idea of MatPCA is derived from the 2-dimensional principal component analysis or image principal component analysis, which mainly performs operations on the image matrix. The description of MatPCA is as follows.

Suppose a training sample set consisting of samples, and their mean is where is an matrix and represents the jth training sample. Let be a projection matrix and be a projection vector with components. MatPCA tries to use this projection matrix for feature extraction by performing the following linear transformation on arbitrary where is the feature matrix and is the number of projection directions with each row vector satisfying the equation Hence, for each in the training set, we have

In order to obtain the optimal projection vector and to retain more original information of the training set in the projected space, next, we want to construct the reconstructed error (Re) criterion of all training samples, specifically, minimizing the following criterion:

Here, is the reconstructed representation for and is the matrix and the equation can also be written as where tr() denotes a matrix trace operation. In this way, Re can be simplified toand specifically,where is called the total covariance matrix composed of given sample matrices, and it is easy to prove that is a positive semidefinite matrix without negative eigenvalues. It can be seen that the first term of is a constant. Therefore, minimizing is equivalent to maximizing as follows:

Additionally, under the constraint condition is optimized to obtain the following eigenvalue-eigenvector matrix equation:where is an identity matrix and is the transpose of the matrix. Here, is a diagonal matrix, in which diagonal elements correspond to all nonnegative eigenvalues of , and is a matrix composed of eigenvectors. Then, the following formula is used to select appropriate :

Finally, the corresponding fusion features are obtained by determining appropriate In this experiment, we obtained 400 features by using MatPCA to analyze the substitution matrix representation of a given protein sequence.

2.4. Weighted Sparse Representation-Based Classification (WSRC)

Sparse representation classifier (SRC) is a popular nonparametric data mining method, which was originally proposed by Wright et al. [32]. The main purpose of this method is to sparsely represent the test set by linearly combining the training set in the original sample data. Finally, the new test set is assigned to the class with minimal residue. So far, the sparse representation classifier has been widely used in various fields to solve different practical problems, such as face recognition [33], speaker recognition system [34], text classification [35], and diabetes detection based on facial block color features [36]. The SRC algorithm is described as follows.

Given a training sample matrix consisting of n samples and d-dimensional feature vectors. The SRC algorithm assumes that the lth sample of is and that there are enough samples corresponding to the object classes. Then, the entire dataset can be expressed as where represents the number of samples that belong to the ith class. Here, the training sample matrix can be further written as For a new test sample that belongs to the ith class, SRC is used to find such a column vector such that

When using a linear combination to denote all training samples, can be described as follows:

It should be noted that nonzero entries in are only relevant for the ith class. Then, we have the following:

Next, to solve the minimization problem, this can be written as

However, this is an NP-hard problem to solve (13). Based on the theory of compressive sensing [37], we know that when is sufficiently sparse, the problem can be transformed into solving the minimization:

In order to avoid occlusion, this minimization should be further transformed into the following stable minimization problem:where denotes the threshold of the residue. Here, we can adopt standard linear programming approaches to solve this minimization problem [38]. Subsequently, the SRC algorithm classifies the given test sample based on the following rule:where is the reconstructed value, which is formed by the training samples of class , and represents the residual. When it is satisfied that belongs to class then those entries of related to class have nonzero values. Finally, the obtained result will be assigned to the smallest residual.

However, previous research results have demonstrated the fact that, in some cases, the local structure is more crucial than the sparsity of the data [39, 40]. Therefore, in this paper, we propose the weighted sparse representation-based classification model by combining the local structure of the data with the sparse representation. The WSRC algorithm is mainly used to find an appropriate method to assess the relationship between testing and training samples. Here, the distance based on Gaussian kernel is applied to the WSRC algorithm to evaluate the similarity between the two samples because the distance can capture the nonlinear relationship in the dataset. The Gaussian-based distance can be formulated as follows:where represent training and testing samples, respectively, and denotes the Gaussian kernel width. Unlike the SRC method, WSRC is used to solve the following weighted minimization problem:and specifically,where is a diagonal-weighted matrix and represents the number of training samples in class Similarly, when dealing with occlusion, the WSRC algorithm needs to solve the stable minimization problem:where is the threshold of the residue.

2.5. Procedure of the Proposed Method

In this experiment, the workflow diagram of the proposed method is presented in Figure 1. More specifically, the positive samples of the datasets used by this study were experimentally identified PPI data. Each protein sequence is converted into a numerical substitution matrix representation, respectively. Following this, MatPCA algorithm is employed to obtain numerous valuable feature vectors for PPI prediction. Finally, we feed the obtained feature vectors into a WSRC model, and then five-fold cross-validation is performed to evaluate the proposed method.

3. Results and Discussion

3.1. Evaluation Measures

In order to assess the reliability and robustness of the proposed method from different perspectives, we used the following four criteria, overall prediction accuracy (ACC), sensitivity (SN), precision (PE), and Matthews correlation coefficient (MCC), to comprehensively evaluate the model. The definitions of these evaluation indexes are as follows:where true positive (TP) is the count of samples that are correctly detected as positive by the model, false negative (FN) is the number of samples that are wrongly predicted as negative by the model, true negative (TN) is the count of samples that are correctly detected as negative by the model, and false positive (FP) is the number of samples that are wrongly predicted as positive by the model. Meanwhile, we also constructed the receiver operating characteristic (ROC) curves [41] and calculated the area under the ROC curve (AUC) to show the predictive performance of the proposed model. In general, the stability of the proposed model can be evaluated by comparing the AUC values of different predictors. A larger AUC shows a better predictor.

3.2. Assessment of Prediction Ability

For the sake of fairness, we set the two main parameters and of the weighted sparse representation-based classifier in this experiment, which correspond to 1.5 and 0.00005, respectively, when performing PPI prediction on three benchmark datasets. In addition, to avoid overfitting, we used a five-fold cross-validation method to verify the stability of the proposed model by dividing the entire dataset into a training set and an independent test set. More specifically, each dataset is divided into five parts, one of which is selected as the test set, and the remaining parts are used as the training set. In this way, we obtained five models, and the final prediction result depends on the average value of five separate experiments. Tables 24 list the prediction results on the three datasets based on the proposed model combined with the five-fold cross-validation method.

As shown in Table 2, when using the proposed method to predict the PPIs of the yeast dataset, the average accuracy, precision, sensitivity, and MCC were 94.55%, 92.33%, 97.15%, and 89.68%, respectively, with the corresponding standard deviations as 0.63%, 0.91%, 0.42%, and 1.12%, respectively. From Table 3, which gives the five-fold cross-validation PPI prediction results of the proposed method on the human dataset, we can see that the average accuracy, precision, sensitivity, and MCC of the proposed method are 97.48%, 96.27%, 99.01%, and 95.06%, respectively, with the corresponding standard deviations of 0.31%, 0.55%, 0.32%, and 0.60%, respectively. It can be seen from Table 4 that when detecting the PPIs of the H. pylori dataset by using a five-fold cross-validation method, the proposed method achieved an average accuracy, precision, sensitivity, and MCC of 83.64%, 89.71%, 75.98%, and 72.26%, with the corresponding standard deviations of 1.15%, 0.79%, 1.63%, and 1.49%, respectively. Although the proposed model obtained good prediction accuracy, to further demonstrate the reliability of our model, we also calculated the AUC value, which denotes the area under the ROC curve and plotted the ROC curves of the three datasets. Figures 24 show the prediction results of ROC curves for performing PPIs on the yeast, human, and H. pylori datasets by adopting the proposed method. Finally, we can find that the average AUC values of the proposed model on these three datasets including yeast, human, and H. pylori are 97.04%, 99.05%, and 86.35%, respectively. In summary, these excellent experimental results show that our method is effective and reasonable in predicting protein-protein interactions.

3.3. Comparison with the Support Vector Machine Classification Model

To predict the interactions between proteins by using computational methods, researchers have proposed a variety of different machine learning models to detect PPIs. Among them, using the support vector machine (SVM) to predict protein interactions has become one of the most popular approaches in this field. In this section, we use the same feature extraction method combined with the SVM to perform PPI classification experiments on the same dataset to further evaluate the predictive performance of the proposed method. Here, we adopt the radial basis function as the kernel function, and we also optimize the two parameters BoxConstraint and KernelScale of the support vector machine, where BoxConstraint and KernelScale are 10 and 2 ^ 1.5 respectively. Finally, the prediction results obtained by combining the five-fold cross-validation method and the SVM classifier are shown in Table 5.

As can be seen from Table 5, when we detected the PPIs by using the support vector machine model on the yeast dataset, the average values of accuracy, sensitivity, precision, MCC, and AUC were 87.25%, 88.22%, 86.58%, 77.75%, and 93.96%, respectively. The average accuracy of the method is reduced by about 7.30% as compared to the WSRC classifier. When we predicted the PPIs by applying the support vector machine model on the human dataset, the average values of accuracy, sensitivity, precision, MCC, and AUC were 92.88%, 95.97%, 90.91%, 86.70%, and 97.56%, respectively. The average accuracy of the method is reduced by about 4.60% as compared to the WSRC classifier. When we detected the PPIs by utilizing the support vector machine model on the H. pylori dataset, the average values of accuracy, sensitivity, precision, MCC, and AUC were 82.64%, 94.33%, 76.51%, 70.53%, and 91.60%, respectively. The average accuracy of the method is reduced by about 1.00% as compared to the WSRC classifier. In addition, we also plotted ROC curves based on SVM classifiers on three datasets, which are shown in Figures 57, respectively. The comparison results show that the WSRC-based method is superior to the SVM-based method in predicting PPIs on three different datasets.

3.4. Prediction Performance on the Independent Dataset

Despite the proposed method achieved good prediction accuracy on the yeast, human, and H. pylori datasets, it is worth considering whether the trained model can be adapted to predicting PPIs from other species. Therefore, 4 independent datasets were constructed from the DIP database to validate the performance of the trained model for predicting protein-protein interactions from other species, in which all 11,188 samples of the yeast dataset were used as training sets, while C. elegans, E. coli, H. sapiens, and M. musculus datasets were treated as test sets separately. All of the test sets are positive. Here, we utilize the same SMR method and MatPCA feature extraction algorithm to convert the protein sequences of the four datasets into feature vectors and send them to the WSRC to perform PPI classification. The experimental results of our method are listed in Table 6. It can be seen that when predicting the PPIs of the C. elegans, E. coli, H. sapiens, and M. musculus datasets by using the proposed method, the accuracy is 96.84%, 90.14%, 97.10%, and 95.21%, respectively. These results indicate that our trained model also has excellent performance in detecting PPIs from other species.

3.5. Comparison with the Previous Methods

Currently, a large number of computational methods have been proposed for predicting PPIs. Here, to further validate the effectiveness of the proposed method, we compared it with several state-of-the-art approaches on two benchmark datasets, namely, yeast and H. pylori, respectively. Table 7 shows the prediction results performed by using five different methods on the yeast dataset. We can clearly observe that the accuracy achieved by the other four existing methods is between 75.08% and 89.33%, which is significantly lower than the average accuracy of 94.55% by the proposed method. At the same time, the proposed method also obtains relatively low standard deviations, which further shows the stability and robustness of our model.

Similarly, we also compared the experimental results of the proposed method with five other existing methods on the H. pylori dataset. From Table 8, we can see that the proposed method achieves good prediction results, including accuracy, precision, and sensitivity. Specifically, signature products, boosting, and phylogenetic bootstrap methods have obtained a relatively high precision, which is 85.70%, 81.69%, and 80.20%, respectively. The HKNN and ensemble of HKNN methods have obtained a relatively high sensitivity, which is 86.00% and 86.70%, respectively. However, compared with other methods, the proposed model achieved 83.64% accuracy, 89.71% precision, 75.98% sensitivity, and 72.26% MCC. The above results show that using MatPCA and weighted sparse representation-based classification can effectively predict protein-protein interactions with good prediction performance.

4. Conclusions

In the postgenomic era, it is essential to employ computational methods to predict the interactions between protein pairs as this is important for explaining the molecular basis of complex cellular processes. In this paper, a novel computational model using solely protein sequence information was proposed for protein-protein interaction prediction. The proposed model first transforms the original protein sequence into a substitution matrix representation. Secondly, MatPCA, as a feature extraction algorithm, is applied to the SMR to capture evolutionary information of protein sequences. Finally, a weighted sparse representation-based classifier was used in this experiment to detect whether there are interactions between protein pairs. At the same time, we also used a five-fold cross-validation method to detect PPIs on three highly credible benchmark datasets including yeast, human, and H. pylori. Furthermore, we compared the experimental results of the proposed method with the SVM as well as different existing models to further demonstrate the prediction performance of different models in predicting PPIs. As expected, the experimental results confirm that our method has obtained good prediction results. These excellent predictive indicator values further demonstrate the effectiveness and feasibility of the proposed model, which will be regarded as a powerful tool for detecting potential PPIs.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Disclosure

Zheng Wang and Yang Li are the co-first authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Authors’ Contributions

Zheng Wang and Yang Li contributed equally to this work.

Acknowledgments

This research was funded by the National Natural Science Foundation of China (Grant no. 61572506).