Abstract

Protein-protein interactions (PPIs), as an important molecular process within cells, are of pivotal importance in the biochemical function of cells. Although high-throughput experimental techniques have matured, enabling researchers to detect large amounts of PPIs, it has unavoidable disadvantages, such as having a high cost and being time consuming. Recent studies have demonstrated that PPIs can be efficiently detected by computational methods. Therefore, in this study, we propose a novel computational method to predict PPIs using only protein sequence information. This method was developed based on a deep learning algorithm-stacked sparse autoencoder (SSAE) combined with a Legendre moment (LM) feature extraction technique. Finally, a probabilistic classification vector machine (PCVM) classifier is used to implement PPI prediction. The proposed method was performed on human, unbalanced-human, H. pylori, and S. cerevisiae datasets with 5-fold cross-validation and yielded very high predictive accuracies of 98.58%, 97.71%, 93.76%, and 96.55%, respectively. To further evaluate the performance of our method, we compare it with the support vector machine- (SVM-) based method. The experimental results indicate that the PCVM-based method is obviously preferable to the SVM-based method. Our results have proven that the proposed method is practical, effective, and robust.

1. Introduction

Most important molecular processes in cells are performed by different types of protein interactions. Thus, one of the main objectives of functional proteomics is to determine the protein-protein interactions of organisms. With continuous research and the development of technique, it is now possible to detect protein interactions on a large scale by using high-throughput experimental techniques. Such research is obviously very important, because the research of PPIs is closely related to many functions of complex life systems, and these functions are not determined by the characteristics of the individual components. For example, molecular cell signaling is carried out through protein interactions. This process is not only the basis of many life functions, but it is also related to many diseases. In addition, the study of protein interactions has been of great value in the development of new drugs and in the prevention and diagnosis of disease.

As some high-throughput experimental techniques have been successfully applied to postgenomic era PPI research tasks, a large number of different species of PPI data have been collected, and some databases have been created to systematically collect and store experimentally determined PPIs [13]. Even though experimentally validated PPI data drives research and development of proteomics, they often have high false positives and false negatives [47]. In addition, because the experimental method has some unavoidable defects, such as having a high cost and being time consuming, the researchers have only verified a small part of the whole PPI network even after a long period of effort. With advances in mathematical and computational methods [812], computer technology has been applied in more and more fields. Vlachakis et al. proposed computational methods to simulate catalytic mechanisms, complete drug design, and model protein three-dimensional structures [1317]. Vlamos et al. developed several intelligent disease diagnosis applications and hybrid models for vulnerability detection [1825]. Some researchers also introduced computational methods into the medical field and developed several automated diagnostic models [26, 27]. Therefore, using a machine learning algorithm to develop an efficient and accurate automatic discriminative system to predict new protein interactions has important practical significance.

To date, a variety of protein information has been used to build PPI prediction models based on machine learning algorithms. Protein information that can be used includes, but is not limited to, physicochemical information, structural information, evolutionary information, and protein domains. However, these methods have some limitations when they are used. For example, some computational methods using genomic information predict protein interactions by calculating a set of gene presence or absence patterns. The main factor limiting these methods is that they can only be applied to fully sequenced genomic data [28, 29]. Recently, methods that directly extract information from a protein primary sequence have attracted much attention. Methods that use only protein sequence information are more general than methods that rely on some additional information about proteins. Many researchers are working on the development of sequence-based computational models to predict new PPIs. Hamp and Rost developed a computational method for predicting PPIs based on profile-kernel support vector machines combined with evolutionary profiles [30]. An et al. proposed a PPI prediction method that combines the local phase quantization and relevance vector machine [31]. Yang et al. used a new local descriptor to describe the interaction between the contiguous and discontinuous regions of the protein sequence, which is able to obtain more protein interaction information from sequences [32]. Zhang et al. introduced two ensemble methods to predict PPIs. These ensemble methods are based on undersampling techniques and fusion classifiers [33]. You et al. proposed a prediction framework for detecting PPIs using a low-rank approximation-kernel extreme learning machine [34]. Several other sequence-based computational methods have been reported in previous work [3538]. These sequence-based methods show that the individual information of the amino acid sequence is sufficient to determine the interaction of the protein. However, these methods usually use physical, chemical, or structural information, and even the fusion of all of these types of information as features of the protein sequence. Therefore, the feature extraction steps of these methods are not efficient. In addition, the above information can only represent each specific protein sequence but does not contain knowledge related to protein interactions. Therefore, even these methods combined with advanced classification algorithms have a difficulty in producing enough accuracy.

Compared with the physicochemical information, the evolutionary information of proteins can reflect the potential interactions between proteins. Therefore, we consider the evolutionary information of the protein as a feature of the protein sequence. Extracting the evolutionary information of a protein is challenging as there is currently no strategy that can efficiently obtain the evolutionary information of a protein. We hypothesize that there is a potential relationship between the conservation of amino acid residues during evolution and the interaction of proteins. Based on this hypothesis, we propose an efficient protein evolution feature extraction scheme, which used a deep learning algorithm combined with Legendre moments (LMs) and position weight matrix (PWM). Specifically, we first convert the protein sequence into a PWM containing the amino acid residue conservative score. Then, we use LMs to extract important evolutionary information from the PWM and generate the feature vector . Last but not least, this feature was further optimized by using SSAE deep neural networks to eliminate noise, obtain primary information, and reduce feature dimensions. In addition, in response to the challenges posed by big data and imbalanced datasets, a sparse model, PCVM, was used to perform classification. Our contributions can be summarized as follows: (1)We propose a method to predict PPIs quickly, efficiently, and accurately.(2)We have abandoned the traditional materialized information and structural information, considered the evolutionary information associated with PPIs as a feature of the protein sequence, and proposed a feature extraction strategy to quickly and efficiently extract the evolutionary information of the protein and improve the prediction performance.(3)We confirm that sparse classification algorithms can greatly benefit prediction of PPIs and present results showing that they can provide a benefit in dealing with large-scale data and unbalanced data (as is the case with PCVM).

The rest of the paper is organized as follows. Section 2 reviews related work. Section 3 describes the datasets and methods used in this paper. Section 4 shows the results of the experiment. Section 5 concludes the paper.

The study of the PPI prediction model is mainly divided into two parts. One is the development of protein sequence feature extraction strategies, and the other is the application of classification algorithms. This section briefly reviews related research.

2.1. Sequence-Based Feature Extraction Algorithm

Previous methods of extracting sequence features were mainly the direct use of physicochemical information or amino acid sequence structure information or evolutionary information of proteins. Since the amino acid composition model has been proposed, many subsequent works have been carried out for the composition model. Chou [39] proposed a feature extraction method called pseudoamino acid composition. This feature extraction method greatly increases the information content of the amino acid sequences contained in the features. It does not only consider the composition of amino acids, but also processes the amino acid position information. Another excellent research was done by Shen et al. [40]. In that study, 20 amino acids were clustered into 7 classes based on their dipole and side chain volumes, and then the features of the protein pairs were extracted based on the amino acid class. Combined with the SVM classifier, this method has a prediction accuracy of 83.9% on human PPIs. In a study by Guo et al. [41], an autocovariance-based method was developed to extract the interaction information of discontinuous amino acid fragments in a sequence. The method replaces the protein sequence with a digital sequence based on the physicochemical properties, and the replaced digital sequence is regarded as a group of information for analysis.

Different from the previous classical computational method, we did not use the traditional sequence-coding scheme and did not consider the physicochemical information of the protein sequence. Our method uses the evolutionary information of the protein sequence indirectly (using Legendre moments to extract feature vectors on the PSSM matrix containing evolutionary information), trying to use image-processing ideas to complete the task of PPI prediction; this is a direction in which only a few people are exploring. The introduction of our method and the satisfactory results produced on several gold standard datasets have greatly encouraged the scholars who explored on this direction. The advantage of this method is that the feature extraction strategy is simple and efficient, does not require complicated sequence coding, and does not need to consider the physicochemical information of the protein. Compared with traditional feature extraction methods, this method greatly improves the accuracy of PPI prediction and saves time and computational overhead.

In addition, the deep learning algorithm has shown extraordinary performance in many fields, but its ability has not been effectively verified in the PPI prediction task. A deep learning algorithm-stacked sparse autoencoder was used to reconstruct a protein feature vector in our work. This algorithm uses sparse network structures and adds sparseness restrictions on neurons. This not only allows us to obtain low-dimensional, low-noise protein feature vectors, but also improves the efficiency of the network. The results of our method applied to the test set demonstrate once again that deep learning algorithms can be used to assist in solving bioinformatics problems.

2.2. Classifier

The support vector machine (SVM) is one of the most commonly used classification algorithms in the PPI prediction model [4244]. However, the SVM approach has some obvious drawbacks: (1) As the dataset becomes larger, the support vector increases rapidly. (2) Cross-validation-based kernel parameter optimization strategy consumes a large amount of computing resources. Another widely used classifier is the relevance vector machine (RVM) [4547], which effectively avoids the disadvantages of SVM. It was developed to take advantage of the Bayesian inference and the prior weight following a zero-mean Gaussian distribution. However, the RVM has the potential to produce some unreliable vectors that lead to system error decisions. Because the weights of the negative class and the positive class are given by the zero-mean Gaussian prior, partial training samples that do not interact might be assigned confident weights or vice versa.

In order to avoid the problems of the above classifiers, we used the probability classification vector machine (PCVM) method to perform PPI classification, which provides different priors for different types of samples. The positive class is associated with a right-truncated Gaussian and the negative class is associated with a left-truncated Gaussian. The PCVM method has the following advantages: (1) PCVM produces sparse predictive models and has better efficiency in the testing phase. (2) PCVM provides probabilistic results for each output. (3) PCVM uses the EM algorithm to automatically find the optimal initial point, which saves time and improves the performance of the system.

3. Materials and Methodology

3.1. Datasets

To evaluate the performance of the proposed method, there are a total of 4 different PPI datasets used in our experiments, two of which are human, one is S. cerevisiae, and one is H. pylori.

The first human PPI dataset we used was from Pan et al. [48], which was downloaded from the Human Protein Reference Database (HPRD). After the self-interaction and repetitive interactions were removed, the remaining 36,630 PPI pairs formed the final gold standard positive (GSP) dataset. For the selection of gold standard negative (GSN) datasets, we followed the previous work [48] and generated GSN datasets from the Swiss-Prot version 57.3 database according to the following criteria: (1) Protein sequences annotated by uncertain terms are removed. (2) Multiple unlocalized protein sequences are deleted. (3) Protein sequences that may be only “fragments” or containing “fragments” are deleted.

After strictly following the above steps, 1773 human proteins were screened out. Noninteracting protein pairs are then constructed by randomly pairing proteins from different subcellular compartments. In addition, another golden negative dataset was downloaded, which was used in the study by Smialowski et al. [49]. The final GSN dataset was constructed by combining the above two negative datasets, which consisted of 36,480 noninteracting protein pairs. Therefore, the entire gold standard dataset (GSD) consists of 73,110 protein pairs, of which almost half is from the positive dataset and half is from the negative dataset.

Due to the fact that there are serious imbalances in the dataset in real-world tasks, this can lead to a failure of the PPI prediction model. Considering this issue, we have constructed another set of human datasets with an unbalanced number of positive and negative samples to evaluate the stability and robustness of our proposed method. This unbalanced human PPI dataset consists of 3899 positive samples and 13,000 negative samples.

The third PPI gold standard dataset we used was from downloaded datasets from the S. cerevisiae core subset of the database of interacting proteins (DIP). We strictly followed the work of Guo et al. [41] to construct the S. cerevisiae dataset. Finally, we obtained a gold standard dataset containing 11,188 protein pairs, of which 5594 positive protein pairs form a GSP dataset and 5594 negative protein pairs form a GSN dataset.

The last PPI dataset uses the pair of H. pylori proteins described by Martin et al. [50], which includes 1458 positive sample pairs and 1458 negative sample pairs.

3.2. Position Weight Matrix

In this article, we use the position weight matrix (PWM) to derive evolutionary information from protein sequences. A PWM for a query protein is a matrix , where represents the size of the protein sequence and the number of columns of the matrix denotes 20 amino acids. In order to construct PWM, a position frequency matrix is first created by calculating the presence of each nucleotide on each position. This frequency matrix can be represented as , where means position and is the nucleotide. The PWM can be expressed as , where is a matrix whose elements represent the mutation value between two different amino acids. Consequently, high scores represent highly conservative positions, and low points represent a weak conservative position. It’s an extremely useful tool for predicting protein disulfide connectivity, protein structural classes, subnuclear localization, and DNA or RNA binding sites. Here, we also employ PWMs to detect PPIs. In this paper, each protein is interpreted as PWMs using the position-specific iterated BLAST (PSI-BLAST). The PSI-BLAST has two important parameters, value and iteration number, which were set at 0.001 and 3, respectively [5153].

3.3. Legendre Moments

Legendre moments (LMs) are typical orthogonal moments, whose kernel function is the Legendre polynomial. It has been widely involved in a lot of applications, such as image analysis, computer vision, and remote sensing [5458]. Here, we use the Legendre moment to extract the evolutionary information of the protein indirectly from the PWM and generate a 961-dimensional eigenvector. The two-dimensional discrete form of the LM is represented as where is defined as a set of discrete points , , . represents the number of columns of the PWM matrix, represents the sum of each column of PWM matrix.

The integral terms in (2) are frequently estimated by zeroth-order approximation; in other words, the values of Legendre polynomials are assumed to be constant over the intervals and . In this case, the set of approximated LMs is defined as:

3.4. Stacked Sparse Autoencoder

Deep learning is a new field in machine learning research. Its motivation lies in building and simulating the neural network of the human brain for analytical learning. It imitates the mechanism of the human brain to interpret data. In this paper, the deep structure stacked sparse autoencoder (SSAE) is adopted for feature reduction and reconstruction [5962]. SSAE forms a more abstract high-level representation feature by combining low-level features to discover the distributed feature representation of protein feature data.

The SSAE is an unsupervised network that is a large-scale nonlinear system composed by multilayer neuron cells in which the outputs of the current layer neuron are fed to the connectivity layer neuron. In this work, the aim of SSAE is to learn a distinctive representation for the Legendre moment (LM) feature. The underlying purposes are noise elimination and dimensionality reduction. The process of feature reconstruction is layer by layer in SSAE. The first layer is in charge of rough integration original input. The second layer is responsible for extracting and integrating the features learned earlier. Higher successive layers will be inclined to produce low-dimensional, low-noise, and high-cohesion features. In this paper, the SSAE was used to reduce the LM feature to 200 dimensional.

SSAE or Sparse autoencoder network is mainly made up of two parts, the encoding part and the decoding part [63], where the encode network compresses high-dimensional into low-dimensional attributes. The decoding network is responsible for restoring the original input layer by layer, and the network structure is symmetrical with the structure of the encoding network. In the coding stage, the primary data is mapped onto a hidden layer. This process can be represented as

Here, is a nonlinear function, is the weight of the encoding part and is the bias. After that, the original data is reconstructed by the decoding network: where is the weight of decoding network and is the bias. The purpose of SAE is to make the output as close as possible to the input by minimizing loss function: where is the number of hidden layer nodes, is the weight of the sparse penalty item, represents the average activation value of the hidden layer element, and is the sparse parameter.

Figure 1 shows a SSAE network with two hidden layers, of which the decoding part has not been shown, in order to highlight the feature reduction function of the network. Similar to the sparse autoencoder (SAE), the key to the training model is to learn the parameters , which allows the model to have minimum input and output deviation. Once the optimal parameters are obtained, the SSAE yield function that transforms original data to a low-dimensional space.

3.5. Probabilistic Classification Vector Machines

The design of feature extraction strategies and the selection of classifiers are two crucial parts in developing an excellent PPI prediction model. In the previous description, we developed a new deep learning-based amino acid sequence feature extraction method. Here, we use the stronger PCVM classifier to replace the Softmax layer of the stacked sparse autoencoder to achieve the output of our model. Like most classification models, the goal of a PCVM [6466] is to generate a model by learning a set of labeled data . The model is determined by parameters learned and expressed as where the denotes the parameter of the model, is a set of primary functions, and represents the bias. A Gaussian cumulative distribution function is used for obtaining the binary outputs. The function is defined as

After incorporating (7) with (8), the model becomes

Each weight is assigned a prior by a truncated Gaussian distribution, as follows: where the bias is assigned a zero-mean Gaussian prior, as follows: where the is a truncated Gaussian, and denotes the inverse of the variance. The EM algorithm is used for obtaining all parameters of a PCVM model [67].

4. Results

4.1. Evaluation Criteria

In this work, the following criteria, such as accuracy (Accu), precision (Prec), sensitivity (Sens), and Matthews’s correlation coefficient (Mcc), are used to assess the proposed method. Accuracy is used to describe the overall system error. Since the key task of PPI prediction is to correctly predict the interacting protein pairs, the sensitivity and accuracy indicators are used to assess the model’s ability to predict positive data. In addition, data imbalance exists in real PPI prediction tasks. In view of this situation, we used an unbalanced PPI dataset in this paper. Therefore, Mcc is used to evaluate the reliability and stability of the model when dealing with unbalanced data. When the model appears “preference prediction” (i.e., the dataset is very unbalanced, the model can only correctly predict negative data), the Mcc score is lower. When the model is strong and robust, the indicator score is high. These indicators are defined as where means those samples, true interacting with each other, are predicted correctly. represents those samples, true noninteracting with each other, are judged to be interacting. represents those samples, true noninteracting with each other, are predicted correctly. represents those samples, true interacting with each other, are judged to be noninteracting. Furthermore, the ROC (receiver operating characteristic) is portrayed to appraise the performance of a set of classification results [68] and the AUC (area under ROC) is computed as an important evaluation indicator.

4.2. Assessment of Prediction

In this paper, the proposed sequence-based PPI predictor is implemented using a MATLAB platform. All the simulations are carried out on a computer with a 3.1 GHz 8-core CPU, 16 GB memory, and a Windows operating system. In order to make the prediction system independent of the training data, each PPI dataset is segmented into five parts by the five-fold cross-validation method. The performance of the PCVM-based method on human, unbalanced-human, H. pylori, and S. cerevisiae datasets are exposed in Tables 14. The corresponding ROC curves are depicted in Figures 26, respectively.

Analyzing Table 1 allows drawing the conclusion that the PCVM-based method yields a satisfactory result on the human dataset, where the accuracy of each fold is above 98% and the accuracy standard deviations of five experiments are only 0.2%. The corresponding average sensitivity, precision, and Mcc are 98.47%, 98.67%, and 97.19%, respectively. Their standard deviations are 0.3%, 0.5%, and 0.3%, respectively. The average AUC (Figure 2) of the five experiments reached 0.9984. The high accuracies and AUC show that the PCVM-based approach has a strong classification ability in identifying PPIs. The low standard deviations illustrate that this model is robust and stable.

When predicting PPIs on the unbalanced-human dataset (Table 2), the method produced an average accuracy of 97.71%, sensitivity of 91.87%, precision of 98.10%, and AUC of 0.9971, respectively.

When applied on the H. pylori dataset with the smallest training set, the PCVM-based methods also yielded a high average prediction accuracy of 93.76%, high precision of 92.10%, high sensitivity of 95.77%, and high Mcc of 88.26%, respectively (Table 3). The standard deviations of Accu, Sens, Prec, and Mcc in the five experiments are 0.1%, 2.0%, 1.9%, and 0.2%, respectively. Moreover, the average AUC on the H. pylori dataset reached 0.9860.

4.3. Comparison with the SVM-Based Approach

In order to highlight the feasibility of our classifier, the state-of-the-art SVM classifier was used to compare with PCVM. To make it fair, the same feature extraction scheme and the same S. cerevisiae dataset were used in this experience. The LIBSVM tool [69] is available for SVM classification, and the grid search approach was adopted for optimizing SVM model parameters and .

The classification results of the PCVM and SVM classifiers on the S. cerevisiae dataset are listed in Table 4, and the ROC curves of SVM are displayed in Figures 5 and 6. As we have seen, the average result of the PCVM method achieved 96.55% Accu, 97.23% Sens, 95.84% Prec, and 93.25% Mcc. The standard deviation of these indicators in five experiments are 0.2%, 0.3%, 0.5%, and 0.3%, respectively. The average results of the SVM method yielded 93.67% Accu, 92.29% Sens, 94.91% Prec, and 88.13% Mcc. The standard deviations are 0.6%, 1.2%, 1.2%, and 1.0%, respectively. In comparison with SVM, the PCVM classifier achieves significantly better results on this gold standard dataset. From Figures 5 and 6, the average AUC of the SVM classifier is 0.9856, which is significantly lower than those of PCVM of 0.9963. Higher AUC values clearly illustrate that the PCVM method is more accurate and more reliable for detection PPIs. The improved classification performance of the PCVM classifier compared with the SVM classifier can be explained by two reasons: (1) The number of PCVM basis functions is less than the number of training points, resulting in a reduction in the computational effort involved. (2) PCVM uses truncated Gauss priors to flexibly assign a priori information about weights, thus ensuring the generation of reliable support vectors.

4.4. Compare with Previous Studies

Some other computational approaches for predicting PPI have been reported in previous studies. These highlight the advantages of the proposed approach, which was compared with the existing approaches that attract wide attention on the same PPI datasets, respectively. We can see from Table 5 that our method also produces better results than other existing methods. The performance of the several different approaches on the H. pylori dataset is presented in Table 6. As seen from the Table 6, our proposed approach produces better performances than the four other main methods. The 93.76% prediction accuracy is much higher than any of the several other methods. Table 7 shows the results of comparing with several other different methods that achieved an average prediction accuracy of less than 93.92% on the S. cerevisiae dataset, while our PCVM-based approach obtained an average prediction accuracy of 96.55% with the lowest standard deviation of 0.2%. Meanwhile, the sensitivity of 97.23% is also far better than those of the other methods. Extensive experiments indicate that the method we employ can sufficiently meet the needs of large-scale protein detection and can be used as a meaningful adjunct application for proteomics investigation.

5. Conclusion

The function and activity of proteins are usually regulated by other proteins that interact with it. In order to understand biological processes, we need to develop a tool that gives us an insight into the knowledge of protein interactions. Although many efforts have been taken to develop the method for detecting PPIs, the accuracy and robustness of most existing methods still have potential room to be improved. Hence, we explore a fresh and efficient computational system based on protein sequences using a PCVM classifier combined with Legendre moments and a stacked sparse autoencoder. Four strictly screened PPI datasets are used to assess the prediction ability of our devised approach and the prediction outcomes display that the approach provides practical predictive capability for PPI detection. In a subsequent comparative experiment, the prediction performance by our approach is obviously better than that of an SVM-based method and previous methods. We also found that prediction quality continues to improve with increasing dataset size. This finding underscores the value of this model to train and apply very large datasets, and suggests that further performance gains may be had by increasing the data size. Therefore, this proposed method is a reliable, efficient, and powerful PPI prediction model. It can be adopted to guide the validation of relevant experiments and to be an auxiliary tool for proteomics research.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors state no conflict of interest.

Authors’ Contributions

Yanbin Wang, Zhuhong You, Liping Li, and Li Cheng considered the algorithm, arranged the datasets, and performed the analyses. Xi Zhou, Libo Zhang, Xiao Li, and Tonghai Jiang wrote the manuscript. All authors read and approved the final manuscript.

Acknowledgments

This work is supported in part by the National Science Foundation of China (Grant nos. 61722212 and 61572506) and in part by the Pioneer Hundred Talents Program of the Chinese Academy of Sciences. The authors would like to thank the editors and anonymous reviewers for their constructive advice.