Abstract
Protein-protein interactions (PPIs) play a crucial role in understanding disease pathogenesis, genetic mechanisms, guiding drug design, and other biochemical processes, thus, the identification of PPIs is of great importance. With the rapid development of high-throughput sequencing technology, a large amount of PPIs sequence data has been accumulated. Researchers have designed many experimental methods to detect PPIs by using these sequence data, hence, the prediction of PPIs has become a research hotspot in proteomics. However, since traditional experimental methods are both time-consuming and costly, it is difficult to analyze and predict the massive amount of PPI data quickly and accurately. To address these issues, many computational systems employing machine learning knowledge were widely applied to PPIs prediction, thereby improving the overall recognition rate. In this paper, a novel and efficient computational technology is presented to implement a protein interaction prediction system using only protein sequence information. First, the Position-Specific Iterated Basic Local Alignment Search Tool (PSI-BLAST) was employed to generate a position-specific scoring matrix (PSSM) containing protein evolutionary information from the initial protein sequence. Second, we used a novel data processing feature representation scheme, MatFLDA, to extract the essential information of PSSM for protein sequences and obtained five training and five testing datasets by adopting a five-fold cross-validation method. Finally, the random fern (RFs) classifier was employed to infer the interactions among proteins, and a model called MatFLDA_RFs was developed. The proposed MatFLDA_RFs model achieved good prediction performance with 95.03% average accuracy on Yeast dataset and 85.35% average accuracy on H. pylori dataset, which effectively outperformed other existing computational methods. The experimental results indicate that the proposed method is capable of yielding better prediction results of PPIs, which provides an effective tool for the detection of new PPIs and the in-depth study of proteomics. Finally, we also developed a web server for the proposed model to predict protein-protein interactions, which is freely accessible online at http://120.77.11.78:5001/webserver/MatFLDA_RFs.
1. Introduction
Recognition of protein-protein interactions (PPIs) is distinctly important for understanding various cellular biological activities [1]. The knowledge of PPIs can help us to explore and elucidate the functions of proteins, drug design, new drug development, and the mechanisms of biological activity and related proteins in cells [2]. Additionally, it can also provide new ideas for other studies, such as the ranking of disease genes [3], functional module identification [4], and human disease prevention and treatment. In general, the research approaches for PPIs mainly include two categories: computational-based methods and biological experimental-based methods. In the last decades, many different experimental techniques have been used for large-scale PPIs validation, such as yeast two-hybrid (Y2H) screens [5], coimmunoprecipitation (Co-IP) [6], nuclear magnetic resonance (NMR) [7], protein chip [8], and other high-throughput biological techniques. However, there are some inevitable disadvantages of these methods: they are not only time-consuming and expensive but also suffer from high false-positive rates and weak generalization ability. Thus, it has great practical significance to develop a new effective machine learning approach for PPIs prediction in order to save cost and time, thereby ultimately improving the prediction accuracy of protein interactions. To date, numerous computational approaches have been suggested to detect PPIs based on different data types, including protein domains, genomic information, evolutionary knowledge, structure information, gene fusion, and phylogenetic profiles [9–14]. Although these methods can be used to detect PPIs, the abovementioned methods are not universally applicable unless prior knowledge of the protein is known. Although amino acid sequence information is readily available for a large number of proteins, the 3D structural information of many proteins is still unclear, and the known and available PPIs for most species are still incomplete or very sparse. Consequently, it is particularly important to design novel computational methods for PPI prediction utilizing only protein amino acid sequence information, so as to better employ these abundant protein sequence data.
Numerous previous works have shown that using protein amino acid sequence information alone is sufficient to predict PPIs. So far, many different computational methods based on sequence information have been presented to implement this pattern in PPI prediction, such as combining average blocks with relevance vector machine [15], combining principal component analysis with ensemble extreme learning machine [16], combining conventional auto covariance with support vector machine [17], local descriptors using k-nearest neighbor [18], discrete cosine transformation using weighted sparse representation model [19], and so on. In 2017, Wang et al. [20] proposed a PCVMZM method based on protein sequence. The Zernike moments (ZM) are used as the feature extraction method. ZM can capture multiangle useful and representative information. Probabilistic classification vector machines (PCVM) are a sparse classification model that optimizes the kernel parameters by the expectation-maximization (EM) algorithm, which not only improves the prediction performance of PPIs but also reduces the computational time in the testing phase. The average prediction accuracy achieved by the PCVMZM method was 94.48% on the Yeast dataset. In the same year, Du et al. [21] proposed a method called DeepPPI from the angle of deep learning technology by using amphiphilic pseudo amino acid composition feature extraction algorithm to extract features from amino acid sequences, which opens a new way for studying PPIs. This DeepPPI method reached a prediction accuracy of 94.43% on the Saccharomyces cerevisiae dataset. In 2018, Göktepe and Kodaz [22] applied a new technique called weighted skip-sequential conjoint triads to predict PPIs. The method adopts principal component analysis (PCA) to remove noise information, captures protein sequence information by combining Bi-gram representation and Pseudo-amino acid composition, and finally uses support vector machine (SVM) as a prediction classifier to identify interactions between proteins. In the same year, Song et al. [23] presented a novel feature fusion scheme based on random projection ensemble method, which separately used three algorithms (fast fourier transform, discrete cosine transform, and singular value decomposition) to explore and denote the patterns of interactions between amino acids. In 2019, Chen et al. [1] developed an end-to-end framework, called PIPR, to predict PPIs using only the protein sequences. They capture effectively the local significant features and sequential features from protein sequence pairs by using a deep residual recurrent convolutional neural network. Experimental results demonstrate that the framework has good scalability on different datasets. In the same year, Beltran et al. [24] used five feature extraction methods, namely, dipeptide composition, tripeptide composition, autocovariance, amino acid composition, and pseudo-amino-acid composition to represent amino acid sequences. They then employed SVM, random forest (RF), and extreme gradient boosting (XGBoost) to predict PPIs, respectively, and finally achieved good prediction performance. More recently, Jha and Saha [25] presented a deep-learning-based predictor to identify PPIs. They introduced two deep learning algorithms, ResNet50 and stacked autoencoder, to extract features from the autocovariance and conjoint triad representations of protein sequences. Then, LSTM-based classifier model was constructed for each feature encoding scheme. The experimental results show that the introduced deep learning scheme can learn valuable features from multimodal information of proteins. Although a number of computational-based methods have achieved good progress and application prospects, the accuracy and efficiency of PPIs prediction still need to be further enhanced so as to provide a supplementary tool for proteomics research and other bioinformatics tasks.
In this paper, an efficient computational method for detecting PPIs from amino acid sequences is presented by using the evolutionary matrix representation of protein sequences and combining with an ensemble classifier. Among them, an important improvement of the proposed model is to develop a more accurate numerical representation of protein sequences. Specifically, we applied the MatFLDA feature extraction algorithm to a position-specific scoring matrix (PSSM) to extract the evolutionary information of protein sequences and utilized a random ferns classifier to predict the PPIs. More specifically, each protein sequence is denoted as a PSSM numerical matrix. Subsequently, for the purpose of obtaining more representative information, we utilize the MatFLDA descriptor to extract the feature information in each PSSM, so as to obtain a 400-dimensional feature vector from the model and thus obtain an 800-dimensional feature vector representation of each protein pair. Finally, we employ the feature vector of protein pairs as the input of the model and combine the RF ensemble model in machine learning to accomplish the classification task of PPIs. The proposed method is estimated on the PPI datasets of Yeast and H. pylori with prediction accuracy of 95.03% and 85.35%, respectively. By comparing with a series of previous computational methods, we clearly found that the proposed model has good generalization performance in predicting PPIs.
2. Materials and Methodology
2.1. Datasets
So far, a number of PPIs databases have been created, including HAPPI database [26], Molecular Interaction Database (MINT) [27], APID database [28], Biomolecular Interaction Network Database (BIND) [29], and Database of Interacting Proteins (DIP) [30]. In this section, we use two high-quality benchmark datasets, which are extracted from DIP, to test the generality of the model and assess the performance of the proposed method. The first dataset is the yeast dataset collected by Guo et al. [17]. To evaluate our method, a data preprocessing procedure that deleted protein pairs of greater than 40% sequence identity and less than 50 residues was used in this experiment to avoid the bias introduced by these homologous sequence pairs. By performing this process, we extracted 5594 protein pairs which formed the golden standard positive dataset. The additional 5594 protein pairs were retained to construct the golden standard negative dataset by removing interaction pairs with the same subcellular localization information. The second dataset is the H. pylori dataset, which was validated by the yeast two-hybrid technology [31] and collected by Martin et al. [32]. The PPI dataset of H. pylori contains 1458 positive protein pairs and 1458 negative protein pairs, which are regarded as positive and negative datasets, respectively. Consequently, yeast and H. pylori datasets are composed of a total of 11,188 and 2916 protein pairs, respectively.
2.2. Numerical Characterization of Protein Sequences
Position-Specific Scoring Matrix (PSSM) serves as a very useful scoring matrix that can contain evolutionary information of protein sequences, which is crucial in proteomics. PSSM was originally introduced by Gribskov et al. [33] in 1987 and is commonly used to detect distantly related proteins and protein folding patterns [34]. Currently, some researchers have done a lot of related work using PSSM encoding information in many fields of bioinformatics such as identification of DNA binding proteins [35], the identification of drug-target interaction [36], prediction of membrane protein types [37], and protein-protein interaction site prediction [38]. In this experiment, we employed the Position-Specific Iterated Basic Local Alignment Search Tool (PSI-BLAST) [39] to convert each protein sequence into a PSSM, which is widely adopted for the numerical representation of protein sequences for further use in PPI detection tasks. PSSM is a matrix composed of rows and 20 columns, where the row represents the length of the protein sequence and 20 columns are attributed to the 20 naive amino acids. Suppose that PSSM can be described as follows:
The elements in this matrix usually contain positive or negative integers, where the element is the probability that the th amino acid mutates into the th amino acid in the process of biological evolution. Here, positive scores in this matrix mean that amino acid substitutions occur more frequently in the alignment, whereas negative scores mean that the substitution occurs less frequently.
In our study, we set the -value and iteration times of PSI-BLAST, which are 0.001 and 3, respectively, to obtain highly and broadly homologous protein sequences. Consequently, each protein sequence is denoted as a 20-dimensional matrix containing elements, where is the length of a given protein sequence and 20 indicates the number of amino acids. The application information of PSI-BLAST can be downloaded at http://blast.ncbi.nlm.nih.gov/Blast.cgi [40, 41].
2.3. Matrix Fisher Linear Discriminant Analysis (MatFLDA)
Fisher linear discriminant analysis (FLDA), as a popular feature extraction method [42], has recently gained considerable attention in the areas of data mining and pattern recognition, such as software fault prediction [43], Arabic text classification [44], and face recognition [45]. In Section 2.2, each PSSM can be denoted as which is a matrix. To construct the FLDA of the matrix pattern, we give the matrix pattern for the th class containing samples, which can be denoted as where represents the number of PSSMs, and the total sample mean is defined as For Matrix Fisher Linear Discriminant Analysis (MatFLDA), assume that a class matrix pattern containing classes is given, where represents the 20 classes of amino acids, and their class mean is Let be a vector with m components. MatFLDA aims to project a matrix pattern onto the satisfying the constraint that and then a dimensional feature matrix can be generated by using the following linear transformation. where is an extracted feature matrix or projected value. Hence, for each matrix pattern all their feature matrices are projected as follows:
To find the optimal projection vector we use the following criterion function and maximize it: where is the total between-class scatter matrix, which is defined as where is the total within-class scatter matrix, which is defined as
In the MatFLDA algorithm, by maximizing we want to keep the between-class scatter matrix as large as possible and the within-class scatter matrix as small as possible in the projection space. Furthermore, under the constraint this optimization problem can be further equated to solve the following eigenvalue-eigenvector matrix equation:
At last, the completely new features are obtained by determining the appropriate which will be used in the subsequent classification task. In this experiment, the PSSM of protein sequences of size was used as input to the Matflda algorithm on the yeast and H. pylori datasets, where the Matflda algorithm was only used for feature extraction. In this way, we obtained the output of a dimensional feature matrix by using the MatFLDA algorithm on an original PSSM of protein sequence. In other words, we obtained a feature vector of dimensions from each PSSM. Consequently, the output of PSSMs is fixed size dimensional feature matrices. Thus, each protein pair contains 800 features. Here, in order to clearly understand how to use the MatFLDA algorithm for feature extraction of protein sequences, we give a schematic diagram of MatFLDA feature extraction for a protein pair namely Histone H4 and Regulatory protein SIR3 in the Saccharomyces cerevisiae dataset, as shown in Figure 1.

2.4. Random Ferns (RFs)
Random fern classifier is developed based on random forests, but it is different from the random forest [46, 47]. Here, by giving a PSSM in a protein sequence, our main task is to assign it to the most likely class. Let be the set of classes, where 1 indicates an interacting protein and 2 is a noninteracting protein. Let be the set of normalized dimensional features that will be calculated by using the MatFLDA algorithm on the PSSM that we are trying to classify. Formally, we are looking for [48] where a random variable, represents the class of protein. The aim of this paper is to model the posterior interacting protein class probability by giving a set of features. This can be expressed in terms of the Bayesian formula, as
Assuming a uniform prior since the denominator is just a scale factor, it is independent and is common for all the classes. Thus, by removing the priors the problem reduces to finding
But learning the complete representation of the joint probability of all features is very intractable. According to the Naive Bayes theory, it is assumed that all features are completely independent, that is,
However, this independence assumption is usually wrong because it completely ignores the correlation between features in practice. To account for the dependencies between these features while making the problem tractable, a better compromise is to divide our features into groups of size These groups are what we define as ferns, and we calculate the joint probability for features in each fern. The conditional probability is expressed as follows: where refers to the th fern, and is a random permutation function. Therefore, we follow a seminaive Bayesian method by modeling only some of the dependencies between features. In addition, the class conditional probabilities are estimated for each fern and class in the training phase. For each fern these terms can be described as where represents the number of training samples of class that evaluates to fern value Here, and represents the total number of samples for class However, when the number of samples given is not infinitely large, both and will be zero. To overcome this problem, is rewritten as where is a regularization term, which behaves as a uniform Dirichlet prior over feature values. is used to guarantee the results above zero. In this experiment, we set two important parameters of the random ferns classifier, where (the depth of ferns) was set to 20 and (the number of ferns) was set to 140. Finally, the features extracted by the MatFLDA algorithm are normalized and then fed into the random ferns classifier to predict whether each protein pair interacts with each other.
3. Results and Discussion
3.1. Evaluation Criteria
In this paper, to ensure the robustness of the proposed model and avoid overfitting and data dependency, we adopted five-fold cross-validation to assess the effectiveness of this method in predicting PPIs. Specifically, we first divide the experimental dataset into five parts and then select four of them as the training dataset and the additional one as the testing dataset. Finally, the average values of the five independent experiments are used as prediction results. Here, the following assessments are used, including overall prediction accuracy (ACC), precision (PE), sensitivity (SN), and Matthews correlation coefficient (MCC), which are defined as follows where TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative, respectively. Among them, TP indicates the number of true PPIs that are predicted correctly, TN represents the number of true noninteracting pairs that are predicted correctly. FP indicates the number of true interacting pairs not found in positive dataset, and FN represents the number of true interacting pairs not found in negative dataset. MCC is used as a balance indicator to measure the quality of binary classification in data mining, which value ranges between -1 and +1 representing the correlation coefficient between the observed results and the predicted results. In this experiment, the receiver operating characteristic (ROC) curve [49] and the area under the ROC curve (AUC) [50] are employed to evaluate the prediction performance of the proposed model. The AUC value of the classifier is larger, the prediction performance of the method is superior, and the model constructed is more stable. The flow of the proposed scheme is shown in Figure 2.

3.2. Prediction Performance of Proposed Model
In order to assess the effectiveness and stability of the model combining MatFLDA and RFs to predict PPIs, we tested the model on Yeast and H. pylori datasets in this section. In addition, for reducing deviations of the proposed method and avoiding overfitting, five-fold cross-validation was performed in the experiment. Specifically, the whole dataset was divided into five parts, including five training and five testing datasets, respectively, and then we obtained five models by carrying out separate experiments for each dataset. Finally, the average values of the five models were selected as the prediction results of our experiments. In order to obtain more accurate and reliable experimental results, the fern size and fern number of the random ferns classifier were set to be the same on Yeast and H. pylori datasets. Here, and The five-fold cross-validation prediction results of the RFs prediction model employing the MatFLDA feature extraction algorithm of protein sequence on two benchmark datasets are shown in Tables 1 and 2.
As can be seen from Table 1, the accuracies of the five experiments were 95.26%, 94.99%, 94.81%, 94.77%, and 95.31% when PPIs were performed on the Yeast dataset. The precisions are ≥98.81%, the sensitivities are ≥90.27%, and the MCCs are ≥90.05%. The standard deviations corresponding to these four assessment values are 0.25%, 0.26%, 0.47%, and 0.45%, respectively. At the same time, we can see that these standard deviations are relatively low. Similarly, the average values of accuracy, precision, sensitivity, and MCC were 85.35%, 79.27%, 95.72%, and 74.41% when exploring PPIs of H. pylori dataset. We can see from Table 2 that the standard deviations corresponding to these four evaluation values are 0.64%, 0.81%, 0.92%, and 1.14%, respectively. In order to better visualize the performance of combining RFs and MatFLDA to predict PPIs, we plot the ROC curves on two benchmark datasets. In addition, MCC and AUC values were also calculated to better quantify the predictive performance of the proposed model. The ROC curves performed on the two benchmark datasets are shown in Figures 3 and 4.


From Figures 1 and 2, we can see that the average AUC values obtained by the proposed method were 94.27% and 94.12% for the experiments on Yeast and H. pylori datasets, respectively. The promising results show that the proposed method is feasible, effective, and practical for detecting PPIs. The excellent prediction performance mainly depends on the selection of the feature extraction algorithm and classification model of the proposed method. It can be seen that the MatFLDA feature extraction descriptor can effectively retain useful information from the original protein sequences. Moreover, the high prediction accuracies and low standard deviations further indicate that the proposed method is robust for predicting PPIs.
3.3. Comparison of the Four Methods Using the Same Feature Representation
Generally, the same feature extraction approach by combining different classifiers will yield different prediction results when using machine-learning-based methods to predict PPIs. In this section, we performed PPI experiments using the same feature extraction method on the state-of-the-art individual classifier support vector machine (SVM) and the proposed ensemble learning classifier random ferns in order to further evaluate the prediction performance of the proposed model. It should be noted that the LIBSVM toolbox, which was downloaded from https://www.csie.ntu.edu.tw/~cjlin/libsvm/ [51], was employed in this experiment to carry out the PPI classification task. In our experiment, a polynomial function is used as the kernel function and the initial values of SVM are and when predicting PPIs using five-fold cross-validation on Yeast and H. pylori datasets, respectively. For SVM and RF classifiers, all input feature vectors are normalized by the zero-mean normalization method.
The experimental results of PPIs based on RFs and SVM-based classifiers are presented in Tables 3 and 4 on Yeast and H. pylori datasets, respectively. From Table 3, the average values of accuracy, precision, sensitivity, and MCC of the RF method on Yeast dataset are as high as 95.03%, 99.14%, 90.84%, and 90.52%, respectively. However, when employing the SVM classifier, we yielded relatively poor prediction results with the average values of accuracy, precision, sensitivity, and MCC of 80.39%, 83.01%, 76.44%, and 68.38%, respectively. It can be observed that the maximum accuracy obtained by the SVM classifier is 81.63%, which is 13% lower than the minimum accuracy obtained by the RF method. Similarly, as presented in Table 4, the average accuracy by utilizing SVM method in H. pylori dataset is 82.09%, among which the results of five models are 82.85%, 82.33%, 79.42%, 82.33%, and 83.53%, respectively. Additionally, for further evaluation, the ROC (receiver operating characteristic) curves and AUC values based on the SVM method are also calculated (see Figures 5 and 6). The average AUC values obtained by the same feature extraction method on Yeast and H. pylori datasets were 85.78% and 88.94%, respectively. In addition, we also evaluate the prediction performance of the proposed model using Random Forest and XGBoost classifiers by employing the same features. Comparing the proposed model with these three models, we can clearly see the proposed model achieves good performance in the prediction of PPIs. Thus, the proposed model can provide a useful tool for detecting PPIs and other bioinformatics tasks.


3.4. Comparison with other PPI Prediction Methods
Currently, many computational methods that are based on data mining knowledge have been presented for predicting sequence-based PPIs. In this section, to verify the performance of the proposed model, we measure the proposed method by comparing with several other state-of-the-art methods on the Yeast and H. pylori datasets. Specifically, we compared the proposed method with previous work on PPI prediction presented by Guo et al., Yang et al., Zhou et al., You et al., Du et al., and Wong et al. on the Yeast dataset. Table 5 lists the PPI prediction results of the above methods on the same Yeast dataset.
As shown in Table 5, the accuracy, sensitivity, precision, and MCC of the MatFLDA_RFs method are 95.03%, 90.84%, 99.14%, and 90.52%, respectively. Compared with other existing methods listed, the accuracy of the proposed method increased by about 0.1% to 9%. The ACC of MatFLDA_RFs method is 7.67% higher than the AC method, 8.88% higher than the method, 6.47% higher than the method, 3.67% higher than the method, 0.89% higher than the method, 0.60% higher than the DeepPPI method, and 1.11% higher than the method. The PE of MatFLDA_RF method is 11.32% higher than the AC method, 8.90% higher than the method, 9.64% higher than the method, 7.20% higher than the method, 2.04% higher than the method, 2.49% higher than the DeepPPI method, and 2.69% higher than the method. The MCC of MatFLDA_RFs method is 13.37% higher than the method, 6.31% higher than the method, 1.56% higher than the method, 1.55% higher than the DeepPPI method, and 1.96% higher than the method.
Similarly, Table 6 presents the PPI prediction results of other existing methods on the same H. pylori dataset. As shown in Table 6, the prediction performance of the proposed method is better than other existing methods. The obtained values of ACC, SN, PE, and MCC are 85.35%, 95.72%, 79.27%, and 74.41%, respectively. In terms of ACC, the MatFLDA_RFs method is 0.44%-9.55% higher than other methods, 1.95% higher than the , 0.44% higher than the method, 1.65% higher than the WSR method, 9.55% higher than the Phylogenetic Booststrap method, 2.35% higher than the LDC method, and 5.83% higher than the Boosting method. These excellent results prove that the proposed method is an effective computational tool suitable for predicting PPIs.
4. Conclusion
The study of proteins and their interactions is essential to understand most biological activities in living cells, such as development, signal transduction, and apoptosis. Therefore, the successful prediction of PPIs will facilitate the study of other related problems in biomedical science. In this work, we present a novel computational approach to detect PPIs, using the MatFLDA algorithm, the RF classifier, and the PSSM matrix that can preserve protein evolutionary information. More specifically, MatFLDA is used to obtain the feature representation from the PSSM, an evolutionary matrix of protein sequences. This PSSM contains a great deal of valuable and important knowledge for PPI prediction. The RF classifier is then applied to detect novel PPIs. Finally, to measure the PPI identification ability of the developed method, we conducted extensive computational experiments on several benchmark PPI datasets. These excellent experimental results have indicated that the proposed MatFLDA_RF method has a higher identification rate of PPIs than other existing methods and SVM-based approaches. Consequently, the proposed method to identify PPIs is reliable and effective, which can be used as a practical tool for experimental methods, thus, facilitating further research on related problems in the field of bioinformatics.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare no conflict of interest.
Authors’ Contributions
Yang Li and Zheng Wang contributed equally to this work.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (no. 61873212).