Ens-PPI: A Novel Ensemble Classifier for Predicting the Interactions of Proteins Using Autocovariance Transformation from PSSM

Gao, Zhen-Guo; Wang, Lei; Xia, Shi-Xiong; You, Zhu-Hong; Yan, Xin; Zhou, Yong

doi:https://doi.org/10.1155/2016/4563524

BioMed Research International

On this page

Abstract Introduction Materials Results Conclusions Authors’ Contributions Acknowledgments References Copyright Related Articles

Special Issue

Current Computational Models for Prediction of the Varied Interactions Related to Noncoding RNAs

View this Special Issue

Research Article | Open Access

Volume 2016 | Article ID 4563524 | https://doi.org/10.1155/2016/4563524

Ens-PPI: A Novel Ensemble Classifier for Predicting the Interactions of Proteins Using Autocovariance Transformation from PSSM

Zhen-Guo Gao,¹Lei Wang,^1,2Shi-Xiong Xia ,¹Zhu-Hong You,¹Xin Yan,³and Yong Zhou¹

Academic Editor: Zheng Yin

Received09 Apr 2016

Accepted08 May 2016

Published29 Jun 2016

Abstract

Protein-Protein Interactions (PPIs) play vital roles in most biological activities. Although the development of high-throughput biological technologies has generated considerable PPI data for various organisms, many problems are still far from being solved. A number of computational methods based on machine learning have been developed to facilitate the identification of novel PPIs. In this study, a novel predictor was designed using the Rotation Forest (RF) algorithm combined with Autocovariance (AC) features extracted from the Position-Specific Scoring Matrix (PSSM). More specifically, the PSSMs are generated using the information of protein amino acids sequence. Then, an effective sequence-based features representation, Autocovariance, is employed to extract features from PSSMs. Finally, the RF model is used as a classifier to distinguish between the interacting and noninteracting protein pairs. The proposed method achieves promising prediction performance when performed on the PPIs of Yeast, H. pylori, and independent datasets. The good results show that the proposed model is suitable for PPIs prediction and could also provide a useful supplementary tool for solving other bioinformatics problems.

1. Introduction

Proteins are the most versatile and important macromolecules in life. They are vital for nearly all of the activity in the cell, including signaling cascades, metabolic cycles, and DNA transcription and replication [1]. Researchers found out that proteins rarely act as isolated agents to achieve their function. As expected, proteins are mutually matched with each other, forming a huge and complex network of Protein-Protein Interactions (PPIs) [2]. Therefore, research on PPIs has become the core issue of systems biology [3, 4].

So far, a variety of experimental techniques have been developed and designed for the detection of PPIs. The high-throughput techniques including Yeast Two-Hybrid (Y2H) screen [5–7], Tandem Affinity Purification (TAP) [2], and Mass Spectrometric Protein Complex Identification (MS-PCI) [6] spend considerable amounts of time, money, and manpower for detecting PPIs. In addition, PPIs obtained by biological experiments at present can only cover a small part of the whole PPIs network [8]. Therefore, the development of reliable computational methods which can improve the recognition efficiency has important significance [9–11].

A large number of in silico methods for predicting PPI have emerged [12–14]. These methods are usually based on the information of gene neighboring [15], gene coexpression [15], phylogenetic relationship [16], gene fusion events [17], three-dimensional structural information [18], and so on [19]. However, the application of these methods is limited [20, 21], because they need to rely on preknowledge of the protein. Recently, the methods based on the sequence information of protein amino acids for detecting PPI have been proposed [22–24]. For example, You et al. [25] used only protein sequence information to predict PPI, in which a kind of method called PCA-EELM (Principal Component Analysis-Ensemble Extreme Learning Machine) is designed. When performed on the PPIs data of Saccharomyces cerevisiae, this model yields 87.00% prediction accuracy, 86.15% sensitivity, and 87.59% precision. Martin et al. [26] designed a model to detect PPIs by using the extended signature descriptor, which was extended to protein pairs. In order to verify the predictive ability of this method, when using 10-fold cross-validation applied on the H. pylori and Yeast datasets, the accuracy of this method is from 70% to 80%. Shen et al. [11] considered the residues local environments and designed the conjoint triad method. When performed on human PPIs dataset, this method has yielded 83.9% accuracy. Guo et al. [9] combined Support Vector Machine classifier with Automatic Covariance features extracted from the protein sequences to predict PPIs in Saccharomyces cerevisiae. The average prediction accuracy of the method reached 86.55%.

In this study, we presented a sequence-based method which combines the RF classifier and Autocovariance (AC) algorithm to predict the interacting protein pairs [9, 27, 28]. A novel protein feature representation is derived from Position-Specific Scoring Matrix (PSSM) [29], which gives the log-odds score of specific residue replacement based on specific location of evolutionary information. Then, an effective sequence-based protein representation, Autocovariance, is employed to extract features from PSSMs. The interaction among a certain number of amino acid sequences was calculated by AC algorithm. Thus, this model took into account the proximity effect and made it possible to find patterns throughout the sequence. Finally, the ensemble RF classifier is established, which is using the PSSM-derived features as input. In the experiments, the proposed model was evaluated on Yeast and H. pylori PPI datasets. The experiment results show that our model achieved 97.77% and 84.84% prediction accuracy with 95.57% and 82.77% sensitivity on these two datasets. In addition, we evaluate the proposed model on independent datasets of the C. elegans, E. coli, H. sapiens, and M. musculus PPIs and achieved 96.01%, 97.73%, 98.30%, and 96.81% prediction accuracy, respectively.

2. Materials and Methodology

2.1. Data Sources

In the experiments, we used nonredundant Yeast data, which was gathered in Saccharomyces cerevisiae core subset of the Database of Interacting Proteins (DIP) [30], and the version is DIP 20070219 by Guo et al. [9]. Two methods, Paralogous Verification Method (PVM) and Expression Profile Reliability (EPR) [31], have proven the reliability of the core subset. There are 5966 interaction pairs contained in the core subset. Sequences with less than 50 amino acid residues were removed because they might just be fragments. The final positive dataset was comprised of the remaining 5943 protein pairs. The CD-Hit [32, 33] algorithm was further used with less than forty percent identity to decrease pairwise sequence redundancy. By doing this, the rest of the 5594 protein pairs constructed the positive dataset. We chose 5594 additional protein pairs in different subcellular localization to construct the negative dataset. Finally, the complete dataset was constructed; it was composed of 11188 protein pairs, half of which were positive and the other half were negative.

We also tested our method using two-hybrid measurements of H. pylori introduced by Rain et al. [34]. The H. pylori dataset (available at http://www.cs.sandia.gov/~smartin/software.html) contains 2916 protein pairs. There are interacting pairs and noninteracting pairs, each accounting for fifty percent. This dataset provides a platform for comparing our approach and other approaches [25, 26, 35–38].

2.2. Position-Specific Scoring Matrix (PSSM)

Position-Specific Scoring Matrix is first used in the detection of distantly related protein, which is proposed by Gribskov et al. [29]. Its feasibility has been verified in protein secondary structure prediction [39], prediction of disordered regions [40], and protein binding site prediction [41]. Structure of a PSSM is rows and 20 columns. Suppose that . Rows of the matrix represent the protein residues and columns represent the naive amino acids. Each matrix can be represented by the following formula: where is the length of the corresponding protein sequence and in the row of PSSM meant the probability of the th residue being mutated into type of 20 native amino acids during the procession of evolutionary information in the protein from multiple sequence alignments.

In this experiment, we introduced the Position-Specific Iterated BLAST (PSI-BLAST) program [42] and SwissProt dataset on a local machine to produce PSSMs. PSI-BLAST is more sensitive compared to BLAST, particularly in the discovery of new members of a protein family. To generate the PSSM, PSI-BLAST needs sequence contrast with very high sensitivity between the input proteins and the proteins in the database, and all sequence entries in the SwissProt database have been carefully verified by computer tools and access to relevant literature through the experience of molecular biologists and protein chemists, so we put SwissProt database as the optimal comparison database in the experiment. And to get broad and high homologous sequences, we held the other parameters constant, where the -value is set to 0.001 and the number of iterations is set to 3, respectively. Applications of PSI-BLAST and SwissProt database can be downloaded from http://blast.ncbi.nlm.nih.gov/Blast.cgi.

2.3. Autocovariance (AC)

As one of the most effective analyzing sequences of vectors statistical tools, the AC has been widely used in protein family classification by researchers [43, 44], prediction of secondary structure content [45, 46], and protein interaction prediction [9]. AC is a variable expressed in a given protein sequence of two residues’ average correlation, which can be calculated by where is the distance between residues, represents the th amino acid, denotes the length of the protein sequence, and indicates the matrix score of amino acid at position .

Using the above expression, the value of AC variable can be figured out: , where is the number of descriptors. When all the data in the database complete the operation, each protein sequence was represented as a vector of AC variables; a protein pair was characterized by concatenating the vectors of two proteins in this protein pair.

2.4. Rotation Forest Classifier

Rotation Forest (RF) is a popular ensemble classifier and this idea originated from Random Forests classifier. Each decision tree in Rotation Forest is trained on the dataset in a rotated feature space. As a decision tree learning algorithm establishes the classification regions using hyperplanes parallel to the feature axes and a small rotation of axes may build an entirely different tree, the diversity of RF can be guaranteed by the transformation. Thus, RF model can enhance the accuracy for individual classifier and the diversity in the ensemble at the same time. It is more robust compared to the previously proposed ensemble systems, such as Random Forest [32, 47], Bagging [33, 48], and Boosting [49]. The RF algorithm is described as follows.

Assuming contains training samples, wherein is a -dimensional feature vector. Suppose that is the training sample set ( matrix), which is composed of observation feature vector composition; denote the feature set, and denote the corresponding labels, and then , . Assume a feature set with an appropriate factor randomly divided into subsets of the same size; in this case, the decision trees in the forest can be expressed as , respectively. The execution steps of the training set for a single classifier are shown below:(1)Select the appropriate parameter which is a factor of ; let be randomly divided into parts of the disjoint subsets; each subset contains a number of features, .(2)From the training dataset , select the corresponding column of the feature in the subset and form a new matrix , followed by a bootstrap subset of objects extracting 75 percent of constituting a new training set . (3)Matrix is used as the feature transform for producing the coefficients in a matrix , with th column coefficient as the characteristic th component.(4)The coefficients obtained in the matrix are constructed as a sparse rotation matrix , which is expressed as follows:

In the prediction period, the test sample , generated by the classifier of to determine , belongs to class . Next, the class of confidence is calculated by means of the average combination, and the formula is as follows:

Then, assign the category with the largest value to .

3. Results and Discussions

3.1. Evaluation Measures

In this section, 5-fold cross-validation is used to evaluate the performance of the proposed method, in which all samples are split into five subsets. Therefore, one subset is the test set and the remaining four subsets are the training set. Evaluation criteria used in our study include overall prediction accuracy (Accu.), sensitivity (Sen.), precision (Prec.), and Matthews correlation coefficient (MCC). The calculation formulas are listed below:where True Positive (TP) represents the number of samples that are correctly detected as positive, True Negative (TN) represents the number of samples that are correctly detected as negative, False Positive (FP) represents the number of samples that are incorrectly detected as positive, and False Negative (FN) represents the number of samples that are incorrectly detected as negative. We also produce Receiver Operating Characteristic (ROC) [50] curves to assess the capability of the classifier. Typically, the threshold value of the classifier is 0.5 by default. When a new set of prediction results is accepted, the threshold value will be changed with the True Positive Rate versus the False Positive Rate; this change can be drawn out with graphics. In addition, the Area Under a Curve (AUC), with score ranges from 0 to 1, can also be expressed by the ROC curve. When a predictor of the AUC value is greater than another predictor, this predictor is regarded as a better one. The workflow of our method is shown in Figure 1.

3.2. Assessment of Prediction Ability

In order to achieve better results in the experiment, we used the grid search method to explore the parameters of the proposed model; concrete has parameter value for AC and parameters and value for RF. Firstly, we discuss the parameters of AC; the maximal possible is the shortest sequence length (50 amino acids) on the Yeast dataset. In this experiment, several () were evaluated in order to achieve the best performance of the protein sequences. The prediction results were shown in Figure 2. As seen from the curve in the graph, the prediction accuracy gradually increases when the parameters of the AC algorithm change from 5 to 40, and it decreases when the value changes from 40 to 45. There is a peak point with an average accuracy of 95.86% when the value of was 40. We can draw a conclusion; when the parameters of the AC algorithm are less than 40 or the number of amino acids is less than 40, protein sequences will lose some useful information, but larger may introduce noise rather than improvnig the performance of the model. So we set the value of as 40.

Secondly, we discuss the parameters of the RF. Based on previous studies, we chose PCA as Rotation Forest conversion method. Additionally, the J48 decision tree was selected as the base classifier from the WEKA database. In this experiment, two parameters (the number of feature subsets and the number of decision trees ) were tested by the grid search method in the range of values to achieve better performance. Figure 3 shows the prediction results of different parameters. We can see that accuracy fluctuates at the beginning and then is slowly enhanced with the increase of , but it seems to be not closely related to the increase of . Considering the accuracy rate and the time cost of the algorithm, as a result, we obtained optimal parameters of and . For the H. pylori dataset, we use the AC to extract features and RF validation with the same parameters with the Yeast dataset.

The 5-fold cross-validation method was introduced to reduce the dependence of the data on the prediction model [51–55]. Table 1 lists all of the prediction results; the prediction accuracies were greater than 97.54%, the precisions were greater than 99.82%, and the sensitivities were greater than 95.01%. Our proposed method can yield an average prediction accuracy of %. The ROC curves performed on Yeast dataset were shown in Figure 4. In this figure, -ray depicts False Positive Rate (FPR) while -ray depicts True Positive Rate (TPR).

3.3. Comparison with the Proposed Method on H. pylori Dataset

For analyzing the ability of the proposed method to predict PPIs, we tested its ability in different dataset. We used the proposed method to predict interactions on the H. pylori dataset. A total of 2916 proteins were included in this database, half of which were interacting pairs and the other half were noninteracting pairs. Our prediction results were shown in Table 2. We can see an accuracy, precision, sensitivity, and MCC of 84.84%, 86.36%, 82.77%, and 74.30%, respectively. The ROC curves performed on H. pylori dataset were shown in Figure 5.

3.4. Comparison with Previous Method

In order to more clearly assess the proposed method, we compared its results with the previous models on the Yeast dataset. As a classic classification algorithm, Support Vector Machine has a very superior performance in identifying interacting and noninteracting protein pairs. For example, Guo et al. [9] proposed a new method with Support Vector Machine combined with Autocovariance to predict Protein-Protein Interactions in Yeast dataset, and the results have proven its ability. Specifically, we use the same feature extraction method (AC) combined with PSSMs to compare the classification performance between Rotation Forest and SVM in the same dataset. We use grid search method to optimize the parameters of Support Vector Machine and set and , respectively. The LIBSVM tools we adopted can be downloaded from https://www.csie.ntu.edu.tw/~cjlin/libsvm/. As can be seen from Table 3, when using SVM to predict PPIs of Yeast dataset, we obtained excellent results with the accuracy, precision, sensitivity, and MCC of 95.86%, 96.46%, 95.21%, and 92.06%, respectively. Most of the SVM based methods produce average standard values that were lower than our method on Yeast dataset.

In addition, we also compared the other existing methods on the Yeast and H. pylori datasets. Table 3 shows the average results of the other six methods in the Yeast dataset; we can see that the accuracy results obtained by these methods are between 75.08% and 89.33%. The average accuracy, precision, sensitivity, and MCC values of these methods are lower than those of our method, which are 97.77%, 99.96%, 95.57%, and 95.64%, respectively. Table 4 shows the average predictive values of the six kinds of methods on the H. pylori dataset. We can see that the accuracy values obtained by these methods are between 75.80% and 87.50%, and the accuracy value of our proposed method is 84.84%, which also performs well in it.

3.5. Performance on Independent Dataset

Having achieved reasonably good results on the Yeast dataset and the H. pylori dataset, we decided to test the proposed method’s performance on independent datasets. We built our final prediction model using all 11188 pairs of Yeast dataset as the training set with the parameters obtained by the grid search method; the value of is 40 in AC, the value of is 20, and is 3 in RF. The feature vector uses the feature extraction method (AC) based on the PSSMs to extract from the four datasets as RF test input. Independent test dataset is composed of the four databases (C. elegans, E. coli, H. sapiens, and M. musculus) collected in DIP database. The results of our model are listed in Table 5; the prediction accuracies on C. elegans, E. coli, H. sapiens, and M. musculus are 96.01%, 97.73%, 98.30%, and 96.81%, respectively. Those results show the excellent performance of our approach in predicting the accuracy of the interactions of other species.

4. Conclusions

In this study, a stable and robust computational method based on the features extracted from PSSM has been proposed to predict PPIs. It is known that the main computational challenge for sequence-based methods for predicting PPIs is to find a suitable feature representation to fully describe the important information of protein interactions. To solve this problem, we here firstly extracted the features from the Position-Specific Scoring Matrices (PSSMs) using Autocovariance (AC) method. Then, Rotation Forest (RF) model is employed as a novel and accurate classifier for PPIs prediction with better performance than state-of-the-art SVM classifier. In order to evaluate the performance of the proposed method, five PPIs datasets, that is, C. elegans, E. coli, H. pylori, H. sapiens, and M. musculus, have been used to perform the comparisons. As expected, the experiments results showed that the proposed method performs better than the other methods. Consequently, the proposed approach can be considered as a powerful tool for predicting PPI.

Competing Interests

The authors declare that there are no competing interests regarding the publication of this paper.

Authors’ Contributions

Zhen-Guo Gao and Lei Wang contributed equally to this work.

Acknowledgments

This work is supported by the National Science Foundation of China, under Grants 61373086, 61572506, and 61401385, in part by the Guangdong Natural Science Foundation under Grant 2014A030313555, and in part by the Shenzhen Scientific Research and Development Funding Program under Grant JCYJ20140418095735569.

References

Z. Yin, T. Deng, L. E. Peterson et al., “Transcriptome analysis of human adipocytes implicates the NOD-like receptor pathway in obesity-induced adipose inflammation,” Molecular and Cellular Endocrinology, vol. 394, no. 1-2, pp. 80–87, 2014.
View at: Publisher Site | Google Scholar
A.-C. Gavin, M. Bösche, R. Krause et al., “Functional organization of the yeast proteome by systematic analysis of protein complexes,” Nature, vol. 415, no. 6868, pp. 141–147, 2002.
View at: Publisher Site | Google Scholar
K. A. Theofilatos, C. M. Dimitrakopoulos, A. K. Tsakalidis, S. D. Likothanassis, S. T. Papadimitriou, and S. P. Mavroudi, “Computational approaches for the prediction of protein-protein interactions: a survey,” Current Bioinformatics, vol. 6, no. 4, pp. 398–414, 2011.
View at: Publisher Site | Google Scholar
N. Tuncbag, G. Kar, O. Keskin, A. Gursoy, and R. Nussinov, “A survey of available tools and web servers for analysis of protein-protein interactions and interfaces,” Briefings in Bioinformatics, vol. 10, no. 3, pp. 217–232, 2009.
View at: Publisher Site | Google Scholar
T. Ito, T. Chiba, R. Ozawa, M. Yoshida, M. Hattori, and Y. Sakaki, “A comprehensive two-hybrid analysis to explore the yeast protein interactome,” Proceedings of the National Academy of Sciences of the United States of America, vol. 98, no. 8, pp. 4569–4574, 2001.
View at: Publisher Site | Google Scholar
Y. Ho, A. Gruhler, A. Heilbut et al., “Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry,” Nature, vol. 415, no. 6868, pp. 180–183, 2002.
View at: Publisher Site | Google Scholar
N. J. Krogan, G. Cagney, H. Yu et al., “Global landscape of protein complexes in the yeast Saccharomyces cerevisiae,” Nature, vol. 440, no. 7084, pp. 637–643, 2006.
View at: Publisher Site | Google Scholar
J.-D. J. Han, D. Dupuy, N. Bertin, M. E. Cusick, and M. Vidal, “Effect of sampling on topology predictions of protein-protein interaction networks,” Nature Biotechnology, vol. 23, no. 7, pp. 839–844, 2005.
View at: Publisher Site | Google Scholar
Y. Guo, L. Yu, Z. Wen, and M. Li, “Using support vector machine combined with auto covariance to predict protein-protein interactions from protein sequences,” Nucleic Acids Research, vol. 36, no. 9, pp. 3025–3030, 2008.
View at: Publisher Site | Google Scholar
Z.-H. You, Z. Yin, K. Han, D.-S. Huang, and X. Zhou, “A semi-supervised learning approach to predict synthetic genetic interactions by combining functional and topological properties of functional gene network,” BMC Bioinformatics, vol. 11, article 343, 2010.
View at: Publisher Site | Google Scholar
J. Shen, J. Zhang, X. Luo et al., “Predicting protein-protein interactions based only on sequences information,” Proceedings of the National Academy of Sciences of the United States of America, vol. 104, no. 11, pp. 4337–4341, 2007.
View at: Publisher Site | Google Scholar
Z. Ji, B. Wang, S. P. Deng, and Z. You, “Predicting dynamic deformation of retaining structure by LSSVR-based time series method,” Neurocomputing, vol. 137, pp. 165–172, 2014.
View at: Publisher Site | Google Scholar
L. Zhu, Z.-H. You, D.-S. Huang, and B. Wang, “t-LSE: a novel robust geometric approach for modeling protein-protein interaction networks,” PLoS ONE, vol. 8, no. 4, Article ID e58368, 2013.
View at: Publisher Site | Google Scholar
L. Zhu, Z.-H. You, and D.-S. Huang, “Increasing the reliability of protein-protein interaction networks via non-convex semantic embedding,” Neurocomputing, vol. 121, pp. 99–107, 2013.
View at: Publisher Site | Google Scholar
T. Ideker, O. Ozier, B. Schwikowski, and A. F. Siegel, “Discovering regulatory and signalling circuits in molecular interaction networks,” Bioinformatics, vol. 18, no. 1, pp. S233–S240, 2002.
View at: Publisher Site | Google Scholar
R. Jothi, M. G. Kann, and T. M. Przytycka, “Predicting protein-protein interaction by searching evolutionary tree automorphism space,” Bioinformatics, vol. 21, no. 1, pp. I241–I250, 2005.
View at: Publisher Site | Google Scholar
A. J. Enright, I. Illopoulos, N. C. Kyrpides, and C. A. Ouzounis, “Protein interaction maps for complete genomes based on gene fusion events,” Nature, vol. 402, no. 6757, pp. 86–90, 1999.
View at: Publisher Site | Google Scholar
Q. C. Zhang, D. Petrey, L. Deng et al., “Structure-based prediction of protein-protein interactions on a genome-wide scale,” Nature, vol. 490, no. 7421, pp. 556–560, 2012.
View at: Publisher Site | Google Scholar
Z. Yin, A. Sadok, H. Sailem et al., “A screen for morphological complexity identifies regulators of switch-like transitions between discrete cell shapes,” Nature Cell Biology, vol. 15, no. 7, pp. 860–871, 2013.
View at: Publisher Site | Google Scholar
Z. Yin, X. Zhou, C. Bakal et al., “Using iterative cluster merging with improved gap statistics to perform online phenotype discovery in the context of high-throughput RNAi screens,” BMC Bioinformatics, vol. 9, no. 1, article 264, 2008.
View at: Publisher Site | Google Scholar
Y. Mao, Z. Xia, Z. Yin, Y. Sun, and Z. Wan, “Fault diagnosis based on fuzzy support vector machine with parameter tuning and feature selection,” Chinese Journal of Chemical Engineering, vol. 15, no. 2, pp. 233–239, 2007.
View at: Publisher Site | Google Scholar
Z.-H. You, Y.-K. Lei, J. Gui, D.-S. Huang, and X. Zhou, “Using manifold embedding for assessing and predicting protein interactions from high-throughput experimental data,” Bioinformatics, vol. 26, no. 21, pp. 2744–2751, 2010.
View at: Publisher Site | Google Scholar
Y.-K. Lei, Z.-H. You, Z. Ji, L. Zhu, and D.-S. Huang, “Assessing and predicting protein interactions by combining manifold embedding with multiple information integration,” BMC Bioinformatics, vol. 13, supplement 7, article S3, 2012.
View at: Publisher Site | Google Scholar
Y.-K. Lei, Z.-H. You, T. Dong, Y.-X. Jiang, and J.-A. Yang, “Increasing reliability of protein interactome by fast manifold embedding,” Pattern Recognition Letters, vol. 34, no. 4, pp. 372–379, 2013.
View at: Publisher Site | Google Scholar
Z.-H. You, Y.-K. Lei, L. Zhu, J. Xia, and B. Wang, “Prediction of protein-protein interactions from amino acid sequences with ensemble extreme learning machines and principal component analysis,” BMC Bioinformatics, vol. 14, supplement 18, article S10, 2013.
View at: Publisher Site | Google Scholar
S. Martin, D. Roe, and J.-L. Faulon, “Predicting protein-protein interactions using signature products,” Bioinformatics, vol. 21, no. 2, pp. 218–226, 2005.
View at: Publisher Site | Google Scholar
J. J. Rodríguez, L. I. Kuncheva, and C. J. Alonso, “Rotation forest: a new classifier ensemble method,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 10, pp. 1619–1630, 2006.
View at: Publisher Site | Google Scholar
L. Nanni and A. Lumini, “Ensemble generation and feature selection for the identification of students with learning disabilities,” Expert Systems with Applications, vol. 36, no. 2, pp. 3896–3900, 2009.
View at: Publisher Site | Google Scholar
M. Gribskov, A. D. McLachlan, and D. Eisenberg, “Profile analysis: detection of distantly related proteins,” Proceedings of the National Academy of Sciences of the United States of America, vol. 84, no. 13, pp. 4355–4358, 1987.
View at: Publisher Site | Google Scholar
I. Xenarios, Ł. Salwínski, X. J. Duan, P. Higney, S.-M. Kim, and D. Eisenberg, “DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions,” Nucleic Acids Research, vol. 30, no. 1, pp. 303–305, 2002.
View at: Publisher Site | Google Scholar
C. M. Deane, Ł. Salwiński, I. Xenarios, and D. Eisenberg, “Protein interactions: two methods for assessment of the reliability of high throughput observations,” Molecular & Cellular Proteomics, vol. 1, no. 5, pp. 349–356, 2002.
View at: Publisher Site | Google Scholar
D. R. Cutler, T. C. Edwards Jr., K. H. Beard et al., “Random forests for classification in ecology,” Ecology, vol. 88, no. 11, pp. 2783–2792, 2007.
View at: Publisher Site | Google Scholar
P. L. Braga, A. L. I. Oliveira, G. H. T. Ribeiro, and S. R. L. Meira, “Bagging predictors for estimation of software project effort,” in Proceedings of the International Joint Conference on Neural Networks (IJCNN '07), pp. 1595–1600, Orlando, Fla, USA, August 2007.
View at: Publisher Site | Google Scholar
J. C. Rain, L. Selig, H. De Reuse et al., “The protein-protein interaction map of Helicobacter pylori,” Nature, vol. 409, no. 6817, pp. 211–215, 2001, Erratum to Nature, vol. 409, no. 6821, article 743, 2001.
View at: Publisher Site | Google Scholar
J. R. Bock and D. A. Gough, “Whole-proteome interaction mining,” Bioinformatics, vol. 19, no. 1, pp. 125–135, 2003.
View at: Publisher Site | Google Scholar
L. Nanni, “Hyperplanes for predicting protein-protein interactions,” Neurocomputing, vol. 69, no. 1–3, pp. 257–263, 2005.
View at: Publisher Site | Google Scholar
L. Nanni and A. Lumini, “An ensemble of K-local hyperplanes for predicting protein-protein interactions,” Bioinformatics, vol. 22, no. 10, pp. 1207–1210, 2006.
View at: Publisher Site | Google Scholar
B. Liu, J. Yi, A. Sv et al., “QChIPat: a quantitative method to identify distinct binding patterns for two biological ChIP-seq samples in different experimental conditions,” BMC Genomics, vol. 14, no. 8, article S3, 2013.
View at: Publisher Site | Google Scholar
D. T. Jones, “Protein secondary structure prediction based on position-specific scoring matrices,” Journal of Molecular Biology, vol. 292, no. 2, pp. 195–202, 1999.
View at: Publisher Site | Google Scholar
D. T. Jones and J. J. Ward, “Prediction of disordered regions in proteins from position specific score matrices,” Proteins: Structure, Function and Genetics, vol. 53, no. 6, pp. 573–578, 2003.
View at: Publisher Site | Google Scholar
X.-W. Chen and J. C. Jeong, “Sequence-based prediction of protein interaction sites with an integrative method,” Bioinformatics, vol. 25, no. 5, pp. 585–591, 2009.
View at: Publisher Site | Google Scholar
S. F. Altschul, T. L. Madden, A. A. Schäffer et al., “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs,” Nucleic Acids Research, vol. 25, no. 17, pp. 3389–3402, 1997.
View at: Publisher Site | Google Scholar
Y. Guo, M. Li, M. Lu, Z. Wen, and Z. Huang, “Predicting G-protein coupled receptors-G-protein coupling specificity based on autocross-covariance transform,” Proteins: Structure, Function, and Bioinformatics, vol. 65, no. 1, pp. 55–60, 2006.
View at: Publisher Site | Google Scholar
M. Lapinsh, A. Gutcaits, P. Prusis, C. Post, T. Lundstedt, and J. E. S. Wikberg, “Classification of G-protein coupled receptors by alignment-independent extraction of principal chemical properties of primary amino acid sequences,” Protein Science, vol. 11, no. 4, pp. 795–805, 2002.
View at: Publisher Site | Google Scholar
Z. Lin and X.-M. Pan, “Accurate prediction of protein secondary structural content,” Journal of Protein Chemistry, vol. 20, no. 3, pp. 217–220, 2001.
View at: Publisher Site | Google Scholar
C.-T. Zhang, Z.-S. Lin, Z. Zhang, and M. Yan, “Prediction of the helix/strand content of globular proteins based on their primary sequences,” Protein Engineering, vol. 11, no. 11, pp. 971–979, 1998.
View at: Publisher Site | Google Scholar
L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
View at: Publisher Site | Google Scholar | Zentralblatt MATH
T.-H. Lee and Y. Yang, “Bagging binary and quantile predictors for time series,” Journal of Econometrics, vol. 135, no. 1-2, pp. 465–497, 2006.
View at: Publisher Site | Google Scholar | MathSciNet
G. Rätsch, S. Mika, B. Schölkopf, and K.-R. Müller, “Constructing boosting algorithms from SVMs: an application to one-class classification,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 9, pp. 1184–1199, 2002.
View at: Publisher Site | Google Scholar
M. H. Zweig and G. Campbell, “Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine,” Clinical Chemistry, vol. 39, no. 4, pp. 561–577, 1993.
View at: Google Scholar
Z.-H. You, J. Li, X. Gao et al., “Detecting protein-protein interactions with a novel matrix-based protein sequence representation and support vector machines,” BioMed Research International, vol. 2015, Article ID 867516, 9 pages, 2015.
View at: Publisher Site | Google Scholar
Z.-H. You, K. C. C. Chan, and P. Hu, “Predicting protein-protein interactions from primary protein sequences using a novel multi-scale local feature representation scheme and the random forest,” PLoS ONE, vol. 10, no. 5, article e0125811, 2015.
View at: Publisher Site | Google Scholar
Y. Huang, Z. You, X. Gao, L. Wong, and L. Wang, “Using weighted sparse representation model combined with discrete cosine transformation to predict protein-protein interactions from protein sequence,” BioMed Research International, vol. 2015, Article ID 902198, 10 pages, 2015.
View at: Publisher Site | Google Scholar
Z.-H. You, J.-Z. Yu, L. Zhu, S. Li, and Z.-K. Wen, “A MapReduce based parallel SVM for large-scale predicting protein-protein interactions,” Neurocomputing, vol. 145, pp. 37–43, 2014.
View at: Publisher Site | Google Scholar
Z.-H. You, S. Li, X. Gao, X. Luo, and Z. Ji, “Large-scale protein-protein interactions detection by integrating big biosensing data with computational model,” BioMed Research International, vol. 2014, Article ID 598129, 9 pages, 2014.
View at: Publisher Site | Google Scholar
L. Yang, J.-F. Xia, and J. Gui, “Prediction of protein-protein interactions from protein sequence using local descriptors,” Protein and Peptide Letters, vol. 17, no. 9, pp. 1085–1090, 2010.
View at: Publisher Site | Google Scholar
Y. Z. Zhou, Y. Gao, and Y. Y. Zheng, “Prediction of protein-protein interactions using local description of amino acid sequence,” in Advances in Computer Science and Education Applications, Part II, vol. 02, pp. 254–262, Springer, Berlin, Germany, 2011.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2016 Zhen-Guo Gao et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

1604

Downloads

1003

Citations