Table of Contents Author Guidelines Submit a Manuscript
BioMed Research International
Volume 2017 (2017), Article ID 4590609, 10 pages
https://doi.org/10.1155/2017/4590609
Research Article

HMMBinder: DNA-Binding Protein Prediction Using HMM Profile Based Features

1Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh
2School of Computing, Information and Mathematical Sciences, The University of the South Pacific, Suva, Fiji
3Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, QLD, Australia
4School of Engineering and Physics, The University of the South Pacific, Suva, Fiji
5RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
6Department of Computer Science, Morgan State University, Baltimore, MD, USA

Correspondence should be addressed to Swakkhar Shatabda

Received 29 August 2017; Accepted 22 October 2017; Published 14 November 2017

Academic Editor: Paul Harrison

Copyright © 2017 Rianon Zaman et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

DNA-binding proteins often play important role in various processes within the cell. Over the last decade, a wide range of classification algorithms and feature extraction techniques have been used to solve this problem. In this paper, we propose a novel DNA-binding protein prediction method called HMMBinder. HMMBinder uses monogram and bigram features extracted from the HMM profiles of the protein sequences. To the best of our knowledge, this is the first application of HMM profile based features for the DNA-binding protein prediction problem. We applied Support Vector Machines (SVM) as a classification technique in HMMBinder. Our method was tested on standard benchmark datasets. We experimentally show that our method outperforms the state-of-the-art methods found in the literature.

1. Introduction

DNA-binding proteins play a vital role in various cellular processes. They are essential in transcriptional regulation, recombination, genome rearrangements, replication, repair, and DNA modification [1]. Proteins which make bond with DNA in both eukaryotes and prokaryotes while performing like activators or repressors are DNA-binding proteins. It has been observed that the percentages of prokaryotes and eukaryotes protein that can bind to DNA are only 2-3% and 4-5%, respectively [2, 3]. There have been a wide variety of experimental methods such as in vitro methods [4, 5] like filter binding assays, chromatin immunoprecipitation on microarrays (ChIP-chip) genetic analysis, and X-ray crystallography which are used to predict DNA-binding proteins. However, these methods are proven to be expensive and time consuming. Therefore, there is a growing demand to find a fast and cost effective computational method to solve this problem.

Most of the computational methods used in the literature to predict DNA-binding proteins formulated the problem as a supervised learning problem. Practically, the number of known DNA-binding proteins is very small compared to the large non-DNA-binding proteins and unknown proteins. DNA-binding protein prediction is often modeled as a binary class classification problem where given a protein sequence as input the task is to predict whether the protein is DNA-binding or not. Note that the challenge here is to select a proper dataset for training and testing incorporating the imbalanced situation. Many supervised learning algorithms have been used in the literature to solve the problem. Among them, Artificial Neural Networks (ANN) [6], Support Vector Machines (SVM) [7, 8], ensemble methods [9], Nave Bayes classifier [10], Random Forest [11], Convolutional Neural Networks [12], Logistic Regression [13], AdaBoost Classifier [5], and so on are well-regarded. Support Vector Machines (SVM) are one of the best performing classifiers used for DNA-binding protein identification [7, 8, 14, 15].

A great number of web based tools and methods are developed for DNA-binding protein prediction and are available for use. In this paper, we would like to mention several of them: DNABinder [7], DNA-Prot [16], iDNA-Prot [11], iDNA-Protdis [14], DBPPred [17], iDNAPro-PseAAC [8], PseDNA-Pro [18], Kmer1 + ACC [19], Local-DPP [20], SVM-PSSM-DT [21], PNImodeler [22], CNNsite [12], and BindUP [23]. Most of these methods have used sequence, profile, or structure based features. In structural feature based methods in the literature, features used were structural motifs, electrostatic potential, the dipole moment, and -carbon only models [13, 24, 25]. On the other hand, sequence based methods often depended on the PSSM profile based information or pseudo-amino-acid compositions [8, 14, 15, 17, 20, 26, 27]. In [28], HMM based profiles were used for generating features for protein fold recognition.

In this paper, we propose HMMBinder, a novel DNA-binding protein prediction tool using HMM profile based features of a protein sequence. Our method uses monogram and bigram features derived from the HMM profile which shows effectiveness compared to the PSSM or sequence based features. We also use SVM as the classifier and standard benchmark datasets to test our method. Using the standard evaluation metrics, our method significantly improves over the state-of-the-art methods and the features used in the literature. We also developed a web server that is publicly available at http://brl.uiu.ac.bd/HMMBinder.

The rest of the paper is organized following the general 5-step guideline suggested in [29] for protein attribute prediction. First, benchmark datasets selected for this problem are described followed by a description of the protein representation by extraction of features. Then we describe the classification algorithm that we selected for our approach followed by the performance evaluation techniques deployed in this paper. Lastly, we describe the web server that we developed for this problem. The results section presents the details of the experimental results followed by an analytical discussion. The paper concludes with a summary and indication of future work.

2. Methods and Materials

In this section, we provide the details of the materials and methods of this paper. Figure 1 provides a system diagram of our proposed method. For the training phase, all the protein sequences are fed to HHBlits [30], a sequence-to-sequence alignment software using the latest UniProt database. HHBlits produces HMM file as output which is then used by our feature extraction method to generate monogram and bigram features. Monogram and bigram features are concatenated together and then used as training feature set to train the classifier. We use SVM with linear kernel as the classification algorithm and a trained model is stored for the testing phase. Testing phase is also similar to the training phase; however, the labels for the test dataset are not given to the classifier. This stored model is also used for the web server implementation of HMMBinder.

Figure 1: System diagram of HMMBinder.
2.1. Datasets

Selection of benchmark datasets is essential in classification and prediction design. In this paper we use a popular benchmark dataset called benchmark1075 to train our model. Later we test the performance using cross validation and on a separate independent test set known as independent186 dataset. This section provides a brief overview of these two datasets. Both of these datasets are widely used in the literature of DNA-binding protein prediction literature [8, 14, 18, 20, 31].

2.1.1. Dataset Benchmark1075

This dataset was first introduced in [14]. This dataset consists of 1075 protein sequences. Among them, 525 are DNA-binding and 550 are non-DNA-binding protein sequences. All the protein sequences were taken from PDB [32]. This dataset is one of the largest DNA-binding protein prediction datasets and thus suitable for training purpose.

2.1.2. Dataset Independent186

Lou et al. [17] constructed this independent dataset consisting of 93 DNA-binding and 93 non-DNA-binding protein sequences. They used BLASTCLUST [33] on the benchmark dataset to remove the sequences that have more that 25% of similarity.

2.2. Feature Extraction

The training dataset used for a binary classification problem consists of two types of instances: positive and negative. Formally, Next, the task is to represent each protein instance as feature vectors suitable for training. The idea is to represent each of the protein instances as a vector of features.

Here, a protein, , is shown as a feature vector with dimension . Most of the methods in the literature of DNA-binding protein prediction use either sequence and PSSM profile based features or structure based features. To the best of our knowledge, there has been no application of features using HMM profiles. In this paper, we have used HHBlits [30] to generate HMM profiles. HMM profiles are comparatively more effective [30, 34] for remote homology detection. HMM profiles were generated using four iterations of HHBlits with a cutoff value set to 0.001 using the latest UniProt database [35]. HMM profiles are matrix produced by HHBlits. These 20 values are the substitution probability of each type of amino-acid residue along the protein sequence at each position. These values are first converted to linear probabilities using the following formula:

We generated two types of features, monogram and bigram, using the generated HMM profile matrix noted here as . We provide a brief description of monogram and bigram features extracted from the HMM profile matrix.

2.2.1. Monogram Features

Monogram features [36] are calculated taking the normalized sum of the column wise substitution probability values. Size of these feature group is 20 because of 20 different amino acids. The feature can be defined formally as follows:

Note that values of are dependent on the columns; that is, . Here, are the values in the th row and th column of the matrix. We denote monogram features as which is a vector of the form .

2.2.2. Bigram Features

Bigram features have been successfully used in the literature for protein attribute prediction [37]. Bigram features are normalized bigrams taken for all pairs of columns. Hence the total number of features generated from this group is 400. Bigram features are generated using the following formula:

Here and denote the column pairs for which the bigram is calculated and are in the ranges and . We denote this feature vector as , where has the form of .

We also generate Positive Specific Scoring Matrix (PSSM) profiles for each of the protein sequences using PSI-BLAST [38]. PSSMs were generated using three iterations of PSI-BLAST using the nr database with a cutoff value of 0.001. PSSM profiles also have a similar form to HMM profiles which is a matrix of the same dimension and each of the matrix values denotes substitution probabilities. We generate monogram and bigram features from PSSM files as well. These PSSM based monogram and bigram features are well used in the literature [36, 37, 3942]. Note that all the monogram features are vectors of size 20 and bigram features are vectors of size 400. We have also used a combination of the monogram and bigram features which is a vector of size 420.

2.3. Support Vector Machine

We have used Support Vector Machines (SVM) as our classification technique. SVM is successfully used in protein attribute prediction in general [28, 39, 43] and particularly in DNA-binding protein prediction [7, 8]. SVM is maximum margin classifier that attempts to learn a hyperplane from the training samples that separates the positive and negative data points in a binary classification problem. The hyperplane that is selected is the one for which the separation width or the margin is maximum and the nature of the hyperplane depends on the kernel functions used. SVM generally tries to optimize a multiplier function that goes as follows:

The prediction of a SVM classifier is defined as follows:

Here the transformation of the data points by the function could be linear, polynomial, or any other kernel functions. In this paper, we explored linear and radial basis function (RBF) kernels. Linear kernel is of the following form:

Here for the linear kernels. RBF kernels follow the following definition:

Often slack variables are used along with the maximum margin SVM classifier to allow generalization error depending on a parameter .

2.4. Performance Evaluation

A good number of effective evaluation metrics have been suggested for use in single valued and multivalued classification and prediction [29, 44]. In the literature of DNA-binding protein prediction, we have found that the most widely used metrics are accuracy, sensitivity, specificity, MCC, auROC, and auPR values. In this section, we first provide a description of these evaluation metrics used in this paper.This first measure, accuracy, is the ratio or percentage of correctly classified negative or positive instances from a given number of protein instances. Here TP is the total number of true positives or correctly classified positive samples and TN is the correctly classified negative samples. FP and FN are incorrectly classified positive and negative instances, respectively. Sensitivity is the true positive rate or the ratio of true positives to the total number of positive examples. Sensitivity is defined in the following equation:Specificity on the other hand is the true negative rate and can be defined as the following equation:All these three measures have a maximum value of 1 which is the best classifier and a minimum value of 0 meaning the worst classifier. Mathew’s Correlation Coefficient (MCC) denotes how good a binary classification is working. The value of MCC is in the range . A perfect classifier should have a maximum MCC value of +1. MCC is defined as the following equation:

Note that all these metrics for probabilistic outputs depend on the threshold set for the classifiers. Two other metrics not dependent on thresholds are area under receiver operating characteristic curve (auROC) and area under precision-recall curve (auPR). The value of auROC and auPR has maximum value of 1 for the perfect classifier. ROC curve plots true positive rate against false positive rate at different threshold values and precision-recall curve plots precision against recall.

To reduce the training bias, several sampling methods are proposed in the literature [45] and widely used for protein attribute prediction [29]. In this paper, we have used 10-fold cross validation and jack-knife tests which are widely used in the literature of DNA-binding protein prediction [8, 11, 14, 17].

3. Results and Discussion

In this section, we present the results of the experiments that were carried out in this study. All the methods were implemented in Python3.4 programming language. The Scikit-learn library [46] of python was used for implementing the machine learning algorithms. All experiments were conducted on computing services provided by CITS, United International University.

3.1. Effect of HMM Based Features

We have run a number of experiments to test the effectiveness of the HMM profile based features on the benchmark dataset. We have six groups of features extracted for this experiment: PSSM-Monogram, PSSM-Bigram, PSSM-Mono + Bigram, HMM-Monogram, HMM-Bigram, and HMM-Mono + Bigram. Each of these feature sets is tested with SVM classifiers using linear and RBF kernels. We further tested the performances of these features using two ensemble classifiers: Random Forest and AdaBoost Classifiers. For these experiments we have performed 10-fold cross validation. The results in terms of accuracy, sensitivity, specificity, auPR, auROC, and MCC are reported. Only the average of these values is reported in Table 1. Note that the best results were found using the combination of HMM-Monogram and Bigram features and 82.87% accuracy was achieved using SVM linear kernels. In each case of the SVM linear kernel, HMM based features achieved better accuracy compared to PSSM based features. Similar results could be noticed for auROC, MCC, and sensitivity analysis. Specificity, auROC, and auPR are slightly improved in the experiments with SVM with RBF kernels. We also show the ROC curves for each of these experiments in Figures 2, 3, and 4.

Table 1: Comparison of performances of different features and SVM kernels on the benchmark dataset using 10-fold cross validation.
Figure 2: Using monogram features. Receiver operating characteristic curves for (a) SVM linear kernel classifier using HMM-Monogram features, (b) SVM linear kernel classifier using PSSM-Monogram features, (c) SVM RBF kernel classifier using HMM-Monogram features, and (d) SVM RBF kernel classifier using PSSM-Monogram features.
Figure 3: Using bigram features. Receiver operating characteristic curves for (a) SVM linear kernel classifier using HMM-Bigram features, (b) SVM linear kernel classifier using PSSM-Bigram features, (c) SVM RBF kernel classifier using HMM-Bigram features, and (d) SVM RBF kernel classifier using PSSM-Bigram features.
Figure 4: Using (Mono + Bi)gram features. Receiver operating characteristic curves for (a) SVM linear kernel classifier using HMM-Mono + Bigram features, (b) SVM linear kernel classifier using PSSM-Mono + Bigram features, (c) SVM RBF kernel classifier using HMM-Mono + Bigram features, and (d) SVM RBF kernel classifier using PSSM-Mono + Bigram features.
3.2. Comparison with Other Methods

We have compared the performance of HMMBinder with several previous methods and tools used for DNA-binding protein prediction on the benchmark dataset benchmark1075. They are DNABinder [7], DNA-Prot [16], iDNA-Prot [11], iDNA-Protdis [14], DBPPred [17], iDNAPro-PseAAC [8], PseDNA-Pro [18], Kmer1 + ACC [19], and Local-DPP [20]. The results reported in this paper for these methods are taken from [8, 20]. The comparisons were made in terms of accuracy, sensitivity, specificity, MCC, and auROC. To make a fair comparison with the other methods, we performed jack-knife test as done in earlier studies and the results are reported in Table 2.

Table 2: Comparison of performance of the proposed method with other state-of-the-art predictors using jack-knife test on the benchmark dataset.

The best values in Table 2 are shown in bold faced fonts. The results show a clear margin of more than 7% improvement of accuracy over the previous best method, Local-DPP [20]. Similar improvements were found in other metrics too. Particularly, MCC is increased by 22% compared to the previous best method.

We further experimented to test the effectiveness of HMMBinder on the independent test set also. These results are shown in Table 3. Here the results are not the best but among the best. In terms of accuracy, our results are almost similar to iDNAPro-PseAAC [8]. Their results were significant in the benchmark dataset and were similar to ours in the independent dataset. Specificity value of HMMBinder was among the best and only second to DNA-Threader which failed miserably in terms of accuracy. Considering the difficulty level of the independent dataset, we believe that our method has not been overtrained on the benchmark dataset and the performance is promising and can be claimed as a generalized method after training and testing. Based on these results, we decided to build the web application based on the model trained on the benchmark dataset.

Table 3: Comparison of performance of the proposed method with other state-of-the-art predictors on the independent dataset.

Note that the results on the independent dataset are comparative but not improved in comparison to the state-of-the-art methods. The main focus of this research was to build a classifier based on HMM profiles instead of the PSSM profile based features and we experimentally showed the effectiveness of the HMM profile based features over PSSM. In the future, we aim to focus on the independent dataset to perform better.

Additionally, we would like to highlight two points. Firstly, the datasets that we used were filtered using BLASTCLUST. It is important to remove the sequences with similarity more than 25% from the dataset before applying the training and testing methods. We used the dataset proposed by Lou et al. [17], a widely accepted standard independent test dataset where the sequences with similarity of 25% or more with other sequences had been removed. We believe it would be interesting to see the effects of the other heuristic, CLUSTALW [47]. Secondly, feature selection methods are gaining much popularity in case of bioinformatics data and supervised machine learning. We believe that using sophisticated feature selection methods, such as maximum relevance minimum redundancy (mRMR) [48] and maximum relevance maximum distance (MRMD) [49], could improve the results further.

3.3. Web Server Implementation

We have implemented a web based application based on the proposed method. We call this HMMBinder. This is readily available to use at http://brl.uiu.ac.bd/HMMBinder. The server was implemented using PHP web programming language in the front end and python based prediction engine at the backend. The software requires an HMM profile as input to the tools that can be generated by HHBlits. The features are extracted automatically by the python program and the predicted value from a trained model is shown in the web form. The web site contains a “read me” guide and the necessary information required to run the application.

4. Conclusion

In this paper, we have introduced HMMBinder, a HMM profile based method for the DNA-binding protein prediction problem. We have used monogram and bigram features extracted from the HMM profiles generated by HHBlits and a SVM classification algorithm to train our data on a standard benchmark dataset. Our method is able to make considerable improvement over the other state-of-the-art methods on this dataset and performed comparably well in the independent dataset. We have also established a web based application for our method that is trained on the benchmark dataset. In the future, we wish to extract more effective features and generate larger dataset to train our model to be able to improve the results on the independent dataset. We believe there is a scope of improvement.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this article.

References

  1. H. M. Berman, J. M. Thornton, N. M. Luscombe, and S. E. Austin, “An overview of the structures of protein-dna complexes,” Genome Biology, vol. 1, 2000. View at Google Scholar
  2. E. W. Stawiski, L. M. Gregoret, and Y. Mandel-Gutfreund, “Annotating nucleic acid-binding function based on protein structure,” Journal of Molecular Biology, vol. 326, no. 4, pp. 1065–1079, 2003. View at Publisher · View at Google Scholar · View at Scopus
  3. S. Jones, J. M. Thornton, H. P. Shanahan, and M. A. Garcia, “Identifying DNA-binding proteins using structural motifs and the electrostatic potential,” Nucleic Acids Research, vol. 32, no. 16, pp. 4732–4741, 2004. View at Publisher · View at Google Scholar · View at Scopus
  4. R. Jaiswal, S. K. Singh, D. Bastia, and C. R. Escalante, “Crystallization and preliminary X-ray characterization of the eukaryotic replication terminator Reb1-Ter DNA complex,” Acta Crystallographica Section F:Structural Biology Communications, vol. 71, pp. 414–418, 2015. View at Publisher · View at Google Scholar · View at Scopus
  5. R. E. Langlois and H. Lu, “Boosting the prediction and understanding of DNA-binding domains from sequence,” Nucleic Acids Research, vol. 38, no. 10, Article ID gkq061, pp. 3149–3158, 2010. View at Publisher · View at Google Scholar · View at Scopus
  6. S. Ahmad, M. M. Gromiha, and A. Sarai, “Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information,” Bioinformatics, vol. 20, no. 4, pp. 477–486, 2004. View at Publisher · View at Google Scholar · View at Scopus
  7. M. Kumar, M. M. Gromiha, and G. P. S. Raghava, “Identification of DNA-binding proteins using support vector machines and evolutionary profiles,” BMC Bioinformatics, vol. 8, no. 1, article 463, 2007. View at Publisher · View at Google Scholar · View at Scopus
  8. B. Liu, S. Wang, and X. Wang, “DNA binding protein identification by combining pseudo amino acid composition and profile-based protein representation,” Scientific Reports, vol. 5, Article ID 15479, 2015. View at Publisher · View at Google Scholar · View at Scopus
  9. L. Song, D. Li, X. Zeng, Y. Wu, L. Guo, and Q. Zou, “nDNA-prot: identification of DNA-binding proteins based on unbalanced classification,” BMC Bioinformatics, vol. 15, no. 1, article 298, 2014. View at Publisher · View at Google Scholar
  10. C. Yan, M. Terribilini, F. Wu, R. L. Jernigan, D. Dobbs, and V. Honavar, “Predicting DNA-binding sites of proteins from amino acid sequence,” BMC Bioinformatics, vol. 7, no. 1, article 262, 2006. View at Publisher · View at Google Scholar · View at Scopus
  11. W.-Z. Lin, J.-A. Fang, X. Xiao, and K.-C. Chou, “iDNA-prot: identification of DNA binding proteins using random forest with grey model,” PLoS ONE, vol. 6, no. 9, Article ID e24756, 2011. View at Publisher · View at Google Scholar · View at Scopus
  12. J. Zhou, Q. Lu, R. Xu, L. Gui, and H. Wang, “CNNsite: Prediction of DNA-binding residues in proteins using Convolutional Neural Network with sequence features,” in Proceedings of the 2016 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2016, pp. 78–85, China, December 2016. View at Publisher · View at Google Scholar · View at Scopus
  13. A. Szilágyi and J. Skolnick, “Efficient Prediction of Nucleic Acid Binding Function from Low-resolution Protein Structures,” Journal of Molecular Biology, vol. 358, no. 3, pp. 922–933, 2006. View at Publisher · View at Google Scholar · View at Scopus
  14. B. Liu, J. Xu, and X. Lan, “iDNA-Prot|dis: identifying dna-binding proteins by incorporating amino acid distance-pairs and reduced alphabet profile into the general pseudo amino acid composition,” PLoS ONE, vol. 9, no. 9, Article ID e106691, 2014. View at Publisher · View at Google Scholar
  15. Y. Fang, Y. Guo, Y. Feng, and M. Li, “Predicting DNA-binding proteins: approached from Chou's pseudo amino acid composition and other specific sequence features,” Amino Acids, vol. 34, no. 1, pp. 103–109, 2008. View at Publisher · View at Google Scholar · View at Scopus
  16. K. K. Kumar, G. Pugalenthi, and P. N. Suganthan, “DNA-prot: identification of DNA binding proteins from protein sequence information using random forest,” Journal of Biomolecular Structure and Dynamics, vol. 26, no. 6, pp. 679–686, 2009. View at Publisher · View at Google Scholar · View at Scopus
  17. W. Lou, X. Wang, F. Chen, Y. Chen, B. Jiang, and H. Zhang, “Sequence based prediction of DNA-binding proteins based on hybrid feature selection using random forest and Gaussian naïve Bayes,” PLoS ONE, vol. 9, no. 1, Article ID e86703, 2014. View at Publisher · View at Google Scholar · View at Scopus
  18. B. Liu, J. Xu, S. Fan, R. Xu, J. Zhou, and X. Wang, “PseDNA-Pro: DNA-binding protein identification by combining Chou's PseAAC and physicochemical distance transformation,” Molecular Informatics, vol. 34, no. 1, pp. 8–17, 2015. View at Publisher · View at Google Scholar
  19. Q. Dong, S. Wang, K. Wang, X. Liu, and B. Liu, “Identification of DNA-binding proteins by auto-cross covariance transformation,” in Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2015, pp. 470–475, USA, November 2015. View at Publisher · View at Google Scholar · View at Scopus
  20. L. Wei, J. Tang, and Q. Zou, “Local-DPP: an improved DNA-binding protein prediction method by exploring local evolutionary information,” Information Sciences, vol. 384, pp. 135–144, 2017. View at Publisher · View at Google Scholar · View at Scopus
  21. R. Xu, J. Zhou, H. Wang, Y. He, X. Wang, and B. Liu, “Identifying DNA-binding proteins by combining support vector machine and PSSM distance transformation,” BMC Systems Biology, vol. 9, no. 1, article S10, 2015. View at Publisher · View at Google Scholar · View at Scopus
  22. J. Im, N. Tuvshinjargal, B. Park, W. Lee, D.-S. Huang, and K. Han, “PNImodeler: web server for inferring protein-binding nucleotides from sequence data,” BMC Genomics, vol. 16, no. 3, article S6, 2015. View at Publisher · View at Google Scholar · View at Scopus
  23. I. Paz, E. Kligun, B. Bengad, and Y. Mandel-Gutfreund, “BindUP: a web server for non-homology-based prediction of DNA and RNA binding proteins,” Nucleic Acids Research, vol. 44, no. W1, pp. W568–W574, 2016. View at Publisher · View at Google Scholar
  24. H. P. Shanahan, M. A. Garcia, S. Jones, and J. M. Thornton, “Identifying DNA-binding proteins using structural motifs and the electrostatic potential,” Nucleic Acids Research, vol. 32, no. 16, pp. 4732–4741, 2004. View at Publisher · View at Google Scholar · View at Scopus
  25. G. Nimrod, M. Schushan, A. Szilágyi, C. Leslie, and N. Ben-Tal, “iDBPs: a web server for the identification of DNA binding proteins,” Bioinformatics, vol. 26, no. 5, Article ID btq019, pp. 692-693, 2010. View at Publisher · View at Google Scholar · View at Scopus
  26. R. Xu, J. Zhou, B. Liu et al., “Identification of DNA-binding proteins by incorporating evolutionary information into pseudo amino acid composition via the top-n-gram approach,” Journal of Biomolecular Structure and Dynamics, vol. 33, no. 8, pp. 1720–1730, 2015. View at Publisher · View at Google Scholar · View at Scopus
  27. X.-W. Zhao, X.-T. Li, Z.-Q. Ma, and M.-H. Yin, “Identify DNA-binding proteins with optimal Chou's amino acid composition,” Protein and Peptide Letters, vol. 19, no. 4, pp. 398–405, 2012. View at Publisher · View at Google Scholar · View at Scopus
  28. J. Lyons, A. Dehzangi, R. Heffernan et al., “Advancing the accuracy of protein fold recognition by utilizing profiles from hidden Markov models,” IEEE Transactions on NanoBioscience, vol. 14, no. 7, pp. 761–772, 2015. View at Publisher · View at Google Scholar
  29. K.-C. Chou, “Some remarks on protein attribute prediction and pseudo amino acid composition,” Journal of Theoretical Biology, vol. 273, pp. 236–247, 2011. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus
  30. M. Remmert, A. Biegert, A. Hauser, and J. Söding, “HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment,” Nature Methods, vol. 9, no. 2, pp. 173–175, 2012. View at Publisher · View at Google Scholar · View at Scopus
  31. B. Liu, L. Fang, F. Liu et al., “Identification of real microRNA precursors with a pseudo structure status composition approach,” PLoS ONE, vol. 10, no. 3, Article ID e0121501, 2015. View at Publisher · View at Google Scholar
  32. H. M. Berman, J. Westbrook, Z. Feng et al., “The protein data bank,” in International Tables for Crystallography Volume F: Crystallography of biological macromolecules, pp. 675–684, 2006. View at Google Scholar
  33. I. Dondoshansky and Y. Wolf, Blastclust (NCBI Software Development Toolkit), NCBI, Maryland, Md, USA, 2002.
  34. D. B. Kuchibhatla, W. A. Sherman, B. Y. W. Chung et al., “Powerful sequence similarity search methods and in-depth manual analyses can identify remote homologs in many apparently ‘orphan’ viral proteins,” Journal of Virology, vol. 88, no. 1, pp. 10–20, 2014. View at Publisher · View at Google Scholar · View at Scopus
  35. UniProt Consortium et al., “Uniprot: the universal protein knowledgebase,” Nucleic Acids Research, vol. 45, no. D1, pp. D158–D169, 2017. View at Publisher · View at Google Scholar
  36. Y.-H. Taguchi and M. M. Gromiha, “Application of amino acid occurrence for discriminating different folding types of globular proteins,” BMC Bioinformatics, vol. 8, no. 1, article 404, 2007. View at Publisher · View at Google Scholar · View at Scopus
  37. A. Sharma, J. Lyons, A. Dehzangi, and K. K. Paliwal, “A feature extraction technique using bi-gram probabilities of position specific scoring matrix for protein fold recognition,” Journal of Theoretical Biology, vol. 320, pp. 41–46, 2013. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus
  38. S. F. Altschul, T. L. Madden, A. A. Schäffer et al., “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs,” Nucleic Acids Research, vol. 25, no. 17, pp. 3389–3402, 1997. View at Publisher · View at Google Scholar · View at Scopus
  39. R. Sharma, A. Dehzangi, J. Lyons, K. Paliwal, T. Tsunoda, and A. Sharma, “Predict gram-positive and gram-negative subcellular localization via incorporating evolutionary information and physicochemical features into chou's general PseAAC,” IEEE Transactions on NanoBioscience, vol. 14, no. 8, pp. 915–926, 2015. View at Publisher · View at Google Scholar · View at Scopus
  40. A. Sharma, K. K. Paliwal, A. Dehzangi, J. Lyons, S. Imoto, and S. Miyano, “A strategy to select suitable physicochemical attributes of amino acids for protein fold recognition,” BMC Bioinformatics, vol. 14, no. 1, article 233, 2013. View at Publisher · View at Google Scholar · View at Scopus
  41. A. Dehzangi, S. Sohrabi, R. Heffernan et al., “Gram-positive and gram-negative subcellular localization using rotation forest and physicochemical-based features,” BMC Bioinformatics, vol. 16, no. 4, article S1, 2015. View at Publisher · View at Google Scholar · View at Scopus
  42. A. Dehzangi, A. Sharma, J. Lyons, K. K. Paliwal, and A. Sattar, “A mixture of physicochemical and evolutionarybased feature extraction approaches for protein fold recognition,” International Journal of Data Mining and Bioinformatics, vol. 11, no. 1, pp. 115–138, 2014. View at Publisher · View at Google Scholar · View at Scopus
  43. A. Dehzangi, K. Paliwal, J. Lyons, A. Sharma, and A. Sattar, “Enhancing protein fold prediction accuracy using evolutionary and structural features,” in Proceedings of the IAPR International Conference on Pattern Recognition in Bioinformatics, pp. 196–207, Springer, Berlin, Germany, 2013.
  44. D. M. Powers, “Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation,” 2011.
  45. B. Efron and G. Gong, “A leisurely look at the bootstrap, the jackknife, and cross-validation,” The American Statistician, vol. 37, no. 1, pp. 36–48, 1983. View at Publisher · View at Google Scholar · View at MathSciNet
  46. F. Pedregosa, G. Varoquaux, A. Gramfort et al., “Scikit-learn: machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011. View at Google Scholar · View at MathSciNet
  47. J. D. Thompson, T. J. Gibson, D. G. Higgins et al., “Multiple sequence alignment using ClustalW and ClustalX,” Current Protocols in Bioinformatics, pp. 2-3, 2002. View at Google Scholar
  48. H. Peng, F. Long, and C. Ding, “Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 8, pp. 1226–1238, 2005. View at Publisher · View at Google Scholar · View at Scopus
  49. Q. Zou, J. Zeng, L. Cao, and R. Ji, “A novel features ranking metric with application to scalable visual and bioinformatics data classification,” Neurocomputing, vol. 173, part 2, pp. 346–354, 2016. View at Publisher · View at Google Scholar · View at Scopus