Abstract

The knowledge of DNA-binding proteins would help to understand the functions of proteins better in cellular biological processes. Research on the prediction of DNA-binding proteins can promote the research of drug proteins and computer acidified drugs. In recent years, methods based on machine learning are usually used to predict proteins. Although great predicted performance can be achieved via current methods, researchers still need to invest more research in terms of the improvement of predicted performance. In this study, the prediction of DNA-binding proteins is studied from the perspective of evolutionary information and the support vector machine method. One machine learning model for predicting DNA-binding proteins based on evolutionary features by using Chou’s 5-step rule is put forward. The results show that great predicted performance is obtained on benchmark dataset PDB1075 and independent dataset PDB186, achieving the accuracy of 86.05% and 75.30%, respectively. Thus, the method proposed is comparable to a certain degree, and it may work even better than other methods to some extent.

1. Introduction

DNA-related life activities are an indispensable part of life activities of biological cells, and it includes detecting the damage of DNA, the replication of DNA, and the transcription and regulation of the gene. On the one hand, they will not occur without the assistance of specific proteins; on the other hand, protein-DNA interaction regulates the activities. To realize the regulation, the combination of proteins and DNA-chain’s specific or nonspecific is essential. Proteins related to the life activities of DNA and then regulate it are known as DNA-binding proteins (DbPs) [1, 2], which are also called helical unstable proteins. It is a kind of protein that can bind with DNA to produce complexes. Because of its crucial role in biological activities, the research of DbP recognition is developed.

With the rapid development of society, the demand for medical health is higher and higher. Thus, it is urgent to understand the structure and function of more proteins to explain more meaning of life and promote the development of biomedical and other fields. However, one research difficulty exists in the current research of bioinformatics, that is, how to predict proteins effectively by its sequence information. Although, whether structure or function, the recognition of traditional proteins via physical, chemical, and biological experiments (such as filtration-binding analysis and genetic analysis) [3] can predict effectively, these methods need high actual cost and consume much time.

Besides, the requirements of the experimental environment are very strict. Thus, identifying all DbPs via experimental methods is unrealistic. Given this problem, to reduce time costs, many computational-based methods were proposed. The methods for the prediction of proteins can fall into two categories: methods based on the sequence information and structural information of proteins [46].

The performance of methods by researching the information of protein structure is usually better, but it is hard to obtain the information of structure, so this method is partly hard to develop. Differently, the methods based on the information of protein sequence just need to use the sequence information of proteins to identify DbPs without complex structural information. Thus, it has been well developed in the postgene era with massive sequence information.

Compared with traditional protein recognition methods, the DNA-binding protein recognition method based on sequence information is more simple and cheaper. It is a high-throughput prediction method of proteins. Therefore, more potential DbPs can be extracted from massive protein data by this method. Then, in order to determine the true DbPs, more precise biochemical methods will be used to further verify it. It can not only save human resources, material resources and financial resources, but also achieve better use of limited resources. So, the recognition method based on the information of sequence is significant to economic development and resource utilization. In addition, it can promote the recognition for other types of proteins and the prediction of the nucleic acid sequence [7, 8]. It can further improve the development of bioinformatics as well.

At present, the methods based on sequence information for DNA-binding protein prediction are various, but the performance can be further improved. For improvement of performance, protein representation is a challenge. We need to do more research on it [911]. For this problem, one model is proposed to predict DbPs based on evolutionary information and the support vector machine (SVM) method by using Chou’s 5-step rule [8, 1214]. Firstly, we processed the datasets by PSI-BLAST [15]. To further improve the performance of prediction, we extract three evolutionary features via feature extraction methods: PsePSSM, PSSM-AB and PSSM-DWT. We splice the PSSM features end-to-end and then input them into the prediction model. Next, the SVM classifier is used to make the prediction. Finally, experiments via the jackknife cross-validation test and independent test are done to evaluate the performance. The results show that great predicted performance can be achieved in the prediction of DbPs by the method proposed in this study. Figure 1 shows the main research sketch of the paper.

2. Materials and Methods

The research for the prediction of DbPs can be divided into three stages: building a model for prediction, training and testing the model, and prediction and analysis. To begin with, determine and extract three evolutionary features from the datasets processed and then integrate them into the machine learning model for prediction. Furthermore, train and test it to verify its availability and reliability. In the end, the representation algorithm with evolutionary features is used for representing the information of protein sequence, and the model is used to predict the proteins. Figure 2 shows the framework of the method.

2.1. Datasets

In this study, datasets PDB1075 [16] and PDB186 [17] that are widely used in the prediction of DbPs are used as the basic data for the experiments. The sequence of proteins originates in the international protein database: PDB (https://www.rcsb.org/). Liu et al. created the dataset PDB1075, and the dataset PDB186 was built by Lou et al. The details of the two datasets are shown in Table 1. In this study, the training set is dataset PDB1075 and dataset PDB186 is used as the dataset of the independent test.

2.2. Evolutionary Features
2.2.1. PSSM

PSSM is referred to as “Position-Specific Scoring Matrix.” The evolutionary information of protein sequence is stored in it. To reflect the evolutionary information, PSSM is used in protein prediction. For one sequence of the protein, setting its name to , its PSSM can be formed by three iterations via PSI-BLAST [18] (the purpose of PSI-BLAST is to search the optimum result by multi-iteration. The result of the previous search will be used for the formation of PSSM. Then, the matrix will be used as the input of the next search until the best result is obtained. Experiments show that the result is the best after three iterations). The -value is 0.001. Presume and its length is . The PSSM of proteins can be expressed as a matrix, and the size of the matrix is . The representation of the matrix is as shown follows: where the rows represent the corresponding position of and the columns denote the corresponding type of the 20 amino acids. And is the score that the th position of converted into the residue type during the process of evolution. Generally, the higher the score is, the more frequent the mutation is.

Besides, the following formula shows the representation of : where is the frequency of th amino acid type at the position and refers to the mutation rate that turns from th amino acid to th in protein sequence of substitution matrix. The larger the value is, the more conservative its position is. Otherwise, the result is the opposite.

2.2.2. PsePSSM

PsePSSM feature was usually used for membrane protein prediction. It was inspired by Chou’s pseudo amino acid (PseAAC) [19]. PSSM matrix is widely used in protein description [20]. The original PSSM of proteins should be further normalized for later calculation and work.

The is as follows: where is the score of the normalized PSSM; the average of 20 amino acids is 0. is the original score. The positive score refers to the occurrence of the corresponding homologous mutations, is more frequent in multiple permutations, and is higher than that by accident, and the negative score is opposite to positive score.

2.2.3. PSSM-DWT

DWT is a discrete wavelet transform. Nanni et al. first put forward the concept that reflects the information of frequency and location [17, 21]. Looking upon the protein sequence as a picture that is particular and then using different matrices to express the sequence, the matrix is decomposed into coefficients with different levels by DWT.

Furthermore, wavelet transform (WT) is the projection of signal that casts onto the wavelet function. The formulation is as follows: where denotes the scale variable, is the translation variable, and means the wavelet analysis function. T refers to the transform coefficients that can be found in a specific wavelet period and specific position of signal. An effective DWT algorithm was proposed by Nanni et al. [17]; they presumed that discrete signal is to perform DWT. The coefficients are calculated as follows: where is the length of the discrete signal and and denote the low-pass filter and high-pass filter. means the approximative coefficient of signal while and is the coefficient that is elaborate. The former is low-frequency components, and the latter is the opposite. Their value of maximum, minimum, mean and standard deviation is calculated by 4-level DWT in this study. In addition, the discrete signals of PSSM over level 4 of discrete wavelet transform are analyzed, which is composed of 20 discrete signals. Figure 3 shows the structure of the 4-level DWT.

2.2.4. PSSM-AB

The full name of the AB method is the Average Block method [22] that was first presented by Huang et al. [23]. Because the amount of amino acids in each protein is different, the size of the feature vector is diverse when PSSM is transformed into the feature vector immediately. For this problem, average features over the local region in PSSMs, and this method is referred to as the AB method. Every block contains a 5% protein sequence. Here, the AB method is used in PSSM without regard to the length of the protein sequence. Divide each matrix into 20 blocks by row, and the size of every block is . Therefore, the protein sequence will be divided into 20 blocks, and every block is composed of 20 features that originated from 20 columns in PSSMs. Its expression is as follows: where is the size of blocks and is one vector with the size of extracted from position of th block in PSSMs.

2.3. Classification Algorithm

Support vector machine (SVM), one classification and regression paradigm built by Nanni et al. [24], is a machine learning method based on statistical theory that minimizes the risk of structure. It is one algorithm of supervised learning. In pattern recognition, the SVM method is usually used to solve problems of classification. When using the SVM method, mark samples as positive or negative and then project it into the high-dimensional feature space via kernels. Optimize the superflat in eigenspace so that the edge of positive and negative samples can be maximized. In this study, we use LIBSVM to build one method model with a radial basis function (RBF) by SVM. To get the optimum parameters, the method of grid search is used in this study [25].

Three kernel functions are commonly used in the construction of SVM: polynomial kernel, radial basis function and sigmoid kernel. RBF is the most commonly used kernel function in most related studies. In this study, the use of RBF can make nonlinear transformation better, and because of its fewer parameters, it can greatly reduce the complexity and difficulty of calculation. The RBF kernel expression is as follows: where is the feature vector and denotes the width of RBF kernel.

Supposing one training dataset of instance-label pairs is . The following expression is the decision function:

To solve the problem of quadratic programming in the following, can be obtained: where is called support vector only when . is the parameter of regularization that coordinates the margin and the error misclassified.

3. Experiment Results

The steps of the experiments are as follows: (1)Firstly, building one method model for the prediction of DbPs based on evolutionary information by SVM, benchmark dataset PDB1075 and PDB186 are selected as experimental data.(2)Secondly, determine the evolutionary features used in the experiments. In order to further improve the prediction performance of the model, we use a variety of feature extraction methods to extract PSSM features and then integrate them into the machine learning model. The results show that the model with integrated features has better prediction performance. Besides, to better evaluate the performance of this model, we need to select appropriate evaluation indicators.(3)Thirdly, compare the performance of combinations with different features on the PDB1075 dataset via a jackknife test that is commonly called the LOOCV test. Then, the performance of several different methods is compared on dataset PDB1075 and PDB186 successively; finally, analyze and compare the performance of the model for prediction to prove its validity, advantages and disadvantages.

3.1. Measurements

In the experiment, the jackknife test is used to analyze the quality of the method predictor. The jackknife test has a high utilization rate of samples. It is suitable for small sample datasets. The experimental results are deterministic. Compared with the method of leaving out, there are no random factors in the process of experiment, and the results have strong persuasion. Thus, when testing the function of the predictor, the jackknife test is widely used. In the test, almost all data in the benchmark dataset is used for training. Use each data in it as the test dataset by turns, and the sample data left is used for training [26, 27].

To better evaluate the performance of this method, accuracy (ACC), Matthews Correlation Coefficient (MCC), Sensitivity (SN) and Specificity (SP) are used for the evaluation of indicators. In the study of biological sequence classification, these indicators are widely used [7, 28].

The definition is as follows: where TP means the number of positive samples predicted correctly and FP is the opposite; TN means the number of negative samples that are correctly predicted and FN is the converse. SN and SP denote the percentage of samples predicted correctly in two samples, respectively. ACC represents the proportion of samples predicted correctly. To reflect the quality of the model for prediction, MCC is used. The range of its numerical value is [-1,1]. The larger the numerical value of the evaluation indicators is, the better the performance of the model for prediction is.

3.2. Parameter Optimization

To get the highest accuracy of prediction, there are two parameters that need to be optimized: parameters (penalty parameter) and (gamma, RBF kernel parameter), when using a radial basis function to build a support vector machine. In the process of training, due to their values that are unknown, it is necessary to select and optimize the two parameters and different prediction accuracy will be obtained with different pairs. To achieve the optimal parameters, the method of gridding search is used for the adjustment and optimization of parameters and . Try various possible values of pairs, and then, conduct the performance test via five cross-validations to find the best accuracy of pair. In this way, global optimization can be achieved, and the parallelism of the grid search is high. Each pair is relatively independent. Besides, the range of parameters and is [-5,5], the length of step is 1, and the kernel function is RBF function, and estimate the probability of the training model. Finally, the optimal parameters and are 2 and 0.0313, respectively, achieving the accuracy of 86.05% and 75.30%, respectively, after training and testing on the benchmark datasets PDB1075 and PDB186.

3.3. Experimental Results and Analysis
3.3.1. The Performance of Different Features on Benchmark Dataset PDB1075

The sequence of PSSM is the main information to predict the binding sites of proteins. The conservation or variability of the sequence depends on many factors in the process of evolution, such as maintaining 3D structure and stability and reducing the aggregation of amyloid protein and the conservation of function. These factors affect the binding of proteins with other proteins, nucleotides, lipids, etc. Therefore, PSSM (including evolutionary information) may pick up important signals/features for the binding of ligand. It proves the validity of the method based on PSSM evolutionary information.

In this study, we first determine that the evolutionary features are PsePSSM, PSSM-AB and PSSM-DWT, combining the features and testing them with the model for prediction on benchmark dataset PDB1075 via the jackknife test by SVM. In the end, the best combination of features can be achieved and its result of prediction is the highest as well.

Table 2 provides the size, the computing time and the performance of different combinations of the features. It can be found that the test performance is improved obviously when features are combined, and the best performance is obtained, gaining ACC (86.05%), MCC (0.7208), SN (85.14%), SP (86.91%) and AUC (0.9324) when combining different features together.

For evaluating the performance of prediction with effect, the AUROC feature curve is used for the analysis of classification in this study. ROC curve (Receiver Operating Characteristic Curve) and AUC (Area Under Curve) make up the AUROC feature curve. In general, the curve is over the space of line ; the value of the range is [0.5,1]. The closer the curve is to the axis , the better the performance of the classifier is. AUC refers to the area enclosed by the ROC curve and axis . The larger the numerical value of AUC is, the better the effect of the classifier is. Figure 4 shows the results of the comparison of seven combinations with different features on dataset PDB1075.

From Figure 4, we can conclude two information: (1) When the three features are combined together, the ROC curve is more inclined to the direction of coordinate axis . At that time, the largest numerical value of AUC can be obtained, and the performance is the best at the same time. (2) The performance of the combination of feature PsePSSM and PSSM-AB is just slightly lower than that of the combination of feature PsePSSM, PSSM-AB and PSSM-DWT. Though the predicted performance of the model is improved to a certain extent by adding feature PSSM-AB, it is not obvious. But the features are redundant to a certain degree, and the features based on PSSM information have their upper-performance limit, so the improvement of performance is not obvious even if we add features based on PSSM information (PSSM-AB).

3.3.2. The Performance of Different Methods Compared on Benchmark Dataset PDB1075

In this section, the performance of the methods described in this study is evaluated on the benchmark dataset PDB1075 and compared with other methods [2934], such as IDNA-Prot|dis [16], DNA binder [29, 30] and IDNA-Prot [31]. Table 3 provides the performance of methods compared on dataset PDB1075 via jackknife test evaluation.

As shown in Table 3, it can be concluded that the performance of our method in this study is higher than that of other methods obviously. The SVM-based method achieves the highest ACC (86.05%), MCC (0.72), SN (85.14%) and SP (86.91%). The ACC, MCC, SN and SP values are improved by 3.63%, 0.07, 1.33% and 5.82%, respectively. It proves the superiority and validity of the SVM-based method for identifying DbPs.

The SVM algorithm selected in the experiment is based on the theory of small sample statistics. Compared with other methods, it can get better results on a small sample dataset. The SVM algorithm has an excellent generalization ability. Because the traditional process from induction to deduction is avoided, the problem of classification is simplified effectively.

Besides, the final decision function of the SVM algorithm depends on minor support vectors. The amount of support vectors determines the complexity of calculation, and it has nothing to do with the dimension of the whole sample space, which avoids the problem of the “dimension disaster”.

3.3.3. The Performance of Different Methods Compared on Independent Dataset PDB186

In the independent test, datasets PDB1075 and PDB186 are used for training and testing. Table 4 provides the performance of methods compared on independent dataset PDB186 for the purpose of analyzing the robustness. The SVM-based method achieves 75.3% of ACC, 0.560 of MCC, 96.8% of SN, and 53.8% of SP. In a certain degree of credibility, the SVM-based method performs better and it is superior to most of the existing methods compared in this study. It can be concluded that the method can identify DbPs effectively and accurately combined with previous tests.

4. Conclusion

In this study, one model for predicting DbPs based on evolutionary information and the support vector machine method by using Chou’s 5-step rule is proposed. Firstly, the datasets are processed by PSI-BLAST, and then, we extract three evolutionary features used for experiments by feature extraction algorithm. To integrate them, we splice the PSSM features end-to-end. Next, inputting them into the machine learning model built to predict DbPs. Finally, the validity and reliability of the SVM-based method are verified by experiments.

In this model, the Pse and AB methods as well as the DWT method that is seldom used in bioinformatics are applied to make the model achieve better performance on datasets PDB1075 and PDB186. In the jackknife test, the performance of the method for the prediction of proteins is better than that of other methods evidently; in the independent test, the performance is better than that of the most methods. The experimental results demonstrate that the model for prediction and method proposed is effective and rational. It can predict DbPs effectively.

In future work, the feature representation and classification algorithm ought to be refined for the improvement of the predicted performance. For the former, we are going to combine some other features related to biology; for the latter, we will use deep learning and other technologies to optimize the performance of prediction.

Data Availability

The related materials can be available on http://eie.usts.edu.cn/prj/PSSMEI/index.html.

Conflicts of Interest

The authors declare that they have no conflict of interest.

Acknowledgments

The work is supported by a grant from the National Natural Science Foundation of China (Nos. 61772357, 61902272, 61672371, 61876217, 61902271, and 61750110519) and the Suzhou Science and Technology Project (SYG201704, SNG201610, and SZS201609).