Abstract

Owing to the abuse of antibiotics, drug resistance of pathogenic bacteria becomes more and more serious. Therefore, it is interesting to develop a more reasonable way to solve this issue. Because they can destroy the bacterial cell structure and then kill the infectious bacterium, the bacterial cell wall lyases are suitable candidates of antibacteria sources. Thus, it is urgent to develop an accurate and efficient computational method to predict the lyases. Based on the consideration, in this paper, a set of objective and rigorous data was collected by searching through the Universal Protein Resource (the UniProt database), whereafter a feature selection technique based on the analysis of variance (ANOVA) was used to acquire optimal feature subset. Finally, the support vector machine (SVM) was used to perform prediction. The jackknife cross-validated results showed that the optimal average accuracy of 84.82% was achieved with the sensitivity of 76.47% and the specificity of 93.16%. For the convenience of other scholars, we built a free online server called Lypred. We believe that Lypred will become a practical tool for the research of cell wall lyases and development of antimicrobial agents.

1. Introduction

Bacteria are widely distributed on the earth, a significant proportion of which can cause disease. The antibiotic can efficiently treat infectious diseases caused by pathogens. However, antibiotics abuse may cause bacterial drug resistance. Thus, there is an ever-increasing need to find new ways to address this important issue [1, 2]. In the search for more effective therapeutic strategies, great effort has been placed on the study and development of lyases, which benefits from high potency activity toward drug-resistant strains and a low inherent susceptibility to emergence of new resistance phenotypes [37].

In 1896, the British bacteriologist Hankin found that the bacteriophage has antibacterial activity [3]. Subsequently, in 1921, Brunoghe and Maisin used bacteriophage to treat staphylococcal skin disease in France, which was the first reported application of bacteriophage to treat infectious diseases [8]. Maxted [9], Krause [10], and Fischetti et al. [11] found that the lysates of Group C streptococci infected with C1 bacteriophage contain an enzyme which has the ability to lyse streptococci and their isolated cell walls. The enzyme is called endolysin which is encoded by bacteriophage gene. It can cause bacteria death by degrading cell wall. It has been reported that 10 ng endolysins can lead to 107 bacteria’s lysis within 30 seconds [4, 12].

Autolysins are another kind of lyases that are functionally similar to endolysins except they are bacteria-encoded enzymes [13]. It has been reported that autolysins play important roles in several fundamental biological phenomena, such as cell wall enlargement, genetic transformation, flagella extrusion, cell division, and lysis induced by fl-lactam antibiotics, as well as in the “suicidal tendencies” of pneumococci [1416].

Due to their special biological activity, lyases have been applied in antibacteria drug development. Thus, it is necessary to perform intensive research on lyases to understand the antibacterial mechanism. Although wet experiments are an objective approach for accurately recognizing the lyases, they are often time-consuming and costly. Due to the convenience and high efficiency, computational methods have attracted more and more attention. Many algorithms such as common support vector machine (SVM) [1719], structured SVM [20], artificial neural network (ANN) [21], Random Forest (RF) [22], -nearest neighbor (KNN) [2325], Bayesian classifier [26, 27], Mahalanobis discriminant [28, 29], LibD3C [30], genetic algorithm [31], imbalanced classifier [32], learning to rank [33], and ensemble learning [34, 35] have been developed for protein function prediction. Various sequence features descriptors such as amino acid composition [36, 37], pseudo amino acid composition (PseAAC) [38], physicochemical properties [39], secondary structure features [40], and N-peptide composition [41] were proposed to represent protein sequences [42].

To deal with the problem about lyases prediction, recently, a method was developed to identify cell wall enzymes by using PseAAC and Fisher discriminant [43]. A maximum overall accuracy of 80.4% was obtained with the sensitivity of 66.7% and the specificity of 88.6% [43]. However, further work is needed due to the following reasons. (i) The prediction quality can be further improved. (ii) No web server for the prediction method in [43] was provided, and hence its usage is quite limited, especially for the majority of experimental scientists.

The present study was devoted to development of a new predictor for identifying lyases. For this purpose, an objective and strict benchmark dataset was constructed for training and testing the proposed model in which protein sequences were formulated by using an improved PseAAC. For the convenience of other scholars, a free online server called Lypred (at http://lin.uestc.edu.cn/server/Lypred/) was established.

2. Material and Method

2.1. Benchmark Dataset

A high quality dataset is the key to building a robust and accurate predictor. The lyases in bacteria or bacteriophage were regarded as positive samples which were derived from the UniProt [44]. Negative samples, namely, the nonlyases, were also derived from bacteriophage and downloaded from the UniProt. In order to guarantee the reliability of the benchmark dataset, we optimized the data according to the following standards: firstly, the sequences whose protein was with annotations of “Inferred from homology” or “Predicted” were excluded; secondly, we removed the sequences which are the fragments of other proteins; thirdly, the protein sequences containing unknown residues, such as “,” “,” “,” “,” “,” and “,” were eliminated; fourthly, to avoid overestimation of prediction model that resulted from the high sequence identity, the CD-HIT program [45] was adopted to eliminate redundant sequence by setting the cutoff of sequence identity to 40%. As a result, a total of 68 lyases and 307 nonlyases were obtained to form the final benchmark dataset.

2.2. Features Extraction

A sequence can be represented by two different forms: one is the sequential form and the other is the discrete form [46]. The most common and straightforward way to characterize a protein is to use all the residues in its sequence written as follows:where , , and are the 1st, 2nd, and th amino acid residue of protein , respectively. Based on the information, a query protein can be predicted by the BLAST or FASTA program. The results are always good for the query sequence which has high similar sequences in benchmark dataset; however, it fails to work when the similar sequences for the query sequence are not found in the training dataset [47]. Therefore, the similarity-based method is not suitable for the case that no homologue was found in the benchmark dataset. The discrete form can overcome the shortcoming and is easy to be treated in statistical prediction. Thus, it has been widely used in protein and DNA formulation [48, 49]. The PseAAC is a typical discrete form that has been widely used for protein function prediction [46, 50, 51].

It is well known that the polypeptide chains fold to tertiary structures based on the physicochemical properties of residues. Thus, it is not enough to analyze the residue compositions of protein molecules. Hence, we proposed to represent protein samples by using an improved PseAAC which includes not only -gap dipeptide composition, but also correlation of physicochemical property between two residues.

According to the concept of PseAAC, a protein with the length of can be formulated in a dimension space as given bywherewhere denotes the normalized occurrence frequency of the th kind of -gap dipeptide in protein formulated aswhere () denotes the number of the th -gap dipeptide in .

in (3) is the -tier sequence correlation factor calculated by the following formulas:

The correlation of physicochemical property between two residues is given bywhere denotes the th physicochemical value of amino acid residue . The value is obtained bywhere is the th physicochemical original value of amino acid .

Thus, each protein sample can be expressed by kinds of features according to (2)–(7).

2.3. Feature Selection

Some features are noise or redundant information which will reduce the predictive performance of classification models. Thus, it is very important to develop a method to evaluate the contribution of every feature to the classification. Here, we used ANOVA [52] to rank features defined aswhere represents the -score of the th feature type, is the feature value of the th feature type of the th sample in the th protein type, and is the number of samples in the th protein type. It is obvious that the larger the value, the better the discriminative capability the th feature has.

In order to eliminate the redundant features, we firstly ranked all features according to their -score from high to low. The first feature subset only contained the feature with the largest -score; then, a new feature subset was generated when the feature with the second largest -score was added. The process was repeated until all features were added. The SVM was used to evaluate the performance for each feature subset. The feature subset with the best performance is deemed the optimal feature subset which does not contain redundant features.

2.4. Support Vector Machine

The SVM is a linear-classifier-based supervised machine learning method, which has been successfully used in many bioinformatics fields [4851, 5357]. To attain the goal of classification, SVM utilizes the kernel function to deal with the nonlinear transformation, and thus linear inseparable can be converted to a linear problem in high-dimension Hilbert space. In this work, the software LIBSVM [58] was used to execute SVM.

2.5. Performance Standard

To provide a more intuitive and easier-to-understand method to evaluate the prediction performance, we used the following criteria: the sensitivity (), the specificity (), Mathew’s correlation coefficient (), the overall accuracy (), and the average accuracy (), which were defined aswhere is the number of lyases that were correctly predicted, denotes the number of lyases that were predicted as the nonlyases, is the number of nonlyases that were correctly predicted, and denotes the number of nonlyases that were predicted as the lyases.

In addition, we also chose the receiver operating characteristic curve (ROC curve) to measure the performance of the proposed model. ROC curve is a kind of comprehensive index that is drawn by using as the abscissa and as the ordinate. Thus, it reveals the continuous variable of and . Generally, we only need to calculate the area under the ROC curve (). The greater the is, the better the discriminate capability the prediction model has is.

3. Results and Discussion

3.1. Forecasting Accuracy

In this work, 9 kinds of physicochemical properties were selected in improved PseAAC [47]. The nine physicochemical properties are hydrophobicity, hydrophilicity, rigidity, flexibility, irreplaceability, side chain mass, pI at 25°C, pK of the α-COOH group, and pK of the α-N group [47], respectively. The original values of the physicochemical properties for 20 amino acids were all listed in Table 1. According to (2)–(7), each protein sample can be formulated by a () dimension vector including   -gap dipeptide compositions and correlation factors based on physicochemical properties between two residues. From (3)–(5), the prediction performance of our method was influenced by two parameters, namely, and , where describes the local sequence-order effect and represents the global sequence-order effect. The current study searched for the optimal values for the two parameters according to the following standard:

In cross-validation test, n-fold cross-validation, jackknife cross-validation, and independent dataset test are often used for measuring the performance of prediction model. Although the jackknife cross-validation is deemed the most objective because it can always yield a unique result for benchmark dataset given [59, 60] and it has been more and more widely used, it also has obvious drawbacks, such as the large calculation and being time-consuming. Hence, the 5-fold cross-validation was adopted in this work for searching the optimal parameters and the optimal feature subset. Once the optimal feature subset was determined, we used jackknife cross-validation for verification ulteriorly.

Based on (10), a total of 10 × 10 = 100 groups of parameters () were investigated. For each parameter group (), there are feature subsets. Then, we used feature selection technique defined in (8) to find out the best one in each parameter group. Thus, we obtained the 100 highest s for 100 groups of parameters (). To provide an overall and intuitive analysis, the best s were drawn into a heat map, where the column and row of the heat map represent the parameters and , respectively. Each element in the heat map represents one of the 100 groups of parameters () and was colorized according to its highest overall accuracy in feature selection process. From Figure 1, we noticed that several elements are red indicating the maximum overall accuracy of 91.73% when equals 0 or 4 and equals 7, 8, 9, and 10. Generally, a model with a small number of features can reduce the risk of overfitting. After checking the feature selected results, we found that when using feature selection technique to optimize parameter group ( = 4 and = 7), the optimal feature subset contains 63 features, which is less than the optimal feature subset in other groups. Thus, the final model was established based on the 63 features from parameter group ( and ).

Because there is imbalance in our benchmark dataset, the average accuracy and ROC curve were employed to evaluate the model. Thus, we set a series of different classification thresholds to seek the maximum of average accuracy. The maximum and corresponding , , , and were listed in Table 2. The ROC curve can demonstrate the predictive capability of the proposed method across the entire range of SVM decision values. Thus, we plotted the ROC curve in Figure 2. It shows that is 0.926, demonstrating that our model has capability to predict cell wall lyases.

To investigate whether other algorithms have the same or higher discriminate capability in the same feature space, the performances of Random Forest, Naïve Bayes, and LibD3C were examined by using jackknife cross-validation. Random Forest and Naïve Bayes were executed by using free package WEKA [61]. The LibD3C, a new selective ensemble algorithm, is a hybrid model of ensemble pruning that is based on -means clustering and the framework of dynamic selection and circulating in combination with a sequential search method [30]. We used default parameters in LibD3C to perform classification.

The jackknife cross-validated results were also recorded in Table 2 for clear comparison. Note that the result for each algorithm in Table 2 was calculated with the maximum . As can be seen from the table, although ’s of Random Forest and Naïve Bayes are no lower than SVM, other indicators (, , , , and ) of SVM are the best.

3.2. Online-Server Guide

A user-friendly online server called Lypred was established. A simple guide about the server was given below in order to further make it easier for the users.

Lypred has five pages. Users can browse the server at http://lin.uestc.edu.cn/server/Lypred/ and see the home page on the screen as shown in Figure 3. The Read Me page provides a brief introduction about Lypred and the caveat when being used. The Data page shows a brief description about the benchmark dataset and the optimal feature subset used in this work and provides links for downloading. The relevant paper about the detailed development and algorithm of Lypred can be seen by clicking the Citation button. Example sequences in FASTA format can be found by clicking the Example button right above the input box.

Users can not only type or copy/paste the query protein sequences into the input box, but also upload FASTA/txt file containing the query protein sequences at the center of the home page of Lypred. Note that Lypred also has some constraints so as to guarantee the reliability of the results: firstly, protein sequences must be in FASTA format consisting of a single initial line beginning with a greater-than symbol (“”) in the first column, followed by lines of sequence data, and the sequence is deemed to end if there is another line starting with “”; secondly, the query protein sequence should only contain 20 kinds of amino acids; thirdly, the length of a query protein sequence should be no less than eight.

4. Conclusions

With growing drug resistance of pathogenic bacteria, great effort has been placed on the study and development of lyases. Effective identification of lyases will provide convenience for development of new antimicrobials. In this work, we used an improved PseAAC including -gap dipeptide compositions and correlation factors of the physicochemical properties to extract the characteristics of protein sequences. A feature selection technique based on ANOVA was used to optimize features. The results of of 84.82% and of 0.926 make us believe that Lypred will become a powerful and useful tool for the experimental study of bacterial cell wall lyase.

Competing Interests

The authors declare that there are no competing interests regarding the publication of this paper.

Acknowledgments

This work was supported by the Applied Basic Research Program of Sichuan Province (nos. 2015JY0100 and LZ-LY-45), the Scientific Research Foundation of the Education Department of Sichuan Province (11ZB122), the Nature Scientific Foundation of Hebei Province (no. C2013209105), the Fundamental Research Funds for the Central Universities of China (nos. ZYGX2015J144 and ZYGX2015Z006), and the Program for the Top Young Innovative Talents of Higher Learning Institutions of Hebei Province (no. BJ2014028).