BioMed Research International

Volume 2017, Article ID 3267325, 4 pages

https://doi.org/10.1155/2017/3267325

## Predicting Presynaptic and Postsynaptic Neurotoxins by Developing Feature Selection Technique

^{1}Department of Pathophysiology, Southwest Medical University, Luzhou 646000, China^{2}Department of Anesthesiology, The Affiliated Traditional Chinese Medical Hospital of Southwest Medical University, Luzhou 646000, China

Correspondence should be addressed to Hua Tang; moc.nuyila@112177auhgnat and Ping Zou; moc.361@gnipuozyl

Received 17 November 2016; Accepted 18 December 2016; Published 12 February 2017

Academic Editor: Ren-Zhi Cao

Copyright © 2017 Hua Tang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Presynaptic and postsynaptic neurotoxins are proteins which act at the presynaptic and postsynaptic membrane. Correctly predicting presynaptic and postsynaptic neurotoxins will provide important clues for drug-target discovery and drug design. In this study, we developed a theoretical method to discriminate presynaptic neurotoxins from postsynaptic neurotoxins. A strict and objective benchmark dataset was constructed to train and test our proposed model. The dipeptide composition was used to formulate neurotoxin samples. The analysis of variance (ANOVA) was proposed to find out the optimal feature set which can produce the maximum accuracy. In the jackknife cross-validation test, the overall accuracy of 94.9% was achieved. We believe that the proposed model will provide important information to study neurotoxins.

#### 1. Introduction

Neurotoxins act typically against channels to block or enhance synaptic transmission. According to the mechanism of action, neurotoxins can be classified as presynaptic type and postsynaptic type [1]. The function of presynaptic neurotoxins is to act at the presynaptic membrane [2]. They usually block neuromuscular transmission and inhibit the neurotransmitter release due to their specific enzymatic activities [3]. Postsynaptic neurotoxins can bind to the postsynaptic membrane and acetylcholine receptors [4]. Thus, the study of presynaptic and postsynaptic neurotoxin will give us important clues for drug-target discovery and drug design.

The function and structure of neurotoxins can be correctly measured by biochemical experiments; however, it is time-consuming and costly. The availability of huge amounts of proteins generated in postgenomic age provides us with an important opportunity to design various computational methods for timely and precisely predicting protein functions. Thus, it is important to develop machine learning approach to predict presynaptic and postsynaptic neurotoxins. Recently, Yang and Li developed an increment of diversity-based method to identify presynaptic neurotoxin and postsynaptic neurotoxin [5]. The benchmark dataset including 78 presynaptic neurotoxins and 69 postsynaptic neurotoxins was downloaded from Animal Toxin Database (ATDB) [6]. The overall accuracy was 90.39% in jackknife cross-validation, which is far from satisfactory. Subsequently, Song proposed using bilayer support vector machine (SVM) to improve prediction accuracy based on a new benchmark dataset [7]. Although the overall accuracy was dramatically improved, the sequence identity of the dataset was so high that the results were overestimated.

To overcome the shortcoming mentioned above, in this study, we developed a new method based on feature selection technique to predict presynaptic neurotoxins and postsynaptic neurotoxins. In the following, we will introduce how to construct a new benchmark dataset, to formulate neurotoxin samples using peptide sequences, and to obtain the expected result produced by best feature subset.

#### 2. Materials and Methods

##### 2.1. Benchmark Dataset Construction

A high quality benchmark dataset is the fundamental for building a reliable and accuracy model. The Universal Protein Resource (UniProt) provides the scientific community with a single, centralized, authoritative resource for protein sequences and functional information [8]. Thus, we downloaded presynaptic and postsynaptic neurotoxins from the UniProt. Ambiguous information can reduce the quality of benchmark dataset which makes the prediction model unreliable. Thus, we must exclude the protein sequence which contains ambiguous residues (such as “X,” “B,” and “Z”) and which is the fragment of other proteins. High similar sequences in benchmark dataset will bring about overestimation of results. Thus, the CD-HIT program was used to remove the highly similar sequences by setting the cutoff of sequence identity as 80% [9]. According to above screening procedure, the final benchmark dataset included 256 neurotoxin samples which can be formulated aswhere the subset contains 91 presynaptic neurotoxins and contains 165 postsynaptic neurotoxins.

##### 2.2. The Dipeptide Composition

One of the most important steps in the prediction problem is to formulate neurotoxin sequences with an effective mathematical expression. Generally, we may formulate a neurotoxin by its entire residue sequence as follows:where denotes the residue of neurotoxin and the subscript is the number of residues of the neurotoxin . We may use some straightforward and intuitive tools, such as BLAST or FASTA, to find the similar sequences. However, these tools are only suitable for the query sequences which have high similar sequences in searching dataset. If there are no similar sequences in the training dataset, they cannot work well.

Machine learning approach can overcome such problem and correctly identify presynaptic and postsynaptic neurotoxins. Thus, we must convert neurotoxin sequences into discrete vector. A simplest method used to represent a neurotoxin is its residue composition containing a 20-dimension vector. However, the sequence order information would be completely lost and hence limit the prediction quality [10–13]. Thus, the dipeptide composition was used in this study. Accordingly, each neurotoxin sample in our benchmark dataset can be expressed as a 400-dimension vector and formulated aswhere () is the occurrence frequency of th dipeptide and given bywhere are the single letter codes of 20 native amino acids, respectively. can be calculated bywhere denotes the number of the th dipeptides in the neurotoxin .

##### 2.3. Support Vector Machine

SVM is a very popular machine learning method and has been widely used in bioinformatics [7, 14–18]. The basic idea of SVM is to transform the input vector into a high-dimension Hilbert space and to determine a separating hyperplane in this space. In this study, we used the LibSVM package 3.18 (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) to implement SVM. Because it is more suitable for nonlinear classification, the radial basis function (RBF) defined as was used as kernel function. In the SVM model construction, a grid search strategy with cross-validation test was used to optimize the regularization parameter and kernel parameter as the following standard:

##### 2.4. Performance Evaluation

In this study, we used jackknife cross-validation to test the prediction. In the jackknife cross-validation test, each protein sample in the dataset is in turn singled out as an independent test sample and all the rule parameters are calculated based on the remaining proteins without including the one being identified. The performance of our proposed method was estimated by the following three indexes called sensitivity (), specificity (), and overall accuracy () which can be expressed aswhere and are the total number of the presynaptic neurotoxins and postsynaptic neurotoxins. is the number of the presynaptic neurotoxins incorrectly predicted as the postsynaptic neurotoxins and is the number of the postsynaptic neurotoxins incorrectly predicted as presynaptic neurotoxins.

#### 3. Results and Discussion

Many published papers have demonstrated that the optimized features could improve predictive accuracy [19–25]. For high-dimension data, some features are noise or redundant information which has negative contribution to the prediction. Thus, it is very important to develop a feature selection technique to exclude the garbage information. The current study will introduce a new feature selection technique based on the principle of analysis of variance (ANOVA).

Two parameters of feature can be defined aswhere denotes frequency of the th feature of the th sample in the th group ( or Pro). denotes number of samples in the th group ( or Pro). and are called sum of squares between groups and sum of squares within groups, respectively. If the sample means within groups are close to each other, will be small. If the sample means are close between two groups, will be small. Then the sample variance between groups and sample variance within groups can be given bywhere and are called degrees of freedom in statistics. In this study, and , respectively.

According to the statistic theory, the ratio between and obeys sampling distribution with and degrees of freedom under the null hypothesis. Thus, we used ratio to measure the contribution of each feature defined as follows:

reveals how strong the th feature is related to the group variables. Accordingly, the 400 dipeptides in (3) were ranked according to their . Subsequently, the incremental feature selection (IFS) strategy was proposed to find an optimal of feature subset. In IFS procedure, we firstly examined the performance of the best feature with the highest by using cross-validation. Subsequently, a new feature with the second highest was added to form new feature subset which was also inputted into SVM and the accuracy was calculated. This process was repeated until 400 feature subsets were examined. By setting the number of features as abscissa and the Acc as ordinate, the IFS curves were plotted in Figure 1. From the figure, we observed that, in the jackknife cross-validation, the maximum Acc of 94.9% can be obtained by the top 190 features which are regarded as the optimal feature subset.