Abstract

Epitopes are antigenic determinants that are useful because they induce B-cell antibody production and stimulate T-cell activation. Bioinformatics can enable rapid, efficient prediction of potential epitopes. Here, we designed a novel B-cell linear epitope prediction system called LEPS, Linear Epitope Prediction by Propensities and Support Vector Machine, that combined physico-chemical propensity identification and support vector machine (SVM) classification. We tested the LEPS on four datasets: AntiJen, HIV, a newly generated PC, and AHP, a combination of these three datasets. Peptides with globally or locally high physicochemical propensities were first identified as primitive linear epitope (LE) candidates. Then, candidates were classified with the SVM based on the unique features of amino acid segments. This reduced the number of predicted epitopes and enhanced the positive prediction value (PPV). Compared to four other well-known LE prediction systems, the LEPS achieved the highest accuracy (72.52%), specificity (84.22%), PPV (32.07%), and Matthews' correlation coefficient (10.36%).

1. Introduction

Epitopes, also called antigenic determinants, are clusters of amino acid segments located on the surfaces of an antigen. Epitopes can elicit the immune response and are recognized by specific antibodies [1]. Basically, B-cell epitopes are categorized into two types: linear and conformational. Linear epitopes (LEs) are composed of contiguous amino acid residues within a continuous stretch of a primary protein sequence. Conformational epitopes (CEs) consist of amino acids that are dispersed among discontinuous regions but become aggregated on the protein surface [2, 3]. In general, over 90% of B-cell epitopes are discontinuous [4, 5]; thus, CEs play critical roles in biological and biomedical applications, including the prevention and neutralization of pathogen infections, and the design of therapeutic drugs. However, the prediction and identification of CEs within a protein depend on resolved three-dimensional structural information. One major, generally accepted concept is that conformational epitopes cannot be properly formed without binding to a corresponding antibody [6]. Therefore, antigen-antibody cocrystallographic information is a major concern in CE prediction. On the other hand, because CEs are discontinuous epitopes, it is difficult to design a peptide that forms the same conformation as the predicted CE. Thus, CEs that are predicted by computational analysis may not be verifiable in biochemical experiments, except with the cocrystallographic approach. Although B-cell LEs occupy a small part of the entire epitope group, they are important in biochemistry [7], virology [8], immunology [9], and vaccine research [10]. Therefore, research and development of accurate computational approaches for LE prediction remains a critical challenge in bioinformatics and computational biology [6]. Most published B-cell LE predictors have been based on the characteristics of amino acids, like hydrophobicity, surface accessibility, mobility, protrusion area, physico-chemical properties, antigenicity, and pocket characteristics [1, 3, 1116]. For example, BcePred [16], BEPITOPE [17], PEOPLE [11], VaxiJen [18], and LEP [12] are bioinformatics tool that use various mathematical approaches to predict LEs according to the physico-chemical propensities of amino acids. Nevertheless, in 2005, Blythe and Flower led a group that evaluated the physico-chemical propensities of amino acids to predict LEs in proteins; they reported that even the best physico-chemical propensity scales available performed only slightly better than a random model [19]. Hence, it was proposed that, instead of using the antigenicity scale alone, LE prediction may be improved by integration with other computational approaches.

Several machine learning computational methods have been applied to improve the accuracy of LE prediction. For example, BepiPred combined a hydrophilicity scale with a hidden Markov model [20]; BCPred [21] and FBCPred [22] employed SVM with a subsequence kernel; Söllner and Mayer utilized a molecular operating environment with the decision tree and nearest neighbour approaches [6]. However, these machine learning approaches were mostly set to predict peptides of fixed lengths. It is difficult to analyze true LEs, because they generally range from 8 to 20 amino acid residues in length [11, 2325]. Epitopes with fixed lengths are not typically sufficient to represent the whole region of antigenic determinants. To overcome the drawbacks of training and/or predicting fixed length epitopes, ABCPred used two artificial neural network methods, the feed-forward network and the recurrent neural network, for the prediction of B-cell LEs [26]. Both networks were used with different window lengths from 10 to 20 amino acids and a two-residue interval.

Although bioinformatists have expended great effort on developing LE predictors, there remains much room for improvement. Theoretically, an epitope identified by experimental immunological or biochemical methods must possess biological antigenicity that can induce antibody production in animals. However, when computational skills are used for the prediction, some experimentally identified epitopes could be missed or ignored. This generated the interesting study of how to retrieve the unpredictable epitopes and enhance their antigenicity score in silico.

In 2008, LEP was developed for predicting LEs based on physico-chemical propensities combined with a mathematical morphology approach. LEP could retrieve some of the LEs that were locally embedded in the noise signals of the antigenic index [12]. We reasoned that prediction accuracies could be further improved and retain the advantage of variable length conditions, by combining the LEP with machine learning technologies.

As mentioned above, the machine learning methods used in previous LE prediction methods were often trained to predict epitopes with fixed lengths. Chen’s study showed that the frequencies of occurrence for some amino acid pairs in the epitope dataset were significantly higher than in non-epitope datasets, or vice versa [23]. We noticed this important statistical feature and applied it to enhance the performance of LE prediction systems. Hence, in order to explore the statistical advantages of verified epitopes and retain the antigenic characteristics of candidate peptides, we decided to extend the concept of amino acid pairs from Chen’s study, which only considered peptides with 2 residues.

In this study, we developed a novel B-cell LE prediction system called LEPS (Linear Epitope Prediction by Propensities and Support Vector Machine). The LEPS is freely available for academic use at http://leps.cs.ntou.edu.tw. We adopted the library for SVM (LIBSVM) tool and trained it to recognize features of amino acid segments (AASs) with lengths from 2 to 4 residues. Then, SVM was used to characterize those patterns as epitope and non-epitope clusters [27]. Accordingly, the LEPS approach first performed physico-chemical propensities and mathematical morphology approaches and then used the AAS features to cluster the predicted LE candidates and remove the less probable LEs.

2. Materials and Methods

2.1. Testing Datasets and Predictors

Four datasets were used in this study. The AntiJen dataset was recommended at an international meeting sponsored by the National Institute for Allergy and Infectious Disease [6] and contained 171 protein sequences with 691 verified, nonoverlapping epitopes [19]. The HIV dataset was a collection of the antigenic determinants located on 10 HIV proteins with 54 nonoverlapping, verified epitopes [39]. The PC dataset, generated in this study, was a collection of 12 protein sequences with 98 nonoverlapping, verified epitopes (Table 1). In order to balance out the variation of each dataset in quantity and antigen diversity, these three datasets were merged into one, comprehensive dataset called the “AHP dataset.” These datasets were analyzed with different LE predictors, including the BepiPred [20], ABCPred [26], BCPred [21], and FBCPred [22], to compare performances with that of the LEPS developed here.

2.2. System Flow

The proposed system was divided into three main steps (Figure 1(a)). The first step retrieved primitive epitope candidates from a query protein sequence with LEP [12], which was developed in our previous work and was used with the default settings. Then, an SVM classifier was applied to remove less probable epitope candidates and improve prediction accuracies. In the final step, the predicted epitope residues were highlighted in the query sequence and visualized in a predicted structure. The virtual structure was generated from Modeller 9.9, based on homologous protein structure modeling approaches [40].

2.3. Training Datasets and SVM Model

The process of training the SVM model comprised two major steps (Figure 1(b)). The first step (step 1(b)) evaluated the statistical characteristics that determined the frequencies of occurrence of AASs with various lengths from an independent B-cell epitope dataset (Bcipep [41]) and a non-epitope dataset (Chen et al. [23]). The second step (step 2(b)) produced an SVM model that recognized the epitopes and non-epitopes of the Chen dataset based on the statistical features derived from step 1(b).

The Bcipep dataset comprised 1230 experimentally verified, B-cell, and nonredundant LEs with lengths that ranged from 3 to 56 residues that were identified in over 1000 antigen proteins. This dataset was used in step 1(b) to analyze the statistical characteristics associated with the frequencies of occurrence of AASs of 2 to 4 residues in length that represented epitopes.

The Chen dataset contained 872 epitopes and 872 non-epitopes. All epitopes and non-epitopes within this dataset were restricted to a length of 20 residues. These verified epitopes were retrieved from the Bcipep dataset by applying a “truncation-extension treatment.” That is, when the length of an LE was longer than 20 residues, an equal number of superfluous residues were truncated from both the N- and C-termini to preserve the central 20 residues. Conversely, when the length of an LE was shorter than 20 residues, an equal number of residues were added to both the N- and C-termini until the epitope comprised 20 residues. On the other hand, the 872 non-epitopes were generated by randomly selecting peptide segments from the Swiss-Prot database [42], with the stipulation that none was the same as any of the 872 epitopes. The 872 non-epitopes were used to analyze the statistical characteristics of AASs for non-epitopes in step 1(b). After determining the statistical features that were associated with frequencies of occurrence, the proposed system applied these features (step 2(b)) to produce an SVM model in a 5-fold cross-validation on the Chen dataset.

2.4. Statistical Analysis of AASs and Epitope Indexes

For LE verification, we considered the statistical features to be AASs of 2 (AAS2), 3 (AAS3), and 4 (AAS4) residues in length for both epitopes and non-epitopes. For AAS2, 400 possible combinations of residue pairs were analyzed for occurrence frequencies within both the epitope and non-epitope datasets. The epitope index (Epidex2𝑖) of the 𝑖th pattern (AAS2𝑖) was calculated by taking logarithm value of the ratio of the number of AAS2𝑖 among all epitopes AASs2 compared to the same ratio in the non-epitope AASs2 group with the following equation:Epidex2𝑖𝑓=log2+𝑖/𝑖𝑓2+𝑖𝑓2𝑖/𝑖𝑓2𝑖(𝑖=1,2,,400),(1) where 𝑓2+𝑖and 𝑓2𝑖 were the numbers of AAS2𝑖 in the epitope and non-epitope datasets; respectively, and 𝑖𝑓2+𝑖 and 𝑖𝑓2𝑖 denoted the total number of AAS2𝑖 in the corresponding dataset. Finally, the values of Epidex2𝑖 were normalized to the range of [0,1] to avoid dominance of any individual Epidex2𝑖 in the classifier learning processes.

There were a total of 8000 and 160,000 possible combinations for AAS3 and AAS4, respectively. A large portion of AAS3 or AAS4 did not appear in the non-epitope dataset; this would cause a problem, because it could lead to a zero in the denominator. Hence, the definitions of Epidex3𝑖 and Epidex4𝑖 were modified from the definition for Epidex2𝑖, and the corresponding epitope indexes for AAS3 and AAS4 were defined as follows: Epidex𝑙𝑖=𝑓𝑙+𝑖𝑖𝑓𝑙+𝑖,(2) where 𝑙 was equal to 3 or 4. Again, the values of Epidex3𝑖 and Epidex4𝑖 were normalized to the range of [0,1].

2.5. SVM Features and Model Selection

In this study, we adopted the SVM as a learning method to classify the epitope and non-epitope peptides. We employed the open source LIBSVM toolbox for executing this classification. In LIBSVM, each instance in the training set possessed one target value (class label) and several features (attributes). In the testing set, only the features were required for each instance. The objective of SVM was to generate a model from the training set that facilitated the prediction of the target value of each instance in the testing set. In this study, a peptide corresponded to an instance, and the target value (1 or −1) represented whether that peptide was an epitope. Each peptide contained three feature values based on Epidex2𝑖, Epidex3𝑖, and Epidex4𝑖. For example, a 20-mer peptide was decomposed into 19 AAS2𝑖 subsegments, and the corresponding epitope index of this peptide was obtained by taking the average of 19 Epidex2𝑖 from the corresponding AAS2𝑖. Similarly, the feature values of Epidex3𝑖 and Epidex4𝑖 could be obtained by calculating the averages of 18 Epidex3𝑖 and 17 Epidex4𝑖 subsegments, respectively.

The Chen dataset was used to construct an SVM model based on three feature values and the target values of each epitope and non-epitope. There were four common kernel functions provided by LIBSVM, including linear, polynomial, radial basis function (RBF), and sigmoid. We examined these four kernel functions with a 5-fold cross-validation. The training dataset was equally divided into 5 different subsets; four of the subsets were used for training the model, and the last one was used for testing the model. These processes were repeated five times with each individual subset used as the testing subset. Here, the RBF kernel was selected as the default kernel function, because it provided the best cross-validation accuracy with the training data. Subsequently, the RBF kernel function was applied to train the whole testing dataset for constructing the final SVM classifier in the LEPS.

2.6. Performance Measurement

To evaluate the performance of the LEPS at the level of the amino acid residue, five indicators were used to measure effectiveness at the default settings. These indicators were (1) sensitivity (SEN), defined as the percentage of epitopes that were correctly predicted as epitopes; (2) specificity (SPE), defined as the percentage of non-epitopes that were correctly predicted as non-epitopes; (3) positive predictive value (PPV), defined as the probability that a predicted epitope was, in fact, an epitope; (4) accuracy (ACC), defined as the proportion of correctly predicted peptides; (5) Matthews’ correlation coefficient (MCC), which was a measure of the predictive performance that incorporated both SEN and SPE into a single value between −1 and +1 [26]. These parameters were calculated with the following equations:Sensitivity=TPTP+FN,(3)Specicity=TNTN+FP,(4)Accuracy=TP+TN,TP+FP+TN+FN(5)PPV=TPTP+FP,(6)MCC=TP×TNFP×FN(,TP+FP)(TP+FN)(TN+FP)(TN+FN)(7) where TP represented the true positive; TN, the true negative; FP, the false positive; FN, the false negative.

3. Results and Discussion

3.1. A New Linear Epitope Dataset: PC

The new dataset, called the PC dataset (collected by Pai and Chang), contained 12 sequences that did not overlap with other datasets. It was generated and analyzed in this study. The experimental epitopes in the PC dataset were identified with the peptide scan methodology, a conventional method for epitope determination. The average length of the identified epitopes in the PC dataset was 18.9 residues. This was considered a practical length for an epitope to be used in peptide vaccine development or antibody generation. The average epitope lengths in the HIV and AntiJen datasets were 26.4 and 16.3 residues, respectively. All sequences in the PC dataset were analyzed with the LEPS, and the predicted and experimentally verified epitopes are listed in Table 1.

3.2. The Performance of LEPS

The epitope information collected from the PC, AntiJen, and HIV datasets were utilized to verify the performance of LEPS. The PC dataset was described in the previous section. The original AntiJen dataset comprised 3619 epitopes, of which 3168 were found in the Swiss-Port database. As in our previous report, we regenerated the original AntiJen dataset by removing the repeated epitopes [12]. The HIV dataset focused on one infectious pathogen and was recognized as a useful tool in the field of HIV immunology [39]. The AHP dataset combined these three datasets to balance the variations in each dataset including variations in epitope length and the physico-chemical properties of antigens. With these 4 datasets, we compared the performance of five LE predictors, including LEPS, BepiPred [20], ABCPred [26], BCPred [21], and FBCPred [22].

As expected, LEPS provided favorable results in all four datasets (Figure 2). Table 2 shows that LEPS displayed the best specificity (SPE), with values of 88.33%, 84.48%, 74.84%, and 84.22% in the PC, AntiJen, HIV, and AHP datasets, respectively. Moreover, LEPS showed the best PPVs, with values of 45.12%, 28.85%, 71.44%, and 32.07% in the PC, AntiJen, HIV, and AHP datasets, respectively. The PPV indicated the rate of identifying real epitopes among all positive predicted candidates. It is one of the most important factors in conducting vaccine development. Reduction of the false positive candidates can improve the effectiveness and efficiency of identifying the real epitopes. Therefore, the LEPS will outperform the other predictors in terms of biological experiment cost effectiveness. In the field of computational science, prediction accuracy is one of the most concerned factors for system evaluation. Except in the HIV dataset, LEPS displayed the best ACCs, with values of 61.66%, 73.81%, and 72.52% for the PC, AntiJen, and AHP datasets, respectively. These results showed that LEPS displayed excellent performance for LE prediction. The LEPS also showed the best performance in the MCC for the AntiJen and AHP datasets (10.10% and 10.36%), and the MCC was only a little lower (22.76%) than BCPred (29.80%) and FBCPred (27.81%) for the HIV dataset. Taken together, LEPS displayed excellent performance in SPE and PPVs for all four datasets; it also showed the best or equivalent ACCs for all datasets. However, it showed relatively low SEN compared to the other predictors, mainly due to less number of predicted LEs.

3.3. The LEPS Platform

The LEPS provides a user-friendly interface for biologists to predict linear epitope candidates (Figure 3(a)). LEPS will accept either FASTA format or text, and the default parameters were set as indicated. In this system, several physicochemical propensities can be dynamically modified by users, including secondary structures, hydropathy, surface accessibility, flexibility, polarity, and other factors. The scanning window size for each parameter is also adjustable. After executing the prediction, the overall antigenicity of the query protein and the predicted LE candidates are displayed. For example, Figure 3(b) shows the LEs in HIV integrase predicted by LEPS. Seventeen candidates were initially predicted by LEP based on the global and local distributions of antigenicity. These candidates were further filtered by SVM selection, with only 9 remaining candidates. Within these 9 epitope candidates, number 1 (residue 5–19), number 2 (residue 41–50), numbers 7 and 8 (residue 227–239, and residue 243–247), and number 9 (residue 261–266) overlapped with the experimental epitopes at residues 1–16, residues 42–55, residues 228–252, and residues 262–271, respectively. To verify the surface conditions of the predicted LEs within the query protein sequence, a protein structure was simulated based on homologous modeling approaches. This structure can be viewed and analyzed by clicking on the button labeled “predicted structure.”

3.4. Visualization of the Predicted LEs on 3D Structures

Predicted structures of the query sequences can be rendered by Jmol (http://www.jmol.org/) in LEPS, and the corresponding PDBs and PyMOL script files (http://www.pymol.org/) are downloadable by request. For example, Figure 4 shows the simulated structure of HIV integrase as predicted by Modeller, with the predicted epitope segments displayed in yellow solid spheres. Because there is a high probability that true epitopes will be exposed on the protein surfaces for binding with antibodies, visualization of the predicted LEs on 3D structures can facilitate the selection of suitable epitopes from predicted candidates according to their surface distributions. Figure 5 shows an example of the experimentally verified epitopes and predicted epitopes for the 10 kDa chaperonin protein in the AntiJen dataset. The yellow spheres in both Figures 5(a) and 5(b) show the true and predicted epitope atoms, respectively. The position of the remaining protein is shown in red and blue solid balls in the two simulated structures. In both cases, most of the epitope residues are located on the protein surface.

3.5. Acceptability of Low Sensitivities

Although LEPS can provide a highly accurate prediction of LEs, the low sensitivity is an issue that remains to be investigated. In general, epitope datasets confront a challenge that biological experiments would not cover all the true epitopes within an individual antigen. Peptide scanning data could only identify potential epitopes that were recognized by a specific antibody. However, different antibodies to the same antigen might recognize different epitopes. These biological variations caused low coverage of epitopes within an antigen [43]. This situation implies that the sensitivities of an LE predictor should generally be low. Alternatively, a LE predictor might ubiquitously predict more epitopes to regain the sensitivities accompanying with the reduction of specificities. This will definitely lead to higher experimental costs in general. Nevertheless, to persuade biologists to conduct in vitro experiments on the predicted potential LEs, the accuracy and MCC values could provide balanced statistics for evaluating the performance of a prediction system.

In this study, LEPS displayed high accuracy, MCC, specificity, and PPV, although the sensitivity was a little low. However, the reduced sensitivity was offset by the high PPV. Therefore, the LEPS provides a high probability of success for molecular biologists in predicting and selecting functional epitopes effectively and efficiently.

Acknowledgments

This work was supported by the National Science Council, Taiwan (NSC-98-2311-B-039-003-MY3 and NSC-99-2627-B-039-002 to H.-T. Chang and NSC100-2321-B-019-004, NSC 99-2627-B-019-007, and NSC98-2221-E-019-031-MY2 to T.-W. Pai) and by the Taiwan Department of Health Clinical Trial and Research Center of Excellence (DOH100-TD-B-111-004).