Abstract

In eukaryotic genes, exons are generally interrupted by introns. Accurately removing introns and joining exons together are essential processes in eukaryotic gene expression. With the avalanche of genome sequences generated in the postgenomic age, it is highly desired to develop automated methods for rapid and effective detection of splice sites that play important roles in gene structure annotation and even in RNA splicing. Although a series of computational methods were proposed for splice site identification, most of them neglected the intrinsic local structural properties. In the present study, a predictor called “iSS-PseDNC” was developed for identifying splice sites. In the new predictor, the sequences were formulated by a novel feature-vector called “pseudo dinucleotide composition” (PseDNC) into which six DNA local structural properties were incorporated. It was observed by the rigorous cross-validation tests on two benchmark datasets that the overall success rates achieved by iSS-PseDNC in identifying splice donor site and splice acceptor site were 85.45% and 87.73%, respectively. It is anticipated that iSS-PseDNC may become a useful tool for identifying splice sites and that the six DNA local structural properties described in this paper may provide novel insights for in-depth investigations into the mechanism of RNA splicing.

1. Introduction

In eukaryotic genomes, exons that code for proteins are typically interrupted by introns termed as protein noncoding regions. The borders between exons and introns are called splice sites (Figure 1). A splice site can be located at either the upstream or the downstream part of an intron. For the former, it is called the 5′ splice site or donor site; for the latter, it is called the 3′ splice site or acceptor site. The vast majority of the donor and acceptor sites are canonical or regular splice sites that are characterized by the presence of the GT and AG, respectively. During RNA splicing, both the donor and acceptor sites will be recognized by a large macromolecule called spliceosome that is comprised of more than 300 proteins and five small nuclear RNAs (snRNAs U1, U2, U4, U5, and U6) [1]. Once the splice sites are recognized, the spliceosome will remove introns through two sequential transesterification reactions (Figure 1). Removing introns from precursor messenger RNA (pre-mRNA) so that exons can be joined together to form mature mRNA is an essential step of gene expression. Therefore, to better understand the splicing process and mechanism, it is important to accurately detect the splice sites in the genome.

Although biochemical experimental approaches can provide some details about the splice sites, it is both time-consuming and expensive to rely on the biochemical experimental techniques alone. Hence, it is a big challenge and also highly desirable to develop computational methods for timely and effectively identifying the splice sites. In view of this, the present study was initiated in an attempt to develop a computational method for predicting splice sites.

According to a comprehensive review [2] and demonstrated by a series of recent publications [39], to establish a really useful statistical predictor for a biological system, we need to consider the following procedures: (i) construct or select a valid benchmark dataset to train and test the predictor; (ii) formulate the biological samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the target to be predicted; (iii) introduce or develop a powerful algorithm to operate the prediction; (iv) properly perform cross-validation tests to objectively evaluate the anticipated accuracy of the predictor. Below, let us describe how to deal with these procedures one by one.

2. Materials and Methods

2.1. Benchmark Dataset

The human splice site-containing sequences were obtained from the database HS3D (http://www.sci.unisannio.it/docenti/rampone/), which contained the sequences of exons, introns, and splice regions extracted from GenBank Rel.123. All the splice site-containing sequences in HS3D obey the GT-AG rule; that is, begin with the dinucleotides GT (GU in RNA) and end with the dinucleotides AG, and their lengths are of 140 nucleotides with the splice donor site GT (or acceptor site AG) in the middle positions.

At present, there are 2,796 (2,880) true splice donor (acceptor) site-containing sequences and 271,937 (329,374) false splice donor (acceptor) site-containing sequences in HS3D. To balance the number of the true and false splice site-containing sequences and to avoid the overfitting problem in the model-training processes, we randomly selected out 2,800 false splice donor (acceptor) site-containing sequences from the 271,937 (329,374) false splice donor (acceptor) site-containing sequences.

As pointed out in a comprehensive review [10], there is no need to separate a benchmark dataset into a training dataset and a testing dataset for examining the performance of a prediction method if it is tested by the jackknife test or subsampling cross-validation test.

Finally, we obtained two benchmark datasets, one for the splice donor site-containing sequence, while the other for the splice acceptor, as can be formulated by where the positive dataset contains 2,796 true splice donor site-containing sequences while the negative dataset contains 2,800 false splice donor site-containing sequences; contains 2,880 true splice acceptor site-containing sequences, while contains 2,800 false splice acceptor site-containing sequences, and the symbol means the union in the set theory. The detailed sequences in the two benchmark datasets and are given in Supplementary Information S1 and Supplementary Information S2, respectively; see Supplementary Material available online at http://dx.doi.org/10.1155/2014/623149.

2.2. DNA Sample Formulation

Given a DNA sample with nucleic acid residues, the most straightforward way to express the sample is to use the following sequential model: where represents the first nucleic acid residue at position 1, represents the second nucleic acid residue at position 2, and so forth. Although the sequential formulation of (2) contains the complete information of the DNA sample, it is difficult to be handled for statistical prediction. This is because all the existing operation engines, such as optimization approach [11], covariance discriminant (CD) [12], neural network [13], support vector machine (SVM) [1416], random forest [17, 18], conditional random field [8], nearest neighbor (NN) [19], K-nearest neighbor (KNN) [20], OET-KNN [21], fuzzy K-nearest neighbor [2224], ML-KNN algorithm [25], and SLLE algorithm [26], can only handle vector but not sequence samples. Although some sequence-similarity-search-based tools, such as BLAST [27], can be used to directly search for those sequences with high similarity to the query sample, unfortunately, this kind of straightforward and intuitive approach failed to work when the query sample did not have significant similarity to any of the character-known sequences. Therefore, various nonsequential or discrete models to represent the DNA samples were proposed in hopes of establishing some sort of correlation or cluster manner through which the prediction could be more effectively carried out.

The simplest discrete model used to represent a DNA sample is its nucleic acid composition or NAC, as given below: where , , , and are the normalized occurrence frequencies of adenine , cytosine , guanine , and thymine in the DNA sequence, respectively; the symbol is the transpose operator. However, as we can see from (3), all its sequence-order information is completely lost if using NAC to represent a DNA sample. Actually, one of the most important but also most difficult problems in computational biology is how to effectively formulate a biological sequence with a discrete model or a vector, yet still keep considerable sequence-order information.

One way to cope with such a problem is to represent the DNA segment with the -tuple nucleotide composition, a vector with components; that is, where is the normalized occurrence frequency of the th -tuple nucleotide in the DNA segment. As we can see from (4), the dimension of the vector is indicating that by increasing the value of , although the coverage scope of sequence order will be gradually increased, the dimension of the vector will be rapidly increased as well. This will cause the high-dimension disaster [28] as reflected by the following disadvantages: (i) the overfitting problem that will make the predictor with a serious bias and extremely low capacity for generalization; (ii) the information redundancy or noise that will bring about the error of misrepresentation resulting in very poor prediction accuracy; and (iii) unnecessarily increasing the computational time.

To avoid the high-dimension disaster, here, the dinucleotide composition (DNC) was used to formulate the DNA sample, as given by where is the normalized occurrence frequency of AA in the DNA sequence, is that of AC, is that of AG, and so forth. By doing so, we can only incorporate the local sequence-order information between the most contiguous nucleotides, but none of the global or long-range sequence-order information can be reflected.

Actually, similar problem also occurred in computational proteomics, where, in order to incorporate the global or long-range sequence-order information for proteins, the pseudo amino acid composition [29] or Chou’s PseAAC [30] was proposed. Since the concept of PseAAC was proposed in 2001 [29], it has been penetrating into almost all the fields of protein attribute predictions (see, e.g., [3173]). Because it has been widely used, recently two types of open access software, called “PseAAC-Builder” [51] and “propy” [74], were established for generating various modes of PseAAC.

Encouraged by the successes of introducing the PseAAC approach into computational proteomics, Chen et al. [4] proposed the “pseudo dinucleotide composition” or PseDNC to identify recombination spots of DNA. The formulation of PseDNC is given by where where () have the same meaning as those in (6), while is the th tire correlation factor that reflects the sequence-order correlation between all the th most contiguous dinucleotides along a DNA sequence (see Figure 2), as formulated by In the above two equations, is the number of the total counted ranks or tiers of the correlations along a DNA sequence, and is the weight factor. Their concrete values as well as the final value for will be further discussed later. The correlation function in (9) is defined by where is the number of local DNA structural properties considered that is equal to 6 in the current study as will be explained below, is the numerical value of the th DNA local structural property for the dinucleotide at position , and is the corresponding value for the dinucleotide at position , as will be given below.

2.3. DNA Local Structural Property Parameters

A lot of evidences have shown that DNA local structural properties play important roles in many biological processes, such as protein-DNA interactions [75], formation of chromosomes [76], and meiotic recombination [4]. Generally speaking, the spatial arrangements of two successive base pairs can be characterized by six parameters, of which three are the local translational ones and the other three are the local angular ones (Figure 3), as formulated by

The six structural parameters of dinucleotides have been calculated by Goñi et al. [75] based on the long atomistic molecular dynamics (MD) simulations in water, and their concrete values are given in Table 1, which will be used to calculate the global or long-range sequence-order effects for the DNA sequences via (9) and (10).

Note that before substituting the values of physicochemical property into (10), they were all subjected to a standard conversion as described by the following equation: where the symbols mean taking the average of the quantity therein over the 16 different combinations of , , , for and SD means the corresponding standard deviation [10]. The converted values obtained by (12) will have a zero mean value over the 16 different dinucleotides and will remain unchanged if going through the same conversion procedure again. Listed in Table 2 are the values of () obtained via the standard conversion of (12) from those of Table 1.

2.4. Support Vector Machine (SVM)

Support vector machine (SVM) is an effective method for supervised pattern recognition and has been widely used in the realm of bioinformatics [4, 14, 77, 78]. The basic idea of SVM is to transform the data into a high dimensional feature space and then determine the optimal separating hyperplane. A brief introduction about the formulation of SVM has been given in [14]. In this study, the SVM implementation was based on the freely available package LIBSVM 2.84 written by Chang and Lin [79], which can be downloaded from http://www.csie.ntu.edu.tw/~cjlin/libsvm/. Because of its effectiveness and speed in training process, the radial basis kernel function (RBF) was used to obtain the best classification hyperplane. The regularization parameter and the kernel width parameter were tuned via the grid search method in the 10-fold cross-validation.

The predictor obtained via the above procedures is called iSS-PseDNC, where “i” stands for “identifying,” “SS” for “splice site,” “Pse” for “pseudo,” “D” for “di,” “N” for “nucleotide,” and “C” for “composition.”

2.5. Criteria for Performance Evaluation

To provide a more intuitive and easier-to-understand method to measure the prediction quality, the following set of four metrics based on the formulation used by Chou [80] in studying signal peptide prediction was adopted. According to Chou’s formulation, the sensitivity (Sn), specificity (Sp), overall accuracy (Acc), and Matthew’s correlation coefficient (MCC) can be expressed as follows [4, 79]: where is the total number of the true splice site-containing sequences investigated, while is the number of true splice site-containing sequences incorrectly predicted as the false splice site-containing sequences; is the total number of the false splice site-containing sequences investigated, while is the number of the false splice site-containing sequences incorrectly predicted as true splice site-containing sequences. From (13), we can easily see the following. When meaning that none of the true splice site-containing sequences was incorrectly predicted to be a false splice site-containing sequence, we have the sensitivity . When meaning that all the true splice site-containing sequences were incorrectly predicted to be the false splice site-containing sequences, we have the sensitivity . Likewise, when meaning that none of the false splice site-containing sequences was incorrectly predicted to be a true splice site-containing sequence, we have the specificity , whereas when meaning that all the false splice site-containing sequences were incorrectly predicted to be the true splice site-containing sequences, we have the specificity . When meaning that none of the true splice site-containing sequences and none of the false splice site-containing sequences were incorrectly predicted, we have the overall accuracy and Mathew’s correlation coefficient ; when and meaning that all the false splice site-containing sequences and all the true splice site-containing sequences were incorrectly predicted, we have and , whereas when and , we have and meaning no better than random prediction. As we can see from the above discussion based on (13), the meanings of the four metrics have become much more intuitive and easier to understand than the conventional formulation often used in the literature, particularly for Mathew’s correlation coefficient, which is usually used for measuring the quality of binary (two-class) classifications as in the case of the current study. However, it is instructive to point out that the set of the metrics in (13) is valid only for the single-label systems. For the multilabel systems whose existence has become more frequent in system biology [8183] and system medicine [24, 84], a completely different set of metrics as defined in [25] is needed.

3. Results and Discussions

3.1. Graphic Profiles of True and False Splice Site-Containing Sequences

It has been reported that the DNA local structural properties, that is, angular parameters (twist, tilt, and roll) and translational parameters (shift, slide, and rise), play important roles in prokaryotic transcription initiation, protein-DNA interactions, and meiotic recombination [4, 75, 76, 85]. Accordingly, it is quite natural to ask whether these DNA structural properties may also play some role in regulating RNA splicing. Here, let us use the graphic approach to address this question. This is because using graphical approaches to study biological problems can provide an intuitive picture or useful insights for helping in analyzing complicated relations in these systems [30], as demonstrated by many previous studies on a series of important biological topics, such as enzyme-catalyzed reactions [8689], inhibition of HIV-1 reverse transcriptase [9093], inhibition kinetics of processive nucleic acid polymerases and nucleases [94], protein folding kinetics [95], drug metabolism systems [96], protein sequence evolutionary analysis [97], protein remote homology detection [5], and using Wenxiang diagram or graph [98] to study protein-protein interactions [99102]. Shown in Figure 4 is a comparison of the graphic profiles between the true and false splice site-containing sequences. As we can see there, the divergence between the true and false splice site-containing sequence profiles is remarkable, clearly indicating that the six structural property parameters can indeed play important roles in RNA splicing. That was why we used them to calculate the global sequence-order effects as elaborated in Section 2.3.

3.2. Cross-Validation

How to properly evaluate the anticipated accuracy is an important step in developing a new predictor. Generally speaking, to avoid the “memory effect” [10] of the resubstitution test in which a same dataset was used to train and test a predictor, the following three cross-validation methods are often used to examine a predictor for its effectiveness in practical application: independent dataset test, subsampling or -fold (such as 5-fold, 7-fold, or 10-fold) test, and jackknife test. However, as elaborated by a penetrating analysis in [2], considerable arbitrariness exists in the independent dataset test. Also, as demonstrated by (28)–(30) in [2], the subsampling test (or -fold cross-validation) cannot avoid arbitrariness either. Only the jackknife test is the least arbitrary that can always yield a unique result for a given benchmark dataset. Therefore, the jackknife test has been widely recognized and increasingly adopted by investigators to examine the quality of various predictors (see, e.g., [42, 58, 59, 62, 64, 66, 67, 70, 103107]). Therefore, in this study, the jackknife test was also used to examine the performance of the predictor. During the jackknife test, each sequence in the benchmark dataset (or ) was in turn singled out as an independent test sample and all the rule-parameters were derived based on the remaining data without including the one under the prediction.

3.3. Parameter Optimization

As we can see from (8), the predictive accuracy of the present model depends on the two parameters and , where is the weight factor which was usually within the range from 0 to 1 and is the number of the correlation tiers to be counted for the global sequence-order information. Generally speaking, the greater the is, the more global sequence-order information the model will contain. However, if is too large, it would reduce the cluster-tolerant capacity [108] so as to lower down the cross-validation accuracy due to overfitting or “high dimension disaster” [28] problem. Therefore, our searching for the optimal values of the two parameters was confined in the range Furthermore, to reduce the computational time during the search process, the 10-fold cross-validation approach was adopted. Once the optimal values thus obtained for the two parameters were determined, the rigorous jackknife test was utilized to evaluate the anticipated accuracy of the predictor.

Listed in Table 3 are the jackknife test results of the iSS-PseDNC predictor in identifying the splice donor site-containing sequences and the splice acceptor site-containing sequences on the benchmark datasets and , respectively, where the optimal values for and are also explicitly given.

To further show the power of the iSS-PseDNC predictor, we also did some comparison calculations as described below.

First, based on the sequence similarity principle, we used BLAST [109] to conduct the jackknife test on the same benchmark dataset as used by the iSS-PseDNC predictor. The results thus obtained are given in Table 4, from which we can see that the percentage rates for Sn, Sp, and Acc by BLAST are about 40% lower than those by iSS-PseDNC and that the rates of MCC by BLAST are about 0.5 lower than those by iSS-PseDNA, for the cases of both donor and acceptor.

Second, rather than pseudo dinucleotide composition (7), we used the dinucleotide compositions (6) to represent the DNA samples for prediction. The corresponding results thus obtained are given in Table 5, from which we can see that the rates for Sn, Sp, Acc, and MCC are all lower than those reported in Table 3, clearly implying that the additional components in the pseudo nucleotide composition did play a role in enhancing the prediction quality.

All these results indicate that the iSS-PseDNC model as proposed in this paper is quite promising and may become a useful tool in identifying splice sites.

4. Conclusions

RNA splicing is a complicated biological process that involves interactions among DNA, RNA, and proteins. Hence, it is reasonable to analyze the structural properties that can be used to describe these interactions. In view of this, we firstly plotted the profiles of the six DNA structural properties (twist, tilt, roll, shift, slide, and rise) for splice site-containing sequences and found the differences between true and false splice site-containing sequences. The structural divergences surrounding splice sites may facilitate the removal of the introns by spliceosome.

By defining PseDNC using the above six DNA structural properties, we proposed a model, namely, iSS-PseDNC, for identifying splice sites. The predictive performance demonstrated that our model is helpful for splice site recognitions. Since user-friendly and publicly accessible web-servers represent the direction of developing practically more useful models [110], simulated methods, or predictors, we will make efforts in our future work to provide a web-server for the approach presented in this paper.

It has not escaped our notice that the web-server PseKNC (pseudo -tuple nucleotide composition) developed very recently [111] will be very useful for further improving the prediction quality in identifying the splicing sites.

Conflict of Interests

The authors declare no conflict of interests.

Acknowledgments

The authors wish to thank the editor for taking time to edit this paper and the anonymous reviewers for the constructive comments, which were very helpful for strengthening the presentation of this paper. This work was supported by the National Nature Scientific Foundation of China (nos. 61100092 and 61202256) and the Nature Scientific Foundation of Hebei Province (no. C2013209105).

Supplementary Materials

Supporting Information S1. The benchmark dataset for splice donor sites.

Supporting Information S2. The benchmark dataset for splice acceptor sites.

  1. Supplementary Materials