Abstract

The identification of protein coding regions (exons) plays a critical role in eukaryotic gene structure prediction. Many techniques have been introduced for discriminating between the exons and the introns in the eukaryotic DNA sequences, such as the discrete Fourier transform (DFT) based techniques, but these DFT-based methods rapidly lose their effectiveness in the case of short DNA sequences. In this paper, a novel integrated algorithm based on autoregressive spectrum analysis and wavelet packets transform is presented to improve the efficiency and accuracy of the coding regions identification. The experimental results show that the new algorithm outperforms the conventional DFT-based approaches in improving the prediction accuracy of protein coding regions distinctly by testing GENSCAN65, HMR195, and BG570 benchmark datasets.

1. Introduction

Deoxyribonucleicacid (DNA) sequence consists of genic and intergenic regions. Identification of protein coding regions is an elementary but very important problem in bioinformatics because the exonic regions code for amino acids. So learning the primary structure of a protein leads to studying and analyzing the secondary and tertiary structures of a protein in addition to protein function. Once we could clearly know the structure and function of a protein, we can design drugs, cure diseases, improve crop productivity, and synthesize biofuel. In addition, coding regions represent the conserved part of genomes. On the other hand, predicting conserved regions is also important to study evolution and predict phylogenetic trees [1, 2]. Nowadays, the rapid growth of raw genome sequence data requires efficient biological interpretations, but biological experiments for gene identification in DNA sequences are costly to conduct, so there is still a real demand for accurate and fast tools to analyze these sequences, especially to find genes and determine their functions [3, 4].

All living organisms can be divided into two categories according to their fundamental cell structures: prokaryotes and eukaryotes. In prokaryotes, the coding genes, which are in charge of protein synthesis, are long and continuous (that is open reading frames (ORFs)). But in eukaryotes, genes consist of coding segments interrupted by long noncoding segments. These coding segments are termed as exons and noncoding segments as introns (Figure 1). In case of human eukaryotes only of DNA sequence is coding [5, 6], so it is a challenging task to identify the protein coding regions (exons).

Genomic sequence processing has been an active area of research for the past twenty years and has increasingly attracted the attention of many researchers, and a number of methods have been proposed to predict the protein coding regions [2]. These methods can be divided into two groups in comprehensive categorization [79]: model-dependent methods and model-independent methods (or filter-based methods). Model-dependent methods are built upon some a priori information usually gathered from database of previously known organisms’ genomics, while model-independent methods do not assume such a priori information. Some model-dependent methods such as hidden Markov model (HMM) [10, 11], support vector machine (SVM), and neural network [12] have been successfully used to find splice sites and identify protein coding regions.

Though these model-dependent methods perform more precisely by means of a priori information to train the classifiers, nevertheless, the coding regions may not be represented on the available datasets but exist in the sequenced organism [7, 9]. In such situations, model-independent methods based on digital spectral analysis, which convert sequences text into numeric signal, have been proposed in recent years to detect the coding regions [2]. For most of DNA sequences, one of the principal features is the fact that the dominant signal in coding regions of genomic sequences exhibits a three-base periodicity (TBP) which is evidenced as a sharp peak at frequency in the PSD [1315], but this behavior is not found in other parts of the DNA (including the introns) (Figure 2). The origin of the TBP in protein coding sequences derives from the triplet nature of the codon, and the reasons for this distinction lie in the unequal usage of codons (codon bias) in coding regions [14, 16]. Meanwhile this phenomenon caused lots of background noise, which leads to more difficulty in finding exons in DNA sequences [7, 1719]. Using this property, several model-independent techniques, mainly DFT-based methods, have been mentioned in the recent literatures. Tiwari et al. [14] firstly used the DFT to calculate the PSD; then a fixed-length window was used to move on to the numerical sequence to determine the exonic regions. This technique is also named sliding discrete Fourier transform (SDFT). Rao et al. [20] proposed an efficient sliding window strategy based on the SDFT (also named the periodogram method) to identify the accurate location of the protein coding regions, and their algorithm could increase the location accuracy greatly than Tiwari’s [14]. Digital filters such as antinotch filter [21] and notch filter [19] with the central frequency of were used to remove the background noise, and then SDFT technique was utilized to find exons.

The aforementioned DFT-based spectrum analysis techniques may be roughly categorized as one kind of nonparametric (also named classical spectrum estimation) methods, that is, the periodogram method. This method has the advantage of possible implementation using the fast Fourier transform (FFT) and has made obvious progress in exons finding area [22]. But these methods rapidly lose effectiveness in the case of short DNA sequences because they lead to many false alarms and false dismissal errors [15]. So parametric spectrum methods such as AR modeling were developed for detection of coding regions in small DNA sequences [15, 2327]. And AR modeling has been proved to be able to produce stronger PSD and weaker artifacts than DFT-based methods [2628], especially performing well in finding the small size coding regions.

In this paper, a novel technique based on Marple algorithm of AR PSD and wavelet packets transform is presented to identify the protein coding regions in eukaryotic DNA sequences. This method firstly employs a mapping method to convert the DNA sequences into numerical sequences; then the sequences are passed through a bandpass filter to enhance the TBP characteristics. After that, by taking the numerical sequence as the observed signal of an AR model, the efficient Marple algorithm is utilized to estimate the PSD of the AR model by calculating the parameters of the Yule-Walker equations. Then wavelet packets transform (WPT) technique is employed to reduce the the background noise of the PSD. Finally, similar to the SDFT [14], the PSD at frequency is used to identify the protein coding regions after denoising by WPT. We show that the new algorithm yields comparable performance in improving the exonic identification accuracy than the conventional approaches.

The remainder of the paper is organized as follows. In Section 2, the datasets used are described, together with a brief explanation of the methods that are involved in this study. The principles of the Marple algorithm for AR PSD estimation and the WPT for noise reduction are expressed in detail. The numerical representations of a DNA sequence and the bandpass filter are also detailed in this section, as well as the evaluation criteria at nucleonic level. Section 3 presents the results of the benchmark datasets tests that demonstrate the performance of the proposed algorithm. Also the results are analyzed in this section. Finally, the most significant findings that emerge from this study are summarized in Section 4.

2. Materials and Methods

2.1. Datasets

In this subsection, several widely used benchmark datasets will be described for the purpose of comparing the performance of different algorithms in identifying exonic regions. They are listed in the following paragraphs.

The gene sequence F56F11.4 (Genbank old number AF009962, new number FO081497, http://www.ncbi.nlm.nih.gov/nuccore/FO081497) is on chromosome III of Caenorhabditis elegans which is a free living nematode, about 1 mm in length, and lives in temperate soil environment. It has five distinct exons with the nucleotide position in the complete sequence: 7949–8059, 9548–9877, 11134–11397, 12485–12664, and 14275–14625. As former literatures [7, 29], we select one part from the complete sequence from nucleotide positions 7021 to 10580, so it just covers the aforementioned five exons. For convenience, in the following analysis, we all use the position that is relative to the first nucleotide position of the selected part sequence, not to the real position in the complete sequence.

In order to demonstrate the performance of our proposed algorithm, we also apply it on three benchmark datasets HMR195 [30], BG570 [31], and GENSCAN65 [32]. HMR195 consists of 195 mammalian (including human, mouse, and rat) sequences, totally 2649 exons, with exactly one complete either single-exon or multiexon gene. BG570 is a genomic test datasets of 570 single gene vertebrate sequences, totally 948 exons, prepared by Burset and Guigó [31]. Datasets HMR195 and BG570 can be available from http://www.imtech.res.in/raghava/genebench/datasets.html. GENSCAN65 contains 65 selected human genome sequences which comprise 381 exons from 2 bp to 1210 bp [32]. For convenience, all sequences contain exactly one gene which starts with the “ATG” initial codon and end with a stop codon (TAA, TAG, or TGA) [30]. GENSCAN65 dataset can be available from http://www.imtech.res.in/raghava/genebench/datasets/Kulp-Reese/Human/.

2.2. Identification of Exonic Regions Based on Marple Algorithm and Wavelet Packets Transform

In this section, a novel integrated algorithm using Marple algorithm and WPT denoising technique is proposed for the identification of protein coding regions. We divide the integrated algorithm into five steps (Figure 3). The block diagram of our proposed universal integrated algorithm is shown in Figure 3, and the procedure of our algorithm is as follows.(S1)Convert the DNA sequence into numerical sequence using Code13 mapping method.(S2)Enhance the TBP characteristics of the numerical sequence using an FIR band pass filter.(S3)Extract the TBP components using the Marple algorithm with proper model order. Similar to the SDFT [14], firstly, a sliding window with length N is determined, and in the window the Marple algorithm is used to calculate the PSD of the windowed sequence. Then the PSD at frequency , that is, the PSD at , is extracted and then divided by the mean PSD of the windowed sequence, so we obtain a ratio, which is referred to as the signal to noise ratio (SNR). Sliding the window along the sequence one by one (i.e., one position by one time), this successive progression and the plot of SNR exhibit the coding regions in DNA.(S4)Remove the noise effect of SNR by wavelet packets transform.(S5)Classify (or predict) the protein coding regions according to the optimal threshold.

The following subsections give the detailed presentation of the aforementioned steps, respectively.

2.2.1. Numerical Representation of DNA Sequences

It is an important foundation to convert the DNA sequences into digital signals because it opens the possibility to employ all kinds of powerful DSP techniques for analyzing of genomic data and reveals features of chromosomes [7]. For example, once the DNA sequence has been converted into digital signal while retaining the biological meaning of the represented information, we can utilize the spectral analysis technique to find the exons. That is, coding regions exhibit the three-base periodicity property in the spectral domain which is less apparent in sequences other than exon sequences and can therefore be used to detect exon sequences and to distinguish exonic regions from intronic regions in genomic sequences [33].

There are a number of representations for nucleotide sequences [2, 33, 34]. Kwan et al. [33] reviewed the former widely used numerical representation methods and developed several novel mapping methods. Totally 17 numerical represented methods were compared for exon finding purpose in [33], and they concluded that the Code13 (named K-Quaternary Code I) method offered an attractive performance. Abo-Zahhad et al. [34] also reviewed the published mapping techniques and broadly classified them into two major groups: fixed mapping techniques and physicochemical property based mapping techniques. The former group include the famous Voss method [17] and the following developed methods, such as the tetrahedron [35], the complex [36], and the integer [37] methods. And the latter group is comprised of the electron-ion interaction potential (EIIP) [38], the paired numeric (PN) [39], the Z-curve [40], the structure profile (SP) [41] representations, and so forth.

In this paper, we use the K-Quaternary Code I (denoted as Code13) technique to convert sequence into numerical signal. In the mean time, for comparison purpose, we select four conventional mapping methods from the aforementioned methods for the following spectral analysis, that is, in shorthand form, the Voss method, the EIIP method, the SP method, and the PN method. The detailed representation methods are described as follows.

Voss Method. Perhaps the earliest and most popular mapping of DNA is the binary or Voss method [17, 39], which represents DNA with four binary indicator sequences , and . The presense of a nucleotide at a particular base pair position is represented by 1, and the absence of it is represented by 0. For example, the binary representation of DNA sequence S = {gctatctatc} is given by , , , and .

EIIP Method. In EIIP method [38], the electron-ion-interaction potential associated with each nucleotide is used for mapping of the DNA sequence. The EIIP values for the nucleotides are , , , and . The aforementioned sequence can be converted into one numerical sequence , , , , , , , , , .

SP Method. In the structure profile method [41, 42], the structural information of physical properties of DNA molecule are utilized for mapping nucleotide sequence to numerical sequence. These properties are DNA-bending stiffness, duplex-free energy, duplex disrupt energy, and propeller twist, etc. These structural profiles are calculated according to the conversion tables [42] with step size of one along the DNA sequence, which transforms two nucleotides at a step. So a DNA sequence can be converted into four numerical sequences, but the lengths of the numerical sequences will be one unit shorter than the DNA sequence. Also taking the sequence as an example, one of whose four profiles, the DNA-bending stiffness profile, is given by . The other three numerical sequences can be obtained according to [42].

PN Method. The paired numeric method [39, 42] is based on the statistical evidence that exons are rich in nucleotides and , while introns are rich in nucleotides and . The PN technique assigns the values +1 and −1 to the presence of the - and - nucleotides. So the sequence is mapped into a single sequence .

Code13 Method. The Code13 method [33, 39, 42] is a kind of 1-sequence complex-value numerical representations, and in this representation, the features of the nucleotides have been retained by translating them into numerical properties. The Code13 method assigns the values 1, −1, , and to the presence of the four nucleotides , , , and , respectively, where is the imaginary unit ( ). Then according to the Code13 method, the sequence is mapped into a single sequence .

2.2.2. Emphasizing TBP of the Numerical Sequences by FIR Bandpass Filter

In order to emphasize the three-base property in the protein coding regions, the numerical sequences are passed through an FIR bandpass filter with a Hamming window, whose order is 8 and central frequency is . Lack of distortions in FIR filters is one reason for their preferred use over IIR filters in medical applications [22, 29].

2.2.3. Autoregressive Spectrum Estimation Using Marple Algorithm

The spectrum estimation techniques available may be categorized as nonparametric (also named classical spectrum estimation) and parametric. The nonparametric methods include the periodogram, the Bartlett and Welch modified periodogram, and the Blackman-Tukey methods. All these methods have the advantage of possible implementation using the fast Fourier transform (FFT), but with the disadvantage in the case of short data lengths of limited frequency resolution, and the requirement for windowing to reduce the spectral leakage. Parametric methods on the other hand can provide high resolution, applicability to short data lengths, and avoidance of spectral leakage, scalloping loss, spectral smearing, and window biasing effects [22]. Its disadvantage lies in being computationally efficient, and parameter methods mainly include three methods, Yule-Walker autoregressive method, Burg method, and the Marple algorithm.

The idea of the AR spectrum analysis is that the digitized signal is modeled as an AR time series plus a white noise error term. The spectrum is then obtained from the AR model parameters and the variance of the error term. The model parameters are found by solving a set of linear equations obtained by minimizing the mean squared error term (the white noise power) over all the data.

The process of the AR spectrum analysis using Marple-WPT method is described as follows. Firstly, an important consideration is the choice of the number of terms in the AR model. This is known as its order. If the order is too low the power density estimate will be excessively smoothed, so some peaks may be obscured. If the order is too high, spurious peaks may be introduced. Hence, it is important to determine the appropriate model order for each set of data.

In an AR model of a time series the current value of the series, , is expressed as a linear function of previous values plus an error term, ; thus

This equation incorporates previous terms and represents a model of order . It is more compactly written as where is the back-shift operator which denotes a delay of sampling intervals. So (2) can be rewritten as Then we obtain AR model where is interpretable as the -transform of an all-pole IIR digital filter with coefficients, . This filter is called an AR filter. In (4) the may be regarded as the output of this filter caused by random inputs, . represents the error between the value predicted by the model, , and the true datum value, . is usually assumed to have the properties of white noise; that is, it is assumed to have a Gaussian probability density distribution and a uniform power density spectrum. Thus may be regarded as having been generated by the AR filter from a white noise source.

The power spectrum density, , of the AR series is as follows:

It can be found that the parameters in the right-hand side of (5) and the autoregressive function of , , have the following relationship [22]:

The model parameters, , may now be obtained from (6), which are the famous Yule-Walker (YW) equations. If we obtain these parameters, then we can calculate the PSD of .

There are several methods to solve YW equation, such as autocorrelation method (also named the Levinson-Durbin algorithm) [43], the Burg method, and the Marple method. Here we choose the Marple method because it can yield statistically stable spectral estimates of high resolution. This method minimizes the forward and backward prediction errors in the least squares sense.

In the Marple method, the YW equations (6) have the equivalent formulation where

The matrix is Hermitian and positive semidefinite, and (8) may be solved using the Cholensky decomposition method [44]. So we can obtain the PSD of signal after we solved the YW equations. And then the spectral estimation for the AR model given from (8) at frequency is utilized to predict the exons in the eukaryotic DNA sequences.

Order Selection of AR Model. The order of the AR model depends on the statistical properties of the data, so it should be carefully chosen for the data to fit well. Models of low order are preferred on the ground that fewer parameters have to be fitted. However, if the order is too small, the resulting spectral estimate will be smoothed and will have poor resolution. On the other hand, if the model order is too large, the spectral estimate may contain spurious peaks and lead to spectral line splitting. Two of the most commonly used order estimation parameters were developed by Akaike. These are the final prediction error, FPE [45], given by and the Akaike information criterion, AIC [46], which is where is the modeling error, is the data record length, and is the order of the model.

Generally, AIC is particularly recommended for short data records, while FPE is recommended for longer data records. A practical approach is to attempt to select to minimize both FPE and AIC . To guarantee a valid output, Lang and McClellan [47] recommended to set the estimation order parameter to be less than or equal to two-thirds the input vector length. Zhao et al. [48] suggested that the best order could be reached between . In this paper, we will make a comprehensive consideration of the aforementioned criterions and choose the best order to improve the prediction accuracy.

2.2.4. Extracting the TBP Components Using the Marple Algorithm

After the mapping and TBP enhancement steps, the next critical step of our algorithm is to extract the TBP components, which can be implemented similarly to the SDFT [14]. As for a numerical sequence , , firstly, a sliding window with length is determined; for example, . In the th 351-length window, the Marple algorithm is used to calculate the PSD . Then the PSD at frequency ; that is, the PSD at position [14] is extracted, referred to as . In order to make a fair comparison of the frequency spectrum in different windows, we introduce the following signal to noise ratio (SNR): where , is the mean of the total PSD of the th windowed sequence.

Sliding the window along the sequence one by one position, this successive progression and the SNR curve exhibit the coding regions in DNA. It is expected that in the SNR curve, the protein coding regions have high SNR, while the noncoding regions have low SNR. So we can identify those exonic regions by proper threshold; that is, if the SNR curve of a region is above the threshold horizonal line, this region may be the exonic region while the region which is under the threshold horizonal line may be noncoding region.

There are several assistant strategies for the identification algorithm.(1)The values on SNR curve will be normalized by dividing by their max value, which contributes to the following comparisons.(2)Different mapping methods and the sliding window technique will make the obtained SNRs have different lengths, so we will use the mirror-symmetric boundary-extension method [49] to overcome this and make the SNRs have the same length as the numerical sequence.(3)It should be noted that before we use SNR curve and the threshold to determine the exons, the background noise in SNR curve should be reduced. The noise reduction technique by WPT and the optimal threshold selection method are described in the following two subsections in detail.

2.2.5. Noise Reduction Using Wavelet Packets Transform

Wavelet packets transform (WPT) is a generalization of wavelet decomposition that offers a richer signal analysis. In the decomposition of a signal by using discrete wavelet transform (DWT), only the lower frequency band is decomposed, giving a right recursive binary tree structure, where its right lobe represents the lower frequency band. Its left lobe represents the higher frequency band. In the corresponding decomposition by using WPT, the lower, as well as the higher, frequency bands are decomposed giving a balanced binary tree structure [5052]. That is, a single wavelet packet decomposition gives a lot of bases from which you can look for the best representation with respect to a design objective. This can be done by finding the “best tree” based on an entropy criterion. Such a tree is given in Figure 4 (MATLAB R2011a Wavelet Toolbox).

Denoising is an important application of WPT and its main idea is to reconstruct the useful frequency contents after the decomposition. The WPT denoising procedure of MATLAB toolbox (MATLAB R2011a Wavelet Toolbox) involves four steps.

Decomposition. For a given wavelet, compute the wavelet packet decomposition of signal at level .

Computation of the Best Tree. For a given entropy, compute the optimal wavelet packet tree.

Threshold of Wavelet Packet Coefficients. For each packet (except for the approximation), select a threshold and apply threshold to coefficients. The graphical tools from MATLAB toolbox automatically provide an initial threshold based on balancing the amount of compression and retained energy. This threshold is a reasonable first approximation for most cases. However, in general you will have to refine your threshold by trial and error so as to optimize the results to fit your particular analysis and design criteria. The tools facilitate experimentation with different thresholds and make it easy to alter the tradeoff between amount of compression and retained signal energy.

Reconstruction. Compute wavelet packet reconstruction based on the original approximation coefficients at level and the modified coefficients.

2.2.6. Threshold Selection

Threshold selection plays an important role in discriminating between coding and noncoding regions based on the SNR curve. The proper threshold can help to optimize the accuracy of the identification. Xu et al. [28, 53] developed a novel method based on the bootstrap algorithm and Rao’s sliding window strategy [20] to infer organism-specific optimal thresholds for different eukaryotes, and this integrate algorithm has improved the prediction accuracy than the conventional universal threshold based methods. In this paper, we use the threshold selection method developed by Kwan et al. [29, 54]. The mean and standard deviations of the TBP values determined from a training set of exon and intron sequences are used to calculate the threshold level , which is defined as where mean and represent the mean and standard deviations of the TBP values obtained from the exon sequences of a training set, respectively, and mean and represent, respectively, the mean and standard deviations of the TBP values obtained from the intron sequences of the same training set.

2.2.7. Evaluation Criteria at Nucleotide Level

In these evaluations, results of different methods are compared at the nucleotide level. At this level, we evaluate the accuracy of a prediction on a test sequence by comparing the predicted coding value (coding or noncoding) with the true coding value for each nucleotide along the test sequence [31]. For this purpose, the following measures are employed [55].

Sensitivity, Specificity, and AC. Sensitivity and specificity are probably the most widely used measures for prediction accuracy evaluation. Similar to [31], a figure of the two measures is utilized to explain them (Figure 5). In Figure 5 true positive (TP) is the number of coding nucleotides correctly predicted as coding, false negative (FN) is the number of coding nucleotides predicted as noncoding, true negative (TN) is the number of noncoding nucleotides correctly predicted as noncoding, and false positive (FP) is the number of noncoding nucleotides predicted as coding. Based on the aforementioned four quantities, sensitivity and specificity are defined as

That is, gives the measure of the proportion of coding nucleotides that have been correctly predicted as coding, and is the proportion of coding nucleotides that are actually coding.

Neither nor is sufficient by itself because perfect sensitivity of 1 can be obtained if all the nucleotides were predicted as coding, and perfect specificity can be obtained if all nucleotides were predicted as noncoding [30]. So accuracy defined as is a widely used compound measure which considers both sides of and from the global perspective [28].

Here we introduce several other global measures as the previous researchers [30, 31]. Correlation coefficient (CC) is a single scalar value summarizing both and as a measure of global accuracy, with the definition CC appears to be particularly appropriate as a measure of overall prediction accuracy, but CC has many shorcomings [31], as an ideal alternative measure of CC. Approximate correlation (AC) was firstly proposed by Burset and Guigó [31]. Its definition is as follows: where average conditional probability (ACP) is

According to Burset and Guigó [31], AC can be used to measure the association between prediction and reality appropriately. AC is not only a measure of gene structure prediction accuracy, but also a measure to optimize when developing gene structure prediction programs. Unlike the CC, it has a probabilistic interpretation, and it can be computed in any circumstances. As ACP is the average of four conditional probabilities [31], it ranges from 0 to 1, so AC ranges from −1 to 1, which can be compared to CC. So AC can be looked upon as approximate measure of CC. And according to [31], if an algorithm has the larger AC value (strictly speaking, the absolute value of AC), that means this algorithm has the better accuracy. So we will calculate the three measures , , and AC in the algorithm evaluation.

Receiver Operating Characteristics (ROC) Curves. The ROC curves were developed in the 1950s as a technique for visualizing, organizing, and selecting classifiers based on their performance [55, 56]. In the exon identification problems, an ROC curve can help to explore the effects on TP and FP as the position of an arbitrary decision threshold is varied. Also, the curve can be characterized as a single number using the area under the ROC curve (AUC). The larger AUC leads to the better performance of the tested technique.

3. Results and Discussion

In this section, the results of the proposed algorithm are compared with those of existing techniques, such as sliding Fourier transform spectrum (SDFT) (referred to as VossDFT) [14], and those improved techniques are based on DFT, such as EIIPDFT [38], PNDFT [39], and SPDFT [41, 42].

The outline of this section is as follows. Firstly, the denoising performance of WPT technique is given by comparing the SNR of a short benchmark data. Then we compare our Code13 mapping method with four widely used mapping methods selected from the aforementioned mapping approaches, that is, Voss, EIIP, SP, and PN mapping methods. It should be noted that in the comparison only the mapping method is different; that means in Figure 3 procedures only the first step is different, the following steps are totally identical. Finally, we compare our proposed algorithm with other existing techniques, and three widely used benchmark datasets will be utilized for comparison purpose. To evaluate and compare the results, the aforementioned measures such as , AC, ROC, and AUC are calculated.

Firstly, we use the DNA sequence F56F11.4 to test the noise reduction performance of WPT. The SNR curve of sequence F56F11.4 calculated by our algorithm is shown in Figure 6. During the calculation process, we determine the window length as 351 [14]; the order of the AR model is 9 according to the order selection strategy (see Section 2.2.3). The soft threshold function is utilized to process the data, entropy is sure criterion, and symlets wavelet is selected as the orthogonal wavelet which will be decomposed into 5 levels according to the signal. It can be seen that the burrs shape noise in original SNR (Figure 6(a)) is distinctly reduced by WPT technique, and the “smoothed” SNR (Figure 6(b)) will contribute to the following exonic identification accuracy.

Secondly, the Code13 and the other four aforementioned mapping methods (Voss, EIIP, SP, and PN) are, respectively, utilized to map the F56F11.4 sequence into five different numerical sequences. Then according to the procedure of our proposed algorithm (Figure 3), the following techniques of the integrated algorithm, including the Marple algorithm and WPT, are used to identify the exons with the same parameters settings in the whole calculate process. So it can be looked upon as five different algorithms whose difference only lies in the mapping methods, and we compare their final identification performance. Similar to the aforementioned WPT denoising section, we determine the following settings: the window length is 351, the order of the AR model is 9, and the soft threshold function, sure entropy criterion, and symlets wavelet with 5-level decomposition are utilized to reduce the noise of the SNR.

The performance measures of five mapping methods for gene sequence F56F11.4 are represented in Table 1. Here, the sensitivities, specificities, and approximate correlation are calculated under the use of optimal threshold according to (14), and the results show that the Code13 method has the largest measures , and AC ). Figure 7 shows the exonic regions identification performance of sequence F56F11.4 using five mapping methods combing with the Marple-WPT technique. As can be seen, there are five exonic regions with the relative positions 928–1039, 2528–2857, 4114–4377, 5465–5644, and 7255–7605 in sequence F56F11.4 to be identified (red bold line segments). The Code13 mapping based algorithm (Figure 7(e)) is able to produce distinct SNR peaks for all the exonic regions, and the SNR in noncoding regions is restricted well. So it plays a critical role in the following identification, and the optimal threshold helps to determine the exact start and end position of the predicted exonic regions; the black thin line segments represent the predicted candidate exons. It is worth mentioning that the first exon of sequence F56F11.4 is a typical short sequence, whose length is 112 bp, and it is hard to be identified in many conventional techniques [2]. But in our Code13 based algorithm, it is easy to be identified. However, the first four mapping methods produce poor performance, either miss the true exon (those square shadows in Figures 7(a), 7(b), 7(c), and 7(d)) or give a false forecast (rectangle shadows in Figures 7(c) and 7(d)). So it means that the Voss, EIIP, SP, and PN methods based algorithms cannot distinguish the exonic regions from the noncoding regions precisely. The advantage of Code13 mapping method may lie in the fact that it is a kind of complex number representation, and it reflects the complementary nature of nucleotide - and - pairs relationship [33].

Finally, our proposed algorithm is applied to the three widely used benchmark datasets: GENSCAN65, HMR195, and BG570. For comparison purpose, several conventional exonic identification techniques are employed on the aforementioned datasets in the mean time, and the performance criteria measures such as AC, ROC curves, and AUC are utilized in the comparison process.

Taking HMR195 as an example, this benchmark datasets contains 195 sequences with exactly one complete either single-exon or multiexon gene (including 43 single-exon genes and 152 multiexon genes) [30]. HMR195 has the following characteristics: the ratio of human : mouse : rat sequences is 103 : 82 : 10; the mean length of the sequences in the set is 7096 bp; there are 948 exons in the datasets with the total length 199176 bp; the minimum exon length is 12 bp; the mean length of the exons is 208 bp; and the mean intron length is 678 bp. Figure 8 shows the distribution of exonic regions length in HMR195, and most exonic regions length concentrate about 210 bp.

Five identification techniques are utilized for the aforementioned three datasets, that is, VossDFT [14], EIIPDFT [38], SPDFT [41, 42], PNDFT [39], and our proposed Code13-Marple. The output results are represented in Table 2 and Figure 9. As can be seen, the proposed Code13-Marple method achieves the largest AC values for all the three datasets (that is, 0.2324, 0.2508, and 0.2131), which means that this method outperforms the other four traditional DFT-based methods in general. From Table 2 and Figure 9, it also can be found that the Code13-Marple algorithm has the largest AUC values for the three test datasets, that is, 0.6801, 0.7179, and 0.6522. Also taking HMR195 datasets as an example, our proposed algorithm achieves relative improvements of 27.6%, 28.7%, 16.9%, and over the VossDFT, EIIPDFT, SPDFT, and PNDFT techniques, respectively, in terms of the AUC values. That is, our proposed algorithm achieves more accuracy than the other four methods. The ROC curves of the proposed algorithm in Figure 9 are all distinctly “close” to the top left corner which visually verifies that the Code13-Marple-WPT method is more effective than the other techniques. The reason of the aforementioned output may lie in the fact that those conventional nonparametric methods especially those DFT-based techniques have the advantage of possible implementation using the FFT, but with the disadvantage in the case of short data lengths of limited frequency resolution, and the requirement for windowing to reduce the spectral leakage. Parametric methods on the other hand can provide high resolution, applicability to short data lengths, and avoidance of spectral leakage, scalloping loss, spectral smearing, and window biasing effects [22].

4. Conclusions

In this paper, we propose a new technique based on Marple algorithm and wavelet packets transform with the Code13 numerical mapping approach to improve the accuracy of identification of the protein coding regions in the eukaryotic DNA sequences. The outputs of the test by many benchmark datasets show that the proposed algorithm outperforms some well-known DFT-based methods. There are several reasons attributed to the improvement of the identification accuracy: first, the FIR filters help to enhance the TBP characteristics of the numerical sequences before PSD calculation; second, the Marple algorithm can calculate the PSD more efficiently and accurately than those conventional methods because it can yield statistically stable spectral estimates of high resolution; third, the WPT can reduce the noise in SNR curves, which attributes to the following identification of exonic regions distinctly; finally, those assistant strategies such as threshold selection, normalization of SNR curves, the mirror-symmetric boundary-extension method also can help to improve the final accuracy of the whole algorithm.

In the same time, it should be noted that there are still some shortcomings in our proposed algorithm, such as the order selection of the AR model when using Marple algorithm and the Marple algorithm being a little more time-consuming.

Also there are still two important and challengeable problems which deserve further study. First, how can we obtain the precise exons location information [14, 20]? Second, how can we utilize the algorithm to identify the dual coding genes in eukaryote? The dual coding genes are the phenomenon where there are two overlapped open reading frames (ORFs) in the same direction of a protein coding region (such as three known humans GNAS1, XBP1, and INK4a) [57, 58]. And whether or not we can take the original overlapped sequence as two equal length sequences with partial overlapped signal, and convert it into a blind signal separation problem, is unknown and deserves further study in the future. So it is our next target to overcome the aforementioned tasks for more effective and accuracy algorithm and to help the related biologists to identify the DNA structure more clearly.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The authors thank Professor Fengzhu Sun from the University of Southern California deeply for the interest in the project and useful discussion about the coding region discriminating criteria. The authors also thank all the anonymous reviewers for their valuable suggestions and support. This work is supported by the Natural Science Foundation of China Grants 11371227 and 10921101 and Graduate Independent Innovation Foundation of Shandong University (GIIFSDU) (yzc12098).