Journal of Applied Mathematics

Volume 2014, Article ID 176943, 9 pages

http://dx.doi.org/10.1155/2014/176943

## An Exploration of the Triplet Periodicity in Nucleotide Sequences with a Mature Self-Adaptive Spectral Rotation Approach

^{1}College of Mathematics and Computer Science, Fuzhou University, Fuzhou 350116, China^{2}Fujian Provincial Key Laboratory of Networking Computing and Intelligent Information Processing, Fuzhou 350116, China^{3}Department of Industrial and Systems Engineering, The Hong Kong Polytechnic University, Kowloon, Hong Kong

Received 19 April 2014; Revised 20 July 2014; Accepted 25 July 2014; Published 12 August 2014

Academic Editor: Ning Hu

Copyright © 2014 Bo Chen and Ping Ji. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Previously, for predicting coding regions in nucleotide sequences, a self-adaptive spectral rotation (SASR) method has been developed, based on a universal statistical feature of the coding regions, named triplet periodicity (TP). It outputs a random walk, that is, TP walk, in the complex plane for the query sequence. Each step in the walk is corresponding to a position in the sequence and generated from a long-term statistic of the TP in the sequence. The coding regions (TP intensive) are then visually discriminated from the noncoding ones (without TP), in the TP walk. In this paper, the behaviors of the walks for random nucleotide sequences are further investigated qualitatively. A slightly leftward trend (a negative noise) in such walks is observed, which is not reported in the previous SASR literatures. An improved SASR, named the mature SASR, is proposed, in order to eliminate the noise and correct the TP walks. Furthermore, a potential sequence pattern opposite to the TP persistent pattern, that is, the TP antipersistent pattern, is explored. The applications of the algorithms on simulated datasets show their capabilities in detecting such a potential sequence pattern.

#### 1. Introduction

Coding region prediction for nucleotide sequences is an active issue in the field of computational biology [1–10]. Techniques, including the dynamic programming (DP) and the Hidden Markov Model (HMM), have been adopted to process information collected from* ab initio* experiments and predict potential coding regions. Besides, researchers suggest that the usages of codons are highly nonrandom in coding regions [11], and the biased appearance of codons raises a universal property in coding regions, called the “triplet periodicity (TP).” Investigating the TP property can be a subject of interest for developing the coding regions detection algorithm [12, 13], as well as some other significant gene related issues.

The TP property was first presented by Fickett [14]. It is said to be a simple and universal difference between coding and noncoding regions. After Fickett’s work, the TP property was analyzed with various theoretical tools, such as the hidden Markov chains [15, 16], the time series [17, 18], the information theory [11, 12], and the Fourier transform [19–25]. Studies on the TP property are with the aim of predicting coding regions [26] and, especially, detecting frame shift points in nucleotide sequences [27, 28]. Among such methods, the self-adaptive spectral rotation (SASR) provides a visualization of the TP property hidden in nucleotide sequences and can be employed for training-free coding region prediction [24, 25]. This method takes only the query sequence as its input and outputs a random walk in the complex plane, called the TP walk, which conveniently presents the locations of coding (TP intensive) regions as well as frame shifts. Here, a “frame shift” is related to the length of the interregion gap (the non-TP region between two TP intensive regions), and it is defined as mod 3.

In Chen and Ji’s work [24], they claimed that, for simple random sequences, the TP walks should be random around zero point, and for simple TP intensive sequences, the walks should obviously move rightward. A measure, named rightward rate (RR), was used to verify such a point and qualitatively discriminate TP intensive sequences from random ones. However, the RR measure is defined in the positive domain and cannot indicate the potential walk trend that moves leftward. In this work, we proposed a new measure, that is, the symmetrical rightward rate (SRR), to qualitatively investigate the behavior of the TP walks for sequences considering both the leftward trend and rightward trend. After that, a slightly leftward trend (a negative noise) in TP walks is observed, which is not reported in the previous SASR literatures. The cause of such an abnormity is discussed with the probability theory, and a modification of the original method, named mature SASR, is given to correct it. Furthermore, a potential sequence pattern opposite to the TP persistent pattern, that is, the TP anti-persistent pattern, is explored. The applications of the algorithms on simulated datasets show their capabilities in detecting such a potential sequence pattern.

All the involved methods in this work are introduced in Section 2, as well as the preparation of the experimental data. Section 3 demonstrates our experiments, findings, and discussions. A conclusion is reached at the end of the paper.

#### 2. Methods and Materials

##### 2.1. Previous Work: The Original SASR

For a certain base sequence , there is a TP profile, describing the preferred usages of the codons. And the TP profile was presented, in Frenkel and Korotkov’s work [11, 12], using a triplet periodicity matrix (TPM). The TPM is a 4 × 3 matrix, each row stands for a nucleotide base , each column stands for a “period position” , and the entry (or ) is the count by which the base Λ appears at the period position . As a previous work [24], in the SASR, the TPM of the posterior subsequence at each position is calculated recursively from to 1, with the recurrence formula and initial value

Here, stands for the posterior subsequence of the complete sequence at position (excluding position ). is the row vector in the TPM of this posterior subsequence for each base . The operation “” means times right cyclic shift (RCS) on the triplet row vector :

Then, for each position , a triplet vector , called TP vector, is selected from the TPM of the posterior subsequence, according to the base at the position, that is, . It follows that . A sequence of TP vectors is generated as , called the TP sequence. Figure 1 gives an example of generating the TP sequence for a given nucleotide sequence.

The TP walk is then defined as a random walk in the complex plane, generating a moving trace according to the TP sequence. The trace is a sequence with the initial value , and for each step ,

Here, the function maps the triplet vector into a complex number by

The above process, that generates a TP walk from the query sequence, is called a SASR process. The TP walk generated from (3) can provide a good visualization of the TP property: for TP intensive regions, the TP walk shows obvious moving trends, while the walk in non-TP regions moves much slower or randomly around stable points. These walk patterns are clues to the discrimination between TP intensive and non-TP regions [24]. Moreover, the walk shifts in direction from a TP intensive region to a neighboring one and the angle of the “corner” indicate the frame shift between the two regions, following a “corner rule” [24].

##### 2.2. The Symmetrical Rightward Rate

According to Chen and Ji [24], the TP walks for simple TP intensive sequences have an obvious trend to move rightward and those for random sequences move randomly around the zero point. To quantitatively verify this principle in practice, a rightward rate (RR) measure has been presented in Chen and Ji’s work. For a given nucleotide sequence, an RR measure is calculated from its TP walk :

Here, stands for the real part of the complex number . This measure is used to reveal the average speed at which the walk moves rightward in the complex plane.

According to the above definition, an RR measure should not be less than 0 and does not allow revealing the walk trend that moves leftward. However, in some cases, a leftward trend should also be considered. So a symmetrical rightward rate (SRR) is further presented here:

If a walk has an obvious trend to move rightward, its SRR measure tends to be positive, while a walk to move leftward provides a negative SRR measure. And a walk to move randomly around the zero point has an SRR measure close to 0. The SRR considers both the leftward and rightward trends and is employed to reveal the true behavior of the TP walks in this work.

##### 2.3. Improvement: The Mature SASR

A modification of the original SASR is proposed here, called the mature SASR. In the original SASR, at each position , the TPM of the posterior subsequence is calculated and the TP vector is selected directly from this TPM, as mentioned previously. In this modification, is selected from a “mature” TPM, instead of from the original matrix. Here, “mature” means that the TPM satisfies

A mature TPM is maintained with a simple recurrence formula only involving a RCS: . Besides, the original TPM is still maintained as mentioned before, so that the mature TPM can be updated by copying it, when the original TPM becomes “mature,” in every three steps. Figure 2 shows a simple example of generating a TP sequence with this new algorithm.

With this improved method obtaining a TP sequence, the complete algorithm in generating a TP walk is described as shown in Pseudocode 1. And its usage and advantages are shown in Section 3.

##### 2.4. Simulating Random Sequences

In this work, a random sequence dataset is generated, containing 2,000 nucleotide sequences with lengths of 300 bp~5,000 bp. These sequences are unbiasedly random without any periodicity, which are obtained by simply assigning each site in the sequences as nucleotide base with the probability .

##### 2.5. Simulating TP Antipersistent Sequences

Besides the random sequence dataset, another sequence dataset is generated, containing 2,000 simulated TP antipersistent sequences with lengths of 300 bp~5,000 bp (see the elaboration about TP antipersistent in Section 3). To generate a simulated TP antipersistent DNA sequence with a length of , the flow chart in Figure 3 is followed. Firstly, a short subsequence at the end (the “seed”), that is, , is randomly generated. Here, we use the seed length . The TPM of the complete sequence is calculated as follows:

Here, “count” means get the number of the elements in the following set, and “” denotes “.”

Then, the bases in the anterior part are assigned recursively from position to 1. For each given position , , is assigned to be base Λ with a probability:

Here, “” denotes “.”

After assigning the base at each position, the TPM of the complete sequence is immediately updated following equation (8), with the newly assigned .

#### 3. Results and Discussions

##### 3.1. Application of the Original SASR to Random Sequences

The original SASR is applied to the simulated random sequences. The distribution of the SRR values of the TP walks is plotted in Figure 4 in the form of its probability density function (PDF). It shows that the distribution is close to the normal distribution with a slight shift to the negative.

The sample mean of the SRR values is and the sample standard deviation is . A one-sample -test with the hypothetical mean obtains a value of 0. Here, the value in a one-sample -test is a statistical term indicating the likelihood to get the observed sample if the population is with the hypothetical mean . In practice, a statistic is first calculated:

The sample size as mentioned before. Once the statistic is determined, a value can be found using a table of values from “Student’s -distribution”. A value of 0 indicates that the distribution is significantly different from the unbiased (with the expectation of 0) normal distribution. So the TP walks for the random sequences slightly move leftward, rather than unbiased random as expected in Chen and Ji’s work [24].

The reason for the slightly leftward trend is discussed below. Consider a random sequence . At any position, a certain base appears with a fixed probability and . Suppose a base Λ appears at position ; according to Chen and Ji [24] (also find the original SASR in Section 2), we have the step :

It is easy to find that the random variable follows the Binomial distribution:

Here, is the count of the positions that satisfy and . And the expected value . So the expected value of the step is

According to the definition of , although the differences among , , and are no more than 1, is always the minimum in the three. According to the mapping from the triplet to the complex number (4), it causes the walk to move leftward slightly for each step and further produces a slightly negative SRR value.

As discussed above, the slightly leftward trend is caused by a negative noise raised by the original SASR method itself. The noise may comprehensively exist in all TP walks. It needs an improved method to eliminate it.

##### 3.2. Application of the Mature SASR to Random Sequences

As mentioned in Section 2, the mature SASR uses a mature TPM instead of the original matrix. The mature TPM is always derived when , and it is supposed to eliminate the noise.

The mature SASR is applied to the random sequences and the distribution of the SRR values is plotted in Figure 5. It shows that it is close to the normal distribution with an expected value of 0. The sample mean of the SRR values is and the sample standard deviation is . The one-sample -test obtains a value of 22.5% (two-tailed), which shows no significant difference from the unbiased distribution. So it is verified that, by using the mature SASR, the TP walks for random sequences are unbiasedly random around the zero point in the complex plane. The negative noise is eliminated by this modification.

It should be pointed out that, compared with the original SASR method, the mature SASR equally eliminates the negative noise that originally exists in all TP walks for both non-TP and TP intensive sequences. Therefore, this modification of the method does not impact the capability of the method in detecting the TP intensive pattern.

##### 3.3. The TP Antipersistent Sequences

The TP profile was presented in Frenkel and Korotkov’s work [11, 12] using a triplet periodicity matrix (TPM) as mentioned in Section 2. The TP profiles in the parts of a non-TP sequence have no correlation with each other. It shows a “Brownian pattern” in the sequence. On the other hand, in a simple TP intensive sequence , a certain base Λ appears at position in the 3 bp period with a probability:

Kotlar and Lavner’s finding [23] suggests that, in coding regions of a given organism, the TP profile, by which nucleotide bases appear in the triplet period, tends to keep unchanged. It can be considered as a “persistent pattern” in the sequence. That is, for any position ,

Besides the “Brownian pattern” and the “persistent pattern” mentioned above, a theoretically potential pattern is considered, namely, the “antipersistent pattern.” For the antipersistent pattern, any part of the sequence has the TP profile opposite to the rest parts. In other words, a certain base Λ avoids appearing at the position in the 3 bp period, which is preferred in other parts of the sequence. An ideal probability model is built here as a simple case of the TP antipersistency. That is, at any given position in the sequence, a certain base appears with a probability:

So that, for any position presenting in the sequence,

Therefore, this model is found to be opposite to the “persistent pattern” of (15). In practice, we simulate such TP antipersistent sequences by the method mentioned in Section 2.

##### 3.4. Applications of the Algorithms to TP Antipersistent Sequences

The original SASR is first applied to the simulated TP antipersistent sequences (see Section 2). The distribution of the SRR values of the TP walks is plotted in Figure 6, compared with that for the random sequences. It shows an obvious difference between these two distributions.

The simulation above reveals a leftward moving trend of the TP walks for TP antipersistent sequences. The reason of such a behavior is discussed as below. Consider any short section containing three sequential positions , , and ( is a multiple of 3; i.e., mod 3 = 0) in a sequence with the TP antipersistent probability model mentioned previously. The posterior subsequences at these three positions share a similar TPM with a shift:

Meanwhile, according to (9), base Λ appears at these positions with a probability:

Here, stands for the entry in the TPM of the posterior subsequence at position ; that is, . Meanwhile, we have

Hence, these three steps in the walk move to where

Obviously, in this case, we have . Therefore, in (21), the first two elements of the expected vector dominate the third one. According to (4), it causes the TP walk to move leftward in the complex plane.

However, since the TP walks, from the original SASR, comprehensively contain a negative noise as mentioned before, it is difficult to determine to what extent the noise has impacted the gap between the two distributions in Figure 6. Therefore, to visualize the real gap between the two patterns, it needs the mature SASR, in which the noise has been eliminated.

The mature SASR is then applied to the simulated TP antipersistent sequences. The distribution of the SRR values of the TP walks is plotted in Figure 7, compared with that for the random sequences. The PDF curve for the simulated TP antipersistent sequences is on the left side to that for the random sequences, and the cumulative distribution function (CDF) curves indicate that there are 85% simulated sequences with negative SRR values, while the SRR values of the random sequences distribute fifty-fifty in negative and positive areas. It is found that the sample mean and the sample deviation of the 2,000 SRR values for the simulated anti-TP dataset are and , respectively. A value of 0 indicates the significant difference between this distribution and that for random sequences. It must be noticed that, although the gap between the two distributions is less than that in Figure 6, such a gap is completely due to the difference between the sequence patterns, without any noise. So the mature SASR is more suitable in visualizing the TP antipersistency than the original SASR.

The results from the simulations and the discussions above indicate that the mature SASR is able to discriminate TP antipersistent sequences from random sequences. The antipersistent pattern can be identified according to a leftward moving trend in the TP walk.

#### 4. Conclusions

In this work, a new measure, that is, SRR, is presented to qualitatively investigate the behavior of the original SASR’s outputs, that is, the TP walks, for sequences considering both the leftward trend and rightward trend. After that, for random sequences, an abnormal behavior of the walks from the original SASR is revealed: the TP walks for the random sequences slightly move leftward, rather than unbiased random as expected in Chen and Ji’s work [24]. This abnormity is caused by a negative noise raised by the original SASR method itself. And the noise comprehensively exists in all TP walks.

A modification of the original SASR, that is, the mature SASR, is then given in order to eliminate the noise and correct the behavior of the TP walks, without impacting the capability of the method in detecting the TP intensive pattern. The application to the simulated random sequences verifies that, by using the mature SASR, the TP walks for random sequences are unbiasedly random around the zero point in the complex plane.

Furthermore, a potential sequence pattern opposite to the TP persistent pattern, that is, the TP antipersistent pattern, is explored. The applications of the algorithms on simulated datasets show their capabilities in detecting such a potential sequence pattern. The mature SASR is said to be an effective tool for the visualization of TP-related features, including non-TP, TP persistency, and TP antipersistency.

#### Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

#### Acknowledgments

The project was supported by the Natural Science Foundation of Fujian Province, China (Grant no. 2012J05114), and Fujian Provincial Key Laboratory of Networking Computing and Intelligent Information Processing (Fuzhou University).

#### References

- K. Song, Z. Zhang, T. Tong, and F. Wu, “Classifier assessment and feature selection for recognizing short coding sequences of human genes,”
*Journal of Computational Biology*, vol. 19, no. 3, pp. 251–260, 2012. View at Publisher · View at Google Scholar · View at Scopus - P. K. Sree and I. R. Babu, “AIS-INMACA: a novel integrated MACA based clonal classifier for protein coding and promoter region prediction,”
*Journal of Bioinformatics and Comparative Genomics*, vol. 1, pp. 1–7, 2014. View at Google Scholar - J. Khatun, Y. Yu, J. A. Wrobel et al., “Whole human genome proteogenomic mapping for ENCODE cell line data: Identifying protein-coding regions,”
*BMC Genomics*, vol. 14, no. 1, article 141, 2013. View at Publisher · View at Google Scholar · View at Scopus - J. Mena-Chalco, H. Carrer, Y. Zana, and R. M. Cesar Jr., “Identification of protein coding regions using the modified Gabor-wavelet transform,”
*IEEE/ACM Transactions on Computational Biology and Bioinformatics*, vol. 5, no. 2, pp. 198–207, 2008. View at Publisher · View at Google Scholar · View at Scopus - A. D. Haimovich, B. Byrne, R. Ramaswamy, and W. J. Welsh, “Wavelet analysis of DNA walks,”
*Journal of Computational Biology*, vol. 13, no. 7, pp. 1289–1298, 2006. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus - Y. L. Orlov, R. Te Boekhorst, and I. I. Abnizova, “Statistical measures of the structure of genomic sequences: entropy, complexity, and position information,”
*Journal of Bioinformatics and Computational Biology*, vol. 4, no. 2, pp. 523–536, 2006. View at Publisher · View at Google Scholar · View at Scopus - J. H. Do and D. K. Choi, “Computational approaches to gene prediction,”
*The Journal of Microbiology*, vol. 44, no. 2, pp. 137–144, 2006. View at Google Scholar · View at Scopus - M. Stanke and S. Waack, “Gene prediction with a hidden Markov model and a new intron submodel,”
*Bioinformatics*, vol. 19, no. 2, pp. ii215–ii225, 2003. View at Publisher · View at Google Scholar · View at Scopus - C. T. Zhang and J. Wang, “Recognition of protein coding genes in the yeast genome at better than 95% accuracy based on the Z curve,”
*Nucleic Acids Research*, vol. 28, no. 14, pp. 2804–2814, 2000. View at Publisher · View at Google Scholar · View at Scopus - W. Li, “The complexity of DNA,”
*Complexity*, vol. 3, pp. 33–37, 1997. View at Google Scholar - F. E. Frenkel and E. V. Korotkov, “Classification analysis of triplet periodicity in protein-coding regions of genes,”
*Gene*, vol. 421, no. 1-2, pp. 52–60, 2008. View at Publisher · View at Google Scholar · View at Scopus - F. E. Frenkel and E. V. Korotkov, “Using triplet periodicity of nucleotide sequences for finding potential reading frame shifts in genes,”
*DNA Research*, vol. 16, no. 2, pp. 105–114, 2009. View at Publisher · View at Google Scholar · View at Scopus - J. W. Fickett, “The gene identification problem: an overview for developers,”
*Computers and Chemistry*, vol. 20, no. 1, pp. 103–118, 1996. View at Publisher · View at Google Scholar · View at Scopus - J. W. Fickett, “Recognition of protein coding regions in DNA sequences,”
*Nucleic Acids Research*, vol. 10, no. 17, pp. 5303–5318, 1982. View at Publisher · View at Google Scholar · View at Scopus - R. K. Azad and M. Borodovsky, “Probabilistic methods of identifying genes in prokaryotic genomes: connections to the HMM theory,”
*Briefings in Bioinformatics*, vol. 5, no. 2, pp. 118–130, 2004. View at Publisher · View at Google Scholar · View at Scopus - J. Henderson, “Finding genes in DNA with a Hidden Markov model,”
*Journal of Computational Biology*, vol. 4, no. 2, pp. 127–141, 1997. View at Publisher · View at Google Scholar · View at Scopus - Y. H. Cao, W. W. Tung, J. B. Gao, and Y. Qi, “Recurrence time statistics: versatile tools for genomic DNA sequence analysis,”
*Journal of Bioinformatics and Computational Biology*, vol. 3, no. 3, pp. 677–696, 2005. View at Publisher · View at Google Scholar · View at Scopus - J. B. Gao, Y. Qi, Y. H. Cao, and W. Tung, “Protein coding sequence identification by simultaneously characterizing the periodic and random features of DNA sequences,”
*Journal of Biomedicine and Biotechnology*, vol. 2005, no. 2, pp. 139–146, 2005. View at Publisher · View at Google Scholar · View at Scopus - X. Y. Jiang, D. Lavenier, and S. S. Yau, “Coding region prediction based on a universal DNA sequence representation method,”
*Journal of Computational Biology*, vol. 15, no. 10, pp. 1237–1256, 2008. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus - G. Dodin, P. Vandergheynst, P. Levoir, C. Cordier, and L. Marcourt, “Fourier and wavelet transform analysis, a tool for visualizing regular patterns in DNA sequences,”
*Journal of Theoretical Biology*, vol. 206, no. 3, pp. 323–326, 2000. View at Publisher · View at Google Scholar · View at Scopus - D. Anastassiou, “Frequency-domain analysis of biomolecular sequences,”
*Bioinformatics*, vol. 16, no. 12, pp. 1073–1081, 2000. View at Publisher · View at Google Scholar · View at Scopus - D. Anastassiou, “Genomic signal processing,”
*IEEE Signal Processing Magazine*, vol. 18, no. 4, pp. 8–20, 2001. View at Publisher · View at Google Scholar · View at Scopus - D. Kotlar and Y. Lavner, “Gene prediction by spectral rotation measure: a new method for identifying protein-coding regions,”
*Genome Research*, vol. 13, no. 8, pp. 1930–1937, 2003. View at Google Scholar · View at Scopus - B. Chen and P. Ji, “Visualization of the protein-coding regions with a self adaptive spectral rotation approach,”
*Nucleic Acids Research*, vol. 39, no. 1, article e3, 2011. View at Publisher · View at Google Scholar · View at Scopus - B. Chen and P. Ji, “Numericalization of the self adaptive spectral rotation method for coding region prediction,”
*Journal of Theoretical Biology*, vol. 296, pp. 95–102, 2012. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus - M. Bellani, J. Epps, and G. A. Huttley, “A comparison of periodicity profile methods for sequence analysis,” in
*Proceedings of the IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS '12)*, pp. 78–81, Washington, DC, USA, December 2012. View at Publisher · View at Google Scholar · View at Scopus - Y. M. Suvorova, V. M. Rudenko, and E. V. Korotkov, “Detection change points of triplet periodicity of gene,”
*Gene*, vol. 491, no. 1, pp. 58–64, 2012. View at Publisher · View at Google Scholar · View at Scopus - A. M. Michel, K. R. Choudhury, A. E. Firth, N. T. Ingolia, J. F. Atkins, and P. V. Baranov, “Observation of dually decoded regions of the human genome using ribosome profiling data,”
*Genome Research*, vol. 22, no. 11, pp. 2219–2229, 2012. View at Publisher · View at Google Scholar · View at Scopus