Abstract

The Chaos Game is an algorithm that can allow one to produce pictures of fractal structures. Considering that the four bases A, G, C, and T of DNA sequences can be divided into three classes according to their chemical structure, we propose different kinds of CGR-walk sequences. Based on CGR coordinates of random sequences, we introduce some invariants for the DNA primary sequences. As an application, we can make the examination of similarity/dissimilarity among the first exon of β-globin gene of different species. The results indicate that our method is efficient and can get more biological information.

1. Introduction

A DNA sequence is comprised of four different nucleotides: adenine (A), cytosine (C), guanine (G), and thymine (T). Since the DNA molecule contains plentiful biological, physical, and chemical information, it has become very important to analyze DNA sequences statistically. Now the nucleotides stored in GenBank have exceeded hundreds of millions of bases and the increasing rate is considerably rapid. Therefore, biologists, physicists, mathematicians, and computer specialists have adopted different techniques to research DNA sequences in recent years, including the statistical methods and some mapping rules of the bases.

A great number of studies have proposed to convert the DNA sequences into digital sequences before downstream analysis. There are many statistical methods such as random walk, lévy-walk, entropy near method, root-mean-square fluctuation, wavelet transform and Fourier transform, and so forth, [112], which can be used as effective tools to process the DNA sequences. One-dimensional DNA walk was first proposed by Peng et al. [1]. Bai et al. [13] later discussed the representation of DNA primary sequences by the same walk. Meanwhile, some investigators proposed several kinds of graphical representation of DNA sequences from different perspectives. For example, G-curve and H-curve were first proposed by Hamori and Ruskin in 1983 [14]. R. Zhang and C. T. Zhang [15] considered a DNA primary sequence termed as Z-curve. Several researchers in their recent studies have outlined different kinds of graphical representation of DNA sequences based on 2D [1621], 3D [2225], 4D [26], 5D [27], and 6D [28] spaces. We here need to stress Chaos Game Representation (CGR) which was proposed as a scale-independent representation for genomic sequences by Jeffrey [3] in 1990. Gao and Xu [29] pointed out that the CGR-walk model can easily generate a model sequence and can be fitted with a long-memory ARFIMA model reasonably. However, they treated the four bases equally and ignored the hidden chemical classification of nucleotides.

Motivated by the above work, we consider in this paper different classifications of the four bases according to their chemical structure and the strength of the hydrogen bond, that is, purine and pyrimidine ; amino group and keto group ; weak H-bonds and strong H-bonds . Then we give three kinds of mapping from the four bases A, C, G, and T to the continuous space and reconstruct CGR-walk sequences based on CGR coordinates. So we can convert a DNA sequence into a random numeric sequence, then select some numerical characterizations of the random sequence as new invariants for the DNA sequence. As an application, we make a comparison of the similarity and dissimilarity of the first exon of -globin gene sequences derived from nine species.

2. CGR-Walk Based on Three kinds of Classification and Primary Sequences

2.1. The CGR Space Proposed by Jeffrey

During the past several years, a new field of physics has developed, known as “nonlinear dynamics,” “chaotic dynamical systems,” or simply “chaos.” In fact, the technique of CGR, formally an iterative mapping, can be traced further back to the foundation of statistical mechanics, in particular, to chaos theory [2]. Based on the technique from chaotic dynamics, CGR produces a picture of gene sequence which displays both local and global patterns. The Chaos Game is an algorithm which allows one to produce pictures of fractal structures. Mathematically, it is described by an iterated function system (IFS).

The CGR space can be viewed as a continuous reference system, where all possible sequences of any length occupy a unique position. And the position is produced by the four possible nucleotides, which are treated as vertices of a binary square. So it is planar. Since a genetic sequence can be treated formally as a string composed of the four letters “A,” “C,” “G,” and “T” (or “U”), the binary CGR vertices are assigned to the four nucleotides as , , , . The CGR coordinates are calculated iteratively by moving a pointer to half the distance between the previous position and the current binary representation. For example, if a “G,” is the next base, then a point is plotted half way between the previous point and the “G” corner. The iterated function can be given by where We take the first 6 bases of the sequence of human -globin in Table 1 as an example and present the above procedure in Figure 1.

2.2. The Newly Proposed CGR Space

The aforementioned work treats the four nucleic acid bases equally. In this paper, however, we take the chemical structures of the four nucleic acid bases into consideration and make adjustments to the classification based on the elements of the minor diagonal. In the CGR space proposed by Jeffrey, the elements of the minor diagonal are purine and the leading diagonal elements are pyrimidine . Considering amino group and keto group , we get the second CGR space as shown in Figure 2. In the same way, according to the strength of the hydrogen bond, the bases can also be classified into weak H-bonds and strong H-bonds , so the third kind of CGR space is obtained in Figure 3.

2.3. CGR-Walk Digital Sequence

Now we can obtain map relationships between DNA sequences and the CGR coordinates in a right-angled plane. For a DNA sequence, we define an equation as follows: where and are the -coordinate and -coordinate of CGR, respectively. Then we can get a data sequence . In this way, we convert a DNA sequence into a random walk sequence under three different patterns. Consistent with the above three figures, we call them CGR-RY-, CGR-MK-, and CGR-WS-walk sequences, respectively.

3. Numerical Characterization of DNA Sequences

Researchers from computer science and mathematics have been attracted to study the comparison of DNA sequences. As pointed out in references [13, 1628], some related work has made progress.

Now, we may represent a DNA sequence by a random numerical sequence based on CGR-walk technique. Gao and Xu [29] also substantially corroborated the results that long-range correlations are uncovered remarkably in the data. In this paper, we explore the tendency of a series of data by calculating the hurst exponent [30].  And some work has been done to study the relation between long-range correlation and hurst exponent [31]. In order to numerically characterize a DNA sequence given by the CGR, we treat the hurst exponent as the efficient invariant that is sensitive to this kind of graphical representation.

Because a DNA sequence can be regarded as an ordered set of alphabet , we represent a DNA sequence as a finite set with elements, denoted as . For any time series , one can define several quantities as follows [30]: (i)the partial mean (ii)the partial difference (iii)the difference (iv)and the standard deviation

Hurst exponent is found to obey the relation: where is called the hurst exponent.

So we can compute the hurst exponent of RY-, MK- and WS-CGR-walk sequences and characterize the coding sequences of the first exon of -globin gene of the nine species in Table 1. The results are listed in Table 2.

Besides, there are other numerical characterizations of random sequences, such as the mean, variance, mean square deviation, and so on. Here we choose the mean square deviation of CGR-walk sequence as follows: In (9) means the classification of RY-, MK-, and WS-sequences, and is the mean [13]. We then present the mean square deviations of three kinds of the CGR-walk sequences in Table 3.

4. Similarity and Dissimilarity among the Coding Sequences of the First Exon of -Globin Gene of Different Nine Species

Here we construct the three-component vectors in this way, whose components, respectively, are values of hurst exponent and mean square deviation. The analysis of similarity/dissimilarity among DNA sequences represented by the three-component vectors is based on the assumption that two DNA sequences are similar if the corresponding vectors point to one direction in the 3D space. Alternatively we can investigate the similarity among the vectors by calculating the Euclidean distance between their end points. Apparently, the smaller the Euclidean distance is, the more similar the two corresponding DNA sequences are.

In Tables 4 and 5, we list the values of Euclidean distances between the 3-component vectors separately including hurst exponent and mean square deviation. We observe that the smallest entry is always the human-gorilla pair. Furthermore, the largest entries are associated with these rows belonging to opossum (the most remote species from the remaining mammals) and gallus (the only nonmammalian representative). We believe that these results are not accidental, and they coincide with other results in [13, 1628].

5. Conclusion

DNA sequences play an important role in modern biological research because all the information of the hereditary and species evolution is contained in these macromolecules. How to gain more information from these DNA sequences is still a very challenging question. Description, comparison, and similarity analysis of DNA sequences still occupy important positions.

In this paper, we first construct three kinds of CGR spaces according to the elements of the minor diagonal because the four bases can be classified into R-Y, M-K, and W-S according to their chemical structures. Then we describe a DNA sequence by CGR-walk and convert it to a digital sequence. And we outline some efficient invariants of DNA sequences. As an application, we compare the similarity/dissimilarity of exon-1 of -globin genes for nine species. From the above tables, we can conclude that the results we got are consistent with known evolutionary facts. Therefore, the method proposed in the paper is visual and efficient.

On one hand, our work can be treated as an effective application of CGR. On the other hand, our method is a valid supplement to graphical representation of DNA sequences. In comparison with other graphical representations of biological sequences, our approach has the following advantages. (1)Our graphical representation based on CGR considers the chemical structure classification of the nucleotides and thus may provide more biological information. (2)It provides a more simple way of viewing, sorting, and comparing various gene structures, even for longer DNA sequences. (3)Our graph is more sensitive, so it can numerically characterize the DNA sequences in a more exact way.

Acknowledgments

The authors thank all the anonymous reviewers for their valuable suggestions and support. This research is supported by the National Science Foundation of China under Grants 11071146 and 10921101.