The Scientific World Journal

Volume 2012 (2012), Article ID 104269, 6 pages

http://dx.doi.org/10.1100/2012/104269

## Numerical Characterization of DNA Sequence Based on Dinucleotides

^{1}School of Mathematics and Statistics, Shandong University at Weihai, Weihai 264209, China^{2}Department of Mathematics, West Virginia University, Morgantown, WV 26506, USA^{3}School of IOT Engineering, Jiangnan University, Wuxi 214122, China

Received 4 November 2011; Accepted 26 December 2011

Academic Editors: S. Cacchione and A. Pask

Copyright © 2012 Xingqin Qi et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Sequence comparison is a primary technique for the analysis of DNA sequences. In order to make quantitative comparisons, one devises mathematical descriptors that capture the essence of the base composition and distribution of the sequence. Alignment methods and graphical techniques (where each sequence is represented by a curve in high-dimension Euclidean space) have been used popularly for a long time. In this contribution we will introduce a new nongraphical and nonalignment approach based on the frequencies of the dinucleotide in DNA sequences. The most important feature of this method is that it not only identifies adjacent pairs but also nonadjacent ones where and are separated by some number of nucleotides. This methodology preserves information in DNA sequence that is ignored by other methods. We test our method on the coding regions of exon-1 of –globin for 11 species, and the utility of this new method is demonstrated.

#### 1. Introduction

The number of identifiable DNA sequences responsible for various physiological structures is rapidly increasing as more and more collected DNA sequences are added to scientific databases. It is, however, difficult to obtain information directly from sequences since the sheer volume of data is computational demanding. It is one of the challenges for biologists to analyze mathematically the large volume of genomic DNA sequence data. Many schemes have been proposed to numerically characterize DNA sequences.

Sequence alignment has been used as a very powerful tool for comparison of two closely related genomes at the base-by-base nucleotide sequence level. This method relies heavily on the orderings of nucleotides appearing in the sequence. With the divergence of species over time, though, genomic rearrangements and in particular genetic shuffling make sequence alignment unreliable or impossible.

Graphical techniques are another powerful tool for the analysis and visualization of DNA sequences. Using graphical approaches can provide intuitive pictures or useful insights that assist the analysis of complicated relations between DNA sequences. This methodology starts with a graphical representation of DNA sequence which could be based on 2D, 3D, 4D, 5D, and 6D spaces and represents DNA as matrices by associating with the selected geometrical objects, then vectors composed of the invariants of matrices will be used to compare DNA sequences, see [1–10]. Such schemes have an advantage in that they offer an instant, though, visual and qualitative summary of the lengthy DNA sequences. This approach also involves many unresolved questions. For example, how does one obtain suitable matrices to characterize DNA sequences and how are invariants selected suitable for sequence comparisons? In many cases, the calculation of the matrices or the invariants will become more and more difficult with the length of the sequence. There are also approaches which could arrive a mathematical representation of DNA sequences by nongraphical ways, see [11–13]. And more recently, a new representation based on symbolic dynamics [14] and a new representation based on digital signal method [15] are also illustrated.

In this contribution, we introduce a novel nongraphical and nonalignment approach for DNA sequence comparison. We use DNA sequence directly by considering the frequencies of dinucleotide. We represent each DNA sequence by a dinucleotide frequency matrix or by a dinucleotide frequency vector, based on which two distance measurements are defined, respectively. Then comparisons between DNA sequences could be carried out by calculating the distances between these mathematical descriptors. The most important feature of this method is that the mathematical descriptors not only take into consideration the frequencies of adjacent pairs but also of nonadjacent pairs. In this way, information contained in the relative spacing of nucleotides is preserved. The method is very simple and fast, and does not require sequence alignment or sequence graphical representation which would cause complex calculations. It can be used to analyze both short and long DNA sequences. As an application, this method is tested on the exon-1 coding sequences of -globin for 11 species and the results are consistent with what have been reported previously [5, 9, 12, 14, 15], which prove the utility of this new method.

#### 2. Dinucleotide Frequency Matrix and Dinucleotide Frequency Vector

Typically, DNA sequence data is represented as a string of letters A, C, G, and T, which signify the four nucleotides: adenine, cytosine, guanine, and thymine, respectively. There are 16 possible dinucleotides, that is, Ω = { AT, AA, AC, AG, TT, TA, TC, TG, GT, GA, GC, GG, CT, CA, CC, CG}. In the following, we always use to represent dinucleotides, and note that dinucleotide is distinguished from.

Let be a sequence of length and denote the number of occurrences of adjacent in by . Clearly, if is a sequence of length, then . The occurrence frequency for is defined as We get one 16-dimensional vector associated with sequence based on adjacent dinucleotides:

Notice that there would be a loss of information when one condenses sequence to a single 16-dimensional vector. A way to recover some of the lost information associated with a sequence to a single 16-vector is to introduce additional 16 vectors to store the frequency information of pairs when and are not adjacent but are separated at various distance. For example, if ATCGATC, the *adjacent* dinucleotides are AT, TC, CG, GA with occurrence frequency , , , and , respectively. The dinucleotides *at distance *2 (i.e., separated by one nucleotide) in are AC, TG, CA, GT, AC with occurrence frequency , , , and , respectively. These two 16-dimensional vectors will contain additional information beyond that found in the initial dinucleotide vector.

Generally, let be a sequence of length. Denote as the number of occurrence of in when and are separated by nucleotides. Clearly, . Define as the occurrence frequency. For each given integer, we could get one 16-dimensional vector associated with sequence :

The distance between and could be 1, 2 or even larger integers. When we scan sequence to count the occurrence of dinucleotides at distance, the nucleotides of from position 1 to are counted as “”, while the nucleotides of from position to are counted as “”. When , there is an overlapping interval between the two intervals and , which means the nucleotides in the overlapping interval will counted as both and ; but if , the two intervals and will disjoint, and the information of these nucleotides in the interval will be lost. So in the following, to avoid loss of information, must not be larger than , that is, . Furthermore, to make the information in more accurate, we hope that the overlapping interval will be large enough. Based on this intuition, we would prefer to these such that , which guarantees that more than half of the nucleotides in sequence will be counted as both and . So is restricted to for each DNA sequence with length.

Let be a DNA sequence of length, for a given , the *dinucleotide frequency matrix* associated with is defined as
where is the 16-dimensional occurrence frequency vector when and are separated by nucleotides. The size of matrix is .

We also present another mathematical descriptor associated with named *dinucleotide frequency vector* which is defined as
then is a row vector.

#### 3. Two Distance Measurements Based on Dinucleotide Frequency

From Section 2, we get correspondences between one DNA sequence and the dinucleotide frequency matrix and the dinucleotide frequency vector . Note that the sizes of and all depend on. To make the comparisons for a set of DNA sequences meaningful, we should use an identical for all these DNA sequences. Denote the set of DNA sequences by, by the discussion in Section 2, we define the identical as where is the length of . The choice of will guarantee that either the frequency matrix or the frequency vector will involve enough accurate information, and the dinucleotide frequency matrices and dinucleotide frequency vectors associated with sequences in all have the same size. DNA sequences comparisons could be completed by studying their corresponding matrices and vectors. In the following we will introduce two different distance measurements based on dinucleotide frequencies matrix and dinucleotide frequency vector, respectively.

##### 3.1. City Block Distance for Dinucleotide Frequency Matrix

Given two DNA sequences and , then we get the dinucleotide frequency matrix and as in Section 2, comparison between and becomes comparison between and . Using this, we define the city block distance between and as

##### 3.2. Cosine Distance for Dinucleotide Frequency Vector

We also obtain a mapping from a DNA sequence to a vector in the -dimensional linear space. Comparison between DNA sequences also could become comparison between these -dimensional vectors. This is based on the assumption that two DNA sequences are similar if the corresponding -dimensional vectors in the -dimensional space have similar directions. Given two DNA sequences and , the dinucleotide frequency vectors are and , we define the cosine distance between and as where is the cosine value of the included angle between vectors and .

#### 4. Applications and Experimental Results

##### 4.1. Experimental Results

A comparison between a pair of DNA sequences to judge their similarity or dissimilarity could be carried out by calculating the distance or . The smaller is the distance, the much similar are the two DNA sequences (The code is available on request).

To test the utility of above method, we make a comparison for the coding regions of exon-1 of -globin gene for 11 different species, which were also studied by Randić et al. in [12]. Table 1 presents their accession numbers in NCBI database, while Table 2 lists these 11 coding sequences concretely.

At first, we present the similarity/dissimilarity matrix based on distance measurement , see Table 3. When we examine this table, we notice that smallest entries are always associated with the pairs (human, chimpanzee) with , (human, gorilla) with , and (gorilla, chimpanzee) with . That means the more similar species pairs are human-gorilla, human-chimpanzee, and gorilla-chimpanzee. We also observe that the largest entry is associated with gallus and lemur and the larger entries appear in the rows belonging to gallus and opossum, which is consistent with the facts that gallus is the only nonmammalian species among these 11 species and opossum is the most remote species from the remaining mammals. These observed facts are consistent with the results reported in previous studies [5, 9, 12] determined by matrix invariants techniques, and also consistent with the reported results from nongraphical means [14, 15]. More interesting, in Table 3, the distance between goat and bovine is , which is actually the smallest entry in Table 3. That implies goat and bovine are regarded to be much similar to each other by our method, which is consistent with their biology taxonomy that bovine and goat are both even-toed ungulates and belong to the family of “Bovidae”.

Table 4 presents the similarity/dissimilarity matrix based on the distance measurement . The smallest entries are also associated with the pairs (human, chimpanzee) with , (human, gorilla), with , and (gorilla, chimpanzee), and with . We find that the largest entry ( ) is associated with (gallus, lemur), and the rows corresponding to gallus and opossum have larger entries, which is also consistent with the facts that gallus is the only nonmammalian species among these 11 species and opossum is the most remote species from the remaining mammals. The observed facts in Table 4 are consistent with the previously reported results in [5, 9, 12, 14, 15] as well. And the distance between goat and bovine ( ) is also much smaller as we expect.

We can see that there is an overall qualitative agreement between Tables 3 and 4. To see it visually, we denote the degree of dissimilarity/similarity of the pair human-gorilla as 1 in each table, then the results of the examination of the degree of dissimilarity/similarity between human and other several species under the two distance measurements are shown in Figure 1. We can see that the curvilinear trend of these two curves are almost the same, which demonstrates the overall agreement among dissimilarity/similarities obtained by these two distance methods.

##### 4.2. Discussion

For the above exon-1 coding data of 11 species, is chosen to be 21 followed by (7). A 336-dimensional vector is used to characterize each DNA sequence under the second distance measure. To confirm the efficacy of the vectors constructed in this high-dimensional data representation, we perform principal component analysis (PCA) on these 336 parameters. Figure 2(a) shows the projection of the 11 vectors on a 2D property space composed of the top two principal components PC1, PC2. We can see that in the 2D space, gallus (labeled by “”) and opossum (labeled by “”) are furthest from the other 9 species, and human, chimpanzee, and gorilla are very close to each other. These result are consistent with what we have got from Table 4. Note that these top two principal components contribute 48% (see Figure 2(b)) to the total information. Some information is lost when we do the projection, for example, bovine seems much closer to rabbit than goat in the 2D projection, but we know this is not true in Table 4 when all 336 parameters are considered. However, this rough approximation confirms that our mathematical descriptor characterizes DNA sequence structure effectively.

#### 5. Conclusion

In this paper, we have presented a new method based on dinucleotide frequencies for DNA sequence comparison. The dinucleotide frequency matrix and dinucleotide frequency vector are used to mathematically characterize a DNA sequence. The most important feature of this method is that the mathematical descriptors not only involve the frequencies of adjacent pairs but also nonadjacent pairs (i.e., when and are separated by various number of nucleotides), such that a lot of important information is avoided to lose. This new method does not require sequence alignment or sequence graphical representation, which avoids the complex calculation found in either sequence alignment or sequence graphical representation. The method is very simple and fast, and it can be used to analyze both short and long DNA sequences with high efficiencies.

#### Acknowledgments

This work is supported partly by Shandong Province Natural Science Foundation of China with no. ZR2010AQ018 and no. ZR2011FQ010 and partly by Independent Innovation Foundation of Shandong University with no. 2010ZRJQ005. This project also has been partially supported by a WV EPSCoR Grant and an NSA Grant H98230-12-1-0233.

#### References

- E. Hamori and J. Ruskin, “H curves, a novel method of representation of nucleotide series especially suited for long DNA sequences,”
*Journal of Biological Chemistry*, vol. 258, no. 2, pp. 1318–1327, 1983. View at Google Scholar · View at Scopus - A. Nandy, “A new graphical representation and analysis of DNA sequence structure: I. Methodology and Application to Globin Genes,”
*Current Science*, vol. 66, pp. 309–314, 1994. View at Google Scholar - M. Randić, M. Vračko, A. Nandy, and S. C. Basak, “On 3-D graphical representation of DNA primary sequences and their numerical characterization,”
*Journal of Chemical Information and Computer Sciences*, vol. 40, no. 5, pp. 1235–1244, 2000. View at Google Scholar · View at Scopus - Y. Zhang, B. Liao, and K. Ding, “On 2D graphical representation of DNA sequence of nondegeneracy,”
*Chemical Physics Letters*, vol. 411, no. 1-3, pp. 28–32, 2005. View at Publisher · View at Google Scholar - M. Randić, M. Vračko, N. Lerš, and D. Plavšić, “Analysis of similarity/dissimilarity of DNA sequences based on novel 2-D graphical representation,”
*Chemical Physics Letters*, vol. 371, no. 1-2, pp. 202–207, 2003. View at Publisher · View at Google Scholar - B. Liao and T. M. Wang, “3-D graphical representation of DNA sequences and their numerical characterization,”
*Journal of Molecular Structure (THEOCHEM)*, vol. 681, no. 1–3, pp. 209–212, 2004. View at Publisher · View at Google Scholar · View at Scopus - Y. Zhang, B. Liao, and K. Ding, “On 3DD-curves of DNA sequences,”
*Molecular Simulation*, vol. 32, no. 1, pp. 29–34, 2006. View at Publisher · View at Google Scholar · View at Scopus - R. Chi and K. Ding, “Novel 4D numerical representation of DNA sequences,”
*Chemical Physics Letters*, vol. 407, no. 1–3, pp. 63–67, 2005. View at Publisher · View at Google Scholar · View at Scopus - B. Liao, R. Li, W. Zhu, and X. Xiang, “On the similarity of DNA primary sequences based on 5-D representation,”
*Journal of Mathematical Chemistry*, vol. 42, no. 1, pp. 47–57, 2007. View at Publisher · View at Google Scholar · View at Scopus - B. Liao and T. M. Wang, “Analysis of similarity/dissimilarity of DNA sequences based on nonoverlapping triplets of nucleotide bases,”
*Journal of Chemical Information and Computer Sciences*, vol. 44, no. 5, pp. 1666–1670, 2004. View at Publisher · View at Google Scholar · View at Scopus - M. Randić, “Condensed representation of DNA primary sequences,”
*Journal of Chemical Information and Computer Sciences*, vol. 40, no. 1, pp. 50–56, 2000. View at Google Scholar · View at Scopus - M. Randić, X. Guo, and S. C. Basak, “On the characterization of DNA primary sequences by triplet of nucleic acid bases,”
*Journal of Chemical Information and Computer Sciences*, vol. 41, no. 3, pp. 619–626, 2001. View at Publisher · View at Google Scholar · View at Scopus - Y. Zhang, “A simple method to construct the similarity matrices of DNA sequences,”
*Match*, vol. 60, no. 2, pp. 313–324, 2008. View at Google Scholar · View at Scopus - S. Wang, F. Tian, W. Feng, and X. Liu, “Applications of representation method for DNA sequences based on symbolic dynamics,”
*Journal of Molecular Structure: THEOCHEM*, vol. 909, no. 1–3, pp. 33–42, 2009. View at Publisher · View at Google Scholar · View at Scopus - Z. H. Qi and X. Q. Qi, “Numerical characterization of DNA sequences based on digital signal method,”
*Computers in Biology and Medicine*, vol. 39, no. 4, pp. 388–391, 2009. View at Publisher · View at Google Scholar · View at Scopus