Abstract

Sequence comparison is a primary technique for the analysis of DNA sequences. In order to make quantitative comparisons, one devises mathematical descriptors that capture the essence of the base composition and distribution of the sequence. Alignment methods and graphical techniques (where each sequence is represented by a curve in high-dimension Euclidean space) have been used popularly for a long time. In this contribution we will introduce a new nongraphical and nonalignment approach based on the frequencies of the dinucleotide 𝑋𝑌 in DNA sequences. The most important feature of this method is that it not only identifies adjacent 𝑋𝑌 pairs but also nonadjacent 𝑋𝑌 ones where 𝑋 and 𝑌 are separated by some number of nucleotides. This methodology preserves information in DNA sequence that is ignored by other methods. We test our method on the coding regions of exon-1 of 𝛽–globin for 11 species, and the utility of this new method is demonstrated.

1. Introduction

The number of identifiable DNA sequences responsible for various physiological structures is rapidly increasing as more and more collected DNA sequences are added to scientific databases. It is, however, difficult to obtain information directly from sequences since the sheer volume of data is computational demanding. It is one of the challenges for biologists to analyze mathematically the large volume of genomic DNA sequence data. Many schemes have been proposed to numerically characterize DNA sequences.

Sequence alignment has been used as a very powerful tool for comparison of two closely related genomes at the base-by-base nucleotide sequence level. This method relies heavily on the orderings of nucleotides appearing in the sequence. With the divergence of species over time, though, genomic rearrangements and in particular genetic shuffling make sequence alignment unreliable or impossible.

Graphical techniques are another powerful tool for the analysis and visualization of DNA sequences. Using graphical approaches can provide intuitive pictures or useful insights that assist the analysis of complicated relations between DNA sequences. This methodology starts with a graphical representation of DNA sequence which could be based on 2D, 3D, 4D, 5D, and 6D spaces and represents DNA as matrices by associating with the selected geometrical objects, then vectors composed of the invariants of matrices will be used to compare DNA sequences, see [110]. Such schemes have an advantage in that they offer an instant, though, visual and qualitative summary of the lengthy DNA sequences. This approach also involves many unresolved questions. For example, how does one obtain suitable matrices to characterize DNA sequences and how are invariants selected suitable for sequence comparisons? In many cases, the calculation of the matrices or the invariants will become more and more difficult with the length of the sequence. There are also approaches which could arrive a mathematical representation of DNA sequences by nongraphical ways, see [1113]. And more recently, a new representation based on symbolic dynamics [14] and a new representation based on digital signal method [15] are also illustrated.

In this contribution, we introduce a novel nongraphical and nonalignment approach for DNA sequence comparison. We use DNA sequence directly by considering the frequencies of dinucleotide. We represent each DNA sequence by a dinucleotide frequency matrix or by a dinucleotide frequency vector, based on which two distance measurements are defined, respectively. Then comparisons between DNA sequences could be carried out by calculating the distances between these mathematical descriptors. The most important feature of this method is that the mathematical descriptors not only take into consideration the frequencies of adjacent 𝑋𝑌 pairs but also of nonadjacent 𝑋𝑌 pairs. In this way, information contained in the relative spacing of nucleotides is preserved. The method is very simple and fast, and does not require sequence alignment or sequence graphical representation which would cause complex calculations. It can be used to analyze both short and long DNA sequences. As an application, this method is tested on the exon-1 coding sequences of 𝛽-globin for 11 species and the results are consistent with what have been reported previously [5, 9, 12, 14, 15], which prove the utility of this new method.

2. Dinucleotide Frequency Matrix and Dinucleotide Frequency Vector

Typically, DNA sequence data is represented as a string of letters A, C, G, and T, which signify the four nucleotides: adenine, cytosine, guanine, and thymine, respectively. There are 16 possible dinucleotides, that is, Ω = { AT, AA, AC, AG, TT, TA, TC, TG, GT, GA, GC, GG, CT, CA, CC, CG}. In the following, we always use 𝑋𝑌 to represent dinucleotides, and note that dinucleotide 𝑋𝑌 is distinguished from.

Let 𝑠 be a sequence of length 𝑛 and denote the number of occurrences of adjacent 𝑋𝑌 in 𝑠 by 𝑌(1). Clearly, if 𝑠 is a sequence of length, then 𝑋𝑌Ω𝑋𝑌(1)=𝑛1. The occurrence frequency for 𝑋𝑌 is defined as𝑓(1)𝑋𝑌=𝑋𝑌(1)(𝑛1).(1) We get one 16-dimensional vector 𝑓(1) associated with sequence 𝑠 based on adjacent dinucleotides:𝑓(1)=𝑓(1)AT,𝑓(1)AA,𝑓(1)AC,,𝑓(1)CT,𝑓(1)CA,𝑓(1)CC,𝑓(1)CG.(2)

Notice that there would be a loss of information when one condenses sequence 𝑠 to a single 16-dimensional vector. A way to recover some of the lost information associated with a sequence 𝑠 to a single 16-vector is to introduce additional 16 vectors to store the frequency information of pairs 𝑋𝑌 when 𝑋 and 𝑌 are not adjacent but are separated at various distance. For example, if 𝑠=ATCGATC, the adjacent dinucleotides are AT, TC, CG, GA with occurrence frequency 2/6, 2/6, 1/6, and 1/6, respectively. The dinucleotides at distance 2 (i.e., separated by one nucleotide) in 𝑠 are AC, TG, CA, GT, AC with occurrence frequency 2/5, 1/5, 1/5, and 1/5, respectively. These two 16-dimensional vectors will contain additional information beyond that found in the initial dinucleotide vector.

Generally, let 𝑠 be a sequence of length. Denote 𝑋𝑌(𝑑) as the number of occurrence of 𝑋𝑌 in 𝑠 when 𝑋 and 𝑌 are separated by 𝑑1 nucleotides. Clearly, 𝑋𝑌Ω𝑋𝑌(𝑑)=𝑛𝑑. Define𝑓(𝑑)𝑋𝑌=𝑋𝑌(𝑑)(𝑛𝑑),(3) as the occurrence frequency. For each given integer, we could get one 16-dimensional vector 𝑓(𝑑) associated with sequence 𝑠:𝑓(𝑑)=𝑓(𝑑)AT,𝑓(𝑑)AA,𝑓(𝑑)AC,,𝑓(𝑑)CT,𝑓(𝑑)CA,𝑓(𝑑)CC,𝑓(𝑑)CG.(4)

The distance 𝑑 between 𝑋 and 𝑌 could be 1, 2 or even larger integers. When we scan sequence 𝑠 to count the occurrence of dinucleotides 𝑋𝑌 at distance, the nucleotides of 𝑠 from position 1 to (𝑛𝑑) are counted as “𝑋”, while the nucleotides of 𝑠 from position (𝑑+1) to 𝑛 are counted as “𝑌”. When 𝑑(𝑛1)/2, there is an overlapping interval [𝑑+1,𝑛𝑑] between the two intervals [1,𝑛𝑑] and [𝑑+1,𝑛], which means the nucleotides in the overlapping interval will counted as both 𝑋 and 𝑌; but if 𝑑>(𝑛1)/2, the two intervals [1,𝑛𝑑] and [𝑑+1,𝑛] will disjoint, and the information of these nucleotides in the interval [𝑛𝑑+1,𝑑] will be lost. So in the following, to avoid loss of information, 𝑑 must not be larger than (𝑛1)/2, that is, 𝑑(𝑛1)/2. Furthermore, to make the information in 𝑓(𝑑) more accurate, we hope that the overlapping interval [𝑑+1,𝑛𝑑] will be large enough. Based on this intuition, we would prefer to these 𝑑 such that (𝑛2𝑑)/𝑛50%, which guarantees that more than half of the nucleotides in sequence 𝑠 will be counted as both 𝑋 and 𝑌. So 𝑑 is restricted to 𝑑𝑛/4 for each DNA sequence 𝑠 with length.

Let 𝑠 be a DNA sequence of length, for a given 𝑑𝑛/4, the dinucleotide frequency matrix associated with 𝑠 is defined as𝑓𝐹(𝑠)=(1)𝑓(2)𝑓(3)𝑓(𝑑),(5) where 𝑓(𝑖) is the 16-dimensional occurrence frequency vector when 𝑋 and 𝑌 are separated by (𝑖1) nucleotides. The size of matrix 𝐹(𝑠) is 𝑑×16.

We also present another mathematical descriptor associated with 𝑠 named dinucleotide frequency vector which is defined as𝑓𝐹(𝑠)=(1),𝑓(2),𝑓(3)𝑓,,(𝑑),(6) then 𝐹(𝑠) is a 1×16𝑑 row vector.

3. Two Distance Measurements Based on Dinucleotide Frequency

From Section 2, we get correspondences between one DNA sequence 𝑠 and the dinucleotide frequency matrix 𝐹(𝑠) and the dinucleotide frequency vector 𝐹(𝑠). Note that the sizes of 𝐹(𝑠) and 𝐹(𝑠) all depend on. To make the comparisons for a set of DNA sequences meaningful, we should use an identical 𝑑 for all these DNA sequences. Denote the set of DNA sequences by, by the discussion in Section 2, we define the identical 𝑑0 as𝑑0=min𝑠𝑆(|𝑠|)4,(7) where |𝑠| is the length of 𝑠. The choice of 𝑑0 will guarantee that either the frequency matrix or the frequency vector will involve enough accurate information, and the dinucleotide frequency matrices and dinucleotide frequency vectors associated with sequences in 𝑆 all have the same size. DNA sequences comparisons could be completed by studying their corresponding matrices and vectors. In the following we will introduce two different distance measurements based on dinucleotide frequencies matrix and dinucleotide frequency vector, respectively.

3.1. City Block Distance for Dinucleotide Frequency Matrix

Given two DNA sequences 𝑠 and , then we get the dinucleotide frequency matrix 𝐹(𝑠) and 𝐹() as in Section 2, comparison between 𝑠 and becomes comparison between 𝐹(𝑠) and 𝐹(). Using this, we define the city block distance 𝑑1(𝑠,) between 𝑠 and as𝑑1(𝑠,)=1𝑖𝑑0,1𝑗16||𝐹𝑖𝑗(𝑠)𝐹𝑖𝑗(||).(8)

3.2. Cosine Distance for Dinucleotide Frequency Vector

We also obtain a mapping from a DNA sequence 𝑠 to a vector 𝐹(𝑠) in the 16𝑑0-dimensional linear space. Comparison between DNA sequences also could become comparison between these 16𝑑0-dimensional vectors. This is based on the assumption that two DNA sequences are similar if the corresponding 16𝑑0-dimensional vectors in the 16𝑑0-dimensional space have similar directions. Given two DNA sequences 𝑠 and , the dinucleotide frequency vectors are 𝐹(𝑠) and 𝐹(), we define the cosine distance 𝑑2(𝑠,) between 𝑠 and as𝑑2(𝑠,)=1cos𝐹(𝑠),𝐹(),(9) where cos(𝐹(𝑠),𝐹()) is the cosine value of the included angle between vectors 𝐹(𝑠) and 𝐹().

4. Applications and Experimental Results

4.1. Experimental Results

A comparison between a pair of DNA sequences to judge their similarity or dissimilarity could be carried out by calculating the distance 𝑑1(𝑠,) or 𝑑2(𝑠,). The smaller is the distance, the much similar are the two DNA sequences (The code is available on request).

To test the utility of above method, we make a comparison for the coding regions of exon-1 of 𝛽-globin gene for 11 different species, which were also studied by Randić et al. in [12]. Table 1 presents their accession numbers in NCBI database, while Table 2 lists these 11 coding sequences concretely.

At first, we present the similarity/dissimilarity matrix based on distance measurement 𝑑1, see Table 3. When we examine this table, we notice that smallest entries are always associated with the pairs (human, chimpanzee) with 𝑑1=2.5567, (human, gorilla) with 𝑑1=2.4026, and (gorilla, chimpanzee) with 𝑑1=2.7338. That means the more similar species pairs are human-gorilla, human-chimpanzee, and gorilla-chimpanzee. We also observe that the largest entry 𝑑1=9.0347 is associated with gallus and lemur and the larger entries appear in the rows belonging to gallus and opossum, which is consistent with the facts that gallus is the only nonmammalian species among these 11 species and opossum is the most remote species from the remaining mammals. These observed facts are consistent with the results reported in previous studies [5, 9, 12] determined by matrix invariants techniques, and also consistent with the reported results from nongraphical means [14, 15]. More interesting, in Table 3, the distance between goat and bovine is 𝑑1=2.3438, which is actually the smallest entry in Table 3. That implies goat and bovine are regarded to be much similar to each other by our method, which is consistent with their biology taxonomy that bovine and goat are both even-toed ungulates and belong to the family of “Bovidae”.

Table 4 presents the similarity/dissimilarity matrix based on the distance measurement 𝑑2. The smallest entries are also associated with the pairs (human, chimpanzee) with 𝑑2=0.0087, (human, gorilla), with 𝑑2=0.0074, and (gorilla, chimpanzee), and with 𝑑2=0.0112. We find that the largest entry (𝑑2=0.1139 ) is associated with (gallus, lemur), and the rows corresponding to gallus and opossum have larger entries, which is also consistent with the facts that gallus is the only nonmammalian species among these 11 species and opossum is the most remote species from the remaining mammals. The observed facts in Table 4 are consistent with the previously reported results in [5, 9, 12, 14, 15] as well. And the distance between goat and bovine (𝑑2=0.0109 ) is also much smaller as we expect.

We can see that there is an overall qualitative agreement between Tables 3 and 4. To see it visually, we denote the degree of dissimilarity/similarity of the pair human-gorilla as 1 in each table, then the results of the examination of the degree of dissimilarity/similarity between human and other several species under the two distance measurements are shown in Figure 1. We can see that the curvilinear trend of these two curves are almost the same, which demonstrates the overall agreement among dissimilarity/similarities obtained by these two distance methods.

4.2. Discussion

For the above exon-1 coding data of 11 species, 𝑑0 is chosen to be 21 followed by (7). A 336-dimensional vector is used to characterize each DNA sequence under the second distance measure. To confirm the efficacy of the vectors constructed in this high-dimensional data representation, we perform principal component analysis (PCA) on these 336 parameters. Figure 2(a) shows the projection of the 11 vectors on a 2D property space composed of the top two principal components PC1, PC2. We can see that in the 2D space, gallus (labeled by “”) and opossum (labeled by “”) are furthest from the other 9 species, and human, chimpanzee, and gorilla are very close to each other. These result are consistent with what we have got from Table 4. Note that these top two principal components contribute 48% (see Figure 2(b)) to the total information. Some information is lost when we do the projection, for example, bovine seems much closer to rabbit than goat in the 2D projection, but we know this is not true in Table 4 when all 336 parameters are considered. However, this rough approximation confirms that our mathematical descriptor characterizes DNA sequence structure effectively.

5. Conclusion

In this paper, we have presented a new method based on dinucleotide frequencies for DNA sequence comparison. The dinucleotide frequency matrix and dinucleotide frequency vector are used to mathematically characterize a DNA sequence. The most important feature of this method is that the mathematical descriptors not only involve the frequencies of adjacent 𝑋𝑌 pairs but also nonadjacent 𝑋𝑌 pairs (i.e., when 𝑋 and 𝑌 are separated by various number of nucleotides), such that a lot of important information is avoided to lose. This new method does not require sequence alignment or sequence graphical representation, which avoids the complex calculation found in either sequence alignment or sequence graphical representation. The method is very simple and fast, and it can be used to analyze both short and long DNA sequences with high efficiencies.

Acknowledgments

This work is supported partly by Shandong Province Natural Science Foundation of China with no. ZR2010AQ018 and no. ZR2011FQ010 and partly by Independent Innovation Foundation of Shandong University with no. 2010ZRJQ005. This project also has been partially supported by a WV EPSCoR Grant and an NSA Grant H98230-12-1-0233.