Circular Helix-Like Curve: An Effective Tool of Biological Sequence Analysis and Comparison
This paper constructed a novel injection from a DNA sequence to a 3D graph, named circular helix-like curve (CHC). The presented graphical representation is available for visualizing characterizations of a single DNA sequence and identifying similarities and differences among several DNAs. A 12-dimensional vector extracted from CHC, as a numerical characterization of CHC, was applied to analyze phylogenetic relationships of 11 species, 74 ribosomal RNAs, 48 Hepatitis E viruses, and 18 eutherian mammals, respectively. Successful experiments illustrated that CHC is an effective tool of biological sequence analysis and comparison.
Complex biological sequence analysis and comparison have been playing important roles in molecular studies. It is significant to find an effective tool for seeking a better understanding of the ever-increasing biological sequences. Graphical representation is one of such tools, which assists researchers in studying genomes in a perceivable form.
The original contributions in DNA graphical representations were compact H-curves created over 30 years ago by Hamori and cooperators [1–4]. The shape of the H-curve was a path in 3D space by mapping four nucleotides (adenine A, cytosine C, guanine G, and thymine T) to four unit vectors by four directions (NW, NE, SE, and SW). The basic rule for constructing H-curve was to move one unit in the corresponding direction and one for each unit in the -direction. H-curve not only characterized complex genetic messages but also embodied important parameters concerning the distribution of nucleotides. Hamori’s achievements encouraged many researchers to study graphical bioinformatics. Here we focus on 3D graphical representations. Among all the representations following Hamori’s idea, the most worthy to mention is that in the year 2000, Randić and cooperators  constructed a new model of DNA sequences which was also based on a path in 3D space. The difference from H-curve is that four vectors corresponding to four bases are located along tetrahedral directions. Moreover, Randić et al. described a particular scheme transforming the spatial model into a numerical matrix representation. This is an important breakthrough that led to the expansion of graphical technique from a visual discipline to a qualitative discipline. One of successful models is the Z-curve, created by Zhang et al. . The construction of Z-curve combined with three classifications of the DNA bases, purines/pyrimidines (A, G)/(C, T), amino/keto groups (A, C)/(G, T), and strong/weak hydrogen bonds (A, T)/(G, C), and assigned A(), T(), C(), and G(), respectively. Z-curve is famous for its extensive applications in comparative genomics, gene prediction, computation of G + C content with a windowless technique, prediction of replication origins, and terminations of bacterial and archaeal genome. But to our regret, there are crosses and overlaps of the spatial curve in the representation in , and the Z-curve might cause a loop if the frequencies of the four bases present in the sequence are the same as pointed out by Tang et al. . To overcome the degeneration appearing in the above representations, other various improvements or transformations were created [8–14]. Recently, Pesek and Zerovnik  presented a modified Hamori’s curve by using analogous embedding into the strong product of graphs, ( is a 4-order complete graph and is an -order path), with weighted edges. Xie and Mo  also considered three classifications of the DNA bases, assigned three types of vectors to the four bases, respectively, and derived three 3D graphical representations.
The above models were all based on individual nucleotides such that it was easy to inspect compositions and distributions of four bases directly, but difficult to dinucleotides or trinucleotides in DNA sequences. Some researchers solved this problem by assigning different vectors to each dinucleotide or to each trinucleotide in 3D space. For example, Qi and Fan  in the year 2007 assigned 16 vectors to 16 dinucleotides and then defined a map from a DNA sequence to a characteristic plot set, while the corresponding curves extended along axes. Subsequently, based on similar research object Qi et al.  presented another 3D graphical representation. Two papers were highly dissimilar in the following aspects: the methods and contents of research, the map used to construct graphical representation, the graphical curve, and numerical invariants characterizing DNA sequences. Other 3D models [19, 20] based on dinucleotides have been also proposed. Yu et al.  in the year 2009 presented a novel 3D graphical representation based on trinucleotides, TN-curve, which is the first model that can display the information of trinucleotides within 3D space. Recently, Jafarzadeh and Iranmanesh  proposed a 3D model, C-curve, also based on trinucleotides.
All works mentioned above almost involved sequence comparison. The most popular tools for comparing sequences are alignment methods including the alignment-based and the alignment-free. In general, most alignment-free methods take less computational time than alignment-based ones. Moreover, they are more sensitive against short or partial sequences  and more efficient in comparing gene regulatory regions . In this paper, we introduce a new 3D graphical representation of DNA sequences, namely, circular helix-like curve (CHC), which is highly different from techniques referred to above. It is composed of four characteristic curves (CHC-A, CHC-C, CHC-G, and CHC-T) which just correspond to four bases (A, C, G, and T) in DNA. The novel injection from a DNA sequence to a point set in 3D space ensures CHC without loss of information. A 12-dimensional vector extracted from CHC, as its numerical characterization, provides effective conditions for alignment-free sequence comparison.
The paper is organized as follows: in Section 2, we describe the construction of the CHC, its several properties, and its numerical characterization; in Section 3, we exhibit applications of CHC by analyzing phylogenetic relationships of 11 species, 74 ribosomal RNAs, 48 Hepatitis E viruses, and 18 eutherian mammals, respectively. Finally, a conclusion ends the paper.
2. Circular Helix-Like Curve
2.1. Construction of Circular Helix-Like Curve
Given a DNA sequence with length , define the map as follows: for , if is even, then If is odd, then
The function maps each nucleotide in the sequence to one point in 3D space. Let . Similarly define , , and . Connect the adjacent points in by lines and then obtain a circular helix-like curve in 3D space representing the trail of base A in the sequence, namely, circular helix-like curve-A (CHC-A) for convenience. In the same way, we can obtain CHC-C, the symmetric curve about plane of a circular helix-like curve; CHC-G, the symmetric curve about plane of a circular helix-like curve; CHC-T, the symmetric curve about -axis of a circular helix-like curve. Clearly, projective points on plane of points in four curves are all assigned over the circumference of a unit circle. Figure 1 shows the circular helix-like curve (CHC) of the first exon of beta-globin gene of Gallus: ATGGTGCACTGGACTGCTGAGGAGAAGCAGCTCATCACCGGCCTCTGGGGCAAGGTCAAT-GTGGCCGAATGTGGGGCCGAAGCCCTGGCCAG.
2.2. Properties of CHC
Property 1. The map defined in Section 2.1 is an injection; thus no information of DNA sequence is lost.
It is sufficient to prove that if . In fact, it is to prove that for all , when is even, , and when is odd, That is, if is even and if is odd. It is clear because is odd when is even, and is also odd when is odd. Complete the proof.
Property 2. CHC could reflect base composition and distribution of a DNA sequence.
On one hand, the base composition is easily determined by the point density of the corresponding CHC. Take Figure 1 as an example, CHC-G has high point density which implies that the first exon of beta-globin gene of Gallus has high G-content. Oppositely, from CHC-T one could derive that Gallus has low T-content. Also from CHC-A and CHC-C it is clear that Gallus has similar contents of bases A and C. On the other hand, the base distribution could be identified by the arrangement of points on their CHC, respectively. As is shown in Figure 1, we are able to find special regions of curves, such as the thickset regions and the sparse regions. Also, one could easily catch sight of spacing distances of each kind of base.
Property 3. CHC is an effective tool of identifying dissimilarities (similarities) among equal length sequences.
Take the first exons of -globin genes of Gallus and Duck as instance (their lengths are both 92).Gallus:ATGGTGCACTGGACTGCTGAGGAGAAGCAGCTCATCACCGGCCTCTGGGGCAAGGTC-AATGTGGCCGAATGTGGGGCCGAAGCCCTGGCCAG.Duck:ATGGTGCACTGGACAGCCGAGGAGAAGCAGCTCATCACCGGCCTCTGGGGCAAGGTC-AATGTGGCCGACTGTGGAGCTGAGGCCCTGGCCAG.
Two sequences only have six mismatches in the sequence level (shown in red), and both have similar composition and distribution of nucleic bases as Figure 2 goes. Especially for CG-comparison (see Figure 2(b)), their respective CHCs are nearly coincident. Certainly, their differentiations from Figure 2(a) could not be concealed, because CHC-Ts and CHC-As both have obvious deviations.
Discussion. Property 3 shows that CHC is convenient to visually compare sequences with equal length; but for unequal length sequences, there may appear some puzzles. For example, the lengths of the first exons of globin genes of Human and Gorilla are 92 and 93, respectively, and their corresponding bases are completely matched from 1 to 92 except the last base in gorilla, but their CHCs deviate from each other (see Figures 3(a) and 3(b)) due to different lengths. This phenomenon is unacceptable. How to solve this trouble? One could change the function defined in Section 2.1 slightly for two sequences to avoid the influence of different lengths. Without loss of generality, suppose sequence Seq1 has bases, sequence Seq2 has bases, and (i)If and are both even, that is, is even, assign in (1) to for Seq1 and to for Seq2, respectively; thus are both odd.(ii)If and are both odd, that is, is even, assign in (2) to for Seq1 and to for Seq2, respectively; thus are both odd.(iii)If is even and is odd, that is, is odd, assign in (1) to for Seq1 and assign in (2) to for Seq2, respectively; thus are both odd.(iv)If is odd and is even, that is, is odd, assign in (2) to for Seq1 and assign in (1) to for Seq2, respectively; thus are both odd.
It is not difficult to conclude that the above four modifications still keep Property 1. Execute measure (iii) to compare the first exons of globin genes of Human and Gorilla; that is, assign in (1) to 92 + 1 + 2 for Human and assign in (2) to 93 + 2 for Gorilla. Optimal CHC comparison of Human and Gorilla appears (see Figures 4(a) and 4(b)). Their corresponding bases are matched very well from 1 to 92, and the last base G in Gorilla is quite striking.
Above programs solve the CHC comparison of DNA sequences from front to back. In fact, we are also able to dispose the CHC comparison of DNA sequences from back to front by taking the similar method. Besides making lengths of compared sequences “equal” by changing the assignment of in (1) or in (2), one needs to adjust “the start location” of comparison by changing the assignment of in (1) or (2) for the shorter sequence, such that two compared sequences have the same “end locations.” For example,:ATTTGGCACCTAAAACGTCGTATATAAAGGGGTCTCA.:GGCACCTAAAACGTCGTATATAAAGGGGTCTCA.The lengths of and are 37 and 33, respectively. just matches the fragment of from position 5 to position 37. Modify the function as (3). Figures 5(a) and 5(b) show the CHC comparison of two sequences:
2.3. Numerical Characterization of CHC
As we have seen, CHC appears pleasing to the eyes about identifying single DNA sequence and comparing DNA sequences. The more important is the numerical characterization derived from CHC, a 12-dimensional vector. It not only captures the essence of the base composition and distribution in DNA sequence, but also allows one to estimate similarity or dissimilarity between different DNAs quantitatively. Given a sequence with length , letwhere is the coordinate of the th base A in the sequence and others are the same. Define the 12-dimensional vector as a numerical characterization of CHC. Note that because
3. Applications of CHC
In this section we use numerical characterization of CHC to compare and analyze complete coding sequences of β-globin genes of 11 species, 74 sequences from 16S ribosomal RNA, 48 Hepatitis E viruses, and whole mitochondrial genomes of 18 eutherian mammals. The average lengths of sequences from four experiments are 444, 1471, 7214, and 16572, respectively (see Tables , , , and in Supplementary Materials available online at http://dx.doi.org/10.1155/2016/3262813). Here we choose Euclidean distance as the measure tool. The basis of sequence comparison is that the smaller the Euclidean distance of two numerical characterizations is, the more similar the two corresponding sequences are. We first calculate the similarity/dissimilarity matrix of sequences by computing their Euclidean distances and then utilize the similarity/dissimilarity matrix to construct phylogenetic tree by Unweighted Pair Group Method with Arithmetic Mean (UPGMA) method in the Molecular Evolutionary Genetics Analysis (MEGA) software package. Comparisons with existing results confirm that the presented method is an effective classification tool of DNA sequences.
3.1. Similarity/Dissimilarity Analysis of the Complete Coding Sequences of β-Globin Genes of 11 Species
Table 1 exhibits the similarity/dissimilarity matrix of the complete coding sequences of β-globin genes of 11 species (see Table in Supplementary Materials) based on Euclidean distance. Conclusions are in agreement with known facts of evolution. The most similar species pairs are Gorilla-Chimpanzee; species pairs Human-Chimpanzee, Human-Gorilla, Goat-Bovine, and Rat-Mouse are closely related to each other, while Opossum and Gallus tend to be significantly different from others. Our phylogenetic tree (see Figure 6) is in good consistency with the common accepted structure except Lemur which is primitive quadrumana but it is not in one branch together with (Human, Gorilla, and Chimpanzee). This phenomenon may be possible, because one gene may have begun to differentiate before the variation of its corresponding species happens, and thus, the differentiation time of gene may be earlier than that of the species.
To check the validity of the presented technique, we compared results in [20, 25–27] with ours (they all applied the same test data). Since different methods generate different magnitudes of the indexes, all indexes normalized to Human-Gallus number individually. Table 2 shows the comparisons of similarity/dissimilarity indexes for 11 species, and Figure 7 is the line chart of Table 2. Obviously [20, 25] and ours get the right trend in the rough, but [26, 27] both show some unreasonable or contrary results. For example, the number of Human-Lemur in  is 1.1974 and Human-Opossum is 1.2931, which are both bigger than 1, the number of Human-Gallus. There are two digits in , Human-Mouse 0.1025 and Human-Goat 0.0434, both less than Human-Chimpanzee 0.1204.
3.2. Similarity/Dissimilarity Analysis of 74 Sequences from 16S Ribosomal RNA
16S ribosomal RNA is a DNA sequence corresponding to encoding rRNA in bacteria and has high conservation and specificity. In this subsection we analyze 74 sequences from 16S ribosomal RNA. The data set consists of 10 Buchnera aphidicola, 9 Coxiella burnetii, 9 Fibrobacter succinogenes, 9 Klebsiella oxytoca, 8 Azoarcus tolulyticus, 7 Borrelia burgdorferi, 7 Helicobacter sp., 5 Aggregatibacter actinomycetemcomitans, 5 Alloprevotella tannerae, and 5 Clostridium scindens. Detail information is described in Table in Supplementary Materials. Utilizing similarity/dissimilarity matrix of 74 sequences (see Table in Supplementary Materials) we construct the phylogenetic tree (see Figure 8) which is consistent with the result in . Ten genotypes just correspond to ten branches of the tree as anticipation.
3.3. Similarity/Dissimilarity Analysis of 48 Hepatitis E Viruses
Hepatitis E viruses (HEV) are nonenveloped, positive-sense, and single-stranded RNA viruses and belong to Herpesvirus genus . Hepatitis E is considered as a public health problem and caused much concern. Until now several classifications of HEV have been proposed; the most accepted one is the classification of four major genotypes [28–32]. Genotypes I–IV are represented by the Burmese isolates, the Mexican isolate, the US isolates, and the new Chinese isolates, respectively. Here we construct the phylogenetic tree (see Figure 9) of 48 Hepatitis E viruses (see Table in Supplementary Materials) based on the similarity/dissimilarity matrix (see Table in Supplementary Materials), which is basically in agreement with the results presented in [28–32]. 48 HEVs are divided into four genotypes distinctly: 16 HEVs are included in genotype I, 17 HEVs in genotype III, and 14 HEVs in genotype IV; M1 is only contained in genotype II and far away from genotype I which is consistent with the structure in . Moreover, some divergences in subtype classification with the result  keep high consistency with the result : T1, which is of subtype IVc in , is more close to subtype IVa in  and ours. Also, subtype IIIc is more close to subtype IIIa.
3.4. Similarity/Dissimilarity Analysis of Whole Mitochondrial Genomes of 18 Eutherian Mammals
We choose a complete DNA sequence of 18 eutherian mammals as a long sequence set, which had been studied in [28, 32–34]; the longest and the shortest lengths of sequences are 17019 and 16295, respectively. Table in Supplementary Materials gives the detailed information. 18 eutherian mammals could be divided into two classes: placental mammals and nonplacental mammal. Placental mammals could also be divided into three groups: Primates, Ferungulates, and Rodents. We construct the phylogenetic tree (see Figure 10) of 18 eutherian mammals based on the similarity/dissimilarity matrix (see Table in Supplementary Materials). In our phylogenetic tree, Platypus, the only nonplacental mammal, is significantly different from others and in the outside of the tree. Rodents first cluster with Ferungulates, and then they cluster with Primates; that is, our result supports the topology of Primates, Rodents, Ferungulates, which is consistent with the structures in [28, 33, 34] and is slightly different from the result in .
Generally speaking, CHC of DNA sequence depends on the map order of four bases on the graph. By changing the map order we will obtain different graphical representations for the same DNA sequence. Even so, applying each graphical representation to compare DNA sequences, we will draw the same analysis conclusion.
Proposition 1. Both geometric center vectors and Euclidean distances ensure together that similarities between DNA sequences are independent of the map order of four bases.
Proof. Take two DNA sequences and as an example. Without loss of generality, suppose their lengths are and , respectively, and both even (other cases are similar). Their corresponding geometric center vectors are and , respectively. Then the Euclidean distance between and isFrom (1), Then Here , Note that is only determined by the sets and (the sets of positions of base A, resp., in sequences and ) and regardless of the map order of base A. This observation is also valid for other base pairs. In conclusion, the Euclidean distance between and is invariable no matter what map order of four bases.
CHC, based on a novel one-to-one mapping from nucleic bases in a DNA sequence to the points in 3D space, characterizes graphically a DNA sequence and reflects base composition and distribution of the sequence. As a consequence, DNA comparison with identical or different lengths intuitively transforms into CHC comparison whether in the normal order or in the reversed order.
The 12-dimensional vector extracted from CHC, as a numerical characterization of CHC, captures the essence of the base composition and distribution in DNA sequence, avoids the trouble of different sequence lengths, and allows quantitative estimates of the degree of similarity or dissimilarity among different DNAs. Reasonable phylogenetic analyses of four experiments illustrate that CHC technique is an effective tool for investigating biological structure and inferring evolutionary relationship. We expect that the presented method could help us explore more information hidden in the biological sequences.
The authors declare that they have no competing interests.
This work is supported in part by the National Natural Science Foundation of China (no. 11201409), Natural Science Foundation of Hebei Province (no. A2013203009), and Young Talents Plan of Higher School in Hebei Province (no. BJ2014060).
The Supplementary Materials include seven tables in total: Table 1, Table 2, Table 3 and Table 4 exhibit the detail information of complete coding sequences of β-globin genes of 11 species, 74 sequences of 16s ribosomal RNA, 48 sequences of Hepatitis E viruses and whole mitochondrial genomes of 18 species, respectively. Table 5 and Table 6 show the data of the similarity/dissimilarity matrices of 74 sequences from 16s ribosomal RNA and 48 sequences of Hepatitis E viruses. Table 7 show the similarity/dissimilarity matrix of whole mitochondrial genomes of 18 species.
E. Hamori and J. Ruskin, “H curves, a novel method of representation of nucleotide series especially suited for long DNA sequences,” Journal of Biological Chemistry, vol. 258, no. 2, pp. 1318–1327, 1983.View at: Google Scholar
E. Hamori, “Graphic representation of long DNA sequences by the method of H curves—current results and future aspects,” BioTechniques, vol. 7, no. 7, pp. 710–720, 1989.View at: Google Scholar
J. F. Yu, J. H. Wang, and X. Sun, “Analysis of similarities/dissimilarities of DNA sequences based on a novel graphical presentation,” MATCH Communications in Mathematical and in Computer Chemistry, vol. 63, pp. 493–512, 2010.View at: Google Scholar