Computational and Mathematical Methods in Medicine

Volume 2016, Article ID 3262813, 12 pages

http://dx.doi.org/10.1155/2016/3262813

## Circular Helix-Like Curve: An Effective Tool of Biological Sequence Analysis and Comparison

College of Science, Yanshan University, Qinhuangdao 066004, China

Received 18 February 2016; Accepted 19 April 2016

Academic Editor: Nadia A. Chuzhanova

Copyright © 2016 Yushuang Li and Wenli Xiao. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

This paper constructed a novel injection from a DNA sequence to a 3D graph, named circular helix-like curve (CHC). The presented graphical representation is available for visualizing characterizations of a single DNA sequence and identifying similarities and differences among several DNAs. A 12-dimensional vector extracted from CHC, as a numerical characterization of CHC, was applied to analyze phylogenetic relationships of 11 species, 74 ribosomal RNAs, 48 Hepatitis E viruses, and 18 eutherian mammals, respectively. Successful experiments illustrated that CHC is an effective tool of biological sequence analysis and comparison.

#### 1. Introduction

Complex biological sequence analysis and comparison have been playing important roles in molecular studies. It is significant to find an effective tool for seeking a better understanding of the ever-increasing biological sequences. Graphical representation is one of such tools, which assists researchers in studying genomes in a perceivable form.

The original contributions in DNA graphical representations were compact H-curves created over 30 years ago by Hamori and cooperators [1–4]. The shape of the H-curve was a path in 3D space by mapping four nucleotides (adenine A, cytosine C, guanine G, and thymine T) to four unit vectors by four directions (NW, NE, SE, and SW). The basic rule for constructing H-curve was to move one unit in the corresponding direction and one for each unit in the -direction. H-curve not only characterized complex genetic messages but also embodied important parameters concerning the distribution of nucleotides. Hamori’s achievements encouraged many researchers to study graphical bioinformatics. Here we focus on 3D graphical representations. Among all the representations following Hamori’s idea, the most worthy to mention is that in the year 2000, Randić and cooperators [5] constructed a new model of DNA sequences which was also based on a path in 3D space. The difference from H-curve is that four vectors corresponding to four bases are located along tetrahedral directions. Moreover, Randić et al. described a particular scheme transforming the spatial model into a numerical matrix representation. This is an important breakthrough that led to the expansion of graphical technique from a visual discipline to a qualitative discipline. One of successful models is the Z-curve, created by Zhang et al. [6]. The construction of Z-curve combined with three classifications of the DNA bases, purines/pyrimidines (A, G)/(C, T), amino/keto groups (A, C)/(G, T), and strong/weak hydrogen bonds (A, T)/(G, C), and assigned A(), T(), C(), and G(), respectively. Z-curve is famous for its extensive applications in comparative genomics, gene prediction, computation of G + C content with a windowless technique, prediction of replication origins, and terminations of bacterial and archaeal genome. But to our regret, there are crosses and overlaps of the spatial curve in the representation in [5], and the Z-curve might cause a loop if the frequencies of the four bases present in the sequence are the same as pointed out by Tang et al. [7]. To overcome the degeneration appearing in the above representations, other various improvements or transformations were created [8–14]. Recently, Pesek and Zerovnik [15] presented a modified Hamori’s curve by using analogous embedding into the strong product of graphs, ( is a 4-order complete graph and is an -order path), with weighted edges. Xie and Mo [16] also considered three classifications of the DNA bases, assigned three types of vectors to the four bases, respectively, and derived three 3D graphical representations.

The above models were all based on individual nucleotides such that it was easy to inspect compositions and distributions of four bases directly, but difficult to dinucleotides or trinucleotides in DNA sequences. Some researchers solved this problem by assigning different vectors to each dinucleotide or to each trinucleotide in 3D space. For example, Qi and Fan [17] in the year 2007 assigned 16 vectors to 16 dinucleotides and then defined a map from a DNA sequence to a characteristic plot set, while the corresponding curves extended along axes. Subsequently, based on similar research object Qi et al. [18] presented another 3D graphical representation. Two papers were highly dissimilar in the following aspects: the methods and contents of research, the map used to construct graphical representation, the graphical curve, and numerical invariants characterizing DNA sequences. Other 3D models [19, 20] based on dinucleotides have been also proposed. Yu et al. [21] in the year 2009 presented a novel 3D graphical representation based on trinucleotides, TN-curve, which is the first model that can display the information of trinucleotides within 3D space. Recently, Jafarzadeh and Iranmanesh [22] proposed a 3D model, C-curve, also based on trinucleotides.

All works mentioned above almost involved sequence comparison. The most popular tools for comparing sequences are alignment methods including the alignment-based and the alignment-free. In general, most alignment-free methods take less computational time than alignment-based ones. Moreover, they are more sensitive against short or partial sequences [23] and more efficient in comparing gene regulatory regions [24]. In this paper, we introduce a new 3D graphical representation of DNA sequences, namely, circular helix-like curve (CHC), which is highly different from techniques referred to above. It is composed of four characteristic curves (CHC-A, CHC-C, CHC-G, and CHC-T) which just correspond to four bases (A, C, G, and T) in DNA. The novel injection from a DNA sequence to a point set in 3D space ensures CHC without loss of information. A 12-dimensional vector extracted from CHC, as its numerical characterization, provides effective conditions for alignment-free sequence comparison.

The paper is organized as follows: in Section 2, we describe the construction of the CHC, its several properties, and its numerical characterization; in Section 3, we exhibit applications of CHC by analyzing phylogenetic relationships of 11 species, 74 ribosomal RNAs, 48 Hepatitis E viruses, and 18 eutherian mammals, respectively. Finally, a conclusion ends the paper.

#### 2. Circular Helix-Like Curve

##### 2.1. Construction of Circular Helix-Like Curve

Given a DNA sequence with length , define the map as follows: for , if is even, then If is odd, then

The function maps each nucleotide in the sequence to one point in 3D space. Let . Similarly define , , and . Connect the adjacent points in by lines and then obtain a circular helix-like curve in 3D space representing the trail of base A in the sequence, namely, circular helix-like curve-A (CHC-A) for convenience. In the same way, we can obtain CHC-C, the symmetric curve about plane of a circular helix-like curve; CHC-G, the symmetric curve about plane of a circular helix-like curve; CHC-T, the symmetric curve about -axis of a circular helix-like curve. Clearly, projective points on plane of points in four curves are all assigned over the circumference of a unit circle. Figure 1 shows the circular helix-like curve (CHC) of the first exon of beta-globin gene of Gallus: ATGGTGCACTGGACTGCTGAGGAGAAGCAGCTCATCACCGGCCTCTGGGGCAAGGTCAAT-GTGGCCGAATGTGGGGCCGAAGCCCTGGCCAG.