Abstract

This paper deals with the complex unit roots representation of archea DNA sequences and the analysis of symmetries in the wavelet coefficients of the digitalized sequence. It is shown that even for extremophile archaea, the distribution of nucleotides has to fulfill some (mathematical) constraints in such a way that the wavelet coefficients are symmetrically distributed, with respect to the nucleotides distribution.

1. Introduction

In some recent papers the existence of symmetries in nucleotide distribution has been studied for several living organisms [16] including mammals, fungi [14], and viruses [5, 6]. Thus showing that any (investigated) DNA sequence, when converted into a digital sequence, features some fractal shape of its DNA walk and an apparently random-like distribution. However, when the short wavelet transform maps the digital sequence into the space of wavelet coefficients, and these coefficients are clustered then they are located along some symmetrical shapes.

One of the main tasks of this paper is to show that although the distribution of nucleotide, in any DNA sequence, can be considered as randomly given, when we compare a random sequence (and the corresponding random walk) with a DNA sequence (and walk) it can be seen that there exists some distinctions. So that the nucleotides distribution seems to side with a random distribution with some constraints. These constraints (rules) are singled out in the following, by showing the existence of hidden geometry which underlies the structure of a DNA sequence.

In other words, nucleotides are distributed along any DNA sequence at first apparently randomly but at second analysis according to some (statistical) mathematical constraints which does not allow a given nucleotide to be arbitrarily followed by any other remaining nucleotides.

It is interesting to notice that even in the primitives organisms which billions of years ago have been colonizing the earth under extreme conditions of life, their DNA has to fulfill the same constraints of the more evolved DNAs.

In order to achieve this goal some fundamental steps have to be taken into consideration and discussed.(1)Since DNA is a sequence of symbols, a map of these symbols into numbers has to be defined. In the following we will consider the complex unit roots map, which has the advantage of being unitary and distributed along the unit circle.(2)The indicator matrix is defined on the the indicator map. This matrix is important in order to draw the dot plot of the DNA sequence and from this plot we can see that apparently nucleotides seem to be randomly distributed. However, we will show by wavelet analysis that they look randomly distributed, while they are not.(3)The Ulam spiral adapted to DNA sequences is defined in order to single out some geometrical patterns.(4)Random walks on DNA, or short DNA walks, show that the random walks look like fractals.(5)The analysis of clusters of wavelet coefficients show that DNA walks have to fulfill some geometrical constraints.

In all DNA sequences, analyzed so far, for different kinds of living organisms, this geometrical symmetry [16] has been detected. In the following this analysis is extended also to archaea, since they might be considered at the early stage of life and their DNA is compared with more evolved microorganisms as bacteria.

It will be shown that, inspite of the many similarities with random sequences, only the wavelet analysis makes it possible to single out some distinctions. In particular, the wavelet coefficients of all (analyzed) organisms tend to fulfill a minimum principle for the energy of the signal. Also the archaea which often live in extreme environments have to fulfill the same geometrical rule of any other living organism.

The analysis of DNA by wavelets [79], as seen in [812], helps to single out local behavior and singularities [7, 13] or to express the scale invariance of coefficients [14]. Also multifractal nature of the time series [1517] can be easily detected by wavelet analysis.

Some previous paper have studied various sequences of DNA such as leukemia tet variants, influenza viruses such as the A (H1N1) variant, mammalian, and a fungus (see [13, 14]) provided by the National Center for Biotechnology Information [1821]. In all these papers it was observed that DNA has to fulfill not only some chemical steady state given by the chemical ligands but also some symmetrical distribution of nucleotide along the sequence. In other words, base pairs have to be placed exactly in some positions.

According to previous results, it will be shown that as any other living organisms also these elementary organisms have DNA walks with fractal shape and wavelet coefficients bounded on a short-range wavelet transform. In other words, also anaerobic organism which should be understood as the most elementary at the first step of life have the same symmetries on wavelet coefficients as for more evolved organism, so that life has to fulfill some constrained distribution of nucleotides in order to give rise to some organism even at the most elementary step.

In particular, in Section 2, some remarks about the analysed data are given. Section 3 deals with some elementary plots which can easily visualize the distribution of nucleotides. The Ulam spiral plot is also proposed for the first time and it is observed a different distribution of weak/strong hydrogen bonds. Section 4 provides some definitions about parameters of complexity. We will notice that all these parameters give rise to the same classification of organism. Section 4 proposes a complex numerical representation of DNA chains and random walks, while in final Section 6 the short wavelet trasform is given in order to single out some symmetries at the lower order of transform.

2. Materials and Methods

In the following we will take into consideration some genome, complete sequences of DNA, concerning the following archaea:h1:Aeropyrum pernix K1, complete genome. DNA, circular, 1669696 bp, [1821], accession BA000002.3. Lineage: Archaea; Crenarchaeota; Thermoprotei; Desulfurococcales; Desulfurococcaceae; Aeropyrum; Aeropyrum pernix; Aeropyrum pernix K1. This organism, which was the first strictly aerobic hyperthermophilic archaeon sequenced, was isolated from sulfuric gases in Kodakara-Jima Island, Japan in 1993.h2:Acidianus hospitalis W1, complete genome. DNA, circular, 2137654 bp, [1821], accession CP002535. Lineage: Archaea; Crenarchaeota; Thermoprotei; Sulfolobales; Sulfolobaceae; Acidianus; Acidianus hospitalis; Acidianus hospitalis W1h3:Acidilobus saccharovorans 345-15. complete genome. DNA, circular, 2137654 bp, [1821], accession CP001742.1. Lineage: Archaea; Crenarchaeota; Thermoprotei; Acidilobales; Acidilobaceae; Acidilobus; Acidilobus saccharovorans; Acidilobus saccharovorans 345-15. Anaerobic bacteria found in hot springs.

to be compared with the following (aerobic/anaerobic) bacteria/fungi:b1:Mycoplasma putrefaciens KS1 chromosome, complete genome. DNA, circular, length 832603 bp, [1821], accession NC 015946,. Lineage: Bacteria; Tenericutes; Mollicutes; Mycoplasmatales; Mycoplasmataceae; Mycoplasma; Mycoplasma putrefaciens; Mycoplasma putrefaciens KS1.b2:Mortierella verticillata mitochondrion, complete genome. dsDNA, circular, length 58745 bp, [1821], accession NC 006838. Lineage: Eukaryota; Opisthokonta; Fungi; Fungi incertae sedis; Basal fungal lineages; Mucoromycotina; Mortierellales; Mortierellaceae; Mortierella; Mortierella verticillata.b3:Blattabacterium sp. (Periplaneta Americana) str. BPLAN, complete genome. DNA, circular, length 636994 nt, [1821], accession NC 013418. Lineage: Bacteria; Bacteroidetes/Chlorobi group; Bacteroidetes; Flavobacteria; Flavobacteriales; Blattabacteriaceae; Blattabacterium; Blattabacterium sp. (Periplaneta Americana); Blattabacterium sp. (Periplaneta Americana) str. BPLAN.

Moreover we will compare DNA sequences with artificial sequences of nucleotides randomly taken (see Section 4).

2.1. Archaea

Archaea are a group of elementary single-cell microorganisms, having no cell nucleus or any other membrane-bound organelles within their cells. They are similar to bacteria, since they have the same size and shape (apart few exceptions) and the generally similar cell structure. However, the evolutionary history of archaea and their biochemistry has significant differences with regard to other forms of life. Therefore they are considered as members of a phylogenetic group distinct from bacteria and eukaryota.

Archaea during their evolution have been spreading all over the Earth in almost all habitats [22, 23] existing in a broad range of habitats, being one of the major contribution () to earth’s biomass. The most peculiar feature of archaea is that they can live in some environments with extreme life conditions (thus being considered as extremophiles [22, 24]). Indeed, some archaea survive to high temperatures, over 100°C, while others can live in very cold habitats or highly saline, acidic, or alkaline water. Nevertheless some archaea are living in mild conditions.

It has been also recognized that the archaea may be the most ancient organisms on the Earth, so that archaea, and eukaryotes are probably diverged early from an ancestral colony of organisms.

We will see, in the following, that archaea DNA it looks very close to random sequences so that we can assume that the ancestral organism were evolving by random permutations from a primitive assembly of nucleotides. So that the evolution can be seen as a tendency to a steady state far from the randomness. Therefore, the bacteria’s DNA (and other eukaryotes’ [16]), as a result of the evolution, shows the existence of some hidden stability.

3. Correlation Plots

In this section we will consider some elementary plots from where it is possible to visualize autocorrelation, distribution law of nucleotides and to measure some fundamental parameters by using frequency count.

Let be the finite set (alphabet) of nucleotides (nucleic acids): adenine , cytosine , guanine , thymine , and any member of the alphabet. Nucleic acids are further grouped according to their ligand properties as(a)purine , pyrimidine ,(b)amino , keto ,(c)weak hydrogen bonds , strong hydrogen bond .

A DNA sequence is the finite symbolic sequence so that with being the nucleotide at the position .

In general we can define an -length alphabet as follows: let the -length DNA word be defined by the -combination of the 4 nucleotides (1). For each fixed length there are words, however not all of them can be considered, from biological point of view, as independent instances (see, e.g., Table 1), for this we define the -length alphabet as the set of -length independent words: with cardinality of the set and For instance with , the alphabet is , with the alphabet is given by the 20 amino acids each amino acid being represented by a 3-length word of Table 1.

Let be an -length ordered sequence of nucleotides and the chosen alphabet, a DNA sequence of words is the finite symbolic sequence so that with being the word at the position .

3.1. Indicator Matrix

The 2D indicator function, based on the 1D definition given in [25], is the map such that with and, where for short, we have assumed According to (12), the indicator of an -length sequence can be easily represented by the sparse symmetric matrix of binary values which results from the indicator matrix (see also [35]) being, explicitly

This squared matrix can be plotted in 2 dimensions by putting a black dot where and white spot when (Figure 1) thus giving rise to the two-dimensional dot plot, which is a special case of the recurrence plot [26].

A simple generalization of this matrix can be considered for the alphabets , as follows. By choosing the 3 alphabet of amino acids, the 2D indicator function is the map such that with

According to (12), the indicator, on the 3-alphabet of amino acids of an -length sequence can be easily represented by the sparse symmetric matrix of binary values : being, explicitly With the graphical representation of this matrix we can also show the correlation of amino acids.

3.2. Test Sequences

In the following, in order to single out the main features of biological sequences, we will compare the DNA sequence with some test sequences.(1)Pseudorandom -length sequence of nucleotides is the sequence where is a symbol randomly chosen in the alphabet , like for example, (): (2)Pseudoperiodic -sequence of nucleotides with period is the direct sum of a given -length pseudorandom sequence, such that and , for example, When we have a pseudorandom sequence.

If we plot the indicator matrix of some bacteria and compare it with a pseudorandom and periodic sequence, we can see that (Figure 1)(1)the main diagonal is a symmetry axis for the plot;(2)there are some motifs which are repeated at different scales like in a fractal;(3)periodicity is detected by parallel lines to the main diagonal (Figure 1(a2));(4)empty spaces are more distributed than filled spaces, in the sense that the matrix is a sparse matrix (having more 0’s than 1’s);(5)it seems that there are some square-like islands where black spots are more concentrated; these islands show the persistence of a nucleotide (Figures 1(a2) and 1(b1));(6)the dot plot of archaea is very similar to the dot plot of a random sequence (Figures 1(a1) and 1(h3)).

It can be noticed that DNA sequences of a living organism resemble (Figure 1) random sequences, with some short range influence, built on the same alphabet. This has been taken as an axiom of nucleotides distribution, so that DNA sequences are often considered as Markov chain [27]. However, there are some hidden rules in combining the nucleotides and these rules lead, during the evolution, to a steady distribution. In fact, the more primitive the sequence is, the more randomly distributed the nucleotides are. It seems that as a consequence of the evolution, nucleotides move from a disordered aggregation toward a more organized structure, shown by the growing islands in the dot plot. The biological evolution is such that the challenge for the self-organization might follow from random permutations of a primitive disordered sequence so that the organization, that is, the complexity, is only the result of many arbitrary permutations of randomness. During the challenge for complexity, DNA sequence becomes “less random” and it loses some kind of energy.

From the graphical representation of the indicator matrix for bacteria and amino acids we can see a more sparse matrix, but with some typical plots (Figure 2).

3.3. Spiral Plot

In this section we consider a 2D distribution of nucleotides, following the idea given by Ulam for the distribution of primes, along an Ulam-like spiral [28]. In order to find some patterns in their distribution, nucleotides are arranged along a rectangular spiral. This is equivalent to mapping the 1D sequence of integers into a 2D sequence as follows:

For instance the sequence distributed along the spiral looks like Figure 3.

For each nucleotide we can draw a spiral containing the distribution of only one acid nucleic. To each organism there correspond four plots, for , respectively.

Let us first note that on a random sequence (Figure 4) the four distribution are equivalent.

By comparing the spirals of bacteria, random and archaea (Figures 4, 5, 6, 7, 8, 9, 10) we can see that there is a different distribution of each nucleotide. However the more evolved organism tends to have a higher percentage of weak hydrogen bonds (Figures 5, 6 and 7), so that we can assume the following.

Conjecture 1. During the evolution, the distribution of nucleotides changes in a such way that strong hydrogen bonds tend to become weak.

It should be noticed that along these spirals, there is a one-to-one map between and the points of the spiral (with integer coordinates) in so that This bijective map can be considered also between and the complex space so that each natural number corresponds to a complex number (with integer coefficients)

Since these spirals seem to fill in a finite region of the plane we can evaluate the complexity of each curve by typical fractal measures.

4. Parameters of Complexity

In this section we define some parameters, based on frequency distribution, which can measure the complexity of a DNA by computing the complexity of its representation in the complex plane (for a more detailed analysis see [29] and references therein).

Let be an -length-ordered sequence of nucleotides, and be the probability to find the nucleotide at the position . According to (12) we define as the number of nucleotides in the -length segment of , so that The corresponding frequencies are so that

We can assume that for large sequences

4.1. Randomness

Since for a random sequence the frequencies of nucleotides coincide for large , we can define as randomness index the following: with being the variance, so that for random sequence and for a nonrandom sequence. Over the first 10000 nucleotides we have the randomness value of Table 2.

However, if we compute the randomness index over the frequencies of amino acids in the alphabet then we can observe a different distribution of values. Over the first 30000 nucleotides corresponding to 10000 amino acids, we have the randomness value of Table 3.

So that we can comment that the arising complexity of the words and alphabets shows a different randomness in each alphabet.

4.2. Complexity

As a simple measure of complexity [3032], for an -length sequence, the following has been proposed [33]:

In Table 4 the complexity of the first 100-length segment of the DNA sequences is computed. It is interesting to notice the more similarities between the archaea Acidilobus with the pseudorandom sequence than with the pseudoperiodic. Nucleotide distribution in primitive biosequences is more likely random than pseudodeterministic. Moreover, the evolution reduces the complexity of the sequence.

4.3. Fractal Dimension

The fractal dimension is computed on the dot plot, by the box counting algorithm [34, 35], as the average of the number of 1’s in the randomly taken minors of the indicator matrix or equivalently the number of black dots in the randomly taken squares over the dot plot

The explicit computation enables us to compare the fractal dimension on the first 100-length segments of DNA chains, with an approximation up to (see Table 5).

If we compare the fractal dimensions of the bacteria with pseudorandom and pseudoperiodic we can see that the fractal dimension of nucleotide distribution ranges, for all variants, in the interval . As expected, the more “random” sequences have higher fractal dimension.

4.4. Entropy

Another fundamental parameter, related to the information content of a sequence which measures the heterogeneity of data, is the information entropy (or Shannon entropy) [3642]. Based on the axiom that less information implies a larger uncertainty and vice versa that more information leads us to a more deterministic model, the entropy concept has been recently offering some interesting interpretations about uncertainty in DNA. In fact, DNA as any other signal has been considered as a sequence of symbols carrying chemical-functional information.

The normalized Shannon entropy [39, 40, 42] is defined, over the alphabet , as where should be computed for large sequences. According to (32), (34), we will approximate its value with

However, the entropy is a parameter very similar to the complexity. In fact, it can be easily seen that (for the proof see [29]) the entropy and the measure of complexity differ for a factor. There follows that the entropy does not give any new information comparing with the previous parameters. As expected also the table of entropies classifies bacteria and archaea in the same way (Table 6).

5. Complex Root Representation of DNA Words

The complex (digital) representation of a DNA sequence of words is the map of the symbolic sequence of words into a set of complex numbers and it is defined as such that for each it is .

The complex root representation of the sequence is the sequence of complex numbers defined as with being the imaginary unit. There follows that, independently on the alphabet, it is being all complex roots, of the unit, located on the unit circle of the complex plane .

For instance, with , the cardinality of the alphabet is and

Analogously, with it is and the 20 complex roots of unit so that explicitly is Therefore the complex representation of a DNA sequence is a sequence of complex numbers with given by (42).

An -length pseudorandom (white noise) complex sequence belonging to the unit circle can be defined directly by using some random exponents with being random values in the set .

5.1. Random Walks

Random walk on the complex sequence is defined as the series which is the cumulative sum When with and we will properly call these walks as DNA walk. When the are randomly generated we will call them random walks.

By remembering the definition of frequencies, DNA walk is the complex value signal with where the coefficients given by (12) fulfill the condition (31).

If we compare the DNA walks (Figure 11) some primitive archaea such as h3 are very similar to a random walk (Figure 13). In particular archaea seem to grow less than other bacteria (with the exception of b2).

It is interesting also to notice that the random walks on amino acids (Figure 12) show that more evolved organisms have some “periodic” behavior, while the absolute value of walks on archaea is growing fast.

6. Wavelet Analysis

Wavelet analysis is a powerful method extensively applied to the analysis of biological signals [12, 4345] aiming to single out the most significant parameters of complexity and heterogeneity in a time series and, in particular, in a DNA sequence. This method is based on the analysis of wavelet coefficients which are obtained by the wavelet transform.

We will consider in the following the Haar wavelet basis (see, e.g., [3, 4, 29]) made by scaling functions: and the Haar wavelets:

The discrete Haar wavelet transform is the matrix which maps the vector into the vector of wavelet coefficients :

The matrix can be easily computed by some recursive product [3, 4, 13, 29, 46] so that with , we have [3, 4, 29]

From (55) with , by explicit computation, we have and [13, 14]

Thus the first wavelet coefficient represents the average value of the sequence and the other coefficients the finite differences. The wavelet coefficients ’s, also called details coefficients, are strictly connected with the first-order properties of the discrete time series.

In the following we will consider the short wavelet transform which consists in the subdivision of the DNA sequence into 4-length segments and apply the wavelet transform to each segment. As a result, from the -length complex vector , which is subdivided into segments, the 4-parameter short Haar wavelet transform gives the cluster of points in the 8-dimensional space , that is,

This algorithm enables us to construct clusters of wavelet coefficients and to study the correlation between the real and imaginary coefficients of the DNA representation and DNA walk. It has been observed [3, 4, 29] that some symmetry arises from the plots of wavelet coefficients of DNA walks.

6.1. Cluster Analysis of the Wavelet Coefficients of the Complex DNA Representation

Let us first compute the clusters of wavelet coefficients for the random sequence (48). As can be seen the wavelet coefficients both for the sequence and for its series range in some discrete set of values (see Figure 13).

The cluster algorithm applied to the complex representation sequence shows that the values of the wavelet coefficients belong to some discrete finite sets (Figure 14).

It should be noticed that this symmetry on detail coefficients is lost for wavelet transform on longer segments (Figures 15, 16 and 17).

There follows that DNA sequences have to be considered as Markov chain with short range dependence; in other words any acid nucleic is attached to the chain on the base of a correlation of the previous acid nucleic. In other words, if we look for a dependence rule on the DNA nucleotides this dependence might be summarized by a function as

7. Conclusions

In this paper archaea DNAs have been studied by focussing on the main parameters for complexity. It has been shown that more or less the main indices for complexity and heterogeneity, such as entropy, fractal dimension, and complexity do not differ too much when we have to classify the complexity of the sequence. However, some DNA sequences look more close to random sequences than others, thus suggesting that the evolution involves a process of complexity reduction: the more evolved a sequence is, the more far from a random distribution it is. In any case seems to be apparently impossible to distinguish between a random sequence and a DNA chain. By using the short wavelet transform instead we have shown that on short range (4-nucleotides) a DNA sequence shows some symmetries that slowly disappear by increasing the length of the analysed segment. Moreover, more evolved organisms have a more symmetrical distribution of wavelet coefficients.