Abstract

This paper studies the information content of the chromosomes of twenty-three species. Several statistics considering different number of bases for alphabet character encoding are derived. Based on the resulting histograms, word delimiters and character relative frequencies are identified. The knowledge of this data allows moving along each chromosome while evaluating the flow of characters and words. The resulting flux of information is captured by means of Shannon entropy. The results are explored in the perspective of power law relationships allowing a quantitative evaluation of the DNA of the species.

1. Introduction

During the last years the genome sequencing project produced a large volume of data that is presently available for computational processing [114]. Researchers have been tackling the information content of the deoxyribonucleic acid (DNA), but interesting questions remain still open [1521].

This paper addresses the information flow along each DNA strand. For this purpose several statistics are developed, and the relative frequencies of distinct types of symbol associations are evaluated. The concepts of character, word, word delimiter, and phrase are defined, and the information content of each chromosome message is quantified. Power law (PL) relationships emerge in the information locus. PL distributions, often known as heavy tail distributions, Pareto laws, Zipf laws, or others, have been largely reported in the modeling of distinct real phenomena [2231]. It was recognized [11, 3234] that DNA has an information structure that reveals long range behavior, somehow in the line of thought of systems with dynamics described by the tools of Fractional Calculus (FC) [3537]. It is known the existence of a strong relationship between FC and PL; nevertheless, up to the present state of knowledge, no formal demonstration supported that observation based on empirical and experimental measurements. Therefore, it is not a surprise that both FC and PL descriptions emerge when analyzing DNA with distinct mathematical tools. In the present study PL descriptions are applied for condensing the charts characterizing the chromosomes of twenty-three species.

Having these ideas in mind, this paper is organized as follows. Section 2 presents the DNA sequence decoding concepts, the mathematical tools and formulates the algorithm that computes the information for each chromosome and species. Section 3 analyzes the DNA information dynamical content of 463 chromosomes corresponding to a set of twenty-three species. Finally, Section 4 outlines the main conclusions.

2. Preliminary Notes on the DNA Information

In the DNA double helix there are four distinct nitrogenous bases, namely, thymine, cytosine, adenine, and guanine, denoted by the symbols . Each type of base on one strand connects with only one type of base on the other strand, forming the base pairing and . Besides the four symbols , the available chromosome data includes a fifth symbol “” which is believed to have no practical meaning for the DNA decoding.

For processing the DNA information a possible technique is to convert the symbols into a numerical value. In previous papers was adopted the direct symbol translation , , , , , where . We can move along the DNA strip, one symbol (base) at a time. The resulting values form a “signal” where “” can be interpreted as a pseudotime. The signal can be treated by the Fourier transform , where represents the angular frequency.

Figure 1 shows one example with the amplitude of the Fourier transform for chromosome 1 of the human being. The frequency interval is adopted and a PL approximation is superimposed revealing a strong correlation.

This technique has, however, one drawback which is the initial assignment of numerical values to the DNA symbols. Therefore, it is important to design an alternative method of analysis avoiding that problem, but, on the other hand, capable of revealing fractional order phenomena. Bearing this strategy in mind, in this paper is adopted an approach based on the histograms of symbol alignment, information theory, and PL approximations.

This study focuses over twenty-three species yielding a space of 463 chromosomes. Therefore, denoting by the number of chromosomes of species , we consider the given by {Mosquito (Anopheles gambiae), Ag, 5}1, {Honeybee, (Apis mellifera), Am, 16}2, {Caenorhabditis briggsae, Cb, 6}3, {Caenorhabditis elegans, Ce, 6}4, {Chimpanzee, Ch, 25}5, {Dog, Dg, 39}6, {Drosophila simulans, Ds, 6}7, {Drosophila yakuba, Dy, 10}8, {Horse, Eq, 32}9, {Chicken, Ga, 31}10, {Human, Ho, 24}11, {Medaka, Me, 24}12, {Mouse, Mm, 21}13, {Opossum, Op, 9}14, {Orangutan, Or, 24}15, {Cow, Ox, 30}16, {Pig, Po, 19}17, {Rat, Rn, 21}18, {Yeast (Saccharomyces cerevisiae), Sc, 16}19, {Stickleback, St, 21}20, {Zebra Finch, Tg, 32}21, {Tetraodon, Tn, 21}22 and {Zebrafish, Zf, 25}23.

The DNA information decoding is addressed in this paper, and we start by defining the underlying concepts. The fundamental unit is the “symbol” that, in our case, consists in one of the four possibilities , while “” is simply disregarded. Each “character” is represented by an -tuple association of the 4 symbols, resulting in a total of possible symbols per character. For example, with we get a maximum of 42 characters represented by the 16 two-symbol sequences {TT, TC, TA, TG, CT, CC, CA, CG, AT, AC, AA, AG, GT, GC, GA, GG}. The sequences are obtained when moving sequentially along the DNA. The characters may have different significance and are divided into two classes, namely, characters with relevant information, to be denoted in the sequel as “word characters,” and delimiters denoted as “spaces.” Therefore, joining consecutive “word characters” yields a “word,” that ends in the presence of one or more consecutive “spaces” (i.e., multiple spaces are considered as a single space). When the complete association of consecutive words is fulfilled, we obtain a “message.”

Figure 2 depicts a simple example of a message with 21 symbols and 3 words. The message {ACTACGTTGGGTTCAGAAACC} is processed according to the proposed scheme for and considering the 2 sequences {TT, AA} as spaces, and the 14 sequences {TC, TA, TG, CT, CC, CA, CG, AT, AC, AG, GT, GC, GA, GG} as characters. Therefore, the resulting words are {AC TA CG}, {GG GT TC AG} and {CC}.

We verify that we may have words with different lengths and that it is considered as a single space any repetition of spaces. The message finishes when the end of the DNA strand is attained, and, therefore, it is not considered the case of multiple messages for each chromosome.

After defining the concepts for symbol, character (with the categories of word character and space), and message, we need to establish the numerical value to be adopted by and the method for measuring the information. In what concerns no a priori optimal value is considered. Therefore, in the experiments is analyzed the influence when going from up to , or, correspondingly, when going from 41 up to 412 symbols per character. This evaluation is performed for one chromosome. Based on this first assessment, given the huge computational load required by high values of , the set of twenty-three species, totalizing 463 chromosomes, is analyzed for . In what concerns the information measurement it is adopted the Shannon information [3849] where represents the quantity of information of event that has a probability . In this topic we can refer to [50] calculating also the Shannon information for short DNA words of differing lengths, where the authors find that genomes share universal statistical properties. It is also worth mentioning that other entropies, such as the Rényi, Tsallis, and Ubriaco definitions [51, 52] were tested. Nevertheless, experiments with these expressions and distinct numerical values of the parameters did not reveal any significant conceptual difference. Therefore, for simplicity in the sequel it is adopted merely the Shannon definition.

In our case, for a -tuple symbol encoding, the occurrence of the th character within the set has probability leading to information , and, therefore, the total information content of a word yields where represents the total number of word characters including the first space. In fact, it was numerically evaluated the effect of including, or not, the space information but, due to its low importance, the final effect is negligible. Therefore, it is considered the inclusion of one space as the information for delimiting the word, while further consecutive repetitions of spaces are disregarded.

The message information is the sum of all word information: where denotes the total number of words included in the message (i.e., the chromosome).

The information measurement requires the knowledge of . While we can expect an equilibrium of probabilities for , that may be not true for larger values of . Therefore, in the sequel it is adopted a numerical procedure that starts by reading the chromosome message based on the -tuple character setup leading to the construction of one histogram per chromosome. In the set of bins are chosen, by inspection, those that are more frequent (and have smaller information content) for the role of spaces. In a second phase, the relative frequencies, which are adopted as approximants to the probabilities, and the information values (2.1) and (2.2) are calculated numerically while traveling along the DNA strand.

This strategy does not consider some a priori optimal value of . Therefore, as mentioned previously, several distinct values of will be studied before establishing any conclusions.

3. Capturing the DNA Information

We start by considering Human chromosome 12 (Ho12) and . This chromosome is represented by a medium size file (130 Mbytes) and may be considered a good compromise between length and computational load.

Figure 3 depicts the histograms for where, for simplifying the visualization, the characters are ordered by decreasing magnitude of relative frequency. For the histograms construction two counting methods were envisaged: (i) counting with disjoint set of symbols and (ii) counting the sets while sliding one symbol at a time. At first sight it seems that (i) is the most straightforward, but if we consider that we do not have reliable information for starting and synchronizing the counting, then method (ii) is more robust and, therefore, is adopted in the sequel.

Figure 4 shows the word information dynamics when travelling along the Ho12 strand for . We observe the existence of quantum information levels that somehow vanish when increases. This is due to finite number of quantifying levels of information that occur before a space terminates a word. The number of quantum levels increases with while the length of each word increases. Besides this interesting effect, we also note a considerable randomness and a uniform behavior along all length of the strand.

The total chromosome information, the number of words , and the average word information versus are depicted in Figures 5(a) and 5(b). We verify a maximum of the total chromosome information for . For larger values of the information decreases slightly due to the effect of dropping out repeated consecutive spaces. Therefore, we can say that large values of seem to lead to a slightly better estimate of the total information content, while the cases of or lead to an inferior measurement process. We also observe that the number of words decreases with but its average information varies in the opposite way. Therefore, it is relevant to plot one variable against the other, with as parameter (Figure 5(c)). A PL trendline approximation demonstrates that the two quantities are inversely proportional. In fact, we get numerically with, . For the rest of the chromosomes it was observed a similar type of behavior, but with different numerical values for the parameters.

For other values of the resulting histograms reveal identical characteristics, namely, two characters with a very large relative frequency (depicted at the left part of the histograms of Figure 3). Furthermore, experiments with other chromosomes lead to similar results. The two characters are simply a succession of symbols or and the corresponding -tuples (i.e., and ) are adopted in the sequel as “spaces.”

Figure 6 shows the total information, that is, the information resulting from summing the information of all the chromosomes of each species versus the corresponding number of chromosomes, for character encoding with. We observe a weak correlation between both variables.

Figure 7 shows the length of each chromosome versus its information content , , estimated by the proposed method with . In this case we observe a strong correlation between both variables, meaning that the implementation of the DNA code has a large similarity between all species. In fact, we can calculate a PL trendline over the 463 chromosomes yielding the relationship .

Bearing these ideas in mind it was decided to explore the PL behavior, that is, the relation , , , of the average word information versus the number of words (with as parameter) per chromosome. The extensive evaluation of the 463 chromosomes for leads to the locus () of the PL trendline depicted in Figure 8. The point for chromosome DyYh is not included to allow a better visualization of the remaining set of points. Moreover, the individual chromosome labels are not included to make the plot more readable.

We verify that the map produces clear patterns, not only by grouping the chromosomes of each species but also by the relative positioning of the different species. Nevertheless, the large number of points complicates the visualization. Therefore, it was decided to represent each species by a single point having for coordinates the geometric and arithmetic averages of parameters and , respectively. Figure 9 depicts the resulting locus where is now easier to analyze the previously mentioned relations. The microchromosomes Ga32 and Tg16, which have a very small base pair counting, were not included in the calculations because they significantly disturb the results.

We verify the emergence of clusters that are in reasonable accordance with phylogenetics, going from the less “complex” species at left up to the most “complex” species at the right. The cluster of mammals is at the right and includes the subcluster of primates {Ho, Ch, Or}, with Ch closer to Hu than Or. In the rest of mammals it is interesting to see Po close to the primates and the position of the marsupial Op relatively distant from the placental mammals. In what concerns the rest of the points we notice Cb close to Ce and, in a middle position, the clusters of birds {Ga, Tg}, fishes {Tn, St, Me, Zf}, and insects {Dy, Ds, Am, Ag}.

In conclusion, the proposed information measure leads to an assertive and quantitative classification of chromosomes and species. Furthermore, it can be further explored for decoding in more detail other aspects of the DNA code in association with the FC tools.

4. Conclusions

Chromosomes have a code based on a four-symbol alphabet, and it can be analyzed with methods usually adopted in information processing. The information structure has resemblances to those occurring in systems characterized by fractional dynamics. Nevertheless, schemes based on assigning numerical values to the DNA symbols may deform the information, and alternative methods that avoid such problem need to be implemented. In this paper it was proposed a scheme based on the Shannon information theory. Bearing these ideas in mind, the chromosomes were processed in the perspective of a PL relationship between the average information and the total number of words, for distinct values of character encoding. For condensing the information an averaging of the PL parameters was also adopted. The resulting locus revealed the emergence of clearly interpretable patterns in accordance with current knowledge in phylogenetics. The proposed methodology opens new directions of research for DNA information processing and supports the recent discoveries that fractional phenomena are present in this biological structure.

Acknowledgments

The authors thank the following organizations for allowing access to genome data: Gambiae Mosquito (The International Anopheles Genome Project), Honeybee (The Baylor College of Medicine Human Genome Sequencing Center, http://www.hgsc.bcm.tmc.edu/projects/honeybee/), Briggsae nematode (Genome Sequencing Center at Washington University in St. Louis School of Medicine), Elegans nematode (Wormbase, http://www.wormbase.org/), Common Chimpanzee (Chimpanzee Genome Sequencing Consortium), Dog (http://www.broad.mit.edu/mammals/dog/, Lindblad-Toh K., et al. Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature. 2005 Dec 8; 438 : 803-19), Drosophila simulans (http://genome.wustl.edu/genomes/view/drosophila_simulans_white_501), Drosophila yakuba (http://genome.wustl.edu/genomes/view/drosophila_yakuba), Horse (http://www.broad.mit.edu/mammals/horse/), Chicken (International Chicken Genome Sequencing Consortium Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature. 2004 Dec 9; 432(7018): 695-716. PMID: 15592404), Human (Genome Reference Consortium, http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/), Medaka (http://dolphin.lab.nig.ac.jp/medaka/), Mouse (Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature, 420, 520–562 (2002), http://www.hgsc.bcm.tmc.edu/projects/mouse/), Opossum (The Broad Institute, http://www.broad.mit.edu/mammals/opossum/), Orangutan (Genome Sequencing Center at WUSTL, http://genome.wustl.edu/genome.cgiGENOME=Pongo%20abelii), Cow (The Baylor College of Medicine Human Genome Sequencing Center, http://www.hgsc.bcm.tmc.edu/projects/bovine/), Pig (The Swine Genome Sequencing Consortium, http://piggenome.org/), Rat (The Baylor College of Medicine Human Genome Sequencing Center, http://www.hgsc.bcm.tmc.edu/projects/rat/, Rat Genome Sequencing Project Consortium. Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 428(6982), 493–521 (2004)), Yeast (Saccahromyces Genome Database, http://www.yeastgenome.org/), Stickleback (http://www.broadinstitute.org/scientific-community/science/projects/mammals-models/vertebrates-invertebrates/stickleback/stickleba), Zebra Finch (Genome Sequencing Center at Washington University St. Louis School of Medicine), Tetraodon (Genoscope, http://www.genoscope.cns.fr/), and Zebrafish (The Wellcome Trust Sanger Institute, http://www.sanger.ac.uk/Projects/D_rerio/).