Abstract

This paper aims to study the relationships between chromosomal DNA sequences of twenty species. We propose a methodology combining DNA-based word frequency histograms, correlation methods, and an MDS technique to visualize structural information underlying chromosomes (CRs) and species. Four statistical measures are tested (Minkowski, Cosine, Pearson product-moment, and Kendall rank correlations) to analyze the information content of 421 nuclear CRs from twenty species. The proposed methodology is built on mathematical tools and allows the analysis and visualization of very large amounts of stream data, like DNA sequences, with almost no assumptions other than the predefined DNA “word length.” This methodology is able to produce comprehensible three-dimensional visualizations of CR clustering and related spatial and structural patterns. The results of the four test correlation scenarios show that the high-level information clusterings produced by the MDS tool are qualitatively similar, with small variations due to each correlation method characteristics, and that the clusterings are a consequence of the input data and not method’s artifacts.

1. Introduction

DNA related information can be analyzed in many different ways, including by methods based on “word frequency” histograms derived from DNA sequences [1]. Histograms are a condensed representation of the original information and allow further processing methods, like correlation, which are not viable in the original data. The correlation between histograms can be computed, producing a correlation matrix that can serve as input to other methods for high-level information extraction and tabular/graphical analysis like the multidimensional scaling (MDS) technique, which is able to create low-dimensional representations of complex data while preserving similarities between data points. In [2], the authors describe how the Kendall rank correlation method [3] is used to generate the correlation matrix and how a Multidimensional Scaling (MDS) tool [4] is able to generate three-dimensional representations of spatial and structural relationships of the chromosomes and species. In that paper, only one correlation method is applied to the generation of correlation matrices, but many other correlation methods exist and can be used for studying chromosomal/species relationships. As such, we compare and evaluate a set of correlation methods in order to determine if those relationships show up in all methods and are similar. Our main goals are to find out if, for each of several correlation methods and word lengths used in the processing of DNA sequences,

(a)the MDS tool generates three-dimensional representations featuring spatial and structural patterns;(b)its spatial and structural patterns denote significant differences for distinct correlation methods;(c)the results from MDS tool are qualitatively similar, independently of the correlation method used.

It should be noted that important contributions in this topic [1, 5] were proposed using alignment-free sequence comparison methods, but the proposed method is based on different concepts.

Bearing these ideas in mind, this paper is organized as follows. Section 2 presents the biological concepts and mathematical tools, formulating its application in the context of DNA sequence decoding. Section 3 analyzes the correlation between CRs and several species, by investigating data representation using MDS applied to twenty species and their CRs. Finally, Section 4 outlines the main conclusions and open issues.

2. Mathematical Tools and DNA Decoding

The chromosomal DNA code of the twenty species was downloaded from the DNA sequence database of the University of California Santa Cruz Genome Bioinformatics site [6]. In each CR, repeated strings of more than 12 symbols were previously masked and replaced by “N” symbols, in order to ignore the nongenomic and nonregulatory sequence data. In consequence, we are handling an alphabet composed of symbols, namely, . In terms of DNA data, an option was made for a set of twenty species, aiming to explore the dynamic analysis by changing the sequence length in the range . The twenty species include eleven mammals , two birds, Chicken and Zebra finch , two fishes, Zebrafish and Tetraodon , two insects, Gambiae mosquito and Honeybee , two nematodes, Caenorhabditis elegans and Caenorhabditis briggsae , and one fungus, Yeast , with a grand total of CRs. The characteristics of chosen species and its DNA are presented in Table 1.

The DNA implements an alphabet composed by the symbols . Any simple translation to a numerical counterpart may impose bias and destroy intrinsic information. Consequently, it was decided to directly process the non-numerical code. Due to the immense volume of information, a histogram-based measure was adopted. Nevertheless, in general, histograms do not capture dynamics. In order to overcome this limitation, a flexible pattern detection algorithm based on counting the sequence of symbols was considered [1]. By “flexible” we mean that the algorithm can count sequences of length items, each one composed by one of the four base symbols.

With the exception of Yeast , the available CR data includes a fifth symbol (“N”), corresponding to masked DNA symbols not belonging to the genome, which typically appear in large contiguous sequences. For example, in the Human Y CR file there are 59373566 base pairs, of which 33710000 are “N” (56.78%) arranged in 17 sequences, the largest one with 30000000 symbols. Another example is the Chicken Ga25 CR, with 2051775 base pairs, of which 663879 are “N” (32.67%) arranged in 274 sequences, the largest one with 500000 symbols. HoY and Ga25 are just two examples of CRs with a percentage of “N” symbols greater than 10%, but most of the CRs have smaller percentages.

We decided not to use “N” in sequences as a fifth symbol or not to replace it by any of the symbols , because that would introduce an unknown bias in the sequence processing. We then considered the following two approaches:(a)remove all “N” symbols in a preprocessing step or,(b)process sequences but ignoring any sequence with an “N”.Although (a) and (b) may seem different, we concluded that differences were minimal and that (a) could be advantageously used without compromising the quality of results and conclusions.

Using as examples nuclear CRs, and a sequence length of in Table 2, the rightmost column synthesizes the differences for the (a) and (b) approaches. For Ga25 the Pearson correlation coefficient between (a) and (b) sequences with length yields , while for HoY the corresponding coefficient is >0.9999999. We conclude that both approaches are statistically equivalent for the envisaged DNA decoding. Therefore, we opted to discard the “N” symbol before histogram construction.

We have different statistics when considering the length ranging from , representing merely a static counting of states, up to , representing a system with (65536) states. During the bin counting two possible approaches may be considered, namely, windows without any overlapping and windows with a partial overlapping of the base sequence. Therefore, two extreme opposite cases were tested, namely, successive counting windows with zero and with adjacent bases in the DNA. In the first case, for a DNA strand of length and sequences of length , results a total of approximately counting windows, while for the second it yields counting windows. Previous tests revealed that both schemes lead to similar qualitative results, with some slight differences in the smaller CRs [2]. In order to get a more robust counting, we adopted the one-base sliding window (i.e., overlapping of consecutive bases).

Having generated the histograms, the second step in the analysis consists in evaluating their similarities by means of suitable correlation indices. In this study, we evaluate four correlation methods [3, 7, 8], namely, the Minkowski , Cosine , Pearson product-moment , and Kendall rank as given by

where and represent the relative frequencies of histograms and for bin and denotes the total number of bins. If represents a set of joint observations from two variables, any pair of observations are said to be concordant (discordant) if the ranks for both elements agree (disagree), while for identical rank the pair is neither concordant nor discordant.

For the purpose of visualizing the correlation results, the multidimensional scaling (MDS) technique is adopted [911]. The MDS is a mathematical tool that represents, in a low-dimensional map, a set of data points whose similarities (or, alternatively, distances) are defined in a higher dimensional space by means of a symmetric matrix . In case of similarities (or, alternatively, distances) and classical MDS, the main diagonal is composed of ones (or, alternatively, zeros), while the rest of the matrix elements must obey the restriction (), , where is the total number of cases under comparison [12]. It should be noted that MDS works with relative measurements. Therefore, MDS maps are not sensitive to translations or rotations. The axes have only the meaning and units (if any) of the measuring index and packages usually apply a heuristic procedure for centering the chart. In practical terms, this means that MDS maps are analyzed on the basis of proximity of (or, alternatively, distance between) points and comparison of the resulting “cloud” of points. Usually, in order to improve the graphical representation, 2-D and 3-D MDS plots are used and its consistency verified by means of Shepard and/or stress charts [13].

3. Analysis of DNA Sequence Histograms

In this section, we start by analyzing a limited part of the global information by means of direct methods. We verify some limitations due to the huge volume of data. This fact motivates the adoption of a more efficient visualization method, namely, the MDS, that is applied to the complete set of data.

3.1. Analysis of Six Species Using a Diagram Visualization Method

In this subsection, we compare six mammals, namely, Human, Common Chimpanzee, Orangutan, Rhesus monkey, Pig, and Opossum, denoted by . In this preliminary analysis, it is adopted that in the histogram construction and the correlation expression (2), leading to a matrix with considerable information. Considering a threshold value of 99.5% for selecting the “most similar CRs” (i.e., smaller values are ignored) we get the groups presented in Figure 10. We observe that some CRs with distinct numbering are very similar as, for example, Rm16 is clearly correlated with Hu17, Ch17, and Or17, while others are very different from the rest, such as, for example, HuY, ChY, Rm19, Pi12, and OpX. In terms of species we conclude that:(i)Hu has twenty CRs correlated in the first place with Ch and two with Or,(ii)Ch has eighteen CRs correlated in the first place with Hu and six with Or,(iii)Or has twenty one CRs correlated in the first place with Ch and three with Hu,(iv)Rm has one CR correlated in the first place with Hu and zero with Ch and Or,(v)Pi and Op have zero CRs correlated with the rest of the species.Therefore, we conclude that Ch is the species closest to the Hu, Rm is far from the trio , and have no proximity with the rest.

This information can be depicted graphically. Figure 1 shows visualization graphs generated by Graphviz [14], an open-source software for representing structural information as diagrams of abstract graphs and networks. The most correlated CRs for the group reveals clearly, for example, triplets of CRs 19, 20, and 22, groups of CRs 13 and 4, groups of CRs 16, 17, and Rm20.

For the trio , Figure 2 shows the chart for the cases of and .

These tests reveal that even for a limited set of data directed graph methods lead to complicated representations.

3.2. Analysis of Twenty Species Using the MDS Visualization Method

In this subsection, we compare the complete set of species using the MDS method. Therefore, after computing all the chromosomal histograms of the twenty species for , the GGobi package [4] is used for generating the MDS plots corresponding to the correlation methods described in (1)–(4). In Figures 3 to 6, we depict MDS plots, using a classical metric dissimilarity analysis, for each correlation method when . The choice for the aforementioned values of n was motivated by the following considerations:(i); it is the smallest value of for reasonably discriminating DNA-based frequency histograms;(ii); the protein coding machinery in CRs uses triplets (3) of bases to specify amino acids [15];(iii); a larger value of that is also multiple of 3 and potentially sensitive to the protein coding mechanisms;(iv); a larger value of that is not multiple of 3 and is computationally tractable.

The MDS maps for the remaining values of are not depicted due to space limitation. Values of were not considered because they impose an increasingly greater computational burden: the number of frequency bins in a histogram is , each correlation depends on operations and a complete correlation study requires approximately correlations.

Figure 3 presents MDS plots for the Minkowski correlation revealing the emergence of chromosomal patterns for all values of . We note that the MDS plots vary progressively and smoothly from up to . When , we observe that mammals’ CRs are more spatially separated and that the primates’ CRs “diverge” from the rest of the mammals. The Minkowski correlation depends on the value of the parameter . For and , it yields the commonly known Manhattan (or City) and Euclidean distances, while for the limiting case we obtain the Chebyshev distance. After testing the MDS plots for several values of was adopted as representative of this method.

Figure 4 presents MDS plots for the cosine correlation demonstrating clear chromosomal patterns. Again we conclude that the MDS plots evolve from up to . Moreover, mammals’ CRs become more separated as reaches larger values such as and . It is clearly noticeable that the MDS plots in Figures 3 and 4 are geometrically very distinct but depict chromosomal patterns and structures that lead to conclusions of the same type. This visual effect is common in MDS maps, namely, with the conclusions being drawn in relative terms rather than in an absolute perspective, with the patterns and not the absolute coordinates of points being important.

Figure 5 presents MDS plots for the Pearson product-moment correlation . Again, chromosomal patterns are clearly observable for all values of and the smooth evolution from up to . We note that the Pearson correlation method is based on the product of moments, which means that it is invariant to separate changes in location and scale of the two histogram sequences. The MDS plots of Figure 5 also depict chromosomal patterns and structures, but geometrically distinct from the MDS plots represented in previous figures.

Finally, Figure 6 presents MDS plots for the Kendall rank correlation leading to similar conclusions.

Comparing the four indices that feed the MDS plots, we conclude that the Kendal correlation reveals more distinct transitions between MDS plots and, consequently, the chart for and seems to be the one that depicts more noticeable chromosomal patterns and geometrical structures.

A standard assessment tool in MDS analysis is the Shepard plot, which provides a qualitative measure of the goodness of fit. Considering and the row and column indexes of matrix , the Shepard plot represents the dissimilarities against the fitted distances (where represents the inner product for classical scaling), or the residuals (where is the transformation of dissimilarities and is a power for metric scaling). In terms of MDS qualitative analysis in this paper, the goodness of fit is very high for all values of and all types of correlation methods. Being the MDS quantitative assessment described by the stress value, Figure 5 depicts the stress plots for the Kendall correlation method and the limit cases of and , showing the usual monotone decreasing shape. For other correlation methods, the stress plots show a similar behavior.

Although the chart of Figure 7 supports the adequacy of adopting a two-dimensional representation for the MDS output, it also shows that a three-dimensional map can lead to a slightly improved rendering of MDS plots. In this line of thought, Figures 8 and 9 show two “visually enhanced” three-dimensional MDS maps for , corresponding to the Minkowski index with and the Kendall correlation . Both figures include visual cues (like perspective effects, shadows on objects/on the floor, and three coordinate axis) to help in the spatial and structural understanding of chromosomal relationships.

In Figure 8, it is clearly noticeable that a primate species’ cluster near to a mammals’ cluster is having next to it the aves’ cluster. The mammals and aves’ cluster depict a “linear” disposition of CRs, which is confirmed by the corresponding shadows on the floor. A “linear” chromosomal disposition is also observed in species like , but not in species like fishes and nematodes . It is also noteworthy to mention the “parallelism” between the linear dispositions of the mammal species (excluding primates) and the aves .

In Figure 9, we can observe that mammal species are organized in a cluster, all of them depicting a “linear” spatial disposition. The aves also cluster together, near the mammals, each one with a clear “linear” disposition. The shadows over the floor (a visual cue) help understanding these spatial and structural relationships. For the remaining species, the fishes are spatially far apart, only Tn depicting a “linear” spatial disposition. This same disposition somewhat exists in the , but not in the nematodes .

As mentioned in Section 2, MDS works with data that characterized the relative distance between the objects. Therefore, in MDS maps, rotation and translation have no special meaning and user can adopt the one that is more useful to visualize the clusters. Identically, MDS charts with different number of points or with distinct measuring indices cannot be compared, neither with each other, nor in the perspective of the coordinates of the points. Therefore, a “good” MDS representation is simply the one that adopts a measuring technique that for the phenomenon under study and for the number of objects leads to a map where user can visualize easily clusters that make sense for that particular application. In this line of thought in this paper, the association of several correlation measures for the 421 CRs proved to lead to a comprehensive pattern easily visualized and assertively interpretable under the light of present-day knowledge.

In this study, the nuclear genomic information used is still incomplete, as explained in [6]. For many of the species referred in Table 1, there is a considerable amount of nuclear DNA sequence data not yet attached to CRs or with an unknown placement. This undesirable uncertainty may contribute to misleading results, not caused by the mathematical and computational tools adopted. While the focus of this paper was mainly an interspecies comparison, the same methodology can be used for revealing intraspecies chromosomal patterns. We also foresee the application of the described methodology to the study of mitochondrial DNA sequences. These issues will be the subject of further research.

4. Conclusions

Chromosomes have a code based on a four-symbol alphabet and the information can be analyzed with mathematical tools usually adopted in the analysis of complex systems [16]. In this paper, it was applied a histogram-based conversion scheme for establishing a numerical signal and the resulting information was studied by means of four distinct correlation measures. The application to the CRs of twenty species, with a grand total of 421 CRs, revealed that the combination of sequence lengths of eight symbols, the Kendall rank correlation, and the MDS visualization is the most promising one, leading to the emergence of patterns that can be easily and assertively interpreted and compared.

The three-dimensional patterns of CRs depicted in Figures 6 and 7 “point” to a high level of genomic structuring in each species (“linear” and “spherical” arrangements) and between species (“parallelism” between mammals and aves). Although we do not have an immediate explanation for this noticeable multidimensional structuring, it may be related to higher levels of information structure underlying CRs.

Acknowledgments

The authors thank the following organizations for allowing access to genome data: Human—Genome Reference Consortium, http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/; Common Chimpanzee—Chimpanzee Genome Sequencing Consortium; Orangutan—Genome Sequencing Center at WUSTL, http://genome.wustl.edu/genome.cgi GENOME=Pongo%20abelii. Rhesus—Macaque Genome Sequencing Consortium, http://www.hgsc.bcm.tmc.edu/projects/rmacaque/; Pig—The Swine Genome Sequencing Consortium, http://piggenome.org/; Ox—The Baylor College of Medicine Human Genome Sequencing Center, http://www.hgsc.bcm.tmc.edu/projects/bovine/; Dog Genome Sequencing Project—http://www.broad.mit.edu/mammals/dog/, Lindblad-Toh K, et al. Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature. 2005 Dec 8;438:803-19; Horse—The Broad Institute, http://www.broad.mit.edu/mammals/horse/; Mouse—Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature, 420, 520-562 (2002), http://www.hgsc.bcm.tmc.edu/projects/mouse/; Rat—The Baylor College of Medicine Human Genome Sequencing Center, http://www.hgsc.bcm.tmc.edu/projects/rat/, Rat Genome Sequencing Project Consortium. Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 428(6982), 493-521 (2004); Opossum—The Broad Institute, http://www.broad.mit.edu/mammals/opossum/; Chicken—International Chicken Genome Sequencing Consortium Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature. 2004 Dec 9; 432(7018): 695-716. PMID: 15592404; Zebra Finch—Genome Sequencing Center at Washington University St. Louis School of Medicine; Zebrafish—The Wellcome Trust Sanger Institute, http://www.sanger.ac.uk/Projects/D_rerio/; Tetraodon—Genoscope, http://www.genoscope.cns.fr/; Honeybee—The Baylor College of Medicine Human Genome Sequencing Center, http://www.hgsc.bcm.tmc.edu/projects/honeybee/; Gambiae Mosquito—The International Anopheles Genome Project; C. elegans nematode—Wormbase, http://www.wormbase.org/; C. briggsae nematode—Genome Sequencing Center at Washington University in St. Louis School of Medicine; Yeast—Sacchromyces Genome Database, http://www.yeastgenome.org/.