Abstract

The vast majority of methods available for sequence comparison rely on a first sequence alignment step, which requires a number of assumptions on evolutionary history and is sometimes very difficult or impossible to perform due to the abundance of gaps (insertions/deletions). In such cases, an alternative alignment-free method would prove valuable. Our method starts by a computation of a generalized suffix tree of all sequences, which is completed in linear time. Using this tree, the frequency of all possible words with a preset length LL-words—in each sequence is rapidly calculated. Based on the L-words frequency profile of each sequence, a pairwise standard Euclidean distance is then computed producing a symmetric genetic distance matrix, which can be used to generate a neighbor joining dendrogram or a multidimensional scaling graph. We present an improvement to word counting alignment-free approaches for sequence comparison, by determining a single optimal word length and combining suffix tree structures to the word counting tasks. Our approach is, thus, a fast and simple application that proved to be efficient and powerful when applied to mitochondrial genomes. The algorithm was implemented in Python language and is freely available on the web.

1. Introduction

During the last decades many sequence comparison methods have been developed in order to recover evolutionary and phylogenetic signals as well as for the discovery of pathogenic mutations [1, 2].

The most common approaches are based on sequence alignments [3, 4]. However, alignment quality depends on the penalties attributed to observed differences between sequences during the alignment process [5, 6]. Alternatively, many alignment-free methods have also been proposed [5, 79] which, being based on word frequencies or on match lengths, are algorithmically simple and computationally faster than alignment methods.

The basis of word frequency tasks is the determination of the optimal word length, L, which should be computed a priori. The L-words counting in a sequence is usually performed considering a one base sliding window, overlapping consecutive bases, that is, shifting one base each time until , m being the sequence length [7, 8].

Here, we present a new approach that determines a single optimal word length, L, and generates L-words frequency profiles using suffix tree theory. The algorithm was applied to a variety of mtDNA sequences that are particularly difficult to handle by automated alignment methods and the performance was compared to the available word counting alignment-free methodologies.

2. Methods

2.1. Algorithm

We present here a new algorithm representing an improvement of word counting alignment free methodologies. The algorithm is described in Supplementary Material available online at http://dx.doi.org/10.1100/2012/450124 and each step is summarized below.

2.1.1. Suffix Tree Approach

The first step of the method is the construction of a generalized suffix tree, T, of n sequences, , where every suffix in the data set is represented only once. Therefore, the memory requirements when using these structures are much more modest than when considering the original complete sequences. The construction of a generalized suffix tree is based on Ukkonen’s algorithm, described with detail by Gusfield [10]. Function GST in the Supplementary Algorithm 1 automates the construction of this structure.

Generalized suffix trees are potent structures, having the useful property that each prefix of paths leading from the root to any internal node points to all occurrences of this prefix in the data set [10]. Thus, when aiming to determine the number of times that a word w occurs in each sequence, we only need to traverse the generalized suffix tree leading from the root in the direction of the branch labeled by a prefix of . If such branch does not exist, we conclude that w does not occur in the data set. Otherwise, we must always skip from a node to its descendant until the end of w. The indexes of all descendant leafs from the last node reached, or from its descendant nodes, are used to determine the sequences in the data set which contain w as well as the number of occurrences of w in each sequence. Each leaf indexes the sequences and the corresponding starting positions of the associated suffixes labeled in the path that leads from the root to this leaf.

An alternative approach, using a k-truncated suffix tree deserves consideration, due to reduction in both memory requirements and running time [11].

2.1.2. L-Words Frequencies

In the next step, we determine all words in the DNA alphabet with length L—WL—determined a priori, following the method of Sims et al. [7]. According to these authors, there is an optimal resolution range in which any integer value should be considered as the length of L. Any value inside this interval is equally good. So, in order to increase the speed of the process we start by considering only the lower limit of resolution, which is given by the expression , where m is the sequence length. Considering n sequences with different lengths m, the expression can obviously generate different values. In order to find a value applicable to all sequences under analysis, we choose m as the length of the greater sequence and L as the smaller integer greater than . Thus, in the present study, we work with the following values: where is the ceiling function of x, defined as the smallest integer is not less than x.

Notice that the total number of possible L-words is and . For example, if then and the following result is obtained: Using the generalized suffix tree we can efficiently determine the number of occurrences of each in each sequence just by traversing the branch with path label from the root towards the leafs only one time, as was thoroughly explained in the previous section: .

Finally, we can determine the relative frequency of each word in each sequence as the following: The resulting matrix with dimension and entries represents a global profile of L-words frequencies of all input sequences. The determination of each element is automated with function LwF in the Supplementary Algorithm 1.

2.1.3. Genetic Distance

The generated frequencies matrix may then be used to assign a pairwise correlation or a metric distance between each pair or sequences. In this work we calculate the pairwise standard Euclidean distance, which is defined as where w represents the L-words and means the relative frequency of w in the sequence Z.

Function Distance described in Supplementary Material automates this procedure.

2.2. Software

The algorithm was written in Python, version 2.5.2, and tested on a Windows 7 x32 system and on a Linux platform with a processor Intel (R) Pentium (R) Dual CPU, T3400 @ 2.16 GHz and 4 Gb of RAM. It is freely available on the web at http://www.portugene.com/SupMat/SuffixTree&Lwords.rar.

3. Results and Discussion

3.1. Phylogenetic Reconstructions

The developed algorithm was tested in different datasets of mtDNA sequences, proving to be a simple and fast way to identify phylogenetic relationships in the different sets of mitochondrial genomes.

The algorithm was first tested in a dataset composed of 29 complete primate mtDNA sequences representing genomes of different families, ranging from 15467 bp to 17036 bp long. Taking into account these lengths, we determined , as explained in the Methods Section. This value has proven to be a good choice, allowing the program to run quickly, while still producing a genetic distance matrix that, when used to construct a dendrogram, exhibits a clustering that is in agreement with consensus primate phylogeny (http://tolweb.org/Primates/15963).

In order to confirm that the algorithm was also able to produce a correct phylogeny with closely related sequences we tested it with mtDNA sequences from the same species, in which the sequence length is more homogeneous. The observed clusterings are in general agreement with those published in the literature, grouping mtDNA genomes in the same clades previously published methodologies (Supplementary Material).

Aiming to check the performance of our algorithm as well as to compare the quality of the results obtained by our approach and Costa’s methodology, we compare the topology of the resulting phylogenies. The dendrograms constructed using the genetic distance matrixes generated by our algorithm are consistent with consensus phylogenies (Supplementary Material), in contrast with the results obtained by Costa et al. [8] methodology, which show some discrepancies, namely, in the clade Platyrrhini, which is clustered with Tarsii and Strepsirrhini ((http://tolweb.org/Primates/15963) and [12]).

3.2. Running Time

Our algorithm takes a linear execution time to determine the words frequencies and a quadratic time to compute the pairwise distances, an improvement to previous word counting alignment-free methodologies.

Our approach was compared to the method developed by Costa et al. [8] in what concerns the running time (the word counting alignment-free methodology proposed by Sims et al. [7] could not be tested because it has not been made available). While our approach computes the optimal word length to determine the word frequency profiles and generates a genetic distance matrix just by inputting a fasta file with mtDNA sequences, the methodology proposed by Costa et al. [8] involves four steps/algorithms: (1) converting a fasta file containing n mtDNA sequences into n fasta files with a single sequence; (2) converting each file into a fa file, a simplified version of fasta files; running two additional algorithms to (3) generate the histograms files and (4) create a correlation similarity matrix. These last two algorithms must be tested in increasingly longer windows until a conserved correlation matrix is obtained.

Our approach was designed to be run in Windows x32 operative system but it was also tested in a Linux platform in order to be compared to the alternative methodology under the same operative system. We thus could demonstrate that, independently of the operating system, the use of suffix tree structures to compute the words frequency profiles enables our methodology to run in a much shorter time. Although for small sets of sequences the running time required by Costa’s (2011) methodology [8] is shorter, when increasing the number of sequences to over a hundred, the performance of our method is clearly better (Table 1, Figure 1, Supplementary Table 5).

3.3. Final Remarks

The algorithm described here has demonstrated to be an improvement of word counting alignment-free methods for sequence clustering, showing to be computationally very fast, particularly with large datasets, while still producing good quality results. In fact, by combining suffix tree structures with word counting tasks, as well as automating the determination of a single optimal word length, a significant decrease in running time and memory requirements for L-words frequencies determination was obtained.

The method proved to be efficient and powerful when applied to complete mitochondrial genomes, either from different species or intraspecifically, being able to quickly cluster the sequences in accordance to acknowledged phylogenetic relationships.

Authors’ Contribution

I. Soares developed the algorithm, performed the tests, and wrote the paper. A. Goios participated in the design of the study and wrote the paper. A. Amorim conceived the study, and participated in its design and coordination and wrote the paper. All authors read and approved the final paper.

Acknowledgments

The authors want to thank to António Guedes de Oliveira and Pedro Silva for their helpful suggestions and the three anonymous reviewers for helpful comments that greatly improved the manuscript. I. Soares has a doctoral Grant (SFRH/BD/38171/2007) and A. Goios has a postdoctoral Grant (SFRH/BPD/43646/2008) from Fundação para a Ciência e Tecnologia. IPATIMUP is an Associate Laboratory of the Portuguese Ministry of Science, Technology and Higher Education and is partly supported by Fundação para a Ciência e Technology. CMUP was funded by the European Regional Development Fund through the programme COMPETE and by the Portuguese Government through the FCT under the Project PEST-C/MAT/UI0144/2011. The funding sources had no involvement in any part of this study.

Supplementary Materials

Complementary text.

  1. Supplementary Materials