Abstract
The evolution of the next generation sequencing technology increases the demand for efficient solutions, in terms of space and time, for several bioinformatics problems. This paper presents a practical and easytoimplement solution for one of these problems, namely, the allpairs suffixprefix problem, using a compact prefix tree. The paper demonstrates an efficient construction of this timeefficient and spaceeconomical tree data structure. The paper presents techniques for parallel implementations of the proposed solution. Experimental evaluation indicates superior results in terms of space and time over existing solutions. Results also show that the proposed technique is highly scalable in a parallel execution environment.
1. Introduction
The next generation sequencing (NGS) technology created new types of DNA sequencing challenges. The great advent of this new technology eliminates the high cost of the Sanger method. Therefore, a lab with modest equipment is currently able to sequence a modest size genome (e.g., bacterial genome). The resulting output for this technology is a group of fragments (reads), each of which is 50–1000 base pairs representing a piece of multiple copies of the genome.
This kind of output presents a challenge since these pieces should be reordered in order to obtain the complete sequence of a genome (de novo assembly). Therefore, to harvest the benefits of utilizing NGS technology, the development of space and timeefficient algorithms to complete the assembly process has become inevitable.
Allpairs suffixprefix (APSP) matching is one of the wellknown computer science problems that has effective applications in the assembly process, especially in de novo assembly. Finding the allpairs suffixprefix matches can also help solve the popular shortest common superstring problem that has an important role in sequencing and mapping DNA [1]. In addition, it has applications in data compression [2].
Gusfield et al. [1] presented an optimal solution for APSP using a suffix tree [3]. It is optimal since it consumes time where is the total length of all strings and is the count of the strings. The suffix tree is a robust data structure that is used to solve many string matching problems. A suffix tree for a string is a tree of all suffixes of . Each suffix in is represented by a path from the root to a leaf in the suffix tree. Despite the optimal performance of the suffix tree, it has the drawback of high memory requirements and poor locality of memory references [4].
The suffix array has been used as a substitute for suffix tree to avoid its two disadvantages [5]. A suffix array of a string is an array whose size is equal to the length of . Each element in array contains the position of a suffix in where all suffixes are sorted lexicographically. Abouelhoda et al. [6] showed that any problem that can be solved using a suffix tree can also be solved using an enhanced suffix array with the same time complexity. Ohlebusch and Gog [7] presented a solution for APSP using an enhanced suffix array. This solution has the same time complexity as the solution obtained using a suffix tree, but it is faster and consumes much less space.
With the advent of next generation sequencing technology, these solutions may be considered expensive in terms of space since the required space for the text itself is bits, where is the alphabet size and is the length of text, while the space that is required to store the data structure (suffix tree or suffix array) is bits. Compressed data structures were developed to solve bioinformatics problems using much less space with an acceptable slowdown in performance. FM index [8] is an example of such data structure that is used to solve APSP.
Dinh and Rajasekaran presented memoryefficient data structure to represent exactmatch overlap graphs [9]. They mentioned that APSP can be solved using the presented data structure in where is the length of one read and is the number of reads assuming that all reads have the same length; otherwise is a maximum length of a read. Dinh and Rajasekaran [9] used a customized compact prefix tree in the process of building the targeted data structure.
APSP has been used in the overlap stage of the genome assembly process. Two important modern assemblers are SGA [10] and Readjoiner [11]. In SGA, the FM index [8] is used to solve the problem in an indirect way as follows. The index is constructed for all strings after concatenating them in one string. Then the index is queried by the reads to find prefixsuffix matches. Other compressed versions of suffix tree and suffix array, such as RLCSA ([12–14]) and Sadakane suffix tree [15], are also used to solve APSP [16, 17].
Readjoiner is a very efficient genome assembler that, in the overlap stage, finds suffixprefix matches with a minimal length by grouping all relevant suffixes in buckets. Each bucket is identified by a common prefix for all suffixes inside it. Then, after sorting suffixes inside each bucket, it finds suffixprefix matches using the lcpintervals concept which is introduced in [6].
This work presents a simple, efficient, spaceeconomical, and scalable solution to APSP using a compact prefix tree. The version of prefix tree which we are using is presented as an enhancement for Btree in [18]. This additional data structure can significantly decrease the time required to retrieve, in a set of strings , the ones whose prefix is a pattern . We demonstrate how to construct a compact prefix tree in Section 3. In Section 4, we explain the process of finding the longest suffixprefix matches for each ordered pair of reads. In Section 5, we show how to parallelize our solution and describe different ways to distribute the load between threads. In Section 6, we compare our solution with previously presented solutions for APSP in terms of space and time. Finally, we present our concluding remarks in Section 7.
2. Preliminaries
Let denote an ordered alphabet. A string is a sequence of characters over . A suffix of a string is a substring of that starts with a character in and ends with the last character of , where can be any character in . A prefix of a string is a substring of that starts with the first character of and ends with a character , where can be any character in . Given two strings and , a suffixprefix match is a suffix of which is also a prefix of . Finding allpairs suffixprefix matches (APSP) for a group of strings is finding the largest suffixprefix match for each ordered pair of strings in . In this paper, denotes the number of strings and denotes the total length of all strings.
3. Compact Prefix Tree
We define a compact prefix tree for a group of strings as a tree having the following properties:(i)Each string should have an identification number that represents its index in the lexicographical order of the strings in group .(ii)Each string in is represented by a path from the root to a leaf. Many strings will share partial path. If the two strings are the same, they will have the same path from the root to the leaf.(iii)The edge between each node and its parent node in the tree has a label which starts with one of the four characters .(iv)Each node has an interval where are identification numbers for some strings in . Since the strings are sorted, the range represents all the strings which have a common prefix represented by a path from the root to this node.(v)It is much better to store the length of a substring in the corresponding node instead of building the whole substring as a path in the tree. We will call this stored value . Each node has its own value.
An example for such tree is shown in Figure 1. The tree has 5 leaves since 5 strings are involved. The first child of the root has the range since three strings start with character A, with = 0 since these strings differ in the second character (A, C, and G).
Accordingly, each node has at least two children since the one child case is not possible as the substring which is represented by this child would be included in the parent node.
3.1. Constructing the Prefix Tree
We present two methods to construct the compact prefix tree. The first constructs the tree with the assumption that the strings are sorted, while the second method constructs the tree without sorting the strings.
3.1.1. Constructing the Tree after Sorting
In this method, we assume that the strings are sorted in lexicographical order. The tree construction starts with a root node. Nodes are added to the tree as the strings through are scanned in order. The first string can be inserted in one step in a node. The interval for this node will clearly be where 1 is the identification number of . For every other string in group , where , we match the string , character by character, with a path in the tree. Let denote the current character in to be compared with. In the matching process, the following variables are required:(i)A variable is used as a pointer to the current tree node.(ii)A variable is used to indicate the position of the character inside the node, where is going to be compared with the current character in the text. If is bigger than the value of the current node, will be advanced to point to the appropriate branch of current node, which is labeled by .(iii)A variable is equal to the total length of all edges in the path plus the total length of all values for all nodes in this path. This variable is important for calculating the value for new leaves.(iv)To find exactly what character is pointing to, we find the character in the position of the original text.
A match may occur in two cases:(i) is less than or equal to of the current node, and then we are still within the same node and matches the character which is pointing to.(ii)Otherwise, a match occurs if there is a branch for the current node, labeled by . In this case, should point to that branch node which becomes now the current node. An update to the interval of this current node should be done by simply changing the upper bound of its interval to .
A mismatch may occur in two cases:(i)It occurs when comparing within a node; that is, is less than or equal to of the current node. In this case, the following steps should be executed in order:(a)The current node should be split into two nodes and where is the parent of .(b) of becomes equal to 1 and the interval for becomes the same as .(c) of becomes equal to of . The lower bound of becomes the same as .(d)The upper bound of becomes equal to the upper bound of .(e)Create a new node which will be the new branch of , labeled with . of becomes equal to the length of and the interval for is .(ii)It occurs when is greater than of the current node, and then should point to the appropriate branch which is labeled by . If no such branch exists, then we have a mismatch. In this case we just create a node with a range of and a which is equal to length of .
Figure 2 demonstrates the stages of constructing the tree. The character which is on the left side of a node is the label of the node. The interval above the node denotes its string interval. The number shown on the right side of the node denotes of the node. Algorithm 1 demonstrates the pseudocode for constructing the tree.

The following refers to the example in Figure 2. The line numbers refer to the pseudocode illustrated in Algorithm 1. The first string is inserted in one step (Figure 2(a)); the branch A for the root is created with an interval and which is the length of excluding the first character A which represents the branch (lines 26–29). Considering the second string , we have a match with the first character (lines 6–8), while the second character “C” causes a mismatch. Since the mismatch occurs when is still less than or equal to the of the current node (mismatch inside the node), we split the current node into two nodes: with and an interval and with and an interval . Then we add as a branch to with and an interval (lines 9–19) (Figure 2(b)). Regarding the third string , we have a match with the first character (lines 6–8) and a mismatch with the second character. Since of the current node, and the current node does not have a “G” branch, we simply add a node with and an interval (lines 27–30) (Figure 2(c)). Both strings and get a mismatch with the first character. Therefore, they are inserted directly in one step each (lines 26–29) (Figures 2(d) and 2(e)).
Since processing each character in every string is done in constant time, the tree can be constructed in time. Since sorting the strings consumes also time using radix sorting, the time complexity stands. Since each internal node has at least two children, and the number of leaves in the tree is , the number of internal nodes is at most . Accordingly, the tree can be constructed using space. Therefore, the space requirement of the solution is determined by the space needed to store the text, which is bits since is much bigger than .
3.1.2. Constructing the Tree without Sorting
This section presents a method for constructing the prefix tree without the need for sorting the strings. This method has two stages:(i)Constructing the tree with no consideration for the intervals.(ii)Traversing the tree in a depthfirst search fashion and updating the intervals.
In the first stage, we use the same construction method which is used for the sorted input in the previous section but with one difference: we ignore the intervals for internal nodes since the identification numbers for the strings are not known. For leaves, we use the current index of the string in the list of unordered input strings as their identification numbers (e.g., for , the interval is used).
In the second stage, a depthfirst traversal for the constructed tree is required to update the intervals. The intervals of each internal node are updated after updating the intervals of its children. A counter is used to assign identification numbers for leaves. When a leaf is visited, the current value of the counter is assigned to it and the counter is incremented. For example, in Figure 2, the interval of the first branch of the root will be updated to be after updating the leaves with the intervals , , and . There is one case that should be considered: when two strings are exactly the same, they will have the same exact path in the tree. Since we ignore intervals during the insertion stage, there will be no way to distinguish one of these strings from another (here all strings are kept as input for the problem; in fact such a string is typically filtered out in a genome assembler). This issue can be handled using k lists in the insertion stage; for each string , we add into the list of if and are exactly matched, assuming is processed before . These lists are used later by assigning a new consecutive identification number for each string. Clearly, strings in the same list will have sequential numbers. The pseudocode for the traversal stage is shown in Algorithm 2.

4. Finding AllPairs SuffixPrefix
Both methods in Section 3 produce the same output, which is an efficient prefix tree to be used to solve APSP. In this section, we present an effective technique for finding a solution for APSP.
In this method, every suffix in every string is tested, starting from the largest proper suffix (i.e, the suffix which starts at position 2). If a suffix in string matches a path in the tree (there is a path which starts with the root and ends in a node with a range in the tree), then represents the longest suffixprefix match between and every string included in the range . For each string, every suffix should be processed. Accordingly, processing the suffixes of each string consumes time where is the maximal length of a string (which is typically less than 1000 in the genome assembly context). Therefore, the time complexity for this method is where is the number of strings.
We write to denote the suffix of the string . For each string in group , we check if the current suffix exactly matches a path in the tree where and is the length of . Let denote the current character in . We distinguish three cases:(i)We reached the last character in , which means that exactly matches a path in the tree. In this case, will be the starting position for the longest suffixprefix match between and every string included in the interval of the current node.(ii)If is less than or equal to of the current node, then we are still within the same node and the current character in suffix either matches the character which is pointing to or does not match it and accordingly does not represent any suffixprefix match.(iii)If is greater than the of the current node, then should point to the appropriate branch which is labeled by , where is the current character to be compared in . If no such branch exists, then does not represent any suffixprefix match. This method is easy to implement. The pseudocode is shown in Algorithm 3. The variables , , and defined in Section 3.1 are used in this algorithm.

4.1. Prefiltering the Reads
In our previous discussion, we used the original strings (reads) as an input for our overlap solution; therefore, the size of the output is . However, in the context of genome assembly, some filtration is applied on the input reads and some redundant reads are removed before finding the overlap. Our solution can easily and efficiently perform such filtration:(i)If a string matches a prefix of another string , then can be removed. The removal procedure can be handled in the construction process. We have 2 cases:(a) is processed first when constructing the prefix tree. In this case, will match a path in the prefix tree and it is simply removed. This case is possible only when construction is done with no sorting.(b) is processed first when constructing the prefix tree. If the strings are sorted, then we assume that sorting will filter . Otherwise (with no sorting case), processing will reach the leaf which has the interval [# #] where # is the identification number of , and there is no need for executing the procedure in Algorithm 2. Instead, the interval of the leaf will be updated to identification number.(ii)If a string matches a suffix of another string , then will match a path in the prefix tree ending with a leaf. In this case, should be removed. A vector of bits can be used to indicate if a string is removed.
5. Parallelizing the Algorithm
In this section, we show how to parallelize the tree construction, and then we explain different techniques to parallelize the solution. In our discussion, we assume the availability of a shared memory multicore computer.
5.1. Parallelizing the Construction of Prefix Tree
A quick and fast way to parallelize our solution is to let each processor work on strings which start with a specific character in the alphabet (A,C,G,T). For example, processor 1 constructs the part of tree that corresponds to strings that start with “A”. Processors 2, 3, and 4 construct the parts of the tree corresponding to the strings starting with C, G, and T, respectively.
The concept can easily be extended to more than 4 processors. For 16 processors, as an example, the load can be distributed based on the first two characters. In other words, the tree construction is distributed such that each processor is responsible for processing strings starting with the prefix XY, where X and Y are characters in the alphabet. Accordingly, processor 1 works on strings starting with “AA”, processor 2 works on strings starting with “AC”, and so on. The advantage of this method is the absence of any communication between processors.
5.2. Parallelizing Finding the APSP Matches
In this section, we show several techniques to parallelize our solution. The first direct technique simply divides the strings among processors so each processor gets equal number of strings. This method requires almost no modification to the sequential version of the algorithm and it scales very well. The problem is that it does not acknowledge the differences in length between the strings which may decrease the efficiency because of load imbalance.
Another way to parallelize the solution is to estimate the required load and therefore the amount of work that each processor should optimally have. Then, strings are assigned one by one to a processor until the load exceeds its estimated share. Since processing each string requires processing every suffix in it, the total amount of work (number of comparisons) for each string can be estimated as follows:where is the required work for processing and is the length of . Accordingly, a processor’s optimal share can be estimated as follows:where is the number of processors. An array, , with the size of is used, where is the number of used processors (threads). It contains the number of the first string to be processed by a processor.
To illustrate the concept, a simple example is shown. Let G = {ACC, AATC, CGTC, TTA, TGA, CCAT} be a group of strings that 3 processors are working on. The number of steps to process these strings is 6, 10, 10, 6, 6, and 10. Accordingly, the share for each processor is 16. Processor 1 gets strings 1 and 2, processor 2 gets strings 3 and 4, and processor 3 gets strings 5 and 6. However, this may not be the case in practice. The pseudocode is shown in Algorithm 4.

A third technique which may be used to parallelize the solution is to assign an initial load, which is a range of strings, for each processor. A shared pointer is used to indicate the starting point of a new range for a free processor. When a processor finishes executing its initial load, it gets a new range of strings using the shared pointer, then it updates the value of the shared pointer. This technique requires mutual exclusion for updating the shared pointer.
The solution can also be parallelized (fourth technique) using a greedy algorithm. An array of size (number of strings) is used where is the processor number which is going to handle string . Another array with the size (number of processors) is also used to maintain the current shares for processors. For every string , is assigned to a processor which has the least share (i.e., , ).
Notice that the granularity of load distribution over processors is the string. In other words, a single string is not processed on more than one processor. For a small number of strings, this may be a problem. For example, consider one huge string . The performance will be limited by the time for processing this string on one processor. However, in practice, this should not be a problem since is much bigger than where is the number of strings, and is the maximum length of a sequence.
6. Experimental Evaluation
In this section, we evaluate the performance of our solution and its scalability in a parallel execution environment.
6.1. Experimental Setup
Our solution has been implemented in C++. We use String Overlap Finder (SOF) to refer to this solution, which is available for download from http://confluence.qu.edu.qa/download/attachments/9240580/Prefix.tgz.
In this section we compare SOF with several previously presented solutions for APSP: suffix tree, enhanced suffix array, Sadakane suffix tree, and the overlap stage of SGA and Readjoiner (version 1.2).
These solutions have been downloaded from the sources shown in Table 1.
LEAP is another efficient genome assembler which is implemented in [9]. We did not use LEAP in our comparisons since LEAP, unlike Readjoiner and SGA, does not offer a seperate stage for finding overlaps which makes estimating the time required for finding overlap matches out of the overall time very hard. Nevertheless, it has been demonstrated in [11] that Readjoiner has better time and space consumption than LEAP.
We use the OpenMP flag to support multithreading. The program takes few optional parameters: sorting option, minimal length for a suffixprefix match, number of threads, type of output, and load distribution. The sorting option enables or disables sorting before constructing the tree. We distinguish three different types of output: outputting all matches, outputting only maximum matches using a twodimensional array, and no output (the results are not printed or stored in any data structure).
Our results are obtained using the following options for SOF: no sorting, dividing strings equally between threads, and output = 2 (which means outputting all suffixprefix matches, not only the longest), except for the large data sets where output = 0 (no output) option is used (the size of output is ). Since many of our samples have a small read length, the minimal match length which is used is 30 unless another value is mentioned.
We used two types of data sets: random and real. The random data are generated by a program that outputs random strings with random lengths but with a total length of where and are specified by the user. The random values were drawn from a uniform distribution.
As with most other solutions like Readjoiner, the input file is encoded in SOF using fixedlength encoding. This step lowers our space requirement dramatically but increases the processing time, only when a small number of reads are used.
The real data are the complete EST database of C. elegans which is downloaded from http://www.uniulm.de/in/theo/research/seqana. We also obtained three complete EST databases of Citrus clementina, Citrus sinensis, and Citrus trifoliata from http://www.citrusgenomedb.org/ and the complete EST database of Atta cephalotes from antgenomes.org. We also obtained 4 large real data sets from NCBI website. Table 2 shows our data sets.
Tests are performed in two environments:(i)A spacelimited environment, which is a modest machine: Linux Ubuntu version 11.10, 32bit with 3 GB RAM, Intel 2.67 GHZ CPU with 4 cores, and 250 GB hard disk. We refer to it as machine A. It is used to evaluate SOF time and space requirements when limited resources are available.(ii)An AWS instance with 16 cores to evaluate the parallelization of our solution and compare SOF with Readjoiner. We refer to it as machine B. It is used to evaluate SOF time and space requirements when dealing with large data sets.
6.2. Experimental Results
6.2.1. Evaluating SOF with Limited Resources (Machine A)
The time required for all 6 solutions, running on machine A (the modest machine), is shown in Figure 3. Since not all solutions support multithreading, we evaluate and compare their performance on a single processor machine. SGA and Readjoiner do not perform well with minimal match which is less than 10, so we test all solutions, except suffix tree and enhanced suffix array (the only choice for both regarding minimal match is 1), with minimal match length = 15. Due to high space consumption, we could not run the suffix tree and enhanced suffix array with more than 90 MB. Our results show clearly that Readjoiner and SOF outperform other solutions. The advantage for Readjoiner over our solution is surmounted by the fact that we ignore the time for prefiltering, which is a prerequisite for the overlap stage in Readjoiner but is not needed for SOF.
The space requirements for these tests are shown in Figure 4. Clearly, Readjoiner and SOF are the most effective in terms of space.
The shown results are expected. SGA and Sadakane suffix tree use compressed fulltext data structures. To build them, a considerable construction time is required, in the worst case. In addition, the time required for some operations, such as the operation, may not be constant, which is the case in a standard suffix tree/array. On the other side, the overlap technique in SGA is not optimal, unlike the case for suffix tree/array which requires time. SGA has a good space consumption, but it is still at least with a higher constant than both SOF ( for constructing a prefix tree and for storing the text) and Readjoiner.
Using real data, SGA, Readjoiner, and SOF were tested on machine A using 4 threads (the maximum number of threads on machine A). We ignore other solutions since they do not support multithreading or they are remarkably slow. The time and space consumptions are shown in Figures 5 and 6. SOF had the best performance when using multithreading in most cases. In these results, the prefiltering time for Readjoiner is ignored. Both SOF and Readjoiner performed much better than SGA. We attribute the impressive performance and low space requirement of Readjoiner when testing with Atta cephalotes to the low number of strings in this data set. This is due to the fact that Readjoiner finds distinct prefixes which can be candidate for suffixprefix matches. This procedure is related to the number of strings in the data set.
6.2.2. Evaluating SOF with Large Data Sets (Machine B)
The parallelization of SOF and its performance on real and large data sets with large numbers of strings are evaluated and compared with Readjoiner and SGA using machine B (an AWS 16core node). Results for real data are shown in Figure 7. While we were able to run SOF using 16 threads with all data sets, we could not run Readjoiner with any of Citrus data sets using more than 10 threads. Unlike SOF whose space requirement does not change with the number of threads, the space requirements for Readjoiner increase as the number of threads increases. Table 3 shows the space consumption of Readjoiner and SOF using different numbers of threads. In the first sample (Citrus clementina), for example, the space consumption for Readjoiner increases more than 100% when 4 threads are used and more than 250% when 9 threads are used. As a result, the space consumption of Readjoiner exceeds that of SOF for large number of threads. Unfortunately, the error which occurs when running Readjoiner with more threads prevented us from showing even a higher difference in space consumption. Table 4 shows the time consumption for SOF and Readjoiner. SOF demonstrates better scalability in most cases.
Readjoiner finds overlaps in several steps. In each step, it uses buffers in order to prepare the output for the next step. When the data in a buffer is processed, the buffer is refilled again and a new chunk of data is processed. This is repeated until the whole set of data is processed. In a multithreading environment, these buffers are most probably created for each thread in order to process multiple chunks at the same time, which may explain the increase in the space consumption when more threads are used.
The results for testing SOF using large data sets with large numbers of strings are shown in Table 5. These datasets are equal to or bigger than the ones which are tested with LEAP [9] in terms of size and number of strings.
We excluded LEAP in our comparison since LEAP does not offer the ability to investigate each stage of the assembly process separately. Therefore, we cannot single out the performance of the relevant overlap stage. However, we tested LEAP's ability to handle our datasets. The program receives a signal 11 as an indication for a segmentation fault when running with datasets 1, 2, and 4 from Table 4 and SRR500004 from Table 5. However, it finishes executing when running with other datasets but with very long times (3.5 hours for SRR866986 and more than 17 hours for SRR098909).
We could not run Readjoiner with any of the data sets in Table 5. For example, we received an “assertion failed” message when running Readjoiner with SRR500004. The prefiltering stage shows a segmentation fault when running Readjoiner with ERR1257766, SRR866986, and SRR098909. Accordingly, the overlap stage is not reached with these data sets. We received the same messages in single and multicore environments. Other solutions (Sadakane suffix tree and SGA) consume a large amount of time (more than 6 hours for the 10 GB file).
7. Conclusion
Both Readjoiner and SOF are fast and spaceeconomical techniques for solving APSP when compared to other solutions. Despite the advantage for Readjoiner in terms of space and time when no multithreading is used, SOF is simple and easy to implement and performs well on a simple machine. In addition, on multicore and parallel machines, SOF exhibits better performance and scalability as compared to Readjoiner. Unlike SOF, Readjoiner’s space consumption increases when using more threads. As a result, SOF can consume less space and time than Readjoiner when both are using multithreading. SOF can also be efficiently used with huge data sets and large numbers of strings beyond the problem sizes and number of strings that Readjoiner can support.
Disclaimer
The statements made herein are solely the responsibility of the authors.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
This paper was made possible by NPRP Grant no. 414541233 from the Qatar National Research Fund (a member of Qatar Foundation). The authors thank the reviewers for their valuable remarks.