Abstract

The next-generation sequencing (NGS) technology outputs a huge number of sequences (reads) that require further processing. After applying prefiltering techniques in order to eliminate redundancy and to correct erroneous reads, an overlap-based assembler typically finds the longest exact suffix-prefix match between each ordered pair of the input reads. However, another trend has been evolving for the purpose of solving an approximate version of the overlap problem. The main benefit of this direction is the ability to skip time-consuming error-detecting techniques which are applied in the prefiltering stage. In this work, we present and compare two techniques to solve the approximate overlap problem. The first adapts a compact prefix tree to efficiently solve the approximate all-pairs suffix-prefix problem, while the other utilizes a well-known principle, namely, the pigeonhole principle, to identify a potential overlap match in order to ultimately solve the same problem. Our results show that our solution using the pigeonhole principle has better space and time consumption over an FM-based solution, while our solution based on prefix tree has the best space consumption between all three solutions. The number of mismatches (hamming distance) is used to define the approximate matching between strings in our work.

1. Introduction

The next-generation sequencing (NGS) technology creates a new type of challenges. The output of NGS is a huge number of sequences (reads) which require further processing. The generated sequences represent segments from multiple copies of the original genome. An overlap-based assembler, such as SGA [1] and Readjoiner [2], finds overlaps between these reads in order to build a string graph which will be the input for the assembly stage. The problem of finding overlaps between each ordered pair of reads is commonly called the all-pairs suffix-prefix problem (APSP).

For a given group of strings , solving APSP is to find the largest (longest) suffix-prefix match for each ordered pair in . Gusfield et al. presented a solution for APSP in an optimal time using a generalized suffix tree [3]. For a text , a suffix tree ST is a tree in which every suffix in is represented by a path from the root to a leaf. The drawback for this solution is the high consumption of memory since the best implementation for suffix tree consumes 20 bytes for a text of size characters [4]. Ohlebusch and Gog [5] presented a solution for APSP using an enhanced generalized suffix array with an optimal time but with much less space than the one utilizing a suffix tree. The algorithm was practically improved [6].

A suffix array SA of a text is an array containing values which range from 1 to and represent the text positions of the lexicographically sorted suffixes of . An array is an array storing the length of the largest common prefix between every two consecutive lexicographically sorted suffixes in . An enhanced suffix array is a suffix array and an array. In both solutions (suffix tree and enhanced suffix array), reads are first concatenated in one string . In , every two consecutive reads are separated by a distinct character which is not repeated anywhere in ; then the data structure is built for the resulting text. The word “generalized” indicates that the data structure is built from all reads.

Compressed data structures have also been utilized to solve APSP. Simpson and Durbin used FM index [7] to solve APSP. Sadakane suffix tree [8] and run-length compressed suffix array (RLCSA) are also utilized to solve APSP ([9] and [10], resp.). A very recent work [11] showed that a compact prefix tree can be used to solve APSP efficiently in terms of time and space.

One of the most important fields for applying APSP is genome assembly. Assemblers can be classified depending on the type of graph they are building as follows:(i)They either build a string graph in which a node represents a read and an edge represents an overlap between two reads. Such assembler is called an overlap-based assembler. Since finding the original genome using a string graph by finding a path that visits every node exactly once is an NP-problem, such assemblers use techniques to solve a reduced version of this problem in the assembly stage [12].(ii)Or they build a de Bruijn graph in which nodes are b-1 mers of the reads and the edges correspond to the overlaps of size b-2 between two b-1 mers, where b is a fixed value less than the length of a read. The assembler then finds the original genome using the de Bruijn graph by finding a path which visits every edge in the graph only once. However, this graph may have many alternative paths, and therefore it is the first step in creating a good draft assembly [13].

A traditional overlap-based assembler would first filter the set of reads by removing redundant reads and applying approximate string-matching methods to detect and to correct errors in these reads. Then it would find exact overlaps between the prefiltered reads by solving APSP. However, a few researches tackled the approximate version of APSP (AAPSP) such as the work of [14]. The advantage of this trend is to avoid the error-correction preprocessing steps which use time-consuming approximate string-matching techniques. Valimaki et al. utilized a compressed data structure (FM index) with the backward backtracking technique to solve AAPSP. It also takes advantage of suffix filters which were introduced by [15] and improved by [16].

2. Objectives

In this work, we present two techniques to solve AAPSP. The first utilizes a compact prefix tree in solving AAPSP, while our second technique takes advantage of the well-known pigeonhole principle and the minimal length for an overlap in order to identify potential overlap matches. We compare our work with the work presented by [14].

We first explain our methods in Section 3. Section 4 demonstrates our experiments and discusses our results. We draw our conclusion in Section 5.

3. Methods

3.1. Definitions
3.1.1. Compact Prefix Tree

The words “read” and “sequence” are used interchangeably in this work. A read is a string of characters over an alphabet . We define a prefix tree for a group of reads as a tree in which every read in is represented by a path from the root to a leaf. Every edge in is labeled with one of the four characters: A, C, G, or T. Every node has an interval [] where are the identification numbers of the reads which share the same prefix up to (assuming that the reads are sorted). Since an edge is labeled by one character and reads in one range [] may share a substring with a length > 1, every node has also a value = , which represents the length of a common substring between all reads in the range []. The benefit of this value is to avoid building unnecessary nodes for substrings which are shared by reads in one range. Figure 1 shows an example for a compact prefix tree.

The compact prefix tree can be built in time where is the total length of all reads. The space consumption for building this tree is where is the count of the reads. Clearly, the presence of the input reads is required since this data structure is not a self-index data structure.

3.2. Approximate Matching

An approximate matching between two strings can be expressed by the edit distance. The edit distance between strings and is defined as the minimum number of insertions, deletions, and replacements of symbols to transform string into [17]. Hamming distance is another way to describe an approximate match. The hamming distance between strings and is the number of mismatching symbols between strings and . A string 1 is considered an approximate match to 2 if the edit distance (or the hamming distance) between the two strings is ≤ , where is the number of allowed insertions, deletions, and replacements (or mismatches when hamming distance is used) to transform to .

3.2.1. Pigeonhole Principle

The basic idea behind the pigeonhole principle is that if there is an approximate matching between two strings , then there must be an exact matching between them with a smaller size. The size of the available exact matching between the two strings is determined by the hamming distance (or edit distance) between them. Given that differs from by characters, if we divide into parts, then one of these parts will exactly match a part in . The principle can easily be proved by contradiction.

The crucial benefit of the pigeonhole principle is the identification of candidates for an approximate matching between the two strings. As a result, extremely fast exact matching algorithms and techniques are used to find these candidates and the time-consuming dynamic programming technique is only used to verify if a candidate is a part of an approximate matching between the two aligned strings.

The principle is also known as seed-and-extend. It is the base of many genome analysis algorithms such as Basic Local Alignment Search Tool (BLAST) and many works such as [18] utilized this concept in sequence alignment. In this work, we employ this principle to find overlaps between reads.

3.2.2. AAPSP

It is easy to define APSP because of its exact nature; however, defining AAPSP may not be very clear since the preference may differ between the length of the match and the number of mismatches; that is, a suffix-prefix match with a length of 20 and 7 mismatches may be better than one with a length of 10 and 4 mismatches. Reference [19] demonstrated different interests in approximate matching for a bioinformatician. In this study, we adopt the following interest. For each ordered pair of the input reads, we target the largest suffix-prefix match with a maximum of allowed mismatches. Nevertheless, we show that our solution can be easily modified to cover most other definitions.

3.3. Solving AAPSP Using a Compact Prefix Tree

The work of [11] describes the technique to solve APSP using a compact prefix tree. Every suffix in every read is matched with a path in the prefix tree (if there is one). The algorithm presented in the work of [11] takes advantage of the minimal length for a suffix-prefix match by ignoring all suffixes which are shorter than . The time consumption for solving APSP is where is the number of reads and is the maximum size of a read. While may vary in practice from hundreds of thousands to hundreds of millions, is usually less than a thousand.

Our first technique employs a compact prefix tree to solve AAPSP. When mismatches are allowed, every suffix in every read is aligned with every path in the tree, where is a path from the root to a leaf. In the attempt to match a suffix with a path in the tree, mismatches are counted up to a threshold. If a threshold is reached before reaching the end of , then does not represent a match between a suffix and a prefix. The pseudocode is shown in Algorithm 1. The following variables are used in Algorithm 2.(i) is the number of allowed mismatches.(ii) is the so-far number of mismatches.(iii) is the current character in (current suffix).(iv) is the current character to compare with in the prefix tree. It can be found by calculating the length of the path from the root to the current node. Accordingly, the current read mentioned in the algorithm is one of the reads which are included in the range of current node.(v) is a pointer to the current node in the prefix tree.(vi) is used to check if the comparison is done inside a node by comparing with the of the .(vii) returns true if the current node has a child with a character , that is, there is an edge coming from current node towards another node and labeled by a .

procedure
for every read in the input reads do
   for every candidate suffix in do
     
   end for
end for
end procedure
procedure  ,  ,
    ,
  while true do
     if the end of is reached then
        Report all reads which have a prefix that matches
        Return
     end if
     if    then
        if then
           
        end if
        if   then
           Return
        end if
        
         next character in
         next character in the current read
     else
        for Every character in Alphabet do
           
           if   then
            
           end if
           if and
                   then
            the character next to in
            the character next to current_char in the
                 current read
           ,
              )
           end if
        end for
        Return
     end if
  end while
end procedure

The code is simple. For every suffix in every read, the procedure is called. If the end of is reached, then represents a suffix-prefix match between the read that contains it and every read which is included in the interval of the current node in the prefix tree (lines ). If this is not the case, then we distinguish two cases:(i)The comparison is done between a character in and a character indicated by the value of in the current node in the tree. It happens when the (lines ).(ii)A character in is compared with a label of an edge. In this case, exceeds the value of of the current node. In this case, we test every child of the current node using a recursive call unless the allowed mismatches are exhausted (lines ).

Let us try to match suffix GGC from string 6 with a path in Figure 1 assuming that the number of allowed mismatches is 1. The first path starts with AA, so there is no match. The second path starts with AGG, so we can see that the comparison will fail when reaching the third character since we will have 2 mismatches. The third path starts with GGG which is a match since it differs from GGC by only one character. Accordingly, GGC is a suffix-prefix match, but it is not relevant since it involves the same read. The fourth path is GGT which is a match since it differs from GGC by only one character. Accordingly, GGC represents a suffix-prefix match between string 6 and each of string 2 and string 4 (which are identified as 4 and 5 after sorting). Finally, the last path does not present any match.

The difference between the two usages of compact prefix tree in solving APSP and AAPSP is clear. In APSP, every suffix is matched with a path in the tree, while in AAPSP, every suffix is tested against every path in the tree.

Given that there are suffixes in a read, all suffixes of a read can be processed using a compact prefix tree in where is the number of reads. The time complexity for the solution is in the worst case where is the total size of all reads. However, in practice, the solution runs much faster than the worst case. The space complexity is bounded by the size of the text which is , since the construction of the prefix tree requires only space [11].

3.3.1. Solving AAPSP Using Pigeonhole Principle

Our second technique takes advantage of the pigeonhole principle and the minimal length for an overlap in order to identify the candidate suffixes which can be approximate suffix-prefix matches between pairs of reads. Let be a minimal length for an overlap (i.e., a suffix-prefix match with a length will not be considered). If a suffix is an approximate suffix-prefix match with a threshold , then its prefix of length has to have a hamming distance when aligned with a prefix with the same length of some read . Accordingly, if is divided into parts, then one of these parts exactly matches a corresponding part of a prefix of a read . We then compare all remaining parts of with their corresponding parts in .

Accordingly, the technique can be summarized as follows:(i)Divide the prefix of length for each read into parts.(ii)Add each part from the prefix of read to an index which has entries of type where is a list of reads. Accordingly, if is already in the index, will be added to an existed entry (in its list); otherwise, a new entry will be added to the index.(iii)Every suffix that can be an acceptable overlap match will be tested. is divided into parts and every part is searched for in the index. If a part is found at position in the index, we investigate every read in . We compare all characters that precede in with their corresponding characters in . If the threshold is not reached, we compare all characters after in with their corresponding characters in until the threshold of mismatches is reached. If the end of is reached without exceeding the limit of mismatches, then is reported as an approximate overlap between (the read which contains ) and .

Figure 2 demonstrates the basic concept. The pseudocode is shown in Algorithm 3. The worst case time complexity is , but, in practice, the solution runs much faster. The index can have () entries. All entries may have up to () values (in all lists). Therefore, the space complexity is bounded by the size of the text which is where is the total length of all reads and is the size of the alphabet.

for Every read in input reads do
  for Every candidate suffix in do
   for Every part of length in ’s prefix do
     if p is found in position in the index then
       for Every read found in do
        compare all characters before with
            their corresponding characters in
         if threshold is not reached then
           compare all characters after with
               their corresponding characters in
               until the end of is reached
         end if
         if threshold is not reached then
           report as a suffix prefix match between and
         end if
       end for
     end if
   end for
  end for
end for

One drawback for this algorithm is the repetition in reporting the overlaps. The same overlap may be reported more than once since each matching part may end up reporting an overlap. We used a hash table to keep track for the overlaps which are reported. The hash table is cleared for every candidate suffix .

3.3.2. Implementation Notes

We used an unordered map to build our index, where is a part of a prefix of size in a read and is a list of reads. To make the implementation simple, we used an index for each part. Accordingly, if , we use 4 indices and each part in every prefix of size in every read is inserted into its appropriate index (i.e., part 1 into index 1). The prefix of length in each suffix is divided into parts and each part is searched for in its appropriate index. If a match is found, the list of reads which is associated with the key () is retrieved and we start comparing whatever before and after part in .

3.3.3. Supporting Other Flavors of Approximate Matching

In this study, we use the hamming distance concept to define an approximate overlap match. However, our solutions support other matching types such as spaced seeds, subset seeds, and edit distance [19]. Spaced seeds can be described as a mask M which is represented by a string over the alphabet where 0 indicates an allowed mismatch position. For example, string ACGCTATTG with a mask 011 accepts GTG, CTG, and GTACTG as suffix matches (we apply the mask cyclically in GTACTG twice since GTACTG is longer than the mask).

A slight modification in Algorithm 1 is sufficient to fulfill the spaced seed type. A variable to track the current element of the mask would be needed. The condition in lines can be easily extended in order to validate the comparison and a return statement should be executed if a mismatching occurs and the current element of the mask is 1. A similar modification should be done to the condition in line ; no recursive call is executed if the current element of the mask is 1. In Algorithm 3, similar modifications are required for comparisons in lines and .

With subset seeds, we specify the types of mismatches that are allowed at each position. For instance, allows a,t mismatches and mismatches only. This form is also easy to incorporate in our solutions by adding more checking before considering the case as a mismatch. If the case is not included in the group of allowed mismatches, we ignore the candidate suffix and move on to the next one.

The edit distance is not supported in our solution.

4. Experimental Evaluation

4.1. Experimental Setup

We implemented two C++ solutions to solve AAPSP using our two techniques. They use openMP to support multithreading. The used parallelizing technique is based on dividing reads equally between threads. We used /usr/bin/time to measure the time and space. The source code for our solutions can be downloaded from: https://github.com/maanrachid/Codes/blob/master/AAPSP.tar.

The implementation is tested on two types of machines:(i)A modest 2-core virtual machine with 1 GB RAM and less than 10 GB hard disk running on 2.00 GHZ CPU: we ran randomly generated data sets on the machine. The random data is generated by a program which creates reads with a total of length . and are inputs from the user. Testing on such machine is to demonstrate our solutions’ ability to find overlaps even with limited resources. The minimal length of an approximate overlap in our experiments is 30.(ii)An 8-core AWS node for testing large real samples: testing our solutions on this node evaluates their ability to handle large data set. Eight threads are used in all experiments. The minimal length of an approximate overlap in our experiments is 30.

Table 1 describes data sets used by the experiments. We obtained our real data from PubMed (http://www.ncbi.nlm.nih.gov/pubmed) and Citrus Genome databases (http://www.citrusgenomedb.org).

4.2. Experiments Results

We compare the time and space consumption for prefix tree (PT), pigeonhole (PH), and FM [14] solutions when used to solve AAPSP on a modest machine with randomly generated data. Tables 2 and 3 show the result of our testing. The number of threads which are used in our experiments is 2, the minimal length of an overlap is 30, and the output option is set on (produce output).

It is very clear that PH has the best results in terms of time. PT has better results than FM in the first two data sets and worse performance with the last one. This is due to the fact that the read in the last data set has a length of 500 (while the length of a read in the first two data sets is 100). PT favors short reads since the minimal length of an overlap can be utilized better (more comparisons can be skipped when the reads are short). In terms of space, PT has the best consumption with a clear advantage for PH over FM.

We test our solutions with real and large data sets on an AWS node with high capabilities (8 cores, 60 GB RAM, and 200 GB hard disk). The required time and space are shown in Tables 4 and 5. The minimal length for an overlap is 30, the number of threads is 8, and the output option is set on.

Clearly, pigeonhole solution demonstrates superior results in terms of time; however, prefix tree consumes less space in all data sets. Despite its low-space requirement, the brute force nature of the prefix tree solution causes a high time consumption.

We should mention that FM performs some additional tasks such as handling the N character or finding the overlaps for the reverse complement. That may affect the time and space consumption. However, it will not doubt the advantage of PH over FM in terms of time and space and the advantage of PT over FM in terms of space since the differences are too big to be interpreted by such factors.

5. Conclusion

Both our solutions can be used efficiently to solve AAPSP with a relatively small number of mismatches. It has been shown that the pigeonhole solution is superior in terms of time and has better space consumption than FM, while the prefix tree achieves the best space consumption between all three solutions. Both our solutions can efficiently contain other flavors of approximate matching with the exception of edit distance (deletion and insertion).

It would be great if these tools can be extended to find overlaps using the edit distance. It may also be interesting to find out how efficient our tools are when integrated with other components in an overlap assembler.

Competing Interests

The author declares that he has no competing interests.