Novel Computational Technologies for Next-Generation Sequencing Data Analysis and Their ApplicationsView this Special Issue
SimpLiFiCPM: A Simple and Lightweight Filter-Based Algorithm for Circular Pattern Matching
This paper deals with the circular pattern matching (CPM) problem, which appears as an interesting problem in many biological contexts. CPM consists in finding all occurrences of the rotations of a pattern of length in a text of length . In this paper, we present SimpLiFiCPM (pronounced “Simplify CPM”), a simple and lightweight filter-based algorithm to solve the problem. We compare our algorithm with the state-of-the-art algorithms and the results are found to be excellent. Much of the speed of our algorithm comes from the fact that our filters are effective but extremely simple and lightweight.
The classical pattern matching problem is to find all the occurrences of a given pattern of length in a text of length , both being sequences of characters drawn from a finite character set . This problem is interesting as a fundamental computer science problem and is a basic requirement of many practical applications. The circular pattern, denoted by , corresponding to a given pattern , is formed by connecting with and forming a sort of a cycle; this gives us the notion where the same circular pattern can be seen as different linear patterns, which would all be considered equivalent. In the circular pattern matching (CPM) problem, we are interested in pattern matching between the text and the circular pattern of a given pattern . We can view as a set of patterns starting at positions and wrapping around the end. In other words, in CPM, we search for all “conjugates” (two words are conjugate if there exist words such that and ) of a given pattern in a given text.
The problem of circular pattern matching has been considered in , where an -time algorithm is presented. A naive solution with quadratic complexity consists in applying a classical algorithm for searching a finite set of strings after having built the trie of rotations of . The approach presented in  consists in preprocessing by constructing a suffix automaton of the string , by noting that every rotation of is a factor of . Then, by feeding into the automaton, the lengths of the longest factors of occurring in can be found by the links followed in the automaton in time . In , the authors have presented an optimal average-case algorithm for CPM, by also showing that the average-case lower bound for the (linear) pattern matching of also holds for CPM, where . Recently, in , the authors have presented two fast average-case algorithms based on word-level parallelism. The first algorithm requires average-case time , where is the number of bits in the computer word. The second one is based on a mixture of word-level parallelism and -grams. The authors have shown that with the addition of -grams, and by setting , an optimal average-case time of can be achieved. Very recently in , the authors have presented an efficient algorithm for CPM that runs in time on average. To the best of our knowledge, this is the fastest running algorithm for CPM in practice to date.
Notably, indexing circular patterns  and variations of approximate circular pattern matching under the edit distance model  have also been considered in the literature. Approximate circular pattern matching has also been studied recently in [4, 7]. In this paper however, we focus on the exact version of CPM.
Apart from being interesting from the pure combinatorial point of view, CPM has applications in areas like geometry, astronomy, computational biology, and so forth. For example, the following application in geometry was discussed in . A polygon may be encoded spelling its coordinates. Now, given the data stream of a number of polygons, we may need to find out whether a desired polygon exists in the data stream. The difficulty in this situation lies in the fact that the same polygon may be encoded differently depending on its “starting” coordinate and hence, there exist possible encodings where is the number of vertices of the polygon. Therefore, instead of traditional pattern matching, we need to resort to problem CPM. This problem seems to be useful in computer graphics as well and hence may be used as a built-in function in graphics cards handling polygon rendering.
CPM in fact appears in many biological contexts. This type of circular pattern occurs in the DNA of viruses [9, 10], bacteria , eukaryotic cells , and archaea . As a result, as has been noted in , algorithms on circular strings seem to be important in the analysis of organisms with such structures. Circular strings have also been studied in the context of sequence alignment. In , basic algorithms for pairwise and multiple circular sequence alignment have been presented. These results have later been improved in , where an additional preprocessing stage is added to speed up the execution time of the algorithm. In , the authors also have presented efficient algorithms for finding the optimal alignment and consensus sequence of circular sequences under the Hamming distance metric.
Furthermore, as has been mentioned in , this problem seems to be related to the much studied swap matching problem (in CPM, the patterns can be thought of as having a swap of two parts of it)  and also to the problem of pattern matching with address error (the circular pattern can be thought of as having a special type of address error) [19, 20]. For further details on the motivation and applications of this problem in computational biology and other areas the readers are kindly referred to [9–17] and references therein.
In this paper, we present SimpLiFiCPM (pronounced Simplify CPM), which is a fast and efficient algorithm for the circular pattern matching problem based on some filtering techniques. In particular, we employ a number of simple and effective filters to preprocess the given pattern and the text. After this preprocessing, we get a text of reduced length on which we can apply any existing state-of-the-art algorithms to get the occurrences of the circular pattern. So, as the name sounds, SimpLiFiCPM, in some sense, simplifies the search space of the circular pattern matching problem.
We have conducted extensive experiments to compare our algorithm with the state-of-the-art algorithms and the results are found to be excellent. Our algorithm turns out to be much faster in practice because of the huge reduction in the search space through filtering. Also, the filtering techniques we use are simple and lightweight but as can be realized from the results extremely effective.
The rest of the paper is organized as follows. Section 2 gives a preliminary description of some terminologies and concepts related to stringology that will be used throughout this paper. In Section 3 we describe our filtering algorithms. Section 4 presents the experimental results. Section 5 draws conclusion followed by some future research directions.
Let be a finite alphabet. An element of is called a string. The length of a string is denoted by . The empty string is a string of length ; that is, . Let . For a string , , , and are called a prefix, factor (or, equivalently, substring), and suffix of , respectively. The th character of a string is denoted by for , and the factor of a string that begins at position and ends at position is denoted by for . For convenience, we assume if . A -factor is a factor of length .
A circular string of length can be viewed as a traditional linear string which has the leftmost and rightmost symbols wrapped around and stuck together in some way. Under this notion, the same circular string can be seen as different linear strings, which would all be considered equivalent. Given a string of length , we denote by , , the th rotation of and .
Example 1. Suppose we have a pattern . The pattern has the following rotations (i.e., conjugates): , , , , , and .
Here we consider the problem of finding occurrences of a pattern string of length with circular structure in a text string of length with linear structure. For instance, the DNA sequence of many viruses has a circular structure. So if a biologist wishes to find occurrences of a particular virus in a carrier’s DNA sequence, which may not be circular, (s)he must locate all positions in where at least one rotation of occurs. This is the problem of circular pattern matching (CPM).
We consider the DNA alphabet, that is, . In our approach, each character of the alphabet is associated with a numeric value as follows. Each character is assigned a unique number from the range . Although this is not essential, we conveniently assign the numbers from the range to the characters of following their inherent lexicographical order. We use , to denote the numeric value of the character . So, we have , , , and . For a string , we use the notation to denote the numeric representation of the string ; denotes the numeric value of the character . So, if then . The concept of circular strings and their rotations also applies naturally on their numeric representations as is illustrated in Example 2 below.
Example 2. Suppose we have a pattern . The numeric representation of is . And this numeric representation has the following rotations: , , , , , and .
The problem we handle in this paper can be formally defined as follows.
Problem 3 (circular pattern matching (CPM)). Given a pattern of length and a text of length , find all factors of such that , for some . And if we have for some , then we say that the circular pattern matches at position .
In the context of our filter-based algorithm the concept of false positives and negatives is important. So, we briefly discuss this concept here. Suppose we have an algorithm to solve a problem . Now suppose that represents the set of true solutions for problem . Further suppose that computes the set as the set of solutions for . Now assume that . Then, the set of false positives can be computed as follows: , where “” refers to the set difference operation. In other words, the set computed by contains some solutions that are not true solutions for problem . And these are the false positives, because falsely marked these as solutions (i.e., positive). On the other hand, the set of false negatives can be computed as follows: . In other words, false negatives are those members in that are absent in . These are false negatives because falsely marked these as nonsolutions (i.e., negative).
3. Our Approach
As has been mentioned above, our algorithm is based on some filtering techniques. Suppose we are given a pattern and a text . We will frequently and conveniently use the expression “ matches at position ” (or, equivalently, “ circularly matches at position ”) to indicate that one of the conjugates of matches at position . We start with a brief overview of our approach below.
3.1. Overview of SimpLiFiCPM
In SimpLiFiCPM, we first employ a number of filters to compute a set of indexes of such that matches at position . As will be clear shortly, our filters are unable to compute the true set of indexes and hence may have false positives. However, our filters are designed in such a way that there are no false negatives. Hence, for all , we can be sure that there is no match. On the other hand, for all , we may or may not have a match; that is, we may have false positives. So, after we have computed , we compute , a reduced version of concatenating all the factors , , putting a special character in between the factors. One essential detail is as follows. There can be , such that . In other words, there can exist overlapping factors matching with . However, this can be handled easily through simple bookkeeping as will be evident from our algorithm in later sections. Clearly, once we have computed the reduced text we can employ any state-of-the-art algorithm to solve CPM on to get the actual occurrences. So the most essential and useful feature of SimpLiFiCPM is the application of filters to get a reduced text on which any existing algorithm can be applied to solve CPM.
3.2. Filters of SimpLiFiCPM
In SimpLiFiCPM, we employ 6 filters. In this section we describe these filters. We also discuss the related notions and notations needed to describe these filters. In what follows we describe our filters in the context of two strings of equal length , namely, and , where the former is a circular string and the latter is linear. We will devise and apply different functions on these strings and present observations related to these functions which in the sequel will lead us to our desired filter. The key to our observations and the resulting filters is the fact that each function we devise results in a unique output when applied to the rotations of a circular string. For example, consider a hypothetical function . We will always have the relation that for all . Recall that actually denotes . For the sake of conciseness, for such functions, we will abuse the notation a bit and use to represent for all .
3.2.1. Filter 1
We define the function on a string of length as follows: . Our first filter, Filter 1, is based on this function. We have the following observation.
Observation 1. Consider a circular string and a linear string both having length . If matches , then we must have .
Example 4. Consider . As can be easily verified, here circularly matches . In fact the match is due to the conjugate . Now we have and . Then, according to Observation 1, we must have . This can indeed be verified easily.
Now consider another string , which is slightly different from . It can be easily verified that does not match . Now, and hence here also we have . This is an example of a false positive with respect to Filter 1.
3.2.2. Filters 2 and 3
Our second and third filters, that is, Filters 2 and 3, depend on a notion of distance between consecutive characters of a string. The distance between two consecutive characters of a string of length is defined by , where . We define . We also define an absolute version of it: , where returns the magnitude of ignoring the sign. Before we apply these two functions on our strings to get our filters, we need to do a simple preprocessing on the respective string, that is, in this case as follows. We extend the string by concatenating the first character of at its end. We use to denote the resultant string. So, we have . Since can simply be treated as another string, we can easily extend the notation and concept of over and we continue to abuse the notation a bit for the sake of conciseness as mentioned at the beginning of Section 3.2 (just before Section 3.2.1).
Now we have the following observation which is the basis of our Filter 2.
Observation 2. Consider a circular string and a linear string both having length and assume that and . If matches , then, we must have . Note carefully that the function has been applied on the extended strings.
Example 5. Consider the same two strings of Example 4, that is, . Here circularly matches (due to the conjugate ). Now consider the extended strings and assume that and . We have . Hence . Hence, . It can be easily verified that is also 14.
Now consider another string of the same length, which is slightly different from . It can easily be checked that does not match . However, assuming that we find that is still 14. So, this is an example of a false positive with respect to Filter 2.
Now we present the following related observation which is the basis of our Filter 3. Note that Observation 2 differs with Observation 3 only through using the absolute version of the function used in the latter.
Observation 3. Consider a circular string and a linear string both having length and assume that and . If matches , then, we must have . Note carefully that the function has been applied on the extended strings.
Example 6. Consider the same two strings of previous examples, that is, . Here circularly matches (due to the conjugate ). Now consider the extended strings and assume that and . We have . Hence . Hence, . It can be easily verified that is also 0.
Now consider another string of the same length, which is slightly different from . It can easily be checked that does not match . However, assuming that we find that is still 0. So, this is an example of a false positive with respect to Filter 3.
3.2.3. Filter 4
Filter 4 uses the function used by Filter 1, albeit in a slightly different way. In particular, it applies the function on individual characters. So, for we define . Now we have the following observation.
Observation 4. Consider a circular string and a linear string both having length . If matches , then, we must have for all .
Example 7. Consider the same two strings of previous examples, that is, . Recall that circularly matches (due to the conjugate ). It is easy to calculate that , , , and . Hence according to Observation 4, the individual sum values for all the conjugates of must also match this. It can be easily verified that this is indeed the case.
Now consider the other string of the same length, which is slightly different from . It can easily be checked that does not match . However, as we can see, still we have , , , and . This is an example of a false positive with respect to Filter 4.
Notably, a similar idea has been used by Kahveci et al. in  for indexing large strings with a goal to achieve fast local alignment of large genomes. In particular, for a DNA string, Kahveci et al. compute the so-called frequency vector that keeps track of the frequency of each character of the DNA alphabet in the string.
3.2.4. Filter 5
Filter 5 depends on modulo operation between two consecutive characters. A modulo operation between two consecutive characters of a string of length is defined as follows: , where . We define to be the summation of the results of the modulo operations on the consecutive characters of . More formally, . Now we present the following observation which is the basis of Filter 5. Note that this observation is applied on the extended versions of the respective strings.
Observation 5. Consider a circular string and a linear string both having length and assume that and . If matches , then, we must have . Note carefully that the function has been applied on the extended strings.
Example 8. Consider the same two strings of previous examples, that is, . Recall that circularly matches (due to the conjugate ). Now consider the extended strings and assume that and . We have . Hence . Hence, . Now according to Observation 5, we must also have . This is indeed true.
Now consider another string of the same length, which is different from . It can easily be checked that does not match . However, assuming that we find that is still 5. So, this is an example of a false positive with respect to Filter 5.
3.2.5. Filter 6
In Filter 6 we employ the operation. A bitwise exclusive-OR () operation between two consecutive characters of a string of length is defined as follows: , where . We define to be the summation of the results of the xor operations on the consecutive characters of . More formally, . Now we present the following observation which is the basis of Filter 6. Note that this observation is applied on the extended versions of the respective strings.
Observation 6. Consider a circular string and a linear string both having length and assume that and . If matches , then, we must have . Note carefully that the function has been applied on the extended strings.
Example 9. Consider the same two strings of previous examples, that is, . Recall that circularly matches (due to the conjugate ). Now consider the extended strings and assume that and . We have . Hence . Hence, . Now according to Observation 5, we must also have . As can be verified easily, this is indeed the case.
Now consider another string of the same length, which is different from . It can easily be checked that does not match . However, assuming that we find that is still 28. So, this is an example of a false positive with respect to Filter 5.
3.2.6. Discussion with respect to 
At this point a brief discussion with respect to our preliminary work in  is in order. To reduce the text , we also employed six filters in . While Filter 1 and Filter 4 remain identical, in SimpLiFiCPM, we have changed and improved Filters , , , and to get better results. In particular, we have introduced the concept of extended string here and modified the filters accordingly. Much of the efficiency of these new filters comes from the fact that in the preliminary version, without the extended strings, we had to deal with a set of values as the output of the functions creating a small bottleneck. On the contrary, SimpLiFiCPM now needs to deal with only one value as the output of the functions of Filters 2, 3, 5, and 6. This makes SimpLiFiCPM even faster than its predecessor. This is evident from the experimental results presented later. Notably, this has essentially brought some more changes in the overall algorithm. In particular in the searching phase of the algorithm we now need to adapt accordingly to apply the corresponding filters on the extended strings. But the overall improvement outweighs this extra work by a long margin.
3.3. Circular Pattern Signature Using the Filters
In this section, we discuss an -time algorithm that SimpLiFiCPM uses to compute the signature of the circular pattern corresponding to pattern of length . This signature is used at a later stage to filter the text. Here, we need five variables to save the output of the functions used for Filters 1, 2, 3, 5, and 6 (based on Observations 1, 2, 3, 5, and 6). And we need a list of size 4 to save the values of the function used in Filter 4 (Observation 4). We start with the extended string and compute the values according to Observations 1 to 6. The algorithm will iterate times and hence the overall runtime of the algorithm is . The algorithm is presented in Procedure (Algorithm 1).
3.4. Reduction of Search Space in the Text
Now we present an runtime algorithm that SimpLiFiCPM uses to reduce the search space of the text applying the six filters presented above. It takes as input the pattern of length and the text of length . It calls Procedure with as parameter and uses the output. It then applies the same technique that is applied in Procedure (Algorithm 1). We apply a sliding window approach with window length of and calculate the values applying the functions according to Observations 1–6 on the factor of captured by the window. Note that, for Observations 2, 3, 5, and 6, we need to consider the extended string and hence the factor of within the window need be extended accordingly for calculating the values. After we calculate the values for a factor of , we check it against the returned values of Procedure . If it matches, then we output the factor to a file. Note that, in case of overlapping factors (e.g., when the consecutive windows need to output the factors to a file), Procedure outputs only the nonoverlapped characters. And Procedure uses a marker to mark the boundaries of nonconsecutive factors, where .
Now note that we can compute the values of consecutive factors of using the sliding window approach quite efficiently as follows. For the first factor, that is, , we exactly follow the strategy of Procedure . When it is done, we slide the window by one character and we only need to remove the contribution of the leftmost character of the previous window and add the contribution of the rightmost character of the new window. The functions are such that this can be done very easily using simple constant time operations. The only other issue that needs to be taken care of is due to the use of the extended string in four of the filters. But this too does not need more than simple constant time operations. Therefore, overall runtime of the algorithm is . The algorithm is presented in the form of Procedure (Algorithm 2).
3.5. The Combined SimpLiFiCPM Algorithm
In this section we combine the algorithms presented so far and present the complete view of SimpLiFiCPM. We have already described the two main components of SimpLiFiCPM, namely, Procedure and Procedure , that in fact calls the former. Now Procedure provides a reduced text (say) after filtering. At this point SimpLiFiCPM can use any algorithm that can solve CPM and apply it over and output the occurrences. Now, suppose SimpLiFiCPM uses algorithm at this stage which runs in time. Then, clearly, the overall running time of SimpLiFiCPM is . For example, if SimpLiFiCPM uses the linear time algorithm of , then clearly the overall theoretical running time of SimpLiFiCPM will be .
In our implementation however we have used the recent algorithm of , which is a linear time algorithm on average and the fastest algorithm in practice to the best of our knowledge. In particular, in , the authors have presented an approximate circular string matching algorithm with -mismatches (ACSMF-Simple) via filtering. They have built a library for ACSMF-Simple algorithm. The library is freely available and can be found in . In this algorithm, if we set , then ACSMF-Simple works for the exact matching case. In what follows, we will refer to this algorithm as ACSMF-SimpleZero. We have implemented SimpLiFiCPM using ACSMF-SimpleZero; that is, we have used ACSMF-Simple algorithm simply by putting .
3.6. An Illustrative Example
Now that we have fully described SimpLiFiCPM, in this section we present the simulation of SimpLiFiCPM on a particular example. We only show the simulation up to the output of Procedure , that is, the output of the reduced text, because afterwards we can employ any state-of-the-art algorithm within SimpLiFiCPM. Consider a pattern . The values computed by Procedure according to Observations 1 through 6 are as follows, respectively: , , , , , and .
Again consider a text string . For the first sliding window we need to calculate the observation values from . The observation values according to Procedure are as follows for : , , , , , and .
The length of is . And the length of is . So, the algorithm iterates exactly = times. Each iteration is illustrated in Table 1.
4. Experimental Results
We have implemented SimpLiFiCPM and conducted extensive experiments to analyze its performance. We have coded SimpLiFiCPM in C++ using a GNU compiler with General Public License (GPL). Our code is available at . As has been mentioned already above, our implementation of SimpLiFiCPM uses the ACSMF-SimpleZero . ACSMF-Simple  has been implemented as library functions in the C programming language under GNU/Linux operating system. The library implementation is distributed under the GNU General Public License (GPL). It takes as input the pattern of length , the text of length , and the integer threshold and returns the list of starting positions of the occurrences of the rotations of in with -mismatches as output. In our case we use .
We have used real genome data in our experiments as the text string, . This data has been collected from . Here, we have taken 299 MB of data for our experiments. We have generated random patterns of different length by a random indexing technique in these 299 MB of text string.
We have conducted our experiments on a PowerEdge R820 rack serve PC with 6-core Intel Xeon processor E5-4600 product family and 64 GB of RAM under GNU/Linux. With the help of the library used in , we have compared the running time of our preliminary work in  (referred to as Filter-CPM henceforth), ACSMF-SimpleZero of , and SimpLiFiCPM. Table 2 reports the elapsed time and speed-up comparisons for various pattern sizes . As can be seen from Table 2, Filter-CPM  runs faster than ACSMF-SimpleZero in all cases. And in fact Filter-CPM  achieves a minimum of twofold speed-up for all the pattern sizes. Again, referring to the same table, SimpLiFiCPM runs even faster than ACSMF-SimpleZero in all cases. And in fact SimpLiFiCPM achieves a minimum of threefold speed-up for all the pattern sizes.
In order to analyze and understand the effect of our filters we have run a second set of experiments as follows. We have run experiments on three variants of SimpLiFiCPM where the first variant (SimpLiFiCPM-) only employs Filters 1 through 3, the second variant (SimpLiFiCPM-) only employs Filters 1 through 4, and finally the third variant (SimpLiFiCPM-) employs Filters 1 through 5. Table 2 reports the elapsed time and speed-up comparisons considering various pattern sizes for ACSMF-SimpleZero and the above-mentioned three variants of SimpLiFiCPM. As can be seen from Table 3, ACSMF-SimpleZero is able to beat SimpLiFiCPM- in a number of cases. However, SimpLiFiCPM- and SimpLiFiCPM- significantly run faster than ACSMF-SimpleZero in all cases.
In this paper, we have employed some effective lightweight filtering technique to reduce the search space of the circular pattern matching (CPM) problem. We have presented SimpLiFiCPM, an extremely fast algorithm based on the above-mentioned filters. We have conducted extensive experimental studies to show the effectiveness of SimpLiFiCPM. In our experiments, SimpLiFiCPM has achieved a minimum of threefold speed-up compared to the state-of-the-art algorithms. Much of the speed of our algorithm comes from the fact that our filters are effective but extremely simple and lightweight. The most intriguing feature of SimpLiFiCPM is perhaps its capability to plug in any algorithm to solve CPM and take advantage of it. We are now working towards adapting the filters so that it could work for the approximate version of CPM.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Part of this research has been supported by an INSPIRE Strategic Partnership Award, administered by the British Council, Bangladesh, for the project titled “Advances in Algorithms for Next Generation Biological Sequences.” M. Sohel Rahman is a Commonwealth Academic Fellow funded by the UK Government who is currently on a sabbatical leave from BUET.
M. Lothaire, Applied Combinatorics on Words, Cambridge University Press, New York, NY, USA, 2005.View at: MathSciNet
R. Susik, S. Grabowski, and S. Deorowicz, “Fast and simple circular pattern matching,” in Man-Machine Interactions 3: Proceedings of the 3rd International Conference on Man-Machine Interactions (ICMMI '13), vol. 242, pp. 537–544, Springer International Publishing, 2014.View at: Publisher Site | Google Scholar
M. A. R. Azim, C. S. Iliopoulos, M. S. Rahman, and M. Samiruzzaman, “A fast and lightweight filter-based algorithm for circular pattern matching,” in Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (BCB '14), pp. 621–622, ACM, Newport Beach, Calif, USA, September 2014.View at: Publisher Site | Google Scholar
G. Lipps, Plasmids: Current Research and Future Trends, Caister Academic Press, Norfolk, UK, 2008.
A. Mosig, I. Hofacker, P. Stadler, and A. Zell, “Comparative analysis of cyclic sequences: viroids and other small circular RNAs,” in Proceedings of the German Conference on Bioinformatics, vol. 83 of Lecture Notes in Informatics, pp. 93–102, 2006.View at: Google Scholar
T. Lee, J. Na, H. Park, K. Park, and J. Sim, “Finding optimal alignment and consensus of circular strings,” in Combinatorial Pattern Matching: 21st Annual Symposium, CPM 2010, New York, NY, USA, June 21–23, 2010, vol. 6129 of Lecture Notes in Computer Science, pp. 310–322, Springer, Berlin, Germany, 2010.View at: Publisher Site | Google Scholar