Research Article  Open Access
Maleeha Najam, Raihan Ur Rasool, Hafiz Farooq Ahmad, Usman Ashraf, Asad Waqar Malik, "Pattern Matching for DNA Sequencing Data Using Multiple Bloom Filters", BioMed Research International, vol. 2019, Article ID 7074387, 9 pages, 2019. https://doi.org/10.1155/2019/7074387
Pattern Matching for DNA Sequencing Data Using Multiple Bloom Filters
Abstract
Storing and processing of large DNA sequences has always been a major problem due to increasing volume of DNA sequence data. However, a number of solutions have been proposed but they require significant computation and memory. Therefore, an efficient storage and pattern matching solution is required for DNA sequencing data. Bloom filters (BFs) represent an efficient data structure, which is mostly used in the domain of bioinformatics for classification of DNA sequences. In this paper, we explore more dimensions where BFs can be used other than classification. A proposed solution is based on Multiple Bloom Filters (MBFs) that finds all the locations and number of repetitions of the specified pattern inside a DNA sequence. Both of these factors are extremely important in determining the type and intensity of any disease. This paper serves as a first effort towards optimizing the search for location and frequency of substrings in DNA sequences using MBFs. We expect that further optimizations in the proposed solution can bring remarkable results as this paper presents a proof of concept implementation for a given set of data using proposed MBFs technique. Performance evaluation shows improved accuracy and time efficiency of the proposed approach.
1. Introduction
DNA sequencing is paving the way in practical applications including life sciences and agricultural fields. However, massive size of DNA sequence data sets available today poses significant computational challenges. Several algorithms have been proposed over the years to address these issues. The compressionbased algorithms have gained the attention of researchers recently due to their efficient data storage and processing capabilities for DNA sequencing. Bloom filters (BFs) are such techniques used in bioinformatics domain for classification where a set of known DNA sequences facilitate the classification of the unknown sequences. It is due to the inherent capability of BFs that allows it to answer any query in either yes or no. Suppose a bloom filter is created for a set of elements and the set is used to answer the query if a certain element is present in the set or not. Further, to determine if the queried element is part of the set, m hash functions are applied to that element. The index positions obtained through hashing are checked in the BF. If all those indexes (or bits) of BF are set to 1, this means that the element is most probably part of the set (false positives can arise in BF), else the element is certainly not in the set. The query response time is fast as it takes O(m), where m is the number of hash functions applied to the element. Likewise, a BF created for a known DNA sequence can determine if the given unknown sequence has any similarity with the known DNA sequence (for which BF is created) and this can quickly be answered in a series of yes or no. Thus, BFs highly assist in determining the origin of the given unknown sequence. In addition to their use as classifiers, some work exists on compression side as well. However, there are certain problems that limit the use of this fast and memory efficient data structure in other areas of bioinformatics.
As mentioned earlier, a bloom filter can only answer any query in yes or no. For example, if there exists a BF for a human chromosome 16 and if a certain disease pattern needs to be checked whether it is present in that chromosome or not, the BF can answer the query in “yes” if the pattern is present or “no” if the pattern is not present. However, in medical science, it is significant to determine the presence of the pattern in the certain chromosome but also the position of that pattern in the chromosome. It is because the location of the disease pattern can also determine the type of the disease; e.g., Maturity Onset Diabetes of the Young (MODY), which is an inherited form of diabetes mellitus, has more than ten different types. Based on the location of the pattern, the type of the disease can be found. Moreover, the number of times a certain pattern is reported in the DNA sequence is also important as it shows the intensity of the disease. For example, the high repetition of the pattern CTG/CAG is often linked with the Huntington’s disease. However, algorithms have been previously proposed where compression is achieved through BFs but to retrieve the original sequence it must be decompressed. Through this paper, we addressed some of these problems due to which BFs are not particularly preferred for pattern matching and sequence compression. Following are the main contributions of this work:(i)We propose an algorithm using multiple BFs to compress DNA sequences which eliminates the need to store the original DNA sequence for pattern searching tasks(ii)We propose and integrate techniques for identifying the exact location of the pattern as well as the number of times the pattern is repeated in the sequence, within the compressed sequence(iii)We provide initial experimental results for evaluating the accuracy and time efficiency of the proposed approach along with deficiencies observed and identify potential future research directions
The remainder of this paper is organized as follows: Section 2 covers the related work. The proposed algorithm is discussed in Section 3. Section 4 presents the experimental results and finally the last section presents the conclusions.
2. Literature Review
Compression algorithms for DNA or genome generally fall into four categories. The first category is naivebit encoding where one or more characters are represented by a certain codeword. For example, the simplest encoding for DNA sequence can be obtained by assigning 2 unique pair of bits to each of the unique alphabet present in a DNA sequence like A=00, C=01, G=10, and T=11 [1–3]. The second category is a dictionarybased or substitutional compression. It is mostly observed that a DNA sequence consists of a lot of repeated sequences. As a result, the repeated sequences can be replaced by references to a dictionary that is either built offline or is maintained at runtime [4]. The most common dictionarybased algorithms include LZ77 and LZ78.
Further, there are statistical methods that achieve extremely good compression rates by generating probabilistic models based on genome datasets [5–7]. The fourth category is referential compression where any repeated sequence in an input dataset is replaced with a reference to one or more external DNA sequences [8–11]. A data structure plays a critical role in any algorithm designed for achieving good compression ratios, fast searching of patterns inside sequences, or both. Many algorithms proposed in the area of bioinformatics make use of selfindexbased data structures to achieve above mention goals. They can be used to avoid the need of keeping large files of text along with the index [12–15]. The index itself contains sufficient information that any part of the text can be recreated. Some of the examples of selfindexes include compressed suffix array (CSA), Succinct Suffix Array (SSA), and FMindex [16–19]. In addition to this, there are other variations of indexes as well [20] and some of them are shown under different categories in Table 1.

Bloom filter (BF), a probabilistic data structure, is often seen being used in different algorithms. In [21], BF is used to store the DeBruijn graph created for storing genome; otherwise DeBruijn itself consumes lot of memory to store the data. However, this leads to false nodes and branching due to false positives induced by BF but the authors have proposed a mechanism to reduce them. Rozov et al. [22] proposed BARCODE that makes use of reference sequence as mentioned previously to achieve compression. The bloom filters are core part of their algorithm; they hash all the reads in the BF and decode them later for querying. Stranneheim et al. [23] proposed FACS that uses BF to classify sequences and the classification speed is high as the searching time is independent of the BF size. In [24], Sequence Bloom Tree (SBT) is proposed that also makes use of BF for fast classification of sequences.
Similarly, in bioinformatics, DNA error correction is another interesting problem where blood filter is used. Due to largesized data sets, error correction is a timeconsuming process. Therefore, a number of algorithms have been proposed for error correction. A. Ramachandran et al. [25] have proposed an FPGA based accelerated errorcorrection algorithm. It is designed to improve the throughput of DNA. The proposed algorithm is based on BLESS which used bloom filter as a main data structure due to its memory efficiency.
The genomic process generates a huge amount of data which is often difficult to store. Therefore, many techniques have been proposed to store the data in compressed form. In [26], Sebastian proposed a compression algorithm to process single and paired reads. The proposed technique is based on partial matching and dynamic Markov coder algorithm. However, the main drawback of the proposed work is slow processing and large memory usage. The slow process is due to the query searching process typically for incomplete kmers. Similarly, in [27], the authors proposed a new mechanism to generate an index of similar strings. The proposed technique can also manage redundancies. Tomasz et al. [28] proposed a new technique to store and index the NGS reads in memory. The proposed technique is based on the traditional technique of counting and locating kmers. The proposed technique is more compact compared to GkA and CGkA solutions.
Heo. Y. et al. [29] have proposed bloom filter based error correction for Next Generation Sequencing (NGS) data. With tremendous increase in the size of data, the NGS data contains far more errors compared to traditional sequencing methods. The traditional algorithms are not memory efficient. Therefore, the authors proposed an algorithm based on bloom filters to reduce the memory overhead and produce much more accurate results. Bloom filters are widely used for pattern matching; however, there are many applications where bloom filters are not suitable, e.g., deleting one attribute based on the other. Jiangbo Qian et al. [30] proposed bloom filter based associative deletion algorithms. Reem Khairy et al. [31] presented the accelerated bloom filter approach using high level synthesis. The authors presented the portable solution that can be incorporated in many applications to take benefit of bloom filter performance. In bioinformatics, most of the algorithms proposed using bloom filter belong to the category of classification. The BF can easily determine if a sequence belongs to the text being searched or not but it cannot accurately determine the exact location of the pattern and the number of times the pattern is repeated in the sequence. These are important factors to be determined if the exact disease or intensity of the disease is to be identified as discussed in the previous section. The next section covers the proposed approach which is based on Multiple Bloom Filters to identify the location of pattern in a given sequence.
Similarly, an efficient storing collection of large populations of genomes and timely searching of data is a challenging problem. Techniques exist to reduce the space requirement such as referencebased compression. However, such storing techniques adversely impact the search techniques. Therefore, Sebastian et al. proposed a framework based on MultiReference Compressed Search Indexes (MRCSI) [32]. The proposed MRCSI has achieved increased compression rates and support string searching. In genomes, aligning sequencing reads is one of the key operations performed against a large collection of genome data. The aligning process required a significant amount of memory. In [33] Danek et al. proposed a Multiple Genome Index (MuGI) to find the occurrences in a large collection of genomes. The MuGI used small size customizable indexes and easily runs on commodity systems with 8 GB RAM.
The next section covers the multiple bloom filter based proposed approach for storing chromosome data and searching patterns.
3. Multiple Bloom Filters (MBFs) Based Proposed Approach
This section presents the proposed method of storing chromosome data and searching patterns in the data as shown in Figure 1. First, we describe the method of using MBFs or an array of BFs to store chromosome data in the compressed format. Subsequently, we describe the procedure to search patterns in the DNA sequences that are stored in MBFs without decompressing them.
3.1. Compression
This phase comprises of three important steps:(1)Formation of kmers of the given chromosome(2)Storing locations of each unique kmer in the KeyValue (KV) store on disk(3)Storing location data for each unique kmer of chromosome in a separate BF
The first step is critical to determine the length (k) of DNA words where a DNA word is simply a sequence of characters (A,C,G,T) of length k. If the length of DNA word “k” is chosen as 4, 5, or 6, the associated words of a given DNA sequence is referred to as 4mers, 5mers, and 6mers, respectively. If the length of the sequence (N) is segmented into words is already known, then the number of total words formed can easily be determined based on the selected kmer size (k), for example, given sequence: ACGTTCCACGTTCA, total characters: N=14, kmer size k=5, total no. of words= Nk+1 =145+1= 10.
Figure 2 shows the formation of words for the above sequence. As we increase kmer size, the number of unique kmers increases by the factor 4k. When k is increased, the number of unique kmers or words increases; on the other hand kmer occurrences in the sequence reduce. We carried an experiment for different chromosomes; Figure 3 shows this trend particularly for chromosome 1 (Chr 1). Instead of displaying each kmer with characters (e.g., AAAA, AAAC, and AAAG for k=4) on xaxis, we have just shown their numbers like kmer 1, kmer 2, and kmer 3. The yaxis shows the number of occurrences of each particular kmer in the sequence. It is an important observation that increasing kmer size increases number of unique kmers whereas the frequency of occurrence of the kmers gets reduced in the sequence as length of the DNA words is increased. Thus, kmer size is a critical parameter and has different impact on the size of KV store and MBF. This is further discussed in the experimental results section.
Next, a KeyValue (KV) storage is required to store all the locations of each unique kmer in the given chromosome. Here, Key is the kmer (or DNA word), e.g., ACGTT, and Value is the sequence of locations where that specific kmer is located in the chromosome. Table 2 shows how location data is saved in KV store corresponding to each unique kmer and it resides on the disk. The left side of the table maintains kmers and right side stores the indexes or locations of each corresponding kmer in the sequence that needs to be compressed. This step is performed only once and in offline mode before pattern matching process starts. It has special significance in this proposed solution as it is used to answer a single query per pattern. This is further discussed in the pattern matching section.

The next step is the construction of Multiple Bloom Filters (MBFs) for each single chromosome. The MBFs are an integral part of the proposed scheme. Bloom filter (BF), a probabilistic data structure, is used here to store all the location data of each unique kmer in a separate BF. As a result, there exists a BF for each distinctive kmer present in the chromosome. This consequently leads to the formation of MBFs as shown in Figure 4. A BF is just like a vector that stores Boolean values: 0 and 1. By default, all the values are set to false (0) and to store the locations of the specified kmer in BF, first hash function is applied to the location (or index value) of the kmer and the resultant value is used to find that index of the BF, which is then set to true . Thus, there exists a BF for each unique kmer and the BF has all of its indexes set to true that are found after applying a hash function to each location (or index) where that kmer is present in the chromosome.
The size of the BF plays a significant role in achieving compression. To determine the accurate size of the BF, following parameters should be known: n= number of items to be stored in the BF: p= probability of false positives (FPs); m= number of bits in the filter; k= number of hash functions (used for index finding in BF).Here, false positives refer to the situation where a kmer is not present at a particular location in the chromosome whereas the hash function results in indexes that are set to true in BF, which means kmer is present at that location. For instance, we need to store one of the locations (which is a 20th index of chromosome string) of kmer TTTT, a hash function is applied on 20, and the result is an index 10 of BF corresponding to TTTT. Index 10 of BF is then set to true. Further, a query needs to be answered if TTTT is present at index 30 of the chromosome string or not, a hash function is applied to it (i.e., 30), and the result comes out to be index 10, which is earlier set to true. Results show that TTTT is present at index 30 of the chromosome string whereas BF is set to true at index 10 due to 20. In this way, false positives can create a discrepancy in results. The false positive can be reduced if “p”, i.e., FP, is kept low and large number of hash functions (i.e., “k”) is used in (1) and (2), respectively. In the proposed solution, we kept FP probability neither too high nor too low, i.e., 0.01 but n varies as each kmer has a different number of occurrences in the chromosome. Consequently, m that is the size of the bloom filter in bits varies for each unique kmer of the specified chromosome.
3.2. Pattern Searching
A pattern is a sequence of DNA characters (A,C,G,T) that is to be searched in a DNA sequence or chromosome. Before the pattern matching process starts, a pattern is decomposed into kmers of size k. DNA word or kmer size is set to 4 when KV store is built as discussed in the previous section. The 4 is selected as the final value of k and a pattern is decomposed into (Nk+1) kmers where N is the length of the pattern sequence. Afterwards, one kmer that is part of the pattern is selected such that it has the least number of occurrences in the chromosome (target string to be searched) among all other kmers that are part of a pattern. For example, a KV store and MBFs are created for a DNA sequence. Now, a pattern ACGTTGCA is to be searched inside that sequence. As k is already set to 4 when KV store is built, therefore 84+1=5 words of size 4 each is formed for the pattern, which is as follows: ACGT CGTT GTTG TTGC TGCA
The next step is to select the kmer from the pattern with the least number of occurrences in the sequence. In case, there is a tie between two or more kmers then the one that comes first in the sorted list of kmers is selected. Suppose ACGT is one such kmer in the present example with only three occurrences in the sequence. Then, a query is sent to KV store to provide the location data of ACGT that comprises of all the index positions of DNA sequence where ACGT has occurred (indexes 10, 30, and 40). Afterwards, this data has to be reduced to make pattern matching process faster. For that quick check step is performed, that requires the bloom filter of the last or the first kmer of the pattern depending on the position of the kmer chosen in the previous step.
In the given example, BF of TGCA, which is the last kmer of pattern, is extracted. Since the distance between ACGT and TGCA is 4; therefore, for each location index of ACGT present in the location data extracted from KV store, 4 is added, e.g., 10+4, 30+4, and 40+4. The resultant value (i.e., 14 first) is hashed and the result of the hash function (index) is checked in the BF of TGCA if it is set to 1 or not. If it is set to 1 then there is a high possibility that the given pattern exists in the chromosome; otherwise pattern does not exist. For each location of ACGT for which the BF of TGCA gives back a positive response, the location is preserved and for the ones that give a negative response, those locations are discarded from the temporary array. In this way, this array often gets reduced in size as the pattern matching process proceeds as illustrated in Figure 4 as well. The usefulness of this step is highlighted when patterns are longer and it can also be observed in the experimental section that large length patterns give many accurate results compared to smaller patterns.
Now, the 10th index is the only index left in the temporary location array after quick check process. So, this index, i.e., 10, is incremented by 1 and the resultant index 11 is hashed. The result of the hash is used to find out if that place in the bloom filter of CGTT (which is the second kmer of the pattern) is set to 1 or 0. Since it is set to 1, this means ACGTT (ACGT and CGTT) is present in the sequence. Next, the same procedure is performed for GTTG and TTGC with location indexes 12 and 13, respectively. In the given example, BFs of both GTTG and TTGC give positive results. However, if any BF returns a negative response in the form of 0 then pattern matching process does not need to proceed further as the pattern does not exist. By the end of this pattern matching process, the temporary array of ACGT is left with only those indexes where the pattern exists.
4. Experimental Results
First, an experiment is performed to determine the impact of kmer size on KeyValue (KV) store. As mentioned earlier, KV store is an integral part of the proposed solution, which stores all the locations of each unique kmer present in the chromosome. We used LevelDB [34] for this purpose, which is a fast ondisk KV store that stores sorted data on the basis of the key in the compressed form (“LevelDB”, 2011). Table 3 shows the size of KV store for Chr 1, Chr 12, and Chr 21 for the different kmer sizes (4, 5, and 6). The chromosome datasets are obtained from NCBI and all the “N” characters are removed from the datasets before conducting the experiments on them [35].

The results show that, by incrementing the kmer size by one, the size of the KV store increases drastically. Therefore, we chose kmer of size 4 for the main experiments that are discussed in this section. The KV store can be created offline and stored on disk before the start of pattern matching process as a result construction time is not discussed here. Further, some experiments are performed to see the impact of kmer size on the size of Multiple Bloom Filters (MBFs) for each corresponding chromosome. The false positive probability of each of the bloom filter is set to 0.01. In Table 4, it is observed that incrementing the kmer size keeps the MBF size nearly constant for each corresponding chromosome. For example, MBF size of Chr 1 is 268 MB, 270 MB, and 271 MB for kmer sizes 4, 5, and 6, respectively. Moreover, construction of MBF for chromosomes is fast and is nearly the same for a certain chromosome irrespective of its kmer size. The false positive (FP) probability of a bloom filter plays a significant role in achieving compression; see Table 5. We determined the size of MBF for Chr 1, Chr 12, and Chr 21 on three different FP probabilities: 0.1, 0.01, and 0.001. It is observed that high FP probability achieves good compression as compared to low FP probability. However, choosing high FP probability can have a negative impact on pattern matching results whereas the increase in pattern size can reduce the effect on results as discussed later in this section.


Next, we evaluated the performance of the proposed solution by measuring the time it takes to find the pattern inside the chromosome. The results are listed in Table 6, where the first column represents the pattern used for evaluation, followed by the length of the pattern and afterward the time taken to identify the stated number of occurrences of that pattern inside the string of the specific chromosome. In all these experiments, kmer size is kept 4 and FP probability is 0.01. If we increase FP probability, the MBFs get highly compressed; however at the same time the number of false positives in the pattern matching process increases that leads to increase in pattern matching time. The number of false positives for Chr 1, Chr 12, and Chr 21 is reported in Table 7 for two cases. In the first case, FP probability is set to 0.1 and for the second case, it is 0.01. It shows that increasing FP probability drastically increases the number of false positives; on the other hand, as we increase the length of the pattern, the number of FPs gets highly reduced as shown for the pattern TTTATTGGAAATATGGGAT present in Chr 12. Thus, this problem can be resolved with large length of patterns (2025) used for identification. In addition to this, a restriction can be placed on the number of occurrences of the pattern to be identified. We compared the results with another pattern matching tool named Seeq [36] that is based on a Levenshtein distance metric. The results are shown in Table 8. It is observed that MBF based pattern matching is fast but its speed is not comparable with other pattern matching tools. Thus, some optimizations are required to make this pattern matching process faster.



5. Conclusion
Problems related to storage and processing are not unknown in the field of bioinformatics due to the large volume of DNA sequence data. In this paper, bloom filters are explored to provide a solution other than the simple classification of DNA sequences. The proposed approach using MBFs allows finding the exact location of the patterns and the number of times the pattern is repeated in the compressed sequence, without using original DNA sequence in its decompressed form. The experiments are carried out on human chromosomes to provide a proof of concept application of the proposed MBFs technique in real life data. Optimizations on various levels are still required and in future, we aim to work on all such areas that lack efficiency, e.g., KV store size and search time of patterns. We expect that this solution can bring highly promising results if optimizations are taken in the right direction. The ability to look for patterns inside compressed DNA sequences makes the proposed solution feasible for realworld applications and is anticipated to be highly sought after in the near future.
Appendix
The code and dataset used to benchmark the proposed technique is available at https://github.com/mrhua2019/PatternMatchingforDNASequencingDatausingMultipleBloomFilters.git.
Data Availability
No data is associated with this article to report.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
References
 V. Yanovsky, “ReCoil  an algorithm for compression of extremely large datasets of DNA data,” Algorithms for Molecular Biology, vol. 6, no. 1, p. 23, 2011. View at: Google Scholar
 S. Wandelt, M. Bux, and U. Leser, “Trends in genome compression,” Current Bioinformatics, vol. 9, no. 3, pp. 315–326, 2014. View at: Publisher Site  Google Scholar
 S. Deorowicz and S. Grabowski, “Data compression for sequencing data,” Algorithms for Molecular Biology, vol. 8, no. 1, article 25, 2013. View at: Google Scholar
 N. S. Bakr and A. A. Sharawi, “DNA lossless compression algorithms,” American Journal of Bioinformatics Research, vol. 3, no. 3, pp. 72–81, 2013. View at: Google Scholar
 D. Pratas and A. J. Pinho, “Compressing the human genome using exclusively markov models,” in 5th International Conference on Practical Applications of Computational Biology & Bioinformatics (PACBB 2011), vol. 93 of Advances in Intelligent and Soft Computing, pp. 213–220, Springer, Berlin, Germany, 2011. View at: Publisher Site  Google Scholar
 B. Chor, D. Horn, N. Goldman, Y. Levy, and T. Massingham, “Genomic DNA kmer spectra: models and modalities,” Genome Biology, vol. 10, no. 10, article R108, 2009. View at: Publisher Site  Google Scholar
 M. D. Cao, T. I. Dix, L. Allison, and C. Mears, “A simple statistical algorithm for biological sequence compression,” in Proceedings of the Data Compression Conference (DCC '07), pp. 43–52, IEEE, Snowbird, Utah, USA, March 2007. View at: Publisher Site  Google Scholar
 S. Wandelt and U. Leser, “String searching in referentially compressed genomes,” in Proceedings of the 4th International Conference on Knowledge Discovery and Information Retrieval, KDIR 2012, pp. 95–102, Spain, October 2012. View at: Google Scholar
 P. Pande and D. Matani, “Compressing the human genome against a reference,” Tech. Rep., Stony Brook University, 2011. View at: Google Scholar
 J. Sirén, “Compressed suffix arrays for massive data,” in String Processing and Information Retrieval, J. Karlgren, J. Tarhio, and H. Hyyrö, Eds., vol. 5721 of Lecture Notes in Computer Science, pp. 63–74, Springer, Berlin, Germany, 2009. View at: Publisher Site  Google Scholar
 F. Alves, V. Cogo, S. Wandelt, U. Leser, and A. Bessani, “Ondemand indexing for referential compression of DNA sequences,” PLoS ONE, vol. 10, no. 7, Article ID e0132460, 2015. View at: Google Scholar
 V. Mäkinen, G. Navarro, J. Sirffien, and N. Välimki, “Storage and retrieval of individual genomes,” in Research in Computational Molecular Biology: 13th Annual International Conference, RECOMB 2009, Tucson, Ariz, USA, May 18–21, 2009. Proceedings, S. Batzoglou, Ed., vol. 5541 of Lecture Notes in Computer Science, pp. 121–137, Springer, Berlin, Germany, 2009. View at: Publisher Site  Google Scholar
 V. Mäkinen and G. Navarro, “Succinct suffix arrays based on runlength encoding,” Nordic Journal of Computing, vol. 12, no. 1, pp. 40–66, 2005. View at: Google Scholar  MathSciNet
 A. Fariña, N. R. Brisaboa, G. Navarro, F. Claude, Á. S. Places, and E. Rodríguez, “Wordbased selfindexes for natural language text,” ACM Transactions on Information and System Security, vol. 30, no. 1, pp. 1–34, 2012. View at: Publisher Site  Google Scholar
 N. R. Brisaboa, A. Fariña, G. Navarro et al., “Selfindexing natural language,” in String processing and information retrieval, vol. 5280 of Lecture Notes in Comput. Sci., pp. 121–132, Springer, Berlin, 2008. View at: Publisher Site  Google Scholar  MathSciNet
 R. Grossi and J. S. Vitter, “Compressed suffix arrays and suffix trees with applications to text indexing and string matching,” SIAM Journal on Computing, vol. 35, no. 2, pp. 378–407, 2005. View at: Publisher Site  Google Scholar  MathSciNet
 W. K. Hon, T. W. Lam, W. K. Sung, W. L. Tse, C. K. Wong, and S. M. Yiu, “Practical aspects of compressed suffix arrays and FMindex in searching DNA sequences,” in ALENEX/ANALC, pp. 31–38, 2004. View at: Google Scholar
 P. Ferragina, G. Manzini, M. Veli, and G. Navarro, “An alphabetfriendly FMindex,” in String Processing and Information Retrieval, pp. 228–228, Springer, Berlin, Germany, 2004. View at: Google Scholar
 G. Navarro, “Indexing text using the ZivLempel trie,” Journal of Discrete Algorithms, vol. 2, no. 1, pp. 87–114, 2004. View at: Publisher Site  Google Scholar  MathSciNet
 M. Vyverman, B. De Baets, V. Fack, and P. Dawyndt, “Prospects and limitations of fulltext index structures in genome analysis,” Nucleic Acids Research, vol. 40, no. 15, pp. 6993–7015, 2012. View at: Publisher Site  Google Scholar
 R. Chikhi and G. Rizk, “Spaceefficient and exact de Bruijn graph representation based on a Bloom filter,” Algorithms for Molecular Biology, vol. 8, no. 1, p. 22, 2013. View at: Google Scholar
 R. Rozov, R. Shamir, and E. Halperin, “Fast lossless compression via cascading Bloom filters,” BMC Bioinformatics, vol. 15, no. 9, p. S7, 2014. View at: Google Scholar
 H. Stranneheim, M. Käller, T. Allander, B. Andersson, L. Arvestad, and J. Lundeberg, “Classification of DNA sequences using Bloom filters,” Bioinformatics, vol. 26, no. 13, pp. 1595–1600, 2010. View at: Publisher Site  Google Scholar
 B. Solomon and C. Kingsford, “Fast search of thousands of shortread sequencing experiments,” Nature Biotechnology, vol. 34, no. 3, pp. 300–302, 2016. View at: Publisher Site  Google Scholar
 A. Ramachandran, Y. Heo, W.M. Hwu, J. Ma, and D. Chen, “FPGA accelerated DNA error correction,” in Proceedings of the 2015 Design, Automation and Test in Europe Conference and Exhibition, DATE 2015, pp. 1371–1376, France, March 2015. View at: Google Scholar
 S. Deorowicz, FQSqueezer: kMerBased Compression of Sequencing Data, bioRxiv, 2019.
 S. Wandelt and U. Leser, “QGramProjector: QGram projection for indexing highlysimilar strings,” in Advances in Databases and Information Systems, vol. 8133 of Lecture Notes in Computer Science, pp. 260–273, Springer Berlin Heidelberg, Berlin, Germany, 2013. View at: Publisher Site  Google Scholar
 T. Kowalski, S. Grabowski, and S. Deorowicz, “Indexing arbitrarylength kmers in sequencing reads,” PLoS ONE, vol. 10, no. 7, 2015. View at: Google Scholar
 Y. Heo, X.L. Wu, D. Chen, J. Ma, and W.M. Hwu, “BLESS: bloom filterbased error correction solution for highthroughput sequencing reads,” Bioinformatics, vol. 30, no. 10, pp. 1354–1362, 2014. View at: Publisher Site  Google Scholar
 J. Qian, Q. Zhu, and Y. Wang, “Bloom filter based associative deletion,” IEEE Transactions on Parallel and Distributed Systems, vol. 25, no. 8, pp. 1986–1998, 2014. View at: Publisher Site  Google Scholar
 R. Khairy, M. Safar, and M. W. ElKharashi, “Bloom filter acceleration: A high level synthesis approach,” in Proceedings of the 2017 IEEE 30th Canadian Conference on Electrical and Computer Engineering (CCECE), pp. 1–6, Windsor, Canada, April 2017. View at: Publisher Site  Google Scholar
 S. Wandelt and U. Leser, “MRCSI: compressing and searching string collections with multiple references,” in Proceedings of the 3rd Workshop on SpatioTemporal Database Management, STDBM 2006, Colocated with the 32nd International Conference on Very Large Data Bases, VLDB 2006, pp. 461–472, Republic of Korea, September 2006. View at: Google Scholar
 A. Danek, S. Deorowicz, and S. Grabowski, “Indexes of large genome collections on a PC,” PLoS ONE, vol. 9, no. 10, 2014. View at: Google Scholar
 LevelDB, http://leveldb.org/, 2018.
 NCBI, https://ftp.ncbi.nlm.nih.gov/, 2018.
 Seeq, https://github.com/ezorita/seeqfrom.
 S. Ladra, Algorithms and Compressed Data Structures for Information Retrieval, 2011.
Copyright
Copyright © 2019 Maleeha Najam et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.