Advances in Bioinformatics

Volume 2016, Article ID 3528406, 7 pages

http://dx.doi.org/10.1155/2016/3528406

## An Optimal Seed Based Compression Algorithm for DNA Sequences

^{1}Department of Information Science and Engineering, Rajiv Gandhi Institute of Technology, Bangalore 560032, India^{2}Department of Computer Science and Engineering, National Institute of Technology Calicut, Kerala 673601, India

Received 28 November 2015; Revised 9 May 2016; Accepted 19 June 2016

Academic Editor: Frank M. You

Copyright © 2016 Pamela Vinitha Eric et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

This paper proposes a seed based lossless compression algorithm to compress a DNA sequence which uses a substitution method that is similar to the LempelZiv compression scheme. The proposed method exploits the repetition structures that are inherent in DNA sequences by creating an offline dictionary which contains all such repeats along with the details of mismatches. By ensuring that only promising mismatches are allowed, the method achieves a compression ratio that is at par or better than the existing lossless DNA sequence compression algorithms.

#### 1. Introduction

There is an exponential increase in the amount of DNA being sequenced, thus leading to problems in storage, comprehension, and transmission. The cost of storage has been reducing dramatically in the past few years, but the exponential growth in the amount of DNA being sequenced leads to tremendous increase in the amount of data that needs to be stored online thereby making storage one of the biggest cost elements. Another challenge faced is how to make sense out of this huge mass of data. With whole genomes, we now have to deal with millions or billions of base pairs. When we have a database of such genomes, as is typically the case, the problem becomes even more compounding. Thus, new and more effective techniques are needed for the compression of biological sequence data, particularly DNA sequences.

DNA sequences are expected to be nonrandom and hence it is possible to remove redundancy, resulting in compression. It is estimated that more than 50% of the human genome is repeat DNA [1]. Compression will solve the issues related to storage and also improve the understanding of these sequences. Chen et al. [2] showed that compressibility is a good measurement of relatedness between sequences and can be effectively used in sequence alignment and evolutionary tree construction. According to Allison et al. [3] compression of DNA sequences also results in the intelligent analysis of these sequences. Compression also plays an important role in efficient sequence classification [4].

DNA sequences consist of four nucleotide bases, A (adenine), C (cytosine), G (guanine), and T (thymine), and two bits are sufficient to represent each of these nucleotide bases. Moreover the repeats found in DNA sequences are not always exact; they can be of different types like approximate, reverse, complementary, reverse complementary, and tandem. Also these repeats are long and less frequent. Traditional text compression algorithms are only effective in capturing short and frequent repeats; hence using them to compress DNA sequences often results in expansion of the same. Therefore finding all the different types of repeats in a DNA sequence and encoding them in order to achieve a good compression ratio is a challenging task.

This paper proposes a seed based algorithm which uses a substitution method that is in line with the LempelZiv [5, 6] compression scheme to compress DNA sequences. The proposed algorithm captures all the various types of repeats like exact, tandem, approximate, reverse, complemented, and reverse complemented and stores them onto an offline dictionary. These repeats are then removed from the original sequence to form the final parsed sequence. The offline dictionary along with the final parsed sequence forms the compressed sequence. Mismatches that give good compression gain are tolerated and recorded along with the repeat substrings in the offline dictionary.

This paper is organized as follows. Section 2 reviews the various DNA compression algorithms. Section 3 describes the proposed method and Section 4 analyzes the results obtained. This is followed by conclusion in Section 5.

#### 2. Related Work

Compression of biological sequences can be either horizontal or vertical as proposed by Grumbach and Tahi [7, 8]. Horizontal mode compresses a biological sequence by making use of information contained within it, like references to the substrings, whereas vertical mode takes a set of biological sequences and compresses each sequence based on the information derived from this set. Horizontal mode is of interest for the reduction of storage and transmission costs [9] and uses compression techniques like substitution, statistical, or a combination of these two [10]. Statistical compression uses a statistical model of the data, comprised of variable sized codes, and the quality of compression obtained depends on the data model [11]. Substitution or dictionary based method selects several strings of symbols that occur frequently and encodes each string as a token which is a pointer to the string in a dictionary. The dictionary itself can be static or dynamic. Compression algorithms based on LZ method use online dictionary whereas in case of methods using offline dictionary compression occurs in two passes: the first pass identifies all repeats and stores them in a dictionary and the second pass encodes these repeats as pointers to the dictionary [5, 6]. A third category of compression is the hybrid technique which makes use of a combination of substitution and statistical techniques to compress data.

Most of the compression methods available for compressing biological sequences like [2, 7, 8, 12] use substitution methods. The earliest special purpose DNA compression algorithm found in the literature is Biocompress developed by Grumbach and Tahi [7, 8]. They proposed Biocompress and Biocompress 2 which detects repeats of substrings that occurred earlier in the sequence and encodes them as length of repeat and position of previous occurrence. They also employ order 2 arithmetic coding to encode nonrepeat regions. Chen et al. [12] developed DNACompress that uses the software utility Pattern Hunter [13] to identify significant approximate repeat regions and then encodes these repeat regions by a pointer to their earlier occurrence. The nonrepeat regions are also encoded using arithmetic coding. The offline approach by Apostolico and Lonardi [14] iteratively selects repeated substrings for which encoding would gain maximum compression. A similar substitution approach is used in GenCompress by Chen et al. [2] where they concentrate on finding an optimal prefix that can be encoded economically. Here approximate repeats are exploited. Adjeroh et al. [15, 16] create an offline dictionary of short repeats and code all occurrences of a given repeat with reference to the position of that repeat in the dictionary. Cfact developed by Rivals et al. [17] constructs a suffix tree in the first pass and uses this data structure to search for the longest exact matching repeat in the second pass.

A few methods like XM, CDNA, and ARM employ statistical techniques. Expert model (XM) proposed by Cao et al. [18] uses an order 2 Markov expert and a copy expert to predict the probability of occurrence of a symbol. It also employs adaptive coding for correct or incorrect predictions. The CDNA algorithm by Loewenstern and Yianilos [19] is a pure statistical algorithm, where the probability distribution of each symbol is obtained by approximate partial matches from history. Each approximate match is with a previous subsequence having a small Hamming distance to the context preceding the symbol to be encoded. The latter ARM algorithm by Allison et al. [3] is also a pure statistical algorithm that forms the probability of a subsequence by summing the probabilities over all explanations as to how the subsequence is generated.

A method that employs hybrid technique was introduced by Korodi and Tabus [20, 21] where encoding is done by using a simple normalized maximum likelihood model for discrete regression, through reference to preceding approximate matching blocks and encoding them by a first-order context coding. In its improvement, GeNML by Korodi and Tabus [20, 21], the DNA sequence is split into fixed size blocks. The bit mask is encoded using a probability distribution estimated by the normalized maximum likelihood of similarity between the regressor and the block. Matsumoto et al. [22] use a combination of LZ [5, 6] and CTW [23]. They first identify approximate repeat regions using hash and dynamic programming and then replace these repeat regions with an offset and length. Edit operations are encoded using arithmetic coding and nonrepeat areas by an order 32 context tree weighting.

#### 3. Optimal Seed Based Compression Algorithm for DNA Sequences

The proposed method consists of a seed based algorithm that identifies potentially good matches. The matching substrings so identified are later extended in both the directions, that is, to the left and right.

Let be the DNA sequence to be compressed and the length of the DNA sequence. represents the th symbol of the given DNA sequence, where and is a substring of of length where . The seed is also a substring of of length . The initial seed is and the first substring to be matched is . The values of are incremented until a repeat substring is identified such that . If no such matching substring is encountered, are incremented and the search is continued until for some , where ranges from to and from to . Now the length of the match is initialized to and , the number of mismatches, is initialized to 0.

The repeat substring and the seed are extended and compared. The extension is done first to the left and then to the right. The length of the match is incremented for each symbol matched. If a mismatch occurs while extending the repeat substring and the seed, decision regarding permitting this mismatch is made, based on the total number of mismatches until then and whether permitting this mismatch would result in a compression gain. If is greater than the threshold, repeat extension in the direction in which the mismatch occurred is temporarily terminated. Extension to the left is also stopped whenever there is an overlap between the extended seed and the extended repeat substring.

Assume that the substrings and are the extended repeat and the extended seed so obtained. An offline dictionary, as shown in Table 1, stores the extended seed , position of occurrence of repeat , length of the repeat , type of repeat, and the details of mismatches that have occurred if any. is then removed from and the remaining symbols of are concatenated to form the next sequence . The process is repeated on sequence until all approximate repeats of are identified and stored in the offline dictionary. This offline dictionary is similar to the one created by Adjeroh et al. [15, 16, 25].