BioMed Research International

Volume 2016, Article ID 4986707, 10 pages

http://dx.doi.org/10.1155/2016/4986707

## PairMotifChIP: A Fast Algorithm for Discovery of Patterns Conserved in Large ChIP-seq Data Sets

^{1}School of Computer Science and Technology, Xidian University, Xi’an 710071, China^{2}School of Electronic Engineering, Xidian University, Xi’an 710071, China

Received 22 June 2016; Revised 4 September 2016; Accepted 27 September 2016

Academic Editor: Yudong Cai

Copyright © 2016 Qiang Yu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Identifying conserved patterns in DNA sequences, namely, motif discovery, is an important and challenging computational task. With hundreds or more sequences contained, the high-throughput sequencing data set is helpful to improve the identification accuracy of motif discovery but requires an even higher computing performance. To efficiently identify motifs in large DNA data sets, a new algorithm called PairMotifChIP is proposed by extracting and combining pairs of -mers in the input with relatively small Hamming distance. In particular, a method for rapidly extracting pairs of -mers is designed, which can be used not only for PairMotifChIP, but also for other DNA data mining tasks with the same demand. Experimental results on the simulated data show that the proposed algorithm can find motifs successfully and runs faster than the state-of-the-art motif discovery algorithms. Furthermore, the validity of the proposed algorithm has been verified on real data.

#### 1. Introduction

A DNA motif is a conserved pattern occurring in the regulatory region of DNA sequences with small mutations [1]. All occurrences of the motif in the sequences are called motif instances or motif sites, which are usually the sequence fragments with specific biological functions such as transcription factor binding sites (TFBSs) [2]. TFBSs are important regulatory elements that control transcription initiation and transcription efficiency of the associated genes. Identifying motifs in a given set of DNA sequences is the basis for analysis of gene expression regulation [3] and the precursor to identifying disease-associated regulatory variations [4].

Though very important, motif discovery is a challenging computational task. Given a set of DNA sequences, (i) the motif and its instances are unknown; (ii) each of the input DNA sequences is long with hundreds of bases, while the motif is short, generally 5 to 25 bases [5]; (iii) a portion of the input sequences may not contain motif instances; (iv) the input sequences typically contain the disturbance of random overrepresented substrings. In 2003, Evans et al. proved that motif discovery is NP-complete [6]. In addition, with the development of biological experimental techniques, the data used for motif discovery have been changed from traditional promoter sequence data sets to high-throughput sequencing data sets [7]. A traditional data set typically contains only a few to dozens of sequences. A high-throughput sequencing data set is a set of peak regions containing TFBSs obtained through ChIP-seq experiments [8], read mapping [9], and peak calling [10]. It contains hundreds or more sequences, thus forming a large DNA sequence data set, and further increases the difficulty of rapid and accurate identification of motifs.

Currently, there are a lot of motif discovery algorithms to deal with small-scale data sets, such as Weeder [11], PairMotif [12], PairMotif+ [13], MEME [14], PMS8 [15], and qPMS9 [16]; for more algorithms, refer to [7, 17]. Because of high time or space complexity, these algorithms cannot be used for motif discovery in high-throughput sequencing data sets directly.

This paper mainly focuses on motif discovery algorithms for high-throughput sequencing data sets. According to motif representation, the algorithms can be divided into two categories. The algorithms in the first category represent motifs as words. Some of these algorithms, such as F-motif [18] and weeder2 [19], use pattern-driven ideas. They exhaustively verify all possible strings of the motif length over the DNA alphabet and then output the strings that satisfy specified motif property. When verifying motifs, F-motif and weeder2 use the suffix tree and De Bruijn graph techniques, respectively. Some other algorithms, such as RSAT [20], CisFinder [21], and MCES [22], adopt word counting ideas; namely, they mine the substrings in input sequences with high occurrence frequency and then combine them into motifs. Besides a test set, these algorithms often require a control set to eliminate the disturbance of random overrepresented substrings.

The second category covers the discovery algorithms representing motifs as position weight matrixes (PWMs). A set of aligned sites of the same length in the input sequences can form a PWM. These algorithms often select some initial PWMs with certain means and then update each PWM iteratively until it reaches the maximum score. MEME-ChIP [23] is a well-known motif discovery algorithm for ChIP-seq data sets, which updates initial PWMs using the expectation maximization method. STEME [24], another discovery algorithm based on expectation maximization, uses suffix trees to improve the time performance of motif discovery when implementing expectation maximization. Currently, there is no discovery algorithm completely superior to others, and thus, in order to tackle false positives produced by individual discovery algorithms, ensemble algorithms [25] integrate multiple existing discovery algorithms to improve the quality of identified motifs.

In order to efficiently identify motifs in large DNA data sets, we propose a new algorithm, which identifies motifs by extracting and combining pairs of -mers in the input with relatively small Hamming distance. Comparisons with the state-of-the-art motif discovery algorithms show that the proposed algorithm can find motifs successfully with the shortest running time. Also, the validity of the proposed algorithm has been verified on real data.

#### 2. Materials and Methods

##### 2.1. Algorithm Overview

The notations frequently used in this paper are summarized in the Notations. When we say a pair of -mers, we are referring to two -mers that come from two distinct sequences.

Almost all de novo motif discovery algorithms make identification based on the fact that the motif instances of the same motif are similar to each other. In other words, the motif information contained in the input sequences is presented by the similarity among motif instances. In addition to the degree of similarity among motif instances, the motif information also depends on the number of pairs of motif instances contained in the input sequences, denoted by . It is calculated by

In our previous work, PairMotif [12] and PairMotif+ [13], we mainly process promoter sequences, which correspond to a small . The basic idea is to extract some pairs of -mers in the input, making them contain at least one pair of motif instances, and then refine each pair of -mers to get motifs. Because of the small value of , limited motif information can be obtained while retaining a large amount of disturbance information. Thus, in order to ensure good identification accuracy, exhaustive methods based on pattern-driven ideas are used for refinement, which has a poor time performance.

In the current work, we propose a new algorithm called PairMotifChIP, which is used for processing large DNA data sets. Our basic idea is still to extract pairs of -mers in the input. Since the value of under large data sets is significantly greater than that under traditional promoter data sets, the advantages are as follows: (i) the extracted pairs of -mers contain sufficient pairs of motif instances and (ii) it can be easier to filter out most of the random overrepresented pairs of -mers; namely, we can distinguish most of the pairs of motif instances and random overrepresented pairs of -mers by probabilistic analysis (see Section 3.1). Therefore, after extracting pairs of -mers, we perform filtration to filter out most of the random overrepresented pairs of -mers and then combine the remaining -mers using clustering methods to obtain motifs while eliminating other random overrepresented -mers.

The overall algorithm of PairMotifChIP is shown in Algorithm 1, containing three steps: extracting pairs of -mers (lines (2)–(4)), filtering pairs of -mers (lines (5)–(9)), and combining -mers (lines (10)–(13)). Next, the technical details of the three steps are described in detail.