BioMed Research International

Volume 2015, Article ID 218068, 10 pages

http://dx.doi.org/10.1155/2015/218068

## A Fast Cluster Motif Finding Algorithm for ChIP-Seq Data Sets

Department of Automation, School of Electronics and Control Engineering, Chang’An University, Xi’an 710064, China

Received 8 April 2015; Accepted 4 June 2015

Academic Editor: Andre Van Wijnen

Copyright © 2015 Yipu Zhang and Ping Wang. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

New high-throughput technique ChIP-seq, coupling chromatin immunoprecipitation experiment with high-throughput sequencing technologies, has extended the identification of binding locations of a transcription factor to the genome-wide regions. However, the most existing motif discovery algorithms are time-consuming and limited to identify binding motifs in ChIP-seq data which normally has the significant characteristics of large scale data. In order to improve the efficiency, we propose a fast cluster motif finding algorithm, named as FCmotif, to identify the motifs in large scale ChIP-seq data set. It is inspired by the emerging substrings mining strategy to find the enriched substrings and then searching the neighborhood instances to construct PWM and cluster motifs in different length. FCmotif is not following the OOPS model constraint and can find long motifs. The effectiveness of proposed algorithm has been proved by experiments on the ChIP-seq data sets from mouse ES cells. The whole detection of the real binding motifs and processing of the full size data of several megabytes finished in a few minutes. The experimental results show that FCmotif has advantageous to deal with the motif finding in the ChIP-seq data; meanwhile it also demonstrates better performance than other current widely-used algorithms such as MEME, Weeder, ChIPMunk, and DREME.

#### 1. Introduction

A Transcription Factor (TF) binds to the specific DNA sequences, which carries the information of the transcription and gene expressions. Hence locating the Transcription Factor Binding Sites (TFBSs) is crucial for uncovering the underlying relationship of regulating transcription and comprehending evolutionary mechanism of living organisms. The identification of TFBSs, or socalled motif discovery, is an issue of discovering short similar nucleotide segments with a common biological function. The planted motif discovery is the famous version for motif discovery [1], which can be formulated as follows: given a set of -length DNA sequences over the alphabet , two nonnegative integers and , where is the length of a motif and is the maximum number of mutations between the motif and a predicted binding site. The task is to find a -length motif occurring in most of the sequences including up to mutations. is called an motif and each occurrence of is called a motif instance. Various motif discovery algorithms have been developed to locate motifs in promoter sequences from coregulated or homologous genes based on either Consensus or Position Weight Matrix (PWM) [2].

In recent years, high-throughput technique ChIP-seq [3, 4], which couples chromatin immunoprecipitation experiment [5] with high-throughput sequencing technologies, has extended the identification of binding locations of a given TF to that of the genome-wide regions. The genome-wide ChIP experiment generally produces thousands of sequences of a few hundred bps (ChIP-seq peaks), which provides data set of one or two magnitudes larger than a typical motif discovery data set and sequences with a high resolution. The novel ChIP technique ChIP-exo can locate binding sites at a higher resolution, but its binding regions identified by ChIP-seq or ChIP-exo experiments may be dozens of bps away from the true binding sites [6]. Computational motif discovery methods are still needed to identify the binding locations of a TF in ChIP-seq or ChIP-exo data sets [7] in the high accuracy.

In order to detect motifs in large-scale ChIP-seq data, some traditional motifs discovery algorithms have been proposed in their ChIP-tailored versions, such as MDscan [8] and MEME-ChIP [9]. These algorithms normally find motifs by using a limited part of the sequences, while ignoring the remaining unselected sequences. That decreases the chance of discovering motifs related to infrequent cofactors. Meanwhile, PWM-based methods also have been developed. For instance, STEME [10] applies suffix trees to accelerate EM steps. This strategy acts well in case of finding short motifs. However, it executes much slower when the width of motif increases in the large data set. HMS [11] is an improved version of Gibbs that combines sampling algorithms with greedy search steps. ChIPMunk [12] introduces EM algorithms with a greedy approach and applies a more complex statistic model. These algorithms aim to optimize a PWM of ChIP-enriched region. They still have an unsolved problems of local optimum and the iteratively training also costs too much. Additionally, consensus-based algorithms are designed based on word-enumeration methods, such as RAST [13] and CisFinder [14], which can process whole ChIP-seq data set by two contrastive data sets. Both RAST and CisFinder are limited to find short motifs and may miss the useful information contained in the sequences.

To overcome these shortcomings, in this paper, we propose a fast cluster motif finding algorithm, named FCmotif, to solve the motif identification problem in large scale ChIP data set. FCmotif utilizes the emerging substrings mining strategy to find the enriched substrings at first and makes each emerging substring as a reference core to construct PWM. Then our algorithm uses the constructed PWMs to cluster the motifs in different length, and we consider intramotif dependency in statistics model to calculate information content (IC) and false discovery rate (FDR) to optimize the outputs. FCmotif achieves to deal with the whole data set that does not limit to the OOPS (one occurrence of the motif instance per sequence) constraint. The experimental results show that FCmotif is advantageous to deal with the motif finding in the ChIP-seq data, and it also demonstrates better performance than other current widely-used algorithms such as MEME, Weeder, ChIPMunk, and DREME.

#### 2. Materials and Methods

We know that the characteristic of a ChIP-seq data set is a large scale set of relative shorter sequences. That is, the amount and quality of ChIP-seq data have been dramatically increased. Each sequence of ChIP-seq data set contains less “the background information,” and several instances of the motifs could be expected to exist in thousands of sequences. From this point of view, our main objective is to handle the whole data set and distinguish the motif instance from the relative “cleaner” background sequence.

##### 2.1. Motif Representation

Generally, a motif can be represented by a PWM , of which each column stores the occurring frequency of the four types of nucleotides (). Let , where represents the probability of nucleotide preference at the th position of the motif, and let be the probability of nucleotide observing at the nonmotif positions in the sequences. For each substring of length (we also call -mer) , the log-likelihood of letter at position is given bywhere is the probability of observing letter at position and is the background probability of letter . This classical product-multinomial model proposed by Liu et al. [15] has been widely used in* de novo* statistic algorithms such as EM and Gibbs algorithm. It assumed that the positions within the motif are independent of each other [16]. However, recent researches imply that the commonly used product-multinomial model may be too simplistic in identifying the binding motifs, while some positions of TF binding motif exert an interdependent effect on binding affinities of TFs [17–19].

To provide a better fit model to increase the quality of motifs identified by ChIP-Seq, a more sophisticated model that involves the intramotif dependency should be considered. Here, “intramotif dependency” means that the frequency of nucleotide combinations spanning several positions deviates from the expected frequency under the independent motif distribution [11]. For instance, if the frequency of two nucleotides, “GT,” in a pair of positions is much higher or lower than the product of frequency of “G” in the first position and frequency of “T” in the second position, we infer that these two positions are dependent. Here, we implement a 16-component dependent multinomial model to scan each pair of positions within the motif to determine the intramotif dependency. Let represent the probability of observing nucleotide pair at th and th position of the motif. For each pair of positions, there are -1 dependent multinomial distributions to be estimated. The log-likelihood of letters , at position and iswhere represents the background probability of the nucleotide pair. The Log-Likelihood Ratio (LLR) of -mer is thenHere, formula (4) represents the joint probability of the independent nucleotides in motif, and formula (5) represents the joint probability of the nucleotide pair in motif. Formula (3) is the LLR of under the corresponding background distribution . For the background (nonmotif) regions, we employ a high-order Markov model to obtain the weak dependency in background DNA sequences. Compared with the uniform distribution or random distribution background, the high-order Markov model can improve the sensitivity and specificity of identifying motifs. In this study, we use a third-order Markov model to characterize the background sequence. As an example, the probability of an -mer () in the background under a third-order Markov model can be represented byThereby, the Information Content of motif can be represented as

##### 2.2. Emerging Substrings Mining

For the large-scale data set, calculating the likelihood score of each substring costs too much, which makes probabilistic training methods unpractical. Pattern-driven strategy can use shorter time to count the substrings that have higher occurrence frequencies. Since each instance differs from motif at most positions, we expect to find some instances occurring multiple times in thousands of sequences and reduce the disturbance of random overrepresented substrings. With the above considerations, we utilize both a test set and a control set of DNA sequences to search the possible motif instances. Generally, the test set consists of the sequences with motifs, while the control set contains the background sequences. The interested substrings are the ones that present in the test set and absent in the control set, and we call such substrings emerging substrings. The task converts to solve emerging substrings mining problem [20] and then identifies motif instances from the emerging substrings. The emerging substrings mining problem is defined as follows.

Given a test set and a control set of sequences over the alphabet , frequency threshold , and growth rate threshold , the task is to find all substrings satisfying the conditions and at the meantime. Such substrings are called emerging substrings. Here, represents the frequency of substring occurring in set , and , that is, the growth rate of substring from set to set . Large value means that substring is highly discriminative for two input data sets.

With the above material, our algorithm can be summarized as the following main procedures. First, we compare the substrings in both test set and control to obtain the emerging substrings. Second, calculate measure score of the emerging substrings to find the true motif instances. Nevertheless, there are still some key problems needed to be solved: (i) As the exact motif length is unknown, we need to select a range of emerging substring length to find motif. (ii) The interested emerging substrings contain true motifs, the instances of both mutation and random disturbance, how to reduce the influence of the unreal instances. (iii) We need to choose one model from OOPS, ZOOPS (zero- or one-motif occurrences per sequence), and TCM (two-component mixture) to find motif instances in each sequences. Therefore, our algorithm is designed in detail to further process the emerging substrings and handle these problems.

##### 2.3. FCmotif Algorithm

*Step 1 (searching emerging substrings). *An essential assumption is that the evidence for binding motif is large in test set and small in the control set. To streamline the predicting sites algorithm and handle the ChIP-Seq data, our algorithm utilizes pattern-driven word enumeration strategy to search the emerging substrings. Assume motif length is ; we first count the amount of all possible -mers in both test set and control set; then we select the rich ones. The threshold frequency and growth rate are two important parameters employed in this step.

As previous studies [21, 22], we knew the probability of the occurrence of a random mutated instance of a reference motif at random positions iswhere is the mutating probability, and it can also represent the conservation of motif. We set as 0.2, 0.5, and 0.8 to represent high conservation, intermediate conservation, and low conservation, respectively.

Then, according to the definition of motif, the probability of a random instance of motif occurring in a sequence can be calculated by

Moreover, for the different models, each sequence contains different amount of motif instances, so the value of can be set by different models and . We set when model is OOPS, when model is ZOOPS, and for TCM. Meanwhile, the default value of that we set is 2. Table 1 shows an example of searching the emerging substrings of length 6 in 600 sequences for ZOOPS model; and . From the example, we can find that the emerging substring “CAGCGA” satisfies both and . However, only the emerging substring cannot indicate motif; it may miss the mutated instances especially for larger value of and .