Abstract

New high-throughput technique ChIP-seq, coupling chromatin immunoprecipitation experiment with high-throughput sequencing technologies, has extended the identification of binding locations of a transcription factor to the genome-wide regions. However, the most existing motif discovery algorithms are time-consuming and limited to identify binding motifs in ChIP-seq data which normally has the significant characteristics of large scale data. In order to improve the efficiency, we propose a fast cluster motif finding algorithm, named as FCmotif, to identify the motifs in large scale ChIP-seq data set. It is inspired by the emerging substrings mining strategy to find the enriched substrings and then searching the neighborhood instances to construct PWM and cluster motifs in different length. FCmotif is not following the OOPS model constraint and can find long motifs. The effectiveness of proposed algorithm has been proved by experiments on the ChIP-seq data sets from mouse ES cells. The whole detection of the real binding motifs and processing of the full size data of several megabytes finished in a few minutes. The experimental results show that FCmotif has advantageous to deal with the motif finding in the ChIP-seq data; meanwhile it also demonstrates better performance than other current widely-used algorithms such as MEME, Weeder, ChIPMunk, and DREME.

1. Introduction

A Transcription Factor (TF) binds to the specific DNA sequences, which carries the information of the transcription and gene expressions. Hence locating the Transcription Factor Binding Sites (TFBSs) is crucial for uncovering the underlying relationship of regulating transcription and comprehending evolutionary mechanism of living organisms. The identification of TFBSs, or socalled motif discovery, is an issue of discovering short similar nucleotide segments with a common biological function. The planted motif discovery is the famous version for motif discovery [1], which can be formulated as follows: given a set of -length DNA sequences over the alphabet , two nonnegative integers and   , where is the length of a motif and is the maximum number of mutations between the motif and a predicted binding site. The task is to find a -length motif occurring in most of the sequences including up to mutations. is called an motif and each occurrence of is called a motif instance. Various motif discovery algorithms have been developed to locate motifs in promoter sequences from coregulated or homologous genes based on either Consensus or Position Weight Matrix (PWM) [2].

In recent years, high-throughput technique ChIP-seq [3, 4], which couples chromatin immunoprecipitation experiment [5] with high-throughput sequencing technologies, has extended the identification of binding locations of a given TF to that of the genome-wide regions. The genome-wide ChIP experiment generally produces thousands of sequences of a few hundred bps (ChIP-seq peaks), which provides data set of one or two magnitudes larger than a typical motif discovery data set and sequences with a high resolution. The novel ChIP technique ChIP-exo can locate binding sites at a higher resolution, but its binding regions identified by ChIP-seq or ChIP-exo experiments may be dozens of bps away from the true binding sites [6]. Computational motif discovery methods are still needed to identify the binding locations of a TF in ChIP-seq or ChIP-exo data sets [7] in the high accuracy.

In order to detect motifs in large-scale ChIP-seq data, some traditional motifs discovery algorithms have been proposed in their ChIP-tailored versions, such as MDscan [8] and MEME-ChIP [9]. These algorithms normally find motifs by using a limited part of the sequences, while ignoring the remaining unselected sequences. That decreases the chance of discovering motifs related to infrequent cofactors. Meanwhile, PWM-based methods also have been developed. For instance, STEME [10] applies suffix trees to accelerate EM steps. This strategy acts well in case of finding short motifs. However, it executes much slower when the width of motif increases in the large data set. HMS [11] is an improved version of Gibbs that combines sampling algorithms with greedy search steps. ChIPMunk [12] introduces EM algorithms with a greedy approach and applies a more complex statistic model. These algorithms aim to optimize a PWM of ChIP-enriched region. They still have an unsolved problems of local optimum and the iteratively training also costs too much. Additionally, consensus-based algorithms are designed based on word-enumeration methods, such as RAST [13] and CisFinder [14], which can process whole ChIP-seq data set by two contrastive data sets. Both RAST and CisFinder are limited to find short motifs and may miss the useful information contained in the sequences.

To overcome these shortcomings, in this paper, we propose a fast cluster motif finding algorithm, named FCmotif, to solve the motif identification problem in large scale ChIP data set. FCmotif utilizes the emerging substrings mining strategy to find the enriched substrings at first and makes each emerging substring as a reference core to construct PWM. Then our algorithm uses the constructed PWMs to cluster the motifs in different length, and we consider intramotif dependency in statistics model to calculate information content (IC) and false discovery rate (FDR) to optimize the outputs. FCmotif achieves to deal with the whole data set that does not limit to the OOPS (one occurrence of the motif instance per sequence) constraint. The experimental results show that FCmotif is advantageous to deal with the motif finding in the ChIP-seq data, and it also demonstrates better performance than other current widely-used algorithms such as MEME, Weeder, ChIPMunk, and DREME.

2. Materials and Methods

We know that the characteristic of a ChIP-seq data set is a large scale set of relative shorter sequences. That is, the amount and quality of ChIP-seq data have been dramatically increased. Each sequence of ChIP-seq data set contains less “the background information,” and several instances of the motifs could be expected to exist in thousands of sequences. From this point of view, our main objective is to handle the whole data set and distinguish the motif instance from the relative “cleaner” background sequence.

2.1. Motif Representation

Generally, a motif can be represented by a PWM , of which each column stores the occurring frequency of the four types of nucleotides (). Let , where represents the probability of nucleotide preference at the th position of the motif, and let be the probability of nucleotide observing at the nonmotif positions in the sequences. For each substring of length (we also call -mer) , the log-likelihood of letter at position is given bywhere is the probability of observing letter at position and is the background probability of letter . This classical product-multinomial model proposed by Liu et al. [15] has been widely used in de novo statistic algorithms such as EM and Gibbs algorithm. It assumed that the positions within the motif are independent of each other [16]. However, recent researches imply that the commonly used product-multinomial model may be too simplistic in identifying the binding motifs, while some positions of TF binding motif exert an interdependent effect on binding affinities of TFs [1719].

To provide a better fit model to increase the quality of motifs identified by ChIP-Seq, a more sophisticated model that involves the intramotif dependency should be considered. Here, “intramotif dependency” means that the frequency of nucleotide combinations spanning several positions deviates from the expected frequency under the independent motif distribution [11]. For instance, if the frequency of two nucleotides, “GT,” in a pair of positions is much higher or lower than the product of frequency of “G” in the first position and frequency of “T” in the second position, we infer that these two positions are dependent. Here, we implement a 16-component dependent multinomial model to scan each pair of positions within the motif to determine the intramotif dependency. Let represent the probability of observing nucleotide pair at th and th position of the motif. For each pair of positions, there are -1 dependent multinomial distributions to be estimated. The log-likelihood of letters , at position and iswhere represents the background probability of the nucleotide pair. The Log-Likelihood Ratio (LLR) of -mer is thenHere, formula (4) represents the joint probability of the independent nucleotides in motif, and formula (5) represents the joint probability of the nucleotide pair in motif. Formula (3) is the LLR of under the corresponding background distribution . For the background (nonmotif) regions, we employ a high-order Markov model to obtain the weak dependency in background DNA sequences. Compared with the uniform distribution or random distribution background, the high-order Markov model can improve the sensitivity and specificity of identifying motifs. In this study, we use a third-order Markov model to characterize the background sequence. As an example, the probability of an -mer () in the background under a third-order Markov model can be represented byThereby, the Information Content of motif can be represented as

2.2. Emerging Substrings Mining

For the large-scale data set, calculating the likelihood score of each substring costs too much, which makes probabilistic training methods unpractical. Pattern-driven strategy can use shorter time to count the substrings that have higher occurrence frequencies. Since each instance differs from motif at most positions, we expect to find some instances occurring multiple times in thousands of sequences and reduce the disturbance of random overrepresented substrings. With the above considerations, we utilize both a test set and a control set of DNA sequences to search the possible motif instances. Generally, the test set consists of the sequences with motifs, while the control set contains the background sequences. The interested substrings are the ones that present in the test set and absent in the control set, and we call such substrings emerging substrings. The task converts to solve emerging substrings mining problem [20] and then identifies motif instances from the emerging substrings. The emerging substrings mining problem is defined as follows.

Given a test set and a control set of sequences over the alphabet , frequency threshold   , and growth rate threshold   , the task is to find all substrings    satisfying the conditions and at the meantime. Such substrings are called emerging substrings. Here, represents the frequency of substring occurring in set , and , that is, the growth rate of substring from set to set . Large value means that substring is highly discriminative for two input data sets.

With the above material, our algorithm can be summarized as the following main procedures. First, we compare the substrings in both test set and control to obtain the emerging substrings. Second, calculate measure score of the emerging substrings to find the true motif instances. Nevertheless, there are still some key problems needed to be solved: (i) As the exact motif length is unknown, we need to select a range of emerging substring length to find motif. (ii) The interested emerging substrings contain true motifs, the instances of both mutation and random disturbance, how to reduce the influence of the unreal instances. (iii) We need to choose one model from OOPS, ZOOPS (zero- or one-motif occurrences per sequence), and TCM (two-component mixture) to find motif instances in each sequences. Therefore, our algorithm is designed in detail to further process the emerging substrings and handle these problems.

2.3. FCmotif Algorithm

Step 1 (searching emerging substrings). An essential assumption is that the evidence for binding motif is large in test set and small in the control set. To streamline the predicting sites algorithm and handle the ChIP-Seq data, our algorithm utilizes pattern-driven word enumeration strategy to search the emerging substrings. Assume motif length is ; we first count the amount of all possible   -mers in both test set and control set; then we select the rich ones. The threshold frequency and growth rate are two important parameters employed in this step.
As previous studies [21, 22], we knew the probability of the occurrence of a random mutated instance of a reference motif at random positions iswhere is the mutating probability, and it can also represent the conservation of motif. We set as 0.2, 0.5, and 0.8 to represent high conservation, intermediate conservation, and low conservation, respectively.

Then, according to the definition of motif, the probability of a random instance of motif occurring in a sequence can be calculated by

Moreover, for the different models, each sequence contains different amount of motif instances, so the value of can be set by different models and . We set when model is OOPS, when model is ZOOPS, and for TCM. Meanwhile, the default value of that we set is 2. Table 1 shows an example of searching the emerging substrings of length 6 in 600 sequences for ZOOPS model; and . From the example, we can find that the emerging substring “CAGCGA” satisfies both and . However, only the emerging substring cannot indicate motif; it may miss the mutated instances especially for larger value of and .

Step 2 (constructing the corresponding PWM). The emerging substrings can represent a part of the enriched (overrepresented) motifs. However, it still contains the fake instances made up by the background sequences and cannot reflect the true mutated motif instances. It is necessary to measure the statistics scores of all possible instances and find the instances among the emerging substring. The PWM indicates the distributions of each character at each position of the motif, and it is the core of measuring the statistical significance, so we construct the corresponding PWM of each emerging substring by the mutating.
Assume each emerging substring is a reference motif; the motif instances should exist in the mutated -mers at most positions from the reference motif (called the “neighborhood” instances), which have a larger amount in the test set than that in the control set. When we find out the mutated instances from the reference motif, we can use them to construct the core PWM and measure the statistical scores. In this way, see each emerging substring as a reference; we first search its “neighborhood” instances and evaluate the -score of each one. -score is a statistical measurement of a score’s relationship to the mean in a group of scores which is estimated based on the hypergeometric probability distribution [14]:where , , and . and represent the number of occurrences of -mer in and , while and are the total number of -mers in and , respectively.
For convenience of description, here we give an example to explain the searching process. Consider a specific reference -mer “ACCACGTG,” which has 119 matches in the test set and 46 matches (9 matches after adjusting test set and control set with the same size) in the control set. As the previous study [21], for the length , we use to search the “neighborhood” instances. Therefore, we find the -mers that have mutated to the other three characters at one position from the reference -mer. Note that using (8, 1) model, we only need to search 24 -mers to find the “neighborhood” ones but not the whole searching space of   (65536, ) -mers. Table 2 shows each neighborhood instance of the reference -mer “ACCACGTG,” -score, and the number of occurrences in the test and the control sets.
Use each emerging substring as a reference center and incorporate its neighborhood instances; two Position Count Matrices (PCMs) of size , and can be formed. is composed of the qualified neighborhood instances in , and is similarly composed of the qualified neighborhood instances in adjusted for length (rescaled by ). Here, the qualified neighborhood instances refer to the instances with or the instances with the maximum positive -score if there is no instances with [23]. While the PCMs and are constructed by adding up the counts of “” at each position of the qualified neighborhood instances.
It is worthy to note that the test set is a mix of motif instances and random disturbances of background, and the control set is full of background noise. Hence, we can use the background distribution found in the control set to recorrect the contribution and estimate the PWM in the test set. That is, can be regarded as the expected count matrix constructed from false positive motifs in . In this way, let the PCM ; we can get PWM of the reference emerging substring by normalizing each row into probability distribution. In order to avoid zero frequency, 5% pseudo-counts to each position are added. As we also concerned the intramotif distribution in the probabilistic model, Φ can be estimated in the same way. Figure 1 shows an example of constructing PWM of the corresponding emerging substring in Table 2.

Step 3 (clustering longer motifs). See each emerging substring as a seed; its PWM can be obtained by the steps above, while the corresponding motif with high IC score can also be computed. However, the PWMs may represent many similar motifs with a few letters varying as previous studies [24, 25]. In order to eliminate redundant motif information and expand the short motif to form longer motif, we cluster the similar motifs and combine the motifs having the long common-overlap segments by utilizing a metric of computing the Euclidean distance between two -mers as described below:where is motif length and and are the estimate probabilities of observing letter at position of -mers and , respectively. Since the length of predicted motifs may be different, we actually use the minimum distance between motifs among all possible overlaps of motifs and induced by shifts that the minimum overlap is 7 bases or two bases fewer when the motifs are even shorter. Hence, we use the Harbison similarity score [26]:where and correspond to all possible overlaps of -mers and . In this way, two -mers and are considered similar if the PWMs of and have the Harbison similarity score ≥ 0.75. In practice, as the motif length is unknown, we use    in a proper range and cluster the PWMs of different length which satisfy the Harbison similarity score constraint. So in this step, a longer motif can be obtained by the corresponding PWM that is combined by clustering the PWMs of different .

Step 4 (output). With the combined PWMs, we employ two measures to optimize the motifs; first we compute IC and then utilize the False Discovery Rate (FDR) to control the final outputs. The False Discovery Rate as a function of the threshold can be intuitively defined aswhere is the number of -mers found in and   is the number of -mers found in . can be calculated by formula (3). Here, we define as an integer satisfying which leads to (FDR value changes with different data sets). Once is determined, the -mers in with are the predicted motif instances. In practice, we finally generate at least 50 top IC score motifs by formula (7) satisfying the FDR constraint.
Main algorithm of FCmotif is shown in Algorithm 1.
In Step 1, lines (6) to (9), we find the emerging substrings enriched in the test set; then lines (10) to (15) are the step to construct the PWM and the intramotif distribution for each emerging substring. Lines (16) to (19) are the step to cluster PWM with the similar Harbison similarity score. Lines (20) to (24) are the last step to compute IC and FDR and finally output the result.

Input: a test set and a control set .
Output: the set of motifs
()  // the set of motifs
()  // the set of emerging substrings
()  // the set of the qualified neighborhood instances
()  // the set of PWMs
()  // the set of intra-motif distributions
() For   to   do
()   For  each -mer of substrings:   do
()    if  (, ) ≥  && (, , ) ≥   then
()    add to
()    For  each -mer of :   do
()   For  each to   do
()   calculate -score of each neighborhood instance
()   if  z() > 1.643  then
()    Add to
()  use and to construct and
()   add to set and add to set
()   For  each of   do
()   if  sim(, ) ≥ 0.75 ()  then
()   cluster with and delete from .
()   if  FDR() > 0.2 then
()  delete from
()   use and corresponding to compute IC.
()   add formed by of top 50 IC score to
()   return  

3. Results and Discussion

We use the ChIP-seq data sets of 12 TFs profiled in mouse ES cells [27] to test the validity of our algorithm. These 12 data sets are key to the maintenance of pluripotency, in which Nanog, Oct4, Sox2, Esrrb, and Zfx are regulators of self-renewal; Klf4, cMyc, and mMyc are the crucial reprogramming factors [28, 29]; Tcfcp2l1 is preferentially upregulated in ES cells [30]; Smad1 and STAT3 have the significant meaning to the signalling pathways, and CTCF is a key component for transcriptional insulation [31]. For the test set, we extract 200 bps sequence segments centered at a peak of TF location. For the control set, we extract 500 bps sequence segments starting from nucleotide positions 400 bps away from both ends of 200 bps positive sequence segments. The total sizes range from 1 Mb to 50 Mb. Table 3 is the statistics information about the 12 mES ChIP-seq data sets.

Our algorithm runs by using the following parameters: word enumeration analysis is performed with the length ranging from 6 to 12. when model is OOPS, when model is ZOOPS, and for TCM. while , 0.5, and 0.8 to represent high conservation, intermediate conservation, and low conservation, respectively. The settings we used include , , , , , , and . The default value of threshold for selecting qualified neighbourhood instances is 1.643, and the FDR constraint is 0.2. FCmotif is implemented in Matlab under the experiment environment: 2.67 Hz CPU and 4 G memory.

3.1. Results on 12 TF Binding Sites in mES Cells

To evaluate the performance of our algorithm, we compare the primary motifs of 12 TFs in mES ChIP-seq data sets discovered by our algorithm with the motifs found by Chen et al. with Weeder. The motif comparison is performed by comparing matrices [32], which supports various scoring metrics and shows the results as the logo of aligned words, in order to grasp the similarities between a predicted motif and the known motifs. Figure 2 shows all 12 motifs identified by FCmotif and motifs found by Chen et al., indicating that the quality of results is comparable. Moreover, we also compare the running time on our algorithm with that of the popular motif discovery algorithms, MEME [33], Weeder [34], ChIPMunk [12], and DREME [35]. Figure 3 shows the running time of above algorithms of 12 mES ChIP-seq data sets. Note that both MEME and Weeder are too slow to deal with ChIP-seq data sets containing thousands of sequences and often fail after running for many days. For this reason, the results in Figure 3 for MEME and Weeder are obtained on the reduced-size data sets. ChIPMunk is an iterative algorithm which can process up to tens of thousands of sequences but with enormous computation at the same time. DREME can predict more accurate motifs than the traditional motif discovery algorithm but was restricted to 500 top-scoring peaks; it cannot analyze the full size data sets. For our algorithm, we found that the computing time scales up efficiently with sequences size and our algorithm outperforms all the compared motif discovery algorithms. Data of several megabytes can be processed in a few minutes. In addition, it is worth to point out that the word length is another factor to influence the computational efficiency. Because the number of possible -mers and the number of neighborhood instances are increasing dramatically with increasing. For example, for Sox2 data set, running time when is 536 s but 620 s when . CisFinder [14] is another algorithm that uses the idea of word enumeration and compares the word enrichment of two input sets; it works extremely fast. However, CisFinder outputs the motifs using -value as a measure, which cannot reflect the significance of the motif but only a single word matching the motif [35].

Although the analysis from 50 to 200 top-scoring binding sites is sufficient to extract the primary motif, yet this data size usually used by Weeder or MEME is not enough to examine alternative motifs. For instance, in Sox2 and Oct4 data sets, Chen et al. report only a single motif with Weeder, respectively. In contrast, FCmotif can find multiple motifs for each TF using the same data sets. As shown in Figure 4(a), our algorithm predict not only the Oct-Sox composite motif bound by Sox2 and Oct4 complex [27] but also the characteristic motifs Sox2 (CCATTGTT) and Oct4 (TATGCAAAT). Meanwhile, some predicted motifs, with no similarity with the prevalent consensus of the data set, often have a high significance and may reveal alternative consensus. Such as the motif (ATATGCGCATGC) in the Oct4 data set, it corresponds to an alternative Oct4 motif reported in other studies [13, 35].

Nanog and Smad1 data sets also have nearly the same binding regions like Sox2 and Oct4 data sets as discussed by Chen et al. As shown in Figure 2, the significant motif found in Nanog is Oct-Sox, which means these similarity binding regions may cause Nanog motif to bind indirectly via one or both of Sox2 and Oct4 TFs and raise the difficulty to identify motifs of the TFs. Therefore, the approach to find the enriched motifs in Nanog data set relative to Sox2 or Oct4 data sets is needed. Here, we use Nanog ChIP-seq data set as the test input and either of Sox2 and Oct4 data sets as the control input. From the results shown in Figure 4(b), the significant predicted motifs of these two compared data sets are similar, and both of them are also similar to the previously reported motifs “CCATCA” by [23, 35] as an alternative Nanog motif.

In addition, for Smad1 data set, our algorithm not only discovers a motif “AACAAAGC” matching the published Smad1 motif “AAACAAAG” but also finds other motifs, like “CCTTTGTC”, which matches a Sox2 motif (Figure 4(c)). And these discovered motifs demonstrate the frequent cobinding relationship of Smad1 and Sox2 TFs.

In contrast with the traditional analysis of transcriptional regulation that motifs commonly bind to a DNA-binding TF, genome-wide locations for a specific TF usually do not carry the primary or alternative binding motif but the binding motifs for other TFs. To explore this issue, we employ the ChIP-seq approach to characterize these TFs that bind to DNA indirectly through binding to a cofactor. We use a histone acetyltransferase generally found at enhancer regions [27], P300, to reveal the interaction of cofactors and DNA-binding TFs, and hope to infer the potential tissues of transcription regulation. From the results, we found that P300 does not only associate with the notable Oct4, Sox2, Nanog, and Smad1 TFs but also cooccurs with Oct-Sox complex and other abundant TFs including a TEF motif “AGGATTGCT” and the core of AP4-L motif “CAGCAGG.” In addition, there are still several motifs found by our algorithm in relative low probabilities, such as Esrrb motif “GAgGGTgA,” Klf4 motif “GGTGTGGg,” and Tcfcp2l1 motif “CCAGTTgcA.”

The results of experiment show that our algorithm makes a good trade-off between accuracy and efficiency. It shows better performance than the other compared algorithms. Data of several megabytes can be handled in a few minutes; when data is up to 50 Mb, it can be handled in several hours. We can see in the word enumeration that FCmotif counts all the -mers in both test and control sets. Suppose the both sets have the same size: the sequence length and the number of the sequences , and the motif length is , so the computational complexity of counting all the -mers is which is completely acceptable. The step of searching emerging substrings dramatically reduce the number of potential motif instances; generally, the order of magnitude of emerging substrings is . Moreover, note that the range of is from 6 to 12 bps, the values include , , , , , , and , and the number of the neighborhood instances isSo the maximum is 6570 for , which means our algorithm does not need to search thousands of possible instances to form PWM and can limit the number of enriched -mers in several dozens. For a few megabytes data, our algorithm can search out a fixed length motif in a few minutes with the amount of computation in or (for a large ), which is more faster than that of the probability training methods. In addition, recent studies indicate that many regulatory regions are located in transposable elements which are commonly not conserved [36].

4. Conclusions

In this paper, we propose a fast cluster motif finding algorithm, named FCmotif, to solve the motif identification problem in large scale ChIP data set. FCmotif overcomes drawbacks of traditional algorithms which are time-consuming and cannot handle the full size data; it guarantees to find all potential motif instances. FCmotif utilizes a word enumeration strategy and searches the neighborhood instances to form the PWM of enriched substrings. It breaks up constrain of the OOPS model and can find long motifs by clustering the PWM in different lengths. The experiments of the ChIP-seq data sets from mouse ES cells confirm that FCmotif can find not only the primary motif but also the exceptional motifs, which uses less time compared to popular motif discovery algorithms. Meanwhile, the potential cobinding relationship can be also detected by our algorithm. It is worth noting that our algorithm is easy to parallel because the calculation of motif of each length is independent.

In summary, it can be seen that FCmotif is a competitive algorithm to deal with the motif finding in the ChIP-seq data. The functions of some motifs found by our algorithm are still unknown. The functions of some motifs found by our algorithm are still unknown, the further experimental validation is needed to prove that these motifs are indeed functional. The analysis of motifs in these complex transcriptional regions is the key issue for the future study.

Disclosure

Ping Wang is the second author of this paper.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work was supported by the Fundamental Research Funds for the Central Universities (Grant nos. 310824151037, 310832151088, and 310832151091) and the Natural Science Foundation of Shaanxi (2015JM6280).