Abstract

The planted motif search (PMS) is one of the fundamental problems in bioinformatics, which plays an important role in locating transcription factor binding sites (TFBSs) in DNA sequences. Nowadays, identifying weak motifs and reducing the effect of local optimum are still important but challenging tasks for motif discovery. To solve the tasks, we propose a new algorithm, APMotif, which first applies the Affinity Propagation (AP) clustering in DNA sequences to produce informative and good candidate motifs and then employs Expectation Maximization (EM) refinement to obtain the optimal motifs from the candidate motifs. Experimental results both on simulated data sets and real biological data sets show that APMotif usually outperforms four other widely used algorithms in terms of high prediction accuracy.

1. Introduction

Transcription factor binding sites (TFBSs) are short and conserved nucleotide fragments (usually ≤ 30 bps) in the cis-regulatory regions of genes in DNA sequences. They interact with transcription factors (TFs) and affect the gene expression. Identification of TFBSs, that is, motif discovery [1], is a fundamental problem for its importance to understand the structure and function of gene expression.

In this paper, we focus on the planted motif search (PMS) problem [2], a widely accepted formulation of motif discovery problem. Given a set of input n-length DNA sequences and two nonnegative integers and , the aim of the PMS is to find an l-mer (an l-length string), which occurs in each of the sequences with up to mutations. The l-mer is called a motif and each mutation of is called a motif instance.

The existing algorithms to solve PMS problem include two main categories. One is exact algorithms, most of which use consensus sequences [3] to represent motifs. The exact algorithms are guaranteed to obtain the optimal motif. Recently, the research of exact algorithms mainly concentrates on pattern-driven algorithms. All the l-length string patterns are taken as candidate motifs, and the string patterns occurring in all input sequences with up to mutations are the motifs. Typical pattern-driven algorithms use various means to reduce time complexity [410]. PairMotif [4] selects multiple pairs of l-mer with relatively large distance from the input sequences to restrict the search space. Compared with recently proposed algorithms, PairMotif requires less storage space and runs faster on most PMS problems. PMS5 [7] computes the common d-neighbors of three l-mers using integer programming formulation, which is an efficient algorithm for solving the difficult instances of PMS: (21, 8) and (23, 9). Some other pattern-driven algorithms index the input sequences with a suffix tree to speed up the search of candidate motifs [1114]. RISOTTO [11] is the fastest algorithm in the family of suffix tree algorithms for PMS problem and can solve the instance (15, 5) in 100 minutes. The initial search space of pattern-driven algorithms is . Therefore, pattern-driven algorithms are feasible for small motif length   , but they will take long running time or have high space requirement with the increase of the motif length.

The other category is approximate algorithms, which commonly use position weight matrixes (PWMs) [15] to represent motifs. They can report results in a short time but often get trapped in local optimal solutions. Most approximate algorithms attempt to maximize the score function of how likely a subsequence of an input sequence is a motif instance, using statistical analysis [1623]. MEME [18] and Gibbs sampling [20] are well-known approximate algorithms. MEME finds motifs by optimizing the PWMs using the Expectation Maximization (EM). Based on MEME, there are some extension algorithms like Projection [21] and MCEMDA [22]. Projection projects all l-mers from the input sequences onto many buckets by hashing and then derives the consensus sequences to select some valid buckets. After the effective initialization step, EM algorithm is used for refinement. MCEMDA is a modification of the EM algorithm in that the expectation in the E-step is computed numerically through Monte Carlo simulation. Gibbs sampling is a Markov Chain Monte Carlo (MCMC) approach. Based on Gibbs sampling strategy, there are some modifications that have also been described [24, 25]. One that stands out is AlignACE [25], which is a Gibbs sampling algorithm for identifying the overrepresented motifs in a set of DNA sequences. Furthermore, some graph-theoretic methods either based on clustering or on heuristic search have also been introduced in the field of motif discovery [2628]. CRMD [26] uses an entropy-based clustering to find good starting candidate motifs from the input sequences and then employs an effective greedy refinement to search for optimal motifs from the candidate motifs. VINE [28] is a graph clustering algorithm for motif discovery by finding -cliques in a -graph in polynomial time. Generally, the approximate algorithm has speedy runtime and minimal memory consumption. Sometimes, however, they cannot converge to the global optimal.

In this paper, we propose a new algorithm, APMotif, to solve motif discovery problem. APMotif first applies Affinity Propagation (AP) [29] clustering in DNA sequences to find highly conserved candidate motifs. APMotif then employs an effective EM refinement to search for optimal motifs from the candidate motifs. Experimental results show that APMotif has competitive prediction accuracy compared to that of previously developed algorithms.

2. Materials and Method

Here, we first briefly describe the original Affinity Propagation clustering and Expectation Maximization algorithms used in the remainder of the paper. We then construct the similarity matrix for motif discovery. Finally, we describe the APMotif algorithm.

2.1. Affinity Propagation (AP)

Compared with other clustering approaches, AP clustering is an effective and fast clustering algorithm, especially for large data sets. Given a set of data points , AP clustering takes as input a collection of real valued similarities between the pairs and ,  . According to the similarities between data points, AP clustering recursively calculates two types of messages: the responsibility , reflecting the suitability of point as the exemplar for point , and the availability , indicating how appropriate it would be for point to choose point as its exemplar:

Upon convergence, AP clustering selects a subset of data points as exemplar and assigns every nonexemplar point to exactly one exemplar. The exemplar associated with point is finally defined as follows:

The AP clustering is terminated when the exemplar remains unchanged for a user-set number of iterations.

2.2. Expectation Maximization (EM)

For EM algorithm, given the DNA sequences , each sequence consists of two components which model the motif and nonmotif (“background”) positions in the sequence. The starting positions of the motif in each sequence are unknown and represented by the variables (“missing data”) , where if a motif starts at position in the sequence , and otherwise.

EM algorithm attempts to maximize the expectation of the logarithm of the joint likelihood of the model.

The main procedure of EM algorithm repeats iteratively the following two steps:

In (4), the logarithm of the joint likelihood of the model is defined as follows: where is the vector containing all the parameters of the model and is the probability of the character occurring at either a background position or a motif position .

In (5), the conditional probability for a sequence containing a motif is defined as follows:where indicates a vector whose entries are all zeros except the one corresponding to the character at position in the sequence . is the set of positions of the background in the sequence .

2.3. Construction of Similarity Matrix for Motif Discovery

In the original AP clustering, given two random l-mers and from DNA sequences , the similarity is set as the negative Hamming distance between l-mers and ; that is, [29], which cannot describe the property of DNA sequences clustering effectively. According to the feature of PMS that two motif instances of the same motif cannot differ by more than positions, and the maximum similarity principle, we employ pairwise constraints and variable-similarity measure [30] to modify the similarity as follows:where (1, +∞), (0, 1], and denotes is an l-mer of the sequence .

Based on the similarity in (8), the similarity between data points is more accurate and only tiny subsets of the data points are required to exchange messages, so AP clustering can not only increase clustering accuracy but also decrease runtime. Its theoretical analyses are shown in Section 3.1.

According to the two similarities: and , take the PMS instance (15, 4) with 20 sequences of different length between 100 and 1000 as an example; we show the comparison of runtime and clustering accuracy in Figure 1.

2.4. APMotif Algorithm

Under the assumption of exactly one occurrence of motif instance per sequence (OOPS) [1], to find the motif instances from the input DNA sequences , APMotif algorithm consists of the following stages:(1)Constructing Clusters. Select the sequence as the reference sequence, for each l-mer    in (reference subsequence), and construct cluster , which is the set composed by all the l-mer in that and the l-mer .(2)Extracting Clusters. For each cluster , use AP clustering and a filtering rule to generate a highly conserved cluster .(3)Refining Clusters. For each filtered cluster , use EM refinement to obtain the distribution and the objective function of each cluster .(4)Verifying Motif Instances. With the maximum distribution and the maximum objective function , the l-mer having the maximum log-likelihood in each sequence is verified as a motif instance.

Based on the four stages, the APMotif algorithm is presented as in Algorithm 1.

Input:   =
Output: () motif instances set
()
() for each l-mer ,   do
()  Construct cluster
() for each   do
()  Use AP clustering and a filtering rule to generate cluster
() for each   do
()  Use EM algorithm to generate and
() Calculate ,
() for   to   do
() for each l-mer y∈   do
()    Calculate
()    Add to
() Output

In line , the set of the motif instances is initialized to an empty set. Lines - show the stage of constructing clusters. Lines - show the stage of extracting clusters. Lines - show the stage of refining clusters. Lines show the stage of verifying motif instances. APMotif can discover the motif instances in high prediction accuracy and output them in line .

Next, we explain each stage in detail.

Stage 1 (construct clusters). The construction of clusters keeps the following simple observation that the Hamming distance between two motif instances of the same motif must be less than or equal to . Generally, we choose the first sequence as the reference sequence. As we do not know in advance which l-mer in is the motif instance, all the l-mers in are regarded as the reference subsequences. Given an l-mer in , the selected l-mers in other sequence should satisfy , denoted as , where denotes that is an l-mer of . The cluster corresponding to the reference subsequence is denoted asThe average number of l-mers in the cluster is , whereis the probability that the Hamming distance between two random l-mers is at most .

Stage 2 (extract clusters). For each cluster subset , we use the AP clustering to produce the high conserved cluster that contains the reference subsequence . If one of the reference subsequences is a motif instance, the corresponding cluster may be the true motif model.
For each cluster , two types of metrics, information content (IC) and complexity scores [31], are employed to assess the quality of the cluster. The information content of the cluster is defined aswhere represents the probability of each character appearing at the position of the l-mer, and where is the background probability of character . A higher IC value indicates a stronger potential of a cluster to be the true motif model.
The complexity score of the cluster is defined as

Note that the IC value cannot completely reflect the conservation of the motif model. The reason is that many noninformative repeated l-mers may lead to a higher IC value. Fortunately, these false positive clusters have lower complexity scores and they can be effectively filtered out.

Taking these into account, we propose the following rule to filter out some unqualified clusters.

Rule. If and     , the cluster will be stored as a candidate motif model.

According to the rule, a few clusters with high IC value and high complexity scores are stored for EM refinement in Stage 3.

Stage 3 (refine clusters). It is important to note that AP clustering is primarily an initialization strategy that produces starting points for EM refinement. Taking each cluster as a starting point, we use the modified EM refinement to search for a motif model.
The E-step of EM calculates the expected value of the missing information , which is the probability that a motif starts in position of sequence .
E-Step. ConsiderThe M-step of EM reestimates distribution by maximizing the expected log-likelihood.
M-Step. Considerwhere is the pseudocount to deal with the zero frequencies and is the total number of the character in all sequence .
The EM algorithm is terminated when the object function , that is, Information Content, remains unchanged. After EM refinement, we can obtain the distribution and the objective function of each cluster .

Stage 4 (select motif instances). Comparing each distribution , we find the maximum one . For the distribution , an l-mer in one sequence with the maximum log-likelihood that is considered as a candidate motif instance: Meanwhile, a candidate motif instance should satisfy , where is the motif by using as the consensus.
Thus, the l-mer that has the maximum log-likelihood under the distribution and satisfies is stored in the set of motif instances .

3. Results and Discussion

Here, we first theoretically analyze the probability of and give its formula. We then show the experimental results of APMotif both on simulated data sets and real biological data sets.

3.1. Analysis of Similarity Matrix

It has been pointed out in [29] that the sparsity of the similarity matrix will lead to fast calculation since the information propagation needs not be performed if .

Given two random l-mers and , coming from different sequences, which differ from the same l-mer with up to positions, the distance relationships between , , and satisfy and . Let represent the probability of and corresponding to a sample space . Because and are independent of each other, can be calculated as follows:

Let represent the probability that the Hamming distance between two random l-mers and is more than .

Using Theorem of Total Probability, we have where represents the conditional probability of given and .

Next, we discuss how to calculate the conditional probability     .

According to and , we can obtain

For and , (20) can be written as

Given , for each , we can find all the 2-tuple that satisfy (21).

Given , for each 2-tuple , the conditional probability of can be calculated as follows:

Considering all the values and all the 2-tuple , we calculate the conditional probability of as follows:

According to (18) and (23), we can obtain

The probability is also the probability of corresponding to the condition that .

Meanwhile, when the two l-mers and are in the same sequence, the probability of in the similarity matrix is , where is the sequence number.

For the (15, 4) problem instance with sequence number , the probability of obtained by (24) and is 0.8405, when sequence length varies from 100 to 1000.

In Table 1, by enumerating all the in the similarity matrix, the empirical result shows that the probability of accounts for more or less than 84% of the similarity matrix, which is consistent with the theoretical analysis.

3.2. Results on Synthetic Data Sets

We generate the synthetic data sets as follows: first, we generate a motif of length and independent and identically distributed (i.i.d) sequences of length . Then, we implant instance, which differs from the motif with up to positions, into a random position in each sequence.

The nucleotide level performance coefficient (nPC) defined by Pevzner and Sze [2] is used to evaluate the motif prediction accuracy: is the set of base positions in the known motif instances, and is the corresponding set of base positions in the predicted motif instances. The value of nPC is between 0 and 1; the larger the value of nPC, the higher the accuracy of the predicted motif.

Table 2 shows the comparison of the mean nPC obtained by APMotif and four other representative algorithms: MEME [1618], Gibbs sampling [20], Projection [21], and VINE [28]. For each of the combinations, all the five algorithms are run once on each of 10 randomly generated sets of input sequences (, ). APMotif constitutes a simple and effective method which groups the significant l-mers to form the optimal clusters so that the motif instances can be predicted with high accuracy. APMotif has the highest mean nPC on the instances (11, 3), (12, 3), (18, 6), and (19, 7), and the second highest mean nPC on the instances (15, 4), (16, 5), which proves that APMotif is relatively robust in various problem instances.

In Table 3, we compare the nPC of APMotif on problem instances with longer background sequences. Since the longer a sequence is, the more noisy l-mers will be yielded, this makes it difficult to discover the true motifs. We fix the instance as (15, 4) instance, one of the most popular benchmarks for motif discovery problem, and vary the sequence length from 100 to 1000. For each setting, 10 i.i.d data sets are generated, each containing 20 sequences. The nPC of APMotif is over 95% for sequences of various lengths between 100 and 1000, much greater than that of Projection, MEME, Gibbs sampling, and VINE. The reason why the performance of APMotif is stable over the sequence length is that APMotif has strong ability of filtering noisy l-mers. With each sequence length increasing , APMotif still maintains its advantage in the prediction accuracy. For example, when the sequence length is 1000 bps, the nPC of Projection, MEME, VINE, and Gibbs sampling are 88%, 76%, 91%, and 8%, respectively, while APMotif algorithm shows its advantage with the nPC 95%.

3.3. Results on Real Biological Data Sets

At first, the performance of APMotif is evaluated on the five widely used real data sets discussed in [21], which are preproinsulin, dihydrofolate reductase (DHFR), c-fos, metallothionein, and Yeast ECB. Because no information about the l and the of the true motif is known in advance, we select the used for each real data set as follows: the value of l is fixed as the length of the reference motif; the value of is minimum to ensure that the predicted instance contains the reference motif. Table 4 shows that the predicted motifs returned by the APMotif algorithm are almost consistent with the reference motifs. In Figure 2, the software Weblogo [32] is used to show the sequence logos of the predicted motifs, which graphically shows the degree of motif conservation measured by relative entropy.

For the five real data sets, Figure 3 compares the nucleotide level performance coefficient of APMotif with that of other popular algorithms. For Yeast ECB, the nPC of APMotif, MEME, Projection, and VINE are 1, which indicates the prediction result is completely correct. For c-fos, preproinsulin, DHFR, and metallothionein, the nPC of APMotif is 0.28, 0.68, 0.37, and 0.82, respectively, greater than that of three other widely used motif finding algorithms.

In addition, we show the prediction performance of APMotif on Tompa data [33], which is set up as the benchmark for testing motif discovery algorithms. For most Tompa data, the distribution of motif in each sequence makes it difficult to report the motif occurrence positions. We select some Tompa data. When a sequence contains more than one motif, it is difficult to discover all the motifs. When some sequences do not contain any motif, it is difficult to discovery motifs in other sequences. Overall, most algorithms have very low prediction accuracy in Tompa data. To improve the prediction accuracy, different algorithms should be executed together to complement each other.

Figure 4 shows the prediction accuracy (nPC) of APMotif and MEME on each selected Tompa data. We observe that, for some data, such as hm08r, hm19r, mus03r, dm04r, and yst02r, the nPC of APMotif is better than that of MEME, but for some other data, such as hm23r, mus04r, dm01r, and yst06r, the nPC of MEME is better than that of APMotif. This phenomenon illustrates the practical significance in combining the results of APMotif and MEME to improve the ability of motif discovery algorithms in identifying TFBSs in higher eukaryotes [33].

4. Conclusions

The planted motif search (PMS) problem arises from the need to find transcription factor binding sites (TFBSs) in DNA sequences. In this paper, we propose a new approximate algorithm, APMotif, which overcomes the local maximum drawback to some extent that is inherent in the EM motif-finding algorithms and guarantees that most motifs can be discovered for specific settings. APMotif first constructs clusters by computing the Hamming distance between each l-mer in the reference sequence and all the l-mers in other sequences, and then it uses AP clustering combined with two metrics to select the high conserved clusters for the EM refinement progress. After the EM refinement, the cluster with maximum information content is verified as the motif instances. The experimental results on the synthetic data sets show that APMotif indeed removes most useless background information to obtain motifs with high accuracy. The experimental results on the real biological data show that APMotif can discover all or a large part of the motif instances. In summary, the APMotif algorithm outperforms the compared algorithms with significant improvement in prediction accuracy.

In the last years, the introduction of ChIP-Seq data raises new challenges for motif discovery problem from the perspective of data scale. Most existing motif discovery algorithms proposed for small data set are inefficient in dealing with ChIP-Seq data. Since APMotif performs AP clustering for multiple times with each clustering independent of others, APMotif thus features the merit for parallel computing. In our future work, we plan to parallel the APMotif algorithm to be fastened, so that the improved APMotif algorithm can deal with large data set efficiently.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

This research was supported in part by the National Natural Science Foundation of China (61173205 and 61373044).