BioMed Research International

Volume 2015 (2015), Article ID 853461, 10 pages

http://dx.doi.org/10.1155/2015/853461

## An Affinity Propagation-Based DNA Motif Discovery Algorithm

School of Computer Science and Technology, Xidian University, Xi’an 710071, China

Received 3 January 2015; Revised 10 June 2015; Accepted 11 June 2015

Academic Editor: Graziano Pesole

Copyright © 2015 Chunxiao Sun et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

The planted motif search (PMS) is one of the fundamental problems in bioinformatics, which plays an important role in locating transcription factor binding sites (TFBSs) in DNA sequences. Nowadays, identifying weak motifs and reducing the effect of local optimum are still important but challenging tasks for motif discovery. To solve the tasks, we propose a new algorithm, APMotif, which first applies the Affinity Propagation (AP) clustering in DNA sequences to produce informative and good candidate motifs and then employs Expectation Maximization (EM) refinement to obtain the optimal motifs from the candidate motifs. Experimental results both on simulated data sets and real biological data sets show that APMotif usually outperforms four other widely used algorithms in terms of high prediction accuracy.

#### 1. Introduction

Transcription factor binding sites (TFBSs) are short and conserved nucleotide fragments (usually ≤ 30 bps) in the cis-regulatory regions of genes in DNA sequences. They interact with transcription factors (TFs) and affect the gene expression. Identification of TFBSs, that is, motif discovery [1], is a fundamental problem for its importance to understand the structure and function of gene expression.

In this paper, we focus on the planted motif search (PMS) problem [2], a widely accepted formulation of motif discovery problem. Given a set of input* n*-length DNA sequences and two nonnegative integers and , the aim of the PMS is to find an* l*-mer (an* l*-length string), which occurs in each of the sequences with up to mutations. The* l*-mer is called a motif and each mutation of is called a motif instance.

The existing algorithms to solve PMS problem include two main categories. One is exact algorithms, most of which use consensus sequences [3] to represent motifs. The exact algorithms are guaranteed to obtain the optimal motif. Recently, the research of exact algorithms mainly concentrates on pattern-driven algorithms. All the* l*-length string patterns are taken as candidate motifs, and the string patterns occurring in all input sequences with up to mutations are the motifs. Typical pattern-driven algorithms use various means to reduce time complexity [4–10]. PairMotif [4] selects multiple pairs of* l*-mer with relatively large distance from the input sequences to restrict the search space. Compared with recently proposed algorithms, PairMotif requires less storage space and runs faster on most PMS problems. PMS5 [7] computes the common* d*-neighbors of three* l*-mers using integer programming formulation, which is an efficient algorithm for solving the difficult instances of PMS: (21, 8) and (23, 9). Some other pattern-driven algorithms index the input sequences with a suffix tree to speed up the search of candidate motifs [11–14]. RISOTTO [11] is the fastest algorithm in the family of suffix tree algorithms for PMS problem and can solve the instance (15, 5) in 100 minutes. The initial search space of pattern-driven algorithms is . Therefore, pattern-driven algorithms are feasible for small motif length , but they will take long running time or have high space requirement with the increase of the motif length.

The other category is approximate algorithms, which commonly use position weight matrixes (PWMs) [15] to represent motifs. They can report results in a short time but often get trapped in local optimal solutions. Most approximate algorithms attempt to maximize the score function of how likely a subsequence of an input sequence is a motif instance, using statistical analysis [16–23]. MEME [18] and Gibbs sampling [20] are well-known approximate algorithms. MEME finds motifs by optimizing the PWMs using the Expectation Maximization (EM). Based on MEME, there are some extension algorithms like Projection [21] and MCEMDA [22]. Projection projects all* l*-mers from the input sequences onto many buckets by hashing and then derives the consensus sequences to select some valid buckets. After the effective initialization step, EM algorithm is used for refinement. MCEMDA is a modification of the EM algorithm in that the expectation in the E-step is computed numerically through Monte Carlo simulation. Gibbs sampling is a Markov Chain Monte Carlo (MCMC) approach. Based on Gibbs sampling strategy, there are some modifications that have also been described [24, 25]. One that stands out is AlignACE [25], which is a Gibbs sampling algorithm for identifying the overrepresented motifs in a set of DNA sequences. Furthermore, some graph-theoretic methods either based on clustering or on heuristic search have also been introduced in the field of motif discovery [26–28]. CRMD [26] uses an entropy-based clustering to find good starting candidate motifs from the input sequences and then employs an effective greedy refinement to search for optimal motifs from the candidate motifs. VINE [28] is a graph clustering algorithm for motif discovery by finding -cliques in a -graph in polynomial time. Generally, the approximate algorithm has speedy runtime and minimal memory consumption. Sometimes, however, they cannot converge to the global optimal.

In this paper, we propose a new algorithm, APMotif, to solve motif discovery problem. APMotif first applies Affinity Propagation (AP) [29] clustering in DNA sequences to find highly conserved candidate motifs. APMotif then employs an effective EM refinement to search for optimal motifs from the candidate motifs. Experimental results show that APMotif has competitive prediction accuracy compared to that of previously developed algorithms.

#### 2. Materials and Method

Here, we first briefly describe the original Affinity Propagation clustering and Expectation Maximization algorithms used in the remainder of the paper. We then construct the similarity matrix for motif discovery. Finally, we describe the APMotif algorithm.

##### 2.1. Affinity Propagation (AP)

Compared with other clustering approaches, AP clustering is an effective and fast clustering algorithm, especially for large data sets. Given a set of data points , AP clustering takes as input a collection of real valued similarities between the pairs and , . According to the similarities between data points, AP clustering recursively calculates two types of messages: the responsibility , reflecting the suitability of point as the exemplar for point , and the availability , indicating how appropriate it would be for point to choose point as its exemplar:

Upon convergence, AP clustering selects a subset of data points as exemplar and assigns every nonexemplar point to exactly one exemplar. The exemplar associated with point is finally defined as follows:

The AP clustering is terminated when the exemplar remains unchanged for a user-set number of iterations.

##### 2.2. Expectation Maximization (EM)

For EM algorithm, given the DNA sequences , each sequence consists of two components which model the motif and nonmotif (“background”) positions in the sequence. The starting positions of the motif in each sequence are unknown and represented by the variables (“missing data”) , where if a motif starts at position in the sequence , and otherwise.

EM algorithm attempts to maximize the expectation of the logarithm of the joint likelihood of the model.

The main procedure of EM algorithm repeats iteratively the following two steps:

In (4), the logarithm of the joint likelihood of the model is defined as follows: where is the vector containing all the parameters of the model and is the probability of the character occurring at either a background position or a motif position .

In (5), the conditional probability for a sequence containing a motif is defined as follows:where indicates a vector whose entries are all zeros except the one corresponding to the character at position in the sequence . is the set of positions of the background in the sequence .

##### 2.3. Construction of Similarity Matrix for Motif Discovery

In the original AP clustering, given two random* l*-mers and from DNA sequences , the similarity is set as the negative Hamming distance between* l*-mers and ; that is, [29], which cannot describe the property of DNA sequences clustering effectively. According to the feature of PMS that two motif instances of the same motif cannot differ by more than positions, and the maximum similarity principle, we employ pairwise constraints and variable-similarity measure [30] to modify the similarity as follows:where (1, +∞), (0, 1], and denotes is an* l*-mer of the sequence .

Based on the similarity in (8), the similarity between data points is more accurate and only tiny subsets of the data points are required to exchange messages, so AP clustering can not only increase clustering accuracy but also decrease runtime. Its theoretical analyses are shown in Section 3.1.

According to the two similarities: and , take the PMS instance (15, 4) with 20 sequences of different length between 100 and 1000 as an example; we show the comparison of runtime and clustering accuracy in Figure 1.