Advances in Artificial Intelligence

Volume 2009 (2009), Article ID 219743, 11 pages

http://dx.doi.org/10.1155/2009/219743

## Bayesian Unsupervised Learning of DNA Regulatory Binding Regions

^{1}Department of Mathematics, Åbo Akademi University, 20500 Turku, Finland^{2}Department of Mathematics, University of Linköping, 58183 Linköping, Sweden^{3}Department of Mathematics, The Royal Institute of Technology, 100 44 Stockholm, Sweden

Received 13 February 2009; Revised 6 June 2009; Accepted 2 July 2009

Academic Editor: Djamel Bouchaffra

Copyright © 2009 Jukka Corander et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Identification of regulatory binding motifs, that is, short specific words, within DNA sequences is a commonly occurring problem in computational bioinformatics. A wide variety of probabilistic approaches have been proposed in the literature to either scan for previously known motif types or to attempt de novo identification of a fixed number (typically one) of putative motifs. Most approaches assume the existence of reliable biodatabase information to build probabilistic a priori description of the motif classes. Examples of attempts to do probabilistic unsupervised learning about the number of putative de novo motif types and their positions within a set of DNA sequences are very rare in the literature. Here we show how such a learning problem can be formulated using a Bayesian model that targets to simultaneously maximize the marginal likelihood of sequence data arising under multiple motif types as well as under the background DNA model, which equals a variable length Markov chain. It is demonstrated how the adopted Bayesian modelling strategy combined with recently introduced nonstandard stochastic computation tools yields a more tractable learning procedure than is possible with the standard Monte Carlo approaches. Improvements and extensions of the proposed approach are also discussed.

#### 1. Introduction

A major body of genomic information coded in the DNA is represented by the regulatory regions, which control the chronology, extent, and way in which genes are expressed. A promoter is the most important regulatory region as it controls the first step of gene expression, that is, the transcription of messenger RNA. A brief summary of and introduction to the biology of promoters are given in [1]. The basic characteristic of a promoter is that it contains binding sites for proteins called transcription factors (TFs). The binding sites in the DNA reside near the gene controlled by the promoters. The interaction of several TFs with their corresponding binding sites regulates the activation or repression of the gene. Hence, the promoter architecture is fundamental for understanding gene expression patterns, regulation networks, and cell specificity. The binding sites of one single TF are not identical, but contain (i.e., tolerate) some variation. The shared content of the binding sites representing the same TF is typically summarized by specifying statistically the degree of conservation in a short (5–20 bases) DNA pattern. The DNA pattern can be understood as a word or a string on a 4-letter alphabet which may contain some variation over its positions when multiple realizations from the same generating source are considered. The substrings of DNA that correspond to the multiple realizations are here called *motif instances* from a gene regulatory binding *motif*. A set of long DNA strings may simultaneously harbor multiple different such motifs, that is, TF binding sites for different genes. In [2, 3] motifs are defined as sets of words of certain length, such that the number of mismatches to a consensus word is smaller than a prescribed threshold. Here, we use instead a probabilistic classification framework to specify the motifs by assigning them to classes in an unsupervised manner.

An important task in computational biology is to detect novel motifs using algorithms that are capable of reasoning from noisy measurements in terms of artificial intelligence. The core of the motif detection problem is to discover novel words whose length is within a biologically meaningful range, such that multiple copies of the words are hidden inside the promoter regions of a set of genes, when the genes have shared properties of, for example, the expression profile. Putative motifs found in a computational manner could thus correspond to binding sites of some common transcription factors regulating a particular set of genes. The algorithms for motif discovery must also be able to avoid suggesting as novel motifs such DNA patterns that represent the background variation in the sequences. In a nutshell, the discovery problem consists of detecting multiple copies of only partially conserved short words in a long string of the characters , such that the words are highly unlikely to occur under a stochastic null model describing general variation over consecutive letters within the string.

The problem may appear as rather straightforward at first sight. However, several of the involved computational challenges make motif discovery and identification a sincerely complex task [4]. The main obstacles can be listed as follows. A motif exhibits a certain degree of random variation (lack of conservation) in its contents, word lengths are unknown, the background (i.e., areas not containing the target words) structure of the DNA sequences is not random, and the space of putative candidates has an astronomic size. In addition, the motifs may appear in composite patterns, for example, as pairs of words with a fixed distance between them.

A considerable amount of effort has been devoted to solve such problems by statistical model families and by scanning algorithms detecting DNA pattern overrepresentation under a null background model; see, for example, [5–17]. Several additional related methodological papers can be found from the references of the quoted papers. In broad terms the methodological efforts for the above-mentioned problem can be divided into two categories, one where the target is to scan for the existence of *a priori* determined motifs within a set of sequences, and the other where one attempts to identify novel motifs in a given set of sequences. Our approach is solely concerned with the latter category, that is, we consider an unsupervised pattern recognition problem, although some brief remarks on the possibility of extending our method to a partially supervised situation are given in the final section.

Most statistical models for motif discovery use a Markovian probabilistic machinery, for example, ordinary Markov chains or hidden Markov models, to describe the observed patterns of the background DNA and to improve the motif specificity. However, some of the recent works, such as [11, 14], have demonstrated the particular potential of so-called variable-length (or variable-order) Markov (VLMC) models. Such models exemplify a rich class of probabilistic structures for which statistical learning theory has been developed much earlier in the general context [18–20], and which are able to compress information efficiently through a relatively sparse parameterization, while not bargaining the expressiveness of the probabilistic characterization.

In the current paper, the VLMC model is used in two ways. Firstly, candidates of motif instances are obtained by fitting the VLMC model to the sequence data using the algorithm from [20] with the implementation due to [19], and then calculating probabilities for word counts using a compound Poisson approximation. The words which are most improbable under the null background model represent natural candidates of motif instances and can thus be used for an efficient initialization of an unsupervised learning process. Special attention has been devoted in the motif identification literature to the calculation of word count probabilities associated with any particular substrings within the investigated sequences under a null background DNA model see; [17, 21–23]. However, the earlier works have not considered the calculation of such probabilities under a VLMC model, but only under ordinary Markov models of fixed length. Secondly, sufficient statistics (multinomial counts) are obtained from the sequence data under the VLMC context tree identified by the algorithm of [19], and these are used in the Bayesian unsupervised model learning to identify different types of motifs. However, it should be noticed that the sufficient statistics under the VLMC model are recalculated in the unsupervised stochastic learning steps when putative motif instances are shifted along the sequence and when they are reinserted to the background model.

Many Bayesian pattern recognition methods are principally based on the concept of latent class models, which were already early used in motif detection [5, 8]. Standard Bayesian Markov chain Monte Carlo (MCMC) computation tools, such as the Gibbs sampler [24], are available for models with *a priori* fixed numbers of classes. However, numerical convergence and mixing problems for such methods are notorious for challenging applications and motif discovery is not an exception in this respect, for example, see the discussion in [8]. The computational problems may arise already in a situation where only a single motif type is considered in the *de novo* detection context. Such problems are further accentuated in the currently considered framework, where the aim is to make formal statistical inference about the simultaneous presence of multiple putative *a priori* unknown motif types. Bayesian model-based inference in such a setting has been previously considered only by [15], where a reversible jump MCMC algorithm [25] was used for model learning. Motivated by the findings of [26] on general convergence issues with reversible Markov chains, we apply here instead the nonreversible Metropolis-Hastings algorithm they introduced for complex model learning applications.

The structure of this paper is as follows. First, we introduce the background DNA model based on a variable-length Markov process and consider a framework for obtaining putative candidates of motif instances by calculating DNA pattern probabilities under the background model. Thereafter, the unsupervised classification model is derived and in Section 4 we develop a parallel MCMC algorithm for making posterior inferences using a general class of nonreversible Metropolis-Hastings algorithms. Section 5 provides some illustrations of the developed method, and some remarks are given in the final section.

#### 2. A Variable Length Markov Process as the Background DNA Model

Markov chains of higher-order memory were proposed a while ago for computing the expected frequencies of the nucleotides observed outside the motifs, that is, the background information in DNA; see [27] and the references therein. In particular, a third order stationary Markov chain is often chosen; see, for example, [8]. Here we consider a more general model that can handle longer Markov memories in a parsimonious manner. The background model will be formulated by means of the notion of a context of a symbol. This has been introduced to define a variable-length Markov chain (VLMC) as well as a universal finite memory source [18–20, 28]. Only brief details of VLMC models are given here, for a comprehensive treatment we refer the reader to the original references.

We consider a DNA sequence , which can be a long concatenation of strings of possibly varying length. Let denote a finite substring in . The values of this string are denoted as , where refers to an arbitrary position within the string , and refers to the set of bases in the DNA. Let be a string in , which starts at position and ends at position . Notice that, for the string is written in the reverse order. In the sequel both and will be used as generic indices of sequence positions, when defining random processes. A repeated use of the chain rule for a random process yields (written using the short notation ) We introduce next the idea of a memory of variable order, in the sense that the conditional probabilities in the above product can depend only on a context relevant for them. Formally, we define the context as a (variable projection) function which maps the whole sequence past into a possibly shorter string.

*Definition 1. *A context is a map , where (the context length) is defined by

Note that the superscript in and refers to the first element of any string that is given as an input argument to the function . Such an input string itself may in the sequel be indexed with arbitrary sub- and superscripts, such as , when it refers to a particular part of a string . It should also be noted that the context is independent of , and it can thus be called a stationary context.

*Definition 2. *Let
If , then is a stationary variable-length Markov chain (VLMC) of order .

Since the context is stationary, (1) can be written as

We recall the th-order Markov property defined by the equality of the conditional probabilities for all . If the context function turns out to be for all , then the process is a full Markov chain of order . In other words, a VLMC of finite order is then a full Markov chain of order with a variable memory, and the minimal state space represented by the VLMC contexts has no impact. However, in practice, the minimal state space represented by the VLMC contexts has a considerable impact, since in the genomic applications an estimated VLMC typically has an order of 7 while the size of the set of contexts is small. In contrast, the memory requirements for the implementation of a full 7th-order ordinary Markov chain are enormous, which hampers the estimation and use of such models.

Using the contexts from a VLMC as such may lead to certain problems for a practical implementation with regard to predicting the next context based on the last one. Such problems are illustrated by the following small example.

*Example 1. *Let . If and , then . However, we cannot determine the probability of transition from to . Note that the superscript in the argument of refers to a position in the original string, whereas the superscript in the Definition 1 refers to the first element in an arbitrary input argument.

A definition that holds for finite data is also needed to extend the VLMC model to a finite string, such that probability transitions between the contexts can be calculated. The finiteness of real data and transition problems in Example 1 is handled by replacing the context length function with according to the following definition.

*Definition 3. *Let
such that corresponding to replaces .

Note that for the largest contexts, implying that can be constructed recursively starting from the largest contexts. The presence of in (6) ensures that the context length is well defined for finite sequences and the part with the maximum ensures that the next context can be predicted from the current context .

The motif instance candidates are identified by assuming the corresponding words to be overrepresented in the string with respect to the background VLMC model. Let be a stochastic number of nonoverlapping instances of a word calculated from the left, that is, more specifically equals the number of , such that no prefix of a is a suffix to another in . We then seek to compute for the observed occurrences , within the range and .

While it is theoretically possible to calculate exactly for small as shown in [29], this does not work for the values of encountered in practice. Firstly, this is due to computational difficulties, and more importantly, the numerical difficulties of the techniques currently available. Instead, we approximate (7) using a compound Poisson distribution by invoking the results in [30]. The compound Poisson distribution refers to the probability distribution of the sum of random number of random variables where are independent, and is independent of , for any positive integer and every , and is Poisson distributed, that is, .

The compound Poisson approximation is here implemented for sequences that are similar to those considered in [31]. We recall that the background model can be seen as a th-order Markov chain. Any th-order Markov chain with state space can be transformed into a first-order Markov chain on the state space . In the actual implementation the state space is designed under the target to keep the number of states low. However, for the discussion below, we assume simply that the model is represented with a state space such that the corresponding r.v. can be constructed to imply a first-order Markov chain with respect to (see Example 2 below).

It is assumed that any putative motif instance can be unambiguously defined as a sequence of states (i.e., strings), where any individual element . Additional states explicitly representing the transitions in the path are added to the state space , which leads to the state space . The probability of generating the sequence from other states than these special states is set to . Then, every time the first-order Markov chain visits the end state of the path, a motif instance candidate has been visited in the DNA sequence. This generic principle is illustrated in Example 2 where the candidate string equals .

*Example 2. *Suppose that and . Then , and the original transition probability matrix
is transformed into
where and .

As seen from the above, the compound Poisson random variable approximates the number of occurrences of in a sequence. In the implemented model the First success distribution (Fs) is used for , that is, and . Let and set the compound Poisson approximation parameters as in [30], that is, and . In general can be chosen rather freely, for example, to minimize the bound in [30]. In the practical implementation is instead set to the random variable representing the starting context of the motif, as the execution time for the currently available algorithm for evaluating the approximation bound in [30] is 1 day on a standard computer, and in this context the typical number of putative motif instance candidates to be evaluated is larger than 10000.

#### 3. An Unsupervised Stochastic Learning Framework for Motif Discovery

The present learning model for motif detection operates by sequentially inserting randomly positioned contiguous windows onto the considered total DNA string and classifying their contents into groups representing putative motif types, such that the number of groups is unknown *a priori*. From an evolutionary perspective, the substrings inside the windows can also be viewed as fragments inserted into specific positions of a previously existing DNA sequence.

We now summarize the concepts for the motif detection task as follows.

(i)A putative gene regulatory*motif*can be interpreted as a statistical model of a binding site which defines for any particular DNA string of an appropriate length how well it fits to that site.(ii)A

*motif instance*can be interpreted as an actual realization of the DNA under a particular regulatory motif model, which specifies a probability distribution over the bases at the positions belonging to the motif. The unknown (multinomial) parameters that specify this joint distribution are nuisance parameters and they are thus integrated out when making Bayesian inference.(iii)A

*motif type*refers here to any particular motif. The type can be represented by a class of motif instances that are judged by the Bayesian model to correspond to the same motif. We assume that the number of motif models discoverable from the considered DNA sequence is

*a priori*unknown and a primary target of the statistical learning. A putative gene regulatory motif type will be here indexed by , , such that is a stochastic number of motif types, each of which corresponds to an

*a priori*unknown number of motif instances that are localized in .

A representation of the components included in our unsupervised model is shown in Figure 1, with the used notation explained in detail below. Now, given , let be a stochastic number, or the abundance, of motif instances representing the motif type , governed by the probability . Further, we let the common length of the motif instances be within the range , such that and are positive integers with . Here , are the lower and upper bounds, respectively, on the motif length that can be modified using biological domain knowledge; see, for example, [11].

Given the outcomes of , there are in total motif instances to be allocated onto , such that an arbitrary motif instance of type contains contiguous positions. Let index an arbitrary motif instance of type . Within , a motif instance is represented by the pair identifying a substring which contains the motif instance of the length .

The substring representing the motif instance is explicitly denoted as The random variables corresponding to the bases observed within a motif instance are governed by the probabilities , such that represents the probability of observing base at position among the instances of type . Generally, this parametric construction is referred to as a weight matrix in the bioinformatics literature. Given the motif types with the unknown underlying distributions, the bases observed at each position are here modeled as conditionally independent multinomial trials with respect to any type.

An allocation of the motif instances onto implies that the sequence is split into two distinct parts, one containing the strings , and the other containing the remainder of representing the background. Hence, we consider the sequence structure in terms of a model of randomly inserted motif instances as in [32].

Let now denote a particular configuration of a motif allocation model as described above. Thus, designates motif instances of type , allocated in random positions in a generic sequence. In probabilistic terms, partitions the data into conditionally independent groups. Let be the space of all eligible such partitions. The partitions are defined using a set of random variables according to Both the motif instance classification structure and the sequence realizations are thus embedded into the set of . An arbitrary realization of these is denoted in the sequel by . Note here that the motif data depend only on the motif type and position in the motif instance, not the surrounding data. The background model is restarted, with the same probability law, after every motif instance, but we do not burden the notation further by this.

The unsupervised motif discovery model is now formally defined by the joint probability where is a the marginal distribution (likelihood) of the data given model and is a probability distribution representing our prior uncertainty about different models.

Here, motif discovery task is formulated as joint maximization of the posterior distribution of the background model and the partition of the motif instance candidates into classes representing distinct motif types, which leads to where The joint maximization reflects the fact that the marginal likelihood of the sequence data under the background model increases when substrings that have a low probability of occurring are instead considered as motif instances. On the other hand, increasing the number of motif types and their lengths increases the number of multinomial parameters that are needed to represent the probabilities of observing the different bases over the motifs. Thus, the model must target a balance between these two aspects when aiming at a simultaneous identification of the posterior optimal number of motif types , as well as allocation and alignment of motifs into the classes representing the different types.

To enable efficient identification of the posterior optimal model structure we will use a constant prior for the structural layer of our model. Notice that the space of eligible model structures depends on the *a priori* limits specified for motif lengths and number of motif types. A constant prior can be described to conceptually arise through the following generating model for a set of fragments randomly inserted within a DNA sequence. Consider an initial sequence of length . Given a stochastic total number of fragments (i.e., all motif instances), one can choose the first positions for the fragments in ways, such that for each of them, a corresponding motif instance of a certain length will be inserted into the sequence between that position and the subsequent one. The probability for any particular arrangement of the fragments is thus . Furthermore, every fragment can now be chosen with the probability from the alternative sources, which specifies the generating model for the prior probabilities of the structural layer. By specifying suitable distributions for and , a uniform distribution for the structure is implied, similarly to the random urn model considered in [33].

Next we derive an expression for the marginal likelihood under the classification model using an approach similar to that adopted in [33]. Let be an indicator function that has value one if the value is found at position in the motif and has value zero otherwise. Then is the number of times the symbol appears at position in the motifs in class and is the total number of motif instances of type . Similarly, let and be the number of times appears after the context denoted by and the number of times the context appears, respectively, where the context can have parts belonging to both the background and one or several motif instances. Recall that the context refers to a particular string with elements from the alphabet . Now we can continue with the actual calculation of .

We use different priors for the probabilities in the background and the motifs. Both are Dirichlet, however, with hyperparameters (for the background) and (for the motifs). Dirichlet distribution is the standard choice of a prior in Bayesian modeling of DNA sequences, and in addition to the computational convenience related to this prior, there are theoretical arguments supporting it in a variety of different contexts, see, for example, the discussion in [33, 34]. Let and let us next introduce and the Dirichlet density on . Another prior is considered for the motifs on the simplex Let us set as well as Consider now the conditional probability , which can be factorized using the chain rule of probabilities as Here (12) and the VLMC property in the sense of (4) are used, resulting in By combining the above distributions, it is possible to derive the marginal data distribution analytically for our classification model. The formula for the marginal distribution is given in Theorem 1, for which a proof is provided in appendix.

Theorem 1. *If is given in (25) and is as in (23), then the marginal likelihood under the classification model equals
*

#### 4. Learning Sequence Partitions by Stochastic Search Operators

Standard MCMC algorithms such as the Gibbs sampler or Metropolis-Hastings algorithm [24] have regularly been used for the earlier inferential problems associated with motif discovery [5]. However, numerical convergence and mixing problems for such methods are burdening the motif discovery task. Furthermore, as the objectives here are even more challenging due to the *a priori* unknown number of motif types, standard stochastic computation is not expected to provide a feasible strategy. The unsupervised learning model for motif discovery developed in the previous section has the characteristics that enable the use of the nonstandard Bayesian computational strategy as developed in [26]. We will now describe the search operators embedded in the nonreversible algorithm.

Let denote a generic fixed distribution that assigns probabilities on the space , conditional on the model structure . A nonreversible Metropolis-Hastings algorithm may then be constructed by defining its transition kernel according to the acceptance probability where is a candidate state (model structure), generated from a current state under . The important difference between this algorithm and the ordinary reversible Metropolis-Hastings is the lack of the ratio of the proposal probabilities which ensures that the stationary distribution of the generated Markov chain equals the sought posterior distribution. However, as shown by [26], consistent estimates of the model posterior probabilities can still be constructed from a realization of the nonreversible chain. The major strength of this algorithm is that it can use more complex proposal mechanisms than the ordinary reversible MCMC algorithms, when it is not feasible in practice to calculate analytically the proposal probabilities. Also, the direct utilization of the analytically calculated marginal likelihood avoids the effects of Monte Carlo error in the estimation of the underlying parameters, for example, as compared to ordinary Gibbs sampler estimation where values for all model parameters are sequentially sampled. The latter advantage has been discussed also in the earlier motif discovery literature, where collapsed Gibbs samplers have been used for the model learning [5, 8].

In addition to the nonreversible transition kernel of the algorithm, we utilize parallel interacting search processes analogously to [26]. The search process as a whole can be defined as follows.

Let and be stochastic processes defined as follows.

(1)Define a sequence of strictly decreasing probabilities , such that , and as .(2)Define the stochastic process as , and , , independently for .(3)Let , , be arbitrary initial states of . Given a realization , the transition mechanism of the processes depends on values of according to steps (4) and (5).(4)For each , such that , transition from to the next state is determined according to the probability (27) under the proposal distribution , for .(5)For each , such that , transition from to the next state is determined according to the following distribution over the set of candidate states: independently for .The processes defined above are not time homogeneous Markov chains. However, as , their transition probabilities converge to those of the time homogeneous Markov chain defined in (27). The parallel interacting processes are defined to yield an efficient, yet consistent scheme for exchange of information between them. The processes have a tendency to coalesce towards the states which are associated with higher marginal likelihoods. Also, when multiple model structures with roughly equal marginal likelihoods are present, the probabilities (28) lead to a more dispersed proposal distribution.

One of the major challenges in the MCMC computation is the need to design proposal operators for new models. Therefore, our emphasis is on developing intelligent proposals that are able to discriminate the information content of different segments of the data. We now specify the search operators that generate candidate states according to the distribution , which in fact consists of several components as typically in MCMC. Some of these are intelligent operators similar to those used in protein sequence classification by [35], and some are locally random operators similar to those used in [26].

Assume first that a nonempty set of initial candidates of motif instances has been made available out of the total sequence . Such candidates could also be continuously chosen and/or discarded as a part of the nonreversible algorithm according to a stochastic search operator under our learning model. However, to limit the computational burden, we chose in our experiments to utilize a number of results from renewal theory and compound Poisson approximations to provide a set of putative motif instances (as explained in Section 2).

To identify an optimal classification of the candidate motif instances we use for each process repeatedly the following four search operators in the transition kernel: Merge, Split, Slide, and Move. These are defined as follows.

(1)*Merge*. Let be the motif type with the largest index. For each pair of motif types find the optimal alignment of the motif instances with respect to the marginal likelihood, when the instances are merged into a single group. When the lengths of the motif instances in the two classes are distinct, the bases in the shorter strings that are lacking in a column of any particular alignment are treated as missing data. The marginal likelihood can then still be calculated as in (26), because the expression is based on the sufficient statistics (counts) arising for each column in the alignment. If any of the mergings results in a higher marginal likelihood than , use that model structure as the proposal value , else use .(2)

*Split*. Choose a motif type randomly. Calculate the pairwise Hamming distances of the corresponding motif instances and cluster them using the standard hierarchical single linkage algorithm. Split the group of motif instances optimally into two subgroups according to the hierarchical clustering.(3)

*Slide*. Choose a motif type randomly. Slide the corresponding motif instances randomly backwards or forwards and change their length randomly. Both the sliding and length change are preformed with respect to a simple uniform proposal distribution. The sliding step relocates the motif instance maximally 5 bases backwards or forwards with equal probabilities for all possible configurations, apart from those that would place the motif instance outside of the sequence. The length is changed by 1 or 2 bases, by randomly either shortening or extending the motif instances with equal probabilities. If the resulting sequence does not satisfy the prior limits , the proposal will be discarded and a new value is generated from the original motif instance.(4)

*Move*. Choose a motif instance randomly among all instances and choose a motif type randomly among the remaining types. Slide the motif instance randomly backwards or forwards and group it with the motif type .

In addition to the above operators, both randomly chosen single motif instances and motif types are proposed to be reinserted into the background model. Since there is a strictly positive probability of associating each motif instance with any type or the background by a successive use of the proposal steps, the MCMC framework is irreducible. Consequently, the following result holds for the stochastic learning, as the algorithm introduced here satisfies the general conditions stated in [26].

Theorem 2. *Let be a model space for any given set of motif instance candidates, and let be the part of the space explored at time by the search process defined by the parallel nonreversible Metropolis-Hastings algorithm defined above. Let
**
Then
**
as .*

#### 5. Empirical Illustration

To briefly illustrate the unsupervised learning framework we have generated simulated data for the motifs reb1, matalpha2, pdr1/3 from the SCPD database [36]. The simulation was performed by first estimating a VLMC based on the SCPD background sequence for these motifs. Then, a sequence was generated using the estimated VLMC model, into which motif data was randomly inserted. Each inserted motif instance was generated using the positional weight matrices of the corresponding motifs [36]. This resulted in a sequence of approximately 10000 nucleotides.

To obtain a set of motif instance candidates, we applied the following procedure. First, we neglected the words that occur in less than 4 times, because they are deemed to have very small chances of being conclusively judged not to belong to the background DNA sequence. Then, probabilities for the remaining substrings were ranked. Those among the 100 most improbable according to the background model, with respect to , were finally considered as candidates for being motifs. The prior limits of motif length were set according to the interval [7, 15] and the upper bound was used for the number of motif types.

The behavior of the MCMC optimization for parallel 50 processes is shown visually in Figure 2. Out of the 80 motif instances (30 reb1, 30 matalpha2, 20 pdr1/3) embedded in the SCPD-like sequence of length approximately 10000 nucleotides, our algorithms were able to identify 64. The adjusted Rand index (see, [37, 38]) between the optimal unsupervised classification of the candidates and the underlying true classification is 0.9623, which indicates a high fidelity of the results for those motif instances that were found.

We also tested the algorithm on real data under comparable prior settings for the motifs MCB and PDR1/3 from SCPD. This was done by providing the upstream regulating regions containing MCB and PDR1/3 (with a total length of 6149 nucleotides) as an input data set for the algorithm. This resulted in 7/32 motifs identified (2 MCB and 5 PDR1/3), which were all correctly grouped with respect to the classification in SCPD.

#### 6. Discussion

The unsupervised classification model was developed under the assumption that the bases at the different positions of a motif instance are conditionally independent given the motif type. This assumption is analogous to the strategy used in a majority of motif identification methods, however, a more elaborate model structure could also be developed by a factorization of the marginal likelihood under a low-order dependence structure, such as a sparse Bayesian network or a context tree similar to that utilized in the construction of the VLMC. This type of a factorization principle has earlier been successfully used in the ordinary likelihood-based motif scanning; see, for example, [11, 13]. The advantage of such likelihood factorizations would in the Bayesian setting be that the marginal likelihoods could still be analytically calculated under suitable Dirichlet priors, thus enabling the utilization of the nonreversible MCMC computation for model learning. However, such a strategy would nevertheless significantly increase the computational complexity associated with the learning procedure, as the size of the model space would increase considerably.

The statistical learning method described here could be modified to incorporate more specific biological knowledge in terms of informative prior distributions for occurrence of certain nucleotides in any given part of a motif instance. Such priors may be designed from well curated biological databases using the basic properties of the Dirichlet distribution. To reduce the computational intensity of our experiments, we used the stochastic search initialized only by the candidate motif instances. Also, the VLMC model for the background was first fitted to the data, whereafter the context tree was not re-estimated during the motif discovery phase. In a fully coherent Bayesian learning procedure it would necessary to consider both aspects entirely as parts of the structural layer of the model and to learn them in parallel with the motif discovery. Similarly, it would be preferable to also monitor the learning process by creating an extensive set of posterior summaries of the motif configurations and visited segments of the sequence as in [15], instead of solely targeting the posterior mode of the model structure.

The fact that sequence scanning yields a large number of motifs that cannot be explained with the knowledge built from the current databases is an open problem in the scanning field, as confirmed by similar studies using YMF [39] and Weeder [40], as well as metastudies such as [41] and even supervised clustering algorithms such as LOGOS [10]. Despite such high “noise” level, we have illustrated that a model-based approach has potential to simultaneously discover multiple motif types hidden in the sequences. From an intuitive perspective such an approach should make more sense than screening for a single motif type at a time. The latter can be suboptimal, for example, in cases where separate motif types are present in the data, such that they are closely related in an evolutionary sense. Here our focus has been more on the mathematical formulation of a statistical framework that has potential for a simultaneous discovery of multiple motif types. We wish to emphasize the advantages of formulating the statistical problem using a Bayesian learning framework which enables the use of nonstandard MCMC computation. In future our aim is to develop more concrete practical tools extending the basic model and the implementation of the learning algorithm, by exploiting a massively parallel computation architecture to pursue large-scale validatory and exploratory experiments.

#### Appendix

Mathematical details of the derivation of the result in Theorem 1 are given below. For convenience of reference we recall the definition of the Dirichlet integral where and for all .

A marginal distribution for given is Now (23) and (25) yield that (A.2) is equal to The Dirichlet integral can now be used to establish (for each context ) and similarly for each motif type this yields Further, using (A.4) and (A.5) the stated result is established.

#### Acknowledgments

The authors thank Dr. Torkel Erhardsson, Linköpings Universitet, for valuable discussions about compound Poisson approximation. The authors would also like to remember their late friend and colleague Björn Larsson, graduate student in Linköping, who participated in the initial stage of this work before losing his life in a traffic accident. The work was supported by grant no. 4042140 from the Swedish Research Council (VR/NT) and by grant no. 121301 from Academy of Finland.

#### References

- T. Werner, “Models for prediction and recognition of eukaryotic promoters,”
*Mammalian Genome*, vol. 10, no. 2, pp. 168–175, 1999. View at Publisher · View at Google Scholar - E. Eskin and P. A. Pevzner, “Finding composite regulatory patterns in DNA sequences,”
*Bioinformatics*, vol. 18, supplement 1, pp. S354–363, 2002. View at Google Scholar - L. Marsan and M.-F. Sagot, “Algorithms for extracting structured motifs using a suffix tree with an application to promoter and regulatory site consensus identification,”
*Journal of Computational Biology*, vol. 7, no. 3-4, pp. 345–362, 2000. View at Publisher · View at Google Scholar - U. Ohler and H. Niemann, “Identification and analysis of eukaryotic promoters: recent computational approaches,”
*Trends in Genetics*, vol. 17, no. 2, pp. 56–60, 2001. View at Publisher · View at Google Scholar - J. S. Liu, A. F. Neuwald, and C. E. Lawrence, “Bayesian models for multiple local sequence alignment and Gibbs sampling strategies,”
*Journal of the American Statistical Association*, vol. 90, pp. 1156–1170, 1995. View at Google Scholar - W. Thompson, E. C. Rouchka, and C. E. Lawrence, “Gibbs recursive sampler: finding transcription factor binding sites,”
*Nucleic Acids Research*, vol. 31, no. 13, pp. 3580–3585, 2003. View at Publisher · View at Google Scholar - M. Gupta and J. S. Liu, “Discovery of conserved sequence patterns using a stochastic dictionary model,”
*Journal of the American Statistical Association*, vol. 98, no. 461, pp. 55–66, 2003. View at Publisher · View at Google Scholar - S. T. Jensen, X. S. Liu, Q. Zhou, and J. S. Liu, “Computational discovery of gene regulatory binding motifs: a Bayesian perspective,”
*Statistical Science*, vol. 19, no. 1, pp. 188–204, 2004. View at Publisher · View at Google Scholar - S. T. Jensen and J. S. Liu, “BioOptimizer: a Bayesian scoring function approach to motif discovery,”
*Bioinformatics*, vol. 20, no. 10, pp. 1557–1564, 2004. View at Publisher · View at Google Scholar - E. P. Xing, W. Wu, M. I. Jordan, and R. M. Karp, “Logos: a modular Bayesian model for de novo motif detection,”
*Journal of Bioinformatics and Computational Biology*, vol. 2, no. 1, pp. 127–154, 2004. View at Publisher · View at Google Scholar - I. Ben-Gal, A. Shani, A. Gohr et al., “Identification of transcription factor binding sites with variable-order Bayesian networks,”
*Bioinformatics*, vol. 21, no. 11, pp. 2657–2666, 2005. View at Publisher · View at Google Scholar - L. Hertzberg, O. Zuk, G. Getz, and E. Domany, “Finding motifs in promoter regions,”
*Journal of Computational Biology*, vol. 12, no. 3, pp. 314–330, 2005. View at Publisher · View at Google Scholar - Y. Barash, G. Elidan, N. Friedman, and T. Kaplan, “Modeling dependencies in protein-DNA binding sites,” in
*Proceedings of the 7th Annual International Conference on Computational Molecular Biology (RECOMB '03)*, pp. 28–37, ACM Press, Berlin, Germany, April 2003. - X. Zhang, H. Huang, M. Li, and T. Speed, “Finding short DNA motifs using permuted Markov models,”
*Bioinformatics*, vol. 21, pp. 894–906, 2005. View at Google Scholar - S. M. Li, J. Wakefield, and S. Self, “A transdimensional Bayesian model for pattern recognition in DNA sequences,”
*Biostatistics*, vol. 9, no. 4, pp. 668–685, 2008. View at Publisher · View at Google Scholar - J. Hawkins, C. Grant, W. S. Noble, and T. L. Bailey, “Assessing phylogenetic motif models for predicting transcription factor binding sites,”
*Bioinformatics*, vol. 25, no. 12, pp. i339–i347, 2009. View at Publisher · View at Google Scholar - T. Marschall and S. Rahmann, “Efficient exact motif discovery,”
*Bioinformatics*, vol. 25, no. 12, pp. i356–i364, 2009. View at Publisher · View at Google Scholar - P. Bühlmann and A. J. Wyner, “Variable length Markov chains,”
*Annals of Statistics*, vol. 27, no. 2, pp. 480–513, 1999. View at Google Scholar - M. Mächler and P. Bühlmann, “Variable length Markov chains: methodology, computing, and software,”
*Journal of Computational and Graphical Statistics*, vol. 13, no. 2, pp. 435–455, 2004. View at Publisher · View at Google Scholar - J. Rissanen, “A universal data compression system,”
*IEEE Transactions on Information Theory*, vol. 29, no. 5, pp. 656–664, 1983. View at Google Scholar - I. Abnizova and W. R. Gilks, “Studying statistical properties of regulatory DNA sequences, and their use in predicting regulatory regions in the eukaryotic genomes,”
*Briefings in Bioinformatics*, vol. 7, no. 1, pp. 48–54, 2006. View at Publisher · View at Google Scholar - M. C. Frith, Y. Fu, L. Yu, J.-F. Chen, U. Hansen, and Z. Weng, “Detection of functional DNA motifs via statistical over-representation,”
*Nucleic Acids Research*, vol. 32, no. 4, pp. 1372–1381, 2004. View at Publisher · View at Google Scholar - J. Zhang, B. Jiang, M. Li, J. Tromp, X. Zhang, and M. Q. Zhang, “Computing exact P-values for DNA motifs,”
*Bioinformatics*, vol. 23, no. 5, pp. 531–537, 2007. View at Publisher · View at Google Scholar - C. P. Robert and G. Casella,
*Monte Carlo Statistical Methods*, Springer, New York, NY, USA, 1999. - P. Green, “Reversible jump Markov chain Monte Carlo computation and Bayesian model determination,”
*Biometrika*, vol. 82, pp. 711–732, 1995. View at Google Scholar - J. Corander, M. Gyllenberg, and T. Koski, “Bayesian model learning based on a parallel MCMC strategy,”
*Statistics and Computing*, vol. 16, no. 4, pp. 355–362, 2006. View at Publisher · View at Google Scholar - E. E. Stückle, C. Emmrich, U. Grob, and P. J. Nielsen, “Statistical analysis of nucleotide sequences,”
*Nucleic Acids Research*, vol. 18, no. 22, pp. 6641–6647, 1990. View at Google Scholar - B. Ron, Y. Singer, and N. Tishby, “The power of amnesia: learning of probabilistic automata with variable memory lengths,”
*Machine Learning*, vol. 25, pp. 117–149, 1996. View at Google Scholar - M. Régnier, “A unified approach to word occurrence probabilities,”
*Discrete Applied Mathematics*, vol. 104, pp. 259–280, 2000. View at Google Scholar - T. Erhardsson, “Compound Poisson approximation for Markov chains using Stein's method,”
*Annals of Probability*, vol. 27, no. 1, pp. 565–596, 1999. View at Google Scholar - T. Erhardsson, “Compound Poisson approximation for counts of rare patterns in Markov chains and extreme sojourns in birth-death chains,”
*Annals of Applied Probability*, vol. 10, no. 2, pp. 573–591, 2000. View at Google Scholar - J. L. Thorne, H. Kishino, and J. Felsenstein, “Inching towards reality: an improved likelihood model for sequence evolution,”
*Journal of Molecular Evolution*, vol. 34, pp. 3–16, 1992. View at Google Scholar - J. Corander, M. Gyllenberg, and T. Koski, “Random partition models and exchangeability for bayesian identification of population structure,”
*Bulletin of Mathematical Biology*, vol. 69, no. 3, pp. 797–815, 2007. View at Publisher · View at Google Scholar - D. Geiger and D. Heckerman, “A characterization of the Dirichlet distribution through global and local parameter independence,”
*Annals of Statistics*, vol. 25, no. 3, pp. 1344–1369, 1997. View at Google Scholar - P. Marttinen, J. Corander, P. Törönen, and L. Holm, “Bayesian search of functionally divergent protein subgroups and their function specific residues,”
*Bioinformatics*, vol. 22, no. 20, pp. 2466–2474, 2006. View at Publisher · View at Google Scholar - J. Zhu and M. Q. Zhang, “SCPD: a promoter database of the yeast
*Saccharomyces cerevisiae*,”*Bioinformatics*, vol. 15, no. 7-8, pp. 607–611, 1999. View at Google Scholar - B. G. Mirkin and L. B. Chernyi, “Measurement of the distance between distinct partitions of a finite set of objects,”
*Automation and Remote Control*, vol. 31, pp. 786–792, 1970. View at Google Scholar - L. Hubert and P. Arabie, “Comparing partitions,”
*Journal of Classification*, vol. 2, no. 1, pp. 193–218, 1985. View at Publisher · View at Google Scholar - S. Sinha and M. Tompa, “Discovery of novel transcription factor binding sites by statistical overrepresentation,”
*Nucleic Acids Research*, vol. 30, no. 24, pp. 5549–5560, 2002. View at Publisher · View at Google Scholar - G. Pavesi, P. Mereghetti, G. Mauri, and G. Pesole, “Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes,”
*Nucleic Acids Research*, vol. 32, web server issue, pp. W199–W203, 2004. View at Publisher · View at Google Scholar - M. Tompa, N. Li, T. L. Bailey et al., “Assessing computational tools for the discovery of transcription factor binding sites,”
*Nature Biotechnology*, vol. 23, no. 1, pp. 137–144, 2005. View at Publisher · View at Google Scholar