Abstract

Identification of regulatory binding motifs, that is, short specific words, within DNA sequences is a commonly occurring problem in computational bioinformatics. A wide variety of probabilistic approaches have been proposed in the literature to either scan for previously known motif types or to attempt de novo identification of a fixed number (typically one) of putative motifs. Most approaches assume the existence of reliable biodatabase information to build probabilistic a priori description of the motif classes. Examples of attempts to do probabilistic unsupervised learning about the number of putative de novo motif types and their positions within a set of DNA sequences are very rare in the literature. Here we show how such a learning problem can be formulated using a Bayesian model that targets to simultaneously maximize the marginal likelihood of sequence data arising under multiple motif types as well as under the background DNA model, which equals a variable length Markov chain. It is demonstrated how the adopted Bayesian modelling strategy combined with recently introduced nonstandard stochastic computation tools yields a more tractable learning procedure than is possible with the standard Monte Carlo approaches. Improvements and extensions of the proposed approach are also discussed.

1. Introduction

A major body of genomic information coded in the DNA is represented by the regulatory regions, which control the chronology, extent, and way in which genes are expressed. A promoter is the most important regulatory region as it controls the first step of gene expression, that is, the transcription of messenger RNA. A brief summary of and introduction to the biology of promoters are given in [1]. The basic characteristic of a promoter is that it contains binding sites for proteins called transcription factors (TFs). The binding sites in the DNA reside near the gene controlled by the promoters. The interaction of several TFs with their corresponding binding sites regulates the activation or repression of the gene. Hence, the promoter architecture is fundamental for understanding gene expression patterns, regulation networks, and cell specificity. The binding sites of one single TF are not identical, but contain (i.e., tolerate) some variation. The shared content of the binding sites representing the same TF is typically summarized by specifying statistically the degree of conservation in a short (5–20 bases) DNA pattern. The DNA pattern can be understood as a word or a string on a 4-letter alphabet which may contain some variation over its positions when multiple realizations from the same generating source are considered. The substrings of DNA that correspond to the multiple realizations are here called motif instances from a gene regulatory binding motif. A set of long DNA strings may simultaneously harbor multiple different such motifs, that is, TF binding sites for different genes. In [2, 3] motifs are defined as sets of words of certain length, such that the number of mismatches to a consensus word is smaller than a prescribed threshold. Here, we use instead a probabilistic classification framework to specify the motifs by assigning them to classes in an unsupervised manner.

An important task in computational biology is to detect novel motifs using algorithms that are capable of reasoning from noisy measurements in terms of artificial intelligence. The core of the motif detection problem is to discover novel words whose length is within a biologically meaningful range, such that multiple copies of the words are hidden inside the promoter regions of a set of genes, when the genes have shared properties of, for example, the expression profile. Putative motifs found in a computational manner could thus correspond to binding sites of some common transcription factors regulating a particular set of genes. The algorithms for motif discovery must also be able to avoid suggesting as novel motifs such DNA patterns that represent the background variation in the sequences. In a nutshell, the discovery problem consists of detecting multiple copies of only partially conserved short words in a long string of the characters {𝐴,𝐢,𝐺,𝑇}, such that the words are highly unlikely to occur under a stochastic null model describing general variation over consecutive letters within the string.

The problem may appear as rather straightforward at first sight. However, several of the involved computational challenges make motif discovery and identification a sincerely complex task [4]. The main obstacles can be listed as follows. A motif exhibits a certain degree of random variation (lack of conservation) in its contents, word lengths are unknown, the background (i.e., areas not containing the target words) structure of the DNA sequences is not random, and the space of putative candidates has an astronomic size. In addition, the motifs may appear in composite patterns, for example, as pairs of words with a fixed distance between them.

A considerable amount of effort has been devoted to solve such problems by statistical model families and by scanning algorithms detecting DNA pattern overrepresentation under a null background model; see, for example, [5–17]. Several additional related methodological papers can be found from the references of the quoted papers. In broad terms the methodological efforts for the above-mentioned problem can be divided into two categories, one where the target is to scan for the existence of a priori determined motifs within a set of sequences, and the other where one attempts to identify novel motifs in a given set of sequences. Our approach is solely concerned with the latter category, that is, we consider an unsupervised pattern recognition problem, although some brief remarks on the possibility of extending our method to a partially supervised situation are given in the final section.

Most statistical models for motif discovery use a Markovian probabilistic machinery, for example, ordinary Markov chains or hidden Markov models, to describe the observed patterns of the background DNA and to improve the motif specificity. However, some of the recent works, such as [11, 14], have demonstrated the particular potential of so-called variable-length (or variable-order) Markov (VLMC) models. Such models exemplify a rich class of probabilistic structures for which statistical learning theory has been developed much earlier in the general context [18–20], and which are able to compress information efficiently through a relatively sparse parameterization, while not bargaining the expressiveness of the probabilistic characterization.

In the current paper, the VLMC model is used in two ways. Firstly, candidates of motif instances are obtained by fitting the VLMC model to the sequence data using the algorithm from [20] with the implementation due to [19], and then calculating probabilities for word counts using a compound Poisson approximation. The words which are most improbable under the null background model represent natural candidates of motif instances and can thus be used for an efficient initialization of an unsupervised learning process. Special attention has been devoted in the motif identification literature to the calculation of word count probabilities associated with any particular substrings within the investigated sequences under a null background DNA model see; [17, 21–23]. However, the earlier works have not considered the calculation of such probabilities under a VLMC model, but only under ordinary Markov models of fixed length. Secondly, sufficient statistics (multinomial counts) are obtained from the sequence data under the VLMC context tree identified by the algorithm of [19], and these are used in the Bayesian unsupervised model learning to identify different types of motifs. However, it should be noticed that the sufficient statistics under the VLMC model are recalculated in the unsupervised stochastic learning steps when putative motif instances are shifted along the sequence and when they are reinserted to the background model.

Many Bayesian pattern recognition methods are principally based on the concept of latent class models, which were already early used in motif detection [5, 8]. Standard Bayesian Markov chain Monte Carlo (MCMC) computation tools, such as the Gibbs sampler [24], are available for models with a priori fixed numbers of classes. However, numerical convergence and mixing problems for such methods are notorious for challenging applications and motif discovery is not an exception in this respect, for example, see the discussion in [8]. The computational problems may arise already in a situation where only a single motif type is considered in the de novo detection context. Such problems are further accentuated in the currently considered framework, where the aim is to make formal statistical inference about the simultaneous presence of multiple putative a priori unknown motif types. Bayesian model-based inference in such a setting has been previously considered only by [15], where a reversible jump MCMC algorithm [25] was used for model learning. Motivated by the findings of [26] on general convergence issues with reversible Markov chains, we apply here instead the nonreversible Metropolis-Hastings algorithm they introduced for complex model learning applications.

The structure of this paper is as follows. First, we introduce the background DNA model based on a variable-length Markov process and consider a framework for obtaining putative candidates of motif instances by calculating DNA pattern probabilities under the background model. Thereafter, the unsupervised classification model is derived and in Section 4 we develop a parallel MCMC algorithm for making posterior inferences using a general class of nonreversible Metropolis-Hastings algorithms. Section 5 provides some illustrations of the developed method, and some remarks are given in the final section.

2. A Variable Length Markov Process as the Background DNA Model

Markov chains of higher-order memory were proposed a while ago for computing the expected frequencies of the nucleotides observed outside the motifs, that is, the background information in DNA; see [27] and the references therein. In particular, a third order stationary Markov chain is often chosen; see, for example, [8]. Here we consider a more general model that can handle longer Markov memories in a parsimonious manner. The background model will be formulated by means of the notion of a context of a symbol. This has been introduced to define a variable-length Markov chain (VLMC) as well as a universal finite memory source [18–20, 28]. Only brief details of VLMC models are given here, for a comprehensive treatment we refer the reader to the original references.

We consider a DNA sequence 𝐱, which can be a long concatenation of strings of possibly varying length. Let π‘₯π‘Ÿ denote a finite substring in 𝐱. The values of this string are denoted as π‘₯π‘Ÿπ‘—βˆˆ{1,2,3,4}=𝒳, where 𝑗 refers to an arbitrary position within the string π‘₯π‘Ÿ, and 𝒳 refers to the set of bases in the DNA. Let π‘§π‘šπ‘›=π‘§π‘š,…,𝑧𝑛 be a string in π’³π‘šβˆ’π‘›+1, which starts at position 𝑑=𝑛 and ends at position 𝑑=π‘š. Notice that, for π‘š>𝑛 the string is written in the reverse order. In the sequel both 𝑑 and π‘Ÿ will be used as generic indices of sequence positions, when defining random processes. A repeated use of the chain rule for a random process yields (written using the short notation 𝑃(𝑍=𝑧)=𝑃(𝑧)) We introduce next the idea of a memory of variable order, in the sense that the conditional probabilities in the above product can depend only on a context relevant for them. Formally, we define the context as a (variable projection) function which maps the whole sequence past into a possibly shorter string.

Definition 1. A context is a map β„ŽβˆΆπ‘§0βˆ’βˆžβ†’π‘§0βˆ’π‘ž+1, where π‘ž (the context length) is defined by

Note that the superscript in 𝑧0βˆ’βˆž and 𝑧0βˆ’π‘ž+1 refers to the first element of any string that is given as an input argument to the function β„Ž. Such an input string itself may in the sequel be indexed with arbitrary sub- and superscripts, such as π‘§π‘Ÿβˆ’1βˆ’βˆž, when it refers to a particular part of a string 𝑧. It should also be noted that the context is independent of 𝑑, and it can thus be called a stationary context.

Definition 2. Let If π‘œ<∞, then {𝑍𝑑}π‘‘βˆˆβ„€ is a stationary variable-length Markov chain (VLMC) of order π‘œ.

Since the context is stationary, (1) can be written as

We recall the π‘˜th-order Markov property defined by the equality of the conditional probabilities for all 𝑧0βˆ’βˆž. If the context function turns out to be β„Ž(𝑧0βˆ’βˆž)=𝑧0βˆ’π‘˜+1 for all 𝑧0βˆ’βˆž, then the process {𝑍𝑑}π‘‘βˆˆβ„€ is a full Markov chain of order π‘˜. In other words, a VLMC of finite order is then a full Markov chain of order π‘œ with a variable memory, and the minimal state space represented by the VLMC contexts has no impact. However, in practice, the minimal state space represented by the VLMC contexts has a considerable impact, since in the genomic applications an estimated VLMC typically has an order of 7 while the size of the set of contexts |{β„Ž(𝑧0βˆ’βˆž)}| is small. In contrast, the memory requirements for the implementation of a full 7th-order ordinary Markov chain are enormous, which hampers the estimation and use of such models.

Using the contexts {β„Ž} from a VLMC as such may lead to certain problems for a practical implementation with regard to predicting the next context based on the last one. Such problems are illustrated by the following small example.

Example 1. Let 𝑧51=43412. If π‘ž(𝑧4βˆ’βˆž)=1 and π‘ž(𝑧5βˆ’βˆž)=4, then β„Ž(𝑧4βˆ’βˆž)=3(3412β†’3). However, we cannot determine the probability of transition from β„Ž(𝑧4βˆ’βˆž)=3 to β„Ž(𝑧5βˆ’βˆž)=4341. Note that the superscript in the argument of π‘ž(β‹…) refers to a position in the original string, whereas the superscript in the Definition 1 refers to the first element in an arbitrary input argument.

A definition that holds for finite data is also needed to extend the VLMC model to a finite string, such that probability transitions between the contexts can be calculated. The finiteness of real data and transition problems in Example 1 is handled by replacing the context length function π‘ž with Μƒπ‘ž according to the following definition.

Definition 3. Let such that ξ‚β„Ž corresponding to Μƒπ‘ž(𝑧𝛾1) replaces β„Ž.

Note that Μƒπ‘ž(𝑧𝛾1)=π‘ž(𝑧𝛾1) for the largest contexts, implying that Μƒπ‘ž(𝑧𝛾1) can be constructed recursively starting from the largest contexts. The presence of 𝛾 in (6) ensures that the context length is well defined for finite sequences and the part with the maximum ensures that the next context ξ‚β„Ž(𝑧1𝛾+1) can be predicted from the current context ξ‚β„Ž(𝑧𝛾1).

The motif instance candidates are identified by assuming the corresponding words to be overrepresented in the string 𝐱 with respect to the background VLMC model. Let 𝑅𝑛 be a stochastic number of nonoverlapping instances of a word calculated from the left, that is, more specifically 𝑅𝑛(̃𝑧𝑙1) equals the number of ̃𝑧𝑙1, such that no prefix of a ̃𝑧𝑙1 is a suffix to another ̃𝑧𝑙1 in 𝑍𝑛1. We then seek to compute for the observed occurrences π‘Ÿπ‘›=π‘Ÿπ‘›(𝑧𝑗𝑗+π‘™βˆ’1), within the range π‘™βˆˆ[𝑙min,𝑙max] and 1⩽𝑗⩽𝑛+1βˆ’π‘™.

While it is theoretically possible to calculate 𝑃(𝑅𝑛(̃𝑧𝑙1)β©Ύπ‘Ÿπ‘›) exactly for small 𝑛 as shown in [29], this does not work for the values of 𝑛 encountered in practice. Firstly, this is due to computational difficulties, and more importantly, the numerical difficulties of the techniques currently available. Instead, we approximate (7) using a compound Poisson distribution by invoking the results in [30]. The compound Poisson distribution CP(πœ†1,πœ†2,…) refers to the probability distribution of the sum of random number 𝑇 of random variables 𝑇𝑖 where {𝑇𝑖}𝑖⩾1 are independent, and 𝑇 is independent of 𝑇𝑖, 𝑃(𝑇𝑖=π‘˜)=πœ†π‘˜/πœ† for any positive integer π‘˜ and every 𝑖, and 𝑇 is Poisson distributed, that is, π‘‡βˆΌPo(πœ†).

The compound Poisson approximation is here implemented for sequences that are similar to those considered in [31]. We recall that the background model can be seen as a π‘œth-order Markov chain. Any π‘œth-order Markov chain with state space Ξ© can be transformed into a first-order Markov chain on the state space Ξ©π‘œ. In the actual implementation the state space Ω={β„Ž(𝑧0βˆ’βˆž)} is designed under the target to keep the number of states low. However, for the discussion below, we assume simply that the model is represented with a state space Ξ© such that the corresponding r.v. π‘Œβˆˆ{1,…,|Ξ©|} can be constructed to imply a first-order Markov chain {π‘Œ1,…,π‘Œπ‘›} with respect to Ξ© (see Example 2 below).

It is assumed that any putative motif instance ̃𝑧𝑙1 can be unambiguously defined as a sequence of states π‘€π‘Ž1 (i.e., strings), where any individual element π‘€π‘–βˆˆΞ©. Additional states explicitly representing the transitions in the path 𝑀1→𝑀2,𝑀2→𝑀3,…,π‘€π‘Žβˆ’1β†’π‘€π‘Ž are added to the state space Ξ©, which leads to the state space Ξ©ξ…ž. The probability of generating the sequence 𝑀1→𝑀2,𝑀2→𝑀3,…,π‘€π‘Žβˆ’1β†’π‘€π‘Ž from other states than these special states is set to 0. Then, every time the first-order Markov chain visits the end state of the path, a motif instance candidate has been visited in the DNA sequence. This generic principle is illustrated in Example 2 where the candidate string equals ̃𝑧21=11.

Example 2. Suppose that Ξ©={0,1} and ̃𝑧21=11. Then Ξ©ξ…ž={0,1,11}, and the original transition probability matrix is transformed into where 𝑃(0∣11)=𝑃(0∣1) and 𝑃(1∣11)=𝑃(11∣1)=𝑃(1∣1).

As seen from the above, the compound Poisson random variable 𝑅𝑛=βˆ‘π‘‡π‘–=1𝑇𝑖 approximates the number of occurrences of ̃𝑧𝑙1 in a sequence. In the implemented model the First success distribution (Fs) is used for 𝑇𝑖, that is, π‘‡π‘–βˆΌFs(πœƒ) and π‘€βˆΌPo(πœ†βˆ—). Let πœπ‘¦β€²π‘–=inf{𝑗>1βˆ£π‘Œξ…žπ‘—=π‘¦ξ…žπ‘–} and set the compound Poisson approximation parameters as in [30], that is, πœƒβˆΆ=𝑃(πœπ‘¦β€²π‘™<πœπ‘¦β€²π‘˜βˆ£π‘Œξ…ž1=π‘¦ξ…žπ‘˜) and πœ†βˆ—=𝑛𝑃(π‘Œξ…ž1=π‘¦ξ…žπ‘˜)𝑃(πœπ‘¦β€²π‘™<πœπ‘¦β€²π‘˜βˆ£π‘Œξ…ž1=π‘¦ξ…žπ‘˜). In general π‘¦ξ…žπ‘™ can be chosen rather freely, for example, to minimize the bound in [30]. In the practical implementation π‘¦ξ…žπ‘™ is instead set to the random variable representing the starting context of the motif, as the execution time for the currently available algorithm for evaluating the approximation bound in [30] is ∼1 day on a standard computer, and in this context the typical number of putative motif instance candidates to be evaluated is larger than 10000.

3. An Unsupervised Stochastic Learning Framework for Motif Discovery

The present learning model for motif detection operates by sequentially inserting randomly positioned contiguous windows onto the considered total DNA string and classifying their contents into groups representing putative motif types, such that the number of groups is unknown a priori. From an evolutionary perspective, the substrings inside the windows can also be viewed as fragments inserted into specific positions of a previously existing DNA sequence.

We now summarize the concepts for the motif detection task as follows.

(i)A putative gene regulatory motif can be interpreted as a statistical model of a binding site which defines for any particular DNA string of an appropriate length how well it fits to that site.(ii)A motif instance can be interpreted as an actual realization of the DNA under a particular regulatory motif model, which specifies a probability distribution over the bases at the positions belonging to the motif. The unknown (multinomial) parameters that specify this joint distribution are nuisance parameters and they are thus integrated out when making Bayesian inference.(iii)A motif type refers here to any particular motif. The type can be represented by a class of motif instances that are judged by the Bayesian model to correspond to the same motif. We assume that the number of motif models discoverable from the considered DNA sequence is a priori unknown and a primary target of the statistical learning. A putative gene regulatory motif type will be here indexed by 𝑐, 𝑐=1,…,𝐾, such that 𝐾∈[0,π‘˜max] is a stochastic number of motif types, each of which corresponds to an a priori unknown number of motif instances that are localized in 𝐱.

A representation of the components included in our unsupervised model is shown in Figure 1, with the used notation explained in detail below. Now, given 𝐾=π‘˜, let π‘ˆπ‘,𝑐=1,…,π‘˜, be a stochastic number, or the abundance, of motif instances representing the motif type 𝑐, governed by the probability 𝑃(π‘ˆπ‘=𝑒𝑐). Further, we let the common length 𝑙𝑐 of the π‘ˆπ‘ motif instances be within the range [𝑙min,…,𝑙max], such that 𝑙min and 𝑙max are positive integers with 𝑙min<𝑙max. Here 𝑙min, 𝑙max are the lower and upper bounds, respectively, on the motif length that can be modified using biological domain knowledge; see, for example, [11].

Given the π‘˜ outcomes of (𝑒𝑐,𝑙𝑐), there are in total βˆ‘π‘˜π‘=1𝑒𝑐 motif instances to be allocated onto 𝐱, such that an arbitrary motif instance of type 𝑐 contains 𝑙𝑐 contiguous positions. Let π‘–π‘βˆˆ{1,…,𝑒𝑐} index an arbitrary motif instance of type 𝑐. Within 𝐱, a motif instance is represented by the pair (π‘Ÿπ‘–π‘,𝑙𝑐) identifying a substring π‘₯π‘Ÿ which contains the motif instance 𝑖𝑐 of the length 𝑙𝑐.

The substring representing the motif instance is explicitly denoted as π‘₯𝑖𝑐=π‘₯π‘Ÿπ‘–π‘π‘₯π‘Ÿπ‘–π‘+1β‹―π‘₯π‘Ÿπ‘–π‘+π‘™π‘βˆ’1. The random variables corresponding to the bases observed within a motif instance are governed by the probabilities 𝐩𝑐𝑗=(π‘π‘π‘—π‘Ž)π‘Žβˆˆπ’³, such that π‘π‘π‘—π‘Ž represents the probability of observing base π‘Ž at position 𝑗 among the instances of type 𝑐. Generally, this parametric construction is referred to as a weight matrix in the bioinformatics literature. Given the motif types with the unknown underlying distributions, the bases observed at each position are here modeled as conditionally independent multinomial trials with respect to any type.

An allocation of the motif instances onto 𝐱 implies that the sequence 𝐱 is split into two distinct parts, one containing the βˆ‘π‘˜π‘=1𝑒𝑐 strings π‘₯𝑖𝑐, and the other containing the remainder of 𝐱 representing the background. Hence, we consider the sequence structure in terms of a model of randomly inserted motif instances as in [32].

Let now π‘š denote a particular configuration of a motif allocation model as described above. Thus, π‘š designates 𝑒𝑐 motif instances of type 𝑐, 𝑐=1,…,π‘˜, allocated in random positions in a generic sequence. In probabilistic terms, π‘š partitions the data 𝐱 into conditionally independent groups. Let β„³ be the space of all eligible such partitions. The partitions are defined using a set of random variables π‘Œπ‘– according to Both the motif instance classification structure and the sequence realizations are thus embedded into the set of π‘Œπ‘–. An arbitrary realization of these is denoted in the sequel by 𝐲. Note here that the motif data depend only on the motif type and position in the motif instance, not the surrounding data. The background model is restarted, with the same probability law, after every motif instance, but we do not burden the notation further by this.

The unsupervised motif discovery model is now formally defined by the joint probability where 𝑃(π²βˆ£π‘š) is a the marginal distribution (likelihood) of the data given model π‘š and 𝑃(π‘š) is a probability distribution representing our prior uncertainty about different models.

Here, motif discovery task is formulated as joint maximization of the posterior distribution of the background model and the partition of the motif instance candidates into classes representing distinct motif types, which leads to where The joint maximization reflects the fact that the marginal likelihood of the sequence data under the background model increases when substrings that have a low probability of occurring are instead considered as motif instances. On the other hand, increasing the number of motif types and their lengths increases the number of multinomial parameters that are needed to represent the probabilities of observing the different bases over the motifs. Thus, the model must target a balance between these two aspects when aiming at a simultaneous identification of the posterior optimal number of motif types π‘˜, as well as allocation and alignment of motifs into the classes representing the different types.

To enable efficient identification of the posterior optimal model structure we will use a constant prior 𝑃(π‘š)=1/|β„³| for the structural layer of our model. Notice that the space of eligible model structures depends on the a priori limits specified for motif lengths and number of motif types. A constant prior can be described to conceptually arise through the following generating model for a set of fragments randomly inserted within a DNA sequence. Consider an initial sequence of length 𝑀. Given a stochastic total number π‘ˆ of fragments (i.e., all motif instances), one can choose the first positions for the fragments in (π‘€π‘ˆ) ways, such that for each of them, a corresponding motif instance of a certain length 𝑙𝑐 will be inserted into the sequence 𝑧 between that position and the subsequent one. The probability for any particular arrangement of the fragments is thus (π‘€π‘ˆ)βˆ’1. Furthermore, every fragment can now be chosen with the probability π‘˜βˆ’1 from the π‘˜ alternative sources, which specifies the generating model for the prior probabilities of the structural layer. By specifying suitable distributions for π‘ˆ and π‘˜, a uniform distribution for the structure π‘š is implied, similarly to the random urn model considered in [33].

Next we derive an expression for the marginal likelihood under the classification model using an approach similar to that adopted in [33]. Let 𝐼π‘₯π‘Ÿπ‘–π‘,𝑗(π‘Ž) be an indicator function that has value one if the value π‘Žβˆˆπ’³ is found at position 𝑗 in the motif π‘₯π‘Ÿπ‘–π‘ and has value zero otherwise. Then is the number of times the symbol π‘Ž appears at position 𝑗 in the motifs in class 𝑐 and 𝑛𝑐𝑗=𝑛𝑐 is the total number of motif instances of type 𝑐. Similarly, let π‘›π‘”π‘Ž and 𝑛𝑔 be the number of times π‘Ž appears after the context denoted by 𝑔 and the number of times the context 𝑔 appears, respectively, where the context can have parts belonging to both the background and one or several motif instances. Recall that the context 𝑔 refers to a particular string with elements from the alphabet 𝒳. Now we can continue with the actual calculation of 𝑃(π²βˆ£π‘š).

We use different priors for the probabilities in the background and the motifs. Both are Dirichlet, however, with hyperparameters 𝛼 (for the background) and πœ† (for the motifs). Dirichlet distribution is the standard choice of a prior in Bayesian modeling of DNA sequences, and in addition to the computational convenience related to this prior, there are theoretical arguments supporting it in a variety of different contexts, see, for example, the discussion in [33, 34]. Let and let us next introduce and the Dirichlet density on 𝒱𝑔. Another prior is considered for the motifs on the simplex Let us set as well as Consider now the conditional probability 𝑃(π‘Œπ‘›π‘œ+1=π‘§π‘›π‘œ+1βˆ£π‘§π‘œ1,𝜷,𝐩), which can be factorized using the chain rule of probabilities as Here (12) and the VLMC property in the sense of (4) are used, resulting in By combining the above distributions, it is possible to derive the marginal data distribution analytically for our classification model. The formula for the marginal distribution is given in Theorem 1, for which a proof is provided in appendix.

Theorem 1. If 𝑃(π‘Œπ‘›π‘œ+1=π‘§π‘›π‘œ+1βˆ£π‘§π‘œ1,𝜷,𝐩) is given in (25) and πœ™(𝜷,𝐩) is as in (23), then the marginal likelihood under the classification model equals

4. Learning Sequence Partitions by Stochastic Search Operators

Standard MCMC algorithms such as the Gibbs sampler or Metropolis-Hastings algorithm [24] have regularly been used for the earlier inferential problems associated with motif discovery [5]. However, numerical convergence and mixing problems for such methods are burdening the motif discovery task. Furthermore, as the objectives here are even more challenging due to the a priori unknown number of motif types, standard stochastic computation is not expected to provide a feasible strategy. The unsupervised learning model for motif discovery developed in the previous section has the characteristics that enable the use of the nonstandard Bayesian computational strategy as developed in [26]. We will now describe the search operators embedded in the nonreversible algorithm.

Let 𝑝prop(β‹…βˆ£π‘š) denote a generic fixed distribution that assigns probabilities on the space β„³, conditional on the model structure π‘š. A nonreversible Metropolis-Hastings algorithm may then be constructed by defining its transition kernel according to the acceptance probability where π‘šβˆ— is a candidate state (model structure), generated from a current state π‘š under 𝑝prop(β‹…βˆ£π‘š). The important difference between this algorithm and the ordinary reversible Metropolis-Hastings is the lack of the ratio of the proposal probabilities 𝑝prop(π‘šβˆ£π‘šβˆ—),𝑝prop(π‘šβˆ—βˆ£π‘š) which ensures that the stationary distribution of the generated Markov chain equals the sought posterior distribution. However, as shown by [26], consistent estimates of the model posterior probabilities can still be constructed from a realization of the nonreversible chain. The major strength of this algorithm is that it can use more complex proposal mechanisms than the ordinary reversible MCMC algorithms, when it is not feasible in practice to calculate analytically the proposal probabilities. Also, the direct utilization of the analytically calculated marginal likelihood 𝑃(π²βˆ£π‘š) avoids the effects of Monte Carlo error in the estimation of the underlying parameters, for example, as compared to ordinary Gibbs sampler estimation where values for all model parameters are sequentially sampled. The latter advantage has been discussed also in the earlier motif discovery literature, where collapsed Gibbs samplers have been used for the model learning [5, 8].

In addition to the nonreversible transition kernel of the algorithm, we utilize 𝑛 parallel interacting search processes analogously to [26]. The search process as a whole can be defined as follows.

Let {π‘šπ‘‘π‘—,𝑑=0,1,…;𝑗=1,…,𝑛} and {𝐼𝑑,𝑑=0,1,…} be 𝑛+1 stochastic processes defined as follows.

(1)Define a sequence of strictly decreasing probabilities {𝛼𝑑,𝑑=1,2,…}, such that 𝛼𝑑>𝛼𝑑+1, and 𝛼𝑑→0 as π‘‘β†’βˆž.(2)Define the stochastic process {𝐼𝑑,𝑑=0,1,…} as 𝐼0=0, and 𝑃(𝐼𝑑=1)=𝛼𝑑, 𝑃(𝐼𝑑=0)=1βˆ’π›Όπ‘‘, independently for 𝑑=1,2,….(3)Let π‘š0𝑗, 𝑗=1,…,𝑛, be arbitrary initial states of {π‘šπ‘‘π‘—,𝑑=0,1,…;𝑗=1,…,𝑛}. Given a realization {𝐼𝑑,𝑑=0,1,…}, the transition mechanism of the processes {π‘šπ‘‘π‘—,𝑑>0;𝑗=1,…,𝑛} depends on values of 𝐼𝑑 according to steps (4) and (5).(4)For each 𝑑, such that 𝐼𝑑=0, transition from π‘šπ‘‘π‘— to the next state π‘š(𝑑+1)𝑗 is determined according to the probability (27) under the proposal distribution 𝑝prop(β‹…βˆ£π‘š), for 𝑗=1,…,𝑛.(5)For each 𝑑, such that 𝐼𝑑=1, transition from π‘šπ‘‘π‘— to the next state π‘š(𝑑+1)𝑗 is determined according to the following distribution over the set {π‘šπ‘‘π‘—,𝑗=1,…,𝑛} of candidate states: independently for 𝑗=1,…,𝑛.

The 𝑛 processes {π‘šπ‘‘π‘—,𝑑=0,1,…;𝑗=1,…,𝑛} defined above are not time homogeneous Markov chains. However, as π‘‘β†’βˆž, their transition probabilities converge to those of the time homogeneous Markov chain defined in (27). The parallel interacting processes are defined to yield an efficient, yet consistent scheme for exchange of information between them. The processes have a tendency to coalesce towards the states which are associated with higher marginal likelihoods. Also, when multiple model structures with roughly equal marginal likelihoods are present, the probabilities (28) lead to a more dispersed proposal distribution.

One of the major challenges in the MCMC computation is the need to design proposal operators for new models. Therefore, our emphasis is on developing intelligent proposals that are able to discriminate the information content of different segments of the data. We now specify the search operators that generate candidate states according to the distribution 𝑝prop(β‹…βˆ£π‘š), which in fact consists of several components as typically in MCMC. Some of these are intelligent operators similar to those used in protein sequence classification by [35], and some are locally random operators similar to those used in [26].

Assume first that a nonempty set {π±π‘Ÿπ‘–,𝑖=1,2,…} of initial candidates of motif instances has been made available out of the total sequence 𝐱. Such candidates could also be continuously chosen and/or discarded as a part of the nonreversible algorithm according to a stochastic search operator under our learning model. However, to limit the computational burden, we chose in our experiments to utilize a number of results from renewal theory and compound Poisson approximations to provide a set of putative motif instances (as explained in Section 2).

To identify an optimal classification of the candidate motif instances we use for each process repeatedly the following four search operators in the transition kernel: Merge, Split, Slide, and Move. These are defined as follows.

(1)Merge. Let 𝑐max be the motif type with the largest index. For each pair of motif types find the optimal alignment of the motif instances with respect to the marginal likelihood, when the instances are merged into a single group. When the lengths of the motif instances in the two classes are distinct, the bases in the shorter strings that are lacking in a column of any particular alignment are treated as missing data. The marginal likelihood 𝑃(π²βˆ£π‘šπ‘‘π‘—) can then still be calculated as in (26), because the expression is based on the sufficient statistics (counts) arising for each column in the alignment. If any of the (𝑐max2) mergings results in a higher marginal likelihood than 𝑃(π²βˆ£π‘šπ‘‘π‘—), use that model structure as the proposal value π‘šβˆ—π‘—, else use π‘šπ‘‘π‘—.(2)Split. Choose a motif type 𝑐 randomly. Calculate the pairwise Hamming distances of the corresponding motif instances and cluster them using the standard hierarchical single linkage algorithm. Split the group of motif instances optimally into two subgroups according to the hierarchical clustering.(3)Slide. Choose a motif type 𝑐 randomly. Slide the corresponding motif instances randomly backwards or forwards and change their length randomly. Both the sliding and length change are preformed with respect to a simple uniform proposal distribution. The sliding step relocates the motif instance maximally 5 bases backwards or forwards with equal probabilities for all possible configurations, apart from those that would place the motif instance outside of the sequence. The length is changed by 1 or 2 bases, by randomly either shortening or extending the motif instances with equal probabilities. If the resulting sequence does not satisfy the prior limits 𝑙min,𝑙max, the proposal will be discarded and a new value is generated from the original motif instance.(4)Move. Choose a motif instance randomly among all instances and choose a motif type 𝑐 randomly among the remaining 𝑐maxβˆ’1 types. Slide the motif instance randomly backwards or forwards and group it with the motif type 𝑐.

In addition to the above operators, both randomly chosen single motif instances and motif types are proposed to be reinserted into the background model. Since there is a strictly positive probability of associating each motif instance with any type or the background by a successive use of the proposal steps, the MCMC framework is irreducible. Consequently, the following result holds for the stochastic learning, as the algorithm introduced here satisfies the general conditions stated in [26].

Theorem 2. Let β„³ξ…ž be a model space for any given set of motif instance candidates, and let β„³π‘‘βŠ†β„³ξ…ž be the part of the space explored at time 𝑑 by the search process defined by the parallel nonreversible Metropolis-Hastings algorithm defined above. Let Then as π‘‘β†’βˆž.

5. Empirical Illustration

To briefly illustrate the unsupervised learning framework we have generated simulated data for the motifs reb1, matalpha2, pdr1/3 from the SCPD database [36]. The simulation was performed by first estimating a VLMC based on the SCPD background sequence for these motifs. Then, a sequence was generated using the estimated VLMC model, into which motif data was randomly inserted. Each inserted motif instance was generated using the positional weight matrices of the corresponding motifs [36]. This resulted in a sequence of approximately 10000 nucleotides.

To obtain a set of motif instance candidates, we applied the following procedure. First, we neglected the words 𝑧𝑗𝑗+π‘™βˆ’1 that occur in 𝐱 less than 4 times, because they are deemed to have very small chances of being conclusively judged not to belong to the background DNA sequence. Then, probabilities for the remaining substrings 𝑧𝑗𝑗+π‘™βˆ’1 were ranked. Those among the 100 most improbable according to the background model, with respect to 𝑅𝑃(𝑛(𝑧𝑗𝑗+π‘™βˆ’1)β©Ύπ‘Ÿπ‘›), were finally considered as candidates for being motifs. The prior limits of motif length were set according to the interval [7, 15] and the upper bound 𝐾=10 was used for the number of motif types.

The behavior of the MCMC optimization for parallel 50 processes is shown visually in Figure 2. Out of the 80 motif instances (30 reb1, 30 matalpha2, 20 pdr1/3) embedded in the SCPD-like sequence of length approximately 10000 nucleotides, our algorithms were able to identify 64. The adjusted Rand index (see, [37, 38]) between the optimal unsupervised classification of the candidates and the underlying true classification is 0.9623, which indicates a high fidelity of the results for those motif instances that were found.

We also tested the algorithm on real data under comparable prior settings for the motifs MCB and PDR1/3 from SCPD. This was done by providing the upstream regulating regions containing MCB and PDR1/3 (with a total length of 6149 nucleotides) as an input data set for the algorithm. This resulted in 7/32 motifs identified (2 MCB and 5 PDR1/3), which were all correctly grouped with respect to the classification in SCPD.

6. Discussion

The unsupervised classification model was developed under the assumption that the bases at the different positions of a motif instance are conditionally independent given the motif type. This assumption is analogous to the strategy used in a majority of motif identification methods, however, a more elaborate model structure could also be developed by a factorization of the marginal likelihood under a low-order dependence structure, such as a sparse Bayesian network or a context tree similar to that utilized in the construction of the VLMC. This type of a factorization principle has earlier been successfully used in the ordinary likelihood-based motif scanning; see, for example, [11, 13]. The advantage of such likelihood factorizations would in the Bayesian setting be that the marginal likelihoods could still be analytically calculated under suitable Dirichlet priors, thus enabling the utilization of the nonreversible MCMC computation for model learning. However, such a strategy would nevertheless significantly increase the computational complexity associated with the learning procedure, as the size of the model space would increase considerably.

The statistical learning method described here could be modified to incorporate more specific biological knowledge in terms of informative prior distributions for occurrence of certain nucleotides in any given part of a motif instance. Such priors may be designed from well curated biological databases using the basic properties of the Dirichlet distribution. To reduce the computational intensity of our experiments, we used the stochastic search initialized only by the candidate motif instances. Also, the VLMC model for the background was first fitted to the data, whereafter the context tree was not re-estimated during the motif discovery phase. In a fully coherent Bayesian learning procedure it would necessary to consider both aspects entirely as parts of the structural layer of the model and to learn them in parallel with the motif discovery. Similarly, it would be preferable to also monitor the learning process by creating an extensive set of posterior summaries of the motif configurations and visited segments of the sequence as in [15], instead of solely targeting the posterior mode of the model structure.

The fact that sequence scanning yields a large number of motifs that cannot be explained with the knowledge built from the current databases is an open problem in the scanning field, as confirmed by similar studies using YMF [39] and Weeder [40], as well as metastudies such as [41] and even supervised clustering algorithms such as LOGOS [10]. Despite such high β€œnoise” level, we have illustrated that a model-based approach has potential to simultaneously discover multiple motif types hidden in the sequences. From an intuitive perspective such an approach should make more sense than screening for a single motif type at a time. The latter can be suboptimal, for example, in cases where separate motif types are present in the data, such that they are closely related in an evolutionary sense. Here our focus has been more on the mathematical formulation of a statistical framework that has potential for a simultaneous discovery of multiple motif types. We wish to emphasize the advantages of formulating the statistical problem using a Bayesian learning framework which enables the use of nonstandard MCMC computation. In future our aim is to develop more concrete practical tools extending the basic model and the implementation of the learning algorithm, by exploiting a massively parallel computation architecture to pursue large-scale validatory and exploratory experiments.

Appendix

Mathematical details of the derivation of the result in Theorem 1 are given below. For convenience of reference we recall the definition of the Dirichlet integral where π›Όπ‘–βˆˆβ„ and for all 𝑖,𝛼𝑖>0.

A marginal distribution for π‘§π‘›π‘œ+1 given π‘§π‘œ1 is Now (23) and (25) yield that (A.2) is equal to The Dirichlet integral can now be used to establish (for each context 𝑔) and similarly for each motif type 𝑐 this yields Further, using (A.4) and (A.5) the stated result is established.

Acknowledgments

The authors thank Dr. Torkel Erhardsson, LinkΓΆpings Universitet, for valuable discussions about compound Poisson approximation. The authors would also like to remember their late friend and colleague BjΓΆrn Larsson, graduate student in LinkΓΆping, who participated in the initial stage of this work before losing his life in a traffic accident. The work was supported by grant no. 4042140 from the Swedish Research Council (VR/NT) and by grant no. 121301 from Academy of Finland.