Abstract

Identifying transcription factor binding sites with experimental methods is often expensive and time consuming. Although many computational approaches and tools have been developed for this problem, the prediction accuracy is not satisfactory. In this paper, we develop a new computational approach that can model the relationships among all short sequence segments in the promoter regions with a graph theoretic model. Based on this model, finding the locations of transcription factor binding site is reduced to computing maximum weighted cliques in a graph with weighted edges. We have implemented this approach and used it to predict the binding sites in two organisms, Caenorhabditis elegans and mus musculus. We compared the prediction accuracy with that of the Gibbs Motif Sampler. We found that the accuracy of our approach is higher than or comparable with that of the Gibbs Motif Sampler for most of tested data and can accurately identify binding sites in cases where the Gibbs Motif Sampler has difficulty to predict their locations.

1. Introduction

Gene regulation is one of the most important biological processes at molecular level. Recent work [1] has shown that gene regulations precisely control the expression levels of genes, which in turn controls many biological processes. In general, the regulation is performed by a binding process, where a protein called transcription factor binds to the promoter region of a gene. The nucleotides where the binding occurs form a binding site. Methods that can accurately identify the locations of binding sites in the promoter regions of genes are thus crucial for understanding the process of gene regulation.

Traditionally, TF binding sites have been characterized by a variety of different experimental approaches [2, 3]. However, predicting transcription factor binding sites with experimental approaches is often expensive and time consuming. Recently, with the large amount of sequence data, computational approaches have become an important alternative to predicting the transcription factor binding sites [49]. Most of the available computational approaches process the promoter regions of a set of homologous genes and recognize binding sites by identifying subsequences that are similar in sequence content. Most of these approaches employ a randomized sampling procedure to identify these subsequences and thus cannot guarantee the prediction accuracy.

For example, approaches based on Gibbs sampling randomly select a candidate subsequence of a fixed length from each sequence. A sequence is arbitrarily selected and each subsequence of the same length in the sequence is aligned to a profiling model obtained from the subsequences selected on other sequences, and each subsequence in the selected sequence is thus associated with a probability value which is the alignment score. One of the subsequences is then selected based on the distribution of the probability values of these subsequences, and a new set of subsequences is thus obtained. The procedure can be repeated until the maximum allowed number of iterations has been reached or a set of satisfying local optimal subsequences have been found [4, 5, 10]. Consensus used a greedy algorithm to align functionally related sequences and applied the algorithm to identify the binding sites for the E. coli CRP protein [11]. Bailey and Elkan [12, 13] used the Expectation maximization technique to fit a two-component mixture model to find binding sites and developed a software MEME+. MEME+ performed better than their previous MEME software [13]. However, its accuracy for identifying transcription factor binding sites is far from being satisfactory.

Genetic algorithms simulate the process of Darwin evolutionary process to find a local optimal solution for an optimization problem. Approaches based on GAs start with an initial population of a certain size. Each individual in the population is a valid solution for the problem. The individuals in the population then go through selection, crossover, and mutation based on certain methods and probabilities to evolve to the next generation. This evolutionary procedure keeps going until the maximum allowed number of generations is reached or the difference between the results falls into a small threshold. Genetic algorithms have been applied to predict the transcription factor binding sites such as FMGA [14]. FMGA was declared to have better performance than Gibbs Motif Sampler [5] in both accuracy and computation time. MDGA [15] is another program that used genetic algorithms to find motifs in homologous sequences. It used information content to evaluate the fitness of an individual during the evolution. MDGA achieved higher accuracy than Gibbs sampling algorithm based approaches while requiring less computation time [15].

In this paper, we develop a new approach that can predict the transcription factor binding sites without using a sampling procedure to select subsequences. Our approach uses a graph to model all subsequences in the promoter regions of the homologous genes and the similarity between any pair of subsequences that are from different promoter regions. In particular, each subsequence is represented by a vertex in the graph, and two vertices are joined by an edge if the two corresponding subsequences are from different promoter regions and their similarity is higher than a threshold. The threshold can be determined using the base compositions of the promoter regions and is guaranteed to be statistically significant. Each edge in the graph is associated with a weight value which is the similarity of the two corresponding subsequences. We then compute the maximum weighted clique in the graph, and the subsequences represented by the vertices in the clique are the transcription factor binding sites.

In order to efficiently compute the maximum weighted clique in the graph, we developed an iterative approach to preprocess the vertices in the graph and remove the vertices that cannot be in the clique from the graph, the size of the graph can thus be significantly reduced by applying this technique to it. After the preprocessing stage, the algorithm exhaustively enumerates all cliques that contain as many vertices as the number of promoter region sequences in the data set and returns the one with the largest weight value.

We have implemented this approach in a computer program with C++ programming language and used the program to predict the binding sites for two organisms, Caenorhabditis elegans and mus musculus. We evaluated the prediction accuracy of our approach based on the testing data sets and compared it with that of the Gibbs Motif Sampler [5]. Our testing results showed that our approach can achieve an average developer score of 0.90 on the test data, which is higher than or comparable with that of the Gibbs Sampler. In addition, our approach can accurately identify the binding sites in some cases where the Gibbs Motif Sampler has difficulty to locate the binding sites, which is one important advantage of our approach over the Gibbs Motif Sampler.

2. Method

2.1. Notations and Problem Description

Given a graph , a clique in is a vertex subset such that every pair of vertices in is joined by an edge in . If each edge in is associated with a weight value, a maximum weighted clique is a clique where the sum of the weight values of all edges in the clique is the maximum of all cliques in . The degree of a vertex is the number of vertices that are adjacent to it in .

Given the sequences of the promoter regions of a set of homologous genes, , the goal of the problem. The goal of the problem is to select a subsequence of a given length from each sequence such that the sum of the similarities between all pairs of selected sequences is maximized. To compare two subsequences and evaluate their similarity, we perform a pair wise alignment between them, and the alignment score is used as a measure of the similarity. We show later in the paper that this problem can be solved by computing the maximum weighted clique in a graph.

2.2. The Algorithm

Given a set of DNA sequences with similar transcription factor binding sites, we take each where and divide it into - subsequences of length as described in Figure 1, where is the number of nucleotides in . In particular, we choose the subsequence that starts from each nucleotide in and contains nucleotides. It is not difficult to see that the number of such subsequences is -.

We then construct a graph such that each vertex of the graph represents a subsequence. Subsequences from the same sequence are placed in one column, and all vertices in can thus be partitioned into disjoint columns. A sample of randomly generated and normally distributed sequence comparison scores is used to calculate a threshold which is then used to determine which sequences are similar. The algorithm starts with the first column and selects a vertex in that column and aligns its corresponding subsequence to that of every other vertex that is from a different column. If the alignment score of two subsequences is higher than the threshold, we make their vertices adjacent in . The algorithm repeats the above process for each vertex until all vertices in the graph have been processed.

After all the edges have been added to the graph, we proceed to preprocess the graph and remove the vertices that cannot be in a clique of size . In particular, we examine the degree of each vertex, and if the degree of a vertex is less than , we remove it from . This procedure is applied iteratively to all vertices in the graph until the size of the graph cannot be further reduced.

The algorithm then starts enumerating all -cliques in the graph and computes the weight of each clique. To this end, the algorithm assigns an integer id between 1 and to each column in the graph and starts with columns 1 and 2. In particular, all edges that connect a vertex from column 1 and a vertex from column 2 are included in a set . is maintained by the algorithm to store cliques. Initially, contains a set of edges, which are in fact cliques on two vertices. The algorithm then proceeds to examine vertices in columns 3 through . After the algorithm completes processing column , contains a set of cliques of size in . For every vertex in column , the algorithm checks each -clique in to examine whether and can together form a -clique. In other words, the algorithm examines whether is adjacent to every vertex in or not. If it is the case, and are combined into a single -clique and included in . After every vertex in column has been processed, the algorithm examines the cliques in and removes all -cliques from . It is not difficult to see that after all columns have been processed, contains a set of -cliques in . It then computes the weight value of each clique in by adding up the weight values of all edges in the clique and outputs the clique with the largest weight value.

The matrix used to generate the comparison score is a log odds ratio matrix for comparing DNA nucleotide sequences. This matrix was developed by Chiromante et al. [16]. They followed a common approach used in protein alignment and determined substitution score by using a set of trusted aligned symbol pairs and using log odds ratio [16]. The matrix used for the alignment of short subsequences is shown in the Table 1.

In order to compute the threshold of similarity value that is used to construct the edges in , we first compute the base composition of all promoter region sequences. Based on the percentage of each nucleotide in the base composition, we randomly generate two subsequences, each of which contains nucleotides. We then perform a pair wise alignment between the two generated subsequences. The alignment can generate an alignment score. We repeat the procedure for a sufficiently large number of times, and we thus can obtain a large collection of alignment scores. This collection of alignment scores in fact describes the distribution of alignment scores between two subsequences from the given base composition. We then choose a confidence value and find the smallest value such that the percentage of the alignment scores higher than in the collection is not larger than . The value of is then used as the threshold of similarity to construct graph .

2.3. Time Complexity

Each iterative step in the preprocessing stage of the algorithm may need up to computation time, where is the number of nucleotides in each promoter region sequence. Since each iterative step removes at least one vertex from the graph. The preprocessing stage may need up to computation time. We use to denote the maximum number of vertices in a column in the graph after the graph is preprocessed. The number of -cliques in the graph is at most . The computation time needed to check whether a vertex can be added into an existing clique to form a larger clique is at most . The computation time needed to compute the maximum weighted clique in the preprocessed graph is thus at most . Putting the preprocessing stage and the clique enumeration stage together, the algorithm needs at most computation time.

2.4. Experimental Results

We have implemented our approach with C++ programming language and used it to predict the transcription factor binding sites in the promoter regions of three genes that were selected from two organisms Caenorhabditis elegans and mus musculus. One data set is selected from the promoter region of a gene of Caenorhabditis elegans, and the remaining two are selected from the promoter regions of two genes of mus musculus. Table 2 provides a description of the testing data we have used to test the accuracy of our approach.

In order to evaluate the accuracy of our approach, we compare the binding sites predicted by our approach with the correct binding sites for all sequences in the data set. A developer score is computed to provide a measure of the accuracy. The developer score is computed based on the deviation of the predicted starting position of a binding site from its real starting position. In particular, for a binding site of length , if the deviation is , the developer score of the prediction result is then computed as follows: It is not difficult to see that the developer score is a measure of the accuracy of a predicted binding site. If the predicted binding site is completely correct, the deviation is 0, and the developer score is thus 1.0. On the other hand, if the predicted binding site is completely incorrect, it does not even intersect the real binding site, the deviation is at least , and the developer score is thus 0.0. In general, the developer score of a predicted binding site is a positive real number between 0.0 and 1.0. A higher value of the developer score indicates a more accurate prediction result.

We also used the Gibbs Motif Sampler [5] to predict the binding sites on the same data sets and evaluate the accuracy of its prediction results with the developer score. We compare the accuracy of our approach with that of the Gibbs Motif Sampler. Tables 3, 4, and 5 provide the accuracy of our approach and that of the Gibbs Motif Sampler on three tested data sets in terms of the developer score.

It is evident from the testing results that our program can achieve an average developer score of 0.90 for the testing data sets. This indicates that our approach can identify the binding sites with accuracy comparable with or better than that of the Gibbs Motif Sampler for most of the tested sequences. In addition, in some cases where the Gibbs Motif Sampler has difficulty to locate the binding sites, our program can identify the binding sites with high accuracy. For example, for sequences 2, 3, and 18 in the data set for mus musculus that contains binding sites of length 11, the Gibbs Motif Sampler fails to identify the correct locations of binding sites while our approach identifies the binding sites for all sequences with high accuracy. The Gibbs Motif Sampler significantly outperforms our approach only in sequence 16 in the data set for Caenorhabditis elegans, where our approach fails to identify the correct location of binding site while the Gibbs Motif Sampler identifies its correct location without any error. It is worth mentioning that our approach achieves significantly higher prediction accuracy for all sequences in the data set for mus musculus that contains binding sites of length 11.

3. Conclusions

In this paper, we developed a novel approach that can efficiently and accurately predict the transcription binding sites in the promoter regions of genes. Our approach uses a graph model to describe the subsequences in the promoter regions of homologous genes and their relationships. The problem is then reduced to a graph optimization problem. In order to efficiently compute the optimal solution of the problem, we developed a preprocessing technique that can significantly reduce the size of the graph. Our testing results on the sequence data from two organisms Caenorhabditis elegans and mus musculus showed that our approach can achieve prediction accuracy higher than or comparable with that of the Gibbs Motif Sampler.

We believe the performance of our approach can be further improved if we employ a weighted scoring scheme that can assign different relative weight values to the pair wise matching scores obtained on different positions in the subsequences. It is well known that mutation is much more likely to occur in nucleotides near the boundary of the binding sites than those near the center of the binding sites. Lower values of relative weights thus should be assigned to the matching scores obtained on nucleotides near the boundary of the binding sites. Determining the relative weights that can maximize the accuracy of prediction is an interesting problem and would be a part of our future work.

Acknowledgment

The work was supported by the Natural Science Foundation of Jiangsu Province (BK2011319).