Abstract

Novel computational methods for finding transcription factor binding motifs have long been sought due to tedious work of experimentally identifying them. However, the current prevailing methods yield a large number of false positive predictions due to the short, variable nature of transcriptional factor binding sites (TFBSs). We proposed here a method that combines sequence overrepresentation and cross-species sequence conservation to detect TFBSs in upstream regions of a given set of coregulated genes. We applied the method to 35 S. cerevisiae transcriptional factors with known DNA binding motifs (with the support of orthologous sequences from genomes of S. mikatae, S. bayanus, and S. paradoxus), and the proposed method outperformed the single-genome-based motif finding methods MEME and AlignACE as well as the multiple-genome-based methods PHYME and Footprinter for the majority of these transcriptional factors. Compared with the prevailing motif finding software, our method has some advantages in finding transcriptional factor binding motifs for potential coregulated genes if the gene upstream sequences of multiple closely related species are available. Although we used yeast genomes to assess our method in this study, it might also be applied to other organisms if suitable related species are available and the upstream sequences of coregulated genes can be obtained for the multiple closely related species.

1. Introduction

To understand the mechanisms that regulate the gene expression in eukaryotes is a major challenge in modern molecular biology. Gene regulation is accomplished by a number of regulatory proteins called transcriptional factors (TFs), which bind to specific DNA motifs in the promoter region of the target gene. TFs and their binding motifs interact with each other and help cells to respond to diverse stimuli. Identifying TFBSs in the upstream region of coregulated genes (genes regulated by a common TF) is crucial for inferring gene regulatory networks, since these motifs might be the building blocks of the regulatory network structures [1]. Most DNA binding motifs contain 6–25 bps and have a range of variability. Regulatory systems can take advantage of the variability in the binding sites to better control transcription [2]. Classical computational motif finding methods can be classified into two major categories: (1) enumerative methods, which explore all possible motifs up to a certain length; (2) local search algorithms using statistic approaches such as EM [36] and Gibbs sampling [79]. Under the second category, MEME [1012] and AlignACE [13, 14] are two computer programs used frequently in finding motifs in a group of related DNA sequences. Recently, comparative genomics approaches such as phylogenetic footprinting have been developed for identifying TFBSs based on the premise that selective pressure causes functional elements to evolve at a slower rate than that of nonfunctional sequences [15]. Phylogenetic footprinting is mostly applied to finding well-conserved regions in a set of orthologous sequences from multiple species [1518]. Although substantial progresses have been made in developing computational methods for predicting transcription factor binding motifs, currently available motif finding tools still yield many false positives due to relatively short length and high variability of DNA binding motifs. These motif finding tools with standard parameter settings usually report one putative TFBS out of 500 to 5000 bps, whereas only 0.1% of the predictions is likely to be functional [19]. Recently, gene expression profile analysis using microarray data and statistical clustering has resulted in numerous sets of potential co-regulated genes. Furthermore, the complete sequencing of more and more eukaryotic genomes makes it easier to obtain the upstream sequences of these co-regulated genes. Hence the development of a novel method with improved specificity in predicting transcription factor binding motifs for co-regulated genes becomes necessary and feasible.

We proposed here a method of finding TF binding motifs by considering both sequence overrepresentation in promoter regions and their conservation across closely related species. DNA binding motifs are believed to appear more frequently in the upstream regions of the genes being regulated, and these motifs are usually conserved across multiple closely related species. We use the degree of sequence conservation among multiple species as an additional constraint to reduce the false positive predictions. For a given set of co-regulated genes from a certain organism, we collect orthologous sequences from multiple closely related species and align them using multiple alignment programs such as ClustalW [20]. The statistically overrepresented sequences will be firstly selected as initial motif candidates, and then we evaluate their conservation in the alignments of orthologous upstream sequences of coexpressed genes. A statistic procedure based on the above principles was designed to scan for potential motifs, and a Perl script was written to conduct the procedure. To evaluate the proposed method, we collected 35 yeast TFs with known DNA binding motifs, and for genes co-regulated by each of these TFs, we searched the upstream regions for potential binding motifs. We compared our method with single-genome-based motif finding methods such as MEME and AlignACE, as well as with the multiple-genome based methods such as PHYME and Footprinter; the results suggested that the rank of the known binding motifs among the predictions of our method are generally higher than that using other methods.

2. Results

We used 35 well-studied yeast transcription factors (see Table 1) to evaluate the proposed method. The criteria for selecting the TFs are (1) their true DNA binding motifs are known; (2) the orthologous genes are available in all the four yeast species, and the upstream sequences of these genes are also available. For each TF, we built two sets of genes, namely, the positive set (PS) and the negative set (NS). The PS consisted of all the genes that are known to be co-regulated by the TF (see Table 1), whereas the NS consisted of randomly selected genes from the S. cerevisiae genome. The NS was used to introduce the background information and serve as a control in our motif finding process. For each gene in both PS and NS, we extracted its promoter region sequences from the genomes of four yeast species, namely, S. cerevisiae, S. mikatae, S. bayanus, and S. paradoxus. We took S. cerevisiae as the principal species in our study.

The method was implemented using a PERL script to find potential binding motifs in the upstream sequences of the genes co-regulated by a given TF (see Table 1 for the 35 TFs considered in this study). We found the known binding motifs for 25 out of the 35 TFs. In Table 2, we listed the known DNA binding motif and the motif found using our method for each TF.

We compared our method with the single-genome-based motif finding methods MEME and AlignACE, as well as with the multiple-genome-based methods PHYME and Footprinter for the majority of these transcription factors. We used the upstream sequences of S. cerevisiae genes in the PS of each TF as the input of MEME and AlignACE. All the parameters were set to default when we used AlignACE to find motifs. To apply MEME to the motif finding, we set the minimum length of the potential motif to 6, and we set the number of motifs expected to be found to the same as the number of motifs predicted by our method. The results are listed in Tables 3 and 4, respectively. Since our method takes into account the conservation of candidate motif sequences among multiple species, the number of predicted motifs found for each TF is in general less than that found by AlignACE (Table 3) or MEME (Table 4). Tables 3 and 4 showed that our method is more efficient in finding the true motifs than AlignACE or MEME, in the sense that it returned less predicted motifs, and the ranks of the known motifs are also generally higher than those in the output of AlignACE or MEME. For example, there were 11 potential motifs found by AlignACE for STE12, and the known motif of STE12 ranked second in the output; however, using our method only one motif was found, and it was the known motif. The results for other TFs showed the same tendency. AlignACE and MEME could only find the known binding motifs for 14 and 12 TFs, respectively, out of the 25 TFs whose known binding motifs were found using our method. Our method could not find the known binding motifs for 10 TFs among the 35 (Table 5) with any of the three parameter threshold settings. Out of these 10 TFs, using AlignACE and MEME we can find known binding motifs for 5 and 3 TFs, respectively.

Unlike single-genome-based motif finding methods such as AlignACE and MEME, our method uses the sequence information from multigenomes, so it is more reasonable to compare it with PHYME and Footprinter, which are two popular multiple-genome-based methods. For a given TF, we found that Footprinter usually yields overwhelming number of predictions, and this makes it difficult to do a comparison. To apply PHYME to the motif finding, we set the motif length limit to 17, which is the maximum length of all known binding motifs of the 35 TFs. For each regulon, the number of motifs predicted was set to 10 and the motifs were searched on both strands. The results were listed in Table 6. Using PHYME we found known motifs for 23 TFs, among them there were 6 TFs whose known binding motifs were not found using our method. From Table 6, we can see that our method and PHYME nearly have the same power in motif finding; however, the ranks of the known motifs found using our method are generally higher than those found by PHYME. Table 7 gives a list of the TFs whose known binding motifs could be found using our method but could not be found by MEME, AlignACE, and PHYME. Our method could not find the known binding motifs for 10 TFs out of the 35 (Table 5). For these 10 TFs, AlignACE and MEME can find known binding motifs for 5 and 3 TFs, respectively. With PHYME, we can find known binding motifs for 6 TFs out of these 10 (Table 6).

3. Discussion

Transcription factors and their DNA binding sites are two of the most important functional elements in eukaryotic genomes. A thorough study of the interactions of the two is important for mapping the regulatory pathways and understanding the potential function of the genes regulated by the TFs [21]. In the past decade, clustering of gene expression profiles obtained from large-scale DNA microarray experiments has been successfully used in identifying coexpressed genes [22, 23], and we believed that these coexpressed genes may share common regulators that bind to their upstream regions. Finding the TF binding motifs of these potentially co-regulated genes becomes critical for understanding the interaction of the genes and their regulators [2427]. So far the binding specificities have been well characterized only for a small number of TFs [19, 21]. TFBSs are usually quite short (around 6–25 bp) and degenerate, which leads to the difficulties in finding them reliably using current motif finding tools. Even though the ab initio motif finding tools have been used successfully in many cases, their performances are far from satisfying. The major drawback of these tools is that they produce many false positive predictions. Under default parameter settings, they yield usually tens or hundreds of putative motifs, and it is difficult to judge which candidate motifs out of them are functional [19]. Phylogenetic footprinting methods have been proposed recently [1518], by which the interspecies comparative sequence information is used for helping to signal the presence of TF binding sites that might not have been predicted using sequences from a single genome. For example, binding sites found in human sequences that are also found in orthologous mouse or other mammalian sequences are far more likely to be functional than those found only in human [28]. We refer to these short orthologous sequences that are conserved over 6 bp or more as phylogenetic footprints.

Our method proposed here considers both overrepresentation and cross-species conservation of potential binding motifs. We used binomial test to determine the statistically overrepresented candidate sequences, and the average relative entropy of the aligned sequence block was used to measure the cross-species conservation of these candidates. The relative entropy is a popular measure of the degree of conservation at a site in a DNA or protein sequence alignment [29]. In our method, the input data are the upstream sequences of two groups of genes, namely, the co-regulated genes of a TF (PS) and the control genes (NS) selected randomly from the genome of the principal species under study, as well as the orthologous sequences from other species, which are closely related to the principal species. Usually the co-regulated genes are collected through wet lab experiments or predicted through gene expression profile analysis using microarray data. The upstream sequences of genes in PS and NS could be extracted from the genome of the principal species, and the corresponding upstream sequences from other species could be obtained by doing BLAST [30] or by downloading from the publicly available databases.

Three parameters are considered in our method: (1) P value, which is used to evaluate the overrepresentation of a candidate sequence, (2) average relative entropy ARE𝑃 of 𝑆OP, which gives the degree of conservation of a candidate motif, (3) Z-value, which is used to assess the statistical significance of the conservation. In order to have a balanced consideration of the sensitivity and the specificity and to cope with different situations, we applied three different parameter threshold settings to scan for candidate motifs, and they are (a) P-value in a magnitude of 10−6 (after Bonferroni correction), ARE𝑃= 1.0, and Z-value = 2.0; (b) P-value = 0.01 (without Bonferroni correction), ARE𝑃 = 1.0, and Z-value = 2.0; (c) P-value = 0.01 (without Bonferroni correction), ARE𝑃= 0.8, and Z-value = 2.0. Theoretically, we can find most of the known motifs as long as we make the criteria for overrepresentation and conservation loose enough, but the less strict criteria may result in numerous putative motifs that are actually false positives. Considering the high cost of verifying a predicted motif through lab experiment, we used firstly a strict criterion for candidate motif screening, so parameter setting (a) was set as default in our method. Using this strict parameter threshold setting we may miss some true TF binding motifs (see Tables 3 and 4), especially those without very high-level statistical significance of overrepresentation, and the method may not be able to return any predictions. We loosen the criteria by using setting (b) or setting (c) in actual motif finding process, if using the default threshold setting, we can find no hit at all. Setting (b) has a moderate criterion for overrepresentation, so it allows more candidate motif to pass the screening. With setting (c), we loosen the criterion of the degree of conservation, since there do exist some known TF binding motifs with ARE𝑃 less than 1.0 (see Table 2).

The method proposed here is, nevertheless, not a replacement of the prevailing motif tools such as MEME and AlignACE. The major limitation of our method is its strong prerequisite. Multiple closely related species and the upstream sequences of each co-regulated gene for all species under study are requested, and in many cases these prerequisites may not be satisfied, so the method is, therefore, not generally applicable. Another problem is how to choose the appropriate species to evaluate the cross-species conservation. In principle, the species selected in the study should be close enough so that the conservation of motif sequences could be detected in a multiple alignment, in the meanwhile their evolutionary distances should not be too close, so that the signals could be distinguished from the noises [31]. The number of species used in the method is also a factor that may need to be considered. We recommended three or four, since using too many species may bring up strong noise and reduce the detection power of the method.

After comparing with the motif finding software such as MEME, AlignACE, and PHYME, we can reach the following conclusions: (1) Our method screens for candidate motifs in terms of both overrepresentation and conservation, therefore, it gives relatively less predicted motifs for a group of co-regulated genes (Tables 3 and 4), hence it is helpful for reducing false positive predictions; (2) The rank of known motif in the output of our method is in general higher (Tables 3 and 4), and this is of practical importance, since we usually focus only on putative binding motifs with high ranks despite the large number of predicted motifs; (3) unlike the most common motif finding tools, our method requests no prior inputs such as the length of the motifs or the number of predictions. Although we used yeast genomes to assess our method, it could also be applied to other organisms if suitable related species are available and the upstream sequences of co-regulated genes could be obtained for the multiple species.

4. Materials and Methods

4.1. Materials

In this study, we considered gene promoter regions of four yeast species, namely, S. cerevisiae, S. mikatae, S. bayanus, and S. paradoxus. All these four are members of the Saccharomyces sensu stricto group. The last three are believed to be separated from S. cerevisiae by an estimated 5–20 million years of evolution and are found to have sufficient sequence similarity to S. cerevisiae such that orthologous regions can be aligned reliably [32].

We obtained the information about gene regulation network of S. cerevisiae from the database SCPD (The Promoter Database of Saccharomyces cerevisiae) [33], which contained TFs and genes co-regulated by them. The upstream region sequences of the co-regulated genes of each TF for all the four yeast species were downloaded from http://www.broad.mit.edu/.

The genes known to be co-regulated by specific TFs such as STE12 and GAL4 were used to evaluate the method. We let PS (positive set) denote the collection of S. cerevisiae genes co-regulated by a common TF, and we built an NS (negative set) by randomly selected S. cerevisiae genes. For each gene in both PS and NS, we extracted the promoter region sequences for all the four species and aligned them using multiple sequence alignment program ClustalW.

4.2. Methods

The method proposed here requests promoter region sequences from multiple closely related ortholog species. Usually we are interested in motif finding for only one of the species, namely, the principal species, whereas the sequences from other species are helpful for the reduction of false positives. For a given TF, we need two sets of genes, namely, positive set (PS) and negative set (NS). PS consists of the genes co-regulated by the TF, whereas NS consists of genes randomly collected from the genome of the principal species.

4.3. Finding Overrepresented Sequences

We only consider the principal species for finding overrepresented sequences. We first search the promoter regions of the genes in NS for each possible sequence pattern of length M (6𝑀17) that satisfying the following constraints: the first three nucleotides in the left flank and the last three nucleotides in the right flank are the core elements and fixed, between the two core elements there might be 𝑀0 nucleotides (𝑀0=0,1,2,,11) and each of them could be any of the nucleotides A, C, T, and G. So within the L-bp upstream region of a gene, there are 𝐿𝑀+1 possible locations that can be occupied by a sequence pattern of length M. We call the fraction as the background probability of the given sequence pattern 𝑐𝑝=𝑛𝑐,(4.1) where 𝑐𝑛 is the number of total occurrences of the pattern in the promoter regions of genes in NS, and c is the total number of possible locations for an M-bp sequence in promoter regions of the genes in NS. In the same way, we can obtain the number of the pattern occurrences K and the total number (N) of possible M-bp locations in the promoter regions of the genes in PS. Using binomial distribution, we can calculate the probability of the pattern occurring more than K times as following [27, 34]: 𝑃=𝑁𝑘=𝐾𝑁!𝑝(𝑁𝑘)!𝑘!𝑘(1𝑝)𝑁𝑘,(4.2) where p is the background probability. We choose the sequence patterns with P less than a threshold 𝑝 (usually in magnitude of 106 after Bonferroni adjustment) for further analysis. If the overlap of two sequences is longer than 80% of one of the two, we eliminate the sequence with larger P-value from the collection of overrepresented sequences. Both DNA strands are considered when we calculate the number of occurrences for a given sequence in the upstream region. If the sequence is a palindrome, we just use the count in one strand as the total occurrence.

4.4. Bonferroni Adjustment

We used the Bonferroni adjustment to the multiple statistical tests for determining overrepresented sequences, so that it was more “difficult” for any single test to be significant. The adjustment was accomplished by setting the P-value threshold at the common significant level (usually 0.05 or 0.01) divided by the number of tests being performed. In our case, the 𝑝 was set as 0.05 divided by the number of all possible sequences in the form of 𝑁𝑁𝑁𝑛𝑛𝑛𝑛𝑁𝑁𝑁, where NNN stands for three fixed nucleotides and 𝑛𝑛𝑛𝑛 stands for unfixed number (from 0 to 11) of nucleotides.

4.4.1. Relative Entropy and Conservation Criteria

Let α be the background nucleotide distribution and β the nucleotide distribution at a given position in a multiple sequence alignment. For the two probability distributions 𝛼 and β, the relative entropy (also known as Kullback-Leibler “distance”) is defined by [29, 35] 𝐻(𝛽||𝛼)=4𝑖=1𝛽𝑖𝛽log𝑖𝛼𝑖.(4.3)

We can prove that relative entropy is always a nonnegative value, and it reflects the extent of the deviation of actual nucleotide distribution from background distribution at a given site in the alignment. The larger the value, the greater the deviation between the actual distribution and the background distribution at that site [29].

Given an overrepresented sequence O, we search for its occurrences in the alignment of upstream sequences from the four species for each gene in PS and NS, respectively. If we find an occurrence in the alignment of a gene in PS, we extract the corresponding sequence block from the alignment and put the four segments that form this block to a sequence set 𝑆OP. Similarly, we also build a sequence set 𝑆ON for genes in NS.

We further align all the sequences in 𝑆OP and 𝑆ON, respectively. These two alignments are used to evaluate the degree of conservation of O across closely related species. We define the average relative entropy (ARE𝑃) of 𝑆OP as ARE𝑃=𝑀𝑖=1𝐸𝑃𝑖𝑀,(4.4) where 𝐸𝑃𝑖 is the relative entropy at the position 𝑖 of the alignment of the sequences in 𝑆OP, and M is the length of 𝑂. If 𝑂 is not found in the alignment of upstream sequences for any gene in NS, then we deposit 𝑂 to the collectionof candidate motifs for further consideration. Otherwise, we could also calculate the average relative entropy ARE𝑁 for the sequences in nonempty set 𝑆ON. We define a Z-score as 𝑍=ARE𝑃ARE𝑁𝑠2𝑁,/𝑀(4.5) where 𝑠𝑁 is the standard deviation of the relative entropies at different positions of the multiple upstream sequence (across multiple species) alignments of genes in NS.

Binding motifs tend to be conserved in the orthologous species (see Figures 1, 2, and 3), so we remove the sequences that are overrepresented but not conserved from our collection of candidate sequences. We set the Z-score threshold as 2, such that the sequences with 𝑍>2 are kept as the candidate sequences for further consideration.

4.5. Building a Profile for a Candidate Sequence

Each candidate sequence will be searched for in the alignment of upstream sequences (from the multiple species) of each gene in PS. If an instance is found in any of the species, we extract the corresponding alignment block for further consideration. We use 𝑒𝐵 to denote the average of the relative entropies at M different positions of an alignment block of length M. For each block, we set 𝑃=𝜇𝑃+2(𝜎𝑃/𝑀) as our cutoff value for block selection, where 𝜇𝑃 and 𝜎𝑃 are the mean and the standard deviation of the relative entropies, respectively, at different positions in the alignments of upstream sequences (from multiple species) of the genes in PS. For a given candidate sequence, we use all the blocks with 𝑒𝐵 greater than 𝑃 to build a profile to represent the candidate motif. For example, we search for a candidate sequence GTTTCA in the alignments of upstream sequences of genes in PS. If we can find it in any species in the alignments, we extract the corresponding alignment block, calculate 𝑒𝐵, and compare it with 𝑃 to decide whether we keep this block for profile building. Using all the blocks selected, we calculate the base frequencies at each position and create thereafter the profile to represent the initial candidate motif. Both strands are considered when we build the profile.

4.6. Species-Specific PSSM Building

The profile obtained above represents the initial candidate motif derived from all the ortholog species. Usually we are only interested in the motif finding for one species, which is named as principal species in our analysis, and it is necessary to build a species-specific PSSM (Position Specific Score Matrix) for the candidate motif [36]. For the genes in PS, which are from the principal species, we search for the candidate motif in their upstream sequences in terms of the initial motif profile, and all the significant hits found are used in building the final species-specific PSSM. The profile search is performed as follows. For each M-bp segment of upstream sequences of the genes in PS, we calculate a score 𝑆𝑐=𝑀𝑖=1𝑞𝑖,(4.6) where 𝑞𝑖 is the probability of observing the ith nucleotide of the segment, which is defined by the position-specific nucleotide distribution in the initial profile of the candidate motif. To determine the significance criterion, we calculate Scs for all the possible M-bp segments of the upstream sequences (for principal species only) of genes in NS and rank these scores in the descending order. We use the 0.001-quantile of these ranked scores, denoted as 𝑆𝑐, as the threshold value to determine whether a match is significant in the profile search. For example, if there are 1000 genes in the NS and the length of each promoter region is L-bp, then there are totally 1000(𝐿𝑀+1) possible segments, so we have 1000(𝐿𝑀+1) scores. We sort the scores in the descending order and set the nth value as the cutoff score 𝑆𝑐 with 𝑛=𝐿𝑀+1. We calculate 𝑆𝑐 for each possible segment in the upstream sequences (principal species only) of the genes in PS. If 𝑆𝑐𝑆𝑐, we deposit the segment into 𝐼, which is the set of the incidences of the candidate motif.

4.7. Optimal Motif Length

Let k be the number of sequence segments in I. In order to determine the optimal length of the potential motif, we extend 0 to 5 bp in both flanks of each M-bp segment in I according to its mother sequence in the gene upstream region. So we have totally 36 possible combinations (left flank extended by 𝑀𝐿=0,1,2,3,4,or5 bp; right flank extended by 𝑀𝑅 = 0, 1, 2, 3, 4, or 5 bp). For each possible combination (𝑀𝐿,𝑀𝑅), we put the newly added flanks into a block with k rows and 𝑀𝐿+𝑀𝑅 columns. We calculate the average relative entropies of all 36 blocks and choose the combination (𝑀𝐿, 𝑀𝑅) that delivers the maximum average relative entropy 𝑒𝐵 for further consideration. In the meanwhile, we randomly generate 1000 sequence blocks, each with k rows and 𝑀𝐿+𝑀𝑅 columns, in terms of the background nucleotide distribution 𝛼. We calculate the mean 𝑒rand and the standard deviation 𝑠rand of the average relative entropies of these 1000 blocks. If 𝑒𝐵 is greater than 𝑒rand+2𝑠rand, then we accept the extension (𝑀𝐿,𝑀𝑅) and set the final motif length at 𝑀+𝑀𝐿+𝑀𝑅; otherwise, we still keep the original motif length 𝑀. The extended sequences (𝑀𝐿 bp in left flank and 𝑀𝑅 bp in right flank) of the segments in I form a new sequence set 𝐼𝑒, which is the set of the incidences of the extended motif. Using all the sequences in 𝐼𝑒, we build the PSSM for a general representation of the final motif.

4.8. Implementation

We used a PERL script to implement the method. The script and the example of input data are available upon request.

Abbreviations

TFBS:Transcription factor binding sites
TF:Transcription factor
bp:Base pair
EM:Expectation maximization.

Conflict of Interests

The authors have declared that no competing interests exist.

Acknowledgments

This work is financially supported by the Nanyang Technological University Research Grant RG64/06 (to JMLI) and a start-up grant from Southern Medical University and Guangdong Province (to JMLI).