Abstract

Obtaining unique oligos from an EST database is a problem of great importance in bioinformatics, particularly in the discovery of new genes and the mapping of the human genome. Many algorithms have been developed to find unique oligos, many of which are much less time consuming than the traditional brute force approach. An algorithm was presented by Zheng et al. (2004) which finds the solution of the unique oligos search problem efficiently. We implement this algorithm as well as several new algorithms based on some theorems included in this paper. We demonstrate how, with these new algorithms, we can obtain unique oligos much faster than with previous ones. We parallelize these new algorithms to further improve the time of finding unique oligos. All algorithms are run on ESTs obtained from a Barley EST database.

1. Introduction

Expressed Sequence Tags (or ESTs) are fragments of DNA that are about 200–800 bases long generated from the sequencing of complementary DNA. ESTs have many applications. They were used in the Human Genome Project in the discovery of new genes and are often used in the mapping of genomic libraries. They can be used to infer functions of newly discovered genes based on comparison to known genes [1].

An oligonucleotide (or oligo) is a subsequence of an EST. Oligos are short, since they are typically no longer than 50 nucleotide bases. Oligos are often referred to in the context of their length by adding the suffix “mer”. For example, an oligo of length 9 would be referred to as a 9-mer. The importance of oligos in relation to EST databases is quite significant. An oligo that is unique in an EST database serves as a representative of its EST sequence. The oligonucleotides (or simply oligos) contained in these EST databases have applications in many areas such as PCR primer design, microarrays, and probing genomic libraries [24].

In this paper we will improve on the algorithms presented in [2] to solve the unique oligos search problem. This problem requires us to determine all oligos that appear in one EST sequence but not in any of the others. In addition, we will consider two oligos to be virtually identical if they fall within a certain number of mismatches from each other. In the appendix we include all the algorithms used and developed in this paper.

2. The Unique Oligos Search Problem

In this paper we use the notation to denote the Hamming Distance between the strings and . Given an EST database , where is a string over the alphabet , integers and , and -mer , we say that occurs approximately in if there exists a substring of some EST such that . We also say that an -mutant list of a string is a list of all possible strings, , of length over the alphabet such that . Such a string is referred to as an -mutant of . A unique oligo of is defined as an -mer such that occurs exactly in one EST and does not occur approximately in any other EST. The unique oligos search problem is the problem of finding all unique oligos in an EST database.

Many algorithms have been presented to solve this problem [5, 6]. The algorithm presented in [2] relies on an observation that if two -mers agree within a specific Hamming Distance, then they must share a certain substring. These observations are presented in this paper as theorems.

Theorem 1. Suppose one has two -mers and such that . If one divides them both into substrings, and , and each , except possibly , has length , then there exists at least one , such that .

Proof. Suppose by contradiction that for any , and have at least 2 mismatches. Then which is a contradiction to the fact that .

Using this observation, an algorithm was presented in [2] which solves the unique oligos search problem in time . The algorithm can be thought of as a two-phase method. In the first phase we record the position of each -mer in the database into a hash table of size . We do so in such a way that for each -mer over the alphabet we have that whereby is an EST sequence, is the position of within that sequence, and is the number of occurrences of in the database. In the second phase, we extend every pair of identical -mers into -mers and compare these -mers for nonuniqueness. We also do the same for pairs that have a Hamming Distance of 1. If they are nonunique, we mark them accordingly. Theorem 1 guarantees that if an -mer is nonunique, then it must share a -mer substring that differs by at most one character with another -mer substring from another -mer. Hence, if an -mer is nonunique, it will be marked during phase two.

Assuming there are symbols in our EST database, the filing of the -mers into the hash table takes time . In phase two, we assume that the distribution of -mers in the database is uniform; in other words, that each table contains entries. Thus we have comparisons within each table entry. Each -mer also has a 1-mutant list of size , so, we have comparisons for each entry in the table. Also, the time required to extend each pair of -mers to -mers is . Given that we have entries in the hash table, we have a total time complexity of where

In [7], several variations of Theorem 1 are presented. We can use these theorems to generate similar algorithms with slightly different time complexities.

Theorem 2. Suppose one has two -mers and such that . If one divides them both into substrings, and , and each , except possibly , has length , then there exists at least one , such that .

Proof. Suppose by contradiction that we cannot find any such that . Then there exists at least one mismatch between and for each , and thus we have at least mismatches which contradicts the fact that .

Based on Theorem 2 we can design a second algorithm that works in a similar way to Algorithm 1. The major difference between these algorithms is that in Algorithm 2 we are not required to do comparisons with each hash table entries mutant list. This means we have comparisons within each table entry which yields a total time complexity of where

Require: EST database , integer (length of unique oligos) and integer
    (maximum number of mismatches between non-unique oligos)
Ensure: All unique -mers in
(1)
(2) findqmers( ) hashtable of positions of all qmers in
(3) for   to split loop iterations among processors}  do
(4) as a base 4 integer of length
(5) list of base 4 integers of length mismatching by 1 digit
(6) the numbers in in base 10
(7) list of each for all
(8) goo2( )
(9) end for

Require: EST database , integer (length of unique oligos) and integer
     (maximum number of mismatches between non-unique oligos)
Ensure: All unique -mers in
(1)
(2) findqmers( ) (hashtable of positions of all qmers in )
(3) for   to split loop iterations among processors}  do
(4) goo( )
(5) end for

A third theorem was also briefly mentioned [7]; however, it was not implemented in an algorithm. We use this theorem to create a third algorithm to solve the unique oligos search problem.

Theorem 3. Suppose one has two -mers and such that . If one divides them both into substrings, and , and each , except possibly , has length , then there exists at least one , such that .

Proof. Suppose by contradiction that for any , and have at least 3 mismatches. Then which is a contradiction to the fact that .

The algorithm is somewhat similar to Algorithm 1. The main difference is that we compare every -mer to -mers in its corresponding 2-mutant list, rather than its 1-mutant list. Each -mer has 2-mutants, so we have comparisons for each entry in the hash table yielding a total time complexity of where

It is important to note the term in the denominator of our time complexity expressions. Since this term is exponential, it will have the largest impact on the time taken to run our algorithms. Based on this observation, we expect Algorithm 3 to run the fastest, followed by Algorithm 1 and then Algorithm 2.

Require: EST database , integer (length of unique oligos) and integer
     (maximum number of mismatches between non-unique oligos)
Ensure: All unique -mers in
(1)
(2) findqmers( ) (hashtable of positions of all qmers in )
(3) for   to split loop iterations among processors}  do
(4) as a base 4 integer of length
(5) list of base 4 integers of length mismatching by at most 2 digits
(6) the numbers in in base 10
(7) list of each for all
(8) goo2( )
(9) end for

3. Implementation

We implement these algorithms using C on a machine with 12 Intel Core i7 CPU 80 @ 3.33 GHz processors and 12 GB of memory. The datasets we use in this implementation are Barley ESTs taken from the genetic software HarvEST by Steve Wanamaker and Timothy Close of the University of California, Riverside (http://harvest.ucr.edu/). We use two different EST databases, one with 78 ESTs and another with 2838. In our experiments we search for oligos of lengths 27 and 28 since they are common lengths for oligonucleotides. As we increase the size of the database, we see that Algorithm 3 is the most efficient as anticipated (data shown in Tables 1 and 2).

One important thing to note about all of these algorithms is the fact that the main portion of them is a for loop which iterates through each index of the hash table. It is also obvious that loop iterations are independent of each other. These two factors make the algorithms perfect candidates for parallelism. Rather than process the hash table one index at a time, our parallel algorithms process groups of indices simultaneously. Ignoring the communication between processors, our algorithms optimally parallelize our three serial algorithms.

There are many APIs in different programming languages that aid in the task of parallel programming. Some examples of this in the C programming language are OpenMP and POSIX Pthreads. OpenMP allows one to easily parallelize a C program amongst multiple cores of a multicore machine [8]. OpenMP also has an extension called Cluster OpenMP which allows one to parallelize across multiple machines in a computing cluster.

A new trend in parallel programming is in the use of GPUs. GPUs are the processing units inside computers graphics card. C has several APIs which allow one to carry out GPU programming. The two such APIs are OpenCL and CUDA [9, 10].

In the second implementation of our algorithms we use OpenMP to parallelize our algorithms throughout the 12 cores of our machine. We can easily see that we achieve near optimal parallelization with our parallel algorithms; that is, the time taken by the parallel algorithms is approximately that of the serial algorithms divided by the number of processors.

4. Conclusion

In this paper we used three algorithms to solve the unique oligos search problem which are extensions of the algorithm presented in [2]. We observed that we can achieve a significant performance improvement by parallelizing our algorithms. We can also see that Algorithm 3 yields the best results for larger databases. For smaller databases, however, the time difference between each pair of algorithms is negligible, but results in Algorithm 3 being the slowest, and this is due to the time required to compute the mismatches of each -mer. Other algorithms can be obtained by setting to different values. See Algorithms 1, 2, 3, 4, 5, 6, 7, and 8.

Require: EST database , integer
Ensure: A hashtable of all positions.
(1) a hashtable of all positions in
(2) for   to   do
(3)  for   to length   do
(4)   map( )
(5)   [ ] Append(hashtable[ ], )
(6)  end for
(7) end for

(1) substring( , , )
(2) under the transformation
(3) return  

(1) substring of from character to character
(2) return  

(1) posi a list of positions of a specified qmer in D
 ( where corresponds to position of sequence )
(2) mut a list of positions of qmers in D that mismatch this qmer by either 1 or 2 characters
  (depending on the filtration algorithm using this function)
(3) for   to length( )  do
(4)  for   to length( )  do
(5)   if     then
(6)     list of -mers generated from the extension of the in position
(7)     list of -mers generated from the extension of the in position
(8)    for   to length( )  do
(9)     for   to length( )  do
(10)     if     then
(11)       mark the as non-unique
(12)     end if
(13)     end for
(14)    end for
(15)   end if
(16)  end for
(17)  for   to length( )  do
(18)   if     then
(19)     list of -mers generated from the extension of the in position
(20)     list of -mers generated from the extension of the in position
(21)    for   to length( )  do
(22)     for   to length( )  do
(23)     if     then
(24)       mark the as non-unique
(25)     end if
(26)     end for
(27)    end for
(28)    end if
(29)  end for
(30) end for

(1) posi a list of positions of qmer in D
 ( where corresponds to position of sequence )
(2) for   to length( )  do
(3) for   to length( )  do
(4)  if     then
(5)    list of -mers generated from the extension of the in position
(6)    list of -mers generated from the extension of the in position
(7)   for   to length( )  do
(8)    for   to length( )  do
(9)     if     then
(10)      mark the as non-unique
(11)      end if
(12)    end for
(13)   end for
(14)  end if
(15) end for
(16) end for