Research Article | Open Access
Manrique Mata-Montero, Nabil Shalaby, Bradley Sheppard, "Efficient Serial and Parallel Algorithms for Selection of Unique Oligos in EST Databases", Advances in Bioinformatics, vol. 2013, Article ID 793130, 6 pages, 2013. https://doi.org/10.1155/2013/793130
Efficient Serial and Parallel Algorithms for Selection of Unique Oligos in EST Databases
Obtaining unique oligos from an EST database is a problem of great importance in bioinformatics, particularly in the discovery of new genes and the mapping of the human genome. Many algorithms have been developed to find unique oligos, many of which are much less time consuming than the traditional brute force approach. An algorithm was presented by Zheng et al. (2004) which finds the solution of the unique oligos search problem efficiently. We implement this algorithm as well as several new algorithms based on some theorems included in this paper. We demonstrate how, with these new algorithms, we can obtain unique oligos much faster than with previous ones. We parallelize these new algorithms to further improve the time of finding unique oligos. All algorithms are run on ESTs obtained from a Barley EST database.
Expressed Sequence Tags (or ESTs) are fragments of DNA that are about 200–800 bases long generated from the sequencing of complementary DNA. ESTs have many applications. They were used in the Human Genome Project in the discovery of new genes and are often used in the mapping of genomic libraries. They can be used to infer functions of newly discovered genes based on comparison to known genes .
An oligonucleotide (or oligo) is a subsequence of an EST. Oligos are short, since they are typically no longer than 50 nucleotide bases. Oligos are often referred to in the context of their length by adding the suffix “mer”. For example, an oligo of length 9 would be referred to as a 9-mer. The importance of oligos in relation to EST databases is quite significant. An oligo that is unique in an EST database serves as a representative of its EST sequence. The oligonucleotides (or simply oligos) contained in these EST databases have applications in many areas such as PCR primer design, microarrays, and probing genomic libraries [2–4].
In this paper we will improve on the algorithms presented in  to solve the unique oligos search problem. This problem requires us to determine all oligos that appear in one EST sequence but not in any of the others. In addition, we will consider two oligos to be virtually identical if they fall within a certain number of mismatches from each other. In the appendix we include all the algorithms used and developed in this paper.
2. The Unique Oligos Search Problem
In this paper we use the notation to denote the Hamming Distance between the strings and . Given an EST database , where is a string over the alphabet , integers and , and -mer , we say that occurs approximately in if there exists a substring of some EST such that . We also say that an -mutant list of a string is a list of all possible strings, , of length over the alphabet such that . Such a string is referred to as an -mutant of . A unique oligo of is defined as an -mer such that occurs exactly in one EST and does not occur approximately in any other EST. The unique oligos search problem is the problem of finding all unique oligos in an EST database.
Many algorithms have been presented to solve this problem [5, 6]. The algorithm presented in  relies on an observation that if two -mers agree within a specific Hamming Distance, then they must share a certain substring. These observations are presented in this paper as theorems.
Theorem 1. Suppose one has two -mers and such that . If one divides them both into substrings, and , and each , except possibly , has length , then there exists at least one , such that .
Proof. Suppose by contradiction that for any , and have at least 2 mismatches. Then which is a contradiction to the fact that .
Using this observation, an algorithm was presented in  which solves the unique oligos search problem in time . The algorithm can be thought of as a two-phase method. In the first phase we record the position of each -mer in the database into a hash table of size . We do so in such a way that for each -mer over the alphabet we have that whereby is an EST sequence, is the position of within that sequence, and is the number of occurrences of in the database. In the second phase, we extend every pair of identical -mers into -mers and compare these -mers for nonuniqueness. We also do the same for pairs that have a Hamming Distance of 1. If they are nonunique, we mark them accordingly. Theorem 1 guarantees that if an -mer is nonunique, then it must share a -mer substring that differs by at most one character with another -mer substring from another -mer. Hence, if an -mer is nonunique, it will be marked during phase two.
Assuming there are symbols in our EST database, the filing of the -mers into the hash table takes time . In phase two, we assume that the distribution of -mers in the database is uniform; in other words, that each table contains entries. Thus we have comparisons within each table entry. Each -mer also has a 1-mutant list of size , so, we have comparisons for each entry in the table. Also, the time required to extend each pair of -mers to -mers is . Given that we have entries in the hash table, we have a total time complexity of where
Theorem 2. Suppose one has two -mers and such that . If one divides them both into substrings, and , and each , except possibly , has length , then there exists at least one , such that .
Proof. Suppose by contradiction that we cannot find any such that . Then there exists at least one mismatch between and for each , and thus we have at least mismatches which contradicts the fact that .
Based on Theorem 2 we can design a second algorithm that works in a similar way to Algorithm 1. The major difference between these algorithms is that in Algorithm 2 we are not required to do comparisons with each hash table entries mutant list. This means we have comparisons within each table entry which yields a total time complexity of where
A third theorem was also briefly mentioned ; however, it was not implemented in an algorithm. We use this theorem to create a third algorithm to solve the unique oligos search problem.
Theorem 3. Suppose one has two -mers and such that . If one divides them both into substrings, and , and each , except possibly , has length , then there exists at least one , such that .
Proof. Suppose by contradiction that for any , and have at least 3 mismatches. Then which is a contradiction to the fact that .
The algorithm is somewhat similar to Algorithm 1. The main difference is that we compare every -mer to -mers in its corresponding 2-mutant list, rather than its 1-mutant list. Each -mer has 2-mutants, so we have comparisons for each entry in the hash table yielding a total time complexity of where
It is important to note the term in the denominator of our time complexity expressions. Since this term is exponential, it will have the largest impact on the time taken to run our algorithms. Based on this observation, we expect Algorithm 3 to run the fastest, followed by Algorithm 1 and then Algorithm 2.
We implement these algorithms using C on a machine with 12 Intel Core i7 CPU 80 @ 3.33 GHz processors and 12 GB of memory. The datasets we use in this implementation are Barley ESTs taken from the genetic software HarvEST by Steve Wanamaker and Timothy Close of the University of California, Riverside (http://harvest.ucr.edu/). We use two different EST databases, one with 78 ESTs and another with 2838. In our experiments we search for oligos of lengths 27 and 28 since they are common lengths for oligonucleotides. As we increase the size of the database, we see that Algorithm 3 is the most efficient as anticipated (data shown in Tables 1 and 2).
One important thing to note about all of these algorithms is the fact that the main portion of them is a for loop which iterates through each index of the hash table. It is also obvious that loop iterations are independent of each other. These two factors make the algorithms perfect candidates for parallelism. Rather than process the hash table one index at a time, our parallel algorithms process groups of indices simultaneously. Ignoring the communication between processors, our algorithms optimally parallelize our three serial algorithms.
There are many APIs in different programming languages that aid in the task of parallel programming. Some examples of this in the C programming language are OpenMP and POSIX Pthreads. OpenMP allows one to easily parallelize a C program amongst multiple cores of a multicore machine . OpenMP also has an extension called Cluster OpenMP which allows one to parallelize across multiple machines in a computing cluster.
A new trend in parallel programming is in the use of GPUs. GPUs are the processing units inside computers graphics card. C has several APIs which allow one to carry out GPU programming. The two such APIs are OpenCL and CUDA [9, 10].
In the second implementation of our algorithms we use OpenMP to parallelize our algorithms throughout the 12 cores of our machine. We can easily see that we achieve near optimal parallelization with our parallel algorithms; that is, the time taken by the parallel algorithms is approximately that of the serial algorithms divided by the number of processors.
In this paper we used three algorithms to solve the unique oligos search problem which are extensions of the algorithm presented in . We observed that we can achieve a significant performance improvement by parallelizing our algorithms. We can also see that Algorithm 3 yields the best results for larger databases. For smaller databases, however, the time difference between each pair of algorithms is negligible, but results in Algorithm 3 being the slowest, and this is due to the time required to compute the mismatches of each -mer. Other algorithms can be obtained by setting to different values. See Algorithms 1, 2, 3, 4, 5, 6, 7, and 8.
- M. D. Adams, J. M. Kelley, J. D. Gocayne et al., “Complementary DNA sequencing: expressed sequence tags and human genome project,” Science, vol. 252, no. 5013, pp. 1651–1656, 1991.
- J. Zheng, T. J. Close, T. Jiang, and S. Lonardi, “Efficient selection of unique and popular oligos for large EST databases,” Bioinformatics, vol. 20, no. 13, pp. 2101–2112, 2004.
- S. H. Nagaraj, R. B. Gasser, and S. Ranganathan, “A hitchhiker's guide to expressed sequence tag (EST) analysis,” Briefings in Bioinformatics, vol. 8, no. 1, pp. 6–21, 2007.
- W. Klug, M. Cummings, and C. Spencer, Concepts of Genetics, Prentice-Hall, Upper Saddle River, NJ, USA, 8th edition, 2006.
- F. Li and G. D. Stormo, “Selection of optimal DNA oligos for gene expression arrays,” Bioinformatics, vol. 17, no. 11, pp. 1067–1076, 2001.
- S. Rahmann, “Rapid large-scale oligonucleotide selection for microarrays,” in Proceedings of the 1st IEEE Computer Society Bioinformatics Conference (CSB '02), pp. 54–63, IEEE Press, Stanford, Calif, USA, 2002.
- S. Go, Combinatorics and its applications in DNA analysis [M.S. thesis], Department of Mathematics and Statistics, Memorial University of Newfoundland, 2009.
- OpenMP.org, 2012, http://openmp.org/wp/.
- Khronos Group, “OpenCL—The open standard for parallel programming of heterogeneous systems,” 2012, http://www.khronos.org/opencl/.
- Nvidia, “Parallel Programming and Computing Platform—Cuda—Nvidia,” 2012, http://www.nvidia.com/object/cuda_home_new.html.
Copyright © 2013 Manrique Mata-Montero et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.