FACC: A Novel Finite Automaton Based on Cloud Computing for the Multiple Longest Common Subsequences Search

Li, Yanni; Wang, Yuping; Bao, Liang

doi:https://doi.org/10.1155/2012/310328

Mathematical Problems in Engineering

On this page

Abstract Introduction Conclusions References Copyright Related Articles

Special Issue

Selected Papers from the 7th International Conference on Computational Intelligence and Security (CIS'2011)

View this Special Issue

Research Article | Open Access

Volume 2012 | Article ID 310328 | https://doi.org/10.1155/2012/310328

FACC: A Novel Finite Automaton Based on Cloud Computing for the Multiple Longest Common Subsequences Search

Yanni Li,^1,2Yuping Wang,¹and Liang Bao²

Academic Editor: Hailin Liu

Received14 Apr 2012

Accepted30 Aug 2012

Published30 Oct 2012

Abstract

Searching for the multiple longest common subsequences (MLCS) has significant applications in the areas of bioinformatics, information processing, and data mining, and so forth, Although a few parallel MLCS algorithms have been proposed, the efficiency and effectiveness of the algorithms are not satisfactory with the increasing complexity and size of biologic data. To overcome the shortcomings of the existing MLCS algorithms, and considering that MapReduce parallel framework of cloud computing being a promising technology for cost-effective high performance parallel computing, a novel finite automaton (FA) based on cloud computing called FACC is proposed under MapReduce parallel framework, so as to exploit a more efficient and effective general parallel MLCS algorithm. FACC adopts the ideas of matched pairs and finite automaton by preprocessing sequences, constructing successor tables, and common subsequences finite automaton to search for MLCS. Simulation experiments on a set of benchmarks from both real DNA and amino acid sequences have been conducted and the results show that the proposed FACC algorithm outperforms the current leading parallel MLCS algorithm FAST-MLCS.

1. Introduction

Searching for MLCS is a classic dynamic programming problem. Let be a set of the finite or infinite alphabet, and be a finite sequence of symbols drawn from , that is, , . A sequence is called a subsequence of if it satisfies , and , that is, . For two given sequences and , is called a common subsequence (CS) of and if and only if is simultaneously a subsequence of both and . When no other common subsequence is longer than , is named the longest common subsequence (LCS) of and . Similarly, given a set of sequences drawn from , a subsequence is called their multiple longest common subsequence (MLCS) if it is a subsequence for all of them (this subsequence is called a common subsequence), and no other common subsequence is longer than it.

Searching for the MLCS is significant for a number of applications in the areas of bioinformatics, information processing, data mining, and pattern recognition [1] and so forth. In the above-mentioned application areas, information can be usually represented as a sequence over a finite or infinite alphabet. For instance, a protein can be expressed as a sequence of twenty different symbols (amino acid) in biology, and a DNA sequence can be described as a sequence of four symbols A, C, G, and T [2]. A program source code can be represented as a sequence over alphabets, that is, ASCII or Unicode. In a specific domain, similarly, web information can be regarded as a word sequence of the domain-ontology. We can explore and discover valuable information by making a comparison and analysis of the sequences. For example, in the biological field, sequence comparison has been successfully used to establish a link between cancer-causing genes and normal genes. By finding the similarity between gene sequences, we can obtain valuable information on genetic diseases [3–5], too. In web information search and data mining, by determining the MLCS of word sequences of the domain-ontology, we can not only increase the accuracy rate of information retrieval, but also mine considerably valuable information. In the programming analysis, by discovering the MLCS among program source codes, we can acquire their redundancy and similarity and eliminate the redundancy or detect clone codes.

Searching for the LCS of two or more given sequences is a classic NP-hard problem [6]. Wagner and Fischer [7] first introduced a classic dynamic programming algorithm (DP) for solving the LCS problem of two sequences, and both its time and space complexity are , where and are the lengths of the two sequences, respectively (the same notations are used in what follows). However, its main drawbacks are as follows: it has a higher time and space complexity, and it can only find the LCS of two relatively short sequences, but can hardly deal efficiently with the LCS problem of the two longer sequences. Hirschberg [8] presented a new LCS algorithm based on the idea of the “divide and conquer” approach, which reduces space complexity to and gives a better solution to the problem of the longer sequences. Using a decision tree model, Ullman et al. [9] (1976) devised a better LCS algorithm with lower bound of the time complexity. Based on the effective detection of all the major matched points between compared sequences, Hunt and Szymansi [10] designed a new LCS algorithm with the time complexity of , where is the maximum length of the compared sequences. The algorithm was later simplified by Bespamyatnikh and Segal [11]. Masek and Paterson [12] put forward an improved dynamic programming algorithm by using a fast algorithm to compute the sequence editing distance, which reduces the time complexity to .

To further improve the time complexity of LCS algorithms, researchers have begun to study the parallel LCS algorithm. Based on a CREW-PRAM model, Aggarwal and Park [13] and Apostolico et al. [14] proposed parallel LCS algorithms with the time complexity of by using processors, respectively. Freschi and Bogliolo [15] presented another new parallel LCS algorithm by using some packed arrays and processors based on the run-length-encoded (RLE) string. The time complexity of the algorithm is , where and are the numbers of runs in their RLE representation, respectively. Liu and Chen [16] presented a specific parallel MLCS algorithm over alphabet {A, C, G, T}, FAST-LCS, based on proposed pruning rules which is more efficient than the previous works. The space and time complexity of the algorithm are and , respectively, where is the number of identical character pairs and is the length of the LCS of and . Wang et al. [17–19] developed the efficient MLCS algorithms parMLCS and Quick-DP, respectively, based on dominant points approach, which have reached a near-liner speedup for large number of sequences. It is worth mentioning that Yang et al. [20], as a new attempt, develop an efficient parallel algorithm on GPUs for the LCS problem. But regretfully, the algorithm is not suitable for the general MLCS problem.

To meet the needs of practical applications, some researchers have also studied some variations of the LCS problems, such as the longest common increasing subsequence (LCIS) problems, the longest increasing subsequence (LIS) problems, and the common increasing subsequence (CIS) problems, and so forth. Fredman [21] proposed an algorithm for LIS problems. The optimal time complexity of the algorithm is when the average length of sequences equals . By combining LCS with LIS, Yang et al. [22] defined a common increasing subsequence (CIS) and designed a dynamic programming algorithm for two sequences CIS problems. The space complexity of the algorithm is . Brodal et al. [23] present an algorithm for finding a LCIS of two or more input sequences. For two sequences of lengths and , where , the time complexity and space complexity of the algorithm are and , respectively, where is the length of an LCIS, is the size of the alphabet, and Sortis the time to sort each input sequence.

Nevertheless, the aforementioned algorithms have the following disadvantages: (1) most of them are inapplicable to the problems with more than two sequences (especially a considerable number of sequences, a large alphabet, and a long average length of the sequences); (2) the efficiency and effectiveness of a few parallel algorithms remain to be improved; (3) the parallel implementations of the algorithms are of a certain difficulty due to their complicated concurrency, synchronization, and mutual exclusion, that is, none of the existing algorithms employed simple and cost-effective high performance parallel computing framework such as MapReduce for implementing their algorithms; (4) most of the algorithms did not provide an abstract and formal description to reveal the inherent properties of the MLCS problem. To overcome these shortcomings, a novel finite automaton based on cloud computing for the MLCS problem was proposed in this paper. The main contributions of this paper are as follows.(1)All common subsequences (CS) of the sequences are abstracted as the language over their common alphabet , that is, every CS of sequences is a sequence over the .(2)Based on the ideas of the matched pair and finite automaton, a novel finite automaton which can recognize/accept all of the CSs of sequences is presented.(3)A formal definition of the finite automaton was introduced, and its properties were verified. (4)A novel parallel algorithm FACC was proposed. FACC is by abstracting the MLCS problem as one of searching for the longest path on the finite automaton, and was implemented based on the new parallel framework MapReduce of cloud computing and a variety of optimization techniques.(5)A quantitative analysis of the time and space complexity of the FACC was conducted.(6)The algorithm FACC was validated on the DNA and amino acid sample sequences from ncbi and dip databases. Then, the comparison of the time performance between FACC and the leading algorithm: FAST-LCS [16] was made. The experimental results show that the proposed FACC outperforms FAST-LCS.

The rest of this paper is organized as follows. Section 2 introduces some notations and concepts in this paper for convenience discussion. Section 3 presents the finite automaton Atm for recognizing common subsequence and its basic properties. Section 4 proposes a new algorithm called finite automaton based on cloud computing (FACC) and describes its implementation in detail. Section 5 explains the analysis of the time and space complexity of FACC. The experiments are made and the analysis results are explained in Section 6. Finally, Section 7 concludes the research.

2. Notations and Basic Concepts

For convenience, the following notations are adopted in Table 1.

Note that a sequence over some alphabet is a finite sequence of symbols drawn from that alphabet. According to the formal language and automaton theory, we can view the common subsequences of all sequences on set as a language of the common alphabet , and then regard the MLCS of the over sequences as the one or several longest statements of the language . Based on this idea, a novel finite automaton Atm which can recognize/accept the is designed, and the Atm for MLCS was constructed quickly based on a new constructing-searching algorithm and the MapReduce parallel framework of cloud computing proposed in this paper, meanwhile the MLCS can be easily achieved.

For easy understanding, some basic concepts in the following are introduced, and the properties of the Atm are discussed.

Definition 2.1. Suppose is a sequence over for . Let denotes the th character in sequence . For the sequences from , vector is called a matched pair of the sequences, if and only if (), where is the character corresponding to the matched pair , denoted as .

For example, for two sequences and , one can get the following matched pairs by Definition 2.1: [1,2], [3,3], [4,5], [2,1], and [5,4] with their corresponding ch’s being , , , , and , respectively.

Definition 2.2. For the sequences from , and are two matched pairs. one calls if and only if for . If and for , one calls a successive matched pair of , and denote it as . If and there does not exist a matched pair for such that , one calls a direct successor matched pair of , denoted as .

Definition 2.3. For the sequences , from , let be a matched pair. If there does not exist a matched pair such that , one terms an initial matched pair. In general, there may be more than one initial matched pair from .

Taking above sequences and as an example, we can see that the matched pairs [3,3], [4,5], and [5,4] are the successive matched pairs of matched pairs [1,2] and [2,1]. Moreover, the matched pair [3,3] is a direct successive matched pair of the matched pairs [1,2] and [2,1], wherein [1,2] and [2,1] are two initial matched pairs in total for sequences and .

Based on above the definitions, the following conclusion can be easily inferred.

Lemma 2.4. The total number of all possible initial matched pairs is less than or equal to regardless of .

3. Finite Automaton Atm for Recognizing Common Subsequence and Its Basic Properties

It can be seen from the above discussion that the characters in the MLCS from must be the characters corresponding to their matched pairs. In the following, based on some ideas and concepts from finite automaton (Atm), we can see the matched pairs of as states of the Atm and construct the Atm which can recognize/accept all the CSs of by defining a specific transition function. The formal definition of the Atm is as follows.

Definition 3.1. The common subsequence finite automaton Atm is a 5-tuple, that is, where — is a finite set of states of the Atm, where is the state corresponding to the th matched pair of for ; —the common alphabet of sequences set ; —the transition function . For for all and , one defines the transition function as follows：
—Let , the initial state of the Atm. —The set of final states of the Atm, that is, .

It can be seen from Definition 3.1 that the Atm is a deterministic finite automaton (DFA), but it is different from the normal DFA. The Atm can be partial, that is, every state in the Atm can be initial or final state.

What follows are the formal definitions of and MLCS recognized/accepted by the Atm.

Definition 3.2. For defined by Definition 3.1, a character sequence is called to be recognized/accepted by the Atm if and only if for , where, , . is called a language recognized/accepted by Atm.

Based on Definition 3.2, we can easily deduce the following conclusion.

Lemma 3.3. For , if for all and , then That is, MLCS is the set of the longest sequences recognized/accepted by .

With Definition 3.2, we can obtain the following properties.

Theorem 3.4. For all , let be the corresponding matched pair of the th () character and be the corresponding matched pair of the ()th character in sequence , then, must be the direct successive matched pair of . Furthermore, the first character of must belong to the character of an initial matched pair.

Proof. We give the proof by reduction in the following.
For all , assuming that is not the direct successive matched pair of , there will be a matched pair ( corresponds to character ) such that . So we can insert between and to get a longer common subsequence, denoted as MLCS’, which is contradiction to the fact that is the longest common subsequence.
Then, let us consider the matched pair which corresponds to the first character of the . If is not an initial matched pair, according to Definition 2.3, there must exist a matched pair , and . So we can get a longer MLCS’ by inserting the character corresponding to into the header of the which contradicts to the fact that is the longest common subsequence.

Theorem 3.5. The Atm is a directed acyclic graph (DAG).

Proof. The theorem will be proved by reduction.
Suppose that there was a series of states , , and forming a cycle in Atm, that is, state is the successive state of state , is the successive state of , and is the successive state of . Due to the fact that matched pairs , , and correspond to the states , , and , we can get the results, , and , which contradicts to Definition 2.2. Therefore, Atm is a directed acyclic graph.

Theorem 3.6. is an upper bound of the number of Atm ’s states, where represents the occurrence number of in sequence .

Proof. Because the number of times the character appears in the sequence is less than or equal to , is an upper bound for the number of the Atm’s states for the character . Thus, is an upper bound for all elements in . By considering the initial state of the Atm, we get an upper bound of the number of the Atm’s states.

4. FACC Algorithm Based on MapReduce Parallel Framework of Cloud Computing

4.1. The Overview of the MapReduce Parallel Framework of Cloud Computing

For the convenience, we first briefly overview the MapReduce parallel framework of cloud computing.

Cloud computing is a new computing model [24], which is a development of distributed computing, parallel computing and grid computing. MapReduce is a parallel framework of cloud computing and an associated implementation for processing and generating large datasets. It is amenable to a broad variety of real-world tasks and has been widely recognized as a cost-effective high performance parallel computing model. In the model, based on “divide and conquer” technology, users only specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules intermachine communication to make efficient use of the network and disks [25, 26]. Figure 1 shows the execution framework of MapReduce [26].

4.2. The Proposed Algorithm: FACC

In this subsection, before describing the details of the proposed FACC, we first give its framework as follows.(1)Preprocessing. Determine the common alphabet of sequence set , and preprocess every sequence of as follows. The redundant characters in each sequence are filtered, that is, removing the characters which do not appear in at least one sequence in . This process will ensure a quick searching later for MLCS of . (2)Construction of successor tables. Based on the MapReduce parallel framework of cloud computing, successor tables (see Section 4.2 Definition 4.1) for every preprocessed sequence of are parallel constructed. Since any character in the MLCS should be a character corresponding to a matched pair, we can construct the successor table for each sequence so that all possible matched pairs of can be found quickly. (3)Construction of the finite automaton Atm for recognizing/accepting common subsequence. Based on the MapReduce parallel framework, and using the matched pairs as the states and adding an initial state to the Atm, we can construct the Atm which can recognize/accept all the CS of sequences of the according to transition function of the Atm, wherein each state holding all of the states of its parent nodes during the construction of the Atm.(4)Traversal of the Atm and output all of the MLCS. Search for the MLCS by traversing the Atm through the depth-first method.

In the following, the proposed FACC and its implementation based on MapReduce will be described step by step in detail.

4.2.1. Preprocessing

Recall that the MLCS of should be the sequences over their common alphabet . The goal of the preprocessing is to reduce the searching time by filtering the redundant characters in each sequence which does not appear in . After preprocessing, we will obtain the specific sequences which only reserve the characters in . Since preprocessing leads to some time and space cost, the proposed FACC adopts this procedure only in the situations of the large or unknown alphabet .

The idea of the preprocessing is that a Key-Value table is designed (that is a data structure Map), where the Key represents a character and the Value is the total number of the sequences containing character . For sequences, the Value corresponding to Key in the Map equals if and only if sequences contain the character . Obviously, . In this situation, we call the value of Key is . According to the definition of the value of Key , we can see that all characters with value consist of the alphabet . Then, all sequences are filtered in parallel using the MapReduce of cloud computing and only the characters in are reserved. The th resulted sequence obtained from after the filter is denoted as for .

Algorithm 1 shows the pseudocode of the preprocessing algorithm.

Algorithm Pre-processing (InitStringSet)
Input: InitStringSet: a initial string set.
Output: Σ :common alphabet over InitStringSet
StringSet: the result of pre-processed InitStringSet
(1) for each string in InitStringSet
(2) tempSet = ∅
(3) for each character [j] in
(4) if ()
(5) tempSet = tempSet ∪ [j]}
(6) if ([j] ∈ map)
(7) map ([j]) = map ([j]) + 1
(8) else
(9) map ([j]) = 1
(10) endfor
(11) endfor
(12)
(13) for each character char[i] in map
(14) if (map(char[i]) == size (InitStringSet))
(15) Σ = Σ ∪ {char[i]}
(16) endfor
(17) StringSet = InitStringSet
(18) for each string in StringSet
(19) for each character [j] in
(20) if ([j] ∉ Σ)
(21) delete ([j])
(22) endfor
(23) endfor
(24) return Σ and StringSet

For example, sequences and , and are the alphabets of these two sequences, respectively. After preprocessing to and , we can get , and is converted into , and is converted into .

4.2.2. Successor Table

Let denote the successor table of the sequence and its definition is as follows.

Definition 4.1. For a sequence drawn from alphabet , the successor table of the sequence is an irregular two-dimensional table, where the element of th row and th column of the table is denoted as , which is defined as follow.

The value of indicates the minimal subscript position of the sequence according to after position when .

For the two sequences and , Tables 2 and 3 show the successor tables of them.

To construct successor tables for , We dispatch Map functions to construct successor table for each sequence of in parallel, and then employ a Reduce function to aggregate the successor tables of the sequences of .

Because the irregular successor tables only store the useful information and are constructed in parallel, it can considerably reduce the time and space complexity of searching for the MLCS.

With the constructed successor table, a direct successive matched pair of a matched pair can be gotten quickly. Take Tab₁ and Tab₂ as examples. When searching for the successive matched pairs of matched pair , all we need to do is searching for the matched pairs of the [, ], (where and stand for all the elements of the th and th columns in the Tab₁ and Tab₂, respectively), and then, removes all of the matched pairs which are not direct successive matched pairs. For example, by checking Tab₁ and Tab₂ based on Definition 4.1, we can get the following successive matched pairs [4,4], [2,3], and [3,2] of matched pair [1,1]. Due to [2,3] [4,4] and [3,2] [4,4], we remove indirect successive matched pair [4,4], and finally, get the expected direct successive matched pairs [2,3] and [3,2] of the matched pair [1,1].

4.2.3. Constructing the Common Subsequence Atm of

By the aforementioned Definition 3.1 and its properties of the Atm, we can build the common subsequence Atm of the sequence set in parallel by MapReduce. The algorithm Build-Atm is shown in Algorithm 2.

Algorithm Build-Atm(Pos, dsucSet)
Input: : the initial state of the Atm
tabSet: the set of successor tables for T drawn from
Output: Atm: a Atm of T for MLCS
(1) Atm = NULL,
(2) Build-Atm(, tabSet)
(3) dispatch map functions in parallel do
(4) for each , for dispatch a Map function
(5)
(6) if and
(7)
(8) end for
(9) reduce all the results of the Map functions
(10) reduce all the results of the Map functions
(11) Build-Atm(, tabSet)
(12) return Atm
(13) end

With the algorithm shown in Algorithm 2, take two sequences and for example, the main construction process of the Atm can be shown as follows (the process also applies to multiple sequences of ). (1) Construct a virtual initial state (0,0) corresponding to the matched pair [0,0] with character . (2) Determine all of the direct successive matched pairs of the matched pair [0,0]. (3) Use the deep first search (DFS) method to construct the Atm.

Notice that each state in the Atm must remember all of the states of its parent nodes during the construction of the Atm. For example, two sequences and illustrated in Figure 2 have a direct successive matched pair [1,1] of the matched pair [0,0]. Because matched pair [1,1] corresponds to character , we can get a state transition . By the algorithm shown in Algorithm 2. The final common subsequence Atm constructed in the example of Figure 2 is shown in Figure 3, where the states (0,0) and (7,7) are the initial and final states of the Atm of the sequences and , respectively.

4.2.4. Traversing Atm and Finding the MCLS

By the finite automata theory and Definitions 3.1 and 3.2, we can get a character sequence, corresponding to a path from to a state of the of Atm, which is a candidate longest common subsequence of , named . Hence, all the longest character sequences of all the candidate sequences are exactly the elements of MLCS from . We first design a specific set named resultSet to store the expected MLCS. Then, By the depth first search method, the Atm is traversed from to every state of the in parallel by MapReduce schemes. Once we get a candidate , we make following operations: if resultSet is not empty and all string length of elements in resultSet is longer than that of the , ignore the ; otherwise clear resultSet, and then insert into the resultSet. Because resultSet is a set, it can eliminateredundant elements automatically, hence, we can acquire the MLCS from the resultSet eventually. In the example for the sequences and the all MLCSs are .

5. The Time and Space Complexity Analysis of FACC

5.1. Time Complexity

In what follows, we first give the time complexity of the FACC in every stage, and then the total time complexity.

Preprocessing
It is necessary to traverse every sequence of once in order to find its common alphabet, therefore the step is time, where is the length of a sequence and is the number of total sequences of , that is, . It is also necessary to traverse every sequence of to filter the characters not in , which also requires time complexity. Assuming all the sequences in with the same length and the number of Map functions is , the total time complexity is in the stage based on MapReduce schemes.

Constructing the Successor Table
To build successor tables for all of the sequences pre-processed, it is necessary to traverse these sequences once again. It turns out that the same time complexity is required as the above preprocessing stage based on MapReduce schemes.

Constructing Atm
According to Theorem 3.6, it is known that the upper bound for the number of the Atm’s states is , where stands for the number of times character appears in the sequence . Because the time complexity for constructing the Atm is proportional to the number of the Atm’s states , the time complexity for constructing the Atm is in this stage based on MapReduce schemes.

Thus, the total time complexity of FACC is equal to .

5.2. Space Complexity

Because the storage space of sequences and successor tables is static and proportional to the size of , the space complexity of their storage is . For building the Atm, the storage space is proportional to the number of states, hence the space complexity for building the Atm is . On the other hand, in recursively constructing the Atm, on the average, the recursion depth is , and then the space complexity required temporary space is for constructing and traversing the Atm. It happens that the space complexity of FACC is

6. Experiments and Analysis of Experimental Results

6.1. Dataset and Experimental Results

In this paper, to test the time performance of FACC and FAST-LCS fairly, we run the two algorithms on the same hardware platform. Using the DNA and amino acid sample sequences dataset provided by ncbi [27] and dip [28], we tested the proposed algorithm FACC on the Hadoop cluster with 4 worker nodes, each of which contains 2 Intel CPUs (2.67 GHz) X5550, 8 GB of RAM, and 24 GB of local disk allocated to HDFS. In the cluster each node was running Hadoop version 0.20.0 on RedHat Enterprise Linux Server release 5.3 and connected by 100 M Ethernet to a commodity switch. The FAST-LCS algorithm was run on Intel CPUs (2.67 GHz) X5550, and 8 GB of RAM, using the same datasets and operating system, and the programming environment of the algorithms is JDK 1.7. The comparisons of the time performance between FACC and FAST-LCS are shown in Tables 4 and 5 and Figures 4 and 5.

6.2. Discussion of Experimental Results

Tables 4 and 5 compares various performance indices of FAST-LCS and FACC on 20 DNA sequences () and 20 amino acid sequences () with different lengths of input sequences. It can be seen that various performance indices of FACC are superior to those of FAST-LCS, wherein the precision of FACC reaches 100% (shown in column Precision, which is the ratio of the number of found MLCSs to the total number of MLCSs, of Tables 4 and 5), while that of FAST-LCS reaches 95% due to its incorrect pruning operation 2. Moreover, Tables 4 and 5 and Figures 4 and 5 show that the time performance of FACC considerably outperforms that of FAST-LCS, and with the increasing lengths of input sequences, the advantage of FACC is growing significantly over FAST-LCS.

Figure 6 shows the time performance of the proposed FACC with preprocessing and without preprocessing. It can be seen from Figure 6 that the time performance for the case with preprocessing is obviously superior to that without preprocessing, especially for the cases of a large set of alphabet and a long average length of sequences. Furthermore, the more the number of input sequences, the more efficient the proposed FACC.

In summary, the time performance of the proposed algorithm FACC is much better than that of FAST-LCS.

7. Conclusions

Considering that the efficiency and effectiveness of the existing parallel algorithms for searching for MLCS are not satisfactory with the increasing complexity and size of biologic data and do not give an abstract and formal description of the MLCS problem and adopt complicated parallel schemes, we propose a novel finite automaton based on MapReduce parallel framework of cloud computing called FACC to overcome the existing algorithms’ shortcomings. The proposed algorithm is based on MapReduce parallel programming framework of cloud computing, the matched pair, and finite automaton (FA) by using some efficient techniques such as preprocessing, constructing the efficient successor table and common subsequence Atm, and looking for MLCS, and so forth. The theoretical analysis to the proposed algorithm shows that the time and space complexity are linear, that is, they are and , respectively, which are superior to the leading parallel MLCS algorithms. Moreover, the simulation experiments of the proposed algorithm on some real DNA and amino acid sequence sample datasets are made, and their performance is compared with that of one of the current leading algorithms: FAST-LCS. The experimental results show that the proposed algorithm is very efficient and effective, and its performance is much better than that of FAST-LCS, especially for the cases of a large alphabet, a considerable number and a long average length of sequences. Meanwhile, experimental results also verify the correctness of our theoretical analysis.

Acknowledgment

This work is supported by the National Natural Science Foundation of China (no. 61272119).

References

W. S. Chen, P. C. Yuen, and X. Xie, “Kernel machine-based rank-lifting regularized discriminant analysis method for face recognition,” Neurocomputing, vol. 74, no. 17, pp. 2953–2960, 2011.
View at: Publisher Site | Google Scholar
A. Cherkasov, “Bioinformatics: a practical guide to the analysis of genes and proteins,” American Journal of Human Biology, vol. 17, no. 3, pp. 387–389, 2005.
View at: Google Scholar
E. W. Edmiston, N. G. Core, J. H. Saltz, and R. M. Smith, “Parallel processing of biological sequence comparison algorithms,” International Journal of Parallel Programming, vol. 17, no. 3, pp. 259–275, 1988.
View at: Publisher Site | Google Scholar
E. Lander, “Protein sequence comparison on a data parallel computer,” in Proceedings of the International Conference on Parallel Processing (ICPP '88), pp. 257–263, The Pennsylvania State University, University Park, PA, USA, 1988.
View at: Google Scholar
A. Galper and D. L. Brutlag, “Parallel similarity search and alignment with the dynamic programming method,” Tech. Rep., Stanford University, Palo Alto, Calif, USA, 1990.
View at: Google Scholar
D. Maier, “The complexity of some problems on subsequences and supersequences,” Journal of the ACM, vol. 25, no. 2, pp. 322–336, 1978.
View at: Google Scholar
R. A. Wagner and M. J. Fischer, “The string-to-string correction problem,” Journal of the ACM, vol. 21, no. 1, pp. 168–173, 1974.
View at: Google Scholar
D. S. Hirschberg, “A linear space algorithm for computing maximal common subsequences,” Communications of the ACM, vol. 18, no. 6, pp. 341–343, 1975.
View at: Publisher Site | Google Scholar
J. D. Ullman, A. V. Aho, and D. S. Hirschberg, “Bounds on the complexity of the longest common subsequence problem,” Journal of the ACM, vol. 23, no. 1, pp. 1–12, 1976.
View at: Google Scholar
J. W. Hunt and T. G. Szymansi, “A fast algorithm for computing longest common subsequences,” Communications of the ACM, vol. 20, no. 5, pp. 350–353, 1977.
View at: Publisher Site | Google Scholar
S. Bespamyatnikh and M. Segal, “Enumerating longest increasing subsequences and patience sorting,” Information Processing Letters, vol. 76, no. 1-2, pp. 7–11, 2000.
View at: Publisher Site | Google Scholar
W. J. Masek and M. S. Paterson, “A faster algorithm computing string edit distances,” Journal of Computer and System Sciences, vol. 20, no. 1, pp. 18–31, 1980.
View at: Google Scholar
A. Aggarwal and J. Park, “Notes on searching in multidimensional monotone arrays,” in Proceedings of the 29th Annual Symposium on Foundations of Computer Science, pp. 497–512, White Plains, NY, USA, 1988.
View at: Google Scholar
A. Apostolico, M. J. Atallah, L. L. Larmore, and S. McFaddin, “Efficient parallel algorithms for string editing and related problems,” SIAM Journal on Computing, vol. 19, no. 5, pp. 968–988, 1990.
View at: Google Scholar
V. Freschi and A. Bogliolo, “Longest common subsequence between run-length-encoded strings: a new algorithm with improved parallelism,” Information Processing Letters, vol. 90, no. 4, pp. 167–173, 2004.
View at: Publisher Site | Google Scholar
W. Liu and L. Chen, “A fast longest common subsequence algorithm for biosequences alignment,” IFIP International Federation for Information Processing, vol. 258, pp. 61–69, 2008.
View at: Publisher Site | Google Scholar
D. Korkin, Q. Wang, and Y. Shang, “An efficient parallel algorithm for the multiple longest common subsequence (MLCS) problem,” in Proceedings of the 37th International Conference on Parallel Processing (ICPP '08), pp. 354–363, Portland, Ore, USA, September 2008.
View at: Publisher Site | Google Scholar
Q. Wang, D. Korkin, and Y. Shang, “Efficient dominant point algorithms for the multiple longest common subsequence (MLCS) problem,” in Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI '09), pp. 1494–1499, July 2009.
View at: Google Scholar
Q. Wang, D. Korkin, and Y. Shang, “A fast multiple longest common subsequence (MLCS) algorithm,” IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 3, pp. 321–334, 2011.
View at: Publisher Site | Google Scholar
J. Yang, Y. Xu, and Y. Shang, “An efficient parallel algorithm for longest common subsequence problem on GPUs,” in Proceedings of the World Congress on Engineering (WCE '10), vol. 1, pp. 499–504, July 2010.
View at: Google Scholar
M. L. Fredman, “On computing the length of longest increasing subsequences,” Discrete Mathematics, vol. 11, no. 1, pp. 29–35, 1975.
View at: Google Scholar
I. H. Yang, C. P. Huang, and K. M. Chao, “A fast algorithm for computing a longest common increasing subsequence,” Information Processing Letters, vol. 93, no. 5, pp. 249–253, 2005.
View at: Publisher Site | Google Scholar
G. S. Brodal, K. Kaligosi, I. Katriel, and M. Kutz, “Faster algorithms for computing longest common increasing subsequences,” in Proceedings of the 17th Annual Symposium on Combinatorial Pattern Matching (CPM '06), vol. 4900, pp. 330–341, 2006.
View at: Publisher Site | Google Scholar
M. Michael, Cloud Computing: Web-Based Applications That Change the Way You Work and Collaborate Online, SAMS Press, 2009.
J. Dean, “Experiences with MapReduce, an abstraction for large-scale computation,” in Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques (PACT '06), p. 1, IEEE Press, September 2006.
View at: Publisher Site | Google Scholar
J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters,” Communications of the ACM, vol. 15, no. 1, pp. 107–113, 2008.
View at: Google Scholar
Pseudomonas aeruginosa PAO1 chromosome, complete genome, http://www.ncbi.nlm.nih.gov/nuccore/110645304?report=fasta.
http://dip.doe-mbi.ucla.edu/dip/Download.cgi.

Copyright

Copyright © 2012 Yanni Li et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

1632

Downloads

1003

Citations