Research Article | Open Access
An Efficient Algorithm for LCS Problem between Two Arbitrary Sequences
The longest common subsequence (LCS) problem is a classic computer science problem. For the essential problem of computing LCS between two arbitrary sequences and , this paper proposes an algorithm taking space and time, where is the total number of elements in the set . The algorithm can be more efficient than relevant classical algorithms in specific ranges of .
The longest common subsequence (LCS) problem is a classic computer science problem and still attracts continuous attention [1–4]. It is the basis of data comparison programs and widely used by revision control systems for reconciling multiple changes made to a revision-controlled collection of files. It also has applications in bioinformatics and many other problems such as [5–7]. For the general case of an arbitrary number of input sequences, the problem is NP-hard . When the number of sequences is constant, the problem is solvable in polynomial time . For the essential problem of computing LCS between two arbitrary sequences (), the complexity is at least proportional to the product of the lengths of sequences according to the conclusion as follows.
It is shown that unless a bound on the total number of distinct symbols [author’s note: the size of alphabet] is assumed, every solution to the problem can consume an amount of time that is proportional to the product of the lengths of the two strings .
The sizes of lengths of sequences make the quadratic time algorithms impractical in many applications. Hence, it is significant to design more efficient algorithm in practice. This paper is confined to and is to present an algorithm that can be more efficient than relevant classical algorithms in specific scenarios.
The following introduction is also confined to the case of two input sequences. Chvátal and Sankoff (1975) proposed a Dynamic Programming (DP) algorithm of space and time . It is the basis of the algorithms for LCS problem. Soon in the same year, D.S. Hirschberg (1975) posted a Divide and Conquer (DC) algorithm that is a variation of the DP algorithm taking space and time . In 2000, Bergroth, Hakonen, and Raita contributed a survey  that shows in the past decades there is no theoretically improved algorithm based on Hirschberg’s DC algorithm  as it is so brilliant. In 1977, Hirschberg additionally proposed an algorithm and an algorithm where is length of LCS . The first one is efficient when is small, while the other one is efficient when p is close to . Both of the two algorithms are more suitable when the length of LCS can be estimated beforehand. Then, Nakatsu, Kambayashi, and Yajima (1982) in  presented an algorithm suitable for similar sequences and having bound of and . Let the two sequences be and . Same in 1977, Hunt and Szymanski proposed an algorithm taking space and time, where is the total number of elements in the set . The algorithm reduces to longest increasing subsequence (LIS) problem. Apostolico and Guerra (1987) in  proposed an algorithm based on  taking time , where is the number of dominant matches (as defined by Hirschberg ) and is minimum of and the alphabet size. Further, based on , Eppstein (1992) in  proposed an algorithm when the problem is sparse. If the alphabet size is constant, Masek and Paterson (1980) in  proposed an algorithm utilizing the method of four Russians (1970) ; Abboud, Backurs, and Williams (2015) in  showed an algorithm where . algorithms are also proposed by Bille and Farach-Colton (2008) in  and Grabowski (2014) in , each of which has its own prerequisite. Restrained by the conclusion of [9, 20], in these decades an extensive amount of research keeps trying to achieve lower complexity than of computing LCS between two condition-specific sequences for different applications, which also can be found in the survey . For computing the length of LCS between two sequences over constant alphabet size, Allison and Dix (1986) presented an algorithm of , where is the word-length of computer . This algorithm uses bit-vector formula with 6 bit-wise operations. Although falling into the same complexity class as simple DP algorithms, this algorithm is faster in practice. Crochemore, Iliopoulos, Pinzon, and Reid (2001) in  proposed a similar approach whose complexity is also . Due to the fact that only 4 bit-wise operations are used by the bit-vector formula, this approach gives a practical speedup over Allison and Dix’s algorithm.
Compared with Chvátal-Sankoff algorithm , Hirschberg algorithm , and Hunt-Szymanski algorithm , most of the other algorithms for LCS problem between two sequences have more dependency, such as the following: the length of LCS is estimable beforehand [13, 14], two input sequences are similar [14, 16], problem is sparse enough , or the alphabet size is finite [16, 18, 20]. Some algorithms give speedup over classical algorithms in engineering [23, 24]. In this paper, an algorithm of space and time is proposed for , where is the total number of elements in the set assuming the two arbitrary sequences are and . The algorithm also reduces to longest increasing subsequence (LIS) problem. Compared with relevant classical algorithms, the algorithm can be more efficient in specific range of .
This paper is organized as follows. In Section 1, the current state of algorithms for LCS problem between two sequences including is introduced. The proposed algorithm of this paper is presented and exemplified in Section 2, where preliminary terminologies needed to understand most of the paper and the theoretical basis of the proposed algorithm are also given. In Section 3, efficiency of the proposed algorithm is analyzed.
The longest common subsequence (LCS) is the longest subsequence common to all sequences in a set of sequences. This subsequence is not necessarily unique or not required to occupy consecutive positions within the original sequences (e.g., is a longest common subsequence between and ). is a defined function that returns a set containing all the LCSes between two sequences, while the longest increasing subsequence (LIS) is a subsequence of a given sequence in which the subsequence’s elements are in sorted order, lowest to highest, and in which the subsequence is as long as possible. This subsequence is not necessarily contiguous, or unique (e.g., is a longest increasing subsequence of ). is also a defined function that returns a set containing all the LISs of a sequence. Assume and . For all , assume there is a sequence , of which the elements are vectors in the form of (see Figure 1). The left part of an element of () is the position of a symbol in , and the right part of the element () is the position of the symbol in . is sorted according to as the first key in ascending order and according to as the second key in descending order. Define . Associating with , it is bijective mapping between and . Hence, can be reduced to . According to the theoretical basis, Algorithm 1 is proposed for . The algorithm is designed to reduce LCS to LIS problem.
(a) Two sequences: and
(b) New data constructed from and
(c) Write operation in
(d) Final result of and
Scan from left to right. The right part of is 3, ; then is going to be computed. ; is the position of in ; therefore .
The right part of is 0; then .
records the information of .
Then, ; therefore ; ; therefore .
For , the right part of is 4; then is going to be computed. ; is the position of in ; therefore .
The right part of is ; then .
records the information of .
Then, ; therefore .
For , ; is the position of in ; therefore .
The right part of is 2; then .
records the information of .
Then, ; therefore . , is kept unchanged, and the rest of the elements and are not going to be checked.
The rest of the elements of can be computed in the same way. Figure 2(d) is the final result of and .
From the auxiliary data , it can be seen that there is only one LIS in . The length of the LIS is 4.
points to ; therefore the last element of the LIS is .
and ; then the last two elements of the LIS are .
and ; then .
and ; then .
is null. Then the LIS is .
Since it is bijective mapping between and , . is the only LCS between and .
According to the conclusion of  (paragraph 3 page 4), we have the following.
Step 1 [author’s note: similar to step 1 of Algorithm 1 of this paper] can be implemented by sorting each sequence while keeping track of each element’s original position. We may then merge the sorted sequences creating the MACHLISTs [author’s note: similar to array of this paper] as we go. This step takes a total of time and space.
Assume is the number of match vectors between and . Step 1 of Algorithm 1 is a process of space and time. As the length of LCS is , step 3 is a process of space and time. Step 4 takes space and time. Write operations in for all element of are listed together in Figure 2(c). In and (see Figure 2(d)), the time of write operation is . In , the time of write operation of dark gray block is ; the time of write operation of light gray block is at most , which is illustrated in Figure 3. Therefore, step 2 takes space and time. Complexities of every step of Algorithm 1 are listed in Table 1. The whole algorithm takes space and time, which is dominated by step 2.
The algorithm proposed in this paper is designed to compute LCS between two arbitrary sequences, which is the same as the original intention of the classical algorithms: Chvátal-Sankoff algorithm , Hirschberg algorithm , and Hunt-Szymanski algorithm . The proposed algorithm can be more efficient in specific range of compared with the classical algorithms, where is the total number of elements in the set assuming two arbitrary sequences are and .
3.1. Comparison with Hunt-Szymanski Algorithm
As the original position in of each element of is not used in the process of computing, in Figure 4 Hunt-Szymanski algorithm needs to utilize binary search to locate the position in for write operation for each element of . The time of binary search in of Hunt-Szymanski algorithm is at most , which is illustrated in Figure 5. Using Stirling’s approximation [26–28], . If the demand is only returning one LCS or the length of LCS, array of the algorithm proposed in this paper can be replaced with the MATCHLIST that is used in Hunt-Szymanski algorithm. Therefore, the algorithm proposed in this paper can take space that is the same as the one Hunt-Szymanski algorithm takes. The main difference between them is the time consumed in . In Figure 3, the total time of write operation of both dark gray and light gray blocks is at most . As , if , the algorithm proposed in this paper is more efficient in time than Hunt-Szymanski algorithm (see Figure 7).
3.2. Comparison with Chvátal-Sankoff Algorithm
Chvátal-Sankoff algorithm needs times of comparison in space, which is illustrated in Figure 6. To simplify the analysis, only the time consumed in of the algorithm proposed in this paper is going to be compared with the time of Chvátal-Sankoff algorithm. As , if , the algorithm proposed in this paper is more efficient in time than Chvátal-Sankoff algorithm (see Figure 7). In this case of , the proposed algorithm is also more efficient in space than Chvátal-Sankoff algorithm.
3.3. Comparison with Hirschberg Algorithm
Hirschberg algorithm takes space and time. As , the algorithm proposed in this paper takes space and time. Therefore, the proposed algorithm has lower time complexity than Hirschberg algorithm.
This submission is about an algorithm of an engineering problem. The efficiency of the algorithm is proven mathematically in theory.
Conflicts of Interest
The author declares that they have no conflicts of interest.
- D. Zhu, L. Wang, and X. Wang, “An improved O(Rlog log n + n) time algorithm for computing the longest common subsequence,” IAENG International Journal of Computer Science (IJCS), vol. 44, no. 2, pp. 166–171, 2017.
- J. Yang, Y. Xu, Y. Shang, and G. Chen, “A space-bounded anytime algorithm for the multiple longest common subsequence problem,” IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 11, pp. 2599–2609, 2014.
- Y. Sakai, “A fast On-Line algorithm for the longest common subsequence problem with constant alphabet,” IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, vol. E-95-A, no. 1, pp. 354–361, 2012.
- M. Sazvar, M. Naghibzadeh, and N. Saadati, “Quick-MLCS: A new algorithm for the multiple longest common subsequence problem,” in Proceedings of the 5th International C Conference on Computer Science and Software Engineering,C3S2E 2012, pp. 61–66, Canada, June 2012.
- R. F. Rahmat, F. Nicholas, S. Purnamawati, and O. S. Sitompul, “File Type Identification of File Fragments using Longest Common Subsequence (LCS),” Journal of Physics: Conference Series, IOP Publishing, p. 012054, 2017.
- A. Sorokin, “Using longest common subsequence and character models to predict word forms,” in Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pp. 54–61, Berlin, Germany, August 2016.
- X. Xie, W. Liao, H. Aghajan, P. Veelaert, and W. Philips, “Detecting road intersections from GPS traces using longest common subsequence algorithm,” ISPRS International Journal of Geo-Information, vol. 6, no. 1, 2017.
- D. Maier, “The complexity of some problems on subsequences and supersequences,” Journal of the ACM, vol. 25, no. 2, pp. 322–336, 1978.
- A. V. Aho, D. S. Hirschberg, and J. D. Ullman, “Bounds on the complexity of the longest common subsequence problem,” Journal of the ACM, vol. 23, no. 1, pp. 1–12, 1976.
- V. Chvatal and D. Sankoff, “Longest common subsequences of two random sequences,” Journal of Applied Probability, vol. 12, pp. 306–315, 1975.
- D. S. Hirschberg, “A linear space algorithm for computing maximal common subsequences,” Communications of the ACM, vol. 18, pp. 341–343, 1975.
- L. Bergroth, H. Hakonen, and T. Raita, “A survey of longest common subsequence algorithms,” in Proceedings of the 7th International Symposium on String Processing and Information Retrieval, SPIRE 2000, pp. 39–48, Spain, September 2000.
- D. S. Hirschberg, “Algorithms for the longest common subsequence problem,” Journal of the ACM, vol. 24, no. 4, pp. 664–675, 1977.
- N. Nakatsu, Y. Kambayashi, and S. Yajima, “A longest common subsequence algorithm suitable for similar test strings,” Acta Informatica, vol. 18, no. 2, pp. 171–179, 1982/83.
- J. W. Hunt and T. G. Szymanski, “A fast algorithm for computing longest common subsequences,” Communications of the ACM, vol. 20, no. 5, pp. 350–353, 1977.
- A. Apostolico and C. Guerra, “The longest common subsequence problem revisited,” Algorithmica, vol. 2, no. 1–4, pp. 315–336, 1987.
- D. Eppstein, Z. Galil, R. Giancarlo, and G. F. Italiano, “Sparse dynamic programming. I. Linear cost functions,” Journal of the ACM, vol. 39, no. 3, pp. 519–545, 1992.
- W. J. Masek and M. S. Paterson, “A faster algorithm computing string edit distances,” Journal of Computer and System Sciences, vol. 20, no. 1, pp. 18–31, 1980.
- V. L. Arlazarov, E. A. Dinic, M. A. Kronrod, and I. A. Faradzhe, “The economical construction of the transitive closure of an oriented graph,” Doklady Akademii Nauk SSSR, vol. 194, pp. 487-488, 1970.
- A. Abboud, A. Backurs, and V. V. Williams, “Tight Hardness Results for LCS and Other Sequence Similarity Measures,” in Proceedings of the 2015 IEEE 56th Annual Symposium on Foundations of Computer Science (FOCS), pp. 59–78, Berkeley, CA, USA, October 2015.
- P. Bille and M. Farach-Colton, “Fast and compact regular expression matching,” Theoretical Computer Science, vol. 409, no. 3, pp. 486–496, 2008.
- S. Grabowski, “New tabulation and sparse dynamic programming based techniques for sequence similarity problems,” Discrete Applied Mathematics: The Journal of Combinatorial Algorithms, Informatics and Computational Sciences, vol. 212, pp. 96–103, 2016.
- L. Allison and T. I. Dix, “A bit-string longest-common-subsequence algorithm,” Information Processing Letters, vol. 23, no. 6, pp. 305–310, 1986.
- M. Crochemore, C. S. Iliopoulos, Y. . Pinzon, and J. F. Reid, “A fast and practical bit-vector algorithm for the longest common subsequence problem,” Information Processing Letters, vol. 80, no. 6, pp. 279–285, 2001.
- D. Gusfield, Algorithms on strings, trees and sequences: computer science and computational biology, Cambridge University Press, New York, NY, USA, 1997.
- J. Dutka, “The early history of the factorial function,” Archive for History of Exact Sciences, vol. 43, no. 3, pp. 225–249, 1991.
- L. Le Cam, “The central limit theorem around 1935,” Statistical Science, vol. 1, no. 1, pp. 78–91, 1986.
- K. Pearson, “Historical Note on the Origin of the Normal Curve of Errors,” Biometrika, vol. 16, no. 3-4, pp. 402–404, 1924.
Copyright © 2018 Yubo Li. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.