Abstract
The longest common subsequence (LCS) problem is a classic computer science problem. For the essential problem of computing LCS between two arbitrary sequences and , this paper proposes an algorithm taking space and time, where is the total number of elements in the set . The algorithm can be more efficient than relevant classical algorithms in specific ranges of .
1. Introduction
The longest common subsequence (LCS) problem is a classic computer science problem and still attracts continuous attention [1–4]. It is the basis of data comparison programs and widely used by revision control systems for reconciling multiple changes made to a revisioncontrolled collection of files. It also has applications in bioinformatics and many other problems such as [5–7]. For the general case of an arbitrary number of input sequences, the problem is NPhard [8]. When the number of sequences is constant, the problem is solvable in polynomial time [9]. For the essential problem of computing LCS between two arbitrary sequences (), the complexity is at least proportional to the product of the lengths of sequences according to the conclusion as follows.
It is shown that unless a bound on the total number of distinct symbols [author’s note: the size of alphabet] is assumed, every solution to the problem can consume an amount of time that is proportional to the product of the lengths of the two strings [9].
The sizes of lengths of sequences make the quadratic time algorithms impractical in many applications. Hence, it is significant to design more efficient algorithm in practice. This paper is confined to and is to present an algorithm that can be more efficient than relevant classical algorithms in specific scenarios.
The following introduction is also confined to the case of two input sequences. Chvátal and Sankoff (1975) proposed a Dynamic Programming (DP) algorithm of space and time [10]. It is the basis of the algorithms for LCS problem. Soon in the same year, D.S. Hirschberg (1975) posted a Divide and Conquer (DC) algorithm that is a variation of the DP algorithm taking space and time [11]. In 2000, Bergroth, Hakonen, and Raita contributed a survey [12] that shows in the past decades there is no theoretically improved algorithm based on Hirschberg’s DC algorithm [11] as it is so brilliant. In 1977, Hirschberg additionally proposed an algorithm and an algorithm where is length of LCS [13]. The first one is efficient when is small, while the other one is efficient when p is close to . Both of the two algorithms are more suitable when the length of LCS can be estimated beforehand. Then, Nakatsu, Kambayashi, and Yajima (1982) in [14] presented an algorithm suitable for similar sequences and having bound of and . Let the two sequences be and . Same in 1977, Hunt and Szymanski proposed an algorithm taking space and time, where is the total number of elements in the set [15]. The algorithm reduces to longest increasing subsequence (LIS) problem. Apostolico and Guerra (1987) in [16] proposed an algorithm based on [15] taking time , where is the number of dominant matches (as defined by Hirschberg [13]) and is minimum of and the alphabet size. Further, based on [16], Eppstein (1992) in [17] proposed an algorithm when the problem is sparse. If the alphabet size is constant, Masek and Paterson (1980) in [18] proposed an algorithm utilizing the method of four Russians (1970) [19]; Abboud, Backurs, and Williams (2015) in [20] showed an algorithm where . algorithms are also proposed by Bille and FarachColton (2008) in [21] and Grabowski (2014) in [22], each of which has its own prerequisite. Restrained by the conclusion of [9, 20], in these decades an extensive amount of research keeps trying to achieve lower complexity than of computing LCS between two conditionspecific sequences for different applications, which also can be found in the survey [12]. For computing the length of LCS between two sequences over constant alphabet size, Allison and Dix (1986) presented an algorithm of , where is the wordlength of computer [23]. This algorithm uses bitvector formula with 6 bitwise operations. Although falling into the same complexity class as simple DP algorithms, this algorithm is faster in practice. Crochemore, Iliopoulos, Pinzon, and Reid (2001) in [24] proposed a similar approach whose complexity is also . Due to the fact that only 4 bitwise operations are used by the bitvector formula, this approach gives a practical speedup over Allison and Dix’s algorithm.
Compared with ChvátalSankoff algorithm [10], Hirschberg algorithm [11], and HuntSzymanski algorithm [15], most of the other algorithms for LCS problem between two sequences have more dependency, such as the following: the length of LCS is estimable beforehand [13, 14], two input sequences are similar [14, 16], problem is sparse enough [17], or the alphabet size is finite [16, 18, 20]. Some algorithms give speedup over classical algorithms in engineering [23, 24]. In this paper, an algorithm of space and time is proposed for , where is the total number of elements in the set assuming the two arbitrary sequences are and . The algorithm also reduces to longest increasing subsequence (LIS) problem. Compared with relevant classical algorithms, the algorithm can be more efficient in specific range of .
This paper is organized as follows. In Section 1, the current state of algorithms for LCS problem between two sequences including is introduced. The proposed algorithm of this paper is presented and exemplified in Section 2, where preliminary terminologies needed to understand most of the paper and the theoretical basis of the proposed algorithm are also given. In Section 3, efficiency of the proposed algorithm is analyzed.
2. Algorithm
The longest common subsequence (LCS) is the longest subsequence common to all sequences in a set of sequences. This subsequence is not necessarily unique or not required to occupy consecutive positions within the original sequences (e.g., is a longest common subsequence between and ). is a defined function that returns a set containing all the LCSes between two sequences, while the longest increasing subsequence (LIS) is a subsequence of a given sequence in which the subsequence’s elements are in sorted order, lowest to highest, and in which the subsequence is as long as possible. This subsequence is not necessarily contiguous, or unique (e.g., is a longest increasing subsequence of ). is also a defined function that returns a set containing all the LISs of a sequence. Assume and . For all , assume there is a sequence , of which the elements are vectors in the form of (see Figure 1). The left part of an element of () is the position of a symbol in , and the right part of the element () is the position of the symbol in . is sorted according to as the first key in ascending order and according to as the second key in descending order. Define . Associating with , it is bijective mapping between and . Hence, can be reduced to [25]. According to the theoretical basis, Algorithm 1 is proposed for . The algorithm is designed to reduce LCS to LIS problem.

2.1. Example
Reuse that are given previously. The process of computing LCSes between and using Algorithm 1 is illustrated in Figure 2 and presented as follows.
(a) Two sequences: and
(b) New data constructed from and
(c) Write operation in
(d) Final result of and
Scan from left to right. The right part of is 3, ; then is going to be computed. ; is the position of in ; therefore .
The right part of is 0; then .
records the information of .
Then, ; therefore ; ; therefore .
For , the right part of is 4; then is going to be computed. ; is the position of in ; therefore .
The right part of is ; then .
records the information of .
Then, ; therefore .
For , ; is the position of in ; therefore .
The right part of is 2; then .
records the information of .
Then, ; therefore . , is kept unchanged, and the rest of the elements and are not going to be checked.
The rest of the elements of can be computed in the same way. Figure 2(d) is the final result of and .
From the auxiliary data , it can be seen that there is only one LIS in . The length of the LIS is 4.
points to ; therefore the last element of the LIS is .
and ; then the last two elements of the LIS are .
and ; then .
and ; then .
is null. Then the LIS is .
Since it is bijective mapping between and , . is the only LCS between and .
2.2. Complexity
According to the conclusion of [15] (paragraph 3 page 4), we have the following.
Step 1 [author’s note: similar to step 1 of Algorithm 1 of this paper] can be implemented by sorting each sequence while keeping track of each element’s original position. We may then merge the sorted sequences creating the MACHLISTs [author’s note: similar to array of this paper] as we go. This step takes a total of time and space.
Assume is the number of match vectors between and . Step 1 of Algorithm 1 is a process of space and time. As the length of LCS is , step 3 is a process of space and time. Step 4 takes space and time. Write operations in for all element of are listed together in Figure 2(c). In and (see Figure 2(d)), the time of write operation is . In , the time of write operation of dark gray block is ; the time of write operation of light gray block is at most , which is illustrated in Figure 3. Therefore, step 2 takes space and time. Complexities of every step of Algorithm 1 are listed in Table 1. The whole algorithm takes space and time, which is dominated by step 2.
3. Efficiency
The algorithm proposed in this paper is designed to compute LCS between two arbitrary sequences, which is the same as the original intention of the classical algorithms: ChvátalSankoff algorithm [10], Hirschberg algorithm [11], and HuntSzymanski algorithm [15]. The proposed algorithm can be more efficient in specific range of compared with the classical algorithms, where is the total number of elements in the set assuming two arbitrary sequences are and .
3.1. Comparison with HuntSzymanski Algorithm
As the original position in of each element of is not used in the process of computing, in Figure 4 HuntSzymanski algorithm needs to utilize binary search to locate the position in for write operation for each element of . The time of binary search in of HuntSzymanski algorithm is at most , which is illustrated in Figure 5. Using Stirling’s approximation [26–28], . If the demand is only returning one LCS or the length of LCS, array of the algorithm proposed in this paper can be replaced with the MATCHLIST that is used in HuntSzymanski algorithm. Therefore, the algorithm proposed in this paper can take space that is the same as the one HuntSzymanski algorithm takes. The main difference between them is the time consumed in . In Figure 3, the total time of write operation of both dark gray and light gray blocks is at most . As , if , the algorithm proposed in this paper is more efficient in time than HuntSzymanski algorithm (see Figure 7).
3.2. Comparison with ChvátalSankoff Algorithm
ChvátalSankoff algorithm needs times of comparison in space, which is illustrated in Figure 6. To simplify the analysis, only the time consumed in of the algorithm proposed in this paper is going to be compared with the time of ChvátalSankoff algorithm. As , if , the algorithm proposed in this paper is more efficient in time than ChvátalSankoff algorithm (see Figure 7). In this case of , the proposed algorithm is also more efficient in space than ChvátalSankoff algorithm.
3.3. Comparison with Hirschberg Algorithm
Hirschberg algorithm takes space and time. As , the algorithm proposed in this paper takes space and time. Therefore, the proposed algorithm has lower time complexity than Hirschberg algorithm.
Data Availability
This submission is about an algorithm of an engineering problem. The efficiency of the algorithm is proven mathematically in theory.
Conflicts of Interest
The author declares that they have no conflicts of interest.