1. Introduction
In this paper, we consider a generalized longest common subsequence problem. The longest common subsequence (LCS) problem is a wellknown measurement for computing the similarity of two strings. This problem can be widely applied in diverse areas, such as file comparison, pattern matching and computational biology [
1].
A sequence is an ordered list of characters over an alphabet, ∑. A subsequence of a sequence, X, is obtained by deleting zero or more characters (not necessarily contiguous) from X. A substring of a sequence, X, is a subsequence of successive characters within X.
For a given sequence, $X={x}_{1}{x}_{2}\cdots {x}_{n}$, of length n, the ith character of X is denoted, ${x}_{i}\in \sum $, for any $i=1,\cdots ,n$. A substring of X from position i to j can be denoted as $X[i:j]={x}_{i}{x}_{i+1}\cdots {x}_{j}$. A substring, $X[i:j]$, is called a prefix of X if $i=1$ and a suffix of X if $j=n$.
Given two sequences, X and Y, the LCS problem is finding a subsequence of X and Y whose length is the longest among all common subsequences of the two given sequences.
For some biological applications, some constraints must be applied to the LCS problem. These types of variants of the LCS problem are called constrained LCS (CLCS) problems [
2].
A recent variant of the LCS problem, which was first addressed in [
2], has received considerable attention. The most cited algorithms solve the CLCS problem based on dynamic programming algorithms. Some improved algorithms have also been proposed in [
3,
4]. The LCS and CLCS problems on indeterminate strings were also discussed in [
4]. A bitparallel algorithm for solving the CLCS problem was proposed in [
3]. The problem was extended to have weighted constraints, a more generalized problem, in [
5]. A variant of the CLCS problem with multiple constraints, the restricted LCS problem, which excludes the given constraint as a subsequence of the answer, was proposed in [
6]. This restricted LCS problem becomes nondeterministic polynomialtime hard (NPhard) when the number of constraints is not fixed [
6].
Recently, Chen and Chao [
7] proposed a more generalized form of the CLCS problem, the generalizedconstrainedLCS (GCLCS) problem. For the two input sequences,
X and
Y, of lengths
n and
m, respectively, and a constraint string,
P, of length
r, the GCLCS problem is a set of four problems that find the LCS of
X and
Y that includes/excludes
P as a subsequence/substring. The four generalized constrained LCSs are summarized in
Table 1 [
7].
Table 1.
The generalizedconstrainedlongest common subsequence (GCLCS) problems. STREC, stringexcluding.
Table 1.
The generalizedconstrainedlongest common subsequence (GCLCS) problems. STREC, stringexcluding.
Problem  Input  Output 

SEQICLCS  X, Y, and P  The LCS of X and Y that includes P as a subsequence 
STRICLCS  X, Y, and P  The LCS of X and Y that includes P as a substring 
SEQECLCS  X, Y, and P  The LCS of X and Y that includes P as a subsequence 
STRECLCS  X, Y, and P  The LCS of X and Y that includes P as a substring 
We will discuss the STRECLCS problem in this paper. We found that a previously proposed dynamic programming algorithm for the STRECLCS problem [
7] cannot correctly solve the problem. Let
$L(i,j,k)$ denote the length of an LCS of
$X[1:i]$ and
$Y[1:j]$, excluding
$P[1:k]$ as a substring. Chen and Chao gave a recursive Formula (1) for computing
$L(i,j,k)$ as follows.
The boundary conditions of this recursive formula are
$L(i,0,k)=L(0,j,k)=0$ for any
$0\le i\le n,0\le j\le m$ and
$0\le k\le r$.
The algorithm presented in [
7] was stated without strict proof. Thus, the correctness of the proposed algorithm cannot be guaranteed. For example, if
$X=abbb,Y=aab$ and
$P=ab$, the values of
$L(i,j,k),1\le i\le 4,1\le j\le 3,0\le k\le 2$ computed by recursive Formula (1) are listed in
Table 2.
Table 2.
$L(i,j,k)$ computed by recursive Formula (1).
Table 2.
$L(i,j,k)$ computed by recursive Formula (1).
  $k=0$    $k=1$    $k=2$  

$i=1$  1  1  1  0  0  0  1  1  1 
$i=2$  1  1  2  0  0  1  1  1  2 
$i=3$  1  1  2  0  0  1  1  1  2 
$i=4$  1  1  2  0  0  1  1  1  2 
From
Table 2, we know that the final answer is
$L(4,3,2)=2$, which is computed by the formula,
$L(4,3,2)=1+L(3,2,2)$, since, in this case,
$k\ge 2$ and
${a}_{4}={b}_{3}={p}_{2}{=}^{\prime}{b}^{\prime}$. However, this is a wrong answer, since the correct answer should be one.
A new dynamic solution for the STRECLCS problem is presented in this paper, and the correctness of the new algorithm is proven. The time complexity of the new algorithm is $O(nmr)$.
The organization of the paper is as follows.
In the following three sections, we describe our dynamic programming algorithm for the STRECLCS problem.
In
Section 2, we present a new dynamic programming solution for the STRECLCS problem with time complexity,
$O(nmr)$, from a novel perspective. In
Section 3, we discuss the issues involved in implementing the algorithm efficiently. Some concluding remarks are provided in
Section 4.
2. A Simple Dynamic Programming Solution
For the two input sequences, $X={x}_{1}{x}_{2}\cdots {x}_{n}$ and $Y={y}_{1}{y}_{2}\cdots {y}_{m}$, of lengths n and m, respectively, and a constraint string, $P={p}_{1}{p}_{2}\cdots {p}_{r}$, of length r, we want to find an LCS of X and Y that excludes P as a substring.
In the description of our new algorithm, a function, σ, will be mentioned frequently. For any string, S, and a fixed constraint string, P, the length of the longest suffix of S that is also a prefix of P is denoted by the function, $\sigma (S)$.
The function, σ, refers to both P and S. Because the string, S, is a variable and the constraint string, P, is fixed, the notation, $\sigma (S)$, will not cause confusion, even though it does not reflect its dependence on P.
The symbol, ⊕, is also used to denote string concatenation.
For example, if $P=aaba$ and $S=aabaaab$, then substring $aab$ is the longest suffix of S that is also a prefix of P; therefore, $\sigma (S)=3$.
It is readily seen that $S\oplus P=aabaaabaaba$.
Let $Z(i,j,k)$ denote the set of all LCSs of $X[i:n]$ and $Y[j:m]$ that exclude P as a substring of $P[1:k]\oplus z$ for each $z\in Z(i,j,k),1\le i\le n,1\le j\le m,0\le k\le r$. $P[1:k]$ is an empty string if $k=0$. The length of an LCS in $Z(i,j,k)$ is denoted $f(i,j,k)$.
If we can compute $f(i,j,k)$ for any $1\le i\le n,1\le j\le m$ and $0\le k<r$ efficiently, then the length of an LCS of X and Y that excludes P as a substring must be $f(1,1,0)$.
We can obtain a recursive formula for computing
$f(i,j,k)$ with the following theorem.
Theorem 1 For the two input sequences, $X={x}_{1}{x}_{2}\cdots {x}_{n}$ and $Y={y}_{1}{y}_{2}\cdots {y}_{m}$, of lengths n and m, respectively, and a constraint string, $P={p}_{1}{p}_{2}\cdots {p}_{r}$, of length r, let $Z(i,j,k)$ denote the set of all LCSs of $X[i:n]$ and $Y[j:m]$ that exclude P as a substring of $P[1:k]\oplus z$ for each $z\in Z(i,j,k)$.
The length of an LCS in $Z(i,j,k)$ is denoted, $f(i,j,k)$.
For any $1\le i\le n,1\le j\le m$ and $0\le k<r$, $f(i,j,k)$ can be computed with the following recursive Formula (2):where $q=\sigma (P[1:k]\oplus {x}_{i})$, and the boundary conditions are $f(i,m+1,k)=f(n+1,j,k)=0$ for any $1\le i\le n,1\le j\le m$ and $0\le k\le r$.
Proof. For any $1\le i\le n,1\le j\le m$ and $0\le k<r$, suppose $f(i,j,k)=t$ and $z={z}_{1},\cdots ,{z}_{t}\in Z(i,j,k)$.
First, we note that for each pair, $({i}^{\prime},{j}^{\prime}),1\le {i}^{\prime}\le n,1\le {j}^{\prime}\le m$, such that ${i}^{\prime}\ge i$ and ${j}^{\prime}\ge j$, we have $f({i}^{\prime},{j}^{\prime},k)\le f(i,j,k)$, because a common subsequence, z, of $X[{i}^{\prime}:n]$ and $Y[{j}^{\prime}:m]$ that excludes P as a substring of $P[1:k]\oplus z$ is also a common subsequence of $X[i:n]$ and $Y[j:m]$ that excludes P as a substring of $P[1:k]\oplus z$.
(1) When ${x}_{i}\ne {y}_{j}$, we have ${x}_{i}\ne {z}_{1}$ or ${y}_{j}\ne {z}_{1}$.
(1.1) If ${x}_{i}\ne {z}_{1}$, then $z={z}_{1},\cdots ,{z}_{t}$ is a common subsequence of $X[i+1:n]$ and $Y[j:m]$ that excludes P as a substring of $P[1:k]\oplus z$; thus, $f(i+1,j,k)\ge t$. In contrast, $f(i+1,j,k)\le f(i,j,k)=t$. Therefore, in this case, we have $f(i,j,k)=f(i+1,j,k)$.
(1.2) If ${y}_{j}\ne {z}_{1}$, then in a similar manner, we can prove that $f(i,j,k)=f(i,j+1,k)$ in this case.
Combining the two subcases, we conclude that when
${x}_{i}\ne {y}_{j}$, we have
(2) When ${x}_{i}={y}_{j}$ and $q<r$, there are also two subcases to be distinguished.
(2.1) If ${x}_{i}={y}_{j}\ne {z}_{1}$, then $z={z}_{1},\cdots ,{z}_{t}$ is also a common subsequence of $X[i+1:n]$ and $Y[j+1:m]$ that excludes P as a substring of $P[1:k]\oplus z$ and, thus, $f(i+1,j+1,k)\ge t$. In contrast, $f(i+1,j+1,k)\le f(i,j,k)=t$. Therefore, we have $f(i,j,k)=f(i+1,j+1,k)$ in this case.
(2.2) If ${x}_{i}={y}_{j}={z}_{1}$, then $f(i,j,k)=t>0$ and $z={z}_{1},\cdots ,{z}_{t}$ is an LCS of $X[i:n]$ and $Y[j:m]$ that excludes P as a substring of $P[1:k]\oplus z$, and thus, ${z}^{\prime}={z}_{2},\cdots ,{z}_{t}$ is a common subsequence of $X[i+1:n]$ and $Y[j+1:m]$ that excludes P as a substring of $P[1:k]\oplus {x}_{i}\oplus {z}^{\prime}$.
If
$q=\sigma (P[1:k]\oplus {x}_{i})$, then
$P[1:q]$ is the longest suffix of
$P[1:k]\oplus {x}_{i}$ that is also a prefix of
P. It follows that
$P[1:q]\oplus {z}^{\prime}$ is a suffix of
$P[1:k]\oplus {x}_{i}\oplus {z}^{\prime}$. Therefore, a sequence that excludes
P as a substring of
$P[1:k]\oplus {x}_{i}\oplus {z}^{\prime}$ is also a sequence that excludes
P as a substring of
$P[1:q]\oplus {z}^{\prime}$. It follows from the fact that
${z}^{\prime}={z}_{2},\cdots ,{z}_{t}$ is a common subsequence of
$X[i+1:n]$ and
$Y[j+1:m]$ that excludes
P as a substring of
$P[1:k]\oplus {x}_{i}\oplus {z}^{\prime}$ that
${z}^{\prime}={z}_{2},\cdots ,{z}_{t}$ is also a common subsequence of
$X[i+1:n]$ and
$Y[j+1:m]$ that excludes
P as a substring of
$P[1:q]\oplus {z}^{\prime}$. In other words:
In contrast, if
$P[1:q]$ is the longest suffix of
$P[1:k]\oplus {x}_{i}$,
$f(i+1,j+1,q)=s$ and
$v={v}_{1},\cdots ,{v}_{s}\in Z(i+1,j+1,q)$, then
v is an LCS of
$X[i+1:n]$ and
$Y[j+1:m]$ that excludes
P as a substring of
$P[1:q]\oplus v$. In this case,
${v}^{\prime}={x}_{i}\oplus v$ is a common subsequence of
$X[i:n]$ and
$Y[j:m]$ that excludes
P as a substring of
$P[1:k]\oplus {x}_{i}\oplus {v}^{\prime}$, because
$P[1:q]$ is the longest suffix of
$P[1:k]\oplus {x}_{i}$ and
$q<r$. Therefore:
Combining (
3) and (
4), we have:
Combining the two subcases, where
${x}_{i}={y}_{j}$ and
$q<r$, we conclude that the recursive Formula (
2) is correct for this case.
(3) When
${x}_{i}={y}_{j}$ and
$q=r$, we must have
${x}_{i}={y}_{j}\ne {z}_{1}$; otherwise,
$P[1:k]\oplus z$ will include the string,
$P[1:k]\oplus {x}_{i}=P$. Similar to Subcase (2.1), we can conclude that in this case,
The proof is complete. ■
3. Implementation of the Algorithm
According to Theorem 1, our new algorithm for computing
$f(i,j,k)$ is a standard dynamic programming algorithm. With the recursive Formula (
2), the new dynamic programming algorithm for computing
$f(i,j,k)$ can be implemented as the following Algorithm 1.
To implement our new algorithm efficiently, it is important to compute $\sigma (P[1:k]\oplus {x}_{i})$ for each $0\le k<r$ and ${x}_{i}$, where $1\le i\le n$ efficiently in line 8.
It is clear that $\sigma (P[1:k]\oplus {x}_{i})=k+1$ when ${x}_{i}={p}_{k+1}$. It will be more complex to compute $\sigma (P[1:k]\oplus {x}_{i})$ when ${x}_{i}\ne {p}_{k+1}$. In this case, the length of the matched prefix of P must be shortened to the largest $t<k$, such that ${p}_{kt+1}\cdots {p}_{k}={p}_{1}\cdots {p}_{t}$ and ${x}_{i}={p}_{t+1}$. Therefore, in this case, $\sigma (P[1:k]\oplus {x}_{i})=t+1$.
This computation is very similar to the computation of the prefix function in the KnuthCMorrisCPratt string searching algorithm (KMP algorithm) for solving the string matching problem [
8].
Algorithm 1 STRECLCS. 
Input: Strings X = x_{1} … x_{n}, Y = y_{1} … y_{m} of lengths n and m, respectively, and a constraint string,
P = p_{1} … p_{r}, of lengths r Output: The length of an LCS of X and Y that excludes P as a substringfor all i, j, k , 1 ≤ i ≤ n, 1 ≤ j ≤ m, and 0 ≤ k ≤ r do f(i,m + 1, k) ← 0, f(n + 1, j, k) ← 0 {boundary conditiong} end for for i = n down to 1 do for j = m down to 1 do for k = 0 to r − 1 do f(i, j, k) ← max{f(i + 1, j, k), f(i, j + 1, k)} q ←σ(P[1 : k] ⊕ x_{i}) if x_{i} = y_{j} q < r then f(i, j, k) ← max{f(i + 1, j + 1, k), 1 + f(i + 1, j + 1, q)} end if end for end for end for return f(1, 1, 0)

For a given string, $S={s}_{1}\cdots {s}_{n}$, the prefix function, $kmp(i)$, denotes the length of the longest prefix of ${s}_{1}\cdots {s}_{i1}$ that matches a suffix of ${s}_{1}\cdots {s}_{i}$. For example, if $S=ababaa$, then $kmp(1),\cdots ,kmp(6)=0,0,1,2,3,1$.
For the constraint string,
$P={p}_{1}\cdots {p}_{r}$, of length
r, its prefix function,
$kmp$, can be precomputed in
$O(r)$ time by Algorithm 2.
Algorithm 2 Prefix Function. 
Input: Strings P = p_{1} … p_{r} Output: The prefix function kmp of P 
With this precomputed prefix function, $kmp$, the function, $\sigma (P[1:k]\oplus ch)$, for each character, $ch\in \sum $ and $1\le k\le r$, can be described as Algorithm 3.
To accelerate the processing, we can precompute a table,
$\lambda (k,ch)$, of the function,
$\sigma (P[1:k]\oplus ch)$, for each character,
$ch\in \sum $ and
$1\le k\le r$. It is clear that
$\lambda (k1,P[k])=k$ for each
$1\le k\le r$. The other values of the table,
λ, can be computed by using the prefix function,
$kmp$, in the following recursive Algorithm 4.
Algorithm 3 σ(k, ch). 
Input: Strings P = p_{1} … p_{r}, integer k and character ch Output: σ(P[1 : k] ⊕ ch)

Algorithm 4 λ(k, ch). 
Input: Integer k, character ch Output: Value of λ(k, ch) 
The time cost of the above preprocessing algorithm is clearly $O(r\Sigma )$. By using this precomputed table, λ, the value of function $\sigma (P[1:k]\oplus ch)$ for each character, $ch\in \sum $ and $1\le k<r$, can be computed readily in $O(1)$ time.
With this precomputed table, λ, the loop body of the above Algorithm 1 requires only $O(1)$ time, because $\lambda (k,{x}_{i})$ can be computed in $O(1)$ time for each ${x}_{i},1\le i\le n$ and any $0\le k<r$. Therefore, our new algorithm for computing the length of an LCS of X and Y that excludes P as a substring requires $O(nmr)$ time and $O(r\Sigma )$ preprocessing time.
If we want to obtain the actual LCS of X and Y that excludes P as a substring, not only its length, we can also present a simple recursive backtracing algorithm for this purpose as Algorithm 5.
At the end of our new algorithm, a function call, $back(1,1,0)$, will produce the resultant LCS accordingly.
Because the cost of the computation of $\lambda (k,{x}_{i})$ is $O(1)$, the algorithm, $back(i,j,k)$, will cost $O(n+m)$ in the worst case.
Finally, we summarize our results in the following theorem:
Theorem 2 Algorithm 1 solves the STRECLCS problem correctly in $O(nmr)$ time and $O(nmr)$ space, with preprocessing time $O(r\Sigma )$.
Algorithm 5 back(i, j, k). 
Comments: A recursive back tracing algorithm to construct the actual LCS.if i > n j > m then return end if if x_{i} = y_{j} f(i, j, k) = 1 + f(i + 1, j + 1, λ(k, x_{i})) then print x_{i} back(i + 1, j + 1, λ(k, x_{i})) else if f(i + 1, j, k) > f(i, j + 1, k) then back(i + 1, j, k) else back(i, j + 1, k) enf if

4. Conclusions
We have suggested a new dynamic programming solution for the STRECLCS problem. The new algorithm corrects a previously presented dynamic programming algorithm with the same time and space complexities.
The STRICLCS problem is another interesting GCLCS, which is very similar to the STRECLCS problem.
The STRICLCS problem, introduced in [
7], is to find an LCS of two main sequences, in which a constraining sequence of length
r must be included as its substring. In [
7], an
$O(nmr)$time algorithm was presented to solve this problem. Almost immediately, the presented algorithm was improved to a quadratictime algorithm and to accept many main input sequences [
9,
10].
It is not clear whether the same improvement can be applied to our presented $O(nmr)$time algorithm for the STRECLCS problem to achieve a quadratictime algorithm. We will investigate the problem further.