Abstract

Global sequence alignment is one of the most basic pairwise sequence alignment procedures used in molecular biology to understand the similarity that arises among the structure, function, or evolutionary relationship between two nucleotide sequences. The general algorithm associated with global sequence alignment is the dynamic programming algorithm of Needleman and Wunsch. In this paper, patterns are exploited in the score matrix of the Needleman–Wunsch algorithm. With the help of some examples, the general patterns realized are formulated as new a priori propositions and corollaries that are established for both equal and unequal length comparisons of any two arbitrary sequences.

1. Introduction

Sequence alignment is the matching of strings or sequences of characters to identify patterns that may lead to informed structural or functional relationships between the strings or sequences matched. Varying situational problems seem to require the use of sequence alignment and examples abound from computer science to molecular biology, where sequences are usually aligned to make more meaning out of them. Bare [1] has discussed how researchers have over the years considered genetic sequences as strings of characters instead of focusing on their intrinsic properties to be able to compute the similarity among two or more related sequences. Global and local sequence alignment are used for matching two sequences, and they are categorized as pairwise sequence alignment or multiple sequence alignment when any alignment matches more than two sequences [24]. The Needleman–Wunsch algorithm [57] and the Smith–Waterman algorithm [8] are the general algorithms associated with pairwise global and local alignment, respectively. These two algorithms utilize a procedure called the dynamic programming approach. Dynamic programming as coined by Bellman in the 1940s is simply the process of solving a bigger problem by finding optimal solutions to its smaller nested problems [911]. Thus, to tackle a problem in the context of dynamic programming, it must possess the notion of recurrence. In [12, 13], an algorithm is said to be a well-defined computational task that accepts input and produces output values after following through a systematic method. Mathematical theory is thus a prerequisite behind the designing of functional programs [14, 15], and the algorithm design specializes in solving such problems. Global sequence alignment is mentioned as one of the vast dynamic programming applications in practical problems [1618].

More recently, Ouzounis and Baichoo [19] have stated that even though pairwise sequence alignment has been dealt with over time, concerns still remain in resolving exact evolutionary distances that demand very specific estimates. They have suggested the existence of theoretical relationships within alignments, algorithms, and data that are yet to be found. Motivated by the display of position-dependent arrays for affine gaps using the Needleman–Wunsch algorithm in [20, 21], we consider the more basic linear gap penalties using arbitrary sequences and proceed to find patterns in the score matrix which were absent in the presentation. This approach was taken because it is found missing in the available stream of literature although it can be likened to the edit graph illustration seen in [1]. To the best of our knowledge, there has never been any theoretical exposition of this kind for the most basic and very fundamental concept of constant linear gap penalties which constitute an implicit part of affine gap penalties. Thus, we focus on finding patterns in global sequence alignment using constant linear gap penalties and attempt the display of completed score matrices of the Needleman–Wunsch algorithm with distinct positional arrays that can be inspected. We give confirmation of this by basic proofs and suggest how predictions can be made. To offer the reader the much needed convenience, concepts that are deemed useful are recalled in brief with corresponding references of comprehensive works that give more in-depth information.

2. Needleman–Wunsch Algorithm

Let the recursive formulation for the Needleman–Wunsch algorithm be(1)Initialization.Let , + gap penalty, and + gap penalty, ,where is the initialization pivot for the score matrix, is the initial row pivot for the score matrix, and is the initial column pivot for the score matrix.(2)Cell Box Calculation.

Letwhere is the pivot for each cell box calculation, is the diagonal value of a cell box, gap penalty is the right value of a cell box, gap penalty is the left value of a cell box, and is the score for aligning the sequence characters of and .

Specifically, the procedure for the Needleman–Wunsch algorithm follows the dynamic programming approach of the score matrix, traceback, and alignment as outlined in the subsequent sections.

2.1. Score Matrix

The score matrix is a tabular box constructed to keep count of score results. The score matrix for the Needleman–Wunsch algorithm begins with an initialization process and ends with the calculation of cell boxes.

2.1.1. Initialization

A sequence matrix is created with columns and rows in order for the initial matrix gap to be aligned where and are the lengths of arbitrary sequences, and , respectively. The letters of the sequence fill in the horizontal axis, and similarly, the characters of fill in the vertical axis of the sequence matrix created. Before the scoring begins from the upper left corner of the initialized matrix to the lower right corner of the matrix, the value is assigned to the intersection of the first row and the first column of the matrix (i.e., the initial gap). The reason for the gap penalty for an alignment is because of the possibility of mutation which may insert or delete a string character from one of the sequences. Arrows that point in the direction of positional movement (diagonal and left or right) are placed in each cell box of the matrix and only terminate when all the cell boxes are completely filled.

2.1.2. Calculation of Cell Boxes in the Score Matrix

(1)Fill the initialized gap values first on both the horizontal axis and the vertical axis with a defined constant gap value score(2)Follow with the calculation of each cell box having the three position-dependent arrays (left/beside, right/bottom, or diagonal)(3)Allow only a match/mismatch value for a “diagonal” position, and allow the “bottom” or “beside” positions to take linear gap values only(4)For each computed cell box values, find the maximum score and let that be the pivot(5)The pivot of a computed cell box directly affects the next cell boxes in the row or the column

2.2. Traceback

A simpler score matrix table that contains only the pivot of each cell box calculation is constructed from the original position-dependent arrays of the score matrix table. Arrow pointers are used to direct a path from the highest score or an optimal value in the matrix (which actually occurs at the lower right end corner of the matrix) and traced back to the next biggest value of the predecessors until we reach the intersection of the first row and the column with the initial gap.

2.3. Alignment
2.3.1. Alignment Generation from a Traceback

To write the sequence characters that appear from the optimal alignment path of the traceback stage, the following steps are followed:(1)When the arrow is diagonal, write both characters in the alignment(2)When the arrow is vertical, write the corresponding horizontal character, and in place of the vertical character, leave a gap(3)When the arrow is horizontal, write the corresponding vertical character, and in place of the horizontal character, leave a gap

Thus, for both the vertical and horizontal arrow positions, one character and a gap are written for the alignment, where the gap explicitly replaces a character position in the alignment. An alignment can only be inferred as the best if the optimal value from the score matrix table corresponds to the alignment score calculation (based on the scoring scheme defined).

2.3.2. The Problem of Aligning Any Two Sequences

The problem of aligning any two sequences can be simplified as discovering the optimal means of aligning any two arbitrary sequences say and such that the character “−” noted as a gap is filled into both and or either of them where(1)Any single character in matches a single character in or a gap(2)The final sum of scores from the scoring function over the aligned pairs and the gap penalties as given by the function of gap penalty is maximized

2.3.3. Letter Choice for Arbitrary Sequences

The letter choice of this study shall be that of DNA considered as a string of four characters of adenine—A, guanine—G, cytosine—C, and thymine—T.

3. Scheme for Scoring

Definition 1. The constant gap penalty, , where is an assigned constant and is the sequence length count of for the score matrix row and for the score matrix column. This is the penalty awarded to gaps and is also known as the linear gap function.

Definition 2. The affine gap penalty is the penalty awarded to gaps where a greatest consecutive sequence of gaps is given as , where is the penalty charged for opening the gap and is the penalty charged for extending it.
Thus, more formally we define(1)A constant linear gap as(2)An affine gap asAltschul’s theory: assign positive scores to an identity and conserved replacements, and assign negative scores to less likely replacements.

Remark 1. Needleman and Wunsch use the “identity matrix” in scoring with 1 for a match and for a mismatch. Needleman–Wunsch’s score was criticised for not reflecting observations from the nature because purine-purine or pyrimidine-pyrimidine is less prone to be mutated in comparison with mutations of purine-pyrimidine. Because there is no definite defined score for DNA alignment, the choice of scoring for this study is +5 for a match, −1 for a mismatch, and −2 for a gap as proposed in [18] following the above theory.
More formally, let.
Also, let the linear gap penalty, , where is a constant of “−2” and is the sequence length count of for the score matrix row and for the score matrix column.

4. Results and Discussion

4.1. Pattern Investigations in the Score Matrix of Needleman–Wunsch Algorithm

The sequences and that will be used are arbitrary sample sequences that were chosen based on consideration of the following:(1) (equal character length of sequences)(2) (unequal character length of sequences)

4.1.1. Equal Character Length

Example 1. Let and . Because the length of the sequence characters of and is , respectively, we align and in a score matrix following the above formulation.(1)Initialization: for score matrix initialization, and then for the initial row pivot, + gap penalty, where , we find the gap penalty as follows: and for the initial column pivot, + gap penalty, , the gap penalty is obtained as follows:Before following with further calculation, some identification is placed on each cell box for easy noting as shown in Table 1.

(2)Cell box calculations:

Remark 2. in was “5” because that is the assigned score for matched characters. .

Remark 3. in was “−1” because that is the assigned score for mismatched characters. .

Remark 4. in was “−1” because that is the assigned score for mismatched characters. .
Consequently, the recursion follows similarly until all the cell boxes are filled. In the event of a pivot tie in a cell box, one tie value is picked. For the tabular display, these notations are used interchangeably; is the same as , is the same as , is the same as , and is the same as the pivot.

4.1.2. Pattern Results for Equal Character Length

With reference to Tables 1 and 2 and Figure 1,(1)The filled-in cell boxes for column 1, i.e., cell boxes , have values coinciding with that of row 1, .i.e., cell boxes (2)The leading value (pivot) of each cell box in both row 1 and column 1 remains the same(3)The bottom value in column 1 switches to become the beside value in row 1 and vice versa(4)For each 3 pointed arrow intersecting any 4 cell boxes, the beside value of a 2nd cell box also coincides with the bottom value of a 3rd cell box(5)For each cell box, the value of the preceding bottom value is less than the value of the immediate next adjacent diagonal value

4.1.3. Traceback and Alignment for Equal Character Length

The traceback and alignment stages for equal character length of sequences are, respectively, shown in this section.

Figure 2 shows the pivot of each cell box calculation of the score matrix table of CTTGA and CTAGA in Figure 1. Thus, Figure 2 is a simpler score matrix table constructed from the original score matrix table of CTTGA and CTAGA in Figure 1. The arrow pointers direct the path from the optimal value and traceback to the initialization value of zero. Based on the traceback and the diagonal direction of the arrow pointers, the alignment is written as

To confirm the correctness of the alignment done, we check using calculations. Recall the scoring scheme: match = +5, mismatch = −1, and gap = −2. We have four alignment matches: , , , and and one mismatched alignment: . Hence,which is the same as the optimal value from the score matrix table. The alignment is thus optimal.

4.1.4. Unequal Character Length

Example 2. Suppose  = AGCTG and TCAG, then to fill in the score matrix values of the cell boxes, the prior-stated recursive formulation is used. This results in Figures 3 and 4.

4.1.5. Pattern Results for Unequal Character Length

(1)The filled-in cell boxes for column 1, i.e., cell boxes, have consistent values with that of row 1, i.e., , up until where they terminate at and at the inconsistent (2)Except for where an inconsistency is noted at and , the pivot values for each cell box of row 1 and column 1 remain the same(3)For each 3 pointed arrows intersecting any 4 cell boxes, the beside value of a cell box also coincides with the bottom value of a cell box(4)For each cell box, the value of the preceding bottom value is seen to be always less than the value of the immediate next adjacent diagonal value

4.1.6. Traceback and Alignment for Unequal Character Length

The traceback and alignment stages for the example on unequal character length of sequences are, respectively, shown below. Figure 5 shows the pivot of each cell box calculation of the score matrix table of AGCTG and TCAG of Table 3. Based on the traceback and the direction of the arrow pointers, the alignment is written as

We check the correctness of the alignment done using calculations. Recall the scoring scheme: match = +5, mismatch = −1, and gap = −2. We have two alignment matches: and , two alignment mismatches: and , and one gapped alignment: . Hence,which is the same as the optimal value from the score matrix table. The alignment is thus optimal.

4.2. Propositions and Proofs for Equal Sequence Length

The following propositions are a priori results deduced from the pattern results of equal character length of sequences.

Definition 3. Let the linear gap penalty be , where is a constant of and is the sequence length count of for the score matrix row and for the score matrix column.

Definition 4. Define to be the initialization pivot for the score matrix.

Definition 5. Define + gap penalty, , to be the initial row pivot for the score matrix.

Definition 6. Define + gap penalty, , to be the initial column pivot for the score matrix.

Definition 7. Define and , where is the pivot for each cell box calculation, is the diagonal value of a cell box, gap penalty is the right value of a cell box, gap penalty is the left value of a cell box, and is the score for aligning the sequence characters of and .

Definition 8. Define and .

Proposition 2. Filled-in cell boxes for are analogous to filled-in cell boxes for for equal length comparison of sequences.

Proof. By Definition 7, we have since . Again, by Definitions 5 and 6,also,Using Definitions 5 and 6 and applying induction, and , thus is true. Assume is always true, thenRecall that the gap is constant, so we write , thus which completes the first part of our proof.
Again, from , , and by induction, let , thenAssume , thenwhich completes the second part of our proof.
Recall,It follows that is obvious. Thus, and . We are left to show that .
For , since and match. For , and mismatch and and also mismatch. Thus,Hence, since .

Corollary 3. The right value of filled-in cell boxes for becomes the left value of filled-in cell boxes for and vice versa.

Proof. From Proposition 2 and Definitions 57, it is clear that

Corollary 4. The pivot values, and , are the same for comparison of equal length of sequences.

Proof. By Definition 7,and by Proposition 2, we can write that . Hence, it is proved.

Proposition 5. For each three pointed arrows intersecting any four cell boxes, the left value of a 2nd cell box corresponds to the right value of a 3rd cell box.

Proof. Let , and be any four cell boxes, then we are to show that . By Definition 7,Since the linear gap penalty is equal in both cases and it is obvious that is the same in both cases, we write .

Corollary 6. For any two adjacent row cell boxes, the right value of a preceding cell box is less than the diagonal value of the next cell box.

Proof. Let and be any two adjacent cell boxes. Then, by Proposition 5, we writeWe are left to show that .

Remark 5. By Altschul’s theory, a match and mismatch are chosen to be greater than a gap penalty. We recall the scoring scheme of the constant being “−2” in the linear gap penalty and the diagonal score being “−1” when there is a mismatch and “+5” when there is a match.
For any count, , the linear gap penalty decreases, and thus the gap penalty can never be greater than or equal to any of the diagonal scores allocated for match and mismatch. Thus,Hence, .

4.3. Proposition and Proof for Unequal Sequence Length

The following proposition holds a priori from the pattern results of unequal character length of sequences. We state and prove the following general results by adhering to the same definitions stated earlier under the equal character length of sequences.

Proposition 7. For each three pointed arrows intersecting any four cell boxes, the left value of a 2nd cell box corresponds to the right value of a 3rd cell box.

Proof. Let , and be any four cell boxes, then we are to show that . Refer to the proof of Proposition 5. Despite the unequal sequence length of , the supposed disparity in length has no bearing on the proof.

Corollary 8. For any two adjacent row cell boxes, the right value of a preceding cell box is less than the diagonal value of the next cell box.

Proof. Refer to the proof of Corollary 6. Again, despite the unequal sequence length of , the supposed disparity in length has no bearing on the proof.

5. Conclusion

In this paper, the score matrix of the Needleman–Wunsch algorithm was exploited for possible patterns. Given any two arbitrary sequences of equal or unequal length, a general pattern was formulated as new a priori propositions and corollaries. These new formulated propositions and corollaries are justified with their corresponding proofs.

Data Availability

No data were used to support this study.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.