Mathematical Problems in Engineering

Mathematical Problems in Engineering / 2012 / Article

Research Article | Open Access

Volume 2012 |Article ID 490649 | 9 pages | https://doi.org/10.1155/2012/490649

An Improved Scoring Matrix for Multiple Sequence Alignment

Academic Editor: Gerhard-Wilhelm Weber
Received16 May 2011
Revised05 Oct 2011
Accepted05 Oct 2011
Published11 Jan 2012

Abstract

The way for performing multiple sequence alignment is based on the criterion of the maximum-scored information content computed from a weight matrix, but it is possible to have two or more alignments to have the same highest score leading to ambiguities in selecting the best alignment. This paper addresses this issue by introducing the concept of joint weight matrix to eliminate the randomness in selecting the best multiple sequence alignment. Alignments with equal scores are iteratively rescored with the joint weight matrix of increasing level (nucleotide pairs, triplets, and so on) until one single best alignment is eventually found. This method for resolving ambiguity in multiple sequence alignment can be easily implemented by use of the improved scoring matrix.

1. Introduction

In the search for DNA regulatory elements such as binding sites, promoter, donor sites, TATA box, and genes, the multiple sequences containing these elements have to be aligned against each other. These elements are highly but not absolutely conserved and a weight matrix is used to represent and score the multiple sequences [1]. However, the current motif discovery algorithms based on the weight matrix technique for scoring multiple sequence alignment in terms of information content are not without their limitations [2]. From the analysis of these algorithms, the highest performance coefficient on the binding site level of search is only 30.2% using Motif Sampler [3], which is an algorithm modified from the widely adopted Gibbs sampling method [4]. This may be a result of randomness in selecting the best alignment from cases whereby there are multiple peaks. Hence, there are rooms for improvement, which is evident from many different approaches that have been developed [5–9].

In this paper, a method of removing the randomness in selection is proposed. Randomness in selection occurs when there is more than one choice of alignments with the highest information content [10]. If one peak is randomly selected, the accuracy of multiple sequence alignment is compromised. This may be the reason that methods based on applied information theory cannot achieve much higher sensitivity, specificity, and performance. For example, by randomly selecting two peaks of similar information content, there is a 50% chance of selecting the wrong peak.

In order to overcome this problem, a simple method is proposed to eliminate the randomness of peak selection and to provide the best alignment, through the use of joint weight matrix (JWM) in this paper. Its flexibility means that a higher-level JWM can be used to work with cases with multiple peaks. The higher the level of JWM used, the lesser will the number of peaks be, until eventually a single peak is obtained. In this paper, JWM has been shown to reduce successfully the number of peaks in multiple sequence alignment.

2. Systems and Methods

The concept of JWM is presented here to demonstrate how two or more ambiguous selections can be reduced. Two sequences are used in this example. The longer one represents the DNA sequence and the shorter one represents a motif sequence, which is aligned to the former. The motif is assumed to be a perfect weight matrix with 100% base weightage at each position. The score is then either 1 for match or 0 for mismatch at each position for simplicity of demonstration.

Since the sequence is 7 bp (base pair) long and the motif is 4 bp, the total number of possible shift positions without introducing gaps is 7βˆ’4+1=4 in Figure 1. Table 1 shows the sequence alignment.


Position1234567

DNAATTGTTC
MotifTTAG
Score2211

The score for the four possible alignments presents an ambiguous choice between positions 1 and 2, which are possible alignments with the highest score of 2. Since there is more than one peak or alignment, the second-level JWM is used to score the alignment. Table 2 shows the result using the second-level comparison.


Position1234567

DNAATTGTTC
MotifTTAG
Score0100

The result clearly shows that between positions 1 and 2, the better match for the motif with the DNA is the position 2 with the matching score of 1 as compared with the position 1 with 0.

3. Algorithm

Here it is shown how JWM can be integrated with sequence alignment tool to remove the randomness of selection during the alignment process. The following are the additional steps added using JWM.

Step 1. Determine a weight matrix 𝑀(𝑏,𝑖)=𝑛(𝑏,𝑖)βˆ‘π‘βˆˆ{A,T,G,C}𝑛(𝑏,𝑖),(3.1) where 𝑛(𝑏,𝑖) is the number of each base π‘βˆˆ{A,T,G,C} at each position 𝑖.

Step 2. Calculate the second-level JWM 𝑀2𝑏1𝑏2𝑏,𝑖=𝑀1𝑀𝑏,𝑖2ξ€Έ,𝑖+1.(3.2) For the second-level JWM, the number of possible combinations of the four bases is 42=16. Hence, JWM is a matrix size of 16 by window length.

Step 3. From the weight matrix, the uncertainty of each combination of bases is 𝐻𝑠(𝑖)=βˆ’π‘šξ“π‘—=1ξ“π‘π‘—βˆˆ{A,T,G,C}π‘€π‘šξ€·π‘1𝑏2β‹―π‘π‘šξ€Έ,𝑖log2π‘€π‘šξ€·π‘1𝑏2β‹―π‘π‘šξ€Έ,𝑖.(3.3) For the second-level JWM, π‘š is the value of 2.

Step 4. The information content for each base is then 𝑅(𝑖)=2π‘š[]βˆ’π»π‘ (𝑖)βˆ’π‘’π‘›(𝑖),(3.4) where 𝑒[𝑛(𝑖)] is a small sample correction for 𝐻𝑠(𝑖) [11].

Step 5. The score for one shifting position is then 𝑅shift(sp)=𝑖𝑅(𝑖).(3.5) The shift position (sp) ranges from negative to positive shifting parameter.

Step 6. Shift JWM as predetermined to get the alignment score plot of information content versus shifting position. From the alignment score plot, the highest peak is chosen among the ambiguous choice of the previous set of peaks to be generated.

Step 7. If there is still ambiguity after using the second-level JWM, a higher-level JWM (three or higher) should be calculated π‘€π‘šξ€·π‘1𝑏2β‹―π‘π‘šξ€Έξ€·π‘,𝑖=𝑀1𝑀𝑏,𝑖2𝑏,𝑖+1β‹―π‘€π‘šξ€Έ,𝑖+π‘šβˆ’1.(3.6) Repeat the Steps 3 to 6 using the higher-level JWM in (3.6) when there is ambiguity in peak selection if using any lower-level JWM.

4. Implementation

An example of how JWM is used to eliminate or reduce ambiguity is shown using data from 16 randomly generated sequences of 15 bp (Tables 3 and 4) that bind to OxyR [12]. For illustration purpose, the centre 9th base is taken to be the start site of transcription, labeled as the position 0. The alignment score is obtained by using the window of 5 bases from βˆ’1 to +3 and the range of shifting position set from βˆ’8 to +6 with respect to the start site. The sequences are shifted one base at a time and the new alignment score is recalculated based on the simplified sequence logo [13] in Figure 2(a).


βˆ’8βˆ’7βˆ’6βˆ’5βˆ’4βˆ’3βˆ’2βˆ’10+1+2+3+4+5+6

1TCACACCGACTTGTG
2ACTTATCGATCCGCA
3CATTAACAATAGGGC
4TACGATAATAGGCAA
5CGTACATTATCCATA
6CTATTATTGTAACAG
7ACTTTCCCAGAGTTC
8CAGAGATCGCTCTAA
9ACTAAACTTCTGATA
10AGTTATCGGTATAAT
11ACGATGGAATCCATA
12CAGAGATCGCTCTAA
13ATCACTGACTACAAT
14ATTAGCGATTACCGT
15ATTACCTATCGCTGC
16CTATTATTGTAACAG

A843867166172577
C652134631538413
T258644644942453
G023131335124333

%A0.500.250.190.500.380.440.060.380.380.060.440.130.310.440.44
%C0.380.310.130.060.190.250.380.190.060.310.190.500.250.060.19
%T0.130.310.500.380.250.250.380.250.250.560.250.130.250.310.19
%G0.000.130.190.060.190.060.190.190.310.060.130.250.190.190.19


βˆ’7βˆ’5βˆ’3βˆ’113

AA0.04690.21090.04690.09770.02730.0469
AC0.04690.10550.11720.03910.01170.0313
AT0.10940.14060.14060.07810.01560.0313
AG0.04690.10550.07030.09770.00780.0156
CA0.05860.02340.03130.07810.16410.2109
CC0.05860.01170.07810.03130.07030.1406
CT0.13670.01560.09380.06250.09380.1406
CG0.05860.01170.04690.07810.04690.0703
TA0.05860.11720.03910.07810.21880.0469
TC0.05860.05860.09770.03130.09380.0313
TT0.13670.07810.11720.06250.12500.0313
TG0.05860.05860.05860.07810.06250.0156
GA0.02340.02340.00780.05860.02730.0703
GC0.02340.01170.01950.02340.01170.0469
GT0.05470.01560.02340.04690.01560.0469
GG0.02340.01170.01170.05860.00780.0234

Window and shifting parameters are selected such that an ambiguous choice of more than one peak is resolved. By shifting one of the sequences from βˆ’8 to +6, the alignment score based on window from βˆ’1 to +3 show two peaks at shift positions βˆ’5 and 0 in Figure 2(b). From the simplified sequence logo, the information content prior to shifting of any sequence is 𝑅shift(0)=0.0637+0.1950+0.5087+0.1504+0.2500=1.1678bits.(4.1)

One of the sequences is randomly selected and shifted about its position. The weight matrix is calculated for each new position and a set of 𝑅shift(sp) is obtained by the end of the shift. The amount of shift required for the 16 sequences to produce 𝑅shift is plotted in Figure 2(b), where two peaks are located at shift positions βˆ’5 and 0. The situation is ambiguous and a higher-level search is required by using JWM. The weight matrix is replaced by the second-level JWM in the new search. The new 𝑅shift plot based on the higher-level JWM is shown in Figure 2(d).

The new alignment score using JWM shows clearly that the shift position 0 has higher information content, as compared with the shift position βˆ’5. Hence, the best alignment is the original position 0. Instead of randomly selecting one of the peaks, it is rational to select the peak with higher information content.

5. Discussion and Conclusions

In the selection of the best multiple sequence alignment using the conventional weight matrix, it is assumed that the probability of each base is independent of its neighboring one. Output from a multiple sequence alignment program is not always the same. This can be attributed to several factors. One of the important contributing factors is the conventional scoring matrix. The best alignment at each stage is decided by the highest score with the conventional scoring matrix. However, there are cases whereby there is more than one of such score. This creates an ambiguity in selecting the best alignment. A random choice can be made, but it may result in a less than optimal alignment.

The following shows examples of ambiguities found using the conventional scoring matrix. The benchmark database (Table 5) consists of DNA sequences containing amelogenin protein in the study of its origin and evolutionary path [14]. Cases of ambiguity during multiple sequence alignment using the conventional scoring matrix are shown in Figure 3. For example, the ambiguity is found when the sequence 2 (DMSPARC), at position 0 and 3 of window, with window placed at the 18th base from the start (first base on the left).


No.NamesSequencesWindow positionAmbiguity positions

2DMSPARCATGCGCTCCCTTTGGCTGCTG
CTCGGCTTGGGCCTGCTGGC
TGTGAGCCACGTCCAGGCCT
180, 3
4RATSC1ATGAAGGCCGTGCTTCTCCT
CCTGTATGCCTTGGGGATCG
CTGCTGCAGTCCCG
180, 3
5MOUSESC1ATGAAGGCTGTGCTTCTCCTC
CTGTGCGCCTTGGGAACCGC
TGTGGCAATCCCG
180, 3
6HUMANHEVINATGAAGACTGGGCCTTTTTTCCTAT
GTCTCTTGGGAACTGCAGCTGCAAT
CCCG
11βˆ’5, βˆ’4
7BOVINAMEXATGGGGACCTGGATTTTGTTTGCCT
GCCTCCTGGGAGCAGCCTTCTCTAT
GCCT
20βˆ’4, 0
10XENOPUS2AMATGAGGCCATTGGTAATGCTAACA
GCTCTCATTGGAGCAGCCTTTTCTCT
TCCT
7βˆ’1, 0
12MOUSEAMEXATGGGGACCTGGATTTTGTTTGCCT
GCCTCCTGGGAGCAGCTTTTGCTAT
GCCC
20βˆ’4, 0
13RATAMEXATGGGGACCTGGATTTTGTTTGCCT
GCCTCCTGGGAGCAGCTTTTGCTAT
GCCC
20βˆ’4,0
16HUMANAMEYATGGGGACCTGGATTTTGTTTGCCT
GCCTTGTGGGAGCAGCTTTTGCCAT
GCCT
20βˆ’4, 0
17HUMAMAMEXATGGGGACCTGGATTTTATTTGCCT
GCCTCCTGGGAGCAGCTTTTGCCAT
GCCT
20βˆ’4, 0
18CAVIAAMEXATGGGAACCTGGATTTTGTTTGCCT
GCCTCTTGGGAACAGCCTTTGCTAT
GCCT
20βˆ’4, 0
23CHICKENSPAATGAGAACCTGGATTTTCTTCTTCCT
CTGCCTGGCAGGCAAAGCCCTGGC
AGCTCCG
16βˆ’3, 0
24QUAILSPARCATGAGAGCCTGGATTTTCTTCCTCCT
CTGCCTGGCAGGCAAAGCCCTGGC
AGCCCCG
19βˆ’3, 0
25ZEBRAFSPARATGAGGGTTTGGATCTTCTTCCTGT
TCTGCCTCGCTGGCAAGACTCTGGC
AGCTCCA
160, 3
26TROUTSPARCATGAGGGTGTGGATTGTCTTCCTCC
TGTGCCTAGCTGGTCAGGCATTCAC
CGCTTCC
7βˆ’4, 5
27XENOPUSSPAATGAGGGTCTGGGTCTTCTTCGTCT
TGTGCCTGGCTGGCAAAGCACTAG
CTGCCCCT
160, 3

The examples above show that ambiguities are frequent enough to be of concern during multiple sequence alignment, which may result in a suboptimal alignment. This problem can be overcome by using the proposed joint weight matrix for scoring. The proposed scoring matrix allows a closer look at each alignment by considering two or more bases for each scoring element. By comparing two bases at one time, the probability of the next base is affected by what appears before it. In fact, there are 16 probabilities of a pair of bases as compared with just 4 probabilities if only one base is considered. This increases the depth of search to reduce the number of peaks. Under Section 4, it is shown how the second-level JWM can identify the highest peak when a conventional weight matrix could not. This reduces the error that may occur when β€œconflicts are resolved” by making a β€œpseudorandom choice” [10].

The higher-level of JWM can be used depending on the level of accuracy required. For example, the second-level JWM may be able to reduce the number of peaks from 5 to 3. The randomness is reduced when one is choosing the best peak from 3 instead of 5 possible sites. However, if the application requires a level of match to be of greater accuracy, a higher-level of JWM may be needed to proceed. The higher-level of JWM can further filter out more peaks till only one obvious choice is left. Although the higher-level of JWM may require more computation time and additional scan, this may be compensated by the faster convergence of results as a better alignment is selected early in the iterations. This is true especially for cases whereby a large number of iterations are required before a satisfactory convergence can be found [15]. JWM can be used to improve applications using conventional weight matrix system in bioinformatics. Besides aligning DNA sequences, JWM can also be implemented in protein sequence alignment.

References

  1. G. D. Stormo and G. W. Hartzell, β€œIdentifying protein-binding sites from unaligned DNA fragments,” Proceedings of the National Academy of Sciences of the United States of America, vol. 86, no. 4, pp. 1183–1187, 1989. View at: Google Scholar
  2. J. Hu, B. Li, and D. Kihara, β€œLimitations and potentials of current motif discovery algorithms,” Nucleic Acids Research, vol. 33, no. 15, pp. 4899–4913, 2005. View at: Publisher Site | Google Scholar
  3. G. Thijs, K. Marchal, M. Lescot et al., β€œA Gibbs sampling method to detect overrepresented motifs in the upstream regions of coexpressed genes,” Journal of Computational Biology, vol. 9, no. 2, pp. 447–464, 2002. View at: Publisher Site | Google Scholar
  4. C. E. Lawrence, S. F. Altschul, M. S. Boguski, J. S. Liu, A. F. Neuwald, and J. C. Wootton, β€œDetecting subtle sequence signals: a gibbs sampling strategy for multiple alignment,” Science, vol. 262, no. 5131, pp. 208–214, 1993. View at: Google Scholar
  5. Y. Liu, X. S. Liu, L. Wei, R. B. Altman, and S. Batzoglou, β€œEukaryotic regulatory element conservation analysis and identification using comparative genomics,” Genome Research, vol. 14, no. 3, pp. 451–458, 2004. View at: Publisher Site | Google Scholar
  6. J. J. Shu and L. S. Ouw, β€œPairwise alignment of the DNA sequence using hypercomplex number representation,” Bulletin of Mathematical Biology, vol. 66, no. 5, pp. 1423–1438, 2004. View at: Publisher Site | Google Scholar
  7. A. V. Favorov, M. S. Gelfand, A. V. Gerasimova, D. A. Ravcheev, A. A. Mironov, and V. J. Makeev, β€œA Gibbs sampler for identification of symmetrically structured, spaced DNA motifs with improved estimation of the signal length,” Bioinformatics, vol. 21, no. 10, pp. 2240–2245, 2005. View at: Publisher Site | Google Scholar
  8. L. Kuo and T. Y. Yang, β€œAn improved collapsed Gibbs sampler for Dirichlet process mixing models,” Computational Statistics & Data Analysis, vol. 50, no. 3, pp. 659–674, 2006. View at: Publisher Site | Google Scholar
  9. J. J. Shu and Y. Li, β€œHypercomplex cross-correlation of DNA sequences,” Journal of Biological Systems, vol. 18, no. 4, pp. 711–725, 2010. View at: Publisher Site | Google Scholar
  10. T. D. Schneider and D. N. Mastronarde, β€œFast multiple alignment of ungapped DNA sequences using information theory and a relaxation method,” Discrete Applied Mathematics, vol. 71, no. 1–3, pp. 259–268, 1996. View at: Publisher Site | Google Scholar
  11. T. D. Schneider, G. D. Stormo, L. Gold, and A. Ehrenfeucht, β€œInformation content of binding sites on nucleotide sequences,” Journal of Molecular Biology, vol. 188, no. 3, pp. 415–431, 1986. View at: Google Scholar
  12. M. B. Toledano, I. Kullik, F. Trinh, P. T. Baird, T. D. Schneider, and G. Storz, β€œRedox-dependent shift of OxyR-DNA contacts along an extended DNA-binding site: a mechanism for differential promoter selection,” Cell, vol. 78, no. 5, pp. 897–909, 1994. View at: Publisher Site | Google Scholar
  13. T. D. Schneider and R. M. Stephens, β€œSequence logos: a new way to display consensus sequences,” Nucleic Acids Research, vol. 18, no. 20, pp. 6097–6100, 1990. View at: Google Scholar
  14. S. Delgado, D. Casane, L. Bonnaud, M. Laurin, J. Y. Sire, and M. Girondot, β€œMolecular evidence for precambrian origin of amelogenin, the major protein of vertebrate enamel,” Molecular Biology and Evolution, vol. 18, no. 12, pp. 2146–2153, 2001. View at: Google Scholar
  15. J.-J. Shu, Q.-W. Wang, and K.-Y. Yong, β€œDNA-based computing of strategic assignment problems,” Physical Review Letters, vol. 106, no. 18, Article ID 188702, 2011. View at: Publisher Site | Google Scholar

Copyright Β© 2012 Jian-Jun Shu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


More related articles

1314Β Views | 1667Β Downloads | 7Β Citations
 PDF  Download Citation  Citation
 Download other formatsMore
 Order printed copiesOrder

Related articles

We are committed to sharing findings related to COVID-19 as quickly and safely as possible. Any author submitting a COVID-19 paper should notify us at help@hindawi.com to ensure their research is fast-tracked and made available on a preprint server as soon as possible. We will be providing unlimited waivers of publication charges for accepted articles related to COVID-19. Sign up here as a reviewer to help fast-track new submissions.