Research Article  Open Access
Mauricio Arriagada, Aleksandar Poleksic, "On the Difference in Quality between Current Heuristic and Optimal Solutions to the Protein Structure Alignment Problem", BioMed Research International, vol. 2013, Article ID 459248, 8 pages, 2013. https://doi.org/10.1155/2013/459248
On the Difference in Quality between Current Heuristic and Optimal Solutions to the Protein Structure Alignment Problem
Abstract
The importance of pairwise protein structural comparison in biomedical research is fueling the search for algorithms capable of finding more accurate structural match of two input proteins in a timely manner. In recent years, we have witnessed rapid advances in the development of methods for approximate and optimal solutions to the protein structure matching problem. Albeit slow, these methods can be extremely useful in assessing the accuracy of more efficient, heuristic algorithms. We utilize a recently developed approximation algorithm for protein structure matching to demonstrate that a deep search of the protein superposition space leads to increased alignment accuracy with respect to many wellestablished measures of alignment quality. The results of our study suggest that a large and important part of the protein superposition space remains unexplored by current techniques for protein structure alignment.
1. Introduction
Pairwise protein structure alignment is one of the most important problems in computational molecular biology. At the same time, protein structure alignment is a very difficult problem, due to an infinite number of possible ways to position a pair of proteins in the threedimensional space. Because of the enormous size of the search space, the research into protein structure alignment has been traditionally focused on the development of methods with better objective functions, that explore a relatively small but representative set of proteins’ spatial superpositions.
In this paper, we take a different approach and study the benefits of searching proteins’ superpositions in a more detailed manner. We demonstrate significant increase in the alignment accuracy of several wellknown distancebased alignment methods, obtained by utilizing the superpositions that rigorously optimize a very simple and intuitive alignment metric, defined as the largest number of residues from the input proteins that can be fit under a predefined distance cutoff.
The size of gap between the accuracy of current heuristic solutions and optimal solutions, observed in this study, suggests that the protein structure alignment problem will likely remain a hot topic in years to come.
2. Materials and Methods
Our study is carried out using two protein structure alignment benchmarks: Sisyphus and FSSP. In both benchmarks, an inhouse algorithm, MaxPairs [1], is applied to compute the superpositions that closely approximate the measure , which is defined as the largest number of pairs of residues from the input proteins that can be fit under Ångströms. MaxPairs algorithm is based on the approximation algorithm EPSILONOPTIMAL [1], which is capable of finding a superposition of the input proteins that fits at least as many pairs of residues under the distance as an optimal superposition fits under the distance , for any accuracy threshold . As an approximation algorithm, EPSILONOPTIMAL suffers from high computational complexity. The algorithm’s run time is a high degree polynomial in the lengths of the structures being compared. To circumvent high computational cost, the present study utilizes MaxPairs—a heuristic version of EPSILONOPTIMAL that searches through a relatively small subset of the space of all superpositions of the input proteins inspected by EPSILONOPTIMAL. While still not practical, as demonstrated in [1], MaxPairs enjoys accuracy superior to that of some widely utilized alignment programs and, as such, this algorithm is an indispensable tool for assessing the precision of more efficient and more popular algorithms. In present study, we set the distance cutoff to Å and the accuracy threshold to . Going below proves to be computationally prohibitive with our computing infrastructure.
We evaluated the performance of three wellknown methods for protein structure comparison, STRUCTAL [2–4], TMalign [5], and LOCK2 [6, 7], before and after replacing their original superpositions with superpositions that optimize .
It is important to emphasize that our experiment is not designed to compare these three methods headtohead, but rather to assess the extent of improvements in the accuracy of each method that can be made by exploring the search space in a more thorough manner.
In choosing the methods for our study, we only considered the availability of software and the simplicity of implementing the alignment scoring functions (see the Results section). An overview of the three algorithms is given below.
STRUCTAL
The STRUCTAL algorithm [2–4] employs iterative dynamic programming to balance the cRMS score with the lengths of aligned regions. In each iteration, the algorithm computes an optimal residueresidue correspondence (alignment) of the input proteins and and then finds a superposition that minimizes cRMS of the aligned subchains and . The cRMS score is given by
The alignment step in STRUCTAL is carried out using a dynamic programming routine, which implements the following recurrence formula:
where
The outputs of STRUCTAL are the subchains of and of , along with the rigidly transformed protein , denoted by , and a residueresidue correspondence that maximizes the STRUCTAL score
where denotes the total number of gaps in the alignment. The STRUCTAL program used in our analysis was downloaded from http://csb.stanford.edu/levitt/Structal/.
TMalign
TMalign is another popular protein structure alignment program, widely used in many applications, in particular for assessing the quality of protein models generated by comparative modeling or abinitio techniques. The score matrix in TMalign is proteinlength specific and is defined as
where , and is the length of the shorter structure [5]. In contrast to linear gap penalties employed by STRUCTAL, the gap penalties in TMalign are affine and are set to 0.6 for gapopening and 0.0 for gapextension [5]. An improved version of the algorithm, called FrTMalign, has been published [8]. The TMalign software, used in this study, was downloaded from http://zhanglab.ccmb.med.umich.edu/TMalign/.
LOCK2
LOCK2 [6] is an improved version of the original LOCK program [7]. It incorporates secondary structure information into the alignment process. An initial superposition is obtained by comparing the vectors of secondary structure elements. An iterative procedure is then applied to minimize RMSD between aligned subchains of the input proteins, using the threshold distance of 3 Å for atomic superposition. Rigid body motions for RMSD minimization are realized using quaternion transformations [9, 10].
The alignment returned by LOCK2 is a sequence of pairs of points where are each other’s nearest neighbors. More specifically, for every , the point is the closest point in protein to the point and vice versa. The final alignment is generated through a twostep process. First, for every atom from protein , the algorithm finds the nearest atom from protein that is at distance ≤3 Å from . In the second step, the algorithm selects the maximum number of aligned pairs in sequential order, by removing pairs that violate colinearity.
The LOCK2 software can be downloaded from http://lock2.stanford.edu.
3. Results
3.1. Sisyphus Benchmark
The Sisyphus test [11] is frequently used to assess the accuracy of automated methods for protein structure comparison [1, 12]. This sophisticated benchmark utilizes 125 alignments of structurally related proteins, created by experts in the field of protein structure analysis. The reference alignments can be downloaded from http://sisyphus.mrccpe.cam.ac.uk.
In present study, we (like Rocha et al. [12]) utilize only a subset of the Sisyphus test set, containing 106 alignments between singlechain proteins. The twostep process is illustrated in Figure 1. In the first step, STRUCTAL, TMalign, and LOCK2 are run with default parameters to generate the methods’ specific alignments between proteins from the Sisyphus set. These alignments are then compared to the reference (“goldstandard”) alignments to compute the percentage of correctly aligned residue pairs [1, 12].
In the second step, the MaxPairs algorithm is run to compute the set of (near)optimal superpositions, namely, the superpositions that rigorously maximize the number of pairs of atoms that can be fit under 3 Å. We used our own implementations of the STRUCTAL, TMalign, and LOCK2 alignment procedures to compute optimal residueresidue correspondence (alignment) between the newly superimposed proteins. The percentage agreement with reference alignments is recorded again and compared to the agreement obtained in the first step.
The agreement with reference alignments in the Sisyphus test is defined as a function of the magnitude of the alignment error. More specifically, for the alignment tolerance shift , the agreement is defined as , where is the number of aligned residues that are shifted by no more than positions in the reference alignment and is the length of the reference alignment [12]. The perfect agreement is the one that corresponds to zeroshift .
The dashed lines in Figures 2, 3, and 4 track the performance of original STRUCTAL, TMalign, and LOCK2 methods. The solid lines show the performance of the same methods when run on the superpositions that maximize the number of residues under 3 Å. As seen in these figures, there is a significant boost in the methods’ accuracy resulting from the “finetooth comb” search of superposition space. More precisely, the new superpositions improve absolute agreement with the reference alignments for STRUCTAL, TMalign, and LOCK2 by 11%, 5%, and 5%, respectively, with a similar trend continuing for nonzero shift.
The increase in number of correctly aligned residues, obtained by switching to MaxPairs superpositions, varies from one pair of structures to another (Figures 5, 6, and 7). For some pairs, the difference is striking. However, it should be emphasized that, in some of these cases, such a high difference might be due to unavailability of information in PDB files used by the methods in our study. For instance, the LOCK method is built to take advantage of the residues’ secondary structure assignment. Hence, it is reasonable to assume that the lack of secondary structure information in the PDB file for one or both structures will often decrease the accuracy of the LOCK alignment of those structures.
A more detailed analysis shows that, when MaxPairs superpositions are used, the number of residue pairs correctly aligned by STRUCTAL increases by more than 10 for 31 out of 106 test pairs. The corresponding number of test pairs for which the same magnitude of increase is observed for TMalign and LOCK is 14 and 13, respectively. For comparison, original STRUCTAL superpositions have such an advantage only in 3 out of 106 test pairs. For TMalign and LOCK, the corresponding numbers are 5 and 4.
The value added by the deep search of superposition space makes some of the methods analyzed here comparable to the best to date methods evaluated in the Sisyphus test. A slight accuracy advantage of algorithms such as Matt [13], PPM [14], and ProtDeform [12] is due to the fact that these methods consider proteins as flexible, rather than rigid objects. In other words, unlike STRUCTAL, TMalign, and LOCK2, which all utilize single transformations of input proteins to compute final alignments, the new generation of protein structure alignment methods consider sequences of different rigid transformations at different sites. It should be emphasized that the methods based on sequences of local transformations can themselves benefit from incorporating the “finetooth comb” search to detect fragments of local similarity. This would lead to further improvements in their overall accuracy, but the true extent of these improvements can only be accessed through a carefully designed study.
3.2. FSSP Benchmark
Our second benchmarking set utilizes 183 representative pairs of proteins, related at various levels according to FSSP structural classification [15]. This test set consists of 55 family pairs, 68 superfamily pairs, and 60fold pairs (see Supplementary Material available online at doi:10.1155/2012/459248).
In contrast to Sisyphus benchmark, which compares alignments returned by automated methods to those generated by human experts, the alignment precision in the FSSP benchmark is assessed using a set of wellknown alignment quality measures:(i)NumPairs(d) represents the number of aligned pairs of residues in two proteins that are at distance Ångströms from each other. We note that, unlike , which is a globally optimal metric, representing the maximum number of pairs of residues in the superimposed structures that can be placed under Ångströms, represents the method specific count of pairs of aligned residues at distance .(ii)Similarity Index, denoted by SI, is defined as , where is the number of aligned residues in proteins and and and are the lengths of and , respectively [16]. The cRMS score, used in the formula for , is computed based upon the method specific alignments.(iii)The Percentage of Structural Similarity, PSI(d), is defined as (see, for example, [8]).
As seen in Table 1, a more detailed search of the superposition space increases both NumPairs and PSI scores for all three methods in our study. The increase in scores is also seen for both STRUCTAL and LOCK2. It is interesting to note, though, that the original TMalign superpositions yeald better scores than the optimal superpositions.

The FSSP levelspecific results of our benchmarking analysis are summarized in Tables 2, 3, and 4.



Figure 8 shows the alignment independent PSI scores computed from superpositions generated by STRUCTAL, TMalign, and LOCK2. For reference, a nearoptimal PSI score, averaged across the FSSP test set and computed by the MaxPairs algorithm, is also provided in this figure.
The data used in Figure 8 shows that (on average) STRUCTAL, TMalign, and LOCK fail to place 8%, 7%, and 11% pairs of residues at distance Å, respectively. As expected, the best performance of these methods is observed at the FSSP family level (STRUCTAL fails to place 5%, TMalign: 5%, LOCK: 6%) and worst at FSSP fold level (STRUCTAL: 15%, TMalign: 12%, LOCK: 17%).
3.3. Illustrative Examples
Several examples illustrating the advantage of the deep search of superposition space are given in Figures 9, 10, 11, 12, and 13.
While examples in Figures 9–13 are striking, it should be noted that they represent rather isolated cases. In fact (as the reader can conclude from Figures 5, 6, and 7), there are several examples where the output of heuristic methods compares favorably to that of MaxPairs (although the difference in quality is not as obvious as that shown in Figures 9–13). As emphasized before, in many instances, the inaccuracy of the alignment generated by heuristic methods is due to insufficient structural information stored in the PDB file, relied upon these methods.
4. Discussion
Resent years have witnessed advances in the development of methods for approximate and exact solution to protein structure alignment problem. One of the first such methods is the Umeyama’s algorithm for finding the transformation that gives the least mean squared error between two point patterns [17]. Since then, several algorithms have been published for finding a nearoptimal solution to the structure alignment problem under distance constraints. The procedure by Akutsu, for example, returns a superposition of the input proteins that fits at least as many pairs of residues under the distance as an optimal alignment fits under the distance , for every fixed [18]. This algorithm runs on the order of , where denotes the protein length. An improved running time procedure for the same problem has also been published [19]. The EPSILONOPTIMAL algorithm, used in present study, is able to place at least as many pairs of residues under the distance as an optimal superposition places under the distance . The asymptotic cost of EPSILONOPTIMAL is for globular and for nonglobular proteins [1].
The polynomial time approximation schemes (PTASs) have been designed for selected nonsequential protein structure alignment measures [20] as well as for the class of measures satisfying the socalled Lipschitz condition [21]. Moreover, methods exist that rigorously minimize proteins’ intraatomic distances, including the algorithm by Caprara et al., which is capable of approximating the “Contact Map Overlap” (CMO) measure with great accuracy [22]. Finally, the algorithms for absolute optimum, with respect to selected alignment metrics, have also been published [1, 23], but they are computationally too expensive for everyday use.
Although inefficient for large scale analysis, the algorithms for exact solution are indispensable tools for assessing the accuracy of more commonly used heuristic methods. The present study utilizes a set of precomputed superpositions to evaluate the improvements in accuracy of three wellknown protein structure alignment algorithms, obtained by the deep search of the superposition space. In the Sisyphus benchmark, these superpositions increase the accuracy of alignments generated by STRUCTAL, TMalign, and LOCK2 by 11%, 7%, and 6%, respectively. An improvement of similar magnitude is seen after allowing for alignment errors (residue shifts). In the FSSP benchmark, the new superpositions increase NumPairs and PSI scores for STRUCTAL, TMalign, and LOCK2 by ~7%, ~5%, and ~13%, respectively. A particularly noticeable improvement is seen in the Similarity Index scores of alignments generated by LOCK2 (from 8.35 to 5.69).
We emphasize that our analysis provides an estimate of the lower bound on the difference between optimal and heuristic solution, since alignments generated by MaxPairs are not always optimal (in the strict sense).
Finally, it is reasonable to expect that a more thorough exploration of the superposition space, coupled with the fragmentbased alignment techniques, can be used to further improve the precision of methods based on sequences of local transformations, such as Matt [13], PPM [14], and ProtDeform [12].
5. Conclusions
A typical distancebased protein structure alignment method explores the space of proteins’ spatial superpositions, computing an optimal residueresidue correspondence (alignment) each time a new superposition is generated. Because of the large search space, current methods for protein structure alignment must trade precision for speed and explore only a small but representative set of superpositions.
We utilize an algorithm capable of finding an alignment of any specified accuracy to demonstrate significant increase in the alignment quality of solutions generated by three popular protein structure alignment methods, obtained through the deep search of the superposition space. The large lower bound on the size of gap between optimal and heuristic solutions, observed in this study, suggests that the protein structure alignment problem will likely remain an attractive research area throughout the next decade.
Acknowledgment
A. Poleksic was supported, in part, by a Professional Development Assignment from the University of Northern Iowa.
Supplementary Materials
FSSP benchmark consisting of as set of 183 representative pairs of proteins extracted from the FSSP database (ftp://ftp.ebi.ac.uk/pub/databases/fssp/).
References
 A. Poleksic, “Algorithms for optimal protein structure alignment,” Bioinformatics, vol. 25, no. 21, pp. 2751–2756, 2009. View at: Publisher Site  Google Scholar
 S. Subbiah, D. V. Laurents, and M. Levitt, “Structural similarity of DNAbinding domains of bacteriophage repressors and the globin core,” Current Biology, vol. 3, no. 3, pp. 141–148, 1993. View at: Google Scholar
 M. Gerstein and M. Levitt, “Using iterative dynamic programming to obtain accurate pairwise and multiple alignments of protein structures,” in Proceedings of the 4th International Conference on Intelligent Systems for Molecular Biology, pp. 59–67, AAAI Press, Menlo Park, Calif, USA, 1996. View at: Google Scholar
 M. Levitt and M. Gerstein, “A unified statistical framework for sequence comparison and structure comparison,” Proceedings of the National Academy of Sciences of the United States of America, vol. 95, no. 11, pp. 5913–5920, 1998. View at: Publisher Site  Google Scholar
 Y. Zhang and J. Skolnick, “TMalign: a protein structure alignment algorithm based on the TMscore,” Nucleic Acids Research, vol. 33, no. 7, pp. 2302–2309, 2005. View at: Publisher Site  Google Scholar
 J. Shapiro and D. Brutlag, “FoldMiner: structural motif discovery using an improved superposition algorithm,” Protein Science, vol. 13, no. 1, pp. 278–294, 2004. View at: Publisher Site  Google Scholar
 A. P. Singh and D. L. Brutlag, “Hierarchical protein structure superposition using both secondary structure and atomic representations,” in Proceedings of the International Conference on Intelligent Systems for Molecular Biology, vol. 5, pp. 284–293, 1997. View at: Google Scholar
 S. B. Pandit and J. Skolnick, “FrTMalign: a new protein structural alignment method based on fragment alignments and the TMscore,” BMC Bioinformatics, vol. 9, article 531, 2008. View at: Publisher Site  Google Scholar
 B. K. P. Horn, “Closedform solution of absolute orientation using unit quaternions,” Journal of the Optical Society of America, vol. 4, pp. 629–642, 1997. View at: Google Scholar
 B. K. P. Horn and H. M. Hilden, “Closedform solution of absolute orientation using orthonormal matrices,” Journal of the Optical Society of America, vol. 5, pp. 1127–1135, 1998. View at: Google Scholar
 A. Andreeva, A. Prlić, T. J. P. Hubbard, and A. G. Murzin, “SISYPHUS—structural alignments for proteins with nontrivial relationships,” Nucleic Acids Research, vol. 35, no. 1, pp. D253–D259, 2007. View at: Publisher Site  Google Scholar
 J. Rocha, J. Segura, R. C. Wilson, and S. Dasgupta, “Flexible structural protein alignment by a sequence of local transformations,” Bioinformatics, vol. 25, no. 13, pp. 1625–1631, 2009. View at: Publisher Site  Google Scholar
 M. Menke, B. Berger, and L. Cowen, “Matt: local flexibility aids protein multiple structure alignment,” PLoS Computational Biology, vol. 4, no. 1, article e10, 2008. View at: Publisher Site  Google Scholar
 G. Csaba, F. Birzele, and R. Zimmer, “Protein structure alignment considering phenotypic plasticity,” Bioinformatics, vol. 24, no. 16, pp. i98–i104, 2008. View at: Publisher Site  Google Scholar
 L. Holm, C. Ouzounis, C. Sander, G. Tuparev, and G. Vriend, “A database of protein structure families with common folding motifs,” Protein Science, vol. 1, no. 12, pp. 1691–1698, 1992. View at: Google Scholar
 R. Kolodny, P. Koehl, and M. Levitt, “Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures,” Journal of Molecular Biology, vol. 346, no. 4, pp. 1173–1188, 2005. View at: Publisher Site  Google Scholar
 S. Umeyama, “Leastsquares estimation of transformation parameters between two point patterns,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 13, no. 4, pp. 376–380, 1991. View at: Publisher Site  Google Scholar
 T. Akutsu, “Protein structure alignment using dynamic programming and iterative improvement,” IEICE Transactions on Information and Systems, vol. E79D, no. 12, pp. 1629–1636, 1996. View at: Google Scholar
 S. C. Li and Y. K. Ng, “On protein structure alignment under distance constraint,” in Proceedings of ISAAC, pp. 65–76, 2009. View at: Google Scholar
 J. Xu, F. Jiao, and B. Berger, “A parameterized algorithm for protein structure alignment,” Journal of Computational Biology, vol. 14, no. 5, pp. 564–577, 2007. View at: Publisher Site  Google Scholar
 R. Kolodny and N. Linial, “Approximate protein structural alignment in polynomial time,” Proceedings of the National Academy of Sciences of the United States of America, vol. 101, no. 33, pp. 12201–12206, 2004. View at: Publisher Site  Google Scholar
 A. Caprara, R. Carr, S. Istrail, G. Lancia, and B. Walenz, “1001 Optimal PDB structure alignments: integer programming methods for finding the maximum contact map overlap,” Journal of Computational Biology, vol. 11, no. 1, pp. 27–52, 2004. View at: Publisher Site  Google Scholar
 C. Ambühl, S. Chakraborty, and B. Gärtner, “Computing largest common point sets under approximate congruence,” in Proceedings of the ESA, vol. 1879 of Lecture Notes in Computer Science, pp. 52–64, 2000. View at: Google Scholar
Copyright
Copyright © 2013 Mauricio Arriagada and Aleksandar Poleksic. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.