Scientific Programming

Volume 2015, Article ID 279715, 12 pages

http://dx.doi.org/10.1155/2015/279715

## Parallel Seed-Based Approach to Multiple Protein Structure Similarities Detection

^{1}INRIA/IRISA and University of Rennes 1, Campus Universitaire de Beaulieu, 35042 Rennes Cedex, France^{2}Los Alamos National Laboratory, Information Sciences, P.O. Box 1663, MS B256, Los Alamos, NM 87545, USA

Received 15 April 2014; Accepted 2 November 2014

Academic Editor: Ewa Deelman

Copyright © 2015 Guillaume Chapuis et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Finding similarities between protein structures is a crucial task in molecular biology. Most of the existing tools require proteins to be aligned in order-preserving way and only find single alignments even when multiple similar regions exist. We propose a new seed-based approach that discovers multiple pairs of similar regions. Its computational complexity is polynomial and it comes with a quality guarantee—the returned alignments have both root mean squared deviations (coordinate-based as well as internal-distances based) lower than a given threshold, if such exist. We do not require the alignments to be order preserving (i.e., we consider nonsequential alignments), which makes our algorithm suitable for detecting similar domains when comparing multidomain proteins as well as to detect structural repetitions within a single protein. Because the search space for nonsequential alignments is much larger than for sequential ones, the computational burden is addressed by extensive use of parallel computing techniques: a coarse-grain level parallelism making use of available CPU cores for computation and a fine-grain level parallelism exploiting bit-level concurrency as well as vector instructions.

#### 1. Introduction

A protein’s three-dimensional structure tends to be evolutionarily better preserved than its sequence. Therefore, finding structural similarities between two proteins can give insights into whether these proteins share a common function or whether they are evolutionarily related. Structural similarity between two proteins is usually defined by two functions—a one-to-one mapping (also called alignment or correspondence [1]) between two subchains of their three-dimensional representations and a specific scoring function that assesses the alignment quality. The structural alignment problem is to find the mapping that is optimal with respect to the scoring function. Hence, the complexity of the protein structural alignment problem and the quality of the found solution strongly depend on the way that scoring function is defined.

The most commonly used among the various measures of alignment similarity are the internal-distances root mean squared deviation () and the coordinate root mean squared deviation () (see (3) and (2), resp., for the exact definitions). Tightly related to these measures are the two main approaches for solving the structural alignment problem. The similarity score in the first approach is based on the internal distances matrix, where a set of distances between elements in the first protein is matched with a set of distances in the second protein. The second approach uses the actual Euclidean distances between corresponding atoms in two proteins and aims to determine the rigid transformation that superimposes the two structures.

A huge majority of the algorithms representing these approaches are heuristics [2–6] (excellent reviews can be found in [7, 8]) and as such do not guarantee finding an optimal alignment with respect to any scoring function. The fact that finding exact solutions in this field is computationally hard is related to the fact that computing the longest alignment of protein structures is typically modeled as an NP-hard problem, for example, the protein threading problem [9], the problem of enumerating all maximal cliques [10, 11], or finding a maximum clique [12–14].

These results have been generalized by Kolodny and Linial [1], who showed that protein structural alignment is NP-hard if the similarity score is distance based. They also point out that a correct and efficient solution of the structural alignment problem must exploit the fact that the proteins lie in three-dimensional Euclidean space.

In this paper we present an algorithm that avoids the fundamental intractabilities pointed out in [1]. Our algorithm is both internal-distances based and Euclidean-coordinates based (i.e., it uses a rigid transformation to optimally superimpose the two structures). Its computational complexity is polynomial and it comes with a quality guarantee—for a given threshold , it guarantees to return alignments that have as well as less than , if such exist.

Our algorithms are motivated by a class of exact structural-alignments algorithms that look for the largest clique in the so-called product (or alignment) graphs [12–14]. The edges in such graphs encode information about pairs of residues in the two proteins that match based on internal distances between them, namely, if the difference between corresponding distances does not exceed some fixed parameter . Then a clique of size would correspond to subsets of residues in both proteins that match.

Here, we relax this condition and accept cliques such that edges correspond to matching of similar internal distances up to . For this relaxed problem, we propose a polynomial algorithm that takes advantage of internal-distance similarities among both proteins to search for an optimal transformation to superimpose their structures. We also replace the goal of finding the largest clique by the one of returning several very dense “near-clique” subgraphs. This choice is strongly justified by the observation that distinct solutions to the structural alignment problem that are close to the optimum are all equally viable from the biological perspective and hence are all equally interesting from the computation standpoint [1, 15].

To the best of our knowledge, our tool is unique in its capacity to generate multiple alignments with “guaranteed good” both and values. We do not require the alignments to be order preserving which makes our algorithm suitable for detecting similar domains when comparing multidomain proteins. Thanks to this property, the tool is able to find both sequential and nonsequential alignments, as well to detect structural repetitions within a single protein and between related proteins.

However, to enumerate exhaustively multiple similar regions requires a more systematic approach than those developed in other existing heuristic-based tools. The computational burden is addressed by extensive use of parallel computing techniques: a coarse-grain level parallelism making use of available CPU cores for computation and a fine-grain level parallelism exploiting bit-level optimization as well as vector instructions.

Other nonsequential structure alignment methods have been recently proposed (excellent review on this topic can be found in the very recent reference [16]). None of them is close to the approach proposed here. As they are all heuristic and do not guarantee finding an optimal alignment, a detailed comparison with algorithms based on different concepts requires extensive numerical experiments and is outside the scope of this study.

Here we present a significantly improved and expanded version of a paper originally presented at the PPAM 2013 conference [17]. In comparison to [17], the current version contains detailed description and explanation of all steps of the algorithms, all pseudocodes, supplementary figures illustrating the algorithms and the experimental results, and extended reference section. Additional sections are added including a comparison between the straightforward and the bit-vector implementations based on complexity analysis as well as detailed analysis of the work from the point of view of future performance improvements and additional possible applications.

#### 2. Preliminaries

##### 2.1. Measures for Protein Alignments

Consider a protein of atoms, , with . Many measures have been proposed to assess the quality of a protein alignment. These measures include additive scores based on the distance between aligned residues such as the TM-score [18], the DALI score [19, 20], the PAUL score [21] and the STRUCTAL score [22], and root mean square deviation (RMSD) based scores, such as RMSD100, SAS, and GSAS [23]. Given a set of deviations , its root mean square deviation isTwo different RMSD measures are used for protein structure comparison. The first one, , takes into account deviations consisting of the Euclidean distances between matched residues after optimal superposition of the two structures and is defined aswhere is the image of protein under a rigid transformation.

The second one, denoted here by , takes into account deviations consisting of absolute differences of internal distances within the matched structures. The measured deviations are , for all couples of matching pairs “.” Let be the latter set and its cardinality. We have that

##### 2.2. Alignment Graphs

An undirected graph is represented by a set of vertices and a set of edges between these vertices. In this paper, we focus on a subset consisting of grid-like graphs, referred to as alignment graphs.

An * alignment graph * is a graph in which the vertex set is depicted by an array , where each cell contains at most one vertex from . An example of such an alignment graph for protein comparison is given in Figure 1.