BioMed Research International

Volume 2014, Article ID 731325, 13 pages

http://dx.doi.org/10.1155/2014/731325

## Finding Semirigid Domains in Biomolecules by Clustering Pair-Distance Variations

^{1}Section of Biosimulation and Bioinformatics, Center for Medical Statistics, Informatics, and Intelligent Systems (CeMSIIS),
Medical University of Vienna, Spitalgasse 23, 1090 Vienna, Austria^{2}Institute for Nuclear Research and Nuclear Energy (INRNE), Bulgarian Academy of Sciences, 72, Tzarigradsko Chaussee, 1784 Sofia, Bulgaria

Received 31 January 2014; Accepted 10 March 2014; Published 15 May 2014

Academic Editor: Francesco Pappalardo

Copyright © 2014 Michael Kenn et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Dynamic variations in the distances between pairs of atoms are used for clustering subdomains of biomolecules. We draw on a well-known target function for clustering and first show mathematically that the assignment of atoms to clusters has to be crisp, not fuzzy, as hitherto assumed. This reduces the computational load of clustering drastically, and we demonstrate results for several biomolecules relevant in immunoinformatics. Results are evaluated regarding the number of clusters, cluster size, cluster stability, and the evolution of clusters over time. Crisp clustering lends itself as an efficient tool to locate semirigid domains in the simulation of biomolecules. Such domains seem crucial for an optimum performance of subsequent statistical analyses, aiming at detecting minute motional patterns related to antigen recognition and signal transduction.

#### 1. Introduction

Molecular dynamics (MD) can be used to investigate functional elements in biomolecules [1–5]. In addition to static structures (such as crystal structures stored in the protein data bank (PDB) [6]) molecular dynamics yields information on dynamic properties [7, 8], lending themselves for evaluation of, for example, signal transduction. However, key patterns of motion related to such a functional element may be hidden among a large amount of “other” movements, reflecting no more than ordinary thermal motility of the biomolecule. Molecular dynamics itself can be carried out along relatively standardized protocols [9, 10]. However, recognizing specific patterns of motion, which are deemed crucial for a functional element, remains a tricky task, requiring sophisticated statistical methods [11], such as principal component analysis [12, 13] or normal mode analysis [14].

For all the mentioned approaches, an initial and essential step is the “fitting” of the molecular structure of each time step of an MD trajectory (henceforward called frame) to a reference structure, [15]. A given frame is first translated to let its centre of mass coincide with that of the reference frame. Then is rotated (around its centre of mass) to minimize square deviations between corresponding atoms of and : In many approaches, RMSD has been used not only for fitting but also for directly (and successfully) quantifying molecular deformations [15], including structural changes, drifts, and trends [16]. In many cases, however, even sophisticated statistical methods fail, when applied to MD-frames after fitting. The suspicion is that the process of fitting itself might cause this failure. How does this come about?

By default, the GROMACS [17] fitting procedure uses atomic masses as weights for superposition of a structure’s atomic coordinates to a reference structure. Accordingly, the fitting of “as a whole” is being optimized. In some cases, fitting the whole molecule may be inadequate and even conceal what one is searching for. For example, consider a molecule with one or more flexible loops. While the body of such a molecule behaves like a slightly deformable, rigid body, a loop may be conformationally flexible exhibiting largely uncorrelated movements with respect to the rest of the molecule. In the fitting criterion, however, atoms within the loop and those in the body may have equal weights. Since all deviations enter quadratic into (1), large movements of an even small number of loop-atoms may generate dominant contributions to the RMSD. In such a case, in order to minimize total RMSD, is rotated predominantly to accommodate for the few atoms within the loop. As a result, the large remaining body of the molecule has to “follow its own loop,” as if the tail chases the dog [maybe this was not the primary intention of fitting]. Needless to say, due to such movements caused by fitting, that minute motile elements may become totally submerged, without any chance of being retrieved from the trajectory, not even by sophisticated statistics.

The described situation is typical and demands more elaborate fitting methods. Choosing unequal weights suggests itself as a nearby and convenient solution. The more rigid parts of the molecule should receive more weight, the flexible ones less. However, how should one know, prior to fitting, which parts are semirigid and which are flexible?

One possibility would be a two-pass procedure, in the first pass fitting to the whole molecule with uniform weights () and evaluating the : where denotes the average over a trajectory and denotes a reference position of atom , not changing over time. Note that will highly depend on the choice of the reference position, which is usually the mean coordinate of atom over the whole trajectory. Then, in a second pass of fitting, weights are chosen inversely proportional to the , as reported by [18]. Highly motile atoms receive less weight and lose their role in shaking the remaining main parts of the molecule. However, this method suffers from the fact that depends on the selection of in the first pass of fitting; that is, the correction procedure depends on the error it is supposed to correct.

Another possibility is the identification of semirigid domains (clusters) within the molecule, as reported by [18]. In particular, the definition of clusters may be based on distances between pairs of atoms rather than coordinates computed in the trajectory. The standard deviation of distance variation (STDDV) between an atomic pair () is given by where is the number of atoms, denotes the average over a trajectory, and is measured in nm. Evidently, pair-distances are unaffected by any kind of arbitrariness due to fitting.

Given a number of clusters (), let denote the partial class membership of atom in cluster . For normalization we require The following criterion has been proposed to identify an optimum decomposition into a given number () of clusters [18]. Minimize the target function: under the constraints of (4). Once identified, any such cluster may be used as “primary fitting domain,” by assigning large weights to the atoms therein. With little motion within such a cluster, the motility of the remaining atoms of the molecule will appear relative to that cluster. This generally increases the chance of tracing relevant patterns of motion outside the cluster, for many statistical methods being applied.

The important difference from the known structure-analyzing tools of GROMACS, whether based on RMS deviation after fitting or RMS deviation of atom-pair distances, is that there is no need of a reference structure here. Also the clustering algorithms themselves, though with different criteria, assign conformations to a cluster for the molecule as a whole, while in our work, groups of atoms are assigned to the same cluster if their mutual distances vary little over time (a spatial clustering within the molecule).

MD trajectories for protein complexes were analyzed by clustering of averaged standard deviations of distance variation (STDDV); see below. Obtaining the most rigid cluster of atoms can be seen as the first step to facilitate the search for protein motions.

#### 2. Methods

##### 2.1. Construction of Complexes for Molecular Dynamics Simulation

We applied the clustering algorithm to a series of molecular systems as follows, see Table 1.

LC13 T cell receptor (TCR) in complex with major histocompatibility complex (MHC) HLA-44:05 and the ABCD3 peptide (EEYLQAFTY) has been successfully crystallized by Macdonald et al. [20] and its structure is accessible on http://www.pdb.org/ assigned the PDB ID* 3KPS*. However, there are no structure files of LC13 TCR in complex with HLA-44:02 and HLA-44:03. Therefore, we applied homology modelling to create these structures.

In-silico mutagenesis was carried out using Swiss PDB Viewer [21]. For generation of LC13/ABCD3/HLA-44:03, we used PDB structure* 3KPS* as a template and introduced mutations Y116D (numbers according to PDB numbering) and D156L to the MHC thus changing the HLA type from 44:05 to 44:03. For generation of LC13/ABCD3/HLA-44:02 we used PDB structure* 3KPS* as a template and introduced mutation Y116D to the MHC thus changing the HLA type from 44:05 to 44:02; see Figures 1 and 2.

##### 2.2. Molecular Dynamics Simulation Protocol

The workflow of the molecular dynamics simulation of the penta-L-alanine system is closely related to that in the work of Bernhard and Noé [18]. MD simulation of penta-L-alanine was performed using GROMACS 4.0.7 [17] according to the following protocol.

First, we immersed penta-L-alanine in an explicit SPC [22] artificial water bath (cubic box) allowing for a minimum distance of 1 nm between peptide and box boundaries. Second, we minimized the solvated system using a steepest descent method. Next, we warmed up the system to 293 K during a 100 ps position restraint MD simulation. Finally, we carried out the MD production run with LINCS constraint algorithm acting on bonds with hydrogen atoms using an integration step of 2 fs and the GROMOS96 53a6 force field [23]. Coordinates were written to the trajectory every 2 ps. Coulomb interactions were computed using Particle Mesh Ewald (PME) with a maximum grid spacing of 0.12 nm and interpolation order 4. Both, Van der Waals and Coulomb interactions were computed with a cut-off at 1.4 nm. Berendsen temperature coupling to 293 K and Berendsen isotropic pressure coupling to 1 bar were used. All further parameters were set in accordance with Omasits et al. [24].

MD simulation of TCR/pMHC systems was performed using GROMACS 4.0.7 [17] according to the following protocol. First, we immersed the TCR/pMHC complex in SPC [22] artificial water bath (cubic box) allowing for a minimum distance of 2 nm between complex and box boundaries. Second, we added sodium and chloride ions to a concentration of 0.15 mol/L, and at the same time neutralizing the net charge of the system. Third, we minimized the energy of the solvated system using a steepest descent method. Next, we warmed up the system to 310 K during a 100 ps position restraints MD simulation. Finally, we carried out MD production runs with LINCS constraint algorithm acting on all bonds and using the GROMOS96 53a6 force field [23]. Hydrogen motions were removed allowing for an integration step of 5 fs. Coordinates were written to the trajectory every 50 ps. Coulomb interactions were computed using Particle Mesh Ewald (PME) with a maximum grid spacing of 0.12 nm and interpolation order 4. Both, Van der Waals and Coulomb interactions were computed with a cut-off at 1.4 nm. Velocity rescale temperature coupling to 310 K and Berendsen isotropic pressure coupling to 1 bar were used. All further parameters were set in accordance with Omasits et al. [24].

##### 2.3. Optimization of Cluster Membership

Each atom in a molecular dynamics simulation may be uniquely assigned to one of the clusters considered [7, 25, 26], represented by a “crisp” vector of cluster-membership; for example, if atom belongs to cluster 4 out of clusters. Alternatively, each atom may be considered to belong to several clusters simultaneously, represented by fuzzy, noninteger memberships, with normalization condition see (4). Fuzzy memberships are the more general case, it seems that they might yield lower minima of the target function than crisp memberships and should therefore be preferred. Interestingly, Bernhard and Noé [18] report that fuzzy memberships, upon optimization with a gradient method, tend to end up as crisp, that is, either 0 or 1. We have scrutinized this issue and will demonstrate how this comes about. Even more, as one of the main results of this work, we will prove that the solution has to be crisp. The proof is given via mathematical arguments; see results section. This finding allows us to restrict the search space to crisp memberships, without diminishing the generality of the optimization problem posed.

###### 2.3.1. Optimization of Crisp Cluster Memberships by a Two-Stage Monte Carlo Method

Initially, the number of clusters, , is chosen and the target function (5) has to be minimized. We will show that, under certain assumptions (see Section 3.1), it is sufficient to search in In a less formal formulation, the objective is to assign each of the atoms to one particular cluster (crisp memberships).

It may become quite tricky to attack such a problem with an analytic gradient approach since the boundary conditions are usually difficult to handle and the search domain consists of isolated points. We have therefore chosen a two-step search, in which a random process is succeeded by an exhaustive search.

Every constellation (i.e., Monte Carlo trial for improvement) which cannot be improved by a single move of one of the atoms from one cluster to another is considered a result (minimum constellation). The result with the lowest is the ground state, but all other minimum constellations should also be included in the further analysis.

###### 2.3.2. Search Algorithm

*(i) Start-Up*. Initially, each of the components is randomly assigned to one (of the ) cluster.

*(ii) Random Search*. In the first step, each of the components (atoms) is moved from its current cluster to another randomly chosen cluster with probability . If this mutation yields a reduction in the new constellation is preserved; otherwise, it is rejected. This process is repeated times. A very rudimentary benchmarking analysis has shown that and are reasonable values to use.

*(iii) Exhaustive Search*. In the second step, the algorithm tries to improve by single step moves for each component separately. If there is no possible move to improve , the constellation is necessarily a local minimum in our sense.

*(iv) Ground State*. Usually there is a large number of minimum constellations in the above sense. For a matrix with significant structure, the ground state will be reached after only a few trials. For ill-conditioned matrices (those without structure), a minimum constellation very close to the ground state will also be found after a few trials, although the absolute ground state might be difficult to find.

#### 3. Results

##### 3.1. Crisp Cluster Membership as a Necessary Consequence

We formulate and prove a lemma that crisp memberships are a necessary consequence of the topology of the multidimensional space of pair-distance standard deviations.

Due to its definition, is a symmetric, nonsingular adjacency matrix whose entries are the standard deviations explained earlier (3); thus, for and . Let The objective is to find for .

###### 3.1.1. Lemma

If is a symmetric, nonsingular matrix with nonnegative entries and , for , then .

To prove this lemma one uses Lagrange multipliers: Since is a polynomial of order 2, the derivatives with respect to and yield a system of linear equations of the form: with being identity matrix and . The determinant of the matrix is and therefore, under the assumption , there must be a unique solution, which is given by Unfortunately, this solution yields the maximum of for . However, since is convex and bounded, one knows that any argmin must be on . Therefore, there must be at least one . Reducing the above system of linear equations by this constraint, one sees that the revolving system again has a unique solution: which again gives a maximum of . The derivation of the value of is rather involved and not shown.

With the same argument as before, one can proceed by setting all equal to zero with the exception of one for each . This iterative procedure shows that clustering with respect to leads to a unique assignment of each of the components to one particular cluster.

##### 3.2. Heuristic Evaluation of Solution Space around the Minima

Our theoretical result, that memberships are crisp, can be illustrated very intuitively; see Figure 3. Left and right end of -axis correspond to constellations with minimum target function, located at the boundary, and no other minimum is found in between.

##### 3.3. Clusters of Atomic Motions in MD Trajectory

For above mentioned TCR/pMHC complexes B4402 and B4403, STDDV matrices were computed from the MD trajectories; see Figure 4. Only the second parts of the trajectories (corresponding to approx. 125 ns simulation time) were considered to exclude relaxation effects; see also Section 3.6.1.

STDDV matrices were clustered as described above for to . After computation, clusters are renumbered according to size (=number of atoms), the largest one always being labelled as cluster 1. For cluster memberships for B4402 and B4403 were remapped onto the protein structure and displayed in VMD [27]; see Figure 5.

Note that NO information whatsoever about secondary structural elements, such as -helices and -sheets, has entered the clustering procedure. Still, clusters more or less seem to retrieve some of these structural elements; see Figure 5. This could be related to extensive hydrogen bonding in -helices and -sheets stabilizing these secondary protein structure elements. How could bond constraints in MD simulations influence the resulting clusters? In our calculations, we just considered the protein backbone, because amino acids side chains show larger spatial fluctuations. The backbone atoms are separated by a planar and rigid amid bond, so neighboring atoms will experience less variation in distance.

##### 3.4. More Clusters Improve Target Function

The number of clusters has to be preselected in our approach. If we had just one cluster, the total distance variability contained in matrix** S** would be part of that cluster. Increasing the number of clusters generally reduces the fraction of variability contained within clusters, expressed as percentage of total in Figure 6.

##### 3.5. Larger Clusters Turn Out to Be More Rigid

Clusters were constructed to achieve maximum internal “rigidity,” that is, a minimum sum of pair-distance standard deviations. One might expect that large clusters, since they accommodate many atoms within larger spatial domains, should turn out to be less rigid then smaller clusters. However, the opposite is true: larger clusters turn out to be more rigid; see the declining trend of normalized STDDV with increasing cluster size in Figure 7. This demonstrates again that structures in motility are captured via clustering. If there is no structure within matrix , normalized cluster sizes would result nearly equal, that is, centered around 1.

Clearly, clusters do not result identical for different subsections of a MD trajectory. In Figure 7, data are shown for two trajectories and 50 subsections each, each clustered for and 6. For details, see legend of Figure 7. 100 Monte Carlo attempts were performed for each clustering, out of which the optimum (smallest target function) was adopted. These results confirm the general trend that larger clusters are more rigid.

An overview of dispersions within and between clusters is given in Figure 8, the numerical results for 6 clusters being given in Table 2.

##### 3.6. Stability of Clusters

Clusters have been evaluated regarding stability, in order to check whether they lend themselves as reliable semirigid domains for fitting MD-configurations. Of note that (at least) two sources of variability of cluster memberships need to be scrutinized as follows:(i)variability due to the stochastic nature of the Monte Carlo clustering method and(ii)variability due to different parts of an MD trajectory being clustered.We will demonstrate that variability due to our Monte Carlo clustering method is negligible. As opposed to that, the “adequate” choice and preparation of the MD trajectory has tremendous impact and remains an issue of a never ending debate [28–32].

###### 3.6.1. Variability between Different Parts of a Single MD Trajectory

Adequate sampling of phase space is essential regarding MD-simulations [33]. Much work has been done to detect changes, drifts, and trends as markers for inadequate sampling [16, 29, 34]. Block averaging was proposed as one of the remedies [35].

In this work, the clustering presented above was based on matrices computed from whole 250 ns MD trajectories. Clearly, the matrices for each subset of a trajectory would be different, entailing different results for clustering. The question is which is the most reliable clustering for a given molecule?

To answer this question, we shall quantify the variability of clusters for subsets, relate it to the result for the whole trajectory, and derive a “stiff kernel,” that is, those atoms which do not (or very rarely) change clusters between subsets of the trajectory.

First, we inspect the total dispersion contained within , evaluated separately for 50 subsets of 5 ns each (i.e., 100 frames out of 5000 frames in a whole trajectory); see Figure 9. In this trajectory we observe an irregular oscillating time behavior, starting with a declining tendency. The existence of such a substantial initial phase indicates that the influence of the starting configuration does not die out until after (roughly) half of the total simulation time has passed by. To account for this fact in a heuristic way, straight lines were fitted to the first and second half of the trajectory, allowing for some overlap. This accommodates with the finding of our previous work [29] that only the second half of the trajectory can be considered an unbiased sample from phase space and should be taken for further evaluations.

###### 3.6.2. The Path along Most Stable Clusterings

For a preselected (number of clusters) , clustering was performed for each of the 50 subset-trajectories (5 ns each), yielding an assignment for each atom to one of the clusters. Note that clusters were primarily labeled (cluster 1, 2, etc.) according to their size in that particular clustering (see Figure 10). Thus, the following situation may occur. Given a clustering of a subset-trajectory with first and second cluster about equal in size. Then, when clustering the following subset-trajectory, a few atoms from cluster 1 may end up in (i.e., “migrate” into) cluster 2, which may suffice to make cluster 2 now the largest cluster and therefore receiving the label “cluster 1” in the second clustering. This would yield a very peculiar result. The few “migrating atoms” would formally belong to the same cluster (in this example cluster 1), while the majority of atoms would switch between clusters 1 and 2. To avoid this misleading and undesired artifact, we refined the procedure as follows. In the first clustering, clusters were assigned labels (1, 2, etc.) according to decreasing size. After each subsequent clustering, we evaluated all permutations of cluster labels regarding the number of “migrating atoms” with respect to the first clustering. That permutation of labels, which yielded a minimum of migrating atoms, was finally adopted. As an example, the resulting number of migrating atoms is shown for B4402 and in Figure 11.

###### 3.6.3. Quantifying the Stability of Cluster Assignments

In Figure 11 the number of migrating atoms was considered. Now, we evaluate which atoms migrate. For quantification, we resort to the Kullback-Leibler-distance [36, 37]: represents the assumed background probability if the assignment of atom to any of the clusters were equally probable. is the actual probability for atom to belong to cluster , estimated from an average over cluster memberships obtained from clustering subsets of the trajectory: Large values of indicate that atom stays predominantly in the same cluster throughout the trajectory. On the contrary, values of close to zero indicate a random distribution of an atom between all clusters. Figure 12 shows for B4402 and .

#### 4. Discussion

##### 4.1. Clustering Reflects Structure within STDDV-Matrix

Target function (5) only counts distance variability within the clusters, not between atoms belonging to different clusters. Thus, if we reorder atoms according to their cluster membership, clusters appear as squares along the main diagonal of the matrix ; see Figure 15. If we hypothetically assume that elements are more or less homogenously distributed across the matrix, the “area” of each cluster in the matrix will roughly correspond to the variability within that cluster. Clearly, these squares have to be of equal size to make their joint area minimum (and thus minimize the total distance variation within clusters).

We have verified this prognosis by randomly rearranging elements of matrix and then performing the clustering procedure. Cluster sizes resulted almost equal (for illustration see left panel of Figure 13) in each of 20 trials of random rearrangement and clustering. This finding was verified for (data not shown).

As opposed, clusters of very different sizes resulted for matrices derived from MD simulation (illustrated in right panel of Figure 13). This clearly indicates that clusters of unequal size do not result by chance but reflect distinct dependencies within . The sensitivity of clustering to existing structures within is also reflected in target function , as can be seen by comparing matrices from MD and their randomly rearranged counterparts; see Figure 14. In both cases averaged STDDV decreases (improves) with increasing number of clusters. However, in presence of real dependencies between atomic mobility, clustering achieves much more improvement.

##### 4.2. Many Small Clusters Are More Rigid but Cover Less Distance Variation

One might ask why a larger cluster number is less favorable. Obviously, the extreme case of defining each single pair of atoms as a separate cluster would lead to minimum STDDV within clusters but would represent a trivial and useless solution. Note that STTDV within clusters decreases with increasing , which makes clusters more homogeneous and is a desired effect. However, at the same time the overall amount of variability caught within clusters also decreases, which is an undesired effect, since larger portions of the molecule are disregarded. These trends are reflected by the total area of coloured squares along the diagonal in Figure 13. The smaller the squares (with increasing ), the smaller their total area, even if there are more squares (the area decreases quadratic with the side lengths of squares).

This tendency has been quantitatively demonstrated for real MD-data in Figure 6. On top of that, Figure 14 shows that the same trend also holds for unstructured matrices, as mentioned above. However, results also show that matrices without structure (randomly rearranged elements) allow for very little reduction in the STDDV covered within clusters, as compared to structured matrices and that this difference further increases with .

##### 4.3. Clustering Is Stable and Self-Contained

Powerful statistics on MD-trajectories (which is, e.g., able to detect small motions related to signaling) needs careful fitting of configuration frames as a prerequisite. Fitting to a domain which should be as rigid and large as possible is one of the options.

Hence, finding large clusters is desirable. However, large clusters in general might prove unstable. Therefore, we have carefully investigated this issue for the target function proposed by Bernhard and Noé [18] in conjunction with our clustering method.

It turned out that larger clusters are even more stable (see Figure 7), which is a strong indicator of stability and self-containment of our results.

##### 4.4. Ergodicity

Atomic motions in a large molecule constitute a highly dimensional phase space, and MD-simulations can in most cases explore only part of the total phase space. At least, one can never be sure about the exact fraction of phase space actually visited in a specific simulation run. As a consequence we use only the second half of our MD trajectory for final clustering (see Figure 5), in accordance with our previous work [29].

##### 4.5. Computational Resources

Clustering takes very little iteration steps for A_{5}, due to monotonous, continuous relationships between elements of** S**; see the appendix. Note that the rapidity of clustering in this case is not only a primary consequence of the small number of atoms but a matter of simplicity in structure.

Randomized matrices** S** take significantly more iterations for clustering, since many minima are almost equal regarding the target function, rendering solutions ambiguous. As opposed to this, clustering large molecules with structured internal motion, such as B4402 and B4403, yields well-defined minima after a reasonable number of trials.

##### 4.6. Cluster Interpretation

There is an obvious difference in visual appearance between STDDV matrices for B4402 and B4403 as seen in Figure 4. Trajectory B4402 yields an unstructured, rather flat STDDV matrix (upper panel in Figure 4), while B4403 shows a distinctly structured STDDV matrix (lower panel in Figure 4). The relation between cluster rigidity and size is illustrated in Figure 7 and shows a clear trend: cluster rigidity increases with increasing cluster size, reflected in a decreasing STDDV within clusters. However, this does not mean that the largest cluster is always the most rigid cluster (i.e., has lowest STDDV). For , we see that in B4403 the largest cluster is at the same time the most rigid one. For in B4402 the second largest cluster is the most rigid one. For the situation is inverted: in B4402 the largest cluster is the most rigid one. In B4403 the second largest cluster is most rigid.

All in all, the most rigid cluster was always found among the two largest clusters. The mappings of the clustering results for and for B4402 and B4403 have been displayed in Figure 5.

##### 4.7. Conclusion and Prospects

A main result of the paper is the finding that the target function proposed by others [18] has only crisp solutions; that is, each atom belongs to one single cluster only (and not to several clusters in a fuzzy sense). This finding allows for a much more efficient search for optimum clustering.

Based on this new finding, the process of clustering was evaluated regarding various aspects to provide concomitant information for possible application by other investigators. Applicability was demonstrated for two trajectories of 250 ns each for large biomolecular complexes whose dynamics is of key importance for the understanding of immune reactions.

Further improvements can be expected from a more detailed investigation of the Kullback-Leibler distance [36, 37]. In this work it has only been reported as a means for assessing the quality of clustering by some given method. In future work, the Kullback-Leibler-distance may enter the clustering procedure itself and render clusters even more stable between subtrajectories and over time.

#### Appendices

#### A. Pair-Distance Variations in a Small Molecule

For comparison, we applied our clustering method also to the A_{5} penta-L-alanine peptide already analyzed by Bernhard and Noé [18]. They choose penta-L-alanine to evaluate their clustering algorithm on MD simulations for reasons of simplicity of this molecule. A_{5} has four amide bonds, each comprising four atoms (CONH). The delocalization of the nitrogen’s free electron between conjugated carbonyl and amine groups poses planarity and rigidity onto this structure.

Mu et al. [38] showed that A_{5} does not remain in an alpha-helical conformation, but rather exhibits repeated folding and unfolding events. As expected for such a molecule, our clustering of the averaged STDDV matrix indicates that atoms close together show little fluctuations of their mutual distances, such as the atoms in the middle of the pentapeptide; they are, so to speak, in the centre of the storm. Conversely, atoms near the edges show large distance fluctuations with respect to all other atoms.

Clustering such a homogeneous matrix does not reveal structural elements within the molecule, but rather splits it into equal parts, according to the number of clusters preselected. Note, however, that this behavior is a property of the equally distributed matrix values and not a general rule.

#### B. Software Availability

The software used for clustering into crisp domains of preselected number is currently being implemented in Java and will be made available for free download from http://www.meduniwien.ac.at/msi/biosim/index.php?lang=en&seite=en_forLehreIt_pairDistanceClustering.

#### Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

#### Acknowledgments

The MD trajectories used in the present work were generated on the IBM-BlueGene computer facility at Bulgarian National Centre for Supercomputing Applications (NCSA). The work was supported in part by BSF and OeAD under Grants DCVP 02/1/2009, DNTS-A 01-2/2013, and WTA-BG 06/2013. We gratefully acknowledge mathematical advice and helpful discussions from Professor Rudolf Karch, PhD and Peicho Petkov, PhD, and technical assistance by Michael Cibena.

#### References

- G. Bao, “Mechanics of biomolecules,”
*Journal of the Mechanics and Physics of Solids*, vol. 50, no. 11, pp. 2237–2274, 2002. View at Publisher · View at Google Scholar · View at Scopus - R. Lavery, A. Lebrun, J.-F. Allemand, D. Bensimon, and V. Croquette, “Structure and mechanics of single biomolecules: experiment and simulation,”
*Journal of Physics Condensed Matter*, vol. 14, no. 14, pp. R383–R414, 2002. View at Publisher · View at Google Scholar · View at Scopus - D. Gordon, R. Chen, and S. H. Chung, “Computational methods of studying the binding of toxins from venomous animals to biological ion channels: theory and applications,”
*Physiological Reviews*, vol. 93, pp. 767–802, 2013. View at Google Scholar - Y. Cui, “Using molecular simulations to probe pharmaceutical materials,”
*Journal of Pharmaceutical Sciences*, vol. 100, no. 6, pp. 2000–2019, 2011. View at Publisher · View at Google Scholar · View at Scopus - S. A. Adcock and J. A. McCammon, “Molecular dynamics: survey of methods for simulating the activity of proteins,”
*Chemical Reviews*, vol. 106, no. 5, pp. 1589–1615, 2006. View at Publisher · View at Google Scholar · View at Scopus - H. M. Berman, J. Westbrook, Z. Feng et al., “The protein data bank,”
*Nucleic Acids Research*, vol. 28, no. 1, pp. 235–242, 2000. View at Google Scholar · View at Scopus - W. Wriggers and K. Schulten, “Protein domain movements: detection of rigid domains and visualization of hinges in comparisons of atomic coordinates,”
*Proteins*, vol. 29, pp. 1–14, 1997. View at Google Scholar - W. R. Taylor, “Protein structural domain identification,”
*Protein Engineering*, vol. 12, no. 3, pp. 203–216, 1999. View at Google Scholar · View at Scopus - J. M. Haile, “Fundamentals,” in
*Molecular Dynamics Simulation: Elementary Methods*, John Wiley & Sons, 1992. View at Google Scholar - J. M. Haile, “Hard spheres,” in
*Molecular Dynamics Simulation: Elementary Methods*, John Wiley & Sons, 1992. View at Google Scholar - H. J. Berendsen and S. Hayward, “Collective protein dynamics in relation to function,”
*Current Opinion in Structural Biology*, vol. 10, no. 2, pp. 165–169, 2000. View at Publisher · View at Google Scholar · View at Scopus - A. Amadei, A. B. M. Linssen, and H. J. C. Berendsen, “Essential dynamics of proteins,”
*Proteins: Structure, Function and Genetics*, vol. 17, no. 4, pp. 412–425, 1993. View at Publisher · View at Google Scholar · View at Scopus - M. A. Balsera, W. Wriggers, Y. Oono, and K. Schulten, “Principal component analysis and long time protein dynamics,”
*Journal of Physical Chemistry*, vol. 100, no. 7, pp. 2567–2572, 1996. View at Google Scholar · View at Scopus - S. Hayward and B. L. De Groot, “Normal modes and essential dynamics,”
*Methods in Molecular Biology*, vol. 443, pp. 89–106, 2008. View at Publisher · View at Google Scholar · View at Scopus - A. E. García, “Large-amplitude nonlinear motions in proteins,”
*Physical Review Letters*, vol. 68, no. 17, pp. 2696–2699, 1992. View at Google Scholar · View at Scopus - X. Zhang, D. Bhatt, and D. M. Zuckerman, “Automated sampling assessment for molecular simulations using the effective sample size,”
*Journal of Chemical Theory and Computation*, vol. 6, no. 10, pp. 3048–3057, 2010. View at Publisher · View at Google Scholar · View at Scopus - B. Hess, C. Kutzner, D. Van Der Spoel, and E. Lindahl, “GRGMACS 4: algorithms for highly efficient, load-balanced, and scalable molecular simulation,”
*Journal of Chemical Theory and Computation*, vol. 4, no. 3, pp. 435–447, 2008. View at Publisher · View at Google Scholar · View at Scopus - S. Bernhard and F. Noé, “Optimal identification of semi-rigid domains in macromolecules from molecular dynamics Simulation,”
*PLoS ONE*, vol. 5, no. 5, Article ID e10491, 2010. View at Publisher · View at Google Scholar · View at Scopus - J. Robinson, J. A. Halliwell, H. McWilliam, R. Lopez, P. Parham, and S. G. Marsh, “The IMGT/HLA database,”
*Nucleic Acids Research*, vol. 41, pp. D1222–D1227, 2013. View at Google Scholar - W. A. Macdonald, Z. Chen, S. Gras et al., “T cell allorecognition via molecular mimicry,”
*Immunity*, vol. 31, no. 6, pp. 897–908, 2009. View at Publisher · View at Google Scholar · View at Scopus - N. Guex and M. C. Peitsch, “SWISS-MODEL and the Swiss-PdbViewer: an environment for comparative protein modeling,”
*Electrophoresis*, vol. 18, no. 15, pp. 2714–2723, 1997. View at Publisher · View at Google Scholar · View at Scopus - H. J. Berendsen, J. P. M. Postma, W. F. Van Gunsteren, and J. Hermans, “Interaction models for water in relation to protein hydration,”
*Intermolecular Forces*, vol. 14, pp. 331–342, 1981. View at Google Scholar - C. Oostenbrink, A. Villa, A. E. Mark, and W. F. Van Gunsteren, “A biomolecular force field based on the free enthalpy of hydration and solvation: the GROMOS force-field parameter sets 53A5 and 53A6,”
*Journal of Computational Chemistry*, vol. 25, no. 13, pp. 1656–1676, 2004. View at Publisher · View at Google Scholar · View at Scopus - U. Omasits, B. Knapp, M. Neumann et al., “Analysis of key parameters for molecular dynamics of pMHC molecules,”
*Molecular Simulation*, vol. 34, no. 8, pp. 781–793, 2008. View at Publisher · View at Google Scholar · View at Scopus - S. O. Yesylevskyy, V. N. Kharkyanen, and A. P. Demchenko, “Hierarchical clustering of the correlation patterns: new method of domain identification in proteins,”
*Biophysical Chemistry*, vol. 119, no. 1, pp. 84–93, 2006. View at Publisher · View at Google Scholar · View at Scopus - T. Shibuya, “Fast hinge detection algorithms for flexible protein structures,”
*IEEE/ACM Transactions on Computational Biology and Bioinformatics*, vol. 7, no. 2, pp. 333–341, 2010. View at Publisher · View at Google Scholar · View at Scopus - W. Humphrey, A. Dalke, and K. Schulten, “VMD: visual molecular dynamics,”
*Journal of Molecular Graphics*, vol. 14, no. 1, pp. 33–38, 1996. View at Publisher · View at Google Scholar · View at Scopus - M. P. Allen,
*Introduction to Molecular Dynamics Simulation*, John von Neumann Institute for Computing, 2004. - W. Schreiner, R. Karch, B. Knapp, and N. Ilieva, “Relaxation estimation of RMSD in molecular dynamics immuno-simulations,”
*Computational and Mathematical Methods in Medicine*, vol. 2012, Article ID 173521, 9 pages, 2012. View at Publisher · View at Google Scholar - J. B. Clarage, T. Romo, B. K. Andrews, B. M. Pettitt, and G. N. Phillips Jr., “A sampling problem in molecular dynamics simulations of macromolecules,”
*Proceedings of the National Academy of Sciences of the United States of America*, vol. 92, no. 8, pp. 3288–3292, 1995. View at Google Scholar · View at Scopus - V. Daggett, “Long timescale simulations,”
*Current Opinion in Structural Biology*, vol. 10, no. 2, pp. 160–164, 2000. View at Publisher · View at Google Scholar · View at Scopus - B. Hess, “Convergence of sampling in protein simulations,”
*Physical Review E. Statistical, Nonlinear, and Soft Matter Physics*, vol. 65, no. 3, Article ID 031910, 2002. View at Publisher · View at Google Scholar · View at Scopus - L. J. Smith, X. Daura, and W. F. Van Gunsteren, “Assessing equilibration and convergence in biomolecular simulations,”
*Proteins. Structure, Function and Genetics*, vol. 48, no. 3, pp. 487–496, 2002. View at Publisher · View at Google Scholar · View at Scopus - A. Grossfield and D. M. Zuckerman, “Quantifying uncertainty and sampling quality in biomolecular simulations,”
*Annual Reports in Computational Chemistry*, vol. 5, pp. 23–48, 2009. View at Publisher · View at Google Scholar · View at Scopus - H. Flyvbjerg and H. G. Petersen, “Error estimates on averages of correlated data,”
*The Journal of Chemical Physics*, vol. 91, no. 1, pp. 461–466, 1989. View at Google Scholar · View at Scopus - C. L. McClendon, L. Hua, A. Barreiro, and M. P. Jacobson, “Comparing conformational ensembles using the Kullback-Leibler divergence expansion,”
*Journal of Chemical Theory and Computation*, vol. 8, pp. 2115–2126, 2012. View at Google Scholar - S. Kullback and R. A. Leibler, “On information and sufficiency,”
*Annals of Mathematical Statistics*, vol. 22, pp. 79–86, 1951. View at Google Scholar - Y. Mu, P. H. Nguyen, and G. Stock, “Energy landscape of a small peptide revealed by dihedral angle principal component analysis,”
*Proteins. Structure, Function and Genetics*, vol. 58, no. 1, pp. 45–52, 2005. View at Publisher · View at Google Scholar · View at Scopus