Advances in Physical Chemistry

Volume 2016 (2016), Article ID 3240674, 10 pages

http://dx.doi.org/10.1155/2016/3240674

## Energy Landscape of Pentapeptides in a Higher-Order Conformational Subspace

^{1}York Centre for Complex Systems Analysis (YCCSA), University of York, York YO10 5GE, UK^{2}Department of Chemistry, College of Science, Qassim University, Buraydah 52571, Saudi Arabia

Received 8 February 2016; Accepted 4 April 2016

Academic Editor: Dennis Salahub

Copyright © 2016 Karim M. ElSawy. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

The potential energy landscape of pentapeptides was mapped in a collective coordinate principal conformational subspace derived from principal component analysis of a nonredundant representative set of protein structures from the PDB. Three pentapeptide sequences that are known to be distinct in terms of their secondary structure characteristics, (Ala)_{5}, (Gly)_{5}, and Val.Asn.Thr.Phe.Val, were considered. Partitioning the landscapes into different energy valleys allowed for calculation of the relative propensities of the peptide secondary structures in a statistical mechanical framework. The distribution of the observed conformations of pentapeptide data showed good correspondence to the topology of the energy landscape of the (Ala)_{5} sequence where, in accord with reported trends, the *α*-helix showed a predominant propensity at 298 K. The topography of the landscapes indicates that the stabilization of the *α*-helix in the (Ala)_{5} sequence is enthalpic in nature while entropic factors are important for stabilization of the *β*-sheet in the Val.Asn.Thr.Phe.Val sequence. The results indicate that local interactions within small pentapeptide segments can lead to conformational preference of one secondary structure over the other where account of conformational entropy is important in order to reveal such preference. The method, therefore, can provide critical structural information for* ab initio* protein folding methods.

#### 1. Introduction

Our understanding of sequence-structure relationships of proteins is increasing in both depth and breadth [1, 2]. Deeper insight is provided by detailed studies of protein folding and stability via sequence mutations and/or* de novo* protein design [3]. Breadth has been gained through a great tradition of serendipitous discovery [4, 5], supported in recent years by the systematised sampling strategies of the structural genomics programmes [6]. Our ability to predict protein structure from sequence is contingent on characteristics of the sequence: if a sequence possesses even weak homology to proteins of known structure, then a variety of methods can exploit this information to infer a structural model at a given level of detail with some (statistical) confidence [7]. If no homologies exist, our empirical knowledge of protein structure (and interactions) can yield good predictions for many (as yet) small proteins. However, pure (“physics-based”)* ab initio* predictions of the native fold of a protein from sequence information alone are still formidably difficult. There are two principal challenges that must be met in order to faithfully model the thermodynamics and kinetics of protein folding: the ability to accurately compute the energetics of solvated proteins and the difficulty of generating structural models which bridge the unfolded and folded states. The latter difficulty arises from the very large number of degrees of freedom in proteins that makes a systematic search of conformational space computationally intractable and thus enforces a conformational sampling approach [8].

The conformational space of a single amino acid in a peptide can be effectively described by the backbone torsion angles and the sterically “allowed” space represented on the 2D “Ramachandran plot” [9]. Can information on single pairs be combined to represent the conformational space of polypeptides? Flory’s isolated pair hypothesis [10] assumes that the conformational state of each pair of in a polypeptide is independent of the adjacent pairs and therefore the number of conformational states available to a polypeptide segment is simply the product of possible conformational states of the pairs of individual residues. The validity of this model for polypeptides and proteins is debatable [11, 12] as it appears to overestimate the number of conformational states and thus implies that nonlocal effects—manifesting in terms of coupling between pairs—are important. However, for longer polypeptides, with increasing number of pairs, a Ramachandran-type conformational analysis is not generally possible due to the high dimensionality. As a consequence, the predominant focus of conformational analysis of polypeptides has been in terms of single pairs [13, 14].

A more complete understanding of sequence-structure relationships in polypeptides requires us to move beyond statistical/geometrical considerations to include physical interactions [15]. From the perspective of energetics, it is the interactions between different residues within the polypeptide chain which distinguish the behaviour of one sequence from another. In recent years the energy landscape concept has emerged as a unifying language for experimentalists and theorists [16–18]. The energy landscape describes how the energy changes within the conformational space. It is the topographical and topological features of the energy landscape which determine the thermodynamic and kinetic properties of the system. Knowledge of the protein energy landscape is therefore indispensable for studying a variety of important system properties such as stable states, dynamics, and folding/unfolding kinetics [19]. Several simulation studies have addressed the problem of evaluating the energy landscape for polypeptides [15, 20–22]. A common element in such studies is the detection of the energy minima and their connectivity based on extensive sampling of the energy landscape.

The relatively small number of folds adopted by experimentally determined protein structures suggests that native folds are confined to a low dimensional manifold (subspace) of the nominally high dimensional protein fold space [23, 24]. Also, it appears that the intrinsic conformational dynamics of a given protein fold may be confined to a relatively low dimensional space [25–27]. In support of this view, a multivariate analysis of the geometry of polypeptide segments from protein crystal structures has revealed that the conformational space available over multiple (“higher-order”) segments is dramatically restricted over that expected from a consideration of individual pairs [28]. The origin of this low intrinsic dimensionality lies in the coupling of pairs within the segments as a consequence of steric packing constraints along the polypeptide chain.

In a previous work [29], we employed a similar multivariate approach to construct a low dimensional principal conformational subspace of single strand dinucleotide fragments from duplex DNA crystal structures. The low dimensionality was a consequence of the use of collective coordinates representing the coupled displacements of backbone torsion angles. The collective coordinates were used to map the underlying energy surface within the conformational subspace using an empirical interaction potential. A key aspect of this approach is that the energy landscape underlies a relevant region of conformational space representing the statistical distribution of the observed crystal structures. The method proved successful in predicting a new stable conformer of DNA and relative conformational propensities of different dinucleotide fragments and describing the dynamical behaviour of an oligomeric DNA sequence.

In this paper, we build on the approach developed in our previous work [29] in order to calculate the potential energy landscape of pentapeptides within the principal conformational space of known crystal structures [28]. Our aim is to gain a better physical understanding of the important sequence-structure relationships in polypeptides within a statistical mechanical framework. To achieve this, we generate a principal conformational subspace from a representative set of pentapeptide fragments extracted from structures from the protein databank. The resulting collective coordinates are then used to map the underlying potential energy surface for selected pentapeptide sequences. The topography and topology of the resulting energy surfaces are characterised and then used to derive the thermodynamic distribution of different bound conformational states. As a proof of principle we consider three pentapeptide sequences that are distinct in terms of their secondary structure characteristics: (Ala)_{5}, (Gly)_{5}, and Val.Asn.Thr.Phe.Val. The (Ala)_{5} sequence represents the canonical sequence for sequence-structure studies; (Gly)_{5} represents the extreme case of removing local steric hindrance along the polypeptide chain; and the Val.Asn.Thr.Phe.Val sequence is context dependent [30] as it appears once as part of an *α*-helix and once as part of a *β*-sheet. For the sequences considered, our model predicts relative conformational propensities that are comparable to those derived from the statistics of the sample crystal structure dataset. Thus, we present a framework for computing sequence-specific thermodynamic and kinetic properties of polypeptide fragments which we hope will have wider utility in studies concerned with the relationship of protein sequence, structure, and dynamics.

#### 2. Methods

##### 2.1. Data Acquisition and Preparation

Representative structures from the Protein Data Bank were created using PDBSELECT [31]. The dataset corresponds to structures of high resolution (≤1.0 ) with <25% sequence identity. The structures were disassembled into pentapeptide segments using a five residue sliding windows over contiguous chain segments with a lag of one residue. The descriptors of the backbone conformation for each segment were described by 10 variables representing the torsion angle pairs of each residue. A data matrix () was constructed from 9607 observations arranged row wise. In order to eliminate any potential bias towards conformers with high frequencies, matrix () was reduced to matrix () which comprised equally represented conformers. To achieve this, segments were categorized based on their secondary structure, in the full-length peptide chain, and the frequency of each distinct secondary structure category was determined. Secondary structure classification was carried out using the Sanders algorithm [32] as implemented in PROMOTIF [33]. Secondary structure categories with frequency less than 10 were discarded. For each distinct secondary structure category, ten conformers were selected at random and the corresponding torsions pairs were fed into matrix resulting in 1572 conformers.

##### 2.2. Principal Component Analysis (PCA)

In order to reduce the dimensionality of the conformational space of the pentapeptide segments, the data matrix () was subjected to principal component analysis (PCA) [34]. PCA and multidimensional scaling (MDS) are widely used as methods for reducing the dimensionality of multidimensional molecular structural data [35–37].

To avoid problems, such as periodicity and nonlinearity, in the calculation of variance for angular data [38], the data in the columns of the data matrix representing individual torsion angles were adjusted in an iterative fashion such that differences from their mean values lie in the range −180 to +180° using the Bio3d [39] package in R 3.1 [40]. This transformed the data matrix into a new data matrix . Principal component analysis was then conducted on matrix using the Bio3d package [39]. The principal components (or PCs) are mutually orthogonal collective variables that maximally describe the sample variance. The elements of the PCs (eigenvectors) are the coefficients of the linear combinations of the 10 torsion angles from which they are derived. Thus, the PCs (analogous to the normal coordinates derived from harmonic vibration analysis [41]) can be considered as collective coordinates for describing the pentapeptide conformational distribution via displacements from the mean torsion angles of the data matrix. The corresponding eigenvalues represent the variance of the conformational distribution along each of the PCs. The extent to which a given observation lies along the th principal component is given by the projection . These projections (or scores) may be used for visualising the distribution of observations in the principal conformational subspace (PCS).

Despite the duality of PCA and MDS in terms of the scores of the data matrix [42, 43], PCA offers advantages over MDS (as used by Sims et al. [28]) in view of the aims of this study. The principal components (eigenvectors) can be characterised directly as concerted atomic displacements [44] such that the magnitude and sign of the coefficients of the linear combinations reflect the correlations of the variables. Displacements along the principal components generate trajectories representing the structural displacements spanning the PCS. Also, PCA results in an analytical basis set (the principal components) which is used in the construction of structures within the PCS as required for the mapping of energy surface (see below).

##### 2.3. Mapping the Potential Energy Surface (PES)

The potential energy surface (PES) of a pentapeptide segment within the PCS was mapped via systematic energy evaluation on a grid defined by discrete points along the first three PCs. The coordinates of a grid point correspond to a set of torsion angles, . Therefore, represents the geometrical construction of structures within the subspace (whose projections along higher principal components are initially zero). Cartesian coordinates for the full pentapeptide fragment were reconstructed from these torsion angles supplemented by other standard internal coordinates (bond lengths and angles). The potential energy of the structures was calculated using the extended-atom CHARMM27 force field [45, 46]. No nonbonded interaction truncation was performed. A distance-dependent dielectric constant (, where is the interatomic distance in Å) [47] was used for the electrostatic terms. Use of distance-dependent dielectric constant is a computationally cheap approximation for the damping of the electrostatic interaction by the solvent. Incorporating the solvent effect into the energetics of the PES using more advanced methods such as Generalized Born [48] or APBS [49] is, however, computationally prohibitive due to the large number of structures used to construct the PES. The PES was mapped in 2D slices along PC3. A grid resolution of 20° in the subspace was used over the range −300° to 300° with respect to an origin corresponding to the mean structure. These search limits were chosen so as to encompass the range of scores spanned by the projected data matrix.

At each grid point, the backbone torsion angles were restrained to their desired values using a force constant of 100 kcal mol^{−1} degree^{−2} and the system was energy minimised using 200 steps of steepest descent followed by 2000 steps of the ABNR method [45]. The side chain conformation was then generated using SCWRL 3.0 program [50] and the system was minimised using 200 steps of steepest descent.

As a consequence of the discretization, the structure and energy of any point in the PCS are considered to be the structure and energy corresponding to the closest grid point within the resolution limit of the grid. Characterisation of the features of the PES and calculation of the thermodynamic properties for the different conformational states were conducted as described previously [29]. For the sake of clarity, we describe it briefly in here: the PES within the PCS was partitioned into a set of distinct energy valleys as follows: starting from a local minimum, a set of 3D isoenergetic contours are generated at increasing discrete energy levels. An energy valley is defined by the isoenergetic surface contained within the contour level preceding the one which encapsulates another energy minimum. The energy valley, therefore, extends up to the (saddle) point which separates it from other minima. This process continues until no further minima are detected. The volume of each energy valley was calculated using the convex hull algorithm [51] as implemented in MATLAB 7.1. The fractional population of an energy valley can be estimated from the ratio where is the NVT partition function given by

The energy levels, , within each energy valley were delineated by a set of concentric isoenergetic closed surfaces in the 3D PCS corresponding to consecutive energies of 0.5 kcal mol^{−1}. The degeneracy of the energy level was estimated by the volume enclosed between the two surfaces corresponding to and kcal mol^{−1}. The (local) partition function for each valley is then estimated via a summation over its discrete energy levels. Here, we use the term “energy level” in an* ad hoc* way since, in the classical regime, energy is continuous and not quantized. Enthalpic contribution to the fractional population of the PES valleys could be approximated by the relative energetics of their minima. The relative volumes of the valleys, however, are indicative of their relative entropic contribution [29, 52].

PCA, therefore, provides a basis set of linear collective coordinates describing the variance of observed structures within the PCS. The PCs allow for the construction of conformers within the PCS, interpolating and, optionally, extrapolating from the observed conformational distribution. Calculating the energy of conformers within the PCS results in a smooth continuous energy landscape. The collective nature of the basis set allows also for a straightforward definition of conformational reaction coordinates for complex biomolecules in contrast to the use of proxy coordinates (“order parameters"), which, are arbitrarily or heuristically chosen [53] and may not have a direct relationship to system geometry limiting interpretability and thus insight [54, 55]. Methods other than PCA could have also been used for reducing the dimensionality of the PES, such as principal coordinate analysis (PCoA) [56, 57] and locally nonlinear embedding (LLE) [58, 59]. In contrast to PCoA, PCA generates an orthogonal basis set that can be easily used as collective coordinates of the PES. Each of these collective coordinates is simply a linear combination of the original structural variables (e.g., torsion angles). Nonlinearity of the basis set generated by LLE, however, complicates the construction of the PES and makes interpretation of the PES topological features not straightforward.

#### 3. Results and Discussion

##### 3.1. A Principal Conformational Subspace for Pentapeptides

We generated a conformational space for pentapeptide segments extracted from protein crystal structures achieving a similar result to that of Sims et al. [28]. The collective coordinates (PCs) describing the conformational space naturally incorporate the coupling of pairs within the segment, which result in a reduction of dimensionality. The first three PCs were used as collective coordinates for defining a principal conformational subspace (PCS). Almost 60% of the total variance of the data matrix was captured by the first three PCs (34.5% along PC1, 18.4% along PC2, and 7.4% along PC3). Restricting the PCS to 3D is supported by the observation that the distribution of projections (scores) along higher components is virtually unimodal (data not shown). For geometrical characterisation of the collective coordinates refer to Supplementary Material available online at http://dx.doi.org/10.1155/2016/3240674. Projection of the pentapeptide data matrix () into the PCS reveals a clear separation of *α* and *β* secondary structures with intermediate conformational states lying in between (see Figure 1). The detailed description of the distribution of the projection of the data matrix within the PCS is discussed in relation to the features of the PES below.