Abstract

Recent advances in understanding protein folding have benefitted from coarse-grained representations of protein structures. Empirical energy functions derived from these techniques occasionally succeed in distinguishing native structures from their corresponding ensembles of nonnative folds or decoys which display varying degrees of structural dissimilarity to the native proteins. Here we utilized atomic coordinates of single protein chains, comprising a large diverse training set, to develop and evaluate twelve all-atom four-body statistical potentials obtained by exploring alternative values for a pair of inherent parameters. Delaunay tessellation was performed on the atomic coordinates of each protein to objectively identify all quadruplets of interacting atoms, and atomic potentials were generated via statistical analysis of the data and implementation of the inverted Boltzmann principle. Our potentials were evaluated using benchmarking datasets from Decoys-‘R’-Us, and comparisons were made with twelve other physics- and knowledge-based potentials. Ranking 3rd, our best potential tied CHARMM19 and surpassed AMBER force field potentials. We illustrate how a generalized version of our potential can be used to empirically calculate binding energies for target-ligand complexes, using HIV-1 protease-inhibitor complexes for a practical application. The combined results suggest an accurate and efficient atomic four-body statistical potential for protein structure prediction and assessment.

1. Introduction

Over recent years, exponential growth of the Protein Data Bank (PDB) [1] has facilitated the selection of larger, nonredundant subsets of experimentally solved protein structures at higher resolutions, which in turn have provided the data used in developing more effective knowledge-based statistical potentials for improved structure prediction. In contrast to physics-based energy functions, statistical potentials generally perform better and are more computationally efficient at identifying the native structure as a global minimum [2, 3]. Distance-dependent statistical potentials often focus on pairwise atomic contacts within macromolecular structures [4, 5]; however, such energy functions fail to take into consideration important higher-order contributions based on multibody interactions [6, 7]. Indeed, use of an “atomic environment potential” for which neighborhood sizes vary by atom previously demonstrated improved performance at discriminating between native and near-native protein structures [3]. In the present work we employed the well-established computational geometry tiling technique of Delaunay tessellation [8], for objectively identifying all quadruplets of nearest neighbor atoms in order to develop, evaluate, and apply all-atom four-body statistical potentials for protein structure prediction.

Four-body statistical potentials were derived based on PDB atomic coordinate file data corresponding to single chains selected from over 1400 diverse protein structures. Delaunay tessellation was applied to the three-dimensional (3D) atomic coordinates of each protein chain, whereby atoms were treated as vertices to generate a convex hull encompassing thousands of space-filling, nonoverlapping, irregular tetrahedra (Figure 1). For assurance that each tetrahedron identifies at its four vertices a quadruplet of atoms that are pairwise all within a prescribed distance from one another, a subsequent edge-length cutoff parameter may be introduced; removal of a tetrahedral edge between a pair of atoms longer than this cutoff eliminates from the tessellation all tetrahedra sharing that edge. Depending on the size of the atomic alphabet used for labeling points, the four atoms appearing at vertices of any particular tetrahedron in these tessellations represent one of 35 ( letters), 330 (), or 8855 () possible distinct atomic quadruplet types. For each cutoff (if any) and alphabet size, statistical data obtained from the protein chains and their tessellations included the following: () observed relative frequencies of interaction for each type of atomic quadruplet, based on their rates of occurrence as tetrahedral vertices; and () rates expected by chance for each atomic quadruplet type, based on relative frequencies of individual atom types in the protein chains and use of a multinomial reference distribution. Through application of the inverted Boltzmann principle [9, 10], the negative logarithm of the ratio of observed to expected rates of occurrence was used to calculate an empirical energy of interaction for each atomic quadruplet, which collectively form an atomic four-body statistical potential.

The approach implemented here at the atomic level was motivated by its prior successful application at the residue level [1116]. All atomic 3D coordinates in proteins are considered in this work to generate the all-atom four-body statistical potentials, while previously developed residue-based four-body potentials used only a single point per amino acid (e.g., or residue center of mass). Clearly, there is a degree of information loss with the coarser-grained residue representation of proteins relative to the finer all-atom representation. Both approaches implement the Delaunay tessellation algorithm, which uses the respective point-sets to serve as vertices for generating a tetrahedral tiling of the protein structure that objectively identifies quadruplets of nearest neighbors (i.e., either residues or atoms). Given its significantly more sparse point-set, a residue-based (i.e., one point per residue) tessellation typically yields a few hundred tetrahedra, whereas tessellation applied to all the atoms in the same protein structure has on the order of a few thousand tetrahedra.

Upon selecting an atomic alphabet and edge-length cutoff as parameters, the energy of any folded protein chain would subsequently be calculated with the atomic four-body potential as follows: label and tessellate the 3D atomic coordinates of the structure according to the same parameters, refer to the previously derived atomic four-body potential under those parameters to assign a score to each tetrahedron in the tessellation equal to the interaction energy of the atomic quadruplet found at its four vertices, and add up the scores of all the tetrahedra in the tessellation. The all-atom four-body statistical potentials that we developed were each evaluated by scoring multiple decoy directories in the Decoys-‘R’-Us benchmarking database [17]. We compared these four-body potentials to one another, based on standard performance metrics, as well as to the knowledge-based potentials of Fogolari et al. [18] and Summa et al. [3]; the latter study detailed performance results for 10 diverse physics- and knowledge-based potentials to conduct their own comparisons, hence providing us an opportunity to assess our four-body potentials relative to a dozen other methods in total. Lastly, we report on a practical application, related to predicting target-inhibitor binding energy, by implementing a modification of our best performing four-body potential.

2. Methods

2.1. Protein Training Set

A nonredundant set of 1417 high-resolution (≤2.2 Å) crystallographic structures, with atomic coordinate files deposited in the PDB, were culled using the PISCES server [28] with the constraint that the single protein chains selected from the structures shared low (<30%) sequence identity (http://binf2.gmu.edu/automute/tessellatable1417.txt). The ensemble of structure files is diverse, consisting of single- and multichain proteins, the vast majority of which are additionally complexed to small molecular or peptide ligands. Coordinates of hydrogen atoms and water molecules were removed from all files prior to proceeding with the analyses.

2.2. Designation of Atoms

For each of the 1417 protein chains, three alphabets were explored for defining atom types and labeling points corresponding to their 3D atomic coordinates. In the first instance, a simple four-letter alphabet (C, N, O, and S) accounts for all atoms and ensures sufficient frequency data are collected for all possible atomic quadruplets observed at the four vertices of tetrahedra in Delaunay tessellations (Table 1). Clearly, the same atom type may appear at more than one of the four vertices of any tetrahedron in a protein tessellation, and given that those vertices are unordered, all permutations of the four atomic letters at the vertices of a tetrahedron refer to the same quadruplet, so that an alphabetical ordering (e.g., COON) of the atoms can be used as a singular representation. In this case, a combinatorial argument [29] shows that the number of distinct subsets of size letters that can be formed from an atomic alphabet of size is given byHence, letters admit distinct atomic quadruplets. Next, an atomic alphabet consisting of letters (amino acid backbone: , , , and ; side-chain: , , , and ) differentiates between backbone alpha- and carbonyl-carbon atoms, distinguishes residue backbone atoms from those in side-chains, and can form distinct atomic quadruplets. Lastly, we explored a maximum diversity of quadruplet atomic interactions with letters as described in Summa et al. [3], which groups atoms based on common traits, including bonding pattern, partial charge, and hydrophobicity, and generates distinct atomic quadruplets.

2.3. Derivation of the Atomic Four-Body Statistical Potentials

Delaunay tessellations for the 1417 single protein chains were generated by submitting their respective atomic coordinates as input to the Qhull program [30], visualizations of the tessellated structures were obtained by utilizing the output data from Qhull to create plots within Matlab, and molecular graphics were produced with Chimera [31] (Figure 1). An in-house suite of Perl programs was used for all data formatting and analyses related to the tessellated structures (Table 1). In particular, for each atomic alphabet of size the relative frequencies of occurrence for all types of atomic quadruplets (, , , ) were calculated as the proportion of tetrahedra among all the tessellations for which the four atoms appear on the vertices. Four separate sets of relative frequencies were calculated for each of the three atomic alphabets explored, based on the original protein tessellations (no edge-length cutoff applied), as well as tessellations modified by introducing cutoffs of length 12 Å, 8 Å, and 4.8 Å. The use of an 8 Å cutoff is consistent with that used by other researchers to generate atomic pair potentials [32], while the other two cutoffs were also selected to identify the appropriate choice for an atomic four-body potential.

For each of the three atomic alphabets, we additionally computed relative frequencies of occurrence , for the atom types in all 1417 single protein chains. These frequencies, in turn, were needed for calculating the rate expected by chance for all types of atomic quadruplets (, , , ) obtained using a multinomial reference distribution, given byIn the formula above, represents the number of occurrences of atom type in the quadruplet. For each pair of parameters selected (i.e., alphabet size and cutoff), we applied the inverted Boltzmann principle to calculate a score for quantifying the interaction energy for all types of atomic quadruplets (, , , ), as described by Sippl [9, 10], thus defining a particular atomic four-body statistical potential function (Table 2 and Figure 2). A total of 12 four-body potentials were generated and evaluated in this study (3 atomic alphabets × 4 edge-length cutoffs, which includes the case where no cutoff is applied to the original tessellations).

In any atomic tessellation of a protein structure, two adjacent tetrahedra may share a common vertex (1 atom), a common edge (2 atoms), or a common triangular face (3 atoms). Although the two adjacent tetrahedra represent two sets of atomic quadruplets that may share up to 3 atoms in common, those quadruplets are distinct by virtue of the atom(s) that the two tetrahedra do not share. The collective interaction of a quadruplet of atoms as a fundamental unit in this four-body scenario is analogous to the interaction of two atoms in the development of a pair potential, whereby a given atom may be considered to interact with each of several neighboring atoms by virtue of satisfying a prescribed distance cutoff between itself and each of the neighbors, and therefore the atom is shared by all of those pairs; likewise, two atomic quadruplets from adjacent tetrahedra in a tessellation may share up to 3 atoms and yet remain fundamentally distinct quadruplets. Moreover, since Delaunay tessellation does not distinguish types of bonds and generates a tetrahedral tiling by objectively identifying quadruplets of nearest neighbor atoms based solely on their six collective pairwise distances from each other, all covalent bonds as well as noncovalent interactions between particular pairs of atoms are included together in these tetrahedral atomic quadruplets without the need to explicitly identify and segregate them. Recent studies suggest that covalent interactions are informative when combined with nonbonded interactions [33, 34].

2.4. Decoy Database

A significant collection of models provided in the Decoys-‘R’-Us database (http://compbio.buffalo.edu/dd/) form a well-established and challenging standard for benchmarking the performance of energy functions. Several categories are located under the heading “The multiple decoy sets,” each containing a number of decoy model directories. Each such directory is named after the PDB accession code of the native crystallographic protein structure and contains coordinate files for that native structure as well as for numerous decoy model structures (i.e., alternative conformations for a given native structure); additionally, the directory includes a file that provides the root mean square deviations (rmsds) for all the alternative models relative to the native structure. For this work, we focused on the following decoy set categories: 4_state_reduced, fisa, fisa_casp3, hg_structal, ig_structal, ig_structal_hires, lattice_ssfit, and lmds.

3. Results

3.1. Energy Calculations and Benchmark Evaluation Measures

Energy calculations were made for 145 native protein structures as well as for all of their respective decoy models downloaded from the 8 decoy set categories in the Decoys-‘R’-Us database. To this end, all native and decoy structures were tessellated, and their energies were repeatedly computed using all twelve four-body potentials under their respective parameters of atomic alphabet size and tessellation edge-length cutoff. Given the energy scores for a native structure and its collection of decoys, all calculated using the same four-body potential, the following measures of performance were evaluated.

(1) Native Rank. Among the native protein and all decoys, the structures are ranked in ascending order according to increasing energy (i.e., lowest energy structure has rank 1).

(2) Z-Score. This measurement is defined aswhere is energy of the native structure, is mean energy over all decoy models, and is standard deviation of the distribution of decoy energies [3]. A large positive -score indicates a wide gap between the energy of the native protein and the mean decoy energy.

(3) Correlation Coefficient (r). It is the linear correlation between calculated energy and rmsd. For decoys with low rmsds relative to the native structure, good correlation is preferable; however, this is unlikely if decoys are significantly misfolded with high rmsds.

(4) Fractional Enrichment (FE). It is the proportion of decoy structures corresponding to the lowest 10% of rmsds that are also found among those corresponding to the lowest 10% of calculated energy scores.

The raw performance data obtained with the four-body potential derived using a 4-letter atomic alphabet and a 12 Å tessellation edge-length cutoff as parameters (Table 2) are presented in Table 3. As such, we employed the same 4-letter alphabet to label the atomic vertices in the tessellations of the 145 native structures and all of their respective decoys, and edges longer than 12 Å were removed from all tessellations prior to calculating total energies as described in the last paragraph of the Introduction. Data analogous to that of Table 3 were obtained using each of the 11 other four-body potentials generated for this study under alternative parameter-pair values for atomic alphabet size and tessellation edge-length cutoff (raw data not shown).

The plots of energy versus rmsd in Figure 3, based on 4 native proteins and their collections of decoys evaluated with a varied selection of four-body potentials that we generated, are illustrative of the strengths and weaknesses of the performance measures defined above. In particular, since 4state_reduced is known to contain native-like alternative conformations for each protein in the set, reasonably good correlation () and fractional enrichment (FE > 10%) are expected from a reliable energy function [18, 35], and this is illustrated by the plot for 4pti. Next, ig_structal_hires and hg-structal contain decoys built by homology modeling for immunoglobulin (ig) and globin (hg) proteins, all of which are native-like structures with very low rmsds relative to native [18]. The plots for 1fvc and 1hdaB reflect the expected strong correlation and fractional enrichment; additionally, despite the fact that the native protein and very low rmsd decoys all have a good chance of achieving the lowest energy conformation, both native proteins rank 1 for these examples. Finally, the set lattice_ssfit consists of decoys selected with an all-atom energy function and refined using coarse lattice models, and rmsd > 4 Å for all decoys in this set relative to their native proteins [18, 36]. The plot for 1beo shows that, as expected in such cases of significantly misfolded decoys, there is no correlation between energy and rmsd relative to the native structure; furthermore, the fractional enrichment is low and the -score is relatively large, as commonly encountered by such decoys and suggested by the plot.

3.2. Four-Body Potentials: Relative Performance

To effectively rank all twelve four-body potentials generated for this study, first we identified, for each of the 145 native proteins and their respective decoys, the best native rank and largest -score, correlation coefficient, and fractional enrichment values obtained, without regard to which potential yielded those optimal values of the performance measures. Next, for each potential separately, we counted the number of times (out of 145) that the potential either matched or singularly provided each optimal value recorded for a performance measure concerning a native protein and its set of decoys (Table 4, numbers above parentheses). For each performance measure, we then ranked these counts across all the potentials (Table 4, numbers in parentheses); subsequently for each potential separately, we averaged its rankings across the four performance measures (Table 4, next to bottom row). Finally, those averaged ranks were used to generate an overall ranking of the twelve four-body potentials (Table 4, bottom row). The ranking approach based on relative performance employed here and in the subsequent section was inspired by the technique described in Summa et al. [3] for comparing the performance of their potential to other related methods.

In general, four-body potentials derived using a 4-letter atomic alphabet ranked highest, followed by those based on letters, while potentials generated using 8 atom types ranked poorly over all four choices of tessellation edge-length cutoff parameter values. Among the 4-letter alphabet potentials, the one based on full structure tessellations (i.e., no edge-length cutoff) outperformed that using a 12 Å cutoff; however, the latter case is preferable since, without a fixed cutoff, false-positive atomic quadruplet interactions are admitted into the analyses based on those tessellations. A satisfying solution to this dilemma is revealed in the subsequent section as these four-body potentials are compared to those developed by other research groups.

3.3. Relative Performance: Comparisons with Related Methods

Next, an approach similar to that described in the previous section is used to individually compare each of our 12 four-body potentials to those of a dozen related methods. Using the Decoys-‘R’-Us database, Summa et al. [3] compared their “atomic environment potential” to ten other well-known physics- and knowledge-based potentials, providing us with valuable raw data to make comparisons with our four-body potentials. In addition, Fogolari et al. [18] developed an energy function employing two centers of interactions per amino acid. They also used the Decoys-‘R’-Us database to evaluate its performance, and these data are also included in our evaluations. Out of the 145 decoy sets that we used for benchmarking our four-body potentials relative to one another, 129 sets overlap with those used by both of those studies and form the basis of comparisons reported here. Lastly, the ten related methods investigated by Summa et al. [3] for comparing relative performance and used by us for a similar purpose include the following: three taken from the AMBER force field (a simple van der Waals potential, a pairwise electrostatic potential term, and the sum of these two terms, the latter representing the entire nonbonded contact energy of a typical molecular mechanics force field without either an explicit or implicit solvent model) [19]; two taken from CHARMM19 (both a van der Waals and a coulombic term) [20]; the and potentials of Delarue and Koehl [21], and the potential of Koehl and Delarue [22]; and distance-dependent atomic potentials RAPDF [23] and DFIRE [24].

For each of the 129 decoy sets common to the studies, we obtained the raw performance data (i.e., native rank, -score, correlation, coefficient, and fractional enrichment as presented in Table 3 for one of the four-body potentials) generated by each of the twelve methods described above. Next, we selected one of our four-body potentials and included its raw performance data, for a total of 13 methods to be compared. With every decoy set, we identified the best native rank achieved and the largest values obtained for -score, correlation coefficient, and fractional enrichment, without regard to which of the 13 methods was responsible for each optimal measurement. For each of these 13 methods separately, we counted the number of times (out of 129) that the method either matched or singularly provided each optimal value recorded for a performance measure concerning a native protein and its set of decoys (Table 5, numbers to the left of those in parentheses). For each performance measure, we then ranked these counts across all 13 methods (Table 5, numbers in parentheses); subsequently for each method separately, we averaged its rankings across the four performance measures (Table 5, next to last column). Finally, those averaged ranks were used to generate an overall ranking of the 13 methods (Table 5, last column).

The overall rankings in Table 5 reveal that our four-body potential, derived using a 4-letter alphabet and 12 Å cutoff as parameters, outperformed 9 other methods, tied in overall ranking (3rd) with the coulombic term from CHARMM19, and was outperformed by DFIRE and the “atomic environment potential” of Summa et al. The methodology described in the previous paragraph was repeated separately for each of the twelve four-body potentials that we investigated, and the overall rankings in each case are reported in the columns of Table 6. Four-body potentials employing a 4-letter alphabet again appear to be the most competitive, and in particular those derived using unmodified tessellations (i.e., no edge-length cutoff) and a 12 Å cutoff achieved the highest overall rankings (3rd) among all of the four-body potentials, each in comparison to the 12 related state-of-the-art methods. As mentioned in the prior section, the introduction of false-positive atomic quadruplet interactions into the analyses is a concern when edge-length cutoffs are not considered after tessellation. Given that both of these potentials are equally competitive when compared with the other related methods, we conclude that a 4-letter alphabet and 12 Å tessellation edge-length cutoff provide the best pair of parameters with which to derive a four-body potential for calculating protein structure energies and effectively distinguishing native folds from nonnative decoy structures.

4. Discussion

Energy calculations for single protein chains have been the sole focus up to this point, so it has been appropriate to consider atomic alphabets based only on the four heavy atom types found in proteins: carbon, nitrogen, oxygen, and sulfur. As mentioned earlier in Methods, a training set of 1417 diverse structures of single chain proteins were used for deriving the four-body potentials; however, the atomic coordinate for these protein chains was each obtained from a distinct PDB coordinate file, and the vast majority of these files are for structures of proteins complexed to small molecular or peptide ligands. Therefore, in order to tessellate the entirety of each of these PDB files, an expansion of the atomic alphabet is necessary to accommodate all atom types. The fact that such tessellations have an important function will become apparent as an application is introduced for predicting target-ligand binding affinities.

4.1. Generalized Four-Body Potential: An Alphabet Incorporating All Atom Types

Given the impressive performance on protein structures by the four-body potential derived using a 4-letter atomic alphabet and 12 Å tessellation edge-length cutoff, we simply expanded the alphabet to 6 letters in order to include atoms found exclusively in molecular ligands: M = all metals and X = all nonmetals other than (, , , ). The atomic frequency data and total number of tetrahedra generated by tessellating the totality of the atomic coordinate data in the 1417 PDB structure files (hydrogen atoms and water molecules excluded, as discussed in Methods), after filtering out edges longer than 12 Å, are provided in Table 7. Since we are now working with a = 6 letter alphabet, the atoms at the four vertices of each tetrahedron of a tessellation represent one of = 126 atomic quadruplets. A retracing of the steps described in Methods yields the all-atom four-body statistical potential presented in Table 8. Note that 11 of the 126 atomic quadruplet types are not represented by any of the 36,406,467 tetrahedra obtained from the tessellations.

4.2. Application: Target-Ligand Binding Affinity Prediction

The four-body potential derived in the previous section can be used to calculate the energy of any macromolecular structure. First, the 3D atomic coordinates of the structure are each labeled using the 6-letter alphabet and those points are tessellated subject to a 12 Å cutoff, then each tetrahedron in the tessellation is scored according to the atomic quadruplet identified at its four vertices by referring to the four-body potential previously derived in Table 8, and finally, the scores of all the tetrahedra are added up to determine the energy of the structure. Using the notation (i.e., total potential) to refer to the energy of a structure calculated in this way, we empirically calculate target-ligand binding affinity in the following manner (Figure 4):(1)Tessellate the entire macromolecular complex and calculate .(2)Tessellate only atomic coordinates for the target protein and calculate .(3)The calculated target-ligand binding affinity is given by the difference The above formula is a simplified model that is valid in the case of small ligands for which tetrahedra formed at the protein interface dominate any purely internal quadruplet atomic interactions within the ligand; hence, the relative energy contribution of the ligand is negligible [37].

4.3. Example: Predicting HIV-1 Protease-Inhibitor Binding Energy

To validate the approach for empirically calculating binding affinity, PDB accession codes and experimental binding energies were obtained from Jenwitheesuk and Samudrala [25] for twenty-five HIV-1 protease-inhibitor complexes (Table 9); they converted experimental inhibition constants () to experimental binding energies (, Gibbs free energy of binding, in units of kcal/mol) by applying the equation , where is the gas constant (1.987 cal K−1 mol−1) and is the absolute temperature (room temperature, 300 K). By following the steps outlined in the previous section, we determined for each of these complexes and used it as the calculated binding energy. As shown in Figure 5, the experimental and calculated binding energies for these complexes were highly correlated (). From a subsequent search of the Binding MOAD database [26, 27], we identified 115 additional HIV-1 protease-inhibitor complexes with experimental structures in PDB, for which experimental inhibition constants are available (Table 9). As before, values were obtained and used for representing the calculated binding energy, and values were converted to experimental binding energies. The correlation remained robust () when these data were combined with those of the initial plot (Figure 5).

5. Conclusion

In this study, we derived and evaluated twelve distinct atomic four-body knowledge-based statistical potentials for protein structure prediction, by altering two parameter values: atom type (4-, 8-, or 20-letter alphabets) and distance cutoff for atomic interactions (none, 12 Å, 8 Å, or 4.8 Å). The best potential employed a simple 4-letter atomic alphabet and considered any quadruplet of atoms to be interacting when they were all pairwise within 12 Å of each other. In a head-to-head comparison of methods using 129 benchmarks from the Decoys-‘R’-Us database, our potential ranked 3rd and was outperformed by only two out of twelve other state-of-the-art methods. In addition to its simplicity and relative accuracy, our method is faster and more efficient in general, with some of the other physics- and knowledge-based potentials used for comparison employing well over one hundred different atom types. Future plans for improvement include combining this four-body potential together with other knowledge-based potentials, as well as subsequently implementing them together in conjunction with statistical machine learning tools.

Conflicts of Interest

The author declares that there are no conflicts of interest regarding the publication of this paper.