Research Article  Open Access
Majid Masso, "AllAtom FourBody KnowledgeBased Statistical Potentials to Distinguish Native Protein Structures from Nonnative Folds", BioMed Research International, vol. 2017, Article ID 5760612, 17 pages, 2017. https://doi.org/10.1155/2017/5760612
AllAtom FourBody KnowledgeBased Statistical Potentials to Distinguish Native Protein Structures from Nonnative Folds
Abstract
Recent advances in understanding protein folding have benefitted from coarsegrained representations of protein structures. Empirical energy functions derived from these techniques occasionally succeed in distinguishing native structures from their corresponding ensembles of nonnative folds or decoys which display varying degrees of structural dissimilarity to the native proteins. Here we utilized atomic coordinates of single protein chains, comprising a large diverse training set, to develop and evaluate twelve allatom fourbody statistical potentials obtained by exploring alternative values for a pair of inherent parameters. Delaunay tessellation was performed on the atomic coordinates of each protein to objectively identify all quadruplets of interacting atoms, and atomic potentials were generated via statistical analysis of the data and implementation of the inverted Boltzmann principle. Our potentials were evaluated using benchmarking datasets from Decoys‘R’Us, and comparisons were made with twelve other physics and knowledgebased potentials. Ranking 3rd, our best potential tied CHARMM19 and surpassed AMBER force field potentials. We illustrate how a generalized version of our potential can be used to empirically calculate binding energies for targetligand complexes, using HIV1 proteaseinhibitor complexes for a practical application. The combined results suggest an accurate and efficient atomic fourbody statistical potential for protein structure prediction and assessment.
1. Introduction
Over recent years, exponential growth of the Protein Data Bank (PDB) [1] has facilitated the selection of larger, nonredundant subsets of experimentally solved protein structures at higher resolutions, which in turn have provided the data used in developing more effective knowledgebased statistical potentials for improved structure prediction. In contrast to physicsbased energy functions, statistical potentials generally perform better and are more computationally efficient at identifying the native structure as a global minimum [2, 3]. Distancedependent statistical potentials often focus on pairwise atomic contacts within macromolecular structures [4, 5]; however, such energy functions fail to take into consideration important higherorder contributions based on multibody interactions [6, 7]. Indeed, use of an “atomic environment potential” for which neighborhood sizes vary by atom previously demonstrated improved performance at discriminating between native and nearnative protein structures [3]. In the present work we employed the wellestablished computational geometry tiling technique of Delaunay tessellation [8], for objectively identifying all quadruplets of nearest neighbor atoms in order to develop, evaluate, and apply allatom fourbody statistical potentials for protein structure prediction.
Fourbody statistical potentials were derived based on PDB atomic coordinate file data corresponding to single chains selected from over 1400 diverse protein structures. Delaunay tessellation was applied to the threedimensional (3D) atomic coordinates of each protein chain, whereby atoms were treated as vertices to generate a convex hull encompassing thousands of spacefilling, nonoverlapping, irregular tetrahedra (Figure 1). For assurance that each tetrahedron identifies at its four vertices a quadruplet of atoms that are pairwise all within a prescribed distance from one another, a subsequent edgelength cutoff parameter may be introduced; removal of a tetrahedral edge between a pair of atoms longer than this cutoff eliminates from the tessellation all tetrahedra sharing that edge. Depending on the size of the atomic alphabet used for labeling points, the four atoms appearing at vertices of any particular tetrahedron in these tessellations represent one of 35 ( letters), 330 (), or 8855 () possible distinct atomic quadruplet types. For each cutoff (if any) and alphabet size, statistical data obtained from the protein chains and their tessellations included the following: () observed relative frequencies of interaction for each type of atomic quadruplet, based on their rates of occurrence as tetrahedral vertices; and () rates expected by chance for each atomic quadruplet type, based on relative frequencies of individual atom types in the protein chains and use of a multinomial reference distribution. Through application of the inverted Boltzmann principle [9, 10], the negative logarithm of the ratio of observed to expected rates of occurrence was used to calculate an empirical energy of interaction for each atomic quadruplet, which collectively form an atomic fourbody statistical potential.
The approach implemented here at the atomic level was motivated by its prior successful application at the residue level [11–16]. All atomic 3D coordinates in proteins are considered in this work to generate the allatom fourbody statistical potentials, while previously developed residuebased fourbody potentials used only a single point per amino acid (e.g., or residue center of mass). Clearly, there is a degree of information loss with the coarsergrained residue representation of proteins relative to the finer allatom representation. Both approaches implement the Delaunay tessellation algorithm, which uses the respective pointsets to serve as vertices for generating a tetrahedral tiling of the protein structure that objectively identifies quadruplets of nearest neighbors (i.e., either residues or atoms). Given its significantly more sparse pointset, a residuebased (i.e., one point per residue) tessellation typically yields a few hundred tetrahedra, whereas tessellation applied to all the atoms in the same protein structure has on the order of a few thousand tetrahedra.
Upon selecting an atomic alphabet and edgelength cutoff as parameters, the energy of any folded protein chain would subsequently be calculated with the atomic fourbody potential as follows: label and tessellate the 3D atomic coordinates of the structure according to the same parameters, refer to the previously derived atomic fourbody potential under those parameters to assign a score to each tetrahedron in the tessellation equal to the interaction energy of the atomic quadruplet found at its four vertices, and add up the scores of all the tetrahedra in the tessellation. The allatom fourbody statistical potentials that we developed were each evaluated by scoring multiple decoy directories in the Decoys‘R’Us benchmarking database [17]. We compared these fourbody potentials to one another, based on standard performance metrics, as well as to the knowledgebased potentials of Fogolari et al. [18] and Summa et al. [3]; the latter study detailed performance results for 10 diverse physics and knowledgebased potentials to conduct their own comparisons, hence providing us an opportunity to assess our fourbody potentials relative to a dozen other methods in total. Lastly, we report on a practical application, related to predicting targetinhibitor binding energy, by implementing a modification of our best performing fourbody potential.
2. Methods
2.1. Protein Training Set
A nonredundant set of 1417 highresolution (≤2.2 Å) crystallographic structures, with atomic coordinate files deposited in the PDB, were culled using the PISCES server [28] with the constraint that the single protein chains selected from the structures shared low (<30%) sequence identity (http://binf2.gmu.edu/automute/tessellatable1417.txt). The ensemble of structure files is diverse, consisting of single and multichain proteins, the vast majority of which are additionally complexed to small molecular or peptide ligands. Coordinates of hydrogen atoms and water molecules were removed from all files prior to proceeding with the analyses.
2.2. Designation of Atoms
For each of the 1417 protein chains, three alphabets were explored for defining atom types and labeling points corresponding to their 3D atomic coordinates. In the first instance, a simple fourletter alphabet (C, N, O, and S) accounts for all atoms and ensures sufficient frequency data are collected for all possible atomic quadruplets observed at the four vertices of tetrahedra in Delaunay tessellations (Table 1). Clearly, the same atom type may appear at more than one of the four vertices of any tetrahedron in a protein tessellation, and given that those vertices are unordered, all permutations of the four atomic letters at the vertices of a tetrahedron refer to the same quadruplet, so that an alphabetical ordering (e.g., COON) of the atoms can be used as a singular representation. In this case, a combinatorial argument [29] shows that the number of distinct subsets of size letters that can be formed from an atomic alphabet of size is given byHence, letters admit distinct atomic quadruplets. Next, an atomic alphabet consisting of letters (amino acid backbone: , , , and ; sidechain: , , , and ) differentiates between backbone alpha and carbonylcarbon atoms, distinguishes residue backbone atoms from those in sidechains, and can form distinct atomic quadruplets. Lastly, we explored a maximum diversity of quadruplet atomic interactions with letters as described in Summa et al. [3], which groups atoms based on common traits, including bonding pattern, partial charge, and hydrophobicity, and generates distinct atomic quadruplets.
 
Same counts regardless of atomic alphabet size. 
2.3. Derivation of the Atomic FourBody Statistical Potentials
Delaunay tessellations for the 1417 single protein chains were generated by submitting their respective atomic coordinates as input to the Qhull program [30], visualizations of the tessellated structures were obtained by utilizing the output data from Qhull to create plots within Matlab, and molecular graphics were produced with Chimera [31] (Figure 1). An inhouse suite of Perl programs was used for all data formatting and analyses related to the tessellated structures (Table 1). In particular, for each atomic alphabet of size the relative frequencies of occurrence for all types of atomic quadruplets (, , , ) were calculated as the proportion of tetrahedra among all the tessellations for which the four atoms appear on the vertices. Four separate sets of relative frequencies were calculated for each of the three atomic alphabets explored, based on the original protein tessellations (no edgelength cutoff applied), as well as tessellations modified by introducing cutoffs of length 12 Å, 8 Å, and 4.8 Å. The use of an 8 Å cutoff is consistent with that used by other researchers to generate atomic pair potentials [32], while the other two cutoffs were also selected to identify the appropriate choice for an atomic fourbody potential.
For each of the three atomic alphabets, we additionally computed relative frequencies of occurrence , for the atom types in all 1417 single protein chains. These frequencies, in turn, were needed for calculating the rate expected by chance for all types of atomic quadruplets (, , , ) obtained using a multinomial reference distribution, given byIn the formula above, represents the number of occurrences of atom type in the quadruplet. For each pair of parameters selected (i.e., alphabet size and cutoff), we applied the inverted Boltzmann principle to calculate a score for quantifying the interaction energy for all types of atomic quadruplets (, , , ), as described by Sippl [9, 10], thus defining a particular atomic fourbody statistical potential function (Table 2 and Figure 2). A total of 12 fourbody potentials were generated and evaluated in this study (3 atomic alphabets × 4 edgelength cutoffs, which includes the case where no cutoff is applied to the original tessellations).

(a)
(b)
In any atomic tessellation of a protein structure, two adjacent tetrahedra may share a common vertex (1 atom), a common edge (2 atoms), or a common triangular face (3 atoms). Although the two adjacent tetrahedra represent two sets of atomic quadruplets that may share up to 3 atoms in common, those quadruplets are distinct by virtue of the atom(s) that the two tetrahedra do not share. The collective interaction of a quadruplet of atoms as a fundamental unit in this fourbody scenario is analogous to the interaction of two atoms in the development of a pair potential, whereby a given atom may be considered to interact with each of several neighboring atoms by virtue of satisfying a prescribed distance cutoff between itself and each of the neighbors, and therefore the atom is shared by all of those pairs; likewise, two atomic quadruplets from adjacent tetrahedra in a tessellation may share up to 3 atoms and yet remain fundamentally distinct quadruplets. Moreover, since Delaunay tessellation does not distinguish types of bonds and generates a tetrahedral tiling by objectively identifying quadruplets of nearest neighbor atoms based solely on their six collective pairwise distances from each other, all covalent bonds as well as noncovalent interactions between particular pairs of atoms are included together in these tetrahedral atomic quadruplets without the need to explicitly identify and segregate them. Recent studies suggest that covalent interactions are informative when combined with nonbonded interactions [33, 34].
2.4. Decoy Database
A significant collection of models provided in the Decoys‘R’Us database (http://compbio.buffalo.edu/dd/) form a wellestablished and challenging standard for benchmarking the performance of energy functions. Several categories are located under the heading “The multiple decoy sets,” each containing a number of decoy model directories. Each such directory is named after the PDB accession code of the native crystallographic protein structure and contains coordinate files for that native structure as well as for numerous decoy model structures (i.e., alternative conformations for a given native structure); additionally, the directory includes a file that provides the root mean square deviations (rmsds) for all the alternative models relative to the native structure. For this work, we focused on the following decoy set categories: 4_state_reduced, fisa, fisa_casp3, hg_structal, ig_structal, ig_structal_hires, lattice_ssfit, and lmds.
3. Results
3.1. Energy Calculations and Benchmark Evaluation Measures
Energy calculations were made for 145 native protein structures as well as for all of their respective decoy models downloaded from the 8 decoy set categories in the Decoys‘R’Us database. To this end, all native and decoy structures were tessellated, and their energies were repeatedly computed using all twelve fourbody potentials under their respective parameters of atomic alphabet size and tessellation edgelength cutoff. Given the energy scores for a native structure and its collection of decoys, all calculated using the same fourbody potential, the following measures of performance were evaluated.
(1) Native Rank. Among the native protein and all decoys, the structures are ranked in ascending order according to increasing energy (i.e., lowest energy structure has rank 1).
(2) ZScore. This measurement is defined aswhere is energy of the native structure, is mean energy over all decoy models, and is standard deviation of the distribution of decoy energies [3]. A large positive score indicates a wide gap between the energy of the native protein and the mean decoy energy.
(3) Correlation Coefficient (r). It is the linear correlation between calculated energy and rmsd. For decoys with low rmsds relative to the native structure, good correlation is preferable; however, this is unlikely if decoys are significantly misfolded with high rmsds.
(4) Fractional Enrichment (FE). It is the proportion of decoy structures corresponding to the lowest 10% of rmsds that are also found among those corresponding to the lowest 10% of calculated energy scores.
The raw performance data obtained with the fourbody potential derived using a 4letter atomic alphabet and a 12 Å tessellation edgelength cutoff as parameters (Table 2) are presented in Table 3. As such, we employed the same 4letter alphabet to label the atomic vertices in the tessellations of the 145 native structures and all of their respective decoys, and edges longer than 12 Å were removed from all tessellations prior to calculating total energies as described in the last paragraph of the Introduction. Data analogous to that of Table 3 were obtained using each of the 11 other fourbody potentials generated for this study under alternative parameterpair values for atomic alphabet size and tessellation edgelength cutoff (raw data not shown).
 
Native rank = (rank of native structure with given PDB ID)/(total number of decoys); a rank of 1 is optimal and means the calculated energy of the native structure is lower than that of all its decoys; = correlation coefficient; FE = fractional enrichment. 
The plots of energy versus rmsd in Figure 3, based on 4 native proteins and their collections of decoys evaluated with a varied selection of fourbody potentials that we generated, are illustrative of the strengths and weaknesses of the performance measures defined above. In particular, since 4state_reduced is known to contain nativelike alternative conformations for each protein in the set, reasonably good correlation () and fractional enrichment (FE > 10%) are expected from a reliable energy function [18, 35], and this is illustrated by the plot for 4pti. Next, ig_structal_hires and hgstructal contain decoys built by homology modeling for immunoglobulin (ig) and globin (hg) proteins, all of which are nativelike structures with very low rmsds relative to native [18]. The plots for 1fvc and 1hdaB reflect the expected strong correlation and fractional enrichment; additionally, despite the fact that the native protein and very low rmsd decoys all have a good chance of achieving the lowest energy conformation, both native proteins rank 1 for these examples. Finally, the set lattice_ssfit consists of decoys selected with an allatom energy function and refined using coarse lattice models, and rmsd > 4 Å for all decoys in this set relative to their native proteins [18, 36]. The plot for 1beo shows that, as expected in such cases of significantly misfolded decoys, there is no correlation between energy and rmsd relative to the native structure; furthermore, the fractional enrichment is low and the score is relatively large, as commonly encountered by such decoys and suggested by the plot.
3.2. FourBody Potentials: Relative Performance
To effectively rank all twelve fourbody potentials generated for this study, first we identified, for each of the 145 native proteins and their respective decoys, the best native rank and largest score, correlation coefficient, and fractional enrichment values obtained, without regard to which potential yielded those optimal values of the performance measures. Next, for each potential separately, we counted the number of times (out of 145) that the potential either matched or singularly provided each optimal value recorded for a performance measure concerning a native protein and its set of decoys (Table 4, numbers above parentheses). For each performance measure, we then ranked these counts across all the potentials (Table 4, numbers in parentheses); subsequently for each potential separately, we averaged its rankings across the four performance measures (Table 4, next to bottom row). Finally, those averaged ranks were used to generate an overall ranking of the twelve fourbody potentials (Table 4, bottom row). The ranking approach based on relative performance employed here and in the subsequent section was inspired by the technique described in Summa et al. [3] for comparing the performance of their potential to other related methods.
 
Numbers above parentheses in each row reflect how many decoy sets (out of 145) for which the given potential matches the best performance value achieved among all 12 potentials tested; numbers in parentheses are the rankings of the counts in that row; = correlation coefficient; FE = fractional enrichment. 
In general, fourbody potentials derived using a 4letter atomic alphabet ranked highest, followed by those based on letters, while potentials generated using 8 atom types ranked poorly over all four choices of tessellation edgelength cutoff parameter values. Among the 4letter alphabet potentials, the one based on full structure tessellations (i.e., no edgelength cutoff) outperformed that using a 12 Å cutoff; however, the latter case is preferable since, without a fixed cutoff, falsepositive atomic quadruplet interactions are admitted into the analyses based on those tessellations. A satisfying solution to this dilemma is revealed in the subsequent section as these fourbody potentials are compared to those developed by other research groups.
3.3. Relative Performance: Comparisons with Related Methods
Next, an approach similar to that described in the previous section is used to individually compare each of our 12 fourbody potentials to those of a dozen related methods. Using the Decoys‘R’Us database, Summa et al. [3] compared their “atomic environment potential” to ten other wellknown physics and knowledgebased potentials, providing us with valuable raw data to make comparisons with our fourbody potentials. In addition, Fogolari et al. [18] developed an energy function employing two centers of interactions per amino acid. They also used the Decoys‘R’Us database to evaluate its performance, and these data are also included in our evaluations. Out of the 145 decoy sets that we used for benchmarking our fourbody potentials relative to one another, 129 sets overlap with those used by both of those studies and form the basis of comparisons reported here. Lastly, the ten related methods investigated by Summa et al. [3] for comparing relative performance and used by us for a similar purpose include the following: three taken from the AMBER force field (a simple van der Waals potential, a pairwise electrostatic potential term, and the sum of these two terms, the latter representing the entire nonbonded contact energy of a typical molecular mechanics force field without either an explicit or implicit solvent model) [19]; two taken from CHARMM19 (both a van der Waals and a coulombic term) [20]; the and potentials of Delarue and Koehl [21], and the potential of Koehl and Delarue [22]; and distancedependent atomic potentials RAPDF [23] and DFIRE [24].
For each of the 129 decoy sets common to the studies, we obtained the raw performance data (i.e., native rank, score, correlation, coefficient, and fractional enrichment as presented in Table 3 for one of the fourbody potentials) generated by each of the twelve methods described above. Next, we selected one of our fourbody potentials and included its raw performance data, for a total of 13 methods to be compared. With every decoy set, we identified the best native rank achieved and the largest values obtained for score, correlation coefficient, and fractional enrichment, without regard to which of the 13 methods was responsible for each optimal measurement. For each of these 13 methods separately, we counted the number of times (out of 129) that the method either matched or singularly provided each optimal value recorded for a performance measure concerning a native protein and its set of decoys (Table 5, numbers to the left of those in parentheses). For each performance measure, we then ranked these counts across all 13 methods (Table 5, numbers in parentheses); subsequently for each method separately, we averaged its rankings across the four performance measures (Table 5, next to last column). Finally, those averaged ranks were used to generate an overall ranking of the 13 methods (Table 5, last column).
 
Numbers not in parentheses in each column reflect how many decoy sets (out of 129) for which the given method matches the best performance value achieved among all 13 methods tested; numbers in parentheses are the rankings of the counts in that column; = correlation coefficient; FE = fractional enrichment. 
The overall rankings in Table 5 reveal that our fourbody potential, derived using a 4letter alphabet and 12 Å cutoff as parameters, outperformed 9 other methods, tied in overall ranking (3rd) with the coulombic term from CHARMM19, and was outperformed by DFIRE and the “atomic environment potential” of Summa et al. The methodology described in the previous paragraph was repeated separately for each of the twelve fourbody potentials that we investigated, and the overall rankings in each case are reported in the columns of Table 6. Fourbody potentials employing a 4letter alphabet again appear to be the most competitive, and in particular those derived using unmodified tessellations (i.e., no edgelength cutoff) and a 12 Å cutoff achieved the highest overall rankings (3rd) among all of the fourbody potentials, each in comparison to the 12 related stateoftheart methods. As mentioned in the prior section, the introduction of falsepositive atomic quadruplet interactions into the analyses is a concern when edgelength cutoffs are not considered after tessellation. Given that both of these potentials are equally competitive when compared with the other related methods, we conclude that a 4letter alphabet and 12 Å tessellation edgelength cutoff provide the best pair of parameters with which to derive a fourbody potential for calculating protein structure energies and effectively distinguishing native folds from nonnative decoy structures.
 
Note that the overall rankings in this column correspond to those in the final column of Table 5; the remaining columns in this table were obtained by repeating the data analyses that generated Table 5 with respect to each of the other 11 fourbody potentials. 
4. Discussion
Energy calculations for single protein chains have been the sole focus up to this point, so it has been appropriate to consider atomic alphabets based only on the four heavy atom types found in proteins: carbon, nitrogen, oxygen, and sulfur. As mentioned earlier in Methods, a training set of 1417 diverse structures of single chain proteins were used for deriving the fourbody potentials; however, the atomic coordinate for these protein chains was each obtained from a distinct PDB coordinate file, and the vast majority of these files are for structures of proteins complexed to small molecular or peptide ligands. Therefore, in order to tessellate the entirety of each of these PDB files, an expansion of the atomic alphabet is necessary to accommodate all atom types. The fact that such tessellations have an important function will become apparent as an application is introduced for predicting targetligand binding affinities.
4.1. Generalized FourBody Potential: An Alphabet Incorporating All Atom Types
Given the impressive performance on protein structures by the fourbody potential derived using a 4letter atomic alphabet and 12 Å tessellation edgelength cutoff, we simply expanded the alphabet to 6 letters in order to include atoms found exclusively in molecular ligands: M = all metals and X = all nonmetals other than (, , , ). The atomic frequency data and total number of tetrahedra generated by tessellating the totality of the atomic coordinate data in the 1417 PDB structure files (hydrogen atoms and water molecules excluded, as discussed in Methods), after filtering out edges longer than 12 Å, are provided in Table 7. Since we are now working with a = 6 letter alphabet, the atoms at the four vertices of each tetrahedron of a tessellation represent one of = 126 atomic quadruplets. A retracing of the steps described in Methods yields the allatom fourbody statistical potential presented in Table 8. Note that 11 of the 126 atomic quadruplet types are not represented by any of the 36,406,467 tetrahedra obtained from the tessellations.

