Abstract

Several secondary structures, such as π-helix and left-handed helix, have been frequently identified at protein ligand-binding sites. A secondary structure is considered to be constrained to a specific region of dihedral angles. However, a comprehensive analysis of the correlation between main chain dihedral angles and ligand-binding sites has not been performed. We undertook an extensive analysis of the relationship between dihedral angles in proteins and their distance to ligand-binding sites, frequency of occurrence, molecular potential energy, amino acid composition, van der Waals contacts, and hydrogen bonds with ligands. The results showed that the values of dihedral angles have a strong preference for ligand-binding sites at certain regions in the Ramachandran plot. We discovered that amino acids preceding the ligand-prefer ϕ/ψ box residues are exposed more to solvents, whereas amino acids following ligand-prefer ϕ/ψ box residues form more hydrogen bonds and van der Waals contacts with ligands. Our method exhibited a similar performance compared with the program Ligsite-csc for both ligand-bound structures and ligand-free structures when just one ligand-binding site was predicted. These results should be useful for the prediction of protein ligand-binding sites and for analysing the relationship between structure and function.

1. Introduction

The two main chain dihedral torsion angles that describe the rotations of the polypeptide backbone around the bonds between N-Cα () and Cα-C () were identified by Ramakrishnan and Ramachandran [1]. These two torsion angles provide flexibility for the polypeptide backbone to adopt a fixed fold because the third possible torsion angle between C-N () is almost flat and fixed at 180°. The application of these two torsion angles, which describe the protein backbone conformation approach, has been widespread. The Ramachandran plot, which has remained unchanged for fifty years, provides a simple view of the distribution of the two torsion angles in protein structures [2]. The two dihedral torsion angles have also been applied in fields such as secondary structure assignment and protein structure refinement [35].

Secondary structure refers to highly regular local sub-substructures in proteins, of which α-helix and β-sheet are the two main types. In 1951, Pauling et al. first defined these secondary structures using the hydrogen pattern between the main chain backbone amino (NH) and carbonyl (CO) groups [6]. Although initially defined by a hydrogen pattern, a secondary structure exhibits a regular geometry that is constrained to a specific region of dihedral angles in the Ramachandran plot [7]. With the exception of the secondary structure assignment program DSSP, which only employs hydrogen bonding information [8], a dozen structure assignments programs using geometric features of local substructures and Cα atoms have been proposed [3, 9] and programs that employ hydrogen bonds and geometrical restraints have also been applied [7]. Although every program has its benefit, DSSP and STRIDE are the most popular secondary structure assignment programs [7, 8]. Several categories of secondary structures have been noted to occur more frequently in the functional site, especially in the ligand site, including the π-helix [1012], the left-handed helix [13], and the 310-helix in membrane proteins [14], stretches of amino acids with unusual backbone conformations are also frequently observed at ligand-binding sites [15]. These provided insightful heuristics for predicting protein ligand-binding site, but previous research did not explore the correlation between the local amino acid geometric features and ligand-binding site in detail.

Proteins perform their biological functions by binding to other molecules. The binding partner, which is commonly referred to as a ligand, may consist of small organic/inorganic molecules, metals and macromolecules, such as protein or DNA. In this paper, we only consider organic molecules as ligands. The identification of ligand-binding site, especially the primary residues in ligand-binding site, is an important step towards the characterization of their molecular function and rational drug design [16]. Numerous methods have been developed to address this problem; they can be categorized into two groups: sequence-based methods and structure-based methods [17]. Sequence-based methods explore the sequence conservation in proteins under the assumption that ligand-binding site sequences are conversed in the evolution process. Structure-based methods employ geometry criteria to detect a concave region on the surfaces of proteins that forms a surface-solvent-surface event. Due to an increase in the number of known protein ligand complexes in the Protein Data Bank (PDB) [18], some programs utilize known protein ligand structures as templates. Previous studies have revealed that specific backbone conformations are likely to be a part of ligand-binding site and that the magnitude of dihedral angles may undergo slight changes after ligand binding [13, 15, 19]; however, no method has considered the conformation of amino acids in ligand-binding site prediction. Therefore, an extensive survey of the correlations between the value of amino acid dihedral angles and ligand-binding sites was conducted. This information was also employed in the prediction of ligand-binding sites. The discovery of the preference of certain dihedral angle values not only provides a comprehensive overview of amino acid conformation features in protein ligand-binding sites but also facilitates the design of binding site prediction methods.

2. Method and Materials

2.1. Hydrogen Bond

The hydrogen bonds in a structure were calculated using the program HBPLUS [20]. To identify the hydrogen bonds, this program locates all proximal donor (D) and acceptor (A) pairs that satisfy specified geometrical criteria for hydrogen bond formation. The current criteria are as follows: dist (H-A) < 2.7 Å, dist (D-A) < 3.35 Å, angle (D-H-A) > 90°, angle (H-A-AA) > 90°, where AA is the atom attached to the acceptor.

2.2. van der Waals (vdW) Contact

If the distance between the nonhydrogen atom A1 and the nonhydrogen atom A2 satisfies the following criteria,where vdW(A) is the van der Waals radius of A, then A1 and A2 are considered to be in vdW contact.

2.3. Solvent-Accessible Surface Area (ASA)

The ASA was calculated using the program NACCESS [21]. The default probe size was employed, and any water molecules, hydrogen, or remaining HET groups in the PDB files were disregarded (including the default behaviour). As dihedral angles are determined by the main chain atoms, side chain atoms were not involved in our ASA calculation. The relative accessibility of the main chain of each residue was calculated as the percentage accessibility and was compared with the accessibility of the residue type in an extended ALA-x-ALA tripeptide (for amino acids).

2.4. Molecular Potential Energy for Residues

The molecular potential energy for different dihedral angle residues was calculated by the program Open Babel Obenergy (using the AMBER force field) [21]. The positions of the amino acid atoms were directly extracted from the PDB files, and the theoretical hydrogen atom positions of the residues were calculated with the REDUCE program [22]. The equation for calculating the energy for each residue is as follows:where the variables correspond to the bond, angle, torsion, vdW force, and electrostatic force in the mechanical force field, which were evaluated with a nonbonded cut-off.

2.5. Three Residue Levels

We classified the binding site residues into three levels. Level 1 residues are strict binding site residues, which are defined as residues that are in direct contact with the ligand, that is, at least one pair of nonhydrogen atoms—an atom from the residue and an atom from the ligand—is positioned with 4 Å distance. If any atom of a residue is positioned at a distance that is less than 6 Å from any atom of the ligand, the residue can be identified as a larger-scale ligand-binding site residue, that is, level 2 residues. Residues are assigned as level 3 residues if the nearest distance between a ligand and residue is less than 15 Å. A binding site secondary structure corresponds to a secondary structure that contains at least one level 2 residue.

2.6. Database of Protein Structure

To analyse the protein structures and evaluate the performance of our ligand-binding site prediction method, we downloaded a set of ligand-bound proteins that were determined by X-ray crystallography at a maximum resolution of 2.0 Å; each structure has a maximum identity of 70%. For multichain proteins, the chains share a maximum sequence identity of 30%; otherwise, only one chain is retained. For the selection of ligands, we searched the PDB file for structures with ligands, which are listed in the HETATM (hetero atom) records. We excluded metal ions and inorganic anions, such as Na+, Ca2+, Cl, , and , from our definition of ligands. Of the 8,189 chain structures, we randomly selected set , which contains 1000 chain structures as our prediction method testing set and defined set as the remaining 7,189 chain structures. The length of the chains in set varies from 22 residues to 1083 residues. To prevent the influence of the geometric size of structures, especially very large complexes, that is, to achieve greater uniformity among the structures, we only retained residues with at least one nonhydrogen atom that is less than 15 Å from the nearest ligand. Nagy and Oostenbrink [4] classified the Ramachandran map into 19 distinct regions on plots based on the observed cluster centre and the density map of (). These regions were used to classify our ligand-prefer boxes.

3. Result and Discussion

Left-handed helices and π-helices are typical secondary structure types that prefer to stay in the ligand-binding site. Of 31 verified left-handed helices (a minimum of four consecutive residues), Novotoy reported that 27 of the 31 left-handed helices perform an important role either for stability or for the function of the protein [13]. π-helices were tended to be associated with a function and ligand-binding site as they were evolutionarily derived from the insertion of a single residue into an α-helix [11].

We employed the left-handed helix assignment criteria ( of the residues in the left-handed helix fell between 30° and 130°, and of the residues lay between −50° and 100°) and the hydrogen bond information calculated by DSSP (version 2.2.1) to detect left-handed helices [13]. SECSTR, a program specifically developed to improve the detection of π-helix, was employed to assign π-helix in this paper [12].

A total of 6,238 π-helix residues (assigned by SECSTR) with at least one atom less than 15 Å from its nearest ligand were detected. The red-edged box (−90° < < −45°, −65° < < −37°), which is centred in the α1 region defined by the DISICL program, contains 3,263 residues (Figure S1 in Supplementary Material available online at http://dx.doi.org/10.1155/2015/757495). A total of 692 residues (21.2% of 3,263) in the red box were detected at a ligand-binding site compared with 813 (27.3% of 2,975) residues outside the red box, which have at least an atom at a distance less than 6 Å from the ligand. Although the probability for both of these two regions reside at a ligand-binding site exceeds the average level (19.8%), the difference between them is significant (21.2% compared with 27.3%). After searching all structures in the set , 88.2% of the residues in the region in the red-edged box were assigned as α-helices. Although the α-helix is the most common secondary structure, existing data have not determined a correlation with protein functions. The divergence suggests that π-helix residues have variant preferences at binding sites with different backbone dihedral angles.

A comparison between left-handed helix residues and non-left-handed helix residues in the same region as left-handed helix residues is provided. First, we determined the probability for left-handed helix residues in different boxes observed at the ligand-binding site, which are denoted as coloured boxes in Figure 1; boxes with fewer than five residues were excluded. Second, we calculated the probability for non-left-handed residues with dihedral angles in the same region as left-handed helix residues in the Ramachandran plot, which are denoted by the identically coloured boxes observed in the ligand-binding site (Figure 1(b)). Of the 1,328 left-handed helix residues detected in set , 426 were located at a ligand-binding site compared with 4,557 out of the total of 15,263 non-left-handed helix residues (78% of the non-left-handed helix residues were assigned as “Turn” or “Bend” by DSSP). A higher probability at the ligand-binding site was observed for the left-handed helix residues (32.1% compared with 29.8%) because a left-handed helix requires two consecutive amino acid dihedral angles that are positioned in the coloured region. Figure 1 shows the detailed probabilities for left-handed helix and non-left-handed helix residues in the same region.

The examples of π-helices and left-handed helices suggest that residues in a π-helix exhibit different performances, which correlate with a ligand as their dihedral angle changes. The residues in a specified region (coloured boxes in Figure 1) yield similar probabilities of detection at a ligand-binding site. These findings inspired us to search the database to determine whether other ligand-prefer Ramachandran regions exist, such as left-handed helix dihedral angle regions, which have a preference for protein ligand-binding sites, instead of focusing on the secondary structure level, as noted in previous studies [1014].

We employ the probability as a measure for a Ramachandran box that is observed at a ligand-binding site. is calculated by the number of level 2 residues divided by the total number of level 3 residues, and the dihedral angles of both of these level 2 residues are located in the Ramachandran box. Boxes that consist of less than 20 level 3 residues were excluded. A total of 972,773 level 3 residues, of which 192,606 are level 2 binding site residues, with an average probability for residues of 19.8%, were detected at a ligand-binding site. The probability for every Ramachandran box is shown in Figure 2(a).

To prevent random occurrence of Ramachandran boxes that have high values themselves but low neighbours, we also considered the neighbours of the boxes. The top 35% of boxes according to value (with %) were selected as central high value boxes, and the neighbours’ average values must be in the top 45% in terms of value for all boxes. Thus, ligand-prefer Ramachandran boxes are defined as follows: if the Ramachandran box % and the average probability for the four neighbouring boxes (up, down, left, and right boxes) exceeds 26%, the five boxes are defined as ligand-prefer Ramachandran boxes. Combined with Ramachandran regions, DISICL has defined our ligand-prefer Ramachandran boxes as distributed among nine Ramachandran regions (Figure 2(b)). Among all 1884 boxes in Figure 2(a), 827 (43.3%) boxes have a probability greater than 26%, whereas level 2 binding site residues in these 827 boxes comprise 12.4% of all level 2 residues. The distributions of the number of boxes and the level 2 binding site residues for different probability levels are shown in Figure 3.

When examining the composition of the level 2 residues in nine ligand-prefer regions, the total number of level 2 residues in ligand-prefer Ramachandran region VIII is twice as large or more in terms of the other eight regions (Table 1). Asp occurs most frequently in regions I, II, III, and IV, with a minimum probability of 35% in the ligand-binding site in these four regions. Level 2 His demonstrates the second, third, third, and second largest contributions to region I to IV; its propensity at the ligand-binding site is always in the top three in all nine regions. Gly is notable because it accounts for 25% of regions VII and VIII and 72.1% of region IX. Cys has the largest probability detected at a ligand-binding site from regions VI to IX; however, the number of Cys from regions VII to IX is relatively low. Ala, Lys, and Pro have relatively low propensities for ligand-binding site occurrence, with the exception of Ala in region IX with a probability of 37.3% at the ligand-binding site. The secondary structure assigned by DISICL suggests that the secondary structures for the ligand-prefer boxes are promiscuous and do not show a preference for specific secondary structures. We define the probability value as a measure of propensity at a ligand-binding site, where AA is the amino acid and is the ligand-prefer Ramachandran region index; will be employed in our ligand-binding site prediction scoring function. Figure S2 shows the distribution of 20 amino acids in each region and the probabilities observed at ligand-binding sites. For statistical analysis, ligand-preference of different regions is compared using a two-tailed Wilcoxon Rank-Sum test; the values for ligand-preference for any two ligand-prefer Ramachandran regions are available in Table S3. Nine ligand-prefer Ramachandran regions all demonstrate significant difference with the other region (the region except for the nine regions in the Ramachandran plot); however, only a few values are less than 0.05 within any two of the nine regions

3.1. Molecular Potential Energy

In 1991, Herzberg reported that sterically strained () residues are energetically unfavourable by calculating the energy for N-acetyl-N′-methylalanyl amide with geometry optimization of bonds, bond angles, and torsions [23]. A detailed energy comparison was also performed for ligand-prefer region residues in each region, and the average level was calculated using all residues from set . The program Open Babel Obenergy was employed for the energy calculations [24]. The exploration of a specific amino acid in different regions with slight variations in energy is insignificant because the potential energy function is empirical and several limitations produce inaccuracies in the calculated potential energy. As shown in Table 2, residues in the nine regions have similar or slightly higher potential energy compared with the average level, with several exceptions. Ala, Gly, Ser, His, and Lys have molecular potential energy within 10 KJ/mol compared with the average level, which shows that these residues have low divergence in energy in different regions. The highest molecular energy for Val, Leu, Phe, Gly, and Asp is observed in region IV, whereas the highest energy for Met, Try, His, Glu, and Asn is observed in region VII. Region IX consists of the four highest energy amino acids: Cys, Ser, Gln, and Arg. The energy increases by 45.9 KJ/mol in region VII and 32.2 KJ/mol in region IX compared with the average level for Met.

3.2. Solvent-Accessible Surface Area (ASA)

The two dihedral angles are determined by the backbone atoms of proteins. To understand solvent-accessible areas for ligand-prefer box residues and their neighbours, we calculated the relative accessibility for the backbone of ligand-prefer box residues () and their neighbours (, ). As depicted in Figure 4, the ASA for the () amino acid has a significantly higher relative accessibility surface compared with the remaining two positions, which indicates that ligand-prefer box residues are buried more and their previous residues are exposed more to solvent.

3.3. Hydrogen Bonds and vdW Contacts

Hydrogen bonds and vdW contacts are the two main noncovalent contacts between ligands and amino acids. Figure 5 shows the average vdW contacts and hydrogen bonds formed by residues (, , ) and ligands; ligand-prefer box residues are at position . All amino acids have the most number of vdW contacts with a ligand at position , with an additional 1.5 and 1.6 contacts/residue for Asp at position compared to position and position . For the amino acids in positions and , the number of vdW contacts with ligands is similar, with the exception of Trp and Arg. Residues at position have more than 0.8 vdW contacts/residue with ligand than the other two positions, with Ile, Val, Met, Asp and Arg, and Lys having 1.4 more vdW contacts/residue with ligand. Obviously, residues at position are capable of providing more contacts with ligand than positions and .

Figure 5(b) delineates the average number of hydrogen bonds established by ligand and residues at three positions as mentioned above. Almost all residues at position formed more hydrogen bonds/residue with ligand, while only Pro at position established the greatest number of hydrogen bonds/residue with compound. Two irregular amino acids are Pro and His, with fewer hydrogen bonds at position . Half of the 20 amino acids at position have twice or more hydrogen bonds/residue with ligand than the other two positions.

In contrast with our previous assumptions, these ligand-prefer box residues do not achieve greater direct interaction with ligands, and the special geometry conformation results in the amino acids following ligand-prefer box residues forming more hydrogen bonds and vdW contacts with ligands.

3.4. Prediction of Protein Ligand-Binding Sites Based on Dihedral Angles

We demonstrated the ability of dihedral angle-based prediction, as previously discussed, in the context of blind prediction and the Ligsite-csc program [25]. The prediction comparisons were made using the set , which consists of 1000 protein structures with a maximum identity of 70%. If any two chains in a protein had an identity greater than 70%, we only employed the first occurrence chain in the PDB file. Ligands that bind to other chains were excluded.

The prediction procedure followed a sequence of three steps. First, we employed the Ligsite-csc program to locate solvent grids in a protein. Ligsite-csc calculates all grids that encompass the protein structure. These grids are divided into three categories: protein grids, surface grids, and solvent grids. A protein grid has at least one protein atom within 1.6 Å. Surface grids have a Connolly vertex within 1.0 Å, and all other grids are characterized as solvent grids [25]. Our method and Ligsite-csc only used the solvent grid points in ligand-binding site prediction. Second, we assigned the score to each grid point :where is defined in (3) and is the total number of ligand-prefer Ramachandran box residues that are positioned less than 6 Å from the grid point . Last, we sorted the grid points in descending order based on the scoring . The prediction results corresponded to the top-scoring grid point.

The predictions were evaluated based on the distance between the top-scoring grid and the actual position of the ligand; that is, a prediction was assumed to be correct if the distance was less than the cut-off threshold value, which varies from 1 to 10 Å. For a given protein structure, we only considered the top-scoring grid point. If the values of any atoms in the ligand were less than the cut-off threshold value from the point, the prediction was assumed to be correct. The success rate was defined as the number of correctly predicted proteins divided by the total number of dataset structures.

Our method was compared with Ligsite-csc (an extension of Ligsite), which identifies pockets based on the notion of surface-solvent-surface events and the degree of conservation of the involved surface residues [25]. Ligsite-csc performs slightly better than other predictors, such as Ligsite, CAST, PASS, and SURFNET [25]. We also implemented a baseline predictor by randomly selecting a grid point that was indicated as a solvent grid by Ligsite-csc.

We also created a set of 362 structurally distinct ligand-free proteins that share more than 95% structural similarity with the ligand-bound form. This was achieved by examination of the ligand-bound-free pairs from the Comsin database [26]. The ligand-bound complexes were superimposed onto their corresponding ligand-free proteins. The ligand coordinates were extracted for ligand-free structure ligand-binding site prediction.

Gunasekaran and Nussinov reported that the magnitude of the dihedral angle changes is minimal after ligand binding [19], which explains why our method has a performance similar to that of Ligsite-csc for both ligand-bound and ligand-free protein ligand-binding site prediction when only one potential pocket is predicted, as shown in Figure 6. The success rate of our method is even higher than that of Ligsite-csc when the distance threshold value is set to 2 Å; however, the success rate of our method increases at a slower rate than that of Ligsite-csc when the distance increases. Our method and Ligsite-csc are both superior to random selection. A total of 82 ligand-binding sites (cut-off threshold = 4 Å) were successfully predicted by our method but could not be detected by Ligsite-csc even when three potential pocket sites were predicted (Table S4). An example is shown in Figure 7; the top prediction by Ligsite-csc is 13 Å from the ligand, and the distance between our prediction grid and the compound is only 1.7 Å. For the remaining top five grids predicted by Ligsite-csc, the shortest distance between the grids and the small molecule was 18 Å. We also provided a comparison for a particular binding site: “HEM” (protoporphyrin IX containing Fe) binding site (Table S5). The results indicate that our method is a useful tool for ligand-binding site prediction, especially for predicting a site that is less than 2 Å from a ligand.

4. Conclusions

We have enumerated the ligand-prefer Ramachandran boxes, for which residues have a high probability of being observed in the ligand-binding site, and classified these boxes into nine regions. Instead of direct contact with ligands, residues preceding ligand-prefer Ramachandran boxes are exposed more to solvent compared with the residues following ligand-prefer Ramachandran boxes, which form more vdW contacts and hydrogen bonds with ligands. This pattern suggests that residues in ligand-prefer Ramachandran boxes and their preceding amino acids facilitate subsequent residue contact with ligands. Residues in ligand-prefer Ramachandran boxes are irregular; common secondary elements for these residues are “undefined” as assigned by DISICL. The relative propensity observed at ligands for residues with specific values should aid in the identification of binding sites in proteins.

Our score function in ligand-binding site prediction is based on the propensities of amino acids. Typically, Cys is heavily weighted when its angle is located in a region V Ramachandran ligand-prefer boxes due to its high propensity (73.2%). Cys has a high propensity in all nine ligand-prefer Ramachandran regions, varying from 41.8% to 69.7%. Several algorithms have been published for predicting ligand-binding sites, and critical information, such as information about geometry, amino acid composition, physical potential, and ligand-binding residues that are conserved in the evolutionary process, has been employed for predictions. We first demonstrate a practical application using the residue angle in the context of blind prediction and the program Ligsite-csc. Our analysis reveals that a scanning method based on the simple propensity of the angle performs as well as Ligsite-csc when one ligand-binding site is predicted. The use of the angle to predict ligand-binding sites can be a useful tool for various aspects of drug discovery.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Supplementary Materials

Supplementary material consists of five tables, and two figures: Figure S1 shows the distribution of π-helix residue dihedral angles while the distribution of 20 amino acids in each region and the probabilities observed at ligand-binding sites is shown in Figure S2. Table S1 illustrates the ϕ/ψ boundaries for the nine regions and Table S2 describes abbreviation codes for secondary structures assigned by DISISL. P-values of Wilcoxon Rank-Sum test for the ligand-preferences on the ten Ramachandran regions are shown in Table S3. Table S4 lists 82 ligand-binding sites that Ligsite-csc fails to detect but our method successfully predicts, and Table S5 displays a comparison of Ligsite-csc and our method at the “HEM” binding site.

  1. Supplementary Material