BioMed Research International

Volume 2016 (2016), Article ID 4354901, 9 pages

http://dx.doi.org/10.1155/2016/4354901

## Identification of Hot Spots in Protein Structures Using Gaussian Network Model and Gaussian Naive Bayes

^{1}School of Computer and Information Engineering, Zhejiang Gongshang University, Hangzhou, Zhejiang 310018, China^{2}School of Statistics and Mathematics, Zhejiang Gongshang University, Hangzhou, Zhejiang 310018, China^{3}School of Community Health Sciences, University of Nevada Las Vegas, Las Vegas, NV 89154, USA

Received 21 August 2016; Revised 2 October 2016; Accepted 11 October 2016

Academic Editor: Guang Hu

Copyright © 2016 Hua Zhang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Residue fluctuations in protein structures have been shown to be highly associated with various protein functions. Gaussian network model (GNM), a simple representative coarse-grained model, was widely adopted to reveal function-related protein dynamics. We directly utilized the high frequency modes generated by GNM and further performed Gaussian Naive Bayes (GNB) to identify hot spot residues. Two coding schemes about the feature vectors were implemented with varying distance cutoffs for GNM and sliding window sizes for GNB based on tenfold cross validations: one by using only a single high mode and the other by combining multiple modes with the highest frequency. Our proposed methods outperformed the previous work that did not directly utilize the high frequency modes generated by GNM, with regard to overall performance evaluated using measure. Moreover, we found that inclusion of more high frequency modes for a GNB classifier can significantly improve the sensitivity. The present study provided additional valuable insights into the relation between the hot spots and the residue fluctuations.

#### 1. Introduction

Flexibility and dynamics play key roles for proteins in implementing various biological processes and functions [1, 2]. Residue fluctuations or atomic motions, contributing to large-scale conformational changes of protein structures, are shown to be closely related to functions of native proteins [3–5].

Two methods, molecular dynamic (MD) simulation and normal mode analysis (NMA), are widely used to investigate the dynamic link between protein structures and functions. The main drawback of MD simulations is their computational cost [6, 7]. Coarse-grained NMA, such as elastic network model (ENM) [7], has been increasingly used in recent years as a powerful tool to elucidate the structure-encoded dynamics of biomolecules [2]. The ENMs, including the isotropic Gaussian network model (GNM) [8, 9] and the anisotropic network model [10], define spring-like interactions between residues that are within a certain cutoff distance. They simplify the computationally costly all-atom potentials into a quadratic function in the vicinity of the native state, which allows the decomposition of the motions into vibrational modes with different frequencies that are often known as normal modes. Being simple and efficient, ENM and GNM have been validated in numerous applications that resulted in reasonable agreement with a wealth of experimental data, including prediction of X-ray crystallographic B-factors for amino acids [9, 11], identifications of hot spots [12–14], catalytic sites [15], core amino acids stabilizing rhodopsin [16] and important residues of HLA proteins [17], elucidation of the molecular mechanisms of motor-protein motions [18], and general conformational changes and functions [3, 4, 19–31].

Previous studies have shown in many cases that the normal modes including the high frequency (fast) modes and the low frequency (slow) modes by the GNM are very useful for recognizing several specific types of protein functions. In particular, the highest frequency modes that reflect local events at the residue level can be utilized to identify core residues or binding sites [16, 17, 20, 32], while the lowest frequency modes are usually responsible for the collective functional dynamics of the global protein motions [23, 33]. In area of protein-protein interaction, several studies such as Ozbek et al. [12], Haliloglu et al. [13], and Demirel et al. [14] utilized GNM to identify hot spots that are defined as the residues contributing more than 2 kcal/mol to the binding energy. Their results suggested that hot spots are predefined in the dynamics of protein structures and forming the binding core of interfaces. However, the mean square distance fluctuations of residue pairs and the mean square fluctuations of residues calculated from the highest frequency modes by GNM, rather than the direct usage of the highest frequency modes themselves, were applied to detect the hot spots in the work by Ozbek et al. [12] and by Haliloglu et al. [13] and Demirel et al. [14], respectively.

In addition, several computational methods by utilizing machine learning tools have been developed to predict hot spots from protein sequences and structures [34–37]. The advantage of learning methods is the ability to result in higher quality by sufficiently integrating the extracted feature information from protein structures. In this paper, we follow the work by Ozbek et al. [12] but focus on the direct usage of the highest frequency modes to investigate the relation between the residue fluctuations and the hot spots. The top 20 highest frequency modes by GNM were used as an original feature set inputted into Gaussian Naive Bayes (GNB), as a representative of learning methods, to identify hot spots. The main purpose of this study is to examine whether the raw fast modes can be directly used to differentiate hot spots or non-hot spots and whether the utilization of learning methods can improve the identification quality of hot spots for unbound protein structures.

#### 2. Material and Methods

##### 2.1. Dataset

We used the dataset that was collected by Ozbek et al. [12]. This set was filtered with PISCES culling server [38] at the sequence identity of 25% and was originally composed of 33 unbound protein structures. We had to remove one protein with ID 1lrp from the dataset since its structure cannot be currently found in Protein Data Bank (PDB) [39]. Therefore, the final dataset had 32 unbound protein structures with a total of 4270 residues of which 171 are hot spot residues. The dataset including the detailed information about hot spot residues can be derived from Ozbek et al. [12].

##### 2.2. Gaussian Network Model and Its Applications to Identification of the Hot Spots

GNM describes each protein as an elastic network, where the springs connecting the nodes represent the bonded and nonbonded interactions between the pairs of residues located within a cutoff distance [8, 9]. Assuming that the springs are harmonic and the residue fluctuations are isotropic and Gaussian, the network potential of nodes (residues) in a protein structure iswhere and are instantaneous and original distance vectors between residues and , respectively, is the force constant assumed to be uniform for all network springs, and is the Kirchhoff connectivity matrix defined aswhere is the distance between residues and and is given as a cutoff. Then, the mean correlation between residue fluctuations is calculated as where** U** is the orthogonal matrix of eigenvectors (), is the diagonal matrix of eigenvalues (), is the Boltzmann constant, and is the absolute temperature.

To identify hot spot residues, Ozbek et al. [12] used the mean square distance fluctuations (MSDF), , of residues and given aswhich were calculated using high frequency modes of GNM based on a cutoff of 6.5 Å. The residues with relatively high MSDF value were considered functionally probable; see more details in Ozbek et al. [12].

In addition, both Haliloglu et al. [13] and Demirel et al. [14] similarly defined mean square fluctuation (or vibration) (MSF) of residues in the weighted average of several high frequency modes based on a cutoff of 7.0 Å, to identify the hot spot residues. The MSF of residue weighed by a subset of modes is given asThen, one residue was predicted as a hot spot if the normalized MSF of the residue (i.e., the measure expressed in (5) divided by ) is larger than a given threshold. The main difference between the work by Haliloglu et al. [13] and that by Demirel et al. [14] is the different thresholds adopted. Haliloglu et al. [13] used a constant threshold of 0.005 while it was given by Demirel et al. [14] where is the number of residues in a protein sequence.

##### 2.3. Gaussian Naive Bayes

A Naive Bayes (NB) classifier calculates the probability of a given instance (example) belonging to a certain class [40]. Given an instance described by its feature vector () and a class target , the conditional probability can be expressed as a product of simpler probabilities using the Naive independence assumption according to Bayes’ theorem:

Here, the target may have two values where means a hot spot residue and represents non-hot spot residue. for one residue (one instance) is a feature vector with the same size for describing its characteristic using high frequency modes generated by GNM. For example, is equal to a vector composed of th component for th residue in a sequence when only one high frequency mode is used. If three high frequency modes, denoted by , , and , are taken into account, the vector will be for residue in a protein sequence. Moreover, if a window size of 3 with respect to the residue is adopted, becomes .

Since is constant for a given instance, the following rule is adopted to classify the instance whose class is unknown:where “arg” means a value of so that the above expression is maximized; that is, if is larger than , ; otherwise, .

Moreover, when the likelihood of the features (i.e., ) is assumed to be Gaussian, a NB classifier is called Gaussian Naive Bayes (GNB). Due to its simplicity and being computationally fast compared to other more sophisticated methods, GNB has been widely applied to prediction problems in bioinformatics [41, 42]. In this study, GNB was mainly used to train the models by inputting the highest frequency modes to identify hot spot residues.

##### 2.4. Performance Evaluation

In a classification task, the following quality indices, including sensitivity (also known as recall), specificity, precision, and the overall accuracy, were generally used to assess prediction performance:where true positives (TP) and true negatives (TN) correspond to correctly predicted hot spot residues and non-hot spot residues, respectively, false positives (FP) denote non-hot spot residues predicted as hot spot residues, and false negatives (FN) denote hot spot residues predicted as non-hot spot residues.

Obviously, the dataset used in this study is extremely unbalanced with a very high proportion of non-hot spot residues. For this reason, the accuracy value is not a good choice to evaluate the overall performance of results. When a dataset includes 95% negative samples but 5% positive samples, a classifier may identify all of them as negative, resulting in 95% overall accuracy and 100% specificity. This is really shown as excellent performance, but it fails to identify the positive samples that we actually need pay close attention to. Moreover, two indices, sensitivity and precision, can both measure the classification correctness for positive samples. It is strongly expected that these two indices can synchronously reach high values, but there exists a trade-off between them in general. Therefore, we used measure to evaluate the overall prediction performance:which can balance the sensitivity and the precision in case of the unbalanced dataset. The formula of the measure can be changed to be when both sen and pre are exactly larger than zero. Thus, measure can be viewed as an increasing function of sen and pre. The minimum of is 0 when sen = 0 or pre = 0, and the maximum of is 1 when sen = 1 and pre = 1.

##### 2.5. Identification of Hot Spots Using GNM and GNB

The experimental performance on identification of hot spot residues is tested using -fold cross validation (CV) on the dataset composed of 32 unbound protein structures. In the CV procedure, chains are randomly divided into subsets with the same numbers of sequences, and the test is repeated times. In each time, the subsets are used to build the model, and the remaining one subset is then tested by the prediction model.

In the present study, we performed tenfold cross validation (10CV) based on Gaussian Naive Bayes using the highest modes as features from GNM outputs in different ways. Then, we mainly implemented two schemes concerning feature coding for investigating the relations between the highest modes and the hot spot residues. Firstly, a classifier was modeled by directly using single one of the top 20 high frequency modes (i.e., the eigenvectors () that correspond to the top 20 largest eigenvalues ()). Meanwhile, a sliding window of the central residue with sizes ranging from 1 to 21 was utilized to examine the impact of the neighboring residues’ fluctuations, and the computation of GNM was carried out by usage of multiple distance cutoffs ranging from 6.0 to 8.0 with a step size of 0.1. Secondly, we combined top modes with the highest frequency () and utilized similar scheme for the distance cutoff of GNM computation and the sliding window of the central residue to establish the models for identifying hot spot residues.

#### 3. Results and Discussion

##### 3.1. Identification of Hot Spot Residues Using Single One of the Highest Modes

In this work, the overall performance was evaluated by the measure in (9), which is able to balance the sensitivity and the precision. Table 1 lists twenty computational outcomes of the prediction performance that are ordered by measure, where the feature vector for a GNB classifier was extracted from single one mode, that is, th highest mode (), the distance cutoff in GNM varied from 6.0 to 8.0 with the step size of 0.1, and the sliding window for one mode ranged from 1 to 21 with a step size of 2. As shown in Table 1, the highest performance was achieved by measure of 0.1517 when the distance cutoff is 7.1 Å and the size of the sliding window is 3 in case of the 8th highest mode.