BioMed Research International

Volume 2017, Article ID 7294519, 9 pages

https://doi.org/10.1155/2017/7294519

## Identify High-Quality Protein Structural Models by Enhanced -Means

^{1}School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China^{2}School of Computer Science and Technology, Soochow University, Suzhou 215006, China^{3}The First Affiliated Hospital of Soochow University, Suzhou 215006, China

Correspondence should be addressed to Hongjie Wu; moc.qq@uw.eijgnoh and Min Jiang; moc.361@111_nimgnaij

Received 23 December 2016; Revised 9 February 2017; Accepted 19 February 2017; Published 22 March 2017

Academic Editor: Ren-Zhi Cao

Copyright © 2017 Hongjie Wu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

*Background.* One critical issue in protein three-dimensional structure prediction using either ab initio or comparative modeling involves identification of high-quality protein structural models from generated decoys. Currently, clustering algorithms are widely used to identify near-native models; however, their performance is dependent upon different conformational decoys, and, for some algorithms, the accuracy declines when the decoy population increases.* Results.* Here, we proposed two enhanced -means clustering algorithms capable of robustly identifying high-quality protein structural models. The first one employs the clustering algorithm SPICKER to determine the initial centroids for basic -means clustering (-means), whereas the other employs squared distance to optimize the initial centroids (-means++). Our results showed that -means and -means++ were more robust as compared with SPICKER alone, detecting 33 (59%) and 42 (75%) of 56 targets, respectively, with template modeling scores better than or equal to those of SPICKER.* Conclusions.* We observed that the classic -means algorithm showed a similar performance to that of SPICKER, which is a widely used algorithm for protein**-**structure identification. Both -means and -means++ demonstrated substantial improvements relative to results from SPICKER and classical -means.

#### 1. Background

A critical issue in protein three-dimensional (3D) structure prediction using either ab initio or comparative modeling involves identification of high-quality protein structural models from generated decoys [1–4]. According to the first principle of predicting protein folding, the native structure of the target sequence should be the conformation exhibiting minimal free energy [5]. According to this methodology, large-scale protein-candidate conformations are generated using ab initio or comparative methods [6–10]. Because accurate calculation of free energy remains unclear in theory [11–13], a protein-structure clustering algorithm is employed, and the structure located at the center of the largest cluster is considered the conformation exhibiting minimal free energy. In clustering algorithms, the 3D-structural similarity between two proteins is used as the distance metric. Currently, root mean square deviation (RMSD) and template modeling (TM)-scores [14] constitute the two most common metrics for determining 3D-structural similarity between candidates. Subsequent refinement steps are also performed based on the conformations detected by protein-structure clustering; however, the quality of the clustering algorithm directly affects the final results of protein prediction.

SPICKER is a simple, widely used, and efficient program used for identifying near-native folds. In this algorithm, clustering is performed in a one-step procedure using a shrunken, but representative, set of decoy conformations, with a pairwise RMSD cut-off determined by a self-adjusting iteration proposed by Zhang and Skolnick [15]. After benchmarking using a set of 1489 nonhomologous proteins representing all protein structures in the PDB ≥ 200 residues, Xu and Zhang [14] proposed a fast algorithm for population-based protein structural model analysis. Two new distance metrics, Dscore1 and Dscore2, based on the comparison of protein-distance matrices for describing the differences and similarities among models were developed [1]. Compared with existing methods using calculation times quadratic to the number of models, Dscore1-based clustering achieves linear-time complexity to obtain almost the same accuracy for near-native model selection.

Clusco [16] is a fast and easy-to-use program allowing high-throughput comparisons of protein models using different similarity measures (coordinate-based RMSD [cRMSD], distance-based RMSD [dRMSD], global distance test [GDT], total score [TS] [17], TM-score, MaxSub [18], and contact map overlap) to cluster the comparison results using standard methods, such as -means clustering or hierarchical agglomerative clustering. The application was highly optimized and written in C/C++ and included code allowing for parallel execution, which resulted in a significant speed increase relative to similar clustering and scoring algorithms. Berenger et al. [19] proposed a fast method that works on large decoy sets and is implemented in a software package called Durandal, which is consistently faster than other software packages in performing rapid and accurate clustering. In some cases, Durandal outperforms the speed of approximate methods through the use of triangular inequalities to accelerate accurate clustering without compromising the distance function.

However, most of these methods are data sensitive, with both different protein targets and different modeling algorithms potentially resulting in large differences in detecting the center of clusters [20, 21]. One possible reason for this is that the free energy distribution varies greatly when using different decoy generated algorithms, such as those relying on ab initio and comparative modeling. Identifying the near-native conformation is also a memory and time-intensive task [22–24]. The -means [25, 26] clustering algorithm is popular and has been successfully employed in many different scientific fields due to its robust performance in several previous applications [27, 28] and the relative simplicity of the algorithm. However, the efficacy of -means clustering in protein-structure prediction has not been extensively studied.

In this paper, we proposed two enhanced -means clustering algorithms to identify the near-native structures. The first one employs SPICKER to determine the initial centroids for basic -means algorithm. Another one employs squared distance to optimize the initial centroids.

#### 2. Methods

##### 2.1. Data Sets of Benchmark

To comprehensively evaluate the methodology, we applied the algorithms to two representative datasets. The first dataset is I-TASSER SPICKER Set-II (http://zhanglab.ccmb.med.umich.edu/decoys/decoy2.html), which is widely used for evaluating the performance of protein decoys clustered algorithm [29, 30]. I-TASSER SPICKER Set-II contains the whole-set atomic structure decoys of 56 nonhomologous small proteins ranging from 47 residues to 118 residues, average with 80.88 residues. And the decoy average contains 439.20 conformations.

The second benchmark is CASP11 experimental targets which were generated by Zhang-Server and QUARK. We choose 12 hard and very hard targets from 64 CASP11 targets published on http://zhanglab.ccmb.med.umich.edu/decoys/casp11/. Hard and very hard targets indicate lower similarity of PDBs and more PDBs in the decoy. The targets without Zhang-Server and QUARK server results and with ZHANG-Server TM-score less than 0.6 are removed from the dataset. Each decoy contains around 1200–1500 conformations, average with 1520.83 conformations. These proteins ranged from 68 residues to 204 residues, average with 135.90 residues.

##### 2.2. Classical -Means Algorithm and 3D Distance Metrics

###### 2.2.1. Classical -Means Algorithm

-means algorithm is a typical clustering algorithm which is based on distance. It uses the Euclidean metric as the similarity measure. The closer the two objects, the greater the similarity -means’ important criterion. -means considers that cluster is composed of many objects which are close in distance. Therefore, its final goal is to find out the compact and independent clusters. The selection of initial clustering center has great influence on the clustering results, because in the first step -means use a random selection of arbitrary objects as the initial clustering center, representing an initial cluster. In each iteration, the remaining data set will be reassigned to the nearest cluster according to the distance. An iteration operation will be finished when all remaining data sets are assigned and new clustering centers will be calculated. When the new clustering centers are equal to the original clustering centers or less than a specified threshold, the algorithm will be finished. Euclidean metric is defined as follows: where is the number of corresponding atoms between two objects and .

###### 2.2.2. Root Mean Square Deviation and Template Modeling Score

The similarity between two models is usually assessed by the root mean square deviation (RMSD) between equivalent atoms in the model and native structures after the optimal superimposition [31, 32].

RMSD alone is not sufficient for globally estimating the similarity between the two proteins, because the alignment coverage can be very different from approaches. A template with a 2 Å RMSD to native having 50% alignment coverage is not necessarily better for structure modeling than the one with an RMSD of 3 Å but having 80% alignment coverage. While the template aligned regions are better in the former because fewer residues are aligned, the resulting full-length model might be of poorer quality. Template Modeling Score (TM-score) function is a variation on the Levitt–Gerstein (LG) score [1, 33], which was first used for sequence independent structure alignments. TM-score is defined as follows:where is the length of the native structure, is the length of the aligned residues to the template structure, is the distance between the th pair of aligned residues, and is a scale to normalize the match difference. “Max” denotes the maximum value after optimal spatial superposition. RMSD, TM-score, and other metrics, such as GDT-TS (Global Distance Test) score and Qprob [34], can be used to evaluate the distance between the two structures. SPICKER enhanced the initial centers of the classical -means algorithm.

One of the key limitations of the -means algorithm concerns the positioning of initial cluster centers. As a heuristic algorithm, it will converge to the global optimum, with the results potentially dependent upon the initial cluster positions. In the classical -means algorithm, the initial centers are randomly generated, and different initial positions consistently result in entirely different final cluster centers. SPICKER represents a simple and efficient strategy for identifying near-native folds by clustering protein structures generated during computer simulations. SPICKER performs this in a one-step procedure using a shrunken, but representative, set of decoy conformations, with the pairwise RMSD cut-off determined by self-adjusting iterations.

We proposed the first enhanced -means algorithm, -means, which integrates SPICKER with -means as Algorithm 1. In the 1st line calculates the similarity of all proteins. In the 2nd line, executes the program, SPICKER, and gets initial cluster centers. In the 6th line, function is to distribute the th protein to the nearest cluster center according to the distance matrix . And in the 10th line, function is to calculate the new center for current cluster . In the 19th line, copies the new cluster center to the current cluster center. The flow chart of -means is depicted in Figure 1(a).