Advances in Bioinformatics

Volume 2018, Article ID 7607384, 17 pages

https://doi.org/10.1155/2018/7607384

## A Novel Framework for* Ab Initio* Coarse Protein Structure Prediction

^{1}Department of Computer Science & Eng., Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, Karnataka 576104, India^{2}Department of Biotechnology, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, Karnataka 576104, India^{3}Department of ECE, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, Karnataka 576104, India

Correspondence should be addressed to S. Balaji; moc.liamg@igalaboib

Received 18 January 2018; Revised 26 April 2018; Accepted 27 May 2018; Published 20 June 2018

Academic Editor: Gilbert Deleage

Copyright © 2018 Sandhya Parasnath Dubey et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Hydrophobic-Polar model is a simplified representation of Protein Structure Prediction (PSP) problem. However, even with the HP model, the PSP problem remains NP-complete. This work proposes a systematic and problem specific design for operators of the evolutionary program which hybrids with local search hill climbing, to efficiently explore the search space of PSP and thereby obtain an optimum conformation. The proposed algorithm achieves this by incorporating the following novel features: (i) new initialization method which generates only valid individuals with (rather than random) better fitness values; (ii) use of probability-based selection operators that limit the local convergence; (iii) use of secondary structure based mutation operator that makes the structure more closely to the laboratory determined structure; and (iv) incorporating all the above-mentioned features developed a complete two-tier framework. The developed framework builds the protein conformation on the square and triangular lattice. The test has been performed using benchmark sequences, and a comparative evaluation is done with various state-of-the-art algorithms. Moreover, in addition to hypothetical test sequences, we have tested protein sequences deposited in protein database repository. It has been observed that the proposed framework has shown superior performance regarding accuracy (fitness value) and speed (number of generations needed to attain the final conformation). The concepts used to enhance the performance are generic and can be used with any other population-based search algorithm such as genetic algorithm, ant colony optimization, and immune algorithm.

#### 1. Introduction

The Protein Structure Prediction (PSP) problem is one of the major problems in the field of computational biology. The prediction of the native conformation of protein structure from its amino acid sequence is called PSP problem [1–4]. The solution to the PSP problem helps in understanding the molecular foundation of proteins in regulating life [5]. In wet lab, Nuclear Magnetic Resonance (NMR), X-ray crystallography (XC), and Cryoelectron microscopy are the methods used to solve the PSP problem [6]. Due to complexities and limitations of experimental methods, there is a huge gap in the number of reported protein sequences and their structures, and yet only 1% of structures are known for the reported protein sequences [7]. This calls for a computational solution to the PSP problem.

Computationally, the PSP problem is addressed by comparative modelling, threading, and* ab initio* approaches [8]. The first two approaches depend on known reference structure (template) to solve the PSP problem. Thus, the successes of these methods are limited by the template availability. The third approach predicts the native structure of the protein from its sequence; it is termed as* ab initio*. This method is devoid of a template and is better than the template-based methods [9]. One of the earliest* ab initio* approaches uses molecular mechanics to model protein structure. This method is based on simulating forces involved at the atomic level of amino acids such as covalent bonds, ionic bonds, hydrogen bonding, van der Waals interactions, and hydrophobic interactions. These molecular mechanics are applied in CHARMM, AMBER, and ECEPP [10, 11] energy functions and are found superior in modelling the fine conformation of proteins. But these energy functions are expensive in terms of both complexity and computational resources, making it infeasible even for the smallest protein sequence [12]. Besides, there are methods available based on interactions between specific pairs of amino acids with lower complexities such as Miyazawa-Jernigan (MJ) model [13–15] and Berrera energy model (BM model). Both these models use 20 x 20 energy matrix, and MJ uses effective contacts between all amino acid pairs, whereas BM uses van der Waals radii of alpha carbon and side-chain heavy atoms. Although these methods have reduced the computational complexity, a great amount of computing time is still required, which makes it computationally intractable for such elaborated energy function [16].

In 1985, Ken Dill [17] proposed a new computational perspective to lessen the modelling complexity and accelerate structure prediction to fill the sequence-structure gap, namely, the Hydrophobic-Polar (HP) model. Contrary to molecular mechanics and other modelling methods, the HP model considers hydrophobic interaction as the primary force involved in the folding process and yet was able to depict the natural folding patterns [18]. This is the most widely accepted approach to solve the PSP problem [19–21]. Although the HP model has reduced the computation time and complexity, it still falls under the category of the* NP-complete* problem [22]. In order to address* NP-hardness*, numerous researchers have worked with various population-based search algorithms such as Monte Carlo [23–25], genetic algorithm [19, 26–30], evolutionary programming [31], ant colony optimization [32–34], immune algorithm [35], constraint based chain growth algorithm [36], and hybrid of local search and evolutionary algorithms commonly called memetic algorithm [37–41]. The details on these algorithms and their performance are available in various review papers [18, 42].

Population diversity and selective pressure are two important parameters which control the performance of the aforementioned algorithms. But it has been observed that more often only one parameter is considered to reduce computing time of the PSP problem. As the PSP problem presents a rugged search space with more opportunity to get trapped in the local solution, undertaking of population diversity and selective pressure may result in better conformation with less computation (in few generations). Hence, this work integrates both these factors (i.e., population diversity and selective pressure) through a hybrid approach (evolutionary programming coupled with hill-climbing local search algorithm) for addressing PSP problem.

#### 2. Material and Method

##### 2.1. Hydrophobic-Polar Model

In the HP model, amino acids are classified into two major groups, namely, hydrophobic (H) and polar (P) [17]. The protein folding happens in an aqueous environment (cytoplasm), and the hydrophobic amino acids of the protein repel water, and this creates the driving force for folding. The hydrophobic amino acids arrange themselves to form a central hydrophobic core, and the polar amino acids are left on the surface to interact with water molecules. This can be explained by the oil-water behaviour [3], and the HP model mimics this.

The HP approach decomposed the PSP problem into three subproblems; first, defining a model to represent the protein structure referred to as conformation, second, defining the energy quantification based on amino acid properties that evaluate the modelled conformation, and, third, developing a search algorithm that can efficiently optimize conformation from a huge modelled space. In 1989 Lau and Dill proposed a lattice statistical mechanics model [3] used to represent the protein conformation commonly known as the lattice model or low-resolution model. In the lattice model, each amino acid is represented as a node of the lattice and two consecutive amino acids are connected through the lattice edge.

The modelled conformations are quantified using hydrophobic interaction (H-H contact) present in the lattice diagram. The hydrophobic interaction is defined by the topological distribution of hydrophobic amino acid; two H amino acids contribute to one unit of free energy value if H residues are adjacent (at the least distance) on lattice but nonconsecutive in a protein sequence. The free energy value is negative of H-H contacts. Concerning H-H contact, PSP is defined as maximization problem whereas with the free energy it is minimization problem. Formally, the PSP over HP model is defined as a triple (), where is a given search space (collection of different possible conformations), is the objective function (in terms of hydrophobic contact or free energy value), which should be maximized or minimized, and is the set of constraints that have to be fulfilled to obtain feasible solutions. The goal is to find a globally optimal solution, which is the solution with the largest or smallest objective value under the condition that all constraints are fulfilled. Triple () is defined as follows: : it is a set of lattice conformations for given HP sequences. : it is an objective function to be maximized for hydrophobic contact or minimize the free energy value; here PSP is defined in terms of free energy value as where: it is a set of the following constraints that need to be satisfied when modelling the PSP on a lattice:

(i) Self-avoiding walk (SAW): each amino acid must occupy only one lattice point, which no other amino acid can share.

(ii) Adjacent amino acids of the primary sequence must be at the unit distance.

##### 2.2. Protein Structure Encoding

In this work, protein conformations are modelled using the nonisomorphic encoding proposed by Hoque et al. [43]. This encoding avoids the generation of isomorphic conformation; hence, the search space is free from degeneracy problem, where it consists of two different encodings corresponding to similar conformation. In such cases, if one conformation is kept at the center of the axis and the other rotating around it, in one quadrangle both structures will overlap. The existence of such conformation increases the use of the computational resources as they correspond to the similar structure and also makes the search space stagnate. A conformation C of n residues can be expressed in the form of movement direction such as and for two-dimension square and triangular lattice, respectively. F, B, U, D, FU, BU, BD, and FD are the movement direction to be followed on the Cartesian coordinate corresponding to* forward*,* backward*,* up*,* down*,* forward-up*,* backward-up*,* backward-down*, and* forward-down* direction. In the Cartesian coordinate these directional movements represent the single step move of (1, 0), (-1, 0), (0, 1), (0, -1), (1, 1), (-1, 1), (-1, -1), and (1, -1) respectively (Figure 1).