Advances in Bioinformatics

Volume 2016 (2016), Article ID 7357123, 16 pages

http://dx.doi.org/10.1155/2016/7357123

## Multiphase Simulated Annealing Based on Boltzmann and Bose-Einstein Distribution Applied to Protein Folding Problem

^{1}Instituto Tecnológico de Ciudad Madero, Tecnológico Nacional de México, Avenida Sor Juana Inés de la Cruz s/n, Colonia los Mangos, 89440 Ciudad Madero, TAMPS, Mexico^{2}Universidad Autónoma de Coahuila, Ciudad Universitaria, 25280 Arteaga, COAH, Mexico^{3}UPEMOR, Boulevard Cuauhnáhuac 566, Jiutepec, 62550 Mor México, CP, Mexico

Received 24 November 2015; Revised 5 April 2016; Accepted 19 April 2016

Academic Editor: F. Fdez-Riverola

Copyright © 2016 Juan Frausto-Solis et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

A new hybrid Multiphase Simulated Annealing Algorithm using Boltzmann and Bose-Einstein distributions (MPSABBE) is proposed. MPSABBE was designed for solving the Protein Folding Problem (PFP) instances. This new approach has four phases: (i) Multiquenching Phase (MQP), (ii) Boltzmann Annealing Phase (BAP), (iii) Bose-Einstein Annealing Phase (BEAP), and (iv) Dynamical Equilibrium Phase (DEP). BAP and BEAP are simulated annealing searching procedures based on Boltzmann and Bose-Einstein distributions, respectively. DEP is also a simulated annealing search procedure, which is applied at the final temperature of the fourth phase, which can be seen as a second Bose-Einstein phase. MQP is a search process that ranges from extremely high to high temperatures, applying a very fast cooling process, and is not very restrictive to accept new solutions. However, BAP and BEAP range from high to low and from low to very low temperatures, respectively. They are more restrictive for accepting new solutions. DEP uses a particular heuristic to detect the stochastic equilibrium by applying a least squares method during its execution. MPSABBE parameters are tuned with an analytical method, which considers the maximal and minimal deterioration of problem instances. MPSABBE was tested with several instances of PFP, showing that the use of both distributions is better than using only the Boltzmann distribution on the classical SA.

#### 1. Introduction

In genetics DNA, RNA, and proteins are the basic elements for many researches. DNA is a molecule that contains genetic instructions, which are involved in protein synthesis process [1]. This molecule represents a complete set of hereditary information of any organism. DNA has four different nucleotides, which are adenine, cytosine, guanine, and thymine. This molecule is divided into genes, and a gene is a sequence of nucleotides that express a protein. A functional protein is conformed in an approximated geometrical model of the global minimum energy [2, 3]. This is a dinamic process where the lowest free energy of the protein plus the solvent can be reasonably approximated by the minimum free energy found by Monte Carlo, conformational space annealing, genetic algorithms, and some deterministic methods [3, 4]. In fact, there are some examples, such as insulin alphalytic [5, 6] with natural conformations whose energy is not minimal. This structure is usually named Native Structure (NS). In addition, the free energy of an NS conformation depends on the interaction among the atoms and their relative positions.

Protein Folding Problem (PFP) is an enormous challenge and important problem in bioinformatics, medicine, and other areas [7]. The function of a protein is directly related to its three-dimensional structure, and misfolded proteins can cause a variety of diseases. The aim of this problem is to find the natural tertiary structure of a protein using only a target sequence. A protein can take a high number of different conformational structures from its primary structure to its NS. The computational problem involved to find the NS is known as Protein Folding Problem. Because PFP is an NP-hard problem [8], heuristic methods avoiding the generation of all possible states of the protein are commonly used. In order to find an NS, computational methods search structures on a huge space of possible solutions. These methods can obtain several structures very close to the NS. A particular class of these methods is known to be* ab initio* which looks for the NS using only the protein’s amino acid sequence.

As a consequence, to solve PFP, new metaheuristics are applied, where simulated annealing (SA) [9, 10] is one of the most successful [11–13]. Currently, classical SA applies a Boltzmann distribution in order to accept bad solutions and escape from local minima. However, to generate high-quality solutions for PFP, new and more efficient SA have been designed; one of them, named Chaotic Multiquenching Annealing Algorithm (CMQA), has obtained very good results for proteins such as -enkephalin, proinsulin, T0549, T0335, and T0281 or 1PLXW, 1T0C, 2K5E, SR384, and 1A19, in PDB format, respectively. There are three central phases of this algorithm [14]: (i) Multiquenching Phase (MQP), (ii) Annealing Phase (AP), and (iii) Dynamical Equilibrium Phase (DEP). All of these phases are explained in the paper; for this introduction all we need to know is that each phase is designed with an annealing approach looking for finding the best configuration of the previous one. At the beginning of the process, MQP improves a random configuration through an annealing procedure executed at extremely very high temperatures; AP searches for a better solution than that of MQP with an annealing search applied at high temperatures, and, finally, DEP is applied at low temperatures looking for a better solution than that obtained by AP. As the classical SA, all of these phases apply Boltzmann distribution for accept bad solutions. However, Bose-Einstein distribution can also be used for escape from local minima [15]. Nevertheless, algorithms using these two distributions in different ranges of temperatures have not been published for PFP.

In this paper, a new SA algorithm named MPSABBE (Multiphase Simulated Annealing based on Boltzmann and Bose-Einstein distributions) is introduced. MPSABBE applies the Boltzmann and Bose-Einstein distributions at high and low temperatures, respectively. The paper shows that using both distributions the quality solution is improved. This paper is organized as follows. In Section 2, PFP is described. In Section 3, the classical SA and MPSABBE algorithms are explained. In Section 4, the SA applied for solving PFP is detailed. In Section 4, all the four MPSABBE’s phases are presented. In Section 5, analytical tuning methods SA and MPSABBE are described. In Section 6, experimental results are shown. Finally, in Section 7, the conclusions of this research are discussed.

#### 2. Protein Folding Problem

PFP is related to the questions of how and why a protein is folded into its NS. The proteins adopt an extreme number of possible conformations [16], which depends on the number of amino acids and the number of conformations by each amino acid. The essential concept introduced by Levinthal is that the PFP is a random search problem. This general idea means that all conformations of a protein (except the native state) are equally likely. Thus, it is more efficient to find the native state by a random search. PFP is an interdisciplinary problem that involves molecular biology, biophysics, computational biology, and computer science. In the* ab initio* case, NS prediction requires different mechanisms that lead the searching process to a biological three-dimensional structure. As was previously mentioned, this process requires only the amino acids’ sequence. PFP is an enormous challenge and is very hard to find the NS of a protein because the space of possible conformations of the protein is in general extremely large. For all practical purposes, PFP can be defined as follows.

Given(i)a sequence of amino acids that represents the primary structure of a protein,(ii)an energy function , where the variables represent dihedral angles,

find the following:(i)the Native Structure such that represents the lowest energy value, where(ii)the solution defines the best three-dimensional configuration.

Force fields are used to represent the energy of a protein; some of the most common are AMBER [17], CHARMM [18], ECEPP/2 [19–21], ECEPP/3 [22], and GROMACS [23]. These force fields compute energy components, for instance, the electrostatic energy, the torsion energy, the hydrogen bond energy, and the Lennard-Jones energy. In this paper ECEPP/2 force field is used.

The atoms of a protein are represented in three-dimensional cartesian coordinates. There are four types of torsion angles or dihedral angles as follows:(i)The angle between the amino group and the alpha carbon is referred to as Phi (). This angle represents the angle between the amino group (or NH_{2}) of the amino acid and the alpha Carbon in the sequence; specifically, it represents the bond angle between atom of amino group and the central carbon ().(ii)The dihedral angle between the alpha carbon and the carboxyl group is referred to as Psi (). Psi represents the angle between the carboxyl () group of the amino acid and the central carbon () of the same amino acid. In particular, Psi measures the angle of the covalent bond between of the carboxyl group and the central carbon ().(iii)For every amino acids sequence, an omega angle () is defined for each two consecutive amino acids ; specifically, it is the angle of the covalent bond between the atom of amino acid and carbon of the carboxyl group amino acid .(iv)And, finally, each Chi angle () is defined between the two planes conformed by two consecutive carbon atoms in the radical group.

The variables of the problem are all of these four angles which are in the range . In the simulations conducted in this research work, these angles are set with discrete values. Some variables have well-defined ranges as is the case of Psi and Phi angles whose ranges are defined by the Ramachandran plot [24]. The Phi angle is defined in the ranges and . The Psi angle is defined in three ranges , , and . Finally, the omega angle is fixed at 180 degrees.

#### 3. Simulated Annealing Algorithm

##### 3.1. Simulated Annealing Based on Boltzmann Distribution

Simulated Annealing (SA) Algorithm is a probabilistic method proposed by Kirkpatrick et al. [9] and Černý [10] and is an adaptation of the Metropolis algorithm, which is a Monte Carlo method [25]. SA is based on the gradual metal cooling for crystallization. This algorithm works by emulating the physical process where a metal is heating at very high temperature and then cooled very slowly until its frozen state. When this process happens, the metal is crystallized with the lowest energy configuration. SA is an algorithm that has been used for finding the optimal solution or close to it for different NP-hard problems including biological problems such as sequence alignment [26–28], phylogenetic trees [29], and PFP [30]. From a theoretical point of view, SA converges to the optimal solution or close to the lowest free energy [31]. However, classical SA is not able to find the lowest energy because energy barriers are too high for SA and cannot escape from local minima. As a consequence, variants of this method are proposed [14, 30].

Simulated annealing usually starts at a very high initial temperature (). Through a cooling function, the temperature value is gradually reduced from to , which usually is very close to zero [9, 10]. There are several cooling functions used in SA [31–36], for example,

The most common function is (1). This function reduces the temperature parameter by factor, which is commonly in the range of . A slow cooling is applied when is very close to 1, while a fast cooling is applied when is around 0.70.

The classical SA has two cycles as is shown in Algorithm 1; the first one is named temperature cycle and is used to decrease the value of the temperature with a specific cooling function. The second cycle is named metropolis cycle and it generates, accepts, or rejects solutions of the problem to be optimized. The initial and final temperature values are set (see lines (1)-(2)). These values are obtained by an analytical (see Section 5) or experimental way: should be as high as possible, while should be close to zero. An initial solution () is required in SA; this solution is generated (see line (3)) and is set to . At the beginning of the process, the parameter is set at the initial temperature (see line (4)). The temperature cycle is executed from to (see lines (5)–(19)). Then the metropolis cycle is repeated (see lines (6)–(17)) a certain number of times until a stop condition, which is explained later in this paper. A new solution () is generated within the metropolis cycle by applying a small perturbation to the current solution (see line (7)). The difference between these two solutions ( and ) is calculated (see line (8)). In practice, SA can be stopped when the probability of accepting a new solution is negligible. For a minimization problem, if this difference is less than or equal to zero (see line (9)), the new solution is accepted (see line (10)). When this difference is greater than zero, the Boltzmann distribution is applied. Then, a Boltzmann probability is calculated using (4) in line (12). If this probability is higher than a random value between 0 and 1 (see line (13)), then the new solution is accepted (see line (14)):