Abstract

Protein structure prediction is computationally a very challenging problem. A large number of existing search algorithms attempt to solve the problem by exploring possible structures and finding the one with the minimum free energy. However, these algorithms perform poorly on large sized proteins due to an astronomically wide search space. In this paper, we present a multipoint spiral search framework that uses parallel processing techniques to expedite exploration by starting from different points. In our approach, a set of random initial solutions are generated and distributed to different threads. We allow each thread to run for a predefined period of time. The improved solutions are stored threadwise. When the threads finish, the solutions are merged together and the duplicates are removed. A selected distinct set of solutions are then split to different threads again. In our ab initio protein structure prediction method, we use the three-dimensional face-centred-cubic lattice for structure-backbone mapping. We use both the low resolution hydrophobic-polar energy model and the high-resolution energy model for search guiding. The experimental results show that our new parallel framework significantly improves the results obtained by the state-of-the-art single-point search approaches for both energy models on three-dimensional face-centred-cubic lattice. We also experimentally show the effectiveness of mixing energy models within parallel threads.

1. Introduction

Proteins are essentially linear chain of amino acids. They adopt specific folded three-dimensional structures to perform specific tasks. The function of a given protein is determined by its native structure, which has the lowest possible free energy level. Nevertheless, misfolded proteins cause many critical diseases such as Alzheimer’s disease, Parkinson’s disease, and cancer [1, 2]. Protein structures are important in drug design and biotechnology.

Protein structure prediction (PSP) is computationally a very hard problem [3]. Given a protein’s amino acid sequence, the problem is to find a three-dimensional structure of the protein such that the total interaction energy amongst the amino acids in the sequence is minimised. The protein folding process that leads to such structures involves very complex molecular dynamics [4] and unknown energy factors. To deal with the complexity in a hierarchical fashion, researchers have used discretised lattice-based structures and simplified energy models [57] for PSP. However, the complexity of the simplified problem still remains challenging.

There are a large number of existing search algorithms that attempt to solve the PSP problem by exploring feasible structures called conformations. For population-based approaches, a genetic algorithm ( [8]) reportedly produces the state-of-the-art results using hydrophobic-polar (HP) energy model. On the other hand, for local search approaches, spiral search (SS-Tabu) [9], which is a tabu-based local search, produces the best results using HP model. Both algorithms use three-dimensional (3D) face-centred-cubic (FCC) lattice for conformation representation.

The approaches used in [1013] produced the state-of-the-art results using the high resolution Berrera energy matrix (henceforth referred to as BM energy model). Nevertheless, the challenges in PSP largely remain in the fact that the energy function that needs to be minimised in order to obtain the native structure of a given protein is not clearly known. A high resolution energy model (such as BM) could better capture the behaviour of the actual energy function than a low resolution energy model (such as HP). However, the fine grained details of the high resolution interaction energy matrix are often not very informative for guiding the search. Pairwise contributions that have low magnitudes could be dominated by the accumulated pairwise contributions having large magnitudes. In contrast, a low resolution energy model could effectively bias the search towards certain promising directions particularly emphasising on the pairwise contributions with large magnitudes.

In a collaborative human team, each member may work individually on his/her own way to solve a problem. They may meet together occasionally to discuss the possible ways they could find and may then refocus only on the more viable options in the next iteration. We envisage this approach to be useful in finding a suitable solution when there are enormously many alternatives that are very close to each other. We therefore try this in the context of conformational search for protein structure prediction.

In this paper, we present a multithreaded search technique that runs SS-Tabu in each thread that is guided by either HP energy or by BM energy model. The search starts with a set of random initial solutions by distributing these solutions to different threads. We allow each thread to run for a predefined period of time. The interim improved solutions are stored threadwise and merged together when all threads have finished their execution. After removing the duplicates from the merged solutions, a selected distinct set of solutions is then considered for next iteration. In our approach, multipoint start first helps find some promising results. For the next set of solutions to be distributed, the most promising solutions from the merged list are selected. Therefore, multipoint parallelism reduces the search space by exploring the vicinities of the promising solutions recursively. In our parallel local search, we use both the HP energy model and BM energy model on the 3D FCC lattice space. The experimental results show that our new approach significantly improves over the results obtained by the state-of-the-art single-point search approaches for the similar models.

The rest of the paper is organized as follows: Section 2 describes the background of the protein structure; Section 3 presents the related work; Section 4.1 presents the SS-Tabu algorithm used in the parallel search approach; Section 4 describes our parallel framework in detail; Section 5 discusses and analyses the experimental results; and finally, Section 6 presents the conclusions and outlines the future work.

2. Background

There are three computational approaches for protein structure prediction. These are homology modeling [14], protein threading [15, 16], and ab initio methods [17, 18]. Prediction quality of homology modeling and protein threading depends on the sequential similarity of previously known protein structures. However, our work is based on the ab initio approach that only depends on the amino acid sequence of the target protein. Levinthal’s paradox [19] and Anfinsen’s hypothesis [20] are the basis of ab initio methods for PSP. The idea was originated in 1970 when it was demonstrated that all information needed to fold a protein resides in its amino acid sequence. In our simplified protein structure prediction model, we use 3D FCC lattice for conformation mapping, HP and BM energy models for conformation evaluation, and the spiral search algorithm [9] (SS-Tabu) in a parallel framework for conformation search. The simplified models (lattice model and energy models) and local search are described below.

2.1. Simplified Model

In this research, we use 3D FCC lattice points for conformation mapping to generate backbone of protein structures. We use the HP and BM energy model for conformation evaluation. The 3D FCC lattice, the HP energy model, and BM energy model are briefly described below.

2.1.1. 3D FCC Lattice

The FCC lattice has the highest packing density compared to the other existing lattices [21]. The hexagonal close packed (HCP) lattice, also known as cuboctahedron, was used in [22]. In HCP, each lattice point has 12 neighbours that correspond to 12 basis vertices with real-numbered coordinates, which causes the loss of structural precision for PSP. In FCC, each lattice point has neighbours as shown in Figure 1.

Figure 1 shows the 12 basis vectors with respect to the origin. The basis vectors are presented below denoting as :

In simplified PSP, conformations are mapped on the lattice by a sequence of basis vectors or by the relative vectors that are relative to the previous basis vectors in the sequence.

2.1.2. HP Energy Model

The amino acid monomers are the building block of protein polymers. These amino acids are broadly divided into two categories based on their hydrophobicity: (a) hydrophobic amino acids (Gly, Ala, Pro, Val, Leu, Ile, Met, Phe, Tyr, Trp) denoted by H; and (b) hydrophilic or polar amino acids (Ser, Thr, Cys, Asn, Gln, Lys, His, Arg, Asp, Glu) denoted by P. In the HP model [23], when two nonconsecutive hydrophobic amino acids become topologically neighbours, they contribute a certain amount of negative energy, which for simplicity is shown as in Table 1. The total energy () of a conformation based on the HP model becomes the sum of the contributions of all pairs of nonconsecutive hydrophobic amino acids as follows: Here, if amino acids and are nonconsecutive neighbours on the lattice, otherwise ; and if th and th amino acids are hydrophobic, otherwise .

2.2. BM Energy Model

By analysing crystallised protein structures, Miyazawa and Jernigan [24] in 1985 statistically deduced a energy matrix that considers residue contact propensities between the amino acids. By calculating empirical contact energies on the basis of information available from selected protein structures and following the quasichemical approximation Berrera et al. [25] in 2003 deduced another energy matrix. In this work, we use the latter model and denote it by BM energy model. Table 2 shows the BM energy model with amino acid names at the left-most column and the bottom-most row and the interaction energy values in the cells. The amino acid names that have boldface are hydrophobic. We draw lines in Table 2 to show groupings based on H-H, H-P, and P-P interactions. In the context of this work, it is worth noting that most energy contributions that have large magnitudes are from H-H interactions followed by those from H-P interactions.

The total energy (shown in (3)) of a conformation based on the BM energy model is the sum of the contributions over all pairs of nonconsecutive amino acids that are one unit lattice distance apart: Here, if amino acids at positions and in the sequence are nonconsecutive neighbours on the lattice, otherwise ; and is the empirical energy value between the and amino acid pair specified in the matrix for the BM model.

2.3. Local Search

Starting from an initial solution, local search algorithms move from one solution to another to find a better solution. Local search algorithms are well known for efficiently producing high quality solutions [9, 26, 27], which are difficult for systematic search approaches. However, they are incomplete [28] and suffer from revisitation and stagnation. Restarting the whole or parts of a solution remains the typical approach to deal with such situations.

2.4. Tabu Metaheuristic

Tabu metaheuristic [29, 30] enhances the performance of local search algorithms. It maintains a short-term memory storage to remember the local changes of a solution. Then any further local changes for those stored positions are forbidden for a certain number of subsequent iterations (known as tabu tenure).

There are a large number of existing search algorithms that attempt to solve the PSP problem by exploring feasible structures on different energy models. In this section we explore the works related to HP and energy models as below.

3.1. HP Energy-Based Approaches

Different types of metaheuristic have been used in solving the simplified PSP problem. These include Monte Carlo Simulation [31], Simulated Annealing [32], Genetic Algorithms (GA) [33, 34], Tabu Search with GA [35], Tabu Search with Hill Climbing [36], Ant Colony Optimisation [37], Immune Algorithms [38], Tabu-based Stochastic Local Search [26, 27], and Constraint Programming [39].

The Bioinformatics Group, headed by Rolf Backofen, applied Constraint Programming [4042] using exact and complete algorithms. Their exact and complete algorithms work efficiently if similar hydrophobic core exists in the repository.

Cebrián et al. [26] used tabu-based local search, and Shatabda et al. [27] used memory-based local search with tabu heuristic and achieved the state-of-the-art results. However, Dotu et al. [39] used constraint programming and found promising results but only for smaller sized ( amino acids) proteins. Besides local search, Unger and Moult [33] applied population-based genetic algorithms to PSP and found their method to be more promising than the Monte Carlo-based methods [31]. They used absolute encodings on the square and cubic lattices for HP energy model. Later, Patton [43] used relative encodings to represent conformations and a penalty method to enforce the self-avoiding walk constraint. GAs have been used by Hoque et al. [22] for cubic and 3D HCP lattices. They used DFS-generated pathways [44] in GA crossover for protein structure prediction. They also introduced a twin-removal operator [45] to remove duplicates from the population to prevent the search from stalling. Ullah et al. in [12, 46] combined local search with constraint programming. They used a energy model [25] on FCC lattice and found promising results. In another hybrid approach [47], tabu metaheuristic was combined with genetic algorithms in two-dimensional HP model to observe crossover and mutation rates over time.

However, for the simplified model (HP energy model and 3D FCC lattice) that is used in this paper, a new genetic algorithm [8] and a tabu-based local search algorithm Spiral Search [9] currently produce the state-of-the-art results.

3.2. Empirical 20 × 20 Matrix Energy Based Approaches

A constraint programming technique was used in [48] by Dal Palù et al. to predict tertiary structures of real proteins using secondary structure information. They also used constraint programming with different heuristics in [49] and a constraint solver named COLA [50] that is highly optimized for protein structure prediction. In another work [51], a fragment assembly method was utilised with empirical energy potentials to optimise protein structures. Among other successful approaches, a population-based local search [52] and a population-based genetic algorithm [13] were used with empirical energy functions.

In a hybrid approach, Ullah and Steinöfel [12] applied a constraint programming-based large neighbourhood search technique on top of the output of COLA solver. The hybrid approach produced the state-of-the-art results for several small sized (less than amino acids) benchmark proteins.

In another work, Ullah et al. [46] proposed a two stage optimisation approach combining constraint programming and local search. The first stage of the approach produced compact optimal structures by using the CPSP tools based on the HP model. In the second stage, those compact structures were used as the input of a simulated annealing-based local search that is guided by the BM energy model.

In a recent work [10], Shatabda et al. presented a mixed heuristic local search algorithm for PSP and produced the state-of-the-art results using BM energy model on 3D FCC lattice. The mixed heuristic local search in each iteration randomly selects a heuristic from a given number of heuristics designed by the authors. The selected heuristics are then used in evaluating the generated neighbouring solutions of the current solution. Although the heuristics themselves are weaker than the BM energy, their collective use in the random mixing fashion produces results better than the BM energy itself.

3.3. Parallel Approaches

Vargas and Lopes [53] proposed an Artificial Bee Colony algorithm based on two parallel approaches (master slave and a hybrid hierarchical) for protein structure prediction using the 3D HP model with sidechains. They showed that the parallel methods achieved a good level of efficiency while compared with the sequential version. A comparative study of parallel metaheuristics was conducted by Trantar et al. [54] using a genetic algorithm, a simulated annealing algorithm, and a random search method in grid environments for protein structure prediction. In another work [55], they applied a parallel hybrid genetic algorithm in order to efficiently deal with the PSP problem using the computational grid. They experimentally showed the effectiveness of a computational grid-based approach. All-atom force field-based protein structure prediction using parallel particle swarm optimization approach was proposed by Kandov in [56]. He showed that asynchronous parallelisation speeds up the simulation better than the synchronous one and reduces the effective time for predictions significantly. Among others, Calvo et al. in [57, 58] applied a parallel multiobjective evolutionary approach and found linear speedups in structure prediction for benchmark proteins and Robles et al. in [59] applied parallel approach in local search to predict secondary structure of a protein from its amino acid sequence.

4. Our Approach

The driving force of our parallel search framework is SS-Tabu [9] that has two versions: (i) the existing algorithm, designed for HP model (as shown in Algorithm 1 and described in Section 4.1) and (ii) the customised spiral search algorithm, designed for BM energy model (as shown in Algorithm 5 and described in Section 4.2). We feed the two versions of spiral search algorithms in different threads in different combinations. The variations are described in the experimental results section.

(1) //H and P are hydrophobic and polar amino acids.
(2) //maxIter terminates the iteration
(3) //maxRetry sets the time of relay-restart
(4) //maxRW sets the time of random-walk
(5) initTabuList()
(6) for (   to   maxIter) do
(7)   mv     selectMoveForH()
(8)   if (mv != null) then
(9)     applyMove(mv)
(10)     updateTabuList(i)
(11) else
(12)     mv     selectMoveForP()
(13)     if (mv != null) then
(14)      applyMove(mv)
(15)  evaluate(AA) //AA—amino acid array
(16)  if (!improved) then
(17)     retry++
(18) else
(19)     improvedList     addTopOfList()
(20)     retry = 0
(21)     rw = 0
(22) if   retry      maxRetry  then
(23)     relayRestart(improvedList)
(24)     resetTabuList()
(25)     rw++;
(26) if   rw      maxRW  then
(27)     randomWalk(maxPull)
(28)     resetTabuList()

4.1. SS-Tabu: Spiral Search

SS-Tabu is a hydrophobic core directed local search [9] that works in a spiral fashion. This algorithm (the pseudocode in Algorithm 1) is the basis of the proposed parallel local search framework. SS-Tabu is composed of H and P move selections, random-walk [60], and relay-restart [9]. However, this algorithm is further customised for detailed energy model as described in Section 4.2. Both versions of SS-Tabu are used in parallel threads with different combinations within the parallel framework. The features of existing SS-Tabu are described in Algorithm 1.

4.1.1. Applying Diagonal Move

In a tabu-guided local search (see Algorithm 1), we use the diagonal move operator (shown in Figure 2) to build H-core. A diagonal move displaces th amino acid from its position to another position on the lattice without changing the position of its succeeding th and preceding th amino acids in the sequence. The move is just a corner-flip to an unoccupied lattice point.

4.1.2. Forming H-Core

Protein structures have hydrophobic cores (H-core) that hide the hydrophobic amino acids from water and expose the polar amino acids to the surface to be in contact with the surrounding water molecules [61]. H-core formation is an important objective for HP-based protein structure prediction models. In our work, we repeatedly use the diagonal-move to aid forming the H-core. We maintain a tabu list to control the amino acids from getting involved in the diagonal moves. SS-Tabu performs a series of diagonal moves on a given conformation to build the H-core around the hydrophobic core centre (HCC) as shown in Figure 3. The Cartesian distance between the HCC and the current position or a new position is denoted by and , respectively. The diagonal move squeezes the conformation and quickly forms the H-core in a spiral fashion.

4.1.3. Selecting Moves for HP Model

In H-move selection algorithm (Algorithm 2), the HCC is calculated (Line 2) by finding arithmetic means of , , and coordinates of all hydrophobic amino acids using (4). The selection is guided by the Cartesian distance (as shown in (5)) between HCC and the hydrophobic amino acids in the sequence. For the th hydrophobic amino acid, the common topological neighbours of the th and th amino acids are computed. The topological neighbours (TN) of a lattice point are the points at unit lattice-distance apart from it. From the common neighbours, the unoccupied points are identified. The Cartesian distance of all unoccupied common neighbours is calculated from the HCC using (5). Then the point with the shortest distance is picked. This point is listed in the possible H-move list for th hydrophobic amino acid if its current distance from HCC is greater than that of the selected point. When all hydrophobic amino acids are traversed and the feasible shortest distances are listed in H-move list, the amino acid having the shortest distance in H-move list is chosen to apply the diagonal move on it (Algorithm 1 Line 9). A tabu list is maintained for each hydrophobic amino acid to control the selection priority amongst them. For each successful move, the tabu list is updated for the respective amino acid. The process stops when no H-move is found. In this situation, the control is transferred to select and apply P-moves. Consider where is the number of H amino acids in the protein. Consider

(1) //H denotes hydrophobic amino acids.
(2) hcc findHCoreCentre(S)
(3) for (   to   seqLength) do
(4)  if ((type[i]=“H”) and (¬tabuList[i])) then
(5)    cfn findCommonFreeNeigh( , )
(6)    mvlocal findShortestMove(hcc, cfn)
(7)    moveList·add(mv)
(8) mvglobal findShortestMove(moveList)
(9) return mv

However, in P-move selection (Algorithm 1 Line 12), the same kind of diagonal moves is applied as H-move. For each th polar amino acid, all free lattice points that are common neighbours of lattice points occupied by th and th amino acids are listed. From the list, a point is selected randomly to complete a diagonal move (Algorithm 1, Line 14) for the respective polar amino acid. No hydrophobic-core-center is calculated, no Cartesian distance is measured, and no tabu list is maintained for P-move. After one try for each polar amino acid the control is returned to select and apply H-moves.

4.1.4. Handling Stagnation

For hard optimisation problems such as protein structure prediction, local search algorithms often face stagnation. In HP model-based conformational search, stagnation is encountered when a premature H-core is formed. Handling the stagnations is a challenging issue for conformational search algorithms (e.g., GA, LS). Thus, handling such situation intelligently is important to proceed further. To deal with stagnation, in SS-Tabu, random-walk [60] and relay-restart techniques are used on an on-demand basis.

Random-Walk. Premature H-cores are observed at local minima. To escape local minima, a random-walk [60] algorithm (Algorithm 1, Line 27) is applied. This algorithm uses pull moves [62] to break the premature H-cores and to create diversity.

Relay-Restart. When the search stagnation situation arises, a new relay-restart technique (Algorithm 1 Line 23) is applied instead of a fresh restart or restarting from the current best solution [26, 27]. We use relay-restart when random-walk fails to escape from the local minima. The relay-restart starts from an improving solution. We maintain an improving solution list that contains all the improving solutions after the initialisation.

4.1.5. Further Implementation Details

Like other search algorithms, SS-Tabu requires initialisation. It also needs evaluation of the solution in each iteration. It starts with a randomly generated or parameterised initial solution and enhances it in a spiral fashion. Further, it needs to maintain a tabu meta-heuristic to guide the local search.

Tabu Tenure. Intuitively we use different tabu-tenure values based on the number of hydrophobic amino acids (hCount) in the sequence. We calculate tabu-tenure using the following formula: The tabu-tenure calculated using (6) is used at Lines 5, 24, and 28 in Algorithm 1 during initialising and resetting tabu-list.

Evaluation. After each iteration, the conformation is evaluated by counting the H-H contacts (topological neighbours) where the two amino acids are nonconsecutive. The pseudocode in Algorithm 3 presents the algorithm of calculating the free energy of a given conformation. Note that the energy value is negation of the H-H contact count. For BM energy model the pairwise contact potentials are found in matrix presented in Table 2.

(1) for (   to   seqLength − 1) do
(2)  for (   to   seqLength − 1) do
(3)    if   AAType[ ] = AAType[ ] = H then
(4)       nodeI     AA[ ]
(5)       nodeJ     AA[ ]
(6)       sqrD     getSqrDist(nodeI, nodeJ)
(7)       if   sqrD = 2 then
(8)         fitness     fitness – 1
(9) return fitness

Initialisation. Our algorithm starts with a feasible set of conformations known as population. We generate each initial conformation following a randomly generated self-avoiding walk (SAW) on FCC lattice points. The pseudocode of the algorithm is presented in Algorithm 4. It places the first amino acid at . It then randomly selects a basis vector to place the successive amino acid at a neighbouring free lattice point. The mapping proceeds until a self-avoiding walk is found for the whole protein sequence.

(1) //AA—amino acid array of the protein
(2) //SAW—Self-avoiding-walk
(3) basisVec 12   getTwelveBasisVectors()
(4) AA 0   AminoAcid(0, 0, 0)
(5) while (!SAW) do
(6)  for (   to   seqLength − 1) do
(7)        getRandom(12)
(8)    basis      basisVec[ ]
(9)    node      AA[ ] + basis
(10)   if   isFree(node) then
(11)      AA[ ]      AminoAcid(node)
(12)   else
(13)      SAW      false
(14)      break
(15) return AA   

4.2. BM Model Adopted Spiral Search

The basic difference between the HP energy based original spiral search (SS-Tabu [9]) and the BM energy guided adopted spiral search lies on the move selection criteria. In former version of spiral search, the amino acids are divided into two groups (H and P). The moves are selected based on these two properties of the amino acids that are guided by the distance of H amino acid from the HCC. However, to adopt BM energy model, all 20 amino acids need to be taken into consideration and the move selection criteria are guided by the distance of any amino acid from the core centre (CC) of the current structure (Algorithm 5). The CC and the distance are calculated using (7) and (8), respectively.

(1) //maxIter terminates the iteration
(2) //maxRetry sets the time of relay-restart
(3) //maxRW sets the time of random-walk
(4) initTabuList()
(5) for (   to   maxIter) do
(6)  mv      selectMove()
(7)  if (mv != null) then
(8)    applyMove(mv)
(9)    updateTabuList(i)
(10)  evalute(AA) //AA—amino acid array
(11)  if  (!improved) then
(12)    retry++
(13)  else
(14)    improvedList      addTopOfList()
(15)    retry = 0
(16)    rw = 0
(17)  if   retry      maxRetry  then
(18)    relayRestart(improvedList)
(19)    resetTabuList()
(20)    rw++;
(21)  if   rw      maxRW  then
(22)    randomWalk(maxPull)
(23)    resetTabuList()

4.2.1. Selecting Moves for BM (20 × 20) Model

In move selection (Algorithm 5 Line 6), the CC is calculated by finding arithmetic means of , , and coordinates of all amino acids using (7). The selection is guided by the Cartesian distance (as shown in (5)) between CC and the amino acids in the sequence. For the th amino acid, the common topological neighbours of the th and th amino acids are computed. The topological neighbours (TN) of a lattice point are the points at unit lattice distance apart from it. From the common neighbours, the unoccupied points are identified. The Cartesian distance of all unoccupied common neighbours is calculated from the CC using (8). Then the point with the shortest distance is picked. This point is listed in the possible move list for th amino acid if its current distance from CC is greater than that of the selected point. When all amino acids are traversed and the feasible shortest distances are listed in move list, the amino acid having the shortest distance in move list is chosen to apply the diagonal move on it (Algorithm 5, Line 8). A tabu list is maintained for each amino acid to control the selection priority amongst them. For each successful move, the tabu list is updated (Algorithm 5, Line 9) for the respective amino acid: where is the number of amino acids in the protein. Consider

4.3. Parallel Framework

In our implemented prototype, we use four parallel threads. The two versions of SS-Tabu are distributed amongst the four threads as shown in Table 3.

Figure 4 shows the architecture of our parallel search algorithm. In this framework, the search starts with a set of randomly generated initial solutions (Line 2 in Algorithm 6). The solutions are then divided in subsets (Line 4 in Algorithm 6) and are distributed to different threads.

(1) //thr—Thread
(2) currSet     initialise()
(3) for (   to   repeat) do
(4)  subSet      genSubSet(currSet)
(5)  for (   to   thCount) do
(6)   thr[ ] = createSSThread(subSet[i], time)
(7)   thr[ start()
(8)  if (noAliveThread) then
(9)   mrgLst  =   mergeImprovedLists()
(10)    distinctLst  =   removeDuplicate(mrgLst)
(11)    currSet     genCurrSet(distinctLst)

We allow each thread to run for a predefined period of time. The improved solutions are stored threadwise and are merged together (Line 9 in Algorithm 6) when all threads finish. After removing the duplicates (Line 10 in Algorithm 6) from the merged solutions, a selected distinct set of solutions are taken (Line 11 in Algorithm 6) for the next iteration. The iterative process continues until the terminating criteria (Line 3 in Algorithm 6) are satisfied.

5. Experimental Results and Analyses

We conduct our experiments on two different sets of benchmark proteins: HP benchmarks and benchmarks. The rest of this section will present the experimental results in detail.

5.1. Experiment Setup
5.1.1. Implementation

The parallel spiral search framework has been implemented in Java 6.0 using Java standard APIs. Currently the source code is not available publicly due to the legal bindings. However, an executable version of the application could be requested to the corresponding author.

5.1.2. Execution

We ran our experiments on the NICTA (NICTA website: http://www.nicta.com.au/) cluster. The cluster consists of a number of identical Dell PowerEdge R415 computers, each equipped with 6-Core Opteron 4184 processors, 2.8 GHz clock speed, 3M L2/6M L3 Cache, GB memory, and running Rocks OS (a Linux variant for cluster). The experimental results presented in this paper are obtained from different runs of identical settings for each protein when using HP benchmarks and different runs of identical settings for each protein when using benchmarks.

5.2. Experimental Results on HP Benchmark

The experimental results on HP benchmarks are presented in Tables 4 and 5. Amongst the sequences, F90, S, F180, and R instances are taken from Peter Clote laboratory website (Peter Clote Lab: http://bioinformatics.bc.edu/clotelab/FCCproteinStructure/). These instances have been used in [8, 9, 26, 27, 39] for evaluating different algorithms. Moreover, we use other six larger sequences that are taken from the CASP (CASP website: http://predictioncenter.org/casp9/targetlist.cgi) competition. The corresponding CASP target IDs for proteins 3mse, 3mr7, 3mqz, 3no6, 3no3, and 3on7 are T0521, T0520, T0525, T0516, T0570, and T0563. These CASP targets are also used in [27]. To fit in the HP model, the CASP targets are converted to HP sequences based on the hydrophobic properties of the constituent amino acids. The lower bounds of the free energy values (in Column LBFEof Tables 4 and 5) are obtained from [26, 27]; however, there are some unknown values (presented as ) of lower bounds of free energy for large sequences.

5.2.1. Results on Medium Sized HP Benchmark Proteins

In Table 4, we present three different sets of result obtained from (i) our parallel local search framework that runs on four parallel threads (30 minutes/run), (ii) a local search (SS-Tabu) that runs on a single thread (2 hours/run), and (iii) a genetic algorithm () that runs on a single thread (2 hours/run). In the table, the Size column presents the number of amino acids in the sequences, and the column shows the known lower bounds of free energy for the corresponding protein sequences in Column . The best and average free energy values for three different algorithms are presented in the table under the specific column headers (PSS, SS-Tabu, and ). The RI Columns present the relative improvements of parallel local search over the single-thread local search and the genetic algorithm. The bold-faced values indicate better performance in comparison to the other algorithms for corresponding proteins.

5.2.2. Results on Large Sized HP Benchmark Proteins

In Table 5, we present three different sets of result obtained from (i) our parallel local search framework that runs on four parallel threads (1 hour 15 minutes/run), (ii) a local search (SS-Tabu) that runs on a single thread (5 hours/run), and (iii) a genetic algorithm () that runs on a single thread (5 hours/run). In the table, the Size column presents the number of amino acids in the sequences, and the LBFEcolumn shows the known lower bounds of free energy for the corresponding protein sequences in Column ID. However, a lower bound of free energy for protein 3on7 is not known. The best and average free energy values for three different algorithms are presented in the table under the specific column headers (PSS, SS-Tabu, and ). The RI Columns present the relative improvements of parallel local search over the single-thread local search and the genetic algorithm. The bold-faced values indicate better performance in comparison to the other algorithms for corresponding proteins.

5.2.3. Relative Improvement on HP Benchmark

The difficulty of improving energy level is increased as the improved energy level approaches to the lower bound of free energy. For example, if the lower bound of free energy of a protein is , the efforts to improve energy level from to are much less than that to improve energy level from to though the change in energy is the same (). Relative Improvement (RI) explains how close our predicted results are to the lower bound of free energy with respect to the energy obtained from the state-of-the-art approaches:

In Tables 4 and 5, we also present a comparison of improvements (%) on average conformation quality (in terms of free energy levels). We compare PSS (target) with SS-Tabu and (references). For each protein, the RI of the target () with respect to the reference () is calculated using the formula in (9), where and denote the average energy values achieved by the target and the reference, respectively, and is the lower bound of free energy for the protein in the HP model. We present the relative improvements only for the proteins having known lower bounds of free energy values. We test our new approach on different proteins of various lengths. The bold-faced values are the minimum and the maximum improvements for the same column.

Improvement with respect to SS-Tabu. The experimental results in Tables 4 and 5, at column RI under SS-Tabu, show that our PSS is able to improve the search quality in terms of minimising the free energy level over all the 16 proteins considered for the test. The relative improvements with respect to SS-Tabu range from to .

Improvement with respect to . The experimental results in Tables 4 and 5, at column RI (relative improvement) under , show that our PSS is able to improve the search quality in terms of minimising the free energy level over all proteins considered for the test. The relative improvements with respect to range from to .

5.2.4. Search Progress

We compare the search progresses of SS-Tabu, , and PSS on the basis of real execution time. Figure 5(a) shows the average energy values obtained with times by the algorithms for protein R1. The graph shows that the progress of PSS stops at 75 minutes (1.25 hours). As we mentioned earlier, we run parallel threads (four threads) in our PSS for 1.25 hours to keep total CPU time equal to five () hours. From the graph, it is clear that multipoint local search with four parallel threads dramatically outperforms the local search and genetic algorithms within (1/4)th of the execution time.

However, in Figure 5(b), we compare the search progresses of SS-Tabu, , and PSS over CPU time. The CPU time of PSS is calculated by summing up the individual times of all threads (time per thread × 4) in different instances.

5.2.5. Comments on Our HP-Based Method

In Tables 4 and 5, the Columns LBFE represent the lower bound of free energy. Some of these values are taken from the literatures and others are obtained running exact and complete algorithms based CPSP-tools [42]. However, we do not compare our experimental results with results obtained from CPSP tools because of a fundamental conceptual difference between our approaches and Will and Backofen [63, 64]. Will’s HPstruct algorithm [65] proceeds with threading an input HP sequence onto hydrophobic cores from a collection of precomputed and stored H-cores. On the other hand, our algorithms compute H-cores on the fly like Yue-Dill CHCC method [61, 66]. HPstruct requires a precomputed set of H-cores for the number of H amino acids in the given sequence. Therefore, CPSP tools cannot find structure without the availability of a precomputed optimal H-core.

5.3. Experimental Results on 20 × 20 Benchmark

Besides HP energy model, we apply our parallel framework on standard benchmark proteins. The protein instances used in our experiments are taken from the literature (as shown in Table 6). The first seven proteins 4RXN, 1ENH, 4PTI, 2IGD, 1YPA, 1R69, and 1CTF are taken from [12] and the next five proteins 3MX7, 3NBM, CMQO, 3MRO, and 3PNX from [10]. In Table 7, we present eight sets of experimental results. The approaches are described below.(1) LS-Tabu is heuristically guided local search based on tabu metaheuristic. The result presented in Table 7 under Column LS-Tabu is the output of 20 different runs of LS-Tabu [10] in an identical setting over 60 minutes duration. The algorithm runs on a single thread using Berrera et al. energy model.(2) SS-Tabu is core directed local search based on tabu metaheuristic works in an spiral fashion. The result presented in Table 7 under Column SS-Tabu is the output of 20 different runs of LS-Tabu [9] in an identical setting over 60 minutes duration. The algorithm runs on a single thread using Berrera et al. energy model.(3) PSSB4H0 is a variant of parallel spiral search running in 4 threads. In this variant of PSS, in all 4 threads, the SS-Tabu is guided by Berrera et al. energy model. The parallel threads are terminated after 15 minutes. Therefore, the total CPU time remains (-threads) the same as the SS-Tabu or LS-Tabu.(4) PSSB3H1 is a variant of parallel spiral search running in 4 threads. In this variant of PSS, in 3 threads, the SS-Tabu is guided by Berrera et al. energy model and in other threads, the SS-Tabu is guided by HP energy model. The parallel threads are terminated after 15 minutes. Therefore, the total CPU time remains (-threads) the same as the SS-Tabu or LS-Tabu.(5) PSSB2H2 is a variant of parallel spiral search running in 4 threads. In this variant of PSS, in 3 threads, the SS-Tabu is guided by Berrera et al. energy model and in other 2 threads, the SS-Tabu is guided by HP energy model. The parallel threads are terminated after 15 minutes. Therefore, the total CPU time remains (-threads) the same as the SS-Tabu or LS-Tabu.(6) PSSB1H3 is a variant of parallel spiral search running in 4 threads. In this variant of PSS, in 3 threads, the SS-Tabu is guided by Berrera et al. energy model and in other 3 threads, the SS-Tabu is guided by HP energy model. The parallel threads are terminated after 15 minutes. Therefore, the total CPU time remains (-threads) the same as the SS-Tabu or LS-Tabu.(7) PSSB0H4 is a variant of parallel spiral search running in 4 threads. In this variant of PSS, in all 4 threads, the SS-Tabu is guided by HP energy model. The parallel threads are terminated after 15 minutes. Therefore, the total CPU time remains (-threads) the same as the SS-Tabu or LS-Tabu.(8) is population-based genetic algorithm that uses hydrophobic-core directed macromutation operator and random-walk-based stagnation recovery technique in addition to the regular GA operators. The result presented in Table 7 under Column is the output of 20 different runs of [11] in an identical setting over 60 minutes duration. The algorithm runs on a single thread using both HP and BM energy models in a mixing manner.

5.4. Energy Values on 20 × 20 Benchmark

In Table 7, the energy columns show the energy values obtained from different approaches on 12 benchmark proteins (Table 6). Although the searches are guided by both HP and BM energy models, the energy values are calculated by applying Berrera et al. energy matrix. The experimental results show that amongst the parallel spiral search variants, PSSB1H3 ( out of proteins) and PSSB0H4 ( out of proteins) produce better results in comparison to the other variants in terms of lowest interaction energies. However, the performs better in comparison to the parallel spiral search variants for out of proteins.

5.5. RMSD Values on 20 × 20 Benchmark

The RMSD is frequently used to measure the differences between values predicted by a model and the values actually observed. We compare the predicted structures obtained by our approach with the state-of-the-art approaches by measuring the RMSD with respect to the native structures from PDB. For any given structure, the RMSD is calculated using (10). The average distance between two -Carbons in a native structure is  Å. To calculate RMSD, the distance between two neighbour lattice points ( for FCC lattice) is considered as  Å. Consider where and denote the distances between th and th amino acids, respectively, in the predicted structure and the native structure of the protein.

In Table 7, the RMSD columns show the root-mean-square deviation (RMSD) values obtained from different approaches on 12 benchmark proteins (Table 6). The experimental results show that amongst the parallel spiral search variants, PSSB1H3 ( out of proteins) produces better results in comparison to other variants in terms of lowest RMSD values. However, when compared with , the parallel variants perform better for out of proteins.

5.6. Effect of Mixing Energy Models

The best hydrophobic cores do not always correspond to the best structures in terms of RMSD values [67, 68]. These observations inspired us to mix the energy models. The approaches presented in Table 7 are guided by BM, HP, or both energy models. However, the conformations are always evaluated using BM model. The experimental results show that when the variants are guided by HP or both BM and HP models (such as PSSB3H1, PSSB2H2, PSSB1H3, and PSSB0H4) it performs better than the variant guided by BM model (such as PSSB4H0). Therefore, from the observation of RMSD values, it is clear that HP model works as a better guidance heuristic, whereas BM model works as better model for evaluating conformations.

6. Conclusion

In this paper, we present a multipoint parallel local search framework that runs tabu-based local search (spiral search [9]) in parallel threads. In our ab initio protein structure prediction method, we develop two versions of SS-Tabu that uses hydrophobic-polar energy model and Berrera et al. [25] energy model separately on face-centred-cubic lattice. Collaboration and negotiation play vital roles in dealing with real world challenges. In our research, we try to adopt this analogy by considering each thread as a collaborator. We allow each thread to run for a predefined period of time. The threads are met in an assembly point when they finish their execution and donate or accept better solutions to proceed with. The PSS starts with a set of random initial solutions by distributing a subset of solutions to different threads which are running different combinations of two versions of SS-Tabu. The interim improved solutions are stored threadwise and merged together when the threads finish. After removing the duplicates from the merged solutions, a selected distinct set of solutions is considered for the next iteration. In our approach, multipoint start helps find some promising solutions. For the next working set of solutions from the merged list, the most promising solutions are selected. Therefore, multipoint parallelism reduces the search space by exploring around the promising solutions in every iteration. The experimental results show that our new approach significantly improves over the results obtained by the state-of-the-art single-point search approaches.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Authors’ Contribution

Mahmood A. Rashid conceived the idea of applying Spiral Search in a parallel framework. M. A. Hakim Newton, Swakkhar Shatabda, Md Tamjidul Hoque, and Abdul Sattar helped Mahmood A. Rashid in modeling, implementing, and testing the approach. All authors equally participated in analysing the test results to improve the approach and were significantly involved in the process of writing and reviewing the paper.

Acknowledgments

The authors would like to express their great appreciation to the people managing the Cluster Computing Services at National ICT Australia (NICTA) and Griffith University. Md Tamjidul Hoque acknowledges the Louisiana Board of Regents through the Board of Regents Support Fund, LEQSF (2013-16)-RD-A-19. NICTA, the sponsor of the article for publication, is funded by the Australian Government as represented by the Department of Broadband, Communications and the Digital Economy and the Australian Research Council through the ICT Centre of Excellence program.