Abstract

A new algorithm for solving sequence alignment problem is proposed, which is named SAPS (Simulated Annealing with Previous Solutions). This algorithm is based on the classical Simulated Annealing (SA). SAPS is implemented in order to obtain results of pair and multiple sequence alignment. SA is a simulation of heating and cooling of a metal to solve an optimization problem. In order to select randomly a current solution, SAPS algorithm chooses a solution from solutions that have been previously generated within the Metropolis Cycle. This simple change has led to increase the quality of the solution to the problem of aligning genomic sequences with respect to the classical Simulated Annealing algorithm. The parameters of SAPS, for certain instances, are tuned by an analytical method, and some parameters have experimentally been tuned. SAPS has generated high-quality results in comparison with the classical SA. The instances used are specific genes of the AIDS virus.

1. Introduction

Sequence alignment is one of the most important and challenging problems in computational biology and bioinformatics [1, 2]. Finding the optimal alignment of a set of sequences is known as a NP-complete problem [3]. Alignment of sequences can be an important tool to measure the similarity of two or more sequences. Sequence Alignment is classified as a combinatorial optimization problem [4], which is solved by using computer algorithms. These algorithms lead to represent, to process, and to compare genetic information to determine evolutionary relationships among living beings [3]. The sequence alignment highlights areas of similarity among sequences. The similarities among sequences may indicate functional or evolutionary relationships among genes or proteins [5].

The problem of sequence alignment is to obtain the maximum alignment of a set of genomic sequences, which is denoted as ; each sequence of this set is formed by the alphabet . The solution to this problem is represented by , which denotes a set with the alphabet . represents the optimal alignment of .

Exact algorithms have been applied to solve the sequence alignment problem. For example, dynamic programming has been one of the most used to solve the sequence alignment problem [6, 7]. The disadvantage of using exact algorithms is that these generate optimal solutions for small problems, but for large problems, exact algorithms become inefficient. For this reason, several metaheuristic methods have been designed to obtain suboptimal alignments. Metaheuristics have also been applied to solve this problem [8], for example, Ant Colony Algorithm [9], Simulated Annealing [10, 11], Genetic Algorithms [12], among others. The disadvantage is that metaheuristics do not guarantee optimal solutions, but solutions generated can be very close to optimal solution in a reasonable processing time.

The proposed algorithm is a modified version of classical Simulated Annealing. SAPS includes a new way to select a current solution after the Metropolis Cycle is finished. In general, SAPS generates better solutions to sequence alignment problem than the classical Simulated Annealing. SAPS was tested with different genes of AIDS virus.

This paper is organized as follows: in Section 2, classical simulated annealing algorithm is described. In Section 3, the SASP algorithm is explained in detail. In Section 4, the analytical tuning method is described. In Section 5, the implementation of the SASP is detailed. In Section 6, the experimentation and results are described. Finally, Section 7 discusses the conclusions.

2. Classical Simulated Annealing

The classical Simulated Annealing is an algorithmic process that simulates the gradual metal cooling for crystallization. This algorithm usually starts at high value of temperature, and then this parameter is decreased until a final temperature is reached. The final temperature typically is very close to zero [13, 14]. Through a cooling function, the temperature value is decreased from the initial temperature to the final temperature. There are cooling functions that have been used in the simulated annealing algorithm [1518]; the most common cooling function is defined by . This function decreases the temperature value by a factor, which does a range of . A gradual cooling is applied when is very close to 1, and a fast cooling is applied when is very close to 0.70.

The classical Simulated Annealing has two cycles; the first cycle is named Cycle of Temperature. Into this cycle, value temperature is decreased by a cooling function. The second cycle is named Metropolis Cycle, and it is applied to generate, to accept, or to reject solutions for the problem to be optimized. Algorithm 1 shows the pseudo code of the classical Simulated Annealing. The initial and final temperature values are set (see line 1). These values are obtained by an analytical (see Section 4) or experimental way. It is recommended that the initial temperature is as high as possible, and the final temperature is as close to zero. The initial solution of the problem to be optimized is created (see line 2). The current solution is set to . Set T to initial temperature (see line 3). The temperature cycle is executed from the initial temperature to the final temperature (see lines 4–18). The Metropolis Cycle gets started (see lines 5–16). This cycle takes a number of times specified in the stop criterion. A new solution is created within the Metropolis Cycle by creating a small perturbation to the current solution (see line 6). The difference between these two solutions ( and ) is obtained. If the difference is less or equal than zero (see line 8), the new solution is accepted (see line 9). If the difference is greater than zero, the Boltzmann probability is calculated (see line 11). If the Boltzmann probability is higher than a random value between 0 and 1 (see line 12) then the new solution is accepted (see line 13). After the Metropolis Cycle is completed, the temperature value is decreased (see line 17).

1: Setting initial and final temperatures
2: Create from Initial solution
3: =
4: While ( > ) do // Temperature Cycle
5:  While (stop criteria) // Metropolis Cycle
6:   Create using a perturbation to S current
7:   Obtain difference similarity between and
8:   If (difference 0) then
9:    Accept
10:   else
11:    Boltzmann probability = exp (−difference/T)
12:    If (Boltzmann probability) > random (0,1) then
13:     Accept
14:    end if
15:  end if
16:  end while
17:  Decrease
18: end while

Algorithm 2 shows the pseudo code of the SA, which is applied to obtain solutions to the problem of aligning two or more genomic sequences. The Simulated Annealing algorithm is modified then it can be implemented to solve the problem of alignment sequence. The values of initial and final temperatures are tuned by using an analytical method (see lines 1-2). The cooling factor value is set to a value very close to 1 ( ) (see line 3). The current solution is set to the original solution (see line 4). The similarity of this solution is calculated by comparing base by base (see line 5). The variable is set to the initial temperature (see line 6). The Metropolis Cycle length is set to an initial value (see line 7). This cycle has an increasing length, at high temperature, it has a low value, and it is increased as the temperature is decreased. The length of Metropolis Cycle is increased by a factor , where must be greater than 1. Temperature cycle is executed (see lines 8–29) with a logic condition that T is greater than . Within this cycle, the variable is updated with value 1 (see line 9), and within the metropolis cycle, this variable is incremented (see line 25).

1: Tune initial temperature ( )
2: Tune final temperature ( )
: Setting cooling factor
4: Create from Initial solution
5: Calculate similarity of
6: =
7: Setting
8: While ( > ) do
9:   = 1
10:  While ( > )
11:   Create adding or removing gaps to
12:   Calculate similarity of
13:   Obtain difference similarity between and
14:    If (difference 0) then
15:     Scurrent = S new
16:     If similarity ( ) > similarity ( ) then
:        =
:    end if
19:   else
20:    Boltzmann probability = exp(−difference/ )
21:    If (Boltzmann probability) > random(0,1) then
22:      =
23:   end if
24:   end if
25:    = + 1
26:  end while
27:  Descrease
28:  Increase
29: end while

The Metropolis Cycle is executed (see lines 10–26). At the end of the Metropolis Cycle, the temperature is decreased (see line 27), and the Metropolis Cycle length is increased (see line 28). Within the Metropolis Cycle, new solutions are generated by modifying the current solution . This is done by adding or removing gaps into DNA sequences (see line 11). The similarity of new solutions is calculated (see line 12), and the difference of similarities between and is calculated (see line 13). This difference is denoted by . The new solutions are accepted when these are better than current solutions, so current solutions are replaced by new solutions (see line 15). When new solutions are of low quality (worse solutions) than current solutions, then new solutions are accepted using the Boltzmann probability (see line 22). This probability is directly related to the current value of the temperature and the quality difference between and . The Boltzmann probability is calculated by the following equation . As the temperature value is decreased, the probability of is decreased, which is of range .

3. Simulated Annealing with Previous Solutions

In order to generate high-quality solutions to sequence alignment, the classical SA was modified, so the SAPS algorithm is a modified version of the classical SA. After the Metropolis Cycle execution is done, the selection of a current solution is done. During the execution of Metropolis Cycle, the best solutions are stored in a set named .

The best of all solutions created in this cycle is stored in . The original sequence is stored in . After the Metropolis cycle is finished, a current solution is randomly selected from , , or . So .

The Metropolis Cycle length of SAPS is growing, which ranges from an initial value to a final value. At high temperature, is set to a small value and as the temperature value is increased, the value of the Metropolis Cycle length is increased until . when is reached, is reached too. Thus, an increasing number of solutions are created as the temperature is decreased. At high temperatures, a small number of solutions are created and as the temperature is decreased, the number of solutions is increased with a factor , where .

Algorithm 3 shows the pseudo code of SAPS, some lines of code were added to SA, for example, at line 5, and are set with . At line 19, is added to . At line 31, is chosen from , , or .

1: Tune initial temperature ( )
2: Tune final temperature ( )
3: Setting the factor cooling
4: Setting with
5: Add to and to
6: Calculate similarity of
7: =
8: Setting
9: While ( > ) do
10:   = 1
11:  While ( > )
12:    Create adding or removing gap to
13:    Calculate similarity of
14:    Obtain difference of similarity between and
15:    If (difference 0) then
16:      =
17:     If similarity ( ) > similarity ( ) then
18:       =
19:      Add S better to SS betters
20:     end if
21:    else
22:     Boltzmann probability = exp(-difference/ )
23:     If (Boltzmann probability) > random(0,1) then
24:      S current = S new
25:     end if
26:    end if
27:     = + 1
28:   end while
29:   Descrease
30:   Increase
31:   Randomly Choose S current of SS betters, S better o S original
32: end while

4. Analytical Tuning Method

Some parameters of SAPS are tuned by the analytical method [1922]. For example, in order to calculate the initial temperature, the maximum deterioration (defined by ) of the instance is applied. The probability of accepting a solution is applied at high temperature. On other hand, the final temperature is calculated by applying the minimum deterioration (defined by ) of the instance and the probability of accepting a Solution at low temperature.

The analytical tuning based on Boltzmann distribution can be helpful for setting up the initial temperature [21]. The probability of accepting any new solution is very close to 1 ( ) at high temperatures, so the deterioration of cost function is maximal. The initial temperature ( ) is associated with the maximum deterioration admitted and the defined acceptance probability .

Let be the current solution and a new proposed one, and and are the costs associated to and , respectively; the maximum and minimum deteriorations are expressed as and , respectively. Then, the probability of accepting a new solution with the maximum deterioration is defined by ( ). This equation basically is the Boltzmann Distribution, which is applied for calculating the . This temperature value is defined by . Similarly, the final temperature ( ) is established according to the probability of accepting a new solution with the minimum deterioration. The equation to calculate the final temperature is defined by .

There are other parameters of SAPS that are calculated by applying a particular cooling function; for example, the Metropolis Cycle length is calculated by applying . The incremental factor of this cycle is also calculated and defined by .

The analytical method determines the Metropolis Cycle lenght with a simple Markov model [22]; at high temperatures, only a few iterations are required because the stochastic equilibrium is quickly reached; nevertheless, at low temperatures a more exhaustive exploration is needed, so a larger is used. Let be at and let be the maximum Metropolis Cycle length.

Let be decreased by the cooling function ( ), and the be calculated by the follow equation , where is the rate of increment of Metropolis Cycle (>1); so and have an initial value, and the last Metropolis Cycle is equal to . The functions and are applied successively in Simulated Annealing from to ; consequently, and are obtained by and , respectively. is the step number from to ; so we can get the and as follows: and .

5. Implementation

SASP was tested with all of the most HIV virus genes of human and simian. The nine genes of the human virus were compared with the nine genes of simian virus; for example, the gen named “env” of HIV human was aligned with the gen “env” of HIV simian, the gen named “gag” of HIV human was aligned with the gen “gag” of HIV simian, and so successively. The information of the virus genes is shown in Table 1. The parameters , , , and are tuned by analytical method. The factor is higher than 1, and it is very close to 1. The values of these parameters are shown in Table 2. In this table, the values of initial temperatures are high; these values are related to the maximum deterioration and the probability of accepting solutions at high temperatures. It is observed that the final temperature has a value very close to zero (0.43); this is because the minimum deterioration is equal to 1.0. The parameters and have the values 2, and 300, respectively.

6. Experimentation and Results

In Table 3, the results of the experiments are shown. The information shown is the average similarity and the standard deviation of the genes of both viruses (HIV Human and HIV Simian). The results show that the average obtained by SASP is of better quality than the average obtained by the classical SA. Table 4 shows that the SAPS processing time generally is better than the processing time of SA.

7. Conclusions

In this paper, a new approach is to make efficient the classical Simulated Annealing algorithm proposed to solve the problem of aligning genomic sequences. This approach is called SAPS. After completing the Metropolis Cycle, a current solution is selected randomly from the best solutions’ set, the best solution and the initial solution. This change in the classical simulated annealing resulted in an improved efficiency to solve the problem of aligning sequences. The parameters of the algorithms SA and SAPS were tuned using a tuning method, specifically the initial temperature, final temperature, and Metropolis Cycle length.

This approach to tune the parameters depends directly on the instance to test. With a preprocessing of the instance, the minimum and maximum deteriorations are calculated. With these values and the probability of acceptance, the initial and final temperatures are calculated.