Abstract

Prediction of RNA structure is a useful process for creating new drugs and understanding genetic diseases. In this paper, we proposed a particle swarm optimization (PSO) and ant colony optimization (ACO) based framework (PAF) for RNA secondary structure prediction. PAF consists of crucial stem searching (CSS) and global sequence building (GSB). In CSS, a modified ACO (MACO) is used to search the crucial stems, and then a set of stems are generated. In GSB, we used a modified PSO (MPSO) to construct all the stems in one sequence. We evaluated the performance of PAF on ten sequences, which have length from 122 to 1494. We also compared the performance of PAF with the results obtained from six existing well-known methods, SARNA-Predict, RnaPredict, ACRNA, PSOfold, IPSO, and mfold. The comparison results show that PAF could not only predict structures with higher accuracy rate but also find crucial stems.

1. Introduction

RNA functions as an information carrier, catalyst, and regulatory element, perhaps reflecting its importance in the earliest stages of evolution. The structures of RNAs provide insight into the mechanisms behind these functions. Determining sequence is the first step in determining structure, and many billions of nucleotide sequences are now known. The second step is determining secondary structure, and relatively few classes of RNAs currently have known secondary structures [1]. The RNA secondary structure prediction problem is a critical one in molecular biology. Secondary structure as well as tertiary structure can be determined by X-ray crystallography and Nuclear Magnetic Resonance (NMR) spectroscopy. Data analysis tools used for prediction of RNA structure are mainly based on dynamic programming [2]. Recently, the metaheuristic methods are widely used to predict RNA secondary structure. These methods generally include genetic algorithm (GA) [3], particle swarm optimization (PSO) [4], ant colony optimization (ACO) [5], and simulated annealing (SA) [6]. The state-of-the-art methods are introduced as follows.

For SA. Shapiro and Wu [7] later modified the algorithm by introducing an annealing mutation operator. Tsang and Wiese [8] presented SARNA-Predict, the permutation-based algorithm for RNA secondary structure prediction based on SA with a simple thermodynamic model and studied mainly its convergence behavior. A performance evaluation of SARNA-Predict in terms of prediction accuracy was made via comparison with eight state-of-the-art RNA prediction algorithms. The results presented in this paper demonstrate that SARNA-Predict can outperform other state-of-the-art algorithms in terms of prediction accuracy. Furthermore, there is substantial improvement of prediction accuracy by incorporating a more sophisticated thermodynamic model.

For GA. Benedetti and Morosetti [9] compared the accuracy of an EA against known RNA structures with the objective of finding optimal and suboptimal structures that were similar. Shapiro et al. [10] modified their EA to study folding pathways using a massively parallel genetic algorithm. Wiese et al. designed a serial EA, RnaPredict [11], which encodes RNA secondary structures as permutations. RnaPredict was parallelized via a coarse-grained distributed EA for RNA secondary structure prediction [12].

For ACO. Yu et al. [13] put forward ACRNA based on ACO for RNA secondary structure prediction. For a given RNA sequence, the set of all possible stems is obtained, and the energy of each stem is calculated and stored at the initial stage. Furthermore, a more realistic formula is used to compute the energy of multibranch loop in the following iteration. Then a folding pathway is simulated, including such processes as construction of the heuristic information, the rule of initializing the pheromone, the mechanism of choosing the initial and next stem, and the strategy of updating the pheromone between two different stems.

For PSO. Geis and Middendorf [14] introduced HelixPSO for finding minimum energy RNA secondary structures. Neethling and Engelbrecht [15] proposed a set-based Particle Swarm Optimization algorithm to optimize the structure of an RNA molecule, using an advanced thermodynamic model. Liu et al. [16] proposed an improved PSO (IPSO). The authors designed an efficient objective function according to the minimum free energy, the number of selected stems, and the average length of selected stems. A promising experimental result was obtained, and the effectiveness and practicability of IPSO for RNA secondary structure prediction was shown. Xing et al. [17, 18] proposed PSOfold based on IPSO. An adaptive parameter controller of PSO based on fuzzy logic is used to improve the balance between exploration and exploitation. Solution conversion strategy (SCS) is designed to enhance the PSO performance in discrete problem such as stem combination.

The metaheuristic methods [19, 20] mentioned previously only paid attention to structure of the sequence but ignored the searching capability of algorithm. PSOfold is our previous work with an attempt to enhance the searching range in PSO, but the crucial stems are not considered. In this paper, we proposed PAF to improve the exploration ability and focused on the influence of the crucial stems.

The main contributions of this paper are described as follows.(1)A framework, namely, PAF, was proposed for RNA secondary structure prediction, which includes CSS and GSB.(2)In CSS, MACO was proposed to search the crucial stems.(3)In GSB, MPSO was designed to construct all the stems in one sequence.The rest of the paper is organized as follows. Section 2 briefly introduces the theory of ACO and PSO. In Section 3, we present the proposed PAF framework and describe the implementation of CSS and GSB. In Section 4, the experiment results on various public sequences are discussed. Finally, we draw the conclusions of this paper in Section 5.

2. Basic Theory

2.1. ACO

ACO algorithm is biologically inspired from the behavior of colonies of real ants, and in particular how they forage for food. ACO has been formalized into a metaheuristic for combinatorial optimization problems by Dorigo and coworkers [21].

In ACO, an ant being in node chooses the next node with a probability given by the random proportional rule defined as follows [22].

(a) State Transition Rule. Consider where is its feasible neighborhood. The feasible neighborhood excludes nodes already visited in the partial tour of ant , and it may be further restricted to a candidate set of the nearest neighbors of a city . Once an ant has visited all nodes, it returns to its starting node.

(b) State Updating Rule. After all ants have completed their solutions, pheromone evaporation on all nodes is triggered according to (2). The pheromone on each edge is updated according to the following equation: where is the number of ants at each iteration and is the pheromone evaporation rate. Consider where denotes the tour length and is a predefined constant.

2.2. PSO

PSO originated from the simulation of social behavior of birds in a flock [23, 24]. In PSO, each particle flies in the search space with a velocity adjusted by its own flying memory and its companion’s flying experience. All particles have objective function values which are decided by a fitness function. Consider the following: where indicates the cognition learning factor, indicates the social learning factor, and and are random numbers uniformly distributed in . Each particle then moves to a new potential solution based on the following equation:

Kennedy and Eberhart [25] proposed a binary PSO in which a particle moves in a state space restricted to 0 and 1 on each dimension, in terms of the changes in probabilities that a bit will be in one state or the other. Consider the following:

The function is a sigmoid limiting transformation and rand is a random number selected from a uniform distribution in .

3. PAF

3.1. CSS
3.1.1. MACO

(1) State Transition Rule. The task of each ant is to build a set of stems. The ants find an RNA secondary structure via a probabilistic decision rule to move through adjacent states. An ant selects a stem as follows: where and are the regulatory factors, is the amount of pheromone trail on stem , is the priori available heuristic information, and is the remaining stems.

(2) State Update Rule. The pheromone trails are updated according to (8) and (9). Consider where is the pheromone trail evaporation rate and is the quantity per unit of length of the trail substance that is laid on stem by the th ant. Also, where Energy represents the quality of an ant’s solution. If stem is not included, the zero is returned.

3.1.2. Algorithm of MACO

See Algorithm 1.

Initialize the parameters of MACO
Randomly initialize the solutions for all the ants
While current number of iterations < Max iteration
  For each ant in the population
    For each stem
     Decide whether to select current stem according to (7)
    End for
    Evaluate the solution according to the energy.
  End for
  For each stem in the set
    Update the pheromones according to (8) and (9)
  End for
End while

3.2. GSB
3.2.1. MPSO

MPSO was modified based on our previous studies IPSO [16] and PSOfold [18] which could predict the RNA secondary structure with excellent performance. The objective function is improved according to the size of stem and the number of pseudoknots. Consider the following: where , , is the weight; is the free energy for the secondary structure in the th particle; is the number of pairs of the th stem; is the length of possible pairs; is the size of stem which is higher than 4; is the number of selected stems; is the total number of stems which are higher than 4; is the size of pseudoknots; is the size of possible pairs.

3.2.2. Algorithm of MPSO

See Algorithm 2.

Initialize all the parameters of MPSO
While current number of iterations < Max iteration
  For each particle
    Update its velocity
    Update its position
    Restrict position and velocity
    Calculate fitness and Update local best
  End for
  Update the global best
  Turn the parameters of MPSO via fuzzy logic controllers
End while

4. Results

The parameter details of ACRNA are number of ants = 100, number of iterations = 600, = 0.2, = 1, and = 1. For IPSO, number of particles = 100, number of iterations = 600, = 0.9, = 2, and = 2. For PSOfold, number of particles = 100, number of iterations = 600, = 0.9, = 2, and = 2. For CSS, number of ants = 100, number of iterations = 600, , , and . For GSB, number of particles = 100, number of iterations = 600, , = 2, and . To generate the mfold results presented here, the mfold Web server version 3.1 was used with default settings. One noteworthy setting is the percentage of suboptimality. This percentage allows the user to control the number of suboptimal structures predicted by mfold. In this experiment, the value was set to return the 5 percent lowest energy structures.

The measures used for prediction accuracy on the majority of documents currently are sensitivity, specificity, and F-measure. In RNA secondary structure prediction, TP (true positive) indicates the number of base pairs predicted correctly; FN (false negative) denotes the number of base pairs which existed in real structure but were not predicted correctly; FP (false positive) represents the number of base pairs which existed in no real structure but was mistakenly predicted; TN (true negative) stands for the number of base pairs which were not matched and predicted correctly. The TN is rarely used in actual measurement because it is generally much larger than TP, FN, and FP. Sensitivity (Se) means the percentage of all base pairs which was correctly predicted in the real structure; specific (Sp) refers to the percentage of all predicted base pairs which was correctly predicted. The general prediction is very difficult for both and is always biased in favor of one side. A metric that combines both the specificity and sensitivity measures into one is F-measure; it can be used as a single performance measure for a predictor. The main result of the paper will be concluded by the sensitivity and specificity [18]. The specific formula is as follows:

Ten sequences from the comparative RNA website are selected for evaluation of the proposed method, and the details of the sequence are described in Table 1. For these sequences the natural secondary structures are also available from the comparative RNA website. These sequences were chosen as they represent different sequence lengths and come from various genomes of organisms that are exposed to a range of physiological conditions. They represent four RNA classes: 5S rRNA, Group I intron 16S rRNA, 16S rRNA, and Group I intron 23S rRNA. Due to space constraints in some tables, we refer to these specific RNA sequences by an abbreviation of the name of the organism from which they originated [18].

Table 2 shows the comparative results of the highest matching base pair structures between PAF and mfold in regard to sensitivity, specificity, and F-measure. PAF predicts fewer base pairs on 8 sequences. PAF obtained higher TP in 7 cases and lower FP in all cases. The values FN predicted by PAF are also lower than mfold in 7 out of 10 cases. For sensitivity, specificity, and F-measure, PAF won in 7, 10, and 10 cases, respectively. Generally, PAF performs significantly better than mfold with respect to sensitivity, specificity, and F-measure.

Table 3 shows a more detailed analysis of comparing the highest matching base pair structures from SARNA-Predict, RnaPredict, ACRNA, PAF, PSOfold, and mfold according to sensitivity and specificity. The results generated by SARNA-Predict and RnaPredict were taken from the literature [8, 11]. From Table 3, it can be seen that PAF gets better sensitivity on 6 sequences and predicts higher specificity in 8 cases. SARNA-Predict gets better results in 2 cases in terms of sensitivity and in 2 cases with regard to specificity. ACRNA wins on 1 sequence for sensitivity and on 1 sequence for specificity. PSOfold and mfold obtained better sensitivity on one sequence and two sequences, respectively. Among the six methods, PAF gets the best results in most cases. From another point of view, the average performance of PAF on sensitivity and specificity exceeds that of the other methods. It is demonstrated that PAF is significantly superior to the other 5 methods.

In order to validate the stability of the proposed method, we ran ACRNA, IPSO, PAF, and PSOfold ten times and calculated the average highest matching base pair structures of each algorithm in terms of sensitivity and specificity. The detail results are shown in Table 4. From Table 4, it is easy to see that PAF obtains best sensitivity and specificity on 8 sequences and on 7 sequences, respectively. Thus, this proves that the proposed method performs stably on multiple sequences, considerably surpassing the other methods. The convergence progress is shown in Figures 1, 2, and 3. From the figures, we can clearly see that the proposed method could avoid trapping in the local optimum during the iterations. This is because the crucial stems lead the algorithm to the right direction.

5. Conclusion

In this paper, a framework, PAF, was proposed for RNA secondary structure prediction, which consists of CSS and GSB. In order to preserve crucial structures, MACO in CSS is proposed to find the important stems. MPSO in GSB is developed to generate predicted structures in order to save searching spaces. The experimental results show that the performance of the proposed method is significantly better than those of the other metaheuristic methods in terms of sensitivity, specificity, and F-measure. We will try to enhance the performance of convergence and reduce time complexity in the future.

Acknowledgment

This research was supported by The Chinese Government’s Executive Program “Instrumentation development and field experimentation” (SinoProbe-09).