Advances in Bioinformatics

Volume 2016, Article ID 9654921, 10 pages

http://dx.doi.org/10.1155/2016/9654921

## Random versus Deterministic Descent in RNA Energy Landscape Analysis

^{1}Informatics Department, King’s College London, London WC2R 2LS, UK^{2}School of Science and Technology, Middlesex University, London NW4 4BT, UK

Received 12 June 2015; Revised 18 November 2015; Accepted 15 December 2015

Academic Editor: Rolf Backofen

Copyright © 2016 Luke Day et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Identifying sets of metastable conformations is a major research topic in RNA energy landscape analysis, and recently several methods have been proposed for finding local minima in landscapes spawned by RNA secondary structures. An important and time-critical component of such methods is steepest, or gradient, descent in attraction basins of local minima. We analyse the speed-up achievable by randomised descent in attraction basins in the context of large sample sets where the size has an order of magnitude in the region of ~10^{6}. While the gain for each individual sample might be marginal, the overall run-time improvement can be significant. Moreover, for the two nongradient methods we analysed for partial energy landscapes induced by ten different RNA sequences, we obtained that the number of observed local minima is on average larger by 7.3% and 3.5%, respectively. The run-time improvement is approximately 16.6% and 6.8% on average over the ten partial energy landscapes. For the large sample size we selected for descent procedures, the coverage of local minima is very high up to energy values of the region where the samples were randomly selected from the partial energy landscapes; that is, the difference to the total set of local minima is mainly due to the upper area of the energy landscapes.

#### 1. Introduction

There is a great diversity in recent research on RNA secondary structure predictions, including refinements of well-established methods such as Mfold [1] and RNAfold [2], kinetic folding simulations, modelling of cotranscriptional folding, and sampling techniques focussing on approximations of the partition function over all secondary structures or specifically for metastable conformations. We briefly recall various aspects of RNA folding simulation and energy landscape analysis, in particular those that inspired the work presented in this paper.

Flamm and Hofacker provide an overview of methods for kinetic folding simulations in [3]; see also the detailed summary by Schuster [4]. While basic kinetic moves are addition and deletion of single base pairs, Flamm et al. [5] introduced the shift move, which is a combination of a base pair removal and a base pair addition where one position remains invariant. The shift move aims at the simulation of “defect diffusion” reported in [6], which tries to capture the process where the position of a bulge in a helix may move along a helix as a result of rapid base pair formation and dissociation.

Cotranscriptional folding is generally acknowledged as describing the process of how RNA folding happens* in vivo* [7]. As pointed out in [3, 8], RNA is transcribed at a rate of only ≈ 30–40 nucleotides per second, where the nascent chain starts folding as soon as it leaves the ribosome. Since helices formed by the incomplete chain may be too stable to refold later on, cotranscriptional folding may drive the folding process to a well-defined folded state that is different from a minimum free energy conformation. In a recent experimental study, Solomatin et al. [9] argue in favour of multiple RNA folding pathways to different biologically active conformations (where the authors include the wider perspective of protein folding).

RNA energy landscape analysis in the context of metastable conformations is presented, for example, in [10–15]. The Barriers tool [10] processes the output of the RNAsubopt program by Wuchty et al. [16] and returns all metastable conformations located in an energy range above the minimum free energy conformation, along with a variety of additional information about the distribution of local minima. We utilise RNAsubopt plus Barriers for generating the information about local minima in partial energy landscapes, which includes recording the run-time. However, the run-time is not compared to descent methods, since the reduction of RNAsubopt to the essential steps of generating the set of local minima would certainly be faster than the recorded times for RNAsubopt plus Barriers execution. Modifying RNAsubopt for such a task, where indeed the sets of local minima are identical to the current RNAsubopt plus Barriers results, is beyond the scope of the present paper.

Lorenz and Clote introduce in [14] the RNAlocopt tool for sampling and approximating the total number of metastable conformations using the partition function. However, currently the RNAlocopt tool has only been implemented by using the Turner 1999 energy model without dangling ends.

Li and Zhang [13] focus on the computation of the set of all possible locally optimal stack configurations over the ensemble of putative stacks, where a new heuristic procedure is utilised for the pathway analysis between local minima. The method targets conformations within a predefined energy range above the minimum free energy conformation and the authors expect the method to be applicable to sequences of up to 250 nt. Saffarian et al. [15] consider the generation of all locally optimal secondary structures assembled from a set of thermodynamically stable helices. The computational experiments for six sequences of length up to 405 nt indicate a relatively short run-time. Huang et al. [11] propose a helix-based heuristic for capturing at least significant subsets of local minima of an RNA folding space. Helices are classified by five loop types that are closed by a given helix. The construction of folding pathways utilises dynamic programming that ensures the correct nesting and juxtaposition of structural elements, where a number of best candidates is considered at each step of the construction of a folding pathway (breadth first search). For fixed values of , the run-time is estimated by energy function evaluations. Kucharík et al. [12] introduce a new connectivity model of attraction basins within energy landscapes, along with the new tool RNAlocmin that is designed for generating sets of local minima based upon modified Boltzmann sampling and steepest descent within RNA energy landscapes. The authors present various comparisons to RNAlocopt [14] regarding the coverage of local minima within a given time frame, which turn out to be in favour of RNAlocmin, partly with large differences in the number of detected local minima.

While RNAlocmin is already relatively fast, we are looking in the present paper at run-time improvements by randomising the descent within attraction basins. Furthermore, we are interested in the coverage of local minima by deterministic and random descent methods. We note that, by using randomised strategies, the completion of steepest descent is not further guaranteed. For large samples even a moderate time improvement of the descent procedure for each individual sample can result in a significant speed-up of the overall processing time. In the present paper, we take RNAlocmin [12] as deterministic steepest descent benchmark method for comparison. The implementations utilised in the present paper are accessible via http://kks.inf.kcl.ac.uk/projects/RandomDescent.php.

#### 2. Materials and Methods

##### 2.1. RNA Structure

In formal terms, a nested secondary structure of an RNA sequence of length is a node-labelled, undirected graph , where , , and , such that(1);(2) (backbone bonds);(3)for , there exists* at most* one , such that , where and comply with Watson-Crick pairs or ();(4), , and imply .

##### 2.2. RNA Folding Landscapes

The energy landscape of an RNA sequence , denoted by , can be described by three components: a set of secondary structure conformations , a neighbourhood function , and a free energy evaluation function . The conformation space consists of secondary structures as defined above and computed by tools such as RNAsubopt [2]. It is important to distinguish between two types of conformation spaces: noncanonical and more restricted canonical spaces. A conformation is canonical, if for every base pairing there exists another base pairing adjacent to at position and/or . In noncanonical conformations, isolated base pairs are admitted. Here, we consider canonical conformation spaces only.

The neighbourhood function of a secondary structure defines the adjacency of the conformation space . For the secondary structure , its neighbourhood is a set of structures that are reachable from by applying a single operation from a move set, . Flamm et al. [5] describe two move sets for RNA folding, a basic move set consisting of insertion and deletion of base pairs and a move set where a shift move to facilitate chain sliding is included. In the present work, we consider the insertion and deletion move set, with the reason being that the Barriers implementation of the two move sets generates shift moves only for noncanonical structures. The basic move set is therefore defined in the following way:(1)Single or double insertion:(a)A single base pair may be inserted at position , if an existing helix is extended; that is, and/or are paired.(b)Two base pairings may be inserted at positions and , if or is not adjacent to an existing base pair belonging to the same helix; that is, and or and are unpaired.(2)Single or double deletion:(a)A single base pairing may be deleted, if its removal does not result in a noncanonical structure.(b)Two base pairings and may be deleted,(i)if positions and are unpaired,(ii)if position is the closing base of a different helix and is unpaired,(iii)if is unpaired and is unpaired,(iv)if is the opening base of a different helix and is unpaired. Additionally, the moves must also satisfy the secondary structure rules as described above, including a minimum length of hairpins of 3. The number of possible neighbours is bounded by , where is the length of the structure. The implementation RNAbor for studying statistics of RNA structural neighbours has been introduced in [17]. RNAbor computes the number and Boltzmann probabilities of all structures having base pair distance to an input structure . RNAbor uses dynamic programming and has a complexity of . Currently, RNAbor works for noncanonical neighbour spaces and uses an older version of the nearest neighbour energy model.

The energy function calculates the free energy of secondary structures and can be calculated by using, for example, the RNAeval tool from the Vienna RNA Package [2]. Finally, a structure is metastable (or a local minimum) of the landscape if all its neighbours have higher or equal energy; that is, .

##### 2.3. Main Features of RNAlocmin

Here, we briefly describe the main features of RNAlocmin as presented in [12]. The RNAlocmin tool [12] from the Vienna RNA Group accepts as input a set of RNA secondary structure conformations and calculates for each structure its corresponding local minimum conformation that defines the attraction basin to which belongs. The underlying method implemented by RNAlocmin is a descent algorithm. RNAlocmin implements three types of descent: (1) a gradient or steepest descent, (2) a first-lower descent, and (3) a random first-lower descent. Along with the local minima structures and their free energies , RNAlocmin counts the total number of input structures that fold into each particular local minimum . As the number of input structures is typically much larger than the number of local minima, some local minima must be reached by multiple input structures. The values therefore provide some insight into the number of structures belonging to attraction basins of the energy landscapes, and consequently the potential size of those basins.

The input conformations are converted into a numerical representation, where for each base pairing the opening position is stored at its closing position and the closing position is stored at its opening position. All unpaired positions are set to 0. For example, the structure .(((...))).. of length 12 is represented numerically by

1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | |

Structure | · | ( | ( | ( | · | · | · | ) | ) | ) | · | · |

0 | 10 | 9 | 8 | 0 | 0 | 0 | 4 | 3 | 2 | 0 | 0 |

The numerical representation supports the efficient search for potential closing positions of an unpaired open position . Figure 1 illustrates typical scenarios for finding a suitable -position, given position : (a) the search starts from . If , then is the first position of a helix and is updated to . For example, as for the structure .(((...))).., if and , then and updated to 11; (b) position is the closing of a base pairing within a hairpin region where . As indicated in the figure, insertion checks only positions where a potential pairing is possible according to the current structure. In the first case (a), a base pair cannot be inserted between and for a number of values ; that is, the search for a suitable “jumps over helices.” The second case (b) occurs if is within the hairpin region of a helix, which is recognised from .