Predicting Pseudoknots from the Base Sequence
---------------------------------------------
This lecture is based on two papers:
[RE] A dynamic programming algorithm for RNA structure prediction including
pseudoknots, E. Rivas and S. R. Eddy, J. Mol. Biol. (1999), 225, 2053-2068.
[LP] Pseudoknot prediction in energy based models, R.B. Lyngso and
C.N.S. Pedersen, Journal of Computational Biology (2000), 7(3),
409-427. Available at
http://www.daimi.au.dk/~cstorm/papers/index.html.
Introduction
------------
Although the structure of many RNA molecules with pseudoknots has been
determined, there are no widely agreed-upon models for calculating the
free energy of an RNA secondary structure with pseudoknots, comparable
to the thermodynamic (sum-of-loops) model used for pseudoknot-free
structures. (This thermodynamic model is credited to Tinoco et al.,
1971, and is the basis for the algorithms of Zuker and others.)
In light of this, the approach of Rivas and Eddy was to develop a
plausible model, along with an algorithm for this model, and test its
performance on both pseudoknotted and pseudoknot-free RNA
structures. Their algorithm is based on dynamic programming. Lyngso
and Pedersen note in their paper that "developing an algorithm and
deciding on a model are closely connected processes", and this is
indeed the case with the development in the Rivas and Eddy paper. They
use very nice diagrams to depict the recurrences, so we'll start by
describing the recurrences from the algorithm of the last lecture
using the new notation.
Pseudoknot free algorithm, revisited
------------------------------------
We'll depict a secondary structure for a strand s1...sn as a horizontal
line, with dots (as needed) indicating bases, and wavy lines indicating
base pairs in the structure. (See Figure 2 of RE; note multiloop.)
In the pseudoknot free algorithm, we used W(i) to denote the min free
energy of a secondary structure for s_1 ... s_i. We'll generalize this
now to W(i,j), denoting the min free energy of a secondary structure
for s_i ... s_j. Note that
W(i,j) = min{ V(i,j), min_{k in [i,j]}(W(i,k) + W(k+1,j)) }
with the base case being W(i,i) =0. We can represent this pictorially
as in slide.
We used V(i,j) to denote the min free energy of a secondary structure
for s_i ...s_j (i <= j), taken only over structure. v(i,j) is the min
over the possibilities that i.j closes a hairpin, internal loop
(bulge, interior, or stacked pair), or multiloop. We depict these
possibilities as in Figure 4 of RE. Note the way that multiloops are
being handled; we did not discuss this in the previous lecture.
The recurrence for W_I is the same as for W, except for the base case,
and a penalty for adding a branch.
Show how pseudoknot-free part of Figure 2 example consists of nested
W's and V's.
Thermodynamic parameters:
eH(i,j) (hairpin)
eL(i,j) (internal loop (bulge, stacked pair, internal)
a (multiloop initiation) [called M in RE]
b (multiloop pair (branch)) [called P in RE]
c (multiloop unpaired base) [called Q_I in RE]
[RE also has Q and P but acknowledges that they are typically 0. RE also
accounts for dangles and coaxial stacking, which we ignore here.]
Generalizing to pseudoknots
---------------------------
In order to handle pseudoknots, it is easiest to see pictorially how
things generalize (see slides). We generalize W to Wh (where "h"
stands for hole). Wh(i,j:k,l) denotes the min energy of a secondary
structure for s_i ... s_j, in which the bases s_k, ..., s_l remain
unpaired; some or all of these bases can form a pseudoknot.
Draw picture that corresponds to pseudoknot of Figure 2.
It turns out to be convenient to have four recurrences, one for each
of the possibilities depending on whether i must be paired with j or
not, and whether k must be paired with l or not. These are called Wh,
Vh, Yh, and Zh (see slide). In total, we need recurrences for all for
of these, plus for W and V, generalized to allow pseudoknots.
New thermodynamic parameters:
eH(i,j:k,l) (hairpin from i to j where there is a gap from k to l)
eL(i,j:k,l) (internal loop from i to j, with gap from k to l)
a' (non-nested multiloop initiation) [called tilde{M} in RE]
b' (non-nested multiloop pair (branch)) [called P in RE]
c' (non-nested multiloop unpaired base) [called Q_I in RE]
g (pseudoknot initiation penalty)
[called G_{w_I} in RE]
g_I (pseudoknot initiation penalty, when pseudoknot is inside a multiloop)
[called G_{w_I} in RE]
--------------------------------------------------------------------------------
Examples
--------
NP-hardness
-----------
Lyngso and Pedersen showed that for a simple thermodynamic model, in
which the cost of a base pair depends on whether it forms a pseudoknot
or not, the secondary structure prediction problem is NP-hard. They
claim that their simple model is a restriction of the model of Lyngso
and Pedersen, and so implies that the problem of RNA secondary
structure prediction in their model is also NP-hard. However, in
reducing SAT to the prediction problem, they use nearest neighbour
energy values that grow with the size of the SAT instance. Thus, in
some sense, their hardness result only implies a lower bound (pending
P not equal to NP) for algorithms whose (worst case) running times
don't depend on the actual free energy values.
Last words on RNA secondary structure prediction
------------------------------------------------
- So far, we have only considered the problem of predicting the
optimal secondary structure, with respect to a particular
thermodynamic model. In reality, however, several different structures
may have similarly low free energy, and more than one fold may be
likely. Applying principles from statistical mechanics, it is
possible to assign a probability to each secondary structure, with the
probability of a structure being proportional to an exponential in its
free energy. The so-called partition function of statistical mechanics
assigns this probability to each potential structure. The partition
function confers on each potential base pair i.j the probability that
i.j is in the secondary structure. McCaskill has described an efficient
dynamic programming algorithm for calculating these base pair
probabilities efficiently.
- Isambert and Siggia have a somewhat different model of the free energy
of RNA structures. See "Modeling RNA folding paths with pseudoknots:
Application to hepatitis delta virus ribozeme" (June 6, 2000), PNAS,
97:12, 6515-6520. See also http://online.itp.ucsb.edu/online/infobio01/isambert/
for slides and audio of a talk by Isambert on the topic. Notation in slide:
h is the stem length for one turn of a helix, s1, s2 are number of unpaired
bases in loops, t1, t2 are helix lengths. [Note that for unpseudoknotted
structure, if you take logs on both sides, you get ln(s1) and ln(s2). These
are usually approximated by linear expressions in algorithms.] What
are the differences between the IS model and the RE model?