Predicting Pseudoknots from the Base Sequence --------------------------------------------- This lecture is based on two papers: [RE] A dynamic programming algorithm for RNA structure prediction including pseudoknots, E. Rivas and S. R. Eddy, J. Mol. Biol. (1999), 225, 2053-2068. [LP] Pseudoknot prediction in energy based models, R.B. Lyngso and C.N.S. Pedersen, Journal of Computational Biology (2000), 7(3), 409-427. Available at http://www.daimi.au.dk/~cstorm/papers/index.html. Introduction ------------ Although the structure of many RNA molecules with pseudoknots has been determined, there are no widely agreed-upon models for calculating the free energy of an RNA secondary structure with pseudoknots, comparable to the thermodynamic (sum-of-loops) model used for pseudoknot-free structures. (This thermodynamic model is credited to Tinoco et al., 1971, and is the basis for the algorithms of Zuker and others.) In light of this, the approach of Rivas and Eddy was to develop a plausible model, along with an algorithm for this model, and test its performance on both pseudoknotted and pseudoknot-free RNA structures. Their algorithm is based on dynamic programming. Lyngso and Pedersen note in their paper that "developing an algorithm and deciding on a model are closely connected processes", and this is indeed the case with the development in the Rivas and Eddy paper. They use very nice diagrams to depict the recurrences, so we'll start by describing the recurrences from the algorithm of the last lecture using the new notation. Pseudoknot free algorithm, revisited ------------------------------------ We'll depict a secondary structure for a strand s1...sn as a horizontal line, with dots (as needed) indicating bases, and wavy lines indicating base pairs in the structure. (See Figure 2 of RE; note multiloop.) In the pseudoknot free algorithm, we used W(i) to denote the min free energy of a secondary structure for s_1 ... s_i. We'll generalize this now to W(i,j), denoting the min free energy of a secondary structure for s_i ... s_j. Note that W(i,j) = min{ V(i,j), min_{k in [i,j]}(W(i,k) + W(k+1,j)) } with the base case being W(i,i) =0. We can represent this pictorially as in slide. We used V(i,j) to denote the min free energy of a secondary structure for s_i ...s_j (i <= j), taken only over structure. v(i,j) is the min over the possibilities that i.j closes a hairpin, internal loop (bulge, interior, or stacked pair), or multiloop. We depict these possibilities as in Figure 4 of RE. Note the way that multiloops are being handled; we did not discuss this in the previous lecture. The recurrence for W_I is the same as for W, except for the base case, and a penalty for adding a branch. Show how pseudoknot-free part of Figure 2 example consists of nested W's and V's. Thermodynamic parameters: eH(i,j) (hairpin) eL(i,j) (internal loop (bulge, stacked pair, internal) a (multiloop initiation) [called M in RE] b (multiloop pair (branch)) [called P in RE] c (multiloop unpaired base) [called Q_I in RE] [RE also has Q and P but acknowledges that they are typically 0. RE also accounts for dangles and coaxial stacking, which we ignore here.] Generalizing to pseudoknots --------------------------- In order to handle pseudoknots, it is easiest to see pictorially how things generalize (see slides). We generalize W to Wh (where "h" stands for hole). Wh(i,j:k,l) denotes the min energy of a secondary structure for s_i ... s_j, in which the bases s_k, ..., s_l remain unpaired; some or all of these bases can form a pseudoknot. Draw picture that corresponds to pseudoknot of Figure 2. It turns out to be convenient to have four recurrences, one for each of the possibilities depending on whether i must be paired with j or not, and whether k must be paired with l or not. These are called Wh, Vh, Yh, and Zh (see slide). In total, we need recurrences for all for of these, plus for W and V, generalized to allow pseudoknots. New thermodynamic parameters: eH(i,j:k,l) (hairpin from i to j where there is a gap from k to l) eL(i,j:k,l) (internal loop from i to j, with gap from k to l) a' (non-nested multiloop initiation) [called tilde{M} in RE] b' (non-nested multiloop pair (branch)) [called P in RE] c' (non-nested multiloop unpaired base) [called Q_I in RE] g (pseudoknot initiation penalty) [called G_{w_I} in RE] g_I (pseudoknot initiation penalty, when pseudoknot is inside a multiloop) [called G_{w_I} in RE] -------------------------------------------------------------------------------- Examples -------- NP-hardness ----------- Lyngso and Pedersen showed that for a simple thermodynamic model, in which the cost of a base pair depends on whether it forms a pseudoknot or not, the secondary structure prediction problem is NP-hard. They claim that their simple model is a restriction of the model of Lyngso and Pedersen, and so implies that the problem of RNA secondary structure prediction in their model is also NP-hard. However, in reducing SAT to the prediction problem, they use nearest neighbour energy values that grow with the size of the SAT instance. Thus, in some sense, their hardness result only implies a lower bound (pending P not equal to NP) for algorithms whose (worst case) running times don't depend on the actual free energy values. Last words on RNA secondary structure prediction ------------------------------------------------ - So far, we have only considered the problem of predicting the optimal secondary structure, with respect to a particular thermodynamic model. In reality, however, several different structures may have similarly low free energy, and more than one fold may be likely. Applying principles from statistical mechanics, it is possible to assign a probability to each secondary structure, with the probability of a structure being proportional to an exponential in its free energy. The so-called partition function of statistical mechanics assigns this probability to each potential structure. The partition function confers on each potential base pair i.j the probability that i.j is in the secondary structure. McCaskill has described an efficient dynamic programming algorithm for calculating these base pair probabilities efficiently. - Isambert and Siggia have a somewhat different model of the free energy of RNA structures. See "Modeling RNA folding paths with pseudoknots: Application to hepatitis delta virus ribozeme" (June 6, 2000), PNAS, 97:12, 6515-6520. See also http://online.itp.ucsb.edu/online/infobio01/isambert/ for slides and audio of a talk by Isambert on the topic. Notation in slide: h is the stem length for one turn of a helix, s1, s2 are number of unpaired bases in loops, t1, t2 are helix lengths. [Note that for unpseudoknotted structure, if you take logs on both sides, you get ln(s1) and ln(s2). These are usually approximated by linear expressions in algorithms.] What are the differences between the IS model and the RE model?