CPSC536a - Assignment 3 (covers Modules 4+5)

handed out Mon, 02/03/25; due Mon, 02/04/08

1 RNA Secondary Structure [8 marks]

Use the mfold RNA secondary structure predication algorithm of Michael Zuker to answer this question. You can find mfold at http://bioinfo.math.rpi.edu/~mfold/rna/form1.cgi - this web page is self-explanatory.

(a) What is the predicted secondary structure of the RNA sequence GGCCAAGGCC? (Note: we use the convention that the left end of this string is the 5' end and the right end is the 3' end, as does Zuker's program.) [2 marks]

(b) Can you find a sequence that folds into the secondary structure described by ((((*((((***))))*((((***))))*((((***))))*))))? (Note that, using set notation, this structure, which is for a string of length 45, is {(1,45), (2,44), (3,43), (4,42), (6,16), (7,15), (8,14), (9,13), (18, 28), (19,,27), (20, 26), (21, 25), (30, 40), (31, 39), (32,38), (33,37)}.) [6 marks]

2 Gene Finding / Hidden Markov Models [12 marks]

Suppose we model a stretch of DNA containing two types of regions. Region 1 contains bases A,T,C, and G with equal frequency. Region 2 has a higher frequency of G and C than of A and T. We also have some knowledge about the length of these regions and the overall length of the modelled sequence.

Consider the following simple 4-state Hidden Markov Model (HMM) for this stretch of DNA:

states: B (Begin), E (End), 1 (Region 1), 2 (Region 2).
transition probabilities: P(B->1) = P(B->2) = 0.5; P(1->1)=0.7; P(1->2)=0.1; P(2->1)=0.5; P(2->2)=0.3; P(1->E)=P(2->E)=0.2.
emission probabilities: P(emit X in State 1) = 0.25 for X = A,T,G, or C; P(emit X in State 2) = 0.1 for X = A,T; P(emit G in State 2) = 0.5; P(emit C in State 2) = 0.3.

Note: B and E are special begin and end states, respectively. They don't emit symbols. When generating output, the HMM is initiated in state B and terminates in state E.

(a) Compute P(obs=ATGCG, path=B11111E), the probability of following state path B11111E and emitting the sequence ATGCG. [2 marks]

(b) For the observation sequence ATGCG, compute alpha_2(2), the probability of being in state 2 at time 2 and having emitted the sequence AT. Show the steps of your calculation. [4 marks]

(c) For the observation sequence ATGCG, compute the most probable path that accounts for the sequence AT and ends in state 2; give delta_2(2), the probability for that path. Show the steps of your calculation. [6 marks]

Bonus Question (no marks) How does the given HMM specify the length of Regions 1 and 2 and the numbers of occurrences of Regions 1 and 2? How could this be done more explicitly, e.g., if length distributions for Regions 1 and 2, and distributions of the number of occurrences of these regions are given?

Hint: If you are uncertain how to solve this problem, consult Chapter 7 of "Bioinformatics - The Machine Learning Approach" by Baldi and Brunak (available from the Reading Room).

3 Motif Discovery / Relative Entropy [10 marks]

Suppose that in 16 DNA sequences, you are interested in finding a perfectly conserved pattern of length 5 that is in 6 of the sequences. Suppose also that the background distribution is uniform (each base A,C,G, and T occur with probability 1/4).

(a) What would be the relative entropy of a set of 18 strings of length 5, 6 of which are the perfectly conserved motif (i.e. all are equal) and the remaining 12 of which are "perfectly random", with 3 A's, C's, G's, and T's in each position? [4 marks]

(b) What would be the relative entropy of a set of 18 strings of length 5, where in each position, three of the bases occur with equal frequency but one base is absent? [4 marks]

(c) Would an algorithm that finds a motif of maximum relative entropy prefer the hidden perfectly conserved motif (a) over the motif described in (b), or vice versa? [2 marks]

General remarks:

While cooperation between students - especially between CS and non-CS students - is encouraged, each student is expected to work out the actual solutions to the problems individually and hand in their own assignment. In other words: help each other, but do not copy solutions.
Feel always free to contact Anne or Holger if you feel you need further help than can be provided by your fellow students.
The assignment has to be handed in on the date it is due before or at the beginning of class.
This assignment should take you about 1.5-3 hours of work, if you have good knowledge of the topics covered and did all reading assignments. However, don't wait until the last minute relying on this estimate - it might not apply to you (or anyone at all), you might need additional time to consult the literature, ...