CPSC536a - Assignment 3 (covers Modules 4+5)
handed out Mon, 02/03/25; due Mon, 02/04/08
1 RNA Secondary Structure [8 marks]
Use the mfold RNA secondary structure predication algorithm of Michael
Zuker to answer this question. You can find mfold at
http://bioinfo.math.rpi.edu/~mfold/rna/form1.cgi - this web page is
self-explanatory.
(a) What is the predicted secondary structure of the RNA sequence
GGCCAAGGCC? (Note: we use the convention that the left end of this string
is the 5' end and the right end is the 3' end, as does Zuker's program.)
[2 marks]
(b) Can you find a sequence that folds into the secondary structure
described by ((((*((((***))))*((((***))))*((((***))))*))))? (Note that,
using set notation, this structure, which is for a string of length 45, is
{(1,45), (2,44), (3,43), (4,42), (6,16), (7,15), (8,14), (9,13), (18, 28),
(19,,27), (20, 26), (21, 25), (30, 40), (31, 39), (32,38), (33,37)}.)
[6 marks]
2 Gene Finding / Hidden Markov Models [12 marks]
Suppose we model a stretch of DNA containing two types of regions.
Region 1 contains bases A,T,C, and G with equal frequency.
Region 2 has a higher frequency of G and C than of A and T.
We also have some knowledge about the length of these regions
and the overall length of the modelled sequence.
Consider the following simple 4-state Hidden Markov Model (HMM)
for this stretch of DNA:
- states: B (Begin), E (End), 1 (Region 1), 2 (Region 2).
- transition probabilities: P(B->1) = P(B->2) = 0.5;
P(1->1)=0.7; P(1->2)=0.1; P(2->1)=0.5; P(2->2)=0.3;
P(1->E)=P(2->E)=0.2.
- emission probabilities: P(emit X in State 1) = 0.25 for
X = A,T,G, or C;
P(emit X in State 2) = 0.1 for X = A,T;
P(emit G in State 2) = 0.5;
P(emit C in State 2) = 0.3.
Note: B and E are special begin and end states, respectively. They
don't emit symbols. When generating output, the HMM is initiated
in state B and terminates in state E.
(a)
Compute P(obs=ATGCG, path=B11111E), the probability
of following state path B11111E and emitting the sequence ATGCG.
[2 marks]
(b)
For the observation sequence ATGCG, compute alpha_2(2), the probability
of being in state 2 at time 2 and having emitted the sequence AT.
Show the steps of your calculation.
[4 marks]
(c)
For the observation sequence ATGCG, compute the most probable path that accounts for the sequence AT and ends in state 2;
give delta_2(2), the probability for that path.
Show the steps of your calculation.
[6 marks]
Bonus Question (no marks)
How does the given HMM specify the length of Regions 1 and 2
and the numbers of occurrences of Regions 1 and 2?
How could this be done more explicitly, e.g., if length distributions
for Regions 1 and 2, and distributions of
the number of occurrences of these regions are given?
Hint: If you are uncertain how to solve this problem, consult Chapter 7
of "Bioinformatics - The Machine Learning Approach" by Baldi and Brunak
(available from the Reading Room).
3 Motif Discovery / Relative Entropy [10 marks]
Suppose that in 16 DNA sequences, you are interested in finding a perfectly
conserved pattern of length 5 that is in 6 of the sequences. Suppose
also that the background distribution is uniform (each base A,C,G, and
T occur with probability 1/4).
(a) What would be the relative entropy of a set of 18 strings of
length 5, 6 of which are the perfectly conserved motif (i.e. all are
equal) and the remaining 12 of which are "perfectly random", with 3
A's, C's, G's, and T's in each position?
[4 marks]
(b) What would be the relative entropy of a set of 18 strings of length 5, where in each position, three of the bases occur with equal frequency
but one base is absent?
[4 marks]
(c) Would an algorithm that finds a motif of maximum relative entropy
prefer the hidden perfectly conserved motif (a) over the motif
described in (b), or vice versa?
[2 marks]
General remarks:
-
While cooperation between students - especially between CS and non-CS students - is
encouraged, each student is expected to work out the actual solutions to the problems individually and hand in their own assignment. In other words: help each other, but do not copy solutions.
-
Feel always free to contact Anne or Holger if you feel you need further help than
can be provided by your fellow students.
- The assignment has to be handed in on the date it is due before or at the beginning of class.
- This assignment should take you about 1.5-3 hours of work, if you have good knowledge of the topics covered and did all reading assignments. However, don't wait until the last minute relying on this estimate - it might not apply to you (or anyone at all), you might need additional time to consult the literature, ...