CPSC536a - Assignment 3 (covers Modules 4+5)

handed out Mon, 02/03/25; due Mon, 02/04/08

1 RNA Secondary Structure [8 marks]

Use the mfold RNA secondary structure predication algorithm of Michael Zuker to answer this question. You can find mfold at http://bioinfo.math.rpi.edu/~mfold/rna/form1.cgi - this web page is self-explanatory.

(a) What is the predicted secondary structure of the RNA sequence GGCCAAGGCC? (Note: we use the convention that the left end of this string is the 5' end and the right end is the 3' end, as does Zuker's program.) [2 marks]

(b) Can you find a sequence that folds into the secondary structure described by ((((*((((***))))*((((***))))*((((***))))*))))? (Note that, using set notation, this structure, which is for a string of length 45, is {(1,45), (2,44), (3,43), (4,42), (6,16), (7,15), (8,14), (9,13), (18, 28), (19,,27), (20, 26), (21, 25), (30, 40), (31, 39), (32,38), (33,37)}.) [6 marks]

2 Gene Finding / Hidden Markov Models [12 marks]

Suppose we model a stretch of DNA containing two types of regions. Region 1 contains bases A,T,C, and G with equal frequency. Region 2 has a higher frequency of G and C than of A and T. We also have some knowledge about the length of these regions and the overall length of the modelled sequence.

Consider the following simple 4-state Hidden Markov Model (HMM) for this stretch of DNA:

Note: B and E are special begin and end states, respectively. They don't emit symbols. When generating output, the HMM is initiated in state B and terminates in state E.

(a) Compute P(obs=ATGCG, path=B11111E), the probability of following state path B11111E and emitting the sequence ATGCG. [2 marks]

(b) For the observation sequence ATGCG, compute alpha_2(2), the probability of being in state 2 at time 2 and having emitted the sequence AT. Show the steps of your calculation. [4 marks]

(c) For the observation sequence ATGCG, compute the most probable path that accounts for the sequence AT and ends in state 2; give delta_2(2), the probability for that path. Show the steps of your calculation. [6 marks]

Bonus Question (no marks) How does the given HMM specify the length of Regions 1 and 2 and the numbers of occurrences of Regions 1 and 2? How could this be done more explicitly, e.g., if length distributions for Regions 1 and 2, and distributions of the number of occurrences of these regions are given?

Hint: If you are uncertain how to solve this problem, consult Chapter 7 of "Bioinformatics - The Machine Learning Approach" by Baldi and Brunak (available from the Reading Room).

3 Motif Discovery / Relative Entropy [10 marks]

Suppose that in 16 DNA sequences, you are interested in finding a perfectly conserved pattern of length 5 that is in 6 of the sequences. Suppose also that the background distribution is uniform (each base A,C,G, and T occur with probability 1/4).

(a) What would be the relative entropy of a set of 18 strings of length 5, 6 of which are the perfectly conserved motif (i.e. all are equal) and the remaining 12 of which are "perfectly random", with 3 A's, C's, G's, and T's in each position? [4 marks]

(b) What would be the relative entropy of a set of 18 strings of length 5, where in each position, three of the bases occur with equal frequency but one base is absent? [4 marks]

(c) Would an algorithm that finds a motif of maximum relative entropy prefer the hidden perfectly conserved motif (a) over the motif described in (b), or vice versa? [2 marks]

General remarks: