# CPSC536a - Assignment 3 (covers Modules 4+5)

handed out Mon, 02/03/25; due Mon, 02/04/08

#### 1 RNA Secondary Structure [8 marks]

Use the mfold RNA secondary structure predication algorithm of Michael Zuker to answer this question. You can find mfold at http://bioinfo.math.rpi.edu/~mfold/rna/form1.cgi - this web page is self-explanatory.

(a) What is the predicted secondary structure of the RNA sequence GGCCAAGGCC? (Note: we use the convention that the left end of this string is the 5' end and the right end is the 3' end, as does Zuker's program.) [2 marks]

(b) Can you find a sequence that folds into the secondary structure described by ((((*((((***))))*((((***))))*((((***))))*))))? (Note that, using set notation, this structure, which is for a string of length 45, is {(1,45), (2,44), (3,43), (4,42), (6,16), (7,15), (8,14), (9,13), (18, 28), (19,,27), (20, 26), (21, 25), (30, 40), (31, 39), (32,38), (33,37)}.) [6 marks]

#### 2 Gene Finding / Hidden Markov Models [12 marks]

Suppose we model a stretch of DNA containing two types of regions. Region 1 contains bases A,T,C, and G with equal frequency. Region 2 has a higher frequency of G and C than of A and T. We also have some knowledge about the length of these regions and the overall length of the modelled sequence.

Consider the following simple 4-state Hidden Markov Model (HMM) for this stretch of DNA:

• states: B (Begin), E (End), 1 (Region 1), 2 (Region 2).
• transition probabilities: P(B->1) = P(B->2) = 0.5; P(1->1)=0.7; P(1->2)=0.1; P(2->1)=0.5; P(2->2)=0.3; P(1->E)=P(2->E)=0.2.
• emission probabilities: P(emit X in State 1) = 0.25 for X = A,T,G, or C; P(emit X in State 2) = 0.1 for X = A,T; P(emit G in State 2) = 0.5; P(emit C in State 2) = 0.3.
Note: B and E are special begin and end states, respectively. They don't emit symbols. When generating output, the HMM is initiated in state B and terminates in state E.

(a) Compute P(obs=ATGCG, path=B11111E), the probability of following state path B11111E and emitting the sequence ATGCG. [2 marks]

(b) For the observation sequence ATGCG, compute alpha_2(2), the probability of being in state 2 at time 2 and having emitted the sequence AT. Show the steps of your calculation. [4 marks]

(c) For the observation sequence ATGCG, compute the most probable path that accounts for the sequence AT and ends in state 2; give delta_2(2), the probability for that path. Show the steps of your calculation. [6 marks]

Bonus Question (no marks) How does the given HMM specify the length of Regions 1 and 2 and the numbers of occurrences of Regions 1 and 2? How could this be done more explicitly, e.g., if length distributions for Regions 1 and 2, and distributions of the number of occurrences of these regions are given?

Hint: If you are uncertain how to solve this problem, consult Chapter 7 of "Bioinformatics - The Machine Learning Approach" by Baldi and Brunak (available from the Reading Room).

#### 3 Motif Discovery / Relative Entropy [10 marks]

Suppose that in 16 DNA sequences, you are interested in finding a perfectly conserved pattern of length 5 that is in 6 of the sequences. Suppose also that the background distribution is uniform (each base A,C,G, and T occur with probability 1/4).

(a) What would be the relative entropy of a set of 18 strings of length 5, 6 of which are the perfectly conserved motif (i.e. all are equal) and the remaining 12 of which are "perfectly random", with 3 A's, C's, G's, and T's in each position? [4 marks]

(b) What would be the relative entropy of a set of 18 strings of length 5, where in each position, three of the bases occur with equal frequency but one base is absent? [4 marks]

(c) Would an algorithm that finds a motif of maximum relative entropy prefer the hidden perfectly conserved motif (a) over the motif described in (b), or vice versa? [2 marks]

#### General remarks:

• While cooperation between students - especially between CS and non-CS students - is encouraged, each student is expected to work out the actual solutions to the problems individually and hand in their own assignment. In other words: help each other, but do not copy solutions.
• Feel always free to contact Anne or Holger if you feel you need further help than can be provided by your fellow students.
• The assignment has to be handed in on the date it is due before or at the beginning of class.
• This assignment should take you about 1.5-3 hours of work, if you have good knowledge of the topics covered and did all reading assignments. However, don't wait until the last minute relying on this estimate - it might not apply to you (or anyone at all), you might need additional time to consult the literature, ...