**(a)** What is the predicted secondary structure of the RNA sequence
GGCCAAGGCC? (Note: we use the convention that the left end of this string
is the 5' end and the right end is the 3' end, as does Zuker's program.)
[2 marks]

**(b)** Can you find a sequence that folds into the secondary structure
described by ((((*((((***))))*((((***))))*((((***))))*))))? (Note that,
using set notation, this structure, which is for a string of length 45, is
{(1,45), (2,44), (3,43), (4,42), (6,16), (7,15), (8,14), (9,13), (18, 28),
(19,,27), (20, 26), (21, 25), (30, 40), (31, 39), (32,38), (33,37)}.)
[6 marks]

Consider the following simple 4-state Hidden Markov Model (HMM) for this stretch of DNA:

- states: B (Begin), E (End), 1 (Region 1), 2 (Region 2).
- transition probabilities: P(B->1) = P(B->2) = 0.5; P(1->1)=0.7; P(1->2)=0.1; P(2->1)=0.5; P(2->2)=0.3; P(1->E)=P(2->E)=0.2.
- emission probabilities: P(emit X in State 1) = 0.25 for X = A,T,G, or C; P(emit X in State 2) = 0.1 for X = A,T; P(emit G in State 2) = 0.5; P(emit C in State 2) = 0.3.

**(a)**
Compute P(obs=ATGCG, path=B11111E), the probability
of following state path B11111E and emitting the sequence ATGCG.
[2 marks]

**(b)**
For the observation sequence ATGCG, compute alpha_2(2), the probability
of being in state 2 at time 2 and having emitted the sequence AT.
Show the steps of your calculation.
[4 marks]

**(c)**
For the observation sequence ATGCG, compute the most probable path that accounts for the sequence AT and ends in state 2;
give delta_2(2), the probability for that path.
Show the steps of your calculation.
[6 marks]

**Bonus Question** (no marks)
How does the given HMM specify the length of Regions 1 and 2
and the numbers of occurrences of Regions 1 and 2?
How could this be done more explicitly, e.g., if length distributions
for Regions 1 and 2, and distributions of
the number of occurrences of these regions are given?

*Hint:* If you are uncertain how to solve this problem, consult Chapter 7
of "Bioinformatics - The Machine Learning Approach" by Baldi and Brunak
(available from the Reading Room).

**(a)** What would be the relative entropy of a set of 18 strings of
length 5, 6 of which are the perfectly conserved motif (i.e. all are
equal) and the remaining 12 of which are "perfectly random", with 3
A's, C's, G's, and T's in each position?
[4 marks]

**(b)** What would be the relative entropy of a set of 18 strings of length 5, where in each position, three of the bases occur with equal frequency
but one base is absent?
[4 marks]

**(c)** Would an algorithm that finds a motif of maximum relative entropy
prefer the hidden perfectly conserved motif (a) over the motif
described in (b), or vice versa?
[2 marks]

- While cooperation between students - especially between CS and non-CS students - is encouraged, each student is expected to work out the actual solutions to the problems individually and hand in their own assignment. In other words: help each other, but do not copy solutions.
- Feel always free to contact Anne or Holger if you feel you need further help than can be provided by your fellow students.
- The assignment has to be handed in on the date it is due before or at the beginning of class.
- This assignment should take you about 1.5-3 hours of work, if you have good knowledge of the topics covered and did all reading assignments. However, don't wait until the last minute relying on this estimate - it might not apply to you (or anyone at all), you might need additional time to consult the literature, ...