CPSC 445 - Assignment 3

released: Tue, 08/03/18; due: Tue, 08/04/01, 9:30 (just before the beginning of class)
Include your name, student ID, and all classmates who you worked with.

1. Phylogenetic Trees / Parsimony [30 marks]

(a) Calculate the parsimony score of the two (rooted) trees shown below, using Fitch's algorithm. (Show all steps of your work) [6 marks]

(b) For tree 2, show all the possible internal label combinations that produce the same parsimony score as that calculated in (a). Label which combinations can not be produced by Fitch's algorithm (if any exist). [9 marks]

(c) Calculate the parsimony score of the (rooted) tree shown below, using Fitch's algorithm (Show all your work) [10 marks]

(d) Given n species and m characters for each species, what is the time-complexity of solving the small parsimony problem using Fitch's algorithm, assuming that each set operation takes one time unit? Explain your answer. [5 marks]

2. Motif Discovery [18 marks]

(a) Briefly outline the two different ways a motif can be defined formally that were covered in class (<= 50 words, in your own words). [4 marks]

(b) For each of the two definitions from part (a), briefly outline an algorithm for finding all instances of the motif in a given set of DNA sequences. [6 marks]

(c) Give the definition of relative entropy between two probability distributions. [1 mark]

(d) What is the smallest (absolute) value for the relative entropy of a distribution P w.r.t. another distribution Q? For which P and Q does this value occur? [2 marks]

(e) How is relative entropy used to evaluate candidate motifs? (<= 50 words, in your own words) [2 marks]

(f) Can the Gibbs Sampling Algorithm by Lawrence et al. find the maximum relative entropy motif for any given instance of the Relative Entropy Site Selection Problem with some probability > 0 in at most k iterations, where k is the number of sequences? Briefly justify your answer. [3 marks]

3. Constructing Phylogenetic Trees using UPGMA and Neighbour-Joining [25 marks]

(a) Visit the following java applet http://www.itu.dk/~sestoft/bsa/Match7Applet.html. Using 5 sequences, input random distances (by using the Random data button), and Build Trees. You should get 2 trees, produced by UPGMA and Neighbour joining. Submit the distance matrix and both trees produced (take screenshots or draw them if you need to). If the trees produced are identical, resubmit new random data until the trees are different. [2 marks]

(b) Using the UPGMA tree from (a), show all the steps involved in producing the tree from UPGMA. [8 marks]

(c) Using the Neighbour-Joining tree from (a), show all the steps involved in producing the tree from Neighbour Joining. [10 marks]

(d) If the trees produced are different, explain where they are different and why this difference occurs. you may refer to parts (b) and (c). [5 marks]

General remarks:

The assignment has to be handed in on the date it is due before 9:30. To ensure fairness, late hand-ins will generally not be accepted (exceptions can only be made for officially documented medical reasons). Please hand your solution to Holger at the beginning of class.
Please do not staple the pages of your assignment and put your name and/or student number on each sheet.
This assignment should take you no longer than about 2 hours to complete, if you have good knowledge of the topics covered. However, don't wait until the last minute relying on this estimate - it might not apply to you (or anyone at all), you might need additional time to consult the literature, etc.
While cooperation between students - especially between CS and non-CS students - is encouraged, each student is expected to work out the actual solutions to the problems individually and hand in their own assignment. In other words: help each other, but do not copy solutions.
Feel always free to contact Holger or Andrew if you feel you need further help than can be provided by your fellow students.