CPSC445/545 - Assignment 3 (covers Module 3 & 4)

released: Fri, 03/10/24; due: Tue, 03/11/04 at 9:30 AM

NOTE: CPSC 545 students are only required to solve problems 1,2, and 4; CPSC 445 students are expected to solve all problems.

1  Phylogenetic Trees / Distance Based Methods [10 marks]

Read Section 7.3 "Making a tree from pairwise distances" in Durbin et al. (available from the box in the ICICS/CS Reading Room) and answer the following questions:

(a) Show all steps of the UPGMA algorithm as applied to the following five sequences, where the distance between two sequences is defined as the number of base positions in which they differ. [3 marks]

• CGGGGCUGAUGAGGC
• CGAAAGGCCGAAGGC
• AUCCUUGAUGGCACA
• CUAUGCGCGAUUGCA
• CCUGCGUUGUUUACC

(b) Explain the Jukes-Cantor distance and discuss why may it be more appropriate than the simple distance measure used in part (a)? (<= 50 words, in your own words). [2 marks]
Note: We are not asking you to compute Jukes-Cantor distances for the sequences from part (a).

(c) Describe the role of "arithmetic averaging" in UPGMA. (<= 50 words, in your own words) [2 marks]

(d) Prove that Equation (7.2) gives the correct distances dkl between a merged cluster Ck = Ci +  Cj ('+' denotes set union) and every other cluster Cl according to the general definition of distance between clusters as given in Equation (7.1). [3 marks]

2  Phylogenetic Trees / Maximum Likelihood Method [8 marks]

(a) Calculate the likelihood of the following two (rooted) trees, assuming the Jukes-Cantor model with parameter m = 1/2, and uniform distribution of bases at the root. What do you notice? [4 marks]

(b) Describe in your own words how a stochastic local search approach over candidate trees for the maximum likelihood approach could work. (<= 100 words) [4 marks]

3  Multiple Sequence Alignment and Phylogenetic Tree Construction (Hands-on Problem)) [445 students only, 5 marks]

Consider the following partial sequences from E.Coli clone vectors in FASTA format. (Source: http://www.cf.ac.uk/biosi/staff/ehrmann/tools/dnasequences.htm)

>pBR322
TTCTCATGTTTGACAGCTTATCATCGATAAGCTTTAATGCGGTAGTTTAT
CACAGTTAAATTGCTAACGCAGTCAGGCACCGTGTATGAAATCTAACAAT
GCGCTCATCGTCATCCTCGGCACCGTCACCCTGGATGCTGTAGGCATAGG
CTTGGTTATGCCGGTACTGCCGGGCCTCTTGCGGGATATCGTCCATTCCG
>pBR325
aggccatgtttgacagcttatcatcgataagctttaatgcggtagtttat
cacagttaaattgctaacgcagtcaggcaccgtgtatgaaatctaacaat
gcgctcatcgtcatcctcggcaccgtcaccctggatgctgtaggcatagg
cttggttatgccggtactgccgggcctcttgcgggatatcgtccattccg

>pBR327
TTCTCATGTTTGACAGCTTATCATCGATAAGCTTTAATGCGGTAGTTTAT
CACAGTTAAATTGCTAACGCAGTCAGGCACCGTGTATGAAATCTAACAAT
GCGCTCATCGTCATCCTCGGCACCGTCACCCTGGATGCTGTAGGCATAGG
CTTGGTTATGCCGGTACTGCCGGGCCTCTTGCGGGATATCGTCCATTCCG
>pACYC184
GAATTCCGGATGAGCATTCATCAGGCGGGCAAGAATGTGAATAAAGGCCG
GATAAAACTTGTGCTTATTTTTCTTTACGGTCTTTAAAAAGGCCGTAATA
TCCAGCTGAACGGTCTGGTTATAGGTACATTGAGCAACTGACTGAAATGC
CTCAAAATGTTCTTTACGATGCCATTGGGATATATCAACGGTGGTATATC
>pHSG575
TGATGTCCGGCGGTGCTTTTGCCGTTACGCACCACCCCGTCAGTAGCTGA
ACAGGAGGGACAGCTGATAGAAACAGAAGCCACTGGAGCACCTCAAAAAC
ACCATCATACACTAAATCACTAAGTTGGCAGCATCACCCGACGCACTTTG
CGCCGAATAAATACCTGTGACGGAAGATCACTTCGCAGAATAAATAAATC
>pGEX2T
acgttatcgactgcacggtgcaccaatgcttctggcgtcaggcagccatc
ggaagctgtggtatggctgtgcaggtcgtaaatcactgcataattcgtgt
cgctcaaggcgcactcccgttctggataatgttttttgcgccgacatcat
aacggttctggcaaatattctgaaatgagctgttgacaattaatcatcgg

(a) Use ClustalW (http://www.ch.embnet.org/software/ClustalW.html) to obtain a multiple sequence alignment of these sequences, using default parameters. Show the result in Phylip format. [2 marks]

(b) Use Phylip (http://www.cbr.nrc.ca/cgi-bin/WebPhylip/index.html) to calculate evolutionary trees from the alignment from part (a). Report the two trees you obtain from parsimony and maximum likelihood runs of Phylip and briefly comment on the differences between these (if any). [3 marks]

4  Gene Finding / Hidden Markov Models  [6 marks]

Consider the following HMM, which models an intronic sequence with GC-rich regions. The model consists of 8 states:

• the begin (B) and end (E) states corresponding to the 5' and 3' ends, which are silent states (don't emit symbols);
• 3 states (S1, S2, S3) following (B)
• 3 states (S6, S7, S8) preceding (E)
• two intermediary states (S4, S5).
The states and transition probabilities are indicated in the following diagram:

The emission probabilities for the non-silent states are as follows:

 Base S1 S2 S3 S4 S5 S6 S7 S8 A 0 0 0.33 0.60 0.49 0.71 0 1 C 0 0 0.37 0.13 0.03 0.07 1 0 G 1 0 0.18 0.14 0.45 0.12 0 0 T 0 1 0.12 0.13 0.03 0.09 0 0

(a) What is the probability of observing GTAATACA along the state path p = B-S1-S2-S3-S5-S4-S6-S7-S8-E? [1 mark ]

(b) What is the probability of seeing GTCGCGCA, given the current HMM? (Show the steps of your calculation.) [3 marks]

(c) Assume an expert biologist has given you the following profile matrices for the 5' and 3' ends of introns:

 5' +1 +2 3 4 5 A 0 0 0.33 0.08 0.15 C 0 0 0.37 0.04 0.19 G 1 0 0.18 0.81 0.2 T 0 1 0.12 0.07 0.46
 3' -5 -4 -3 -2 -1 A 0.71 0.26 0.06 0 1 C 0.07 0.3 0.05 1 0 G 0.12 0.31 0.84 0 0 T 0.09 0.13 0.05 0 0

Extend the given HMM model based on this information. [2 marks]

Hint: If you are uncertain how to solve this problem, consult Chapter 7 of "Bioinformatics - The Machine Learning Approach" by Baldi and Brunak (available from the Reading Room).

5  Motif Discovery [445 students only; 11 marks]

(a) Give the definition of relative entropy between two distributions. [1 mark]

(b) What is the smallest (absolute) value for the relative entropy of a distribution P w.r.t. another distribution Q? For which P and Q does this value occur? [2 marks]

(c) What is the largest value for the relative entropy of a distribution P w.r.t. another distribution Q? For which P and Q does this value occur? [2 marks]

(d) What is the difference between a motif and a site (i.e., an occurrence of the motif)? How is a motif defined formally? [2 marks]

(e) Why is relative entropy a good measure for scoring candidate motifs? (<= 50 words, in your own words) [2 marks]

(f) Can the Gibbs Sampling Algorithm by Lawrence et al. find the maximum relative entropy motif for any given instance of the Relative Entropy Site Selection Problem with some probability > 0 in at most k iterations, where k is the number of sequences? [2 marks]

General remarks:

• While cooperation between students - especially between CS and non-CS students - is encouraged, each student is expected to work out the actual solutions to the problems individually and hand in their own assignment. In other words: help each other, but do not copy solutions.
• Feel always free to contact Holger, Alena, or Dan if you feel you need further help than can be provided by your fellow students.
• The assignment has to be handed in on the date it is due before or at the beginning of class.
• This assignment should take you about 4-5 hours of work, if you have good knowledge of the topics covered and did all reading assignments. However, don't wait until the last minute relying on this estimate - it might not apply to you (or anyone at all), you might need additional time to consult the literature, ...
• Electronic Handin: Please hand in the assignment electronically by sending a PDF file (sorry, no other formats are supported) using a tool called "handin". The tool is available on the grad and undergrad machines. The name of the course is cs445 for undergrads or cs545 for grads. The name of the assignment is a3 (Unix is case sensitive, so make sure you use lowercase only). See also "man handin" is you need additional information on how to use handin.
• Late Assignments / Multiple Submissions: Late assignments will not be accepted by the system and cannot be marked (i.e. 0 marks by default). If you submit an assignment multiple times (before the due date), the latest version automatically replaces any previous versions and only the latest version will be marked.