CPSC 445 - Assignment 2

released: Fri, 15/02/2008; due: Thr, 28/02/2008, 9:30 (just before the beginning of class)


1  Multiple Sequence Alignment (Hands-on Problem) [20 marks]

Consider the following partial sequences from E.Coli clone vectors in FASTA format. (Source: http://www.cf.ac.uk/biosi/staff/ehrmann/tools/dnasequences.htm)

>pBR322
TTCTCATGTTTGACAGCTTATCATCGATAAGCTTTAATGCGGTAGTTTAT 
CACAGTTAAATTGCTAACGCAGTCAGGCACCGTGTATGAAATCTAACAAT 
GCGCTCATCGTCATCCTCGGCACCGTCACCCTGGATGCTGTAGGCATAGG 
CTTGGTTATGCCGGTACTGCCGGGCCTCTTGCGGGATATCGTCCATTCCG
>pBR325
aggccatgtttgacagcttatcatcgataagctttaatgcggtagtttat 
cacagttaaattgctaacgcagtcaggcaccgtgtatgaaatctaacaat 
gcgctcatcgtcatcctcggcaccgtcaccctggatgctgtaggcatagg 
cttggttatgccggtactgccgggcctcttgcgggatatcgtccattccg

>pBR327
TTCTCATGTTTGACAGCTTATCATCGATAAGCTTTAATGCGGTAGTTTAT 
CACAGTTAAATTGCTAACGCAGTCAGGCACCGTGTATGAAATCTAACAAT 
GCGCTCATCGTCATCCTCGGCACCGTCACCCTGGATGCTGTAGGCATAGG 
CTTGGTTATGCCGGTACTGCCGGGCCTCTTGCGGGATATCGTCCATTCCG
>pACYC184
GAATTCCGGATGAGCATTCATCAGGCGGGCAAGAATGTGAATAAAGGCCG 
GATAAAACTTGTGCTTATTTTTCTTTACGGTCTTTAAAAAGGCCGTAATA 
TCCAGCTGAACGGTCTGGTTATAGGTACATTGAGCAACTGACTGAAATGC 
CTCAAAATGTTCTTTACGATGCCATTGGGATATATCAACGGTGGTATATC
>pHSG575
TGATGTCCGGCGGTGCTTTTGCCGTTACGCACCACCCCGTCAGTAGCTGA 
ACAGGAGGGACAGCTGATAGAAACAGAAGCCACTGGAGCACCTCAAAAAC 
ACCATCATACACTAAATCACTAAGTTGGCAGCATCACCCGACGCACTTTG 
CGCCGAATAAATACCTGTGACGGAAGATCACTTCGCAGAATAAATAAATC
>pGEX2T
acgttatcgactgcacggtgcaccaatgcttctggcgtcaggcagccatc 
ggaagctgtggtatggctgtgcaggtcgtaaatcactgcataattcgtgt 
cgctcaaggcgcactcccgttctggataatgttttttgcgccgacatcat 
aacggttctggcaaatattctgaaatgagctgttgacaattaatcatcgg

(a) Use ClustalW2 (http://www.ebi.ac.uk/Tools/clustalw2/index.html) to obtain a multiple sequence alignment of these sequences. Report the multiple sequence alignment and the guide tree used for the alignment. [5 marks]

(b) Obtain another multiple sequence alignment for the same sequeces using the progressive multiple sequence alignment program MULTI-LAGAN (http://lagan.stanford.edu/lagan_web/index.shtml). Report the multiple sequence alignment and the guide tree used for constructing it (the alignment is accessed by clicking a TextBrowser link and then the MFA multiple sequence alignment). [5 marks]

(c) Recalculate the MULTI-LAGAN alignment using the guide tree produced by ClustalW2. The phylogenetic tree can be entered into the MULTI-LAGAN program at the bottom of the form by using a string input. MULT-LAGAN only takes a binary tree, and the result of ClustalW2 might contain a branch with more then 2 children. If this happens, convert the tree into any binary tree. Report the resulting multiple sequence alignment and guide tree. [5 marks]

(d) Comment on the differences between the multiple sequence alignments from (a), (b) and (c). Keep your answer as concise as possible. [5 marks]


2  Sum-of-pairs Scoring of Multiple Sequence Alignments (Programming Problem) [40 marks]

Important notes:
  • Your programs should be written either in C, Java or C++.
  • When you are done, send an email to acarbo@cs.ubc.ca with subject 'CPSC445-hw2' and attach your programs sources.
  • The name of your programs should be [your-student-id]-sp.{c,cpp,java}, e.g. 80132322-sp.cpp. Feel free to add any prefix to your file name in case needed. We are fine as long as your student id appears in the file name. If you would like to attach a readme text file, it should be named [your-student-id]-readme.txt.
  • Your programs should be well documented and you should explain the purpose of every function that you write [up to 10 marks will be deducted for code that is not commented/documented].
  • Your programs should output their results to standard out (stdout).

We are interested in finding the sum-of-pairs score for a given alignment. We will use the following scoring function for this program: 4 points for a match, -1 points for a mismatch, -2 for a s(-,base) or s(base,-) and 0 for a s(-,-).

(a) (Hand in this part with your written assignment) Compute (by hand) the sum-of-pairs score for the following alignment using the above score.
[5 marks]

A-G
A--
TCG


Write a program that computes the sum-of-pairs score for an alignment. The input for your program will be a file with an alignment names (asst2.in). The alignment will be a set of sequences separated by line breaks. Each sequence will have a length of up to 500 bases, and contain anywhere from 3 to 10 sequences.

(b) Using your program, compute the sum-of-pairs score for the alignment from part (a). [10 marks]

(c) Using your program, compute the sum-of-pairs score for the following alignment:

CTCT--CTCCACGGGC
CCAAA-ATTTACAGAC
CCCTAGGTTCGCAGAC
CCCTAATCCCGCAGGG
[10 marks]

(d) Compute the sum-of-pairs scores for the multiple sequence alignments from from problems 1a), 1b) and 1c). Can you make any additional comments about the success of these programs? Note: You will need to modify the output multiple sequence alignments of these programs before using them as input for your program.
[15 marks]


3  Scoring Models [10 marks]

The following questions should be answered after carefully reading section 8.1 and 8.2 of Durbin et al.

(a) What is the Jukes-Cantor distance model and why is it more appropriate than a simple model that merely counts the number of mismatches? (<= 50 words, in your own words). [3 marks]

(b) Why might the 2-parameter Kimura model be even more appropriate than the Jukes-Cantor model? (<= 50 words, in your own words). [3 marks]

(c) All three of the above models are less then realistic. Give 3 reasons or examples where all three of the models would not, or could not model real-life cases. [4 marks]


4  Phylogenetic Trees / Distance Based Methods [25 marks]

(a) Show all steps of the UPGMA algorithm as applied to the following five sequences, where the distance between two sequences is defined as the number of base positions in which they differ (for example, the first two sequences have a distance of 6 unmatched base pairs). [10 marks]

(b) Briefly describe the role of "arithmetic averaging" in UPGMA. (<= 50 words, in your own words) [5 marks]

(c) Prove that Equation (7.2) from Durbin et al. gives the correct distances dkl between a merged cluster Ck = Ci +  Cj (where '+' denotes set union) and every other cluster Cl according to the general definition of distance between clusters as given in Equation (7.1). [10 marks]


General remarks: