CPSC 445 - Assignment 2

released: Fri, 15/02/2008; due: Thr, 28/02/2008, 9:30 (just before the beginning of class)

1 Multiple Sequence Alignment (Hands-on Problem) [20 marks]

Consider the following partial sequences from E.Coli clone vectors in FASTA format. (Source: http://www.cf.ac.uk/biosi/staff/ehrmann/tools/dnasequences.htm)

>pBR322
TTCTCATGTTTGACAGCTTATCATCGATAAGCTTTAATGCGGTAGTTTAT CACAGTTAAATTGCTAACGCAGTCAGGCACCGTGTATGAAATCTAACAAT GCGCTCATCGTCATCCTCGGCACCGTCACCCTGGATGCTGTAGGCATAGG CTTGGTTATGCCGGTACTGCCGGGCCTCTTGCGGGATATCGTCCATTCCG >pBR325
aggccatgtttgacagcttatcatcgataagctttaatgcggtagtttat cacagttaaattgctaacgcagtcaggcaccgtgtatgaaatctaacaat gcgctcatcgtcatcctcggcaccgtcaccctggatgctgtaggcatagg cttggttatgccggtactgccgggcctcttgcgggatatcgtccattccg >pBR327
TTCTCATGTTTGACAGCTTATCATCGATAAGCTTTAATGCGGTAGTTTAT CACAGTTAAATTGCTAACGCAGTCAGGCACCGTGTATGAAATCTAACAAT GCGCTCATCGTCATCCTCGGCACCGTCACCCTGGATGCTGTAGGCATAGG CTTGGTTATGCCGGTACTGCCGGGCCTCTTGCGGGATATCGTCCATTCCG >pACYC184
GAATTCCGGATGAGCATTCATCAGGCGGGCAAGAATGTGAATAAAGGCCG GATAAAACTTGTGCTTATTTTTCTTTACGGTCTTTAAAAAGGCCGTAATA TCCAGCTGAACGGTCTGGTTATAGGTACATTGAGCAACTGACTGAAATGC CTCAAAATGTTCTTTACGATGCCATTGGGATATATCAACGGTGGTATATC >pHSG575
TGATGTCCGGCGGTGCTTTTGCCGTTACGCACCACCCCGTCAGTAGCTGA ACAGGAGGGACAGCTGATAGAAACAGAAGCCACTGGAGCACCTCAAAAAC ACCATCATACACTAAATCACTAAGTTGGCAGCATCACCCGACGCACTTTG CGCCGAATAAATACCTGTGACGGAAGATCACTTCGCAGAATAAATAAATC >pGEX2T
acgttatcgactgcacggtgcaccaatgcttctggcgtcaggcagccatc ggaagctgtggtatggctgtgcaggtcgtaaatcactgcataattcgtgt cgctcaaggcgcactcccgttctggataatgttttttgcgccgacatcat aacggttctggcaaatattctgaaatgagctgttgacaattaatcatcgg

(a) Use ClustalW2 (http://www.ebi.ac.uk/Tools/clustalw2/index.html) to obtain a multiple sequence alignment of these sequences. Report the multiple sequence alignment and the guide tree used for the alignment. [5 marks]

(b) Obtain another multiple sequence alignment for the same sequeces using the progressive multiple sequence alignment program MULTI-LAGAN (http://lagan.stanford.edu/lagan_web/index.shtml). Report the multiple sequence alignment and the guide tree used for constructing it (the alignment is accessed by clicking a TextBrowser link and then the MFA multiple sequence alignment). [5 marks]

(c) Recalculate the MULTI-LAGAN alignment using the guide tree produced by ClustalW2. The phylogenetic tree can be entered into the MULTI-LAGAN program at the bottom of the form by using a string input. MULT-LAGAN only takes a binary tree, and the result of ClustalW2 might contain a branch with more then 2 children. If this happens, convert the tree into any binary tree. Report the resulting multiple sequence alignment and guide tree. [5 marks]

(d) Comment on the differences between the multiple sequence alignments from (a), (b) and (c). Keep your answer as concise as possible. [5 marks]

2 Sum-of-pairs Scoring of Multiple Sequence Alignments (Programming Problem) [40 marks]

Important notes:

Your programs should be written either in C, Java or C++.
When you are done, send an email to acarbo@cs.ubc.ca with subject 'CPSC445-hw2' and attach your programs sources.
The name of your programs should be [your-student-id]-sp.{c,cpp,java}, e.g. 80132322-sp.cpp. Feel free to add any prefix to your file name in case needed. We are fine as long as your student id appears in the file name. If you would like to attach a readme text file, it should be named [your-student-id]-readme.txt.
Your programs should be well documented and you should explain the purpose of every function that you write [up to 10 marks will be deducted for code that is not commented/documented].
Your programs should output their results to standard out (stdout).

We are interested in finding the sum-of-pairs score for a given alignment. We will use the following scoring function for this program: 4 points for a match, -1 points for a mismatch, -2 for a s(-,base) or s(base,-) and 0 for a s(-,-).

(a) (Hand in this part with your written assignment) Compute (by hand) the sum-of-pairs score for the following alignment using the above score.
[5 marks]

A-G
A--
TCG

Write a program that computes the sum-of-pairs score for an alignment. The input for your program will be a file with an alignment names (asst2.in). The alignment will be a set of sequences separated by line breaks. Each sequence will have a length of up to 500 bases, and contain anywhere from 3 to 10 sequences.

(b) Using your program, compute the sum-of-pairs score for the alignment from part (a). [10 marks]

(c) Using your program, compute the sum-of-pairs score for the following alignment:

CTCT--CTCCACGGGC
CCAAA-ATTTACAGAC
CCCTAGGTTCGCAGAC
CCCTAATCCCGCAGGG
[10 marks]

(d) Compute the sum-of-pairs scores for the multiple sequence alignments from from problems 1a), 1b) and 1c). Can you make any additional comments about the success of these programs? Note: You will need to modify the output multiple sequence alignments of these programs before using them as input for your program.
[15 marks]

3 Scoring Models [10 marks]

The following questions should be answered after carefully reading section 8.1 and 8.2 of Durbin et al.

(a) What is the Jukes-Cantor distance model and why is it more appropriate than a simple model that merely counts the number of mismatches? (<= 50 words, in your own words). [3 marks]

(b) Why might the 2-parameter Kimura model be even more appropriate than the Jukes-Cantor model? (<= 50 words, in your own words). [3 marks]

(c) All three of the above models are less then realistic. Give 3 reasons or examples where all three of the models would not, or could not model real-life cases. [4 marks]

4 Phylogenetic Trees / Distance Based Methods [25 marks]

(a) Show all steps of the UPGMA algorithm as applied to the following five sequences, where the distance between two sequences is defined as the number of base positions in which they differ (for example, the first two sequences have a distance of 6 unmatched base pairs). [10 marks]

GTTAAACATCTCCTC
GTGAAACAACATGAC
GTTAAACATGTGGAC
GCACGGAACTCGCCT
GTCTTACTGGCATGA

(b) Briefly describe the role of "arithmetic averaging" in UPGMA. (<= 50 words, in your own words) [5 marks]

(c) Prove that Equation (7.2) from Durbin et al. gives the correct distances d_kl between a merged cluster C_k = C_i + C_j (where '+' denotes set union) and every other cluster C_l according to the general definition of distance between clusters as given in Equation (7.1). [10 marks]

General remarks:

The assignment has to be handed in on the date it is due before 9:30. To ensure fairness, late hand-ins will generally not be accepted (exceptions can only be made for officially documented medical reasons). Please hand your solution to Holger at the beginning of class.
This assignment should take you no longer than about 2 hours to complete, if you have good knowledge of the topics covered. However, don't wait until the last minute relying on this estimate - it might not apply to you (or anyone at all), you might need additional time to consult the literature, etc.
While cooperation between students - especially between CS and non-CS students - is encouraged, each student is expected to work out the actual solutions to the problems individually and hand in their own assignment. Mark the names of all student you work with.
Feel always free to contact Holger or Andrew if you feel you need further help than can be provided by your fellow students.