Consider the following partial sequences from E.Coli clone vectors in FASTA format. (Source: http://www.cf.ac.uk/biosi/staff/ehrmann/tools/dnasequences.htm)
(a) Use ClustalW2 (http://www.ebi.ac.uk/Tools/clustalw2/index.html) to obtain a multiple sequence alignment of these sequences. Report the multiple sequence alignment and the guide tree used for the alignment. [5 marks]
(b) Obtain another multiple sequence alignment for the same sequeces using the progressive multiple sequence alignment program MULTI-LAGAN (http://lagan.stanford.edu/lagan_web/index.shtml). Report the multiple sequence alignment and the guide tree used for constructing it (the alignment is accessed by clicking a TextBrowser link and then the MFA multiple sequence alignment). [5 marks]
(c) Recalculate the MULTI-LAGAN alignment using the guide tree produced by ClustalW2. The phylogenetic tree can be entered into the MULTI-LAGAN program at the bottom of the form by using a string input. MULT-LAGAN only takes a binary tree, and the result of ClustalW2 might contain a branch with more then 2 children. If this happens, convert the tree into any binary tree. Report the resulting multiple sequence alignment and guide tree. [5 marks]
(d) Comment on the differences between the multiple sequence alignments from (a), (b) and (c). Keep your answer as concise as possible. [5 marks]
We are interested in finding the sum-of-pairs score for a given alignment. We will use the following scoring function for this program: 4 points for a match, -1 points for a mismatch, -2 for a s(-,base) or s(base,-) and 0 for a s(-,-).
(a) (Hand in this part with your written assignment)
Compute (by hand) the sum-of-pairs score for the following alignment using the above score.
Write a program that computes the sum-of-pairs score for an alignment. The input for your program will be a file with an alignment names (asst2.in). The alignment will be a set of sequences separated by line breaks. Each sequence will have a length of up to 500 bases, and contain anywhere from 3 to 10 sequences.
(b) Using your program, compute the sum-of-pairs score for the alignment from part (a). [10 marks]
(c) Using your program, compute the sum-of-pairs score for the following alignment:
(d) Compute the sum-of-pairs scores for the multiple sequence alignments from from
problems 1a), 1b) and 1c). Can you make any additional comments about the success of these programs? Note: You will
need to modify the output multiple sequence alignments of these programs before using them as input for your program.
The following questions should be answered after carefully reading section 8.1 and 8.2 of Durbin et al.
(a) What is the Jukes-Cantor distance model and why is it more appropriate than a simple model that merely counts the number of mismatches? (<= 50 words, in your own words). [3 marks]
(b) Why might the 2-parameter Kimura model be even more appropriate than the Jukes-Cantor model? (<= 50 words, in your own words). [3 marks]
(c) All three of the above models are less then realistic. Give 3 reasons or examples where all three of the models would not, or could not model real-life cases. [4 marks]
(a) Show all steps of the UPGMA algorithm as applied to the following five sequences, where the distance between two sequences is defined as the number of base positions in which they differ (for example, the first two sequences have a distance of 6 unmatched base pairs). [10 marks]
(b) Briefly describe the role of "arithmetic averaging" in UPGMA. (<= 50 words, in your own words) [5 marks]
(c) Prove that Equation (7.2) from Durbin et al. gives the correct distances dkl between a merged cluster Ck = Ci + Cj (where '+' denotes set union) and every other cluster Cl according to the general definition of distance between clusters as given in Equation (7.1). [10 marks]
Reading Questions: In graduate-style seminars, you will often be asked to read academic papers before class and have a set of questions prepared for discussion with your peers during the seminar period. These questions should not simply be limited to questions of the form, "I didn't understand X, how does it work?" but should demonstrate that you have made an effort to re-read and understand the paper to come up with more piercing, critical analysis, or suggestions for future improvements or directions to the work. Working out these questions will prepare you for such seminars and discussions in the future.
(a) Reading Question 1: Read the land-mark paper,"CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice," ( http://www.ncbi.nlm.nih.gov/pmc/articles/PMC308517/pdf/nar00046-0131.pdf ) and formulate two (2) reading questions according to the note above - clarification questions are OK as long as you have made an effort to understand them by discussing with your peers. To get you started, try to describe and justify the four steps taken in the paper to improve the sensitivity of the multiple sequence alignment method.
(b) Reading Question 2: Read the paper, "MUSCLE: multiple sequence alignment with high accuracy and high throughput," available here: ( http://nar.oxfordjournals.org/content/32/5/1792.full.pdf+html ). Describe and motivate the decisions the authors took for the selection of their scoring model.
(c) Reading Question 3: The paper, "How well do evolutionary trees describe genetic relationships among populations?," takes a critical view of the descriptive power of constructed trees from a biological perspective ( http://www.nature.com/hdy/journal/v102/n5/pdf/hdy2008136a.pdf ). Describe how the authors evaluate the results generated by tree contruction methods - including UPGMA - through comparison with actual, biological, hereditary data. What trends do they find? What conclusions do they draw?