CPSC536a - Assignment 2 (covers Modules 3)

available Tues, 02/02/12; due Wed, 02/02/27

1 Parsimony [10 marks]

Recall that Fitch's algorithm associates with each node u of a rooted tree the set R_u of characters that is defined recursively as follows:

Base case (leaf): R_u contains the single character at leaf node u.
General case: node u with children v, w:

 R_u = R_v intersection R_w, if this intersection is non-empty, and R_v union R_w, otherwise.

(a) Prove that for all nodes u in the tree T, if T_u is the subtree of T rooted at u, the bases in the set R_u are exactly those that could label u in an optimal (most parsimonious) labeling of tree T_u. (You can use induction on the depth of the subtree rooted at u.)

(b) This question was raised by Daniel in class. Is it the case that, for all input trees T, there is an optimal labeled tree in which every node along some path from the root to a leaf has the same character? (Justify your answer.)

(c) Bonus points: In class, Mirela pointed out that it is possible to construct optimal labelings of trees in which the character at a node u of the tree is NOT in R_u. Can you design an efficient (polynomial time) algorithm that, given as input a tree T with labels just at the leaves, calculates, for each node u of the tree the set of ALL bases that could label that node in an optimal labeling of the tree?

2 ClustalW and Phylip [6 marks]

Below is the DNA-directed RNA polymerase alpha subunit/40kD subunit in Fasta format. (Graeme Campbell pointed me to this; I pulled it from the web at http://www.ncbi.nlm.nih.gov/cgi-bin/COG/palox?seq=rpoA.) The data is for the organisms:

Archaeoglobus fulgidus
Thermoplasma acidophilum
Thermoplasma volcanium
Mycobacterium tuberculosis
Mycobacterium leprae
Haemophilus influenzae

(a) Use ClustalW to align these sequences. You can use the web-based interface for ClustalW at http://www.cbr.nrc.ca/newdocs/services/clustalw_form.html. Choose Phylip format for the output.

(b) Next, use Phylip to form a tree from these sequences. You can use the web-based interface for Phylip at http://www.cbr.nrc.ca/cgi-bin/WebPhylip/index.html. What tree do you get? (You may use only part of the aligned sequences if needed.)

-----------------------------------------------------------------------

>AF2282
MMPEIEILEEKDFKIKFILKNASPALANSFRRAMKAEVPAMAVDYVDIYLNSSYFYDEVIAHRLAMLPIK
TYLDRFNMQSECSCGGEGCPNCQISFRLNVEGPKVVYSGDFISDDPDVVFAIDNIPVLELFEGQQLMLEA
VARLGTGREHAKFQPVSVCVYKIIPEIVVNENCNGCGDCIEACPRNVFEKDGDKVRVKNVMACSMCGECV
EVCEMNAISVNETNNFLFTVEGTGALPVREVMKKALEILRSKAEEMNKIIEEIQ
>Ta1030
MDVKIFKLSDKYIRFEIDGITPSQANALRRTLINDIPKLAIENVTFHHGEIRDSEGNVYDSSLPLFDEMV
AHRLGLIPLKTDLTMNFRDQCSCGGKGCSLCTVTYSINKLGPSTVFSSDLQAVSHPDLIPVDGEIPIVKL
GPKQAILITAEAILGTAKEHAKWQVTSGVSYKYHREFHVSKKDFEDWQKLKGACPKSVMSETDTEIVFTD
DFGCNDLNVLFESDGVNMIEDDSRFIFQFETDGSLTAKETLLYALNRLKDRWDILVESLSE
>TVN0565
MEVKIFKLSDKYMSFEIDGITPSQANALRRTLINDIPKLAIENVTFHHGEIRDAEGNVYDSSLPLFDEMV
AHRLGLIPLKTDLSLNFRDQCSCGGKGCSLCTVTYSINKIGPASVMSGDIQAISHPDLVPVDPDIPIVKL
GAKQAILITAEAILGTAKEHAKWQVTSGVAYKYHREFEVNKKLFEDWAKIKERCPKSVLSEDENTIVFTD
DYGCNDLSILFESDGVQIKEDDSRFIFHFETDGSLTAEETLSYALNRLMDRWGILVESLSE
>Rv3457c
MLISQRPTLSEDVLTDNRSQFVIEPLEPGFGYTLGNSLRRTLLSSIPGAAVTSIRIDGVLHEFTTVPGVK
EDVTEIILNLKSLVVSSEEDEPVTMYLRKQGPGEVTAGDIVPPAGVTVHNPGMHIATLNDKGKLEVELVV
ERGRGYVPAVQNRASGAEIGRIPVDSIYSPVLKVTYKVDATRVEQRTDFDKLILDVETKNSISPRDALAS
SDLLDIRNFGQKSIDEVKIKLHQLGLSLKDSPPSFDPSEVAGYDVATGTWSTEGAYDEQDYAETEQL
>ML1957
MLISQRPTLSEDILTDNRSQFVIEPLEPGFGYTLGNSLRRTLLSSIPGAAVTSIRIDGVLHEFTTVPGVK
EDVTEIILNLKGLVVSSEEDEPVTMYLRKQGPGEVTAGDIVPPAGVTLHNPGMRIATLNDKGKIEAELVV
ERGRGYVPAVQNRALGAEIGRIPVDSIYSPVLKVTYKVDATRVEQRTDFDKLILDVETKSSITPRDALAS