CPSC 545/445 (Autumn 2003) - Class 9: Phylogenetic Tree Construction
Module 3, Part 3
---
recap (very briefly)
- parsimony, particularly big parsimony problem = searching over tree topologies
- simultaneous alignment and phylogenetic tree inference
---
Maximum Likelihood Approach for Phylogenetic Tree Construction
Key ideas:
- use a probabilistic model of evaluation
- for given sequences X, find a tree T with maximal P(X | T)
Use similar approach as above, i.e., solve the following two problems:
- for given sequences X and tree T, determine P(X | T) under given model of evolution
- search over trees in order to find overall optimum
Note: unlike above, in this section we will consider rooted trees
How to determine P(X | T):
- use Felsenstein's algorithm [1981],
which iteratively computes for nodes of tree,
sweeping tree from leaves to root (similar to parsimony algorithm)
- detailes later
How to search over trees in order to find overall optimum:
- use same methods as for large parsimony problem (see above)
(but have to additionally find optimal edge lengths for given topology)
- can also use probabilistic sampling methods, e.g., Metropolis algorithm
(see Durbin et al., p.207ff)
Before we provide details on Felsentein's algorithm ...
--
Probabilistic models of evoluation
Assumptions:
- each site evolves independently
- no insertions and deletions
- DNA sequences
want: probabilities P(x|y,t) = prob that base y evolves into base x in time t
consider substitution matrix S(t) = [P(a_i|a_j,t)]_ij
assume: S(t+s) = S(t)*(s)
(this is satisfied if the underlying probalistic process is Markovian and stationary)
Jukes-Cantor Model [1969]:
P(X|X,t) = 1/4 * (1 + 3*exp(-4*alpha t)
P(X|Y,t) = 1/4 * (1 - exp(-4*alpha t)
(derived from the assumption that rate of X->Y is equal for all X,Y with X!=Y;
implies nucleotide equilibrium frequencies of 1/4 for all bases)
More realistic models exist, e.g. Kimura model [1980] that allows
different rates (and hence probabilities) for transitions (A<->G, C<->T)
and transversions (all others)
For protein sequences, a slightly generalised form (for arbitrary time t)
of the PAM matrices can be used.
--
How to determine P(X | T) - for two sequences:
tree T with only two leaves x1,x2, two edges (length t1,t2):
x1 <-t1- a -t2-> x2
P(x1,x2,a|T,t1,t2) = q_a * P(x1|a,t1) * P(x2|a,t2)
q_a = equilibrium probability of a (from substitution matrix)
P(x1|a,t1) = prob that a evolves into x1 in time t1
P(x1,x2|T,t1,t2) = Sum_a P(x1,x2,a|T,t1,t2)
= Sum (q_a * P(x1|a,t1) * P(x2|a,t2))
(sum over all possibilities for residue a)
With N sites, take product over each individual site
(assumption: all sites evolve independently)
[example? see Durbin et al., p.198f]
--
How to determine P(X | T) - for n sequences:
given: sequences x1...xn, tree T, edge lengths t
label leaf nodes 1...n, non-leaf nodes n+1...2n-1
pi(i) = parent of node i
obtain P(x1,...,xn|T,t) by summing probability of specific
instantiation of tree (i.e., assignmennt of residues to internal
node) under given evolutionary model
over all possible assignments of residues to non-leaf nodes
P(x1,...,xn|T,t) = sum_{assignment of non-leaf nodes n+1...2n-1}
q_{root residue} * product_{non-leaf nodes i} P(a_i | a_pi(i),t_i)
* product_{leaf nodes i} P(x_i | a_pi(i),t_i)
Note: this works because given an instantiation of all internal
node residues, evolution along each each is independent of
evolution along all other edges
To compute this efficiently, use post-order traversal of tree (bottom to top,
starting from leaves) to obtain P(L_k|a) = probability of all leaves
beneath node k given node k is assigned a
Felsenstein's Algorithm:
1. k=2n-1
2. recursively compute P(L_k|a) for all a as follows
if k is a leaf node then
P(L_k|a) := 1 if a=x_k, 0 otherwise
else
compute P(L_i|a), P(L_j|a) for all a at daughter nodes i,j of k
P(L_k|a) := sum_{b,c} P(b|a,t_i) * P(L_i|b) * P(c|a,t_j) * P(L_j|c)
3. P(x|T,t) := sum_a (q_a * P(L_{2n-1}|a))
Note: P(b|a,t_i) and P(c|a,t_j) are calculated according to the given
model of evolution
again, with N sites, take product over each individual site
(assumption: all sites evolve independently)
[example, see Durbin et al., p.192f, p.201f]
--
How to search over trees in order to find overall optimum:
- use same methods as for large parsimony problem (see above)
but: have to additionally find optimal edge lengths for given topology
for a given tree, this can be done using numerical optimisation techniques
(newton's method, conjugate gradient methods, expectation-maximisation)
overall maximum likelihood procedure in this case:
- search over tree topologies
- for each candidate tology, find edge lengths that maximise likelihood
(this will typically need a subroutine for determining the
likelihood of the given taxa given the tree topology and edge lengths
-> may use felsenstein's algorithm)
- alternative: use probabilistic sampling methods, e.g., Metropolis algorithm
(see Durbin et al., p.207ff)
fundamental idea (can be seen as special case of SLS):
start with abritrary tree T,t
iterate:
given a tree T,t probabilistically generate a new candidate tree T',t'
P := P(T,t|x);
P' := P(T',t'|x);
if P' > P then T,t := T',t'
else T,t := T',t' with probability P'/P
T' is generated from T according to randomised local transformation
of edge lengths and topology
calculation of P(T,t|x) is done via Bayes rule from
P(x|T,t), prior P(T,t) - this can be, e.g., uniform over all trees
(don't need P(x), since this is just a normalisation factor)
--
the max-likelihood approach can be generalised to more realistic models
of evolution, specifically, site dependent mutation rates and insertions/deletions,
but this is beyond the scope of this course. (But: see Durbin et al., Section 8.5)
---
Resources:
Durbin et al., Ch.8, Sections 8.1-8.3, beginning of 8.4
Additional material:
@incollection{ cottamoscato02:phylotrees:ppsn,
author = "Cotta, C. and Moscato, P.",
title = "Inferring Phylogenetic Trees Using Evolutionary Algorithms",
booktitle = "Parallel Problem Solving From Nature VII",
editor = "Merelo, J.J. and Adamidis, P. and Beyer, H.-G. and Fern\'andez-Villaca\~nas, J.-L. and Schwefel, H.-P.",
series = "Lecture Notes in Computer Science",
volume = "2439",
pages = "720--729",
publisher = "Springer-Verlag",
address = "Berlin",
year = "2002",
url = "citeseer.nj.nec.com/cotta02inferring.html" }
@misc{ salter-algorithms,
author = "L. A. Salter",
title = "Algorithms for Phylogenetic Tree Reconstruction",
url = "citeseer.nj.nec.com/salter00algorithms.html" }
http://biohpc.learn.in.th/contents2002/Worawit/worawit.pdf
- nice slide set, but contains some errors
http://artedi.ebc.uu.se/course/BioInfo-10p-2001/Phylogeny/Phylogeny-TreeSearch/Phylogeny-Search.html
---