CPSC 545/445 (Autumn 2003) - Class 9: Phylogenetic Tree Construction
Module 3, Part 2
---
recap:
Parsimony:
find a tree relating given sequences such that
number of substitutions (=mutations) is minimised
Small Parsimony Problem:
Given tree T with just the leaves labeled, find the labels for the
internal nodes such that score S(T) is minimised.
S(T) = number of substitutions (from parents to children)
Large Parsimony Problem:
Given n sequences, find the tree T (completely labeled) with the lowest
score S(T).
Note: Theta(n!) many different trees with n nodes, problem is NP-hard
---
Approaches for solving the large parsimony problem
(Note: enumaration is infeasible because number of possible trees grows to quickly
- see above)
Note: Consider unrooted trees (slightly smaller search space)
--
1) Greedy Constructive Search
Idea: iteratively build tree by adding edges one at a time
Algorithm (greedy construction):
1. Randomly choose 3 sequences and place on unrooted tree T
(Note: for given sequences there is only one such tree)
2. Repeat
Randomly select new sequence x and add to T resulting
in larger tree T' such that S(T') is minimal over all T' that
can be obtained by adding x to T
(Note: the new edge to x can be branched off any of the existing edges,
dividing it into two)
Until tree is complete.
Note: This algorithm is not guaranteed to construct optimal tree
Running multiple times can give different results
-> selecting best of these leads to improving solution quality over time
[illustration of search step]
--
2) Branch & Bound Search
Idea: Iteratively build trees as in method 1), but
- use backtracking to systematically enumerate all trees
- abandon particular construction whenever score of partial tree
exceeds score of best complete tree obtained so far
Algorithm (branch & bound):
function bb-ptc(T,X)
input: partial tree T with leaves labelled by sequences not in X,
set of sequences X
output: unrooted tree with leaves labelled by the elements of X
if T = empty then
select three sequences x1,x2,x2 \in X
X' := X \ {x1,x2,x3};
T' := unique unrooted tree with leaves x1,x2,x3
return bb_ptc(T',X');
else if X != empty and S(T) <= bound then
select sequence x \in X
X' := X \ {x};
for n := 1 to number of edges in T do
obtain T'_n from T by adding a new edge that connects x to edge n in T
s_n := S(T'_n);
T' := T'_n with min s_n
if X' = empty then bound := min{bound,s_1,...,s_n};
end for
end if
end bb-ptc
bound := \infty
T_opt := bb-ptc(empty,X)
Note: can substantially improve performance of algorithm by
using a better choice for the initial bound
(e.g., a value obtained by methods 1 or 3 below)
(For another description of the same idea, see Durbin et al., p.178)
[illustration of search tree(?)]
Problem: Much more efficient then exhaustive enumeration, but
still impractical for large number of sequences (# > 20)
--
3) Stochastic Local Search (SLS)
Idea: Start with any tree (e.g., constructed as in 1),
iteratively reduce cost by swapping branches
Various types of branch swapping:
- Nearest Neighbour Interchange, NNI
select internal edge e in T
note that e has two subtrees attached to each of its two incident nodes
interchange two of these (two possibilities: AB-CD -> AC-BD, AD-BC)
- Subtree Pruning and Regrafting, SPR:
cut off a subtree T' from T
reconnect the edge that connected T' to rest of T
to another edge in T
(eliminate node from which T' was cut off)
- Tree Bisection and reconnection, TBR
delete edge from T
reconnect resulting two subtrees T', T'' by new edge between
arbitrary edge in T' and arbitrary edge in T''
(eliminate node from which T' was cut off)
[see http://artedi.ebc.uu.se/course/BioInfo-10p-2001/Phylogeny/Phylogeny-TreeSearch/Phylogeny-Search.html,
http://www.hyphy.org/docs/analyses/methods/nni.html]
[illustration of search steps]
Algorithm (best improvement):
1. Construct a tree T (e.g., using method 1)
2. Repeat
N(T) := set of all trees that can be obtained by NNI from T
TT := set of T' from N(T)+{T} with minimal S(T')
randomly choose T' from TT
Until T'=T
Note: A variant of this where the trees T' in N(T) are generated in some order
and the first with S(T') < S(T) is accepted can give better performance
(first improvement method)
Problem: Can easily get stuck in local optima
Solution: Allow worsening swaps
(-> Randomised Iterative Improvement, Simulated Annealing, ...)
Can also use other SLS methods, e.g., Evolutionary Algorithms, ...
Note: Methods 1) and 3) can be combined - do SLS after every construction
step in 1)
This is the method underlying many, if not most, phylogenetic tree
construction algorithms used in practice for larger numbers of sequences.
---
Simultaneous Alignment and Phylogenetic Tree Construction
(here: motivation + idea only)
Note: Quality of phylogenetic tree depends on quality of underlying
multiple sequence alignment
Idea: simultaneously construct tree and alignment
Basic approach (analogous to small + large parsimony problems):
- find optimal alignment given tree
- search over trees to find overall optimum
How to find optimal alignment given tree:
- linear gap score:
use a simple variant of a multidimensional dynamic programming algorithm
for multiple sequence alignment (see Durbin et al., p. 141):
Instead of the standard subst score for individual residues (+ gaps),
use a parsimony score for any partial alignment
that can be computed by an upward pass through the given tree
using a standard parsimony algorithm (e.g., Durbin et al., p.174 or p.175)
[Sankoff & Cedergren, 1983]
- affine gap score:
algorithm by Hein [1989], uses variant of DP method for pairwise
sequence alignment with affine gaps that is generalised to deal
with multiple progressive alignments (for details, see Durbin et al., p.181ff)
Note: Hein's algorithm works for modest size sets of sequence (tens of),
but does not guarantee optimal results
How to search over trees to find overall optimum:
- use same methods as described above for the large parsimony problem.
---
Resources:
Durbin et al., Ch.7
Additional material:
@incollection{ cottamoscato02:phylotrees:ppsn,
author = "Cotta, C. and Moscato, P.",
title = "Inferring Phylogenetic Trees Using Evolutionary Algorithms",
booktitle = "Parallel Problem Solving From Nature VII",
editor = "Merelo, J.J. and Adamidis, P. and Beyer, H.-G. and Fern\'andez-Villaca\~nas, J.-L. and Schwefel, H.-P.",
series = "Lecture Notes in Computer Science",
volume = "2439",
pages = "720--729",
publisher = "Springer-Verlag",
address = "Berlin",
year = "2002",
url = "citeseer.nj.nec.com/cotta02inferring.html" }
@misc{ salter-algorithms,
author = "L. A. Salter",
title = "Algorithms for Phylogenetic Tree Reconstruction",
url = "citeseer.nj.nec.com/salter00algorithms.html" }
http://biohpc.learn.in.th/contents2002/Worawit/worawit.pdf
- nice slide set, but contains some errors
http://artedi.ebc.uu.se/course/BioInfo-10p-2001/Phylogeny/Phylogeny-TreeSearch/Phylogeny-Search.html
- nice illustration of neighbourhoods for branch swapping