notes for class on sls algorithms for phylogenetic tree inference
-----------------------------------------------------------------
(version 0.1, developed for trento-05)
---
outline:
1. brief intro: evolution and phylogenetic trees
2. phylogenetic tree inference via parsimony
3. sls methods for the large parsimony problem
4. related problems
5. further reading and related work
---
1. Introduction: Evolution and Phylogenetic Trees
* Phylogeny = (evolutionary) relationships between a set of species
Hypothesis:
All organisms on Earth are evolutionarily related via a common ancestor
Evidence:
similarity of many molecular mechanisms and genetic material
(e.g., ribosomal rna)
Assumption:
Phylogeny can be represented as a tree.
(believed to be true for higher organisms, problematic for microorganisms
and early stages of evolution)
* Phylogenetics = study of phylogeny
two approaches:
- classic phylogenetics: based on morphological features
(e.g., shape of body/organs)
- modern phylogenetics: based on molecular features
(e.g., information extracted from DNA, RNA or protein sequence data,
in particular, gene sequences)
nomenclature:
character = feature
taxon = (complete) set of features for one species
challenges:
- features may have evolved independently in various branches of evolutionary
tree (e.g., eyes of octopi, vertebrates - same for genes)
- gene duplication (common in nature) leads to sequence divergence
(paralogues vs. orthologues) that can be misleading when
trying to infer phylogeny
- organisms and even genes within an organism evolve at different rates
* Phylogenetic Trees
nodes = taxa, i.e., set of feature values for species
(e.g., sequence of a given gene)
edges = evolutionary relationships between nodes
leaves = observed species
internal nodes = (hypothetical) ancestors
edge lengths = measure of evolutionary distance between nodes
(~ evolutionary time)
typically restricted to binary trees
(without loss of generality when edges of length 0)
phyl trees can be rooted or unrooted
(root represents the ultimate ancestor of the group of sequences)
here: focus on unrooted trees (slight simplification)
[illustration: sample-tree.pdf = fig 4 from Lewis98]
* Phylogenetic Tree Inference Problem
Given:
n taxa
(i.e., for each of n species, values for each of m characters)
Objective:
find fully labelled phylogenetic tree that 'best' explains the given data
(i.e., maximises a suitable target function = score)
Assumptions:
- characters are mutually independent
- after two species diverged, their further evolution is independent
- taxa are sequences (e.g., dna or protein sequences)
- without loss of generality, mainly for convenience
Note:
for n taxa, there are Prod_{i=1}^{n-1} (2n-3) = \Theta(n!)
different rooted trees
e.g., n = 20 -> 10^21 trees
=> completely impracticable to search exhaustively except for very small n
(number of unrooted trees is only marginally smaller)
quality phylogenetic trees can be assessed using various methods (target functions),
including:
- distance-based: given distance metric on sequences, the best
tree is the one most consistent with the observed distances
between sequences
-> tree inference problem often solved using clustering methods
- parsimony: the best tree is the one with the fewest substitutions
(mutations) between directly related sequences
- maximum likelihood: given a probabilistic model of sequence evolution,
the best tree is the one with the highest likelihood under that model
here: focus on parsimony
---
2. Phylogenetic Tree Inference via Parsimony
* The Parsimony Problem
Parsimony Problem:
Given n sequences, find a tree relating the sequences such that the
number of substitutions (=mutations) between directly related sequences
(nodes connected by an edge) is minimised
label edges with # of substitutions => find tree with minimal sum of edge labels
The parsimony problem is usually solved in two stages:
Small Parsimony Problem:
Given tree T with just the leaves labeled, find the labels for the
internal nodes such that score S(T) is minimised.
S(T) = number of substitutions (from parents to children)
Can be solved efficiently using Fitch's Algorithm
(time = linear in size of tree * sequence length)
Large Parsimony Problem:
Given n sequences, find the tree T (completely labeled) with the lowest
score S(T).
(search over trees with only leaves labeled, label internal nodes
and compute parsimony score using fitch's algorithm)
Note: \Theta(n!) many different trees with n nodes, problem is NP-hard
* Greedy Constructive Search
Idea: iteratively build tree by adding edges one at a time
Algorithm (greedy construction):
1. Randomly choose 3 sequences and place on unrooted tree T
(Note: for given sequences there is only one such tree)
2. Repeat
Randomly select new sequence x and add to T resulting
in larger tree T' such that S(T') is minimal over all T' that
can be obtained by adding x to T
(Note: the new edge to x can be branched off any of the existing edges,
dividing it into two)
Until tree is complete.
Note: This algorithm is not guaranteed to construct optimal tree
Running multiple times can give different results
-> selecting best of these leads to improving solution quality over time
[illustration of search step]
---
3. SLS Methods for the Large Parsimony Problem
* Stochastic Local Search (SLS)
search space: phylogenetic trees with given sequences at leaves
initialisation: e.g., with tree obtained from construction heuristic
types of search steps (i.e., neighbourhood relations):
- Nearest Neighbour Interchange, NNI
select internal edge e in T
note that e has two subtrees attached to each of its two incident nodes
interchange two of these (two possibilities: AB-CD -> AC-BD, AD-BC)
- Subtree Pruning and Regrafting, SPR:
cut off a subtree T' from T
reconnect the edge that connected T' to rest of T
to another edge in T
(eliminate node from which T' was cut off)
- Tree Bisection and reconnection, TBR
delete edge from T
reconnect resulting two subtrees T', T'' by new edge between
arbitrary edge in T' and arbitrary edge in T''
(eliminate node from which T' was cut off)
[see http://artedi.ebc.uu.se/course/BioInfo-10p-2001/Phylogeny/Phylogeny-TreeSearch/Phylogeny-Search.html,
http://www.hyphy.org/docs/analyses/methods/nni.html]
objective function: parsimony score, i.e., # substitutions in
the (completely labeled) tree
[illustration of search steps]
Iterative Best Improvement Algorithm:
1. Construct initial tree T (e.g., randomly or using greedy construction heuristic)
2. Repeat
N(T) := set of all trees that can be obtained by NNI from T
[use fitch's algorithm to compute internal node labels + S(T)]
TT := set of T' from N(T)+{T} with minimal S(T')
randomly choose T' from TT
Until T'=T (local minimum)
Note: A variant of this where the trees T' in N(T) are generated in some order
and the first with S(T') < S(T) is accepted can give better performance
(first improvement method)
Problem: Can easily get stuck in local optima
Solution: Allow worsening swaps
(-> Randomised Iterative Improvement, Simulated Annealing, ...)
Can also use other SLS methods, e.g., Evolutionary Algorithms, ...
LVB [Barker, 2004]:
- initialisation: choose tree with random topology
- neighbourhood: NNI + SPR (used alternatingly)
- evaluation function: homoplasy index
:= 1 - sum_{characters} min poss subst score for given chars
/ sum_{chars} S (T)
[see http://www.biol.lu.se/mibiol/research/wachen/phylogentics/BI3-2-MP.pdf]
- step function: standard simulated annealing mechanism:
- randomly select neighbour (alternatingly using NNI, SPR)
- accept according to Metropolis condition
- annealing schedule: geometric decay
schedule + termination criterion controlled by 6 parameters
(incl decay rate, # search steps per temp value, etc.)
Other SLS Algorithms:
- Memetic algorithm by Ribeiro and Vianna (2003)
- initialisation by randomised construction heuristic
- local search phase based on iterative first improvement
on the SPR nbh (applied to all cand sln produced by crossover
every 7 generations)
- elitist mechanisms for selection and crossover
- cross-over is based on path relinking
general idea: modify one parent tree guided by the other
- no mutation
- GRASP+VND algorithm by Ribeiro and Vianna (2005)
- uses randomised greedy construction heuristic
+ VND in 'k-step' extensions of SPR nbh
(appears to be outperformed by previously mentioned MA)
Note:
Many phylogenetic tree inference algorithms used in practice for
larger numbers of sequences are based on a construction heuristic
that is augmented by (limited) perturbative local search after each
construction step.
---
4. Related Problems
- different approaches for phyl tree inference,
in particular:
+ distance-based methods
+ maximum likelihood approach
-> give rise to problems similar to large parsimony,
sls methods (including ea, sa) have been applied
very successfully to these
- simultaneous alignment and phylogenetic tree construction
(note: quality of phylogenetic tree depends on quality of underlying
multiple sequence alignment and vice versa)
---
5. Further Reading and Related work:
- Salter00: Algorithms for Phylogenetic Tree Reconstruction
[good overview/introduction]
*- Barker04: LVB: parsimony and simulated annealing in the search
for phylogenetic trees
*- RibVia03: A genetic algorithm for the phylogeny problem
using an optimized crossover strategy based on path-relinking
[parsimony]
- RibVia05: A GRASP/VND heuristic for the phylogeny problem using
a new neighborhood structure
[parsimony]
- CotMos02: Inferring Phylogenetic Trees Using Evolutionary Algorithms
[distance-based]
- Lewis98: A Genetic Algorithm for Maximum-Likelihood Phylogeny Inference
Using Nucleotide Sequence Data
- LemMin02: The metapopulation genetic algorithm: An efficient
solution for the problem of large phylogeny estimation
[max likelihood]
- Stamatakis05: An Efficient Program for Phylogenetic Inference Using
Simulated Annealing
[max likelihood]
other methods for solving phy tree inference problems (see Salter, 2000)
- 'star decomposition' methods, such as neighbour joining
(closely related to constructive search)
- 'branch-swapping' methods (= construction heuristics + local search
on partial / complete trees) and other sls methods
- divide-and-conquer methods (divide into subproblems, solve these,
reassemble solutions)
- branch & bound methods
For references to some medical and other applications, see RibVia05.
-----------------------------------------------------------------