notes for class on sls algorithms for phylogenetic tree inference
-----------------------------------------------------------------

(version 0.1, developed for trento-05)

---
outline:

1. brief intro: evolution and phylogenetic trees

2. phylogenetic tree inference via parsimony

3. sls methods for the large parsimony problem

4. related problems

5. further reading and related work


---
1. Introduction: Evolution and Phylogenetic Trees

* Phylogeny = (evolutionary) relationships between a set of species

Hypothesis:
All organisms on Earth are evolutionarily related via a common ancestor

Evidence:   
similarity of many molecular mechanisms and genetic material
(e.g., ribosomal rna)

Assumption:
Phylogeny can be represented as a tree.
(believed to be true for higher organisms, problematic for microorganisms
and early stages of evolution)


* Phylogenetics = study of phylogeny

two approaches:

- classic phylogenetics: based on morphological features
	(e.g., shape of body/organs)

- modern phylogenetics: based on molecular features
	(e.g., information extracted from DNA, RNA or protein sequence data,
	in particular, gene sequences)


nomenclature:

character = feature
taxon = (complete) set of features for one species


challenges: 
- features may have evolved independently in various branches of evolutionary
	tree (e.g., eyes of octopi, vertebrates - same for genes)
- gene duplication (common in nature) leads to sequence divergence
	(paralogues vs. orthologues) that can be misleading when
	trying to infer phylogeny
- organisms and even genes within an organism evolve at different rates


* Phylogenetic Trees

nodes = taxa, i.e., set of feature values for species 
	(e.g., sequence of a given gene)
edges = evolutionary relationships between nodes

leaves = observed species
internal nodes = (hypothetical) ancestors 

edge lengths = measure of evolutionary distance between nodes 
	(~ evolutionary time) 

typically restricted to binary trees
	(without loss of generality when edges of length 0)

phyl trees can be rooted or unrooted
	(root represents the ultimate ancestor of the group of sequences)

here: focus on unrooted trees (slight simplification)


[illustration: sample-tree.pdf = fig 4 from Lewis98]


* Phylogenetic Tree Inference Problem

Given:
  n taxa 
  (i.e., for each of n species, values for each of m characters)

Objective: 
  find fully labelled phylogenetic tree that 'best' explains the given data 
  (i.e., maximises a suitable target function = score)

Assumptions:
- characters are mutually independent
- after two species diverged, their further evolution is independent 
- taxa are sequences (e.g., dna or protein sequences) 
	- without loss of generality, mainly for convenience

Note: 
  for n taxa, there are Prod_{i=1}^{n-1} (2n-3) = \Theta(n!)
	different rooted trees
  e.g., n = 20 -> 10^21 trees

  => completely impracticable to search exhaustively except for very small n

  (number of unrooted trees is only marginally smaller)


quality phylogenetic trees can be assessed using various methods (target functions),
including:

- distance-based: given distance metric on sequences, the best 
	tree is the one most consistent with the observed distances 
	between sequences
  -> tree inference problem often solved using clustering methods

- parsimony: the best tree is the one with the fewest substitutions
	(mutations) between directly related sequences 

- maximum likelihood: given a probabilistic model of sequence evolution,
	the best tree is the one with the highest likelihood under that model

here: focus on parsimony


---
2. Phylogenetic Tree Inference via Parsimony


* The Parsimony Problem

Parsimony Problem: 
  Given n sequences, find a tree relating the sequences such that the
  number of substitutions (=mutations) between directly related sequences
  (nodes connected by an edge) is minimised

  label edges with # of substitutions => find tree with minimal sum of edge labels

The parsimony problem is usually solved in two stages:

Small Parsimony Problem:
  Given tree T with just the leaves labeled, find the labels for the 
  internal nodes such that score S(T) is minimised.
  S(T) = number of substitutions (from parents to children)

  Can be solved efficiently using Fitch's Algorithm 
	(time = linear in size of tree * sequence length)

Large Parsimony Problem:
  Given n sequences, find the tree T (completely labeled) with the lowest 
  score S(T).

  (search over trees with only leaves labeled, label internal nodes
  and compute parsimony score using fitch's algorithm)

  Note: \Theta(n!) many different trees with n nodes, problem is NP-hard


* Greedy Constructive Search

Idea: iteratively build tree by adding edges one at a time

Algorithm (greedy construction):

  1. Randomly choose 3 sequences and place on unrooted tree T
	(Note: for given sequences there is only one such tree)
  2. Repeat
	Randomly select new sequence x and add to T resulting
	in larger tree T' such that S(T') is minimal over all T' that 
	can be obtained by adding x to T
	(Note: the new edge to x can be branched off any of the existing edges,
		dividing it into two)
     Until tree is complete.

Note: This algorithm is not guaranteed to construct optimal tree  

Running multiple times can give different results 
-> selecting best of these leads to improving solution quality over time

[illustration of search step]


---
3. SLS Methods for the Large Parsimony Problem


* Stochastic Local Search (SLS)

search space: phylogenetic trees with given sequences at leaves

initialisation: e.g., with tree obtained from construction heuristic

types of search steps (i.e., neighbourhood relations):
  - Nearest Neighbour Interchange, NNI
	select internal edge e in T
	note that e has two subtrees attached to each of its two incident nodes
	interchange two of these (two possibilities: AB-CD -> AC-BD, AD-BC)
  - Subtree Pruning and Regrafting, SPR:
	cut off a subtree T' from T
	reconnect the edge that connected T' to rest of T
	to another edge in T
	(eliminate node from which T' was cut off)
  - Tree Bisection and reconnection, TBR 
	delete edge from T
	reconnect resulting two subtrees T', T'' by new edge between
 	arbitrary edge in T' and arbitrary edge in T''
	(eliminate node from which T' was cut off)
	
  [see http://artedi.ebc.uu.se/course/BioInfo-10p-2001/Phylogeny/Phylogeny-TreeSearch/Phylogeny-Search.html,
	http://www.hyphy.org/docs/analyses/methods/nni.html]

objective function: parsimony score, i.e., # substitutions in 
	the (completely labeled) tree

[illustration of search steps]


Iterative Best Improvement Algorithm:

  1. Construct initial tree T (e.g., randomly or using greedy construction heuristic)
  2. Repeat
	N(T) := set of all trees that can be obtained by NNI from T
		[use fitch's algorithm to compute internal node labels + S(T)]
	TT := set of T' from N(T)+{T} with minimal S(T')
	randomly choose T' from TT
     Until T'=T (local minimum)

Note: A variant of this where the trees T' in N(T) are generated in some order
	and the first with S(T') < S(T) is accepted can give better performance
	(first improvement method)

Problem: Can easily get stuck in local optima
Solution: Allow worsening swaps 
	(-> Randomised Iterative Improvement, Simulated Annealing, ...)

Can also use other SLS methods, e.g., Evolutionary Algorithms, ...


LVB [Barker, 2004]:

- initialisation: choose tree with random topology

- neighbourhood: NNI + SPR (used alternatingly)

- evaluation function: homoplasy index 
	:= 1 - sum_{characters} min poss subst score for given chars 
		/ sum_{chars} S (T)

	[see http://www.biol.lu.se/mibiol/research/wachen/phylogentics/BI3-2-MP.pdf]

- step function: standard simulated annealing mechanism:
	- randomly select neighbour (alternatingly using NNI, SPR)
	- accept according to Metropolis condition

- annealing schedule: geometric decay
	schedule + termination criterion controlled by 6 parameters
	(incl decay rate, # search steps per temp value, etc.)


Other SLS Algorithms:

- Memetic algorithm by Ribeiro and Vianna (2003)
	- initialisation by randomised construction heuristic
	- local search phase based on iterative first improvement
	on the SPR nbh (applied to all cand sln produced by crossover
	every 7 generations)
	- elitist mechanisms for selection and crossover
	- cross-over is based on path relinking
	general idea: modify one parent tree guided by the other
	- no mutation

- GRASP+VND algorithm by Ribeiro and Vianna (2005)
	- uses randomised greedy construction heuristic
		+ VND in 'k-step' extensions of SPR nbh
	
	(appears to be outperformed by previously mentioned MA)


Note:

Many phylogenetic tree inference algorithms used in practice for 
larger numbers of sequences are based on a construction heuristic
that is augmented by (limited) perturbative local search after each
construction step.


---
4. Related Problems

- different approaches for phyl tree inference,
  in particular:
  + distance-based methods
  + maximum likelihood approach

  -> give rise to problems similar to large parsimony,
	sls methods (including ea, sa) have been applied 
	very successfully to these

- simultaneous alignment and phylogenetic tree construction
  (note: quality of phylogenetic tree depends on quality of underlying
	multiple sequence alignment and vice versa)


---
5. Further Reading and Related work:

- Salter00: Algorithms for Phylogenetic Tree Reconstruction
  [good overview/introduction]


*- Barker04: LVB: parsimony and simulated annealing in the search 
	for phylogenetic trees

*- RibVia03: A genetic algorithm for the phylogeny problem
	using an optimized crossover strategy based on path-relinking
  [parsimony]

- RibVia05: A GRASP/VND heuristic for the phylogeny problem using
	a new neighborhood structure
  [parsimony]


- CotMos02: Inferring Phylogenetic Trees Using Evolutionary Algorithms
  [distance-based]


- Lewis98: A Genetic Algorithm for Maximum-Likelihood Phylogeny Inference 
	Using Nucleotide Sequence Data

- LemMin02: The metapopulation genetic algorithm: An efficient
	solution for the problem of large phylogeny estimation
  [max likelihood]

- Stamatakis05: An Efficient Program for Phylogenetic Inference Using 
	Simulated Annealing
  [max likelihood]


other methods for solving phy tree inference problems (see Salter, 2000)
- 'star decomposition' methods, such as neighbour joining
	(closely related to constructive search)
- 'branch-swapping' methods (= construction heuristics + local search
	on partial / complete trees) and other sls methods
- divide-and-conquer methods (divide into subproblems, solve these,
	reassemble solutions)

- branch & bound methods


For references to some medical and other applications, see RibVia05.


-----------------------------------------------------------------