notes for class on sls algorithms for phylogenetic tree inference ----------------------------------------------------------------- (version 0.1, developed for trento-05) --- outline: 1. brief intro: evolution and phylogenetic trees 2. phylogenetic tree inference via parsimony 3. sls methods for the large parsimony problem 4. related problems 5. further reading and related work --- 1. Introduction: Evolution and Phylogenetic Trees * Phylogeny = (evolutionary) relationships between a set of species Hypothesis: All organisms on Earth are evolutionarily related via a common ancestor Evidence: similarity of many molecular mechanisms and genetic material (e.g., ribosomal rna) Assumption: Phylogeny can be represented as a tree. (believed to be true for higher organisms, problematic for microorganisms and early stages of evolution) * Phylogenetics = study of phylogeny two approaches: - classic phylogenetics: based on morphological features (e.g., shape of body/organs) - modern phylogenetics: based on molecular features (e.g., information extracted from DNA, RNA or protein sequence data, in particular, gene sequences) nomenclature: character = feature taxon = (complete) set of features for one species challenges: - features may have evolved independently in various branches of evolutionary tree (e.g., eyes of octopi, vertebrates - same for genes) - gene duplication (common in nature) leads to sequence divergence (paralogues vs. orthologues) that can be misleading when trying to infer phylogeny - organisms and even genes within an organism evolve at different rates * Phylogenetic Trees nodes = taxa, i.e., set of feature values for species (e.g., sequence of a given gene) edges = evolutionary relationships between nodes leaves = observed species internal nodes = (hypothetical) ancestors edge lengths = measure of evolutionary distance between nodes (~ evolutionary time) typically restricted to binary trees (without loss of generality when edges of length 0) phyl trees can be rooted or unrooted (root represents the ultimate ancestor of the group of sequences) here: focus on unrooted trees (slight simplification) [illustration: sample-tree.pdf = fig 4 from Lewis98] * Phylogenetic Tree Inference Problem Given: n taxa (i.e., for each of n species, values for each of m characters) Objective: find fully labelled phylogenetic tree that 'best' explains the given data (i.e., maximises a suitable target function = score) Assumptions: - characters are mutually independent - after two species diverged, their further evolution is independent - taxa are sequences (e.g., dna or protein sequences) - without loss of generality, mainly for convenience Note: for n taxa, there are Prod_{i=1}^{n-1} (2n-3) = \Theta(n!) different rooted trees e.g., n = 20 -> 10^21 trees => completely impracticable to search exhaustively except for very small n (number of unrooted trees is only marginally smaller) quality phylogenetic trees can be assessed using various methods (target functions), including: - distance-based: given distance metric on sequences, the best tree is the one most consistent with the observed distances between sequences -> tree inference problem often solved using clustering methods - parsimony: the best tree is the one with the fewest substitutions (mutations) between directly related sequences - maximum likelihood: given a probabilistic model of sequence evolution, the best tree is the one with the highest likelihood under that model here: focus on parsimony --- 2. Phylogenetic Tree Inference via Parsimony * The Parsimony Problem Parsimony Problem: Given n sequences, find a tree relating the sequences such that the number of substitutions (=mutations) between directly related sequences (nodes connected by an edge) is minimised label edges with # of substitutions => find tree with minimal sum of edge labels The parsimony problem is usually solved in two stages: Small Parsimony Problem: Given tree T with just the leaves labeled, find the labels for the internal nodes such that score S(T) is minimised. S(T) = number of substitutions (from parents to children) Can be solved efficiently using Fitch's Algorithm (time = linear in size of tree * sequence length) Large Parsimony Problem: Given n sequences, find the tree T (completely labeled) with the lowest score S(T). (search over trees with only leaves labeled, label internal nodes and compute parsimony score using fitch's algorithm) Note: \Theta(n!) many different trees with n nodes, problem is NP-hard * Greedy Constructive Search Idea: iteratively build tree by adding edges one at a time Algorithm (greedy construction): 1. Randomly choose 3 sequences and place on unrooted tree T (Note: for given sequences there is only one such tree) 2. Repeat Randomly select new sequence x and add to T resulting in larger tree T' such that S(T') is minimal over all T' that can be obtained by adding x to T (Note: the new edge to x can be branched off any of the existing edges, dividing it into two) Until tree is complete. Note: This algorithm is not guaranteed to construct optimal tree Running multiple times can give different results -> selecting best of these leads to improving solution quality over time [illustration of search step] --- 3. SLS Methods for the Large Parsimony Problem * Stochastic Local Search (SLS) search space: phylogenetic trees with given sequences at leaves initialisation: e.g., with tree obtained from construction heuristic types of search steps (i.e., neighbourhood relations): - Nearest Neighbour Interchange, NNI select internal edge e in T note that e has two subtrees attached to each of its two incident nodes interchange two of these (two possibilities: AB-CD -> AC-BD, AD-BC) - Subtree Pruning and Regrafting, SPR: cut off a subtree T' from T reconnect the edge that connected T' to rest of T to another edge in T (eliminate node from which T' was cut off) - Tree Bisection and reconnection, TBR delete edge from T reconnect resulting two subtrees T', T'' by new edge between arbitrary edge in T' and arbitrary edge in T'' (eliminate node from which T' was cut off) [see http://artedi.ebc.uu.se/course/BioInfo-10p-2001/Phylogeny/Phylogeny-TreeSearch/Phylogeny-Search.html, http://www.hyphy.org/docs/analyses/methods/nni.html] objective function: parsimony score, i.e., # substitutions in the (completely labeled) tree [illustration of search steps] Iterative Best Improvement Algorithm: 1. Construct initial tree T (e.g., randomly or using greedy construction heuristic) 2. Repeat N(T) := set of all trees that can be obtained by NNI from T [use fitch's algorithm to compute internal node labels + S(T)] TT := set of T' from N(T)+{T} with minimal S(T') randomly choose T' from TT Until T'=T (local minimum) Note: A variant of this where the trees T' in N(T) are generated in some order and the first with S(T') < S(T) is accepted can give better performance (first improvement method) Problem: Can easily get stuck in local optima Solution: Allow worsening swaps (-> Randomised Iterative Improvement, Simulated Annealing, ...) Can also use other SLS methods, e.g., Evolutionary Algorithms, ... LVB [Barker, 2004]: - initialisation: choose tree with random topology - neighbourhood: NNI + SPR (used alternatingly) - evaluation function: homoplasy index := 1 - sum_{characters} min poss subst score for given chars / sum_{chars} S (T) [see http://www.biol.lu.se/mibiol/research/wachen/phylogentics/BI3-2-MP.pdf] - step function: standard simulated annealing mechanism: - randomly select neighbour (alternatingly using NNI, SPR) - accept according to Metropolis condition - annealing schedule: geometric decay schedule + termination criterion controlled by 6 parameters (incl decay rate, # search steps per temp value, etc.) Other SLS Algorithms: - Memetic algorithm by Ribeiro and Vianna (2003) - initialisation by randomised construction heuristic - local search phase based on iterative first improvement on the SPR nbh (applied to all cand sln produced by crossover every 7 generations) - elitist mechanisms for selection and crossover - cross-over is based on path relinking general idea: modify one parent tree guided by the other - no mutation - GRASP+VND algorithm by Ribeiro and Vianna (2005) - uses randomised greedy construction heuristic + VND in 'k-step' extensions of SPR nbh (appears to be outperformed by previously mentioned MA) Note: Many phylogenetic tree inference algorithms used in practice for larger numbers of sequences are based on a construction heuristic that is augmented by (limited) perturbative local search after each construction step. --- 4. Related Problems - different approaches for phyl tree inference, in particular: + distance-based methods + maximum likelihood approach -> give rise to problems similar to large parsimony, sls methods (including ea, sa) have been applied very successfully to these - simultaneous alignment and phylogenetic tree construction (note: quality of phylogenetic tree depends on quality of underlying multiple sequence alignment and vice versa) --- 5. Further Reading and Related work: - Salter00: Algorithms for Phylogenetic Tree Reconstruction [good overview/introduction] *- Barker04: LVB: parsimony and simulated annealing in the search for phylogenetic trees *- RibVia03: A genetic algorithm for the phylogeny problem using an optimized crossover strategy based on path-relinking [parsimony] - RibVia05: A GRASP/VND heuristic for the phylogeny problem using a new neighborhood structure [parsimony] - CotMos02: Inferring Phylogenetic Trees Using Evolutionary Algorithms [distance-based] - Lewis98: A Genetic Algorithm for Maximum-Likelihood Phylogeny Inference Using Nucleotide Sequence Data - LemMin02: The metapopulation genetic algorithm: An efficient solution for the problem of large phylogeny estimation [max likelihood] - Stamatakis05: An Efficient Program for Phylogenetic Inference Using Simulated Annealing [max likelihood] other methods for solving phy tree inference problems (see Salter, 2000) - 'star decomposition' methods, such as neighbour joining (closely related to constructive search) - 'branch-swapping' methods (= construction heuristics + local search on partial / complete trees) and other sls methods - divide-and-conquer methods (divide into subproblems, solve these, reassemble solutions) - branch & bound methods For references to some medical and other applications, see RibVia05. -----------------------------------------------------------------