Notes from Feb 4 class on Maximum Likelihood Anne Condon Maximum likelihood ------------------ This method takes a statistical approach to tree construction, given sequence data. Addresses limitations of previous methods: distance-based methods don't use complete information about sequence composition; only distances, and parsimony does not work well in the face of long evolutionary distances where homoplasies (sites mutate and then revert back to original value) can occur. The idea is intuitive: we assume that the data sequences at leaves of the tree) was generated as the leaves of some tree T according to some probabilistic model M. The model M determines how sequences evolve from one node to a child in the tree. We would like to find the tree T that best explains the data with respect to M. More specifically, consider a tree T that has distances (weights) on edges, no labels on internal nodes, except that there is data D labeling the leaves. The likelihood of this tree is the probability that the tree generates the data at the leaves, under the probabilistic model. We denote this probability by P[D|T,M]. We use L(T) to denote the likelihood of tree T, i.e. L(T) = P[D|T,M]. [This approach circumvents the problem of defining a general model of evolution that accounts for the process of branching. Such a model would have to account for speciation and extinction processes, as well as the process by which the species under study have been selected from among those potentially available. Seems impossible to model adequately. With the general model, one could obtain prior probabilities on the form of the trees, but with the maximum likelihood approach this is ignored; the tree is the unknown entity to be estimated.] Model of evolution ------------------- In order to compute the likelihood of a tree, our model of evolution must tell us what is the probability that sequences S1 changes to S2 during evolution along an edge of the tree with evolutionary distance t. We'll assume a very simple model M: (i) each site evolves independently; no insertion and deletion events. The independence assumption implies that P[D |T,M] is the product of P[Di|T,M] where Di is the data at site i. So, let's focus on a single site. (ii) two branches of the tree evolve independently. With these assumptions we can calculate what is the likelihood of a tree if we know what is P[y|x,t]: the probability that a site which initially has value x will have value y after t units of time have elapsed. We'll come back to how P[y|x,t] is calculated later, and first show how it can be used to calculate the likelihood of a tree. Likelihood of a given tree -------------------------- As a step towards the larger goal of finding the maximum likelihood tree, let's first consider the problem of calculating the likelihood for a given tree T. With the assumptions above, we can write an expression for the likelihood of a given tree, T. An example will make this clear: Example tree T: t1/ \t2 C t3/ \t4 A C First suppose for a moment that internal nodes have labels; r t1/ \t2 v C t3/ \t4 A C The likelihood of this labeled tree (i.e. Pr(data(incl. internal labels) | tree)) is: P[v|r,t1]P[C|r,t2]P[A|v,t3]P[C|v,t4] To get an expression for the likelihood of the tree T (which has no labels on the internal nodes), we assume some initial distribution {P(r)} of characters at the root, and then sum over all r and v: sum_r sum_v P(r) P[v|r,t1]P[C|r,t2]P[A|v,t3]P[C|v,t4] [Note that we are using assumption (ii) of the model here] In general, the number of terms in sum is exponential in number of nodes in tree. Luckily there is an efficient method to calculate this sum. To describe this method, we use the following notation: Let L(T|r) be the likelihood of tree T when the value of the site at the root is r. Then, L(T) = sum_r P(r) L(T|r). Moreover, L(T|r) can be expressed as the product of two terms. Let t' t'' be the distances to the children of the root of T, and let these children be the roots of trees T', T''. Then L(T|r) = (sum_v P[v|r,t']L(T'|v)) (sum_v P[v|r, t'']L(T''|v)). If T is simply a leaf, then L(T|r) = {1 if r is the label of the leaf 0 otherwise. Going back to our example, we can rearrange the terms and the summations to match the structure of the recurrences above. (How?) The quantities P[T'|r] can thus be calculated in a bottom-up fashion for all subtrees T' and all four values of r, resulting in an algorithm for calculating the likelihood of the tree that runs in time linear in the size of the tree. The Jukes-Cantor Model ---------------------- We now return to our model of evolution, which specifies how a character x can evolve to y in time t. The simplest model, and one which is widely used, is the Jukes-Cantor model. This model has an associated parameter m, which is the rate of base substitutions (i.e. expected number per unit time). A base substitution happens when a base is replaced by a *distinct* new base. (Note that I stated differently in class, but prefer to use this definition because it is consistent with other usages.) If dt is a small amount of time, then the probability of a mutation from x to a distinct base y in time dt is m(dt)/3, and m(dt) is the probability of a base substitution. Let p(t) denote the probability that the base does NOT change in t units of time. We can express p() recursively as: p(t) = (1 - m(dt)) p(t-dt) + m(dt)/3 (1 - p(t-dt)); p(0) = 1. Solving this equation (the iteration method will work), we get: p(t) = 3/4(1 - 4m(dt)/3)^{t/(dt)} + 1/4 Thus, the probability that the base changes in t units of time, namely 1- p(t), is 3/4 - 3/4(1 - 4m(dt)/3)^{t/(dt)}. The probability of a particular base change, e.g. A -T, is thus 1/4 - 1/4(1 - 4m(dt)/3^{t/(dt)}. Taking the limit of this as dt goes to 0, we get: 1/4 - 1/4 e^{-4tm/3}. (This uses the fact that lim_{t -infty}(1 - 1/t)^t = e^{-1}.) Therefore, Prob[y|x,t] = (1 - e^{-4mt/3})/4. Since this probability depends only on the product, mt, one can adopt the convention that m = 1 and then t measure the expected number of substitutions. We can then think of the edge distances as time measured on a molecular clock which may run at different rates in different segments of the tree. The Pulley Principle -------------------- Although the likelihood estimation algorithm appears to produce a rooted tree, it turns out that the specific location of the root does not change the likelihood of the tree. First, it is possible to show that if the root n has children n' and n'', with distances t' and t'' to these children, then if you subtract an amount x from t' (while keeping it positive) and add it to t'', the resulting tree has the same likelihood. Felsentein calls this the Pulley Principle. (Prove it!) More generally, if you turn the rooted tree into an unrooted one (by removing n and having distance t'+t'' between n' and n''), and you add a root anywhere along one of the edges of the resulting unrooted tree (dividing one edge and its distance into two), then you get a tree with the same likelihood. Finding optimum edge distances for a given tree T ------------------------------------------------- We can exploit the pulley principle to find optimum edge distance in practice, in the following way: set some initial values for the edge distances. Fix all but one of them, and find the optimal value of that one, relative to all of the others. Repeat for each edge in turn, until the distances converge. The distances found in this way could be local optima, but in practice the method appears to reliably find global optima. To find the optimal value of one edge distance, t, relative to the others, it is possible (by application of the pulley principle) to express the optimal edge distance value as the solution of a fixed point expression, t = f(t), where f involves other (constant) terms based on likelihood values for other parts of the tree. Starting with an initial value t_0, and iterating to calculate t_k = f(t_{k-1}), one can usually quickly converge on the right value for t. Finding the maximum likelihood tree ------------------------------------ Felsenstein's original algorithm used the stepwise addition heuristic for exploring the space of trees. I don't know how it is currently done. Could be interesting to look at. Extensions ---------- The method, as described above, can easily be extended in the following ways: - ambiguity in the original data (at some site, the character could be, e.g. adenine or guanine (a purine), but you are not sure which. Simply set the likelihood of either to be 1 at the leaf. - use three different substitution rates at the different positions of a codon - generalizations of the Jukes-Cantor model, in which substitution probabilities are not necessarily equal. For example, the Kimura 2-parameter model uses different probabilities for transversions versus transitions, while keeping each nucleotide equally frequent. However, there does not appear to be a practical generalization of the maximum likelihood algorithm that handles gaps in alignment data. Columns in alignment data which contain gaps are apparently typically omitted from analysis. ------------------------------------------------------------------------- Current research directions in phylogeny ---------------------------------------- All of the algorithms we have described have been around for decades, along with many, many more approaches. What are the current challenges in phylogeny? Here is my take, based on somewhat limited knowledge. - Improved algorithms for large data sets. With the rapid rate at which new sequence data is becoming available, there is a need for phylogenetic analysis of larger data sets across longer evolutionary distances than ever before. Experimental analysis of current methods indicates that the fast, distance-based methods are not sufficiently accurate on data sets with large evolutionary distance, and methods based on NP-hard optimization problems, such as MP or ML, are too slow on data sets with large number of taxa. Faster and better methods would have important applications. o Statistical and algorithmic analysis can shed light on when various methods are expected to perform well or poorly. For example, one would expect that in order to accurately align a given number of taxa, the length of the sequences needed would grow as the max evolutionary distance between the sequences increases. A recent result of Erdos (1999) shows that under probabilistic models such as Jukes-Cantor, the sequence length that suffices to guarantee accuracy of standard polynomial time algorithms is exponential in the max evolutionary distance in the data set. This is supported by experimental evidence. o Such insights can lead to improved algorithms. For example, a divide and conquer approach that groups taxa into closely related subgroups would be well motivated by the analysis cited above. There may be good biological reasons for this too; different parts of a sequence may be appropriate for phylogenetic analysis of different subspecies. See the work of Tandy Warnow et al. http://www.cs.utexas.edu/users/tandy/ for examples of this type of work. - Algorithms for improved models of evolution. o Ribosomal RNA is used for reconstruction of trees involving organisms which are evolutionarily distant, because it is a universal molecule, and highly conserved. Since the structure of this molecule is a large determinant of its function, and the structure is determined by pairing of complementary bases, one would not expect all sites to evolve independently. Can improved analysis be obtained by taking into account pairwise correlations between bases? o Modeling gaps. Thorne, Kishino, and Felsenstein (1991) proposed a model for evolution with insertions and deletions (indels). This model uses three classes of variables. First, a model of substitution that does not allow for indels. Second, associated with each nucleotide is a random variable D that is exponentially distributed with parameter mu. If this variable fires, the site is deleted. Third, to the right of every nucleotide is a random variable, I. It is exponentially distributed with parameter lambda < mu. If I fires, a nucleotide is chosen at random and inserted in the location corresponding to I. Also to the left of the complete sequence is a similar random variable. See http://www.math.canterbury.ac.nz/~mathmas/ for recent work on this model. - In a review of a new book, "Phylogenetic trees made easy: a how-to manual for molecular biologists" by Barry G. Hall. Sinauer, 2001, it was noted that bayesian inference is a new method for constructing phylogenetic trees that is rapidly gaining popularity. The method calculates a so-called posterior probability of candidate trees, which depends on the joint probabilities of the tree, branch lengths, and the model of substitution. This method allows for the possibility of including prior information, such as the common origin of a group of species. - Algorithms for other data types. As an example, there is increasing interest in using gene order data for phylogenetic tree construction. In a chromosome, genes have order and orientation. Inversions cause the orientation of a continuous subsequence of the genes to be reversed. Transpositions cause a contiguous subsequence of genes to be removed from their current location and inserted elsewhere in the chromosome. These and other operations on genes happen as organisms evolve, and provide clues as to the phylogenetic ordering of certain organisms (plant chloroplasts and animal mitochondria, for example). On a completely different note, phylogenetic tree construction algorithms have been applied to construction of language trees.