Notes from Feb 4 class on Maximum Likelihood
Anne Condon

Maximum likelihood
------------------

This method takes a statistical approach to tree construction, given
sequence data. Addresses limitations of previous methods:
distance-based methods don't use complete information about sequence
composition; only distances, and parsimony does not work well in the
face of long evolutionary distances where homoplasies (sites mutate
and then revert back to original value) can occur.

The idea is intuitive: we assume that the data sequences at leaves of
the tree) was generated as the leaves of some tree T according to some
probabilistic model M. The model M determines how sequences evolve
from one node to a child in the tree.  We would like to find the tree
T that best explains the data with respect to M.

More specifically, consider a tree T that has distances (weights) on
edges, no labels on internal nodes, except that there is data D
labeling the leaves.  The likelihood of this tree is the probability
that the tree generates the data at the leaves, under the
probabilistic model.  We denote this probability by P[D|T,M].  We use
L(T) to denote the likelihood of tree T, i.e.  L(T) = P[D|T,M].

[This approach circumvents the problem of defining a general model of
evolution that accounts for the process of branching. Such a model
would have to account for speciation and extinction processes, as well
as the process by which the species under study have been selected
from among those potentially available.  Seems impossible to model
adequately. With the general model, one could obtain prior
probabilities on the form of the trees, but with the maximum
likelihood approach this is ignored; the tree is the unknown entity to
be estimated.]

Model of evolution
-------------------

In order to compute the likelihood of a tree, our model of evolution
must tell us what is the probability that sequences S1 changes
to S2 during evolution along an edge of the tree with evolutionary
distance t.

We'll assume a very simple model M: 
 (i) each site evolves independently; no insertion and deletion
events. The independence assumption implies that P[D |T,M] is the
product of P[Di|T,M] where Di is the data at site i. So, let's focus
on a single site.
 (ii) two branches of the tree evolve independently. 

With these assumptions we can calculate what is the likelihood of a
tree if we know what is P[y|x,t]: the probability that a site which
initially has value x will have value y after t units of time have
elapsed. We'll come back to how P[y|x,t] is calculated later, and
first show how it can be used to calculate the likelihood of a tree.

Likelihood of a given tree
--------------------------

As a step towards the larger goal of finding the maximum likelihood tree,
let's first consider the problem of calculating the likelihood for
a given tree T. 

With the assumptions above, we can write an expression for the likelihood
of a given tree, T. An example will make this clear:

Example tree T: 

   t1/   \t2
          C
 t3/ \t4
  A  C

First suppose for a moment that internal nodes have labels;

       r
   t1/   \t2
    v     C
 t3/ \t4
  A  C

The likelihood of this labeled tree (i.e. Pr(data(incl. internal labels) | tree)) is:

          P[v|r,t1]P[C|r,t2]P[A|v,t3]P[C|v,t4]

To get an expression for the likelihood of the tree T (which has no
labels on the internal nodes), we assume some initial distribution {P(r)}
of characters at the root, and then sum over all r and v:

        sum_r sum_v P(r) P[v|r,t1]P[C|r,t2]P[A|v,t3]P[C|v,t4]

[Note that we are using assumption (ii) of the model here]
In general, the number of terms in sum is exponential in number of
nodes in tree. Luckily there is an efficient method to calculate this
sum. To describe this method, we use the following notation:

Let L(T|r) be the likelihood of tree T when the value of the site at
the root is r.

Then, L(T) = sum_r P(r) L(T|r).

Moreover, L(T|r) can be expressed as the product of two terms.
Let t' t'' be the distances to the children of the root of T,
and let these children be the roots of trees T', T''.
Then

     L(T|r)  = (sum_v P[v|r,t']L(T'|v)) (sum_v P[v|r, t'']L(T''|v)).

If  T is simply a leaf, then 
     L(T|r) = {1 if r is the label of the leaf
               0 otherwise.

Going back to our example, we can rearrange the terms and the summations
to match the structure of the recurrences above. (How?)

The quantities P[T'|r] can thus be calculated in a bottom-up fashion for
all subtrees T' and all four values of r, resulting in an algorithm for
calculating the likelihood of the tree that runs in time linear in the size
of the tree. 

The Jukes-Cantor Model
----------------------

We now return to our model of evolution, which specifies how a
character x can evolve to y in time t. The simplest model, and one
which is widely used, is the Jukes-Cantor model. This model has an
associated parameter m, which is the rate of base substitutions
(i.e. expected number per unit time). A base substitution happens
when a base is replaced by a *distinct* new base. (Note that I
stated differently in class, but prefer to use this definition because
it is consistent with other usages.)

If dt is a small amount of time, then the probability of a mutation from x to
a distinct base y in time dt is m(dt)/3, and m(dt) is the probability
of a base substitution.

Let p(t) denote the probability that the base does NOT change in t
units of time.  We can express p() recursively as:

   p(t) = (1 - m(dt)) p(t-dt) + m(dt)/3 (1 - p(t-dt)); p(0) = 1.

Solving this equation (the iteration method will work), we get:

   p(t) = 3/4(1 - 4m(dt)/3)^{t/(dt)} + 1/4

Thus, the probability that the base changes in t units of time, namely 
1- p(t), is

          3/4 - 3/4(1 - 4m(dt)/3)^{t/(dt)}.

The probability of a particular base change, e.g. A -T, is thus

          1/4 - 1/4(1 - 4m(dt)/3^{t/(dt)}.

Taking the limit of this as dt goes to 0, we get:

          1/4 - 1/4 e^{-4tm/3}.
(This uses the fact that lim_{t -infty}(1 - 1/t)^t = e^{-1}.)
Therefore,
   Prob[y|x,t] = (1 - e^{-4mt/3})/4.

Since this probability depends only on the product, mt, one can
adopt the convention that m = 1 and then t measure the expected
number of substitutions. We can then think of the edge distances
as time measured on a molecular clock which may run at different
rates in different segments of the tree.

The Pulley Principle
--------------------

Although the likelihood estimation algorithm appears to produce a rooted
tree, it turns out that the specific location of the root does not
change the likelihood of the tree.

First, it is possible to show that if the root n has children n' and
n'', with distances t' and t'' to these children, then if you subtract
an amount x from t' (while keeping it positive) and add it to t'', the
resulting tree has the same likelihood. Felsentein calls this the Pulley
Principle. (Prove it!) More generally, if you turn the rooted tree
into an unrooted one (by removing n and having distance t'+t'' between
n' and n''), and you add a root anywhere along one of the edges of the
resulting unrooted tree (dividing one edge and its distance into two),
then you get a tree with the same likelihood.

Finding optimum edge distances for a given tree T
-------------------------------------------------

We can exploit the pulley principle to find optimum edge distance
in practice, in the following way: set some initial values for
the edge distances. Fix all but one of them, and find the optimal
value of that one, relative to all of the others. Repeat for
each edge in turn, until the distances converge. The distances
found in this way could be local optima, but in practice the
method appears to reliably find global optima.

To find the optimal value of one edge distance, t, relative to the
others, it is possible (by application of the pulley principle) to
express the optimal edge distance value as the solution of a fixed
point expression, t = f(t), where f involves other (constant) terms
based on likelihood values for other parts of the tree.  Starting with
an initial value t_0, and iterating to calculate t_k = f(t_{k-1}), one
can usually quickly converge on the right value for t.

Finding the maximum likelihood tree
------------------------------------

Felsenstein's original algorithm used the stepwise addition heuristic
for exploring the space of trees. I don't know how it is currently
done. Could be interesting to look at.

Extensions
----------

The method, as described above, can easily be extended in the following ways:

- ambiguity in the original data (at some site, the character could be, e.g.
adenine or guanine (a purine), but you are not sure which. Simply set the
likelihood of either to be 1 at the leaf.

- use three different substitution rates at the different positions of
a codon

- generalizations of the Jukes-Cantor model, in which substitution
probabilities are not necessarily equal. For example, the Kimura
2-parameter model uses different probabilities for transversions
versus transitions, while keeping each nucleotide equally frequent.

However, there does not appear to be a practical generalization of
the maximum likelihood algorithm that handles gaps in alignment data.
Columns in alignment data which contain gaps are apparently typically
omitted from analysis. 

-------------------------------------------------------------------------

Current research directions in phylogeny
----------------------------------------

All of the algorithms we have described have been around for decades,
along with many, many more approaches.  What are the current
challenges in phylogeny? Here is my take, based on somewhat limited
knowledge.

- Improved algorithms for large data sets.  With the rapid rate at
which new sequence data is becoming available, there is a need for
phylogenetic analysis of larger data sets across longer evolutionary
distances than ever before. Experimental analysis of current methods
indicates that the fast, distance-based methods are not sufficiently
accurate on data sets with large evolutionary distance, and methods
based on NP-hard optimization problems, such as MP or ML, are too slow
on data sets with large number of taxa. Faster and better methods
would have important applications.
 o Statistical and algorithmic analysis can shed light on when various
   methods are expected to perform well or poorly. For example, one
   would expect that in order to accurately align a given number of taxa, 
   the length of the sequences needed would grow as the max evolutionary
   distance between the sequences increases.
   A recent result of Erdos (1999) shows that under probabilistic models
   such as Jukes-Cantor, the sequence length that suffices to guarantee
   accuracy of standard polynomial time algorithms is exponential in
   the max evolutionary distance in the data set. This is supported
   by experimental evidence. 
 o Such insights can lead to improved algorithms. For example, a
   divide and conquer approach that groups taxa into closely related
   subgroups would be well motivated by the analysis cited above.
   There may be good biological reasons for this too; different parts
   of a sequence may be appropriate for phylogenetic analysis of
   different subspecies. See the work of Tandy Warnow et al.
   http://www.cs.utexas.edu/users/tandy/ for examples of this type
   of work.

- Algorithms for improved models of evolution. 

 o Ribosomal RNA is used for reconstruction of trees involving
   organisms which are evolutionarily distant, because it is a
   universal molecule, and highly conserved. Since the structure of this
   molecule is a large determinant of its function, and the structure is
   determined by pairing of complementary bases, one would not expect all
   sites to evolve independently. Can improved analysis be obtained by
   taking into account pairwise correlations between bases?

 o Modeling gaps. Thorne, Kishino, and Felsenstein (1991) proposed a
   model for evolution with insertions and deletions (indels).  This
   model uses three classes of variables. First, a model of substitution
   that does not allow for indels. Second, associated with each
   nucleotide is a random variable D that is exponentially distributed
   with parameter mu. If this variable fires, the site is deleted. Third,
   to the right of every nucleotide is a random variable, I. It is
   exponentially distributed with parameter lambda < mu. If I fires, a
   nucleotide is chosen at random and inserted in the location
   corresponding to I. Also to the left of the complete sequence is a
   similar random variable.  See
       http://www.math.canterbury.ac.nz/~mathmas/ 
   for recent work on this model.

- In a review of a new book, "Phylogenetic trees made easy: a how-to
manual for molecular biologists" by Barry G. Hall. Sinauer, 2001, it
was noted that bayesian inference is a new method for constructing
phylogenetic trees that is rapidly gaining popularity. The method
calculates a so-called posterior probability of candidate trees, which
depends on the joint probabilities of the tree, branch lengths, and
the model of substitution.  This method allows for the possibility of
including prior information, such as the common origin of a group of
species.

- Algorithms for other data types. As an example, there is increasing
interest in using gene order data for phylogenetic tree construction.
In a chromosome, genes have order and orientation. Inversions cause
the orientation of a continuous subsequence of the genes to be
reversed. Transpositions cause a contiguous subsequence of genes to be
removed from their current location and inserted elsewhere in the
chromosome. These and other operations on genes happen as organisms
evolve, and provide clues as to the phylogenetic ordering of certain
organisms (plant chloroplasts and animal mitochondria, for example).
On a completely different note, phylogenetic tree construction
algorithms have been applied to construction of language trees.