# Sequence Similarity

Tests two sequences to find similarities. Used to:
- Phylogenetic tree reconstruction
- Similar sequences => similar structure => similar function

## Global Pairwise Sequence Alignment Problem

Given strings *x, y* : *x = x1 x2 ... xn *and
*y = y1 y2 ... yn* find an __optimal alignment__ of *x*
and *y*.
### Alignment

Based on changes that happen to molecules as they evolve. (i.e. substitutions,
gaps).
For example, `
`HEAGAWGHE-E

`--P-AW-HEAE
`

Formally, an alignment of *x* and *y* is a pair *x', y'* where

*x'* with gaps removed is
*x*, similarly for *y', y*
- |
*x'*| = |*y'*|
*xi' = yi' = gap* never happens

We assign a score to an alignment (*x', y'*) additively: sum of scores
of non-gap (*xi', yi'*) pairs + scores for regions containing gaps.
Score matrices are used to assign scores to non-gapped pairs.

### Developing Score Matrices

Matrices are derived according to the following probabilistic interpretation:
- Assume no gaps
- Want
__score__ assigned to *x',y'* to be measure of
likelyhood that *x', y'* are related
- We consider a score for
*x', y'* relative to a __random__
model and a __match__ model
- Random model: assume each symbol
*a* occurs with probability
*q(a)*.

*P(x', y'*|*R) = product q(xi') * product q(yi')*
- Match model: assigns probablity to pairs (
*a, b*) of symbols.

*P(x', y')*|*M) = product p(xi', yi')*
- Take the odds ratio:

*P(x', y')*|*M)/P(x', y')*|*R)*
- Log odds ratio:

*sum s(xi', yi')*

### Assigning Scores to Gaps

Linear gap scoring system with a gap length *g*: *gamma(g) = -dg*.
Affine gap scoring system with a gap length
*g*: *gamma(g) = -d - e(g-1)*.

## Algorithms for Sequence Similarity

Given *x, y* both of length *n*, how many alignments are there?
The number grows exponentially!
(actually, *= 2n choose n ~= 2^2n / sqrt(2 pi n)*).
### Dynamic Programming Approach

This method is *O(n^2)*.
The optimal alignment of *x, y* up to the *i*th and
*j*th position, respectively, looks like one of the following:

- (optimal alignment of
*x1 ... xi-1, y1 ... yj-1*) and
xi matched with yj
- (optimal alignment of
*x1 ... xi-1, y1 ... yj *) and
xi matched with a gap
- (optimal alignment of
*x1 ... xi , y1 ... yj-1*) and
yj matched with a gap

So the optimal score is defined as:
*
*

F(i, j) = max { F(i-1, j-1) + s(xi, yi), F(i-1, j ) - d, F( i , j-1) - d }

*F* is the optimal score for any prefix
*x1 x2 ... xi, y1 y2 ... yj*.

The base cases occur at *F(i, 0) = -di* (since we must match with
*i* gaps) and *F(0, j) = -dj*.

See *Biological Sequence Analysis* by Durbin, Eddy, Krogh and
Mitchison, page 21 for diagram of how to fill in the table of *F*
values and how to retrieve the optimal sequence. See also Anne's pseudo-code
for the method.