# Sequence Similarity

Tests two sequences to find similarities. Used to:
• Phylogenetic tree reconstruction
• Similar sequences => similar structure => similar function

## Global Pairwise Sequence Alignment Problem

Given strings x, y : x = x1 x2 ... xn and y = y1 y2 ... yn find an optimal alignment of x and y.

### Alignment

Based on changes that happen to molecules as they evolve. (i.e. substitutions, gaps). For example,

HEAGAWGHE-E

--P-AW-HEAE

Formally, an alignment of x and y is a pair x', y' where

• x' with gaps removed is x, similarly for y', y
• |x'| = |y'|
• xi' = yi' = gap never happens

We assign a score to an alignment (x', y') additively: sum of scores of non-gap (xi', yi') pairs + scores for regions containing gaps. Score matrices are used to assign scores to non-gapped pairs.

### Developing Score Matrices

Matrices are derived according to the following probabilistic interpretation:
• Assume no gaps
• Want score assigned to x',y' to be measure of likelyhood that x', y' are related
• We consider a score for x', y' relative to a random model and a match model
• Random model: assume each symbol a occurs with probability q(a).
P(x', y'|R) = product q(xi') * product q(yi')
• Match model: assigns probablity to pairs (a, b) of symbols.
P(x', y')|M) = product p(xi', yi')
• Take the odds ratio:
P(x', y')|M)/P(x', y')|R)
• Log odds ratio:
sum s(xi', yi')

### Assigning Scores to Gaps

Linear gap scoring system with a gap length g: gamma(g) = -dg.

Affine gap scoring system with a gap length g: gamma(g) = -d - e(g-1).

## Algorithms for Sequence Similarity

Given x, y both of length n, how many alignments are there? The number grows exponentially! (actually, = 2n choose n ~= 2^2n / sqrt(2 pi n)).

### Dynamic Programming Approach

This method is O(n^2).

The optimal alignment of x, y up to the ith and jth position, respectively, looks like one of the following:

• (optimal alignment of x1 ... xi-1, y1 ... yj-1) and xi matched with yj
• (optimal alignment of x1 ... xi-1, y1 ... yj ) and xi matched with a gap
• (optimal alignment of x1 ... xi , y1 ... yj-1) and yj matched with a gap

So the optimal score is defined as:
F(i, j) = max { F(i-1, j-1) + s(xi, yi), F(i-1, j ) - d, F( i , j-1) - d }

F is the optimal score for any prefix x1 x2 ... xi, y1 y2 ... yj.

The base cases occur at F(i, 0) = -di (since we must match with i gaps) and F(0, j) = -dj.

See Biological Sequence Analysis by Durbin, Eddy, Krogh and Mitchison, page 21 for diagram of how to fill in the table of F values and how to retrieve the optimal sequence. See also Anne's pseudo-code for the method.