Sequence Similarity

Tests two sequences to find similarities. Used to:

Global Pairwise Sequence Alignment Problem

Given strings x, y : x = x1 x2 ... xn and y = y1 y2 ... yn find an optimal alignment of x and y.

Alignment

Based on changes that happen to molecules as they evolve. (i.e. substitutions, gaps). For example,

HEAGAWGHE-E

--P-AW-HEAE

Formally, an alignment of x and y is a pair x', y' where

We assign a score to an alignment (x', y') additively: sum of scores of non-gap (xi', yi') pairs + scores for regions containing gaps. Score matrices are used to assign scores to non-gapped pairs.

Developing Score Matrices

Matrices are derived according to the following probabilistic interpretation:

Assigning Scores to Gaps

Linear gap scoring system with a gap length g: gamma(g) = -dg.

Affine gap scoring system with a gap length g: gamma(g) = -d - e(g-1).

Algorithms for Sequence Similarity

Given x, y both of length n, how many alignments are there? The number grows exponentially! (actually, = 2n choose n ~= 2^2n / sqrt(2 pi n)).

Dynamic Programming Approach

This method is O(n^2).

The optimal alignment of x, y up to the ith and jth position, respectively, looks like one of the following:

So the optimal score is defined as:
F(i, j) = max { F(i-1, j-1) + s(xi, yi), F(i-1, j ) - d, F( i , j-1) - d }

F is the optimal score for any prefix x1 x2 ... xi, y1 y2 ... yj.

The base cases occur at F(i, 0) = -di (since we must match with i gaps) and F(0, j) = -dj.

See Biological Sequence Analysis by Durbin, Eddy, Krogh and Mitchison, page 21 for diagram of how to fill in the table of F values and how to retrieve the optimal sequence. See also Anne's pseudo-code for the method.