CPSC536A  Notes for Class 7  Multiple Sequence Alignment
Jan 25, 2001
Index
0. Some notes
1.
Introduction to Multiple Sequence Alignment
1.1 What
is multiple sequence alignment?
1.2 Why do we need
it?  Motivations
1.3 An example
1.4 Definition:
Global Multiple Sequence Alignment
2.
What is a good alignment?  Scoring Systems for MSA
2.1 some assumptions
2.2 The
ideal solution (and why we can't use it)
2.3 The practical
solution: The SP score
2.3.1 Assumptions
2.3.2
Definition : The Sum of Pairs score (SPscore)
2.3.3 Critique of the SP
score
3.
Constructing MSA's using the SPscore
3.1 The Algorithm
3.2 Complexity
3.3 An Example
3.4 NPcompleteness
and how to get around it...
Next lecture, we'll be looking
at...
References
0. Some notes

This is a huge topic, which will only be covered in part during our lecture.
For further reading, Holger recommends the books
[Gusfield] and [Durbin et al.].

Most of the time when we talk of sequence alignment from now on, we will
mean aminoacid sequences, i.e. proteins
1.
Introduction to Multiple Sequence Alignment
1.1
What is multiple sequence alignment?

Multiple Sequence alignment (MSA) can be seen as a generalization of Pairwise
Sequence Alignment  instead of
aligning two sequences, k sequences are aligned simultaneously, where
k is any number greater than two. (See below
for a more precise definition)
1.2 Why
do we need it?  Motivations

Computer scientists take it as a challenge, simply because it's possible
 generalization cannot be a bad thing

Biologists think beyond that and wonder if it is of any practical relevance.
And in fact, it is:

MSA allows us to extract and represent biologically important but faintly/
widely dispersed sequence
similarities  this can give us hints about the evolutionary history
of certain sequences, for example

MSA can help us to elucidate biological facts about proteins, etc.

analysis of the secondary/tertiary structure of , for example, proteins

critical consensus motives (DNA/ proteins)

MSA can be seen as inverse to Pairwise Sequence Alignment(PSA):

When an alignment is good using PSA, we usually conclude that there exists
a functional relationship between
the sequences

In MSA, we already know that there exists a functional similarity, and
want to find out where exactly it comes from
1.3 An example

see [Durbin et al.], fig. 6.1

The figure shows the alignment of ten subsequences of a particular family
of proteins
1.4
Definition: Global Multiple Sequence Alignment

A Global Multiple Sequence Alignment of N > 2 sequences
is obtained by inserting gaps ("_") into the
(note: these gaps can be
inserted at any position, i.e. also at the beginning or end) such that
the sequences obtained
this way have all length L and can be arranged in a matrix
of N rows and L columns

The ability to actually determine a good alignment highly depends on how
diverged the are

An effect that is often encountered in practice is that in proteins, some
regions within all sequences can be well aligned without spending
much effort, while other regions can't be meaningfully aligned at all

A plausible explanation for this is that not all residues within a protein
are important  some have little or no function at all.
Thus, there is little evolutionary pressure to conserve the structure
of these residues.

Typical sets of sequences only have a sequence similarity of about 30%
2.
What is a good alignment?  Scoring Systems for MSA
2.1 some assumptions

Usually, the sequences are not independent  they usually have some sort
of evolutionary relationship
with each other, recorded in their phylogenetic tree (that we don't
know)

Some positions within the sequences are more conserved than others.
This means that when there's a high level of similarity at one position,
"alignmentcandidates" for this
position that deviate from the others should be given a high penalty.
E.g. all agree, except for one:
=> here, the "importance"
of N seems obvious, and so P should be given a high penalty
2.2
The ideal solution (and why we can't use it)

What we would like to have is a complete and precise model M of molecular
sequence evolution, such that given
the correct phylogenetic tree T for our set,
2.3
The practical solution: The SP score
2.3.1 Assumptions

the columns of the alignment matrix are statistically independent, and

we don't make use of phylogenetic trees (for now)
2.3.2
Definition : The Sum of Pairs score (SPscore)

given:
 a substitution matrix like PAM or BLOSUM that gives us the price
s(x,y) for aligning two characters x and y
 a (L x N) MSA matrix M

The SPscore for the ith
column of the MSAmatrix
is calculated as
hm.... below sigma, it's (1<=j<k<=N)
(being the jth entry in the
ith column), and the score for the whole matrix M is
(.....................1<=j<=L)
Simply speaking, the SP score is calculated by first adding up all possible
(pairwise alignment) scores for one column and
then summing up the scores for all columns.
2.3.3 Critique
of the SP score

Problems:

There's no probabilistic justification for the SP score

Each sequence is treated as if it were directly evolutionary related to
all other N1 sequences, where in fact
it is very probably only directly related to one of them  this problem
arises because we don't use a phylogenetic
tree

Nevertheless:

The SP score is easy to work with, and widely used

The results are reasonably good

Other (theoretically better founded) methods that are efficient are not
known
3.
Constructing MSA's using the SPscore
3.1 The Algorithm

Idea: generalize the dynamic programming approach for pairwise alignment
(NeedlemanWunsch Algorithm)

Let's do this for an alignment of three sequences:
The score of an optimal alignment of the sequences x, y and z up to
positions is calculated
recursively
(same principle as for pairwise alignment) as

For three sequences, seven cases have to be considered  generally there
are
cases (N = number of sequences)
(at the (current) end of each sequence, we can add either a gap or the
next character > ,
minus one because adding only gaps is not allowed)

for initialization, we use
(s(x,y) is the score we get from the PAM/BLOSUM/... matrix)
3.2 Complexity
For N sequences of length L, the above algorithm
needs
( see [Gusfield] for an explanation of 
intuitively it means that the complexity is quite exactly )
and
3.3
An Example: Fig. 6.3, [Durbin et al.?]
It shows you the cube you get for an alignment of three sequences.
It turns out that not all cells of this cube (and in general, the Ndimensional
matrix) need to be computed, and the order of computation can also
be heuristically optimised. The rather sophisticated
algorithm by Carillo and Lipman exploits these ideas and achieves thus
"slightly" improved time and space usage over "naive" Ndiminsional dynamic programming.
In 1988, Lipman, Altschul, and Kececioglu implemented a (further refined)
version of this algorithm in their program "MSA" (still in use).
MSA is practically restricted to 57 protein
sequences of typical length of 200300 residues.
3.4
NPcompleteness and how to get around it...
Bad news:
The problem we are trying to solve here is very hard intrinsically
 "NPcomplete", as computer scientists say. For NPcomplete problems,
there is (almost) no hope that there is an algorithm that is not exponential
in its complexity.

...But
Most of the time, we're happy with a close approximation to the ideal
solution which can be efficiently computed:

Bounded Error Approximation
(see [Gusfield], Chapter 14, Section 6.2 for more)

do the alignment consistent with a tree ( not necessarily a phylogenetic
tree) that relates the sequences to each other

use the socalled "centrestar method"

this gives us, in polynomial time, an SPscore SP that satisfies
which seems to be nothing to be excited about, but it works reasonably well in
practice, and typically only deviates 216% from the optimal SPscore

still, this method isn't good enough to be used as a standalone method,
but

the theory behind it is interesting

it can be very useful for constructing the "really good" algorithms (which
are heuristics)
Next lecture,
we'll be looking at...
Heuristics and
Phylogenetic trees
References
[Gusfield]
D.Gusfield: Algorithms on Strings, Trees, and Sequences: Computational
Science and Computational Biology.
Cambridge University Press, 1997.
[Durbin et al.] Durbin, Eddy, Krogh, Mitchison:
Biological sequence analysis: Probabilistic models of proteins and nucleic
acids.
Cambridge University Press, 1998. (Available from CICSR Reading Room)