Jan 25, 2001

# Index

0. Some notes
1.1 What is multiple sequence alignment?
1.2 Why do we need it? - Motivations
1.3 An example
1.4 Definition: Global Multiple Sequence Alignment
2. What is a good alignment? - Scoring Systems for MSA
2.1 some assumptions
2.2 The ideal solution (and why we can't use it)
2.3 The practical solution: The SP score
2.3.1 Assumptions
2.3.2 Definition : The Sum of Pairs score (SP-score)
2.3.3 Critique of the SP score
3.1 The Algorithm
3.2 Complexity
3.3 An Example
3.4 NP-completeness and how to get around it...
Next lecture, we'll be looking at...

0. Some notes

• This is a huge topic, which will only be covered in part during our lecture.

• For further reading, Holger recommends the books [Gusfield] and [Durbin et al.].
• Most of the time when we talk of sequence alignment from now on, we will mean amino-acid sequences, i.e. proteins
1. Introduction to Multiple Sequence Alignment

1.1 What is multiple sequence alignment?

• Multiple Sequence alignment (MSA) can be seen as a generalization of Pairwise Sequence Alignment - instead of

• aligning two sequences, k sequences are aligned simultaneously, where k is any number greater than two. (See below
for a more precise definition)
1.2 Why do we need it? - Motivations
• Computer scientists take it as a challenge, simply because it's possible - generalization cannot be a bad thing
• Biologists think beyond that and wonder if it is of any practical relevance. And in fact, it is:
• MSA allows us to extract and represent biologically important but faintly/ widely dispersed sequence

• similarities - this can give us hints about the evolutionary history of certain sequences, for example
• MSA can help us to elucidate biological facts about proteins, etc.
• analysis of the secondary/tertiary structure of , for example, proteins
• critical consensus motives (DNA/ proteins)
• MSA can be seen as inverse to Pairwise Sequence Alignment(PSA):
• When an alignment is good using PSA, we usually conclude that there exists a functional relationship between

• the sequences
• In MSA, we already know that there exists a functional similarity, and want to find out where exactly it comes from
1.3 An example
• see [Durbin et al.], fig. 6.1
• The figure shows the alignment of ten subsequences of a particular family of proteins
1.4 Definition: Global Multiple Sequence Alignment
• A Global Multiple Sequence Alignment of N > 2 sequences  is obtained by inserting gaps ("_") into the

• (note: these gaps can be inserted at any position, i.e. also at the beginning or end) such that
the sequences obtained this way have all length L and can be arranged in a matrix  of N rows and L columns
• The ability to actually determine a good alignment highly depends on how diverged the  are
• An effect that is often encountered in practice is that in proteins, some regions within all sequences can be well aligned without spending

• much effort, while other regions can't be meaningfully aligned at all
• A plausible explanation for this is that not all residues within a protein are important - some have little or no function at all.

• Thus, there is little evolutionary pressure to conserve the structure of these residues.
• Typical sets of sequences only have a sequence similarity of about 30%
2. What is a good alignment? - Scoring Systems for MSA

2.1 some assumptions

• Usually, the sequences are not independent - they usually have some sort of evolutionary relationship

• with each other, recorded in their phylogenetic tree (that we don't know)
• Some positions within the sequences are more conserved than others.

• This means that when there's a high level of similarity at one position, "alignment-candidates" for this
position that deviate from the others should be given a high penalty.
E.g. all agree, except for one:

=> here, the "importance" of N seems obvious, and so P should be given a high penalty

2.2 The ideal solution (and why we can't use it)
• What we would like to have is a complete and precise model M of molecular sequence evolution, such that given

• the correct phylogenetic tree T for our set,
P(msa) = P(T|M)

i.e. we could compute the correct model and use it as a basis for scoring

• However, this model would have to be VERY complex, simply because life and evolution is so complex and full of

• exceptions to rules that we haven't even understood yet

2.3 The practical solution: The SP score

2.3.1 Assumptions

• the columns of the alignment matrix are statistically independent, and
• we don't make use of phylogenetic trees (for now)
2.3.2 Definition : The Sum of Pairs score (SP-score)
• given:

• - a substitution matrix like PAM or BLOSUM that gives us the price s(x,y) for aligning two characters x and y
- a (L x N) MSA matrix M
• The SP-score for the i-th column  of the MSA-matrix is calculated as

hm.... below sigma, it's (1<=j<k<=N)
(being the j-th entry in the i-th column), and the score for the whole matrix M is

(.....................1<=j<=L)
Simply speaking, the SP score is calculated by first adding up all possible (pairwise alignment) scores for one column and
then summing up the scores for all columns.
2.3.3 Critique of the SP score
• Problems:
• There's no probabilistic justification for the SP score
• Each sequence is treated as if it were directly evolutionary related to all other N-1 sequences, where in fact

• it is very probably only directly related to one of them - this problem arises because we don't use a phylogenetic
tree
• Nevertheless:
• The SP score is easy to work with, and widely used
• The results are reasonably good
• Other (theoretically better founded) methods that are efficient are not known
3. Constructing MSA's using the SP-score

3.1 The Algorithm

• Idea: generalize the dynamic programming approach for pairwise alignment (Needleman-Wunsch Algorithm)
• Let's do this for an alignment of three sequences:

• The score of an optimal alignment of the sequences x, y and z up to positions  is calculated recursively
(same principle as for pairwise alignment) as
• For three sequences, seven cases have to be considered - generally there are
cases (N = number of sequences) (at the (current) end of each sequence, we can add either a gap or the next character -> ,
minus one because adding only gaps is not allowed)
• for initialization, we use

(s(x,y) is the score we get from the PAM/BLOSUM/... matrix)

3.2 Complexity For N sequences of length L, the above algorithm
needs

( see [Gusfield] for an explanation of - intuitively it means that the complexity is quite exactly )

and

3.3 An Example: Fig. 6.3, [Durbin et al.?] It shows you the cube you get for an alignment of three sequences. It turns out that not all cells of this cube (and in general, the N-dimensional matrix) need to be computed, and the order of computation can also be heuristically optimised. The rather sophisticated algorithm by Carillo and Lipman exploits these ideas and achieves thus "slightly" improved time and space usage over "naive" N-diminsional dynamic programming. In 1988, Lipman, Altschul, and Kececioglu implemented a (further refined) version of this algorithm in their program "MSA" (still in use). MSA is practically restricted to 5-7 protein sequences of typical length of 200-300 residues. 3.4 NP-completeness and how to get around it...

• The problem we are trying to solve here is very hard intrinsically - "NP-complete", as computer scientists say. For NP-complete problems,
there is (almost) no hope that there is an algorithm that is not exponential in its complexity.
• ...But

• Most of the time, we're happy with a close approximation to the ideal solution which can be efficiently computed:
• Bounded Error Approximation

• (see [Gusfield], Chapter 14, Section 6.2 for more)
• Ideas:
• do the alignment consistent with a tree ( not necessarily a phylogenetic tree) that relates the sequences to each other
• use the so-called "centre-star method"
• this gives us, in polynomial time, an SP-score SP that satisfies

• which seems to be nothing to be excited about, but it works reasonably well in practice, and typically only deviates 2-16% from the optimal SP-score

• still, this method isn't good enough to be used as a standalone method, but
• the theory behind it is interesting
• it can be very useful for constructing the "really good" algorithms (which are heuristics)
Next lecture, we'll be looking at... Heuristics and

Phylogenetic trees

References

[Gusfield] D.Gusfield: Algorithms on Strings, Trees, and Sequences: Computational Science and Computational Biology.
Cambridge University Press, 1997.

[Durbin et al.] Durbin, Eddy, Krogh, Mitchison: Biological sequence analysis: Probabilistic models of proteins and nucleic acids.
Cambridge University Press, 1998. (Available from CICSR Reading Room)