CPSC536A - Notes for Class 7 - Multiple Sequence Alignment

Jan 25, 2001

Index

0. Some notes

1. Introduction to Multiple Sequence Alignment

1.1 What is multiple sequence alignment?
1.2 Why do we need it? - Motivations
1.3 An example
1.4 Definition: Global Multiple Sequence Alignment
2. What is a good alignment? - Scoring Systems for MSA
2.1 some assumptions
2.2 The ideal solution (and why we can't use it)
2.3 The practical solution: The SP score
2.3.1 Assumptions
2.3.2 Definition : The Sum of Pairs score (SP-score)
2.3.3 Critique of the SP score


3. Constructing MSA's using the SP-score

3.1 The Algorithm
3.2 Complexity
3.3 An Example
3.4 NP-completeness and how to get around it...
Next lecture, we'll be looking at...

References


0. Some notes

1. Introduction to Multiple Sequence Alignment

1.1 What is multiple sequence alignment?

1.2 Why do we need it? - Motivations 1.3 An example 1.4 Definition: Global Multiple Sequence Alignment 2. What is a good alignment? - Scoring Systems for MSA

2.1 some assumptions

E.g. all agree, except for one:

=> here, the "importance" of N seems obvious, and so P should be given a high penalty

2.2 The ideal solution (and why we can't use it)


2.3 The practical solution: The SP score

2.3.1 Assumptions

2.3.2 Definition : The Sum of Pairs score (SP-score)
hm.... below sigma, it's (1<=j<k<=N)
(being the j-th entry in the i-th column), and the score for the whole matrix M is
 
(.....................1<=j<=L)
Simply speaking, the SP score is calculated by first adding up all possible (pairwise alignment) scores for one column and
then summing up the scores for all columns.
2.3.3 Critique of the SP score 3. Constructing MSA's using the SP-score

3.1 The Algorithm

cases (N = number of sequences) (at the (current) end of each sequence, we can add either a gap or the next character -> ,
minus one because adding only gaps is not allowed)

(s(x,y) is the score we get from the PAM/BLOSUM/... matrix)
 

3.2 Complexity For N sequences of length L, the above algorithm
needs


( see [Gusfield] for an explanation of - intuitively it means that the complexity is quite exactly )

and

 
3.3 An Example: Fig. 6.3, [Durbin et al.?] It shows you the cube you get for an alignment of three sequences. It turns out that not all cells of this cube (and in general, the N-dimensional matrix) need to be computed, and the order of computation can also be heuristically optimised. The rather sophisticated algorithm by Carillo and Lipman exploits these ideas and achieves thus "slightly" improved time and space usage over "naive" N-diminsional dynamic programming. In 1988, Lipman, Altschul, and Kececioglu implemented a (further refined) version of this algorithm in their program "MSA" (still in use). MSA is practically restricted to 5-7 protein sequences of typical length of 200-300 residues. 3.4 NP-completeness and how to get around it...
  • Bad news:

  • The problem we are trying to solve here is very hard intrinsically - "NP-complete", as computer scientists say. For NP-complete problems,
    there is (almost) no hope that there is an algorithm that is not exponential in its complexity.
    Next lecture, we'll be looking at... Heuristics and

    Phylogenetic trees
     



    References

    [Gusfield] D.Gusfield: Algorithms on Strings, Trees, and Sequences: Computational Science and Computational Biology.
    Cambridge University Press, 1997.

    [Durbin et al.] Durbin, Eddy, Krogh, Mitchison: Biological sequence analysis: Probabilistic models of proteins and nucleic acids.
    Cambridge University Press, 1998. (Available from CICSR Reading Room)