CPSC 545/445 (Autumn 2003) - Class 6: Pairwise Sequence Alignment
Module 2, Part 2
---
2.4 psa with affine gap scores
reminder: affine gap scores: gamma(g) = -d - (g-1)*e
here: global psa (loccal psa is analogous)
general approach:
modify take the standard global psa (needleman-wunsch) algorithm as follows:
M(i,j) := max {
M(i-1,j-1) + s(xi ,yj)
M(k,j) + gamma(i-k) k := 0...i-1
M(i,k) + gamma(j-k) k := 0...j-1 }
problem: additional iteration over k (= position of last non-gap character)
-> O(m*n*max{m,n}) time; in practice, this is typically too slow
but note: this works for arbitrary gap scores
better: exploit additive structure of affine gap score
(this will alow us to achieve O(m*n) time complexity)
idea: use a help structure to capture information on start of gaps
and accumulated gap penalties
-> three cases are pursued simultaneously:
"no gap" - best score M(i,j),
"(extending) gap in y" = "insertion in x" - best score Ix(i,j)
"(extending) gap in x" = "insertion in y" - best scost Iy(i,j)
M(i,j) := max {
M(i-1, j-1) + s(xi,yj)
Ix(i-1, j-1) + s(xi,yj) % end gap in y
Iy(i-1, j-1) + s(xi,yj) % end gap in x
}
Ix(i,j) := max {
M(i-1,j) -d % start new gap in y
Ix(i-1,j) -e % extend existing gap in y
}
Iy(i,j) := max {
M(i, j-1) -d % start new gap in x
Iy(i, j-1) -e % extend existing gap in x
}
initialization: M(k,0) = Ix(0,k) = Iy(k,0) = M(0,k) = -d-k*e
this is based on the assumption that deletion is never directly followed by insertion
(true for optimal alignment if -d-e < min s(a,b) for any a,b
---
2.5 other sequence alignment problems:
- local alignments with repeated matches
useful for finding multiple genes (e.g., tRNA genes)
- alignments with overlap matches
special case of local alignment
can be solves using smith-waterman algorithm,
but there is a better algorithm (simple variation of needleman-wunsch)
- alignment with hybrid(complex) match conditions
these can all be solved using variants of dynamic programming,
time/space complexity O(m*n)
-> BSA, Ch.2; reading assignment
- heuristic alignment -> 2.6
- multiple sequence alignment -> 2.7
---
2.6 Heuristic alignment
idea: reduce complexity of psa algorithms
by using heuristic method
-> loose guarantee for finding optimal alignment (w.r.t. given scoring model)
[why may this not be a big problem in practice?]
Here: BLAST (Basic Local Alignment Search Tool; Altschul et al., 1990)
goal: find high-scoring local alignment between query sequence y
(e.g., specific protein sequence)
and long target sequence x / database (e.g., genome, chromosome)
basic idea:
- a true match alignment likely contains very similar or identical subsequences
- find such subsequences and use them as seeds from which to extend
longer, weaker but still good alignment
- short seeds allows use of table of all possible seeds with starting points
(index) constructed from query during preprocessing
realisation in BLAST:
- parameters w = seed length; t, C = score thresholds (seed / extension)
(prot: w=3, dna: w=11)
- find all length w substrings of query y that align to a particular substring
of sequence x with score > t
-> seed table S
- scan through sequence x (or database),
for each match of subsequence in x with seed in S ("hit"):
extend ungapped local alignment in both directions
until alignment score drops below best score for shorter extensions - C
(implementation uses additional heuristics, special techniques to increase
efficiency)
different versions for BLAST for DNA and protein alignments;
various improvements of basic algorithm, including:
- gapped BLAST (for gapped alignment - uses dynamic programming to extend hits),
- PSI-BLAST (position-specific iterated BLAST, often detects weaker seq similarities)
other heuristic alignment algorithms: FASTA (see BSA, 2.5)
---
Resources:
BSA, Ch.2 [good introduction; contains further references]