CPSC 545/445 (Autumn 2003) - Class 12: Gene Finding (2)
Module 4, Part 2
---
recap:
- probabilistic models for gene finding, WMMs and WAMs
---
4.4 Hidden Markov Models (HMMs)
generalisation of WMM, PAM, and various other probabilistic
models.
HMMs are graphical probabilistic models used in
- modelling time series
- speech recognition
- optical character recognition
- ion channel recordings
in biology (bioinformatics), HMMs can be used for modeling:
- coding/non-coding regions in DNA
- protein binding sites in DNA
- protein superfamilies
--
since the mid-1990s, HMMs + machine learning techniques
are used for:
- modeling
- aligning
- analysing
DNA regions and protein families.
--
HMMs are closely related to other formal models:
- neural networks
- stochastic grammars
- Baysian networks
[example: fig 7.1]
--
An HMM is given by:
- S: finite set of states (e.g., intron, exon)
convention: use begin + end states S,E
- A: discrete alphabet of symbols (A,T,C,G)
- T=(t_ij): probability transition matrix
-> transition probabilities t_ij between states i,j
- E=(e_ix) probability emission matrix
-> emission probabilities e_ix for symbols x in state i
[slide: Figure 7.1]
--
How does it work?
- system randomy evolves from state to state
while emitting symbols from the alphabet
- emission and transition depend on current state only
(Markov assumption)
- only emitted symbols are observable, states are hidden
[example 1]
--
problems related to HMMs:
- how likely is a given sequence given an HMM?
-> given model, obs sequence O, prob of O? (likelihood)
- what is the most probable sequence of
transitions and emissions given a sequence?
-> given model, obs seq O, most prob sequence of states & emissions? (decoding)
- if the transition and emission probabilities not known,
how can they be learned?
-> given uncertain model m, obs seq O, how should m be revised based on O? (learning)
--
computation of likelihood P(O | model)
naive method:
can easily determine P(O,path | model) for given O, path, model
o = o_1 ... o_L
p = p_0 p_1 p_2 ... p_L p_L+1
p_0 = B, p_L+1 = E
P(O,p | model) = t_{0,p_1} \product_{i=1..L} e_{p_i,o_i} t_{p_i,p_i+1}
based on this:
P(O | model) = sum{all paths p} P(O,p | model)
-> summing over all paths leads to exponential worst case complexity
(why? since there may be exponentially many paths of length L)
forward algorithm (for determining likelihood)
- key insight: 'reconstruct' observation sequence,
need only to compute/store prob of being in state i after emitting
sequence o_1 ... o_i
- recursively compute f_k(i) = P(o_1...o_i, p_i=k | model)
using f_l(i+1) = e_{l,o_{i+1}} \sum_{all states k} f_k(i) t_k,l
- compute this in one forward pass through O
- note: f_E(L+1) = P(o_1...o_L, p_{L+1}=E | model)
- complexity: N^2 (length of X * num states)
[example 2]
--
most likely state for any given time P(p_i=k | O, model)
note: P(p_i=k | O, model) = P(p_i=k, O | model) / P(O | model)
[according to bayes rule]
obtain P(O | model) from fwd alg
P(p_i=k, O | model) = P(o_1...o_i, p_i=k | model) * P(o_i+1...o_L | p_i=k, model)
note:
P(o_1...o_i, p_i=k | model) = f_k(i) from fwd algorithm above
can compute b_k(i) := P(o_i+1...o_L | p_i=k, model) from backward sweep
through O, very similarly to f_k(i)
-> reverse forward algorithm (backward algorithm)
[see Durbin et al., p.59]
--
decoding (given HMM and observation sequence, find most probable state path):
-> viterbi algorithm
(dynamic programming algorithm, conceptually similar to
sequence alignment algorithms from Module 2
and very similar to fwd algorithm above, see also Durbin et al., p.55ff)
--
learning:
- expectation maximisation (baum-welch)
initialise with arbitrary model parameters
uses fwd and backwd alg to iteratively modify model
(see Durbin et al., Ch.3)
-> maximise likelihood of model given data
- gradient descent
- viterbi learning
similar to baum-welch, but re-estimate parameeters
based on most probable paths (as determined by viterbi alg);
typically faster, but giving slightly worse results than em/baum-welch
--
The standard HMM architecture
[time permitting]
- used for modelling related families, e.g., families of proteins or RNAs,
but also introns/exons, ...
- use one state for each sequence position
- model insertions and deletions by separate states
- deletion ('gap') states are silent (no emissions)
- self loops in insertion states can model variable length insertions
- direct transitions between deletion staates model variable length deletions
-> large number of parameters, training can be difficult, if relatively
few sequences are given
[slide: Figure 7.2]
---
4.5 Using HMMs for Gene Finding
individual signal detectors
probabilistic integration of various signals:
- promotor regions
- translation start & stop context sequences
- reading frame periodicity
- polyadenylation signals
- intron splicing signals
- compositional contrast between introns / exons
- differences in nucleosome positioning signals
- sequence determinants of topological domains
(scaffold attachment regions, SARs)
---
Genscan (Burge & Karlin, 1997)
- based on probabilistic model of gene structure
in human genomic sequence
- emphasis on features recognised by general
transcriptional, splicing, and translational machinery,
e.g., TATA box, cap site in eukaryotic promotors
(rather than signals specific to particular genes)
- does not use similarity search
- overall model similar to generalised HMM
(explicit state duration HMM)
- uses explicitly double stranded genomic sequence model
-> potential genes on both strands are analysed
simultaneously
- covers cases where input sequence contains
no gene, partial gene, complete gene, multiple genes
- uses WMM and maximal dependency decomposition (MDD) to model
functional signals
- cannot handle overlapping transcription units
- does not address alternative splicing
[Figure GS]
E_k, I_k = phase k internal exons, introns
--
signal models used by Genscan:
- WMM for transcriptional and translational signals
(translation initiation, polyadenylation signals, TATA
box etc.)
probabilities estimated from GenBank annotated data
- maximal dependency decomposition for splice signals
(WMM and WAM inadequate)
probabilistic composition of conditional WMMs
--
exon models and non-coding state models used by Genscan:
- probabilistic models based on conditional hexamer frequency
- consistent reading frame is maintained throughout a gene
---
HMMgene (Krogh, 1997)
- different approach from Genscan:
rather than model individual functional elements
and combining them in to big model,
combined model is estimated directly from labeled sequence
data
- based on class HMMs (CHMMs - HMMs where states are labeled
and emit symbol + label)
- uses clever machine learning algorithm for
estimating CHMM from sequence data such that probability
of correct labeling is maximised
important features:
- emission probabilities of states can depend on n previous
states
- allows states to share emission/transition probabilities
(tying)
---
Resources:
Durbin et al., Ch.3 [introduction to HMMs]
Baldi & Brunak, Ch. 7 [introduction to HMMs]
A. Krogh: Two methods for improving performance of an HMM and their
application for gene finding.
Proc 5th Int. Conf. Intelligent Systems for Mol. Biol., 179-186, 1997.
D. Haussler. Computational Genefinding.
[online, from his webpage: www.cse.ucsc.edu/~haussler]
J.W. Fickett. The gene identification problem: an overview for developers.
Computers Chem. 20(1): 103-118, 1996.
R. Guigo. Computational gene identification: an open problem.
Computers Chem. 21(4): 215-222, 1997.
---