CPSC 536A
Notes for March 13, 2001

Class 17: Gene Finding (2)

## Hidden Markov Models (HMMs)

#### HMMs are graphical probabilistic models used in

• - modelling time series
• - speech recognition
• - optical character recognition
• - ion channel recordings
• #### in biology (bioinformatics), HMMs can be used for modeling:

• - coding/non-coding regions in DNA
• - protein binding sites in DNA
• - protein superfamilies
• #### since the mid-1990s, HMMs + machine learning techniques are used for:

• - modeling
• - aligning
• - analysing
• DNA regions and protein families.

#### HMMs are closely related to other formal models:

• - neural networks
• - stochastic grammars
• - Baysian networks

• #### An HMM is given by:

• - finite set of states (intron, exon)
• - discrete alphabet of symbols (A,T,C,G)
• - probability transition matrix (transition die)
• - probability emission matrix (emission die)
• #### EXAMPLE

A: Alphabet A={A,C,G,T}
S: Set of States S={Sbegin, S1, S2, ..., Send}
T(tij): probability of transition from Si->Sj
E=(eix): probability of generating latter x in state i

#### Fundamental questions:

1. given model H, observation sequence O, what is the probability of generating O using H? P(O|H) (likelihood)
2. given model H, observation sequence O, what is the most probable stae path of H generating O? (decoding)
3. given model H, observation sequence O, how can we update H, based on O, to make it better (learning)

#### Naive computation of likelihood:

- P(O,path | model)
- P(O | model) = sum{all paths p} P(O,p | model)
-> summing over all paths leads to exponential complexity

#### Forward algorithm (for determining likelihood)

- key insight: 'reconstruct' observation sequence, need only to compute/store prob of being in state i after emitting sequence o_1 ... o_i
- complexity: N^2 (length of X * num states)
(see [BML, Section 7.3] for details)

#### Viterbi algorithm (for decoding)

- key idea: 'reconstruct' observation sequence, for each state s keep only most probable state seq leading to s consistent with o_1 ... o_i
- dp algorithm very similar to forward algorithm, but use max instead of sum in each step (store pointers)
- complexity: N^2 (length of X * num states)
(see [BML, Section 7.3] for details)

#### Learning Problem

- EM(Expect Maximization)
- Viterbi Learning

## Evaluation of Gene Finders

### Using HMMs for Gene Finding

#### Individual Signal Detectors

Probabilistic integration of various signals:
- promotor regions
- translation start & stop context sequences
- reading frame periodicity
- intron splicing signals
- compositional contrast between introns / exons
- differences in nucleosome positioning signals
- sequence determinants of topological domains
(scaffold attachment regions, SARs)

### Two State-of-the-Art Gene Finders

#### Genscan (Burge & Karlin, 1997):

- based on probabilistic model of gene structure in human genomic sequence
- emphasis on features recognised by general transcriptional, splicing, and translational machinery, e.g., TATA box, cap site in eukaryotic promotors (rather than signals specific to particular genes)
- does not use similarity search
- overall model similar to generalised HMM (explicit state duration HMM)
- uses explicitly double stranded genomic sequence model
-> potential genes on both strands are analysed simultaneously
- covers cases where input sequence contains no gene, partial gene, complete gene, multiple genes
- uses WMM and maximal dependency decomposition (MDD) to model functional signals
- cannot handle overlapping transcription units
- does not address alternative splicing

signal models used by Genscan:
- WMM for transcriptional and translational signals (translation initiation, polyadenylation signals, TATA box etc.) probabilities estimated from GenBank annotated data
- maximal dependency decomposition for splice signals (WMM and WAM inadequate)
- probabilistic composition of conditional WMMs

exon models and non-coding state models used by Genscan:
- probabilistic models based on conditional hexamer frequency
- consistent reading frame is maintained throughout a gene

#### HMMgene (Krogh, 1997):

- different approach from Genscan:
rather than model individual functional elements and combining them in to big model, combined model is estimated directly from labeled sequence data
- based on class HMMs (CHMMs - HMMs where states are labeled and emit symbol + label)
- uses clever machine learning algorithm for estimating CHMM from sequence data such that probability of correct labeling is maximised
important features:
- emission probabilities of states can depend on n previous states
- allows states to share emission/transition probabilities (tying)

## Evaluating Gene Finding Programs

#### How do we know how good a gene finder is?

- define performance measures for evaluation
- test on standardised test sets of sequence data

#### general performance measures:

true positive TP: correctly predicted feature (e.g., exon/intron boundary)
false positive FP: incorrectly predicted feature
false negative FN: missed feature
true negative TN: correctly predicted absence of a feature

note:
T = TP+FN, true number of features present
P = TP+FP, number of features predicted

sensitivity: SN= TP/T, correct predictions per feature
specificity: SP= TP/P, correct predictions per prediction
base-level: SN, SP for annotation of individual bases as coding, non-coding
exon-level: SN, SP for complete exons (both splise sites)

want: high specificity and sensitivity
combined measures:
- (SN+SP)/2
- approximate correlation AC: high=good, low=bad

### Results for comparing several gene finders (Rogic et al., 2000; Burset and Guigo, 1996):

- HMMGene and Genscan typically better than the five other programs tested
- quality of prediction varies with
- exon length
- exon type (initial, internal, terminal, single)
- signal type (acceptor, donor, start, stop)
- similar prediction quality for human and murine genes

correlation between programs is not perfect, e.g., Genscan sometimes misses exons that HMMgene finds and vice versa
-> combination of programs can yield improved prediction accuracy!

### Combining Gene Finding Programs

various methods, here: Exon Union-Intersection (Rogic et al., 2000)

Observations:

- if either HMMgene or Genscan predict a high score, it's usually correct
- if only one program predicts with a low score, the prediction tends to be incorrect, but if both predict with a low score, it tends ot be a correct.
- for most of the false positives, only one program predicts the exon, and the probability score is low

#### idea:

- accept prediction only if one program gives a high score or if _both_ programs predict with a low score

#### EUI algorithm:

- consider all Genscan and HMMgene exons with probability score >= threshold p'
- of these, predict all exons that are predicted by at least one program
- consider all Genscan and HMMgene exons with probability score < threshold p'
- of these, predict only exons that are predicted by both programs

#### Results:

false positives significantly reduced
-> significantly increased specificity, sensitivity almost unaffected.

### Conclusions:

- gene prediction is an increasingly relevant problem
- it is hard to do in a fully automated way (in reality, lab work is required to check predictions)
- complex probabilistic models integrating biological knowledge (signal detection) and computer science techniques (machine learning, algorithms) provide the basis for modern gene finders
- empirical methods for evaluating algorithms can be used to improve prediction accuracy by combining gene finding programs
- significant further progress needed to achieve fully automated gene finding with acceptable accuracy