CPSC 545/445 (Autumn 2003) - Class 13: Motif Discovery
Module 4, Part 3
[partly based on Anne Condon's lecture notes from CPSC 536A, 2001/2002
and on lecture notes from UW, CS527, 2000]
---
4.8 The Motif Discovery Problem
goal: identify and characterise unknown, shared motifs (short, imperfectly
conserved subsequences) in a set of unaligned sequences.
motif: pattern common to a set of short sequences, or sites, where the sites
share some signal associated with their common biological property.
examples: binding site for a regulatory protein, transcription start and stop
sites, ribosome binding sites in prokaryotes, or intron splice sites
[slide:
For example, the cyclic AMP receptor protein (CRP) is a transcription
factor in E. Coli. Its binding sites are DNA sequences of length
approximately 22. The following table, taken from Stormo and
Hartzell (1989), shows just positions 3-9 (out of the 22 sequence positions)
in 23 bonafide CRP binding sites.
TTGTGGC
TTTTGAT
AAGTGTC
ATTTGCA
CTGTGAG
ATGCAAA
GTGTTAA
ATTTGAA
TTGTGAT
ATTTATT
ACGTGAT
ATGTGAG
TTGTGAG
CTGTAAC
CTGTGAA
TTGTGAC
GCCTGAC
TTGTGAT
TTGTGAT
GTGTGAA
CTGTGAC
ATGAGAC
TTGTGAG
Notice that in the second column T predominates and in the third
column G predominates, for example.
]
motifs can be summarised in a sequence profile = weight array
[slide: weight array for example from above:
A 0.35 0.043 0 0.043 0.13 0.83 0.26
C 0.17 0.087 0.043 0.043 0 0.043 0.3
G 0.13 0 0.78 0 0.83 0.043 0.17
T 0.35 0.87 0.17 0.91 0.043 0.087 0.26
]
Simplifying assumptions:
- consider only motifs with same length
- do not allow gaps
- consider DNA sequences
--
How to quantify the "amount of information" in a motif
- could use information content [def: ...]
- better: relative entropy of set of sequences
(allows to take into account background distribution)
assumption: each position is independent
Given:
- background distribution on nucleotide frequencies
Q(N) = prob of N, N \in {A,C,G,T}
Q(N) is position independent
- profile matrix = weight array
P_j(N) = prob of N at pos j, N \in {A,C,G,T}
Note: P_j(N) is a probability distribution over bases
Relative entropy of P_j w.r.t. Q is defined as
D(P_j||Q) := sum_{N in {A,C,G,T}} P_j(N) log_2 (P_j(N)/Q(N)).
Relative entropy of motif given by profile matrix [P_j(N)]_{N,j}, length L:
D(P||Q) := sum_{j=1}^{L} D(P_j||Q)
For the example above, the positional relative entropy values, measured
in base 2, are
0.12 1.3 1.1 1.5 1.2 1.1 0.027
total relative entropy = 6.347 (bits)
Note:
Relative entropy is a measure for the difference between two distributions;
it is also called "Kulback Leibler distance"
(but: it is not a proper distance metric, since it doesn't satisfy the
triangle inequality)
Note:
- D(P||Q) >= 0
- P = Q => D(P||Q) = 0
- D(P||Q) <= 2*L (when does this happen?)
--
The Relative Entropy Site Selection Problem:
Given:
- k sequences s1, s2, ..., sk
- a target length L for the sites
- background distribution Q on nucleotide frequencies
Goal: Find a set T of sites (contiguous subsequences) of equal length L,
one per sequence, such that the associated motif
has maximum relative entropy
Unfortunately, this problem is NP-complete [1].
(Several variants of this problem have been used to formalise
and solve motif discovery problems.)
--
A Greedy Algorithm for solving the Relative Entropy Site Selection Problem
key idea:
- construct profiles by iteratively selecting length L subsequence from
each given sequence, starting from single subsequence,
greedily adding one subsequence at a time.
Input:
- sequences s1, ..., sk
- L: length of motif
- Q: background distribution
- d: number of motifs (profiles) to keep at each step
Output:
- d motifs
Algorithm [Hertz and Stormo, 1999]:
1. Create a singleton set (i.e., only one member) for each possible length L
substring of each of the k input sequences.
Repeat
2. For each set S retained so far, add each possible length L
substring from an input sequence s_i not represented in S
3. For each new set, compute the profile and relative entropy with respect Q
each new set.
4. Select and retain the d sets with highest entropy
Until each set covers all sequences
Note: this algorithm cannot be expected to find optimal solutions,
but was shown to perform well on some practical problem instances.
--
Gibbs Sampling Approach
[Lawrence et al., 1993]
basic idea:
- to start with complete set of k substrings (candidate sites),
in each step, remove one sequence at random, then add a new one at random
with probability proportional to its score
Input:
- sequences s1, ..., sk
- L: length of motif
- Q: background distribution
- stopping criterion (e.g., max number of iterations or max CPU time allowed)
1. Randomly create set T := {t_1, ..., t_k} by
choosing t_i from s_i uniformly at random
Repeat
2. Choose i uniformly at random from 1,...,k and remove t_i from T
3. For every j in {1, ..., |s_i|-L+1} do:
4. t_ij := length L substring of s_i starting in position j
5. compute D_j := rel entropy of T \union {t_ij} w.r.t. Q
6. Randomly choose t from the t_ij such that every t_ij is
selected with probability proportional to D_j
7. T := T \union {t}
Until stopping criterion met
Note: Like the greedy algorithm, this algorithm is not guaranteed to
find optimal solutions, but was shown to perform well on some
practical problem instances.
General Note:
Gibbs sampling is a very simple SLS algorithm.
Since any stochastic local search algorithm can be applied to
solving this problem, it should not be hard to come up with
an SLS algorithm that performs better.
---
State-of-the-Art approach: Random Projections Method by Buhler & Tompa (2001):
slightly different problem formulation:
Planted (L,d)-Motif Problem: given k DNA sequences of length n,
each of which contains a variant M' of a fixed, unknown DNA sequence M (motif)
of length L as a subsequence, where M' differs from M in exactly d positions
(d point mutations).
Instances of this problem are known to be very challenging,
e.g., (18,6) for k=20, n=600
Can be solved using previously mentioned methods.
basic idea behind PROJECTIONS algorithm:
- use random projections of input sequences' L-mers,
i.e., choose M of the L positions at random and
use these as a hash function h(x)
-> planted motif will collect in one bucket for suitably chosen M
- extract planted motif from buckets containing "many" elements
using expectation-maximisation (or other refinement technique)
(For details, wait for project report of Alam et al. and/or
read Buhler and Tompa, 2001.)
PROJECTION has been shown to work well for artificial and real
motif discovery problems.
---
Resources:
Finding instances of unknown sites, Lecture 10, CS527, Winter
2000, CS&E, U. Washington. See
http://www.cs.washington.edu/education/courses/527/00wi/
G. Z. Hertz and G. D. Stormo. Identifying DNA and protein
patterns with statistically significant alignments of multiple
sequences. Bioinformatics, 15(7/8):563-577, July/August 1999. See
http://ural.wustl.edu/publications.html
J. Buhler and M. Tompa. Finding Motifs Using Random Projections.
In RECOMB'01, pages 69--76. ACM-, 2001. Proc.RECOMB'01, Montreal.
http://citeseer.nj.nec.com/buhler01finding.html
Further Reading:
C. E. Lawrence and A. A. Reilly. An expectation maximization (EM)
algorithm for the identification and characterization of common sites
in unaligned biopolymer sequences. Proteins: structure, function and
genetics, 7:41-51, 1990.
G. D. Stormo and G. W. Hartzell III. Identifying protein-binding
sites from unaligned DNA fragments. Proc. Nat. Acad. Science, U.S.A.,
86:1183-1187, 1989.
T. L. Bailey and C. Elkan. Unsupervised learning of multiple motifs
in biopolymers using expectation maximization. Machine Learning,
21(1-2):51-80, 1995.
http://meme.sdsc.edu/meme/website/papers.html
T. Akutsu. Hardness results on gapless local multiple sequence
alignment, By Technical report 98-MPS-42-2, Information Processing
Society of Japan, 1998.
L. P. Lim and C. B. Burge. A computational analysis of sequence
features involved in recognition of short introns, PNAS,
98(20):11193-11198, September 25, 2001.
See www.pnas.org.
---