CPSC 445/545 (Autumn 2004) 
Module 4, Part 1 - Motif Discovery

[partly based on Anne Condon's lecture notes from CPSC 536A, 2001/2002
 and on lecture notes from UW, CS527, 2000]

---
4.1 The Motif Discovery Problem

goal: identify and characterise unknown, shared motifs (short, imperfectly
	conserved subsequences) in a set of unaligned sequences.

motif: pattern common to a set of short sequences, or sites, where the sites 
	share some signal associated with their common biological property.

examples: binding site for a regulatory protein, transcription start and stop
	sites, ribosome binding sites in prokaryotes, or intron splice sites

[slide: 
For example, the cyclic AMP receptor protein (CRP) is a transcription
factor in E. Coli. Its binding sites are DNA sequences of length
approximately 22. The following table, taken from Stormo and
Hartzell (1989), shows just positions 3-9 (out of the 22 sequence positions)
in 23 bonafide CRP binding sites.

TTGTGGC
TTTTGAT
AAGTGTC
ATTTGCA
CTGTGAG
ATGCAAA
GTGTTAA
ATTTGAA
TTGTGAT
ATTTATT
ACGTGAT
ATGTGAG
TTGTGAG
CTGTAAC
CTGTGAA
TTGTGAC
GCCTGAC
TTGTGAT
TTGTGAT
GTGTGAA
CTGTGAC
ATGAGAC
TTGTGAG

Notice that in the second column T predominates and in the third
column G predominates, for example.
]

motifs can be summarised in a sequence profile = weight array

[slide: weight array for example from above:

A 0.35 0.043 0     0.043 0.13  0.83  0.26
C 0.17 0.087 0.043 0.043 0     0.043 0.3
G 0.13 0     0.78  0     0.83  0.043 0.17
T 0.35 0.87  0.17  0.91  0.043 0.087 0.26
]

Simplifying assumptions:
- consider only motifs with same length
- do not allow gaps
- consider DNA sequences


--
How to quantify the "amount of information" in a motif

- could use information content [def: ...]
- better: relative entropy of set of sequences
	(allows to take into account background distribution)

assumption: each position is independent

Given: 
- background distribution on nucleotide frequencies 
	Q(N) = prob of N, N \in {A,C,G,T}
	Q(N) is position independent
- profile matrix = weight array
	P_j(N) = prob of N at pos j, N \in {A,C,G,T}

Note: P_j(N) is a probability distribution over bases

Relative entropy of P_j w.r.t. Q is defined as
	D(P_j||Q) := sum_{N in {A,C,G,T}} P_j(N) log_2 (P_j(N)/Q(N)).

Relative entropy of motif given by profile matrix [P_j(N)]_{N,j}, length L:
	D(P||Q) := sum_{j=1}^{L} D(P_j||Q)

For the example above, the positional relative entropy values, measured
in base 2, are

0.12 1.3 1.1 1.5 1.2 1.1 0.027

total relative entropy = 6.347 (bits)


Note: 
Relative entropy is a measure for the difference between two distributions;
it is also called "Kulback Leibler distance" 
(but: it is not a proper distance metric, since it doesn't satisfy the 
triangle inequality)

Note:
- D(P||Q) >= 0
- P = Q => D(P||Q) = 0
- D(P||Q) <= 2*L	(when does this happen?)


--
The Relative Entropy Site Selection Problem:

Given: 
- k sequences s1, s2, ..., sk
- a target length L for the sites
- background distribution Q on nucleotide frequencies

Goal: Find a set T of sites (contiguous subsequences) of equal length L, 
	one per sequence, such that the associated motif 
	has maximum relative entropy

Unfortunately, this problem is NP-complete [1].

(Several variants of this problem have been used to formalise
and solve motif discovery problems.)


--
A Greedy Algorithm for solving the Relative Entropy Site Selection Problem

key idea: 
- construct profiles by iteratively selecting length L subsequence from 
	each given sequence, starting from single subsequence,
	greedily adding one subsequence at a time.

Input: 
- sequences s1, ..., sk
- L: length of motif
- Q: background distribution
- d: number of motifs (profiles) to keep at each step

Output:
- d motifs


Algorithm [Hertz and Stormo, 1999]:

1. Create a singleton set (i.e., only one member) for each possible length L
	substring of each of the k input sequences.

Repeat 

2.  For each set S retained so far, add each possible length L
	substring from an input sequence s_i not represented in S

3.  For each new set, compute the profile and relative entropy with respect Q
	each new set. 

4.  Select and retain the d sets with highest entropy

Until each set covers all sequences


Note: this algorithm cannot be expected to find optimal solutions,
	but was shown to perform well on some practical problem instances.

--
Gibbs Sampling Approach
[Lawrence et al., 1993]

basic idea:
- to start with complete set of k substrings (candidate sites), 
  in each step, remove one sequence at random, then add a new one at random 
  with probability proportional to its score

Input: 
- sequences s1, ..., sk
- L: length of motif
- Q: background distribution
- stopping criterion (e.g., max number of iterations or max CPU time allowed)

1. Randomly create set T := {t_1, ..., t_k} by
	choosing t_i from s_i uniformly at random

Repeat

2.   Choose i uniformly at random from 1,...,k and remove t_i from T
3.   For every j in {1, ..., |s_i|-L+1} do:
4.	t_ij := length L substring of s_i starting in position j
5.	compute D_j := rel entropy of T \union {t_ij} w.r.t. Q
6.   Randomly choose t from the t_ij such that every t_ij is 
	selected with probability proportional to D_j
7.   T := T \union {t}

Until stopping criterion met


Note: Like the greedy algorithm, this algorithm is not guaranteed to 
	find optimal solutions, but was shown to perform well on some 
	practical problem instances.


General Note: 
Gibbs sampling is a very simple SLS algorithm.
Since any stochastic local search algorithm can be applied to 
solving this problem, it should not be hard to come up with 
an SLS algorithm that performs better.


---
State-of-the-Art approach: Random Projections Method by Buhler & Tompa (2001):

slightly different problem formulation:

Planted (L,d)-Motif Problem: given k DNA sequences of length n,
each of which contains a variant M' of a fixed, unknown DNA sequence M (motif)
of length L as a subsequence, where M' differs from M in exactly d positions
(d point mutations).

Instances of this problem are known to be very challenging, 
e.g., (18,6) for k=20, n=600

Can be solved using previously mentioned methods.


basic idea behind PROJECTIONS algorithm:
- use random projections of input sequences' L-mers, 
	i.e., choose M of the L positions at random and
	use these as a hash function h(x)
	-> planted motif will collect in one bucket for suitably chosen M
- extract planted motif from buckets containing "many" elements
	using expectation-maximisation (or other refinement technique)

(For details, wait for project report of Alam et al. and/or 
read Buhler and Tompa, 2001.)

PROJECTION has been shown to work well for artificial and real
motif discovery problems.


---
Resources:

Finding instances of unknown sites, Lecture 10, CS527, Winter
2000, CS&E, U. Washington.  See
http://www.cs.washington.edu/education/courses/527/00wi/

G. Z. Hertz and G. D. Stormo.  Identifying DNA and protein
patterns with statistically significant alignments of multiple
sequences.  Bioinformatics, 15(7/8):563-577, July/August 1999.  See
http://ural.wustl.edu/publications.html

J. Buhler and M. Tompa. Finding Motifs Using Random Projections. 
In RECOMB'01, pages 69--76. ACM-, 2001. Proc.RECOMB'01, Montreal. 
http://citeseer.nj.nec.com/buhler01finding.html


Further Reading:

C. E. Lawrence and A. A. Reilly.  An expectation maximization (EM)
algorithm for the identification and characterization of common sites
in unaligned biopolymer sequences.  Proteins: structure, function and
genetics, 7:41-51, 1990.
 
G. D. Stormo and G. W. Hartzell III.  Identifying protein-binding
sites from unaligned DNA fragments.  Proc. Nat. Acad. Science, U.S.A.,
86:1183-1187, 1989.

T. L. Bailey and C. Elkan. Unsupervised learning of multiple motifs
in biopolymers using expectation maximization.  Machine Learning,
21(1-2):51-80, 1995.  
http://meme.sdsc.edu/meme/website/papers.html

T. Akutsu.  Hardness results on gapless local multiple sequence
alignment, By Technical report 98-MPS-42-2, Information Processing
Society of Japan, 1998.

L. P. Lim and C. B. Burge. A computational analysis of sequence
features involved in recognition of short introns, PNAS,
98(20):11193-11198, September 25, 2001. 
See www.pnas.org.


---