CPSC 445/545 (Autumn 2004) Module 4, Part 1 - Motif Discovery [partly based on Anne Condon's lecture notes from CPSC 536A, 2001/2002 and on lecture notes from UW, CS527, 2000] --- 4.1 The Motif Discovery Problem goal: identify and characterise unknown, shared motifs (short, imperfectly conserved subsequences) in a set of unaligned sequences. motif: pattern common to a set of short sequences, or sites, where the sites share some signal associated with their common biological property. examples: binding site for a regulatory protein, transcription start and stop sites, ribosome binding sites in prokaryotes, or intron splice sites [slide: For example, the cyclic AMP receptor protein (CRP) is a transcription factor in E. Coli. Its binding sites are DNA sequences of length approximately 22. The following table, taken from Stormo and Hartzell (1989), shows just positions 3-9 (out of the 22 sequence positions) in 23 bonafide CRP binding sites. TTGTGGC TTTTGAT AAGTGTC ATTTGCA CTGTGAG ATGCAAA GTGTTAA ATTTGAA TTGTGAT ATTTATT ACGTGAT ATGTGAG TTGTGAG CTGTAAC CTGTGAA TTGTGAC GCCTGAC TTGTGAT TTGTGAT GTGTGAA CTGTGAC ATGAGAC TTGTGAG Notice that in the second column T predominates and in the third column G predominates, for example. ] motifs can be summarised in a sequence profile = weight array [slide: weight array for example from above: A 0.35 0.043 0 0.043 0.13 0.83 0.26 C 0.17 0.087 0.043 0.043 0 0.043 0.3 G 0.13 0 0.78 0 0.83 0.043 0.17 T 0.35 0.87 0.17 0.91 0.043 0.087 0.26 ] Simplifying assumptions: - consider only motifs with same length - do not allow gaps - consider DNA sequences -- How to quantify the "amount of information" in a motif - could use information content [def: ...] - better: relative entropy of set of sequences (allows to take into account background distribution) assumption: each position is independent Given: - background distribution on nucleotide frequencies Q(N) = prob of N, N \in {A,C,G,T} Q(N) is position independent - profile matrix = weight array P_j(N) = prob of N at pos j, N \in {A,C,G,T} Note: P_j(N) is a probability distribution over bases Relative entropy of P_j w.r.t. Q is defined as D(P_j||Q) := sum_{N in {A,C,G,T}} P_j(N) log_2 (P_j(N)/Q(N)). Relative entropy of motif given by profile matrix [P_j(N)]_{N,j}, length L: D(P||Q) := sum_{j=1}^{L} D(P_j||Q) For the example above, the positional relative entropy values, measured in base 2, are 0.12 1.3 1.1 1.5 1.2 1.1 0.027 total relative entropy = 6.347 (bits) Note: Relative entropy is a measure for the difference between two distributions; it is also called "Kulback Leibler distance" (but: it is not a proper distance metric, since it doesn't satisfy the triangle inequality) Note: - D(P||Q) >= 0 - P = Q => D(P||Q) = 0 - D(P||Q) <= 2*L (when does this happen?) -- The Relative Entropy Site Selection Problem: Given: - k sequences s1, s2, ..., sk - a target length L for the sites - background distribution Q on nucleotide frequencies Goal: Find a set T of sites (contiguous subsequences) of equal length L, one per sequence, such that the associated motif has maximum relative entropy Unfortunately, this problem is NP-complete [1]. (Several variants of this problem have been used to formalise and solve motif discovery problems.) -- A Greedy Algorithm for solving the Relative Entropy Site Selection Problem key idea: - construct profiles by iteratively selecting length L subsequence from each given sequence, starting from single subsequence, greedily adding one subsequence at a time. Input: - sequences s1, ..., sk - L: length of motif - Q: background distribution - d: number of motifs (profiles) to keep at each step Output: - d motifs Algorithm [Hertz and Stormo, 1999]: 1. Create a singleton set (i.e., only one member) for each possible length L substring of each of the k input sequences. Repeat 2. For each set S retained so far, add each possible length L substring from an input sequence s_i not represented in S 3. For each new set, compute the profile and relative entropy with respect Q each new set. 4. Select and retain the d sets with highest entropy Until each set covers all sequences Note: this algorithm cannot be expected to find optimal solutions, but was shown to perform well on some practical problem instances. -- Gibbs Sampling Approach [Lawrence et al., 1993] basic idea: - to start with complete set of k substrings (candidate sites), in each step, remove one sequence at random, then add a new one at random with probability proportional to its score Input: - sequences s1, ..., sk - L: length of motif - Q: background distribution - stopping criterion (e.g., max number of iterations or max CPU time allowed) 1. Randomly create set T := {t_1, ..., t_k} by choosing t_i from s_i uniformly at random Repeat 2. Choose i uniformly at random from 1,...,k and remove t_i from T 3. For every j in {1, ..., |s_i|-L+1} do: 4. t_ij := length L substring of s_i starting in position j 5. compute D_j := rel entropy of T \union {t_ij} w.r.t. Q 6. Randomly choose t from the t_ij such that every t_ij is selected with probability proportional to D_j 7. T := T \union {t} Until stopping criterion met Note: Like the greedy algorithm, this algorithm is not guaranteed to find optimal solutions, but was shown to perform well on some practical problem instances. General Note: Gibbs sampling is a very simple SLS algorithm. Since any stochastic local search algorithm can be applied to solving this problem, it should not be hard to come up with an SLS algorithm that performs better. --- State-of-the-Art approach: Random Projections Method by Buhler & Tompa (2001): slightly different problem formulation: Planted (L,d)-Motif Problem: given k DNA sequences of length n, each of which contains a variant M' of a fixed, unknown DNA sequence M (motif) of length L as a subsequence, where M' differs from M in exactly d positions (d point mutations). Instances of this problem are known to be very challenging, e.g., (18,6) for k=20, n=600 Can be solved using previously mentioned methods. basic idea behind PROJECTIONS algorithm: - use random projections of input sequences' L-mers, i.e., choose M of the L positions at random and use these as a hash function h(x) -> planted motif will collect in one bucket for suitably chosen M - extract planted motif from buckets containing "many" elements using expectation-maximisation (or other refinement technique) (For details, wait for project report of Alam et al. and/or read Buhler and Tompa, 2001.) PROJECTION has been shown to work well for artificial and real motif discovery problems. --- Resources: Finding instances of unknown sites, Lecture 10, CS527, Winter 2000, CS&E, U. Washington. See http://www.cs.washington.edu/education/courses/527/00wi/ G. Z. Hertz and G. D. Stormo. Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics, 15(7/8):563-577, July/August 1999. See http://ural.wustl.edu/publications.html J. Buhler and M. Tompa. Finding Motifs Using Random Projections. In RECOMB'01, pages 69--76. ACM-, 2001. Proc.RECOMB'01, Montreal. http://citeseer.nj.nec.com/buhler01finding.html Further Reading: C. E. Lawrence and A. A. Reilly. An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins: structure, function and genetics, 7:41-51, 1990. G. D. Stormo and G. W. Hartzell III. Identifying protein-binding sites from unaligned DNA fragments. Proc. Nat. Acad. Science, U.S.A., 86:1183-1187, 1989. T. L. Bailey and C. Elkan. Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning, 21(1-2):51-80, 1995. http://meme.sdsc.edu/meme/website/papers.html T. Akutsu. Hardness results on gapless local multiple sequence alignment, By Technical report 98-MPS-42-2, Information Processing Society of Japan, 1998. L. P. Lim and C. B. Burge. A computational analysis of sequence features involved in recognition of short introns, PNAS, 98(20):11193-11198, September 25, 2001. See www.pnas.org. ---