Finding Instances of Unknown Sites ---------------------------------- The problem considered in this lecture is that of "identifying and characterizing shared motifs in a set of unaligned sequences" [2]. Loosely speaking, a motif is a pattern common to a set of short sequences, or sites, where the sites share some signal associated with their common biological property. Examples of such sites include the binding site for a regulatory protein, transcription start and stop sites, ribosome binding sites in prokaryotes, or intron splice sites. For example, the cyclic AMP receptor protein (CRP) is a transcription factor in E. Coli. Its binding sites are DNA sequences of length approximately 22. The following table, taken from Stormo and Hartzell [5], shows just positions 3-9 (out of the 22 sequence positions) in 23 bonafide CRP binding sites. TTGTGGC TTTTGAT AAGTGTC ATTTGCA CTGTGAG ATGCAAA GTGTTAA ATTTGAA TTGTGAT ATTTATT ACGTGAT ATGTGAG TTGTGAG CTGTAAC CTGTGAA TTGTGAC GCCTGAC TTGTGAT TTGTGAT GTGTGAA CTGTGAC ATGAGAC TTGTGAG Notice that in the second column T predominates and in the third column G predominates, for example. Recall that the content of these sites can be summarized in a profile, or weight, matrix: A 0.35 0.043 0 0.043 0.13 0.83 0.26 C 0.17 0.087 0.043 0.043 0 0.043 0.3 G 0.13 0 0.78 0 0.83 0.043 0.17 T 0.35 0.87 0.17 0.91 0.043 0.087 0.26 Since our goal will be to find unknown motifs in unaligned sequences, we need a measure of the "amount of signal" in a motif. We will only consider here motifs that all have the same length and have no gaps. One good option would be the information content of the set of sites. We'll use a closely related measure, that allows us to take the background distribution of the sequences into account, namely the relative entropy of the set of sequences. We can express the relative entropy of the set of motifs as the sum of the positional relative entropies, assuming that each position is independent. The positional entropy is defined as follows. Suppose we have a background distribution on the frequency of each nucleotide Q(N), N in {A,C,G,T}. Let P be the profile matrix for our motif set. Then each column P_j of P is also a probability distribution over {A,C,G,T}. The relative entropy for position j (relative to Q) is: D_b(P_j||Q) (def) = sum_{N in {A,C,G,T}} P_j(N) log_b (P_j(N)/Q(N)). For the example above, the positional relative entropy, measured in base 2, is 0.12 1.3 1.1 1.5 1.2 1.1 0.027 We can now make the problem we are addressing precise: Given k sequences s1, s2, ..., sk, a target length n for the sites, and a background distribution on the nucleotide frequencies, find a set T of sites (contiguous subsequences) of equal length l, one per sequence, that has maximum relative entropy. Call this the relative entropy site selection problem. Unfortunately, this problem is NP-complete [1]. We now follow exactly the lecture notes [6]. We note that another technique that has been used to solve a site selection problem that is formulated in a similar manner as above is expectation maximization. See the work of Lawrence and Reilly [4] and Bailey and Elken [2], who developed the MEME program. Application to analysis of sequence features in short introns ------------------------------------------------------------- Lim and Burge [7] applied some of the methods described here to develop tools for recognition of short introns. Introns are characterized by their 5' and 3' splice sites, as well as by branch site involved in the splicing process. For several organisms, using known intron sequences, they used a simple weight matrix model of the 5' and 3' splice sites. It was not possible to determine the brand site from the aligned sequences, so they used the Gibbs sampling method to determine the most likely branch site for each organism. (See slide for pictograms of these three sites.) They calculated the relative entropy of each of these sites. Their PAIRSCAN program attempts to recognize introns using the 5' and 3' splice sites alone. TRIPLESCAN also uses the branch site. INTRONSCAN further uses pentamer composition and length. In other experiments, they generated randomized artificial introns with splice sites and branch site of varying relative entropy, and plotted the accuracy of their programs on the artificial data. They found that approx. 30 bits of entropy are needed to achieve 98% accuracy. They conclude that for some organisms, the three motif sites examined do not provide enough information for high accuracy of detection. References ---------- 1. T. Akutsu. Hardness results on gapless local multiple sequence alignment, By Technical report 98-MPS-42-2, Information Processing Society of Japan, 1998. 2. T. L. Bailey and C. Elkan. Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning, 21(1-2):51-80, 1995. See http://meme.sdsc.edu/meme/website/papers.html 3. G. Z. Hertz and G. D. Stormo. Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics, 15(7/8):563-577, July/August 1999. See http://ural.wustl.edu/publications.html 4. C. E. Lawrence and A. A. Reilly. An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins: structure, function and genetics, 7:41-51, 1990. 5. G. D. Stormo and G. W. Hartzell III. Identifying protein-binding sites from unaligned DNA fragments. Proc. Nat. Acad. Science, U.S.A., 86:1183-1187, 1989. 6. Finding instances of unknown sites, Lecture 10, CS527, Winter 2000, CS&E, U. Washington. See http://www.cs.washington.edu/education/courses/527/00wi/ 7. L. P. Lim and C. B. Burge. A computational analysis of sequence features involved in recognition of short introns, PNAS, 98(20):11193-11198, September 25, 2001. See www.pnas.org.