Finding Instances of Unknown Sites
----------------------------------

The problem considered in this lecture is that of "identifying and
characterizing shared motifs in a set of unaligned sequences" [2].
Loosely speaking, a motif is a pattern common to a set of short
sequences, or sites, where the sites share some signal associated with
their common biological property. Examples of such sites include the
binding site for a regulatory protein, transcription start and stop
sites, ribosome binding sites in prokaryotes, or intron splice sites.

For example, the cyclic AMP receptor protein (CRP) is a transcription
factor in E. Coli. Its binding sites are DNA sequences of length
approximately 22. The following table, taken from Stormo and
Hartzell [5], shows just positions 3-9 (out of the 22 sequence positions)
in 23 bonafide CRP binding sites.

TTGTGGC
TTTTGAT
AAGTGTC
ATTTGCA
CTGTGAG
ATGCAAA
GTGTTAA
ATTTGAA
TTGTGAT
ATTTATT
ACGTGAT
ATGTGAG
TTGTGAG
CTGTAAC
CTGTGAA
TTGTGAC
GCCTGAC
TTGTGAT
TTGTGAT
GTGTGAA
CTGTGAC
ATGAGAC
TTGTGAG

Notice that in the second column T predominates and in the third
column G predominates, for example.  Recall that the content of these
sites can be summarized in a profile, or weight, matrix:

A 0.35 0.043 0     0.043 0.13  0.83  0.26
C 0.17 0.087 0.043 0.043 0     0.043 0.3
G 0.13 0     0.78  0     0.83  0.043 0.17
T 0.35 0.87  0.17  0.91  0.043 0.087 0.26

Since our goal will be to find unknown motifs in unaligned sequences,
we need a measure of the "amount of signal" in a motif.  We will
only consider here motifs that all have the same length and have no
gaps. One good option would be the information content of the set of 
sites. We'll use a closely related measure, that allows us to take 
the background distribution of the sequences into account, namely the 
relative entropy of the set of sequences.

We can express the relative entropy of the set of motifs as the sum of
the positional relative entropies, assuming that each position is
independent. The positional entropy is defined as follows.  Suppose we
have a background distribution on the frequency of each nucleotide
Q(N), N in {A,C,G,T}.  Let P be the profile matrix for our motif
set. Then each column P_j of P is also a probability distribution over
{A,C,G,T}. The relative entropy for position j (relative to Q) is:

D_b(P_j||Q) (def) = sum_{N in {A,C,G,T}} P_j(N) log_b (P_j(N)/Q(N)).

For the example above, the positional relative entropy, measured
in base 2, is

0.12 1.3 1.1 1.5 1.2 1.1 0.027

We can now make the problem we are addressing precise: Given k
sequences s1, s2, ..., sk, a target length n for the sites, and
a background distribution on the
nucleotide frequencies, find a set T of sites (contiguous
subsequences) of equal length l, one per sequence, that has maximum
relative entropy.  Call this the relative entropy site selection
problem.

Unfortunately, this problem is NP-complete [1].

We now follow exactly the lecture notes [6].

We note that another technique that has been used to solve a site
selection problem that is formulated in a similar manner as above is
expectation maximization. See the work of Lawrence and Reilly [4] and
Bailey and Elken [2], who developed the MEME program.

Application to analysis of sequence features in short introns
-------------------------------------------------------------

Lim and Burge [7] applied some of the methods described here to
develop tools for recognition of short introns. Introns are characterized
by their 5' and 3' splice sites, as well as by branch site involved
in the splicing process. For several organisms, using known intron sequences,
they used a simple weight matrix model of the 5' and 3' splice sites.
It was not possible to determine the brand site from the aligned
sequences, so they used the Gibbs sampling method to determine the
most likely branch site for each organism. (See slide for pictograms
of these three sites.) They calculated the relative entropy of each
of these sites.

Their PAIRSCAN program attempts to recognize introns using the 5' and 3'
splice sites alone. TRIPLESCAN also uses the branch site. INTRONSCAN
further uses pentamer composition and length. 

In other experiments, they generated randomized artificial
introns with splice sites and branch site of varying relative entropy,
and plotted the accuracy of their programs on the artificial data.
They found that approx. 30 bits of entropy are needed to achieve 98%
accuracy. They conclude that for some organisms, the three motif sites
examined do not provide enough information for high accuracy of detection.


References
----------

1.  T. Akutsu.  Hardness results on gapless local multiple sequence
alignment, By Technical report 98-MPS-42-2, Information Processing
Society of Japan, 1998.

2. T. L. Bailey and C. Elkan. Unsupervised learning of multiple motifs
in biopolymers using expectation maximization.  Machine Learning,
21(1-2):51-80, 1995.  See
http://meme.sdsc.edu/meme/website/papers.html

3.  G. Z. Hertz and G. D. Stormo.  Identifying DNA and protein
patterns with statistically significant alignments of multiple
sequences.  Bioinformatics, 15(7/8):563-577, July/August 1999.  See
http://ural.wustl.edu/publications.html

4. C. E. Lawrence and A. A. Reilly.  An expectation maximization (EM)
algorithm for the identification and characterization of common sites
in unaligned biopolymer sequences.  Proteins: structure, function and
genetics, 7:41-51, 1990.
 
5. G. D. Stormo and G. W. Hartzell III.  Identifying protein-binding
sites from unaligned DNA fragments.  Proc. Nat. Acad. Science, U.S.A.,
86:1183-1187, 1989.

6.  Finding instances of unknown sites, Lecture 10, CS527, Winter
2000, CS&E, U. Washington.  See
http://www.cs.washington.edu/education/courses/527/00wi/

7. L. P. Lim and C. B. Burge. A computational analysis of sequence
features involved in recognition of short introns, PNAS,
98(20):11193-11198, September 25, 2001. See www.pnas.org.