CPSC 536A - Bioinformatics: Project Ideas

# CPSC 536A: Project Ideas

## General Remarks:

• We expect that most projects will be worked on by groups of three students. Ideally, these three students should cover a wide range of backgrounds, in particular, the few non-CS students we have should be well distributed between different groups.
• The following are just project ideas. If you have other ideas for potential projects, come talk to Anne and/or Holger - we look forward to hearing about your ideas!
• Start discussing project ideas you are interested in with your fellow students and with us (Anne and Holger) as soon as possible - you need to have selected a project by January 16th and a proposal has to be submitted by January 25th.

1. Testing for absence of secondary structure in combinatorial sets of DNA strands.

Background.
-----------

A single-stranded DNA molecule tends to fold on itself so that
complementary base pairs bond, thereby lowering the so-called free
energy'' of the molecule.  Think of the secondary structure of a DNA
strand 5'-w1 w2 w3 ... wn-3' as a set of pairs (i,j), 1
<= i < j <= n.  Simple energy models associate a free energy with a
secondary structure of a molecule. Roughly, adjacent complementary
base pairs in the structure tend to lower the free energy of the
structure, thereby stabilizing it, whereas loops formed from unpaired
bases tend to raise the free energy.
Dynamic programming algorithms are known that can predict the
minmimum-energy pseudoknot-free secondary structure of a DNA
strand. (A structure is pseudo-knot free if no two base pairs (i,j)
and (i',j') are such that i < j < i' < j'.)

Motivation.
-----------

In DNA computing applications, large combinatorial sets of DNA strands
are constructed from short strands called DNA words. For
example, from 2n strands s1,
s1', s2, s2', sn, sn', the following
set T can be constructed that represents all binary strings
of length n:

T = { w1w2...wn |  wi is either si or si' }.

Many heuristic methods have been proposed and used to design the short
DNA words (i.e. s1, s1', ... in the above example).  These
methods rely on "local" properties of the words, such as high
Hamming distance between pairs of distinct words, for example, but the
details are not important here.  A central design goal is that the
long strands in T are structure-free, that is, that the minimum free
energy secondary structure is the empty set.

Project.
--------

The purpose of this project is to understand the complexity of the
following question: given s1, s1', sn, sn',
is it the case that all 2^n strands in T are structure-free?

Specifically, is there a way to extend the dynamic programming
structure prediction algorithms to obtain a polynomial time algorithm
for the above question?

References.
-----------

See notes from Algorithms in Molecular Biology'' (Karp and Ruzzo,
U. Washington, 1998) for RNA secondary structure
prediction.
(http://www.cs.washington.edu/education/courses/590bi/98w/)

M. Zuker and P. Stielger, Optimal computer folding of large RNA
sequences usin g thermodynamics and auxiliary information,'' Nucleic
Acids Res 9,(1981), pages 133-148.

M. Zuker, Algorithms, thermodynamics, and databases for RNA
secondary structur e.'' 6 September 2000.
http://www.ibc.wustl.edu/\~zuker/rna/.

See http://www.cs.ubc.ca/~condon/536/wordsurvey.ps for a survey of
DNA word design.

2. Combinatorial Design of Universal DNA Tag Systems.

Background.
-----------

Short DNA tags are used to label molecules in a
chemical library or to anchor DNA molecules to DNA arrays. A
combinatorial approach to design of DNA tags was proposed by Ben-Dor
et al. (RECOMB, 2000), based on the {\em melting temperature} of
strand duplexes. However, the measure of melting temperature used in
the paper is less than ideal; more accurate measures, based on nearest
neighbour calculations, are preferred.

Project.
--------

Can the methods of Ben-Dor et al. for DNA tag design be
extended to work for the nearest neighbour measure of melting
temperature? If not, how well can heuristic methods do in this case?

This project can combine theoretical and experimental algorithm design
and analysis work.

Reference.
----------

A. Ben-Dor, R. Karp, B. Schwikowski, and Z. Yakhini,
Universal DNA tag systems: a combinatorial design scheme,''
Proc. RECOMB 2000, ACM, pages 65-75.
(See http://www.cs.ubc.ca/~condon/536/ben-doretal00.pdf)

3. A resource bounded theory of DNA self-assembly.

Background.
-----------

Molecular self-assembly naturally gives rise to
numerous complex forms.  Self-assembly of DNA molecules with
programmable interactions can be used to construct structures at a
nano-metre scale.

Rothemund and Winfree propose a formal model for studying
self-assembled objects from the point of view of computational
complexity, and study the complexity of self-assembly of a
square in this model.

Project.
--------

What can you say about the complexity of self-assembly
of other interesting'' structures, using the model of Rothemund and
Winfree? What are natural candidate structures to study?

Reference.
----------

The Program-Size Complexity of Self-Assembled Squares.
P.W.K. Rothemund and E. Winfree, Proc. Thirty-Second Annual ACM
Symposium on Theory of Computing, May, 2000.
(See http://waggle.gg.caltech.edu/~winfree/old\_html/DNAresearch.html
for this paper and others by Winfree.)

4. Phylogenetic classification of organisms based on sequence tags.

Background.
-----------

Phylogenetic classification based on biomolecular sequence data
(DNA, RNA, or protein) is widely used for determining evolutionary
relationships between organisms or their components.

Recently, phylogenetic classification techniques are being considered
in the context of studying microbial communities that potentially
contain huge numbers of yet unknown organisms. On this scale,
only small fragments (tags) of the sequences used for phylogenetic tree
reconstruction (typically ribosomal RNA) can be extracted and analysed.

Project.
--------

Given a (potentially partial) phylogenetic tree, develop strategies
and algorithms for placing sequences in that tree based on knowledge
of a short subsequence (tag) only.

This project is potentially of immediate and significant practical relevance
in the context of a collaboration we have with Bill Mohn and Michael Murphy
from the UBC Dept of Microbiology and Immunology.

References.
-----------

T.Brock, M.Madigan, J.Martinko, J.Parker: Biology of Microorganisms.
Prentice Hall, Seventh Edition, 1994. (Sections 18.4-18.5)

Durbin, Eddy, Krogh, Mitchison: Biological sequence analysis:
Probabilistic models of proteins and nucleic acids. Cambridge University Press, 1998.
(Chapters 7 and 8)

W. Ludwig, K.-H. Schleifer: Phylogeny of Becteria beyond the 16S rRNA Standard.
ASM News, Volume 65, Number 11, 1999.

5. RNA Secondary Structure Prediction.

Background.
-----------

Many classes of RNA molecules, such as tRNA or ribosomal RNA,
fold into particular structures (by bonding between bases within the
same strand) which determine their function. Algorithms exist for
predicting RNA secondary structures, however most of them are restricted
in the type of structure they can predict.
In particular, almost all existing RNA structure prediction programmes
cannot deal with pseudoknots or multiple, interacting strands.

Project.
--------

Develop and test a (heuristic) algorithm that can predict RNA secondary structures
with pseudoknots. Since the problem is combinatorially hard, we expect that
stochastic local seach techniques might be particularly effective for solving
it algorithmically.

References.
-----------

See notes from Algorithms in Molecular Biology'' (Karp and Ruzzo,
U. Washington, 1998) for RNA secondary structure
prediction.
(http://www.cs.washington.edu/education/courses/590bi/98w/)

M. Zuker and P. Stielger, Optimal computer folding of large RNA
sequences usin g thermodynamics and auxiliary information,'' Nucleic
Acids Res 9,(1981), pages 133-148.

M. Zuker, Algorithms, thermodynamics, and databases for RNA
secondary structur e.'' 6 September 2000.
http://www.ibc.wustl.edu/\~zuker/rna/.

M. Zuker: RNA Folding by Energy Minimization.
http://bioinfo.math.rpi.edu/~zukerm/Bio-5495/RNAfold-html/node3.html

I. Hofacker: Vienna RNA Package. http://www.tbi.univie.ac.at/~ivo/RNA/

6. Multiple digestion restriction site mapping.

Background.
-----------

A fairly old but still relevant method for mapping genomes
is to do multiple digests with different restriction enzymes
(single enzymes and combinations) and inferring the position
of the restriction sites from the fragments obtained from these
reactions. This is a hard combinatorial problem, and various approaches
have been tried for it.

Project.
--------

Implement a few known techniques for solving this problem,
try to improve them with modern heuristic techniques, and
do a comparative analysis of their performance on real genomic data.

References.
-----------

Andrew Kwon: Restriction Enzyme Mapping. CPSC 502 Project Report,
University of BC, 2000. (Contains lots of further references.)

Goldstein, Waterman: Mapping DNA by stochastic relaxation.
Advances in Applied Mathematics, Vol. 8, pp.194-207, 1997.

Grigorjev, Mironov: Mapping DNA by stochastic relaxation: a new
approach to fragment sizes. Computer Applications in Biosciences,
Vol.6, pp.107-111, 1990.

R.M.Karp: Mapping the Genome: Some Combinatorial Problems Arising
in Molecular Biology. 25th ACM STOC, pp.278-285, 1993.