CPSC 536A - Bioinformatics: Project Ideas

CPSC 536A: Project Ideas


General Remarks:


1. Testing for absence of secondary structure in combinatorial sets of DNA strands. Background. ----------- A single-stranded DNA molecule tends to fold on itself so that complementary base pairs bond, thereby lowering the so-called ``free energy'' of the molecule. Think of the secondary structure of a DNA strand 5'-w1 w2 w3 ... wn-3' as a set of pairs (i,j), 1 <= i < j <= n. Simple energy models associate a free energy with a secondary structure of a molecule. Roughly, adjacent complementary base pairs in the structure tend to lower the free energy of the structure, thereby stabilizing it, whereas loops formed from unpaired bases tend to raise the free energy. Dynamic programming algorithms are known that can predict the minmimum-energy pseudoknot-free secondary structure of a DNA strand. (A structure is pseudo-knot free if no two base pairs (i,j) and (i',j') are such that i < j < i' < j'.) Motivation. ----------- In DNA computing applications, large combinatorial sets of DNA strands are constructed from short strands called DNA words. For example, from 2n strands s1, s1', s2, s2', sn, sn', the following set T can be constructed that represents all binary strings of length n: T = { w1w2...wn | wi is either si or si' }. Many heuristic methods have been proposed and used to design the short DNA words (i.e. s1, s1', ... in the above example). These methods rely on "local" properties of the words, such as high Hamming distance between pairs of distinct words, for example, but the details are not important here. A central design goal is that the long strands in T are structure-free, that is, that the minimum free energy secondary structure is the empty set. Project. -------- The purpose of this project is to understand the complexity of the following question: given s1, s1', sn, sn', is it the case that all 2^n strands in T are structure-free? Specifically, is there a way to extend the dynamic programming structure prediction algorithms to obtain a polynomial time algorithm for the above question? References. ----------- See notes from ``Algorithms in Molecular Biology'' (Karp and Ruzzo, U. Washington, 1998) for RNA secondary structure prediction. (http://www.cs.washington.edu/education/courses/590bi/98w/) M. Zuker and P. Stielger, ``Optimal computer folding of large RNA sequences usin g thermodynamics and auxiliary information,'' Nucleic Acids Res 9,(1981), pages 133-148. M. Zuker, ``Algorithms, thermodynamics, and databases for RNA secondary structur e.'' 6 September 2000. http://www.ibc.wustl.edu/\~zuker/rna/. See http://www.cs.ubc.ca/~condon/536/wordsurvey.ps for a survey of DNA word design.
2. Combinatorial Design of Universal DNA Tag Systems. Background. ----------- Short DNA tags are used to label molecules in a chemical library or to anchor DNA molecules to DNA arrays. A combinatorial approach to design of DNA tags was proposed by Ben-Dor et al. (RECOMB, 2000), based on the {\em melting temperature} of strand duplexes. However, the measure of melting temperature used in the paper is less than ideal; more accurate measures, based on nearest neighbour calculations, are preferred. Project. -------- Can the methods of Ben-Dor et al. for DNA tag design be extended to work for the nearest neighbour measure of melting temperature? If not, how well can heuristic methods do in this case? This project can combine theoretical and experimental algorithm design and analysis work. Reference. ---------- A. Ben-Dor, R. Karp, B. Schwikowski, and Z. Yakhini, ``Universal DNA tag systems: a combinatorial design scheme,'' Proc. RECOMB 2000, ACM, pages 65-75. (See http://www.cs.ubc.ca/~condon/536/ben-doretal00.pdf)
3. A resource bounded theory of DNA self-assembly. Background. ----------- Molecular self-assembly naturally gives rise to numerous complex forms. Self-assembly of DNA molecules with programmable interactions can be used to construct structures at a nano-metre scale. Rothemund and Winfree propose a formal model for studying self-assembled objects from the point of view of computational complexity, and study the complexity of self-assembly of a square in this model. Project. -------- What can you say about the complexity of self-assembly of other ``interesting'' structures, using the model of Rothemund and Winfree? What are natural candidate structures to study? Reference. ---------- The Program-Size Complexity of Self-Assembled Squares. P.W.K. Rothemund and E. Winfree, Proc. Thirty-Second Annual ACM Symposium on Theory of Computing, May, 2000. (See http://waggle.gg.caltech.edu/~winfree/old\_html/DNAresearch.html for this paper and others by Winfree.)
4. Phylogenetic classification of organisms based on sequence tags. Background. ----------- Phylogenetic classification based on biomolecular sequence data (DNA, RNA, or protein) is widely used for determining evolutionary relationships between organisms or their components. Recently, phylogenetic classification techniques are being considered in the context of studying microbial communities that potentially contain huge numbers of yet unknown organisms. On this scale, only small fragments (tags) of the sequences used for phylogenetic tree reconstruction (typically ribosomal RNA) can be extracted and analysed. Project. -------- Given a (potentially partial) phylogenetic tree, develop strategies and algorithms for placing sequences in that tree based on knowledge of a short subsequence (tag) only. This project is potentially of immediate and significant practical relevance in the context of a collaboration we have with Bill Mohn and Michael Murphy from the UBC Dept of Microbiology and Immunology. References. ----------- T.Brock, M.Madigan, J.Martinko, J.Parker: Biology of Microorganisms. Prentice Hall, Seventh Edition, 1994. (Sections 18.4-18.5) Durbin, Eddy, Krogh, Mitchison: Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press, 1998. (Chapters 7 and 8) W. Ludwig, K.-H. Schleifer: Phylogeny of Becteria beyond the 16S rRNA Standard. ASM News, Volume 65, Number 11, 1999.
5. RNA Secondary Structure Prediction. Background. ----------- Many classes of RNA molecules, such as tRNA or ribosomal RNA, fold into particular structures (by bonding between bases within the same strand) which determine their function. Algorithms exist for predicting RNA secondary structures, however most of them are restricted in the type of structure they can predict. In particular, almost all existing RNA structure prediction programmes cannot deal with pseudoknots or multiple, interacting strands. Project. -------- Develop and test a (heuristic) algorithm that can predict RNA secondary structures with pseudoknots. Since the problem is combinatorially hard, we expect that stochastic local seach techniques might be particularly effective for solving it algorithmically. References. ----------- See notes from ``Algorithms in Molecular Biology'' (Karp and Ruzzo, U. Washington, 1998) for RNA secondary structure prediction. (http://www.cs.washington.edu/education/courses/590bi/98w/) M. Zuker and P. Stielger, ``Optimal computer folding of large RNA sequences usin g thermodynamics and auxiliary information,'' Nucleic Acids Res 9,(1981), pages 133-148. M. Zuker, ``Algorithms, thermodynamics, and databases for RNA secondary structur e.'' 6 September 2000. http://www.ibc.wustl.edu/\~zuker/rna/. M. Zuker: RNA Folding by Energy Minimization. http://bioinfo.math.rpi.edu/~zukerm/Bio-5495/RNAfold-html/node3.html I. Hofacker: Vienna RNA Package. http://www.tbi.univie.ac.at/~ivo/RNA/
6. Multiple digestion restriction site mapping. Background. ----------- A fairly old but still relevant method for mapping genomes is to do multiple digests with different restriction enzymes (single enzymes and combinations) and inferring the position of the restriction sites from the fragments obtained from these reactions. This is a hard combinatorial problem, and various approaches have been tried for it. Project. -------- Implement a few known techniques for solving this problem, try to improve them with modern heuristic techniques, and do a comparative analysis of their performance on real genomic data. References. ----------- Andrew Kwon: Restriction Enzyme Mapping. CPSC 502 Project Report, University of BC, 2000. (Contains lots of further references.) Goldstein, Waterman: Mapping DNA by stochastic relaxation. Advances in Applied Mathematics, Vol. 8, pp.194-207, 1997. Grigorjev, Mironov: Mapping DNA by stochastic relaxation: a new approach to fragment sizes. Computer Applications in Biosciences, Vol.6, pp.107-111, 1990. R.M.Karp: Mapping the Genome: Some Combinatorial Problems Arising in Molecular Biology. 25th ACM STOC, pp.278-285, 1993.