CPSC 545 - Project Ideas


General Remarks:


1. Multiple Sequence Alignment

The problem of finding alignments of multiple sequences (DNA or protein) arises in many bioinformatics applications and often forms the basis for further computations, e.g., in phylogenetic tree construction. Unfortunately, determining optimal multiple sequence alignments is computationally hard (under certain, not realistic conditions, the problem has been shown to be NP-hard). A variety of heuristic techniques have been applied to this problem, including Genetic Algorithms and Simulated Annealing.

The goal of this project is to implement and empirically study the behaviour of iterative refinement methods for multiple sequence alignment using the SP (sum of pairs) score.

After familiarising yourself with the general approach and conducting a literature survey, you might want to implement two or three different methods (these could be all rather closely related) and study their performance on a set of homologous biological sequences. You might also consider additional tests on sets of sequences obtained from applying simulated evolution (according to one of the commonly used sequence evolution models, e.g., the Jukes-Cantor model).

References:


2. Protein Design in the 2D HP Model

De novo design of proteins is an approach for designing novel proteins with predetermined properties. It gives rise to the following protein design problem or inverse protein folding problem: Given a protein structure find an amino-acid sequence that folds into the specified structure as its native state. The two-dimensional Hydrophobic-Hydrophilic (2D HP) model is an extremely simplified model of protein structure that has been extensively studied by biochemists and physicists, and models certain properties of real proteins.

The project is to implement an algorithms for protein design in the 2D HP model (ideally, a new algorithm that has not been previously applied to the problem) and to carefully evaluate its performance against results known from the literature. Furthermore, given protein structures can be investigated in terms of "designability", e.g., by determining the number of sequences that have a given structure as their native structure. Implementations of high-performance algorithms for 2D HP protein structure prediction will be provided.

References:


3. Combinatorial DNA Design: A Coding Theory Approach

A DNA strand can be seen as a string over a quaternary alphabet with the letters A,C,G,T.

Combinatorial methods for designing DNA strands have been used to obtain sets of DNA strands that do not hybridize between themselves and only with perfect complements.

Some methods used for this purpose can be found in the coding theory literature [1,2,3]. Nevertheless, few publications report results on time complexity and algorithmic behavior for their methods. The project is to implement and possibly expand one (or more) of the algorithms presented in [1,2] and/or [3], and to characterise its performance, with the goal of gaining a better understanding of the empirical performance of these methods and possibly achieving improved results.

References:

  1. K.U. Koshnick (1991), "Some new constant weight codes," IEEE Transactions on Information Theory, 37, 370-371.
  2. G. Dueck and T. Scheuer (1990), "Threshold accepting: a general purpose optimization algorithm appearing superior to simulated annealing," Journal of Computational Physics, 90, 161-175
  3. F. Comellas and R. Roca (1993), "Using genetic algorithms to design constant weight codes," Proceedings of the International Workshop on Applications of Neural Networks to Telecomunication, LAwrence Erlbaum, Hillsdale, NJ, 119-124.
  4. A.E. Brouwer, J.B. Shearer, N.J.A. Sloane, W.D. Smith (1990), "A new table of constant weight codes," IEEE Transactions on Information Theory, 36, 1334-1380.

4. Design of DNA Strand Sets with Unequal Length

The accurate design of DNA molecules for biomolecular computations and nano-assemblies is of great interest to many researchers today. Many approaches to the problem of designing sets of DNA strand with desirable properties have been applied so far. Most of these methods, however, are designed to generate sets of strands that are all of equal length.

The project is to develop an algorithm (or eventually to modify a currently existing method [1]) to design DNA strands of different length. All strands must fullfil similar constraints (combinatorial and thermodynamic). The algorithm should be able to build new sets from scratch as well as to extend a initial set of DNA strands and/or to generate new DNA strands without modifying the existing ones, such that the entire resulting set fulfils all given constraints. The new algorithm should be analysed empirically and its performance should be compared to existing methods.

References:

  1. Dan C. Tulpan, Holger H. Hoos, Anne Condon, "Stochastic Local Search Algorithms for DNA Word Design," DNA 8 2002: 229-241.

last update 04/09/20, hh