CPSC 545 - Project Ideas
General Remarks:
- Projects should be worked on by groups of two or three students
all of whom are taking the course for credit.
Each group needs to include at least one CS or ECE graduate students.
- The following are just project ideas. If you have other ideas for potential
projects, come talk to Holger - I look forward to hearing
about your ideas!
- Start discussing project ideas you are interested in with your fellow students 
and with me (Holger) as soon as possible - you need to have selected
a project by September 27th and a proposal has to be submitted by 
October 11th. The proposal should be about 2-3 pages long and needs to reflect
your understanding of the problem, briefly discuss existing
and related work, describe the proposed work,
specify a work schedule and distribution of work between the group members,
and list relevant literature.
1. Multiple Sequence Alignment
The problem of finding alignments of multiple sequences (DNA or protein)
arises in many bioinformatics applications and often forms the basis
for further computations, e.g., in phylogenetic tree construction.
Unfortunately, determining optimal multiple sequence alignments
is computationally hard (under certain, not realistic conditions,
the problem has been shown to be NP-hard). A variety of 
heuristic techniques have been applied to this problem, including
Genetic Algorithms and Simulated Annealing.
The goal of this project is to implement and empirically
study the behaviour of iterative refinement methods 
for multiple sequence alignment 
using the SP (sum of pairs) score.
After familiarising yourself with the 
general approach and conducting a literature survey,
you might want to implement two or three different methods 
(these could be all rather closely related) and study their
performance on a set of homologous biological sequences.
You might also consider additional tests on sets of sequences
obtained from applying simulated evolution (according to
one of the commonly used sequence evolution models, e.g., the
Jukes-Cantor model).
References:
- 
Durbin, Eddy, Krogh, Mitchison: Biological Sequence Analysis.
Cambridge University Press, 1998; Ch.6. 
(This book is available from the Reading Room; it contains
a good introduction to the problem and contains further references.)
- 
O. Gotoh: Multiple sequence alignment:  algorithms and applications.
    Adv. Biophys. 36, 159-206, 1999.
- Udo Tönges, Sören W. Perrey, Jens Stoye, Andreas W. M. Dres:
A General Method for Fast Multiple Sequence Alignment;
Preprint, Universität Bielefeld, 1996.
[ electronic version ]
                
- Multiple Alignment Resource Page of the VSNS BioComputing Division:
www.techfak.uni-bielefeld.de/bcd/Curric/MulAli/welcome.html
2. Protein Design in the 2D HP Model
De novo design of proteins is an approach for designing
novel proteins with predetermined properties.
It gives rise to the following protein design problem or 
inverse protein folding problem: 
Given a protein structure find an amino-acid sequence 
that folds into the specified structure as its native state. The 
two-dimensional Hydrophobic-Hydrophilic (2D HP) model is an extremely 
simplified model of protein structure that has been extensively studied by 
biochemists and physicists, and models certain properties of real proteins.
The project is to implement an algorithms for protein design in the 2D HP 
model (ideally, a new algorithm that has not been previously applied to the 
problem) and to carefully evaluate its performance against results known 
from the literature. Furthermore, given protein structures 
can be investigated in terms of "designability", e.g., by determining
the number of sequences that have a given structure as 
their native structure. 
Implementations of high-performance algorithms for 2D HP protein structure
prediction will be provided.
References:
    -  J. Deutsch, and T. Kurosky: New Algorithm for Protein Design. 
Physical Review Letters, Vol 76, Num 2, pp. 323-326, 1996.	
    
-  A. Irback, C. Peterson, F. Pottharst, and E. Sandelin: Monte Carlo 
procedure for protein design. Physical Review E, Vol 58. Num 5, pp. 
5249-5252, 1998.
    
-  C. Micheletti, A. Maritan, J. Banavar: A comparative study of 
existing and new design techniques for protein models. J. of Chem. 
Physics, Vol 110, Num 19, pp. 9730-9738.
    
-  Brian Hayes: Prototeins. American Scientist, Volume 86, No. 3, 1998. 
[ electronic version ]
    
-  2D HP Benchmark Instances 
3. Combinatorial DNA Design: A Coding Theory Approach
A DNA strand can be seen as a string over a quaternary alphabet
with the letters A,C,G,T.
Combinatorial methods for designing DNA strands have been used
to obtain sets of DNA strands that do not hybridize between 
themselves and only with perfect complements.
Some methods used for this purpose can be found in the coding theory 
literature [1,2,3]. Nevertheless, few publications report 
results on time complexity and algorithmic behavior for their methods. 
The project is to implement and possibly expand
one (or more) of the algorithms presented in [1,2] and/or [3],
and to characterise its performance,
with the goal of gaining a better understanding of the empirical
performance of these methods and
possibly achieving improved results.
References:
- K.U. Koshnick (1991), "Some new constant weight codes," IEEE 
	Transactions on Information Theory, 37, 370-371.
- G. Dueck and T. Scheuer (1990), "Threshold accepting: a general 
	purpose optimization algorithm appearing superior to simulated 
	annealing," Journal of Computational Physics, 90, 161-175
- F. Comellas and R. Roca (1993), "Using genetic algorithms to 
	design constant weight codes," Proceedings of the International 
	Workshop on Applications of Neural Networks to Telecomunication, 
	LAwrence Erlbaum, Hillsdale, NJ, 119-124. 
- A.E. Brouwer, J.B. Shearer, N.J.A. Sloane, W.D. Smith (1990), 
	"A new table of constant weight codes," IEEE Transactions on 
	Information Theory, 36, 1334-1380.
4. Design of DNA Strand Sets with Unequal Length
The accurate design of DNA molecules for biomolecular computations and 
nano-assemblies is of great interest to many researchers today. 
Many approaches to the problem of designing sets of DNA strand with 
desirable properties have been applied so far. Most of these
methods, however, are designed to generate sets of strands that
are all of equal length.
The project is to develop an algorithm (or 
eventually to modify a currently existing method [1]) to design DNA 
strands of different length. All strands must fullfil 
similar constraints (combinatorial and thermodynamic). The 
algorithm should be able to build new sets 
from scratch as well as to extend a initial set of DNA 
strands and/or to generate new DNA strands without modifying the 
existing ones, such that the entire resulting set
fulfils all given constraints.
The new algorithm should be analysed empirically 
and its performance should be compared to existing methods.
References:
- 	Dan C. Tulpan, Holger H. Hoos, Anne Condon, "Stochastic Local 
	Search Algorithms for DNA Word Design," DNA 8 2002: 229-241.
last update 04/09/20, hh