MITACS Project Website

The work of the MITACS project on prediction of bio-molecular structure and interaction focuses on bio-molecules: proteins, DNA and RNA. How can a computer predict molecular structure? Broadly speaking, there are two approaches, which can often be combined. Given a protein, DNA, or RNA sequence, one approach aims to infer its structure from known structures of similar sequences. The other approach uses physics-based models of structure formation, such as energy minimization models or kinetic models of folding. Energy minimization approaches are often computationally intractable (NP-hard), and so optimization techniques are necessary. Below, describe such approaches in more detail, to illustrate the types of optimization problems we are focusing on, and ongoing challenges in structure prediction.

1. State-of-the-art algorithms for protein structure prediction

Proteins have long been recognized as the activists of the cell, involved in virtually all cellular processes, including transport of molecules, signaling within and between cells, scaffolding, and immune system responses. Better methods for protein structure prediction and an improved understanding of protein folding form the basis for studying important biological processes in the cell and disease mechanisms as well as for developing new diagnostic tools and treatments.

Currently, the most practical way to predict protein 3D structures is via threading, whereby a given protein sequence is threaded onto a known protein structure, guided by energy functions. The associated optimization problem is computationally intractable (NP-hard). For 3D protein structure prediction, we have developed RAPTOR, a (polynomial time) linear programming formulation for protein structure prediction. RAPTOR ranked first in one category of the annual protein folding competition, CAFASP3. We are currently working on a multi-step approach that will significantly enhance the quality of predictions obtained by RAPTOR. RAPTOR has already been licensed to pharmaceutical companies, where it is applied for pre-screening of proteins for drug targets and protein annotation.

We are also investigating algorithms for protein structure prediction on the 2D HP model. We have also developed a substantially improved version of our previous ACO algorithm for the 2D HP model and extended it to the 3D model. Furthermore, we have developed a replica-exchange Monte Carlo algorithm that uses the pull-move neighbourhood to solve protein structure prediction problems in the HP model even more efficiently. Finally, we have developed a novel approach for solving protein structure prediction problems based on the use of a bin framework for adaptively storing and retrieving promising locally optimal solutions.

We have proposed a version of the inverse protein-folding problem that is more realistic than previous work. We have shown that it is possible to closely approximate any given structure in 2D and find a protein with a native fold compatible with this approximation. For a number of basic structures, we gave a formal proof that our proteins are stable (their native fold is unique) in the HP model.

An iconic structure is the DNA double helix, which results from pairing of complementary bases (A-T, G-C base pairs) in two DNA strands. RNA structure formation arises from similar principles: bonds between pairs of bases (primarily A-U, G-C, and G-U pairs) in a single RNA strand bring subsequences of the strand together to form helical stems.

Good methods for RNA structure prediction would be of great help in elucidating the functions of RNA molecules. While computational prediction of RNA three-dimensional structure is not yet possible, there has been significant success in predicting the set of base pairs, called the secondary structure, from the base sequence of an RNA molecule, and tools for doing this are widely used by biologists. The prediction problem may be cast, in optimization terms, as the task of finding the secondary structure which has minimum free energy (MFE).

We have made progress on three fronts: inference of improved thermodynamic parameters, development of an alternative thermodynamic approach involving the notion of free energy density, and development of approaches that take into account the folding pathway. In order to infer parameters for the thermodynamic energy model for nucleic acid structures, we are using maximum-likelihood techniques, as well as constraint-based approach in which linear constraints on the parameters are derived from known sequences and their structures.

A key limitation of the thermodynamic approach is that it ignores the fact that much of the secondary structure is established during the transcription process. We have shown that delocalizing the thermodynamic cost of forming an RNA substructure by considering the energy density of the substructure can improve on the thermodynamic approach. An alternative objective function which is a linear combination of the free energy and the energy density, can be optimized through a heuristic search procedure which starts with an estimate of the energy density component and iteratively predicts the structure through standard linear optimization - which in turn provides a new estimate for the energy density contribution. The resulting algorithm seems to outperform all available alternatives in predicting the secondary structures of all the pseudoknot free ncRNA families available via the Rfam database.

Co-transcriptional folding of RNA induces 5'-to-3' asymmetries into the structure formation process, and this has important implications for theoretical RNA structure prediction. Most existing RNA structure prediction methods ignore co-transcriptional folding, but instead, fold an already synthesized molecule in a symmetric fashion and mostly aim to predict the minimum free energy RNA structure, i.e. the thermodynamic structure, rather than the functional RNA structure which exerts the molecule's functional role in the cell and which forms during during co-transcriptional folding. We would like to study sets of evolutionarily related RNA genes in order to detect conserved, local RNA structural elements (e.g. hairpin-like structure) which are required to define the correct folding pathway, but which are not necessarily part of the functional structure. We aim to accomplish this (1) by extending the theoretical framework introduced by Meyer (2004) in order to assign asymmetry-values to local structure elements and (2) by developing a novel theoretical framework and methods in order to measure the type and extend of evolutionary conservation of these features.

We have already developed a combinatorial approach for solving the RNA-RNA interaction prediction (RIP) problem between two RNA molecules (e.g. an antisense RNA and its target mRNA). We now have algorithms that minimize the joint free energy between the two RNA molecules. We also describe how to accelerate our technique through several heuristic shortcuts while experimentally maintaining the original accuracy.

We plan to work on a new variant of the nucleic acid design problem, motivated by the goal of gene synthesis. Current methods for gene synthesis assemble genes from short oligos, which can be synthesized, for example, on an array using light-directed methods. The challenge is to determine the best way to segment a long gene into short oligos, so that both the oligo synthesis and assembly phases will proceed smoothly. Criteria for oligo selection include structural criteria (the oligos should not fold on themselves to form secondary structure) and also similarity criteria (two distinct oligos which participate in the same assembly should not be too similar). Current methods for design of oligos, such as the Gene2Oligo approach of Rouillard et al. use somewhat inefficient back-tracking methods for oligo selection, and do not check for some desired structural criteria of the oligos. We will apply our structure prediction algorithms, along with new dynamic programming techniques, to improve the selection of oligos for gene synthesis.

In addition, we hope to resolve the complexity of the inverse RNA secondary structure design problem. It is still open whether the problem is in P or is NP-complete.

Our HomoStRscan program, which detects structural homologues of a known structured RNA in a given genomic sequence, shows very good performance with high sensitivity and specificity. However, its speed is not ideal in searching genomes. Therefore we developed a new filtering algorithm to be a preprocessing step. As such the main requirement for this algorithm is that it should have very low false negative ratio. Combining the two algorithms, that is, running our filtering algorithm first and then running HomoStRscan on the filtered candidate segments, gives us a faster tool for searching RNA structures in a genome. A test run of the combined approach on searching tRNA from a genome with 1.7M bases took about 16 minutes. We will continue to further improve the combined approach.

Another direction of our work is to develop better scoring measures to increase sensitivity and specificity. We also plan to develop statistical ways to evaluate the quality of the motifs found. This evaluation depends on the similarity measure used. We are developing some theoretical properties of good similarity measures, along with methods for normalizing similarity measures so that the result of normalization is sill a good similarity measure.