Subject: Pre-mRNA Secondary Structure Prediction Aids Splice Site Prediction: Data mining and machine learning teach us something about biology?
Presenter: Larry Ruzzo
Abstract Pre-mRNA Secondary Structure Prediction Aids Splice Site Prediction: Data mining and machine learning teach us something about biology?

While DNA is often called the "blueprint for life," in humans and other so-called higher organisms only a few percent of DNA actually encodes information of any known functional relevance. Distinuishing these functionally important parts from the mass of "junk" DNA is an increasingly important computational task, given the floods of genomic sequence data now being produced. As one example of this problem, most protein-coding genes in humans are interrupted by "junk" segments called introns that must be "spliced out" (deleted) by the cell before the sequence can be used to direct the production of its corresponding protein. Consequently, accurate prediction of the sites at which splicing takes place is a critical component of any computational approach to gene prediction in higher organisms.

Existing approaches generally use sequence-based models that capture local dependencies among nucleotides in a small window around the splice site. However, these models are clearly inadequate to fully explain the known biology of splicing. What else might be involved? These RNA molecules are known to fold into complex shapes, and various authors have speculated that these shapes may also play a role in splicing. We present evidence that this is indeed the case. Specifically, computationally predicted secondary structure of moderate length pre-mRNA subsequences contains information that can be exploited to improve acceptor splice site prediction beyond that possible with conventional sequence-based approaches. Both decision tree and support vector machine classifiers, using folding energy and structure metrics characterizing helix formation near the splice site, achieve a 5--10% reduction in error rate with a human data set. Based on our data, we hypothesize that acceptors preferentially exhibit short helices at the splice site.

Reference: Patterson DJ, Yasuhara K, Ruzzo WL. Pre-mRNA secondary structure prediction aids splice site prediction. Pacific Symposium on Biocomputing, Kauai, Hawaii, Jan., 2002, pp. 223-234.