|
|
Combining Gene-Finding Programs |
Sanja Rogic |
 |
Francis Ouellette
|
 |
Alan Mackworth |
Current gene-finding programs are complex integrated systems that incorporate a number of different methods for gene-finding. The set of methods used and the way they are integrated vary between individual programs. It has been observed [1,2] that these different techniques often correctly predict different elements of the gene, suggesting that programs could complement each other, yielding better predictions.
In order to test this hypothesis we explored different methods for combining predictions from two gene-finding programs. After extensive evaluation of current eukaryotic gene-finding programs, Genscan [3] and HMMgene [4] were chosen for their high prediction accuracy and their reliable estimates of the accuracy of the exon prediction. The predictions were combined on the exon level, using three separate techniques: decision trees, modified set operations and probabilistic networks.
Some of these methods yielded notable improvements in the prediction accuracy especially at the exon level: the sensitivity increased from 0.76 to 0.79 (4.0%) and the specificity increased from 0.77 to 0.86 (11.7%), compared to the best exon level accuracy measures achieved by any single program. The successful methods were tested on three independent datasets, each time outperforming any individual gene-finding program. The results were especially good for the dataset containing sequences with several genes, where exon accuracy measures were improved by 30% compared to Genscan's results.
An important part of our analysis was the generation of a non-redundant dataset that excludes sequences used for training of previously developed gene-finding programs, including Genscan and HMMgene. This set contains 195 human, mouse and rat sequences comprised of one complete, single- or multi-exon gene. All sequences from the dataset have passed the standard filtering steps to exclude any anomalous sequence. We have also verified the exon annotation present in GenBank flat files for these sequences by using the sim4 [5] program to confirm the exon-intron boundaries by aligning genomic and mRNA sequences.
|
 |
references |
[1] Burset, M. and Guigo, R. (1996) Evaluation of gene structure prediction programs. Genomics 34: 352-367.
[2] K. Murakami and T.Takagi. Gene recognition by combination of several gene-finding programs. Bioinformatics, Vol. 14 no.8: 665-675, 1998.
[3] C. Burge and S. Karlin. Prediction of complete gene structures in human genomic DNA. Journal of Molecular Biology 268: 78-94, 1997.
[4] Krogh, A. (1997) Two methods for improving performance of an HMM and their application for gene finding. Proceedings of Fifth ISMB Conference, edited by T. Gaasterland et al., Menlo Park, CA: AAAI Press, pp. 179-186.
[5] Florea, L. et al. (1998) A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res. 8(9): 967-974.
|
|