STRIN dataset


characteristics of the dataset

We constructed our dataset by including introns that have consistent annotations between at least two of the following three databases: Since the vast majority of yeast intron-containing genes contain only one intron and only a few contain two introns per gene, we decided to include only the former, leaving the latter for later consideration. This was done solely to make the dataset more uniform. The number of introns found to have a consistent annotation between at least two databases was 227. Eleven of these were excluded because they were not supported by the latest comparative genomic study ([1]) and were marked as possible misannotations. An additional two introns, belonging to the genes YLR202C and YOR318C, have been excluded from the dataset because they were labeled as `dubious' in SGD.

Consequently, our final yeast intron dataset contains 214 pre-mRNA introns. The consistency of annotation allows us to combine intron information from two sources in a classic database join operation, e.g., the intron sequence from the AYID database, which is not available in YIDB, and the branchpoint sequence from the YIDB database, which is not available in AYID. All of these introns are part of protein-coding genes, 95 of which code for ribosomal proteins, 84 have other, known cellular functions and 35 code for proteins of unknown function. The dataset contains 159 experimentally verified and 55 putative introns. The vast majority of introns are located in the translated portion of a gene, while 12 introns are located in the 5' untranslated region (UTR). This information was collected in January 2006 when the dataset was last updated.

[1] "Sequencing and comparison of yeast species to identify genes and regulatory elements", Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES. Nature. 2003 May 15;423:233-4.