Title: Efficient data structures in DNA sequence alignment
Speaker: Jay Zhang
UBC
Abstract

The invention of Next Generation Sequencing (NGS) machines has made sequencing DNA cheaper and faster. Consequently, scientists are now able to sequence whole genomes at a fraction of the effort. However, NGS machines tend to produce large amounts of short reads--typically less than a hundred bases--compared to traditional Sanger sequencing methods, which produce small amounts of long reads. The short read lengths, coupled with the massive amounts of data, make efficient alignment and subsequent analysis difficult for traditional sequence aligners and new techniques must be developed to compensate.

This talk will focus in detail on two data structures implemented in the NGS Alignment library, which is currently being developed in the BETA lab at UBC. The first data structure is a more memory efficient, scalable hash table that allows for quick filtering of candidate alignments. The second structure is a trie-like index the FM Index that is linear in both space and time (for exact matching). We will then address the strengths and weaknesses of these structures and evaluate use cases for each.