Title: Single nucleotide variant (SNV) prediction for identifying somatic mutations in next generation sequencing (NGS) data
Speaker: Jiarui Ding
Department of Computer Science, UBC / BC Cancer Agency
Abstract

Single nucleotide variant (SNV) prediction for identifying somatic mutations in next generation sequencing (NGS) data is critical to defining mutational landscapes in cancer and ultimately furthering our understanding of tumour biology. There are a host of SNV prediction tools for NGS data available such as Samtools and GATK, but few are tailored specifically to the characteristics of cancer genomes such as tumour-normal admixture and segmental aneuploidy which can dramatically alter allele frequencies present in the data. Moreover, after SNV prediction, most approaches need sophisticated heuristic filters to filter the false predictions. These heuristics are based largely on the combination of intuition and ad-hoc rules, and not on empirical data using experimentally revalidated mutations. We propose that principled methods, based on sufficient ground truth data are needed to train robust classifiers to distinguish true from false positives and to simultaneously learn what characteristics in the input data are leading to false positive predictions. These classifiers might then inform development of the next generation of alignment algorithms.

We present supervised machine learning algorithms used for somatic point mutation prediction in tumour/normal NGS experiments. Specifically, we construct 80 features to represent a candidate somatic mutation, and logistic regression models on a validated dataset consisting of 990 true somatic mutations and 2433 non-somatic mutation positions. In addition, we use feature selection methods to rank the importance of features for somatic mutation prediction. The classifiers' prediction results using ROC comparisons are more accurate than Samtools and GATK alone. We find 24 features achieve the best compromise between feature number and prediction accuracy.

We show supervised machine learning algorithms can accurately predict somatic point mutations, and provide alternative choices to heuristic filters. Since machine learning algorithms can automatically find patterns in the training data, and use the uncovered patterns to predict new candidate somatic point mutations, the classifiers can be considered as adaptive optimal filters for mutation prediction.