Genomic array data analysis

Title:	Statistical models for genomic array data analysis
Speaker:	Sohrab Shah

Abstract	DNA copy number alterations (CNAs) are a hallmark of somatic mutations in tumor genomes and congenital abnormalities that lead to diseases such as mental retardation. CNAs define regions on a given chromosome that exhibit a deletion or amplification of the DNA within the region. Accurately identifying the locations of CNAs in an individual sample has applications in the understanding molecular mechanisms of disease as well as the development of diagnostic and prognostic tools. Furthermore, identifying the pattern of recurrent CNAs that occur in a set of samples exhibiting a common phenotype has compelling implications for medical advances. Recent progress in array comparative genomic hybridization (aCGH) have enabled researchers to measure CNAs at high resolution for the entire human genome. Unfortunately, the observed copy number changes are often corrupted by various sources of noise, making the CNAs hard to detect. In this talk I will explore model-based approaches to the detection of CNAs in aCGH data. I will describe four main areas of research: CNA detection given a sample from one individual; joint analysis of aCGH data from a set of samples to detect recurrent CNAs; unsupervised clustering of aCGH data; and integration of aCGH data with methylation arrays-a promising new technique for detecting so called epigenomic changes. I will systematically describe how novel extensions to HMMs applied to the first two of these research goals leads to improved results over baseline models on both cell line and clinical data. Furthermore, I will show how work to date provides a robust statistical framework upon which to develop our novel ideas for the latter two research goals.