Title: Improved algorithms for DNA motif detection and their application
Speaker: Christina Boucher
University of Waterloo, ON, Canada

Improving the accuracy and efficiency of motif recognition is an important computational challenge that has application to detecting transcription factor binding sites in genomic data. Closely related to motif recognition is the closest string decision problem that asks, given a parameter $d$ and a set of $\ell$-length strings $S = \{s_1, \ldots, s_n\}$, whether there exists a consensus string that has Hamming distance at most $d$ from any string in $S$. We focus on the development of a method for solving the closest string problem quickly with a small probability of error. We apply this heuristic to develop a new motif recognition program, sMCL-WMR, which has impressive accuracy and efficiency.

We discuss the use sMCL-WMR to uncover similarities in the promoter region of the genetic data of canola, and identify more than 40 motifs in a set of promoters conjectured to be responsible for the seed coat-specificity. These identified motifs are used to synthesized a promoter DNA sequence, which is presently being tested for its biological expression.

Lastly, we propose a refined closest string model, which asks for a center string $s$ that is within Hamming distance $d$ to at least $n-k$ of the $n$ input strings, where $k$ is a parameter describing the maximum number of outliers. A solution to this problem not only provides the center string as a representative for the set of strings but also reveals the outliers of the set. We provide fixed parameter algorithms when $d$ and $k$ are parameters for both bounded and unbounded alphabets. We also show that when the alphabet is unbounded the problem is W[1]-hard with respect to $n-k$, $\ell$, and $d$.