# CPSC 445 - Assignment 4

released: 2016/03/25, thu, 15:30; due: 2016/04/01, fri, 18:00 (via e-mail to Robbie and Holger)

#### 1. Motif Discovery [18 marks]

(a) Briefly outline the two different ways a motif can be defined formally that were covered in class (<= 50 words, in your own words). [4 marks]

(b) For each of the two definitions from part (a), briefly outline an algorithm for finding all instances of the motif in a given set of DNA sequences. [6 marks]

(c) Give the definition of relative entropy between two probability distributions. [1 mark]

(d) What is the smallest (absolute) value for the relative entropy of a distribution P w.r.t. another distribution Q? For which P and Q does this value occur? [2 marks]

(e) How is relative entropy used to evaluate candidate motifs? (<= 50 words, in your own words) [2 marks]

(f) Can the Gibbs Sampling Algorithm by Lawrence et al. find the maximum relative entropy motif for any given instance of the Relative Entropy Site Selection Problem with some probability > 0 in at most k iterations, where k is the number of sequences? Briefly justify your answer. [3 marks]

#### 2.  [Programming] Sequence Motif Finding [40 marks]

Consider the following sequences as signals for the trancription factor CREB1:

CATCATGACGTC
CTTCATGACGTG
GTGGATGACGTA
CACGAGGACGCC
CCGGATGACGCA
CCTAATGACGCA
GGGGATGACGTG
CACGATGACGTG
TTGGATGACGCT
GGTGATGACGTC
GCTCATGACGTA
TAGGATGACGCT
GGGGATGACGTC
CCTGATGACGAC
TCTAATGACGTA
TTCGATGACGTC

(a) Build a position weight matrix for the transcription factor CREB1. It is recommended to write a script to answer this question but only the position weight matrix needs to be submitted.
[5 marks]

(b) Write a program that computes the WMM scores for all windows (of the same length as the weight matrix). Run this program on the following sequence from the human genome chromosome 22:

AGCAGTATCAGGCACACTACCAGACCCAGGTAGAGACCGAGCACCTGCTG
GTGGGACATCAGCGACCATGCGACAGGGTCTGCCCAAGCGAGGTGTTGTC
AGATCCTCCAAGTCAGTAGGTCAGAGGCTCAGTAGCAATTGATGGTACAA
TGAAAAAGGAAGCCTTGGGCTTGGACCCAAGCGTGTCTAGTGGGTAACAG
TTATTTACAGGAAGAGGTGGCAGTGTGGCCCCACATCTGCTTTGCACTGA
TATTTCTCCCTTAGCAGATAAGCATTCTGGTCTGGTTCGTATATAGGTCA
GTTTGATGTGTTGATATGAACTGAAAATGAACCATAGGTACACTACAGTC
CCATTCAAGGTGACCCTAAAGGACGACAATAGGAAATCCTCCATGGGGCT
GAGCTTCCAGCAGTTCGTGTGGATATCCACTTTGTATGGAGGCAGTGGAC
AGAGTAGTGGCCTGAGGGAGGGACGCATATGGACTTCTGGGTTGTGACGT
CCTGCTGGCTGGTCAGGGACCTGAAAAGAGCAAGAGGGGAAGATGGACCT
ACAGGAGTGGCCACACAATATGTGCATCTCTCTGCCTTGTGTTAATACTG
CAGAGGAGTTGGTGAACAGCAGGATGGATGGGATGTCAGTCAGCTGTGCC
CCTGGCTACCCCTGTGCTTGAACAGTGGACTGTGAGTGGCGGCGTCATGC
AGAAGGAGCACAGGTTAGCGTCCACCAGCACAGGCCTTCTTTCTCAAGGC
TTGTCTCATGATTACTCCTGCTGAAAGCGATCAGTGCTGAGCCCCTGCTG
AGATACCATCCCTAGAGCACCCCAACTAGTTACTTAGTGGCAAGTTGGTG
ACAGCCCTTCATCCTTGCTGGAGTTGACACCTGCTCTGGATAAGGGTTTG
TCTTCTGTATCCACAGGGTCCCAGGCCAGCATCACTATCCAGGAGCTTCA
GGCATGCTCCGTGTACCAACATGGGATCACGAATCACATCGCCTCAGGCC

The length of this sequence is 1000 and, hence, there should be 989 windows for each the + and - strand directions. Write the window start positions (starting the position count at 0 for the first position in the sequence), the strand (+/-) and the scores to stdout (the console), such that each line contains a start position followed by a '+' or '-' followed by a score value, separated by a single space. Plot the scores (y axis) over window start positions (x axis) using any software that you like (GNUPlot, MS EXCEL, ..) and mark up (by hand or electronically) the regions that intuitively represent good hits. Hand in the graph with the rest of your written assignment. Note: The gene CHKB (chr 22) is known to be regulated by CREB1 transcription factor, but the CHKB gene is located on the reverse (-) strand. [35 marks]

 Important notes: Your programs should be written either in Python, Java, C or C++. When you are done, send an email to hoos@cs.ubc.ca and rolin@cs.ubc.ca with subject 'CPSC445-hw4' and attach your program source. All files submitted should contain your student id # in the title e.g. 80132322.cpp and 80132322-readme.txt. If you decided to provide your files in a zip or tarball archive, please include your student number on the archive title. You programs should be well documented  and you should explain the purpose of every function that you write. You may lose marks for ill documented code. Readme files are encouraged, and should include your name, colleague that you worked with, and a sample compile and execution of the code if done by the command line.

#### 3.  RNA secondary structure [25 marks]

(a) Given the following RNA secondary structure, name all of the secondary structure elements and specify their positions in terms of the respective exterior and interior base-pairs. [10 marks]

Parts (b),(c) and (d) use the following sequence:
AACCCUUUCAAAAAGGGAGGUCAGU

(b) Fill in the dynamic programming matrix produced from running the Nussinov algorithm on the following sequence (there are 6 blanks to fill in); [7 marks]

Nussinov score matrix:
 A A C C C U U U C A A A A A G G G A G G U C A G U A 0 0 0 0 0 1 2 2 2 3 3 3 3 4 5 6 6 6 6 7 8 8 8 9 A 0 0 0 0 0 1 1 1 1 2 3 3 3 3 4 5 6 6 6 6 7 8 8 8 9 C 0 0 0 0 0 0 0 0 0 1 2 3 3 3 4 5 6 6 6 6 7 8 8 8 9 C 0 0 0 0 0 0 0 0 0 1 2 3 3 3 4 5 5 5 5 6 6 7 7 8 C 0 0 0 0 0 0 0 0 0 1 2 3 3 3 4 4 4 4 5 5 6 6 7 7 U 0 0 0 0 0 0 0 0 0 1 2 3 3 3 3 3 3 4 4 4 5 6 6 6 7 U 0 0 0 0 0 0 0 0 0 1 2 2 2 2 2 2 3 3 3 3 4 5 5 5 6 U 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 2 2 2 2 2 3 4 4 4 5 C 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 2 3 3 3 4 A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 2 3 A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 2 3 A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 2 3 A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 2 3 A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 2 3 G 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 2 3 G 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 2 3 G 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 2 3 A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 2 3 G 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 2 3 G 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 2 2 U 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 2 C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 G 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 U 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

(c) Given the following traceback report derived from the above matrix, fill in the empty labels as each step is popped from the stack as either paired, unpaired, bifurcation or NA. [5 marks]

 rows columns label 1 25 unpaired 2 25 unpaired 3 25 bifurcation 3 17 paired 4 16 5 15 6 14 6 13 6 12 7 11 8 10 9 9 NA 18 25 19 25 20 24 20 21 21 20 22 24 23 23

(d) Draw an optimal secondary structure for the sequence. You should use the work provided in (b) and (c). [3 marks]

#### 4. Challenge Problems [no marks, but highly recommended to broaded and deepen your knowledge]

Special Note - In each assignment, there will be a Challenge Problems section. These problems are not for credit. However, they will give you a greater depth of knowledge in the subject matter for this section, and will help you prepare for, and impress in, the Final Oral Examination. If you have any questions regarding these problems, since some are quite open-ended in nature, feel free to discuss them with Holger or Robbie.

Reading Questions: In graduate-style seminars, you will often be asked to read academic papers before class and have a set of questions prepared for discussion with your peers during the seminar period.  These questions should not simply be limited to questions of the form, "I didn't understand X, how does it work?" but should demonstrate that you have made an effort to re-read and understand the paper to come up with more piercing, critical analysis, or suggestions for future improvements or directions to the work.  Working out these questions will prepare you for such seminars and discussions in the future.

(a) Reading Question 1: Describe the method employed in "Expanded Sequence Dependence of Thermodynamic Parameters Improves Prediction of RNA Secondary Structure" (http://bio.gsnu.ac.kr/research/miRNA/RNAss/Mfold/miRNAss_JMB_1999_Mathews.pdf) Why is the method an improvement at previous attempts at RNA Secondary Structure Predicition? When does the method have difficulty predicting structure accurately? What assumptions about the biological data does the method make?

(b) Reading Question 2: Read "Transat - A Method for Detecting the Conserved Helices of Functional RNA Structures, Including Transient, Pseudo-Knotted and Alternative Structures" (http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000823) and describe the method employed to predict functional RNA structures. Why is the method effective? When does the method have difficulty predicting structure accurately? What assumptions about the biological data does the method make?

#### General remarks:

• The assignment has to be handed in on the date it is due before 18:00. To ensure fairness, late hand-ins will generally not be accepted (exceptions can only be made for officially documented medical reasons). Please send your solution via e-mail (as plain text or - possibly scanned - PDF) to Robbie and Holger by this deadline.
• This assignment should take you no longer than about 4-5 hours to complete, if you have good knowledge of the topics covered. However, don't wait until the last minute relying on this estimate - it might not apply to you (or anyone at all), you might need additional time to consult the literature, etc.
• While cooperation between students - especially between CS and non-CS students - is encouraged, each student is expected to work out the actual solutions to the problems individually and hand in their own assignment. In other words: help each other, but do not copy solutions.
• Feel always free to contact Holger or Robbie if you feel you need further help than can be provided by your fellow students.