CPSC 445 - Assignment 4

released: Tue, Apr 1st, 2008; due: Thu, Apr 10th, 9:30 (just before the beginning of class)

1  [Programming] Sequence Motif Finding [40 marks]

Consider the following sequences as signals for the trancription factor CREB1:

CATCATGACGTC
CTTCATGACGTG
GTGGATGACGTA
CACGAGGACGCC
CCGGATGACGCA
CCTAATGACGCA
GGGGATGACGTG
CACGATGACGTG
TTGGATGACGCT
GGTGATGACGTC
GCTCATGACGTA
TAGGATGACGCT
GGGGATGACGTC
CCTGATGACGAC
TCTAATGACGTA
TTCGATGACGTC

(a) Build a position weight matrix for the transcription factor CREB1.
[5marks]

(b) Write a program that computes the WMM scores for all windows (of the same length as the weight matrix). Run this program on the following sequence from the human genome chromosome 22:

AGCAGTATCAGGCACACTACCAGACCCAGGTAGAGACCGAGCACCTGCTG
GTGGGACATCAGCGACCATGCGACAGGGTCTGCCCAAGCGAGGTGTTGTC
AGATCCTCCAAGTCAGTAGGTCAGAGGCTCAGTAGCAATTGATGGTACAA
TGAAAAAGGAAGCCTTGGGCTTGGACCCAAGCGTGTCTAGTGGGTAACAG
TTATTTACAGGAAGAGGTGGCAGTGTGGCCCCACATCTGCTTTGCACTGA
TATTTCTCCCTTAGCAGATAAGCATTCTGGTCTGGTTCGTATATAGGTCA
GTTTGATGTGTTGATATGAACTGAAAATGAACCATAGGTACACTACAGTC
CCATTCAAGGTGACCCTAAAGGACGACAATAGGAAATCCTCCATGGGGCT
GAGCTTCCAGCAGTTCGTGTGGATATCCACTTTGTATGGAGGCAGTGGAC
AGAGTAGTGGCCTGAGGGAGGGACGCATATGGACTTCTGGGTTGTGACGT
CCTGCTGGCTGGTCAGGGACCTGAAAAGAGCAAGAGGGGAAGATGGACCT
ACAGGAGTGGCCACACAATATGTGCATCTCTCTGCCTTGTGTTAATACTG
CAGAGGAGTTGGTGAACAGCAGGATGGATGGGATGTCAGTCAGCTGTGCC
CCTGGCTACCCCTGTGCTTGAACAGTGGACTGTGAGTGGCGGCGTCATGC
AGAAGGAGCACAGGTTAGCGTCCACCAGCACAGGCCTTCTTTCTCAAGGC
TTGTCTCATGATTACTCCTGCTGAAAGCGATCAGTGCTGAGCCCCTGCTG
AGATACCATCCCTAGAGCACCCCAACTAGTTACTTAGTGGCAAGTTGGTG
ACAGCCCTTCATCCTTGCTGGAGTTGACACCTGCTCTGGATAAGGGTTTG
TCTTCTGTATCCACAGGGTCCCAGGCCAGCATCACTATCCAGGAGCTTCA
GGCATGCTCCGTGTACCAACATGGGATCACGAATCACATCGCCTCAGGCC

The length of this sequence is 1000 and, hence, there should be 989 windows for each the + and - strand directions. Write the window start positions (starting the position count at 0 for the first position in the sequence), the strand (+/-) and the scores to stdout (the console), such that each line contains a start position followed by a '+' or '-' followed by a score value, separated by a single space. Plot the scores (y axis) over window start positions (x axis) using any software that you like (GNUPlot, MS EXCEL, ..) and mark up (by hand or electronically) the regions that intuitively represent good hits. Hand in the graph with the rest of your written assignment. Note: The gene CHKB (chr 22) is known to be regulated by CREB1 transcription factor, but the CHKB gene is located on the reverse (-) strand. [35 marks]

Important notes:
  • Your programs should be written either in Java, C or C++.
  • When you are done, send an email to acarbo@cs.ubc.ca with subject 'CPSC445-hw4' and attach your program source.
  • All files submitted should contain your student id # in the title e.g. 80132322.cpp and 80132322-readme.txt. If you decided to provide your files in a zip or tarball archive, please include your student number on the archive title.
  • You programs should be well documented  and you should explain the purpose of every function that you write. You may lose marks for ill documented code.
  • Readme files are encouraged, and should include your name, colleague that you worked with, and a sample compile and execution of the code if done by the command line.


2  RNA secondary structure [25 marks]

(a) Given the following RNA secondary structure, name all of the secondary structure elements and specify their positions in terms of the respective exterior and interior base-pairs. [10 marks]

Parts (b),(c) and (d) use the following sequence:
   AACCCUUUCAAAAAGGGAGGUCACC

(b) Fill in the dynamic programming matrix produced from running the Nussinov algorithm on the following sequence (there are 6 blanks to fill in); [7 marks]

Nussinov score matrix:
A A C C C U U U C A A A A A G G G A G G U C A C C
A 0 0 0 0 0 1 2 2 2   3 3 3 3 4 5 6 6 6 6 7 8 8 8 9
A 0 0 0 0 0 1 1 1 1 2 3 3 3 3 4 5 6 6 6 6 7 8 8 8 9
C 0 0 0 0 0 0 0 0 0 1 2 3 3 3 4 5 6 6 6 6 7 8 8 8 9
C 0 0 0 0 0 0 0 0 0 1 2 3 3 3 4 5 5 5 5 6 6 7 7   8
C 0 0 0 0 0 0 0 0 0 1 2 3 3 3 4 4 4 4 5 5 6 6   7 8
U 0 0 0 0 0 0 0 0 0 1 2 3 3 3 3 3 3 4 4 4 5 6 6 6 7
U 0 0 0 0 0 0 0 0 0 1 2 2 2 2 2 2 3 3 3 3 4 5 5 5 6
U 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 2 2 2 2 2 3 4 4 5 5
C 0 0 0 0 0 0 0 0 0 0 0 0 0       1 1 1 1 2 3 3 4 5
A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 3 4
A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 3 4
A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 3 4
A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 3 4
A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 3 4
G 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 3 4
G 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 3 4
G 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 3 3
A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 2 3
G 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 2 3
G 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 2 2
U 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

(c) Given the following traceback report derived from the above matrix, fill in the empty labels as each step is popped from the stack as either paired, unpaired, bifurcation or NA. [5 marks]

rowscolumnslabel
1 25unpaired
2 25unpaired
3 25bifurcation
3 17paired
4 16 
5 15 
6 14 
6 13 
6 12 
7 11 
8 10 
9 9NA
18 25 
19 25 
20 24 
21 23 
22 22 

(d) Draw an optimal secondary structure for the sequence. You should use the work provided in (b) and (c). [3 marks]


General remarks: