CPSC 445 - Assignment 4

released: Tue, Apr 1st, 2008; due: Thu, Apr 10th, 9:30 (just before the beginning of class)

1 [Programming] Sequence Motif Finding [40 marks]

Consider the following sequences as signals for the trancription factor CREB1:

CATCATGACGTC CTTCATGACGTG GTGGATGACGTA CACGAGGACGCC CCGGATGACGCA CCTAATGACGCA GGGGATGACGTG CACGATGACGTG TTGGATGACGCT GGTGATGACGTC GCTCATGACGTA TAGGATGACGCT GGGGATGACGTC CCTGATGACGAC TCTAATGACGTA TTCGATGACGTC

(a) Build a position weight matrix for the transcription factor CREB1.
[5marks]

(b) Write a program that computes the WMM scores for all windows (of the same length as the weight matrix). Run this program on the following sequence from the human genome chromosome 22:

AGCAGTATCAGGCACACTACCAGACCCAGGTAGAGACCGAGCACCTGCTG GTGGGACATCAGCGACCATGCGACAGGGTCTGCCCAAGCGAGGTGTTGTC AGATCCTCCAAGTCAGTAGGTCAGAGGCTCAGTAGCAATTGATGGTACAA TGAAAAAGGAAGCCTTGGGCTTGGACCCAAGCGTGTCTAGTGGGTAACAG TTATTTACAGGAAGAGGTGGCAGTGTGGCCCCACATCTGCTTTGCACTGA TATTTCTCCCTTAGCAGATAAGCATTCTGGTCTGGTTCGTATATAGGTCA GTTTGATGTGTTGATATGAACTGAAAATGAACCATAGGTACACTACAGTC CCATTCAAGGTGACCCTAAAGGACGACAATAGGAAATCCTCCATGGGGCT GAGCTTCCAGCAGTTCGTGTGGATATCCACTTTGTATGGAGGCAGTGGAC AGAGTAGTGGCCTGAGGGAGGGACGCATATGGACTTCTGGGTTGTGACGT CCTGCTGGCTGGTCAGGGACCTGAAAAGAGCAAGAGGGGAAGATGGACCT ACAGGAGTGGCCACACAATATGTGCATCTCTCTGCCTTGTGTTAATACTG CAGAGGAGTTGGTGAACAGCAGGATGGATGGGATGTCAGTCAGCTGTGCC CCTGGCTACCCCTGTGCTTGAACAGTGGACTGTGAGTGGCGGCGTCATGC AGAAGGAGCACAGGTTAGCGTCCACCAGCACAGGCCTTCTTTCTCAAGGC TTGTCTCATGATTACTCCTGCTGAAAGCGATCAGTGCTGAGCCCCTGCTG AGATACCATCCCTAGAGCACCCCAACTAGTTACTTAGTGGCAAGTTGGTG ACAGCCCTTCATCCTTGCTGGAGTTGACACCTGCTCTGGATAAGGGTTTG TCTTCTGTATCCACAGGGTCCCAGGCCAGCATCACTATCCAGGAGCTTCA GGCATGCTCCGTGTACCAACATGGGATCACGAATCACATCGCCTCAGGCC

The length of this sequence is 1000 and, hence, there should be 989 windows for each the + and - strand directions. Write the window start positions (starting the position count at 0 for the first position in the sequence), the strand (+/-) and the scores to stdout (the console), such that each line contains a start position followed by a '+' or '-' followed by a score value, separated by a single space. Plot the scores (y axis) over window start positions (x axis) using any software that you like (GNUPlot, MS EXCEL, ..) and mark up (by hand or electronically) the regions that intuitively represent good hits. Hand in the graph with the rest of your written assignment. Note: The gene CHKB (chr 22) is known to be regulated by CREB1 transcription factor, but the CHKB gene is located on the reverse (-) strand. [35 marks]

Important notes:

Your programs should be written either in Java, C or C++.

When you are done, send an email to acarbo@cs.ubc.ca with subject 'CPSC445-hw4' and attach your program source.

All files submitted should contain your student id # in the title e.g. 80132322.cpp and 80132322-readme.txt. If you decided to provide your files in a zip or tarball archive, please include your student number on the archive title.
You programs should be well documented and you should explain the purpose of every function that you write. You may lose marks for ill documented code.

Readme files are encouraged, and should include your name, colleague that you worked with, and a sample compile and execution of the code if done by the command line.

2 RNA secondary structure [25 marks]

(a) Given the following RNA secondary structure, name all of the secondary structure elements and specify their positions in terms of the respective exterior and interior base-pairs. [10 marks]

Parts (b),(c) and (d) use the following sequence:
AACCCUUUCAAAAAGGGAGGUCACC

(b) Fill in the dynamic programming matrix produced from running the Nussinov algorithm on the following sequence (there are 6 blanks to fill in); [7 marks]

Nussinov score matrix:

A A C C C U U U C A A A A A G G G A G G U C A C C
A 0 0 0 0 0 1 2 2 2 3 3 3 3 4 5 6 6 6 6 7 8 8 8 9
A 0 0 0 0 0 1 1 1 1 2 3 3 3 3 4 5 6 6 6 6 7 8 8 8 9
C 0 0 0 0 0 0 0 0 0 1 2 3 3 3 4 5 6 6 6 6 7 8 8 8 9
C 0 0 0 0 0 0 0 0 0 1 2 3 3 3 4 5 5 5 5 6 6 7 7 8
C 0 0 0 0 0 0 0 0 0 1 2 3 3 3 4 4 4 4 5 5 6 6 7 8
U 0 0 0 0 0 0 0 0 0 1 2 3 3 3 3 3 3 4 4 4 5 6 6 6 7
U 0 0 0 0 0 0 0 0 0 1 2 2 2 2 2 2 3 3 3 3 4 5 5 5 6
U 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 2 2 2 2 2 3 4 4 5 5
C 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 2 3 3 4 5
A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 3 4
A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 3 4
A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 3 4
A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 3 4
A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 3 4
G 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 3 4
G 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 3 4
G 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 3 3
A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 2 3
G 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 2 3
G 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 2 2
U 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

(c) Given the following traceback report derived from the above matrix, fill in the empty labels as each step is popped from the stack as either paired, unpaired, bifurcation or NA. [5 marks]

rows columns label
1 25 unpaired
2 25 unpaired
3 25 bifurcation
3 17 paired
4 16
5 15
6 14
6 13
6 12
7 11
8 10
9 9 NA
18 25
19 25
20 24
21 23
22 22

(d) Draw an optimal secondary structure for the sequence. You should use the work provided in (b) and (c). [3 marks]

General remarks:

The assignment has to be handed in on the date it is due before 9:30. To ensure fairness, late hand-ins will generally not be accepted (exceptions can only be made for officially documented medical reasons). Please hand your solution to Holger at the beginning of class.
Please include your name and student number, and all persons you worked on the assignment with.
This assignment should take you no longer than about 4-5 hours to complete, if you have good knowledge of the topics covered. However, don't wait until the last minute relying on this estimate - it might not apply to you (or anyone at all), you might need additional time to consult the literature, etc.
While cooperation between students - especially between CS and non-CS students - is encouraged, each student is expected to work out the actual solutions to the problems individually and hand in their own assignment. In other words: help each other, but do not copy solutions.
Feel always free to contact Holger or Andrew if you feel you need further help than can be provided by your fellow students.

	A	A	C	C	C	U	U	U	C	A	A	A	A	A	G	G	G	A	G	G	U	C	A	C	C
A	0	0	0	0	0	1	2	2	2		3	3	3	3	4	5	6	6	6	6	7	8	8	8	9
A	0	0	0	0	0	1	1	1	1	2	3	3	3	3	4	5	6	6	6	6	7	8	8	8	9
C	0	0	0	0	0	0	0	0	0	1	2	3	3	3	4	5	6	6	6	6	7	8	8	8	9
C	0	0	0	0	0	0	0	0	0	1	2	3	3	3	4	5	5	5	5	6	6	7	7		8
C	0	0	0	0	0	0	0	0	0	1	2	3	3	3	4	4	4	4	5	5	6	6		7	8
U	0	0	0	0	0	0	0	0	0	1	2	3	3	3	3	3	3	4	4	4	5	6	6	6	7
U	0	0	0	0	0	0	0	0	0	1	2	2	2	2	2	2	3	3	3	3	4	5	5	5	6
U	0	0	0	0	0	0	0	0	0	1	1	1	1	1	1	2	2	2	2	2	3	4	4	5	5
C	0	0	0	0	0	0	0	0	0	0	0	0	0				1	1	1	1	2	3	3	4	5
A	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	2	2	3	4
A	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	2	2	3	4
A	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	2	2	3	4
A	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	2	2	3	4
A	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	2	2	3	4
G	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	2	2	3	4
G	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	2	2	3	4
G	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	2	2	3	3
A	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	2	2	2	3
G	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	2	2	2	3
G	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	1	2	2
U	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	1
C	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
A	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
C	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
C	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

rows	columns	label
1	25	unpaired
2	25	unpaired
3	25	bifurcation
3	17	paired
4	16
5	15
6	14
6	13
6	12
7	11
8	10
9	9	NA
18	25
19	25
20	24
21	23
22	22

	A	A	C	C	C	U	U	U	C	A	A	A	A	A	G	G	G	A	G	G	U	C	A	C	C
A	0	0	0	0	0	1	2	2	2		3	3	3	3	4	5	6	6	6	6	7	8	8	8	9
A	0	0	0	0	0	1	1	1	1	2	3	3	3	3	4	5	6	6	6	6	7	8	8	8	9
C	0	0	0	0	0	0	0	0	0	1	2	3	3	3	4	5	6	6	6	6	7	8	8	8	9
C	0	0	0	0	0	0	0	0	0	1	2	3	3	3	4	5	5	5	5	6	6	7	7		8
C	0	0	0	0	0	0	0	0	0	1	2	3	3	3	4	4	4	4	5	5	6	6		7	8
U	0	0	0	0	0	0	0	0	0	1	2	3	3	3	3	3	3	4	4	4	5	6	6	6	7
U	0	0	0	0	0	0	0	0	0	1	2	2	2	2	2	2	3	3	3	3	4	5	5	5	6
U	0	0	0	0	0	0	0	0	0	1	1	1	1	1	1	2	2	2	2	2	3	4	4	5	5
C	0	0	0	0	0	0	0	0	0	0	0	0	0				1	1	1	1	2	3	3	4	5
A	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	2	2	3	4
A	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	2	2	3	4
A	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	2	2	3	4
A	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	2	2	3	4
A	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	2	2	3	4
G	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	2	2	3	4
G	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	2	2	3	4
G	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	2	2	3	3
A	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	2	2	2	3
G	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	2	2	2	3
G	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	1	2	2
U	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	1
C	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
A	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
C	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
C	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

	A	A	C	C	C	U	U	U	C	A	A	A	A	A	G	G	G	A	G	G	U	C	A	C	C
A	0	0	0	0	0	1	2	2	2		3	3	3	3	4	5	6	6	6	6	7	8	8	8	9
A	0	0	0	0	0	1	1	1	1	2	3	3	3	3	4	5	6	6	6	6	7	8	8	8	9
C	0	0	0	0	0	0	0	0	0	1	2	3	3	3	4	5	6	6	6	6	7	8	8	8	9
C	0	0	0	0	0	0	0	0	0	1	2	3	3	3	4	5	5	5	5	6	6	7	7		8
C	0	0	0	0	0	0	0	0	0	1	2	3	3	3	4	4	4	4	5	5	6	6		7	8
U	0	0	0	0	0	0	0	0	0	1	2	3	3	3	3	3	3	4	4	4	5	6	6	6	7
U	0	0	0	0	0	0	0	0	0	1	2	2	2	2	2	2	3	3	3	3	4	5	5	5	6
U	0	0	0	0	0	0	0	0	0	1	1	1	1	1	1	2	2	2	2	2	3	4	4	5	5
C	0	0	0	0	0	0	0	0	0	0	0	0	0				1	1	1	1	2	3	3	4	5
A	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	2	2	3	4
A	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	2	2	3	4
A	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	2	2	3	4
A	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	2	2	3	4
A	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	2	2	3	4
G	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	2	2	3	4
G	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	2	2	3	4
G	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	2	2	3	3
A	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	2	2	2	3
G	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	2	2	2	3
G	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	1	2	2
U	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	1
C	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
A	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
C	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
C	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0