CPSC 445 - Assignment 4 (covers Module 4)

released: Thu, 06/03/16; due: Thu, 06/03/23, 9:30 (just before the beginning of class)

1 Gene Finding / General Questions [15 marks]

(a) Briefly explain how the additive scores in a position weight matrix (such as the one in Problem 2a) can be obtained from probabilities of observing certain bases at a given position within a signal. (In your own words, max. 100 words, be as concise and accurate as possible.) [5 marks]

(b) Briefly explain how a WMM can be seen as a special case of an HMM, using the WMM from Problem 2a as an illustrative example. (In your own words, max. 100 words, be as concise and accurate as possible.) [10 marks]

2 [Programming] Gene Finding / Weight Matrix Methods

[40 marks]

(a) Consider the position weight matrix for a transcription factor binding site in yeast:

Pos.	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18
A	11	5	15	0	0	16	0	2	9	3	11	8	23	0	0	9	10	8
C	1	6	0	0	23	1	12	3	5	4	5	3	0	23	0	0	3	5
G	5	4	8	1	0	4	0	4	3	3	3	6	0	0	23	6	5	5
T	6	8	0	22	0	2	11	14	6	13	4	6	0	0	0	8	5	5

Write a program that computes WMM scores for all windows (of the same length as the given weight matrix). Run this program on the following genomic yeast sequence (also available from http://www.cs.ubc.ca/~safari/cpsc445/hw4.in:

TAAGAAACCGGGACTTATATATTTATAAATATAAATCTAACTTAATTAATAATTTAAATAATATACTTTATATTTTATAAATAAAAATAATTAT AACCTTTTTTATAATTATATATAATAATAATATATATTATCAAATAATTATTATTTCTTTTTTTTCTTTAATTAATTAATTAATTAATATTTTATAA AAATATATTTCTCCTTACGGGGTTCCGGCTCCCGTAGCCGGGGCCCGAAACTAAATAAAATATATTATTAATAATATTATATAATATAATA ATAATATAATAATTTTATATAAATATATATTTATATATTAAATTAAATTATAATTTTATTATGAAAATTATATCTTTTTTTTATATTTTTATATAAT AAAAATATGTTATATATATATTAATAAATAAAAATAATTATAACCTTTTTTATAATTATATATAATAATAATATATATTATCAAATAATTATTAT TTCTTTTTTTTCTTTAATTAATTAATTAATTAATATTTTATAAAAATATATTTCTCCTTACGGGGTTCCGGCTCCCGTAGCCGGGGCCCGAAA CTAAATAAAATATATTATTAATAATATTATATAATATAATAATAATATAATAATTTTATATAAATATATATTTATATATTAAATTAAATTATAAT TTTATTATGAAAATTATATCTTTTTTTTATATTTTTATATAATAAAAATATGTTATATATATATTAATAATAAAAGGTAGTGAGGATTAAATAAA TTATATAATAATTATAACTCTTAATTATAAAATAAATATATATATATAT

The length of this sequence is 806 and, hence, there are 788 windows (starting at positions 1, 2, etc.). Write the window start positions (starting the position count at 1 for the first position in the sequence) and the scores in the output file, 4.out, such that each line contains a start position followed by a score value, separated by a single space. Plot the scores (y axis) over window start positions (x axis) using any software that you like (GNUPlot, MS EXCEL, ..) and mark up (by hand or electronically) the regions that intuitively represent good hits. [30 marks]

Important notes:

Your programs should be written either in Java or C++.

When you are done, send an email to safari@cs.ubc.ca with subject 'CPSC445-hw4' and attach your program source.

The name of your program should be w[your-student-id].{c,cpp,java}, e.g. w80132322.cpp. No other file name is accepted.

You programs should be well documented and you should explain the purpose of every function that you write.

Since the marking process is partly automated, you must strictly follow the format of all output files as specified below. You won't be given any chance to modify your program after the assignment is due.

No zip file; Only attach your program source without any extra unnecessary file such as the input files.

Output file: 4.out

(b) Given a number of sequences for a signal, such as the transcription factor binding site above, explain a general method for computing a position weight matrix. Illustrate your method with pseudo-code. Notice that DNAs in the sequence might have different lengths. [10 marks]

3 Gene Finding / Hidden Markov Models [45 marks]

Consider the following HMM, which models an intronic sequence with GC-rich regions. The model consists of 8 states:

the begin (B) and end (E) states corresponding to the 5' and 3' ends, which are silent states (don't emit symbols);
3 states (S1, S2, S3) following (B)
3 states (S6, S7, S8) preceding (E)
two intermediary states (S4, S5).

The states and transition probabilities are indicated in the following diagram:

The emission probabilities for the non-silent states are as follows:

Base	S1	S2	S3	S4	S5	S6	S7	S8
A	0	0	0.35	0.60	0.50	0.75	0	1
C	0	0	0.37	0.13	0.03	0.08	1	0
G	1	0	0.18	0.14	0.44	0.10	0	0
T	0	1	0.10	0.13	0.03	0.07	0	0

Please answer the following questions and show your work (for some parts, this may be easiest to do by writing short programs that compute the answers and print out the steps):

(a) What is the probability of observing GTAATACA along the state path p = B-S1-S2-S3-S5-S4-S6-S7-S8-E? [5 mark ]

(b) What is the probability of seeing GTCGCGCA, given the current HMM? (Show the steps of your calculation.) [15 marks]

(c) Use the backward algorithm to determine the most likely state in the HMM shown above for every position in the observation sequence GTCGCGCA? (Show the steps of your calculation.) [15 marks]

(d) Assume an expert biologist has given you the following profile matrices for the 5' and 3' ends of introns:

5'	+1	+2	+3	+4	+5
A	0	0	0.35	0.08	0.15
C	0	0	0.37	0.04	0.19
G	1	0	0.18	0.81	0.20
T	0	1	0.10	0.07	0.46

3'	-5	-4	-3	-2	-1
A	0.75	0.26	0.06	0	1
C	0.08	0.30	0.05	1	0
G	0.10	0.31	0.84	0	0
T	0.07	0.13	0.05	0	0

Extend the given HMM model based on this information. [10 marks]

Hint: If you are uncertain how to solve this problem, consult Chapter 7 of "Bioinformatics - The Machine Learning Approach" by Baldi and Brunak (available from the Reading Room).

General remarks:

The assignment has to be handed in on the date it is due before 9:30. To ensure fairness, late hand-ins will generally not be accepted (exceptions can only be made for officially documented medical reasons). Please hand your solution to Holger at the beginning of class.
Please do not staple the pages of your assignment and put your name and/or student number on each sheet.
This assignment should take you no longer than about 4-5 hours to complete, if you have good knowledge of the topics covered. However, don't wait until the last minute relying on this estimate - it might not apply to you (or anyone at all), you might need additional time to consult the literature, etc.
While cooperation between students - especially between CS and non-CS students - is encouraged, each student is expected to work out the actual solutions to the problems individually and hand in their own assignment. In other words: help each other, but do not copy solutions.
Feel always free to contact Holger, Baharak or Mohammad if you feel you need further help than can be provided by your fellow students.