CPSC 445 - Assignment 4 (covers Module 4)

released: Thu, 06/03/16; due: Thu, 06/03/23, 9:30 (just before the beginning of class)

1  Gene Finding / General Questions [15 marks]

(a) Briefly explain how the additive scores in a position weight matrix (such as the one in Problem 2a) can be obtained from probabilities of observing certain bases at a given position within a signal. (In your own words, max. 100 words, be as concise and accurate as possible.) [5 marks]

(b) Briefly explain how a WMM can be seen as a special case of an HMM, using the WMM from Problem 2a as an illustrative example. (In your own words, max. 100 words, be as concise and accurate as possible.) [10 marks]


2  [Programming] Gene Finding / Weight Matrix Methods

[40 marks]

(a) Consider the position weight matrix for a transcription factor binding site in yeast:

Pos. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
A 11 5 15 0 0 16 0 2 9 3 11 8 23 0 0 9 10 8
C 1 6 0 0 23 1 12 3 5 4 5 3 0 23 0 0 3 5
G 5 4 8 1 0 4 0 4 3 3 3 6 0 0 23 6 5 5
T 6 8 0 22 0 2 11 14 6 13 4 6 0 0 0 8 5 5

Write a program that computes WMM scores for all windows (of the same length as the given weight matrix). Run this program on the following genomic yeast sequence (also available from http://www.cs.ubc.ca/~safari/cpsc445/hw4.in:

TAAGAAACCGGGACTTATATATTTATAAATATAAATCTAACTTAATTAATAATTTAAATAATATACTTTATATTTTATAAATAAAAATAATTAT AACCTTTTTTATAATTATATATAATAATAATATATATTATCAAATAATTATTATTTCTTTTTTTTCTTTAATTAATTAATTAATTAATATTTTATAA AAATATATTTCTCCTTACGGGGTTCCGGCTCCCGTAGCCGGGGCCCGAAACTAAATAAAATATATTATTAATAATATTATATAATATAATA ATAATATAATAATTTTATATAAATATATATTTATATATTAAATTAAATTATAATTTTATTATGAAAATTATATCTTTTTTTTATATTTTTATATAAT AAAAATATGTTATATATATATTAATAAATAAAAATAATTATAACCTTTTTTATAATTATATATAATAATAATATATATTATCAAATAATTATTAT TTCTTTTTTTTCTTTAATTAATTAATTAATTAATATTTTATAAAAATATATTTCTCCTTACGGGGTTCCGGCTCCCGTAGCCGGGGCCCGAAA CTAAATAAAATATATTATTAATAATATTATATAATATAATAATAATATAATAATTTTATATAAATATATATTTATATATTAAATTAAATTATAAT TTTATTATGAAAATTATATCTTTTTTTTATATTTTTATATAATAAAAATATGTTATATATATATTAATAATAAAAGGTAGTGAGGATTAAATAAA TTATATAATAATTATAACTCTTAATTATAAAATAAATATATATATATAT

The length of this sequence is 806 and, hence, there are 788 windows (starting at positions 1, 2, etc.). Write the window start positions (starting the position count at 1 for the first position in the sequence) and the scores in the output file, 4.out, such that each line contains a start position followed by a score value, separated by a single space. Plot the scores (y axis) over window start positions (x axis) using any software that you like (GNUPlot, MS EXCEL, ..) and mark up (by hand or electronically) the regions that intuitively represent good hits. [30 marks]

Important notes:
  • Your programs should be written either in Java or C++.
  • When you are done, send an email to safari@cs.ubc.ca with subject 'CPSC445-hw4' and attach your program source.
  • The name of your program should be w[your-student-id].{c,cpp,java}, e.g. w80132322.cpp. No other file name is accepted.
  • You programs should be well documented  and you should explain the purpose of every function that you write.
  • Since the marking process is partly automated, you must strictly follow the format of all output files as specified below. You won't be given  any chance to modify your program after the assignment is due.
  • No zip file; Only attach your program source without any extra unnecessary file such as the input files.
  • Output file: 4.out

(b) Given a number of sequences for a signal, such as the transcription factor binding site above, explain a general method for computing a position weight matrix. Illustrate your method with pseudo-code. Notice that DNAs in the sequence might have different lengths. [10 marks]


3  Gene Finding / Hidden Markov Models [45 marks]

Consider the following HMM, which models an intronic sequence with GC-rich regions. The model consists of 8 states:

The states and transition probabilities are indicated in the following diagram:

The emission probabilities for the non-silent states are as follows:

Base S1 S2 S3 S4 S5 S6 S7 S8
A 0 0 0.35 0.60 0.50 0.75 0 1
C 0 0 0.37 0.13 0.03 0.08 1 0
G 1 0 0.18 0.14 0.44 0.10 0 0
T 0 1 0.10 0.13 0.03 0.07 0 0

Please answer the following questions and show your work (for some parts, this may be easiest to do by writing short programs that compute the answers and print out the steps):

(a) What is the probability of observing GTAATACA along the state path p = B-S1-S2-S3-S5-S4-S6-S7-S8-E? [5 mark ]

(b) What is the probability of seeing GTCGCGCA, given the current HMM? (Show the steps of your calculation.) [15 marks]

(c) Use the backward algorithm to determine the most likely state in the HMM shown above for every position in the observation sequence GTCGCGCA? (Show the steps of your calculation.) [15 marks]

(d) Assume an expert biologist has given you the following profile matrices for the 5' and 3' ends of introns:

5' +1 +2 +3 +4 +5
A 0 0 0.35 0.08 0.15
C 0 0 0.37 0.04 0.19
G 1 0 0.18 0.81 0.20
T 0 1 0.10 0.07 0.46
3' -5 -4 -3 -2 -1
A 0.75 0.26 0.06 0 1
C 0.08 0.30 0.05 1 0
G 0.10 0.31 0.84 0 0
T 0.07 0.13 0.05 0 0

Extend the given HMM model based on this information. [10 marks]

Hint: If you are uncertain how to solve this problem, consult Chapter 7 of "Bioinformatics - The Machine Learning Approach" by Baldi and Brunak (available from the Reading Room).


General remarks: