CPSC445/545 - Assignment 2 (covers Module 2)

released: Tue, 04/10/12; due: Tue, 04/10/19

NOTE: CPSC 545 students are only required to solve problems 1 and 2; CPSC 445 students are expected to solve all problems.


1 Local Sequence Alignment [7 marks]

Here is the dynamic programming matrix resulting from a run of the standard pairwise local sequence alignment algorithm (with linear gap penalty d=8) on the protein sequences DEWDEH and NDWEHK, using the BLOSUM50 matrix:

N D W E H K
0 0 0 0 0 0 0
D 0 2 8 0 2 0 0
E 0 0 4 5 6 2 1
W ? 0 0 19 11 3 0
D 0 2 8 ? 21 13 5
E 0 0 4 5 17 21 14
H 0 1 0 1 ? ? 21

(a) Complete the table, by replacing the ?'s with the appropriate numbers. [4 marks]

(b) From the table, give the optimal local alignment for these two sequences and its score. [3 marks]


2 Multiple Sequence Alignment [10 marks]

(a) Determine the sum-of-pairs scores for the following multiple sequence alignment of DNA sequences, using the scoring matrix in which a match gets a value of +4, a mismatch gets a value of -1, and a (base,gap) pair gets a value of -2. (A (gap,gap) pair gets a value of 0.) [3 marks]

GCAA
GT - A
C - - A

(b) Align the following two alignments (i.e., profiles) of protein sequences. Use the BLOSUM50 matrix and linear gap penalties with d=-8. [4 marks]
Hint: Make sure you understand the description of profile alignment in Durbin et al. (page 146-147)

Profile 1:
RWCH
R - CY

Profile 2:
RIWY
RVW -

(c) Briefly explain the role of guide trees in progressive multiple sequence alignment algorithms. What do the leaf and internal nodes of a guide tree represent? [3 marks]
Hint: Review the section on progressive alignment methods in Durbin et al.z


3 Similarity Search using BLAST (Hands-on Problem) [CPSC 445 students only; 12 marks]

Run a Protein-Protein Blast search (BLASTP) at the NCBI web site in order to find proteins similar to the Matrix glycoprotein of Human coronavirus OC43 in the SWISSPROT database; the sequence of this protein in FASTA format is as follows:

>gi|267362|sp|Q01455|VME1_CVHOC E1 glycoprotein (Matrix glycoprotein) (Membrane glycoprotein)
MSSKTTPAPVYIWTADEAIKFLKEWNFSLGIILLFITIILQFGYTSRSMFVYVIKMIILWLMWPLTIILT
IFNCVYALNNVYLGLSIVFTIVAIIMWIVYFVNSIRLFIRTGSFWSFNPETNNLMCIDMKGTMYVRPIIE
DYHTLTVTIIRGHLYIQGIKLGTGYSWADLPAYMTVAKVTHLCTYKRGFLDRISDTSGFAVYVKSKVGNY
RLPSTQKGSGMDTALLRNNI

Perform the search using the BLOSUM62 scoring matrix, with gap opening cost = 11, gap extension cost = 1, Expectation = 10 and word size = 3.

(a) Answer the following questions:

  1. What is the number of hits reported by BLASTP? [1 mark]
  2. What is the number of hits with a bit score between 80 and 200? [1 mark]
  3. Which types of organisms do the proteins from the previous question belong to? (Hint: Use the taxonomy report link on the BLAST result page) [1 mark]
  4. Some of the these organisms are parasitic. In which hosts are these organisms found? (This can be easily inferred from the taxonomy report information) [1 mark]
  5. What type of disorders do the viruses cause whose proteins got a score between 80 and 200 in this search? [1 mark]
  6. Give the alignment of the Human coronavirus matrix protein sequence with the SARS coronavirus sequence. Specify the number of identical residues and % sequence identity, the number of similar residues and % similarity (in BLAST terminology: % positive), and the number and % of gaps. [1 mark]

(b) Are the scores for the hits with bit scores between 80 and 200 significantly affected when using the PAM70 scoring matrix instead of the BLOSUM62 matrix? [1 mark]

(c) Does the alignment between Human coronavirus and SARS proteins change when using the PAM70 scoring matrix instead of the BLOSUM62 matrix? If so, how? [1 mark]

(d) What does the expectation parameter mean? What will happen if the expectation value is increased from its default value of 10 to a 100? [2 marks]

(e) Based on the respective alignments, would you say that the Human coronovirus sequence is very similar to the sequence for the SARS matrix protein? [1 mark]

(f) Human coronavirus OC43 belongs to the group 2 of mammalian coronaviruses. One of the BLAST hits in the search you have performed is a matrix protein from a group 1 human coronavirus (229E). Which of these two matrix proteins belonging to two different groups of human coronaviruses is more similar to SARS matrix protein? (Hint: you will have to perform another BLAST search using the FASTA sequence of the group1 human coronavirus protein; click on the web link for that protein which will take you to its GenBank record.) [1 mark]


General remarks: