CPSC445/545 - Assignment 2 (covers Module 2)

released: Tue, 03/10/07; due: Tue, 03/10/14

NOTE: CPSC 545 students are only required to solve problems 1 and 2; CPSC 445 students are expected to solve all problems.


1 Local Sequence Alignment [6 marks]

Here is the dynamic programming matrix resulting from a run of the standard pairwise local sequence alignment algorithm (with linear gap penalty with d=-8) on the protein sequences DEWDEH and NDWEHK, using the BLOSUM50 matrix:

N D W E H K
0 0 0 0 0 0 0
D 0 2 8 0 2 0 0
E 0 0 4 5 6 2 1
W 0 0 0 19 11 3 0
D 0 2 8 ? 21 13 5
E 0 0 4 5 17 21 ?
H 0 1 0 ? 9 27 21

(a) Complete the table, by replacing the ?'s with the appropriate numbers. [3 marks]

(b) From the table, give the optimal local alignment for these two sequences and its score. [3 marks]


2 Multiple Sequence Alignment [10 marks]

(a) Determine the sum-of-pairs scores for the following multiple sequence alignment of DNA sequences, using the scoring matrix in which a match gets a value of +4, a mismatch gets a value of -1, and a (base,gap) pair gets a value of -2. (A (gap,gap) pair gets a value of 0.) [3 marks]

GCAA
GT - A
C - - A

(b) Align the following two alignments (i.e., profiles) of protein sequences. Use the BLOSUM50 matrix and linear gap penalties with d=-8. [4 marks]
Hint: Make sure you understand the description of profile alignment in Durbin et al. (page 146-147)

Profile 1:
RWCH
R - CY

Profile 2:
RIWY
RVW -

(c) Briefly explain the role of guide trees in progressive multiple sequence alignment algorithms. What do the leaf and internal nodes of a guide tree represent? [3 marks]
Hint: Review the section on progressive alignment methods in Durbin et al.z


3 Similarity Search using BLAST (Hands-on Problem) [CPSC 445 students only; 11 marks]

Run a Protein-Protein Blast search (BLASTP) at the NCBI web site in order to find proteins similar to the matrix protein of SARS in the SWISSPROT database; the sequence of this protein in FASTA format is as follows:

>SARS - NP_828855
MADNGTITVEELKQLLEQWNLVIGFLFLAWIMLLQFAYSNRNRFLYIIKLVFLWLLWPVTLACFVLAAVY
RINWVTGGIAIAMACIVGLMWLSYFVASFRLFARTRSMWSFNPETNILLNVPLRGTIVTRPLMESELVIG
AVIIRGHLRMAGHSLGRCDIKDLPKEITVATSRTLSYYKLGASQRVGTDSGFAAYNRYRIGNYKLNTDHA
GSNDNIALLVQ

(a) Using the BLOSUM62 scoring matrix (with gap opening cost = 11, gap extension cost = 1, Expectation = 10, word size = 3) to answer the following questions:

  1. What is the number of hits reported by BLASTP? [1 mark]
  2. What is the number of hits with a bit score between 80 and 200? [1 mark]
  3. Which types of organisms do the proteins from Question 2 belong to? (Hint: Use the taxonomy report link on the BLAST result page) [1 mark]
  4. Some of the organisms from Question 3 are parasitic. In which hosts are these organisms found? (This can be easily inferred from the taxonomy report information) [1 mark]
  5. What type of disorders do the viruses cause whose proteins got a score between 80 and 200 in this search? [1 mark]
  6. Give the alignment of the SARS matrix protein sequence with two Human coronavirus sequences (HCoV-OC43 and HCoV-229E). Specify the number of identical residues and % sequence identity, the number of similar residues and % similarity (in BLAST terminology: % positive), and the number and % of gaps. [1 mark]

(b) Are the scores of the top hits from the search in part (a), i.e., from hits with bit scores between 80 and 200, significantly affected when using the PAM70 scoring matrix instead of the BLOSUM62 matrix? [1 mark]

(c) Does the alignment between SARS and each Human coronvirus sequence changed when using the PAM70 scoring matrix instead of the BLOSUM62 matrix? If so, how? [1 mark]

(d) What does the expectation parameter mean? What will happen if the expectation value is increased from its default value of 10 to a 100? [2 marks]

(e) Based on the respective alignments, would you say that two Human coronovirus sequences are very similar to the sequence for the SARS matrix protein? [1 mark]


General remarks: