# CPSC445/545 - Assignment 2 (covers Module 2)

released: Tue, 04/10/12; due: Tue, 04/10/19

NOTE: CPSC 545 students are only required to solve problems 1 and 2; CPSC 445 students are expected to solve all problems.

#### 1 Local Sequence Alignment [7 marks]

Here is the dynamic programming matrix resulting from a run of the standard pairwise local sequence alignment algorithm (with linear gap penalty d=8) on the protein sequences DEWDEH and NDWEHK, using the BLOSUM50 matrix:

N D W E H K
0 0 0 0 0 0 0
D 0 2 8 0 2 0 0
E 0 0 4 5 6 2 1
W ? 0 0 19 11 3 0
D 0 2 8 ? 21 13 5
E 0 0 4 5 17 21 14
H 0 1 0 1 ? ? 21

(a) Complete the table, by replacing the ?'s with the appropriate numbers. [4 marks]

(b) From the table, give the optimal local alignment for these two sequences and its score. [3 marks]

#### 2 Multiple Sequence Alignment [10 marks]

(a) Determine the sum-of-pairs scores for the following multiple sequence alignment of DNA sequences, using the scoring matrix in which a match gets a value of +4, a mismatch gets a value of -1, and a (base,gap) pair gets a value of -2. (A (gap,gap) pair gets a value of 0.) [3 marks]

GCAA
GT - A
C - - A

(b) Align the following two alignments (i.e., profiles) of protein sequences. Use the BLOSUM50 matrix and linear gap penalties with d=-8. [4 marks]
Hint: Make sure you understand the description of profile alignment in Durbin et al. (page 146-147)

Profile 1:
RWCH
R - CY

Profile 2:
RIWY
RVW -

(c) Briefly explain the role of guide trees in progressive multiple sequence alignment algorithms. What do the leaf and internal nodes of a guide tree represent? [3 marks]
Hint: Review the section on progressive alignment methods in Durbin et al.z

#### 3 Similarity Search using BLAST (Hands-on Problem) [CPSC 445 students only; 12 marks]

Run a Protein-Protein Blast search (BLASTP) at the NCBI web site in order to find proteins similar to the Matrix glycoprotein of Human coronavirus OC43 in the SWISSPROT database; the sequence of this protein in FASTA format is as follows:

>gi|267362|sp|Q01455|VME1_CVHOC E1 glycoprotein (Matrix glycoprotein) (Membrane glycoprotein)
IFNCVYALNNVYLGLSIVFTIVAIIMWIVYFVNSIRLFIRTGSFWSFNPETNNLMCIDMKGTMYVRPIIE
RLPSTQKGSGMDTALLRNNI

Perform the search using the BLOSUM62 scoring matrix, with gap opening cost = 11, gap extension cost = 1, Expectation = 10 and word size = 3.

1. What is the number of hits reported by BLASTP? [1 mark]
2. What is the number of hits with a bit score between 80 and 200? [1 mark]
3. Which types of organisms do the proteins from the previous question belong to? (Hint: Use the taxonomy report link on the BLAST result page) [1 mark]
4. Some of the these organisms are parasitic. In which hosts are these organisms found? (This can be easily inferred from the taxonomy report information) [1 mark]
5. What type of disorders do the viruses cause whose proteins got a score between 80 and 200 in this search? [1 mark]
6. Give the alignment of the Human coronavirus matrix protein sequence with the SARS coronavirus sequence. Specify the number of identical residues and % sequence identity, the number of similar residues and % similarity (in BLAST terminology: % positive), and the number and % of gaps. [1 mark]

(b) Are the scores for the hits with bit scores between 80 and 200 significantly affected when using the PAM70 scoring matrix instead of the BLOSUM62 matrix? [1 mark]

(c) Does the alignment between Human coronavirus and SARS proteins change when using the PAM70 scoring matrix instead of the BLOSUM62 matrix? If so, how? [1 mark]

(d) What does the expectation parameter mean? What will happen if the expectation value is increased from its default value of 10 to a 100? [2 marks]

(e) Based on the respective alignments, would you say that the Human coronovirus sequence is very similar to the sequence for the SARS matrix protein? [1 mark]

(f) Human coronavirus OC43 belongs to the group 2 of mammalian coronaviruses. One of the BLAST hits in the search you have performed is a matrix protein from a group 1 human coronavirus (229E). Which of these two matrix proteins belonging to two different groups of human coronaviruses is more similar to SARS matrix protein? (Hint: you will have to perform another BLAST search using the FASTA sequence of the group1 human coronavirus protein; click on the web link for that protein which will take you to its GenBank record.) [1 mark]

#### General remarks:

• While cooperation between students - especially between CS and non-CS students - is encouraged, each student is expected to work out the actual solutions to the problems individually and hand in their own assignment. In other words: help each other, but do not copy solutions.
• Feel always free to contact Holger, Baharak, or Sanja if you feel you need further help than can be provided by your fellow students.
• The assignment has to be handed in on the date it is due before or at the beginning of class.
• This assignment should take you about 1.5-3 hours of work, if you have good knowledge of the topics covered and did all reading assignments. However, don't wait until the last minute relying on this estimate - it might not apply to you (or anyone at all), you might need additional time to consult the literature, ...