# CPSC445/545 - Assignment 2 (covers Module 2)

released: Tue, 03/10/07; due: Tue, 03/10/14

NOTE: CPSC 545 students are only required to solve problems 1 and 2; CPSC 445 students are expected to solve all problems.

#### 1 Local Sequence Alignment [6 marks]

Here is the dynamic programming matrix resulting from a run of the standard pairwise local sequence alignment algorithm (with linear gap penalty with d=-8) on the protein sequences DEWDEH and NDWEHK, using the BLOSUM50 matrix:

N D W E H K
0 0 0 0 0 0 0
D 0 2 8 0 2 0 0
E 0 0 4 5 6 2 1
W 0 0 0 19 11 3 0
D 0 2 8 ? 21 13 5
E 0 0 4 5 17 21 ?
H 0 1 0 ? 9 27 21

(a) Complete the table, by replacing the ?'s with the appropriate numbers. [3 marks]

(b) From the table, give the optimal local alignment for these two sequences and its score. [3 marks]

#### 2 Multiple Sequence Alignment [10 marks]

(a) Determine the sum-of-pairs scores for the following multiple sequence alignment of DNA sequences, using the scoring matrix in which a match gets a value of +4, a mismatch gets a value of -1, and a (base,gap) pair gets a value of -2. (A (gap,gap) pair gets a value of 0.) [3 marks]

GCAA
GT - A
C - - A

(b) Align the following two alignments (i.e., profiles) of protein sequences. Use the BLOSUM50 matrix and linear gap penalties with d=-8. [4 marks]
Hint: Make sure you understand the description of profile alignment in Durbin et al. (page 146-147)

Profile 1:
RWCH
R - CY

Profile 2:
RIWY
RVW -

(c) Briefly explain the role of guide trees in progressive multiple sequence alignment algorithms. What do the leaf and internal nodes of a guide tree represent? [3 marks]
Hint: Review the section on progressive alignment methods in Durbin et al.z

#### 3 Similarity Search using BLAST (Hands-on Problem) [CPSC 445 students only; 11 marks]

Run a Protein-Protein Blast search (BLASTP) at the NCBI web site in order to find proteins similar to the matrix protein of SARS in the SWISSPROT database; the sequence of this protein in FASTA format is as follows:

>SARS - NP_828855
RINWVTGGIAIAMACIVGLMWLSYFVASFRLFARTRSMWSFNPETNILLNVPLRGTIVTRPLMESELVIG
AVIIRGHLRMAGHSLGRCDIKDLPKEITVATSRTLSYYKLGASQRVGTDSGFAAYNRYRIGNYKLNTDHA
GSNDNIALLVQ

(a) Using the BLOSUM62 scoring matrix (with gap opening cost = 11, gap extension cost = 1, Expectation = 10, word size = 3) to answer the following questions:

1. What is the number of hits reported by BLASTP? [1 mark]
2. What is the number of hits with a bit score between 80 and 200? [1 mark]
3. Which types of organisms do the proteins from Question 2 belong to? (Hint: Use the taxonomy report link on the BLAST result page) [1 mark]
4. Some of the organisms from Question 3 are parasitic. In which hosts are these organisms found? (This can be easily inferred from the taxonomy report information) [1 mark]
5. What type of disorders do the viruses cause whose proteins got a score between 80 and 200 in this search? [1 mark]
6. Give the alignment of the SARS matrix protein sequence with two Human coronavirus sequences (HCoV-OC43 and HCoV-229E). Specify the number of identical residues and % sequence identity, the number of similar residues and % similarity (in BLAST terminology: % positive), and the number and % of gaps. [1 mark]

(b) Are the scores of the top hits from the search in part (a), i.e., from hits with bit scores between 80 and 200, significantly affected when using the PAM70 scoring matrix instead of the BLOSUM62 matrix? [1 mark]

(c) Does the alignment between SARS and each Human coronvirus sequence changed when using the PAM70 scoring matrix instead of the BLOSUM62 matrix? If so, how? [1 mark]

(d) What does the expectation parameter mean? What will happen if the expectation value is increased from its default value of 10 to a 100? [2 marks]

(e) Based on the respective alignments, would you say that two Human coronovirus sequences are very similar to the sequence for the SARS matrix protein? [1 mark]

#### General remarks:

• While cooperation between students - especially between CS and non-CS students - is encouraged, each student is expected to work out the actual solutions to the problems individually and hand in their own assignment. In other words: help each other, but do not copy solutions.
• Feel always free to contact Holger, Alena, or Dan if you feel you need further help than can be provided by your fellow students.
• The assignment has to be handed in on the date it is due before or at the beginning of class.
• This assignment should take you about 2-3 hours of work, if you have good knowledge of the topics covered and did all reading assignments. However, don't wait until the last minute relying on this estimate - it might not apply to you (or anyone at all), you might need additional time to consult the literature, ...
• You may hand in the assignment electronically by sending a PDF or plain text file (sorry, no Word files!) via e-mail to Alena <oshmygel@cs.ubc.ca>. We are working on a more advanced electronic handin procedure.