CPSC445/545 - Assignment 2 (covers Module 2)
released: Tue, 04/10/12; due: Tue, 04/10/19
NOTE: CPSC 545 students are only required to solve problems 1 and 2;
CPSC 445 students are expected to solve all problems.
1 Local Sequence Alignment [7 marks]
Here is the dynamic programming matrix resulting from a run of
the standard pairwise local sequence alignment algorithm
(with linear gap penalty d=8)
on the protein sequences DEWDEH and NDWEHK, using the
BLOSUM50 matrix:
|
|
N |
D |
W |
E |
H |
K |
|
0 |
0 |
0 |
0 |
0 |
0 |
0 |
D |
0 |
2 |
8 |
0 |
2 |
0 |
0 |
E |
0 |
0 |
4 |
5 |
6 |
2 |
1 |
W |
? |
0 |
0 |
19 |
11 |
3 |
0 |
D |
0 |
2 |
8 |
? |
21 |
13 |
5 |
E |
0 |
0 |
4 |
5 |
17 |
21 |
14 |
H |
0 |
1 |
0 |
1 |
? |
? |
21 |
(a) Complete the table, by replacing the ?'s with the appropriate
numbers. [4 marks]
(b) From the table, give the optimal local alignment for these two
sequences and its score. [3 marks]
2 Multiple Sequence Alignment [10 marks]
(a)
Determine the sum-of-pairs scores for the following multiple sequence
alignment of DNA sequences, using the
scoring matrix in which a match gets a value of +4, a mismatch gets
a value of -1, and a (base,gap) pair gets a value of -2. (A
(gap,gap) pair gets a value of 0.) [3 marks]
GCAA
GT - A
C - - A
(b) Align the following two alignments (i.e., profiles) of
protein sequences. Use the BLOSUM50 matrix and linear gap
penalties with d=-8. [4 marks]
Hint: Make sure you understand the description
of profile alignment in Durbin et al. (page 146-147)
Profile 1:
RWCH
R - CY
Profile 2:
RIWY
RVW -
(c) Briefly explain the role of guide trees in progressive
multiple sequence alignment algorithms. What do the leaf and
internal nodes of a guide tree represent? [3 marks]
Hint: Review the section on progressive alignment methods in Durbin et al.z
3 Similarity Search using BLAST (Hands-on Problem) [CPSC 445 students only; 12 marks]
Run a Protein-Protein Blast search (BLASTP) at the
NCBI web site
in order to find proteins similar
to the Matrix glycoprotein of Human coronavirus OC43 in the SWISSPROT database;
the sequence of this protein in FASTA format is as follows:
>gi|267362|sp|Q01455|VME1_CVHOC E1 glycoprotein (Matrix glycoprotein) (Membrane glycoprotein)
MSSKTTPAPVYIWTADEAIKFLKEWNFSLGIILLFITIILQFGYTSRSMFVYVIKMIILWLMWPLTIILT
IFNCVYALNNVYLGLSIVFTIVAIIMWIVYFVNSIRLFIRTGSFWSFNPETNNLMCIDMKGTMYVRPIIE
DYHTLTVTIIRGHLYIQGIKLGTGYSWADLPAYMTVAKVTHLCTYKRGFLDRISDTSGFAVYVKSKVGNY
RLPSTQKGSGMDTALLRNNI
Perform the search using the BLOSUM62 scoring matrix, with gap opening cost = 11, gap extension cost = 1, Expectation = 10 and word size = 3.
(a) Answer the following questions:
- What is the number of hits reported by BLASTP? [1 mark]
- What is the number of hits with a bit score between 80 and 200? [1 mark]
- Which types of organisms do the proteins from the previous question belong to?
(Hint: Use the taxonomy report link on the BLAST result page) [1 mark]
- Some of the these organisms are parasitic.
In which hosts are these organisms found?
(This can be easily inferred from the taxonomy report information) [1 mark]
- What type of disorders do the viruses cause whose proteins
got a score between 80 and 200 in this search? [1 mark]
- Give the alignment of the Human coronavirus matrix protein sequence with the SARS coronavirus sequence.
Specify the number of identical residues and % sequence identity,
the number of similar residues and % similarity (in BLAST terminology: % positive),
and the number and % of gaps. [1 mark]
(b) Are the scores
for the hits with bit scores between 80 and 200
significantly affected when using the PAM70 scoring matrix
instead of the BLOSUM62 matrix? [1 mark]
(c) Does the alignment between Human coronavirus and SARS proteins
change when using the PAM70 scoring matrix instead of the BLOSUM62 matrix?
If so, how? [1 mark]
(d) What does the expectation parameter mean?
What will happen if the expectation value is increased from
its default value of 10 to a 100? [2 marks]
(e) Based on the respective alignments,
would you say that the Human coronovirus sequence is very similar
to the sequence for the SARS matrix protein? [1 mark]
(f) Human coronavirus OC43 belongs to the group 2 of mammalian
coronaviruses. One of the BLAST hits in the search you have
performed is a matrix protein from a group 1 human coronavirus
(229E). Which of these two matrix proteins belonging to two
different groups of human coronaviruses is more similar to
SARS matrix protein? (Hint: you will have to perform another
BLAST search using the FASTA sequence of the group1 human
coronavirus protein; click on the web link for that protein
which will take you to its GenBank record.) [1 mark]
General remarks:
-
While cooperation between students - especially between CS and non-CS students
- is encouraged, each student is expected to work out the actual solutions
to the problems individually and hand in their own assignment.
In other words: help each other, but do not copy solutions.
-
Feel always free to contact Holger, Baharak, or Sanja if you feel you need further help than
can be provided by your fellow students.
- The assignment has to be handed in on the date it is due before or
at the beginning of class.
- This assignment should take you about 1.5-3 hours of work, if you have
good knowledge of the topics covered and did all reading assignments.
However, don't wait until the last minute relying on this estimate
- it might not apply to you (or anyone at all), you might need additional
time to consult the literature, ...