Please include your name, student ID and the names of all students you worked with. For submission instructions, see general remarks at the end of the assignment.
Here is the dynamic programming matrix resulting from a run of the standard pairwise local sequence alignment algorithm (with linear gap penalty d=8) on the protein sequences DWHAEP and WHEDPA, using the BLOSUM50 matrix (found on page 16 of Durbin et al.):
(a) Complete the table, by replacing the ?'s with the appropriate numbers. [4 marks]
(b) From the table, give the optimal local alignment for these two sequences and its score. [6 marks]
This problem uses the following sequences:
A - TAATAATAACTA
B - TACTAATAATCG
and the following substitution matrix:
and an affine gap penalty gamma(g)= -d-(g-1)e where d is the gap-open penalty of 400 and e is the gap extension penalty of 30. The substitution matrix and affine gap penalty are the same as those used by the blastz global alignment program.
(a) Draw the corresponding FSA (include transition and state labels) and recurrence relation using the above sequences, substitution matrix and affine gap penalty. [4 marks]
(b) Give the best global alignment for sequence A and B and the resulting score. Show your work (hint: you may not need to fill in the entire DP matrix if you can justify why not). [4 marks]
(c) In coding sequences, as opposed to non-coding regions, there is a bias towards fewer gaps. Draw a FSA and write a recurrence relation that uses a gap-open penalty of 800 and a gap-extension penalty of 60 (per base) whenever the gaps are within coding regions, and a gap-open penalty of 400 with a gap-extension penalty of 30 (per base) otherwise. Explain your choices. [7 marks]
Given two DNA sequences, we are interested in finding the best pair-wise
sequence alignment between them. The scoring function uses 2 for any match, -1
for any mismatch, and -2 for any match against a gap.
More specifically, we are interested in the best local and global alignment, its corresponding alignment score, the possibility of existing multiple optimal alignments, and, if the answer is positive, the set of all optimal alignments.
The two DNA sequences are given to you in file 2.in in two lines, each line corresponds to one of the two DNA sequences. Each DNA sequence has length between 5 and 50.
Note that each part of the problem must be written in a separate output file (this simplifies partly automated marking).
(a) (output: 2.o1) - 10 marks
In this file, write the optimal score as an integer value.
(b) (output: 2.o2) - 10 marks
For this part, you should write the dynamic programming matrix.
Assumem the first DNA has length m and the second one has length n. Your output, 2.o2, should contain m+1 lines each containing n+1 integer values separated by a single space character. The value j in the line i corresponds the value in column j and row i of the dynamic programming matrix.
(c) (output: 2.o3) - 10 marks
For this part, we are interested in one optimal alignment. Write the best alignment of the two input sequences on two lines. For any gap write a single character '-'. Notice that there might be multiple optimal alignments and you are required to only report one.
(d) (output: 2.o4) - 10 marks
Are there multiple optimal alignments? Write either 'YES' or 'NO' (all in capital letters) into the output file.
(e) (output: 2.o5) - bonus question, no marks
In this part report all possible optimal alignments. In the first line write the number of optimal alignments and then put every alignment in two lines separated by an empty line.
(f) (hand in this part with your written assignment) - 10 marks
Which parts of the program did you need to change in order to perform local rather than global pairwise sequence alignment?
Examples for input and output file formats:
0 -2 -4 -6 -8 -10 -12 -14
-2 -1 -3 -5 -4 -6 -8 -10
-4 0 1 -1 -3 -5 -4 -6
-6 -2 -1 3 1 -1 -3 -2
-8 -4 -3 1 5 3 1 -1
-10 -6 -5 -1 3 7 5 3
-12 -8 -7 -3 1 5 6 4
-14 -10 -9 -5 -1 3 4 5
-16 -12 -8 -7 -3 1 5 3
-18 -14 -10 -6 -5 -1 3 7
0 0 0 0 0 0 0 0
0 0 0 0 2 0 0 0
0 2 2 0 0 1 2 0
0 0 1 4 2 0 0 4
0 0 0 2 6 4 2 2
0 0 0 0 4 8 6 4
0 0 0 0 2 6 7 5
0 0 0 0 0 4 5 6
0 2 2 0 0 2 6 4
0 0 1 4 2 0 4 8
>gi|267362|sp|Q01455|VME1_CVHOC E1 glycoprotein (Matrix glycoprotein)
(b) Are the scores for the hits with bit scores of 80 and above significantly affected when using the PAM70 scoring matrix instead of the BLOSUM62 matrix? [2 marks]
(c) Does the alignment between Human coronavirus and SARS proteins change when using the PAM70 scoring matrix instead of the BLOSUM62 matrix? If so, how? [4 marks]
(d) What does the expectation parameter mean? What will happen if the expectation value is increased from its default value of 10 to a 100? [4 marks]
(e) Based on the respective alignments, would you say that the Human coronovirus sequence is very similar to the sequence for the SARS matrix protein? Briefly justify your answer. [4 marks]
(f) Human coronavirus OC43 belongs to the group 2 of mammalian coronaviruses. One of the BLAST hits in the search you have performed is a matrix protein from a group 1 human coronavirus (229E). Which of these two matrix proteins belonging to two different groups of human coronaviruses is more similar to SARS matrix protein? (Hint: you will have to perform another BLAST search using the FASTA sequence of the group1 human coronavirus protein; click on the web link for that protein which will take you to its GenBank record.) [4 marks]
(b) Alignment Algorithms: Show that the number of ways of intercalating two sequences of lengths n and m to give a single sequence of length n+m, while preserving the order of symbos in each, is n+m choose m.(Problem 2.5 from Page 19 of Durbin et al.)
(c) Significance of Scores: Now that we know how to find an optimal alignment, how can we assess the significance of its score? Read Section 2.7 of Durbin et al., and briefly summarize the methods and justifications for how to assess the significance of alignment scores.
(d) Deriving Score Parameters from Alignment Data: So far we have not discussed in detail where scores come from - that is, how to determine the components of the scoring model, the substitution and gap scores. Read Section 2.8 of Durbin et al., and briefly summarize the methods and justifications for how to derive scoring models.
Reading Questions: In graduate-style seminars, you will often be asked to read academic papers before class and have a set of questions prepared for discussion with your peers during the seminar period. These questions should not simply be limited to questions of the form, "I didn't understand X, how does it work?" but should demonstrate that you have made an effort to re-read and understand the paper to come up with more piercing, critical analysis, or suggestions for future improvements or directions to the work. Working out these questions will prepare you for such seminars and discussions in the future.
(e) Reading Question 1: Read "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." (http://nar.oxfordjournals.org/content/25/17/3389.full/) and formulate two (2) reading questions according to the note above - clarification questions are OK as long as you have made an effort to understand them by discussing with your peers.
(f) Reading Question 2: Horizontal (lateral) gene transfer, the transfer of genes between different species, is an evolutionary phenomenon whose extent and even very existence have been the subject of a longtime debate that tends to become particularly vigorous when cases of horizontal gene transfer that involve eukaryotes are considered. Read, "Horizontal Gene Transfer in Prokaryotes: Quantification and Classification," ( http://www.ncbi.nlm.nih.gov/books/NBK2228/) and describe how the problem of alignment fits into this study. Many methods for describing horizontal gene transfer are now so-called "alignment free" methods. Describe why you think alignment free methods are now more popular. (See this reference for a new paper on alignment free methods: http://bioinformatics.oxfordjournals.org/content/27/11/1466.full.)