CPSC536a - Assignment 4 (covers Module 4)

handed out Tue, 01/03/20; due Tue, 01/03/27

1 RNA Secondary Structure [8 marks]

Use the mfold RNA secondary structure predication algorithm of Michael Zuker to answer this question. You can find mfold at http://bioinfo.math.rpi.edu/~mfold/rna/form1.cgi - this web page is self-explanatory.

(a) What is the predicted secondary structure of the RNA sequence GGCCAAGGCC? (Note: we use the convention that the left end of this string is the 5' end and the right end is the 3' end, as does Zuker's program.) [2 marks]

(b) Can you find a sequence that folds into the secondary structure described by ((((*((((***))))*((((***))))*((((***))))*))))? (Note that, using set notation, this structure, which is for a string of length 45, is {(1,45), (2,44), (3,43), (4,42), (6,16), (7,15), (8,14), (9,13), (18, 28), (19,,27), (20, 26), (21, 25), (30, 40), (31, 39), (32,38), (33,37)}.) [6 marks]

2 Neural Networks and Secondary Structure Prediction [8 marks]

In Module 4 and in last week's reading assignment you learned about neural networks for protein secondary structure prediction.

(a) Design a simple, three layer feed-forward neural network with two binary input units A and B and a binary output unit C such that C=1 if A=1 and B=0 or A=0 and B=1 (logic XOR). Use as few hidden units as possible. Specify the network structure, connection weights, and transfer functions for all units. [3 marks]

(b) When using a simple multi-layer perceptron (MLP) for secondary structure prediction, how is the input sequence presented to the network? Illustrate your answer with a simple example. [2 marks]

(c) Explain how secondary structure prediction can benefit from using evolutionary information based on your knowledge of the PHD approach (max 200 words). [3 marks]
Hint: You might find the following paper useful as an additional reference: Burkhard Rost & Chris Sander,"3rd Generation Prediction Of Secondary Structure", http://www.columbia.edu/~rost/Papers/1999_humana/paper.html

BONUS QUESTION: Is it possible to construct a three-layer feed-forward neural network that computes the XOR function as specified in 3a using only units with linear transfer functions? Justify your answer!
(Might require further literature research!)

3 Application of Protein Secondary Structure Prediction [10 marks]

In this hands-on exercise you will use online tools and services for retrieving protein information from a databank and for predicting protein secondary structure.

(a) Retrieve the protein sequence for the plant seed protein Crambin (1CRN) from the Protein Data Bank (PDB) at www.pdb.org and report the primary sequence. [1 mark]
Hint: Use the search tool from the PDB frontpage and specify the PDB ID `1CRN'. Retrieve 'Sequence Details' and download sequence in FASTA format.

(b) Check out the secondary structure annotation for the Crambin PDB entry (1CRN) and annotate the primary sequence from part (a) using the letters 'H' for alpha-helix and 'E' for beta-sheet. [3 marks]
Attention: This is not the secondary structure as listed under 'Sequence Details'!
Hint: Use 'Download/Display File' and select 'Display the Structure File', PDB / HTML format; click on 'HELIX' / 'SHEET' for explanation of annotations.

(c) Use the NNPREDICT program (based on a feed-forward neural network) available at http://www.cmpharm.ucsf.edu/~nomi/nnpredict.html to predict the secondary structure of Crambin. Use the 'all alpha' and 'none' tertiary structure classes and compare the results to the secondary structure information from the PDB entry. [3 marks]

(d) Submit the Crambin sequence to the PredictProtein server at http://www.embl-heidelberg.de/predictprotein/predictprotein.html to perform a PHD secondary structure prediction. Compare the results to the PDB annotated secondary structure and to the NNPREDICT predictions from part (c) and discuss the differences you observe. [3 marks]

Warning: Part (d) of this exercise requires waiting for an automated e-mail response from the Predict Protein Server. Although this can be very fast, to be safe you should allow at least 24h for processing.

BONUS QUESTION: Perform the same analysis for the oxygen binding protein Myohemerythrin (PDB ID `2MHR'). What do you observe?

4 Genetic Algorithms and Tertiary Structure Prediction [7 marks]

In Module 4 and in last week's reading assignment you learned about an evolutionary algorithm for protein tertiary structure prediction.

(a) Describe the difference between the MUTATE and the VARIATE operators in the evolutionary algorithm, and how the application of these operators changes over a run of the algorithm. Explain the motivation for the mechanism as proposed in Schulze-Kremer, Genetic Algorithms and Protein Folding, Section 1.2.1.4. [3 marks]

(b) Explain how a vector fitness function can be used to combine different fitness criteria. Furthermore, explain how in this case, generation replacement is different from the elitist replacement used for simple energy minimisation. (Your answer should be based on the assigned reading and not exceed 200 words.) [2 marks]

(c) How can information on secondary structure (e.g., from a secondary structure prediction algorithm) be used for improving an evolutionary algorithm for tertiary structure prediction based on the energy minimisation approach? [2 marks]

General remarks:

If you want to avoid typing in the URLs appearing above, consider using the online version of this assignment on the course webpage.
While cooperation between students - especially between CS and non-CS students - is encouraged, each student is expected to work out the actual solutions to the problems individually and hand in their own assignment. In other words: help each other, but do not copy solutions.
Feel always free to contact Anne or Holger if you feel you need further help than can be provided by your fellow students.
The assignment has to be handed in on the date it is due before or at the beginning of class.
This assignment should take you about 1.5-3 hours of work, if you have good knowledge of the topics covered and did all reading assignments. However, don't wait until the last minute relying on this estimate - it might not apply to you (or anyone at all), you might need additional time to consult the literature, ...