The CGpred
directory for running CG with a specific prediction
program
- by Mirela Andronescu, last modified Apr 12, 2009.
CG can essentially work with
any RNA secondary structure prediction software, as long as the energy
function is linear
or quadratic
in the parameter vector.
You just need a prediction function and a few other functions for your
model (see details below). Here's
what you need:
- Configuration file
- Initial
parameter file
- Training data set
- [optional] Validation
data set
- [optional] Testing data
set
- Code to create the
structural constraints
- Code to predict and analyse
results of new
parameters
- [optional] Thermodynamic
file and code to generate the thermodynamic constraints
- [optional] File that
specifies which
parameters are fixed and
which are variable
- [optional] File
with
additional constraints
When you have all these, you can run CG.
Here's a sample directory, where I used Simfold
as the
prediction program: Simfold-template.tar.gz
1. The
configuration file is a file where you specify the names (and path) of
all the other
files on this web page. Read the rest of the document first.
This file also contains some input options for CG. You should
test several such options, for best performance.
Here's a configuration file example: config_sample.txt
2. Initial parameter
file. This is a text
file, with the values of the initial parameters, one per line. Here's
an example: turner_parameters_fm363_constrdangles.txt
3. Training data set.
This is one text file,
to be used as "structural training set" (see paper).
There are two options:
- Option 1. Use the S-Full-Train data set that I used in my thesis, or another training data set that I used before (see my phd web page). This data set includes predictions with the Turner99 parameters.
- Option 2.
If you want to start with predictions by different parameters, or you
want to use your own training file, you need to create a C++ program,
using the C++ example
below. You need this program because you have to obtain predicted
structures using your prediction program and your initial parameter
set.
- Here's the functions you will need:
- A function that takes as input a file with a set of
parameters
(like the initial
parameters file
above),
and fills your internal data structures
- A function that predicts the minimum free energy (or
low free
energy) secondary structure for a given sequence
- If your data has secondary structure constraints, you
need a
variant of the prediction function to accomodate constraints
- Here's what you have to do:
- If your prediction program is written in C or C++, I
recommend you use the following C++ program as a model to get the
training data set (see comments in the file): add_initial_predictions_simfold.cpp.
The name of your file can be anything, just specify this name in the
configuration file. (e.g. if your prediction program is called foo, you
might want to call this program add_initial_predictions_foo.cpp.)
- When running CG, your program will be called as follows, so make sure it
works.
- add_initial_predictions_simfold.cpp
input_initial_parameter_file.txt
input_training_set_without_predictions.txt
output_training_set.txt
- where input_initial_parameter_file.txt is the
initial
file you provided at point 2
- input_training_set_without_predictions.txt has the
format
below, see example1_nopred.txt,
example2_nopred.txt:
> some name of the molecule
sequence: should have only A, C, G and
U. Otherwise
please make sure your program can deal with all the base types.
known structure, in dot-parentheses
format
optional, constrained structure
empty line
The training set should be comprehensive enough for good training. The
better it is, the better the quality of the estimated parameters.
4. Validation data set.
Exactly
the same format as
the training data set, you can use one of two options above. The
molecules in this set should be different
from the ones in the training data set.
5. Testing data set. Exactly
the same format as
the training data set, you can use one of two options above. The
molecules in this set should be different
from the ones in the training data set.
6. Code to create the
structural constraints.
You need to create
an
executable that takes as input a data set, and
writes two
output files (see details below). The minimum you need for this is:
- A function that returns the number of parameters used in
the
model
- A function that takes as input a file with a set of
parameters,
and fills your internal data structures
- A function that returns the number of features that occur
in a
given structure. Here's some details:
- If your energy function is linear
in the energy parameters, then your energy function can be written like
this:
- deltaG = c' x + f
- where x is the vector of parameters
- c is the vector of how many times each parameter
occurs
in the given structure
- c' means c transposed
- f is a constant
- As a simple example, if your model has 3 parameters
x1, x2
and x3, and deltaG for some structure is x1+ x3 + x3 + 0.5, then
c'=(1,0,2) and f=0.5;
- The Turner model underlying mfold, simfold,
RNAstructure,
RNA Vienna package etc is linear.
- If your energy function is quadratic
in the energy parameters, then your energy function can be written like
this:
- deltaG = x' P x + c' x + f
- where x is the vector of parameters
- P is a symmetric matrix of the coefficients for
each
quadratic term
- c is a vector of counts for each linear term
- c' means c transposed
- f is a constant
- As a simple example, if your model has 3 parameters
x1, x2
and x3, and deltaG for some structure is x1*x2 + 0.5*x2*x3 + 2*x2, then
P = (0,1,0; 1,0,0.5; 0,0.5,0), c' = (0,2,0) and f=0.
- The Dirks&Pierce and Rivas&Eddy
models for
pseudoknotted structures are quadratic.
- Optionally, a function that returns the free energy
(under your
model) of a sequence folded into a given structure.
- If you have
your functions already written in C/C++, I recommend using the C/C++
model
provided below. (Writing a Perl script with system calls will be much
slower). Just replace the necessary code, following the comments in the
file:
- The executable will be run like this, so make sure it
works.
- ./create_structural_constraints_simfold params.txt
input_set_no_pseudoknots.txt constraints_output.txt
num_constraints_per_molecule.txt
- Files to test your code works well.
7. Code to predict and analyse
results of new
parameters. You need to create an executable that takes as input a set
of parameters compatible to your model and a data set file. The program
predicts structures with the new parameters and computes the accuracy
obtained. The functions you need are:
- A function that takes as input a file with a set of
parameters
(like the initial parameters
file
above),
and fills your internal data structures
- A function that predicts the minimum free energy (or low
free
energy) secondary structure for a given sequence
- If your data has secondary structure constraints, you
need a
variant of the prediction function to accomodate constraints
- A function that computes the sensitivity of prediction
- A function that computes the positive predictive value of
the
prediction
- If you have your functions written in C++, I recommend
using the
model below. Just replace the necessary code, following the comments in
the file:
- The executable will be run like this, so make sure it
works.
- ./predict_and_analyse_results_simfold params.txt
input_set.txt
output_predictions.txt output_accuracy.txt
- Files to test your code works well.
- Example parameter set, one parameter value per line: params.txt
- Input data set, to test if your program works: input_set.txt
- First output file, containing predictions. Make sure you
get
something similar (don't worry about the details): output_predictions.txt
- Second output file, containing accuracy results. Make
sure you
get something similar (don't worry about the details): output_accuracy.txt
8. [optional] The
thermodynamic file is a file in XML format, to be used for the
constraints corresponding to the thermodynamic set (see paper).
- You can use my best XML file without pseudoknots, or my best XML file WITH pseudoknots.
- You also have to provide code that generates the constraints corresponding to this XML file. The
code should be similar to the code to create the structural constraints. See my example here.
- If
the XML file contains experimental errors, you can use them to weigh different
experiments differently, by setting "Use published errors: " in the
configuration file to 0 (don't use them), or 1 (use them). In my
experience, 0 gave better results than 1.
9. [optional] File that
specifies which
parameters are fixed and
which are variable.
Sometimes you might want to keep some parameters fixed to some values.
If so, start from a file like the initial parameters file, and replace
every value that you do NOT wish to keep fixed by the word "variable".
Here's an example in which parameters with the index 205 and 259 have
fixed values, and all the others are variable: params_fix_205_259.txt
10. [optional]
File with additional
constraints. Sometimes you need to specify some constraints for some
variables. For example, in the following example we want all dangling
end parameters to be negative or zero, and we want the 3' dangling ends
to be less than or equal to the 5' dangling ends: constraints_dangling_ends_fm363.txt