Codon Optimizer

A software tool to remove forbidden motifs, add desirable motifs, and optimize codon usage of a protein sequence according to the CAI measure.
Download this project as a .zip file Download this project as a tar.gz file

About the Software

This software serves as a reference implementation of a dynamic programming algorithm proposed by Anne Condon and Chris Thachuk for optimizing codon usage of a coding DNA sequence while simultaneously removing undesirable motifs and adding desirable motifs. See the conference slides for an overview of how the algorithm works and the journal paper for details.

Journal Paper

This work was published in a special issue of the Journal of Discrete Algorithms.

It is now common to add protein coding genes into cloning vectors for expression within non-native host organisms. Codon optimization supports translational efficiency of the desired protein product, by exchanging codons which are rarely found in the host organism with more frequently observed codons. Motif engineering, such as removal of restriction enzyme recognition sites or addition of immuno-stimulatory elements, is also often necessary. We present an algorithm for optimizing codon bias of a gene with respect to a well motivated measure of bias, while simultaneously performing motif engineering. The measure is the previously studied codon adaptation index, which favors the use, in the gene to be optimized, of the most abundant codons found in the host genome. We demonstrate the efficiency and effectiveness of our algorithm on the GENCODE dataset and provide a guarantee that the solution found is always optimal.

Conference Paper

A preliminary version appeared at the IWOCA conference.

It is now common to add protein coding genes into cloning vectors for expression within non-native host organisms. Codon optimization supports translational efficiency of the desired protein product, by exchanging codons which are rarely found in the host organism with more frequently observed codons. Motif engineering, such as removal of restriction enzyme recognition sites or addition of immuno-stimulatory elements, is also often necessary. We present an algorithm for optimizing codon bias of a gene with respect to a well motivated measure of bias, while simultaneously performing motif engineering. The measure is the previously studied codon adaptation index, which favors the use, in the gene to be optimized, of the most abundant codons found in the host genome. We demonstrate the efficiency and effectiveness of our algorithm on the GENCODE dataset and provide a guarantee that the solution found is always optimal.

Implementation

The software was implemented in C++. It is licensed under the GPL version 3 or higher and makes use of the wonderful Boost libraries.

Download the Software

A Tarball and zipfile are available for the source code which additionally contains the experimental data used in the conference and journal papers. A separate Tarball and zipfile is also available containing only the experimental data.


Building the Software

The software can be built in the typical unix way. The configure script will ensure the software can be built on your system and will check that the required Boost libraries (Version 1.48+) are installed. You can optionally specify a prefix path where the software should be installed. Otherwise, it will be installed to the standard directories for your system. Note the software does not need to be installed to be used. After the make command, the binary codon-optimizer will be available in the build directory for use.

$ ./configure --prefix=${HOME}
$ make
$ make install

Using the Software

The program assumes input sequence files are in FASTA format. After building the software, command line options and usage can be determined with:

./codon-optimizer
At present, the usage is:
Usage: codon-optimizer [options] <fasta_file>

Allowed options:

Generic:
  -h [ --help ]         produce this help message

Design Specifications:
  -s [ --start-index ] arg (=1)                              first index in FASTA file of sequences to optimize
  -e [ --end-index ] arg (=1)                                last index in FASTA file of sequences to optimize
  -f [ --forbidden-motif-file ] arg                          a newline separated file containing forbidden motifs
  -d [ --desired-motif-file ] arg                            a newline separated file containing desired motifs
                                                             
Other:                                                       
  -o [ --optimized-sequence-file ] arg (=optimized.fasta)    output file for optimized sequences
  -t [ --trace-file ] arg (=optimized.trace)                 trace file for optimized sequences

The data directory contains the experimental data used in the published manuscripts. To repeat the CAI optimization of all 3,157 sequences in the data/gencode_filtered.fasta file using the forbidden motifs in the data/motifs/forbidden.cpg file and the desirable motifs in the data/motifs/desirable.cpg file, issue the following command:

./codon-optimizer -s 1 -e 3157 -f data/motifs/forbidden.cpg -d data/motifs/desirable.cpg data/gencode_filtered.fasta

The progress will be indicated as the sequences are updated:

Eliminating the following forbidden motifs:

CCG
CGG
<snip>

Adding the following desirable motifs:

AACGTT
AACGTTCG
<snip>

Optimizing 3157 sequence(s).

100 %
Warnings occurred while optimizing sequences.  See 'optimized.trace' for details.

If the sequences being optimized contained invalid bases, or other possible warnings are generated, they will be indicated at the end of the run. A trace file optimized.trace will be produced giving statistics of the optimization for each sequence and will list warnings:

#Warning: sequence 1 length is not a multiple of 3.  It has been truncated.
#seq_id  length  CAI_before  Forbidden_before  Desirable_before  CAI_after  Forbidden_after  Desirable_after  CPU_runtime
1        351     0.588861    18                2                 0.868277   0                21               0.050000

The optimized sequences will be written to the file optimized.fasta. Note that alternative filenames can be specified for both the trace and optimized sequences.

>hg17_chr7_26907301_26907654_+
GTGGGTGGCTCGCAGAGCGTTTAAGGTCGTCGTCCACGTGGACGTAACGCTGGTCGACGTGCAGGCCTGCTGTAAGGCTTTCTGGGCCTGCTGGTCGTTAGCTGGCGTGGTGGTCGTCGCGATGGCCAGGAGACGTTACTGCTGACGTTTCTGCTGCTGCATAGCGCTGTGCTGGAACCAGATCTGCATCTGGGCCTCGTAGAACTGCAGGGCTGCAGCGATCTGCATCCAGCAGGCGCTCGACAGGTGCTCGTAGAAGTGGAACTGCTGCTGCAGTTTCGCGAACTGCTGGGCAGCGAAGTGGGTGCGCATCGTGTGGGCCTGACCCAGGTCGCTGTGCTGAGCAACTTT
>hg17_chr7_26908119_26908771_+
CTGTTCTGGGAAGGCTTCTTCTAACTGTCGTCGTCTCAGAAATCAGCGCTGGAGAAGATGAGCCCAATGCGTGGCAGCGATCGATTACTGGGTGGCTGGCGTGGTGAAGGCACGCGTAGCTATTATACGTAACCAGGCCCAGGCTCTGGTGCGCCAGTGCATATGTCTGGCGAATGCATTGAAGCGAGCCCACCTCGACCACAGCATAATCCAGGTGGTGGTGGCGACGCTGGCCCATGGGAAATGCGTGATTTCCAGAGCAAACAGCGTGAACGTACAGGTGGCACCCATCATTTACGTTTACTGCCAGATTTAACTCGACGTGGCTGCAAAGCGCATTAGTCGTCTGTCTCGCATAGCCTCGACCATAATCTGTCTGGCTCTCGTACGCCACCAGGCTCTCGTAAATCTGGACGTTAACCAGCAGGTGGTGGCGATGGTGGTGGTGGTGGTGGTGGTGGCGCGAATCGTAGCGCTCCACCATGCCCACTGGGCTCTGGTCGTCGTCGACGTAATTGCTGGCGTTAACCTCGTACCACAGGCAAACTGTAAAGCTATGGCCCACGTGGACGTTTAGGCCTGTCTCGTAGCCCAAGTCGTCATTGCTAAGTGTGGGGTATTCCAGGACGTTCGTCGTTTCTGCATTGCCCA
>hg17_chr7_26912637_26912720_+
<snip>

Contact

For support using the software or to request new features, please contact Chris Thachuk.