RAST is a web-based environmet that allows users to upload a genome, annotate the genome, edit the annotations and compare the genome with other sequenced genomes in the SEED database. Since the initial publication of The RAST Server: rapid annotations using subsystems technology in 2008, over 150,000 requests for genome annotations have been processed, at a current rate of 1,200 jobs per week.
As demand for ever more accurate annotations and the number of newly-sequenced genomes increases, there is a growing demand for "the next generation" of the RAST technology (RASTtk). This new version of the architecture makes it possible to construct custom pipelines, integrate new bioinformatic tools, and make the developed pipelines easily accessible by a large user community.
In essence, RASTtk is an updated version of the RAST pipline which users can modify and customize, but it is not intended to replace the RAST web environment.
In order to run the RASTtk tools, you must either download the RASTtk app DMG (for mac) here, or use the IRIS environment. An introduction to IRIS can be found here.
There are advantages to using either tool. The RASTtk app runs a bash shell on your machine with all of the necessary scripts installed and is therefore easier to use for processing batches of genomes. On the other hand, the IRIS environment requires no installation, is always up to date, and is ideal for small numbers of genomes.
If you are following this tutorial with the RASTtk app, no login is required. If you use IRIS, you must first login by typing "login" and then "some user name". It is not necessary that you register, but you should remember your user name so that you can retrieve your data later.
If you want to step through the tutorial, you can download the E. coli K-12 contig in fasta format from KBase. To download this genome open a command window in the RASTtk app or in the IRIS command window type:
echo "kb|g.0" | genomes_to_contigs | contigs_to_sequences > E_coli.contig
"kb|g.0" is the KBase identifier for E. coli K-12. genomes_to_contigs provides the contig ids (there is only one for this genome),
and contigs_to_sequences returns the sequence in fasta format. There is a large body of scripts and database acquisition tools that
that are visible in the "command list" side bar in IRIS which can also be implemented in the RASTtk.app. Tutorials for these scripts exist on the KBase website,
and all scripts contain a help (-h) option. When the time comes, you can manually upload your own genome to the IRIS environment by clicking on the up arrow button on the bottom left-hand side of the page below the file browser window.
If you want to annotate batches of genomes, please refer to our tutorial on this topic.
The RASTtk environment is designed so that users can compose annotation pipelines, and then run those pipelines to annotate genomes. There is a rich and growing body of annotation tools that we have either built or imported from other groups, and these offer a framework for incrementally constructing annotations.
In some cases users would rather execute a minimal set of commands representing the currently recommended annotation pipeline. This pipeline is composed of three easy scripts that:
1. Format the contigs file
2. Annotate the genome
3. Export the genome
All of the individual commands available in the RASTtk pipeline add data to a special file type called a genome typed object (GTO). A GTO is a JSON file that is compatible with KBase. Annotations are incrementally appended to this file until it is ready for export. By creating the GTO and adding annotation data to it, it is possible to export the data in multiple file formats without having to reannotate the genome.
To create a GTO from scratch we will use the command
rast-create-genome options: -o --output file to which the output is to be written -h --help print usage message and exit --url URL for the genome annotation service --genome-id Genome identifier --scientific-name Scientific name (Genus species strain) for the genome --domain Domain (Bacteria/Archaea/Virus/Eukaryota) for the genome --genetic-code Genetic code for the genome (usually 11 for most organisms or 4 Mycoplasmas etc.) --source Source (external database) name for this genome --source-id Identifier for this genome in the source (external source) --contigs Fasta file containing DNA contig data
rast-create-genome --scientific-name "Escherichia coli K-12" --genetic-code 11 --domain Bacteria --contigs E_coli.contig > E_coli.gto
Some processes in the pipeline require that you declare the scientific name, domain and genetic code, so these are required fields.
To run the default RASTtk pipeline tool, type:
rast-process-genome < E_coli.gto > E_coli.gto2
Here, "E_coli.gto2" is a second genome typed object with all of the RAST annotation data.
This is the RASTtk Default Pipeline:
1. Calls rRNAs with a custom BLAST-based tool 2. Calls tRNAs with tRNAscan 3. Calls large repeat regions 4. Calls seleno proteins 5. Calls pyrrolysyl proteins 6. Finds Streptococcus repeat regions (only if the genus is Streptococcus) 7. Calls CRISPRs 8. Calls the protein-encoding genes with Prodigal and Glimmer3 9. Annotates protein-encoding genes with k-mers (version 2), 10. Annotates remaining hypothetical proteins with k-mers (version 1), 11. Attempts to annotate remaining hypothetical proteins by blasting against close relatives (if possible) 12. Performs a basic gene overlap removal
To export the genome in a desired format we will use the command:
rast-export-genome options: -i --input file from which the input is to be read -o --output file to which the output is to be written -h --help print usage message and exit --url URL for the genome annotation service --feature-type Include this feature type in output. If no feature-types specified, include all feature types Supported formats: genbank Genbank format genbank_merged Genbank format as single merged locus, suitable for Artemis feature_data Tabular form of feature data protein_fasta Protein translations in fasta format contig_fasta Contig DNA in fasta format feature_dna Feature DNA sequences in fasta format gff GFF format embl EMBL format
To illustrate how rast-export-genome is used, we will export our genome in genbank format. Type:
rast-export-genome genbank < E_coli.gto2 > E_coli.gbk
Using the "--feature-type" option, it is possible to filter the output. For instance if we wanted a fasta file of RNA sequences we would type:
rast-export-genome feature_dna --feature-type rna < E_coli.gto2 > E_coli.rna.fasta
Other feature types include "CDS", "repeat", "crispr_array", "crispr_repeat", and "crispr_spacer". We anticipate that the number of features will continue to grow as we add new functionality.
That's it! Three basic commands-- rast-create-genome, rast-process-genome and rast-export-genome--give you the RASTtk default pathway. However, this is only a subset of the available RASTtk functions. We have designed RASTtk so that it is modular and users can build custom annotation pipelines. In order to tap into this capability and to learn about individual steps please read the tutorial, "RASTtk: The Incremental Commands".