Annotation
Annotation is the process of identifying features of interest in a set of genomic DNA sequences and labeling them with information such as their function. The Metagenomics-Toolkit provides several tools for annotating genes, based on the gene prediction using Prodigal. Prodigal is also part of Prokka, which provides a fast functional annotation of our data in addition to gene prediction.
Annotation with larger databases is beyond the scope of this workshop due to runtime limitations, but we will explore it within EMGB.
Prokka¶
Prokka is an efficient, user-friendly and open source bioinformatics tool designed for the annotation of bacterial genomes. It automates the prediction of genes, tRNAs, rRNAs, and other genomic features, utilizing various databases and algorithms to ensure accurate annotations. Prokka supports standard output formats such as GenBank and GFF, facilitating further analysis with compatible tools.
See Prokka homepage for more information.
The following snippet represents the Toolkit contiguration for the annotion module, just running Prokka:
Annotation Configuration File Snippet 1 | |
---|---|
1 2 3 4 |
|
Task 1
Run the following command to for the annotation:
cd ~/mgcourse/
NXF_VER=24.10.4 nextflow run metagenomics/metagenomics-tk \
-profile standard \
-params-file https://raw.githubusercontent.com/metagenomics/metagenomics-tk/refs/heads/master/default/tutorials/tutorial1/fullpipeline_annotation.yml \
-ansi-log false \
-entry wFullPipeline \
-resume \
--input.paired.path https://raw.githubusercontent.com/metagenomics/metagenomics-tk/refs/heads/master/test_data/tutorials/tutorial1/reads.tsv \
--databases $(readlink -f databases) \
--logDir logs_annotation
For reference, your complete parameter file should look like this:
Parameter-file
tempdir: "tmp"
summary: false
s3SignIn: false
input:
paired:
path: "test_data/tutorials/tutorial1/reads.tsv"
watch: false
output: output
logDir: log
runid: 1
databases: "/vol/scratch/databases"
publishDirMode: "symlink"
logLevel: 1
scratch: false
steps:
qc:
fastp:
# For PE data, the adapter sequence auto-detection is disabled by default since the adapters can be trimmed by overlap analysis. However, you can specify --detect_adapter_for_pe to enable it.
# For PE data, fastp will run a little slower if you specify the sequence adapters or enable adapter auto-detection, but usually result in a slightly cleaner output, since the overlap analysis may fail due to sequencing errors or adapter dimers.
# -q, --qualified_quality_phred the quality value that a base is qualified. Default 15 means phred quality >=Q15 is qualified.
# --cut_front move a sliding window from front (5') to tail, drop the bases in the window if its mean quality is below cut_mean_quality, stop otherwise.
# --length_required reads shorter than length_required will be discarded, default is 15. (int [=15])
# PE data, the front/tail trimming settings are given with -f, --trim_front1 and -t, --trim_tail1
additionalParams:
fastp: " --detect_adapter_for_pe -q 20 --cut_front --trim_front1 3 --cut_tail --trim_tail1 3 --cut_mean_quality 10 --length_required 50 "
reportOnly: false
timeLimit: "AUTO"
nonpareil:
additionalParams: " -v 10 -r 1234 "
filterHuman:
additionalParams: " "
database:
download:
source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/human_filter.db.20231218v2.gz
md5sum: cc92c0f926656565b1156d66a0db5a3c
assembly:
megahit:
# --mem-flag 0 to use minimum memory, --mem-flag 1 (default) moderate memory and --mem-flag 2 all memory.
# meta-sensitive: '--min-count 1 --k-list 21,29,39,49,...,129,141'
# meta-large: '--k-min 27 --k-max 127 --k-step 10' (large & complex metagenomes, like soil)
additionalParams: " --min-contig-len 500 --presets meta-sensitive "
fastg: true
resources:
RAM:
mode: 'PREDICT'
predictMinLabel: 'highmemLarge'
binning:
bwa2:
additionalParams:
bwa2: " "
# samtools flags are used to filter resulting bam file
samtoolsView: " -F 3584 "
contigsCoverage:
additionalParams: " --min-covered-fraction 0 --min-read-percent-identity 100 --min-read-aligned-percent 100 "
genomeCoverage:
additionalParams: " --min-covered-fraction 0 --min-read-percent-identity 100 --min-read-aligned-percent 100 "
# Primary binning tool
metabat:
# Set --seed positive numbers to reproduce the result exactly. Otherwise, random seed will be set each time.
additionalParams: " --seed 234234 "
magAttributes:
checkm2:
database:
download:
source: "https://openstack.cebitec.uni-bielefeld.de:8080/databases/checkm2_v2.tar.gz"
md5sum: a634cb3d31a1f56f2912b74005f25f09
additionalParams: " "
gtdb:
buffer: 1000
database:
download:
source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/gtdbtk_r214_data.tar.gz
md5sum: 390e16b3f7b0c4463eb7a3b2149261d9
additionalParams: " --min_af 0.65 --scratch_dir . "
annotation:
prokka:
defaultKingdom: false
additionalParams: " --mincontiglen 500 "
resources:
highmemLarge:
cpus: 28
memory: 60
highmemMedium:
cpus: 14
memory: 30
large:
cpus: 28
memory: 58
medium:
cpus: 14
memory: 29
small:
cpus: 7
memory: 14
tiny:
cpus: 1
memory: 1
Task 2
Locate the annotation files in the output
directory and look for *.txt
files that summarize the number of genes per category.
Can you report a bin that contains CRISPR sequences?
Solution
cd ~/mgcourse
cat output/data/1/annotation/1.0.0/prokka/data_bin.3.txt
organism: Genus species strain
contigs: 150
bases: 520975
CDS: 853
CRISPR: 3
tRNA: 12
If you want to look into the descriptions of the annotated genes then you have to look into the GFF (General Feature Format) files.
A GFF file is a tab-delimited text format used in bioinformatics to describe genomic features such as genes, exons, introns, and more. It consists of multiple columns that provide specific details about each feature:
-
Sequence Name: The identifier for the sequence or chromosome.
-
Source: The source or database providing the annotation.
-
Feature Type: The type of feature (e.g., gene, exon, CDS).
-
Start and End Positions: The coordinates where the feature begins and ends.
-
Score: A confidence score for the feature prediction.
-
Strand: Indicates the directionality (forward or reverse) of the feature.
-
Phase: For features like CDS, it indicates reading frame offsets.
-
Attributes: Additional information such as gene IDs or product names.
Optional header lines starting with '##' provide metadata about the file, adding context to the annotations.
Task 3
Find a gff file of a bin that contains tRNAs the same way as you did in the previous taks. Search with zgrep -P "\ttRNA\t"
for tRNAs and the corresponding annotations file (gff).
Can you tell the products of two tRNAs.
Solution
In the our precomputed case bin3 contains two rRNAs.
zgrep tRNA output/data/1/annotation/1.0.0/prokka/data_bin.3.gff.gz
tRNA-Thr(cgt)
and tRNA-His(gtg)
.
EMGB¶
Using the Toolkit, you have now predicted and annotated hundreds of genes. If you were to use all the tools in the annotation module, you could annotate even more genes using databases such as KEGG. While you can of course use the command line to search for genes, MAGs, and pathways in this mass of data, you could also explore your dataset with EMGB.
Task 4
Lets imagine you search for MAGs that contain genes that encode for the enzyme 5.3.1.9
.
Hint: The enzyme is involved the glycolysis pathway.
Click on the enzyme and a filter will be automatically created. Now if you click on the "MAGs" tab, you should see MAGs that contain that gene. How many MAGs contain that gene? Which MAG has the highest completeness?
In the next step, you might also interested in neighboring genes of the MAG with the highest completeness. What is the name of the gene to the left of it?
Solution
Five MAGs contain the gene that encodes the enzyme 5.3.1.9
. The MAG with the highest completeness is Bin 6
. To the left of the gene encoding the enzyme is the gene for Proline--tRNA ligase
.