Skip to content

Annotation

The annotation module is able to predict genes and annotate those based on Prokka and a set of user provided databases. A user can add additional formatted databases as part of the configuration by adding a key (Example: kegg ) with a possible download strategy. See database section for possible download strategies. In addition, the resistance gene identifier is executed by default.

Input

-entry wAnnotateLocal -params-file example_params/annotation.yml

Warning

The configuration file shown here is for demonstration and testing purposes only. Parameters that should be used in production can be viewed in the annotation section of one of the yaml files located in the default folder of the Toolkit's Github repository.

output: "output"
summary: false
runid: 1
s3SignIn: false
logDir: log
tempdir: "tmp"
scratch: "/vol/scratch"
databases: "/mnt/databases"
publishDirMode: "symlink"
steps:
   annotation:
      input: "test_data/annotation/input_small.tsv"
      mmseqs2:
        chunkSize: 20000
        kegg:
          additionalParams:
            search : ' -s 1 --max-seqs 100 --max-accept 50 --alignment-mode 1 --exact-kmer-matching 1 --db-load-mode 3'
            additionalColumns: ""
          database:
            extractedDBPath: '/vol/spool/toolkit/kegg-mirror-2021-01_mmseqs/sequenceDB'
#        bacmet20_experimental:
#          params: ' -s 1 --exact-kmer-matching 1 --db-load-mode 3'
#          database:
#            download:
#              source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/bacmet20_experimental.tar.zst
#              md5sum: 57a6d328486f0acd63f7e984f739e8fe
        bacmet20_predicted:
          database:
            download:
              source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/bacmet20_predicted.tar.zst
              md5sum: 55902401a765fc460c09994d839d9b64
          additionalParams:
            search : ' -s 1 --max-seqs 100 --max-accept 50 --alignment-mode 1 --exact-kmer-matching 1 --db-load-mode 3'
            additionalColumns: ""

#        vfdb:
#          params: ' -s 1 --max-seqs 100 --max-accept 50 --alignment-mode 1 --exact-kmer-matching 1 --db-load-mode 3'
#          database:
#            download:
#              source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/vfdb_full_2022_07_29.tar.zst
#              md5sum: 7e32aaed112d6e056fb8764b637bf49e
      keggFromMMseqs2:
         database:
           extractedDBPath: "/vol/spool/toolkit/kegg_2021-01/"
      rgi: 
        database:
          download:
            source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/card_20221209.tar.bz2
            md5sum: d7e627221a1d4584d1c1795cda774cdb
        additionalParams: ""
      mmseqs2_taxonomy:
        runOnMAGs: false
        gtdb:
          params: ' --orf-filter-s 1 -e 1e-15'
          ramMode: false
          initialMaxSensitivity: 1
          database:
            download:
              source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/gtdb_r214_1_mmseqs.tar.gz
              md5sum: 3c8f12c5c2dc55841a14dd30a0a4c718
      prokka:
        prodigalMode: "meta"
        defaultKingdom: false
        additionalParams: " --mincontiglen 200 "
resources:
  highmemLarge:
    cpus: 28
    memory: 230
  highmemMedium:
    cpus: 14
    memory: 113
  large:
    cpus: 28
    memory: 58
  medium:
    cpus: 14
    memory: 29
  small:
    cpus: 7
    memory: 14
  tiny:
    cpus: 1
    memory: 1
DATASET BIN_ID  PATH
test3   bin.1   https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/bins/bin.1.fa
test1   bin.2   https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/bins/bin.2.fa
test1   bin.8   https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/bins/bin.8.fasta
test2   bin.9   https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/bins/bin.9.fasta
test2   bin.32  https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/bins/bin.32.fa

Databases

MMseqs2

MMseqs2 needs a combination of different data, index and dbtype files as "one" database, be it in- or output. See MMseqs2 database for more information. As multiple and in most cases, big files are used, tar and zstd are utilized to compress and transport files. Input databases have to be compressed by these and need to end with .tar.zst. Naming inside an archive is irrelevant, as databases are picked automatically. Multiple databases per one archive are not supported, one archive, one database. If the database also includes a taxonomy as described here, it can also be used for taxonomic classifications with MMseqs2 - Taxonomy. See database section for possible download strategies. If you need credentials to access your files via S3 then please use the following command:

nextflow secrets set S3_db_ACCESS XXXXXXX
nextflow secrets set S3_db_SECRET XXXXXXX

where db is the name of the database that you use in your config file. Example:

....
      vfdb:
        params: ' -s 1 --max-seqs 100 --max-accept 50 --alignment-mode 1 --exact-kmer-matching 1 --db-load-mode 3'
        database:
          download:
            source: s3://databases/vfdb_full_2022_07_29.tar.zst
            md5sum: 7e32aaed112d6e056fb8764b637bf49e
            s5cmd:
              params: " --retry-count 30 --endpoint-url https://openstack.cebitec.uni-bielefeld.de:8080 " 
....

Based on these settings, you would set the following secret:

nextflow secrets set S3_vfdb_ACCESS XXXXXXX
nextflow secrets set S3_vfdb_SECRET XXXXXXX

KEGGFromBlast

KeGGFromBlast is only executed if genes are searched against a KEGG database. There must be a kegg identifier (see example configuration file) in the annotation section. KeGGFromBlast needs a kegg database as input which must be a tar.gz file. See database section for possible download strategies. If you need credentials to access your files via S3 then please use the following command:

nextflow secrets set S3_kegg_ACCESS XXXXXXX
nextflow secrets set S3_kegg_SECRET XXXXXXX

MMSeqs Taxonomy

If you need credentials to access your files via S3 then please use the following command:

nextflow secrets set S3_TAX_db_ACCESS XXXXXXX
nextflow secrets set S3_TAX_db_SECRET XXXXXXX

where db is the name of the database that you use in your config file. Example:

....
    mmseqs2_taxonomy:
      gtdb:
        params: ' --orf-filter-s 1 -e 1e-15'
        ramMode: false
        database:
          download:
            source: s3://databases/gtdb_r214_1_mmseqs.tar.gz
            md5sum: 3c8f12c5c2dc55841a14dd30a0a4c718
            s5cmd:
              params: " --retry-count 30 --endpoint-url https://openstack.cebitec.uni-bielefeld.de:8080 " 
....

Based on these settings, you would set the following secrets:

nextflow secrets set S3_TAX_gtdb_ACCESS XXXXXXX
nextflow secrets set S3_TAX_gtdb_SECRET XXXXXXX

RGI

RGI needs a CARD database which can be fetched via this link: https://card.mcmaster.ca/latest/data. The compressed database must be a tar.bz2 file. See database section for possible download strategies. If you need credentials to access your files via S3 then please use the following command:

nextflow secrets set S3_rgi_ACCESS XXXXXXX
nextflow secrets set S3_rgi_SECRET XXXXXXX

Output

MMseqs2

Calculated significant matches of a nucleotide/protein query which was compared against a user provided set of databases.

MMseqs2 - Taxonomy

By identifying homologous through searches against a provided MMseqs2 taxonomy-database, MMseqs2 can compute the lowest common ancestor. This lowest common ancestor is a robust taxonomic label for unknown sequences. These labels are presented in form of an *.taxonomy.tsv file, a *.krakenStyleTaxonomy.out formatted in accordance to the KRAKEN tool outputs and an interactive KRONA plot in form of a html website *.krona.html.

Prokka

Prokka computes *.err, *.faa, *.ffn, *.fna, *.fsa, *.gbk, *.gff, *.sqn, *.tbl, *.tbl for every bin. *.gbk and *.sqn are skipped per default, since tbl2asn runs for quite a while! If you need those files generated by prokka, include: --tbl2asn in the prokka parameters to enable it. Details of all files can be read on the Prokka page. In addition, it also computes a summary tsv file which adheres to the magAttributes specification.

KEGGFromBlast

Result *.tsv file filled with KEGG information (like modules, KO's, ...) which could be linked to the input hits.

Resistance Gene Identifier (rgi)

The *rgi.tsv files contain the found CARD genes.