Skip to content

Export

This module exports a set of results produced by Metagenomics-tk. Currently, the export for EMGB needs the results of the following tools:

  1. Assembly
  2. Binning
  3. Checkm (v1 or v2)
  4. Prokka output
  5. GTDB-Tk
  6. MMseqs Taxonomy (Database: GTDB)
  7. MMseqs (Database: UniRef90)

Input

-entry wExportPipeline -params-file example_params/export.yml

Warning

The configuration file shown here is for demonstration and testing purposes only. Parameters that should be used in production can be viewed in the read mapping section of one of the yaml files located in the default folder of the Toolkit's Github repository.

tempdir: "tmp"
summary: false
s3SignIn: false
input: "output"
output: "output"
logDir: log
runid: 1
databases: "/mnt/databases"
logLevel: 1
scratch: "/vol/scratch"
publishDirMode: "symlink"
steps:
  export:
    emgb:
      additionalParams:
              blastDB: "bacmet20_predicted"
              taxonomyDB: "gtdb"
      titles:
        database:
          download:
            source: "https://openstack.cebitec.uni-bielefeld.de:8080/databases/uniref90.titles.tsv.gz"
            md5sum: aaf1dd9021243def8e6c4e438b4b3669
      kegg:
        database:
          download:
            source: s3://databases_internal/annotatedgenes2json_db_kegg-mirror-2022-12.tar.zst
            md5sum: 262dab8ca564fbc1f27500c22b5bc47b
            s5cmd:
              params: '--retry-count 30 --no-verify-ssl --endpoint-url https://openstack.cebitec.uni-bielefeld.de:8080'
resources:
  highmemLarge:
    cpus: 28
    memory: 230
  highmemMedium:
    cpus: 14
    memory: 113
  large:
    cpus: 28
    memory: 58
  medium:
    cpus: 14
    memory: 29
  small:
    cpus: 7
    memory: 14
  tiny:
    cpus: 1
    memory: 1

Additional Parameters

  • blastDB: The toolkit runs MMseqs against multiple databases. You can specify here, which BLAST output should be used. (Default: UniRef90)

  • taxonomyDB: MMseqs is executed against a specific taxonomy database. (Default: GTDB)

Output

The following files are produced as output:

  • SAMPLE.bins.json.gz
  • SAMPLE.contigs.json.gz
  • SAMPLE.genes.json.gz

where SAMPLE is the name of the sample.

You can read more here about how to start EMGB and use these files to import a dataset.