Skip to content

Full Pipeline

The full pipeline mode allows you to run the per-sample part and the aggregation part in one execution (see schematic overview). In contrast to the quickstart section, this chapter refers to the execution of the toolkit on a cluster system.

Requirements

  • SLURM: The Toolkit was mainly developed for cloud-based clusters using SLURM as a resource orchestrator.

  • Docker: Docker must be available on all worker nodes.

  • Java: In order to run Nextflow you have install Java.

  • Nextflow version: Please check in the nextflow.config file the supported Nextflow versions.

  • Resources: scratch

Preparation

First checkout the Github repository in a directory which is shared by all worker nodes:

git clone git@github.com:metagenomics/metagenomics-tk.git
cd metagenomics-tk

Run the Toolkit

In the following case the Nextflow binary is placed on a share working directory:

./nextflow run main.nf -work-dir /vol/spool/work \
    -profile slurm \
    -entry wFullPipeline \
    -params-file default/fullPipeline_illumina_nanpore_getting_started.yml  \
    --s3SignIn false \
    --scratch /vol/scratch \
    --databases /vol/scratch/databases \
    --output output \
    --input.paired.path test_data/fullPipeline/reads_split.tsv

where

  • -work-dir points to a directory that is shared between multiple machines.
  • -profile defines the execution profile that should be used.
  • -entry is the entrypoint of the Toolkit.
  • -params-file sets the parameters file which defines the parameters for all tools. (see input section below)
  • --s3SignIn defines if any S3 login for retrieving inputs is necessary. See the S3 configuration section for more information on how to configure the Toolkit for possible S3 input data.
  • --scratch is the directory on the worker node where all intermediate results are saved.
  • --databases is the directory on the worker node where all databases are saved. Already downloaded databases on a shared file system can be configured in the database setting of the corresponding database section in the configuration file.
  • --output is the output directory where all results are saved. If you want to know more about which outputs are created, then please refer to the modules section.
  • --input.paired.path is the path to a tsv file that lists the datasets that should be processed. If you want to provide other parameters, please check the input section.

Parameter override

Any parameters defined with a double dash are parameters that override parameters that are already specified in the yaml file.

Input

SAMPLE  READS1  READS2
test1   https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/small/read1_1.fq.gz  https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/small/read2_1.fq.gz
test2   https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/small/read1_1.fq.gz  https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/small/read2_1.fq.gz

Must include the columns SAMPLE, READS1 and READS2. SAMPLE must contain unique dataset identifiers without whitespaces or special characters. READS1 and READS2 are paired reads and can be HTTPS URLs, S3 links or files.

tempdir: "tmp"
summary: false
s3SignIn: true
input:
  SRA:
    pattern:
      ont: ".+[^(_1|_2)].+fastq.gz$"
      illumina: ".+(_1|_2).+fastq.gz$"
    S3:
      path: test_data/fullPipeline/production.tsv
      bucket: "s3://ftp.era.ebi.ac.uk" 
      prefix: "/vol1/fastq/"
      watch: false
output: output
logDir: log
runid: 1
databases: "/vol/scratch/databases"
publishDirMode: "symlink"
logLevel: 1
scratch: "/vol/scratch"
steps:
  qc:
    fastp:
       # For PE data, the adapter sequence auto-detection is disabled by default since the adapters can be trimmed by overlap analysis. However, you can specify --detect_adapter_for_pe to enable it.
       # For PE data, fastp will run a little slower if you specify the sequence adapters or enable adapter auto-detection, but usually result in a slightly cleaner output, since the overlap analysis may fail due to sequencing errors or adapter dimers.
       # -q, --qualified_quality_phred       the quality value that a base is qualified. Default 15 means phred quality >=Q15 is qualified.
       # --cut_front move a sliding window from front (5') to tail, drop the bases in the window if its mean quality is below cut_mean_quality, stop otherwise.
       # --length_required  reads shorter than length_required will be discarded, default is 15. (int [=15])
       # PE data, the front/tail trimming settings are given with -f, --trim_front1 and -t, --trim_tail1
       additionalParams: " --detect_adapter_for_pe -q 20 --cut_front --trim_front1 3 --cut_tail --trim_tail1 3 --cut_mean_quality 10 --length_required 50 "
       timeLimit: "AUTO"
    nonpareil:
      additionalParams: " -v 10 -r 1234 "
    kmc:
      timeLimit: "AUTO"
      additionalParams:
        # Computes k-mer distribution based on k-mer length 13 and 21
        #  -sm - use strict memory mode (memory limit from -m<n> switch will not be exceeded)
        #  -cs<value> - maximal value of a counter
        count: " -sm -cs10000 "
        histo: " -cx50000 "

  qcONT:
    porechop:
       additionalParams:
         porechop: ""
        # --keep_percent Throw out the worst 10% of reads. This is measured by bp, not by read count. So this option throws out the worst 10% of read bases. 
        # 
         filtlong: " --min_length 1000  --keep_percent 90 "
    nanoplot:
      additionalParams: ""
  assembly:
    megahit:
      # --mem-flag 0 to use minimum memory, --mem-flag 1 (default) moderate memory and --mem-flag 2 all memory.
      # meta-sensitive: '--min-count 1 --k-list 21,29,39,49,...,129,141' 
      # meta-large:  '--k-min  27  --k-max 127 --k-step 10' (large & complex metagenomes, like soil)
      additionalParams: " --min-contig-len 1000 --presets meta-sensitive "
      fastg: true
      resources:
        RAM:
          mode: 'PREDICT'
          predictMinLabel: 'medium'
  assemblyONT:
    metaflye:
      additionalParams: " -i 1 "
      quality: "AUTO"
  binning:
    bwa2:
      additionalParams: 
        bwa2: " "
        # samtools flags are used to filter resulting bam file
        samtoolsView: " -F 3584 " 
    contigsCoverage:
      additionalParams: " --min-covered-fraction 0 --min-read-percent-identity 100 --min-read-aligned-percent 100 "
    genomeCoverage:
      additionalParams: " --min-covered-fraction 0 --min-read-percent-identity 100 --min-read-aligned-percent 100 "
    # Primary binning tool
    metabat:
      # Set --seed positive numbers to reproduce the result exactly. Otherwise, random seed will be set each time.
      additionalParams: " --seed 234234  "
    # Secondary binning tool for use with MAGscot
  binningONT:
    minimap:
      additionalParams: 
        minimap: " "
        # samtools flags are used to filter resulting bam file
        samtoolsView: " -F 3584 " 
    contigsCoverage:
      additionalParams: " --min-covered-fraction 0  --min-read-aligned-percent 100 "
    genomeCoverage:
      additionalParams: " --min-covered-fraction 0  --min-read-aligned-percent 100 "
    metabat:
      additionalParams: " --seed 234234  "
  magAttributes:
    # gtdbtk classify_wf
    # --min_af minimum alignment fraction to assign genome to a species cluster (0.5)
    gtdb:
      buffer: 1000
      database:
        download:
          source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/gtdbtk_r214_data.tar.gz
          md5sum: 390e16b3f7b0c4463eb7a3b2149261d9
      additionalParams: " --min_af 0.65 --scratch_dir . "
    checkm2:
      database:
        download:
          source: "https://openstack.cebitec.uni-bielefeld.de:8080/databases/checkm2_v2.tar.gz"
          md5sum: a634cb3d31a1f56f2912b74005f25f09
      additionalParams: "  "
  fragmentRecruitment:
    mashScreen:
      genomes: test_data/fragmentRecruitment/mags.tsv
      unzip:
        timeLimit: "AUTO"
      additionalParams:
        mashSketch: " "
        mashScreen: " "
        bwa2: "  "
        # We demand that matches should be at least 70 percent to restrict the
        # search space. We further restrict all found matches in 'coveredBasesCutoff' parameter.
        coverm: " --min-covered-fraction 70 --min-read-percent-identity 95 --min-read-aligned-percent 95 "
        covermONT: " --min-covered-fraction 70 --min-read-aligned-percent 95 "
        minimap: "   "
        samtoolsViewBwa2: " -F 3584 " 
        samtoolsViewMinimap: " " 
       # Mash identity should be at least 95%
      mashDistCutoff: 0.95
       # percentage of covered genome
      coveredBasesCutoff: 0.90
       # Minimum number of k-mer hashes to match
      mashHashCutoff: 700
    # For final reporting of the covered fraction, the --min-covered-fraction parameter is set to the same value as 
    # in the binning and other sections
    genomeCoverage:
      additionalParams: " --min-covered-fraction 0 "
    contigsCoverage:
      additionalParams: " --min-covered-fraction 0 "
  dereplication:
    bottomUpClustering:
      # stricter MIMAG medium quality
      minimumCompleteness: 50
      maximumContamination: 5
      ANIBuffer: 20
      mashBuffer: 2000
      method: 'ANI'
      additionalParams:
        mash_sketch: ""
        mash_dist: ""
        # cluster cutoff
        cluster: " -c 0.05 "
        pyani: " -m ANIb "
        representativeAniCutoff: 0.95
  readMapping:
    bwa2:
      additionalParams:
        bwa2_index: ""
        bwa2_mem: ""
     # This module produces two abundance tables.
     # One table is based on relative abundance and the second one on the trimmed mean.
     # Just using relative abundance makes it difficult to tell if a genome is part of a dataset.
     # Thats why it makes sense to set at leat a low min covered fraction parameter.
    coverm: " --min-covered-fraction 80  --min-read-percent-identity 95 --min-read-aligned-percent 95 "
    covermONT: " --min-covered-fraction 80  --min-read-aligned-percent 95 "
    minimap:
      additionalParams:
        minimap_index: ""
        minimap: ""
  cooccurrence:
    inference:
      additionalParams:
        method: 'correlation'
        rscript: " --mincovthreshold 0.9 --maxzero 60 --minweight 0.4 "
        timeLimit: "AUTO"
    metabolicAnnotation:
      additionalParams:
        metabolicEdgeBatches: 5
        metabolicEdgeReplicates: 10
        smetana: " --flavor bigg --molweight "
  plasmid:
    SCAPP:
      additionalParams:
        bwa2: "  "
        SCAPP: "  "
        coverm: " --min-covered-fraction 0 --min-read-percent-identity 100 --min-read-aligned-percent 100 "
        covermONT: " --min-covered-fraction 0 --min-read-aligned-percent 100 "
        minimap: " "
        samtoolsViewBwa2: " -F 3584 " 
        samtoolsViewMinimap: " " 
    ViralVerifyPlasmid:
      filter: true
      # Select sequences that are labeled as uncertain or plasmid id ViralVerifies output
      filterString: "Uncertain - plasmid or chromosomal|Plasmid"
      database:
        download:
          source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/pfam-A_35.0.hmm.gz
          md5sum: c80b75bd48ec41760bbca19c70616e36
      additionalParams: " --thr 7 "
    PlasClass:
      filter: true
      # A cutoff of 0.9 for longer (> 2 kb) sequences should pretty much yield only plasmids.
      threshold: 0.9
      additionalParams: "   "
    Filter:
      method: "OR"
      minLength: 1000 
    PLSDB:
      sharedKmerThreshold: 0
      database:
        download:
          source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/plasmids_plsdb_20220929.tar.bz2
          md5sum: 13c1078e6cd6a46e3f508c24ca07cc18
      additionalParams:
     # Parameters according to plsdb paper
     # -s: Sketch size
     # -k: k-mer size
     # -S: Seed to provide to the hash function
     # -d: Maximum distance to report
     # -v: Maximum p-value to report
        mashSketch: " -S 42 -k 21 -s 1000 "
        mashDist: " -v 0.1 -d 0.1 "
    MobTyper:
      minLength: 1000
      additionalParams: " --min_length 1000  "
      database:
        download:
          source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/mob_20220929.gz
          md5sum: 21fcaf9c3754a985d1d6875939d71e28
  annotation:
    prokka:
      defaultKingdom: false
      additionalParams: " --mincontiglen 500 "
    mmseqs2:
      chunkSize: 20000
      kegg:
        additionalParams:
          search : ' --max-seqs 300 --max-accept 50 -c 0 --cov-mode 2 --start-sens 4 --sens-steps 1 -s 6 --num-iterations 2 -e 0.001 --e-profile 0.01 --db-load-mode 3 '
          additionalColumns: ""
        database:
          download:
            source: s3://databases_internal/kegg-mirror-2021-01_mmseqs.tar.zst
            md5sum: 0d20db97b3e7ee6571ca1fd5ad3a87f1
            s5cmd:
              params: '--retry-count 30 --no-verify-ssl --endpoint-url https://openstack.cebitec.uni-bielefeld.de:8080'
      vfdb:
        additionalParams:
          search : ' --max-seqs 300 --max-accept 50 -c 0 --cov-mode 2 --start-sens 4 --sens-steps 1 -s 6 --num-iterations 2 -e 0.001 --e-profile 0.01 --db-load-mode 3 '
          additionalColumns: ""
        database:
          download:
            source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/vfdb_full_2022_07_29.tar.zst
            md5sum: 7e32aaed112d6e056fb8764b637bf49e
      bacmet20_experimental:
        additionalParams:
          search : ' --max-seqs 300 --max-accept 50 -c 0 --cov-mode 2 --start-sens 4 --sens-steps 1 -s 6 --num-iterations 2 -e 0.001 --e-profile 0.01 --db-load-mode 3 '
          additionalColumns: ""
        database:
          download:
            source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/bacmet20_experimental.tar.zst
            md5sum: 57a6d328486f0acd63f7e984f739e8fe
      bacmet20_predicted:
        additionalParams:
          search : ' --max-seqs 300 --max-accept 50 -c 0 --cov-mode 2 --start-sens 4 --sens-steps 1 -s 6 --num-iterations 2 -e 0.001 --e-profile 0.01 --db-load-mode 3 '
          additionalColumns: ""
        database:
          download:
            source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/bacmet20_predicted.tar.zst
            md5sum: 55902401a765fc460c09994d839d9b64
      uniref90:
        additionalParams:
          search : ' --max-seqs 300 --max-accept 50 -c 0 --cov-mode 2 --start-sens 4 --sens-steps 1 -s 6 --num-iterations 2 -e 0.001 --e-profile 0.01 --db-load-mode 3 '
          additionalColumns: ""
        database:
          download:
            source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/uniref90_20231108_mmseqs.tar.zst
            md5sum: 313f2c031361091af2d5f3c6f6f46013
    rgi:
      # --include_loose includes matches of more distant homologs of AMR genes which may also report spurious partial matches
      # --include_nudge Partial ORFs may do not pass curated bitscore cut-offs or novel samples may contain divergent alleles, so nudging 
      #                 95% identity Loose matches to Strict matches may aid resistome annotation 
      additionalParams: " --include_loose --include_nudge "
      database:
        download:
          source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/card_20221209.tar.bz2
          md5sum: d7e627221a1d4584d1c1795cda774cdb
    mmseqs2_taxonomy:
      # Run taxonomy classification on MAGs and unbinable contigs or just the later
      runOnMAGs: true
      gtdb:
        # High sensitivity mode parameters only work for quality MAGs, for unbinable contigs use default parameters
        params: ' --lca-ranks superkingdom,phylum,class,order,family,genus,species,subspecies --max-seqs 300 --max-accept 50 --cov-mode 2 -e 0.001 --e-profile 0.01 '
        # Load database into memory to speed up classification, only works if --db-load-mode 3 is set in params
        ramMode: false
        initialMaxSensitivity: 6
        database:
          download:
            source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/gtdb_r214_1_mmseqs.tar.gz
            md5sum: 3c8f12c5c2dc55841a14dd30a0a4c718
    keggFromMMseqs2:
      database:
        download:
          source: s3://databases_internal/kegg-links-mirror-2021-01.tar.gz
          md5sum: 69538151ecf4128f4cb1e0ac21ff55b3
          s5cmd:
              params: '--retry-count 30 --no-verify-ssl --endpoint-url https://openstack.cebitec.uni-bielefeld.de:8080'
  metabolomics:
     filter:
       minCompleteness: 50
       maxContamination: 5
     carveme:
       additionalParams: " --solver scip "
     smetana:
       beforeProcessScript: ""
       global: false
       detailed: false
       additionalParams:
         detailed: ""
         global: ""
     memote:
       beforeProcessScript: ""
       additionalParams:
         run: ""
         report: ""

resources:
  highmemLarge:
    cpus: 28
    memory: 230
  highmemMedium:
    cpus: 14
    memory: 113
  large:
    cpus: 28
    memory: 58
  medium:
    cpus: 14
    memory: 29
  small:
    cpus: 7
    memory: 14
  tiny:
    cpus: 1
    memory: 1

Nextflow usually stores downloaded files in the work directory. If enough scratch disc space is available on the worker nodes then this can be prevented by specifying s3 links in the input tsv file and download parameter in the qc part of the input yaml.

Currently the S3 links must point to files that are publicly available.

S3 TSV Links:

SAMPLE  READS1  READS2
test1   s3://meta_test/small/read1_1.fq.gz  s3://meta_test/small/read2_1.fq.gz
test2   s3://meta_test/small/read1_1.fq.gz  s3://meta_test/small/read2_1.fq.gz

Example Input YAML with the download parameter in the qc part.

tempdir: "tmp"
summary: true
s3SignIn: true
input:
  paired: 
    path: "test_data/fullPipeline/reads_split_s3.tsv"
    watch: false
output: "output"
logDir: log
runid: 1
logLevel: 1
scratch: "/vol/scratch"
publishDirMode: "symlink"
steps:
  qc:
    interleaved: false
    fastp:
      download:
        s5cmdParams: " --retry-count 30 --no-verify-ssl --no-sign-request --endpoint-url https://openstack.cebitec.uni-bielefeld.de:8080 "
      additionalParams: " "
      timeLimit: "AUTO"
resources:
  highmemLarge:
    cpus: 28
    memory: 230
  highmemMedium:
    cpus: 14
    memory: 113
  large:
    cpus: 28
    memory: 58
  medium:
    cpus: 14
    memory: 29
  small:
    cpus: 7
    memory: 14
  tiny:
    cpus: 1
    memory: 1

Output

The produced output can be inspected on on the modules page.

Further Reading

  • Pipeline Configuration: If you want to configure and optimize the Toolkit for your data or your infrastructure you can continue with the configuration section.

  • In case you want to import the output to EMGB, please go on to the EMGB part.

  • If you want to add databases or more the pre-configured ones, you can read here how to do this.

  • You might want to adjust the resource requirements of the Toolkit to your infrastructure.