Skip to content

Aggregation

There are two ways to execute the Toolkit. You can either run all steps in one execution, or run the per‐sample analysis (e.g., assembly, binning, annotation, etc.) first and then combine the results (e.g., via dereplication and co‐occurrence) in a second run. The second option allows you to process multiple samples via independent Toolkit executions on different infrastructures and combine all results afterwards.

Requirements

  1. SLURM: The Toolkit was mainly developed for cloud-based clusters using SLURM as a resource orchestrator.
  2. Docker: Install Docker by following the official Docker installation instructions.
  3. Java: In order to run Nextflow, you need to install Java on your machine. This can be achieved via sudo apt install default-jre.
  4. Nextflow should be installed. Please check the official Nextflow instructions
  5. This tutorial assumes that you have already executed the Toolkit as described in the full pipeline section.

Run the Toolkit

NXF_HOME=$PWD/.nextflow NXF_VER=24.10.4 nextflow run metagenomics/metagenomics-tk \
    -work-dir $(pwd)/work \
    -profile slurm \
    -ansi-log false \
    -entry wAggregatePipeline \
    -params-file  https://raw.githubusercontent.com/metagenomics/metagenomics-tk/refs/heads/master/default/fullPipelineAggregate.yml \
    --logDir logAggregate \
    --s3SignIn false \
    --scratch /vol/scratch \
    --databases /vol/scratch/databases \
    --input my_data_spades_output \
    --output output

where

  • -work-dir points to a directory that is shared between multiple machines.
  • -profile defines the execution profile that should be used (local or cluster computing).
  • -entry is the entry point of the aggregation workflow.
  • -params-file sets the parameters file which defines the parameters for all tools. (see input section below)
  • --logDir points to a directory where your trace TSV, a timeline HTML of the executed processes and a report regarding the resource consumption of the workflow is saved.
  • --s3SignIn defines if any S3 login for retrieving inputs is necessary. See the S3 configuration section for more information on how to configure the Toolkit for possible S3 input data.
  • --scratch is the directory on the worker node where all intermediate results are saved.
  • --databases is the directory on the worker node where all databases are saved. Already downloaded databases on a shared file system can be configured in the database setting of the corresponding database section in the configuration file.
  • --output is the output directory where all results are saved. If you want to know more about which outputs are created, then please refer to the modules section.
  • --input points to the output directory of the per-sample workflow.

Parameter override

Any parameters defined with a double dash are parameters that override parameters that are already specified in the YAML file.

Input

tempdir: "tmp"
s3SignIn: true
output: "output"
input: "fullPipelineOutput"
logDir: log
runid: 1
logLevel: 1
scratch: "/vol/scratch"
publishDirMode: "symlink"
steps:
  dereplication:
    bottomUpClustering:
      # stricter MIMAG medium quality
      minimumCompleteness: 50
      maximumContamination: 5
      ANIBuffer: 20
      mashBuffer: 2000
      method: 'ANI'
      additionalParams:
        mash_sketch: ""
        mash_dist: ""
        # cluster cutoff
        cluster: " -c 0.05 "
        pyani: " -m ANIb "
        representativeAniCutoff: 0.95
  readMapping:
    bwa2:
      additionalParams:
        bwa2_index: ""
        bwa2_mem: ""
     # This module produces two abundance tables.
     # One table is based on relative abundance and the second one on the trimmed mean.
     # Just using relative abundance makes it difficult to tell if a genome is part of a dataset.
     # Thats why it makes sense to set at leat a low min covered fraction parameter.
    coverm: " --min-covered-fraction 80  --min-read-percent-identity 95 --min-read-aligned-percent 95 "
    covermONT: " --min-covered-fraction 80  --min-read-aligned-percent 95 "
    minimap:
      additionalParams:
        minimap_index: ""
        minimap: ""
  cooccurrence:
    inference:
      additionalParams:
        method: 'correlation'
        rscript: " --mincovthreshold 0.9 --maxzero 60 --minweight 0.4 "
        timeLimit: "AUTO"
    metabolicAnnotation:
      additionalParams:
        metabolicEdgeBatches: 5
        metabolicEdgeReplicates: 10
        smetana: " --flavor bigg --molweight "
resources:
  highmemLarge:
    cpus: 28
    memory: 230
  highmemMedium:
    cpus: 14
    memory: 113
  large:
    cpus: 28
    memory: 58
  medium:
    cpus: 14
    memory: 29
  small:
    cpus: 7
    memory: 14
  tiny:
    cpus: 1
    memory: 1

Output

The meaning of the produced output can be inspected on the respective module page. You can check for the results in the AGGREGATED folder:

For example you could check for the number of species clusters created through dereplication:

cat  my_data_spades_output/AGGREGATED/1/dereplication/*/bottomUpClustering/clusters/clusters.tsv

Parameter override

Please note that the dereplication method produces more meaningful results when more than one sample is provided as input.

Further Reading

  • Pipeline Configuration: If you want to configure and optimize the Toolkit for your data or your infrastructure then you can continue with the configuration section.

  • In case you want to import the output to EMGB, please go on to the EMGB configuration section. Please keep in mind that for EMGB only the per-sample part is necessary.

  • You might want to adjust the resource requirements of the Toolkit to your infrastructure.