Aggregation
There are two ways to execute the Toolkit. You can either run all steps in one execution, or run the per‐sample analysis (e.g., assembly, binning, annotation, etc.) first and then combine the results (e.g., via dereplication and co‐occurrence) in a second run. The second option allows you to process multiple samples via independent Toolkit executions on different infrastructures and combine all results afterwards.
Requirements¶
- SLURM: The Toolkit was mainly developed for cloud-based clusters using SLURM as a resource orchestrator.
- Docker: Install Docker by following the official Docker installation instructions.
- Java: In order to run Nextflow, you need to install Java on your machine. This can be achieved via
sudo apt install default-jre
. - Nextflow should be installed. Please check the official Nextflow instructions
- This tutorial assumes that you have already executed the Toolkit as described in the full pipeline section.
Run the Toolkit¶
NXF_HOME=$PWD/.nextflow NXF_VER=24.10.4 nextflow run metagenomics/metagenomics-tk \
-work-dir $(pwd)/work \
-profile slurm \
-ansi-log false \
-entry wAggregatePipeline \
-params-file https://raw.githubusercontent.com/metagenomics/metagenomics-tk/refs/heads/master/default/fullPipelineAggregate.yml \
--logDir logAggregate \
--s3SignIn false \
--scratch /vol/scratch \
--databases /vol/scratch/databases \
--input my_data_spades_output \
--output output
where
-work-dir
points to a directory that is shared between multiple machines.-profile
defines the execution profile that should be used (local or cluster computing).-entry
is the entry point of the aggregation workflow.-params-file
sets the parameters file which defines the parameters for all tools. (see input section below)--logDir
points to a directory where your trace TSV, a timeline HTML of the executed processes and a report regarding the resource consumption of the workflow is saved.--s3SignIn
defines if any S3 login for retrieving inputs is necessary. See the S3 configuration section for more information on how to configure the Toolkit for possible S3 input data.--scratch
is the directory on the worker node where all intermediate results are saved.--databases
is the directory on the worker node where all databases are saved. Already downloaded databases on a shared file system can be configured in the database setting of the corresponding database section in the configuration file.--output
is the output directory where all results are saved. If you want to know more about which outputs are created, then please refer to the modules section.--input
points to the output directory of the per-sample workflow.
Parameter override
Any parameters defined with a double dash are parameters that override parameters that are already specified in the YAML file.
Input¶
tempdir: "tmp"
s3SignIn: true
output: "output"
input: "fullPipelineOutput"
logDir: log
runid: 1
logLevel: 1
scratch: "/vol/scratch"
publishDirMode: "symlink"
steps:
dereplication:
bottomUpClustering:
# stricter MIMAG medium quality
minimumCompleteness: 50
maximumContamination: 5
ANIBuffer: 20
mashBuffer: 2000
method: 'ANI'
additionalParams:
mash_sketch: ""
mash_dist: ""
# cluster cutoff
cluster: " -c 0.05 "
pyani: " -m ANIb "
representativeAniCutoff: 0.95
readMapping:
bwa2:
additionalParams:
bwa2_index: ""
bwa2_mem: ""
# This module produces two abundance tables.
# One table is based on relative abundance and the second one on the trimmed mean.
# Just using relative abundance makes it difficult to tell if a genome is part of a dataset.
# Thats why it makes sense to set at leat a low min covered fraction parameter.
coverm: " --min-covered-fraction 80 --min-read-percent-identity 95 --min-read-aligned-percent 95 "
covermONT: " --min-covered-fraction 80 --min-read-aligned-percent 95 "
minimap:
additionalParams:
minimap_index: ""
minimap: ""
cooccurrence:
inference:
additionalParams:
method: 'correlation'
rscript: " --mincovthreshold 0.9 --maxzero 60 --minweight 0.4 "
timeLimit: "AUTO"
metabolicAnnotation:
additionalParams:
metabolicEdgeBatches: 5
metabolicEdgeReplicates: 10
smetana: " --flavor bigg --molweight "
resources:
highmemLarge:
cpus: 28
memory: 230
highmemMedium:
cpus: 14
memory: 113
large:
cpus: 28
memory: 58
medium:
cpus: 14
memory: 29
small:
cpus: 7
memory: 14
tiny:
cpus: 1
memory: 1
Output¶
The meaning of the produced output can be inspected on the respective module page.
You can check for the results in the AGGREGATED
folder:
For example you could check for the number of species clusters created through dereplication:
cat my_data_spades_output/AGGREGATED/1/dereplication/*/bottomUpClustering/clusters/clusters.tsv
Parameter override
Please note that the dereplication method produces more meaningful results when more than one sample is provided as input.
Further Reading¶
-
Pipeline Configuration: If you want to configure and optimize the Toolkit for your data or your infrastructure then you can continue with the configuration section.
-
In case you want to import the output to EMGB, please go on to the EMGB configuration section. Please keep in mind that for EMGB only the per-sample part is necessary.
-
You might want to adjust the resource requirements of the Toolkit to your infrastructure.