Per-Sample and Aggregation
There are two ways to execute the toolkit. You can either run all steps in one execution or you run first the per sample analysis (e.g. assembly, binning, annotation, etc.) and afterwards you combine the results (e.g. dereplication, co-occurrence) in a second run. The second option allows you to process multiple samples via independent toolkit executions on different infrastructures and combine all results afterwards.
You could run first the wFullPipeline mode as described in the full pipeline section but without dereplication, read mapping and co-occurrence modules and afterwards run the the aggregation as described below:
Requirements¶
-
SLURM: The Toolkit was mainly developed for cloud-based clusters using SLURM as a resource orchestrator.
-
Docker: Docker must be available on all worker nodes.
-
Java: In order to run Nextflow you have install Java.
-
Nextflow version: Please check in the nextflow.config file the supported Nextflow versions.
-
Resources: scratch
Preparation¶
First checkout the Github repository in a directory which is shared by all worker nodes:
git clone git@github.com:metagenomics/metagenomics-tk.git
cd metagenomics-tk
Run the Toolkit¶
./nextflow run main.nf -work-dir /vol/spool/work \
-profile slurm \
-entry wAggregatePipeline \
-params-file default/fullPipelineAggregate.yml \
--s3SignIn false \
--scratch /vol/scratch \
--databases /vol/scratch/databases \
--input output \
--output output
where
-work-dir
points to a directory that is shared between multiple machines.-profile
defines the execution profile that should be used.-entry
is the entrypoint of the Toolkit.-params-file
sets the parameters file which defines the parameters for all tools. (see input section below)--s3SignIn
defines if any S3 login for retrieving inputs is necessary. See the S3 configuration section for more information on how to configure the Toolkit for possible S3 input data.--scratch
is the directory on the worker node where all intermediate results are saved.--databases
is the directory on the worker node where all databases are saved. Already downloaded databases on a shared file system can be configured in the database setting of the corresponding database section in the configuration file.--output
is the output directory where all results are saved. If you want to know more about which outputs are created, then please refer to the modules section.--input
points to the output directory of the per-sample workflow.
The output directory is the same directory that is used as input.
Parameter override
Any parameters defined with a double dash are parameters that override parameters that are already specified in the yaml file.
Input¶
tempdir: "tmp"
summary: false
s3SignIn: true
output: "output"
input: "fullPipelineOutput"
logDir: log
runid: 1
logLevel: 1
scratch: "/vol/scratch"
publishDirMode: "symlink"
steps:
dereplication:
bottomUpClustering:
# stricter MIMAG medium quality
minimumCompleteness: 50
maximumContamination: 5
ANIBuffer: 20
mashBuffer: 2000
method: 'ANI'
additionalParams:
mash_sketch: ""
mash_dist: ""
# cluster cutoff
cluster: " -c 0.05 "
pyani: " -m ANIb "
representativeAniCutoff: 0.95
readMapping:
bwa2:
additionalParams:
bwa2_index: ""
bwa2_mem: ""
# This module produces two abundance tables.
# One table is based on relative abundance and the second one on the trimmed mean.
# Just using relative abundance makes it difficult to tell if a genome is part of a dataset.
# Thats why it makes sense to set at leat a low min covered fraction parameter.
coverm: " --min-covered-fraction 80 --min-read-percent-identity 95 --min-read-aligned-percent 95 "
covermONT: " --min-covered-fraction 80 --min-read-aligned-percent 95 "
minimap:
additionalParams:
minimap_index: ""
minimap: ""
cooccurrence:
inference:
additionalParams:
method: 'correlation'
rscript: " --mincovthreshold 0.9 --maxzero 60 --minweight 0.4 "
timeLimit: "AUTO"
metabolicAnnotation:
additionalParams:
metabolicEdgeBatches: 5
metabolicEdgeReplicates: 10
smetana: " --flavor bigg --molweight "
resources:
highmemLarge:
cpus: 28
memory: 230
highmemMedium:
cpus: 14
memory: 113
large:
cpus: 28
memory: 58
medium:
cpus: 14
memory: 29
small:
cpus: 7
memory: 14
tiny:
cpus: 1
memory: 1
Output¶
The produced output can be inspected on on the modules page.
Further Reading¶
-
Pipeline Configuration: If you want to configure and optimize the Toolkit for your data or your infrastructure you can continue with the configuration section.
-
In case you want to import the output to EMGB, please go on to the EMGB part. Please keep in mind that for EMGB only the per-sample part is necessary.
-
If you want to add databases or more the pre-configured ones, you can read here how to do this.
-
You might want to adjust the resource requirements of the Toolkit to your infrastructure.