Per-Sample and Aggregation

There are two ways to execute the toolkit. You can either run all steps in one execution or you run first the per sample analysis (e.g. assembly, binning, annotation, etc.) and afterwards you combine the results (e.g. dereplication, co-occurrence) in a second run. The second option allows you to process multiple samples via independent toolkit executions on different infrastructures and combine all results afterwards.

You could run first the wFullPipeline mode as described in the full pipeline section but without dereplication, read mapping and co-occurrence modules and afterwards run the the aggregation as described below:

Requirements¶

SLURM: The Toolkit was mainly developed for cloud-based clusters using SLURM as a resource orchestrator.
Docker: Docker must be available on all worker nodes.
Java: In order to run Nextflow you have install Java.
Nextflow version: Please check in the nextflow.config file the supported Nextflow versions.
Resources: scratch

Preparation¶

First checkout the Github repository in a directory which is shared by all worker nodes:

git clone git@github.com:metagenomics/metagenomics-tk.git
cd metagenomics-tk

Run the Toolkit¶

./nextflow run main.nf -work-dir /vol/spool/work \
    -profile slurm \
    -entry wAggregatePipeline \
    -params-file default/fullPipelineAggregate.yml \
    --s3SignIn false \
    --scratch /vol/scratch \
    --databases /vol/scratch/databases \
    --input output \
    --output output

where

-work-dir points to a directory that is shared between multiple machines.
-profile defines the execution profile that should be used.
-entry is the entrypoint of the Toolkit.
-params-file sets the parameters file which defines the parameters for all tools. (see input section below)
--s3SignIn defines if any S3 login for retrieving inputs is necessary. See the S3 configuration section for more information on how to configure the Toolkit for possible S3 input data.
--scratch is the directory on the worker node where all intermediate results are saved.
--databases is the directory on the worker node where all databases are saved. Already downloaded databases on a shared file system can be configured in the database setting of the corresponding database section in the configuration file.
--output is the output directory where all results are saved. If you want to know more about which outputs are created, then please refer to the modules section.
--input points to the output directory of the per-sample workflow.

The output directory is the same directory that is used as input.

Parameter override

Any parameters defined with a double dash are parameters that override parameters that are already specified in the yaml file.

Input¶

Configuration File

tempdir: "tmp"
summary: false
s3SignIn: true
output: "output"
input: "fullPipelineOutput"
logDir: log
runid: 1
logLevel: 1
scratch: "/vol/scratch"
publishDirMode: "symlink"
steps:
  dereplication:
    bottomUpClustering:
      # stricter MIMAG medium quality
      minimumCompleteness: 50
      maximumContamination: 5
      ANIBuffer: 20
      mashBuffer: 2000
      method: 'ANI'
      additionalParams:
        mash_sketch: ""
        mash_dist: ""
        # cluster cutoff
        cluster: " -c 0.05 "
        pyani: " -m ANIb "
        representativeAniCutoff: 0.95
  readMapping:
    bwa2:
      additionalParams:
        bwa2_index: ""
        bwa2_mem: ""
     # This module produces two abundance tables.
     # One table is based on relative abundance and the second one on the trimmed mean.
     # Just using relative abundance makes it difficult to tell if a genome is part of a dataset.
     # Thats why it makes sense to set at leat a low min covered fraction parameter.
    coverm: " --min-covered-fraction 80  --min-read-percent-identity 95 --min-read-aligned-percent 95 "
    covermONT: " --min-covered-fraction 80  --min-read-aligned-percent 95 "
    minimap:
      additionalParams:
        minimap_index: ""
        minimap: ""
  cooccurrence:
    inference:
      additionalParams:
        method: 'correlation'
        rscript: " --mincovthreshold 0.9 --maxzero 60 --minweight 0.4 "
        timeLimit: "AUTO"
    metabolicAnnotation:
      additionalParams:
        metabolicEdgeBatches: 5
        metabolicEdgeReplicates: 10
        smetana: " --flavor bigg --molweight "
resources:
  highmemLarge:
    cpus: 28
    memory: 230
  highmemMedium:
    cpus: 14
    memory: 113
  large:
    cpus: 28
    memory: 58
  medium:
    cpus: 14
    memory: 29
  small:
    cpus: 7
    memory: 14
  tiny:
    cpus: 1
    memory: 1

Output¶

The produced output can be inspected on on the modules page.

Per-Sample and Aggregation

Requirements¶

Preparation¶

Run the Toolkit¶

Input¶

Output¶

Further Reading¶