Full Pipeline

The full pipeline mode allows you to run the per-sample part and the aggregation part in one execution (see schematic overview). In contrast to the Quickstart section, this chapter refers to the execution of the Toolkit on a cluster system. In this section, you will run the Toolkit with a dataset stored on a remote server and then learn how to replace it with your own local data. You will then learn how to add additional analyses steps to your pipeline configuration and how to replace the tools in a module with alternative ones.

Requirements¶

SLURM: The Toolkit was mainly developed for cloud-based clusters using SLURM as a resource orchestrator.
Docker: Install Docker by following the official Docker installation instructions.
Java: In order to run Nextflow, you need to install Java on your machine, which can be achieved via sudo apt install default-jre.
Nextflow should be installed. Please check the official Nextflow instructions
You will need at least 150 GB of scratch space on every worker node.
This tutorial requires at least one worker node with 28 cores and 230 GB of RAM.

Part 1: Run the Toolkit with a basic config¶

In this section you will learn how to run the Toolkit. The input data will be downloaded automatically. The following Toolkit analyses will be performed: quality control, assembly, binning, taxonomic classification, contamination and completeness of MAGs, and gene prediction and annotation via Prokka.

Default Configuration

Please note that in the following you will modify our default (best practice) configuration.
For most cases you don't need to modify our default configuration, you might only need to remove or add analyses.

NXF_HOME=$PWD/.nextflow NXF_VER=24.10.4 nextflow run metagenomics/metagenomics-tk -work-dir $(pwd)/work \
    -profile slurm \
    -entry wFullPipeline \
    -ansi-log false \
    -params-file  https://raw.githubusercontent.com/metagenomics/metagenomics-tk/refs/heads/master/default/fullPipeline_illumina_nanpore_getting_started_part1.yml \
    --logDir log1 \
    --s3SignIn false \
    --scratch /vol/scratch \
    --databases /vol/scratch/databases \
    --output output \
    --input.paired.path https://raw.githubusercontent.com/metagenomics/metagenomics-tk/refs/heads/master/test_data/fullPipeline/reads_split.tsv

where

NXF_HOME points to the directory where Nextflow internal files and additional configs are stored. The default location is your home directory. However, it might be that your home directory is not shared among all worker nodes and is only available on the master node. In this example the variable points to your current working directory ($PWD/.nextflow).
-work-dir points in this example to your current working directory and should point to a directory that is shared between all worker nodes.
-profile defines the execution profile that should be used (local or cluster computing).
-entry is the entrypoint of the Toolkit.
-params-file sets the parameters file which defines the parameters for all tools. (see input section below)
--logDir points to a directory where your trace TSV, a timeline HTML of the executed processes and a report regarding the resource consumption of the workflow is saved.
--s3SignIn defines if any S3 login for retrieving inputs is necessary. See the S3 configuration section for more information on how to configure the Toolkit for possible S3 input data.
--scratch is the directory on the worker node where all intermediate results are saved.
--databases is the directory on the worker node where all databases are saved. Already downloaded and extracted databases on a shared file system can be configured in the database setting of the corresponding database section in the configuration file.
--output is the output directory where all results are saved. If you want to know more about which outputs are created, then please refer to the modules section.
--input.paired.path is the path to a TSV file that lists the datasets that should be processed. Besides paired-end data there are also other input types. Please check the input section.

Parameter override

Any parameters defined with a double dash are parameters that override parameters that are already specified in the YAML file.

Input¶

Here you can see the actual input TSV and YAML which was used by the previous command and automatically downloaded by Nextflow. The TSV file only describes the input data, while the YAML file represents the Toolkit configuration.

TSV TableConfiguration File

SAMPLE  READS1  READS2
test1   https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/small/read1_1.fq.gz  https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/small/read2_1.fq.gz
test2   https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/small/read1_1.fq.gz  https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/small/read2_1.fq.gz

Must include the columns SAMPLE, READS1 and READS2. SAMPLE must contain unique dataset identifiers without whitespaces or special characters. READS1 and READS2 are paired reads and can be HTTPS URLs, S3 links or files.

tempdir: "tmp"
s3SignIn: false
input:
  paired:
    path: "test_data/fullPipeline/reads_split.tsv"
    watch: false
output: output
logDir: log
runid: 1
databases: "/vol/scratch/databases"
publishDirMode: "symlink"
logLevel: 1
scratch: "/vol/scratch"
steps:
  qc:
    fastp:
       # For PE data, the adapter sequence auto-detection is disabled by default since the adapters can be trimmed by overlap analysis. However, you can specify --detect_adapter_for_pe to enable it.
       # For PE data, fastp will run a little slower if you specify the sequence adapters or enable adapter auto-detection, but usually result in a slightly cleaner output, since the overlap analysis may fail due to sequencing errors or adapter dimers.
       # -q, --qualified_quality_phred       the quality value that a base is qualified. Default 15 means phred quality >=Q15 is qualified.
       # --cut_front move a sliding window from front (5') to tail, drop the bases in the window if its mean quality is below cut_mean_quality, stop otherwise.
       # --length_required  reads shorter than length_required will be discarded, default is 15. (int [=15])
       # PE data, the front/tail trimming settings are given with -f, --trim_front1 and -t, --trim_tail1
       additionalParams:
         fastp: " --detect_adapter_for_pe -q 20 --cut_front --trim_front1 3 --cut_tail --trim_tail1 3 --cut_mean_quality 10 --length_required 50 "
         reportOnly: false 
       timeLimit: "AUTO"
    nonpareil:
      additionalParams: " -v 10 -r 1234 "
    kmc:
      timeLimit: "AUTO"
      additionalParams:
        # Computes k-mer distribution based on k-mer length 13 and 21
        #  -sm - use strict memory mode (memory limit from -m<n> switch will not be exceeded)
        #  -cs<value> - maximal value of a counter
        count: " -sm -cs10000 "
        histo: " -cx50000 "

  qcONT:
    porechop:
       additionalParams:
         # Input files are split into chunks because of RAM issues
         chunkSize: 450000
         porechop: ""
        # --keep_percent Throw out the worst 10% of reads. This is measured by bp, not by read count. So this option throws out the worst 10% of read bases. 
        # 
         filtlong: " --min_length 1000  --keep_percent 90 "
    nanoplot:
      additionalParams: ""
  assembly:
    megahit:
      # --mem-flag 0 to use minimum memory, --mem-flag 1 (default) moderate memory and --mem-flag 2 all memory.
      # meta-sensitive: '--min-count 1 --k-list 21,29,39,49,...,129,141' 
      # meta-large:  '--k-min  27  --k-max 127 --k-step 10' (large & complex metagenomes, like soil)
      additionalParams: " --min-contig-len 1000 --presets meta-sensitive "
      fastg: true
      resources:
        RAM:
          mode: 'PREDICT'
          predictMinLabel: 'medium'
  assemblyONT:
    metaflye:
      additionalParams: " -i 1 "
      quality: "AUTO"
  binning:
    bwa2:
      additionalParams: 
        bwa2: " "
        # samtools flags are used to filter resulting bam file
        samtoolsView: " -F 3584 " 
    contigsCoverage:
      additionalParams: " --min-covered-fraction 0 --min-read-percent-identity 100 --min-read-aligned-percent 100 "
    genomeCoverage:
      additionalParams: " --min-covered-fraction 0 --min-read-percent-identity 100 --min-read-aligned-percent 100 "
    # Primary binning tool
    metabat:
      # Set --seed positive numbers to reproduce the result exactly. Otherwise, random seed will be set each time.
      additionalParams: " --seed 234234  "
    # Secondary binning tool for use with MAGscot
  binningONT:
    minimap:
      additionalParams: 
        minimap: " "
        # samtools flags are used to filter resulting bam file
        samtoolsView: " -F 3584 " 
    contigsCoverage:
      additionalParams: " --min-covered-fraction 0  --min-read-aligned-percent 100 "
    genomeCoverage:
      additionalParams: " --min-covered-fraction 0  --min-read-aligned-percent 100 "
    metabat:
      additionalParams: " --seed 234234  "
  magAttributes:
    # gtdbtk classify_wf
    # --min_af minimum alignment fraction to assign genome to a species cluster (0.5)
    gtdb:
      buffer: 1000
      database:
        download:
          source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/gtdbtk_r214_data.tar.gz
          md5sum: 390e16b3f7b0c4463eb7a3b2149261d9
      additionalParams: " --min_af 0.65 --scratch_dir . "
    checkm2:
      database:
        download:
          source: "https://openstack.cebitec.uni-bielefeld.de:8080/databases/checkm2_v2.tar.gz"
          md5sum: a634cb3d31a1f56f2912b74005f25f09
      additionalParams: "  "
  annotation:
    prokka:
      defaultKingdom: false
      additionalParams: " --mincontiglen 500 "
resources:
  highmemLarge:
    cpus: 28
    memory: 230
  highmemMedium:
    cpus: 14
    memory: 113
  large:
    cpus: 28
    memory: 58
  medium:
    cpus: 14
    memory: 29
  small:
    cpus: 7
    memory: 14
  tiny:
    cpus: 1
    memory: 1

Output¶

You can read more about the produced output on the respective modules page.

Part 2: Run the Toolkit with your own data¶

Now that you have been able to run the Toolkit on your system, let's use the same configuration, but with your own data that may already be locally available . We will simulate the provisioning of local files by first downloading sample paired-end FASTQ files (size: 1.2 GB). Please note that these are the same fastq files as in the previous part. The only difference to previous part is that you will download the data beforehand.

mkdir inputFiles
wget -P inputFiles https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/small/read1_1.fq.gz 
wget -P inputFiles https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/small/read2_1.fq.gz

In addition, you have to tell the Toolkit the name of the sample and the path to the files that are the subject of the analysis. For this reason, we will create a file that contains three columns: sample, path to left reads, and path to right reads.

INPUT_FILES=$(pwd)/inputFiles/input.tsv
echo -e "SAMPLE\tREADS1\tREADS2" > $INPUT_FILES
echo -e "MYDATA\t$(readlink -f inputFiles/read1*)\t$(readlink -f inputFiles/read2*)" >> $INPUT_FILES

Now that the files are created, you are ready to execute the Toolkit on your data but with a modified command of the first part. The only difference is that you modify the --input.paired.path variable.

NXF_HOME=$PWD/.nextflow NXF_VER=24.10.4 nextflow run metagenomics/metagenomics-tk -work-dir $(pwd)/work \
    -profile slurm \
    -entry wFullPipeline \
    -ansi-log false \
    -params-file  https://raw.githubusercontent.com/metagenomics/metagenomics-tk/refs/heads/master/default/fullPipeline_illumina_nanpore_getting_started_part1.yml \
    --logDir log2 \
    --s3SignIn false \
    --scratch /vol/scratch \
    --databases /vol/scratch/databases \
    --output my_data_output \
    --input.paired.path inputFiles/input.tsv

Now you can go through the my_data_output folder and check the results. The next section describes how to modify the analysis that was performed.

Part 3: Exchange tools of a module¶

In some cases, you may also be interested in replacing a tool from one module with another. For example, you might be interested in comparing the assembler that is set as default with another one like MetaSpades.

In this case you could replace the MEGAHIT part with the MetaSpades config.

NXF_HOME=$PWD/.nextflow NXF_VER=24.10.4 nextflow run metagenomics/metagenomics-tk -work-dir $(pwd)/work \
    -profile slurm \
    -ansi-log false \
    -entry wFullPipeline \
    -params-file  https://raw.githubusercontent.com/metagenomics/metagenomics-tk/refs/heads/master/default/fullPipeline_illumina_nanpore_getting_started_part3.yml \
    --logDir log3 \
    --s3SignIn false \
    --scratch /vol/scratch \
    --databases /vol/scratch/databases \
    --output my_data_spades_output \
    --input.paired.path inputFiles/input.tsv

This is the MetaSpades part that was used by the previous command instead of the MEGAHIT configuration:

  assembly:
    metaspades:
      additionalParams: "  "
      fastg: true

If you now compare the contigs of the two assemblers with the following command, you will notice that the MetaSpades assembly has a higher N50 than the MEGAHIT one.

cat my_data_spades_output/MYDATA/1/assembly/*/metaspades/MYDATA_contigs_stats.tsv
#SAMPLE  file                    format  type    num_seqs  sum_len    min_len avg_len max_len Q1      Q2      Q3      sum_gap N50     Q20(%)  Q30(%)  GC(%)
#MYDATA  MYDATA_contigs.fa.gz    FASTA   DNA     1761      4521427    1000    2567.5  26470   1149.0  1408.0  2119.0  0       3799    0.00    0.00    55.07

cat my_data_output/MYDATA/1/assembly/*/megahit/MYDATA_contigs_stats.tsv
#SAMPLE  file                    format  type    num_seqs  sum_len    min_len avg_len max_len Q1      Q2      Q3      sum_gap N50     Q20(%)  Q30(%)  GC(%)
#MYDATA  MYDATA_contigs.fa.gz    FASTA   DNA     95227     60110517   56      631.2   346664  234.0   286.0   439.0   0       1229    0.00    0.00    57.34

Part 4: Add further analyses¶

Now, suppose you also want to check the presence of your genes in other databases. With the help of the Toolkit, you could create your own database with the syntax as described in the wiki. In this example, we will use the bacmet database. What you need to do here is to add the following part to the annotation section of the Toolkit configuration.

The bacmet database snippet is the following:

    mmseqs2:
      chunkSize: 20000
      bacmet20_experimental:
        additionalParams:
          search : ' --max-seqs 300 --max-accept 50 -c 0.8 --cov-mode 0 --start-sens 4 --sens-steps 1 -s 6 --num-iterations 2 -e 0.001 --e-profile 0.01 --db-load-mode 3 '
          additionalColumns: ""
        database:
          download:
            source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/bacmet20_experimental.tar.zst
            md5sum: 57a6d328486f0acd63f7e984f739e8fe

By re-running the Toolkit with this configuration, you will see that the previous results were cached (see next snippet) and only the annotation part is re-executed.

...
[a7/d8e782] Cached process > wFullPipeline:_wProcessIllumina:wShortReadBinningList:_wBinning:pMetabat (MYDATA)
[c3/537cbc] Cached process > wFullPipeline:_wProcessIllumina:wShortReadBinningList:_wBinning:pCovermContigsCoverage (Sample: MYDATA)
[bb/fcd1aa] Cached process > wFullPipeline:_wProcessIllumina:wShortReadBinningList:_wBinning:pCovermGenomeCoverage (Sample: MYDATA)
[8c/08e157] Cached process > wFullPipeline:wMagAttributesList:_wMagAttributes:pGtdbtk (Sample: MYDATA)
[4b/286c49] Cached process > wFullPipeline:wMagAttributesList:_wMagAttributes:pCheckM2 (Sample: MYDATA)
...