Introduction

This tutorial is a short introduction to the Metagenomics-Toolkit which shows the main steps in analysing metagenomics data using the Metagenomics-Toolkit. A more detailed introduction can be found here. In this part you will learn how to configure and run the Toolkit and what the output of a Toolkit run looks like.

Tutorial Scope and Requirements¶

The Metagenomics-Toolkit allows you to run either the full pipeline of assembly, binning, and many other downstream analysis tasks or the individual analyses separately. In this tutorial you will only use the full pipeline mode. The full pipeline mode itself is structured into two parts. The first part runs the Toolkit on each sample separately (per-sample), and the second part runs a combined downstream analysis on the output of the per-sample runs; this step is called aggregation. While there are several optimizations for running the Toolkit on a cloud-based setup, during this workshop you will run the Toolkit on a single machine.

Requirements¶

Course Participants

In case you are a course participant, it is very likely that a machine has been prepared for you and you can ignore this section. If in doubt, ask the course organizers.

Basic Linux command-line usage
This tutorial has been tested on a machine with 28 CPUs and 64 GB of RAM with Ubuntu installed on it.
Docker: Install Docker by following the official Docker installation instructions.
Java: In order to run Nextflow, you need to install Java on your machine, which can be achieved via sudo apt install default-jre.
Nextflow should be installed. Please check the official Nextflow instructions.

Metagenomics-Toolkit Introduction¶

Execution¶

The Toolkit is based on Nextflow and can be executed with the following command-line pattern:

NXF_VER=NEXTFLOW_VERSION nextflow run metagenomics/metagenomics-tk NEXTFLOW_OPTIONS TOOLKIT_OPTIONS

NEXTFLOW_VERSION is the Nextflow version supported (or required) by the Toolkit. Every code snippet in this tutorial has a hard-coded version number. If you ever choose the wrong version, then the Toolkit will print out the versions that are supported.
NEXTFLOW_OPTIONS are options that are implemented by Nextflow:
- -profile determines the technology which is used to execute the Toolkit. Here we support standard for running the workflow on a single machine and slurm for running the Toolkit on a cluster which uses SLURM to distribute jobs.
- -params-file points to a configuration file that tells the Toolkit which analyses to run and which resources it should use. An example configuration file will be explained in the next section.
- -resume In some cases, you may want to resume the workflow execution, such as when you add an analysis. Resuming the workflow forces Nextflow to reuse the results of the previous analyses that the new analysis depends on, rather than starting from scratch.
- -ansi-log accepts a boolean (default: true) that, when set to true, tells Nextflow to print every update as a new line on the terminal. If false then Nextflow prints a line for every process and updates the specific line on an update. We recommend setting -ansi-log to false because it is not possible to print all possible processes on a terminal at once when running the Toolkit.
- -entry specifies which entrypoint Nextflow should use to run the workflow. To run the full pipeline that you will use in this workshop, use the wFullPipeline entrypoint. If you ever want to run separate modules, you can check on the modules-specific page (e.g. assembly).
TOOLKIT_OPTIONS are options that are provided by the Toolkit. All Toolkit options are either in a configuration file or can be provided on the command line which will be explained in the following section.

Task 2

Open the Metagenomics-Toolkit wiki in a second browser tab by clicking this link. Imagine you need to run the quality-control part separately. Can you tell the name of the entrypoint? Use the wiki page you have opened on another tab to answer the question.

Solution

If you go to the quality control part, then you will find the wShortReadQualityControl entrypoint for short reads and the wOntQualityControl entrypoint for long reads.

Configuration¶

The Toolkit uses a YAML configuration file that specifies global parameters, the analyses that will be executed and the computational resources that can be used.

The configuration file is divided into three parts:

Part 1: Global Workflow Parameters¶

The following snippet shows parameters that affect the whole execution of the workflow. All parameters are explained in a dedicated Toolkit wiki section.

Example Configuration File Snippet 1
summary: false
s3SignIn: false 
input:
  paired:
    sheet: "test_data/tutorials/tutorial1/reads.tsv"
    watch: false
output: output
logDir: log
runid: 1
databases: "/vol/scratch/databases"
publishDirMode: "symlink"
logLevel: 1
scratch: false 

Computational Resources

Please note that computational resources are also global parameters and will be handled in the third part of this configuration section.

Input Field¶

The input field (line 3, snippet 1) specifies the type of input data to process (Nanopore, Illumina, data hosted on SRA or a mirror) and you can find a dedicated wiki section here. Regardless of which input type is used, the user must provide a file containing a list of datasets to be processed. The list can be a list of remote or local files and in the case of SRA, a list of SRA run IDs.

Since you will work with short read data in this tutorial, your input file looks like this:

Sample Sheet
SAMPLE  READS1  READS2
SRR492183   https://openstack.cebitec.uni-bielefeld.de:8080/ftp.era.ebi.ac.uk/vol1/fastq/SRR492/SRR492183/SRR492183_1.fastq.gz  https://openstack.cebitec.uni-bielefeld.de:8080/ftp.era.ebi.ac.uk/vol1/fastq/SRR492/SRR492183/SRR492183_2.fastq.gz
SRR492065   https://openstack.cebitec.uni-bielefeld.de:8080/ftp.era.ebi.ac.uk/vol1/fastq/SRR492/SRR492065/SRR492065_1.fastq.gz  https://openstack.cebitec.uni-bielefeld.de:8080/ftp.era.ebi.ac.uk/vol1/fastq/SRR492/SRR492065/SRR492065_2.fastq.gz

The first column (SAMPLE) specifies the name of the dataset. The second (READS1) and third (READS2) columns specify the files containing the forward and reverse reads.

Part 2: Toolkit Analyses Steps¶

Analyses (also called modules) that the Toolkit executes are placed directly under the steps attribute in the configuration file. In the example below, the modules qc and assembly are placed directly under the steps attribute. Any tools or methods that are used as part of the module can be considered a property of the module. For example, MEGAHIT is executed as part of the assembly module. The level below the tool names is for configuring the tools and methods. Each analysis is listed on the modules page.

Example Configuration File Snippet 2
steps:
  qc:
    fastp:
       # For PE data, the adapter sequence auto-detection is disabled by default since the adapters can be trimmed by overlap analysis. However, you can specify --detect_adapter_for_pe to enable it.
       # For PE data, fastp will run a little slower if you specify the sequence adapters or enable adapter auto-detection, but usually result in a slightly cleaner output, since the overlap analysis may fail due to sequencing errors or adapter dimers.
       # -q, --qualified_quality_phred       the quality value that a base is qualified. Default 15 means phred quality >=Q15 is qualified.
       # --cut_front move a sliding window from front (5') to tail, drop the bases in the window if its mean quality is below cut_mean_quality, stop otherwise.
       # --length_required  reads shorter than length_required will be discarded, default is 15. (int [=15])
       # PE data, the front/tail trimming settings are given with -f, --trim_front1 and -t, --trim_tail1
       additionalParams:
         fastp: " --detect_adapter_for_pe -q 20 --cut_front --trim_front1 3 --cut_tail --trim_tail1 3 --cut_mean_quality 10 --length_required 50 "
         reportOnly: false
       timeLimit: "AUTO"
    nonpareil:
      additionalParams: " -v 10 -r 1234 "
    filterHuman:
      additionalParams: "  "
      database:
        download:
          source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/human_filter.db.20231218v2.gz
          md5sum: cc92c0f926656565b1156d66a0db5a3c
  assembly:
    megahit:
      # --mem-flag 0 to use minimum memory, --mem-flag 1 (default) moderate memory and --mem-flag 2 all memory.
      # meta-sensitive: '--min-count 1 --k-list 21,29,39,49,...,129,141' 
      # meta-large:  '--k-min  27  --k-max 127 --k-step 10' (large & complex metagenomes, like soil)
      additionalParams: " --min-contig-len 500 --presets meta-sensitive "
      fastg: true
      resources:
        RAM:
          mode: 'PREDICT'
          predictMinLabel: 'highmemLarge'

Part 3: Computational Resources¶

The third part of a Toolkit configuration file is the resources attribute. The resources attribute lists computational resource configurations, where each configuration has a label and consists of the number of CPUs and amount of RAM assigned to it. Predefined labels are listed in the following example snippet. These labels are assigned to the processes that run the workflow-specific tools. You can read more about resource parameters here.

Example Configuration File Snippet 3
resources:
  highmemLarge:
    cpus: 28
    memory: 60
  highmemMedium:
    cpus: 14
    memory: 30
  large:
    cpus: 28
    memory: 58
  medium:
    cpus: 14
    memory: 29
  small:
    cpus: 7
    memory: 14
  tiny:
    cpus: 1
    memory: 1

Task 3

One of the first checks before running the Toolkit is to adjust the resource labels to the resources of your machine. You can run nproc to get the number of CPUs and free -h --mega to get the amount of RAM (row name: Mem, column name: total) available on your machine. Is there enough RAM on your machine to run the Toolkit?

Solution

If you are using a machine as described in the "Requirements" section, then yes, there is enough RAM available for the workflow (more than 60 GB).

Configuration File vs. Command-line Parameters¶

All parameters defined in the YAML configuration file can also be supplied as command-line arguments. To do this, prefix each parameter with a double dash (--). If a parameter is nested within the hierarchy of the YAML file, represent it as a command-line argument by connecting each level of the hierarchy using a dot (.).

For example, consider the CPU count of the highmemLarge resource label in the previous snippet. The corresponding command-line argument would be --resources.highmemLarge.cpus.

Task 4

Let`s say you want to specify a path to a different input TSV file (see Example Configuration File Snippet 1) that contains a different set of input datasets. How would you specify the parameter on the command line?

Solution

--input.paired.sheet

Command-line arguments supersede the configuration file. This is a quick way to change variables without touching files.

Output¶

The Toolkit output fulfills the following schema:

SAMPLE_NAME/RUN_ID/MODULE/MODULE_VERSION/TOOL

RUN_ID: The run ID will be part of the output path and allows to distinguish between different pipeline configurations that were used for the same dataset.
MODULE is the analysis that is executed by the Toolkit (e.g. qc, assembly, etc.).
MODULE_VERSION is the version number of the module.
TOOL is the tool or method that is executed by the Toolkit.

Below you can see an example output structure. Every output folder includes four log files:

.command.err: Contains the standard error.
.command.out: Contains the standard output.
.command.log: Contains the combined standard error and standard output.
.command.sh: Contains the command that was executed.

Example Output Directory
output/
└── sample
    └── 1
        └── qc
            └── 0.3.0
                ├── fastp
                │   ├── .command.err
                │   ├── .command.log
                │   ├── .command.out
                │   ├── .command.sh
                │   ├── sample_fastp.json
                │   ├── sample_fastp_summary_after.tsv
                │   ├── sample_fastp_summary_before.tsv
                │   ├── sample_interleaved.qc.fq.gz
                │   ├── sample_report.html
                │   ├── sample_unpaired.qc.fq.gz
                │   └── sample_unpaired_summary.tsv
                ├── filterHuman
                │   ├── .command.err
                │   ├── .command.log
                │   ├── .command.out
                │   ├── .command.sh
                │   ├── sample_interleaved.filtered.fq.gz
                │   ├── sample_interleaved.removed.fq.gz
                │   ├── sample_interleaved_summary_after.tsv
                │   ├── sample_interleaved_summary_before.tsv
                │   ├── sample_unpaired.filtered.fq.gz
                │   ├── sample_unpaired_summary_after.tsv
                │   └── sample_unpaired_summary_before.tsv
                ├── kmc
                │   ├── .command.err
                │   ├── .command.log
                │   ├── .command.out
                │   ├── .command.sh
                │   ├── sample.13.histo.tsv
                │   ├── sample.13.kmc.json
                │   ├── sample.21.histo.tsv
                │   ├── sample.21.kmc.json
                │   ├── sample.71.histo.tsv
                │   └── sample.71.kmc.json
                └── nonpareil
                    ├── .command.err
                    ├── .command.log
                    ├── .command.out
                    ├── .command.sh
                    ├── sample.npa
                    ├── sample.npc
                    ├── sample.npl
                    ├── sample_nonpareil_curves.pdf
                    └── sample_nonpareil_index.tsv

➡️ Continue to: From Reads To Genomes