Full Pipeline
The full pipeline mode allows you to run the per-sample part and the aggregation part in one execution (see schematic overview). In contrast to the Quickstart section, this chapter refers to the execution of the Toolkit on a cluster system. In this section, you will run the Toolkit with a dataset stored on a remote server and then learn how to replace it with your own local data. You will then learn how to add additional analyses steps to your pipeline configuration and how to replace the tools in a module with alternative ones.
Requirements¶
- SLURM: The Toolkit was mainly developed for cloud-based clusters using SLURM as a resource orchestrator.
- Docker: Install Docker by following the official Docker installation instructions.
- Java: In order to run Nextflow, you need to install Java on your machine, which can be achieved via
sudo apt install default-jre
. - Nextflow should be installed. Please check the official Nextflow instructions
- You will need at least 150 GB of scratch space on every worker node.
Part 1: Run the Toolkit with a basic config¶
In this section you will learn how to run the Toolkit. The input data will be downloaded automatically. The following Toolkit analyses will be performed: quality control, assembly, binning, taxonomic classification, contamination and completeness of MAGs, and gene prediction and annotation via Prokka.
Default Configuration
Please note that in the following you will modify our default (best practice) configuration.
For most cases you don't need to modify our default configuration, you might only need to remove or add analyses.
NXF_HOME=$PWD/.nextflow NXF_VER=24.10.4 nextflow run metagenomics/metagenomics-tk -work-dir $(pwd)/work \
-profile slurm \
-entry wFullPipeline \
-ansi-log false \
-params-file https://raw.githubusercontent.com/metagenomics/metagenomics-tk/refs/heads/master/default/fullPipeline_illumina_nanpore_getting_started_part1.yml \
--logDir log1 \
--s3SignIn false \
--scratch /vol/scratch \
--databases /vol/scratch/databases \
--output output \
--input.paired.path https://raw.githubusercontent.com/metagenomics/metagenomics-tk/refs/heads/master/test_data/fullPipeline/reads_split.tsv
where
NXF_HOME
points to the directory where Nextflow internal files and additional configs are stored. The default location is your home directory. However, it might be that your home directory is not shared among all worker nodes and is only available on the master node. In this example the variable points to your current working directory ($PWD/.nextflow
).-work-dir
points in this example to your current working directory and should point to a directory that is shared between all worker nodes.-profile
defines the execution profile that should be used (local or cluster computing).-entry
is the entrypoint of the Toolkit.-params-file
sets the parameters file which defines the parameters for all tools. (see input section below)--logDir
points to a directory where your trace TSV, a timeline HTML of the executed processes and a report regarding the resource consumption of the workflow is saved.--s3SignIn
defines if any S3 login for retrieving inputs is necessary. See the S3 configuration section for more information on how to configure the Toolkit for possible S3 input data.--scratch
is the directory on the worker node where all intermediate results are saved.--databases
is the directory on the worker node where all databases are saved. Already downloaded and extracted databases on a shared file system can be configured in the database setting of the corresponding database section in the configuration file.--output
is the output directory where all results are saved. If you want to know more about which outputs are created, then please refer to the modules section.--input.paired.path
is the path to a TSV file that lists the datasets that should be processed. Besides paired-end data there are also other input types. Please check the input section.
Parameter override
Any parameters defined with a double dash are parameters that override parameters that are already specified in the YAML file.
Input¶
Here you can see the actual input TSV and YAML which was used by the previous command and automatically downloaded by Nextflow. The TSV file only describes the input data, while the YAML file represents the Toolkit configuration.
SAMPLE READS1 READS2
test1 https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/small/read1_1.fq.gz https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/small/read2_1.fq.gz
test2 https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/small/read1_1.fq.gz https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/small/read2_1.fq.gz
Must include the columns SAMPLE
, READS1
and READS2
. SAMPLE
must contain unique dataset identifiers
without whitespaces or special characters. READS1
and READS2
are paired reads and can be HTTPS URLs, S3 links or files.
tempdir: "tmp"
s3SignIn: false
input:
paired:
path: "test_data/fullPipeline/reads_split.tsv"
watch: false
output: output
logDir: log
runid: 1
databases: "/vol/scratch/databases"
publishDirMode: "symlink"
logLevel: 1
scratch: "/vol/scratch"
steps:
qc:
fastp:
# For PE data, the adapter sequence auto-detection is disabled by default since the adapters can be trimmed by overlap analysis. However, you can specify --detect_adapter_for_pe to enable it.
# For PE data, fastp will run a little slower if you specify the sequence adapters or enable adapter auto-detection, but usually result in a slightly cleaner output, since the overlap analysis may fail due to sequencing errors or adapter dimers.
# -q, --qualified_quality_phred the quality value that a base is qualified. Default 15 means phred quality >=Q15 is qualified.
# --cut_front move a sliding window from front (5') to tail, drop the bases in the window if its mean quality is below cut_mean_quality, stop otherwise.
# --length_required reads shorter than length_required will be discarded, default is 15. (int [=15])
# PE data, the front/tail trimming settings are given with -f, --trim_front1 and -t, --trim_tail1
additionalParams:
fastp: " --detect_adapter_for_pe -q 20 --cut_front --trim_front1 3 --cut_tail --trim_tail1 3 --cut_mean_quality 10 --length_required 50 "
reportOnly: false
timeLimit: "AUTO"
nonpareil:
additionalParams: " -v 10 -r 1234 "
kmc:
timeLimit: "AUTO"
additionalParams:
# Computes k-mer distribution based on k-mer length 13 and 21
# -sm - use strict memory mode (memory limit from -m<n> switch will not be exceeded)
# -cs<value> - maximal value of a counter
count: " -sm -cs10000 "
histo: " -cx50000 "
qcONT:
porechop:
additionalParams:
# Input files are split into chunks because of RAM issues
chunkSize: 450000
porechop: ""
# --keep_percent Throw out the worst 10% of reads. This is measured by bp, not by read count. So this option throws out the worst 10% of read bases.
#
filtlong: " --min_length 1000 --keep_percent 90 "
nanoplot:
additionalParams: ""
assembly:
megahit:
# --mem-flag 0 to use minimum memory, --mem-flag 1 (default) moderate memory and --mem-flag 2 all memory.
# meta-sensitive: '--min-count 1 --k-list 21,29,39,49,...,129,141'
# meta-large: '--k-min 27 --k-max 127 --k-step 10' (large & complex metagenomes, like soil)
additionalParams: " --min-contig-len 1000 --presets meta-sensitive "
fastg: true
resources:
RAM:
mode: 'PREDICT'
predictMinLabel: 'medium'
assemblyONT:
metaflye:
additionalParams: " -i 1 "
quality: "AUTO"
binning:
bwa2:
additionalParams:
bwa2: " "
# samtools flags are used to filter resulting bam file
samtoolsView: " -F 3584 "
contigsCoverage:
additionalParams: " --min-covered-fraction 0 --min-read-percent-identity 100 --min-read-aligned-percent 100 "
genomeCoverage:
additionalParams: " --min-covered-fraction 0 --min-read-percent-identity 100 --min-read-aligned-percent 100 "
# Primary binning tool
metabat:
# Set --seed positive numbers to reproduce the result exactly. Otherwise, random seed will be set each time.
additionalParams: " --seed 234234 "
# Secondary binning tool for use with MAGscot
binningONT:
minimap:
additionalParams:
minimap: " "
# samtools flags are used to filter resulting bam file
samtoolsView: " -F 3584 "
contigsCoverage:
additionalParams: " --min-covered-fraction 0 --min-read-aligned-percent 100 "
genomeCoverage:
additionalParams: " --min-covered-fraction 0 --min-read-aligned-percent 100 "
metabat:
additionalParams: " --seed 234234 "
magAttributes:
# gtdbtk classify_wf
# --min_af minimum alignment fraction to assign genome to a species cluster (0.5)
gtdb:
buffer: 1000
database:
download:
source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/gtdbtk_r214_data.tar.gz
md5sum: 390e16b3f7b0c4463eb7a3b2149261d9
additionalParams: " --min_af 0.65 --scratch_dir . "
checkm2:
database:
download:
source: "https://openstack.cebitec.uni-bielefeld.de:8080/databases/checkm2_v2.tar.gz"
md5sum: a634cb3d31a1f56f2912b74005f25f09
additionalParams: " "
annotation:
prokka:
defaultKingdom: false
additionalParams: " --mincontiglen 500 "
resources:
highmemLarge:
cpus: 28
memory: 230
highmemMedium:
cpus: 14
memory: 113
large:
cpus: 28
memory: 58
medium:
cpus: 14
memory: 29
small:
cpus: 7
memory: 14
tiny:
cpus: 1
memory: 1
Output¶
You can read more about the produced output on the respective modules page.
Part 2: Run the Toolkit with your own data¶
Now that you have been able to run the Toolkit on your system, let's use the same configuration, but with your own data that may already be locally available . We will simulate the provisioning of local files by first downloading sample paired-end FASTQ files (size: 1.2 GB). Please note that these are the same fastq files as in the previous part. The only difference to previous part is that you will download the data beforehand.
mkdir inputFiles
wget -P inputFiles https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/small/read1_1.fq.gz
wget -P inputFiles https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/small/read2_1.fq.gz
In addition, you have to tell the Toolkit the name of the sample and the path to the files that are the subject of the analysis. For this reason, we will create a file that contains three columns: sample, path to left reads, and path to right reads.
INPUT_FILES=$(pwd)/inputFiles/input.tsv
echo -e "SAMPLE\tREADS1\tREADS2" > $INPUT_FILES
echo -e "MYDATA\t$(readlink -f inputFiles/read1*)\t$(readlink -f inputFiles/read2*)" >> $INPUT_FILES
Now that the files are created, you are ready to execute the Toolkit on your data but with a modified command of the first part.
The only difference is that you modify the --input.paired.path
variable.
NXF_HOME=$PWD/.nextflow NXF_VER=24.10.4 nextflow run metagenomics/metagenomics-tk -work-dir $(pwd)/work \
-profile slurm \
-entry wFullPipeline \
-ansi-log false \
-params-file https://raw.githubusercontent.com/metagenomics/metagenomics-tk/refs/heads/master/default/fullPipeline_illumina_nanpore_getting_started_part1.yml \
--logDir log2 \
--s3SignIn false \
--scratch /vol/scratch \
--databases /vol/scratch/databases \
--output my_data_output \
--input.paired.path inputFiles/input.tsv
Now you can go through the my_data_output
folder and check the results.
The next section describes how to modify the analysis that was performed.
Part 3: Exchange tools of a module¶
In some cases, you may also be interested in replacing a tool from one module with another. For example, you might be interested in comparing the assembler that is set as default with another one like MetaSpades.
In this case you could replace the MEGAHIT part with the MetaSpades config.
NXF_HOME=$PWD/.nextflow NXF_VER=24.10.4 nextflow run metagenomics/metagenomics-tk -work-dir $(pwd)/work \
-profile slurm \
-ansi-log false \
-entry wFullPipeline \
-params-file https://raw.githubusercontent.com/metagenomics/metagenomics-tk/refs/heads/master/default/fullPipeline_illumina_nanpore_getting_started_part3.yml \
--logDir log3 \
--s3SignIn false \
--scratch /vol/scratch \
--databases /vol/scratch/databases \
--output my_data_spades_output \
--input.paired.path inputFiles/input.tsv
This is the MetaSpades part that was used by the previous command instead of the MEGAHIT configuration:
additionalParams: ""
assembly:
metaspades:
additionalParams: " "
If you now compare the contigs of the two assemblers with the following command, you will notice that the MetaSpades assembly has a higher N50 than the MEGAHIT one.
cat my_data_spades_output/MYDATA/1/assembly/*/metaspades/MYDATA_contigs_stats.tsv
#SAMPLE file format type num_seqs sum_len min_len avg_len max_len Q1 Q2 Q3 sum_gap N50 Q20(%) Q30(%) GC(%)
#MYDATA MYDATA_contigs.fa.gz FASTA DNA 1761 4521427 1000 2567.5 26470 1149.0 1408.0 2119.0 0 3799 0.00 0.00 55.07
cat my_data_output/MYDATA/1/assembly/*/megahit/MYDATA_contigs_stats.tsv
#SAMPLE file format type num_seqs sum_len min_len avg_len max_len Q1 Q2 Q3 sum_gap N50 Q20(%) Q30(%) GC(%)
#MYDATA MYDATA_contigs.fa.gz FASTA DNA 95227 60110517 56 631.2 346664 234.0 286.0 439.0 0 1229 0.00 0.00 57.34
Part 4: Add further analyses¶
Now, suppose you also want to check the presence of your genes in other databases. With the help of the Toolkit, you could create your own database with the syntax as described in the wiki. In this example, we will use the bacmet database. What you need to do here is to add the following part to the annotation section of the Toolkit configuration.
The bacmet database snippet is the following:
additionalParams: " --mincontiglen 500 "
mmseqs2:
chunkSize: 20000
bacmet20_experimental:
additionalParams:
search : ' --max-seqs 300 --max-accept 50 -c 0.8 --cov-mode 0 --start-sens 4 --sens-steps 1 -s 6 --num-iterations 2 -e 0.001 --e-profile 0.01 --db-load-mode 3 '
additionalColumns: ""
database:
download:
source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/bacmet20_experimental.tar.zst
By re-running the Toolkit with this configuration, you will see that the previous results were cached (see next snippet) and only the annotation part is re-executed.
...
[a7/d8e782] Cached process > wFullPipeline:_wProcessIllumina:wShortReadBinningList:_wBinning:pMetabat (MYDATA)
[c3/537cbc] Cached process > wFullPipeline:_wProcessIllumina:wShortReadBinningList:_wBinning:pCovermContigsCoverage (Sample: MYDATA)
[bb/fcd1aa] Cached process > wFullPipeline:_wProcessIllumina:wShortReadBinningList:_wBinning:pCovermGenomeCoverage (Sample: MYDATA)
[8c/08e157] Cached process > wFullPipeline:wMagAttributesList:_wMagAttributes:pGtdbtk (Sample: MYDATA)
[4b/286c49] Cached process > wFullPipeline:wMagAttributesList:_wMagAttributes:pCheckM2 (Sample: MYDATA)
...
Further Reading¶
-
Continue to the aggregation part of the Getting Started tutorial to learn how to aggregate data of multiple samples.
-
You can check in our configuration section for how to further adapt the Toolkit to your infrastructure.
-
In case you want to import the output to EMGB, please visit the EMGB section.
-
If you want to add your own sequence databases, you can read here how to do this.
-
You might want to adjust the resource requirements of the Toolkit to your infrastructure.