Dereplication, and Coverage Inspection of MAGs
In this final part of the tutorial we aggregate the MAGs recovered from multiple samples, dereplicate them, and briefly inspect both the resulting clusters and the read-mapping depth (coverage) of some MAGs in the small example dataset.
Dereplication reduces a large set of MAGs to a non-redundant genome catalogue by grouping nearly identical genomes (typically using Average Nucleotide Identity, ANI) and selecting the highest-quality MAG as the representative for each cluster. This avoids double-counting strains, simplifies downstream analyses and prevents ambiguous read mapping to many almost-identical genomes.
Within the Metagenomics-Toolkit, dereplication is implemented as a separate module that:
- takes the previous run or a table describing all MAGs (including completeness, contamination and coverage),
- clusters similar genomes in a bottom-up fashion (using Mash and ANI distances), and
- selects the best representative per cluster based on quality and assembly statistics.
Metagenomics-Toolkit Execution¶
The following lines represent the part of the configuration that tells the Toolkit to run the Dereplication module on your existing results.
| Dereplication Configuration File Snippet | |
|---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | |
For reference, your complete parameter file for this final step looks like this:
Parameter-file
tempdir: "tmp"
summary: false
s3SignIn: false
output: "output"
input: "output"
logDir: log
runid: 1
logLevel: 1
scratch: false
publishDirMode: "copy"
steps:
dereplication:
bottomUpClustering:
# stricter MIMAG medium quality
minimumCompleteness: 50
maximumContamination: 5
ANIBuffer: 20
mashBuffer: 2000
method: 'ANI'
additionalParams:
mash_sketch: ""
mash_dist: ""
# cluster cutoff
cluster: " -c 0.05 "
pyani: " -m ANIb "
representativeAniCutoff: 0.95
readMapping:
bwa2:
additionalParams:
bwa2_index: ""
bwa2_mem: ""
# This module produces two abundance tables.
# One table is based on relative abundance and the second one on the trimmed mean.
# Just using relative abundance makes it difficult to tell if a genome is part of a dataset.
# Thats why it makes sense to set at leat a low min covered fraction parameter.
coverm: " --min-covered-fraction 80 --min-read-percent-identity 95 --min-read-aligned-percent 95 "
covermONT: " --min-covered-fraction 80 --min-read-aligned-percent 95 "
minimap:
additionalParams:
minimap_index: ""
minimap: ""
resources:
highmemLarge:
cpus: 28
memory: 230
highmemMedium:
cpus: 14
memory: 113
large:
cpus: 28
memory: 58
medium:
cpus: 14
memory: 29
small:
cpus: 7
memory: 14
tiny:
cpus: 1
memory: 1
We assume that all required steps by the previous tutorials are done.
If you run the Toolkit again with the -resume flag and the above dereplication configuration, it will reuse all previous results (QC, assembly, binning, MagAttributes) and only execute the new dereplication step.
Task 1
Change into the directory of the Toolkit session.
cd /vol/volume/sessions/metagenomics_metagenomics-tk
Run the dereplication module on your results using the following command:
NXF_VER=25.04.2 nextflow run metagenomics/metagenomics-tk \
-profile standard \
-params-file https://raw.githubusercontent.com/metagenomics/metagenomics-tk/refs/heads/doc/exhibition-day/tutorial/default/tutorials/tutorial2/fullPipeline_dereplication.yml \
-ansi-log false \
-entry wAggregatePipeline \
-resume \
--input output \
--logDir dereplication
This will:
- collect all the necessary results from your MAGs to create the summary table needed as input,
- cluster similar genomes based on ANI, and
- write a
clusters.tsvfile describing the dereplicated genome clusters.
Output¶
Dereplication Clusters¶
The example dereplication input table looks like this (one MAG per row):
BIN_ID COMPLETENESS COVERAGE CONTAMINATION HETEROGENEITY PATH N50
SRR492065_bin.3.fa 36.29 171.259 0.39 0 metagenomics-tk/work/58/267ffbad749945cb7dfd98b23a7449/SRR492065_bin.3.fa 73239
SRR492183_bin.2.fa 100.0 142.023 0.1 0 metagenomics-tk/work/90/b5b80ef93e064e7d723eca9bd92241/SRR492183_bin.2.fa 175163
SRR492065_bin.1.fa 85.99 14.7729 18.95 0 metagenomics-tk/work/58/267ffbad749945cb7dfd98b23a7449/SRR492065_bin.1.fa 7308
SRR492065_bin.5.fa 64.75 152.837 0.54 0 metagenomics-tk/work/58/267ffbad749945cb7dfd98b23a7449/SRR492065_bin.5.fa 161674
SRR492065_bin.4.fa 96.32 70.5657 0.14 0 metagenomics-tk/work/58/267ffbad749945cb7dfd98b23a7449/SRR492065_bin.4.fa 25145
SRR492065_bin.2.fa 94.64 23.2949 0.11 0 metagenomics-tk/work/58/267ffbad749945cb7dfd98b23a7449/SRR492065_bin.2.fa 16802
SRR492183_bin.3.fa 92.84 27.7866 0.08 0 metagenomics-tk/work/90/b5b80ef93e064e7d723eca9bd92241/SRR492183_bin.3.fa 21121
SRR492183_bin.1.fa 99.96 45.5264 0.01 0 metagenomics-tk/work/90/b5b80ef93e064e7d723eca9bd92241/SRR492183_bin.1.fa 46795
The dereplication module uses these columns as follows:
COMPLETENESS,CONTAMINATION– to filter MAGs before clustering.COVERAGE,N50,HETEROGENEITY– to decide which genome becomes the representative of each cluster.
After running Task 1, dereplication produces a clusters.tsv file with one row per MAG and three columns:
CLUSTER– cluster ID (all genomes in the same cluster are near-identical),GENOME– path or identifier of the genome,REPRESENTATIVE–1if the genome is the chosen representative,0otherwise.
A typical file will look like this:
CLUSTER GENOME REPRESENTATIVE
1 bin.1.fa 1
1 bin.2.fa 0
2 bin.8.fasta 1
2 bin.9.fasta 0
3 bin.10.fasta 0
3 bin.32.fa 1
(The exact clustering in your run may differ; this is just an illustrative example.)
Task 2
Locate the dereplication clusters file and inspect how many clusters were formed and which genomes were chosen as representatives. Hint: There is a new directory in your output folder.
Solution
Change into the Toolkit session directory (if you are not already there):
cd /vol/volume/sessions/metagenomics_metagenomics-tk
First, locate the new directory in your output folder:
ls -la output/
clusters.tsv in the similarly named directory.
cat output/AGGREGATED/1/dereplication/0.1.1/bottomUpClustering/clusters/clusters.tsv
It should look similar to this:
CLUSTER GENOME REPRESENTATIVE
1 SRR492065_bin.2.fa 0.0
3 SRR492065_bin.4.fa 1.0
2 SRR492065_bin.5.fa 0.0
2 SRR492183_bin.1.fa 1.0
4 SRR492183_bin.2.fa 1.0
1 SRR492183_bin.3.fa 1.0
This shows you:
* how many clusters exist (number of unique values in the CLUSTER column), and
* which MAG(s) per cluster have REPRESENTATIVE=1. Two MAGs where redundant and are now part of clusters represented only by its best MAG.
Read Mapping Depth (Coverage) of Example MAGs¶
Coverage values are derived from mapping reads from all input samples back to each representative MAG.
Task 3
Let's have a look to examine the coverage (read mapping depth) of our representative MAGs.
Explore the new directory, find the readMapping folder and look for the relative abundance coverage of our representatives with all SRR492065 reads.
Solution
Change again into the Toolkit session directory:
cd /vol/volume/sessions/metagenomics_metagenomics-tk
Explore the AGGREGATED directory:
ls -la output/AGGREGATED/1/
There is a readMapping directory.
If we explore further:
ls -la output/AGGREGATED/1/readMapping/0.1.0/
genomeCoverage one.
If we look at this, we see something like:
-rw-r--r-- 1 ubuntu ubuntu 111 Dec 9 18:50 SRR492065_count.tsv
-rw-r--r-- 1 ubuntu ubuntu 120 Dec 9 18:50 SRR492065_mean.tsv
-rw-r--r-- 1 ubuntu ubuntu 138 Dec 9 18:50 SRR492065_relative_abundance.tsv
-rw-r--r-- 1 ubuntu ubuntu 120 Dec 9 18:50 SRR492065_rpkm.tsv
-rw-r--r-- 1 ubuntu ubuntu 116 Dec 9 18:50 SRR492065_tpm.tsv
-rw-r--r-- 1 ubuntu ubuntu 118 Dec 9 18:50 SRR492065_trimmed_mean.tsv
-rw-r--r-- 1 ubuntu ubuntu 110 Dec 9 18:50 SRR492183_count.tsv
-rw-r--r-- 1 ubuntu ubuntu 113 Dec 9 18:50 SRR492183_mean.tsv
-rw-r--r-- 1 ubuntu ubuntu 130 Dec 9 18:50 SRR492183_relative_abundance.tsv
-rw-r--r-- 1 ubuntu ubuntu 110 Dec 9 18:50 SRR492183_rpkm.tsv
-rw-r--r-- 1 ubuntu ubuntu 112 Dec 9 18:50 SRR492183_tpm.tsv
-rw-r--r-- 1 ubuntu ubuntu 113 Dec 9 18:50 SRR492183_trimmed_mean.tsv
drwxr-xr-x 2 ubuntu ubuntu 4096 Dec 9 18:50 logs
SRR492065_relative_abundance.tsv we find the right abundances.
cat output/AGGREGATED/1/readMapping/0.1.0/genomeCoverage/SRR492065_relative_abundance.tsv
Genome SRR492065
unmapped 21.920425
SRR492065_bin.4 22.608734
SRR492183_bin.1 45.45554
SRR492183_bin.2 2.6764407
SRR492183_bin.3 7.338865