Assessing Bin Quality
To get an impression of the quality of our bins, we compute the completeness and contamination values for our bins.
Computing Completeness and Contamination using CheckM2¶
CheckM 1 and 2 provide a set of tools to assess the quality of genomes recovered from isolates, single cells, or metagenomes. While the first version relies on collocated sets of genes that are ubiquitous and single-copy within a phylogenetic lineage, the second version uses a machine-learning-based approach that was trained on all available genomic information such as multi-copy genes, biological pathways and modules. Both versions are supported by the Metagenomics-Toolkit and we will use the second version for this part of the tutorial.
The following lines represent the part of the configuration that tells the Toolkit to run the MagAttributes module that also includes the CheckM2 tool:
MagAttributes Configuration File Snippet 1 | |
---|---|
1 2 3 4 5 6 7 |
|
For reference, your complete parameter file looks like this:
Parameter-file
tempdir: "tmp"
summary: false
s3SignIn: false
input:
paired:
path: "test_data/tutorials/tutorial1/reads.tsv"
watch: false
output: output
logDir: log
runid: 1
databases: "/vol/scratch/databases"
publishDirMode: "symlink"
logLevel: 1
scratch: false
steps:
qc:
fastp:
# For PE data, the adapter sequence auto-detection is disabled by default since the adapters can be trimmed by overlap analysis. However, you can specify --detect_adapter_for_pe to enable it.
# For PE data, fastp will run a little slower if you specify the sequence adapters or enable adapter auto-detection, but usually result in a slightly cleaner output, since the overlap analysis may fail due to sequencing errors or adapter dimers.
# -q, --qualified_quality_phred the quality value that a base is qualified. Default 15 means phred quality >=Q15 is qualified.
# --cut_front move a sliding window from front (5') to tail, drop the bases in the window if its mean quality is below cut_mean_quality, stop otherwise.
# --length_required reads shorter than length_required will be discarded, default is 15. (int [=15])
# PE data, the front/tail trimming settings are given with -f, --trim_front1 and -t, --trim_tail1
additionalParams:
fastp: " --detect_adapter_for_pe -q 20 --cut_front --trim_front1 3 --cut_tail --trim_tail1 3 --cut_mean_quality 10 --length_required 50 "
reportOnly: false
timeLimit: "AUTO"
nonpareil:
additionalParams: " -v 10 -r 1234 "
filterHuman:
additionalParams: " "
database:
download:
source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/human_filter.db.20231218v2.gz
md5sum: cc92c0f926656565b1156d66a0db5a3c
assembly:
megahit:
# --mem-flag 0 to use minimum memory, --mem-flag 1 (default) moderate memory and --mem-flag 2 all memory.
# meta-sensitive: '--min-count 1 --k-list 21,29,39,49,...,129,141'
# meta-large: '--k-min 27 --k-max 127 --k-step 10' (large & complex metagenomes, like soil)
additionalParams: " --min-contig-len 500 --presets meta-sensitive "
fastg: true
resources:
RAM:
mode: 'PREDICT'
predictMinLabel: 'highmemLarge'
binning:
bwa2:
additionalParams:
bwa2: " "
# samtools flags are used to filter resulting bam file
samtoolsView: " -F 3584 "
contigsCoverage:
additionalParams: " --min-covered-fraction 0 --min-read-percent-identity 100 --min-read-aligned-percent 100 "
genomeCoverage:
additionalParams: " --min-covered-fraction 0 --min-read-percent-identity 100 --min-read-aligned-percent 100 "
# Primary binning tool
metabat:
# Set --seed positive numbers to reproduce the result exactly. Otherwise, random seed will be set each time.
additionalParams: " --seed 234234 "
magAttributes:
checkm2:
database:
download:
source: "https://openstack.cebitec.uni-bielefeld.de:8080/databases/checkm2_v2.tar.gz"
md5sum: a634cb3d31a1f56f2912b74005f25f09
additionalParams: " "
resources:
highmemLarge:
cpus: 28
memory: 60
highmemMedium:
cpus: 14
memory: 30
large:
cpus: 28
memory: 58
medium:
cpus: 14
memory: 29
small:
cpus: 7
memory: 14
tiny:
cpus: 1
memory: 1
Task 2
Run CheckM2 with the Metagenomics-Toolkit using the following command:
cd ~/mgcourse/
NXF_VER=24.10.4 nextflow run metagenomics/metagenomics-tk \
-profile standard \
-params-file https://raw.githubusercontent.com/metagenomics/metagenomics-tk/refs/heads/master/default/tutorials/tutorial1/fullpipeline_bin_quality.yml \
-ansi-log false \
-entry wFullPipeline \
-resume \
--databases $(readlink -f databases) \
--input.paired.path https://raw.githubusercontent.com/metagenomics/metagenomics-tk/refs/heads/master/test_data/tutorials/tutorial1/reads.tsv \
--logDir logs_bin_quality
Task 3
Locate the CheckM results inside the output
directory and find out the completeness and contamination values for all of our bins.
Solution
cd ~/mgcourse
cut -f2,3,4 output/data/1/magAttributes/*/checkm2/data_checkm2_generated.tsv
BIN_ID COMPLETENESS CONTAMINATION
data_bin.1.fa 16.88 0.0
data_bin.10.fa 24.04 0.04
data_bin.2.fa 12.81 0.12
data_bin.3.fa 20.57 0.01
data_bin.4.fa 52.73 0.3
data_bin.5.fa 11.09 0.0
data_bin.6.fa 81.91 0.46
data_bin.7.fa 20.68 0.75
data_bin.8.fa 85.4 0.28
data_bin.9.fa 39.66 0.62
Task 4
Compare the CheckM2 completeness results with the genome fraction results from QUAST. Do they match your expectations?
Solution
No, completeness is much lower than expected in comparison to the genome fraction we observed in the QUAST results (see Assembly evaluation part).
We will now run QUAST again, but this time we specify the bins and the unbinned contigs instead of the assemblies as input. By doing this, QUAST can tell us which reference genome the bin belongs to.
Task 5
Run metaquast with the individual bins as input:
cd ~/mgcourse/
metaquast.py --threads 28 --gene-finding \
-R genomes/Aquifex_aeolicus_VF5.fna,\
genomes/Bdellovibrio_bacteriovorus_HD100.fna,\
genomes/Chlamydia_psittaci_MN.fna,\
genomes/Chlamydophila_pneumoniae_CWL029.fna,\
genomes/Chlamydophila_pneumoniae_J138.fna,\
genomes/Chlamydophila_pneumoniae_LPCoLN.fna,\
genomes/Chlamydophila_pneumoniae_TW_183.fna,\
genomes/Chlamydophila_psittaci_C19_98.fna,\
genomes/Finegoldia_magna_ATCC_29328.fna,\
genomes/Fusobacterium_nucleatum_ATCC_25586.fna,\
genomes/Helicobacter_pylori_26695.fna,\
genomes/Lawsonia_intracellularis_PHE_MN1_00.fna,\
genomes/Mycobacterium_leprae_TN.fna,\
genomes/Porphyromonas_gingivalis_W83.fna,\
genomes/Wigglesworthia_glossinidia.fna \
-o quast_bins \
-l BIN1,BIN2,BIN3,BIN4,BIN5,BIN6,BIN7,BIN8,BIN9,BIN10,UNBINNED \
output/data/1/binning/*/metabat/data_bin.1.fa \
output/data/1/binning/*/metabat/data_bin.2.fa \
output/data/1/binning/*/metabat/data_bin.3.fa \
output/data/1/binning/*/metabat/data_bin.4.fa \
output/data/1/binning/*/metabat/data_bin.5.fa \
output/data/1/binning/*/metabat/data_bin.6.fa \
output/data/1/binning/*/metabat/data_bin.7.fa \
output/data/1/binning/*/metabat/data_bin.8.fa \
output/data/1/binning/*/metabat/data_bin.9.fa \
output/data/1/binning/*/metabat/data_bin.10.fa \
output/data/1/binning/*/metabat/data_notBinned.fa
Task 6
Now inspect the QUAST report with:
firefox quast_bins/report.html
Solution
Large parts of the contigs have been assigned to the unbinned fraction. The reason for that is the small contig size. Metabat2 uses 2500 bp as default cutoff for contigs that can be assigned to genome bins, other contigs are automatically assigned to the unbinned fraction. This cutoff can be lowered to 1500 bp but would not improve the results very much here, since so many contigs are smaller than that.
In summary, three of your bins meet at least the completeness and contamination criteria to be considered as medium quality (Completeness > 50% and Contamination < 10%) according to the Minimum Information about a Metagenome-Assembled Genome (MIMAG) standard. In the next section we will examine their taxonomy.