Skip to content

Classification

Taxonomic classification tools assign taxonomic labels to reads or assembled contigs of metagenomic datasets. We will perform taxonomic classification of our genome bins using GTDB-Tk.

Genome Taxonomy Database (GTDB)

GTDB is an open-access database that provides a comprehensive taxonomy for bacterial and archaeal genomes. It uses genome-scale phylogenetic analysis to classify microbial genomes into a standardized taxonomic framework. The database is regularly updated with new genomes, ensuring it reflects the latest advancements in microbial taxonomy. GTDB-Tk is a software toolkit that enables users to classify their own genomic or metagenomic data against the GTDB taxonomy. It uses metrics like average nucleotide identity (ANI) and other phylogenetic markers to determine the taxonomic placement of query genomes. This tool is particularly useful for researchers who want to assign taxonomic labels to their sequences within a consistent and widely-accepted framework, facilitating better understanding of microbial diversity and evolution.

See the GTDB-Tk homepage for more information.

Next, let's assign taxonomic labels to our binning results using GTDB-Tk. The following snippet represents the Toolkit configuration for the classification part of the MagAttributes module:

Classification Configuration File Snippet 1
1
2
3
4
5
6
7
    gtdb:
      buffer: 1000
      database:
        download:
          source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/gtdbtk_r214_data.tar.gz
          md5sum: 390e16b3f7b0c4463eb7a3b2149261d9
      additionalParams: " --min_af 0.65 --scratch_dir . "

Task 1

Run the following command for the classification:

cd ~/mgcourse/

NXF_VER=24.10.4 nextflow run metagenomics/metagenomics-tk \
      -profile standard \
      -params-file https://raw.githubusercontent.com/metagenomics/metagenomics-tk/refs/heads/master/default/tutorials/tutorial1/fullpipeline_classification.yml \
      -ansi-log false \
      -entry wFullPipeline \
      -resume \
      --databases $(readlink -f databases) \
      --input.paired.path https://raw.githubusercontent.com/metagenomics/metagenomics-tk/refs/heads/master/test_data/tutorials/tutorial1/reads.tsv \
      --logDir logs_classification

For reference, your complete parameter file should look like this:

Parameter-file
tempdir: "tmp"
summary: false
s3SignIn: false
input:
  paired:
    path: "test_data/tutorials/tutorial1/reads.tsv"
    watch: false
output: output
logDir: log
runid: 1
databases: "/vol/scratch/databases"
publishDirMode: "symlink"
logLevel: 1
scratch: false
steps:
  qc:
    fastp:
       # For PE data, the adapter sequence auto-detection is disabled by default since the adapters can be trimmed by overlap analysis. However, you can specify --detect_adapter_for_pe to enable it.
       # For PE data, fastp will run a little slower if you specify the sequence adapters or enable adapter auto-detection, but usually result in a slightly cleaner output, since the overlap analysis may fail due to sequencing errors or adapter dimers.
       # -q, --qualified_quality_phred       the quality value that a base is qualified. Default 15 means phred quality >=Q15 is qualified.
       # --cut_front move a sliding window from front (5') to tail, drop the bases in the window if its mean quality is below cut_mean_quality, stop otherwise.
       # --length_required  reads shorter than length_required will be discarded, default is 15. (int [=15])
       # PE data, the front/tail trimming settings are given with -f, --trim_front1 and -t, --trim_tail1
       additionalParams:
         fastp: " --detect_adapter_for_pe -q 20 --cut_front --trim_front1 3 --cut_tail --trim_tail1 3 --cut_mean_quality 10 --length_required 50 "
         reportOnly: false
       timeLimit: "AUTO"
    nonpareil:
      additionalParams: " -v 10 -r 1234 "
    filterHuman:
      additionalParams: "  "
      database:
        download:
          source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/human_filter.db.20231218v2.gz
          md5sum: cc92c0f926656565b1156d66a0db5a3c
  assembly:
    megahit:
      # --mem-flag 0 to use minimum memory, --mem-flag 1 (default) moderate memory and --mem-flag 2 all memory.
      # meta-sensitive: '--min-count 1 --k-list 21,29,39,49,...,129,141' 
      # meta-large:  '--k-min  27  --k-max 127 --k-step 10' (large & complex metagenomes, like soil)
      additionalParams: " --min-contig-len 500 --presets meta-sensitive "
      fastg: true
      resources:
        RAM:
          mode: 'PREDICT'
          predictMinLabel: 'highmemLarge'
  binning:
    bwa2:
      additionalParams: 
        bwa2: " "
        # samtools flags are used to filter resulting bam file
        samtoolsView: " -F 3584 " 
    contigsCoverage:
      additionalParams: " --min-covered-fraction 0 --min-read-percent-identity 100 --min-read-aligned-percent 100 "
    genomeCoverage:
      additionalParams: " --min-covered-fraction 0 --min-read-percent-identity 100 --min-read-aligned-percent 100 "
    # Primary binning tool
    metabat:
      # Set --seed positive numbers to reproduce the result exactly. Otherwise, random seed will be set each time.
      additionalParams: " --seed 234234  "
  magAttributes:
    checkm2:
      database:
        download:
          source: "https://openstack.cebitec.uni-bielefeld.de:8080/databases/checkm2_v2.tar.gz"
          md5sum: a634cb3d31a1f56f2912b74005f25f09
      additionalParams: "  "
    gtdb:
      buffer: 1000
      database:
        download:
          source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/gtdbtk_r214_data.tar.gz
          md5sum: 390e16b3f7b0c4463eb7a3b2149261d9
      additionalParams: " --min_af 0.65 --scratch_dir . "
resources:
  highmemLarge:
    cpus: 28
    memory: 60
  highmemMedium:
    cpus: 14
    memory: 30
  large:
    cpus: 28
    memory: 58
  medium:
    cpus: 14
    memory: 29
  small:
    cpus: 7
    memory: 14
  tiny:
    cpus: 1
    memory: 1

The classification took some time, most of it is due to a mash database, that needs to be built before the actual classification. After that, the classification itself was quite fast. The fast execution can be explained by the fast identification of a genome, and thus a taxonomic label, in the GTDB via Mash and ANI. Classification based on marker genes was almost not necessary. In a real metagenome sample, the classification would usually take longer. You can read here about the different steps of the GTDB-Tk classification.

Task 2

Inspect the different columns of the GTDB-Tk output file with the following command to see how many genomes were processed by the ANI screening and how many were identified by the marker gene approach.

cd ~/mgcourse/
cat output/data/1/magAttributes/*/gtdb/data_gtdbtk_generated_combined.tsv  | column -s$'\t' -t | less -S
Solution

For example, the column fastani_ani reports that 9 out of 10 genomes were processed using ANI.

fastani_ani
98.52
97.43
98.4
98.7
99.23
N/A
99.51
98.58
99.67
98.82

Now, of course, you are interested in the classification of your genomes. The results can be affected by missing marker genes or contamination, as indicated on the GTDB-Tk website. You already have an estimate of the completeness and contamination of your genomes from the bin quality sections and can use this information for the next task.

Task 3

Check the classification and BIN_ID column of the gtdb output. What is the classification of genomes that are at least 50% complete and are at most 10% contaminated?

Solution
cd ~/mgcourse/

By executing the following two commands you get the classification and the quality values:

cut -f 1,5 output/data/1/magAttributes/*/gtdb/data_gtdbtk_generated_combined.tsv | column -s$'\t' -t

cut -f2,3,4 output/data/1/magAttributes/*/checkm2/data_checkm2_generated.tsv | column -s$'\t' -t

The output is the following:

BIN_ID          classification
data_bin.1.fa   d__Bacteria;p__Desulfobacterota;c__Desulfovibrionia;o__Desulfovibrionales;f__Desulfovibrionaceae;g__Lawsonia;s__Lawsonia intracellularis
data_bin.10.fa  d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Porphyromonadaceae;g__Porphyromonas;s__Porphyromonas gingivalis
data_bin.2.fa   d__Bacteria;p__Bdellovibrionota;c__Bdellovibrionia;o__Bdellovibrionales;f__Bdellovibrionaceae;g__Bdellovibrio;s__Bdellovibrio bacteriovorus
data_bin.3.fa   d__Bacteria;p__Aquificota;c__Aquificae;o__Aquificales;f__Aquificaceae;g__Aquifex;s__Aquifex aeolicus
data_bin.4.fa   d__Bacteria;p__Bacillota_A;c__Clostridia;o__Tissierellales;f__Peptoniphilaceae;g__Finegoldia;s__Finegoldia magna_H
data_bin.5.fa   d__Bacteria;p__Campylobacterota;c__Campylobacteria;o__Campylobacterales;f__Helicobacteraceae;g__Helicobacter;s__
data_bin.6.fa   d__Bacteria;p__Chlamydiota;c__Chlamydiia;o__Chlamydiales;f__Chlamydiaceae;g__Chlamydophila;s__Chlamydophila pneumoniae
data_bin.7.fa   d__Bacteria;p__Actinomycetota;c__Actinomycetia;o__Mycobacteriales;f__Mycobacteriaceae;g__Mycobacterium;s__Mycobacterium leprae
data_bin.8.fa   d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Wigglesworthia;s__Wigglesworthia glossinidia_B
data_bin.9.fa   d__Bacteria;p__Fusobacteriota;c__Fusobacteriia;o__Fusobacteriales;f__Fusobacteriaceae;g__Fusobacterium;s__Fusobacterium nucleatum

BIN_ID          COMPLETENESS  CONTAMINATION
data_bin.1.fa   16.88         0.0
data_bin.10.fa  24.04         0.04
data_bin.2.fa   12.81         0.12
data_bin.3.fa   20.57         0.01
data_bin.4.fa   52.73         0.3
data_bin.5.fa   11.09         0.0
data_bin.6.fa   81.91         0.46
data_bin.7.fa   20.68         0.75
data_bin.8.fa   85.4          0.28
data_bin.9.fa   39.66         0.62

Based on the output and the comparison of the BIN_ID columns, we can say that the following species could be detected that belong to bins that are at least 50% complete and at most 10% contaminated:

d__Bacteria;p__Bacillota_A;c__Clostridia;o__Tissierellales;f__Peptoniphilaceae;g__Finegoldia;s__Finegoldia magna_H
d__Bacteria;p__Chlamydiota;c__Chlamydiia;o__Chlamydiales;f__Chlamydiaceae;g__Chlamydophila;s__Chlamydophila pneumoniae
d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Wigglesworthia;s__Wigglesworthia glossinidia_B

Task 4

Inspect the results in EMGB - (1) filter for a specific species in the tree on the left side then inspect the MAG tab. (2) Do the same by filtering for a specific MAG (using the filter symbol near the bin id). What do you observe?

Solution

Not all genes are classified the same way as the corresponding MAG. Genes are classified using an LCA with mmseqs taxonomy, MAGs are classified using GTDBtk. If in doubt, the GTDBtk classification is probably more accurate.


➡️ Continue to: Annotation