Skip to content

Assembly

Once your raw reads have been quality-controlled, we will now perform an assembly using the Metagenomics-Toolkit. The Toolkit supports several assemblers, one of which is the MEGAHIT assembler, which we will run in this section.

Metagenomics-Toolkit

The following lines represent the part of the configuration that tells the Toolkit to run the assembly:

Assembly Configuration File Snippet 1
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
  assembly:
    megahit:
      # --mem-flag 0 to use minimum memory, --mem-flag 1 (default) moderate memory and --mem-flag 2 all memory.
      # meta-sensitive: '--min-count 1 --k-list 21,29,39,49,...,129,141' 
      # meta-large:  '--k-min  27  --k-max 127 --k-step 10' (large & complex metagenomes, like soil)
      additionalParams: " --min-contig-len 500 --presets meta-sensitive "
      fastg: true
      resources:
        RAM:
          mode: 'PREDICT'
          predictMinLabel: 'highmemLarge'

Task 1

Which tool is used for the assembly? What additional parameters are used?

Solution

MEGAHIT with the additional parameters: minimum contig length of 500 bp and the preset meta-sensitive. meta-sensitive sets the following parameters: --min-count 1 --k-list 21,29,39,49,...,129,141, which causes MEGAHIT to use a longer list of k-mers. MEGAHIT is a single-node assembler for large and complex metagenomics NGS reads, such as soil. It makes use of a succinct de Bruijn graph (SdBG) to achieve a low memory assembly. See the MEGAHIT home page for more info.

Task 2

Copy the following command to run the quality control and assembly module of the Toolkit on your machine

cd ~/mgcourse/

NXF_VER=24.10.4 nextflow run metagenomics/metagenomics-tk \
      -profile standard \
      -params-file https://raw.githubusercontent.com/metagenomics/metagenomics-tk/refs/heads/master/default/tutorials/tutorial1/fullpipeline_assembly.yml \
      -ansi-log false \
      -entry wFullPipeline \
      -resume \
      --input.paired.path https://raw.githubusercontent.com/metagenomics/metagenomics-tk/refs/heads/master/test_data/tutorials/tutorial1/reads.tsv \
      --logDir logs_assembly

Note that you have used the resume flag this time, which causes Nextflow to reuse the results of the QC analysis. That's why you'll see several processes labeled Cache process on the Toolkit execution screen.

For reference, your complete parameter file looks like this:

Parameter-file
tempdir: "tmp"
summary: false
s3SignIn: false 
input:
  paired:
    path: "test_data/tutorials/tutorial1/reads.tsv"
    watch: false
output: output
logDir: log
runid: 1
databases: "/vol/scratch/databases"
publishDirMode: "symlink"
logLevel: 1
scratch: false 
steps:
  qc:
    fastp:
       # For PE data, the adapter sequence auto-detection is disabled by default since the adapters can be trimmed by overlap analysis. However, you can specify --detect_adapter_for_pe to enable it.
       # For PE data, fastp will run a little slower if you specify the sequence adapters or enable adapter auto-detection, but usually result in a slightly cleaner output, since the overlap analysis may fail due to sequencing errors or adapter dimers.
       # -q, --qualified_quality_phred       the quality value that a base is qualified. Default 15 means phred quality >=Q15 is qualified.
       # --cut_front move a sliding window from front (5') to tail, drop the bases in the window if its mean quality is below cut_mean_quality, stop otherwise.
       # --length_required  reads shorter than length_required will be discarded, default is 15. (int [=15])
       # PE data, the front/tail trimming settings are given with -f, --trim_front1 and -t, --trim_tail1
       additionalParams:
         fastp: " --detect_adapter_for_pe -q 20 --cut_front --trim_front1 3 --cut_tail --trim_tail1 3 --cut_mean_quality 10 --length_required 50 "
         reportOnly: false
       timeLimit: "AUTO"
    nonpareil:
      additionalParams: " -v 10 -r 1234 "
    filterHuman:
      additionalParams: "  "
      database:
        download:
          source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/human_filter.db.20231218v2.gz
          md5sum: cc92c0f926656565b1156d66a0db5a3c
  assembly:
    megahit:
      # --mem-flag 0 to use minimum memory, --mem-flag 1 (default) moderate memory and --mem-flag 2 all memory.
      # meta-sensitive: '--min-count 1 --k-list 21,29,39,49,...,129,141' 
      # meta-large:  '--k-min  27  --k-max 127 --k-step 10' (large & complex metagenomes, like soil)
      additionalParams: " --min-contig-len 500 --presets meta-sensitive "
      fastg: true
      resources:
        RAM:
          mode: 'PREDICT'
          predictMinLabel: 'highmemLarge'
resources:
  highmemLarge:
    cpus: 28
    memory: 60
  highmemMedium:
    cpus: 14
    memory: 30
  large:
    cpus: 28
    memory: 58
  medium:
    cpus: 14
    memory: 29
  small:
    cpus: 7
    memory: 14
  tiny:
    cpus: 1
    memory: 1

Assembly Results

We will now have a first look at some assembly statistics. First of all, locate your assembly results somewhere in your output directory.

Task 3

Where are the assembly results of MEGAHIT stored? And what files are generated?

Solution

The assembly results are stored in output/data/1/assembly/1.2.1/megahit/

cd ~/mgcourse/
ls -l output/data/1/assembly/1.2.1/megahit/
    data_contigs.fa.gz  # the assembled sequences as gzipped fasta
    data_contigs.fastg  # the assembly graph for inspection for example with Bandage
    data_contigs_stats.tsv  # some assembly statistics

Task 4

Have a look at the data_contigs_stats.tsv file - how large is the assembly and what is the N50?

Solution

cd ~/mgcourse/
cat output/data/1/assembly/1.2.1/megahit/data_contigs_stats.tsv | column -s$'\t' -t
SAMPLE  file                format  type  num_seqs  sum_len   min_len  avg_len  max_len  Q1     Q2      Q3      sum_gap  N50   Q20(%)  Q30(%)  GC(%)
data    data_contigs.fa.gz  FASTA   DNA   16675     25442506  500      1525.8   35033    711.0  1087.0  1769.0  0        1897  0.00    0.00    42.84
In this assembly run, the assembly length is 25,442,506 bp and the N50 is 1,897 bp.

Task 5

Open the results in the EMGB-Browser, find out, how many genes are annotated on the largest contig!

Solution

There are 41 genes annotated on the largest contig. Open the dataset data go to "Contigs", filter for contigs longer than 30kb and inspect the larger one.


➡️ Continue to: Assembly Evaluation