Assembly
Once your raw reads have been quality-controlled, we will now perform an assembly using the Metagenomics-Toolkit. The Toolkit supports several assemblers, one of which is the MEGAHIT assembler, which we will run in this section.
Metagenomics-Toolkit¶
The following lines represent the part of the configuration that tells the Toolkit to run the assembly:
Assembly Configuration File Snippet 1 | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 |
|
Task 1
Which tool is used for the assembly? What additional parameters are used?
Solution
MEGAHIT with the additional parameters: minimum contig length of 500 bp
and the preset meta-sensitive
.
meta-sensitive
sets the following parameters: --min-count 1 --k-list 21,29,39,49,...,129,141
, which causes
MEGAHIT to use a longer list of k-mers.
MEGAHIT is a single-node assembler for large and complex metagenomics
NGS reads, such as soil. It makes use of a succinct de Bruijn graph
(SdBG) to achieve a low memory assembly. See the
MEGAHIT home page for more
info.
Task 2
Copy the following command to run the quality control and assembly module of the Toolkit on your machine
cd ~/mgcourse/
NXF_VER=24.10.4 nextflow run metagenomics/metagenomics-tk \
-profile standard \
-params-file https://raw.githubusercontent.com/metagenomics/metagenomics-tk/refs/heads/master/default/tutorials/tutorial1/fullpipeline_assembly.yml \
-ansi-log false \
-entry wFullPipeline \
-resume \
--input.paired.path https://raw.githubusercontent.com/metagenomics/metagenomics-tk/refs/heads/master/test_data/tutorials/tutorial1/reads.tsv \
--logDir logs_assembly
Note that you have used the resume
flag this time, which causes Nextflow to reuse the results of the QC analysis.
That's why you'll see several processes labeled Cache process on the Toolkit execution screen.
For reference, your complete parameter file looks like this:
Parameter-file
tempdir: "tmp"
summary: false
s3SignIn: false
input:
paired:
path: "test_data/tutorials/tutorial1/reads.tsv"
watch: false
output: output
logDir: log
runid: 1
databases: "/vol/scratch/databases"
publishDirMode: "symlink"
logLevel: 1
scratch: false
steps:
qc:
fastp:
# For PE data, the adapter sequence auto-detection is disabled by default since the adapters can be trimmed by overlap analysis. However, you can specify --detect_adapter_for_pe to enable it.
# For PE data, fastp will run a little slower if you specify the sequence adapters or enable adapter auto-detection, but usually result in a slightly cleaner output, since the overlap analysis may fail due to sequencing errors or adapter dimers.
# -q, --qualified_quality_phred the quality value that a base is qualified. Default 15 means phred quality >=Q15 is qualified.
# --cut_front move a sliding window from front (5') to tail, drop the bases in the window if its mean quality is below cut_mean_quality, stop otherwise.
# --length_required reads shorter than length_required will be discarded, default is 15. (int [=15])
# PE data, the front/tail trimming settings are given with -f, --trim_front1 and -t, --trim_tail1
additionalParams:
fastp: " --detect_adapter_for_pe -q 20 --cut_front --trim_front1 3 --cut_tail --trim_tail1 3 --cut_mean_quality 10 --length_required 50 "
reportOnly: false
timeLimit: "AUTO"
nonpareil:
additionalParams: " -v 10 -r 1234 "
filterHuman:
additionalParams: " "
database:
download:
source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/human_filter.db.20231218v2.gz
md5sum: cc92c0f926656565b1156d66a0db5a3c
assembly:
megahit:
# --mem-flag 0 to use minimum memory, --mem-flag 1 (default) moderate memory and --mem-flag 2 all memory.
# meta-sensitive: '--min-count 1 --k-list 21,29,39,49,...,129,141'
# meta-large: '--k-min 27 --k-max 127 --k-step 10' (large & complex metagenomes, like soil)
additionalParams: " --min-contig-len 500 --presets meta-sensitive "
fastg: true
resources:
RAM:
mode: 'PREDICT'
predictMinLabel: 'highmemLarge'
resources:
highmemLarge:
cpus: 28
memory: 60
highmemMedium:
cpus: 14
memory: 30
large:
cpus: 28
memory: 58
medium:
cpus: 14
memory: 29
small:
cpus: 7
memory: 14
tiny:
cpus: 1
memory: 1
Assembly Results¶
We will now have a first look at some assembly statistics. First of all, locate your assembly results somewhere in your output
directory.
Task 3
Where are the assembly results of MEGAHIT stored? And what files are generated?
Solution
The assembly results are stored in output/data/1/assembly/1.2.1/megahit/
cd ~/mgcourse/
ls -l output/data/1/assembly/1.2.1/megahit/
data_contigs.fa.gz # the assembled sequences as gzipped fasta
data_contigs.fastg # the assembly graph for inspection for example with Bandage
data_contigs_stats.tsv # some assembly statistics
Task 4
Have a look at the data_contigs_stats.tsv
file - how large is the assembly and what is the N50?
Solution
cd ~/mgcourse/
cat output/data/1/assembly/1.2.1/megahit/data_contigs_stats.tsv | column -s$'\t' -t
SAMPLE file format type num_seqs sum_len min_len avg_len max_len Q1 Q2 Q3 sum_gap N50 Q20(%) Q30(%) GC(%)
data data_contigs.fa.gz FASTA DNA 16675 25442506 500 1525.8 35033 711.0 1087.0 1769.0 0 1897 0.00 0.00 42.84
Task 5
Open the results in the EMGB-Browser, find out, how many genes are annotated on the largest contig!
Solution
There are 41 genes annotated on the largest contig. Open the dataset data
go to "Contigs", filter for contigs longer than 30kb and inspect the larger one.