Plasmids¶
The plasmid module is able to identify contigs as plasmids and also to assemble plasmids from the samples fastq data. The module is executed in two parts. In the first part contigs of a metagenome assembler are scanned for plasmids. In the second part a plasmid assembler is used to assemble circular plasmids out of raw reads. All plasmid detection tools are executed on the circular assembly result and on the contigs of the metagenome assembler. Just the filtered sequences are used for downstream analysis.
The identification of plasmids is based on the combined result of tools which have a filter
property assigned. Results of all tools that
have the filter
property set to true are combined either by a logical OR
or by a logical AND
.
Example for the OR
and AND
operations:
Let's assume that we have three plasmid detection tools (t1, t2, t3) that have four contigs (c1, c2, c3, c4) as input. Let's further assume that c1 and c2 are detected by all tools as contigs and
c3 and c4 are only detected by t1 and t2. By using an AND
only c1 and c2 are finally reported by the module as plasmids. By using an OR
all contigs would be annotated as plasmids.
It is also possible to simply run a tool without
using its result as filter by setting filter
to false
. If a tool should not be executed then the tool section should be removed.
Only the detected plasmids will be used for downstream analysis.
For running a plasmid assembly we suggest running the full pipeline mode with the enabled plasmids module. See input example configuration files. The read mapper can either be Bowtie or Bwa for Illumina and minimap for long reads.
Input¶
-entry wPlasmidsPath -params-file example_params/plasmids.yml
Warning
The configuration file shown here is for demonstration and testing purposes only.
Parameters that should be used in production can be viewed in the plasmids section
of one of the yaml files located in the default
folder of the Toolkit's Github repository.
tempdir: "tmp"
summary: false
s3SignIn: true
input:
paired:
path: "test_data/fullPipeline/reads_split.tsv"
watch: false
output: "output"
runid: 1
scratch: "/vol/scratch"
databases: "/mnt/databases/"
logDir: log
publishDirMode: "symlink"
steps:
qc:
interleaved: false
fastp:
# Example params: " --cut_front --cut_tail --detect_adapter_for_pe "
additionalParams: " "
timeLimit: "AUTO"
assembly:
megahit:
fastg: false
additionalParams: " --min-contig-len 200 "
resources:
RAM:
mode: 'DEFAULT'
predictMinLabel: 'AUTO'
binning:
bowtie:
additionalParams:
bowtie: " --quiet --very-sensitive "
samtoolsView: " -F 3584 "
contigsCoverage:
additionalParams: ""
genomeCoverage:
additionalParams: " "
metabat:
additionalParams: " "
plasmid:
SCAPP:
additionalParams:
SCAPP: " "
bowtie: " "
coverm: " "
covermONT: " "
minimap: " "
samtoolsViewBowtie: " -F 3584 "
samtoolsViewMinimap: " "
ViralVerifyPlasmid:
filter: true
filterString: "Uncertain - plasmid or chromosomal|Plasmid"
database:
download:
source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/pfam-A_35.0.hmm.gz
md5sum: c80b75bd48ec41760bbca19c70616e36
additionalParams: " --thr 7 "
MobTyper:
filter: true
minLength: 5000
additionalParams: " --min_length 9000 "
database:
download:
source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/mob_20220929.gz
md5sum: 21fcaf9c3754a985d1d6875939d71e28
Platon:
filter: false
database:
download:
source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/platon_20220929.tar.gz
md5sum: f6d1701704396182c6c9daca053eb9d6
additionalParams: " "
PlasClass:
filter: true
threshold: 0.5
additionalParams: " "
Filter:
method: "AND"
minLength: 0
PLSDB:
sharedKmerThreshold: 30
database:
download:
source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/plasmids_plsdb_20220929.tar.bz2
md5sum: 13c1078e6cd6a46e3f508c24ca07cc18
additionalParams:
mashSketch: " -S 42 -k 21 -s 1000 "
mashDist: " -v 0.2 -d 0.2 "
resources:
highmemLarge:
cpus: 28
memory: 230
highmemMedium:
cpus: 14
memory: 113
large:
cpus: 28
memory: 58
medium:
cpus: 14
memory: 29
small:
cpus: 7
memory: 14
tiny:
cpus: 1
memory: 1
Warning
The configuration file shown here is for demonstration and testing purposes only.
Parameters that should be used in production can be viewed in the plasmids section
of one of the yaml files located in the default
folder of the Toolkit's Github repository.
tempdir: "tmp"
summary: false
s3SignIn: false
output: "output"
logDir: log
runid: 1
databases: "/vol/scratch/databases/"
logLevel: 1
scratch: "/vol/scratch"
publishDirMode: "symlink"
steps:
plasmid:
input: "test_data/plasmid/input_contigs.tsv"
ViralVerifyPlasmid:
filter: true
filterString: "Uncertain - plasmid or chromosomal|Plasmid"
database:
download:
source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/pfam-A_35.0.hmm.gz
md5sum: c80b75bd48ec41760bbca19c70616e36
additionalParams: " --thr 7 "
MobTyper:
filter: true
minLength: 5000
additionalParams: " --min_length 9000 "
database:
download:
source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/mob_20220929.gz
md5sum: 21fcaf9c3754a985d1d6875939d71e28
Platon:
filter: false
database:
download:
source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/platon_20220929.tar.gz
md5sum: f6d1701704396182c6c9daca053eb9d6
additionalParams: " "
PlasClass:
filter: true
threshold: 0.5
additionalParams: " "
Filter:
method: "AND"
minLength: 0
PLSDB:
sharedKmerThreshold: 30
database:
download:
source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/plasmids_plsdb_20220929.tar.bz2
md5sum: 13c1078e6cd6a46e3f508c24ca07cc18
additionalParams:
mashSketch: " -S 42 -k 21 -s 1000 "
mashDist: " -v 0.2 -d 0.2 "
resources:
highmemLarge:
cpus: 28
memory: 230
highmemMedium:
cpus: 14
memory: 113
large:
cpus: 28
memory: 58
medium:
cpus: 14
memory: 29
small:
cpus: 7
memory: 14
tiny:
cpus: 1
memory: 1
DATASET BIN_ID PATH
test3 bin.1 https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/bins/bin.1.fa
test1 bin.2 https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/bins/bin.2.fa
test1 bin.8 https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/bins/bin.8.fasta
test2 bin.9 https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/bins/bin.9.fasta
test2 bin.32 https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/bins/bin.32.fa
Databases¶
The plasmid module needs the following compressed database file formats:
ViralVerifyPlasmid¶
ViralVerifyPlasmid needs a recent pfam-A database in .gz format. See database section for possible download strategies. If you need credentials to access your files via S3 then please use the following command:
nextflow secrets set S3_ViralVerifyPlasmid_ACCESS XXXXXXX
nextflow secrets set S3_ViralVerifyPlasmid_SECRET XXXXXXX
MobTyper¶
Database was generated by gzipping the output of mob_init. See database section for possible download strategies. If you need credentials to access your files via S3 then please use the following command:
nextflow secrets set S3_MobTyper_ACCESS XXXXXXX
nextflow secrets set S3_MobTyper_SECRET XXXXXXX
Platon¶
The tar gzipped database for running platon can be fetched from the Platon github page. See database section for possible download strategies. If you need credentials to access your files via S3 then please use the following command:
nextflow secrets set S3_Platon_ACCESS XXXXXXX
nextflow secrets set S3_Platon_SECRET XXXXXXX
PLSDB¶
PLSDB Database is available via this link: https://ccb-microbe.cs.uni-saarland.de/plsdb/plasmids/download/plasmids_meta.tar.bz2. All files except .tsv and .msh were deleted from the compressed package. See database section for possible download strategies. The compressed database must be a tar.bz2 file. If you need credentials to access your files via S3 then please use the following command:
nextflow secrets set S3_PLSDB_ACCESS XXXXXXX
nextflow secrets set S3_PLSDB_SECRET XXXXXXX
Output¶
SCAPP¶
SCAPP detects plasmid sequences out of the samples assembly graph.
It reports sequences as gzipped fasta files (*_plasmids.fasta.gz
). A basic statistic (*_plasmids_stats.tsv
) per plasmid and a summary statistic (*_plasmids_summary_stats.tsv
) over all
plasmids is also generated. Coverm coverage metrics are generated for all plasmids. Gene coverage values are generated as part of the annotation module output.
PlasClass¶
PlasClass is able to identify plasmids by using a statistical model that was build using kmer frequencies.
It reports gzipped fata files and their probabilities (*_plasclass.tsv
).
MobTyper and Platon¶
MobTyper and Platon are using both replicon typing for plasmid detection. (*_mobtyper_results.tsv
, *_platon.tsv
)
ViralVerifyPlasmid¶
ViralVerfiy is applying a Naive Bayes classifier (*_viralverifyplasmid.tsv
).
PLSDB¶
PLSDB includes a curated set of plasmid sequences that were extracted from databases like refseq.
The metadata of found sequences are reported in *.tsv
and the metadata of the filtered sequences in *_kmerThreshold_X.tsv
.