Plasmids¶

The plasmid module is able to identify contigs as plasmids and also to assemble plasmids from the samples fastq data. The module is executed in two parts. In the first part contigs of a metagenome assembler are scanned for plasmids. In the second part a plasmid assembler is used to assemble circular plasmids out of raw reads. All plasmid detection tools are executed on the circular assembly result and on the contigs of the metagenome assembler. Just the filtered sequences are used for downstream analysis.

The identification of plasmids is based on the combined result of tools which have a filter property assigned. Results of all tools that have the filter property set to true are combined either by a logical OR or by a logical AND.

Example for the OR and AND operations: Let's assume that we have three plasmid detection tools (t1, t2, t3) that have four contigs (c1, c2, c3, c4) as input. Let's further assume that c1 and c2 are detected by all tools as contigs and c3 and c4 are only detected by t1 and t2. By using an AND only c1 and c2 are finally reported by the module as plasmids. By using an OR all contigs would be annotated as plasmids.

It is also possible to simply run a tool without using its result as filter by setting filter to false. If a tool should not be executed then the tool section should be removed. Only the detected plasmids will be used for downstream analysis.

For running a plasmid assembly we suggest running the full pipeline mode with the enabled plasmids module. See input example configuration files. The read mapper can either be Bowtie or Bwa for Illumina and minimap for long reads.

Input¶

CommandConfiguration file for full pipeline mode with plasmids detectionsConfiguration file for plasmids module onlyTSV Table

-entry wPlasmidsPath -params-file example_params/plasmids.yml

Warning

The configuration file shown here is for demonstration and testing purposes only. Parameters that should be used in production can be viewed in the plasmids section of one of the yaml files located in the default folder of the Toolkit's Github repository.

tempdir: "tmp"
s3SignIn: true
input:
  paired:
    path: "test_data/fullPipeline/reads_split.tsv"
    watch: false
output: "output"
runid: 1
scratch: "/vol/scratch"
databases: "/mnt/databases/"
logDir: log
publishDirMode: "symlink"
steps:
  qc:
    interleaved: false
    fastp:
       # Example params: " --cut_front --cut_tail --detect_adapter_for_pe  "
       additionalParams:
         fastp: "  "
         reportOnly: false
       timeLimit: "AUTO"
  assembly:
    megahit:
      fastg: false
      additionalParams: " --min-contig-len 200 "
      resources:
         RAM: 
            mode: 'DEFAULT'
            predictMinLabel: 'AUTO' 
  binning:
    bowtie:
      additionalParams: 
        bowtie: " --quiet --very-sensitive "
        samtoolsView: " -F 3584 " 
    contigsCoverage:
      additionalParams: ""
    genomeCoverage:
      additionalParams: " "
    metabat:
      additionalParams: "   "
  plasmid:
    SCAPP:
      additionalParams: 
        SCAPP: "  "
        bowtie: "  "
        coverm: "  "
        covermONT: "  "
        minimap: " "
        samtoolsViewBowtie: " -F 3584 " 
        samtoolsViewMinimap: " " 
    ViralVerifyPlasmid:
      filter: true
      filterString: "Uncertain - plasmid or chromosomal|Plasmid"
      database:
        download:
          source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/pfam-A_35.0.hmm.gz
          md5sum: c80b75bd48ec41760bbca19c70616e36
      additionalParams: " --thr 7 "
    MobTyper:
      filter: true
      minLength: 5000
      additionalParams: " --min_length 9000  "
      database:
        download:
          source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/mob_20220929.gz
          md5sum: 21fcaf9c3754a985d1d6875939d71e28
    Platon:
      filter: false
      database:
        download:
          source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/platon_20220929.tar.gz
          md5sum: f6d1701704396182c6c9daca053eb9d6
      additionalParams: "   "
    PlasClass:
      filter: true
      threshold: 0.5 
      additionalParams: "   "
    Filter:
      method: "AND"
      minLength: 0 
    PLSDB:
      sharedKmerThreshold: 30
      database:
        download:
          source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/plasmids_plsdb_20220929.tar.bz2
          md5sum: 13c1078e6cd6a46e3f508c24ca07cc18
      additionalParams:
        mashSketch: " -S 42 -k 21 -s 1000 "
        mashDist: " -v 0.2 -d 0.2 "
resources:
  highmemLarge:
    cpus: 28
    memory: 230
  highmemMedium:
    cpus: 14
    memory: 113
  large:
    cpus: 28
    memory: 58
  medium:
    cpus: 14
    memory: 29
  small:
    cpus: 7
    memory: 14
  tiny:
    cpus: 1
    memory: 1

Warning

The configuration file shown here is for demonstration and testing purposes only. Parameters that should be used in production can be viewed in the plasmids section of one of the yaml files located in the default folder of the Toolkit's Github repository.

tempdir: "tmp"
s3SignIn: false
output: "output"
logDir: log
runid: 1
databases: "/vol/scratch/databases/"
logLevel: 1
scratch: "/vol/scratch"
publishDirMode: "symlink"
steps:
  plasmid:
    input: "test_data/plasmid/input_contigs.tsv"
    ViralVerifyPlasmid:
      filter: true
      filterString: "Uncertain - plasmid or chromosomal|Plasmid"
      database:
        download:
          source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/pfam-A_35.0.hmm.gz
          md5sum: c80b75bd48ec41760bbca19c70616e36
      additionalParams: " --thr 7 "
    MobTyper:
      filter: true
      minLength: 5000
      additionalParams: " --min_length 9000  "
      database:
        download:
          source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/mob_20220929.gz
          md5sum: 21fcaf9c3754a985d1d6875939d71e28
    Platon:
      filter: false
      database:
        download:
          source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/platon_20220929.tar.gz
          md5sum: f6d1701704396182c6c9daca053eb9d6
      additionalParams: "   "
    PlasClass:
      filter: true
      threshold: 0.5 
      additionalParams: "   "
    Filter:
      method: "AND"
      minLength: 0 
    PLSDB:
      sharedKmerThreshold: 30
      database:
        download:
          source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/plasmids_plsdb_20220929.tar.bz2
          md5sum: 13c1078e6cd6a46e3f508c24ca07cc18
      additionalParams:
        mashSketch: " -S 42 -k 21 -s 1000 "
        mashDist: " -v 0.2 -d 0.2 "

resources:
  highmemLarge:
    cpus: 28
    memory: 230
  highmemMedium:
    cpus: 14
    memory: 113
  large:
    cpus: 28
    memory: 58
  medium:
    cpus: 14
    memory: 29
  small:
    cpus: 7
    memory: 14
  tiny:
    cpus: 1
    memory: 1

DATASET BIN_ID  PATH
test3   bin.1   https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/bins/bin.1.fa
test1   bin.2   https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/bins/bin.2.fa
test1   bin.8   https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/bins/bin.8.fasta
test2   bin.9   https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/bins/bin.9.fasta
test2   bin.32  https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/bins/bin.32.fa

Databases¶

The plasmid module needs the following compressed database file formats:

ViralVerifyPlasmid¶

ViralVerifyPlasmid needs a recent pfam-A database in .gz format. See database section for possible download strategies. If you need credentials to access your files via S3 then please use the following command:

nextflow secrets set S3_ViralVerifyPlasmid_ACCESS XXXXXXX
nextflow secrets set S3_ViralVerifyPlasmid_SECRET XXXXXXX

MobTyper¶

Database was generated by gzipping the output of mob_init. See database section for possible download strategies. If you need credentials to access your files via S3 then please use the following command:

nextflow secrets set S3_MobTyper_ACCESS XXXXXXX
nextflow secrets set S3_MobTyper_SECRET XXXXXXX

Platon¶

The tar gzipped database for running platon can be fetched from the Platon github page. See database section for possible download strategies. If you need credentials to access your files via S3 then please use the following command:

nextflow secrets set S3_Platon_ACCESS XXXXXXX
nextflow secrets set S3_Platon_SECRET XXXXXXX

PLSDB¶

PLSDB Database is available via this link: https://ccb-microbe.cs.uni-saarland.de/plsdb/plasmids/download/plasmids_meta.tar.bz2. All files except .tsv and .msh were deleted from the compressed package. See database section for possible download strategies. The compressed database must be a tar.bz2 file. If you need credentials to access your files via S3 then please use the following command:

nextflow secrets set S3_PLSDB_ACCESS XXXXXXX
nextflow secrets set S3_PLSDB_SECRET XXXXXXX

Output¶

SCAPP¶

SCAPP detects plasmid sequences out of the samples assembly graph. It reports sequences as gzipped fasta files (*_plasmids.fasta.gz). A basic statistic (*_plasmids_stats.tsv) per plasmid and a summary statistic (*_plasmids_summary_stats.tsv) over all plasmids is also generated. Coverm coverage metrics are generated for all plasmids. Gene coverage values are generated as part of the annotation module output.

PlasClass¶

PlasClass is able to identify plasmids by using a statistical model that was build using kmer frequencies. It reports gzipped fata files and their probabilities (*_plasclass.tsv).

MobTyper and Platon¶

MobTyper and Platon are using both replicon typing for plasmid detection. (*_mobtyper_results.tsv, *_platon.tsv)

ViralVerifyPlasmid¶

ViralVerfiy is applying a Naive Bayes classifier (*_viralverifyplasmid.tsv).

PLSDB¶

PLSDB includes a curated set of plasmid sequences that were extracted from databases like refseq. The metadata of found sequences are reported in *.tsv and the metadata of the filtered sequences in *_kmerThreshold_X.tsv.