Dereplication¶

Input¶

CommandConfiguration FileTSV Table

-entry wDereplication -params-file example_params/dereplication.yml

Warning

The configuration file shown here is for demonstration and testing purposes only. Parameters that should be used in production can be viewed in the dereplication section of one of the yaml files located in the default folder of the Toolkit's Github repository.

output: "output"
s3SignIn: false
runid: 1
logLevel: 1
logDir: log
scratch: "/vol/scratch"
publishDirMode: "symlink"
steps:
  dereplication:
    bottomUpClustering:
      input: "test_data/dereplication/input.tsv"
      minimumCompleteness: 0
      maximumContamination: 5000
      ANIBuffer: 20
      mashBuffer: 2000
      method: 'ANI'
      additionalParams:
        mash_sketch: ""
        mash_dist: ""
        #  cluster cutoff
        cluster: " -c 0.05 "
        pyani: " -m ANIb "
        representativeAniCutoff: 0.95
    sans:
      additionalParams: " -k 15 -f strict -w 25  -t 400 " 
resources:
  highmemLarge:
    cpus: 28
    memory: 230
  highmemMedium:
    cpus: 14
    memory: 113
  large:
    cpus: 28
    memory: 58
  medium:
    cpus: 14
    memory: 29
  small:
    cpus: 7
    memory: 14
  tiny:
    cpus: 1
    memory: 1

DATASET BIN_ID  PATH    COMPLETENESS    CONTAMINATION   COVERAGE    N50 HETEROGENEITY
test3   test3_bin.1 https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/bins/bin.1.fa    100 0   10  5000    10
test1   test1_bin.2 https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/bins/bin.2.fa    100 0   10  5000    10
test1   test1_bin.8 https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/bins/bin.8.fasta 100 0   10  5000    10
test2   test2_bin.9 https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/bins/bin.9.fasta 100 0   10  5000    10
test3   test2_bin.10    https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/bins/bin.9.fasta 100 0   10  5000    10
test2   test2_bin.32    https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/bins/bin.32.fa   100 0   10  5000    10

Must include the columns DATASET, BIN_ID, PATH, COMPLETENESS, CONTAMINATION, COVERAGE, N50 and HETEROGENEITY. Completeness and contamination can be used for filtering (see params-file). N50, COVERAGE and HETEROGENEITY are used for selecting the representative of every cluster. You can set values of these columns to zero if data is not available or if you don't want the representative selection to be influenced by theses columns. Make sure that BIN_ID is a unique identifier.

Output¶

The output tsv file (clusters.tsvin the cluster's folder) contains the columns CLUSTER, GENOME and REPRESENTATIVE where CLUSTER identifies a group of genomes, GENOME represents the path or link of a genome and REPRESENTATIVE is either 0 or 1 (selected as representative). If sans is specified in the configuration file (see examples folder), then SANS is used to dereplicate the genomes of every cluster that was reported by the previous step. The SANS output can be found in the sans folder.