Dereplication¶
Input¶
-entry wDereplication -params-file example_params/dereplication.yml
Warning
The configuration file shown here is for demonstration and testing purposes only.
Parameters that should be used in production can be viewed in the dereplication section
of one of the yaml files located in the default folder of the Toolkit's Github repository.
output: "output"
s3SignIn: false
runid: 1
logLevel: 1
logDir: log
scratch: "/vol/scratch"
publishDirMode: "symlink"
steps:
dereplication:
bottomUpClustering:
input: "test_data/dereplication/input.tsv"
minimumCompleteness: 0
maximumContamination: 5000
ANIBuffer: 20
mashBuffer: 2000
method: 'ANI'
additionalParams:
mash_sketch: ""
mash_dist: ""
# cluster cutoff
cluster: " -c 0.05 "
pyani: " -m ANIb "
representativeAniCutoff: 0.95
sans:
additionalParams: " -k 15 -f strict -w 25 -t 400 "
resources:
highmemLarge:
cpus: 28
memory: 230
highmemMedium:
cpus: 14
memory: 113
large:
cpus: 28
memory: 58
medium:
cpus: 14
memory: 29
small:
cpus: 7
memory: 14
tiny:
cpus: 1
memory: 1
DATASET BIN_ID PATH COMPLETENESS CONTAMINATION COVERAGE N50 HETEROGENEITY
test3 test3_bin.1 https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/bins/bin.1.fa 100 0 10 5000 10
test1 test1_bin.2 https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/bins/bin.2.fa 100 0 10 5000 10
test1 test1_bin.8 https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/bins/bin.8.fasta 100 0 10 5000 10
test2 test2_bin.9 https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/bins/bin.9.fasta 100 0 10 5000 10
test3 test2_bin.10 https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/bins/bin.9.fasta 100 0 10 5000 10
test2 test2_bin.32 https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/bins/bin.32.fa 100 0 10 5000 10
DATASET, BIN_ID, PATH, COMPLETENESS, CONTAMINATION, COVERAGE, N50 and HETEROGENEITY.
Completeness and contamination can be used for filtering (see params-file). N50, COVERAGE and HETEROGENEITY are used for selecting the representative of every cluster.
You can set values of these columns to zero if data is not available or if you don't want the representative selection to be influenced by theses columns. Make sure that BIN_ID is a unique identifier.
Output¶
The output tsv file (clusters.tsvin the cluster's folder) contains the columns CLUSTER, GENOME and REPRESENTATIVE where CLUSTER identifies a group of genomes, GENOME represents the path or
link of a genome and REPRESENTATIVE is either 0 or 1 (selected as representative).
If sans is specified in the configuration file (see examples folder), then SANS is used to dereplicate the genomes of every cluster that was reported by the previous step.
The SANS output can be found in the sans folder.