Dereplication¶
Input¶
-entry wDereplication -params-file example_params/dereplication.yml
Warning
The configuration file shown here is for demonstration and testing purposes only.
Parameters that should be used in production can be viewed in the dereplication section
of one of the yaml files located in the default
folder of the Toolkit's Github repository.
output: "output"
summary: false
s3SignIn: false
runid: 1
logLevel: 1
logDir: log
scratch: "/vol/scratch"
publishDirMode: "symlink"
steps:
dereplication:
bottomUpClustering:
input: "test_data/dereplication/input.tsv"
minimumCompleteness: 0
maximumContamination: 5000
ANIBuffer: 20
mashBuffer: 2000
method: 'ANI'
additionalParams:
mash_sketch: ""
mash_dist: ""
# cluster cutoff
cluster: " -c 0.05 "
pyani: " -m ANIb "
representativeAniCutoff: 0.95
sans:
additionalParams: " -k 15 -f strict -w 25 -t 400 "
resources:
highmemLarge:
cpus: 28
memory: 230
highmemMedium:
cpus: 14
memory: 113
large:
cpus: 28
memory: 58
medium:
cpus: 14
memory: 29
small:
cpus: 7
memory: 14
tiny:
cpus: 1
memory: 1
DATASET BIN_ID PATH COMPLETENESS CONTAMINATION COVERAGE N50 HETEROGENEITY
test3 test3_bin.1 https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/bins/bin.1.fa 100 0 10 5000 10
test1 test1_bin.2 https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/bins/bin.2.fa 100 0 10 5000 10
test1 test1_bin.8 https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/bins/bin.8.fasta 100 0 10 5000 10
test2 test2_bin.9 https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/bins/bin.9.fasta 100 0 10 5000 10
test3 test2_bin.10 https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/bins/bin.9.fasta 100 0 10 5000 10
test2 test2_bin.32 https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/bins/bin.32.fa 100 0 10 5000 10
DATASET
, BIN_ID
, PATH
, COMPLETENESS
, CONTAMINATION
, COVERAGE
, N50
and HETEROGENEITY
.
Completeness and contamination can be used for filtering (see params-file
). N50
, COVERAGE
and HETEROGENEITY
are used for selecting the representative of every cluster.
You can set values of these columns to zero if data is not available or if you don't want the representative selection to be influenced by theses columns. Make sure that BIN_ID
is a unique identifier.
Output¶
The output tsv file (clusters.tsv
in the cluster's folder) contains the columns CLUSTER
, GENOME
and REPRESENTATIVE
where CLUSTER
identifies a group of genomes, GENOME
represents the path or
link of a genome and REPRESENTATIVE
is either 0 or 1 (selected as representative).
If sans
is specified in the configuration file (see examples folder), then SANS is used to dereplicate the genomes of every cluster that was reported by the previous step.
The SANS output can be found in the sans
folder.