Quality Control¶
The quality control module removes adapters, trims and filters short and long-read data.
Since quality control is typically the first step in the processing of sequencing data, the Toolkit offers a way to
directly download the sequencing data (See download
flag.). This allows the data to be downloaded in parallel on multiple machines,
as opposed to the usual Nextflow mechanism of downloading input data only on the VM running Nextflow.
In addition, the quality control module enables the filtering of human reads and, with Nonpareil, provides diversity estimation of input sequences.
Short Reads¶
For short reads, we offer a way to generate only a quality report using fastp. This approach eliminates the need for additional disk space to store quality-controlled reads. (See reportOnly
flag in the configuration file below.)
Input¶
-entry wShortReadQualityControl -params-file example_params/qc.yml
Warning
The configuration file shown here is for demonstration and testing purposes only.
Parameters that should be used in production can be viewed in the quality control section
of one of the yaml files located in the default
folder of the Toolkit's Github repository.
tempdir: "tmp"
s3SignIn: false
output: "output"
logDir: log
runid: 1
databases: "/mnt/databases"
logLevel: 1
scratch: "/vol/scratch"
steps:
qc:
input: "test_data/qc/reads_split.tsv"
fastp:
# Example params: " --cut_front --cut_tail --detect_adapter_for_pe "
additionalParams:
fastp: " "
reportOnly: false
filterHuman:
additionalParams: " "
database:
download:
source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/human_filter.db.20231218v2.gz
md5sum: cc92c0f926656565b1156d66a0db5a3c
resources:
highmemLarge:
cpus: 28
memory: 230
highmemMedium:
cpus: 14
memory: 113
large:
cpus: 28
memory: 58
medium:
cpus: 14
memory: 29
small:
cpus: 7
memory: 14
tiny:
cpus: 1
memory: 1
SAMPLE READS1 READS2
test1 https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/small/read1_1.fq.gz https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/small/read2_1.fq.gz
test2 https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/small/read1_1.fq.gz https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/small/read2_1.fq.gz
Output¶
Fastp¶
SAMPLE_fastp.json
-
Contains quality statistics about the raw reads and the quality controlled reads in JSON format.
SAMPLE_fastp_summary_after.tsv
-
Contains quality statistics about the quality controlled reads in TSV format.
SAMPLE_fastp_summary_before.tsv
-
Contains quality statistics about the raw reads in TSV format.
SAMPLE_interleaved.qc.fq.gz
-
Quality controlled reads
SAMPLE_report.html
-
HTML report with plots summarizing the quality of the raw and quality controlled reads.
test1_unpaired.qc.fq.gz
-
Unpaired reads where the other pair was filtered out due to quality control.
test1_unpaired_summary.tsv
-
TSV file that contains quality statistics about single reads where the other pair was filtered out due to quality control.
KMC¶
SAMPLE.[13|21|71].kmc.json
K-mer statistics for k-mers of length 13, 21 and 71.
SAMPLE.[13|21|71].histo.tsv
K-mer frequency table with the columns FREQUENCY
, COUNT
and SAMPLE
.
FREQUENCY
is the number of times a specific k-mer appears.
COUNT
is the number of different k-mers that occur a number of times described by FREQUENCY
.
Nanopore Reads¶
Input¶
-entry wOntQualityControl -params-file example_params/qcONT.yml
Warning
The configuration file shown here is for demonstration and testing purposes only.
Parameters that should be used in production can be viewed in the quality control section
of one of the YAML files located in the default
folder of the Toolkit's GitHub repository.
tempdir: "tmp"
s3SignIn: false
output: "output"
logDir: log
runid: 1
databases: "/mnt/databases"
logLevel: 1
scratch: "/vol/scratch"
steps:
qcONT:
input: "test_data/qcONT/ont.tsv"
porechop:
additionalParams:
chunkSize: 450000
porechop: ""
filtlong: " --min_length 1000 --keep_percent 90 "
filterHumanONT:
additionalParams: " "
database:
download:
source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/human_filter.db.20231218v2.gz
md5sum: cc92c0f926656565b1156d66a0db5a3c
nanoplot:
additionalParams: ""
resources:
highmemLarge:
cpus: 28
memory: 230
highmemMedium:
cpus: 14
memory: 113
large:
cpus: 28
memory: 58
medium:
cpus: 14
memory: 29
small:
cpus: 7
memory: 14
tiny:
cpus: 1
memory: 1
SAMPLE READS
nano https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/SRR16328449_qc.fq.gz
Output¶
Porechop¶
SAMPLE_qc.fq.gz
-
Gzipped quality controlled reads of the format
SAMPLE_qc.fq.gz
.
Nanoplot¶
NanoStats.tsv
-
Statistics of the output reads, such as quality scores and read length.
plots
-
NanoPlot offers a variety of plots that show the quality, length and quantity of the reads.
Output¶
The following output is produced for short and long reads.
Nonpareil¶
SAMPLE.npa
-
You can read more here
SAMPLE.npc
-
You can read more here
SAMPLE.npl
-
You can read more here
SAMPLE_nonpareil_curves.pdf
-
Nonpareil curves visualize the estimated average coverage for the current sequencing effort.
SAMPLE_nonpareil_index.tsv
-
Nonpareil statistics including the Nonpareil diversity index.
Columns:
-
SAMPLE
sample name -
C
Average coverage of the entire dataset. -
diversity
is the Nonpareil diversity index. -
LR
Actual sequencing effort of the dataset. -
LRstar
is the sequencing effort for nearly complete coverage. -
modelR
Pearson’s R coefficient betweeen the rarefied data and the projected model. -
kappa
"Redundancy" value of the entire dataset.
Filtered out Human Sequences¶
SAMPLE_filtered.fq.gz
-
Sequences without human DNA.
SAMPLE_removed.fq.gz
-
Sequences that were classified as human DNA.
SAMPLE_summary_[after|before].tsv
-
Statistics of reads before and after quality control.