Quality Control¶

The quality control module removes adapters, trims and filters short and long-read data. Since quality control is typically the first step in the processing of sequencing data, the Toolkit offers a way to directly download the sequencing data (See download flag.). This allows the data to be downloaded in parallel on multiple machines, as opposed to the usual Nextflow mechanism of downloading input data only on the VM running Nextflow. In addition, the quality control module enables the filtering of human reads and, with Nonpareil, provides diversity estimation of input sequences.

Short Reads¶

For short reads, we offer a way to generate only a quality report using fastp. This approach eliminates the need for additional disk space to store quality-controlled reads. (See reportOnly flag in the configuration file below.)

Input¶

Command for short read dataConfiguration FileTSV Table short read

-entry wShortReadQualityControl -params-file example_params/qc.yml

Warning

The configuration file shown here is for demonstration and testing purposes only. Parameters that should be used in production can be viewed in the quality control section of one of the yaml files located in the default folder of the Toolkit's Github repository.

tempdir: "tmp"
s3SignIn: false
output: "output"
logDir: log
runid: 1
databases: "/mnt/databases"
logLevel: 1
scratch: "/vol/scratch"
steps:
  qc:
    input: "test_data/qc/reads_split.tsv"
    fastp:
       # Example params: " --cut_front --cut_tail --detect_adapter_for_pe  "
       additionalParams:
         fastp: "  "
         reportOnly: false
    filterHuman:
      additionalParams: "  "
      database:
        download:
          source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/human_filter.db.20231218v2.gz
          md5sum: cc92c0f926656565b1156d66a0db5a3c
resources:
  highmemLarge:
    cpus: 28
    memory: 230
  highmemMedium:
    cpus: 14
    memory: 113
  large:
    cpus: 28
    memory: 58
  medium:
    cpus: 14
    memory: 29
  small:
    cpus: 7
    memory: 14
  tiny:
    cpus: 1
    memory: 1

SAMPLE  READS1  READS2
test1   https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/small/read1_1.fq.gz  https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/small/read2_1.fq.gz
test2   https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/small/read1_1.fq.gz  https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/small/read2_1.fq.gz

Output¶

Fastp¶

SAMPLE_fastp.json: Contains quality statistics about the raw reads and the quality controlled reads in JSON format.
SAMPLE_fastp_summary_after.tsv: Contains quality statistics about the quality controlled reads in TSV format.
SAMPLE_fastp_summary_before.tsv: Contains quality statistics about the raw reads in TSV format.
SAMPLE_interleaved.qc.fq.gz: Quality controlled reads
SAMPLE_report.html: HTML report with plots summarizing the quality of the raw and quality controlled reads.
test1_unpaired.qc.fq.gz: Unpaired reads where the other pair was filtered out due to quality control.
test1_unpaired_summary.tsv: TSV file that contains quality statistics about single reads where the other pair was filtered out due to quality control.

KMC¶

SAMPLE.[13|21|71].kmc.json

K-mer statistics for k-mers of length 13, 21 and 71.

SAMPLE.[13|21|71].histo.tsv

K-mer frequency table with the columns FREQUENCY, COUNT and SAMPLE. FREQUENCY is the number of times a specific k-mer appears. COUNT is the number of different k-mers that occur a number of times described by FREQUENCY.

Nanopore Reads¶

Input¶

Command for nanopore dataConfiguration FileTSV Table nanopore

-entry wOntQualityControl -params-file example_params/qcONT.yml

Warning

The configuration file shown here is for demonstration and testing purposes only. Parameters that should be used in production can be viewed in the quality control section of one of the YAML files located in the default folder of the Toolkit's GitHub repository.

tempdir: "tmp"
s3SignIn: false
output: "output"
logDir: log
runid: 1
databases: "/mnt/databases"
logLevel: 1
scratch: "/vol/scratch"
steps:
  qcONT:
    input: "test_data/qcONT/ont.tsv"
    porechop:
      additionalParams:
        chunkSize: 450000
        porechop: ""
        filtlong: " --min_length 1000 --keep_percent 90  "
    filterHumanONT:
      additionalParams: "  "
      database:
        download:
          source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/human_filter.db.20231218v2.gz
          md5sum: cc92c0f926656565b1156d66a0db5a3c
    nanoplot:
      additionalParams: ""
resources:
  highmemLarge:
    cpus: 28
    memory: 230
  highmemMedium:
    cpus: 14
    memory: 113
  large:
    cpus: 28
    memory: 58
  medium:
    cpus: 14
    memory: 29
  small:
    cpus: 7
    memory: 14
  tiny:
    cpus: 1
    memory: 1

SAMPLE  READS
nano    https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/SRR16328449_qc.fq.gz

Output¶

Porechop¶

SAMPLE_qc.fq.gz: Gzipped quality controlled reads of the format SAMPLE_qc.fq.gz.

Nanoplot¶

NanoStats.tsv: Statistics of the output reads, such as quality scores and read length.
plots: NanoPlot offers a variety of plots that show the quality, length and quantity of the reads.

Output¶

The following output is produced for short and long reads.

Nonpareil¶

SAMPLE.npa: You can read more here
SAMPLE.npc: You can read more here
SAMPLE.npl: You can read more here
SAMPLE_nonpareil_curves.pdf: Nonpareil curves visualize the estimated average coverage for the current sequencing effort.
SAMPLE_nonpareil_index.tsv: Nonpareil statistics including the Nonpareil diversity index.

Columns:

SAMPLE sample name
C Average coverage of the entire dataset.
diversity is the Nonpareil diversity index.
LR Actual sequencing effort of the dataset.
LRstar is the sequencing effort for nearly complete coverage.
modelR Pearson’s R coefficient betweeen the rarefied data and the projected model.
kappa "Redundancy" value of the entire dataset.

Filtered out Human Sequences¶

SAMPLE_filtered.fq.gz: Sequences without human DNA.
SAMPLE_removed.fq.gz: Sequences that were classified as human DNA.
SAMPLE_summary_[after|before].tsv: Statistics of reads before and after quality control.