Skip to content

Quality Control

The quality control module removes adapters, trims and filters short and long-read data. Since quality control is typically the first step in the processing of sequencing data, the Toolkit offers a way to directly download the sequencing data (See download flag.). This allows the data to be downloaded in parallel on multiple machines, as opposed to the usual Nextflow mechanism of downloading input data only on the VM running Nextflow. In addition, the quality control module enables the filtering of human reads and, with Nonpareil, provides diversity estimation of input sequences.

Short Reads

For short reads, we offer a way to generate only a quality report using fastp. This approach eliminates the need for additional disk space to store quality-controlled reads. (See reportOnly flag in the configuration file below.)

Input

-entry wShortReadQualityControl -params-file example_params/qc.yml

Warning

The configuration file shown here is for demonstration and testing purposes only. Parameters that should be used in production can be viewed in the quality control section of one of the yaml files located in the default folder of the Toolkit's Github repository.

tempdir: "tmp"
s3SignIn: false
output: "output"
logDir: log
runid: 1
databases: "/mnt/databases"
logLevel: 1
scratch: "/vol/scratch"
steps:
  qc:
    input: "test_data/qc/reads_split.tsv"
    fastp:
       # Example params: " --cut_front --cut_tail --detect_adapter_for_pe  "
       additionalParams:
         fastp: "  "
         reportOnly: false
    filterHuman:
      additionalParams: "  "
      database:
        download:
          source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/human_filter.db.20231218v2.gz
          md5sum: cc92c0f926656565b1156d66a0db5a3c
resources:
  highmemLarge:
    cpus: 28
    memory: 230
  highmemMedium:
    cpus: 14
    memory: 113
  large:
    cpus: 28
    memory: 58
  medium:
    cpus: 14
    memory: 29
  small:
    cpus: 7
    memory: 14
  tiny:
    cpus: 1
    memory: 1
SAMPLE  READS1  READS2
test1   https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/small/read1_1.fq.gz  https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/small/read2_1.fq.gz
test2   https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/small/read1_1.fq.gz  https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/small/read2_1.fq.gz

Output

Fastp

SAMPLE_fastp.json

Contains quality statistics about the raw reads and the quality controlled reads in JSON format.

SAMPLE_fastp_summary_after.tsv

Contains quality statistics about the quality controlled reads in TSV format.

SAMPLE_fastp_summary_before.tsv

Contains quality statistics about the raw reads in TSV format.

SAMPLE_interleaved.qc.fq.gz

Quality controlled reads

SAMPLE_report.html

HTML report with plots summarizing the quality of the raw and quality controlled reads.

test1_unpaired.qc.fq.gz

Unpaired reads where the other pair was filtered out due to quality control.

test1_unpaired_summary.tsv

TSV file that contains quality statistics about single reads where the other pair was filtered out due to quality control.

KMC

SAMPLE.[13|21|71].kmc.json

K-mer statistics for k-mers of length 13, 21 and 71.

SAMPLE.[13|21|71].histo.tsv

K-mer frequency table with the columns FREQUENCY, COUNT and SAMPLE. FREQUENCY is the number of times a specific k-mer appears. COUNT is the number of different k-mers that occur a number of times described by FREQUENCY.

Nanopore Reads

Input

-entry wOntQualityControl -params-file example_params/qcONT.yml

Warning

The configuration file shown here is for demonstration and testing purposes only. Parameters that should be used in production can be viewed in the quality control section of one of the YAML files located in the default folder of the Toolkit's GitHub repository.

tempdir: "tmp"
s3SignIn: false
output: "output"
logDir: log
runid: 1
databases: "/mnt/databases"
logLevel: 1
scratch: "/vol/scratch"
steps:
  qcONT:
    input: "test_data/qcONT/ont.tsv"
    porechop:
      additionalParams:
        chunkSize: 450000
        porechop: ""
        filtlong: " --min_length 1000 --keep_percent 90  "
    filterHumanONT:
      additionalParams: "  "
      database:
        download:
          source: https://openstack.cebitec.uni-bielefeld.de:8080/databases/human_filter.db.20231218v2.gz
          md5sum: cc92c0f926656565b1156d66a0db5a3c
    nanoplot:
      additionalParams: ""
resources:
  highmemLarge:
    cpus: 28
    memory: 230
  highmemMedium:
    cpus: 14
    memory: 113
  large:
    cpus: 28
    memory: 58
  medium:
    cpus: 14
    memory: 29
  small:
    cpus: 7
    memory: 14
  tiny:
    cpus: 1
    memory: 1
SAMPLE  READS
nano    https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/meta_test/SRR16328449_qc.fq.gz

Output

Porechop

SAMPLE_qc.fq.gz

Gzipped quality controlled reads of the format SAMPLE_qc.fq.gz.

Nanoplot

NanoStats.tsv

Statistics of the output reads, such as quality scores and read length.

plots

NanoPlot offers a variety of plots that show the quality, length and quantity of the reads.

Output

The following output is produced for short and long reads.

Nonpareil

SAMPLE.npa

You can read more here

SAMPLE.npc

You can read more here

SAMPLE.npl

You can read more here

SAMPLE_nonpareil_curves.pdf

Nonpareil curves visualize the estimated average coverage for the current sequencing effort.

SAMPLE_nonpareil_index.tsv

Nonpareil statistics including the Nonpareil diversity index.

Columns:

  • SAMPLE sample name

  • C Average coverage of the entire dataset.

  • diversity is the Nonpareil diversity index.

  • LR Actual sequencing effort of the dataset.

  • LRstar is the sequencing effort for nearly complete coverage.

  • modelR Pearson’s R coefficient betweeen the rarefied data and the projected model.

  • kappa "Redundancy" value of the entire dataset.

Filtered out Human Sequences

SAMPLE_filtered.fq.gz

Sequences without human DNA.

SAMPLE_removed.fq.gz

Sequences that were classified as human DNA.

SAMPLE_summary_[after|before].tsv

Statistics of reads before and after quality control.