Global Pipeline Configuration
Global parameter settings¶
-
tempdir
: Temporary directory for storing files that are used to collect intermediate files. -
summary
: If true a summary folder is created storing results of all samples combined. -
output
: Output directory for storing pipeline results. If an S3 bucket is specified with the corresponding S3 credentials (See S3 configuration section) then the output is written to S3. -
runid
: The run ID will be part of the output path and allows to distinguish between different pipeline configurations that were used for the same dataset. -
logDir
: A path to a directory which is used to store log files. -
scratch
: The scratch value can be eitherfalse
or a path on a worker node. If a path is set, then the nextflow process inslurm
mode is executed on the provided path. If the standard mode is used, then the parameter is ignored. -
steps
: Steps allows to specify multiple pipeline modules for running the toolkit. We distinguish between two modes. You can either run one tool of the pipeline or the whole pipeline with different configurations. -
databases
: This parameter specifies a place where files are downloaded to. If theslurm
profile is used and databases should be downloaded, the path should point to a folder which is not shared between the worker nodes (to reduce I/O on the shared folder resulting in a better performance). If this parameter is provided, the toolkit will create the specified directory. If all your databases have already been extracted beforehand, you can simply omit this parameter. -
publishDirMode
: (optional) Per default results are symlinked to the chosenoutput
directory. This default mode can be changed with this parameter. A useful mode is "copy", to copy results instead of just linking them. Other modes to choose from here. -
skipVersionCheck
: The toolkit is regurarly tested against a set of Nextflow versions. Setting the--skipVersionCheck
allows you to use the toolkit with Nextflow versions that were not tested. -
s3SignIn
: If your input data (not the databases) is not publicly accessible via S3, then you will have to set thes3SignIn
parameter totrue
.
S3 Configuration¶
All module inputs and outputs can be used in conjunction with S3. If you want to set a custom S3 configuration setting (i.e. custom S3 endpoint), you will have to modify the aws client parameters with " -c ".
Example:
aws {
client {
s3PathStyleAccess = true
connectionTimeout = 120000
maxParallelTransfers = 28
maxErrorRetry = 10
protocol = 'HTTPS'
connectionTimeout = '2000'
endpoint = 'https://openstack.cebitec.uni-bielefeld.de:8080'
signerOverride = 'AWSS3V4SignerType'
}
}
In addition you will have to set a Nextflow Secret with the following keys:
nextflow secrets set S3_ACCESS xxxxxxxxx
nextflow secrets set S3_SECRET xxxxxxxxx
S3_ACCESS
corresponds to the aws S3 access key id and S3_SECRET
is the aws S3 secret key.
If your input data (not the databases) is publicly available then you have to set s3SignIn:
to false
in your config file.
Please note that for using databases you have to set additional credentials (see database section).
Configuration of Computational Resources used for Pipeline Runs¶
The toolkit uses the following machine types (flavors) for running tools. All flavors can be optionally
adjusted by modifying the cpus and memory (in GB) parameters. If for example the largest flavor is not available
in the infrastructure, cpus
and memory
parameters can be modified to fit the highmemMedium flavor. If larger
flavors are available, it makes especially sense to increase the cpus
and memory
values of the large
flavor to speed up for example assembly and read mapping.
Example Configuration:
resources:
highmemLarge:
cpus: 28
memory: 230
highmemMedium:
cpus: 14
memory: 113
large:
cpus: 28
memory: 58
medium:
cpus: 14
memory: 29
small:
cpus: 7
memory: 14
tiny:
cpus: 1
memory: 1
Additional flavors can be defined that can be used by methods that dynamically compute resources on tool error (for example the assembly module).
Example:
resources:
xlarge:
cpus: 56
memory: 512
highmemLarge:
cpus: 28
memory: 230
highmemMedium:
cpus: 14
memory: 113
large:
cpus: 28
memory: 58
medium:
cpus: 14
memory: 29
small:
cpus: 7
memory: 14
tiny:
cpus: 1
memory: 1
The full pipeline mode is able to predict the memory consumption of some assemblers (see assembly module). The prediction will also consider additional flavors which have been added to the resources section in the configuration file.