Before Snakemake constructs the DAG and executes a single rule, the pipeline automatically runsDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/BDB-Genomics/atacseq-pipeline/llms.txt
Use this file to discover all available pages before exploring further.
rules/scripts/validate_config.py against your config.yaml. This early validation catches every class of misconfiguration — missing reference files, malformed sample sheets, incorrect parameter types, missing conda environment definitions — and surfaces them all at once with categorised, actionable error messages. A clean validation run prints [CONFIG VALIDATION] OK and allows the pipeline to proceed; any failure exits with code 1 and a formatted error report.
Automatic Invocation
The validator is called unconditionally from the Snakefile’s top-level scope, before any rule is parsed or target file is computed:Manual Invocation
You can also run the validator independently without starting the full pipeline — useful when editingconfig.yaml or debugging a new tool block:
Exit Codes
| Code | Meaning |
|---|---|
0 | All checks passed. Pipeline may proceed. |
1 | One or more validation errors found. Inspect the printed report. |
Validation Stages
The validator executes eight distinct checks in sequence. All errors are accumulated across every stage and printed together — the run does not stop at the first error.Config file existence and YAML syntax
Checks that the path supplied as
sys.argv[1] exists and that its contents can be parsed as valid YAML. A yaml.YAMLError is caught and reported as a human-readable message. The config root must be a YAML mapping (object), not a list or scalar.Required config key discovery
The validator statically analyses the Snakefile and every Any key path present in a rule file but absent from
.smk file under rules/ to discover which config keys the pipeline actually reads. It uses the regular expression config((?:\[['"][^'"]+['"]\])+) to extract key paths such as config['qc_gate']['params']['min_frip'].config.yaml is reported as a Schema/Keys error. Parent-key errors suppress child-key errors to avoid noise (if qc_gate is missing, the validator does not also report qc_gate.params.min_frip).Scalar type validation
The validator walks the entire config tree and checks the type of values whose key names match known suffixes.Positive integers — keys ending in any of:
threads, mem_mb, trim_front1, trim_front2, length_required, MAPQ, flags, min_length, max_length, max_fragment, upstream, downstream, bin_sizeNon-negative floats — keys ending in any of:
min_frip, min_tss_enr, min_mapping_rate, max_duplicate_rate, qvalue, MNon-empty strings — keys named exactly:
time, mito_chr, genome_size, feature_typestime accepts a positive integer or float (minutes) as well as a non-empty string. This allows both time: 120 and time: "2:00:00" formats.Sample sheet validation
The validator locates the sample sheet via
global.samples, resolves it using a four-base search (config directory → workflow root → cwd), and then validates every row:- Header check — all five columns from
SAMPLE_COLUMNS = ("sample", "fastq_r1", "fastq_r2", "replicate", "condition")must be present. - Sample name regex — each
samplevalue must match^[A-Za-z0-9._-]+$. - Duplicate sample IDs — the
samplecolumn must be globally unique. - Replicate type —
replicatemust parse as a positive integer. - Duplicate condition/replicate pairs — the combination of
condition+replicatemust be unique. - FASTQ path existence — both
fastq_r1andfastq_r2are resolved against four bases (sample sheet dir → config dir → workflow root →cwd). Missing files are reported individually. - R1 ≠ R2 — the two FASTQ paths must not be identical.
- Control cross-reference — if the optional
controlcolumn is present, every non-NONEvalue must match asampleID that appears elsewhere in the sheet.
fastp input mapping cross-check
If
fastp.input is populated in config.yaml (it is absent by default — FASTQ paths come from the sample sheet), the validator checks that the set of sample names in fastp.input exactly matches the set in the sample sheet, and that the R1/R2 paths agree between both sources.Reference file path checks
The validator walks the
global.references block and validates every file path it finds. The check is triggered for keys matching specific suffixes (_fa, _bed, _gtf, _index, _sizes, _db) or the literal key blacklist.For bowtie2_index, existence is checked by globbing for .bt2 or .bt2l files matching the prefix — a directory containing genome.1.bt2 satisfies the check for prefix genome. All other reference paths must resolve to an existing regular file.Sample sheet config-key usage check
The validator scans every
.smk file under rules/ for patterns that treat config['samples'] as a list of sample names (e.g. sample = config["samples"]) rather than as the expected sample-sheet path string. If global.samples is a path string in config.yaml but any rule file reads it as a list, an error is reported identifying the offending rule files.Error Categorisation
When validation fails, errors are grouped into four categories and printed with ANSI colour coding:| Category | Triggered by |
|---|---|
| Reference Files | Messages containing path not found, index prefix, or not a file |
| Sample Sheet | Messages containing Sample sheet, sample ID, or FASTQ |
| Schema/Keys | Messages containing Missing config key |
| Parameters | All other messages (type errors, format errors) |
data/, the validator also prints an absolute-path hint to help locate the missing file on disk.
FASTQ Path Resolution Algorithm
Both the sample sheet validator and the reference path checker use the same multi-base resolution strategy. Relative paths are tried against each base in order; the first existing match is used:cwd equals the config directory) are deduplicated via the seen set.