Documentation Index
Fetch the complete documentation index at: https://mintlify.com/BDB-Genomics/atacseq-pipeline/llms.txt
Use this file to discover all available pages before exploring further.
validate_config.py is the first thing the pipeline runs. The Snakemake Snakefile invokes it automatically via subprocess.run at parse time, before Snakemake constructs the DAG, so configuration errors surface immediately with actionable messages rather than cryptic rule failures mid-run. You can also call it manually — for example, in CI before submitting a cluster job — to validate a configuration file without triggering a full pipeline run.
Path to the configuration YAML file to validate. Accepts absolute paths or paths relative to the current working directory. If not supplied, defaults to
config.yaml in the current directory. The script also searches relative to the workflow root (rules/scripts/../../) when resolving relative paths.Exit Codes
| Code | Meaning |
|---|---|
0 | Configuration is valid. Sample sheet parsed successfully. |
1 | One or more validation errors found. Error report printed to stdout and the process exits. |
1.
Error Output Format
Errors are grouped into four categories and printed with ANSI colour formatting (suitable for terminals). When errors reference paths underdata/, a hint line shows the absolute path that was checked:
Validation Categories
Reference Files
Errors that reference a path and classify as a file-system issue. Triggered when:- A
global.references.*path (genome FASTA, chromosome sizes, blacklist, GTF, motif database) does not exist on disk. - The Bowtie2 index prefix has no matching
.bt2or.bt2lfiles in its parent directory. - A
script:orconda:path referenced in a.smkfile cannot be resolved.
Sample Sheet
Errors relating to the TSV file declared atglobal.samples. Triggered when:
- The sample sheet file cannot be found at any candidate path (relative to the config file directory, workflow root, or CWD).
- The header row is missing required columns:
sample,fastq_r1,fastq_r2,replicate,condition. - A sample ID is empty, contains unsupported characters (only
[A-Za-z0-9._-]are allowed), or is duplicated. - A condition/replicate combination (e.g.,
control replicate 1) appears more than once. - FASTQ R1 or R2 paths are missing, identical, or the files cannot be found on disk.
- A
controlcolumn value references a sample ID that does not appear in the sample sheet. - The sample sheet has no data rows after the header.
Schema / Keys
Missing required configuration keys discovered by scanning allSnakefile and rules/*.smk files for config["key"] access patterns. Any key accessed in a workflow file must be present in config.yaml; the first missing ancestor key in a nested path is reported.
Parameters
Type and value constraint violations:| Constraint type | Affected config suffixes |
|---|---|
| Positive integer | threads, mem_mb, trim_front1, trim_front2, length_required, MAPQ, flags, min_length, max_length, max_fragment, upstream, downstream, bin_size |
| Non-negative float | min_frip, min_tss_enr, min_mapping_rate, max_duplicate_rate, qvalue, M |
| Non-empty string or positive number | time, mito_chr, genome_size, feature_types |
Public Functions
The script is structured as a set of composable validation functions, each accepting the parsed config dict and anerrors list that is accumulated and reported at the end.
load_config(config_path, errors)
load_config(config_path, errors)
Reads and parses the YAML file at
config_path. Appends errors if the file is missing, cannot be parsed as YAML, or its root element is not a mapping. Returns an empty dict on failure so downstream validators can be skipped safely.validate_required_config_paths(config, required_paths, errors)
validate_required_config_paths(config, required_paths, errors)
Iterates over a list of
tuple[str, ...] key paths (e.g., ("macs2", "params", "genome_size")) discovered by scanning .smk files and verifies each is present in the parsed config. Skips child keys whose parent has already been reported as missing to avoid cascading noise.validate_scalar_config_values(config, errors)
validate_scalar_config_values(config, errors)
Walks the entire config tree recursively and checks every key whose name matches a known suffix (e.g., any key named
threads anywhere in the config) against its expected type and value constraint. Reports errors for non-positive integers, negative floats, or empty strings where they are not allowed.validate_samples_sheet(config, config_path, root, errors)
validate_samples_sheet(config, config_path, root, errors)
Resolves the sample sheet path from
global.samples, opens it as a tab-delimited CSV, validates all required columns are present, and iterates every row to check:- Non-empty, pattern-safe sample IDs
- No duplicate sample IDs or condition/replicate pairs
- Positive-integer replicate values
- FASTQ R1 ≠ R2, both resolvable on disk
- Control references resolve to existing sample IDs
validate_fastp_input_mapping.validate_conda_environments(root, errors)
validate_conda_environments(root, errors)
Scans every
rules/*.smk file for conda: "path/to/env.yaml" directives and verifies each referenced file exists on disk, resolving paths relative to the .smk file’s parent directory (mirroring Snakemake’s own resolution behaviour). This catches missing or misnamed Conda environment YAML files before any rule attempts to build a Conda prefix.