The sample sheet is a tab-separated values (TSV) file that tells the pipeline which samples to process, where their FASTQ files live, and how they are grouped for downstream analysis such as IDR and differential accessibility. It is the authoritative record of your experiment’s design. Every aspect of the sheet — column names, sample identifiers, replicate numbering, and file paths — is validated before the Snakemake DAG is constructed, so errors surface immediately with a clear error message rather than mid-run failures.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/BDB-Genomics/atacseq-pipeline/llms.txt
Use this file to discover all available pages before exploring further.
Registering the Sample Sheet
Point the pipeline to your sample sheet by settingglobal.samples in config.yaml:
config.yaml, the workflow root, and the current working directory. The first match wins.
Required Columns
The sample sheet must contain a header row with exactly these five column names (order does not matter):| Column | Type | Description |
|---|---|---|
sample | string | Unique sample identifier |
fastq_r1 | string | Path to R1 (forward) FASTQ file |
fastq_r2 | string | Path to R2 (reverse) FASTQ file |
replicate | positive integer | Replicate number within the condition |
condition | string | Experimental condition label |
validate_config.py as:
Optional Column
Name of the control (input) sample for this sample. Set to
NONE or leave the column absent if no matched control exists. When provided, the validator checks that the referenced control sample ID exists elsewhere in the sheet.Sample Naming Constraints
Sample identifiers are validated against the regular expression:A–Z, a–z), digits (0–9), dots (.), underscores (_), and hyphens (-). Spaces, slashes, parentheses, and any other special characters will trigger a validation error.
Duplicate Detection
The validator enforces two uniqueness rules:- Duplicate sample IDs — the
samplecolumn value must be globally unique across all rows. - Duplicate condition/replicate pairs — the combination of
condition+replicatemust be unique. Two different sample names cannot share the same condition and replicate number.
Example Sample Sheet
The following sheet defines a 2 × 2 factorial design (two conditions, two replicates each) with no matched controls:FASTQ Path Resolution
FASTQ paths in thefastq_r1 and fastq_r2 columns can be absolute or relative. The validator resolves relative paths against four bases in priority order:
Workflow root
The repository root, resolved from the location of
validate_config.py two directories up (__file__.parents[2]).Absolute paths (starting with
/) bypass the resolution chain and are used as-is. This is the safest option when running the pipeline from varying working directories.How the Snakefile Consumes the Sheet
The Snakefile reads the sample sheet once at startup using Python’scsv.DictReader and builds three global dictionaries:
SAMPLES list drives all expand() calls that generate per-sample target files. FASTQ_R1 and FASTQ_R2 are passed as inputs to the fastp rule. Conditions and replicates from the sheet are also used to build IDR target pairs:
Common Validation Errors
Sample sheet is missing required columns
Sample sheet is missing required columns
The header row does not include one or more of
sample, fastq_r1, fastq_r2, replicate, condition. Check for typos or extra whitespace in the header.Sample contains unsupported characters
Sample contains unsupported characters
The
sample value at a given row fails the ^[A-Za-z0-9._-]+$ regex. Remove spaces, slashes, or special characters from the identifier.Duplicate sample ID
Duplicate sample ID
Two rows share the same
sample value. Every sample must have a unique name.Duplicate condition/replicate pair
Duplicate condition/replicate pair
Two samples share the same
condition and replicate number. Assign distinct replicate numbers within each condition.FASTQ R1/R2 not found
FASTQ R1/R2 not found
The validator could not locate a FASTQ file at any of the four resolution bases. Verify the path is correct or use an absolute path.
Control sample not found in sheet
Control sample not found in sheet
The
control column references a sample name that does not appear in the sample column of any other row.