The BDB-Genomics ATAC-seq pipeline ships with two complementary testing strategies. The synthetic CI path requires no internet access and completes in minutes — it is the approach used by GitHub Actions on every push and pull request. The real ENCODE path downloads approximately 200 MB of publicly available data from ENCODE, GENCODE, and JASPAR, and exercises the full pipeline against biologically authentic reads. Choose synthetic data for rapid iteration during development and real data for pre-release validation.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/BDB-Genomics/atacseq-pipeline/llms.txt
Use this file to discover all available pages before exploring further.
Synthetic CI data
The scriptrules/scripts/generate_test_data.py builds a complete, self-contained dataset entirely in memory and writes it to the data/ directory. No network connection, SRA-tools, or pre-existing reference files are required.
What gets generated
Running the script produces the following artefacts in seven steps:Reference genome (FASTA)
A three-chromosome genome (
chr1: 500 kbp, chr2: 250 kbp, chrMT: 16,569 bp) written to data/reference/genome.fa with a companion .fai index.Chromosome sizes
data/reference/genome.chrom.sizes — required by bedGraphToBigWig and deeptools.Gene annotation (GTF)
80 genes (50 on
chr1, 30 on chr2) with gene, transcript, and exon entries written to data/reference/annotation.gtf.ENCODE blacklist BED + motif database
A two-region blacklist (
data/reference/ENCODE_blacklist.bed) covering known repeat-dense intervals, plus a minimal MEME motif database written to data/motifs/jaspar_vertebrates.meme. Both are generated in the same step.Bowtie2 index
Built via
bowtie2-build if available on $PATH; otherwise, valid placeholder .bt2 binary files are written to data/reference/index/.Paired-end FASTQs
Four samples (
sample1–sample4) × 7,500 read pairs each, written as gzip-compressed FASTQ to data/fastq/. 75% of reads are positioned within ±2,000 bp of a TSS to guarantee non-zero TSS enrichment scores.Chromap index placeholder
A placeholder index file written to
data/reference/chromap/genome.index for scATAC-seq compatibility. The sample sheet (data/fastp/samples.tsv) is also written in this final step.tss_enrichment.R calls featureAlignedSignal() which tiles each TSS into a ±2,000 bp window divided into 200 bins. Without sufficient coverage across those bins the R function crashes with a names/length mismatch. By targeting 75% of reads near annotated TSSes across 80 genes, the synthetic data reliably passes this QC gate.
Generate and run
Run with the test profile
Theprofile/test/ profile relaxes QC thresholds to values appropriate for tiny synthetic datasets, reduces default resource requests, and enables verbose shell output:
profile/test/config.yaml):
configfile key points to profile/test/config_test.yaml, which overrides the qc_gate thresholds so synthetic datasets are not incorrectly flagged as failures:
Real ENCODE data
For full-scale validation against real sequencing data, use the download script:What gets downloaded
| Resource | Source | Destination |
|---|---|---|
| ENCSR356KRQ FASTQs (chr19 + chrM subsampled) | ENCODE Google Storage | data/fastq/sample{1,2,3}_R{1,2}.fastq.gz |
| hg38 chr19 FASTA | UCSC Golden Path | data/reference/genome.fa (merged with chrM) |
| hg38 chrM FASTA | UCSC Golden Path | merged into data/reference/genome.fa |
| GENCODE v44 basic GTF | EMBL-EBI / Ensembl | data/reference/annotation.gtf (chr19 + chrM only) |
| ENCODE blacklist (ENCFF356LFX) | ENCODE Project | data/reference/ENCODE_blacklist.bed (chr19 + chrM) |
| JASPAR 2024 vertebrate motifs | JASPAR ELIXIR | data/motifs/jaspar_vertebrates.meme |
- Combines
chr19.faandchrM.fainto a singlegenome.fa - Extracts only
chr19andchrMlines from the GENCODE GTF - Filters the blacklist BED to those same chromosomes
- Writes
genome.chrom.sizes(chr19: 58,617,616 bp,chrM: 16,569 bp) - Builds a Bowtie2 index with
bowtie2-build - Attempts to build a Chromap index if
chromapis available
Run the full pipeline
config.yaml thresholds, including the QC gate (min_frip: 0.2, min_tss_enr: 7.0, min_mapping_rate: 80.0, max_duplicate_rate: 20.0). Real ENCODE data should pass all four gates.
CI/CD with GitHub Actions
The repository’s continuous integration pipeline is defined in.github/workflows/lint.yml and runs on every push and pull request to main.
Workflow structure
The workflow has two sequential jobs:lint
Generates synthetic test data, then runs
snakemake --lint and pytest rules/scripts/test_validate_config.py to catch config errors and rule syntax issues before any computation starts. Finishes with a dry-run (snakemake -n) to validate the full DAG.test
Depends on
lint. Generates synthetic data and executes snakemake --use-conda --conda-frontend mamba --profile profile/test --cores 4. Results and logs are uploaded as GitHub Actions artifacts (retained 7 days).mamba-org/setup-micromamba@v1) instead of the standard conda frontend for significantly faster environment solve and install times. Conda environments are cached by a hash of all rules/envs/**/*.yaml files, so unchanged environments are restored from cache rather than rebuilt.
Running the test suite locally
To reproduce the CI checks on your own machine:test_validate_config.py contains over 100 pytest assertions covering required keys, value types, resource floor values, path consistency, and YAML anchor resolution. It runs in under a second and catches the most common configuration mistakes before any Snakemake rules are evaluated.