This guide gets you from a bare machine to a completed pipeline run as quickly as possible. You will create a Snakemake Conda environment, clone the repository, generate lightweight synthetic test data (no internet downloads required for the CI path), and execute both the bulk and single-cell modes. By the end you will have a fullDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/BDB-Genomics/atacseq-pipeline/llms.txt
Use this file to discover all available pages before exploring further.
results/ directory tree with BAMs, peaks, QC reports, and footprinting outputs ready to inspect.
Create the Conda environment
The pipeline requires Snakemake ≥ 8.0 and Python ≥ 3.9 on the host. Create a dedicated environment using the
conda-forge and bioconda channels:All per-rule tool dependencies (Bowtie2, MACS2, TOBIAS, ArchR, etc.) are resolved automatically by Snakemake at runtime via the
--use-conda flag. You do not need to install them manually.Generate synthetic test data
For CI runs or offline environments, the pipeline ships a self-contained data generator that creates FASTQ files, a FASTA reference, a GTF annotation, and Bowtie2 indices — with TSS-targeted reads designed to pass all QC thresholds without requiring any external downloads:
(Optional) Download real ENCODE data
This script fetches ENCSR356KRQ chr19 + chrM from ENCODE (~200 MB), GENCODE v44 annotations, JASPAR motif files, and builds all required indices automatically:
This step requires an active internet connection and approximately 200 MB of disk space. It is not needed when using the synthetic data generated in the previous step.
Point config.yaml at your sample sheet
Open The sample sheet (
config.yaml and confirm that global.samples points at your sample manifest. For the synthetic data path it should read:samples.tsv) is a tab-separated file with columns sample, condition, replicate, fastq_r1, and fastq_r2. The synthetic data generator creates this file automatically; for real data, edit it to reference your own FASTQ paths.Validate your configuration
Run the pre-flight validation script before launching Snakemake. It checks every required key and path in A clean run prints no errors and exits with code 0. The Snakemake workflow also runs this validation automatically on startup, but running it manually first saves time if there are configuration issues.
config.yaml and prints specific error messages for any missing or malformed values:Run the bulk ATAC-seq pipeline
Launch the full bulk-mode pipeline using 8 CPU cores. Snakemake will automatically build and cache all per-rule Conda environments on the first run:You will see a startup banner reporting the mode and number of detected samples:
(Optional) Run the single-cell ATAC-seq pipeline
Switch modalities by setting the
ATAC_MODE environment variable. The pipeline will load the Chromap, ArchR, and Cicero rule sets instead of the bulk toolchain:Single-cell mode replaces alignment (Chromap
--preset atac), filtering (ArchR Arrow file creation and doublet removal), and peak calling (ArchR marker peaks). It also adds Cicero co-accessibility analysis with a 500 bp window and 250 kb distance cutoff.Expected Outputs
On successful completion, all outputs are written to theresults/ directory, organized by stage:
| Directory | Contents |
|---|---|
results/alignment/ | Initial Bowtie2-aligned BAMs |
results/post_alignment/ | Sorted, deduplicated, blacklist-filtered, and Tn5-shifted BAMs |
results/metrics_qc/ | Picard, TSS enrichment, fragment size, and cross-correlation metrics |
results/peak_calling/ | MACS2 narrowPeak files, consensus BEDs, and IDR replicate sets |
results/peak_calling/differential_accessibility/ | DESeq2 tables, Volcano/MA/PCA plots |
results/peak_calling/footprinting/ | HINT-ATAC footprint BEDs |
results/peak_calling/tobias/ | TOBIAS bias-corrected BigWigs and BINDetect motif plots |
results/reporting/ | pipeline_execution_summary.json and benchmark TSV |
benchmarks/ | Per-rule CPU time and peak memory consumption |
results/reporting/multiqc/multiqc_report.html and is the recommended entry point for reviewing run quality.