Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/BDB-Genomics/atacseq-pipeline/llms.txt

Use this file to discover all available pages before exploring further.

The BDB-Genomics ATAC-seq Framework is a production-grade, config-driven Snakemake workflow for end-to-end chromatin accessibility analysis. Starting from raw paired-end FASTQ files, it executes six tightly ordered stages — Preprocessing, Alignment, Post-Alignment filtering, Metrics & QC, Peak Calling, and Visualization/Reporting — and produces a fully auditable set of BAMs, peak files, differential-accessibility tables, TF-footprint BigWigs, and an aggregated MultiQC HTML report. The entire pipeline is driven by a single config.yaml file; no Snakemake rule ever needs to be modified for routine use. Both bulk and single-cell modalities are supported from the same codebase, switched entirely through an environment variable.

Pipeline Stage Map

StageWhat it DoesKey ToolsOutput Directory
PreprocessingAdapter trimming, 5′-end trimming, QC reportsfastp, FastQCresults/preprocessing/
AlignmentMap trimmed reads to reference genomeBowtie2 (bulk) / Chromap (scATAC)results/alignment/
Post-AlignmentDedup, MAPQ filter, blacklist removal, Tn5 shiftsamtools, bedtools, deepToolsresults/post_alignment/
Metrics & QCTSS enrichment, fragment sizes, QC gatingATACseqQC (R), Picard, Preseq, Qualimapresults/metrics_qc/, results/qc_gate/
Peak CallingPeak calling, IDR, consensus peaks, DA, footprintingMACS2, IDR, DESeq2, TOBIAS, HINT-ATAC, HOMERresults/peak_calling/
VisualizationBigWigs, heatmaps, correlation, MultiQC reportdeepTools, UCSC tools, MultiQCresults/visualization/, results/reporting/

Six-Stage DAG

Raw FASTQs


┌───────────────────┐
│   Preprocessing   │  fastp (trim) → FastQC (report)
└────────┬──────────┘


┌───────────────────┐
│    Alignment      │  Bowtie2 --very-sensitive (bulk)
└────────┬──────────┘  Chromap --preset atac   (scATAC)


┌───────────────────┐
│  Post-Alignment   │  fixmate → markdup → mito remove →
│    Filtering      │  MAPQ ≥ 30 → blacklist → Tn5 shift
└────────┬──────────┘


┌───────────────────┐
│   Metrics & QC    │  TSS enrichment, FRiP, samtools stats
│     QC Gate       │  → PASS / WARN / FAIL per sample
└────────┬──────────┘

    ┌────┴────┐
    ▼         ▼
┌──────┐  ┌──────────────┐
│ Peak │  │ Visualization│
│Calling│  │  & Reporting │
└──────┘  └──────────────┘

Modality Switching (ATAC_MODE)

The pipeline determines which set of Snakemake rules to activate by reading the ATAC_MODE environment variable at startup. If the variable is not set, it falls back to config.yaml → global.mode (default: "bulk").
# Bulk ATAC-seq (default)
snakemake --use-conda --cores 8

# Single-cell ATAC-seq
ATAC_MODE=scatac snakemake --use-conda --cores 8
An invalid value raises a hard ValueError before any job is submitted:
# Snakefile
MODE = os.getenv("ATAC_MODE", config.get("global", {}).get("mode", "bulk"))
if MODE not in ("bulk", "scatac"):
    raise ValueError(
        f"Invalid mode '{MODE}'. Use 'bulk' or 'scatac'. "
        "Set via ATAC_MODE env var or config.yaml global.mode"
    )
StageTool
AlignmentBowtie2 (--very-sensitive)
Deduplicationsamtools markdup (mark only, configurable)
Peak CallingMACS2 (BAMPE, --nomodel)
DifferentialDESeq2 (FDR 0.05, log2FC 1.0)
FootprintingHINT-ATAC (RGT) + TOBIAS BINDetect

Lifecycle Hooks

The Snakefile registers three Snakemake lifecycle hooks that run automatically without any user intervention:
1

onstart

Prints the active mode and detected sample count to stdout the moment Snakemake begins planning the DAG. No files are written.
onstart:
    print(f"[START] BDB-Genomics ATAC-seq Framework")
    print(f"Mode: {MODE.upper()}")
    print(f"Samples: {len(SAMPLES)} samples detected")
2

onsuccess

Prints the path to the final MultiQC HTML report, then calls aggregate_logs.py success to write a machine-readable execution summary to results/reporting/pipeline_execution_summary.json.
onsuccess:
    print(f"[SUCCESS] Pipeline completed successfully!")
    print(f"Final MultiQC report: {config['multiqc']['output']}/multiqc_report.html")
    subprocess.run([
        "python3", "rules/scripts/aggregate_logs.py",
        "success", "results/reporting/pipeline_execution_summary.json"
    ])
3

onerror

Prints a pointer to the logs/ directory, then calls aggregate_logs.py error with the same JSON path. On failure the JSON includes the last five error lines extracted from each log file, making post-mortem debugging straightforward.
onerror:
    print(f"[ERROR] Pipeline encountered an error.")
    print(f"Please check the log files in 'logs/' for details.")
    subprocess.run([
        "python3", "rules/scripts/aggregate_logs.py",
        "error", "results/reporting/pipeline_execution_summary.json"
    ])

Configuration & Profiles

All parameters live in config.yaml. You never need to touch a .smk rule file for routine analysis. Override individual keys on the command line without modifying the main config:
# Merge a custom override file at runtime
snakemake --configfile config.yaml custom_override.yaml --cores 8

local

Up to 8 concurrent local jobs. Ideal for workstations.

slurm

Submits each rule as a SLURM job with per-rule memory/time limits.

low_resource

Caps all rules at 4 GB RAM. Designed for laptops.

test

Relaxed QC thresholds for synthetic CI datasets.

aws

AWS Batch + S3 + Tibanna executor.

kubernetes

Container-native K8s scaling for large cohorts.

Output Manifest

results/
├── preprocessing/          # Trimmed FASTQs, fastp HTML/JSON, FastQC reports
├── alignment/              # Unsorted BAMs from Bowtie2 / Chromap
├── post_alignment/         # Sorted, dedup, filtered, Tn5-shifted BAMs
├── metrics_qc/             # TSS enrichment, fragment sizes, Picard metrics
├── qc_gate/                # Per-sample _qc_pass.txt and _qc_pass.json
├── peak_calling/           # MACS2 narrowPeaks, consensus BED, IDR, DESeq2
├── visualization/          # BigWigs, heatmaps, correlation heatmap
└── reporting/
    ├── multiqc/            # multiqc_report.html
    ├── benchmark_summary.tsv
    └── pipeline_execution_summary.json

Stage Pages

Preprocessing

fastp trimming parameters and FastQC quality reporting.

Alignment

Bowtie2 bulk alignment and coordinate sorting.

Post-Alignment

Eight-step filtering chain from fixmate to Tn5 shift.

QC Gating

Four-metric gate with PASS/WARN/FAIL tiers.

Peak Calling

MACS2, IDR, DESeq2, TOBIAS, HOMER, and more.

Visualization

BigWigs, heatmaps, MultiQC, and benchmark summaries.

Build docs developers (and LLMs) love