The bulk ATAC-seq mode is the default execution track of the BDB-Genomics framework. It drives a complete, production-grade chromatin accessibility analysis from raw FASTQ reads through bias-corrected transcription factor footprints, applying a hard quality-control gate before any compute-intensive downstream work begins. Every stage is configuration-driven: you declare your reference paths, resource limits, and QC thresholds once inDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/BDB-Genomics/atacseq-pipeline/llms.txt
Use this file to discover all available pages before exploring further.
config.yaml, and the Snakemake DAG handles all orchestration automatically.
Activating Bulk Mode
Bulk mode is the default. You do not need to set any environment variable or changeconfig.yaml to use it. All three forms below are equivalent:
Modality Comparison
The table below summarises how bulk mode differs from single-cell mode across every major analysis stage.| Stage | Bulk Mode | Single-Cell Mode (scatac) |
|---|---|---|
| Alignment | Bowtie2 (--very-sensitive) | Chromap (--preset atac) |
| Filtering | MAPQ > 30, Fixmate, ENCODE Blacklist removal, Tn5 Shift | ArchR Arrow creation & doublet removal |
| Peak Calling | MACS2, IDR replicate concordance | ArchR marker peak identification |
| Co-accessibility | N/A | Cicero (500 bp window, 250 kb distance) |
| Differential | DESeq2 (FDR 0.05, log2FC 1.0) | ArchR cluster markers |
| Footprinting | HINT-ATAC & TOBIAS BINDetect | chromVAR motif accessibility |
Sample Sheet
The bulk pipeline reads sample metadata from a tab-separated file whose path is set inconfig.yaml under global.samples (default: data/fastp/samples.tsv). Each row represents one biological replicate. The condition and replicate columns drive automatic IDR pairing.
Every sample requires both
fastq_r1 and fastq_r2. The pipeline is paired-end only. Provide absolute paths or paths relative to the repository root.How IDR Pairs Are Auto-Computed
The Snakefile reads thecondition and replicate columns and uses itertools.combinations to enumerate every possible replicate pair within each condition group. No manual configuration is needed:
ctrl_rep1_rep2_idr_peaks.bed and treat_rep1_rep2_idr_peaks.bed.
Full Pipeline Stages
Raw reads are quality-trimmed by fastp with a 5-base front trim on both mates and a minimum length filter of 30 bp, followed by FastQC for per-sample read-level QC metrics.
# config.yaml
fastp:
output: "results/preprocessing/fastp"
params:
trim_front1: 5
trim_front2: 5
length_required: 30
threads: 4
resources:
mem_mb: 8000
time: 120
Trimmed paired-end reads are aligned to the reference genome using Bowtie2 in
--very-sensitive mode.# config.yaml
bowtie2:
input: "results/preprocessing/fastp"
output: "results/alignment/bowtie2"
params:
index: "data/reference/index/genome" # set via YAML anchor *BOWTIE2_INDEX
sensitive: "--very-sensitive"
threads: 8
resources:
mem_mb: 16000
time: 240
The raw BAM passes through a multi-step filtering chain that removes low-quality alignments and common sources of ATAC-seq noise:
chrMT)3844The final output of this chain is a Tn5-shifted, blacklist-cleaned BAM at
results/post_alignment/tn5_shift/{sample}.filtered.shifted.bam.Before the quality gate, the pipeline collects a comprehensive set of alignment and library complexity metrics:
The pipeline implements a hard QC gate before peak calling. Samples that fail any threshold are recorded in
results/qc_gate/{sample}_qc_pass.txt and automatically bypassed for downstream footprinting and differential analysis to avoid wasting compute on poor-quality data.# config.yaml
qc_gate:
params:
min_frip: 0.2
min_tss_enr: 7.0
min_mapping_rate: 80.0
max_duplicate_rate: 20.0
The QC gate runs before MACS2 peak calling. Samples that fail will not appear in IDR, consensus peaks, DESeq2, or footprinting outputs. Review
results/qc_gate/ and the MultiQC report to identify failing samples.Samples that pass the QC gate proceed to MACS2 peak calling in BAMPE format (paired-end BAM). The genome effective size (
hs for human) and q-value threshold are configurable.# config.yaml
macs2:
params:
genome_size: "hs"
qvalue: 0.01
nomodel: "--nomodel"
format: "BAMPE"
threads: 8
resources:
mem_mb: 16000
time: 240
Called peaks are immediately filtered against the ENCODE blacklist to produce
results/peak_calling/filtered_peaks/{sample}_filtered_peaks.bed.For each condition, all replicate pairs are run through IDR to identify reproducibly called peaks. IDR pairs are computed automatically from the sample sheet (see above).
# config.yaml
idr:
output:
idr_peaks: "results/peak_calling/idr/idr_peaks"
optimal_peaks: "results/peak_calling/idr/optimal_peaks"
plots: "results/peak_calling/idr/plots"
params:
idr_threshold: 0.05
rank_column: "score"
High-confidence peaks are merged across all samples into a single consensus peak set. A peak must appear in at least 2 samples (
min_samples: 2) with a merge distance of 100 bp.# config.yaml
consensus_peaks:
output:
consensus: "results/peak_calling/consensus_peaks"
counts: "results/peak_calling/consensus_peaks"
params:
min_samples: 2
merge_distance: 100
The count matrix (
peak_sample_counts.txt) records per-sample read counts across the consensus peak set and feeds directly into DESeq2.DESeq2 uses the consensus peak count matrix to identify differentially accessible regions between conditions. Results, volcano, MA, and PCA plots are written to
results/peak_calling/differential_accessibility/.# config.yaml
differential_accessibility:
params:
fdr_threshold: 0.05
log2fc_threshold: 1.0
threads: 8
resources:
mem_mb: 16000
time: 240
# config.yaml
tobias:
output:
corrected_bw: "results/peak_calling/tobias/corrected_bw"
bias_track: "results/peak_calling/tobias/bias_track"
footprint_bw: "results/peak_calling/tobias/footprint_bw"
bindetect: "results/peak_calling/tobias/bindetect"
params:
genome_fa: "data/reference/genome.fa"
motif_db: "data/motifs/jaspar_vertebrates.meme"
conditions: "condition"
threads: 8
resources:
mem_mb: 16000
time: 240
HINT-ATAC (via RGT) predicts footprints per sample and writes BED files to
results/peak_calling/footprinting/footprints/{sample}_footprints.bed.chromVAR quantifies motif accessibility deviation scores across samples using the Tn5-shifted BAMs and the JASPAR motif database.
# config.yaml
chromvar_analysis:
output:
deviations: "results/peak_calling/chromvar/deviations"
bias_corrected: "results/peak_calling/chromvar/bias_corrected"
plots: "results/peak_calling/chromvar/plots"
params:
motif_db: "data/motifs/jaspar_vertebrates.meme"
genome_fa: "data/reference/genome.fa"
genome_sizes: "data/reference/genome.chrom.sizes"
threads: 8
resources:
mem_mb: 16000
time: 240
Output Structure
All outputs are written beneathresults/ and are organised by analysis stage:
Graceful Handling of Empty Peak Sets
If a sample passes the QC gate but yields zero peaks after blacklist filtering, downstream rules (ChIPseeker annotations, heatmaps, TOBIAS) detect the empty BED file and produce valid stub outputs — empty DataFrames, blank PDFs — rather than crashing the DAG. The overall pipeline continues processing all other samples.Aggregated Reporting
On every run, regardless of success or failure, the pipeline writes a machine-readable execution summary:logs/. The MultiQC report at results/reporting/multiqc/multiqc_report.html aggregates FastQC, Preseq, Picard, and QC gate metrics into a single browsable HTML file.