Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/BDB-Genomics/atacseq-pipeline/llms.txt

Use this file to discover all available pages before exploring further.

The bulk ATAC-seq mode is the default execution track of the BDB-Genomics framework. It drives a complete, production-grade chromatin accessibility analysis from raw FASTQ reads through bias-corrected transcription factor footprints, applying a hard quality-control gate before any compute-intensive downstream work begins. Every stage is configuration-driven: you declare your reference paths, resource limits, and QC thresholds once in config.yaml, and the Snakemake DAG handles all orchestration automatically.

Activating Bulk Mode

Bulk mode is the default. You do not need to set any environment variable or change config.yaml to use it. All three forms below are equivalent:
snakemake --use-conda --cores 8
If both the environment variable and the config key are set, the environment variable takes precedence.

Modality Comparison

The table below summarises how bulk mode differs from single-cell mode across every major analysis stage.
StageBulk ModeSingle-Cell Mode (scatac)
AlignmentBowtie2 (--very-sensitive)Chromap (--preset atac)
FilteringMAPQ > 30, Fixmate, ENCODE Blacklist removal, Tn5 ShiftArchR Arrow creation & doublet removal
Peak CallingMACS2, IDR replicate concordanceArchR marker peak identification
Co-accessibilityN/ACicero (500 bp window, 250 kb distance)
DifferentialDESeq2 (FDR 0.05, log2FC 1.0)ArchR cluster markers
FootprintingHINT-ATAC & TOBIAS BINDetectchromVAR motif accessibility

Sample Sheet

The bulk pipeline reads sample metadata from a tab-separated file whose path is set in config.yaml under global.samples (default: data/fastp/samples.tsv). Each row represents one biological replicate. The condition and replicate columns drive automatic IDR pairing.
sample          fastq_r1                         fastq_r2                         condition   replicate
SRR_ctrl_rep1   data/raw/SRR_ctrl_rep1_R1.fastq.gz   data/raw/SRR_ctrl_rep1_R2.fastq.gz   ctrl        1
SRR_ctrl_rep2   data/raw/SRR_ctrl_rep2_R1.fastq.gz   data/raw/SRR_ctrl_rep2_R2.fastq.gz   ctrl        2
SRR_treat_rep1  data/raw/SRR_treat_rep1_R1.fastq.gz  data/raw/SRR_treat_rep1_R2.fastq.gz  treat       1
SRR_treat_rep2  data/raw/SRR_treat_rep2_R1.fastq.gz  data/raw/SRR_treat_rep2_R2.fastq.gz  treat       2
Every sample requires both fastq_r1 and fastq_r2. The pipeline is paired-end only. Provide absolute paths or paths relative to the repository root.

How IDR Pairs Are Auto-Computed

The Snakefile reads the condition and replicate columns and uses itertools.combinations to enumerate every possible replicate pair within each condition group. No manual configuration is needed:
# Snakefile (excerpt — illustrative)
from itertools import combinations
from collections import defaultdict

_cond_reps = defaultdict(list)
for row in rows:
    _cond_reps[row["condition"]].append(row["replicate"])

IDR_TARGETS = [
    f"{idr_peaks_path}/{condition}_rep{r1}_rep{r2}_idr_peaks.bed"
    for cond, reps in _cond_reps.items()
    for (r1, r2) in combinations(sorted(set(reps)), 2)
]
For the four-sample sheet above this produces two IDR target files: ctrl_rep1_rep2_idr_peaks.bed and treat_rep1_rep2_idr_peaks.bed.

Full Pipeline Stages

1
Preprocessing — fastp + FastQC
2
Raw reads are quality-trimmed by fastp with a 5-base front trim on both mates and a minimum length filter of 30 bp, followed by FastQC for per-sample read-level QC metrics.
3
# config.yaml
fastp:
  output: "results/preprocessing/fastp"
  params:
    trim_front1: 5
    trim_front2: 5
    length_required: 30
  threads: 4
  resources:
    mem_mb: 8000
    time: 120
4
Alignment — Bowtie2
5
Trimmed paired-end reads are aligned to the reference genome using Bowtie2 in --very-sensitive mode.
6
# config.yaml
bowtie2:
  input: "results/preprocessing/fastp"
  output: "results/alignment/bowtie2"
  params:
    index: "data/reference/index/genome"   # set via YAML anchor *BOWTIE2_INDEX
    sensitive: "--very-sensitive"
  threads: 8
  resources:
    mem_mb: 16000
    time: 240
7
Post-Alignment Filtering
8
The raw BAM passes through a multi-step filtering chain that removes low-quality alignments and common sources of ATAC-seq noise:
9
  • samtools sort — coordinate-sort the raw BAM
  • mitoATAC / remove_mito_reads — compute and remove mitochondrial reads (chrMT)
  • samtools fixmate — fill in mate-score tags required for duplicate marking
  • samtools markdup — mark (but do not remove by default) PCR duplicates
  • samtools view — apply MAPQ ≥ 30 filter and bitflag mask 3844
  • remove_blacklist_reads — exclude reads overlapping the ENCODE blacklist BED
  • tn5_shift — apply the +4 / −5 bp Tn5 insertion shift to both mates
  • 10
    # config.yaml (samtools_view excerpt)
    samtools_view:
      params:
        MAPQ: 30
        flags: 3844
    
    11
    The final output of this chain is a Tn5-shifted, blacklist-cleaned BAM at results/post_alignment/tn5_shift/{sample}.filtered.shifted.bam.
    12
    QC Metrics Collection
    13
    Before the quality gate, the pipeline collects a comprehensive set of alignment and library complexity metrics:
    14
  • samtools stats — mapping rate, insert size distribution
  • Picard CollectAlignmentSummaryMetrics — alignment summary
  • Picard CollectInsertSizeMetrics — nucleosomal fragment-size histogram
  • TSS enrichment — enrichment score around transcription start sites
  • Preseq — library complexity curve
  • QualiMap bamqc — per-chromosome coverage
  • Cross-correlation — NSC/RSC strand cross-correlation
  • 15
    QC Gate
    16
    The pipeline implements a hard QC gate before peak calling. Samples that fail any threshold are recorded in results/qc_gate/{sample}_qc_pass.txt and automatically bypassed for downstream footprinting and differential analysis to avoid wasting compute on poor-quality data.
    17
    MetricThresholdFRiP Score≥ 0.2TSS Enrichment≥ 7.0Mapping Rate≥ 80.0 %Duplicate Rate≤ 20.0 %
    18
    # config.yaml
    qc_gate:
      params:
        min_frip: 0.2
        min_tss_enr: 7.0
        min_mapping_rate: 80.0
        max_duplicate_rate: 20.0
    
    19
    The QC gate runs before MACS2 peak calling. Samples that fail will not appear in IDR, consensus peaks, DESeq2, or footprinting outputs. Review results/qc_gate/ and the MultiQC report to identify failing samples.
    20
    Peak Calling — MACS2
    21
    Samples that pass the QC gate proceed to MACS2 peak calling in BAMPE format (paired-end BAM). The genome effective size (hs for human) and q-value threshold are configurable.
    22
    # config.yaml
    macs2:
      params:
        genome_size: "hs"
        qvalue: 0.01
        nomodel: "--nomodel"
        format: "BAMPE"
      threads: 8
      resources:
        mem_mb: 16000
        time: 240
    
    23
    Called peaks are immediately filtered against the ENCODE blacklist to produce results/peak_calling/filtered_peaks/{sample}_filtered_peaks.bed.
    24
    IDR — Irreproducible Discovery Rate
    25
    For each condition, all replicate pairs are run through IDR to identify reproducibly called peaks. IDR pairs are computed automatically from the sample sheet (see above).
    26
    # config.yaml
    idr:
      output:
        idr_peaks: "results/peak_calling/idr/idr_peaks"
        optimal_peaks: "results/peak_calling/idr/optimal_peaks"
        plots: "results/peak_calling/idr/plots"
      params:
        idr_threshold: 0.05
        rank_column: "score"
    
    27
    Consensus Peaks
    28
    High-confidence peaks are merged across all samples into a single consensus peak set. A peak must appear in at least 2 samples (min_samples: 2) with a merge distance of 100 bp.
    29
    # config.yaml
    consensus_peaks:
      output:
        consensus: "results/peak_calling/consensus_peaks"
        counts: "results/peak_calling/consensus_peaks"
      params:
        min_samples: 2
        merge_distance: 100
    
    30
    The count matrix (peak_sample_counts.txt) records per-sample read counts across the consensus peak set and feeds directly into DESeq2.
    31
    Differential Accessibility — DESeq2
    32
    DESeq2 uses the consensus peak count matrix to identify differentially accessible regions between conditions. Results, volcano, MA, and PCA plots are written to results/peak_calling/differential_accessibility/.
    33
    # config.yaml
    differential_accessibility:
      params:
        fdr_threshold: 0.05
        log2fc_threshold: 1.0
      threads: 8
      resources:
        mem_mb: 16000
        time: 240
    
    34
    Footprinting — HINT-ATAC + TOBIAS
    35
    Two complementary footprinting tools run in parallel to assess transcription factor occupancy:
    36
    TOBIAS (BINDetect) produces bias-corrected BigWigs and per-motif differential binding scores:
    37
    # config.yaml
    tobias:
      output:
        corrected_bw: "results/peak_calling/tobias/corrected_bw"
        bias_track:   "results/peak_calling/tobias/bias_track"
        footprint_bw: "results/peak_calling/tobias/footprint_bw"
        bindetect:    "results/peak_calling/tobias/bindetect"
      params:
        genome_fa:    "data/reference/genome.fa"
        motif_db:     "data/motifs/jaspar_vertebrates.meme"
        conditions:   "condition"
      threads: 8
      resources:
        mem_mb: 16000
        time: 240
    
    38
    HINT-ATAC (via RGT) predicts footprints per sample and writes BED files to results/peak_calling/footprinting/footprints/{sample}_footprints.bed.
    39
    chromVAR Motif Deviations
    40
    chromVAR quantifies motif accessibility deviation scores across samples using the Tn5-shifted BAMs and the JASPAR motif database.
    41
    # config.yaml
    chromvar_analysis:
      output:
        deviations:      "results/peak_calling/chromvar/deviations"
        bias_corrected:  "results/peak_calling/chromvar/bias_corrected"
        plots:           "results/peak_calling/chromvar/plots"
      params:
        motif_db:    "data/motifs/jaspar_vertebrates.meme"
        genome_fa:   "data/reference/genome.fa"
        genome_sizes: "data/reference/genome.chrom.sizes"
      threads: 8
      resources:
        mem_mb: 16000
        time: 240
    

    Output Structure

    All outputs are written beneath results/ and are organised by analysis stage:
    results/
    ├── preprocessing/
    │   ├── fastp/          # Trimmed FASTQs and fastp JSON reports
    │   └── fastqc/         # FastQC HTML reports
    ├── alignment/
    │   └── bowtie2/        # Raw BAMs from Bowtie2
    ├── post_alignment/
    │   ├── samtools_sort/  # Coordinate-sorted BAMs
    │   ├── samtools_markdup/   # Duplicate-marked BAMs
    │   ├── samtools_view/  # MAPQ-filtered, blacklist-cleaned BAMs
    │   └── tn5_shift/      # Tn5-shifted, analysis-ready BAMs
    ├── metrics_qc/
    │   ├── tss_enrichment/
    │   ├── picard/
    │   └── cross_correlation/
    ├── qc_gate/            # {sample}_qc_pass.txt per sample
    ├── peak_calling/
    │   ├── macs2_peakcall/         # Raw narrowPeak files
    │   ├── filtered_peaks/         # Blacklist-filtered BEDs
    │   ├── idr/                    # IDR peaks, optimal sets, plots
    │   ├── consensus_peaks/        # Merged consensus BED + count matrix
    │   ├── differential_accessibility/  # DESeq2 results + plots
    │   ├── tobias/                 # TOBIAS corrected BigWigs + BINDetect
    │   ├── footprinting/           # HINT-ATAC footprint BEDs
    │   └── chromvar/               # chromVAR deviation TSVs + plots
    ├── visualization/
    │   ├── bigwig/
    │   ├── normalized_coverage/
    │   ├── heatmap/
    │   └── correlation_analysis/
    └── reporting/
        ├── multiqc/multiqc_report.html
        ├── benchmark_summary.tsv
        └── pipeline_execution_summary.json
    
    The benchmarks/ directory at the repository root records wall-clock time, CPU time, and peak memory for every individual job — useful for capacity planning on HPC clusters.

    Graceful Handling of Empty Peak Sets

    If a sample passes the QC gate but yields zero peaks after blacklist filtering, downstream rules (ChIPseeker annotations, heatmaps, TOBIAS) detect the empty BED file and produce valid stub outputs — empty DataFrames, blank PDFs — rather than crashing the DAG. The overall pipeline continues processing all other samples.

    Aggregated Reporting

    On every run, regardless of success or failure, the pipeline writes a machine-readable execution summary:
    results/reporting/pipeline_execution_summary.json
    
    This JSON includes per-rule CPU time, peak memory usage, and — on failure — the last five error lines extracted from logs/. The MultiQC report at results/reporting/multiqc/multiqc_report.html aggregates FastQC, Preseq, Picard, and QC gate metrics into a single browsable HTML file.

    Build docs developers (and LLMs) love