Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/BDB-Genomics/atacseq-pipeline/llms.txt

Use this file to discover all available pages before exploring further.

The final stage of the BDB-Genomics ATAC-seq pipeline converts the processed BAM and peak files into genome-browser-ready tracks, publication-quality plots, and a comprehensive aggregated quality report. Every visualization rule is fully decoupled from the upstream analysis — it reads from clearly defined output directories and writes to results/visualization/ and results/reporting/. On pipeline completion, the onsuccess hook automatically calls aggregate_logs.py to produce a machine-readable execution summary that records per-rule CPU time, peak memory, and any error lines encountered during the run.

Bedgraph Generation

The visualization track begins with bedtools_genomecov, which converts the Tn5-shifted BAM into a bedgraph coverage file using the -bg flag (output only positions with non-zero coverage):
# config.yaml
bedtools_genomecov:
  input:
    shifted_bam: "results/post_alignment/tn5_shift"
  output:
    bedgraph: "results/visualization/bedtools_genomecov"
  params:
    extra: "-bg"
  threads: 4
  resources:
    mem_mb: 4000
    time: 60
Output: results/visualization/bedtools_genomecov/{sample}.bedGraph

Coordinate-Sorted Bedgraph

bedGraphToBigWig requires its input to be sorted by chromosome and start position. A dedicated sort rule uses the Unix sort -k1,1 -k2,2n convention to produce a coordinate-ordered bedgraph:
# config.yaml
sorted_bedgraph:
  input:
    bedgraph: "results/visualization/bedtools_genomecov"
  output:
    sorted_bedgraph: "results/visualization/sorted_bedgraph_file"
  threads: 4
  resources:
    mem_mb: 4000
    time: 60
Output: results/visualization/sorted_bedgraph_file/{sample}.sorted.bedGraph

BigWig Conversion

The sorted bedgraph is converted to the indexed BigWig binary format using UCSC’s bedGraphToBigWig. BigWig files are orders of magnitude faster to load in genome browsers (IGV, UCSC) because they store pre-computed summary statistics at multiple zoom levels:
# config.yaml
bigwig:
  input:
    sorted_bedgraph: "results/visualization/sorted_bedgraph_file"
  output:
    bigwig: "results/visualization/bigwig"
  params:
    genome: "data/reference/genome.chrom.sizes"   # *GENOME_SIZES anchor
  threads: 4
  resources:
    mem_mb: 4000
    time: 60
# rules/bigwig.smk
bedGraphToBigWig \
    {input.sorted_bedgraph} \
    {params.genome} \
    {output.bigwig} \
    2> {log}
Output: results/visualization/bigwig/{sample}.bw

CPM-Normalized Coverage

Raw read-depth BigWigs are not comparable across samples with different sequencing depths. The normalize_coverage rule uses deepTools bamCoverage to produce a CPM (counts per million) normalized BigWig directly from the Tn5-shifted BAM:
# config.yaml
normalized_coverage:
  input:
    shifted_bam: "results/post_alignment/tn5_shift"
  output:
    normalized_coverage: "results/visualization/normalized_coverage"
  params:
    method: "CPM"
  threads: 4
  resources:
    mem_mb: 8000
    time: 120
Output: results/visualization/normalized_coverage/{sample}_CPM.bw The method parameter maps directly to deepTools’ --normalizeUsing flag, so switching to RPKM or BPM requires only a config edit:
normalized_coverage:
  params:
    method: "RPKM"   # alternatives: BPM, RPKM, CPM, None

Sample Correlation Analysis

correlation_analysis uses deepTools multiBigwigSummary (binning the genome into 1 kb windows) and plotCorrelation to compute pairwise Pearson or Spearman correlation between all sample BigWigs and produce a heatmap:
# config.yaml
correlation_analysis:
  input:
    bigwig: "results/visualization/bigwig"
  output: "results/visualization/correlation_analysis"
  params:
    bin_size: 1000
  threads: 4
  resources:
    mem_mb: 8000
    time: 120
Output: results/visualization/correlation_analysis/correlation_heatmap.png This heatmap is one of the most informative QC outputs for multi-sample experiments — replicates should cluster together with correlation > 0.9 for high-quality ATAC-seq libraries.

TSS Enrichment Heatmap

The heatmap rule uses deepTools computeMatrix reference-point centered on TSS positions (± 3 kb) and plotHeatmap with the coolwarm colormap to visualize Tn5-cut-site enrichment around transcription start sites across all samples:
# config.yaml
heatmap:
  input:
    filtered_peaks: "results/peak_calling/filtered_peaks"
    bigwig:         "results/visualization/bigwig"
  output:
    plot:    "results/visualization/heatmap/plot"
    matrix:  "results/visualization/heatmap/matrix"
    regions: "results/visualization/heatmap"
  threads: 8
  resources:
    mem_mb: 16000
    time: 240
  params:
    color:      "coolwarm"
    upstream:   3000
    downstream: 3000
The rule is implemented as a Python script (rules/scripts/run_heatmap.py) that calls computeMatrix and plotHeatmap in sequence:
1

computeMatrix

Bins signal from the BigWig in ±3 kb windows centered on each peak:
computeMatrix reference-point \
    --referencePoint center \
    --upstream   3000 \
    --downstream 3000 \
    --scoreFileName {input.bigwig} \
    --regionsFileName {input.filtered_peaks} \
    --outFileName {output.matrix}
2

plotHeatmap

Renders the matrix as a PDF heatmap with the coolwarm colormap:
plotHeatmap \
    -m {output.matrix} \
    -o {output.plot} \
    --colorMap coolwarm
Output: results/visualization/heatmap/plot/{sample}_tss_heatmap.pdf
The heatmap rule takes {sample}_qc_pass.txt as an explicit input. If a sample fails the QC gate, the rule still triggers (Snakemake requires the output), but the script reads the sentinel file and produces a blank placeholder PDF so the DAG can complete without error.

MultiQC Aggregated Report

MultiQC aggregates outputs from every QC-producing tool in the pipeline into a single HTML report. The multiqc rule collects:
SourceFiles Collected
FastQC*_fastqc.zip (R1 and R2 per sample)
fastp{sample}.json
samtools stats{sample}_postFiltering.stats.txt
Picard CollectAlignmentSummaryMetrics{sample}.alignment_metrics.txt
Picard CollectInsertSizeMetrics{sample}.insert_metrics.txt, {sample}.insert_histogram.pdf
Preseq{sample}.ccurve.txt
Qualimap{sample}_qualimap_report/
QC Gate{sample}_qc_pass.json
# config.yaml
multiqc:
  output: "results/reporting/multiqc"
  threads: 2
  resources:
    mem_mb: 4000
    time: 30
  params:
    config: "rules/config/multiqc_config.yaml"
multiqc {input} \
    -f \
    -o {params.out_dir} \
    -c {params.config} \
    --title "ATAC-seq Pipeline QC Report" \
    --comment "Comprehensive quality control metrics for ATAC-seq analysis" \
    2> {log}
Output: results/reporting/multiqc/multiqc_report.html
The MultiQC config at rules/config/multiqc_config.yaml is pre-configured to parse the _qc_pass.json files from the QC gate and display them as a custom-content table in the report. No MultiQC plugin installation is required.

Benchmark Summary

The benchmark_summary rule collects Snakemake benchmark files from 16 core rules (plus IDR and mito-read rules) and compiles them into a single TSV with columns for rule name, sample, runtime in seconds, peak memory in MB, and CPU percentage:
# rules/benchmark_summary.smk — inputs
benchmarks_per_sample = expand(
    "benchmarks/{rule}/{sample}.txt",
    rule=[
        "fastp", "fastqc", "bowtie2", "samtools_sort",
        "samtools_markdup", "samtools_view", "tn5_shift",
        "macs2", "frip", "tss_enrichment", "cross_correlation",
        "motif_analysis", "bedtools_genomecov", "bigwig",
        "heatmap", "footprinting"
    ],
    sample=SAMPLES
)
echo -e "Rule\tSample\tRuntime_s\tMemory_MB\tCPU_Percent" > {output.summary}

for f in {input...}; do
    rule=$(echo "$f" | sed 's|benchmarks/||' | cut -d'/' -f1)
    sample=$(echo "$f" | sed 's|.txt||' | rev | cut -d'/' -f1 | rev)
    [ -f "$f" ] && {
        runtime=$(head -n 2 "$f" | tail -n 1 | cut -f1)
        mem=$(head -n 2 "$f" | tail -n 1 | cut -f2)
        cpu=$(head -n 2 "$f" | tail -n 1 | cut -f3)
        echo -e "${rule}\t${sample}\t${runtime}\t${mem}\t${cpu}" >> {output.summary}
    }
done
Output: results/reporting/benchmark_summary.tsv

Pipeline Execution Summary

On every run — success or failure — the onsuccess/onerror lifecycle hook calls aggregate_logs.py to write a structured JSON execution summary:
{
    "timestamp": "2026-01-15T14:32:07",
    "status": "success",
    "performance_metrics": [
        { "s": "342.1", "h:m:s": "0:05:42", "max_rss": "14823", "rule": "bowtie2" },
        { "s": "287.4", "h:m:s": "0:04:47", "max_rss": "13411", "rule": "macs2" }
    ],
    "errors": []
}
On failure, the errors array contains the last five error-matching lines from each affected log file (filtered to exclude false positives such as “0 errors”), enabling rapid post-mortem diagnosis without manually grep-ing through the logs/ directory. Output: results/reporting/pipeline_execution_summary.json

Complete Visualization Output Summary

Raw BigWig

results/visualization/bigwig/{sample}.bw

CPM-Normalized BigWig

results/visualization/normalized_coverage/{sample}_CPM.bw

Correlation Heatmap

results/visualization/correlation_analysis/correlation_heatmap.png

TSS Heatmap

results/visualization/heatmap/plot/{sample}_tss_heatmap.pdf

MultiQC Report

results/reporting/multiqc/multiqc_report.html

Benchmark Summary

results/reporting/benchmark_summary.tsv

Execution Summary

results/reporting/pipeline_execution_summary.json

Bedgraph

results/visualization/sorted_bedgraph_file/{sample}.sorted.bedGraph

Build docs developers (and LLMs) love