Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/BDB-Genomics/atacseq-pipeline/llms.txt

Use this file to discover all available pages before exploring further.

This changelog records all notable changes to the BDB-Genomics ATAC-seq framework. Versions follow a MAJOR.MINOR.PATCH scheme — major versions introduce breaking changes or large new feature sets, minor versions add backward-compatible functionality, and patch versions fix bugs without altering the public interface. Each release section is expanded below with full Added, Changed, and Fixed detail.

Added

  • Global Wildcard Constraints: Added regex constraints for {sample} ([^/]+), {condition} ([^/]+), and {replicate} ([0-9]+) at the global level in the Snakefile to prevent ambiguous path matching and ensure robust DAG resolution.

Changed

  • calculate_mito_reads.smk: Refactored to include .bam.bai as an official Snakemake input dependency instead of executing an inline indexing command in the shell directive, preventing potential job race conditions.
  • bedtools_genomecov.smk: Modified to dynamically ignore bulk QC (qc_pass file) when the pipeline is switched to scatac mode, allowing scATAC-seq target coverage rules to execute seamlessly.

Fixed

  • tobias.smk: Corrected TOBIAS BINDetect argument from --bam to --signals to prevent runtime crashes during bias-corrected TF footprinting.
  • bedtools.yaml: Added missing samtools dependency to the 03_post_alignment/bedtools conda environment to resolve runtime failures in FRiP calculation.
  • archr.smk: Restored the missing galaxy project Singularity container directive for the archr_doublet_detection rule.
  • fastp.smk: Restored missing whitespace in threads: config["fastp"]["threads"] directive.
  • samtools_fixmate.smk: Normalized log and benchmark extensions (.err and .txt) to maintain consistency with the global framework logging convention.
  • Snakefile: Standardized CI block to define SAMPLES_TSV = None under IS_CI mode, avoiding a potential NameError on empty sample checks.

Added

  • CLI Mode Switching: ATAC_MODE=bulk or ATAC_MODE=scatac environment variable switches the entire pipeline between bulk and single-cell modes — no manual file editing required.
  • Chromap Alignment: Fast single-cell ATAC-seq aligner with --preset atac for barcode-aware alignment of scATAC-seq reads.
  • ArchR Pipeline: Arrow file creation, doublet detection and filtering, iterative LSI dimensionality reduction, UMAP clustering, and marker gene identification.
  • Cicero Co-accessibility: Chromatin co-accessibility network analysis, CCAN identification, and connection scoring.
  • scATAC-seq Conda Environments: Dedicated chromap, archr, and cicero conda environments with all required dependencies.
  • scATAC-seq Config Blocks: New chromap, archr, and cicero configuration sections added to config.yaml.
  • global.mode: New top-level config key ("bulk" or "scatac") for declarative modality selection without environment variable.

Changed

  • Snakefile: Conditional rule includes based on the MODE variable — bulk and scATAC-seq rule sets are now mutually exclusive, preventing DAG conflicts.
  • README: Complete rewrite of the scATAC-seq section — now a single command switch instead of requiring manual file edits.
  • Comparison Table: Added scATAC-seq, Cicero, and mode switching rows to the feature comparison table.

Added

  • TOBIAS Footprinting Suite: Full TOBIAS pipeline (ATACorrect → ScoreBigwig → BINDetect) for bias-corrected transcription factor footprinting and differential TF binding analysis across conditions.
  • Low-Resource Profile: profile/low_resource/ Snakemake profile for machines with ≤ 8 GB RAM and ≤ 4 CPU cores.
  • Sequential Sample Batching: run_batched.py script for ultra-low-resource machines (≤ 4 GB RAM) that processes samples one or a few at a time to avoid out-of-memory failures.

Added

  • IDR Replicate Concordance: Irreproducible Discovery Rate analysis for validating peak reproducibility between biological replicates.
  • NSC/RSC Cross-Correlation: ENCODE-compliant strand cross-correlation metrics via phantompeakqualtools.
  • Consensus Peak Calling: Multi-sample peak merging with a configurable minimum sample threshold (min_samples: 2) and merge distance (merge_distance: 100 bp).
  • Differential Accessibility: DESeq2-based differential chromatin accessibility analysis with volcano plots, MA plots, PCA plots, and heatmaps.
  • Peak Count Matrix: bedtools-based read counting in consensus peaks for all samples, producing the count matrix required by DESeq2.
  • Benchmark Aggregation: Multi-rule performance summary aggregating wall-clock time, CPU time, and peak memory across all pipeline stages.
  • Test Profile: profile/test/ Snakemake profile for CI validation with auto-generated minimal test data.
  • Test Data Generator: generate_test_data.py creates minimal FASTQ, reference genome, and annotation files for integration testing.
  • CI/CD Pipeline: Two-stage GitHub Actions workflow (lint + test) with micromamba environment setup and artifact upload.

Changed

  • QC Gate Enforcement: Downstream rules (macs2, bedtools_genomecov, heatmap, peak_annotation, normalize_coverage) now declare {sample}_qc_pass.txt as a required input, enforcing the QC gate dependency throughout the DAG.
  • motif_analysis: Refactored to per-sample execution using the HOMER assembly name instead of a FASTA path.
  • cross_correlation: Promoted from optional to standard ENCODE-compliant QC metric included in every bulk run.
  • README: Complete rewrite with a feature comparison table covering all pipeline stages and analysis modes.
  • Version: Bumped to V2.0.0 for the production-grade feature set.

Fixed

  • fastp.yaml: Invalid version 1.3.3 corrected to 0.24.0.
  • tss_enrichment.yaml: Added 7 missing Bioconductor packages required by the TSS enrichment R script.
  • fragment_size_analysis.smk: Referenced the wrong conda environment (samtoolsfragment_analysis with R).
  • frip_calculation.smk: Removed chromosome prefix normalisation (sed 's/^chr//g') that caused mismatches with MACS2 peak chromosome names.
  • preseq.smk: Removed || true failure silencing that was masking real preseq errors.
  • samtools_sort.smk: Moved log redirection to a separate line for correct shell behaviour.
  • samtools_fixmate.smk: Added set -o pipefail to catch errors in piped commands.
  • bowtie2.smk: Replaced hardcoded --very-sensitive flag with the config["bowtie2"]["params"]["sensitive"] config reference.
  • blacklist_filter.smk: Removed fragile awk chromosome-prefix normalisation logic.
  • remove_mito_reads.smk: Switched from regex matching to exact chromosome name matching to prevent accidental exclusion of non-mitochondrial contigs.
  • tss_enrichment.R: Removed leftover DEBUG print statement.
  • validate_config.py: Corrected "ChIP-seq" to "ATAC-seq" in the module docstring.
  • .gitignore: Removed invalid .../ directory glob syntax.
  • bedtools.yaml / samtools.yaml: Added missing bc dependency for shell arithmetic in FRiP calculation.
  • profile/slurm/config.yaml: Replaced placeholder SLURM account name, added latency-wait for NFS-mounted shared filesystems.
  • .github/workflows/lint.yml: Fixed YAML indentation errors, added micromamba setup step, pinned pulp version to resolve dependency conflict.

Added

  • Production-Ready Architecture: Implemented a fully reactive and modular Snakemake framework with deterministic DAG resolution.
  • Dynamic Configuration: Migrated all target paths in the Snakefile to dynamic config.yaml references, ensuring complete portability across compute environments without hard-coded paths.
  • QC Gating: Integrated a biological checkpoint system to validate TSS enrichment and FRiP scores before expensive downstream analysis stages are triggered.
  • Lifecycle Hooks: Added onstart, onsuccess, and onerror Snakemake handlers for automated status reporting and JSON summary generation.
  • Proactive Validation: Integrated validate_config.py invocation at DAG-build time to surface schema errors before any jobs are submitted.
Core Processes included in first release:
  • Preprocessing: fastp, FastQC
  • Alignment: Bowtie2, samtools sorting, indexing, and deduplication
  • Post-Alignment QC: Mitochondrial read quantification, fragment size analysis, TSS enrichment, phantompeakqualtools, Preseq, Qualimap
  • Coverage and normalisation: Genome coverage, BigWig conversion, CPM normalisation
  • Peak calling and filtering: MACS2, ENCODE blacklist filtering
  • Visualisation: Heatmaps, motif analysis, correlation plots

Changed

  • Standardized Directives: Enforced a uniform 10-directive layout across all 34 .smk rule files (rule, input, output, params, log, benchmark, conda, container, threads, resources, shell).
  • Global Containerization: Switched all rules to use stable Singularity containers via Biocontainers for 100 % reproducibility across compute environments.
  • Environment Hierarchy: Refactored rules/envs/ into a stage-based hierarchical directory structure matching the six pipeline stages.
  • Cleaned Root Directory: Removed legacy scripts, runtime artefacts, and unused directories (benchmarks/, scratch/, scripts/) from the repository root.

Fixed

  • Resolved redundant and missing include: statements in the main Snakefile.
  • Corrected motif_analysis output directory resolution issues causing downstream rules to fail.
  • Standardized log and benchmark file paths across the entire framework for consistent --report generation.

Build docs developers (and LLMs) love