config.yaml — Full ATAC-seq Pipeline Configuration

The config.yaml file is the single source of truth for every parameter in the BDB-Genomics ATAC-seq pipeline. Every tool path, resource limit, QC threshold, and reference genome pointer is declared here. Snakemake rules are intentionally stateless — they read from this config at runtime and contain no hard-coded values. This separation means you can tune, extend, or override any aspect of the pipeline without ever touching a rule file.

YAML Anchors

The config uses YAML anchors (&NAME) and aliases (*NAME) to centralise all reference file paths. A path is declared exactly once and then referenced everywhere it is needed. Changing a genome build is therefore a single-line edit.

global:
  references:
    genome_fa:      &GENOME_FA      "data/reference/genome.fa"
    genome_sizes:   &GENOME_SIZES   "data/reference/genome.chrom.sizes"
    bowtie2_index:  &BOWTIE2_INDEX  "data/reference/index/genome"
    chromap_index:  &CHROMAP_INDEX  "data/reference/chromap/genome.index"
    blacklist:      &BLACKLIST       "data/reference/ENCODE_blacklist.bed"
    annotation_gtf: &ANNOTATION_GTF "data/reference/annotation.gtf"
    motif_db:       &MOTIF_DB       "data/motifs/jaspar_vertebrates.meme"

Downstream tool blocks dereference these anchors with the * alias syntax:

bowtie2:
  params:
    index: *BOWTIE2_INDEX   # expands to "data/reference/index/genome"

remove_blacklist_reads:
  input:
    blacklist: *BLACKLIST    # expands to "data/reference/ENCODE_blacklist.bed"

All seven anchor names — &GENOME_FA, &GENOME_SIZES, &BOWTIE2_INDEX, &CHROMAP_INDEX, &BLACKLIST, &ANNOTATION_GTF, and &MOTIF_DB — are validated by validate_config.py at startup. A missing or inaccessible path fails the run before the DAG is built.

The `global` Section

The global block holds project-wide settings and the reference file registry.

global.mode

string

default:"bulk"

Pipeline modality. Accepted values: "bulk" or "scatac". Can be overridden at runtime with the ATAC_MODE environment variable without editing this file.

global.samples

string

required

Relative path to the TSV sample sheet. Resolved relative to the config file’s directory, then the workflow root, then the current working directory.

global.references.genome_fa

string

required

Path to the reference FASTA. Used by Picard alignment metrics, peak annotation, TOBIAS, chromVAR, and footprinting.

global.references.genome_sizes

string

required

Two-column chromosome sizes file (chromosome name, length in bp). Required by bedGraphToBigWig and TOBIAS.

global.references.bowtie2_index

string

required

Bowtie2 index prefix (bulk mode). The validator checks for .bt2 or .bt2l files matching this prefix.

global.references.chromap_index

string

required

Chromap index file path (scATAC-seq mode only).

global.references.blacklist

string

required

ENCODE blacklist BED file. Applied during post-alignment filtering and again during peak blacklist filtering.

global.references.annotation_gtf

string

required

GTF annotation file used for TSS enrichment scoring and peak annotation.

global.references.motif_db

string

required

MEME-format motif database (e.g., JASPAR vertebrates). Used by motif analysis, TOBIAS, and chromVAR.

Per-Tool Block Schema

Every tool block in the config follows a uniform five-field schema. This consistency makes the config self-documenting and allows the validation script to discover required keys automatically by scanning rule files.

<tool_name>:
  input:   ...   # source directory or file path(s)
  output:  ...   # destination directory
  params:  ...   # tool-specific flags and thresholds
  threads: ...   # CPU allocation (positive integer)
  resources:     # scheduler resource requests
    mem_mb: ...  # memory in megabytes (positive integer)
    time:   ...  # wall-clock limit in minutes (positive integer)

fastp is the only tool without an input key — raw FASTQ paths are resolved dynamically from the sample sheet at runtime.

Preprocessing

fastp

fastp:
  output: "results/preprocessing/fastp"
  params:
    trim_front1: 5
    trim_front2: 5
    length_required: 30
  threads: 4
  resources:
    mem_mb: 8000
    time: 120

fastqc

fastqc:
  input:
    R1: "results/preprocessing/fastp"
    R2: "results/preprocessing/fastp"
  output: "results/preprocessing/fastqc"
  threads: 4
  resources:
    mem_mb: 2000
    time: 30

Alignment

bowtie2

bowtie2:
  input: "results/preprocessing/fastp"
  output: "results/alignment/bowtie2"
  params:
    index: *BOWTIE2_INDEX
    sensitive: "--very-sensitive"
  threads: 8
  resources:
    mem_mb: 16000
    time: 240

Post-alignment

Key post-alignment tools and their defaults:

samtools_markdup

samtools_markdup:
  input:
    sorted_bam_noMT_fixmate: "results/post_alignment/samtools_fixmate"
  output:
    markdup_bam: "results/post_alignment/samtools_markdup"
  params:
    remove_duplicates: false
  threads: 4
  resources:
    mem_mb: 8000
    time: 120

samtools_view (MAPQ + flag filtering)

samtools_view:
  input:
    noMT_sorted_bam: "results/post_alignment/remove_mito_reads"
  output:
    filtered_bam: "results/post_alignment/samtools_view"
  params:
    MAPQ: 30
    flags: 3844
  threads: 2
  resources:
    mem_mb: 2000
    time: 30

tn5_shift

tn5_shift:
  input:
    filtered_bam: "results/post_alignment/samtools_view"
  output:
    shifted_bam: "results/post_alignment/tn5_shift"
    shifted_bam_index: "results/post_alignment/tn5_shift"
  threads: 4
  resources:
    mem_mb: 4000
    time: 60

QC Gate

The QC gate block is the most commonly tuned section. It gates downstream peak calling and visualisation on four per-sample thresholds:

qc_gate:
  input:
    frip: "results/peak_calling/frip_calculation"
    tss:  "results/metrics_qc/tss_enrichment"
    stats: "results/post_alignment/samtools_stats"
  output: "results/qc_gate"
  params:
    min_frip:           0.2
    min_tss_enr:        7.0
    min_mapping_rate:   80.0
    max_duplicate_rate: 20.0
  threads: 1
  resources:
    mem_mb: 1000
    time: 10

qc_gate.params.min_frip

float

default:"0.2"

Minimum Fraction of Reads in Peaks. Samples below this threshold are flagged as failing QC. Must be a non-negative float.

qc_gate.params.min_tss_enr

float

default:"7.0"

Minimum TSS enrichment score. A score ≥ 7 is considered high quality for bulk ATAC-seq.

qc_gate.params.min_mapping_rate

float

default:"80.0"

Minimum overall mapping rate (%) from samtools stats.

qc_gate.params.max_duplicate_rate

float

default:"20.0"

Maximum duplicate read rate (%). Samples with higher duplication are flagged.

Peak Calling

macs2

macs2:
  input:
    shifted_bam: "results/post_alignment/tn5_shift"
  output:
    peaks: "results/peak_calling/macs2_peakcall"
  params:
    genome_size: "hs"
    qvalue: 0.01
    nomodel: "--nomodel"
    format: "BAMPE"
  threads: 8
  resources:
    mem_mb: 16000
    time: 240

idr

idr:
  output:
    idr_peaks:     "results/peak_calling/idr/idr_peaks"
    optimal_peaks: "results/peak_calling/idr/optimal_peaks"
    plots:         "results/peak_calling/idr/plots"
  params:
    idr_threshold: 0.05
    rank_column:   "score"
  threads: 4
  resources:
    mem_mb: 4000
    time: 60

consensus_peaks

consensus_peaks:
  output:
    consensus: "results/peak_calling/consensus_peaks"
    counts:    "results/peak_calling/consensus_peaks"
  params:
    min_samples:     2
    merge_distance: 100
  threads: 4
  resources:
    mem_mb: 8000
    time: 120

differential_accessibility

differential_accessibility:
  output:
    results: "results/peak_calling/differential_accessibility"
    plots:   "results/peak_calling/differential_accessibility/plots"
  params:
    fdr_threshold:    0.05
    log2fc_threshold: 1.0
  threads: 8
  resources:
    mem_mb: 16000
    time: 240

scATAC-seq Tools

chromap

chromap:
  input: "results/preprocessing/fastp"
  output: "results/alignment/chromap"
  params:
    index: *CHROMAP_INDEX
    preset: "atac"
  threads: 16
  resources:
    mem_mb: 32000
    time: 120

archr

archr:
  input:
    bam: "results/alignment/chromap"
  output:
    arrow:          "results/scatac/archr/arrow"
    filtered_arrow: "results/scatac/archr/filtered_arrow"
    fragments:      "results/scatac/archr/fragments"
    clusters:       "results/scatac/archr/clusters"
    markers:        "results/scatac/archr/markers"
    doublets:       "results/scatac/archr/doublets"
    plots:          "results/scatac/archr/plots"
    qc_report:      "results/scatac/archr/qc_report"
  params:
    min_tss:                4.0
    min_frags:              1000
    max_frags:              100000
    tsse_method:            "ArchR"
    doublet_threshold:      0.2
    clustering_resolution:  0.8
    dims_to_use:            1:30
    force_dim_reduction:    true
  threads: 16
  resources:
    mem_mb: 64000
    time: 240

Dynamic Config Overrides

Snakemake supports layered config loading: values in a second --configfile argument override matching keys in the first. This is useful for parameter sweeps, CI testing, or per-project adjustments without modifying the canonical config.yaml.

snakemake --configfile config.yaml custom_override.yaml

Example: Loosening QC Gate Thresholds for Synthetic Data

When running the pipeline on synthetic or downsampled CI data, FRiP scores and TSS enrichment values will be well below production thresholds. Create an override file to relax these gates without altering the main config:

# ci_override.yaml
qc_gate:
  params:
    min_frip:     0.01
    min_tss_enr:  1.0
    min_mapping_rate: 30.0
    max_duplicate_rate: 80.0

Then run:

snakemake --configfile config.yaml ci_override.yaml --profile profile/test

Only keys present in the override file are replaced. All other keys — including reference paths and tool parameters — remain exactly as declared in config.yaml. Never commit a relaxed override file as the default config.

Adding a New Tool

The config ships a boilerplate template block at the bottom of config.yaml. Copy it, replace the placeholder names, and add the matching rule file and include: directive in the Snakefile:

template_category:
  template_tool:
    input:  "results/preprocessing/fastp"
    output: "results/template_category/template_tool"
    params:
      message: "This is a boilerplate template."
    threads: 1
    resources:
      mem_mb: 1000
      time: 10

Get Started

Configuration

Pipeline Stages

Modalities

Deployment

Guides

config.yaml — Full ATAC-seq Pipeline Configuration

YAML Anchors

The `global` Section

Per-Tool Block Schema

Preprocessing

Alignment

Post-alignment

QC Gate

Peak Calling

scATAC-seq Tools

Dynamic Config Overrides

Example: Loosening QC Gate Thresholds for Synthetic Data

Adding a New Tool

Build docs developers (and LLMs) love

Get Started

Configuration

Pipeline Stages

Modalities

Deployment

Guides

Documentation Index

​YAML Anchors

​The global Section

​Per-Tool Block Schema

​Preprocessing

​Alignment

​Post-alignment

​QC Gate

​Peak Calling

​scATAC-seq Tools

​Dynamic Config Overrides

​Example: Loosening QC Gate Thresholds for Synthetic Data

​Adding a New Tool

Build docs developers (and LLMs) love

YAML Anchors

The `global` Section

Per-Tool Block Schema

Preprocessing

Alignment

Post-alignment

QC Gate

Peak Calling

scATAC-seq Tools

Dynamic Config Overrides

Example: Loosening QC Gate Thresholds for Synthetic Data

Adding a New Tool