Documentation Index
Fetch the complete documentation index at: https://mintlify.com/BDB-Genomics/atacseq-pipeline/llms.txt
Use this file to discover all available pages before exploring further.
run_batched.py is a memory-efficiency wrapper for the BDB-Genomics ATAC-seq pipeline designed for machines where running all samples concurrently would exhaust available RAM. Instead of letting Snakemake schedule all samples in parallel, this script reads the sample sheet, divides the samples into sequential batches of a fixed size, and invokes Snakemake once per batch — each batch targeting a specific set of final output files for just those samples. Because each Snakemake invocation finishes completely before the next begins, peak memory usage is bounded by a single batch rather than the full sample set. After all batches complete, the script runs a final Snakemake invocation to generate the MultiQC aggregation report.
Arguments
Number of samples to include in each sequential Snakemake batch. A batch size of
1 runs one sample at a time (minimum memory footprint). A batch size of 2 runs two samples concurrently within each batch invocation.Recommended values:- ≤ 4 GB RAM:
--batch-size 1 - 4–8 GB RAM:
--batch-size 2 - 8–16 GB RAM:
--batch-size 4
Number of CPU cores passed to each
snakemake --cores invocation. Applies to every batch and to the final MultiQC aggregation run.Memory limit in megabytes passed to each Snakemake invocation via
--resources mem_mb={memory}. Snakemake uses this to prevent scheduling rules whose resources.mem_mb would exceed the limit.Pipeline modality:
"bulk" or "scatac". When not set, the script reads from the ATAC_MODE environment variable, then falls back to global.mode in config.yaml. Must be "bulk" or "scatac"; any other value causes the script to exit with an error.Path to the pipeline configuration YAML file, relative to the pipeline root. The script loads this file to resolve output paths for each tool when constructing per-batch target file lists.
Path to the TSV sample sheet, relative to the pipeline root. The script reads the
sample column to produce the ordered list of sample names that is then divided into batches.Conda frontend passed to Snakemake via
--conda-frontend. Accepted values: "conda" or "mamba".When supplied, prints the batch plan (which samples are in each batch and how many batches total) without executing any Snakemake commands. Useful for verifying batch assignments before committing to a long run.
Any additional arguments listed after all named arguments are passed verbatim to every Snakemake invocation (including the final MultiQC aggregation step). For example:
-- --forcerun tn5_shift.How It Works
Load configuration
Reads the base
config.yaml from the pipeline root. If --config points to a different file, that file is deep-merged on top of the base config so custom overrides take effect while all tool output paths remain fully resolved.Read sample sheet
Opens the TSV sample sheet and extracts all values from the
sample column, preserving their order.Split into batches
Divides the ordered sample list into consecutive groups of
--batch-size. The last batch may have fewer samples than the others.Run each batch sequentially
For each batch, constructs a list of concrete target files from the config’s output paths and calls Snakemake with:The
ATAC_MODE environment variable is set to the resolved mode value for each invocation.Track failures
Batches that exit with a non-zero Snakemake return code are recorded in a
failed_batches list. The script continues to the next batch regardless (--keep-going behaviour). A summary of failed batches is printed at the end.Per-Batch Target Files
Bulk mode (global.mode: "bulk")
For each sample in the batch, the following target files are requested from Snakemake:
| File | Source config key |
|---|---|
{fastp.output}/{sample}_R1_trimmed.fastq.gz | fastp.output |
{bowtie2.output}/{sample}.bam | bowtie2.output |
{samtools_sort.output.sorted_bam}/{sample}.sorted.bam | samtools_sort.output.sorted_bam |
{samtools_markdup.output.markdup_bam}/{sample}.sorted.dedup.bam | samtools_markdup.output.markdup_bam |
{tn5_shift.output.shifted_bam}/{sample}.filtered.shifted.bam | tn5_shift.output.shifted_bam |
{macs2.output.peaks}/{sample}_peaks.narrowPeak | macs2.output.peaks |
{blacklist_filter.output.filtered_peaks}/{sample}_filtered_peaks.bed | blacklist_filter.output.filtered_peaks |
{qc_gate.output}/{sample}_qc_pass.txt | qc_gate.output |
{bigwig.output.bigwig}/{sample}.bw | bigwig.output.bigwig |
scATAC-seq mode (global.mode: "scatac")
| File | Source config key |
|---|---|
{fastp.output}/{sample}_R1_trimmed.fastq.gz | fastp.output |
{chromap.output}/{sample}.bam | chromap.output |
{bigwig.output.bigwig}/{sample}.bw | bigwig.output.bigwig |
Usage Examples
Minimal memory run (1 sample at a time, 2 cores, 4 GB RAM):Because each batch calls Snakemake with
--rerun-incomplete, partial outputs from a crashed batch are automatically cleaned up and re-run in the next attempt. You can safely re-run run_batched.py after a failure — Snakemake will skip already-completed target files and pick up from where the failure occurred.