Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/BDB-Genomics/atacseq-pipeline/llms.txt

Use this file to discover all available pages before exploring further.

Snakemake profiles are directories containing a config.yaml that pre-configure the executor, resource defaults, and scheduler behaviour for a specific compute environment. Instead of passing dozens of flags on the command line, you select a profile with --profile profile/<name> and all settings load automatically. The BDB-Genomics ATAC-seq pipeline ships eight profiles covering every major execution target, from a laptop to multi-cloud Kubernetes clusters. Every profile’s config keys override or extend Snakemake’s built-in defaults. Per-rule resource declarations in config.yaml (the pipeline config, not the profile config) always take precedence over a profile’s default-resources block, which acts only as a fallback for rules that do not declare their own requirements.

local

The local profile is the recommended starting point for development, debugging, and single-machine production runs.
snakemake --profile profile/local
# profile/local/config.yaml
use-conda: true
jobs: 8
printshellcmds: true
show-failed-logs: true
keep-going: true
rerun-incomplete: true
restart-times: 0

default-resources:
  mem_mb: 4000
  time: 60
  threads: 1

jobs: 8

Up to 8 rules execute in parallel. Tune this to match your machine’s logical CPU count.

keep-going: true

Independent branches of the DAG continue even if one rule fails, maximising throughput on partial failures.

rerun-incomplete: true

Output files from interrupted rules are automatically re-generated on the next run.

restart-times: 0

Failed rules are not automatically retried locally. Investigate logs before re-running.

slurm

The SLURM profile submits every rule as an independent batch job via the native snakemake-executor-plugin-slurm.
snakemake --profile profile/slurm
# profile/slurm/config.yaml
executor: slurm
use-conda: true
jobs: 100
printshellcmds: true
show-failed-logs: true
keep-going: true
rerun-incomplete: true
restart-times: 1

latency-wait: 60

default-resources:
  mem_mb: 4000
  time: 60
  threads: 1
  slurm_partition: "standard"
  slurm_account: "bdb_genomics"
executor
string
default:"slurm"
Selects the SLURM executor plugin. Requires snakemake-executor-plugin-slurm to be installed in the Snakemake environment.
jobs
integer
default:"100"
Maximum number of concurrently queued or running SLURM jobs. Set this to stay within your cluster’s fair-share policy.
latency-wait
integer
default:"60"
Seconds Snakemake waits for output files to appear on shared storage after a job completes. Increase this if your cluster uses a high-latency network filesystem (e.g., Lustre over WAN).
default-resources.slurm_partition
string
default:"standard"
Default SLURM partition for all jobs. Override per-rule in config.yaml resources if needed.
default-resources.slurm_account
string
default:"bdb_genomics"
SLURM account to charge compute time against. Change this to your institutional account name.
restart-times: 1 allows each SLURM job one automatic retry on failure — useful for transient scheduler preemptions. Do not set this higher without also checking latency-wait, as retries on NFS can cause false failures.

low_resource

Designed for workstations with ≤ 8 GB RAM and ≤ 4 CPU cores. This profile caps memory and thread allocations for every named rule via Snakemake’s set-resources directive, preventing out-of-memory kills on constrained hardware.
snakemake --profile profile/low_resource
# profile/low_resource/config.yaml
use-conda: true
jobs: 2
printshellcmds: true
show-failed-logs: true
keep-going: true
rerun-incomplete: true
restart-times: 0
latency-wait: 30

set-resources:
  bowtie2_align:
    mem_mb: 4000
    threads: 2
  samtools_sort:
    mem_mb: 3000
    threads: 2
  samtools_markdup:
    mem_mb: 4000
    threads: 2
  tn5_shift:
    mem_mb: 3000
    threads: 2
  macs2_peak_calling:
    mem_mb: 4000
    threads: 2
  tss_enrichment:
    mem_mb: 4000
    threads: 2
  heatmap:
    mem_mb: 4000
    threads: 2
  peak_annotation:
    mem_mb: 4000
    threads: 2
  motif_analysis:
    mem_mb: 4000
    threads: 2
  differential_accessibility:
    mem_mb: 4000
    threads: 2
  chromvar_analysis:
    mem_mb: 4000
    threads: 2
  footprinting:
    mem_mb: 4000
    threads: 2
  tobias_atacorrect:
    mem_mb: 4000
    threads: 2
  tobias_score_bigwig:
    mem_mb: 4000
    threads: 2
  tobias_bindetect:
    mem_mb: 4000
    threads: 2
  # ... (all rules explicitly capped — see profile file for the complete list)

default-resources:
  mem_mb: 2000
  time: 120
  threads: 1
Rules that declare mem_mb: 16000 in config.yaml (e.g., bowtie2, macs2, heatmap) will be overridden to 4 GB by this profile. Alignment of large genomes may fail or produce incomplete results. This profile is intended for testing and development only.
The profile sets jobs: 2 to prevent the two parallel jobs from simultaneously saturating available RAM. The default-resources fallback (mem_mb: 2000, threads: 1) applies to any rule not listed explicitly in set-resources.

test

The test profile is used in CI pipelines and for verifying the installation on synthetic or downsampled data. It loads its own config_test.yaml which relaxes QC thresholds for synthetic reads.
snakemake --profile profile/test
# profile/test/config.yaml
use-conda: true
jobs: 4
printshellcmds: true
show-failed-logs: true
rerun-incomplete: true
restart-times: 0
configfile: "profile/test/config_test.yaml"

default-resources:
  mem_mb: 2000
  time: 30
  threads: 2
The configfile: "profile/test/config_test.yaml" directive automatically overlays the test-specific overrides (such as relaxed qc_gate thresholds) on top of the main config.yaml. You do not need to pass a second --configfile argument when using this profile.

aws

Runs the pipeline on AWS Batch via the Tibanna executor plugin, with intermediate files stored in Amazon S3.
snakemake --profile profile/aws
# profile/aws/config.yaml
executor: tibanna
use-conda: true
use-singularity: true
jobs: 50
printshellcmds: true
show-failed-logs: true
keep-going: true
rerun-incomplete: true
restart-times: 1
latency-wait: 120

default-resources:
  mem_mb: 4000
  time: 60
  threads: 1

tibanna-sfn: "tibanna_unicorn_atacseq"
default-remote-prefix: "YOUR_S3_BUCKET/atacseq-pipeline"
default-remote-provider: "S3"
1

Install the executor plugin

pip install snakemake-executor-plugin-tibanna tibanna
2

Configure AWS credentials

aws configure
# or export AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
3

Deploy Tibanna unicorn (one-time)

tibanna deploy_unicorn -g atacseq -b YOUR_S3_BUCKET
4

Create an S3 bucket for intermediate storage

aws s3 mb s3://your-bucket-name
5

Update the profile config

Replace YOUR_S3_BUCKET in profile/aws/config.yaml with your actual bucket name.
6

Run the pipeline

snakemake --profile profile/aws
use-singularity: true is required because AWS Batch jobs run in container-isolated environments. Conda alone is insufficient for container-native execution.

gcp

Runs the pipeline on Google Cloud Life Sciences, with intermediate files in Google Cloud Storage.
snakemake --profile profile/gcp
# profile/gcp/config.yaml
executor: google-lifesciences
use-conda: true
use-singularity: true
jobs: 50
printshellcmds: true
show-failed-logs: true
keep-going: true
rerun-incomplete: true
restart-times: 1
latency-wait: 120

default-resources:
  mem_mb: 4000
  time: 60
  threads: 1
  machine_type: "n1-standard-4"

google-lifesciences-project: "YOUR_GCP_PROJECT_ID"
google-lifesciences-region: "us-central1"
default-remote-prefix: "YOUR_GCS_BUCKET/atacseq-pipeline"
default-remote-provider: "GS"
1

Install the executor plugin

pip install snakemake-executor-plugin-google-lifesciences
2

Authenticate with Google Cloud

gcloud auth application-default login
3

Create a GCS bucket

gsutil mb gs://your-bucket-name
4

Update the profile config

Replace YOUR_GCP_PROJECT_ID and YOUR_GCS_BUCKET in profile/gcp/config.yaml.
5

Run the pipeline

snakemake --profile profile/gcp
default-resources.machine_type
string
default:"n1-standard-4"
Default GCE machine type. The n1-standard-4 provides 4 vCPUs and 15 GB RAM. Increase to n1-highmem-8 for memory-intensive rules like ArchR.
google-lifesciences-region
string
default:"us-central1"
GCP region where Life Sciences pipelines execute. Choose a region close to your GCS bucket to minimise egress costs.

azure

Runs the pipeline on Azure Batch, with intermediate files in Azure Blob Storage.
snakemake --profile profile/azure
# profile/azure/config.yaml
executor: azure-batch
use-conda: true
use-singularity: true
jobs: 50
printshellcmds: true
show-failed-logs: true
keep-going: true
rerun-incomplete: true
restart-times: 1
latency-wait: 120

default-resources:
  mem_mb: 4000
  time: 60
  threads: 1

az-batch-account-url: "https://YOUR_BATCH_ACCOUNT.eastus.batch.azure.com"
default-remote-prefix: "YOUR_BLOB_CONTAINER/atacseq-pipeline"
default-remote-provider: "AzBlob"
1

Install the executor plugin

pip install snakemake-executor-plugin-azure-batch
2

Authenticate with Azure

az login
3

Create a Batch account and storage account

az batch account create \
  --name atacseqbatch \
  --resource-group YOUR_RG \
  --location eastus

az storage account create \
  --name atacseqstorage \
  --resource-group YOUR_RG \
  --location eastus
4

Create a Blob container

az storage container create \
  --name atacseq-pipeline \
  --account-name atacseqstorage
5

Update the profile config

Replace YOUR_BATCH_ACCOUNT and YOUR_BLOB_CONTAINER in profile/azure/config.yaml.
6

Run the pipeline

snakemake --profile profile/azure

kubernetes

Runs each rule as a Kubernetes Pod on any conformant cluster (GKE, EKS, AKS, or local Minikube).
snakemake --profile profile/kubernetes
# profile/kubernetes/config.yaml
executor: kubernetes
use-conda: true
use-singularity: true
jobs: 50
printshellcmds: true
show-failed-logs: true
keep-going: true
rerun-incomplete: true
restart-times: 1
latency-wait: 120

default-resources:
  mem_mb: 4000
  time: 60
  threads: 1

kubernetes-namespace: "default"
default-remote-prefix: "YOUR_BUCKET/atacseq-pipeline"
default-remote-provider: "GS"       # or "S3" for AWS, "AzBlob" for Azure
1

Install the executor plugin

pip install snakemake-executor-plugin-kubernetes
2

Provision a cluster (example: GKE)

gcloud container clusters create atacseq-cluster --num-nodes=2
# EKS: eksctl create cluster --name atacseq-cluster --nodes=2
# Local: minikube start --memory=8192 --cpus=4
3

Verify kubectl context

kubectl config current-context
4

Create a cloud storage bucket and update the profile config

Replace YOUR_BUCKET and set the correct default-remote-provider (GS, S3, or AzBlob) in profile/kubernetes/config.yaml.
5

Run the pipeline

snakemake --profile profile/kubernetes
Always delete your cluster after the run to avoid ongoing infrastructure costs. For GKE: gcloud container clusters delete atacseq-cluster. For EKS: eksctl delete cluster --name atacseq-cluster.

Profile Comparison

local

jobs: 8 · executor: default (process fork) · restart-times: 0 · Best for: single-machine runs and development.

slurm

jobs: 100 · executor: slurm · restart-times: 1 · latency-wait: 60 s · Best for: university HPC clusters.

low_resource

jobs: 2 · executor: default · All rules capped at ≤ 4 GB RAM. Best for: laptops and constrained VMs.

test

jobs: 4 · configfile: config_test.yaml · Best for: CI pipelines and installation checks.

aws

jobs: 50 · executor: tibanna · provider: S3 · Best for: AWS Batch scale-out.

gcp

jobs: 50 · executor: google-lifesciences · provider: GS · Best for: Google Cloud Life Sciences.

azure

jobs: 50 · executor: azure-batch · provider: AzBlob · Best for: Azure Batch compute pools.

kubernetes

jobs: 50 · executor: kubernetes · provider: GS/S3/AzBlob · Best for: container-native, multi-cloud.

Build docs developers (and LLMs) love