Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/BDB-Genomics/atacseq-pipeline/llms.txt

Use this file to discover all available pages before exploring further.

The single-cell ATAC-seq (scatac) mode replaces the bulk alignment and peak-calling stack with a purpose-built single-cell analysis chain. Chromap aligns reads at high speed against the reference index, ArchR constructs Arrow files, removes doublets, performs iterative LSI dimensionality reduction, and clusters cells via UMAP. Cicero then maps chromatin co-accessibility networks across a configurable genomic window. All QC decisions in this mode are delegated to ArchR’s internal thresholds — the pipeline-level QC gate that runs between alignment and peak calling in bulk mode is skipped entirely.

Activating scATAC Mode

Switch to single-cell mode with the ATAC_MODE environment variable or the global.mode key in config.yaml:
ATAC_MODE=scatac snakemake --use-conda --cores 8
The environment variable takes priority over config.yaml. If you set ATAC_MODE=scatac on the command line, it overrides any value written in global.mode.

Sample Sheet Format

The scATAC pipeline reads from the same tab-separated sample sheet as bulk mode (path set by global.samples, default data/fastp/samples.tsv). Each row represents one library (or one barcode manifest file, depending on your experimental design). The condition and replicate columns are still parsed but are used by ArchR for pseudo-bulk grouping rather than for IDR analysis.
sample          fastq_r1                              fastq_r2                              condition   replicate
SC_lib1         data/raw/SC_lib1_R1.fastq.gz          data/raw/SC_lib1_R2.fastq.gz          ctrl        1
SC_lib2         data/raw/SC_lib2_R1.fastq.gz          data/raw/SC_lib2_R2.fastq.gz          treat       1
For 10x Genomics data, point fastq_r1 to the read file and fastq_r2 to the barcode/insert file as produced by cellranger-atac mkfastq. The Chromap rule passes reads directly to the aligner without assuming a particular barcode structure.

Pipeline Stages

1
Preprocessing — fastp + FastQC
2
Identical to bulk mode: fastp trims 5 bases from the front of both mates and enforces a minimum read length of 30 bp before FastQC generates per-sample HTML reports.
3
# config.yaml
fastp:
  output: "results/preprocessing/fastp"
  params:
    trim_front1: 5
    trim_front2: 5
    length_required: 30
  threads: 4
  resources:
    mem_mb: 8000
    time: 120
4
Alignment — Chromap
5
Trimmed reads are aligned with Chromap using the --preset atac flag, which applies ATAC-seq–specific alignment parameters. The rule spawns 16 threads and requests 32 GB of memory to handle the large barcode-aware index:
6
# config.yaml
chromap:
  input: "results/preprocessing/fastp"
  output: "results/alignment/chromap"
  params:
    index: "data/reference/chromap/genome.index"   # *CHROMAP_INDEX
    preset: "atac"
  threads: 16
  resources:
    mem_mb: 32000
    time: 120
7
The Chromap rule produces two output files per sample:
8
FileDescriptionresults/alignment/chromap/{sample}.bamRaw aligned BAMresults/alignment/chromap/{sample}_tag.bamSAM-converted tagged BAM fed to ArchR
9
The shell command executed by the rule is:
10
chromap \
    --preset atac \
    -x data/reference/chromap/genome.index \
    -r data/reference/genome.fa \
    -1 {sample}_R1_trimmed.fastq.gz \
    -2 {sample}_R2_trimmed.fastq.gz \
    -t 16 \
    -o {sample}.bam \
    --SAM
11
The QC gate that runs in bulk mode (FRiP, TSS enrichment, mapping rate, duplicate rate checks) is not applied in scATAC mode. ArchR enforces its own per-cell QC thresholds during Arrow file creation.
12
ArchR — Arrow File Creation
13
All tagged BAMs are passed together to archr_pseudobulk, which creates ArchR Arrow files (one per sample). Cells below the minimum TSS enrichment or fragment count thresholds are excluded at this stage.
14
# config.yaml
archr:
  input:
    bam: "results/alignment/chromap"
  output:
    arrow: "results/scatac/archr/arrow"
  params:
    min_tss: 4.0
    min_frags: 1000
    max_frags: 100000
    tsse_method: "ArchR"
  threads: 16
  resources:
    mem_mb: 64000
    time: 240
15
ParameterValueMeaningmin_tss4.0Minimum TSS enrichment score per cellmin_frags1000Minimum unique fragments per cellmax_frags100000Maximum fragments (filters multiplets)tsse_methodArchRTSS enrichment algorithm
16
Arrow files are written to results/scatac/archr/arrow/.
17
ArchR — Doublet Detection and Filtering
18
archr_doublet_detection runs ArchR’s simulation-based doublet scoring on the raw Arrow files. Cells with a doublet enrichment score above doublet_threshold: 0.2 are removed. A diagnostic PDF is written to results/scatac/archr/doublets/doublet_enrichment.pdf and the cleaned Arrow files are saved to results/scatac/archr/filtered_arrow/.
19
# config.yaml (archr params excerpt)
archr:
  params:
    doublet_threshold: 0.2
20
ArchR — Iterative LSI, UMAP, and Clustering
21
The filtered Arrow files feed archr_clustering, which runs the full single-cell analysis:
22
  • Iterative LSI dimensionality reduction across dimensions 1:30
  • UMAP embedding of the LSI components
  • Leiden clustering at resolution 0.8
  • Marker gene and peak identification per cluster
  • 23
    # config.yaml (archr params excerpt)
    archr:
      params:
        clustering_resolution: 0.8
        dims_to_use: 1:30
        force_dim_reduction: true
    
    24

    Cell Clusters

    results/scatac/archr/clusters/cell_clusters.tsv — barcode-to-cluster assignment table

    UMAP Plot

    results/scatac/archr/plots/umap_clusters.pdf — UMAP coloured by cluster identity

    Marker Genes

    results/scatac/archr/markers/marker_genes.tsv — per-cluster marker gene table

    Full QC Report

    results/scatac/archr/qc_report/ArchR_full_report.pdf — comprehensive ArchR PDF report
    25
    Cicero — Chromatin Co-Accessibility
    26
    Cicero models co-accessibility between distal regulatory elements and gene promoters using the filtered Arrow files and the cluster assignments from ArchR. The analysis runs within a 500 bp tiling window and considers peak pairs up to 250 kb apart.
    27
    # config.yaml
    cicero:
      input:
        arrow:    "results/scatac/archr/filtered_arrow"
        clusters: "results/scatac/archr/clusters"
      output:
        connections: "results/scatac/cicero/connections"
        ccans:       "results/scatac/cicero/ccans"
        plots:       "results/scatac/cicero/plots"
      params:
        window_size:     500
        distance_cutoff: 250000
      threads: 8
      resources:
        mem_mb: 32000
        time: 240
    
    28
    ParameterValueMeaningwindow_size500Tiling window size in bpdistance_cutoff250000Maximum genomic distance for co-accessibility (250 kb)
    29
    Cicero outputs three primary files:
    30
    OutputPathDescriptionCo-accessibility networkconnections/coaccessibility_connections.rdsR object of full networkCo-accessibility tableconnections/coaccessibility_table.tsvPeak-pair scores as TSVCCAN BEDccans/ccans.bedCis-Coaccessibility Networks as BED intervals
    31
    chromVAR Motif Deviations
    32
    chromVAR computes per-cell motif deviation scores using the JASPAR vertebrate motif database and the Tn5-shifted signal. In scATAC mode, it operates on the Chromap-aligned BAMs alongside bigwig generation and correlation analysis.
    33
    # config.yaml
    chromvar_analysis:
      output:
        deviations:     "results/peak_calling/chromvar/deviations"
        bias_corrected: "results/peak_calling/chromvar/bias_corrected"
        plots:          "results/peak_calling/chromvar/plots"
      params:
        motif_db:     "data/motifs/jaspar_vertebrates.meme"
        genome_fa:    "data/reference/genome.fa"
        genome_sizes: "data/reference/genome.chrom.sizes"
      threads: 8
      resources:
        mem_mb: 16000
        time: 240
    
    34
    Per-sample deviation scores are written to results/peak_calling/chromvar/deviations/{sample}_deviations.tsv.

    Output Structure

    All scATAC outputs are written to results/scatac/ in addition to the shared results/visualization/ and results/peak_calling/chromvar/ directories:
    results/
    ├── preprocessing/
    │   ├── fastp/          # Trimmed FASTQs
    │   └── fastqc/         # FastQC HTML reports
    ├── alignment/
    │   └── chromap/        # {sample}.bam and {sample}_tag.bam
    ├── visualization/
    │   ├── bigwig/         # Per-sample BigWig tracks
    │   └── correlation_analysis/  # Correlation heatmap
    ├── peak_calling/
    │   ├── motif_analysis/ # Per-sample HOMER motif results
    │   └── chromvar/
    │       ├── deviations/ # {sample}_deviations.tsv
    │       └── plots/
    └── scatac/
        ├── archr/
        │   ├── arrow/              # Raw Arrow files
        │   ├── filtered_arrow/     # Doublet-filtered Arrow files
        │   ├── clusters/
        │   │   └── cell_clusters.tsv
        │   ├── markers/
        │   │   └── marker_genes.tsv
        │   ├── doublets/
        │   │   └── doublet_enrichment.pdf
        │   ├── plots/
        │   │   └── umap_clusters.pdf
        │   └── qc_report/
        │       └── ArchR_full_report.pdf
        └── cicero/
            ├── connections/
            │   ├── coaccessibility_connections.rds
            │   └── coaccessibility_table.tsv
            ├── ccans/
            │   └── ccans.bed
            └── plots/
                └── coaccessibility_plot.png
    

    Key Differences from Bulk Mode

    In bulk mode, rules/scripts/parse_qc_metrics.py applies hard thresholds for FRiP, TSS enrichment, mapping rate, and duplicate rate before peak calling. In scATAC mode this gate is skipped — the QC_GATE_TARGETS list is empty. ArchR enforces per-cell quality thresholds (min_tss, min_frags, max_frags) during Arrow file creation instead.
    Bulk mode runs samtools sort → fixmate → markdup → view (MAPQ ≥ 30) → blacklist removal → Tn5 shift as a sequential chain on every BAM. In scATAC mode, POST_FILTERING_TARGETS is empty; Chromap’s --preset atac handles alignment-level filtering internally and ArchR manages cell-level QC.
    IDR replicate concordance and DESeq2 differential accessibility are bulk-only steps. In scATAC mode, cluster-level differential accessibility is performed within ArchR using its marker peak identification routines.
    Transcription factor footprinting with TOBIAS and HINT-ATAC is not included in the scATAC rule graph. motif chromVAR provides single-cell–resolution motif deviation scores instead.

    Build docs developers (and LLMs) love