Documentation Index Fetch the complete documentation index at: https://mintlify.com/BDB-Genomics/atacseq-pipeline/llms.txt
Use this file to discover all available pages before exploring further.
All pipeline outputs are written beneath the results/ directory at the project root. The directory hierarchy mirrors the six-stage DAG — Preprocessing → Alignment → Post-alignment → Metrics & QC → Peak Calling → Visualization — so individual stages can be inspected or restarted independently. Per-rule benchmark timings land in benchmarks/ and execution logs in logs/, both at the project root.
Paths prefixed with {sample} expand to one file per entry in your data/fastp/samples.tsv sample sheet. Paths prefixed with {condition} or {condition}_rep{N}_rep{M} expand based on the condition/replicate columns in that sheet.
Preprocessing
results/preprocessing/fastp/
Adapter-trimmed and quality-filtered FASTQ files plus fastp QC reports. {sample}_R1_trimmed.fastq.gz
Trimmed Read 1 FASTQ (gzip-compressed). Produced by fastp with trim_front1: 5 and length_required: 30.
{sample}_R2_trimmed.fastq.gz
Trimmed Read 2 FASTQ (gzip-compressed).
Structured JSON QC report from fastp, ingested by MultiQC.
Interactive HTML QC report from fastp.
results/preprocessing/fastqc/
Post-trimming per-read quality reports from FastQC. {sample}_R1_trimmed_fastqc.html
Interactive FastQC HTML report for trimmed Read 1.
{sample}_R2_trimmed_fastqc.html
Interactive FastQC HTML report for trimmed Read 2.
{sample}_R1_trimmed_fastqc.zip
Raw FastQC data archive for programmatic parsing.
Alignment
results/alignment/bowtie2/
Raw aligned BAM files from Bowtie2 (bulk ATAC-seq mode only). Unsorted BAM produced by bowtie2 --very-sensitive paired-end alignment to the reference genome.
results/alignment/chromap/
Raw aligned BAM files from Chromap (scATAC-seq mode only; requires global.mode: "scatac"). BAM produced by chromap --preset atac. Used as input to the ArchR and Cicero scATAC-seq analysis rules.
Post-Alignment Processing
results/post_alignment/samtools_sort/
Coordinate-sorted BAMs. Coordinate-sorted BAM. Input to samtools_fixmate.
results/post_alignment/samtools_fixmate/
BAMs with filled mate-score tags, required before duplicate marking. {sample}.sorted.fixmate.bam
Sorted BAM with mate-score tags filled by samtools fixmate -m.
results/post_alignment/samtools_markdup/
PCR-duplicate-marked BAMs (duplicates retained by default). {sample}.sorted.dedup.bam
Deduplicated BAM. Duplicate reads are marked but not removed (remove_duplicates: false). Input to Picard, TOBIAS, and Qualimap.
{sample}.sorted.dedup.bam.bai
BAM index produced by the samtools_index_post_markdup rule.
results/post_alignment/remove_mito_reads/
Mitochondrial-read-free sorted BAMs. BAM with all reads mapping to chrMT (configurable via remove_mito_reads.params.mito_chr) excluded. Input to samtools_view and samtools_stats.
results/post_alignment/samtools_view/
Quality-filtered, blacklist-cleaned BAMs. BAM after MAPQ ≥ 30 filtering (-q 30), flag-based exclusion (-F 3844), and ENCODE blacklist removal. Input to Tn5 shifting, TOBIAS, and cross-correlation analysis.
{sample}.filtered.bam.bai
BAM index placed alongside the filtered BAM by samtools_index_post_filter.
results/post_alignment/tn5_shift/
Tn5-transposase-corrected BAMs — the primary analysis BAM for peak calling and visualisation. {sample}.filtered.shifted.bam
BAM with reads shifted +4 bp (forward strand) and −5 bp (reverse strand) to correct for Tn5 insertion bias. Input to MACS2, FRiP calculation, TSS enrichment, and coverage tracks.
{sample}.filtered.shifted.bam.bai
BAM index.
results/post_alignment/mito-ATAC/
Mitochondrial read fraction statistics produced before deduplication. Tab-separated file containing total read count and mitochondrial read fraction for each sample.
results/post_alignment/samtools_stats/
Raw samtools stats output files consumed by the QC gate. {sample}_postFiltering.stats.txt
Full samtools stats output. The QC gate script extracts sequences, reads duplicated, and percentage of properly paired reads from the SN section.
Metrics & QC
results/metrics_qc/tss_enrichment/
TSS enrichment scores computed by tss_enrichment.R. {sample}_tss_enrichment.txt
Tab-separated file with two columns: sample name and TSS enrichment score. Consumed by parse_qc_metrics.py.
{sample}_tss_enrichment.pdf
TSS enrichment profile plot (signal ± 2 kb around TSSes).
results/metrics_qc/picard/
Picard tool outputs for alignment and insert-size QC. CollectAlignmentSummaryMetrics/{sample}.alignment_metrics.txt
Picard alignment summary: total reads, mapped rate, strand balance, and chimeric read fraction.
CollectInsertSizeMetrics/{sample}.insert_metrics.txt
Picard insert-size summary statistics, including median insert size and mean insert size.
CollectInsertSizeMetrics/{sample}.insert_size_histogram.pdf
Insert-size frequency histogram PDF showing nucleosome banding pattern.
results/metrics_qc/cross_correlation/
NSC/RSC strand cross-correlation outputs from phantompeakqualtools. Tab-separated file containing NSC, RSC, estimated fragment length, and phantom peak shift.
Cross-correlation profile plot.
results/metrics_qc/fragment_size_analysis/
Fragment size distribution plots and summary statistics. {sample}_fragment_sizes.pdf
Histogram of fragment size distribution with nucleosomal banding annotations.
{sample}_fragment_stats.txt
Summary statistics (NFR fraction, mono-nucleosomal fraction, di-nucleosomal fraction).
QC Gate
Per-sample pass/fail trigger files and structured QC data. Downstream rules require {sample}_qc_pass.txt as an explicit Snakemake input. Single-line trigger file: {sample}\tPASSED or {sample}\tFAILED. Snakemake uses this file as a dependency checkpoint for all downstream rules.
Structured JSON QC report containing per-metric values, targets, and statuses. See the QC Thresholds reference for the full schema.
Peak Calling
results/peak_calling/macs2_peakcall/
Raw peak calls from MACS2. {sample}_peaks.narrowPeak
ENCODE narrowPeak format: chromosome, start, end, name, score, strand, fold-change, −log₁₀(p-value), −log₁₀(q-value), summit offset.
Single-base-pair peak summits BED file.
MACS2 peak spreadsheet with extended statistics.
results/peak_calling/filtered_peaks/
Blacklist-filtered peaks — the primary peak set used by all downstream analyses. {sample}_filtered_peaks.bed
narrowPeak file with ENCODE blacklist regions removed. Input to FRiP calculation, heatmap, peak annotation, motif analysis, and TOBIAS.
results/peak_calling/frip_calculation/
FRiP score output files consumed by the QC gate. Tab-separated file: sample name and FRiP score (e.g., SAMPLE\t0.342).
results/peak_calling/idr/
Irreproducible Discovery Rate outputs for replicate concordance analysis. idr_peaks/{condition}_rep{N}_rep{M}_idr_peaks.bed
Peaks passing the IDR threshold (default 0.05) between replicate pairs.
optimal_peaks/{condition}_optimal_peaks.bed
Final optimal peak set selected by the IDR analysis.
plots/{condition}_rep{N}_rep{M}_idr_plot.png
IDR diagnostic scatter plot.
results/peak_calling/consensus_peaks/
Multi-sample merged consensus peak set. Non-redundant consensus peak set merging peaks present in at least min_samples (default: 2) samples, with peaks within merge_distance (default: 100 bp) collapsed.
Tab-separated matrix showing how many samples each consensus peak was called in.
results/peak_calling/count_peaks/
Read count matrix over consensus peaks for DESeq2 input. Tab-separated count matrix: rows are consensus peaks, columns are samples.
results/peak_calling/differential_accessibility/
DESeq2-based differential chromatin accessibility results. diff_accessibility_results.tsv
Full DESeq2 results table: peak coordinates, base mean, log₂FC, standard error, Wald statistic, p-value, and adjusted p-value (FDR).
Volcano plot: −log₁₀(FDR) vs log₂ fold-change, with significant peaks highlighted.
MA plot: log₂FC vs mean accessibility, coloured by significance.
PCA of variance-stabilised count data across all samples.
results/peak_calling/tobias/
TOBIAS bias-corrected TF footprinting results. corrected_bw/{sample}_corrected.bw
Tn5 bias-corrected ATAC-seq signal BigWig (ATACorrect output).
TOBIAS footprint score BigWig (ScoreBigwig output).
BINDetect output directory: per-TF binding scores, differential binding plots, and a summary table across conditions.
HINT-ATAC footprint calls via the RGT toolkit. BED file of predicted TF-bound footprint regions.
results/peak_calling/chromvar/
chromVAR TF motif accessibility deviation scores. Raw chromVAR deviation score matrices per TF motif (RDS and TSV formats).
GC-bias-corrected deviation scores.
Heatmaps and variability plots of TF deviation scores across samples.
results/peak_calling/peak_annotation/
Genomic feature annotations for filtered peaks. {sample}_peak_annotation.txt
HOMER or ChIPseeker annotation table: peak coordinates + nearest gene, genomic feature category (promoter, intron, exon, intergenic), distance to TSS.
results/peak_calling/motif_analysis/
HOMER de novo and known motif enrichment results per sample. {sample}/homerResults.html
HOMER de novo motif enrichment results page.
{sample}/knownResults.html
HOMER known motif enrichment results page.
scATAC-seq Outputs
ArchR single-cell ATAC-seq analysis outputs. Only generated when global.mode: "scatac". Raw ArchR Arrow files (one per sample), containing per-cell fragment matrices and metadata.
Arrow files after doublet removal and cell QC filtering (min_tss: 4.0, min_frags: 1000, max_frags: 100000).
clusters/cell_clusters.tsv
Tab-separated file mapping each cell barcode to its Leiden cluster assignment.
UMAP embedding coloured by cluster identity.
Differentially accessible peaks and marker genes per cluster.
doublets/doublet_enrichment.pdf
Doublet enrichment score distribution used to set the doublet_threshold: 0.2 cutoff.
Cicero chromatin co-accessibility outputs. connections/coaccessibility_connections.rds
R RDS file containing the full co-accessibility connection object from Cicero.
connections/coaccessibility_table.tsv
Tab-separated connection table: Peak1, Peak2, co-accessibility score (0–1).
BED file of identified Cis-Co-Accessibility Networks (CCANs).
Visualization
results/visualization/bigwig/
Raw signal BigWig files converted from sorted bedGraph. BigWig coverage track from the Tn5-shifted BAM, for genome browser loading and deepTools analysis.
results/visualization/normalized_coverage/
CPM-normalised BigWig tracks for cross-sample comparability. Counts Per Million normalised BigWig, produced by bamCoverage --normalizeUsing CPM.
results/visualization/heatmap/
deepTools heatmap plots and data matrices centred on filtered peaks. plot/{sample}_tss_heatmap.pdf
Read-density heatmap PDF, ±3 kb around peak centres, coloured by coolwarm palette.
matrix/{sample}_heatmap_matrix.gz
Compressed deepTools matrix file for replotting or downstream analysis.
results/visualization/correlation_analysis/
Inter-sample BigWig correlation analysis outputs. Pearson/Spearman correlation heatmap across all samples.
Raw correlation coefficient matrix in tab-separated format.
Reporting
results/reporting/multiqc/
Consolidated MultiQC report aggregating all tool QC outputs. Interactive HTML report combining fastp, FastQC, Picard, samtools, Qualimap, preseq, and QC gate metrics across all samples.
Raw JSON and TSV data files extracted by MultiQC for programmatic use.
results/reporting/pipeline_execution_summary.json
Structured JSON summary written by the Snakemake onsuccess lifecycle hook. Contains run metadata, sample list, mode, and completion timestamp. Consumed by atacseq_tool.py to return a structured status string to AI agents.
results/reporting/benchmark_summary.tsv
Aggregated benchmark table produced by the benchmark_summary rule. Columns include rule name, sample, wall-clock time (seconds), CPU time (seconds), and peak memory (MB) drawn from individual benchmarks/*.txt files.
Benchmarks
Per-rule, per-sample Snakemake benchmark files in tab-separated format. Each file records: s (wall-clock seconds), h:m:s (human-readable time), max_rss (peak RSS memory in MB), max_vms, max_uss, max_pss, io_in, io_out, mean_load, cpu_time. benchmarks/
├── fastp/{sample}.txt
├── bowtie2/{sample}.txt
├── samtools_markdup/{sample}.txt
├── macs2/{sample}.txt
├── idr/{condition}_rep{N}_rep{M}.txt
└── multiqc/multiqc.txt
Logs
Per-rule, per-sample stderr and stdout log files. Log file extensions follow the global convention: .err for stderr, .log for combined output. logs/
├── fastp/{sample}.log
├── bowtie2/{sample}.log
├── samtools_markdup/{sample}.log
├── qc_gate/{sample}.log
├── macs2/{sample}.log
└── multiqc/multiqc.log