The sorted BAM that leaves the alignment stage is not yet suitable for peak calling. It contains reads from the mitochondrial genome (which are massively over-represented in ATAC-seq), PCR duplicates that inflate apparent fragment depth, reads that aligned poorly or to artifactual loci in the ENCODE blacklist, and — critically — reads whose 5′ ends have not been shifted to reflect the true Tn5 insertion site. The post-alignment stage corrects all of these problems in a strict eight-step chain. Each step is an independent Snakemake rule so that jobs can be parallelized across samples and individual failures are isolated rather than crashing the entire pipeline.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/BDB-Genomics/atacseq-pipeline/llms.txt
Use this file to discover all available pages before exploring further.
Filtering Chain at a Glance
Step 1 — Calculate Mitochondrial Reads
Before removing mitochondrial reads, the pipeline quantifies how many exist. This produces amito_stats.txt file consumed by the QC reporting layer.
The mitochondrial chromosome is referenced as
chrMT in this pipeline’s reference genome. If your genome uses chrM instead, update mitoATAC_calculate.params.mito_chr and remove_mito_reads.params.mito_chr in config.yaml.Step 2 — samtools fixmate
samtools fixmate populates mate-pair information (insert size, mate coordinate, mate strand) in the BAM flags field. This information is required for samtools markdup to correctly identify optical and PCR duplicates in paired-end data.
Step 3 — samtools markdup
Duplicate reads arise from PCR amplification of the same original DNA fragment. Rather than removing them outright (which would destroy information needed for library complexity estimates), the pipeline marks them with the SAM duplicate flag (0x400) by default. Downstream rules and the QC gate can then compute duplication rates from the marked BAM without losing data.
remove_duplicates to the -r flag dynamically:
Step 4 — Remove Mitochondrial Reads
After duplication marking, the mitochondrial chromosome reads are excluded. The exclusion is performed by chromosome name (chrMT) using samtools view:
results/post_alignment/remove_mito_reads/{sample}_noMT.sorted.bam
Step 5 — samtools view (MAPQ Filter)
Multi-mapping reads and reads with alignment ambiguity are removed by requiring a minimum mapping quality of MAPQ ≥ 30. In addition, bitwise flag filtering excludes unmapped reads, secondary alignments, supplementary alignments, reads failing platform QC, and unproperly paired reads:| Bit | Flag | Meaning |
|---|---|---|
| 0x4 | 4 | Read unmapped |
| 0x100 | 256 | Secondary alignment |
| 0x200 | 512 | Not passing platform/vendor QC checks |
| 0x400 | 1024 | PCR or optical duplicate |
| 0x800 | 2048 | Supplementary alignment |
results/post_alignment/samtools_view/{sample}.filtered.pre_blacklist.bam
Step 6 — Remove Blacklist Reads
The ENCODE blacklist contains genomic regions (centromeres, telomeres, low-complexity repeats) that consistently produce anomalously high read counts regardless of the biological experiment. These regions produce artifactual peaks that overwhelm downstream analysis. The pipeline usesbedtools intersect -v to exclude any read whose alignment overlaps a blacklist interval:
-v flag inverts the intersection, retaining only reads that do not overlap any blacklist region. Output: results/post_alignment/samtools_view/{sample}.filtered.bam.
Step 7 — Tn5 Insertion-Site Correction
The most ATAC-seq-specific step of the entire pipeline. Tn5 transposase cleaves a double-strand break and inserts its payload adapter at the cut site, but the physical insertion spans 9 bp. After alignment, the 5′ end of each read is offset from the true cut site by +4 bp on the forward strand and −5 bp on the reverse strand. All ATAC-seq signal analysis assumes reads start exactly at the cut site; without this correction, footprints and TSS enrichment profiles are blurred by up to 9 bp.alignmentSieve --ATACshift from the deepTools suite:
results/post_alignment/tn5_shift/{sample}.filtered.shifted.bam
Step 8 — samtools stats (Post-Filter QC)
After the full filtering chain completes,samtools stats is run on the mito-removed BAM to collect alignment statistics (total reads, properly paired count, duplicate count) that are later consumed by parse_qc_metrics.py at the QC gate:
results/post_alignment/samtools_stats/{sample}_postFiltering.stats.txt
Complete Output Summary
| Rule | Output File | Next Consumed By |
|---|---|---|
calculate_mito_reads | mito_stats.txt | QC reporting |
samtools_fixmate | {sample}.sorted.fixmate.bam | samtools_markdup |
samtools_markdup | {sample}.sorted.dedup.bam | Picard metrics, remove_mito_reads |
remove_mito_reads | {sample}_noMT.sorted.bam | samtools_view, samtools_stats |
samtools_view | {sample}.filtered.pre_blacklist.bam | remove_blacklist_reads |
remove_blacklist_reads | {sample}.filtered.bam | tn5_shift, TOBIAS, HINT-ATAC |
tn5_shift | {sample}.filtered.shifted.bam | MACS2, BigWig, TSS enrichment |
samtools_stats | {sample}_postFiltering.stats.txt | QC gate (parse_qc_metrics.py) |