Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/BDB-Genomics/atacseq-pipeline/llms.txt

Use this file to discover all available pages before exploring further.

parse_qc_metrics.py is the QC gate engine. It is invoked once per sample by the qc_gate Snakemake rule after FRiP calculation, TSS enrichment scoring, and samtools stats have completed. The script reads the three metric files, evaluates each measurement against its configured threshold, applies WARN/FAIL tiering, and writes three output files: a plain-text log, a structured JSON report, and a single-line trigger file that downstream Snakemake rules consume as a mandatory input dependency.
python3 rules/scripts/parse_qc_metrics.py \
  --sample SAMPLE_NAME \
  --frip-file results/peak_calling/frip_calculation/SAMPLE_frip.txt \
  --tss-file results/metrics_qc/tss_enrichment/SAMPLE_tss_enrichment.txt \
  --stats-file results/post_alignment/samtools_stats/SAMPLE_samtools_stats.txt \
  --min-frip 0.2 \
  --min-tss 7.0 \
  --min-mapping-rate 80.0 \
  --max-duplicate-rate 20.0 \
  --log logs/qc_gate/SAMPLE.log \
  --output results/qc_gate/SAMPLE_qc_pass.txt \
  --json-output results/qc_gate/SAMPLE_qc_pass.json

Arguments

--sample
string
required
Sample name. Used as the key in all output files and in the human-readable log header (QC Report for SAMPLE_NAME). Must match the sample column value in the sample sheet.
--frip-file
string
required
Path to the FRiP score file produced by the frip_calculation rule. The parser handles two formats:
  • A single-line file containing only the numeric FRiP value.
  • A two-column tab-separated file (e.g., SAMPLE\t0.342) — value is taken from column 2.
  • A headered TSV (first line contains "sample") — value is taken from column 2 of the second line.
--tss-file
string
required
Path to the TSS enrichment score file produced by tss_enrichment.R. The parser handles:
  • A single-line file with just the numeric TSS score.
  • A two-column tab-separated file (e.g., SAMPLE\t9.17).
  • A headered TSV where the header contains "sample" or "tss".
--stats-file
string
required
Path to the samtools stats output file produced by the samtools_stats rule. The parser scans lines starting with SN and extracts:
  • sequences → total read count
  • properly paired → properly paired read count
  • percentage of properly paired reads → mapping rate (%)
  • reads duplicated → duplicate read count
--min-frip
float
required
Minimum FRiP threshold. Reads from qc_gate.params.min_frip in config.yaml (default: 0.2). The sample FAILS if the measured FRiP is strictly less than this value.
--min-tss
float
required
Minimum TSS Enrichment threshold. Reads from qc_gate.params.min_tss_enr in config.yaml (default: 7.0). The sample FAILS if TSS enrichment is strictly less than this value.
--min-mapping-rate
float
required
Minimum mapping rate percentage. Reads from qc_gate.params.min_mapping_rate in config.yaml (default: 80.0). The value compared is the percentage of properly paired reads field from samtools stats, which already represents a percentage (0–100).
--max-duplicate-rate
float
required
Maximum duplicate rate percentage. Reads from qc_gate.params.max_duplicate_rate in config.yaml (default: 20.0). Duplicate rate is calculated as (reads_duplicated / sequences) × 100. The sample FAILS if this derived value is strictly greater than the threshold.
--log
string
required
Path to write the plain-text QC log. Parent directories are created automatically. ANSI colour codes are stripped from the log file (they are preserved on stdout for terminal display).
--output
string
required
Path to write the Snakemake trigger file. Parent directories are created automatically. Downstream rules declare this file as a required input: to enforce the QC gate dependency.
--json-output
string
required
Path to write the structured JSON QC report. Suitable for MultiQC custom content modules, pipeline dashboards, or programmatic post-processing.

WARN / FAIL Tiering Logic

For each metric the script applies a two-tier evaluation. The WARN boundary is 10 % inside the threshold:
MetricDirectionFAIL conditionWARN condition
FRiP>=val < min_fripmin_frip ≤ val < min_frip × 1.1
TSS Enrichment>=val < min_tssmin_tss ≤ val < min_tss × 1.1
Mapping Rate>=val < min_mapping_ratemin_mapping_rate ≤ val < min_mapping_rate × 1.1
Duplicate Rate<=val > max_duplicate_ratemax_duplicate_rate × 0.9 < val ≤ max_duplicate_rate
WARN samples receive an overall "PASSED" result and proceed through the pipeline. FAIL samples receive an overall "FAILED" result. If any input file cannot be parsed, the affected metric defaults to 0.0 (or 100.0 for duplicate rate), the overall field is set to "FAILED", and the script continues rather than raising an exception — preventing a parsing error in one sample from blocking all others in the batch.
The script exits with code 0 even when a sample FAILS QC. This is intentional: Snakemake uses the existence of {sample}_qc_pass.txt (not the process exit code) as the rule completion signal. Downstream rules inspect the file contents to gate their own execution.

Output Files

Trigger File (--output)

A single line consumed by downstream Snakemake rules:
SAMPLE_NAME\tPASSED
or
SAMPLE_NAME\tFAILED

JSON Output (--json-output)

{
    "sample": "ctrl_rep1",
    "metrics": {
        "frip": {
            "val": 0.342,
            "target": 0.2,
            "status": "PASS"
        },
        "tss": {
            "val": 9.17,
            "target": 7.0,
            "status": "PASS"
        },
        "mapping": {
            "val": 93.4,
            "target": 80.0,
            "status": "PASS"
        },
        "duplicates": {
            "val": 12.6,
            "target": 20.0,
            "status": "PASS"
        }
    },
    "overall": "PASSED"
}
Each status field is one of "PASS", "WARN", or "FAIL". The overall field is "PASSED" or "FAILED".

Text Log (--log)

ANSI-stripped plain-text version of the console output:
QC Report for ctrl_rep1
-------------------------------
[PASS] FRIP: 0.342
[WARN] TSS: 7.210 (Borderline)
[PASS] MAPPING: 93.400
[PASS] DUPLICATES: 12.600
-------------------------------
OVERALL RESULT: PASSED

Metric Parsing Details

Opens the FRiP file and reads non-empty lines. If the first line contains the string "sample" (case-insensitive), the parser uses the second line. Otherwise it uses the first line. For tab-separated lines, it returns column index 1 (0-based); for single-column lines, it returns the entire line as a float.
Opens the TSS file and reads non-empty lines. If the first line contains "sample" or "tss", the parser uses the second line. For lines with two or more tab-separated columns, it returns column index 1; for single-column lines, it returns the entire value.
Iterates only lines that start with SN (the Summary Numbers section). For each SN line, it checks for substring matches against four keys — "sequences", "properly paired", "percentage of properly paired reads", "reads duplicated" — using a colon-agnostic approach that is robust across samtools versions. Values are parsed using a two-pass intfloat converter that also strips % characters and handles scientific notation.

Build docs developers (and LLMs) love