Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/BDB-Genomics/atacseq-pipeline/llms.txt

Use this file to discover all available pages before exploring further.

The GEOAgent bridge (rules/scripts/geo_agent_bridge.py) is the integration layer between AI-assisted dataset discovery tools — such as GEOAgent or bioStream — and the BDB-Genomics ATAC-seq pipeline. Starting from a single metadata CSV exported by those tools, the bridge infers sample conditions and replicates, builds a pipeline-ready config_geo.yaml (a copy of config.yaml with the global.samples field pointing at the new sample sheet), and optionally downloads every SRA run and converts it to paired-end FASTQ. The result is a fully configured project directory that can be handed directly to snakemake.

Input CSV format

The bridge accepts comma- or tab-delimited CSVs. Column names are detected automatically and matched case-insensitively. The following columns are recognised:
Column name(s)RequiredDescription
Sample_ID, Sample_Name, or sample✅ YesUnique sample identifier used as the pipeline wildcard
SRR, Run, or SRR_ID✅ YesSRA run accession(s) to download
SRR_AWS_URL or aws_urlNoDirect S3 URL for faster download via the SRA AWS mirror
TitleNoFree-text sample title used for condition/replicate inference
GSE_IDNoGEO series accession (metadata only, not used at runtime)
GSM_IDNoGEO sample accession (metadata only, not used at runtime)
A single Sample_ID may appear on multiple rows (one row per SRR run). The bridge merges all runs for a sample into a single pair of FASTQ files.

Condition and replicate inference

The bridge calls infer_metadata(sample_id, title) on every row. It searches the concatenated Title and Sample_ID strings using case-insensitive pattern matching: Replicate number — extracted by the regex (?:rep|replicate)[_-]?(\d+). If no match is found, replicate defaults to 1. Condition — assigned by keyword scanning in priority order:
control, ctrl, wildtype, wt, untreated, mock, input, naiveAny match assigns condition = "control".

Usage

1
Prepare your metadata CSV
2
Export the metadata CSV from GEOAgent or bioStream. Ensure it contains at minimum the Sample_ID (or Sample_Name) and SRR (or Run) columns. Save it somewhere accessible, e.g., path/to/ATAC_meta.csv.
3
Generate config and sample sheet (no download)
4
python3 rules/scripts/geo_agent_bridge.py path/to/ATAC_meta.csv
5
This writes two files:
6
  • config_geo.yaml — a copy of config.yaml with global.samples overridden to point at the new sample sheet
  • data/fastp/samples_geo.tsv — the pipeline sample sheet with inferred replicate and condition columns
  • 7
    FASTQ paths in the sample sheet are set to where the bridge expects files to be placed (data/fastq/<sample_id>_R1.fastq / _R2.fastq). You can populate those paths manually or proceed to the next step to download them automatically.
    8
    Generate config + download SRA FASTQs
    9
    python3 rules/scripts/geo_agent_bridge.py path/to/ATAC_meta.csv --download
    
    10
    With --download the bridge:
    11
  • Downloads each SRR via the SRR_AWS_URL column if present (using wget then curl as fallback)
  • Falls back to prefetch + fasterq-dump (or fastq-dump) from SRA-tools if no AWS URL is available
  • Merges multiple runs for the same sample using binary concatenation
  • Cleans up intermediate .sra files automatically to save disk space
  • 12
    SRA-tools (prefetch, fasterq-dump) must be installed and on $PATH if you are not supplying SRR_AWS_URL values. The bridge checks for each tool with shutil.which() before attempting the download.
    13
    Review the generated files
    14
    # Inspect the sample sheet
    cat data/fastp/samples_geo.tsv
    
    # Spot-check the config override
    head -20 config_geo.yaml
    
    15
    Verify that the condition and replicate assignments match your experimental design. Edit data/fastp/samples_geo.tsv directly if any assignments need correcting — the TSV is a plain tab-delimited file.
    16
    Run the pipeline with the generated config
    17
    snakemake --configfile config_geo.yaml --use-conda --cores 8
    
    18
    The --configfile flag merges config_geo.yaml on top of the base config.yaml, so only the global.samples key is overridden. All other tool parameters, resource limits, and reference paths remain exactly as defined in config.yaml.

    CLI reference

    python3 rules/scripts/geo_agent_bridge.py --help
    
    FlagDefaultDescription
    meta_csv(positional)Path to the GEOAgent / bioStream metadata CSV
    --fastq-dirdata/fastqDirectory where FASTQ files are saved / expected
    --sra-dirdata/sraTemporary directory for intermediate .sra files
    --out-samplesdata/fastp/samples_geo.tsvOutput sample sheet path
    --out-configconfig_geo.yamlOutput config YAML path
    --downloadfalseDownload and extract SRA files automatically

    Zenodo publishing

    After a successful pipeline run, you can archive and publish the results to Zenodo using the included deposit script.
    export ZENODO_TOKEN="your_sandbox_token_here"
    python3 rules/scripts/zenodo_deposit.py
    
    Uploads a draft deposition to sandbox.zenodo.org — safe for testing. The draft is not publicly visible and does not mint a real DOI.
    The script automatically reads title, version, keywords, abstract, license, and authors from CITATION.cff, packages the repository via git archive, and uploads the resulting zip to Zenodo’s bucket API.
    Your token must have the deposit:write and deposit:actions scopes. Generate sandbox tokens at sandbox.zenodo.org/account/settings/applications and production tokens at zenodo.org/account/settings/applications.

    Build docs developers (and LLMs) love