The GEOAgent bridge (Documentation Index
Fetch the complete documentation index at: https://mintlify.com/BDB-Genomics/atacseq-pipeline/llms.txt
Use this file to discover all available pages before exploring further.
rules/scripts/geo_agent_bridge.py) is the integration layer between AI-assisted dataset discovery tools — such as GEOAgent or bioStream — and the BDB-Genomics ATAC-seq pipeline. Starting from a single metadata CSV exported by those tools, the bridge infers sample conditions and replicates, builds a pipeline-ready config_geo.yaml (a copy of config.yaml with the global.samples field pointing at the new sample sheet), and optionally downloads every SRA run and converts it to paired-end FASTQ. The result is a fully configured project directory that can be handed directly to snakemake.
Input CSV format
The bridge accepts comma- or tab-delimited CSVs. Column names are detected automatically and matched case-insensitively. The following columns are recognised:| Column name(s) | Required | Description |
|---|---|---|
Sample_ID, Sample_Name, or sample | ✅ Yes | Unique sample identifier used as the pipeline wildcard |
SRR, Run, or SRR_ID | ✅ Yes | SRA run accession(s) to download |
SRR_AWS_URL or aws_url | No | Direct S3 URL for faster download via the SRA AWS mirror |
Title | No | Free-text sample title used for condition/replicate inference |
GSE_ID | No | GEO series accession (metadata only, not used at runtime) |
GSM_ID | No | GEO sample accession (metadata only, not used at runtime) |
Sample_ID may appear on multiple rows (one row per SRR run). The bridge merges all runs for a sample into a single pair of FASTQ files.
Condition and replicate inference
The bridge callsinfer_metadata(sample_id, title) on every row. It searches the concatenated Title and Sample_ID strings using case-insensitive pattern matching:
Replicate number — extracted by the regex (?:rep|replicate)[_-]?(\d+). If no match is found, replicate defaults to 1.
Condition — assigned by keyword scanning in priority order:
- Control keywords
- Treated keywords
- Default
control, ctrl, wildtype, wt, untreated, mock, input, naiveAny match assigns condition = "control".Usage
Export the metadata CSV from GEOAgent or bioStream. Ensure it contains at minimum the
Sample_ID (or Sample_Name) and SRR (or Run) columns. Save it somewhere accessible, e.g., path/to/ATAC_meta.csv.config_geo.yaml — a copy of config.yaml with global.samples overridden to point at the new sample sheetdata/fastp/samples_geo.tsv — the pipeline sample sheet with inferred replicate and condition columnsFASTQ paths in the sample sheet are set to where the bridge expects files to be placed (
data/fastq/<sample_id>_R1.fastq / _R2.fastq). You can populate those paths manually or proceed to the next step to download them automatically.SRR_AWS_URL column if present (using wget then curl as fallback)prefetch + fasterq-dump (or fastq-dump) from SRA-tools if no AWS URL is available.sra files automatically to save disk spaceSRA-tools (
prefetch, fasterq-dump) must be installed and on $PATH if you are not supplying SRR_AWS_URL values. The bridge checks for each tool with shutil.which() before attempting the download.# Inspect the sample sheet
cat data/fastp/samples_geo.tsv
# Spot-check the config override
head -20 config_geo.yaml
Verify that the
condition and replicate assignments match your experimental design. Edit data/fastp/samples_geo.tsv directly if any assignments need correcting — the TSV is a plain tab-delimited file.CLI reference
| Flag | Default | Description |
|---|---|---|
meta_csv | (positional) | Path to the GEOAgent / bioStream metadata CSV |
--fastq-dir | data/fastq | Directory where FASTQ files are saved / expected |
--sra-dir | data/sra | Temporary directory for intermediate .sra files |
--out-samples | data/fastp/samples_geo.tsv | Output sample sheet path |
--out-config | config_geo.yaml | Output config YAML path |
--download | false | Download and extract SRA files automatically |
Zenodo publishing
After a successful pipeline run, you can archive and publish the results to Zenodo using the included deposit script.- Sandbox (test)
- Production
- Auto-publish
title, version, keywords, abstract, license, and authors from CITATION.cff, packages the repository via git archive, and uploads the resulting zip to Zenodo’s bucket API.