Documentation Index
Fetch the complete documentation index at: https://mintlify.com/BDB-Genomics/atacseq-pipeline/llms.txt
Use this file to discover all available pages before exploring further.
geo_agent_bridge.py connects GEOAgent or bioStream metadata exports directly to the BDB-Genomics ATAC-seq pipeline. Given a metadata CSV that describes GEO samples and their associated SRR accessions, the script infers conditions, replicates, and FASTQ paths; writes a pipeline-ready TSV sample sheet; and generates a config_geo.yaml config file that inherits all settings from the base config.yaml with only global.samples overridden to point at the generated sample sheet. If the --download flag is passed, it also fetches and extracts SRA FASTQ files automatically.
Arguments
Path to the GEOAgent or bioStream metadata CSV file (positional argument). Accepts comma or tab-delimited files — the delimiter is auto-detected using Python’s
csv.Sniffer. Column names are stripped of surrounding whitespace before parsing.When supplied, the script downloads SRA data and extracts paired FASTQ files for each sample before writing the sample sheet. Download proceeds via:
wgetorcurlfrom theSRR_AWS_URLcolumn if present.- Fallback to
prefetch(SRA toolkit) if the AWS URL download fails or the column is absent. - FASTQ extraction using
fasterq-dump(preferred) orfastq-dump.
{SRR}_1.fastq, {SRR}_2.fastq) or sample ID convention ({sample_id}_R1.fastq, {sample_id}_R2.fastq).Directory where downloaded FASTQ files are saved. Created automatically if it does not exist.
Directory for temporary intermediate SRA archive files. Intermediate
.sra files are deleted after successful FASTQ extraction to conserve disk space.Output path for the generated TSV sample sheet. Parent directories are created automatically.
Output path for the generated pipeline config YAML. The base
config.yaml is loaded first, and only global.samples is overridden to point at --out-samples.Input CSV Columns
The following column names are recognised. Column matching is case-sensitive and whitespace-stripped.| Column name(s) | Required | Description |
|---|---|---|
Sample_ID, Sample_Name, or sample | Yes | Unique sample identifier. The first non-null value among these three is used. |
SRR, Run, or SRR_ID | Yes | SRA run accession (e.g., SRR12345678). Multiple SRRs per sample are automatically merged. |
SRR_AWS_URL or aws_url | No | Direct HTTP(S) URL to the SRA file on AWS S3. Used by --download as the primary download source. |
Title | No | Free-text sample description. Used by the condition/replicate inference heuristic. |
GSE_ID | No | GEO Series accession. Stored in the bridge’s internal metadata but not written to the sample sheet. |
GSM_ID | No | GEO Sample accession. Stored internally but not written to the sample sheet. |
Rows that have no
Sample_ID or no SRR value are silently skipped. Multiple rows with the same Sample_ID are merged into one sample entry with multiple SRR runs — the FASTQ files from those runs are concatenated (merged with shutil.copyfileobj) into a single R1 and R2 file.Condition and Replicate Inference
The bridge automatically inferscondition and replicate values for each sample from its Title and Sample_ID strings. This removes the need to manually populate those columns before running the pipeline.
Replicate inference: A regex search for rep[licate][_-]?(\d+) (case-insensitive) in Title + "_" + Sample_ID. If a match is found, the captured integer is used; otherwise the replicate defaults to 1.
Condition inference: The combined Title + "_" + Sample_ID string is lowercased and searched for keyword lists in order:
- Control keywords (matched first):
control,ctrl,wildtype,wt,untreated,mock,input,naive→ condition set to"control" - Treated keywords:
treated,knockout,ko,tg,transgenic,mutant,mut,stimulated,stim→ condition set to"treated" - Default (no keyword match): condition set to
"GEO_sample"
Outputs
Sample Sheet (--out-samples)
A tab-separated TSV file conforming to the BDB pipeline’s required format:
Config File (--out-config)
A complete config.yaml-compatible YAML file with global.samples pointing at the generated sample sheet. All other settings inherit from config.yaml if it is present in the current working directory.
Running the Pipeline After Import
Once the bridge has completed, launch the pipeline with the generated config:run_batched.py script for low-resource environments: