Documentation Index
Fetch the complete documentation index at: https://mintlify.com/BDB-Genomics/atacseq-pipeline/llms.txt
Use this file to discover all available pages before exploring further.
atacseq_tool.py exposes the complete BDB-Genomics ATAC-seq pipeline as a callable LangChain tool, making the end-to-end genomic analysis workflow accessible to AI agents and automated orchestration systems. The run_atacseq_pipeline function is decorated with @tool("run_atacseq_pipeline"), which registers it in LangChain’s tool registry so agents can invoke it by name, passing structured arguments. The function handles pre-flight validation, optional GEOAgent data import, Snakemake execution (standard or batched), and returns a detailed status string that includes the structured execution summary if one is available.
LangChain Import and Graceful Fallback
The tool uses a graceful fallback so it can be imported and called in environments where LangChain is not installed:run_atacseq_pipeline behaves as a standard Python function and can be called directly without any agent framework.
Function Signature
Parameters
Execution environment profile. Corresponds to a subdirectory under
profile/ in the pipeline root (e.g., profile/local/, profile/slurm/). The profile directory must contain a config.yaml defining Snakemake cluster settings.Supported values: "local", "slurm", "low_resource", "aws", "gcp", "azure", "kubernetes".Number of CPU cores to pass to
snakemake --cores. Applies to local and HPC runs. For cluster profiles (SLURM, cloud), this controls the maximum concurrent local threads; per-rule resources are governed by the profile’s config.yaml.When
true, passes --use-conda --conda-frontend conda to Snakemake so each rule resolves its dependencies via the Conda environment declared in its conda: directive. Requires Conda or Mamba to be available on the PATH.When set to a positive integer, the tool invokes
run_batched.py instead of directly calling snakemake, splitting the sample list into sequential batches of this size. Useful for machines with limited RAM where running all samples concurrently would cause out-of-memory failures. When null, all samples are processed in a single Snakemake invocation.Optional path to a secondary YAML config file. When provided, it is appended to the Snakemake
--configfile argument list, allowing partial overrides of config.yaml without modifying the base file. Only applied when batch_size is null (standard Snakemake mode).Optional path to a GEOAgent or bioStream metadata CSV. When provided, the tool first invokes
geo_agent_bridge.py to generate config_geo.yaml and a sample sheet, then uses config_geo.yaml as the pipeline config for the subsequent Snakemake run. See the geo-agent-bridge reference for CSV format details.When
true and geo_metadata_csv is set, passes --download to geo_agent_bridge.py to automatically fetch SRA FASTQ files before running the pipeline. Requires the SRA toolkit (prefetch, fasterq-dump) on the system PATH.Root directory of the ATAC-seq pipeline repository. All internal script paths, the config file, and the Snakemake working directory are resolved relative to this path. Defaults to the current working directory.
Execution Steps
The function performs the following steps in order:GEOAgent Bridge (optional)
If
geo_metadata_csv is provided, runs python3 rules/scripts/geo_agent_bridge.py <csv_path> [--download] as a subprocess. On failure (non-zero exit code), returns an error string immediately without proceeding. On success, switches the active config path to the generated config_geo.yaml.Pre-flight Configuration Validation
Runs
python3 rules/scripts/validate_config.py <config_path> as a subprocess. On failure, returns a formatted error string containing both stderr and stdout from the validation script. This step mirrors what the Snakefile itself does at parse time, but the tool surfaces it explicitly so agents receive actionable feedback without having to parse Snakemake DAG error logs.Build Execution Command
Constructs the Snakemake (or
run_batched.py) command:- Standard mode (
batch_size = None):snakemake --profile profile/{profile} --cores {cores} [--use-conda --conda-frontend conda] [--configfile ...] - Batched mode (
batch_sizeset):python3 rules/scripts/run_batched.py --batch-size {batch_size} --cores {cores} --profile profile/{profile} [--conda-frontend conda]
Execute Pipeline
Runs the constructed command via
subprocess.run with the project directory as the working directory. Captures both stdout and stderr. Any exception raised during subprocess launch (e.g., snakemake not found on PATH) is caught and returned as an error string.Load Structured Summary
Checks for
results/reporting/pipeline_execution_summary.json. If the file exists, its contents are JSON-parsed and appended to the return string as a formatted Structured Summary: block.Return Value
The function always returns astr. The agent framework should treat a string beginning with "Pipeline completed successfully!" as a success indicator. All other strings (beginning with "Error:", "Configuration Validation Failed:", "GEOAgent Bridge Import Failed:", or "Pipeline execution failed") indicate failure conditions.