Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/math-inc/OpenGauss/llms.txt

Use this file to discover all available pages before exploring further.

The batch_runner.py module lets you run many agent tasks in parallel by spawning a pool of AIAgent instances across multiple worker processes. It is purpose-built for generating training trajectories from a large prompt dataset — each prompt gets its own isolated agent, its own VM sandbox, and its own tool-usage statistics. Results are checkpointed after every batch so interrupted runs can be safely resumed.

When to Use Batch Runner vs. Interactive CLI

ScenarioUse
Exploratory, interactive workgauss (interactive CLI)
Single-prompt automation or scriptingAIAgent.chat() / run_conversation() directly
Processing hundreds or thousands of promptsbatch_runner.py
Generating fine-tuning trajectoriesbatch_runner.py
Parallel evaluation benchmarksbatch_runner.py

Installation

The batch runner is included with the base gauss-agent package. No additional extras are required for basic use.
pip install gauss-agent
# or from source:
pip install -e .
The gauss-agent entry point (from pyproject.toml) maps to run_agent:main. The batch runner is invoked directly as a script:
python batch_runner.py --dataset_file=data.jsonl --batch_size=10 --run_name=my_run

Input Format

The batch runner reads a JSONL file — one JSON object per line. Each line must contain at minimum a "prompt" field:
{"prompt": "Write a Python function that computes the Fibonacci sequence recursively."}
{"prompt": "Explain the difference between a mutex and a semaphore."}
{"prompt": "Refactor this code to use async/await: ..."}
Optional per-prompt fields override default agent behavior for that row:
FieldTypeDescription
promptstrRequired. The task description sent to the agent.
imagestrContainer image override for this task’s sandbox (Docker, Modal, Singularity, or Daytona).
docker_imagestrAlias for image.
cwdstrWorking directory override for the task’s terminal environment.
Lines with missing "prompt" fields or invalid JSON are skipped with a warning; the run continues.

Output Format

All output is written to data/<run_name>/:
data/my_run/
├── batch_0.jsonl          # Results from batch 0 (kept for debugging)
├── batch_1.jsonl          # Results from batch 1
├── ...
├── trajectories.jsonl     # Combined output — all batches merged and filtered
├── checkpoint.json        # Resume state (completed prompt indices)
└── statistics.json        # Aggregate tool-usage and reasoning coverage stats

trajectories.jsonl — per-entry schema

Each line in trajectories.jsonl is a JSON object:
{
  "prompt_index": 42,
  "conversations": [
    {"from": "system", "value": "..."},
    {"from": "human",  "value": "Write a Fibonacci function"},
    {"from": "gpt",    "value": "<think>\n...</think>\n<tool_call>...</tool_call>"},
    {"from": "tool",   "value": "<tool_response>\n...</tool_response>"},
    {"from": "gpt",    "value": "<think>\n</think>\nHere is the function: ..."}
  ],
  "metadata": {
    "batch_num": 4,
    "timestamp": "2025-07-15T10:23:41.123456",
    "model": "anthropic/claude-opus-4.6"
  },
  "completed": true,
  "partial": false,
  "api_calls": 3,
  "toolsets_used": ["file", "terminal"],
  "tool_stats": {
    "read_file":  {"count": 2, "success": 2, "failure": 0},
    "terminal":   {"count": 1, "success": 1, "failure": 0},
    "...": "..."
  },
  "tool_error_counts": {
    "read_file": 0,
    "terminal":  0,
    "...": 0
  }
}
tool_stats and tool_error_counts always include all possible tools with zero defaults, ensuring a consistent schema for loading into HuggingFace Datasets or Apache Arrow/Parquet without schema mismatch errors. Entries are automatically discarded if the agent produced zero reasoning across all assistant turns (no <REASONING_SCRATCHPAD> and no native thinking tokens). These samples are logged and counted in the summary but not written to trajectories.jsonl.

statistics.json

{
  "run_name": "my_run",
  "distribution": "default",
  "total_prompts": 500,
  "total_batches": 50,
  "batch_size": 10,
  "model": "anthropic/claude-opus-4.6",
  "completed_at": "2025-07-15T12:00:00",
  "duration_seconds": 3612.4,
  "tool_statistics": { "...": "..." },
  "reasoning_statistics": {
    "total_assistant_turns": 1850,
    "turns_with_reasoning": 1820,
    "turns_without_reasoning": 30
  }
}

CLI Reference

python batch_runner.py [OPTIONS]

Required Arguments

--dataset_file
str
Path to the JSONL input file. Each line must have a "prompt" key.
--batch_size
int
Number of prompts processed per batch. Each batch runs its prompts sequentially inside a single worker process. Multiple batches execute in parallel across --num_workers processes.
--run_name
str
Identifier for this run. Determines the output directory (data/<run_name>/) and the checkpoint file name. Reuse the same name with --resume to continue an interrupted run.

Model and Provider Arguments

--model
str
default:"\"anthropic/claude-sonnet-4.6\""
Model identifier in OpenRouter format passed to every AIAgent instance.
--api_key
str
API key for the model provider. Falls back to OPENROUTER_API_KEY (or provider-specific env vars) when not set.
--base_url
str
default:"\"https://openrouter.ai/api/v1\""
Base URL for the LLM API.
--max_turns
int
default:"10"
Maximum tool-calling iterations per prompt (maps to AIAgent.max_iterations). Keep this low (10–20) for batch generation to control cost; the interactive CLI default is 90.
--max_tokens
int
Maximum tokens per model response. Uses the model’s native default when not set.
--reasoning_effort
str
OpenRouter reasoning effort level. Accepted values: "xhigh", "high", "medium", "low", "minimal", "none". Defaults to "medium" when not specified.
--reasoning_disabled
bool
default:"false"
Completely disable reasoning/thinking tokens. Equivalent to --reasoning_effort=none. Takes precedence over --reasoning_effort.

Concurrency Arguments

--num_workers
int
default:"4"
Number of parallel worker processes (using multiprocessing.Pool). Each worker handles one batch at a time. Set based on available CPU cores and API rate limits. Higher values increase throughput but also API concurrency.

Toolset Distribution Arguments

--distribution
str
default:"\"default\""
Named toolset distribution used to sample which toolsets each prompt receives. Each prompt gets an independently sampled subset. List available distributions with --list_distributions.
--list_distributions
bool
default:"false"
Print all available toolset distributions and their descriptions, then exit.

Resume and Checkpointing Arguments

--resume
bool
default:"false"
Resume from a previous interrupted run. The runner scans all batch_*.jsonl files for completed prompts by matching prompt text content (not just indices), then rebuilds the batch list with only the remaining prompts.
--max_samples
int
Process only the first N samples from the dataset. Useful for quick test runs before committing to a full dataset.

Logging and Output Arguments

--verbose
bool
default:"false"
Enable verbose logging in worker processes. Prints full tracebacks on errors and shows per-prompt toolset selection.
--log_prefix_chars
int
default:"100"
Number of characters to show in log previews for tool arguments and responses.
--ephemeral_system_prompt
str
A system prompt injected into each agent during execution but not saved to output trajectories. Use this for task-framing instructions that should not appear in training data.

OpenRouter Provider Routing Arguments

--providers_allowed
str
Comma-separated list of OpenRouter providers to allow (e.g. "anthropic,google").
--providers_ignored
str
Comma-separated list of OpenRouter providers to exclude (e.g. "together,deepinfra").
--providers_order
str
Comma-separated provider preference order (e.g. "anthropic,openai,google").
--provider_sort
str
Sort providers by "price", "throughput", or "latency".

Prefill Arguments

--prefill_messages_file
str
Path to a JSON file containing an array of prefill messages ([{"role": "user", "content": "..."}, ...]). These messages are prepended to every conversation for few-shot priming.

Usage Examples

1

Basic run

python batch_runner.py \
  --dataset_file=data.jsonl \
  --batch_size=10 \
  --run_name=my_run
2

Resume an interrupted run

python batch_runner.py \
  --dataset_file=data.jsonl \
  --batch_size=10 \
  --run_name=my_run \
  --resume
The runner scans data/my_run/batch_*.jsonl for completed prompts using content-based matching and processes only the remainder.
3

Larger parallel run with custom model

python batch_runner.py \
  --dataset_file=prompts.jsonl \
  --batch_size=20 \
  --run_name=opus_run \
  --model="anthropic/claude-opus-4.6" \
  --num_workers=8 \
  --max_turns=15 \
  --reasoning_effort=high
4

Disabled reasoning with token cap

python batch_runner.py \
  --dataset_file=data.jsonl \
  --batch_size=10 \
  --run_name=no_think_run \
  --reasoning_disabled \
  --max_tokens=32000
5

With a prefill messages file

# configs/prefill_opus.json contains few-shot examples
python batch_runner.py \
  --dataset_file=data.jsonl \
  --batch_size=10 \
  --run_name=fewshot_run \
  --prefill_messages_file=configs/prefill_opus.json
6

List available toolset distributions

python batch_runner.py --list_distributions

Programmatic Use

You can also drive BatchRunner directly from Python for tighter integration with your pipeline:
from batch_runner import BatchRunner

runner = BatchRunner(
    dataset_file="data/prompts.jsonl",
    batch_size=10,
    run_name="my_run",
    distribution="default",
    max_iterations=10,
    model="anthropic/claude-opus-4.6",
    num_workers=4,
)

runner.run(resume=False)
Each worker process calls _process_single_prompt(), which instantiates a fresh AIAgent with skip_context_files=True and skip_memory=True hardcoded (always set in batch mode to prevent user-specific files from appearing in trajectories). These are internal defaults and are not exposed as BatchRunner constructor parameters.

Trajectory Format from Code

After a run, load the combined trajectories in Python:
import json
from pathlib import Path

trajectories = []
with open("data/my_run/trajectories.jsonl", "r") as f:
    for line in f:
        trajectories.append(json.loads(line))

print(f"Loaded {len(trajectories)} trajectories")
print(f"First trajectory turns: {len(trajectories[0]['conversations'])}")
print(f"Completed: {trajectories[0]['completed']}")
Or load directly into a HuggingFace dataset:
from datasets import load_dataset

ds = load_dataset("json", data_files="data/my_run/trajectories.jsonl", split="train")
print(ds)

Concurrency Model

The batch runner uses Python’s multiprocessing.Pool — not threads — for parallelism. Each worker is a separate OS process with its own memory space.
Parent process
├── Creates Pool(num_workers)
├── Dispatches batch tasks via pool.imap_unordered()
├── Updates rich Progress bar as batches complete
└── Writes incremental checkpoints after each completed batch

Worker process (× num_workers)
└── Receives one batch (list of prompts)
    └── Processes each prompt sequentially
        └── Instantiates AIAgent, calls run_conversation()
Within a batch, prompts are processed sequentially in the worker. Parallelism comes from running multiple batches simultaneously across workers. To maximize throughput, set batch_size and num_workers so that total concurrent API calls stay within your rate limit.
_last_resolved_tool_names is a process-global in model_tools.py. When subagent delegation (delegate_tool.py) is used inside a batch worker, spawned subagents may overwrite this global. Subsequent execute_code calls in the same worker process may then fail with missing tool import errors. Avoid toolsets that trigger subagent delegation in batch runs, or set --max_turns low enough that delegation is unlikely to occur.

Checkpointing and Fault Tolerance

The checkpoint file at data/<run_name>/checkpoint.json is updated incrementally — after each batch completes, not only at the end of the full run. This means:
  • A crash mid-run loses at most one batch worth of work.
  • On --resume, the runner performs content-based matching: it scans all batch_*.jsonl files and extracts the human prompt text from completed conversations. This is more robust than index-based matching and correctly handles dataset re-ordering or index shifts between runs.
Failed prompts (exceptions during run_conversation()) are not written to the batch output file, so they remain eligible for retry on resume.

Trajectory Saving and Quality Filters

The batch runner handles all trajectory serialization itself via agent._convert_to_trajectory_format(). It always passes save_trajectories=False to each AIAgent instance to avoid double-writing. Quality filters applied before writing to batch_*.jsonl:
  1. No-reasoning filter — Trajectories where zero assistant turns contain reasoning (no <REASONING_SCRATCHPAD> tag and no native thinking tokens) are discarded. The count appears in the run summary under “Samples discarded (zero reasoning)”.
  2. Invalid tool name filter — At combine time, entries containing tool names not in the master TOOL_TO_TOOLSET_MAP are filtered out. These result from model hallucinations and would break downstream schema validation.
Use --verbose only when debugging a single run. The batch runner always sets skip_context_files=True and skip_memory=True on each AIAgent instance internally to prevent user-specific context files from appearing in trajectories. Verbose mode in workers produces very high log volume and can slow down multiprocessing output flushing.

Build docs developers (and LLMs) love