Parallel Batch Processing with OpenGauss

The batch_runner.py module lets you run many agent tasks in parallel by spawning a pool of AIAgent instances across multiple worker processes. It is purpose-built for generating training trajectories from a large prompt dataset — each prompt gets its own isolated agent, its own VM sandbox, and its own tool-usage statistics. Results are checkpointed after every batch so interrupted runs can be safely resumed.

When to Use Batch Runner vs. Interactive CLI

Scenario	Use
Exploratory, interactive work	`gauss` (interactive CLI)
Single-prompt automation or scripting	`AIAgent.chat()` / `run_conversation()` directly
Processing hundreds or thousands of prompts	`batch_runner.py`
Generating fine-tuning trajectories	`batch_runner.py`
Parallel evaluation benchmarks	`batch_runner.py`

Installation

The batch runner is included with the base gauss-agent package. No additional extras are required for basic use.

pip install gauss-agent
# or from source:
pip install -e .

The gauss-agent entry point (from pyproject.toml) maps to run_agent:main. The batch runner is invoked directly as a script:

python batch_runner.py --dataset_file=data.jsonl --batch_size=10 --run_name=my_run

Input Format

The batch runner reads a JSONL file — one JSON object per line. Each line must contain at minimum a "prompt" field:

{"prompt": "Write a Python function that computes the Fibonacci sequence recursively."}
{"prompt": "Explain the difference between a mutex and a semaphore."}
{"prompt": "Refactor this code to use async/await: ..."}

Optional per-prompt fields override default agent behavior for that row:

Field	Type	Description
`prompt`	`str`	Required. The task description sent to the agent.
`image`	`str`	Container image override for this task’s sandbox (Docker, Modal, Singularity, or Daytona).
`docker_image`	`str`	Alias for `image`.
`cwd`	`str`	Working directory override for the task’s terminal environment.

Lines with missing "prompt" fields or invalid JSON are skipped with a warning; the run continues.

Output Format

All output is written to data/<run_name>/:

data/my_run/
├── batch_0.jsonl          # Results from batch 0 (kept for debugging)
├── batch_1.jsonl          # Results from batch 1
├── ...
├── trajectories.jsonl     # Combined output — all batches merged and filtered
├── checkpoint.json        # Resume state (completed prompt indices)
└── statistics.json        # Aggregate tool-usage and reasoning coverage stats

`trajectories.jsonl` — per-entry schema

Each line in trajectories.jsonl is a JSON object:

{
  "prompt_index": 42,
  "conversations": [
    {"from": "system", "value": "..."},
    {"from": "human",  "value": "Write a Fibonacci function"},
    {"from": "gpt",    "value": "<think>\n...</think>\n<tool_call>...</tool_call>"},
    {"from": "tool",   "value": "<tool_response>\n...</tool_response>"},
    {"from": "gpt",    "value": "<think>\n</think>\nHere is the function: ..."}
  ],
  "metadata": {
    "batch_num": 4,
    "timestamp": "2025-07-15T10:23:41.123456",
    "model": "anthropic/claude-opus-4.6"
  },
  "completed": true,
  "partial": false,
  "api_calls": 3,
  "toolsets_used": ["file", "terminal"],
  "tool_stats": {
    "read_file":  {"count": 2, "success": 2, "failure": 0},
    "terminal":   {"count": 1, "success": 1, "failure": 0},
    "...": "..."
  },
  "tool_error_counts": {
    "read_file": 0,
    "terminal":  0,
    "...": 0
  }
}

tool_stats and tool_error_counts always include all possible tools with zero defaults, ensuring a consistent schema for loading into HuggingFace Datasets or Apache Arrow/Parquet without schema mismatch errors. Entries are automatically discarded if the agent produced zero reasoning across all assistant turns (no <REASONING_SCRATCHPAD> and no native thinking tokens). These samples are logged and counted in the summary but not written to trajectories.jsonl.

`statistics.json`

{
  "run_name": "my_run",
  "distribution": "default",
  "total_prompts": 500,
  "total_batches": 50,
  "batch_size": 10,
  "model": "anthropic/claude-opus-4.6",
  "completed_at": "2025-07-15T12:00:00",
  "duration_seconds": 3612.4,
  "tool_statistics": { "...": "..." },
  "reasoning_statistics": {
    "total_assistant_turns": 1850,
    "turns_with_reasoning": 1820,
    "turns_without_reasoning": 30
  }
}

CLI Reference

python batch_runner.py [OPTIONS]

Required Arguments

--dataset_file

str

Path to the JSONL input file. Each line must have a "prompt" key.

--batch_size

int

Number of prompts processed per batch. Each batch runs its prompts sequentially inside a single worker process. Multiple batches execute in parallel across --num_workers processes.

--run_name

str

Identifier for this run. Determines the output directory (data/<run_name>/) and the checkpoint file name. Reuse the same name with --resume to continue an interrupted run.

Model and Provider Arguments

--model

str

default:"\"anthropic/claude-sonnet-4.6\""

Model identifier in OpenRouter format passed to every AIAgent instance.

--api_key

str

API key for the model provider. Falls back to OPENROUTER_API_KEY (or provider-specific env vars) when not set.

--base_url

str

default:"\"https://openrouter.ai/api/v1\""

Base URL for the LLM API.

--max_turns

int

default:"10"

Maximum tool-calling iterations per prompt (maps to AIAgent.max_iterations). Keep this low (10–20) for batch generation to control cost; the interactive CLI default is 90.

--max_tokens

int

Maximum tokens per model response. Uses the model’s native default when not set.

--reasoning_effort

str

OpenRouter reasoning effort level. Accepted values: "xhigh", "high", "medium", "low", "minimal", "none". Defaults to "medium" when not specified.

--reasoning_disabled

bool

default:"false"

Completely disable reasoning/thinking tokens. Equivalent to --reasoning_effort=none. Takes precedence over --reasoning_effort.

Concurrency Arguments

--num_workers

int

default:"4"

Number of parallel worker processes (using multiprocessing.Pool). Each worker handles one batch at a time. Set based on available CPU cores and API rate limits. Higher values increase throughput but also API concurrency.

Toolset Distribution Arguments

--distribution

str

default:"\"default\""

Named toolset distribution used to sample which toolsets each prompt receives. Each prompt gets an independently sampled subset. List available distributions with --list_distributions.

--list_distributions

bool

default:"false"

Print all available toolset distributions and their descriptions, then exit.

Resume and Checkpointing Arguments

--resume

bool

default:"false"

Resume from a previous interrupted run. The runner scans all batch_*.jsonl files for completed prompts by matching prompt text content (not just indices), then rebuilds the batch list with only the remaining prompts.

--max_samples

int

Process only the first N samples from the dataset. Useful for quick test runs before committing to a full dataset.

Logging and Output Arguments

--verbose

bool

default:"false"

Enable verbose logging in worker processes. Prints full tracebacks on errors and shows per-prompt toolset selection.

--log_prefix_chars

int

default:"100"

Number of characters to show in log previews for tool arguments and responses.

--ephemeral_system_prompt

str

A system prompt injected into each agent during execution but not saved to output trajectories. Use this for task-framing instructions that should not appear in training data.

OpenRouter Provider Routing Arguments

--providers_allowed

str

Comma-separated list of OpenRouter providers to allow (e.g. "anthropic,google").

--providers_ignored

str

Comma-separated list of OpenRouter providers to exclude (e.g. "together,deepinfra").

--providers_order

str

Comma-separated provider preference order (e.g. "anthropic,openai,google").

--provider_sort

str

Sort providers by "price", "throughput", or "latency".

Prefill Arguments

--prefill_messages_file

str

Path to a JSON file containing an array of prefill messages ([{"role": "user", "content": "..."}, ...]). These messages are prepended to every conversation for few-shot priming.

Usage Examples

Basic run

python batch_runner.py \
  --dataset_file=data.jsonl \
  --batch_size=10 \
  --run_name=my_run

Resume an interrupted run

python batch_runner.py \
  --dataset_file=data.jsonl \
  --batch_size=10 \
  --run_name=my_run \
  --resume

The runner scans data/my_run/batch_*.jsonl for completed prompts using content-based matching and processes only the remainder.

Larger parallel run with custom model

python batch_runner.py \
  --dataset_file=prompts.jsonl \
  --batch_size=20 \
  --run_name=opus_run \
  --model="anthropic/claude-opus-4.6" \
  --num_workers=8 \
  --max_turns=15 \
  --reasoning_effort=high

Disabled reasoning with token cap

python batch_runner.py \
  --dataset_file=data.jsonl \
  --batch_size=10 \
  --run_name=no_think_run \
  --reasoning_disabled \
  --max_tokens=32000

With a prefill messages file

# configs/prefill_opus.json contains few-shot examples
python batch_runner.py \
  --dataset_file=data.jsonl \
  --batch_size=10 \
  --run_name=fewshot_run \
  --prefill_messages_file=configs/prefill_opus.json

List available toolset distributions

python batch_runner.py --list_distributions

Programmatic Use

You can also drive BatchRunner directly from Python for tighter integration with your pipeline:

from batch_runner import BatchRunner

runner = BatchRunner(
    dataset_file="data/prompts.jsonl",
    batch_size=10,
    run_name="my_run",
    distribution="default",
    max_iterations=10,
    model="anthropic/claude-opus-4.6",
    num_workers=4,
)

runner.run(resume=False)

Each worker process calls _process_single_prompt(), which instantiates a fresh AIAgent with skip_context_files=True and skip_memory=True hardcoded (always set in batch mode to prevent user-specific files from appearing in trajectories). These are internal defaults and are not exposed as BatchRunner constructor parameters.

Trajectory Format from Code

After a run, load the combined trajectories in Python:

import json
from pathlib import Path

trajectories = []
with open("data/my_run/trajectories.jsonl", "r") as f:
    for line in f:
        trajectories.append(json.loads(line))

print(f"Loaded {len(trajectories)} trajectories")
print(f"First trajectory turns: {len(trajectories[0]['conversations'])}")
print(f"Completed: {trajectories[0]['completed']}")

Or load directly into a HuggingFace dataset:

from datasets import load_dataset

ds = load_dataset("json", data_files="data/my_run/trajectories.jsonl", split="train")
print(ds)

Concurrency Model

The batch runner uses Python’s multiprocessing.Pool — not threads — for parallelism. Each worker is a separate OS process with its own memory space.

Parent process
├── Creates Pool(num_workers)
├── Dispatches batch tasks via pool.imap_unordered()
├── Updates rich Progress bar as batches complete
└── Writes incremental checkpoints after each completed batch

Worker process (× num_workers)
└── Receives one batch (list of prompts)
    └── Processes each prompt sequentially
        └── Instantiates AIAgent, calls run_conversation()

Within a batch, prompts are processed sequentially in the worker. Parallelism comes from running multiple batches simultaneously across workers. To maximize throughput, set batch_size and num_workers so that total concurrent API calls stay within your rate limit.

_last_resolved_tool_names is a process-global in model_tools.py. When subagent delegation (delegate_tool.py) is used inside a batch worker, spawned subagents may overwrite this global. Subsequent execute_code calls in the same worker process may then fail with missing tool import errors. Avoid toolsets that trigger subagent delegation in batch runs, or set --max_turns low enough that delegation is unlikely to occur.

Checkpointing and Fault Tolerance

The checkpoint file at data/<run_name>/checkpoint.json is updated incrementally — after each batch completes, not only at the end of the full run. This means:

A crash mid-run loses at most one batch worth of work.
On --resume, the runner performs content-based matching: it scans all batch_*.jsonl files and extracts the human prompt text from completed conversations. This is more robust than index-based matching and correctly handles dataset re-ordering or index shifts between runs.

Failed prompts (exceptions during run_conversation()) are not written to the batch output file, so they remain eligible for retry on resume.

Trajectory Saving and Quality Filters

The batch runner handles all trajectory serialization itself via agent._convert_to_trajectory_format(). It always passes save_trajectories=False to each AIAgent instance to avoid double-writing. Quality filters applied before writing to batch_*.jsonl:

No-reasoning filter — Trajectories where zero assistant turns contain reasoning (no <REASONING_SCRATCHPAD> tag and no native thinking tokens) are discarded. The count appears in the run summary under “Samples discarded (zero reasoning)”.
Invalid tool name filter — At combine time, entries containing tool names not in the master TOOL_TO_TOOLSET_MAP are filtered out. These result from model hallucinations and would break downstream schema validation.

Use --verbose only when debugging a single run. The batch runner always sets skip_context_files=True and skip_memory=True on each AIAgent instance internally to prevent user-specific context files from appearing in trajectories. Verbose mode in workers produces very high log volume and can slow down multiprocessing output flushing.

CLI Reference

Toolsets & Skills

API & Embedding

Parallel Batch Processing with OpenGauss

When to Use Batch Runner vs. Interactive CLI

Installation

Input Format

Output Format

`trajectories.jsonl` — per-entry schema

`statistics.json`

CLI Reference

Required Arguments

Model and Provider Arguments

Concurrency Arguments

Toolset Distribution Arguments

Resume and Checkpointing Arguments

Logging and Output Arguments

OpenRouter Provider Routing Arguments

Prefill Arguments

Usage Examples

Programmatic Use

Trajectory Format from Code

Concurrency Model

Checkpointing and Fault Tolerance

Trajectory Saving and Quality Filters

Build docs developers (and LLMs) love

CLI Reference

Toolsets & Skills

API & Embedding

Documentation Index

​When to Use Batch Runner vs. Interactive CLI

​Installation

​Input Format

​Output Format

​trajectories.jsonl — per-entry schema

​statistics.json

​CLI Reference

​Required Arguments

​Model and Provider Arguments

​Concurrency Arguments

​Toolset Distribution Arguments

​Resume and Checkpointing Arguments

​Logging and Output Arguments

​OpenRouter Provider Routing Arguments

​Prefill Arguments

​Usage Examples

​Programmatic Use

​Trajectory Format from Code

​Concurrency Model

​Checkpointing and Fault Tolerance

​Trajectory Saving and Quality Filters

Build docs developers (and LLMs) love

When to Use Batch Runner vs. Interactive CLI

Installation

Input Format

Output Format

`trajectories.jsonl` — per-entry schema

`statistics.json`

CLI Reference

Required Arguments

Model and Provider Arguments

Concurrency Arguments

Toolset Distribution Arguments

Resume and Checkpointing Arguments

Logging and Output Arguments

OpenRouter Provider Routing Arguments

Prefill Arguments

Usage Examples

Programmatic Use

Trajectory Format from Code

Concurrency Model

Checkpointing and Fault Tolerance

Trajectory Saving and Quality Filters