Z3Sandbox — Isolated Z3 Script Execution Environment

Z3Sandbox executes LLM-generated Z3Py scripts in an isolated subprocess with strict timeouts and resource limits. LLM-generated code is untrusted; the sandbox enforces both time and memory budgets and intercepts dangerous imports before any user code runs. All K voting candidates from LLMClient are executed in parallel via asyncio.gather, making the sandbox the primary throughput bottleneck of the pipeline.

Import

from agentic_afl.orchestrator.z3_sandbox import Z3Sandbox

Security Model

Every script execution goes through the following protection layers in order:

Subprocess isolation — Scripts run in a separate Python subprocess (asyncio.create_subprocess_exec). A crash, infinite loop, or sys.exit() inside the script cannot affect the main Agentic-AFL process.
RLIMIT_AS (512 MB memory cap) — Set via resource.setrlimit(resource.RLIMIT_AS, ...) inside the subprocess before any user code executes. A Z3 path explosion that allocates more than 512 MB kills the subprocess with MemoryError.
RLIMIT_CPU (CPU time limit) — The subprocess’s CPU time is bounded. This acts as a backstop if the asyncio timeout fires before the OS scheduler has a chance to kill the process.
Import whitelist — A _restricted_import hook is installed on __builtins__.__import__ after z3 has finished loading (Z3’s C extension loads os, ctypes, etc. internally at import time, so the hook is installed post-import to avoid breaking Z3 itself). Blocked modules include os, subprocess, socket, http, urllib, requests, threading, and multiprocessing.
Strict solver timeout — The execution harness passes s.set("timeout", <ms>) to the Z3 Solver instance before any LLM-generated constraints are added. This bounds s.check() to z3_timeout_seconds (default 5 s).

The import whitelist is a first-line defense. RLIMIT_AS and RLIMIT_CPU are the true security backstop. Do not run Agentic-AFL with root privileges, as resource limits may not be enforced correctly in some containerized environments.

Constructor

Z3Sandbox is constructed with no required arguments. All parameters have sensible defaults sourced from settings.

timeout_seconds

int

default:"settings.z3_timeout_seconds (5)"

Wall-clock seconds before the subprocess is killed with asyncio.TimeoutError. An additional 5-second overhead buffer is added for subprocess startup, so the actual asyncio.wait_for timeout is timeout_seconds + 5.

sandbox_dir

Path

default:"settings.z3_sandbox_dir"

Directory where temporary script files are written before execution. Each script is given a random UUID-based filename (z3_exec_<uuid8>.py) and deleted after execution unless keep_scripts=True.

keep_scripts

bool

default:"false"

When True, temporary script files are not deleted after execution. Useful for post-mortem debugging — combine with settings.debug_mode which also writes raw LLM completions and sandbox results to /tmp/agentic_afl_debug/.

`async execute(script: Z3Script) -> Z3Result`

Execute a Z3 script in an isolated subprocess. This is the only public method of Z3Sandbox. Steps performed:

Wrap the raw LLM script in the execution harness template (strip duplicate from z3 import *, s = Solver(), s.check(), and model extraction boilerplate that the harness already provides)
Write the wrapped script to a temporary file in sandbox_dir
Execute the file with asyncio.create_subprocess_exec(sys.executable, script_path)
Wait up to timeout_seconds + 5 seconds for the subprocess to finish
Parse the JSON printed between ===Z3_RESULT_JSON=== and ===Z3_RESULT_END=== delimiters
Construct and return a Z3Result
Delete the temporary file (unless keep_scripts=True)

If the subprocess exits without printing the JSON delimiters and returns a non-zero exit code, the result is classified as SYNTAX_ERROR. If the subprocess exits without the delimiters but with exit code 0, the result is classified as RUNTIME_ERROR.

Z3Verdict values

Verdict	Enum value	Meaning
`SAT`	`"sat"`	Z3 found a satisfying assignment — `model` dict is populated
`UNSAT`	`"unsat"`	Z3 proved no solution exists — constraints are structurally wrong
`TIMEOUT`	`"timeout"`	`s.check()` exceeded `timeout_seconds` — constraints may need concretization
`SYNTAX_ERROR`	`"syntax_error"`	Script failed to parse/compile before `_execute()` ran
`RUNTIME_ERROR`	`"runtime_error"`	Script executed but raised a Python exception inside `_execute()`
`UNKNOWN`	`"unknown"`	Z3 returned `unknown` (e.g., non-linear arithmetic beyond Z3’s decision procedures)

Z3 Model Format

When verdict is SAT, the model field is a dict[str, int] mapping Z3 variable names to concrete integer values. The naming convention enforced by LLMClient’s system prompt ensures variables are named byte_0, byte_1, etc., one per input byte position.

# Example model from a successful solve of a DNP3 checksum constraint:
{
    "byte_0": 5,    # first byte of input
    "byte_1": 1,    # second byte
    "byte_2": 0,    # third byte
    "byte_3": 1,    # fourth byte
    "byte_4": 100,  # CRC low byte
    "byte_5": 42,   # CRC high byte
}

AgentLoop._model_to_payload() reads this dict and overlays the solved values onto the original seed input. Positions not covered by the model retain their original seed bytes, allowing the solver to address only the constrained header while leaving the packet body intact.

Scripts that print nothing to stdout (due to early exit, uncaught exception before the _execute() wrapper, or a bare sys.exit()) are classified as RUNTIME_ERROR. The error message in Z3Result.error_message is capped at 2000 characters and is fed back to the LLM as the correction prompt in the next ReAct turn.

Parallel Execution

The sandbox is designed for parallel execution. AgentLoop runs all K Z3Sandbox.execute() calls concurrently with asyncio.gather, so total latency per ReAct turn is approximately the latency of the slowest of the K scripts rather than K times the average latency.

import asyncio
from agentic_afl.orchestrator.z3_sandbox import Z3Sandbox

sandbox = Z3Sandbox()
results = await asyncio.gather(*[
    sandbox.execute(script) for script in scripts
])
sat_results = [r for r in results if r.verdict.value == "sat"]

Each execute() call writes its own uniquely named temp file, so concurrent executions never collide on the filesystem. The subprocess isolation also means that a timeout or crash in one candidate does not affect the others.

In settings.debug_mode, AgentLoop writes each script and its result to /tmp/agentic_afl_debug/ as turn{N}_k{M}_{verdict}.py and turn{N}_k{M}_{verdict}.txt. Enable this to inspect exactly what the LLM generated and what the sandbox returned for each voting candidate.

CLI Reference

Python API

Data & Schema

Z3Sandbox — Isolated Z3 Script Execution Environment

Import

Security Model

Constructor

`async execute(script: Z3Script) -> Z3Result`

Z3Verdict values

Z3 Model Format

Parallel Execution

Build docs developers (and LLMs) love

CLI Reference

Python API

Data & Schema

Documentation Index

​Import

​Security Model

​Constructor

​async execute(script: Z3Script) -> Z3Result

​Z3Verdict values

​Z3 Model Format

​Parallel Execution

Build docs developers (and LLMs) love

Import

Security Model

Constructor

`async execute(script: Z3Script) -> Z3Result`

Z3Verdict values

Z3 Model Format

Parallel Execution