How Agentic-AFL Bypasses Math and Crypto Coverage Stalls

Agentic-AFL operates as an asynchronous co-process alongside AFL++. While AFL++ runs its normal mutation loop without interruption, Agentic-AFL watches for edge-count plateaus, and when one is detected it executes a five-stage pipeline: extract a P-Code slice of the stalling binary logic via Ghidra, profile the constraint structure, query a PostgreSQL-backed retrieval store for similar solved templates, ask the LLM to translate the P-Code into Z3Py syntax, and inject the concrete satisfying payload into AFL++‘s sync directory so that AFL++ discovers it on its next cycle. The whole pipeline runs asynchronously — AFL++ never pauses.

The Core Problem

AFL++‘s mutation strategies — bitflip, arithmetic increment, havoc, splice, and the REDQUEEN equality-matching extensions — are all fundamentally probabilistic. They work extremely well when a branch can be flipped by changing a single byte or a small constant, because even random mutation will occasionally produce the right value. But they break down on multivariate arithmetic constraints: constraints where the correct value of one byte depends on the values of many other bytes, all combined through operations like XOR, shift, modular addition, and table lookup. The canonical example in ICS/OT protocol parsing is CRC-16 verification. A Modbus RTU frame’s last two bytes are the CRC-16 of all preceding bytes. AFL++ must simultaneously get every payload byte right — a 1-in-2¹⁶ probability per frame — before it can pass the checksum gate and reach any downstream parsing logic. Since AFL++ cannot model this multivariate dependency, coverage stalls at the checksum address and stays there regardless of how long the campaign runs.

The Pipeline

Stall Detection

StallDetector polls AFL++‘s fuzzer_stats file every stall_poll_interval seconds (default 5 s), reading the edges_found counter. When no new edges have appeared for longer than --stall-minutes (default 5 minutes), the detector classifies the event as a stall.Once a stall is confirmed, the detector runs addr2line and GDB against the binary to identify the most likely stall address — the branch instruction where coverage has stopped advancing. It then selects the most relevant seed from the AFL++ queue (prioritizing by file size, since larger inputs tend to have reached deeper code paths) to use as the concrete context for P-Code extraction.Stalls are classified by StallSeverity (CRITICAL, HIGH, MEDIUM, LOW) and placed into a priority queue in AgentLoop. CRITICAL stalls — where edges have been completely flat since the campaign’s baseline — are processed first.

P-Code Extraction

PCodeSlicer invokes Ghidra in headless mode on the target binary, pointing it at the stall address. Ghidra decompiles the containing function and exports its full P-Code intermediate representation — a low-level, architecture-neutral instruction set that preserves all semantic information from the original machine code while abstracting away ISA-specific encoding.The slicer then performs taint-bounded backward slicing: starting from the stall address’s branch instruction, it walks backward through the control-flow graph and retains only the P-Code instructions that are data-dependent on the taint source — the fuzzer’s input buffer (e.g., the register carrying the input pointer on the target architecture). Instructions that read from global state or local variables not reachable from the input buffer are pruned.The result is a PCodeSlice: an ordered list of PCodeInstruction objects (each with an address, mnemonic, input varnodes, and output varnode), bounded to at most max_pcode_instructions (default 200) instructions to stay within the LLM’s practical token budget. The slice also includes Ghidra’s decompiled C pseudocode for the function as additional context.

Constraint Profiling

ConstraintProfiler analyzes the PCodeSlice and computes a ConstraintProfile — a structured characterization of the math the slice performs.The profiler counts mnemonics and structural patterns to assign a set of ConstraintTag enum values. Tags are structural, not algorithmic: they describe the shape of the computation (e.g., a loop whose body is dominated by XOR and shift operations) rather than naming a specific algorithm (e.g., CRC-16). This design ensures that a proprietary polynomial checksum and a standard CRC-16 produce overlapping tag sets, enabling CARM to match them even though they are different algorithms.Example tags assigned to a CRC-16 slice:

Tag	Why assigned
`BITWISE_LOOP`	Loop body dominated by XOR and right-shift P-Code ops
`CONSTANT_EQUALITY`	Final comparison of computed CRC against a field in the input
`INDEXED_LOOKUP`	CRC table lookup indexed by an input-derived byte
`BIT_FIELD_EXTRACTION`	Shift + mask to extract high/low nibbles of each byte

The profiler also computes scalar metrics: bitwise_density, arithmetic_density, loop_depth, register_count, and estimated_complexity (a heuristic 0–100 score). These metrics accompany the tags in the CARM query.

CARM Retrieval and Z3 Generation

CARM retrieval: CARMRetriever queries the PostgreSQL spec store using a server-side Jaccard similarity function on the integer arrays encoding the ConstraintTag sets. The query returns up to carm_max_results (default 5) previously stored specifications whose tag sets overlap with the current profile above the carm_similarity_threshold (default 0.3). Results are filtered to the same target architecture. Any matching specs that have an associated Z3 template (i.e., a script that returned SAT in a prior session) are extracted as template strings.Z3 generation: LLMClient sends a structured prompt to the configured LLM provider (OpenAI or Gemini). The prompt includes the raw P-Code text, the decompiled C pseudocode, the ConstraintTag list as strategy hints, any retrieved Z3 templates as positive examples, any correction history from prior failed attempts as negative examples, and the architecture identifier for BitVec width selection.The LLM returns K independent Z3Py scripts (K-way voting, default K=3), which are each submitted to Z3Sandbox for isolated subprocess execution. Z3Sandbox enforces a strict z3_timeout_seconds (default 5 s) per script to prevent Z3 path explosion on deeply nested crypto. If any script returns SAT, its model (a dictionary mapping variable names to concrete integer values) is extracted.PayloadInjector then converts the model to raw bytes via seed splicing: it overlays the Z3-solved byte positions onto the original seed input (preserving bytes the LLM did not constrain), writes the result atomically to AFL++‘s sync directory, and AFL++ ingests the file on its next queue cycle.

Payload Injection

PayloadInjector writes the solved payload as a file into AFL++‘s sync directory using an atomic rename to prevent AFL++ from reading a partially-written file. The filename encodes the originating spec ID, stall address, and timestamp (e.g., agentic_a3f2b1c0_0x08001a44_20250601_143022.bin).AFL++ natively monitors the sync directory and ingests any new files it finds on the next fuzzing cycle. This means the payload reaches AFL++‘s queue and starts being mutated within seconds of being written — there is no pause, no restart, and no shared memory protocol. The fuzzer bridge is entirely filesystem-based.After injection, AgentLoop monitors edges_found to detect whether the payload produced new coverage. In harvest mode this measurement gates whether the solved (P-Code, tags, Z3 script) triplet is committed to the CARM database for future reuse.

K-Way Voting and Self-Repair

When the LLM is asked to generate Z3Py code for a stall, it produces K independent scripts in a single API call rather than one (default K=3, controlled by K_VOTE_COUNT). All K scripts are executed in parallel by the Z3Sandbox. If any one of them returns SAT, that model is used — the other scripts are discarded. This strategy, drawn from LINC §2, is cheap (one API call, K parallel subprocess executions) and significantly increases the probability of getting at least one syntactically and semantically correct script per attempt. If all K scripts fail (returning UNSAT, TIMEOUT, SYNTAX_ERROR, RUNTIME_ERROR, or UNKNOWN), AgentLoop enters the ReAct self-repair loop:

_select_best_failure() picks the most informative failure. The preference order is UNSAT > RUNTIME_ERROR > SYNTAX_ERROR > TIMEOUT > UNKNOWN. An UNSAT result means the Z3 logic compiled correctly but is contradictory — the most useful signal for repair.
A CorrectionEntry is constructed from the failing script and its error message, and appended to the Z3GenerationRequest’s correction_history.
The updated request (now containing the failure as a negative example) is sent to the LLM again for a new set of K scripts.
This cycle repeats for up to max_react_turns turns (default 5).

If all ReAct turns are exhausted without a SAT result, the stall is deferred back to AFL++ for probabilistic mutation. The correction history is persisted in PostgreSQL so that future campaigns against the same binary benefit from the accumulated negative examples.

If the Z3 model returned is technically SAT but contains too few byte-indexed variables to represent a meaningful payload (fewer than min(8, input_length // 2) byte variables), the model is rejected and treated as a failure for repair purposes. The LLM is told exactly which byte positions are missing and instructed to model the hash/checksum computation explicitly to constrain the CRC field bytes. This prevents trivially incomplete solutions from being injected as payloads.

CARM — Constraint-Aware Retrieval Memory

CARM is the knowledge-transfer layer that allows Agentic-AFL to improve over time and generalize across binaries. Its core operation is Jaccard similarity on ConstraintTag sets:

J(A, B) = \frac{|A \cap B|}{|A \cup B|}

where A and B are the ConstraintTag frozensets of two ConstraintProfile objects. A score of 1.0 means the two stall sites have identical structural tag sets; a score of 0.0 means they share no tags at all. The carm_similarity_threshold (default 0.3) is the minimum score for a match to be returned. The computation is performed server-side in PostgreSQL using a custom jaccard_similarity() SQL function that operates on integer arrays (each ConstraintTag enum value is stored as its integer ordinal). This avoids transferring all stored profiles to the Python process for comparison — the database engine filters and ranks them, and only the top-N results are returned to CARMRetriever. Queries are also filtered by target architecture: an ARM32 template is not retrieved for an X86_64 stall, even if the tag sets match, because BitVec widths in the generated Z3 scripts would be wrong. When a stall is successfully solved (a SAT payload is injected and, in harvest mode, confirmed by an edge delta), CARMRetriever.update_template() stores the winning Z3 script alongside the stall’s spec in PostgreSQL. On the next campaign — against the same binary or a different one with structurally similar constraints — that template is retrieved and injected into the LLM prompt as a positive example, dramatically reducing the number of ReAct turns needed.

The LLM is a translator, not a solver. Its role is to read the P-Code slice and write syntactically and semantically correct Z3Py code that encodes the constraint as a system of bit-vector equations. Z3 — the SMT solver — is what actually finds the concrete satisfying assignment by searching the solution space mathematically. This division of labor is intentional: LLMs are unreliable arithmetic reasoners but are excellent at structured code translation, while Z3 is a proven decision procedure for quantifier-free bit-vector logic.

Get Started

Configuration

Guides

Architecture

How Agentic-AFL Bypasses Math and Crypto Coverage Stalls

The Core Problem

The Pipeline

K-Way Voting and Self-Repair

CARM — Constraint-Aware Retrieval Memory

Build docs developers (and LLMs) love

Get Started

Configuration

Guides

Architecture

Documentation Index

​The Core Problem

​The Pipeline

​K-Way Voting and Self-Repair

​CARM — Constraint-Aware Retrieval Memory

Build docs developers (and LLMs) love

The Core Problem

The Pipeline

K-Way Voting and Self-Repair

CARM — Constraint-Aware Retrieval Memory