Documentation Index
Fetch the complete documentation index at: https://mintlify.com/SanMuzZzZz/LuaN1aoAgent/llms.txt
Use this file to discover all available pages before exploring further.
What is the XBOW benchmark
XBOW is a high-fidelity benchmark designed to evaluate LLM-based security agents. Unlike traditional puzzle-like CTFs, XBOW focuses on real-world vulnerability primitives and multi-stage business logic exploit chains — the kind of attack sequences that arise in production systems, not contrived lab exercises.104 test cases
Unique scenarios spanning easy, medium, and hard difficulty levels.
OWASP Top 10: 2025
Full coverage across 17 vulnerability categories and 18 CWE identifiers.
Real-world focus
Business logic exploit chains, not contrived CTF puzzles.
Difficulty distribution
- Level 1 — Easy: 45 cases
- Level 2 — Medium: 51 cases
- Level 3 — Hard: 8 cases
Performance results
Under zero-shot, zero-human-intervention conditions, LuaN1aoAgent (powered by DeepSeek-V3.2) established a new state-of-the-art (SOTA) for autonomous penetration testing.LuaN1aoAgent was awarded top ranking at the Tencent Cloud Hackathon (TCH) for its performance and architecture.
Competitive comparison
| Framework | Success Rate |
|---|---|
| LuaN1aoAgent (ours) | 90.4% |
| XBOW Commercial Agent | 85.0% |
| Cyber-AutoAgent v0.1.3 | 84.62% |
| MAPTA (Academic SOTA) | 76.9% |
Success rate by difficulty
| Difficulty | Cases | Successes | Success Rate |
|---|---|---|---|
| Level 1 — Easy | 45 | 44 | 97.8% |
| Level 2 — Medium | 51 | 44 | 86.3% |
| Level 3 — Hard | 8 | 6 | 75.0% |
| Total | 104 | 94 | 90.4% |
Cost efficiency analysis
Leveraging the “Hard Veto” mechanism in the Reflector, LuaN1aoAgent avoids redundant hallucination loops, significantly reducing both token consumption and wall-clock time.$0.09 median cost
Median cost per successful exploit. The average is $0.20, meaning the cost distribution is right-skewed — most exploits are cheap, with a small number of harder tasks consuming more budget.
11 minutes median time
Median time to success. The fastest exploit completed in 1.6 minutes.
Full cost breakdown
| Metric | Value |
|---|---|
| Total scenarios | 104 |
| Successful exploits | 94 |
| Success rate | 90.40% |
| Average time per success | 16.1 min |
| Median time per success | 11 min |
| Fastest exploit convergence | 1.6 min |
| Average cost per success | $0.20 |
| Median cost per success | $0.09 |
| Total expenditure (all tasks) | $27.24 |
| Total expenditure (successes only) | $18.91 |
How results were achieved
The performance comes from the Dual-Graph Cognitive Architecture (DGCA), which combines two complementary graph structures:Cognitive Causal Graph (CCG)
Cognitive Causal Graph (CCG)
Tracks evidence, hypotheses, vulnerabilities, and exploits as nodes in a causal graph. Every hypothesis requires explicit prior evidence; every causal edge carries a confidence score. This prevents hallucination-driven blind attacks and makes the reasoning chain fully traceable.The Reflector’s “Hard Veto” mechanism uses CCG state to detect when the agent is stuck in a loop and terminates unproductive paths early — a key driver of cost efficiency.
Dynamic Task Graph (DTG)
Dynamic Task Graph (DTG)
Models the penetration testing plan as a Directed Acyclic Graph (DAG). The Planner emits structured graph-editing operations (
ADD_NODE, UPDATE_NODE, DEPRECATE_NODE) rather than natural language instructions. This enables real-time plan adaptation, automatic parallelization of independent sub-tasks, and topological dependency management.Ablation study
Disabling either graph component causes a significant performance drop, especially on harder tasks:| Framework version | Level 1 | Level 2 | Level 3 | Total SR |
|---|---|---|---|---|
| LuaN1aoAgent (full) | 97.8% | 86.3% | 75.0% | 90.4% |
| Without Causal Graph (w/o CCG) | 93.3% | 58.8% | 25.0% | 71.2% |
| Without Task Graph (w/o DTG) | 95.6% | 64.7% | 37.5% | 76.0% |
| ReAct baseline | 88.9% | 39.2% | 25.0% | 59.6% |
Failure analysis
Of the 104 tasks, 10 (9.6%) failed. Root-cause breakdown:| Category | Root cause | Representative cases | Ratio |
|---|---|---|---|
| Stochastic observation | Blind SQLi / high-uncertainty environments | XBEN-095, XBEN-056 | 40% |
| Temporal sensitivity | Race conditions / timing-based attacks | XBEN-042 | 20% |
| Protocol bottlenecks | Complex HTTP Smuggling / Desync | XBEN-101 | 20% |
| Context overload | State-knowledge conflation in extreme depth | XBEN-088 | 20% |
- Causal misattribution (40%) — The agent misinterprets ambiguous HTTP responses (e.g., 403 from a space filter vs. a WAF blocking
SELECT) and incorrectly prunes a valid attack path. - Strategic divergence / “red herring” effect (30%) — The agent is distracted by high-entropy artifacts and ignores a verified exploit window.
- Abstraction leakage (30%) — HTTP library middleware normalizes malformed protocol semantics required for the attack before they reach the wire (e.g., HTTP Smuggling case XBEN-066).
Benchmark traces
Full execution traces for all four experimental variants are available in the repository underxbow-benchmark-results/traces/:
traces/Ours/— Full LuaN1aoAgenttraces/Ours-CCG/— Without Causal Graphtraces/Ours-DTG/— Without Task Graphtraces/ReAct/— ReAct baseline