Production agents operate at the intersection of language models, external tools, and untrusted data — a combination that creates a distinct threat surface not covered by traditional application security. This page explains how to model those threats, layer guardrails against them, instrument traces that support debugging and auditing, design eval suites that measure harness behavior rather than just model quality, and define the gates an agent must pass before going to production.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/DenisSergeevitch/agents-best-practices/llms.txt
Use this file to discover all available pages before exploring further.
Threat model
Agents face threats that arise from the combination of language, tools, and external data. The primary categories are:- Prompt injection — malicious content in retrieved documents or tool results that attempts to redirect model behavior
- Data exfiltration — agent is manipulated into sending sensitive data to attacker-controlled destinations
- Over-broad tool access — tools expose more capability than the current task requires, increasing blast radius
- Runaway loops — the agent repeats expensive or destructive operations without a stop condition
- Secret leakage — credentials or PII are copied into context and exposed in traces or outputs
- Unsafe external communication — agent sends unapproved messages, emails, or API calls to third parties
- Financial or destructive side effects — unreviewed actions trigger purchases, deletions, or state changes
- Compaction state loss — context summarization drops the active objective or approval state
- Subagent miscoordination — parallel agents duplicate actions or pass conflicting state to each other
Guardrails
Guardrails should be layered, fast, specific, and independently testable. Each layer addresses a different attack surface:| Layer | What it does |
|---|---|
| Input guardrails | Reject or route unsafe user requests before they reach the model |
| Context guardrails | Label untrusted content and redact secrets before context assembly |
| Schema guardrails | Force structured tool arguments and outputs |
| Tool guardrails | Validate arguments and results around every tool execution |
| Permission guardrails | Approve, deny, or pause actions based on risk class |
| Output guardrails | Check the final answer before it becomes user-visible output |
| Trace guardrails | Grade tool calls and decisions after the run for offline review |
Sandboxing
Code execution, filesystem access, and network access must be isolated from the agent harness process. The model should never be able to reach the host filesystem, the harness credentials, or arbitrary network endpoints through a code execution tool. Use containerized or VM-level sandboxes with explicit egress allowlists. Treat every code execution result as untrusted output — validate its schema before using it as input to a subsequent tool call.Distributed tracing
Trace operational events, not private hidden reasoning. Every trace should be able to answer: what did the agent try to do; what data did it use; what tool changed state; who approved it; what failed; why did it stop; could the run be replayed. Capture the following fields on every run:Eval types
Evaluate the harness, not only the model. Evals should cover reliability under normal conditions, safety under adversarial conditions, and correct behavior at boundary conditions.Task success and reliability
Task success and reliability
Measure whether the agent completes the task using the right tools, in the right order, with valid arguments. Track tool selection precision (did it call the right tool?), unnecessary tool calls (did it call tools it did not need?), output format adherence, and cost and latency per task type. Run against a representative sample of historical tasks before launch.
Safety and permission correctness
Safety and permission correctness
Verify that permission checks fire correctly for every risk class. Test approval requests for external sends, financial actions, and destructive operations. Test that the model cannot approve its own actions. Track human intervention rate as a leading indicator of safety failures. Test failure recovery — when a connector returns an error, does the agent stop safely or loop?
Prompt injection probes
Prompt injection probes
Inject adversarial content into retrieved documents, tool results, and email bodies. Test that the agent does not change its objective, call unexpected tools, or exfiltrate data in response. Specific cases:
- Retrieved document says “ignore previous instructions”
- Email contains a request to forward data to an external address
- Tool result contains a fake system message
- Search result includes a hidden instruction in metadata
Budget exhaustion and runaway loops
Budget exhaustion and runaway loops
Verify that step budgets, token budgets, and cost budgets are enforced. The agent must stop and report failure cleanly when a budget is reached — not loop, not silently truncate, not fabricate a result. Test scenarios where the goal is vague or impossible to ensure the agent stops rather than running indefinitely.
Missing tool result and connector failure
Missing tool result and connector failure
Test behavior when a connector returns an auth error, a timeout, or malformed data. The agent must return a structured error observation and either retry with backoff or escalate to the user — it must not hallucinate a tool result or proceed as if the call succeeded.
Context compaction retention
Context compaction retention
Trigger compaction by filling the context to the configured threshold. Verify that the active objective, approval state, and active plan are present in the post-compaction context. Test that the agent can continue the task after compaction without repeating completed steps or losing durable state.
Trace grading
Trace grading
After each eval run, grade specific trace events:
- Did the agent use the right tool?
- Was the tool call necessary?
- Were arguments valid?
- Was permission checked?
- Was approval requested at the right time?
- Was the final answer grounded in tool results?
- Did compaction preserve the active objective?
Launch gates
An agent must pass all of the following criteria before receiving production traffic:- Narrow tool registry — only tools required for the production task are visible
- Local schema validation is active for all tool arguments
- Permission matrix is enforced in code, not only in prompt instructions
- Approval UX is live and tested for all risky action classes
- Prompt injection test suite passes with no regressions
- Compaction tests pass — active objective and approval state survive context reset
- Connector auth and revocation flows are tested end-to-end
- Trace logging is enabled and traces are reviewable
- Cost budgets are enforced with automatic stop on breach
- Rollback or incident path is documented and tested
- Evals have been run on both realistic and adversarial tasks
The first rollout should be limited-scope, shadow-mode, or monitored by a human reviewer. Expand traffic only after reviewing traces from real runs and confirming human acceptance rates are within target.
Incident response
When an agent takes an unexpected action in production:Pause risky tools
Immediately disable or rate-limit the tool classes involved in the unexpected action. Do not wait for a root-cause analysis before containing the blast radius.
Preserve traces and artifacts
Lock the run trace, all tool call records, approval records, and any produced artifacts. Do not allow compaction or cleanup to delete evidence.
Identify the failure class
Determine whether the failure originated in the instruction (prompt logic), a tool (schema, validation, or side-effect), a connector (auth, scope, or external behavior), or the model (unexpected reasoning). Each class has a different fix path.
Patch the relevant layer
Fix the policy, tool schema, context logic, or permission check that allowed the unexpected action. Do not rely on prompt-only fixes for failures that should be enforced mechanically.
Add a regression eval
Convert the incident scenario into a named adversarial eval case. It must pass before re-enabling the affected tools.