The Plugin Tutorial covers the basics of registering a model and implementingDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/simonw/LLM/llms.txt
Use this file to discover all available pages before exploring further.
execute(). This page goes further, covering the patterns you’ll need when building production-quality plugins for real model APIs — including API key management, async support, structured output, tools, attachments, rich streaming events, and token tracking.
Lazily loading expensive dependencies
If your plugin depends on a large library like PyTorch, don’t import it at the top level of your module. Top-level imports run every timellm starts — including for simple commands like llm --help — and can add multiple seconds of startup latency.
Move expensive imports inside the methods that need them:
Models that accept API keys
Models that call external API providers should subclassllm.KeyModel instead of llm.Model. This wires your model into LLM’s key management system, which reads from llm keys set, environment variables, and the --key CLI flag.
key= parameter to execute() — LLM populates it automatically:
key by checking, in order: the --key CLI option, the key registry entry for needs_key, and then the key_env_var environment variable.
Async models
Plugins can provide an async model for use with Python’sasyncio. The async class subclasses llm.AsyncModel and implements execute() as an async generator:
llm.AsyncKeyModel:
register_models():
Supporting schemas
If your model supports structured JSON output, declare this withsupports_schema = True and check for prompt.schema in execute():
prompt.schema is always a Python dictionary representing a JSON schema, regardless of what format the caller used.
Supporting tools
Adding tools support involves several coordinated steps:If
prompt.tools is populated, convert each llm.Tool object into your provider’s format and include it in the API request.For each tool call in the provider’s response, call
response.add_tool_call(). Pass the provider’s own ID if one is available:response.add_tool_call(
llm.ToolCall(
tool_call_id=tool_id, # omit to let LLM generate a tc_-prefixed id
name=tool_name,
arguments=parsed_args,
)
)
If
prompt.tool_results is populated, include those llm.ToolResult objects in the messages sent to the API.Some prompts carry only tool results, so
prompt.prompt may be None. Your code must handle that case.Attachments for multi-modal models
Models that accept images, audio, or other binary content declare the MIME types they support via anattachment_types class attribute:
MP3 files are detected as
audio/mpeg, not audio/mp3.Working with Attachment objects
Insideexecute(), the prompt.attachments list contains Attachment instances. Each has:
url— the original URL, if providedpath— the resolved file path, if providedtype— the declared content type, if providedcontent— raw bytes, if provided directly
attachment.resolve_type()— returns the content type, guessing from bytes if necessaryattachment.content_bytes()— returns binary content, fetching from a URL or reading from disk as neededattachment.base64_content()— returns the content as a base64-encoded string
prompt.prompt is None:
Structured messages and streaming events
Rather than yielding plain strings,execute() can yield StreamEvent objects for richer structured output. This enables separate handling of text, reasoning tokens, tool calls, and server-side tool results.
Plain string yields still work unchanged. Each string is internally wrapped as
StreamEvent(type="text", chunk=...). You only need StreamEvent when you need to emit non-text content types.Yielding StreamEvent objects
StreamEvent has these primary fields:
| Field | Description |
|---|---|
type | One of "text", "reasoning", "tool_call_name", "tool_call_args", "tool_result" |
chunk | The text fragment for this event |
tool_call_id | Provider’s ID for the tool call — used to group tool events |
provider_metadata | Optional dict[str, dict] for opaque provider data |
server_executed | Set True for server-side tool calls (e.g. Anthropic web search) |
tool_name | Set on tool_result events |
part_index | Override event grouping (leave None for automatic) |
How events group into Parts
The framework groups consecutive events intoPart objects automatically:
| Event stream | Resulting Parts |
|---|---|
text × N | one TextPart |
reasoning × N, then text × N | ReasoningPart, TextPart |
text, tool call events, text | TextPart, ToolCallPart, TextPart |
| Parallel tool calls (interleaved by id) | one ToolCallPart per distinct tool_call_id |
Reasoning tokens
YieldStreamEvent(type="reasoning", ...) for thinking tokens:
prompt.hide_reasoning — when True, do not request reasoning summaries from providers that require an explicit opt-in:
prompt.hide_reasoning is True.
Tool call events
Emit a pair of events sharing atool_call_id for each tool call:
response.add_tool_call():
Server-side tool calls
For tools the API runs internally (like Anthropic web search), setserver_executed=True:
response.add_tool_call() for server-side tool calls.
Opaque provider metadata
Some providers require you to echo back fields on subsequent requests for multi-turn continuity:Consuming prompt.messages in build_messages
prompt.messages is the complete input chain for the current turn — whether supplied explicitly via model.prompt(messages=[...]), assembled from keyword arguments, or built by a Conversation.
Do not also walk conversation.responses. History is already baked into prompt.messages and iterating the conversation would duplicate it.
Restoring opaque metadata on subsequent requests
When resuming a conversation, extract provider-specific metadata fromPart.provider_metadata and fold it back into your outgoing request body:
Tracking token usage
Callresponse.set_usage() at the end of execute() to record token counts. The data is stored in the SQLite log and available through the Python API:
input— integer count of input tokensoutput— integer count of output tokensdetails— dict for extra breakdown (e.g. cached tokens, reasoning tokens)
Tracking resolved model names
Some providers expose aliases likemodel-latest that resolve to different underlying models over time. If the API response includes the actual model used, record it:
resources.resolved_model in the log and shown in llm logs output.
LLM_RAISE_ERRORS
During plugin development, set theLLM_RAISE_ERRORS environment variable to make LLM raise exceptions instead of catching and logging them:
-i flag drops Python into an interactive shell if an error occurs. From there you can open a post-mortem debugger: