LLM Python API: Prompts, Responses, Tools, and Streaming

The LLM Python API lets you run prompts against any installed model directly from Python code — the same models accessible via the CLI are available programmatically. You can stream responses, pass multi-modal attachments, call tools, extract structured data with schemas, and manage API keys, all without leaving Python.

Getting a Model

Use llm.get_model() to retrieve a model by its ID or alias. The function is available as soon as you import llm — no separate configuration step is needed.

import llm

model = llm.get_model("gpt-4o-mini")

Calling llm.get_model() with no argument uses the currently configured default model, which is gpt-4o-mini unless you have changed it. Passing an invalid model ID raises llm.UnknownModelError.

Running a Prompt

Call model.prompt() to send a prompt to the model. The response uses lazy evaluation — the network call does not happen until you actually consume the response (by calling .text(), iterating it, or casting it to str).

import llm

model = llm.get_model("gpt-4o-mini")
response = model.prompt("Five surprising names for a pet pelican")
print(response.text())

Before the response is evaluated, its repr looks like:

<Response prompt='Five surprising names for a pet pelican' text='... not yet done ...'>

Because Response.__str__ returns the text, you can also write:

print(llm.get_model().prompt("Five surprising names for a pet pelican"))

System Prompts

Pass a system prompt using the system= keyword argument:

response = model.prompt(
    "Five surprising names for a pet pelican",
    system="Answer like GlaDOS"
)
print(response.text())

Attachments

Models that accept multi-modal input (images, audio, video, etc.) accept a list of llm.Attachment objects via the attachments= keyword argument.

import llm

model = llm.get_model("gpt-4o-mini")
response = model.prompt(
    "Describe these images",
    attachments=[
        llm.Attachment(path="pelican.jpg"),
        llm.Attachment(url="https://static.simonwillison.net/static/2024/pelicans.jpg"),
    ]
)
print(response.text())

llm.Attachment(path="photo.jpg")

To check which MIME types a model accepts, inspect model.attachment_types:

model = llm.get_model("gpt-4o-mini")
print(model.attachment_types)
# {'image/gif', 'image/png', 'image/jpeg', 'image/webp'}

if "image/jpeg" in model.attachment_types:
    # safe to send a JPEG
    ...

Tools

Tools are Python functions the model can call as part of generating a response. Define a function with a docstring (LLM uses it as the tool description), then pass it in the tools= list.

Basic tool call

import llm

def upper(text: str) -> str:
    """Convert text to uppercase."""
    return text.upper()

model = llm.get_model("gpt-4.1-mini")
response = model.prompt("Convert panda to upper", tools=[upper])

# Inspect what the model wants to call
tool_calls = response.tool_calls()
# [ToolCall(name='upper', arguments={'text': 'panda'}, tool_call_id='...')]

# Execute the calls and get results
tool_results = response.execute_tool_calls()
# [ToolResult(name='upper', output='PANDA', tool_call_id='...')]

# Get the model's follow-up reply (auto-executes tool calls)
follow_up = response.reply()
print(follow_up.text())
# The word "panda" converted to uppercase is "PANDA".

Every tool call has a guaranteed-unique tool_call_id. LLM synthesizes one of the form tc_01... for providers that do not supply their own, so you can always correlate calls with results.

Automatic chain loop

For an automatic loop that keeps calling tools until the model stops requesting them, use model.chain():

chain_response = model.chain(
    "Convert panda to upper",
    tools=[upper],
)
print(chain_response.text())
# The word "panda" converted to uppercase is "PANDA".

Stream the chain output token by token:

for chunk in model.chain("Convert panda to upper", tools=[upper]):
    print(chunk, end="", flush=True)

Iterate over individual responses inside the chain:

chain = model.chain("Convert panda to upper", tools=[upper])
for response in chain.responses():
    print(response.prompt)
    for chunk in response:
        print(chunk, end="", flush=True)

Tool debugging hooks

Pass before_call= and after_call= to model.chain() to run code around each tool invocation. Raise llm.CancelToolCall in before_call to abort a specific call:

import llm
from typing import Optional

def upper(text: str) -> str:
    "Convert text to uppercase."
    return text.upper()

def before_call(tool: Optional[llm.Tool], tool_call: llm.ToolCall):
    print(f"About to call {tool.name} with {tool_call.arguments}")
    if tool.name == "upper" and "bad" in repr(tool_call.arguments):
        raise llm.CancelToolCall("Not allowed on text containing 'bad'")

def after_call(tool: llm.Tool, tool_call: llm.ToolCall, tool_result: llm.ToolResult):
    print(f"{tool.name} returned {tool_result.output}")

model = llm.get_model("gpt-4.1-mini")
response = model.chain(
    "Convert panda to upper and badger to upper",
    tools=[upper],
    before_call=before_call,
    after_call=after_call,
)
print(response.text())

Pausing a chain

Raise llm.PauseChain inside a tool to stop the chain cleanly — for example, when human approval is required:

import llm

def delete_files(path: str) -> str:
    if not approval_already_recorded(path):
        record_approval_request(path)
        raise llm.PauseChain("waiting for approval to delete " + path)
    do_delete(path)
    return "deleted"

try:
    chain_response.text()
except llm.PauseChain as pause:
    print("Paused on", pause.tool_call.name, pause.tool_call.tool_call_id)

Unlike other exceptions, PauseChain does not produce an error tool result — the framework propagates it with pause.tool_call and pause.tool_results (sibling calls that completed) attached.

Tools that return attachments

Tools can return llm.ToolOutput to pass attachments back to the model alongside their text output:

import llm

def generate_image(prompt: str) -> llm.ToolOutput:
    """Generate an image based on the prompt."""
    image_content = generate_image_from_prompt(prompt)
    return llm.ToolOutput(
        output="Image generated successfully",
        attachments=[llm.Attachment(
            content=image_content,
            type="image/png"
        )],
    )

Toolbox classes

For tools that share state or configuration, subclass llm.Toolbox. All public methods become tools automatically:

import llm

class Memory(llm.Toolbox):
    _memory = None

    def _get_memory(self):
        if self._memory is None:
            self._memory = {}
        return self._memory

    def set(self, key: str, value: str):
        "Set something as a key"
        self._get_memory()[key] = value

    def get(self, key: str):
        "Get something from a key"
        return self._get_memory().get(key) or ""

    def keys(self):
        "Return a list of keys"
        return list(self._get_memory().keys())

model = llm.get_model("gpt-4.1-mini")
memory = Memory()
conversation = model.conversation(tools=[memory])
print(conversation.chain("Set name to Simon").text())
print(memory._memory)  # {'name': 'Simon'}

Use toolbox.add_tool(fn) to register additional tools after construction. Implement prepare() (or prepare_async()) on the class to run setup logic before the toolbox’s first use.

Schemas

Pass a JSON schema or a Pydantic BaseModel subclass via schema= to receive structured output:

import llm, json
from pydantic import BaseModel

class Dog(BaseModel):
    name: str
    age: int

model = llm.get_model("gpt-4o-mini")
response = model.prompt("Describe a nice dog", schema=Dog)
dog = json.loads(response.text())
print(dog)  # {"name": "Buddy", "age": 3}

Fragments

Pass pre-assembled text fragments to the prompt or system prompt using fragments= and system_fragments=:

response = model.prompt(
    "What do these documents say about dogs?",
    fragments=[
        open("dogs1.txt").read(),
        open("dogs2.txt").read(),
    ],
    system_fragments=[
        "You answer questions like Snoopy",
    ]
)

Fragments are especially useful for plugin authors who want to tap into LLM’s fragment caching system (e.g. the llm-anthropic plugin uses them for Claude’s prompt caching).

Model Options

Pass model-specific options (viewable with llm models --options) as a dictionary to options=:

model = llm.get_model()
print(model.prompt("Names for otters", options={"temperature": 0.2}))

Passing an API Key

Supply an API key directly to model.prompt() using key=:

model = llm.get_model("gpt-4o-mini")
print(model.prompt("Names for beavers", key="sk-..."))

If key= is not provided, LLM falls back to the OPENAI_API_KEY environment variable, and then to keys stored with llm keys set.

Models from Plugins

Any model installed via a plugin is available through the same llm.get_model() call:

pip install llm-anthropic

import llm

model = llm.get_model("claude-3.5-sonnet")
# Set the key if you haven't run 'llm keys set claude'
model.key = "YOUR_API_KEY_HERE"
response = model.prompt("Five surprising names for a pet pelican")
print(response.text())

Accessing the Underlying JSON

Most model plugins expose the raw provider JSON via response.json():

import llm
from pprint import pprint

model = llm.get_model("gpt-4o-mini")
response = model.prompt("3 names for an otter")
pprint(response.json())

Example output from GPT-4o mini:

{'content': 'Sure! Here are three fun names for an otter:\n\n1. **Splash**\n2. **Bubbles**\n3. **Otto**',
 'created': 1739291215,
 'finish_reason': 'stop',
 'id': 'chatcmpl-AznO31yxgBjZ4zrzBOwJvHEWgdTaf',
 'model': 'gpt-4o-mini-2024-07-18',
 ...}

The JSON structure differs between providers. Code that reads response.json() directly will only work with one specific model provider.

Token Usage

Call response.usage() to get a Usage object with input and output token counts:

from pprint import pprint

model = llm.get_model("gpt-4o-mini")
response = model.prompt("Name an otter")
pprint(response.usage())

Usage(input=5,
      output=2,
      details={'candidatesTokensDetails': [{'modality': 'TEXT', 'tokenCount': 2}],
               'promptTokensDetails': [{'modality': 'TEXT', 'tokenCount': 5}]})

The Usage dataclass has three fields: input (int or None), output (int or None), and details (dict or None with provider-specific breakdown).

Streaming Responses

Iterate the response object directly to receive text chunks as they stream in:

response = model.prompt("Five diabolical names for a pet goat")
for chunk in response:
    print(chunk, end="")

Calling response.text() after iteration continues to return the same complete string — it does not re-run the prompt.

Structured Messages and Streaming Events

LLM’s richer structured API lets you pass full message lists and inspect typed streaming events.

Prompting with a message list

Use the user(), assistant(), and system() helper functions to build multi-turn message histories:

import llm
from llm import user, assistant, system

model = llm.get_model("gpt-4o-mini")

response = model.prompt(messages=[
    system("You are a helpful pirate."),
    user("What is the capital of France?"),
    assistant("Paris, matey."),
    user("And Germany?"),
])
print(response.text())

Streaming typed events

response.stream_events() yields StreamEvent objects for every content block as it arrives — useful for UIs that show reasoning, tool calls, and text side by side:

response = model.prompt("Explain quantum computing briefly.")
for event in response.stream_events():
    if event.type == "reasoning":
        print(f"[thinking] {event.chunk}", end="", flush=True)
    elif event.type == "text":
        print(event.chunk, end="", flush=True)
    elif event.type == "tool_call_name":
        print(f"\n[calling tool: {event.chunk}]")
    elif event.type == "tool_call_args":
        print(event.chunk, end="", flush=True)

Event types: "text", "reasoning", "tool_call_name", "tool_call_args", "tool_result". Plain iteration (for chunk in response) yields only text strings.

Hiding reasoning output

Pass hide_reasoning=True to suppress visible reasoning tokens from supported models:

response = model.prompt(
    "Explain quantum computing briefly.",
    hide_reasoning=True,
)
print(response.text())

Inspecting assembled messages

response.messages() returns the list of Message objects produced by the model after the response completes:

response = model.prompt("What's 2+2?")
for message in response.messages():
    for part in message.parts:
        print(type(part).__name__, part.to_dict())

Persisting and resuming conversations

response.to_dict() serializes a response to a JSON-safe dict; Response.from_dict() rehydrates it. Use response.reply() to continue from a rehydrated response:

import json, llm

model = llm.get_model("gpt-4o-mini")
response = model.prompt("What's 2+2?")
print(response.text())

payload = json.dumps(response.to_dict())
# Save payload wherever you like, then later:

rebuilt = llm.Response.from_dict(json.loads(payload))
followup = rebuilt.reply("Add 3 to that")
print(followup.text())

Running Code When a Response Completes

Use response.on_done(callback) to run a function as soon as all tokens have been received — useful for token accounting, logging, or triggering downstream work:

import llm

model = llm.get_model("gpt-4o-mini")
response = model.prompt("a poem about a hippo")
response.on_done(lambda r: print(r.usage()))
print(response.text())

The callback signature is def callback(response). The callback is called synchronously at the end of the response stream. For async models, on_done must be awaited — see the Async API page.

Listing Models

import llm

# All synchronous models
for model in llm.get_models():
    print(model.model_id)

# All async models
for model in llm.get_async_models():
    print(model.model_id)

llm.get_models_with_aliases() returns a list of ModelWithAliases objects that pair each model with its registered aliases.

Other Utility Functions

set_alias / remove_alias

import llm

# Create an alias
llm.set_alias("mini", "gpt-4o-mini")

# Remove an alias (raises KeyError if it doesn't exist)
llm.remove_alias("turbo")

set_default_model / get_default_model

import llm

# Set the default model globally (persisted to disk)
llm.set_default_model("claude-3.5-sonnet")

# Get the current default (returns "gpt-4o-mini" if unset)
model_id = llm.get_default_model()

# Detect whether a default has actually been configured
if llm.get_default_model(default=None) is None:
    print("No default has been set")

set_default_model() writes to the LLM configuration folder and affects all programs using LLM on the system, including the llm CLI.

set_default_embedding_model / get_default_embedding_model

import llm

llm.set_default_embedding_model("text-embedding-3-small")
model_id = llm.get_default_embedding_model()

These work identically to set_default_model / get_default_model but target the default embedding model.

Reference

LLM Python API: Prompts, Responses, Tools, and Streaming

Getting a Model

Running a Prompt

System Prompts

Attachments

Tools

Basic tool call

Automatic chain loop

Tool debugging hooks

Pausing a chain

Tools that return attachments

Toolbox classes

Schemas

Fragments

Model Options

Passing an API Key

Models from Plugins

Accessing the Underlying JSON

Token Usage

Streaming Responses

Structured Messages and Streaming Events

Prompting with a message list

Streaming typed events

Hiding reasoning output

Inspecting assembled messages

Persisting and resuming conversations

Running Code When a Response Completes

Listing Models

Other Utility Functions

Build docs developers (and LLMs) love

Reference

Documentation Index

​Getting a Model

​Running a Prompt

​System Prompts

​Attachments

​Tools

​Basic tool call

​Automatic chain loop

​Tool debugging hooks

​Pausing a chain

​Tools that return attachments

​Toolbox classes

​Schemas

​Fragments

​Model Options

​Passing an API Key

​Models from Plugins

​Accessing the Underlying JSON

​Token Usage

​Streaming Responses

​Structured Messages and Streaming Events

​Prompting with a message list

​Streaming typed events

​Hiding reasoning output

​Inspecting assembled messages

​Persisting and resuming conversations

​Running Code When a Response Completes

​Listing Models

​Other Utility Functions

Build docs developers (and LLMs) love

Getting a Model

Running a Prompt

System Prompts

Attachments

Tools

Basic tool call

Automatic chain loop

Tool debugging hooks

Pausing a chain

Tools that return attachments

Toolbox classes

Schemas

Fragments

Model Options

Passing an API Key

Models from Plugins

Accessing the Underlying JSON

Token Usage

Streaming Responses

Structured Messages and Streaming Events

Prompting with a message list

Streaming typed events

Hiding reasoning output

Inspecting assembled messages

Persisting and resuming conversations

Running Code When a Response Completes

Listing Models

Other Utility Functions