Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/simonw/LLM/llms.txt

Use this file to discover all available pages before exploring further.

LLM provides a fully asynchronous API for use with Python’s asyncio. Where the synchronous API blocks until the model responds, the async API yields control back to the event loop — making it a natural fit for web servers, CLI tools with concurrent requests, or any application that manages multiple I/O-bound tasks at once.

Getting an Async Model

Use llm.get_async_model() instead of llm.get_model(). The function accepts the same model IDs and aliases:
import llm

model = llm.get_async_model("gpt-4o-mini")
llm.get_async_model() raises llm.UnknownModelError if the model ID is not found. If a synchronous model exists under that name but no async version does, the error message says "Unknown async model (sync model exists): ..." to help you diagnose the difference.

Running a Prompt

model.prompt() returns an AsyncResponse. Await it to get the full text, or stream it token by token with async for:
import asyncio, llm

async def run():
    model = llm.get_async_model("gpt-4o-mini")
    text = await model.prompt(
        "Five surprising names for a pet pelican"
    ).text()
    print(text)

asyncio.run(run())
model.prompt() accepts the same keyword arguments as the synchronous version — system=, attachments=, tools=, schema=, key=, options=, hide_reasoning=, etc.

AsyncResponse

AsyncResponse is the async counterpart of Response. All methods that read the response body are coroutines and must be awaited.

await response.text()

Return the full response text, driving the stream to completion if needed.

await response.json()

Return the raw provider JSON dict, or None if unavailable.

await response.usage()

Return a Usage(input, output, details) dataclass.

await response.tool_calls()

Return the list of ToolCall objects requested by the model.
import asyncio, llm

async def run():
    model = llm.get_async_model("gpt-4o-mini")
    response = model.prompt("Name three rivers")
    text = await response.text()
    print(text)
    print(await response.usage())

asyncio.run(run())
Before the response is awaited, its repr shows:
<AsyncResponse prompt='Name three rivers' text='... not yet awaited ...'>

Streaming typed events

Use response.astream_events() to receive StreamEvent objects (text, reasoning, tool calls) as they arrive:
import asyncio, llm

async def run():
    model = llm.get_async_model("gpt-4o-mini")
    response = model.prompt("Explain quantum computing.")
    async for event in response.astream_events():
        if event.type == "reasoning":
            print(f"[thinking] {event.chunk}", end="", flush=True)
        elif event.type == "text":
            print(event.chunk, end="", flush=True)
        elif event.type == "tool_call_name":
            print(f"\n[calling tool: {event.chunk}]")

asyncio.run(run())

Inspecting assembled messages

await response.messages() returns the list of Message objects produced by the model:
response = model.prompt("What's 2+2?")
for message in await response.messages():
    for part in message.parts:
        print(type(part).__name__, part.to_dict())

Tool Functions Can Be Sync or Async

Tool functions passed to tools= can be either regular functions or async def coroutines:
import asyncio, llm

async def hello(name: str) -> str:
    "Say hello to name"
    return "Hello there " + name

# Works in a sync context too — LLM wraps it in asyncio.run() automatically
model = llm.get_model("gpt-4.1-mini")
chain_response = model.chain("Say hello to Percival", tools=[hello])
print(chain_response.text())
When an async def tool is used in a synchronous context, LLM executes it via asyncio.run() in a thread pool. When used in an async context, it runs natively as a coroutine. Either way, the same function works in both contexts.
In an async context, synchronous tool implementations block the event loop for their entire duration. Only use synchronous tools with async models if you are certain they are extremely fast (microsecond-scale computations, not I/O).

Tool Use for Async Models

model.chain(), response.execute_tool_calls(), and response.reply() must all be awaited when used with async models:

Async chain loop

import asyncio, llm

def upper(string: str) -> str:
    "Converts string to uppercase"
    return string.upper()

async def run():
    model = llm.get_async_model("gpt-4.1")
    chain = model.chain(
        "Convert panda to uppercase then pelican to uppercase",
        tools=[upper],
        after_call=print,
    )
    print(await chain.text())

asyncio.run(run())
Stream the chain output as it is generated:
async def run():
    model = llm.get_async_model("gpt-4.1")
    async for chunk in model.chain(
        "Convert panda to uppercase then pelican to uppercase",
        tools=[upper],
    ):
        print(chunk, end="", flush=True)

Manual tool execution and reply

import asyncio, llm

def upper(text: str) -> str:
    "Convert text to uppercase"
    return text.upper()

async def run():
    model = llm.get_async_model("gpt-4.1")
    response = model.prompt("Convert panda to upper", tools=[upper])
    await response.text()                         # drain the stream first
    follow_up = await response.reply()            # auto-executes tool calls
    print(await follow_up.text())

asyncio.run(run())
Pass explicit tool_results= to reply() to skip auto-execution and supply your own results:
follow_up = await response.reply(
    tool_results=[llm.ToolResult(name="upper", output="PANDA", tool_call_id="...")]
)

Async before_call and after_call hooks

Both before_call and after_call hooks can be async def functions when used with async models:
import asyncio, llm
from typing import Optional

async def before_call(tool: Optional[llm.Tool], tool_call: llm.ToolCall):
    print(f"[async] about to call {tool_call.name}")

async def after_call(tool: llm.Tool, tool_call: llm.ToolCall, result: llm.ToolResult):
    print(f"[async] {tool_call.name}{result.output!r}")

async def run():
    model = llm.get_async_model("gpt-4.1-mini")
    chain = model.chain(
        "Convert panda to uppercase",
        tools=[upper],
        before_call=before_call,
        after_call=after_call,
    )
    print(await chain.text())

AsyncConversation

model.conversation() on an async model returns an AsyncConversation. Call conversation.prompt() or conversation.chain() on it — both return AsyncResponse objects that you must await:
import asyncio, llm

async def run():
    model = llm.get_async_model("gpt-4o-mini")
    conversation = model.conversation()

    r1 = await conversation.prompt("Five fun facts about pelicans").text()
    print(r1)

    r2 = await conversation.prompt("Now do skunks").text()
    print(r2)

asyncio.run(run())

Async conversation with tools

import asyncio, llm

async def search_web(query: str) -> str:
    "Search the web for information"
    # Replace with a real async HTTP call in production
    await asyncio.sleep(0.01)
    return f"Results for: {query}"

async def run():
    model = llm.get_async_model("gpt-4.1-mini")
    conversation = model.conversation(tools=[search_web])

    result1 = await conversation.chain("Find recent news about pelicans").text()
    print(result1)

    result2 = await conversation.chain("What about flamingos?").text()
    print(result2)

asyncio.run(run())

Running Code When a Response Completes

await response.on_done(callback) queues a function to run as soon as all tokens have been received. The callback receives the completed response and can be either sync or async:
import asyncio, llm

async def run():
    model = llm.get_async_model("gpt-4o-mini")
    response = model.prompt("a short poem about a brick")

    async def done(response):
        print("Usage:", await response.usage())

    await response.on_done(done)   # registers the callback
    print(await response.text())  # drives the stream; callback fires at end

asyncio.run(run())
on_done is useful for token accounting, logging, or triggering downstream work the moment a response finishes — without polling or restructuring your streaming loop.

Persisting and Resuming Async Responses

AsyncResponse supports the same to_dict() / from_dict() round-trip as the sync Response, but requires the response to be awaited first:
import asyncio, json, llm

async def run():
    model = llm.get_async_model("gpt-4o-mini")
    response = model.prompt("What's 2+2?")
    await response  # must await before serializing

    payload = json.dumps(response.to_dict())

    # Later — rehydrate and continue
    rebuilt = llm.AsyncResponse.from_dict(json.loads(payload))
    followup = await rebuilt.reply("Add 3 to that")
    print(await followup.text())

asyncio.run(run())

Listing Async Models

import llm

for model in llm.get_async_models():
    print(model.model_id)
Use llm.get_models_with_aliases() to see both sync and async variants together — each ModelWithAliases entry has both .model and .async_model attributes (either may be None).

Complete Async Example

The following example brings together async prompting, tool use, streaming, and token tracking in a single script:
import asyncio
import llm

def celsius_to_fahrenheit(celsius: float) -> float:
    """Convert a temperature from Celsius to Fahrenheit."""
    return celsius * 9 / 5 + 32

async def run():
    model = llm.get_async_model("gpt-4.1-mini")
    conversation = model.conversation(tools=[celsius_to_fahrenheit])

    # Turn 1: let the model decide to call the tool
    print("=== Turn 1 ===")
    async for chunk in conversation.chain("What is 100°C in Fahrenheit?"):
        print(chunk, end="", flush=True)
    print()

    # Turn 2: follow-up — model still has context
    print("\n=== Turn 2 ===")
    r2 = await conversation.chain("And -40°C?").text()
    print(r2)

    # Show token usage for each turn
    print("\n=== Usage ===")
    for i, response in enumerate(conversation.responses, 1):
        usage = await response.usage()
        print(f"Turn {i}: input={usage.input}, output={usage.output}")

asyncio.run(run())

Build docs developers (and LLMs) love