Async LLM Calls: AsyncModel, AsyncResponse, and asyncio
Use llm.get_async_model() to call LLMs asynchronously with Python asyncio. AsyncModel, AsyncResponse, and AsyncConversation support streaming and tools.
Use this file to discover all available pages before exploring further.
LLM provides a fully asynchronous API for use with Python’s asyncio. Where the synchronous API blocks until the model responds, the async API yields control back to the event loop — making it a natural fit for web servers, CLI tools with concurrent requests, or any application that manages multiple I/O-bound tasks at once.
llm.get_async_model() raises llm.UnknownModelError if the model ID is not found. If a synchronous model exists under that name but no async version does, the error message says "Unknown async model (sync model exists): ..." to help you diagnose the difference.
model.prompt() returns an AsyncResponse. Await it to get the full text, or stream it token by token with async for:
import asyncio, llmasync def run(): model = llm.get_async_model("gpt-4o-mini") text = await model.prompt( "Five surprising names for a pet pelican" ).text() print(text)asyncio.run(run())
model.prompt() accepts the same keyword arguments as the synchronous version — system=, attachments=, tools=, schema=, key=, options=, hide_reasoning=, etc.
Tool functions passed to tools= can be either regular functions or async def coroutines:
import asyncio, llmasync def hello(name: str) -> str: "Say hello to name" return "Hello there " + name# Works in a sync context too — LLM wraps it in asyncio.run() automaticallymodel = llm.get_model("gpt-4.1-mini")chain_response = model.chain("Say hello to Percival", tools=[hello])print(chain_response.text())
When an async def tool is used in a synchronous context, LLM executes it via asyncio.run() in a thread pool. When used in an async context, it runs natively as a coroutine. Either way, the same function works in both contexts.
In an async context, synchronous tool implementations block the event loop for their entire duration. Only use synchronous tools with async models if you are certain they are extremely fast (microsecond-scale computations, not I/O).
import asyncio, llmdef upper(string: str) -> str: "Converts string to uppercase" return string.upper()async def run(): model = llm.get_async_model("gpt-4.1") chain = model.chain( "Convert panda to uppercase then pelican to uppercase", tools=[upper], after_call=print, ) print(await chain.text())asyncio.run(run())
Stream the chain output as it is generated:
async def run(): model = llm.get_async_model("gpt-4.1") async for chunk in model.chain( "Convert panda to uppercase then pelican to uppercase", tools=[upper], ): print(chunk, end="", flush=True)
model.conversation() on an async model returns an AsyncConversation. Call conversation.prompt() or conversation.chain() on it — both return AsyncResponse objects that you must await:
import asyncio, llmasync def run(): model = llm.get_async_model("gpt-4o-mini") conversation = model.conversation() r1 = await conversation.prompt("Five fun facts about pelicans").text() print(r1) r2 = await conversation.prompt("Now do skunks").text() print(r2)asyncio.run(run())
import asyncio, llmasync def search_web(query: str) -> str: "Search the web for information" # Replace with a real async HTTP call in production await asyncio.sleep(0.01) return f"Results for: {query}"async def run(): model = llm.get_async_model("gpt-4.1-mini") conversation = model.conversation(tools=[search_web]) result1 = await conversation.chain("Find recent news about pelicans").text() print(result1) result2 = await conversation.chain("What about flamingos?").text() print(result2)asyncio.run(run())
await response.on_done(callback) queues a function to run as soon as all tokens have been received. The callback receives the completed response and can be either sync or async:
import asyncio, llmasync def run(): model = llm.get_async_model("gpt-4o-mini") response = model.prompt("a short poem about a brick") async def done(response): print("Usage:", await response.usage()) await response.on_done(done) # registers the callback print(await response.text()) # drives the stream; callback fires at endasyncio.run(run())
on_done is useful for token accounting, logging, or triggering downstream work the moment a response finishes — without polling or restructuring your streaming loop.
import llmfor model in llm.get_async_models(): print(model.model_id)
Use llm.get_models_with_aliases() to see both sync and async variants together — each ModelWithAliases entry has both .model and .async_model attributes (either may be None).
The following example brings together async prompting, tool use, streaming, and token tracking in a single script:
import asyncioimport llmdef celsius_to_fahrenheit(celsius: float) -> float: """Convert a temperature from Celsius to Fahrenheit.""" return celsius * 9 / 5 + 32async def run(): model = llm.get_async_model("gpt-4.1-mini") conversation = model.conversation(tools=[celsius_to_fahrenheit]) # Turn 1: let the model decide to call the tool print("=== Turn 1 ===") async for chunk in conversation.chain("What is 100°C in Fahrenheit?"): print(chunk, end="", flush=True) print() # Turn 2: follow-up — model still has context print("\n=== Turn 2 ===") r2 = await conversation.chain("And -40°C?").text() print(r2) # Show token usage for each turn print("\n=== Usage ===") for i, response in enumerate(conversation.responses, 1): usage = await response.usage() print(f"Turn {i}: input={usage.input}, output={usage.output}")asyncio.run(run())