Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/sidmanale643/northstar/llms.txt

Use this file to discover all available pages before exploring further.

LLMService is a thin, opinionated wrapper around LiteLLM that integrates NorthStar tracing into every completion call without any boilerplate. Every method creates a MODEL span, records input messages, captures the output message, and reports prompt and completion token counts along with the USD cost — all automatically. You only need to call llm.generate() the same way you would call litellm.completion().

Installation

LLMService depends on LiteLLM for token counting, cost lookups, and provider routing. Install the pricing extra:
uv add 'northstar-ai[pricing]'
LLMService requires northstar.init() (or a Northstar client) to be initialized before any method is called. The tracing span is created against the active NorthStar context; calling LLMService without an initialized client raises a RuntimeError.

Constructor

from northstar.llm import LLMService

llm = LLMService(default_model="gpt-4o-mini")
default_model
str
The LiteLLM model string used when model is not passed to a generation method. Accepts any model identifier that LiteLLM supports, including provider-prefixed strings like "openrouter/deepseek/deepseek-v4-flash" or "anthropic/claude-3-5-sonnet-20241022". Defaults to "gpt-4o-mini".

Methods

generate() — Synchronous completion

Calls litellm.completion() synchronously and returns the full response object. A MODEL span is opened, input messages and the output message are recorded, and token usage is reported before the span closes.
response = llm.generate(
    messages=[{"role": "user", "content": "Summarize this document."}],
    tools=tool_schemas,
    temperature=0.2,
)
content = response.choices[0].message.content
messages
list[dict[str, Any]]
required
The conversation history in OpenAI message format. Each entry must have a "role" key ("system", "user", "assistant", or "tool") and a "content" key.
model
str | None
Override the model for this call. Falls back to default_model when None. Accepts any LiteLLM model string.
tools
list[dict[str, Any]] | None
Tool schemas in OpenAI function-calling format. When provided, tool_choice is also forwarded to the provider. When None, tool calling is disabled.
tool_choice
Any
Forwarded directly to LiteLLM when tools is provided. Ignored when tools is None. Defaults to "auto".
temperature
float
Sampling temperature. Lower values produce more deterministic outputs. Defaults to 0.3.
**kwargs
Any
Additional keyword arguments passed directly to litellm.completion(). Use this to pass max_tokens, top_p, stop, response_format, and any other provider-specific parameters.
Returns: A LiteLLM ModelResponse object (compatible with openai.ChatCompletion).

agenerate() — Async completion

Identical to generate() but calls litellm.acompletion() with await. Use inside async def functions.
response = await llm.agenerate(
    messages=[{"role": "user", "content": "Hello"}],
    model="gpt-4o",
)
messages
list[dict[str, Any]]
required
The conversation history in OpenAI message format.
model
str | None
Override the model for this call. Falls back to default_model when None.
tools
list[dict[str, Any]] | None
Tool schemas. When None, tool calling is disabled.
tool_choice
Any
Forwarded to LiteLLM when tools is provided. Defaults to "auto".
temperature
float
Sampling temperature. Defaults to 0.3.
**kwargs
Any
Additional keyword arguments forwarded to litellm.acompletion().
Returns: A LiteLLM ModelResponse object.

stream() — Synchronous streaming generator

Calls litellm.completion() with stream=True and yields each chunk as it arrives. Input messages are recorded before streaming begins. Token usage is captured from the final usage chunk (LiteLLM’s stream_options={"include_usage": True} is set automatically). The full aggregated content is recorded as the output message after the generator is exhausted.
for chunk in llm.stream(
    messages=[{"role": "user", "content": "Tell me a story"}]
):
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
messages
list[dict[str, Any]]
required
The conversation history in OpenAI message format.
model
str | None
Override the model for this call. Falls back to default_model when None.
tools
list[dict[str, Any]] | None
Tool schemas. When None, tool calling is disabled.
tool_choice
Any
Forwarded to LiteLLM when tools is provided. Defaults to "auto".
temperature
float
Sampling temperature. Defaults to 0.3.
**kwargs
Any
Additional keyword arguments forwarded to litellm.completion(). Note: stream_options is set automatically if not provided.
Yields: LiteLLM streaming chunk objects, each with a choices[0].delta attribute.

astream() — Async streaming generator

Identical to stream() but uses litellm.acompletion() with stream=True and async for. Input messages are recorded before streaming, usage is captured from the final chunk, and the full content is recorded after the async generator is exhausted.
async for chunk in llm.astream(
    messages=[{"role": "user", "content": "Tell me a story"}]
):
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
messages
list[dict[str, Any]]
required
The conversation history in OpenAI message format.
model
str | None
Override the model for this call. Falls back to default_model when None.
tools
list[dict[str, Any]] | None
Tool schemas. When None, tool calling is disabled.
tool_choice
Any
Forwarded to LiteLLM when tools is provided. Defaults to "auto".
temperature
float
Sampling temperature. Defaults to 0.3.
**kwargs
Any
Additional keyword arguments forwarded to litellm.acompletion(). stream_options is set automatically if not provided.
Yields: LiteLLM async streaming chunk objects.

Full usage example

import os
from northstar import Northstar, CaptureOptions, SpanKind
from northstar.llm import LLMService

client = Northstar(
    api_key=os.environ["NORTHSTAR_API_KEY"],
    project_id=os.environ["NORTHSTAR_PROJECT_ID"],
    capture=CaptureOptions(tool_arguments=True, tool_results=True, final_response=True),
)

llm = LLMService(default_model="gpt-4o-mini")

tool_schemas = [
    {
        "type": "function",
        "function": {
            "name": "search_web",
            "description": "Search the web for a query.",
            "parameters": {
                "type": "object",
                "properties": {"query": {"type": "string"}},
                "required": ["query"],
            },
        },
    }
]

with client.session() as session:
    with session.run("research-agent") as run:
        run.record_user_input("What is the capital of France?")

        response = llm.generate(
            messages=[
                {"role": "system", "content": "You are a helpful research assistant."},
                {"role": "user", "content": "What is the capital of France?"},
            ],
            tools=tool_schemas,
        )

        run.record_final_response(response.choices[0].message.content)

Streaming example

with client.session() as session:
    with session.run("story-agent") as run:
        full_text = ""
        for chunk in llm.stream(
            messages=[{"role": "user", "content": "Tell me a short story about a robot."}],
            model="gpt-4o",
            temperature=0.7,
        ):
            delta = chunk.choices[0].delta.content
            if delta:
                print(delta, end="", flush=True)
                full_text += delta

        run.record_final_response(full_text)

Async streaming example

import asyncio

async def run_agent():
    async with ... :  # your async session management
        async for chunk in llm.astream(
            messages=[{"role": "user", "content": "Explain quantum entanglement briefly."}],
        ):
            if chunk.choices[0].delta.content:
                print(chunk.choices[0].delta.content, end="", flush=True)

asyncio.run(run_agent())

What gets recorded automatically

Every LLMService method creates a MODEL span and records the following without any extra code:
Recorded dataSource
MODEL span named "llm.generate" / "llm.agenerate" / "llm.stream" / "llm.astream"api.model_call()
Input messages (per CaptureOptions)span.record_input_messages()
Output message (per CaptureOptions)span.record_output_message()
model, input_tokens, output_tokens, total_tokensspan.record_usage()
cost_usd in USDNorthStar pricing module via LiteLLM pricing tables
Span status = ERROR + error dictAutomatic on any exception
Because LLMService calls span.record_usage() internally, you do not need to call model_call(), record_usage(), or record_input_messages() separately. The spans are fully populated by the time the method returns.

Build docs developers (and LLMs) love