Browser AI Assistant: LFM2.5-350M on WebGPU

The Browser Barista Brain is a fully on-device AI assistant. LiquidAI’s LFM2.5-350M model is downloaded once to the browser cache and then executed locally over WebGPU using ONNX Runtime — no API keys, no cloud inference, and no data leaves the machine. The assistant is still connected to real coffee data: it calls the backend’s MCP HTTP endpoint to list the menu, inspect the queue, place orders, and advance ticket status.

No API keys required

The model runs locally inside the browser tab. The only network traffic after the initial model download is JSON-RPC to the backend at /mcp. There is no external AI provider involved.

The first run downloads approximately 350 MB of ONNX model assets from Hugging Face and caches them in the browser. Subsequent page loads skip the download. The DemoTranscript component adapts its submit label to reflect the cache state: "Download + chat" on a cold cache, "Send to local model" once the assets are fully cached.

Model identity

The model is LiquidAI/LFM2.5-350M-ONNX, resolved in ui/src/features/assistant/lib/lfm-model.ts:

export const MODEL_ID = "LiquidAI/LFM2.5-350M-ONNX";
export const MODEL_BASE = `https://huggingface.co/${MODEL_ID}/resolve/main`;
export const MODEL_PATH = `${MODEL_BASE}/onnx/model_q4.onnx`;
export const MODEL_DATA_PATH = `${MODEL_BASE}/onnx/model_q4.onnx_data`;

The q4-quantised ONNX weights are loaded into an onnxruntime-web InferenceSession with the webgpu execution provider. Tokenisation is handled by @huggingface/transformers AutoTokenizer.

Feature structure

ui/src/features/assistant/
├── components/
│   ├── BrowserMcpLandingPage.tsx     # Stateful root — wires hook to view
│   ├── BrowserMcpLandingView.tsx     # Presentational layout (hero + transcript)
│   ├── DemoComposer.tsx              # Text input + submit button
│   ├── DemoStatusPanel.tsx           # Progress bar + phase label
│   ├── DemoTranscript.tsx            # Message list + composer
│   ├── ToolActivityCard.tsx          # Single tool-call/result card
│   └── ToolActivityDrawer.tsx        # Drawer listing all tool events
├── hooks/
│   ├── lfmAssistantSupport.ts        # State types, reducer, preset prompts
│   └── useLfmCoffeeAssistant.ts      # Main hook — orchestrates model + MCP
└── lib/
    ├── assistant-loop.ts             # processAssistantTurn() — the agentic loop
    ├── assistantTooling.ts           # Prompt formatting helpers
    ├── lfm-browser.ts                # LfmBrowserModel class (load + generate)
    ├── lfm-cache.ts                  # Browser cache status detection
    ├── lfm-generation.ts             # Token generation loop
    ├── lfm-model.ts                  # Model ID and asset URLs
    ├── mcp-client.ts                 # CoffeeMcpClient (JSON-RPC over HTTP)
    ├── ortTensor.ts                  # ORT tensor helpers
    └── tool-call-parser.ts           # Extracts tool calls from raw model output

Model loading: `LfmBrowserModel`

LfmBrowserModel (lib/lfm-browser.ts) manages the load lifecycle with a single loadPromise guard so concurrent callers share one download:

export class LfmBrowserModel {
  async load(onProgress?: (update: ModelProgressUpdate) => void): Promise<void> {
    if (this.isLoaded()) return;
    if (this.loadPromise !== null) return this.loadPromise;

    this.loadPromise = this.loadInternal(onProgress).finally(() => {
      this.loadPromise = null;
    });
    return this.loadPromise;
  }
}

The internal loader checks for WebGPU availability, dynamically imports @huggingface/transformers and onnxruntime-web/webgpu in parallel, initialises the tokenizer with progress callbacks, then opens an ONNX InferenceSession:

const session = await runtime.ort.InferenceSession.create(MODEL_PATH, {
  executionProviders: ["webgpu"],
  externalData: [{ data: MODEL_DATA_PATH, path: "model_q4.onnx_data" }],
});

If navigator.gpu is absent, the load throws "WebGPU is unavailable in this browser.".

MCP connection: `CoffeeMcpClient`

CoffeeMcpClient (lib/mcp-client.ts) is a minimal JSON-RPC 2.0 HTTP client that speaks the MCP protocol. It defaults to the /mcp path, which the Vite dev-server proxy forwards to the backend at http://localhost:3000/mcp.

export class CoffeeMcpClient {
  constructor(endpoint = "/mcp") { ... }

  async getPromptTools(): Promise<readonly PromptToolDefinition[]> { ... }

  async callTool(
    name: string,
    arguments_: Record<string, unknown>
  ): Promise<McpToolCallResult> { ... }
}

On first use, the client sends an MCP initialize handshake and negotiates the protocol version (2025-06-18). Tool definitions are cached in-memory after the first tools/list call. The session ID returned in the Mcp-Session-Id response header is attached to subsequent requests.

The agentic loop: `processAssistantTurn`

processAssistantTurn (lib/assistant-loop.ts) drives a multi-step reasoning loop capped at three tool-use steps:

const MAX_TOOL_STEPS = 3;
const TOOL_TOKEN_LIMIT = 96;    // first generation (tool-call decision)
const ANSWER_TOKEN_LIMIT = 192; // subsequent generations (answer)

On each step the function generates tokens from the model, checks for a tool-call tag between <|tool_call_start|> and <|tool_call_end|>, executes the tool via CoffeeMcpClient.callTool(), and appends the result as a user message before the next generation. When no tool call is detected (or after the tool result is appended), the loop returns the final assistant text and the list of AssistantEvent records for the activity drawer. The system prompt identifies the assistant as Beanline and instructs it to always call list_menu before answering menu questions, and list_orders or get_order for queue questions:

function buildSystemPrompt(tools: readonly PromptToolDefinition[]): string {
  return [
    "You are Beanline, a browser-side coffee concierge for the Onion Coffee Shop.",
    "You do have live access to current Coffee Shop data through MCP tools.",
    "For menu questions, call list_menu before answering.",
    "For order status, queue, ready-ticket, or pickup questions, call list_orders or get_order before answering.",
    ...
  ].join("\n");
}

State management: `useLfmCoffeeAssistant`

The useLfmCoffeeAssistant hook (hooks/useLfmCoffeeAssistant.ts) owns all assistant state through a reducer (reduceAssistantState) and an AssistantRuntime ref. The runtime ref holds the LfmBrowserModel, the CoffeeMcpClient, the current conversation history, and the cached tool definitions — none of which should trigger re-renders on mutation. Effect v4 is used to sequence async steps (load tools → load model → run turn) with structured error handling:

const submitAssistantPrompt = Effect.fn("LfmAssistant.submitAssistantPrompt")(
  function* (runtime, dispatch, prompt) {
    runtime.conversation = [...runtime.conversation, { content: prompt, role: "user" }];
    yield* dispatchMessage(dispatch, { type: "user-submitted", prompt });

    const tools = yield* ensureAssistantReady(runtime, dispatch);
    yield* dispatchMessage(dispatch, { type: "assistant-running" });

    const result = yield* Effect.tryPromise({
      try: async () => processAssistantTurn({ client, conversation, model, onDraft, tools }),
      catch: (cause) => asError(cause, "The local assistant could not complete that request."),
    });

    yield* dispatchMessage(dispatch, { type: "assistant-completed", ... });
  }
);

The hook exposes a stable public API:

return {
  assistantDraft,   // streaming token preview
  cacheStatus,      // "cold" | "partial" | "warm"
  errorMessage,
  events,           // tool-call + tool-result events for the activity drawer
  hasLoadedModel,
  input,
  isBusy,
  messages,         // rendered conversation
  prompts,          // three preset prompt buttons
  resetConversation,
  setInput,
  status,           // { phase, label, progress }
  submit,
  warmUp,
};

UI components

BrowserMcpLandingView

Top-level presentational layout. Two-column hero (copy + facts card) above the DemoTranscript. Includes a “Preload browser model” button that calls onWarmUp to download and cache model assets before the user’s first prompt.

DemoTranscript

Chat container. Renders DemoStatusPanel (progress bar), preset prompt buttons, a scrollable message list with TranscriptBubble entries for the user (“You”) and the assistant (“Beanline”), and DemoComposer at the bottom.

DemoStatusPanel

Displays the current DemoStatus phase (idle | loading | running | ready | error) with a progress bar and a text label such as "Thinking with local WebGPU" or "Model ready in this tab with N Coffee Shop tools.".

ToolActivityDrawer

Opens a vaul drawer listing every AssistantEvent. Each event is rendered as a ToolActivityCard showing the tool name and formatted arguments or result. Allows the user to inspect exactly which MCP tools the model called and what they returned.

Preset prompts

Three starter prompts are defined in hooks/lfmAssistantSupport.ts and displayed as buttons in the transcript:

export const assistantPrompts = [
  "What drinks are on the menu right now?",
  "Place a medium oat latte for Maya with one extra shot.",
  "List open orders and tell me which tickets are ready to pick up.",
] as const;

Clicking a prompt bypasses the text input and submits directly to the agentic loop.

Get Started

Architecture

Interfaces

Frontend

Development

Browser AI Assistant: LFM2.5-350M on WebGPU

No API keys required

Model identity

Feature structure

Model loading: `LfmBrowserModel`

MCP connection: `CoffeeMcpClient`

The agentic loop: `processAssistantTurn`

State management: `useLfmCoffeeAssistant`

UI components

BrowserMcpLandingView

DemoTranscript

DemoStatusPanel

ToolActivityDrawer

Preset prompts

Build docs developers (and LLMs) love

Get Started

Architecture

Interfaces

Frontend

Development

​No API keys required

​Model identity

​Feature structure

​Model loading: LfmBrowserModel

​MCP connection: CoffeeMcpClient

​The agentic loop: processAssistantTurn

​State management: useLfmCoffeeAssistant

​UI components

BrowserMcpLandingView

DemoTranscript

DemoStatusPanel

ToolActivityDrawer

​Preset prompts

Build docs developers (and LLMs) love

No API keys required

Model identity

Feature structure

Model loading: `LfmBrowserModel`

MCP connection: `CoffeeMcpClient`

The agentic loop: `processAssistantTurn`

State management: `useLfmCoffeeAssistant`

UI components

Preset prompts