oMLX delegates tool call detection to mlx-lm’s built-inDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/jundot/omlx/llms.txt
Use this file to discover all available pages before exploring further.
TokenizerWrapper tool parser system, which automatically selects the correct parsing strategy based on the model’s chat template. This means you get native tool calling for every supported model family without any manual configuration: pass a tools array in your request and oMLX handles encoding, streaming suppression, and structured response assembly.
Supported Model Families
| Model Family | Tool Call Format |
|---|---|
| Llama, Qwen, DeepSeek, and most others | JSON <tool_call> |
| Qwen3.5 Series | XML <function=...> |
| Gemma | <start_function_call> / <|tool_call|> |
| GLM (4.7, 5) | <arg_key>/<arg_value> XML |
| MiniMax | Namespaced <minimax:tool_call> |
| Mistral / Devstral | [TOOL_CALLS] (one-sided marker) |
| Kimi K2 | <|tool_calls_section_begin|> |
| Longcat | <longcat_tool_call> |
tools and their output uses a recognized <tool_call> XML format. oMLX falls back to a generic XML parser automatically.
Tool calling requires the model’s chat template to accept a
tools parameter. If the template does not support it, the tools array is silently ignored and the model generates plain text.How Parsing Works
Template encoding
The
tools array is passed to tokenizer.apply_chat_template() in the OpenAI format. mlx-lm’s tokenizer wrapper encodes the tool definitions according to the model’s native format.Stream suppression
During streaming,
ToolCallStreamFilter detects known tool-call opening envelopes (e.g. <tool_call>, [TOOL_CALLS], namespaced tags) and suppresses them from the streamed content deltas. Prose text before the tool call is emitted normally.Turn completion and parsing
After the model finishes generating (EOS token),
parse_tool_calls() runs the full parser on the complete output. Parsed calls are returned as structured ToolCall objects in the tool_calls field of the response, matching the OpenAI response schema exactly.Example Request
Streaming with Tool Calls
Whenstream: true, oMLX emits assistant prose tokens incrementally as they are generated. Tool-call control markup (the model-specific delimiters) is suppressed from the stream. Once the turn is complete, the structured tool_calls array is emitted in the final delta chunk with finish_reason: "tool_calls".
This matches the behavior of the OpenAI streaming API: your client receives text as it generates, then receives the full parsed tool call at the end of the turn.
Structured Output (JSON Schema)
Passresponse_format to request structured JSON output:
jsonschema. If validation fails, the error is surfaced in the response rather than silently returning malformed JSON.
response_format.type | Behavior |
|---|---|
text | Plain text output (default) |
json_object | Extract and validate any JSON object from output |
json_schema | Extract JSON and validate against provided schema |
Grammar-Constrained Decoding
For reliable structured output, oMLX supports grammar-constrained decoding viaGrammarConstraintProcessor. This feature restricts the model’s sampling distribution at every token step so only tokens consistent with the target JSON schema can be generated—making schema violations structurally impossible rather than just unlikely.
Reasoning Model Support
For reasoning models that emit<think>...</think> blocks, oMLX strips the thinking section before tool call parsing. If the model’s regular (non-thinking) output contains no tool calls but the thinking block does, oMLX promotes the thinking-block tool calls as the actual invocation—with two guards:
- If the model also produced non-empty regular text, the thinking tool call is discarded (it was reasoning, not a real invocation).
- The promoted tool call’s function name must match one of the provided
toolsdefinitions.