Skip to main content
ik_llama.cpp supports OpenAI-style function calling through Jinja chat templates. When the server is started with --jinja, it parses tool definitions from incoming requests, formats them according to the model’s chat template, and extracts tool calls from the model’s response.
You must start llama-server with the --jinja flag to enable function calling. Without it, tool definitions in API requests are ignored.

Native and generic handlers

Function calling works with all models, but the quality depends on which format handler is selected: Native handlers parse tool calls using model-specific logic. They produce the most reliable results. The following model families have native support:
  • Llama 3.1 / 3.2 / 3.3 (including built-in tools: wolfram_alpha, web_search / brave_search, code_interpreter)
  • Qwen 2.5 and Qwen 2.5 Coder
  • Hermes 2 and Hermes 3
  • Mistral Nemo
  • Firefunction v2
  • Command R7B
  • Functionary v3.1 / v3.2
  • DeepSeek R1 (WIP — the model is reluctant to call tools)
Generic handler is used when the chat template is not recognised by any native handler. You will see Chat format: Generic in the server logs. Generic mode works but may consume more tokens and be less efficient than a model’s native format.

Starting the server

Native support (no template override needed)

llama-server --jinja -fa -hf bartowski/Qwen2.5-7B-Instruct-GGUF:Q4_K_M
llama-server --jinja -fa -hf bartowski/Mistral-Nemo-Instruct-2407-GGUF:Q6_K_L
llama-server --jinja -fa -hf bartowski/Llama-3.3-70B-Instruct-GGUF:Q4_K_M
The official DeepSeek R1 chat template has known issues. Use the bundled override:
llama-server --jinja -fa \
  -hf bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF:Q6_K_L \
  --chat-template-file models/templates/llama-cpp-deepseek-r1.jinja

llama-server --jinja -fa \
  -hf bartowski/DeepSeek-R1-Distill-Qwen-32B-GGUF:Q4_K_M \
  --chat-template-file models/templates/llama-cpp-deepseek-r1.jinja

Models that require a template override

Some GGUF files embed an incorrect or default (non-tool-use) template. Pass the correct template with --chat-template-file:
# Functionary v3.2
llama-server --jinja -fa \
  -hf bartowski/functionary-small-v3.2-GGUF:Q4_K_M \
  --chat-template-file models/templates/meetkai-functionary-medium-v3.2.jinja

# Hermes 2 Pro
llama-server --jinja -fa \
  -hf bartowski/Hermes-2-Pro-Llama-3-8B-GGUF:Q4_K_M \
  --chat-template-file models/templates/NousResearch-Hermes-2-Pro-Llama-3-8B-tool_use.jinja

# Hermes 3
llama-server --jinja -fa \
  -hf bartowski/Hermes-3-Llama-3.1-8B-GGUF:Q4_K_M \
  --chat-template-file models/templates/NousResearch-Hermes-3-Llama-3.1-8B-tool_use.jinja

# Firefunction v2
llama-server --jinja -fa \
  -hf bartowski/firefunction-v2-GGUF \
  -hff firefunction-v2-IQ1_M.gguf \
  --chat-template-file models/templates/fireworks-ai-llama-3-firefunction-v2.jinja

# Command R7B
llama-server --jinja -fa \
  -hf bartowski/c4ai-command-r7b-12-2024-GGUF:Q6_K_L \
  --chat-template-file models/templates/CohereForAI-c4ai-command-r7b-12-2024-tool_use.jinja

Generic format models

These work out of the box with --jinja, using the generic handler:
llama-server --jinja -fa -hf bartowski/phi-4-GGUF:Q4_0
llama-server --jinja -fa -hf bartowski/gemma-2-2b-it-GGUF:Q8_0
llama-server --jinja -fa -hf bartowski/c4ai-command-r-v01-GGUF:Q2_K

Chat template override

If a model’s embedded template is buggy or missing tool-use support, download the correct .jinja file and pass it with --chat-template-file. This avoids re-downloading the full GGUF:
llama-server --jinja -fa \
  --model /models/my-model.gguf \
  --chat-template-file /models/templates/my-model-tool_use.jinja
To retrieve the official template from a HuggingFace repository:
python scripts/get_chat_template.py <model-repo>
If no official tool_use template exists for your model, try --chat-template chatml. It works with many models as a fallback, though results vary.

Making a tool call request

Use the standard OpenAI /v1/chat/completions endpoint with a tools array:
curl http://localhost:8080/v1/chat/completions -d '{
  "model": "gpt-3.5-turbo",
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_current_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and country/state, e.g. San Francisco, CA"
            }
          },
          "required": ["location"]
        }
      }
    }
  ],
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant that uses tools."
    },
    {
      "role": "user",
      "content": "What is the weather in Istanbul?"
    }
  ]
}'
A successful response looks like:
{
  "choices": [
    {
      "finish_reason": "tool",
      "index": 0,
      "message": {
        "content": null,
        "tool_calls": [
          {
            "name": "get_current_weather",
            "arguments": "{\"location\": \"Istanbul, Turkey\"}"
          }
        ],
        "role": "assistant"
      }
    }
  ],
  "model": "gpt-3.5-turbo",
  "object": "chat.completion"
}

KV cache and tool calling quality

Extreme KV cache quantizations (e.g. -ctk q4_0) can substantially degrade tool calling performance. Use -ctk q8_0 or -ctk q6_0 when running function-calling workloads. For very aggressive quantization below Q6_0, add --k-cache-hadamard to partially recover quality.

Verifying the active template

To confirm which template and format handler the server selected, check the /props endpoint after startup:
curl http://localhost:8080/props | jq '.chat_template, .chat_template_tool_use'
The server logs also print the detected chat format (e.g. Chat format: Hermes 2 Pro or Chat format: Generic) at startup.

Build docs developers (and LLMs) love