Completion

Overview

The completion API supports:

Multi-turn conversations with chat templates
Tool calling (function calling)
Streaming token callbacks
Vision-language models (images in messages)
Retrieval-augmented generation (RAG)
Cloud handoff for low-confidence responses

cactus_complete

Generate chat completion.

int cactus_complete(
    cactus_model_t model,
    const char* messages_json,
    char* response_buffer,
    size_t buffer_size,
    const char* options_json,
    const char* tools_json,
    cactus_token_callback callback,
    void* user_data
);

model

cactus_model_t

required

Model handle from cactus_init

messages_json

string

required

JSON array of message objects (see format below)

response_buffer

char*

required

Buffer to write JSON response

buffer_size

size_t

required

Size of response buffer in bytes

options_json

string

Optional JSON object with generation parameters

tools_json

string

Optional JSON array of tool definitions

callback

cactus_token_callback

Optional streaming callback: void callback(const char* token, uint32_t token_id, void* user_data)

user_data

void*

Optional pointer passed to callback

return

int

Number of bytes written to response_buffer on success, -1 on error

Messages Format

[
  {
    "role": "system",
    "content": "You are a helpful assistant."
  },
  {
    "role": "user",
    "content": "What is the capital of France?"
  },
  {
    "role": "assistant",
    "content": "The capital of France is Paris."
  },
  {
    "role": "user",
    "content": "What is its population?",
    "images": ["file:///path/to/map.jpg"]
  }
]

role

string

required

Message role: system, user, assistant, or tool

content

string

required

Message text content

name

string

Speaker name or tool name

images

array<string>

Image file paths or URLs (VLM models only)

Options JSON

{
  "temperature": 0.7,
  "top_p": 0.95,
  "top_k": 40,
  "max_tokens": 2048,
  "stop": ["\n\n", "User:"],
  "include_stop_sequences": false,
  "force_tools": false,
  "tool_rag_top_k": 5,
  "confidence_threshold": 0.5,
  "cloud_handoff_threshold": 0.0
}

temperature

float

default:"0.6"

Sampling temperature (0.0 = greedy, higher = more random)

top_p

float

default:"0.95"

Nucleus sampling threshold

top_k

int

default:"20"

Sample from top K tokens

max_tokens

int

default:"2048"

Maximum tokens to generate

stop

array<string>

Stop sequences to end generation

include_stop_sequences

bool

default:"false"

Include stop sequence in output

force_tools

bool

default:"false"

Constrain output to valid tool calls

tool_rag_top_k

int

default:"5"

Number of RAG documents to retrieve

confidence_threshold

float

default:"0.5"

Minimum confidence for accepting response

cloud_handoff_threshold

float

default:"0.0"

Entropy threshold to trigger cloud handoff (0.0 = disabled)

Tools JSON

[
  {
    "name": "get_weather",
    "description": "Get current weather for a location",
    "parameters": {
      "type": "object",
      "properties": {
        "location": {
          "type": "string",
          "description": "City name"
        },
        "units": {
          "type": "string",
          "enum": ["celsius", "fahrenheit"]
        }
      },
      "required": ["location"]
    }
  }
]

Response Format

Success Response

{
  "success": true,
  "error": null,
  "text": "Paris has a population of approximately 2.1 million.",
  "stop_reason": "stop",
  "tool_calls": [],
  "time_to_first_token_ms": 45.2,
  "total_time_ms": 523.1,
  "prefill_tokens_per_second": 890.3,
  "decode_tokens_per_second": 42.7,
  "prompt_tokens": 128,
  "completion_tokens": 15,
  "confidence": 0.94,
  "cloud_handoff": false,
  "ram_usage_mb": 412.5
}

success

bool

Whether generation succeeded

error

string | null

Error message if failed

text

string

Generated text

stop_reason

string

Why generation stopped: stop, length, tool_call

tool_calls

array

Parsed tool invocations (if applicable)

time_to_first_token_ms

float

Latency to first generated token

total_time_ms

float

Total generation time

prefill_tokens_per_second

float

Prompt processing throughput

decode_tokens_per_second

float

Token generation throughput

prompt_tokens

int

Number of input tokens

completion_tokens

int

Number of generated tokens

confidence

float

Average confidence score (0.0-1.0)

cloud_handoff

bool

Whether response should be retried in cloud

ram_usage_mb

float

Current memory usage

Error Response

{
  "success": false,
  "error": "Model not initialized",
  "text": "",
  "stop_reason": "error"
}

Example: Basic Completion

#include "cactus_ffi.h"
#include <stdio.h>

int main() {
    cactus_model_t model = cactus_init("/path/to/model", NULL, false);
    
    const char* messages = "["
        "{\"role\":\"user\",\"content\":\"Hello!\"}"
    "]";
    
    char response[8192];
    int result = cactus_complete(
        model,
        messages,
        response,
        sizeof(response),
        NULL,  // default options
        NULL,  // no tools
        NULL,  // no streaming
        NULL
    );
    
    if (result > 0) {
        printf("%s\n", response);
    }
    
    cactus_destroy(model);
    return 0;
}

Example: Streaming

void token_handler(const char* token, uint32_t token_id, void* user_data) {
    printf("%s", token);
    fflush(stdout);
}

int main() {
    cactus_model_t model = cactus_init("/path/to/model", NULL, false);
    
    const char* messages = "[{\"role\":\"user\",\"content\":\"Tell a story.\"}]";
    const char* options = "{\"temperature\":0.8,\"max_tokens\":500}";
    
    char response[8192];
    cactus_complete(
        model,
        messages,
        response,
        sizeof(response),
        options,
        NULL,
        token_handler,
        NULL
    );
    
    cactus_destroy(model);
}

Example: Tool Calling

const char* tools = "["
    "{"
        "\"name\":\"get_weather\","
        "\"description\":\"Get weather for a city\","
        "\"parameters\":{"
            "\"type\":\"object\","
            "\"properties\":{"
                "\"location\":{\"type\":\"string\"}"
            "},"
            "\"required\":[\"location\"]"
        "}"
    "}"
"]";

const char* messages = "[{\"role\":\"user\",\"content\":\"What's the weather in Tokyo?\"}]";
const char* options = "{\"force_tools\":true}";

char response[8192];
cactus_complete(model, messages, response, sizeof(response), options, tools, NULL, NULL);

// Response will contain tool_calls array

C FFI

Complete FFI reference

Python SDK

Python completion API

Chat Guide

Building chat applications

Tool Calling

Function calling guide

Core APIs

Features

Overview

cactus_complete

Messages Format

Options JSON

Tools JSON

Response Format

Success Response

Error Response

Example: Basic Completion

Example: Streaming

Example: Tool Calling

See Also

C FFI

Python SDK

Chat Guide

Tool Calling

Build docs developers (and LLMs) love

Core APIs

Features

Documentation Index

​Overview

​cactus_complete

​Messages Format

​Options JSON

​Tools JSON

​Response Format

​Success Response

​Error Response

​Example: Basic Completion

​Example: Streaming

​Example: Tool Calling

​See Also

C FFI

Python SDK

Chat Guide

Tool Calling

Build docs developers (and LLMs) love

Overview

cactus_complete

Messages Format

Options JSON

Tools JSON

Response Format

Success Response

Error Response

Example: Basic Completion

Example: Streaming

Example: Tool Calling

See Also