Overview
The completion API supports:
Multi-turn conversations with chat templates
Tool calling (function calling)
Streaming token callbacks
Vision-language models (images in messages)
Retrieval-augmented generation (RAG)
Cloud handoff for low-confidence responses
cactus_complete
Generate chat completion.
int cactus_complete (
cactus_model_t model ,
const char * messages_json ,
char * response_buffer ,
size_t buffer_size ,
const char * options_json ,
const char * tools_json ,
cactus_token_callback callback ,
void * user_data
);
Model handle from cactus_init
JSON array of message objects (see format below)
Buffer to write JSON response
Size of response buffer in bytes
Optional JSON object with generation parameters
Optional JSON array of tool definitions
Optional streaming callback: void callback(const char* token, uint32_t token_id, void* user_data)
Optional pointer passed to callback
Number of bytes written to response_buffer on success, -1 on error
[
{
"role" : "system" ,
"content" : "You are a helpful assistant."
},
{
"role" : "user" ,
"content" : "What is the capital of France?"
},
{
"role" : "assistant" ,
"content" : "The capital of France is Paris."
},
{
"role" : "user" ,
"content" : "What is its population?" ,
"images" : [ "file:///path/to/map.jpg" ]
}
]
Message role: system, user, assistant, or tool
Speaker name or tool name
Image file paths or URLs (VLM models only)
Options JSON
{
"temperature" : 0.7 ,
"top_p" : 0.95 ,
"top_k" : 40 ,
"max_tokens" : 2048 ,
"stop" : [ " \n\n " , "User:" ],
"include_stop_sequences" : false ,
"force_tools" : false ,
"tool_rag_top_k" : 5 ,
"confidence_threshold" : 0.5 ,
"cloud_handoff_threshold" : 0.0
}
Sampling temperature (0.0 = greedy, higher = more random)
Nucleus sampling threshold
Maximum tokens to generate
Stop sequences to end generation
Include stop sequence in output
Constrain output to valid tool calls
Number of RAG documents to retrieve
Minimum confidence for accepting response
Entropy threshold to trigger cloud handoff (0.0 = disabled)
[
{
"name" : "get_weather" ,
"description" : "Get current weather for a location" ,
"parameters" : {
"type" : "object" ,
"properties" : {
"location" : {
"type" : "string" ,
"description" : "City name"
},
"units" : {
"type" : "string" ,
"enum" : [ "celsius" , "fahrenheit" ]
}
},
"required" : [ "location" ]
}
}
]
Success Response
{
"success" : true ,
"error" : null ,
"text" : "Paris has a population of approximately 2.1 million." ,
"stop_reason" : "stop" ,
"tool_calls" : [],
"time_to_first_token_ms" : 45.2 ,
"total_time_ms" : 523.1 ,
"prefill_tokens_per_second" : 890.3 ,
"decode_tokens_per_second" : 42.7 ,
"prompt_tokens" : 128 ,
"completion_tokens" : 15 ,
"confidence" : 0.94 ,
"cloud_handoff" : false ,
"ram_usage_mb" : 412.5
}
Whether generation succeeded
Why generation stopped: stop, length, tool_call
Parsed tool invocations (if applicable)
Latency to first generated token
prefill_tokens_per_second
Prompt processing throughput
Token generation throughput
Number of generated tokens
Average confidence score (0.0-1.0)
Whether response should be retried in cloud
Error Response
{
"success" : false ,
"error" : "Model not initialized" ,
"text" : "" ,
"stop_reason" : "error"
}
Example: Basic Completion
#include "cactus_ffi.h"
#include <stdio.h>
int main () {
cactus_model_t model = cactus_init ( "/path/to/model" , NULL , false );
const char * messages = "["
"{ \" role \" : \" user \" , \" content \" : \" Hello! \" }"
"]" ;
char response [ 8192 ];
int result = cactus_complete (
model,
messages,
response,
sizeof (response),
NULL , // default options
NULL , // no tools
NULL , // no streaming
NULL
);
if (result > 0 ) {
printf ( " %s \n " , response);
}
cactus_destroy (model);
return 0 ;
}
Example: Streaming
void token_handler ( const char * token , uint32_t token_id , void * user_data ) {
printf ( " %s " , token);
fflush (stdout);
}
int main () {
cactus_model_t model = cactus_init ( "/path/to/model" , NULL , false );
const char * messages = "[{ \" role \" : \" user \" , \" content \" : \" Tell a story. \" }]" ;
const char * options = "{ \" temperature \" :0.8, \" max_tokens \" :500}" ;
char response [ 8192 ];
cactus_complete (
model,
messages,
response,
sizeof (response),
options,
NULL ,
token_handler,
NULL
);
cactus_destroy (model);
}
const char * tools = "["
"{"
" \" name \" : \" get_weather \" ,"
" \" description \" : \" Get weather for a city \" ,"
" \" parameters \" :{"
" \" type \" : \" object \" ,"
" \" properties \" :{"
" \" location \" :{ \" type \" : \" string \" }"
"},"
" \" required \" :[ \" location \" ]"
"}"
"}"
"]" ;
const char * messages = "[{ \" role \" : \" user \" , \" content \" : \" What's the weather in Tokyo? \" }]" ;
const char * options = "{ \" force_tools \" :true}" ;
char response [ 8192 ];
cactus_complete (model, messages, response, sizeof (response), options, tools, NULL , NULL );
// Response will contain tool_calls array
See Also
C FFI Complete FFI reference
Python SDK Python completion API
Chat Guide Building chat applications
Tool Calling Function calling guide