Overview
The Cactus chat completion API enables you to build conversational AI applications with support for multi-turn conversations, streaming responses, tool calling, and automatic cloud fallback.
Basic Completion
C API
#include <cactus.h>
#include <stdio.h>
cactus_model_t model = cactus_init ( "weights/lfm2-1.2b" , NULL , false );
const char * messages = R "([
{" role ": " system ", " content ": " You are a helpful assistant. "},
{" role ": " user ", " content ": " What is 2 + 2 ? "}
])" ;
char response [ 4096 ];
int result = cactus_complete (
model,
messages,
response,
sizeof (response),
NULL , // options
NULL , // tools
NULL , // callback
NULL // user_data
);
if (result == 0 ) {
printf ( " %s \n " , response);
}
cactus_destroy (model);
Python SDK
from cactus import cactus_init, cactus_complete, cactus_destroy
import json
model = cactus_init( "weights/lfm2-1.2b" , None , False )
messages = json.dumps([
{ "role" : "system" , "content" : "You are a helpful assistant." },
{ "role" : "user" , "content" : "What is 2+2?" }
])
result = json.loads(cactus_complete(model, messages, None , None , None ))
print (result[ "response" ])
print ( f "Time to first token: { result[ 'time_to_first_token_ms' ] :.2f} ms" )
print ( f "Decode speed: { result[ 'decode_tps' ] :.2f} tokens/sec" )
cactus_destroy(model)
All completion responses return a JSON object:
{
"success" : true ,
"error" : null ,
"cloud_handoff" : false ,
"response" : "4" ,
"function_calls" : [],
"confidence" : 0.92 ,
"time_to_first_token_ms" : 45.2 ,
"total_time_ms" : 163.7 ,
"prefill_tps" : 619.5 ,
"decode_tps" : 168.4 ,
"ram_usage_mb" : 245.67 ,
"prefill_tokens" : 28 ,
"decode_tokens" : 12 ,
"total_tokens" : 40
}
Whether the generation succeeded
The model’s generated text response
True if confidence was below threshold and cloud model was used
Model confidence score (0-1) based on token probabilities
Parsed tool/function calls if tools were provided
Options
Control generation behavior with an options JSON object:
{
"max_tokens" : 256 ,
"temperature" : 0.7 ,
"top_p" : 0.95 ,
"top_k" : 20 ,
"stop_sequences" : [ "<|im_end|>" , "User:" ],
"cloud_handoff_threshold" : 0.8
}
Maximum number of tokens to generate
Sampling temperature (0.0-2.0). Higher values = more random
Nucleus sampling threshold (0.0-1.0)
Top-k sampling limit. 0 disables top-k
Array of strings that stop generation when encountered
Minimum confidence (0-1) required to stay on-device. Below this triggers cloud fallback
Streaming Responses
Get token-by-token streaming for better UX:
Python Example
def on_token ( token , token_id ):
print (token, end = "" , flush = True )
options = json.dumps({ "max_tokens" : 256 , "temperature" : 0.7 })
result = json.loads(cactus_complete(model, messages, options, None , on_token))
print ( f " \n\n Generation complete: { result[ 'total_time_ms' ] :.2f} ms" )
Swift Example
let options = #"{"max_tokens":256,"temperature":0.7}"#
let result = try cactusComplete (model, messagesJson, options, nil ) { token, tokenId in
print (token, terminator : "" )
}
Kotlin Example
val options = """{"max_tokens":256,"temperature":0.7}"""
val result = cactusComplete (model, messagesJson, options, null ) { token, _ ->
print (token)
}
Multi-Turn Conversations
Maintain conversation history by including previous messages:
conversation = [
{ "role" : "system" , "content" : "You are a helpful math tutor." },
{ "role" : "user" , "content" : "What is 2+2?" },
{ "role" : "assistant" , "content" : "2+2 equals 4." },
{ "role" : "user" , "content" : "What about 3+3?" }
]
messages = json.dumps(conversation)
result = json.loads(cactus_complete(model, messages, None , None , None ))
conversation.append({ "role" : "assistant" , "content" : result[ "response" ]})
The model’s KV cache is automatically managed. Use cactus_reset(model) to clear the cache and start a fresh conversation.
Cloud Fallback
Automatically hand off complex queries to cloud models:
options = json.dumps({
"max_tokens" : 512 ,
"cloud_handoff_threshold" : 0.8 # Hand off if confidence < 0.8
})
result = json.loads(cactus_complete(model, messages, options, None , None ))
if result[ "cloud_handoff" ]:
print ( "Query handled by cloud model" )
else :
print ( f "On-device inference (confidence: { result[ 'confidence' ] :.2f} )" )
Set your Cactus Cloud API key with cactus auth to enable cloud fallback.
Error Handling
try :
result = json.loads(cactus_complete(model, messages, None , None , None ))
if not result[ "success" ]:
print ( f "Generation failed: { result[ 'error' ] } " )
except RuntimeError as e:
print ( f "API error: { e } " )
error = cactus_get_last_error()
if error:
print ( f "Details: { error } " )
Next Steps
Tool Calling Add function calling for agentic workflows
Vision Models Use vision-language models with image inputs
API Reference Complete API documentation
Streaming Guide Advanced streaming patterns