Skip to main content

Overview

Cactus provides built-in RAG capabilities with:
  • Automatic text corpus indexing
  • Vector similarity search
  • Embedded vector database
  • Query-time retrieval
No external vector database required - everything runs on-device.

Quick Start

Automatic RAG Setup

from cactus import cactus_init, cactus_complete, cactus_destroy
import json

# Initialize with corpus directory
model = cactus_init(
    "weights/lfm2-1.2b",
    "path/to/docs",  # Directory containing .txt files
    cache_index=True  # Cache embeddings for faster startup
)

# Query automatically retrieves relevant context
messages = json.dumps([{
    "role": "user",
    "content": "What is the return policy?"
}])

result = json.loads(cactus_complete(model, messages, None, None, None))
print(result["response"])

cactus_destroy(model)
Cactus automatically:
  1. Chunks documents into passages
  2. Generates embeddings for each passage
  3. Builds a vector index
  4. Retrieves top-k relevant passages at query time
  5. Injects context into the prompt

Corpus Preparation

Organize your documents as text files:
docs/
  ├── faq.txt
  ├── product_info.txt
  ├── policies.txt
  └── technical_specs.txt
Each file is automatically chunked and indexed.

Manual RAG Query

Query the RAG index directly:
from cactus import cactus_rag_query
import json

# Query without generating a response
result = json.loads(cactus_rag_query(
    model,
    "machine learning features",
    top_k=5
))

for doc in result["documents"]:
    print(f"Score: {doc['score']:.3f}")
    print(f"Content: {doc['content'][:200]}...\n")

Response Format

{
    "success": true,
    "documents": [
        {
            "id": 42,
            "score": 0.87,
            "content": "Machine learning models can be deployed...",
            "metadata": "source: ml_guide.txt"
        },
        {
            "id": 15,
            "score": 0.82,
            "content": "On-device inference provides...",
            "metadata": "source: inference.txt"
        }
    ],
    "query_time_ms": 12.5
}

Custom Vector Index

Use the vector index API for more control:
from cactus import (
    cactus_init,
    cactus_embed,
    cactus_index_init,
    cactus_index_add,
    cactus_index_query,
    cactus_index_destroy,
    cactus_destroy
)
import json

# Initialize embedding model
model = cactus_init("weights/qwen3-embedding-0.6b", None, False)

# Create vector index
index = cactus_index_init("/path/to/index", embedding_dim=768)

# Add documents
documents = [
    "Cactus is an AI inference engine for mobile devices.",
    "It supports NPU acceleration on Apple chips.",
    "Quantization reduces model size by 70-90%."
]

ids = list(range(len(documents)))
embeddings = [cactus_embed(model, doc, True) for doc in documents]

cactus_index_add(index, ids, documents, embeddings, None)

# Query the index
query = "How does Cactus optimize for mobile?"
query_emb = cactus_embed(model, query, True)

options = json.dumps({"top_k": 2, "score_threshold": 0.5})
results = json.loads(cactus_index_query(index, query_emb, options))

for result in results["results"]:
    doc_id = result["id"]
    print(f"Score: {result['score']:.3f}")
    print(f"Document: {documents[doc_id]}\n")

cactus_index_destroy(index)
cactus_destroy(model)

Adding Metadata

Store metadata with documents:
documents = [
    "Product A costs $99",
    "Product B costs $149"
]

metadatas = [
    json.dumps({"category": "pricing", "product": "A"}),
    json.dumps({"category": "pricing", "product": "B"})
]

cactus_index_add(index, ids, documents, embeddings, metadatas)

# Retrieve metadata
result = json.loads(cactus_index_get(index, [0]))
for doc in result["documents"]:
    print(f"Content: {doc['content']}")
    print(f"Metadata: {doc['metadata']}")

Updating the Index

Add New Documents

new_docs = ["New feature: Cloud fallback"]
new_ids = [100]
new_embeddings = [cactus_embed(model, doc, True) for doc in new_docs]

cactus_index_add(index, new_ids, new_docs, new_embeddings, None)

Delete Documents

cactus_index_delete(index, [5, 10, 15])

Compact Index

# Reclaim space from deleted documents
cactus_index_compact(index)

Query Options

{
    "top_k": 10,
    "score_threshold": 0.7
}
top_k
integer
default:"10"
Maximum number of results to return
score_threshold
number
default:"-1.0"
Minimum similarity score (0-1). Results below threshold are filtered. -1 disables filtering
Combine semantic and keyword search:
def hybrid_search(query, documents, embeddings, top_k=5):
    # Semantic search
    query_emb = cactus_embed(model, query, True)
    semantic_results = cactus_index_query(index, query_emb, 
                                          json.dumps({"top_k": top_k}))
    
    # Keyword search
    keyword_results = []
    query_lower = query.lower()
    for i, doc in enumerate(documents):
        if query_lower in doc.lower():
            keyword_results.append(i)
    
    # Combine results
    combined = set(r["id"] for r in json.loads(semantic_results)["results"])
    combined.update(keyword_results)
    
    return list(combined)[:top_k]

Reranking

Rerank retrieved documents for better relevance:
def rerank(query, retrieved_docs):
    # Generate query embedding
    query_emb = cactus_embed(model, query, True)
    
    # Score each document
    scores = []
    for doc in retrieved_docs:
        doc_emb = cactus_embed(model, doc["content"], True)
        score = np.dot(query_emb, doc_emb)
        scores.append((score, doc))
    
    # Sort by score
    scores.sort(reverse=True)
    return [doc for _, doc in scores]

Chunking Strategies

def chunk_fixed(text, chunk_size=512, overlap=50):
    words = text.split()
    chunks = []
    
    for i in range(0, len(words), chunk_size - overlap):
        chunk = ' '.join(words[i:i + chunk_size])
        chunks.append(chunk)
    
    return chunks

Performance Tips

  • Use cache_index=True to avoid recomputing embeddings on startup
  • Chunk documents to 256-512 tokens for optimal retrieval
  • Use batch embedding generation for large corpora
  • Enable NPU acceleration for faster embedding generation

Index Storage Format

The index is stored in two files:
  • index.bin - Vector embeddings (FP16) and metadata
  • data.bin - Document content and metadata strings
Both use memory-mapped I/O for efficient access.

Error Handling

try:
    result = json.loads(cactus_rag_query(model, query, top_k=5))
    if not result["success"]:
        print(f"RAG query failed: {result.get('error')}")
except RuntimeError as e:
    print(f"Error: {e}")

Next Steps

Embeddings Guide

Learn about embedding generation

Vector Index API

Complete vector database API reference

Chat Completion

Use RAG with chat completion

Supported Models

Browse embedding models

Build docs developers (and LLMs) love