Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/circuitbreakerlabs/cli/llms.txt

Use this file to discover all available pages before exploring further.

The Ollama provider allows you to run AI safety evaluations using models hosted locally with Ollama, providing privacy and cost-effectiveness for your testing workflow.

Prerequisites

Install Ollama

Ollama must be running locally before using this provider.
1

Download and Install

Download Ollama from ollama.ai and follow the installation instructions for your operating system:
  • macOS: Download and run the installer
  • Linux: Run curl -fsSL https://ollama.ai/install.sh | sh
  • Windows: Download the Windows installer
2

Start Ollama Service

After installation, start the Ollama service:
ollama serve
By default, Ollama runs on http://localhost:11434
3

Pull a Model

Download a model to use for evaluations:
ollama pull llama3.2
View available models at ollama.ai/library

Basic Usage

cbl single-turn \
    --threshold 0.5 \
    --variations 2 \
    --maximum-iteration-layers 2 \
    ollama --model llama3.2

Configuration Options

Required Options

--model
string
required
Ollama model name to use for evaluations.Examples: llama3.2, mistral, codellama, gemma
The model must already be pulled via ollama pull <model-name>

Optional Options

--base-url
string
default:"http://localhost:11434"
Ollama server base URL. Change this if Ollama is running on a different host or port.Environment variable: OLLAMA_BASE_URLExample: --base-url http://192.168.1.100:11434
--logprobs
boolean
Return log probabilities for each token in the response.

Model Options

Ollama supports extensive model configuration through the following parameters:
--temperature
float
default:"0.8"
Model temperature - higher values make answers more creative.Range: 0.0 to 2.0
--top-k
integer
default:"40"
Reduces probability of generating nonsense. Higher values give more diverse answers.
--top-p
float
default:"0.9"
Works with top-k. Higher values lead to more diverse text.Range: 0.0 to 1.0
--num-predict
integer
default:"128"
Maximum number of tokens to predict.Special values:
  • -1: Infinite generation
  • -2: Fill context window
--num-ctx
integer
default:"2048"
Size of the context window (number of tokens).
--repeat-penalty
float
default:"1.1"
How strongly to penalize repetitions. Higher values reduce repetition.
--repeat-last-n
integer
default:"64"
How far back to look to prevent repetition.Special values:
  • 0: Disabled
  • -1: Use num_ctx value
--seed
integer
default:"0"
Random number seed for generation. Use the same seed for reproducible outputs.
--stop
string[]
Stop sequences - generation stops when these strings are encountered.Example: --stop END --stop STOP
--tfs-z
float
default:"1"
Tail free sampling - reduces impact of less probable tokens.
--mirostat
integer
default:"0"
Enable Mirostat sampling for controlling perplexity.Options:
  • 0: Disabled
  • 1: Mirostat 1.0
  • 2: Mirostat 2.0
--mirostat-tau
float
default:"5.0"
Mirostat tau - controls balance between coherence and diversity.
--mirostat-eta
float
default:"0.1"
Mirostat learning rate.

Hardware Options

--num-gpu
integer
Number of layers to send to GPU(s). Use to control GPU memory usage.
--num-thread
integer
Number of threads to use during computation. Adjust based on your CPU cores.
--num-gqa
integer
Number of GQA (Grouped Query Attention) groups in transformer layer. Model-specific setting.

Examples

Basic Single-Turn Evaluation

cbl single-turn \
    --threshold 0.5 \
    --variations 2 \
    ollama --model llama3.2

Multi-Turn with Custom Temperature

cbl multi-turn \
    --threshold 0.5 \
    --max-turns 8 \
    --test-types user_persona,semantic_chunks \
    ollama \
    --model mistral \
    --temperature 0.7

Remote Ollama Instance

cbl single-turn \
    --threshold 0.5 \
    ollama \
    --model codellama \
    --base-url http://192.168.1.100:11434

Reproducible Results with Seed

cbl single-turn \
    --threshold 0.5 \
    ollama \
    --model llama3.2 \
    --temperature 0.3 \
    --seed 42

Large Context Window Configuration

cbl multi-turn \
    --threshold 0.4 \
    --max-turns 10 \
    ollama \
    --model llama3.2 \
    --num-ctx 8192 \
    --num-predict 1024

GPU Optimization

cbl single-turn \
    --threshold 0.5 \
    ollama \
    --model llama3.2 \
    --num-gpu 35 \
    --num-thread 8

Advanced Sampling Configuration

cbl multi-turn \
    --threshold 0.5 \
    --max-turns 8 \
    ollama \
    --model mistral \
    --temperature 0.8 \
    --top-k 50 \
    --top-p 0.95 \
    --repeat-penalty 1.2 \
    --mirostat 2 \
    --mirostat-tau 5.0
Here are some popular models available through Ollama:
ModelSizeDescriptionPull Command
llama3.23BLatest Llama model, efficient and capableollama pull llama3.2
llama3.18B-70BPrevious Llama generation, multiple sizesollama pull llama3.1
mistral7BHigh-performance open modelollama pull mistral
mixtral8x7BMixture of experts modelollama pull mixtral
codellama7B-34BCode-specialized Llama variantollama pull codellama
gemma2B-7BGoogle’s efficient open modelollama pull gemma
phi2.7BMicrosoft’s compact modelollama pull phi
For a complete list of available models, visit the Ollama Library.

Environment Variables

VariableDescriptionRequired
OLLAMA_BASE_URLOllama server URLNo (defaults to http://localhost:11434)

Tips

Model Selection: Larger models (70B+) provide better quality but require more resources. Start with 7B-13B models for development, then scale up if needed.
Context Window: If you encounter truncation issues, increase --num-ctx. Be aware this increases memory usage.
GPU Memory: Monitor GPU memory usage when running large models. Use --num-gpu to control how many layers are offloaded to the GPU.
Reproducibility: For consistent results across runs, set both --seed and --temperature 0 to minimize randomness.

Troubleshooting

Connection Issues

If you see connection errors:
  1. Verify Ollama is running: ollama list
  2. Check the service is accessible: curl http://localhost:11434
  3. Ensure the model is pulled: ollama pull <model-name>

Performance Issues

  • Use --num-thread to match your CPU cores
  • Adjust --num-gpu to optimize GPU usage
  • Consider using smaller models for faster evaluations

Build docs developers (and LLMs) love