Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/FloxTBoTyy/BoardPulse-AI/llms.txt

Use this file to discover all available pages before exploring further.

Running local models with Ollama lets you operate BoardPulse AI entirely on your own infrastructure. No data leaves your network, no API key is required, and you retain full control over the model being used — making it the right choice for air-gapped deployments, strict data-privacy environments, and teams that want to eliminate per-token cloud costs.

Why use local models

Data privacy

Queries and results never leave your machine. No data is sent to a third-party API.

No API key needed

Run without an OPENAI_API_KEY. Ollama serves models directly from your hardware.

Air-gapped deployments

Once the model is pulled, the stack runs with no internet access required.

Cost control

No per-token billing. Inference cost is fixed at your hardware’s electricity consumption.

Prerequisites for GPU acceleration

Ollama automatically detects NVIDIA GPUs when the container toolkit is installed. Without GPU support, Ollama falls back to CPU inference, which is significantly slower. Install the NVIDIA container toolkit before starting the stack:
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
GPU support is optional. BoardPulse AI and Ollama will still run on CPU-only hardware — expect higher response latency for larger models.

Enabling Ollama

1

Set OLLAMA_ENABLED in .env

Open your .env file and enable Ollama:
OLLAMA_ENABLED=true
OLLAMA_BASE_URL=http://ollama:11434
OLLAMA_MODEL=qwen3:8b
The OLLAMA_BASE_URL points to the ollama container using Docker’s internal DNS. Do not change this value unless you are running Ollama outside of Docker Compose.
2

Start the stack with the local-models profile

The ollama service is gated behind the local-models profile. Use this command to bring up the full stack including Ollama:
docker compose --profile local-models up --build -d
3

Pull the model into the Ollama container

After the stack is running, pull the model you configured in OLLAMA_MODEL:
docker compose exec ollama ollama pull qwen3:8b
The download size depends on the model. qwen3:8b is approximately 5 GB. The model is stored in the ollama-data Docker volume and persists across restarts.
4

Verify local inference is working

Send a test query to the API using preferred_provider: "local":
curl -X POST http://localhost:8000/api/v1/chat/query \
  -H "Content-Type: application/json" \
  -d '{
    "message": "How many sales orders are in the database?",
    "preferred_provider": "local"
  }'
A successful response confirms that the API is routing requests to Ollama.

Ollama environment variables

VariableDefaultDescription
OLLAMA_ENABLEDfalseSet to true to enable local model routing
OLLAMA_BASE_URLhttp://ollama:11434Internal URL of the Ollama service
OLLAMA_MODELqwen3:8bThe model name Ollama uses to serve responses

Hybrid routing

When preferred_provider is set to hybrid in a query, BoardPulse AI checks whether Ollama is available and routes to it first. If Ollama is unreachable or OLLAMA_ENABLED is false, the request falls back to the configured cloud provider automatically.
curl -X POST http://localhost:8000/api/v1/chat/query \
  -H "Content-Type: application/json" \
  -d '{
    "message": "What were total sales last quarter?",
    "preferred_provider": "hybrid"
  }'

Open WebUI model alias

In Open WebUI (accessible at http://localhost:3002), the model named boardpulse-executive maps to the value of OLLAMA_MODEL in your .env. Selecting this model in the chat interface sends requests through the BoardPulse AI API to Ollama using your configured local model.

Monitoring GPU usage

Once Ollama is running with GPU support, you can inspect real-time GPU utilization from inside the container:
docker compose exec ollama nvidia-smi
This command shows VRAM usage, GPU load percentage, and temperature — useful for verifying that inference is hitting the GPU rather than falling back to CPU.
qwen3:8b is the default model and offers a strong balance of quality and performance on hardware with 8 GB or more of VRAM. Smaller models like qwen3:4b work on 6 GB GPUs, while qwen3:14b requires 12 GB or more for comfortable inference.

Build docs developers (and LLMs) love