Serve local AI models with Ollama in BoardPulse AI

Running local models with Ollama lets you operate BoardPulse AI entirely on your own infrastructure. No data leaves your network, no API key is required, and you retain full control over the model being used — making it the right choice for air-gapped deployments, strict data-privacy environments, and teams that want to eliminate per-token cloud costs.

Why use local models

Data privacy

Queries and results never leave your machine. No data is sent to a third-party API.

No API key needed

Run without an OPENAI_API_KEY. Ollama serves models directly from your hardware.

Air-gapped deployments

Once the model is pulled, the stack runs with no internet access required.

Cost control

No per-token billing. Inference cost is fixed at your hardware’s electricity consumption.

Prerequisites for GPU acceleration

Ollama automatically detects NVIDIA GPUs when the container toolkit is installed. Without GPU support, Ollama falls back to CPU inference, which is significantly slower. Install the NVIDIA container toolkit before starting the stack:

sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

GPU support is optional. BoardPulse AI and Ollama will still run on CPU-only hardware — expect higher response latency for larger models.

Enabling Ollama

Set OLLAMA_ENABLED in .env

Open your .env file and enable Ollama:

OLLAMA_ENABLED=true
OLLAMA_BASE_URL=http://ollama:11434
OLLAMA_MODEL=qwen3:8b

The OLLAMA_BASE_URL points to the ollama container using Docker’s internal DNS. Do not change this value unless you are running Ollama outside of Docker Compose.

Start the stack with the local-models profile

The ollama service is gated behind the local-models profile. Use this command to bring up the full stack including Ollama:

docker compose --profile local-models up --build -d

Pull the model into the Ollama container

After the stack is running, pull the model you configured in OLLAMA_MODEL:

docker compose exec ollama ollama pull qwen3:8b

The download size depends on the model. qwen3:8b is approximately 5 GB. The model is stored in the ollama-data Docker volume and persists across restarts.

Verify local inference is working

Send a test query to the API using preferred_provider: "local":

curl -X POST http://localhost:8000/api/v1/chat/query \
  -H "Content-Type: application/json" \
  -d '{
    "message": "How many sales orders are in the database?",
    "preferred_provider": "local"
  }'

A successful response confirms that the API is routing requests to Ollama.

Ollama environment variables

Variable	Default	Description
`OLLAMA_ENABLED`	`false`	Set to `true` to enable local model routing
`OLLAMA_BASE_URL`	`http://ollama:11434`	Internal URL of the Ollama service
`OLLAMA_MODEL`	`qwen3:8b`	The model name Ollama uses to serve responses

Hybrid routing

When preferred_provider is set to hybrid in a query, BoardPulse AI checks whether Ollama is available and routes to it first. If Ollama is unreachable or OLLAMA_ENABLED is false, the request falls back to the configured cloud provider automatically.

curl -X POST http://localhost:8000/api/v1/chat/query \
  -H "Content-Type: application/json" \
  -d '{
    "message": "What were total sales last quarter?",
    "preferred_provider": "hybrid"
  }'

Open WebUI model alias

In Open WebUI (accessible at http://localhost:3002), the model named boardpulse-executive maps to the value of OLLAMA_MODEL in your .env. Selecting this model in the chat interface sends requests through the BoardPulse AI API to Ollama using your configured local model.

Monitoring GPU usage

Once Ollama is running with GPU support, you can inspect real-time GPU utilization from inside the container:

docker compose exec ollama nvidia-smi

This command shows VRAM usage, GPU load percentage, and temperature — useful for verifying that inference is hitting the GPU rather than falling back to CPU.

qwen3:8b is the default model and offers a strong balance of quality and performance on hardware with 8 GB or more of VRAM. Smaller models like qwen3:4b work on 6 GB GPUs, while qwen3:14b requires 12 GB or more for comfortable inference.

Get Started

Configuration

Core Features

Deployment

Serve local AI models with Ollama in BoardPulse AI

Why use local models

Data privacy

No API key needed

Air-gapped deployments

Cost control

Prerequisites for GPU acceleration

Enabling Ollama

Ollama environment variables

Hybrid routing

Open WebUI model alias

Monitoring GPU usage

Build docs developers (and LLMs) love

Get Started

Configuration

Core Features

Deployment

Documentation Index

​Why use local models

Data privacy

No API key needed

Air-gapped deployments

Cost control

​Prerequisites for GPU acceleration

​Enabling Ollama

​Ollama environment variables

​Hybrid routing

​Open WebUI model alias

​Monitoring GPU usage

Build docs developers (and LLMs) love

Why use local models

Prerequisites for GPU acceleration

Enabling Ollama

Ollama environment variables

Hybrid routing

Open WebUI model alias

Monitoring GPU usage