Run OpenGauss Against Local vLLM Inference Servers

OpenGauss supports any OpenAI-compatible inference API through the standard OPENAI_BASE_URL environment variable. This means you can point Gauss at a locally running vLLM server, llama.cpp, Ollama in OpenAI-compatible mode, LM Studio, or any other server that speaks the OpenAI chat-completion format — without touching the cloud.

Start a vLLM inference server

Start your vLLM server on a local port:

python -m vllm.entrypoints.openai.api_server --model <model_name>

For example, to serve Qwen2.5-Coder-7B on the default port (8000):

python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-Coder-7B-Instruct

vLLM will print the base URL once it is ready — typically http://localhost:8000/v1.

Point Gauss at your local server

Option A — run gauss setup (interactive wizard)

The setup wizard has a dedicated step for custom endpoints:

gauss setup

When prompted for your provider or API configuration, select the custom/local option and enter http://localhost:8000/v1 as the base URL. The wizard writes the value to ~/.gauss/.env automatically.

Option B — set OPENAI_BASE_URL directly

Edit ~/.gauss/.env and add:

OPENAI_BASE_URL=http://localhost:8000/v1

No API key is required for local endpoints. OpenGauss treats a configured OPENAI_BASE_URL as sufficient to consider the local provider active.

Select the model

Tell Gauss which model name to request (must match the --model you passed to vLLM):

gauss model

Or set it directly in ~/.gauss/config.yaml:

model: Qwen/Qwen2.5-Coder-7B-Instruct

Verify with gauss doctor

Run the health check to confirm the provider resolves correctly:

gauss doctor

Full config example

Here is a complete ~/.gauss/config.yaml snippet for a local vLLM setup:

model: Qwen/Qwen2.5-Coder-7B-Instruct

agent:
  max_turns: 40          # local models may have shorter context — tune as needed

compression:
  enabled: true
  threshold: 0.40        # compress earlier to stay within a smaller context window
  summary_model: ""      # leave empty to use the primary model for compression too

And the corresponding ~/.gauss/.env:

OPENAI_BASE_URL=http://localhost:8000/v1
# No OPENAI_API_KEY needed for a local endpoint

Compatible inference servers

Any server that implements the OpenAI chat-completion API works with the same OPENAI_BASE_URL approach:

Server	Notes
vLLM	Recommended for GPU-accelerated inference; supports most HuggingFace models
llama.cpp (`--server`)	CPU-friendly; use `--port` and `--host` to control the endpoint
Ollama	Start with `ollama serve`; OpenAI-compatible endpoint at `http://localhost:11434/v1`
LM Studio	Enable “Local Server” in the app; exposes `http://localhost:1234/v1` by default
LocalAI	Drop-in OpenAI replacement with broad model support

Routing auxiliary tasks to a different endpoint

OpenGauss uses auxiliary models for side tasks like vision analysis, context compression, and MCP sampling. You can route these to a different local endpoint — for example, a lighter model on the same machine — without changing the primary model:

auxiliary:
  vision:
    base_url: "http://localhost:8001/v1"   # a separate vLLM instance for vision
    model: "Qwen/Qwen2.5-VL-7B-Instruct"

  compression:
    base_url: "http://localhost:8001/v1"
    model: "Qwen/Qwen2.5-0.5B-Instruct"   # fast/small model for summarization

  mcp:
    base_url: "http://localhost:8001/v1"
    model: "Qwen/Qwen2.5-0.5B-Instruct"   # model used for MCP sampling requests

When base_url is set under an auxiliary task it takes precedence over the provider setting for that task. This lets you run a large model for primary interactions and a small model for background work.

Auxiliary tasks like context compression and MCP sampling fire frequently during long sessions. Pointing them at a smaller, faster local model (e.g., a 0.5B–3B parameter model) keeps latency low and frees GPU memory for the primary model.

Performance tuning

Local models often have shorter context windows than hosted frontier models. A few config settings that help:

agent:
  max_turns: 40          # reduce the default 90 if the model context is tight

compression:
  enabled: true
  threshold: 0.40        # start compressing when context reaches 40% capacity
                         # (default is 0.50)

You can also use gauss config set to change these without editing the file directly:

gauss config set agent.max_turns 40
gauss config set compression.threshold 0.40

Get Started

Core Concepts

Configuration

Guides

Run OpenGauss Against Local vLLM Inference Servers

Start a vLLM inference server

Point Gauss at your local server

Full config example

Compatible inference servers

Routing auxiliary tasks to a different endpoint

Performance tuning

Build docs developers (and LLMs) love

Get Started

Core Concepts

Configuration

Guides

Documentation Index

​Start a vLLM inference server

​Point Gauss at your local server

​Full config example

​Compatible inference servers

​Routing auxiliary tasks to a different endpoint

​Performance tuning

Build docs developers (and LLMs) love

Start a vLLM inference server

Point Gauss at your local server

Full config example

Compatible inference servers

Routing auxiliary tasks to a different endpoint

Performance tuning