Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/math-inc/OpenGauss/llms.txt

Use this file to discover all available pages before exploring further.

OpenGauss supports any OpenAI-compatible inference API through the standard OPENAI_BASE_URL environment variable. This means you can point Gauss at a locally running vLLM server, llama.cpp, Ollama in OpenAI-compatible mode, LM Studio, or any other server that speaks the OpenAI chat-completion format — without touching the cloud.

Start a vLLM inference server

Start your vLLM server on a local port:
python -m vllm.entrypoints.openai.api_server --model <model_name>
For example, to serve Qwen2.5-Coder-7B on the default port (8000):
python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-Coder-7B-Instruct
vLLM will print the base URL once it is ready — typically http://localhost:8000/v1.

Point Gauss at your local server

1

Option A — run gauss setup (interactive wizard)

The setup wizard has a dedicated step for custom endpoints:
gauss setup
When prompted for your provider or API configuration, select the custom/local option and enter http://localhost:8000/v1 as the base URL. The wizard writes the value to ~/.gauss/.env automatically.
2

Option B — set OPENAI_BASE_URL directly

Edit ~/.gauss/.env and add:
OPENAI_BASE_URL=http://localhost:8000/v1
No API key is required for local endpoints. OpenGauss treats a configured OPENAI_BASE_URL as sufficient to consider the local provider active.
3

Select the model

Tell Gauss which model name to request (must match the --model you passed to vLLM):
gauss model
Or set it directly in ~/.gauss/config.yaml:
model: Qwen/Qwen2.5-Coder-7B-Instruct
4

Verify with gauss doctor

Run the health check to confirm the provider resolves correctly:
gauss doctor

Full config example

Here is a complete ~/.gauss/config.yaml snippet for a local vLLM setup:
model: Qwen/Qwen2.5-Coder-7B-Instruct

agent:
  max_turns: 40          # local models may have shorter context — tune as needed

compression:
  enabled: true
  threshold: 0.40        # compress earlier to stay within a smaller context window
  summary_model: ""      # leave empty to use the primary model for compression too
And the corresponding ~/.gauss/.env:
OPENAI_BASE_URL=http://localhost:8000/v1
# No OPENAI_API_KEY needed for a local endpoint

Compatible inference servers

Any server that implements the OpenAI chat-completion API works with the same OPENAI_BASE_URL approach:
ServerNotes
vLLMRecommended for GPU-accelerated inference; supports most HuggingFace models
llama.cpp (--server)CPU-friendly; use --port and --host to control the endpoint
OllamaStart with ollama serve; OpenAI-compatible endpoint at http://localhost:11434/v1
LM StudioEnable “Local Server” in the app; exposes http://localhost:1234/v1 by default
LocalAIDrop-in OpenAI replacement with broad model support

Routing auxiliary tasks to a different endpoint

OpenGauss uses auxiliary models for side tasks like vision analysis, context compression, and MCP sampling. You can route these to a different local endpoint — for example, a lighter model on the same machine — without changing the primary model:
auxiliary:
  vision:
    base_url: "http://localhost:8001/v1"   # a separate vLLM instance for vision
    model: "Qwen/Qwen2.5-VL-7B-Instruct"

  compression:
    base_url: "http://localhost:8001/v1"
    model: "Qwen/Qwen2.5-0.5B-Instruct"   # fast/small model for summarization

  mcp:
    base_url: "http://localhost:8001/v1"
    model: "Qwen/Qwen2.5-0.5B-Instruct"   # model used for MCP sampling requests
When base_url is set under an auxiliary task it takes precedence over the provider setting for that task. This lets you run a large model for primary interactions and a small model for background work.
Auxiliary tasks like context compression and MCP sampling fire frequently during long sessions. Pointing them at a smaller, faster local model (e.g., a 0.5B–3B parameter model) keeps latency low and frees GPU memory for the primary model.

Performance tuning

Local models often have shorter context windows than hosted frontier models. A few config settings that help:
agent:
  max_turns: 40          # reduce the default 90 if the model context is tight

compression:
  enabled: true
  threshold: 0.40        # start compressing when context reaches 40% capacity
                         # (default is 0.50)
You can also use gauss config set to change these without editing the file directly:
gauss config set agent.max_turns 40
gauss config set compression.threshold 0.40

Build docs developers (and LLMs) love