Overview
Gambiarra works with any OpenAI-compatible API endpoint. This includes popular local LLM servers and any custom implementation that follows the OpenAI chat completions API specification.Supported Providers
The following table lists officially tested and supported LLM providers:| Provider | Default Endpoint | Notes |
|---|---|---|
| Ollama | http://localhost:11434 | Most popular local LLM server |
| LM Studio | http://localhost:1234 | GUI-based LLM management |
| LocalAI | http://localhost:8080 | Self-hosted OpenAI alternative |
| vLLM | http://localhost:8000 | High-performance inference |
| text-generation-webui | http://localhost:5000 | Gradio-based interface |
| Custom | Any URL | Any OpenAI-compatible endpoint |
Provider Configuration
Ollama
Ollama is the most commonly used provider with Gambiarra. It exposes models through both its native API (/api/tags) and OpenAI-compatible endpoints.
Configuration:
- Automatic model discovery via
/api/tags - Native support for model pulling and management
- GPU acceleration with CUDA/ROCm
- Supports most popular open-source models
- Native API:
http://localhost:11434/api/* - OpenAI-compatible:
http://localhost:11434/v1/*
Gambiarra automatically detects Ollama models via the
/api/tags endpoint during the join process.LM Studio
LM Studio provides a desktop GUI for managing and running LLMs locally. Configuration:- User-friendly GUI for model management
- Built-in model downloader
- Hardware acceleration support
- OpenAI-compatible API by default
- Ensure the LM Studio server is running before joining
- Model names should match what’s loaded in LM Studio
- Check the server settings in LM Studio for the correct port
LocalAI
LocalAI is a drop-in replacement for OpenAI API that runs locally. Configuration:- Full OpenAI API compatibility
- Supports multiple model formats (GGML, GGUF, etc.)
- Audio transcription and image generation
- Docker-ready deployment
- OpenAI-compatible:
http://localhost:8080/v1/*
vLLM
vLLM is a high-performance inference engine optimized for serving LLMs. Configuration:- High throughput and low latency
- PagedAttention for efficient memory management
- OpenAI-compatible API
- Continuous batching
- Model names often use HuggingFace format
- Requires GPU with sufficient VRAM
- Best for production deployments
text-generation-webui
A Gradio-based web interface for running LLMs with OpenAI API extension. Configuration:- Web-based interface with multiple extensions
- OpenAI API extension available
- Supports various model formats
- Character/chat mode
- Must enable the OpenAI API extension
- Check the extensions tab for API settings
- Default port may vary based on configuration
Custom Providers
Any service implementing the OpenAI chat completions API can be used. Required Endpoint:For custom providers, ensure your endpoint responds to both
/v1/models (for model listing) and /v1/chat/completions (for inference).Model Discovery
Gambiarra attempts to discover models from your endpoint during the join process:- Ollama Format: Tries
GET /api/tags - OpenAI Format: Tries
GET /v1/models
packages/cli/src/commands/join.ts:24-50 for implementation details.
Generation Parameters
Gambiarra supports standard OpenAI-compatible generation parameters:| Parameter | Type | Range | Description |
|---|---|---|---|
temperature | number | 0-2 | Controls randomness |
top_p | number | 0-1 | Nucleus sampling |
max_tokens | number | - | Maximum tokens to generate |
stop | string[] | - | Stop sequences |
frequency_penalty | number | -2 to 2 | Penalize token frequency |
presence_penalty | number | -2 to 2 | Penalize token presence |
seed | number | - | Deterministic generation |
packages/core/src/types.ts:22-31 for the full schema.
Provider-Specific Considerations
Endpoint Availability
Model Names
Different providers use different model naming conventions:- Ollama: Simple names like
llama3,mistral - vLLM: HuggingFace format like
meta-llama/Llama-2-7b-chat-hf - LM Studio: Display names from the GUI
- LocalAI: Custom aliases defined in configuration
Performance
Hardware Requirements:- CPU-only: 8-16GB RAM minimum, very slow inference
- GPU (8GB VRAM): Good for 7B models
- GPU (16GB+ VRAM): Can run 13B+ models
- GPU (24GB+ VRAM): Can run 30B+ models or quantized 70B
Use the
--no-specs flag when joining if you don’t want to share your hardware specifications with other room participants.Streaming Support
All providers should support streaming responses via Server-Sent Events (SSE):packages/core/src/hub.ts:284-293).
Troubleshooting
Model Not Found
If you get “Model not found” errors:- Verify the model is loaded in your LLM server
- Check the exact model name (case-sensitive)
- Ensure the server is running on the specified endpoint
- Try listing models manually:
Connection Refused
If connection fails:- Verify the server is running
- Check firewall rules
- Ensure correct port and hostname
- Test connectivity: