Provider comparison
| Provider | LLM_PROVIDER value | Default model | Privacy | Best for |
|---|---|---|---|---|
| Gemini | gemini | gemini-3.1-flash-lite-preview | Cloud (data sent to Google) | Daily use, large context windows |
| Groq | groq | llama-3.3-70b-versatile | Cloud (data sent to Groq) | Fast inference, modest hardware |
| Ollama | ollama | llama3.2 | 100% local, data never leaves | Air-gapped environments, sensitive projects |
Switching providers
Open.env and change LLM_PROVIDER:
LLMFactory reads LLM_PROVIDER at startup and instantiates the appropriate client:
BaseLLMClient interface — generate() for synchronous responses and stream_generate() for token-by-token streaming — so the rest of the RAG pipeline is provider-agnostic.
Gemini
Gemini is the default provider. It offers a large context window suitable for complex multi-document prompts with the full RAG context injected.Obtain a Gemini API key from Google AI Studio. The free tier supports the context sizes used by SoftArchitect AI’s default configuration.
Groq
Groq provides ultra-fast cloud inference for large open-weight models. It is a good choice when you want near-instant responses and are comfortable with data leaving your machine.Ollama
Ollama runs LLMs entirely on your local hardware. No API key is required and no data leaves your network. This is the recommended mode for projects with sensitive architecture decisions or strict data sovereignty requirements.| Model | RAM required | Use case |
|---|---|---|
llama3.2 | ~4 GB | General architecture guidance (default) |
qwen2.5-coder:7b | ~6 GB | Code-heavy architecture and API design |
phi4-mini | ~3 GB | Low-memory laptops, faster responses |
Hardware optimization
Two environment variables let you tune the RAG context budget to match your hardware and model. The values below come from the.env.example defaults:
| Variable | Default | Ollama (8K model) | Gemini / Groq |
|---|---|---|---|
LLM_MAX_PROMPT_CHARS | 200000 | 30000 | 200000 |
RAG_MAX_CHUNKS | 3 | 2 | 3–5 |
LLM_MAX_PROMPT_CHARS is a hard cap on the fully-assembled prompt (approximately tokens × 4). When the prompt exceeds this value, the orchestrator truncates from the end — architectural context injected earlier in the prompt is always preserved.
RAG_MAX_CHUNKS controls how many per-project semantic search results are injected. Increasing it to 5 gives the LLM more project context but consumes more of the context window.
Docker resource limits
The Ollama container has configurable memory and CPU limits to prevent it from starving other services:OLLAMA_MEMORY_LIMIT to 4GB or more when running 7B+ parameter models locally.