ollama
Version: 1.0.0
Script: scripts/ollama-mcp-server.mjs
Environment Variables
Available Models
Genie Helper uses multiple specialized Ollama models:| Model | Role | Size |
|---|---|---|
qwen-2.5:latest | Primary agent / code / JSON | 7B |
dolphin3:8b-llama3.1-q4_K_M | Orchestrator / tool planning | 8B |
dolphin-mistral:7b | Uncensored content writer | 7B |
phi-3.5:latest | Fallback classifier | 3.5B |
llama3.2:3b | Lightweight summarizer | 3B |
scout-fast-tag:latest | Fast taxonomy classifier | Custom |
bge-m3:latest | Embeddings | — |
list-models
List all locally available Ollama models. Parameters: None Returns: Array of model objects withname, size, and modified_at fields.
Example Response:
generate
Generate a completion from an Ollama model (single-turn, no conversation history). Use case: One-shot text generation, classification, data extraction, or completion tasks that don’t require multi-turn context. Parameters:prompt(string, required): The prompt to generate frommodel(string, optional): Model name (default:qwen-2.5:latest)system(string, optional): System prompt to prependtemperature(number, optional): Sampling temperature 0-2 (default 0.7)max_tokens(number, optional): Max tokens to generate
chat
Multi-turn chat with an Ollama model using a messages array. Maintains conversation context across multiple exchanges. Use case: Conversational AI, multi-step reasoning, context-dependent responses, or any task requiring chat history. Parameters:messages(string, required): JSON string array of{role, content}objects- Valid roles:
user,assistant,system
- Valid roles:
model(string, optional): Model name (default:qwen-2.5:latest)system(string, optional): System message to prepend to conversationtemperature(number, optional): Sampling temperature 0-2 (default 0.7)
Message Format
Themessages parameter for the chat tool must be a JSON string containing an array of message objects:
Temperature Guide
| Temperature | Use Case | Output Style |
|---|---|---|
| 0.0 - 0.3 | Data extraction, classification, structured output | Deterministic, consistent, focused |
| 0.4 - 0.7 | Balanced responses, Q&A, instructions | Natural, reliable, moderate creativity |
| 0.8 - 1.2 | Creative writing, captions, fan messages | Varied, expressive, engaging |
| 1.3 - 2.0 | Experimental, highly creative content | Unpredictable, very diverse |
- Classification tasks: 0.1-0.3
- Chat responses: 0.5-0.7
- Content generation: 0.8-0.9
Performance Notes
Inference speed (CPU-only VPS):- Small models (3B): ~1-2s first token
- Medium models (7B): ~2-5s first token
- Large models (8B+): ~5-10s first token
- 3B models: ~2GB RAM
- 7B models: ~4.8GB RAM
- 8B models: ~5.5GB RAM
- Use smaller models (
llama3.2:3b,phi-3.5) for quick tasks - Reserve larger models (
qwen-2.5,dolphin3) for complex reasoning - Set
max_tokensto limit generation time - Lower temperature for faster, more focused responses
Error Handling
Common errors and solutions: Model not found:list-models to see available models or pull the model with ollama pull model-name:tag
Connection refused:
