Skip to main content
The Ollama MCP server provides 3 tools for interacting with locally-running Ollama models. These tools enable text generation, multi-turn chat, and model discovery. Server name: ollama Version: 1.0.0 Script: scripts/ollama-mcp-server.mjs

Environment Variables

OLLAMA_URL=http://127.0.0.1:11434
OLLAMA_MODEL=qwen-2.5:latest

Available Models

Genie Helper uses multiple specialized Ollama models:
ModelRoleSize
qwen-2.5:latestPrimary agent / code / JSON7B
dolphin3:8b-llama3.1-q4_K_MOrchestrator / tool planning8B
dolphin-mistral:7bUncensored content writer7B
phi-3.5:latestFallback classifier3.5B
llama3.2:3bLightweight summarizer3B
scout-fast-tag:latestFast taxonomy classifierCustom
bge-m3:latestEmbeddings

list-models

List all locally available Ollama models. Parameters: None Returns: Array of model objects with name, size, and modified_at fields. Example Response:
[
  {
    "name": "qwen-2.5:latest",
    "size": 4368439808,
    "modified_at": "2026-03-01T12:34:56Z"
  },
  {
    "name": "dolphin-mistral:7b",
    "size": 4109865216,
    "modified_at": "2026-02-28T09:15:30Z"
  },
  {
    "name": "llama3.2:3b",
    "size": 2019999744,
    "modified_at": "2026-02-25T14:22:10Z"
  }
]
Usage:
{}

generate

Generate a completion from an Ollama model (single-turn, no conversation history). Use case: One-shot text generation, classification, data extraction, or completion tasks that don’t require multi-turn context. Parameters:
  • prompt (string, required): The prompt to generate from
  • model (string, optional): Model name (default: qwen-2.5:latest)
  • system (string, optional): System prompt to prepend
  • temperature (number, optional): Sampling temperature 0-2 (default 0.7)
  • max_tokens (number, optional): Max tokens to generate
Returns: Generated text response. Example 1: Simple generation
{
  "prompt": "Write a short engaging caption for a beach photo",
  "model": "dolphin-mistral:7b",
  "temperature": 0.9
}
Response:
Sun-kissed waves and endless summer vibes 🌊☀️ Living my best life one beach day at a time
Example 2: With system prompt
{
  "prompt": "Rate this content on a scale of 1-10 for engagement potential: 'Check out my new outfit'",
  "system": "You are a social media analytics expert. Provide concise, data-driven ratings.",
  "model": "qwen-2.5:latest",
  "temperature": 0.3,
  "max_tokens": 100
}
Response:
Engagement Rating: 4/10

Reasoning: Generic phrasing lacks specificity. No visual hooks, call-to-action, or personality. Recommend adding descriptive details, emoji, or question to boost interaction.
Example 3: JSON extraction
{
  "prompt": "Extract structured data from this text and return as JSON: 'Posted 3 hours ago, 142 likes, 28 comments, trending'",
  "system": "Extract data into JSON format with keys: posted_time, likes, comments, status. Use null if data missing.",
  "model": "qwen-2.5:latest",
  "temperature": 0.1
}
Response:
{
  "posted_time": "3 hours ago",
  "likes": 142,
  "comments": 28,
  "status": "trending"
}

chat

Multi-turn chat with an Ollama model using a messages array. Maintains conversation context across multiple exchanges. Use case: Conversational AI, multi-step reasoning, context-dependent responses, or any task requiring chat history. Parameters:
  • messages (string, required): JSON string array of {role, content} objects
    • Valid roles: user, assistant, system
  • model (string, optional): Model name (default: qwen-2.5:latest)
  • system (string, optional): System message to prepend to conversation
  • temperature (number, optional): Sampling temperature 0-2 (default 0.7)
Returns: Assistant’s message content. Example 1: Basic conversation
{
  "messages": "[{\"role\":\"user\",\"content\":\"What are the best times to post on Instagram?\"},{\"role\":\"assistant\",\"content\":\"Peak engagement times are typically 11am-1pm and 7pm-9pm on weekdays.\"},{\"role\":\"user\",\"content\":\"What about for adult content creators specifically?\"}]",
  "model": "qwen-2.5:latest"
}
Response:
For adult content creators, optimal posting times shift slightly:

- Late evening (9pm-12am) when audience is most active
- Weekend mornings (10am-12pm) for casual browsing
- Avoid 3pm-5pm when users may be at work/school

Consider your audience's timezone and test different schedules to find your peak engagement windows.
Example 2: With system context
{
  "messages": "[{\"role\":\"user\",\"content\":\"Help me draft a message to a top fan who just renewed their subscription\"}]",
  "system": "You are Genie Helper, an AI assistant for adult content creators. Be professional, friendly, and help with fan engagement. The creator's brand voice is warm and appreciative.",
  "model": "dolphin-mistral:7b",
  "temperature": 0.8
}
Response:
Hey! 💜 I just saw you renewed and I'm seriously so grateful for your support! You've been with me since the beginning and that means the world. I've got some exclusive content dropping this week that I think you're going to love 😊 Thank you for believing in me!
Example 3: Multi-step reasoning
{
  "messages": "[{\"role\":\"user\",\"content\":\"I want to watermark my photos\"},{\"role\":\"assistant\",\"content\":\"I can help you watermark your photos. Do you want to use your username or a custom logo?\"},{\"role\":\"user\",\"content\":\"Username, bottom right corner\"},{\"role\":\"assistant\",\"content\":\"Got it. Would you like the watermark to be subtle (semi-transparent) or prominent (solid)?\"},{\"role\":\"user\",\"content\":\"Subtle\"}]",
  "model": "qwen-2.5:latest",
  "temperature": 0.5
}
Response:
Perfect! I'll apply a semi-transparent watermark with your username in the bottom right corner. This will protect your content while keeping it visually clean. Would you like me to process all images in your media library or just selected ones?

Message Format

The messages parameter for the chat tool must be a JSON string containing an array of message objects:
[
  {
    "role": "system",
    "content": "You are a helpful assistant."
  },
  {
    "role": "user",
    "content": "Hello!"
  },
  {
    "role": "assistant",
    "content": "Hi! How can I help you today?"
  },
  {
    "role": "user",
    "content": "Tell me about your features."
  }
]
Important: The entire array must be stringified when passing to the tool:
{
  "messages": "[{\"role\":\"user\",\"content\":\"Hello!\"}]"
}

Temperature Guide

TemperatureUse CaseOutput Style
0.0 - 0.3Data extraction, classification, structured outputDeterministic, consistent, focused
0.4 - 0.7Balanced responses, Q&A, instructionsNatural, reliable, moderate creativity
0.8 - 1.2Creative writing, captions, fan messagesVaried, expressive, engaging
1.3 - 2.0Experimental, highly creative contentUnpredictable, very diverse
Genie Helper defaults:
  • Classification tasks: 0.1-0.3
  • Chat responses: 0.5-0.7
  • Content generation: 0.8-0.9

Performance Notes

Inference speed (CPU-only VPS):
  • Small models (3B): ~1-2s first token
  • Medium models (7B): ~2-5s first token
  • Large models (8B+): ~5-10s first token
Memory usage:
  • 3B models: ~2GB RAM
  • 7B models: ~4.8GB RAM
  • 8B models: ~5.5GB RAM
Recommendations:
  • Use smaller models (llama3.2:3b, phi-3.5) for quick tasks
  • Reserve larger models (qwen-2.5, dolphin3) for complex reasoning
  • Set max_tokens to limit generation time
  • Lower temperature for faster, more focused responses

Error Handling

Common errors and solutions: Model not found:
Error: model 'model-name:tag' not found
Solution: Use list-models to see available models or pull the model with ollama pull model-name:tag Connection refused:
Error: fetch failed (connection refused)
Solution: Ensure Ollama service is running on port 11434 Out of memory:
Error: failed to load model
Solution: Use a smaller quantized model or free up system memory

Build docs developers (and LLMs) love