Model Inventory
| Model | Size | Role | Typical Use Case |
|---|---|---|---|
dolphin3:8b-llama3.1-q4_K_M | 4.9GB | Orchestrator | Tool planning, ACTION tag emission, multi-step reasoning |
dolphin-mistral:7b | 4.1GB | Uncensored writer | NSFW captions, adult content drafts, fan messages |
qwen-2.5:latest | 4.8GB | Primary agent | AnythingLLM agent, code generation, JSON output |
phi-3.5:latest | 2.2GB | Fallback classifier | Lightweight taxonomy tagging when scout-fast-tag fails |
llama3.2:3b | 2.0GB | Summarizer | Quick summaries, metadata extraction, lightweight tasks |
scout-fast-tag:latest | 1.2GB | Taxonomy classifier | Fast 6-concept taxonomy tagging (custom SmolLM fine-tune) |
bge-m3:latest | 2.4GB | Embeddings | Document embeddings for RAG, semantic search |
Model Roles
Primary Agent: qwen-2.5:latest
Used by: AnythingLLM workspace (default)
Strengths:
- Best code generation in the 7B class
- Reliable JSON output for structured tasks
- Good instruction following
- Handles most general-purpose agent tasks
- Tool calling is unreliable (hence Action Runner bypass)
- First token latency: ~33s in agent mode on CPU-only VPS
- Will stall on CPU-only hardware if context > 8K tokens
- Default for all AnythingLLM chat interactions
- Code generation (scripts, flows, API integrations)
- Structured data extraction
- Technical Q&A
Orchestrator: dolphin3:8b-llama3.1-q4_K_M
Used by: Action Runner flow planning (optional)
Strengths:
- Excellent at multi-step planning
- Reliable ACTION tag emission
- Uncensored but still coherent for reasoning tasks
- Better tool selection than Qwen-2.5
- Slower inference (~45s first token on CPU)
- Larger memory footprint (4.9GB)
- Currently stalls on production VPS (CPU-only)
- Complex workflows requiring multi-step reasoning
- When you need better tool selection than Qwen-2.5
- Requires GPU VPS or wait for hardware upgrade
Content Writer: dolphin-mistral:7b
Used by: Caption generation, fan message drafting
Strengths:
- Fully uncensored (no refusals on NSFW content)
- Creative, natural-sounding prose
- Good at matching tone/style from examples
- Fast inference (~2-5s first token)
- Less structured than Qwen (worse for JSON)
- Can be overly verbose
- Not suitable for technical tasks
- Drafting OnlyFans/Fansly captions
- Fan engagement messages (DMs, comments)
- Content descriptions and marketing copy
- Any NSFW text generation
POST /api/captions/generate (uses this model)
Taxonomy Classifier: scout-fast-tag:latest
Used by: taxonomy-tag ACTION flow
Strengths:
- Custom fine-tune on 3,208 taxonomy tags
- Fast inference (~500ms per image)
- Optimized for 6-concept classification
- Low memory footprint (1.2GB)
- Single-purpose model (only taxonomy)
- Requires fallback to
phi-3.5on failure - Limited to visual content classification
- Auto-tagging uploaded media
- Batch classification of scraped content
- Content audit (NSFW detection, theme analysis)
Fallback Classifier: phi-3.5:latest
Used by: Taxonomy tagging when scout-fast-tag fails
Strengths:
- General-purpose vision model
- More robust on edge cases
- Can handle text + image input
- Slower than scout-fast-tag (~3-5s)
- Less accurate on Genie Helper’s specific taxonomy
- May refuse NSFW content (has guardrails)
- Backup for failed scout-fast-tag calls
- Mixed content (text + image analysis)
- When you need explanations for classifications
Summarizer: llama3.2:3b
Used by: Quick summaries, metadata extraction
Strengths:
- Fastest inference (~1-2s first token)
- Lowest memory footprint (2GB)
- Good at extracting key points
- Limited context window (4K tokens)
- Less coherent on complex topics
- Not suitable for multi-turn chat
- Summarizing scraped profile data
- Extracting metadata from long text
- Quick content previews
- Notifications and alerts
Embeddings: bge-m3:latest
Used by: AnythingLLM document embeddings, RAG
Strengths:
- State-of-the-art multilingual embeddings
- Fast batch processing
- Supports 8K token context
- Ingesting documents into AnythingLLM workspaces
- Semantic search over creator content
- Memory recall (“What did I post last week?”)
Hardware Constraints
Current server: IONOS dedicated VPS, CPU-only (no GPU) Performance characteristics:- Models up to 7B: Acceptable performance (2-5s first token)
- Models 8B+: Stall on CPU-only (dolphin3:8b takes 45s+)
- Recommended limit: 7B quantized models (q4_K_M)
- GPU VPS: Enables dolphin3:8b and larger models
- Alternative: Stick with Qwen-2.5 and optimize Action Runner flows
- ~10GB RAM available for Ollama (after OS + other services)
- ~33 concurrent Stagehand browser sessions (300MB each)
- FFmpeg clip generation is the real bottleneck (~30s CPU per clip)
Model Selection Logic
Configuration
Environment Variables
AnythingLLM Workspace Settings
Administrator workspace:- LLM:
qwen-2.5:latest - Embeddings:
bge-m3:latest - Agent mode: Enabled
- Temperature: 0.7
- Max tokens: 4096
- Same settings, isolated context
- Potential routing to dolphin3:8b after GPU upgrade
Related
- MCP Servers — The
ollamaMCP server exposes these models - Action Runner — Uses models via
ollamastep type - Taxonomy System — Uses scout-fast-tag for classification
- Endpoint:
POST /api/captions/generate— Uses dolphin-mistral:7b
