Grip AI provides multiple strategies to reduce LLM costs while maintaining quality for complex tasks.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/5unnykum4r/grip-ai/llms.txt
Use this file to discover all available pages before exploring further.
Model Tier Routing
Automatic model selection based on prompt complexity. Simple queries use cheap models, complex tasks use premium models.Configuration
Fromgrip/config/schema.py:136:
Setup
Enable tiered routing in~/.grip/config.json:
How Complexity Detection Works
The router analyzes prompt characteristics: Low Complexity (useslow model):
- Greetings and small talk: “Hello”, “How are you?”
- Simple lookups: “What time is it?”, “Current Bitcoin price”
- Basic regex/formatting: “Extract emails from this text”
- Single-step operations: “List files in /tmp”
medium or default model):
- Code changes: “Fix this bug”, “Add error handling”
- Explanations: “Explain how JWT works”
- Multi-step tasks: “Search for X and summarize findings”
- Data analysis: “Analyze this CSV and find trends”
high model):
- System design: “Design a scalable microservices architecture”
- Large refactors: “Refactor this module to use async/await”
- Debugging: “Find why this race condition occurs”
- Research synthesis: “Compare React vs Vue for enterprise apps”
Cost Savings Example
Without Tier Routing:Consolidation Model
Use a cheaper model for session compaction and summarization.Configuration
Fromgrip/config/schema.py:91:
~/.grip/config.json:
How It Works
When conversation history exceeds2 × memory_window messages, grip automatically:
- Sends old messages to the
consolidation_model - Generates a concise summary (typically 200-500 tokens)
- Replaces old messages with the summary
- Keeps recent
memory_windowmessages intact
Manual Consolidation
From
grip/engines/litellm_engine.py:138, consolidation is implemented in the engine layer and works with any model combination.Choosing Cost-Effective Models
Budget Models
Google Gemini Flash 2.0 (openrouter/google/gemini-flash-2.0)
- Cost: ~$0.0001 per request (100x cheaper than Claude)
- Speed: 2-3x faster than Sonnet
- Best for: Lookups, simple Q&A, data extraction, summarization
- Limitations: Weaker reasoning, less reliable tool use
openrouter/openai/gpt-4o-mini)
- Cost: ~$0.0005 per request (30x cheaper than Claude)
- Speed: Very fast
- Best for: Code formatting, regex tasks, simple refactors
- Limitations: Shorter context, less creative
anthropic/claude-haiku-4)
- Cost: ~$0.003 per request (5x cheaper than Sonnet)
- Speed: Fastest Claude model
- Best for: Tool-heavy workflows, data processing, quick iterations
- Limitations: Not as strong for complex reasoning
Premium Models
Claude Sonnet-4 (anthropic/claude-sonnet-4)
- Cost: ~$0.015 per request
- Best for: General-purpose tasks, code generation, analysis
- Sweet spot: Best quality/cost ratio for most coding tasks
openrouter/anthropic/claude-opus-4)
- Cost: ~$0.075 per request (5x more than Sonnet)
- Best for: System design, complex debugging, research synthesis
- When to use: Only when task quality justifies the cost
openrouter/openai/gpt-4-turbo)
- Cost: ~$0.04 per request
- Best for: Math, structured output, code execution validation
- Trade-off: Cheaper than Opus, slightly different strengths
Memory Window Optimization
Reduce tokens sent per request by limiting conversation history.Configuration
memory_window: 10-20):
- Pros: Very low token usage, fast responses
- Cons: Agent forgets context quickly
- Best for: Single-task sessions, tool-heavy automation
memory_window: 30-50, default):
- Pros: Good balance of cost and context retention
- Cons: May need consolidation every 100 messages
- Best for: General interactive use
memory_window: 100-200):
- Pros: Excellent context retention, fewer consolidations
- Cons: High token usage (2-5K tokens per request)
- Best for: Complex multi-turn debugging or architecture discussions
Max Tool Iterations Limit
Prevent runaway costs from infinite tool loops.grip/config/schema.py:76):
0= Unlimited iterations (default, see long-running tasks)N= Stop after N tool call rounds, even if task is incomplete
- Simple tasks:
max_tool_iterations: 5(file operations, lookups) - Code tasks:
max_tool_iterations: 15(build, test, fix) - Research:
max_tool_iterations: 10(search, fetch, analyze)
Each iteration costs 1K-8K tokens depending on tool output size. A 15-iteration task with an 8K-token model can consume 120K tokens total.
Semantic Caching
Cache identical queries to avoid re-processing.grip/config/schema.py:100):
- Identical user messages return cached responses
- Cache expires after
semantic_cache_ttlseconds (default 1 hour) - Saves 100% of tokens for repeated queries
- FAQ-style queries: “What time is it?”, “Show me the logs”
- Repeated analysis: Re-running the same report
- Development: Testing the same prompt multiple times
- Only caches exact message matches (no fuzzy matching)
- Session-specific (cache key includes session_key)
- Stored in workspace
state/semantic_cache.db
Token Budget Enforcement
Set daily token limits to prevent cost overruns.grip/config/schema.py:110):
0= Unlimited (default)N= Stop all agent runs after N total tokens used today- Counts both prompt tokens and completion tokens
- Resets at midnight UTC
- Light use:
100,000tokens/day (~$1.50 with Sonnet) - Medium use:
500,000tokens/day (~$7.50 with Sonnet) - Heavy use:
2,000,000tokens/day (~$30 with Sonnet)