Cost Optimization

Grip AI provides multiple strategies to reduce LLM costs while maintaining quality for complex tasks.

Model Tier Routing

Automatic model selection based on prompt complexity. Simple queries use cheap models, complex tasks use premium models.

Configuration

From grip/config/schema.py:136:

class ModelTiersConfig(BaseModel):
    """Model overrides per complexity tier for the cost-aware router.
    
    Leave a tier empty to use agents.defaults.model for that complexity
    level. Only tiers with a model set will be routed differently.
    Example: set low to a fast/cheap model like gemini-flash, leave
    medium empty (uses default), and set high to claude-opus.
    """
    
    enabled: bool = Field(
        default=False,
        description="Enable automatic model routing based on prompt complexity.",
    )
    low: str = Field(
        default="",
        description="Model for simple queries (greetings, lookups, regex).",
    )
    medium: str = Field(
        default="",
        description="Model for moderate tasks (code changes, explanations).",
    )
    high: str = Field(
        default="",
        description="Model for complex tasks (architecture, refactors, debugging).",
    )

Setup

Enable tiered routing in ~/.grip/config.json:

{
  "agents": {
    "defaults": {
      "model": "openrouter/anthropic/claude-sonnet-4"
    },
    "model_tiers": {
      "enabled": true,
      "low": "openrouter/google/gemini-flash-2.0",
      "medium": "",
      "high": "openrouter/anthropic/claude-opus-4"
    }
  }
}

Leave medium empty to use your default model for medium-complexity tasks. Only override the tiers where cost savings matter most.

How Complexity Detection Works

The router analyzes prompt characteristics: Low Complexity (uses low model):

Greetings and small talk: “Hello”, “How are you?”
Simple lookups: “What time is it?”, “Current Bitcoin price”
Basic regex/formatting: “Extract emails from this text”
Single-step operations: “List files in /tmp”

Medium Complexity (uses medium or default model):

Code changes: “Fix this bug”, “Add error handling”
Explanations: “Explain how JWT works”
Multi-step tasks: “Search for X and summarize findings”
Data analysis: “Analyze this CSV and find trends”

High Complexity (uses high model):

System design: “Design a scalable microservices architecture”
Large refactors: “Refactor this module to use async/await”
Debugging: “Find why this race condition occurs”
Research synthesis: “Compare React vs Vue for enterprise apps”

Cost Savings Example

Without Tier Routing:

100 daily queries × Claude Sonnet-4 avg cost
= 100 × $0.015 = $1.50/day = $45/month

With Tier Routing (70% low, 20% medium, 10% high):

× Gemini Flash ($0.0001) = $0.007
× Claude Sonnet-4 ($0.015) = $0.30
× Claude Opus-4 ($0.075) = $0.75
Total: $1.057/day = $31.71/month (29% savings)

Consolidation Model

Use a cheaper model for session compaction and summarization.

Configuration

From grip/config/schema.py:91:

consolidation_model: str = Field(
    default="",
    description="LLM model for summarization/consolidation. Empty = use main model. "
    "Set to a cheaper model (e.g. openrouter/google/gemini-flash-2.0) to save tokens.",
)

Set in ~/.grip/config.json:

{
  "agents": {
    "defaults": {
      "model": "openrouter/anthropic/claude-sonnet-4",
      "consolidation_model": "openrouter/google/gemini-flash-2.0",
      "auto_consolidate": true,
      "memory_window": 50
    }
  }
}

How It Works

When conversation history exceeds 2 × memory_window messages, grip automatically:

Sends old messages to the consolidation_model
Generates a concise summary (typically 200-500 tokens)
Replaces old messages with the summary
Keeps recent memory_window messages intact

Example:

Before consolidation: 120 messages (48K tokens)
After consolidation: 50 recent messages + 1 summary (20K tokens)
Savings: 28K tokens per subsequent request

Manual Consolidation

# Interactive CLI
grip agent
> /compact

# Via API
curl -X POST http://localhost:18800/api/v1/agent/consolidate \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"session_key": "cli:default"}'

From grip/engines/litellm_engine.py:138, consolidation is implemented in the engine layer and works with any model combination.

Choosing Cost-Effective Models

Budget Models

Google Gemini Flash 2.0 (openrouter/google/gemini-flash-2.0)

Cost: ~$0.0001 per request (100x cheaper than Claude)
Speed: 2-3x faster than Sonnet
Best for: Lookups, simple Q&A, data extraction, summarization
Limitations: Weaker reasoning, less reliable tool use

GPT-4o Mini (openrouter/openai/gpt-4o-mini)

Cost: ~$0.0005 per request (30x cheaper than Claude)
Speed: Very fast
Best for: Code formatting, regex tasks, simple refactors
Limitations: Shorter context, less creative

Claude Haiku (anthropic/claude-haiku-4)

Cost: ~$0.003 per request (5x cheaper than Sonnet)
Speed: Fastest Claude model
Best for: Tool-heavy workflows, data processing, quick iterations
Limitations: Not as strong for complex reasoning

Premium Models

Claude Sonnet-4 (anthropic/claude-sonnet-4)

Cost: ~$0.015 per request
Best for: General-purpose tasks, code generation, analysis
Sweet spot: Best quality/cost ratio for most coding tasks

Claude Opus-4 (openrouter/anthropic/claude-opus-4)

Cost: ~$0.075 per request (5x more than Sonnet)
Best for: System design, complex debugging, research synthesis
When to use: Only when task quality justifies the cost

GPT-4 Turbo (openrouter/openai/gpt-4-turbo)

Cost: ~$0.04 per request
Best for: Math, structured output, code execution validation
Trade-off: Cheaper than Opus, slightly different strengths

Memory Window Optimization

Reduce tokens sent per request by limiting conversation history.

Configuration

{
  "agents": {
    "defaults": {
      "memory_window": 30,
      "auto_consolidate": true
    }
  }
}

Small Window (memory_window: 10-20):

Pros: Very low token usage, fast responses
Cons: Agent forgets context quickly
Best for: Single-task sessions, tool-heavy automation

Medium Window (memory_window: 30-50, default):

Pros: Good balance of cost and context retention
Cons: May need consolidation every 100 messages
Best for: General interactive use

Large Window (memory_window: 100-200):

Pros: Excellent context retention, fewer consolidations
Cons: High token usage (2-5K tokens per request)
Best for: Complex multi-turn debugging or architecture discussions

From grip/config/schema.py:81, the default memory_window is 50 messages. Monitor your token usage and adjust based on typical conversation length.

Max Tool Iterations Limit

Prevent runaway costs from infinite tool loops.

{
  "agents": {
    "defaults": {
      "max_tool_iterations": 15
    }
  }
}

How it works (from grip/config/schema.py:76):

0 = Unlimited iterations (default, see long-running tasks)
N = Stop after N tool call rounds, even if task is incomplete

Setting limits:

Simple tasks: max_tool_iterations: 5 (file operations, lookups)
Code tasks: max_tool_iterations: 15 (build, test, fix)
Research: max_tool_iterations: 10 (search, fetch, analyze)

Each iteration costs 1K-8K tokens depending on tool output size. A 15-iteration task with an 8K-token model can consume 120K tokens total.

Semantic Caching

Cache identical queries to avoid re-processing.

{
  "agents": {
    "defaults": {
      "semantic_cache_enabled": true,
      "semantic_cache_ttl": 3600
    }
  }
}

How it works (from grip/config/schema.py:100):

Identical user messages return cached responses
Cache expires after semantic_cache_ttl seconds (default 1 hour)
Saves 100% of tokens for repeated queries

Best for:

FAQ-style queries: “What time is it?”, “Show me the logs”
Repeated analysis: Re-running the same report
Development: Testing the same prompt multiple times

Limitations:

Only caches exact message matches (no fuzzy matching)
Session-specific (cache key includes session_key)
Stored in workspace state/semantic_cache.db

Token Budget Enforcement

Set daily token limits to prevent cost overruns.

{
  "agents": {
    "defaults": {
      "max_daily_tokens": 500000
    }
  }
}

How it works (from grip/config/schema.py:110):

0 = Unlimited (default)
N = Stop all agent runs after N total tokens used today
Counts both prompt tokens and completion tokens
Resets at midnight UTC

Example limits:

Light use: 100,000 tokens/day (~$1.50 with Sonnet)
Medium use: 500,000 tokens/day (~$7.50 with Sonnet)
Heavy use: 2,000,000 tokens/day (~$30 with Sonnet)

Combined Cost Strategy

Maximum savings configuration:

{
  "agents": {
    "defaults": {
      "model": "openrouter/anthropic/claude-sonnet-4",
      "consolidation_model": "openrouter/google/gemini-flash-2.0",
      "memory_window": 30,
      "max_tool_iterations": 15,
      "auto_consolidate": true,
      "semantic_cache_enabled": true,
      "semantic_cache_ttl": 7200,
      "max_daily_tokens": 500000
    },
    "model_tiers": {
      "enabled": true,
      "low": "openrouter/google/gemini-flash-2.0",
      "medium": "",
      "high": "openrouter/anthropic/claude-opus-4"
    },
    "profiles": {
      "budget": {
        "model": "openrouter/google/gemini-flash-2.0",
        "max_tokens": 4096,
        "memory_window": 20,
        "max_tool_iterations": 10
      }
    }
  }
}

Estimated savings: 40-60% compared to using Claude Sonnet-4 for all tasks with no optimizations.

Getting Started

Core Concepts

Channels

Features

Configuration

Deployment

Advanced

Cost Optimization

Model Tier Routing

Configuration

Setup

How Complexity Detection Works

Cost Savings Example

Consolidation Model

Configuration

How It Works

Manual Consolidation

Choosing Cost-Effective Models

Budget Models

Premium Models

Memory Window Optimization

Configuration

Max Tool Iterations Limit

Semantic Caching

Token Budget Enforcement

Combined Cost Strategy

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Channels

Features

Configuration

Deployment

Advanced

Documentation Index

​Model Tier Routing

​Configuration

​Setup

​How Complexity Detection Works

​Cost Savings Example

​Consolidation Model

​Configuration

​How It Works

​Manual Consolidation

​Choosing Cost-Effective Models

​Budget Models

​Premium Models

​Memory Window Optimization

​Configuration

​Max Tool Iterations Limit

​Semantic Caching

​Token Budget Enforcement

​Combined Cost Strategy

Build docs developers (and LLMs) love

Model Tier Routing

Configuration

Setup

How Complexity Detection Works

Cost Savings Example

Consolidation Model

Configuration

How It Works

Manual Consolidation

Choosing Cost-Effective Models

Budget Models

Premium Models

Memory Window Optimization

Configuration

Max Tool Iterations Limit

Semantic Caching

Token Budget Enforcement

Combined Cost Strategy