LLM Proxy Usage Costs, Limits, and Model Optimization

Archestra tracks LLM usage costs, enforces usage limits, and records savings from model optimization and tool-result compression. These controls work together: pricing defines cost, statistics show what happened, limits stop or shape usage, and optimization reduces spend before a request even reaches a model. All cost features depend on model pricing being configured correctly — without pricing, token counts are still logged but cost calculations remain incomplete.

Usage Statistics

The statistics view aggregates LLM traffic by time range, team, proxy, and model so you can answer questions like which teams are driving the most spend, which models account for the largest share of cost, or whether optimization rules and TOON compression are reducing spend over time.

Spend by Team

Break down total cost by team to identify which groups are consuming the most tokens and at what rate.

Spend by Model

See which models account for the largest share of cost to inform model selection decisions.

Savings Tracking

Archestra records both raw spend and savings from optimization rules and TOON compression so you can measure the impact of cost controls.

External Dashboards

For long-term monitoring, alerting, and cross-system cost analysis, use Archestra’s exported Prometheus metrics and the prebuilt Grafana dashboards.

The statistics view depends on model pricing being configured correctly. If a model has no pricing set, usage is still logged but cost calculations will be incomplete. Configure pricing in the provider model settings pages.

Usage Limits

Usage limits are guardrails for LLM spend. Archestra supports token-cost limits scoped to the organization, team, user, agent, LLM proxy, or virtual API key. Each limit can target one or more specific models, or apply to all models. A limit with no model specified acts as a global budget across every model the entity uses. Each limit has its own cleanup interval, and limits are evaluated from recorded model usage — meaning pricing configuration directly affects when a token-cost limit is considered reached.

Scope Reference

Scope	Use when
Organization	You need a shared platform-wide budget that applies to all teams and users.
Team	Different groups need separate spend caps to track and control costs independently.
User	Individual users need their own budgets, separate from team or org limits.
Agent or LLM Proxy	A specific agent profile or proxy needs its own budget regardless of who calls it.
Virtual API Key	Spend should be capped per API key, for example to give each external application its own budget.

Default User Limits

Admins can configure a default user limit in LLM settings. It applies to every current and future user automatically. A custom per-user limit overrides the default for that individual user — use this when one user needs a different budget from the platform default.

Limit Cleanup Intervals

Each limit resets on its own schedule:

Rolling intervals — reset after the elapsed time window (for example, every 30 days from when the limit was last reset)
Calendar intervals — reset at the next day, week, or month boundary; weekly intervals can start on Sunday or Monday

Changing a limit’s cleanup interval resets its current usage immediately. Default user limits use their own cleanup interval configured in LLM settings.

Model Pricing

Model pricing is configured on the provider model settings pages and is the foundation for every cost feature in Archestra:

Statistics

Pricing converts token counts into dollar spend for the statistics and aggregate cost views.

Token-Cost Limits

Limits use pricing to decide when a budget is reached and traffic should be stopped or throttled.

Optimization Reports

Savings from optimization rules are calculated in dollars using the configured model price differential.

TOON Compression Savings

Compression savings are reported in dollars using the price of the model that received the compressed input.

If you use custom or self-hosted models (vLLM, Ollama), add pricing explicitly so cost reporting and token-cost limits work as expected.

Optimization Rules

Optimization rules reduce cost before a request is sent to an LLM. Archestra evaluates request context against the configured rules and can switch the request to a lower-cost model when conditions match. Rules are applied in priority order, making them useful for layered policies where a specific exception should win over a general fallback.

Common Use Cases

Short Prompts

Route short prompts to a cheaper, smaller model when the full power of a flagship model is not needed.

No Tool Use

Use a less expensive model for requests that do not require tool calling or structured outputs.

Time-Based Policies

Apply time-based routing rules for predictable traffic patterns, such as off-hours cost reduction.

Savings from optimization rules are recorded alongside each interaction and roll up into the statistics view, so you can see how much each rule is saving over time.

TOON Compression

TOON (Token-Oriented Object Notation) compression reduces the token footprint of structured tool results before they are passed to the model. Archestra keeps the original JSON intact for application logic, then converts the model-facing representation to TOON when compression is enabled and when the converted form is actually smaller. TOON is a compact, lossless representation of the JSON data model. Its main advantage is with uniform arrays of objects, where repeated field names are declared once and row values are emitted in a table-like form — similar to a columnar format for LLM input.

When TOON Is Most Effective

TOON compression is especially valuable for tool outputs that contain repeated structure:

Database Query Results

Rows from SQL queries or ORM results with many repeated column names.

API Resource Lists

Lists of API resources with consistent schemas, such as cloud resource listings.

Analytics Rows

Analytics or report data with repeated field names across many records.

Search Results

Search results where each result object shares a common set of fields.

When Compression Is Skipped

Archestra skips TOON compression when:

TOON is disabled at the org or team level
A response has no tool results
The TOON representation would not actually save tokens (i.e., the TOON output is larger than the original JSON)

Archestra records before/after token counts and savings when compression is applied. These savings appear in individual interaction logs and in the aggregate cost reporting view.

Enabling TOON Compression

TOON can be enabled at two levels:

Level	Effect
Organization	Applies compression to all LLM traffic across the entire organization.
Team	Applies compression only to traffic from the specified team, useful when only certain workflows benefit from compression.

See the toon-format/toon project for the format specification and benchmarks showing token savings by data type.

Dynamic Model Routing for Cost Savings

Optimization rules and TOON compression work together with usage limits to give you layered cost control:

Configure Model Pricing

Set input and output token prices for each model in the provider model settings pages. This activates all cost-based features.

Set Usage Limits

Create limits at the appropriate scope — org-wide for a hard platform cap, team limits for per-group budgets, or virtual key limits for per-application spend controls.

Create Optimization Rules

Add rules that route to cheaper models based on request characteristics — prompt length, presence of tool calls, time of day, or model tier.

Enable TOON Compression

Turn on TOON compression at the org or team level to automatically reduce token counts for tool-heavy workflows without any change to application code.

Monitor in Statistics

Review the statistics view to see total spend, savings from optimization, savings from TOON compression, and which teams or models are consuming the most budget.

Get Started

MCP

Agents

LLM Proxy

Security

Administration

Integrations

Contributing

LLM Proxy Usage Costs, Limits, and Model Optimization

Usage Statistics

Spend by Team

Spend by Model

Savings Tracking

External Dashboards

Usage Limits

Scope Reference

Default User Limits

Limit Cleanup Intervals

Model Pricing

Statistics

Token-Cost Limits

Optimization Reports

TOON Compression Savings

Optimization Rules

Common Use Cases

Short Prompts

No Tool Use

Time-Based Policies

TOON Compression

When TOON Is Most Effective

Database Query Results

API Resource Lists

Analytics Rows

Search Results

When Compression Is Skipped

Enabling TOON Compression

Dynamic Model Routing for Cost Savings

Build docs developers (and LLMs) love

Get Started

MCP

Agents

LLM Proxy

Security

Administration

Integrations

Contributing

Documentation Index

​Usage Statistics

Spend by Team

Spend by Model

Savings Tracking

External Dashboards

​Usage Limits

​Scope Reference

​Default User Limits

​Limit Cleanup Intervals

​Model Pricing

Statistics

Token-Cost Limits

Optimization Reports

TOON Compression Savings

​Optimization Rules

​Common Use Cases

Short Prompts

No Tool Use

Time-Based Policies

​TOON Compression

​When TOON Is Most Effective

Database Query Results

API Resource Lists

Analytics Rows

Search Results

​When Compression Is Skipped

​Enabling TOON Compression

​Dynamic Model Routing for Cost Savings

Build docs developers (and LLMs) love

Usage Statistics

Usage Limits

Scope Reference

Default User Limits

Limit Cleanup Intervals

Model Pricing

Optimization Rules

Common Use Cases

TOON Compression

When TOON Is Most Effective

When Compression Is Skipped

Enabling TOON Compression

Dynamic Model Routing for Cost Savings