Archestra tracks LLM usage costs, enforces usage limits, and records savings from model optimization and tool-result compression. These controls work together: pricing defines cost, statistics show what happened, limits stop or shape usage, and optimization reduces spend before a request even reaches a model. All cost features depend on model pricing being configured correctly — without pricing, token counts are still logged but cost calculations remain incomplete.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/archestra-ai/archestra/llms.txt
Use this file to discover all available pages before exploring further.
Usage Statistics
The statistics view aggregates LLM traffic by time range, team, proxy, and model so you can answer questions like which teams are driving the most spend, which models account for the largest share of cost, or whether optimization rules and TOON compression are reducing spend over time.Spend by Team
Break down total cost by team to identify which groups are consuming the most tokens and at what rate.
Spend by Model
See which models account for the largest share of cost to inform model selection decisions.
Savings Tracking
Archestra records both raw spend and savings from optimization rules and TOON compression so you can measure the impact of cost controls.
External Dashboards
For long-term monitoring, alerting, and cross-system cost analysis, use Archestra’s exported Prometheus metrics and the prebuilt Grafana dashboards.
The statistics view depends on model pricing being configured correctly. If a model has no pricing set, usage is still logged but cost calculations will be incomplete. Configure pricing in the provider model settings pages.
Usage Limits
Usage limits are guardrails for LLM spend. Archestra supports token-cost limits scoped to the organization, team, user, agent, LLM proxy, or virtual API key. Each limit can target one or more specific models, or apply to all models. A limit with no model specified acts as a global budget across every model the entity uses. Each limit has its own cleanup interval, and limits are evaluated from recorded model usage — meaning pricing configuration directly affects when a token-cost limit is considered reached.Scope Reference
| Scope | Use when |
|---|---|
| Organization | You need a shared platform-wide budget that applies to all teams and users. |
| Team | Different groups need separate spend caps to track and control costs independently. |
| User | Individual users need their own budgets, separate from team or org limits. |
| Agent or LLM Proxy | A specific agent profile or proxy needs its own budget regardless of who calls it. |
| Virtual API Key | Spend should be capped per API key, for example to give each external application its own budget. |
Default User Limits
Admins can configure a default user limit in LLM settings. It applies to every current and future user automatically. A custom per-user limit overrides the default for that individual user — use this when one user needs a different budget from the platform default.Limit Cleanup Intervals
Each limit resets on its own schedule:- Rolling intervals — reset after the elapsed time window (for example, every 30 days from when the limit was last reset)
- Calendar intervals — reset at the next day, week, or month boundary; weekly intervals can start on Sunday or Monday
Changing a limit’s cleanup interval resets its current usage immediately. Default user limits use their own cleanup interval configured in LLM settings.
Model Pricing
Model pricing is configured on the provider model settings pages and is the foundation for every cost feature in Archestra:Statistics
Pricing converts token counts into dollar spend for the statistics and aggregate cost views.
Token-Cost Limits
Limits use pricing to decide when a budget is reached and traffic should be stopped or throttled.
Optimization Reports
Savings from optimization rules are calculated in dollars using the configured model price differential.
TOON Compression Savings
Compression savings are reported in dollars using the price of the model that received the compressed input.
Optimization Rules
Optimization rules reduce cost before a request is sent to an LLM. Archestra evaluates request context against the configured rules and can switch the request to a lower-cost model when conditions match. Rules are applied in priority order, making them useful for layered policies where a specific exception should win over a general fallback.Common Use Cases
Short Prompts
Route short prompts to a cheaper, smaller model when the full power of a flagship model is not needed.
No Tool Use
Use a less expensive model for requests that do not require tool calling or structured outputs.
Time-Based Policies
Apply time-based routing rules for predictable traffic patterns, such as off-hours cost reduction.
TOON Compression
TOON (Token-Oriented Object Notation) compression reduces the token footprint of structured tool results before they are passed to the model. Archestra keeps the original JSON intact for application logic, then converts the model-facing representation to TOON when compression is enabled and when the converted form is actually smaller. TOON is a compact, lossless representation of the JSON data model. Its main advantage is with uniform arrays of objects, where repeated field names are declared once and row values are emitted in a table-like form — similar to a columnar format for LLM input.When TOON Is Most Effective
TOON compression is especially valuable for tool outputs that contain repeated structure:Database Query Results
Rows from SQL queries or ORM results with many repeated column names.
API Resource Lists
Lists of API resources with consistent schemas, such as cloud resource listings.
Analytics Rows
Analytics or report data with repeated field names across many records.
Search Results
Search results where each result object shares a common set of fields.
When Compression Is Skipped
Archestra skips TOON compression when:- TOON is disabled at the org or team level
- A response has no tool results
- The TOON representation would not actually save tokens (i.e., the TOON output is larger than the original JSON)
Enabling TOON Compression
TOON can be enabled at two levels:| Level | Effect |
|---|---|
| Organization | Applies compression to all LLM traffic across the entire organization. |
| Team | Applies compression only to traffic from the specified team, useful when only certain workflows benefit from compression. |
Dynamic Model Routing for Cost Savings
Optimization rules and TOON compression work together with usage limits to give you layered cost control:Configure Model Pricing
Set input and output token prices for each model in the provider model settings pages. This activates all cost-based features.
Set Usage Limits
Create limits at the appropriate scope — org-wide for a hard platform cap, team limits for per-group budgets, or virtual key limits for per-application spend controls.
Create Optimization Rules
Add rules that route to cheaper models based on request characteristics — prompt length, presence of tool calls, time of day, or model tier.
Enable TOON Compression
Turn on TOON compression at the org or team level to automatically reduce token counts for tool-heavy workflows without any change to application code.