Archestra exposes Prometheus metrics and OpenTelemetry traces for monitoring system health, tracking HTTP requests, and analyzing LLM API performance. The web process exposes metrics atDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/archestra-ai/archestra/llms.txt
Use this file to discover all available pages before exploring further.
http://localhost:9050/metrics, and when the separate worker deployment is enabled (the default for Helm deployments), the worker exposes its own metrics endpoint at http://<worker-host>:9000/metrics. Production scrape configurations should collect both endpoints — task queue metrics and Knowledge Base pipeline metrics (connector syncs, embedding batches) are emitted from the worker process.
Prometheus Metrics
LLM Metrics
The following metrics track LLM API usage, cost, and performance across all agents, proxies, and gateways:| Metric | Description |
|---|---|
llm_request_duration_seconds | LLM API request duration by provider, model, agent_id, agent_name, agent_type, external_agent_id, source, and status code |
llm_tokens_total | Token consumption by provider, model, agent, source, and type (input/output) |
llm_cost_total | Estimated cost in USD by provider, model, and agent. Requires token pricing to be configured in Archestra. |
llm_blocked_tools_total | Counter of tool calls blocked by tool invocation policies, grouped by provider, model, and agent |
llm_time_to_first_token_seconds | Time to first token (TTFT) for streaming requests, by provider, agent, source, and model. Helps compare models by initial response latency. |
llm_tokens_per_second | Output tokens per second throughput, by provider, agent, source, and model. Useful for comparing response speeds in latency-sensitive applications. |
agent_id and agent_name are the internal Archestra identifiers. external_agent_id contains the value passed via the X-Archestra-Agent-Id header, allowing clients to associate metrics with their own agent identifiers. Knowledge Base embedding and reranking calls emit these same metrics with agent_name="Knowledge Base" and source="knowledge:embedding" or source="knowledge:reranker".MCP Metrics
| Metric | Description |
|---|---|
mcp_tool_calls_total | Total MCP tool calls by agent, mcp_server_name, tool_name, and status (success/error) |
mcp_tool_call_duration_seconds | MCP tool call execution duration by agent, mcp_server_name, tool_name, and status |
mcp_server_deployment_status | Current deployment state of self-hosted MCP servers by server_name and state (not_created/pending/running/failed/succeeded). Value is 1 for the active state. Use count(mcp_server_deployment_status{state="running"} == 1) to count running deployments. |
RAG & Knowledge Base Metrics
| Metric | Description |
|---|---|
rag_connector_syncs_total | Total connector syncs by connector_type and status (success/failed/partial) |
rag_connector_sync_duration_seconds | Connector sync duration by connector_type and status |
rag_documents_processed_total | Total documents processed during syncs by connector_type |
rag_documents_ingested_total | Total documents ingested (new or updated) by connector_type |
rag_chunks_created_total | Total chunks created during document ingestion by connector_type |
rag_embedding_batches_total | Total embedding batches processed by status (success/error) |
rag_embedding_documents_total | Total documents embedded by status |
rag_queries_total | Total RAG queries by search_type (vector/hybrid) |
rag_query_duration_seconds | RAG query end-to-end duration (embedding, search, rerank) by search_type |
rag_query_results_count | Number of results returned per RAG query by search_type |
Task Queue Metrics
| Metric | Description |
|---|---|
task_queue_tasks_enqueued_total | Total tasks enqueued by task_type |
task_queue_tasks_completed_total | Total tasks completed successfully by task_type |
task_queue_tasks_failed_total | Total task processing failures (may be retried) by task_type |
task_queue_tasks_dead_total | Total tasks moved to dead-letter (max retries exceeded) by task_type |
task_queue_task_duration_seconds | Task processing duration by task_type |
task_queue_active_tasks | Currently active (in-flight) tasks by task_type |
task_queue_stuck_tasks_reset_total | Total stuck tasks reset back to pending |
HTTP & Runtime Metrics
| Metric | Description |
|---|---|
http_request_duration_seconds_count | Total HTTP requests by method, route, and status |
http_request_duration_seconds_bucket | Request duration histogram buckets |
http_request_summary_seconds | Request duration summary with quantiles |
process_cpu_user_seconds_total | CPU time in user mode |
process_resident_memory_bytes | Physical memory usage |
nodejs_eventloop_lag_seconds | Event loop lag (latency indicator) |
nodejs_heap_size_used_bytes | V8 heap memory usage |
nodejs_gc_duration_seconds | Garbage collection timing by type |
Securing the Metrics Endpoint
The metrics endpoint supports bearer token authentication. SetARCHESTRA_METRICS_SECRET to require an Authorization: Bearer <token> header from all scrapers:
OpenTelemetry Tracing
Archestra exports OpenTelemetry traces to any OTLP-compatible backend — Jaeger, Tempo, Honeycomb, Grafana Cloud, and others.Configuration
Authentication
- Bearer Token
- Basic Auth
- No Authentication
Content Capture
Archestra can capture prompt/completion content and tool call arguments/results as span events for a full audit trail. This is enabled by default.- LLM spans —
gen_ai.content.promptevent with request messages, andgen_ai.content.completionevent with response text - MCP spans —
gen_ai.content.inputevent with tool call arguments, andgen_ai.content.outputevent with tool call results
What’s Traced
Archestra automatically traces these categories:LLM API Calls
Every call to an LLM provider, with model, token counts, cost, and response time. Span names follow the GenAI semconv format
{operation} {model} — e.g., chat gpt-4o-mini or generate_content gemini-2.0-flash.MCP Tool Calls
Every tool execution through the MCP Gateway, with tool name, server, duration, and whether the call was blocked by a policy.
Knowledge Base Operations
Embedding and reranking LLM calls made by the Knowledge Base system, with cost and token tracking. Identified by
archestra.trigger.source=knowledge:embedding or knowledge:reranker.Skill Sandbox Execution
The native Rust sandbox exports its own spans (
service.name=archestra-sandbox-rs) — command runs, artifact reads, and container materialization — with exit_code, duration_ms, and output size as span fields.Chat Trace Structure
Each chat turn produces a unified trace grouping LLM calls and tool executions under a single parent span:| Invocation Path | route.category | archestra.trigger.source |
|---|---|---|
| Chat UI | chat | — |
| A2A Protocol | a2a | — |
| MS Teams | chatops | ms-teams |
email | email |
Verbose Tracing
By default, traces include only GenAI-specific spans. To also capture infrastructure spans (HTTP routes, outgoing HTTP calls, Node.js fetch):Metric-to-Trace Exemplars
All LLM and MCP metrics include trace exemplars. In Grafana, clicking on a data point jumps directly to the corresponding trace in Tempo. This requires:- Prometheus configured with
--enable-feature=exemplar-storage - Grafana Prometheus datasource configured with
exemplarTraceIdDestinationspointing to your Tempo datasource
Custom Agent Labels
Labels are key-value pairs configured on agents in the Archestra UI. Once added, they appear in:- Metrics — as additional label dimensions on all LLM and MCP metrics (
kebab-caselabels are converted tosnake_casefor Prometheus naming compatibility) - Traces — as
archestra.label.<key>span attributes
Grafana Dashboards
Archestra provides five pre-built Grafana dashboards:GenAI Observability
LLM request metrics, token usage, cost analysis, latency, and traces
MCP Monitoring
MCP tool call metrics, error rates, duration, and traces
Agent Sessions
Session-level agent audit trail with drill-down into LLM calls, MCP tool calls, and correlated logs
Application Metrics
HTTP traffic, Node.js runtime health, task queue processing, and PostgreSQL database monitoring
RAG & Knowledge Base
Connector sync monitoring, embedding pipeline, and RAG query performance
Importing All Dashboards
Create a Grafana Service Account token with the Editor role and run:PostgreSQL Metrics Provider
The Application Metrics dashboard supports multiple PostgreSQL metrics sources. Use the--postgres-provider flag to select the right one:
| Provider | Metric Prefix | Use When |
|---|---|---|
helm (default) | pg_* | Using the Bitnami PostgreSQL Helm subchart with metrics sidecar |
otel | postgresql_* | Using OTel Collector PostgreSQL Receiver against any PostgreSQL instance |
cloudsql | stackdriver_cloudsql_* | Scraping GCP Cloud Monitoring via the Stackdriver Exporter |
azure | azure_* | Scraping Azure Monitor metrics for Azure Database for PostgreSQL |
Performance Benchmarks
Archestra adds approximately 30-50ms latency per request while providing enterprise-grade security and policy enforcement for LLM applications.| Metric | Value |
|---|---|
| Backend processing | 20-23ms |
| End-to-end P50 | 25ms |
| End-to-end P95 | 31ms |
| End-to-end P99 | 41ms |
| Database overhead | <0.5ms |
| Throughput @ concurrency=10 | 155 req/s |
| Throughput @ concurrency=500 | 272 req/s |
| CPU utilization (single process) | 0.44% |
| Memory (single process) | 222MB RAM |
Recommended Deployment Configurations
| Tier | Requests/Day | Requests/Second | Platform Resources | Database Resources |
|---|---|---|---|---|
| Small | under 100K | 1-100 | 1 instance: 2 vCPU, 4 GB RAM | 2 vCPU, 4 GB RAM |
| Medium | 100K-1M | 100-500 | 2-4 instances: 4 vCPU, 8 GB RAM each | 4 vCPU, 8 GB RAM, read replicas |
| Large | 1M-10M | 500-2K | 4-8 instances: 4 vCPU, 16 GB RAM each | 8 vCPU, 16 GB RAM, connection pooling |
| Enterprise | 10M+ | 2K+ | 8+ instances: 8 vCPU, 16 GB RAM each | 8+ vCPU, 32 GB RAM, sharding |