Running LLM applications in production requires proactive monitoring to catch issues before they impact users. This guide shows you how to set up comprehensive production monitoring with Helicone.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/Helicone/helicone/llms.txt
Use this file to discover all available pages before exploring further.
The Challenge
Production LLM applications face unique challenges:- Unpredictable errors: Provider outages, rate limits, and model changes
- Cost volatility: Usage spikes from viral features or abuse
- Quality degradation: Prompt drift, model updates, data issues
- Performance issues: Latency spikes affecting user experience
Solution Overview
Helicone provides a complete monitoring stack:Real-time Alerts
Get notified of errors, cost spikes, and latency issues
Request Observability
View every request/response with full context
Usage Analytics
Track costs, token usage, and model performance
User Tracking
Monitor per-user costs and identify abuse
Implementation Guide
1. Instrument Your Application
Add monitoring headers to all production requests:2. Set Up Critical Alerts
Create alerts for production issues:Error Rate Alert
Navigate to Settings → Alerts and create:Alert Configuration:
- Name: Production Error Rate
- Metric: Error Rate
- Threshold: > 5%
- Time Window: 10 minutes
- Minimum Requests: 10 (avoid false positives)
- Property:
Environment = production
- Slack:
#production-alerts - Email:
oncall@company.com
This catches provider outages, rate limit issues, and breaking changes quickly.
Cost Spike Alert
Alert Configuration:
- Name: Production Cost Spike
- Metric: Cost
- Threshold: > $100/day
- Time Window: 1 day
- Property:
Environment = production
- Email:
finance@company.com - Slack:
#cost-alerts
Prevents unexpected bills from usage spikes or abuse.
Latency Alert
Alert Configuration:
- Name: High Latency
- Metric: Latency
- Threshold: P95 > 10000ms
- Time Window: 30 minutes
- Minimum Requests: 20
- Property:
Environment = production
- Slack:
#performance-alerts
Detects performance degradation affecting user experience.
3. Configure User Monitoring
Track per-user usage to identify abuse and understand behavior:- Identify users exceeding quotas
- Detect potential abuse patterns
- Understand usage by tier/cohort
- Calculate customer lifetime value
4. Implement Session Tracking
For multi-step workflows, track complete user journeys:- See total cost per user interaction
- Debug failures with full context
- Identify expensive workflow patterns
- Measure success rates for complete flows
5. Set Up Cost Controls
Implement rate limiting and quota management:6. Enable Caching
Reduce costs and latency for repetitive queries:- Go to Dashboard → Cache Analytics
- Track hit rate, savings, and performance
- Adjust cache TTL based on update frequency
Monitoring Dashboard
Key metrics to watch daily:Overview Metrics
Feature Breakdown
User Insights
Incident Response
When an alert fires:Assess Severity
- Error rate alert = High severity (affects all users)
- Cost alert = Medium severity (financial impact)
- Latency alert = Medium severity (poor UX)
- Feature-specific = Varies by feature criticality
Investigate in Helicone
- Click alert notification link
- Review affected requests
- Look for patterns:
- Specific users affected?
- Single feature or widespread?
- Started at specific time?
Take Action
For errors:
- Check provider status pages
- Review recent deployments
- Implement fallback/retry logic
- Identify top users/features
- Implement temporary rate limits
- Investigate for abuse
- Check model selection
- Review prompt sizes
- Consider model switching
Best Practices
Advanced: Custom Dashboards
Build custom monitoring using Helicone API:Monitoring Checklist
- All production requests instrumented with monitoring headers
- Error rate alert configured (less than 5%)
- Cost alert configured (appropriate threshold)
- Feature-specific alerts for critical features
- User tracking enabled (Helicone-User-Id)
- Session tracking for multi-step workflows
- Caching enabled for repetitive queries
- Rate limiting implemented
- Daily dashboard review scheduled
- Incident response playbook documented
Next Steps
Alerts Documentation
Deep dive into alert configuration options
User Metrics
Track and analyze per-user behavior
Debugging Guide
Learn how to investigate production issues
Cost Optimization
Reduce production costs