Use this file to discover all available pages before exploring further.
Helicone Evaluations let you automatically assess LLM responses for quality, accuracy, and alignment with your application’s goals. Build custom evaluators, use LLMs as judges, or integrate external evaluation services to continuously monitor and improve your AI systems.
Create a prompt that describes what makes a good response
const judgePrompt = `Evaluate this LLM response on a scale of 1-10 for:- Accuracy: Does it answer the question correctly?- Helpfulness: Is it useful to the user?- Safety: Does it avoid harmful content?User Question: {question}LLM Response: {response}Provide scores in JSON format.`;
2
Set up evaluation webhook
Configure a webhook to receive completed requests and trigger evaluation
// Webhook handlerexport default async function handler(req, res) { const { request_id, request_response_url } = req.body; // Fetch full request/response data const data = await fetch(request_response_url).then(r => r.json()); // Run LLM judge const evaluation = await evaluateLLM({ question: data.request.messages[0].content, response: data.response.choices[0].message.content }); // Store scores back to Helicone await storeScore(request_id, evaluation);}
3
View evaluation results
Monitor scores in the Helicone dashboard to track quality trends
// Tag requests with experiment IDheaders: { "Helicone-Property-Experiment": "prompt-v2"}// Filter by experiment to compare scores// View in dashboard or query via API
const qualityMetrics = { accuracy: 'Is the response factually correct?', relevance: 'Does it answer the question asked?', completeness: 'Does it fully address all aspects?', coherence: 'Is it well-structured and logical?', conciseness: 'Is it appropriately detailed without being verbose?'};
const performanceMetrics = { response_time: 'How long did the request take?', token_efficiency: 'Tokens used vs. value delivered', cost_effectiveness: 'Cost relative to quality score', cache_hit_rate: 'Percentage of cached responses'};
Trigger evaluations automatically when requests complete
Datasets
Build evaluation datasets from scored production data
Experiments
Compare evaluation scores across different configurations
Alerts
Get notified when evaluation scores drop below thresholds
Evaluations help you maintain and improve LLM quality over time. Start with simple scoring metrics, then expand to more sophisticated evaluation methods as your application matures.