Documentation Index
Fetch the complete documentation index at: https://mintlify.com/langchain-ai/lca-reliable-agents/llms.txt
Use this file to discover all available pages before exploring further.
Why Evaluate?
AI agents are non-deterministic. You can’t just run a test suite and call it done. Instead, you need systematic evaluation to:- Catch regressions when you change prompts, models, or tools
- Measure improvements objectively when iterating on your agent
- Identify failure patterns across different types of inputs
- Build confidence before deploying to production
- Monitor quality continuously as your agent serves real users
The Evaluation Foundation: Datasets
Before you can evaluate, you need a dataset—a collection of test cases with inputs and (optionally) expected outputs.Anatomy of a Dataset
Each example in your dataset should include:- Inputs: The question or request the user makes
- Reference outputs (optional): The expected or ideal response
- Metadata: Tags, categories, or context about the test case
Creating Datasets
You can create datasets in several ways:From Real Traffic
Export actual user interactions from your traces. This ensures your evaluations reflect real-world usage.
Handcrafted Test Cases
Write specific scenarios to test edge cases, known failure modes, or requirements from your PRD.
Synthetic Generation
Use an LLM to generate diverse test cases based on your agent’s capabilities.
From Bugs
When users report issues, add them to your dataset to prevent regressions.
Example: OfficeFlow Dataset
The OfficeFlow agent uses a dataset with questions like:Running Experiments
An experiment runs your agent against a dataset and applies evaluators to measure performance.Basic Experiment Structure
Types of Evaluators
Evaluators are functions that score agent outputs. There are three main types:1. Code-Based Evaluators
Deterministic evaluators that check specific conditions. These work like unit tests.Example: Schema-Before-Query Evaluator
This evaluator checks that the agent inspects the database schema before running queries:- Checking tool usage patterns
- Validating output format (JSON schema, specific fields)
- Verifying security requirements (no PII in logs)
- Measuring response length or token usage
2. LLM-as-Judge Evaluators
Use an LLM to evaluate subjective criteria that are hard to code.Example: Correctness Evaluator
- Evaluating helpfulness or tone
- Checking semantic correctness when exact wording varies
- Assessing whether the response follows specific guidelines
- Measuring conciseness or clarity
3. Pairwise Evaluators
Compare two versions of your agent side-by-side to measure relative improvement.Example: Conciseness Comparison
- A/B testing different prompts or models
- Measuring incremental improvements
- Evaluating subjective qualities where absolute scoring is hard
- Detecting regressions when refactoring
Combining Evaluators
Real-world evaluation uses multiple evaluators to measure different aspects:Functional Correctness
Does the agent produce correct outputs and use tools properly?
Quality & Style
Is the response helpful, concise, and appropriate in tone?
Performance
Does the agent respond quickly enough for production use?
Safety
Does the agent avoid hallucinations, PII leaks, or harmful outputs?
Interpreting Results
After running an experiment, LangSmith shows:- Aggregate scores for each evaluator
- Pass rate across the dataset
- Individual run details with inputs, outputs, and scores
- Comparison view when you select multiple experiments
What to Look For
- Overall trends: Is your new version improving or regressing?
- Failure patterns: Do certain types of questions consistently fail?
- Trade-offs: Did you improve accuracy but hurt latency?
- Edge cases: Which specific examples are still failing?
Iteration Loop
Real-World Example: OfficeFlow Evolution
The course demonstrates iterative improvement through 6 versions:- v0: Baseline (no tracing)
-
v1: + LangSmith tracing
Evaluation: Can now see what agent is doing -
v2: + Enhanced tool instructions
Evaluation: Tool usage improves from 60% to 85% accuracy -
v3: + Stock information policy
Evaluation: Reduces hallucinations about inventory -
v4: + No-chunking RAG
Evaluation: Knowledge base retrieval accuracy improves -
v5: + Conciseness improvements
Evaluation: Pairwise comparison shows 70% prefer v5 responses
Best Practices
Start Simple
Begin with a small dataset (10-20 examples) and basic evaluators. Add complexity as you learn what matters.
Test Edge Cases
Include challenging examples: ambiguous questions, multi-step reasoning, tool failures, and adversarial inputs.
Version Your Datasets
As your agent evolves, your evaluation needs change. Create new datasets for new capabilities.
Automate Everything
Run evaluations automatically on every code change (CI/CD) to catch regressions immediately.
Balance Speed and Quality
Use fast evaluators (code-based) for rapid iteration, and slower ones (LLM-judge) for final validation.
Continuous Monitoring
In production, run evaluators on random samples of traffic to detect quality drift over time.
Common Pitfalls
Pitfall 1: Dataset Overfitting
If you optimize only for your test dataset, you might hurt generalization. Solution: Maintain separate datasets for development and final validation.Pitfall 2: Flaky Evaluators
LLM-as-judge evaluators can be inconsistent. Solution: Run evaluations multiple times and look at variance. Use temperature=0 for judges to reduce randomness.Pitfall 3: Ignoring Latency
Focusing only on quality can lead to slow agents. Solution: Always include performance evaluators alongside quality metrics.Pitfall 4: Binary Thinking
Not every improvement is clear-cut. Sometimes v2 is better at X but worse at Y. Solution: Use multiple evaluators and make trade-offs explicit. Document why you chose one version over another.From Evaluation to Production
Once you’re confident in your agent’s performance:- Run final validation on a held-out dataset
- Set quality thresholds for continuous monitoring
- Configure online evaluators to score production traffic
- Set up alerts for when metrics drop below thresholds
- Schedule periodic re-evaluation to catch model drift
Next Steps
Creating Datasets
Learn techniques for building comprehensive test datasets
Writing Evaluators
Deep dive into building custom evaluators for your use case
Analyzing Results
Master the LangSmith UI for experiment comparison and debugging
Production Monitoring
Set up continuous evaluation for deployed agents