Test datasets are the foundation of reliable agent evaluation. They provide consistent, repeatable inputs that let you measure your agent’s performance over time and detect regressions as you iterate.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/langchain-ai/lca-reliable-agents/llms.txt
Use this file to discover all available pages before exploring further.
Why Datasets Matter
Datasets enable you to:- Track performance across agent versions
- Catch regressions before deploying changes
- Benchmark systematically against production scenarios
- Share evaluation criteria with your team
Start with 10-25 examples covering your core use cases. You can expand the dataset as you discover edge cases through production traces.
Dataset Structure
A dataset is a collection of test cases, where each case typically contains:- Input: The user question or scenario
- Expected Output (optional): Reference answer or behavior
- Metadata (optional): Tags, difficulty level, scenario type
Simple CSV Format
The easiest way to start is with a CSV file:officeflow-dataset.csv
Creating Datasets in LangSmith
officeflow-dataset)from langsmith import Client
client = Client()
# Create dataset
dataset = client.create_dataset(
dataset_name="officeflow-dataset",
description="Customer support queries for OfficeFlow"
)
# Add examples
examples = [
{
"inputs": {"question": "How many reams of copy paper do you have?"},
"outputs": {"expected_tool": "query_database"}
},
{
"inputs": {"question": "What is your return policy?"},
"outputs": {"expected_tool": "search_knowledge_base"}
}
]
for example in examples:
client.create_example(
dataset_id=dataset.id,
inputs=example["inputs"],
outputs=example.get("outputs")
)
from langsmith import Client
client = Client()
# Find interesting traces
runs = client.list_runs(
project_name="production",
filter='eq(feedback_key, "user_rating") and gte(feedback_score, 4)'
)
# Add to dataset
for run in runs:
client.create_example(
dataset_id=dataset_id,
inputs=run.inputs,
outputs=run.outputs
)
Best Practices
Cover Core Scenarios
Ensure your dataset includes:- Happy path queries - Straightforward requests your agent should handle easily
- Edge cases - Ambiguous, multi-part, or unusual requests
- Error scenarios - Invalid inputs, out-of-scope questions
- Complex workflows - Multi-step interactions requiring multiple tools
Example Categories for OfficeFlow
dataset_categories.py
Add Reference Outputs
While not required, reference outputs help with evaluation:with_expected_outputs.py
Using Your Dataset
Once created, reference your dataset by name in experiments:run_experiment.py
Datasets are versioned in LangSmith. You can update examples without breaking existing experiments.
Next Steps
Run Experiments
Connect your dataset to evaluators
Code-based Eval
Write deterministic evaluators