Running Experiments

Experiments connect your agent, dataset, and evaluators to produce quantitative measurements of performance. Every experiment creates a snapshot you can compare against future versions.

Anatomy of an Experiment

An experiment requires three components:

Target: The agent or function to evaluate
Dataset: Test cases with inputs (and optionally expected outputs)
Evaluators: Functions that score the agent’s outputs

experiment_components.py

from langsmith import evaluate

# 1. Your agent (target)
def your_agent(inputs: dict) -> dict:
    question = inputs["question"]
    # ... agent logic ...
    return {"answer": response}

# 2. Dataset (by name)
dataset_name = "officeflow-dataset"

# 3. Evaluators
def check_mentions_product(outputs: dict) -> bool:
    return "officeflow" in outputs["answer"].lower()

# Run experiment
results = evaluate(
    your_agent,
    data=dataset_name,
    evaluators=[check_mentions_product]
)

Basic Experiment

Here’s a minimal example from the OfficeFlow agent:

from dotenv import load_dotenv
from langsmith import evaluate

load_dotenv()

# Target: Your agent function
def dummy_app(inputs: dict) -> dict:
    return {
        "response": "Sure! In OfficeFlow, you can reset your password from the settings page."
    }

# Evaluator: Check if response mentions brand
def mentions_officeflow(outputs: dict) -> bool:
    return "officeflow" in outputs["response"].lower()

# Run experiment
results = evaluate(
    dummy_app,
    data="officeflow-dataset",
    evaluators=[mentions_officeflow]
)

The evaluate function automatically runs your agent on every example in the dataset and applies all evaluators to the outputs.

Async Agents

For agents using async/await, use aevaluate:

async_experiment.py

import asyncio
from langsmith import aevaluate
from agent_v5 import chat, load_knowledge_base

async def chat_wrapper(inputs: dict) -> dict:
    question = inputs.get("question", "")
    result = await chat(question)
    return {"answer": result["output"], "messages": result["messages"]}

async def main():
    # Load any required resources
    await load_knowledge_base(kb_dir="./knowledge_base")
    
    # Run evaluation
    results = await aevaluate(
        chat_wrapper,
        data="officeflow-dataset"
    )
    return results

if __name__ == "__main__":
    asyncio.run(main())

Experiment Naming

Experiments are automatically named with timestamps. Use experiment_prefix to organize results:

named_experiments.py

from langsmith import evaluate

results = evaluate(
    agent_v5,
    data="officeflow-dataset",
    evaluators=[schema_before_query],
    experiment_prefix="schema-check-v5"  # Appears as "schema-check-v5-{timestamp}"
)

Real-World Example

Here’s how the OfficeFlow course evaluates the schema-checking behavior:

run_eval.py

import asyncio
import sys
from pathlib import Path
from langsmith import evaluate
from langsmith import uuid7

# Import your agent
import agent_v5
from agent_v5 import chat, load_knowledge_base
from eval_schema_check import schema_before_query

async def setup():
    """Load knowledge base before running evals."""
    kb_dir = "./knowledge_base"
    await load_knowledge_base(kb_dir)

def run_agent(inputs: dict) -> dict:
    """Invoke the agent with a fresh thread_id each time."""
    agent_v5.thread_id = str(uuid7())
    return asyncio.run(chat(inputs["question"]))

if __name__ == "__main__":
    asyncio.run(setup())
    
    results = evaluate(
        run_agent,
        data="officeflow-dataset",
        evaluators=[schema_before_query],
        experiment_prefix="schema-check-v5",
    )

Creating a fresh thread_id for each example ensures test isolation - one example’s state doesn’t affect another.

Experiment Results

The evaluate function returns an ExperimentResults object:

analyze_results.py

results = evaluate(
    your_agent,
    data="officeflow-dataset",
    evaluators=[mentions_officeflow]
)

print(f"Experiment: {results.experiment_name}")
print(f"Dataset: {results.dataset_name}")
print(f"Results URL: {results.experiment_url}")

# Access individual evaluator results
for result in results.results:
    print(f"Input: {result.example.inputs}")
    print(f"Output: {result.output}")
    print(f"Scores: {result.evaluation_results}")

Viewing Results

Experiments appear in the LangSmith UI with:

Aggregate metrics - Pass rate, average scores
Per-example results - See which inputs failed
Trace links - Debug individual runs
Comparison view - Compare against other experiments

Access Experiment Results

Run your experiment

Click the results URL printed to console

Or navigate to Datasets → Your Dataset → Experiments tab

Compare Experiments

Select two experiments from the same dataset

Click Compare

View side-by-side metrics and identify regressions

Multiple Evaluators

Pass multiple evaluators to measure different aspects:

multiple_evaluators.py

from langsmith import evaluate

def mentions_brand(outputs: dict) -> bool:
    return "officeflow" in outputs["answer"].lower()

def is_concise(outputs: dict) -> bool:
    return len(outputs["answer"].split()) < 50

def uses_tools(run, example) -> dict:
    messages = run.outputs.get("messages", [])
    used_tools = any(msg.get("tool_calls") for msg in messages)
    return {"score": 1 if used_tools else 0}

results = evaluate(
    your_agent,
    data="officeflow-dataset",
    evaluators=[
        mentions_brand,
        is_concise,
        uses_tools
    ]
)

Using Dataset-Bound Evaluators

You can attach evaluators directly to datasets in the LangSmith UI. These run automatically:

auto_evaluators.py

from langsmith import aevaluate

# Evaluators bound to the dataset in UI will run automatically
results = await aevaluate(
    chat_wrapper,
    data="officeflow-dataset"  # No evaluators specified - uses dataset's bound evaluators
)

Dataset-bound evaluators are useful for organization-wide standards that should apply to all experiments on that dataset.

Best Practices

Test Isolation

Ensure each example runs independently:

test_isolation.py

import uuid

def run_agent(inputs: dict) -> dict:
    # Create fresh state for each run
    thread_id = str(uuid.uuid4())
    
    # Reset any global state
    agent.reset_state()
    
    return agent.chat(inputs["question"], thread_id=thread_id)

Performance Optimization

For large datasets, use async evaluation:

parallel_eval.py

from langsmith import aevaluate

results = await aevaluate(
    async_agent,
    data="large-dataset",
    evaluators=[evaluator1, evaluator2],
    max_concurrency=10  # Run 10 examples in parallel
)

Reproducibility

Set seeds and model parameters for consistent results:

reproducible_eval.py

def your_agent(inputs: dict) -> dict:
    response = llm.invoke(
        inputs["question"],
        temperature=0,  # Deterministic sampling
        seed=42         # Reproducible outputs
    )
    return {"answer": response}

Get Started

Core Concepts

Building Agents

Evaluation

Production

Anatomy of an Experiment

Basic Experiment

Async Agents

Experiment Naming

Real-World Example

Experiment Results

Viewing Results

Multiple Evaluators

Using Dataset-Bound Evaluators

Best Practices

Test Isolation

Performance Optimization

Reproducibility

Next Steps

Code-based Eval

LLM-as-Judge

Build docs developers (and LLMs) love

Get Started

Core Concepts

Building Agents

Evaluation

Production

Documentation Index

​Anatomy of an Experiment

​Basic Experiment

​Async Agents

​Experiment Naming

​Real-World Example

​Experiment Results

​Viewing Results

​Multiple Evaluators

​Using Dataset-Bound Evaluators

​Best Practices

​Test Isolation

​Performance Optimization

​Reproducibility

​Next Steps

Code-based Eval

LLM-as-Judge

Build docs developers (and LLMs) love

Anatomy of an Experiment

Basic Experiment

Async Agents

Experiment Naming

Real-World Example

Experiment Results

Viewing Results

Multiple Evaluators

Using Dataset-Bound Evaluators

Best Practices

Test Isolation

Performance Optimization

Reproducibility

Next Steps