Phoenix provides a comprehensive evaluation framework for LLM applications, enabling you to assess quality, accuracy, and performance at scale. Evaluations help you understand model behavior, catch issues early, and continuously improve your AI systems.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/Arize-ai/phoenix/llms.txt
Use this file to discover all available pages before exploring further.
What is Evaluation?
Evaluation is the process of measuring how well your LLM application performs on specific criteria. Phoenix supports multiple evaluation approaches:- LLM-as-a-Judge: Use an LLM to evaluate outputs based on criteria like correctness, relevance, or safety
- Code-based Evaluations: Write custom Python functions to check outputs programmatically
- Pre-built Metrics: Leverage battle-tested evaluators for common tasks like hallucination detection
Client-Side vs Server-Side Evaluations
Client-Side Evaluations
Client-side evaluations run in your Python environment using thephoenix.evals library. This approach gives you:
- Full control over evaluation logic and prompts
- Flexibility to use any LLM provider (OpenAI, Anthropic, etc.)
- Fast iteration during development
- Offline evaluation on datasets without needing a Phoenix server
Server-Side Evaluations
Server-side evaluations run on the Phoenix platform and automatically evaluate traces as they’re collected. Benefits include:- Automatic evaluation of production traffic
- Real-time monitoring of quality metrics
- Historical tracking and trend analysis
- Team collaboration on evaluation criteria
Evaluation Metrics
Phoenix evaluations produce Score objects containing:Numeric score (e.g., 0.0 to 1.0)
Categorical classification (e.g., “correct”, “incorrect”)
LLM’s reasoning for the score (for LLM-as-judge evaluations)
The evaluator name (e.g., “faithfulness”)
Evaluation type: “llm”, “code”, or “human”
Whether to maximize or minimize the score
Viewing Evaluation Results
In Python
Evaluation results are returned as Score objects that you can inspect programmatically:In DataFrames
When evaluating dataframes, results are added as new columns:In Phoenix UI
When evaluations are traced (automatic in Phoenix 2.0), they appear in the Phoenix UI:- Navigate to the Traces view
- Filter by evaluator name or score range
- Inspect individual traces to see evaluation details
- View aggregate metrics and distributions
Tracing Evaluations
Phoenix automatically traces all evaluations, creating observability into:- Evaluation inputs: What data was evaluated
- LLM calls: Model, prompt, and response for LLM-as-judge
- Scores: Complete Score objects with explanations
- Performance: Latency and error rates
Common Evaluation Patterns
Quality Checks
Evaluate outputs for correctness, relevance, and completeness:RAG Evaluations
Evaluate retrieval-augmented generation systems:Tool Calling
Evaluate agent tool selection and invocation:Next Steps
LLM-as-a-Judge
Learn about using LLMs to evaluate outputs
Pre-built Metrics
Explore ready-to-use evaluation metrics
Custom Evaluators
Build your own evaluation logic
Batch Evaluation
Evaluate datasets at scale