Use this file to discover all available pages before exploring further.
Phoenix provides a comprehensive evaluation framework to assess LLM outputs using both code-based and LLM-based evaluators. Evaluations can run client-side during development or server-side on production traces.
LLM-as-a-judge evaluation uses language models to assess the quality of LLM outputs based on specific criteria. This approach scales better than human evaluation while maintaining high correlation with human judgments.Phoenix evaluations produce structured results with three components:
Score: Numeric value (typically 0-1 or boolean) indicating quality
Code-based evaluators provide deterministic validation without LLM calls:
Contains Keyword
Regex Match
JSON Parsable
Contains All Keywords
Check if output contains specific keywords.
from phoenix.experiments.evaluators import ContainsKeywordevaluator = ContainsKeyword( keyword="Phoenix", name="mentions_phoenix")result = evaluator.evaluate( output="Phoenix is great for observability")# Returns: EvaluationResult(score=1.0, label="true")
Validate output against a regular expression pattern.
from phoenix.experiments.evaluators import MatchesRegexevaluator = MatchesRegex( pattern=r"^\d{3}-\d{4}$", # Phone number format name="valid_phone_format")
Verify that output is valid JSON.
from phoenix.experiments.evaluators import JSONParsableevaluator = JSONParsable(name="valid_json")result = evaluator.evaluate( output='{"status": "success"}')
Require multiple keywords to be present.
from phoenix.experiments.evaluators import ContainsAllKeywordsevaluator = ContainsAllKeywords( keywords=["observability", "tracing", "evaluation"], name="covers_key_topics")
Create evaluators for domain-specific criteria using LLMCriteriaEvaluator (from src/phoenix/experiments/evaluators/llm_evaluators.py):
from phoenix.experiments.evaluators import LLMCriteriaEvaluatorfrom phoenix.evals import OpenAIModelevaluator = LLMCriteriaEvaluator( model=OpenAIModel(model="gpt-4"), criteria="professionalism", description="maintains a respectful tone and appropriate formality", name="professional_tone")result = evaluator.evaluate( output="Thank you for your inquiry. I'd be happy to assist.")