Documentation Index
Fetch the complete documentation index at: https://mintlify.com/Arize-ai/phoenix/llms.txt
Use this file to discover all available pages before exploring further.
The @arizeai/phoenix-evals package provides a comprehensive framework for evaluating LLM outputs using LLM-based evaluators and custom functions.
Installation
npm install @arizeai/phoenix-evals
Quick Start
import { createHallucinationEvaluator } from '@arizeai/phoenix-evals';
const evaluator = createHallucinationEvaluator({
model: 'gpt-4o'
});
const result = await evaluator({
output: 'The capital of France is Paris.',
context: 'France is a country in Europe with Paris as its capital.'
});
console.log(result);
// {
// name: 'hallucination',
// score: 0.0,
// label: 'factual',
// explanation: 'The output is fully supported by the context.'
// }
Built-in Evaluators
Phoenix provides ready-to-use evaluators for common LLM evaluation tasks.
Hallucination / Faithfulness
Detects when the model generates information not supported by the context.
import { createHallucinationEvaluator } from '@arizeai/phoenix-evals';
const evaluator = createHallucinationEvaluator({
model: 'gpt-4o',
temperature: 0.0
});
const result = await evaluator({
output: 'Paris is the capital of France.',
context: 'France is a European country with Paris as its capital city.'
});
Alias: createFaithfulnessEvaluator() (same functionality)
Required fields:
output: The LLM’s response
context: The context/documents provided to the LLM
Document Relevance
Evaluates if retrieved documents are relevant to the query.
import { createDocumentRelevanceEvaluator } from '@arizeai/phoenix-evals';
const evaluator = createDocumentRelevanceEvaluator({
model: 'gpt-4o'
});
const result = await evaluator({
input: 'What is the capital of France?',
context: 'France is a European country. Paris is its capital.'
});
Required fields:
input: The user’s query
context: The retrieved document(s)
Correctness
Compares the output against a reference answer.
import { createCorrectnessEvaluator } from '@arizeai/phoenix-evals';
const evaluator = createCorrectnessEvaluator({
model: 'gpt-4o'
});
const result = await evaluator({
output: 'Paris',
expected: 'Paris',
input: 'What is the capital of France?'
});
Required fields:
output: The LLM’s response
expected: The reference/correct answer
input: The original query (optional but recommended)
Conciseness
Evaluates if the response is appropriately concise.
import { createConcisenessEvaluator } from '@arizeai/phoenix-evals';
const evaluator = createConcisenessEvaluator({
model: 'gpt-4o'
});
const result = await evaluator({
input: 'What is 2+2?',
output: '2+2 equals 4.'
});
Required fields:
input: The user’s query
output: The LLM’s response
Refusal
Detects when the model inappropriately refuses to answer.
import { createRefusalEvaluator } from '@arizeai/phoenix-evals';
const evaluator = createRefusalEvaluator({
model: 'gpt-4o'
});
const result = await evaluator({
input: 'What is the weather?',
output: 'I cannot provide weather information.'
});
Required fields:
input: The user’s query
output: The LLM’s response
Evaluate tool/function calling behavior:
import {
createToolSelectionEvaluator,
createToolInvocationEvaluator,
createToolResponseHandlingEvaluator
} from '@arizeai/phoenix-evals';
// Check if the right tool was selected
const toolSelection = createToolSelectionEvaluator({
model: 'gpt-4o'
});
const result1 = await toolSelection({
input: 'Get the weather in Paris',
output: 'Called get_weather(location="Paris")',
tools: ['get_weather', 'get_time', 'search_web']
});
// Check if tool was invoked correctly
const toolInvocation = createToolInvocationEvaluator({
model: 'gpt-4o'
});
const result2 = await toolInvocation({
input: 'Get weather for Paris',
output: 'get_weather(location="Paris", units="celsius")'
});
// Check if tool response was handled properly
const toolResponseHandling = createToolResponseHandlingEvaluator({
model: 'gpt-4o'
});
const result3 = await toolResponseHandling({
input: 'What\'s the weather?',
toolResponse: '{"temp": 20, "condition": "sunny"}',
output: 'It is sunny and 20°C.'
});
Custom Evaluators
Classification Evaluator
Create a custom binary or multi-class classifier:
import { createClassificationEvaluator } from '@arizeai/phoenix-evals';
const evaluator = createClassificationEvaluator({
name: 'politeness',
model: 'gpt-4o',
template: `
Given the following query and response, classify if the response is polite.
Query: {input}
Response: {output}
Is the response polite? Answer YES or NO.
`,
rails: ['YES', 'NO']
});
const result = await evaluator({
input: 'Can you help me?',
output: 'Of course! I\'d be happy to help.'
});
Function-Based Evaluator
Create an evaluator from a custom function:
import { createEvaluator } from '@arizeai/phoenix-evals';
const lengthEvaluator = createEvaluator({
name: 'response-length',
evaluateFn: ({ output }: { output: string }) => {
const length = output.length;
return {
name: 'response-length',
score: length < 100 ? 1.0 : 0.5,
label: length < 100 ? 'concise' : 'verbose',
metadata: { length }
};
}
});
const result = await lengthEvaluator({
output: 'Short response.'
});
LLM-Based Custom Evaluator
import { LLMEvaluator } from '@arizeai/phoenix-evals';
import OpenAI from 'openai';
class CreativityEvaluator extends LLMEvaluator {
async evaluate({ input, output }: { input: string; output: string }) {
const openai = new OpenAI();
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{
role: 'user',
content: `Rate the creativity of this response on a scale of 1-10.\n\nQuery: ${input}\nResponse: ${output}\n\nProvide only the number.`
}
],
temperature: 0.0
});
const rating = parseInt(response.choices[0].message.content || '5');
return {
name: 'creativity',
score: rating / 10,
label: rating >= 7 ? 'creative' : 'conventional',
metadata: { rating }
};
}
}
const evaluator = new CreativityEvaluator({ name: 'creativity' });
const result = await evaluator.evaluate({
input: 'Write a story',
output: 'Once upon a time...'
});
Evaluation Result
All evaluators return a result object:
interface EvaluationResult {
name: string; // Evaluator name
score: number; // Numeric score (0-1)
label: string; // Categorical label
explanation?: string; // Optional explanation
metadata?: Record<string, any>; // Additional data
}
Example:
{
name: 'hallucination',
score: 0.0,
label: 'factual',
explanation: 'The output is fully supported by the context.',
metadata: {
model: 'gpt-4o',
confidence: 0.98
}
}
Batch Evaluation
Evaluate multiple examples in parallel:
import { createHallucinationEvaluator } from '@arizeai/phoenix-evals';
const evaluator = createHallucinationEvaluator({
model: 'gpt-4o'
});
const examples = [
{
output: 'Paris is the capital.',
context: 'France has Paris as capital.'
},
{
output: 'London is the capital.',
context: 'UK has London as capital.'
}
];
const results = await Promise.all(
examples.map(ex => evaluator(ex))
);
console.log(results);
Model Configuration
Phoenix evals support multiple LLM providers:
OpenAI
import { createHallucinationEvaluator } from '@arizeai/phoenix-evals';
const evaluator = createHallucinationEvaluator({
model: 'gpt-4o',
temperature: 0.0,
apiKey: 'your-api-key' // Or set OPENAI_API_KEY env var
});
Anthropic
const evaluator = createHallucinationEvaluator({
model: 'claude-3-5-sonnet-20241022',
temperature: 0.0,
apiKey: 'your-api-key' // Or set ANTHROPIC_API_KEY env var
});
Google (Gemini)
const evaluator = createHallucinationEvaluator({
model: 'gemini-1.5-pro',
apiKey: 'your-api-key' // Or set GOOGLE_API_KEY env var
});
Azure OpenAI
process.env.AZURE_OPENAI_API_KEY = 'your-key';
process.env.AZURE_OPENAI_ENDPOINT = 'https://your-resource.openai.azure.com';
process.env.AZURE_OPENAI_API_VERSION = '2024-02-01';
const evaluator = createHallucinationEvaluator({
model: 'azure/gpt-4o'
});
Template System
Customize evaluation prompts using templates:
import { createClassificationEvaluator, applyTemplate } from '@arizeai/phoenix-evals';
// Define a custom template
const template = `
You are an expert at evaluating responses.
Task: {task}
Response: {output}
Reference: {expected}
Evaluate if the response correctly completes the task.
Answer CORRECT or INCORRECT.
`;
const evaluator = createClassificationEvaluator({
name: 'task-completion',
model: 'gpt-4o',
template,
rails: ['CORRECT', 'INCORRECT']
});
Template Variables
Extract variables from a template:
import { getTemplateVariables } from '@arizeai/phoenix-evals';
const template = 'Evaluate {input} against {output}';
const variables = getTemplateVariables(template);
// Returns: ['input', 'output']
Binding Evaluators
Create an evaluator with pre-filled inputs:
import { createHallucinationEvaluator, bindEvaluator } from '@arizeai/phoenix-evals';
const baseEvaluator = createHallucinationEvaluator({
model: 'gpt-4o'
});
// Bind a fixed context
const boundEvaluator = bindEvaluator(baseEvaluator, {
context: 'This is the fixed context for all evaluations.'
});
// Now only need to provide output
const result = await boundEvaluator({
output: 'The response based on context.'
});
Helper Functions
toEvaluationResult()
Convert custom data to standard evaluation result:
import { toEvaluationResult } from '@arizeai/phoenix-evals';
const customResult = {
evaluatorName: 'my-eval',
value: 0.85,
category: 'good'
};
const standardResult = toEvaluationResult({
name: customResult.evaluatorName,
score: customResult.value,
label: customResult.category
});
asEvaluatorFn()
Convert a function to an evaluator:
import { asEvaluatorFn } from '@arizeai/phoenix-evals';
const myFunction = async ({ input, output }: any) => {
return {
name: 'custom',
score: Math.random(),
label: 'random'
};
};
const evaluator = asEvaluatorFn(myFunction);
Integration Examples
With Phoenix Client
import { createClient } from '@arizeai/phoenix-client';
import { createHallucinationEvaluator } from '@arizeai/phoenix-evals';
const client = createClient();
const evaluator = createHallucinationEvaluator({ model: 'gpt-4o' });
// Get traces from Phoenix
const tracesResponse = await client.GET('/v1/traces', {
params: {
query: { project_id: 'my-project', limit: 10 }
}
});
if (tracesResponse.data) {
for (const trace of tracesResponse.data.traces) {
for (const span of trace.spans || []) {
// Evaluate each span
const result = await evaluator({
output: span.attributes?.output,
context: span.attributes?.context
});
// Upload result as annotation
await client.POST('/v1/spans/{spanId}/annotations', {
params: {
path: { spanId: span.id }
},
body: {
name: result.name,
score: result.score,
label: result.label,
explanation: result.explanation
}
});
}
}
}
With Vercel AI SDK
import { openai } from '@ai-sdk/openai';
import { generateText } from 'ai';
import { createHallucinationEvaluator } from '@arizeai/phoenix-evals';
const evaluator = createHallucinationEvaluator({ model: 'gpt-4o' });
const context = 'Paris is the capital of France.';
const { text } = await generateText({
model: openai('gpt-4'),
prompt: 'What is the capital of France?'
});
// Evaluate the response
const result = await evaluator({
output: text,
context
});
console.log('Hallucination score:', result.score);
TypeScript Types
The package provides full TypeScript support:
import type {
Evaluator,
EvaluationResult,
EvaluatorConfig,
ClassificationEvaluatorConfig
} from '@arizeai/phoenix-evals';
const config: ClassificationEvaluatorConfig = {
name: 'my-evaluator',
model: 'gpt-4o',
template: 'Evaluate {input}',
rails: ['GOOD', 'BAD']
};
const evaluator: Evaluator = createClassificationEvaluator(config);
const result: EvaluationResult = await evaluator({
input: 'test'
});
See Also