Documentation Index
Fetch the complete documentation index at: https://mintlify.com/firebase/genkit/llms.txt
Use this file to discover all available pages before exploring further.
Evaluation
Evaluation helps you measure and improve the quality of your AI applications. Genkit provides tools for scoring outputs, running test datasets, and tracking performance over time.
Why Evaluate?
AI outputs can be unpredictable. Evaluation helps you:
- Measure quality - Quantify how well your AI performs
- Catch regressions - Detect when changes make outputs worse
- Compare approaches - Test different models, prompts, or parameters
- Track improvements - Monitor quality over time
Built-in Evaluators
Genkit includes several built-in evaluators:
import "github.com/firebase/genkit/go/plugins/evaluators"
metrics := []evaluators.MetricConfig{
{
MetricType: evaluators.EvaluatorDeepEqual,
},
{
MetricType: evaluators.EvaluatorRegex,
},
{
MetricType: evaluators.EvaluatorJsonata,
},
}
g := genkit.Init(ctx, genkit.WithPlugins(
&googlegenai.GoogleAI{},
&evaluators.GenkitEval{Metrics: metrics},
))
DeepEqual
Checks if the output exactly matches an expected value:
{
"expected": "Paris is the capital of France",
"actual": "Paris is the capital of France",
"score": 1.0
}
Regex
Matches output against a regular expression:
{
"pattern": "capital.*France",
"actual": "The capital of France is Paris",
"score": 1.0
}
JSONata
Queries structured output using JSONata:
{
"query": "$.ingredients[0].name",
"expected": "flour",
"score": 1.0
}
Custom Evaluators
Create custom evaluators for your specific needs:
import (
"github.com/firebase/genkit/go/ai"
"github.com/firebase/genkit/go/core/api"
"github.com/firebase/genkit/go/genkit"
)
evalOptions := ai.EvaluatorOptions{
DisplayName: "Simple Evaluator",
Definition: "Checks if output contains specific keywords",
IsBilled: false,
}
genkit.DefineEvaluator(g, api.NewName("custom", "keywordChecker"), &evalOptions,
func(ctx context.Context, req *ai.EvaluatorCallbackRequest) (*ai.EvaluatorCallbackResponse, error) {
// Check if output contains required keywords
output := req.Input.Output.(string)
keywords := []string{"Paris", "France"}
foundAll := true
for _, keyword := range keywords {
if !strings.Contains(output, keyword) {
foundAll = false
break
}
}
score := ai.Score{
Id: "keyword_match",
Score: foundAll,
Status: ai.ScoreStatusPass.String(),
Details: map[string]any{
"reasoning": fmt.Sprintf("Found all keywords: %v", foundAll),
},
}
return &ai.EvaluatorCallbackResponse{
TestCaseId: req.Input.TestCaseId,
Evaluation: []ai.Score{score},
}, nil
})
Batch Evaluators
Process multiple test cases efficiently:
genkit.DefineBatchEvaluator(g, api.NewName("custom", "batchChecker"), &evalOptions,
func(ctx context.Context, req *ai.EvaluatorRequest) (*ai.EvaluatorResponse, error) {
var evalResponses []ai.EvaluationResult
for _, datapoint := range req.Dataset {
score := ai.Score{
Id: "testScore",
Score: evaluateDatapoint(datapoint),
Status: ai.ScoreStatusPass.String(),
Details: map[string]any{
"reasoning": fmt.Sprintf("Evaluated: %s", datapoint.Input),
},
}
evalResponses = append(evalResponses, ai.EvaluationResult{
TestCaseId: datapoint.TestCaseId,
Evaluation: []ai.Score{score},
})
}
return &evalResponses, nil
})
Using the Developer UI
The Genkit Developer UI provides visual evaluation tools:
-
Run the Dev UI:
genkit start -- go run main.go
-
Navigate to Evaluate tab
-
Create a test dataset:
[
{
"testCaseId": "test1",
"input": "What is the capital of France?",
"expected": "Paris"
},
{
"testCaseId": "test2",
"input": "What is the capital of Japan?",
"expected": "Tokyo"
}
]
-
Run evaluation and view results with detailed traces
Programmatic Evaluation
Evaluate flows programmatically in your tests:
import { genkit } from 'genkit';
const testCases = [
{ input: 'What is 2+2?', expected: '4' },
{ input: 'What is the capital of France?', expected: 'Paris' },
];
for (const testCase of testCases) {
const result = await myFlow(testCase.input);
const passed = result.includes(testCase.expected);
console.log(`Test: ${testCase.input}`);
console.log(`Expected: ${testCase.expected}`);
console.log(`Result: ${result}`);
console.log(`Passed: ${passed}\n`);
}
Evaluation Metrics
Accuracy
Measure exact match rate:
function calculateAccuracy(results: Array<{ expected: string, actual: string }>) {
const correct = results.filter(r => r.actual === r.expected).length;
return correct / results.length;
}
Semantic Similarity
Use embeddings to measure semantic similarity:
import { googleAI } from '@genkit-ai/google-genai';
function cosineSimilarity(a: number[], b: number[]): number {
const dotProduct = a.reduce((sum, val, i) => sum + val * b[i], 0);
const magnitudeA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0));
const magnitudeB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0));
return dotProduct / (magnitudeA * magnitudeB);
}
async function semanticSimilarity(
expected: string,
actual: string
): Promise<number> {
const [expectedEmbed, actualEmbed] = await Promise.all([
ai.embed({ embedder: textEmbedding004, content: expected }),
ai.embed({ embedder: textEmbedding004, content: actual }),
]);
return cosineSimilarity(expectedEmbed.values, actualEmbed.values);
}
Retrieval Metrics (RAG)
For RAG applications, measure retrieval quality:
interface RetrievalMetrics {
precision: number; // Relevant docs / Retrieved docs
recall: number; // Relevant docs / Total relevant docs
mrr: number; // Mean Reciprocal Rank
}
function calculateRetrievalMetrics(
retrieved: string[],
relevant: string[]
): RetrievalMetrics {
const relevantSet = new Set(relevant);
const relevantRetrieved = retrieved.filter(doc => relevantSet.has(doc));
const precision = relevantRetrieved.length / retrieved.length;
const recall = relevantRetrieved.length / relevant.length;
// Mean Reciprocal Rank
let mrr = 0;
for (let i = 0; i < retrieved.length; i++) {
if (relevantSet.has(retrieved[i])) {
mrr = 1 / (i + 1);
break;
}
}
return { precision, recall, mrr };
}
A/B Testing
Compare different approaches:
async function compareModels(
testCases: Array<{ input: string, expected: string }>
) {
const results = {
'gemini-2.5-flash': [],
'gemini-2.5-pro': [],
};
for (const testCase of testCases) {
for (const model of Object.keys(results)) {
const { text } = await ai.generate({
model: googleAI.model(model),
prompt: testCase.input,
});
const score = await evaluateOutput(text, testCase.expected);
results[model].push(score);
}
}
// Calculate averages
const summary = {};
for (const [model, scores] of Object.entries(results)) {
summary[model] = scores.reduce((a, b) => a + b, 0) / scores.length;
}
return summary;
}
Best Practices
Create Diverse Test Sets
Cover various scenarios:
const testCases = [
// Happy path
{ input: 'What is the capital of France?', expected: 'Paris' },
// Edge cases
{ input: 'What is the capital of a country that doesn\'t exist?', expected: 'unknown' },
// Ambiguous inputs
{ input: 'capital', expected: 'clarification' },
// Different phrasings
{ input: 'France\'s capital city?', expected: 'Paris' },
];
Track Metrics Over Time
Store evaluation results:
interface EvaluationResult {
timestamp: Date;
modelVersion: string;
averageScore: number;
testCases: number;
}
const results: EvaluationResult[] = [];
async function runAndTrackEvaluation() {
const scores = await runEvaluation();
results.push({
timestamp: new Date(),
modelVersion: 'gemini-2.5-flash',
averageScore: scores.reduce((a, b) => a + b, 0) / scores.length,
testCases: scores.length,
});
// Save to database
await saveResults(results);
}
Automate Evaluation in CI/CD
Run evaluations automatically:
# .github/workflows/evaluate.yml
name: Evaluate AI Quality
on:
pull_request:
schedule:
- cron: '0 0 * * *' # Daily
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- run: npm install
- run: npm run evaluate
- name: Check quality threshold
run: |
if [ $(cat results.json | jq '.averageScore') < 0.8 ]; then
echo "Quality below threshold"
exit 1
fi
Use Human Evaluation
For subjective qualities, involve humans:
interface HumanEvaluation {
testCaseId: string;
output: string;
ratings: {
accuracy: number; // 1-5
helpfulness: number; // 1-5
tone: number; // 1-5
};
feedback: string;
}
// Present outputs to human reviewers
// Collect ratings and feedback
// Use to improve prompts and fine-tune models
Next Steps