Deterministic graders can tell you whether the right tools were called or whether the response contains an expected phrase, but they cannot tell you whether the response is actually good. LLM judges fill that gap by prompting a language model to evaluate the agent’s final response against a rubric, a stated goal, or a faithfulness criterion. NorthStar ships two general-purpose judges —Documentation Index
Fetch the complete documentation index at: https://mintlify.com/sidmanale643/northstar/llms.txt
Use this file to discover all available pages before exploring further.
RubricJudge and FaithfulnessJudge — and two trace-level judges — HallucinatedToolResultJudge and PlanningActionMismatchJudge — all built on LiteLLM so they work with any supported provider.
RubricJudge
RubricJudge prompts the judge model to score the agent’s final response against a goal or a custom rubric. It supports both numeric and binary scoring modes, and its passing threshold is configurable.
Constructor
The grader name that appears in
GradeResult.name.Any model string accepted by LiteLLM (e.g.,
"gpt-4o", "anthropic/claude-3-5-sonnet-20241022", "openrouter/deepseek/deepseek-v4-flash").A custom grading rubric applied to all cases. When omitted, the judge falls back to
expected.rubric, then expected.goal, then a generic pass-if-goal-satisfied prompt.An optional callable that replaces the LiteLLM
completion() call. Useful for testing with a mock or for using a custom inference backend. When set, the prerequisite API key check is skipped.The minimum normalized score required to pass (when using numeric scoring). Ignored when you pass an explicit
scoring config with passing_score set.Sampling temperature passed to the judge model. Keep at
0.0 for deterministic grading.Full scoring configuration. When omitted, defaults to numeric mode with
min_score=0.0, max_score=1.0, and passing_score=threshold.Basic example
How RubricJudge prompts the model
The judge receives a system prompt instructing it to act as a strict evaluator, followed by a user message containing a JSON object with these keys:goal— fromexpected.goalrubric— from the judge-level or case-level rubricground_truth— fromexpected.ground_truthfinal_response— the last assistant messagetool_calls— list of tool call objects from the runtool_outputs— list of tool output objects from the runcontext— fromexpected.context
JudgeScoringConfig
JudgeScoringConfig controls how the judge’s output is interpreted.
"numeric" — the judge returns a score in [min_score, max_score], which is normalized to [0, 1]. "binary" — the judge returns a passed boolean.Lower bound of the numeric scale. Only used when
mode="numeric".Upper bound of the numeric scale.
The raw score at or above which the grade passes. Required when
mode="numeric". Defaults to max_score when mode="binary".Optional mapping of raw score values to label strings (e.g.,
{4: "good", 5: "excellent"}).passing_score is required for mode="numeric" — omitting it raises a ValueError at construction time.FaithfulnessJudge
FaithfulnessJudge is a subclass of RubricJudge with a faithfulness-specific system prompt. It checks whether the agent’s final response is grounded in the provided context or tool outputs, penalizing claims that are not supported by those sources — even when they sound plausible.
Constructor
FaithfulnessJudge accepts the same arguments as RubricJudge minus rubric (the faithfulness rubric is built-in). Key defaults differ:
Default name used when not overridden.
Higher default than
RubricJudge — faithfulness is a stricter criterion.When FaithfulnessJudge is skipped
FaithfulnessJudge returns SKIPPED when neither expected.context nor any tool outputs are present in the run. It requires grounding material to evaluate against.
Trace LLM judges
Two LLM judges operate specifically on trace DAG data and are included in the"trace" grader plan.
HallucinatedToolResultJudge
Prompts the judge to verify that claims in the final response are supported by observedtool_result events in the trace. Fails when the response contains information that was not returned by any tool.
PlanningActionMismatchJudge
Prompts the judge to verify that the tools and actions taken later in the trace are consistent with any planning or reasoning events that appeared earlier. Fails when the agent’s actions contradict its stated plan.run.trace is None.
trace_graders() convenience function
trace_graders() returns all 9 trace graders (7 deterministic + 2 LLM judges) configured with the same judge model and optional completion_fn.
Authentication
LLM judges check for the required API key environment variable before making any inference calls. If the key is missing, aJudgeAuthenticationError is raised immediately with a clear message identifying the exact environment variable to set.
completion_fn is provided, the prerequisite check is bypassed entirely — useful for tests and custom inference backends.
Supported providers
| Provider | Environment variable |
|---|---|
| OpenAI | OPENAI_API_KEY |
| Anthropic | ANTHROPIC_API_KEY |
| OpenRouter | OPENROUTER_API_KEY |
| Azure | AZURE_API_KEY |
| Gemini / Google | GEMINI_API_KEY / GOOGLE_API_KEY |
| Groq | GROQ_API_KEY |
| Mistral | MISTRAL_API_KEY |
| Cohere | COHERE_API_KEY |
| Together | TOGETHER_API_KEY |
| Replicate | REPLICATE_API_KEY |
| Perplexity | PERPLEXITY_API_KEY |
| DeepSeek | DEEPSEEK_API_KEY |
| Fireworks | FIREWORKS_API_KEY |
| HuggingFace | HUGGINGFACE_API_KEY |
| Vertex AI | GOOGLE_APPLICATION_CREDENTIALS |
Injecting a mock for testing
Error handling
When a judge call fails,EvalSuite catches the exception and converts it to a FAILED GradeResult with an actionable reason and feedback rather than crashing the entire eval run.
Common error scenarios and their GradeResult.reason values:
| Error type | reason message |
|---|---|
| Missing API key (401) | "Judge model '…' is not authenticated." |
| Rate limit (429) | "Judge model '…' is rate-limited." |
| Context window exceeded | "Judge model '…' exceeded its context window." |
| Request timeout | "Judge model '…' timed out." |
| Model not found (404) | "Judge model '…' was not found." |
| Invalid JSON response | "LLM judge returned invalid JSON." |