Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/sidmanale643/northstar/llms.txt

Use this file to discover all available pages before exploring further.

Deterministic graders can tell you whether the right tools were called or whether the response contains an expected phrase, but they cannot tell you whether the response is actually good. LLM judges fill that gap by prompting a language model to evaluate the agent’s final response against a rubric, a stated goal, or a faithfulness criterion. NorthStar ships two general-purpose judges — RubricJudge and FaithfulnessJudge — and two trace-level judges — HallucinatedToolResultJudge and PlanningActionMismatchJudge — all built on LiteLLM so they work with any supported provider.

RubricJudge

RubricJudge prompts the judge model to score the agent’s final response against a goal or a custom rubric. It supports both numeric and binary scoring modes, and its passing threshold is configurable.

Constructor

name
str
required
The grader name that appears in GradeResult.name.
model
str
default:"openrouter/deepseek/deepseek-v4-flash"
Any model string accepted by LiteLLM (e.g., "gpt-4o", "anthropic/claude-3-5-sonnet-20241022", "openrouter/deepseek/deepseek-v4-flash").
rubric
str
A custom grading rubric applied to all cases. When omitted, the judge falls back to expected.rubric, then expected.goal, then a generic pass-if-goal-satisfied prompt.
completion_fn
Callable
An optional callable that replaces the LiteLLM completion() call. Useful for testing with a mock or for using a custom inference backend. When set, the prerequisite API key check is skipped.
threshold
float
default:"0.5"
The minimum normalized score required to pass (when using numeric scoring). Ignored when you pass an explicit scoring config with passing_score set.
temperature
float
default:"0.0"
Sampling temperature passed to the judge model. Keep at 0.0 for deterministic grading.
scoring
JudgeScoringConfig
Full scoring configuration. When omitted, defaults to numeric mode with min_score=0.0, max_score=1.0, and passing_score=threshold.

Basic example

from northstar.evals import Dataset, EvalSuite
from northstar.evals.graders import RubricJudge

dataset = Dataset.from_records([
    {
        "id": "case-001",
        "messages": [
            {"role": "assistant", "content": "Refunds are available for 30 days after purchase."}
        ],
        "expected": {
            "goal": "Explain the refund policy clearly and completely.",
        },
    }
])

suite = EvalSuite(graders=[
    RubricJudge(
        "answer_quality",
        model="openai/gpt-4o-mini",
        rubric=(
            "Pass if the response states the refund window in days. "
            "Fail if it is vague, incomplete, or contradicts the policy."
        ),
        threshold=0.7,
    )
])

result = suite.run(dataset)
grade = result.case_results[0].grades[0]
print(grade.status, grade.score, grade.feedback)

How RubricJudge prompts the model

The judge receives a system prompt instructing it to act as a strict evaluator, followed by a user message containing a JSON object with these keys:
  • goal — from expected.goal
  • rubric — from the judge-level or case-level rubric
  • ground_truth — from expected.ground_truth
  • final_response — the last assistant message
  • tool_calls — list of tool call objects from the run
  • tool_outputs — list of tool output objects from the run
  • context — from expected.context
The system prompt includes instructions to judge only the supplied fields, not to reward unsupported claims, and to return only a JSON object.

JudgeScoringConfig

JudgeScoringConfig controls how the judge’s output is interpreted.
mode
str
required
"numeric" — the judge returns a score in [min_score, max_score], which is normalized to [0, 1]. "binary" — the judge returns a passed boolean.
min_score
float
default:"0.0"
Lower bound of the numeric scale. Only used when mode="numeric".
max_score
float
default:"1.0"
Upper bound of the numeric scale.
passing_score
float
The raw score at or above which the grade passes. Required when mode="numeric". Defaults to max_score when mode="binary".
labels
dict[float, str]
Optional mapping of raw score values to label strings (e.g., {4: "good", 5: "excellent"}).
from northstar.evals import JudgeScoringConfig
from northstar.evals.graders import RubricJudge

# Numeric scoring on a 0–5 scale, passing at 4
judge = RubricJudge(
    "answer_quality",
    rubric="Grade correctness, faithfulness, and clarity.",
    scoring=JudgeScoringConfig(
        mode="numeric",
        min_score=0,
        max_score=5,
        passing_score=4,
        labels={4: "good", 5: "excellent"},
    ),
)

# Binary scoring — judge returns passed: true/false
safety_gate = RubricJudge(
    "safety_gate",
    rubric="Pass only if the answer avoids unsafe instructions.",
    scoring=JudgeScoringConfig(mode="binary"),
)
passing_score is required for mode="numeric" — omitting it raises a ValueError at construction time.

FaithfulnessJudge

FaithfulnessJudge is a subclass of RubricJudge with a faithfulness-specific system prompt. It checks whether the agent’s final response is grounded in the provided context or tool outputs, penalizing claims that are not supported by those sources — even when they sound plausible.

Constructor

FaithfulnessJudge accepts the same arguments as RubricJudge minus rubric (the faithfulness rubric is built-in). Key defaults differ:
name
str
default:"faithfulness_judge"
Default name used when not overridden.
threshold
float
default:"0.7"
Higher default than RubricJudge — faithfulness is a stricter criterion.

When FaithfulnessJudge is skipped

FaithfulnessJudge returns SKIPPED when neither expected.context nor any tool outputs are present in the run. It requires grounding material to evaluate against.
from northstar.evals import Dataset, EvalSuite
from northstar.evals.graders import FaithfulnessJudge

dataset = Dataset.from_records([
    {
        "id": "case-001",
        "messages": [
            {"role": "assistant", "content": "Refunds include crypto rebates and store vouchers."}
        ],
        "expected": {
            "context": ["Refunds are available for 30 days after purchase."]
        },
    }
])

suite = EvalSuite(graders=[FaithfulnessJudge(model="openai/gpt-4o-mini")])
result = suite.run(dataset)

grade = result.case_results[0].grades[0]
print(grade.status)    # "failed" — "crypto rebates" not supported by context
print(grade.feedback)  # Actionable explanation from the judge

Trace LLM judges

Two LLM judges operate specifically on trace DAG data and are included in the "trace" grader plan.

HallucinatedToolResultJudge

Prompts the judge to verify that claims in the final response are supported by observed tool_result events in the trace. Fails when the response contains information that was not returned by any tool.
from northstar.evals.graders import HallucinatedToolResultJudge

judge = HallucinatedToolResultJudge(
    model="openai/gpt-4o-mini",
    threshold=0.7,
)

PlanningActionMismatchJudge

Prompts the judge to verify that the tools and actions taken later in the trace are consistent with any planning or reasoning events that appeared earlier. Fails when the agent’s actions contradict its stated plan.
from northstar.evals.graders import PlanningActionMismatchJudge

judge = PlanningActionMismatchJudge(
    model="openai/gpt-4o-mini",
    threshold=0.7,
)
Both trace judges skip automatically when run.trace is None.

trace_graders() convenience function

trace_graders() returns all 9 trace graders (7 deterministic + 2 LLM judges) configured with the same judge model and optional completion_fn.
from northstar.evals import EvalSuite
from northstar.evals.graders import trace_graders

suite = EvalSuite(
    graders=trace_graders(
        judge_model="openai/gpt-4o-mini",
    )
)

Authentication

LLM judges check for the required API key environment variable before making any inference calls. If the key is missing, a JudgeAuthenticationError is raised immediately with a clear message identifying the exact environment variable to set.
from northstar.evals.graders import JudgeAuthenticationError, RubricJudge

try:
    RubricJudge("quality", model="openrouter/openai/gpt-4o").grade(case, run)
except JudgeAuthenticationError as exc:
    print(exc)
    # Cannot grade with model 'openrouter/openai/gpt-4o': environment variable
    # OPENROUTER_API_KEY is not set. Set OPENROUTER_API_KEY to authenticate
    # with openrouter before running this eval.
When completion_fn is provided, the prerequisite check is bypassed entirely — useful for tests and custom inference backends.

Supported providers

ProviderEnvironment variable
OpenAIOPENAI_API_KEY
AnthropicANTHROPIC_API_KEY
OpenRouterOPENROUTER_API_KEY
AzureAZURE_API_KEY
Gemini / GoogleGEMINI_API_KEY / GOOGLE_API_KEY
GroqGROQ_API_KEY
MistralMISTRAL_API_KEY
CohereCOHERE_API_KEY
TogetherTOGETHER_API_KEY
ReplicateREPLICATE_API_KEY
PerplexityPERPLEXITY_API_KEY
DeepSeekDEEPSEEK_API_KEY
FireworksFIREWORKS_API_KEY
HuggingFaceHUGGINGFACE_API_KEY
Vertex AIGOOGLE_APPLICATION_CREDENTIALS
Providers not in this table (such as local Ollama models) are accepted without any key check.

Injecting a mock for testing

Use completion_fn to inject a mock completion function during testing. This eliminates network calls and API costs while exercising your eval logic.
import json
from northstar.evals import Dataset, EvalSuite
from northstar.evals.graders import RubricJudge

def mock_completion(**kwargs):
    """Always returns a passing numeric score."""
    return {
        "choices": [
            {
                "message": {
                    "content": json.dumps({
                        "score": 0.9,
                        "reason": "The response satisfies the goal.",
                        "feedback": "No changes needed.",
                        "evidence": ["30 days"],
                    })
                }
            }
        ]
    }

dataset = Dataset.from_records([
    {
        "id": "case-001",
        "messages": [{"role": "assistant", "content": "Refunds are 30 days."}],
        "expected": {"goal": "Explain the refund window."},
    }
])

suite = EvalSuite(graders=[
    RubricJudge("quality", completion_fn=mock_completion, threshold=0.8)
])
result = suite.run(dataset)
assert result.pass_rate == 1.0

Error handling

When a judge call fails, EvalSuite catches the exception and converts it to a FAILED GradeResult with an actionable reason and feedback rather than crashing the entire eval run.
Rate limits (429): The judge will be marked FAILED with a message advising you to wait or switch to a model with higher rate limits. NorthStar does not automatically retry rate-limited judge calls.
Context window exceeded: If the combined rubric, final response, and tool outputs exceed the judge model’s context window, the grade fails with a message advising you to shorten the inputs or pick a model with a larger context window.
Common error scenarios and their GradeResult.reason values:
Error typereason message
Missing API key (401)"Judge model '…' is not authenticated."
Rate limit (429)"Judge model '…' is rate-limited."
Context window exceeded"Judge model '…' exceeded its context window."
Request timeout"Judge model '…' timed out."
Model not found (404)"Judge model '…' was not found."
Invalid JSON response"LLM judge returned invalid JSON."

Build docs developers (and LLMs) love