NorthStar’s eval system is built around three concepts: a Dataset ofDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/sidmanale643/northstar/llms.txt
Use this file to discover all available pages before exploring further.
EvalCase records, a list of Graders that score each case, and an EvalSuite that orchestrates the run and aggregates results. Graders are plain Python objects with a grade(case, run) method — they are deterministic functions, regex checks, code runners, or LLM-backed judges. You compose them freely, or pick a pre-built grader_plan() bundle.
EvalSuite
EvalSuite is the top-level orchestrator. It iterates over a Dataset, reconstructs an EvalRun from each case’s messages and trace, applies every grader, and returns an EvalResult.
Constructor
Explicit list of grader instances to use. When provided,
plan is ignored. When None, the plan argument is used to select a pre-built grader list via grader_plan(plan). Defaults to None.Name of the pre-built grader bundle to use when
graders is None. One of "deterministic", "quality", "agentic", or "trace". Defaults to "deterministic".Arbitrary metadata merged into the
EvalResult.metadata dict. Useful for tagging runs with experiment names or model versions. Defaults to {}.run(dataset) -> EvalResult
Iterates over every EvalCase in dataset, applies all graders, and returns an aggregated EvalResult.
Any iterable of
EvalCase objects, including a Dataset instance.EvalResult
EvalResult
EvalResult is returned by EvalSuite.run() and contains aggregate statistics and per-case breakdowns.
Total number of cases in the dataset.
Number of cases where at least one grader produced a non-skipped grade.
Number of cases where every grader skipped (i.e., all required
expected fields were absent).Number of evaluated cases where all non-skipped grades passed.
Number of evaluated cases where at least one non-skipped grade failed.
passed_cases / evaluated_cases. 0.0 when evaluated_cases == 0.Total count of individual
SKIPPED grade results across all cases and all graders.Per-case breakdown. Each
CaseResult contains the case_id, an overall CaseStatus, and the list of individual GradeResult objects from each grader.Merged from
EvalSuite.metadata. Also includes plan, grader_names, and created_at automatically.CaseResult
The
id of the EvalCase that produced this result.Overall pass/fail/not-evaluated for this case. See
CaseStatus enum below.One
GradeResult per grader, in the same order as EvalSuite.graders.GradeResult
Every grader returns aGradeResult. All fields except name, status, and reason are optional.
The grader’s
name attribute (e.g. "required_tools", "rubric_judge").PASSED, FAILED, or SKIPPED. See GradeStatus enum below.A concise machine-generated explanation of why the grade passed, failed, or was skipped. Always non-empty.
Actionable feedback for the agent author. Populated by LLM judges with a concrete suggestion for what to fix.
None for deterministic graders.Numeric score in
[0.0, 1.0]. Deterministic graders use 1.0 for pass, 0.0 for fail. LLM judges normalize their raw score to this range.The passing threshold for this grade, typically
1.0 for deterministic graders and the normalized passing_score for judges.A short categorical label such as
"pass", "fail", or a custom label defined in JudgeScoringConfig.labels.Optional confidence score in
[0.0, 1.0] returned by LLM judges. None for deterministic graders.Short strings copied or summarized from the inputs that justify the grade. Populated by LLM judges and some deterministic graders (e.g.
ToolOutputReferenced). Defaults to [].Grader-specific structured data. For deterministic graders, this carries counts and lists (e.g.
missing_tools, actual_sequence). For LLM judges, this includes judge_model, scoring_mode, raw_score, and scale. Defaults to {}.Enums
GradeStatus
| Value | String | Meaning |
|---|---|---|
GradeStatus.PASSED | "passed" | The grade criterion was met |
GradeStatus.FAILED | "failed" | The grade criterion was not met |
GradeStatus.SKIPPED | "skipped" | The required expected field was absent; grader did not run |
CaseStatus
| Value | String | Meaning |
|---|---|---|
CaseStatus.PASSED | "passed" | All non-skipped grades passed |
CaseStatus.FAILED | "failed" | At least one non-skipped grade failed |
CaseStatus.NOT_EVALUATED | "not_evaluated" | Every grader skipped (no expected fields were present) |
grader_plan(name)
Returns a pre-built list of graders by plan name. The default judge model is openrouter/deepseek/deepseek-v4-flash.
One of the four plan names below. Raises
ValueError for any other value.Override the LLM judge model for plans that include judge graders. Defaults to
"openrouter/deepseek/deepseek-v4-flash".Optional custom completion function passed to all judge graders in the plan. When provided, the judges call
completion_fn(**kwargs) instead of litellm.completion. Useful for testing and custom providers.Plans
| Plan | Includes | Best for |
|---|---|---|
"deterministic" | All graders from default_graders() — tool checks, contains, ground truth, latency, cost | Fast, no LLM cost, CI pipelines |
"quality" | All deterministic graders + RubricJudge | Response quality evaluation with rubric scoring |
"agentic" | All deterministic graders + FaithfulnessJudge | RAG and tool-heavy agents where factual grounding matters |
"trace" | All graders from trace_graders() — loop detection, cost attribution, hallucination, planning | Deep trace-level analysis; requires a trace payload in each case |
Built-in graders
Deterministic graders
These graders never call an LLM. They are always included indefault_graders().
| Class | name | Required expected field | Description |
|---|---|---|---|
MaxToolCalls | max_tool_calls | max_tool_calls | Passes if the total tool call count ≤ limit |
RequiredTools | required_tools | required_tools | Passes if all named tools appear in run.tool_calls |
ForbiddenTools | forbidden_tools | forbidden_tools | Passes if no forbidden tool appears in run.tool_calls |
ToolArgumentsMatch | tool_arguments_match | tool_arguments | Passes if each named tool was called with arguments that are a superset of the expected dict |
ToolSequence | tool_sequence | tool_sequence | Passes if run.tool_calls names match the expected ordered list exactly |
ToolOutputReferenced | tool_output_referenced | require_tool_output_reference | Passes if the final response overlaps meaningfully with a tool output (threshold: 35%) |
Contains | contains | contains | Passes if every phrase appears (case-insensitive) in the final response |
NotContains | not_contains | not_contains | Passes if no forbidden phrase appears in the final response |
GroundTruthMatch | ground_truth_match | ground_truth | Passes if normalized ground_truth is a substring of normalized final response |
LatencyUnder | latency_under | max_latency_ms + case.metrics.latency_ms | Passes if metrics.latency_ms ≤ limit |
CostUnder | cost_under | max_cost_usd + case.metrics.cost_usd | Passes if metrics.cost_usd ≤ limit |
Configurable deterministic graders
These require a constructor call.LLM judge graders
These graders call an LLM and require an API key for the configured provider.Trace graders
These graders require a trace DAG to be present in the case. They are included intrace_graders() and the "trace" plan.
| Class | name | Required input | Description |
|---|---|---|---|
BadToolFailureRecovery | bad_tool_failure_recovery | Trace with failed tool spans | Passes if every failed tool span is followed by a recovery event (reasoning, assistant message, or final response) |
UnnecessaryToolLoop | unnecessary_tool_loop | Trace with repeated tool calls | Passes if no tool signature repeats more than max_repeated_tool_calls times |
StaleContextUsage | stale_context_usage | Trace with stale-marked events | Passes if no events have stale context markers in their attributes |
InvalidStateTransition | invalid_state_transition | Trace + expected.trace.allowed_state_transitions | Passes if all observed state transitions are in the allowed set |
RetrievalPrecisionRecall | retrieval_precision_recall | Trace + expected.trace.relevant_retrieval_ids | Computes precision and recall against the expected relevant document IDs |
StepCostAttribution | step_cost_attribution | Trace with cost_usd attributes on spans | Reports per-step cost; fails if any step exceeds expected.trace.max_step_cost_usd |
FailureOrigin | failure_origin | Trace with errored spans or run error | Identifies the earliest failure-origin span or event in the trace |
HallucinatedToolResultJudge | hallucinated_tool_result_judge | Trace + LLM | LLM judge: passes only if final response claims are supported by observed tool_result events |
PlanningActionMismatchJudge | planning_action_mismatch_judge | Trace + LLM | LLM judge: passes only if later tool calls are consistent with stated reasoning/planning events |
Helper functions
default_graders() -> list[Grader]
Returns a fresh instance list of all 11 deterministic graders: MaxToolCalls, RequiredTools, ForbiddenTools, ToolArgumentsMatch, ToolSequence, ToolOutputReferenced, Contains, NotContains, GroundTruthMatch, LatencyUnder, CostUnder.
trace_graders(*, completion_fn=None, judge_model=DEFAULT_RUBRIC_JUDGE_MODEL) -> list[Grader]
Returns a fresh instance list of all 9 trace-aware graders: BadToolFailureRecovery, UnnecessaryToolLoop, StaleContextUsage, InvalidStateTransition, RetrievalPrecisionRecall, StepCostAttribution, FailureOrigin, HallucinatedToolResultJudge, PlanningActionMismatchJudge.
Custom grader protocol
Any Python class that implements theGrader protocol can be passed to EvalSuite. The protocol requires: