Deterministic graders evaluate agent runs without calling an LLM. They compare concrete facts — which tools were called, in what order, with what arguments, what the response contained, how long the run took, and how much it cost — against the expected values you declare in each dataset case. Graders never guess: if the expected field they need is absent, they returnDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/sidmanale643/northstar/llms.txt
Use this file to discover all available pages before exploring further.
SKIPPED rather than FAILED. This means you can incrementally add graders to a dataset without breaking existing cases that don’t set the corresponding fields.
Built-in deterministic graders
Standard graders
| Grader class | name field | What it checks | Required expected field |
|---|---|---|---|
MaxToolCalls | max_tool_calls | Total tool call count is within the configured limit | expected.max_tool_calls |
RequiredTools | required_tools | Every listed tool appears in the run’s tool calls | expected.required_tools |
ForbiddenTools | forbidden_tools | None of the listed tools appear in the run’s tool calls | expected.forbidden_tools |
ToolArgumentsMatch | tool_arguments_match | Each expected tool was called with at least the declared argument subset | expected.tool_arguments |
ToolSequence | tool_sequence | Tool calls appear in exactly the declared order | expected.tool_sequence |
ToolOutputReferenced | tool_output_referenced | Final response overlaps sufficiently with tool output text (threshold: 0.35) | expected.require_tool_output_reference |
Contains | contains | All listed phrases appear in the final response (case-insensitive) | expected.contains |
NotContains | not_contains | None of the listed phrases appear in the final response | expected.not_contains |
GroundTruthMatch | ground_truth_match | Final response contains the ground truth string (normalized whitespace, case-insensitive) | expected.ground_truth |
LatencyUnder | latency_under | Run latency is within the configured limit | expected.max_latency_ms and metrics.latency_ms |
CostUnder | cost_under | Total cost is within the configured limit | expected.max_cost_usd and metrics.cost_usd |
Trace graders
Trace graders inspect the NorthStar trace DAG attached to each case. They are all skipped whenrun.trace is None.
| Grader class | name field | What it checks | Required expected field |
|---|---|---|---|
BadToolFailureRecovery | bad_tool_failure_recovery | Every failed tool span is followed by a recovery event (assistant message, reasoning, or final response) | None (trace required) |
UnnecessaryToolLoop | unnecessary_tool_loop | No tool signature repeats beyond the configured threshold (default: 3) | expected.trace.max_repeated_tool_calls (optional) |
StaleContextUsage | stale_context_usage | No trace events carry stale, stale_context, or used_stale_context attributes | None (trace required) |
InvalidStateTransition | invalid_state_transition | All observed state transitions are present in the allowed list | expected.trace.allowed_state_transitions |
RetrievalPrecisionRecall | retrieval_precision_recall | Retrieved document IDs meet precision and recall thresholds | expected.trace.relevant_retrieval_ids + thresholds |
StepCostAttribution | step_cost_attribution | Per-span costs are present and no span exceeds max_step_cost_usd | expected.trace.max_step_cost_usd (optional) |
FailureOrigin | failure_origin | Identifies the earliest failing span or event in the trace | None (trace required; always fails when failure evidence exists) |
FailureOrigin is designed as a diagnostic grader. It always produces FAILED when there is failure evidence in the trace, allowing you to pinpoint the root cause.Using a grader directly
You can call any grader’s.grade(case, run) method outside of an EvalSuite. This is useful for quick interactive checks or custom evaluation loops.
GradeResult fields
Every grader returns aGradeResult with the following fields.
The grader’s
name identifier (e.g., "tool_sequence", "contains").One of
"passed", "failed", or "skipped".A short machine-readable explanation of why the grade passed, failed, or was skipped.
Actionable human-readable feedback. Populated by LLM judges;
None for deterministic graders.Numeric score, normalized to
[0, 1]. 1.0 for passing deterministic grades, 0.0 for failures, or the normalized LLM judge score.The passing threshold used to determine pass/fail. For deterministic graders this is typically
1.0.A string label for the grade outcome (e.g.,
"pass", "fail", or a custom label from a scoring config).Optional confidence score from
0 to 1. Populated by LLM judges when they return a confidence field.Snippets of evidence supporting the grade outcome. Populated by
ToolOutputReferenced and LLM judges.Grader-specific structured data, such as
actual_sequence vs. expected_sequence, missing_tools, or LLM judge metadata.Running an EvalSuite with specific graders
Pass an explicitgraders list to EvalSuite to run only the graders you want.
grader_plan() function
grader_plan(name) returns the standard list of graders for a named plan. Use it to start from a plan and extend it.
"deterministic", "quality", "agentic", "trace".
Custom graders
RegexGrader
RegexGrader matches a regular expression against the final response (or any other target field) without requiring an LLM.
The grader name that appears in results.
A Python regular expression pattern.
The value to match against. Use
"final_response" (or "output") for the last assistant message, "case.<field>" for a case attribute, or "run.<field>" for a run attribute.Optional list of flag names:
"ignorecase", "multiline", "dotall".PythonCodeGrader
PythonCodeGrader runs a Python validate() function in a sandboxed subprocess. The function receives output (the final response string), case (the EvalCase as a dict), and run (the EvalRun as a dict).
The grader name.
Python source code as a string. Must define
validate(output, case, run).Execution timeout in milliseconds. Maximum: 5000.
validate() function can return:
- A boolean (
True= pass,False= fail) - A dict with
passed(bool), optionalreason,feedback,score, andmetadata
TypeScriptCodeGrader
TypeScriptCodeGrader works identically to PythonCodeGrader but runs TypeScript via Node.js. The exported validate function must be named validate.
The grader name.
TypeScript source as a string. Must export
validate(output, evalCase, run).Execution timeout in milliseconds. Maximum: 5000.