Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/sidmanale643/northstar/llms.txt

Use this file to discover all available pages before exploring further.

NorthStar’s eval system is built around three concepts: a Dataset of EvalCase records, a list of Graders that score each case, and an EvalSuite that orchestrates the run and aggregates results. Graders are plain Python objects with a grade(case, run) method — they are deterministic functions, regex checks, code runners, or LLM-backed judges. You compose them freely, or pick a pre-built grader_plan() bundle.
from northstar.evals import EvalSuite, Dataset
from northstar.evals.graders import (
    RubricJudge,
    FaithfulnessJudge,
    ToolSequence,
    RegexGrader,
    PythonCodeGrader,
    TypeScriptCodeGrader,
    grader_plan,
    default_graders,
    trace_graders,
)

EvalSuite

EvalSuite is the top-level orchestrator. It iterates over a Dataset, reconstructs an EvalRun from each case’s messages and trace, applies every grader, and returns an EvalResult.
suite = EvalSuite(plan="quality")
result = suite.run(dataset)
print(f"Pass rate: {result.pass_rate:.1%} ({result.passed_cases}/{result.evaluated_cases})")

Constructor

EvalSuite(graders=None, *, plan="deterministic", metadata=None)
graders
list[Grader] | None
Explicit list of grader instances to use. When provided, plan is ignored. When None, the plan argument is used to select a pre-built grader list via grader_plan(plan). Defaults to None.
plan
str
Name of the pre-built grader bundle to use when graders is None. One of "deterministic", "quality", "agentic", or "trace". Defaults to "deterministic".
metadata
dict[str, Any]
Arbitrary metadata merged into the EvalResult.metadata dict. Useful for tagging runs with experiment names or model versions. Defaults to {}.

run(dataset) -> EvalResult

Iterates over every EvalCase in dataset, applies all graders, and returns an aggregated EvalResult.
dataset
Dataset | Iterable[EvalCase]
required
Any iterable of EvalCase objects, including a Dataset instance.
Returns: EvalResult

EvalResult

EvalResult is returned by EvalSuite.run() and contains aggregate statistics and per-case breakdowns.
total_cases
int
Total number of cases in the dataset.
evaluated_cases
int
Number of cases where at least one grader produced a non-skipped grade.
not_evaluated_cases
int
Number of cases where every grader skipped (i.e., all required expected fields were absent).
passed_cases
int
Number of evaluated cases where all non-skipped grades passed.
failed_cases
int
Number of evaluated cases where at least one non-skipped grade failed.
pass_rate
float
passed_cases / evaluated_cases. 0.0 when evaluated_cases == 0.
skipped_grades
int
Total count of individual SKIPPED grade results across all cases and all graders.
case_results
list[CaseResult]
Per-case breakdown. Each CaseResult contains the case_id, an overall CaseStatus, and the list of individual GradeResult objects from each grader.
metadata
dict[str, Any]
Merged from EvalSuite.metadata. Also includes plan, grader_names, and created_at automatically.

CaseResult

case_id
str
The id of the EvalCase that produced this result.
status
CaseStatus
Overall pass/fail/not-evaluated for this case. See CaseStatus enum below.
grades
list[GradeResult]
One GradeResult per grader, in the same order as EvalSuite.graders.

GradeResult

Every grader returns a GradeResult. All fields except name, status, and reason are optional.
name
str
The grader’s name attribute (e.g. "required_tools", "rubric_judge").
status
GradeStatus
PASSED, FAILED, or SKIPPED. See GradeStatus enum below.
reason
str
A concise machine-generated explanation of why the grade passed, failed, or was skipped. Always non-empty.
feedback
str | None
Actionable feedback for the agent author. Populated by LLM judges with a concrete suggestion for what to fix. None for deterministic graders.
score
float | None
Numeric score in [0.0, 1.0]. Deterministic graders use 1.0 for pass, 0.0 for fail. LLM judges normalize their raw score to this range.
threshold
float | None
The passing threshold for this grade, typically 1.0 for deterministic graders and the normalized passing_score for judges.
label
str | None
A short categorical label such as "pass", "fail", or a custom label defined in JudgeScoringConfig.labels.
confidence
float | None
Optional confidence score in [0.0, 1.0] returned by LLM judges. None for deterministic graders.
evidence
list[str]
Short strings copied or summarized from the inputs that justify the grade. Populated by LLM judges and some deterministic graders (e.g. ToolOutputReferenced). Defaults to [].
metadata
dict[str, Any]
Grader-specific structured data. For deterministic graders, this carries counts and lists (e.g. missing_tools, actual_sequence). For LLM judges, this includes judge_model, scoring_mode, raw_score, and scale. Defaults to {}.

Enums

GradeStatus

ValueStringMeaning
GradeStatus.PASSED"passed"The grade criterion was met
GradeStatus.FAILED"failed"The grade criterion was not met
GradeStatus.SKIPPED"skipped"The required expected field was absent; grader did not run

CaseStatus

ValueStringMeaning
CaseStatus.PASSED"passed"All non-skipped grades passed
CaseStatus.FAILED"failed"At least one non-skipped grade failed
CaseStatus.NOT_EVALUATED"not_evaluated"Every grader skipped (no expected fields were present)

grader_plan(name)

Returns a pre-built list of graders by plan name. The default judge model is openrouter/deepseek/deepseek-v4-flash.
from northstar.evals.graders import grader_plan

graders = grader_plan("quality", judge_model="openai/gpt-4o")
suite = EvalSuite(graders=graders)
name
str
required
One of the four plan names below. Raises ValueError for any other value.
judge_model
str
Override the LLM judge model for plans that include judge graders. Defaults to "openrouter/deepseek/deepseek-v4-flash".
completion_fn
Callable | None
Optional custom completion function passed to all judge graders in the plan. When provided, the judges call completion_fn(**kwargs) instead of litellm.completion. Useful for testing and custom providers.

Plans

PlanIncludesBest for
"deterministic"All graders from default_graders() — tool checks, contains, ground truth, latency, costFast, no LLM cost, CI pipelines
"quality"All deterministic graders + RubricJudgeResponse quality evaluation with rubric scoring
"agentic"All deterministic graders + FaithfulnessJudgeRAG and tool-heavy agents where factual grounding matters
"trace"All graders from trace_graders() — loop detection, cost attribution, hallucination, planningDeep trace-level analysis; requires a trace payload in each case

Built-in graders

Deterministic graders

These graders never call an LLM. They are always included in default_graders().
ClassnameRequired expected fieldDescription
MaxToolCallsmax_tool_callsmax_tool_callsPasses if the total tool call count ≤ limit
RequiredToolsrequired_toolsrequired_toolsPasses if all named tools appear in run.tool_calls
ForbiddenToolsforbidden_toolsforbidden_toolsPasses if no forbidden tool appears in run.tool_calls
ToolArgumentsMatchtool_arguments_matchtool_argumentsPasses if each named tool was called with arguments that are a superset of the expected dict
ToolSequencetool_sequencetool_sequencePasses if run.tool_calls names match the expected ordered list exactly
ToolOutputReferencedtool_output_referencedrequire_tool_output_referencePasses if the final response overlaps meaningfully with a tool output (threshold: 35%)
ContainscontainscontainsPasses if every phrase appears (case-insensitive) in the final response
NotContainsnot_containsnot_containsPasses if no forbidden phrase appears in the final response
GroundTruthMatchground_truth_matchground_truthPasses if normalized ground_truth is a substring of normalized final response
LatencyUnderlatency_undermax_latency_ms + case.metrics.latency_msPasses if metrics.latency_ms ≤ limit
CostUndercost_undermax_cost_usd + case.metrics.cost_usdPasses if metrics.cost_usd ≤ limit

Configurable deterministic graders

These require a constructor call.

LLM judge graders

These graders call an LLM and require an API key for the configured provider.

Trace graders

These graders require a trace DAG to be present in the case. They are included in trace_graders() and the "trace" plan.
ClassnameRequired inputDescription
BadToolFailureRecoverybad_tool_failure_recoveryTrace with failed tool spansPasses if every failed tool span is followed by a recovery event (reasoning, assistant message, or final response)
UnnecessaryToolLoopunnecessary_tool_loopTrace with repeated tool callsPasses if no tool signature repeats more than max_repeated_tool_calls times
StaleContextUsagestale_context_usageTrace with stale-marked eventsPasses if no events have stale context markers in their attributes
InvalidStateTransitioninvalid_state_transitionTrace + expected.trace.allowed_state_transitionsPasses if all observed state transitions are in the allowed set
RetrievalPrecisionRecallretrieval_precision_recallTrace + expected.trace.relevant_retrieval_idsComputes precision and recall against the expected relevant document IDs
StepCostAttributionstep_cost_attributionTrace with cost_usd attributes on spansReports per-step cost; fails if any step exceeds expected.trace.max_step_cost_usd
FailureOriginfailure_originTrace with errored spans or run errorIdentifies the earliest failure-origin span or event in the trace
HallucinatedToolResultJudgehallucinated_tool_result_judgeTrace + LLMLLM judge: passes only if final response claims are supported by observed tool_result events
PlanningActionMismatchJudgeplanning_action_mismatch_judgeTrace + LLMLLM judge: passes only if later tool calls are consistent with stated reasoning/planning events

Helper functions

default_graders() -> list[Grader]

Returns a fresh instance list of all 11 deterministic graders: MaxToolCalls, RequiredTools, ForbiddenTools, ToolArgumentsMatch, ToolSequence, ToolOutputReferenced, Contains, NotContains, GroundTruthMatch, LatencyUnder, CostUnder.

trace_graders(*, completion_fn=None, judge_model=DEFAULT_RUBRIC_JUDGE_MODEL) -> list[Grader]

Returns a fresh instance list of all 9 trace-aware graders: BadToolFailureRecovery, UnnecessaryToolLoop, StaleContextUsage, InvalidStateTransition, RetrievalPrecisionRecall, StepCostAttribution, FailureOrigin, HallucinatedToolResultJudge, PlanningActionMismatchJudge.

Custom grader protocol

Any Python class that implements the Grader protocol can be passed to EvalSuite. The protocol requires:
class MyGrader:
    name = "my_grader"          # str class attribute
    requires_feedback = False   # bool class attribute; True for LLM judges

    def grade(self, case: EvalCase, run: EvalRun) -> GradeResult:
        # Use case.expected, run.final_response, run.tool_calls, etc.
        if run.final_response is None:
            return GradeResult(
                name=self.name,
                status=GradeStatus.SKIPPED,
                reason="Final response was not found.",
            )
        passed = "magic word" in run.final_response.lower()
        return GradeResult(
            name=self.name,
            status=GradeStatus.PASSED if passed else GradeStatus.FAILED,
            reason="Found magic word." if passed else "Magic word was missing.",
            score=1.0 if passed else 0.0,
        )
Pass your custom grader alongside built-ins:
suite = EvalSuite(graders=[
    *default_graders(),
    MyGrader(),
    RubricJudge("response_quality"),
])
result = suite.run(dataset)
Return GradeStatus.SKIPPED when the required expected field is absent. This keeps the case status as NOT_EVALUATED rather than forcing a failure, and avoids inflating the pass or fail counts with incomplete data.

Complete example

from northstar.evals import EvalSuite, Dataset
from northstar.evals.graders import RubricJudge, RequiredTools, Contains, grader_plan

# Load dataset
dataset = Dataset.from_path("evals/weather_agent.jsonl")

# Custom grader mix
suite = EvalSuite(
    graders=[
        RequiredTools(),
        Contains(),
        RubricJudge(
            name="response_quality",
            model="openai/gpt-4o-mini",
            threshold=0.7,
        ),
    ],
    metadata={"experiment": "v2-weather-agent", "model": "gpt-4o-mini"},
)

result = suite.run(dataset)

print(f"Total cases:     {result.total_cases}")
print(f"Evaluated:       {result.evaluated_cases}")
print(f"Passed:          {result.passed_cases}")
print(f"Pass rate:       {result.pass_rate:.1%}")
print(f"Skipped grades:  {result.skipped_grades}")

for case_result in result.case_results:
    print(f"\nCase {case_result.case_id}: {case_result.status}")
    for grade in case_result.grades:
        print(f"  {grade.name}: {grade.status}{grade.reason}")
        if grade.feedback:
            print(f"    Feedback: {grade.feedback}")

Build docs developers (and LLMs) love