The NorthStar eval framework gives you a structured way to measure whether your AI agent is behaving correctly — before it reaches production and after every change you make to it. You define a dataset of representative inputs, attach expected outcomes to each case, choose a grader plan, and runDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/sidmanale643/northstar/llms.txt
Use this file to discover all available pages before exploring further.
EvalSuite. NorthStar evaluates every case against every grader, skips checks that are not relevant (instead of failing them), and returns a summary of pass rates broken down by case and grader. Use evals to catch regressions in tool usage, output quality, latency budgets, and cost — all from a single suite.run(dataset) call.
End-to-end example
The example below loads a JSON dataset, runs the default"deterministic" plan, and prints a summary of the results.
Install the eval extras
LLM judges require LiteLLM. Install the optional evals extras if you plan to use The deterministic graders have no extra dependencies.
RubricJudge or FaithfulnessJudge.Create a dataset file
Save agent message transcripts as a JSON array. Each case must have an
id and a messages array that matches the OpenAI chat format.dataset.json
Grader plans
EvalSuite accepts a plan argument that selects a built-in bundle of graders. You can also pass an explicit graders list to override the plan entirely.
| Plan | Graders included | Requires LLM? |
|---|---|---|
"deterministic" | MaxToolCalls, RequiredTools, ForbiddenTools, ToolArgumentsMatch, ToolSequence, ToolOutputReferenced, Contains, NotContains, GroundTruthMatch, LatencyUnder, CostUnder | No |
"quality" | All deterministic + RubricJudge | Yes |
"agentic" | All deterministic + FaithfulnessJudge | Yes |
"trace" | BadToolFailureRecovery, UnnecessaryToolLoop, StaleContextUsage, InvalidStateTransition, RetrievalPrecisionRecall, StepCostAttribution, FailureOrigin, HallucinatedToolResultJudge, PlanningActionMismatchJudge | Yes (last two) |
Every grader automatically skips when its corresponding
expected field is absent from the case. A skipped grade does not affect the case status or pass rate — only evaluated grades count.EvalResult fields
EvalSuite.run() returns an EvalResult object with the following fields.
Total number of cases in the dataset, regardless of whether any grader was active.
Cases where at least one grader produced a PASSED or FAILED result (i.e., was not entirely skipped).
Cases where every grader was skipped — no expected fields were set that triggered a grade.
Cases where every non-skipped grade passed.
Cases where at least one non-skipped grade failed.
passed_cases / evaluated_cases. Returns 0.0 when no cases were evaluated.Total number of individual SKIPPED grade results across all cases and all graders.
Per-case breakdown. Each
CaseResult has case_id, status ("passed", "failed", or "not_evaluated"), and grades — a list of GradeResult objects, one per grader.Suite-level metadata including
plan, grader_names, and created_at timestamp.Explore further
Datasets
How to structure JSON and JSONL dataset files and load them with
Dataset.from_path().Graders
All 11+ built-in deterministic graders, custom
RegexGrader, and code graders.LLM Judges
RubricJudge, FaithfulnessJudge, and trace-level LLM judges for qualitative evaluation.