Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/sidmanale643/northstar/llms.txt

Use this file to discover all available pages before exploring further.

The NorthStar eval framework gives you a structured way to measure whether your AI agent is behaving correctly — before it reaches production and after every change you make to it. You define a dataset of representative inputs, attach expected outcomes to each case, choose a grader plan, and run EvalSuite. NorthStar evaluates every case against every grader, skips checks that are not relevant (instead of failing them), and returns a summary of pass rates broken down by case and grader. Use evals to catch regressions in tool usage, output quality, latency budgets, and cost — all from a single suite.run(dataset) call.

End-to-end example

The example below loads a JSON dataset, runs the default "deterministic" plan, and prints a summary of the results.
1

Install the eval extras

LLM judges require LiteLLM. Install the optional evals extras if you plan to use RubricJudge or FaithfulnessJudge.
uv add 'northstar-ai[evals]'
The deterministic graders have no extra dependencies.
2

Create a dataset file

Save agent message transcripts as a JSON array. Each case must have an id and a messages array that matches the OpenAI chat format.
dataset.json
[
  {
    "id": "case-001",
    "input": "What is the refund policy?",
    "messages": [
      { "role": "user", "content": "What is the refund policy?" },
      {
        "role": "assistant",
        "content": null,
        "tool_calls": [
          {
            "id": "call-1",
            "type": "function",
            "function": {
              "name": "search_docs",
              "arguments": "{\"query\": \"refund policy\"}"
            }
          }
        ]
      },
      {
        "role": "tool",
        "tool_call_id": "call-1",
        "name": "search_docs",
        "content": "Refunds are available for 30 days after purchase."
      },
      {
        "role": "assistant",
        "content": "Refunds are available for 30 days after purchase."
      }
    ],
    "expected": {
      "goal": "Correctly explain the refund policy.",
      "ground_truth": "refunds are available for 30 days",
      "required_tools": ["search_docs"],
      "tool_sequence": ["search_docs"],
      "contains": ["30 days"],
      "require_tool_output_reference": true,
      "max_tool_calls": 2
    },
    "metrics": {
      "latency_ms": 820,
      "cost_usd": 0.0012
    }
  }
]
3

Run the eval suite

from northstar.evals import Dataset, EvalSuite

# Load the dataset — auto-detects .json vs .jsonl
dataset = Dataset.from_path("dataset.json")

# Run with the default deterministic plan
suite = EvalSuite()
result = suite.run(dataset)

print(f"Pass rate: {result.pass_rate:.0%}")
print(f"Passed:    {result.passed_cases} / {result.evaluated_cases}")
print(f"Skipped grades: {result.skipped_grades}")

for case_result in result.case_results:
    print(f"\n[{case_result.status}] {case_result.case_id}")
    for grade in case_result.grades:
        if grade.status != "skipped":
            print(f"  {grade.name}: {grade.status}{grade.reason}")

Grader plans

EvalSuite accepts a plan argument that selects a built-in bundle of graders. You can also pass an explicit graders list to override the plan entirely.
from northstar.evals import EvalSuite

# 11 deterministic graders covering tools, output, latency, and cost.
# No LLM calls — fast and free to run on CI.
suite = EvalSuite(plan="deterministic")
PlanGraders includedRequires LLM?
"deterministic"MaxToolCalls, RequiredTools, ForbiddenTools, ToolArgumentsMatch, ToolSequence, ToolOutputReferenced, Contains, NotContains, GroundTruthMatch, LatencyUnder, CostUnderNo
"quality"All deterministic + RubricJudgeYes
"agentic"All deterministic + FaithfulnessJudgeYes
"trace"BadToolFailureRecovery, UnnecessaryToolLoop, StaleContextUsage, InvalidStateTransition, RetrievalPrecisionRecall, StepCostAttribution, FailureOrigin, HallucinatedToolResultJudge, PlanningActionMismatchJudgeYes (last two)
Every grader automatically skips when its corresponding expected field is absent from the case. A skipped grade does not affect the case status or pass rate — only evaluated grades count.

EvalResult fields

EvalSuite.run() returns an EvalResult object with the following fields.
total_cases
int
Total number of cases in the dataset, regardless of whether any grader was active.
evaluated_cases
int
Cases where at least one grader produced a PASSED or FAILED result (i.e., was not entirely skipped).
not_evaluated_cases
int
Cases where every grader was skipped — no expected fields were set that triggered a grade.
passed_cases
int
Cases where every non-skipped grade passed.
failed_cases
int
Cases where at least one non-skipped grade failed.
pass_rate
float
passed_cases / evaluated_cases. Returns 0.0 when no cases were evaluated.
skipped_grades
int
Total number of individual SKIPPED grade results across all cases and all graders.
case_results
list[CaseResult]
Per-case breakdown. Each CaseResult has case_id, status ("passed", "failed", or "not_evaluated"), and grades — a list of GradeResult objects, one per grader.
metadata
dict
Suite-level metadata including plan, grader_names, and created_at timestamp.

Explore further

Datasets

How to structure JSON and JSONL dataset files and load them with Dataset.from_path().

Graders

All 11+ built-in deterministic graders, custom RegexGrader, and code graders.

LLM Judges

RubricJudge, FaithfulnessJudge, and trace-level LLM judges for qualitative evaluation.

Build docs developers (and LLMs) love