NorthStar Evals: Evaluate AI Agent Runs and Tool Use

The NorthStar eval framework gives you a structured way to measure whether your AI agent is behaving correctly — before it reaches production and after every change you make to it. You define a dataset of representative inputs, attach expected outcomes to each case, choose a grader plan, and run EvalSuite. NorthStar evaluates every case against every grader, skips checks that are not relevant (instead of failing them), and returns a summary of pass rates broken down by case and grader. Use evals to catch regressions in tool usage, output quality, latency budgets, and cost — all from a single suite.run(dataset) call.

End-to-end example

The example below loads a JSON dataset, runs the default "deterministic" plan, and prints a summary of the results.

Install the eval extras

LLM judges require LiteLLM. Install the optional evals extras if you plan to use RubricJudge or FaithfulnessJudge.

uv add 'northstar-ai[evals]'

The deterministic graders have no extra dependencies.

Create a dataset file

Save agent message transcripts as a JSON array. Each case must have an id and a messages array that matches the OpenAI chat format.

dataset.json

[
  {
    "id": "case-001",
    "input": "What is the refund policy?",
    "messages": [
      { "role": "user", "content": "What is the refund policy?" },
      {
        "role": "assistant",
        "content": null,
        "tool_calls": [
          {
            "id": "call-1",
            "type": "function",
            "function": {
              "name": "search_docs",
              "arguments": "{\"query\": \"refund policy\"}"
            }
          }
        ]
      },
      {
        "role": "tool",
        "tool_call_id": "call-1",
        "name": "search_docs",
        "content": "Refunds are available for 30 days after purchase."
      },
      {
        "role": "assistant",
        "content": "Refunds are available for 30 days after purchase."
      }
    ],
    "expected": {
      "goal": "Correctly explain the refund policy.",
      "ground_truth": "refunds are available for 30 days",
      "required_tools": ["search_docs"],
      "tool_sequence": ["search_docs"],
      "contains": ["30 days"],
      "require_tool_output_reference": true,
      "max_tool_calls": 2
    },
    "metrics": {
      "latency_ms": 820,
      "cost_usd": 0.0012
    }
  }
]

Run the eval suite

from northstar.evals import Dataset, EvalSuite

# Load the dataset — auto-detects .json vs .jsonl
dataset = Dataset.from_path("dataset.json")

# Run with the default deterministic plan
suite = EvalSuite()
result = suite.run(dataset)

print(f"Pass rate: {result.pass_rate:.0%}")
print(f"Passed:    {result.passed_cases} / {result.evaluated_cases}")
print(f"Skipped grades: {result.skipped_grades}")

for case_result in result.case_results:
    print(f"\n[{case_result.status}] {case_result.case_id}")
    for grade in case_result.grades:
        if grade.status != "skipped":
            print(f"  {grade.name}: {grade.status} — {grade.reason}")

Grader plans

EvalSuite accepts a plan argument that selects a built-in bundle of graders. You can also pass an explicit graders list to override the plan entirely.

from northstar.evals import EvalSuite

# 11 deterministic graders covering tools, output, latency, and cost.
# No LLM calls — fast and free to run on CI.
suite = EvalSuite(plan="deterministic")

Plan	Graders included	Requires LLM?
`"deterministic"`	`MaxToolCalls`, `RequiredTools`, `ForbiddenTools`, `ToolArgumentsMatch`, `ToolSequence`, `ToolOutputReferenced`, `Contains`, `NotContains`, `GroundTruthMatch`, `LatencyUnder`, `CostUnder`	No
`"quality"`	All deterministic + `RubricJudge`	Yes
`"agentic"`	All deterministic + `FaithfulnessJudge`	Yes
`"trace"`	`BadToolFailureRecovery`, `UnnecessaryToolLoop`, `StaleContextUsage`, `InvalidStateTransition`, `RetrievalPrecisionRecall`, `StepCostAttribution`, `FailureOrigin`, `HallucinatedToolResultJudge`, `PlanningActionMismatchJudge`	Yes (last two)

Every grader automatically skips when its corresponding expected field is absent from the case. A skipped grade does not affect the case status or pass rate — only evaluated grades count.

EvalResult fields

EvalSuite.run() returns an EvalResult object with the following fields.

total_cases

int

Total number of cases in the dataset, regardless of whether any grader was active.

evaluated_cases

int

Cases where at least one grader produced a PASSED or FAILED result (i.e., was not entirely skipped).

not_evaluated_cases

int

Cases where every grader was skipped — no expected fields were set that triggered a grade.

passed_cases

int

Cases where every non-skipped grade passed.

failed_cases

int

Cases where at least one non-skipped grade failed.

pass_rate

float

passed_cases / evaluated_cases. Returns 0.0 when no cases were evaluated.

skipped_grades

int

Total number of individual SKIPPED grade results across all cases and all graders.

case_results

list[CaseResult]

Per-case breakdown. Each CaseResult has case_id, status ("passed", "failed", or "not_evaluated"), and grades — a list of GradeResult objects, one per grader.

metadata

dict

Suite-level metadata including plan, grader_names, and created_at timestamp.

Explore further

Datasets

How to structure JSON and JSONL dataset files and load them with Dataset.from_path().

Graders

All 11+ built-in deterministic graders, custom RegexGrader, and code graders.

LLM Judges

RubricJudge, FaithfulnessJudge, and trace-level LLM judges for qualitative evaluation.

Get Started

Tracing

Prompts

Evaluations

Configuration & Deployment

NorthStar Evals: Evaluate AI Agent Runs and Tool Use

End-to-end example

Grader plans

EvalResult fields

Explore further

Datasets

Graders

LLM Judges

Build docs developers (and LLMs) love

Get Started

Tracing

Prompts

Evaluations

Configuration & Deployment

Documentation Index

​End-to-end example

​Grader plans

​EvalResult fields

​Explore further

Datasets

Graders

LLM Judges

Build docs developers (and LLMs) love

End-to-end example

Grader plans

EvalResult fields

Explore further