Structuring and Loading Eval Datasets in NorthStar

A dataset is the starting point for every eval run. It holds the collection of agent conversations you want to evaluate — each one paired with the expected outcomes NorthStar will grade against. Without a dataset there is nothing to evaluate; with a well-structured one, NorthStar can tell you exactly which cases pass, which fail, and why. Datasets can be loaded from JSON or JSONL files on disk, or constructed in memory from Python dicts. The Dataset class is iterable and supports len(), so it integrates directly with EvalSuite.run().

EvalCase structure

Each entry in a dataset is an EvalCase. Only id and messages are required — every other field is optional and defaults gracefully.

str

required

Unique identifier for the case. Appears in all result objects and log output.

input

Any

The original user input that triggered the agent run. Not graded directly — use messages for the conversation transcript.

messages

list[dict]

required

The full conversation transcript in OpenAI chat format. NorthStar normalizes this into tool calls, tool outputs, and the final assistant response automatically.

expected

EvalExpected

The expected outcomes. All fields are optional; graders skip automatically when their field is absent. See the full field reference below.

metrics

EvalMetrics

Run-time measurements. Accepted fields are latency_ms (float) and cost_usd (float). Required if you use LatencyUnder or CostUnder graders.

metadata

dict

Arbitrary key/value pairs. Passed through to results and available in custom graders.

trace

dict

Raw NorthStar trace payload. Required if you use trace graders (BadToolFailureRecovery, UnnecessaryToolLoop, etc.).

EvalExpected fields

View all EvalExpected fields

goal

str

A plain-language description of what a correct response should achieve. Used by RubricJudge as the primary grading criterion when no rubric is provided.

rubric

str

A detailed grading rubric passed verbatim to RubricJudge. Takes precedence over the judge-level rubric when both are set.

ground_truth

str

The canonical correct answer. GroundTruthMatch checks whether the final response contains this string (case-insensitive, normalized whitespace).

context

list[str]

Reference passages or retrieved documents. Required by FaithfulnessJudge when no tool outputs are present.

tool_sequence

list[str]

The exact ordered list of tool names the agent should call. ToolSequence checks for an exact match against the actual call order.

tool_arguments

list[ExpectedToolArguments]

Expected tool arguments, as a list of {name, arguments} objects. ToolArgumentsMatch checks that every listed tool was called with at least the expected argument subset.

required_tools

list[str]

Tool names that must appear in the run’s tool calls. RequiredTools fails if any are missing.

forbidden_tools

list[str]

Tool names that must not appear. ForbiddenTools fails if any are called.

max_tool_calls

int

Maximum total number of tool calls allowed. MaxToolCalls fails if the actual count exceeds this.

contains

list[str]

Phrases that must appear in the final response (case-insensitive). Contains fails if any are missing.

not_contains

list[str]

Phrases that must not appear in the final response. NotContains fails if any are found.

require_tool_output_reference

bool

When true, ToolOutputReferenced checks that the final response is sufficiently grounded in tool output text (overlap threshold: 0.35).

max_latency_ms

float

Maximum allowed run latency in milliseconds. Requires metrics.latency_ms to be set.

max_cost_usd

float

Maximum allowed total cost in USD. Requires metrics.cost_usd to be set.

trace

TraceExpected

Trace-level constraints used by trace graders. Fields include max_repeated_tool_calls, allowed_state_transitions, relevant_retrieval_ids, min_retrieval_precision, min_retrieval_recall, and max_step_cost_usd.

JSON dataset format

The simplest dataset is a JSON array of case objects. Each case must have id and messages; all other fields are optional.

dataset.json

[
  {
    "id": "case-001",
    "input": "What is the capital of France?",
    "messages": [
      { "role": "user", "content": "What is the capital of France?" },
      { "role": "assistant", "content": "The capital of France is Paris." }
    ],
    "expected": {
      "goal": "Correctly identify Paris as the capital of France.",
      "ground_truth": "Paris",
      "contains": ["Paris"],
      "not_contains": ["London", "Berlin"]
    }
  },
  {
    "id": "case-002",
    "input": "Find the refund policy.",
    "messages": [
      { "role": "user", "content": "Find the refund policy." },
      {
        "role": "assistant",
        "content": null,
        "tool_calls": [
          {
            "id": "call-1",
            "type": "function",
            "function": {
              "name": "search_docs",
              "arguments": "{\"query\": \"refund policy\"}"
            }
          }
        ]
      },
      {
        "role": "tool",
        "tool_call_id": "call-1",
        "name": "search_docs",
        "content": "Refunds are available for 30 days."
      },
      {
        "role": "assistant",
        "content": "Refunds are available for 30 days after purchase."
      }
    ],
    "expected": {
      "required_tools": ["search_docs"],
      "tool_sequence": ["search_docs"],
      "tool_arguments": [
        { "name": "search_docs", "arguments": { "query": "refund policy" } }
      ],
      "require_tool_output_reference": true,
      "max_tool_calls": 2,
      "ground_truth": "refunds are available for 30 days"
    },
    "metrics": {
      "latency_ms": 740,
      "cost_usd": 0.0009
    }
  }
]

You can also wrap cases in an object with a "cases" key — Dataset.from_json() accepts both formats. A file with a single case object (not wrapped in an array) is accepted too.

Loading datasets

from northstar.evals import Dataset

# Automatically picks JSON or JSONL based on file extension
dataset = Dataset.from_path("dataset.json")
dataset = Dataset.from_path("dataset.jsonl")

Dataset.from_path() raises ValueError for unsupported file extensions. Only .json and .jsonl are accepted.

JSONL format

JSONL (newline-delimited JSON) works well for large datasets and streaming writes. Each non-empty line must be a valid JSON object representing one EvalCase.

dataset.jsonl

{"id": "case-001", "messages": [{"role": "assistant", "content": "Paris"}], "expected": {"ground_truth": "Paris"}}
{"id": "case-002", "messages": [{"role": "assistant", "content": "Refunds are 30 days."}], "expected": {"contains": ["30 days"]}}
{"id": "case-003", "messages": [{"role": "assistant", "content": "Contact support."}]}

Blank lines are skipped. A parse error on any line raises ValueError with the line number included in the message.

Iterating and measuring a dataset

Dataset is iterable and supports len(). You can inspect cases before passing the dataset to an eval suite.

from northstar.evals import Dataset

dataset = Dataset.from_path("dataset.json")

print(f"Dataset has {len(dataset)} cases")

for case in dataset:
    print(case.id, "→", case.expected.goal)

Passing a dataset to EvalSuite

from northstar.evals import Dataset, EvalSuite

dataset = Dataset.from_path("dataset.json")
result = EvalSuite().run(dataset)

print(f"Pass rate: {result.pass_rate:.0%} ({result.passed_cases}/{result.evaluated_cases})")

EvalSuite.run() iterates the dataset once, evaluating each case against every configured grader. See the Graders and LLM Judges pages for details on what each grader checks.

Get Started

Tracing

Prompts

Evaluations

Configuration & Deployment

Structuring and Loading Eval Datasets in NorthStar

EvalCase structure

EvalExpected fields

JSON dataset format

Loading datasets

JSONL format

Iterating and measuring a dataset

Passing a dataset to EvalSuite

Build docs developers (and LLMs) love

Get Started

Tracing

Prompts

Evaluations

Configuration & Deployment

Documentation Index

​EvalCase structure

​EvalExpected fields

​JSON dataset format

​Loading datasets

​JSONL format

​Iterating and measuring a dataset

​Passing a dataset to EvalSuite

Build docs developers (and LLMs) love

EvalCase structure

EvalExpected fields

JSON dataset format

Loading datasets

JSONL format

Iterating and measuring a dataset

Passing a dataset to EvalSuite