A dataset is the starting point for every eval run. It holds the collection of agent conversations you want to evaluate — each one paired with the expected outcomes NorthStar will grade against. Without a dataset there is nothing to evaluate; with a well-structured one, NorthStar can tell you exactly which cases pass, which fail, and why. Datasets can be loaded from JSON or JSONL files on disk, or constructed in memory from Python dicts. TheDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/sidmanale643/northstar/llms.txt
Use this file to discover all available pages before exploring further.
Dataset class is iterable and supports len(), so it integrates directly with EvalSuite.run().
EvalCase structure
Each entry in a dataset is anEvalCase. Only id and messages are required — every other field is optional and defaults gracefully.
Unique identifier for the case. Appears in all result objects and log output.
The original user input that triggered the agent run. Not graded directly — use
messages for the conversation transcript.The full conversation transcript in OpenAI chat format. NorthStar normalizes this into tool calls, tool outputs, and the final assistant response automatically.
The expected outcomes. All fields are optional; graders skip automatically when their field is absent. See the full field reference below.
Run-time measurements. Accepted fields are
latency_ms (float) and cost_usd (float). Required if you use LatencyUnder or CostUnder graders.Arbitrary key/value pairs. Passed through to results and available in custom graders.
Raw NorthStar trace payload. Required if you use trace graders (
BadToolFailureRecovery, UnnecessaryToolLoop, etc.).EvalExpected fields
View all EvalExpected fields
View all EvalExpected fields
A plain-language description of what a correct response should achieve. Used by
RubricJudge as the primary grading criterion when no rubric is provided.A detailed grading rubric passed verbatim to
RubricJudge. Takes precedence over the judge-level rubric when both are set.The canonical correct answer.
GroundTruthMatch checks whether the final response contains this string (case-insensitive, normalized whitespace).Reference passages or retrieved documents. Required by
FaithfulnessJudge when no tool outputs are present.The exact ordered list of tool names the agent should call.
ToolSequence checks for an exact match against the actual call order.Expected tool arguments, as a list of
{name, arguments} objects. ToolArgumentsMatch checks that every listed tool was called with at least the expected argument subset.Tool names that must appear in the run’s tool calls.
RequiredTools fails if any are missing.Tool names that must not appear.
ForbiddenTools fails if any are called.Maximum total number of tool calls allowed.
MaxToolCalls fails if the actual count exceeds this.Phrases that must appear in the final response (case-insensitive).
Contains fails if any are missing.Phrases that must not appear in the final response.
NotContains fails if any are found.When
true, ToolOutputReferenced checks that the final response is sufficiently grounded in tool output text (overlap threshold: 0.35).Maximum allowed run latency in milliseconds. Requires
metrics.latency_ms to be set.Maximum allowed total cost in USD. Requires
metrics.cost_usd to be set.Trace-level constraints used by trace graders. Fields include
max_repeated_tool_calls, allowed_state_transitions, relevant_retrieval_ids, min_retrieval_precision, min_retrieval_recall, and max_step_cost_usd.JSON dataset format
The simplest dataset is a JSON array of case objects. Each case must haveid and messages; all other fields are optional.
dataset.json
You can also wrap cases in an object with a
"cases" key — Dataset.from_json() accepts both formats. A file with a single case object (not wrapped in an array) is accepted too.Loading datasets
JSONL format
JSONL (newline-delimited JSON) works well for large datasets and streaming writes. Each non-empty line must be a valid JSON object representing oneEvalCase.
dataset.jsonl
ValueError with the line number included in the message.
Iterating and measuring a dataset
Dataset is iterable and supports len(). You can inspect cases before passing the dataset to an eval suite.
Passing a dataset to EvalSuite
EvalSuite.run() iterates the dataset once, evaluating each case against every configured grader. See the Graders and LLM Judges pages for details on what each grader checks.