Documentation Index
Fetch the complete documentation index at: https://mintlify.com/sidmanale643/northstar/llms.txt
Use this file to discover all available pages before exploring further.
Dataset is the entry point for loading evaluation cases into NorthStar’s eval pipeline. It reads structured EvalCase records from JSON or JSONL files on disk, or directly from a list of Python dicts, and exposes them as an iterable collection. Each EvalCase describes one test scenario: the agent’s input and conversation messages, what the expected outcome looks like, and any pre-measured performance metrics that deterministic graders can use.
Dataset class
Dataset.from_path(path)
Auto-detects the file format from the extension and delegates to from_json or from_jsonl. Raises ValueError for unsupported extensions.
Path to a
.json or .jsonl file. The extension (case-insensitive) determines the parser: .json → from_json, .jsonl → from_jsonl. Any other extension raises ValueError.Dataset
Dataset.from_json(path)
Loads eval cases from a JSON file. The file may contain:
- A list of case objects (
[{...}, {...}]) - An object with a
"cases"key whose value is a list ({"cases": [{...}]}) - A single case object (
{...}) — wrapped in a one-element list
ValueError if the file is not valid JSON or if the structure does not match one of these shapes.
Path to the
.json dataset file. Read as UTF-8.Dataset
Dataset.from_jsonl(path)
Loads eval cases from a JSONL file (one JSON object per line). Blank lines are skipped. Raises ValueError with a line number if any line contains invalid JSON or a malformed EvalCase.
Path to the
.jsonl dataset file. Read as UTF-8, line-by-line.Dataset
Dataset.from_records(records)
Constructs a Dataset directly from a Python iterable of dicts. Each dict is validated against the EvalCase schema using Pydantic’s model_validate. Useful for programmatically generated test cases.
An iterable of raw dictionaries. Each must be a valid
EvalCase payload.Dataset
__iter__() and __len__()
Dataset implements both __iter__ and __len__, making it compatible with for loops and len().
EvalCase
Each record in a dataset is validated and stored as anEvalCase. All fields use extra="forbid" — unknown keys in the JSON raise a validation error.
A unique identifier for this test case. Used to correlate
CaseResult objects in the EvalResult back to specific dataset rows. Example: "case-001" or "weather-query-paris".The raw agent input for this case. May be a string, dict, or any JSON-serializable value. Also used to carry a
trace payload when trace is not explicitly set: if input is a dict containing a "trace" key, EvalSuite will extract it automatically.The full conversation history for this case in OpenAI message format. The
EvalSuite parses these to reconstruct system_prompts, user_messages, assistant_messages, tool_calls, and tool_outputs for grading.Describes the expected outcome. Each grader reads specific fields from
expected to decide whether to run and how to score. Defaults to an empty EvalExpected (all fields None), which causes every grader to skip gracefully.Pre-measured performance metrics for this case. Used by
LatencyUnder and CostUnder graders. Defaults to an empty EvalMetrics.Arbitrary key-value pairs attached to this case. Available to custom graders through
case.metadata. Defaults to an empty dict.An optional raw trace payload (the JSON structure produced by
client.flush()). When provided, EvalSuite reconstructs an EvalTraceDag for trace-aware graders like BadToolFailureRecovery, UnnecessaryToolLoop, and HallucinatedToolResultJudge. Defaults to None.EvalExpected
EvalExpected describes what a correct agent run looks like. Every field is optional — graders skip gracefully when their required field is absent.
A natural-language description of what a correct response should achieve. Read by
RubricJudge and FaithfulnessJudge as the primary evaluation criterion.A detailed rubric string that overrides
goal inside RubricJudge. Use when you need more granular pass/fail criteria than a simple goal statement.The canonical correct answer. Read by
GroundTruthMatch (substring match) and surfaced to RubricJudge as additional context.Supporting documents or reference passages the agent was given. Read by
FaithfulnessJudge to check whether claims in the response are grounded in evidence. A single string is also accepted and converted to a one-element list.Tool names that must appear in
run.tool_calls. Read by RequiredTools. A single string is also accepted.Tool names that must not appear in
run.tool_calls. Read by ForbiddenTools. A single string is also accepted.The exact ordered list of tool names the agent must call. Read by
ToolSequence. A single string is also accepted.Expected tool names with their expected argument subsets. Read by
ToolArgumentsMatch. Each entry has a name (tool name) and arguments (dict — actual arguments must be a superset).When
True, ToolOutputReferenced checks that the final response meaningfully references content from at least one tool output. When False or None, the grader skips.Maximum allowed number of tool calls. Read by
MaxToolCalls. Must be non-negative.Phrases that must appear (case-insensitive) in the final response. Read by
Contains. A single string is also accepted.Phrases that must not appear (case-insensitive) in the final response. Read by
NotContains. A single string is also accepted.Maximum allowed latency in milliseconds. Read by
LatencyUnder against case.metrics.latency_ms. Must be non-negative.Maximum allowed cost in USD. Read by
CostUnder against case.metrics.cost_usd. Must be non-negative.Trace-level constraints for trace-aware graders. See
TraceExpected below.TraceExpected
TraceExpected is nested inside EvalExpected.trace and provides constraints for trace-aware graders.
Maximum number of times the same tool signature may repeat before
UnnecessaryToolLoop fires. Must be positive. Defaults to 3 inside the grader when None.List of
{from_state, to_state} pairs that are permitted. Read by InvalidStateTransition. Any state transition not in this list causes a failure.The set of document IDs that should have been retrieved. Used by
RetrievalPrecisionRecall as the ground-truth relevance set.Minimum acceptable retrieval precision (0–1). Used by
RetrievalPrecisionRecall.Minimum acceptable retrieval recall (0–1). Used by
RetrievalPrecisionRecall.Maximum allowed cost per individual span. Read by
StepCostAttribution. Must be non-negative.EvalMetrics
EvalMetrics carries pre-measured performance values for a case.
The measured end-to-end latency for this case in milliseconds. Used by
LatencyUnder.The measured total cost in USD for this case. Used by
CostUnder.Dataset file formats
JSON array format
JSON object with cases key format
JSONL format
JSONL files must have one complete JSON object per line. Blank lines are skipped automatically. Each line is validated independently, so a single invalid line raises a
ValueError with its line number.