Scores are quality signals you attach to a run after it completes. They let you correlate human feedback, automated eval results, and computed metrics directly with the trace data that produced them. Each score is stored in the trace database alongside the run’s spans and events, and is immediately visible on the NorthStar dashboard. You can attach multiple scores per run — for example, one for relevance, one for faithfulness, and one for human thumbs-up feedback.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/sidmanale643/northstar/llms.txt
Use this file to discover all available pages before exploring further.
Score model
Every score submitted throughclient.score() is stored as a Score record with the following fields:
The
id of the Run this score is attached to. Passed as the first positional argument to client.score().Optionally scope the score to a specific child
Span within the run. When None, the score applies to the run as a whole. Pass via the span_id= keyword argument.A non-blank label identifying what this score measures. Examples:
"relevance", "faithfulness", "thumbs_up", "cost_ok". Used as the display name in the dashboard.The numeric score value. For
boolean scores, this is 1.0 (true) or 0.0 (false). For categorical scores, this is always 0.0 and the human-readable label is carried in string_value.Inferred automatically from the Python type of the value you pass. You can also pass
data_type= explicitly if needed. The three valid types are:"numeric"— afloatorinton a continuous scale (e.g.0.92)"boolean"— a Pythonbool:Truemaps to1.0,Falseto0.0"categorical"— a Pythonstrlabel (e.g."thumbs_up")
Always
"api" for scores submitted through the Python SDK. This field is set automatically and cannot be overridden.Optional free-text note, e.g. an annotator’s rationale or a link to a review. Passed via the
comment= keyword argument.client.score()
Enqueues a score to be flushed with the next batch. The score is sent to the backend on the next client.flush() call (or automatically by the background worker).
The
id of the Run to attach this score to. Accepts either a UUID object or a UUID string.A non-blank display label for this score. Raises
ValueError if blank.The score value. The Python type determines
data_type automatically:floatorint→"numeric"bool→"boolean"(True=1.0,False=0.0)str→"categorical"(stored as thestring_value)
Optional. Scope the score to a specific child span. When omitted, the score applies to the whole run.
Optional. If provided, must match the type inferred from
value. Raises ValueError on mismatch. Omit this parameter in most cases — the SDK infers it correctly.Optional free-text annotation. Displayed alongside the score in the dashboard.
Examples
Numeric score from a human reviewer
Boolean thumbs-up feedback
Categorical label from a human annotator
Scoping a score to a specific span
data_type rules
| Python value | Inferred data_type | Notes |
|---|---|---|
0.92 (float) | "numeric" | Any finite float or int |
True / False (bool) | "boolean" | Stored as 1.0 / 0.0 |
"thumbs_up" (str) | "categorical" | Label stored in string_value |
Scores from EvalSuite
When you run anEvalSuite, each grader’s GradeResult is also surfaced as a score on the corresponding run. This happens automatically — you don’t need to call client.score() yourself for eval-generated scores.
Scores are stored with the trace and are visible in the NorthStar dashboard under the Scores tab of any run. They can be filtered, sorted, and used to build dashboards tracking quality over time.