Phase 4: Test — Deterministic tests and AI Evals suite

Testing is not optional in the SDD cycle — it is a phase gate. A task is not marked “Done” until it has passing tests. The /test phase extends this principle beyond classical software testing by unifying deterministic tests (for conventional code) with non-deterministic Evals (for AI components and generative pipelines). Both types of validation must be designed and executed before the work advances to /code-simplify.

Deterministic Tests vs. AI Evals

Dimension	Deterministic Tests	AI Evals
What is validated	Classical system behaviour	Intelligent / probabilistic behaviour
Output	Pass / Fail (binary)	Score against a rubric
Examples	Compilation, unit tests, integration tests	Output quality, hallucination detection
Tooling	Vitest, Pytest, Jest, etc.	Model judges, rubric scripts
Repeatability	Fully deterministic	Statistically reproducible
Trigger	Always	Only if the project includes AI/LLM components

Deterministic Tests

Every task that touches production code must produce at minimum:

Compilation check — the project builds without errors or type violations.
Unit tests — individual functions and modules are tested in isolation.
Critical path coverage — the happy path and at least one failure path for every user-facing feature are covered.

“Tests are proof.” If a feature exists in code but has no test verifying its behaviour, it is not complete from the SDD perspective.

AI Evals (non-deterministic components)

If the project includes AI components, complex prompts, or generative pipelines, a dedicated Evals suite must be designed and executed. Evals cover four dimensions:

1. Output quality evaluation

Assess the response against a rubric: accuracy, tone, absence of invented facts, and JSON/XML format conformity. Rubrics can be enforced by automated scripts or “Model Judge” patterns (a second LLM evaluating the first LLM’s output).

2. Tool trajectory evaluation

Audit the sequence of steps the agent took — not just whether the final result is correct. An agent that produces the right answer after 100 redundant tool calls or excessive token loops is not operating correctly. Trajectory Evals verify that the agent’s path through its tools is efficient and logical.

3. Hallucination detection

Verify that the AI has not invented API endpoints, library names, data fields, or factual claims that do not exist in the codebase or specification. This is especially important when the AI generates code that references external services.

4. Format conformity

Confirm that structured outputs (JSON schemas, XML blocks, Markdown templates) match the exact format required by downstream consumers.

CHANGELOG and Memory Triggers

CHANGELOG: If tests uncover and fix a bug, the fix is recorded in the [Sin publicar] section of CHANGELOG.md as a Fixed entry. Bugs do not disappear silently — every fix is traceable. Memory Trigger: If a test reveals that an assumption documented in docs/SPECIFICATIONS.md was incorrect, the assistant must immediately log the finding in memory.md under ## ⚠️ Lecciones Aprendidas. Incorrect assumptions that are corrected in code but not reflected in the spec will cause the same mistake to recur in future sessions.

Never skip the Memory Trigger for a failed assumption. The whole value of memory.md comes from capturing decisions and corrections at the moment they occur, not reconstructed from memory later.

Get Started

Core Concepts

The SDD Workflow

Platform Setup

Advanced

Phase 4: Test — Deterministic tests and AI Evals suite

Deterministic Tests vs. AI Evals

Deterministic Tests

AI Evals (non-deterministic components)

1. Output quality evaluation

2. Tool trajectory evaluation

3. Hallucination detection

4. Format conformity

CHANGELOG and Memory Triggers

Build docs developers (and LLMs) love

Get Started

Core Concepts

The SDD Workflow

Platform Setup

Advanced

Documentation Index

​Deterministic Tests vs. AI Evals

​Deterministic Tests

​AI Evals (non-deterministic components)

​1. Output quality evaluation

​2. Tool trajectory evaluation

​3. Hallucination detection

​4. Format conformity

​CHANGELOG and Memory Triggers

Build docs developers (and LLMs) love

Deterministic Tests vs. AI Evals

Deterministic Tests

AI Evals (non-deterministic components)

1. Output quality evaluation

2. Tool trajectory evaluation

3. Hallucination detection

4. Format conformity

CHANGELOG and Memory Triggers