Testing is not optional in the SDD cycle — it is a phase gate. A task is not marked “Done” until it has passing tests. TheDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/davidbuenov/dbv-specs-ops/llms.txt
Use this file to discover all available pages before exploring further.
/test phase extends this principle beyond classical software testing by unifying deterministic tests (for conventional code) with non-deterministic Evals (for AI components and generative pipelines). Both types of validation must be designed and executed before the work advances to /code-simplify.
Deterministic Tests vs. AI Evals
| Dimension | Deterministic Tests | AI Evals |
|---|---|---|
| What is validated | Classical system behaviour | Intelligent / probabilistic behaviour |
| Output | Pass / Fail (binary) | Score against a rubric |
| Examples | Compilation, unit tests, integration tests | Output quality, hallucination detection |
| Tooling | Vitest, Pytest, Jest, etc. | Model judges, rubric scripts |
| Repeatability | Fully deterministic | Statistically reproducible |
| Trigger | Always | Only if the project includes AI/LLM components |
Deterministic Tests
Every task that touches production code must produce at minimum:- Compilation check — the project builds without errors or type violations.
- Unit tests — individual functions and modules are tested in isolation.
- Critical path coverage — the happy path and at least one failure path for every user-facing feature are covered.
“Tests are proof.” If a feature exists in code but has no test verifying its behaviour, it is not complete from the SDD perspective.
AI Evals (non-deterministic components)
If the project includes AI components, complex prompts, or generative pipelines, a dedicated Evals suite must be designed and executed. Evals cover four dimensions:1. Output quality evaluation
Assess the response against a rubric: accuracy, tone, absence of invented facts, and JSON/XML format conformity. Rubrics can be enforced by automated scripts or “Model Judge” patterns (a second LLM evaluating the first LLM’s output).2. Tool trajectory evaluation
Audit the sequence of steps the agent took — not just whether the final result is correct. An agent that produces the right answer after 100 redundant tool calls or excessive token loops is not operating correctly. Trajectory Evals verify that the agent’s path through its tools is efficient and logical.3. Hallucination detection
Verify that the AI has not invented API endpoints, library names, data fields, or factual claims that do not exist in the codebase or specification. This is especially important when the AI generates code that references external services.4. Format conformity
Confirm that structured outputs (JSON schemas, XML blocks, Markdown templates) match the exact format required by downstream consumers.CHANGELOG and Memory Triggers
CHANGELOG: If tests uncover and fix a bug, the fix is recorded in the[Sin publicar] section of CHANGELOG.md as a Fixed entry. Bugs do not disappear silently — every fix is traceable.
Memory Trigger: If a test reveals that an assumption documented in docs/SPECIFICATIONS.md was incorrect, the assistant must immediately log the finding in memory.md under ## ⚠️ Lecciones Aprendidas. Incorrect assumptions that are corrected in code but not reflected in the spec will cause the same mistake to recur in future sessions.