Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/davidbuenov/dbv-specs-ops/llms.txt

Use this file to discover all available pages before exploring further.

Testing is not optional in the SDD cycle — it is a phase gate. A task is not marked “Done” until it has passing tests. The /test phase extends this principle beyond classical software testing by unifying deterministic tests (for conventional code) with non-deterministic Evals (for AI components and generative pipelines). Both types of validation must be designed and executed before the work advances to /code-simplify.

Deterministic Tests vs. AI Evals

DimensionDeterministic TestsAI Evals
What is validatedClassical system behaviourIntelligent / probabilistic behaviour
OutputPass / Fail (binary)Score against a rubric
ExamplesCompilation, unit tests, integration testsOutput quality, hallucination detection
ToolingVitest, Pytest, Jest, etc.Model judges, rubric scripts
RepeatabilityFully deterministicStatistically reproducible
TriggerAlwaysOnly if the project includes AI/LLM components

Deterministic Tests

Every task that touches production code must produce at minimum:
  • Compilation check — the project builds without errors or type violations.
  • Unit tests — individual functions and modules are tested in isolation.
  • Critical path coverage — the happy path and at least one failure path for every user-facing feature are covered.
“Tests are proof.” If a feature exists in code but has no test verifying its behaviour, it is not complete from the SDD perspective.

AI Evals (non-deterministic components)

If the project includes AI components, complex prompts, or generative pipelines, a dedicated Evals suite must be designed and executed. Evals cover four dimensions:

1. Output quality evaluation

Assess the response against a rubric: accuracy, tone, absence of invented facts, and JSON/XML format conformity. Rubrics can be enforced by automated scripts or “Model Judge” patterns (a second LLM evaluating the first LLM’s output).

2. Tool trajectory evaluation

Audit the sequence of steps the agent took — not just whether the final result is correct. An agent that produces the right answer after 100 redundant tool calls or excessive token loops is not operating correctly. Trajectory Evals verify that the agent’s path through its tools is efficient and logical.

3. Hallucination detection

Verify that the AI has not invented API endpoints, library names, data fields, or factual claims that do not exist in the codebase or specification. This is especially important when the AI generates code that references external services.

4. Format conformity

Confirm that structured outputs (JSON schemas, XML blocks, Markdown templates) match the exact format required by downstream consumers.

CHANGELOG and Memory Triggers

CHANGELOG: If tests uncover and fix a bug, the fix is recorded in the [Sin publicar] section of CHANGELOG.md as a Fixed entry. Bugs do not disappear silently — every fix is traceable. Memory Trigger: If a test reveals that an assumption documented in docs/SPECIFICATIONS.md was incorrect, the assistant must immediately log the finding in memory.md under ## ⚠️ Lecciones Aprendidas. Incorrect assumptions that are corrected in code but not reflected in the spec will cause the same mistake to recur in future sessions.
Never skip the Memory Trigger for a failed assumption. The whole value of memory.md comes from capturing decisions and corrections at the moment they occur, not reconstructed from memory later.

Build docs developers (and LLMs) love