Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/czlonkowski/n8n-skills/llms.txt

Use this file to discover all available pages before exploring further.

Evaluation-First Development

n8n-skills uses an Evaluation-Driven Development (EDD) approach: evaluations are written before the skill, not after. This ensures every skill solves real, measurable problems.
Write your evaluation scenarios before writing a single line of SKILL.md. Evaluations define what “done” looks like.
The full cycle for a new skill:
1. Create 3+ evaluation scenarios
2. Test baseline (without skill) to confirm the problem exists
3. Write minimal SKILL.md
4. Test against evaluations
5. Iterate until 100% pass
6. Add reference files as needed
Why this works: It ensures skills solve real problems and can be tested objectively, rather than optimizing for content that sounds good but doesn’t measurably improve AI behavior.

Evaluation File Format

Each evaluation is a JSON file in evaluations/[skill-name]/. The filename follows the pattern eval-NNN-kebab-case-description.json.
{
  "id": "skill-001",
  "skills": ["skill-name"],
  "query": "User question or scenario",
  "expected_behavior": [
    "Skill should identify X",
    "Skill should provide Y guidance",
    "Skill should reference Z content"
  ],
  "baseline_without_skill": {
    "likely_response": "Generic answer",
    "expected_quality": "Low"
  },
  "with_skill_expected": {
    "response_quality": "High",
    "uses_skill_content": true,
    "provides_correct_guidance": true
  }
}

Real examples

Here are two evaluation files from the expression-syntax skill that illustrate good evaluation design:
{
  "id": "expr-001",
  "skills": ["n8n-expression-syntax"],
  "query": "I'm trying to access an email field in my n8n Slack node using $json.email but it's showing as literal text '$json.email' in the message. What's wrong?",
  "expected_behavior": [
    "Identifies missing curly braces around the expression",
    "Explains that n8n expressions must be wrapped in {{ }}",
    "Provides the corrected expression: {{$json.email}}",
    "Explains that without braces, it's treated as literal text",
    "References expression format documentation from SKILL.md"
  ],
  "baseline_without_skill": {
    "likely_response": "May suggest general JavaScript or template syntax, might not know n8n-specific {{ }} requirement",
    "expected_quality": "Low - lacks n8n-specific knowledge about expression syntax"
  },
  "with_skill_expected": {
    "response_quality": "High - precise fix with n8n-specific guidance",
    "uses_skill_content": true,
    "provides_correct_syntax": true,
    "explains_why_it_failed": true
  }
}

How Many Evaluations?

Every skill needs a minimum of 3 evaluations. The existing skills in the project follow this coverage pattern:
SkillEvaluations
expression-syntax4
mcp-tools5
code-javascriptvaries
code-pythonvaries
node-configurationvaries
validation-expertvaries
workflow-patternsvaries
Aim for at least:
  1. Basic usage — the most common trigger query
  2. Common mistake — a specific error the skill should help fix
  3. Advanced scenario — a more complex or edge-case query

Running Evaluations Manually

There is no automated evaluation runner yet. Test each scenario by hand:
1

Start Claude Code

Launch Claude Code with the skill loaded from your skills/ directory.
2

Ask the evaluation query

Copy the query field from the evaluation JSON exactly as written and send it.
3

Check expected behaviors

Go through each item in expected_behavior and verify whether the response satisfies it. Be specific — vague confirmation does not count.
4

Document results

Mark each behavior as PASS or FAIL. A scenario only passes when every expected behavior is present.
5

Iterate if needed

If any behaviors fail, update SKILL.md to address the gap and re-run. Repeat until 100% of scenarios pass.
# Manual test via CLI (if testing framework available)
npm test

# Or manually with Claude
claude-code --skill n8n-expression-syntax "Test webhook data access"
An automated evaluation framework is planned for a future release. Until then, manual testing against evaluation JSON files is the standard process.

What Makes a Good Evaluation?

Characteristics of effective evaluations:
  • Specific, measurable expected behaviors (not “gives a good answer”)
  • Based on real user queries that have actually been seen
  • Covers both common and edge cases
  • Includes a baseline_without_skill that shows what a generic response would miss
  • Each expected_behavior item is independently verifiable
Example of a specific, measurable behavior:
"expected_behavior": [
  "Provides the corrected expression: {{$json.body.name}}",
  "Explains the webhook node output structure",
  "Warns this is a CRITICAL gotcha specific to webhook nodes"
]

Test Quality Criteria

Before considering a skill complete, confirm all of the following:
  • All evaluations pass (every expected_behavior item verified)
  • Skill activates correctly on the trigger query
  • Content in the response is accurate
  • All code examples in the response actually work
  • Baseline comparison confirms meaningful improvement over no-skill response

MCP Tool Testing

Before writing any skill content, test the relevant MCP tools and record real responses. This ensures the skill content is grounded in actual tool behavior. Document findings in docs/MCP_TESTING_LOG.md:
## [Your Skill Name] - MCP Testing

### Tool: tool_name

**Test**:
```javascript
tool_name({param: "value"})
Response:
{actual response}
Key Insights:
  • Finding 1
  • Finding 2

Record:
- Actual tool responses (copy verbatim)
- Performance timings
- Gotchas discovered
- Format differences between tool modes
- Real error messages returned

See `docs/MCP_TESTING_LOG.md` in the repository for the full log of MCP testing performed for the existing 7 skills.

---

## Iterating to 100%

It is expected that the first version of a SKILL.md will not pass all evaluations. The process is iterative:

1. Run all evaluations and record which behaviors pass or fail
2. Identify patterns — are failures concentrated in one section?
3. Update the relevant section of SKILL.md (or add a reference file)
4. Re-run the failed evaluations
5. Continue until every scenario passes every expected behavior

<Warning>
  Do not submit a skill with failing evaluations. The 100% pass rate requirement is not a target — it is the definition of done.
</Warning>

Build docs developers (and LLMs) love