scripts/quality_scorer.py. The scoring system enforces the optimized agent format and ensures consistent quality across the catalog.
Scoring Dimensions
Each dimension is scored 1-5. Higher is better.1. Frontmatter
What it measures: Presence of required metadata fields.| Score | Criteria |
|---|---|
| 5 | All 3 fields present: description, mode, permission block |
| 3 | 2 out of 3 fields present |
| 1 | Fewer than 2 fields |
2. Identity
What it measures: Unheaded paragraph between frontmatter and first## heading.
| Score | Criteria |
|---|---|
| 5 | 50-300 words |
| 3 | 30-400 words |
| 2 | More than 0 words (outside range) |
| 1 | Empty or missing |
3. Decisions
What it measures:## Decisions section with structured IF/THEN logic.
| Score | Criteria |
|---|---|
| 5 | 5+ decision rules (IF/THEN/ELIF/ELSE keywords or patterns) |
| 3 | 2-4 decision rules |
| 2 | Section exists but fewer than 2 rules |
| 1 | Section missing |
- Line-based:
IF,THEN,ELIF,ELSEas whole words - Inline patterns:
IF x → THEN y - Case-insensitive
4. Examples
What it measures:## Examples section with fenced code blocks.
| Score | Criteria |
|---|---|
| 5 | 3+ code blocks |
| 4 | 2 code blocks |
| 3 | 1 code block |
| 2 | Section exists but no code blocks |
| 1 | Section missing |
``` fences (opening + closing).
Example (5/5):
Discriminated union
Type guard
6. Conciseness
What it measures: Body line count and filler phrase density.| Score | Criteria |
|---|---|
| 5 | 70-120 lines, ≤3% filler |
| 4 | 50-150 lines, ≤8% filler |
| 3 | 40-200 lines, ≤15% filler |
| 2 | Outside range but ≥30 lines |
| 1 | Fewer than 30 lines |
- “it is important”
- “note that”
- “please ensure”
- “keep in mind”
- “remember to”
- “as mentioned”
- “in order to”
7. No Banned Sections
What it measures: Absence of old format headings.| Score | Criteria |
|---|---|
| 5 | No banned sections |
| 3 | 1 banned section |
| 1 | 2+ banned sections |
#, ##, ###):
WorkflowToolsAnti-patternsCollaboration
8. Version Pinning
What it measures: Version numbers or years in the identity paragraph.| Score | Criteria |
|---|---|
| 5 | Both version and year present |
| 4 | Either version or year present |
| 2 | Neither present |
5.x,3.11+,v2,>=4.0,~=1.2
2020-2039(four-digit years)
prd, scrum-master), so absence scores 2, not 1.
Overall Score and Pass Criteria
The overall score is the mean of all 8 dimensions, rounded to 2 decimals. Pass criteria (both must be true):- Overall score ≥ 3.5
- No dimension < 2
Score Labels
| Label | Range |
|---|---|
| Excellent | ≥ 4.5 |
| Good | 3.5 - 4.49 |
| Needs improvement | 2.5 - 3.49 |
| Poor | < 2.5 |
Running the Scorer
Single agent
Multiple agents
0if all agents pass1if any agent fails
Batch scoring
Regenerate README score tables:- Scans all agents in
agents/ - Scores each with
quality_scorer.py - Regenerates score tables in
README.mdandREADME.en.md - Preserves content outside
<!-- SCORES:BEGIN -->/<!-- SCORES:END -->markers
Catalog Statistics
Current quality metrics (69 agents):- Average score: 4.59/5
- Pass rate: 100%
- Excellent: 49 agents (≥ 4.5)
- Good: 20 agents (3.5 - 4.49)
- Needs improvement: 0 agents
- Poor: 0 agents
llm-architect, golang-pro, java-architect, kotlin-specialist, php-pro, python-pro, rails-expert, rust-pro, swift-expert, typescript-pro, mcp-developer — all 4.88/5.
Adding New Agents
When creating or modifying an agent:-
Write agent following the optimized format:
- Frontmatter with
description,mode,permission - Identity paragraph (50-300 words)
## Decisionswith IF/THEN rules## Exampleswith code blocks## Quality Gatewith validation criteria
- Frontmatter with
-
Score the agent:
- Iterate until score ≥ 3.5 with no dimension < 2
-
Regenerate README scores:
- Commit agent file and updated READMEs together
Common Issues
Low identity score
Problem: Identity paragraph too short or too long. Fix: Aim for 50-300 words. Include role, expertise level, version context, and focus areas.Low decisions score
Problem: Decisions section lacks structured rules. Fix: Use explicit IF/THEN patterns. Example:Low examples score
Problem: Fewer than 2 code blocks. Fix: Add 2-3 fenced code examples showing before/after or common patterns.Low conciseness score
Problem: Too many lines or high filler density. Fix: Cut generic advice. Remove phrases like “it is important to note that”. Aim for 70-120 lines.Banned sections detected
Problem: Old format headings (Workflow, Tools, Anti-patterns, Collaboration).
Fix: Remove those sections. Move relevant content to Decisions or Examples.
Why These 8 Dimensions?
The scoring system enforces the optimized agent format developed through iterative refinement:- Frontmatter ensures discoverability and permission safety
- Identity establishes expertise and context
- Decisions provides structured, actionable rules (IF/THEN trees)
- Examples shows concrete application
- Quality Gate defines success criteria
- Conciseness prevents bloat and generic advice
- No Banned Sections removes old format cruft
- Version Pinning keeps advice current
Related
- Architecture — System design and data flow
- Permissions — Control agent access