Quality Scoring System

Every agent in the registry is evaluated on 8 quality dimensions using scripts/quality_scorer.py. The scoring system enforces the optimized agent format and ensures consistent quality across the catalog.

Scoring Dimensions

Each dimension is scored 1-5. Higher is better.

1. Frontmatter

What it measures: Presence of required metadata fields.

Score	Criteria
5	All 3 fields present: `description`, `mode`, `permission` block
3	2 out of 3 fields present
1	Fewer than 2 fields

Example (5/5):

---
description: Expert TypeScript developer specializing in type safety
mode: code
permission:
  read: allow
  write: allow
  edit: allow
  bash:
    '*': ask
  glob: allow
---

2. Identity

What it measures: Unheaded paragraph between frontmatter and first ## heading.

Score	Criteria
5	50-300 words
3	30-400 words
2	More than 0 words (outside range)
1	Empty or missing

Why it matters: Identity establishes role, expertise level, and context. Too short lacks substance; too long is verbose. Example (5/5):

---
# frontmatter
---

You are an expert TypeScript engineer with deep knowledge of the
TypeScript 5.x type system, focusing on type safety, inference,
and compile-time correctness. You prioritize strict mode, exhaustive
type checking, and minimal use of `any`. You stay current with
TypeScript 5.x features (2024+) and recommend modern patterns.

## Decisions
...

3. Decisions

What it measures: ## Decisions section with structured IF/THEN logic.

Score	Criteria
5	5+ decision rules (IF/THEN/ELIF/ELSE keywords or patterns)
3	2-4 decision rules
2	Section exists but fewer than 2 rules
1	Section missing

Detection:

Line-based: IF, THEN, ELIF, ELSE as whole words
Inline patterns: IF x → THEN y
Case-insensitive

Example (5/5):

## Decisions

- IF user code uses `any`, THEN suggest strict type or generic
- IF function lacks return type, THEN add explicit annotation
- IF type can be inferred, THEN omit annotation (don't over-annotate)
- IF union type is complex (3+ members), THEN extract to type alias
- IF code uses `as` cast, THEN validate necessity or use type guard

4. Examples

What it measures: ## Examples section with fenced code blocks.

Score	Criteria
5	3+ code blocks
4	2 code blocks
3	1 code block
2	Section exists but no code blocks
1	Section missing

Code block = pair of ``` fences (opening + closing). Example (5/5):

## Examples

### Strict null check

```typescript
// Before
function getUser(id: string) {
  return users.find(u => u.id === id);
}

// After
function getUser(id: string): User | undefined {
  return users.find(u => u.id === id);
}

Discriminated union

type Result<T> =
  | { ok: true; value: T }
  | { ok: false; error: string };

function handleResult<T>(result: Result<T>) {
  if (result.ok) {
    console.log(result.value);
  } else {
    console.error(result.error);
  }
}

Type guard

function isString(x: unknown): x is string {
  return typeof x === 'string';
}

### 5. Quality Gate

**What it measures:** `## Quality Gate` section with validation criteria.

| Score | Criteria |
|-------|----------|
| 5 | 5+ bullet points |
| 4 | 3-4 bullet points |
| 3 | 1-2 bullet points |
| 2 | Section exists but no bullets |
| 1 | Section missing |

**Bullet point** = line starting with `-` or `*` followed by content.

**Example (5/5):**

```markdown
## Quality Gate

- All functions have explicit return types
- No `any` types in production code (test mocks allowed)
- `strict: true` in tsconfig.json
- No TypeScript errors in build output
- Complex unions (3+ members) use type aliases
- Type guards preferred over `as` casts

6. Conciseness

What it measures: Body line count and filler phrase density.

Score	Criteria
5	70-120 lines, ≤3% filler
4	50-150 lines, ≤8% filler
3	40-200 lines, ≤15% filler
2	Outside range but ≥30 lines
1	Fewer than 30 lines

Filler phrases (detected case-insensitive):

“it is important”
“note that”
“please ensure”
“keep in mind”
“remember to”
“as mentioned”
“in order to”

Why it matters: Agents should be dense and actionable. Generic advice adds noise.

7. No Banned Sections

What it measures: Absence of old format headings.

Score	Criteria
5	No banned sections
3	1 banned section
1	2+ banned sections

Banned headings (any level #, ##, ###):

Workflow
Tools
Anti-patterns
Collaboration

The optimized format uses Decisions, Examples, and Quality Gate instead.

8. Version Pinning

What it measures: Version numbers or years in the identity paragraph.

Score	Criteria
5	Both version and year present
4	Either version or year present
2	Neither present

Version patterns:

5.x, 3.11+, v2, >=4.0, ~=1.2

Year patterns:

2020-2039 (four-digit years)

Why it matters: Version pinning clarifies which features are available and prevents outdated advice. Not all agents need versions (e.g., prd, scrum-master), so absence scores 2, not 1.

Overall Score and Pass Criteria

The overall score is the mean of all 8 dimensions, rounded to 2 decimals. Pass criteria (both must be true):

Overall score ≥ 3.5
No dimension < 2

This ensures balanced quality — an agent can’t pass with one critically weak dimension.

Score Labels

Label	Range
Excellent	≥ 4.5
Good	3.5 - 4.49
Needs improvement	2.5 - 3.49
Poor	< 2.5

Running the Scorer

Single agent

python3 scripts/quality_scorer.py agents/languages/typescript-pro.md

Output:

============================================================
  agents/languages/typescript-pro.md
============================================================
  frontmatter                    [#####] 5/5
  identity                       [#####] 5/5
  decisions                      [#####] 5/5
  examples                       [####.] 4/5
  quality_gate                   [#####] 5/5
  conciseness                    [#####] 5/5
  no_banned_sections             [#####] 5/5
  version_pinning                [#####] 5/5
                                 --------
  overall                        4.88/5.00
  label                          Excellent
  passed                         YES

Multiple agents

python3 scripts/quality_scorer.py agents/languages/*.md

Exit code:

0 if all agents pass
1 if any agent fails

Batch scoring

Regenerate README score tables:

python3 scripts/generate_readme_scores.py

This:

Scans all agents in agents/
Scores each with quality_scorer.py
Regenerates score tables in README.md and README.en.md
Preserves content outside  /  markers

CI integration:

# Check that README scores are up to date
python3 scripts/generate_readme_scores.py --check
# → Exit 1 if scores don't match

Catalog Statistics

Current quality metrics (69 agents):

Average score: 4.59/5
Pass rate: 100%
Excellent: 49 agents (≥ 4.5)
Good: 20 agents (3.5 - 4.49)
Needs improvement: 0 agents
Poor: 0 agents

Top scores: llm-architect, golang-pro, java-architect, kotlin-specialist, php-pro, python-pro, rails-expert, rust-pro, swift-expert, typescript-pro, mcp-developer — all 4.88/5.

Adding New Agents

When creating or modifying an agent:

Write agent following the optimized format:
- Frontmatter with description, mode, permission
- Identity paragraph (50-300 words)
- ## Decisions with IF/THEN rules
- ## Examples with code blocks
- ## Quality Gate with validation criteria

Score the agent:

python3 scripts/quality_scorer.py agents/new-category/new-agent.md

Iterate until score ≥ 3.5 with no dimension < 2

Regenerate README scores:

python3 scripts/generate_readme_scores.py

Commit agent file and updated READMEs together

Common Issues

Low identity score

Problem: Identity paragraph too short or too long. Fix: Aim for 50-300 words. Include role, expertise level, version context, and focus areas.

Low decisions score

Problem: Decisions section lacks structured rules. Fix: Use explicit IF/THEN patterns. Example:

## Decisions

- IF function is pure, THEN mark with JSDoc `@pure`
- IF side effect is unavoidable, THEN document in comment
- IF parameter has default, THEN use ES6 default syntax

Low examples score

Problem: Fewer than 2 code blocks. Fix: Add 2-3 fenced code examples showing before/after or common patterns.

Low conciseness score

Problem: Too many lines or high filler density. Fix: Cut generic advice. Remove phrases like “it is important to note that”. Aim for 70-120 lines.

Banned sections detected

Problem: Old format headings (Workflow, Tools, Anti-patterns, Collaboration). Fix: Remove those sections. Move relevant content to Decisions or Examples.

Why These 8 Dimensions?

The scoring system enforces the optimized agent format developed through iterative refinement:

Frontmatter ensures discoverability and permission safety
Identity establishes expertise and context
Decisions provides structured, actionable rules (IF/THEN trees)
Examples shows concrete application
Quality Gate defines success criteria
Conciseness prevents bloat and generic advice
No Banned Sections removes old format cruft
Version Pinning keeps advice current

This format produces agents that score 8-9/10 in practice, compared to 3-4/10 for generic templates.

Architecture — System design and data flow
Permissions — Control agent access

Get Started

CLI Reference

Agent Catalog

Packs

Advanced

Scoring Dimensions

1. Frontmatter

2. Identity

3. Decisions

4. Examples

Discriminated union

Type guard

6. Conciseness

7. No Banned Sections

8. Version Pinning

Overall Score and Pass Criteria

Score Labels

Running the Scorer

Single agent

Multiple agents

Batch scoring

Catalog Statistics

Adding New Agents

Common Issues

Low identity score

Low decisions score

Low examples score

Low conciseness score

Banned sections detected

Why These 8 Dimensions?

Build docs developers (and LLMs) love

Get Started

CLI Reference

Agent Catalog

Packs

Advanced

​Scoring Dimensions

​1. Frontmatter

​2. Identity

​3. Decisions

​4. Examples

​Discriminated union

​Type guard

​6. Conciseness

​7. No Banned Sections

​8. Version Pinning

​Overall Score and Pass Criteria

​Score Labels

​Running the Scorer

​Single agent

​Multiple agents

​Batch scoring

​Catalog Statistics

​Adding New Agents

​Common Issues

​Low identity score

​Low decisions score

​Low examples score

​Low conciseness score

​Banned sections detected

​Why These 8 Dimensions?

​Related

Build docs developers (and LLMs) love

Scoring Dimensions

1. Frontmatter

2. Identity

3. Decisions

4. Examples

Discriminated union

Type guard

6. Conciseness

7. No Banned Sections

8. Version Pinning

Overall Score and Pass Criteria

Score Labels

Running the Scorer

Single agent

Multiple agents

Batch scoring

Catalog Statistics

Adding New Agents

Common Issues

Low identity score

Low decisions score

Low examples score

Low conciseness score

Banned sections detected

Why These 8 Dimensions?

Related