The tool
DevRel Playground lets you send a prompt to multiple models at once, stream their responses side-by-side, and then run an evaluation pass that scores each response against a rubric you define. I built it because I was tired of copy-pasting between ChatGPT and Claude. It turned into something more interesting than a tab-saver. Stack: Next.js 15, React 19, TypeScript, Mantine UI, OpenAI, Anthropic, HuggingFace, react-pdfDefining good
The tool supports different content types. Each needed its own rubric — a set of weighted criteria and a checklist that an evaluation model could apply.Tutorial writer
Tutorial writer
Weights: Clarity 30%, Comprehensiveness 30%, Step-by-step 30%, Efficiency 10%Checklist:
- Clear learning objectives
- Prerequisites mentioned upfront
- Step-by-step instructions
- Screenshots or visual aids
- Troubleshooting tips
- Next steps suggested
- Hands-on exercises
Code examples
Code examples
Weights: Code Quality 80%, Clarity 10%, Comprehensiveness 5%, Efficiency 5%Checklist:
- Produces correct output
- Compiles without errors
- Handles edge cases
- Best practices shown
- Comments explain concepts
- Error handling included
- Dependencies specified
- Real-world use cases
Error messages
Error messages
Weights: Clarity 50%, Helpfulness 30%, Comprehensiveness 10%, Efficiency 10%Checklist:
- Root cause explained
- Multiple solutions offered
- Step-by-step resolution
- Prevention tips
- Documentation links
- Context about the error
- Skill levels addressed
API documentation
API documentation
Weights: Comprehensiveness 40%, Code Quality 30%, Clarity 20%, Efficiency 10%Checklist:
- Auth methods explained
- Error codes covered
- Rate limits mentioned
- Request/response format documented
- Multi-language examples
- Endpoints documented
- Parameters specified
Community Q&A
Community Q&A
Weights: Comprehensiveness 40%, Clarity 30%, Code Quality 20%, Efficiency 10%Checklist:
- Question fully addressed
- Multiple approaches offered
- Relevant code examples
- Best practices highlighted
- Common pitfalls covered
- Additional resources linked
- Follow-ups anticipated
General purpose
General purpose
Weights: All criteria equally weighted at 25%Checklist:
- Addresses core question
- Actionable information provided
- Appropriate tone
- Relevant examples included
- Logical structure
Multiple experiments
Prompt used across all runs:| Model | Score | Criteria met |
|---|---|---|
| Claude Sonnet 4 | 88.5 | 6/7 |
| Kimi-K2-Instruct | 87.5 | 6/7 |
| GPT-5 | 87.5 | 6/7 |
| Model | Score | Criteria met |
|---|---|---|
| Claude Sonnet 4 | 85.3 | 6/7 |
| Kimi-K2-Instruct | 80.0 | 5/7 |
| GPT-5 | 91.3 | 6/7 |
What numbers can’t tell you
When I actually read the three tutorials — not skimmed them — they turned out to be written for entirely different people.| Kimi-K2-Instruct | GPT-5 | Claude Sonnet 4 | |
|---|---|---|---|
| Driver | @neondatabase/serverless | @neondatabase/serverless | Prisma ORM |
| ORM | No ORM (suggests Drizzle later) | Reusable db helper with caching | Full relational model |
| Scope | 30-minute lean tutorial | Full CRUD implementation | 18-page deep dive |
| Extras | Production checklist | Structured exercises | Prisma Studio, Tailwind |
| Deployment | Assumes Vercel | Architecture overview | Users + Posts schema |
| Implicit audience | ”Here’s how a senior dev would set this up." | "Here’s a solid foundation with room to grow." | "Let me show you how all the pieces connect.” |
My assumptions are the data
There are three layers to the measurement problem.Layer 1: The rubrics
Layer 1: The rubrics
I wrote the checklists and set the weights. The evaluation model judges responses against my definition of good. The models being evaluated have no idea what that definition is — they’re answering a prompt, not a rubric.The rubric reflects my assumptions about what a good tutorial contains. Those assumptions are reasonable. They’re also mine.
Layer 2: What the rubrics cannot see
Layer 2: What the rubrics cannot see
Rubrics have blind spots. The checklist measures structure, not teaching.A concrete example: the Tic Tac Toe tutorial I wrote reached 75,000+ developers. It uses verbose
switch statements rather than coordinate math — case 5: gameBoard[1][2] instead of (position - 1) / 3. That’s a deliberate choice. A beginner can read the verbose version and immediately understand what’s happening. The compact version requires working backwards from the math.“Step-by-step instructions provided” would score both versions the same. The checklist can’t see that one of those steps is calibrated to where the learner actually is.Layer 3: The judge
Layer 3: The judge
Even with a perfect rubric, asking an LLM to apply it isn’t stable. The Run 1 vs. Run 2 results are the evidence: same inputs, different scores, different winner on different days.The evaluator isn’t a fixed instrument. It’s another model with its own variance.
Where does that leave us
I started this project thinking I could measure “good,” and I ended up with three reasons why that’s harder than it sounds: my rubric encodes my assumptions, my assumptions have blind spots I can’t see, and the evaluator applying my rubric isn’t even consistent with itself.The models didn’t fail. They answered different questions — questions I never asked. What kind of beginner? Learning toward what? Building on which foundation? The evaluation couldn’t catch the mismatch because I never gave it the one input that would have mattered: who is this for, and what do they already understand?
