Every skill in this repository ships with a structured evaluation suite that measures whether the skill actually improves agent outputs compared to a baseline — the same prompt, sent to the same model, with and without the skill’s context loaded. The evaluation runner lives in theDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/aws-samples/sample-well-architected-skills-and-steering/llms.txt
Use this file to discover all available pages before exploring further.
evals/ directory and is powered by Amazon Bedrock’s Converse API.
How evaluations work
For each skill, a set of test cases defines realistic user prompts alongside 5–7 concrete, gradable assertions about what a good response should contain. The runner:- Sends each prompt to Bedrock without the skill — producing a baseline response
- Sends the same prompt with the
SKILL.mdinjected as a system prompt — producing a with-skill response - Uses an LLM-as-judge (Claude Opus 4.8 by default) to grade every assertion as
PASSorFAILagainst both responses - Reports a percentage score for each condition and the delta the skill contributes
What a test case looks like
Eachevals/evals.json file follows the Agent Skills eval spec. A single case includes:
id— stable identifier for the caseprompt— the realistic user request (e.g. “Review this Terraform architecture for security issues”)assertions— 5–7 specific, binary-gradable statements (e.g. “Response identifies the publicly accessible S3 bucket as a HIGH severity finding”)
Prerequisites
Before running evaluations, ensure:Python 3.13+ and uv
The runner requires Python 3.13 or later. Install uv for dependency management.
AWS credentials with Bedrock access
Configure credentials via
aws configure, AWS IAM Identity Center (SSO), or environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN).Bedrock model access enabled
Request model access in the AWS Console → Amazon Bedrock → Model access for the models listed in
evals/config.yaml. Claude Opus 4.8 is required by default.Correct region
Model availability varies by region. The default region is
us-east-1 — verify the models in config.yaml are available in your chosen region.Effectiveness results
All 11 skills are evaluated using an automated LLM-as-judge framework with paired comparison. Results below were produced with Claude Opus 4.8 as both the generation and grading model, with 16K token output limit:| Skill | Baseline | With Skill | Delta |
|---|---|---|---|
wa-review | 82% | 100% | +18% |
architecture-decision-record | 81% | 100% | +19% |
cost-optimization-review | 93% | 100% | +7% |
migration-readiness | 85% | 100% | +15% |
operational-excellence | 90% | 100% | +10% |
performance-efficiency | 90% | 100% | +10% |
reliability-improvement-plan | 95% | 100% | +5% |
security-assessment | 94% | 100% | +6% |
sustainability-optimization | 85% | 100% | +15% |
wa-builder | 61% | 94% | +33% |
wa-guardrails | 76% | 99% | +23% |
| Average | 85% | 99% | +15% |
- 9 of 11 skills achieve a perfect 100% score on behavioral assertions with skill context loaded
- Average +15% improvement over the same model operating without skill guidance
- Skills never produce worse output than baseline — they improve or match in every case
- The largest gains appear in
wa-builder(+33%) andwa-guardrails(+23%) — tasks where structured playbook guidance has the most impact on output shape
