Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/aws-samples/sample-well-architected-skills-and-steering/llms.txt

Use this file to discover all available pages before exploring further.

Every skill in this repository ships with a structured evaluation suite that measures whether the skill actually improves agent outputs compared to a baseline — the same prompt, sent to the same model, with and without the skill’s context loaded. The evaluation runner lives in the evals/ directory and is powered by Amazon Bedrock’s Converse API.

How evaluations work

For each skill, a set of test cases defines realistic user prompts alongside 5–7 concrete, gradable assertions about what a good response should contain. The runner:
  1. Sends each prompt to Bedrock without the skill — producing a baseline response
  2. Sends the same prompt with the SKILL.md injected as a system prompt — producing a with-skill response
  3. Uses an LLM-as-judge (Claude Opus 4.8 by default) to grade every assertion as PASS or FAIL against both responses
  4. Reports a percentage score for each condition and the delta the skill contributes
This paired-comparison design isolates the skill’s contribution: any difference in scores comes from the skill context, not model variance or prompt wording.

What a test case looks like

Each evals/evals.json file follows the Agent Skills eval spec. A single case includes:
  • id — stable identifier for the case
  • prompt — the realistic user request (e.g. “Review this Terraform architecture for security issues”)
  • assertions — 5–7 specific, binary-gradable statements (e.g. “Response identifies the publicly accessible S3 bucket as a HIGH severity finding”)
Assertions are written to be unambiguous: the judge only needs to determine whether the output contains or omits a specific, concrete piece of guidance.

Prerequisites

Before running evaluations, ensure:

Python 3.13+ and uv

The runner requires Python 3.13 or later. Install uv for dependency management.

AWS credentials with Bedrock access

Configure credentials via aws configure, AWS IAM Identity Center (SSO), or environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN).

Bedrock model access enabled

Request model access in the AWS Console → Amazon Bedrock → Model access for the models listed in evals/config.yaml. Claude Opus 4.8 is required by default.

Correct region

Model availability varies by region. The default region is us-east-1 — verify the models in config.yaml are available in your chosen region.

Effectiveness results

All 11 skills are evaluated using an automated LLM-as-judge framework with paired comparison. Results below were produced with Claude Opus 4.8 as both the generation and grading model, with 16K token output limit:
SkillBaselineWith SkillDelta
wa-review82%100%+18%
architecture-decision-record81%100%+19%
cost-optimization-review93%100%+7%
migration-readiness85%100%+15%
operational-excellence90%100%+10%
performance-efficiency90%100%+10%
reliability-improvement-plan95%100%+5%
security-assessment94%100%+6%
sustainability-optimization85%100%+15%
wa-builder61%94%+33%
wa-guardrails76%99%+23%
Average85%99%+15%
Key findings:
  • 9 of 11 skills achieve a perfect 100% score on behavioral assertions with skill context loaded
  • Average +15% improvement over the same model operating without skill guidance
  • Skills never produce worse output than baseline — they improve or match in every case
  • The largest gains appear in wa-builder (+33%) and wa-guardrails (+23%) — tasks where structured playbook guidance has the most impact on output shape
The evaluation framework is fully reproducible. See Running Evals to run the suite on your own models and prompts.

Build docs developers (and LLMs) love