The evaluation runner generates paired responses — baseline and with-skill — for every test case in a skill’sDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/aws-samples/sample-well-architected-skills-and-steering/llms.txt
Use this file to discover all available pages before exploring further.
evals/evals.json, then grades each assertion using an LLM-as-judge. This page walks through setup and the full set of run options.
Setup
Install dependencies
Navigate to the This installs
evals/ directory and sync the project’s dependencies with uv:boto3 and pyyaml into an isolated virtual environment. No other dependencies are required.Configure AWS credentials
The runner calls Amazon Bedrock via Prompts for Access Key ID, Secret Access Key, region, and output format.
boto3. Any standard credential chain works:- aws configure
- SSO (IAM Identity Center)
- Environment variables
Running evaluations
Command reference
| Flag | Description |
|---|---|
--skill <name> | Evaluate a single skill. Omit to run all skills. |
--verbose / -v | Show per-assertion grading detail (pass/fail icons for every assertion in every case). |
--parallel / -p | Run eval cases within each skill in parallel. Roughly 3× faster wall-clock time. |
--save | Write results JSON to evals/results/ for historical tracking. |
--list | Print available skill names and exit. |
--model <id> | Override the generation model without editing config.yaml. |
--grading-model <id> | Override the grading model. |
--runs <n> | Run each skill n times for statistical averaging. |
Configuration
Theevals/config.yaml file controls which Bedrock models are used and where calls are made:
| Field | Purpose |
|---|---|
region | AWS region for all Bedrock API calls |
generation_model | Model used to generate baseline and with-skill responses |
grading_model | Model used as LLM-as-judge to grade assertions — keep this a strong model |
max_tokens | Maximum output tokens per generation call (16384 = full reports) |
How it works
For each test case in a skill’sevals.json, the runner:
- Sends the prompt to Bedrock without skill context → baseline response
- Sends the same prompt with
SKILL.mdprepended to the system prompt → with-skill response - Grades each assertion against the baseline response (PASS/FAIL)
- Grades each assertion against the with-skill response (PASS/FAIL)
- Computes a percentage score for each condition and the delta
Estimated cost per run
| Scope | Generation calls | Grading calls | Estimated cost |
|---|---|---|---|
| Single skill (3 cases) | 6 (Opus) | 6 (Opus) | ~2.50 |
| All 11 skills (33 cases) | 66 (Opus) | 66 (Opus) | ~25 |
Experimenting with other models
The eval runner works with any model available in your Bedrock region. To try a different generation model without editing the config file:generation_model in config.yaml to change the default for all runs. The grading model should remain a strong model (Claude Opus or Sonnet) for reliable assertion grading.
Claude Opus 4.8 does not support the
temperature inference parameter. The runner detects this automatically and retries without the parameter — no configuration change is needed.