Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/aws-samples/sample-well-architected-skills-and-steering/llms.txt

Use this file to discover all available pages before exploring further.

The evaluation runner generates paired responses — baseline and with-skill — for every test case in a skill’s evals/evals.json, then grades each assertion using an LLM-as-judge. This page walks through setup and the full set of run options.

Setup

1

Install dependencies

Navigate to the evals/ directory and sync the project’s dependencies with uv:
cd evals
uv sync
This installs boto3 and pyyaml into an isolated virtual environment. No other dependencies are required.
2

Configure AWS credentials

The runner calls Amazon Bedrock via boto3. Any standard credential chain works:
aws configure
Prompts for Access Key ID, Secret Access Key, region, and output format.
3

Verify Bedrock model access

Open the AWS Console → Amazon BedrockModel access and confirm that model access is granted for the models listed in evals/config.yaml in your target region. By default, this means Claude Opus 4.8 (us.anthropic.claude-opus-4-8) in us-east-1.

Running evaluations

uv run python run.py --list

Command reference

FlagDescription
--skill <name>Evaluate a single skill. Omit to run all skills.
--verbose / -vShow per-assertion grading detail (pass/fail icons for every assertion in every case).
--parallel / -pRun eval cases within each skill in parallel. Roughly 3× faster wall-clock time.
--saveWrite results JSON to evals/results/ for historical tracking.
--listPrint available skill names and exit.
--model <id>Override the generation model without editing config.yaml.
--grading-model <id>Override the grading model.
--runs <n>Run each skill n times for statistical averaging.

Configuration

The evals/config.yaml file controls which Bedrock models are used and where calls are made:
provider: bedrock
region: us-east-1
generation_model: us.anthropic.claude-opus-4-8
grading_model: us.anthropic.claude-opus-4-8
max_tokens: 16384
temperature: 0
FieldPurpose
regionAWS region for all Bedrock API calls
generation_modelModel used to generate baseline and with-skill responses
grading_modelModel used as LLM-as-judge to grade assertions — keep this a strong model
max_tokensMaximum output tokens per generation call (16384 = full reports)

How it works

For each test case in a skill’s evals.json, the runner:
  1. Sends the prompt to Bedrock without skill context → baseline response
  2. Sends the same prompt with SKILL.md prepended to the system prompt → with-skill response
  3. Grades each assertion against the baseline response (PASS/FAIL)
  4. Grades each assertion against the with-skill response (PASS/FAIL)
  5. Computes a percentage score for each condition and the delta
The final report shows baseline score, with-skill score, and delta for each skill — quantifying the value the skill adds.

Estimated cost per run

ScopeGeneration callsGrading callsEstimated cost
Single skill (3 cases)6 (Opus)6 (Opus)~1.501.50–2.50
All 11 skills (33 cases)66 (Opus)66 (Opus)~1515–25
Cost assumes ~1K input tokens and ~8K output tokens per generation call, and ~9K input / ~500 output per grading call. Actual cost depends on response length and Bedrock pricing in your region.
Use --parallel to cut wall-clock time by ~3×. Cost is the same — parallelism doesn’t reduce token consumption, it just runs cases concurrently within a skill.

Experimenting with other models

The eval runner works with any model available in your Bedrock region. To try a different generation model without editing the config file:
# Try Amazon Nova Pro
uv run python run.py --skill wa-review --model us.amazon.nova-pro-v1:0 --verbose

# Try Claude Sonnet
uv run python run.py --skill wa-review --model us.anthropic.claude-sonnet-4-6 --verbose
To see which models are available in your region, use the discovery utility:
uv run python list_models.py
Then update generation_model in config.yaml to change the default for all runs. The grading model should remain a strong model (Claude Opus or Sonnet) for reliable assertion grading.
Claude Opus 4.8 does not support the temperature inference parameter. The runner detects this automatically and retries without the parameter — no configuration change is needed.
On Windows, ensure AWS credentials are configured via aws configure or environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN). If using AWS IAM Identity Center (SSO), run aws sso login --profile your-profile before running evaluations.

Build docs developers (and LLMs) love