Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/aws-samples/sample-well-architected-skills-and-steering/llms.txt

Use this file to discover all available pages before exploring further.

The benchmark runner sends the same Well-Architected review prompt to multiple Amazon Bedrock models simultaneously and captures quality, latency, tokens per second, and cost per invocation side by side. Use it to choose the right model for your team’s balance of depth, speed, and budget — or to track how results evolve as new models are released.

Latest benchmark results

The table below reflects the most recent run against a customer-facing e-commerce Terraform architecture. Models are sorted by quality score descending.
ModelInput TokensOutput TokensLatency (s)Tokens/sCostQuality
claude-sonnet-57933,98341.895$0.06215.0/5
r15171,56611.0142$0.00924.8/5
nova-2-lite5121,4318.0179$0.00024.2/5
pixtral-large-25026201,56519.182$0.01063.8/5
claude-haiku-4-5-202510015874,09621.6190$0.01693.8/5
nova-pro5491,3169.1145$0.00463.6/5
llama4-maverick-17b-instruct5039855.4182$0.00053.5/5
llama3-3-70b-instruct5051,1339.9114$0.00123.3/5
Last run: 2026-07-01 · Region: us-east-1 · Max tokens: 4,096 · Temperature: 0Quality scores are graded by an LLM judge against eight criteria: coverage of all 6 WA pillars, identification of key risks (single-AZ ECS, multi_az=false on RDS, CloudFront allow-all viewer protocol), auto-scaling recommendation for flash sale peaks, single Redis node as SPOF, cost right-sizing, and actionability of recommendations.

Run commands

1

Install dependencies

cd evals
uv sync
2

Run the benchmark

# Latency and token counts only — no quality grading
uv run python benchmark.py

Command reference

FlagDescription
--gradeRun an LLM-as-judge grading pass after generation to produce quality scores
--models <id> [<id> ...]Override the model list from benchmark_config.yaml
--config <path>Use a custom config file
--results <path>Write results JSON to a custom path (default: results/benchmark-YYYYMMDD-HHMMSS.json)
--concurrency <n>Max parallel model calls (default: 4)

Configuration

Model list, prompt, grading criteria, and per-model pricing are all configured in evals/benchmark_config.yaml:
region: us-east-1
max_tokens: 4096
temperature: 0
concurrency: 4

models:
  - us.anthropic.claude-sonnet-5
  - us.anthropic.claude-haiku-4-5-20251001-v1:0
  - us.amazon.nova-pro-v1:0
  - us.amazon.nova-lite-v1:0
  - us.amazon.nova-2-lite-v1:0
  - us.deepseek.r1-v1:0
  - us.meta.llama4-maverick-17b-instruct-v1:0
  - us.meta.llama3-3-70b-instruct-v1:0
  - us.mistral.pixtral-large-2502-v1:0

grading:
  model: us.anthropic.claude-sonnet-4-6
  criteria:
    - "Covers all 6 WA pillars (Operational Excellence, Security, Reliability,
       Performance Efficiency, Cost Optimization, Sustainability)"
    - "Identifies the single-AZ ECS deployment as a reliability risk"
    - "Flags multi_az=false on RDS as HIGH severity"
    - "Identifies CloudFront allow-all viewer_protocol_policy as a security issue"
    - "Recommends auto-scaling for the ECS service to handle flash sale peaks"
    - "Notes the single Redis node as a SPOF"
    - "Addresses the cost concern with specific right-sizing or pricing model recommendations"
    - "Provides actionable recommendations (not just generic advice)"
Add new models as they become available in Bedrock, update the pricing section with their per-token rates, and re-run to extend the table.

Benchmark task

Every model receives the same prompt: a Well-Architected review of a customer-facing e-commerce platform defined in Terraform. The architecture includes deliberate issues planted across all six pillars — single-AZ ECS, multi_az=false on RDS, a single-node Redis cluster, CloudFront with allow-all viewer protocol, and no auto-scaling — alongside real operational context (10,000 orders/day, 5× peak traffic, a 4-hour RTO target not being met, and a $18K/month bill growing 20% QoQ).
Use uv run python list_models.py to discover available inference profile IDs in your Bedrock region, then add them to the models list in benchmark_config.yaml to include them in your next run.
Models that require extended thinking (Claude Sonnet 5, Claude Opus 4) are handled automatically — the runner detects them by model ID and configures additionalModelRequestFields with adaptive thinking. Models that don’t support temperature (e.g., Claude Opus 4.8) are also detected and retried without that parameter.

Build docs developers (and LLMs) love