The benchmark runner sends the same Well-Architected review prompt to multiple Amazon Bedrock models simultaneously and captures quality, latency, tokens per second, and cost per invocation side by side. Use it to choose the right model for your team’s balance of depth, speed, and budget — or to track how results evolve as new models are released.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/aws-samples/sample-well-architected-skills-and-steering/llms.txt
Use this file to discover all available pages before exploring further.
Latest benchmark results
The table below reflects the most recent run against a customer-facing e-commerce Terraform architecture. Models are sorted by quality score descending.| Model | Input Tokens | Output Tokens | Latency (s) | Tokens/s | Cost | Quality |
|---|---|---|---|---|---|---|
| claude-sonnet-5 | 793 | 3,983 | 41.8 | 95 | $0.0621 | 5.0/5 |
| r1 | 517 | 1,566 | 11.0 | 142 | $0.0092 | 4.8/5 |
| nova-2-lite | 512 | 1,431 | 8.0 | 179 | $0.0002 | 4.2/5 |
| pixtral-large-2502 | 620 | 1,565 | 19.1 | 82 | $0.0106 | 3.8/5 |
| claude-haiku-4-5-20251001 | 587 | 4,096 | 21.6 | 190 | $0.0169 | 3.8/5 |
| nova-pro | 549 | 1,316 | 9.1 | 145 | $0.0046 | 3.6/5 |
| llama4-maverick-17b-instruct | 503 | 985 | 5.4 | 182 | $0.0005 | 3.5/5 |
| llama3-3-70b-instruct | 505 | 1,133 | 9.9 | 114 | $0.0012 | 3.3/5 |
Last run: 2026-07-01 · Region: us-east-1 · Max tokens: 4,096 · Temperature: 0Quality scores are graded by an LLM judge against eight criteria: coverage of all 6 WA pillars, identification of key risks (single-AZ ECS,
multi_az=false on RDS, CloudFront allow-all viewer protocol), auto-scaling recommendation for flash sale peaks, single Redis node as SPOF, cost right-sizing, and actionability of recommendations.Run commands
Command reference
| Flag | Description |
|---|---|
--grade | Run an LLM-as-judge grading pass after generation to produce quality scores |
--models <id> [<id> ...] | Override the model list from benchmark_config.yaml |
--config <path> | Use a custom config file |
--results <path> | Write results JSON to a custom path (default: results/benchmark-YYYYMMDD-HHMMSS.json) |
--concurrency <n> | Max parallel model calls (default: 4) |
Configuration
Model list, prompt, grading criteria, and per-model pricing are all configured inevals/benchmark_config.yaml:
pricing section with their per-token rates, and re-run to extend the table.
Benchmark task
Every model receives the same prompt: a Well-Architected review of a customer-facing e-commerce platform defined in Terraform. The architecture includes deliberate issues planted across all six pillars — single-AZ ECS,multi_az=false on RDS, a single-node Redis cluster, CloudFront with allow-all viewer protocol, and no auto-scaling — alongside real operational context (10,000 orders/day, 5× peak traffic, a 4-hour RTO target not being met, and a $18K/month bill growing 20% QoQ).
Models that require extended thinking (Claude Sonnet 5, Claude Opus 4) are handled automatically — the runner detects them by model ID and configures
additionalModelRequestFields with adaptive thinking. Models that don’t support temperature (e.g., Claude Opus 4.8) are also detected and retried without that parameter.