Use this file to discover all available pages before exploring further.
The operational-excellence skill teaches your AI coding agent to assess workloads against the AWS Well-Architected Operational Excellence pillar. It reads your actual CI/CD pipeline definitions, CloudWatch configurations, deployment strategies, and incident management patterns to produce evidence-backed findings — every gap is cited to a specific file and line number, never inferred from silence.
Reviews pipeline stages, deployment strategies (canary, blue/green, rolling), rollback mechanisms, test coverage gates, and artifact promotion flows across CodePipeline, GitHub Actions, GitLab CI, and CDK Pipelines.
Observability
Audits CloudWatch alarms, dashboard definitions, X-Ray and OpenTelemetry tracing configurations, structured logging libraries, correlation ID usage, and custom metric publishing.
Incident Management
Inspects alert routing (SNS, PagerDuty, OpsGenie), automated remediation Lambdas, SSM runbooks, escalation configs, and health check implementations.
Continuous Improvement
Looks for DORA metrics tracking, deployment success monitoring, operational dashboards, and cultural signals like CODEOWNERS, PR templates, and contributing guides.
The skill evaluates all ten Operational Excellence pillar questions with evidence sourced from your codebase.
OPS 1–3 — Organization, culture, and priorities
Checks for SLO/SLI definitions, CODEOWNERS files, business metric dashboards, PR templates, and contributing guides that reflect a healthy operational culture.
OPS 4 — Observability
Verifies structured logging with correlation IDs, distributed tracing across service boundaries, custom metrics for business-critical operations, and CloudWatch dashboards with meaningful panels.
The agent reads every pipeline definition it can find — buildspec.yml, .github/workflows/, codepipeline-stack.ts, cdk-pipelines.ts. It documents each stage, deployment strategy, rollback mechanism, and test gate with file paths and line numbers.
2
Observability discovery
Every CloudWatch alarm, dashboard JSON, log group configuration, X-Ray/OTEL SDK import, and PutMetricData call is inventoried. The agent explicitly maps which services lack alarms and which flows lack distributed tracing.
3
Incident and event management discovery
Alert routing (SNS topics, webhook integrations), SSM Automation documents, Lambda remediation functions, and EventBridge rules are catalogued alongside any runbook markdown files.
4
Checkpoint — confirm before proceeding
The skill pauses and presents a summary of what was discovered across all three domains. You confirm before the WA framework evaluation begins.
5
WA framework evaluation (OPS 1–10)
Each question is answered with status, evidence (file:line), gaps, and risk level using an Impact × Likelihood matrix.
6
Prioritized remediation plan
Findings are organized into Quick Wins (<1 week), Foundation (1–4 weeks), and Strategic (1–3 months) tracks.
The skill rates every finding using Impact × Likelihood so you can triage immediately:
Impact
Likelihood
Risk Level
Severe
High
Critical
Severe
Medium / Low
High
Moderate
High
High
Moderate
Medium
Medium
Moderate
Low
Medium
Minor
High
Medium
Minor
Medium / Low
Low
Severe = undetected outage, no rollback capability, data loss Moderate = delayed detection, extended recovery time Minor = inconvenience, manual workaround exists
Evaluated using an automated LLM-as-judge framework with paired comparison (same prompt, with and without skill context) using Claude Opus 4.8.
Baseline
With skill
Delta
Score
90%
100%
+10%
The skill raises a strong baseline because operational excellence patterns are often well-known, but it adds precision by citing specific deployment config issues and observability gaps that a bare agent frequently misses or under-specifies.