Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/aws-samples/sample-well-architected-skills-and-steering/llms.txt

Use this file to discover all available pages before exploring further.

The operational-excellence skill teaches your AI coding agent to assess workloads against the AWS Well-Architected Operational Excellence pillar. It reads your actual CI/CD pipeline definitions, CloudWatch configurations, deployment strategies, and incident management patterns to produce evidence-backed findings — every gap is cited to a specific file and line number, never inferred from silence.

What it does

CI/CD & Deployment

Reviews pipeline stages, deployment strategies (canary, blue/green, rolling), rollback mechanisms, test coverage gates, and artifact promotion flows across CodePipeline, GitHub Actions, GitLab CI, and CDK Pipelines.

Observability

Audits CloudWatch alarms, dashboard definitions, X-Ray and OpenTelemetry tracing configurations, structured logging libraries, correlation ID usage, and custom metric publishing.

Incident Management

Inspects alert routing (SNS, PagerDuty, OpsGenie), automated remediation Lambdas, SSM runbooks, escalation configs, and health check implementations.

Continuous Improvement

Looks for DORA metrics tracking, deployment success monitoring, operational dashboards, and cultural signals like CODEOWNERS, PR templates, and contributing guides.

WA Operational Excellence pillar coverage

The skill evaluates all ten Operational Excellence pillar questions with evidence sourced from your codebase.
Checks for SLO/SLI definitions, CODEOWNERS files, business metric dashboards, PR templates, and contributing guides that reflect a healthy operational culture.
Verifies structured logging with correlation IDs, distributed tracing across service boundaries, custom metrics for business-critical operations, and CloudWatch dashboards with meaningful panels.
Examines automated testing (unit, integration, e2e), linting and security scanning, staged deployments, and pre-deployment quality gates.
Reviews CodeDeploy deployment configurations, feature flag SDK usage, automatic rollback alarm ARNs, and canary/blue-green deployment configurations.
Looks for SSM runbooks, pre-deployment checklists, load testing scripts, and operational readiness review artifacts.
Checks for CompositeAlarm constructs, health check endpoints, DORA metrics tracking, and pipeline health dashboards.
Examines EventBridge rule definitions, automated remediation Lambda functions, SNS topic subscriptions, and escalation configurations.

How to invoke it

Ask your AI coding agent any of the following — the skill activates automatically:
Run an operational excellence review of this service

What the agent analyzes

1

CI/CD and deployment discovery

The agent reads every pipeline definition it can find — buildspec.yml, .github/workflows/, codepipeline-stack.ts, cdk-pipelines.ts. It documents each stage, deployment strategy, rollback mechanism, and test gate with file paths and line numbers.
2

Observability discovery

Every CloudWatch alarm, dashboard JSON, log group configuration, X-Ray/OTEL SDK import, and PutMetricData call is inventoried. The agent explicitly maps which services lack alarms and which flows lack distributed tracing.
3

Incident and event management discovery

Alert routing (SNS topics, webhook integrations), SSM Automation documents, Lambda remediation functions, and EventBridge rules are catalogued alongside any runbook markdown files.
4

Checkpoint — confirm before proceeding

The skill pauses and presents a summary of what was discovered across all three domains. You confirm before the WA framework evaluation begins.
5

WA framework evaluation (OPS 1–10)

Each question is answered with status, evidence (file:line), gaps, and risk level using an Impact × Likelihood matrix.
6

Prioritized remediation plan

Findings are organized into Quick Wins (<1 week), Foundation (1–4 weeks), and Strategic (1–3 months) tracks.

Risk assessment matrix

The skill rates every finding using Impact × Likelihood so you can triage immediately:
ImpactLikelihoodRisk Level
SevereHighCritical
SevereMedium / LowHigh
ModerateHighHigh
ModerateMediumMedium
ModerateLowMedium
MinorHighMedium
MinorMedium / LowLow
Severe = undetected outage, no rollback capability, data loss
Moderate = delayed detection, extended recovery time
Minor = inconvenience, manual workaround exists

Example output

# Operational Excellence Assessment: payments-service

## Executive Summary
- Date: 2025-06-01
- Packages Analyzed: infrastructure/, .github/workflows/
- Findings: 1 Critical, 3 High, 4 Medium, 2 Low
- Overall Maturity: 2/5 — CI/CD in place but no safe deployment strategy; observability sparse

## Maturity Scorecard
| Domain                  | Score | Key Strength                   | Key Gap                              |
|-------------------------|-------|--------------------------------|--------------------------------------|
| CI/CD & Deployment      | 2/5   | Automated build and unit tests | All-at-once deploy, no rollback      |
| Observability           | 2/5   | CloudWatch logs enabled        | No alarms, no tracing, no dashboards |
| Incident Management     | 1/5   | SNS topic exists               | No routing, no runbooks              |
| Change Management       | 3/5   | PR review required             | No deployment approval gate          |
| Continuous Improvement  | 2/5   | Test suite present             | No DORA metrics tracked              |

## Critical Findings

**OPS-001** | CI/CD | All-at-once deployment to production with no rollback
The CodeDeploy configuration uses DeploymentType: IN_PLACE with AllAtOnce. A failed deploy
has no automatic rollback and requires a full re-deploy to recover.
Evidence: infrastructure/codedeploy-stack.ts:38
Recommendation: Switch to BLUE_GREEN with automatic rollback on alarm
AWS Services: CodeDeploy, CloudWatch Alarms
Effort: Medium (1–2 days)

Deployment strategy comparison

The skill identifies your current strategy and explains the safer alternatives:
# .github/workflows/deploy.yml — current (flagged by skill)
- name: Deploy
  run: aws deploy create-deployment \
    --deployment-config-name CodeDeployDefault.AllAtOnce
All-at-once has no automatic rollback. A failed deployment requires a full re-deploy. The skill flags this as Critical when used in production.

Effectiveness

Evaluated using an automated LLM-as-judge framework with paired comparison (same prompt, with and without skill context) using Claude Opus 4.8.
BaselineWith skillDelta
Score90%100%+10%
The skill raises a strong baseline because operational excellence patterns are often well-known, but it adds precision by citing specific deployment config issues and observability gaps that a bare agent frequently misses or under-specifies.

Follow-up actions the agent offers

After delivering the assessment, the agent offers to:
  • Generate CloudWatch alarm and dashboard IaC for every gap identified
  • Create SSM runbook documents for specific failure modes found in the codebase
  • Design a self-healing architecture for recurring issues (DLQ reprocessing, auto-scaling events)
  • Implement structured logging with correlation IDs in the application code
  • Deep-dive into a specific domain: CI/CD, observability, or incident response
SkillWhen to use instead
wa-reviewRun a full cross-pillar review including all five other pillars alongside Operational Excellence
wa-guardrailsTurn the findings from this assessment into preventive CI/CD checks and Config rules
reliability-improvement-planFocus specifically on fault tolerance, recovery, and single points of failure

Build docs developers (and LLMs) love