# Design Doc: SRE Journey with GenAI using AWS Resilience Hub

This document proposes an SRE platform built on AWS Resilience Hub with a GenAI layer to automate dependency discovery, failure mode analysis, and runbook generation for critical applications. The goal is to reduce operational risk through modular resiliency policies and organization-level consolidated reports, replacing manual processes prone to coverage gaps. The design prioritizes traceability, incremental automation, and integration with existing CI/CD pipelines.

- URL: https://fernando.moretes.com/studies/design-doc-resilience-hub-genai-sre-journey

- Markdown: https://fernando.moretes.com/studies/design-doc-resilience-hub-genai-sre-journey/study.md?lang=en

- Type: Design Doc / RFC

- Company: SRE platform (cenário)

- Domain: Resiliência / SRE

- Date: 2026-06-03

- Tags: sre, aws-resilience-hub, genai, resiliency, failure-mode-analysis, runbooks, well-architected, operational-risk

- Reading time: 11 min

---

SRE teams in organizations running dozens of critical applications face a structural problem: resilience analysis is manual, sporadic, and rarely covers the real dependency chain. AWS Resilience Hub, combined with GenAI capabilities, offers a concrete path to automate this cycle — from discovery to runbook. This RFC details how to build that journey incrementally, with explicit trade-offs and measurable success criteria.

## The Problem: Resilience as a Point-in-Time Audit

In financial and mission-critical environments I've worked in, the dominant resilience management pattern looks like this: once a quarter (or before an audit), someone produces a Business Impact Analysis document, manually maps dependencies in a spreadsheet, and defines RTOs/RPOs that frequently don't reflect the actual production architecture. When an incident occurs, the runbook is outdated, the critical dependency nobody documented is exactly the one that failed, and the post-mortem reveals the system was never tested against that specific failure mode.

This pattern has three identifiable root causes. First, **dependency discovery is expensive and manual**: mapping what a service actually consumes — SQS queues, DynamoDB tables, third-party endpoints, chained Lambda functions — requires access to multiple sources of truth (CloudFormation, Service Catalog, X-Ray, Config) that are rarely queried together. Second, **failure mode analysis requires scarce expertise**: a senior engineer needs to reason about what happens when each component fails, in what sequence, and with what impact — work that doesn't scale across dozens of applications. Third, **resiliency policies are defined per application, without organizational visibility**: each squad defines its own RTO/RPO thresholds without reference to a corporate policy, creating inconsistencies that only surface during real incidents.

AWS Resilience Hub addresses the first two problems through automatic discovery via CloudFormation, Terraform, and AWS Service Catalog, resilience analysis against defined policies, and — more recently — GenAI-assisted generation of recommendations and runbooks. This document proposes how to structure that adoption in a way that produces real value, not just compliance reports.

## Goals and Non-Goals

- ✅ GOAL: Automate dependency discovery for all critical applications (Tier-1 and Tier-2) registered in AWS Resilience Hub, with continuous updates via CI/CD pipeline.
- ✅ GOAL: Use GenAI (Amazon Bedrock) to generate assisted failure mode analysis (FMA), reducing analysis production time per application from days to hours.
- ✅ GOAL: Define modular resiliency policies by business criticality (Tier-1: RTO ≤ 1h / RPO ≤ 15min; Tier-2: RTO ≤ 4h / RPO ≤ 1h) applied centrally via AWS Organizations.
- ✅ GOAL: Generate versioned operational runbooks linked to specific CloudWatch alarms, replacing static documentation.
- ✅ GOAL: Produce consolidated resilience reports per business unit for consumption by technical leadership and risk committees.
- ❌ NON-GOAL: Replace chaos testing (Chaos Engineering) — Resilience Hub complements, not replaces, tools like AWS Fault Injection Service.

## Scenario Fact Sheet

- **Context:** Corporate SRE platform — composite scenario based on real patterns from AWS financial organizations
- **Estimated scale:** 50–150 critical applications, 8–20 AWS accounts, multiple regions (us-east-1 primary, sa-east-1 secondary)
- **Main tools:** AWS Resilience Hub, Amazon Bedrock (Claude 3), AWS Organizations, CloudFormation, AWS Config, Amazon CloudWatch, AWS Systems Manager (SSM)
- **Discovery sources:** CloudFormation stacks, Terraform state via S3, AWS Service Catalog, Resource Groups
- **Resiliency policies:** Tier-1 (RTO 1h / RPO 15min), Tier-2 (RTO 4h / RPO 1h), Tier-3 (RTO 24h / RPO 4h)
- **GenAI model:** Amazon Bedrock with Claude 3 Sonnet (FMA analysis and runbooks); Claude 3 Haiku (report summarization)
- **CI/CD integration:** AWS CodePipeline + CodeBuild with resilience assessment step as deploy gate
- **Regulatory:** Aligned with BACEN 4.557 (operational risk), ISO 22301 (business continuity), SOC 2 Type II

## Proposed Design: Three Automation Layers

The design organizes the SRE journey into three functional layers that feed sequentially but can be adopted incrementally.

**Layer 1 — Continuous Discovery and Modeling**: AWS Resilience Hub imports each application's definition from multiple sources (CloudFormation stacks, Terraform state stored in S3, AWS Service Catalog portfolios). This import is not a one-time event — it is triggered automatically via EventBridge whenever a stack is updated in production, ensuring the resilience model reflects the actual architecture. The result is a versioned dependency graph per application, with components classified by type (compute, storage, networking, database, messaging) and by criticality derived from the associated resiliency policy.

A point that is frequently underestimated: **tag quality** on AWS resources is the limiting factor of this layer. Without consistent `Application`, `Tier`, and `Environment` tags, automatic discovery produces fragmented graphs that require manual curation. Therefore, the operational prerequisite for this phase is implementing a mandatory tagging policy via AWS Config Rules and Service Control Policies (SCPs) on target accounts.

**Layer 2 — GenAI-Assisted Analysis**: After discovery, Resilience Hub runs its native resilience analysis, comparing the discovered architecture against the RTO/RPO policy defined for that tier. The result is a set of structured recommendations (e.g., "add Multi-AZ to RDS cluster X", "configure DLQ on SQS queue Y"). This is where the GenAI layer enters: an orchestrator Lambda invokes Amazon Bedrock (Claude 3 Sonnet) with the Resilience Hub analysis payload as context, requesting three specific outputs:

1. **Failure Mode Analysis (FMA)**: For each critical component, the model reasons about the most likely failure scenarios, cascade impact, and estimated severity. The prompt is structured with the dependency graph, policy thresholds, and incident history (when available via CloudWatch Logs Insights).

2. **Operational Runbook**: For each high-priority recommendation, the model generates a runbook in SSM Document format, with diagnostic steps, remediation commands, and escalation criteria. The runbook is versioned in S3 and linked to the corresponding CloudWatch alarm.

3. **Executive Summary**: Claude 3 Haiku (faster and cheaper) generates a natural language summary for consumption by technical leaders and risk committees, without infrastructure jargon.

**Layer 3 — Governance and Feedback Loop**: Results from all analyses are aggregated in a central dashboard (Amazon QuickSight) with views by business unit, criticality tier, and temporal trend of the resilience score. An EventBridge Scheduler triggers automatic weekly re-evaluations for Tier-1 applications. The feedback loop closes when generated runbooks are executed during real incidents and results are captured via SSM Automation for future prompt refinement.

## SRE Platform Architecture with GenAI

Complete flow from dependency discovery to runbook generation and organizational reports, showing the three automation layers and integration points with CI/CD and governance.

### 🔍 Camada 1 — Descoberta / Discovery Layer

- CloudFormation Stacks (ci)
- Terraform State (S3) (storage)
- Service Catalog Portfolios (ci)
- AWS Resilience Hub App Registry (compute)

### ⚙️ Camada 2 — Análise GenAI / GenAI Analysis Layer

- EventBridge Trigger (messaging)
- Lambda Orchestrator (compute)
- Amazon Bedrock Claude 3 Sonnet (ai)
- Amazon Bedrock Claude 3 Haiku (ai)
- FMA Output (S3 versioned) (storage)
- SSM Documents Runbooks (compute)

### 📊 Camada 3 — Governança / Governance Layer

- CloudWatch Alarms (security)
- Amazon QuickSight Dashboard (frontend)
- EventBridge Scheduler (weekly) (messaging)
- AWS Organizations Policy Hub (security)

### 🚀 CI/CD Integration

- CodePipeline Deploy Gate (ci)
- CodeBuild Resilience Check (ci)

### Flows

- cfn -> rh: imports stack
- tf -> rh: imports state
- sc -> rh: imports portfolio
- rh -> eb: assessment complete
- eb -> orch: triggers
- orch -> bedrock: FMA + runbook prompt
- orch -> haiku: executive summary
- bedrock -> fma: saves FMA
- bedrock -> ssm: creates runbook
- ssm -> cw: links alarm
- fma -> qs: feeds dashboard
- haiku -> qs: executive summary
- sched -> rh: weekly re-assessment
- org -> rh: tier policies
- pipe -> cb: deploy gate
- cb -> rh: checks score

## Evaluated Design Alternatives

### AWS Resilience Hub + Bedrock (proposed)

**Pros**
- Automatic discovery integrated with CloudFormation/Terraform without additional agent
- GenAI contextualized with the application's actual dependency graph
- Resiliency policies managed centrally via Organizations
- Native integration with SSM for executable runbooks

**Cons**
- Bedrock inference cost can be significant for organizations with many applications (estimate: $0.50–2.00 per full analysis with Sonnet)
- Quality of generated FMA depends on tag quality and completeness of the discovered graph
- Resilience Hub does not cover non-AWS resources natively

**Verdict:** Recommended for AWS-native organizations with minimum IaC maturity (CloudFormation or Terraform)

### Manual Process with FMA Templates

**Pros**
- No additional tooling cost
- Full control over analysis content

**Cons**
- Does not scale beyond 10–15 applications per senior engineer per quarter
- Dependency graphs become stale quickly in frequent-deploy environments
- Inconsistent coverage across squads and business units

**Verdict:** Inadequate for organizations with more than 20 critical applications

### Third-Party Solution (e.g., Gremlin, Steadybit)

**Pros**
- Specific focus on chaos engineering with mature experiment library
- Multi-cloud by design

**Cons**
- Does not integrate with AWS Organizations resiliency policies natively
- High license cost; overlap with AWS Fault Injection Service
- Does not generate runbooks integrated with SSM

**Verdict:** Complementary to the proposed design for chaos testing, not a substitute for resilience analysis

### Custom LLM with RAG over internal documentation

**Pros**
- Analyses contextualized with incident history and internal runbooks
- Potential for more precise recommendations with proprietary data

**Cons**
- Much higher engineering complexity; requires dedicated ML team
- Time-to-value of 6–12 months vs. weeks with Bedrock + Resilience Hub
- Continuous maintenance of RAG pipeline and knowledge base curation

**Verdict:** Natural evolution of the proposed design in Phase 3, not a starting point

## Security and Data Governance Considerations

In regulated financial environments, introducing GenAI into resilience analysis processes raises legitimate questions that need to be explicitly addressed in the design, not treated as afterthoughts.

**Dependency graph confidentiality**: The payload sent to Amazon Bedrock contains sensitive architectural information — resource names, network topology, database configurations. It is necessary to ensure: (1) Bedrock is being invoked via VPC endpoint (Interface VPC Endpoint) so traffic does not traverse the public internet; (2) Claude 3 models via Bedrock do not use customer data for training by default (per AWS documentation) — but this must be verified and documented for compliance purposes; (3) model outputs (FMAs and runbooks) are stored in S3 with SSE-KMS encryption and access restricted by specific IAM roles.

**Resilience Hub access control**: Resiliency policies defined at the Organizations level represent critical business decisions. Write access to Resilience Hub must be restricted to a `ResiliencyPolicyAdmin` role managed by the platform team, with all changes audited via CloudTrail. Product squads have read access to query their application scores but cannot modify policies or thresholds.

**GenAI output validation**: This is the most critical point in the design. An incorrectly generated runbook executed during an incident can worsen the situation. The validation process has three steps: (1) mandatory human review by an SRE engineer before publishing any new runbook; (2) execution in a staging environment with AWS Fault Injection Service to validate that remediation steps produce the expected effect; (3) explicit versioning with status (`draft`, `reviewed`, `validated`, `deprecated`) — only runbooks with `validated` status are linked to production alarms.

**Inference cost as operational risk**: In an organization with 100 Tier-1 and Tier-2 applications, weekly re-evaluations with Claude 3 Sonnet can generate inference costs in the range of $200–800/month (estimate based on ~2000 input tokens and ~1500 output tokens per analysis, at $0.003/1K input tokens and $0.015/1K output tokens for Sonnet). This cost is justifiable but must be monitored with AWS Cost Anomaly Detection and a specific budget alert for the analysis workload.

## Phased Rollout Plan

1. **Phase 0 — Prerequisites (Weeks 1–3)** — Audit and correction of tagging across all target accounts. Implementation of mandatory Config Rules for `Application`, `Tier`, `Environment`, `Owner` tags. Creation of account structure in Organizations with centralized SRE platform account. Definition and approval of resiliency policies by tier with business stakeholders. Configuration of VPC Endpoints for Resilience Hub and Bedrock in relevant accounts.

2. **Phase 1 — Pilot with 5 Tier-1 Applications (Weeks 4–7)** — Registration of the 5 most critical applications in Resilience Hub. Manual execution of first analyses and validation of discovered dependency graphs against product team knowledge. Development and testing of orchestrator Lambda with Bedrock integration. Generation of first runbooks and review by senior SRE engineers. Collection of qualitative feedback on quality and utility of generated FMAs.

3. **Phase 2 — Expansion and CI/CD Automation (Weeks 8–12)** — Expansion to all Tier-1 and Tier-2 applications (estimate: 30–60 applications). Implementation of resilience gate in CodePipeline: deploys that reduce resilience score below policy threshold require explicit manual approval. Configuration of EventBridge Scheduler for automatic weekly re-evaluations. Launch of QuickSight dashboard for technical leadership. Training of product squads on the process of consulting and interpreting reports.

4. **Phase 3 — Maturity and RAG (Weeks 13–20)** — Implementation of RAG (Retrieval-Augmented Generation) with Bedrock Knowledge Base fed by internal post-mortem history and validated runbooks. Integration with AWS Fault Injection Service for automated runbook validation in staging. Expansion to Tier-3 applications. Automatic monthly reports for risk committees with KMS digital signature. Multi-region coverage assessment and eventual extension to hybrid workloads via AWS Systems Manager.

> **Critical Risks and Mitigations:** **RISK 1 — Excessive trust in GenAI outputs**: The biggest risk of this architecture is not technical — it is organizational. Teams under pressure tend to treat AI-generated recommendations as absolute truth, especially when the model writes with confidence. Mitigation: the runbook validation process (draft → reviewed → validated) is non-negotiable and must be culturally reinforced, not just documented. Consider an explicit SLA: no runbook enters production without human review in less than 48h.

**RISK 2 — Drift between model and reality**: The dependency graph in Resilience Hub reflects the IaC state, not necessarily the actual infrastructure state if there are manually created resources (ClickOps). In organizations with partial IaC maturity, this creates a false sense of coverage. Mitigation: AWS Config with drift detection rules + alerts for resources not managed by IaC.

**RISK 3 — CI/CD gate as deploy blocker in crisis**: A resilience gate that blocks deploys can be counterproductive during an active incident where the fix requires an urgent deploy. Mitigation: implement a bypass mechanism with approval from two senior engineers + automatic notification to the CISO, audited via CloudTrail.

**RISK 4 — Bedrock costs at scale**: Frequent analyses of many applications can generate unexpected costs. Mitigation: implement throttling in the orchestrator Lambda, use Haiku for low-priority analyses, and configure Cost Anomaly Detection with an alert threshold at 150% of the monthly baseline.

## Alignment with AWS Well-Architected Framework

- **security**: VPC Endpoints for Bedrock and Resilience Hub eliminate public internet exposure. IAM roles with least-privilege for Resilience Hub access. GenAI outputs encrypted with SSE-KMS. CloudTrail enabled for all administrative operations in Resilience Hub and Organizations.
- **reliability**: Directly addresses the Reliability pillar: automatic dependency discovery, RTO/RPO definition by policy, resilience testing integrated into CI/CD, and versioned operational runbooks. Aligned with practices REL-6 (manage changes), REL-9 (test recovery procedures), and REL-10 (automatic recovery from failure).
- **performance**: Orchestrator Lambda with reserved concurrency for Tier-1 analyses ensures predictable latency. Analysis results stored in S3 with caching to avoid unnecessary Bedrock reinvocations when the graph has not changed.
- **sustainability**: Use of Haiku (smaller model) for summarization tasks reduces computational consumption. Incremental analyses (only when graph changes) avoid unnecessary processing.

## Success Metrics and Targets

- **Analysis coverage:** 100% of Tier-1 and Tier-2 applications with analysis updated within ≤ 7 days after any stack change
- **FMA production time:** Reduction from 3–5 days (manual) to < 4 hours (GenAI-assisted + human review)
- **Tier-1 resilience score:** ≥ 85% of Tier-1 applications with 'High' score in Resilience Hub after 6 months
- **Validated runbooks:** ≥ 1 validated runbook per Tier-1 application; 0 runbooks in 'draft' status linked to production alarms
- **MTTD of resilience gaps:** Reduction from reactive discovery (post-incident) to proactive (≤ 24h after deploy that introduces gap)
- **Inference cost:** < $1,000/month for 100 applications with weekly Tier-1 and monthly Tier-2 re-evaluation (estimate)
- **Squad adoption:** ≥ 80% of product squads consulting the resilience dashboard monthly after 3 months of operation

> **My Senior Perspective:** I've worked in financial systems where resilience analysis was treated as a compliance exercise — a document produced before an audit and forgotten until the next one. The fundamental problem wasn't lack of tools; it was that the cost of keeping analyses current was higher than the perceived cost of not having them. AWS Resilience Hub with GenAI genuinely changes that equation, but with an important caveat I don't see discussed enough: **the real value is not in automatic FMA generation — it's in continuous dependency discovery**.

Most serious incidents I've investigated didn't fail because someone didn't know component X was critical. They failed because nobody knew component X depended on component Y, which in turn depended on a third-party service with an SLA lower than required. The Resilience Hub dependency graph, kept current via EventBridge, is the most valuable artifact in this architecture — the Bedrock-generated FMA is second.

If I had to prioritize a single thing to start: invest the first two weeks exclusively in tag quality and IaC coverage. A resilience analysis based on an incomplete graph is worse than having no analysis — it creates false confidence. Only after having discovery working correctly for 5 pilot applications would I add the GenAI layer.

On the question of trusting model outputs: I'm skeptical by principle of any system that generates operational recommendations without structured human validation. The draft → reviewed → validated process is not bureaucracy — it's the mechanism that converts model-generated text into a reliable operational asset. Remove that step and you have a system that looks sophisticated but can do more harm than good during a real incident.

## Verdict

The proposed design is technically sound and operationally viable for AWS-native organizations with minimum IaC maturity. The combination of AWS Resilience Hub for structured discovery and analysis with Amazon Bedrock for assisted FMA and runbook generation solves a real scaling problem that manual processes cannot address beyond 20–30 critical applications.

The non-negotiable prerequisites are two: consistent tag quality across AWS accounts (without this, discovery is incomplete and value drops dramatically) and a rigorous human validation process for GenAI outputs before any operational use. These two elements are more important than any implementation detail of the architecture.

The phased rollout is deliberately conservative — starting with 5 pilot applications before expanding is not timidity, it is the only way to calibrate the quality of generated analyses and build organizational confidence in the process. The CI/CD gate is the highest long-term impact artifact: it institutionalizes resilience as a deploy criterion, not a periodic exercise.

For organizations in regulated sectors (financial, healthcare, critical infrastructure), this design also addresses audit requirements more traceably than manual processes — automatically generated reports with versioning and KMS signatures are more robust control evidence than quarterly-updated spreadsheets. The infrastructure investment is modest; the investment in process change and validation culture is where the real effort lies.

## References

- [AWS Resilience Hub — Product Page](https://aws.amazon.com/resilience-hub/)
- [AWS News Blog — AWS Resilience Hub with Generative AI](https://aws.amazon.com/blogs/aws/)
- [AWS Well-Architected Framework — Reliability Pillar](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/welcome.html)
- [Amazon Bedrock — Documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/what-is-bedrock.html)
- [AWS Fault Injection Service — Documentation](https://docs.aws.amazon.com/fis/latest/userguide/what-is.html)
- [AWS Systems Manager — SSM Documents](https://docs.aws.amazon.com/systems-manager/latest/userguide/documents.html)
- [AWS Organizations — Service Control Policies](https://docs.aws.amazon.com/organizations/latest/userguide/orgs_manage_policies_scps.html)
- [AWS Config — Managed Rules](https://docs.aws.amazon.com/config/latest/developerguide/managed-rules-by-aws-config.html)

## Case sources

- [AWS News Blog — AWS Resilience Hub with generative AI](https://aws.amazon.com/blogs/aws/)
- [AWS Resilience Hub](https://aws.amazon.com/resilience-hub/)
- [AWS Well-Architected — Reliability Pillar](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/welcome.html)
