Learning path · Trilha
Security & Resilience
Zero Trust, observability, chaos engineering and DR — building systems that withstand failure and attack.
The path· 7
- 1Advanced 10 minCognito Multi-Region: Migrating Identity to High AvailabilityAuthentication is critical infrastructure — a regional Cognito failure brings down the entire user journey. With Cognito multi-Region replication now available, there is a concrete path to elevating the identity plane to the same resilience level we already demand from databases and queues. In this article, I document the migration journey, the architecture decisions, and the risks that need active management.
- 2Advanced 8 minScalable User Search with Amazon Cognito: A Deep-Dive AnalysisAmazon Cognito excels at authentication, but its user-listing API was never designed for high-frequency search against large user pools. In this article, I analyze how to build a scalable search layer on top of Cognito, the failure modes that emerge when you ignore native API limits, and the real trade-offs between eventual consistency, data privacy, and operational cost.
- 3Advanced 9 minPostmortem: When AI Meets Resilience — AWS Resilience Hub and SREAWS Resilience Hub gained generative AI capabilities for failure mode analysis and runbook generation — a change that looks incremental but redefines how SRE teams operate in production. In this retrospective, I analyze what this evolution means in practice, where it fails, and how to integrate these tools into financial-grade systems without creating new fragile dependencies.
- 4Advanced 11 minAWS WAF and AI Bot Traffic Monetization: A Technical ReviewAWS WAF has gained native capability to identify and route AI bot traffic — a shift that turns a defensive tool into a revenue control point. In this article, I analyze what the feature actually delivers, where it falls short, and how to integrate it safely in financial-grade architectures.
- 5Expert 11 minRansomware Recovery Patterns on AWS: A Technical ReviewRansomware remains the highest-financial-impact threat vector in enterprise environments — and AWS provides solid technical primitives to build real resilience. In this analysis, I examine the recovery patterns published by the AWS Architecture Blog through the lens of someone who has operated DR plans in regulated financial environments. The result is an honest view: where these patterns deliver real value, where operational gaps exist, and what you need to add for them to hold under pressure.
- 6Expert 9 minOIDC Session Metadata and Zero Trust: An Architecture Decision RecordSession metadata support in Sign in with Google opens a genuine window for continuous, signal-driven adaptive access — not just at login time. In this ADR, I analyze the architectural forces, options considered, and the decision I would make in a high-criticality financial system integrated with AWS.
- 7Expert 8 minAI Agents for Security and DevOps: Productivity or Risk?AWS launched frontier agents for security testing and cloud operations, opening a real debate about how far AI autonomy can go in regulated environments. This article compares four deployment patterns — fully autonomous agent, semi-autonomous with human approval, assisted (copilot), and deterministic pipeline — using concrete criteria of risk, cost, latency, and compliance.
Deep-dive studies
adrADR: AWS Transform & AI Agents vs Traditional Modernization FactoryThis ADR evaluates the decision to adopt AWS Transform (with AI agents for .NET, Mainframe, VMware, and custom code) versus a traditional human-engineering modernization factory, or a hybrid approach. The analysis covers regression risk, test coverage, code ownership, security, total cost, and change governance in an enterprise-scale modernization program.design-docDesign Doc: Continuous Evaluation Suite for Agents with Bedrock AgentCoreLLM agents in production silently degrade as models, tools, and prompts evolve — without a continuous evaluation discipline, regressions reach users before they are detected. This document proposes a complete offline and online evaluation architecture using Amazon Bedrock AgentCore, with versioned datasets, CI/CD quality gates, runtime signals, and systematic adversarial testing.design-docDesign Doc: LLM Observability — from GPU Utilization to Response QualityThis document proposes an end-to-end observability architecture for LLM inference platforms running on Amazon SageMaker AI and Amazon Bedrock, covering everything from hardware metrics (GPU utilization, memory) to semantic response quality, behavioral drift, and per-tenant cost. The design integrates CloudWatch, Amazon Managed Grafana, prompt-level tracing, and automated regression alarms, with clear separation of concerns across collection, storage, evaluation, and alerting layers.adrADR: Cognito Multi-Region for Resilient AuthenticationThis ADR examines when and how to adopt multi-region User Pool replication in Amazon Cognito to reduce authentication downtime on identity platforms with high-availability requirements. It covers regional failover, customer-managed KMS keys, user synchronization, session and token impact, custom domains, and customer experience, with explicit reasoning on operational and cost trade-offs.adrADR: OpenSearch Serverless vs Dedicated Vector Database for Agentic RAGThis ADR evaluates vector search infrastructure options for a multi-tenant agentic RAG platform on AWS, comparing OpenSearch Serverless, dedicated vector databases (Pinecone, pgvector), and a self-managed hybrid search layer. The decision weighs cost, p99 latency, permission-based filtering, incremental ingestion, and native Bedrock Knowledge Bases integration.design-docDesign Doc: SRE Journey with GenAI using AWS Resilience HubThis document proposes an SRE platform built on AWS Resilience Hub with a GenAI layer to automate dependency discovery, failure mode analysis, and runbook generation for critical applications. The goal is to reduce operational risk through modular resiliency policies and organization-level consolidated reports, replacing manual processes prone to coverage gaps. The design prioritizes traceability, incremental automation, and integration with existing CI/CD pipelines. Open source to explore
aws-event-driven-finops-platformEvent-driven AWS banking reference architecture with FinOps, security, and a live frontend.aws-agentic-ai-reference-architectureAWS reference architecture for production agentic AI — security, observability, and DevSecOps.DeskbuddyESP32 touchscreen smart desk dashboard — firmware, web UI, and browser installer in one repo.solution-architecture-mcp-toolkitBilingual MCP toolkit for ADRs, threat modeling, and governed Well-Architected reviews.bedrock-agent-starterProduction-shaped Amazon Bedrock agent starter — tools, IaC, and evals in 30 minutes.aws-ai-reference-architecturesSix AWS AI reference architectures — diagrams, IaC skeletons, and Well-Architected analysis.