Architecture studies
Architecture documents from real cases — ADRs, design docs, post-mortem analyses and teardowns — with my reading as a solutions architect.
Design Doc: Zero Trust on AWS for Internal Service Access
This document proposes a Zero Trust architecture on AWS where identity, context, and device posture replace the network perimeter as the primary access control mechanism. The design covers workload segmentation, adaptive access via IAM Identity Center and Verified Access, and continuous audit instrumentation. The goal is to eliminate implicit trust based on network location without introducing excessive operational friction.
Design Doc: Enterprise RAG Platform with Continuous Evaluation and Guardrails on Bedrock
This document describes the architecture of an enterprise RAG platform built on Amazon Bedrock, covering semantic retrieval, continuous quality evaluation, safety guardrails, and cost control. The design prioritizes traceability, operability, and risk containment in regulated environments, without sacrificing acceptable end-user latency.
Design Doc: Multi-Region Active-Active Payments API
This document proposes a multi-region active-active architecture for a critical payments API, targeting near-zero RTO/RPO, deterministic conflict resolution in data replication, and a phased rollout that minimizes operational risk. The design is grounded in real financial engineering principles and AWS patterns, with explicit trade-offs between consistency, latency, and cost.
ADR: EventBridge vs Kafka/MSK for Order Processing
This ADR evaluates EventBridge and Amazon MSK as the event backbone for an order processing system, weighing throughput, ordering, replay, and operational burden. The decision is grounded in real trade-offs between managed simplicity and platform control, with direct consequences on cost, operability, and delivery guarantees.
ADR: Aurora vs DynamoDB for a Double-Entry Ledger in Core Banking
This ADR evaluates Aurora PostgreSQL and DynamoDB as the persistence engine for a double-entry ledger in a core banking system, weighing strong consistency, access patterns, auditability, and cost. The decision favors Aurora with date-range partitioning and an immutable event layer, acknowledging the horizontal scaling constraints that choice imposes.
ADR: Modular Monolith vs Microservices in a Greenfield Fintech
An early-stage fintech faces the classic architecture decision: go straight to microservices or build a modular monolith first. This ADR examines the real forces at play — team size, validation speed, blast radius, and operational cost — and records the decision with its concrete consequences.
Figma: Horizontal Postgres Sharding Without Stopping Growth
In 2022, Figma hit the physical limits of its monolithic Postgres database and executed a horizontal sharding migration using key-based partitioning, dynamic routing, and incremental data movement — all without downtime and without halting product growth. This teardown reconstructs the architecture, analyzes the technical decisions, and points out what I would do differently.
Discord: How to Store Trillions of Messages — Cassandra → ScyllaDB Migration Teardown
Discord migrated its message storage from Apache Cassandra to ScyllaDB, eliminating unpredictable tail latencies and GC pauses that affected millions of users. This teardown reconstructs the architecture, examines the engineering decisions and trade-offs involved, and presents my critical read of what was done well — and what I would do differently.
Roblox 2021: 73 Hours of Downtime, Consul and the Load Effect
In October 2021, Roblox suffered 73 consecutive hours of unavailability — the largest outage in the platform's history. The root cause was a combination of BoltDB contention (Consul's backend) amplified by a newly enabled telemetry streaming feature during a period of elevated traffic. This post-mortem reconstructs the failure chain, analyzes the infrastructure decisions involved, and extracts lessons applicable to any platform relying on service mesh and distributed coordination.