ADR: EventBridge vs Kafka/MSK for Order Processing
Listen to study
generated on playGenerated only on first play
This ADR evaluates EventBridge and Amazon MSK as the event backbone for an order processing system, weighing throughput, ordering, replay, and operational burden. The decision is grounded in real trade-offs between managed simplicity and platform control, with direct consequences on cost, operability, and delivery guarantees.
Choosing the wrong event backbone for an order processing system is not a matter of technology preference — it is a business risk. EventBridge and MSK solve similar problems on the surface, but diverge deeply in delivery guarantees, throughput, ordering, and operational cost. This ADR documents the reasoning behind the decision and the trade-offs every architect must understand before making this choice.
Scenario Fact Sheet
- System
- Order processing platform (e-commerce / marketplace)
- Domain
- Event-driven architecture
- Estimated volume
- 5,000–50,000 orders/hour under normal operation; 10x spikes on peak dates (estimate)
- Event consumers
- Inventory, payments, notifications, fraud, analytics, fulfillment
- Critical requirements
- Per-order ordering, replay for reprocessing, at-least-once delivery, audit trail
- Cloud
- AWS (single region, multi-AZ)
- Existing stack
- Lambda, ECS/Fargate, RDS Aurora, S3, CloudWatch
- Decision status
- Accepted
Context and Forces at Play
The order system is the operational core of the platform. Every state transition of an order — created, paid, picked, shipped, delivered, cancelled — must be reliably propagated to multiple downstream consumers. A failure in that propagation can result in double billing, incorrect inventory, missed customer notification, or inconsistency in the fraud system.
The engineering team has solid experience with AWS managed services and Lambda, but no operational history with Kafka. The platform team has two senior engineers dedicated to infrastructure. Business pressure favors fast time-to-market, but the CTO has signaled that any order-loss incident carries high political cost.
The forces shaping this decision are:
- Ordering: individual orders must be processed in sequence. An
order.cancelledevent must not be processed beforeorder.paidfor the sameorderId. - Throughput: current volume is not extreme, but Black Friday and seasonal spikes require real elasticity.
- Replay: the analytics team and fraud team need to reprocess historical events when models change or bugs are fixed.
- Operational burden: with a small team, every hour spent tuning brokers is an hour away from features.
- Cost: MSK has fixed infrastructure cost regardless of usage; EventBridge charges per published event.
- AWS ecosystem integration: the existing stack is 100% AWS; native integrations reduce glue code.
Why This Decision Is Non-Trivial
The common temptation is to treat EventBridge as "simpler managed Kafka". That analogy is dangerous. EventBridge is a rules-based event router — it routes events from sources to targets based on content patterns. Kafka (and MSK) is a distributed commit log — it stores events durably and in order, allowing multiple independent consumers to read at their own pace.
These differences have concrete implications:
Ordering: EventBridge does not guarantee ordering between events. For an order system where the sequence created → paid → shipped is semantically critical, this requires consumers to implement their own ordering logic or the application to tolerate out-of-order events. MSK with partitioning by orderId guarantees ordering within a partition — which is exactly what we need.
Replay: EventBridge Archive allows event replay, but with limitations — replay targets the same destination and does not support fine-grained per-consumer filtering. MSK retains the log for a configurable period (days to weeks) and any consumer group can reposition its offset to any point in time, independently of others. For the analytics team that needs to reprocess 30 days of events without affecting the payments consumer, MSK is structurally superior.
Fan-out and coupling: EventBridge excels at decoupled fan-out — you add a new rule and a new target without touching the producer. MSK requires new consumers to create their own consumer groups, which is equally decoupled but requires more initial configuration.
Throughput and latency: EventBridge has a default throughput of 10,000 events/second per region (with soft limits adjustable via support), with latency typically under 1 second but no published latency SLA. MSK supports throughput of hundreds of MB/s with millisecond latency in appropriate configurations. For the current system volume, both are sufficient — but MSK has far greater headroom.
The core point is: EventBridge is the right choice when the problem is routing and service integration; MSK is the right choice when the problem is data streaming with ordering, replay, and multiple independent consumers. Order processing is the second case.
Options Evaluated
Amazon EventBridge (Event Bus)
- Zero operational overhead — fully serverless and managed by AWS
- Native integration with 200+ AWS services and SaaS partners
- Integrated Schema Registry for event contract governance
- Decoupled fan-out via rules — adding a consumer requires no change to the producer
- Usage-proportional cost — ideal for low and irregular volumes
- No ordering guarantee between events from the same order
- Archive replay has limitations: no per-consumer filtering, replay targets original destination
- Maximum event retention via Archive: indefinite, but replay has a limited operational window
- No consumer group concept — multiple consumers require multiple rules and targets
- Default throughput of 10k events/s per region (soft limit) may be insufficient at extreme peaks
Suitable for service integration and notifications; insufficient as an order backbone with ordering and granular replay requirements
Amazon MSK (Managed Streaming for Apache Kafka)
- Guaranteed per-partition ordering — partitioning by orderId guarantees state sequence
- Native and granular replay — any consumer group repositions offset independently
- Throughput of hundreds of MB/s — real headroom for growth
- Configurable retention (days to weeks) — immutable log as source of truth
- Independent consumer groups — analytics, fraud, payments read without mutual interference
- Fixed infrastructure cost regardless of usage — brokers billed per hour
- Operational learning curve: partitions, replication factor, consumer lag, offset management
- No direct native Lambda trigger (requires MSK Event Source Mapping — available but with throughput limitations)
- Requires upfront partition planning — later repartitioning is complex
- Consumer lag monitoring requires additional instrumentation (CloudWatch MSK metrics)
Correct choice for the order backbone — ordering, replay, and multiple independent consumers are first-class requirements met natively
Hybrid Architecture: MSK + EventBridge
- MSK as order backbone (ordering + replay); EventBridge for notifications and external integrations
- Each technology used at its strength
- SaaS and external service integration via EventBridge without exposing MSK
- Doubled operational complexity — two messaging systems to maintain
- Bridge between MSK and EventBridge requires Lambda or Connector — additional failure point
- Higher combined cost in the short term
Valid as a future evolution when external integrations scale; premature as a starting point
Architecture Decision
The order processing system requires an event backbone that guarantees per-order ordering, supports granular per-consumer-group replay, sustains seasonal throughput spikes, and operates with a small platform team. The existing stack is 100% AWS.
Adopt Amazon MSK (managed Kafka) as the event backbone for the order system. Use MSK Serverless in the initial phase to reduce capacity planning overhead, migrating to MSK Provisioned when volume justifies granular control of partitions and throughput. Partition by `orderId` to guarantee ordering. Log retention of 7 days as default, extensible to 30 days for the orders topic. EventBridge remains in the stack for external service integrations and notifications, fed by a dedicated MSK consumer.
- ✅ Per-order ordering guaranteed natively via orderId partitioning
- ✅ Independent replay per consumer group — analytics and fraud can reprocess without affecting payments
- ✅ Real throughput headroom for organic growth and seasonal spikes
- ✅ Immutable log as auditable source of truth for compliance and debugging
- ⚠️ Fixed MSK infrastructure cost even during low-volume periods (mitigated by MSK Serverless initially)
- ⚠️ The team will need to acquire operational competency in Kafka: consumer lag monitoring, offset management, rebalancing
Implementation Details and Operational Risks
Partitioning strategy: The partition key must be orderId. This guarantees that all events for a specific order are written to the same partition and therefore consumed in order. The initial number of partitions for the orders topic should be calculated based on expected peak throughput divided by sustainable per-partition throughput (typically 1-10 MB/s depending on message size and broker configuration). For the estimated volume, 12 partitions is a reasonable starting point — it allows up to 12 parallel instances of a consumer group without rebalancing.
Retention and replay: Configure retention.ms to 604800000 (7 days) on the main topic. For the orders topic, consider 30 days given the analytics team's reprocessing requirement. Replay is operationalized via consumer group offset reset — the team must have documented runbooks for this operation, including how to pause the production consumer before replay to avoid unintentional duplicate processing.
Consumer criticality tiers: Not all consumers have the same SLA. Payments and inventory are critical — they should run on ECS/Fargate with auto-scaling based on consumer lag (EstimatedMaxTimeLag metric in CloudWatch MSK). Notifications and analytics can use Lambda via MSK Event Source Mapping, accepting higher latency in exchange for lower operational cost.
Dead Letter Queue: Each consumer must have a DLQ (SQS) for messages that fail after N retries. The retry schema should be exponential backoff with jitter. Messages in the DLQ should generate CloudWatch alerts and have a documented manual reprocessing process.
Rebalancing risk: In ECS consumers, Kafka rebalancing can cause processing pauses of seconds to tens of seconds depending on group size and rebalancing strategy. Configure partition.assignment.strategy=CooperativeStickyAssignor to minimize impact. Monitor RebalanceLatency and alert if it exceeds acceptable thresholds.
MSK Serverless vs Provisioned: MSK Serverless simplifies the start, but has limitations: maximum throughput of 200 MB/s write and 400 MB/s read per cluster, no control over partition count (automatically managed), and cost per capacity unit that may be higher than Provisioned at high volumes. Migration to Provisioned should be planned when monthly data volume makes the cost per capacity unit unfavorable — typically above a few TB/month (estimate).
Resulting Architecture: MSK Backbone for Order Processing
Event flow from order creation to downstream consumers, with MSK as the central backbone, per-consumer DLQs, and EventBridge for external integrations.
- Order API · ECS/Fargate
- Orders DB · RDS Aurora
- Amazon MSK · orders topic · 12 partitions / key=orderId
- Payments Consumer · ECS/Fargate
- Inventory Consumer · ECS/Fargate
- Fraud Consumer · ECS/Fargate
- Notifications Consumer · Lambda (MSK trigger)
- Analytics Consumer · Lambda (MSK trigger)
- EventBridge Bridge · Lambda Consumer
- Amazon EventBridge · Custom Event Bus
- External SaaS · (CRM, ERP, etc.)
- DLQ Payments · SQS
- DLQ Inventory · SQS
- DLQ Generic · SQS
- CloudWatch · Lag Alerts
- S3 · Analytics Lake
Technical Comparison: EventBridge vs MSK for Orders
| Dimension | EventBridge | Amazon MSK | Relevance for Orders | |
|---|---|---|---|---|
| Ordering | No guarantee | Guaranteed per partition | 🔴 Critical — state sequence is mandatory | — |
| Replay | Archive with filtering limitations | Offset reset per consumer group | 🔴 Critical — analytics and fraud need independent replay | — |
| Throughput | 10k events/s default (soft limit) | Hundreds of MB/s | 🟡 Relevant for seasonal spikes | — |
| Operational overhead | Minimal — serverless | Moderate — partitions, lag, rebalancing | 🟡 Relevant — small team | — |
| Cost | Per published event | Fixed per broker-hour + storage | 🟡 MSK Serverless mitigates initial fixed cost | — |
| Decoupled fan-out | Native via rules | Via consumer groups | 🟢 Both adequate | — |
| Native AWS integration | 200+ services without code | Via Lambda/ECS consumers | 🟢 EventBridge superior for external integrations | — |
| Log retention | Indefinite archive (but limited operational replay) | Configurable: hours to weeks | 🔴 MSK superior for audit and reprocessing | — |
Well-Architected Analysis
Security
MSK supports authentication via IAM (MSK IAM Auth), TLS in transit, and encryption at rest with KMS. Consumer groups isolated by IAM policy ensure the analytics consumer cannot write to the orders topic. VPC-only access eliminates public exposure. EventBridge bridge uses an IAM role with minimal permissions.
Reliability
MSK multi-AZ with replication factor 3 guarantees durability even with a broker failure. Per-consumer DLQs prevent failures in one downstream from blocking others. Consumer lag monitoring with alerts ensures early detection of lagging consumers. MSK Serverless has a 99.9% availability SLA.
Performance efficiency
Partitioning by orderId distributes load evenly and guarantees parallelism. Critical consumers on ECS/Fargate with consumer-lag-based auto-scaling respond to spikes within minutes. Configurable batch size and fetch.min.bytes optimize throughput vs latency per consumer.
Cost optimization
MSK Serverless in the initial phase avoids fixed provisioned broker costs. Migration to Provisioned when volume justifies it. 7-day retention as default — analytics topics with 30 days have incremental storage cost. Lambda consumers for async workloads reduce ECS cost.
Sustainability
MSK Serverless scales to zero during low-usage periods, reducing idle resource consumption. Configurable log retention avoids unnecessary storage of historical data beyond the operational utility period.
I see this EventBridge vs Kafka debate frequently, and the most common mistake is framing the decision as 'which is simpler to operate' instead of 'which correctly solves the functional requirements'. EventBridge is an excellent tool — I use it extensively for service integrations, notifications, and fan-out of domain events where ordering is not critical. But for an order backbone, the lack of guaranteed ordering is not an implementation detail — it is a structural failure. What concerns me most in this type of decision is not the technology choice itself, but the tendency to underestimate the cost of not having replay. Every order system will need replay at some point — a bug in the inventory consumer, a change in the fraud model, a schema migration. If you built on EventBridge and didn't plan the Archive carefully, that replay will be painful or impossible. With MSK, it's a 5-minute operation with a runbook. On operational burden: yes, Kafka has a learning curve. But MSK Serverless has significantly reduced the operational overhead for use cases like this. The team doesn't need to manage brokers, ZooKeeper, or replication manually. What they need is to understand consumer groups, offsets, and lag monitoring — and that's knowledge worth having in any team working with distributed systems. My practical recommendation: start with MSK Serverless, invest 2 sprints in runbooks and lag observability, and use EventBridge only for what it does best — external service integration and notifications. The hybrid architecture I describe here is not over-engineering; it's using
Verdict
Amazon MSK is the correct choice for the event backbone of an order processing system. The three requirements that define this decision — per-order ordering, granular per-consumer replay, and log retention as a source of truth — are natively met by Kafka's distributed log model and have no functional equivalent in EventBridge. EventBridge is not discarded: it remains in the stack as an integration layer for external services and SaaS, fed by a dedicated MSK consumer. This separation of concerns is the correct architecture — not a concession. The operational cost of MSK is real and should not be minimized. The mitigation is MSK Serverless in the initial phase and deliberate investment in consumer lag observability and operational runbooks. A team that understands offsets and consumer groups is a more resilient team, regardless of whatever messaging technology they use in the future. The transferable lesson: the choice between an event router and a distributed log is not about preference or simplicity — it is about which semantic guarantees your domain requires. Identify those guarantees first; the technology follows.