Studies
TeardownApache Kafka / Amazon MSKStreaming

Kafka at Scale: Tiered Storage, KRaft, and Exactly-Once

Jun 1, 2024 10 min AI-assisted
Share:

Listen to study

generated on play

Generated only on first play

On demand
0:000:00
Speed
The MP3 is saved to S3 after the first play.

An in-depth analysis of Apache Kafka's internal architecture — the log/segment model, ZooKeeper elimination via KRaft, data offload to S3 with Tiered Storage, exactly-once semantics, and what Amazon MSK abstracts (and what it does not). A technical teardown for those who need to operate Kafka for real.

Kafka is not a queue. That distinction matters more than it seems — it defines how you scale, how you retain data, how you guarantee delivery, and most importantly, where you will fail in production. In this teardown, I disassemble Kafka's internal architecture from primary sources: the immutable log model, the transition from ZooKeeper to KRaft, Tiered Storage with S3 offload, exactly-once semantics with idempotent producers and transactions, and what Amazon MSK actually manages — and what still falls on you.

Fact Sheet

System
Apache Kafka + Amazon MSK
Domain
Event streaming / Data platform
Reference version
Kafka 3.x (KRaft GA: 3.3+, Tiered Storage GA: 3.6+)
Typical scale (production)
Millions of messages/second; retention from TB to PB per cluster
Core stack
Java/Scala (broker), Apache ZooKeeper (legacy), KRaft (internal Raft), Amazon S3 (tiered storage via MSK)
MSK — availability
Multi-AZ by default; AWS 99.9% SLA
Key architectural change
ZooKeeper removal (KIP-500/833); ZooKeeper deprecated in Kafka 3.5, removed in 4.0
Primary sources
kafka.apache.org/documentation, KIP-405, docs.aws.amazon.com/msk

The Fundamental Problem: Log, Not Queue

Kafka's central abstraction is the partition log: an immutable, append-only sequence of records ordered by offset. This is not an implementation detail — it is the design decision that explains everything else. Traditional queues (SQS, RabbitMQ) delete messages after consumption; Kafka retains them. This allows multiple consumer groups to read the same data independently, deterministic reprocessing, and native auditing. The price is that the broker must actively manage retention, and local disk becomes a bottleneck in clusters with high volume or long retention.

Each topic is divided into partitions, each partition replicated across brokers according to the configured replication factor. The leader broker for each partition accepts writes; followers replicate synchronously (if in the ISR — In-Sync Replica set). A producer writes to the leader; the record is only considered committed when all ISR members confirm — that is Kafka's durability contract. The acks=all (or acks=-1) parameter combined with min.insync.replicas defines the minimum confirmation quorum. Without understanding this triangle (leader, ISR, acks), any discussion of exactly-once is premature.

The segment model is where Kafka gains I/O efficiency: each partition on disk is a sequence of segments (.log files + offset and timestamp indexes). The broker only writes to the active segment (tail); closed segments are immutable. This enables sendfile() directly from the filesystem cache to the consumer's socket — zero-copy — without passing through the JVM heap. In high-throughput workloads, Kafka frequently saturates the network before the disk, precisely because of this design.

KRaft: Why ZooKeeper Had to Go

ZooKeeper was the pragmatic choice of 2011: a mature distributed coordination service that Kafka used to store cluster metadata (which brokers are alive, which are partition leaders, topic configurations, ACLs). It worked for a decade, but created an expensive architectural coupling: two distributed systems to operate, two security models to configure, and a real scaling bottleneck — ZooKeeper stored all metadata in memory, and clusters with hundreds of thousands of partitions began experiencing latency in controller election and metadata propagation.

KRaft (Kafka Raft Metadata) solves this via KIP-500: Kafka brokers themselves now run an internal Raft protocol for metadata consensus. A subset of brokers (or dedicated controller nodes) forms the controller quorum; the active controller maintains a replicated metadata log — the same log/segment model as Kafka, applied to itself. This eliminates ZooKeeper as an operational dependency and, more importantly, removes the practical partition-per-cluster limit: Confluent benchmarks showed controller failure recovery in seconds with KRaft vs. tens of seconds with ZooKeeper on large clusters.

The transition was not trivial. KIP-500 was proposed in 2019; KRaft entered early access in Kafka 2.8 (2021), reached GA in 3.3 (2022) for new clusters, and ZooKeeper-to-KRaft migration support arrived in 3.5. ZooKeeper was officially removed in Kafka 4.0 (2024). On Amazon MSK, KRaft clusters are available from Kafka 3.6 on MSK. The critical operational point: in KRaft mode, controller nodes need reliable disks and low network latency between them — they are the new control plane of the cluster. In MSK, this is abstracted; in self-managed clusters, it is an explicit infrastructure decision.

Kafka Architecture: Log, KRaft, Tiered Storage, and Consumer Groups

Reconstructed view of Kafka 3.6+ internal architecture with KRaft and Tiered Storage enabled on Amazon MSK. Shows the write flow (producer → leader → ISR), cold segment offload to S3, the KRaft control plane, and consumption via consumer groups.

👤 Producers
  • Producer A · idempotent + txn
  • Producer B · acks=all
⚙️ MSK Broker Plane (Multi-AZ)
  • Broker 1 · Partition Leader · (AZ-a)
  • Broker 2 · ISR Follower · (AZ-b)
  • Broker 3 · ISR Follower · (AZ-c)
🗂️ Local Storage (EBS/NVMe)
  • Segmentos Ativos · (hot — local disk)
  • Segmentos Fechados · (cold — candidatos a offload)
☁️ Tiered Storage (S3)
  • Amazon S3 · Segmentos Offloaded · (retenção longa)
  • Tiered Storage · Plugin (RemoteLogManager)
🔐 KRaft Controller Quorum
  • Controller 1 · (active — AZ-a)
  • Controller 2 · (standby — AZ-b)
  • Controller 3 · (standby — AZ-c)
📦 Consumer Groups
  • Consumer Group A · (analytics — offset latest)
  • Consumer Group B · (audit — offset earliest · / S3 fetch)
  • Group Coordinator · (broker-side)

Tiered Storage: Separating Hot from Cold in the Log

KIP-405 (Tiered Storage) solves a real economic problem: local disk on Kafka brokers is expensive and does not scale independently of compute. In clusters with days or weeks of retention, most data is cold — rarely read, but must be available for reprocessing or auditing. Keeping this on local EBS or NVMe is wasteful.

The mechanism works as follows: the RemoteLogManager (RLM), the component introduced by KIP-405, monitors closed segments for each partition. When a segment meets the offload criteria (configurable by time or size), the RLM uploads the .log file, offset and timestamp indexes, and metadata to remote storage — in MSK's case, Amazon S3. After upload confirmation, the local segment can be deleted, freeing disk. The broker maintains a remote metadata map indicating which offsets are in S3 vs. local.

From the consumer's perspective, the protocol is transparent: it issues a normal FetchRequest to the leader broker. If the requested offset is in remote storage, the broker fetches the data from S3, optionally caches it locally (configurable), and returns it to the consumer. The consumer does not know the data came from S3. This is elegant, but has latency implications: a cold data fetch has S3 latency (typically tens of ms) vs. local disk or page cache (sub-ms). For batch reprocessing workloads, this is acceptable; for consumers requiring low latency on historical data, it is not.

On Amazon MSK, Tiered Storage is enabled via cluster configuration and uses AWS-managed S3 — you do not configure the bucket directly, but can use S3 Lifecycle Policies to control long-term storage costs. Current limitation: Tiered Storage on MSK does not support all instance types, and log compaction has restrictions with tiered storage enabled — compacted segments are not eligible for offload in certain versions. This is critical for changelog topics (Kafka Streams, CDC).

Exactly-Once: Idempotency, Transactions, and the Real Cost

Exactly-once in Kafka is implemented in two complementary layers, and confusing the two is a common mistake in interviews and in production.

Layer 1 — Idempotent Producer: enabled with enable.idempotence=true (default true since Kafka 3.0). The broker assigns the producer a ProducerID (PID) and tracks the sequence number per partition. If the producer retransmits a message (due to timeout or network failure), the broker detects the duplicate sequence number and discards it — no duplicates in the log. This guarantees exactly-once per partition, per producer session. Cost: practically zero — it is bookkeeping on the broker, with no additional distributed coordination.

Layer 2 — Transactions: enabled with transactional.id on the producer. The producer coordinates with a Transaction Coordinator (a special broker elected by hash of transactional.id) to begin, commit, or abort a transaction that can span multiple partitions and multiple topics. The mechanism uses an internal two-phase commit: the coordinator writes the transaction state to an internal topic (__transaction_state) before instructing brokers to mark messages as committed. Consumers with isolation.level=read_committed only see messages from committed transactions — messages from open or aborted transactions are filtered.

The real cost of transactions is non-trivial: additional round-trip latency to the Transaction Coordinator, write overhead on __transaction_state, and the risk of zombie transactions — producers that failed without aborting the transaction, leaving it open until transaction.timeout.ms. In high-throughput systems, very large transactions (many messages per transaction) or very frequent transactions (one transaction per message) significantly degrade throughput. The practical recommendation: batch your transactions by time window or record count, not per individual message.

A frequently overlooked point: exactly-once end-to-end (producer → broker → consumer → sink) requires the sink to also be idempotent or transactional. Kafka guarantees exactly-once in the log; if the consumer processes and writes to a database without idempotency, you still have at-least-once in the system as a whole. Kafka Streams implements end-to-end exactly-once using Kafka transactions to atomically coordinate read, process, and write — but this only works when both source and sink are Kafka.

Consumer Groups, Rebalance, and the Stop-the-World Problem

The consumer group model is what allows Kafka to scale consumption horizontally: each partition is assigned to exactly one consumer within the group, and multiple groups can consume the same topic independently. The Group Coordinator (a broker elected by hash of group.id) manages the group lifecycle: join, assignment synchronization, heartbeats, and failure detection.

The classic problem is rebalance: whenever a consumer joins, leaves, or stops sending heartbeats within session.timeout.ms, the coordinator triggers a rebalance. In the original protocol (Eager Rebalance), all consumers in the group stop consuming, revoke all partitions, and wait for the new assignment. In large groups with many partitions, this can cause pauses of seconds to tens of seconds — a distributed stop-the-world.

Kafka 2.4 introduced the Cooperative Sticky Assignor (KIP-429/KIP-482), which implements incremental rebalancing: only the partitions that need to be moved are revoked; the rest continue to be consumed during the rebalance. This drastically reduces the impact, but requires all consumers in the group to use a cooperative assignor — a configuration detail that is frequently forgotten during migrations.

Another critical operational point: max.poll.interval.ms defines the maximum time between two poll() calls. If processing a batch takes longer than this interval (heavy processing, slow external calls), the consumer is considered dead by the coordinator and a rebalance is triggered — even if the consumer is alive and processing. This creates a rebalance loop in workloads with variable processing time. The solution is not to increase max.poll.interval.ms indefinitely (this delays detection of real failures), but to reduce batch size via max.poll.records to ensure processing fits within the configured interval. On Amazon MSK, these parameters are configurable per consumer — MSK does not manage them; they remain entirely in your application.

Architectural Trade-offs: Key Design Decisions in Kafka

KRaft vs. ZooKeeper

Pros
  • Eliminates operational dependency on ZooKeeper
  • Controller recovery in seconds (vs. tens of seconds)
  • Support for clusters with millions of partitions
  • Unified security model
Cons
  • Migration of existing clusters requires specific procedure (3.5+)
  • Controller nodes need reliable disks in self-managed setups
  • Tooling ecosystem still adapting (monitoring, etc.)

Adopt KRaft for new clusters. No exceptions.

Tiered Storage (S3) vs. Local Retention

Pros
  • Drastically lower storage cost for cold data
  • Retention scales independently of compute
  • Transparent to consumers (no protocol change)
Cons
  • Fetch latency for cold data (tens of ms vs. sub-ms)
  • Incompatibility with log compaction in certain versions
  • S3 API costs (GET/PUT) can surprise at high offload frequency

Ideal for long retention (>24h) with sporadic access to historical data. Avoid on compacted topics.

Exactly-Once Transactions vs. At-Least-Once + Sink Idempotency

Pros
  • Native guarantee at the broker; no deduplication logic in the consumer
  • Required for Kafka Streams EOS
Cons
  • Additional latency (round-trip to Transaction Coordinator)
  • Reduced throughput if transactions are too frequent or too large
  • Does not guarantee end-to-end EOS if the sink is not idempotent

Use transactions where the cost of duplicates is high (financial, CDC). For analytics, at-least-once + sink idempotency is simpler and faster.

Amazon MSK vs. Self-Managed Kafka (EC2/EKS)

Pros
  • MSK: no broker management overhead, patches, ZK/KRaft
  • MSK: native integration with IAM, VPC, CloudWatch, S3 (tiered)
  • MSK: Multi-AZ by design; automatic broker failover
Cons
  • MSK: less control over broker configurations (some are fixed)
  • MSK: higher cost than equivalent EC2 at very large scale
  • Self-managed: full flexibility, but real operational overhead

MSK for most cases. Self-managed only if you have a dedicated platform team and requirements MSK does not support.

Well-Architected Read: Kafka / Amazon MSK

Security

Kafka has a multidimensional security model: authentication (SASL/PLAIN, SCRAM, mTLS, IAM on MSK), authorization (ACLs per resource — topic, group, cluster), and encryption in transit (TLS) and at rest (EBS encryption on MSK). The critical point: Kafka ACLs are additive and granular — a misconfigured producer with WRITE permission on * (all topics) is a data injection vector. On MSK with IAM auth, IAM policies replace internal ACLs for authentication, but ACLs can still coexist.

Reliability

Reliability in Kafka is a direct function of three configurations: replication factor (≥3 in production), min.insync.replicas (≥2, typically RF-1), and acks=all on the producer. MSK manages broker failover automatically, but leader election time still impacts producers during failover — retries and retry.backoff.ms on the producer must be configured to absorb this interval. With KRaft, controller election is faster, reducing the unavailability window.

Performance efficiency

Kafka is optimized for throughput, not individual message latency. The batching mechanism on the producer (linger.ms, batch.size) is fundamental: increasing linger.ms from 0 to 5-20ms can multiply throughput by 10x at the cost of end-to-end latency. Compression (snappy or lz4) reduces network and disk usage with minimal CPU cost on text/JSON data. On MSK, the instance type defines maximum network throughput — kafka.m5.4xlarge and larger instances have Enhanced Networking enabled.

Cost optimization

MSK cost has three components: broker instances (per hour), EBS storage (per GB/month), and data transfer (intra-AZ is free; inter-AZ has cost). Tiered Storage adds S3 cost (storage + API), but typically reduces total cost by allowing smaller instances with less EBS. The most common hidden cost: excessive retention on high-volume topics without Tiered Storage — teams that configure retention.ms=7d on 100MB/s topics need ~60TB of EBS per replica. With Tiered Storage, local disk can be reduced to hours of retention, and the rest goes to S3 at ~$0.023/GB/month.

FA
What I'd do differently — and what the industry still gets wrong
Senior Solutions Architect

After operating and designing systems on Kafka in financial contexts, there are three error patterns I see repeatedly — and that I would address differently. 1. Tiered Storage as an afterthought, not an initial design. Most teams enable Tiered Storage after EBS costs are already high and migration is painful. I would define the tiered storage policy at topic creation: local.retention.ms for the hot tier (e.g., 6-24h), retention.ms for the total (e.g., 30d in S3). This is a day-zero decision, not a day-90 one. 2. Exactly-once misapplied. I see teams enabling transactional.id on all producers for "safety", without understanding the cost. In financial systems, exactly-once is necessary in payment and reconciliation flows — not in telemetry or audit logs where at-least-once + sink deduplication is more efficient. The decision should be per flow, not per cluster. 3. Consumer group design neglected. Teams create one consumer group per service and put multiple responsibilities into it. I separate consumer groups by processing responsibility — one group for real-time processing (low max.poll.records, high SLA priority), another for batch/analytics (high max.poll.records, lag-tolerant). This allows independent tuning and prevents a slow consumer doing heavy processing from causing rebalances that affect the real-time flow. The most important point I would emphasize for any team adopting Kafka: Kafka solves the transport and retention problem; it does not solve the data contract problem. Without a schema registry (Confluent Schema Registry or AWS Glue Schema Registry) with versioned schema evolution, you will break consumers in production at some point. This is not optional in serious production systems.

Verdict

Apache Kafka is one of the best-designed software architectures of the last 15 years — the immutable log model is simple, correct, and scales in ways that traditional queue systems cannot. The transition to KRaft removes the last major operational fragility point (ZooKeeper) without compromising the consistency model. Tiered Storage solves the economic problem of long retention elegantly, with a well-defined and documented latency trade-off. Exactly-once is genuinely useful and correctly implemented at the broker level — but it is frequently misapplied in production, generating unnecessary overhead or, worse, a false sense of security when the sink is not idempotent. Amazon MSK is the right choice for most teams: it eliminates the operational overhead of broker management and integrates natively with the AWS ecosystem (IAM, VPC, S3, CloudWatch). The real limitation of MSK is not technical — it is that it abstracts enough that less experienced teams underestimate what is still their responsibility: producer/consumer tuning, consumer group design, schema management, and consumer lag monitoring.

#kafka#streaming#event-driven#amazon-msk#tiered-storage#kraft#exactly-once#data-platform
Share:
Written with AI assistance from the public case and my architect's reading.