# Kafka at Scale: Tiered Storage, KRaft, and Exactly-Once

An in-depth analysis of Apache Kafka's internal architecture — the log/segment model, ZooKeeper elimination via KRaft, data offload to S3 with Tiered Storage, exactly-once semantics, and what Amazon MSK abstracts (and what it does not). A technical teardown for those who need to operate Kafka for real.

- URL: https://fernando.moretes.com/studies/kafka-tiered-storage-kraft-exactly-once

- Markdown: https://fernando.moretes.com/studies/kafka-tiered-storage-kraft-exactly-once/study.md?lang=en

- Type: Teardown

- Company: Apache Kafka / Amazon MSK

- Domain: Streaming

- Date: 2024-06-01

- Tags: kafka, streaming, event-driven, amazon-msk, tiered-storage, kraft, exactly-once, data-platform

- Reading time: 10 min

---

Kafka is not a queue. That distinction matters more than it seems — it defines how you scale, how you retain data, how you guarantee delivery, and most importantly, where you will fail in production. In this teardown, I disassemble Kafka's internal architecture from primary sources: the immutable log model, the transition from ZooKeeper to KRaft, Tiered Storage with S3 offload, exactly-once semantics with idempotent producers and transactions, and what Amazon MSK actually manages — and what still falls on you.

## Fact Sheet

- **System:** Apache Kafka + Amazon MSK
- **Domain:** Event streaming / Data platform
- **Reference version:** Kafka 3.x (KRaft GA: 3.3+, Tiered Storage GA: 3.6+)
- **Typical scale (production):** Millions of messages/second; retention from TB to PB per cluster
- **Core stack:** Java/Scala (broker), Apache ZooKeeper (legacy), KRaft (internal Raft), Amazon S3 (tiered storage via MSK)
- **MSK — availability:** Multi-AZ by default; AWS 99.9% SLA
- **Key architectural change:** ZooKeeper removal (KIP-500/833); ZooKeeper deprecated in Kafka 3.5, removed in 4.0
- **Primary sources:** kafka.apache.org/documentation, KIP-405, docs.aws.amazon.com/msk

## The Fundamental Problem: Log, Not Queue

Kafka's central abstraction is the **partition log**: an immutable, append-only sequence of records ordered by offset. This is not an implementation detail — it is the design decision that explains everything else. Traditional queues (SQS, RabbitMQ) delete messages after consumption; Kafka retains them. This allows multiple consumer groups to read the same data independently, deterministic reprocessing, and native auditing. The price is that **the broker must actively manage retention**, and local disk becomes a bottleneck in clusters with high volume or long retention.

Each topic is divided into **partitions**, each partition replicated across brokers according to the configured replication factor. The leader broker for each partition accepts writes; followers replicate synchronously (if in the ISR — In-Sync Replica set). A producer writes to the leader; the record is only considered committed when all ISR members confirm — that is Kafka's durability contract. The `acks=all` (or `acks=-1`) parameter combined with `min.insync.replicas` defines the minimum confirmation quorum. Without understanding this triangle (leader, ISR, `acks`), any discussion of exactly-once is premature.

The segment model is where Kafka gains I/O efficiency: each partition on disk is a sequence of **segments** (`.log` files + offset and timestamp indexes). The broker only writes to the active segment (tail); closed segments are immutable. This enables `sendfile()` directly from the filesystem cache to the consumer's socket — zero-copy — without passing through the JVM heap. In high-throughput workloads, Kafka frequently saturates the network before the disk, precisely because of this design.

## KRaft: Why ZooKeeper Had to Go

ZooKeeper was the pragmatic choice of 2011: a mature distributed coordination service that Kafka used to store cluster metadata (which brokers are alive, which are partition leaders, topic configurations, ACLs). It worked for a decade, but created an **expensive architectural coupling**: two distributed systems to operate, two security models to configure, and a real scaling bottleneck — ZooKeeper stored all metadata in memory, and clusters with hundreds of thousands of partitions began experiencing latency in controller election and metadata propagation.

**KRaft** (Kafka Raft Metadata) solves this via KIP-500: Kafka brokers themselves now run an internal Raft protocol for metadata consensus. A subset of brokers (or dedicated controller nodes) forms the **controller quorum**; the active controller maintains a replicated metadata log — the same log/segment model as Kafka, applied to itself. This eliminates ZooKeeper as an operational dependency and, more importantly, **removes the practical partition-per-cluster limit**: Confluent benchmarks showed controller failure recovery in seconds with KRaft vs. tens of seconds with ZooKeeper on large clusters.

The transition was not trivial. KIP-500 was proposed in 2019; KRaft entered early access in Kafka 2.8 (2021), reached GA in 3.3 (2022) for new clusters, and ZooKeeper-to-KRaft migration support arrived in 3.5. ZooKeeper was officially removed in Kafka 4.0 (2024). On Amazon MSK, KRaft clusters are available from Kafka 3.6 on MSK. **The critical operational point**: in KRaft mode, controller nodes need reliable disks and low network latency between them — they are the new control plane of the cluster. In MSK, this is abstracted; in self-managed clusters, it is an explicit infrastructure decision.

## Kafka Architecture: Log, KRaft, Tiered Storage, and Consumer Groups

Reconstructed view of Kafka 3.6+ internal architecture with KRaft and Tiered Storage enabled on Amazon MSK. Shows the write flow (producer → leader → ISR), cold segment offload to S3, the KRaft control plane, and consumption via consumer groups.

### 👤 Producers

- Producer A idempotent + txn (external)
- Producer B acks=all (external)

### ⚙️ MSK Broker Plane (Multi-AZ)

- Broker 1 Partition Leader (AZ-a) (messaging)
- Broker 2 ISR Follower (AZ-b) (messaging)
- Broker 3 ISR Follower (AZ-c) (messaging)

### 🗂️ Local Storage (EBS/NVMe)

- Segmentos Ativos (hot — local disk) (storage)
- Segmentos Fechados (cold — candidatos a offload) (storage)

### ☁️ Tiered Storage (S3)

- Amazon S3 Segmentos Offloaded (retenção longa) (storage)
- Tiered Storage Plugin (RemoteLogManager) (compute)

### 🔐 KRaft Controller Quorum

- Controller 1 (active — AZ-a) (compute)
- Controller 2 (standby — AZ-b) (compute)
- Controller 3 (standby — AZ-c) (compute)

### 📦 Consumer Groups

- Consumer Group A (analytics — offset latest) (external)
- Consumer Group B (audit — offset earliest / S3 fetch) (external)
- Group Coordinator (broker-side) (messaging)

### Flows

- p1 -> b1: write (acks=all)
- p2 -> b1: write (idempotent)
- b1 -> b2: ISR replication
- b1 -> b3: ISR replication
- b1 -> seg_hot: append active segment
- seg_hot -> seg_cold: segment closed (roll)
- seg_cold -> ts_plugin: offload trigger
- ts_plugin -> s3: upload segment + index
- seg_cold -> ts_plugin: delete local after offload
- kc1 -> b1: metadata / leader election
- kc1 -> kc2: Raft replication
- kc1 -> kc3: Raft replication
- cg1 -> b1: fetch (local)
- cg2 -> b1: fetch request
- b1 -> s3: remote fetch (tiered)
- cg1 -> gc: heartbeat / rebalance
- cg2 -> gc: heartbeat / rebalance

## Tiered Storage: Separating Hot from Cold in the Log

KIP-405 (Tiered Storage) solves a real economic problem: **local disk on Kafka brokers is expensive and does not scale independently of compute**. In clusters with days or weeks of retention, most data is cold — rarely read, but must be available for reprocessing or auditing. Keeping this on local EBS or NVMe is wasteful.

The mechanism works as follows: the `RemoteLogManager` (RLM), the component introduced by KIP-405, monitors closed segments for each partition. When a segment meets the offload criteria (configurable by time or size), the RLM uploads the `.log` file, offset and timestamp indexes, and metadata to remote storage — in MSK's case, Amazon S3. After upload confirmation, the local segment can be deleted, freeing disk. The broker maintains a **remote metadata map** indicating which offsets are in S3 vs. local.

From the consumer's perspective, the protocol is transparent: it issues a normal `FetchRequest` to the leader broker. If the requested offset is in remote storage, the broker fetches the data from S3, optionally caches it locally (configurable), and returns it to the consumer. The consumer does not know the data came from S3. This is elegant, but has latency implications: a cold data fetch has S3 latency (typically tens of ms) vs. local disk or page cache (sub-ms). **For batch reprocessing workloads, this is acceptable; for consumers requiring low latency on historical data, it is not.**

On Amazon MSK, Tiered Storage is enabled via cluster configuration and uses AWS-managed S3 — you do not configure the bucket directly, but can use S3 Lifecycle Policies to control long-term storage costs. Current limitation: Tiered Storage on MSK does not support all instance types, and log compaction has restrictions with tiered storage enabled — compacted segments are not eligible for offload in certain versions. This is critical for changelog topics (Kafka Streams, CDC).

## Exactly-Once: Idempotency, Transactions, and the Real Cost

Exactly-once in Kafka is implemented in two complementary layers, and confusing the two is a common mistake in interviews and in production.

**Layer 1 — Idempotent Producer**: enabled with `enable.idempotence=true` (default `true` since Kafka 3.0). The broker assigns the producer a `ProducerID` (PID) and tracks the sequence number per partition. If the producer retransmits a message (due to timeout or network failure), the broker detects the duplicate sequence number and discards it — no duplicates in the log. This guarantees **exactly-once per partition, per producer session**. Cost: practically zero — it is bookkeeping on the broker, with no additional distributed coordination.

**Layer 2 — Transactions**: enabled with `transactional.id` on the producer. The producer coordinates with a **Transaction Coordinator** (a special broker elected by hash of `transactional.id`) to begin, commit, or abort a transaction that can span multiple partitions and multiple topics. The mechanism uses an **internal two-phase commit**: the coordinator writes the transaction state to an internal topic (`__transaction_state`) before instructing brokers to mark messages as committed. Consumers with `isolation.level=read_committed` only see messages from committed transactions — messages from open or aborted transactions are filtered.

The real cost of transactions is non-trivial: additional round-trip latency to the Transaction Coordinator, write overhead on `__transaction_state`, and **the risk of zombie transactions** — producers that failed without aborting the transaction, leaving it open until `transaction.timeout.ms`. In high-throughput systems, very large transactions (many messages per transaction) or very frequent transactions (one transaction per message) significantly degrade throughput. The practical recommendation: **batch your transactions by time window or record count**, not per individual message.

A frequently overlooked point: exactly-once **end-to-end** (producer → broker → consumer → sink) requires the sink to also be idempotent or transactional. Kafka guarantees exactly-once in the log; if the consumer processes and writes to a database without idempotency, you still have at-least-once in the system as a whole. Kafka Streams implements end-to-end exactly-once using Kafka transactions to atomically coordinate read, process, and write — but this only works when both source and sink are Kafka.

## Consumer Groups, Rebalance, and the Stop-the-World Problem

The consumer group model is what allows Kafka to scale consumption horizontally: each partition is assigned to exactly one consumer within the group, and multiple groups can consume the same topic independently. The **Group Coordinator** (a broker elected by hash of `group.id`) manages the group lifecycle: join, assignment synchronization, heartbeats, and failure detection.

The classic problem is **rebalance**: whenever a consumer joins, leaves, or stops sending heartbeats within `session.timeout.ms`, the coordinator triggers a rebalance. In the original protocol (Eager Rebalance), **all consumers in the group stop consuming, revoke all partitions, and wait for the new assignment**. In large groups with many partitions, this can cause pauses of seconds to tens of seconds — a distributed stop-the-world.

Kafka 2.4 introduced the **Cooperative Sticky Assignor** (KIP-429/KIP-482), which implements incremental rebalancing: only the partitions that need to be moved are revoked; the rest continue to be consumed during the rebalance. This drastically reduces the impact, but requires all consumers in the group to use a cooperative assignor — a configuration detail that is frequently forgotten during migrations.

Another critical operational point: `max.poll.interval.ms` defines the maximum time between two `poll()` calls. If processing a batch takes longer than this interval (heavy processing, slow external calls), the consumer is considered dead by the coordinator and a rebalance is triggered — even if the consumer is alive and processing. This creates a rebalance loop in workloads with variable processing time. The solution is not to increase `max.poll.interval.ms` indefinitely (this delays detection of real failures), but to **reduce batch size via `max.poll.records`** to ensure processing fits within the configured interval. On Amazon MSK, these parameters are configurable per consumer — MSK does not manage them; they remain entirely in your application.

## Architectural Trade-offs: Key Design Decisions in Kafka

### KRaft vs. ZooKeeper

**Pros**
- Eliminates operational dependency on ZooKeeper
- Controller recovery in seconds (vs. tens of seconds)
- Support for clusters with millions of partitions
- Unified security model

**Cons**
- Migration of existing clusters requires specific procedure (3.5+)
- Controller nodes need reliable disks in self-managed setups
- Tooling ecosystem still adapting (monitoring, etc.)

**Verdict:** Adopt KRaft for new clusters. No exceptions.

### Tiered Storage (S3) vs. Local Retention

**Pros**
- Drastically lower storage cost for cold data
- Retention scales independently of compute
- Transparent to consumers (no protocol change)

**Cons**
- Fetch latency for cold data (tens of ms vs. sub-ms)
- Incompatibility with log compaction in certain versions
- S3 API costs (GET/PUT) can surprise at high offload frequency

**Verdict:** Ideal for long retention (>24h) with sporadic access to historical data. Avoid on compacted topics.

### Exactly-Once Transactions vs. At-Least-Once + Sink Idempotency

**Pros**
- Native guarantee at the broker; no deduplication logic in the consumer
- Required for Kafka Streams EOS

**Cons**
- Additional latency (round-trip to Transaction Coordinator)
- Reduced throughput if transactions are too frequent or too large
- Does not guarantee end-to-end EOS if the sink is not idempotent

**Verdict:** Use transactions where the cost of duplicates is high (financial, CDC). For analytics, at-least-once + sink idempotency is simpler and faster.

### Amazon MSK vs. Self-Managed Kafka (EC2/EKS)

**Pros**
- MSK: no broker management overhead, patches, ZK/KRaft
- MSK: native integration with IAM, VPC, CloudWatch, S3 (tiered)
- MSK: Multi-AZ by design; automatic broker failover

**Cons**
- MSK: less control over broker configurations (some are fixed)
- MSK: higher cost than equivalent EC2 at very large scale
- Self-managed: full flexibility, but real operational overhead

**Verdict:** MSK for most cases. Self-managed only if you have a dedicated platform team and requirements MSK does not support.

## Well-Architected Read: Kafka / Amazon MSK

- **security**: Kafka has a multidimensional security model: authentication (SASL/PLAIN, SCRAM, mTLS, IAM on MSK), authorization (ACLs per resource — topic, group, cluster), and encryption in transit (TLS) and at rest (EBS encryption on MSK). The critical point: Kafka ACLs are additive and granular — a misconfigured producer with `WRITE` permission on `*` (all topics) is a data injection vector. On MSK with IAM auth, IAM policies replace internal ACLs for authentication, but ACLs can still coexist.
- **reliability**: Reliability in Kafka is a direct function of three configurations: replication factor (≥3 in production), `min.insync.replicas` (≥2, typically RF-1), and `acks=all` on the producer. MSK manages broker failover automatically, but leader election time still impacts producers during failover — `retries` and `retry.backoff.ms` on the producer must be configured to absorb this interval. With KRaft, controller election is faster, reducing the unavailability window.
- **performance**: Kafka is optimized for throughput, not individual message latency. The batching mechanism on the producer (`linger.ms`, `batch.size`) is fundamental: increasing `linger.ms` from 0 to 5-20ms can multiply throughput by 10x at the cost of end-to-end latency. Compression (`snappy` or `lz4`) reduces network and disk usage with minimal CPU cost on text/JSON data. On MSK, the instance type defines maximum network throughput — `kafka.m5.4xlarge` and larger instances have Enhanced Networking enabled.
- **cost**: MSK cost has three components: broker instances (per hour), EBS storage (per GB/month), and data transfer (intra-AZ is free; inter-AZ has cost). Tiered Storage adds S3 cost (storage + API), but typically reduces total cost by allowing smaller instances with less EBS. The most common hidden cost: **excessive retention on high-volume topics without Tiered Storage** — teams that configure `retention.ms=7d` on 100MB/s topics need ~60TB of EBS per replica. With Tiered Storage, local disk can be reduced to hours of retention, and the rest goes to S3 at ~$0.023/GB/month.

> **What I'd do differently — and what the industry still gets wrong:** After operating and designing systems on Kafka in financial contexts, there are three error patterns I see repeatedly — and that I would address differently.

**1. Tiered Storage as an afterthought, not an initial design.** Most teams enable Tiered Storage after EBS costs are already high and migration is painful. I would define the tiered storage policy at topic creation: `local.retention.ms` for the hot tier (e.g., 6-24h), `retention.ms` for the total (e.g., 30d in S3). This is a day-zero decision, not a day-90 one.

**2. Exactly-once misapplied.** I see teams enabling `transactional.id` on all producers for "safety", without understanding the cost. In financial systems, exactly-once is necessary in payment and reconciliation flows — not in telemetry or audit logs where at-least-once + sink deduplication is more efficient. The decision should be per flow, not per cluster.

**3. Consumer group design neglected.** Teams create one consumer group per service and put multiple responsibilities into it. I separate consumer groups by processing responsibility — one group for real-time processing (low `max.poll.records`, high SLA priority), another for batch/analytics (high `max.poll.records`, lag-tolerant). This allows independent tuning and prevents a slow consumer doing heavy processing from causing rebalances that affect the real-time flow.

The most important point I would emphasize for any team adopting Kafka: **Kafka solves the transport and retention problem; it does not solve the data contract problem**. Without a schema registry (Confluent Schema Registry or AWS Glue Schema Registry) with versioned schema evolution, you will break consumers in production at some point. This is not optional in serious production systems.

## Verdict

Apache Kafka is one of the best-designed software architectures of the last 15 years — the immutable log model is simple, correct, and scales in ways that traditional queue systems cannot. The transition to KRaft removes the last major operational fragility point (ZooKeeper) without compromising the consistency model. Tiered Storage solves the economic problem of long retention elegantly, with a well-defined and documented latency trade-off.

Exactly-once is genuinely useful and correctly implemented at the broker level — but it is frequently misapplied in production, generating unnecessary overhead or, worse, a false sense of security when the sink is not idempotent.

Amazon MSK is the right choice for most teams: it eliminates the operational overhead of broker management and integrates natively with the AWS ecosystem (IAM, VPC, S3, CloudWatch). The real limitation of MSK is not technical — it is that it abstracts enough that less experienced teams underestimate what is still their responsibility: producer/consumer tuning, consumer group design, schema management, and consumer lag monitoring.

## References

- [Apache Kafka — Official Documentation](https://kafka.apache.org/documentation/)
- [KIP-405: Kafka Tiered Storage](https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage)
- [Amazon MSK — AWS Documentation](https://docs.aws.amazon.com/msk/)
- [KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum](https://cwiki.apache.org/confluence/display/KAFKA/KIP-500%3A+Replace+ZooKeeper+with+a+Self-Managed+Metadata+Quorum)
- [KIP-429: Kafka Consumer Incremental Rebalance Protocol](https://cwiki.apache.org/confluence/display/KAFKA/KIP-429%3A+Kafka+Consumer+Incremental+Rebalance+Protocol)
- [Kafka Exactly-Once Semantics — Confluent Blog](https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how-apache-kafka-does-it/)
- [Amazon MSK Tiered Storage — AWS Docs](https://docs.aws.amazon.com/msk/latest/developerguide/msk-tiered-storage.html)

## Case sources

- [Apache Kafka — Documentation](https://kafka.apache.org/documentation/)
- [KIP-405: Kafka Tiered Storage](https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage)
- [Amazon MSK — AWS docs](https://docs.aws.amazon.com/msk/)
