Kafka Tiered Storage and Multi-Region Replication

Kafka Tiered Storage and Multi-Region Replication for Analytics Lakehouses


At MinervaDB, we design Apache Kafka clusters that act as the durable ingestion tier in front of analytics lakehouses built on Apache Iceberg, Delta Lake, and ClickHouse. At web-scale, two capabilities separate a production-grade ingestion tier from a science experiment: tiered storage that decouples retention from broker disk capacity, and multi-region replication that protects analytics pipelines from regional failure. This article explains how our engineering team combines KIP-405 tiered storage (generally available in Kafka 3.9 and default on Kafka 4.2) with MirrorMaker 2 and Cluster Linking to build Kafka topologies that serve lakehouse ingestion at petabyte scale. The focus is on production patterns — sizing, cost, failure modes, and configuration — not marketing claims.

Kafka Tiered Storage

The Economics of Long Retention in Kafka

Web-scale analytics lakehouses need two things from Kafka that are in tension: low-latency ingest of the most recent hour of events, and long retention of raw events for backfill, model retraining, and regulatory replay. Before tiered storage, the only way to hold 30 days of raw events was to provision enough NVMe storage on every broker to cover the full retention window multiplied by the replication factor. The math was brutal: 1 GiB of logical retention on replicated NVMe costs 10–20× the same GiB on object storage, before factoring in the free-space headroom every broker must retain.

Tiered storage eliminates that tradeoff by separating hot data on broker disks from cold data in object storage. The hot tier serves low-latency reads from the page cache and NVMe, while the cold tier holds the long-retention copy in S3, GCS, or Azure Blob. In our engagements, tiered storage has reduced Kafka infrastructure cost by 60–80% for lakehouse-backing clusters while extending retention from days to months.

KIP-405 Tiered Storage Architecture

KIP-405 introduced tiered storage as an early-access feature, reached general availability in Kafka 3.9, and is a first-class feature on Kafka 4.2. The architecture is clean: each broker runs a RemoteLogManager that uploads closed log segments to a configured object store and reads them back on-demand when consumers fetch data older than the local retention window.

The critical insight is that tiered storage operates below the partition API. Producers and consumers are unchanged — a consumer reading from offset 0 on a partition with 30 days of retention transparently fetches recent segments from broker disk and older segments from object storage. The broker handles the storage tiering, the pluggable RemoteStorageManager handles the object store integration, and the RemoteLogMetadataManager tracks which segments live where.

# Broker-side tiered storage configuration
remote.log.storage.system.enable=true
remote.log.storage.manager.class.name=org.apache.kafka.server.log.remote.storage.RemoteLogManager
remote.log.storage.manager.impl.prefix=rsm.config.
remote.log.metadata.manager.impl.prefix=rlmm.config.

# S3-backed remote storage plugin
rsm.config.storage.class=org.apache.kafka.tiered.storage.s3.S3RemoteStorageManager
rsm.config.s3.bucket.name=minerva-kafka-cold-tier-prod
rsm.config.s3.region=us-east-1
rsm.config.s3.credentials.provider=DEFAULT
rsm.config.s3.multipart.upload.part.size.bytes=10485760

# Metadata topic for remote segment tracking
remote.log.metadata.topic.replication.factor=3
remote.log.metadata.topic.num.partitions=50
remote.log.metadata.topic.retention.ms=-1

Configuring Tiered Storage for Lakehouse Ingestion

Enabling tiered storage at the cluster level is not enough — each topic must opt in and set retention boundaries for the local and remote tiers. For lakehouse ingestion topics, we size the local retention window to cover the longest consumer lag plus operational headroom, and set the remote retention to the full compliance or replay window.

# Create a lakehouse ingestion topic with tiered storage
kafka-topics.sh --bootstrap-server kafka-1:9092 \
  --create --topic events_raw_iceberg \
  --partitions 120 \
  --replication-factor 3 \
  --config min.insync.replicas=2 \
  --config remote.storage.enable=true \
  --config local.retention.ms=86400000 \
  --config retention.ms=2592000000 \
  --config segment.bytes=1073741824 \
  --config compression.type=zstd

In this configuration, the topic keeps 24 hours of hot data on broker NVMe (local.retention.ms=86400000) and 30 days total retention including the remote tier (retention.ms=2592000000). A lakehouse consumer that falls behind by six hours serves entirely from hot data. A backfill job that reads from the beginning of the 30-day window transparently streams older segments from S3, at slightly higher latency but with no operational intervention.

Segment size matters for tiered storage efficiency. We standardize on 1 GiB segments (segment.bytes=1073741824) because each segment is uploaded as a single object, and smaller segments produce excessive S3 request counts. For very high-throughput topics, we raise the segment size to 2 GiB to further reduce upload overhead.

Multi-Region Replication with MirrorMaker 2

Tiered storage solves retention cost but does not protect against regional failure. For analytics lakehouses that must survive a full AWS region outage, we replicate Kafka data across regions using MirrorMaker 2 or Cluster Linking. MirrorMaker 2 runs on the Kafka Connect framework, consumes from a source cluster, and produces to a destination cluster with configurable topic prefixing and offset translation.

# MirrorMaker 2 connector configuration
name=mm2-primary-to-secondary
connector.class=org.apache.kafka.connect.mirror.MirrorSourceConnector
source.cluster.alias=primary
target.cluster.alias=secondary
source.cluster.bootstrap.servers=kafka-1.us-east-1:9092,kafka-2.us-east-1:9092
target.cluster.bootstrap.servers=kafka-1.us-west-2:9092,kafka-2.us-west-2:9092
topics=events_raw_iceberg,features_online_v2
replication.factor=3
refresh.topics.interval.seconds=60
sync.topic.configs.enabled=true
sync.topic.acls.enabled=true
offset-syncs.topic.replication.factor=3
tasks.max=16
producer.compression.type=zstd
producer.acks=all

For a standby region used primarily for disaster recovery, we run MirrorMaker 2 with one-way replication and a dedicated Connect cluster sized to the target throughput. The offset-syncs topic and MirrorCheckpointConnector together allow consumer offsets to be translated across clusters, so that when we fail over, analytics consumers resume from the correct position in the replica.

MirrorMaker 2 is reliable but not free. It consumes inter-region bandwidth at 1.0–1.1× the source throughput (the extra 10% covers headers and metadata), and it adds a Kafka Connect cluster to operate. For clusters approaching 1 GB/s of source throughput, we size the destination Connect cluster with at least four workers and monitor replication-latency-ms as the primary SLI.

Cluster Linking for Lower-Overhead Replication

Cluster Linking, available on Confluent Platform and as a supported community pattern on Apache Kafka with some custom tooling, moves replication from Connect workers into the broker itself. The destination cluster’s brokers fetch directly from the source cluster, preserving partition layout, offsets, and timestamps without transformation. For organizations that can adopt it, Cluster Linking removes the MirrorMaker 2 operational overhead.

The tradeoffs are platform dependence and topology constraints. Cluster Linking works well for straightforward one-way or active-passive replication but is less flexible than MirrorMaker 2 for complex topic routing. At MinervaDB, we make the choice per customer based on the platform (Apache vs. Confluent), topology complexity, and operational maturity.

Lakehouse Consumer Patterns

Analytics lakehouse ingestion from Kafka usually runs through one of three consumer patterns: direct Iceberg sinks via Kafka Connect, streaming ingestion with Apache Flink or Spark Structured Streaming, and batch micro-ingestion via scheduled Trino or Spark jobs. Tiered storage makes the third pattern substantially more efficient because batch jobs can pull hours-old segments directly from object storage without traversing the hot path.

# Spark Structured Streaming with tiered storage awareness
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("iceberg-ingestion").getOrCreate()

events = (spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "kafka-1:9092,kafka-2:9092,kafka-3:9092")
    .option("subscribe", "events_raw_iceberg")
    .option("startingOffsets", "earliest")
    .option("maxOffsetsPerTrigger", 500000)
    .option("fetch.min.bytes", "524288")
    .option("fetch.max.wait.ms", "500")
    .load())

(events
    .writeStream
    .format("iceberg")
    .outputMode("append")
    .option("path", "s3://lakehouse/warehouse/events_raw")
    .option("checkpointLocation", "s3://lakehouse/checkpoints/events_raw")
    .trigger(processingTime="60 seconds")
    .start())

The maxOffsetsPerTrigger throttle is essential when the job is catching up from hours or days of lag — otherwise the first micro-batch will attempt to drain the full backlog and either OOM the executors or saturate the tiered-storage fetch path. We size the throttle to 5–10 minutes of steady-state ingest, which keeps micro-batch durations predictable regardless of backlog depth.

Failure Scenarios: Region Outage and Replay

We design multi-region Kafka topologies against two primary failure modes: full source-region outage and accidental data corruption requiring replay. For regional outage, the lakehouse consumers cut over to the replica region using the MirrorMaker 2 checkpoint-translated offsets. Consumer groups in the destination region resume from the translated position, and downstream analytics continue with a short gap but no data loss.

For replay scenarios — for example, a feature extraction bug that wrote corrupted features for 18 hours — tiered storage extends the practical replay window from days to months. The operator rewinds the consumer group to the offset corresponding to the start of the corrupted window and lets the job reprocess from remote-tier segments. Without tiered storage, this same replay would have required a restore from lakehouse snapshots, which is dramatically more expensive and slower.

Key Takeaways

  • Tiered storage (KIP-405, GA in Kafka 3.9) separates hot broker storage from cold object storage, cutting long-retention cost by 60–80%.
  • Size local.retention.ms to cover the longest expected consumer lag plus operational headroom; set retention.ms to the full compliance window.
  • Use 1–2 GiB segment sizes to minimize object store request overhead.
  • MirrorMaker 2 provides flexible multi-region replication with offset translation for clean failover.
  • Cluster Linking reduces replication operational overhead where the platform supports it.
  • Lakehouse ingestion through Flink or Spark Structured Streaming must throttle per-trigger to handle remote-tier backlogs safely.
  • Tiered storage extends the practical replay window from days to months, which is a game-changer for data quality incidents.

How MinervaDB Can Help

We build Kafka ingestion tiers that anchor lakehouse architectures on Apache Iceberg, Delta Lake, and ClickHouse for web-scale organizations. Our engineering team sizes tiered storage against the true replay and retention requirements, designs MirrorMaker 2 or Cluster Linking topologies for regional resilience, and tunes lakehouse consumers for predictable micro-batch performance. When the business needs 30-day replay with regional failover and linear cost scaling, we bring the practitioner experience to make it work in production. Explore MinervaDB database infrastructure engineering or contact our Kafka architecture team to discuss a tiered storage and multi-region design.

Frequently Asked Questions

What is Kafka tiered storage and when did it become production-ready?

Tiered storage, introduced in KIP-405, separates hot data on broker disks from cold data in object storage such as S3, GCS, or Azure Blob. It reached general availability in Kafka 3.9 and is a first-class feature on Kafka 4.2. Producers and consumers are unchanged; the broker transparently serves reads from either tier based on the requested offset.

How much does tiered storage reduce Kafka infrastructure cost?

For clusters with long retention windows, tiered storage typically reduces infrastructure cost by 60–80%. The savings come from replacing replicated NVMe storage (which multiplies by the replication factor and requires free-space headroom) with single-copy object storage that is 10–20× cheaper per GiB.

Should I use MirrorMaker 2 or Cluster Linking for multi-region replication?

MirrorMaker 2 is the right choice for Apache Kafka deployments and for topologies with complex routing or topic prefixing. Cluster Linking, available on Confluent Platform, offers lower operational overhead but is less flexible. At MinervaDB, we select based on platform, topology complexity, and operational maturity.

What segment size should I use for tiered storage topics?

We standardize on 1 GiB segments (segment.bytes=1073741824) for most tiered storage topics, because each segment becomes a single object in remote storage and smaller segments produce excessive S3 request overhead. For very high-throughput topics with multi-GB/s ingest, 2 GiB segments further reduce upload frequency.

Can tiered storage be used with log compaction?

Log compaction support in tiered storage has been an ongoing development item. In Kafka 3.9 GA and Kafka 4.x, standard retention-based (time and size) topics are fully supported, while compacted topics have had limitations. Teams relying on compacted topics should verify current support in the target Kafka version before enabling tiered storage on those topics.

How does MirrorMaker 2 handle consumer offset translation across regions?

MirrorMaker 2 includes a MirrorCheckpointConnector that periodically writes offset-translation records to an offset-syncs topic. On failover, consumer groups in the destination region use the translated offsets to resume from the correct logical position in the replica. This is essential for analytics consumers that must not reprocess or skip data during regional cutover.

About MinervaDB Corporation 233 Articles
Full-stack Database Infrastructure Architecture, Engineering and Operations Consultative Support(24*7) Provider for PostgreSQL, MySQL, MariaDB, MongoDB, ClickHouse, Trino, SQL Server, Cassandra, CockroachDB, Yugabyte, Couchbase, Redis, Valkey, NoSQL, NewSQL, SAP HANA, Databricks, Amazon Resdhift, Amazon Aurora, CloudSQL, Snowflake and AzureSQL with core expertize in Performance, Scalability, High Availability, Database Reliability Engineering, Database Upgrades/Migration, and Data Security.