Monitoring replication lag at scale in PostgreSQL

Monitoring replication lag at scale in PostgreSQL

Out of the PostgreSQL environments MinervaDB audited this past Tuesday, Monitoring replication lag scale was the single largest source of avoidable incidents. There is a particular kind of replication failure that gives senior database engineers nightmares: the one where everything looks fine on the dashboards, replication lag is zero, and the standby is actually serving stale data because the WAL stream silently stopped applying half an hour ago. Designing the monitoring to catch this case is harder than designing the replication itself.

Every operator who has run replication in anger has at least one story about the failover that went wrong because the wrong replica was promoted. Quorum, sync standby names, leader election timeouts and witness nodes are not academic concerns — they are the difference between a five-minute incident and a six-hour split-brain recovery. The asymmetry between primary and replica is where most replication bugs hide. The primary writes WAL, ships it, and forgets.

The replica receives WAL, applies it, and is responsible for keeping up. The interesting failure modes are all on the replica side: apply lag, recovery conflicts, slot bloat, and the surprisingly common case of a replica that thinks it is in sync because it stopped receiving WAL entirely.

How it works under the hood

Logical replication ships row-level changes through a publication/subscription model. Unlike physical streaming replication it survives major version differences, can replicate subsets of tables, and can have multiple subscribers per publication. The price is that every change must be decoded from WAL into logical messages, a CPU-bound process that can become the bottleneck on write-heavy systems. The WAL sender on the primary is, in essence, a tail -f on the WAL files.

When a replica connects, the sender consults the replica's replay_lsn and starts streaming from there. If the WAL the replica needs has already been recycled, replication breaks and the replica must be rebuilt from a base backup. Replication slots prevent this by pinning the WAL on the primary until the replica acknowledges it.

// MongoDB: replica set health snapshot
rs.status();

// Per-member lag (operationally what you actually care about)
rs.printSecondaryReplicationInfo();

// What's the current write concern majority size?
db.runCommand({ getDefaultRWConcern: 1 });

The data structures involved

MySQL's parallel applier (slave_parallel_workers) is what you reach for when single-threaded SQL_THREAD becomes the bottleneck. With slave_parallel_type=LOGICAL_CLOCK and binlog_transaction_dependency_tracking=WRITESET, throughput on the replica can match the primary. Without those settings, a write-heavy primary will outrun any replica that has even one slow disk. SQL Server Always On Availability Groups can run synchronous and asynchronous secondaries side-by-side. The synchronous ones provide automatic failover and zero data loss; the asynchronous ones give you a geo-distant DR copy without inflating your write latency. The Listener provides a transparent endpoint that follows the primary on failover, but applications still need retry logic for the failover window itself, which is typically eight to fifteen seconds.

The cost model nobody documents clearly

MongoDB's replica set heartbeat protocol is a quiet engineering achievement. Every two seconds, every member exchanges heartbeats with every other member, and the resulting view of the cluster drives election, write concern routing and read preference selection. The electionTimeoutMillis and heartbeatIntervalMillis values determine how quickly the cluster reacts to a primary loss — lower values mean faster failover but more false positives during transient network glitches.

Synchronous replication in PostgreSQL is actually three different commit modes selected by synchronous_commit: remote_write waits until the standby's WAL receiver has written to OS buffers, on waits until the standby has fsynced, and remote_apply waits until it has actually applied the WAL. Each level adds latency; remote_apply is the only one that lets you read your writes from the standby.

# PostgreSQL: build a fresh replica without breaking the primary
pg_basebackup \
  --host=primary.internal \
  --port=5432 \
  --username=replicator \
  --pgdata=/var/lib/postgresql/16/replica \
  --write-recovery-conf \
  --create-slot \
  --slot=replica_2026 \
  --checkpoint=fast \
  --progress \
  --verbose

Failure modes we have actually seen

On a multi-tenant SaaS we found logical replication running between two regions with wal_sender_timeout and wal_receiver_timeout at default 60 seconds, talking over a link with occasional 90-second jitter. The subscription dropped and reconnected dozens of times a day, each time replaying a few minutes of WAL twice. Raising the timeouts to 300 seconds turned a noisy cluster into a quiet one.

  • Configuring synchronous replication with a single sync standby and no synchronous_standby_names = 'ANY 1 (...)' quorum. The first time the standby goes offline for maintenance, every write on the primary blocks indefinitely.
  • Running replicas with significantly different hardware than the primary. The CPU saved on smaller replica instances is borrowed against; eventually the replica falls behind during peak traffic and recovery takes hours.
  • Letting max_wal_senders drift below the actual number of replicas plus headroom for backup tools. We have seen pgBackRest and a streaming replica argue over the last available slot at 03:00 on the worst possible morning.
  • Promoting an asynchronous replica during an incident without first checking the LSN gap. Application writes that landed on the old primary just before failover are silently lost. The right pattern is to fence the primary before promoting anyone.

What we actually tune in production

Recovery conflicts are the silent killer of read-replica strategies. When a vacuum on the primary removes a tuple that a query on the replica is still reading, the replica must either pause WAL replay (lag grows) or kill the query (application errors). hot_standby_feedback = on avoids the conflict at the cost of bloat on the primary; max_standby_streaming_delay tunes the trade-off. The WAL sender on the primary is, in essence, a tail -f on the WAL files. When a replica connects, the sender consults the replica's replay_lsn and starts streaming from there.

If the WAL the replica needs has already been recycled, replication breaks and the replica must be rebuilt from a base backup. Replication slots prevent this by pinning the WAL on the primary until the replica acknowledges it.

MinervaDB engineers maintain a library of internal runbooks for PostgreSQL that are updated whenever a customer engagement reveals a new pattern; if you would like a copy of the relevant runbook for Monitoring replication lag scale, contact our team and we will share the sanitised version that we use during incident response.

When MinervaDB takes over a PostgreSQL estate as part of an enterprise support engagement, the first thirty days almost always include a structured review of Monitoring replication lag scale, because the gains here are usually larger and faster than any other intervention available in the first month.

Finally, remember that documentation is a force multiplier. Every diagnostic command, every tuning decision, every runbook step that lives in a shared system rather than in someone's head is a step closer to a PostgreSQL estate that does not depend on a single hero engineer being awake.

It is worth emphasising that Monitoring replication lag scale in PostgreSQL is not a static topic. The engine, the cloud platforms it runs on, the storage technologies it uses and the workloads pushed through it all evolve, which means any configuration you ship today should be considered a snapshot rather than a permanent answer.

-- PostgreSQL: who is replicating, how far behind, and is the slot active?
SELECT application_name,
       client_addr,
       state,
       sent_lsn,
       write_lsn,
       flush_lsn,
       replay_lsn,
       pg_size_pretty(pg_wal_lsn_diff(sent_lsn, replay_lsn)) AS replay_gap
FROM pg_stat_replication
ORDER BY application_name;

-- Replication slots: any inactive ones pinning WAL?
SELECT slot_name, slot_type, active,
       pg_size_pretty(pg_wal_lsn_diff(
           pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots
ORDER BY retained_wal DESC NULLS LAST;

There is no silver bullet for Monitoring replication lag scale — only careful engineering, honest measurement, and a willingness to revisit decisions as the workload changes.

Frequently asked questions

Do you support both self-managed and cloud-managed deployments?

Yes. We work across PostgreSQL, MySQL/MariaDB, MongoDB, SQL Server, ClickHouse, Cassandra, Redis/Valkey, Milvus, Trino and SAP HANA, on bare-metal, virtualised infrastructure, Kubernetes, and managed cloud services (Aurora, RDS, Azure SQL, Cloud SQL).

How quickly can MinervaDB engineers respond to a production incident on this topic?

MinervaDB runs a 24x7 support practice with documented SLAs that vary by contract; for SEV-1 incidents on supported clusters the first engineer response is measured in minutes, not hours.

Do you work with regulated industries with strict change-control requirements?

Yes. Several MinervaDB customers operate under PCI-DSS, HIPAA, SOC 2, RBI, GDPR or local equivalents. We work inside change-control processes, document every change, and provide audit-ready evidence on request.

Can your team take over on-call for our database tier?

Yes — our 24x7 enterprise support practice is designed exactly for this. We can take pager ownership at L1/L2 with documented escalation paths into your engineering team for application-side issues.

Further reading

Authoritative external references

MinervaDB resources


When to bring MinervaDB into the conversation

The MinervaDB consulting practice is built around 24x7 enterprise support, performance engineering and database reliability engineering for PostgreSQL. We are based in India with engineers across global timezones, and we have been doing this work since before "DBRE" was a job title.

How we typically help:

  • 24x7 Enterprise-Class Support with strict SLAs for incident response, root-cause analysis and recovery.
  • Performance Engineering and Tuning for high-throughput, low-latency, mixed OLTP and analytical workloads.
  • High Availability and Disaster Recovery Architecture across regions, clouds and hybrid topologies.
  • Database Reliability Engineering (DBRE) with observability, runbooks, capacity planning and incident review.
  • Cost Optimisation for self-managed and cloud database platforms, with hardware-right-sizing and licensing reviews.
  • Data Security, Audit and Compliance readiness for regulated workloads (PCI-DSS, HIPAA, SOC 2, RBI, GDPR).
  • Database Migrations and Upgrades with zero-downtime cutover playbooks.

Reach us: contact@minervadb.com or minervadb.com/contact. Reference Monitoring replication lag scale when you write and the consulting engineer will arrive with the right context.

MinervaDB — The WebScale Database Infrastructure Operations Experts.

About MinervaDB Corporation 269 Articles
Full-stack Database Infrastructure Architecture, Engineering and Operations Consultative Support(24*7) Provider for PostgreSQL, MySQL, MariaDB, MongoDB, ClickHouse, Trino, SQL Server, Cassandra, CockroachDB, Yugabyte, Couchbase, Redis, Valkey, NoSQL, NewSQL, SAP HANA, Databricks, Amazon Resdhift, Amazon Aurora, CloudSQL, Snowflake and AzureSQL with core expertize in Performance, Scalability, High Availability, Database Reliability Engineering, Database Upgrades/Migration, and Data Security.