Troubleshooting PostgreSQL Logical Replication Delay: A Practical Guide

Logical replication delay in PostgreSQL is one of those problems that surfaces quietly and then escalates fast. A subscriber keeping pace at 50 ms of lag can balloon to minutes — or hours — under a heavy workload, and the diagnostic path is rarely obvious. Unlike physical replication, where pg_stat_replication gives you a single authoritative lag column, logical replication distributes the problem across WAL sender processes, logical decoding, the reorder buffer, apply workers on the subscriber, and the replication slot itself. We have observed this confusion extend production incidents far longer than necessary, purely because the engineer on call did not know which view to query first. This guide works through the full diagnostic chain: what logical replication delay actually means, where it originates, how to measure it precisely, and which tuning levers reliably reduce it. Understanding PostgreSQL Logical Replication Delay is crucial for effective database management.

PostgreSQL Logical Replication Delay

Logical Replication Delay vs. Physical Replication Lag

This section provides insight into PostgreSQL Logical Replication Delay and its implications for database performance.

Physical replication moves raw WAL blocks from the primary to a standby, which replays those blocks in order. The lag columns in pg_stat_replication — write_lag, flush_lag, and replay_lag — reflect elapsed wall-clock time between when the primary wrote a WAL record and when the standby acknowledged each stage. The standby applies changes at the storage level with no knowledge of table schemas or transactions, so it converges quickly under normal conditions.

Logical replication is fundamentally different. The WAL sender on the publisher must run logical decoding — interpreting raw WAL through the output plugin (built-in pgoutput) to produce a stream of row-level change events. These events are grouped by transaction, reassembled in commit order, and sent to the subscriber. The subscriber’s apply worker then opens the target table, resolves any schema mappings, executes the appropriate DML, and manages its own transaction boundaries. Every one of those stages introduces latency with no equivalent in physical replication.

The practical consequence is that logical replication delay is the sum of decoding time, network transit time, and apply time — and each component can become the bottleneck depending on workload shape. A burst of large transactions stresses the reorder buffer. Hundreds of small transactions per second stress the apply worker’s DML throughput. A subscriber under heavy read load with long-running queries holds row-level locks that block apply workers entirely. Diagnosing PostgreSQL Logical Replication Delay correctly requires isolating which component is actually the constraint.

Reading the Key Diagnostic Views

pg_stat_replication and LSN Columns

On the publisher, pg_stat_replication exposes one row per WAL sender process. The columns sent_lsn, write_lsn, flush_lsn, and replay_lsn track how far each stage has progressed. For logical replication, replay_lsn and replay_lag reflect when the subscriber confirmed it applied the change — which depends on the apply worker completing the DML and sending an acknowledgment back upstream. A large gap between sent_lsn and replay_lsn tells you the subscriber is behind; the lag interval columns tell you how long it has been behind at each stage.

A more precise view of slot health comes from pg_replication_slots. The column confirmed_flush_lsn is the last LSN the subscriber has durably applied. Comparing it to pg_current_wal_lsn() on the publisher reveals the total unacknowledged WAL the slot is retaining. When this difference is large and not shrinking, the subscriber is falling behind faster than it can catch up.

-- Publisher: assess replication slot lag and WAL retention pressure
SELECT
    slot_name,
    plugin,
    active,
    pg_size_pretty(
        pg_current_wal_lsn() - confirmed_flush_lsn
    ) AS replication_lag_size,
    (pg_current_wal_lsn() - confirmed_flush_lsn) AS replication_lag_bytes,
    wal_status,
    safe_wal_size
FROM pg_replication_slots
WHERE slot_type = 'logical'
ORDER BY replication_lag_bytes DESC;

Pay close attention to wal_status. When it transitions from reserved to extended, PostgreSQL is retaining WAL segments beyond max_wal_size to service the slot. A value of lost means the slot has been invalidated — the subscriber has lost its replication position and must be rebuilt from scratch.

pg_stat_subscription and pg_stat_subscription_stats

On the subscriber, pg_stat_subscription shows the state of each subscription’s apply worker. The column received_lsn indicates the last LSN received from the publisher, and latest_end_time shows when the subscriber last sent a feedback message. The gap between what the subscriber has received and what it has confirmed applied represents pure apply-worker lag. PostgreSQL 15 introduced pg_stat_subscription_stats, which tracks error counts per subscription. Elevated apply_error_count values signal schema mismatches or constraint violations causing apply workers to restart repeatedly — each restart pauses replication while the worker reconnects.

-- Subscriber: check apply worker status and feedback freshness
SELECT
    s.subname,
    s.pid,
    s.received_lsn,
    s.latest_end_lsn,
    s.latest_end_time,
    now() - s.latest_end_time AS feedback_age,
    ss.apply_error_count,
    ss.sync_error_count
FROM pg_stat_subscription s
LEFT JOIN pg_stat_subscription_stats ss
    ON ss.subid = s.subid
ORDER BY s.subname, s.pid;

Common Root Causes

Large Transactions and Reorder Buffer Spill

Logical decoding must reconstruct complete transactions before delivering them to the subscriber. During decoding, in-progress transaction data accumulates in the reorder buffer — an in-memory structure whose size is governed by logical_decoding_work_mem (introduced in PostgreSQL 13, defaulting to 64 MB). When a transaction’s decoded changes exceed this limit, PostgreSQL spills the overflow to disk under pg_replslot/<slot_name>/. Each spill introduces disk I/O overhead during both write and the subsequent read at commit time. Publishers running large batch loads — multi-million-row deletes, bulk inserts via COPY, or large UPDATE sweeps — will produce spill files under the default configuration.

To inspect spill files on the publisher’s filesystem:

# Inspect reorder buffer spill files for all logical replication slots
# Run on the publisher as the postgres OS user
PGDATA=$(psql -Atc "SHOW data_directory;")
for slot_dir in "${PGDATA}/pg_replslot"/*/; do
    slot_name=$(basename "$slot_dir")
    spill_count=$(find "$slot_dir" -name "*.spill" 2>/dev/null | wc -l)
    spill_size=$(du -sh "$slot_dir" 2>/dev/null | awk '{print $1}')
    echo "Slot: ${slot_name} | Spill files: ${spill_count} | Dir size: ${spill_size}"
done

A non-zero spill count during steady-state operation means logical_decoding_work_mem is too low for the workload. Increasing it to 256 MB or 512 MB on publishers that regularly process large transactions typically eliminates spill. The tradeoff is memory pressure per active logical decoding session, so size this parameter relative to the number of concurrent logical slots multiplied by peak transaction size. Alternatively, enable the streaming = on subscription option (PostgreSQL 14+) so that large in-progress transactions stream incrementally to the subscriber rather than accumulating in the reorder buffer until commit.

Long-Running Queries Blocking Apply Workers

This is one of the most frequently overlooked causes of delay in mixed-workload environments where the subscriber also serves read traffic. The apply worker attempts to write rows on the subscriber, and if a concurrent reader holds a conflicting lock, the apply worker queues behind it. PostgreSQL will not cancel user queries to unblock replication. We have seen subscriber lag spike from under a second to 45 minutes in production because a single analytical query ran without a statement timeout.

-- Subscriber: find queries blocking the logical replication apply worker
SELECT
    blocking.pid            AS blocking_pid,
    blocking.query          AS blocking_query,
    now() - blocking.query_start AS blocking_duration,
    blocked.pid             AS apply_worker_pid,
    blocked.wait_event_type,
    blocked.wait_event
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking
    ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
WHERE blocked.application_name LIKE 'logical replication%'
ORDER BY blocking_duration DESC;

The production fix is to enforce statement_timeout and idle_in_transaction_session_timeout on all non-replication roles on the subscriber, and to route long-running analytical queries to a physical standby that does not participate in logical replication.

Replication Slot Bloat and WAL Retention

An inactive or severely lagged slot forces the publisher to retain WAL segments that would otherwise be recycled. As WAL accumulates, catch-up decoding must scan an ever-larger WAL range, increasing CPU and I/O pressure. Configure max_slot_wal_keep_size (PostgreSQL 13+) to a hard ceiling so that a slot is invalidated before it can exhaust publisher disk space. Alert on wal_status = ‘extended’ and on lag bytes crossing a threshold relative to available disk.

Slow Apply Throughput and Parallel Apply

A subscription’s apply worker is a single process that applies changes serially in commit order. Under high write rates from the publisher, the apply worker’s DML throughput becomes the ceiling. PostgreSQL 16 introduced parallel apply via max_parallel_apply_workers_per_subscription, allowing independent transactions to be applied concurrently. The apply workers coordinate through a shared queue and handle transaction ordering automatically — there is no application-level change required. On PostgreSQL versions before 16, the only mitigation is to split publications across multiple subscriptions targeting distinct table sets, giving each set its own dedicated apply worker. This partitioning approach is effective but requires careful table grouping to avoid cross-table foreign key dependencies that could cause constraint failures during apply.

Schema Mismatches Between Publisher and Subscriber

Logical replication does not automatically propagate DDL changes. If a column is added, dropped, or renamed on the publisher without a matching change on the subscriber, the apply worker will encounter errors on the next affected row change. PostgreSQL responds by terminating and restarting the apply worker, which introduces a replication pause on every affected transaction. In production environments where schema migrations are frequent, it is essential to apply DDL to the subscriber before or alongside the publisher change, depending on whether the migration is additive (adding a nullable column) or destructive (dropping a column). Always validate apply_error_count in pg_stat_subscription_stats after any schema migration to confirm the apply worker is processing cleanly.

Worked Diagnostic Example

Consider a scenario where subscriber lag has climbed to eight minutes and is growing. The on-call engineer begins on the publisher with the replication slot query and finds replication_lag_bytes at 4.2 GB for the active slot, with wal_status = ‘extended’. They check pg_stat_replication and observe sent_lsn equals pg_current_wal_lsn() — the WAL sender is delivering data in real time, so decoding and network are not the bottleneck.

Moving to the subscriber, pg_stat_subscription shows received_lsn matches what the publisher is sending, but latest_end_time is seven minutes old, meaning no feedback acknowledgment has been sent. Running the blocking query reveals a reporting join — running for 90 minutes — holding an AccessShareLock on a high-volume orders table. The apply worker’s UPDATE statements against that table have been waiting the entire time. Terminating the reporting query with pg_terminate_backend() unblocks the apply worker, which clears the backlog within 12 minutes. The permanent fix: add statement_timeout = ‘300s’ for all non-superuser roles on the subscriber and redirect analytical queries to a physical standby.

Production Best Practices

  1. Monitor pg_replication_slots.confirmed_flush_lsn lag in bytes as the primary replication health metric. Alert on byte thresholds scaled to available WAL disk, not just time-based lag estimates.
  2. Set max_slot_wal_keep_size to a safe ceiling on all publishers. Allowing an inactive slot to retain unbounded WAL risks disk exhaustion — a more severe incident than losing the subscriber’s position and reseeding.
  3. Enable streaming = on on subscriptions where both sides run PostgreSQL 14 or later. Streaming eliminates reorder buffer spill for large transactions and is the highest-impact single change for bulk-DML workloads.
  4. Raise logical_decoding_work_mem to 256 MB or higher on publishers that routinely process large transactions, and monitor the pg_replslot directory for spill files as a validation step.
  5. Enforce statement_timeout and idle_in_transaction_session_timeout on subscriber instances serving read traffic. Apply workers cannot preempt user queries; query governance must come from the application and configuration tier.
  6. Monitor pg_stat_subscription_stats.apply_error_count (PostgreSQL 15+) for recurring apply failures. Persistent errors indicate schema drift between publisher and subscriber that will not self-correct and can cause the apply worker to restart in a continuous loop.
  7. Test slot invalidation recovery periodically. A slot lost due to max_slot_wal_keep_size or WAL removal requires dropping the subscription and performing a full resync, which can take hours on large datasets.

Conclusion

Diagnosing logical replication delay requires a systematic walk through the publisher’s replication slots, WAL sender metrics in pg_stat_replication, and apply worker state in pg_stat_subscription. The most common production culprits — reorder buffer spill, subscriber-side lock contention, and slot bloat — each have clear diagnostic signatures and targeted remedies. With streaming subscriptions in PostgreSQL 14, parallel apply in PostgreSQL 16, and byte-level slot monitoring in place, production logical replication delay can be reliably held under one second even under demanding workloads. For deeper reference, consult the PostgreSQL documentation on logical decoding and the logical replication administration guide.



Further Reading

About MinervaDB Corporation 269 Articles
Full-stack Database Infrastructure Architecture, Engineering and Operations Consultative Support(24*7) Provider for PostgreSQL, MySQL, MariaDB, MongoDB, ClickHouse, Trino, SQL Server, Cassandra, CockroachDB, Yugabyte, Couchbase, Redis, Valkey, NoSQL, NewSQL, SAP HANA, Databricks, Amazon Resdhift, Amazon Aurora, CloudSQL, Snowflake and AzureSQL with core expertize in Performance, Scalability, High Availability, Database Reliability Engineering, Database Upgrades/Migration, and Data Security.