PostgreSQL MVCC Fundamentals

PostgreSQL MVCC Fundamentals


PostgreSQL’s Multi-Version Concurrency Control (MVCC) is the mechanism that allows readers and writers to operate without blocking each other. The trade-off is one that surprises many production teams: every UPDATE and DELETE leaves behind dead tuples — old row versions that accumulate inside heap pages. Unlike databases that overwrite rows in place, PostgreSQL never shrinks a table file when you delete rows. At MinervaDB, we regularly inherit production clusters where tables have grown to 3-5x the size of live data — not because of data growth, but because of misunderstood MVCC mechanics. This guide explains exactly how dead tuples are born, why the disk never reclaims space automatically, and what you must do to control bloat before it becomes a crisis.

PostgreSQL MVCC

Why PostgreSQL Keeps Old Row Versions

PostgreSQL implements snapshot isolation through MVCC rather than locking. When a transaction starts, it receives a snapshot — a consistent view of the database at a specific point in time. Rows written by transactions that started after your snapshot are invisible to you. Rows deleted by concurrent transactions remain visible until your snapshot expires.

This design eliminates read-write lock contention entirely. A long-running analytical query can read a table while application workers hammer it with writes. Neither blocks the other. The cost of this elegance is storage: PostgreSQL must keep every row version visible to any active snapshot. That means old versions of updated or deleted rows cannot be removed until no active transaction can see them.

In a busy OLTP system processing thousands of updates per second, this policy generates dead tuples continuously. The rate at which dead tuples accumulate depends on transaction throughput, row size, and — critically — the age of the oldest active transaction in the database. A single long-running connection holding an idle transaction can prevent the entire system from reclaiming dead tuple space.

-- Find transactions preventing dead tuple cleanup
SELECT
  pid,
  usename,
  application_name,
  state,
  backend_xmin,
  now() - xact_start AS txn_age
FROM pg_stat_activity
WHERE backend_xmin IS NOT NULL
ORDER BY backend_xmin ASC
LIMIT 10;

Anatomy of a Dead Tuple Inside a Heap Page

PostgreSQL stores table data in 8KB heap pages. Each page contains a header, an array of line pointers (ItemId entries), and the actual tuple data growing from the end of the page toward the middle. Every tuple carries a 23-byte header (HeapTupleHeaderData) that includes system columns controlling visibility.

The critical fields in the tuple header are t_xmin (the transaction ID that inserted the row), t_xmax (the transaction ID that deleted or updated the row), and t_infomask (a bitmask of flags indicating whether xmin/xmax transactions committed or aborted). When PostgreSQL evaluates row visibility for a query, it reads these fields against the current transaction snapshot.

A dead tuple is one where t_xmax is set to a committed transaction ID that is older than all active snapshots. The row is logically gone — no transaction can see it anymore — but it still occupies physical space on the heap page. The line pointer remains allocated, and the tuple bytes remain on disk.

-- Inspect tuple visibility flags using pageinspect extension
SELECT
  t_ctid,
  t_xmin,
  t_xmax,
  t_infomask::bit(16) AS infomask_bits,
  t_data
FROM heap_page_items(get_raw_page('orders', 0))
WHERE t_xmax != 0
LIMIT 20;

What Happens Internally During UPDATE and DELETE

PostgreSQL UPDATE is not an in-place modification. The database executes it as a DELETE of the old tuple followed by an INSERT of the new tuple. The old tuple’s t_xmax is set to the current transaction ID. The new tuple is written to a new location — either in the same page if space permits (enabling a HOT update), or in a different page.

DELETE simply marks the existing tuple’s t_xmax with the current transaction ID and sets the appropriate infomask bits when the transaction commits. The tuple bytes stay exactly where they are. No physical space is freed. The page does not shrink.

On a high-write table — an orders table, a sessions table, a queue table — this means every row goes through this delete-mark cycle repeatedly. A table processing 10,000 updates per second generates 10,000 dead tuples per second. Without aggressive vacuuming, a table that holds 1 million live rows can accumulate 50 million dead tuples within an hour. We have seen this pattern destroy I/O performance on systems where autovacuum was throttled too aggressively.

-- Quantify dead tuple accumulation per table
SELECT
  schemaname,
  relname AS table_name,
  n_live_tup,
  n_dead_tup,
  ROUND(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 2) AS dead_pct,
  last_vacuum,
  last_autovacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 0
ORDER BY n_dead_tup DESC
LIMIT 20;

xmin, xmax, and Transaction Visibility

PostgreSQL uses 32-bit transaction IDs (XIDs). The visibility algorithm for any given row against a snapshot is: the row is visible if t_xmin committed before the snapshot and t_xmax is either zero (never deleted) or was set by an aborted transaction or a committed transaction that is newer than the snapshot.

The hint bits in t_infomask cache the commit/abort status of xmin and xmax so PostgreSQL does not query pg_clog (now pg_xact) on every tuple access. When a tuple is first read after the inserting transaction commits, the HEAP_XMIN_COMMITTED hint bit is set. This is a page-level write — which is why even read-heavy workloads generate some WAL traffic on freshly loaded tables.

A tuple becomes a candidate for removal only when t_xmax points to a committed transaction AND that transaction ID is older than the oldest XID visible in any current snapshot. PostgreSQL calls this threshold the relfrozenxid horizon. Tuples older than this horizon can be safely vacuumed and the space reclaimed for reuse.

Why Disk Space Is Never Returned to the OS

This is the behavior that frustrates operations teams most. Standard VACUUM reclaims dead tuple space for reuse by future inserts — but it does not shrink the physical table file. A table that once held 100GB of data and had 80GB of that data deleted will still occupy 100GB on disk after VACUUM runs. The pages are marked as free in the Free Space Map (FSM) and will be reused for new inserts, but the file size reported by the OS does not change.

The only operations that physically return space to the operating system are VACUUM FULL and pg_repack. Both rebuild the table from scratch, excluding all dead tuples. VACUUM FULL acquires an exclusive lock for the entire duration. pg_repack performs the rebuild online using a shadow table and triggers, with only a brief lock at cutover.

For most production tables with continuous write activity, VACUUM FULL is operationally dangerous because of locking. The correct long-term strategy is to tune autovacuum aggressively enough that dead tuples never accumulate to the point where physical shrinkage becomes necessary. If a table is already severely bloated, pg_repack is the appropriate remediation tool — not VACUUM FULL.

Measuring Dead Tuple Accumulation in Production

The pg_stat_user_tables view gives you live and dead tuple counts updated by the statistics collector. For a more accurate physical bloat estimate — accounting for page fragmentation inside heap files — use the pgstattuple extension.

-- Install pgstattuple if not present
CREATE EXTENSION IF NOT EXISTS pgstattuple;

-- Measure physical bloat for a specific table
SELECT
  table_len,
  tuple_count,
  tuple_len,
  dead_tuple_count,
  dead_tuple_len,
  ROUND(100.0 * dead_tuple_len / NULLIF(table_len, 0), 2) AS dead_pct,
  free_space,
  ROUND(100.0 * free_space / NULLIF(table_len, 0), 2) AS free_pct
FROM pgstattuple('public.orders');

The dead_tuple_len field gives you the exact byte count consumed by dead tuples. The free_space field shows space already reclaimed by prior VACUUM runs but not yet filled by new inserts. A table where dead_pct + free_pct > 30% is a strong candidate for pg_repack. We recommend running this check monthly on all tables with more than 1 million rows.

Where VACUUM Fits — and Where It Falls Short

Standard VACUUM scans heap pages looking for dead tuples. When it finds a page containing only dead tuples and no live rows, it can truncate that page from the end of the file — but only trailing pages are truncatable. A single live row in the last page prevents all preceding dead-tuple pages from being released to the OS.

Autovacuum triggers based on dead tuple thresholds controlled by autovacuum_vacuum_threshold and autovacuum_vacuum_scale_factor. With default settings (threshold = 50, scale factor = 0.2), autovacuum will not run on a 10-million-row table until 2,000,050 dead tuples accumulate. On high-write tables, this default is far too permissive. We tune per-table storage parameters to trigger autovacuum at much lower thresholds for write-intensive tables — a topic we cover in depth separately.

The most important operational discipline is monitoring the age gap between last_autovacuum and current time in pg_stat_user_tables. Any table that has not been vacuumed in more than 30 minutes during peak write hours warrants investigation. If autovacuum is consistently too slow to keep up with write volume, the solution is either lowering the cost delay parameters (autovacuum_vacuum_cost_delay) or dedicating more autovacuum workers.

Key Takeaways

  • PostgreSQL MVCC stores old row versions as dead tuples — every UPDATE creates one, every DELETE creates one.
  • Dead tuples remain on disk until no active transaction snapshot can see them, meaning long-running transactions block cleanup.
  • Standard VACUUM marks dead tuple space as reusable but does not shrink the physical file — only VACUUM FULL and pg_repack return space to the OS.
  • Use pgstattuple to measure actual physical bloat; pg_stat_user_tables n_dead_tup is a logical count, not a byte-accurate bloat measurement.
  • Default autovacuum thresholds are too permissive for high-write tables — per-table storage_parameters overrides are essential.
  • A single idle transaction holding an old snapshot can prevent dead tuple cleanup across the entire database.
  • Monitor pg_stat_activity.backend_xmin to identify snapshot-blocking connections before they create bloat crises.

How MinervaDB Can Help

Table bloat caused by misunderstood MVCC mechanics is one of the most common performance emergencies we remediate at MinervaDB. Our database infrastructure engineering team performs deep bloat audits — using pgstattuple, pg_stat_user_tables, and internal page inspection — to quantify exactly how much space is recoverable and what the production impact of remediation will be. We design per-table autovacuum configurations, plan and execute online pg_repack operations during low-traffic windows, and set up proactive monitoring to catch bloat accumulation before it affects query performance. If your PostgreSQL cluster is consuming disk space that does not match your data volume, or if autovacuum cannot keep pace with write throughput, contact our database engineering team to schedule a bloat assessment.

Running into this in production? Contact MinervaDB for PostgreSQL support contracts with guaranteed SLAs.

About MinervaDB Corporation 227 Articles
Full-stack Database Infrastructure Architecture, Engineering and Operations Consultative Support(24*7) Provider for PostgreSQL, MySQL, MariaDB, MongoDB, ClickHouse, Trino, SQL Server, Cassandra, CockroachDB, Yugabyte, Couchbase, Redis, Valkey, NoSQL, NewSQL, SAP HANA, Databricks, Amazon Resdhift, Amazon Aurora, CloudSQL, Snowflake and AzureSQL with core expertize in Performance, Scalability, High Availability, Database Reliability Engineering, Database Upgrades/Migration, and Data Security.