Why Your PostgreSQL Query Reads Millions of Blocks

Why Your PostgreSQL Query Reads Millions of Blocks to Return Zero Rows (And How to Fix It)


A deep dive into index bloat, dead tuples, and MVCC visibility in high-throughput PostgreSQL tables


PostgreSQL Query

If you’ve ever stared at a PostgreSQL query plan in disbelief — watching a query scan 3.5 million blocks only to return zero rows — you’re not alone. And if that same query magically completes in 23 milliseconds five seconds later, only to crawl back to 60 seconds fifteen minutes after that, something deeper is going on beneath the surface.

This post breaks down exactly what’s happening, why it happens, and — most importantly — how to fix it for good.


The Setup: A High-Throughput Staging Table

Imagine a staging table — let’s call it poslog_publisher_rms_stage — running on Aurora PostgreSQL 17.4. It’s a busy table. Rows arrive with a status of READY, get picked up almost immediately by a processing job that flips them to PROCESSED, and every two hours a separate job scans for anything stuck in UNPROCESSED or FAILED with fewer than three retry attempts.

The table has a composite index on (status, error_retry_count, modified_date) — exactly what that retry query needs. And it does use it. So why is performance so catastrophically bad?


The Core Problem: Index Bloat from Dead Tuples

This is where PostgreSQL’s MVCC (Multi-Version Concurrency Control) model becomes both a strength and a hidden trap.

When a row is updated in PostgreSQL, the database doesn’t modify the row in place. Instead, it writes a brand new version of the row and marks the old one as dead. This is fundamental to how PostgreSQL handles concurrent reads and writes without locking — but it has a cost.

Every single row that gets flipped from READY to PROCESSED leaves behind a dead tuple. And those dead tuples don’t disappear immediately. They linger in the heap until VACUUM comes along to clean them up.

Here’s where it gets painful: the composite index still contains entries pointing to those dead tuples. So when the retry query runs and walks the index looking for rows where status IN (‘UNPROCESSED’, ‘FAILED’), it encounters index entries for all the old READY and PROCESSED versions of rows too.

For each of those entries, PostgreSQL has to go fetch the actual heap page to determine whether that version of the row is still visible to the current transaction. This is called a heap fetch for visibility check — and with 8.4 million dead tuples spread across a 96 million row table, the query ends up touching millions of heap pages just to confirm, over and over again: “Nope, not visible. Skip it.”

That’s where the 3.5 million block reads come from. Most of them aren’t finding matching rows — they’re doing visibility bookkeeping.


Why the Second Run Drops to 18k Block Reads

Five seconds later, the same query runs in 23 milliseconds. What changed?

Nothing changed in the data. What changed is the buffer cache.

The first run pulled roughly 2 million blocks from disk into PostgreSQL’s shared buffer pool — the in-memory cache. When the query runs again five seconds later, all those pages are still warm in memory. Nothing has evicted them yet. So instead of going to disk, the second run reads from cache, and the difference between cold disk I/O and warm in-memory reads is the entire explanation for that 66-second-to-23-millisecond drop.

It’s worth noting: even 18k logical reads to find zero rows is higher than you’d expect from a clean index. That’s still the dead tuples at work — even with everything in cache, the query still has to touch those pages to confirm visibility. It’s just fast because it’s all in memory.


Why It Blows Up Again 15 Minutes Later

In those 15 minutes, roughly 10,000+ new rows were inserted and processed. Each insert and update:

  • Added new pages to the heap
  • Added new entries to the index
  • Created new dead tuples

The buffer cache — which has a finite size — has been partially evicted and replaced with newer data. The index has grown with fresh entries pointing to newly dead rows. So when the query runs again, it’s back to scanning a bloated index and fetching heap pages that are no longer in cache. Another ~3 million block reads. Another ~60 seconds of execution time.

The cycle repeats.


The Visibility Map: The Missing Piece

There’s one more concept worth understanding here: the visibility map.

PostgreSQL maintains a visibility map for each table that tracks which heap pages contain only live, visible tuples. When a page is marked “all-visible,” an index scan can skip the heap fetch entirely — it knows without checking that any row on that page is safe to return.

But when a page has dead tuples on it, it’s not marked all-visible. PostgreSQL must fetch the heap page to check visibility for every single index entry it encounters.

With 8.4 million dead tuples spread across the table, a huge portion of heap pages are not all-visible. The index scan degenerates into what is essentially a full heap scan disguised as an index scan — all the overhead of a sequential scan, with none of the efficiency of a true index lookup.


How to Fix It

Here are the most impactful solutions, roughly in order of effectiveness:

1. Create a Partial Index

This is the single most impactful change you can make right now.

Since the retry query only ever cares about UNPROCESSED and FAILED rows — and the vast majority of rows are PROCESSED — you can create a partial index that only includes the rows you actually need:

CREATE INDEX idx_retry_candidates
ON poslog_publisher_rms_stage (error_retry_count, modified_date)
WHERE status IN ('UNPROCESSED', 'FAILED');

This index will be tiny. It will stay clean. It will never contain entries for dead READY or PROCESSED rows. The retry query will fly through it in microseconds regardless of how large the main table grows.

2. Tune Autovacuum Aggressively for This Table

A dead tuple count of 8.4 million on a 96 million row table is a red flag. The default autovacuum settings are designed for general-purpose workloads — they’re not aggressive enough for a high-throughput staging table like this one.

You can set table-level storage parameters to make autovacuum kick in far more frequently:

ALTER TABLE poslog_publisher_rms_stage
SET (
  autovacuum_vacuum_scale_factor = 0.001,
  autovacuum_vacuum_threshold = 1000
);

With a scale factor of 0.001 instead of the default 0.2, VACUUM will trigger after just 0.1% of the table’s rows become dead — rather than waiting for 20%. On a 96 million row table, that’s the difference between vacuuming after ~96,000 dead tuples versus waiting for ~19 million.

3. Run a Manual VACUUM Immediately

For immediate relief, run a manual VACUUM to clean up the existing dead tuples and rebuild the visibility map:

VACUUM (ANALYZE) poslog_publisher_rms_stage;

This won’t prevent the problem from recurring — that requires the autovacuum tuning above — but it will restore performance right now.

4. Consider Archiving or Partitioning

If processed rows are never needed again after a certain point, consider archiving or deleting them on a regular schedule. Keeping the table lean reduces the surface area for dead tuple accumulation and keeps the index manageable over time. Table partitioning by date or status can also make it easier to drop old data in bulk without the overhead of row-by-row deletes.


Summary

ObservationRoot Cause
3.5M block reads for 0 rowsIndex bloat + dead tuples forcing heap visibility checks
2nd run drops to 18k readsBuffer cache — pages already in memory from first run
15 min later, back to 3M readsNew dead tuples + cache eviction from ongoing inserts/updates
18k reads still seems highRemaining dead tuples still require visibility checks even in cache


The Bottom Line

PostgreSQL’s MVCC model is one of its greatest strengths — it enables high concurrency without the locking overhead that plagues other databases. But on high-throughput tables where rows are constantly being inserted and updated, it creates a hidden cost: dead tuples accumulate faster than autovacuum can clean them up, index bloat follows, and queries that should be lightning-fast end up doing the work of a full table scan.

The good news is that this is a well-understood problem with well-understood solutions. A partial index on just the rows your query actually cares about is often all it takes to go from a 60-second nightmare to a sub-millisecond lookup — permanently, regardless of how many dead tuples accumulate in the rest of the table.

If your PostgreSQL queries are behaving erratically — fast sometimes, slow others, with no obvious pattern — dead tuples and index bloat should be near the top of your diagnostic checklist.



Further Reading

About MinervaDB Corporation 234 Articles
Full-stack Database Infrastructure Architecture, Engineering and Operations Consultative Support(24*7) Provider for PostgreSQL, MySQL, MariaDB, MongoDB, ClickHouse, Trino, SQL Server, Cassandra, CockroachDB, Yugabyte, Couchbase, Redis, Valkey, NoSQL, NewSQL, SAP HANA, Databricks, Amazon Resdhift, Amazon Aurora, CloudSQL, Snowflake and AzureSQL with core expertize in Performance, Scalability, High Availability, Database Reliability Engineering, Database Upgrades/Migration, and Data Security.