Slow query log analysis with pgBadger

Every PostgreSQL DBA eventually runs into pgBadger, especially during slow query log analysis with pgBadger. The question is whether you have a repeatable process for handling it, or whether you reinvent one each time. After hundreds of production PostgreSQL deployments, our process has settled into the steps below.

Quick answer

To handle pgBadger in PostgreSQL: baseline current behavior with pg_stat_statements and pg_stat_activity, then narrow the symptom to the right layer (query plan, configuration, autovacuum, replication, or storage). Apply the smallest reversible change, validate against your baseline, and end with a Datadog or Prometheus alert plus a runbook entry. The full SQL and worked examples are below.

What is pgBadger?

pgBadger in PostgreSQL is the operational discipline around pgBadger. It overlaps query optimization, PostgreSQL tuning, and slow query analysis, which is why every team eventually wrestles with it. Get it wrong and you get user-visible latency, throughput collapse, or operational risk you didn't budget for.

On the inside, pgBadger involves a small set of PostgreSQL subsystems: the buffer manager, the write-ahead log, the planner and statistics collector, autovacuum, and the replication and HA layers. This guide walks through each in the order you should investigate them when a real production problem hits, with the SQL we actually run during execution plan engagements.

Why pgBadger matters in production

Most PostgreSQL incidents that escalate in production PostgreSQL trace back to pgBadger in one of three flavors: a sudden p99 latency cliff, a slow buildup that fires an alert weeks too late, or a regression that lands the moment an apparently unrelated deployment goes out. Each has a different root cause and a different fix.

What makes pgBadger tricky is that the symptom rarely points cleanly at the root cause. A latency spike might be buffer cache, or it might be a noisy neighbor at the storage layer, or it might be an unrelated checkpoint cycle dropping caches. That's why measurement comes before tuning, every single time.

A useful mental model: every PostgreSQL change has a cost, a blast radius, and a reversibility. The cheapest, smallest, most reversible change that actually moves your metric is almost always the right first step. It may not be the change you eventually want in steady state, but it buys you the time and confidence to make the bigger one safely.

How pgBadger works in PostgreSQL

PostgreSQL behavior around pgBadger is governed by five subsystems. Each can quietly affect throughput in ways that aren't visible from query logs alone.

  • Buffer manager. The shared_buffers pool decides what stays hot in PostgreSQL memory versus the OS page cache.
  • Write-ahead log. Every change is written to WAL before it touches the heap. Replication, PITR, and crash recovery all depend on it.
  • Planner and statistics. The cost-based optimizer interacts with statistics gathered by ANALYZE to choose query plans.
  • Autovacuum. Background workers reclaim dead tuples produced by MVCC. Mistuned autovacuum is the single most common cause of OLTP performance regressions.
  • Process model. PostgreSQL forks a backend per connection. work_mem is allocated per-backend, which is exactly the surprise that takes down clusters during connection storms.

Knowing which layer your symptom belongs to determines the fix. A p99 spike caused by checkpoint I/O is configuration. A regression caused by stale planner statistics is operational. A correlation between table growth and write latency is almost always autovacuum starvation. The diagnostic queries below help you place the symptom on this map before you change anything.

How to diagnose pgBadger issues

Before changing anything, capture a baseline. PostgreSQL has world-class observability built into the engine itself through the pg_stat_* views, so use it. None of the queries below take locks heavier than ACCESS SHARE, but several scan large system catalogs and should run on a read replica or during a low-traffic window on very large clusters.

Step 1. Generate a pgBadger HTML report from log files.

pgBadger -j 8 -O /var/www/pgBadger \
 --prefix '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h ' \
 /var/log/postgresql/postgresql-2026-04-*.log

Read the output with two questions in mind. Does the shape match what you expected? And what's the worst-case row? The shape tells you whether your mental model of the cluster matches reality. The worst-case row tells you where the next surprise will come from in your throughput workflow.

How to fix pgBadger step by step

The fix has three parts: the schema or configuration change itself, the rollout plan, and the validation. Skip any one of them and the problem returns within weeks.

On managed PostgreSQL services like AWS RDS, Aurora, Cloud SQL, and Azure Flexible Server, schema changes still happen via plain SQL. Configuration changes happen through parameter group rebuilds. Some parameters take effect immediately, others require a reboot. Verify with SELECT name, context FROM pg_settings WHERE name = '<param>'; before scheduling the change window.

Step 2. Run EXPLAIN ANALYZE with BUFFERS to see real I/O.

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT o.id, c.name, SUM(o.amount) AS total
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.created_at >= now() - interval '30 days'
GROUP BY o.id, c.name
ORDER BY total DESC
LIMIT 50;

Step 3. Compare buffer hits versus reads — high read counts mean cache misses.

EXPLAIN (ANALYZE, BUFFERS, VERBOSE, SETTINGS)
SELECT * FROM events
WHERE event_type = 'checkout' AND occurred_at > now() - interval '1 hour';

Step 4. Validation. Re-run your baseline query and compare the results. If the change didn't move the metric you set out to improve, revert before chasing a second hypothesis. Tuning one PostgreSQL parameter at a time is the only way to keep your sanity, and your audit trail, intact.

Slow Query Log Analysis with pgBadger

Production guardrails and monitoring

A change isn't done when it ships. It's done when you have the alert, the runbook entry, and the rollback plan. The difference between a healthy PostgreSQL platform and a chronically firefighting one is whether teams complete this last 20 percent of the work.

  • Add a Datadog or Prometheus alert on the metric you just improved at a threshold 20 percent above your new baseline.
  • Capture an EXPLAIN (ANALYZE, BUFFERS) for any regressed query into your runbook so the on-call engineer has the next-step diagnostic ready.
  • Document the rollback path: the exact SQL or ALTER SYSTEM sequence to restore the prior state if the change misbehaves.
  • Set a calendar reminder to re-validate after the next major PostgreSQL version upgrade. Planner behaviors and default GUC values do change.
  • Record the pg_stat_statements query ID and a representative plan in your team wiki so you can compare against future regressions in latency.
  • Schedule a follow-up review 30 days after the change to confirm the improvement persisted under realistic production traffic.

Going deeper with cross-checks

Once the basic fix is in place, the next layer of validation cross-checks against complementary signals. The query below is the one we run on production PostgreSQL deployments to confirm the change has propagated everywhere it should.

Force a fresh plan to bypass parameter sniffing.

SET plan_cache_mode = force_generic_plan;
EXPLAIN ANALYZE EXECUTE checkout_query(1234);

Common mistakes and anti-patterns

After hundreds of production PostgreSQL deployments we see the same anti-patterns around this topic again and again, across companies, industries, and continents. None are obvious in the moment. Each looked like the right call when someone made it.

  • Tuning pgBadger by copy-pasting from a 2014 blog post without re-validating against PostgreSQL 14, 15, 16, or 17 behavior.
  • Changing more than one PostgreSQL parameter at a time without measurement.
  • Forgetting to ANALYZE after a large data load, then wondering why the planner picked a sequential scan over your shiny new index.
  • Trusting an unverified backup or untested failover for PostgreSQL performance tuning.
  • Treating autovacuum as something to disable rather than something to tune.
  • Allowing developers to write production queries with no EXPLAIN review.

PostgreSQL on AWS, Aurora, GCP, Azure

If you're running on AWS RDS, Aurora, Cloud SQL, AlloyDB, or Azure Flexible Server, here's what changes. Schema work is identical to self-managed PostgreSQL. Configuration goes through parameter groups. Some OS-level levers are gone. And Aurora plays by slightly different rules because of its decoupled storage architecture.

Specifics worth memorizing. AWS RDS PostgreSQL on gp3 storage gives you provisioned IOPS, but the maximum is per-volume, not per-instance. That fact surprises customers scaling vCPU and expecting linear I/O. Google AlloyDB's columnar engine is opt-in per table; turning it on is a one-line SQL call, but the analytical workload eligibility rules aren't always obvious until you read the EXPLAIN plan. Azure Database for PostgreSQL Flexible Server exposes a broader set of extensions than RDS or Aurora, including pg_partman, pgvector, TimescaleDB, and Citus on the Citus-flavored variant.

When this approach is the wrong starting point

This technique assumes a roughly normal OLTP PostgreSQL workload with healthy autovacuum. It's the wrong starting point if your workload is dominated by long analytical queries against a Citus or TimescaleDB hypertable, if you run on Aurora's storage-decoupled architecture (where buffer-pool semantics differ), or if the symptom is actually a network or kernel-level issue masquerading as a PostgreSQL problem.

Another pattern we see often. A fintech in Mumbai called us about 12-second checkout p99s every Sunday night. The culprit hiding in pg_stat_statements was a 600-million-row transactions table being scanned without a partial index on status='pending'. One CREATE INDEX CONCURRENTLY later, p99 sat at 180 ms.

Frequently asked questions

How do I find the slowest queries in PostgreSQL?

Enable the pg_stat_statements extension and query it sorted by total_exec_time to surface the worst offenders. Pair that with auto_explain for sampled execution plans of long-running queries. Together these tools reveal the top 20 queries that drive most of your database load.

Should I increase shared_buffers to 50 percent of RAM for better PostgreSQL performance?

No. Most production PostgreSQL deployments do best with shared_buffers at about 25 percent of system RAM, leaving the OS page cache to handle the rest. Anything above 40 percent rarely helps and can hurt performance because PostgreSQL relies on double-buffered reads through the page cache.

Does parallel query execution actually make PostgreSQL faster?

For large analytical scans, yes. Tuning max_parallel_workers_per_gather can cut query execution time by two to four times. For OLTP point-lookup queries, parallelism adds latency overhead and is best left off per session. The right setting depends on your workload mix.

What is the most useful EXPLAIN flag in PostgreSQL?

BUFFERS. Combined with ANALYZE, EXPLAIN (ANALYZE, BUFFERS) tells you whether the slow part of your plan is doing real disk I/O or hitting the buffer cache. That single distinction explains most PostgreSQL performance mysteries.

Why does the PostgreSQL planner ignore my new index?

Three usual suspects: outdated table statistics (run ANALYZE), an expression mismatch between your WHERE predicate and the index definition, or selectivity below the planner's cost threshold for an index scan over a sequential scan.

Where should I start if I’m new to slow query log analysis with pgBadger?

Read this guide end to end, then run the diagnostic SQL queries against a non-production PostgreSQL database to build intuition. Most engineers we coach are productive within a day. Bookmark this page, then move on to the cluster posts linked below for deeper dives.

Further Reading

How to compute cost of slow query in PostgreSQL?

How to tune a slow query in PostgreSQL?

Why Your PostgreSQL Query Reads Millions of Blocks

 

About MinervaDB Corporation 243 Articles
Full-stack Database Infrastructure Architecture, Engineering and Operations Consultative Support(24*7) Provider for PostgreSQL, MySQL, MariaDB, MongoDB, ClickHouse, Trino, SQL Server, Cassandra, CockroachDB, Yugabyte, Couchbase, Redis, Valkey, NoSQL, NewSQL, SAP HANA, Databricks, Amazon Resdhift, Amazon Aurora, CloudSQL, Snowflake and AzureSQL with core expertize in Performance, Scalability, High Availability, Database Reliability Engineering, Database Upgrades/Migration, and Data Security.