Case study: cutting P99 latency 10x with observability

Case study: cutting P99 latency 10x with observability

The conventional wisdom about Case study in PostgreSQL is mostly wrong, or at least wrong enough that copying it into production will cost you a quarter. An alert that does not include a runbook link is not an alert; it is a notification of obligation. Every alert should answer two questions in its body: what is wrong, and what does the on-call engineer do next. Alerts that fail this test get tuned out within weeks, which is worse than not having the alert at all.

Database observability is the discipline of asking, in the middle of a production incident, ‘what is this database actually doing right now?’ and getting an answer in less than thirty seconds. Most teams discover during their first major incident that their dashboards were designed to look impressive in a sales demo, not to answer that question. The most useful observability metrics are the ones nobody puts on a dashboard until after the first time they were missing during an incident. Wait events, replication lag percentiles, autovacuum progress, lock graphs, slow-log digests — these are the metrics that matter at 03:00, and they are usually missing the first time a senior engineer goes looking for them.

The two approaches you will encounter

Strategies for Achieving P99 Latency 10x

Wait events are the right primitive for diagnosing a busy PostgreSQL system. pg_stat_activity exposes wait_event_type and wait_event for every active backend, sampled cheaply enough that you can poll once per second without measurable overhead. Aggregating that sample over time gives you a flame graph of where the database actually spends its life. pg_stat_statements is the single most valuable extension PostgreSQL ships. It aggregates query execution stats by normalised SQL text, giving you mean, total and standard-deviation timings without any per-query overhead worth measuring. The standard reset cadence is daily; we prefer hourly snapshots stored externally for trend analysis across deployments.

A representative implementation of the first approach:

// MongoDB: cheap working-set health snapshot
let s = db.serverStatus();
printjson({
  cache_used_pct: 100 * s.wiredTiger.cache['bytes currently in the cache']
                  / s.wiredTiger.cache['maximum bytes configured'],
  dirty_pct: 100 * s.wiredTiger.cache['tracked dirty bytes in the cache']
                  / s.wiredTiger.cache['maximum bytes configured'],
  evictions_per_sec: s.wiredTiger.cache['eviction worker thread evicting pages']
});

The trade-off matrix

Each approach trades cleanly against the other on a small number of axes that operators learn to recognise:

  • Dashboards full of average-latency metrics. The average is the metric that hides the problem; p95, p99 and p99.9 are what your customers actually feel.
  • Alerts on absolute thresholds (“CPU > 80%”) without rate-of-change context. A database at 85% CPU all afternoon is healthy; the same database climbing 10% per minute is mid-incident.
  • Sampling intervals longer than the incident timeline. A 60-second poll on a 90-second outage shows two data points, neither of which is the incident.
  • Storing query text alongside metrics in unsecured logs. PII, secrets and customer identifiers all leak this way; redaction has to happen before the metric leaves the database host.

MongoDB’s db.serverStatus() output is overwhelming on first contact and indispensable on second. The fields that matter most for capacity work are under wiredTiger.cache, wiredTiger.transaction and opLatencies. Sample them every fifteen seconds and the time-series will tell you exactly when the working set started exceeding RAM. Slow-query logging is the cheapest form of observability and the one most teams misconfigure. The threshold should be set low enough to catch the queries you actually want to fix (we typically use 200 ms for OLTP, 5 seconds for OLAP) and the log itself should be shipped, parsed and aggregated, not tailed by a human. pg_stat_monitor, pt-query-digest and Datadog Database Monitoring all do this well.

When the first approach is the right answer

MySQL’s Performance Schema is the equivalent set of measurements with a different vocabulary. events_statements_summary_by_digest for query patterns, events_waits_summary_global_by_event_name for wait events, and file_summary_by_event_name for IO patterns. Enable the right consumers, ship them to a TSDB, and you have a feedback loop on every query that ran. pg_stat_statements is the single most valuable extension PostgreSQL ships. It aggregates query execution stats by normalised SQL text, giving you mean, total and standard-deviation timings without any per-query overhead worth measuring. The standard reset cadence is daily; we prefer hourly snapshots stored externally for trend analysis across deployments.

When the second approach earns its keep

MySQL’s Performance Schema is the equivalent set of measurements with a different vocabulary. events_statements_summary_by_digest for query patterns, events_waits_summary_global_by_event_name for wait events, and file_summary_by_event_name for IO patterns. Enable the right consumers, ship them to a TSDB, and you have a feedback loop on every query that ran. Wait events are the right primitive for diagnosing a busy PostgreSQL system. pg_stat_activity exposes wait_event_type and wait_event for every active backend, sampled cheaply enough that you can poll once per second without measurable overhead. Aggregating that sample over time gives you a flame graph of where the database actually spends its life.

-- PostgreSQL wait-event flame source
SELECT wait_event_type, wait_event, COUNT(*) AS sessions
FROM pg_stat_activity
WHERE state = 'active' AND pid <> pg_backend_pid()
GROUP BY 1,2
ORDER BY sessions DESC;

-- Top queries by total time over a window
SELECT calls,
       round(total_exec_time::numeric, 0) AS total_ms,
       round(mean_exec_time::numeric, 1) AS mean_ms,
       round(stddev_exec_time::numeric, 1) AS stddev_ms,
       LEFT(query, 100) AS query_preview
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

Our standing recommendation

We took a call from a customer whose Grafana boards showed all green during what was, by the time we arrived, a forty-minute partial outage. The wait-event panel was missing entirely. After we added it the next day, the same dashboard would have shown the issue (heavy LWLock:WALWriteLock) within thirty seconds. The fix was three lines of YAML and a one-line PromQL query.

When MinervaDB takes over a PostgreSQL estate as part of an enterprise support engagement, the first thirty days almost always include a structured review of Case study, because the gains here are usually larger and faster than any other intervention available in the first month.

Finally, remember that documentation is a force multiplier. Every diagnostic command, every tuning decision, every runbook step that lives in a shared system rather than in someone’s head is a step closer to a PostgreSQL estate that does not depend on a single hero engineer being awake.

It is worth emphasising that Case study in PostgreSQL is not a static topic. The engine, the cloud platforms it runs on, the storage technologies it uses and the workloads pushed through it all evolve, which means any configuration you ship today should be considered a snapshot rather than a permanent answer.

MinervaDB engineers maintain a library of internal runbooks for PostgreSQL that are updated whenever a customer engagement reveals a new pattern; if you would like a copy of the relevant runbook for Case study, contact our team and we will share the sanitised version that we use during incident response.

Where possible, treat Case study as a code review concern: a peer should challenge configuration changes the same way they would challenge an application code change, with explicit acceptance criteria and a documented rollback plan. This single cultural shift removes more outages than any individual parameter tweak.

  • Wait events first, CPU second. The wait-event picture tells you what the database is waiting on; CPU just confirms it is busy.
  • Alert on percentiles and rates of change, not absolute thresholds.
  • Every alert ships with a runbook link and a clear next action.
  • Sample fast, retain coarse. One-second granularity for the last hour, one-minute for the last week, one-hour for the last year.
  • Treat broken exporters as outages, not informational events.

There is no silver bullet for Case study — only careful engineering, honest measurement, and a willingness to revisit decisions as the workload changes.

Frequently asked questions

How quickly can MinervaDB engineers respond to a production incident on this topic?

MinervaDB runs a 24×7 support practice with documented SLAs that vary by contract; for SEV-1 incidents on supported clusters the first engineer response is measured in minutes, not hours.

Do you support both self-managed and cloud-managed deployments?

Yes. We work across PostgreSQL, MySQL/MariaDB, MongoDB, SQL Server, ClickHouse, Cassandra, Redis/Valkey, Milvus, Trino and SAP HANA, on bare-metal, virtualised infrastructure, Kubernetes, and managed cloud services (Aurora, RDS, Azure SQL, Cloud SQL).

What is your typical engagement model for a one-off review?

A typical engagement starts with a short discovery call, a focused review (architecture, performance, security, cost, or topic-specific), and a written assessment with prioritised recommendations. We can then either hand it back to your team to execute, or stay engaged to implement.

Can your team take over on-call for our database tier?

Yes — our 24×7 enterprise support practice is designed exactly for this. We can take pager ownership at L1/L2 with documented escalation paths into your engineering team for application-side issues.

Further reading

Vendor and community documentation

Related MinervaDB articles and services


A short note on how we work with PostgreSQL customers

The MinervaDB consulting practice is built around 24×7 enterprise support, performance engineering and database reliability engineering for PostgreSQL. We are based in India with engineers across global timezones, and we have been doing this work since before “DBRE” was a job title.

How we typically help:

  • Round-the-clock production support with documented SLAs and escalation paths.
  • Performance engineering and capacity planning for serious OLTP and analytical workloads.
  • High-availability and disaster-recovery architecture, across regions and clouds.
  • Cost optimisation for cloud database bills and self-managed platforms.
  • Migration and upgrade engineering with rehearsed cutover and rollback plans.
  • Security, audit, and compliance readiness for regulated industries.

If this resonates: reach the team at contact@minervadb.com or via minervadb.com/contact and we can schedule a no-obligation technical discovery focused on Case study.

MinervaDB — The WebScale Database Infrastructure Operations Experts.

About MinervaDB Corporation 269 Articles
Full-stack Database Infrastructure Architecture, Engineering and Operations Consultative Support(24*7) Provider for PostgreSQL, MySQL, MariaDB, MongoDB, ClickHouse, Trino, SQL Server, Cassandra, CockroachDB, Yugabyte, Couchbase, Redis, Valkey, NoSQL, NewSQL, SAP HANA, Databricks, Amazon Resdhift, Amazon Aurora, CloudSQL, Snowflake and AzureSQL with core expertize in Performance, Scalability, High Availability, Database Reliability Engineering, Database Upgrades/Migration, and Data Security.