PostgreSQL Monitoring, Observability & Troubleshooting: The Complete Guide
Imagine a B2B SaaS platform with strict regulatory SLAs. The PostgreSQL cluster is healthy, replication lag is zero, and yet customer complaints are climbing. The problem is almost always somewhere inside PostgreSQL Monitoring, Observability. There is a subtle pathology that affects mature monitoring stacks: too many metrics, all of them slightly stale, none of them mapped to a clear question. The team knows the database is unhealthy but cannot tell whether it is CPU-bound, IO-bound, lock-bound or replication-bound. The fix is fewer dashboards, sharper alerts, and a ruthless culling of metrics that nobody acts on.
Database observability is the discipline of asking, in the middle of a production incident, ‘what is this database actually doing right now?’ and getting an answer in less than thirty seconds. Most teams discover during their first major incident that their dashboards were designed to look impressive in a sales demo, not to answer that question. There is a subtle pathology that affects mature monitoring stacks: too many metrics, all of them slightly stale, none of them mapped to a clear question. The team knows the database is unhealthy but cannot tell whether it is CPU-bound, IO-bound, lock-bound or replication-bound. The fix is fewer dashboards, sharper alerts, and a ruthless culling of metrics that nobody acts on.
The two approaches you will encounter
Slow-query logging is the cheapest form of observability and the one most teams misconfigure. The threshold should be set low enough to catch the queries you actually want to fix (we typically use 200 ms for OLTP, 5 seconds for OLAP) and the log itself should be shipped, parsed and aggregated, not tailed by a human. pg_stat_monitor, pt-query-digest and Datadog Database Monitoring all do this well. MongoDB’s db.serverStatus() output is overwhelming on first contact and indispensable on second. The fields that matter most for capacity work are under wiredTiger.cache, wiredTiger.transaction and opLatencies. Sample them every fifteen seconds and the time-series will tell you exactly when the working set started exceeding RAM.
A representative implementation of the first approach:
// MongoDB: cheap working-set health snapshot
let s = db.serverStatus();
printjson({
cache_used_pct: 100 * s.wiredTiger.cache['bytes currently in the cache']
/ s.wiredTiger.cache['maximum bytes configured'],
dirty_pct: 100 * s.wiredTiger.cache['tracked dirty bytes in the cache']
/ s.wiredTiger.cache['maximum bytes configured'],
evictions_per_sec: s.wiredTiger.cache['eviction worker thread evicting pages']
});
The trade-off matrix
Each approach trades cleanly against the other on a small number of axes that operators learn to recognise:
- Dashboards full of average-latency metrics. The average is the metric that hides the problem; p95, p99 and p99.9 are what your customers actually feel.
- Alerts on absolute thresholds (“CPU > 80%”) without rate-of-change context. A database at 85% CPU all afternoon is healthy; the same database climbing 10% per minute is mid-incident.
- Sampling intervals longer than the incident timeline. A 60-second poll on a 90-second outage shows two data points, neither of which is the incident.
- Storing query text alongside metrics in unsecured logs. PII, secrets and customer identifiers all leak this way; redaction has to happen before the metric leaves the database host.
pg_stat_statements is the single most valuable extension PostgreSQL ships. It aggregates query execution stats by normalised SQL text, giving you mean, total and standard-deviation timings without any per-query overhead worth measuring. The standard reset cadence is daily; we prefer hourly snapshots stored externally for trend analysis across deployments. Wait events are the right primitive for diagnosing a busy PostgreSQL system. pg_stat_activity exposes wait_event_type and wait_event for every active backend, sampled cheaply enough that you can poll once per second without measurable overhead. Aggregating that sample over time gives you a flame graph of where the database actually spends its life.
When the first approach is the right answer
MongoDB’s db.serverStatus() output is overwhelming on first contact and indispensable on second. The fields that matter most for capacity work are under wiredTiger.cache, wiredTiger.transaction and opLatencies. Sample them every fifteen seconds and the time-series will tell you exactly when the working set started exceeding RAM. MySQL’s Performance Schema is the equivalent set of measurements with a different vocabulary. events_statements_summary_by_digest for query patterns, events_waits_summary_global_by_event_name for wait events, and file_summary_by_event_name for IO patterns. Enable the right consumers, ship them to a TSDB, and you have a feedback loop on every query that ran.
When the second approach earns its keep
MongoDB’s db.serverStatus() output is overwhelming on first contact and indispensable on second. The fields that matter most for capacity work are under wiredTiger.cache, wiredTiger.transaction and opLatencies. Sample them every fifteen seconds and the time-series will tell you exactly when the working set started exceeding RAM. Slow-query logging is the cheapest form of observability and the one most teams misconfigure. The threshold should be set low enough to catch the queries you actually want to fix (we typically use 200 ms for OLTP, 5 seconds for OLAP) and the log itself should be shipped, parsed and aggregated, not tailed by a human. pg_stat_monitor, pt-query-digest and Datadog Database Monitoring all do this well.
-- PostgreSQL wait-event flame source
SELECT wait_event_type, wait_event, COUNT(*) AS sessions
FROM pg_stat_activity
WHERE state = 'active' AND pid <> pg_backend_pid()
GROUP BY 1,2
ORDER BY sessions DESC;
-- Top queries by total time over a window
SELECT calls,
round(total_exec_time::numeric, 0) AS total_ms,
round(mean_exec_time::numeric, 1) AS mean_ms,
round(stddev_exec_time::numeric, 1) AS stddev_ms,
LEFT(query, 100) AS query_preview
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
Our standing recommendation
An e-commerce platform alerted on disk usage thresholds and missed a runaway query that filled their replica’s apply queue. The replica fell two hours behind, the application read stale data on the way through Black Friday, and the team rebuilt the dashboard around replication lag percentiles afterward. The lesson, quote, was ‘monitor what hurts customers, not what fills disks’.
Where possible, treat PostgreSQL Monitoring, Observability as a code review concern: a peer should challenge configuration changes the same way they would challenge an application code change, with explicit acceptance criteria and a documented rollback plan. This single cultural shift removes more outages than any individual parameter tweak.
MinervaDB engineers maintain a library of internal runbooks for PostgreSQL that are updated whenever a customer engagement reveals a new pattern; if you would like a copy of the relevant runbook for PostgreSQL Monitoring, Observability, contact our team and we will share the sanitised version that we use during incident response.
It is worth emphasising that PostgreSQL Monitoring, Observability in PostgreSQL is not a static topic. The engine, the cloud platforms it runs on, the storage technologies it uses and the workloads pushed through it all evolve, which means any configuration you ship today should be considered a snapshot rather than a permanent answer.
When MinervaDB takes over a PostgreSQL estate as part of an enterprise support engagement, the first thirty days almost always include a structured review of PostgreSQL Monitoring, Observability, because the gains here are usually larger and faster than any other intervention available in the first month.
- Wait events first, CPU second. The wait-event picture tells you what the database is waiting on; CPU just confirms it is busy.
- Alert on percentiles and rates of change, not absolute thresholds.
- Every alert ships with a runbook link and a clear next action.
- Sample fast, retain coarse. One-second granularity for the last hour, one-minute for the last week, one-hour for the last year.
- Treat broken exporters as outages, not informational events.
There is no silver bullet for PostgreSQL Monitoring, Observability — only careful engineering, honest measurement, and a willingness to revisit decisions as the workload changes.
Frequently asked questions
What if our team is smaller and we just want a quarterly health-check?
That is one of the most common engagements we run. A quarterly health-check covers performance trends, capacity, observability gaps, security posture, and a written report with prioritised actions.
Do you support both self-managed and cloud-managed deployments?
Yes. We work across PostgreSQL, MySQL/MariaDB, MongoDB, SQL Server, ClickHouse, Cassandra, Redis/Valkey, Milvus, Trino and SAP HANA, on bare-metal, virtualised infrastructure, Kubernetes, and managed cloud services (Aurora, RDS, Azure SQL, Cloud SQL).
Do you publish runbooks and documentation we can keep after the engagement?
Yes. Documentation and runbooks are deliverables, not afterthoughts. Everything we produce is yours to keep, with no proprietary tooling lock-in.
Can your team take over on-call for our database tier?
Yes — our 24×7 enterprise support practice is designed exactly for this. We can take pager ownership at L1/L2 with documented escalation paths into your engineering team for application-side issues.
References and resources
Authoritative external references
- PostgreSQL Wiki: Performance Optimization
- PostgreSQL high availability
- PostgreSQL Write-Ahead Logging
- PostgreSQL autovacuum configuration
- PostgreSQL official documentation
- Patroni HA template documentation
Related MinervaDB articles and services
Talking to MinervaDB about PostgreSQL
For more than a decade, MinervaDB engineers have been the team that customers call when something complicated is happening on their PostgreSQL platform and the answer is not in the documentation. We bring deep operational experience and a strong opinion about how production database engineering should be practised.
How we typically help:
- 24×7 Enterprise-Class Support with strict SLAs for incident response, root-cause analysis and recovery.
- Performance Engineering and Tuning for high-throughput, low-latency, mixed OLTP and analytical workloads.
- High Availability and Disaster Recovery Architecture across regions, clouds and hybrid topologies.
- Database Reliability Engineering (DBRE) with observability, runbooks, capacity planning and incident review.
- Cost Optimisation for self-managed and cloud database platforms, with hardware-right-sizing and licensing reviews.
- Data Security, Audit and Compliance readiness for regulated workloads (PCI-DSS, HIPAA, SOC 2, RBI, GDPR).
- Database Migrations and Upgrades with zero-downtime cutover playbooks.
Reach us: contact@minervadb.com or minervadb.com/contact. Reference PostgreSQL Monitoring, Observability when you write and the consulting engineer will arrive with the right context.
MinervaDB — The WebScale Database Infrastructure Operations Experts.