Amazon RDS PostgreSQL Performance Insights

Amazon RDS PostgreSQL Performance Insights Deep Dive


A sudden spike in DB Load at 3 a.m. with no obvious query change is one of the most disorienting incidents an on-call engineer faces on Amazon RDS for PostgreSQL. Performance Insights was designed precisely for this moment — it samples wait events every second, correlates them with SQL statements, host identifiers, and users, and presents the data as a continuous time series.

At MinervaDB, we have used Performance Insights as the first-response tool in hundreds of RDS PostgreSQL engagements. In this article we explain what Performance Insights actually captures, how to read the DB Load chart, how to drill from a spike to a culprit query, how to pull data programmatically via the API, and how the retention and cost model works in practice.

Amazon RDS PostgreSQL Performance

What Performance Insights Actually Captures

Performance Insights is an always-on sampling engine embedded in the RDS host agent. Once enabled,  it samples the state of every active session once per second using an internal mechanism equivalent  to polling pg_stat_activity. Each sample records the session state (active, idle, idle in transaction, etc.), the current wait event category and name, the normalized query fingerprint, the  database name, the username, the application name, and the client host. Sessions that are idle or  waiting on client I/O with no pending query are excluded from the DB Load metric, which keeps the  signal clean.

The raw samples feed two representations. First, the DB Load time series, measured in Average  Active Sessions (AAS). This is a ratio: if 4 out of 8 vCPUs are running backend processes at a  given second, DB Load is 4. Second, the wait-event breakdown chart stacks the contribution of each wait class (CPU, IO:*, Lock:*, LWLock:*, Client:*) across the selected time window.

Enabling Performance Insights requires no parameter group change; it is toggled at the RDS instance level:

aws rds modify-db-instance \
  --db-instance-identifier prod-pg-01 \
  --enable-performance-insights \
  --performance-insights-retention-period 7 \
  --performance-insights-kms-key-id alias/aws/rds \
  --apply-immediately

The –performance-insights-retention-period flag accepts 7 (free tier) or any value from 31 to 731 days (paid tier, priced at $0.02 per vCPU-hour per month on standard instances). Once enabled, the console shows data within 60 seconds of the first sample.

DB Load, Active Sessions, and the vCPU Line

The single most important concept in Performance Insights is the relationship between DB Load and the vCPU line. DB Load is the count of Average Active Sessions over the selected granularity (1-second, 1-minute, or 5-minute). The vCPU line is a horizontal reference drawn at the number of  vCPUs provisioned for the instance. When DB Load exceeds the vCPU line, the database is CPU- or  wait-bound in a way that cannot be absorbed by available parallelism.

A DB Load of 2.0 on a db.r7g.2xlarge (8 vCPUs) is unremarkable — 25% of capacity. The same DB Load of 2.0 on a db.t4g.medium (2 vCPUs) means every vCPU is fully saturated. This is why we always check the vCPU line before drawing conclusions.

DB Load decomposition by wait class tells us where time is being spent:

  • CPU — sessions actively executing on a vCPU. High CPU DB Load with DB Load below the vCPU line is healthy. CPU DB Load exceeding the vCPU line means compute saturation.
  • IO:DataFileRead — backend reading a page from EBS into shared buffers.
  • IO:WALWrite — WAL writer flushing to EBS; relevant on high-write workloads.
  • Lock:transactionid — sessions blocked waiting for a row lock held by another transaction.
  • LWLock:BufferContent — contention on the shared buffer pool itself, often a sign of  concurrent access to a small hot set of pages.
  • Client:ClientRead — server waiting for the client to send the next query; this should never dominate DB Load and, when it does, it signals a connection-pool misconfiguration.
-- Real-time approximation of PI wait-event breakdown using pg_stat_activity
SELECT wait_event_type,
       wait_event,
       count(*) AS active_sessions
FROM   pg_stat_activity
WHERE  state = 'active'
  AND  pid <> pg_backend_pid()
GROUP  BY 1, 2
ORDER  BY 3 DESC;

Top SQL, Top Waits, Top Hosts, and Top Users Dimensions

Below the DB Load time series, the Performance Insights console presents a multi-dimensional breakdown panel with four tabs: Top SQLTop waitsTop hosts, and Top users.  Each tab ranks contributors to DB Load over the selected time window.

  • Top SQL is the most actionable tab. It displays normalized query digests (parameter values  replaced with placeholders) ranked by AAS contribution. Each row shows load, calls, rows examined,  and the wait-event breakdown for that specific statement. Clicking a digest opens the full SQL text  and per-wait-event attribution.
  • Top waits aggregates by wait event across all statements, useful when a single wait type  dominates and you want to identify whether it affects one or many queries.
  • Top hosts ranks client IP addresses by AAS contribution. In a multi-tenant SaaS model, this  quickly reveals which tenant is generating disproportionate load — particularly useful when the  application does not annotate queries with application_name.
  • Top users ranks PostgreSQL roles by AAS. In deployments where read/write roles are separated, the split between the read replica user and the writer user confirms whether load is coming from ” analytical or transactional traffic.
# Retrieve top-10 SQL digests for the past hour via AWS CLI
aws pi get-resource-metrics \
  --service-type RDS \
  --identifier db-ABCDEFGHIJKLMNOPQRSTUVWXYZ \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time   $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period-in-seconds 60 \
  --metric-queries '[
    {"Metric":"db.load.avg",
     "GroupBy":{"Group":"db.sql","Limit":10}}
  ]'

Retention Tiers: 7-Day Free vs 2-Year Paid

Performance Insights retention is available in two tiers. The 7-day free tier stores samples at 1-second granularity for seven days with no additional charge beyond the instance cost. This  covers the overwhelming majority of reactive incident investigations. The long-term retention tier extends storage to any value between 31 and 731 days (approximately two years). Each month of  additional storage costs roughly $0.02 per vCPU-hour.

For a db.r7g.4xlarge (16 vCPUs) running 730 hours per month, long-term retention adds  approximately $0.02 × 16 × 730 = $233.60 per month. For compliance-driven workloads where historical query patterns must be auditable, that cost is justified. For most OLTP fleets, the  7-day window is sufficient because incidents are triaged the same day they occur.

One operational subtlety: when a DB instance is replaced (restore from snapshot, blue-green switchover), Performance Insights history is tied to the DBI resource identifier, not the ” instance name. A new instance gets a fresh PI history. We keep this in mind during blue-green deployments and always export relevant PI data before discarding the old instance.

# Confirm current retention period on a running instance
aws rds describe-db-instances \
  --db-instance-identifier prod-pg-01 \
  --query 'DBInstances[0].PerformanceInsightsRetentionPeriod'

Accessing Performance Insights via the GetResourceMetrics API

The Performance Insights API (pi service namespace in AWS SDK) is underused. It exposes every  metric and dimension available in the console, making it possible to build automated dashboards, SLO monitors, and capacity-planning pipelines without polling the database directly.

The primary call is GetResourceMetrics. It accepts a list of metric queries, a time range, a period, and an optional GroupBy dimension. Metrics prefixed with db.load return AAS values; metrics prefixed with db.sql or db.wait_event return dimension-specific breakdowns.

import boto3, datetime

pi = boto3.client('pi', region_name='us-east-1')
end   = datetime.datetime.utcnow()
start = end - datetime.timedelta(hours=1)

response = pi.get_resource_metrics(
    ServiceType='RDS',
    Identifier='db-ABCDEFGHIJKLMNOPQRSTUVWXYZ',  # DBI resource ID
    StartTime=start,
    EndTime=end,
    PeriodInSeconds=60,
    MetricQueries=[
        {
            'Metric': 'db.load.avg',
            'GroupBy': {'Group': 'db.wait_event', 'Limit': 5},
        },
        {
            'Metric': 'db.load.avg',
            'GroupBy': {'Group': 'db.sql',        'Limit': 5},
        },
    ],
)
for result in response['MetricList']:
    print(result['Key'], [p['Value'] for p in result['DataPoints'][:5]])

The pi_notify_kb parameter (set in the RDS parameter group) controls how much shared memory the PI agent uses for its local ring buffer before flushing samples upstream. The default is 4096 KB;  increasing it to 8192 KB on high-connection instances (> 500 concurrent sessions) reduces the  probability of sample loss during brief network hiccups between the instance and the PI backend.

Triaging a DB Load Spike Step by Step

When a page is received showing DB Load has crossed the vCPU line, the following workflow  recovers root cause in under ten minutes in most cases.

Step 1 — Bound the time window. Use the console zoom or the API to narrow the window to the  exact spike interval. Confirm the spike’s start time and whether it was sudden (< 60 s ramp) or gradual (> 5 min ramp). Sudden spikes typically indicate a lock event or a deployment. Gradual ” ramps indicate accumulating row lock contention, unbounded connection growth, or a runaway query.

Step 2 — Identify the dominant wait class. Switch the DB Load chart to the wait-event breakdown view. A single color dominating the stack immediately tells you the category: orange for  Lock:*, red for IO:DataFileRead, grey for CPU.

Step 3 — Pivot to Top SQL. Sort by DB Load descending. The top one or two digests typically  account for 80%+ of load during a spike. Note the wait-event attribution for those specific  statements.

Step 4 — Retrieve the full SQL text. Click the digest in the console to see the parameterized SQL. Match it to the application layer — an ORM, a scheduled job, a reporting query.

Step 5 — Correlate with CloudWatch. Check CPUUtilization, ReadIOPS, WriteIOPS,  DatabaseConnections, and ReadLatency over the same window. A DB Load spike with flat  CPUUtilization is almost always a wait event (IO, lock) rather than compute saturation.

Step 6 — Act. Depending on the finding, the resolution may be pg_cancel_backend() for a runaway query, VACUUM ANALYZE for a bloated table driving excessive IO:DataFileRead, or a  schema change to add a missing index.

-- Identify and cancel runaway queries during an active incident
SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state
FROM   pg_stat_activity
WHERE  state <> 'idle'
  AND  (now() - pg_stat_activity.query_start) > interval '30 seconds'
ORDER  BY duration DESC;

-- Cancel a specific backend (non-superuser safe version)
SELECT pg_cancel_backend(12345);

Cost Model and Practical Sizing Decisions

Performance Insights pricing has three components: the 7-day free tier (no charge), long-term retention ($0.02 per vCPU-hour per additional month), and API calls ($0.01 per 1,000 API requests  beyond 1 million free requests per month). For most fleets, the API cost is negligible. Long-term  retention is the main variable cost.

Our recommendation at MinervaDB: enable 7-day free-tier Performance Insights on every RDS PostgreSQL instance regardless of workload. The overhead on the database engine is sub-1% in our benchmarks. For regulated industries (financial services, healthcare) where query-level audit trails are required, enable 93-day or 365-day retention selectively on the primary writer instance.

For read replicas, Performance Insights is independently controlled per instance. Enabling PI on the replica allows correlation of read-heavy analytical load against replica lag, which is a useful diagnostic when replica lag spikes correlate with IO:DataFileRead bursts.

# Enable PI on every instance in a fleet using AWS CLI and jq
aws rds describe-db-instances \
  --query 'DBInstances[*].DBInstanceIdentifier' \
  --output text | tr '\t' '\n' | while read id; do
    aws rds modify-db-instance \
      --db-instance-identifier "$id" \
      --enable-performance-insights \
      --performance-insights-retention-period 7 \
      --apply-immediately
    echo "Enabled PI on $id"
done

One note on db.t4g burstable instances: Performance Insights is supported but the vCPU line fluctuates when CPU credits are being consumed versus replenished, which can make the vCPU  reference line misleading. We recommend upgrading burstable instances to db.m7g or db.r7g  in production before relying on PI for capacity planning.


Key Takeaways

  • Performance Insights samples every active session once per second and computes DB Load as Average Active Sessions, comparable to the number of vCPUs.
  • The vCPU line is the critical reference: DB Load consistently above it signals saturation that more parallelism cannot absorb.
  • Top SQL, Top waits, Top hosts, and Top users let you pivot from a global spike to a specific statement, wait event, tenant, or role in seconds.
  • The 7-day free tier is sufficient for reactive incident triage; enable long-term retention only for compliance or capacity-planning requirements.
  • The GetResourceMetrics API exposes all PI data programmatically, enabling automated SLO dashboards and capacity pipelines without direct database polling.
  • The pi_notify_kb parameter controls the PI agent ring buffer; increase it to 8192 KB on high-connection instances to reduce sample loss.


How MinervaDB Can Help

At MinervaDB, we operate a full-stack database infrastructure engineering practice focused on Amazon RDS for PostgreSQL, Aurora PostgreSQL, and self-managed PostgreSQL at scale. Our PostgreSQL consulting services cover performance diagnostics, parameter-group tuning, capacity planning, Multi-AZ and read-replica architecture, pg_stat_statements-driven optimization, and 24×7 remote DBA operations. If the workload is experiencing latency spikes, replication lag, connection storms, or cost pressure on RDS, we can help.


Frequently Asked Questions

Does enabling Performance Insights affect RDS PostgreSQL query latency?

In our benchmarks on db.r7g and db.r6i instance classes, the overhead of Performance Insights sampling is below 1% of CPU. The agent samples pg_stat_activity once per second in a non-blocking read and writes to a local ring buffer. We enable it on every production instance by default.

What is the difference between DB Load and CPUUtilization in CloudWatch?

CPUUtilization (CloudWatch) measures the percentage of vCPU time consumed by all OS processes on the instance, including the PostgreSQL engine, the OS kernel, and background daemons. DB Load (Performance Insights) measures only the average number of PostgreSQL sessions actively executing or waiting, making it a more precise signal for database-layer bottlenecks.

How do we retrieve the DBI resource identifier needed for the PI API?

Call aws rds describe-db-instances –db-instance-identifier <name> –query ‘DBInstances[0].DbiResourceId’. The value starts with db- followed by an uppercase alphanumeric string. This identifier is stable across reboots but changes when a new instance is created from a snapshot.

Can Performance Insights show query plans?

No. Performance Insights captures normalized SQL digests and wait events but does not store execution plans. For plan analysis, combine PI (to identify the high-load digest) with EXPLAIN (ANALYZE, BUFFERS) run manually, or enable auto_explain via the shared_preload_libraries parameter to log slow-query plans automatically.

Is Performance Insights available on RDS for PostgreSQL read replicas?

Yes. Performance Insights is independently enabled per instance, so it can be turned on for the primary, for read replicas, or for both. Enabling it on a read replica is useful for correlating analytical read load against replica lag and identifying IO:DataFileRead bursts caused by large sequential scans.

About MinervaDB Corporation 232 Articles
Full-stack Database Infrastructure Architecture, Engineering and Operations Consultative Support(24*7) Provider for PostgreSQL, MySQL, MariaDB, MongoDB, ClickHouse, Trino, SQL Server, Cassandra, CockroachDB, Yugabyte, Couchbase, Redis, Valkey, NoSQL, NewSQL, SAP HANA, Databricks, Amazon Resdhift, Amazon Aurora, CloudSQL, Snowflake and AzureSQL with core expertize in Performance, Scalability, High Availability, Database Reliability Engineering, Database Upgrades/Migration, and Data Security.