Monitoring PostgreSQL Disk I/O Performance

Monitoring PostgreSQL Disk I/O Performance 


PostgreSQL performance depends heavily on the I/O Subsystem (IOS). Latency in the IOS can result in many performance problems. For example, you may experience slow response times and problems caused by tasks timing out. It is critical that you monitor your disk usage. Disk Transfers/sec is comprised of both Disk Reads/sec and Disk Writes/sec. You can use these counters to determine if the drive does not have enough supporting disks. When you use these counters you might need to adjust the values for the type of RAID you have implemented:

  • Raid 0 — I/Os per disk = (reads + writes) / number of disks
  • Raid 1 — I/Os per disk = [reads + (2 * writes)] / 2
  • Raid 5 — I/Os per disk = [reads + (4 * writes)] / number of disks
  • Raid 10 — I/Os per disk = [reads + (2 * writes)] / number of disks

Disk counters that you can monitor to determine disk activity are divided into the following two groups:

  • Primary
    • PhysicalDisk: Avg. Disk sec/Write
    • PhysicalDisk: Avg. Disk sec/Read
  • Secondary
    • PhysicalDisk: Avg. Disk Queue Length
    • PhysicalDisk: Disk Bytes/sec
    • PhysicalDisk: Disk Transfers/sec

Monitoring Disk I/O performance in Linux

Monitoring PostgreSQL Disk I/O performance

Monitoring PostgreSQL Table Size

Finding the largest databases in your PostgreSQL cluster

Finding the total size of your biggest tables in your PostgreSQL Cluster

Conclusion

The following three questions can be used to help identify if there is a storage bottleneck and where it likely is:

  • Is there observed latency? (Avg Disk Sec/Read > 0.020 or Avg Disk Sec/Write > 0.020)
  • Is the latency observed on all (most) of the disks (LUNs) or only a single (few) disk (LUN).
    • This question helps us understand if the problem is leaning towards a general lack of communication between the server and the storage or if the issue is more likely due to limitations of the physical spindles.
    • If most of the disks are observed with latency at the same time it may indicate that the latency is due to a communication bottleneck, such as: an HBA, Switch, SAN Port, or SAN CPU.
    • If there are many LUNs from the same storage device and only a single or few are observed with latency, the issue is likely due to the LUN.
  • Finally, compare the disk throughput (Disk Transfers/sec & Disk Bytes/sec) during the time that latency was observed to the time when the maximum throughput is observed.
    • If latency always grows in proportion with the throughput, the issue may be with the physical spindles; though, this does not rule out the communication layer. Engage your storage administrator to identify if the physical spindles are capable of handling the throughput observed with Disk Transfers/sec and Disk Bytes/sec.
    • If latency is found to be much lower when the activity is much higher than the bottleneck is likely not due to the physical spindles (JBOD). A storage administrator should be engaged to assist in reviewing the storage fabric (HBA, switches, SAN CPU, Ports, …).

References

About Shiv Iyer 446 Articles
Open Source Database Systems Engineer with a deep understanding of Optimizer Internals, Performance Engineering, Scalability and Data SRE. Shiv currently is the Founder, Investor, Board Member and CEO of multiple Database Systems Infrastructure Operations companies in the Transaction Processing Computing and ColumnStores ecosystem. He is also a frequent speaker in open source software conferences globally.