How PostgreSQL Optimizer selects which indexes to use?

In PostgreSQL, the optimizer selects which indexes to use for query execution based on cost-based optimization. The goal is to choose the most efficient index or combination of indexes that minimize the total cost of executing the query. The process involves estimating the cost of different access methods and join strategies and selecting the one with the lowest estimated cost. Here’s an overview of how the PostgreSQL optimizer selects indexes:

Query Parsing and Analysis:

  • The PostgreSQL optimizer first parses the query and performs syntactic and semantic analysis to understand the query structure and identify the tables and columns involved.
  • It gathers statistics about the tables and columns, including cardinality (number of distinct values) and data distribution, from the system catalog tables and statistics objects.

Query Rewriting and Transformation:

  • The optimizer applies query rewriting and transformation techniques to transform the original query into an equivalent form that can be optimized more efficiently.
  • This includes transforming subqueries, eliminating redundant expressions, and applying predicate pushdown.

Cost Estimation:

  • The optimizer estimates the cost of different access methods and join strategies for executing the query with different index choices.
  • It uses statistical information, such as the number of rows, selectivity, and data distribution, to estimate the cost of different operations.
  • The cost model considers factors like disk I/O, CPU usage, memory usage, and network communication.

Index Selection:

  • The optimizer generates a set of potential index access paths and join strategies based on the available indexes and query predicates.
  • It assigns a cost estimate to each potential access path and join strategy using the cost model.
  • The optimizer then performs a cost-based comparison of the access paths and join strategies and selects the one with the lowest estimated cost.

Factors Considered by the Optimizer:

  • Selectivity: The selectivity of predicates helps the optimizer estimate the number of rows that will match the conditions. More selective conditions are more likely to benefit from index usage.
  • Index Statistics: The optimizer relies on statistics, such as the number of distinct values and data distribution, to estimate the cost of using an index.
  • Index Size and Structure: The size and structure of the index influence the cost of index scans and seeks. Smaller indexes and those with fewer levels are generally more efficient.
  • Join Selectivity: When joining multiple tables, the optimizer considers the selectivity of join conditions and estimates the number of rows that will result from the join.

Hints and Configuration:

  • PostgreSQL also allows the use of hints to provide guidance to the optimizer about the desired index usage.
  • Configuration parameters like random_page_cost, seq_page_cost, and effective_cache_size influence the cost model and can impact index selection.

To assist the optimizer in selecting the appropriate indexes, it is important to keep table and index statistics up to date using the ANALYZE command and consider proper indexing strategies based on the query patterns and workload characteristics.

It’s worth noting that the optimizer’s choices may not always align with expectations, and occasional manual intervention using hints or rewriting queries may be necessary to achieve optimal index usage.

About Shiv Iyer 452 Articles
Open Source Database Systems Engineer with a deep understanding of Optimizer Internals, Performance Engineering, Scalability and Data SRE. Shiv currently is the Founder, Investor, Board Member and CEO of multiple Database Systems Infrastructure Operations companies in the Transaction Processing Computing and ColumnStores ecosystem. He is also a frequent speaker in open source software conferences globally.