Hacking the PostgreSQL Parser: A Deep Dive into PostgreSQL Internals

The source code for the PostgreSQL LALR(1) parser is written in C and can be found in the PostgreSQL source code repository. The main components of the parser are:

scan.l: A lexer that takes the string representation of a SQL query and converts it into a stream of tokens.
gram.y: A Bison specification that defines the grammar of the SQL language and the actions to be performed by the parser when it recognizes each non-terminal symbol.
parse.c: The implementation of the parser itself, which uses the lexer and the parsing table generated from the grammar to parse queries.
parse_node.h: A header file that defines the data structures used by the parser, including the syntax tree nodes and the parse state structure.

To understand how the parser works, you can study the source code of each of these components and see how they interact with each other. You can also experiment with queries and observe the log output produced by the parser to see how it works in practice.

Table of Contents

Hacking PostgreSQL lexer (scan.l ) source code

The scan.l source code file in PostgreSQL implements the lexer, which is responsible for converting the string representation of a SQL query into a stream of tokens. The lexer uses the Flex lexer generator to generate its implementation.

The lexer is defined using a set of regular expressions that match specific patterns in the input, along with actions to be performed when a match is found. The actions typically include emitting a token with a type and a value, and updating the lexer’s state.

Here is a simplified example of what the lexer code in scan.l might look like:

%{
#include "postgres.h"
#include "nodes/parsenodes.h"
#include "parser/scanner.h"
%}

%option noyywrap

%x INTEGER

%%

[0-9]+ {
    yylval.ival = atoi(yytext);
    return INTEGER;
}

[a-zA-Z]+ {
    yylval.str = pstrdup(yytext);
    return IDENT;
}

[ \t\n]+ { /* skip whitespace */ }

. {
    /* error */
    yyerror("invalid character: %c", yytext[0]);
    return 0;
}

%%

In this example, the lexer has three rules: one for matching integers, one for matching alphabetic strings, and one for skipping whitespace. The yylval variable is used to store the value of the token being emitted, and the yytext variable contains the text that matched the pattern. The pstrdup function is used to make a copy of the matched text in the token value.

The lexer uses the Flex library to generate a scanner that reads the input stream, matches patterns in the input, and performs the associated actions. The generated scanner is then used by the parser to get the next token in the input stream.

Hacking PostgreSQL gram.y ( Bison specification for the grammar of the SQL language) in PostgreSQL source

The gram.y file in the PostgreSQL source code is a Bison specification for the grammar of the SQL language. It defines the syntax of the SQL language and specifies how the parser should recognize and process valid SQL statements.

The grammar is defined using a series of rules, with each rule representing a non-terminal symbol in the grammar. The left-hand side of each rule defines the non-terminal symbol, and the right-hand side defines one or more alternative expansions of the symbol in terms of other non-terminal or terminal symbols.

Here is a simplified example of what the grammar code in gram.y might look like:

%{
#include "postgres.h"
#include "nodes/parsenodes.h"
#include "parser/gram.h"
%}

%token INTEGER IDENT
%start query

%%

query:
    SELECT_STMT
    {
        $$ = make_query($1);
    }
;

SELECT_STMT:
    SELECT column_list
    FROM table_list
    {
        $$ = make_select_stmt($2, $4);
    }
;

column_list:
    IDENT
    {
        $$ = make_column_list($1);
    }
|   column_list ',' IDENT
    {
        $$ = add_column_to_list($1, $3);
    }
;

table_list:
    IDENT
    {
        $$ = make_table_list($1);
    }
|   table_list ',' IDENT
    {
        $$ = add_table_to_list($1, $3);
    }
;

%%

In this example, the grammar has four non-terminal symbols: query, SELECT_STMT, column_list, and table_list. The query symbol is the start symbol of the grammar, and it represents a complete SQL query. The SELECT_STMT symbol represents a SELECT statement, and the column_list and table_list symbols represent lists of column and table names, respectively.

Each rule has an associated action that is executed when the parser recognizes the non-terminal symbol defined by the rule. In this example, the actions use functions such as make_query, make_select_stmt, make_column_list, and add_column_to_list to build a parse tree representing the input. The parse tree is then used by the rest of the system to execute the query.

Hacking parse.c (PostgreSQL Parser) in PostgreSQL source

The parse.c file in the PostgreSQL source code is the implementation of the SQL parser in C. It uses the Bison-generated parser code and the lexer code (found in scan.l) to recognize and process SQL statements.

The main entry point for the parser is the parser function, which is called to parse an input string of SQL text. This function uses the lexer to tokenize the input string, and then passes the stream of tokens to the Bison-generated parser. The parser uses the grammar defined in gram.y to recognize valid SQL statements and to construct a parse tree representing the input.

Here is a simplified example of what the code in parse.c might look like:

#include "postgres.h"
#include "nodes/parsenodes.h"
#include "parser/gram.h"

/* Other headers and functions */

Node *
parser(const char *input)
{
    int parse_result;

    yy_scan_string(input);
    parse_result = yyparse();
    yy_delete_buffer(YY_CURRENT_BUFFER);

    if (parse_result != 0)
        ereport(ERROR, (errcode(ERRCODE_SYNTAX_ERROR), errmsg("Invalid SQL statement")));

    return parse_tree;
}

In this example, the parser function takes an input string of SQL text and tokenizes it using the lexer.

Hacking parse_node.h(the header file that defines data structures used by parser) in PostgreSQL source

The parse_node.h file in the PostgreSQL source code is a header file that defines the data structures used to represent the parse tree in the PostgreSQL SQL parser.

The parse tree is a hierarchical representation of the SQL statement being parsed, where each node in the tree represents a component of the statement, such as a query, a select clause, a join condition, or a constant value.

Here is a simplified example of what the code in parse_node.h might look like:

#ifndef PARSE_NODE_H
#define PARSE_NODE_H

#include "nodes/nodes.h"
#include "nodes/primnodes.h"

typedef struct Query
{
    NodeTag type;
    List *commandType;
    bool canSetTag;
    Node *utilityStmt;
    int resultRelation;
    List *intoClause;
    bool hasAggs;
    bool hasSubLinks;
    /* Other fields */
} Query;

typedef struct SelectStmt
{
    NodeTag type;
    List *targetList;
    List *fromClause;
    Node *whereClause;
    List *groupClause;
    Node *havingClause;
    /* Other fields */
} SelectStmt;

typedef struct ColumnRef
{
    NodeTag type;
    List *fields;
    /* Other fields */
} ColumnRef;

typedef struct A_Expr
{
    NodeTag type;
    int kind;
    Node *lexpr;
    Node *rexpr;
    /* Other fields */
} A_Expr;

typedef struct TypeName
{
    NodeTag type;
    List *names;
    /* Other fields */
} TypeName;

/* Other data structures */

#endif /* PARSE_NODE_H */

In this example, the parse tree is represented using a set of C structs, each of which corresponds to a different type of node in the tree. For example, the Query struct represents a query node, and it contains fields for the command type, the result relation, the into clause, and other information about the query. The SelectStmt struct represents a SELECT statement, and it contains fields for the target list, the from clause, the where clause, and the group clause. The ColumnRef struct represents a reference to a column in the target list or in a condition, and it contains a list of field names. The A_Expr struct represents an arithmetic expression, and it contains fields for the left and right operands and the type of the expression.

These data structures are used throughout the PostgreSQL code to process and analyze SQL statements, and they are a crucial part of the parser implementation.

Conclusion

The performance of a PostgreSQL system can be significantly influenced by the efficiency of the parser, as the parser is involved in nearly every SQL query that is executed.

A poorly designed or inefficient parser can lead to slower query processing times, which in turn can lead to decreased system performance and reduced overall scalability.

To ensure good performance, the PostgreSQL parser is designed to be fast and efficient. It uses a combination of lexical analysis, syntax analysis, and semantic analysis to parse SQL statements, and it caches frequently used query plans to avoid re-parsing the same statements. Additionally, the parser takes advantage of advanced techniques such as grammar specialization and predictive parsing to minimize the number of steps required to parse a statement.

In summary, the efficiency and design of the PostgreSQL parser has a direct impact on the overall performance of the system, and it is an important factor to consider when evaluating and tuning the performance of a PostgreSQL installation.

The WebScale Database Infrastructure Architecture, Engineering and Operations Company

Full-Stack Database Engineering & Cloud DBaaS Solutions for PostgreSQL, MySQL, MongoDB & More | Performance, Scalability, High Availability, Security & Analytics Experts

PostgreSQL Internals – Hacking PostgreSQL Parser

Hacking PostgreSQL lexer (scan.l ) source code

Hacking PostgreSQL gram.y ( Bison specification for the grammar of the SQL language) in PostgreSQL source

Hacking parse.c (PostgreSQL Parser) in PostgreSQL source

Hacking parse_node.h(the header file that defines data structures used by parser) in PostgreSQL source

Conclusion