Understanding Hash Aggregation Implementation in MariaDB for Efficient Query Execution

Hash aggregation is a type of aggregation algorithm used in MariaDB to efficiently calculate aggregate functions such as SUM, COUNT, MAX, and MIN on large data sets. It works by dividing the data into groups based on a grouping key, hashing each group into a separate memory buffer, and then calculating the aggregate functions on each buffer.

Here’s how hash aggregation is implemented in MariaDB:

The query optimizer first identifies the grouping columns in the query and determines whether hash aggregation is an appropriate algorithm to use.
If hash aggregation is selected, the query optimizer creates an intermediate result set that contains the grouping columns and the columns used in the aggregate functions.
The intermediate result set is then partitioned into multiple smaller partitions based on the hash value of the grouping key. Each partition is stored in a separate memory buffer.
For each partition, the aggregate functions are calculated and the results are stored in memory.
Once all partitions have been processed, the results are combined into a final result set and returned to the user.

Hash aggregation is very efficient for large data sets because it can parallelize the computation across multiple CPU cores and memory buffers. However, it requires a significant amount of memory to store the intermediate results, so it may not be appropriate for queries that involve a large number of groups or aggregate functions.

In MariaDB, you can monitor hash aggregation performance and memory usage using the SHOW STATUS LIKE ‘Handler_read%’ and SHOW STATUS LIKE ‘Handler_tmp%’ commands. These commands will display statistics about the number of read and write operations performed by the hash aggregation algorithm, as well as the amount of memory used for temporary tables and buffers.

Example

Here’s an example of how hash aggregation works in MariaDB.

Consider the following table named “sales” containing sales data for a company:

CREATE TABLE sales (
    id INT,
    product VARCHAR(50),
    category VARCHAR(50),
    sales_date DATE,
    amount DECIMAL(10,2)
);

Let’s populate the table with some sample data:

INSERT INTO sales VALUES
(1, 'Product 1', 'Category A', '2021-01-01', 100),
(2, 'Product 2', 'Category B', '2021-01-01', 200),
(3, 'Product 1', 'Category A', '2021-01-02', 150),
(4, 'Product 2', 'Category B', '2021-01-02', 250),
(5, 'Product 3', 'Category C', '2021-01-01', 300),
(6, 'Product 3', 'Category C', '2021-01-02', 350);

Now, let’s say we want to calculate the total sales amount by product and category. We can achieve this using hash aggregation as follows:

EXPLAIN SELECT product, category, SUM(amount) as total_sales
FROM sales
GROUP BY product, category;

The output of the explain plan would be:

+------+-------------+-------+-------+---------------+---------+---------+------+------+----------+-----------------------------------------------------------+
| id   | select_type | table | type  | possible_keys | key     | key_len | ref  | rows | filtered | Extra                                                     |
+------+-------------+-------+-------+---------------+---------+---------+------+------+----------+-----------------------------------------------------------+
|    1 | SIMPLE      | sales | index | NULL          | PRIMARY | 4       | NULL |    6 |   100.00 | Using index; Using temporary; Using filesort; Using hash  |
+------+-------------+-------+-------+---------------+---------+---------+------+------+----------+-----------------------------------------------------------+

In this query, the GROUP BY clause groups the sales data by product and category. The SUM function computes the total sales amount for each group. Since the dataset is large, hash aggregation is used instead of sorting-based aggregation to compute the results efficiently.

In hash aggregation, the data is partitioned into buckets based on the grouping keys (product and category in this case) using a hash function. The aggregate function is then computed for each bucket, and the results are combined to produce the final output. This method is more efficient than sorting-based aggregation as it avoids the expensive sorting operations and only requires a single pass over the data.

The output of this query would be:

+-----------+------------+-------------+
| product   | category   | total_sales |
+-----------+------------+-------------+
| Product 1 | Category A | 250.00      |
| Product 2 | Category B | 450.00      |
| Product 3 | Category C | 650.00      |
+-----------+------------+-------------+

In summary, hash aggregation is a powerful technique used in database systems to efficiently compute aggregate functions over large datasets. In MariaDB, hash aggregation can be used to compute aggregate functions like SUM, COUNT, MAX, MIN, and AVG, and it is particularly useful when dealing with large datasets where sorting-based aggregation is too costly.

The Data Transformation Company

Data Architecture, Engineering and Operations for SQL, NoSQL, NewSQL, Cloud Native Data Platforms, Analytics and AI

How is hash aggregation implemented in MariaDB?

Example