How do Bloom Filters Work in MyRocks?

Bloom filters are probabilistic data structures that are used for membership tests. They are particularly useful in databases as they allow faster and more efficient querying of data. In MyRocks, a storage engine based on RocksDB, Bloom filters are used to speed up read queries.

A Bloom filter consists of a bit array of a fixed size and a set of hash functions. The bit array is initially set to all zeros, and for each value that is added to the Bloom filter, the corresponding bits in the array are set to 1 using the hash functions. When a query is made to check for the presence of a value in the filter, the same hash functions are used to check the corresponding bits in the array. If any of the bits are not set, then the value is not in the filter. However, if all the bits are set, then the value may be in the filter, or it may be a false positive.

In MyRocks, Bloom filters are used to speed up read queries by allowing the engine to skip unnecessary disk reads. When a query is made, the Bloom filter is checked to see if the requested data is likely to be on disk. If the filter indicates that the data is not on disk, then the engine can avoid a disk read and return a result indicating that the data is not present. However, if the filter indicates that the data may be on disk, then the engine must perform a disk read to confirm the presence of the data.

MyRocks also allows for the use of compressed Bloom filters, which can further reduce the storage overhead of the filters. Compressed Bloom filters are created by applying a compression algorithm to the bit array, which reduces the number of bits required to represent the filter. However, the compression algorithm can also introduce additional false positives into the filter, which can reduce the accuracy of the membership tests.

Overall, Bloom filters are an important component of MyRocks, as they allow for more efficient querying of data by reducing the number of disk reads required. However, they must be carefully tuned to balance the tradeoff between filter size and accuracy.

About Shiv Iyer 390 Articles
Open Source Database Systems Engineer with a deep understanding of Optimizer Internals, Performance Engineering, Scalability and Data SRE. Shiv currently is the Founder, Investor, Board Member and CEO of multiple Database Systems Infrastructure Operations companies in the Transaction Processing Computing and ColumnStores ecosystem. He is also a frequent speaker in open source software conferences globally.