CockroachDB is a distributed SQL database built on a transactional and strongly-consistent key-value store that scales horizontally across multiple nodes while maintaining ACID compliance. Designed with inspiration from Google’s Spanner and F1 systems, CockroachDB aims to provide the best aspects of both traditional SQL databases and modern NoSQL systems, combining SQL functionality with horizontal scalability and high availability. The database implements a sophisticated architecture that ensures data consistency even in the face of machine, rack, and even data center failures.
CockroachDB Layered Architecture
Architectural Design Goals
CockroachDB was built with several primary design goals that guide its architecture. These include making life easier for humans through low-touch operations and simple reasoning for developers, offering industry-leading consistency even on massively scaled deployments, creating an always-on database that accepts reads and writes on all nodes without conflicts, allowing flexible deployment in any environment without vendor lock-in, and supporting familiar tools for working with relational data through SQL. The database employs a shared-nothing, multi-leader architecture where each node functions symmetrically, acting as a gateway to the entire database.
Layered Architecture
CockroachDB’s architecture is organized into distinct layers, each handling specific aspects of database operations while interacting with the layers directly above and below. This layered approach helps manage complexity and provides clear separation of concerns.
SQL Layer
The SQL layer sits at the top of CockroachDB’s architecture and serves as the interface between client applications and the database system. It receives SQL queries from clients, parses them into abstract syntax trees (AST), optimizes the query plans, and converts them into low-level key-value operations that can be processed by lower layers. The SQL layer contains several key components, including:
- SQL API – Forms the user interface for interacting with the database
- Parser – Converts SQL text into an abstract syntax tree
- Cost-based optimizer – Converts the AST into an optimized logical query plan based on data statistics and indexes
- Physical planner – Converts the logical plan into a physical execution plan across one or more nodes
- SQL execution engine – Executes the physical plan by making read and write requests to the underlying key-value store
CockroachDB supports both row-oriented and vectorized execution models, with the vectorized engine providing significant performance improvements for analytical workloads. The vectorized execution processes data in column-batches rather than row-by-row, making more efficient use of modern CPU caches and instruction pipelines.
Transaction Layer
The transaction layer implements support for ACID transactions by coordinating concurrent operations across the distributed system. This layer ensures that despite being distributed, CockroachDB maintains strong consistency guarantees. Key aspects of the transaction layer include:
- Transaction management – Handling the lifecycle of transactions from begin to commit or rollback
- Distributed transaction coordination – Ensuring atomicity across multiple ranges and nodes
- Parallel Commits protocol – An atomic commit protocol that reduces latency by performing consensus round trips concurrently
- Multi-Version Concurrency Control (MVCC) – Providing transaction isolation without blocking by maintaining multiple versions of data
CockroachDB uses a hybrid logical clock (HLC) as the versioning mechanism for MVCC, combining system time and a logical counter to create monotonically increasing timestamps that can be used for time travel queries. Transactions in CockroachDB guarantee serializable isolation, the strongest SQL isolation level, ensuring that concurrent transactions behave as if they executed serially.
Distribution Layer
The distribution layer provides a unified view of the cluster’s data by managing how data is distributed across nodes. This layer is responsible for:
- Managing the monolithic sorted map of key-value pairs that contains all data
- Dividing the key-space into ranges (contiguous chunks of approximately 64MB)
- Tracking the location of ranges using meta ranges (a two-level index at the beginning of the key-space)
- Handling range splits and merges as data grows or shrinks
The distribution layer enables both simple lookups and efficient scans by maintaining the order of data in the key-value store. It also facilitates data locality through geo-partitioning, allowing specific data to be pinned to particular geographic regions.
Replication Layer
The replication layer ensures data is safely replicated across multiple nodes for high availability and fault tolerance.This layer implements:
- The Raft consensus protocol to maintain consistency between replicas
- Leader election to designate one replica as the Raft leader
- Log replication to ensure all replicas apply the same changes in the same order
- Range lease management to determine which replica (the leaseholder) serves reads and coordinates writes
Each range in CockroachDB is replicated to at least three nodes by default, allowing the system to tolerate failures of individual nodes while maintaining data availability. The replication factor can be configured at the cluster, database, or table level using replication zones.
CockroachDB Transaction Flow
Storage Layer
The storage layer reads and writes data to disk, serving as the interface between the in-memory state and persistent storage. Key components include:
- Pebble – A RocksDB-inspired key-value storage engine written in Go
- Block cache – Shared among all stores in a node to improve read performance
- Write-ahead log – Ensures durability of writes before they’re committed
CockroachDB previously used RocksDB as its storage engine but switched to Pebble, which is specifically optimized for CockroachDB’s needs and avoids the challenges of crossing the Cgo boundary. Each CockroachDB node contains at least one store, which is where the process reads and writes data on disk.
Key Components and Mechanisms
Ranges and Data Distribution
CockroachDB divides its entire key-space into ranges, which are contiguous chunks of the key-space with a default target size of 64MB . These small range sizes facilitate quick splits and merges while helping distribute load at hotspots within a key range. From an SQL perspective, a table and its secondary indexes initially map to a single range, with each key-value pair representing a single row in the table.
When a range grows beyond the target size, it splits into two ranges, a process that continues as tables and indexes grow. Ranges are distributed across nodes in the cluster, with no node containing more than one replica of the same range.
Consensus Protocol (Raft)
CockroachDB uses the Raft consensus algorithm to ensure data consistency across replicas. Unlike many other systems that use Raft as a single consensus group, CockroachDB implements what it calls “MultiRaft,” where each range has its own consensus group. This approach allows a single node to participate in thousands of consensus groups simultaneously while reducing network traffic by batching heartbeat messages between nodes.
The Raft protocol in CockroachDB follows three key phases:
- Leader election – Nodes elect a leader through randomized timeouts and voting
- Log replication – The leader appends new entries to its log and replicates them to followers
- Safety – A majority of nodes must agree on entries before they’re committed, ensuring consistency
Leaseholders and Range Leadership
For each range, one of its replicas is designated as the leaseholder. The leaseholder receives and coordinates all read and write requests for that range. This mechanism allows CockroachDB to avoid the coordination overhead of sending every read through the Raft protocol while still maintaining consistency.
In the latest versions of CockroachDB, a feature called “leader leases” ensures that the Raft leader for a range is always the leaseholder (except briefly during lease transfers). This improvement enhances both resilience and performance by reducing overhead in the leasing mechanism and enabling faster recovery after node failures.
Multi-Version Concurrency Control (MVCC)
CockroachDB leverages Multi-Version Concurrency Control to handle concurrent transactions without locking. The core idea behind MVCC is never to overwrite existing data; instead, updates create new versions with new timestamps. This approach eliminates lock contentions, improves concurrency, and enables time travel queries.
When a read transaction starts, it specifies a timestamp at which it wants to read data, and CockroachDB ensures it only sees versions committed before that timestamp. Write transactions create new versions with new timestamps, making them visible to subsequent read transactions. Old versions are eventually cleaned up through garbage collection after they’re no longer needed.
Storage Engine (Pebble)
Pebble is CockroachDB’s storage engine, inspired by RocksDB but written in Go and optimized specifically for CockroachDB’s needs. It provides atomic write batches and snapshots, which give CockroachDB a subset of transaction capabilities at the storage level.
The storage engine uses a log-structured merge (LSM) tree implementation for efficiently storing and retrieving key-value pairs. This design optimizes for write performance by sequentially appending data to log files and periodically merging and compacting these files in the background.
Advanced Features
Distributed SQL Execution
CockroachDB supports distributed SQL execution, where parts of a query can be processed in parallel across multiple nodes. The gateway node creates a directed acyclic graph (DAG) of SQL processors and distributes them across the cluster, similar to Google’s F1 system. This distributed execution reduces network traffic by applying filters and aggregations where the data lives, only sending necessary results back to the gateway.
The SQL processors operate in either a tuple-at-a-time mode or a vectorized mode, depending on the query characteristics. The vectorized execution engine, introduced in newer versions of CockroachDB, significantly improves performance for analytical queries by processing data in column batches rather than row by row.
Geo-Partitioning and Data Locality
CockroachDB supports geo-partitioning, which allows data to be located in specific geographic regions based on business needs. This feature is particularly useful for applications that need to comply with data residency regulations or want to minimize latency by keeping data close to users.
Data locality can be configured at various levels, from cluster-wide settings to table-specific rules. For example, a table can be partitioned by region, ensuring that European customer data stays in European data centers while US customer data remains in US data centers.
Fault Tolerance and Recovery
CockroachDB is designed to survive failures at various levels, from disk failures to entire data center outages. The system automatically detects node failures and initiates recovery processes without manual intervention. When a node fails, the cluster rebalances data to maintain the configured replication factor, ensuring data remains available and properly replicated.
The database provides built-in high availability features and disaster recovery tooling to achieve operational resilience. These include multi-active availability through Raft replication, advanced fault tolerance capabilities for routine maintenance operations, and tools for backup and restore.
Scalability and Performance
CockroachDB achieves elastic scalability through its distributed architecture, allowing both vertical and horizontal scaling. As workloads grow, new nodes can be added to the cluster, and data automatically rebalances across them. This scalability extends to both read and write operations, as each node can serve both types of requests regardless of deployment topology.
Performance optimization in CockroachDB involves considerations at multiple levels, including node topology, hardware selection, workload patterns, and SQL query optimization. The cost-based optimizer plays a crucial role in selecting efficient execution plans, taking into account data distribution, statistics, and available indexes.
Conclusion
CockroachDB’s architecture represents a sophisticated approach to distributed database design, successfully combining the scalability and resilience of distributed systems with the familiar SQL interface and ACID guarantees of traditional relational databases. Through its layered design, consensus-based replication, distributed transaction management, and innovative features like parallel commits and vectorized execution, CockroachDB offers a compelling solution for modern applications that require both horizontal scalability and strong consistency.
The architecture enables CockroachDB to deliver on its core promises: survivability through automatic fault tolerance, scalability through horizontal growth, and SQL compliance through its PostgreSQL-compatible interface. As distributed applications continue to gain prominence in the cloud era, CockroachDB’s architecture provides a solid foundation for building globally distributed, highly available, and consistent database systems.
Citations:
- https://www.cockroachlabs.com/docs/stable/architecture/overview
- https://www.cockroachlabs.com/roachfest/2022/cockroachdb-architecture/
- https://www.xenonstack.com/insights/what-is-cockroachdb
- https://www.cockroachlabs.com/blog/cockroachdb-enterprise-architecture/
- https://infohub.delltechnologies.com/nl-nl/l/cockroachdb-deployment-on-dell-powerflex-with-kubernetes/cockroachdb-architecture/
- https://dbdb.io/db/cockroachdb
- https://www.cockroachlabs.com/docs/stable/schema-design-overview
- https://www.cockroachlabs.com/docs/stable/architecture/sql-layer
- https://www.cockroachlabs.com/docs/stable/architecture/transaction-layer
- https://www.cockroachlabs.com/docs/stable/transactions
- https://www.cockroachlabs.com/docs/stable/architecture/life-of-a-distributed-transaction
- https://www.cockroachlabs.com/docs/stable/ui-transactions-page
- https://www.cockroachlabs.com/docs/stable/architecture/storage-layer
- https://www.cockroachlabs.com/blog/trust-but-verify-cockroachdb-checks-replication/
- https://www.cockroachlabs.com/glossary/distributed-db/storage-layer/
- https://www.cockroachlabs.com/blog/logical-data-replication/
- https://www.cockroachlabs.com/blog/raft-is-so-fetch/
- https://www.cockroachlabs.com/glossary/distributed-db/raft-consensus-protocol/
- https://www.cockroachlabs.com/docs/stable/architecture/replication-layer
- https://dantheengineer.com/cockroachdb_and_raft/
- https://www.linkedin.com/posts/ashna1jain_cockroachdb-database-consensus-activity-7150876833180176385-2Uty
- https://www.bookstack.cn/read/CockroachDB/8046c6bc96011fb9.md?wd=structure
- https://www.youtube.com/watch?v=cek0uZjmdws
- https://www.cockroachlabs.com/blog/scaling-raft/
- https://www.cockroachlabs.com/blog/what-is-distributed-sql/
- https://www.cockroachlabs.com/blog/local-and-distributed-processing-in-cockroachdb/
- https://github.com/cockroachdb/cockroach
- https://www.youtube.com/watch?v=ujh6eMbql-c
- https://www.cockroachlabs.com/docs/stable/cost-based-optimizer
- https://dev.to/robreid/geo-partitioning-no-column-no-problem-ob4
- https://www.youtube.com/watch?v=ZAzgd4xvV7o
- https://courses.cs.washington.edu/courses/csep590d/22sp/lectures/UW Talk 2022 CockroachDB’s Optimizer.pdf
- https://www.cockroachlabs.com/blog/mvcc-range-tombstones/
- https://www.youtube.com/watch?v=Ctp5WQdbEd4
- https://www.linkedin.com/posts/arpitbhayani_asliengineering-cockroachdb-databaseinternals-activity-7052993626947137536-GzOf
- https://www.cockroachlabs.com/docs/stable/data-resilience
- https://stackbay.org/modules/chapter/learn-cockroachdb/cockroachdb-performance-tuning
- https://www.cockroachlabs.com/docs/stable/disaster-recovery-overview
- https://www.cockroachlabs.com/docs/stable/make-queries-fast
- https://www.cockroachlabs.com/docs/stable/vectorized-execution
- https://www.cockroachlabs.com/blog/how-we-built-a-vectorized-execution-engine/
- https://www.cockroachlabs.com/blog/disk-spilling-vectorized/
- https://www.cockroachlabs.com/blog/vectorized-hash-joiner/
- https://www.cockroachlabs.com/blog/unordered-distinct-vectorized-engine/
- https://www.linkedin.com/posts/cockroach-labs_whats-new-in-cockroachdb-251-cockroach-activity-7298346412428185600-L-fa
- https://dantheengineer.com/cockroachdb-multi-region/
- https://stackoverflow.com/questions/79496603/how-does-cockroachdb-handle-partitions-between-the-leaseholder-and-raft-leader-o
- https://www.cockroachlabs.com/blog/the-new-stack-meet-cockroachdb-the-resilient-sql-database/
- https://github.com/cockroachdb/cockroach/blob/master/docs/design.md
- https://dantheengineer.com/cockroachdb-vs-sql-server/
- https://smazumder05.gitbooks.io/design-and-architecture-of-cockroachdb/content/
- https://airbyte.com/data-engineering-resources/cockroachdb-vs-sql-server
- https://www.cockroachlabs.com/blog/distributed-sql-key-value-store/
- https://www.cockroachlabs.com/blog/sql-in-cockroachdb-mapping-table-data-to-key-value-storage/
- https://www.cockroachlabs.com/blog/pebble-rocksdb-kv-store/
- https://dev.to/achoarnold/key-value-store-built-with-cockroachdb-net-50-and-entity-framework-1kl4
- https://martinfowler.com/articles/patterns-of-distributed-systems/hybrid-clock.html
- https://www.cockroachlabs.com/blog/parallel-commits/
- https://www.martinfowler.com/articles/patterns-of-distributed-systems/hybrid-clock.html
- https://blog.cloudneutral.se/parallel-query-execution-in-cockroachdb
- https://www.youtube.com/watch?v=RgREEOnSKTg
- https://www.cockroachlabs.com/docs/stable/ui-distributed-dashboard
- https://dl.acm.org/doi/pdf/10.1145/3318464.3386134
- https://github.com/cockroachdb/cockroach/blob/master/docs/tech-notes/life_of_a_query.md
- https://www.cockroachlabs.com/blog/get-started-geo-partitioning-data-with-our-command-line-cockroachdb-demo/
- https://www.reddit.com/r/CockroachDB/comments/1hcsd0b/what_is_mvcc_and_why_should_i_care/
- https://faculty.cc.gatech.edu/~jarulraj/courses/4420-f20/slides/cockroachdb.pdf
- https://www.cockroachlabs.com/docs/stable/movr-flask-deployment
- https://github.com/cockroachdb/pebble
All Rights Reserved by MinervaDB®.
Be the first to comment