Comparison of JOINS: MongoDB vs. PostgreSQL
(Cross-post from Dr. MICHAEL STONEBRAKER and Álvaro Hernández)
Introduction
In this blog post we review the JOIN capabilities in both MongoDB and Postgres. The conclusion is that MongoDB joins are very brittle (when things change, application programs must be extensively recoded), and often MongoDB offers very poor performance, relative to Postgres. We begin in with MongoDB support for joins, then continue with the corresponding capabilities in Postgres. we are also showing why MongoDB joins are brittle, and finally about how we consider join performance in both systems.
Throughout this blog post, we will use the well-known example of employees and departments with the relational schema in Table 1. Here, employees have an ename, an age, a salary and are in a single department. In turn, departments have a dname, a floor and a budget. Note that there is a department (Candy) with no employees.
employee | ename | age | salary | dname |
Bill | 36 | 10000 | Shoe | |
Sam | 27 | 15000 | Toy | |
Fred | 29 | 12000 | Shoe |
department | dname | floor | budget |
Shoe | 1 | 1200 | |
Toy | 2 | 1400 | |
Candy | 1 | 900 |
Example Database Table 1
At some later time, management may decide that Bill can split his time between multiple departments. In this case the schema in Table 1 is no longer valid, and the data must be altered to that in Table 2. Notice that we must add a table called works_in with a dedication_pct field to indicate Bill’s time split between multiple departments.
employee | ename | age | salary |
Bill | 36 | 10000 | |
Sam | 27 | 15000 | |
Fred | 29 | 12000 |
department | dname | floor | budget |
Shoe | 1 | 1200 | |
Toy | 2 | 1400 | |
Candy | 1 | 900 |
works_in | ename | dname | dedication_pct |
Bill | Toy | 60 | |
Bill | Shoe | 40 | |
Sam | Toy | 100 | |
Fred | Shoe | 100 |
Revised Example Database Table 2
JOINS in MongoDB
In MongoDB, there are two main ways to express a relationship, namely “embedded” and “reference.” Using the embedded approach, an Employee document is expressed as:
1 2 3 4 5 6 7 8 9 10 11 |
{ "_id": "1", "ename": "Bill", "age": 36, "salary": 10000, "department": { "dname": "Shoe", "floor": 1, "budget": 1200 } } |
In other words, the department information is stored embedded in each employee document. Basically, this is an “inline” representation. In document applications, this representation may make some sense, but in structured data, it has two major drawbacks.
First, department information is repeated for each employee in the department. Since Bill and Fred are both in the Shoe department, information will be replicated. When Shoe information gets updated, say the budget is adjusted, all copies must be found and correctly updated. If even one replica is omitted, then an inconsistent (corrupted) data base results. Worse, this multi-record update operation is either non-atomic (by default in MongoDB); or requires MongoDB’s 4.0+ multi-document transactions, which have several limitations and incur a performance hit. In the first case, a corrupted database is possible; in the second case poorer performance will occur. Either situation is problematic.
Second, there is no place to put Candy information, because there is no employee in this department right now. Because of these two limitations, we will not consider this representation in this blog post.
The alternative representation is “reference.” In this case, Bill’s record would be stored as:
1 2 3 4 5 6 7 |
{ "_id": "1", "ename": "Bill", "age": 36, "salary": 10000, "department": "1001" } |
and there would be a separate department document to store the attributes of departments:
1 2 3 4 5 6 |
{ "_id": "1001", "dname": "Shoe", "floor": 1, "budget": 1200 } |
Effectively, this looks a lot like the relational representation in Table 1. However, the programmer must know that the department field refers to a document in the department collection. In other words, there is no notion of foreign keys to assist the programmer in specifying the join.
Now suppose management decides that employees can split their time between multiple departments. There is no standard way to represent this situation, but typically an array of references would be used. In order to record the information about the dedication_pct in a given department, we could transform the department field into an array of documents, which contain both the reference to the department collection and the dedication_pct, as noted below for the employee Bill:
Collection department:
1 2 3 4 5 6 7 8 9 10 11 12 |
{ "_id": "1001", "dname": "Shoe", "floor": 1, "budget": 1200 }, { "_id": "1002", "dname": "Toy", "floor": 2, "budget": 1400 } |
Collection employee:
1 2 3 4 5 6 7 8 9 10 |
{ "_id": "1", "ename": "Bill", "age": 36, "salary": 10000, "departments": [ { "dept": "1002", "dedication_pct": 60 }, { "dept": "1001", "dedication_pct": 40 } ] } |
When one wants to obtain joint (“related”) information, she may either do the join manually in the application (which is error-prone and puts the burden into the developer) or use the $lookup aggregation operation (which has its own limitations, for example not being able to $lookup from a sharded collection, or not supporting right or full outer joins).
JOINS in Postgres
In theory, one can use an embedded representation in Postgres, and the Postgres jsonb datatype allows this. However, we have very rarely seen this in practice because of the drawbacks noted above. Instead, one typically uses the representation in Table 1, which corresponds to the “reference” case in MongoDB.
Using standard SQL, one can find a list of all departments and the total salary of their employees following the data model of Table 1 as (click here for full source code for Postgres data and examples):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
select dname, sum(salary) from employee as e inner join department as d on e.department = d.dname group by dname ; ┌───────┬───────┐ │ dname │ sum │ ├───────┼───────┤ │ Shoe │ 22000 │ │ Toy │ 15000 │ └───────┴───────┘ |
When employees can work in multiple departments, one would typically use the representation in Table 2. The query above can be readily converted to:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
select dname, sum(salary * dedication_pct * 0.01) from employee inner join works_in using (ename) group by dname ; ┌───────┬──────────┐ │ dname │ sum │ ├───────┼──────────┤ │ Shoe │ 16000.00 │ │ Toy │ 21000.00 │ └───────┴──────────┘ |
Notice that one merely needs to add dedication_pct to the aggregate clause and replace employee by works_in changing the attribute in the join condition to move from the first join to the second.
JOINS are Brittle in MongoDB
Now reconsider the case where the semantics of the data obeys Table 1, i.e. the join between employees and departments is 1:N. To construct a document containing the total salary for each department, the code in MongoDB is: (see full source code here):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
db.employee.aggregate([ { $lookup: { from: "department", localField: "department", foreignField: "_id", as: "dept" } }, { $unwind: "$dept" }, { $group: { "_id": "$dept.dname", "salary": { "$sum": "$salary" }, } } ]); |
Result:
1 2 |
{ "_id" : "Shoe", "totalsalary" : 22000 } { "_id" : "Toy", "totalsalary" : 15000 } |
Notice that the code is a great deal more complex than the Postgres code, because MongoDB doesn’t have relational join notions and is in a lower level language than SQL. Also, it requires the programmer to algorithmically construct a query plan for the join. In this case, a combination of $unwind, $lookup and $group to extract the desired information. More ominously, when we move to the semantics of Table 2, (adding the dedication_pct field to allow employees to be in multiple departments) the MongoDB join code must be significantly rewritten to add two more aggregation stages to “unwind” the “works_in” implicit relationship:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
db.employee.aggregate([ { $unwind: "$departments" }, { $project: { "_id": 0, "salary": 1, "department": "$departments.dept", "dedication_pct": "$departments.dedication_pct" } }, { $lookup: { from: "department", localField: "department", foreignField: "_id", as: "dept" } }, { $unwind: "$dept" }, { $group: { _id: "$dept.dname", totalsalary: { $sum: { $multiply: [ "$salary", "$dedication_pct", 0.01 ] } } } } ]); Result: { "_id" : "Shoe", "totalSalary" : 16000 } { "_id" : "Toy", "totalSalary" : 21000 } |
Hence, on a change to the semantics of the join from 1:N to M:N, the application (or applications) must be significantly rewritten. In contrast, in Postgres, the queries remain nearly unaltered and simple.
All previous results are, however, a bit misleading, because the Candy department has no employees and does not appear in the join. Let’s assume that the user actually wants to see all three departments with their total salaries. In other words, she wishes to see the Candy department with a total salary of zero in addition to the other two departments. In Postgres, this requires a simple change to the query, namely adding a right outer join to the department table:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
select dname, sum(coalesce(salary * dedication_pct * 0.01, 0)) from employee inner join works_in using (ename) right outer join department using (dname) group by dname ; ┌───────┬──────────┐ │ dname │ sum │ ├───────┼──────────┤ │ Shoe │ 16000.00 │ │ Toy │ 21000.00 │ │ Candy │ 0 │ └───────┴──────────┘ |
However, in MongoDB there is no support for right outer joins. Hence, you would need to add the “0” manually in your application. Of course, this is a burden to the developer and is cumbersome and error prone.
It is well known (and presented in every DBMS textbook) that RDBMSs present superior data independence relative to lower level solutions like MongoDB. Because databases last a very long time, and the semantics can be expected to change from time to time, superior data independence is a very desirable feature, present in Postgres but not in MongoDB. As such MongoDB joins are a brittle solution.
Performance Comparison of PostgreSQL vs. MongoDB
In this section, we report on the performance of the two queries in the previous section, namely to find the total salary of each department, with or without the departments with no employees.
All of the source code in this section appears here, where the interested reader can find a loader program that generates synthetic data in Postgres, and then using Postgres JSON functions to export the data in a format suitable for importing into Mongo. Then, we run the two queries over both data sets and compare execution times.
The benchmark has been tested on AWS, using an EC2 i3.xlarge instance (4 cores, 32 GB RAM), on a local NVMe disk formatted with XFS. Basic tuning was performed on the Postgres instance, and Mongo production best practices were followed. The benchmark was performed using 4000 departments and 20M employees, with a given employee working in between one and three departments. Data size was 6.1 GB in Postgres and 1.6GB in Mongo (using default compression). The total execution time is shown in the tables below.
(less is better) | Postgres (s) | Mongo (s) | MongoDB / Postgres |
1:N case | 9.97 | 1,162.91 | 129.61 |
1:N Performance Comparison between MongoDB and PostgreSQL Table 3
MongoDB is 130 times slower than Postgres because the only join tactic available is to iterate over employees, for each one performing a lookup in the department table. In contrast, Postgres can use this tactic (called iterative substitution) as well as merge join and hash join, and the Postgres query optimizer will pick the expected best strategy. MongoDB is constrained to a single strategy. Whenever (as is almost always the case) this single strategy is inferior, poor performance will result.
Changing the MongoDB query execution strategy either involves restructuring the database (and thereby requires rewriting all of the queries) or implementing a query optimizer in the application (which is a huge amount of work).
Things are also bad in the M:N case. Again, MongoDB has a single strategy, hardcoded into the application. In contrast Postgres can pick from all of the available options. Table 4 shows the resulting performance comparison.
(less is better) | Postgres (s) | Mongo (s) | MongoDB / Postgres |
M:N case | 54.09 | 2,706.19 | 50.03 |
M:N Performance Comparison between MongoDB and Postgres Table 4
In summary, relative to Postgres, joins in MongoDB:
- If anything changes over the lifetime of the database, then MongoDB requires significant recoding, while Postgres requires more modest changes.
- MongoDB does not have a query optimizer and the execution strategy is hard-coded into the application. Whenever merge-sort or hash-join is the best choice, Mongo performance will suffer.