MongoDB Join Benchmark
We produce a benchmark for join in MongoDB and we want to compare the performance of join evaluate inside MongoDB with join done outside using Tatooine plan.
Description of the benchmark
For this benchmark, we will generate collections of JSON documents of the form:
{ _id: "ID", k: "value" }
where the id is unique and the key k can have two kind of values: no joinable and joinable values. The generator have the following parameters:
card
is the number of documents in the collectionrangeNJ
is the number of different no joinable valuesrangeJ
is the number of different joinable valuesratioNJJ
is the ratio of no joinable values over joinable values
The javascript generator function:
function generateCollection(file, card, rangeNJ, rangeJ, symbolNJ, symbolJ, ratioNJJ) { // empty file if (fs.existsSync(file)) fs.truncateSync(file, 0) for (let c = 0; c < card; c++) { let obj = {} // we choose between NJ and J if (ratioNJJ * Math.random() > 1) { // we choose one NJ in its range obj.k = symbolNJ + Math.floor(rangeNJ * Math.random()) } else { // we choose one J in its range obj.k = symbolJ + Math.floor(rangeJ * Math.random()) } fs.appendFileSync(file, JSON.stringify(obj) + '\n') } }
Small Intersection of values
We generate two collections of 100M documents with few intersection in joinable values with the following parameters:
card = 100M
rangeNJ = 1M
rangeJ = 1000
ratioNJJ = 100 000
The number of pair of joinable document should be around 1000.
We load the two collection into mongoDB
mongoimport --drop -d join -c S --file collectionS.json mongoimport --drop -d join -c R --file collectionR.json
We create an index for the two collections:
db.S.createIndex({k: 1}) db.R.createIndex({k: 1})
We query the join of the two collections which takes 1h1min30s:
db.S.explain("executionStats").aggregate([{$lookup: {from: "R", localField: "k", foreignField: "k", as: "kr"}}, {$unwind: "$kr"}]) { "stages" : [ { "$cursor" : { "query" : { }, "queryPlanner" : { "plannerVersion" : 1, "namespace" : "join.S", "indexFilterSet" : false, "parsedQuery" : { }, "winningPlan" : { "stage" : "COLLSCAN", "direction" : "forward" }, "rejectedPlans" : [ ] }, "executionStats" : { "executionSuccess" : true, "nReturned" : 100000000, "executionTimeMillis" : 3691177, "totalKeysExamined" : 0, "totalDocsExamined" : 100000000, "executionStages" : { "stage" : "COLLSCAN", "nReturned" : 100000000, "executionTimeMillisEstimate" : 16530, "works" : 100000002, "advanced" : 100000000, "needTime" : 1, "needYield" : 0, "saveState" : 787888, "restoreState" : 787888, "isEOF" : 1, "invalidates" : 0, "direction" : "forward", "docsExamined" : 100000000 } } } }, { "$lookup" : { "from" : "R", "as" : "kr", "localField" : "k", "foreignField" : "k", "unwinding" : { "preserveNullAndEmptyArrays" : false } } } ], "ok" : 1 }
Large Intersection of Values
We generate two collections of 100M documents with a lot intersection in joinable values with the following parameters:
card = 100M
rangeNJ = 1M
rangeJ = 1000
ratioNJJ = 10 000
The number of pair of joinable document should be around 100 000.
> db.S.explain("executionStats").aggregate([{$lookup: {from: "R", localField: "k", foreignField: "k", as: "kr"}}, {$unwind: "$kr"}]) { "stages" : [ { "$cursor" : { "query" : { }, "queryPlanner" : { "plannerVersion" : 1, "namespace" : "join.S", "indexFilterSet" : false, "parsedQuery" : { }, "winningPlan" : { "stage" : "COLLSCAN", "direction" : "forward" }, "rejectedPlans" : [ ] }, "executionStats" : { "executionSuccess" : true, "nReturned" : 100000000, "executionTimeMillis" : 3070369, "totalKeysExamined" : 0, "totalDocsExamined" : 100000000, "executionStages" : { "stage" : "COLLSCAN", "nReturned" : 100000000, "executionTimeMillisEstimate" : 19450, "works" : 100000002, "advanced" : 100000000, "needTime" : 1, "needYield" : 0, "saveState" : 787888, "restoreState" : 787888, "isEOF" : 1, "invalidates" : 0, "direction" : "forward", "docsExamined" : 100000000 } } } }, { "$lookup" : { "from" : "R", "as" : "kr", "localField" : "k", "foreignField" : "k", "unwinding" : { "preserveNullAndEmptyArrays" : false } } } ], "ok" : 1 }
Postgres Comparison
We build a similar CSV generator
function generator_csv () { FILE=$1 CARD=$2 RANGENJ=$3 RANGEJ=$4 SYMBOLNJ=$5 SYMBOLJ=$6 RATIONJJ=$7 # empty the file > $FILE id=0 while [ $id -lt $CARD ] do if [ $((RANDOM % (RATIONJJ + 1))) -gt 0 ] then echo -e "$id\t$SYMBOLNJ$(( ( RANDOM % RANGENJ ) + 1 ))" >> $FILE else echo -e "$id\t$SYMBOLJ$(( ( RANDOM % RANGEJ ) + 1 ))" >> $FILE fi ((id++)) done } generator_csv "tableR.csv" 100000000 1000000 1000 "A" "J" 100000 generator_csv "tableS.csv" 100000000 1000000 1000 "B" "J" 100000
CREATE TABLE R ( id integer primary key, k varchar); CREATE INDEX on R (k); CREATE TABLE S ( id integer primary key, k varchar); CREATE INDEX on S (k);
For small intersection tables, Postgres computes the result in 241809,116 ms, with the following execution plan:
mongo_join=# explain select * from s AS S, r AS R where S.k = R.k; QUERY PLAN ------------------------------------------------------------------------------------- Merge Join (cost=755377247.21..3937606397.52 rows=302921560607 width=20) Merge Cond: ((s.k)::text = (r.k)::text) -> Index Scan using s_k_idx on s (cost=0.57..5066424.21 rows=99997608 width=10) -> Index Scan using r_k_idx on r (cost=0.57..5065805.49 rows=99997960 width=10) (4 lignes) mongo_join=# select count(*) from s AS S, r AS R where S.k = R.k; count ------- 9017 (1 ligne)