MongoDB Join Benchmark

We produce a benchmark for join in MongoDB and we want to compare the performance of join evaluate inside MongoDB with join done outside using Tatooine plan.

Description of the benchmark

For this benchmark, we will generate collections of JSON documents of the form:

{ 
  _id: "ID",
    k: "value"
}

where the id is unique and the key k can have two kind of values: no joinable and joinable values. The generator have the following parameters:

card is the number of documents in the collection
rangeNJ is the number of different no joinable values
rangeJ is the number of different joinable values
ratioNJJ is the ratio of no joinable values over joinable values

The javascript generator function:

function generateCollection(file, card, rangeNJ, rangeJ, symbolNJ, symbolJ, ratioNJJ) {
  // empty file
  if (fs.existsSync(file))
    fs.truncateSync(file, 0)

  for (let c = 0; c < card; c++) {
    let obj = {}
    // we choose between NJ and J
    if (ratioNJJ * Math.random() > 1) {
      // we choose one NJ in its range 
      obj.k = symbolNJ + Math.floor(rangeNJ * Math.random())
    } else {
      // we choose one J in its range 
      obj.k = symbolJ + Math.floor(rangeJ * Math.random())
    }

    fs.appendFileSync(file, JSON.stringify(obj) + '\n')
  }
}

Small Intersection of values

We generate two collections of 100M documents with few intersection in joinable values with the following parameters:

card = 100M
rangeNJ = 1M
rangeJ = 1000
ratioNJJ = 100 000

The number of pair of joinable document should be around 1000.

We load the two collection into mongoDB

mongoimport --drop -d join -c S --file collectionS.json
mongoimport --drop -d join -c R --file collectionR.json

We create an index for the two collections:

db.S.createIndex({k: 1})
db.R.createIndex({k: 1})

We query the join of the two collections which takes 1h1min30s:

db.S.explain("executionStats").aggregate([{$lookup: {from: "R", localField: "k", foreignField: "k", as: "kr"}}, {$unwind: "$kr"}])

{
	"stages" : [
		{
			"$cursor" : {
				"query" : {
					
				},
				"queryPlanner" : {
					"plannerVersion" : 1,
					"namespace" : "join.S",
					"indexFilterSet" : false,
					"parsedQuery" : {
						
					},
					"winningPlan" : {
						"stage" : "COLLSCAN",
						"direction" : "forward"
					},
					"rejectedPlans" : [ ]
				},
				"executionStats" : {
					"executionSuccess" : true,
					"nReturned" : 100000000,
					"executionTimeMillis" : 3691177,
					"totalKeysExamined" : 0,
					"totalDocsExamined" : 100000000,
					"executionStages" : {
						"stage" : "COLLSCAN",
						"nReturned" : 100000000,
						"executionTimeMillisEstimate" : 16530,
						"works" : 100000002,
						"advanced" : 100000000,
						"needTime" : 1,
						"needYield" : 0,
						"saveState" : 787888,
						"restoreState" : 787888,
						"isEOF" : 1,
						"invalidates" : 0,
						"direction" : "forward",
						"docsExamined" : 100000000
					}
				}
			}
		},
		{
			"$lookup" : {
				"from" : "R",
				"as" : "kr",
				"localField" : "k",
				"foreignField" : "k",
				"unwinding" : {
					"preserveNullAndEmptyArrays" : false
				}
			}
		}
	],
	"ok" : 1
}

Large Intersection of Values

We generate two collections of 100M documents with a lot intersection in joinable values with the following parameters:

card = 100M
rangeNJ = 1M
rangeJ = 1000
ratioNJJ = 10 000

The number of pair of joinable document should be around 100 000.

> db.S.explain("executionStats").aggregate([{$lookup: {from: "R", localField: "k", foreignField: "k", as: "kr"}}, {$unwind: "$kr"}])
{
	"stages" : [
		{
			"$cursor" : {
				"query" : {
					
				},
				"queryPlanner" : {
					"plannerVersion" : 1,
					"namespace" : "join.S",
					"indexFilterSet" : false,
					"parsedQuery" : {
						
					},
					"winningPlan" : {
						"stage" : "COLLSCAN",
						"direction" : "forward"
					},
					"rejectedPlans" : [ ]
				},
				"executionStats" : {
					"executionSuccess" : true,
					"nReturned" : 100000000,
					"executionTimeMillis" : 3070369,
					"totalKeysExamined" : 0,
					"totalDocsExamined" : 100000000,
					"executionStages" : {
						"stage" : "COLLSCAN",
						"nReturned" : 100000000,
						"executionTimeMillisEstimate" : 19450,
						"works" : 100000002,
						"advanced" : 100000000,
						"needTime" : 1,
						"needYield" : 0,
						"saveState" : 787888,
						"restoreState" : 787888,
						"isEOF" : 1,
						"invalidates" : 0,
						"direction" : "forward",
						"docsExamined" : 100000000
					}
				}
			}
		},
		{
			"$lookup" : {
				"from" : "R",
				"as" : "kr",
				"localField" : "k",
				"foreignField" : "k",
				"unwinding" : {
					"preserveNullAndEmptyArrays" : false
				}
			}
		}
	],
	"ok" : 1
}

Postgres Comparison

We build a similar CSV generator

function generator_csv () {
    FILE=$1
    CARD=$2
    RANGENJ=$3
    RANGEJ=$4
    SYMBOLNJ=$5
    SYMBOLJ=$6
    RATIONJJ=$7

    # empty the file
    > $FILE

    id=0
    while [ $id -lt $CARD ]
    do
        if [ $((RANDOM % (RATIONJJ + 1))) -gt 0 ]
        then
            echo -e "$id\t$SYMBOLNJ$(( ( RANDOM % RANGENJ )  + 1 ))" >> $FILE
        else
            echo -e "$id\t$SYMBOLJ$(( ( RANDOM % RANGEJ )  + 1 ))" >> $FILE
        fi
        ((id++))
    done

}

generator_csv "tableR.csv" 100000000 1000000 1000 "A" "J" 100000
generator_csv "tableS.csv" 100000000 1000000 1000 "B" "J" 100000

CREATE TABLE R (
id integer primary key,
k varchar);

CREATE INDEX on R (k);

CREATE TABLE S (
id integer primary key,
k varchar);

CREATE INDEX on S (k);

For small intersection tables, Postgres computes the result in 241809,116 ms, with the following execution plan:

mongo_join=# explain select * from s AS S, r AS R where S.k = R.k;

                                     QUERY PLAN                                      
-------------------------------------------------------------------------------------
 Merge Join  (cost=755377247.21..3937606397.52 rows=302921560607 width=20)
   Merge Cond: ((s.k)::text = (r.k)::text)
   ->  Index Scan using s_k_idx on s  (cost=0.57..5066424.21 rows=99997608 width=10)
   ->  Index Scan using r_k_idx on r  (cost=0.57..5065805.49 rows=99997960 width=10)
(4 lignes)


mongo_join=# select count(*) from s AS S, r AS R where S.k = R.k;
 count 
-------
  9017
(1 ligne)