ToroDB: All your MongoDB data are belong to SQL
Álvaro Hernández <[email protected]>
ToroDB
About $self and 8Kdata
ToroDB
What do Stripe and Buffer have in common?
Use MongoDB for high load OLTP
Successful companies
They are Web Scale
ToroDB
What else do they have in common?
Both ETL to a RDBMS
for Analytics
ToroDB
MongoDB analytics performance: Github dataset
{ "id": "2489368070", "type": "PushEvent", "public": true, "created_at": "2015-01-01T00:00:00Z", "actor": { "id": 9152315, "login": "davidjhulse", "gravatar_id": "", "url": "https://api.github.com/users/davidjhulse", "avatar_url": "https://avatars.githubusercontent.com/u/9152315?" }, "repo": { "id": 28635890, "name": "davidjhulse/davesbingrewardsbot", "url": "https://api.github.com/repos/davidjhulse/davesbingrewardsbot" },
ToroDB
MongoDB analytics performance: Github dataset
"payload": { "push_id": 536740396, "size": 1, "distinct_size": 1, "ref": "refs/heads/master", "head": "a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81", "before": "86ffa724b4d70fce46e760f8cc080f5ec3d7d85f", "commits": [ { "sha": "a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81", "author": { "email": "[email protected]", "name": "davidjhulse" }, "message": "Altered BingBot.jar\n\nFixed issue with multiple account", "distinct": true, "url": "https://api.github.com/repos/..." } ] } }
ToroDB
MongoDB analytics performance: Github dataset
Mill
iseco
nds
0
300.000
600.000
900.000
1.200.000
1.500.000
1.800.000
2.100.000
2.400.000
2.700.000
Github dataset size (GB)
1 GB 10 GB 100 GBMongoDB
db.github.find({ type: 'PushEvent', 'actor.login': 'davidjhulse' });
ToroDB
Why is it so slow?
• For most aggregated queries, the whole collection is scanned for every query.
• For every document, many keys and offsets are parsed and computed to find the possible keys. Worst case: all keys are scanned.
• Different documents with different in the same collection. Mixing apples and oranges!
ToroDB
How MongoDB performs an aggregated query
What if we had all our data in a RDBMS?
ToroDB
How a RDBMS performs an aggregate query
ToroDB
How a RDBMS performs an aggregate query
ToroDB
Read I/O required to answer the query db.githubarchive.aggregate([ { $group: { _id: '$actor.login', events: { $sum: 1 } } }, { $sort: { events: -1 }}, { $limit: 10 } ])
SELECT count(*), login FROM actor GROUP BY login ORDER BY 1 DESC LIMIT 10;
ToroDB
Read I/O required to answer the query (“iotop -o -a”)
Github Archive: top 10 actors (1,4GB dataset)
Disk
Rea
d (M
B)
0
125
250
375
500
MongoDB PostgreSQL
87,93
536,36MongoDB storageSize: 536.37 MB
MongoDB size: 1410.35 MB
Exactly 100% of the storageSize!
ToroDB
What if we use a columnar store?
ToroDB
And if we use a columnar store? (“iotop -o -a”)
Github Archive: top 10 actors (1,4GB dataset)
Disk
Rea
d (M
B)
0
125
250
375
500
MongoDB PostgreSQL PostgreSQL + cstore
6,587,93
536,36compressed, only 1 column read
ToroDB
What about query performance?
Mill
iseco
nds
0
43.750
87.500
131.250
175.000
Github dataset size (GB)
1 GB 10 GB 100 GB
Mill
iseco
nds
0
300.000
600.000
900.000
1.200.000
1.500.000
1.800.000
2.100.000
2.400.000
2.700.000
Github dataset size (GB)
1 GB 10 GB 100 GB
PostgreSQLMongoDB
ToroDB
What if we take this to the extreme
• What if we do an aggregate query for a 1:N relationship which is empty 99% of the time?
• MongoDB will still scan the whole collection
• A RDBMS will only scan an almost empty table
ToroDB
Show me another benchmark! YASP dataset
db.getCollection('player_matches').aggregate([ { $match: { 'item_uses.key': 'quelling_blade' } }, { $group: { '_id': null, avg: { $avg: '$level' } } } ]);
SELECT AVG(matches.level) FROM uses INNER JOIN matches ON uses.did = matches.did WHERE uses.key = 'quelling_blade';
ToroDB
Show me another benchmark! YASP dataset
YASP dataset. Avg level of users using 'quelling_blade' object
exec
utio
n tim
e (m
s)
0K
14K
28K
42K
56K
71K
85K
99K
113K
127K
MongoDB PostgreSQL
63,36
131.678,49
That's more than 2000x faster!
ToroDB
So what about indexes?
• For most aggregated queries, indexes will not be used!
• You cannot plan in advance all the queries that will require an index.
•Worst, but common use case: its the users who craft the queries they want to make.
ToroDB
Show me another benchmark! YASP dataset
YASP dataset. Avg level of users using 'quelling_blade' object
exec
utio
n tim
e (m
s)
0
8
16
24
32
MongoDB (Indexed) PostgreSQL (Indexed)
19,65
30,85
ToroDB
I’m sold! Gimme SQL
• Significant performance improvement!
• No connectors required: native SQL!
• Design DDL, implement ETL
• Not real-time, not HA
Maybe not that fast…
ToroDB
ToroDB: MongoDB to relational
ToroDB
ToroDB for MongoDB Analytics
• No need to design the SQL schema:
• ToroDB automagically does it for you!
• Real-time: insert in MongoDB, automatically shows up in ToroDB. Even if tables need created.
• Native SQL: your data ends in a RDBMS like PostgreSQL
ToroDB
ToroDB works as a secondary on a replica set
ToroD
this is your SQL replica!
ToroDB
ToroDB performance: yet another example :)
Github Archive 10GB dataset
Seco
nds
06
1218
2430
3642
4854
60
MongoDB ToroDB on PostgreSQLToroDB on Greenplum
2,467,901
58,519
ToroDB
ToroDB main characteristics
• Works as a MongoDB secondary node
•No need to run drdl or define schema. Accepts any input document, even with type conflicts
•Query w/ native SQL (PostgreSQL as of today)
•Open source!
ToroDB
ToroDB vs MongoDB BI Connector
ToroDB Mongo BI CLicense Open Source Proprietary
Query Language PostgreSQLPostgreSQL (v1) or reduced SQL set (v2)
Performance 100x faster than MongoDB
100x slower plus big (v1) or smaller (v2) connector overhead
NoSQL to SQL transformation Once (insert time) Many (per query)
Distributed analyticsWith Greenplum or CitusDB No
Columnar store and compression
Yes No
ToroDB
ToroDB main use cases
Native SQL BI Connector
Native SQL BI
Connector
Data Integration Platform: SQL and NoSQL apps in the
same RDBMS
Live MongoDB
to RDBMS migration
Apps: Write data with
Mongo API, query with
SQL!
ToroDB
Last but definitely not least
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ READ ONLY;
Consistent reads!
ToroDB
ToroDB: coming to Github soon!
ToroDB
Rate My Session!
Let’s Talk!
Edificio 4B - Loft 33 Avda. Fuencarral, 44
Campus Empresarial Tribeca 28108 Alcobendas, Madrid (SPAIN)
(+34) 91 867 55 54