ToroDB: All your MongoDB data are belong to SQL€¦ · ToroDB ToroDB for MongoDB Analytics • No need to design the SQL schema: •ToroDB automagically does it for you! •Real-time:

ToroDB: All your MongoDB data are belong to SQL

Álvaro Hernández <[email protected]>

ToroDB

About $self and 8Kdata

ToroDB

What do Stripe and Buffer have in common?

Use MongoDB for high load OLTP

Successful companies

They are Web Scale

ToroDB

What else do they have in common?

Both ETL to a RDBMS

for Analytics

ToroDB

MongoDB analytics performance: Github dataset

{ "id": "2489368070", "type": "PushEvent", "public": true, "created_at": "2015-01-01T00:00:00Z", "actor": { "id": 9152315, "login": "davidjhulse", "gravatar_id": "", "url": "https://api.github.com/users/davidjhulse", "avatar_url": "https://avatars.githubusercontent.com/u/9152315?" }, "repo": { "id": 28635890, "name": "davidjhulse/davesbingrewardsbot", "url": "https://api.github.com/repos/davidjhulse/davesbingrewardsbot" },

ToroDB


"payload": { "push_id": 536740396, "size": 1, "distinct_size": 1, "ref": "refs/heads/master", "head": "a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81", "before": "86ffa724b4d70fce46e760f8cc080f5ec3d7d85f", "commits": [ { "sha": "a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81", "author": { "email": "[email protected]", "name": "davidjhulse" }, "message": "Altered BingBot.jar\n\nFixed issue with multiple account", "distinct": true, "url": "https://api.github.com/repos/..." } ] } }

ToroDB


Mill

iseco

nds

0

300.000

600.000

900.000

1.200.000

1.500.000

1.800.000

2.100.000

2.400.000

2.700.000

Github dataset size (GB)

1 GB 10 GB 100 GBMongoDB

db.github.find({ type: 'PushEvent', 'actor.login': 'davidjhulse' });

ToroDB

Why is it so slow?

• For most aggregated queries, the whole collection is scanned for every query.

• For every document, many keys and offsets are parsed and computed to find the possible keys. Worst case: all keys are scanned.

• Different documents with different in the same collection. Mixing apples and oranges!

ToroDB

How MongoDB performs an aggregated query

What if we had all our data in a RDBMS?

ToroDB

How a RDBMS performs an aggregate query

ToroDB

How a RDBMS performs an aggregate query

ToroDB

Read I/O required to answer the query db.githubarchive.aggregate([ { $group: { _id: '$actor.login', events: { $sum: 1 } } }, { $sort: { events: -1 }}, { $limit: 10 } ])

SELECT count(*), login FROM actor GROUP BY login ORDER BY 1 DESC LIMIT 10;

ToroDB

Read I/O required to answer the query (“iotop -o -a”)

Github Archive: top 10 actors (1,4GB dataset)

Disk

Rea

d (M

B)

0

125

250

375

500

MongoDB PostgreSQL

87,93

536,36MongoDB storageSize: 536.37 MB

MongoDB size: 1410.35 MB

Exactly 100% of the storageSize!

ToroDB

What if we use a columnar store?

ToroDB

And if we use a columnar store? (“iotop -o -a”)

Github Archive: top 10 actors (1,4GB dataset)

Disk

Rea

d (M

B)

0

125

250

375

500

MongoDB PostgreSQL PostgreSQL + cstore

6,587,93

536,36compressed, only 1 column read

ToroDB

What about query performance?

Mill

iseco

nds

0

43.750

87.500

131.250

175.000


1 GB 10 GB 100 GB

Mill

iseco

nds

0

300.000

600.000

900.000

1.200.000

1.500.000

1.800.000

2.100.000

2.400.000

2.700.000


1 GB 10 GB 100 GB

PostgreSQLMongoDB

ToroDB

What if we take this to the extreme

• What if we do an aggregate query for a 1:N relationship which is empty 99% of the time?

• MongoDB will still scan the whole collection

• A RDBMS will only scan an almost empty table

ToroDB

Show me another benchmark! YASP dataset

db.getCollection('player_matches').aggregate([ { $match: { 'item_uses.key': 'quelling_blade' } }, { $group: { '_id': null, avg: { $avg: '$level' } } } ]);

SELECT AVG(matches.level) FROM uses INNER JOIN matches ON uses.did = matches.did WHERE uses.key = 'quelling_blade';

ToroDB


YASP dataset. Avg level of users using 'quelling_blade' object

exec

utio

n tim

e (m

s)

0K

14K

28K

42K

56K

71K

85K

99K

113K

127K

MongoDB PostgreSQL

63,36

131.678,49

That's more than 2000x faster!

ToroDB

So what about indexes?

• For most aggregated queries, indexes will not be used!

• You cannot plan in advance all the queries that will require an index.

•Worst, but common use case: its the users who craft the queries they want to make.

ToroDB


YASP dataset. Avg level of users using 'quelling_blade' object

exec

utio

n tim

e (m

s)

0

8

16

24

32

MongoDB (Indexed) PostgreSQL (Indexed)

19,65

30,85

ToroDB

I’m sold! Gimme SQL

• Significant performance improvement!

• No connectors required: native SQL!

• Design DDL, implement ETL

• Not real-time, not HA

Maybe not that fast…

ToroDB

ToroDB: MongoDB to relational

ToroDB

ToroDB for MongoDB Analytics

• No need to design the SQL schema:

• ToroDB automagically does it for you!

• Real-time: insert in MongoDB, automatically shows up in ToroDB. Even if tables need created.

• Native SQL: your data ends in a RDBMS like PostgreSQL

ToroDB

ToroDB works as a secondary on a replica set

ToroD

this is your SQL replica!

ToroDB

ToroDB performance: yet another example :)

Github Archive 10GB dataset

Seco

nds

06

1218

2430

3642

4854

60

MongoDB ToroDB on PostgreSQLToroDB on Greenplum

2,467,901

58,519

ToroDB

ToroDB main characteristics

• Works as a MongoDB secondary node

•No need to run drdl or define schema. Accepts any input document, even with type conflicts

•Query w/ native SQL (PostgreSQL as of today)

•Open source!

ToroDB

ToroDB vs MongoDB BI Connector

ToroDB Mongo BI CLicense Open Source Proprietary

Query Language PostgreSQLPostgreSQL (v1) or reduced SQL set (v2)

Performance 100x faster than MongoDB

100x slower plus big (v1) or smaller (v2) connector overhead

NoSQL to SQL transformation Once (insert time) Many (per query)

Distributed analyticsWith Greenplum or CitusDB No

Columnar store and compression

Yes No

ToroDB

ToroDB main use cases

Native SQL BI Connector

Native SQL BI

Connector

Data Integration Platform: SQL and NoSQL apps in the

same RDBMS

Live MongoDB

to RDBMS migration

Apps: Write data with

Mongo API, query with

SQL!

ToroDB

Last but definitely not least

BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ READ ONLY;

Consistent reads!

ToroDB

ToroDB: coming to Github soon!

ToroDB

Rate My Session!

Let’s Talk!

Edificio 4B - Loft 33 Avda. Fuencarral, 44

Campus Empresarial Tribeca 28108 Alcobendas, Madrid (SPAIN)

(+34) 91 867 55 54

ToroDB: All your MongoDB data are belong to SQL€¦ · ToroDB ToroDB for MongoDB Analytics • No need to design the SQL schema: •ToroDB automagically does it for you! •Real-time:

Documents