How Voron works: Insight into the new RavenDB storage engine

Post on 11-Aug-2014

24533 Views

Category:

Data & Analytics

11 Downloads

Preview:

Click to see full reader

DESCRIPTION

In this Level 400 talk, we go deep into how Voron is implemented, including all the gory details of creating a high performance transnational storage.

Transcript

Level 400: Diving into VoronOren Eini

ayende@ayende.com ayende.com/blogHibernating Rhinos

Voron is… Low level key / value store Transactional / ACID MVCC Multi layers

WHY!?

background LevelDB LMDB Esent

Seeks are slow

0.01 ms – Compress 1kb with Zippy 0.25 ms – Read 1 MB from memory 0.50 ms – Ping inside data center 10.0 ms – Disk seek 10.0 ms – Read 1 MB from network 30.0 ms – Read 1 MB from disk

Binary Trees, Eh?

FB

A

DC

E

GH

I

B+ Trees

Implementation 4KB Pages B+ Tree Page translation table MVCC Journal file Scratch file Memory mapped

Modifying the tree Find appropriate #to modify. Get a scratch page, copy #to scratch page. Register scratch #with the old ## in #translation table

(PTT). Modify the #as you wish. On commit, the PTT becomes publicly visible. All changed pages are written to journal file. If rollback, revert to previous PTT, release scratch pages,

done.

#3#0

#4#1

#5#2

Dat

asc

ratc

h

#6#3#0

#7#4#1

#8#5#2

Tx #1#0 >- #3#1 >- #1

Tx #2#0 >- #3#1 >- #5

Background Find pages in scratch that have no one looking at

older versions of them. Copy to data file. Clear the scratch space.

How it works Only I/O during commits is a single write through,

compressed, of data to journal. Moving data to data file is done in async. No need to call fsync().

Full & incremental backups.

Missing the forest Voron isn’t a B+ Tree system. It doesn’t have a tree, it has trees. Plural. <blink>Important</blink>

Falling trees Single root tree Contain many additional trees. Tree is similar to a table. Operations on tree:

Add(key, value) Del(key, value) Find(key) : value Iterate() (Seek,Next, Prev)

How it works?

With indexes

Finding stuff

* Not the most efficient method

So, Voron has trees… Root tree Free Space tree Contains references to named trees

Enough?

Tree of trees MultiAdd, MultiDelete, MultiRead

Why multi trees? Optimization – if has just 1 item (and no value) can

directly use the parent tree store. Store multiple items for a single value.

Iterating multi trees

What voron does? Opens up a lot of interesting scenarios. We have far better control over persistence now. Very low level (bits & bytes). Very fast! Concurrency benefits:

Reads Writes*

* Yet Voron allows only a single writer!

What it does not? It isn’t about Linux. It can’t run on Linux*.

Need to implment: PosixPureMemoryPager PosixPageFileBackedMemoryMappedPager PosixMemoryMapPager

Waiting for big Linux push post 3.0 release.

The cloud story… Scratch / temp usage Utilize fast local drives that can go away. Slow I/O only hold us for tx commit (and we optimized

that).

Summary Voron learned from LevelDB, LMDB, Esent. Journal for Atomicity, Consistency & Durability. MVCC for Consistency & Isolation. Root tree, named tress, multi trees.

Questions?

top related