Top Banner
And now for something completely different...
59

MongoDB Europe 2016 - Building WiredTiger

Jan 07, 2017

Download

Data & Analytics

MongoDB
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: MongoDB Europe 2016 - Building WiredTiger

And now for something completely different...

Page 2: MongoDB Europe 2016 - Building WiredTiger

WiredTiger: Fast data structures in C

Page 3: MongoDB Europe 2016 - Building WiredTiger

Keith Bostic MongoDB WiredTiger team [email protected]

Page 4: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

You are here: database layers

Middleware

Networking

Query APIs

Storage Engine

Page 5: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

Storage engines are performance critical

Middleware

Networking

Query APIs

mmapV1 Storage Engine

RocksDB Storage Engine

WiredTiger Storage Engine

ACID transactional guarantees

Page 6: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

WiredTiger •  From (some of) the folks that brought you Berkeley DB

• High performance data engine •  scalable throughput with low latency

• MongoDB’s default storage engine

•  a general-purpose workhorse

Page 7: MongoDB Europe 2016 - Building WiredTiger

Next Ø  Hardware (is the problem) •  Hazard pointers •  Skiplists •  Ticket locks

Page 8: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

Modern servers have many CPUs/cores

core 3

core 2

core 1

core N

Page 9: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

Each core has multiple memory caches

core 3

core 2

core 1

core N

two or more

caches

two or more

caches

two or more

caches

two or more

caches

Page 10: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

Cache coherence: cores “snoop” on writes

core 3

core 2

core 1

core N

two or more

caches

two or more

caches

two or more

caches

Main Memory

two or more

caches

Page 11: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

Traditional data engines struggle with this architecture

• Writing “shared” memory is slow •  but databases exist to manage shared access to data!

• Snoopy cache-coherence scales poorly

Page 12: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

Programmers solve with locking •  Locks are complex objects

•  get exclusive access to the lock state •  review and update the lock state •  “publish” (ensure every CPU sees the changes) •  release exclusive access

Page 13: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

Locking is slow

• Every operation requires exclusive access •  even shared (“read”) locks require a lock/unlock cycle •  thread stall is inevitable

•  Locks require notification of every CPU •  Locks require exclusive access to the memory bus

Page 14: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

Locking is expensive

•  A lock per object is too much memory • POSIX locks cache-aligned, up to 128B •  grouping objects under locks makes contention worse

• More complexity to make locks “fair” and avoid starvation •  add thread queues • wake-up the next thread waiting for the lock

Page 15: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

We need to find something else If we can’t use locks, what do we use instead? Today we’re going to talk about ways to get rid of locks.

Page 16: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

WiredTiger is written in C

•  Java or C++ are better choices for system programming •  automatic memory management vs. malloc/free •  exception handling vs. explicit error paths • widespread availability of reusable components

•  Giving up programmer productivity

Page 17: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

C is “portable assembler”

• Marshall typed values to/from unaligned memory •  streaming compression, encryption, checksums •  unstructured I/O to/from stable storage

•  Light-weight access to shared data •  use the underlying machine primitives that make up locks •  algorithms/structures based on those primitives

Page 18: MongoDB Europe 2016 - Building WiredTiger

You may have seen this last year:

Page 19: MongoDB Europe 2016 - Building WiredTiger

Next

•  Hardware Ø  Hazard pointers •  Skiplists •  Ticket locks

Page 20: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

Pages in the WiredTiger cache

page 2

page 6

page 8

page 9

Lots and lots (and lots) of pages MongoDB worker threads read from disk WiredTiger server threads evict to disk

Page 21: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

A reasonable page-locking implementation

• MongoDB worker threads read, modify pages • WiredTiger server threads evict pages from the cache

•  Allocate a lock per page • MongoDB worker threads share pages • WiredTiger eviction threads require exclusive access

Page 22: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

Page locking in the WiredTiger cache

page 2

page 6

page 8

page 9

eviction

lock

lock

lock

lock

writer

reader thread stall on read locks! vulnerable to starvation too much memory

Page 23: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

Introducing memory barriers

• Memory barriers •  order reads, writes or both across a line of code •  compiler won’t cache values or reorder across a barrier

•  Locks imply memory barriers

Page 24: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

Something faster

• Hazard pointers: a technique for avoiding locks • MongoDB worker threads

•  “log” intention to access a page •  publish: a memory barrier to ensure global CPU visibility

• Write to a per-thread memory location

• write won’t collide with other worker threads

Page 25: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

What about eviction starvation?

•  Add a per-page “blocker” • MongoDB worker won’t proceed if the page is blocked

• Cheap: •  it’s only a bit of information •  a read-only operation for workers

Page 26: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

Worker threads

• Publish intent to access the page • Memory barrier to ensure global CPU visibility

•  If the page not blocked, it’s accessible

• Clear intent to access when done

Page 27: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

Hazard pointers for workers

page 2

page 6

page 8

page 9

flag

writer

reader

flag

flag

flag

page 9

page 2

page 6

page 2

page 9

Page 28: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

Eviction server

• Block future worker thread access • Memory barrier to ensure global CPU visibility

• Review worker thread access intentions •  can either wait or quit

• Unblock worker thread access when done

Page 29: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

Hazard pointers for workers and eviction

page 2

page 6

page 8

page 9

flag

flag

flag

flag

writer

reader page 9

page 2

page 6

page 2

page 9

eviction

Page 30: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

Something faster: hazard pointers

Replaces two lock/unlock pairs for each page access ... with a single memory barrier instruction.

•  Transfers work to the eviction server

• MongoDB worker latency is what we care about

• Memory costs •  per-worker-thread list •  per-page blocking flag

Page 31: MongoDB Europe 2016 - Building WiredTiger

Next

•  Hardware •  Hazard pointers Ø  Skiplists •  Ticket locks

Page 32: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

Introducing atomic instructions

•  Atomic increment or decrement •  read a value •  change it and store it back without the possibility of racing

• Based on compare-and-swap (CAS) instruction •  read value •  update value if the value is unchanged

•  but fail if the value has changed

Page 33: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

Atomic prepend to singly-linked list Update head if (and only if), head’s value is unchanged

head

NEW

new.next = head compare_and_swap(head, new.next, new)

Page 34: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

How WiredTiger uses skiplists

•  WiredTiger pages start with a disk image

•  a compact representation we don’t want to modify •  Inserts and updates for the disk image stored in skiplists

Page 35: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

Skiplists start with a linked list Singly-linked list with sorted values: 7, 10, 13, 18, 21, 24

7 10 21 18 13 24

Page 36: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

Skiplists: add additional linked lists Each higher level “skips” over more of the list

1:7

3:7

2:7

1:10 1:21 1:18 1:13 1:24

2:13 2:21

3:21

2:24

Page 37: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

Search for 18 search starts at the top-level

1:7

3:7

2:7

1:10 1:21 1:18 1:13 1:24

2:13 2:21

3:21

2:24

Page 38: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

Search for 18 search starts at the top-level

1:7

3:7

2:7

1:10 1:21 1:18 1:13 1:24

2:13 2:21

3:21

2:24

Page 39: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

Search for 18 search starts at the top-level

1:7

3:7

2:7

1:10 1:21 1:18 1:13 1:24

2:13 2:21

3:21

2:24

Page 40: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

Search for 18 search starts at the top-level

1:7

3:7

2:7

1:10 1:21 1:18 1:13 1:24

2:13 2:21

3:21

2:24

Page 41: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

Search for 18 search starts at the top-level

1:7

3:7

2:7

1:10 1:21 1:18 1:13 1:24

2:13 2:21

3:21

2:24

Page 42: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

Search for 18 search starts at the top-level

1:7

3:7

2:7

1:10 1:21 1:18 1:13 1:24

2:13 2:21

3:21

2:24

Page 43: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

Skiplists, the great

Replaces a lock/unlock pair over the entire skiplist with one atomic memory instruction per object level

•  Insert without locking • Search without locking, while inserting •  Forward & backward traversal without locking, while inserting

Page 44: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

Skiplists, the good

• Simpler code than a Btree • WiredTiger binary search ~200 lines of code •  a typical skiplist search < 20

•  Fast search

•  a Btree guarantees search in logarithmic time •  skiplists don’t offer a guarantee, but are usually close

Page 45: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

Skiplists, the not-so-good

• Cache-unfriendly •  every indirection a CPU cache miss

• Memory-unfriendly •  needs more memory for a data set than a Btree

• Removal requires locking • WiredTiger is an MVCC engine (multiple values per key) •  removal less important to WiredTiger

Page 46: MongoDB Europe 2016 - Building WiredTiger

Next

•  Hardware •  Hazard pointers •  Skiplists Ø  Ticket locks

Page 47: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

Ticket locks

• WiredTiger still needs to lock objects •  but we can make locks faster

•  Ticket locks •  customers take a unique ticket number •  customers served in ticket order

Page 48: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

Ticket locks

Please Take a Number

42 43 41 40 39

Now Serving

Page 49: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

Ticket locks

•  Two incrementing counters: ticket: the next available ticket number serving: the ticket number now being served

•  Thread takes a ticket number •  Thread increments “next available” •  Thread waits for “serving” to match its ticket number • When thread finishes, increments “serving”

Page 50: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

Ticket locks serialize threads

40

Now Serving

39

Thread A

39

40

39

40

41

Thread B

Page 51: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

Ticket locks are almost what we need

•  Ticket locks avoid starvation and are “fair” • Smaller memory footprint • Can be made significantly faster than POSIX locks

•  remember our compare-and-swap instructions!

• But POSIX locks are shared between readers

Page 52: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

Ticket locks: shared vs. exclusive

•  Three incrementing counters: ticket: the next available ticket number readers: the next reader to be served writers: the next writer to be served

Page 53: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

Readers run in parallel

40

Writers Readers

39

Thread A

39

40

41

41

39

40

41

42

39

40

41

42

Thread B

Thread C

Page 54: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

Multiple variable updates without locking

• Updating multiple counters would require locking ... but we can write the bus width atomically

• Encode the entire lock state in a single 8B value lock { uint16_t readers; uint16_t writers; uint16_t ticket; // 64K simultaneous threads uint16_t unused; }

Page 55: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

Ticket locks

Replaces two higher-level lock/unlock calls

... with two atomic instructions.

Page 56: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

That’s a (very) fast introduction.... • Hazard pointers • Skiplists •  Ticket locks

Open Source implementations are available in WiredTiger, including Public Domain ticket locks.

Page 57: MongoDB Europe 2016 - Building WiredTiger

#MDBE16

WiredTiger distribution

• Standalone application database toolkit library •  key-value store (NoSQL) •  row-store, column-store and LSM engines •  schema layer includes data types and indexes

•  Another MongoDB Open Source contribution • WiredTiger available for other applications •  https://github.com/wiredtiger

Page 58: MongoDB Europe 2016 - Building WiredTiger

Thank you! Keith Bostic [email protected]

Page 59: MongoDB Europe 2016 - Building WiredTiger