Top Banner
Faster persistent data structures through hashing Johan Tibell [email protected] 2011-02-15
23

Faster persistent data structures through hashing

May 12, 2015

Download

Education

Johan Tibell

This talk was given at Galois.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Faster persistent data structures through hashing

Faster persistent data structures through hashing

Johan [email protected]

2011-02-15

Page 2: Faster persistent data structures through hashing

Motivating problem: Twitter data analys

“I’m computing a communication graph from Twitter data andthen scan it daily to allocate social capital to nodes behaving in agood karmic manner. The graph is culled from 100 million tweetsand has about 3 million nodes.”

We need a data structure that is

I fast when used with string keys, and

I doesn’t use too much memory.

Page 3: Faster persistent data structures through hashing

Persistent maps in Haskell

I Data.Map is the most commonly used map type.

I It’s implemented using size balanced trees.

I Keys can be of any type, as long as values of the type can beordered.

Page 4: Faster persistent data structures through hashing

Real world performance of Data.Map

I Good in theory: no more than O(log n) comparisons.

I Not great in practice: up to O(log n) comparisons!

I Many common types are expensive to compare e.g String,ByteString, and Text.

I Given a string of length k, we need O(k ∗ log n) comparisonsto look up an entry.

Page 5: Faster persistent data structures through hashing

Hash tables

I Hash tables perform well with string keys: O(k) amortizedlookup time for strings of length k .

I However, we want persistent maps, not mutable hash tables.

Page 6: Faster persistent data structures through hashing

Milan Straka’s idea: Patricia trees as sparse arrays

I We can use hashing without using hash tables!

I A Patricia tree implements a persistent, sparse array.

I Patricia trees are as twice fast as size balanced trees, but onlywork with Int keys.

I Use hashing to derive an Int from an arbitrary key.

Page 7: Faster persistent data structures through hashing

Implementation tricks

I Patricia trees implement a sparse, persistent array of size 232

(or 264).

I Hashing using this many buckets makes collisions rare: for 224

entries we expect about 32,000 single collisions.

I Linked lists are a perfectly adequate collision resolutionstrategy.

Page 8: Faster persistent data structures through hashing

First attempt at an implementation

−− Def ined i n the c o n t a i n e r s package .data IntMap a

= N i l| Tip {−# UNPACK #−} ! Key a| Bin {−# UNPACK #−} ! P r e f i x

{−# UNPACK #−} ! Mask! ( IntMap a ) ! ( IntMap a )

type P r e f i x = Inttype Mask = Inttype Key = Int

newtype HashMap k v = HashMap ( IntMap [ ( k , v ) ] )

Page 9: Faster persistent data structures through hashing

A more memory efficient implementation

data HashMap k v= N i l| Tip {−# UNPACK #−} ! Hash

{−# UNPACK #−} ! ( FL . F u l l L i s t k v )| Bin {−# UNPACK #−} ! P r e f i x

{−# UNPACK #−} ! Mask! ( HashMap k v ) ! ( HashMap k v )

type P r e f i x = Inttype Mask = Inttype Hash = Int

data F u l l L i s t k v = FL ! k ! v ! ( L i s t k v )data L i s t k v = N i l | Cons ! k ! v ! ( L i s t k v )

Page 10: Faster persistent data structures through hashing

Reducing the memory footprint

I List k v uses 2 fewer words per key/value pair than [(k, v)] .

I FullList can be unpacked into the Tip constructor as it’s aproduct type, saving 2 more words.

I Always unpack word sized types, like Int, unless you reallyneed them to be lazy.

Page 11: Faster persistent data structures through hashing

Benchmarks

Keys: 212 random 8-byte ByteStrings

Map HashMapinsert 1.00 0.43

lookup 1.00 0.28

delete performs like insert .

Page 12: Faster persistent data structures through hashing

Can we do better?

I We still need to perform O(min(W , log n)) Int comparisons,where W is the number of bits in a word.

I The memory overhead per key/value pair is still quite high.

Page 13: Faster persistent data structures through hashing

Borrowing from our neighbours

I Clojure uses a hash-array mapped trie (HAMT) data structureto implement persistent maps.

I Described in the paper “Ideal Hash Trees” by Bagwell (2001).

I Originally a mutable data structure implemented in C++.

I Clojure’s persistent version was created by Rich Hickey.

Page 14: Faster persistent data structures through hashing

Hash-array mapped tries in Clojure

I Shallow tree with high branching factor.

I Each node, except the leaf nodes, contains an array of up to32 elements.

I 5 bits of the hash are used to index the array at each level.

I A clever trick, using bit population count, is used to representspare arrays.

Page 15: Faster persistent data structures through hashing

The Haskell definition of a HAMT

data HashMap k v= Empty| BitmapIndexed

{−# UNPACK #−} ! Bitmap{−# UNPACK #−} ! ( Array ( HashMap k v ) )

| L e a f {−# UNPACK #−} ! ( L e a f k v )| F u l l {−# UNPACK #−} ! ( Array ( HashMap k v ) )| C o l l i s i o n {−# UNPACK #−} ! Hash

{−# UNPACK #−} ! ( Array ( L e a f k v ) )

type Bitmap = Word

data Array a = Array ! ( Array# a ){−# UNPACK #−} ! Int

Page 16: Faster persistent data structures through hashing

Making it fast

I The initial implementation by Edward Z. Yang: correct butdidn’t perform well.

I Improved performance byI replacing use of Data.Vector by a specialized array type,I paying careful attention to strictness, andI using GHC’s new INLINABLE pragma.

Page 17: Faster persistent data structures through hashing

Benchmarks

Keys: 212 random 8-byte ByteStrings

Map HashMap HashMap (HAMT)insert 1.00 0.43 1.21

lookup 1.00 0.28 0.21

Page 18: Faster persistent data structures through hashing

Where is all the time spent?

I Most time in insert is spent copying small arrays.

I Array copying is implemented using indexArray# andwriteArray#, which results in poor performance.

I When cloning an array, we are force to first fill the new arraywith dummy elements, followed by copying over the elementsfrom the old array.

Page 19: Faster persistent data structures through hashing

A better array copy

I Daniel Peebles have implemented a set of new primops forcopying array in GHC.

I The first implementation showed a 20% performanceimprovement for insert .

I Copying arrays is still slow so there might be room for bigimprovements still.

Page 20: Faster persistent data structures through hashing

Other possible performance improvements

I Even faster array copying using SSE instructions, inlinememory allocation, and CMM inlining.

I Use dedicated bit population count instruction onarchitectures where it’s available.

I Clojure uses a clever trick to unpack keys and values directlyinto the arrays; keys are stored at even positions and values atodd positions.

I GHC 7.2 will use 1 less word for Array.

Page 21: Faster persistent data structures through hashing

Optimize common cases

I In many cases maps are created in one go from a sequence ofkey/value pairs.

I We can optimize for this case by repeatedly mutating theHAMT and freeze it when we’re done.

Keys: 212 random 8-byte ByteStrings

fromList/pure 1.00

fromList/mutating 0.50

Page 22: Faster persistent data structures through hashing

Abstracting over collection types

I We will soon have two map types worth using (one orderedand one unordered).

I We want to write function that work with both types, withouta O(n) conversion cost.

I Use type families to abstract over different concreteimplementation.

Page 23: Faster persistent data structures through hashing

Summary

I Hashing allows us to create more efficient data structures.

I There are new interesting data structures out there that have,or could have, an efficient persistent implementation.