May 12, 2015
Motivating problem: Twitter data analys
“I’m computing a communication graph from Twitter data andthen scan it daily to allocate social capital to nodes behaving in agood karmic manner. The graph is culled from 100 million tweetsand has about 3 million nodes.”
We need a data structure that is
I fast when used with string keys, and
I doesn’t use too much memory.
Persistent maps in Haskell
I Data.Map is the most commonly used map type.
I It’s implemented using size balanced trees.
I Keys can be of any type, as long as values of the type can beordered.
Real world performance of Data.Map
I Good in theory: no more than O(log n) comparisons.
I Not great in practice: up to O(log n) comparisons!
I Many common types are expensive to compare e.g String,ByteString, and Text.
I Given a string of length k, we need O(k ∗ log n) comparisonsto look up an entry.
Hash tables
I Hash tables perform well with string keys: O(k) amortizedlookup time for strings of length k .
I However, we want persistent maps, not mutable hash tables.
Milan Straka’s idea: Patricia trees as sparse arrays
I We can use hashing without using hash tables!
I A Patricia tree implements a persistent, sparse array.
I Patricia trees are as twice fast as size balanced trees, but onlywork with Int keys.
I Use hashing to derive an Int from an arbitrary key.
Implementation tricks
I Patricia trees implement a sparse, persistent array of size 232
(or 264).
I Hashing using this many buckets makes collisions rare: for 224
entries we expect about 32,000 single collisions.
I Linked lists are a perfectly adequate collision resolutionstrategy.
First attempt at an implementation
−− Def ined i n the c o n t a i n e r s package .data IntMap a
= N i l| Tip {−# UNPACK #−} ! Key a| Bin {−# UNPACK #−} ! P r e f i x
{−# UNPACK #−} ! Mask! ( IntMap a ) ! ( IntMap a )
type P r e f i x = Inttype Mask = Inttype Key = Int
newtype HashMap k v = HashMap ( IntMap [ ( k , v ) ] )
A more memory efficient implementation
data HashMap k v= N i l| Tip {−# UNPACK #−} ! Hash
{−# UNPACK #−} ! ( FL . F u l l L i s t k v )| Bin {−# UNPACK #−} ! P r e f i x
{−# UNPACK #−} ! Mask! ( HashMap k v ) ! ( HashMap k v )
type P r e f i x = Inttype Mask = Inttype Hash = Int
data F u l l L i s t k v = FL ! k ! v ! ( L i s t k v )data L i s t k v = N i l | Cons ! k ! v ! ( L i s t k v )
Reducing the memory footprint
I List k v uses 2 fewer words per key/value pair than [(k, v)] .
I FullList can be unpacked into the Tip constructor as it’s aproduct type, saving 2 more words.
I Always unpack word sized types, like Int, unless you reallyneed them to be lazy.
Benchmarks
Keys: 212 random 8-byte ByteStrings
Map HashMapinsert 1.00 0.43
lookup 1.00 0.28
delete performs like insert .
Can we do better?
I We still need to perform O(min(W , log n)) Int comparisons,where W is the number of bits in a word.
I The memory overhead per key/value pair is still quite high.
Borrowing from our neighbours
I Clojure uses a hash-array mapped trie (HAMT) data structureto implement persistent maps.
I Described in the paper “Ideal Hash Trees” by Bagwell (2001).
I Originally a mutable data structure implemented in C++.
I Clojure’s persistent version was created by Rich Hickey.
Hash-array mapped tries in Clojure
I Shallow tree with high branching factor.
I Each node, except the leaf nodes, contains an array of up to32 elements.
I 5 bits of the hash are used to index the array at each level.
I A clever trick, using bit population count, is used to representspare arrays.
The Haskell definition of a HAMT
data HashMap k v= Empty| BitmapIndexed
{−# UNPACK #−} ! Bitmap{−# UNPACK #−} ! ( Array ( HashMap k v ) )
| L e a f {−# UNPACK #−} ! ( L e a f k v )| F u l l {−# UNPACK #−} ! ( Array ( HashMap k v ) )| C o l l i s i o n {−# UNPACK #−} ! Hash
{−# UNPACK #−} ! ( Array ( L e a f k v ) )
type Bitmap = Word
data Array a = Array ! ( Array# a ){−# UNPACK #−} ! Int
Making it fast
I The initial implementation by Edward Z. Yang: correct butdidn’t perform well.
I Improved performance byI replacing use of Data.Vector by a specialized array type,I paying careful attention to strictness, andI using GHC’s new INLINABLE pragma.
Benchmarks
Keys: 212 random 8-byte ByteStrings
Map HashMap HashMap (HAMT)insert 1.00 0.43 1.21
lookup 1.00 0.28 0.21
Where is all the time spent?
I Most time in insert is spent copying small arrays.
I Array copying is implemented using indexArray# andwriteArray#, which results in poor performance.
I When cloning an array, we are force to first fill the new arraywith dummy elements, followed by copying over the elementsfrom the old array.
A better array copy
I Daniel Peebles have implemented a set of new primops forcopying array in GHC.
I The first implementation showed a 20% performanceimprovement for insert .
I Copying arrays is still slow so there might be room for bigimprovements still.
Other possible performance improvements
I Even faster array copying using SSE instructions, inlinememory allocation, and CMM inlining.
I Use dedicated bit population count instruction onarchitectures where it’s available.
I Clojure uses a clever trick to unpack keys and values directlyinto the arrays; keys are stored at even positions and values atodd positions.
I GHC 7.2 will use 1 less word for Array.
Optimize common cases
I In many cases maps are created in one go from a sequence ofkey/value pairs.
I We can optimize for this case by repeatedly mutating theHAMT and freeze it when we’re done.
Keys: 212 random 8-byte ByteStrings
fromList/pure 1.00
fromList/mutating 0.50
Abstracting over collection types
I We will soon have two map types worth using (one orderedand one unordered).
I We want to write function that work with both types, withouta O(n) conversion cost.
I Use type families to abstract over different concreteimplementation.
Summary
I Hashing allows us to create more efficient data structures.
I There are new interesting data structures out there that have,or could have, an efficient persistent implementation.