Alternative Data Structures in Ruby Tyler McMullen Friday, February 19, 2010
Nov 18, 2014
Alternative Data Structures in Ruby
Tyler McMullen
Friday, February 19, 2010
Why?
Friday, February 19, 2010
Why?
• Speed
• Memory
• Clarity
Friday, February 19, 2010
What’s wrong with my favorite data structure, X?
Friday, February 19, 2010
Nothing. (Maybe.)
Friday, February 19, 2010
•Bloom Filter
•BK-tree
•Splay Tree
•Trie
Friday, February 19, 2010
Bloom Filters
• Tests for existence in a set
• Probabilistic
• Minimal memory use
Friday, February 19, 2010
100 million strings in a Set
Traditional Set: Minimum 10gb
Friday, February 19, 2010
100 million strings in a Set
Traditional Set: Minimum 10gbBloom Filter (0.00001): 280mb
Friday, February 19, 2010
100 million strings in a Set
Traditional Set: Minimum 10gbBloom Filter (0.00001): 280mb
Bloom Filter (0.001): 170mb
Friday, February 19, 2010
Friday, February 19, 2010
10 2 3 4 5 6 7
Friday, February 19, 2010
10 2 3 4 5 6 7
“to be or not to be”
Friday, February 19, 2010
10 2 3 4 5 6 7
add: “to be or not to be”
Friday, February 19, 2010
10 2 3 4 5 6 7
add: “that is the question”
Friday, February 19, 2010
10 2 3 4 5 6 7
query: “whether ‘tis nobler”
NO MATCH
Friday, February 19, 2010
10 2 3 4 5 6 7
query: “to be or not to be”
MATCH
Friday, February 19, 2010
10 2 3 4 5 6 7
query: “in the mind to suffer”
FALSE MATCH
Friday, February 19, 2010
File Server
Friday, February 19, 2010
File Server
Request
exists?
200 404
Y N
Friday, February 19, 2010
File Server
Request
exists?
200 404
Y N
Bloom Filter
Friday, February 19, 2010
Bloom Filter
• Test for existence in set
• Tiny Memory Footprint
• Excellent Speed
Friday, February 19, 2010
BK-tree
Friday, February 19, 2010
BK-tree
• find items within a distance of a target
• reduces search space
• works inside a metric space
Friday, February 19, 2010
Triangle Inequality| d(x, y) - d(x, z) | ≤ d(y, z)
Friday, February 19, 2010
Triangle Inequality| d(x, y) - d(x, z) | ≤ d(y, z)
x
y
z
Friday, February 19, 2010
Triangle Inequality| d(x, y) - d(x, z) | ≤ d(y, z)
1
4
x
y
z
Friday, February 19, 2010
Triangle Inequality| d(x, y) - d(x, z) | ≤ d(y, z)
1
4
x
y
z
?
Friday, February 19, 2010
Triangle Inequality| 4 - 1 | ≤ d(y, z)
1
4
x
y
z
?
Friday, February 19, 2010
Triangle Inequality3 ≤ d(y, z)
1
4
x
y
z
≥3
Friday, February 19, 2010
BK-tree
paste
pasta
taser
pastor
shave
light
Friday, February 19, 2010
BK-tree
paste
pasta
taser
pastor
shave
light
Friday, February 19, 2010
BK-tree
paste
pasta taserpastor shave light1 2 3 4 5
root
Friday, February 19, 2010
BK-tree
paste
pasta taserpastor shave light1 2 3 4 5
rootpastu
Friday, February 19, 2010
BK-tree
paste
pasta taserpastor shave light1 2 3 4 5
rootpastu
1
Friday, February 19, 2010
BK-tree
paste
pasta taserpastor shave light1 2 3 4 5
rootpastu
1
Friday, February 19, 2010
BK-tree
paste
pasta pastor
rootpastu
1
1 2
Friday, February 19, 2010
BK-tree
paste
pasta pastor
rootpastu
1
1 2
Friday, February 19, 2010
BK-tree
paste
pasta taserpastor shave light1 2 3 4 5
root
Friday, February 19, 2010
BK-tree
paste
pasta taserpastor shave light1 2 3 4 5
rootpastu
Friday, February 19, 2010
BK-tree
paste
pasta taserpastor shave light1 2 3 4 5
rootpastu
Friday, February 19, 2010
BK-tree
paste
pasta taserpastor shave light1 2 3 4 5
rootpastu
Friday, February 19, 2010
BK-tree
paste
pasta taserpastor shave light1 2 3 4 5
rootpastu
Friday, February 19, 2010
BK-tree
• Most often used for spelling correctors
• Work in any metric space
• Reduce the search space
Friday, February 19, 2010
Splay Tree
Friday, February 19, 2010
Tangent: Access Patterns
Friday, February 19, 2010
Access Patterns
Usually assumed to be random or even.
Friday, February 19, 2010
Access Patterns
Rarely the case.
Friday, February 19, 2010
Splay Tree
• Self-balancing binary tree
• Brings most accessed items toward root
• The more uneven the access pattern, the better
Friday, February 19, 2010
Splay Tree
7
4
2 6
5 41 3
11
9 13
12 148 10
Friday, February 19, 2010
Splay Tree
7
4
2 6
5 41 3
11
9 13
12 148 10
Friday, February 19, 2010
Splay Tree
7
4
2 6
5 41 3
11
9
13
12 14
8
10
Friday, February 19, 2010
Splay Tree
7
4
2 6
5 41 3
11
9
13
12 14
8
10
Friday, February 19, 2010
Splay Tree
• Made for very uneven access patterns
• Caches, Garbage collectors, etc...
Friday, February 19, 2010
Trie
Friday, February 19, 2010
Trie
• O(1) on lookup, add, removal
• Ordered traversals
• Prefix matching
• Excellent memory usage (depending on implementation)
Friday, February 19, 2010
Trie
Friday, February 19, 2010
Trie
T
H
N
I
add: “thin”
Friday, February 19, 2010
Trie
T
H
N
I
R
A
P
add: “trap”
Friday, February 19, 2010
Trie
T
H
N
I
R
A
P
B
A
R
add: “bar”
Friday, February 19, 2010
Trie
T
H
N
I
R
A
P
B
A
R
U
R
P
add: “burp”
Friday, February 19, 2010
Trie
T
H
N
I
R
A
P
B
A
R
U
R
P
query: “trap”
Friday, February 19, 2010
Trie
T
H
N
I
R
A
P
B
A
R
U
R
P
query: “trap”
Friday, February 19, 2010
Trie
T
H
N
I
R
A
P
B
A
R
U
R
P
query: “trap”
Friday, February 19, 2010
Trie
T
H
N
I
R
A
P
B
A
R
U
R
P
query: “trap”
Friday, February 19, 2010
Trie
T
H
N
I
R
A
P
B
A
R
U
R
P
query: “trap”
Success!Friday, February 19, 2010
Trie
T
H
N
I
R
A
P
B
A
R
U
R
P
query: “bumpkin”
Friday, February 19, 2010
Trie
T
H
N
I
R
A
P
B
A
R
U
R
P
query: “bupkis”
Friday, February 19, 2010
Trie
T
H
N
I
R
A
P
B
A
R
U
R
P
query: “bupkis”
Friday, February 19, 2010
Trie
T
H
N
I
R
A
P
B
A
R
U
R
P
query: “bupkis”
Fail!Friday, February 19, 2010
Trie
Example: Autocompleter
Friday, February 19, 2010
Trie
class Autocompleter def initialize(words) @trie = Trie.new words.each { |word| @trie.add(word) } end
def query(word) return @trie.children(word) endend
Friday, February 19, 2010
Trieclass Autocompleter def initialize(words) @trie = Trie.new words.each { |word| @trie.add(word) } end
def call(env) request = Rack::Request.new(env) return [200, { ‘content-‐type’ => ‘application/json’ }, @trie.children(word).to_json] endend
Friday, February 19, 2010
Conclusion: Data structures are cool.
Friday, February 19, 2010
Questions?
Friday, February 19, 2010