Top Banner
Bloom Filters An Introduction and Really Most Of It CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
26

Bloom Filters An Introduction and Really Most Of It CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Dec 26, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Bloom Filters An Introduction and Really Most Of It CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Bloom FiltersAn Introduction and Really Most Of It

CMSC 491Hadoop-Based Distributed Computing

Spring 2015Adam Shook

Page 2: Bloom Filters An Introduction and Really Most Of It CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Agenda

• Discuss what a set data structure is using math terms• Discuss the concept of a Bloom filter• Explore the mathematical magic behind Bloom filters

Page 3: Bloom Filters An Introduction and Really Most Of It CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Set!

• A set is an unsorted data structure containing unique values• Most common uses are:• Error-free set membership tests• Storing unique members of data (remove duplicates)• Iterating through data in no particular order• Other fun operations like unions, intersections, subsets, etceteras!

• Other sets support sorting and duplicate values• But we aren’t here to talk about those

Page 4: Bloom Filters An Introduction and Really Most Of It CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Set Insertion

peterlois

chrispeterstewiechris

stewielois

chrispeter

insert

Page 5: Bloom Filters An Introduction and Really Most Of It CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

is_member

Set Membership Test

peter

stewielois

chrispeter

Page 6: Bloom Filters An Introduction and Really Most Of It CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

is_member

Set Membership Test

adam

stewielois

chrispeter

Page 7: Bloom Filters An Introduction and Really Most Of It CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Use Case!

• I’ve got a bunch of interesting keywords, A• I’ve got a data set B• I want to check if a record in B contains a word in A• Make a new data set C for some cool data science

for each record x in B

for each word w in x

if w in A

emit x

Page 8: Bloom Filters An Introduction and Really Most Of It CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Use Case, Solved!

• Stuff all the data in A into a set• Get an A+ on your computer science project• Impress the boss

• But what if A is stupid big?

credit to mr. squarepants

Page 9: Bloom Filters An Introduction and Really Most Of It CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Memory Footprint

• A contains 1 billion unique strings, average of 32 characters in length• 8 bits per character• 32 characters per string• 1 billion of them• 8 bits * 32 * 1,000,000,000 …• Roughly 29.8 GB of raw storage required to hold these elements

• + overhead• + even more if you are using Java

• For the sake of argument, let’s all agree that A doesn’t fit comfortably on a computer…

Page 10: Bloom Filters An Introduction and Really Most Of It CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

credit to xkcd and paint

Page 11: Bloom Filters An Introduction and Really Most Of It CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Making a Set Smaller

• What two ‘features’ of a set can we relax to meet our requirements and have a reasonable memory footprint?

• Functionality• Only want set membership operations

• Accuracy• Don’t really need to be 100% accurate

Page 12: Bloom Filters An Introduction and Really Most Of It CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Use Case, Revised!

• I’ve got a bunch of interesting keywords, A• I’ve got a data set B• I want to check if a record in B contains a word in A• Make a new data set C for some cool data science• I don’t really care if some stuff in C doesn’t contain words from A

for each record x in B

for each word w in x

if w is likely in A with false positive p

emit x

Page 13: Bloom Filters An Introduction and Really Most Of It CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Let me paint you a story…

• We travel back to 1970…• Burton Howard Bloom was investigating means to eliminate

unnecessary disk accesses for particular algorithms• Came up with the a probabilistic data structure for set membership• Useful for programs with expensive operations where the operation is

often unnecessary• A structure only 15% of the size of the original can eliminate 85% of

unnecessary disk accesses

Page 14: Bloom Filters An Introduction and Really Most Of It CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Bloom Filter

• A space-efficient means to test if an element is a member of a set• Elements can be added, but cannot be removed• Storage cost for a single element is independent of the element size• The members are not stored, so they cannot be retrieved• There are no false negatives, but false positives are possible

Page 15: Bloom Filters An Introduction and Really Most Of It CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

How It’s Made – Training a Bloom FilterGiven

An array of bits size m, initialized to 0k hash functionsn elements

foreach element ni in n

foreach function ki in k

m[ki(ni) % m] = 1

Training a Bloom filter is O(n)

Page 16: Bloom Filters An Introduction and Really Most Of It CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

How It’s Made – Training a Bloom Filter

0 0 0 0 0 0 0 0 0 0

peter lois chris

11 1 1 1 11

Page 17: Bloom Filters An Introduction and Really Most Of It CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

How It’s Made – Membership Testing

GivenA trained Bloom filter of size mThe same k hash functionsAn element x

foreach function ki in k

if m[ki(x) % m] is 0

return false

return true

Testing a Bloom filter is O(1)

Page 18: Bloom Filters An Introduction and Really Most Of It CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

How It’s Made – Membership Testing

0 1 1 1 1 1 0 1 0 1

peter

Page 19: Bloom Filters An Introduction and Really Most Of It CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

How It’s Made – Membership Testing

0 1 1 1 1 1 0 1 0 1

adam

Page 20: Bloom Filters An Introduction and Really Most Of It CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

I know what you’re thinking

Page 21: Bloom Filters An Introduction and Really Most Of It CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

The Catch

• What we make up for in space, we give up the accuracy…• I give you… the false positive!

0 1 1 1 1 1 0 1 0 1

cleveland

Page 22: Bloom Filters An Introduction and Really Most Of It CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

credit to xkcd and paint

Page 23: Bloom Filters An Introduction and Really Most Of It CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Controlling the False Positive Rate

GivenApproximate number of elements in A, nA willingness to tolerate a percent p of false positivesk is the optimal number of hash functions

We can approximate m

If you want the full details, read the paper or Wikipedia

Page 24: Bloom Filters An Introduction and Really Most Of It CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Back to our use case…

n = 1,000,000,000p = .1

After dusting off the calculators….

m = 4.792 x109 bits or 0.558 GB

An improvement of 29.8/0.558 = 53.4!

Page 25: Bloom Filters An Introduction and Really Most Of It CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

And now that we have m…

We can use n and m to calculate k = m/n * ln(2)

But I haven’t heard of 3.32 hash functions so let’s call it 4

Page 26: Bloom Filters An Introduction and Really Most Of It CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

References

• Wikipedia