Top Banner
Bloom Filters Kira Radinsky Slides based on material from: Michael Mitzenmacher and Hanoch Levy
21

Tutorial 9 (bloom filters)

Jun 17, 2015

Download

Technology

Kira

Part of the Search Engine course given in the Technion (2011)
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Tutorial 9 (bloom filters)

Bloom Filters

Kira Radinsky

Slides based on material from:

Michael Mitzenmacher and Hanoch Levy

Page 2: Tutorial 9 (bloom filters)

Motivation - Cache

• Lookup questions: Does item “x” exist in a set?

• Data set may be very big or expensive to access. Filter lookup questions with negative results before accessing data.

• Allow false positive errors, as they only cost us an extra data access.

• Don’t allow false negative errors, because they result in wrong answers.

Page 3: Tutorial 9 (bloom filters)

Application of Bloom Filters: Distributed Web Caches

Web Cache 1 Web Cache 2 Web Cache 3

Web Cache 6Web Cache 5Web Cache 4

• Send Bloom filters of URLs.• False positives do not hurt much.

– Get errors from cache changes anyway

Page 4: Tutorial 9 (bloom filters)

Web Caching

• Summary Cache: [Fan, Cao, Almeida, & Broder]

If local caches know each other’s content...

…try local cache before going out to Web

• Sending/updating lists of URLs too expensive.

• Solution: use Bloom filters.

• False positives– Local requests go unfulfilled.

– Small cost, big potential gain

Page 5: Tutorial 9 (bloom filters)

The Problem Solved by BF:Approximate Set Membership

• Lookup Problem: Given a set S = {x1,x2,…,xn}, construct data structure to answer queries of the form “Is y in S?”

• Data structure should be:

– Fast (Faster than searching through S).

– Small (Smaller than explicit representation).

• To obtain speed and size improvements, allow some probability of error.

– False positives: y S but we report y S

– False negatives: y S but we report y S

Page 6: Tutorial 9 (bloom filters)

Bloom Filters

Start with an m bit array, filled with 0s.

Hash each item xj in S k times. If Hi(xj) = a, set B[a] = 1.

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0B

0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B

To check if y is in S, check B at Hi(y). All k values must be 1.

0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B

0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B

Possible to have a false positive; all k values are 1, but y is not in S.

Page 7: Tutorial 9 (bloom filters)

Bloom Filter

01000 10100 00010

x

h1(x) h2(x) hk(x)

V0 Vm-1

h3(x)

Page 8: Tutorial 9 (bloom filters)

Advantages

• No Overflow

• Union and intersection of Bloom filters

– A simple bitwise OR and AND operations

• Applications:

– Google BigTable

– The Squid Web Proxy Cache uses Bloom filters for cache digests.

Page 9: Tutorial 9 (bloom filters)

Bloom Errors

01000 10100 00010h1(x) h2(x) hk(x)

V0 Vm-1

h3(x)

a b c d

x didn’t appear, yet its bits are already set

Page 10: Tutorial 9 (bloom filters)

Example

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0 1 2 3 4 5 6 7 8 9 10

Hash functions

Fa

lse p

osi

tiv

e r

ate

m/n = 8

Opt k = 8 ln 2 = 5.45...

Page 11: Tutorial 9 (bloom filters)

Tradeoffs

• Three parameters.

– Size m/n : bits per item.

• |U| = n: Number of elements to encode.

• hi: U[1..m] : Maintain a Bit Vector V of size m

– Time k : number of hash functions.

• Use k hash functions (h1..hk)

– Error f : false positive probability.

Page 12: Tutorial 9 (bloom filters)

Bloom Filter Tradeoffs

• Three factors: m,k and n.

• Normally, n and m are given, and we select k.

• Small k– Less computations.

– Actual number of bits accessed (nk) is smaller, so the chance of a “step over” is smaller too.

– However, less bits need to be stepped over to generate an error.

• For big k, the exact opposite holds.

• Not surprisingly, when k is optimal, the “hit ratio” (ratio of bits flipped in the array) is exactly 0.5

Page 13: Tutorial 9 (bloom filters)

Alternative Approach for Bloom Filters: Perfect Hashing Approach

Element 1 Element 2 Element 3 Element 4 Element 5

Fingerprint(4) Fingerprint(5) Fingerprint(2) Fingerprint(1) Fingerprint(3)

Page 14: Tutorial 9 (bloom filters)

Perfect Hashing Approach

• Folklore Bloom filter construction.– Recall: Given a set S = {x1,x2,x3,…xn} on a universe U, we want

to answer membership queries.

– Method: Find an n-cell perfect hash function for S.• Maps set of n elements to n cells in a 1-1 manner.

– Then keep bit fingerprint of item in each cell. Lookups have false positive < e.

– Advantage: each bit/item reduces false positives by a factor of 1/2, vs ln 2 for a standard Bloom filter.

• Negatives:– Perfect hash functions non-trivial to find.

– Cannot handle on-line insertions.

)/1(log 2 e

Page 15: Tutorial 9 (bloom filters)

Bloom Filters and Deletions

• Cache contents change– Items both inserted and deleted.

• Insertions are easy – add bits to BF

• Can Bloom filters handle deletions?

– Use Counting Bloom Filters to track insertions/deletions at hosts;

– Send Bloom filters.

Page 16: Tutorial 9 (bloom filters)

Handling Deletions

• Bloom filters can handle insertions, but not deletions.

• If deleting xi means resetting 1s to 0s, then deleting xi will “delete” xj.

0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B

xi xj

Page 17: Tutorial 9 (bloom filters)

Counting Bloom Filters

Start with an m bit array, filled with 0s.

Hash each item xj in S k times. If Hi(xj) = a, add 1 to B[a].

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0B

0 3 0 0 1 0 2 0 0 3 2 1 0 2 1 0B

To delete xj decrement the corresponding counters.

0 2 0 0 0 0 2 0 0 3 2 1 0 1 1 0B

Can obtain a corresponding Bloom filter by reducing to 0/1.

0 1 0 0 0 0 1 0 0 1 1 1 0 1 1 0B

Page 18: Tutorial 9 (bloom filters)

Counting Bloom Filters: Overflow

• Must choose counters large enough to avoid overflow.

• Poisson approximation suggests 4 bits/counter.– Average load using k = (ln 2)m/n counters is ln 2.

– Probability a counter has load at least 16:

• Failsafes possible.

17E78.6!16/)2(ln 162ln e

Page 19: Tutorial 9 (bloom filters)

Variations and Extensions

• Distance-Sensitive Bloom Filters

• Bloomier Filter

Page 20: Tutorial 9 (bloom filters)

Extension: Distance-Sensitive Bloom Filters

• Instead of answering questions of the form

we would like to answer questions of the form

• That is, is the query close to some element of the set, under some metric and some notion of close.

• Applications:– DNA matching– Virus/worm matching– Databases

• Some initial results [KirschMitzenmacher]. Hard.

.SyIs

.SxyIs

Page 21: Tutorial 9 (bloom filters)

Extension: Bloomier Filter

• Bloom filters handle set membership.

• Counters to handle multi-set/count tracking.

• Bloomier filter [Chazelle, Kilian, Rubinfeld, Tal]:– Extend to handle approximate functions.

– Each element of set has associated function value.

– Non-set elements should return null.

– Want to always return correct function value for set elements.

– A false positive returns a function value for a non-null element.