Top Banner
1 Bloom Filter [email protected] 2011-11-18
19

Bloom filter

Dec 09, 2014

Download

Technology

wangp1988

bloomfilter is a data structure that can support very fast owership query and it has very compacted storage space.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Bloom filter

1

Bloom Filter

[email protected]

2011-11-18

Page 2: Bloom filter

2

• A Membership Query Problem

• What is Bloom Filter

• BloomFilter Math Theory

• Compression

• Application Scenario

Agenda

Page 3: Bloom filter

3

Problem Description

Given an element E, query whether it

belongs to an big elements set S.

– Fast as soon as possible

– Small as soon as possible

Membership Query Problem

Page 4: Bloom filter

4

Some Solutions

hashtable

fast but big data structure

bitmap index

can be smaller?

Membership Query Problem

Page 5: Bloom filter

5

Tradeoff Solutions

To obtain speed and size improvements,

allow some probability of error.

Bloom Filter

Membership Query Problem

Page 6: Bloom filter

6

Support approximate set membership Given a set S = {x1,x2,…,xn}, construct data

structure to answer queries of the form “Is y in S?”

Data structure should be:–Fast (Faster than searching through S).–Small (Smaller than explicit representation).

To obtain speed and size improvements, allow some probability of error.

–False positives: y S but we report y S–False negatives: y S but we report y S

What is Bloom Filter

Page 7: Bloom filter

7

What is Bloom Filter

7

Start with an m bit array, filled with 0s.

Hash each item xj in S k times. If Hi(xj) = a, set B[a] = 1.

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0B

0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B

To check if y is in S, check B at Hi(y). All k values must be 1.

0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B

0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0BPossible to have a false positive; all k values are 1, but y is not in S.

n items m = cn bits k hash functions

Page 8: Bloom filter

What is Bloom Filter

False Positive

8

A

0

0

1

0

1

0

0

0

0

1

0

hash1

hash2

hash3B

Page 9: Bloom filter

Bloom Filter Math Theory

9

Pr(specific bit of filter is 0) is

If is fraction of 0 bits in the filter then false positive probability is

Approximations valid as is concentrated around E[].

–Martingale argument suffices. Find optimal at k = (ln 2)m/n by calculus.

–So optimal fpp is about (0.6185)m/n

pmp mknkn /e)/11('

kckkkk pp )e1()1()'1()1( /

n items m = cn bits k hash functions

Page 10: Bloom filter

Bloom Filter Math Theory

10

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0 1 2 3 4 5 6 7 8 9 10

Hash functions

Fal

se p

osit

ive

rate

Opt k = 8 ln 2 = 5.45...m/n = 8

n items m = cn bits k hash functions

Page 11: Bloom filter

Bloom Filter Compression

Use BF on Network Transmission

BF as a message, should be small

enough

to transmitted over the network

Compressing bit vector is easy

Arithmetic coding gets close to entropy.

Can Bloom filters be compressed?

11

Page 12: Bloom filter

Bloom Filter Compression

• Optimize to minimize false positive.

• At k = m (ln 2) /n, p = 1/2.

• Bloom filter looks like a random string.– Can’t compress it.– H(p) = -plog2p – (1-p)log2(1-p)

12

mknkn emp /)/11(]empty is cellPr[ kmknk epf )1()1(]pos falsePr[ /

nmk /)2ln(

Page 13: Bloom filter

Bloom Filter Compression With more decompressed size (storage),

we can achive compression.

13

• Assumption: optimal compressor, z = mH(p). – H(p) is entropy function; optimally get

H(p) compressed bits per original table bit.– Arithmetic coding close to optimal.

• Optimization: Given z bits for compressed filter and n elements, choose table size m and number of hash functions k to minimize f. )(;)1(; // pmHzefep kmknmkn

Page 14: Bloom filter

Bloom Filter Compression

1414

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0 1 2 3 4 5 6 7 8 9 10

Hash functions

Fal

se p

osit

ive

rate

z/n = 8Original

Compressed

Page 15: Bloom filter

Bloom Filter Compression

• At k = m (ln 2) /n, false positives are maximized with a compressed Bloom filter.– Best case without compression is worst case

with compression; compression always helps.

– Side benefit: Use fewer hash functions with compression; possible speedup.

1515

Conclusion

Page 16: Bloom filter

Application Scenario

Speed up answers in a key-value like syetem

16

filter(memory)

storage(memory)key1

no

key2yes

disk accesssuccess

key3yes

disk accessfail

Page 17: Bloom filter

Application Scenario

Web Cache

17

cache1 cache2 cache3……

Web Server

Page 18: Bloom filter

Q&A

18

Q&A

Page 19: Bloom filter