Top Banner
Pre-diploma internship ASYLBEK OTAROV
21

Advanced Data Structures for information search

Dec 03, 2014

Download

Technology

nar_zack1

Bloom Filter
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Advanced Data Structures for information search

Pre-diploma internshipASYLBEK OTAROV

Page 2: Advanced Data Structures for information search

Advanced Data Structures for information search

Page 3: Advanced Data Structures for information search

Agenda• Introduction

• A Membership Query Problem

• What is Bloom Filter

• Windows Application in C#(Show how bloom filter algorithm does work)

• Conclusion

Page 4: Advanced Data Structures for information search

Introduction• The place of practice: Almaty, AO “KBTU”

• Date of practice: January 13, 2014 – February 7, 2014

• Instructor: Eliusizov Damir

• Supervisor:Eliusizov Damir

Page 5: Advanced Data Structures for information search

Problem Description

Given an element E, query whether it belongs to an big elements set S.

• Fast as soon as possible

• Small as soon as possible

Page 6: Advanced Data Structures for information search

Some Solutions• Hash table• Fast but big data structure

• Bitmap index• Small but data structure smaller than hash table

Page 7: Advanced Data Structures for information search

Tradeoff solutions• To obtain speed and size improvements, allow some

probability of error.

Bloom Filter

Page 8: Advanced Data Structures for information search

Bloom Filter• A Bloom filter is a space-efficient probabilistic data structure,

conceived by Burton Howard Bloom in 1970.

An empty Bloom filter is a bit array of m bits, all set to 0.

0 1 2 3 4 m-1

There must also be k different hash functions defined, each of which maps or hashes some set element to one of the m array positions with a uniform random distribution.

0 0 0 0 0 0…………

Page 9: Advanced Data Structures for information search

To add an element, feed it to each of the k hash functions to get k array positions. Set the bits at all these positions to 1.

HF1(x) = indexHF2(x) = index………HFk(x) = index

To query for an element (test whether it is in the set), feed it to each of the k hash functions to get k array positions. If any of the bits at these positions are 0, the element is definitely not in the set – if it were, then all the bits would have been set to 1, the element is in the set

Bloom Filter has 2 operations Add and Test

Page 10: Advanced Data Structures for information search

Bloom filter is space-efficient probabilistic data structure that is used to test whether an element is a member of a set

Hash Table chance of collision • a collision is a situation that occurs when two distinct pieces of data have

the same hash value, checksum, fingerprint, or cryptographic digest.

x – is element, which added in hash table

y – is element, which added in hash table

F(x) and G(x) – hash functions.

0 0 1 0 1 0…………

F(x) = 2 G(x) = 1 F(y) = 2 G(y) = 1

0 1 2 3 4 M-1

False positives are possible False negatives are not possible

If all are 1, then either the element is in the set, or the bits have by chance been set to 1 during the insertion of other elements, resulting in a false positive

Page 11: Advanced Data Structures for information search

Bloom filter is space-efficient probabilistic data structure that is used to test whether an element is a member of a set

• Not a Key-Value store

• Array of bits indicating the presence of a key in the filter.

• Removing an element from the filter is not possible

Page 12: Advanced Data Structures for information search
Page 13: Advanced Data Structures for information search

Bloom Filter: Usage Google Big Table and Apache Cassandra use Bloom filters to reduce the disk lookups for non-existent rows or columns. Avoiding costly disk lookups considerably increases the performance of a database query operation.

The Google Chrome web browser uses a Bloom filter to identify malicious URLs. Any URL is first checked against a local Bloom filter and only upon a hit a full check of the URL is performed.

Page 14: Advanced Data Structures for information search

Windows Application in C#

Page 15: Advanced Data Structures for information search

Add element

Page 16: Advanced Data Structures for information search

Check element

Page 17: Advanced Data Structures for information search

Check element

Page 18: Advanced Data Structures for information search

Hash function in C#• public static UInt32 FNV1(string offset_str)• {

• UInt32 FNV_prime = 16777601;• UInt32 offset_basis = 2166136261;

• int strlen = offset_str.Length;

• for (int i = 0; i < strlen; i++)• {• offset_basis = offset_basis ^ Convert.ToChar(offset_str.Substring(i, 1));• offset_basis = (offset_basis * FNV_prime) % 23;

• }• return offset_basis;• }

Page 19: Advanced Data Structures for information search

ConclusionThis internship gave me a lot of experiences which I can use in future life. As I am a student of department of computer engineering with a major of Information Systems, I can definitely say that this internship helped me to understand the concepts of information system more deeply, because I have participated in project where information systems play a big role in connecting two sides: people and organizations. I learned a lot of tricks related to object oriented programming.

Page 20: Advanced Data Structures for information search

References

1: Network Applications of Bloom Filters: A Survey, Broder and Mitzenmacher. An excellent overview.

2: Wikipedia, which has an excellent and comprehensive page on Bloom filters

3: Less Hashing, Same Performance, Kirsch and Mitzenmacher

4: Scalable Bloom Filters, Almeida et al5: SlideShare http://

www.slideshare.net/quipo/modern-algorithms-and-data-structures-1-bloom-filters-merkle-trees (Lorenzo Alberton)

Page 21: Advanced Data Structures for information search