Advanced Data Structures for information search

Pre-diploma internshipASYLBEK OTAROV

Advanced Data Structures for information search

Agenda• Introduction

• A Membership Query Problem

• What is Bloom Filter

• Windows Application in C#(Show how bloom filter algorithm does work)

• Conclusion

Introduction• The place of practice: Almaty, AO “KBTU”

• Date of practice: January 13, 2014 – February 7, 2014

• Instructor: Eliusizov Damir

• Supervisor:Eliusizov Damir

Problem Description

Given an element E, query whether it belongs to an big elements set S.

• Fast as soon as possible

• Small as soon as possible

Some Solutions• Hash table• Fast but big data structure

• Bitmap index• Small but data structure smaller than hash table

Tradeoff solutions• To obtain speed and size improvements, allow some

probability of error.

Bloom Filter

Bloom Filter• A Bloom filter is a space-efficient probabilistic data structure,

conceived by Burton Howard Bloom in 1970.

An empty Bloom filter is a bit array of m bits, all set to 0.

0 1 2 3 4 m-1

There must also be k different hash functions defined, each of which maps or hashes some set element to one of the m array positions with a uniform random distribution.

0 0 0 0 0 0…………

To add an element, feed it to each of the k hash functions to get k array positions. Set the bits at all these positions to 1.

HF1(x) = indexHF2(x) = index………HFk(x) = index

To query for an element (test whether it is in the set), feed it to each of the k hash functions to get k array positions. If any of the bits at these positions are 0, the element is definitely not in the set – if it were, then all the bits would have been set to 1, the element is in the set

Bloom Filter has 2 operations Add and Test

Bloom filter is space-efficient probabilistic data structure that is used to test whether an element is a member of a set

Hash Table chance of collision • a collision is a situation that occurs when two distinct pieces of data have

the same hash value, checksum, fingerprint, or cryptographic digest.

x – is element, which added in hash table

y – is element, which added in hash table

F(x) and G(x) – hash functions.

0 0 1 0 1 0…………

F(x) = 2 G(x) = 1 F(y) = 2 G(y) = 1

0 1 2 3 4 M-1

False positives are possible False negatives are not possible

If all are 1, then either the element is in the set, or the bits have by chance been set to 1 during the insertion of other elements, resulting in a false positive

Bloom filter is space-efficient probabilistic data structure that is used to test whether an element is a member of a set

• Not a Key-Value store

• Array of bits indicating the presence of a key in the filter.

• Removing an element from the filter is not possible

Bloom Filter: Usage Google Big Table and Apache Cassandra use Bloom filters to reduce the disk lookups for non-existent rows or columns. Avoiding costly disk lookups considerably increases the performance of a database query operation.

The Google Chrome web browser uses a Bloom filter to identify malicious URLs. Any URL is first checked against a local Bloom filter and only upon a hit a full check of the URL is performed.

Windows Application in C#

Add element

Check element

Check element

Hash function in C#• public static UInt32 FNV1(string offset_str)• {

• UInt32 FNV_prime = 16777601;• UInt32 offset_basis = 2166136261;

• int strlen = offset_str.Length;

• for (int i = 0; i < strlen; i++)• {• offset_basis = offset_basis ^ Convert.ToChar(offset_str.Substring(i, 1));• offset_basis = (offset_basis * FNV_prime) % 23;

• }• return offset_basis;• }

ConclusionThis internship gave me a lot of experiences which I can use in future life. As I am a student of department of computer engineering with a major of Information Systems, I can definitely say that this internship helped me to understand the concepts of information system more deeply, because I have participated in project where information systems play a big role in connecting two sides: people and organizations. I learned a lot of tricks related to object oriented programming.

References

1: Network Applications of Bloom Filters: A Survey, Broder and Mitzenmacher. An excellent overview.

2: Wikipedia, which has an excellent and comprehensive page on Bloom filters

3: Less Hashing, Same Performance, Kirsch and Mitzenmacher

4: Scalable Bloom Filters, Almeida et al5: SlideShare http://

www.slideshare.net/quipo/modern-algorithms-and-data-structures-1-bloom-filters-merkle-trees (Lorenzo Alberton)

http://citeseer.ist.psu.edu/viewdoc/download;jsessionid=6CA79DD1A90B3EFD3D62ACE5523B99E7?doi=10.1.1.127.9672&rep=rep1&type=pdf

http://en.wikipedia.org/wiki/Bloom_filter

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.152.579&rank=1

http://gsd.di.uminho.pt/members/cbm/ps/dbloom.pdf

http://www.slideshare.net/quipo/modern-algorithms-and-data-structures-1-bloom-filters-merkle-trees



Advanced Data Structures for information search

Technology

element test

element e

local bloom filter

array positions

index hf2x

index hfkx

false positives

big elements