Top Banner
Three Cool Algorithms You’ve Never Heard Of! Carey Nachenberg [email protected]
47

Three Cool Algorithms You’ve Never Heard Of!

Feb 26, 2016

Download

Documents

Kanoa

Three Cool Algorithms You’ve Never Heard Of!. Carey Nachenberg [email protected]. Cool Data Structure: The Metric Tree. City: LA Threshold: 1500km. City: SF Threshold: 100km. City: Austin Threshold: 250km. City: San Jose - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Three Cool  Algorithms You’ve  Never Heard Of!

Three Cool Algorithms You’ve Never Heard Of!

Carey [email protected]

Page 2: Three Cool  Algorithms You’ve  Never Heard Of!

Cool Data Structure: The Metric TreeCity: LAThreshold: 1500km

City: Las VegasThreshold: 1000km

City: SFThreshold: 100km

City: AustinThreshold: 250km

City: NYCThreshold: 1100km

City: BostonThreshold: 400km

City: AtlantaThreshold: 600km

City: New OrleansThreshold: 300kmCity: San Jose

Threshold: 200kmCity: MercedThreshold: 70km

<=1500km away

<=1000km away

<=100km away

<=1110km away

City: ProvidenceThreshold: 200km

<=400km away

<=600km away

>1500km away

>1000km away

>100km away

<=200km away

>200km away

<=70km away

>70km away

… … … …

… … … >600km away

>1100km away

<=200km away

>200km away

… … <=300km away

>300km away

… …

Page 3: Three Cool  Algorithms You’ve  Never Heard Of!

Challenge: Building a Fuzzy Spell Checker

Imagine you’re building a word processor and you want to

implement a spell checker that gives suggestions… lobeky

Of course it’s easy to tell the user that their word is

misspelled…Question: What data structure could we use to determine if a word is in a dictionary or not?

Suggestions

lonelylovelylocale…

Right – a hash table or binary search tree could tell you if a

word is spelled correctly.

But what if we want to efficiently provide

the user with possible

alternatives?

Page 4: Three Cool  Algorithms You’ve  Never Heard Of!

Providing Alternatives?Before we can provide

alternatives, we need a way to find close matches…

One useful tool for this is the “edit distance” metric.

Edit Distance: How many letters must be added, deleted or

replaced to get from word A to B.

lobeky -> lovely has an edit distance of 2.

v l

-> lowly has an edit distance of 3.

ol b ke ywl

So given the user’s misspelled word, and

this edit distance function…

How can we use this to provide the user with spelling suggestions?

Page 5: Three Cool  Algorithms You’ve  Never Heard Of!

Providing Alternatives?Well, we could take our

misspelled word and compute its edit distance to every word in

the dictionary!lobeky aardvark

arkacorn…bonebonfire…lonelylonesome…

856

And then give the user all words with an edit distance of

<=3…

There’s a better way!But before we talk

about it, let’s talk about edit distance a bit

more…

But that’s really, really slow!

Page 6: Three Cool  Algorithms You’ve  Never Heard Of!

Edit DistanceAs it turns out, the edit distance

function, e(x,y), is what we call a “metric distance function.”

What does that mean?

1. e(x,y) = e(y,x) The edit distance of “foo” from “food”is the same as from “food” to “foo”

2. e(x,y) >= 0 You can never have a negativeedit distance… Well that makes sense…

3. e(x,z) <= e(x,y) + e(y,z)It’s never cheaper to do two conversions than a direct conversion.

e(“foo”,”feed”) = 3e(“feed”,”goon”) = 4Total cost: 7

e(“foo”,”goon”) = 2

aka “the triangle inequality”

>

Page 7: Three Cool  Algorithms You’ve  Never Heard Of!

Metric Distance FunctionsGiven some word w (e.g., pier), let’s say

I happen to know all words with an edit distance of 1 from that word…

Now, if my misspelled word m (e.g., zifs) has an edit distance of 3

from w, what does that guarantee about m to these other words?

Right: If e(“zifs”,”pier”) is 3, and all these other words are exactly 1 edit

away from pier…

pierpeer

tier piper

piepies

zifsThen by definition, “zifs” must be at most 4 edits away from any

word in this cloud!

And directly:e(“zifs”,”piper”) = 4

+3

+1

But by the same reasoning, none of these words can be less than 2 edits away from

“zifs”…

Why? Because we know that all of these words have at most one character

difference from “pier”…

So if “pier” is 3 away from “zifs”, then in the best case these other words would be one letter closer to “zifs” (e.g., if one of

pier’s letters was replaced by one of zifs’ letters)...Imagine if we had thousands of different

clouds like this.

We could compare your misspelled word to the center word of each cloud. If

e(m,w) is less than some threshold edit distance, then the cloud’s other words

are good suggestions…-1

Let’s see:e(“zifs”,”pies”) = 2

e(“zifs”,”pier”) = 3e(“pier”,”piper”)

= 1Total cost: 4

Page 8: Three Cool  Algorithms You’ve  Never Heard Of!

Metric Distance Functions

pierpeer

tier piper

piepies

zifs

We could compare your misspelled word to the center word of each cloud. If

e(m,w) is less than some threshold edit distance, then the cloud’s other words

are good suggestions…

gatehate

rate date

ategale

pencil

computer

table

3

5

85

4

Page 9: Three Cool  Algorithms You’ve  Never Heard Of!

A Better Way?That works well, but then again, we’d

still have to do thousands of comparisons

(one to each cloud)…Hmmm. Can we figure out a more efficient way to do this?

Say with log2(D) comparisons, where D is the number of words in your

dictionary?Duh… Well of course, we’ll need a tree!

Page 10: Three Cool  Algorithms You’ve  Never Heard Of!

The Metric TreeThe Metric Tree was invented in 1991 by Jeffrey

Uhlmann of the Naval Research Labs.

Each node in a Metric Tree holds a word, an edit distance threshold value and left and right next

pointers.

struct MetricTreeNode{ string word; unsigned int editThreshold; MetricTreeNode *left, *right;};

Let’s see how to build a Metric Tree! Building one is really slow, but once we build it, searching it is really fast!

Page 11: Three Cool  Algorithms You’ve  Never Heard Of!

The Metric Tree1. Pick a random word W from set S.2. Compute the edit distance for all other words in set S to your random word W.

main(){ Let S = {every word in the dictionary}; Node *root = buildMTree(S);

3. Sort all these words based on their edit distance di to your random word W.

5. Now, create a root node N for our tree and put our word W in this node. Set its editThreshold value to dmed.6. N->left = buildMTree(subset of S that is <= dmed)

4. Select the median value of di, let dmed be this median edit distance.

Node *buildMTree(SetOfWords &S)

7. N->right = buildMTree(subset of S that is > dmed)8. return N

SetOfWordsgoatoysterrosterhippotoadhamstermousechickenrooster

Page 12: Three Cool  Algorithms You’ve  Never Heard Of!

The Metric Tree

main(){ Let S = {every word in the dictionary}; Node *root = buildMTree(S);

1. Pick a random word W from set S.2. Compute the edit distance for all other words in set S to your random word W.3. Sort all these words based on their edit distance di to your random word W.

5. Now, create a root node N for our tree and put our word W in this node. Set its editThreshold value to dmed.6. N->left = buildMTree(subset of S that is <= dmed)

4. Select the median value of di, let dmed be this median edit distance.

Node *buildMTree(SetOfWords &S)

7. N->right = buildMTree(subset of S that is > dmed)8. return N

SetOfWordsgoatoysterrosterhippotoadhamstermousechicken

621

rooster

76347

SetOfWordsroster 1oyster 2hamster 3mouse 4goat 6toad 6hippo 7chicken 7

dmed = 4

“rooster” 4

1. Pick a random word W from set S.2. Compute the edit distance for all other words in set S to your random word W.3. Sort all these words based on their edit distance di to your random word W.

5. Now, create a root node N for our tree and put our word W in this node. Set its editThreshold value to dmed.6. N->left = buildMTree(subset of S that is <= dmed)

4. Select the median value of di, let dmed be this median edit distance.

Node *buildMTree(SetOfWords &S)

7. N->right = buildMTree(subset of S that is > dmed)8. return N

Page 13: Three Cool  Algorithms You’ve  Never Heard Of!

1. Pick a random word W from set S.2. Compute the edit distance for all other words in set S to your random word W.3. Sort all these words based on their edit distance di to your random word W.

5. Now, create a root node N for our tree and put our word W in this node. Set its editThreshold value to dmed.6. N->left = buildMTree(subset of S that is <= dmed)

4. Select the median value of di, let dmed be this median edit distance.

Node *buildMTree(SetOfWords &S)

7. N->right = buildMTree(subset of S that is > dmed)8. return N

The Metric Tree

main(){ Let S = {every word in the dictionary}; Node *root = buildMTree(S);

Dictionarygoatoysterrosterhippotoadhamstermousechicken

631

rooster

7

7

SetOfWordsroster oyster hamster goat toad hippo chicken “rooster”

4

446

1. Pick a random word W from set S.2. Compute the edit distance for all other words in set S to your random word W.3. Sort all these words based on their edit distance di to your random word W.

5. Now, create a root node N for our tree and put our word W in this node. Set its editThreshold value to dmed.6. N->left = buildMTree(subset of S that is <= dmed)

4. Select the median value of di, let dmed be this median edit distance.

Node *buildMTree(SetOfWords &S)

7. N->right = buildMTree(subset of S that is > dmed)8. return N

mouse

dmed = 4

“mouse” 4

“oyster” 4

“roster” 0

“hamster” 0

Page 14: Three Cool  Algorithms You’ve  Never Heard Of!

1. Pick a random word W from set S.2. Compute the edit distance for all other words in set S to your random word W.3. Sort all these words based on their edit distance di to your random word W.

5. Now, create a root node N for our tree and put our word W in this node. Set its editThreshold value to dmed.6. N->left = buildMTree(subset of S that is <= dmed)

4. Select the median value of di, let dmed be this median edit distance.

Node *buildMTree(SetOfWords &S)

7. N->right = buildMTree(subset of S that is > dmed)8. return N

The Metric Tree

main(){ Let S = {every word in the dictionary}; Node *root = buildMTree(S);

Dictionarygoatoysterrosterhippotoadhamstermousechicken

SetOfWordsroster oyster hamster goat hippo chicken

mouse

“oyster” 4

“roster” 0

“hamster” 0

1. Pick a random word W from set S.2. Compute the edit distance for all other words in set S to your random word W.3. Sort all these words based on their edit distance di to your random word W.

5. Now, create a root node N for our tree and put our word W in this node. Set its editThreshold value to dmed.6. N->left = buildMTree(subset of S that is <= dmed)

4. Select the median value of di, let dmed be this median edit distance.

Node *buildMTree(SetOfWords &S)

7. N->right = buildMTree(subset of S that is > dmed)8. return N

toad2

57

dmed = 5

“toad” 5

“mouse” 4

“rooster” 4

“goat” 5

“hippo” 0

“chicken” 0

Page 15: Three Cool  Algorithms You’ve  Never Heard Of!

1. Pick a random word W from set S.2. Compute the edit distance for all other words in set S to your random word W.3. Sort all these words based on their edit distance di to your random word W.

5. Now, create a root node N for our tree and put our word W in this node. Set its editThreshold value to dmed.6. N->left = buildMTree(subset of S that is <= dmed)

4. Select the median value of di, let dmed be this median edit distance.

Node *buildMTree(SetOfWords &S)

7. N->right = buildMTree(subset of S that is > dmed)8. return N

The Metric Tree

main(){ Let S = {every word in the dictionary}; Node *root = buildMTree(S);

Dictionarygoatoysterrosterhippotoadhamstermousechicken

SetOfWordsroster oyster hamster goat hippo chicken

mouse

“oyster” 4

“roster” 0

“hamster” 0

toad

“toad” 5

“mouse” 4

“rooster” 4

“goat” 5

“hippo” 0

“chicken” 0

Page 16: Three Cool  Algorithms You’ve  Never Heard Of!

A Metric Tree

“oyster” 4

“roster” 0

“hamster” 0

“toad” 5

“mouse” 4

“rooster” 4

“goat” 5

“hippo” 0

“chicken” 0

So now we have a metric tree!

How do we interpret it?

rooster4

Every word to the left of rooster is

guaranteed to be within 4 edits of it…And every word to the

right of rooster is guaranteed to be more

than 4 edits away…

chicken

And this same structure is repeated

recursively!

mouse

oyster hamster

roster

4 5toad

goat

hippo

5

Page 17: Three Cool  Algorithms You’ve  Never Heard Of!

Searching

When you search a metric tree, you specify the word you’re looking for and an edit-distance

radius, e.g.e.g., I want to find words within 2 edits of

“roaster”.rooster

oyster

hamsterroster

mouse

“oyster” 4

“roster” 0

“hamster” 0

“toad” 5

“mouse” 4

“rooster” 4

“goat” 5

“hippo” 0

“chicken” 0

toad

goat

chicken

hippoStarting at the root, there

are three cases to consider:

1. Your word and its search radius are totally inside the edit threshold.roaster

2In this case, all of your matches are guaranteed to be in our left subtree…

Page 18: Three Cool  Algorithms You’ve  Never Heard Of!

Searching

rooster

oyster

hamsterroster

mouse

“oyster” 4

“roster” 0

“hamster” 0

“toad” 5

“mouse” 4

“rooster” 4

“goat” 5

“hippo” 0

“chicken” 0

toad

goat

chicken

hippo

2. Your word and its search radius are partially inside and partially outside the edit threshold.

goute 2

In this case, some matches will be in our left subtree and some in our right subtree…

Page 19: Three Cool  Algorithms You’ve  Never Heard Of!

Searching

rooster

oyster

hamsterroster

mouse

“oyster” 4

“roster” 0

“hamster” 0

“toad” 5

“mouse” 4

“rooster” 4

“goat” 5

“hippo” 0

“chicken” 0

toad

goat

chicken

hippo

3. Your word and its search radius are completely outside the edit threshold.

vhivken 2

In this case, all matches will be in our right subtree.

Page 20: Three Cool  Algorithms You’ve  Never Heard Of!

PrintMatches(Node *cur, string misspell, int rad){ if e(misspell,cur->word) <= rad then print the current word if e(misspell,cur->word) <= cur->editThreshold then PrintMatches(cur->left) if e(misspell,cur->word) > cur->editThreshold then PrintMatches(cur->right); }

Metric Tree: Search Algorithm

PrintMatches(root,”chomster”,2); cur->

rooster

oyster

hamster

roster

mouse

4

cur->

toadgoat

chicken

hippo

*This is a slight

simplification…

e(“chomster”,”rooster”) = 3So rooster is outside of

chomster’s radius of 2. It’s not a close enough match to print…

chomster2

3e(“chomster”,”rooster”) = 3

Since 3 is less than our editThreshold of 4, let’s go left…

PrintMatches(Node *cur, string misspell, int rad){ if e(misspell,cur->word) <= rad then print the current word if e(misspell,cur->word) <= cur->editThreshold then PrintMatches(cur->left) if e(misspell,cur->word) > cur->editThreshold then PrintMatches(cur->right); }

e(“chomster”,”mouse”) = 5So mouse is outside of

chomster’s radius of 2. It’s not a close enough match to print…

oyster

hamster

roster

mouse

chomster 2

5e(“chomster”,”mouse”) = 5Since 5 is greater than our

editThreshold of 4, we won’t go left.e(“chomster”,”mouse”) = 5Since 5 is greater than our

editThreshold of 4, we will go right.

PrintMatches(Node *cur, string misspell, int rad){ if e(misspell,cur->word) <= rad then print the current word if e(misspell,cur->word) <= cur->editThreshold then PrintMatches(cur->left) if e(misspell,cur->word) > cur->editThreshold then PrintMatches(cur->right); }

cur->

hamster

e(“chomster”,”hamster”) = 2So hamster is inside of chomster’s

radius of 2. We’ve got a match! Print hamster!

chomster 2

2

Page 21: Three Cool  Algorithms You’ve  Never Heard Of!

Other Metric Tree ApplicationsIn addition to spell checking, the

Metric Tree can be used with virtually any application where the items obey metric rules!

Pretty cool, huh? Here’s the full search algorithm from the original paper (without my earlier simplications):

PrintMatches(Node *cur, string misspell, int rad){ if ( e(cur->word , misspell) <= rad) cout << cur->word; if ( e(cur->word,misspell) – rad <= cur->editThresh ) PrintMatches(cur->left,misspell,maxDist) if ( e(cur->word, misspell) + rad >= cur->editThresh ) PrintMatches (cur->right,misspell,maxDist); }

Page 22: Three Cool  Algorithms You’ve  Never Heard Of!

Challenge: Space-efficient Set Membership

There are many problems where we want to maintain a set S of items and

then check if a new item X is in the set, e.g.:

So, what data structures could you use for this?

Right! Both hash tables and binary search trees allow you

to:1. Hold a bunch of items.2. Quickly search through them to

see if they hold an item X.

“Is ‘carey nachenberg’ a student at UCLA?”“Is the phone number ‘424-750-7519’ known to be used by a terrorist cell?

Page 23: Three Cool  Algorithms You’ve  Never Heard Of!

So what’s the problem!Well, binary search trees and hash

tables are memory hogs!

But if I JUST want to do two things:

In other words, if I never need to:1. Print the items of the set (after

they’ve been added).2. Enumerate each value in the set.3. Erase items from the set.

Then we can do much better than our classic data

structures!

I can actually create a much more memory efficient data structure!

1. Add new items to the set2. Check if an item was previously added to a

set

Page 24: Three Cool  Algorithms You’ve  Never Heard Of!

But first… A hash primer*

* Not that kind of hash.

A hash function is a function, y=f(x), that takes an input x (like a string) and returns an

output number y for that input.The ideal hash function returns entirely different values foreach different input, even if two inputs are almost identical:

int y,z;

y = idealHashFunction(“carey”);cout << y;z = idealHashFunction(“cArey”);cout << z;

So even though these two strings are almost identical, a goodhash function might return y=92629 and z=152.

Page 25: Three Cool  Algorithms You’ve  Never Heard Of!

Hash Functionsint hashFunc(const string &name){ int i, total=0;  for (i=0;i<name.length(); i++)

total = total + name[i];   return(total);}

Here’s a not-so-good hash function.Can anyone

figure out why?

Right – because similar inputs produce the

same output:

int y, z;

y = hashFunc(“bat”);z = hashFunc(“tab”);// y == z!!!! BAD!

How can we fix this?

By changing our function! That’s a little better, although not great…

total = total + (name[i] * (i+1));

Page 26: Three Cool  Algorithms You’ve  Never Heard Of!

A Better Hash FunctionThe CRC or Cyclical Redundancy Check algorithm is an excellent

hash function.This function was designed to

check network packets for corruption.

We won’t go into CRC’s details, but it’s a perfectly fine hashing algorithm…

Ok, so we have a good hash function, now what?

Page 27: Three Cool  Algorithms You’ve  Never Heard Of!

A Simple Set Membership Algorithm

Imagine that I know I want to store up to 1 million items in my set…I could create an array of say…

100 million bitsAnd then do the following…

class SimpleSet{public: …

private: BitArray m_arr[100000000];

void insertItem(string &name) { int slot = CRC(SEED, name); slot = slot % 100000000; m_arr[slot] = 1; } bool isItemInSet(string &name) { int slot = CRC(SEED, name); slot = slot % 100000000; if (m_arr[slot] == 1) return(true); else return(false); }

main(){ SimpleSet s; s.insertItem(“Carey”); s.insertItem(“Flint”);

if (s.isItemInSet(“Flint”) == true) cout << “Flint’s in my set!”;}

000000000000000000000000000000000000000000000000000000s

“Carey” slot 300001213112131

1

“Flint”9721

1“Flint”

slot 9721

Most hash functions require a seed (initialization) value to be passed in.

Here’s how it might be used:unsigned CRC(unsigned seed, string &s){ unsigned crc = seed; for (int i=0;i<s.length();i++) crc = ((crc >> 8) & CONST1) ^ crcTable[(crc^ s[i]) & CONST2]; return(crc);}

Typically you’d use a seed value of 0xFFFFFFFF with CRC.

But you can change the seed if you like – this results in a (much) different hash

value, even for the same input!

Page 28: Three Cool  Algorithms You’ve  Never Heard Of!

A Simple Set Membership Algorithm

class SimpleSet{public: …

private: BitArray m_arr[100000000];

void insertItem(string &name) { int slot = CRC(SEED,name); slot = slot % 100000000; m_arr[slot] = 1; } bool isItemInSet(string &name) { int slot = CRC(SEED,name); slot = slot % 100000000; if (m_arr[slot] == 1) return(true); else return(false); }

Ok, so what’s the problem with our SimpleSet?

Right! There’s a chance of collisions!

What if two names happen to hash right to the same slot?

main(){ SimpleSet coolPeople; coolPeople.insertItem(“Carey”);

if (coolPeople.isItemInSet(“Paul”)) cout << “Paul Agbabian is cool!”;}

000000000000000000000000000000000000000000000000000000cool People

slot 300001213112131

1

slot1100001213112131

Page 29: Three Cool  Algorithms You’ve  Never Heard Of!

A Simple Set Membership Algorithm

class SimpleSet{public: …

private: BitArray m_arr[100000000];

void insertItem(string &name) { int slot = CRC(SEED,name); slot = slot % 100000000; m_arr[slot] = 1; } bool isItemInSet(string &name) { int slot = CRC(SEED,name); slot = slot % 100000000; if (m_arr[slot] == 1) return(true); else return(false); }

Ok, so what’s the problem with our SimpleSet?

Right! There’s a chance of collisions!

What if two names happen to hash right to the same slot?

Ack! If we put 1 million items in our 100 million entry

array…we’ll have a collision rate of

about 1%!Actually, depending on your

requirements,that might not be so bad…

Page 30: Three Cool  Algorithms You’ve  Never Heard Of!

A Simple Set Membership Algorithm

Our simple set can hold about 1M items in just 12.5MB of memory!

While it does have some false-positives, it’s much smaller than

a hash table or binary search tree…

But we’ve got to be able to do better… Right?

Right! That’s where the Bloom Filter comes in!

The Bloom Filter was invented by Burton Bloom in 1970.

Let’s take a look!

Page 31: Three Cool  Algorithms You’ve  Never Heard Of!

The Bloom FilterIn a Bloom Filter, we use an array of bits just like our original algorithm!

class BloomFilter{public: …

private: BitArray m_arr[100000000];

But instead of just using1 hash function

and setting just one bit

for each insertion…We use K hash functions, compute K hash values

and set K bits!

void insertItem(string &name) { for (int i=0;i< K ;i++) { int slot = CRC( i , name); slot = slot % 100000000; m_arr[slot] = 1; } }

main(){ BloomFilter coolPeople; coolPeople.insertItem(“Preston”);}

000000000000000000000000000000000000000000000000000000cool People

We’ll see how K is chosen in a bit. It’s a constant and its value is

computed from:1. The max # of items you want to

add.2. The size of the array.3. Your desired false positive rate.

const int K = 4;

slot 9000022531

Notice that each time we call the CRC function, it starts with a

different seed value:unsigned CRC(unsigned seed, string &s){ unsigned crc = seed;

for (int i=0;i<s.length();i++) crc = ((crc >> 8) & CONST1) ^ crcTable[(crc^ s[i]) & CONST2]; return(crc);

}(Passing K different seed values is the

same as using K different hash functions…)

22531

1

9197

1

79929

1

300000001313

1

Page 32: Three Cool  Algorithms You’ve  Never Heard Of!

The Bloom FilterNow to search, we do the

same thing!class BloomFilter{public: …

private: BitArray m_arr[100000000];

void insertItem(string &name) { for (int i=0;i< K ;i++) { int slot = CRC( i , name); slot = slot % 100000000; m_arr[slot] = 1; } }

main(){ BloomFilter coolPeople; coolPeople.insertItem(“Preston”);}

000000000000000000000000000000000000000000000000000000cool People 11 11

bool isItemInSet(string &name) { for (int i=0;i< K ;i++) { int slot = CRC( i , name); slot = slot % 100000000; if (m_arr[slot] == 0) return(false); } return(true); }

Note: We only say an item is a member of the set if

all K bits are set to 1.

Note: If any bit that we check is 0, then we have a

miss…

if (coolPeople.isItemInSet(“Carey”)) cout << “I figured…”;

Page 33: Three Cool  Algorithms You’ve  Never Heard Of!

The Bloom FilterOk, so what’s the big deal? All we’re doing is checking

K bits instead of 1?!!?

class BloomFilter{public:

private: BitArray m_arr[100000000];

void insertItem(string &name) { for (int i=0;i< K ;i++) { int slot = CRC( i , name); slot = slot % 100000000; m_arr[slot] = 1; } } bool isItemInSet(string &name) { for (int i=0;i< K ;i++) { int slot = CRC( i , name); slot = slot % 100000000; if (m_arr[slot] == 0) return(false); } return(true); }

Well, it turns out that this dramatically reduces the

false positive rate!

Ok… So the only questions are, how do we chose:

1. The size of our bit-array?2. The value of K?

Let’s see!

Page 34: Three Cool  Algorithms You’ve  Never Heard Of!

The Bloom FilterIf you want to store N items in your Bloom

Filter…And you want a false positive rate of F%...

You’ll want to have M bits in your bit array:

M = log(F) * N log(.6185)

And you’ll want to use K different hash functions:

K=.7* M N

Let’s see some stats!

To store:N items with this FP rate, use M bits (bytes) and K hash fns1M .1% 14.4M bits (1.79MB) 10100M .001% 2.4B bits (299MB) 17100M .00001% 3.4B bits (419MB)

23

Now you’ve got to admit, that’s pretty efficient!

Of course, unlike a hash table, there is some chanceof having a false positive…

But for many projects, this is not an issue, especially if you can guarantee a certain minimum level of FPs!

Now that’s COOL! And you’ve (hopefully) never heard about it!

Page 35: Three Cool  Algorithms You’ve  Never Heard Of!

Challenge: Constant-time searching for similar items

(in a high-dimensional space)

Problem:I’ve got a large collection C of existing web-pages, and I want to

determine if a new web-page P is a close match to any pages in my existing collection.

Obvious approach:I could iterate through all C of my existing pages and do a pair-wise

comparison of page P to each page.

But that’s inefficient!

So how can we do it faster?

Page 36: Three Cool  Algorithms You’ve  Never Heard Of!

Answer: Use Locality Sensitive Hashing!

LSH has two operations:

Inserting items into the hash table:

We add a bunch of items (e.g., web pages) into a locality-sensitive hash table

Given an item, find closely-related items in the hash table:

Once we have a filled locality-sensitive hash table, we want to search it for a new item and see if it

contains anything similar.

Page 37: Three Cool  Algorithms You’ve  Never Heard Of!

LSH, Operation #1: InsertionHere’s the Insertion algorithm:

Step #1:Take each input item (e.g., a web-page) and convert it to a feature vector of size

V.What’s a feature vector?

It’s a fixed-length array of floating point numbers that measure various attributes about each input item.

const int V = 6;float fv[V];

fv[0] = # of times the word “free” was used in the emailfv[1] = # of times the word “viagra” was used in the email

fv[2] = # of exclamation marks used in the emailfv[3] = The length of the email in words

fv[4] = The average length of each word found in the emailfv[5] = The ratio of punctuation marks to letters in the email

The items in the feature vector should be chosen to provide maximum differentiation between different categories of items (e.g., spam vs clean email)!

fv[5] = # of times the word “the” was used in the email

Page 38: Three Cool  Algorithms You’ve  Never Heard Of!

LSH, Operation #1: InsertionWhy compute a feature vector for each input item?

The feature vector is a way of plotting each item into N-space.

Input #1:“Click here now for free

viagra!!!!!”fv1 = {1, 1, 5, 6, 4.17, 0.2}

Input #2:“Please come to the meeting at 5pm.” fv2 = {0, 0, 1, 7, 3.71, 0.038}

1.01.0

5.0

}

}

fv1

fv2

In principle, items (e.g. emails) with similar content (i.e., similar feature vectors) should occupy similar regions of N-space.

Page 39: Three Cool  Algorithms You’ve  Never Heard Of!

LSH, Operation #1: InsertionStep #2:

Once you have a feature vector for each of your items, you determine the size of your hash table.

“I’m going to need to hold 100 million email feature vectors, so I’ll want an open hash table of size N = 1 million”

Step #3:Next compute the number of bits B required to represent N in binary.

If N is 1 million, B will be log2(1 million), or 20.

Wait! Why is our hash table smaller than the # of items we want to store?

Because we want to put related items in the same bucket/slot of the table!

Note: N must be a power of 2, e.g.,

65536, or 1,048,576

Page 40: Three Cool  Algorithms You’ve  Never Heard Of!

LSH, Operation #1: InsertionStep #4:

Now, create B (e.g., 20) RANDOM feature vectors that are the same dimension as your input feature vectors.

R1 = {.277,.891,3,.32,5.89, .136}R2 = {2.143,.073,0.3,4.9, .58, .252}

R19 = {.8,.425,6.43,5.6,.197,1.43}R20 = {1.47,.256,4.15,5.6,.437,.075}

Page 41: Three Cool  Algorithms You’ve  Never Heard Of!

LSH, Operation #1: InsertionWhat are these B random vectors for?

R1 = {1,0,1}

Each of the B random vectors defines a hyper-plane in N-space!

R2 = {0,0,3} R3 = {0,2.5,0}

(each hyper-plane is perpendicular to its random vector)

If we have B such random vectors, we essentially chop up N-space

with B possibly overlapping slices!

So in our example, we’d have B=20 hyper-planes chopping up

our V=6 dimensional space.

(Chopping it up into 220 different regions!)

Page 42: Three Cool  Algorithms You’ve  Never Heard Of!

LSH, Operation #1: InsertionOk, let’s consider a single random vector, R1, and it’s hyper-plane for now.

If the tips of those two vectors are on the same side of R’s hyper-plane, then the dot-product of the two

vectors will be positive.R1 · v1 > 0

v2

R1

v1

Now let’s consider a second vector, v1.

On the other hand, if the tips of those two vectors are on opposite sides of R’s hyper-plane, then the dot-product of the two vectors will be negative.

R1 · v2 < 0

So this is useful – if we compute the dot product of two vectors R and v, we can determine if they’re close to

each other or far from each other in N-space.

Page 43: Three Cool  Algorithms You’ve  Never Heard Of!

· {1, 1, 5, 6, 4.17, 0.2}

And if we concatenate the 1s and 0s, this gives us a B-digit (e.g., 20 digit) binary number.

{1, 1, 5, 6, 4.17, 0.2}

LSH, Operation #1: InsertionStep #5:

Create an empty open hash table

with 2B buckets (e.g. 220 = 1M).

000…0000000…0001000…0010000…0011…1111…111101111…11111

Let’s label each bucket’s # using binary rather than

decimal numbers. (You’ll see why soon )

Step #6:For each item we want to add to

our hash table… Take the feature vector for the item...

“Click here now for free viagra!!!!!”

And dot-product multiply it by every one of our B random-valued vectors…

R1 = {.277,.891,3,.32,5.89, .136}R2 = {2.13,.07,0.3,4.9, .58, .252}

R19 = {.8,.45,6.3,5.6,.197,1.43}R20 = {1.7,.26,4.15,5.6,.47,.07}

-3.25-1.73

.185.24

Now convert every positive dot-product to a 1

And convert every negative dot-product into a 0

00…11

Which we can use to compute a bucket number in our hash table

and store our item!

This basically tells us whether our feature vector is on the same side or the opposite side of the hyper-plane of every one of our random

vectors.

Opp. side of R1Opp. side of R2

…Same side as R19Same side as R20

is on the…

Page 44: Three Cool  Algorithms You’ve  Never Heard Of!

{1, 1, 5, 6, 4.17, 0.2}

LSH, Operation #1: Insertion

000…0000000…0001000…0010000…0011…1111…111101111…11111

“Click here now for free viagra!!!!!”

Basically, every item in bucket 0000000000000

will be on the opposite sides of hyper-planes of all the random

vectors.And every item in bucket

111111111111111will be on the same side of the hyper-planes of all the random

vectors.And items in bucket

000000000001will be on the same side as R20, but the opposite side of R1, R2…

R19.So each bucket essentially represents one of the 220 different regions of N-space, as divided by the 20 random

hyper-plane slices.

Page 45: Three Cool  Algorithms You’ve  Never Heard Of!

{1, 1, 5, 6, 4.17, 0.2}

LSH, Operation #2: Searching

000…0000000…0001000…0010000…0011…1111…111101111…11111

“Click here now for free viagra!!!!!”

Searching for closely-related items is the same as

inserting!Step #1:

Compute the feature vector for your item

Step #2:Dot-product multiply this vector

by your B random vectors

Step #3:Convert all positive dot-products to 1, and all negative dot-products to 0

Step #4:Use the concatenated binary

number to pick a bucket in your hash table

And viola – you’ve located similar feature vectors/items!

Page 46: Three Cool  Algorithms You’ve  Never Heard Of!

LSH, One Last Point…Typically, we don’t just use one LSH hash

table…But we use two or more, each with a different set of random vectors!

Why?

Then, when searching for a new vector V, we take the union of all buckets that V hashes to, from all

hash tables to obtain a list of matches.

Page 47: Three Cool  Algorithms You’ve  Never Heard Of!

Questions?