Feb 26, 2016
Three Cool Algorithms You’ve Never Heard Of!
Carey [email protected]
Cool Data Structure: The Metric TreeCity: LAThreshold: 1500km
City: Las VegasThreshold: 1000km
City: SFThreshold: 100km
City: AustinThreshold: 250km
City: NYCThreshold: 1100km
City: BostonThreshold: 400km
City: AtlantaThreshold: 600km
City: New OrleansThreshold: 300kmCity: San Jose
Threshold: 200kmCity: MercedThreshold: 70km
<=1500km away
<=1000km away
<=100km away
<=1110km away
City: ProvidenceThreshold: 200km
<=400km away
<=600km away
>1500km away
>1000km away
>100km away
<=200km away
>200km away
<=70km away
>70km away
… … … …
… … … >600km away
>1100km away
…
<=200km away
>200km away
… … <=300km away
>300km away
… …
Challenge: Building a Fuzzy Spell Checker
Imagine you’re building a word processor and you want to
implement a spell checker that gives suggestions… lobeky
Of course it’s easy to tell the user that their word is
misspelled…Question: What data structure could we use to determine if a word is in a dictionary or not?
Suggestions
lonelylovelylocale…
Right – a hash table or binary search tree could tell you if a
word is spelled correctly.
But what if we want to efficiently provide
the user with possible
alternatives?
Providing Alternatives?Before we can provide
alternatives, we need a way to find close matches…
One useful tool for this is the “edit distance” metric.
Edit Distance: How many letters must be added, deleted or
replaced to get from word A to B.
lobeky -> lovely has an edit distance of 2.
v l
-> lowly has an edit distance of 3.
ol b ke ywl
So given the user’s misspelled word, and
this edit distance function…
How can we use this to provide the user with spelling suggestions?
Providing Alternatives?Well, we could take our
misspelled word and compute its edit distance to every word in
the dictionary!lobeky aardvark
arkacorn…bonebonfire…lonelylonesome…
856
And then give the user all words with an edit distance of
<=3…
There’s a better way!But before we talk
about it, let’s talk about edit distance a bit
more…
But that’s really, really slow!
Edit DistanceAs it turns out, the edit distance
function, e(x,y), is what we call a “metric distance function.”
What does that mean?
1. e(x,y) = e(y,x) The edit distance of “foo” from “food”is the same as from “food” to “foo”
2. e(x,y) >= 0 You can never have a negativeedit distance… Well that makes sense…
3. e(x,z) <= e(x,y) + e(y,z)It’s never cheaper to do two conversions than a direct conversion.
e(“foo”,”feed”) = 3e(“feed”,”goon”) = 4Total cost: 7
e(“foo”,”goon”) = 2
aka “the triangle inequality”
>
Metric Distance FunctionsGiven some word w (e.g., pier), let’s say
I happen to know all words with an edit distance of 1 from that word…
Now, if my misspelled word m (e.g., zifs) has an edit distance of 3
from w, what does that guarantee about m to these other words?
Right: If e(“zifs”,”pier”) is 3, and all these other words are exactly 1 edit
away from pier…
pierpeer
tier piper
piepies
zifsThen by definition, “zifs” must be at most 4 edits away from any
word in this cloud!
And directly:e(“zifs”,”piper”) = 4
+3
+1
But by the same reasoning, none of these words can be less than 2 edits away from
“zifs”…
Why? Because we know that all of these words have at most one character
difference from “pier”…
So if “pier” is 3 away from “zifs”, then in the best case these other words would be one letter closer to “zifs” (e.g., if one of
pier’s letters was replaced by one of zifs’ letters)...Imagine if we had thousands of different
clouds like this.
We could compare your misspelled word to the center word of each cloud. If
e(m,w) is less than some threshold edit distance, then the cloud’s other words
are good suggestions…-1
Let’s see:e(“zifs”,”pies”) = 2
e(“zifs”,”pier”) = 3e(“pier”,”piper”)
= 1Total cost: 4
Metric Distance Functions
pierpeer
tier piper
piepies
zifs
We could compare your misspelled word to the center word of each cloud. If
e(m,w) is less than some threshold edit distance, then the cloud’s other words
are good suggestions…
gatehate
rate date
ategale
pencil
computer
table
3
5
85
4
A Better Way?That works well, but then again, we’d
still have to do thousands of comparisons
(one to each cloud)…Hmmm. Can we figure out a more efficient way to do this?
Say with log2(D) comparisons, where D is the number of words in your
dictionary?Duh… Well of course, we’ll need a tree!
The Metric TreeThe Metric Tree was invented in 1991 by Jeffrey
Uhlmann of the Naval Research Labs.
Each node in a Metric Tree holds a word, an edit distance threshold value and left and right next
pointers.
struct MetricTreeNode{ string word; unsigned int editThreshold; MetricTreeNode *left, *right;};
Let’s see how to build a Metric Tree! Building one is really slow, but once we build it, searching it is really fast!
The Metric Tree1. Pick a random word W from set S.2. Compute the edit distance for all other words in set S to your random word W.
main(){ Let S = {every word in the dictionary}; Node *root = buildMTree(S);
3. Sort all these words based on their edit distance di to your random word W.
5. Now, create a root node N for our tree and put our word W in this node. Set its editThreshold value to dmed.6. N->left = buildMTree(subset of S that is <= dmed)
4. Select the median value of di, let dmed be this median edit distance.
Node *buildMTree(SetOfWords &S)
7. N->right = buildMTree(subset of S that is > dmed)8. return N
SetOfWordsgoatoysterrosterhippotoadhamstermousechickenrooster
The Metric Tree
main(){ Let S = {every word in the dictionary}; Node *root = buildMTree(S);
1. Pick a random word W from set S.2. Compute the edit distance for all other words in set S to your random word W.3. Sort all these words based on their edit distance di to your random word W.
5. Now, create a root node N for our tree and put our word W in this node. Set its editThreshold value to dmed.6. N->left = buildMTree(subset of S that is <= dmed)
4. Select the median value of di, let dmed be this median edit distance.
Node *buildMTree(SetOfWords &S)
7. N->right = buildMTree(subset of S that is > dmed)8. return N
SetOfWordsgoatoysterrosterhippotoadhamstermousechicken
621
rooster
76347
SetOfWordsroster 1oyster 2hamster 3mouse 4goat 6toad 6hippo 7chicken 7
dmed = 4
“rooster” 4
1. Pick a random word W from set S.2. Compute the edit distance for all other words in set S to your random word W.3. Sort all these words based on their edit distance di to your random word W.
5. Now, create a root node N for our tree and put our word W in this node. Set its editThreshold value to dmed.6. N->left = buildMTree(subset of S that is <= dmed)
4. Select the median value of di, let dmed be this median edit distance.
Node *buildMTree(SetOfWords &S)
7. N->right = buildMTree(subset of S that is > dmed)8. return N
1. Pick a random word W from set S.2. Compute the edit distance for all other words in set S to your random word W.3. Sort all these words based on their edit distance di to your random word W.
5. Now, create a root node N for our tree and put our word W in this node. Set its editThreshold value to dmed.6. N->left = buildMTree(subset of S that is <= dmed)
4. Select the median value of di, let dmed be this median edit distance.
Node *buildMTree(SetOfWords &S)
7. N->right = buildMTree(subset of S that is > dmed)8. return N
The Metric Tree
main(){ Let S = {every word in the dictionary}; Node *root = buildMTree(S);
Dictionarygoatoysterrosterhippotoadhamstermousechicken
631
rooster
7
7
SetOfWordsroster oyster hamster goat toad hippo chicken “rooster”
4
446
1. Pick a random word W from set S.2. Compute the edit distance for all other words in set S to your random word W.3. Sort all these words based on their edit distance di to your random word W.
5. Now, create a root node N for our tree and put our word W in this node. Set its editThreshold value to dmed.6. N->left = buildMTree(subset of S that is <= dmed)
4. Select the median value of di, let dmed be this median edit distance.
Node *buildMTree(SetOfWords &S)
7. N->right = buildMTree(subset of S that is > dmed)8. return N
mouse
dmed = 4
“mouse” 4
“oyster” 4
“roster” 0
“hamster” 0
1. Pick a random word W from set S.2. Compute the edit distance for all other words in set S to your random word W.3. Sort all these words based on their edit distance di to your random word W.
5. Now, create a root node N for our tree and put our word W in this node. Set its editThreshold value to dmed.6. N->left = buildMTree(subset of S that is <= dmed)
4. Select the median value of di, let dmed be this median edit distance.
Node *buildMTree(SetOfWords &S)
7. N->right = buildMTree(subset of S that is > dmed)8. return N
The Metric Tree
main(){ Let S = {every word in the dictionary}; Node *root = buildMTree(S);
Dictionarygoatoysterrosterhippotoadhamstermousechicken
SetOfWordsroster oyster hamster goat hippo chicken
mouse
“oyster” 4
“roster” 0
“hamster” 0
1. Pick a random word W from set S.2. Compute the edit distance for all other words in set S to your random word W.3. Sort all these words based on their edit distance di to your random word W.
5. Now, create a root node N for our tree and put our word W in this node. Set its editThreshold value to dmed.6. N->left = buildMTree(subset of S that is <= dmed)
4. Select the median value of di, let dmed be this median edit distance.
Node *buildMTree(SetOfWords &S)
7. N->right = buildMTree(subset of S that is > dmed)8. return N
toad2
57
dmed = 5
“toad” 5
“mouse” 4
“rooster” 4
“goat” 5
“hippo” 0
“chicken” 0
1. Pick a random word W from set S.2. Compute the edit distance for all other words in set S to your random word W.3. Sort all these words based on their edit distance di to your random word W.
5. Now, create a root node N for our tree and put our word W in this node. Set its editThreshold value to dmed.6. N->left = buildMTree(subset of S that is <= dmed)
4. Select the median value of di, let dmed be this median edit distance.
Node *buildMTree(SetOfWords &S)
7. N->right = buildMTree(subset of S that is > dmed)8. return N
The Metric Tree
main(){ Let S = {every word in the dictionary}; Node *root = buildMTree(S);
Dictionarygoatoysterrosterhippotoadhamstermousechicken
SetOfWordsroster oyster hamster goat hippo chicken
mouse
“oyster” 4
“roster” 0
“hamster” 0
toad
“toad” 5
“mouse” 4
“rooster” 4
“goat” 5
“hippo” 0
“chicken” 0
A Metric Tree
“oyster” 4
“roster” 0
“hamster” 0
“toad” 5
“mouse” 4
“rooster” 4
“goat” 5
“hippo” 0
“chicken” 0
So now we have a metric tree!
How do we interpret it?
rooster4
Every word to the left of rooster is
guaranteed to be within 4 edits of it…And every word to the
right of rooster is guaranteed to be more
than 4 edits away…
chicken
And this same structure is repeated
recursively!
mouse
oyster hamster
roster
4 5toad
goat
hippo
5
Searching
When you search a metric tree, you specify the word you’re looking for and an edit-distance
radius, e.g.e.g., I want to find words within 2 edits of
“roaster”.rooster
oyster
hamsterroster
mouse
“oyster” 4
“roster” 0
“hamster” 0
“toad” 5
“mouse” 4
“rooster” 4
“goat” 5
“hippo” 0
“chicken” 0
toad
goat
chicken
hippoStarting at the root, there
are three cases to consider:
1. Your word and its search radius are totally inside the edit threshold.roaster
2In this case, all of your matches are guaranteed to be in our left subtree…
Searching
rooster
oyster
hamsterroster
mouse
“oyster” 4
“roster” 0
“hamster” 0
“toad” 5
“mouse” 4
“rooster” 4
“goat” 5
“hippo” 0
“chicken” 0
toad
goat
chicken
hippo
2. Your word and its search radius are partially inside and partially outside the edit threshold.
goute 2
In this case, some matches will be in our left subtree and some in our right subtree…
Searching
rooster
oyster
hamsterroster
mouse
“oyster” 4
“roster” 0
“hamster” 0
“toad” 5
“mouse” 4
“rooster” 4
“goat” 5
“hippo” 0
“chicken” 0
toad
goat
chicken
hippo
3. Your word and its search radius are completely outside the edit threshold.
vhivken 2
In this case, all matches will be in our right subtree.
PrintMatches(Node *cur, string misspell, int rad){ if e(misspell,cur->word) <= rad then print the current word if e(misspell,cur->word) <= cur->editThreshold then PrintMatches(cur->left) if e(misspell,cur->word) > cur->editThreshold then PrintMatches(cur->right); }
Metric Tree: Search Algorithm
PrintMatches(root,”chomster”,2); cur->
rooster
oyster
hamster
roster
mouse
4
cur->
toadgoat
chicken
hippo
*This is a slight
simplification…
e(“chomster”,”rooster”) = 3So rooster is outside of
chomster’s radius of 2. It’s not a close enough match to print…
chomster2
3e(“chomster”,”rooster”) = 3
Since 3 is less than our editThreshold of 4, let’s go left…
PrintMatches(Node *cur, string misspell, int rad){ if e(misspell,cur->word) <= rad then print the current word if e(misspell,cur->word) <= cur->editThreshold then PrintMatches(cur->left) if e(misspell,cur->word) > cur->editThreshold then PrintMatches(cur->right); }
e(“chomster”,”mouse”) = 5So mouse is outside of
chomster’s radius of 2. It’s not a close enough match to print…
oyster
hamster
roster
mouse
chomster 2
5e(“chomster”,”mouse”) = 5Since 5 is greater than our
editThreshold of 4, we won’t go left.e(“chomster”,”mouse”) = 5Since 5 is greater than our
editThreshold of 4, we will go right.
PrintMatches(Node *cur, string misspell, int rad){ if e(misspell,cur->word) <= rad then print the current word if e(misspell,cur->word) <= cur->editThreshold then PrintMatches(cur->left) if e(misspell,cur->word) > cur->editThreshold then PrintMatches(cur->right); }
cur->
hamster
e(“chomster”,”hamster”) = 2So hamster is inside of chomster’s
radius of 2. We’ve got a match! Print hamster!
chomster 2
2
Other Metric Tree ApplicationsIn addition to spell checking, the
Metric Tree can be used with virtually any application where the items obey metric rules!
Pretty cool, huh? Here’s the full search algorithm from the original paper (without my earlier simplications):
PrintMatches(Node *cur, string misspell, int rad){ if ( e(cur->word , misspell) <= rad) cout << cur->word; if ( e(cur->word,misspell) – rad <= cur->editThresh ) PrintMatches(cur->left,misspell,maxDist) if ( e(cur->word, misspell) + rad >= cur->editThresh ) PrintMatches (cur->right,misspell,maxDist); }
Challenge: Space-efficient Set Membership
There are many problems where we want to maintain a set S of items and
then check if a new item X is in the set, e.g.:
So, what data structures could you use for this?
Right! Both hash tables and binary search trees allow you
to:1. Hold a bunch of items.2. Quickly search through them to
see if they hold an item X.
“Is ‘carey nachenberg’ a student at UCLA?”“Is the phone number ‘424-750-7519’ known to be used by a terrorist cell?
So what’s the problem!Well, binary search trees and hash
tables are memory hogs!
But if I JUST want to do two things:
In other words, if I never need to:1. Print the items of the set (after
they’ve been added).2. Enumerate each value in the set.3. Erase items from the set.
Then we can do much better than our classic data
structures!
I can actually create a much more memory efficient data structure!
1. Add new items to the set2. Check if an item was previously added to a
set
But first… A hash primer*
* Not that kind of hash.
A hash function is a function, y=f(x), that takes an input x (like a string) and returns an
output number y for that input.The ideal hash function returns entirely different values foreach different input, even if two inputs are almost identical:
int y,z;
y = idealHashFunction(“carey”);cout << y;z = idealHashFunction(“cArey”);cout << z;
So even though these two strings are almost identical, a goodhash function might return y=92629 and z=152.
Hash Functionsint hashFunc(const string &name){ int i, total=0; for (i=0;i<name.length(); i++)
total = total + name[i]; return(total);}
Here’s a not-so-good hash function.Can anyone
figure out why?
Right – because similar inputs produce the
same output:
int y, z;
y = hashFunc(“bat”);z = hashFunc(“tab”);// y == z!!!! BAD!
How can we fix this?
By changing our function! That’s a little better, although not great…
total = total + (name[i] * (i+1));
A Better Hash FunctionThe CRC or Cyclical Redundancy Check algorithm is an excellent
hash function.This function was designed to
check network packets for corruption.
We won’t go into CRC’s details, but it’s a perfectly fine hashing algorithm…
Ok, so we have a good hash function, now what?
A Simple Set Membership Algorithm
Imagine that I know I want to store up to 1 million items in my set…I could create an array of say…
100 million bitsAnd then do the following…
class SimpleSet{public: …
private: BitArray m_arr[100000000];
void insertItem(string &name) { int slot = CRC(SEED, name); slot = slot % 100000000; m_arr[slot] = 1; } bool isItemInSet(string &name) { int slot = CRC(SEED, name); slot = slot % 100000000; if (m_arr[slot] == 1) return(true); else return(false); }
main(){ SimpleSet s; s.insertItem(“Carey”); s.insertItem(“Flint”);
if (s.isItemInSet(“Flint”) == true) cout << “Flint’s in my set!”;}
000000000000000000000000000000000000000000000000000000s
“Carey” slot 300001213112131
1
“Flint”9721
1“Flint”
slot 9721
Most hash functions require a seed (initialization) value to be passed in.
Here’s how it might be used:unsigned CRC(unsigned seed, string &s){ unsigned crc = seed; for (int i=0;i<s.length();i++) crc = ((crc >> 8) & CONST1) ^ crcTable[(crc^ s[i]) & CONST2]; return(crc);}
Typically you’d use a seed value of 0xFFFFFFFF with CRC.
But you can change the seed if you like – this results in a (much) different hash
value, even for the same input!
A Simple Set Membership Algorithm
class SimpleSet{public: …
private: BitArray m_arr[100000000];
void insertItem(string &name) { int slot = CRC(SEED,name); slot = slot % 100000000; m_arr[slot] = 1; } bool isItemInSet(string &name) { int slot = CRC(SEED,name); slot = slot % 100000000; if (m_arr[slot] == 1) return(true); else return(false); }
Ok, so what’s the problem with our SimpleSet?
Right! There’s a chance of collisions!
What if two names happen to hash right to the same slot?
main(){ SimpleSet coolPeople; coolPeople.insertItem(“Carey”);
if (coolPeople.isItemInSet(“Paul”)) cout << “Paul Agbabian is cool!”;}
000000000000000000000000000000000000000000000000000000cool People
slot 300001213112131
1
slot1100001213112131
A Simple Set Membership Algorithm
class SimpleSet{public: …
private: BitArray m_arr[100000000];
void insertItem(string &name) { int slot = CRC(SEED,name); slot = slot % 100000000; m_arr[slot] = 1; } bool isItemInSet(string &name) { int slot = CRC(SEED,name); slot = slot % 100000000; if (m_arr[slot] == 1) return(true); else return(false); }
Ok, so what’s the problem with our SimpleSet?
Right! There’s a chance of collisions!
What if two names happen to hash right to the same slot?
Ack! If we put 1 million items in our 100 million entry
array…we’ll have a collision rate of
about 1%!Actually, depending on your
requirements,that might not be so bad…
A Simple Set Membership Algorithm
Our simple set can hold about 1M items in just 12.5MB of memory!
While it does have some false-positives, it’s much smaller than
a hash table or binary search tree…
But we’ve got to be able to do better… Right?
Right! That’s where the Bloom Filter comes in!
The Bloom Filter was invented by Burton Bloom in 1970.
Let’s take a look!
The Bloom FilterIn a Bloom Filter, we use an array of bits just like our original algorithm!
class BloomFilter{public: …
private: BitArray m_arr[100000000];
But instead of just using1 hash function
and setting just one bit
for each insertion…We use K hash functions, compute K hash values
and set K bits!
void insertItem(string &name) { for (int i=0;i< K ;i++) { int slot = CRC( i , name); slot = slot % 100000000; m_arr[slot] = 1; } }
main(){ BloomFilter coolPeople; coolPeople.insertItem(“Preston”);}
000000000000000000000000000000000000000000000000000000cool People
We’ll see how K is chosen in a bit. It’s a constant and its value is
computed from:1. The max # of items you want to
add.2. The size of the array.3. Your desired false positive rate.
const int K = 4;
slot 9000022531
Notice that each time we call the CRC function, it starts with a
different seed value:unsigned CRC(unsigned seed, string &s){ unsigned crc = seed;
for (int i=0;i<s.length();i++) crc = ((crc >> 8) & CONST1) ^ crcTable[(crc^ s[i]) & CONST2]; return(crc);
}(Passing K different seed values is the
same as using K different hash functions…)
22531
1
9197
1
79929
1
300000001313
1
The Bloom FilterNow to search, we do the
same thing!class BloomFilter{public: …
private: BitArray m_arr[100000000];
void insertItem(string &name) { for (int i=0;i< K ;i++) { int slot = CRC( i , name); slot = slot % 100000000; m_arr[slot] = 1; } }
main(){ BloomFilter coolPeople; coolPeople.insertItem(“Preston”);}
000000000000000000000000000000000000000000000000000000cool People 11 11
bool isItemInSet(string &name) { for (int i=0;i< K ;i++) { int slot = CRC( i , name); slot = slot % 100000000; if (m_arr[slot] == 0) return(false); } return(true); }
Note: We only say an item is a member of the set if
all K bits are set to 1.
Note: If any bit that we check is 0, then we have a
miss…
if (coolPeople.isItemInSet(“Carey”)) cout << “I figured…”;
The Bloom FilterOk, so what’s the big deal? All we’re doing is checking
K bits instead of 1?!!?
class BloomFilter{public:
private: BitArray m_arr[100000000];
void insertItem(string &name) { for (int i=0;i< K ;i++) { int slot = CRC( i , name); slot = slot % 100000000; m_arr[slot] = 1; } } bool isItemInSet(string &name) { for (int i=0;i< K ;i++) { int slot = CRC( i , name); slot = slot % 100000000; if (m_arr[slot] == 0) return(false); } return(true); }
Well, it turns out that this dramatically reduces the
false positive rate!
Ok… So the only questions are, how do we chose:
1. The size of our bit-array?2. The value of K?
Let’s see!
The Bloom FilterIf you want to store N items in your Bloom
Filter…And you want a false positive rate of F%...
You’ll want to have M bits in your bit array:
M = log(F) * N log(.6185)
And you’ll want to use K different hash functions:
K=.7* M N
Let’s see some stats!
To store:N items with this FP rate, use M bits (bytes) and K hash fns1M .1% 14.4M bits (1.79MB) 10100M .001% 2.4B bits (299MB) 17100M .00001% 3.4B bits (419MB)
23
Now you’ve got to admit, that’s pretty efficient!
Of course, unlike a hash table, there is some chanceof having a false positive…
But for many projects, this is not an issue, especially if you can guarantee a certain minimum level of FPs!
Now that’s COOL! And you’ve (hopefully) never heard about it!
Challenge: Constant-time searching for similar items
(in a high-dimensional space)
Problem:I’ve got a large collection C of existing web-pages, and I want to
determine if a new web-page P is a close match to any pages in my existing collection.
Obvious approach:I could iterate through all C of my existing pages and do a pair-wise
comparison of page P to each page.
But that’s inefficient!
So how can we do it faster?
Answer: Use Locality Sensitive Hashing!
LSH has two operations:
Inserting items into the hash table:
We add a bunch of items (e.g., web pages) into a locality-sensitive hash table
Given an item, find closely-related items in the hash table:
Once we have a filled locality-sensitive hash table, we want to search it for a new item and see if it
contains anything similar.
LSH, Operation #1: InsertionHere’s the Insertion algorithm:
Step #1:Take each input item (e.g., a web-page) and convert it to a feature vector of size
V.What’s a feature vector?
It’s a fixed-length array of floating point numbers that measure various attributes about each input item.
const int V = 6;float fv[V];
fv[0] = # of times the word “free” was used in the emailfv[1] = # of times the word “viagra” was used in the email
fv[2] = # of exclamation marks used in the emailfv[3] = The length of the email in words
fv[4] = The average length of each word found in the emailfv[5] = The ratio of punctuation marks to letters in the email
The items in the feature vector should be chosen to provide maximum differentiation between different categories of items (e.g., spam vs clean email)!
fv[5] = # of times the word “the” was used in the email
LSH, Operation #1: InsertionWhy compute a feature vector for each input item?
The feature vector is a way of plotting each item into N-space.
Input #1:“Click here now for free
viagra!!!!!”fv1 = {1, 1, 5, 6, 4.17, 0.2}
Input #2:“Please come to the meeting at 5pm.” fv2 = {0, 0, 1, 7, 3.71, 0.038}
1.01.0
5.0
}
}
fv1
fv2
In principle, items (e.g. emails) with similar content (i.e., similar feature vectors) should occupy similar regions of N-space.
LSH, Operation #1: InsertionStep #2:
Once you have a feature vector for each of your items, you determine the size of your hash table.
“I’m going to need to hold 100 million email feature vectors, so I’ll want an open hash table of size N = 1 million”
Step #3:Next compute the number of bits B required to represent N in binary.
If N is 1 million, B will be log2(1 million), or 20.
Wait! Why is our hash table smaller than the # of items we want to store?
Because we want to put related items in the same bucket/slot of the table!
Note: N must be a power of 2, e.g.,
65536, or 1,048,576
LSH, Operation #1: InsertionStep #4:
Now, create B (e.g., 20) RANDOM feature vectors that are the same dimension as your input feature vectors.
R1 = {.277,.891,3,.32,5.89, .136}R2 = {2.143,.073,0.3,4.9, .58, .252}
…
R19 = {.8,.425,6.43,5.6,.197,1.43}R20 = {1.47,.256,4.15,5.6,.437,.075}
LSH, Operation #1: InsertionWhat are these B random vectors for?
R1 = {1,0,1}
Each of the B random vectors defines a hyper-plane in N-space!
R2 = {0,0,3} R3 = {0,2.5,0}
(each hyper-plane is perpendicular to its random vector)
If we have B such random vectors, we essentially chop up N-space
with B possibly overlapping slices!
So in our example, we’d have B=20 hyper-planes chopping up
our V=6 dimensional space.
(Chopping it up into 220 different regions!)
LSH, Operation #1: InsertionOk, let’s consider a single random vector, R1, and it’s hyper-plane for now.
If the tips of those two vectors are on the same side of R’s hyper-plane, then the dot-product of the two
vectors will be positive.R1 · v1 > 0
v2
R1
v1
Now let’s consider a second vector, v1.
On the other hand, if the tips of those two vectors are on opposite sides of R’s hyper-plane, then the dot-product of the two vectors will be negative.
R1 · v2 < 0
So this is useful – if we compute the dot product of two vectors R and v, we can determine if they’re close to
each other or far from each other in N-space.
· {1, 1, 5, 6, 4.17, 0.2}
And if we concatenate the 1s and 0s, this gives us a B-digit (e.g., 20 digit) binary number.
{1, 1, 5, 6, 4.17, 0.2}
LSH, Operation #1: InsertionStep #5:
Create an empty open hash table
with 2B buckets (e.g. 220 = 1M).
…
000…0000000…0001000…0010000…0011…1111…111101111…11111
Let’s label each bucket’s # using binary rather than
decimal numbers. (You’ll see why soon )
Step #6:For each item we want to add to
our hash table… Take the feature vector for the item...
“Click here now for free viagra!!!!!”
And dot-product multiply it by every one of our B random-valued vectors…
R1 = {.277,.891,3,.32,5.89, .136}R2 = {2.13,.07,0.3,4.9, .58, .252}
…
R19 = {.8,.45,6.3,5.6,.197,1.43}R20 = {1.7,.26,4.15,5.6,.47,.07}
-3.25-1.73
.185.24
Now convert every positive dot-product to a 1
And convert every negative dot-product into a 0
00…11
Which we can use to compute a bucket number in our hash table
and store our item!
This basically tells us whether our feature vector is on the same side or the opposite side of the hyper-plane of every one of our random
vectors.
Opp. side of R1Opp. side of R2
…Same side as R19Same side as R20
is on the…
{1, 1, 5, 6, 4.17, 0.2}
LSH, Operation #1: Insertion
…
000…0000000…0001000…0010000…0011…1111…111101111…11111
“Click here now for free viagra!!!!!”
Basically, every item in bucket 0000000000000
will be on the opposite sides of hyper-planes of all the random
vectors.And every item in bucket
111111111111111will be on the same side of the hyper-planes of all the random
vectors.And items in bucket
000000000001will be on the same side as R20, but the opposite side of R1, R2…
R19.So each bucket essentially represents one of the 220 different regions of N-space, as divided by the 20 random
hyper-plane slices.
{1, 1, 5, 6, 4.17, 0.2}
LSH, Operation #2: Searching
…
000…0000000…0001000…0010000…0011…1111…111101111…11111
“Click here now for free viagra!!!!!”
Searching for closely-related items is the same as
inserting!Step #1:
Compute the feature vector for your item
Step #2:Dot-product multiply this vector
by your B random vectors
Step #3:Convert all positive dot-products to 1, and all negative dot-products to 0
Step #4:Use the concatenated binary
number to pick a bucket in your hash table
And viola – you’ve located similar feature vectors/items!
LSH, One Last Point…Typically, we don’t just use one LSH hash
table…But we use two or more, each with a different set of random vectors!
Why?
Then, when searching for a new vector V, we take the union of all buckets that V hashes to, from all
hash tables to obtain a list of matches.
Questions?