Transcript
8/22/2019 04 Lsh Theory
1/45
CS246: Mining Massive DatasetsJure Leskovec, Stanford University
http://cs246.stanford.edu
8/22/2019 04 Lsh Theory
2/45
Goal:Given a large number (N in the millions orbillions) of text documents, find pairs that arenear duplicates
Applications:
Mirror websites, or approximate mirrors Dont want to show both in a search
Problems:
Many small pieces of one document can appear
out of order in another Too many documents to compare all pairs
Documents are so large or so many that they cannotfit in main memory
1/16/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 2
8/22/2019 04 Lsh Theory
3/45
1. Shingling: Convert docs to sets of items Document is a set of k-shingles
2. Minhashing: Convert large sets into short
signatures, while preserving similarity Want hash func. that Pr[h
(C1) = h(C2)] = sim(C1, C2)
For the Jaccard similarity Minmash has this property!
3. Locality-sensitive hashing: Focus on pairs ofsignatures likely to be from similar documents
Split signatures into bands and hash them
Documents with similar signatures get hashed into
same buckets: Candidate pairs1/16/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 3
8/22/2019 04 Lsh Theory
4/45
4
Docu-
ment
The setof stringsof length k
that appearin the doc-ument
Signatures:short integervectors that
represent thesets, andreflect theirsimilarity
Locality-sensitiveHashing
Candidate
pairs:those pairs
of signaturesthat we needto test forsimilarity
1/16/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets
8/22/2019 04 Lsh Theory
5/45
A k-shingle (or k-gram) is a sequence ofktokens that appears in the document
Example: k=2; D1= abcab
Set of 2-shingles: C1 = S(D1) = {ab, bc, ca} Represent a doc by the set of hash values of
its k-shingles
A natural document similarity measure is then
the Jaccard similarity:
Sim(D1, D2) = |C1C2|/|C1C2|
1/16/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 5
8/22/2019 04 Lsh Theory
6/45
Prob. h(C1) = h(C2) is the same as Sim(D1, D2):Pr[h(C1) = h(C2)] = Sim(D1, D2)
1/16/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 6
Similarities of columns andsignatures (approx.) match!
1-3 2-4 1-2 3-4Col/Col 0.75 0.75 0 0
Sig/Sig 0.67 1.00 0 0
Signature matrix M
1212
5
7
6
3
1
2
4
1412
4
5
1
6
7
3
2
2121
0101
0101
1010
1010
1010
1001
0101
Input matrix (Shingles x Documents)
3
4
7
2
6
1
5
Permutation
8/22/2019 04 Lsh Theory
7/45
Hash columns of the signature matrix M:Similar columns likely hash to same bucket
Divide matrix M into b bands ofrrows (M=br)
Candidatecolumn pairs are those that hashto the same bucket for 1 band
1/16/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 7
r rows
b bands
Buckets
Matrix M
SimilarityProb.ofsharing
1
bucke
t
1212
1412
2121
Threshold
s
8/22/2019 04 Lsh Theory
8/45
The S-curve is where the magic happens
1/16/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 8
Similarity t of two sets
Probabilityofs
haring
1b
ucket
Remember:
Probability ofequal hash-values= similarity
This is what 1 band gives you
Pr[h(C1) = h(C2)] = sim(D1, D2)
No chanceifs< t
Probability= 1 ifs> t
This is what we want!
How to get a step-function?
By picking rand b!
Thres
holds
Similarity t of two sets
8/22/2019 04 Lsh Theory
9/45
Remember: b bands, rrows/band Let sim(C1 , C2) = t
Pick some band (rrows)
Prob. that elements in a single row ofcolumns C1 and C2 are equal = t
Prob. that all rows in a band are equal= tr
Prob. that some row in a band is not equal = 1 - tr
Prob. that all bands are not equal = (1 - tr)b
Prob. that at least 1 band is equal = 1 - (1 - tr)b
1/16/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 9
Similarity t
P(C1,
C2
isa
candidatepa
ir)
P(C1, C2 is a candidate pair)= 1 - (1 - tr)b
8/22/2019 04 Lsh Theory
10/451/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 10
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Similarity
r = 1..10, b = 1
Pro
b(Candidatepair)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Prob(Candidatepair)
r = 1, b = 1..10
r = 5, b = 1..50
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
r = 10, b = 1..50
Similarityy = 1 - (1 - x r)b
Given a fixedthreshold s.
We want chooserand bsuchthat theP(Candidate
pair) has astep rightaround s.
8/22/2019 04 Lsh Theory
11/45
Docu-
ment
The setof stringsof length k
that appearin the doc-ument
Signatures:short vectorsthat represent
the sets, andreflect theirsimilarity
Locality-sensitive
Hashing
Candidate
pairs:those pairsof signaturesthat we needto test forsimilarity
8/22/2019 04 Lsh Theory
12/45
We have used LSH to find similar documents In reality, columns in large sparse matrices with
high Jaccard similarity
e.g., customer/item purchase histories
Can we use LSH for other distance measures?
e.g., Euclidean distances, Cosine distance
Lets generalize what weve learned!
1/19/2011 12Jure Leskovec, Stanford C246: Mining Massive Datasets
8/22/2019 04 Lsh Theory
13/45
For Minhashing signatures, we got a Minhashfunction for each permutation of rows
An example of afamily of hash functions:
A hash function is any function that takes twoelements and says whether or not they are equal
Shorthand:h(x) = h(y) means h says x and y are equal
Afamilyof hash functions is any set of hash
functions from which we canpick one at
random efficiently
Example: The set of Minhash functions generated
from permutations of rows1/19/2011 13Jure Leskovec, Stanford C246: Mining Massive Datasets
8/22/2019 04 Lsh Theory
14/45
Suppose we have a space S of points witha distance measure d
A family Hof hash functions is said to be
(d1, d2,p1,p2)-sensitiveif for anyxand yin S:
1. Ifd(x, y) < d1, then the probability over all h H,that h(x) = h(y) is at leastp1
2. Ifd(x, y) > d2, then the probability over all h H,that h(x) = h(y) is at mostp2
1/19/2011 14Jure Leskovec, Stanford C246: Mining Massive Datasets
8/22/2019 04 Lsh Theory
15/45
Pr[h(x)=h
(y)]
d(x,y)
d1 d2
p2
p1
Highprobability;at least p1
Low
probability;at most p2
1/19/2011 15Jure Leskovec, Stanford C246: Mining Massive Datasets
8/22/2019 04 Lsh Theory
16/45
Let: S = space of all sets,
d= Jaccard distance,
His family of Minhash functions for allpermutations of rows
Then for any hash function h H:Pr[h(x) = h(y)] = 1 - d(x, y)
Simply restates theorem about Minhashing
in terms of distances rather than similarities
1/19/2011 16Jure Leskovec, Stanford C246: Mining Massive Datasets
8/22/2019 04 Lsh Theory
17/45
Claim:His a (1/3, 2/3, 2/3, 1/3)-sensitive familyfor S and d.
For Jaccard similarity, minhashing gives a(d1,d2,(1-d1),(1-d2))-sensitive family for any d1 2/3)
Then probabilitythat Minhash values
agree is > 2/3
1/19/2011 17Jure Leskovec, Stanford C246: Mining Massive Datasets
8/22/2019 04 Lsh Theory
18/45
Can we reproduce the S-curveeffect we saw before for any LSfamily?
The bands technique we learnedfor signature matrices carries overto this more general setting
Two constructions: AND construction like rows in a band
OR construction like many bands
1/19/2011 18Jure Leskovec, Stanford C246: Mining Massive Datasets
Similarity t
Prob.ofsharing
abucket
8/22/2019 04 Lsh Theory
19/45
Given family H, construct family Hconsistingofrfunctions from H
For h = [h1,,hr] in H, we say
h(x) = h(y) if and only ifhi(x) = hi(y) for alli
Note this corresponds to creating a band of size r
Theorem: IfHis (d1, d2,p1,p2)-sensitive,then H is (d1,d2, (p1)
r, (p2)r)-sensitive
Proof: Use the fact that his are independent
1/19/2011 19Jure Leskovec, Stanford C246: Mining Massive Datasets
1 i r
8/22/2019 04 Lsh Theory
20/45
Independence of hash functions (HFs) reallymeans that the prob. of two HFs saying yes
is the product of each saying yes
But two Minhash functions (e.g.) could be highlycorrelated
For example, if their permutations agree in the first one
million entries
However, the probabilities in the 4-tuples are overall possible members ofH, H
OK for Minhash, others, but must be part of LSH-
family definition
8/22/2019 04 Lsh Theory
21/45
Given family H, construct family Hconsistingofbfunctions from H
For h = [h1,,h
b] in H,
h(x) = h(y) if and only ifhi(x) = hi(y) for at least 1 i
Theorem: IfHis (d1, d2,p1,p2)-sensitive,
then His (d1, d2, 1-(1-p1)b, 1-(1-p2)b)-sensitive
Proof: Use the fact that his are independent
1/19/2011 21Jure Leskovec, Stanford C246: Mining Massive Datasets
8/22/2019 04 Lsh Theory
22/45
ANDmakes all probs. shrink, but by choosing rcorrectly, we can make the lower prob. approach0 while the higher does not
ORmakes all probs. grow, but by choosing b
correctly, we can make the upper prob. approach1 while the lower does not
1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 22
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
AND
r=1..10, b=1
Prob.sharing
a
bucket
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Prob.sharinga
bucket
OR
r=1, b=1..10
Similarity of a pair of items Similarity of a pair of items
8/22/2019 04 Lsh Theory
23/45
r-way ANDfollowed by b-way ORconstruction Exactly what we did with Minhashing
If bands match in all rvalues hash to same bucket
Cols that are hashed into 1 common bucketCandidate
Take pointsxand y s.t. Pr[h(x) = h(y)] = p
Hwill make (x,y) a candidate pair with prob. p
Construction makes (x,y) a candidate pair with
probability 1-(1-pr
)b
The S-Curve! Example: Take H and construct H by the AND
construction with r= 4. Then, from H, construct H
by the ORconstruction with b = 4
1/19/2011 23Jure Leskovec, Stanford C246: Mining Massive Datasets
8/22/2019 04 Lsh Theory
24/45
p 1-(1-p4)4
.2 .0064
.3 .0320
.4 .0985
.5 .2275
.6 .4260
.7 .6666
.8 .8785
.9 .9860
r = 4,b = 4 transforms a(.2,.8,.8,.2)-sensitive family into a
(.2,.8,.8785,.0064)-sensitive family.1/19/2011 24Jure Leskovec, Stanford C246: Mining Massive Datasets
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
8/22/2019 04 Lsh Theory
25/45
Picking rand b to get desired performance 50 hash-functions (r= 5, b = 10)
1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 25
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Blue area X: False Negative rate
These are pairs with sim > sbut the Xfraction wont share a band and thenwill never become candidates.Thismeans we will never consider thesepairs for (slow/exact) similaritycalculation!Green area Y: False Posit ive rate
These are pairs with sim < sbutwe will consider them as candidates.This is not too bad, we will considerthem for (slow/exact) similarity
computation and discard them.
Similarity
Prob
(Candidatepair)
Thresho
lds
8/22/2019 04 Lsh Theory
26/45
Picking rand b to get desired performance 50 hash-functions (r* b = 50)
1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 26
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
r=2, b=25
r=5, b=10
r=10, b=5Thresholds
8/22/2019 04 Lsh Theory
27/45
Apply a b-way OR construction followed byan r-way AND construction
Transforms probabilityp into (1-(1-p)b)r
The same S-curve, mirrored horizontally andvertically
Example: Take H and construct H by the OR
construction with b = 4. Then, from H,construct H by the AND construction
with r= 4.
1/19/2011 27Jure Leskovec, Stanford C246: Mining Massive Datasets
8/22/2019 04 Lsh Theory
28/45
1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 28
p (1-(1-p)4)4
.1 .0140
.2 .1215
.3 .3334
.4 .5740
.5 .7725
.6 .9015
.7 .9680
.8 .9936
The example transforms a(.2,.8,.8,.2)-sensitive family into a(.2,.8,.9936,.1215)-sensitive family.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
8/22/2019 04 Lsh Theory
29/45
Example: Apply the (4,4) OR-AND constructionfollowed by the (4,4) AND-OR construction
Transforms a (.2, .8, .8, .2)-sensitive family into
a (.2, .8, .9999996, .0008715)-sensitive family
Note this family uses 256 (=4*4*4*4) of the
original hash functions
1/19/2011 29Jure Leskovec, Stanford C246: Mining Massive Datasets
8/22/2019 04 Lsh Theory
30/45
Pick any two distancesx< y
Start with a (x, y, (1-x), (1-y))-sensitive family
Apply constructions to amplify(x, y, p, q)-sensitive family,
wherep is almost 1 and q is almost 0
The closer to 0 and 1 we get, the more hash
functions must be used
1/19/2011 30Jure Leskovec, Stanford C246: Mining Massive Datasets
8/22/2019 04 Lsh Theory
31/45
LSH methods for other distance metrics: Cosine distance:Random hyperplanes
Euclidean distance:Project on lines
1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 31
Points
Signatures: shortinteger signatures thatreflect their similarity Locality-
sensitiveHashing
Candidate pairs:those pairs ofsignatures thatwe need to test
for similarityDesign a (x,y,p,q)-sensitivefamily of hash functions (for that
particular distance metric)
Amplify the familyusingAND and OR
constructions
Depends on the
distance function used
8/22/2019 04 Lsh Theory
32/45
For cosine distance,d(A, B) = = arccos(AB / AB)
there is a technique calledRandom Hyperplanes
Technique similar to Minhashing
Random Hyperplanesis a
(d1, d2, (1-d1/180), (1-d2/180))-sensitive
family for any d1 and d2 Reminder:(d1, d2,p1,p2)-sensitive
1. Ifd(x,y) < d1, then prob. that h(x) = h(y) is at leastp1
2. Ifd(x,y) > d2, then prob. that h(x) = h(y)is at most p2
1/19/2011 32Jure Leskovec, Stanford C246: Mining Massive Datasets
A
B
AB
A
8/22/2019 04 Lsh Theory
33/45
Pick a random vector v, which determines ahash function hvwith two buckets
hv(x) = +1 ifvx 0; = -1 ifvx < 0 LS-family H= set of all functions derived
from any vector
Claim: For points x and y,
Pr[h(x) = h(y)] = 1 d(x,y) / 180
1/19/2011 33Jure Leskovec, Stanford C246: Mining Massive Datasets
8/22/2019 04 Lsh Theory
34/45
1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 34
x
y
Look in theplane ofxand y.
So: Prob[Red case] = / 180So:P[h(x)=h(y)] = 1- /180 = 1-d(x,y)
Hyperplanenormal to v.Here h(x) h(y)
v
Hyperplanenormal to v.Here h(x) = h(y)
v
8/22/2019 04 Lsh Theory
35/45
1/16/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 35
8/22/2019 04 Lsh Theory
36/45
Pick some number of random vectors, andhash your data for each vector
The result is a signature (sketch) of
+1s and1s for each data point
Can be used for LSH like we used the
Minhash signatures for Jaccard distance
Amplify using AND/OR constructions
1/19/2011 36Jure Leskovec, Stanford C246: Mining Massive Datasets
8/22/2019 04 Lsh Theory
37/45
Expensive to pick a random vector in Mdimensions for large M
Would have to generate M random numbers
A more efficient approach
It suffices to consider only vectors v
consisting of +1 and 1 components
Why is this more efficient?
1/19/2011 37Jure Leskovec, Stanford C246: Mining Massive Datasets
8/22/2019 04 Lsh Theory
38/45
Simple idea: Hash functions correspond to lines
Partition the line into buckets of size a
Hash each point to the bucket containing itsprojection onto the line
Nearby points are always close;
distant points are rarely in same bucket
1/19/2011 38Jure Leskovec, Stanford C246: Mining Massive Datasets
8/22/2019 04 Lsh Theory
39/45
Lucky case: Points that are close
hash in the same bucket
Distant points end up in
different buckets
Two unlucky cases: Top: unlucky
quantization
Bottom: unlucky
projection1/16/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 39
vv
Line
Buckets of size a
v v
vv
v v
vv
vv
8/22/2019 04 Lsh Theory
40/45
1/16/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 40
v v
vv
vv
vv
8/22/2019 04 Lsh Theory
41/45
Bucketwidth a
Randomlychosen line
Points atdistance d
Ifd
8/22/2019 04 Lsh Theory
42/45
Bucketwidth a
Points atdistance d
d cos
Ifd >> a, mustbe close to 90o
for there to beany chance pointsgo to the samebucket.
1/19/2011 42Jure Leskovec, Stanford C246: Mining Massive Datasets
Randomlychosen line
8/22/2019 04 Lsh Theory
43/45
If points are distance d 2aapart, then they
can be in the same bucket only if d cos
a
cos
60 < < 90, i.e., at most 1/3 probability
Yields a (a/2, 2a, 1/2, 1/3)-sensitivefamily ofhash functions for any a
Amplify using AND-OR cascades
1/19/2011 43Jure Leskovec, Stanford C246: Mining Massive Datasets
8/22/2019 04 Lsh Theory
44/45
44
Projection method yields a (a/2, 2a, 1/2,1/3)-sensitivefamily of hash functions
For previous distance measures, we could
start with an (x, y, p, q)-sensitive family foranyx< y, and drivep and q to 1 and 0 by
AND/OR constructions
Here, we seem to needx 4 y In the calculation on the previous slide we only
considered cases d 2a
1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets
8/22/2019 04 Lsh Theory
45/45
But as long asx< y, the probability of pointsat distancex falling in the same bucket is
greater than the probability of points at
distance ydoing so Thus, the hash family formed by projecting
onto lines is an (x, y, p, q)-sensitive family
for somep > q Then, amplify by AND/OR constructions
top related