Nearest Neighbor Nearest Neighbor Paul Hsiung March 16, 2004
Nearest NeighborNearest Neighbor
Paul Hsiung
March 16, 2004
Quick Review of NNQuick Review of NN
Set of points P Query point q Distance metric d Find p in P such that
d(p,q) < d(p’,q)for all p’ in P
qp
NN Used In…NN Used In…
Image databases [Pentland et al]Color indexing [swain et al]Recognizing 3D objects [Murase et al]Shapes [Mori et al]Drug testingDNA sequence matching [Buhler]
Tree-based ApproachesTree-based Approaches
Quadtrees– Split middle in all dimensions– Split until no points or one point left
Kd-trees– Split in one dimension– Pick the middle wisely
Ball-trees– Pick two pivots and split
SR-trees– We have rectangles and spheres, so why not combine them
Indyk’s GripeIndyk’s Gripe
Beyond 10 or 20 dimensions, tree-based structures will look at many points
No better than brute force linear search…So he came up with a hash table approach:
Locality Sensitive Hashing (LSH)Rest of talk will be on his paper
LSHLSH
Interlude: Near NeighborInterlude: Near Neighbor
Set of points P Query point q Distance metric d Find p in P such that
d(p,q) < (1+ε)d(P,q)where d(P,q) is the distance of q to its closest point in P
q p(1+ε)d(P,q)
d(P,q)
HashHash
Pick a subset I of random coordinatesHash function, h(p), will return a bucket ID
h(p) = projection of p on I
IntuitionIntuition
If two points are close, they hash to same bucket with some probability p1
If they are far, they hash to same bucket with a smaller probability p2 < p1
Indyk’s HashIndyk’s Hash
Convert coordinates of p to {0,1}d
Use Hamming distance: d(p,q)= # positions on which p and q differ
Example:– p=(0,1,0,1,1,1,0,0,1,0)– I={2,5,7}– Then, h(p)=(1,1,0)
Demo: – http://web.mit.edu/ardonite/6.838/locality-hashing.htm
Why Locality-sensitive?Why Locality-sensitive?
Pr[h(p)=h(q)]=(1-d(p,q)/D)k
– D is the number of dimensions in the binary representation
– k is the size of I We can vary the probability by changing k
k=1 k=2
distance distance
Pr Pr
Now to Use It (Training)Now to Use It (Training)
Generate l hash functions: h1..hl
Store each point p in the bucket hi(p) of the i-th hash array, i=1...l
Now to Use It (Query)Now to Use It (Query)
Retrieve all the points that belong to the buckets: h1(q)..hl(q)
Return the retrieved point that is closest to qThis “solves” the Near Neighbor problem
Indyk’s ResultsIndyk’s Results
Compared with another tree-based algorithmColor histogram dataset from Corel Draw
– 20,000 images, 64 dimensions– Used 1k, 2k, 5k, 10k, 19k points for training– 1k points are used for query– Computed missed ratio – fraction of queries with
no hits
Indyk’s ResultsIndyk’s Results
Results IIResults II
Ugly SideUgly Side
Works best with Hamming distance– Can be extended from L1 and L2 norms
Requires parameter tweaking (size of I and number of hash buckets)
Does not work well on uniform data
BibliographyBibliography
A. Gionis, P. Indyk, R. Motwani. Similarity Search in High Dimensions via Hashing. In VLDB 25th, 1999
J. Buhler. Efficient Large-Scale Sequence Comparison by Locality-Sensitive Hashing. In Bioinformatics 17(5) 419-428, 2001
H. Murase, S. K. Nayar. Visual Learning and Recognition of 3D Objects from Appearance. In IJCV, Vol. 14, No. 1 5-24, 1995
A. Pentland, R.W. Picard, S. Scalroff. Photobook: Tools for Content Based Manipulation of Image Databases. In SPIE Vol. 2185 34-47, 1994
M.J. Swain, D.H. Ballard. Color Indexing. In IJCV, Vol. 7, No. 1 11-32, 1991
G. Mori, S. Belongie, J. Malik. Shape Contexts Enable Efficient Retrieval of Similar Shapes. CVPR 1 723-730, 2001
Slides: “Algorithms for Nearest Neighbor Search” by Piotr Indyk
Slides: “Approximate Nearest Neighbor in High Dimensions via Hashing” by Aris Gionis, Piotr Indyk, and Rajeev Motwani