Top Banner
Beyond Locality Sensitive Hashing Alex Andoni (Microsoft Research) Joint with: Piotr Indyk (MIT), Huy L. Nguyen (Princeton), Ilya Razenshteyn (MIT)
18

Beyond Locality Sensitive Hashing Alex Andoni (Microsoft Research) Joint with: Piotr Indyk (MIT), Huy L. Nguyen (Princeton), Ilya Razenshteyn (MIT)

Dec 31, 2015

Download

Documents

Emil Flowers
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Beyond Locality Sensitive Hashing Alex Andoni (Microsoft Research) Joint with: Piotr Indyk (MIT), Huy L. Nguyen (Princeton), Ilya Razenshteyn (MIT)

Beyond Locality Sensitive Hashing

Alex Andoni (Microsoft Research)

Joint with: Piotr Indyk (MIT), Huy L. Nguyen (Princeton), Ilya Razenshteyn (MIT)

Page 2: Beyond Locality Sensitive Hashing Alex Andoni (Microsoft Research) Joint with: Piotr Indyk (MIT), Huy L. Nguyen (Princeton), Ilya Razenshteyn (MIT)

Nearest Neighbor Search (NNS)

• Preprocess: a set of points

• Query: given a query point , report a point with the smallest distance to

𝑞

𝑝

Page 3: Beyond Locality Sensitive Hashing Alex Andoni (Microsoft Research) Joint with: Piotr Indyk (MIT), Huy L. Nguyen (Princeton), Ilya Razenshteyn (MIT)

Motivation• Generic setup:

• Points model objects (e.g. images)• Distance models (dis)similarity measure

• Application areas: • machine learning: k-NN rule• image/video/music recognition,

deduplication, bioinformatics, etc…

• Distance can be: • Hamming, Euclidean, …

• Primitive for other problems:• find the similar pairs, clustering…

000000011100010100000100010100011111

000000001100000100000100110100111111 𝑞

𝑝

Page 4: Beyond Locality Sensitive Hashing Alex Andoni (Microsoft Research) Joint with: Piotr Indyk (MIT), Huy L. Nguyen (Princeton), Ilya Razenshteyn (MIT)

Approximate NNS

• r-near neighbor: given a new point , report a point s.t.

• Randomized: a point returned with 90% probability

• In practice: query time small if not too many approximate neighbors

c-approximate

𝑐𝑟if there exists apoint at distance

q

r p

cr

Page 5: Beyond Locality Sensitive Hashing Alex Andoni (Microsoft Research) Joint with: Piotr Indyk (MIT), Huy L. Nguyen (Princeton), Ilya Razenshteyn (MIT)

Locality-Sensitive Hashing• Random hash function on

s.t. for any points :• Close when is “high” • Far when is “small”

• Use several hashtables

q

p

𝑑𝑖𝑠𝑡 (𝑝 ,𝑞)

Pr [𝑔 (𝑝 )=𝑔 (𝑞)]

𝑟 𝑐𝑟

1

𝑃 1

𝑃 2

: , where

[Indyk-Motwani’98]

q

“not-so-small”𝑃 1=¿𝑃 2=¿

𝑃1=𝑃2𝜌

Page 6: Beyond Locality Sensitive Hashing Alex Andoni (Microsoft Research) Joint with: Piotr Indyk (MIT), Huy L. Nguyen (Princeton), Ilya Razenshteyn (MIT)

Locality sensitive hash functions• Hash function is actually a

concatenation of “primitive” functions:

• LSH in Hamming space • , i.e., choose bit for a random • • = projection on a few random bits

6

[Indyk-Motwani’98]

Ham (𝑝 ,𝑞)

Pr [𝑔 (𝑝 )=𝑔 (𝑞)]

𝑟 𝑐𝑟

1

𝑃 1

𝑃 2

Page 7: Beyond Locality Sensitive Hashing Alex Andoni (Microsoft Research) Joint with: Piotr Indyk (MIT), Huy L. Nguyen (Princeton), Ilya Razenshteyn (MIT)

Algorithms and Lower Bounds

Space Time Comment Reference

[IM’98]

memory lookups [PTW’08, PTW’10]

[IM’98]

[DIIM’04, AI’06]

ℓ1

ℓ2

[MNP’06]

[OWZ’11]

memory lookups [PTW’08, PTW’10]

[MNP’06]

[OWZ’11]

Page 8: Beyond Locality Sensitive Hashing Alex Andoni (Microsoft Research) Joint with: Piotr Indyk (MIT), Huy L. Nguyen (Princeton), Ilya Razenshteyn (MIT)

LSH is tight…

leave the rest to cell-probe lower bounds?

Page 9: Beyond Locality Sensitive Hashing Alex Andoni (Microsoft Research) Joint with: Piotr Indyk (MIT), Huy L. Nguyen (Princeton), Ilya Razenshteyn (MIT)

Main Result

• NNS in Hamming space () with query time, space and preprocessing for

• cf of [IM’98]

• NNS in Euclidean space () with:

• cf of [AI’06]

9

+𝑂( 1𝑐3 /2 )+𝑜(1)

+𝑂( 1𝑐3 )+𝑜(1)

Page 10: Beyond Locality Sensitive Hashing Alex Andoni (Microsoft Research) Joint with: Piotr Indyk (MIT), Huy L. Nguyen (Princeton), Ilya Razenshteyn (MIT)

A look at LSH lower bounds

• LSH lower bounds in Hamming space• Fourier analytic

• [Motwani-Naor-Panigrahy’06]• distribution over hash functions • Far pair: random, distance • Close pair: random at distance = • Get

10

[O’Donnell-Wu-Zhou’11]

𝜖 𝑑𝜖 𝑑𝑐

𝜌 ≥1/𝑐

Page 11: Beyond Locality Sensitive Hashing Alex Andoni (Microsoft Research) Joint with: Piotr Indyk (MIT), Huy L. Nguyen (Princeton), Ilya Razenshteyn (MIT)

Why not NNS lower bound?

• Suppose we try to generalize [OWZ’11] to NNS

• Pick random • All the “far point” are concentrated in a small

ball of radius • Easy to see at preprocessing: actual near

neighbor close to the center of the minimum enclosing ball

11

Page 12: Beyond Locality Sensitive Hashing Alex Andoni (Microsoft Research) Joint with: Piotr Indyk (MIT), Huy L. Nguyen (Princeton), Ilya Razenshteyn (MIT)

Our algorithm: intuition

• Data dependent LSH:• Hashing that depends on the entire given

dataset!

• Two components:• “Nice” geometric configuration with • Reduction from general to this “nice”

geometric configuration

12

Page 13: Beyond Locality Sensitive Hashing Alex Andoni (Microsoft Research) Joint with: Piotr Indyk (MIT), Huy L. Nguyen (Princeton), Ilya Razenshteyn (MIT)

Nice Configuration: “sparsity”• All points are on a sphere of radius

• Random points are at distance

• Lemma 1: • “Proof”:

• Obtained via “cap carving”• Similar to “ball carving” [KMS’98, AI’06]

• Lemma 1’: for radius =

13

Page 14: Beyond Locality Sensitive Hashing Alex Andoni (Microsoft Research) Joint with: Piotr Indyk (MIT), Huy L. Nguyen (Princeton), Ilya Razenshteyn (MIT)

Reduction: into spherical LSH• Idea: apply a few rounds of “regular” LSH

• Ball carving [AI’06]

• Intuitively:• far points unlikely to collide• partitions the data into buckets of small

diameter • find the minimum enclosing ball• finally apply spherical LSH on this ball!

14

Page 15: Beyond Locality Sensitive Hashing Alex Andoni (Microsoft Research) Joint with: Piotr Indyk (MIT), Huy L. Nguyen (Princeton), Ilya Razenshteyn (MIT)

Two-level algorithm

• hash tables, each with:• hash function • ’s are “ball carving LSH” (data independent)• ’s are “spherical LSH” (data dependent)

• Final is an “average” of from levels 1 and 2

h1 (𝑝 ) ,…,h𝑙(𝑝)

𝑠1 (𝑝 ) ,…, 𝑠𝑚(𝑝)

Page 16: Beyond Locality Sensitive Hashing Alex Andoni (Microsoft Research) Joint with: Piotr Indyk (MIT), Huy L. Nguyen (Princeton), Ilya Razenshteyn (MIT)

Details

• Inside a bucket, need to ensure “sparse” case

• 1) drop all “far pairs”• 2) find minimum enclosing ball (MEB)

• Use Jung theorem: diameter implies MEB radius

• 3) partition by “sparsity” (distance from center) into shells

16

Page 17: Beyond Locality Sensitive Hashing Alex Andoni (Microsoft Research) Joint with: Piotr Indyk (MIT), Huy L. Nguyen (Princeton), Ilya Razenshteyn (MIT)

Practice• Practice uses data-dependent

partitions!• “wherever theoreticians suggest to use

random dimensionality reduction, use PCA”

• Lots of variants• Trees: kd-trees, quad-trees, ball-trees,

rp-trees, PCA-trees, sp-trees…• no guarantees: e.g., are deterministic

• Is there a better way to do partitions in practice?

• Why do PCA-trees work?• [Abdullah-A-Kannan-Krauthgamer]: if

have more structure17

Page 18: Beyond Locality Sensitive Hashing Alex Andoni (Microsoft Research) Joint with: Piotr Indyk (MIT), Huy L. Nguyen (Princeton), Ilya Razenshteyn (MIT)

Finale

• NNS with query time: • where for • where for

• Below the lower bounds for LSH/space partitions!

• Idea: data dependent space partitions• Better upper bound?

• Multi-level improves a bit, but not too much• for ?

• Best partitions in theory and practice?18