Top Banner
Summer School on Hashing’14 Locality Sensitive Hashing Alex Andoni (Microsoft Research)
31

Summer School on Hashing’14 Locality Sensitive Hashing

Feb 23, 2016

Download

Documents

Bill

Summer School on Hashing’14 Locality Sensitive Hashing. Alex Andoni ( Microsoft Research). Nearest Neighbor Search (NNS). Preprocess: a set of points Query: given a query point , report a point with the smallest distance to . Approximate NNS. c -approximate. if there exists a - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Summer School on Hashing’14 Locality Sensitive Hashing

Summer School on Hashing’14

Locality Sensitive Hashing

Alex Andoni(Microsoft Research)

Page 2: Summer School on Hashing’14 Locality Sensitive Hashing

Nearest Neighbor Search (NNS)

• Preprocess: a set of points

• Query: given a query point , report a point with the smallest distance to

𝑞𝑝

Page 3: Summer School on Hashing’14 Locality Sensitive Hashing

Approximate NNS• r-near neighbor: given a new

point , report a point s.t.

• Randomized: a point returned with 90% probability

c-approximate

𝑐𝑟if there exists apoint at distance

q

r p

cr

Page 4: Summer School on Hashing’14 Locality Sensitive Hashing

Heuristic for Exact NNS• r-near neighbor: given a new

point , report a set with• all points s.t. (each with 90%

probability)

• may contain some approximate neighbors s.t.

• Can filter out bad answers

q

r p

cr

c-approximate

Page 5: Summer School on Hashing’14 Locality Sensitive Hashing

Locality-Sensitive Hashing• Random hash function on

s.t. for any points :• Close when is “high” • Far when is “small”

• Use several hashtables

q

p

𝑑𝑖𝑠𝑡 (𝑝 ,𝑞)

Pr [𝑔 (𝑝 )=𝑔 (𝑞)]

𝑟 𝑐𝑟

1

𝑃 1𝑃 2

: , where

[Indyk-Motwani’98]

q

“not-so-small”𝑃 1=¿𝑃 2=¿

𝜌=log 1/𝑃1

log 1/𝑃2

Page 6: Summer School on Hashing’14 Locality Sensitive Hashing

Locality sensitive hash functions

6

• Hash function is usually a concatenation of “primitive” functions:

• Example: Hamming space • , i.e., choose bit for a random • chooses bits at random•

Page 7: Summer School on Hashing’14 Locality Sensitive Hashing

Full algorithm

7

• Data structure is just hash tables:• Each hash table uses a fresh random function • Hash all dataset points into the table

• Query:• Check for collisions in each of the hash tables• until we encounter a point within distance

• Guarantees:• Space: , plus space to store points• Query time: (in expectation)• 50% probability of success.

Page 8: Summer School on Hashing’14 Locality Sensitive Hashing

Analysis of LSH Scheme

8

• Choice of parameters ?• hash tables with

• Pr[collision of far pair] = • Pr[collision of close pair] = • Hence “repetitions” (tables) suffice!

distance

colli

sion

prob

abili

ty

𝑟 𝑐𝑟

𝑃1

𝑃2

𝑃12

𝑃22

set s.t.=

=

𝑘=1𝑘=2

Page 9: Summer School on Hashing’14 Locality Sensitive Hashing

Analysis: Correctness

9

• Let be an -near neighbor• If does not exists, algorithm can output

anything• Algorithm fails when:

• near neighbor is not in the searched buckets • Probability of failure:

• Probability do not collide in a hash table: • Probability they do not collide in hash tables at

most

Page 10: Summer School on Hashing’14 Locality Sensitive Hashing

Analysis: Runtime

10

• Runtime dominated by:• Hash function evaluation: time• Distance computations to points in buckets

• Distance computations:• Care only about far points, at distance • In one hash table, we have

• Probability a far point collides is at most • Expected number of far points in a bucket:

• Over hash tables, expected number of far points is

• Total: in expectation

Page 11: Summer School on Hashing’14 Locality Sensitive Hashing

NNS for Euclidean space

11

• Hash function is a concatenation of “primitive” functions:

• LSH function :• pick a random line , and quantize• project point into

• is a random Gaussian vector• random in • is a parameter (e.g., 4)

[Datar-Immorlica-Indyk-Mirrokni’04]

𝑝ℓ

Page 12: Summer School on Hashing’14 Locality Sensitive Hashing

• Regular grid → grid of balls• p can hit empty space, so take

more such grids until p is in a ball

• Need (too) many grids of balls

• Start by projecting in dimension t

• Analysis gives• Choice of reduced dimension

t?• Tradeoff between

• # hash tables, n, and• Time to hash, tO(t)

• Total query time: dn1/c2+o(1)

Optimal* LSH

2D

p

pRt

[A-Indyk’06]

Page 13: Summer School on Hashing’14 Locality Sensitive Hashing

x

Proof idea• Claim: , i.e.,

• P(r)=probability of collision when ||p-q||=r

• Intuitive proof:• Projection approx preserves distances

[JL] • P(r) = intersection / union• P(r)≈random point u beyond the

dashed line• Fact (high dimensions): the x-

coordinate of u has a nearly Gaussian distribution

→ P(r) exp(-A·r2)

pqr

q

P(r)

u

p

 

Page 14: Summer School on Hashing’14 Locality Sensitive Hashing

LSH Zoo

14

• Hamming distance• : pick a random coordinate(s)

[IM98]• Manhattan distance:

• : random grid [AI’06]• Jaccard distance between sets:

• : pick a random permutation on the universe

min-wise hashing [Bro’97,BGMZ’97]• Angle: sim-hash [Cha’02,…]

To be or not to be

To Simons or not to Simons

…21102…

be toornot

Sim

ons

…01122…

be toornot

Sim

ons

…11101… …01111…

{be,not,or,to} {not,or,to, Simons}

1 1

not

not

=be,to,Simons,or,not

be to

Page 15: Summer School on Hashing’14 Locality Sensitive Hashing

LSH in the wild

15

• If want exact NNS, what is ?• Can choose any parameters • Correct as long as • Performance:

• trade-off between # tables and false positives• will depend on dataset “quality”• Can tune to optimize for given dataset

𝐿

𝑘

safety not guaranteed

fewer false positives

fewer tables

Page 16: Summer School on Hashing’14 Locality Sensitive Hashing

Time-Space Trade-offs

[AI’06]

[KOR’98, IM’98, Pan’06]

[Ind’01, Pan’06]

Space Time Comment Reference

[DIIM’04, AI’06]

[IM’98]

querytime

space

medium medium

lowhigh

highlow

ω(1) memory lookups [AIP’06]

ω(1) memory lookups [PTW’08, PTW’10]

[MNP’06, OWZ’11]

1 mem lookup

Page 17: Summer School on Hashing’14 Locality Sensitive Hashing

LSH is tight…

leave the rest to cell-probe lower bounds?

Page 18: Summer School on Hashing’14 Locality Sensitive Hashing

Data-dependent Hashing!• NNS in Hamming space () with query

time, space and preprocessing for

• optimal LSH: of [IM’98]• NNS in Euclidean space () with:

• optimal LSH: of [AI’06]

18

+𝑂( 1𝑐3 /2 )+𝑜(1)

+𝑂( 1𝑐3 )+𝑜(1)

[A-Indyk-Nguyen-Razenshteyn’14]

Page 19: Summer School on Hashing’14 Locality Sensitive Hashing

A look at LSH lower bounds• LSH lower bounds in Hamming space

• Fourier analytic• [Motwani-Naor-Panigrahy’06]

• distribution over hash functions • Far pair: random, distance • Close pair: random at distance = • Get

19

[O’Donnell-Wu-Zhou’11]

𝜖 𝑑𝜖 𝑑𝑐

𝜌 ≥1/𝑐

Page 20: Summer School on Hashing’14 Locality Sensitive Hashing

Why not NNS lower bound?• Suppose we try to generalize [OWZ’11]

to NNS• Pick random • All the “far point” are concentrated in a small

ball of radius • Easy to see at preprocessing: actual near

neighbor close to the center of the minimum enclosing ball

20

Page 21: Summer School on Hashing’14 Locality Sensitive Hashing

Intuition• Data dependent LSH:

• Hashing that depends on the entire given dataset!

• Two components:• “Nice” geometric configuration with • Reduction from general to this “nice”

geometric configuration

21

Page 22: Summer School on Hashing’14 Locality Sensitive Hashing

Nice Configuration: “sparsity”• All points are on a sphere of radius

• Random points are at distance

• Lemma 1: • “Proof”:

• Obtained via “cap carving”• Similar to “ball carving” [KMS’98, AI’06]

• Lemma 1’: for radius = 22

Page 23: Summer School on Hashing’14 Locality Sensitive Hashing

Reduction: into spherical LSH• Idea: apply a few rounds of “regular” LSH

• Ball carving [AI’06]• as if target approx. is 5c => query time

• Intuitively:• far points unlikely to collide• partitions the data into buckets of small

diameter • find the minimum enclosing ball• finally apply spherical LSH on this ball!

23

Page 24: Summer School on Hashing’14 Locality Sensitive Hashing

Two-level algorithm• hash tables, each with:

• hash function • ’s are “ball carving LSH” (data independent)• ’s are “spherical LSH” (data dependent)

• Final is an “average” of from levels 1 and 2h1 (𝑝 ) ,…,h𝑙(𝑝)

𝑠1 (𝑝 ) ,…, 𝑠𝑚(𝑝)

Page 25: Summer School on Hashing’14 Locality Sensitive Hashing

Details• Inside a bucket, need to ensure “sparse”

case• 1) drop all “far pairs”• 2) find minimum enclosing ball (MEB)• 3) partition by “sparsity” (distance from

center)

25

Page 26: Summer School on Hashing’14 Locality Sensitive Hashing

1) Far points• In level 1 bucket:

• Set parameters as if looking for approximation

• Probability a pair survives:• At distance : • At distance : • At distance :

• Expected number of collisions for distance is constant

• Throw out all pairs that violate this

26

Page 27: Summer School on Hashing’14 Locality Sensitive Hashing

2) Minimum Enclosing Ball• Use Jung theorem:

• A pointset with diameter has a MEB of radius

27

Page 28: Summer School on Hashing’14 Locality Sensitive Hashing

3) Partition by “sparsity”• Inside a bucket, points are at distance at most

• “Sparse” LSH does not work in this case• Need to partition by the distance to center• Partition into spherical shells of width 1

28

Page 29: Summer School on Hashing’14 Locality Sensitive Hashing

Practice of NNS

29

• Data-dependent partitions…• Practice:

• Trees: kd-trees, quad-trees, ball-trees, rp-trees, PCA-trees, sp-trees…

• often no guarantees

• Theory?• assuming more about data: PCA-like

algorithms “work” [Abdullah-A-Kannan-Krauthgamer’14]

Page 30: Summer School on Hashing’14 Locality Sensitive Hashing

Finale• LSH: geometric hashing• Data-dependent hashing can do better!

• Better upper bound? • Multi-level improves a bit, but not too much• for ?

• Best partitions in theory and practice?• LSH for other metrics:

• Edit distance, Earth-Mover Distance, etc• Lower bounds (cell probe?)

30

Page 31: Summer School on Hashing’14 Locality Sensitive Hashing

Open question:• Practical variant of ball-carving hashing?• Design space-partition of that is

• efficient: point location in poly(t) time• qualitative: regions are “sphere-like”

[Prob. needle of length 1 is not cut]

[Prob needle of length c is not cut]≥

1/c2

𝑝