Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)
Feb 13, 2016
Dimension Reduction in the Hamming Cube
(and its Applications)
Rafail Ostrovsky UCLA
(joint works with Rabani; and Kushilevitz and Rabani)
2
http://www.cs.ucla.edu/~rafail/
PLAN
Problem Formulations Communication complexity game What really happened? (dimension
reduction) Solutions to 2 problems
–ANN–k-clustering
What’s next?
3
http://www.cs.ucla.edu/~rafail/
Problem statements Johnson-lindenstrauss lemma: n points in high
dim. Hilbert Space can be embedded into O(logn) dim subspace with small distortion
Q: how do we do it for the Hamming Cube?
(we show how to avoid impossibility of [Charicar-Sahai])
4
http://www.cs.ucla.edu/~rafail/
Many different formulations of ANN ANN – “approximate nearest neighbor search”
(many applications in computational geometry, biology/stringology, IR, other areas)
Here are different formulations:
5
http://www.cs.ucla.edu/~rafail/
Approximate Searching Motivation: given a DB of “names”, user with a
“target” name, find if any of DB names are “close” to the current name, without doing liner scan.
JonAliceBobEvePanconesi KateFred
A.Panconesi ?
6
http://www.cs.ucla.edu/~rafail/
Geometric formulation Nearest Neighbor Search (NNS): given N blue points (and
a distance function, say Euclidian distance in Rd), store all these points somehow
7
http://www.cs.ucla.edu/~rafail/
Data structure question given a new red point, find closest blue point.
Naive solution 1: store blue points “as is” and when given a red point, measure distances to all blue points.Q: can we do better?
8
http://www.cs.ucla.edu/~rafail/
Can we do better? Easy in small dimensions (Voronoi diagrams) “Curse of dimensionality” in High Dimensions… [KOR]: Can get a good “approximate” solution
efficiently!
9
http://www.cs.ucla.edu/~rafail/
Hamming Cube Formulation for ANN Given a DB of N blue n-bit strings, process
them somehow. Given an n-bit red string find ANN in the Hyper-Cube {0,1}n
Naïve solution 2: pre-compute all (exponential #) of answers (want small data-structures!)
0010101101011001111010011011011011010101110110001010101010101111
11010100
10
http://www.cs.ucla.edu/~rafail/
Clustering problem that I’ll discuss in detail K-clustering
11
http://www.cs.ucla.edu/~rafail/
An example of Clustering – find “centers” Given N points in Rd
12
http://www.cs.ucla.edu/~rafail/
A clustering formulation Find cluster “centers”
13
http://www.cs.ucla.edu/~rafail/
Clustering formulation The “cost” is the sum of distances
14
http://www.cs.ucla.edu/~rafail/
Main technique First, as a communication game Second, interpreted as a dimension reduction
15
http://www.cs.ucla.edu/~rafail/
COMMUNICATION COMPLEXITY GAME Given two players Alice and Bob, Alice is secretly given string x Bob is secretly given string y they want to estimate hamming distance
between x and y with small communication (with small error), provided that they have common randomness
How can they do it? (say length of |x|=|y|= N) Much easier: how do we check that x=y ?
16
http://www.cs.ucla.edu/~rafail/
Main lemma : an abstract game How can Alice and Bob estimate hamming distance between X
and Y with small CC? We assume Alice and Bob share randomness
ALICE
X1X2X3X4…Xn
BOB
Y1Y2Y3Y4…Yn
17
http://www.cs.ucla.edu/~rafail/
A simpler question To estimate hamming distance between X and Y
(within (1+ )) with small CC, sufficient for Alice and Bob for any L to be able to distinguish X and Y for:– H(X,Y) <= L OR – H(X,Y) > (1+ ) L
Q: why sampling does not work?
ALICE
X1X2X3X4…Xn
BOB
Y1Y2Y3Y4…Yn
18
http://www.cs.ucla.edu/~rafail/
Alice and Bob pick the SAME n-bit blue R each bit of R=1 independently with probability 1/2L
0 1 0 1 0 0 0 1 0 1 0
XOR
0 1 0 0 0 1 0 0 1 0 0
0 1 0 1 1 1 0 1 0 1 0
XOR
0/1 0/1
0 1 0 0 0 1 0 0 1 0 0
X Y
19
http://www.cs.ucla.edu/~rafail/
What is the difference in probabilities? H(X,Y) <= L and H(X,Y) > (1+ ) L
0 1 0 1 0 0 0 1 0 1 0
XOR
0/1
0 1 0 0 0 1 0 0 1 0 0
0 1 0 1 1 1 0 1 0 1 0
XOR
0/1
0 1 0 0 0 1 0 0 1 0 0
X Y
20
http://www.cs.ucla.edu/~rafail/
How do we amplify?
0 1 0 1 0 0 0 1 0 1 0
XOR
0/1
0 1 0 0 0 1 0 0 1 0 0
0 1 0 1 1 1 0 1 0 1 0
XOR
0/1
0 1 0 0 0 1 0 0 1 0 0
X Y
21
http://www.cs.ucla.edu/~rafail/
How do we amplify? - Repeat, with many independent R’s but same distribution!
0 1 0 1 0 0 0 1 0 1 0
XOR
0/1
0 1 0 0 0 1 0 0 1 0 0
0 1 0 1 1 1 0 1 0 1 0
XOR
0/1
0 1 0 0 0 1 0 0 1 0 0
X Y
22
http://www.cs.ucla.edu/~rafail/
a refined game with a small communication How can Alice and Bob distinguish X and Y:
– H(X,Y) <= L OR – H(X,Y) > (1+ ) L
ALICE
X1X2X3X4…Xn
For each RXOR (subset) of Xi
Compare the outputs.
BOB
Y1Y2Y3Y4…Yn
For each R XOR (the same subset) of Yi
Compare the outputs.
Pick 1/ logN R’s with correct distribution
Compare this linear transformation.
23
http://www.cs.ucla.edu/~rafail/
Dimension Reduction in the Hamming Cube [OR]
For each L, we can pick O(log N) R’s and boost theProbabilities!
Key Property: we get an embedding from large to small cube that preserve ranges around L very well.
24
http://www.cs.ucla.edu/~rafail/
Dimension Reduction in the Hamming Cube [OR]
For each L, we can pick O(log N) R’s and boost theProbabilities!
Key Property: we get an embedding from large to small cube that preserve ranges around L.
Key idea in applications: can build inverse lookup table for the small cube!
25
http://www.cs.ucla.edu/~rafail/
Applications Applications of the dimension reduction in the
Hamming CUBE For ANN in the Hamming cube and Rd
For K-Clustering
26
http://www.cs.ucla.edu/~rafail/
Application to ANN in the Hamming Cube For each possible L build a “small cube” and
project original DB to a small cube Pre-compute inverse table for each entry of
the small cube. Why is this efficient? How do we answer any query? How do we navigate between different L?
27
http://www.cs.ucla.edu/~rafail/
Putting it All together: User’s private approx search from DB
Each projection is O(log N) R’s. User picks many such projections for each L-range. That defines all the embeddings.
Now, DB builds inverse lookup tables for each projection as new DB’s for each L.
User can now “project” its query into small cube and use binary search on L
28
http://www.cs.ucla.edu/~rafail/
MAIN THM [KOR] Can build poly-size data-structure to do ANN
for high-dimensional data in time polynomial in d and poly-log in N– For the hamming cube– L_1– L_2– Square of the Euclidian dist.
[IM] had a similar results, slightly weaker guarantee.
29
http://www.cs.ucla.edu/~rafail/
Dealing with Rd
Project to random lines, choose “cut” points…
Well, not exactly… we need “navigation”
30
http://www.cs.ucla.edu/~rafail/
Clustering Huge number of applications (IR,
mining, analysis of stat data, biology, automatic taxonomy formation, web, topic-specific data-collections, etc.)
Two independent issues:– Representation of data– Forming “clusters” (many
incomparable methods)
31
http://www.cs.ucla.edu/~rafail/
Representation of data examples Latent semantic indexing yields points in Rd
with l2 distance (distance indicating similarity) Min-wise permutation (Broder at. al.) approach
yields points in the hamming metric Many other representations from IR literature
lead to other metrics, including edit-distance metric on strings
Recent news: [OR-95] showed that we can embed edit-distance metric into l1 with small distortion distortion= exp(sqrt(\log n \log log n))
32
http://www.cs.ucla.edu/~rafail/
Geometric Clustering: examples Min-sum clustering in Rd: form clusters s.t. the
sum of intra-cluster distances in minimized K-clustering: pick k “centers” in the ambient
space. The cost is the sum of distances from each data-point to the closest center
Agglomerative clustering (form clusters below some distance-threshold)
Q: which is better?
33
http://www.cs.ucla.edu/~rafail/
Methods are (in general) incomparable
34
http://www.cs.ucla.edu/~rafail/
Min-SUM
35
http://www.cs.ucla.edu/~rafail/
2-Clustering
36
http://www.cs.ucla.edu/~rafail/
A k-clustering problem: notation N – number of points d – dimension k – number of centers
37
http://www.cs.ucla.edu/~rafail/
About k-clustering When k if fixed, this is easy for small d [Kleinberg, Papadimitriou, Raghavan]: NP-complete
for k=2 for the cube [Drineas, Frieze, Kannan, Vempala, Vinay]” NP
complete for Rd for square of the Euclidian distance When k is not fixed, this is facility location (Euclidian k-
median) For fixed d but growing k a PTAS was given by [Arora,
Raghavan, Rao] (using dynamic prog.) (this talk): [OR]: PTAS for fixed k, arbitrary d
38
http://www.cs.ucla.edu/~rafail/
Common tools in geometric PTAS Dynamic programming Sampling [Schulman, AS, DLVK] [DFKVV] use SVD
Embeddings/dimension reduction seem useless because– Too many candidate centers– May introduce new centers
39
http://www.cs.ucla.edu/~rafail/
[OR] k-clustering result A PTAS for fixed k
– Hamming cube {0,1}d
– l1d
– l2d (Euclidian distance)– Square of the Euclidian distance
40
http://www.cs.ucla.edu/~rafail/
Main ideas For 2-clustering find a good partition is as
good as solving the problem Switch to cube Try partitions in the embedded low-
dimensional data set Given a partition, compute centers and cost in
the original data send Embedding/dim. reduction used to reduce the
number of partitions
41
http://www.cs.ucla.edu/~rafail/
Stronger property of [OR] dimension reduction Our random linear transformation preserve
ranges!
42
http://www.cs.ucla.edu/~rafail/
THE ALGORITHM
43
http://www.cs.ucla.edu/~rafail/
The algorithm yet again Guess 2-center distance Map to small cube Partition in the small cube Measure the partition in the big cube
THM: gets within (1+ of optimal.
Disclaimer: PTAS is (almost never) practical, this shows “feasibility only”, more ideas are needed for a practical solution.
44
http://www.cs.ucla.edu/~rafail/
Dealing with k>2 Apex of a tournament is a node of max out-
degree Fact: apex has a path of length 2 to every
node Every point is assigned an apex of center
“tournaments”:– Guess all (k choose 2) center distances– Embed into (k choose 2) small cubes– Guess center-projection in small cubes– For every point, for every pair of centers, define a
“tournament” which center is closer in the projection
45
http://www.cs.ucla.edu/~rafail/
Conclusions Dimension reduction in the
cube allows to deal with huge number of “incomparable” attributes.
Embeddings of other metrics into the cube allows fast ANN for other metrics
Real applications still require considerable additional ideas
Fun area to work in