Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute [email protected] (joint.

Computing Sketches of MatricesComputing Sketches of MatricesEfficiently & (Privacy Preserving) Data Efficiently & (Privacy Preserving) Data

MiningMining

Petros DrineasPetros Drineas

Rensselaer Polytechnic [email protected]

(joint work with R. Kannan and M. Mahoney)

@ DIMACS Workshop on Privacy Preserving Data Mining

2

Motivation (Data Mining)

In many applications large matrices appear (too large to store in RAM).

• We can make a few “passes” (sequential READS) through the matrices.

• We can create and store a small “sketch” of the matrices in RAM.

• Computing the “sketch” should be a very fast process.

Discard the original matrix and work with the “sketch”.

3

Motivation (Privacy Preserving)

In many applications, instead of revealing a large matrix, we only reveal its “sketch”.

• Intuition: The “sketch” is an approximation to the original matrix.

Instead of viewing the approximation as a “necessary evil”, we might be able to use it to achieve privacy preservation (similar ideas in Feigenbaum et. al., ICALP 2001).

• Goal: Formulate a technical definition of privacy that might be achievable by such “sketching” algorithms and provide meaningful and quantifiable protection.

Achieving the goal is an open problem !

4

Our approach & our results

1. A “sketch” consisting of a few rows/columns of the matrix is adequate for efficient approximations.

[see D & Kannan ’03, and D, Kannan & Mahoney ’04]

2. We draw the rows/columns randomly, using adaptive sampling; e.g. rows/columns are picked with probability proportional to their lengths.

Create an approximation to the original matrix which can be stored in much less space.

5

Overview

• A Data Mining setup

• Approximating a large matrix• Algorithm

• Error bounds

• Tightness of the results

• An alternative approach (Achlioptas and McSherry ’01 and ’03)

• Conclusions

6

Applications: Data Mining

We are given m (>106) objects and n(>105) features describing the objects.

Database

An m-by-n matrix A (Aij shows the “importance” of feature j for object i).

Every row of A represents an object.

Queries

Given a new object x, find similar objects in the database (nearest neighbors).

7

Applications (cont’d)

Key observation: The exact value xT· d might not be necessary.

1. The feature values in the vectors are set by coarse heuristics.

2. It is in general enough to see if xT· d > Threshold.

feature 1

fea

ture

2

Object x

Object d

(d,x)

Two objects are “close” if the angle between their corresponding vectors is small. So, assuming that the vectors are normalized,

xT·d = cos(x,d)

is high when the two objects are close.

A·x computes all the angles and answers the query.

8

Using an approximation to A

Assume that A’ = CUR is an approximation to A, such that A’ is stored efficiently (e.g. in RAM).

Given a query vector x, instead of computing A · x, compute A’ · x to identify its nearest neighbors.

The CUR algorithm guarantees a bound on the worst case choice of x.

9

Approximating A efficiently

Given a large m-by-n matrix A (stored on disk), compute an approximation A’ to A such that:

1. A’ can be stored in O(m+n) space, after making two passes through the entire matrix A, and using O(m+n) additional space and time.

2. A’ satisfies (with high probability)

||A-A’||22 < ε ||A||F

2

(and a similar bound with respect to the Frobenius norm).

10

Describing A’ = C · U · R

• C consists of c = θ(1/ε2) columns of A and R consists of r =

θ(1/ε2) rows of A (the “description length” of A is O(m+n)).

• C and R are created using adaptive sampling.

11

Creating C and R

• Create C (R) by performing c (r) i.i.d trials.

• In each trial, pick a column (row) of A with probability

• Include A(i) (A(i)) as a column of C (R).

[A(i) (A(i)) is the i-th column (row) of A]

12

Singular Value Decomposition (SVD)

1. Exact computation of the SVD takes O(min(mn2 , m2n)) time.

2. The top few singular vectors/values can be approximated faster (Lanczos/ Arnoldi methods).

U (V): orthogonal matrix containing the left (right) singular vectors of A.

: diagonal matrix containing the singular values of A.

13

Rank k approximations (Ak)

Ak is a matrix of rank k such that ||A-Ak||2,F is minimized over all rank k matrices!

Uk (Vk): orthogonal matrix containing the top k left (right) singular vectors of A.

k: diagonal matrix containing the top k singular values of A.

14

The CUR algorithm

Input:

1. The matrix A in “sparse unordered representation”.

(e.g. non-zero entries of A are presented as triples (i,j,Aij) in any order)

2. Positive integers c < n and r < m (number or columns/rows that we pick).

3. Positive integer k (the rank of A’=CUR).Note: Since A’ is of rank k, ||A-A’||2,F >= ||A-Ak||2,F.

We choose a k such that ||A-Ak||2,F is small. As k grows, for the Frobenius norm approximation, c and r grow as well.

15

Computing U

Intuition:

The CUR algorithm essentially expresses every row of the matrix A as a linear combination of a small subset of the rows of A.

• This small subset consists of the rows in R.

• Given a row of A – say A(i) – the algorithm computes the “best fit” for the row A(i) using the rows in R as the basis.

e.g.

Notice that only c = O(1) element of the i-th row are given as input.

However, a vector of coefficients u can still be computed.

16

Creating U

Running time

Computing the elements of U amounts to a pseudo-inverse computation. It can be done in O(c2m + c3 + r3) time.

Thus, U can be computed in O(m) time.

Note on the rank of U and CUR

The rank of U (by construction) is k.

Thus, the rank of A’=CUR is at most k.

17

Error bounds (Frobenius norm)

Assume Ak is the “best” rank k approximation to A (through SVD). Then

We need to pick O(k/ε2) rows and O(k/ε2) columns.

18

Error bounds (2-norm)

Assume Ak is the “best” rank k approximation to A (through SVD). Then

since |A-Ak|22 <= |A|F

2/(k+1).

We need to pick O(1/ε2) rows and O(1/ε2) columns.

19

Can we do better?

Lemma

For any < 1, there is a set of Ω(–n) n-by-n matrices, such that for two distinct matrices A,B in the set,

||A-B||22 > (/20)||A||F

2

Lower bound Theorem

Any algorithm which approximates these matrices must output a different “sketch” for each one, thus it must output at least

Ω(n log(1/)) bits

Tighter lower bounds, matching almost exactly with our upper bounds, have been obtained by Ziv-Bar Yossef, STOC ’03.

20

A different technique

(D. Achlioptas and F. McSherry, ’01 and ’03)

The Algorithm in 2 lines:

• To approximate a matrix A, keep a few elements of the matrix (instead of rows or columns) and zero out the remaining elements.

• Compute a rank k approximation to this sparse matrix (using Lanczos methods).

Comparing the two techniques:

• The error bound w.r.t. the 2-norm is better, while the error bound w.r.t. the Frobenius norm is the same.

(weighted sampling is used - heavier elements are kept with higher probabilities)

• Running times are the same.

21

Conclusions

• Given the small “sketch” of a matrix A, a “friendly user” can

• reconstruct a (provably accurate) approximation A’ to the original matrix A and employ any algorithms that he would use to process the original matrix A on A’,

• use the Frobenius and spectral norm bounds for A-A’ to argue about the approximation error of his algorithms.

• How do we ensure privacy for the object-vectors (rows) of A that are revealed as part of R?

• Are such sketches offering some privacy preserving guarantees, under some (relaxed) definition of privacy?

Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute [email protected] (joint.

Documents

data mining slide

n matrix

original matrix

entire matrix

d kannan

applications large matrices

x t d threshold

motivation data mining