Iterative Row Sampling

Iterative Row Sampling

Richard Peng

Joint work with Mu Li (CMU) and Gary Miller (CMU)

CMU MIT

OUTLINE

•Matrix Sketches• Existence• Samples better samples• Iterative algorithms

DATA

• n-by-d matrix A, m entries• Columns: data• Rows: attributes A

Goal:• Classification/ clustering• Identify patterns• Interpret new data

LINEAR MODEL

• Can add/scale data points• x1: coefficients, combo: Ax

Ax x1A:,1 x2A:,2 x3A:,3

PROBLEM

Interpret new data point b as combination of known ones Ax

?

REGRESSION

• Express as combination of current examples• Regression: minx ║Ax–b║

p

• p=2: least squares• p=1: compressive sensing

• ║x║2: Euclidean norm of x• ║x║1: sum of absolute

values

VARIANTS OF COMPRESSIVE SENSING

•minx ║Ax-b║1 +║x║1

•minx ║Ax-b║2 +║x║1

•minx ║x║1 s.t. Ax=b

•minx ║Ax║1 s.t. Bx = y

•minx ║Ax-b║1 + ║Bx - y║1

All similar to minx║Ax-b║1

SIMPLIFIED

•minx║Ax–b║p = minx║[A, b] [x; -1]║p

• Regression equivalent to min║Ax║p with one entry of x fixed

A b

x

-1

‘BIG’ DATA POINTS

• Each data point has many attributes•#rows (n) >> #columns (d)• Examples:• Genetic data• Time series (videos)

• Reverse (d>>n) also common: images + SIFT

A

FASTER?

A’

A

Smaller, equivalent A’Matrix sketch

ROW SAMPLING

A’

A

• Pick some rows of A to be A’•How to pick? Random

SHORTER EQUIVALENT

• Find shorter A’ that preserves answer• |Ax|p≈1+ε|A’x|p for all x

• Run algorithm on A’, same answer good for A

A’

Simplified error notation ≈:a≈kb if there exists k1, k2 s.t.k2/k1 ≤ k and k1a ≤ b ≤ k2 b

OUTLINE

•Matrix Sketches•How? Existence• Samples better samples• Iterative algorithms

SKETCHES EXIST

• Linear sketches: A’=SA• [Drineals et al. `12]:Row sampling: one non-zero in each row of S• [Clarkson-Woodruff `12]:S = countSketch, one non-zero per column.

A’

|Ax|p≈|A’x|p for all x

SKETCHES EXIST

p=2 p=1

Dasgupta et al. `09 d2.5

Magdon-Ismail `10 dlog2d

Sohler & Woodruff `11 d3.5

Drineals et al. `12 dlogd

Clarkson et al. `12 d4.5log1.5d

Clarkson & Woodruff `12 d2logd d8

Mahoney & Meng `12 d2 d3.5

Nelson & Nguyen `12 d1+α

This Paper dlogd d3.66

Hidden: runtime costs, ε-2 dependency

WHY IS ≈D POSSIBLE?

• ║Ax║22 = xTATAx

• ATA: d-by-d matrix• Any factorization (e.g.

QR) of ATA suffices as A’

|Ax|p≈|A’x|p for all x

ATA

• Covariance matrix•Dot product of all pairs of columns (data)• Covariance:cov(j1,j2) = Σi Ai,j1

TAi,j2

A:,j1 A:,j2

USE OF COVARIANCE MATRIX

• Clustering: l2 distances of all pairs given by C• Kernel methods: all pair dot products suffice for many models.

C

C=AT

A

OTHER USE OF COVARIANCE

• Covariance of attributes used to tune parameters• Images + SIFT: many data points, few attributes.• http://www.image-net.org/:

14,197,122 images

1000 SIFT features

C

http://www.image-net.org/

http://www.image-net.org/

HOW EXPENSIVE IS THIS?

• d2 dots of length n vectors• Total: O(nd2)• Faster: O(ndω-1)• Expensive: nd2 > nd > m

AC

EQUIVALENT VIEW OF SKETCHES

• Approximate covariance matrix: C’=(A’)TA’• ║Ax║2≈║A’x║2 is the same as C ≈ C’

C’

A’

APPLICATION OF SKETCHES

•A’: n’ rows• d2 dots of length n’ vectors• Total cost: O(n’dω-1)

AC’

A’

SKETCHES IN INPUT SPARSITY TIME

•Need: cost of computing C’ < cost of computing C = ATA• 2 goals:• n’ small• A’ found efficiently

A

C’

A’

COST AND QUALITY OF A’

p=2 p=1

cost size cost size

Dasgupta et al. `09 nd5 d2.5

Magdon-Ismail `10 nd2/logd dlog2d

Sohler & Woodruff `11 ndω-1+α d3.5

Drineals et al. `12 ndlogd+dω dlogd

Clarkson et al. `12 ndlogd d4.5log1.5d

Clarkson & Woodruff `12

m d2logd m + d7 d8

Mahoney & Meng `12 m d2 mlogn+d8

d3.5

Nelson & Nguyen `12 m d1+α Same as above

This Paper m + dω+α dlogd m + dω+α d3.66

OUTLINE

•Matrix Sketches•How? Existence•Samples better samples• Iterative algorithms

PREVIOUS APPROACHES

•Go go poly(d) rows directly• Projection to obtain key info, or the sketch itself

A

A’

m

poly(d)

A miracle happens

OUR MAIN APPROACH

•Utilize the robustness of sketches, covariance matrices, and sampling• Iteratively reduce errors and sizes

A

A” A’

BETTER ALGORITHM FOR P=2

p=2 p=1

cost size cost size

Dasgupta et al. `09 nd5 d2.5

Magdon-Ismail `10 nd2/logd dlog2d


Drineals et al. `12 ndlogd+dω dlogd



m d2logd m + d7 d8

Mahoney & Meng `12 m d2 mlogn+d8

d3.5

Nelson & Nguyen `12 m d1+α Same as above

This Paper m + dω+α dlogd m + dω+α d3.66

COMPOSING SKETCHES

Total cost: O(m + n’dlogd + dω) = O(m + dω)

A

A” A’

n rows

n’ = d1+α

O(m) O(n’dlogd +dω)

dlogd rows

ACCUMULATION OF ERRORS

A

A” A’

n rows

n’ = d1+α

║Ax║2 ≈k║A”x║2

dlogd rows

║A”x║2 ≈k’║A’x║2

║Ax║2 ≈kk’║A’x║2

ACCMULATION OF ERRORS

║Ax║ 2 ≈kk’║A’x║2

• Final error: product of both errors•Dependency of error in cost: usually ε-2 or more for 1± ε error• [Avron & Toledo `11]: only final step needs to be accurate• Idea: compute sketches indirectly

ROW SAMPLING

A’

A

• Pick some rows of A to be A’•How to pick? Random

ARE ALL ROWS EQUAL?

one non-zero row

A

column with one entry

A

|A[1;0;…;0]|p≠ 0

ROW SAMPLING

A’

A

• τ’ : weights on rows distribution• Pick a number of rows independently from this distribution, rescale to form A’

MATRIX CHERNOFF BOUNDS

A

• Sufficient property of τ’• τ: statistical leverage scores• If τ' ≥ τ,║τ'║1logd (scaled) rows suffices for A’ ≈ A

τ'

STATISTICAL LEVERAGE SCORES

•Studied in stats since 70s•Importance of rows•leverage score of row i, Ai:

τi = Ai (ATA)-1Ai

T

•Key fact: ║τ║1 = rank ≤ d ║τ'║1logd = dlogd rows

Aτ

COMPUTING LEVERAGE SCORES

τi = Ai(ATA)-

1AiT

= AiC-1AiT

•ATA: covariance matrix, C•Given C-1, can compute each τi in O(d2) time•Total cost: O(nd2+dω)


τi = AiC-1AiT

=║AiC-1/2║22

•2-norm of a vector, AiC-1/2

•rows in isotropic positions•Decorrelates columns

ASIDE: WHAT IS LEVERAGE?

Geometric view:• Rows define ‘energy’ directions.• Normalize so total energy is

uniform• τi : norm of row i after normalizing

Ai

AiC-1/2


How to interpret statistical leverage scores?•Statistics ([Hoaglin-Welsh `78], [Chatterjee-Hadi `86]):• Influence on data set• Likelihood of outlier

•Uniqueness of Row

Aτ


High Leverage Score:• Key attribute?• Outlier (measuring

error)?


My current view (motivated by graph sparsification):• Sampling probabilities• Use them to find

sketches

Aτ


τi = ║AiC-1/2║22

•Only need τ' ≥ τ•Can use approximations after scaling them up•Error leads to larger ║τ'║1

DIMENSIONALITY REDUCTION

║x║22 ≈jl ║Gx║2

2

•Johnson Lindenstrauss Transform•G: d-by-O(1/α) Gaussian• Errorjl = dα

x

Gx

ESTIMATING LEVERAGE SCORES

τi =║AiC-1/2║22

≈jl║AiC-1/2G║22

•G: d-by-O(1/α) Gaussian• C1/2G: d-by-O(1/α)• Cost: O(α ∙ nnz(Ai))

total: O(α ∙ m + α ∙ d2logd)


•C ≈k C’ gives ║C-1/2x║2 ≈k║C’-1/2x║2

•Using C’ as a preconditioner for C•Can also combine with JL

τi =║AiC-1/2 ║ 22

≈║AiC’-1/2 ║ 22


τi’ =║AiC’-1/2G║2

2

≈jl║AiC-1/2║22

≈jl∙k τ i

•(jl ∙ k) ∙ τ’ ≥ τ•Total number of rows:

║jl ∙ k ∙ τ’║1 ≤ jl ∙ k ∙ ║τ’║1

≤ k d1 + α


•Quality of A’ does not depend on quality of τ'•C ≈k C’ gives A’ ≈2 A with O(kd1+α) rows in O(m + dω) time

•(jl k) ∙ ∙ τ’ ≥ τ•║jl k ∙ ∙ τ’║1 ≤ jl k d∙ ∙ 1+α

Some fixable issues when n >>>d

SIZE REDUCTION

•A” ≈O(1) A•C” ≈O(1) C•τ' ≈O(1) τ•A’ ≈O(1) A , O(d 1+α logd) rows

A” C”

τ'A’

HIGH ERROR SETTING

•A” ≈k A•C” ≈k C•τ' ≈k τ•A’ ≈O(1) A , O(kd 1+α logd) rows

A” C”

τ'A’

ACCURACY BOOSTING

•Can reduce any error, k, in O(m + kdω+α ) time•All intermediate steps can have large (constant) error

A

A’’ A’

OUTLINE

•Matrix Sketches•How? Existence• Samples better samples• Iterative algorithms

ONE STEP SKETCHING

•Obtain sketch of size poly(d)• Error correct to O(dlogd) rows in poly(d) time

A

A’A”

m

poly(d)dlogd

A miracle happens

WHAT WE WILL SHOW

• A number of iterative steps can give a similar result•More work, less miraculous, more robust• Key idea: find leverage scores

ALGORITHMIC PICTURE

C’τ'A’

sketch, covariance matrix, leverage scores with error k gives all three with high accuracy in O(m + kdω+α ) time

OBSERVATIONS

C’τ'A’

• Error does not accumulate• Can loop around many

times• Unused parameter: size of

A

≈k ≈k

≈O(1), O(K) size increase

OUR APPROACH

A As

Create shorter matrix As s.t. total leverage score of each block is close

LEVERAGE SCORE OF A BLOCK

A As

• l22 of leverage scores : Frobenius norm of A1:kC-1/2

•≈ under random projection•G: O(1)-by-k, GA1:k: O(1) rows

║τ1..k║22 =║A1:kC-1/2║F

2

≈║GA1:kC-1/2║F2

SIZE REDUCTION

Recursing on As gives leverages scores that:• Sum to ≤d• Can row sample A

A As

ALGORITHM

C’τ'A’

• Decrease size by dα, recurse• Bring back leverage scores• Reduce error

≈k ≈k


PROBLEM

• Leverage scores in As measured using Cs = As

TAs

• Already have bound on total, suffices to show

║xC-1/2║2 ≤ k║xCs-1/2║2

PROOF SKETCH

• Show ║Cs1/2x|2 ≤ k║C1/2x║2

• Invert both sides• Some issues when As has smaller rank than A

Need: ║xC-1/2║2 ≤ k║xCs-1/2║2

║CS1/2X║2 ≤ K║C1/2X║2

║Cs1/2x║2=║Asx║2

= ΣbΣi(Gi,bAbTx)2

2

≤ ΣbΣi║Gi,b║22║Ab

Tx║22

≤ maxb,i║Gi,b║22 ║Ax║2

2

≤ O(klogn)║Ax║22

b: blocks of As

P=1, OR ARBITRARY P

• Same approach can still work• P-norm leverage scores•Need: well-conditioned basis, U for column space

║Ax║p≈║A’x║p for any x

QUALITY OF BASIS (P=1)

•Quality of U: maximum distortion in dual norm:β = maxx≠0║Ux║∞ /║x║∞

• Analog of leverage scores: τi = β║Ui,:║1

• Total number of rows: β║U║1

BASIS CONSTRUCTION

• Basis using linear transform, U = AC• Compute |Ui|1 using p-stable distributions (Indyk `06) instead of JL

C1, U τ'A’

≈k ≈k


ITERATIVE ALGORITHM FOR P=1

• C1 = C-1/2, l2 basis

• Quality of U=AC1: β║Ui,:║1= n1/2d

• Too coarse for a single step, but good enough to iterate

• n approaches poly(d) quickly• Need to run l2 algorithm for C

SUMMARYp=2 p=1

Cost for dlog rows cost size


Drineals et al. `12 ndlogd+dω



m+d3log2d m + d7 d8

Mahoney & Meng `12 m+d3logd mlogn+d8 d3.5

Nelson & Nguyen `12 m+dω Same as above

This Paper m + dω+α m + dω+α d3.66

• Robust steps algorithms• l2: more complicated than

sketching• Smaller overhead for p-norm

FUTURE WORK

•What are leverage scores???• Iterative low rank approximation?• Better p-norm leverage scores?•More streamlined view of the projections in our algorithm?• Empirical evaluation?

THANK YOU!

Questions?

Iterative Row Sampling

Documents