Iterative Row Sampling Richard Peng Joint work with Mu Li (CMU) and Gary Miller (CMU) CMU MIT
Jan 04, 2016
Iterative Row Sampling
Richard Peng
Joint work with Mu Li (CMU) and Gary Miller (CMU)
CMU MIT
OUTLINE
•Matrix Sketches• Existence• Samples better samples• Iterative algorithms
DATA
• n-by-d matrix A, m entries• Columns: data• Rows: attributes A
Goal:• Classification/ clustering• Identify patterns• Interpret new data
LINEAR MODEL
• Can add/scale data points• x1: coefficients, combo: Ax
Ax x1A:,1 x2A:,2 x3A:,3
PROBLEM
Interpret new data point b as combination of known ones Ax
?
REGRESSION
• Express as combination of current examples• Regression: minx ║Ax–b║
p
• p=2: least squares• p=1: compressive sensing
• ║x║2: Euclidean norm of x• ║x║1: sum of absolute
values
VARIANTS OF COMPRESSIVE SENSING
•minx ║Ax-b║1 +║x║1
•minx ║Ax-b║2 +║x║1
•minx ║x║1 s.t. Ax=b
•minx ║Ax║1 s.t. Bx = y
•minx ║Ax-b║1 + ║Bx - y║1
All similar to minx║Ax-b║1
SIMPLIFIED
•minx║Ax–b║p = minx║[A, b] [x; -1]║p
• Regression equivalent to min║Ax║p with one entry of x fixed
A b
x
-1
‘BIG’ DATA POINTS
• Each data point has many attributes•#rows (n) >> #columns (d)• Examples:• Genetic data• Time series (videos)
• Reverse (d>>n) also common: images + SIFT
A
FASTER?
A’
A
Smaller, equivalent A’Matrix sketch
ROW SAMPLING
A’
A
• Pick some rows of A to be A’•How to pick? Random
SHORTER EQUIVALENT
• Find shorter A’ that preserves answer• |Ax|p≈1+ε|A’x|p for all x
• Run algorithm on A’, same answer good for A
A’
Simplified error notation ≈:a≈kb if there exists k1, k2 s.t.k2/k1 ≤ k and k1a ≤ b ≤ k2 b
OUTLINE
•Matrix Sketches•How? Existence• Samples better samples• Iterative algorithms
SKETCHES EXIST
• Linear sketches: A’=SA• [Drineals et al. `12]:Row sampling: one non-zero in each row of S• [Clarkson-Woodruff `12]:S = countSketch, one non-zero per column.
A’
|Ax|p≈|A’x|p for all x
SKETCHES EXIST
p=2 p=1
Dasgupta et al. `09 d2.5
Magdon-Ismail `10 dlog2d
Sohler & Woodruff `11 d3.5
Drineals et al. `12 dlogd
Clarkson et al. `12 d4.5log1.5d
Clarkson & Woodruff `12 d2logd d8
Mahoney & Meng `12 d2 d3.5
Nelson & Nguyen `12 d1+α
This Paper dlogd d3.66
Hidden: runtime costs, ε-2 dependency
WHY IS ≈D POSSIBLE?
• ║Ax║22 = xTATAx
• ATA: d-by-d matrix• Any factorization (e.g.
QR) of ATA suffices as A’
|Ax|p≈|A’x|p for all x
ATA
• Covariance matrix•Dot product of all pairs of columns (data)• Covariance:cov(j1,j2) = Σi Ai,j1
TAi,j2
A:,j1 A:,j2
USE OF COVARIANCE MATRIX
• Clustering: l2 distances of all pairs given by C• Kernel methods: all pair dot products suffice for many models.
C
C=AT
A
OTHER USE OF COVARIANCE
• Covariance of attributes used to tune parameters• Images + SIFT: many data points, few attributes.• http://www.image-net.org/:
14,197,122 images
1000 SIFT features
C
HOW EXPENSIVE IS THIS?
• d2 dots of length n vectors• Total: O(nd2)• Faster: O(ndω-1)• Expensive: nd2 > nd > m
AC
EQUIVALENT VIEW OF SKETCHES
• Approximate covariance matrix: C’=(A’)TA’• ║Ax║2≈║A’x║2 is the same as C ≈ C’
C’
A’
APPLICATION OF SKETCHES
•A’: n’ rows• d2 dots of length n’ vectors• Total cost: O(n’dω-1)
AC’
A’
SKETCHES IN INPUT SPARSITY TIME
•Need: cost of computing C’ < cost of computing C = ATA• 2 goals:• n’ small• A’ found efficiently
A
C’
A’
COST AND QUALITY OF A’
p=2 p=1
cost size cost size
Dasgupta et al. `09 nd5 d2.5
Magdon-Ismail `10 nd2/logd dlog2d
Sohler & Woodruff `11 ndω-1+α d3.5
Drineals et al. `12 ndlogd+dω dlogd
Clarkson et al. `12 ndlogd d4.5log1.5d
Clarkson & Woodruff `12
m d2logd m + d7 d8
Mahoney & Meng `12 m d2 mlogn+d8
d3.5
Nelson & Nguyen `12 m d1+α Same as above
This Paper m + dω+α dlogd m + dω+α d3.66
OUTLINE
•Matrix Sketches•How? Existence•Samples better samples• Iterative algorithms
PREVIOUS APPROACHES
•Go go poly(d) rows directly• Projection to obtain key info, or the sketch itself
A
A’
m
poly(d)
A miracle happens
OUR MAIN APPROACH
•Utilize the robustness of sketches, covariance matrices, and sampling• Iteratively reduce errors and sizes
A
A” A’
BETTER ALGORITHM FOR P=2
p=2 p=1
cost size cost size
Dasgupta et al. `09 nd5 d2.5
Magdon-Ismail `10 nd2/logd dlog2d
Sohler & Woodruff `11 ndω-1+α d3.5
Drineals et al. `12 ndlogd+dω dlogd
Clarkson et al. `12 ndlogd d4.5log1.5d
Clarkson & Woodruff `12
m d2logd m + d7 d8
Mahoney & Meng `12 m d2 mlogn+d8
d3.5
Nelson & Nguyen `12 m d1+α Same as above
This Paper m + dω+α dlogd m + dω+α d3.66
COMPOSING SKETCHES
Total cost: O(m + n’dlogd + dω) = O(m + dω)
A
A” A’
n rows
n’ = d1+α
O(m) O(n’dlogd +dω)
dlogd rows
ACCUMULATION OF ERRORS
A
A” A’
n rows
n’ = d1+α
║Ax║2 ≈k║A”x║2
dlogd rows
║A”x║2 ≈k’║A’x║2
║Ax║2 ≈kk’║A’x║2
ACCMULATION OF ERRORS
║Ax║ 2 ≈kk’║A’x║2
• Final error: product of both errors•Dependency of error in cost: usually ε-2 or more for 1± ε error• [Avron & Toledo `11]: only final step needs to be accurate• Idea: compute sketches indirectly
ROW SAMPLING
A’
A
• Pick some rows of A to be A’•How to pick? Random
ARE ALL ROWS EQUAL?
one non-zero row
A
column with one entry
A
|A[1;0;…;0]|p≠ 0
ROW SAMPLING
A’
A
• τ’ : weights on rows distribution• Pick a number of rows independently from this distribution, rescale to form A’
MATRIX CHERNOFF BOUNDS
A
• Sufficient property of τ’• τ: statistical leverage scores• If τ' ≥ τ,║τ'║1logd (scaled) rows suffices for A’ ≈ A
τ'
STATISTICAL LEVERAGE SCORES
•Studied in stats since 70s•Importance of rows•leverage score of row i, Ai:
τi = Ai (ATA)-1Ai
T
•Key fact: ║τ║1 = rank ≤ d ║τ'║1logd = dlogd rows
Aτ
COMPUTING LEVERAGE SCORES
τi = Ai(ATA)-
1AiT
= AiC-1AiT
•ATA: covariance matrix, C•Given C-1, can compute each τi in O(d2) time•Total cost: O(nd2+dω)
COMPUTING LEVERAGE SCORES
τi = AiC-1AiT
=║AiC-1/2║22
•2-norm of a vector, AiC-1/2
•rows in isotropic positions•Decorrelates columns
ASIDE: WHAT IS LEVERAGE?
Geometric view:• Rows define ‘energy’ directions.• Normalize so total energy is
uniform• τi : norm of row i after normalizing
Ai
AiC-1/2
ASIDE: WHAT IS LEVERAGE?
How to interpret statistical leverage scores?•Statistics ([Hoaglin-Welsh `78], [Chatterjee-Hadi `86]):• Influence on data set• Likelihood of outlier
•Uniqueness of Row
Aτ
ASIDE: WHAT IS LEVERAGE?
High Leverage Score:• Key attribute?• Outlier (measuring
error)?
ASIDE: WHAT IS LEVERAGE?
My current view (motivated by graph sparsification):• Sampling probabilities• Use them to find
sketches
Aτ
COMPUTING LEVERAGE SCORES
τi = ║AiC-1/2║22
•Only need τ' ≥ τ•Can use approximations after scaling them up•Error leads to larger ║τ'║1
DIMENSIONALITY REDUCTION
║x║22 ≈jl ║Gx║2
2
•Johnson Lindenstrauss Transform•G: d-by-O(1/α) Gaussian• Errorjl = dα
x
Gx
ESTIMATING LEVERAGE SCORES
τi =║AiC-1/2║22
≈jl║AiC-1/2G║22
•G: d-by-O(1/α) Gaussian• C1/2G: d-by-O(1/α)• Cost: O(α ∙ nnz(Ai))
total: O(α ∙ m + α ∙ d2logd)
ESTIMATING LEVERAGE SCORES
•C ≈k C’ gives ║C-1/2x║2 ≈k║C’-1/2x║2
•Using C’ as a preconditioner for C•Can also combine with JL
τi =║AiC-1/2 ║ 22
≈║AiC’-1/2 ║ 22
ESTIMATING LEVERAGE SCORES
τi’ =║AiC’-1/2G║2
2
≈jl║AiC-1/2║22
≈jl∙k τ i
•(jl ∙ k) ∙ τ’ ≥ τ•Total number of rows:
║jl ∙ k ∙ τ’║1 ≤ jl ∙ k ∙ ║τ’║1
≤ k d1 + α
ESTIMATING LEVERAGE SCORES
•Quality of A’ does not depend on quality of τ'•C ≈k C’ gives A’ ≈2 A with O(kd1+α) rows in O(m + dω) time
•(jl k) ∙ ∙ τ’ ≥ τ•║jl k ∙ ∙ τ’║1 ≤ jl k d∙ ∙ 1+α
Some fixable issues when n >>>d
SIZE REDUCTION
•A” ≈O(1) A•C” ≈O(1) C•τ' ≈O(1) τ•A’ ≈O(1) A , O(d 1+α logd) rows
A” C”
τ'A’
HIGH ERROR SETTING
•A” ≈k A•C” ≈k C•τ' ≈k τ•A’ ≈O(1) A , O(kd 1+α logd) rows
A” C”
τ'A’
ACCURACY BOOSTING
•Can reduce any error, k, in O(m + kdω+α ) time•All intermediate steps can have large (constant) error
A
A’’ A’
OUTLINE
•Matrix Sketches•How? Existence• Samples better samples• Iterative algorithms
ONE STEP SKETCHING
•Obtain sketch of size poly(d)• Error correct to O(dlogd) rows in poly(d) time
A
A’A”
m
poly(d)dlogd
A miracle happens
WHAT WE WILL SHOW
• A number of iterative steps can give a similar result•More work, less miraculous, more robust• Key idea: find leverage scores
ALGORITHMIC PICTURE
C’τ'A’
sketch, covariance matrix, leverage scores with error k gives all three with high accuracy in O(m + kdω+α ) time
OBSERVATIONS
C’τ'A’
• Error does not accumulate• Can loop around many
times• Unused parameter: size of
A
≈k ≈k
≈O(1), O(K) size increase
OUR APPROACH
A As
Create shorter matrix As s.t. total leverage score of each block is close
LEVERAGE SCORE OF A BLOCK
A As
• l22 of leverage scores : Frobenius norm of A1:kC-1/2
•≈ under random projection•G: O(1)-by-k, GA1:k: O(1) rows
║τ1..k║22 =║A1:kC-1/2║F
2
≈║GA1:kC-1/2║F2
SIZE REDUCTION
Recursing on As gives leverages scores that:• Sum to ≤d• Can row sample A
A As
ALGORITHM
C’τ'A’
• Decrease size by dα, recurse• Bring back leverage scores• Reduce error
≈k ≈k
≈O(1), O(K) size increase
PROBLEM
• Leverage scores in As measured using Cs = As
TAs
• Already have bound on total, suffices to show
║xC-1/2║2 ≤ k║xCs-1/2║2
PROOF SKETCH
• Show ║Cs1/2x|2 ≤ k║C1/2x║2
• Invert both sides• Some issues when As has smaller rank than A
Need: ║xC-1/2║2 ≤ k║xCs-1/2║2
║CS1/2X║2 ≤ K║C1/2X║2
║Cs1/2x║2=║Asx║2
= ΣbΣi(Gi,bAbTx)2
2
≤ ΣbΣi║Gi,b║22║Ab
Tx║22
≤ maxb,i║Gi,b║22 ║Ax║2
2
≤ O(klogn)║Ax║22
b: blocks of As
P=1, OR ARBITRARY P
• Same approach can still work• P-norm leverage scores•Need: well-conditioned basis, U for column space
║Ax║p≈║A’x║p for any x
QUALITY OF BASIS (P=1)
•Quality of U: maximum distortion in dual norm:β = maxx≠0║Ux║∞ /║x║∞
• Analog of leverage scores: τi = β║Ui,:║1
• Total number of rows: β║U║1
BASIS CONSTRUCTION
• Basis using linear transform, U = AC• Compute |Ui|1 using p-stable distributions (Indyk `06) instead of JL
C1, U τ'A’
≈k ≈k
≈O(1), O(K) size increase
ITERATIVE ALGORITHM FOR P=1
• C1 = C-1/2, l2 basis
• Quality of U=AC1: β║Ui,:║1= n1/2d
• Too coarse for a single step, but good enough to iterate
• n approaches poly(d) quickly• Need to run l2 algorithm for C
SUMMARYp=2 p=1
Cost for dlog rows cost size
Sohler & Woodruff `11 ndω-1+α d3.5
Drineals et al. `12 ndlogd+dω
Clarkson et al. `12 ndlogd d4.5log1.5d
Clarkson & Woodruff `12
m+d3log2d m + d7 d8
Mahoney & Meng `12 m+d3logd mlogn+d8 d3.5
Nelson & Nguyen `12 m+dω Same as above
This Paper m + dω+α m + dω+α d3.66
• Robust steps algorithms• l2: more complicated than
sketching• Smaller overhead for p-norm
FUTURE WORK
•What are leverage scores???• Iterative low rank approximation?• Better p-norm leverage scores?•More streamlined view of the projections in our algorithm?• Empirical evaluation?
THANK YOU!
Questions?