Randomized Regression in Parallel and Distributed Environments Michael W. Mahoney Stanford University ( For more info, see: http:// cs.stanford.edu/people/mmahoney/ or Google on “Michael Mahoney”) November 2013 Mahoney (Stanford) Randomized Regression November 2013 1 / 35
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Randomized Regressionin Parallel and Distributed Environments
Michael W. Mahoney
Stanford University
( For more info, see:http:// cs.stanford.edu/people/mmahoney/
or Google on “Michael Mahoney”)
November 2013
Mahoney (Stanford) Randomized Regression November 2013 1 / 35
Motivation: larger-scale graph analyticsLeskovec, Lang, Dasgupta, and Mahoney, “Community Structure in Large Networks” Internet Mathematics, (2009)
How people think of networks: What real networks look like:
True on 103 node graphs: meaning even small graphs have “bad” properties
True on 106 node graphs: where many/most reall applications are interested
True on 109 node graphs: presumably, bue we don’t have tools to test it
Goal:
Go beyond “simple statistics,” e.g., triangles-closing and degree distribution
Develop more sophisticated “interactive analytics” methods for large-scale data
Mahoney (Stanford) Randomized Regression November 2013 2 / 35
Goal: very large-scale “vector space analytics”
Small-scale and medium-scale:
Model data by graphs and matrices
Compute eigenvectors, correlations, etc. in RAM
Very large-scale:
Model data with flat tables and the relational model
Compute with join/select and other “counting” in, e.g., Hadoop
Can we “bridge the gap” and do “vector space computations” at verylarge scale?
Not obviously yes: exactly computing eigenvectors, correlations, etc. issubtle and uses lots of comminication.
Not obviously no: lesson from random sampling algorithms is you can getε-approximation of optimal with very few samples.
Mahoney (Stanford) Randomized Regression November 2013 3 / 35
Why randomized matrix algorithms?
Traditional matrix algorithms (direct & iterative methods, interior point,simplex, etc.) are designed to work in RAM and their performance ismeasured in floating-point operations per second (FLOPS).
Traditional algorithms are NOT well-suited for:I problems that are very largeI distributed or parallel computationI when communication is a bottleneckI when the data must be accessed via “passes”
Randomized matrix algorithms are:I faster: better theoryI simpler: easier to implementI implicitly regularize: noise in the algorithm avoids overfittingI inherently parallel: exploiting modern computer architecturesI more scalable: modern massive data sets
Mahoney (Stanford) Randomized Regression November 2013 4 / 35
Over-determined/over-constrained regression
An `p regression problem is specified by a design matrix A ∈ Rm×n, aresponse vector b ∈ Rm, and a norm ‖ · ‖p:
minimizex∈Rn ‖Ax − b‖p.
Assume m� n, i.e., many more “constraints” than “variables.” Given anε > 0, find a (1 + ε)-approximate solution x in relative scale, i.e.,
‖Ax − b‖p ≤ (1 + ε)‖Ax∗ − b‖p,
where x∗ is a/the optimal solution.
p = 2: Least Squares Approximation: Very widely-used, but highlynon-robust to outliers.
p = 1: Least Absolute Deviations: Improved robustness, but at thecost of increased complexity.
Mahoney (Stanford) Randomized Regression November 2013 5 / 35
Strongly rectangular data
Some examples:
m nSNP number of SNPs (106) number of subjects (103)
TinyImages number of pixels in each image (103) number of images (108)PDE number of degrees of freedom number of time steps
sensor network size of sensing data number of sensorsNLP number of words and n-grams number of principle components
More generally:
Over-constrained `1/`2 regression is good model for implementingother matrix algorithms (low-rank approximation, matrix completion,generalized linear models, etc.) in large-scale settings.
Mahoney (Stanford) Randomized Regression November 2013 6 / 35
Large-scale environments and how they scaleShared memory
I cores: [10, 103]∗
I memory: [100GB, 100TB]Message passing
I cores: [200, 105]†
I memory: [1TB, 1000TB]I CUDA cores: [5× 104, 3× 106]‡
I GPU memory: [500GB, 20TB]MapReduce
I cores: [40, 105]§
I memory: [240GB, 100TB]I storage: [100TB, 100PB]¶
Two important notions: leverage and condition(Mahoney, “Randomized Algorithms for Matrices and Data,” FnTML, 2011.)
Statistical leverage. (Think: eigenvectors. Important for low-precision.)I The statistical leverage scores of A (assume m� n) are the diagonal
elements of the projection matrix onto the column span of A.I They equal the `2-norm-squared of any orthogonal basis spanning A.I They measure:
F how well-correlated the singular vectors are with the canonical basisF which constraints have largest “influence” on the LS fitF a notion of “coherence” or “outlierness”
I Computing them exactly is as hard as solving the LS problem.
Condition number. (Think: eigenvalues. Important for high-precision.)I The `2-norm condition number of A is κ(A) = σmax(A)/σ+
min(A).I κ(A) bounds the number of iterations; for ill-conditioned problems
(e.g., κ(A) ≈ 106 � 1), the convergence speed is very slow.I Computing κ(A) is generally as hard as solving the LS problem.
These are for the `2-norm. Generalizations exist for the `1-norm.Mahoney (Stanford) Randomized Regression November 2013 9 / 35
Meta-algorithm for `2-norm regression (1 of 3)(Drineas, Mahoney, etc., 2006, 2008, etc., starting with SODA 2006; Mahoney FnTML, 2011.)
1: Using the `2 statistical leverage scores of A, construct an importancesampling distribution {pi}mi=1.
2: Randomly sample a small number of constraints according to {pi}mi=1
to construct a subproblem.
3: Solve the `2-regression problem on the subproblem.
A naıve version of this meta-algorithm gives a 1 + ε relative-errorapproximation in roughly O(mn2/ε) time (DMM 2006, 2008). (Ugh.)
Mahoney (Stanford) Randomized Regression November 2013 10 / 35
Meta-algorithm for `2-norm regression (2 of 3)(Drineas, Mahoney, etc., 2006, 2008, etc., starting with SODA 2006; Mahoney FnTML, 2011.)
Randomly sample high-leverageconstraints
Solve the subproblem
(In many moderately large-scale
applications, one uses “`2 objectives,”
not since they are “right,” but since
other things are even more expensive.)
Mahoney (Stanford) Randomized Regression November 2013 11 / 35
Meta-algorithm for `2-norm regression (2 of 3)(Drineas, Mahoney, etc., 2006, 2008, etc., starting with SODA 2006; Mahoney FnTML, 2011.)
Randomly sample high-leverageconstraints
Solve the subproblem
(In many moderately large-scale
applications, one uses “`2 objectives,”
not since they are “right,” but since
other things are even more expensive.)
Mahoney (Stanford) Randomized Regression November 2013 12 / 35
Meta-algorithm for `2-norm regression (2 of 3)(Drineas, Mahoney, etc., 2006, 2008, etc., starting with SODA 2006; Mahoney FnTML, 2011.)
Randomly sample high-leverageconstraints
Solve the subproblem
(In many moderately large-scale
applications, one uses “`2 objectives,”
not since they are “right,” but since
other things are even more expensive.)
Mahoney (Stanford) Randomized Regression November 2013 13 / 35
Meta-algorithm for `2-norm regression (3 of 3)(Drineas, Mahoney, etc., 2006, 2008, etc., starting with SODA 2006; Mahoney FnTML, 2011.‡‡)
We can make this meta-algorithm “fast” in RAM:∗∗
This meta-algorithm runs in O(mn log n/ε) time in RAM if:I we perform a Hadamard-based random projection and sample uniformly
in a randomly rotated basis, orI we quickly computing approximations to the statistical leverage scores
and using those as an importance sampling distribution.
We can make this meta-algorithm “high precision” in RAM:††
This meta-algorithm runs in O(mn log n log(1/ε)) time in RAM if:I we use the random projection/sampling basis to construct a
preconditioner and couple with a traditional iterative method.
‡‡(Mahoney, “Randomized Algorithms for Matrices and Data,” FnTML, 2011.)
Mahoney (Stanford) Randomized Regression November 2013 14 / 35
Randomized regression in RAM: ImplementationsAvron, Maymounkov, and Toledo, SISC, 32, 1217–1236, 2010.
Conclusions:
Randomized algorithms “beats Lapack’s direct dense least-squaressolver by a large margin on essentially any dense tall matrix.”
These results “suggest that random projection algorithms should beincorporated into future versions of Lapack.”
Mahoney (Stanford) Randomized Regression November 2013 15 / 35
Randomized regression in RAM: Human GeneticsPaschou et al., PLoS Gen ’07; Paschou et al., J Med Gen ’10; Drineas et al., PLoS ONE ’10; Javed et al., Ann Hum Gen ’11.
Computing large rectangular regressions/SVDs/CUR decompositions:
In commodity hardware (e.g., a 4GB RAM, dual-core laptop), using MatLab 7.0 (R14), the computation of the SVD ofthe dense 2, 240× 447, 143 matrix A takes about 20 minutes.
Computing this SVD is not a one-liner—we can not load the whole matrix in RAM (runs out-of-memory in MatLab).
Instead, compute the SVD of AAT.
In a similar experiment, compute 1,200 SVDs on matrices of dimensions (approx.) 1, 200× 450, 000 (roughly, a fullleave-one-out cross-validation experiment).
Mahoney (Stanford) Randomized Regression November 2013 16 / 35
A retrospective
Randomized matrix algorithms:
BIG success story in high precision scientific computing applicationsand large-scale statistical data analysis!
Can they really be implemented in parallel and distributedenvironments for LARGE-scale statistical data analysis?
Mahoney (Stanford) Randomized Regression November 2013 17 / 35
1: Perform a Gaussian random projection2: Construct a preconditioner from the subsample3: Iterate with a “traditional” iterative algorithm
Why a Gaussian random projection? Since it
provides the strongest results for conditioning,
uses level 3 BLAS on dense matrices and can be generated super fast,
speeds up automatically on sparse matrices and fast operators,
still works (with an extra “allreduce” operation) when A is partitionedalong its bigger dimension.
Although it is “slow” (compared with “fast” Hadamard-based projectionsi.t.o. FLOPS), it allows for better communication properties.
Mahoney (Stanford) Randomized Regression November 2013 19 / 35
Implementation of LSRN(Meng, Saunders, and Mahoney 2011)
Shared memory (C++ with MATLAB interface)I Multi-threaded ziggurat random number generator (Marsaglia and Tsang
2000), generating 109 numbers in less than 2 seconds on 12 CPU cores.I A naıve implementation of multi-threaded dense-sparse matrix
multiplications.
Message passing (Python)I Single-threaded BLAS for matrix-matrix and matrix-vector products.I Multi-threaded BLAS/LAPACK for SVD.I Using the Chebyshev semi-iterative method (Golub and Varga 1961)
instead of LSQR.
Mahoney (Stanford) Randomized Regression November 2013 20 / 35
Table: Real-world problems and corresponding running times. DGELSD doesn’ttake advantage of sparsity. Though MATLAB’s backslash may not give themin-length solutions to rank-deficient or under-determined problems, we stillreport its running times. Blendenpik either doesn’t apply to rank-deficientproblems or runs out of memory (OOM). LSRN’s running time is mainlydetermined by the problem size and the sparsity.
Mahoney (Stanford) Randomized Regression November 2013 21 / 35
Iterating with LSQR(Paige and Saunders 1982)
Code snippet (Python):
u = A . matvec ( v ) − a l p h a ∗ube ta = s q r t (comm . a l l r e d u c e ( np . dot ( u , u ) ) ). . .v = comm . a l l r e d u c e (A . rmatvec ( u ) ) − be ta ∗v
Cost per iteration:
two matrix-vector multiplications
two cluster-wide synchronizations
Mahoney (Stanford) Randomized Regression November 2013 22 / 35
Iterating with Chebyshev semi-iterative (CS) method(Golub and Varga 1961)
The strong concentration results on σmax(AN) and σmin(AN) enable useof the CS method, which requires an accurate bound on the extremesingular values to work efficiently.
Code snippet (Python):
v = comm . a l l r e d u c e (A . rmatvec ( r ) ) − be ta ∗vx += a l p h a ∗vr −= a l p h a ∗A . matvec ( v )
Cost per iteration:
two matrix-vector multiplications
one cluster-wide synchronization
Mahoney (Stanford) Randomized Regression November 2013 23 / 35
LSQR vs. CS on an Amazon EC2 cluster(Meng, Saunders, and Mahoney 2011)
solver Nnodes Nprocesses m n nnz Niter Titer Ttotal
LSRN w/ CS2 4 1024 4e6 8.4e7
106 34.03 170.4LSRN w/ LSQR 84 41.14 178.6
LSRN w/ CS5 10 1024 1e7 2.1e8
106 50.37 193.3LSRN w/ LSQR 84 68.72 211.6
LSRN w/ CS10 20 1024 2e7 4.2e8
106 73.73 220.9LSRN w/ LSQR 84 102.3 249.0
LSRN w/ CS20 40 1024 4e7 8.4e8
106 102.5 255.6LSRN w/ LSQR 84 137.2 290.2
Table: Test problems on an Amazon EC2 cluster and corresponding running timesin seconds. Though the CS method takes more iterations, it actually runs fasterthan LSQR by making only one cluster-wide synchronization per iteration.
Mahoney (Stanford) Randomized Regression November 2013 24 / 35
Outline
1 Randomized regression in RAM
2 Solving `2 regression using MPI
3 Solving `1 and quantile regression on MapReduce
Comparison of three types of regression problems
`2 regression `1 regression quantile regression
estimation mean median quantile τ
loss function z2 |z | ρτ (x) =
{τz , z ≥ 0;
(τ − 1)z , z < 0.
formulation ‖Ax − b‖22 ‖Ax − b‖1 ρτ (Ax − b)
is a norm? yes yes no
These measure different functions of the response variable given certainvalues of the predictor variables.
−1 0 10
0.5
1
L2 regression
−1 0 10
0.5
1
L1 regression
−1 0 10
0.2
0.4
0.6
0.8Quantile regression
Mahoney (Stanford) Randomized Regression November 2013 26 / 35
“Everything generalizes” from `2 regression to `1 regression
(But “everything generalizes messily” since `1 is “worse” that `2.)
A matrix U ∈ Rm×n is (α, β, p = 1)-conditioned if |U|1 ≤ α and‖x‖∞ ≤ β‖Ux‖1, ∀x ; and `1-well-conditioned basis if α, β = poly(n).
Define the `1 leverage scores of an m × n matrix A, with m > n, asthe `1-norms-squared of the rows of an `1-well-conditioned basis of A.
Define the `1-norm condition number of A, denoted by κ1(A), as:
κ1(A) =σmax
1 (A)
σmin1 (A)
=max‖x‖2=1 ‖Ax‖1
min‖x‖2=1 ‖Ax‖1.
This implies: σmin1 (A)‖x‖2 ≤ ‖Ax‖1 ≤ σmax
1 (A)‖x‖2, ∀x ∈ Rn.
Mahoney (Stanford) Randomized Regression November 2013 27 / 35
Making `1 and quantile work to low and high precision
Finding a good basis (to get a low-precision solution):
First and the third quartiles of relative errors in 1-, 2-, and ∞-norms on a data set of
size 1010 × 15. CT (and FCT) clearly performs the best. GT is worse but follows closely.
NOCD and UNIF are much worse. (Similar results for size 1010 × 100 if SPC2 is used.)
Mahoney (Stanford) Randomized Regression November 2013 31 / 35
The Method of Inscribed Ellipsoids (MIE)
MIE works similarly to the bisection method, but in a higher dimension.
Why do we choose MIE?
Least number of iterations
Initialization using all the subsampled solutions
Multiple queries per iteration
At each iteration, we need to compute (1) a function value and (2) agradient/subgradient.
For each subsampled solution, we have a hemisphere that containsthe optimal solution.
We use all these solution hemispheres to construct initial searchregion.
Mahoney (Stanford) Randomized Regression November 2013 32 / 35
Computing multiple f and g in a single pass
On MapReduce, the IO cost may dominate the computational cost, whichrequires algorithms that could do more computation in a single pass.
Single query:
f (x) = ‖Ax‖1, g(x) = AT sign(Ax).
Multiple queries:
F (X ) = sum(|AX |, 0), G (X ) = AT sign(AX ).
An example on a 10-node Hadoop cluster:
A : 108 × 50, 118.7GB.
A single query: 282 seconds.
100 queries in a single pass: 328 seconds.
Mahoney (Stanford) Randomized Regression November 2013 33 / 35
MIE with sampling initialization and multiple queries
0 10 20 30 40 50 60 70 80 90 10010
−12
10−10
10−8
10−6
10−4
10−2
100
102
104
number of iterations
rela
tive
erro
r
mie
mie w/ multi q
mie w/ sample init
mie w/ sample init and multi q
(d) size: 106 × 20
5 10 15 20 25 3010
−6
10−5
10−4
10−3
10−2
10−1
100
101
102
103
104
number of iterations
(f−
f* )/f*
standard IPCPM
proposed IPCPM
(e) size: 5.24e9× 15
Comparing different MIE methods on large/LARGE `1 regression problem.
Mahoney (Stanford) Randomized Regression November 2013 34 / 35
Conclusion
Randomized regression in parallel & distributed environments:different design principles for high-precision versus low-precision
I Least Squares ApproximationI Least Absolute DeviationsI Quantile Regression
Algorithms require more computation than traditional matrixalgorithms, but they have better communication profiles.
I On MPI: Chebyshev semi-iterative method vs. LSQR.I On MapReduce: Method of inscribed ellipsoids with multiple queries.I Look beyond FLOPS in parallel and distributed environments.
Mahoney (Stanford) Randomized Regression November 2013 35 / 35