Dimensionality reduction techniques for large-scale optimization Coralia Cartis (University of Oxford) Joint with Jari Fowkes, Estelle Massart, Adilet Otemissov, Zhen Shao (Oxford) Lindon Roberts (ANU Canberra), Jan Fiala (NAG Ltd) Research supported by the Alan Turing Institute for Data Science, NAG Ltd and NPL Workshop on Mathematical Foundations of Optimization in Data Science November 24, 2020 (online) Cantab Capital Institute for the Mathematics of Information
26
Embed
Dimensionality reduction techniques for large-scale ...
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Dimensionality reduction techniques for large-scale optimization
Coralia Cartis (University of Oxford)Joint with
Jari Fowkes, Estelle Massart, Adilet Otemissov, Zhen Shao (Oxford) Lindon Roberts (ANU Canberra), Jan Fiala (NAG Ltd)
Research supported by the Alan Turing Institute for Data Science, NAG Ltd and NPL
Workshop on Mathematical Foundations of Optimization in Data Science November 24, 2020 (online)
Cantab Capital Institute for the Mathematics of Information
Johnson-Lindenstrauss Lemma and Random Embeddings
A ∈ ℝn×dS ∈ ℝm×n SA ∈ ℝm×d
Let a(ny) real matrix, , .
Then is an -subspace embedding for if
for all .
A n × d n ≫ d ϵs ∈ (0,1]
S m × n ϵs A
(1 − ϵs)∥Ax∥22 ≤ ∥SAx∥2
2 ≤ (1 + ϵs)∥Ax∥22
x ∈ ℝd
Johnson-Lindenstrauss Lemma: [Woodruff,’14]
If is a scaled Gaussian matrix with , then
is an (oblivious) -subspace embedding for with probability at least .
S m = 𝒪 (d | log δs |ϵ−2s )
S ϵs A 1 − δs
But note the high cost of forming SA ⟹ 𝒪(nd2)
Sparse Random EmbeddingsMoving away from Gaussian sketching: uniformly sampling rows of A fast, preserves sparsity.⟶
A ∈ ℝn×d SA ∈ ℝm×dBut it does not work sometimes….
𝒪(1)
𝒪(10−6)
Chance of missing first row: (n-m)/n
A
Sampling provides an embedding when A has low coherence
If , are similar in magnitude; intuitively, the rows of are similarly important in determining the solution.μ(A) ≪ 1 ∥Ui∥2 A
Definition [Leverage score, coherence]: Given , the leverage score of row is (row norm).The coherence of , is the maximum of the leverage scores.
A = UΣV i ∥Ui∥2A μ(A), d /n ≤ μ(A) ≤ 1
[Drineas et al’10,’11, Tropp’11]: If is a random sampling matrix with ,
then is an subspace embedding for with probability at least .
S m = 𝒪 (μ(A)2d log d | log δs |ϵ−2s )
S ϵs− A 1 − δs
Sparse Random Embeddings
Hashing: sparse sketching for dense and sparse matrices
Compared to sampling, Hashing uses every row of A.
•Expect better robustness
SA =n
∑i=1
siai
: column of S : row of A
si ith
ai ith
Can also consider nonzero per columns: s-hashing. More robustness is achieved.
s
Sparse Random Embeddings
Sketching with hashing matrices: theoretical results. [Shao, C, Fiala’20]
Sampling has better embedding properties when coherence of is low. Is this true for hashing?
A
When is sufficiently small, hashing provides an -subspace embedding with an optimal dimensionality reduction bound, , better than the bound for sampling.
μ(A) ϵs𝒪(d) 𝒪(d log d)
Result Coherence of A Size of sketching S
[Meng & Mahoney’13] —
[Bourgain et al’15]
[C, Fiala & Shao,’20]
Θ(d2 | log δs |ϵ−2s )
Throughout, can be replaced by rank of .d A
𝒪(log−3 d) 𝒪(d log2 d | log δs |ϵ−2s )
𝒪(d−1) 𝒪(d | log δs |ϵ−2s )
Using sketching for optimization?
Sketching in the observational domain (subsampling, batch) reduces number of observations/measurements/data points
• linear least squares solver (Solver: Ski-LLS [C, Fiala, Shao’20])• nonlinear least squares - derivative-based Gauss-Newton methods [C, Scheinberg’20]• nonlinear least squares - derivative-free Gauss-Newton methods [C, Ferguson,
Roberts’20]
Sketching in the variable domain (block-coordinate, subspace methods)reduces the number of parameters/variables
• Gauss-Newton variants for derivative-based and derivative-free• Functions with low effective dimensionality, global optimization
How can we use sketching for improving efficiency and scalability of optimization algorithms ?
Today
Nonlinear least squares: derivative-based methods
minx∈ℝd
f(x) =12
∥r(x)∥22 =
n
∑i=1
(ri(x))2
Gauss-Newton method for Non-linear Least Squares (NNLS)
where smooth and possibly nonconvex; Jacobian matrix of first derivatives of . r : ℝd → ℝn J n × d r
Gauss-Newton method(s): state-of-the-art for NNLS
At iterate , calculate direction by approximately minimizing a regularized/constrained/unconstrained variant of the convex quadratic local model, over ,
xk sk ∈ ℝd
s ∈ ℝd
qk(s) =12
∥J(xk)s + r(xk)∥22 = f(xk) + ⟨J(xk)Tr(xk), s⟩ +
12
⟨s, J(xk)TJ(xk)s⟩ .
Regularization, trust-region and linesearch variants have been successfully developed.
We will look at: Sketching in the variable domain (subspace methods)
R-SGN with quadratic regularisation /trust region: at iteration , [C,Fowkes,Shao’20] k
qk(s) =12 (J(xk)ST
k ) s + r(xk)2
= f(xk) + ⟨SkJ(xk)Tr(xk), s⟩ +12
⟨ s, SkJ(xk)TJ(xk)STk s⟩ .
Global rates of convergence for R-SGN methodsAssumptions:
are Lipschitz continuous; [smoothness] Let . At each iterate , with probability at least ,
, and . [sketching accuracy]Typical inexact model minimization conditions for quadratic regularisation/trust-region
r, Jϵs, δs ∈ (0,1) xk 1 − δs
∥Sk ∇f(xk)∥22 ≥ (1 − ϵs)∥∇f(xk)∥2
2 ∥Sk∥2 ≤ Smax
Theorem[R-SGN]: Let , and such that , where is a user-chosen parameter . Then the R-SGN algorithm takes at most
iterations and evaluations of the residual and sketched Jacobian such that , with probability at least .
[ The constant connects the updates of the regularisation/trust-region parameter]
ϵ > 0 δ ∈ (0,1) (1 − δs)δ > c c*
N ≤ [(1 − δs)δ1 − c]−1𝒪 (f(x0)(1 − ϵs)−1ϵ−2)
mink≤N
∥∇f(xk)∥2 ≤ ϵ 1 − e− (1 − δ)22 (1−δs)N
* c
This bound matches deterministic complexity bounds for first-order and Gauss-Newton methods despite having only partial Jacobian information available at each iteration. [C,Fowkes,Shao’20]
Global rates of convergence for R-SGN methods
Proof Idea: uses techniques from probabilistic models complexity analyses [Gratton et al’18; C, Scheinberg’18]
There can be at most true and successful iterations (from sufficient decrease condition) and .
Sketching accuracy assumption gives: for any ,where and are the total number of true and total iterations, respectively.
C[ f(x0) − f*]ϵ−2
f* = 0
ℙ(T < δN) ≤ ϵ−(1−δ)2N δ ∈ (0,1)T N
Global rates of convergence for R-SGN methods
Satisfying the sketching accuracy assumption
matrix with iid scaled Gaussian entries with -hashing matrix with
Sk p × d p = 𝒪( | log δs |ϵ−2s )
Sk p × d s p = 𝒪( | log δs |ϵ−2s )
Sufficient for each to be a (one-sided) -subspace embedding for one-dimensional vectors,so that the gradient can be embedded correctly.
Sk ϵs
sampling matrix : need non-uniformity dependent subspace embeddings for vectors with . Then . This implies that sampling embeds correctly the gradient whenever , the gradient components are similar in magnitude.
Sk p × d∥y∥∞ ⋅ ∥y∥−1
2 ≤ νs p = 𝒪(dν2s | log δs |ϵ−2
s ) Sk∥∇f(x)∥∞ ⋅ ∥∇f(x)∥−1
2 ≤ νs
Comparison with probabilistic modelsOur sketching assumption is weaker than probabilistic model conditions [Bandeira, Scheinberg, Vicente’13]: one-sided length preservation of gradient; not required to embed subspace; numerical example
[C,Fowkes,Shao’20]
Block-Coordinate Gauss-Newton (BC-GN) methodsBC-GN= R-SGN with sampling matrixSk
Theorem[R-SGN] global rate of convergence of BC-GN (with quadratic regularisation or trust region) with high probability, provided the gradient has similar components in magnitude
⟹
When is a sampling matrix, is a random subset of size p of the columns of . Sk J(xk)STk
∂r∂xi
J(xk)
Under more general assumptions, we can obtain a weaker global rate analysis for BC-GN with fixed and arbitrary block size. Assume that each coordinate block of size p is drawn with probability P_k (with replacementor from a partition). Then , where each coordinate appears R times in the
set of all possible block.
Bk𝔼Bk
∥∇Bkf(xk)∥2 ≥ PminR∥∇f(xk)∥2
[∇Bkf(xk) = JBk
(xk)Tr(xk)]Theorem[BC-GN]: Assume Lipschitz continuous. Then the number of BC-GN iterations/evaluations until
is at most . In particular, when blocks are drawn uniformlyat random, such as from a partition, then .
BC-GN with TR on logistic regression on gisette dataset
Nonlinear least squares: derivative-free methods
Subspace derivative-free Gauss-Newton methods for NNLS
Sketching DFO-GN/DFO-LS in (number of variables/size of interpolation set)d
Less evaluations and lower linear algebra cost per iteration. Global efficiency?
[Roberts, PhD Thesis’19; C, Roberts’20]
Use interpolation set , then solve
Underdetermined system take minimal norm solution. Computational Cost= factorization + solve = Evaluation cost: only need evaluations of on first iteration and a small number/multiple of subsequentlyChoose based on computational resources/evaluation cost
{xk, y1, …, yp} for p < d(y1 − xk)T
⋮(yp − xk)T
JTk =
(r(y1) − r(xk))T
⋮(r(yp) − r(xk))T
⟹
𝒪(dp2) + (np2) ≈ 𝒪(np2)p r p
p
Subspace derivative-free Gauss-Newton methods for NNLS
DFBGN (Derivative-Free Block Gauss-Newton) Algorithm Build low-dimensional model and calculate trust-region step,
Evaluate , accept/reject step, and update (usual DFO choices) (where is basis of interpolation set )
Add to interpolation set and remove points from the interpolation setAdd random orthogonal directions for until we have interpolation points
sk ∈ ℝp,
mins∈ℝp
12
∥r(xk) + Jk sk∥2 s.t. ∥ s∥ ≤ Δk
f(xk + Qk sk) ΔkQk 𝒴k = {y1 − xk, …, yp − xk}
xk + Qk sk pdrop ≥ 2
xk + Δkd d ⊥ 𝒴k p + 1
Comments: Linear algebra cost vs full space method
Choosing points to remove uses Lagrange polynomials (geometry-aware)
Choice of on successful iterations, on unsuccessful iterations
𝒪(np2 + dp2 + p3) 𝒪(nd2 + d3)
pdrop : pdrop = 2 p/10
Subspace derivative-free Gauss-Newton methods for NNLS
Numerical results for DFBGN algorithm
Choose test set CUTEst with , max 12hrs per problem
Relative accuracy=0.1 vs budget; Solver and timeoutDFBGN outperforms DFO-LS for low accuracy solutions …because it does not time out!
d ≈ 1000
1 2 4 8 16 32Budget / min budget of any solver
0.0
0.2
0.4
0.6
0.8
1.0
Proportion
problemssolved
DFO-LS
DFO-LS (init n/100)
DFBGN (p = n)
DFBGN (p = n/2)
DFBGN (p = n/10)
DFBGN (p = n/100)
DFOLS 93%
DFBGN (d/100) 35%
DFBGN (d/10) 74%
DFBGN (d/2) 82%
in figures[n → d]
Subspace derivative-free Gauss-Newton methods for NNLS
Numerical results for DFBGN algorithm
Other advantage: DFBGN make progress after evaluations (especially important when large)
ARWHDNE, d=2000 CHANDHEQ, d=2000 in figures[n → d]
Random embeddings for global optimization
Global optimization of functions with low effective dimensionality
Global optimization is generally NP-hard. Can global optimization algorithms be made efficient for `simpler' problems? What is problem/data ‘simplicity'? Can algorithms adapt to data (without knowing it a priori)?
minx
f(x) subject to x ∈ 𝒳 = [−1,1]d
Problem simplicity: Functions which do not vary along certain linear subspaces.
Alternative names: low effective dimensionality, (multi-)ridge, planar waves, active subspaces
Global optimization of functions with low effective dimensionality
Challenging set-up: The objective function is black box. The orientation of the important subspace is not known.
Solution: Random embeddings [Ziyu Wang et al. Bayesian optimization in a billion dimensions via random embeddings. \textit{J. Artif. Int. Res.}, 55(1), 2016.]
Random embeddings lower dimensional problemsReplace by , where is an Gaussian matrix, and .
⟶f(x) f(STy) S p × d p ≪ d
Functions with low effective dimensionality [Wang et al.’13]: has effective dimensionality if there exists a linear subspace of dimension such that for all vectors in and in . [ is the smallest integer satisfying these properties]. Dimensions of interest: .
f : ℝd → ℝde ≤ d 𝒯 de
f(x⊤ + x⊥) = f(x⊤) x⊤ 𝒯 x⊥ 𝒯⊥ dede ≤ p ≤ d
Global optimization of functions with low effective dimensionality