Dimensionality reduction techniques for large-scale ...

Dimensionality reduction techniques for large-scale optimization

Coralia Cartis (University of Oxford)Joint with

Jari Fowkes, Estelle Massart, Adilet Otemissov, Zhen Shao (Oxford) Lindon Roberts (ANU Canberra), Jan Fiala (NAG Ltd)

Research supported by the Alan Turing Institute for Data Science, NAG Ltd and NPL

Workshop on Mathematical Foundations of Optimization in Data Science November 24, 2020 (online)

Cantab Capital Institute for the Mathematics of Information

Johnson-Lindenstrauss Lemma and Random Embeddings

A ∈ ℝn×dS ∈ ℝm×n SA ∈ ℝm×d

Let a(ny) real matrix, , .

Then is an -subspace embedding for if

for all .

A n × d n ≫ d ϵs ∈ (0,1]

S m × n ϵs A

(1 − ϵs)∥Ax∥22 ≤ ∥SAx∥2

2 ≤ (1 + ϵs)∥Ax∥22

x ∈ ℝd

Johnson-Lindenstrauss Lemma: [Woodruff,’14]

If is a scaled Gaussian matrix with , then

is an (oblivious) -subspace embedding for with probability at least .

S m = 𝒪 (d | log δs |ϵ−2s )

S ϵs A 1 − δs

But note the high cost of forming SA ⟹ 𝒪(nd2)

Sparse Random EmbeddingsMoving away from Gaussian sketching: uniformly sampling rows of A fast, preserves sparsity.⟶

A ∈ ℝn×d SA ∈ ℝm×dBut it does not work sometimes….

𝒪(1)

𝒪(10−6)

Chance of missing first row: (n-m)/n

A

Sampling provides an embedding when A has low coherence

If , are similar in magnitude; intuitively, the rows of are similarly important in determining the solution.μ(A) ≪ 1 ∥Ui∥2 A

Definition [Leverage score, coherence]: Given , the leverage score of row is (row norm).The coherence of , is the maximum of the leverage scores.

A = UΣV i ∥Ui∥2A μ(A), d /n ≤ μ(A) ≤ 1

[Drineas et al’10,’11, Tropp’11]: If is a random sampling matrix with ,

then is an subspace embedding for with probability at least .

S m = 𝒪 (μ(A)2d log d | log δs |ϵ−2s )

S ϵs− A 1 − δs

Sparse Random Embeddings

Hashing: sparse sketching for dense and sparse matrices

Sampling: one non-zero per row

⎛

⎜⎜⎝

0 · · · 0 1 0 00 0 · · · · · · · · · 10 0 1 0 · · · 00 · · · · · · 1 0 0

⎞

⎟⎟⎠

Hashing: one non-zero per column

0

BB@

0 1 0 1 0 · · · 01 0 0 0 0 · · · 10 0 1 0 0 · · · 00 0 0 0 1 · · · 0

1

CCA

<latexit sha1_base64="WPyQptR0n9joYXrp3aceajFQpSg=">AAACrnicbVHBThsxEPUutAXT0gDHXiwCVU/RblqpvRQhceGYSiQgZVeR1ztJLLz2yp5tG63yP/wSX9FfqBMWtARGGvn5zRt79CYrlXQYRfdBuLX95u27nV269/7D/sfOweHImcoKGAqjjL3JuAMlNQxRooKb0gIvMgXX2e3Fqn79G6yTRl/hooS04DMtp1Jw9NSkc3dCkwxmUtdlwdHKv0tGacQ+s9hn+0xEbtCtLklCH8l2Pgli5hXRRvvzJ9qKtrKloAno/GkoejLpdKNetA72EsQN6JImBpPOvyQ3oipAo1DcuXEclZjW3KIUCpY0qRyUXNzyGYw91LwAl9ZrR5fs1DM5mxrrUyNbs+2OmhfOLYrMK/2Ec7dZW5Gv1cYVTn+ktdRlhaDFw0fTSjE0bLUelksLAtXCAy6s9LMyMeeWC/RLpImGPzgHY6Gom3NZXzXAOxRv+vESjPq9+Guv/6vfPR80Xu2QT+SYfCEx+U7OySUZkCERwX7wLfgZnIVROArTcPIgDYOm54g8i3D+H4gVvnw=</latexit>

A ∈ ℝn×d SA ∈ ℝm×d

Action of Hashing S:

Compared to sampling, Hashing uses every row of A.

•Expect better robustness

SA =n

∑i=1

siai

: column of S : row of A

si ith

ai ith

Can also consider nonzero per columns: s-hashing. More robustness is achieved.

s

Sparse Random Embeddings

Sketching with hashing matrices: theoretical results. [Shao, C, Fiala’20]

Sampling has better embedding properties when coherence of is low. Is this true for hashing?

A

When is sufficiently small, hashing provides an -subspace embedding with an optimal dimensionality reduction bound, , better than the bound for sampling.

μ(A) ϵs𝒪(d) 𝒪(d log d)

Result Coherence of A Size of sketching S

[Meng & Mahoney’13] —

[Bourgain et al’15]

[C, Fiala & Shao,’20]

Θ(d2 | log δs |ϵ−2s )

Throughout, can be replaced by rank of .d A

𝒪(log−3 d) 𝒪(d log2 d | log δs |ϵ−2s )

𝒪(d−1) 𝒪(d | log δs |ϵ−2s )

Using sketching for optimization?

Sketching in the observational domain (subsampling, batch) reduces number of observations/measurements/data points

• linear least squares solver (Solver: Ski-LLS [C, Fiala, Shao’20])• nonlinear least squares - derivative-based Gauss-Newton methods [C, Scheinberg’20]• nonlinear least squares - derivative-free Gauss-Newton methods [C, Ferguson,

Roberts’20]

Sketching in the variable domain (block-coordinate, subspace methods)reduces the number of parameters/variables

• Gauss-Newton variants for derivative-based and derivative-free• Functions with low effective dimensionality, global optimization

How can we use sketching for improving efficiency and scalability of optimization algorithms ?

Today

Nonlinear least squares: derivative-based methods

minx∈ℝd

f(x) =12

∥r(x)∥22 =

n

∑i=1

(ri(x))2

Gauss-Newton method for Non-linear Least Squares (NNLS)

where smooth and possibly nonconvex; Jacobian matrix of first derivatives of . r : ℝd → ℝn J n × d r

Gauss-Newton method(s): state-of-the-art for NNLS

At iterate , calculate direction by approximately minimizing a regularized/constrained/unconstrained variant of the convex quadratic local model, over ,

xk sk ∈ ℝd

s ∈ ℝd

qk(s) =12

∥J(xk)s + r(xk)∥22 = f(xk) + ⟨J(xk)Tr(xk), s⟩ +

12

⟨s, J(xk)TJ(xk)s⟩ .

Regularization, trust-region and linesearch variants have been successfully developed.

We will look at: Sketching in the variable domain (subspace methods)

Randomised Subspace Gauss-Newton (R-SGN) methods: variable sketching

randomly draw sketching matrix ; calculate the subspace-Jacobian and the reduced local quadratic model, , ,

solve the reduced subproblem (inexactly) to find

compute the ratio .

set and if ; else, and

p × d Sk J(xk)STk

s ∈ ℝp p ≪ d

sk ∈ ℝp,

ρk =f(xk) − f(xk + ST

k sk)f(xk) − qk( sk)

xk+1 = xk + STk sk σk+1 < σk(Δk+1 > Δk) ρk ≥ η1 xk+1 = xk

σk+1 > σk(Δk+1 < Δk) .

mins∈ℝp

qk( s)(+σk

2∥ST

k s∥2) or (such that ∥STk s∥ ≤ Δk) .

R-SGN with quadratic regularisation /trust region: at iteration , [C,Fowkes,Shao’20] k

qk(s) =12 (J(xk)ST

k ) s + r(xk)2

= f(xk) + ⟨SkJ(xk)Tr(xk), s⟩ +12

⟨ s, SkJ(xk)TJ(xk)STk s⟩ .

Global rates of convergence for R-SGN methodsAssumptions:

are Lipschitz continuous; [smoothness] Let . At each iterate , with probability at least ,

, and . [sketching accuracy]Typical inexact model minimization conditions for quadratic regularisation/trust-region

r, Jϵs, δs ∈ (0,1) xk 1 − δs

∥Sk ∇f(xk)∥22 ≥ (1 − ϵs)∥∇f(xk)∥2

2 ∥Sk∥2 ≤ Smax

Theorem[R-SGN]: Let , and such that , where is a user-chosen parameter . Then the R-SGN algorithm takes at most

iterations and evaluations of the residual and sketched Jacobian such that , with probability at least .

[ The constant connects the updates of the regularisation/trust-region parameter]

ϵ > 0 δ ∈ (0,1) (1 − δs)δ > c c*

N ≤ [(1 − δs)δ1 − c]−1𝒪 (f(x0)(1 − ϵs)−1ϵ−2)

mink≤N

∥∇f(xk)∥2 ≤ ϵ 1 − e− (1 − δ)22 (1−δs)N

* c

This bound matches deterministic complexity bounds for first-order and Gauss-Newton methods despite having only partial Jacobian information available at each iteration. [C,Fowkes,Shao’20]

Global rates of convergence for R-SGN methods

Proof Idea: uses techniques from probabilistic models complexity analyses [Gratton et al’18; C, Scheinberg’18]

True/false iterations, successful/unsuccessful iterations

There can be at most true and successful iterations (from sufficient decrease condition) and .

Sketching accuracy assumption gives: for any ,where and are the total number of true and total iterations, respectively.

C[ f(x0) − f*]ϵ−2

f* = 0

ℙ(T < δN) ≤ ϵ−(1−δ)2N δ ∈ (0,1)T N

Global rates of convergence for R-SGN methods

Satisfying the sketching accuracy assumption

matrix with iid scaled Gaussian entries with -hashing matrix with

Sk p × d p = 𝒪( | log δs |ϵ−2s )

Sk p × d s p = 𝒪( | log δs |ϵ−2s )

Sufficient for each to be a (one-sided) -subspace embedding for one-dimensional vectors,so that the gradient can be embedded correctly.

Sk ϵs

sampling matrix : need non-uniformity dependent subspace embeddings for vectors with . Then . This implies that sampling embeds correctly the gradient whenever , the gradient components are similar in magnitude.

Sk p × d∥y∥∞ ⋅ ∥y∥−1

2 ≤ νs p = 𝒪(dν2s | log δs |ϵ−2

s ) Sk∥∇f(x)∥∞ ⋅ ∥∇f(x)∥−1

2 ≤ νs

Comparison with probabilistic modelsOur sketching assumption is weaker than probabilistic model conditions [Bandeira, Scheinberg, Vicente’13]: one-sided length preservation of gradient; not required to embed subspace; numerical example

[C,Fowkes,Shao’20]

Block-Coordinate Gauss-Newton (BC-GN) methodsBC-GN= R-SGN with sampling matrixSk

Theorem[R-SGN] global rate of convergence of BC-GN (with quadratic regularisation or trust region) with high probability, provided the gradient has similar components in magnitude

⟹

When is a sampling matrix, is a random subset of size p of the columns of . Sk J(xk)STk

∂r∂xi

J(xk)

Under more general assumptions, we can obtain a weaker global rate analysis for BC-GN with fixed and arbitrary block size. Assume that each coordinate block of size p is drawn with probability P_k (with replacementor from a partition). Then , where each coordinate appears R times in the

set of all possible block.

Bk𝔼Bk

∥∇Bkf(xk)∥2 ≥ PminR∥∇f(xk)∥2

[∇Bkf(xk) = JBk

(xk)Tr(xk)]Theorem[BC-GN]: Assume Lipschitz continuous. Then the number of BC-GN iterations/evaluations until

is at most . In particular, when blocks are drawn uniformlyat random, such as from a partition, then .

r, J𝔼(∥∇f(xk)∥2) ≤ ϵ2 𝒪 (f(x0)(PminR)−1ϵ−2)

PminR = pd−1

R-SGN/BC-GN methods: numerical experiments[C,Fowkes,Shao’20; WPaper 20]

BC-GN with TR on logistic regression for chemotherapy

dataset (Python code)

R-SGN/BC-GN methods: numerical experiments[C,Fowkes,Shao’20]

BC-GN with TR on logistic regression on gisette dataset

Nonlinear least squares: derivative-free methods

Subspace derivative-free Gauss-Newton methods for NNLS

Sketching DFO-GN/DFO-LS in (number of variables/size of interpolation set)d

Less evaluations and lower linear algebra cost per iteration. Global efficiency?

[Roberts, PhD Thesis’19; C, Roberts’20]

Use interpolation set , then solve

Underdetermined system take minimal norm solution. Computational Cost= factorization + solve = Evaluation cost: only need evaluations of on first iteration and a small number/multiple of subsequentlyChoose based on computational resources/evaluation cost

{xk, y1, …, yp} for p < d(y1 − xk)T

⋮(yp − xk)T

JTk =

(r(y1) − r(xk))T

⋮(r(yp) − r(xk))T

⟹

𝒪(dp2) + (np2) ≈ 𝒪(np2)p r p

p


DFBGN (Derivative-Free Block Gauss-Newton) Algorithm Build low-dimensional model and calculate trust-region step,

Evaluate , accept/reject step, and update (usual DFO choices) (where is basis of interpolation set )

Add to interpolation set and remove points from the interpolation setAdd random orthogonal directions for until we have interpolation points

sk ∈ ℝp,

mins∈ℝp

12

∥r(xk) + Jk sk∥2 s.t. ∥ s∥ ≤ Δk

f(xk + Qk sk) ΔkQk 𝒴k = {y1 − xk, …, yp − xk}

xk + Qk sk pdrop ≥ 2

xk + Δkd d ⊥ 𝒴k p + 1

Comments: Linear algebra cost vs full space method

Choosing points to remove uses Lagrange polynomials (geometry-aware)

Choice of on successful iterations, on unsuccessful iterations

𝒪(np2 + dp2 + p3) 𝒪(nd2 + d3)

pdrop : pdrop = 2 p/10


Numerical results for DFBGN algorithm

Choose test set CUTEst with , max 12hrs per problem

Relative accuracy=0.1 vs budget; Solver and timeoutDFBGN outperforms DFO-LS for low accuracy solutions …because it does not time out!

d ≈ 1000

1 2 4 8 16 32Budget / min budget of any solver

0.0

0.2

0.4

0.6

0.8

1.0

Proportion

problemssolved

DFO-LS

DFO-LS (init n/100)

DFBGN (p = n)

DFBGN (p = n/2)

DFBGN (p = n/10)

DFBGN (p = n/100)

DFOLS 93%

DFBGN (d/100) 35%

DFBGN (d/10) 74%

DFBGN (d/2) 82%

in figures[n → d]


Numerical results for DFBGN algorithm

Other advantage: DFBGN make progress after evaluations (especially important when large)

p ≪ d d

normalized objective reduction vs.~\# evaluations, 12hr timeout);

0.0 0.2 0.4 0.6 0.8 1.0Budget (in gradients)

100

2× 10−1

3× 10−1

4× 10−1

6× 10−1

Normalized

ObjectiveValue

DFO-LS

DFO-LS (init n/100)

p = n

p = n/2

p = n/10

p = n/100

0.0 0.2 0.4 0.6 0.8 1.0Budget (in gradients)

100

5× 10−1

6× 10−1

7× 10−1

8× 10−1

9× 10−1

Normalized

ObjectiveValue

DFO-LS

DFO-LS (init n/100)

p = n

p = n/2

p = n/10

p = n/100

ARWHDNE, d=2000 CHANDHEQ, d=2000 in figures[n → d]

Random embeddings for global optimization

Global optimization of functions with low effective dimensionality

Global optimization is generally NP-hard. Can global optimization algorithms be made efficient for `simpler' problems? What is problem/data ‘simplicity'? Can algorithms adapt to data (without knowing it a priori)?

minx

f(x) subject to x ∈ 𝒳 = [−1,1]d

Problem simplicity: Functions which do not vary along certain linear subspaces.

Alternative names: low effective dimensionality, (multi-)ridge, planar waves, active subspaces

Applications: hyper-parameter optimization; complex engineering simulations; parametric, stochastic PDEs; over-parametrized DNNs?


Challenging set-up: The objective function is black box. The orientation of the important subspace is not known.

Solution: Random embeddings [Ziyu Wang et al. Bayesian optimization in a billion dimensions via random embeddings. \textit{J. Artif. Int. Res.}, 55(1), 2016.]

Random embeddings lower dimensional problemsReplace by , where is an Gaussian matrix, and .

⟶f(x) f(STy) S p × d p ≪ d

Functions with low effective dimensionality [Wang et al.’13]: has effective dimensionality if there exists a linear subspace of dimension such that for all vectors in and in . [ is the smallest integer satisfying these properties]. Dimensions of interest: .

f : ℝd → ℝde ≤ d 𝒯 de

f(x⊤ + x⊥) = f(x⊤) x⊤ 𝒯 x⊥ 𝒯⊥ dede ≤ p ≤ d


(R) (AR)miny∈ℝp

f(STy + u)

s.t. y ∈ Y = [−a, a]p

miny∈ℝp

f(STy + u)

s.t. STy + u ∈ 𝒳

Reduced optimization problems [C, Otemissov’20; C, Massart, Otemissov’20]

REGO algorithm: (single random embedding)u=0; solves (R) once (using any global solver) , unconstrained solution f(STy*) ≈ f* STy*

AREGO algorithm: (multiple random embedding)solves (AR) multiple times (with any global solver) updates to best point found so far

, u

f(STy*) ≈ f* STy* ∈ 𝒳Assumption: .p ≥ deTheoretical analysis

REGO: best-known probability of success of (R) and suitable choices of , depends only on

, not on ambient dimension a

p, de d

Numerical experiments: confirm the theoretical findings; include replacing global solvers with local ones

AREGO: probability of success of (AR) and convergence of AREGO, depends on

algebraically, not exponentiallyd

Global optimization of functions with low effective dimensionality(R) reduced subproblem and REGO algorithm min

y∈ℝpf(STy)

s.t. y ∈ Y = [−a, a]py*2 = arg miny: f(STy)=f*

∥y∥2.

We can show that ∥x*T ∥2

2

∥y*2 ∥22

∼ χ2p−de+1⟹ ℙ((R) is successful) ≥ 1 − C(q)(1 +

q2

e−c2/2) ( c2

2 )q2

where , .q = p − de + 1 c = ∥x*T ∥/a

BARON on GO problems with low effective dimensionality

in figures[D → d and d → p]

(R)

Global optimization of functions with low effective dimensionality(AR) reduced subproblem and AREGO algorithm min

y∈ℝpf(STy + u)

s.t. STy + u ∈ 𝒳 ℙ((AR) is successful) ≥ ℙ(−1 ≤ STy*2 + u ≤ 1) > τ(d) > 0

⟶ t-distributionConvergence of AREGO, with prob one, proved to a neighbourhood of global minimum of original problem; multiple embeddings used.Sk

BARON Local KNITRO

Same tests as for REGO, functions with low effective dimensionality

(AR)

in figures[D → d and d → p]

Dimensionality reduction techniques for large-scale ...

Documents