Top Banner
Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden
31

Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

Mar 27, 2015

Download

Documents

Lillian Bennett
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

Sublinear-time Algorithms for Machine Learning

Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden

Page 2: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

Linear Classification

- margin

Page 3: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

Linear Classification

n vectors in d dimensions: A1,…,An2 Rd

Can assume norms of the Ai are bounded by 1

Labels y1,…,yn 2 {-1,1}

Find vector x such that:

8 i 2 [n], sign(Ai x) = yi

Page 4: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

The Perceptron Algorithm

Page 5: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

The Perceptron Algorithm

[Rosenblatt 1957, Novikoff 1962, Minsky&Papert

1969]

Iteratively:

1. Find vector Ai for which sign(Ai x) yi

2. Add Ai to x:

Note: can assume all yi = +1 by multiplying Ai by yi

Page 6: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

Thm [Novikoff 1962]: converges in 1/2 iterations

Proof:

Let x* be the optimal hyperplane, for which Ai x* >=

Page 7: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

For n vectors in d dimensions:

1/ 2 iterations

Each – n £ d time, total time:

New algorithm:

Sublinear time with high probability, leading order

term improvement

(in running times, poly-log factors are omitted)

Page 8: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

Why is it surprising ?

- margin

Page 9: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

More results

- O(n/ε2 + d/ε) time alg for minimum enclosing

ball (MEB) assuming norms of input points

known

- Sublinear time kernel versions, e.g., polynomial

kernel of deg q

- Poly-log space / low pass / sublinear time

algorithms for these problems

All running times are tight up to polylog factors

(we give information theoretic lower bounds)

Page 10: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

Talk outline

- primal-dual optimization in learning

- l2 sampling

- MEB

- Kernels

Page 11: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

A Primal-dual Perceptron

´ = £~(ε)

Iteratively:

1. Primal player supplies hyperplane xt

2. Dual player supplies distribution pt

3. Updates:

xt+1 = xt + ´ in pt(i) Ai

pt+1(i) = pt(i) ¢ e-´ Ai xt

Page 12: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

The Primal-dual Perceptrondistribution over examples

Page 13: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

Optimization via learning

game

offline optimization problem

Player 1:Low regret alg

Player 2:Low regret alg

Converges to the min-max solutionClassification problem: maxx in B mini in [n] Ai x = maxx in B minp in Δ p Ai x = minp in Δ maxx in B p Ai

x

Reduction

Low regret algorithm = after

many game iterations,average payoff -> best attainable payoff in

hindsight of a fixed strategy

Page 14: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

Thm: iterations to converge to -approximate

solution is bounded by T for which:

Total time = # iterations £ time-per-iteration

Advantages:

- Generic optimization

- Easy to apply randomization

Player 1 regret:

Tε · maxx 2 B t pt A x · t pt A xt + Regret1

Player 2 regret:

t ptAxt · minp 2 ¢ t p A xt + Regret2

So, mini in [n] t Ai xt ¸ Tε – Regret1 – Regret2

Output t xt/T

Player 1 regret:

Tε · maxx 2 B t pt A x · t pt A xt + Regret1

Player 2 regret:

t ptAxt · minp 2 ¢ t p A xt + Regret2

So, mini in [n] t Ai xt ¸ Tε – Regret1 – Regret2

Output t xt/T

Page 15: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

A Primal-dual Perceptron

Iteratively:

1. Primal player supplies hyperplane xt

2. Dual player supplies distribution pt

3. Updates:

# iterations via regret of OGD/MW:

xt+1 = xt + ´ in pt(i) Ai

pt+1(i) = pt(i) ¢ e-´ Ai xt

Page 16: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

A Primal-dual Perceptron

Total time ?

Speed up via randomization:

1. Sufficient to look at one example

2. Sufficient to obtain crude estimates of inner

products.

Page 17: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

l2 sampling

Consider two vectors from the d-dim sphere u, v

- Sample coordinate i w.p. vi2

- Return

Notice that

- Expectation is correct

- Variance at most one (magnitude can be d)

- Time: O(d)

Page 18: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

The Primal-Dual Perceptron

Iteratively:

1. Primal player supplies hyperplane xt, l2-sample

from xt

2. Dual player supplies distribution pt, sample

from it it

3. Updates:

Important: preprocess xt only once for all estimates

Running time: O((n + d)/ε2)

pt+1(i) = pt(i) ¢ e-´ Ai xt

pt+1(i+1) Ã pt(i) ¢ (1-´ l2-sample(Ai xt) + ´2 l2-sample(Ai xt)2)

xt+1 = xt + ´ Ait

Page 19: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

Analysis

Some difficulties:

- Non-trivial regret analysis due to sampling

- Need new multiplicative update analysis for

bounded variance

- Analysis shows good solution with constant

probability

- Need a way to verify a solution to get high

probability

Page 20: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

Streaming Implementation

- See rows one at a time

- Can’t afford to store xt or pt

- Want few passes, poly(log n/ε) space, and sublinear time

- Want to output succinct representation of hyperplane

- list of 1/ε2 row indices

- In t-th iteration, when l2-sampling from xt, use the

same index jt for all n rows

- Store samples i1, …, iT of rows chosen by dual player,

and j1, …, jT of l2-sampling indices of primal player

- Sample in a stream using known algorithms

Page 21: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

Lower Bound

- Consider n x d matrix

- First 1/ε2 rows contain a random position equal to ε and

all other values are 0

- Each of the remaining n-1/ε2 rows is a copy of a random

row among the first 1/ε2

- With probability ½,

- choose a random row, and replace the value ε by –ε

With probability ½,

- do nothing

- Deciding which case you’re in requires reading (n+d)/ ε2

entries

Page 22: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

MEB (minimum enclosing ball)

Page 23: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

A Primal-dual algorithm

Iteratively:

1. Primal player supplies point xt

2. Dual player supplies distribution pt

3. Updates:

# iterations via regret of OGD/MW:

xt+1 = xt + ´ i=1n pt(i) (Ai – xt)

pt+1(i) = pt(i) ¢ e´ ||Ai-xt||2

Page 24: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

l2-sampling speed up

Iteratively:

1. Primal player supplies point xt

2. Dual player supplies distribution pt

3. Updates:

# iterations via regret of OGD/MW:

xt+1 = xt + ´ Ait

pt+1(i) = pt(i) (1+´ l2-sample(||Ai-xt||2) + ´2 l2-sample(||Ai-xt||2)2

Page 25: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

Regret speed up

Updates:

# iterations remains

But only in -fraction we have to do O(d) work,

though in all iterations we do O(n) work

O(n/ε2 + d/ε) total time

with probability ε: xt+1 = xt + ´ Ait

pt+1(i) = pt(i) (1+´ l2-sample(|Ai-xt|2) + ´2 l2-sample(|Ai-xt|2)2

Page 26: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

Kernels

Page 27: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

Kernels

Map input to higher dimensional

space via non-linear

mapping. i.e. polynomial:

Classification via linear classifier in new space.

Efficient classification and optimization if inner

products can be computed efficiently (the “kernel

function”)

Page 28: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

The Primal-Dual Perceptron

Iteratively:

1. Primal player supplies hyperplane xt,

l2 sample from xt

2. Dual player supplies distribution pt, sample

from it it

3. Updates:xt+1 Ã xt + ´ Ait

Page 29: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

The Primal-Dual Kernel Perceptron

Iteratively:

1. Primal player supplies hyperplane xt,

l2-sample from xt

2. Dual player supplies distribution pt, sample

from it it

3. Updates:xt+1 Ã xt + ´ ©(Ait

)

Page 30: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

l2 sampling for kernels

Polynomial kernel:

Kernel l2-sample = q independent l2-samples of xT y

Running time increases by q

Can also use Taylor expansion, to do, say, Gaussian

kernels

Page 31: Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.