Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden
Mar 27, 2015
Sublinear-time Algorithms for Machine Learning
Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden
Linear Classification
- margin
Linear Classification
n vectors in d dimensions: A1,…,An2 Rd
Can assume norms of the Ai are bounded by 1
Labels y1,…,yn 2 {-1,1}
Find vector x such that:
8 i 2 [n], sign(Ai x) = yi
The Perceptron Algorithm
The Perceptron Algorithm
[Rosenblatt 1957, Novikoff 1962, Minsky&Papert
1969]
Iteratively:
1. Find vector Ai for which sign(Ai x) yi
2. Add Ai to x:
Note: can assume all yi = +1 by multiplying Ai by yi
Thm [Novikoff 1962]: converges in 1/2 iterations
Proof:
Let x* be the optimal hyperplane, for which Ai x* >=
For n vectors in d dimensions:
1/ 2 iterations
Each – n £ d time, total time:
New algorithm:
Sublinear time with high probability, leading order
term improvement
(in running times, poly-log factors are omitted)
Why is it surprising ?
- margin
More results
- O(n/ε2 + d/ε) time alg for minimum enclosing
ball (MEB) assuming norms of input points
known
- Sublinear time kernel versions, e.g., polynomial
kernel of deg q
- Poly-log space / low pass / sublinear time
algorithms for these problems
All running times are tight up to polylog factors
(we give information theoretic lower bounds)
Talk outline
- primal-dual optimization in learning
- l2 sampling
- MEB
- Kernels
A Primal-dual Perceptron
´ = £~(ε)
Iteratively:
1. Primal player supplies hyperplane xt
2. Dual player supplies distribution pt
3. Updates:
xt+1 = xt + ´ in pt(i) Ai
pt+1(i) = pt(i) ¢ e-´ Ai xt
The Primal-dual Perceptrondistribution over examples
Optimization via learning
game
offline optimization problem
Player 1:Low regret alg
Player 2:Low regret alg
Converges to the min-max solutionClassification problem: maxx in B mini in [n] Ai x = maxx in B minp in Δ p Ai x = minp in Δ maxx in B p Ai
x
Reduction
Low regret algorithm = after
many game iterations,average payoff -> best attainable payoff in
hindsight of a fixed strategy
Thm: iterations to converge to -approximate
solution is bounded by T for which:
Total time = # iterations £ time-per-iteration
Advantages:
- Generic optimization
- Easy to apply randomization
Player 1 regret:
Tε · maxx 2 B t pt A x · t pt A xt + Regret1
Player 2 regret:
t ptAxt · minp 2 ¢ t p A xt + Regret2
So, mini in [n] t Ai xt ¸ Tε – Regret1 – Regret2
Output t xt/T
Player 1 regret:
Tε · maxx 2 B t pt A x · t pt A xt + Regret1
Player 2 regret:
t ptAxt · minp 2 ¢ t p A xt + Regret2
So, mini in [n] t Ai xt ¸ Tε – Regret1 – Regret2
Output t xt/T
A Primal-dual Perceptron
Iteratively:
1. Primal player supplies hyperplane xt
2. Dual player supplies distribution pt
3. Updates:
# iterations via regret of OGD/MW:
xt+1 = xt + ´ in pt(i) Ai
pt+1(i) = pt(i) ¢ e-´ Ai xt
A Primal-dual Perceptron
Total time ?
Speed up via randomization:
1. Sufficient to look at one example
2. Sufficient to obtain crude estimates of inner
products.
l2 sampling
Consider two vectors from the d-dim sphere u, v
- Sample coordinate i w.p. vi2
- Return
Notice that
- Expectation is correct
- Variance at most one (magnitude can be d)
- Time: O(d)
The Primal-Dual Perceptron
Iteratively:
1. Primal player supplies hyperplane xt, l2-sample
from xt
2. Dual player supplies distribution pt, sample
from it it
3. Updates:
Important: preprocess xt only once for all estimates
Running time: O((n + d)/ε2)
pt+1(i) = pt(i) ¢ e-´ Ai xt
pt+1(i+1) Ã pt(i) ¢ (1-´ l2-sample(Ai xt) + ´2 l2-sample(Ai xt)2)
xt+1 = xt + ´ Ait
Analysis
Some difficulties:
- Non-trivial regret analysis due to sampling
- Need new multiplicative update analysis for
bounded variance
- Analysis shows good solution with constant
probability
- Need a way to verify a solution to get high
probability
Streaming Implementation
- See rows one at a time
- Can’t afford to store xt or pt
- Want few passes, poly(log n/ε) space, and sublinear time
- Want to output succinct representation of hyperplane
- list of 1/ε2 row indices
- In t-th iteration, when l2-sampling from xt, use the
same index jt for all n rows
- Store samples i1, …, iT of rows chosen by dual player,
and j1, …, jT of l2-sampling indices of primal player
- Sample in a stream using known algorithms
Lower Bound
- Consider n x d matrix
- First 1/ε2 rows contain a random position equal to ε and
all other values are 0
- Each of the remaining n-1/ε2 rows is a copy of a random
row among the first 1/ε2
- With probability ½,
- choose a random row, and replace the value ε by –ε
With probability ½,
- do nothing
- Deciding which case you’re in requires reading (n+d)/ ε2
entries
MEB (minimum enclosing ball)
A Primal-dual algorithm
Iteratively:
1. Primal player supplies point xt
2. Dual player supplies distribution pt
3. Updates:
# iterations via regret of OGD/MW:
xt+1 = xt + ´ i=1n pt(i) (Ai – xt)
pt+1(i) = pt(i) ¢ e´ ||Ai-xt||2
l2-sampling speed up
Iteratively:
1. Primal player supplies point xt
2. Dual player supplies distribution pt
3. Updates:
# iterations via regret of OGD/MW:
xt+1 = xt + ´ Ait
pt+1(i) = pt(i) (1+´ l2-sample(||Ai-xt||2) + ´2 l2-sample(||Ai-xt||2)2
Regret speed up
Updates:
# iterations remains
But only in -fraction we have to do O(d) work,
though in all iterations we do O(n) work
O(n/ε2 + d/ε) total time
with probability ε: xt+1 = xt + ´ Ait
pt+1(i) = pt(i) (1+´ l2-sample(|Ai-xt|2) + ´2 l2-sample(|Ai-xt|2)2
Kernels
Kernels
Map input to higher dimensional
space via non-linear
mapping. i.e. polynomial:
Classification via linear classifier in new space.
Efficient classification and optimization if inner
products can be computed efficiently (the “kernel
function”)
The Primal-Dual Perceptron
Iteratively:
1. Primal player supplies hyperplane xt,
l2 sample from xt
2. Dual player supplies distribution pt, sample
from it it
3. Updates:xt+1 Ã xt + ´ Ait
The Primal-Dual Kernel Perceptron
Iteratively:
1. Primal player supplies hyperplane xt,
l2-sample from xt
2. Dual player supplies distribution pt, sample
from it it
3. Updates:xt+1 Ã xt + ´ ©(Ait
)
l2 sampling for kernels
Polynomial kernel:
Kernel l2-sample = q independent l2-samples of xT y
Running time increases by q
Can also use Taylor expansion, to do, say, Gaussian
kernels