Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yang ‡ SDM 2014, Philadelphia, Pennsylvania collaborators: Rong Jin † , Shenghuo Zhu ‡ ‡ NEC Laboratories America, † Michigan State University February 9, 2014 Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 1 / 77
88
Embed
Stochastic Optimization for Big Data Analytics: Algorithms ... · STochastic OPtimization (STOP) and Machine Learning Outline 1 STochastic OPtimization (STOP) and Machine Learning
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Stochastic Optimization for Big Data Analytics:Algorithms and Libraries
Tianbao Yang‡
SDM 2014, Philadelphia, Pennsylvania
collaborators:Rong Jin†, Shenghuo Zhu‡
‡NEC Laboratories America, †Michigan State University
February 9, 2014
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 1 / 77
how variance is reduced?if xs−1 → x∗, ∇f (xt−1)→ 0, ∇f (xt−1;ωt)−∇f (x∗;ωt)→ 0
xs = xsm or xs =∑m
t=1 xst/m
Constant learning rate
Better Iteration Complexity.: O((n + 1λ) log
(1ε
)) for sm&sc, O(1/ε)
for smooth
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 59 / 77
General Strategies for Stochastic Optimization
Distributed Optimization: Why?
data distributed over multiple machines
moving to single machine suffers
low network bandwidthlimited disk or memory
benefits from parallel computation
cluster of machinesGPUs
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 60 / 77
General Strategies for Stochastic Optimization
A Naive Solution
Data
w1 w2 w3 w4 w5 w6
w =1
k
k∑i=1
wi
Issue: Not the Optimal
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 61 / 77
General Strategies for Stochastic Optimization
Distribute SGD is simple
Mini-batch
synchronization
Mini-batch SGD
Good: reduced variance, faster conv.
Bad: synchronization is expensive
Solutions:asynchronized update
convergence: difficult to prove
lesser synchronizations
guaranteed convergence
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 62 / 77
General Strategies for Stochastic Optimization
Distribute SDCA is not trivial
issue: data are correlated
∆αi = arg max−φ∗i (−αti −∆αi )−
λn
2
∥∥∥∥wt +1
λn∆αixi
∥∥∥∥2
2
Not Working
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 63 / 77
General Strategies for Stochastic Optimization
Distribute SDCA (T. Yang NIPS’13)
The Basic Variant
mR-SDCA running on machine k
Iterate: for t = 1, . . . ,T
Iterate: for j = 1, . . . ,m
Randomly pick i ∈ 1, · · · , nk and let ij = iFind ∆αk,i by IncDual(w = wt−1, scl = mK )Set αt
k,i = αt−1k,i + ∆αk,i
Reduce: wt ← wt−1 + 1λn
∑mj=1 ∆αk,ij xk,ij
∆αi = arg max−φ∗i (−αti −∆αi )−
λn
2mK
∥∥∥∥wt +mK
λn∆αixi
∥∥∥∥2
2
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 64 / 77
General Strategies for Stochastic Optimization
Distribute SDCA (T. Yang NIPS’13)
The Practical Variant
Initialize: u0t = wt−1
Iterate: for j = 1, . . . ,m
Randomly pick i ∈ 1, · · · , nk and let ij = i
Find ∆αk,ij by IncDual(w = uj−1t , scl = k)
Update αtk,ij
= αt−1k,ij
+ ∆αk,ij
Update ujt = uj−1t + 1
λnk∆αk,ijxk,ij
∆αij = arg max−φ∗ij (−αtij−∆αij )−
λn
2K
∥∥∥∥uj−1t +
K
λn∆αijxij
∥∥∥∥2
2
Using Updated Informationand Large Step Size
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 65 / 77
General Strategies for Stochastic Optimization
Distribute SDCA (T. Yang NIPS’13)
The Practical Variant is much faster than the Basic Variant
0 20 40 60 80 100−9
−8
−7
−6
−5
−4
−3
−2
−1
0
iterations
log
(P(w
t ) −
D(α
t ))
duality gap
DisDCA−pDisDCA−n
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 66 / 77
Implementations and A Library
Outline
1 STochastic OPtimization (STOP) and Machine Learning
2 STOP Algorithms for Big Data Classification and Regression
3 General Strategies for Stochastic Optimization
4 Implementations and A Library
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 67 / 77
Implementations and A Library
Implementations and A Library
In this section, we present some techniques for efficient implementationsand a practical library
Coverage:
efficient averaging
Gradient sparsification
pre-optimization: screening
pre-optimization: data preconditioning
distributed (parallel) optimization library
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 68 / 77
Implementations and A Library
Efficient Averaging
Update rule:
xt = (1− γtλ)xt−1 + γtgt
xt = (1− αt)xt−1 + αtxt
Efficient update when gt has many 0, or gt is sparse,
St =
(1− λγt 0
αt(1− λγt) 1− αt
)St−1, S1 = I
yt = yt−1 − [S−1t ]11γtgt
yt = yt−1 − ([S−1t ]21 + [S−1
t ]22αt)γtgt
xT = [ST ]11yT
xT = [ST ]21yT + [ST ]22yT
When Gradient is SparseYang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 69 / 77
Implementations and A Library
Gradient sparsification
Sparsification by importance sampling
Rti = unif(0, 1)
gti = gti [|gti | ≥ gi ] + gsign(gti )[giRti ≤ |gti | < gi ]
Unbiased sample: Egt = gt .
Tradeoff variance increase for the efficient computation.
Especially useful for Logistic Regression
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 70 / 77
Implementations and A Library
Pre-optimization: Screening
Screening for Lasso (Wang et al., 2012)
Screening for SVM (Ogawa et al., 2013)
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 71 / 77
Implementations and A Library
Distributed Optimization Library: Birds
The birds library implements distributed stochastic dual coordinateascent (DisDCA) for classification and regression with a broadsupport.
For technical details see:
”Trading Computation for Communication: Distributed StochasticDual Coordinate Ascent.” Tianbao Yang. NIPS 2013.”On the Theoretical Analysis of Distributed Stochastic DualCoordinate Ascent” Tianbao Yang, etc. Tech Report 2013.
The code is distributed under GNU General Publich License (seelicense.txt for details).
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 72 / 77
Implementations and A Library
Distributed Optimization Library: Birds
What problems does it Solve?
Classification and Regression
Loss1 Hinge loss (SVM): L-1 hinge loss, L-2 hinge loss2 Logistic loss (Logistic Regression)3 Least Square Regression (Ridge Regression)
Regularizer1 `2 norm: SVM, Logistic Regression, Ridge Regression2 `1 norm: Lasso, SVM, LR with `1 norm
Multi-class : one-vs-all
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 73 / 77
Implementations and A Library
Distributed Optimization Library: Birds
What data does it Support?
dense, sparse
txt, binary
What environment does it Support?
Prerequisites: Boost Library
Tested on A cluster of Linux Servers (up to hundreds of machines)
Tested on Cygwin in Windows with multi-core
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 74 / 77
Implementations and A Library
How about kernel methods?
Stochastic/Online Optimization Approaches: (Jin et al., 2010;Orabona et al., 2012),
Linearization + STOP for linear methods
the Nystrom method (Drineas & Mahoney, 2005)Random Fourier Features (Rahimi & Recht, 2007)Comparison of two (Yang et al., 2012)
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 75 / 77
Implementations and A Library
References I
Drineas, Petros and Mahoney, Michael W. On the nystrom method forapproximating a gram matrix for improved kernel-based learning. J.Mach. Learn. Res., 6:2153–2175, 2005.
Jin, Rong, Hoi, Steven C. H., and Yang, Tianbao. Online multiple kernellearning: Algorithms and mistake bounds. In Proceedings of the 21stInternational Conference on Algorithmic Learning Theory, pp. 390–404,2010.
Ogawa, Kohei, Suzuki, Yoshiki, and Takeuchi, Ichiro. Safe screening ofnon-support vectors in pathwise svm computation. In ICML (3), pp.1382–1390, 2013.
Orabona, Francesco, Luo, Jie, and Caputo, Barbara. Multi kernel learningwith online-batch optimization. Journal of Machine Learning Research,13:227–253, 2012.
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 76 / 77
Implementations and A Library
References II
Rahimi, Ali and Recht, Benjamin. Random features for large-scale kernelmachines. In NIPS, 2007.
Wang, Jie, Lin, Binbin, Gong, Pinghua, Wonka, Peter, and Ye, Jieping.Lasso screening rules via dual polytope projection. CoRR,abs/1211.3966, 2012.
Yang, Tianbao, Li, Yu-Feng, Mahdavi, Mehrdad, Jin, Rong, and Zhou,Zhi-Hua. ”nystrom method vs random fourier features: A theoreticaland empirical comparison”. In NIPS, pp. 485–493, 2012.
Yang et al. (NEC Labs America) Tutorial for SDM’14 February 9, 2014 77 / 77