Top Banner
Fast Streaming Regression HARSH GUPTA R SRIKANT UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN SAI KIRAN BURLE
20

Fast Streaming Regression - University Of Illinoisniaohe.ise.illinois.edu/IE598_2016/pdf/IE598-presentation11-Gupta H… · Fast Streaming Regression Hence, we get the following update

Jun 11, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • Fast Streaming Regression

    HARSH GUPTA R SRIKANT

    UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

    SAI KIRAN BURLE

  • OutlineMotivation

    Problem formulation

    Existing techniques/algorithms

    Fast Streaming Regression (FSR) algorithm

    Performance comparison

  • Motivation

    Regression is widely used in practical machine learning applications as well as statistics and hence has been a subject of detailed study.

    Stochastic optimization techniques and algorithms based on stochastic gradient descent (SGD) are used widely in regression problems, primarily due to their desirable computational complexity and storage overhead.

    The disadvantage with such methods is that they do not converge as fast as asymptotically optimal methods like empirical risk minimizer (ERM) as the noisy gradient estimate used introduces high variance.

  • Motivation Therefore, lately, there has been a growing interest in trying to reduce the variance

    introduced by noisy gradient estimates in SGD by using explicit and implicit variance reduction techniques.

    The aim is to devise an algorithm which is computationally and storage-wise as efficient as SGD while having an asymptotically optimal convergence rate like ERM.

    Streaming SVRG (Frostig et al., 2015) and Projected SGD with Weighted Averaging (Srikant et al., 2016) are some recent papers addressing similar problems.

  • Problem formulationThe general stochastic optimization problem is formulated as follows:

    min𝑤𝑤 ∈ Ω

    𝑓𝑓(𝑤𝑤)

    where 𝑓𝑓 𝑤𝑤 = Ε𝜓𝜓~𝐷𝐷{𝜓𝜓 𝑤𝑤 }, and 𝜓𝜓 𝑤𝑤 is a random function of 𝑤𝑤 drawn from the distribution 𝐷𝐷 and Ωis the constraint set. Let 𝑤𝑤∗ be the minimizer of 𝑓𝑓.

    We consider the case where 𝑓𝑓 𝑤𝑤 = 12𝑤𝑤𝑇𝑇𝑃𝑃𝑤𝑤 − 𝑞𝑞𝑇𝑇𝑤𝑤 + 𝑟𝑟, i.e., we consider the case of quadratic risk

    regression. We assume 𝑓𝑓 is strongly convex and has Lipschitz continuous gradients.

    What is 𝑤𝑤? In most of the practical machine learning applications, 𝑤𝑤 is a contextually defined vector representing weights or parameters assigned to various features. Thus, Ω ⊆ ℝ𝑑𝑑, where 𝑑𝑑 ≥ 1.

  • Practical issuesFor the general stochastic optimization problem, the distribution 𝐷𝐷 from which the random function 𝜓𝜓 𝑤𝑤 is drawn is often unknown.

    Even if we know the distribution, it might not be possible to use standard optimization techniques to solve the problem as gradient computations may turn out to be cumbersome and often impossible.

    We generally have access to 𝑛𝑛 samples of the random function 𝜓𝜓 𝑤𝑤 , i.e. 𝜓𝜓1 𝑤𝑤 ,𝜓𝜓2 𝑤𝑤 , … ,𝜓𝜓𝑛𝑛(𝑤𝑤). We need to solve the problem as fast as possible with as little data as we can use.

    The 𝑛𝑛 data points may also be available one at a time (streaming data) and not all at once.

    For the quadratic risk regression problem, this means that we do not have exact values of 𝑃𝑃 and 𝑞𝑞 available to us. Instead, for every time step 𝑖𝑖, we have random samples 𝑃𝑃𝑖𝑖 , 𝑞𝑞𝑖𝑖 such that 𝐸𝐸 𝑃𝑃𝑖𝑖 =𝑃𝑃 and 𝐸𝐸 𝑞𝑞𝑖𝑖 = 𝑞𝑞.

  • Example- Least squares regression

    GIF Credits – Tom O’Haver, UMD

    For least squares regression, we have 𝑓𝑓 𝑤𝑤 = Ε𝑋𝑋, 𝑌𝑌~𝐷𝐷{||𝑤𝑤𝑇𝑇𝑋𝑋 − 𝑌𝑌 ||2}(special case of quadratic risk regression)

    The aim is minimize 𝑓𝑓, i.e. to fit the best possible line through the data as shown in the animation.

    We make a prediction 𝑌𝑌′ = 𝑤𝑤𝑇𝑇𝑋𝑋 for 𝑌𝑌and try to minimize the mean squared error 𝑌𝑌′ − 𝑌𝑌

    2for obtaining the most

    accurate predictor.

  • Existing techniques/algorithmsWhat technique sets the benchmark for performance?

    ⇒ Empirical Risk Minimization (ERM):ERM solves for 𝑤𝑤 which minimizes the empirical risk, i.e. 𝒘𝒘 𝑬𝑬𝑬𝑬𝑬𝑬 =𝐦𝐦𝐦𝐦𝐦𝐦𝒘𝒘 ∈𝛀𝛀

    ∑𝒊𝒊=𝟏𝟏𝒏𝒏 𝝍𝝍𝒊𝒊(𝒘𝒘) .

    ERM is the Maximum Likelihood Estimator (MLE) in the context of statistical modelling and is known to converge to the true solution with a convergence rate of 𝜪𝜪 𝟏𝟏

    𝒏𝒏for strongly convex and Lipschitz continuous functions.

    Under certain regularity conditions and correct model specification, MLE is asymptotically efficient (since it is the ML estimator, this is intuitive).

  • Existing techniques/algorithmsWhy not use ERM then?:

    All 𝑛𝑛 data points may not be available to us immediately as data points might arrive in a streaming fashion.

    ERM is computationally cumbersome as it involves matrix inversions even for the simple case of least-squares regression.

    There may not be a closed form ERM solution for certain objective functions like the logit function in logistic regression.

  • Existing techniques/algorithmsWhat techniques are used instead of ERM?

    ⇒ Stochastic Gradient Descent (SGD) based algorithms.

    SGD uses a noisy estimate of the gradient at each iteration such that the expected value of the noisy estimate is the same as the true gradient.

    Although SGD also achieves a convergence rate of 𝜪𝜪 𝟏𝟏𝒏𝒏

    for strongly convex and Lipschitz continuous functions, it has poorer constants as compared to ERM due to variance introduced by the gradient estimate.

    SGD is often combined with weighted iterate averaging to fasten the convergence rate. Srikant et al., (2016) propose a projected SGD method with weighted averaging for linear least-squares regression and obtain a competitive ratio of 4 with respect to ERM.

  • Goal:ERM is asymptotically optimal but is computationally infeasible.

    On the other hand, SGD based techniques achieve the same order of convergence as ERM with much lesser computational complexity, but they have very poor constants as compared to ERM.

    Our aim is to come up with an algorithm that achieves exactly the same rate of convergence as ERM asymptotically.

    Important: The computational and storage overhead of the new algorithm should be comparable to that of SGD based algorithms.

  • Fast Streaming RegressionHow to reduce the variance in the gradient estimate of SGD?

    ⇒ Use as many points as possible to compute the gradient instead of using just one!

    For quadratic risk regression, we observe that 𝛻𝛻𝑓𝑓 𝑤𝑤 = 𝑃𝑃𝑤𝑤 − 𝑞𝑞. Since the gradient is linear, we can maintain a running average of 𝑃𝑃𝑖𝑖 , 𝑞𝑞𝑖𝑖’s and use them to compute the gradient at each iteration.

    Thus, at iteration 𝑖𝑖, SGD gradient estimate would be 𝛻𝛻𝑓𝑓𝑖𝑖 𝑤𝑤 = 𝑃𝑃𝑖𝑖𝑤𝑤 − 𝑞𝑞𝑖𝑖. But for FSR, we use 𝛻𝛻𝑓𝑓𝑖𝑖 𝑤𝑤 = �𝑃𝑃𝑖𝑖𝑤𝑤 − �𝑞𝑞𝑖𝑖, where �𝑃𝑃𝑖𝑖 =

    1𝑖𝑖∑𝑡𝑡=1𝑖𝑖 𝑃𝑃𝑡𝑡 and �𝑞𝑞𝑖𝑖 =

    1𝑖𝑖∑𝑡𝑡=1𝑖𝑖 𝑞𝑞𝑡𝑡

  • Fast Streaming RegressionHence, we get the following update rule for FSR:

    Update Rule: 𝒘𝒘𝒊𝒊= 𝜫𝜫𝜴𝜴(𝒘𝒘𝒊𝒊−𝟏𝟏 − 𝜸𝜸𝒊𝒊{�𝑷𝑷𝒊𝒊𝒘𝒘𝒊𝒊−𝟏𝟏 − �𝒒𝒒𝒊𝒊})

    where,

    ◦ 𝜸𝜸𝒊𝒊 is the step size at the 𝑖𝑖𝑡𝑡𝑡 iteration

    ◦ �𝑷𝑷𝒊𝒊 =𝒊𝒊−𝟏𝟏 �𝑷𝑷𝒊𝒊−𝟏𝟏+𝑷𝑷𝒊𝒊

    𝒊𝒊, �𝒒𝒒𝐦𝐦 =

    𝒊𝒊−𝟏𝟏 �𝒒𝒒𝒊𝒊−𝟏𝟏+𝒒𝒒𝒊𝒊𝒊𝒊

    ◦ 𝜫𝜫𝜴𝜴 is the projection operator for the constraint set Ω.

    Note that each iteration of FSR has 𝑂𝑂(1) computational complexity, same as that of SGD. The storage overhead is also the same as that of SGD.

  • Fast Streaming RegressionWhy does using more data points help?

    ⇒ Greater the number of samples used, lesser the noise variance in the gradient estimate.

    Why would FSR compete with ERM?

    ⇒ We give an intuitive explanation here. The detailed proof will be included in the report.

    ⇒ Let Ω = 𝑅𝑅𝑑𝑑 . Suppose the FSR algorithm converges. The update rule implies:

    lim𝑖𝑖→∞

    𝑤𝑤𝑖𝑖 = lim𝑖𝑖→∞(𝑤𝑤𝑖𝑖−1 − 𝛾𝛾𝑖𝑖{�𝑃𝑃𝑖𝑖𝑤𝑤𝑖𝑖−1 − �𝑞𝑞𝑖𝑖})

    Therefore, we have lim𝑖𝑖→∞

    �𝑃𝑃𝑖𝑖𝑤𝑤𝑖𝑖−1 − �𝑞𝑞𝑖𝑖 = 0. This is the same asymptotic expression that the ERM solution needs to satisfy!

  • Results:For the Fast Streaming Regression (FSR) algorithm, we obtain the following result:Theorem: In FSR Algorithm, we have:

    lim𝑛𝑛→∞

    𝐸𝐸 𝑤𝑤𝑛𝑛𝐹𝐹𝐹𝐹𝐹𝐹 − 𝑤𝑤∗

    2

    2

    𝐸𝐸[ 𝑤𝑤𝑛𝑛𝐸𝐸𝐹𝐹𝐸𝐸 − 𝑤𝑤∗

    2

    2]

    = 1.

    where, 𝑤𝑤𝑛𝑛𝐸𝐸𝐹𝐹𝐸𝐸 is the estimate of the ERM algorithm using 𝑛𝑛 data samples and 𝑤𝑤𝑛𝑛

    𝐹𝐹𝐹𝐹𝐹𝐹

    is the output of the FSR algorithm after 𝑛𝑛 iterations. Thus, FSR algorithm achieves an asymptotic competitive ratio of 1 with respect to ERM. Also, FSR has the same order of rate of convergence as ERM, i.e., 𝑂𝑂(1

    𝑛𝑛).

  • Performance comparisonWe compare our algorithm with ERM

    which achieves a convergence rate of Ο(1

    𝑛𝑛) (same as that of SGD).

    We generate synthetic 25 dimensional data for the random variable 𝑋𝑋 and put 𝑌𝑌 = 𝑤𝑤∗𝑇𝑇𝑋𝑋 + 𝜂𝜂, where 𝜂𝜂 is zero-mean noise.

    Clearly, our algorithm competes with ERM easily with nearly the same convergence rate.

  • Performance comparison We implement FSR, ERM and PSGD-WA

    for the linear least-squares regression problem with 𝑙𝑙2 regularization as well (for a larger number of iterations).

    Since the number of iterations is larger than the previous experiment, the FSR and ERM curves cannot be distinguished in the figure since they are right on top of each other.

    Thus, experimental results corroborate our theoretical claims.

  • Ongoing WorkSince loss functions are typically not polynomial, we are investigating linearization techniques to extend our idea to non-polynomial loss functions to achieve faster convergence rate than SGD while keeping the same computational complexity.

    The performance of the algorithm for large scale systems with high dimensionality as well as high correlation among data points is being investigated to obtain insights into its suitability for different scenarios.

    We also intend to implement the algorithm on real data sets in order to gauge its practical usability.

  • References[1] Simon Lacoste-Julien, Mark Schmidt, and Francis Bach (2012), “A simpler approach to obtaining an O(1/t) convergence rate for the projectedstochastic subgradient method”, http://arxiv.org/abs/1212.2002

    [2] Roy Frostig , Rong Ge , Sham M. Kakade , and Aaron Sidford (2015), “Competing with the Empirical Risk Minimizer in a Single Pass”, http://arxiv.org/pdf/1412.6606v2.pdf

    [3] K. Cohen, A. Nedic and R. Srikant. On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression. In ICASSP 2016.

    http://arxiv.org/abs/1212.2002http://arxiv.org/pdf/1412.6606v2.pdf

  • THANK YOU

    Fast Streaming RegressionOutlineMotivationMotivationProblem formulationPractical issuesExample- Least squares regression Existing techniques/algorithmsExisting techniques/algorithmsExisting techniques/algorithmsGoal:Fast Streaming RegressionFast Streaming RegressionFast Streaming RegressionResults:Performance comparisonPerformance comparison Ongoing WorkReferencesSlide Number 20