Top Banner
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification
39

Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfย ยท Stochastic Optimization / Learning min โˆˆ ( )=๐”ผ ~๐’Ÿ[ , ] โ€ขEmpirical

Jul 22, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfย ยท Stochastic Optimization / Learning min โˆˆ ( )=๐”ผ ~๐’Ÿ[ , ] โ€ขEmpirical

Computational and StatisticalLearning Theory

TTIC 31120

Prof. Nati Srebro

Lecture 17:Stochastic Optimization

Part II: Realizable vs Agnostic RatesPart III: Nearest Neighbor Classification

Page 2: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfย ยท Stochastic Optimization / Learning min โˆˆ ( )=๐”ผ ~๐’Ÿ[ , ] โ€ขEmpirical

Stochastic (Sub)-Gradient Descent

If ๐›ป๐‘“ ๐‘ค, ๐‘ง 2 โ‰ค ๐บ then with appropriate step size:

๐”ผ ๐น ๐‘ค๐‘š โ‰ค inf๐‘คโˆˆ๐’ฒ, ๐‘ค 2โ‰ค๐ต

๐น ๐‘ค + ๐‘‚๐ต2๐บ2

๐‘š

Similarly, also Stochastic Mirror Descent

Optimize ๐‘ญ ๐’˜ = ๐”ผ๐’›โˆผ๐““ ๐’‡ ๐’˜, ๐’› s.t. ๐’˜ โˆˆ ๐“ฆ1. Initialize ๐‘ค1 = 0 โˆˆ ๐’ฒ2. At iteration ๐‘ก = 1,2,3, โ€ฆ

1. Sample ๐‘ง๐‘ก โˆผ ๐’Ÿ (Obtain ๐‘”๐‘ก s.t. ๐”ผ ๐‘”๐‘ก โˆˆ ๐œ•๐น(๐‘ค๐‘ก))

2. ๐‘ค๐‘ก+1 = ฮ ๐’ฒ ๐‘ค๐‘ก โˆ’ ๐œ‚๐‘ก๐›ป๐‘“ ๐‘ค๐‘ก , ๐‘ง๐‘ก

3. Return ๐‘ค๐‘š =1

๐‘šฯƒ๐‘ก=1๐‘š ๐‘ค๐‘ก

Stochastic Gradient DescentOnline Gradient Descent[Cesa-Binachi et al 02]

online2stochastic

๐’ˆ๐’•

[Zinkevich 03] [Nemirovski Yudin 78]

Online Mirror Descent Stochastic Mirror Descent[Shalev-Shwatz Singer 07] [Nemirovski Yudin 78]

Page 3: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfย ยท Stochastic Optimization / Learning min โˆˆ ( )=๐”ผ ~๐’Ÿ[ , ] โ€ขEmpirical

Stochastic Optimization

min๐‘คโˆˆ๐’ฒ

๐น(๐‘ค)

based only on stochastic information on ๐นโ€ข Only unbiased estimates of ๐น ๐‘ค , ๐›ป๐น(๐‘ค)

โ€ข No direct access to ๐น

E.g., fixed ๐‘“(๐‘ค, ๐‘ง) but ๐’Ÿ unknownโ€ข Optimize ๐น(๐‘ค) based on iid sample ๐‘ง1, ๐‘ง2, โ€ฆ , ๐‘ง๐‘š~๐’Ÿ

โ€ข ๐‘” = ๐›ป๐‘“(๐‘ค, ๐‘ง๐‘ก) is unbiased estimate of ๐›ป๐น ๐‘ค

โ€ข Traditional applicationsโ€ข Optimization under uncertainty

โ€ข Uncertainty about network performance

โ€ข Uncertainty about client demands

โ€ข Uncertainty about system behavior in control problems

โ€ข Complex systems where its easier to sample then integrate over ๐‘ง

= ๐”ผ๐‘ง~๐’Ÿ[๐‘“ ๐‘ค, ๐‘ง ]

Page 4: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfย ยท Stochastic Optimization / Learning min โˆˆ ( )=๐”ผ ~๐’Ÿ[ , ] โ€ขEmpirical

Machine Learning isStochastic Optimization

minโ„Ž

๐ฟ โ„Ž = ๐”ผ๐‘งโˆผ๐’Ÿ โ„“ โ„Ž, ๐‘ง = ๐”ผ๐‘ฅ,๐‘ฆโˆผ๐’Ÿ ๐‘™๐‘œ๐‘ ๐‘  โ„Ž ๐‘ฅ , ๐‘ฆ

โ€ข Optimization variable: predictor โ„Ž

โ€ข Objective: generalization error ๐ฟ(โ„Ž)

โ€ข Stochasticity over ๐‘ง = (๐‘ฅ, ๐‘ฆ)

โ€œGeneral Learningโ€ โ‰ก Stochastic Optimization:

ValdimirVapnik

ArkadiNemirovskii

Page 5: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfย ยท Stochastic Optimization / Learning min โˆˆ ( )=๐”ผ ~๐’Ÿ[ , ] โ€ขEmpirical

โ€ขFocus on computational efficiency

โ€ขGenerally assumes unlimited sampling- as in monte-carlo methods for complicated objectives

โ€ขOptimization variable generally a vector in a normed space- complexity control through norm

โ€ขMostly convex objectives

Stochastic Optimization

โ€ขFocus on sample size

โ€ขWhat can be done with a fixed number of samples?

โ€ขAbstract hypothesis classes- linear predictors, but also combinatorial hypothesis classes- generic measures of complexity such as VC-dim, fat shattering, Radamacher

โ€ข Also non-convex classes and loss functions

Statistical Learningvs

ValdimirVapnik

ArkadiNemirovskii

Page 6: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfย ยท Stochastic Optimization / Learning min โˆˆ ( )=๐”ผ ~๐’Ÿ[ , ] โ€ขEmpirical

Two Approaches to Stochastic Optimization / Learning

min๐‘คโˆˆ๐’ฒ

๐น(๐‘ค) = ๐”ผ๐‘ง~๐’Ÿ[๐‘“ ๐‘ค, ๐‘ง ]

โ€ข Empirical Risk Minimization (ERM) / Sample Average Approximation (SAA):โ€ข Collect sample z1,โ€ฆ,zm

โ€ข Minimize ๐น๐‘† ๐‘ค =1

๐‘šฯƒ๐‘– ๐‘“ ๐‘ค, ๐‘ง๐‘–

โ€ข Analysis typically based on Uniform Convergence

โ€ข Stochastic Approximation (SA): [Robins Monro 1951]

โ€ข Update ๐‘ค๐‘ก based on ๐‘ง๐‘กโ€ข E.g., based on ๐‘”๐‘ก = ๐›ป๐‘“(๐‘ค, ๐‘ง๐‘ก)

โ€ข E.g.: stochastic gradient descent

โ€ข Online-to-batch conversion of online algorithmโ€ฆ

Page 7: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfย ยท Stochastic Optimization / Learning min โˆˆ ( )=๐”ผ ~๐’Ÿ[ , ] โ€ขEmpirical

SA/SGD for Machine Learning

โ€ข In learning with ERM, need to optimize

เท๐‘ค = arg min๐‘คโˆˆ๐’ฒ

๐ฟ๐‘†(๐‘ค) ๐ฟ๐‘† ๐‘ค =1

๐‘šฯƒ๐‘– โ„“(๐‘ค, ๐‘ง๐‘–)

โ€ข ๐ฟ๐‘† ๐‘ค is expensive to evaluate exactlyโ€” ๐‘‚(๐‘š๐‘‘) time

โ€ข Cheap to get unbiased gradient estimateโ€” ๐‘‚(๐‘‘) time

๐‘– โˆผ ๐‘ˆ๐‘›๐‘–๐‘“(1โ€ฆ๐‘š) ๐‘” = ๐›ปโ„“ ๐‘ค, ๐‘ง๐‘–

๐”ผ ๐‘” = ฯƒ๐‘–1

๐‘š๐›ปโ„“ ๐‘ค, ๐‘ง๐‘– = ๐›ป๐ฟ๐‘†(๐‘ค)

โ€ข SGD guarantee:

๐”ผ ๐ฟ๐‘† เดฅ๐‘ค ๐‘‡ โ‰ค inf๐‘คโˆˆ๐’ฒ

๐ฟ๐‘† ๐‘ค +sup ๐›ปโ„“ 2

2 sup ๐‘ค 22

๐‘‡

Page 8: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfย ยท Stochastic Optimization / Learning min โˆˆ ( )=๐”ผ ~๐’Ÿ[ , ] โ€ขEmpirical

SGD for SVM

min ๐ฟ๐‘† ๐‘ค ๐‘ . ๐‘ก. ๐‘ค 2 โ‰ค ๐ต

Use ๐‘”๐‘ก = ๐›ป๐‘ค๐‘™๐‘œ๐‘ ๐‘ โ„Ž๐‘–๐‘›๐‘”๐‘’ ๐‘ค๐‘ก, ๐œ™๐‘–๐‘ก ๐‘ฅ ; ๐‘ฆ๐‘–๐‘ก for random ๐‘–๐‘ก

Initialize ๐‘ค 0 = 0At iteration t:โ€ข Pick ๐‘– โˆˆ 1โ€ฆ๐‘š at random

โ€ข If ๐‘ฆ๐‘– ๐‘ค๐‘ก , ๐œ™ ๐‘ฅ๐‘– < 1,

๐‘ค ๐‘ก+1 โ† ๐‘ค ๐‘ก + ๐œ‚๐‘ก๐‘ฆ๐‘–๐œ™ ๐‘ฅ๐‘–else: ๐‘ค ๐‘ก+1 โ† ๐‘ค ๐‘ก

โ€ข If ๐‘ค ๐‘ก+12> ๐ต, then ๐‘ค ๐‘ก+1 โ† ๐ต

๐‘ค ๐‘ก+1

๐‘ค ๐‘ก+12

Return เดฅ๐‘ค ๐‘‡ =1

๐‘‡ฯƒ๐‘ก=1๐‘‡ ๐‘ค ๐‘ก

๐œ™ ๐‘ฅ 2 โ‰ค ๐บ ๐‘”๐‘ก 2 โ‰ค ๐บ ๐ฟ๐‘† เดฅ๐‘ค ๐‘‡ โ‰ค ๐ฟ๐‘† เท๐‘ค +๐ต2๐บ2

๐‘‡

(in expectation over randomness in algorithm)

Page 9: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfย ยท Stochastic Optimization / Learning min โˆˆ ( )=๐”ผ ~๐’Ÿ[ , ] โ€ขEmpirical

Stochastic vs Batch

x1,y1

x2,y2

x3,y3

x4,y4

x5,y5

xm,ym

๐‘”1 = โˆ‡๐‘™๐‘œ๐‘ ๐‘ (๐‘ค, (๐‘ฅ1, ๐‘ฆ1) )

๐‘”2 = โˆ‡๐‘™๐‘œ๐‘ ๐‘ (๐‘ค, (๐‘ฅ2, ๐‘ฆ2) )

๐‘”3 = โˆ‡๐‘™๐‘œ๐‘ ๐‘ (๐‘ค, (๐‘ฅ3, ๐‘ฆ3) )

๐‘”4 = โˆ‡๐‘™๐‘œ๐‘ ๐‘ (๐‘ค, (๐‘ฅ4, ๐‘ฆ4) )

๐‘”5 = โˆ‡๐‘™๐‘œ๐‘ ๐‘ (๐‘ค, ๐‘ฅ5, ๐‘ฆ5 )

๐‘”๐‘š = โˆ‡๐‘™๐‘œ๐‘ ๐‘ (๐‘ค, (๐‘ฅ๐‘š, ๐‘ฆ๐‘š) )

๐‘”1 = โˆ‡๐‘™๐‘œ๐‘ ๐‘ (๐‘ค, (๐‘ฅ1, ๐‘ฆ1) )

๐‘”2 = โˆ‡๐‘™๐‘œ๐‘ ๐‘ (๐‘ค, (๐‘ฅ2, ๐‘ฆ2) )

๐‘”3 = โˆ‡๐‘™๐‘œ๐‘ ๐‘ (๐‘ค, (๐‘ฅ3, ๐‘ฆ3) )

๐‘”4 = โˆ‡๐‘™๐‘œ๐‘ ๐‘ (๐‘ค, (๐‘ฅ4, ๐‘ฆ4) )

๐‘”5 = โˆ‡๐‘™๐‘œ๐‘ ๐‘ (๐‘ค, (๐‘ฅ5, ๐‘ฆ5) )

๐‘”๐‘š = โˆ‡๐‘™๐‘œ๐‘ ๐‘ (๐‘ค, (๐‘ฅ๐‘š, ๐‘ฆ๐‘š) )

๐‘ค โ† ๐‘ค โˆ’ ๐‘”1

๐‘ค โ† ๐‘ค โˆ’ ๐‘”2

๐‘ค โ† ๐‘ค โˆ’ ๐‘”3

๐‘ค โ† ๐‘ค โˆ’ ๐‘”4

๐‘ค โ† ๐‘ค โˆ’ ๐‘”๐‘šโˆ’1

๐‘ค โ† ๐‘ค โˆ’ ๐‘”๐‘–๐‘ค โ† ๐‘ค โˆ’ ๐‘”๐‘š

๐‘ค โ† ๐‘ค โˆ’ ๐‘”5

min ๐ฟ๐‘† ๐‘ค ๐‘ . ๐‘ก. ๐‘ค 2 โ‰ค ๐ต

โˆ‡๐ฟ๐‘† ๐‘ค =1

๐‘šฯƒ๐‘”๐‘–

Page 10: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfย ยท Stochastic Optimization / Learning min โˆˆ ( )=๐”ผ ~๐’Ÿ[ , ] โ€ขEmpirical

Stochastic vs Batchโ€ข Intuitive argument: if only taking simple gradient steps, better

to be stochastic

โ€ข To get ๐ฟ๐‘† ๐‘ค โ‰ค ๐ฟ๐‘† เท๐‘ค + ๐œ–๐‘œ๐‘๐‘ก:

#iter cost/iter runtime

Batch GD ๐ต2๐บ2/๐œ–๐‘œ๐‘๐‘ก2 ๐‘š๐‘‘ ๐’Ž๐‘‘

๐ต2๐บ2

๐œ–๐‘œ๐‘๐‘ก2

SGD ๐ต2๐บ2/๐œ–๐‘œ๐‘๐‘ก2 ๐‘‘ ๐‘‘

๐ต2๐บ2

๐œ–๐‘œ๐‘๐‘ก2

โ€ข Comparison to methods with a log 1/๐œ–๐‘œ๐‘๐‘ก dependence that use the structure of ๐ฟ๐‘†(๐‘ค) (not only local access)?

โ€ข How small should ๐œ–๐‘œ๐‘๐‘ก be?

โ€ข What about ๐ฟ ๐‘ค , which is what we really care about?

Page 11: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfย ยท Stochastic Optimization / Learning min โˆˆ ( )=๐”ผ ~๐’Ÿ[ , ] โ€ขEmpirical

Overall Analysis of ๐ฟ๐’Ÿ(๐‘ค)โ€ข Recall for ERM: ๐ฟ๐’Ÿ เท๐‘ค โ‰ค ๐ฟ๐’Ÿ ๐‘คโˆ— + 2sup

๐‘ค๐ฟ๐’Ÿ ๐‘ค + ๐ฟ๐‘† ๐‘ค

โ€ข For ๐œ–๐‘œ๐‘๐‘ก suboptimal ERM เดฅ๐‘ค:

๐ฟ๐’Ÿ เดฅ๐‘ค โ‰ค ๐ฟ๐’Ÿ ๐‘คโˆ— + 2sup๐‘ค

๐ฟ๐’Ÿ ๐‘ค โˆ’ ๐ฟ๐‘† ๐‘ค + ๐ฟ๐‘† เดฅ๐‘ค โˆ’ ๐ฟ๐‘† เท๐‘ค

โ€ข Take ๐œ–๐‘œ๐‘๐‘ก โ‰ˆ ๐œ–๐‘’๐‘ ๐‘ก, i.e. #๐‘–๐‘ก๐‘’๐‘Ÿ ๐‘‡ โ‰ˆ ๐‘ ๐‘Ž๐‘š๐‘๐‘™๐‘’ ๐‘ ๐‘–๐‘ง๐‘’ ๐‘š

โ€ข To ensure ๐ฟ๐’Ÿ ๐‘ค โ‰ค ๐ฟ๐’Ÿ ๐‘คโˆ— + ๐œ–:

๐‘‡,๐‘š = ๐‘‚๐ต2๐บ2

๐œ–2

เท๐‘ค = arg min๐‘ค โ‰ค๐ต

๐ฟ๐‘†(๐‘ค) ๐‘คโˆ— = arg min๐‘ค โ‰ค๐ต

๐ฟ๐’Ÿ(๐‘ค)

๐œ–๐‘Ž๐‘๐‘Ÿ๐‘œ๐‘ฅ

๐œ–๐‘’๐‘ ๐‘ก โ‰ค 2๐ต2๐บ2

๐‘š๐œ–๐‘œ๐‘๐‘ก โ‰ค

๐ต2๐บ2

๐‘‡

Page 12: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfย ยท Stochastic Optimization / Learning min โˆˆ ( )=๐”ผ ~๐’Ÿ[ , ] โ€ขEmpirical

Direct Online-to-Batch:SGD on ๐ฟ๐’Ÿ(๐‘ค)

min๐‘ค

๐ฟ๐’Ÿ(๐‘ค)

use ๐‘”๐‘ก = ๐›ป๐‘คโ„Ž๐‘–๐‘›๐‘”๐‘’ ๐‘ฆ ๐‘ค, ๐œ™ ๐‘ฅ for random ๐‘ฆ, ๐‘ฅ~๐’Ÿ

๐”ผ ๐‘”๐‘ก = ๐›ป๐ฟ๐’Ÿ ๐‘ค

Initialize ๐‘ค 0 = 0At iteration t:โ€ข Draw ๐‘ฅ๐‘ก , ๐‘ฆ๐‘ก~๐’Ÿ

โ€ข If ๐‘ฆ๐‘ก ๐‘ค๐‘ก , ๐œ™ ๐‘ฅ๐‘ก < 1,

๐‘ค ๐‘ก+1 โ† ๐‘ค ๐‘ก + ๐œ‚๐‘ก๐‘ฆ๐‘ก๐œ™ ๐‘ฅ๐‘กelse: ๐‘ค ๐‘ก+1 โ† ๐‘ค ๐‘ก

Return เดฅ๐‘ค ๐‘‡ =1

๐‘‡ฯƒ๐‘ก=1๐‘‡ ๐‘ค ๐‘ก

๐ฟ๐’Ÿ เดฅ๐‘ค ๐‘‡ โ‰ค inf๐‘ค 2โ‰ค๐ต

๐ฟ๐’Ÿ ๐‘ค +๐ต2๐บ2

๐‘‡

๐‘š = ๐‘‡ = ๐‘‚๐ต2๐บ2

๐œ–2

Page 13: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfย ยท Stochastic Optimization / Learning min โˆˆ ( )=๐”ผ ~๐’Ÿ[ , ] โ€ขEmpirical

SGD for Machine Learning

Initialize ๐‘ค 0 = 0At iteration t:โ€ข Draw ๐‘ฅ๐‘ก , ๐‘ฆ๐‘ก~๐’Ÿ

โ€ข If ๐‘ฆ๐‘ก ๐‘ค๐‘ก , ๐œ™ ๐‘ฅ๐‘ก < 1,

๐‘ค ๐‘ก+1 โ† ๐‘ค ๐‘ก + ๐œ‚๐‘ก๐‘ฆ๐‘ก๐œ™ ๐‘ฅ๐‘กelse: ๐‘ค ๐‘ก+1 โ† ๐‘ค ๐‘ก

Return เดฅ๐‘ค ๐‘‡ =1

๐‘‡ฯƒ๐‘ก=1๐‘‡ ๐‘ค ๐‘ก

Draw ๐‘ฅ1, ๐‘ฆ1 , โ€ฆ , ๐‘ฅ๐‘š, ๐‘ฆ๐‘š ~๐’Ÿ

Initialize ๐‘ค 0 = 0At iteration t:โ€ข Pick ๐‘– โˆˆ 1โ€ฆ๐‘š at random

โ€ข If ๐‘ฆ๐‘– ๐‘ค๐‘ก , ๐œ™ ๐‘ฅ๐‘– < 1,

๐‘ค ๐‘ก+1 โ† ๐‘ค ๐‘ก + ๐œ‚๐‘ก๐‘ฆ๐‘–๐œ™ ๐‘ฅ๐‘–else: ๐‘ค ๐‘ก+1 โ† ๐‘ค ๐‘ก

โ€ข ๐‘ค ๐‘ก+1 โ† ๐‘๐‘Ÿ๐‘œ๐‘— ๐‘ค ๐‘ก+1 ๐‘ก๐‘œ ๐‘ค โ‰ค ๐ต

Return เดฅ๐‘ค ๐‘‡ =1

๐‘‡ฯƒ๐‘ก=1๐‘‡ ๐‘ค ๐‘ก

Direct SA (online2batch) Approach:SGD on ERM:min๐‘ค 2โ‰ค๐ต

๐ฟ๐‘† ๐‘ค

min๐‘ค

๐ฟ ๐‘ค

โ€ข Fresh sample at each iteration, ๐‘š = ๐‘‡โ€ข No need to project nor require ๐‘ค โ‰ค ๐ตโ€ข Implicit regularization via early stopping

โ€ข Can have ๐‘‡ > ๐‘š iterationsโ€ข Need to project to ๐‘ค โ‰ค ๐ตโ€ข Explicit regularization via โ€–๐‘คโ€–

Page 14: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfย ยท Stochastic Optimization / Learning min โˆˆ ( )=๐”ผ ~๐’Ÿ[ , ] โ€ขEmpirical

SGD for Machine Learning

Initialize ๐‘ค 0 = 0At iteration t:โ€ข Draw ๐‘ฅ๐‘ก , ๐‘ฆ๐‘ก~๐’Ÿ

โ€ข If ๐‘ฆ๐‘ก ๐‘ค๐‘ก , ๐œ™ ๐‘ฅ๐‘ก < 1,

๐‘ค ๐‘ก+1 โ† ๐‘ค ๐‘ก + ๐œ‚๐‘ก๐‘ฆ๐‘ก๐œ™ ๐‘ฅ๐‘กelse: ๐‘ค ๐‘ก+1 โ† ๐‘ค ๐‘ก

Return เดฅ๐‘ค ๐‘‡ =1

๐‘‡ฯƒ๐‘ก=1๐‘‡ ๐‘ค ๐‘ก

Draw ๐‘ฅ1, ๐‘ฆ1 , โ€ฆ , ๐‘ฅ๐‘š, ๐‘ฆ๐‘š ~๐’Ÿ

Initialize ๐‘ค 0 = 0At iteration t:โ€ข Pick ๐‘– โˆˆ 1โ€ฆ๐‘š at random

โ€ข If ๐‘ฆ๐‘– ๐‘ค๐‘ก , ๐œ™ ๐‘ฅ๐‘– < 1,

๐‘ค ๐‘ก+1 โ† ๐‘ค ๐‘ก + ๐œ‚๐‘ก๐‘ฆ๐‘–๐œ™ ๐‘ฅ๐‘–else: ๐‘ค ๐‘ก+1 โ† ๐‘ค ๐‘ก

โ€ข ๐‘ค ๐‘ก+1 โ† ๐‘๐‘Ÿ๐‘œ๐‘— ๐‘ค ๐‘ก+1 ๐‘ก๐‘œ ๐‘ค โ‰ค ๐ต

Return เดฅ๐‘ค ๐‘‡ =1

๐‘‡ฯƒ๐‘ก=1๐‘‡ ๐‘ค ๐‘ก

Direct SA (online2batch) Approach:SGD on ERM:min๐‘ค 2โ‰ค๐ต

๐ฟ๐‘† ๐‘ค

min๐‘ค

๐ฟ ๐‘ค

๐ฟ เดฅ๐‘ค ๐‘‡ โ‰ค ๐ฟ ๐‘คโˆ— +๐ต2๐บ2

๐‘‡๐ฟ เดฅ๐‘ค ๐‘‡ โ‰ค ๐ฟ ๐‘คโˆ— + 2

๐ต2๐บ2

๐‘š+

๐ต2๐บ2

๐‘‡

๐‘คโˆ— = arg min๐‘ค โ‰ค๐ต

๐ฟ(๐‘ค)

๐œ‚๐‘ก = ๐ต2/๐บ2๐‘ก

Page 15: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfย ยท Stochastic Optimization / Learning min โˆˆ ( )=๐”ผ ~๐’Ÿ[ , ] โ€ขEmpirical

SGD for Machine Learning

Initialize ๐‘ค 0 = 0At iteration t:โ€ข Draw ๐‘ฅ๐‘ก , ๐‘ฆ๐‘ก~๐’Ÿ

โ€ข If ๐‘ฆ๐‘ก ๐‘ค๐‘ก , ๐œ™ ๐‘ฅ๐‘ก < 1,

๐‘ค ๐‘ก+1 โ† ๐‘ค ๐‘ก + ๐œ‚๐‘ก๐‘ฆ๐‘ก๐œ™ ๐‘ฅ๐‘กelse: ๐‘ค ๐‘ก+1 โ† ๐‘ค ๐‘ก

Return เดฅ๐‘ค ๐‘‡ =1

๐‘‡ฯƒ๐‘ก=1๐‘‡ ๐‘ค ๐‘ก

Draw ๐‘ฅ1, ๐‘ฆ1 , โ€ฆ , ๐‘ฅ๐‘š, ๐‘ฆ๐‘š ~๐’Ÿ

Initialize ๐‘ค 0 = 0At iteration t:โ€ข Pick ๐‘– โˆˆ 1โ€ฆ๐‘š at random

โ€ข If ๐‘ฆ๐‘– ๐‘ค๐‘ก , ๐œ™ ๐‘ฅ๐‘– < 1,

๐‘ค ๐‘ก+1 โ† ๐‘ค ๐‘ก + ๐œ‚๐‘ก๐‘ฆ๐‘–๐œ™ ๐‘ฅ๐‘–else: ๐‘ค ๐‘ก+1 โ† ๐‘ค ๐‘ก

โ€ข ๐‘ค ๐‘ก+1 โ† ๐‘ค ๐‘ก+1 โˆ’ ๐œ†๐‘ค(๐‘ก)

Return เดฅ๐‘ค ๐‘‡ =1

๐‘‡ฯƒ๐‘ก=1๐‘‡ ๐‘ค ๐‘ก

Direct SA (online2batch) Approach:

SGD on RERM:

min๐ฟ๐‘† ๐‘ค +๐œ†

2โ€–๐‘คโ€–

min๐‘ค

๐ฟ ๐‘ค

โ€ข Fresh sample at each iteration, ๐‘š = ๐‘‡โ€ข No need shrink ๐‘คโ€ข Implicit regularization via early stopping

โ€ข Can have ๐‘‡ > ๐‘š iterationsโ€ข Need to shrink ๐‘คโ€ข Explicit regularization via โ€–๐‘คโ€–

Page 16: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfย ยท Stochastic Optimization / Learning min โˆˆ ( )=๐”ผ ~๐’Ÿ[ , ] โ€ขEmpirical

SGD vs ERM

๐‘ค 0

เดฅ๐‘ค ๐‘š

เท๐‘ค๐‘คโˆ—

๐‘‚๐ต2๐บ2

๐‘š

argmin๐‘ค

๐ฟ๐‘†(๐‘ค)

(overfit)

๐‘ป > ๐’Žw/ proj

๐‘คโˆ— = arg min๐‘ค โ‰ค๐ต

๐ฟ(๐‘ค)เท๐‘ค = arg min๐‘ค โ‰ค๐ต

๐ฟ๐‘†(๐‘ค)

Page 17: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfย ยท Stochastic Optimization / Learning min โˆˆ ( )=๐”ผ ~๐’Ÿ[ , ] โ€ขEmpirical

Mixed Approach: SGD on ERM

0 1,000,000 2,000,000 3,000,000

0.052

0.054

0.056

0.058

# iterations k

Test

mis

clas

sifi

cati

on

err

or

m=300,000m=400,000m=500,000

Reuters RCV1 data, CCAT task

โ€ข The mixed approach (reusing examples) can make sense

Page 18: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfย ยท Stochastic Optimization / Learning min โˆˆ ( )=๐”ผ ~๐’Ÿ[ , ] โ€ขEmpirical

Mixed Approach: SGD on ERM

0 1,000,000 2,000,000 3,000,000

0.052

0.054

0.056

0.058

# iterations k

Test

mis

clas

sifi

cati

on

err

or

m=300,000m=400,000m=500,000

Reuters RCV1 data, CCAT task

โ€ข The mixed approach (reusing examples) can make senseโ€ข Still: fresh samples are better

)With a larger training set, can reduce generalization error faster) Larger training set means less runtime to get target generalization error

Page 19: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfย ยท Stochastic Optimization / Learning min โˆˆ ( )=๐”ผ ~๐’Ÿ[ , ] โ€ขEmpirical

Online Optimization vs Stochastic Approximation

โ€ข In both Online Setting and Stochastic Approximation โ€ข Receive samples sequentiallyโ€ข Update w after each sample

โ€ข But, in Online Setting:โ€ข Objective is empirical regret, i.e. behavior on observed instancesโ€ข ๐‘ง๐‘ก chosen adversarialy (no distribution involved)

โ€ข As opposed on Stochastic Approximation:โ€ข Objective is ๐”ผ โ„“ ๐‘ค, ๐‘ง , i.e. behavior on โ€œfutureโ€ samplesโ€ข i.i.d. samples ๐‘ง๐‘ก

โ€ข Stochastic Approximation is a computational approach,Online Learning is an analysis setupโ€ข E.g. โ€œFollow the leaderโ€

Page 20: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfย ยท Stochastic Optimization / Learning min โˆˆ ( )=๐”ผ ~๐’Ÿ[ , ] โ€ขEmpirical

Part II:Realizable vs Agnostic

Rates

Page 21: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfย ยท Stochastic Optimization / Learning min โˆˆ ( )=๐”ผ ~๐’Ÿ[ , ] โ€ขEmpirical

Realizable vs Agnostic Rates

โ€ข Recall for finite hypothesis classes:

๐ฟ๐’Ÿ โ„Ž โ‰ค infโ„Žโˆˆโ„‹

๐ฟ๐’Ÿ(โ„Ž) + 2log |โ„‹|+log เต—2 ๐›ฟ

2๐‘š ๐‘š = ๐‘‚

log โ„‹

๐œ–๐Ÿ

โ€ข But in the realizable case, if infโ„Žโˆˆโ„‹

๐ฟ๐’Ÿ(โ„Ž) = 0:

๐ฟ๐’Ÿ โ„Ž โ‰คlog โ„‹ +log

1

๐›ฟ

๐‘š ๐‘š = ๐‘‚

log โ„‹

๐œ–

โ€ข Also for VC-classes, in general ๐‘š = ๐‘‚๐‘‰๐ถ๐‘‘๐‘–๐‘š

๐œ–๐Ÿwhile in the

realizable case ๐‘š = ๐‘‚๐‘‰๐ถ๐‘‘๐‘–๐‘šโ‹…log 1/๐œ–

๐œ–

โ€ข What happens if ๐ฟโˆ— = infโ„Žโˆˆโ„‹

๐ฟ๐’Ÿ(โ„Ž) is low, but not zero?

Page 22: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfย ยท Stochastic Optimization / Learning min โˆˆ ( )=๐”ผ ~๐’Ÿ[ , ] โ€ขEmpirical

Estimating the Bias of a Coin

๐‘ โˆ’ ฦธ๐‘ โ‰คlog เต—2 ๐›ฟ2๐‘š

๐‘ โˆ’ ฦธ๐‘ โ‰ค2 ๐‘ log เต—2 ๐›ฟ

๐‘š+2 log เต—2 ๐›ฟ3๐‘š

Page 23: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfย ยท Stochastic Optimization / Learning min โˆˆ ( )=๐”ผ ~๐’Ÿ[ , ] โ€ขEmpirical

Optimistic VC bound(aka ๐ฟโˆ—-bound, multiplicative bound)

โ„Ž = argminโ„Žโˆˆโ„‹

๐ฟ๐‘†(โ„Ž) ๐ฟโˆ— = infโ„Žโˆˆโ„‹

๐ฟ(โ„Ž)

โ€ข For a hypothesis class with VC-dim ๐ท, w.p. 1-๐›ฟ over n samples:

๐ฟ โ„Ž โ‰ค ๐ฟโˆ— + 2 ๐ฟโˆ—๐ท log ฮค2๐‘’๐‘š

๐ท+log เต—2 ๐›ฟ

๐‘š+ 4

๐ท log ฮค2๐‘’๐‘š๐ท+log เต—2 ๐›ฟ

๐‘š

= inf๐›ผ

1 + ๐›ผ ๐ฟโˆ— + 1 +1

๐›ผ4๐ท log ฮค2๐‘’๐‘š

๐ท+log เต—2 ๐›ฟ

๐‘š

โ€ข Sample complexity to get ๐ฟ โ„Ž โ‰ค ๐ฟโˆ— + ๐œ–:

๐‘š ๐œ– = ๐‘‚๐ท

๐œ–โ‹…๐ฟโˆ— + ๐œ–

๐œ–log

1

๐œ–

โ€ข Extends to bounded real-valued loss in terms of VC-subgraph dim

Page 24: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfย ยท Stochastic Optimization / Learning min โˆˆ ( )=๐”ผ ~๐’Ÿ[ , ] โ€ขEmpirical

From Parametric to Scale Sensitive๐ฟ โ„Ž = ๐”ผ ๐‘™๐‘œ๐‘ ๐‘  โ„Ž ๐‘ฅ , ๐‘ฆ โ„Ž โˆˆ โ„‹

โ€ข Instead of VC-dim or VC-subgraph-dim (โ‰ˆ #params), rely on metric scale to control complexity, e.g.:

โ„‹ = ๐‘ค โ†ฆ ๐‘ค, ๐‘ฅ ๐‘ค 2 โ‰ค ๐ต }

โ€ข Learning depends on:

โ€ข Metric complexity measures: fat shattering dimension, covering numbers, Rademacher Complexity

โ€ข Scale sensitivity of loss (bound on derivatives or โ€œmarginโ€)

โ€ข For โ„‹with Rademacher Complexity โ„›๐‘š, and ๐‘™๐‘œ๐‘ ๐‘ โ€ฒ โ‰ค ๐บ:

๐ฟ โ„Ž โ‰ค ๐ฟโˆ— + 2๐บโ„›๐‘š +log เต—2 ๐›ฟ2๐‘š

โ‰ค ๐ฟโˆ— + ๐‘‚๐บ2๐‘… + log เต—2 ๐›ฟ

2๐‘š

โ„›๐‘š โ‰ค๐‘…

๐‘š

๐‘…=

โ„›๐‘š โ„‹ =๐ต2 sup ๐‘ฅ 2

๐‘š

Page 25: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfย ยท Stochastic Optimization / Learning min โˆˆ ( )=๐”ผ ~๐’Ÿ[ , ] โ€ขEmpirical

Non-Parametric Optimistic Rate for Smooth Loss

โ€ข Theorem: for any โ„‹ with (worst case) RademacherComplexity โ„›๐‘š(โ„‹), and any smooth loss with ๐‘™๐‘œ๐‘ ๐‘ โ€ฒโ€ฒ โ‰ค ๐ป, ๐‘™๐‘œ๐‘ ๐‘  โ‰ค ๐‘, w.p. 1 โˆ’ ๐›ฟ over n samples:

โ€ข Sample complexity

Page 26: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfย ยท Stochastic Optimization / Learning min โˆˆ ( )=๐”ผ ~๐’Ÿ[ , ] โ€ขEmpirical

Parametric vs Non-Parametric

Parametricdim(โ„‹) โ‰ค ๐ƒ, ๐’‰ โ‰ค ๐Ÿ

Scale-Sensitive

โ„›๐’ โ„‹ โ‰ค เต—๐‘น ๐’

Lipschitz: ๐œ™โ€ฒ โ‰ค ๐บ(e.g. hinge, โ„“1) ๐บ ๐ท

๐‘š + ๐ฟโˆ—๐บ๐ท๐‘š

๐บ2๐‘…๐‘š

Smooth: ๐œ™โ€ฒโ€ฒ โ‰ค ๐ป(e.g. logistic, Huber, smoothed hinge)

๐ป ๐ท๐‘š + ๐ฟโˆ—

๐ป๐ท๐‘š

๐ป ๐‘…๐‘š + ๐ฟโˆ—

๐ป๐‘…๐‘š

Smooth & strongly convex: ๐œ‡ โ‰ค ๐œ™โ€ฒโ€ฒ โ‰ค ๐ป(e.g. square loss)

๐ป

๐œ‡โ‹…๐ป ๐ท

๐‘š๐ป ๐‘…๐‘š + ๐ฟโˆ—

๐ป๐‘…๐‘š

Min-max tight up to poly-log factors

Page 27: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfย ยท Stochastic Optimization / Learning min โˆˆ ( )=๐”ผ ~๐’Ÿ[ , ] โ€ขEmpirical

Optimistic Learning Guarantees

๐ฟ โ„Ž โ‰ค 1 + ๐›ผ ๐ฟโˆ— + 1 +1

๐›ผเทจ๐‘‚

๐‘…

๐‘š

๐‘š ๐œ– โ‰ค เทจ๐‘‚๐‘…

๐œ–โ‹…๐ฟโˆ— + ๐œ–

๐œ–

Parametric classes

Scale-sensitive classes with smooth loss

Perceptron gurantee

Margin Bounds

Stability-based guarantees with smooth loss

Online Learning/Optimization with smooth loss

ร— Non-param (scale sensitive) classes with non-smooth loss

ร— Online Learning/Optimization with non-smooth loss

Page 28: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfย ยท Stochastic Optimization / Learning min โˆˆ ( )=๐”ผ ~๐’Ÿ[ , ] โ€ขEmpirical

Why Optimistic Guarantees?

๐ฟ โ„Ž โ‰ค 1 + ๐›ผ ๐ฟโˆ— + 1 +1

๐›ผเทจ๐‘‚

๐‘…

๐‘š

๐‘š ๐œ– โ‰ค เทจ๐‘‚๐‘…

๐œ–โ‹…๐ฟโˆ— + ๐œ–

๐œ–

โ€ข Optimistic regime typically relevant regime:

โ€ข Approximation error ๐ฟโˆ— โ‰ˆ Estimation error ๐œ–

โ€ข If ๐œ– โ‰ช ๐ฟโˆ—, better to spend energy on lowering approx. error(use more complex class)

โ€ข Often important in highlighting true phenomena

Page 29: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfย ยท Stochastic Optimization / Learning min โˆˆ ( )=๐”ผ ~๐’Ÿ[ , ] โ€ขEmpirical

Part III:Nearest Neighbor

Classification

Page 30: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfย ยท Stochastic Optimization / Learning min โˆˆ ( )=๐”ผ ~๐’Ÿ[ , ] โ€ขEmpirical

The Nearest Neighbor Classifierโ€ข Training sample S = ๐‘ฅ1, ๐‘ฆ1 , โ€ฆ , ๐‘ฅ๐‘š, ๐‘ฆ๐‘š

โ€ข Want to predict label of new point ๐‘ฅ

โ€ข The Nearest Neighbor Rule:

โ€ข Find the closest training point: ๐‘– = argmin๐‘–๐œŒ(๐‘ฅ, ๐‘ฅ๐‘–)

โ€ข Predict label of ๐‘ฅ as ๐‘ฆ๐‘–

?

Page 31: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfย ยท Stochastic Optimization / Learning min โˆˆ ( )=๐”ผ ~๐’Ÿ[ , ] โ€ขEmpirical

The Nearest Neighbor Classifierโ€ข Training sample S = ๐‘ฅ1, ๐‘ฆ1 , โ€ฆ , ๐‘ฅ๐‘š, ๐‘ฆ๐‘š

โ€ข Want to predict label of new point ๐‘ฅ

โ€ข The Nearest Neighbor Rule:

โ€ข Find the closest training point: ๐‘– = argmin๐‘–๐œŒ(๐‘ฅ, ๐‘ฅ๐‘–)

โ€ข Predict label of ๐‘ฅ as ๐‘ฆ๐‘–

โ€ข As learning rule: ๐‘๐‘ ๐‘† = โ„Ž where โ„Ž(๐‘ฅ) = ๐‘ฆarg min๐‘–

๐œŒ ๐‘ฅ,๐‘ฅ๐‘–

Page 32: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfย ยท Stochastic Optimization / Learning min โˆˆ ( )=๐”ผ ~๐’Ÿ[ , ] โ€ขEmpirical

เทฉ๐“ ๐’™ โˆ’ เทฉ๐“(๐’™โ€ฒ)๐Ÿ

เทฉ๐“ ๐’™ = (๐Ÿ“๐’™ ๐Ÿ , ๐’™ ๐Ÿ )๐’™ โˆ’ ๐’™โ€ฒ ๐Ÿ๐’™ โˆ’ ๐’™โ€ฒ ๐Ÿ

Where is the Bias Hiding?

โ€ข What is the right โ€œdistanceโ€ between images? Between sound waves? Between sentences?

โ€ข Option 1: ๐œŒ ๐‘ฅ, ๐‘ฅโ€ฒ = ๐œ™ ๐‘ฅ โˆ’ ๐œ™ ๐‘ฅโ€ฒ 2

โ€ข What representation ๐œ™(๐‘ฅ)?

โ€ข Find the closest training point: ๐‘– = argmin๐‘–๐œŒ(๐‘ฅ, ๐‘ฅ๐‘–)

โ€ข Predict label of ๐‘ฅ as ๐‘ฆ๐‘–

Page 33: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfย ยท Stochastic Optimization / Learning min โˆˆ ( )=๐”ผ ~๐’Ÿ[ , ] โ€ขEmpirical

เทฉ๐“ ๐’™ โˆ’ เทฉ๐“(๐’™โ€ฒ)๐Ÿ

เทฉ๐“ ๐’™ = (๐Ÿ“๐’™ ๐Ÿ , ๐’™ ๐Ÿ )๐’™ โˆ’ ๐’™โ€ฒ ๐Ÿ ๐’™ โˆ’ ๐’™โ€ฒ โˆž

Where is the Bias Hiding?

โ€ข What is the right โ€œdistanceโ€ between images? Between sound waves? Between sentences?

โ€ข Option 1: ๐œŒ ๐‘ฅ, ๐‘ฅโ€ฒ = ๐œ™ ๐‘ฅ โˆ’ ๐œ™ ๐‘ฅโ€ฒ 2

โ€ข What representation ๐œ™(๐‘ฅ)?

โ€ข Maybe a different distance? ๐œ™ ๐‘ฅ โˆ’ ๐œ™ ๐‘ฅโ€ฒ 1? ๐œ™ ๐‘ฅ โˆ’ ๐œ™ ๐‘ฅ โˆž? sin(โˆ (๐œ™ ๐‘ฅ , ๐œ™ ๐‘ฅโ€ฒ )? ๐พ๐ฟ(๐œ™(๐‘ฅ)||๐œ™ ๐‘ฅโ€ฒ )?

โ€ข Find the closest training point: ๐‘– = argmin๐‘–๐œŒ(๐‘ฅ, ๐‘ฅ๐‘–)

โ€ข Predict label of ๐‘ฅ as ๐‘ฆ๐‘–

Page 34: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfย ยท Stochastic Optimization / Learning min โˆˆ ( )=๐”ผ ~๐’Ÿ[ , ] โ€ขEmpirical

Where is the Bias Hiding?

โ€ข What is the right โ€œdistanceโ€ between images? Between sound waves? Between sentences?

โ€ข Option 1: ๐œŒ ๐‘ฅ, ๐‘ฅโ€ฒ = ๐œ™ ๐‘ฅ โˆ’ ๐œ™ ๐‘ฅโ€ฒ 2

โ€ข What representation ๐œ™(๐‘ฅ)?

โ€ข Maybe a different distance? ๐œ™ ๐‘ฅ โˆ’ ๐œ™ ๐‘ฅโ€ฒ 1? ๐œ™ ๐‘ฅ โˆ’ ๐œ™ ๐‘ฅ โˆž? sin(โˆ (๐œ™ ๐‘ฅ , ๐œ™ ๐‘ฅโ€ฒ )? ๐พ๐ฟ(๐œ™(๐‘ฅ)||๐œ™ ๐‘ฅโ€ฒ )?

โ€ข Option 2: Special-purpose distance measure on ๐‘ฅโ€ข E.g. edit distance, deformation measure, etc

โ€ข Find the closest training point: ๐‘– = argmin๐‘–๐œŒ(๐‘ฅ, ๐‘ฅ๐‘–)

โ€ข Predict label of ๐‘ฅ as ๐‘ฆ๐‘–

Page 35: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfย ยท Stochastic Optimization / Learning min โˆˆ ( )=๐”ผ ~๐’Ÿ[ , ] โ€ขEmpirical

Nearest Neighbor Learning Guarantee

โ€ข Optimal predictor: โ„Žโˆ— = argmin ๐ฟ๐’Ÿ(โ„Ž)

โ„Žโˆ— ๐‘ฅ = แ‰Š+1, ๐œ‚ ๐‘ฅ > 0.5

โˆ’1, ๐œ‚ ๐‘ฅ < 0.5๐œ‚ ๐‘ฅ = ๐‘ƒ๐’Ÿ(๐‘ฆ = 1|๐‘ฅ)

โ€ข For the NN rule with ๐œŒ ๐‘ฅ, ๐‘ฅโ€ฒ = ๐œ™ ๐‘ฅ โˆ’ ๐œ™ ๐‘ฅโ€ฒ 2, and ๐œ™:๐’ณ โ†’ [0,1] ๐‘‘:

๐”ผ๐‘†โˆผ๐’Ÿ๐‘š ๐ฟ ๐‘๐‘ ๐‘† โ‰ค 2๐ฟ โ„Žโˆ— + 4๐‘๐’Ÿ๐‘‘

๐‘‘+1 ๐‘š

๐œ‚ ๐‘ฅ โˆ’ ๐œ‚ ๐‘ฅโ€ฒ โ‰ค ๐‘๐’Ÿ โ‹… ๐œŒ(๐‘ฅ, ๐‘ฅโ€ฒ)

Page 36: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfย ยท Stochastic Optimization / Learning min โˆˆ ( )=๐”ผ ~๐’Ÿ[ , ] โ€ขEmpirical

Data Fit / Complexity Tradeoff

โ€ข k-Nearest Neighbor: predict according to majority among ๐‘˜ closest point from S.

๐”ผ๐‘†โˆผ๐’Ÿ๐‘š ๐ฟ ๐‘๐‘ ๐‘† โ‰ค 2๐ฟ(โ„Žโˆ—) + 4๐‘๐’Ÿ๐‘‘

๐‘‘+1 ๐‘š

Page 37: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfย ยท Stochastic Optimization / Learning min โˆˆ ( )=๐”ผ ~๐’Ÿ[ , ] โ€ขEmpirical

k-Nearest Neighbor:Data Fit / Complexity Tradeoff

k=1 k=5 k=12

k=50 k=100 k=200

S= h*=

Page 38: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfย ยท Stochastic Optimization / Learning min โˆˆ ( )=๐”ผ ~๐’Ÿ[ , ] โ€ขEmpirical

k-Nearest Neighbor Guarantee

โ€ข For k-NN with ๐œŒ ๐‘ฅ, ๐‘ฅโ€ฒ = ๐œ™ ๐‘ฅ โˆ’ ๐œ™ ๐‘ฅโ€ฒ 2, and ๐œ™:๐’ณ โ†’ [0,1]๐‘‘:

๐”ผ๐‘†โˆผ๐’Ÿ๐‘š ๐ฟ ๐‘๐‘๐‘˜ ๐‘† โ‰ค 1 + 8๐‘˜ ๐ฟ โ„Žโˆ— +

6๐‘๐’Ÿ ๐‘‘ + ๐‘˜๐‘‘+1 ๐‘š

โ€ข Should increase ๐‘˜ with sample size ๐‘š

โ€ข Above theory suggests ๐‘˜๐‘š โˆ ๐ฟ โ„Žโˆ— 2/3 โ‹… ๐‘š2

3(๐‘‘+1)

โ€ข โ€œUniversalโ€ Learning: for any โ€œsmoothโ€ ๐’Ÿ and representation ๐œ™ โ‹… (with continuous ๐‘ƒ(๐‘ฆ|๐œ™ ๐‘ฅ )), if we increase k slowly enough, we will eventually converge to optimal ๐ฟ(โ„Žโˆ—)

โ€ข Very non-uniform: sample complexity depends not only on โ„Žโˆ—, but also on ๐’Ÿ

๐œ‚ ๐‘ฅ โˆ’ ๐œ‚ ๐‘ฅโ€ฒ โ‰ค ๐‘๐’Ÿ โ‹… ๐œŒ(๐‘ฅ, ๐‘ฅโ€ฒ)

Page 39: Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdfย ยท Stochastic Optimization / Learning min โˆˆ ( )=๐”ผ ~๐’Ÿ[ , ] โ€ขEmpirical

Uniform and Non-Uniform Learnability

โ€ข Definition: A hypothesis class โ„‹ is agnostically PAC-Learnable if there exists a

learning rule ๐ด such that โˆ€๐œ–, ๐›ฟ > 0, โˆƒ๐‘š ๐œ–, ๐›ฟ , โˆ€๐’Ÿ, โˆ€๐’‰, โˆ€๐‘†โˆผ๐’Ÿ๐‘š ๐œ–,๐›ฟ๐›ฟ ,

๐ฟ๐’Ÿ ๐ด ๐‘† โ‰ค ๐ฟ๐’Ÿ โ„Ž + ๐œ–

โ€ข Definition: A hypothesis class โ„‹ is non-uniformly learnable if there exists a

learning rule ๐ด such that โˆ€๐œ–, ๐›ฟ > 0, โˆ€โ„Ž, โˆƒ๐‘š ๐œ–, ๐›ฟ, ๐’‰ , โˆ€๐’Ÿ, โˆ€๐‘†โˆผ๐’Ÿ๐‘š ๐œ–,๐›ฟ,๐’‰๐›ฟ ,

๐ฟ๐’Ÿ ๐ด ๐‘† โ‰ค ๐ฟ๐’Ÿ โ„Ž + ๐œ–

โ€ข Definition: A hypothesis class โ„‹ is โ€œconsistently learnableโ€ if there exists a

learning rule ๐ด such that โˆ€๐œ–, ๐›ฟ > 0, โˆ€โ„Ž โˆ€๐““, โˆƒ๐‘š ๐œ–, ๐›ฟ, โ„Ž, ๐““ , โˆ€๐‘†โˆผ๐’Ÿ๐‘š ๐œ–,๐›ฟ,โ„Ž,๐““๐›ฟ ,

๐ฟ๐’Ÿ ๐ด ๐‘† โ‰ค ๐ฟ๐’Ÿ โ„Ž + ๐œ–

Realizable/Optimistic Guarantees: ๐’Ÿ dependence through ๐ฟ๐’Ÿ(โ„Ž)