Top Banner
RANDOM TOPICS stochastic gradient descent & Monte Carlo
64

RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

Sep 30, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

RANDOM TOPICSstochastic gradient descent

&Monte Carlo

Page 2: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

MASSIVE MODEL FITTING

minimize1

2kAx� bk2 =

X

i

1

2(aix� bi)

2

least squares

minimize1

2kwk2 + h(LDw) =

1

2kwk2 +

X

i

h(lidiw)

SVM

low-rank factorization

Big!(over 100K)

minimize f(x) =1

n

nX

i=1

fi(x)

minimize1

2kD �XY k2 =

X

ij

1

2(dij � xiyj)

2

Page 3: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

THE BIG IDEA

True gradient

⌦ ⇢ [1, N ]

Idea: choose a subset of the data

usually just one sample

minimize f(x) =1

n

nX

i=1

fi(x)

rf =1

n

nX

i=1

rfi(x)

rf ⇡ g⌦ =1

|⌦|X

i2⌦

rfi(x)

Page 4: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

INFINITE SAMPLE VS FINITE SAMPLE

minimize f(x) = Es[fs(x)] =

Z

sfs(x)p(s)ds

minimize f(x) =1

n

nX

i=1

fi(x)

We can solve finite sample problem to high accuracy….

…but true accuracy is limited by sample size

finite sample

infinite sample

Page 5: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

SGD

big errorsolution improves

compute gradientselect data

update

gk =1

M

MX

i=1

rf(x, di)gk ⇡ rf(x, d8)gk ⇡ rf(x, d12)

small errorsolutions gets worse

xk+1 = xk � ⌧kgk

Page 6: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

SGDcompute gradientselect

dataupdate

gk ⇡ rf(x, d8) xk+1 = xk � ⌧kgk

Error must decrease as we approach solution

limk!1

⌧k = 0

classical solution

O(1/pk)

shrink stepsize slow convergence

Variance reductionCorrect error in gradient approximations

Page 7: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

EXAMPLE: WITHOUT DECREASING STEPSIZE

minimize1

2kAx� bk2

inexact gradientwhy does this happen?

what's happening?

Page 8: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

DECREASING STEP SIZEbig stepsize

small stepsize

⌧k =a

b+ k

used for strongly convex problems

xk+1 = xk � ⌧krfk(x)

(almost) equivalent tochoose larger sample

why?

Page 9: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

AVERAGING

averaging

xk+1 = xk � ⌧krfk(x)

⌧k =apk + b

xk+1 =1

k + 1

Xxi

“ergodic” averaging

used for weakly convex problems

x

does this limit convergence rate?

why is this bad?

Page 10: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

AVERAGING

xk+1 =1

k + 1

Xxi

ergodic averaging

averaging

x

xk+1 =k

k + 1xk +

1

k + 1xk+1

compute without storage

xk+1 =k

k + ⌘xk +

k + ⌘xk+1

short memory version

⌘ � 1

tradeoff variance for bias

Page 11: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

KNOWN RATES: WEAKLY CONVEX

TheoremSuppose f is convex, krf(x)k G, and that the diameterof dom(f) is less than D. If we use the stepsize

⌧k =cpk,

then

SeeShamir and Zhang, ICML ’13

Rakhlin, Shamir, Sridharan, CoRR ‘11

E[f(xk)� f?] ✓D2

c+ cG

◆2 + log(k)p

k

Page 12: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

KNOWN RATES: STRONGLY CONVEX

Theorem

Shamir and Zhang, ICML ’13

E[f(xk)� f?] 58(1 + ⌘/k)

✓⌘(⌘ + 1) +

(⌘ + .5)3(1 + log k)

k

◆G2

mk

xk+1 =k

k + ⌘xk +

k + ⌘xk+1

Suppose f is strongly convex with parameter m, and thatkrf(x)k G. If you use stepsize ⌧k = 1/mk, and the limitedmemory averaging

with ⌘ � 1, then

Page 13: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

EXAMPLE: SVMPEGASOS: Primal Estimated sub-GrAdient SOlver for SVM

rh(x) =

(�1, x < 1

0, otherwise

note: this is a “subgradient” descent method

minimize1

2kwk2 + Ch(Aw)

Page 14: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

PEGOSOSPEGASOS: Primal Estimated sub-GrAdient SOlver for SVM

⌧k =1

�k

minimizeX

i

2kwk2 + h(aiw)

While “not converged”:

not used in practicewk+1 =1

k + 1

k+1X

i=1

wk

If aTkwk < 1 : wk+1 = wk � ⌧k(�wk � aTi )

If aTkwk � 1 : wk+1 = wk � ⌧k�wk

Page 15: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

PEGOSOS

gradientmethod

stochasticmethods

finite sample method

Page 16: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

PEGOSOS

gradientmethod

stochasticmethods

classification error (CV)

You hope to converge before SGD gets slow!

Page 17: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

ALLREDUCEbut if you don’t…

use SGD as warm start for iterative method

Agarwal. “Effective Terascale Linear Learning.” ‘12

Page 18: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

SGDcompute gradientselect

dataupdate

gk ⇡ rf(x, d8) xk+1 = xk � ⌧kgk

Error must decrease as we approach solution

variance reduction solutionmake gradient more accurate

preserve fast convergence

Page 19: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

SGD+VARIANCE REDUCTIONcompute gradient update

gk ⇡ rf(x, d8) xk+1 = xk � ⌧kgk

Error must decrease as we approach solution

variance reduction solutionmake gradient more accurate

preserve fast convergence

� error8

select data

Page 20: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

VR APPROACHES

Central VRA VR approach targeting distributed ML

SAGADefazio, Bach, Lacoste-Julian, 2014

SVRGJohnson, Zhang, 2013

SAGLe Roux, Schmidt, Bach, 2013

many more…

“Efficient Distributed SGD with Variance Reduction,” ICDM 2016

shameless self-promotion

The original: requires full gradient computations

Avoid full gradient computations

Page 21: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

SVRG

rf1(x1m)

rf2(x2m)

rfn(xnm)

...

rfn�1(xn�1m )

First epoch gradient tableau

rf3(x3m)

Page 22: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

rf1(x1m)

rf2(x2m)

rfn(xnm)

...

rfn�1(xn�1m )} gm =

1

n

nX

i=1

rfi(xim)

Average gradientover last epoch rf3(x

3m)

gradient tableau

SVRG

Page 23: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

rf1(x1m)

rf2(x2m)

rfn(xnm)

...

rfn�1(xn�1m )

gm =1

n

nX

i=1

rfi(xim)

Average gradientover last epoch

rf3(x3m)

(rf3(x3m)� gm)�

corrected gradient}error

rf3(x3m+1)

new gradient

gradient tableau

SVRG

Page 24: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

SVRG GETS BACK DETERMINISTIC RATES

Theorem

Johnson and Zhang, 2013

Suppose of objective terms are m stronglyconvex with L Lipschitz gradients. If thelearning rate is small enough, then

<latexit sha1_base64="HTGAliInHfDxWKYbVWtacL71/ok=">AAACj3icbVHLbtNAFB2bVzGvFJbdXBFXYoEiOyzKCkWwoVIXRZC2UhxF48m1PXQe1sx1IET5Br6PT+AD2DNJs6AtVxrp6Dx0r86UrZKesuxXFN+5e+/+g72HyaPHT54+6+0/P/O2cwLHwirrLkruUUmDY5Kk8KJ1yHWp8Ly8/LDRzxfovLTmCy1bnGpeG1lJwSlQs97PolKdbxRWBMnnrm2tR7AV2PIrCpILBEKnPXCHkOoUPDlrarUsigSENQv8Dt8kNZCepHAiWy8aST+gdnwu0ZAfABxXQA1CCCjkzkhTg+OEID14zZUCNLarm9cbl0lmvX42yLYDt0G+A322m9NZ73cxt6LTYZtQ3PtJnrU0XXFHUihcJ0XnseXiktc4CdBwjX662ha3hsPAzKGyLjxDsGX/Tay49n6py+DUnBp/U9uQ/9MmHVVvpytp2o7QiKtFVaeALGx+AebShXrVMgAunAy3gmi44yK07ZPk2h4jBVZBWodu8ptN3AZnw0H+ZjD8NOyP3u9a2mMH7CV7xXJ2xEbsIztlYybYn+ggSqPDeD8+it/FoytrHO0yL9i1iY//AlY7xdk=</latexit>

Ef(wk)� f? ck[f(w0)� f?]<latexit sha1_base64="FkMPZaQmJTfxjKaog3LujK0S4+U=">AAACNHicbVBNS8NAEN3U7/pV9ehlsQj1YElqQY+iCB4rWBWatGy2k3bpZhN2N0oJ/SP+EM9e9R8I3sSDF3+DmzYHW30w8Hhvhpl5fsyZ0rb9ZhXm5hcWl5ZXiqtr6xubpa3tGxUlkkKTRjySdz5RwJmApmaaw10sgYQ+h1t/cJ75t/cgFYvEtR7G4IWkJ1jAKNFG6pTqbkh03/fTixEOKg/twcEhDtqu0kRilwOm7QFuZYZ9cJjrHu6UynbVHgP/JU5OyihHo1P6crsRTUIQmnKiVMuxY+2lRGpGOYyKbqIgJnRAetAyVJAQlJeOvxvhfaN0cRBJU0Ljsfp7IiWhUsPQN53ZL2rWy8T/vFaigxMvZSJONAg6WRQkHOsIZ1HhLpNANR8aQqhk5lZM+0QSqk2gxeLUHsEoBMYamWyc2ST+kpta1Tmq1q7q5dOzPKVltIv2UAU56BidokvUQE1E0SN6Ri/o1Xqy3q0P63PSWrDymR00Bev7BxyTqZY=</latexit>

for some c < 1.<latexit sha1_base64="7i+9y8tRinww0pMsrhRs+pZ9DCo=">AAACD3icbVC9TgJBGNzDP8QfTi1tNoKJFbnDQgsLoo0lJvKTACF7y3ewYW/3srtnQggPYW2rz2BnbH0EH8G3cA+uEHCqycx8+SYTxJxp43nfTm5jc2t7J79b2Ns/OCy6R8dNLRNFoUEll6odEA2cCWgYZji0YwUkCji0gvFd6reeQGkmxaOZxNCLyFCwkFFirNR3i6FUWMsIcJne+JVy3y15FW8OvE78jJRQhnrf/ekOJE0iEIZyonXH92LTmxJlGOUwK3QTDTGhYzKEjqWCRKB703nxGT63ygCnFUIpDJ6rfy+mJNJ6EgU2GREz0qteKv7ndRITXvemTMSJAUEXj8KEYyNxugIeMAXU8IklhCpmu2I6IopQY7cqFJb+CEYhtNbMbuOvLrFOmtWKf1mpPlRLtdtspTw6RWfoAvnoCtXQPaqjBqIoQS/oFb05z8678+F8LqI5J7s5QUtwvn4BB86bCg==</latexit>

Page 25: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

MONTO CARLO METHODS

Monte Carlo casino

Methods that involve randomly sampling a distribution

Monaco?

Page 26: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

BAYESIAN LEARNING

ingredients…prior

likelihood

estimate parameters in a model

probability distribution of unknown parameters

data = M(parameters)estimate

thesemeasure

these

probability of datagiven (unknown) parameters Thomas Bayes

Page 27: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

BAYES RULE

P (A|B) =P (B|A)P (A)

P (B)

P (parameters|data) = P (data|parameters)P (parameters)

P (data)

likelihood prior

P (parameters|data) / P (data|parameters)P (parameters)

we really care about this

“posterior distribution”

Page 28: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

EXAMPLE: LOGISTIC REGRESSION

P (parameters|data) / P (data|parameters)P (parameters)

prior

P (y,X|w) =Y

i

exp(yi · xTi w)

1 + exp(yi · xTi w)

<latexit sha1_base64="lWgkerH9AERziYYhNxTP4CqGnHg=">AAAIGXicfVXNbtw2EFaSNpu6f0567EWoE8BuXGPlBEgvBZK6SNuDUbeIHQOmLVBcasUsfxSSyu6W1pPkGfICufVW9JpTgfZdOtTuNruU1jxIQ37fzHBmyGFWcmZsv//3tes3PvjwZu/WRxsff/LpZ59v3r5zYlSlCT0miit9mmFDOZP02DLL6WmpKRYZp8+z0YHHn7+i2jAln9lpSc8FHkqWM4ItLKWbPx5tT3dPL8c78XcxKrUapAzlGhOH6KTcnsKMDJSNJym7eDbeqV1yvxtIN7f6e/1mxG0hmQtb0XwcpbdvvkUDRSpBpSUcG3OW9Et77rC2jHBab6DK0BKTER5Sh4UxU5HV8T2BbWFCzC92YpMmQfXGxr2l1bPK5t+eOybLylJJQBGwvOKxVbFPUTxgmhLLpyBgohnsJyYFhqRYSOSKKZcpNbI4M+ACSTomSggsBw4RY+qz5NyhjA6ZdAQqZOqtBFEAZ5N6VUHZguoxM7R2yNKJdfH7ldA4ZwKOBthwSLNhYbHWalyHnHxh6ec8Dr1xsC3/9+QncctLgbk30RyGpHb7oQ2NvX8sh02xVqx7hHchxgJSZGriYrSLdk2VvYBMQ+L9rNnCKp/JASj4uhLM3Wm4BSZf1e7CfZOEgGU+OFx1YUrZpbA8q6X9ntHA7dhHpfc7ut8yPhIN0OG1BJsXkACswzxSzM08yixzv4WaIJRL+EELLxiTvlIgpO4ghazkdhqyxFJMoiskSMtyXva7EuM5o5CUjjpoZoVl2FDgdo189a4shAko3XaAcokkzjiOJ5ctUOBJ7aazZrVgBSTKuZJw8e56KU3uBvBBwa46hII8WYKfhHBul9CnrVR5v7PS5i4J7iBU0rd1+lIqf3VoSVQloQc5+rJq2neN4m0Et3cx3/H6P1Bop5oegs1fSqqxVfprcCEgTYeHJ/VaApNMsN9hOwsJgdXBej6eLPhz6Wo+1A46TvNdR8F62FQL/mjXS1cRmVwQ2XqLptBMwomd/ztpzj974LX5rmFYOIEQ6ewHTQpeuyR829rCyf5e8mBv/9eHW4+/n797t6Ivo6+i7SiJHkWPo5+io+g4ItGb6F30T/Rv73Xvj96fvb9m1OvX5jpfRCuj9+4/+/8CsA==</latexit>

P (yi = 1|xi) =exp(xT

i w)

1 + exp(xTi w)

<latexit sha1_base64="30xsw632vjL+Lqp1+IPVHH/WWSI=">AAAIBHicfVXNbtw2EFbSNJu4f05zzEWIE8BuXGO1CZBcAiRxULQHo24ROwZMW6C41IpZ/igkFe+G1jXP0Iforeilh1zbV+jbdCjvNruU1jpIQ37ffMOZIams5MzYfv/fK1c/u/b59d6Nm2tffPnV19+s3/r20KhKE3pAFFf6KMOGcibpgWWW06NSUywyTl9n412Pv35HtWFKvrLTkp4IPJIsZwRbmErXB/ub05Q9Tc4nKduKn8Yo15g4RCflJsycvjrbql3yYGmcrm/0d/rNE7eNZGZsRLNnP711/S80VKQSVFrCsTHHSb+0Jw5rywin9RqqDC0xGeMRdVgYMxVZHd8X2BYmxPxkJzZpalGvrd1fmD2ubP7kxDFZVpZKAo6A5RWPrYp9NeIh05RYPgUDE81gPTEpMJTAQs2WpFym1NjizEAIJOkZUUJgOXSIGFMfJycOZXTEpCPQDFNvJIgCeDGolx2ULag+Y4bWDlk6sS7+NBOKcyZgF4CGQ5qNCou1Vmd1yMnnSj/lcRiNg7b8P5IfxK0oBeZeoml9UrtBqKGxj4/lqGnWkrpHeBdiLCBFpiYuRtto21TZG6g0FN6PmiUs85kcgoPvK8HcHYVLYPJd7U7d90kIWOaTw1UXppRdSMuzWt6fGA3czn1c+rjjBy3xsWiAjqglaJ5CAbAO60gxN7Mss8z9GnqCUS7guy28YEz6ToGRut0UqpLbacgSCzmJrpSgLIt1GXQVxnPGISkdd9DMEsuwkcDtHvnuXdoIE1C6dYByjiTOOI4n5y1Q4EntpogMlY3nrIBEOVcSDt49b6XJvQDeLdhlm1CQ5wvw8xDO7QL6Q6tUPu5Fa3OXBGcQOulvcPpWKn90aElUJeEOcvRt1dzUNYo3EZze+XjL+7+kcJ1qugeaP5dUY6v0dxBCQJn29g7rlQQmmWDvYTlzC4HqcDUfT+b8mXU5H3oHN07zXkXBetR0C75o21uXEZmcE9lqRVNoJmHHzr6dNIdKrSBq817BsLADIdOLD1xS8LdLwn9b2zgc7CQPdwa/PNp49mL237sR3YnuRptREj2OnkU/RvvRQUSi36KP0d/RP70Pvd97f/T+vKBevTLzuR0tPb2P/wHrBPmj</latexit>

Page 29: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

EXAMPLE: LOGISTIC REGRESSION

P (parameters|data) / P (data|parameters)P (parameters)

prior

logP (w) = �|w|Example:

P (w|y,X) =

Y

i

exp(yi · xTi w)

1 + exp(yi · xTi w)

!P (w)

<latexit sha1_base64="004+ZLNp5XNu9cMH76QXus7GxG4=">AAAIK3icfVXdchs1FN4WqEv4S+GSmx3Sztg0ZLyhM+WGmZYwDFxkCEyThomSHa2s9QrrZytpaxtlH4kX4IaH4AqG297CK3C0tqmtXUcXu0f6vvMd6Rz9ZCVnxg6Hf966/cabb93p3X17551333v/g917H54ZVWlCT4niSp9n2FDOJD21zHJ6XmqKRcbp82xy5PHnL6k2TMlndl7SS4HHkuWMYAtD6e5PJ/3p9Xz/fBB/iTjNbT9GpVajlKFcY+IQnZX9OfTISNl4lrKrZ9NB7ZKH3QDSbFzYQQyig3R3b3gwbFrcNpKlsRct20l6785vaKRIJai0hGNjLpJhaS8d1pYRTusdVBlaYjLBY+qwMGYusjp+ILAtTIj5wU5s1qSs3tl5sDZ6Udn8i0vHZFlZKgk4ApZXPLYq9kmLR0xTYvkcDEw0g/nEpMCQIAup3ZBymVITizMDIZCkU6KEwHLkEDGmvkguHcromElHoGam3ksQBXDRqTcdlC2onjJDa4csnVkXvx4JxTkTsFlAwy0qgLVW0zrk5Cul7/I4jMZBW/4fyXfiVpQCcy/RbIykdoehhsY+Ppbjplgb6h7hXYixgBSZmrkY7aN9U2U/Q6Yh8b7XTGGTz+QIHHxdCebuPJwCky9rd+U+S0LAMr84XHVhStm1ZXlWy/s1o4Hba5+UPu7kYUt8IhqgI2oJmleQAKzDPFLMzXKVWeZ+DD3BKNfwoxZeMCZ9pcBI3VEKWcntPGSJtTWJriVBWtbzctiVGM+ZhKR00kEzGyzDxgK3a+Srd2MhTEDp1gHKNZI44zieXbdAgWe1my8urhUrIFHOlYSDd99baXI/gI8KdtMmFOTpGvw0hHO7hn7TSpWPuyht7pLgDEIl/UVPX0jljw4tiaok3EGOvqiaC71GcR/B6V31B97/awrXqabHoPl9STW2Sn8KIQSk6fj4rN5KYJIJ9gtMZ2UhUB1t5+PZir+0buZD7eDGab7bKFiPm2rBH+176yYikysi265oCs0k7Njlv5Pm/BMIUZvvFoaFHQgrXfzgkoLXLgnftrZxdniQfH5w+MOjvSdfLd+9u9HH0SdRP0qix9GT6NvoJDqNSPR79Cr6J/q392vvj95fvb8X1Nu3lj4fRRut9+o/62oJbg==</latexit>

logP (w|y,X) = logP (w) +X

i

� log(1 + exp(�yi · xTi w))

<latexit sha1_base64="3Y80qniIrti7yfwt00Tirz2HpIc=">AAAIFXicfVXdjhs1FJ620JTlbwuX3FhsKyXsjzILEr1BaglCcLFiQd3tSuvdkcfxZEz8M7U9TYJ3noNn4BWQuEPc9pa+DceThCYzyc7FzLG/z9/xOcc+kxaCW9fvv7lz9947797vPHhv5/0PPvzo492Hn5xbXRrKzqgW2lykxDLBFTtz3Al2URhGZCrYi3Q8CPiLV8xYrtVzNyvYlSQjxTNOiYOpZHeAhR6h0+7kZnZw0UPfoOW4h/YRtqVMODqsJ7vxPmbTons4SzimQ+3QNOHXzye9XrK71z/q1w9qG/HC2IsWz2ny8P6feKhpKZlyVBBrL+N+4a48MY5TwaodXFpWEDomI+aJtHYm0wo9lsTltomFyY3YtE5OtbPzeGX2snTZkyvPVVE6pigsBCwrBXIahfSgITeMOjEDg1DDYT+I5sQQ6iCJa1I+1XrsSGrBBVZsQrWURA09ptZWl/GVxykbceUpVMdWezFmAM4H1foC7XJmJtyyymPHps6jtzNNccElHAvQ8NjwUe6IMXpSNTnZUunHDDW9CdBW/3sKA9TykhMRJDKI28eVP25qGBL8EzWqi7WmHhCxCbEOkDzVU4/wAT6wZforZBoSH0b1Ftb5XA1hQagrJcJfNLfA1avKX/vDuAk4HoIj5SZMa7cSVmC1Vr9l1HA79nER/I73W+JjWQMbvBageQ0JIKaZR0aEXUSZpv6X5kowihV80MJzzlWoFBiJHySQlczNmiy5EpPcFBKkZTUvx5sSEzjjJikZb6DZNZblI0naNQrVu7UQtkHZrAOUG6xIKgia3rRASaaVn82b1ZLVIDEhtIKL9yhYSfyoAQ9yftshlPTZCvysCWduBf2+largd17azMeNOwiVDC2dvVQ6XB1WUF0q6EGevSzr1l1h1MVwe5fjXlj/HYN2atgJaP5UMEOcNl+ACwlpOjk5r7YSuOKS/wbbWVoYVIfb+WS65C+s2/lQO+g49XsbhZhRXS344oNg3Ubkaknk2xVtbriCE7v4bqR5XBgNXuv3FoaDEwiRzj/QpOBvFzf/bW3j/Pgo/vLo+Oev9p5+u/jvPYg+iz6PulEcfR09jX6ITqOziEZ/RK+jf6M3nd87f3X+7vwzp969s1jzabT2dF7/B9Db/e8=</latexit>

Page 30: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

BAYESIANS OFFER NO SOLUTIONS

P (w|D, l)

solution space

optimizationsolution

Monte Carlo methods randomly sample this solution space

BayesLagrange

Page 31: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

WHY MONTE CARLO?You need to know more than just a max/minimizer

optimizationsolution

“error bars”

Var(w) = E[(w � µ)(w � µ)T ] =

ZwwTP (w) dw � µµT

E(w) =

ZwP (w) dw

Page 32: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

EXAMPLE: POKER BOTSYou can’t solve your problem by other (faster) methods

maximize logP (w) we don’t know derivative

poker bots

model parameters describe player behavior

2.5M different community cards

10 players = 59K raise/fold/call perms per round

Page 33: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

WHY MONTE CARLO?You’re incredibly lazymaximize logP (w)

I could differentiate this,I just don’t wanna

argmax logP (w) ⇡ maxk

{P (wk)}

Page 34: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

MARKOV CHAINS

Irreducible: can visit any state starting at any state

Aperiodic: does not get trapped in deterministic cycles

MC has steady state if

What is an MC?

Page 35: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

METROPOLIS HASTINGSingredients

q(y|x)

proposal distribution

p(x)

posterior distribution

MH AlgorithmStart with x0

For k = 1, 2, 3, . . .Choose candidate y from q(y|xk)Compute acceptance probability

↵ = min

⇢1,

p(y)q(xk|y)p(xk)q(y|xk)

Set xk+1 = y with probability ↵otherwise xk+1 = xk

Page 36: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

CONVERGENCE

Theorem

Suppose the support of q contains the support of p. Thenthe MH sampler has a stationary distribution, and the

distribution is equal to p.

Irreducible: support of q must contain support of p

Aperiodic: there must be positive rejection probability states

Page 37: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

METROPOLIS ALGORITHMingredients

q(y|x)

proposal distribution

p(x)

posterior distribution

Assume proposal is symmetricq(xk|y) = q(y|xk)

↵ = min

⇢1,

p(y)q(xk|y)p(xk)q(y|xk)

�↵ = min

⇢1,

p(y)

p(xk)

Metropolis Hastings Metropolis

Page 38: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

EXAMPLE: GMMhistogram of Metropolis iterates

Andrieu, Freitas, Doucet, Jordan ‘03

Page 39: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

PROPERTIES OF MH

We don’t need a normalization constant for posteriorWe can run many chains in parallel (cluster/GPU)We don’t need any derivatives

Pro’s

P (parameters|data) = P (data|parameters)P (parameters)

P (data)

P (parameters|data) / P (data|parameters)P (parameters)

Page 40: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

PROPERTIES OF MH

“Mixing time” depends on proposal distribution• too wide = constant rejections = slow mixing• too narrow = short movements=slow mixing

Con’s

Samples only meaningful at stationary distribution • “Burn in” samples must be discarded• Many samples needed because of correlations

Posterior

proposal

This is bad…

xk<latexit sha1_base64="8tY7aDdJ2IAPpvSJDyWI10/Am7g=">AAAH03icfVVLb9w2EJbTNuu4rzg55iLUMeC2jrFyAqRHOw6K9mDUfdgxYNoLikutWPEhk5S9G1qXotdc2//Sf9J/06G8W+9SWusgDfl98w1nhqTSkjNj+/1/Vx589PEnD3urj9Y+/ezzL758vP7kxKhKE3pMFFf6NMWGcibpsWWW09NSUyxSTt+lxYHH311RbZiSv9lJSc8FHkmWMYItTP06vigGjzf6O/3midtGMjU2oulzNFh/+A8aKlIJKi3h2JizpF/ac4e1ZYTTeg1VhpaYFHhEHRbGTERax5sC29yEmJ/sxMZNavXa2ubc7Flls+/OHZNlZakk4AhYVvHYqtgnFw+ZpsTyCRiYaAbriUmONSYWSrAg5VKlCotTAyGQpNdECYHl0CFiTH2WnDuU0hGTjkBtTb2RIArg7aBedFA2p/qaGVo7ZOnYuvhuJhTnTEBTQcMhzUa5xVqr6zrkZDOlH7M4jMZBW/4fyQ/iVpQccy+RQd4uqd1uqKGxj4/lqGnWgrpHeBdiLCB5qsYuRtto21Tp71BpKLwfNUtY5DM5BAffV4K5Ow2XwORV7S7ciyQELPPJ4aoLU8rOpeVZLe87RgO3cy9KH7f4tiVeiAboiFqC5gUUAOuwjhRzM80yTd0voScY5Rx+0MJzxqTvFBgDdzCAqmR2ErLEXE6iKyUoy3xddrsK4zlFSBoUHTSzwDJsJHC7R7579zbCBJRuHaDcIIlTjuPxTQsUeFy7CSJDZeMZKyBRzpWEg/fcW4PkeQAf5Oy+TSjI/hy8H8KZnUO/b5XKx71tbeaS4AxCJ/2FTC+l8keHlkRVEu4gRy+r5uKtUbyF4PTOxl97/7cUrlNND0Hzp5JqbJX+BkIIKNPh4Um9lMAkE+w9LGdmIVAdLufj8Yw/te7nQ+/gxmneyyhYj5puwRdte+s+IpMzIluuaHLNJOzY6beT5lCpFURt3ksYFnYgZHr7gUsK/nZJ+G9rGye7O8nLnd2fX23svZn+91ajZ9FX0VaURK+jveiH6Cg6jkg0ij5Ef0V/9457rvdH789b6oOVqc/TaOHpffgP+CLnSg==</latexit>

Page 41: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

SIMULATED ANNEALINGmaximize p(x)

Easy choice: run MCMC, thenmax

kp(xk)

Better choice: sample the distribution

p1

Tk (xk) “temperature”

Why is this better?Where does the name come from?

Page 42: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

COOLING SCHEDULE

Andrieu, Freitas, Doucet, Jordan ‘03

hot warm

cool cold

Page 43: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

CONVERGENCE

Granville, "Simulated annealing: A proof of convergence”

TheoremSuppose the MCMC mixes fast enough that epsilon-dense sampling occurs in finite time starting at every temperature. For an annealing schedule with temperature

simulated annealing converges to a global optima with probability 1.

Tk =1

C log(k + T0)

SA solves non-convex problem, even NP-complete problems,as time goes to infinity.

Page 44: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

WHEN TO USE SIMULATED ANNEALING

flow chart

Should I use simulated annealing?

no.

No practical way to choose temperature scheduleToo fast = stuck in local minimum (risky)

Too slow = no different from MCMCAct of desperation!

Page 45: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

GIBBS SAMPLERWant to sample

P (x1, x2, x3)

on stage k, pick some coordinate j

q(y|xk) =

(p(yj |xk

jc), if yjc = xkjc

0, otherwise

↵ =p(y)q(xk|y)p(xk)q(y|xk)

=p(y)p(xk

j |xkjc)

p(xk)p(yj |yjc)

=p(yjc)

p(xkjc)

= 1

P (B) =P (A and B)

P (A|B)

p(y) = p(yj and yjc)<latexit sha1_base64="KTec+pwpBDIxNpH0RsoPrN7Tsvc=">AAACI3icbVDLSgMxFM3Ud31VXboJFkE3ZaYKuhFENy4r2Fpoa8mkd9poJjMkd8RhmA/wQ1y71W9wJ25c+AH+hWntwqoHQg7nnMtNjh9LYdB1353C1PTM7Nz8QnFxaXlltbS23jBRojnUeSQj3fSZASkU1FGghGasgYW+hEv/5nToX96CNiJSF5jG0AlZX4lAcIZW6pbK8U66S4+ovbrXtI1whxllqkdzmnaz6yue71KbcivuCPQv8cakTMaodUuf7V7EkxAUcsmMaXlujJ2MaRRcQl5sJwZixm9YH1qWKhaC6WSjz+R02yo9GkTaHoV0pP6cyFhoTBr6NhkyHJjf3lD8z2slGBx2MqHiBEHx70VBIilGdNgM7QkNHGVqCeNa2LdSPmCacbT9FYsTe5TgEFgrt914v5v4SxrVirdXqZ7vl49Pxi3Nk02yRXaIRw7IMTkjNVInnNyTR/JEnp0H58V5dd6+owVnPLNBJuB8fAH9lqL0</latexit>

Page 46: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

GIBBS SAMPLERWant to sample

P (x1, x2, x3)

iterates

x2 ⇠ P (x1|x12, x

13, x

14, . . . , x

1n)

x3 ⇠ P (x2|x21, x

23, x

24, . . . , x

2n)

x4 ⇠ P (x3|x31, x

32, x

34, . . . , x

3n)

· · ·

Page 47: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

APPLICATION: SAMPLING GRAPHICAL MODELS

Restricted Boltzmann Machine (RBM)

E(v, h) = �aT v � bTh� vTWh

P (v, h) =1

Ze�E(v,h) “partition

function”

weights

binaryrandomvariables

0/1

Page 48: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

APPLICATION: SAMPLING GRAPHICAL MODELS

Restricted Boltzmann Machine (RBM)

E(v, h) = �aT v � bTh� vTWh P (v, h) =1

Ze�E(v,h)

remove normalization

aggregate constant

cancel constants

sigmoid function

P (vi = 1|h) = P (vi = 1|h)P (vi = 0|h) + P (vi = 1|h)

=exp(�ai +

Pj wijhj + C)

exp(C) + exp(�ai +P

j wijhj + C)

=exp(�ai +

Pj wijhj)

1 + exp(�ai +P

j wijhj)

= �(�ai +X

i

wijhj)

Page 49: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

BLOCK GIBBS FOR RBM

P (vi = 1|h) = �(�ai +X

j

wijhj)

hiddenvisible

freeze hiddenrandomly sample visible

stage 1

freeze visiblerandomly sample hidden

stage 2

P (hj = 1|v) = �(�bj +X

i

wijvi)

Page 50: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

DEEP BELIEF NETSvisible

hidden

DBN = layered RBM

Each layer only dependson layer beneath it

(feed forward)

Probability for each hidden node is sigmoid function

Page 51: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

EXAMPLE: MNIST

Train 3-layer DBN with 200 hidden units

pre-training: layer-by-layer training

training: train all weights simultaneously

Gan, Henao, Carlson, Carin ‘15

trained on 60K MNIST digits

training done using Gibbs sampler

Gibbs sampler used to explore final solution

Page 52: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

EXAMPLE: MNISTtraining data Learned features

observations sampled from deep belief network using Gibbs sampler

Page 53: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

WHAT’S WRONG WITH GIBBS?

susceptible to strong correlationsalso a problem for MH: bad proposal distribution

Page 54: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

SLICE SAMPLERProblems with MH:

hard to choose proposal distributiondoes not exploit analytical information about problem

long mixing timeSlice sampler

q(x, u) =

(1, if 0 u p(x)

0, otherwise

1 1

x

p(x)

x

u

Page 55: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

“LIFTED INTEGRAL”

1 1

x

p(x)

x

u

p(x) =

Zq(x, u) du

Zf(x)p(x) =

Z Zf(x)q(x, u) du dx

sample this

Problems with MH:hard to choose proposal distribution

does not exploit analytical information about problemlong mixing timeSlice sampler

Page 56: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

SLICE SAMPLERZ

f(x)p(x) =

Z Zf(x)q(x, u) du dx

do Gibbs on thisFreeze x :

Choose u uniformly from [0, p(x)]Freeze u :

Choose x uniformly from {x|p(x) � u}need analytical formula

for this set

p(x)u

Page 57: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

MODEL FITTING

q(x, u1, u2, . . . , un) =

(1, if pi(x) ui 8i0, otherwise

Zq(x, u1, u2, . . . , un) du =

Z p1(x)

u1=0

Z p2(x)

u2=0· · ·

Z pn(x)

un=0du =

nY

k=1

pi(x)

log p(x) =nX

i=1

log pi(x)p(x) =nY

i=1

pi(x)

Zf(x)p(x) dx =

Z

x

Z

uf(x)q(x, u) dx du

support is a box

Page 58: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

MODEL FITTINGZ

f(x)p(x) dx =

Z

x

Z

uf(x)q(x, u) dx du

Freeze x :For all i, choose ui uniformly from [0, pi(x)]

Freeze u :Choose x uniformly from

Ti{x|pi(x) � ui}

q(x, u1, u2, . . . , un) =

(1, if pi(x) ui 8i0, otherwise

Page 59: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

SLICE SAMPLER PROPERTIESpros

• does not require proposal distribution• mixes super fast• simple implementation

cons• analytical formula for super-level sets• hard in multiple dimensions

fixes / generalizations• step-out methods• random hyper-rectangles

Page 60: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

STEP-OUT METHODSp(x)

u

p(x)u

p(x)u

•pick small interval

•double width size until slice is contained

•sample uniformly from slice using rejection sampling - throw away selections outside of slice

Page 61: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

HIGH DIMS: COORDINATE METHOD

Zf(x)p(x) =

Z Zf(x)q(x, u) du dx

do coordinate Gibbs on this

Freeze x :Choose u uniformly from [0, p(x)]

Freeze u :For i = 1 · · ·n

Choose xi uniformly from{xi|p(x1, · · · , xi · · · , xn) � u}

can use stepping -out methodRadford Neal: “Slice Sampling” ‘03

Page 62: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

HIGH DIMS: HYPER-RECTANGLE

Radford Neal: “Slice Sampling” ‘03

•choose “large” rectangle

•randomly/uniformly place rectangle around current location

•draw random proposal point from rectangle

•if proposal lies outside slice, shrink box so proposal is on boundary

performs much better for poorly-conditioned objectives

Page 63: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

COMPARISON

Gradient/Splitting

SGD

MCMC

rate

e�ck

1/k

1/pk

when to use

you value reliability and precision(moderate speed, high accuracy)

you value speed over accuracy(high speed, moderate accuracy)

you value simplicity (no gradient)or need statistical inference

(slow and inaccurate)

Page 64: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis.

DO MATLAB EXERCISEMCMC