Top Banner
Static Parameter Estimation using Kalman Filtering and Proximal Operators Vivak Patel Department of Statistics, University of Chicago December 2, 2015
22

Static Parameter Estimation using Kalman Filtering and ...

May 17, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Static Parameter Estimation using Kalman Filtering and ...

Static Parameter Estimation using KalmanFiltering and Proximal Operators

Vivak PatelDepartment of Statistics, University of Chicago December 2, 2015

Page 2: Static Parameter Estimation using Kalman Filtering and ...

Acknowledge

Mihai AnitescuSenior Computational Mathematician in LANSPart­time Professor at University of Chicago

Madhukanta Patel (June, 8 1924 ­ Nov. 21, 2015)

Page 3: Static Parameter Estimation using Kalman Filtering and ...

Outline

I. Stationary Parameter Estimation (SPE) 

II. Optimization 

III. Proximal Operators 

IV. Kalman Filtering Theory 

V. Kalman­based SGD

Page 4: Static Parameter Estimation using Kalman Filtering and ...

Stationary ParametersProblem.

Observed Pairs   where Data Generation Process:

Estimate  , which is stationary (changes on long time scales)

( , ) ∈ R ×Yt Xt Rn t = 1, 2, …

P( | ) = f( , )Yt Xt Yt Xt

f

Page 5: Static Parameter Estimation using Kalman Filtering and ...

Stationary ParametersNonparametric Density Estimation

Assume some level of smoothness of Use orthonormal bases of function spaces, or use moving averageRate of Convergence (Curse of Dimensionality):

Replace smoothness assumption with sparsity

f

ASIME ∼ N− 2s

2s+n

Page 6: Static Parameter Estimation using Kalman Filtering and ...

Stationary ParametersParametric Density Estimation

Assume a particular form for   depending on a parameter 

Estimate Rate of Convergence:

f β

P ( | , β) = f( , , β)Yt Xt Yt Xt

β

MSE ∼ N −1

Page 7: Static Parameter Estimation using Kalman Filtering and ...

OptimizationFor statistical inference, the objective function is the log­likelihood:

Fisher (1922), Cramér (1946), Le Cam (1970), Hájek (1972)

Suppose the data is generated by some fixed  . Let  , and

If some regularity conditions on   hold then

van der Vaart (2000)

(β) = log f( , , β)FN ∑t=1

N

Yt Xt

β∗ = arg min (β)βN FN

= E [ log f(Y , X, ) log f(Y , X, )]V ∗ ∇β β∗ ∇Tβ

β∗

f

( − ) ⇒ N (0, )N−−√ βN β∗ V ∗

Page 8: Static Parameter Estimation using Kalman Filtering and ...

OptimizationGoals of Optimization Method

Fast Rate of ConvergenceComputational StabilityWell­defined stop conditionLow computational requirements (floating point operations, memory)

Classical Methods for 

GradientEvals

HessianEvals

FloatingPoint

MemoryCost RoC Stop

Condition Conditioning

Newton Quadratic Deterministic Mild

Quasi­Newton Super­Linear Deterministic Mild

GradientDescent Linear Deterministic Moderate

Moral: If   is a large, a single pass through the data set before calculating a new iterate is too expensive.

(β)FN

N N O( N)n2 O( )n2

N 0 O( + nN)n2 O( )n2

N 0 O(nN) O(n)

N

Page 9: Static Parameter Estimation using Kalman Filtering and ...

Stochastic GradientsBasic Idea: Minimize   using:

History

Least Mean Squares Stochastic Gradient Descent (SGD)

Cotes (early 1700s)Legendre (1805)Gauss (1806)Macchi & Eweda (1983)Gerencsér (1995)

Robbins & Monro (1951)Neveu (1975)Murata (1998)

(β) = log f( , , β)FN ∑Nt=1 Yt Xt

= − ∇ log f( , , ) =: − ∇βSGDk+1 βSGD

k αk Yt Xt βSGDk βSGD

k αk lSGDt,k

Page 10: Static Parameter Estimation using Kalman Filtering and ...

Stochastic GradientsCharacterization per iteration

GradientEvals

HessianEvals

FloatingPoint

MemoryCost RoC Stop

Condition Conditioning

Newton Quadratic Deterministic Mild

Quasi­Newton Super­Linear Deterministic Mild

Gradient Descent Linear Deterministic ModerateAlt. Full Gradient­

SGD Linear Probabilistic Moderate

SGD Sub­Linear None Severe

Alternative Strategy? Without increasing number of gradient evaluations, can we:

Increase Rate of Convergence?Reduce sensitivity to conditoning?

Yes, by increasing computational resources used per iteration

N N O( N)n2 O( )n2

N 0 O( + nN)n2 O( )n2

N 0 O(nN) O(n)

N + m 0 O(N + m) O(n)

1 0 O(n) O(n)

Page 11: Static Parameter Estimation using Kalman Filtering and ...

Proximal OperatorProximal Form of SGD

Levenberg (1944), Marquardt (1963) Improvement

Why is this an improvement?

= arg { + (β − ∇ + }βSGDk+1 min

βlSGDt,k βSGD

k )T lSGDt,k

12αk

β −∥∥ βSGDk

∥∥2

βLMk+1

βLMk+1

= arg { + }minβ

12

( + (β − ∇ )lLMt,k βLM

k )T lLMt,k

2 12αk

β −∥∥ βLMk

∥∥2

= − (∇ )βLMk (∇ + I)lLM

t,k ∇T lLMt,k α−1

k

−1lLMt,k lLM

t,k

Page 12: Static Parameter Estimation using Kalman Filtering and ...

Proximal OperatorRecall:

Second Bartlett Identity:

holds under some regularity conditions on  .

Moral: We are using a regularized estimate of the hessian/covariance.

Can we do better still?

= − (∇ )βLMk+1 βLM

k (∇ + I)lLMt,k ∇T lLM

t,k α−1k

−1lLMt,k lLM

t,k

[∇ ] = [ ]EβLMk

lLMt,k ∇T lLM

t,k EβLMk

∇2lt,k

f

Page 13: Static Parameter Estimation using Kalman Filtering and ...

Proximal OperatorLevenberg (1944), Marquardt (1963) Improvement

Possible Rescaling Improvement?

βLMk+1

βLMk+1

= arg { + }minβ

12

( + (β − ∇ )lLMt,k βLM

k )T lLMt,k

2 12αk

β −∥∥ βLMk

∥∥2

= − (∇ )βLMk (∇ + I)lLM

t,k ∇T lLMt,k α−1

k

−1lLMt,k lLM

t,k

βk+1

βk+1

= arg { + }minβ

12

( + (β − ∇ )lt,k βk)T lt,k2 1

2∥β − ∥βk

2M −1

k

= − (∇ )βk (∇ + )lt,k∇T lt,k M −1k

−1lt,k lt,k

Page 14: Static Parameter Estimation using Kalman Filtering and ...

Proximal OperatorRecall:

Optimizer Interpretation:   should be the inverse Hessian at  .

First Generation Stochastic Quasi­Newton:

No improvement in rate of convergenceNo stop conditionHigh sensitivity to conditioning

Byrd, Hansen, Nocedal & Singer (arXiv, 2015)

βk+1

βk+1

= arg { + }minβ

12

( + (β − ∇ )lt,k βk)T lt,k2 1

2∥β − ∥βk

2M −1

k

= − (∇ )βk (∇ + )lt,k∇T lt,k M −1k

−1lt,k lt,k

Mk βk

Page 15: Static Parameter Estimation using Kalman Filtering and ...

Proximal OperatorRecall:

Statistician Interpretation:   should estimate the inverse covariance.

Kalman Filter: simultaneously estimates parameter and its covariance matrix.

Kalman (1960)

βk+1

βk+1

= arg { + }minβ

12

( + (β − ∇ )lt,k βk)T lt,k2 1

2∥β − ∥βk

2M −1

k

= − (∇ )βk (∇ + )lt,k∇T lt,k M −1k

−1lt,k lt,k

Mk

Page 16: Static Parameter Estimation using Kalman Filtering and ...

Kalman FilterControl Theoretic Derivation

Suppose

and find   which minimizes

Proximal Derivation

Given the "variance" of   conditioned on  ,  , and 

= −βk+1 βk Gk+1lk,t

Gk+1

E [ , … , ]∥ − ∥βk+1 β∗ 2∥∥ X1 Xk+1

lk,t Xt σ2t := E [( − )( − , … , ]Mk βk β∗ βk β∗)T ∣∣ X1 Xk

= arg { + }βk+1 minβ

12σ2

t

( + (β − ∇ )lt,k βk)T lt,k2 1

2∥β − ∥βk

2M−1

k

Page 17: Static Parameter Estimation using Kalman Filtering and ...

Kalman FilterUpdate for  :

Problem: neither   nor   are known a priori

Possible Solution: replace   with other sequence  , and replace   with   and generate

Questions

What values of   will result in   converging to  ?What values of   will result in   approximating  ?

Mk

= ∇ +M−1k+1

1σ2

t+1lk,t∇T lk,t M−1

k

σ2t M0

σ2t γ2

kM0 M0

= ∇ +M −1k+1

1γ2

k+1

lk,t∇T lk,t M −1k

γ2k

βk β∗

γ2k

Mk Mk

Page 18: Static Parameter Estimation using Kalman Filtering and ...

Kalman FilterSummary of Kalman Filtering Theory

Randomness in the model is not assumed to exist. Thus,   and   could be picked based rate of convergence needs.There is a strict focus on dynamic parameter estimation. Approximating   is ignored to prove linear convergence rate.

References

Johnstone, Johnson, Bitmead & Anderson (1982)Bittanti, Bolzern & Campi (1990)Parkum, Poulsen & Holst (1992)Cao & Schwartz (2003)

= 0σ2t γ2

kMk

Page 19: Static Parameter Estimation using Kalman Filtering and ...

Kalman­based SGDAssumptions

Linear Model   with Independence and Identical Distribution of Regularity. Uniqueness. For all unit vectors  ,  .

In the noise free case,  . kSGD will calculate the vector   once   linearly independent examples are assimilated.(Modified Gram­Schmidt)In the noisy case, if   then almost surely

For every  , asymptotically almost surely:

= ( − /2lk,t Yt XTt βk)2 E [ ] < ∞l∗,t

( , )Yt Xt

E [ ] < ∞∥ ∥X122

v ∈ Rn P [ v = 0] < ∞∣∣XT1 ∣∣

=Yt XTt β∗ β∗ n

0 < ≤ < ∞infk γ2k

supk γ2k

E [∥ − ∥| , … , ] → 0βk β∗ X1 Xk

ϵ > 0

⪰ ⪰1 + ϵ

inf γ2k

Mk1

σ2 Mk1 − ϵ

sup γ2k

Mk

Page 20: Static Parameter Estimation using Kalman Filtering and ...

Kalman­based SGDCMS Data Set ( ,   million)n = 31 N = 2.8

Page 21: Static Parameter Estimation using Kalman Filtering and ...

Kalman­based SGDCMS Data Set ( ,   million)n = 31 N = 2.8

Page 22: Static Parameter Estimation using Kalman Filtering and ...

Kalman­based SGDCharacterization per iteration

GradientEvals

HessianEvals

FloatingPoint

MemoryCost RoC Stop

Condition Conditioning

Newton Quadratic Deterministic Mild

Quasi­Newton Super­Linear Deterministic Mild

Gradient Descent Linear Deterministic ModerateAlt. Full Gradient­

SGD Linear Probabilistic Moderate

kSGD >Sub­Linear Almost Sure Mild

SGD Sub­Linear None Severe

N N O( N)n2 O( )n2

N 0 O( + nN)n2 O( )n2

N 0 O(nN) O(n)

N + m 0 O(N + m) O(n)

1 0 O( )n2 O( )n2

1 0 O(n) O(n)