Newton’s Method applied to a scalar function Newton’s method for minimizing f(x): Twice differentiable function f(x), initial solution x 0 . Generate a sequence of solutions x 1 , x 2 , …and stop if the sequence converges to a solution with f(x)=0. • Solve -f(x k ) ≈ 2 f(x k )x • Let x k+1 =x k +x. 3. let k=k+1
23
Embed
Newton’s Method applied to a scalar function Newton’s method for minimizing f(x): Twice differentiable function f(x), initial solution x 0. Generate a.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Newton’s Method applied to a scalar function
Newton’s method for minimizing f(x):
Twice differentiable function f(x), initial solution x0. Generate a sequence of solutions x1, x2, …and stop if the sequence converges to a solution with f(x)=0.
• Solve -f(xk) ≈ 2f(xk)x
• Let xk+1=xk+x. 3. let k=k+1
Newton’s Method applied to LS
Not directly applicable to most nonlinear regression and inverse problems (not equal # of model parameters and data points, no exact solution to G(m)=d). Instead we will use N.M. to minimize a nonlinear LS problem, e.g. fit a vector of n parameters to a data vector d.
f(m)=∑ [(G(m)i-di)/i]2
Let fi(m)=(G(m)i-di)/i i=1,2,…,m, F(m)=[f1(m) … fm(m)]T
So that f(m)= ∑ fi(m)2 f(m)=∑ fi(m)2]
m
i=1
m
i=1
m
i=1
NM: Solve -f(mk) ≈ 2f(mk)m
LHS:f(mk)j = -∑ 2fi(mk)jF(m)j = -2 J(mk)F(mk)
RHS: 2f(mk)m = [2J(m)TJ(m)+Q(m)]m, where
Q(m)= 2 ∑ fi(m) fi(m)
-2 J(mk)F(mk) = 2 H(m) m
m = -H-1 J(mk)F(mk) =
-H-1 f(mk) (eq. 9.19)
H(m)= 2J(m)TJ(m)+Q(m)
fi(m)=(G(m)i-di)/i i=1,2,…,m,
F(m)=[f1(m) … fm(m)]T
Gauss-Newton (GN) method
2f(mk)m = H(m) m = [2J(mk)TJ(mk)+Q(m)]m
ignores Q(m)=2∑ fi(m) fi(m) :
f(m)≈2J(m)TJ(m), assuming fi(m) will be reasonably small as we approach m*. That is,
Solve -f(xk) ≈ 2f(xk)x
f(m)j=∑ 2fi(m)jF(m)j, i.e.
J(mk)TJ(mk)m=-J(mk)TF(mk)
fi(m)=(G(m)i-di)/i i=1,2,…,m,
F(m)=[f1(m) … fm(m)]T
Newton’s Method applied to LS
Levenberg-Marquardt (LM) method uses
[J(mk)TJ(mk)+I]m=-J(mk)TF(mk)
->0 : GN
->large, steepest descent (SD) (down-gradient most rapidly). SD provides slow but certain convergence.
Which value of to use? Small values when GN is working well, switch to larger values in problem areas. Start with small value of , then adjust.
Statistics of iterative methods
Cov(Ad)=A Cov(d) AT (d has multivariate N.D.)
Cov(mL2)=(GTG)-1GT Cov(d) G(GTG)-1
Cov(d)=2I: Cov(mL2)=2(GTG)-1
However, we don’t have a linear relationship between data and estimated model parameters for the nonlinear regression, so cannot use these formulas. Instead:
F(m*+m)≈F(m*)+J(m*)m
Cov(m*)≈(J(m*)TJ(m*))-1 not exact due to linearization, confidence intervals may not be accurate :)
Thus shown that CG generates a sequence of mutually conjugate basis vectors. In theory, the method will find an exact solution in n iterations.
Given positive definite, symmetric system of eqs Ax=b, initial solution x0, let 0=0, p-1=0,r0=b-Ax0, k=0
Ñ If k>0, let k = rkTrk/rk-1
Trk-1
Ñ Let pk= rk+kpk-1
Ñ Let k = rkTrk / pk
TA pk
4. Let xk+1= xk+kpk
5. Let rk+1= rk-kApk
6. Let k=k+1
Conjugate Gradients Least Squares Method
CG can only be applied to positive definite systems of equations, thus not applicable to general LS problems. Instead, we can apply the CGLS method to
min ||Gm-d||2
GTGm=GTd
Conjugate Gradients Least Squares Method
GTGm=GTd
rk = GTd-GTGmk = GT(d-Gmk) = GTsk
sk+1=d-Gmk+1=d-G(mk+kpk)=(d-Gmk)-kGpk=sk-kGpk
Given a system of eqs Gm=d, k=0, m0=0, p-1=0, 0=0, r0=GTs0.
Ñ If k>0, let k = rkTrk/[rk-1
Trk-1]
Ñ Let pk= rk+kpk-1
Ñ Let k = rkTrk / [Gpk]T[G pk]
4. Let mk+1= mk+kpk
5. Let sk+1=sk-kGpk
6. Let rk+1= rk-kGpk
7. Let k=k+1; never computing GTG, only Gpk, GTsk+1
L1 Regression
LS (L2) is strongly affected by outliers
If outliers are due to incorrect measurements, the inversion should minimize their effect on the estimated model.
Effects of outliers in LS is shown by rapid fall-off of the tails of the Normal Distribution
In contrast the Exponential Distribution has a longer tail, implying that the probability of realizing data far from the mean is higher. A few data points several from <d> is much more probable if drawn from an exponential rather than from a normal distribution. Therefore methods based on exponential distributions are able to handle outliers better than methods based on normal distributions. Such methods are said to be robust.
L1 Regression
min ∑ [di -(Gm)i]/i = min ||dw-Gwm||1
thus more robust to outliers because error is not squared
Example: repeating measurement m times:
[1 1 … 1]T m =[d1 d2 … dm]T
mL2 = (GTG)-1GTd = m-1 ∑ di
f(m) = ||d-Gm||1 = ∑ |di-m|
Non-differentiable if m=di
Convex, so local minima=global minima
f’(m) = ∑ sgn(di-m), sgn(x)=+1 if x>0, =-1 if x<0, =0 if x=0
=0 if half is +, half is -
<d>est = median, where 1/2 of data is < <d>est, 1/2 > <d>est
L1 Regression
Finding min ||dw-Gwm||1 is not trivial. Several methods available, such as IRLS, solving a series of LS problems converging to a 1-norm:
‘Shaping’ filtering: A*x=D, D ‘desired’ response, A,D known
a1 a2 a3
a0 a1 a2
a-1 a0 a1
.
The matrix ij is formed by the auto-correlation of at with zero-lag values along the diagonal and auto-correlations of successively higher lags off the diagonal. ij is symmetric of order n
a1 a0 a-1
a2 a1 a0
a3 a2 a1
.
convolution
ATD becomes
.
a-1 a-2 a-3
a0 a-1 a-2
a1 a0 a-1
.
The matrix cij is formed by the cross-correlation of the elements of A and D. Solution: (ATA)-1ATD = -1c
.
d-1
d0
d1
.
.
c1
= c0
c-1
.
Example
Find a filter, 3 elements long, that convolved with (2,1) produced (1,0,0,0): (2,1)*(f1,f2,f3)=(1,0,0,0)
a-1 a-2 a-3
a0 a-1 a-2
a1 a0 a-1
.
The matrix cij is formed by the cross-correlation of the elements of A and D. Solution: (ATA)-1ATD = -1c