Top Banner
Lecture 8 Advanced Topics in Least Squares - Part Two -
46

Lecture 8 Advanced Topics in Least Squares - Part Two -

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lecture 8 Advanced Topics in Least Squares - Part Two -

Lecture 8

Advanced Topics in

Least Squares

- Part Two -

Page 2: Lecture 8 Advanced Topics in Least Squares - Part Two -

Concerning the Homework

the dirty little secret of data analysis

Page 3: Lecture 8 Advanced Topics in Least Squares - Part Two -

You often spend more time futzing with reading files that are

in inscrutable formats

than the intellectually-interesting side of data analysis

Page 4: Lecture 8 Advanced Topics in Least Squares - Part Two -

Sample MatLab Code

cs = importdata('test.txt','=');Ns = length(cs);mag = zeros(Ns,1);Nm=0;for i = [1:Ns] s=char(cs(i)); smag=s(48:50); stype=s(51:52); if( stype == 'Mb' ) Nm=Nm+1; mag(Nm,1)=str2num(smag); endend

mag = mag(1:Nm);

a routine to read a text file

choose non-occurring delimiter to force complete line into one cell

returns “cellstr” data type: array of strings

convert “cellstr” element to string

convert string to number

Page 5: Lecture 8 Advanced Topics in Least Squares - Part Two -

What can go wrong in least-squares

m = [GTG]-1 GT d

the matrix [GTG]-1 is singular

Page 6: Lecture 8 Advanced Topics in Least Squares - Part Two -

m =

d1

d2

d3

dN

1 x1

1 x2

1 x3

1 xN

EXAMPLE - a straight line fit

N i xi

i xi i xi2

GTG =

det(GTG) = N i xi2 – [i xi]2

[GTG]-1 singular when determinant is zero

Page 7: Lecture 8 Advanced Topics in Least Squares - Part Two -

N=1, only one measurement (x,d)

N i xi2 – [i xi]2 = x2 - x2 = 0

you can’t fit a straight line to only one point

N1, all data measured at the same x

N i xi2 – [i xi]2 = N2 x2 – N2 x2 = 0

measuring the same point over and over doesn’t help

det(GTG) = N i xi2 – [i xi]2 = 0

Page 8: Lecture 8 Advanced Topics in Least Squares - Part Two -

m =

s1

s2

dNd-1

dNd-1

1 1

1 1…

1 -1

1 -1

another example – sums and differences

Ns+Nd Ns-Nd

Ns-Nd Ns+Nd

GTG =

det(GTG) = 0 = [Ns+Nd]2 - [Ns-Nd]2 =

[Ns2+Nd

2 +2NsNd] - [Ns2+Nd

2 -2NsNd] =

4NsNd

Ns sums, si, and Nd differences, di, of two unknowns m1 and m2

zero when Ns=0 or Nd=0, that is, only measurements of one kind

Page 9: Lecture 8 Advanced Topics in Least Squares - Part Two -

This sort of ‘missing measurement’might be difficult to recognize in a

complicated problem

but it happens all the time …

Page 10: Lecture 8 Advanced Topics in Least Squares - Part Two -

Example - Tomography

Page 11: Lecture 8 Advanced Topics in Least Squares - Part Two -

in this method, you try to plaster the subject with X-ray beams made at every possible position and direction, but you can easily wind up missing some small region …

no data coverage here

Page 12: Lecture 8 Advanced Topics in Least Squares - Part Two -

What to do ?

Introduce prior information

assumptions about the behavior of the unknowns

that ‘fill in’ the data gaps

Page 13: Lecture 8 Advanced Topics in Least Squares - Part Two -

Examples of Prior Information

The unknowns:

are close to some already-known valuethe density of the mantle is close to 3000 kg/m3

vary smoothly with time or with geographical position

ocean currents have length scales of 10’s of km

obey some physical law embodied in a PDEwater is incompressible andthus its velocity satisfies div(v)=0

Page 14: Lecture 8 Advanced Topics in Least Squares - Part Two -

Are you only fooling yourself ?

It depends …

are your assumptions good ones?

Page 15: Lecture 8 Advanced Topics in Least Squares - Part Two -

Application of theMaximum Likelihood Method

to this problem

so, let’s have a foray into the world of probability

Page 16: Lecture 8 Advanced Topics in Least Squares - Part Two -

Overall Strategy

1. Represent the observed data as a probability distribution

2. Represent prior information as a probability distribution

3. Represent the relationship between data and model parameters as a probability distribution

4. Combine the three distributions in a way that embodies combining the information that they contain

5. Apply maximum likelihood to the combined distribution

Page 17: Lecture 8 Advanced Topics in Least Squares - Part Two -

How to combine distributions in a way that embodies combining the information that they contain …

Short answer: multiply them

But let’s step through a more well-reasoned analysis of why we should do that …

x

p1(x)

x

p2(x)

x

pT(x)

Page 18: Lecture 8 Advanced Topics in Least Squares - Part Two -

how to quantify the information in a distribution p(x)

Information compared to what?

Compared to a distribution pN(x) that represents the state of complete ignorance

Example: pN(x) = a uniform distribution

The information content should be a scalar quantity, Q

Page 19: Lecture 8 Advanced Topics in Least Squares - Part Two -

Q = ln[ p(x)/pN(x) ] p(x) dx

Q is the expected value of ln[ p(x)/pN(x) ]

Properties:

Q=0 when p(x) = pN(x)

Q0 always (since limitp0 of pln(p)=0)

Q is invariant under a change of variables xy

Page 20: Lecture 8 Advanced Topics in Least Squares - Part Two -

Combining distributions pA(x) and pB(x)

Desired properties of the combination:

pA(x) combined with pB(x) is the same as pB(x) combined with pA(x)

pA(x) combined [ pB(x) combined with pC(x)]is the same as [ pA(x) combined pB(x) ] combined with pC(x)

Q of [ pA(x) combined with pN(x) ] QA

Page 21: Lecture 8 Advanced Topics in Least Squares - Part Two -

pAB(x) = pA(x) pB(x) / pN(x)

When pN(x) is the uniform distribution …

… combining is just multiplying.

But note that for “points on the surface of a sphere’, the null distribution, p(,), is latitude and is longitude, where would not be uniform, but rather proportional to sin().

Page 22: Lecture 8 Advanced Topics in Least Squares - Part Two -

Overall Strategy

1. Represent the observed data as a Normal probability distribution

pA(d) exp{ -½ (d-dobs)T Cd-1 (d-dobs) }

In the absence of any other information, the best estimate of the mean of the data is the observed data itself.

Prior covariance of the data.

I don’t feel like typing the normalization

Page 23: Lecture 8 Advanced Topics in Least Squares - Part Two -

Overall Strategy

2. Represent prior information as a Normal probability distribution

pA(m) exp{ -½ (m-mA)T Cm-1 (m-mA) }

Prior estimate of the model, your best guess as to what it would be, in the absence of any observations.

Prior covariance of the model quantifies how good you think your prior estimate is …

Page 24: Lecture 8 Advanced Topics in Least Squares - Part Two -

example

one observationdobs = 0.8 ± 0.4

one model parameter withmA=1.0 ± 1.25

Page 25: Lecture 8 Advanced Topics in Least Squares - Part Two -

mA=1

dobs =

0.8

0 2

20

pA(d) pA(m)

Page 26: Lecture 8 Advanced Topics in Least Squares - Part Two -

Overall Strategy

3. Represent the relationship between data and model parameters as a probability distribution

pT(d,m) exp{ -½ (d-Gm)T CG-1 (d-Gm) }

Prior covariance of the theory quantifies how good you think your linear theory is.

linear theory, Gm=d, relating data, d, to model parameters, m.

Page 27: Lecture 8 Advanced Topics in Least Squares - Part Two -

example

theory: d=m

but only accurate to ± 0.2

Page 28: Lecture 8 Advanced Topics in Least Squares - Part Two -

mA=1

d obs=

0.8

0 2

20

pT(d,m)

Page 29: Lecture 8 Advanced Topics in Least Squares - Part Two -

Overall Strategy

4. Combine the three distributions in a way that embodies combining the information that they contain

p (m,d) = pA(d) pA(m) pT(m,d)

exp{ -½ [

(d-dobs)T Cd-1 (d-dobs) +

(m-mA)T Cm-1 (m-mA) +

(d-Gm)T CG-1 (d-Gm) ]}

a bit of a mess, but it can be simplified ,,,

Page 30: Lecture 8 Advanced Topics in Least Squares - Part Two -

0 2

20

p(d,m)=pA(d) pA(m) pT(d,m)

Page 31: Lecture 8 Advanced Topics in Least Squares - Part Two -

Overall Strategy

5. Apply maximum likelihood to the combined distribution, p(d,m) = pA(d) pA(m) pT(m,d)

There are two distinct ways we could do this:

Find the (d,m) combinations that maximized the joint probability distribution, p(d,m)

Find the m that maximized the individual probability distribution, p(m) = p(d,m) dd

These do not necessarily give the same value for m

Page 32: Lecture 8 Advanced Topics in Least Squares - Part Two -

mest

dpre

0 2

20

p(d,m)

maximum of p(d,m)=pA(d) pA(m) pT(d,m)

Maximum likelihood point

Page 33: Lecture 8 Advanced Topics in Least Squares - Part Two -

maximum of p(m) = p(d,m)dd

m

p(m)

mest

Maximum likelihood point

Page 34: Lecture 8 Advanced Topics in Least Squares - Part Two -

special case of an exact theory

in the limit CG0

exp{ -½ (d-Gm)T CG-1 (d-Gm) } (d-Gm)

and p(m) = p(d,m) dd

= pA(m) pA(d) (d-Gm) dd

= pA(m) pA(d=Gm)

so for normal distributions p(m) = exp{ -½ [

(Gm-dobs)T Cd-1 (Gm-dobs) + (m-mA)T Cm

-1 (m-mA) ]}

Dirac delta function, with property f(x) (x-y) dx = f(y)

Page 35: Lecture 8 Advanced Topics in Least Squares - Part Two -

special case of an exact theory

maximizing p(m) is equivalent to minimizing

(Gm-dobs)TCd-1(Gm-dobs) + (m-mA)TCm

-1(m-mA)

weighted “prediction error” weighted “distance of the model from its prior value”

+

Page 36: Lecture 8 Advanced Topics in Least Squares - Part Two -

solutioncalculated via the usual messy minimization process

mest = mA + M [ dobs – GmA]

where M = [GTCd-1G + Cm

-1]-1 GT Cd-1

Don’t Memorize, but be prepared to use

(e.g. in homework).

Page 37: Lecture 8 Advanced Topics in Least Squares - Part Two -

interesting interpretation

mest - mA = M [ dobs – GmA]

estimated model minus its prior

observed data minus the prediction of the prior model

linear connection between the two

Page 38: Lecture 8 Advanced Topics in Least Squares - Part Two -

special case of no prior informationCm

M = [GTCd-1G + Cm

-1]-1 GT Cd-1[GTCd

-1G]-1 GT Cd-1

mest = mA + [GTCd-1G]-1 GT Cd

-1 [ dobs – GmA]

= mA+[GTCd-1G]-1GTCd

-1dobs–[GTCd-1G]-1GTCd

-1GmA

= mA+[GTCd-1G]-1GTCd

-1dobs–mA

= [GTCd-1G]-1GTCd

-1dobs

recovers weighted least squares

Page 39: Lecture 8 Advanced Topics in Least Squares - Part Two -

special case of infinitely accurate prior information Cm 0

M = [GTCd-1G + Cm

-1]-1 GT Cd-1 0

mest = mA + 0

= mA

recovers prior value of m

Page 40: Lecture 8 Advanced Topics in Least Squares - Part Two -

special uncorrelated case Cm=m

2I and Cd=d2I

M = [GTCd-1G + Cm

-1]-1 GT Cd-1

= [d-2GTG + m

-2I]-1 GT d-2

= [ GTG + (d/m)2I ]-1 GT

this formula is sometimes called “damped least squares”, with “damping factor” =d/m

Page 41: Lecture 8 Advanced Topics in Least Squares - Part Two -

Damped Least Squaresmakes the process of avoiding

singular matrices associated with insufficient data

trivially easy

you just add 2I to GTG before computing the inverse

Page 42: Lecture 8 Advanced Topics in Least Squares - Part Two -

GTG GTG + 2I

this process regularizes the matrix, so its inverse always exists

its interpretation is :in the absence of relevant data,

assume the model parameter has its prior value

Page 43: Lecture 8 Advanced Topics in Least Squares - Part Two -

Are you only fooling yourself ?

It depends …

is the assumption - that you know the prior value - a good one?

Page 44: Lecture 8 Advanced Topics in Least Squares - Part Two -

Smoothness Constraints

e.g. model is smooth when its second derivative is small

d2mi/dx2 mi-1 - 2mi + mi+1

(assuming the data are organized according to one spatial variable)

Page 45: Lecture 8 Advanced Topics in Least Squares - Part Two -

matrix D approximates second derivative

1 -2 1 0 0 0 …

0 1 -2 1 0 0 …

0 0 0 … 1 -2 1

D =

d2m/dx2 Dm

Page 46: Lecture 8 Advanced Topics in Least Squares - Part Two -

Choosing a smooth solution is thus equivalent to minimizing

(Dm)T (Dm) = mT (DTD) m

comparing to the

(m-mA)TCm-1(m-mA)

minimization implied by the general solution

mest = mA + M [ dobs – GmA]where M = [GTCd

-1G + Cm-1]-1 GT Cd

-1

indicates that we should make the choicesmA = 0

Cm-1 = (DTD)

To implement smoothness