Top Banner
Lecture 1: Dimension reduction estimation: can linear methods solve nonlinear problems? Fall 2014 Hong Kong Baptist University University of Connecticut 1 / 32
32

Lecture 1: Dimension reduction estimation: can linear ...web.mst.edu/~wenx/Zhu-visit/lecture1-zhu.pdf · Why dimension reduction? Once again, the scatter plot matrix cannot capture

Jul 15, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lecture 1: Dimension reduction estimation: can linear ...web.mst.edu/~wenx/Zhu-visit/lecture1-zhu.pdf · Why dimension reduction? Once again, the scatter plot matrix cannot capture

Lecture 1: Dimension reduction estimation: canlinear methods solve nonlinear problems?

Fall 2014

Hong Kong Baptist UniversityUniversity of Connecticut

1 / 32

Page 2: Lecture 1: Dimension reduction estimation: can linear ...web.mst.edu/~wenx/Zhu-visit/lecture1-zhu.pdf · Why dimension reduction? Once again, the scatter plot matrix cannot capture

Part I: Why dimension reduction?

Question: Why dimension reduction?

Regression modelling: Dimension reduction makes datavisualization. The current practice of regression analysis is:

Fit a simpler model;

Check the residual plot;

If the residual plot does not show a systematic pattern thenstop; otherwise continue to fit a more complex model.

2 / 32

Page 3: Lecture 1: Dimension reduction estimation: can linear ...web.mst.edu/~wenx/Zhu-visit/lecture1-zhu.pdf · Why dimension reduction? Once again, the scatter plot matrix cannot capture

Why dimension reduction?

Question: How to give a comprehensive residual plot?

In the one-dimensional cases – the residual plot is informative.Given (X1,Y1), · · · , (Xn,Yn), first fit a linear regressionmodel:

For 1 ≤ i ≤ n,Yi = β0 + β1Xi + εi

Find a regression estimator:

β1 =

∑ni=1(Yi − Y )(Xi − X )

(∑n

i=1)((Xi − X ))2;

β0 = Y − β1X . (1)

3 / 32

Page 4: Lecture 1: Dimension reduction estimation: can linear ...web.mst.edu/~wenx/Zhu-visit/lecture1-zhu.pdf · Why dimension reduction? Once again, the scatter plot matrix cannot capture

Why dimension reduction?

Let Yi be the predicted values β0 + β1Xi and ei be theresiduals Yi − Yi . We can simply plot ei against Xi . This iscalled the one-dimensional residual plot.

−0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5−4

−3

−2

−1

0

1

2

3

hy

resi

Figure: Residual Plot

4 / 32

Page 5: Lecture 1: Dimension reduction estimation: can linear ...web.mst.edu/~wenx/Zhu-visit/lecture1-zhu.pdf · Why dimension reduction? Once again, the scatter plot matrix cannot capture

Why dimension reduction?

What should we do if Xi is a vector in Rp? Currently twomethods are in frequent use:

(1) Residual plot ei versus Yi (note thatYi is alwaysone-dimensional.)

(2) Scatter plot matrix, in which we plot ei against each predictor,and each predictor against any other predictor, forming a(p + 1)× (p + 1) matrix of scatter plots.

However, each of these methods are intrinsically marginal –they cannot reflect the whole picture of the regressionrelation. Let us see this through an example.

5 / 32

Page 6: Lecture 1: Dimension reduction estimation: can linear ...web.mst.edu/~wenx/Zhu-visit/lecture1-zhu.pdf · Why dimension reduction? Once again, the scatter plot matrix cannot capture

Why dimension reduction?

Example 1. 100 pairs, (X1;Y1); · · · ; (X100;Y100), aregenerated from some model, where Xi are in R3

(three-dimensional vector Xi = (Xi1,Xi2,Xi3)).

The scatter plot matrix is produced. Show the scatter plotmatrix below. From the scatter plot matrix the data appear tohave the following features:

(1) Y doesn’t seem to depend on X2

(2) Y seems to depend on X1 in a nonlinear way(3) Y seems to depend on X3 in a nonlinear way.

6 / 32

Page 7: Lecture 1: Dimension reduction estimation: can linear ...web.mst.edu/~wenx/Zhu-visit/lecture1-zhu.pdf · Why dimension reduction? Once again, the scatter plot matrix cannot capture

Why dimension reduction?

Figure: Scatter Plot Matrix

−2 0 2 4−5 0 5−2 −1 0 1−5 0 5

−2

0

2

4

−5

0

5

−2

−1

0

1

−5

0

5

7 / 32

Page 8: Lecture 1: Dimension reduction estimation: can linear ...web.mst.edu/~wenx/Zhu-visit/lecture1-zhu.pdf · Why dimension reduction? Once again, the scatter plot matrix cannot capture

Why need dimension reduction?

However, (X ;Y ) are actually generated from the followingmodel:

Y = |X1 + X2|+ ε,

where ε is independent of X = (X1;X2;X3) that ismultivariate normal with mean 0 and covariance matrix Σ

Σ =

1 0 0.80 0.2 0

0.8 0 1

Note that Y does not depend on X3, and Y does depend onX2.

X2 has much smaller variance than that of X1 or X3 bothhave high correlation.

8 / 32

Page 9: Lecture 1: Dimension reduction estimation: can linear ...web.mst.edu/~wenx/Zhu-visit/lecture1-zhu.pdf · Why dimension reduction? Once again, the scatter plot matrix cannot capture

Why dimension reduction?

Once again, the scatter plot matrix cannot capture the truerelation between X and Y .

What can truly capture the relation between X and Y is thescatter plot of Y versus X1 + X2.

But how can we make this plot before we know thatX1 + X2 is the predictor? This is the question ofdimension reduction.

Find the linear combination X1 + X2 before any regressionmodelling is performed!

9 / 32

Page 10: Lecture 1: Dimension reduction estimation: can linear ...web.mst.edu/~wenx/Zhu-visit/lecture1-zhu.pdf · Why dimension reduction? Once again, the scatter plot matrix cannot capture

Models under dimension reduction structure

Some examples:

Linear model: Y = βTX + ε,

Generalized linear model: Y = g(βTX ) + ε for a givenmonotonic function g ,

Single-index model: Y = g(βTX ) + ε for an unknownfunction g ,

Multi-index model (1): Y = g(βTX ) + ε, β is an p × qorthogonal matrix

Multi-index model (2): Y = g1(βT1 X ) + g2(βT2 X )ε, whereβ = (β1, β2),

Multi-index model (3): Y = g(βTX , ε).

10 / 32

Page 11: Lecture 1: Dimension reduction estimation: can linear ...web.mst.edu/~wenx/Zhu-visit/lecture1-zhu.pdf · Why dimension reduction? Once again, the scatter plot matrix cannot capture

Further observation on dimension reduction structure

In these models, all the information on the response Y can becaptured through βTX , rather than through the original X !

In other words, when βTX is given, no more information onY can be acquired from the rest part of X . Y is thenconditionally independent of X if ε is independent of X .

We call this independence the conditional independence, andwrite this independence as Y⊥⊥X |βTX .

Based on this, consider a generic framework such that”model-free” methods can be developed to estimate theparameter of interest and then to establish a model.

11 / 32

Page 12: Lecture 1: Dimension reduction estimation: can linear ...web.mst.edu/~wenx/Zhu-visit/lecture1-zhu.pdf · Why dimension reduction? Once again, the scatter plot matrix cannot capture

Central Subspace

The goal of dimension reduction is to seek β ∈ Rp×q, q < psuch that

Y⊥⊥X |βTX .

However, β is not unique such that Y⊥⊥X |βTX unless q = 1such that β is a vector. To make the notion clearly, definecolumn space first.

Proposition 1.1 If A is any q × q non-singular matrix, Y⊥⊥X |βTX if and only

if Y⊥⊥X |(βA)TX .

Column space. For a matrix B we denote by S(B) thesubspace spanned by the columns of B: Let b1, · · · , bq be thecolumns of the matrix B. The space consists of all the linearcombinations c1b1 + · · ·+ cqbq for constants c1, · · · , cq.

12 / 32

Page 13: Lecture 1: Dimension reduction estimation: can linear ...web.mst.edu/~wenx/Zhu-visit/lecture1-zhu.pdf · Why dimension reduction? Once again, the scatter plot matrix cannot capture

Central Subspace

If γ is another matrix such that S(β) ⊆ S(γ), thenY⊥⊥X |βTX implies Y⊥⊥X |γTX . Therefore, we are naturallyinterested in the smallest dimension reduction space, whichachieves the maximal reduction of the dimension of X .

Definition

Definition If the intersection of all dimension reduction spaces for(X ,Y ) is itself a dimension reduction space, this space is called theCentral Space. Write SY |X .

Reference: Cook (1994, 1998). Thus the goal of dimensionreduction is to find the central space SY |X .

13 / 32

Page 14: Lecture 1: Dimension reduction estimation: can linear ...web.mst.edu/~wenx/Zhu-visit/lecture1-zhu.pdf · Why dimension reduction? Once again, the scatter plot matrix cannot capture

Sufficient plot

Once we know the central space or equivalently its p × q basematrix β = (β1, · · · , βq), a comprehensive scatter plot orresidual plot relating to βTX can be informative.

When q = 1, scatter plot is informative and q = 2, use spinsoftware to have a comprehensive view of the data. Usuallythis will suffice for most of the data analysis.

Example 1 (continued) The model is

Y = |X1 + X2|+ ε.

Thus the central space is spanned by (1, 1, 0). The sufficientplot is the scatter plot of Y versus X1 + X2. Show this plothere.

14 / 32

Page 15: Lecture 1: Dimension reduction estimation: can linear ...web.mst.edu/~wenx/Zhu-visit/lecture1-zhu.pdf · Why dimension reduction? Once again, the scatter plot matrix cannot capture

Sufficient plot

Figure: Scatter Plot Matrix

−2 0 2 4 6−5 0 5

−2

0

2

4

6

−5

0

5

15 / 32

Page 16: Lecture 1: Dimension reduction estimation: can linear ...web.mst.edu/~wenx/Zhu-visit/lecture1-zhu.pdf · Why dimension reduction? Once again, the scatter plot matrix cannot capture

Part II: Ordinary least squares

Assumption

Assumption 2.1 Let β be a Rp×q matrix whose columns form anorthonormal basis in SY |X . Assume that E (X |βTX ) is a linearfunction of X , that is, For a constant c and an p × q matrix CE (X |βTX ) = c + CβTX.

We first make some notes on the intuitions and implicationsof this assumption.

In practice, we do not know β at the outset. So we typicallyreplace this assumption by E (X |γTX ) is linear in X for allγ ∈ Rp×q. This is equivalent to elliptical symmetry of X .

16 / 32

Page 17: Lecture 1: Dimension reduction estimation: can linear ...web.mst.edu/~wenx/Zhu-visit/lecture1-zhu.pdf · Why dimension reduction? Once again, the scatter plot matrix cannot capture

Ordinary least squares

Elliptically symmetric distribution can often approximately beachieved by appropriate transformation of the original data; acertain power of the data, or logarithm of the data. See Cookand Weisberg (1994).

Hall and K. C. Li (1993) demonstrated that, if the originaldimension p is much larger than the structural dimension q,then E (X |βTX ) is approximately linear in X .

17 / 32

Page 18: Lecture 1: Dimension reduction estimation: can linear ...web.mst.edu/~wenx/Zhu-visit/lecture1-zhu.pdf · Why dimension reduction? Once again, the scatter plot matrix cannot capture

Ordinary least squares

We can get C = β.

Theorem

Theorem 2.2. If Assumption 2.1 holds, then

E (X |βTX ) = c + CβTX =: Pβ(X ),

where Pβ = ββT is the projection operator.

18 / 32

Page 19: Lecture 1: Dimension reduction estimation: can linear ...web.mst.edu/~wenx/Zhu-visit/lecture1-zhu.pdf · Why dimension reduction? Once again, the scatter plot matrix cannot capture

Ordinary least squares

Recall the assumption that E (X ) = 0, var(X ) = Ip. Then wehave

Theorem

Theorem 2.3. Suppose that Assumption 2.1 holds. The vectorE (XY ) = βc for an 1× q vector c is a vector in SY |X . In otherwords, E (XY ) can identify a vector in the central subspace.

19 / 32

Page 20: Lecture 1: Dimension reduction estimation: can linear ...web.mst.edu/~wenx/Zhu-visit/lecture1-zhu.pdf · Why dimension reduction? Once again, the scatter plot matrix cannot capture

Ordinary least squares

Theorem

Theorem 2.4 Suppose that Y |X follows the modelY = g(βTX ) + varepsilon where β = (β1, · · · , βp) is a vector.Suppose that Assumption 2.1 holds. Then E (XY ) is proportionalto β in the model.

20 / 32

Page 21: Lecture 1: Dimension reduction estimation: can linear ...web.mst.edu/~wenx/Zhu-visit/lecture1-zhu.pdf · Why dimension reduction? Once again, the scatter plot matrix cannot capture

Ordinary least squares

At the population level, the identifying procedure can bedescribed as follows. First, standardize X to beZ = Σ

−1/2x (X − µ). Identify a vector in SY |Z , then transfer

back to SY |X = Σ−1/2x SY |Z . At the sample level, the

estimating procedure is as follows.

Step 1. Compute the sample mean and variance:

µ = En(X ) Σx = varn(X )

and standardize Xi to be Zi = Σ−1/2x (X − µ).

Step 2. Center Yi to be Yi = Yi − E (Y ) and it is estimatedby Yi = Yi − En(Y ).

Step 3. Let γ = En(Z Y ) estimate E (ZY ) ∈ SY |Z .

Step 4. Let β = Σ−1/2x γ estimate E (XY ) ∈ SY |X .

21 / 32

Page 22: Lecture 1: Dimension reduction estimation: can linear ...web.mst.edu/~wenx/Zhu-visit/lecture1-zhu.pdf · Why dimension reduction? Once again, the scatter plot matrix cannot capture

Applications

For generalized linear model: Y = g(βTX ) + ε, single-indexmodel Y = g(βTX ) + ε, and transformation modelH(Y ) = βTX + ε, the above result shows that OLS can beapplied to estimate γ = β/‖β‖ if the L2-norm ‖β‖ of β is not1.

As OLS has a close form and thus, estimating γ is verycomputational efficient under the above nonlinear models.

After that, we can define z = γTX that is one-dimensional,rewrite the model as Y = g(αz) + ε to estimate α in aone-dimensional setting.

Thus, estimating β = αγ can be performed in this two-stepprocedure. (see Feng and Zhu 2013 CSDA)

22 / 32

Page 23: Lecture 1: Dimension reduction estimation: can linear ...web.mst.edu/~wenx/Zhu-visit/lecture1-zhu.pdf · Why dimension reduction? Once again, the scatter plot matrix cannot capture

Part III: Principal Hessian Directions

The biggest disadvantages of OLS: 1) it can only estimate atmost one direction in the central space; 2) it cannot identifythe direction in symmetric function such as the one inY = |βTX |+ ε.

For example, Y = |X1 + X2|+ ε: β = (1, 1, 0). But it cannotwell identified by OLS.

23 / 32

Page 24: Lecture 1: Dimension reduction estimation: can linear ...web.mst.edu/~wenx/Zhu-visit/lecture1-zhu.pdf · Why dimension reduction? Once again, the scatter plot matrix cannot capture

OLS plot

Figure: OLS Scatter Plot

−0.5 0.0 0.5 1.0

−1

01

23

n = 20 (−1.13,0.22,0.94)

OLS

y

●●

−0.5 0.0 0.5 1.0

−2

−1

01

23

4

n = 50 (0.02,0.61,−0.12)

OLS

y

●●

●●

●●

●●

−0.4 −0.2 0.0 0.2 0.4

−2

−1

01

23

4

n = 100 (0.01,0.11,1.49)

OLS

y

24 / 32

Page 25: Lecture 1: Dimension reduction estimation: can linear ...web.mst.edu/~wenx/Zhu-visit/lecture1-zhu.pdf · Why dimension reduction? Once again, the scatter plot matrix cannot capture

Principal Hessian Directions

Another method: Consider the conditional mean E (Y |X ) ofY given X . When E (Y |X ) = E (Y |βTX ), its secondderivative is∂2E (Y |X )/∂(XXT ) = β∂2E (Y |βTX )/∂(βTXXTβ)βT whichis a p × p matrix.An application of Stein Lemma (1956). When the distributionof X is normal,

E(∂2E (Y |X )/∂(XXT )

)= E (YXXT )

= βE(∂2E (Y |βTX )/∂(βTXXTβ)

)βT ,

where E(∂2E (Y |βTX )/∂(βTXXTβ)

)is a q × q matrix.

The p × p matrix βE(∂2E (Y |βTX )/∂(βTXXTβ)

)βT has at

most q non-zero eigenvalues and the correspondingeigenvectors will be proved to be in the central subspace.

25 / 32

Page 26: Lecture 1: Dimension reduction estimation: can linear ...web.mst.edu/~wenx/Zhu-visit/lecture1-zhu.pdf · Why dimension reduction? Once again, the scatter plot matrix cannot capture

Principal Hessian Directions

Assumption

Assumption 3.1 Assume that the conditional variance

var(X |βTX ) = C

is a p × p non-random matrix.

This assumption is satisfied if X is multivariate normal.

26 / 32

Page 27: Lecture 1: Dimension reduction estimation: can linear ...web.mst.edu/~wenx/Zhu-visit/lecture1-zhu.pdf · Why dimension reduction? Once again, the scatter plot matrix cannot capture

Principal Hessian directions: population development

Let α be the OLS vector E (XY ). Let e be the residual fromthe simple linear regression, that is,

e = Y − αTX .

Note that, in the standardized coordinate, the intersection ofthe OLS is zero, because it is E (Y ) = 0 and E (X ) = 0 andthus E (Y )− αTE (X ) = 0. That is why there is no constantterm in e.

Definition

Definition 3.1. The matrix H1 = E (YXXT ) is called the y-basedHessian matrix, the matrix H2 = E (eXXT ) is called the e-basedHessian matrix.

The central result of Part III is that the column space of aHessian matrix (either one) is a subspace of the central space.

27 / 32

Page 28: Lecture 1: Dimension reduction estimation: can linear ...web.mst.edu/~wenx/Zhu-visit/lecture1-zhu.pdf · Why dimension reduction? Once again, the scatter plot matrix cannot capture

Principal Hessian directions: population development

Theorem

Theorem 3.1. Suppose that Assumptions 2.1 and 3.1 hold. Thenthe column space of H1 is a subspace of SY |X .

Theorem

Theorem 3.2 Suppose that Assumptions 2.1 and 3.1 hold. Thenthe column space of H2 is a subspace of SY |X .

28 / 32

Page 29: Lecture 1: Dimension reduction estimation: can linear ...web.mst.edu/~wenx/Zhu-visit/lecture1-zhu.pdf · Why dimension reduction? Once again, the scatter plot matrix cannot capture

Sample estimator of pHd

Again, we use the idea of first transforming to Z , estimatingSY |Z , and then transforming back to SY |X . We summarizethe computation into the following steps.

Step 1. standardize X1, · · · ,Xn to be Z1, · · · Zn, and centerY1, · · · ,Yn to be Y1, · · · Yn, as described in the algorithm forOLS.

Step 2. Compute the OLS of Yi versus Zi to get α:

α = (varn(Z ))−1covn(Z , Y ),

andα0 = En(Y )− αTEn(Z ) = 0.

Because of standardization, we have:

29 / 32

Page 30: Lecture 1: Dimension reduction estimation: can linear ...web.mst.edu/~wenx/Zhu-visit/lecture1-zhu.pdf · Why dimension reduction? Once again, the scatter plot matrix cannot capture

Sample estimator of pHd

varn(Z ) = Ip, andcovn(Z , Y ) = En(Z Y )− En(Z )En(Y ) = En(Z Y ).

This means the OLS for Y and Z is α = En(Z Y ). Theresidual is ei = Yi − αT Zi .

Step 3. Construct the e-based and y-based Hessian matrix:

H1 = En(Y Z ZT ), H2 = En(eZ ZT ).

Step 4. Assume, for now, we know the structural dimension q.Let γ1 · · · γq be the q eigenvectors corresponding to the qlargest eigenvalues of H1H

T1 and let δ1 · · · δq be the q

eigenvectors corresponding to the q largest eigenvalues ofH2H

T2 . We use γ1 · · · γq and δ1 · · · δq as the estimators of

SY |Z .

30 / 32

Page 31: Lecture 1: Dimension reduction estimation: can linear ...web.mst.edu/~wenx/Zhu-visit/lecture1-zhu.pdf · Why dimension reduction? Once again, the scatter plot matrix cannot capture

Sample estimator of pHd

Letβi = Σ−1/2γi , ηi = Σ−1/2δi

We then use β1 · · · βq and η1 · · · ηq as the estimators of SY |X .

We have assumed that the structural dimension q is known.In practice this must be determined by the data. There areseveral proposals in the literature.

The following is the pHd scatter plot for the modelY = |X1 + X2|+ ε, described before.

31 / 32

Page 32: Lecture 1: Dimension reduction estimation: can linear ...web.mst.edu/~wenx/Zhu-visit/lecture1-zhu.pdf · Why dimension reduction? Once again, the scatter plot matrix cannot capture

pHd plot

Figure: pHd Scatter Plot

−2 −1 0 1 2

01

23

n =20 (−0.9,−0.31,0.27)

r−pHd1

y

●●

−1.5 −1.0 −0.5 0.0 0.5 1.0

01

23

45

n =50 (0.72,0.66,−0.17)

r−pHd1

y

●●

●●

●●

●●

●●

● ●

●●

●●

●●●

−2 −1 0 1 2

−2

−1

01

23

4

n =100 (−0.62,−0.74,−0.23)

r−pHd1

y

32 / 32