Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;

• Reduces time complexity: Less computation• Reduces space complexity: Less parameters• Simpler models are more robust on small datasets• More interpretable; simpler explanation• Data visualization (beyond 2 attributes, it gets

complicated)

1Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Why Reduce Dimensionality?

Feature Selection vs Extraction

Feature selection: Chose k<d important features, ignore the remaining d – k

Data snoopingGenetic algorithm

Feature extraction: Project the original d attributes onto a new k<d dimensional feature space

Principal components analysis (PCA), Linear discriminant analysis (LDA), Factor analysis (FA)Auto-association ANN


Principal Components Analysis (PCA)

Assume that attributes in dataset are drawn from a multivariate normal distribution. P(x)=N(m, S)


TdE ,...,:Mean 1μx

TE μxμx dx1 1xddxd

221

22221

11221

ddd

d

d

Variance is a matrixcalled “covariance”.

Diagonal elements are s2 of individual attributes.

Off diagonals describe how fluctuations in one attribute affect

fluctuations in another.

TE μμ xx

dx1 1xddxd

221

22221

11221

ddd

d

d

Dividing off-diagonal elements by the product of variances, gives “correlation coefficients”

Correlation among attributes makes it difficult to say how any one attribute contributes to an effect.

1, ji

ijijji xx

Corr

Consider a linear transformation of attributes z = Mx where M is a dxd matrix. The d features z will also be normally distributed (proof later).

A choice of M that results in a diagonal covariance matrix in feature-space has the following advantages: 1. Interpretation of uncorrelated features is easier2. Total variance of features is the sum of diagonal elements

Diagonalization of the covariance matrix:

The transformation z = Mx that leads to a diagonal feature-space covariance has M = WT where the columns of W are the eigenvectors of the covariance matrix .S

The collection of eigenvalue equations Swk = lkwk

can be written as SW = WD where D = diag(l1...ld) and W is formed by column vectors [w1 ... wd].

WT= W-1 so WTSW = W-1WD = D

If we arrange the eigenvectors so that eigenvalues l1...ld are in decreasing order of magnitude, then zi = wi

Tx, i = 1…k < d are the “principle components”

Proportion of Variance (PoV) explained by k principal components (λi sorted in descending order) is

dk

k

21

21


A plot PoV vs k shows how many eigenvalues arerequired in capture given part of total variance

How many principal components ?

Proof that if attributes x are normally distributed with mean m and covariance S, then z=wTx is normally distributed with mean wTm and variance wTSw.

Var(z) = Var(wTx) = E[(wTx – wTμ)2] = E[(wTx – wTμ)(xTw – mTw)]= E[wT(x – μ)(x – μ)Tw]

= wT E[(x – μ)(x –μ)T]w = wT ∑ w

The objective of PCA is to maximize Var(z)=wT ∑ w Must be done subject to the constraint ||w1|| = w1

Tw1 = 1


Review: constrained optimization by Lagrange multipliers

find the stationary point of f(x1, x2) = 1 - x1

2 – x22

subject to the constraint g(x1, x2) = x1 + x2 = 1

Constrained optimization

Form the Lagrangian

L(x, l) = f(x1, x2) + l(g(x1, x2) - c)

L(x, l) = 1-x12-x2

2 +l(x1+x2-1)

-2x1 + l = 0-2x2 + l = 0x1 + x2 -1 = 0Solve for x1 and x2

Set the partial derivatives of L with respect to x1, x2, and l equal to zero

L(x, l) = 1-x12-x2

2 +l(x1+x2-1)

In this case, not necessary to find ll sometimes called “undetermined multiplier”

Solution isx1* = x2* = ½

Application of Lagrange multipliers in PCA

Find w1 such that w1TSw1 is maximum subject

to constraint ||w1|| = w1Tw1 = 1

Maximize L = w1TSw1 + c(w1

Tw1 – 1)

gradient of L = 2Sw1+ 2cw1 = 0

Sw1 = -cw1

w1 is an eigenvector of covariance matrix

let c = -l1

l1 is eigenvalue associate with w1


Prove that l1 is the variance of principal component 1

z1 = w1Tx

Sw1 = l1 w1

var(z1) = w1TSw1 = l1 w1

Tw1 = l1

To maximize var(z1), chose l1 as largest eigenvalue


More principal components:

If S has 2 distinct eigenvalues, define 2nd principal componentby max Var(z2), such that ||w2||=1 and orthogonal to w1

Introduce Lagrange multipliers a and b

01 122222 wwwwww TTTL

Set gradient of L with respect to w2 to zero2Sw2 – 2aw2 – bw1 = 0Choose b = 0 and a = l2 get Sw2 = l2w2

To maximize Var(z2) chose l2 as the second largest

eigenvalue

For any dxd matrix M, z=MTx is a linear transformation of attributes x that defines features z

If attributes x are normally distributed with mean m and covariance S, then z is normally distributed with mean MTm and covariance MTSM. (proof slide 8)

If M = W, a matrix with columns that are the normalized eigenvectors of S, then the covariance of z is diagonal with elements equal to the eigenvalues of S (proof slide 6)

Arrange the eigenvalues in decreasing order of magnitude and find l1...lk that account for most (e.g. 90%) of the total variance, then zi = wi

Tx, are the “principle components”

16

Review

MatLab’s [V,D] = eig(A) returns both eigenvectors (columns of V) and eigenvalues D in increasing order.

Invert the order and construct

17Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press

(V1.0)

More review

dk

k

21

21

Chose k that captures the desired amount of total variance

Example: cancer diagnostics

• Metabonomics data• 94 samples• 35 metabolites in each sample = d• 60 control samples• 34 diseased samples

dk

k

21

21

73.680918.74912.88561.90680.72780.54440.42380.35010.1631

proportion of variance plot

ranked eigenvalues

3 PCs capture > 95%

1-34 cancer>35 controlSamples from cancer patients cluster

Scatter plot of PCs 1 and 2

Assignment 5 due 10-30-15

Find the accuracy of a model that classifies all 6 types of beer bottles in glassdata.csv by multivariate linear regression. Find the eigenvalues and eigenvectors of the covariance matrix for the full beer-bottle data set. How many eigenvalues are required to capture more than the 90% of the variance? Transform the attribute data by the eigenvectors of the 3 largest eigenvalues. What is the accuracy of a linear model that uses these features.

Plot the accuracy when you successively extent the linear model by including z1

2, z22, z3

2, z1z2, z1z3, and z2z3.

PCA code for glass data

eige

nval

ues

indexed by decreasing magnitude

indexed by decreasing magnitude

PoV

Extend MLR with PCA features

L +x12 +x2

2 +x13 +x1x2 +x1x3 +x2x3

Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;

Documents

linear model

pca features

glass data

attribute data

largest eigenvalues

beerbottle data set

multivariate linear

e alpaydn