Top Banner
PCA 1 10Ǧ601 Introduction to Machine Learning Matt Gormley Lecture 26 April 20, 2020 Machine Learning Department School of Computer Science Carnegie Mellon University
36

10 æ 601 Introduction to Machine Learningmgormley/courses/10601/slides/... · 2020-04-29 · 10 æ601 Introduction to Machine Learning Matt Gormley Lecture 26 April 20, 2020 ...

Jul 27, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 10 æ 601 Introduction to Machine Learningmgormley/courses/10601/slides/... · 2020-04-29 · 10 æ601 Introduction to Machine Learning Matt Gormley Lecture 26 April 20, 2020 ...

PCA

1

10Ǧ601�Introduction�to�Machine�Learning

Matt�GormleyLecture�26

April�20,�2020

Machine�Learning�DepartmentSchool�of�Computer�ScienceCarnegie�Mellon�University

Page 2: 10 æ 601 Introduction to Machine Learningmgormley/courses/10601/slides/... · 2020-04-29 · 10 æ601 Introduction to Machine Learning Matt Gormley Lecture 26 April 20, 2020 ...

Reminders� Homework 8:�Reinforcement Learning

± Out:�Fri,�Apr 10± Due:�Wed,�Apr�22�at�11:59pm

� Homework 9:�Learning Paradigms± Out:�Wed,�Apr.�22± Due:�Wed,�Apr.�29�at�11:59pm± Can�only be�submitted up�to�3�days late,�so�we can�return grades�before final�exam

� Today’s InǦClass Poll± http://poll.mlcourse.org

4

Page 3: 10 æ 601 Introduction to Machine Learningmgormley/courses/10601/slides/... · 2020-04-29 · 10 æ601 Introduction to Machine Learning Matt Gormley Lecture 26 April 20, 2020 ...

ML�Big�Picture

5

Learning�Paradigms:What�data�is�available�and�when?�What�form�of�prediction?� supervised�learning� unsupervised�learning� semiǦsupervised�learning� reinforcement�learning� active�learning� imitation�learning� domain�adaptation� online�learning� density�estimation� recommender�systems� feature�learning� manifold�learning� dimensionality�reduction� ensemble�learning� distant�supervision� hyperparameter optimization

Learning�Paradigms:What�data�is�available�and�when?�What�form�of�prediction?� supervised�learning� unsupervised�learning� semiǦsupervised�learning� reinforcement�learning� active�learning� imitation�learning� domain�adaptation� online�learning� density�estimation� recommender�systems� feature�learning� manifold�learning� dimensionality�reduction� ensemble�learning� distant�supervision� hyperparameter optimization

Problem�Formulation:What�is�the�structure�of�our�output�prediction?Problem�Formulation:What�is�the�structure�of�our�output�prediction?boolean Binary�Classificationcategorical Multiclass�Classificationordinal Ordinal�Classificationreal Regressionordering Rankingmultiple�discrete Structured�Predictionmultiple�continuous (e.g.�dynamical�systems)both�discrete�&cont.

(e.g. mixed�graphical�models)

Theoretical�Foundations:What�principles�guide�learning?� probabilistic� information�theoretic� evolutionary�search� ML�as�optimization

Theoretical�Foundations:What�principles�guide�learning?� probabilistic� information�theoretic� evolutionary�search� ML�as�optimization

Facets�of�Building�ML�Systems:How�to�build�systems�that�are�robust,�efficient,�adaptive,�effective?1. Data�prep�2. Model�selection3. Training�(optimization�/�

search)4. Hyperparameter tuning�on�

validation�data5. (Blind)�Assessment�on�test�

data

Facets�of�Building�ML�Systems:How�to�build�systems�that�are�robust,�efficient,�adaptive,�effective?1. Data�prep�2. Model�selection3. Training�(optimization�/�

search)4. Hyperparameter tuning�on�

validation�data5. (Blind)�Assessment�on�test�

data

Big�Ideas�in�ML:Which�are�the�ideas�driving�development�of�the�field?� inductive�bias� generalization�/�overfitting� biasǦvariance�decomposition� generative�vs.�discriminative� deep�nets,�graphical�models� PAC�learning� distant�rewards

Big�Ideas�in�ML:Which�are�the�ideas�driving�development�of�the�field?� inductive�bias� generalization�/�overfitting� biasǦvariance�decomposition� generative�vs.�discriminative� deep�nets,�graphical�models� PAC�learning� distant�rewards

App

lication�Areas

Key�ch

alleng

es?

NLP

,�Spe

ech,�Com

puter�

Vision

,�Rob

otics,�M

edicine,�

Search

App

lication�Areas

Key�ch

alleng

es?

NLP

,�Spe

ech,�Com

puter�

Vision

,�Rob

otics,�M

edicine,�

Search

Page 4: 10 æ 601 Introduction to Machine Learningmgormley/courses/10601/slides/... · 2020-04-29 · 10 æ601 Introduction to Machine Learning Matt Gormley Lecture 26 April 20, 2020 ...

Learning�Paradigms

6

Page 5: 10 æ 601 Introduction to Machine Learningmgormley/courses/10601/slides/... · 2020-04-29 · 10 æ601 Introduction to Machine Learning Matt Gormley Lecture 26 April 20, 2020 ...

DIMENSIONALITY�REDUCTION

7

Page 6: 10 æ 601 Introduction to Machine Learningmgormley/courses/10601/slides/... · 2020-04-29 · 10 æ601 Introduction to Machine Learning Matt Gormley Lecture 26 April 20, 2020 ...

PCA�Outline� Dimensionality�Reduction

± HighǦdimensional�data± Learning�(low�dimensional)�representations

� Principal�Component�Analysis�(PCA)± Examples:�2D�and�3D± Data�for�PCA± PCA�Definition± Objective�functions�for�PCA± PCA,�Eigenvectors,�and�Eigenvalues± Algorithms�for�finding�Eigenvectors�/�

Eigenvalues� PCA�Examples

± Face�Recognition± Image�Compression

8

Page 7: 10 æ 601 Introduction to Machine Learningmgormley/courses/10601/slides/... · 2020-04-29 · 10 æ601 Introduction to Machine Learning Matt Gormley Lecture 26 April 20, 2020 ...

High�Dimension�Data

Examples�of�high�dimensional�data:± High�resolution�images�(millions�of�pixels)

9

Page 8: 10 æ 601 Introduction to Machine Learningmgormley/courses/10601/slides/... · 2020-04-29 · 10 æ601 Introduction to Machine Learning Matt Gormley Lecture 26 April 20, 2020 ...

High�Dimension�Data

Examples�of�high�dimensional�data:±Multilingual�News�Stories�(vocabulary�of�hundreds�of�thousands�of�words)

10

Page 9: 10 æ 601 Introduction to Machine Learningmgormley/courses/10601/slides/... · 2020-04-29 · 10 æ601 Introduction to Machine Learning Matt Gormley Lecture 26 April 20, 2020 ...

High�Dimension�Data

Examples�of�high�dimensional�data:± Brain�Imaging�Data�(100s�of�MBs�per�scan)

11Image�from�https://pixabay.com/en/brainǦmrtǦmagneticǦresonanceǦimagingǦ1728449/

Image�from�(Wehbe et�al.,�2014)

Page 10: 10 æ 601 Introduction to Machine Learningmgormley/courses/10601/slides/... · 2020-04-29 · 10 æ601 Introduction to Machine Learning Matt Gormley Lecture 26 April 20, 2020 ...

High�Dimension�Data

Examples�of�high�dimensional�data:± Customer�Purchase�Data

12

Page 11: 10 æ 601 Introduction to Machine Learningmgormley/courses/10601/slides/... · 2020-04-29 · 10 æ601 Introduction to Machine Learning Matt Gormley Lecture 26 April 20, 2020 ...

PCA,�Kernel�PCA,�ICA:�Powerful�unsupervised�learning�techniques�for�extracting�hidden�(potentially�lower�dimensional)�structure�from�high�dimensional�datasets.

Learning�Representations

Useful�for:

� Visualization�

� Further�processing�by�machine�learning�algorithms

� More�efficient�use�of�resources�(e.g.,�time,�memory,�communication)

� Statistical:�fewer�dimensions�Æ better�generalization

� Noise�removal�(improving�data�quality)

Slide�from�Nina�Balcan

Page 12: 10 æ 601 Introduction to Machine Learningmgormley/courses/10601/slides/... · 2020-04-29 · 10 æ601 Introduction to Machine Learning Matt Gormley Lecture 26 April 20, 2020 ...

Shortcut�Example

17

https://www.youtube.com/watch?v=MlJN9pEfPfE

Photo�from�https://www.springcarnival.org/booth.shtml

Page 13: 10 æ 601 Introduction to Machine Learningmgormley/courses/10601/slides/... · 2020-04-29 · 10 æ601 Introduction to Machine Learning Matt Gormley Lecture 26 April 20, 2020 ...

PRINCIPAL�COMPONENT�ANALYSIS�(PCA)

18

Page 14: 10 æ 601 Introduction to Machine Learningmgormley/courses/10601/slides/... · 2020-04-29 · 10 æ601 Introduction to Machine Learning Matt Gormley Lecture 26 April 20, 2020 ...

PCA�Outline� Dimensionality�Reduction

± HighǦdimensional�data± Learning�(low�dimensional)�representations

� Principal�Component�Analysis�(PCA)± Examples:�2D�and�3D± Data�for�PCA± PCA�Definition± Objective�functions�for�PCA± PCA,�Eigenvectors,�and�Eigenvalues± Algorithms�for�finding�Eigenvectors�/�Eigenvalues

� PCA�Examples± Face�Recognition± Image�Compression

19

Page 15: 10 æ 601 Introduction to Machine Learningmgormley/courses/10601/slides/... · 2020-04-29 · 10 æ601 Introduction to Machine Learning Matt Gormley Lecture 26 April 20, 2020 ...

Principal�Component�Analysis�(PCA)

In�case�where�data��lies�on�or�near�a�low�dǦdimensional�linear�subspace,�axes�of�this�subspace�are�an�effective�representation�of�the�data.

Identifying�the�axes�is�known�as�Principal�Components�Analysis,�and�can�be�obtained�by�using�classic�matrix�computation�tools�(Eigen�or�Singular�Value�Decomposition).

Slide�from�Nina�Balcan

Page 16: 10 æ 601 Introduction to Machine Learningmgormley/courses/10601/slides/... · 2020-04-29 · 10 æ601 Introduction to Machine Learning Matt Gormley Lecture 26 April 20, 2020 ...

�'�*DXVVLDQ�GDWDVHW

Slide�from�Barnabas�Poczos

Page 17: 10 æ 601 Introduction to Machine Learningmgormley/courses/10601/slides/... · 2020-04-29 · 10 æ601 Introduction to Machine Learning Matt Gormley Lecture 26 April 20, 2020 ...

�VW 3&$�D[LV

Slide�from�Barnabas�Poczos

Page 18: 10 æ 601 Introduction to Machine Learningmgormley/courses/10601/slides/... · 2020-04-29 · 10 æ601 Introduction to Machine Learning Matt Gormley Lecture 26 April 20, 2020 ...

�QG 3&$�D[LV

Slide�from�Barnabas�Poczos

Page 19: 10 æ 601 Introduction to Machine Learningmgormley/courses/10601/slides/... · 2020-04-29 · 10 æ601 Introduction to Machine Learning Matt Gormley Lecture 26 April 20, 2020 ...

Data�for�PCA

We�assume�the�data�is�centered

24

Q:What�if�your�data�is�

not centered?

A:�Subtract�off�the�

sample�mean

Page 20: 10 æ 601 Introduction to Machine Learningmgormley/courses/10601/slides/... · 2020-04-29 · 10 æ601 Introduction to Machine Learning Matt Gormley Lecture 26 April 20, 2020 ...

Sample�Covariance�Matrix

The�sample�covariance�matrix�is�given�by:

25

Since�the�data�matrix�is�centered,�we�rewrite�as:

Page 21: 10 æ 601 Introduction to Machine Learningmgormley/courses/10601/slides/... · 2020-04-29 · 10 æ601 Introduction to Machine Learning Matt Gormley Lecture 26 April 20, 2020 ...

Principal�Component�Analysis�(PCA)

Whiteboard± Strawman:�random�linear�projection± PCA�Definition± Objective�functions�for�PCA

26

Page 22: 10 æ 601 Introduction to Machine Learningmgormley/courses/10601/slides/... · 2020-04-29 · 10 æ601 Introduction to Machine Learning Matt Gormley Lecture 26 April 20, 2020 ...

Maximizing�the�VarianceQuiz:�Consider�the�two�projections�below

1. Which�maximizes�the�variance?2. Which�minimizes�the�reconstruction�error?

27

Option�A Option�B

Page 23: 10 æ 601 Introduction to Machine Learningmgormley/courses/10601/slides/... · 2020-04-29 · 10 æ601 Introduction to Machine Learning Matt Gormley Lecture 26 April 20, 2020 ...

Background:�Eigenvectors�&�Eigenvalues

For�a�square�matrix�A�(n�x�n�matrix),�the�vector�v (n�x�1�matrix)�is�an�eigenvectoriff there�exists�eigenvalue ɉ (scalar)�such�that:�

Av =�ɉv

28

Av =�ɉv

v

The�linear�transformation�A is�only�stretching�vector�v.

That�is,�ɉv is�a�scalar�multiple�of�v.

Page 24: 10 æ 601 Introduction to Machine Learningmgormley/courses/10601/slides/... · 2020-04-29 · 10 æ601 Introduction to Machine Learning Matt Gormley Lecture 26 April 20, 2020 ...

Principal�Component�Analysis�(PCA)

Whiteboard± PCA,�Eigenvectors,�and�Eigenvalues

29

Page 25: 10 æ 601 Introduction to Machine Learningmgormley/courses/10601/slides/... · 2020-04-29 · 10 æ601 Introduction to Machine Learning Matt Gormley Lecture 26 April 20, 2020 ...

PCA

30

Equivalence�of�Maximizing�Variance�and�Minimizing��Reconstruction�Error

Page 26: 10 æ 601 Introduction to Machine Learningmgormley/courses/10601/slides/... · 2020-04-29 · 10 æ601 Introduction to Machine Learning Matt Gormley Lecture 26 April 20, 2020 ...

PCA:�the�First�Principal�Component

31

Page 27: 10 æ 601 Introduction to Machine Learningmgormley/courses/10601/slides/... · 2020-04-29 · 10 æ601 Introduction to Machine Learning Matt Gormley Lecture 26 April 20, 2020 ...

Algorithms�for�PCAHow�do�we�find�principal�components�(i.e.�eigenvectors)?� Power�iteration�(aka.�Von�Mises�iteration)

± finds�each principal�component�one�at�a�time in�order�� Singular�Value�Decomposition�(SVD)

± finds�all the�principal�components�at�once± two�options:

� Option�A:�run�SVD�on�XTX� Option�B:�run�SVD�on�X�(not�obvious�why�Option�B�should�work…)

� Stochastic�Methods�(approximate)± very�efficient�for�high�dimensional�datasets�with�lots�of�

points

32

Page 28: 10 æ 601 Introduction to Machine Learningmgormley/courses/10601/slides/... · 2020-04-29 · 10 æ601 Introduction to Machine Learning Matt Gormley Lecture 26 April 20, 2020 ...

Background:�SVD

Singular�Value�Decomposition�(SVD)

33

Page 29: 10 æ 601 Introduction to Machine Learningmgormley/courses/10601/slides/... · 2020-04-29 · 10 æ601 Introduction to Machine Learning Matt Gormley Lecture 26 April 20, 2020 ...

SVD�for�PCA

34

Page 30: 10 æ 601 Introduction to Machine Learningmgormley/courses/10601/slides/... · 2020-04-29 · 10 æ601 Introduction to Machine Learning Matt Gormley Lecture 26 April 20, 2020 ...

Principal�Component�Analysis�(PCA)��� � ൌ ɉ��,�so�v�(the�first�PC)�is�the�eigenvector�of�sample�correlation/covariance�matrix��

Sample�variance�of�projection���� ൌ ��ߣ ൌ ߣ

Thus,�the�eigenvalueߣ��denotes�the�amount�of�variability�captured�along�that�dimension (aka�amount�of�energy�along�that�dimension).

Eigenvaluesߣ�ଵ ଶߣ ଷߣ ڮ

� The�1st PCݒ�ଵ is�the�the eigenvector�of�the�sample�covariance�matrix��associated�with�the�largest�eigenvalue�

� The�2nd�PCݒ�ଶ is�the�the eigenvector�of�the�sample�covariance�matrix�� associated�with�the�second�largest�eigenvalue�

� And�so�on�…

Slide�from�Nina�Balcan

Page 31: 10 æ 601 Introduction to Machine Learningmgormley/courses/10601/slides/... · 2020-04-29 · 10 æ601 Introduction to Machine Learning Matt Gormley Lecture 26 April 20, 2020 ...

� For�M original�dimensions,�sample�covariance�matrix�is�MxM,�and�has�up�to�M eigenvectors.�So�M PCs.

� Where�does�dimensionality�reduction�come�from?Can�ignore�the�components�of�lesser�significance.�

��

��

��

��

3&� 3&� 3&� 3&� 3&� 3&� 3&� 3&� 3&� 3&��

9DULDQFH����

How�Many�PCs?

©�Eric�Xing�@�CMU,�2006Ǧ2011 36

� You�do�lose�some�information,�but�if�the�eigenvalues�are�small,�you�don’t�lose�much± M dimensions�in�original�data�± calculate M eigenvectors�and�eigenvalues± choose�only�the�first�D eigenvectors,�based�on�their�eigenvalues± final�data�set�has�only�D dimensions

Variance�(%)�=�ratio�of�variance�along�given�principal�component�to�total�variance�of�all�principal�components

Page 32: 10 æ 601 Introduction to Machine Learningmgormley/courses/10601/slides/... · 2020-04-29 · 10 æ601 Introduction to Machine Learning Matt Gormley Lecture 26 April 20, 2020 ...

PCA�EXAMPLES

37

Page 33: 10 æ 601 Introduction to Machine Learningmgormley/courses/10601/slides/... · 2020-04-29 · 10 æ601 Introduction to Machine Learning Matt Gormley Lecture 26 April 20, 2020 ...

Projecting�MNIST�digits

38

Task�Setting:1. Take�25x25�images�of�digits�and�project�them�down�to�K�components2. Report�percent�of�variance�explained�for�K�components3. Then�project�back�up�to�25x25�image�to�visualize�how�much�information�was�preserved

Page 34: 10 æ 601 Introduction to Machine Learningmgormley/courses/10601/slides/... · 2020-04-29 · 10 æ601 Introduction to Machine Learning Matt Gormley Lecture 26 April 20, 2020 ...

Projecting�MNIST�digits

39

Task�Setting:1. Take�25x25�images�of�digits�and�project�them�down�to�2�components2. Plot�the�2�dimensional�points3. Here�we�look�at�all�ten�digits�0�Ǧ 9

Page 35: 10 æ 601 Introduction to Machine Learningmgormley/courses/10601/slides/... · 2020-04-29 · 10 æ601 Introduction to Machine Learning Matt Gormley Lecture 26 April 20, 2020 ...

Projecting�MNIST�digits

40

Task�Setting:1. Take�25x25�images�of�digits�and�project�them�down�to�2�components2. Plot�the�2�dimensional�points3. Here�we�look�at�just�four�digits�0,�1,�2,�3

Page 36: 10 æ 601 Introduction to Machine Learningmgormley/courses/10601/slides/... · 2020-04-29 · 10 æ601 Introduction to Machine Learning Matt Gormley Lecture 26 April 20, 2020 ...

Learning�ObjectivesDimensionality�Reduction�/�PCA

You�should�be�able�to…1. Define�the�sample�mean,�sample�variance,�and�sample�

covariance�of�a�vectorǦvalued�dataset2. Identify�examples�of�high�dimensional�data�and�common�use�

cases�for�dimensionality�reduction3. Draw�the�principal�components�of�a�given�toy�dataset4. Establish�the�equivalence�of�minimization�of�reconstruction�

error�with�maximization�of�variance5. Given�a�set�of�principal�components,�project�from�high�to�low�

dimensional�space�and�do�the�reverse�to�produce�a�reconstruction

6. Explain�the�connection�between�PCA,�eigenvectors,�eigenvalues,�and�covariance�matrix

7. Use�common�methods�in�linear�algebra�to�obtain�the�principal�components

41