Top Banner
9/6/10 1 Curse of Dimensionality Real world applications deal with spaces with high dimensionality High dimensionality poses serious challenges for the design of pattern recognition techniques Oil flaw data in two dimensions Red = homogeneous Green = annular Blue = laminar Curse of Dimensionality A simple classification approach
35

Curse of Dimensionality - George Mason University

Feb 17, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Curse of Dimensionality - George Mason University

9/6/10

1

Curse of Dimensionality

 Real world applications deal with spaces with high dimensionality

 High dimensionality poses serious challenges for the design of pattern recognition techniques

Oil flaw data in two dimensions

Red = homogeneous

Green = annular

Blue = laminar

Curse of Dimensionality

 A simple classification approach

Page 2: Curse of Dimensionality - George Mason University

9/6/10

2

Curse of Dimensionality

 Going higher in dimensionality…

The number of regions of a regular grid grows exponentially with the dimensionality of the space

DMAX/DMIN

Curse of Dimensionality

Page 3: Curse of Dimensionality - George Mason University

9/6/10

3

Curse of Dimensionality

Curse of Dimensionality

Page 4: Curse of Dimensionality - George Mason University

9/6/10

4

Curse of Dimensionality

Curse of Dimensionality

Page 5: Curse of Dimensionality - George Mason University

9/6/10

5

uniform distribution in the unit cube -.5,.5[ ]q

Page 6: Curse of Dimensionality - George Mason University

9/6/10

6

Curse-of-Dimensionality

K =1

d(q,N) =O(N−1/ q )

d(q,N)⇒

Curse-of-Dimensionality

Page 7: Curse of Dimensionality - George Mason University

9/6/10

7

Curse-of-Dimensionality

Microarray Data Analysis

Page 8: Curse of Dimensionality - George Mason University

9/6/10

8

Document Classification

Of all the sensory impressions proceeding to the brain, the visual experiences are the dominant ones. Our perception of the world around us is based essentially on the messages that reach the brain from our eyes. For a long time it was thought that the retinal image was transmitted point by point to visual centers in the brain; the cerebral cortex was a movie screen, so to speak, upon which the image in the eye was projected. Through the discoveries of Hubel and Wiesel we now know that behind the origin of the visual perception in the brain there is a considerably more complicated course of events. By following the visual impulses along their path to the various cell layers of the optical cortex, Hubel and Wiesel have been able to demonstrate that the message about the image falling on the retina undergoes a step-wise analysis in a system of nerve cells stored in columns. In this system each cell has its specific function and is responsible for a specific detail in the pattern of the retinal image.

sensory, brain, visual, perception, retinal, cerebral cortex, eye, cell, optical nerve, image Hubel, Wiesel

brain visual

eye

cell

Cerebral cortex

perception sensory retinal

Wiesel optical nerve

Bag-of-words representation of a document

w1 w2 …. wq

Term Frequency

brain visual

eye

cell

Cerebral cortex

perception sensory retinal

Wiesel optical nerve

Of all the sensory impressions proceeding to the brain, the visual experiences are the dominant ones. Our perception of the world around us is based essentially on the messages that reach the brain from our eyes. For a long time it was thought that the retinal image was transmitted point by point to visual centers in the brain; the cerebral cortex was a movie screen, so to speak, upon which the image in the eye was projected. Through the discoveries of Hubel and Wiesel we

Of all the sensory impressions proceeding to the brain, the visual experiences are the dominant ones. Our perception of the world around us is based essentially on the messages that reach the brain from our eyes. For a long time it was thought that the retinal image was transmitted point by point to visual centers in the brain; the cerebral cortex was a movie screen, so to speak, upon which the image in the eye was projected. Through the discoveries of Hubel and Wiesel we

Of all the sensory impressions proceeding to the brain, the visual experiences are the dominant ones. Our perception of the world around us is based essentially on the messages that reach the brain from our eyes. For a long time it was thought that the retinal image was transmitted point by point to visual centers in the brain; the cerebral cortex was a movie screen, so to speak, upon which the image in the eye was projected. Through the discoveries of Hubel and Wiesel we

Of all the sensory impressions proceeding to the brain, the visual experiences are the dominant ones. Our perception of the world around us is based essentially on the messages that reach the brain from our eyes. For a long time it was thought that the retinal image was transmitted point by point to visual centers in the brain; the cerebral cortex was a movie screen, so to speak, upon which the image in the eye was projected. Through the discoveries of Hubel and Wiesel we

dictionary

Page 9: Curse of Dimensionality - George Mason University

9/6/10

9

What’s the size of the dictionary?

Page 10: Curse of Dimensionality - George Mason University

9/6/10

10

Curse-of-Dimensionality

Page 11: Curse of Dimensionality - George Mason University

9/6/10

11

x =x1

x2

µ =

µ1

µ2

=

1N

xii=1

N

2 × 2 covariance matrix :

E x −µ( ) x −µ( )T[ ] =

Ex1 −µ1

x2 −µ2

x1 −µ1,x2 −µ2( )

=

Ex1 −µ1( )2 x1 −µ1( ) x2 −µ2( )

x1 −µ1( ) x2 −µ2( ) x2 −µ2( )2

=

1N

x1i −µ1( )2

x1i −µ1( ) x2

i −µ2( )x1i −µ1( ) x2

i −µ2( ) x2i −µ2( )

2

i=1

N

1N

x1i −µ1( )2 x1

i −µ1( ) x2i −µ2( )x1i −µ1( ) x2i −µ2( ) x2

i −µ2( )2

i=1

N

∑ =

1N

x1i −µ1( )

2

i=1

N

∑ 1N

x1i −µ1( ) x2i −µ2( )[ ]

i=1

N

∑1N

x1i −µ1( ) x2i −µ2( )[ ]

i=1

N

∑ 1N

x2i −µ2( )

2

i=1

N

Page 12: Curse of Dimensionality - George Mason University

9/6/10

12

0.99 −0.5−0.5 1.06

1.04 −1.05−1.05 1.15

0.93 0.010.01 1.05

0.97 0.490.49 1.04

0.94 0.930.93 1.03

Dimensionality Reduction

•  Many dimensions are often interdependent (correlated);

We can:

•  Reduce the dimensionality of problems;

•  Transform interdependent coordinates into significant and independent ones;

Page 13: Curse of Dimensionality - George Mason University

9/6/10

13

Decision Theory •  Decision theory, when combined with probability

theory, allows to make optimal decisions in situations involving uncertainty

•  Training data:

•  Inference: joint probability distribution

•  Decision step: make optimal decision

Decision Theory Classification example: medical diagnosis problem

•  set of pixel intensities in an image

•  Two classes: –  absence of cancer

–  presence of cancer

•  Inference step: estimate

•  Decision step: given predict so that a measure of error is minimized according to the given probabilities

Page 14: Curse of Dimensionality - George Mason University

9/6/10

14

Decision Theory How probabilities play a role in decision making?

•  Decision step: given predict

Thus, we are interested in

Intuitively: we want to minimize the chance of assigning to the wrong class. Thus, choose the class that gives the higher posterior probability

Minimizing the misclassification rate •  Goal: Minimize the number of misclassifications

We need to find a rule that assigns each input vector to one of the possible classes

Such rule divides the input space into regions so that all points in are assigned to

Boundaries between regions are called decision boundaries

Page 15: Curse of Dimensionality - George Mason University

9/6/10

15

Minimizing the misclassification rate •  Goal: Minimize the number of misclassifications

•  Assign x to the class that gives the smaller value of the integrand: –  Choose if

–  Choose if €

p mistake( ) = p x ∈ R1,C2( ) + p x ∈ R2,C1( ) = p x,C2( )

R1

∫ dx + p x,C1( )R2

∫ dx

Minimizing the misclassification rate –  Choose if

–  Choose if

Thus:

–  Choose if

–  Choose if

Page 16: Curse of Dimensionality - George Mason University

9/6/10

16

Minimizing the misclassification rate

Optimal decision boundary:

Minimizing the misclassification rate

Thus:

General case of K classes:

Choose that gives the largest

Page 17: Curse of Dimensionality - George Mason University

9/6/10

17

Minimizing the expected loss

cancer

cancer normal

normal

The optimal solution is the one that minimizes the loss function

Minimizing the expected loss

Page 18: Curse of Dimensionality - George Mason University

9/6/10

18

Minimizing the expected loss

The Reject Option

Page 19: Curse of Dimensionality - George Mason University

9/6/10

19

Inference and Decision

( )x|kCp

Generative Methods

( )kCp |x kC

( )kCp

( ) ( ) ( )( )x

xxp

CpCpCp kkk

|| =

( ) ( ) ( )∑=k

kk CpCpp |xx

Page 20: Curse of Dimensionality - George Mason University

9/6/10

20

Discriminative Methods

( )x|kCp

Discriminant Functions ( )xf

Page 21: Curse of Dimensionality - George Mason University

9/6/10

21

Example

Linear Models for Classification   Classification: Given an input vector x, assign it to one of K classes where k = 1,…, K

  The input space is divided in decision regions whose boundaries are called decision boundaries or decision surfaces

  Linear models: decision surfaces are linear functions of the input vector x. They are defined by (D -1)-dimensional hyperplanes within the D -dimensional input space

kC

Page 22: Curse of Dimensionality - George Mason University

9/6/10

22

Linear Models for Classification   For regression:

  For classification, we want to predict class labels, or more generally class posterior probabilities.

  We transform the linear function of w using a nonlinear function f () so that

( ) 0wy T += xwx

f wTx + w0( )

Generalized Linear Models

Linear Discriminant Functions Two classes:

Decision boundary:

( ) 0wy T += xwx

( )2

10CtoassignotherwiseCtoassignyif

x x x ≥

( ) 0=xy

Page 23: Curse of Dimensionality - George Mason University

9/6/10

23

Linear Discriminant Functions Geometrical properties:

Decision boundary:

Let be two points which lie on the decision boundary

( ) 00 =+= wy T xwx

21, xx

( ) ( )( ) 0

0,0

21

022011

=−⇒

=+==+=

xx wxwxxwx

T

TT wywy

w represents the orthogonal direction to the decision boundary

Geometrical properties (con’t)

wwwT

T =*

0x( )

( )

( ) ( )

( ) ( )

( )ww

x 0x

wxwxw

w

xwxww

xxw

w xx

xxw

0

0

00

*0

0*

,

1

1

wywhen

y

directionwtheontoofprojectiontheis

T

TTT

T

==

=+=

−=−

Signed orthogonal distance of the origin from the decision surface

Page 24: Curse of Dimensionality - George Mason University

9/6/10

24

Linear Discriminant Functions Multiple classes

one-versus-the-rest: K-1 classifiers each of which solves a two-class problem of separating points of from points not in that class

kC

Linear Discriminant Functions Multiple classes

one-versus-one: K(K-1)/2 binary discriminant functions, one for every possible pair of classes.

Page 25: Curse of Dimensionality - George Mason University

9/6/10

25

Linear Discriminant Functions Multiple classes

Solution: consider a single K-class discriminant comprising K linear functions of the form

Assign a point x to class if The decision boundary between class and class is given by

( ) 0kTkk wy += xwx

kC ( ) ( ) kjxyxy jk ≠∀>

yk x( ) =y j x( )

⇒ wk − w j( )T

x + wk 0 − wj0( ) = 0

kC jC

Linear Discriminant Functions

Two approaches:

  Fisher’s linear discriminant

  Perceptron algorithm

Page 26: Curse of Dimensionality - George Mason University

9/6/10

26

Fisher’s Linear Discriminant One way to view a linear classification model is in terms of dimensionality reduction.

Two class case:

Suppose we project x onto one dimension:

Set a threshold t

xwTy =

2

1

CtoassignotherwiseCtoassigntyif

x x ≥

Page 27: Curse of Dimensionality - George Mason University

9/6/10

27

Page 28: Curse of Dimensionality - George Mason University

9/6/10

28

Page 29: Curse of Dimensionality - George Mason University

9/6/10

29

Fisher’s Linear Discriminant

•  Find an orientation along which the projected samples are well separated;

•  This is exactly the goal of linear discriminant analysis (LDA);

•  In other words: we are after the linear projection that best separates the data, i.e. best discriminates data of different classes.

Page 30: Curse of Dimensionality - George Mason University

9/6/10

30

LDA

•  samples of class

•  samples of class

•  Consider with

•  Then: is the projection of along the direction of

•  We want the projections where separated from the projections where

Niin C 1)},{( =x q

n ℜ∈x } ,{ 21 CCCi ∈

1N 1C

2N 2Cqℜ∈w 1=w

xwT

wx

xwT 1C∈xxwT

2C∈x

LDA •  A measure of the separation between the

projected points is the difference of the sample means:

iC

mi =1N i

xx∈Ci

mi =1N i

wT xx∈Ci

∑ = wT mi

m1 −m2 = wT (m1 −m2)

Page 31: Curse of Dimensionality - George Mason University

9/6/10

31

LDA •  To obtain good separation of the projected data we

really want the difference between the means to be large relative to some measure of the standard deviation of each class:

iC

si2 = wT x −mi( )

2

x∈Ci

s12 + s2

2

arg maxw

m1 −m2

2

s12 + s2

2

LDA

Page 32: Curse of Dimensionality - George Mason University

9/6/10

32

LDA

( ) : matrices following the

define we of function explicit an as obtain To wwJ

( ) 22

21

221

ssmm

J+

−=w

Si = x −mi( ) x −mi( )T

x∈Ci

SW = S1 + S2Then:

si2 = wT x −mi( )

x∈Ci

∑2

= wT x − wT mi( )2

x∈Ci

= wT x −mi( )x∈Ci

∑ x −mi( )T w = wT Siw

LDA

So : s12 = wT S1w and s2

2 = wT S2w Thus : s1

2 + s22 = wT S1w + wT S2w =

wT S1 + S2( )w = wT SWw

Similarly :

m1 −m2( )2= wT m1 − wT m2( )2

=

wT m1 −m2( ) m1 −m2( )T w =

wT SB w

where SB = m1 −m2( ) m1 −m2( )T

Page 33: Curse of Dimensionality - George Mason University

9/6/10

33

LDA

ww WT Sss =+ 2

221

: obtained have We

( ) ww BT Smm =− 2

21

( )wwwww

WT

BT

SS

ssmm

J =+

−= 2

221

221

wwww

wW

TB

T

SSmaxarg

LDA

We observe that :

SB w = m1 −m2( ) m1 −m2( )T w( )21 mm −

( )211 mmw −= −

WS

( )wwwww

WT

BT

SSJ =

J w( ) is maximized when wT SB w( )SWw = wT SWw( )SB w

Page 34: Curse of Dimensionality - George Mason University

9/6/10

34

LDA

Projection onto the line joining the class means

LDA

Solution of LDA

Page 35: Curse of Dimensionality - George Mason University

9/6/10

35

LDA

( )211 mmw −= −

WS

LDA