Top Banner
Discriminant Functions Generative Models Discriminative Models Linear Models for Classification Machine Learning Torsten Möller and Thomas Torsney-Weir ©Möller/Mori 1
101

Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Sep 09, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Linear Models for ClassificationMachine Learning

Torsten Möller and Thomas Torsney-Weir

©Möller/Mori 1

Page 2: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Reading

• Chapter 4 of “Pattern Recognition and Machine Learning”by Bishop

©Möller/Mori 2

Page 3: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Classification

• goal in classification is to take input x and assign it to oneof K discrete classes Ck

• Ck typically disjoint (unique class membership)• input space divided into decision regions• boundaries: decision boundaries or decision surfaces• linear models: decision boundaries are hyperplanes:

linearly separable classes

©Möller/Mori 3

Page 4: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Classification: Hand-written Digit Recognition

xi =

!"#$%&'"()#*$#+ !,- )( !.,$#/+ +%((/#/01/! 233456 334

7/#()#-! 8/#9 */:: )0 "'%! /$!9 +$"$;$!/ +,/ ") "'/ :$1< )(

8$#%$"%)0 %0 :%&'"%0& =>?@ 2ABC D,!" -$</! %" ($!"/#56E'/ 7#)")"97/ !/:/1"%)0 $:&)#%"'- %! %::,!"#$"/+ %0 F%&6 GH6

C! !//0I 8%/*! $#/ $::)1$"/+ -$%0:9 ()# -)#/ 1)-7:/J1$"/&)#%/! *%"' '%&' *%"'%0 1:$!! 8$#%$;%:%"96 E'/ 1,#8/-$#</+ 3BK7#)") %0 F%&6 L !')*! "'/ %-7#)8/+ 1:$!!%(%1$"%)07/#()#-$01/ ,!%0& "'%! 7#)")"97/ !/:/1"%)0 !"#$"/&9 %0!"/$+)( /.,$::9K!7$1/+ 8%/*!6 M)"/ "'$" */ );"$%0 $ >6? 7/#1/0"

/##)# #$"/ *%"' $0 $8/#$&/ )( )0:9 (),# "*)K+%-/0!%)0$:

8%/*! ()# /$1' "'#//K+%-/0!%)0$: );D/1"I "'$0<! ") "'/

(:/J%;%:%"9 7#)8%+/+ ;9 "'/ -$"1'%0& $:&)#%"'-6

!"# $%&'() *+,-. */0+12.33. 4,3,5,6.

N,# 0/J" /J7/#%-/0" %08):8/! "'/ OAPQKR !'$7/ !%:'),/""/

+$"$;$!/I !7/1%(%1$::9 B)#/ PJ7/#%-/0" BPK3'$7/KG 7$#" SI

*'%1' -/$!,#/! 7/#()#-$01/ )( !%-%:$#%"9K;$!/+ #/"#%/8$:

=>T@6 E'/ +$"$;$!/ 1)0!%!"! )( GI?HH %-$&/!U RH !'$7/

1$"/&)#%/!I >H %-$&/! 7/# 1$"/&)#96 E'/ 7/#()#-$01/ %!

-/$!,#/+ ,!%0& "'/ !)K1$::/+ V;,::!/9/ "/!"IW %0 *'%1' /$1'

!"# $%%% &'()*(+&$,)* ,) -(&&%') ()(./*$* ()0 1(+2$)% $)&%..$3%)+%4 5,.6 784 ),6 784 (-'$. 7997

:;<6 #6 (== >? @AB C;DE=FDD;?;BG 1)$*& @BD@ G;<;@D HD;I< >HJ CB@A>G KLM >H@ >? "94999N6 &AB @BO@ FP>QB BFEA G;<;@ ;IG;EF@BD @AB BOFCR=B IHCPBJ

?>==>SBG PT @AB @JHB =FPB= FIG @AB FDD;<IBG =FPB=6

:;<6 U6 M0 >PVBE@ JBE><I;@;>I HD;I< @AB +,$.W79 GF@F DB@6 +>CRFJ;D>I >?@BD@ DB@ BJJ>J ?>J **04 *AFRB 0;D@FIEB K*0N4 FIG *AFRB 0;D@FIEB S;@A!!"#$%&$' RJ>@>@TRBD K*0WRJ>@>N QBJDHD IHCPBJ >? RJ>@>@TRB Q;BSD6 :>J**0 FIG *04 SB QFJ;BG @AB IHCPBJ >? RJ>@>@TRBD HI;?>JC=T ?>J F==>PVBE@D6 :>J *0WRJ>@>4 @AB IHCPBJ >? RJ>@>@TRBD RBJ >PVBE@ GBRBIGBG >I@AB S;@A;IW>PVBE@ QFJ;F@;>I FD SB== FD @AB PB@SBBIW>PVBE@ D;C;=FJ;@T6

:;<6 "96 -J>@>@TRB Q;BSD DB=BE@BG ?>J @S> G;??BJBI@ M0 >PVBE@D ?J>C @AB+,$. GF@F DB@ HD;I< @AB F=<>J;@AC GBDEJ;PBG ;I *BE@;>I !676 X;@A @A;DFRRJ>FEA4 Q;BSD FJB F==>EF@BG FGFR@;QB=T GBRBIG;I< >I @AB Q;DHF=E>CR=BO;@T >? FI >PVBE@ S;@A JBDRBE@ @> Q;BS;I< FI<=B6

ti = (0, 0, 0, 1, 0, 0, 0, 0, 0, 0)

• Each input vector classified into one of K discrete classes• Denote classes by Ck

• Represent input image as a vector xi ∈ R784.• We have target vector ti ∈ {0, 1}10• Given a training set {(x1, t1), . . . , (xN , tN )}, learning

problem is to construct a “good” function y(x) from these.• y : R784 → R10

©Möller/Mori 4

Page 5: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Generalized Linear Models

• Similar to previous chapter on linear models for regression,we will use a “linear” model for classification:

y(x) = f(wTx+ w0)

• This is called a generalized linear model• f(·) is a fixed non-linear function

• e.g.f(u) =

{1 if u ≥ 00 otherwise

• Decision boundary between classes will be linear functionof x

• Can also apply non-linearity to x, as in ϕi(x) for regression

©Möller/Mori 5

Page 6: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Generalized Linear Models

• Similar to previous chapter on linear models for regression,we will use a “linear” model for classification:

y(x) = f(wTx+ w0)

• This is called a generalized linear model• f(·) is a fixed non-linear function

• e.g.f(u) =

{1 if u ≥ 00 otherwise

• Decision boundary between classes will be linear functionof x

• Can also apply non-linearity to x, as in ϕi(x) for regression

©Möller/Mori 6

Page 7: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Generalized Linear Models

• Similar to previous chapter on linear models for regression,we will use a “linear” model for classification:

y(x) = f(wTx+ w0)

• This is called a generalized linear model• f(·) is a fixed non-linear function

• e.g.f(u) =

{1 if u ≥ 00 otherwise

• Decision boundary between classes will be linear functionof x

• Can also apply non-linearity to x, as in ϕi(x) for regression

©Möller/Mori 7

Page 8: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Outline

Discriminant Functions

Generative Models

Discriminative Models

©Möller/Mori 8

Page 9: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Outline

Discriminant Functions

Generative Models

Discriminative Models

©Möller/Mori 9

Page 10: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Discriminant Functions with Two Classes

x2

x1

wx

y(x)‖w‖

x⊥

−w0‖w‖

y = 0y < 0

y > 0

R2

R1

• Start with 2 class problem,ti ∈ {0, 1}

• Simple linear discriminant

y(x) = wTx+ w0

apply threshold function to getclassification

• Projection of x in w dir. is wTx||w||

©Möller/Mori 10

Page 11: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Multiple Classes

R1

R2

R3

?

C1

not C1

C2

not C2

R1

R2

R3

?C1

C2

C1

C3

C2

C3

• A linear discriminant between two classes separates with ahyperplane

• How to use this for multiple classes?• One-versus-the-rest method: build K − 1 classifiers,

between Ck and all others• One-versus-one method: build K(K − 1)/2 classifiers,

between all pairs©Möller/Mori 11

Page 12: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Multiple Classes

R1

R2

R3

?

C1

not C1

C2

not C2

R1

R2

R3

?C1

C2

C1

C3

C2

C3

• A linear discriminant between two classes separates with ahyperplane

• How to use this for multiple classes?• One-versus-the-rest method: build K − 1 classifiers,

between Ck and all others• One-versus-one method: build K(K − 1)/2 classifiers,

between all pairs©Möller/Mori 12

Page 13: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Multiple Classes

R1

R2

R3

?

C1

not C1

C2

not C2

R1

R2

R3

?C1

C2

C1

C3

C2

C3

• A linear discriminant between two classes separates with ahyperplane

• How to use this for multiple classes?• One-versus-the-rest method: build K − 1 classifiers,

between Ck and all others• One-versus-one method: build K(K − 1)/2 classifiers,

between all pairs©Möller/Mori 13

Page 14: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Multiple Classes

R1

R2

R3

?

C1

not C1

C2

not C2

R1

R2

R3

?C1

C2

C1

C3

C2

C3

• A linear discriminant between two classes separates with ahyperplane

• How to use this for multiple classes?• One-versus-the-rest method: build K − 1 classifiers,

between Ck and all others• One-versus-one method: build K(K − 1)/2 classifiers,

between all pairs©Möller/Mori 14

Page 15: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Multiple Classes

R1

R2

R3

?

C1

not C1

C2

not C2

R1

R2

R3

?C1

C2

C1

C3

C2

C3

• A linear discriminant between two classes separates with ahyperplane

• How to use this for multiple classes?• One-versus-the-rest method: build K − 1 classifiers,

between Ck and all others• One-versus-one method: build K(K − 1)/2 classifiers,

between all pairs©Möller/Mori 15

Page 16: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Multiple Classes

R1

R2

R3

?

C1

not C1

C2

not C2

R1

R2

R3

?C1

C2

C1

C3

C2

C3

• A linear discriminant between two classes separates with ahyperplane

• How to use this for multiple classes?• One-versus-the-rest method: build K − 1 classifiers,

between Ck and all others• One-versus-one method: build K(K − 1)/2 classifiers,

between all pairs©Möller/Mori 16

Page 17: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Multiple Classes

Ri

Rj

Rk

xA

xB

• A solution is to build K linear functions:yk(x) = wT

k x+ wk0

assign x to class arg maxk yk(x)• Gives connected, convex decision regions

x̂ = λxA + (1− λ)xB

yk(x̂) = λyk(xA) + (1− λ)yk(xB)

⇒ yk(x̂) > yj(x̂),∀j ̸= k©Möller/Mori 17

Page 18: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Multiple Classes

Ri

Rj

Rk

xA

xB

• A solution is to build K linear functions:yk(x) = wT

k x+ wk0

assign x to class arg maxk yk(x)• Gives connected, convex decision regions

x̂ = λxA + (1− λ)xB

yk(x̂) = λyk(xA) + (1− λ)yk(xB)

⇒ yk(x̂) > yj(x̂),∀j ̸= k©Möller/Mori 18

Page 19: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Least Squares for Classification

• How do we learn the decision boundaries (wk, wk0)?• One approach is to use least squares, similar to regression• Find W to minimize squared error over all examples and

all components of the label vector:

E(W ) =1

2

N∑n=1

K∑k=1

(yk(xn)− tnk)2

• Some algebra, we get a solution using the pseudo-inverseas in regression

©Möller/Mori 19

Page 20: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Least Squares for Classification

• How do we learn the decision boundaries (wk, wk0)?• One approach is to use least squares, similar to regression• Find W to minimize squared error over all examples and

all components of the label vector:

E(W ) =1

2

N∑n=1

K∑k=1

(yk(xn)− tnk)2

• Some algebra, we get a solution using the pseudo-inverseas in regression

©Möller/Mori 20

Page 21: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Least Squares for Classification

• How do we learn the decision boundaries (wk, wk0)?• One approach is to use least squares, similar to regression• Find W to minimize squared error over all examples and

all components of the label vector:

E(W ) =1

2

N∑n=1

K∑k=1

(yk(xn)− tnk)2

• Some algebra, we get a solution using the pseudo-inverseas in regression

©Möller/Mori 21

Page 22: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Problems with Least Squares

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

• Looks okay... least squaresdecision boundary

• Similar to logistic regressiondecision boundary (more later)

• Gets worse by adding easypoints?!

• Why?• If target value is 1, points far

from boundary will have highvalue, say 10; this is a largeerror so the boundary is moved

©Möller/Mori 22

Page 23: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Problems with Least Squares

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

• Looks okay... least squaresdecision boundary

• Similar to logistic regressiondecision boundary (more later)

• Gets worse by adding easypoints?!

• Why?• If target value is 1, points far

from boundary will have highvalue, say 10; this is a largeerror so the boundary is moved

©Möller/Mori 23

Page 24: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Problems with Least Squares

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

• Looks okay... least squaresdecision boundary

• Similar to logistic regressiondecision boundary (more later)

• Gets worse by adding easypoints?!

• Why?

• If target value is 1, points farfrom boundary will have highvalue, say 10; this is a largeerror so the boundary is moved

©Möller/Mori 24

Page 25: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Problems with Least Squares

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

• Looks okay... least squaresdecision boundary

• Similar to logistic regressiondecision boundary (more later)

• Gets worse by adding easypoints?!

• Why?• If target value is 1, points far

from boundary will have highvalue, say 10; this is a largeerror so the boundary is moved©Möller/Mori 25

Page 26: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

More Least Squares Problems

−6 −4 −2 0 2 4 6−6

−4

−2

0

2

4

6

• Easily separated by hyperplanes, but not found using leastsquares!

• We’ll address these problems later with better models• First, a look at a different criterion for linear discriminant

©Möller/Mori 26

Page 27: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

• The two-class linear discriminant acts as a projection:

y = wTx

≥ −w0

followed by a threshold• In which direction w should we project?• One which separates classes “well”

©Möller/Mori 27

Page 28: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

• The two-class linear discriminant acts as a projection:

y = wTx ≥ −w0

followed by a threshold

• In which direction w should we project?• One which separates classes “well”

©Möller/Mori 28

Page 29: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

• The two-class linear discriminant acts as a projection:

y = wTx ≥ −w0

followed by a threshold• In which direction w should we project?

• One which separates classes “well”

©Möller/Mori 29

Page 30: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant• The two-class linear discriminant acts as a projection:

y = wTx ≥ −w0

followed by a threshold• In which direction w should we project?• One which separates classes “well”

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

©Möller/Mori 30

Page 31: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

−2 2 6

−2

0

2

4

−2 2 6

−2

0

2

4

• A natural idea would be to project in the direction of the lineconnecting class means

• However, problematic if classes have variance in thisdirection

• Fisher criterion: maximize ratio of inter-class separation(between) to intra-class variance (inside)

©Möller/Mori 31

Page 32: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

−2 2 6

−2

0

2

4

−2 2 6

−2

0

2

4

• A natural idea would be to project in the direction of the lineconnecting class means

• However, problematic if classes have variance in thisdirection

• Fisher criterion: maximize ratio of inter-class separation(between) to intra-class variance (inside)

©Möller/Mori 32

Page 33: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

−2 2 6

−2

0

2

4

−2 2 6

−2

0

2

4

• A natural idea would be to project in the direction of the lineconnecting class means

• However, problematic if classes have variance in thisdirection

• Fisher criterion: maximize ratio of inter-class separation(between) to intra-class variance (inside)

©Möller/Mori 33

Page 34: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

−2 2 6

−2

0

2

4

−2 2 6

−2

0

2

4

• A natural idea would be to project in the direction of the lineconnecting class means

• However, problematic if classes have variance in thisdirection

• Fisher criterion: maximize ratio of inter-class separation(between) to intra-class variance (inside)

©Möller/Mori 34

Page 35: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Math time - FLD• Projection yn = wTxn

• Inter-class separation is distance between class means(good):

mk =1

Nk

∑n∈Ck

wTxn

• Intra-class variance (bad):

s2k =∑n∈Ck

(yn −mk)2

• Fisher criterion:

J(w) =(m2 −m1)

2

s21 + s22

maximize wrt w©Möller/Mori 35

Page 36: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Math time - FLD• Projection yn = wTxn

• Inter-class separation is distance between class means(good):

mk =1

Nk

∑n∈Ck

wTxn

• Intra-class variance (bad):

s2k =∑n∈Ck

(yn −mk)2

• Fisher criterion:

J(w) =(m2 −m1)

2

s21 + s22

maximize wrt w©Möller/Mori 36

Page 37: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Math time - FLD• Projection yn = wTxn

• Inter-class separation is distance between class means(good):

mk =1

Nk

∑n∈Ck

wTxn

• Intra-class variance (bad):

s2k =∑n∈Ck

(yn −mk)2

• Fisher criterion:

J(w) =(m2 −m1)

2

s21 + s22

maximize wrt w©Möller/Mori 37

Page 38: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Math time - FLD• Projection yn = wTxn

• Inter-class separation is distance between class means(good):

mk =1

Nk

∑n∈Ck

wTxn

• Intra-class variance (bad):

s2k =∑n∈Ck

(yn −mk)2

• Fisher criterion:

J(w) =(m2 −m1)

2

s21 + s22

maximize wrt w©Möller/Mori 38

Page 39: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Math time - FLD

J(w) =(m2 −m1)

2

s21 + s22=

wTSBw

wTSWw

Between-class covariance:SB = (m2 −m1)(m2 −m1)

T

Within-class covariance:SW =

∑n∈C1

(xn −m1)(xn −m1)T +

∑n∈C2

(xn −m2)(xn −m2)T

Lots of math:

w ∝ S−1W (m2 −m1)

If covariance SW is isotropic, reduces to class mean differencevector

©Möller/Mori 39

Page 40: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Math time - FLD

J(w) =(m2 −m1)

2

s21 + s22=

wTSBw

wTSWw

Between-class covariance:SB = (m2 −m1)(m2 −m1)

T

Within-class covariance:SW =

∑n∈C1

(xn −m1)(xn −m1)T +

∑n∈C2

(xn −m2)(xn −m2)T

Lots of math:

w ∝ S−1W (m2 −m1)

If covariance SW is isotropic, reduces to class mean differencevector

©Möller/Mori 40

Page 41: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

FLD Summary

• FLD is a dimensionality reduction technique (more later inthe course)

• Criterion for choosing projection based on class labels• Still suffers from outliers (e.g. earlier least squares

example)

©Möller/Mori 41

Page 42: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Perceptrons

• Perceptrons is used to refer to many neural networkstructures

• The classic type is a fixed non-linear transformation ofinput, one layer of adaptive weights, and a threshold:

y(x) = f(wTϕ(x))

• Developed by Rosenblatt in the 50s• The main difference compared to the methods we’ve seen

so far is the learning algorithm

©Möller/Mori 42

Page 43: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning

• Two class problem• For ease of notation, we will use t = 1 for class C1 and

t = −1 for class C2• We saw that squared error was problematic• Instead, we’d like to minimize the number of misclassified

examples• An example is mis-classified if wTϕ(xn)tn < 0• Perceptron criterion:

EP (w) = −∑n∈M

wTϕ(xn)tn

sum over mis-classified examples only

©Möller/Mori 43

Page 44: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning

• Two class problem• For ease of notation, we will use t = 1 for class C1 and

t = −1 for class C2• We saw that squared error was problematic• Instead, we’d like to minimize the number of misclassified

examples• An example is mis-classified if wTϕ(xn)tn < 0• Perceptron criterion:

EP (w) = −∑n∈M

wTϕ(xn)tn

sum over mis-classified examples only

©Möller/Mori 44

Page 45: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning

• Two class problem• For ease of notation, we will use t = 1 for class C1 and

t = −1 for class C2• We saw that squared error was problematic• Instead, we’d like to minimize the number of misclassified

examples• An example is mis-classified if wTϕ(xn)tn < 0• Perceptron criterion:

EP (w) = −∑n∈M

wTϕ(xn)tn

sum over mis-classified examples only

©Möller/Mori 45

Page 46: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning

• Two class problem• For ease of notation, we will use t = 1 for class C1 and

t = −1 for class C2• We saw that squared error was problematic• Instead, we’d like to minimize the number of misclassified

examples• An example is mis-classified if wTϕ(xn)tn < 0• Perceptron criterion:

EP (w) = −∑n∈M

wTϕ(xn)tn

sum over mis-classified examples only

©Möller/Mori 46

Page 47: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Algorithm

• Minimize the error function using stochastic gradientdescent (gradient descent per example):

w(τ+1) = w(τ) − η∇EP (w)

= w(τ) + ηϕ(xn)tn︸ ︷︷ ︸if incorrect

• Iterate over all training examples, only change w if theexample is mis-classified

• Guaranteed to converge if data are linearly separable• Will not converge if not• May take many iterations• Sensitive to initialization

©Möller/Mori 47

Page 48: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Algorithm

• Minimize the error function using stochastic gradientdescent (gradient descent per example):

w(τ+1) = w(τ) − η∇EP (w) = w(τ) + ηϕ(xn)tn︸ ︷︷ ︸if incorrect

• Iterate over all training examples, only change w if theexample is mis-classified

• Guaranteed to converge if data are linearly separable• Will not converge if not• May take many iterations• Sensitive to initialization

©Möller/Mori 48

Page 49: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Algorithm

• Minimize the error function using stochastic gradientdescent (gradient descent per example):

w(τ+1) = w(τ) − η∇EP (w) = w(τ) + ηϕ(xn)tn︸ ︷︷ ︸if incorrect

• Iterate over all training examples, only change w if theexample is mis-classified

• Guaranteed to converge if data are linearly separable• Will not converge if not• May take many iterations• Sensitive to initialization

©Möller/Mori 49

Page 50: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Illustration

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

• Note there are many hyperplanes with 0 error• Support vector machines (in a few weeks) have a nice way

of choosing one

©Möller/Mori 50

Page 51: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Illustration

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

• Note there are many hyperplanes with 0 error• Support vector machines (in a few weeks) have a nice way

of choosing one

©Möller/Mori 51

Page 52: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Illustration

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

• Note there are many hyperplanes with 0 error• Support vector machines (in a few weeks) have a nice way

of choosing one

©Möller/Mori 52

Page 53: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Illustration

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

• Note there are many hyperplanes with 0 error• Support vector machines (in a few weeks) have a nice way

of choosing one

©Möller/Mori 53

Page 54: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Illustration

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

• Note there are many hyperplanes with 0 error• Support vector machines (in a few weeks) have a nice way

of choosing one ©Möller/Mori 54

Page 55: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Limitations of Perceptrons

• Perceptrons can only solve linearly separable problems infeature space

• Same as the other models in this chapter• Canonical example of non-separable problem is X-OR

• Real datasets can look like this too

Expressiveness of perceptrons

Consider a perceptron with g = step function (Rosenblatt, 1957, 1960)

Can represent AND, OR, NOT, majority, etc.

Represents a linear separator in input space:

!jWjxj > 0 or W · x > 0

I 1

I 2

I 1

I 2

I 1

I 2

?

(a) (b) (c)

0 1

0

1

0

1 1

0

0 1 0 1

xor I 2I 1orI 1 I 2and I 1 I 2

Chapter 20 11

©Möller/Mori 55

Page 56: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Outline

Discriminant Functions

Generative Models

Discriminative Models

©Möller/Mori 56

Page 57: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Generative vs. Discriminative

• Generative models• Can generate synthetic

example data• Perhaps accurate

classification is equivalent toaccurate synthesis

• e.g. vision and graphics• Tend to have more parameters• Require good model of class

distributions

• Discriminative models• Only usable for classification• Don’t solve a harder problem

than you need to• Tend to have fewer parameters• Require good model of

decision boundary

©Möller/Mori 57

Page 58: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Probabilistic Generative Models• Up to now we’ve looked at learning classification by

choosing parameters to minimize an error function• We’ll now develop a probabilistic approach• With 2 classes, C1 and C2:

p(C1|x) =p(x|C1)p(C1)

p(x)Bayes’ Rule

p(C1|x) =p(x|C1)p(C1)

p(x, C1) + p(x, C2)Sum rule

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)Product rule

• In generative models we specify the distribution p(x|Ck)which generates the data for each class

©Möller/Mori 58

Page 59: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Probabilistic Generative Models• Up to now we’ve looked at learning classification by

choosing parameters to minimize an error function• We’ll now develop a probabilistic approach• With 2 classes, C1 and C2:

p(C1|x) =p(x|C1)p(C1)

p(x)Bayes’ Rule

p(C1|x) =p(x|C1)p(C1)

p(x, C1) + p(x, C2)Sum rule

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)Product rule

• In generative models we specify the distribution p(x|Ck)which generates the data for each class

©Möller/Mori 59

Page 60: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Probabilistic Generative Models• Up to now we’ve looked at learning classification by

choosing parameters to minimize an error function• We’ll now develop a probabilistic approach• With 2 classes, C1 and C2:

p(C1|x) =p(x|C1)p(C1)

p(x)Bayes’ Rule

p(C1|x) =p(x|C1)p(C1)

p(x, C1) + p(x, C2)Sum rule

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)Product rule

• In generative models we specify the distribution p(x|Ck)which generates the data for each class

©Möller/Mori 60

Page 61: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Probabilistic Generative Models• Up to now we’ve looked at learning classification by

choosing parameters to minimize an error function• We’ll now develop a probabilistic approach• With 2 classes, C1 and C2:

p(C1|x) =p(x|C1)p(C1)

p(x)Bayes’ Rule

p(C1|x) =p(x|C1)p(C1)

p(x, C1) + p(x, C2)Sum rule

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)Product rule

• In generative models we specify the distribution p(x|Ck)which generates the data for each class

©Möller/Mori 61

Page 62: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Probabilistic Generative Models• Up to now we’ve looked at learning classification by

choosing parameters to minimize an error function• We’ll now develop a probabilistic approach• With 2 classes, C1 and C2:

p(C1|x) =p(x|C1)p(C1)

p(x)Bayes’ Rule

p(C1|x) =p(x|C1)p(C1)

p(x, C1) + p(x, C2)Sum rule

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)Product rule

• In generative models we specify the distribution p(x|Ck)which generates the data for each class

©Möller/Mori 62

Page 63: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Probabilistic Generative Models• Up to now we’ve looked at learning classification by

choosing parameters to minimize an error function• We’ll now develop a probabilistic approach• With 2 classes, C1 and C2:

p(C1|x) =p(x|C1)p(C1)

p(x)Bayes’ Rule

p(C1|x) =p(x|C1)p(C1)

p(x, C1) + p(x, C2)Sum rule

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)Product rule

• In generative models we specify the distribution p(x|Ck)which generates the data for each class

©Möller/Mori 63

Page 64: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Probabilistic Generative Models - Example

• Let’s say we observe x which is the current temperature• Determine if we are in Vienna (C1) or Honolulu (C2)• Generative model:

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)

• p(x|C1) is distribution over typical temperatures in Vienna• e.g. p(x|C1) = N (x; 10, 5)

• p(x|C2) is distribution over typical temperatures in Honolulu• e.g. p(x|C2) = N (x; 25, 5)

• Class priors p(C1) = 0.1, p(C2) = 0.9

• p(C1|x = 15) = 0.0484·0.10.0484·0.1+0.0108·0.9 ≈ 0.33

©Möller/Mori 64

Page 65: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Probabilistic Generative Models - Example

• Let’s say we observe x which is the current temperature• Determine if we are in Vienna (C1) or Honolulu (C2)• Generative model:

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)

• p(x|C1) is distribution over typical temperatures in Vienna• e.g. p(x|C1) = N (x; 10, 5)

• p(x|C2) is distribution over typical temperatures in Honolulu• e.g. p(x|C2) = N (x; 25, 5)

• Class priors p(C1) = 0.1, p(C2) = 0.9

• p(C1|x = 15) = 0.0484·0.10.0484·0.1+0.0108·0.9 ≈ 0.33

©Möller/Mori 65

Page 66: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Generalized Linear Models

• We can write the classifier in another form

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)

=1

1 + exp(−a)≡ σ(a)

where a = ln p(x|C1)p(C1)p(x|C2)p(C2)

• This looks like gratuitous math, but if a takes a simple formthis is another generalized linear model which we havebeen studying

• Of course, we will see how such a simple forma = wTx+ w0 arises naturally

©Möller/Mori 66

Page 67: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Generalized Linear Models

• We can write the classifier in another form

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)

=1

1 + exp(−a)≡ σ(a)

where a = ln p(x|C1)p(C1)p(x|C2)p(C2)

• This looks like gratuitous math, but if a takes a simple formthis is another generalized linear model which we havebeen studying

• Of course, we will see how such a simple forma = wTx+ w0 arises naturally

©Möller/Mori 67

Page 68: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Generalized Linear Models

• We can write the classifier in another form

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)

=1

1 + exp(−a)≡ σ(a)

where a = ln p(x|C1)p(C1)p(x|C2)p(C2)

• This looks like gratuitous math, but if a takes a simple formthis is another generalized linear model which we havebeen studying

• Of course, we will see how such a simple forma = wTx+ w0 arises naturally

©Möller/Mori 68

Page 69: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Logistic Sigmoid

−5 0 50

0.5

1

• The function σ(a) = 11+exp(−a) is known as the logistic

sigmoid• It squashes the real axis down to [0, 1]

• It is continuous and differentiable• It avoids the problems encountered with the too correct

least-squares error fitting (later)

©Möller/Mori 69

Page 70: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Multi-class Extension

• There is a generalization of the logistic sigmoid to K > 2classes:

p(Ck|x) =p(x|Ck)p(Ck)∑j p(x|Cj)p(Cj)

=exp(ak)∑j exp(aj)

where ak = ln p(x|Ck)p(Ck)

• a. k. a. softmax function• If some ak ≫ aj , p(Ck|x) goes to 1

©Möller/Mori 70

Page 71: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Multi-class Extension

• There is a generalization of the logistic sigmoid to K > 2classes:

p(Ck|x) =p(x|Ck)p(Ck)∑j p(x|Cj)p(Cj)

=exp(ak)∑j exp(aj)

where ak = ln p(x|Ck)p(Ck)

• a. k. a. softmax function• If some ak ≫ aj , p(Ck|x) goes to 1

©Möller/Mori 71

Page 72: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Gaussian Class-Conditional Densities

• Back to that a in the logistic sigmoid for 2 classes• Let’s assume the class-conditional densities p(x|Ck) are

Gaussians, and have the same covariance matrix Σ:

p(x|Ck) =1

(2π)D/2|Σ|1/2exp

{−1

2(x− µk)

TΣ−1(x− µk)

}• a takes a simple form:

a = ln p(x|C1)p(C1)p(x|C2)p(C2)

= wTx+ w0

• Note that quadratic terms xTΣ−1x cancel©Möller/Mori 72

Page 73: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Gaussian Class-Conditional Densities

• Back to that a in the logistic sigmoid for 2 classes• Let’s assume the class-conditional densities p(x|Ck) are

Gaussians, and have the same covariance matrix Σ:

p(x|Ck) =1

(2π)D/2|Σ|1/2exp

{−1

2(x− µk)

TΣ−1(x− µk)

}• a takes a simple form:

a = ln p(x|C1)p(C1)p(x|C2)p(C2)

= wTx+ w0

• Note that quadratic terms xTΣ−1x cancel©Möller/Mori 73

Page 74: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Gaussian Class-Conditional Densities

• Back to that a in the logistic sigmoid for 2 classes• Let’s assume the class-conditional densities p(x|Ck) are

Gaussians, and have the same covariance matrix Σ:

p(x|Ck) =1

(2π)D/2|Σ|1/2exp

{−1

2(x− µk)

TΣ−1(x− µk)

}• a takes a simple form:

a = ln p(x|C1)p(C1)p(x|C2)p(C2)

= wTx+ w0

• Note that quadratic terms xTΣ−1x cancel©Möller/Mori 74

Page 75: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Gaussian Class-Conditional Densities

• Back to that a in the logistic sigmoid for 2 classes• Let’s assume the class-conditional densities p(x|Ck) are

Gaussians, and have the same covariance matrix Σ:

p(x|Ck) =1

(2π)D/2|Σ|1/2exp

{−1

2(x− µk)

TΣ−1(x− µk)

}• a takes a simple form:

a = ln p(x|C1)p(C1)p(x|C2)p(C2)

= wTx+ w0

• Note that quadratic terms xTΣ−1x cancel©Möller/Mori 75

Page 76: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning

• We can fit the parameters to this model using maximumlikelihood

• Parameters are µ1, µ2, Σ−1, p(C1) ≡ π, p(C2) ≡ 1− π• Refer to as θ

• For a datapoint xn from class C1 (tn = 1):

p(xn, C1) = p(C1)p(xn|C1) = πN (xn|µ1,Σ)

• For a datapoint xn from class C2 (tn = 0):

p(xn, C2) = p(C2)p(xn|C2) = (1− π)N (xn|µ2,Σ)

©Möller/Mori 76

Page 77: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning

• We can fit the parameters to this model using maximumlikelihood

• Parameters are µ1, µ2, Σ−1, p(C1) ≡ π, p(C2) ≡ 1− π• Refer to as θ

• For a datapoint xn from class C1 (tn = 1):

p(xn, C1) = p(C1)p(xn|C1) = πN (xn|µ1,Σ)

• For a datapoint xn from class C2 (tn = 0):

p(xn, C2) = p(C2)p(xn|C2) = (1− π)N (xn|µ2,Σ)

©Möller/Mori 77

Page 78: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning

• We can fit the parameters to this model using maximumlikelihood

• Parameters are µ1, µ2, Σ−1, p(C1) ≡ π, p(C2) ≡ 1− π• Refer to as θ

• For a datapoint xn from class C1 (tn = 1):

p(xn, C1) = p(C1)p(xn|C1) = πN (xn|µ1,Σ)

• For a datapoint xn from class C2 (tn = 0):

p(xn, C2) = p(C2)p(xn|C2) = (1− π)N (xn|µ2,Σ)

©Möller/Mori 78

Page 79: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning

• The likelihood of the training data is:

p(t|π,µ1,µ2,Σ) =N∏

n=1

[πN (xn|µ1,Σ)]tn [(1−π)N (xn|µ2,Σ)]1−tn

• As usual, ln is our friend:

ℓ(t; θ) =

N∑n=1

tn lnπ + (1− tn) ln(1− π)︸ ︷︷ ︸π

+ tn lnN1 + (1− tn) lnN2︸ ︷︷ ︸µ1,µ2,Σ

• Maximize for each separately

©Möller/Mori 79

Page 80: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning

• The likelihood of the training data is:

p(t|π,µ1,µ2,Σ) =N∏

n=1

[πN (xn|µ1,Σ)]tn [(1−π)N (xn|µ2,Σ)]1−tn

• As usual, ln is our friend:

ℓ(t; θ) =

N∑n=1

tn lnπ + (1− tn) ln(1− π)︸ ︷︷ ︸π

+ tn lnN1 + (1− tn) lnN2︸ ︷︷ ︸µ1,µ2,Σ

• Maximize for each separately

©Möller/Mori 80

Page 81: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning - Class Priors

• Maximization with respect to the class priors parameter πis straightforward:

∂πℓ(t; θ) =

N∑n=1

tnπ

− 1− tn1− π

⇒ π =N1

N1 +N2

• N1 and N2 are the number of training points in each class• Prior is simply the fraction of points in each class

©Möller/Mori 81

Page 82: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning - Class Priors

• Maximization with respect to the class priors parameter πis straightforward:

∂πℓ(t; θ) =

N∑n=1

tnπ

− 1− tn1− π

⇒ π =N1

N1 +N2

• N1 and N2 are the number of training points in each class• Prior is simply the fraction of points in each class

©Möller/Mori 82

Page 83: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning - Class Priors

• Maximization with respect to the class priors parameter πis straightforward:

∂πℓ(t; θ) =

N∑n=1

tnπ

− 1− tn1− π

⇒ π =N1

N1 +N2

• N1 and N2 are the number of training points in each class• Prior is simply the fraction of points in each class

©Möller/Mori 83

Page 84: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning - Gaussian Parameters• The other parameters can also be found in the same

fashion• Class means:

µ1 =1

N1

N∑n=1

tnxn

µ2 =1

N2

N∑n=1

(1− tn)xn

• Means of training examples from each class• Shared covariance matrix:

Σ =N1

N

1

N1

∑n∈C1

(xn−µ1)(xn−µ1)T+

N2

N

1

N2

∑n∈C2

(xn−µ2)(xn−µ2)T

• Weighted average of class covariances©Möller/Mori 84

Page 85: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning - Gaussian Parameters• The other parameters can also be found in the same

fashion• Class means:

µ1 =1

N1

N∑n=1

tnxn

µ2 =1

N2

N∑n=1

(1− tn)xn

• Means of training examples from each class• Shared covariance matrix:

Σ =N1

N

1

N1

∑n∈C1

(xn−µ1)(xn−µ1)T+

N2

N

1

N2

∑n∈C2

(xn−µ2)(xn−µ2)T

• Weighted average of class covariances©Möller/Mori 85

Page 86: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Probabilistic Generative Models Summary

• Fitting Gaussian using ML criterion is sensitive to outliers• Simple linear form for a in logistic sigmoid occurs for more

than just Gaussian distributions• Arises for any distribution in the exponential family, a large

class of distributions

©Möller/Mori 86

Page 87: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Outline

Discriminant Functions

Generative Models

Discriminative Models

©Möller/Mori 87

Page 88: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Probabilistic Discriminative Models

• Generative model made assumptions about form ofclass-conditional distributions (e.g. Gaussian)

• Resulted in logistic sigmoid of linear function of x• Discriminative model - explicitly use functional form

p(C1|x) =1

1 + exp(−wTx+ w0)

and find w directly• For the generative model we had 2M +M(M + 1)/2 + 1

parameters• M is dimensionality of x

• Discriminative model will have M + 1 parameters

©Möller/Mori 88

Page 89: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Probabilistic Discriminative Models

• Generative model made assumptions about form ofclass-conditional distributions (e.g. Gaussian)

• Resulted in logistic sigmoid of linear function of x• Discriminative model - explicitly use functional form

p(C1|x) =1

1 + exp(−wTx+ w0)

and find w directly• For the generative model we had 2M +M(M + 1)/2 + 1

parameters• M is dimensionality of x

• Discriminative model will have M + 1 parameters

©Möller/Mori 89

Page 90: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Probabilistic Discriminative Models

• Generative model made assumptions about form ofclass-conditional distributions (e.g. Gaussian)

• Resulted in logistic sigmoid of linear function of x• Discriminative model - explicitly use functional form

p(C1|x) =1

1 + exp(−wTx+ w0)

and find w directly• For the generative model we had 2M +M(M + 1)/2 + 1

parameters• M is dimensionality of x

• Discriminative model will have M + 1 parameters

©Möller/Mori 90

Page 91: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning - Discriminative Model• As usual we can use the maximum likelihood criterion for

learning

p(t|w) =

N∏n=1

ytnn {1− yn}1−tn ; where yn = p(C1|xn)

• Taking ln and derivative gives:

∇ℓ(w) =

N∑n=1

(tn − yn)xn

• This time no closed-form solution since yn = σ(wTx)

• Could use (stochastic) gradient descent• But there’s a better iterative technique

©Möller/Mori 91

Page 92: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning - Discriminative Model• As usual we can use the maximum likelihood criterion for

learning

p(t|w) =

N∏n=1

ytnn {1− yn}1−tn ; where yn = p(C1|xn)

• Taking ln and derivative gives:

∇ℓ(w) =

N∑n=1

(tn − yn)xn

• This time no closed-form solution since yn = σ(wTx)

• Could use (stochastic) gradient descent• But there’s a better iterative technique

©Möller/Mori 92

Page 93: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning - Discriminative Model• As usual we can use the maximum likelihood criterion for

learning

p(t|w) =

N∏n=1

ytnn {1− yn}1−tn ; where yn = p(C1|xn)

• Taking ln and derivative gives:

∇ℓ(w) =

N∑n=1

(tn − yn)xn

• This time no closed-form solution since yn = σ(wTx)

• Could use (stochastic) gradient descent• But there’s a better iterative technique

©Möller/Mori 93

Page 94: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning - Discriminative Model• As usual we can use the maximum likelihood criterion for

learning

p(t|w) =

N∏n=1

ytnn {1− yn}1−tn ; where yn = p(C1|xn)

• Taking ln and derivative gives:

∇ℓ(w) =

N∑n=1

(tn − yn)xn

• This time no closed-form solution since yn = σ(wTx)

• Could use (stochastic) gradient descent• But there’s a better iterative technique

©Möller/Mori 94

Page 95: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Iterative Reweighted Least Squares• Iterative reweighted least squares (IRLS) is a descent

method• As in gradient descent, start with an initial guess, improve it• Gradient descent - take a step (how large?) in the gradient

direction• IRLS is a special case of a Newton-Raphson method

• Approximate function using second-order Taylor expansion:

f̂(w+v) = f(w)+∇f(w)T (v−w)+1

2(v−w)T∇2f(w)(v−w)

• Closed-form solution to minimize this is straight-forward:quadratic, derivatives linear

• In IRLS this second-order Taylor expansion ends up beinga weighted least-squares problem, as in the regressioncase from last week

• Hence the name IRLS©Möller/Mori 95

Page 96: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Iterative Reweighted Least Squares• Iterative reweighted least squares (IRLS) is a descent

method• As in gradient descent, start with an initial guess, improve it• Gradient descent - take a step (how large?) in the gradient

direction• IRLS is a special case of a Newton-Raphson method

• Approximate function using second-order Taylor expansion:

f̂(w+v) = f(w)+∇f(w)T (v−w)+1

2(v−w)T∇2f(w)(v−w)

• Closed-form solution to minimize this is straight-forward:quadratic, derivatives linear

• In IRLS this second-order Taylor expansion ends up beinga weighted least-squares problem, as in the regressioncase from last week

• Hence the name IRLS©Möller/Mori 96

Page 97: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Iterative Reweighted Least Squares• Iterative reweighted least squares (IRLS) is a descent

method• As in gradient descent, start with an initial guess, improve it• Gradient descent - take a step (how large?) in the gradient

direction• IRLS is a special case of a Newton-Raphson method

• Approximate function using second-order Taylor expansion:

f̂(w+v) = f(w)+∇f(w)T (v−w)+1

2(v−w)T∇2f(w)(v−w)

• Closed-form solution to minimize this is straight-forward:quadratic, derivatives linear

• In IRLS this second-order Taylor expansion ends up beinga weighted least-squares problem, as in the regressioncase from last week

• Hence the name IRLS©Möller/Mori 97

Page 98: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Iterative Reweighted Least Squares• Iterative reweighted least squares (IRLS) is a descent

method• As in gradient descent, start with an initial guess, improve it• Gradient descent - take a step (how large?) in the gradient

direction• IRLS is a special case of a Newton-Raphson method

• Approximate function using second-order Taylor expansion:

f̂(w+v) = f(w)+∇f(w)T (v−w)+1

2(v−w)T∇2f(w)(v−w)

• Closed-form solution to minimize this is straight-forward:quadratic, derivatives linear

• In IRLS this second-order Taylor expansion ends up beinga weighted least-squares problem, as in the regressioncase from last week

• Hence the name IRLS©Möller/Mori 98

Page 99: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Iterative Reweighted Least Squares• Iterative reweighted least squares (IRLS) is a descent

method• As in gradient descent, start with an initial guess, improve it• Gradient descent - take a step (how large?) in the gradient

direction• IRLS is a special case of a Newton-Raphson method

• Approximate function using second-order Taylor expansion:

f̂(w+v) = f(w)+∇f(w)T (v−w)+1

2(v−w)T∇2f(w)(v−w)

• Closed-form solution to minimize this is straight-forward:quadratic, derivatives linear

• In IRLS this second-order Taylor expansion ends up beinga weighted least-squares problem, as in the regressioncase from last week

• Hence the name IRLS©Möller/Mori 99

Page 100: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Newton-Raphson484 9 Unconstrained minimization

PSfrag replacements

f

!f

(x, f(x))

(x + !xnt, f(x + !xnt))

Figure 9.16 The function f (shown solid) and its second-order approximation!f at x (dashed). The Newton step !xnt is what must be added to x to give

the minimizer of !f .

9.5 Newton’s method

9.5.1 The Newton step

For x ! dom f , the vector

!xnt = "#2f(x)!1#f(x)

is called the Newton step (for f , at x). Positive definiteness of #2f(x) implies that

#f(x)T !xnt = "#f(x)T#2f(x)!1#f(x) < 0

unless #f(x) = 0, so the Newton step is a descent direction (unless x is optimal).The Newton step can be interpreted and motivated in several ways.

Minimizer of second-order approximation

The second-order Taylor approximation (or model) !f of f at x is

!f(x + v) = f(x) + #f(x)T v +1

2vT#2f(x)v, (9.28)

which is a convex quadratic function of v, and is minimized when v = !xnt. Thus,the Newton step !xnt is what should be added to the point x to minimize thesecond-order approximation of f at x. This is illustrated in figure 9.16.

This interpretation gives us some insight into the Newton step. If the functionf is quadratic, then x + !xnt is the exact minimizer of f . If the function f isnearly quadratic, intuition suggests that x + !xnt should be a very good estimateof the minimizer of f , i.e., x!. Since f is twice di"erentiable, the quadratic modelof f will be very accurate when x is near x!. It follows that when x is near x!,the point x + !xnt should be a very good estimate of x!. We will see that thisintuition is correct.

• Figure from Boyd and Vandenberghe, Convex Optimization

• Excellent reference, free for download onlinehttp://www.stanford.edu/~boyd/cvxbook/

©Möller/Mori 100

Page 101: Linear Models for Classification - univie.ac.atvda.univie.ac.at/.../19s/LectureNotes/04_classification.pdfDiscriminant Functions Generative Models Discriminative Models Generalized

Discriminant Functions Generative Models Discriminative Models

Conclusion

• Readings: Ch. 4.1.1-4.1.4, 4.1.7, 4.2.1-4.2.2, 4.3.1-4.3.3• Generalized linear models y(x) = f(wTx+ w0)

• Threshold/max function for f(·)• Minimize with least squares• Fisher criterion - class separation• Perceptron criterion - mis-classified examples

• Probabilistic models: logistic sigmoid / softmax for f(·)• Generative model - assume class conditional densities in

exponential family; obtain sigmoid• Discriminative model - directly model posterior using

sigmoid (a. k. a. logistic regression, though classification)• Can learn either using maximum likelihood

• All of these models are limited to linear decisionboundaries in feature space

©Möller/Mori 101