Classification : Logistic regression. Generative ...milos/courses/cs2750... · CS 2750 Machine Learning Logistic regression model • Discriminant functions: • Values of discriminant

1

CS 2750 Machine Learning

CS 2750 Machine LearningLecture 8

Milos [email protected] Sennott Square

Classification : Logistic regression.

Generative classification model.


Binary classification

• Two classes• Our goal is to learn to classify correctly two types of examples

– Class 0 – labeled as 0, – Class 1 – labeled as 1

• We would like to learn• Zero-one error (loss) function

• Error we would like to minimize:• First step: we need to devise a model of the function

}1,0{=Y

}1,0{: →Xf

=≠

=ii

iiii yf

yfyError

),(0),(1

),(1 wxwx

x

)),(( 1),( yErrorE yx x

2


Discriminant functions

• One way to represent a classifier is by using– Discriminant functions

• Works for binary and multi-way classification

• Idea: – For every class i = 0,1, …k define a function

mapping– When the decision on input x should be made choose the

class with the highest value of

• So what happens with the input space? Assume a binary case.

)(xigℜ→X

)(xig



)()( 01 xx gg ≥

-2 -1.5 -1 -0.5 0 0.5 1 1.5-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

3



)()( 01 xx gg ≥

)()( 01 xx gg ≤

-2 -1.5 -1 -0.5 0 0.5 1 1.5-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

)()( 01 xx gg ≤



)()( 01 xx gg ≥

)()( 01 xx gg ≤

-2 -1.5 -1 -0.5 0 0.5 1 1.5-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

)()( 01 xx gg ≤

)()( 01 xx gg ≥

4



• Define decision boundary

)()( 01 xx gg ≥

)()( 01 xx gg ≤

-2 -1.5 -1 -0.5 0 0.5 1 1.5-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

)()( 01 xx gg ≥

)()( 01 xx gg ≤

)()( 01 xx gg =


-2 -1.5 -1 -0.5 0 0.5 1 1.5-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

3Decision boundary

Quadratic decision boundary

)()( 01 xx gg ≥

)()( 01 xx gg ≤ )()( 01 xx gg =

5


Logistic regression model

• Defines a linear decision boundary• Discriminant functions:

• where

)()()( 1 xwxwwx, TT ggf ==

)1/(1)( zezg −+=

xInput vector

∑

1

1x )( wx,f

0w

1w2w

dw2x

z

dx

Logistic function

)()(1 xwx Tgg = )(1)(0 xwx Tgg −=- is a logistic function


Logistic functionfunction

• Is also referred to as a sigmoid function• Replaces the threshold function with smooth switching • takes a real number and outputs the number in the interval [0,1]

)1(1)( ze

zg −+=

-20 -15 -10 -5 0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

6


Logistic regression model

• Discriminant functions:

• Values of discriminant functions vary in [0,1]– Probabilistic interpretation

),|1( wx=yp

xInput vector

∑

1

1x0w

1w2w

dw2x

z

dx

)()(1 xwx Tgg = )(1)(0 xwx Tgg −=

)()(),|1()( 1 xwxxwwx, Tggypf ====


Logistic regression

• We learn a probabilistic function

– where f describes the probability of class 1 given x

Note that:

• Transformation to binary class values:

),|1()()( 1 wxxwwx, === ypgf T

]1,0[: →Xf

2/1)|1( ≥= xypIf then choose 1Else choose 0

)|1(1),|0( wx,wx =−== ypyp

7


Linear decision boundary

• Logistic regression model defines a linear decision boundary• Why?• Answer: Compare two discriminant functions.• Decision boundary:• For the boundary it must hold:

0)(

)(1log)()(log

1

=−

=xw

xwxx

T

T

gg

ggo

)()( 01 xx gg =

0)(explog

)(exp11

)(exp1)(exp

log)()(log

1

==−=

−+

−+−

= xwxw

xw

xwxw

xx TT

T

T

T

ggo


Logistic regression model. Decision boundary

• LR defines a linear decision boundaryExample: 2 classes (blue and red points)

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2Decision boundary

8


Likelihood of outputs• Let

• Then

• Find weights w that maximize the likelihood of outputs– Apply the log-likelihood trick The optimal weights are the

same for both the likelihood and the log-likelihood

Logistic regression: parameter learning

=−=−= ∑∏=

−

=

−n

i

yi

yi

n

i

yi

yi

iiiiDl1

1

1

1 )1(log)1(log),( µµµµw

∏∏=

−

=

−===n

i

yi

yiii

n

i

iiyyPDL1

1

1

)1(),|(),( µµwxw

)()(),|1( xwwx Tiiii gzgyp ====µ

)1log()1(log1

iii

n

ii yy µµ −−+= ∑

=

>=< iii yD ,x


Logistic regression: parameter learning• Log likelihood

• Derivatives of the loglikelihood

• Gradient descent:

)1log()1(log),(1

iii

n

ii yyDl µµ −−+= ∑

=

w

)),(())((),(11

iii

n

ii

Tii

n

i

fygyDl xwxxwxww −−=−−=−∇ ∑∑==

)1(|)],([)()1()(−−∇−← −

kDlkkkww www α

Nonlinear in weights !!

∑=

−− −+←n

iii

ki

kk fyk1

)1()1()( )],([)( xxwww α

))((),(1

, ii

n

iji

j

zgyxDlw

−−=∂∂

− ∑=

w

9


Logistic regression. Online gradient descent

• On-line component of the loglikelihood

• On-line learning update for weight w

• ith update for the logistic regression and

)1(|)],([)()1()(−∇−← −

kkonlinekk DJk

ww www α

),( wkonline DJ

>=< kkk yD ,x

kkk

iki fyk xxwww )],()[( )1()1()( −− −+← α

)1log()1(log),(online iiiii yyDJ µµ −−+=− w


Online logistic regression algorithm

Online-logistic-regression (D, number of iterations)initialize weightsfor i=1:1: number of iterations

do select a data point from Dset update weights (in parallel)

end forreturn weights

),,( 210 dwwww K=w

i/1=α

iii fyi xxwww )],()[( −+← α

>=< iii yD ,x

w

10


Online algorithm. Example.



11




Derivation of the gradient• Log likelihood

• Derivatives of the loglikelihood

)1log()1(log),(1

iii

n

ii yyDl µµ −−+= ∑

=

w

)),(())((),(11

iii

n

ii

Tii

n

i

fygyDl xwxxwxww −−=−−=∇ ∑∑==

[ ]j

in

iiiii

ij wzyy

zDl

w ∂∂

−−+∂∂

=∂∂ ∑

=1

)1log()1(log),( µµw

[ ]i

i

ii

i

i

iiiiii

i zzg

zgy

zzg

zgyyy

z ∂∂

−−

−+∂

∂=−−+

∂∂ )(

)(11)1()(

)(1)1log()1(log µµ

))(1)(()(ii

i

i zgzgzzg

−=∂

∂Derivative of a logistic function

)())()(1())(1( iiiiii zgyzgyzgy −=−−+−=

jij

i xwz

,=∂∂

12


Generative approach to classification

Idea: 1. Represent and learn the distribution2. Use it to define probabilistic discriminant functions

E.g.

Typical model• = Class-conditional distributions (densities)

binary classification: two class-conditional distributions

• = Priors on classes - probability of class ybinary classification: Bernoulli distribution

)0|( =yp x

1)1()0( ==+= ypyp

y

x

),( yp x

)()|(),( ypypyp xx =)|( yp x

)1|( =yp x)( yp

)|0()( xx == ypg o )|1()(1 xx == ypg


Generative approach to classification

Example:• Class-conditional distributions

– multivariate normal distributions

• Priors on classes (class 0,1)– Bernoulli distribution

−−−= − )()(

21exp

)2(1)|( 1

2/12/µxΣµx

ΣΣµ,x T

dp

π

0for),(~ 00 =yN Σµx1for),(~ 11 =yN Σµx

y

x

Multivariate normal ),(~ Σµx N

yyyp −−= 1)1(),( θθθ

Bernoulliy ~

}1,0{∈y

13


Learning of parameters of the model

Density estimation in statistics• We see examples – we do not know the parameters of

Gaussians (class-conditional densities)

• ML estimate of parameters of a multivariate normal for a set of n examples of x Optimize log-likelihood:

• How about class priors?

−−−= − )()(

21exp

)2(1),|( 1

2/12/µxΣµx

ΣΣµx T

dp

π

∑=

=n

iin 1

1ˆ xµ Tn

in)ˆ)(ˆ(1ˆ

1µxµxΣ ii −−= ∑

=

)|(log),,(1

Σ,µxΣµ ∏=

=n

iipDl

),( ΣµN


Generative model

)()( 01 xx gg ≥

-2 -1.5 -1 -0.5 0 0.5 1 1.5-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

14


2 Gaussian class-conditional densities

• .


Making class decision

Basically we need to design discriminant functionsTwo possible choices:• Likelihood of data – choose the class (Gaussian) that explains

the input data (x) better (likelihood of the data)

• Posterior of a class – choose the class with better posterior probability

),|(),|( 0011 ΣxΣx µµ pp > then y=1else y=0

)|0()|1( xx =>= ypyp then y=1else y=0

)1(),|()0(),|()1(),|()|1(

1100

11

=+==

==yppypp

yppypΣxΣx

Σxxµµ

µ

)(1 xg )(0 xg

15


2 Gaussians: Quadratic decision boundary

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2Contours of class-conditional densities


2 Gaussians: Quadratic decision boundary

-2 -1.5 -1 -0.5 0 0.5 1 1.5-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

3Decision boundary

16


2 Gaussians: Linear decision boundary• When covariances are the same 0,),(~ 0 =yN Σµx

1,),(~ 1 =yN Σµx


2 Gaussians: Linear decision boundary

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2Contours of class-conditional densities

17


2 Gaussians: linear decision boundary

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2Decision boundary

Classification : Logistic regression. Generative ...milos/courses/cs2750... · CS 2750 Machine Learning Logistic regression model • Discriminant functions: • Values of discriminant

Documents