Top Banner
Logistic Regression Robot Image Credit: Viktoriya Sukhanova © 123RF.com These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these slides for your own academic purposes, provided that you include proper attribution. Please send comments and corrections to Eric.
31

Logistic Regression - Penn EngineeringLogistic Regression • Takes a probabilistic approach to learning discriminative functions (i.e., a classifier) • should give – Want •

Jul 21, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Logistic Regression - Penn EngineeringLogistic Regression • Takes a probabilistic approach to learning discriminative functions (i.e., a classifier) • should give – Want •

LogisticRegression

RobotImageCredit:Viktoriya Sukhanova ©123RF.com

TheseslideswereassembledbyEricEaton,withgratefulacknowledgementofthemanyotherswhomadetheircoursematerialsfreelyavailableonline.Feelfreetoreuseoradapttheseslidesforyourownacademicpurposes,providedthatyouincludeproperattribution.PleasesendcommentsandcorrectionstoEric.

Page 2: Logistic Regression - Penn EngineeringLogistic Regression • Takes a probabilistic approach to learning discriminative functions (i.e., a classifier) • should give – Want •

ClassificationBasedonProbability• Insteadofjustpredictingtheclass,givetheprobabilityoftheinstancebeingthatclass– i.e.,learn

• Comparisontoperceptron:– Perceptrondoesn’tproduceprobabilityestimate– Perceptron(andotherdiscriminativeclassifiers)areonlyinterestedinproducingadiscriminativemodel

• Recallthat:

2

p(y | x)

p(event) + p(¬event) = 1

0 p(event) 1

Page 3: Logistic Regression - Penn EngineeringLogistic Regression • Takes a probabilistic approach to learning discriminative functions (i.e., a classifier) • should give – Want •

LogisticRegression• Takesaprobabilisticapproachtolearningdiscriminativefunctions(i.e.,aclassifier)

• shouldgive– Want

• Logisticregressionmodel:

3

h✓(x) = g (✓|x)

g(z) =1

1 + e�z

0 h✓(x) 1

Can’tjustuselinearregressionwitha

threshold

g(z) =1

1 + e�z

h✓(x) =1

1 + e�✓Tx

Logistic/SigmoidFunction

h✓(x) p(y = 1 | x;✓)

Page 4: Logistic Regression - Penn EngineeringLogistic Regression • Takes a probabilistic approach to learning discriminative functions (i.e., a classifier) • should give – Want •

InterpretationofHypothesisOutput

4

=estimated

à Tellpatientthat70%chanceoftumorbeingmalignant

Example:Cancerdiagnosisfromtumorsize

h✓(x) p(y = 1 | x;✓)

x =

x0

x1

�=

1

tumorSize

h✓(x) = 0.7

p(y = 0 | x;✓) + p(y = 1 | x;✓) = 1Notethat:

BasedonexamplebyAndrewNg

Therefore, p(y = 0 | x;✓) = 1� p(y = 1 | x;✓)

Page 5: Logistic Regression - Penn EngineeringLogistic Regression • Takes a probabilistic approach to learning discriminative functions (i.e., a classifier) • should give – Want •

AnotherInterpretation• Equivalently,logisticregressionassumesthat

• Inotherwords,logisticregressionassumesthatthelogoddsisalinearfunctionof

5

logp(y = 1 | x;✓)p(y = 0 | x;✓) = ✓0 + ✓1x1 + . . .+ ✓dxd

x

SideNote:theoddsinfavorofaneventisthequantityp /(1−p),wherep istheprobabilityoftheevent

E.g.,IfItossafairdice,whataretheoddsthatIwillhavea6?

oddsofy =1

BasedonslidebyXiaoli Fern

Page 6: Logistic Regression - Penn EngineeringLogistic Regression • Takes a probabilistic approach to learning discriminative functions (i.e., a classifier) • should give – Want •

LogisticRegression

• Assumeathresholdand...

– Predicty =1if– Predicty =0if

6

h✓(x) = g (✓|x)

g(z) =1

1 + e�z

g(z) =1

1 + e�z

h✓(x) � 0.5

h✓(x) < 0.5

y =1

y =0

BasedonslidebyAndrewNg

shouldbelargenegativevaluesfornegativeinstances

h✓(x) = g (✓|x) shouldbelargepositivevaluesforpositiveinstances

h✓(x) = g (✓|x)

Page 7: Logistic Regression - Penn EngineeringLogistic Regression • Takes a probabilistic approach to learning discriminative functions (i.e., a classifier) • should give – Want •

Non-LinearDecisionBoundary• Canapplybasisfunctionexpansiontofeatures,sameaswithlinearregression

7

x =

2

41x1

x2

3

5 !

2

6666666666666664

1x1

x2

x1x2

x21

x22

x21x2

x1x22

...

3

7777777777777775

Page 8: Logistic Regression - Penn EngineeringLogistic Regression • Takes a probabilistic approach to learning discriminative functions (i.e., a classifier) • should give – Want •

LogisticRegression

• Given

where

• Model:

8

x| =⇥1 x1 . . . xd

⇤✓ =

2

6664

✓0✓1...✓d

3

7775

h✓(x) = g (✓|x)

g(z) =1

1 + e�z

n⇣x(1), y(1)

⌘,⇣x(2), y(2)

⌘, . . . ,

⇣x(n), y(n)

⌘o

x(i) 2 Rd, y(i) 2 {0, 1}

Page 9: Logistic Regression - Penn EngineeringLogistic Regression • Takes a probabilistic approach to learning discriminative functions (i.e., a classifier) • should give – Want •

LogisticRegressionObjectiveFunction• Can’tjustusesquaredlossasinlinearregression:

– Usingthelogisticregressionmodel

resultsinanon-convexoptimization

9

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2

h✓(x) =1

1 + e�✓Tx

Page 10: Logistic Regression - Penn EngineeringLogistic Regression • Takes a probabilistic approach to learning discriminative functions (i.e., a classifier) • should give – Want •

DerivingtheCostFunctionviaMaximumLikelihoodEstimation

• Likelihoodofdataisgivenby:

• So,lookingfortheθ thatmaximizesthelikelihood

• Cantakethelogwithoutchangingthesolution:

10

l(✓) =nY

i=1

p(y(i) | x(i);✓)

✓MLE = argmax✓

l(✓) = argmax✓

nY

i=1

p(y(i) | x(i);✓)

✓MLE = argmax✓

lognY

i=1

p(y(i) | x(i);✓)

= argmax✓

nX

i=1

log p(y(i) | x(i);✓)

✓MLE = argmax✓

lognY

i=1

p(y(i) | x(i);✓)

= argmax✓

nX

i=1

log p(y(i) | x(i);✓)

Page 11: Logistic Regression - Penn EngineeringLogistic Regression • Takes a probabilistic approach to learning discriminative functions (i.e., a classifier) • should give – Want •

DerivingtheCostFunctionviaMaximumLikelihoodEstimation

11

• Expandasfollows:

• Substituteinmodel,andtakenegativetoyield

✓MLE = argmax✓

nX

i=1

log p(y(i) | x(i);✓)

= argmax✓

nX

i=1

hy(i) log p(y(i)=1 | x(i);✓) +

⇣1� y(i)

⌘log

⇣1� p(y(i)=1 | x(i);✓)

⌘i

J(✓) = �nX

i=1

hy(i) log h✓(x

(i)) +⇣1� y(i)

⌘log

⇣1� h✓(x

(i))⌘i

Logisticregressionobjective:min✓

J(✓)

✓MLE = argmax✓

nX

i=1

log p(y(i) | x(i);✓)

= argmax✓

nX

i=1

hy(i) log p(y(i)=1 | x(i);✓) +

⇣1� y(i)

⌘log

⇣1� p(y(i)=1 | x(i);✓)

⌘i

Page 12: Logistic Regression - Penn EngineeringLogistic Regression • Takes a probabilistic approach to learning discriminative functions (i.e., a classifier) • should give – Want •

IntuitionBehindtheObjective

• Costofasingleinstance:

• Canre-writeobjectivefunctionas

12

J(✓) = �nX

i=1

hy(i) log h✓(x

(i)) +⇣1� y(i)

⌘log

⇣1� h✓(x

(i))⌘i

cost (h✓(x), y) =

⇢� log(h✓(x)) if y = 1

� log(1� h✓(x)) if y = 0

J(✓) =nX

i=1

cost⇣h✓(x

(i)), y(i)⌘

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2Comparetolinearregression:

Page 13: Logistic Regression - Penn EngineeringLogistic Regression • Takes a probabilistic approach to learning discriminative functions (i.e., a classifier) • should give – Want •

IntuitionBehindtheObjective

13

cost (h✓(x), y) =

⇢� log(h✓(x)) if y = 1

� log(1� h✓(x)) if y = 0

Aside:Recalltheplotoflog(z)

Page 14: Logistic Regression - Penn EngineeringLogistic Regression • Takes a probabilistic approach to learning discriminative functions (i.e., a classifier) • should give – Want •

IntuitionBehindtheObjective

Ify =1• Cost=0ifpredictioniscorrect• As

• Capturesintuitionthatlargermistakesshouldgetlargerpenalties– e.g.,predict,buty =1

14

cost (h✓(x), y) =

⇢� log(h✓(x)) if y = 1

� log(1� h✓(x)) if y = 0

h✓(x) ! 0, cost ! 1

h✓(x) = 0

BasedonexamplebyAndrewNg

Ify =1

10

cost

h✓(x) = 0

Page 15: Logistic Regression - Penn EngineeringLogistic Regression • Takes a probabilistic approach to learning discriminative functions (i.e., a classifier) • should give – Want •

IntuitionBehindtheObjective

15

cost (h✓(x), y) =

⇢� log(h✓(x)) if y = 1

� log(1� h✓(x)) if y = 0

Ify =0

10

cost

Ify =1

Ify =0• Cost=0ifpredictioniscorrect• As

• Capturesintuitionthatlargermistakesshouldgetlargerpenalties

(1� h✓(x)) ! 0, cost ! 1

BasedonexamplebyAndrewNg

h✓(x) = 0

Page 16: Logistic Regression - Penn EngineeringLogistic Regression • Takes a probabilistic approach to learning discriminative functions (i.e., a classifier) • should give – Want •

RegularizedLogisticRegression

• Wecanregularizelogisticregressionexactlyasbefore:

16

J(✓) = �nX

i=1

hy(i) log h✓(x

(i)) +⇣1� y(i)

⌘log

⇣1� h✓(x

(i))⌘i

Jregularized(✓) = J(✓) +�

2

dX

j=1

✓2j

= J(✓) +�

2k✓[1:d]k22

Page 17: Logistic Regression - Penn EngineeringLogistic Regression • Takes a probabilistic approach to learning discriminative functions (i.e., a classifier) • should give – Want •

GradientDescentforLogisticRegression

17

• Initialize• Repeatuntilconvergence

✓j ✓j � ↵@

@✓jJ(✓) simultaneousupdate

forj =0...d

Want min✓

J(✓)

Usethenaturallogarithm(ln =loge)tocancelwiththeexp()in h✓(x) =1

1 + e�✓Tx

Jreg(✓) = �nX

i=1

hy(i) log h✓(x

(i)) +⇣1� y(i)

⌘log

⇣1� h✓(x

(i))⌘i

+�

2k✓[1:d]k22

Page 18: Logistic Regression - Penn EngineeringLogistic Regression • Takes a probabilistic approach to learning discriminative functions (i.e., a classifier) • should give – Want •

GradientDescentforLogisticRegression

18

Want min✓

J(✓)

• Initialize• Repeatuntilconvergence

✓(simultaneousupdateforj =0...d)

Jreg(✓) = �nX

i=1

hy(i) log h✓(x

(i)) +⇣1� y(i)

⌘log

⇣1� h✓(x

(i))⌘i

+�

2k✓[1:d]k22

✓0 ✓0 � ↵nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

✓j ✓j � ↵

"nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘x(i)j + �✓j

#

Page 19: Logistic Regression - Penn EngineeringLogistic Regression • Takes a probabilistic approach to learning discriminative functions (i.e., a classifier) • should give – Want •

GradientDescentforLogisticRegression

19

• Initialize• Repeatuntilconvergence

✓(simultaneousupdateforj =0...d)

ThislooksIDENTICALtolinearregression!!!• Ignoringthe1/n constant• However,theformofthemodelisverydifferent:

h✓(x) =1

1 + e�✓Tx

✓0 ✓0 � ↵nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

✓j ✓j � ↵

"nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘x(i)j + �✓j

#

Page 20: Logistic Regression - Penn EngineeringLogistic Regression • Takes a probabilistic approach to learning discriminative functions (i.e., a classifier) • should give – Want •

StochasticGradientDescent

20

Page 21: Logistic Regression - Penn EngineeringLogistic Regression • Takes a probabilistic approach to learning discriminative functions (i.e., a classifier) • should give – Want •

ConsiderLearningwithNumerousData• Logisticregressionobjective:

• Fitviagradientdescent:

• Whatisthecomputationalcomplexityintermsofn?

21

✓j ✓j � ↵1

n

nX

i=1

(h✓ (xi)� yi)xij

J(✓) = � 1

n

nX

i=1

[yi log h✓(xi) + (1� yi) log (1� h✓(xi))]

@

@✓jcost✓(xi, yi)

Page 22: Logistic Regression - Penn EngineeringLogistic Regression • Takes a probabilistic approach to learning discriminative functions (i.e., a classifier) • should give – Want •

GradientDescent

22

BatchGradientDescentInitializeθRepeat{

}✓j ✓j � ↵

1

n

nX

i=1

(h✓ (xi)� yi)xij forj = 0...d

StochasticGradientDescentInitializeθRandomlyshuffledatasetRepeat{

Fori = 1...n,do

}forj = 0...d✓j ✓j � ↵ (h✓ (xi)� yi)xij

@

@✓jJ(✓)

@

@✓jcost✓(xi, yi)

(Typically1– 10x)

Page 23: Logistic Regression - Penn EngineeringLogistic Regression • Takes a probabilistic approach to learning discriminative functions (i.e., a classifier) • should give – Want •

Batchvs StochasticGDBatchGD StochasticGD

23

• Learningrateαistypicallyheldconstant• Canslowlydecreaseαovertimetoforceθ toconverge:

e.g.,

BasedonslidebyAndrewNg

↵t =constant1

iterationNumber + constant2

Page 24: Logistic Regression - Penn EngineeringLogistic Regression • Takes a probabilistic approach to learning discriminative functions (i.e., a classifier) • should give – Want •

Adagrad

24

Page 25: Logistic Regression - Penn EngineeringLogistic Regression • Takes a probabilistic approach to learning discriminative functions (i.e., a classifier) • should give – Want •

NewStochasticGradientAlgorithms

• Sofar,wehaveconsidered:• aconstantlearningrateα• atime-dependentlearningrateαtviaapre-setformula

• AdaGrad adjuststhelearningratebasedonhistoricalinformation• Frequentlyoccurringfeaturesinthegradientsgetsmalllearning

ratesandinfrequentfeaturesgethigherones• Keyidea:“learnslowly”fromfrequentfeaturesbut“payattention”

torarebutinformativefeatures

• Defineaper-featurelearningrateforfeaturej as:

• Gt,j isthesumofsquaresofgradientsoffeaturej throughtimet25

↵t,j =↵pGt,j

Gt,j =tX

k=1

g2k,jwhere @

@✓jcost✓(xk, yk)

Page 26: Logistic Regression - Penn EngineeringLogistic Regression • Takes a probabilistic approach to learning discriminative functions (i.e., a classifier) • should give – Want •

k✓�✓⇤k 2

Time

Withabadchoiceforα

k✓�✓⇤k 2

Time

Withagoodchoiceforα

NewStochasticGradientAlgorithms

• Adagrad changestheupdateruleforSGDattimet from

to

• Adagrad convergesquickly:

26

↵t,j =↵pGt,j

Gt,j =tX

k=1

g2k,jwhere

Adagrad per-featurelearningrate

✓j ✓j �↵p

Gt,j + ⇣gt,j

✓j ✓j � ↵gt,j

Inpractice,weaddasmallconstant𝜁 >0topreventdividing

byzeroerrors

Plotsfromhttp://akyrillidis.github.io/notes/AdaGrad

Page 27: Logistic Regression - Penn EngineeringLogistic Regression • Takes a probabilistic approach to learning discriminative functions (i.e., a classifier) • should give – Want •

Multi-ClassClassification

27

Page 28: Logistic Regression - Penn EngineeringLogistic Regression • Takes a probabilistic approach to learning discriminative functions (i.e., a classifier) • should give – Want •

Multi-ClassClassification

Diseasediagnosis: healthy/cold/flu/pneumonia

Objectclassification: desk/chair/monitor/bookcase28

x1

x2

x1

x2

Binaryclassification: Multi-classclassification:

Page 29: Logistic Regression - Penn EngineeringLogistic Regression • Takes a probabilistic approach to learning discriminative functions (i.e., a classifier) • should give – Want •

h✓(x) =1

1 + exp(�✓Tx)=

exp(✓Tx)

1 + exp(✓Tx)

Multi-ClassLogisticRegression• For2classes:

• ForC classes{1,...,C }:

– Calledthesoftmax function

29

h✓(x) =1

1 + exp(�✓Tx)=

exp(✓Tx)

1 + exp(✓Tx)

weightassignedtoy =

0

weightassignedtoy =

1

p(y = c | x;✓1, . . . ,✓C) =exp(✓T

c x)PCc=1 exp(✓

Tc x)

Page 30: Logistic Regression - Penn EngineeringLogistic Regression • Takes a probabilistic approach to learning discriminative functions (i.e., a classifier) • should give – Want •

Multi-ClassLogisticRegression

• Trainalogisticregressionclassifierforeachclassitopredicttheprobabilitythaty =i with

30

x1

x2

SplitintoOnevs Rest:

hc(x) =exp(✓T

c x)PCc=1 exp(✓

Tc x)

Page 31: Logistic Regression - Penn EngineeringLogistic Regression • Takes a probabilistic approach to learning discriminative functions (i.e., a classifier) • should give – Want •

hc(x) =exp(✓T

c x)PCc=1 exp(✓

Tc x)

ImplementingMulti-ClassLogisticRegression

• Useasthemodelforclassc

• Gradientdescentsimultaneouslyupdatesallparametersforallmodels– Samederivativeasbefore,justwiththeabovehc(x)

• Predictclasslabelasthemostprobablelabel

31

maxc

hc(x)