Page 1
LogisticRegression
RobotImageCredit:Viktoriya Sukhanova ©123RF.com
TheseslideswereassembledbyEricEaton,withgratefulacknowledgementofthemanyotherswhomadetheircoursematerialsfreelyavailableonline.Feelfreetoreuseoradapttheseslidesforyourownacademicpurposes,providedthatyouincludeproperattribution.PleasesendcommentsandcorrectionstoEric.
Page 2
ClassificationBasedonProbability• Insteadofjustpredictingtheclass,givetheprobabilityoftheinstancebeingthatclass– i.e.,learn
• Comparisontoperceptron:– Perceptrondoesn’tproduceprobabilityestimate– Perceptron(andotherdiscriminativeclassifiers)areonlyinterestedinproducingadiscriminativemodel
• Recallthat:
2
p(y | x)
p(event) + p(¬event) = 1
0 p(event) 1
Page 3
LogisticRegression• Takesaprobabilisticapproachtolearningdiscriminativefunctions(i.e.,aclassifier)
• shouldgive– Want
• Logisticregressionmodel:
3
h✓(x) = g (✓|x)
g(z) =1
1 + e�z
0 h✓(x) 1
Can’tjustuselinearregressionwitha
threshold
g(z) =1
1 + e�z
h✓(x) =1
1 + e�✓Tx
Logistic/SigmoidFunction
h✓(x) p(y = 1 | x;✓)
Page 4
InterpretationofHypothesisOutput
4
=estimated
à Tellpatientthat70%chanceoftumorbeingmalignant
Example:Cancerdiagnosisfromtumorsize
h✓(x) p(y = 1 | x;✓)
x =
x0
x1
�=
1
tumorSize
�
h✓(x) = 0.7
p(y = 0 | x;✓) + p(y = 1 | x;✓) = 1Notethat:
BasedonexamplebyAndrewNg
Therefore, p(y = 0 | x;✓) = 1� p(y = 1 | x;✓)
Page 5
AnotherInterpretation• Equivalently,logisticregressionassumesthat
• Inotherwords,logisticregressionassumesthatthelogoddsisalinearfunctionof
5
logp(y = 1 | x;✓)p(y = 0 | x;✓) = ✓0 + ✓1x1 + . . .+ ✓dxd
x
SideNote:theoddsinfavorofaneventisthequantityp /(1−p),wherep istheprobabilityoftheevent
E.g.,IfItossafairdice,whataretheoddsthatIwillhavea6?
oddsofy =1
BasedonslidebyXiaoli Fern
Page 6
LogisticRegression
• Assumeathresholdand...
– Predicty =1if– Predicty =0if
6
h✓(x) = g (✓|x)
g(z) =1
1 + e�z
g(z) =1
1 + e�z
h✓(x) � 0.5
h✓(x) < 0.5
y =1
y =0
✓
BasedonslidebyAndrewNg
shouldbelargenegativevaluesfornegativeinstances
h✓(x) = g (✓|x) shouldbelargepositivevaluesforpositiveinstances
h✓(x) = g (✓|x)
Page 7
Non-LinearDecisionBoundary• Canapplybasisfunctionexpansiontofeatures,sameaswithlinearregression
7
x =
2
41x1
x2
3
5 !
2
6666666666666664
1x1
x2
x1x2
x21
x22
x21x2
x1x22
...
3
7777777777777775
Page 8
LogisticRegression
• Given
where
• Model:
8
x| =⇥1 x1 . . . xd
⇤✓ =
2
6664
✓0✓1...✓d
3
7775
h✓(x) = g (✓|x)
g(z) =1
1 + e�z
n⇣x(1), y(1)
⌘,⇣x(2), y(2)
⌘, . . . ,
⇣x(n), y(n)
⌘o
x(i) 2 Rd, y(i) 2 {0, 1}
Page 9
LogisticRegressionObjectiveFunction• Can’tjustusesquaredlossasinlinearregression:
– Usingthelogisticregressionmodel
resultsinanon-convexoptimization
9
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘2
h✓(x) =1
1 + e�✓Tx
Page 10
DerivingtheCostFunctionviaMaximumLikelihoodEstimation
• Likelihoodofdataisgivenby:
• So,lookingfortheθ thatmaximizesthelikelihood
• Cantakethelogwithoutchangingthesolution:
10
l(✓) =nY
i=1
p(y(i) | x(i);✓)
✓MLE = argmax✓
l(✓) = argmax✓
nY
i=1
p(y(i) | x(i);✓)
✓MLE = argmax✓
lognY
i=1
p(y(i) | x(i);✓)
= argmax✓
nX
i=1
log p(y(i) | x(i);✓)
✓MLE = argmax✓
lognY
i=1
p(y(i) | x(i);✓)
= argmax✓
nX
i=1
log p(y(i) | x(i);✓)
Page 11
DerivingtheCostFunctionviaMaximumLikelihoodEstimation
11
• Expandasfollows:
• Substituteinmodel,andtakenegativetoyield
✓MLE = argmax✓
nX
i=1
log p(y(i) | x(i);✓)
= argmax✓
nX
i=1
hy(i) log p(y(i)=1 | x(i);✓) +
⇣1� y(i)
⌘log
⇣1� p(y(i)=1 | x(i);✓)
⌘i
J(✓) = �nX
i=1
hy(i) log h✓(x
(i)) +⇣1� y(i)
⌘log
⇣1� h✓(x
(i))⌘i
Logisticregressionobjective:min✓
J(✓)
✓MLE = argmax✓
nX
i=1
log p(y(i) | x(i);✓)
= argmax✓
nX
i=1
hy(i) log p(y(i)=1 | x(i);✓) +
⇣1� y(i)
⌘log
⇣1� p(y(i)=1 | x(i);✓)
⌘i
Page 12
IntuitionBehindtheObjective
• Costofasingleinstance:
• Canre-writeobjectivefunctionas
12
J(✓) = �nX
i=1
hy(i) log h✓(x
(i)) +⇣1� y(i)
⌘log
⇣1� h✓(x
(i))⌘i
cost (h✓(x), y) =
⇢� log(h✓(x)) if y = 1
� log(1� h✓(x)) if y = 0
J(✓) =nX
i=1
cost⇣h✓(x
(i)), y(i)⌘
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘2Comparetolinearregression:
Page 13
IntuitionBehindtheObjective
13
cost (h✓(x), y) =
⇢� log(h✓(x)) if y = 1
� log(1� h✓(x)) if y = 0
Aside:Recalltheplotoflog(z)
Page 14
IntuitionBehindtheObjective
Ify =1• Cost=0ifpredictioniscorrect• As
• Capturesintuitionthatlargermistakesshouldgetlargerpenalties– e.g.,predict,buty =1
14
cost (h✓(x), y) =
⇢� log(h✓(x)) if y = 1
� log(1� h✓(x)) if y = 0
h✓(x) ! 0, cost ! 1
h✓(x) = 0
BasedonexamplebyAndrewNg
Ify =1
10
cost
h✓(x) = 0
Page 15
IntuitionBehindtheObjective
15
cost (h✓(x), y) =
⇢� log(h✓(x)) if y = 1
� log(1� h✓(x)) if y = 0
Ify =0
10
cost
Ify =1
Ify =0• Cost=0ifpredictioniscorrect• As
• Capturesintuitionthatlargermistakesshouldgetlargerpenalties
(1� h✓(x)) ! 0, cost ! 1
BasedonexamplebyAndrewNg
h✓(x) = 0
Page 16
RegularizedLogisticRegression
• Wecanregularizelogisticregressionexactlyasbefore:
16
J(✓) = �nX
i=1
hy(i) log h✓(x
(i)) +⇣1� y(i)
⌘log
⇣1� h✓(x
(i))⌘i
Jregularized(✓) = J(✓) +�
2
dX
j=1
✓2j
= J(✓) +�
2k✓[1:d]k22
Page 17
GradientDescentforLogisticRegression
17
• Initialize• Repeatuntilconvergence
✓
✓j ✓j � ↵@
@✓jJ(✓) simultaneousupdate
forj =0...d
Want min✓
J(✓)
Usethenaturallogarithm(ln =loge)tocancelwiththeexp()in h✓(x) =1
1 + e�✓Tx
Jreg(✓) = �nX
i=1
hy(i) log h✓(x
(i)) +⇣1� y(i)
⌘log
⇣1� h✓(x
(i))⌘i
+�
2k✓[1:d]k22
Page 18
GradientDescentforLogisticRegression
18
Want min✓
J(✓)
• Initialize• Repeatuntilconvergence
✓(simultaneousupdateforj =0...d)
Jreg(✓) = �nX
i=1
hy(i) log h✓(x
(i)) +⇣1� y(i)
⌘log
⇣1� h✓(x
(i))⌘i
+�
2k✓[1:d]k22
✓0 ✓0 � ↵nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘
✓j ✓j � ↵
"nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘x(i)j + �✓j
#
Page 19
GradientDescentforLogisticRegression
19
• Initialize• Repeatuntilconvergence
✓(simultaneousupdateforj =0...d)
ThislooksIDENTICALtolinearregression!!!• Ignoringthe1/n constant• However,theformofthemodelisverydifferent:
h✓(x) =1
1 + e�✓Tx
✓0 ✓0 � ↵nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘
✓j ✓j � ↵
"nX
i=1
⇣h✓
⇣x(i)
⌘� y(i)
⌘x(i)j + �✓j
#
Page 20
StochasticGradientDescent
20
Page 21
ConsiderLearningwithNumerousData• Logisticregressionobjective:
• Fitviagradientdescent:
• Whatisthecomputationalcomplexityintermsofn?
21
✓j ✓j � ↵1
n
nX
i=1
(h✓ (xi)� yi)xij
J(✓) = � 1
n
nX
i=1
[yi log h✓(xi) + (1� yi) log (1� h✓(xi))]
@
@✓jcost✓(xi, yi)
Page 22
GradientDescent
22
BatchGradientDescentInitializeθRepeat{
}✓j ✓j � ↵
1
n
nX
i=1
(h✓ (xi)� yi)xij forj = 0...d
StochasticGradientDescentInitializeθRandomlyshuffledatasetRepeat{
Fori = 1...n,do
}forj = 0...d✓j ✓j � ↵ (h✓ (xi)� yi)xij
@
@✓jJ(✓)
@
@✓jcost✓(xi, yi)
(Typically1– 10x)
Page 23
Batchvs StochasticGDBatchGD StochasticGD
23
• Learningrateαistypicallyheldconstant• Canslowlydecreaseαovertimetoforceθ toconverge:
e.g.,
BasedonslidebyAndrewNg
↵t =constant1
iterationNumber + constant2
Page 25
NewStochasticGradientAlgorithms
• Sofar,wehaveconsidered:• aconstantlearningrateα• atime-dependentlearningrateαtviaapre-setformula
• AdaGrad adjuststhelearningratebasedonhistoricalinformation• Frequentlyoccurringfeaturesinthegradientsgetsmalllearning
ratesandinfrequentfeaturesgethigherones• Keyidea:“learnslowly”fromfrequentfeaturesbut“payattention”
torarebutinformativefeatures
• Defineaper-featurelearningrateforfeaturej as:
• Gt,j isthesumofsquaresofgradientsoffeaturej throughtimet25
↵t,j =↵pGt,j
Gt,j =tX
k=1
g2k,jwhere @
@✓jcost✓(xk, yk)
Page 26
k✓�✓⇤k 2
Time
Withabadchoiceforα
k✓�✓⇤k 2
Time
Withagoodchoiceforα
NewStochasticGradientAlgorithms
• Adagrad changestheupdateruleforSGDattimet from
to
• Adagrad convergesquickly:
26
↵t,j =↵pGt,j
Gt,j =tX
k=1
g2k,jwhere
Adagrad per-featurelearningrate
✓j ✓j �↵p
Gt,j + ⇣gt,j
✓j ✓j � ↵gt,j
Inpractice,weaddasmallconstant𝜁 >0topreventdividing
byzeroerrors
Plotsfromhttp://akyrillidis.github.io/notes/AdaGrad
Page 27
Multi-ClassClassification
27
Page 28
Multi-ClassClassification
Diseasediagnosis: healthy/cold/flu/pneumonia
Objectclassification: desk/chair/monitor/bookcase28
x1
x2
x1
x2
Binaryclassification: Multi-classclassification:
Page 29
h✓(x) =1
1 + exp(�✓Tx)=
exp(✓Tx)
1 + exp(✓Tx)
Multi-ClassLogisticRegression• For2classes:
• ForC classes{1,...,C }:
– Calledthesoftmax function
29
h✓(x) =1
1 + exp(�✓Tx)=
exp(✓Tx)
1 + exp(✓Tx)
weightassignedtoy =
0
weightassignedtoy =
1
p(y = c | x;✓1, . . . ,✓C) =exp(✓T
c x)PCc=1 exp(✓
Tc x)
Page 30
Multi-ClassLogisticRegression
• Trainalogisticregressionclassifierforeachclassitopredicttheprobabilitythaty =i with
30
x1
x2
SplitintoOnevs Rest:
hc(x) =exp(✓T
c x)PCc=1 exp(✓
Tc x)
Page 31
hc(x) =exp(✓T
c x)PCc=1 exp(✓
Tc x)
ImplementingMulti-ClassLogisticRegression
• Useasthemodelforclassc
• Gradientdescentsimultaneouslyupdatesallparametersforallmodels– Samederivativeasbefore,justwiththeabovehc(x)
• Predictclasslabelasthemostprobablelabel
31
maxc
hc(x)