Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 20 June 2018 Day 3: Classification, logistic regression
Introduction to Machine Learning Summer SchoolJune 18, 2018 - June 29, 2018, Chicago
Instructor:SuriyaGunasekar,TTIChicago
20June2018
Day3:Classification,logisticregression
Topicssofar
• Supervisedlearning,linearregression• Yesterday
o Overfitting,o RidgeandlassoRegressiono Gradientdescent
• Todayo Biasvariancetrade-offo Classificationo Logisticregressiono Regularizationforlogisticregressiono Classificationmetrics
1
Bias-variance tradeoff
2
3
Empiricalvspopulationloss
• PopulationdistributionLet 𝑥, 𝑦 ∼ 𝒟• Wehave
o Lossfunctionℓ(𝑦(, 𝑦)o Hypothesisclassℋo Trainingdata𝑆 = 𝑥 - , 𝑦 - : 𝑖 = 1,2, … , 𝑁 ∼445 𝒟6
§ ThinkofS asrandomvariable
• Whatwereallywant𝑓 ∈ ℋ tominimize𝐩𝐨𝐩𝐮𝐥𝐚𝐭𝐢𝐨𝐧𝐥𝐨𝐬𝐬
𝐿𝒟(𝑓) ≜ 𝐄𝒟 ℓ 𝑓 𝑥 , 𝑦 = F ℓ 𝑓 𝑥 , 𝑦�
(H,I)Pr(𝑥, 𝑦)
• ERMminimizesempiricalloss
𝐿L 𝑓 ≜ 𝐄ML ℓ 𝑓 𝑥 , 𝑦 =1𝑁Nℓ 𝑓 𝑥 - , 𝑦 -6
-OP
e.g,Pr 𝑥 = 𝑢𝑛𝑖𝑓𝑜𝑟𝑚 0,1𝑦 = 𝑤∗. 𝑥 + 𝜖 where𝜖 = 𝒩 0,0.1
⇒ Pr 𝑦 𝑥 = 𝒩 𝑤∗. 𝑥, 0.1Pr 𝑥, 𝑦 = Pr 𝑥 Pr(𝑦|𝑥)
4
Empiricalvspopulationloss
𝐿(𝑓) ≜ 𝐄𝒟 ℓ 𝑓 𝑥 , 𝑦 = F ℓ 𝑓 𝑥 , 𝑦�
(H,I)Pr(𝑥, 𝑦)
𝐿L 𝑓 ≜ 𝐄ML ℓ 𝑓 𝑥 , 𝑦 =1𝑁Nℓ 𝑓 𝑥 - , 𝑦 -6
-OP
• 𝑓_L fromsomemodeloverfits to𝑆 ifthereis𝑓∗ ∈ ℋ with𝐄ML ℓ 𝑓_L 𝑥 , 𝑦 ≤ 𝐄ML ℓ 𝑓∗ 𝑥 , 𝑦 but𝐄𝒟 ℓ 𝑓_L 𝑥 , 𝑦 ≫ 𝐄𝒟 ℓ 𝑓∗ 𝑥 , 𝑦
• If𝑓 isindependentof𝑆efg-h thenboth𝐿Lijklm 𝑓 and𝐿Linoi 𝑓 aregoodapproximationsof𝐿𝒟 𝑓
• Butgenerally,𝑓_ dependson𝑆efg-h.Why?o 𝐿Lijklm 𝑓_Lijklm isnomoreagoodapproximationof𝐿𝒟 𝑓
o 𝐿Linoi 𝑓_Lijklm isstillagoodapproximationof𝐿𝒟 𝑓 since 𝑓_Lijklm isindependentof𝑆epqe
5
OptimumUnrestrictedPredictor
• Considerpopulationsquaredloss
argminw∈ℋ
𝐿(𝑓) ≜ 𝐄𝒟 ℓ 𝑓 𝑥 , 𝑦 = 𝐄(H,I) 𝑓 𝑥 − 𝑦 y
• Sayℋ isunrestricted– anyfunction𝑓: 𝑥 → 𝑦 isallowed𝐿 𝑓 = 𝐄(H,I) 𝑓 𝑥 − 𝑦 y = 𝐄H 𝐄I 𝑓 𝑥 − 𝑦 y|𝑥
= 𝐄H 𝐄I 𝑓 𝑥 − 𝐄I 𝑦 𝑥 + 𝐄I 𝑦 𝑥 − 𝑦 y|𝑥
= 𝐄H 𝐄I 𝑓 𝑥 − 𝐄I 𝑦 𝑥y|𝑥 + 𝐄H 𝐄I 𝐄I 𝑦 𝑥 − 𝑦 y|𝑥
+ 2𝐄H 𝐄I (𝑓 𝑥 − 𝐄I 𝑦 𝑥 )(𝐄I 𝑦 𝑥 − 𝑦)|𝑥
= 𝐄H[ 𝑓 𝑥 − 𝐄I 𝑦 𝑥y] + 𝐄H,I 𝐄I 𝑦 𝑥 − 𝑦 y
notafunctionof𝑦
= 0
Noiseminimizedfor𝑓 = 𝐄I 𝑦 𝑥
6
Biasvariancedecomposition
• Bestunrestrictedpredictor𝑓∗∗(𝑥) = 𝐄I[𝑦|𝑥]
• 𝐿 𝑓L = 𝐄H[ 𝑓L 𝑥 − 𝑓∗∗ 𝑥 y] + 𝐄H,I 𝑓∗∗ 𝑥 − 𝑦 y
• 𝐄L𝐿 𝑓L = 𝐄L𝐄H 𝑓L 𝑥 − 𝑓∗∗ 𝑥 y + 𝑛𝑜𝑖𝑠𝑒
𝐄L𝐄H 𝑓L 𝑥 − 𝑓∗∗ 𝑥 y = 𝐄H 𝐄L 𝑓L 𝑥 − 𝑓∗∗ 𝑥 y|𝑥
= 𝐄H𝐄L 𝑓L 𝑥 − 𝐄L 𝑓L 𝑥 + 𝐄L 𝑓L 𝑥 − 𝑓∗∗ 𝑥 y|𝑥
= 𝐄H𝐄L 𝑓L 𝑥 − 𝐄L 𝑓L 𝑥 y|𝑥 + 𝐄H 𝐄L 𝑓L 𝑥 − 𝑓∗∗ 𝑥 y
+ 2𝐄H 𝐄L 𝐄L 𝑓L 𝑥 − 𝑓∗∗ 𝑥 𝑓L 𝑥 − 𝐄L 𝑓L 𝑥 |𝑥
= 𝐄L,H 𝑓L 𝑥 − 𝐄L 𝑓L 𝑥 y + 𝐄H 𝐄L 𝑓L 𝑥 − 𝑓∗∗ 𝑥 y
=variance+bias2+noise
𝐄L𝐿 𝑓L = 𝐄L,H 𝑓L 𝑥 − 𝐄L 𝑓L 𝑥 y
+𝐄H 𝐄L 𝑓L 𝑥 − 𝑓∗∗ 𝑥 y
+𝐄H,I[ 𝑓∗∗ 𝑥 − 𝑦 y]
7
Bias-variancetradeoff
• 𝑓L ∈ ℋ• noise isirreducible• variance canreducedby
o getmoredatao make𝑓L lesssensitiveto𝑆
§ lessnumberofcandidatesinℋ tochoosefromàless variance§ reducingthe“complexity”ofmodelclassℋ decreasesvariance
• biasy ≥ minw∈���� ℋ
𝐄H 𝑓 𝑥 − 𝑓∗∗ 𝑥 y
§ expandingmodelclassℋ decreasesbias
=variance+bias2+noise
𝐄L𝐿 𝑓L = 𝐄L,H 𝑓L 𝑥 − 𝐄L 𝑓L 𝑥 y
+𝐄H 𝐄L 𝑓L 𝑥 − 𝑓∗∗ 𝑥 y
+𝐄H,I[ 𝑓∗∗ 𝑥 − 𝑦 y]
8
Modelcomplexity• reducingthecomplexityofmodelclassℋ decreasesvariance• expandingmodelclassℋ decreasesbias• Complexity≈ numberofchoicesinℋ
• Foranyloss𝐿,forall𝑓 ∈ ℋ withprobabilitygreaterthan1 − 𝛿
𝐿 𝑓 ≤ 𝐿L 𝑓 +log ℋ + log 1𝛿
𝑁
�
• manyothervariantsforinfinitecardinalityclasses• oftenboundsareloose
• Complexity≈numberofdegreesoffreedom• e.g.,numberofparameterstoestimate• moredataècanfitmorecomplexmodels
• IsℋP = {𝒙 → 𝑤� + 𝒘𝟏. 𝒙 − 𝒘𝟐. 𝒙}morecomplexthanℋy ={𝒙 → 𝑤� + 𝒘𝟏. 𝒙}?• Whatweneedishowmanydifferent“behaviors”wecangetonsame𝑆
9
Summary
• Overfittingo Whatisoverfitting?o Howtodetectoverfitting?o Avoidingoverfittingusingmodelselection
• Bias– variancetradeoff
Classification
• Supervisedlearning:estimateamapping𝑓 frominput𝑥 ∈ 𝒳 tooutput𝑦 ∈ 𝒴o Regression 𝒴 = ℝorothercontinuousvariableso Classification 𝒴 takesdiscretesetofvalues
§ Examples:q 𝒴 = {spam, nospam},q digits(notvalues)𝒴 = {0,1,2, … , 9}
• ManysuccessfulapplicationsofMLinvision,speech,NLP,healthcare
10
ClassificationvsRegression
• Label-valuesdonothavemeaningo 𝒴 = {spam, nospam} or𝒴 = {0,1} or𝒴 = {−1,1}
• Orderingoflabelsdoesnotmatter(formostparts)o 𝑓(𝑥) = “0” when𝑦 = “1” isasbadas𝑓(𝑥) = “9” when𝑦 = “1”
• Often𝑓(𝑥) doesnotreturnlabels𝑦o e.g.inbinaryclassificationwith𝒴 = {−1,1} weoftenestimate𝑓:𝒳 → ℝ andthenpostprocesstoget𝑦( 𝑓 𝑥 = 𝟏 𝑓 𝑥 ≥ 0
o mainlyforcomputationalreasons
§ remember,weneedtosolveminw∈ℋ
∑ ℓ(𝑓 𝑥 - , 𝑦 - )�-
§ discretevaluesà combinatorialproblemsà hardtosolveo moregenerallyℋ ⊂ 𝑓:𝒳 → ℝ andlossℓ:ℝ×𝒴 → ℝ
§ comparetoregression,wheretypicallyℋ ⊂ 𝑓:𝒳 → 𝒴 andlossℓ:𝒴×𝒴 → ℝ
11
Non-parametric classifiers
12
• TrainingdataS = 𝑥 - , 𝑦 - : 𝑖 = 1,2, … , 𝑁
• Wanttopredictlabelofnewpoint𝑥
• NearestNeighborRule
o Findtheclosesttrainingpoint:𝑖∗ = argmin-𝜌(𝑥, 𝑥 - )
o Predictlabelof𝑥 as𝑦( 𝑥 = 𝑦 -∗
• Computationo Trainingtime:Donothing
o Testtime:searchthetrainingsetforaNN
NearestNeighbor(NN)Classifier
?
13Figurecredit:Nati Srebro
NearestNeighbor(NN)Classifier
• Whereisthemainmodel?o 𝑖∗ = argmin
-𝜌(𝑥, 𝑥 - )
o Whatistheright“distance”betweenimages?Betweensoundwaves?Betweensentences?
o Often𝜌 𝑥, 𝑥� = 𝜙 𝑥 − 𝜙 𝑥� y orothernorms 𝑥 − 𝑥� P
𝝓� 𝒙 − 𝝓�(𝒙�) 𝟐𝝓� 𝒙 = (𝟓𝒙 𝟏 , 𝒙 𝟐 )𝒙 − 𝒙� 𝟏
14Slidecredit:Nati Srebro
• TrainingdataS = 𝑥 - , 𝑦 - : 𝑖 = 1,2, … , 𝑁
•Wanttopredictlabelofnewpoint𝑥• k-NearestNeighborRule
o Findthe𝒌 closesttrainingpoint:𝑖P∗, 𝑖y∗, … , 𝑖¢∗
o Predictlabelof𝑥 as𝑦( 𝑥 = majority 𝑦 -¥∗ , 𝑦 -¦∗ , … , 𝑦 -§
∗
• Computationo Trainingtime:Donothingo Testtime:searchthetrainingsetforkNNs
k-NearestNeighbor(kNN)classifier
15
k-NearestNeighbor
•Advantagesonotrainingouniversalapproximator – non-parametric
•Disadvantagesonotscalable
§testtimememoryrequirement§testtimecomputation
oeasilyoverfits withsmalldata
16
Trainingvstesterror
17
1-NN• Trainingerror?• 0
• Testerror?• DependsonPr(𝑥, 𝑦)
k-NN• Trainingerror: canbegreaterthan0
• Testerror:againdependsonPr(𝑥, 𝑦)
Figurecredit:Nati Srebro
k-NearestNeighbor:DataFit/ComplexityTradeoff
k=1 k=5 k=12
k=50 k=100 k=200
S= h*=
18Slidecredit:Nati Srebro
Spacepartition
• kNNpartitioningof𝒳 (orℝ¨)intoregionsof+1and-1• Whataboutdiscretevaluedfeatures𝑥?• Evenforcontinuous𝑥,canwegetmorestructuredpartitions?o easytodescribe
§ e.g.,𝑅y = {𝑥: 𝑥P < 𝑡Pand𝑥y > 𝑡y}o reducesdegreesoffreedom
• Anynon-overlappingpartitionusingonly(hyper)rectanglesà representablebyatree
19Figurecredit:GregShaknarovich
Decisiontrees
• Focusonbinarytrees(treeswithatmosttwochildrenateachnode)
• Howtocreatetrees?
•Whatisa“good”tree?oMeasureof“purity”ateachleafnodewhereeachleafnodecorrespondingtoaregion𝑅-purity 𝑡𝑟𝑒𝑒 = ∑ |#blueat𝑅4 −#redat𝑅-|�
°l
20
Therearevariousmetricsof(im)puritythatareusedinpractice,buttheroughideaisthesame
Decisiontrees
• Howtocreatetrees?• Trainingdata𝑆 = 𝑥 - , 𝑦 - : 𝑖 = 1,2, … , 𝑁 ,where𝑦 - ∈ {blue, red}• Ateachpoint,purity 𝑡𝑟𝑒𝑒 =N|#blueatleaf −#redatleaf|
�
²³´µ
• Startwithalldataatrooto onlyone𝑙𝑒𝑎𝑓 = 𝑟𝑜𝑜𝑡.Whatispurity 𝑡𝑟𝑒𝑒 ?
21
Decisiontrees
• Howtocreatetrees?• Trainingdata𝑆 = 𝑥 - , 𝑦 - : 𝑖 = 1,2, … , 𝑁 ,where𝑦 - ∈ {blue, red}• Ateachpoint,purity 𝑡𝑟𝑒𝑒 =N|#blueatleaf −#redatleaf|
�
²³´µ
• Startwithalldataatrooto onlyone𝑙𝑒𝑎𝑓 = 𝑟𝑜𝑜𝑡.Whatispurity 𝑡𝑟𝑒𝑒 ?
• Createasplitbasedonarulethatincreasestheamountof“purity”oftree.o Howcomplexcantherulesbe?
22
Decisiontrees
• Howtocreatetrees?• Trainingdata𝑆 = 𝑥 - , 𝑦 - : 𝑖 = 1,2, … , 𝑁 ,where𝑦 - ∈ {blue, red}• Ateachpoint,purity 𝑡𝑟𝑒𝑒 =N|#blueatleaf −#redatleaf|
�
²³´µ
• Startwithalldataatrooto onlyone𝑙𝑒𝑎𝑓 = 𝑟𝑜𝑜𝑡.Whatispurity 𝑡𝑟𝑒𝑒 ?
• Createasplitbasedonarulethatincreasestheamountof“purity”oftree.o Howcomplexcantherulesbe?
• Repeat23
Whentostop?whatisthecomplexityofaDT?• Limitthenumber
ofleafnodes
Decisiontrees
• Advantageso interpretableo easytodealwithnon-numericfeatureso naturalextensionstomulti-class,multi-label
• Disadvantageso notscalableo harddecisions– non-smoothdecisionso oftenoverfits inspiteofregularization
• CheckCARTpackageinscikit-learn
24
Parametricclassifiers
• Whatistheequivalentoflinearregression?o somethingeasytotraino somethingeasytouseattesttime
• 𝑓 𝒙 = 𝑓𝒘 𝒙 = 𝒘. 𝒙 + 𝑤�• ℋ = {𝑓𝒘 = 𝒙 → 𝒘. 𝒙 + 𝑤�:𝒘 ∈ ℝ¨,𝑤� ∈ ℝ}
• but𝑓 𝒙 ∉ {−1,1}!howdowegetlabels?o reasonablechoice
𝑦( 𝒙 = 1 if𝑓𝒘¹ 𝒙 ≥ 0 and𝑦( 𝒙 = −1 otherwiseo linearclassifier:𝑦( 𝒙 = sign(𝒘¹. 𝒙 + 𝑤¹�)
25
Parametricclassifiers
• ℋ = 𝑓𝒘 = 𝒙 → 𝒘. 𝒙 + 𝑤�:𝒘 ∈ ℝ¨,𝑤� ∈ ℝ• 𝑦( 𝒙 = sign(𝒘¹. 𝒙 + 𝑤¹�)• 𝒘¹. 𝒙 + 𝑤¹� = 0 (linear)decisionboundaryorseparatinghyperplaneo thatseparatesℝ¨ intotwohalfspaces(regions)𝒘¹. 𝒙 + 𝑤¹� > 0 and𝒘¹. 𝒙 + 𝑤¹� < 0
• moregenerally,𝑦( 𝒙 = sign 𝑓_ 𝒙à decisionboundaryis𝑓_ 𝒙 = 0
26
!"
!#
ç
Linear classifier
27
ClassificationvsRegression
• Label-valuesdonothavemeaningo 𝒴 = {spam, nospam} or𝒴 = {0,1} or𝒴 = {−1,1}
• Orderingoflabelsdoesnotmatter(formostparts)o 𝑓(𝑥) = “0” when𝑦 = “1” isasbadas𝑓(𝑥) = “9” when𝑦 = “1”
• Often𝑓(𝑥) doesnotreturnlabels𝑦o e.g.inbinaryclassificationwith𝒴 = {−1,1} weoftenestimate𝑓:𝒳 → ℝ andthenpostprocesstoget𝑦( 𝑓 𝑥 = 𝟏 𝑓 𝑥 ≥ 0
o mainlyforcomputationalreasons
§ remember,weneedtosolveminw∈ℋ
∑ ℓ(𝑓 𝑥 - , 𝑦 - )�-
§ discretevaluesà combinatorialproblemsà hardtosolveo moregenerallyℋ ⊂ 𝑓:𝒳 → ℝ andlossℓ:ℝ×𝒴 → ℝ
§ comparetoregression,wheretypicallyℋ ⊂ 𝑓:𝒳 → 𝒴 andlossℓ:𝒴×𝒴 → ℝ
28
ClassificationvsRegression
• Label-valuesdonothavemeaningo 𝒴 = {spam, nospam} or𝒴 = {0,1} or𝒴 = {−1,1}
• Orderingoflabelsdoesnotmatter(formostparts)o 𝑓(𝑥) = “0” when𝑦 = “1” isasbadas𝑓(𝑥) = “9” when𝑦 = “1”
• Often𝑓(𝑥) doesnotreturnlabels𝑦o e.g.inbinaryclassificationwith𝒴 = {−1,1} weoftenestimate𝑓:𝒳 → ℝ andthenpostprocesstoget𝑦( 𝑓 𝑥 = 𝟏 𝑓 𝑥 ≥ 0
o mainlyforcomputationalreasons
§ remember,weneedtosolveminw∈ℋ
∑ ℓ(𝑓 𝑥 - , 𝑦 - )�-
§ discretevaluesà combinatorialproblemsà hardtosolveo moregenerallyℋ ⊂ 𝑓:𝒳 → ℝ andlossℓ:ℝ×𝒴 → ℝ
§ comparetoregression,wheretypicallyℋ ⊂ 𝑓:𝒳 → 𝒴 andlossℓ:𝒴×𝒴 → ℝ
29
Whatifweignoreaboveandsolveclassification
usingregression?
Classificationasregression
• Binaryclassification𝒴 = −1,1 and𝒳 ∈ ℝ¨
• Treatitasregressionwithsquaredloss,saylinearregressiono Trainingdata𝑆 = 𝒙 𝒊 , 𝑦 - : 𝑖 = 1,2, … , 𝑁o ERM
𝒘¹,𝑤¹� = argmin»,»¼
N 𝒘. 𝒙 𝒊 + 𝑤� − 𝑦 - y�
-
30
Classificationasregression
𝑥
𝑦
𝑥
𝑦( = +1 𝑦( = −1
31
𝑦( 𝑥 = sign(𝑤𝑥 + 𝑤�)
Examplecredit:GregShaknarovich
Classificationasregression
𝑥
𝑥
𝑦 classifiedcorrectlyby𝑦( 𝑥 = sign 𝑤. 𝑥
butsquaredloss 𝑤. 𝑥 + 1 ywillbehigh
32Examplecredit:GregShaknarovich
Classificationasregression
𝑥
𝑦 = +1 𝑦 = −1
𝑥
𝑦
33Examplecredit:GregShaknarovich
Classificationasregression
34Slidecredit:GregShaknarovich
• Thecorrectlosstouseis0-1lossafter thresholdingℓ�P 𝑓 𝑥 , 𝑦 = 𝟏 sign 𝑓 𝑥 ≠ 𝑦
= 𝟏 sign 𝑓 𝑥 𝑦 < 0
SurrogateLosses
35
0 𝑓(𝑥)𝑦 →
ℓ(𝑓𝑥,𝑦)
• Thecorrectlosstouseis0-1lossafter thresholdingℓ�P 𝑓 𝑥 , 𝑦 = 𝟏 sign 𝑓 𝑥 ≠ 𝑦
= 𝟏 sign 𝑓 𝑥 𝑦 < 0• Linearregressionusesℓ¾L 𝑓 𝑥 , 𝑦 = 𝑓 𝑥 − 𝑦 y
• WhynotdoERMoverℓ�P 𝑓 𝑥 , 𝑦 directly?o non-continuous,non-convex
SurrogateLosses
36
0 𝑦(𝑦 →
ℓ(𝑓𝑥,𝑦)
0 𝑓(𝑥)𝑦 →
SurrogateLosses
37
• Hardtooptimizeoverℓ�P,findanotherlossℓ(𝑦(, 𝑦)o Convex(foranyfixed𝑦)à easiertominimizeo Anupperboundofℓ�P à smallℓ ⇒ smallℓ�P
• Satisfiedbysquaredlossàbuthas“large”lossevenwhenℓ�P 𝑦(, 𝑦 = 0• Twomoresurrogatelossesininthiscourseo Logisticlossℓ²�¿ 𝑦(, 𝑦 = log 1 + exp −𝑦(𝑦(TODAY)o HingelossℓÁ-hÂp 𝑦(, 𝑦 = max(0,1 − 𝑦(𝑦)(TOMORROW)
0 𝑓(𝑥)𝑦 →
ℓ(𝑓𝑥,𝑦)
Logistic Regression
38
Logisticregression:ERMonsurrogateloss
• 𝑆 = 𝒙 𝒊 , 𝑦 - : 𝑖 = 1,2, … ,𝑁 , 𝒳 = ℝ¨, 𝒴 = {−1,1}• Linearmodel𝑓 𝒙 = 𝑓𝒘 𝒙 = 𝒘. 𝒙 + 𝑤�• Minimizetrainingloss
𝒘¹,𝑤¹� = argmin𝒘,»¼
Nlog 1 + exp − 𝒘. 𝒙 𝒊 + 𝑤� 𝑦 - �
-
• Outputclassifier𝑦( 𝒙 = sign(𝒘. 𝒙 + 𝑤�)
Logisticlossℓ 𝑓 𝑥 , 𝑦 = log 1 + exp −𝑓(𝑥)𝑦
39
ℓ(𝑓𝑥,𝑦)
0 𝑓(𝑥)𝑦 →
𝒘¹,𝑤¹� = argmin𝒘,»¼
Nlog 1 + exp − 𝒘. 𝒙 𝒊 + 𝑤� 𝑦 - �
-
• Learnsalineardecisionboundaryo {𝒙:𝒘. 𝒙 + 𝑤� = 0} isahyperplaneinℝ¨ - decisionboundaryo {𝒙:𝒘. 𝒙 + 𝑤� = 0} dividesℝ¨ intotwohalfspace (regions)o 𝒙:𝒘. 𝒙 + 𝑤� ≥ 0 willgetlabel+1 and{𝒙:𝒘. 𝒙 + 𝑤� < 0} willgetlabel-1
• Maps𝒙 toa1Dcoordinate
𝑥� =𝒘. 𝒙 + 𝑤�
𝒘
Logisticregression
𝒙
𝒙′
𝑥P
𝑥y
𝑤
ç
40Figurecredit:GregShaknarovich
LogisticRegression
𝒘¹,𝑤¹� = argmin𝒘,»¼
Nlog 1 + exp − 𝒘. 𝒙 + 𝑤� 𝑦 �
-
• Convexoptimizationproblem• Cansolveusinggradientdescent• Canalsoaddusualregularization:ℓy, ℓP
o Moredetailsinthenextsession
41