Day 3: Classification, logistic regressionsuriya/website-introml... · Classification •Supervised learning: estimate a mapping 7from input !∈’to output #∈‚ oRegression‚=ℝ

Introduction to Machine Learning Summer SchoolJune 18, 2018 - June 29, 2018, Chicago

Instructor:SuriyaGunasekar,TTIChicago

20June2018

Day3:Classification,logisticregression

Topicssofar

• Supervisedlearning,linearregression• Yesterday

o Overfitting,o RidgeandlassoRegressiono Gradientdescent

• Todayo Biasvariancetrade-offo Classificationo Logisticregressiono Regularizationforlogisticregressiono Classificationmetrics

1

Bias-variance tradeoff

2

3

Empiricalvspopulationloss

• PopulationdistributionLet 𝑥, 𝑦 ∼ 𝒟• Wehave

o Lossfunctionℓ(𝑦(, 𝑦)o Hypothesisclassℋo Trainingdata𝑆 = 𝑥 - , 𝑦 - : 𝑖 = 1,2, … , 𝑁 ∼445 𝒟6

§ ThinkofS asrandomvariable

• Whatwereallywant𝑓 ∈ ℋ tominimize𝐩𝐨𝐩𝐮𝐥𝐚𝐭𝐢𝐨𝐧𝐥𝐨𝐬𝐬

𝐿𝒟(𝑓) ≜ 𝐄𝒟 ℓ 𝑓 𝑥 , 𝑦 = F ℓ 𝑓 𝑥 , 𝑦�

(H,I)Pr(𝑥, 𝑦)

• ERMminimizesempiricalloss

𝐿L 𝑓 ≜ 𝐄ML ℓ 𝑓 𝑥 , 𝑦 =1𝑁Nℓ 𝑓 𝑥 - , 𝑦 -6

-OP

e.g,Pr 𝑥 = 𝑢𝑛𝑖𝑓𝑜𝑟𝑚 0,1𝑦 = 𝑤∗. 𝑥 + 𝜖 where𝜖 = 𝒩 0,0.1

⇒ Pr 𝑦 𝑥 = 𝒩 𝑤∗. 𝑥, 0.1Pr 𝑥, 𝑦 = Pr 𝑥 Pr(𝑦|𝑥)

4

Empiricalvspopulationloss

𝐿(𝑓) ≜ 𝐄𝒟 ℓ 𝑓 𝑥 , 𝑦 = F ℓ 𝑓 𝑥 , 𝑦�

(H,I)Pr(𝑥, 𝑦)

𝐿L 𝑓 ≜ 𝐄ML ℓ 𝑓 𝑥 , 𝑦 =1𝑁Nℓ 𝑓 𝑥 - , 𝑦 -6

-OP

• 𝑓_L fromsomemodeloverfits to𝑆 ifthereis𝑓∗ ∈ ℋ with𝐄ML ℓ 𝑓_L 𝑥 , 𝑦 ≤ 𝐄ML ℓ 𝑓∗ 𝑥 , 𝑦 but𝐄𝒟 ℓ 𝑓_L 𝑥 , 𝑦 ≫ 𝐄𝒟 ℓ 𝑓∗ 𝑥 , 𝑦

• If𝑓 isindependentof𝑆efg-h thenboth𝐿Lijklm 𝑓 and𝐿Linoi 𝑓 aregoodapproximationsof𝐿𝒟 𝑓

• Butgenerally,𝑓_ dependson𝑆efg-h.Why?o 𝐿Lijklm 𝑓_Lijklm isnomoreagoodapproximationof𝐿𝒟 𝑓

o 𝐿Linoi 𝑓_Lijklm isstillagoodapproximationof𝐿𝒟 𝑓 since 𝑓_Lijklm isindependentof𝑆epqe

5

OptimumUnrestrictedPredictor

• Considerpopulationsquaredloss

argminw∈ℋ

𝐿(𝑓) ≜ 𝐄𝒟 ℓ 𝑓 𝑥 , 𝑦 = 𝐄(H,I) 𝑓 𝑥 − 𝑦 y

• Sayℋ isunrestricted– anyfunction𝑓: 𝑥 → 𝑦 isallowed𝐿 𝑓 = 𝐄(H,I) 𝑓 𝑥 − 𝑦 y = 𝐄H 𝐄I 𝑓 𝑥 − 𝑦 y|𝑥

= 𝐄H 𝐄I 𝑓 𝑥 − 𝐄I 𝑦 𝑥 + 𝐄I 𝑦 𝑥 − 𝑦 y|𝑥

= 𝐄H 𝐄I 𝑓 𝑥 − 𝐄I 𝑦 𝑥y|𝑥 + 𝐄H 𝐄I 𝐄I 𝑦 𝑥 − 𝑦 y|𝑥

+ 2𝐄H 𝐄I (𝑓 𝑥 − 𝐄I 𝑦 𝑥 )(𝐄I 𝑦 𝑥 − 𝑦)|𝑥

= 𝐄H[ 𝑓 𝑥 − 𝐄I 𝑦 𝑥y] + 𝐄H,I 𝐄I 𝑦 𝑥 − 𝑦 y

notafunctionof𝑦

= 0

Noiseminimizedfor𝑓 = 𝐄I 𝑦 𝑥

6

Biasvariancedecomposition

• Bestunrestrictedpredictor𝑓∗∗(𝑥) = 𝐄I[𝑦|𝑥]

• 𝐿 𝑓L = 𝐄H[ 𝑓L 𝑥 − 𝑓∗∗ 𝑥 y] + 𝐄H,I 𝑓∗∗ 𝑥 − 𝑦 y

• 𝐄L𝐿 𝑓L = 𝐄L𝐄H 𝑓L 𝑥 − 𝑓∗∗ 𝑥 y + 𝑛𝑜𝑖𝑠𝑒

𝐄L𝐄H 𝑓L 𝑥 − 𝑓∗∗ 𝑥 y = 𝐄H 𝐄L 𝑓L 𝑥 − 𝑓∗∗ 𝑥 y|𝑥

= 𝐄H𝐄L 𝑓L 𝑥 − 𝐄L 𝑓L 𝑥 + 𝐄L 𝑓L 𝑥 − 𝑓∗∗ 𝑥 y|𝑥

= 𝐄H𝐄L 𝑓L 𝑥 − 𝐄L 𝑓L 𝑥 y|𝑥 + 𝐄H 𝐄L 𝑓L 𝑥 − 𝑓∗∗ 𝑥 y

+ 2𝐄H 𝐄L 𝐄L 𝑓L 𝑥 − 𝑓∗∗ 𝑥 𝑓L 𝑥 − 𝐄L 𝑓L 𝑥 |𝑥

= 𝐄L,H 𝑓L 𝑥 − 𝐄L 𝑓L 𝑥 y + 𝐄H 𝐄L 𝑓L 𝑥 − 𝑓∗∗ 𝑥 y

=variance+bias2+noise

𝐄L𝐿 𝑓L = 𝐄L,H 𝑓L 𝑥 − 𝐄L 𝑓L 𝑥 y

+𝐄H 𝐄L 𝑓L 𝑥 − 𝑓∗∗ 𝑥 y

+𝐄H,I[ 𝑓∗∗ 𝑥 − 𝑦 y]

7

Bias-variancetradeoff

• 𝑓L ∈ ℋ• noise isirreducible• variance canreducedby

o getmoredatao make𝑓L lesssensitiveto𝑆

§ lessnumberofcandidatesinℋ tochoosefromàless variance§ reducingthe“complexity”ofmodelclassℋ decreasesvariance

• biasy ≥ minw∈�� ℋ

𝐄H 𝑓 𝑥 − 𝑓∗∗ 𝑥 y

§ expandingmodelclassℋ decreasesbias

=variance+bias2+noise

𝐄L𝐿 𝑓L = 𝐄L,H 𝑓L 𝑥 − 𝐄L 𝑓L 𝑥 y

+𝐄H 𝐄L 𝑓L 𝑥 − 𝑓∗∗ 𝑥 y

+𝐄H,I[ 𝑓∗∗ 𝑥 − 𝑦 y]

8

Modelcomplexity• reducingthecomplexityofmodelclassℋ decreasesvariance• expandingmodelclassℋ decreasesbias• Complexity≈ numberofchoicesinℋ

• Foranyloss𝐿,forall𝑓 ∈ ℋ withprobabilitygreaterthan1 − 𝛿

𝐿 𝑓 ≤ 𝐿L 𝑓 +log ℋ + log 1𝛿

𝑁

�

• manyothervariantsforinfinitecardinalityclasses• oftenboundsareloose

• Complexity≈numberofdegreesoffreedom• e.g.,numberofparameterstoestimate• moredataècanfitmorecomplexmodels

• IsℋP = {𝒙 → 𝑤� + 𝒘𝟏. 𝒙 − 𝒘𝟐. 𝒙}morecomplexthanℋy ={𝒙 → 𝑤� + 𝒘𝟏. 𝒙}?• Whatweneedishowmanydifferent“behaviors”wecangetonsame𝑆

9

Summary

• Overfittingo Whatisoverfitting?o Howtodetectoverfitting?o Avoidingoverfittingusingmodelselection

• Bias– variancetradeoff

Classification

• Supervisedlearning:estimateamapping𝑓 frominput𝑥 ∈ 𝒳 tooutput𝑦 ∈ 𝒴o Regression 𝒴 = ℝorothercontinuousvariableso Classification 𝒴 takesdiscretesetofvalues

§ Examples:q 𝒴 = {spam, nospam},q digits(notvalues)𝒴 = {0,1,2, … , 9}

• ManysuccessfulapplicationsofMLinvision,speech,NLP,healthcare

10

ClassificationvsRegression

• Label-valuesdonothavemeaningo 𝒴 = {spam, nospam} or𝒴 = {0,1} or𝒴 = {−1,1}

• Orderingoflabelsdoesnotmatter(formostparts)o 𝑓(𝑥) = “0” when𝑦 = “1” isasbadas𝑓(𝑥) = “9” when𝑦 = “1”

• Often𝑓(𝑥) doesnotreturnlabels𝑦o e.g.inbinaryclassificationwith𝒴 = {−1,1} weoftenestimate𝑓:𝒳 → ℝ andthenpostprocesstoget𝑦( 𝑓 𝑥 = 𝟏 𝑓 𝑥 ≥ 0

o mainlyforcomputationalreasons

§ remember,weneedtosolveminw∈ℋ

∑ ℓ(𝑓 𝑥 - , 𝑦 - )�-

§ discretevaluesà combinatorialproblemsà hardtosolveo moregenerallyℋ ⊂ 𝑓:𝒳 → ℝ andlossℓ:ℝ×𝒴 → ℝ

§ comparetoregression,wheretypicallyℋ ⊂ 𝑓:𝒳 → 𝒴 andlossℓ:𝒴×𝒴 → ℝ

11

Non-parametric classifiers

12

• TrainingdataS = 𝑥 - , 𝑦 - : 𝑖 = 1,2, … , 𝑁

• Wanttopredictlabelofnewpoint𝑥

• NearestNeighborRule

o Findtheclosesttrainingpoint:𝑖∗ = argmin-𝜌(𝑥, 𝑥 - )

o Predictlabelof𝑥 as𝑦( 𝑥 = 𝑦 -∗

• Computationo Trainingtime:Donothing

o Testtime:searchthetrainingsetforaNN

NearestNeighbor(NN)Classifier

?

13Figurecredit:Nati Srebro

NearestNeighbor(NN)Classifier

• Whereisthemainmodel?o 𝑖∗ = argmin

-𝜌(𝑥, 𝑥 - )

o Whatistheright“distance”betweenimages?Betweensoundwaves?Betweensentences?

o Often𝜌 𝑥, 𝑥� = 𝜙 𝑥 − 𝜙 𝑥� y orothernorms 𝑥 − 𝑥� P

𝝓� 𝒙 − 𝝓�(𝒙�) 𝟐𝝓� 𝒙 = (𝟓𝒙 𝟏 , 𝒙 𝟐 )𝒙 − 𝒙� 𝟏

14Slidecredit:Nati Srebro

• TrainingdataS = 𝑥 - , 𝑦 - : 𝑖 = 1,2, … , 𝑁

•Wanttopredictlabelofnewpoint𝑥• k-NearestNeighborRule

o Findthe𝒌 closesttrainingpoint:𝑖P∗, 𝑖y∗, … , 𝑖¢∗

o Predictlabelof𝑥 as𝑦( 𝑥 = majority 𝑦 -¥∗ , 𝑦 -¦∗ , … , 𝑦 -§

∗

• Computationo Trainingtime:Donothingo Testtime:searchthetrainingsetforkNNs

k-NearestNeighbor(kNN)classifier

15

k-NearestNeighbor

•Advantagesonotrainingouniversalapproximator – non-parametric

•Disadvantagesonotscalable

§testtimememoryrequirement§testtimecomputation

oeasilyoverfits withsmalldata

16

Trainingvstesterror

17

1-NN• Trainingerror?• 0

• Testerror?• DependsonPr(𝑥, 𝑦)

k-NN• Trainingerror: canbegreaterthan0

• Testerror:againdependsonPr(𝑥, 𝑦)

Figurecredit:Nati Srebro

k-NearestNeighbor:DataFit/ComplexityTradeoff

k=1 k=5 k=12

k=50 k=100 k=200

S= h*=

18Slidecredit:Nati Srebro

Spacepartition

• kNNpartitioningof𝒳 (orℝ¨)intoregionsof+1and-1• Whataboutdiscretevaluedfeatures𝑥?• Evenforcontinuous𝑥,canwegetmorestructuredpartitions?o easytodescribe

§ e.g.,𝑅y = {𝑥: 𝑥P < 𝑡Pand𝑥y > 𝑡y}o reducesdegreesoffreedom

• Anynon-overlappingpartitionusingonly(hyper)rectanglesà representablebyatree

19Figurecredit:GregShaknarovich

Decisiontrees

• Focusonbinarytrees(treeswithatmosttwochildrenateachnode)

• Howtocreatetrees?

•Whatisa“good”tree?oMeasureof“purity”ateachleafnodewhereeachleafnodecorrespondingtoaregion𝑅-purity 𝑡𝑟𝑒𝑒 = ∑ |#blueat𝑅4 −#redat𝑅-|�

°l

20

Therearevariousmetricsof(im)puritythatareusedinpractice,buttheroughideaisthesame

Decisiontrees

• Howtocreatetrees?• Trainingdata𝑆 = 𝑥 - , 𝑦 - : 𝑖 = 1,2, … , 𝑁 ,where𝑦 - ∈ {blue, red}• Ateachpoint,purity 𝑡𝑟𝑒𝑒 =N|#blueatleaf −#redatleaf|

�

²³´µ

• Startwithalldataatrooto onlyone𝑙𝑒𝑎𝑓 = 𝑟𝑜𝑜𝑡.Whatispurity 𝑡𝑟𝑒𝑒 ?

21

Decisiontrees


�

²³´µ


• Createasplitbasedonarulethatincreasestheamountof“purity”oftree.o Howcomplexcantherulesbe?

22

Decisiontrees


�

²³´µ


• Createasplitbasedonarulethatincreasestheamountof“purity”oftree.o Howcomplexcantherulesbe?

• Repeat23

Whentostop?whatisthecomplexityofaDT?• Limitthenumber

ofleafnodes

Decisiontrees

• Advantageso interpretableo easytodealwithnon-numericfeatureso naturalextensionstomulti-class,multi-label

• Disadvantageso notscalableo harddecisions– non-smoothdecisionso oftenoverfits inspiteofregularization

• CheckCARTpackageinscikit-learn

24

Parametricclassifiers

• Whatistheequivalentoflinearregression?o somethingeasytotraino somethingeasytouseattesttime

• 𝑓 𝒙 = 𝑓𝒘 𝒙 = 𝒘. 𝒙 + 𝑤�• ℋ = {𝑓𝒘 = 𝒙 → 𝒘. 𝒙 + 𝑤�:𝒘 ∈ ℝ¨,𝑤� ∈ ℝ}

• but𝑓 𝒙 ∉ {−1,1}!howdowegetlabels?o reasonablechoice

𝑦( 𝒙 = 1 if𝑓𝒘¹ 𝒙 ≥ 0 and𝑦( 𝒙 = −1 otherwiseo linearclassifier:𝑦( 𝒙 = sign(𝒘¹. 𝒙 + 𝑤¹�)

25

Parametricclassifiers

• ℋ = 𝑓𝒘 = 𝒙 → 𝒘. 𝒙 + 𝑤�:𝒘 ∈ ℝ¨,𝑤� ∈ ℝ• 𝑦( 𝒙 = sign(𝒘¹. 𝒙 + 𝑤¹�)• 𝒘¹. 𝒙 + 𝑤¹� = 0 (linear)decisionboundaryorseparatinghyperplaneo thatseparatesℝ¨ intotwohalfspaces(regions)𝒘¹. 𝒙 + 𝑤¹� > 0 and𝒘¹. 𝒙 + 𝑤¹� < 0

• moregenerally,𝑦( 𝒙 = sign 𝑓_ 𝒙à decisionboundaryis𝑓_ 𝒙 = 0

26

!"

!#

ç

Linear classifier

27







∑ ℓ(𝑓 𝑥 - , 𝑦 - )�-



28







∑ ℓ(𝑓 𝑥 - , 𝑦 - )�-



29

Whatifweignoreaboveandsolveclassification

usingregression?

Classificationasregression

• Binaryclassification𝒴 = −1,1 and𝒳 ∈ ℝ¨

• Treatitasregressionwithsquaredloss,saylinearregressiono Trainingdata𝑆 = 𝒙 𝒊 , 𝑦 - : 𝑖 = 1,2, … , 𝑁o ERM

𝒘¹,𝑤¹� = argmin»,»¼

N 𝒘. 𝒙 𝒊 + 𝑤� − 𝑦 - y�

-

30


𝑥

𝑦

𝑥

𝑦( = +1 𝑦( = −1

31

𝑦( 𝑥 = sign(𝑤𝑥 + 𝑤�)

Examplecredit:GregShaknarovich


𝑥

𝑥

𝑦 classifiedcorrectlyby𝑦( 𝑥 = sign 𝑤. 𝑥

butsquaredloss 𝑤. 𝑥 + 1 ywillbehigh

32Examplecredit:GregShaknarovich


𝑥

𝑦 = +1 𝑦 = −1

𝑥

𝑦

33Examplecredit:GregShaknarovich


34Slidecredit:GregShaknarovich

• Thecorrectlosstouseis0-1lossafter thresholdingℓ�P 𝑓 𝑥 , 𝑦 = 𝟏 sign 𝑓 𝑥 ≠ 𝑦

= 𝟏 sign 𝑓 𝑥 𝑦 < 0

SurrogateLosses

35

0 𝑓(𝑥)𝑦 →

ℓ(𝑓𝑥,𝑦)

• Thecorrectlosstouseis0-1lossafter thresholdingℓ�P 𝑓 𝑥 , 𝑦 = 𝟏 sign 𝑓 𝑥 ≠ 𝑦

= 𝟏 sign 𝑓 𝑥 𝑦 < 0• Linearregressionusesℓ¾L 𝑓 𝑥 , 𝑦 = 𝑓 𝑥 − 𝑦 y

• WhynotdoERMoverℓ�P 𝑓 𝑥 , 𝑦 directly?o non-continuous,non-convex

SurrogateLosses

36

0 𝑦(𝑦 →

ℓ(𝑓𝑥,𝑦)

0 𝑓(𝑥)𝑦 →

SurrogateLosses

37

• Hardtooptimizeoverℓ�P,findanotherlossℓ(𝑦(, 𝑦)o Convex(foranyfixed𝑦)à easiertominimizeo Anupperboundofℓ�P à smallℓ ⇒ smallℓ�P

• Satisfiedbysquaredlossàbuthas“large”lossevenwhenℓ�P 𝑦(, 𝑦 = 0• Twomoresurrogatelossesininthiscourseo Logisticlossℓ²�¿ 𝑦(, 𝑦 = log 1 + exp −𝑦(𝑦(TODAY)o HingelossℓÁ-hÂp 𝑦(, 𝑦 = max(0,1 − 𝑦(𝑦)(TOMORROW)

0 𝑓(𝑥)𝑦 →

ℓ(𝑓𝑥,𝑦)

Logistic Regression

38

Logisticregression:ERMonsurrogateloss

• 𝑆 = 𝒙 𝒊 , 𝑦 - : 𝑖 = 1,2, … ,𝑁 , 𝒳 = ℝ¨, 𝒴 = {−1,1}• Linearmodel𝑓 𝒙 = 𝑓𝒘 𝒙 = 𝒘. 𝒙 + 𝑤�• Minimizetrainingloss

𝒘¹,𝑤¹� = argmin𝒘,»¼

Nlog 1 + exp − 𝒘. 𝒙 𝒊 + 𝑤� 𝑦 - �

-

• Outputclassifier𝑦( 𝒙 = sign(𝒘. 𝒙 + 𝑤�)

Logisticlossℓ 𝑓 𝑥 , 𝑦 = log 1 + exp −𝑓(𝑥)𝑦

39

ℓ(𝑓𝑥,𝑦)

0 𝑓(𝑥)𝑦 →


Nlog 1 + exp − 𝒘. 𝒙 𝒊 + 𝑤� 𝑦 - �

-

• Learnsalineardecisionboundaryo {𝒙:𝒘. 𝒙 + 𝑤� = 0} isahyperplaneinℝ¨ - decisionboundaryo {𝒙:𝒘. 𝒙 + 𝑤� = 0} dividesℝ¨ intotwohalfspace (regions)o 𝒙:𝒘. 𝒙 + 𝑤� ≥ 0 willgetlabel+1 and{𝒙:𝒘. 𝒙 + 𝑤� < 0} willgetlabel-1

• Maps𝒙 toa1Dcoordinate

𝑥� =𝒘. 𝒙 + 𝑤�

𝒘

Logisticregression

𝒙

𝒙′

𝑥P

𝑥y

𝑤

ç

40Figurecredit:GregShaknarovich

LogisticRegression


Nlog 1 + exp − 𝒘. 𝒙 + 𝑤� 𝑦 �

-

• Convexoptimizationproblem• Cansolveusinggradientdescent• Canalsoaddusualregularization:ℓy, ℓP

o Moredetailsinthenextsession

41

Day 3: Classification, logistic regressionsuriya/website-introml... · Classification •Supervised learning: estimate a mapping 7from input !∈’to output #∈‚ oRegression‚=ℝ

Documents