Top Banner
Mul$class Classifica$on Wei Xu (many slides from Greg Durrett,Vivek Srikumar, Stanford CS231n)
52

Mul$class Classifica$on

Nov 29, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Mul$class Classifica$on

Mul$classClassifica$on

WeiXu(many slides from Greg Durrett, Vivek Srikumar, Stanford CS231n)

Page 2: Mul$class Classifica$on

Administrivia

‣ ProblemSet1Graded(onGradescope)

‣ ProgrammingProject1isreleased(due9/20)

‣ Reading:Eisenstein2.0-2.5,4.1,4.3-4.5

‣ Op$onalreadingsrelatedtoProject1werepostedbyTAonPiazza

Page 3: Mul$class Classifica$on

ThisLecture

‣Mul$classfundamentals

‣Mul$classlogis$cregression

‣ Featureextrac$on

‣ Op$miza$on

Page 4: Mul$class Classifica$on

Mul$classFundamentals

Page 5: Mul$class Classifica$on

TextClassifica$on

~20classes

Sports

Health

Page 6: Mul$class Classifica$on

ImageClassifica$on

‣ Thousandsofclasses(ImageNet)

Car

Dog

Page 7: Mul$class Classifica$on

En$tyLinking

‣ 4,500,000classes(allar$clesinWikipedia)

Althoughheoriginallywontheevent,theUnitedStatesAn$-DopingAgencyannouncedinAugust2012thattheyhaddisqualifiedArmstrongfromhissevenconsecu$veTourdeFrancewinsfrom1999–2005.

LanceEdwardArmstrongisanAmericanformerprofessionalroadcyclist

ArmstrongCountyisacountyinPennsylvania…

??

Page 8: Mul$class Classifica$on

En$tyLinking

Page 9: Mul$class Classifica$on

BinaryClassifica$on

‣ Binaryclassifica$on:oneweightvectordefinesposi$veandnega$veclasses

+++ +

++

++

- - -----

--

Page 10: Mul$class Classifica$on

Mul$classClassifica$on

1 11 11 12

222

333

3

‣ Canwejustusebinaryclassifiershere?

Page 11: Mul$class Classifica$on

Mul$classClassifica$on

‣ One-vs-all:trainkclassifiers,onetodis$nguisheachclassfromalltherest

1 11 11 12

222

333

3

1 11 11 12

222

333

3

‣ Howdowereconcilemul$pleposi$vepredic$ons?Highestscore?

Page 12: Mul$class Classifica$on

Mul$classClassifica$on

‣ Notallclassesmayevenbeseparableusingthisapproach

1 11 11 12 2

222 2

33

3 33 3

‣ Canseparate1from2+3and2from1+3butnot3fromtheothers(withthesefeatures)

Page 13: Mul$class Classifica$on

Mul$classClassifica$on‣ All-vs-all:trainn(n-1)/2classifierstodifferen$ateeachpairofclasses

1 11 11 1

333

3

1 11 11 12

222

‣ Again,howtoreconcile?

Page 14: Mul$class Classifica$on

Mul$classClassifica$on

+++ +

++

++

- - -----

--

1 11 11 12

222

333

3

‣ Binaryclassifica$on:oneweightvectordefinesbothclasses

‣Mul$classclassifica$on:differentweightsand/orfeaturesperclass

Page 15: Mul$class Classifica$on

Mul$classClassifica$on

‣ Decisionrule:

‣ Canalsohaveoneweightvectorperclass:

‣ Formally:insteadoftwolabels,wehaveanoutputspacecontaininganumberofpossibleclasses

Y

‣ Samemachinerythatwe’lluselaterforexponen$allylargeoutputspaces,includingsequencesandtrees

argmaxy2Yw>y f(x)

argmaxy2Yw>f(x, y)

‣ Thesingleweightvectorapproachwillgeneralizetostructuredoutputspaces,whereasper-classweightvectorswon’t

‣Mul$plefeaturevectors,oneweightvector

featuresdependonchoiceoflabelnow!note:thisisn’tthegoldlabel

Page 16: Mul$class Classifica$on

FeatureExtrac$on

Page 17: Mul$class Classifica$on

BlockFeatureVectors‣ Decisionrule:argmaxy2Yw

>f(x, y)

toomanydrugtrials,toofewpa5ents

Health

Sports

Science

f(x)=I[containsdrug],I[containspa5ents],I[containsbaseball] =[1,1,0]

[1,1,0,0,0,0,0,0,0]

[0,0,0,1,1,0,0,0,0]

f(x, y = ) =Health

f(x, y = ) =Sports

‣ Equivalenttohavingthreeweightvectorsinthiscase

featurevectorblocksforeachlabel

‣ Basefeaturefunc$on:

I[containsdrug&label=Health]

Page 18: Mul$class Classifica$on

MakingDecisions

f(x) =I[containsdrug],I[containspa5ents],I[containsbaseball]

w = [+2.1,+2.3,-5,-2.1,-3.8,0,+1.1,-1.7,-1.3]

= Health:+4.4 Sports:-5.9 Science:-0.6

argmax

toomanydrugtrials,toofewpa5ents

Health

Sports

Science

[1,1,0,0,0,0,0,0,0]

[0,0,0,1,1,0,0,0,0]

f(x, y = ) =Health

f(x, y = ) =Sports “worddruginSciencear$cle”=+1.1

w>f(x, y)

Page 19: Mul$class Classifica$on

Anotherexample:POStaggingblocksNNSVBZNNDT…

f(x,y=VBZ)=I[curr_word=blocks&tag=VBZ],I[prev_word=router&tag=VBZ]I[next_word=the&tag=VBZ]I[curr_suffix=s&tag=VBZ]

‣ Classifyblocksasoneof36POStags

‣ Nexttwolectures:sequencelabeling!

‣ Examplex:sentencewithaword(inthiscase,blocks)highlighted

‣ Extractfeatureswithrespecttothisword:

notsayingthattheistaggedasVBZ!sayingthatthefollowstheVBZword

therouter thepackets

Page 20: Mul$class Classifica$on

Mul$classLogis$cRegression

Page 21: Mul$class Classifica$on

Mul$classLogis$cRegression

‣ Comparetobinary:

nega$veclassimplicitlyhadf(x,y=0)=thezerovector

sumoveroutputspacetonormalize

P (y = 1|x) = exp(w>f(x))

1 + exp(w>f(x))

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

Softmax function

Page 22: Mul$class Classifica$on

Mul$classLogis$cRegression

sumoveroutputspacetonormalize

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

Health:+2.2

Sports:+3.1

Science:-0.6w>f(x, y)

Why?Interpretrawclassifierscoresasprobabili/es

exp6.05

22.2

0.55

probabili$esmustbe>=0

unnormalizedprobabili$es

toomanydrugtrials,

toofewpa5ents

normalize0.21

0.77

0.02

probabili$esmustsumto1

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

Page 23: Mul$class Classifica$on

Mul$classLogis$cRegression

sumoveroutputspacetonormalize

‣ Training:maximize

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

i.e.minimizenega$veloglikelihoodorcross-entropyloss

indexofdatapoints(j)

L(x, y) =nX

j=1

logP (y⇤j |xj)

=nX

j=1

w>f(xj , y

⇤j )� log

X

y

exp(w>f(xj , y))

!

Page 24: Mul$class Classifica$on

Mul$classLogis$cRegression

sumoveroutputspacetonormalize

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

Health:+2.2

Sports:+3.1

Science:-0.6w>f(x, y)

exp6.05

22.2

0.55

probabili$esmustbe>=0

unnormalizedprobabili$es

1.00

0.00

0.00correct(gold)probabili$es

toomanydrugtrials,

toofewpa5ents

compare

L(x, y) =nX

j=1

logP (y⇤j |xj)L(xj , y⇤j ) = w>f(xj , y

⇤j )� log

X

y

exp(w>f(xj , y))

log(0.21)=-1.56

Q:max/minoflogprob.?

normalize0.21

0.77

0.02

probabili$esmustsumto1

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

Page 25: Mul$class Classifica$on

Training

‣ Mul$classlogis$cregression

‣ Likelihood L(xj , y⇤j ) = w>f(xj , y

⇤j )� log

X

y

exp(w>f(xj , y))

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )�

Py fi(xj , y) exp(w>f(xj , y))P

y exp(w>f(xj , y))

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )� Ey[fi(xj , y)]

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )�

X

y

fi(xj , y)Pw(y|xj)

goldfeaturevalue

model’sexpecta$onoffeaturevalue

Page 26: Mul$class Classifica$on

Training

toomanydrugtrials,toofewpa5ents[1,1,0,0,0,0,0,0,0]

[0,0,0,1,1,0,0,0,0]

f(x, y = ) =Health

f(x, y = ) =Sports

y*= Health

Pw(y|x)=[0.21,0.77,0.02]

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )�

X

y

fi(xj , y)Pw(y|xj)

[1,1,0,0,0,0,0,0,0] —0.21[1,1,0,0,0,0,0,0,0]—0.77[0,0,0,1,1,0,0,0,0] —0.02[0,0,0,0,0,0,1,1,0]

=[0.79,0.79,0,-0.77,-0.77,0,-0.02,-0.02,0]

gradient:

[1.3,0.9,-5,3.2,-0.1,0,1.1,-1.7,-1.3] +[0.79,0.79,0,-0.77,-0.77,0,-0.02,-0.02,0]=[2.09,1.69,0,2.43,-0.87,0,1.08,-1.72,0]

newPw(y|x)=[0.89,0.10,0.01]

update:w>f(x, y) + `(y, y⇤)

Page 27: Mul$class Classifica$on

Mul$classLogis$cRegression:Summary

‣Model:

‣ Learning:gradientascentonthediscrimina$velog-likelihood

‣ Inference:

“towardsgoldfeaturevalue,awayfromexpecta$onoffeaturevalue”

f(x, y⇤)� Ey[f(x, y)] = f(x, y⇤)�X

y

[Pw(y|x)f(x, y)]

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

argmaxyPw(y|x)

Page 28: Mul$class Classifica$on

Mul$classSVM

Page 29: Mul$class Classifica$on

SozMarginSVM

Minimize

s.t.

8j (2yj � 1)(w>xj) � 1� ⇠j

8j ⇠j � 0

�kwk22 +mX

j=1

⇠jslackvariables>0iffexampleissupportvector

Image credit: Lang Van Tran

Page 30: Mul$class Classifica$on

Mul$classSVM

Correctpredic$onnowhastobeateveryotherclass

Minimize

s.t.

8j (2yj � 1)(w>xj) � 1� ⇠j

8j ⇠j � 0

�kwk22 +mX

j=1

⇠j

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

The1thatwashereisreplacedbyalossfunc$on

Scorecomparisonismoreexplicitnow

slackvariables>0iffexampleissupportvector

Page 31: Mul$class Classifica$on

Training(loss-augmented)

‣ Arealldecisionsequallycostly?

‣Wecandefinealossfunc$on `(y, y⇤)

toomanydrugtrials,toofewpa5ents

Health

SportsSports

ScienceSports

Science

Predicted

Predicted :notsobad

:baderror

`( , ) =HealthSports

HealthScience`( , ) =

3

1

Page 32: Mul$class Classifica$on

Loss-AugmentedDecoding

Health Science Sports

2.4+0

1.3+3

1.8+1

Y

‣ Doesgoldbeateverylabel+loss?No!

‣ =4.3-2.4=1.9⇠j

‣MostviolatedconstraintisSports;whatis?

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

w>f(x, y) + `(y, y⇤)

‣ Perceptronwouldmakenoupdatehere

⇠j

Page 33: Mul$class Classifica$on

Loss-AugmentedDecoding

‣ Sportsismostviolatedconstraint,slack=4.3—2.4=1.9

Health+2.4Sports+1.3

Science+1.8

toomanydrugtrials,toofewpa5ents

Loss031

⇠j = maxy2Y

w>f(xj , y) + `(y, y⇤j )� w>f(xj , y⇤j )

argmax

Total2.44.32.8

Health

w>f(x, y)

‣ Perceptronwouldmakenoupdate,regularSVMwouldpickScience

Page 34: Mul$class Classifica$on

Mul$classSVM

Minimize

s.t. 8j ⇠j � 0

�kwk22 +mX

j=1

⇠j

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

‣ Oneslackvariableperexample,soit’ssettobewhateverthemostviolatedconstraintisforthatexample

⇠j = maxy2Y

w>f(xj , y) + `(y, y⇤j )� w>f(xj , y⇤j )

‣ Pluginthegoldyandyouget0,soslackisalwaysnonnega$ve!

Page 35: Mul$class Classifica$on

Compu$ngtheSubgradient

‣ If,theexampleisnotasupportvector,gradientiszero⇠j = 0

‣ Otherwise,

(updatelooksbackwards—we’reminimizinghere!)

@

@wi⇠j = fi(xj , ymax)� fi(xj , y

⇤j )

‣ Perceptron-like,butweupdateawayfrom*loss-augmented*predic$on

Minimize

s.t. 8j ⇠j � 0

�kwk22 +mX

j=1

⇠j

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

⇠j = maxy2Y

w>f(xj , y) + `(y, y⇤j )� w>f(xj , y⇤j )

Page 36: Mul$class Classifica$on

Pu|ngitTogether

Minimize

s.t. 8j ⇠j � 0

�kwk22 +mX

j=1

⇠j

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

‣ (Unregularized)gradients:‣ SVM:

f(x, y⇤)� Ey[f(x, y)] = f(x, y⇤)�X

y

[Pw(y|x)f(x, y)]‣ Logreg:f(x, y⇤)� f(x, ymax) (loss-augmentedmax)

‣ SVM:maxoverstocomputegradient.LR:needtosumovers`(y, y⇤) `(y, y⇤)

Page 37: Mul$class Classifica$on

SozmaxMargin‣ Canweincludealossfunc$oninlogis$cregression?

‣ Likelihoodisar$ficiallyhigherforthingswithhighloss—trainingneedstoworkevenhardertomaximizethelikelihoodoftherightthing!

rightanswer

highloss

lowloss

highloss

lowloss

‣ Biasedes$matorfororiginallikelihood,butbe}erlossGimpelandSmith(2010)

P (y|x) =exp

�w>f(x, y) + `(y, y⇤)

�P

y0 exp (w>f(x, y0) + `(y0, y⇤))

Page 38: Mul$class Classifica$on

En$tyLinking

Althoughheoriginallywontheevent,theUnitedStatesAn$-DopingAgencyannouncedinAugust2012thattheyhaddisqualifiedArmstrongfromhissevenconsecu$veTourdeFrancewinsfrom1999–2005.

LanceEdwardArmstrongisanAmericanformerprofessionalroadcyclist

ArmstrongCountyisacountyinPennsylvania…

??

‣ Instead,featuresf(x,y)lookattheactualar$cleassociatedwithy

‣ 4.5Mclasses,notenoughdatatolearnfeatureslike“TourdeFrance<->en/wiki/Lance_Armstrong”

Page 39: Mul$class Classifica$on

En$tyLinkingAlthoughheoriginallywontheevent,theUnitedStatesAn$-DopingAgencyannouncedinAugust2012thattheyhaddisqualifiedArmstrongfromhissevenconsecu$veTourdeFrancewinsfrom1999–2005.

LanceEdwardArmstrong ArmstrongCounty

‣ �-idf(doc,w)=freqofwindoc*log(4.5M/#Wikiar$cleswoccursin)‣ the:occursineveryar$cle,�-idf=0‣ cyclist:occursin1%ofar$cles,�-idf=#occurrences*log10(100)

‣ f(x,y)=[cos(�-idf(x),�-idf(y)),…otherfeatures]‣ �-idf(doc)=vectorof�-idf(doc,w)forallwordsinvocabulary(50,000)

Page 40: Mul$class Classifica$on

Op$miza$on

Page 41: Mul$class Classifica$on

Recap‣ Fourelementsofamachinelearningmethod:

‣Model:probabilis$c,max-margin,deepneuralnetwork

‣ Inference:justmaxesandsimpleexpecta$onssofar,butwillgetharder

‣ Training:gradientdescent?

‣ Objec$ve:

Page 42: Mul$class Classifica$on

Op$miza$on

‣ Gradientdescent‣ Batchupdateforlogis$cregression‣ Eachupdateisbasedonacomputa$onovertheen$redataset

L

Lmin

L(w)

Lmin

w

Page 43: Mul$class Classifica$on

Op$miza$on

‣ Gradientdescent‣ Batchupdateforlogis$cregression‣ Eachupdateisbasedonacomputa$onovertheen$redataset

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201820

Optimization

W_1

W_2

‣ Verysimpletocodeup

Page 44: Mul$class Classifica$on

Op$miza$on

‣ Stochas/cgradientdescent

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201821

Optimization: Problems with SGDWhat if loss changes quickly in one direction and slowly in another?What does gradient descent do?

Loss function has high condition number: ratio of largest to smallest singular value of the Hessian matrix is large

w w + ↵g, g =@

@wL

Q:Whatiflosschangesquicklyinonedirec$onandslowlyinanotherdirec$on?

‣ Approx.gradientiscomputedonasingleinstance

contourplot

Page 45: Mul$class Classifica$on

Op$miza$on

‣ Stochas/cgradientdescentw w + ↵g, g =

@

@wL

Q:Whatiflosschangesquicklyinonedirec$onandslowlyinanotherdirec$on?

‣ Approx.gradientiscomputedonasingleinstance

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201822

Optimization: Problems with SGDWhat if loss changes quickly in one direction and slowly in another?What does gradient descent do?Very slow progress along shallow dimension, jitter along steep direction

Loss function has high condition number: ratio of largest to smallest singular value of the Hessian matrix is large

Page 46: Mul$class Classifica$on

Op$miza$on

‣ Stochas/cgradientdescent‣ Verysimpletocodeup

w w + ↵g, g =@

@wL

‣Whatifthelossfunc$onhasalocalminimaorsaddlepoint?

“Iden$fyinganda}ackingthesaddlepointprobleminhigh-dimensionalnon-convexop$miza$on”Dauphinetal.(2014)

Page 47: Mul$class Classifica$on

Op$miza$on

‣ Stochas$cgradientdescent‣ Verysimpletocodeup

‣ “First-order”technique:onlyreliesonhavinggradient

w w + ↵g, g =@

@wL

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201854

First-Order Optimization

Loss

w1

(1) Use gradient form linear approximation(2) Step to minimize the approximation

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201855

Second-Order Optimization

Loss

w1

(1) Use gradient and Hessian to form quadratic approximation(2) Step to the minima of the approximation

Page 48: Mul$class Classifica$on

Op$miza$on(extracurricular)

‣ Stochas$cgradientdescent‣ Verysimpletocodeup

‣ “First-order”technique:onlyreliesonhavinggradient

‣ Newton’smethod

‣ Second-ordertechnique

InverseHessian:nxnmat,expensive!‣ Op$mizesquadra$cinstantly

‣ Quasi-Newtonmethods:L-BFGS,etc.approximateinverseHessian

‣ Se|ngstepsizeishard(decreasewhenheld-outperformanceworsens?)

w w + ↵g, g =@

@wL

w w +

✓@2

@w2L◆�1

g

Page 49: Mul$class Classifica$on

AdaGrad(extracurricular)

Duchietal.(2011)

‣ Op$mizedforproblemswithsparsefeatures

‣ Per-parameterlearningrate:smallerupdatesaremadetoparametersthatgetupdatedfrequently

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201837

AdaGrad

Added element-wise scaling of the gradient based on the historical sum of squares in each dimension

“Per-parameter learning rates” or “adaptive learning rates”

Duchi et al, “Adaptive subgradient methods for online learning and stochastic optimization”, JMLR 2011

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201821

Optimization: Problems with SGDWhat if loss changes quickly in one direction and slowly in another?What does gradient descent do?

Loss function has high condition number: ratio of largest to smallest singular value of the Hessian matrix is large

Page 50: Mul$class Classifica$on

AdaGrad(extracurricular)

Duchietal.(2011)

‣ Op$mizedforproblemswithsparsefeatures

‣ Per-parameterlearningrate:smallerupdatesaremadetoparametersthatgetupdatedfrequently

(smoothed)sumofsquaredgradientsfromallupdates

‣ GenerallymorerobustthanSGD,requireslesstuningoflearningrate

‣ Othertechniquesforop$mizingdeepmodels—morelater!

wi wi + ↵1q

✏+Pt

⌧=1 g2⌧,i

gti

Page 51: Mul$class Classifica$on

Summary

‣ Designtradeoffsneedtoreflectinterac$ons:

‣Modelandobjec$vearecoupled:probabilis$cmodel<->maximizelikelihood

‣ …butnotalways:alinearmodelorneuralnetworkcanbetrainedtominimizeanydifferen$ablelossfunc$on

‣ Inferencegovernswhatlearning:needtobeabletocomputeexpecta$onstouselogis$cregression

Page 52: Mul$class Classifica$on

NextUp

‣ You’venowseeneverythingyouneedtoimplementmul$-classclassifica$onmodels

‣ Next$me:NeuralNetworkBasics!

‣ In2weeks:Sequen$alModels(HMM,CRF,…)forPOStagging,NER