Mul$class Classiﬁca$on

Mul$classClassifica$on

WeiXu(many slides from Greg Durrett, Vivek Srikumar, Stanford CS231n)

Administrivia

‣ ProblemSet1Graded(onGradescope)

‣ ProgrammingProject1isreleased(due9/20)

‣ Reading:Eisenstein2.0-2.5,4.1,4.3-4.5

‣ Op$onalreadingsrelatedtoProject1werepostedbyTAonPiazza

ThisLecture

‣Mul$classfundamentals

‣Mul$classlogis$cregression

‣ Featureextrac$on

‣ Op$miza$on

Mul$classFundamentals

TextClassifica$on

~20classes

Sports

Health

ImageClassifica$on

‣ Thousandsofclasses(ImageNet)

Car

Dog

En$tyLinking

‣ 4,500,000classes(allar$clesinWikipedia)

Althoughheoriginallywontheevent,theUnitedStatesAn$-DopingAgencyannouncedinAugust2012thattheyhaddisqualifiedArmstrongfromhissevenconsecu$veTourdeFrancewinsfrom1999–2005.

LanceEdwardArmstrongisanAmericanformerprofessionalroadcyclist

ArmstrongCountyisacountyinPennsylvania…

??

En$tyLinking

BinaryClassifica$on

‣ Binaryclassifica$on:oneweightvectordefinesposi$veandnega$veclasses

+++ +

++

++

- - -----

--


1 11 11 12

222

333

3

‣ Canwejustusebinaryclassifiershere?


‣ One-vs-all:trainkclassifiers,onetodis$nguisheachclassfromalltherest

1 11 11 12

222

333

3

1 11 11 12

222

333

3

‣ Howdowereconcilemul$pleposi$vepredic$ons?Highestscore?


‣ Notallclassesmayevenbeseparableusingthisapproach

1 11 11 12 2

222 2

33

3 33 3

‣ Canseparate1from2+3and2from1+3butnot3fromtheothers(withthesefeatures)

Mul$classClassifica$on‣ All-vs-all:trainn(n-1)/2classifierstodifferen$ateeachpairofclasses

1 11 11 1

333

3

1 11 11 12

222

‣ Again,howtoreconcile?


+++ +

++

++

- - -----

--

1 11 11 12

222

333

3

‣ Binaryclassifica$on:oneweightvectordefinesbothclasses

‣Mul$classclassifica$on:differentweightsand/orfeaturesperclass


‣ Decisionrule:

‣ Canalsohaveoneweightvectorperclass:

‣ Formally:insteadoftwolabels,wehaveanoutputspacecontaininganumberofpossibleclasses

Y

‣ Samemachinerythatwe’lluselaterforexponen$allylargeoutputspaces,includingsequencesandtrees

argmaxy2Yw>y f(x)

argmaxy2Yw>f(x, y)

‣ Thesingleweightvectorapproachwillgeneralizetostructuredoutputspaces,whereasper-classweightvectorswon’t

‣Mul$plefeaturevectors,oneweightvector

featuresdependonchoiceoflabelnow!note:thisisn’tthegoldlabel

FeatureExtrac$on

BlockFeatureVectors‣ Decisionrule:argmaxy2Yw

>f(x, y)

toomanydrugtrials,toofewpa5ents

Health

Sports

Science

f(x)=I[containsdrug],I[containspa5ents],I[containsbaseball] =[1,1,0]

[1,1,0,0,0,0,0,0,0]

[0,0,0,1,1,0,0,0,0]

f(x, y = ) =Health

f(x, y = ) =Sports

‣ Equivalenttohavingthreeweightvectorsinthiscase

featurevectorblocksforeachlabel

‣ Basefeaturefunc$on:

I[containsdrug&label=Health]

MakingDecisions

f(x) =I[containsdrug],I[containspa5ents],I[containsbaseball]

w = [+2.1,+2.3,-5,-2.1,-3.8,0,+1.1,-1.7,-1.3]

= Health:+4.4 Sports:-5.9 Science:-0.6

argmax


Health

Sports

Science

[1,1,0,0,0,0,0,0,0]

[0,0,0,1,1,0,0,0,0]

f(x, y = ) =Health

f(x, y = ) =Sports “worddruginSciencear$cle”=+1.1

w>f(x, y)

Anotherexample:POStaggingblocksNNSVBZNNDT…

f(x,y=VBZ)=I[curr_word=blocks&tag=VBZ],I[prev_word=router&tag=VBZ]I[next_word=the&tag=VBZ]I[curr_suffix=s&tag=VBZ]

‣ Classifyblocksasoneof36POStags

‣ Nexttwolectures:sequencelabeling!

‣ Examplex:sentencewithaword(inthiscase,blocks)highlighted

‣ Extractfeatureswithrespecttothisword:

notsayingthattheistaggedasVBZ!sayingthatthefollowstheVBZword

therouter thepackets

Mul$classLogis$cRegression


‣ Comparetobinary:

nega$veclassimplicitlyhadf(x,y=0)=thezerovector

sumoveroutputspacetonormalize

P (y = 1|x) = exp(w>f(x))

1 + exp(w>f(x))

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

Softmax function



Pw(y|x) =exp

�w>f(x, y)

�P


Health:+2.2

Sports:+3.1

Science:-0.6w>f(x, y)

Why?Interpretrawclassifierscoresasprobabili/es

exp6.05

22.2

0.55

probabili$esmustbe>=0

unnormalizedprobabili$es

toomanydrugtrials,

toofewpa5ents

normalize0.21

0.77

0.02

probabili$esmustsumto1

Pw(y|x) =exp

�w>f(x, y)

�P




‣ Training:maximize

Pw(y|x) =exp

�w>f(x, y)

�P


i.e.minimizenega$veloglikelihoodorcross-entropyloss

indexofdatapoints(j)

L(x, y) =nX

j=1

logP (y⇤j |xj)

=nX

j=1

w>f(xj , y

⇤j )� log

X

y

exp(w>f(xj , y))

!



Pw(y|x) =exp

�w>f(x, y)

�P


Health:+2.2

Sports:+3.1

Science:-0.6w>f(x, y)

exp6.05

22.2

0.55

probabili$esmustbe>=0

unnormalizedprobabili$es

1.00

0.00

0.00correct(gold)probabili$es

toomanydrugtrials,

toofewpa5ents

compare

L(x, y) =nX

j=1

logP (y⇤j |xj)L(xj , y⇤j ) = w>f(xj , y

⇤j )� log

X

y

exp(w>f(xj , y))

log(0.21)=-1.56

Q:max/minoflogprob.?

normalize0.21

0.77

0.02

probabili$esmustsumto1

Pw(y|x) =exp

�w>f(x, y)

�P


Training

‣ Mul$classlogis$cregression

‣ Likelihood L(xj , y⇤j ) = w>f(xj , y

⇤j )� log

X

y

exp(w>f(xj , y))

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )�

Py fi(xj , y) exp(w>f(xj , y))P

y exp(w>f(xj , y))

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )� Ey[fi(xj , y)]

Pw(y|x) =exp

�w>f(x, y)

�P


@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )�

X

y

fi(xj , y)Pw(y|xj)

goldfeaturevalue

model’sexpecta$onoffeaturevalue

Training

toomanydrugtrials,toofewpa5ents[1,1,0,0,0,0,0,0,0]

[0,0,0,1,1,0,0,0,0]

f(x, y = ) =Health

f(x, y = ) =Sports

y*= Health

Pw(y|x)=[0.21,0.77,0.02]

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )�

X

y

fi(xj , y)Pw(y|xj)

[1,1,0,0,0,0,0,0,0] —0.21[1,1,0,0,0,0,0,0,0]—0.77[0,0,0,1,1,0,0,0,0] —0.02[0,0,0,0,0,0,1,1,0]

=[0.79,0.79,0,-0.77,-0.77,0,-0.02,-0.02,0]

gradient:

[1.3,0.9,-5,3.2,-0.1,0,1.1,-1.7,-1.3] +[0.79,0.79,0,-0.77,-0.77,0,-0.02,-0.02,0]=[2.09,1.69,0,2.43,-0.87,0,1.08,-1.72,0]

newPw(y|x)=[0.89,0.10,0.01]

update:w>f(x, y) + `(y, y⇤)

Mul$classLogis$cRegression:Summary

‣Model:

‣ Learning:gradientascentonthediscrimina$velog-likelihood

‣ Inference:

“towardsgoldfeaturevalue,awayfromexpecta$onoffeaturevalue”

f(x, y⇤)� Ey[f(x, y)] = f(x, y⇤)�X

y

[Pw(y|x)f(x, y)]

Pw(y|x) =exp

�w>f(x, y)

�P


argmaxyPw(y|x)

Mul$classSVM

SozMarginSVM

Minimize

s.t.

8j (2yj � 1)(w>xj) � 1� ⇠j

8j ⇠j � 0

�kwk22 +mX

j=1

⇠jslackvariables>0iffexampleissupportvector

Image credit: Lang Van Tran

Mul$classSVM

Correctpredic$onnowhastobeateveryotherclass

Minimize

s.t.

8j (2yj � 1)(w>xj) � 1� ⇠j

8j ⇠j � 0

�kwk22 +mX

j=1

⇠j

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

The1thatwashereisreplacedbyalossfunc$on

Scorecomparisonismoreexplicitnow

slackvariables>0iffexampleissupportvector

Training(loss-augmented)

‣ Arealldecisionsequallycostly?

‣Wecandefinealossfunc$on `(y, y⇤)


Health

SportsSports

ScienceSports

Science

Predicted

Predicted :notsobad

:baderror

`( , ) =HealthSports

HealthScience`( , ) =

3

1

Loss-AugmentedDecoding

Health Science Sports

2.4+0

1.3+3

1.8+1

Y

‣ Doesgoldbeateverylabel+loss?No!

‣ =4.3-2.4=1.9⇠j

‣MostviolatedconstraintisSports;whatis?


w>f(x, y) + `(y, y⇤)

‣ Perceptronwouldmakenoupdatehere

⇠j

Loss-AugmentedDecoding

‣ Sportsismostviolatedconstraint,slack=4.3—2.4=1.9

Health+2.4Sports+1.3

Science+1.8


Loss031

⇠j = maxy2Y

w>f(xj , y) + `(y, y⇤j )� w>f(xj , y⇤j )

argmax

Total2.44.32.8

Health

w>f(x, y)

‣ Perceptronwouldmakenoupdate,regularSVMwouldpickScience

Mul$classSVM

Minimize

s.t. 8j ⇠j � 0

�kwk22 +mX

j=1

⇠j


‣ Oneslackvariableperexample,soit’ssettobewhateverthemostviolatedconstraintisforthatexample

⇠j = maxy2Y

w>f(xj , y) + `(y, y⇤j )� w>f(xj , y⇤j )

‣ Pluginthegoldyandyouget0,soslackisalwaysnonnega$ve!

Compu$ngtheSubgradient

‣ If,theexampleisnotasupportvector,gradientiszero⇠j = 0

‣ Otherwise,

(updatelooksbackwards—we’reminimizinghere!)

@

@wi⇠j = fi(xj , ymax)� fi(xj , y

⇤j )

‣ Perceptron-like,butweupdateawayfrom*loss-augmented*predic$on

Minimize

s.t. 8j ⇠j � 0

�kwk22 +mX

j=1

⇠j


⇠j = maxy2Y

w>f(xj , y) + `(y, y⇤j )� w>f(xj , y⇤j )

Pu|ngitTogether

Minimize

s.t. 8j ⇠j � 0

�kwk22 +mX

j=1

⇠j


‣ (Unregularized)gradients:‣ SVM:

f(x, y⇤)� Ey[f(x, y)] = f(x, y⇤)�X

y

[Pw(y|x)f(x, y)]‣ Logreg:f(x, y⇤)� f(x, ymax) (loss-augmentedmax)

‣ SVM:maxoverstocomputegradient.LR:needtosumovers`(y, y⇤) `(y, y⇤)

SozmaxMargin‣ Canweincludealossfunc$oninlogis$cregression?

‣ Likelihoodisar$ficiallyhigherforthingswithhighloss—trainingneedstoworkevenhardertomaximizethelikelihoodoftherightthing!

rightanswer

highloss

lowloss

highloss

lowloss

‣ Biasedes$matorfororiginallikelihood,butbe}erlossGimpelandSmith(2010)

P (y|x) =exp

�w>f(x, y) + `(y, y⇤)

�P

y0 exp (w>f(x, y0) + `(y0, y⇤))

En$tyLinking

Althoughheoriginallywontheevent,theUnitedStatesAn$-DopingAgencyannouncedinAugust2012thattheyhaddisqualifiedArmstrongfromhissevenconsecu$veTourdeFrancewinsfrom1999–2005.

LanceEdwardArmstrongisanAmericanformerprofessionalroadcyclist

ArmstrongCountyisacountyinPennsylvania…

??

‣ Instead,featuresf(x,y)lookattheactualar$cleassociatedwithy

‣ 4.5Mclasses,notenoughdatatolearnfeatureslike“TourdeFrance<->en/wiki/Lance_Armstrong”

En$tyLinkingAlthoughheoriginallywontheevent,theUnitedStatesAn$-DopingAgencyannouncedinAugust2012thattheyhaddisqualifiedArmstrongfromhissevenconsecu$veTourdeFrancewinsfrom1999–2005.

LanceEdwardArmstrong ArmstrongCounty

‣ �-idf(doc,w)=freqofwindoc*log(4.5M/#Wikiar$cleswoccursin)‣ the:occursineveryar$cle,�-idf=0‣ cyclist:occursin1%ofar$cles,�-idf=#occurrences*log10(100)

‣ f(x,y)=[cos(�-idf(x),�-idf(y)),…otherfeatures]‣ �-idf(doc)=vectorof�-idf(doc,w)forallwordsinvocabulary(50,000)

Op$miza$on

Recap‣ Fourelementsofamachinelearningmethod:

‣Model:probabilis$c,max-margin,deepneuralnetwork

‣ Inference:justmaxesandsimpleexpecta$onssofar,butwillgetharder

‣ Training:gradientdescent?

‣ Objec$ve:

Op$miza$on

‣ Gradientdescent‣ Batchupdateforlogis$cregression‣ Eachupdateisbasedonacomputa$onovertheen$redataset

L

Lmin

L(w)

Lmin

w

Op$miza$on

‣ Gradientdescent‣ Batchupdateforlogis$cregression‣ Eachupdateisbasedonacomputa$onovertheen$redataset

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201820

Optimization

W_1

W_2

‣ Verysimpletocodeup

Op$miza$on

‣ Stochas/cgradientdescent


Optimization: Problems with SGDWhat if loss changes quickly in one direction and slowly in another?What does gradient descent do?

Loss function has high condition number: ratio of largest to smallest singular value of the Hessian matrix is large

w w + ↵g, g =@

@wL

Q:Whatiflosschangesquicklyinonedirec$onandslowlyinanotherdirec$on?

‣ Approx.gradientiscomputedonasingleinstance

contourplot

Op$miza$on

‣ Stochas/cgradientdescentw w + ↵g, g =

@

@wL

Q:Whatiflosschangesquicklyinonedirec$onandslowlyinanotherdirec$on?

‣ Approx.gradientiscomputedonasingleinstance


Optimization: Problems with SGDWhat if loss changes quickly in one direction and slowly in another?What does gradient descent do?Very slow progress along shallow dimension, jitter along steep direction


Op$miza$on

‣ Stochas/cgradientdescent‣ Verysimpletocodeup

w w + ↵g, g =@

@wL

‣Whatifthelossfunc$onhasalocalminimaorsaddlepoint?

“Iden$fyinganda}ackingthesaddlepointprobleminhigh-dimensionalnon-convexop$miza$on”Dauphinetal.(2014)

Op$miza$on

‣ Stochas$cgradientdescent‣ Verysimpletocodeup

‣ “First-order”technique:onlyreliesonhavinggradient

w w + ↵g, g =@

@wL


First-Order Optimization

Loss

w1

(1) Use gradient form linear approximation(2) Step to minimize the approximation


Second-Order Optimization

Loss

w1

(1) Use gradient and Hessian to form quadratic approximation(2) Step to the minima of the approximation

Op$miza$on(extracurricular)

‣ Stochas$cgradientdescent‣ Verysimpletocodeup

‣ “First-order”technique:onlyreliesonhavinggradient

‣ Newton’smethod

‣ Second-ordertechnique

InverseHessian:nxnmat,expensive!‣ Op$mizesquadra$cinstantly

‣ Quasi-Newtonmethods:L-BFGS,etc.approximateinverseHessian

‣ Se|ngstepsizeishard(decreasewhenheld-outperformanceworsens?)

w w + ↵g, g =@

@wL

w w +

✓@2

@w2L◆�1

g

AdaGrad(extracurricular)

Duchietal.(2011)

‣ Op$mizedforproblemswithsparsefeatures

‣ Per-parameterlearningrate:smallerupdatesaremadetoparametersthatgetupdatedfrequently


AdaGrad

Added element-wise scaling of the gradient based on the historical sum of squares in each dimension

“Per-parameter learning rates” or “adaptive learning rates”

Duchi et al, “Adaptive subgradient methods for online learning and stochastic optimization”, JMLR 2011


Optimization: Problems with SGDWhat if loss changes quickly in one direction and slowly in another?What does gradient descent do?


AdaGrad(extracurricular)

Duchietal.(2011)

‣ Op$mizedforproblemswithsparsefeatures

‣ Per-parameterlearningrate:smallerupdatesaremadetoparametersthatgetupdatedfrequently

(smoothed)sumofsquaredgradientsfromallupdates

‣ GenerallymorerobustthanSGD,requireslesstuningoflearningrate

‣ Othertechniquesforop$mizingdeepmodels—morelater!

wi wi + ↵1q

✏+Pt

⌧=1 g2⌧,i

gti

Summary

‣ Designtradeoffsneedtoreflectinterac$ons:

‣Modelandobjec$vearecoupled:probabilis$cmodel<->maximizelikelihood

‣ …butnotalways:alinearmodelorneuralnetworkcanbetrainedtominimizeanydifferen$ablelossfunc$on

‣ Inferencegovernswhatlearning:needtobeabletocomputeexpecta$onstouselogis$cregression

NextUp

‣ You’venowseeneverythingyouneedtoimplementmul$-classclassifica$onmodels

‣ Next$me:NeuralNetworkBasics!

‣ In2weeks:Sequen$alModels(HMM,CRF,…)forPOStagging,NER

Mul$class Classiﬁca$on

Documents