Top Banner
CS395T: Structured Models for NLP Lecture 2: Binary Classifica>on Greg Durrett Some slides adapted from Vivek Srikumar, University of Utah
37

CS395T: Structured Models for NLP Lecture 2: Binary ...gdurrett/courses/fa2018/lectures/lec2-1… · the film was beauCful, stunning cinematography and gorgeous sets, but boring

Jul 26, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CS395T: Structured Models for NLP Lecture 2: Binary ...gdurrett/courses/fa2018/lectures/lec2-1… · the film was beauCful, stunning cinematography and gorgeous sets, but boring

CS395T:StructuredModelsforNLPLecture2:BinaryClassifica>on

GregDurrett

SomeslidesadaptedfromVivekSrikumar,UniversityofUtah

Page 2: CS395T: Structured Models for NLP Lecture 2: Binary ...gdurrett/courses/fa2018/lectures/lec2-1… · the film was beauCful, stunning cinematography and gorgeous sets, but boring

Administrivia

‣ Courseenrollment

‣ OHsthisweek:Jifan1pm-2pmTues(today)inGDC1.304TAdesk#1Greg11am-12pmWeds+10am-11amFriinGDC3.420

‣ Readingsoncoursewebsite

‣Mini1isout,dueSeptember11

‣ Feelfreetoextendthecodeasneeded;op>mizers,featuriza>on,etc.isn’tsetinstone

Page 3: CS395T: Structured Models for NLP Lecture 2: Binary ...gdurrett/courses/fa2018/lectures/lec2-1… · the film was beauCful, stunning cinematography and gorgeous sets, but boring

ThisLecture‣ Linearclassifica>onfundamentals

‣ Threediscrimina>vemodels:logis>cregression,perceptron,SVM

‣ NaiveBayes,maximumlikelihoodingenera>vemodels

‣ Differentmo>va>onsbutverysimilarupdaterules/inference!

Page 4: CS395T: Structured Models for NLP Lecture 2: Binary ...gdurrett/courses/fa2018/lectures/lec2-1… · the film was beauCful, stunning cinematography and gorgeous sets, but boring

Classifica>on

Page 5: CS395T: Structured Models for NLP Lecture 2: Binary ...gdurrett/courses/fa2018/lectures/lec2-1… · the film was beauCful, stunning cinematography and gorgeous sets, but boring

Classifica>on

‣ Embeddatapointinafeaturespace

+++ ++ +++

- - ------

-

‣ Lineardecisionrule:

=[0.5,1.6,0.3]

[0.5,1.6,0.3,1]

x

y 2 {0, 1}

f(x) 2 Rn

‣ Datapointwithlabel

butinthislectureandareinterchangeablex

f(x)

w

>f(x) + b > 0

f(x)‣ Candeletebiasifweaugmentfeaturespace:

w

>f(x) > 0

Page 6: CS395T: Structured Models for NLP Lecture 2: Binary ...gdurrett/courses/fa2018/lectures/lec2-1… · the film was beauCful, stunning cinematography and gorgeous sets, but boring

+++ ++ +++

- - ------

-+++ ++ +++

- - ------

-???

f(x)=[x1,x2,x12,x22,x1x2]

x1

x2

+++ ++ +++

- - ------

-

+++ ++ +++

- - ------

-

x1x2

x1

f(x)=[x1,x2]

Linearfunc>onsarepowerful!

‣ “Kerneltrick”doesthisfor“free,”butistooexpensivetouseinNLPapplica>ons,trainingisinsteadofO(n2) O(n · (num feats))

Page 7: CS395T: Structured Models for NLP Lecture 2: Binary ...gdurrett/courses/fa2018/lectures/lec2-1… · the film was beauCful, stunning cinematography and gorgeous sets, but boring

Classifica>on:Sen>mentAnalysis

thismoviewasgreat!wouldwatchagain

Nega>ve

Posi>ve

thatfilmwasawful,I’llneverwatchagain

‣ Surfacecuescanbasicallytellyouwhat’sgoingonhere:presenceorabsenceofcertainwords(great,awful)

‣ Stepstoclassifica>on:‣ Turnexampleslikethisintofeaturevectors

‣ Pickamodel/learningalgorithm

‣ Trainweightsondatatogetourclassifier

Page 8: CS395T: Structured Models for NLP Lecture 2: Binary ...gdurrett/courses/fa2018/lectures/lec2-1… · the film was beauCful, stunning cinematography and gorgeous sets, but boring

FeatureRepresenta>on

thismoviewasgreat!wouldwatchagain Posi>ve

‣ Convertthisexampletoavectorusingbag-of-wordsfeatures

‣ Requiresindexingthefeatures(mappingthemtoaxes)

[containsthe][containsa][containswas][containsmovie][containsfilm]

0 0 1 1 0

‣Moresophis>catedfeaturemappingspossible(m-idf),aswellaslotsofotherfeatures:charactern-grams,partsofspeech,lemmas,…

posi>on0 posi>on1 posi>on2 posi>on3 posi>on4

‣ Verylargevectorspace(sizeofvocabulary),sparsefeatures…f(x)=[

Page 9: CS395T: Structured Models for NLP Lecture 2: Binary ...gdurrett/courses/fa2018/lectures/lec2-1… · the film was beauCful, stunning cinematography and gorgeous sets, but boring

NaiveBayes

Page 10: CS395T: Structured Models for NLP Lecture 2: Binary ...gdurrett/courses/fa2018/lectures/lec2-1… · the film was beauCful, stunning cinematography and gorgeous sets, but boring

NaiveBayes‣ Datapoint,label

P (y|x) = P (y)P (x|y)P (x)

/ P (y)P (x|y)constant:irrelevantforfindingthemax

= P (y)nY

i=1

P (xi|y)

Bayes’Rule

“Naive”assump>on:

x = (x1, ..., xn) y 2 {0, 1}‣ Formulateaprobabilis>cmodelthatplacesadistribu>on

linearmodel!

P (y|x)

y

nxi

‣ Compute,predicttoclassify

P (x, y)

argmaxyP (y|x)

argmaxyP (y|x) = argmaxy logP (y|x) = argmaxy

"logP (y) +

nX

i=1

logP (xi|y)#

Page 11: CS395T: Structured Models for NLP Lecture 2: Binary ...gdurrett/courses/fa2018/lectures/lec2-1… · the film was beauCful, stunning cinematography and gorgeous sets, but boring

NaiveBayesExample

P (y|x) / [ ]itwasgreat

P (y|x) / P (y)nY

i=1

P (xi|y)

argmaxyP (y|x) = argmaxy logP (y|x) = argmaxy

"logP (y) +

nX

i=1

logP (xi|y)#

Page 12: CS395T: Structured Models for NLP Lecture 2: Binary ...gdurrett/courses/fa2018/lectures/lec2-1… · the film was beauCful, stunning cinematography and gorgeous sets, but boring

MaximumLikelihoodEs>ma>on‣ Datapointsprovided(jindexesoverexamples)

‣ Findvaluesofthatmaximizedatalikelihood(genera>ve):P (y), P (xi|y)

(xj , yj)

datapoints(j) features(i)

mY

j=1

P (yj , xj) =mY

j=1

P (yj)

"nY

i=1

P (xji|yj)#

ithfeatureofjthexample

Page 13: CS395T: Structured Models for NLP Lecture 2: Binary ...gdurrett/courses/fa2018/lectures/lec2-1… · the film was beauCful, stunning cinematography and gorgeous sets, but boring

MaximumLikelihoodEs>ma>on‣ Imagineacoinflipwhichisheadswithprobabilityp

mX

j=1

logP (yj) = 3 log p+ log(1� p)

loglikelihood

p0 1

P(H)=0.75

‣Maximumlikelihoodparametersforbinomial/mul>nomial=readcountsoffofthedata+normalize

‣ Observe(H,H,H,T)andmaximizelikelihood:mY

j=1

P (yj) = p3(1� p)

‣ Easier:maximizeloglikelihood

Page 14: CS395T: Structured Models for NLP Lecture 2: Binary ...gdurrett/courses/fa2018/lectures/lec2-1… · the film was beauCful, stunning cinematography and gorgeous sets, but boring

MaximumLikelihoodEs>ma>on‣ Datapointsprovided(jindexesoverexamples)

‣ Findvaluesofthatmaximizedatalikelihood(genera>ve):P (y), P (xi|y)

(xj , yj)

datapoints(j) features(i)

mY

j=1

P (yj , xj) =mY

j=1

P (yj)

"nY

i=1

P (xji|yj)#

‣ Equivalenttomaximizinglogarithmofdatalikelihood:mX

j=1

logP (yj , xj) =

mX

j=1

"logP (yj) +

nX

i=1

logP (xji|yj)#

ithfeatureofjthexample

Page 15: CS395T: Structured Models for NLP Lecture 2: Binary ...gdurrett/courses/fa2018/lectures/lec2-1… · the film was beauCful, stunning cinematography and gorgeous sets, but boring

MaximumLikelihoodforNaiveBayes

+thismoviewasgreat!wouldwatchagain

thatfilmwasawful,I’llneverwatchagain

—Ididn’treallylikethatmoviedryandabitdistasteful,itmissesthemark —greatpotenCalbutendedupbeingaflop —

+IlikeditwellenoughforanacConflickIexpectedagreatfilmandleEhappy +

+brilliantdirecCngandstunningvisualsP (great|+) =

1

2

P (great|�) =1

4

P (+) =1

2

P (�) =1

2

P (y|x) / P (+)P (great|+)

P (�)P (great|�)[ ]= 1/41/8[ ]= 2/3

1/3[ ]itwasgreat

P (great|�) =1

4

Page 16: CS395T: Structured Models for NLP Lecture 2: Binary ...gdurrett/courses/fa2018/lectures/lec2-1… · the film was beauCful, stunning cinematography and gorgeous sets, but boring

NaiveBayes:Summary‣Model y

nxi

P (x, y) = P (y)nY

i=1

P (xi|y)

‣ Learning:maximizebyreadingcountsoffthedata

‣ Inference

P (x, y)

argmaxy logP (y|x) = argmaxy

"logP (y) +

nX

i=1

logP (xi|y)#

‣ Alterna>vely:logP (y = +|x)� logP (y = �|x) > 0

, log

P (y = +|x)P (y = �|x) +

nX

i=1

log

P (xi|y = +)

P (xi|y = �)

> 0

Page 17: CS395T: Structured Models for NLP Lecture 2: Binary ...gdurrett/courses/fa2018/lectures/lec2-1… · the film was beauCful, stunning cinematography and gorgeous sets, but boring

ProblemswithNaiveBayes

‣ NaiveBayesisnaive,butanotherproblemisthatit’sgeneraCve:spendscapacitymodelingP(x,y),whenwhatwecareaboutisP(y|x)

‣ Correlatedfeaturescompound:beauCfulandgorgeousarenotindependent!

thefilmwasbeauCful,stunningcinematographyandgorgeoussets,butboring —P (xbeautiful|+) = 0.1

P (xstunning|+) = 0.1

P (xgorgeous

|+) = 0.1

P (xbeautiful|�) = 0.01

P (xstunning|�) = 0.01

P (xgorgeous

|�) = 0.01

P (xboring

|�) = 0.1P (x

boring

|+) = 0.01

‣ Discrimina>vemodelsmodelP(y|x)directly(SVMs,mostneuralnetworks,…)

Page 18: CS395T: Structured Models for NLP Lecture 2: Binary ...gdurrett/courses/fa2018/lectures/lec2-1… · the film was beauCful, stunning cinematography and gorgeous sets, but boring

Logis>cRegression

Page 19: CS395T: Structured Models for NLP Lecture 2: Binary ...gdurrett/courses/fa2018/lectures/lec2-1… · the film was beauCful, stunning cinematography and gorgeous sets, but boring

Logis>cRegression

‣ Tolearnweights:maximizediscrimina>veloglikelihoodofdataP(y|x)

P (y = +|x) = logistic(w

>x)

P (y = +|x) =exp(

Pni=1 wixi)

1 + exp(

Pni=1 wixi)

L(xj , yj = +) = logP (yj = +|xj)

=

nX

i=1

wixji � log

1 + exp

nX

i=1

wixji

!!

sumoverfeatures

Page 20: CS395T: Structured Models for NLP Lecture 2: Binary ...gdurrett/courses/fa2018/lectures/lec2-1… · the film was beauCful, stunning cinematography and gorgeous sets, but boring

Logis>cRegression

@L(xj , yj)

@wi= xji �

@

@wilog

1 + exp

nX

i=1

wixji

!!

= xji �1

1 + exp (

Pni=1 wixji)

@

@wi

1 + exp

nX

i=1

wixji

!!

= xji �1

1 + exp (

Pni=1 wixji)

xji exp

nX

i=1

wixji

!

derivoflog

derivofexp

= xji � xjiexp (

Pni=1 wixji)

1 + exp (

Pni=1 wixji)

= xji(1� P (yj = +|xj))

L(xj , yj = +) = logP (yj = +|xj) =

nX

i=1

wixji � log

1 + exp

nX

i=1

wixji

!!

Page 21: CS395T: Structured Models for NLP Lecture 2: Binary ...gdurrett/courses/fa2018/lectures/lec2-1… · the film was beauCful, stunning cinematography and gorgeous sets, but boring

Logis>cRegression

IfP(+)iscloseto1,makeveryliuleupdateOtherwisemakewilookmorelikexji,whichwillincreaseP(+)

‣ Gradientofwionposi>veexample

‣ Gradientofwionnega>veexample

IfP(+)iscloseto0,makeveryliuleupdateOtherwisemakewilooklesslikexji,whichwilldecreaseP(+)

xj(yj � P (yj = 1|xj))

= xji(�P (yj = +|xj))

= xji(yj � P (yj = +|xj))

‣ Cancombinethesegradientsas

‣ Recallthatyj=1forposi>veinstances,yj=0fornega>veinstances.

Page 22: CS395T: Structured Models for NLP Lecture 2: Binary ...gdurrett/courses/fa2018/lectures/lec2-1… · the film was beauCful, stunning cinematography and gorgeous sets, but boring

Regulariza>on‣ Regularizinganobjec>vecanmeanmanythings,includinganL2-normpenaltytotheweights:

mX

j=1

L(xj , yj)� �kwk22

‣ Keepingweightssmallcanpreventoverfiwng

‣ FormostoftheNLPmodelswebuild,explicitregulariza>onisn’tnecessary

‣ Earlystopping

‣ Forneuralnetworks:dropoutandgradientclipping‣ Largenumbersofsparsefeaturesarehardtooverfitinareallybadway

Page 23: CS395T: Structured Models for NLP Lecture 2: Binary ...gdurrett/courses/fa2018/lectures/lec2-1… · the film was beauCful, stunning cinematography and gorgeous sets, but boring

Logis>cRegression:Summary‣Model

‣ Learning:gradientascentonthe(regularized)discrimina>velog-likelihood

‣ Inference

argmaxyP (y|x) fundamentallysameasNaiveBayes

P (y = 1|x) � 0.5 , w

>x � 0

P (y = +|x) =exp(

Pni=1 wixi)

1 + exp(

Pni=1 wixi)

Page 24: CS395T: Structured Models for NLP Lecture 2: Binary ...gdurrett/courses/fa2018/lectures/lec2-1… · the film was beauCful, stunning cinematography and gorgeous sets, but boring

Perceptron/SVM

Page 25: CS395T: Structured Models for NLP Lecture 2: Binary ...gdurrett/courses/fa2018/lectures/lec2-1… · the film was beauCful, stunning cinematography and gorgeous sets, but boring

Perceptron

‣ Simpleerror-drivenlearningapproachsimilartologis>cregression

‣ Decisionrule:

‣ Guaranteedtoeventuallyseparatethedataifthedataareseparable

‣ Ifincorrect:ifposi>ve,ifnega>ve,

w w + x

w w � x

w w � xP (y = 1|x)w w + x(1� P (y = 1|x))

Logis>cRegressionw

>x > 0

Page 26: CS395T: Structured Models for NLP Lecture 2: Binary ...gdurrett/courses/fa2018/lectures/lec2-1… · the film was beauCful, stunning cinematography and gorgeous sets, but boring

SupportVectorMachines

‣Manysepara>nghyperplanes—isthereabestone?

+++ ++ +++

- - ------

-

Page 27: CS395T: Structured Models for NLP Lecture 2: Binary ...gdurrett/courses/fa2018/lectures/lec2-1… · the film was beauCful, stunning cinematography and gorgeous sets, but boring

SupportVectorMachines

‣Manysepara>nghyperplanes—isthereabestone?

+++ +

++

++

- - ------

- margin

Page 28: CS395T: Structured Models for NLP Lecture 2: Binary ...gdurrett/courses/fa2018/lectures/lec2-1… · the film was beauCful, stunning cinematography and gorgeous sets, but boring

SupportVectorMachines‣ Constraintformula>on:findwviafollowingquadra>cprogram:

Minimize

s.t.

Asasingleconstraint:

minimizingnormwithfixedmargin<=>maximizingmargin

kwk228j w

>xj � 1 if yj = 1

w

>xj �1 if yj = 0

8j (2yj � 1)(w>xj) � 1

‣ Generallynosolu>on(dataisgenerallynon-separable)—needslack!

Page 29: CS395T: Structured Models for NLP Lecture 2: Binary ...gdurrett/courses/fa2018/lectures/lec2-1… · the film was beauCful, stunning cinematography and gorgeous sets, but boring

N-SlackSVMs

Minimize

s.t. 8j (2yj � 1)(w>xj) � 1� ⇠j 8j ⇠j � 0

‣ Thearea“fudgefactor”tomakeallconstraintssa>sfied⇠j

�kwk22 +mX

j=1

⇠j

‣ Takethegradientoftheobjec>ve:@

@wi⇠j = 0 if ⇠j = 0

@

@wi⇠j = (2yj � 1)xji if ⇠j > 0

= xji if yj = 1, �xji if yj = 0

‣ Looksliketheperceptron!Butupdatesmorefrequently

Page 30: CS395T: Structured Models for NLP Lecture 2: Binary ...gdurrett/courses/fa2018/lectures/lec2-1… · the film was beauCful, stunning cinematography and gorgeous sets, but boring

GradientsonPosi>veExamplesLogis>cregression

Perceptron

x(1� P (y = 1|x)) = x(1� logistic(w

>x))

x if w>x < 0, else 0

SVM(ignoringregularizer)

Hinge(SVM)

Logis>cPerceptron

0-1

Loss

w

>x

*gradientsareformaximizingthings,whichiswhytheyareflipped

x if w>x < 1, else 0

Page 31: CS395T: Structured Models for NLP Lecture 2: Binary ...gdurrett/courses/fa2018/lectures/lec2-1… · the film was beauCful, stunning cinematography and gorgeous sets, but boring

ComparingGradientUpdates(Reference)

x(y � P (y = 1|x))x(y � logistic(w

>x))

Perceptronifclassifiedincorrectly

0else

SVMifnotclassifiedcorrectlywithmarginof1

0else

(2y � 1)x

(2y � 1)x

=

y=1forpos,0forneg

Logis>cregression(unregularized)

Page 32: CS395T: Structured Models for NLP Lecture 2: Binary ...gdurrett/courses/fa2018/lectures/lec2-1… · the film was beauCful, stunning cinematography and gorgeous sets, but boring

Op>miza>on—next>me…

‣ Rangeoftechniquesfromsimplegradientdescent(workspreuywell)tomorecomplexmethods(canworkbeuer)

‣Mostmethodsboildownto:takeagradientandastepsize,applythegradientupdate>messtepsize,incorporatees>matedcurvatureinforma>ontomaketheupdatemoreeffec>ve

Page 33: CS395T: Structured Models for NLP Lecture 2: Binary ...gdurrett/courses/fa2018/lectures/lec2-1… · the film was beauCful, stunning cinematography and gorgeous sets, but boring

Sen>mentAnalysis

BoPang,LillianLee,ShivakumarVaithyanathan(2002)

themoviewasgrossandoverwrought,butIlikedit

thismoviewasgreat!wouldwatchagain

‣ Bag-of-wordsdoesn’tseemsufficient(discoursestructure,nega>on)

thismoviewasnotreallyveryenjoyable

‣ Therearesomewaysaroundthis:extractbigramfeaturefor“notX”forallXfollowingthenot

++—

Page 34: CS395T: Structured Models for NLP Lecture 2: Binary ...gdurrett/courses/fa2018/lectures/lec2-1… · the film was beauCful, stunning cinematography and gorgeous sets, but boring

Sen>mentAnalysis

‣ Simplefeaturesetscandopreuywell!

BoPang,LillianLee,ShivakumarVaithyanathan(2002)

Page 35: CS395T: Structured Models for NLP Lecture 2: Binary ...gdurrett/courses/fa2018/lectures/lec2-1… · the film was beauCful, stunning cinematography and gorgeous sets, but boring

Sen>mentAnalysis

WangandManning(2012)

Beforeneuralnetshadtakenoff—resultsweren’tthatgreat

NaiveBayesisdoingwell!

NgandJordan(2002)—NBcanbebeuerforsmalldata

81.589.5Kim(2014)CNNs

Page 36: CS395T: Structured Models for NLP Lecture 2: Binary ...gdurrett/courses/fa2018/lectures/lec2-1… · the film was beauCful, stunning cinematography and gorgeous sets, but boring

Recap

‣ Logis>cregression:P (y = 1|x) =

exp (

Pni=1 wixi)

(1 + exp (

Pni=1 wixi))

Gradient(unregularized):

‣ SVM:

Decisionrule:

Decisionrule:w>x � 0

P (y = 1|x) � 0.5 , w

>x � 0

(Sub)gradient(unregularized):0ifcorrectwithmarginof1,else

x(y � P (y = 1|x))

x(2y � 1)

Page 37: CS395T: Structured Models for NLP Lecture 2: Binary ...gdurrett/courses/fa2018/lectures/lec2-1… · the film was beauCful, stunning cinematography and gorgeous sets, but boring

Recap

‣ Logis>cregression,SVM,andperceptronarecloselyrelated

‣ SVMandperceptroninferencerequiretakingmaxes,logis>cregressionhasasimilarupdatebutis“so}er”duetoitsprobabilis>cnature

‣ Allgradientupdates:“makeitlookmoreliketherightthingandlesslikethewrongthing”