Top Banner
Discriminative Estimation (Maxent models and perceptron) Generative vs. Discriminative models Many slides are adapted from slides by Christopher Manning and perceptron slides by Alan Ritter
56

Discriminative Estimation (Maxentmodelsandperceptron)

Dec 22, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Discriminative Estimation (Maxentmodelsandperceptron)

DiscriminativeEstimation(Maxent models and perceptron)

Generativevs.Discriminativemodels

Many slides are adapted from slides by Christopher Manning and perceptron slides by Alan Ritter

Page 2: Discriminative Estimation (Maxentmodelsandperceptron)

Introduction

• Sofarwe’velookedat“generativemodels”• NaiveBayes

• ButthereisnowmuchuseofconditionalordiscriminativeprobabilisticmodelsinNLP,Speech,IR(andMLgenerally)

• Because:• Theygivehighaccuracyperformance• Theymakeiteasytoincorporatelotsoflinguisticallyimportantfeatures• Theyallowautomaticbuildingoflanguageindependent,retargetableNLPmodules

Page 3: Discriminative Estimation (Maxentmodelsandperceptron)

Jointvs.ConditionalModels

• Wehavesomedata{(d,c)}ofpairedobservationsd andhiddenclassesc.

• Joint(generative)modelsplaceprobabilitiesoverbothobserveddataandthehiddenstuff(gene-ratetheobserveddatafromhiddenstuff):• AlltheclassicStatNLP models:• n-grammodels,NaiveBayesclassifiers,hiddenMarkovmodels,probabilisticcontext-freegrammars,IBMmachinetranslationalignmentmodels

P(c,d)

Page 4: Discriminative Estimation (Maxentmodelsandperceptron)

Jointvs.ConditionalModels

• Discriminative(conditional)modelstakethedataasgiven,andputaprobabilityoverhiddenstructuregiventhedata:

• Logisticregression,conditionalloglinear ormaximumentropymodels,conditionalrandomfields

• Also,SVMs,(averaged)perceptron,etc.arediscriminativeclassifiers(butnotdirectlyprobabilistic)

P(c|d)

Page 5: Discriminative Estimation (Maxentmodelsandperceptron)

BayesNet/GraphicalModels

• Bayesnetdiagramsdrawcirclesforrandomvariables,andlinesfordirectdependencies

• Somevariablesareobserved;somearehidden• Eachnodeisalittleclassifier(conditionalprobabilitytable)basedon

incomingarcs c

d1 d 2 d 3

NaiveBayes

c

d1 d2 d3

GenerativeLogisticRegression

Discriminative

Page 6: Discriminative Estimation (Maxentmodelsandperceptron)

Conditionalvs.JointLikelihood

• AjointmodelgivesprobabilitiesP(d,c)andtriestomaximizethisjointlikelihood.• Itturnsouttobetrivialtochooseweights:justrelativefrequencies.

• AconditionalmodelgivesprobabilitiesP(c|d).Ittakesthedataasgivenandmodelsonlytheconditionalprobabilityoftheclass.• Weseektomaximizeconditionallikelihood.• Hardertodo(aswe’llsee…)• Morecloselyrelatedtoclassificationerror.

Page 7: Discriminative Estimation (Maxentmodelsandperceptron)

Maxent ModelsandDiscriminativeEstimation

Generativevs.Discriminativemodels

Page 8: Discriminative Estimation (Maxentmodelsandperceptron)

DiscriminativeModelFeatures

MakingfeaturesfromtextfordiscriminativeNLPmodels

Page 9: Discriminative Estimation (Maxentmodelsandperceptron)

Features

• Intheseslidesandmostmaxent work:features f areelementarypiecesofevidencethatlinkaspectsofwhatweobserved withacategoryc thatwewanttopredict

• Afeatureisafunctionwithaboundedrealvalue:f: C ´ D → ℝ

A Belief: to create a data partition

Page 10: Discriminative Estimation (Maxentmodelsandperceptron)

Features

• InNLPuses,usuallyafeaturespecifies1. anindicatorfunction– ayes/noboolean matchingfunction– of

propertiesoftheinputand2. aparticularclass

fi(c, d) º [Φ(d) Ù c = cj] [Valueis0or1]

• Eachfeaturepicksoutadatasubsetandsuggestsalabelforit

Page 11: Discriminative Estimation (Maxentmodelsandperceptron)

Examplefeatures

• f1(c, d) º [c = LOCATION Ù w-1 = “in” Ù isCapitalized(w)]• f2(c, d) º [c = LOCATION Ù hasAccentedLatinChar(w)]• f3(c, d) º [c = DRUG Ù ends(w, “c”)]

• Modelswillassigntoeachfeatureaweight:• Apositiveweightvotesthatthisconfigurationislikelycorrect• Anegativeweightvotesthatthisconfigurationislikelyincorrect

LOCATIONin Québec

PERSONsaw Sue

DRUGtaking Zantac

LOCATIONin Arcadia

Page 12: Discriminative Estimation (Maxentmodelsandperceptron)

Feature-BasedModels• Thedecisionaboutadatapointisbasedonlyonthe

features activeatthatpoint.

BUSINESS: Stocks hit a yearly low …

Data

Features{…, stocks, hit, a, yearly, low, …}

Label: BUSINESS

Text Categorization

… to restructure bank:MONEY debt.

Data

Features{…, w-1=restructure, w+1=debt, …}

Label: MONEY

Word-Sense Disambiguation

DT JJ NN …The previous fall …

Data

Features{w=fall, t-1=JJ w-

1=previous}

Label: NN

POS Tagging

Page 13: Discriminative Estimation (Maxentmodelsandperceptron)

Example:TextCategorization

(ZhangandOles 2001)• Featuresarepresenceofeachword inadocumentandthedocumentclass

(theydofeatureselectiontousereliableindicatorwords)• TestsonclassicReutersdataset(andothers)

• NaïveBayes:77.0%F1• Linearregression:86.0%• Logisticregression:86.4%• Supportvectormachine:86.5%

• Paperemphasizestheimportanceofregularization (smoothing)forsuccessfuluseofdiscriminativemethods(notusedinmuchearlyNLP/IRwork)

Page 14: Discriminative Estimation (Maxentmodelsandperceptron)

OtherMaxent ClassifierExamples

• Youcanuseamaxent classifierwheneveryouwanttoassigndatapointstooneofanumberofclasses:• Sentenceboundarydetection(Mikheev 2000)

• Isaperiodendofsentenceorabbreviation?• Sentimentanalysis(PangandLee2002)• Wordunigrams,bigrams,POScounts,…

• PPattachment(Ratnaparkhi 1998)• Attachtoverbornoun?Featuresofheadnoun,preposition,etc.

• Parsingdecisionsingeneral(Ratnaparkhi 1997;Johnsonetal.1999,etc.)

Page 15: Discriminative Estimation (Maxentmodelsandperceptron)

DiscriminativeModelFeatures

MakingfeaturesfromtextfordiscriminativeNLPmodels

Page 16: Discriminative Estimation (Maxentmodelsandperceptron)

Feature-basedLinearClassifiers

Howtoputfeaturesintoaclassifier

16

Page 17: Discriminative Estimation (Maxentmodelsandperceptron)

Feature-BasedLinearClassifiers

• Linearclassifiersatclassificationtime:• Linear function from feature sets {fi} to classes {c}.• Assign a weight li to each feature fi.

• We consider each class for an observed datum d• For a pair (c,d), features vote with their weights:

• vote(c) = Slifi(c,d)

• Choose the class c which maximizes Slifi(c,d)

LOCATIONin Québec

DRUGin Québec

PERSONin Québec

Page 18: Discriminative Estimation (Maxentmodelsandperceptron)

Feature-BasedLinearClassifiers

• Linearclassifiersatclassificationtime:• Linear function from feature sets {fi} to classes {c}.• Assign a weight li to each feature fi.

• We consider each class for an observed datum d• For a pair (c,d), features vote with their weights:

• vote(c) = Slifi(c,d)

• Choose the class c which maximizes Slifi(c,d) = LOCATION

1.8 –0.60.3LOCATION

in QuébecDRUG

in QuébecPERSON

in Québec

Page 19: Discriminative Estimation (Maxentmodelsandperceptron)

Feature-BasedLinearClassifiers

TherearemanywaystochoseweightsforfeaturesWithdifferentlossfunctionsastheoptimizationgoal

• Perceptron:findacurrentlymisclassifiedexample,andnudgeweightsinthedirectionofitscorrectclassification

• Margin-basedmethods(SupportVectorMachines)

Page 20: Discriminative Estimation (Maxentmodelsandperceptron)

Feature-BasedLinearClassifiers• Exponential(log-linear,maxent,logistic,Gibbs)models:

• MakeaprobabilisticmodelfromthelinearcombinationSlifi(c,d)

• P(LOCATION|in Québec) = e1.8e–0.6/(e1.8e–0.6 + e0.3 + e0) = 0.586

• P(DRUG|in Québec) = e0.3 /(e1.8e–0.6 + e0.3 + e0) = 0.238

• P(PERSON|in Québec) = e0 /(e1.8e–0.6 + e0.3 + e0) = 0.176

• The weights are the parameters of the probability model, combined via a “soft max” function

∑ ∑'

),'(expc i

ii dcfλ=),|( λdcP

∑i

ii dcf ),(exp λ Makes votes positive

Normalizes votes

Page 21: Discriminative Estimation (Maxentmodelsandperceptron)

Aside:logisticregression

• Maxent modelsinNLPareessentiallythesameasmulticlasslogisticregressionmodelsinstatistics(ormachinelearning)

• ThekeyroleoffeaturefunctionsinNLPandinthispresentation• Thefeaturesaremoregeneral,withf alsobeingafunctionoftheclass

21

Page 22: Discriminative Estimation (Maxentmodelsandperceptron)

Quiz Question

• Assuming exactly the same set up (3 class decision: LOCATION, PERSON, or DRUG; 3 features as before, maxent), what are:• P(PERSON| byGoéric) =

• P(LOCATION| byGoéric) =

• P(DRUG| byGoéric) =

• 1.8 f1(c, d) º [c = LOCATION Ù w-1 = “in” Ù isCapitalized(w)]• -0.6 f2(c, d) º [c = LOCATION Ù hasAccentedLatinChar(w)]• 0.3 f3(c, d) º [c = DRUG Ù ends(w, “c”)]

∑ ∑'

),'(expc i

ii dcfλ=),|( λdcP ∑

iii dcf ),(exp λ

PERSONby Goéric

LOCATIONby Goéric

DRUGby Goéric

Page 23: Discriminative Estimation (Maxentmodelsandperceptron)

Feature-basedLinearClassifiers

Howtoputfeaturesintoaclassifier

23

Page 24: Discriminative Estimation (Maxentmodelsandperceptron)

BuildingaMaxentModel

Thenutsandbolts

Page 25: Discriminative Estimation (Maxentmodelsandperceptron)

BuildingaMaxent Model

• Wedefinefeatures(indicatorfunctions)overdatapoints• Featuresrepresentsetsofdatapointswhicharedistinctiveenoughtodeservemodelparameters.• Words,butalso“wordcontainsnumber”,“wordendswithing”,etc.

• WewillsimplyencodeeachΦ featureasauniqueString(index)• AdatumwillgiverisetoasetofStrings:theactiveΦ features• Eachfeaturefi(c, d) º [Φ(d) Ù c = cj] getsarealnumberweight

• WeconcentrateonΦ featuresbutthemathusesi indicesoffi

Page 26: Discriminative Estimation (Maxentmodelsandperceptron)

BuildingaMaxentModel• Featuresareoftenaddedduringmodeldevelopmenttotargeterrors

• Often,theeasiestthingtothinkofarefeaturesthatmarkbadcombinations

• Then,foranygivenfeatureweights,wewanttobeabletocalculate:• Dataconditionallikelihood• Derivativeofthelikelihoodwrt eachfeatureweight• Usesexpectationsofeachfeatureaccordingtothemodel

• Wecanthenfindtheoptimumfeatureweights(discussedlater).

Page 27: Discriminative Estimation (Maxentmodelsandperceptron)

BuildingaMaxentModel

Thenutsandbolts

Page 28: Discriminative Estimation (Maxentmodelsandperceptron)

NaiveBayesvs.Maxent models

Generativevs.Discriminativemodels:Theproblemofovercounting evidence

Page 29: Discriminative Estimation (Maxentmodelsandperceptron)

Textclassification:AsiaorEurope

NBFACTORS:• P(A)=P(E)=• P(M|A)=• P(M|E)=• P(H|A)=P(K|A)=• P(H|E)=PK|E)=

Europe Asia

Class

H K

NBModel PREDICTIONS:• P(A,H,K,M)=• P(E,H,K,M)=• P(A|H,K,M)=• P(E|H,K,M)=

TrainingData

M

Monaco Monaco

Monaco Monaco Hong Kong

Hong Kong Monaco

Monaco Hong Kong

Hong Kong

Monaco Monaco

Page 30: Discriminative Estimation (Maxentmodelsandperceptron)

NaiveBayesvs.Maxent Models

• NaiveBayesmodelsmulti-countcorrelatedevidence• Eachfeatureismultipliedin,evenwhenyouhavemultiplefeaturestellingyouthesamething

• MaximumEntropymodels(prettymuch)solvethisproblem• Aswewillsee,thisisdonebyweightingfeaturessothatmodelexpectationsmatchtheobserved(empirical)expectations

Page 31: Discriminative Estimation (Maxentmodelsandperceptron)

NaiveBayesvs.Maxent models

Generativevs.Discriminativemodels:Theproblemofovercounting evidence

Page 32: Discriminative Estimation (Maxentmodelsandperceptron)

Maxent ModelsandDiscriminativeEstimation

Maximizingthelikelihood

Page 33: Discriminative Estimation (Maxentmodelsandperceptron)

FeatureExpectations

• Wewillcruciallymakeuseoftwoexpectations• actualorpredictedcountsofafeaturefiring:

• Empiricalcount(expectation)ofafeature:

• Modelexpectationofafeature:

∑ ∈=

),(observed),(),()( empirical

DCdc ii dcffE

∑ ∈=

),(),(),(),()(

DCdc ii dcfdcPfE

Goal: well fit the data

Page 34: Discriminative Estimation (Maxentmodelsandperceptron)

ExponentialModelLikelihood

• Maximum(Conditional)LikelihoodModels:• Givenamodelform,choosevaluesofparameterstomaximizethe(conditional)likelihoodofthedata.

∑∑∈∈

==),(),(),(),(

log),|(log),|(logDCdcDCdc

dcPDCP λλ∑ ∑'

),'(expc i

ii dcfλ

∑i

ii dcf ),(exp λ

Page 35: Discriminative Estimation (Maxentmodelsandperceptron)

TheLikelihoodValue

• The(log)conditionallikelihoodofiid data(C,D)accordingtomaxent modelisafunctionofthedataandtheparametersl:

• Iftherearen’tmanyvaluesofc,it’seasytocalculate:

∑∏∈∈

==),(),(),(),(

),|(log),|(log),|(logDCdcDCdc

dcPdcPDCP λλλ

∑∈

=),(),(

log),|(logDCdc

DCP λ∑ ∑'

),'(expc i

ii dcfλ

∑i

ii dcf ),(exp λ

Page 36: Discriminative Estimation (Maxentmodelsandperceptron)

TheLikelihoodValue

• Wecanseparatethisintotwocomponents:

• Thederivativeisthedifferencebetweenthederivativesofeachcomponent

∑ ∑ ∑∈ ),(),( '

),'(explogDCdc c i

ii dcfλ∑ ∑∈ ),(),(

),(explogDCdc i

ii dcfλ −=),|(log λDCP

)(λN )(λM=),|(log λDCP −

Page 37: Discriminative Estimation (Maxentmodelsandperceptron)

TheDerivativeI:Numerator

Derivativeofthenumeratoris:theempiricalcount(fi,c)

i

DCdc iii dcf

λ

λ

=∑ ∑∈ ),(),(

),(

∑∑

∈ ∂

∂=

),(),(

),(

DCdc i

iii dcf

λ

λ

∑∈

=),(),(

),(DCdci dcf

i

DCdc iici

i

dcfN

λ

λ

λλ

=∂

∂∑ ∑∈ ),(),(

),(explog)(

Page 38: Discriminative Estimation (Maxentmodelsandperceptron)

TheDerivativeII:Denominator

i

DCdc c iii

i

dcfM

λ

λ

λλ

=∂

∂∑ ∑ ∑∈ ),(),( '

),'(explog)(

∑∑ ∑

∑ ∑∈ ∂

∂=

),(),(

'

''

),'(exp

),''(exp1

DCdc i

c iii

c iii

dcf

dcf λ

λ

λ

∑ ∑∑∑

∑ ∑∈ ∂

∂=

),(),( '''

),'(

1

),'(exp

),''(exp1

DCdc c i

iii

iii

c iii

dcfdcf

dcf λ

λλ

λ

i

iii

DCdc cc i

ii

iii dcf

dcf

dcf

λ

λ

λ

λ

∂=

∑∑ ∑∑ ∑

∑∈

),'(

),''(exp

),'(exp

),(),( '''

∑ ∑∈

=),(),( '

),'(),|'(DCdc

ic

dcfdcP λ =predictedcount(fi,l)

Page 39: Discriminative Estimation (Maxentmodelsandperceptron)

TheDerivativeIII

• Theoptimumparametersaretheonesforwhicheachfeature’spredictedexpectationequalsitsempiricalexpectation.Theoptimumdistributionis:• Alwaysunique(butparametersmaynotbeunique)• Alwaysexists(iffeaturecountsarefromactualdata).

• Thesemodelsarealsocalledmaximumentropymodelsbecausewefindthemodelhavingmaximumentropyandsatisfyingtheconstraints:

=∂

i

DCPλ

λ),|(log),(countactual Cfi ),(countpredicted λif−

jfEfE jpjp ∀= ),()( ~

Page 40: Discriminative Estimation (Maxentmodelsandperceptron)

Findingtheoptimalparameters

• Wewanttochooseparametersλ1,λ2,λ3,…thatmaximizetheconditionallog-likelihoodofthetrainingdata

• Tobeabletodothat,we’veworkedouthowtocalculatethefunctionvalueanditspartialderivatives(itsgradient)

)|(log)(1

i

n

ii dcPDCLogLik ∑

=

=

Page 41: Discriminative Estimation (Maxentmodelsandperceptron)

Alikelihoodsurface

Page 42: Discriminative Estimation (Maxentmodelsandperceptron)

Findingtheoptimalparameters

• Useyourfavoritenumericaloptimizationpackage….• Commonly,youminimize thenegativeofCLogLik

1. Gradientdescent(GD);Stochasticgradientdescent(SGD)2. Iterativeproportionalfittingmethods:GeneralizedIterativeScaling

(GIS)andImprovedIterativeScaling(IIS)3. Conjugategradient(CG),perhapswithpreconditioning4. Quasi-Newtonmethods– limitedmemoryvariablemetric(LMVM)

methods,inparticular,L-BFGS

Page 43: Discriminative Estimation (Maxentmodelsandperceptron)

GradientDescent(GD)

43

Page 44: Discriminative Estimation (Maxentmodelsandperceptron)

Maxent ModelsandDiscriminativeEstimation

Maximizingthelikelihood

Page 45: Discriminative Estimation (Maxentmodelsandperceptron)

FeatureSparsityRegularization

Combatingoverfitting

Page 46: Discriminative Estimation (Maxentmodelsandperceptron)

Smoothing:IssuesofScale• Lotsoffeatures:

• NLPmaxent modelscanhavewelloveramillionfeatures.• Evenstoringasinglearrayofparametervaluescanhaveasubstantialmemorycost.

• Lotsofsparsity:• Overfitting veryeasy– weneedsmoothing!• Manyfeaturesseenintrainingwillneveroccuragainattesttime.

• Optimizationproblems:• Featureweightscanbeinfinite,anditerativesolverscantakealongtimetogetto

thoseinfinities.

Page 47: Discriminative Estimation (Maxentmodelsandperceptron)

Smoothing/Priors/Regularization

Page 48: Discriminative Estimation (Maxentmodelsandperceptron)

Standardvs.RegularizedUpdates

48

Page 49: Discriminative Estimation (Maxentmodelsandperceptron)

FeatureSparsityRegularization

Combatingoverfitting

Page 50: Discriminative Estimation (Maxentmodelsandperceptron)

Batchvs.OnlineLearning

GDvs.SGD

Page 51: Discriminative Estimation (Maxentmodelsandperceptron)

StochasticGradientDecent(SGD)

51

Batch vs. Online learning:

Page 52: Discriminative Estimation (Maxentmodelsandperceptron)

Batchvs.OnlineLearning

GDvs.SGD

Page 53: Discriminative Estimation (Maxentmodelsandperceptron)

Perceptron

AnotherOnlineLearningalgorithem

Page 54: Discriminative Estimation (Maxentmodelsandperceptron)

Perceptron Algorithm

54

Page 55: Discriminative Estimation (Maxentmodelsandperceptron)

MaxEnt v.s Perceptron

• Perceptron doesn’t always make updates• Probabilities v.s scores

55

Page 56: Discriminative Estimation (Maxentmodelsandperceptron)

RegularizationinthePerceptronAlgorithm

• No gradient computed,so can’t directly include a regularizer inan object function.

• Insteadrun different numbers of iterations• Use parameter averaging, for instance, average of all

parameters after seeing each data point

56