Discriminative Estimation (Maxentmodelsandperceptron)

DiscriminativeEstimation(Maxent models and perceptron)

Generativevs.Discriminativemodels

Many slides are adapted from slides by Christopher Manning and perceptron slides by Alan Ritter

Introduction

• Sofarwe’velookedat“generativemodels”• NaiveBayes

• ButthereisnowmuchuseofconditionalordiscriminativeprobabilisticmodelsinNLP,Speech,IR(andMLgenerally)

• Because:• Theygivehighaccuracyperformance• Theymakeiteasytoincorporatelotsoflinguisticallyimportantfeatures• Theyallowautomaticbuildingoflanguageindependent,retargetableNLPmodules

Jointvs.ConditionalModels

• Wehavesomedata{(d,c)}ofpairedobservationsd andhiddenclassesc.

• Joint(generative)modelsplaceprobabilitiesoverbothobserveddataandthehiddenstuff(gene-ratetheobserveddatafromhiddenstuff):• AlltheclassicStatNLP models:• n-grammodels,NaiveBayesclassifiers,hiddenMarkovmodels,probabilisticcontext-freegrammars,IBMmachinetranslationalignmentmodels

P(c,d)

Jointvs.ConditionalModels

• Discriminative(conditional)modelstakethedataasgiven,andputaprobabilityoverhiddenstructuregiventhedata:

• Logisticregression,conditionalloglinear ormaximumentropymodels,conditionalrandomfields

• Also,SVMs,(averaged)perceptron,etc.arediscriminativeclassifiers(butnotdirectlyprobabilistic)

P(c|d)

BayesNet/GraphicalModels

• Bayesnetdiagramsdrawcirclesforrandomvariables,andlinesfordirectdependencies

• Somevariablesareobserved;somearehidden• Eachnodeisalittleclassifier(conditionalprobabilitytable)basedon

incomingarcs c

d1 d 2 d 3

NaiveBayes

c

d1 d2 d3

GenerativeLogisticRegression

Discriminative

Conditionalvs.JointLikelihood

• AjointmodelgivesprobabilitiesP(d,c)andtriestomaximizethisjointlikelihood.• Itturnsouttobetrivialtochooseweights:justrelativefrequencies.

• AconditionalmodelgivesprobabilitiesP(c|d).Ittakesthedataasgivenandmodelsonlytheconditionalprobabilityoftheclass.• Weseektomaximizeconditionallikelihood.• Hardertodo(aswe’llsee…)• Morecloselyrelatedtoclassificationerror.

Maxent ModelsandDiscriminativeEstimation

Generativevs.Discriminativemodels

DiscriminativeModelFeatures

MakingfeaturesfromtextfordiscriminativeNLPmodels

Features

• Intheseslidesandmostmaxent work:features f areelementarypiecesofevidencethatlinkaspectsofwhatweobserved withacategoryc thatwewanttopredict

• Afeatureisafunctionwithaboundedrealvalue:f: C ´ D → ℝ

A Belief: to create a data partition

Features

• InNLPuses,usuallyafeaturespecifies1. anindicatorfunction– ayes/noboolean matchingfunction– of

propertiesoftheinputand2. aparticularclass

fi(c, d) º [Φ(d) Ù c = cj] [Valueis0or1]

• Eachfeaturepicksoutadatasubsetandsuggestsalabelforit

Examplefeatures

• f1(c, d) º [c = LOCATION Ù w-1 = “in” Ù isCapitalized(w)]• f2(c, d) º [c = LOCATION Ù hasAccentedLatinChar(w)]• f3(c, d) º [c = DRUG Ù ends(w, “c”)]

• Modelswillassigntoeachfeatureaweight:• Apositiveweightvotesthatthisconfigurationislikelycorrect• Anegativeweightvotesthatthisconfigurationislikelyincorrect

LOCATIONin Québec

PERSONsaw Sue

DRUGtaking Zantac

LOCATIONin Arcadia

Feature-BasedModels• Thedecisionaboutadatapointisbasedonlyonthe

features activeatthatpoint.

BUSINESS: Stocks hit a yearly low …

Data

Features{…, stocks, hit, a, yearly, low, …}

Label: BUSINESS

Text Categorization

… to restructure bank:MONEY debt.

Data

Features{…, w-1=restructure, w+1=debt, …}

Label: MONEY

Word-Sense Disambiguation

DT JJ NN …The previous fall …

Data

Features{w=fall, t-1=JJ w-

1=previous}

Label: NN

POS Tagging

Example:TextCategorization

(ZhangandOles 2001)• Featuresarepresenceofeachword inadocumentandthedocumentclass

(theydofeatureselectiontousereliableindicatorwords)• TestsonclassicReutersdataset(andothers)

• NaïveBayes:77.0%F1• Linearregression:86.0%• Logisticregression:86.4%• Supportvectormachine:86.5%

• Paperemphasizestheimportanceofregularization (smoothing)forsuccessfuluseofdiscriminativemethods(notusedinmuchearlyNLP/IRwork)

OtherMaxent ClassifierExamples

• Youcanuseamaxent classifierwheneveryouwanttoassigndatapointstooneofanumberofclasses:• Sentenceboundarydetection(Mikheev 2000)

• Isaperiodendofsentenceorabbreviation?• Sentimentanalysis(PangandLee2002)• Wordunigrams,bigrams,POScounts,…

• PPattachment(Ratnaparkhi 1998)• Attachtoverbornoun?Featuresofheadnoun,preposition,etc.

• Parsingdecisionsingeneral(Ratnaparkhi 1997;Johnsonetal.1999,etc.)

DiscriminativeModelFeatures

MakingfeaturesfromtextfordiscriminativeNLPmodels

Feature-basedLinearClassifiers

Howtoputfeaturesintoaclassifier

16

Feature-BasedLinearClassifiers

• Linearclassifiersatclassificationtime:• Linear function from feature sets {fi} to classes {c}.• Assign a weight li to each feature fi.

• We consider each class for an observed datum d• For a pair (c,d), features vote with their weights:

• vote(c) = Slifi(c,d)

• Choose the class c which maximizes Slifi(c,d)

LOCATIONin Québec

DRUGin Québec

PERSONin Québec


• Linearclassifiersatclassificationtime:• Linear function from feature sets {fi} to classes {c}.• Assign a weight li to each feature fi.

• We consider each class for an observed datum d• For a pair (c,d), features vote with their weights:

• vote(c) = Slifi(c,d)

• Choose the class c which maximizes Slifi(c,d) = LOCATION

1.8 –0.60.3LOCATION

in QuébecDRUG

in QuébecPERSON

in Québec


TherearemanywaystochoseweightsforfeaturesWithdifferentlossfunctionsastheoptimizationgoal

• Perceptron:findacurrentlymisclassifiedexample,andnudgeweightsinthedirectionofitscorrectclassification

• Margin-basedmethods(SupportVectorMachines)

Feature-BasedLinearClassifiers• Exponential(log-linear,maxent,logistic,Gibbs)models:

• MakeaprobabilisticmodelfromthelinearcombinationSlifi(c,d)

• P(LOCATION|in Québec) = e1.8e–0.6/(e1.8e–0.6 + e0.3 + e0) = 0.586

• P(DRUG|in Québec) = e0.3 /(e1.8e–0.6 + e0.3 + e0) = 0.238

• P(PERSON|in Québec) = e0 /(e1.8e–0.6 + e0.3 + e0) = 0.176

• The weights are the parameters of the probability model, combined via a “soft max” function

∑ ∑'

),'(expc i

ii dcfλ=),|( λdcP

∑i

ii dcf ),(exp λ Makes votes positive

Normalizes votes

Aside:logisticregression

• Maxent modelsinNLPareessentiallythesameasmulticlasslogisticregressionmodelsinstatistics(ormachinelearning)

• ThekeyroleoffeaturefunctionsinNLPandinthispresentation• Thefeaturesaremoregeneral,withf alsobeingafunctionoftheclass

21

Quiz Question

• Assuming exactly the same set up (3 class decision: LOCATION, PERSON, or DRUG; 3 features as before, maxent), what are:• P(PERSON| byGoéric) =

• P(LOCATION| byGoéric) =

• P(DRUG| byGoéric) =

• 1.8 f1(c, d) º [c = LOCATION Ù w-1 = “in” Ù isCapitalized(w)]• -0.6 f2(c, d) º [c = LOCATION Ù hasAccentedLatinChar(w)]• 0.3 f3(c, d) º [c = DRUG Ù ends(w, “c”)]

∑ ∑'

),'(expc i

ii dcfλ=),|( λdcP ∑

iii dcf ),(exp λ

PERSONby Goéric

LOCATIONby Goéric

DRUGby Goéric

Feature-basedLinearClassifiers

Howtoputfeaturesintoaclassifier

23

BuildingaMaxentModel

Thenutsandbolts

BuildingaMaxent Model

• Wedefinefeatures(indicatorfunctions)overdatapoints• Featuresrepresentsetsofdatapointswhicharedistinctiveenoughtodeservemodelparameters.• Words,butalso“wordcontainsnumber”,“wordendswithing”,etc.

• WewillsimplyencodeeachΦ featureasauniqueString(index)• AdatumwillgiverisetoasetofStrings:theactiveΦ features• Eachfeaturefi(c, d) º [Φ(d) Ù c = cj] getsarealnumberweight

• WeconcentrateonΦ featuresbutthemathusesi indicesoffi

BuildingaMaxentModel• Featuresareoftenaddedduringmodeldevelopmenttotargeterrors

• Often,theeasiestthingtothinkofarefeaturesthatmarkbadcombinations

• Then,foranygivenfeatureweights,wewanttobeabletocalculate:• Dataconditionallikelihood• Derivativeofthelikelihoodwrt eachfeatureweight• Usesexpectationsofeachfeatureaccordingtothemodel

• Wecanthenfindtheoptimumfeatureweights(discussedlater).

BuildingaMaxentModel

Thenutsandbolts

NaiveBayesvs.Maxent models

Generativevs.Discriminativemodels:Theproblemofovercounting evidence

Textclassification:AsiaorEurope

NBFACTORS:• P(A)=P(E)=• P(M|A)=• P(M|E)=• P(H|A)=P(K|A)=• P(H|E)=PK|E)=

Europe Asia

Class

H K

NBModel PREDICTIONS:• P(A,H,K,M)=• P(E,H,K,M)=• P(A|H,K,M)=• P(E|H,K,M)=

TrainingData

M

Monaco Monaco

Monaco Monaco Hong Kong

Hong Kong Monaco

Monaco Hong Kong

Hong Kong

Monaco Monaco

NaiveBayesvs.Maxent Models

• NaiveBayesmodelsmulti-countcorrelatedevidence• Eachfeatureismultipliedin,evenwhenyouhavemultiplefeaturestellingyouthesamething

• MaximumEntropymodels(prettymuch)solvethisproblem• Aswewillsee,thisisdonebyweightingfeaturessothatmodelexpectationsmatchtheobserved(empirical)expectations

NaiveBayesvs.Maxent models

Generativevs.Discriminativemodels:Theproblemofovercounting evidence


Maximizingthelikelihood

FeatureExpectations

• Wewillcruciallymakeuseoftwoexpectations• actualorpredictedcountsofafeaturefiring:

• Empiricalcount(expectation)ofafeature:

• Modelexpectationofafeature:

∑ ∈=

),(observed),(),()( empirical

DCdc ii dcffE

∑ ∈=

),(),(),(),()(

DCdc ii dcfdcPfE

Goal: well fit the data

ExponentialModelLikelihood

• Maximum(Conditional)LikelihoodModels:• Givenamodelform,choosevaluesofparameterstomaximizethe(conditional)likelihoodofthedata.

∑∑∈∈

==),(),(),(),(

log),|(log),|(logDCdcDCdc

dcPDCP λλ∑ ∑'

),'(expc i

ii dcfλ

∑i

ii dcf ),(exp λ

TheLikelihoodValue

• The(log)conditionallikelihoodofiid data(C,D)accordingtomaxent modelisafunctionofthedataandtheparametersl:

• Iftherearen’tmanyvaluesofc,it’seasytocalculate:

∑∏∈∈

==),(),(),(),(

),|(log),|(log),|(logDCdcDCdc

dcPdcPDCP λλλ

∑∈

=),(),(

log),|(logDCdc

DCP λ∑ ∑'

),'(expc i

ii dcfλ

∑i

ii dcf ),(exp λ

TheLikelihoodValue

• Wecanseparatethisintotwocomponents:

• Thederivativeisthedifferencebetweenthederivativesofeachcomponent

∑ ∑ ∑∈ ),(),( '

),'(explogDCdc c i

ii dcfλ∑ ∑∈ ),(),(

),(explogDCdc i

ii dcfλ −=),|(log λDCP

)(λN )(λM=),|(log λDCP −

TheDerivativeI:Numerator

Derivativeofthenumeratoris:theempiricalcount(fi,c)

i

DCdc iii dcf

λ

λ

∂

∂

=∑ ∑∈ ),(),(

),(

∑∑

∈ ∂

∂=

),(),(

),(

DCdc i

iii dcf

λ

λ

∑∈

=),(),(

),(DCdci dcf

i

DCdc iici

i

dcfN

λ

λ

λλ

∂

∂

=∂

∂∑ ∑∈ ),(),(

),(explog)(

TheDerivativeII:Denominator

i

DCdc c iii

i

dcfM

λ

λ

λλ

∂

∂

=∂

∂∑ ∑ ∑∈ ),(),( '

),'(explog)(

∑∑ ∑

∑ ∑∈ ∂

∂=

),(),(

'

''

),'(exp

),''(exp1

DCdc i

c iii

c iii

dcf

dcf λ

λ

λ

∑ ∑∑∑

∑ ∑∈ ∂

∂=

),(),( '''

),'(

1

),'(exp

),''(exp1

DCdc c i

iii

iii

c iii

dcfdcf

dcf λ

λλ

λ

i

iii

DCdc cc i

ii

iii dcf

dcf

dcf

λ

λ

λ

λ

∂

∂=

∑∑ ∑∑ ∑

∑∈

),'(

),''(exp

),'(exp

),(),( '''

∑ ∑∈

=),(),( '

),'(),|'(DCdc

ic

dcfdcP λ =predictedcount(fi,l)

TheDerivativeIII

• Theoptimumparametersaretheonesforwhicheachfeature’spredictedexpectationequalsitsempiricalexpectation.Theoptimumdistributionis:• Alwaysunique(butparametersmaynotbeunique)• Alwaysexists(iffeaturecountsarefromactualdata).

• Thesemodelsarealsocalledmaximumentropymodelsbecausewefindthemodelhavingmaximumentropyandsatisfyingtheconstraints:

=∂

∂

i

DCPλ

λ),|(log),(countactual Cfi ),(countpredicted λif−

jfEfE jpjp ∀= ),()( ~

Findingtheoptimalparameters

• Wewanttochooseparametersλ1,λ2,λ3,…thatmaximizetheconditionallog-likelihoodofthetrainingdata

• Tobeabletodothat,we’veworkedouthowtocalculatethefunctionvalueanditspartialderivatives(itsgradient)

)|(log)(1

i

n

ii dcPDCLogLik ∑

=

=

Alikelihoodsurface

Findingtheoptimalparameters

• Useyourfavoritenumericaloptimizationpackage….• Commonly,youminimize thenegativeofCLogLik

1. Gradientdescent(GD);Stochasticgradientdescent(SGD)2. Iterativeproportionalfittingmethods:GeneralizedIterativeScaling

(GIS)andImprovedIterativeScaling(IIS)3. Conjugategradient(CG),perhapswithpreconditioning4. Quasi-Newtonmethods– limitedmemoryvariablemetric(LMVM)

methods,inparticular,L-BFGS

GradientDescent(GD)

43


Maximizingthelikelihood

FeatureSparsityRegularization

Combatingoverfitting

Smoothing:IssuesofScale• Lotsoffeatures:

• NLPmaxent modelscanhavewelloveramillionfeatures.• Evenstoringasinglearrayofparametervaluescanhaveasubstantialmemorycost.

• Lotsofsparsity:• Overfitting veryeasy– weneedsmoothing!• Manyfeaturesseenintrainingwillneveroccuragainattesttime.

• Optimizationproblems:• Featureweightscanbeinfinite,anditerativesolverscantakealongtimetogetto

thoseinfinities.

Smoothing/Priors/Regularization

Standardvs.RegularizedUpdates

48

FeatureSparsityRegularization

Combatingoverfitting

Batchvs.OnlineLearning

GDvs.SGD

StochasticGradientDecent(SGD)

51

Batch vs. Online learning:

Batchvs.OnlineLearning

GDvs.SGD

Perceptron

AnotherOnlineLearningalgorithem

Perceptron Algorithm

54

MaxEnt v.s Perceptron

• Perceptron doesn’t always make updates• Probabilities v.s scores

55

RegularizationinthePerceptronAlgorithm

• No gradient computed,so can’t directly include a regularizer inan object function.

• Insteadrun different numbers of iterations• Use parameter averaging, for instance, average of all

parameters after seeing each data point

56

Discriminative Estimation (Maxentmodelsandperceptron)

Documents