Discriminative Estimation (Maxent models and perceptron) Generative vs. Discriminative models Many slides are adapted from slides by Christopher Manning and perceptron slides by Alan Ritter
DiscriminativeEstimation(Maxent models and perceptron)
Generativevs.Discriminativemodels
Many slides are adapted from slides by Christopher Manning and perceptron slides by Alan Ritter
Introduction
• Sofarwe’velookedat“generativemodels”• NaiveBayes
• ButthereisnowmuchuseofconditionalordiscriminativeprobabilisticmodelsinNLP,Speech,IR(andMLgenerally)
• Because:• Theygivehighaccuracyperformance• Theymakeiteasytoincorporatelotsoflinguisticallyimportantfeatures• Theyallowautomaticbuildingoflanguageindependent,retargetableNLPmodules
Jointvs.ConditionalModels
• Wehavesomedata{(d,c)}ofpairedobservationsd andhiddenclassesc.
• Joint(generative)modelsplaceprobabilitiesoverbothobserveddataandthehiddenstuff(gene-ratetheobserveddatafromhiddenstuff):• AlltheclassicStatNLP models:• n-grammodels,NaiveBayesclassifiers,hiddenMarkovmodels,probabilisticcontext-freegrammars,IBMmachinetranslationalignmentmodels
P(c,d)
Jointvs.ConditionalModels
• Discriminative(conditional)modelstakethedataasgiven,andputaprobabilityoverhiddenstructuregiventhedata:
• Logisticregression,conditionalloglinear ormaximumentropymodels,conditionalrandomfields
• Also,SVMs,(averaged)perceptron,etc.arediscriminativeclassifiers(butnotdirectlyprobabilistic)
P(c|d)
BayesNet/GraphicalModels
• Bayesnetdiagramsdrawcirclesforrandomvariables,andlinesfordirectdependencies
• Somevariablesareobserved;somearehidden• Eachnodeisalittleclassifier(conditionalprobabilitytable)basedon
incomingarcs c
d1 d 2 d 3
NaiveBayes
c
d1 d2 d3
GenerativeLogisticRegression
Discriminative
Conditionalvs.JointLikelihood
• AjointmodelgivesprobabilitiesP(d,c)andtriestomaximizethisjointlikelihood.• Itturnsouttobetrivialtochooseweights:justrelativefrequencies.
• AconditionalmodelgivesprobabilitiesP(c|d).Ittakesthedataasgivenandmodelsonlytheconditionalprobabilityoftheclass.• Weseektomaximizeconditionallikelihood.• Hardertodo(aswe’llsee…)• Morecloselyrelatedtoclassificationerror.
Maxent ModelsandDiscriminativeEstimation
Generativevs.Discriminativemodels
DiscriminativeModelFeatures
MakingfeaturesfromtextfordiscriminativeNLPmodels
Features
• Intheseslidesandmostmaxent work:features f areelementarypiecesofevidencethatlinkaspectsofwhatweobserved withacategoryc thatwewanttopredict
• Afeatureisafunctionwithaboundedrealvalue:f: C ´ D → ℝ
A Belief: to create a data partition
Features
• InNLPuses,usuallyafeaturespecifies1. anindicatorfunction– ayes/noboolean matchingfunction– of
propertiesoftheinputand2. aparticularclass
fi(c, d) º [Φ(d) Ù c = cj] [Valueis0or1]
• Eachfeaturepicksoutadatasubsetandsuggestsalabelforit
Examplefeatures
• f1(c, d) º [c = LOCATION Ù w-1 = “in” Ù isCapitalized(w)]• f2(c, d) º [c = LOCATION Ù hasAccentedLatinChar(w)]• f3(c, d) º [c = DRUG Ù ends(w, “c”)]
• Modelswillassigntoeachfeatureaweight:• Apositiveweightvotesthatthisconfigurationislikelycorrect• Anegativeweightvotesthatthisconfigurationislikelyincorrect
LOCATIONin Québec
PERSONsaw Sue
DRUGtaking Zantac
LOCATIONin Arcadia
Feature-BasedModels• Thedecisionaboutadatapointisbasedonlyonthe
features activeatthatpoint.
BUSINESS: Stocks hit a yearly low …
Data
Features{…, stocks, hit, a, yearly, low, …}
Label: BUSINESS
Text Categorization
… to restructure bank:MONEY debt.
Data
Features{…, w-1=restructure, w+1=debt, …}
Label: MONEY
Word-Sense Disambiguation
DT JJ NN …The previous fall …
Data
Features{w=fall, t-1=JJ w-
1=previous}
Label: NN
POS Tagging
Example:TextCategorization
(ZhangandOles 2001)• Featuresarepresenceofeachword inadocumentandthedocumentclass
(theydofeatureselectiontousereliableindicatorwords)• TestsonclassicReutersdataset(andothers)
• NaïveBayes:77.0%F1• Linearregression:86.0%• Logisticregression:86.4%• Supportvectormachine:86.5%
• Paperemphasizestheimportanceofregularization (smoothing)forsuccessfuluseofdiscriminativemethods(notusedinmuchearlyNLP/IRwork)
OtherMaxent ClassifierExamples
• Youcanuseamaxent classifierwheneveryouwanttoassigndatapointstooneofanumberofclasses:• Sentenceboundarydetection(Mikheev 2000)
• Isaperiodendofsentenceorabbreviation?• Sentimentanalysis(PangandLee2002)• Wordunigrams,bigrams,POScounts,…
• PPattachment(Ratnaparkhi 1998)• Attachtoverbornoun?Featuresofheadnoun,preposition,etc.
• Parsingdecisionsingeneral(Ratnaparkhi 1997;Johnsonetal.1999,etc.)
DiscriminativeModelFeatures
MakingfeaturesfromtextfordiscriminativeNLPmodels
Feature-basedLinearClassifiers
Howtoputfeaturesintoaclassifier
16
Feature-BasedLinearClassifiers
• Linearclassifiersatclassificationtime:• Linear function from feature sets {fi} to classes {c}.• Assign a weight li to each feature fi.
• We consider each class for an observed datum d• For a pair (c,d), features vote with their weights:
• vote(c) = Slifi(c,d)
• Choose the class c which maximizes Slifi(c,d)
LOCATIONin Québec
DRUGin Québec
PERSONin Québec
Feature-BasedLinearClassifiers
• Linearclassifiersatclassificationtime:• Linear function from feature sets {fi} to classes {c}.• Assign a weight li to each feature fi.
• We consider each class for an observed datum d• For a pair (c,d), features vote with their weights:
• vote(c) = Slifi(c,d)
• Choose the class c which maximizes Slifi(c,d) = LOCATION
1.8 –0.60.3LOCATION
in QuébecDRUG
in QuébecPERSON
in Québec
Feature-BasedLinearClassifiers
TherearemanywaystochoseweightsforfeaturesWithdifferentlossfunctionsastheoptimizationgoal
• Perceptron:findacurrentlymisclassifiedexample,andnudgeweightsinthedirectionofitscorrectclassification
• Margin-basedmethods(SupportVectorMachines)
Feature-BasedLinearClassifiers• Exponential(log-linear,maxent,logistic,Gibbs)models:
• MakeaprobabilisticmodelfromthelinearcombinationSlifi(c,d)
• P(LOCATION|in Québec) = e1.8e–0.6/(e1.8e–0.6 + e0.3 + e0) = 0.586
• P(DRUG|in Québec) = e0.3 /(e1.8e–0.6 + e0.3 + e0) = 0.238
• P(PERSON|in Québec) = e0 /(e1.8e–0.6 + e0.3 + e0) = 0.176
• The weights are the parameters of the probability model, combined via a “soft max” function
∑ ∑'
),'(expc i
ii dcfλ=),|( λdcP
∑i
ii dcf ),(exp λ Makes votes positive
Normalizes votes
Aside:logisticregression
• Maxent modelsinNLPareessentiallythesameasmulticlasslogisticregressionmodelsinstatistics(ormachinelearning)
• ThekeyroleoffeaturefunctionsinNLPandinthispresentation• Thefeaturesaremoregeneral,withf alsobeingafunctionoftheclass
21
Quiz Question
• Assuming exactly the same set up (3 class decision: LOCATION, PERSON, or DRUG; 3 features as before, maxent), what are:• P(PERSON| byGoéric) =
• P(LOCATION| byGoéric) =
• P(DRUG| byGoéric) =
• 1.8 f1(c, d) º [c = LOCATION Ù w-1 = “in” Ù isCapitalized(w)]• -0.6 f2(c, d) º [c = LOCATION Ù hasAccentedLatinChar(w)]• 0.3 f3(c, d) º [c = DRUG Ù ends(w, “c”)]
∑ ∑'
),'(expc i
ii dcfλ=),|( λdcP ∑
iii dcf ),(exp λ
PERSONby Goéric
LOCATIONby Goéric
DRUGby Goéric
Feature-basedLinearClassifiers
Howtoputfeaturesintoaclassifier
23
BuildingaMaxentModel
Thenutsandbolts
BuildingaMaxent Model
• Wedefinefeatures(indicatorfunctions)overdatapoints• Featuresrepresentsetsofdatapointswhicharedistinctiveenoughtodeservemodelparameters.• Words,butalso“wordcontainsnumber”,“wordendswithing”,etc.
• WewillsimplyencodeeachΦ featureasauniqueString(index)• AdatumwillgiverisetoasetofStrings:theactiveΦ features• Eachfeaturefi(c, d) º [Φ(d) Ù c = cj] getsarealnumberweight
• WeconcentrateonΦ featuresbutthemathusesi indicesoffi
BuildingaMaxentModel• Featuresareoftenaddedduringmodeldevelopmenttotargeterrors
• Often,theeasiestthingtothinkofarefeaturesthatmarkbadcombinations
• Then,foranygivenfeatureweights,wewanttobeabletocalculate:• Dataconditionallikelihood• Derivativeofthelikelihoodwrt eachfeatureweight• Usesexpectationsofeachfeatureaccordingtothemodel
• Wecanthenfindtheoptimumfeatureweights(discussedlater).
BuildingaMaxentModel
Thenutsandbolts
NaiveBayesvs.Maxent models
Generativevs.Discriminativemodels:Theproblemofovercounting evidence
Textclassification:AsiaorEurope
NBFACTORS:• P(A)=P(E)=• P(M|A)=• P(M|E)=• P(H|A)=P(K|A)=• P(H|E)=PK|E)=
Europe Asia
Class
H K
NBModel PREDICTIONS:• P(A,H,K,M)=• P(E,H,K,M)=• P(A|H,K,M)=• P(E|H,K,M)=
TrainingData
M
Monaco Monaco
Monaco Monaco Hong Kong
Hong Kong Monaco
Monaco Hong Kong
Hong Kong
Monaco Monaco
NaiveBayesvs.Maxent Models
• NaiveBayesmodelsmulti-countcorrelatedevidence• Eachfeatureismultipliedin,evenwhenyouhavemultiplefeaturestellingyouthesamething
• MaximumEntropymodels(prettymuch)solvethisproblem• Aswewillsee,thisisdonebyweightingfeaturessothatmodelexpectationsmatchtheobserved(empirical)expectations
NaiveBayesvs.Maxent models
Generativevs.Discriminativemodels:Theproblemofovercounting evidence
Maxent ModelsandDiscriminativeEstimation
Maximizingthelikelihood
FeatureExpectations
• Wewillcruciallymakeuseoftwoexpectations• actualorpredictedcountsofafeaturefiring:
• Empiricalcount(expectation)ofafeature:
• Modelexpectationofafeature:
∑ ∈=
),(observed),(),()( empirical
DCdc ii dcffE
∑ ∈=
),(),(),(),()(
DCdc ii dcfdcPfE
Goal: well fit the data
ExponentialModelLikelihood
• Maximum(Conditional)LikelihoodModels:• Givenamodelform,choosevaluesofparameterstomaximizethe(conditional)likelihoodofthedata.
∑∑∈∈
==),(),(),(),(
log),|(log),|(logDCdcDCdc
dcPDCP λλ∑ ∑'
),'(expc i
ii dcfλ
∑i
ii dcf ),(exp λ
TheLikelihoodValue
• The(log)conditionallikelihoodofiid data(C,D)accordingtomaxent modelisafunctionofthedataandtheparametersl:
• Iftherearen’tmanyvaluesofc,it’seasytocalculate:
∑∏∈∈
==),(),(),(),(
),|(log),|(log),|(logDCdcDCdc
dcPdcPDCP λλλ
∑∈
=),(),(
log),|(logDCdc
DCP λ∑ ∑'
),'(expc i
ii dcfλ
∑i
ii dcf ),(exp λ
TheLikelihoodValue
• Wecanseparatethisintotwocomponents:
• Thederivativeisthedifferencebetweenthederivativesofeachcomponent
∑ ∑ ∑∈ ),(),( '
),'(explogDCdc c i
ii dcfλ∑ ∑∈ ),(),(
),(explogDCdc i
ii dcfλ −=),|(log λDCP
)(λN )(λM=),|(log λDCP −
TheDerivativeI:Numerator
Derivativeofthenumeratoris:theempiricalcount(fi,c)
i
DCdc iii dcf
λ
λ
∂
∂
=∑ ∑∈ ),(),(
),(
∑∑
∈ ∂
∂=
),(),(
),(
DCdc i
iii dcf
λ
λ
∑∈
=),(),(
),(DCdci dcf
i
DCdc iici
i
dcfN
λ
λ
λλ
∂
∂
=∂
∂∑ ∑∈ ),(),(
),(explog)(
TheDerivativeII:Denominator
i
DCdc c iii
i
dcfM
λ
λ
λλ
∂
∂
=∂
∂∑ ∑ ∑∈ ),(),( '
),'(explog)(
∑∑ ∑
∑ ∑∈ ∂
∂=
),(),(
'
''
),'(exp
),''(exp1
DCdc i
c iii
c iii
dcf
dcf λ
λ
λ
∑ ∑∑∑
∑ ∑∈ ∂
∂=
),(),( '''
),'(
1
),'(exp
),''(exp1
DCdc c i
iii
iii
c iii
dcfdcf
dcf λ
λλ
λ
i
iii
DCdc cc i
ii
iii dcf
dcf
dcf
λ
λ
λ
λ
∂
∂=
∑∑ ∑∑ ∑
∑∈
),'(
),''(exp
),'(exp
),(),( '''
∑ ∑∈
=),(),( '
),'(),|'(DCdc
ic
dcfdcP λ =predictedcount(fi,l)
TheDerivativeIII
• Theoptimumparametersaretheonesforwhicheachfeature’spredictedexpectationequalsitsempiricalexpectation.Theoptimumdistributionis:• Alwaysunique(butparametersmaynotbeunique)• Alwaysexists(iffeaturecountsarefromactualdata).
• Thesemodelsarealsocalledmaximumentropymodelsbecausewefindthemodelhavingmaximumentropyandsatisfyingtheconstraints:
=∂
∂
i
DCPλ
λ),|(log),(countactual Cfi ),(countpredicted λif−
jfEfE jpjp ∀= ),()( ~
Findingtheoptimalparameters
• Wewanttochooseparametersλ1,λ2,λ3,…thatmaximizetheconditionallog-likelihoodofthetrainingdata
• Tobeabletodothat,we’veworkedouthowtocalculatethefunctionvalueanditspartialderivatives(itsgradient)
)|(log)(1
i
n
ii dcPDCLogLik ∑
=
=
Alikelihoodsurface
Findingtheoptimalparameters
• Useyourfavoritenumericaloptimizationpackage….• Commonly,youminimize thenegativeofCLogLik
1. Gradientdescent(GD);Stochasticgradientdescent(SGD)2. Iterativeproportionalfittingmethods:GeneralizedIterativeScaling
(GIS)andImprovedIterativeScaling(IIS)3. Conjugategradient(CG),perhapswithpreconditioning4. Quasi-Newtonmethods– limitedmemoryvariablemetric(LMVM)
methods,inparticular,L-BFGS
GradientDescent(GD)
43
Maxent ModelsandDiscriminativeEstimation
Maximizingthelikelihood
FeatureSparsityRegularization
Combatingoverfitting
Smoothing:IssuesofScale• Lotsoffeatures:
• NLPmaxent modelscanhavewelloveramillionfeatures.• Evenstoringasinglearrayofparametervaluescanhaveasubstantialmemorycost.
• Lotsofsparsity:• Overfitting veryeasy– weneedsmoothing!• Manyfeaturesseenintrainingwillneveroccuragainattesttime.
• Optimizationproblems:• Featureweightscanbeinfinite,anditerativesolverscantakealongtimetogetto
thoseinfinities.
Smoothing/Priors/Regularization
Standardvs.RegularizedUpdates
48
FeatureSparsityRegularization
Combatingoverfitting
Batchvs.OnlineLearning
GDvs.SGD
StochasticGradientDecent(SGD)
51
Batch vs. Online learning:
Batchvs.OnlineLearning
GDvs.SGD
Perceptron
AnotherOnlineLearningalgorithem
Perceptron Algorithm
54
MaxEnt v.s Perceptron
• Perceptron doesn’t always make updates• Probabilities v.s scores
55
RegularizationinthePerceptronAlgorithm
• No gradient computed,so can’t directly include a regularizer inan object function.
• Insteadrun different numbers of iterations• Use parameter averaging, for instance, average of all
parameters after seeing each data point
56