Top Banner
Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky
45

Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

Oct 05, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

TextClassificationandNaïveBayes

TheTaskofTextClassification

Many slides are adapted from slides by Dan Jurafsky

Page 2: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

Isthisspam?

Page 3: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

WhowrotewhichFederalistpapers?• 1787-8:anonymousessaystrytoconvinceNewYorktoratifyU.SConstitution: Jay,Madison,Hamilton.

• Authorshipof12ofthelettersindispute• 1963:solvedbyMosteller andWallaceusingBayesianmethods

JamesMadison AlexanderHamilton

Page 4: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

Maleorfemaleauthor?1. By1925present-dayVietnamwasdividedintothreeparts

underFrenchcolonialrule.ThesouthernregionembracingSaigonandtheMekongdeltawasthecolonyofCochin-China;thecentralareawithitsimperialcapitalatHuewastheprotectorateofAnnam…

2. Claraneverfailedtobeastonishedbytheextraordinaryfelicityofherownname.Shefoundithardtotrustherselftothemercyoffate,whichhadmanagedovertheyearstoconverthergreatestshameintooneofhergreatestassets…

S.Argamon,M.Koppel,J.Fine,A.R.Shimoni,2003.“Gender,Genre,andWritingStyleinFormalWrittenTexts,”Text,volume23,number3,pp.321–346

Page 5: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

Positiveornegativemoviereview?• unbelievablydisappointing• Fullofzanycharactersandrichlyappliedsatire,andsomegreatplottwists

• thisisthegreatestscrewballcomedyeverfilmed

• Itwaspathetic.Theworstpartaboutitwastheboxingscenes.

5

Page 6: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

Whatisthesubjectofthisarticle?

• Antogonists andInhibitors

• BloodSupply• Chemistry• DrugTherapy• Embryology• Epidemiology• …

6

MeSH SubjectCategoryHierarchy

?

MEDLINE Article

Page 7: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

TextClassification

• Assigningsubjectcategories,topics,orgenres• Spamdetection• Authorshipidentification• Age/genderidentification• LanguageIdentification• Sentimentanalysis• …

Page 8: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

TextClassification:definition• Input:– adocumentd– afixedsetofclassesC= {c1,c2,…,cJ}

• Output:apredictedclassc Î C

Page 9: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

ClassificationMethods:Hand-codedrules

• Rulesbasedoncombinationsofwordsorotherfeatures– spam:black-list-addressOR(“dollars”AND“have beenselected”)

• Accuracycanbehigh– Ifrulescarefullyrefinedbyexpert

• Butbuildingandmaintainingtheserulesisexpensive

Page 10: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

ClassificationMethods:SupervisedMachineLearning

• Input:– adocumentd– afixedsetofclassesC= {c1,c2,…,cJ}– Atrainingsetofm hand-labeleddocuments(d1,c1),....,(dm,cm)

• Output:– alearnedclassifierγ:dà c

10

Page 11: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

ClassificationMethods:SupervisedMachineLearning

• Anykindofclassifier– Naïve Bayes– Logisticregression,maxent– Support-vectormachines– k-NearestNeighbors

– …

Page 12: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

TextClassificationandNaïveBayes

TheTaskofTextClassification

Page 13: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

TextClassificationandNaïveBayes

FormalizingtheNaïve BayesClassifier

Page 14: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

NaïveBayesIntuition

• Simple(“naïve”)classificationmethodbasedonBayesrule

• Reliesonverysimplerepresentationofdocument– Bagofwords

Page 15: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

Bayes’RuleAppliedtoDocumentsandClasses

•Foradocumentd andaclassc

P(c | d) = P(d | c)P(c)P(d)

Page 16: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

Naïve BayesClassifier(I)

cMAP = argmaxc∈C

P(c | d)

= argmaxc∈C

P(d | c)P(c)P(d)

= argmaxc∈C

P(d | c)P(c)

MAP is “maximum a posteriori” = most likely class

Bayes Rule

Dropping the denominator

Page 17: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

Naïve BayesClassifier(II)

cMAP = argmaxc∈C

P(d | c)P(c)

Document d represented as features x1..xn

= argmaxc∈C

P(x1, x2,…, xn | c)P(c)

Page 18: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

Naïve BayesClassifier(IV)

How often does this class occur?

cMAP = argmaxc∈C

P(x1, x2,…, xn | c)P(c)

O(|X|n•|C|)parameters

We can just count the relative frequencies in a corpus

Couldonlybeestimatedifavery,verylargenumberoftrainingexampleswasavailable.

Page 19: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

MultinomialNaïve BayesIndependenceAssumptions

• BagofWordsassumption:Assumepositiondoesn’tmatter

• ConditionalIndependence:AssumethefeatureprobabilitiesP(xi|cj)areindependentgiventheclassc.

P(x1, x2,…, xn | c)

P(x1,…, xn | c) = P(x1 | c)•P(x2 | c)•P(x3 | c)•...•P(xn | c)

Page 20: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

ThebagofwordsrepresentationI love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun… It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet.

γ(

)=c

Page 21: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

Thebagofwordsrepresentation

γ(

)=cgreat 2love 2

recommend 1

laugh 1happy 1

... ...

Page 22: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

Planning GUIGarbageCollection

Machine Learning NLP

parsertagtrainingtranslationlanguage...

learningtrainingalgorithmshrinkagenetwork...

garbagecollectionmemoryoptimizationregion...

Test document

parserlanguagelabeltranslation…

Bagofwordsfordocumentclassification

...planningtemporalreasoningplanlanguage...

?

Page 23: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

ApplyingMultinomialNaiveBayesClassifierstoTextClassification

cNB = argmaxc j∈C

P(cj ) P(xi | cj )i∈positions∏

positions ¬ allwordpositionsintestdocument

Page 24: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

TextClassificationandNaïveBayes

FormalizingtheNaïve BayesClassifier

Page 25: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

TextClassificationandNaïveBayes

Naïve Bayes:Learning

Page 26: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

LearningtheMultinomialNaïve BayesModel

• Firstattempt:maximumlikelihoodestimates– simplyusethefrequenciesinthedata

Sec.13.3

P̂(wi | cj ) =count(wi,cj )count(w,cj )

w∈V∑

P̂(cj ) =doccount(C = cj )

Ndoc

Page 27: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

Parameterestimation

• Createmega-documentfortopicj byconcatenatingalldocsinthistopic– Usefrequencyofw inmega-document

fractionoftimeswordwi appearsamongallwordsindocumentsoftopiccj

P̂(wi | cj ) =count(wi,cj )count(w,cj )

w∈V∑

Page 28: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

ProblemwithMaximumLikelihood• Whatifwehaveseennotrainingdocumentswiththewordfantastic and

classifiedinthetopicpositive (thumbs-up)?

• Zeroprobabilitiescannotbeconditionedaway,nomattertheotherevidence!

P̂("fantastic" positive) = count("fantastic", positive)count(w, positive

w∈V∑ )

= 0

cMAP = argmaxc P̂(c) P̂(xi | c)i∏

Sec.13.3

Page 29: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

Laplace(add-1)smoothing:unknownwords

P̂(wu | c) = count(wu,c)+1

count(w,cw∈V∑ )

#

$%%

&

'(( + V +1

Addoneextrawordtothevocabulary,the“unknownword”wu

=1

count(w,cw∈V∑ )

#

$%%

&

'(( + V +1

Page 30: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

UnderflowPrevention:logspace• Multiplyinglotsofprobabilitiescanresultinfloating-pointunderflow.• Sincelog(xy)=log(x)+log(y)

– Bettertosumlogsofprobabilitiesinsteadofmultiplyingprobabilities.• Classwithhighestun-normalizedlogprobabilityscoreisstillmost

probable.

• Modelisnowjustmaxofsumofweights

cNB = argmaxc j∈C

logP(cj )+ logP(xi | cj )i∈positions∑

Page 31: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

TextClassificationandNaïve Bayes

Naïve Bayes:Learning

Page 32: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

TextClassificationandNaïveBayes

MultinomialNaïve Bayes:AWorkedExample

Page 33: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

Choosingaclass:P(c|d5)

P(j|d5) 1/4*(2/10)3 *2/10 *2/10≈0.00008

Doc Words Class

Training 1 Chinese BeijingChinese c

2 ChineseChineseShanghai c3 ChineseMacao c4 TokyoJapanChinese j

Test 5 ChineseChineseChineseTokyo Japan ?

33

ConditionalProbabilities:P(Chinese|c)=P(Tokyo|c)=P(Japan|c)=P(Chinese|j)=P(Tokyo|j)=P(Japan|j)=

Priors:P(c)=P(j)=

34 1

4

P̂(w | c) = count(w,c)+1count(c)+ |V |

P̂(c) = Nc

N

(5+1)/(8+7)=6/15(0+1)/(8+7)=1/15

(1+1)/(3+7)=2/10(0+1)/(8+7)=1/15

(1+1)/(3+7)=2/10(1+1)/(3+7)=2/10

3/4*(6/15)3 *1/15 *1/15≈0.0002

µ

µ

+1

Page 34: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

Summary:NaiveBayesisNotSoNaive• RobusttoIrrelevantFeatures

IrrelevantFeaturescanceleachotherwithoutaffectingresults

• Verygoodindomainswithmanyequallyimportantfeatures

DecisionTreessufferfromfragmentation insuchcases– especiallyiflittledata

• Optimaliftheindependenceassumptionshold:Ifassumedindependenceiscorrect,thenitistheBayesOptimalClassifierforproblem

• Agooddependablebaselinefortextclassification– Butwewillseeotherclassifiersthatgivebetteraccuracy

Page 35: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

TextClassificationandNaïveBayes

MultinomialNaïve Bayes:AWorkedExample

Page 36: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

TextClassificationandNaïveBayes

TextClassification:Evaluation

Page 37: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

The2-by-2contingencytable

correct notcorrectselected tp fp

notselected fn tn

Page 38: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

Precisionandrecall• Precision:%ofselecteditemsthatarecorrectRecall:%ofcorrectitemsthatareselected

correct notcorrectselected tp fp

notselected fn tn

Page 39: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

Acombinedmeasure:F• AcombinedmeasurethatassessestheP/RtradeoffisFmeasure(weightedharmonicmean):

• PeopleusuallyusebalancedF1measure– i.e.,withb =1(thatis,a =½): F =2PR/(P+R)

RPPR

RP

F+

+=

−+= 2

2 )1(1)1(1

1ββ

αα

Page 40: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

Confusionmatrixc• Foreachpairofclasses<c1,c2>howmanydocumentsfromc1 wereincorrectlyassignedtoc2?– c3,2:90wheatdocumentsincorrectlyassignedtopoultry

40

Docsintestset AssignedUK

Assignedpoultry

Assignedwheat

Assignedcoffee

Assignedinterest

Assignedtrade

TrueUK 95 1 13 0 1 0

Truepoultry 0 1 0 0 0 0

Truewheat 10 90 0 1 0 0

Truecoffee 0 0 0 34 3 7

Trueinterest - 1 2 13 26 5

Truetrade 0 0 2 14 5 10

Page 41: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

PerclassevaluationmeasuresRecall:Fractionofdocsinclassi classifiedcorrectly:

Precision:Fractionofdocsassignedclassi thatareactually

aboutclassi:

Accuracy:(1- errorrate)Fractionofdocsclassifiedcorrectly: 41

ciii∑

ciji∑

j∑

ciic ji

j∑

ciicij

j∑

Sec. 15.2.4

Page 42: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

Micro- vs.Macro-Averaging– Ifwehavemorethanoneclass,howdowecombinemultipleperformancemeasuresintoonequantity?

• Macroaveraging:Computeperformanceforeachclass,thenaverage.Averageonclasses

• Microaveraging:Collectdecisionsforeachinstancefromallclasses,computecontingencytable,evaluate.Averageoninstances

42

Sec. 15.2.4

Page 43: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

Micro- vs.Macro-Averaging:Example

Truth:yes

Truth:no

Classifier:yes 10 10

Classifier:no 10 970

Truth:yes

Truth:no

Classifier:yes 90 10

Classifier:no 10 890

Truth:yes

Truth:no

Classifier:yes 100 20

Classifier:no 20 1860

43

Class1 Class2 MicroAve.Table

Sec.15.2.4

• Macroaveraged precision:(0.5+0.9)/2=0.7• Microaveraged precision:100/120=.83• Microaveraged scoreisdominatedbyscoreoncommonclasses

Page 44: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

DevelopmentTestSetsandCross-validation

• Metric:P/R/F1orAccuracy• Unseentestset– avoidoverfitting (‘tuningtothetestset’)– moreconservativeestimateofperformance

– Cross-validationovermultiplesplits• Handlesamplingerrorsfromdifferentdatasets

– Poolresultsovereachsplit– Computepooleddev setperformance

Trainingset Development Test Set TestSet

TestSet

TrainingSet

TrainingSetDev Test

TrainingSet

Dev Test

Dev Test

Page 45: Text Classification and Naïve Bayes - ecology lab · Text Classification and Naïve Bayes The Task of Text Classification Many slides are adapted from slides by Dan Jurafsky. Is

TextClassificationandNaïveBayes

TextClassification:Evaluation