Top Banner
复旦大学大数据学院 School of Data Science, Fudan University DATA130006 Text Management and Analysis Text Classification and Naïve Bayes 魏忠钰 October 11 th , 2017 Adapted from UIUC CS410 and Stanford CS124U
72

October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Jun 10, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

复旦大学大数据学院School of Data Science, Fudan University

DATA130006 Text Management and Analysis

Text Classification and Naïve Bayes魏忠钰

October11th,2017

AdaptedfromUIUCCS410andStanfordCS124U

Page 2: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Outline

§ TheTaskofTextClassification

Page 3: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Is this a spam?

Page 4: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Who wrote which Federalist papers?

§ 1787-8:anonymousessaystrytoconvinceNewYorktoratifyU.SConstitution:Jay,Madison,Hamilton.

§ Authorshipof12ofthelettersindispute

§ 1963:solvedbyMosteller andWallaceusingBayesianmethods

JamesMadison AlexanderHamilton

Page 5: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Male or female author?

1. By1925present-dayVietnamwasdividedintothreepartsunderFrenchcolonialrule.ThesouthernregionembracingSaigonandtheMekongdeltawasthecolonyofCochin-China;thecentralareawithitsimperialcapitalatHuewastheprotectorateofAnnam…

2. Claraneverfailedtobeastonishedbytheextraordinaryfelicityofherownname.Shefoundithardtotrustherselftothemercyoffate,whichhadmanagedovertheyearstoconverthergreatestshameintooneofhergreatestassets…

Page 6: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Positive or negative movie review?

§ unbelievablydisappointing§ Fulloffantasticcharactersandrichlyappliedsatire,andsomegreatplottwists

§ thisisthegreatestscrewballcomedyeverfilmed§ Itwaspathetic.Theworstpartaboutitwastheboxingscenes.

Page 7: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

What is the subject of this article?

• AntogonistsandInhibitors• BloodSupply• Chemistry• DrugTherapy• Embryology• Epidemiology• …

MeSH SubjectCategoryHierarchy

?

MEDLINE Article

Page 8: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Text Categorization

§ Giventhefollowing:§ Asetofpredefinedcategories,possiblyformingahierarchyandoften§ Atrainingsetoflabeledtextobjects

§ Task:Classify atextobjectintooneormoreofthecategories

CategorizationSystem

SportsBusiness

Education

Science…SportsBusinessEducation

TextObjects

Trainingdata(knowncategories)

CategorizationResults

Page 9: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Examples of Text Categorization

§ Textobjectscanvary(e.g.,documents,passages,orcollectionsoftext)

§ Categoriescanalsovary§ “Internal”categoriesthatcharacterizeatextobject(e.g.,topicalcategories,sentimentcategories)

§ “External”categoriesthatcharacterizeanentityassociatedwiththetextobject(e.g.,authorattributionoranyothermeaningfulcategoriesassociatedwithtextdata)

§ Someexamplesofapplications§ Newscategorization,literaturearticlecategorization(e.g.,MeSH annotations)

§ Spamemaildetection/filtering§ Sentimentcategorizationofproductreviewsortweets§ Automaticemailsorting/routing§ Authorattribution

Page 10: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Variants of Problem Formulation

§ Binarycategorization:Onlytwocategories§ Retrieval:{relevant-doc,non-relevant-doc}§ Spamfiltering:{spam,non-spam}§Opinion:{positive,negative}

§ K-category categorization:Morethantwocategories§ Topiccategorization:{sports,science,travel,business,…}

§ Emailrouting:{folder1,folder2,folder3,…}§Hierarchical categorization:Categoriesformahierarchy

§ Jointcategorization:Multiplerelated categorizationtasksdoneinajointmanner

Page 11: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Text Classification: definition

§ Input:§ adocumentd§ afixedsetofclassesC= {c1,c2,…,cJ}

§Output:apredictedclassc Î C

Page 12: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Classification Methods: Hand-coded rules

§ Rulesbasedoncombinationsofwordsorotherfeatures§ spam:black-list-addressOR(“dollars”AND“have beenselected”)

§ Accuracycanbehigh§ Ifrulescarefullyrefinedbyexpert

§ Butbuildingandmaintainingtheserulesisexpensive

Page 13: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Classification Methods: Supervised Machine Learning

§ Input:§ adocumentd§ afixedsetofclassesC= {c1,c2,…,cJ}§ Atrainingsetofm hand-labeleddocuments(d1,c1),....,(dm,cm)

§Output:§ alearnedclassifierγ:dà c

Page 14: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Classification Methods: Supervised Machine Learning

§ Anykindofclassifier§ Naïve Bayes§ Logisticregression§ Support-vectormachines§ k-NearestNeighbors

§ …

Page 15: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Outline

§ TheTaskofTextClassification§ Naïve Bayes(I)

Page 16: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Naïve Bayes Intuition

§ Simple(“naïve”)classificationmethodbasedonBayesrule

§ Reliesonverysimplerepresentationofdocument§ Bagofwords

Page 17: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

The bag of words representation

I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun… It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet.

γ( )=c

Page 18: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

The bag of words representation

I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun… It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it againwhenever I have a friend who hasn't seen it yet.

γ( )=c

Page 19: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

The bag of words representation: using a subset of words

x love xxxxxxxxxxxxxxxx sweetxxxxxxx satirical xxxxxxxxxxxxxxxxxxxxx great xxxxxxxxxxxxxxxxxxxxxxxxxx fun xxxxxxxxxxxxxxxxx whimsical xxxxromantic xxxx laughingxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx recommend xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx several xxxxxxxxxxxxxxxxxxxxxx happy xxxxxxxxx againxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

γ( )=c

Page 20: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

The bag of words representation

γ( )=cgreat 2love 2

recommend 1

laugh 1happy 1

... ...

Page 21: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Bag of words for document classification

Planning GUIGarbageCollection

Machine Learning NLP

parsertagtrainingtranslationlanguage...

learningtrainingalgorithmshrinkagenetwork...

garbagecollectionmemoryoptimizationregion...

Testdocument

parserlanguagelabeltranslation…

...planningtemporalreasoningplanlanguage...

?

Page 22: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Outline

§ TheTaskofTextClassification§ Naïve Bayes(I)§ FormalizingtheNaïve BayesClassifier

Page 23: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Bayes’ Rule Applied to Documents and Classes

P(c | d) = P(d | c)P(c)P(d)

§Foradocumentd andaclassc

Page 24: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Naïve Bayes Classifier (I)

cMAP = argmaxc∈C

P(c | d)

= argmaxc∈C

P(d | c)P(c)P(d)

= argmaxc∈C

P(d | c)P(c)

MAPis“maximumaposteriori”=mostlikelyclass

BayesRule

Droppingthedenominator

Page 25: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Naïve Bayes Classifier (II)

cMAP = argmaxc∈C

P(d | c)P(c)

Documentdrepresentedasfeaturesx1..xn

= argmaxc∈C

P(x1, x2,…, xn | c)P(c)

Page 26: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Naïve Bayes Classifier (III)

Howoftendoesthisclassoccur?

cMAP = argmaxc∈C

P(x1, x2,…, xn | c)P(c)

O(|X|n•|C|)parameters

Wecanjustcounttherelativefrequenciesinacorpus

Couldonlybeestimatedifavery,verylargenumberoftrainingexampleswasavailable.

Page 27: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Multinomial Naïve Bayes Independence Assumptions

P(x1, x2,…, xn | c)

§ BagofWordsassumption:Assumepositiondoesn’tmatter

§ ConditionalIndependence:AssumethefeatureprobabilitiesP(xi|cj)areindependentgiventheclassc.

P(x1,…, xn | c) = P(x1 | c)•P(x2 | c)•P(x3 | c)•...•P(xn | c)

Page 28: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Multinomial Naïve Bayes Classifier

cMAP = argmaxc∈C

P(x1, x2,…, xn | c)P(c)

cNB = argmaxc∈C

P(cj ) P(x | c)x∈X∏

Page 29: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Applying Multinomial Naive Bayes Classifiers to Text Classification

cNB = argmaxc j∈C

P(cj ) P(xi | cj )i∈positions∏

positions ¬ allwordpositionsintestdocument

Page 30: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Outline

§ TheTaskofTextClassification§ Naïve Bayes(I)§ FormalizingtheNaïve BayesClassifier§ Naïve Bayes:Learning

Page 31: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

LearningtheMultinomialNaïve BayesModel

§Firstattempt:maximumlikelihoodestimates§ simplyusethefrequenciesinthedata

P̂(wi | cj ) =count(wi,cj )count(w,cj )

w∈V∑

P̂(cj ) =doccount(C = cj )

Ndoc

Page 32: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Parameterestimation

§ Createmega-documentfortopicj byconcatenatingalldocsinthistopic§ Usefrequencyofw inmega-document

fractionoftimeswordwi appearsamongallwordsindocumentsoftopiccj

P̂(wi | cj ) =count(wi,cj )count(w,cj )

w∈V∑

Page 33: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

ProblemwithMaximumLikelihood

§Whatifwehaveseennotrainingdocumentswiththewordfantastic andclassifiedinthetopicpositive (thumbs-up)?

§ Zeroprobabilitiescannotbeconditionedaway,nomattertheotherevidence!

P̂("fantastic" positive) = count("fantastic", positive)count(w, positive

w∈V∑ )

= 0

cMAP = argmaxc P̂(c) P̂(xi | c)i∏

Page 34: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Laplace (add-1) smoothing for Naïve Bayes

P̂(wi | c) =count(wi,c)+1count(w,c)+1( )

w∈V∑

=count(wi,c)+1

count(w,cw∈V∑ )

#

$%%

&

'(( + V

P̂(wi | c) =count(wi,c)count(w,c)( )

w∈V∑

Page 35: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Multinomial Naïve Bayes: Learning

§ CalculateP(cj) terms§ Foreachcj inC do

docsj¬ alldocswithclass=cj

P(wk | cj )←nk +α

n+α |Vocabulary |

P(cj )←| docsj |

| total # documents|

§ CalculateP(wk | cj) terms§ Textj¬ singledoccontainingalldocsj§ For eachwordwk inVocabulary

nk¬ #ofoccurrencesofwk inTextj

§ Fromtrainingcorpus,extractVocabulary

Page 36: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Laplace (add-1) smoothing: unknown words

P̂(wu | c) = count(wu,c)+1

count(w,cw∈V∑ )

#

$%%

&

'(( + V +1

Addoneextrawordtothevocabulary,the“unknownword”wu

=1

count(w,cw∈V∑ )

#

$%%

&

'(( + V +1

Page 37: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Outline

§ TheTaskofTextClassification§ Naïve Bayes(I)§ FormalizingtheNaïve BayesClassifier§ Naïve Bayes:Learning§ Naïve Bayes:RelationshiptoLanguageModeling

Page 38: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Generative Model for Multinomial Naïve Bayes

c=China

X1=Shanghai X2=and X3=Shenzhen X4=issue X5=bonds

Page 39: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Naïve Bayes and Language Modeling

§Naïve bayes classifierscanuseanysortoffeature§URL,emailaddress,dictionaries,networkfeatures

§ Butif,asinthepreviousslides§Weuseonly wordfeatures§weuseall ofthewordsinthetext(notasubset)

§ Then§Naïve bayes hasanimportantsimilaritytolanguagemodeling.

Page 40: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Eachclass=aunigramlanguagemodel

§ Assigningeachword:P(word|c)§ Assigningeachsentence:P(s|c)=Π P(word|c)

0.1 I

0.1 love

0.01 this

0.05 fun

0.1 film

I love this fun film

0.1 0.1 .05 0.01 0.1

Classpos

P(s|pos)=0.0000005

Page 41: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Naïve BayesasaLanguageModel

§Whichclassassignsthehigherprobabilitytos?

0.1 I

0.1 love

0.01 this

0.05 fun

0.1 film

Modelpos Modelneg

filmlove this funI

0.10.1 0.01 0.050.10.10.001 0.01 0.0050.2

P(s|pos)>P(s|neg)

0.2 I

0.001 love

0.01 this

0.005 fun

0.1 film

Page 42: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Outline

§ TheTaskofTextClassification§ Naïve Bayes(I)§ FormalizingtheNaïve BayesClassifier§ Naïve Bayes:Learning§ Naïve Bayes:RelationshiptoLanguageModeling§ MultinomialNaïve Bayes:AWorkedExample

Page 43: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

In Class Quiz

1/4*(2/9)3 *2/9*2/9

Doc Words ClassTraining 1 Chinese BeijingChineseBeijing c

2 ChineseChineseShanghaiShanghai c3 ChineseMacao c4 TokyoJapanChinese j

Test 5 ChineseChineseChineseTokyo Japan ?

ConditionalProbabilities:P(Chinese|c)=P(Tokyo|c)=P(Japan|c)=P(Chinese|j)=P(Tokyo|j)=P(Japan|j)=

Priors:P(c)=

P(j)=

3414

P̂(w | c) = count(w,c)+1count(c)+ |V |

P̂(c) = Nc

N

(5+1)/(10+6)=6/16=3/8(0+1)/(10+6)=1/16

(1+1)/(3+6)=2/9(0+1)/10+6)=1/16

(1+1)/(3+6)=2/9(1+1)/(3+6)=2/9

3/4*(3/8)3 *1/16*1/16µ

µ

Choosingaclass:P(c|d5)

P(j|d5)

Page 44: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Naïve Bayes in Spam Filtering

§ SpamAssassin Features:§ MentionsGenericViagra§ OnlinePharmacy§ Mentionsmillionsof(dollar)((dollar)NN,NNN,NNN.NN)§ Phrase:impress...girl§ From:startswithmanynumbers§ Subjectisallcapitals§ HTMLhasalowratiooftexttoimagearea§ Onehundredpercentguaranteed§ Claimsyoucanberemovedfromthelist§ http://spamassassin.apache.org/tests_3_3_x.html

Page 45: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Outline

§ TheTaskofTextClassification§ Naïve Bayes(I)§ FormalizingtheNaïve BayesClassifier§ Naïve Bayes:Learning§ Naïve Bayes:RelationshiptoLanguageModeling§ Evaluation

Page 46: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

General Evaluation Methodology

§Havehumanstocreateatestcollectionwhereeverydocumentistaggedwiththedesiredcategories(“groundtruth”)

§Generatecategorizationresultsusingasystemonthetestcollection

§ Comparethesystemcategorizationdecisionswiththehuman-madecategorizationdecisionsandquantifytheirsimilarity(orequivalentlydifference)§ Thehigherthesimilarityis,thebettertheresultsare§ Similaritycanbemeasuredfromdifferentperspectivestounderstandthequalityofresultsindetail(e.g.,whichcategoryperformsbetter?)

Page 47: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

The 2-by-2 result table

ground-truth

prediction

𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =𝑡𝑝 + 𝑡𝑛

𝑡𝑝 + 𝑡𝑛 + 𝑓𝑝 + 𝑓𝑛

System(“y”) System (“n”)Human (+) True Positives

TPFalse Negatives

FNHuman(-) False Positives

FPTrue Negatives

TN

Page 48: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

An Example about why this is not enough by using accuracy only

E NotEE 0 0

NotE 10 9990

§Howaboutearthquakedetection?

§ Accuracycanbeashighas99.90%

§ But…

ground-truth

prediction

Page 49: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Problems with Classification Accuracy

§ Somedecisionerrorsaremoreseriousthanothers§ Itmaybemoreimportanttogetthedecisionsrightonsomedocumentsthanothers

§ Itmaybemoreimportanttogetthedecisionsrightonsomecategoriesthanothers

§ E.g.,spamfiltering:missingalegitimateemailcostsmorethanlettingaspamgo

§ Problemwithimbalancedtestset§ Skewedtestset:98%incategory1;2%incategory2§ Strongbaseline:putallinstancesincategory1è 98%accuracy!

Page 50: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Per-Document Evaluation

c1 c2 c3 … ckd1 y(+) y(-) n(+) n(+)d2 y(-) n(+) y(+) n(+)d3 n(+) n(+) y(+) n(+)

Howgoodarethedecisionsondi?

FNTPTP+

=Recall

Doesthedochaveallthecategoriesitshouldhave?FPTP

TP+

=Precision

Whenthesystemsays“yes,”howmanyarecorrect?

System(“y”) System (“n”)Human (+) True Positives

TPFalse Negatives

FNHuman(-) False Positives

FPTrue Negatives

TN

Page 51: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Per-Class Evaluation

c1 c2 c3 … ckd1 y(+) y(-) n(+) n(+)d2 y(-) n(+) y(+) n(+)d3 n(+) n(+) y(+) n(+)

Howgoodarethedecisionsonci?

FNTPTP+

=Recall

Doesthedochaveallthecategoriesitshouldhave?FPTP

TP+

=Precision

Whenthesystemsays“yes,”howmanyarecorrect?

Page 52: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Re-visit Earthquake

E NotEE 0(tp) 0(fp)

NotE 10(fn) 9990(tf)

ground-truth

prediction

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =𝑡𝑝

𝑡𝑝 + 𝑓𝑝 = 0

𝑅𝑒𝑐𝑎𝑙𝑙 =𝑡𝑝

𝑡𝑝 + 𝑓𝑛 = 0

Page 53: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Re-visit Earthquake - 2

E NotEE 10(tp) 9990(fp)

NotE 0(fn) 0(tf)

ground-truth

prediction

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =𝑡𝑝

𝑡𝑝 + 𝑓𝑝 = 0.01

𝑅𝑒𝑐𝑎𝑙𝑙 =𝑡𝑝

𝑡𝑝 + 𝑓𝑛 = 1

Page 54: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

A combined measure: F

§ AcombinedmeasurethatassessestheP/RtradeoffisFmeasure(weightedharmonicmean):

§ Theharmonicmeanisaveryconservativeaverage;§ PeopleusuallyusebalancedF1measure– i.e.,withb =1(thatis,a =½): F =2PR/(P+R)

RPPR

RP

F+

+=

−+= 2

2 )1(1)1(1

1ββ

αα

Page 55: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Evaluation:ClassicReuters-21578DataSet

§Most(over)useddataset,21,578docs(each90types,200tokens)

§ 9603training,3299testarticles§ 118categories

§ Anarticlecanbeinmorethanonecategory§ Learn118binarycategorydistinctions

§ Averagedocument(withatleastonecategory)has1.24classes

§Onlyabout10outof118categoriesarelarge

Common categories(#train, #test)

• Earn (2877, 1087) • Acquisitions (1650, 179)• Money-fx (538, 179)• Grain (433, 149)• Crude (389, 189)

• Trade (369,119)• Interest (347, 131)• Ship (197, 89)• Wheat (212, 71)• Corn (182, 56)

Page 56: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

ReutersTextCategorizationdataset(Reuters-21578)document

<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="12981" NEWID="798">

<DATE> 2-MAR-1987 16:51:43.42</DATE>

<TOPICS><D>livestock</D><D>hog</D></TOPICS>

<TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE>

<DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress kicks off tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC.

Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said.

A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry, the NPPC added. Reuter

&#3;</BODY></TEXT></REUTERS>

Page 57: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Confusion matrix c

§ Foreachpairofclasses<c1,c2>howmanydocumentsfromc1 wereincorrectlyassignedtoc2?§ c3,2:90wheatdocumentsincorrectlyassignedtopoultry

Docsintestset AssignedUK

Assignedpoultry

Assignedwheat

Assignedcoffee

Assignedinterest

Assignedtrade

TrueUK 95 1 13 0 1 0

Truepoultry 0 1 0 0 0 0

Truewheat 10 90 0 1 0 0

Truecoffee 0 0 0 34 3 7

Trueinterest - 1 2 13 26 5

Truetrade 0 0 2 14 5 10

Page 58: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Micro- vs.Macro-Averaging

§ Ifwehavemorethanoneclass,howdowecombinemultipleperformancemeasuresintoonequantity?

§Macroaveraging:Computeperformanceforeachclass,thenaverage.

§Microaveraging:Collectdecisionsforallclasses,computecontingencytable,evaluate.

Page 59: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

(Macro) Average Over All the Categories

c1 c2 c3 … ckd1 y(+) y(-) n(+) n(+)d2 y(-) n(+) y(+) n(+)d3 n(+) n(+) y(+) n(+)…dN … …

Precision

Recall

F-Measure

p1

r1

f1

p2

r2

f2

OverallPrecisionAggregate

OverallRecall

OverallFscore

p3

r3

f3

pk

rk

fk

Page 60: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Micro-Averaging of Precision and Recall

c1 c2 c3 … ckd1 y(+) y(-) n(+) n(+)d2 y(-) n(+) y(+) n(+)d3 n(+) n(+) y(+) n(+)…dN … …

System(“y”) System (“n”)Human (+) True Positives( TP) False Negatives(FN)Human(-) False Positives(FP) True Negatives(TN)

FPTPTP+

=

FNTPTP+

=

Precision

Recall

Firstpoolalldecisions,thencomputeprecisionandrecall

Page 61: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Micro- vs.Macro-Averaging:Example

§ Macroaveraged precision:(0.5+0.9)/2=0.7§ Microaveraged precision:100/120=.83§ Microaveraged scoreisdominatedbyscoreoncommonclasses

Truth:yes

Truth:no

Classifier:yes 10 10

Classifier:no 10 970

Truth:yes

Truth:no

Classifier:yes 90 10

Classifier:no 10 890

Truth:yes

Truth:no

Classifier:yes 100 20

Classifier:no 20 1860

Class1 Class2 MicroAve.Table

Page 62: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Development Test Sets and Cross-validation

§Metric:P/R/F1orAccuracy§Unseentestset

§ avoidoverfitting(‘tuningtothetestset’)§moreconservativeestimateofperformance

§ Cross-validationovermultiplesplits§Handlesamplingerrorsfromdifferentdatasets

§ Poolresultsovereachsplit§ Computepooleddevsetperformance

Trainingset Development Test Set TestSet

TestSetTrainingSet

TrainingSetDev Test

TrainingSet

Dev Test

Dev Test

Page 63: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Outline

§ TheTaskofTextClassification§ Naïve Bayes(I)§ FormalizingtheNaïve BayesClassifier§ Naïve Bayes:Learning§ Naïve Bayes:RelationshiptoLanguageModeling§ Precision,Recall,andtheFmeasure§ TextClassification:Evaluation§ TextClassification:PracticalIssues

Page 64: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

TheRealWorld

§Gee,I’mbuildingatextclassifierforreal,now!§WhatshouldIdo?

Page 65: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Notrainingdata?Manuallywrittenrules

If(wheatorgrain)andnot(wholeorbread)thenCategorizeasgrain

§Needcarefulcrafting§ Humantuningondevelopmentdata§ Time-consuming:2daysperclass

Page 66: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Verylittledata?

§UseNaïve Bayes§NaïveBayesisa“high-bias”algorithm(NgandJordan2002NIPS)

§Getmorelabeleddata§ Findcleverwaystogethumanstolabeldataforyou

§ Trysemi-supervisedtrainingmethods:§ Bootstrapping,EMoverunlabeleddocuments,…

Page 67: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Areasonableamountofdata?

§ Perfectforallthecleverclassifiers§ SVM§ RegularizedLogisticRegression

§ Youcanevenuseuser-interpretabledecisiontrees§ Usersliketohack§ Managementlikesquickfixes

Page 68: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Ahugeamountofdata?

§ Canachievehighaccuracy!§ Atacost:

§ SVMs(traintime)orkNN (testtime)canbetooslow§ Regularizedlogisticregressioncanbesomewhatbetter§ Neuralnetworkbenefitsthisalot!

§ SoNaïveBayescancomebackintoitsownagain!

Page 69: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Accuracyasafunctionofdatasize

§Withenoughdata§ Classifiermaynotmatter

Page 70: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Real-world systems generally combine:

§ Automaticclassification§ Manualreviewofuncertain/difficult/"new”cases

Page 71: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Underflow Prevention: log space

§Multiplyinglotsofprobabilitiescanresultinfloating-pointunderflow.

§ Sincelog(xy)=log(x)+log(y)§ Bettertosumlogsofprobabilitiesinsteadofmultiplyingprobabilities.

§ Classwithhighestun-normalizedlogprobabilityscoreisstillmostprobable.

§Modelisnowjustmaxofsumofweights

cNB = argmaxc j∈C

logP(cj )+ logP(xi | cj )i∈positions∑

Page 72: October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

Howtoimprove performance

§ Domain-specificfeaturesandweights:veryimportantinrealperformance

§ Sometimesneedtocollapseterms:§ Partnumbers,chemicalformulas,…§ Butstemminggenerallydoesn’thelp (forclassification)

§ Upweighting:Countingawordasifitoccurredtwice:§ titlewords(Cohen&Singer1996)§ firstsentenceofeachparagraph(Murata,1999)§ Insentencesthatcontaintitlewords(Ko etal, 2002)