复旦大学大数据学院 School of Data Science, Fudan University DATA130006 Text Management and Analysis Text Classification and Naïve Bayes 魏忠钰 October 11 th , 2017 Adapted from UIUC CS410 and Stanford CS124U
复旦大学大数据学院School of Data Science, Fudan University
DATA130006 Text Management and Analysis
Text Classification and Naïve Bayes魏忠钰
October11th,2017
AdaptedfromUIUCCS410andStanfordCS124U
Outline
§ TheTaskofTextClassification
Is this a spam?
Who wrote which Federalist papers?
§ 1787-8:anonymousessaystrytoconvinceNewYorktoratifyU.SConstitution:Jay,Madison,Hamilton.
§ Authorshipof12ofthelettersindispute
§ 1963:solvedbyMosteller andWallaceusingBayesianmethods
JamesMadison AlexanderHamilton
Male or female author?
1. By1925present-dayVietnamwasdividedintothreepartsunderFrenchcolonialrule.ThesouthernregionembracingSaigonandtheMekongdeltawasthecolonyofCochin-China;thecentralareawithitsimperialcapitalatHuewastheprotectorateofAnnam…
2. Claraneverfailedtobeastonishedbytheextraordinaryfelicityofherownname.Shefoundithardtotrustherselftothemercyoffate,whichhadmanagedovertheyearstoconverthergreatestshameintooneofhergreatestassets…
Positive or negative movie review?
§ unbelievablydisappointing§ Fulloffantasticcharactersandrichlyappliedsatire,andsomegreatplottwists
§ thisisthegreatestscrewballcomedyeverfilmed§ Itwaspathetic.Theworstpartaboutitwastheboxingscenes.
What is the subject of this article?
• AntogonistsandInhibitors• BloodSupply• Chemistry• DrugTherapy• Embryology• Epidemiology• …
MeSH SubjectCategoryHierarchy
?
MEDLINE Article
Text Categorization
§ Giventhefollowing:§ Asetofpredefinedcategories,possiblyformingahierarchyandoften§ Atrainingsetoflabeledtextobjects
§ Task:Classify atextobjectintooneormoreofthecategories
CategorizationSystem
…
SportsBusiness
Education
Science…SportsBusinessEducation
TextObjects
Trainingdata(knowncategories)
CategorizationResults
Examples of Text Categorization
§ Textobjectscanvary(e.g.,documents,passages,orcollectionsoftext)
§ Categoriescanalsovary§ “Internal”categoriesthatcharacterizeatextobject(e.g.,topicalcategories,sentimentcategories)
§ “External”categoriesthatcharacterizeanentityassociatedwiththetextobject(e.g.,authorattributionoranyothermeaningfulcategoriesassociatedwithtextdata)
§ Someexamplesofapplications§ Newscategorization,literaturearticlecategorization(e.g.,MeSH annotations)
§ Spamemaildetection/filtering§ Sentimentcategorizationofproductreviewsortweets§ Automaticemailsorting/routing§ Authorattribution
Variants of Problem Formulation
§ Binarycategorization:Onlytwocategories§ Retrieval:{relevant-doc,non-relevant-doc}§ Spamfiltering:{spam,non-spam}§Opinion:{positive,negative}
§ K-category categorization:Morethantwocategories§ Topiccategorization:{sports,science,travel,business,…}
§ Emailrouting:{folder1,folder2,folder3,…}§Hierarchical categorization:Categoriesformahierarchy
§ Jointcategorization:Multiplerelated categorizationtasksdoneinajointmanner
Text Classification: definition
§ Input:§ adocumentd§ afixedsetofclassesC= {c1,c2,…,cJ}
§Output:apredictedclassc Î C
Classification Methods: Hand-coded rules
§ Rulesbasedoncombinationsofwordsorotherfeatures§ spam:black-list-addressOR(“dollars”AND“have beenselected”)
§ Accuracycanbehigh§ Ifrulescarefullyrefinedbyexpert
§ Butbuildingandmaintainingtheserulesisexpensive
Classification Methods: Supervised Machine Learning
§ Input:§ adocumentd§ afixedsetofclassesC= {c1,c2,…,cJ}§ Atrainingsetofm hand-labeleddocuments(d1,c1),....,(dm,cm)
§Output:§ alearnedclassifierγ:dà c
Classification Methods: Supervised Machine Learning
§ Anykindofclassifier§ Naïve Bayes§ Logisticregression§ Support-vectormachines§ k-NearestNeighbors
§ …
Outline
§ TheTaskofTextClassification§ Naïve Bayes(I)
Naïve Bayes Intuition
§ Simple(“naïve”)classificationmethodbasedonBayesrule
§ Reliesonverysimplerepresentationofdocument§ Bagofwords
The bag of words representation
I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun… It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet.
γ( )=c
The bag of words representation
I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun… It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it againwhenever I have a friend who hasn't seen it yet.
γ( )=c
The bag of words representation: using a subset of words
x love xxxxxxxxxxxxxxxx sweetxxxxxxx satirical xxxxxxxxxxxxxxxxxxxxx great xxxxxxxxxxxxxxxxxxxxxxxxxx fun xxxxxxxxxxxxxxxxx whimsical xxxxromantic xxxx laughingxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx recommend xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx several xxxxxxxxxxxxxxxxxxxxxx happy xxxxxxxxx againxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
γ( )=c
The bag of words representation
γ( )=cgreat 2love 2
recommend 1
laugh 1happy 1
... ...
Bag of words for document classification
Planning GUIGarbageCollection
Machine Learning NLP
parsertagtrainingtranslationlanguage...
learningtrainingalgorithmshrinkagenetwork...
garbagecollectionmemoryoptimizationregion...
Testdocument
parserlanguagelabeltranslation…
...planningtemporalreasoningplanlanguage...
?
Outline
§ TheTaskofTextClassification§ Naïve Bayes(I)§ FormalizingtheNaïve BayesClassifier
Bayes’ Rule Applied to Documents and Classes
P(c | d) = P(d | c)P(c)P(d)
§Foradocumentd andaclassc
Naïve Bayes Classifier (I)
cMAP = argmaxc∈C
P(c | d)
= argmaxc∈C
P(d | c)P(c)P(d)
= argmaxc∈C
P(d | c)P(c)
MAPis“maximumaposteriori”=mostlikelyclass
BayesRule
Droppingthedenominator
Naïve Bayes Classifier (II)
cMAP = argmaxc∈C
P(d | c)P(c)
Documentdrepresentedasfeaturesx1..xn
= argmaxc∈C
P(x1, x2,…, xn | c)P(c)
Naïve Bayes Classifier (III)
Howoftendoesthisclassoccur?
cMAP = argmaxc∈C
P(x1, x2,…, xn | c)P(c)
O(|X|n•|C|)parameters
Wecanjustcounttherelativefrequenciesinacorpus
Couldonlybeestimatedifavery,verylargenumberoftrainingexampleswasavailable.
Multinomial Naïve Bayes Independence Assumptions
P(x1, x2,…, xn | c)
§ BagofWordsassumption:Assumepositiondoesn’tmatter
§ ConditionalIndependence:AssumethefeatureprobabilitiesP(xi|cj)areindependentgiventheclassc.
P(x1,…, xn | c) = P(x1 | c)•P(x2 | c)•P(x3 | c)•...•P(xn | c)
Multinomial Naïve Bayes Classifier
cMAP = argmaxc∈C
P(x1, x2,…, xn | c)P(c)
cNB = argmaxc∈C
P(cj ) P(x | c)x∈X∏
Applying Multinomial Naive Bayes Classifiers to Text Classification
cNB = argmaxc j∈C
P(cj ) P(xi | cj )i∈positions∏
positions ¬ allwordpositionsintestdocument
Outline
§ TheTaskofTextClassification§ Naïve Bayes(I)§ FormalizingtheNaïve BayesClassifier§ Naïve Bayes:Learning
LearningtheMultinomialNaïve BayesModel
§Firstattempt:maximumlikelihoodestimates§ simplyusethefrequenciesinthedata
P̂(wi | cj ) =count(wi,cj )count(w,cj )
w∈V∑
P̂(cj ) =doccount(C = cj )
Ndoc
Parameterestimation
§ Createmega-documentfortopicj byconcatenatingalldocsinthistopic§ Usefrequencyofw inmega-document
fractionoftimeswordwi appearsamongallwordsindocumentsoftopiccj
P̂(wi | cj ) =count(wi,cj )count(w,cj )
w∈V∑
ProblemwithMaximumLikelihood
§Whatifwehaveseennotrainingdocumentswiththewordfantastic andclassifiedinthetopicpositive (thumbs-up)?
§ Zeroprobabilitiescannotbeconditionedaway,nomattertheotherevidence!
P̂("fantastic" positive) = count("fantastic", positive)count(w, positive
w∈V∑ )
= 0
cMAP = argmaxc P̂(c) P̂(xi | c)i∏
Laplace (add-1) smoothing for Naïve Bayes
P̂(wi | c) =count(wi,c)+1count(w,c)+1( )
w∈V∑
=count(wi,c)+1
count(w,cw∈V∑ )
#
$%%
&
'(( + V
P̂(wi | c) =count(wi,c)count(w,c)( )
w∈V∑
Multinomial Naïve Bayes: Learning
§ CalculateP(cj) terms§ Foreachcj inC do
docsj¬ alldocswithclass=cj
P(wk | cj )←nk +α
n+α |Vocabulary |
P(cj )←| docsj |
| total # documents|
§ CalculateP(wk | cj) terms§ Textj¬ singledoccontainingalldocsj§ For eachwordwk inVocabulary
nk¬ #ofoccurrencesofwk inTextj
§ Fromtrainingcorpus,extractVocabulary
Laplace (add-1) smoothing: unknown words
P̂(wu | c) = count(wu,c)+1
count(w,cw∈V∑ )
#
$%%
&
'(( + V +1
Addoneextrawordtothevocabulary,the“unknownword”wu
=1
count(w,cw∈V∑ )
#
$%%
&
'(( + V +1
Outline
§ TheTaskofTextClassification§ Naïve Bayes(I)§ FormalizingtheNaïve BayesClassifier§ Naïve Bayes:Learning§ Naïve Bayes:RelationshiptoLanguageModeling
Generative Model for Multinomial Naïve Bayes
c=China
X1=Shanghai X2=and X3=Shenzhen X4=issue X5=bonds
Naïve Bayes and Language Modeling
§Naïve bayes classifierscanuseanysortoffeature§URL,emailaddress,dictionaries,networkfeatures
§ Butif,asinthepreviousslides§Weuseonly wordfeatures§weuseall ofthewordsinthetext(notasubset)
§ Then§Naïve bayes hasanimportantsimilaritytolanguagemodeling.
Eachclass=aunigramlanguagemodel
§ Assigningeachword:P(word|c)§ Assigningeachsentence:P(s|c)=Π P(word|c)
0.1 I
0.1 love
0.01 this
0.05 fun
0.1 film
…
I love this fun film
0.1 0.1 .05 0.01 0.1
Classpos
P(s|pos)=0.0000005
Naïve BayesasaLanguageModel
§Whichclassassignsthehigherprobabilitytos?
0.1 I
0.1 love
0.01 this
0.05 fun
0.1 film
Modelpos Modelneg
filmlove this funI
0.10.1 0.01 0.050.10.10.001 0.01 0.0050.2
P(s|pos)>P(s|neg)
0.2 I
0.001 love
0.01 this
0.005 fun
0.1 film
Outline
§ TheTaskofTextClassification§ Naïve Bayes(I)§ FormalizingtheNaïve BayesClassifier§ Naïve Bayes:Learning§ Naïve Bayes:RelationshiptoLanguageModeling§ MultinomialNaïve Bayes:AWorkedExample
In Class Quiz
1/4*(2/9)3 *2/9*2/9
Doc Words ClassTraining 1 Chinese BeijingChineseBeijing c
2 ChineseChineseShanghaiShanghai c3 ChineseMacao c4 TokyoJapanChinese j
Test 5 ChineseChineseChineseTokyo Japan ?
ConditionalProbabilities:P(Chinese|c)=P(Tokyo|c)=P(Japan|c)=P(Chinese|j)=P(Tokyo|j)=P(Japan|j)=
Priors:P(c)=
P(j)=
3414
P̂(w | c) = count(w,c)+1count(c)+ |V |
P̂(c) = Nc
N
(5+1)/(10+6)=6/16=3/8(0+1)/(10+6)=1/16
(1+1)/(3+6)=2/9(0+1)/10+6)=1/16
(1+1)/(3+6)=2/9(1+1)/(3+6)=2/9
3/4*(3/8)3 *1/16*1/16µ
µ
Choosingaclass:P(c|d5)
P(j|d5)
Naïve Bayes in Spam Filtering
§ SpamAssassin Features:§ MentionsGenericViagra§ OnlinePharmacy§ Mentionsmillionsof(dollar)((dollar)NN,NNN,NNN.NN)§ Phrase:impress...girl§ From:startswithmanynumbers§ Subjectisallcapitals§ HTMLhasalowratiooftexttoimagearea§ Onehundredpercentguaranteed§ Claimsyoucanberemovedfromthelist§ http://spamassassin.apache.org/tests_3_3_x.html
Outline
§ TheTaskofTextClassification§ Naïve Bayes(I)§ FormalizingtheNaïve BayesClassifier§ Naïve Bayes:Learning§ Naïve Bayes:RelationshiptoLanguageModeling§ Evaluation
General Evaluation Methodology
§Havehumanstocreateatestcollectionwhereeverydocumentistaggedwiththedesiredcategories(“groundtruth”)
§Generatecategorizationresultsusingasystemonthetestcollection
§ Comparethesystemcategorizationdecisionswiththehuman-madecategorizationdecisionsandquantifytheirsimilarity(orequivalentlydifference)§ Thehigherthesimilarityis,thebettertheresultsare§ Similaritycanbemeasuredfromdifferentperspectivestounderstandthequalityofresultsindetail(e.g.,whichcategoryperformsbetter?)
The 2-by-2 result table
ground-truth
prediction
𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =𝑡𝑝 + 𝑡𝑛
𝑡𝑝 + 𝑡𝑛 + 𝑓𝑝 + 𝑓𝑛
System(“y”) System (“n”)Human (+) True Positives
TPFalse Negatives
FNHuman(-) False Positives
FPTrue Negatives
TN
An Example about why this is not enough by using accuracy only
E NotEE 0 0
NotE 10 9990
§Howaboutearthquakedetection?
§ Accuracycanbeashighas99.90%
§ But…
ground-truth
prediction
Problems with Classification Accuracy
§ Somedecisionerrorsaremoreseriousthanothers§ Itmaybemoreimportanttogetthedecisionsrightonsomedocumentsthanothers
§ Itmaybemoreimportanttogetthedecisionsrightonsomecategoriesthanothers
§ E.g.,spamfiltering:missingalegitimateemailcostsmorethanlettingaspamgo
§ Problemwithimbalancedtestset§ Skewedtestset:98%incategory1;2%incategory2§ Strongbaseline:putallinstancesincategory1è 98%accuracy!
Per-Document Evaluation
c1 c2 c3 … ckd1 y(+) y(-) n(+) n(+)d2 y(-) n(+) y(+) n(+)d3 n(+) n(+) y(+) n(+)
Howgoodarethedecisionsondi?
FNTPTP+
=Recall
Doesthedochaveallthecategoriesitshouldhave?FPTP
TP+
=Precision
Whenthesystemsays“yes,”howmanyarecorrect?
System(“y”) System (“n”)Human (+) True Positives
TPFalse Negatives
FNHuman(-) False Positives
FPTrue Negatives
TN
Per-Class Evaluation
c1 c2 c3 … ckd1 y(+) y(-) n(+) n(+)d2 y(-) n(+) y(+) n(+)d3 n(+) n(+) y(+) n(+)
Howgoodarethedecisionsonci?
FNTPTP+
=Recall
Doesthedochaveallthecategoriesitshouldhave?FPTP
TP+
=Precision
Whenthesystemsays“yes,”howmanyarecorrect?
Re-visit Earthquake
E NotEE 0(tp) 0(fp)
NotE 10(fn) 9990(tf)
ground-truth
prediction
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =𝑡𝑝
𝑡𝑝 + 𝑓𝑝 = 0
𝑅𝑒𝑐𝑎𝑙𝑙 =𝑡𝑝
𝑡𝑝 + 𝑓𝑛 = 0
Re-visit Earthquake - 2
E NotEE 10(tp) 9990(fp)
NotE 0(fn) 0(tf)
ground-truth
prediction
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =𝑡𝑝
𝑡𝑝 + 𝑓𝑝 = 0.01
𝑅𝑒𝑐𝑎𝑙𝑙 =𝑡𝑝
𝑡𝑝 + 𝑓𝑛 = 1
A combined measure: F
§ AcombinedmeasurethatassessestheP/RtradeoffisFmeasure(weightedharmonicmean):
§ Theharmonicmeanisaveryconservativeaverage;§ PeopleusuallyusebalancedF1measure– i.e.,withb =1(thatis,a =½): F =2PR/(P+R)
RPPR
RP
F+
+=
−+= 2
2 )1(1)1(1
1ββ
αα
Evaluation:ClassicReuters-21578DataSet
§Most(over)useddataset,21,578docs(each90types,200tokens)
§ 9603training,3299testarticles§ 118categories
§ Anarticlecanbeinmorethanonecategory§ Learn118binarycategorydistinctions
§ Averagedocument(withatleastonecategory)has1.24classes
§Onlyabout10outof118categoriesarelarge
Common categories(#train, #test)
• Earn (2877, 1087) • Acquisitions (1650, 179)• Money-fx (538, 179)• Grain (433, 149)• Crude (389, 189)
• Trade (369,119)• Interest (347, 131)• Ship (197, 89)• Wheat (212, 71)• Corn (182, 56)
ReutersTextCategorizationdataset(Reuters-21578)document
<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="12981" NEWID="798">
<DATE> 2-MAR-1987 16:51:43.42</DATE>
<TOPICS><D>livestock</D><D>hog</D></TOPICS>
<TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE>
<DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress kicks off tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC.
Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said.
A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry, the NPPC added. Reuter
</BODY></TEXT></REUTERS>
Confusion matrix c
§ Foreachpairofclasses<c1,c2>howmanydocumentsfromc1 wereincorrectlyassignedtoc2?§ c3,2:90wheatdocumentsincorrectlyassignedtopoultry
Docsintestset AssignedUK
Assignedpoultry
Assignedwheat
Assignedcoffee
Assignedinterest
Assignedtrade
TrueUK 95 1 13 0 1 0
Truepoultry 0 1 0 0 0 0
Truewheat 10 90 0 1 0 0
Truecoffee 0 0 0 34 3 7
Trueinterest - 1 2 13 26 5
Truetrade 0 0 2 14 5 10
Micro- vs.Macro-Averaging
§ Ifwehavemorethanoneclass,howdowecombinemultipleperformancemeasuresintoonequantity?
§Macroaveraging:Computeperformanceforeachclass,thenaverage.
§Microaveraging:Collectdecisionsforallclasses,computecontingencytable,evaluate.
(Macro) Average Over All the Categories
c1 c2 c3 … ckd1 y(+) y(-) n(+) n(+)d2 y(-) n(+) y(+) n(+)d3 n(+) n(+) y(+) n(+)…dN … …
Precision
Recall
F-Measure
p1
r1
f1
p2
r2
f2
OverallPrecisionAggregate
OverallRecall
OverallFscore
p3
r3
f3
pk
rk
fk
…
…
…
Micro-Averaging of Precision and Recall
c1 c2 c3 … ckd1 y(+) y(-) n(+) n(+)d2 y(-) n(+) y(+) n(+)d3 n(+) n(+) y(+) n(+)…dN … …
System(“y”) System (“n”)Human (+) True Positives( TP) False Negatives(FN)Human(-) False Positives(FP) True Negatives(TN)
FPTPTP+
=
FNTPTP+
=
Precision
Recall
Firstpoolalldecisions,thencomputeprecisionandrecall
Micro- vs.Macro-Averaging:Example
§ Macroaveraged precision:(0.5+0.9)/2=0.7§ Microaveraged precision:100/120=.83§ Microaveraged scoreisdominatedbyscoreoncommonclasses
Truth:yes
Truth:no
Classifier:yes 10 10
Classifier:no 10 970
Truth:yes
Truth:no
Classifier:yes 90 10
Classifier:no 10 890
Truth:yes
Truth:no
Classifier:yes 100 20
Classifier:no 20 1860
Class1 Class2 MicroAve.Table
Development Test Sets and Cross-validation
§Metric:P/R/F1orAccuracy§Unseentestset
§ avoidoverfitting(‘tuningtothetestset’)§moreconservativeestimateofperformance
§ Cross-validationovermultiplesplits§Handlesamplingerrorsfromdifferentdatasets
§ Poolresultsovereachsplit§ Computepooleddevsetperformance
Trainingset Development Test Set TestSet
TestSetTrainingSet
TrainingSetDev Test
TrainingSet
Dev Test
Dev Test
Outline
§ TheTaskofTextClassification§ Naïve Bayes(I)§ FormalizingtheNaïve BayesClassifier§ Naïve Bayes:Learning§ Naïve Bayes:RelationshiptoLanguageModeling§ Precision,Recall,andtheFmeasure§ TextClassification:Evaluation§ TextClassification:PracticalIssues
TheRealWorld
§Gee,I’mbuildingatextclassifierforreal,now!§WhatshouldIdo?
Notrainingdata?Manuallywrittenrules
If(wheatorgrain)andnot(wholeorbread)thenCategorizeasgrain
§Needcarefulcrafting§ Humantuningondevelopmentdata§ Time-consuming:2daysperclass
Verylittledata?
§UseNaïve Bayes§NaïveBayesisa“high-bias”algorithm(NgandJordan2002NIPS)
§Getmorelabeleddata§ Findcleverwaystogethumanstolabeldataforyou
§ Trysemi-supervisedtrainingmethods:§ Bootstrapping,EMoverunlabeleddocuments,…
Areasonableamountofdata?
§ Perfectforallthecleverclassifiers§ SVM§ RegularizedLogisticRegression
§ Youcanevenuseuser-interpretabledecisiontrees§ Usersliketohack§ Managementlikesquickfixes
Ahugeamountofdata?
§ Canachievehighaccuracy!§ Atacost:
§ SVMs(traintime)orkNN (testtime)canbetooslow§ Regularizedlogisticregressioncanbesomewhatbetter§ Neuralnetworkbenefitsthisalot!
§ SoNaïveBayescancomebackintoitsownagain!
Accuracyasafunctionofdatasize
§Withenoughdata§ Classifiermaynotmatter
Real-world systems generally combine:
§ Automaticclassification§ Manualreviewofuncertain/difficult/"new”cases
Underflow Prevention: log space
§Multiplyinglotsofprobabilitiescanresultinfloating-pointunderflow.
§ Sincelog(xy)=log(x)+log(y)§ Bettertosumlogsofprobabilitiesinsteadofmultiplyingprobabilities.
§ Classwithhighestun-normalizedlogprobabilityscoreisstillmostprobable.
§Modelisnowjustmaxofsumofweights
cNB = argmaxc j∈C
logP(cj )+ logP(xi | cj )i∈positions∑
Howtoimprove performance
§ Domain-specificfeaturesandweights:veryimportantinrealperformance
§ Sometimesneedtocollapseterms:§ Partnumbers,chemicalformulas,…§ Butstemminggenerallydoesn’thelp (forclassification)
§ Upweighting:Countingawordasifitoccurredtwice:§ titlewords(Cohen&Singer1996)§ firstsentenceofeachparagraph(Murata,1999)§ Insentencesthatcontaintitlewords(Ko etal, 2002)