October 11 , 2017 - Fudan University...复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and Analysis 魏忠钰Text Classification and Naïve

复旦大学大数据学院School of Data Science, Fudan University

DATA130006 Text Management and Analysis

Text Classification and Naïve Bayes魏忠钰

October11th,2017

AdaptedfromUIUCCS410andStanfordCS124U

Outline

§ TheTaskofTextClassification

Is this a spam?

Who wrote which Federalist papers?

§ 1787-8:anonymousessaystrytoconvinceNewYorktoratifyU.SConstitution:Jay,Madison,Hamilton.

§ Authorshipof12ofthelettersindispute

§ 1963:solvedbyMosteller andWallaceusingBayesianmethods

JamesMadison AlexanderHamilton

Male or female author?

1. By1925present-dayVietnamwasdividedintothreepartsunderFrenchcolonialrule.ThesouthernregionembracingSaigonandtheMekongdeltawasthecolonyofCochin-China;thecentralareawithitsimperialcapitalatHuewastheprotectorateofAnnam…

2. Claraneverfailedtobeastonishedbytheextraordinaryfelicityofherownname.Shefoundithardtotrustherselftothemercyoffate,whichhadmanagedovertheyearstoconverthergreatestshameintooneofhergreatestassets…

Positive or negative movie review?

§ unbelievablydisappointing§ Fulloffantasticcharactersandrichlyappliedsatire,andsomegreatplottwists

§ thisisthegreatestscrewballcomedyeverfilmed§ Itwaspathetic.Theworstpartaboutitwastheboxingscenes.

What is the subject of this article?

• AntogonistsandInhibitors• BloodSupply• Chemistry• DrugTherapy• Embryology• Epidemiology• …

MeSH SubjectCategoryHierarchy

?

MEDLINE Article

Text Categorization

§ Giventhefollowing:§ Asetofpredefinedcategories,possiblyformingahierarchyandoften§ Atrainingsetoflabeledtextobjects

§ Task:Classify atextobjectintooneormoreofthecategories

CategorizationSystem

…

SportsBusiness

Education

Science…SportsBusinessEducation

TextObjects

Trainingdata(knowncategories)

CategorizationResults

Examples of Text Categorization

§ Textobjectscanvary(e.g.,documents,passages,orcollectionsoftext)

§ Categoriescanalsovary§ “Internal”categoriesthatcharacterizeatextobject(e.g.,topicalcategories,sentimentcategories)

§ “External”categoriesthatcharacterizeanentityassociatedwiththetextobject(e.g.,authorattributionoranyothermeaningfulcategoriesassociatedwithtextdata)

§ Someexamplesofapplications§ Newscategorization,literaturearticlecategorization(e.g.,MeSH annotations)

§ Spamemaildetection/filtering§ Sentimentcategorizationofproductreviewsortweets§ Automaticemailsorting/routing§ Authorattribution

Variants of Problem Formulation

§ Binarycategorization:Onlytwocategories§ Retrieval:{relevant-doc,non-relevant-doc}§ Spamfiltering:{spam,non-spam}§Opinion:{positive,negative}

§ K-category categorization:Morethantwocategories§ Topiccategorization:{sports,science,travel,business,…}

§ Emailrouting:{folder1,folder2,folder3,…}§Hierarchical categorization:Categoriesformahierarchy

§ Jointcategorization:Multiplerelated categorizationtasksdoneinajointmanner

Text Classification: definition

§ Input:§ adocumentd§ afixedsetofclassesC= {c1,c2,…,cJ}

§Output:apredictedclassc Î C

Classification Methods: Hand-coded rules

§ Rulesbasedoncombinationsofwordsorotherfeatures§ spam:black-list-addressOR(“dollars”AND“have beenselected”)

§ Accuracycanbehigh§ Ifrulescarefullyrefinedbyexpert

§ Butbuildingandmaintainingtheserulesisexpensive

Classification Methods: Supervised Machine Learning

§ Input:§ adocumentd§ afixedsetofclassesC= {c1,c2,…,cJ}§ Atrainingsetofm hand-labeleddocuments(d1,c1),....,(dm,cm)

§Output:§ alearnedclassifierγ:dà c

Classification Methods: Supervised Machine Learning

§ Anykindofclassifier§ Naïve Bayes§ Logisticregression§ Support-vectormachines§ k-NearestNeighbors

§ …

Outline

§ TheTaskofTextClassification§ Naïve Bayes(I)

Naïve Bayes Intuition

§ Simple(“naïve”)classificationmethodbasedonBayesrule

§ Reliesonverysimplerepresentationofdocument§ Bagofwords

The bag of words representation

I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun… It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet.

γ( )=c


I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun… It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it againwhenever I have a friend who hasn't seen it yet.

γ( )=c

The bag of words representation: using a subset of words

x love xxxxxxxxxxxxxxxx sweetxxxxxxx satirical xxxxxxxxxxxxxxxxxxxxx great xxxxxxxxxxxxxxxxxxxxxxxxxx fun xxxxxxxxxxxxxxxxx whimsical xxxxromantic xxxx laughingxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx recommend xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx several xxxxxxxxxxxxxxxxxxxxxx happy xxxxxxxxx againxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

γ( )=c


γ( )=cgreat 2love 2

recommend 1

laugh 1happy 1

... ...

Bag of words for document classification

Planning GUIGarbageCollection

Machine Learning NLP

parsertagtrainingtranslationlanguage...

learningtrainingalgorithmshrinkagenetwork...

garbagecollectionmemoryoptimizationregion...

Testdocument

parserlanguagelabeltranslation…

...planningtemporalreasoningplanlanguage...

?

Outline

§ TheTaskofTextClassification§ Naïve Bayes(I)§ FormalizingtheNaïve BayesClassifier

Bayes’ Rule Applied to Documents and Classes

P(c | d) = P(d | c)P(c)P(d)

§Foradocumentd andaclassc

Naïve Bayes Classifier (I)

cMAP = argmaxc∈C

P(c | d)

= argmaxc∈C

P(d | c)P(c)P(d)

= argmaxc∈C

P(d | c)P(c)

MAPis“maximumaposteriori”=mostlikelyclass

BayesRule

Droppingthedenominator

Naïve Bayes Classifier (II)

cMAP = argmaxc∈C

P(d | c)P(c)

Documentdrepresentedasfeaturesx1..xn

= argmaxc∈C

P(x1, x2,…, xn | c)P(c)

Naïve Bayes Classifier (III)

Howoftendoesthisclassoccur?

cMAP = argmaxc∈C

P(x1, x2,…, xn | c)P(c)

O(|X|n•|C|)parameters

Wecanjustcounttherelativefrequenciesinacorpus

Couldonlybeestimatedifavery,verylargenumberoftrainingexampleswasavailable.

Multinomial Naïve Bayes Independence Assumptions

P(x1, x2,…, xn | c)

§ BagofWordsassumption:Assumepositiondoesn’tmatter

§ ConditionalIndependence:AssumethefeatureprobabilitiesP(xi|cj)areindependentgiventheclassc.

P(x1,…, xn | c) = P(x1 | c)•P(x2 | c)•P(x3 | c)•...•P(xn | c)

Multinomial Naïve Bayes Classifier

cMAP = argmaxc∈C

P(x1, x2,…, xn | c)P(c)

cNB = argmaxc∈C

P(cj ) P(x | c)x∈X∏

Applying Multinomial Naive Bayes Classifiers to Text Classification

cNB = argmaxc j∈C

P(cj ) P(xi | cj )i∈positions∏

positions ¬ allwordpositionsintestdocument

Outline

§ TheTaskofTextClassification§ Naïve Bayes(I)§ FormalizingtheNaïve BayesClassifier§ Naïve Bayes:Learning

LearningtheMultinomialNaïve BayesModel

§Firstattempt:maximumlikelihoodestimates§ simplyusethefrequenciesinthedata

P̂(wi | cj ) =count(wi,cj )count(w,cj )

w∈V∑

P̂(cj ) =doccount(C = cj )

Ndoc

Parameterestimation

§ Createmega-documentfortopicj byconcatenatingalldocsinthistopic§ Usefrequencyofw inmega-document

fractionoftimeswordwi appearsamongallwordsindocumentsoftopiccj

P̂(wi | cj ) =count(wi,cj )count(w,cj )

w∈V∑

ProblemwithMaximumLikelihood

§Whatifwehaveseennotrainingdocumentswiththewordfantastic andclassifiedinthetopicpositive (thumbs-up)?

§ Zeroprobabilitiescannotbeconditionedaway,nomattertheotherevidence!

P̂("fantastic" positive) = count("fantastic", positive)count(w, positive

w∈V∑ )

= 0

cMAP = argmaxc P̂(c) P̂(xi | c)i∏

Laplace (add-1) smoothing for Naïve Bayes

P̂(wi | c) =count(wi,c)+1count(w,c)+1( )

w∈V∑

=count(wi,c)+1

count(w,cw∈V∑ )

#

$%%

&

'(( + V

P̂(wi | c) =count(wi,c)count(w,c)( )

w∈V∑

Multinomial Naïve Bayes: Learning

§ CalculateP(cj) terms§ Foreachcj inC do

docsj¬ alldocswithclass=cj

P(wk | cj )←nk +α

n+α |Vocabulary |

P(cj )←| docsj |

| total # documents|

§ CalculateP(wk | cj) terms§ Textj¬ singledoccontainingalldocsj§ For eachwordwk inVocabulary

nk¬ #ofoccurrencesofwk inTextj

§ Fromtrainingcorpus,extractVocabulary

Laplace (add-1) smoothing: unknown words

P̂(wu | c) = count(wu,c)+1

count(w,cw∈V∑ )

#

$%%

&

'(( + V +1

Addoneextrawordtothevocabulary,the“unknownword”wu

=1

count(w,cw∈V∑ )

#

$%%

&

'(( + V +1

Outline

§ TheTaskofTextClassification§ Naïve Bayes(I)§ FormalizingtheNaïve BayesClassifier§ Naïve Bayes:Learning§ Naïve Bayes:RelationshiptoLanguageModeling

Generative Model for Multinomial Naïve Bayes

c=China

X1=Shanghai X2=and X3=Shenzhen X4=issue X5=bonds

Naïve Bayes and Language Modeling

§Naïve bayes classifierscanuseanysortoffeature§URL,emailaddress,dictionaries,networkfeatures

§ Butif,asinthepreviousslides§Weuseonly wordfeatures§weuseall ofthewordsinthetext(notasubset)

§ Then§Naïve bayes hasanimportantsimilaritytolanguagemodeling.

Eachclass=aunigramlanguagemodel

§ Assigningeachword:P(word|c)§ Assigningeachsentence:P(s|c)=Π P(word|c)

0.1 I

0.1 love

0.01 this

0.05 fun

0.1 film

…

I love this fun film

0.1 0.1 .05 0.01 0.1

Classpos

P(s|pos)=0.0000005

Naïve BayesasaLanguageModel

§Whichclassassignsthehigherprobabilitytos?

0.1 I

0.1 love

0.01 this

0.05 fun

0.1 film

Modelpos Modelneg

filmlove this funI

0.10.1 0.01 0.050.10.10.001 0.01 0.0050.2

P(s|pos)>P(s|neg)

0.2 I

0.001 love

0.01 this

0.005 fun

0.1 film

Outline

§ TheTaskofTextClassification§ Naïve Bayes(I)§ FormalizingtheNaïve BayesClassifier§ Naïve Bayes:Learning§ Naïve Bayes:RelationshiptoLanguageModeling§ MultinomialNaïve Bayes:AWorkedExample

In Class Quiz

1/4*(2/9)3 *2/9*2/9

Doc Words ClassTraining 1 Chinese BeijingChineseBeijing c

2 ChineseChineseShanghaiShanghai c3 ChineseMacao c4 TokyoJapanChinese j

Test 5 ChineseChineseChineseTokyo Japan ?

ConditionalProbabilities:P(Chinese|c)=P(Tokyo|c)=P(Japan|c)=P(Chinese|j)=P(Tokyo|j)=P(Japan|j)=

Priors:P(c)=

P(j)=

3414

P̂(w | c) = count(w,c)+1count(c)+ |V |

P̂(c) = Nc

N

(5+1)/(10+6)=6/16=3/8(0+1)/(10+6)=1/16

(1+1)/(3+6)=2/9(0+1)/10+6)=1/16

(1+1)/(3+6)=2/9(1+1)/(3+6)=2/9

3/4*(3/8)3 *1/16*1/16µ

µ

Choosingaclass:P(c|d5)

P(j|d5)

Naïve Bayes in Spam Filtering

§ SpamAssassin Features:§ MentionsGenericViagra§ OnlinePharmacy§ Mentionsmillionsof(dollar)((dollar)NN,NNN,NNN.NN)§ Phrase:impress...girl§ From:startswithmanynumbers§ Subjectisallcapitals§ HTMLhasalowratiooftexttoimagearea§ Onehundredpercentguaranteed§ Claimsyoucanberemovedfromthelist§ http://spamassassin.apache.org/tests_3_3_x.html

Outline

§ TheTaskofTextClassification§ Naïve Bayes(I)§ FormalizingtheNaïve BayesClassifier§ Naïve Bayes:Learning§ Naïve Bayes:RelationshiptoLanguageModeling§ Evaluation

General Evaluation Methodology

§Havehumanstocreateatestcollectionwhereeverydocumentistaggedwiththedesiredcategories(“groundtruth”)

§Generatecategorizationresultsusingasystemonthetestcollection

§ Comparethesystemcategorizationdecisionswiththehuman-madecategorizationdecisionsandquantifytheirsimilarity(orequivalentlydifference)§ Thehigherthesimilarityis,thebettertheresultsare§ Similaritycanbemeasuredfromdifferentperspectivestounderstandthequalityofresultsindetail(e.g.,whichcategoryperformsbetter?)

The 2-by-2 result table

ground-truth

prediction

𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =𝑡𝑝 + 𝑡𝑛

𝑡𝑝 + 𝑡𝑛 + 𝑓𝑝 + 𝑓𝑛

System(“y”) System (“n”)Human (+) True Positives

TPFalse Negatives

FNHuman(-) False Positives

FPTrue Negatives

TN

An Example about why this is not enough by using accuracy only

E NotEE 0 0

NotE 10 9990

§Howaboutearthquakedetection?

§ Accuracycanbeashighas99.90%

§ But…

ground-truth

prediction

Problems with Classification Accuracy

§ Somedecisionerrorsaremoreseriousthanothers§ Itmaybemoreimportanttogetthedecisionsrightonsomedocumentsthanothers

§ Itmaybemoreimportanttogetthedecisionsrightonsomecategoriesthanothers

§ E.g.,spamfiltering:missingalegitimateemailcostsmorethanlettingaspamgo

§ Problemwithimbalancedtestset§ Skewedtestset:98%incategory1;2%incategory2§ Strongbaseline:putallinstancesincategory1è 98%accuracy!

Per-Document Evaluation

c1 c2 c3 … ckd1 y(+) y(-) n(+) n(+)d2 y(-) n(+) y(+) n(+)d3 n(+) n(+) y(+) n(+)

Howgoodarethedecisionsondi?

FNTPTP+

=Recall

Doesthedochaveallthecategoriesitshouldhave?FPTP

TP+

=Precision

Whenthesystemsays“yes,”howmanyarecorrect?

System(“y”) System (“n”)Human (+) True Positives

TPFalse Negatives

FNHuman(-) False Positives

FPTrue Negatives

TN

Per-Class Evaluation

c1 c2 c3 … ckd1 y(+) y(-) n(+) n(+)d2 y(-) n(+) y(+) n(+)d3 n(+) n(+) y(+) n(+)

Howgoodarethedecisionsonci?

FNTPTP+

=Recall

Doesthedochaveallthecategoriesitshouldhave?FPTP

TP+

=Precision

Whenthesystemsays“yes,”howmanyarecorrect?

Re-visit Earthquake

E NotEE 0(tp) 0(fp)

NotE 10(fn) 9990(tf)

ground-truth

prediction

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =𝑡𝑝

𝑡𝑝 + 𝑓𝑝 = 0

𝑅𝑒𝑐𝑎𝑙𝑙 =𝑡𝑝

𝑡𝑝 + 𝑓𝑛 = 0

Re-visit Earthquake - 2

E NotEE 10(tp) 9990(fp)

NotE 0(fn) 0(tf)

ground-truth

prediction

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =𝑡𝑝

𝑡𝑝 + 𝑓𝑝 = 0.01

𝑅𝑒𝑐𝑎𝑙𝑙 =𝑡𝑝

𝑡𝑝 + 𝑓𝑛 = 1

A combined measure: F

§ AcombinedmeasurethatassessestheP/RtradeoffisFmeasure(weightedharmonicmean):

§ Theharmonicmeanisaveryconservativeaverage;§ PeopleusuallyusebalancedF1measure– i.e.,withb =1(thatis,a =½): F =2PR/(P+R)

RPPR

RP

F+

+=

−+= 2

2 )1(1)1(1

1ββ

αα

Evaluation:ClassicReuters-21578DataSet

§Most(over)useddataset,21,578docs(each90types,200tokens)

§ 9603training,3299testarticles§ 118categories

§ Anarticlecanbeinmorethanonecategory§ Learn118binarycategorydistinctions

§ Averagedocument(withatleastonecategory)has1.24classes

§Onlyabout10outof118categoriesarelarge

Common categories(#train, #test)

• Earn (2877, 1087) • Acquisitions (1650, 179)• Money-fx (538, 179)• Grain (433, 149)• Crude (389, 189)

• Trade (369,119)• Interest (347, 131)• Ship (197, 89)• Wheat (212, 71)• Corn (182, 56)

ReutersTextCategorizationdataset(Reuters-21578)document

<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="12981" NEWID="798">

<DATE> 2-MAR-1987 16:51:43.42</DATE>

<TOPICS><D>livestock</D><D>hog</D></TOPICS>

<TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE>

<DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress kicks off tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC.

Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said.

A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry, the NPPC added. Reuter

</BODY></TEXT></REUTERS>

Confusion matrix c

§ Foreachpairofclasses<c1,c2>howmanydocumentsfromc1 wereincorrectlyassignedtoc2?§ c3,2:90wheatdocumentsincorrectlyassignedtopoultry

Docsintestset AssignedUK

Assignedpoultry

Assignedwheat

Assignedcoffee

Assignedinterest

Assignedtrade

TrueUK 95 1 13 0 1 0

Truepoultry 0 1 0 0 0 0

Truewheat 10 90 0 1 0 0

Truecoffee 0 0 0 34 3 7

Trueinterest - 1 2 13 26 5

Truetrade 0 0 2 14 5 10

Micro- vs.Macro-Averaging

§ Ifwehavemorethanoneclass,howdowecombinemultipleperformancemeasuresintoonequantity?

§Macroaveraging:Computeperformanceforeachclass,thenaverage.

§Microaveraging:Collectdecisionsforallclasses,computecontingencytable,evaluate.

(Macro) Average Over All the Categories

c1 c2 c3 … ckd1 y(+) y(-) n(+) n(+)d2 y(-) n(+) y(+) n(+)d3 n(+) n(+) y(+) n(+)…dN … …

Precision

Recall

F-Measure

p1

r1

f1

p2

r2

f2

OverallPrecisionAggregate

OverallRecall

OverallFscore

p3

r3

f3

pk

rk

fk

…

…

…

Micro-Averaging of Precision and Recall

c1 c2 c3 … ckd1 y(+) y(-) n(+) n(+)d2 y(-) n(+) y(+) n(+)d3 n(+) n(+) y(+) n(+)…dN … …

System(“y”) System (“n”)Human (+) True Positives( TP) False Negatives(FN)Human(-) False Positives(FP) True Negatives(TN)

FPTPTP+

=

FNTPTP+

=

Precision

Recall

Firstpoolalldecisions,thencomputeprecisionandrecall

Micro- vs.Macro-Averaging:Example

§ Macroaveraged precision:(0.5+0.9)/2=0.7§ Microaveraged precision:100/120=.83§ Microaveraged scoreisdominatedbyscoreoncommonclasses

Truth:yes

Truth:no

Classifier:yes 10 10

Classifier:no 10 970

Truth:yes

Truth:no



Truth:yes

Truth:no



Class1 Class2 MicroAve.Table

Development Test Sets and Cross-validation

§Metric:P/R/F1orAccuracy§Unseentestset

§ avoidoverfitting(‘tuningtothetestset’)§moreconservativeestimateofperformance

§ Cross-validationovermultiplesplits§Handlesamplingerrorsfromdifferentdatasets

§ Poolresultsovereachsplit§ Computepooleddevsetperformance

Trainingset Development Test Set TestSet

TestSetTrainingSet

TrainingSetDev Test

TrainingSet

Dev Test

Dev Test

Outline

§ TheTaskofTextClassification§ Naïve Bayes(I)§ FormalizingtheNaïve BayesClassifier§ Naïve Bayes:Learning§ Naïve Bayes:RelationshiptoLanguageModeling§ Precision,Recall,andtheFmeasure§ TextClassification:Evaluation§ TextClassification:PracticalIssues

TheRealWorld

§Gee,I’mbuildingatextclassifierforreal,now!§WhatshouldIdo?

Notrainingdata?Manuallywrittenrules

If(wheatorgrain)andnot(wholeorbread)thenCategorizeasgrain

§Needcarefulcrafting§ Humantuningondevelopmentdata§ Time-consuming:2daysperclass

Verylittledata?

§UseNaïve Bayes§NaïveBayesisa“high-bias”algorithm(NgandJordan2002NIPS)

§Getmorelabeleddata§ Findcleverwaystogethumanstolabeldataforyou

§ Trysemi-supervisedtrainingmethods:§ Bootstrapping,EMoverunlabeleddocuments,…

Areasonableamountofdata?

§ Perfectforallthecleverclassifiers§ SVM§ RegularizedLogisticRegression

§ Youcanevenuseuser-interpretabledecisiontrees§ Usersliketohack§ Managementlikesquickfixes

Ahugeamountofdata?

§ Canachievehighaccuracy!§ Atacost:

§ SVMs(traintime)orkNN (testtime)canbetooslow§ Regularizedlogisticregressioncanbesomewhatbetter§ Neuralnetworkbenefitsthisalot!

§ SoNaïveBayescancomebackintoitsownagain!

Accuracyasafunctionofdatasize

§Withenoughdata§ Classifiermaynotmatter

Real-world systems generally combine:

§ Automaticclassification§ Manualreviewofuncertain/difficult/"new”cases

Underflow Prevention: log space

§Multiplyinglotsofprobabilitiescanresultinfloating-pointunderflow.

§ Sincelog(xy)=log(x)+log(y)§ Bettertosumlogsofprobabilitiesinsteadofmultiplyingprobabilities.

§ Classwithhighestun-normalizedlogprobabilityscoreisstillmostprobable.

§Modelisnowjustmaxofsumofweights

cNB = argmaxc j∈C

logP(cj )+ logP(xi | cj )i∈positions∑

Howtoimprove performance

§ Domain-specificfeaturesandweights:veryimportantinrealperformance

§ Sometimesneedtocollapseterms:§ Partnumbers,chemicalformulas,…§ Butstemminggenerallydoesn’thelp (forclassification)

§ Upweighting:Countingawordasifitoccurredtwice:§ titlewords(Cohen&Singer1996)§ firstsentenceofeachparagraph(Murata,1999)§ Insentencesthatcontaintitlewords(Ko etal, 2002)