Naïve Bayes - University of California, Berkeleycs188/su19/assets/... · Naïve Bayes for Digits § Naïve Bayes: Assume all features are independent effects of the label § Simple

CS188:ArtificialIntelligenceNaïveBayes

Instructors:BrijenThananjeyanandAdityaBaradwaj --- UniversityofCalifornia,Berkeley[TheseslideswerecreatedbyDanKlein, PieterAbbeel,SergeyLevine,withsomematerialsfromA.Farhadi.AllCS188materialsareathttp://ai.berkeley.edu.]

MachineLearning

§ Upuntilnow:howuseamodeltomakeoptimaldecisions

§ Machinelearning:howtoacquireamodelfromdata/experience§ Learningparameters (e.g.probabilities)§ Learningstructure(e.g.BNgraphs)§ Learninghiddenconcepts(e.g.clustering)

§ Today:model-basedclassificationwithNaiveBayes

Classification

Example:SpamFilter

§ Input:anemail§ Output:spam/ham

§ Setup:§ Getalargecollectionofexampleemails,eachlabeled

“spam”or“ham”§ Note:someonehastohandlabelallthisdata!§ Wanttolearntopredictlabelsofnew,futureemails

§ Features:Theattributesusedtomaketheham/spamdecision§ Words:FREE!§ TextPatterns:$dd,CAPS§ Non-text:SenderInContacts§ …

DearSir.

First,Imustsolicityourconfidenceinthistransaction,thisisbyvirtureofitsnatureasbeingutterlyconfidencialandtopsecret.…

TOBEREMOVEDFROMFUTUREMAILINGS,SIMPLYREPLYTOTHISMESSAGEANDPUT"REMOVE"INTHESUBJECT.

99MILLIONEMAILADDRESSESFORONLY$99

Ok,IknowthisisblatantlyOTbutI'mbeginningtogoinsane.HadanoldDellDimensionXPSsittinginthecorneranddecidedtoput ittouse,Iknowitwasworkingprebeingstuckinthecorner,butwhenIpluggeditin,hitthepowernothinghappened.

Example:DigitRecognition

§ Input:images/pixelgrids§ Output:adigit0-9

§ Setup:§ Getalargecollectionofexampleimages,eachlabeledwithadigit§ Note:someonehastohandlabelallthisdata!§ Wanttolearntopredictlabelsofnew,futuredigitimages

§ Features:Theattributesusedtomakethedigitdecision§ Pixels:(6,8)=ON§ ShapePatterns:NumComponents,AspectRatio,NumLoops§ …

0

1

2

1

??

OtherClassificationTasks

§ Classification:giveninputsx,predictlabels(classes)y

§ Examples:§ Spamdetection(input:document,

classes:spam/ham)§ OCR(input:images,classes:characters)§ Medicaldiagnosis(input:symptoms,

classes:diseases)§ Automaticessaygrading(input:document,

classes:grades)§ Frauddetection(input:accountactivity,

classes:fraud/nofraud)§ Customerserviceemailrouting§ …manymore

§ Classificationisanimportantcommercialtechnology!

Model-BasedClassification

Model-BasedClassification

§ Model-basedapproach§ Buildamodel(e.g.Bayes’net)whereboththelabelandfeaturesarerandomvariables

§ Instantiateanyobservedfeatures§ Queryforthedistributionofthelabelconditionedonthefeatures

§ Challenges§ WhatstructureshouldtheBNhave?§ Howshouldwelearnitsparameters?

NaïveBayesforDigits

§ NaïveBayes:Assumeallfeaturesareindependenteffectsofthelabel

§ Simpledigitrecognitionversion:§ Onefeature(variable)Fij foreachgridposition<i,j>§ Featurevaluesareon/off,basedonwhetherintensity

ismoreorlessthan0.5inunderlyingimage§ Eachinputmapstoafeaturevector,e.g.

§ Here:lotsoffeatures,eachisbinaryvalued

§ NaïveBayes model:

§ Whatdoweneedtolearn?

Y

F1 FnF2

GeneralNaïveBayes

§ AgeneralNaiveBayes model:

§ Weonlyhavetospecifyhoweachfeaturedependsontheclass§ Totalnumberofparametersislinear inn§ Modelisverysimplistic,butoftenworksanyway

Y

F1 FnF2

|Y|parameters

nx|F|x|Y|parameters

|Y|x|F|n values

InferenceforNaïveBayes

§ Goal:computeposteriordistributionoverlabelvariableY§ Step1:getjointprobabilityoflabelandevidenceforeachlabel

§ Step2:sumtogetprobabilityofevidence

§ Step3:normalizebydividingStep1byStep2

+

GeneralNaïveBayes

§ WhatdoweneedinordertouseNaïveBayes?

§ Inferencemethod(wejustsawthispart)§ Startwithabunchofprobabilities:P(Y)andtheP(Fi|Y)tables§ UsestandardinferencetocomputeP(Y|F1…Fn)§ Nothingnewhere

§ Estimatesoflocalconditionalprobabilitytables§ P(Y),theprioroverlabels§ P(Fi|Y)foreachfeature(evidencevariable)§ Theseprobabilitiesarecollectivelycalledtheparameters ofthemodelanddenotedbyq

§ Upuntilnow,weassumedtheseappearedbymagic,but…§ …theytypicallycomefromtrainingdatacounts:we’lllookatthissoon

Example:ConditionalProbabilities

1 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.10 0.1

1 0.012 0.053 0.054 0.305 0.806 0.907 0.058 0.609 0.500 0.80

1 0.052 0.013 0.904 0.805 0.906 0.907 0.258 0.859 0.600 0.80

ASpamFilter

§ NaïveBayesspamfilter

§ Data:§ Collectionofemails,labeled

spamorham§ Note:someonehastohand

labelallthisdata!§ Splitintotraining,held-out,

testsets

§ Classifiers§ Learnonthetrainingset§ (Tuneitonaheld-outset)§ Testitonnewemails

Dear Sir.

First, I must solicit your confidence in this transaction, this is by virture of its nature as being utterly confidencial and top secret. …

TO BE REMOVED FROM FUTURE MAILINGS, SIMPLY REPLY TO THIS MESSAGE AND PUT "REMOVE" IN THE SUBJECT.

99 MILLION EMAIL ADDRESSESFOR ONLY $99

Ok, Iknow this is blatantly OT but I'm beginning to go insane. Had an old Dell Dimension XPS sitting in the corner and decided to put it to use, I know it was working pre being stuck in the corner, but when I plugged it in, hit the power nothing happened.

NaïveBayesforText

§ Bag-of-wordsNaïveBayes:§ Features:Wi isthewordatpositon i§ Asbefore:predictlabelconditionedonfeaturevariables(spamvs.ham)§ Asbefore:assumefeaturesareconditionallyindependentgivenlabel§ New:eachWi isidenticallydistributed

§ Generativemodel:

§ “Tied”distributionsandbag-of-words§ Usually,eachvariablegetsitsownconditionalprobabilitydistributionP(F|Y)§ Inabag-of-wordsmodel

§ Eachposition isidenticallydistributed§ Allpositionssharethesameconditional probs P(W|Y)§ Whymakethisassumption?

§ Called“bag-of-words”becausemodelisinsensitivetowordorderorreordering

Wordatpositioni,not ith wordinthedictionary!

When the lecture is over, remember to wake up the person sitting next to you in the lecture room.

in is lecture lecture next over person remember room sitting the the the to to up wake when you

how many variables are there?how many values?

Example:SpamFiltering

§ Model:

§ Whataretheparameters?

§ Wheredothesetablescomefrom?

the : 0.0156to : 0.0153and : 0.0115of : 0.0095you : 0.0093a : 0.0086with: 0.0080from: 0.0075...

the : 0.0210to : 0.0133of : 0.01192002: 0.0110with: 0.0108from: 0.0107and : 0.0105a : 0.0100...

ham : 0.66spam: 0.33

SpamExample

Word P(w|spam) P(w|ham) Tot Spam Tot Ham(prior) 0.33333 0.66666 -1.1 -0.4Gary 0.00002 0.00021 -11.8 -8.9would 0.00069 0.00084 -19.1 -16.0you 0.00881 0.00304 -23.8 -21.8like 0.00086 0.00083 -30.9 -28.9to 0.01517 0.01339 -35.1 -33.2lose 0.00008 0.00002 -44.5 -44.0weight 0.00016 0.00002 -53.3 -55.0while 0.00027 0.00027 -61.5 -63.2you 0.00881 0.00304 -66.2 -69.0sleep 0.00006 0.00001 -76.0 -80.5

P(spam | w) = 98.9

TrainingandTesting

ImportantConcepts

§ Data:labeledinstances,e.g.emailsmarkedspam/ham§ Trainingset§ Heldoutset§ Testset

§ Features:attribute-valuepairswhichcharacterizeeachx

§ Experimentationcycle§ Learnparameters(e.g.modelprobabilities)ontrainingset§ (Tunehyperparameters onheld-outset)§ Computeaccuracyoftestset§ Veryimportant:never“peek”atthetestset!

§ Evaluation§ Accuracy:fractionofinstancespredictedcorrectly

§ Overfitting andgeneralization§ Wantaclassifierwhichdoeswellontest data§ Overfitting:fittingthetrainingdataveryclosely,butnot

generalizingwell§ Underfitting:fitsthetrainingsetpoorly

TrainingData

Held-OutData

TestData

UnderfittingandOverfitting

0 2 4 6 8 10 12 14 16 18 20-15

-10

-5

0

5

10

15

20

25

30

Degree15polynomial

Overfitting

Example:Overfitting

2wins!!

Example:Overfitting

§ Posteriorsdeterminedbyrelativeprobabilities(oddsratios):

south-west : infnation : infmorally : infnicely : infextent : infseriously : inf...

Whatwentwronghere?

screens : infminute : infguaranteed : inf$205.00 : infdelivery : infsignature : inf...

GeneralizationandOverfitting

§ Relativefrequencyparameterswilloverfit thetrainingdata!§ Justbecauseweneversawa3withpixel(15,15)onduring trainingdoesn’tmeanwewon’tseeitattesttime§ Unlikelythateveryoccurrenceof“minute”is100%spam§ Unlikelythateveryoccurrenceof“seriously”is100%ham§ Whataboutallthewordsthatdon’toccurinthetrainingsetatall?§ Ingeneral,wecan’tgoaroundgivingunseeneventszeroprobability

§ Asanextremecase,imagineusingtheentireemailastheonlyfeature§ Wouldgetthetrainingdataperfect(ifdeterministic labeling)§ Wouldn’t generalize atall§ Justmaking thebag-of-wordsassumption givesussomegeneralization, butisn’tenough

§ Togeneralizebetter:weneedtosmoothorregularizetheestimates

ParameterEstimation

ParameterEstimation

§ Estimatingthedistributionofarandomvariable

§ Elicitation: askahuman(whyisthishard?)

§ Empirically:usetrainingdata(learning!)§ E.g.:foreachoutcomex,lookattheempiricalrate ofthatvalue:

§ Thisistheestimatethatmaximizesthelikelihoodofthedata

r r b

r b b

r bbrb b

r bb

r

b

b

YourFirstConsultingJob

§ Abillionairetechentrepreneurasksyouaquestion:§ Hesays:Ihavethumbtack,ifIflipit,what’stheprobabilityitwillfallwiththenailup?

§ Yousay:Pleaseflipitafewtimes:

§ Yousay:Theprobabilityis:§ P(H)=3/5

§ Hesays:Why???§ Yousay:Because…

YourFirstConsultingJob

§ P(Heads)=q,P(Tails)=1-q

§ Flipsarei.i.d.:§ Independentevents§ Identicallydistributedaccordingtounknowndistribution

§ SequenceD ofaH HeadsandaT Tails

…

D={xi | i=1…n}, P(D | θ) = ΠiP(xi | θ)

MaximumLikelihoodEstimation

§ Data:ObservedsetD ofaH HeadsandaT Tails§ Hypothesisspace: Bernoullidistributions§ Learning:findingqisanoptimizationproblem

§ What’stheobjectivefunction?

§ MLE:ChooseqtomaximizeprobabilityofD

MaximumLikelihoodEstimation

§ Setderivativetozero,andsolve!

Brief Article

The Author

January 11, 2012

⇥̂ = argmax⇥

lnP (D | ⇥)

ln ⇥�H

d

d⇥lnP (D | ⇥) =

d

d⇥[ln ⇥�H (1� ⇥)�T ] =

d

d⇥[�H ln ⇥ + �T ln(1� ⇥)]

1

Brief Article

The Author

January 11, 2012

⇥̂ = argmax⇥

lnP (D | ⇥)

ln ⇥�H

d

d⇥lnP (D | ⇥) =

d

d⇥[ln ⇥�H (1� ⇥)�T ] =

d

d⇥[�H ln ⇥ + �T ln(1� ⇥)]

1

Brief Article

The Author

January 11, 2012

⇥̂ = argmax⇥

lnP (D | ⇥)

ln ⇥�H

d

d⇥lnP (D | ⇥) =

d

d⇥[ln ⇥�H (1� ⇥)�T ] =

d

d⇥[�H ln ⇥ + �T ln(1� ⇥)]

= �Hd

d⇥ln ⇥ + �T

d

d⇥ln(1� ⇥) =

�H

⇥� �T

1� ⇥= 0

1

Brief Article

The Author

January 11, 2012

⇥̂ = argmax⇥

lnP (D | ⇥)

ln ⇥�H

d

d⇥lnP (D | ⇥) =

d

d⇥[ln ⇥�H (1� ⇥)�T ] =

d

d⇥[�H ln ⇥ + �T ln(1� ⇥)]

= �Hd

d⇥ln ⇥ + �T

d

d⇥ln(1� ⇥) =

�H

⇥� �T

1� ⇥= 0

1

Brief Article

The Author

January 11, 2012

�̂ = argmax✓

lnP (D | �)

ln �↵H

d

d�lnP (D | �) =

d

d�ln �↵H (1� �)↵T

1

Brief Article

The Author

January 11, 2012

⇥̂ = argmax⇥

lnP (D | ⇥)

ln ⇥�H

d

d⇥lnP (D | ⇥) =

d

d⇥[ln ⇥�H (1� ⇥)�T ] =

d

d⇥[�H ln ⇥ + �T ln(1� ⇥)]

= �Hd

d⇥ln ⇥ + �T

d

d⇥ln(1� ⇥) =

�H

⇥� �T

1� ⇥= 0

1

Smoothing

MaximumLikelihood?

§ Relativefrequenciesarethemaximumlikelihoodestimates

§ Anotheroptionistoconsiderthemostlikelyparametervaluegiventhedata

????

UnseenEvents

LaplaceSmoothing

§ Laplace’sestimate:§ Pretendyousaweveryoutcome

oncemorethanyouactuallydid

§ CanderivethisestimatewithDirichlet priors (seecs281a)

r r b

LaplaceSmoothing

§ Laplace’sestimate(extended):§ Pretendyousaweveryoutcomekextratimes

§ What’sLaplacewithk=0?§ kisthestrength oftheprior

§ Laplaceforconditionals:§ Smootheachconditionindependently:

r r b

RealNB:Smoothing

§ Forrealclassificationproblems,smoothingiscritical§ Newoddsratios:

helvetica : 11.4seems : 10.8group : 10.2ago : 8.4areas : 8.3...

verdana : 28.8Credit : 28.4ORDER : 27.2<FONT> : 26.9money : 26.5...

Dothesemakemoresense?

Tuning

TuningonHeld-OutData

§ Nowwe’vegottwokindsofunknowns§ Parameters: theprobabilitiesP(X|Y),P(Y)§ Hyperparameters:e.g.theamount/typeofsmoothingtodo,k,a

§ Whatshouldwelearnwhere?§ Learnparametersfromtrainingdata§ Tunehyperparameters ondifferentdata

§ Why?§ Foreachvalueofthehyperparameters,trainandtestontheheld-outdata

§ Choosethebestvalueanddoafinaltestonthetestdata

Summary

§ Bayesruleletsusdodiagnosticquerieswithcausalprobabilities

§ ThenaïveBayesassumptiontakesallfeaturestobeindependentgiventheclasslabel

§ WecanbuildclassifiersoutofanaïveBayesmodelusingtrainingdata

§ Smoothingestimatesisimportantinrealsystems

Naïve Bayes - University of California, Berkeleycs188/su19/assets/... · Naïve Bayes for Digits § Naïve Bayes: Assume all features are independent effects of the label § Simple

Documents