Page 1
CS188:ArtificialIntelligenceNaïveBayes
Instructors:BrijenThananjeyanandAdityaBaradwaj --- UniversityofCalifornia,Berkeley[TheseslideswerecreatedbyDanKlein, PieterAbbeel,SergeyLevine,withsomematerialsfromA.Farhadi.AllCS188materialsareathttp://ai.berkeley.edu.]
Page 2
MachineLearning
§ Upuntilnow:howuseamodeltomakeoptimaldecisions
§ Machinelearning:howtoacquireamodelfromdata/experience§ Learningparameters (e.g.probabilities)§ Learningstructure(e.g.BNgraphs)§ Learninghiddenconcepts(e.g.clustering)
§ Today:model-basedclassificationwithNaiveBayes
Page 4
Example:SpamFilter
§ Input:anemail§ Output:spam/ham
§ Setup:§ Getalargecollectionofexampleemails,eachlabeled
“spam”or“ham”§ Note:someonehastohandlabelallthisdata!§ Wanttolearntopredictlabelsofnew,futureemails
§ Features:Theattributesusedtomaketheham/spamdecision§ Words:FREE!§ TextPatterns:$dd,CAPS§ Non-text:SenderInContacts§ …
DearSir.
First,Imustsolicityourconfidenceinthistransaction,thisisbyvirtureofitsnatureasbeingutterlyconfidencialandtopsecret.…
TOBEREMOVEDFROMFUTUREMAILINGS,SIMPLYREPLYTOTHISMESSAGEANDPUT"REMOVE"INTHESUBJECT.
99MILLIONEMAILADDRESSESFORONLY$99
Ok,IknowthisisblatantlyOTbutI'mbeginningtogoinsane.HadanoldDellDimensionXPSsittinginthecorneranddecidedtoput ittouse,Iknowitwasworkingprebeingstuckinthecorner,butwhenIpluggeditin,hitthepowernothinghappened.
Page 5
Example:DigitRecognition
§ Input:images/pixelgrids§ Output:adigit0-9
§ Setup:§ Getalargecollectionofexampleimages,eachlabeledwithadigit§ Note:someonehastohandlabelallthisdata!§ Wanttolearntopredictlabelsofnew,futuredigitimages
§ Features:Theattributesusedtomakethedigitdecision§ Pixels:(6,8)=ON§ ShapePatterns:NumComponents,AspectRatio,NumLoops§ …
0
1
2
1
??
Page 6
OtherClassificationTasks
§ Classification:giveninputsx,predictlabels(classes)y
§ Examples:§ Spamdetection(input:document,
classes:spam/ham)§ OCR(input:images,classes:characters)§ Medicaldiagnosis(input:symptoms,
classes:diseases)§ Automaticessaygrading(input:document,
classes:grades)§ Frauddetection(input:accountactivity,
classes:fraud/nofraud)§ Customerserviceemailrouting§ …manymore
§ Classificationisanimportantcommercialtechnology!
Page 7
Model-BasedClassification
Page 8
Model-BasedClassification
§ Model-basedapproach§ Buildamodel(e.g.Bayes’net)whereboththelabelandfeaturesarerandomvariables
§ Instantiateanyobservedfeatures§ Queryforthedistributionofthelabelconditionedonthefeatures
§ Challenges§ WhatstructureshouldtheBNhave?§ Howshouldwelearnitsparameters?
Page 9
NaïveBayesforDigits
§ NaïveBayes:Assumeallfeaturesareindependenteffectsofthelabel
§ Simpledigitrecognitionversion:§ Onefeature(variable)Fij foreachgridposition<i,j>§ Featurevaluesareon/off,basedonwhetherintensity
ismoreorlessthan0.5inunderlyingimage§ Eachinputmapstoafeaturevector,e.g.
§ Here:lotsoffeatures,eachisbinaryvalued
§ NaïveBayes model:
§ Whatdoweneedtolearn?
Y
F1 FnF2
Page 10
GeneralNaïveBayes
§ AgeneralNaiveBayes model:
§ Weonlyhavetospecifyhoweachfeaturedependsontheclass§ Totalnumberofparametersislinear inn§ Modelisverysimplistic,butoftenworksanyway
Y
F1 FnF2
|Y|parameters
nx|F|x|Y|parameters
|Y|x|F|n values
Page 11
InferenceforNaïveBayes
§ Goal:computeposteriordistributionoverlabelvariableY§ Step1:getjointprobabilityoflabelandevidenceforeachlabel
§ Step2:sumtogetprobabilityofevidence
§ Step3:normalizebydividingStep1byStep2
+
Page 12
GeneralNaïveBayes
§ WhatdoweneedinordertouseNaïveBayes?
§ Inferencemethod(wejustsawthispart)§ Startwithabunchofprobabilities:P(Y)andtheP(Fi|Y)tables§ UsestandardinferencetocomputeP(Y|F1…Fn)§ Nothingnewhere
§ Estimatesoflocalconditionalprobabilitytables§ P(Y),theprioroverlabels§ P(Fi|Y)foreachfeature(evidencevariable)§ Theseprobabilitiesarecollectivelycalledtheparameters ofthemodelanddenotedbyq
§ Upuntilnow,weassumedtheseappearedbymagic,but…§ …theytypicallycomefromtrainingdatacounts:we’lllookatthissoon
Page 13
Example:ConditionalProbabilities
1 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.10 0.1
1 0.012 0.053 0.054 0.305 0.806 0.907 0.058 0.609 0.500 0.80
1 0.052 0.013 0.904 0.805 0.906 0.907 0.258 0.859 0.600 0.80
Page 14
ASpamFilter
§ NaïveBayesspamfilter
§ Data:§ Collectionofemails,labeled
spamorham§ Note:someonehastohand
labelallthisdata!§ Splitintotraining,held-out,
testsets
§ Classifiers§ Learnonthetrainingset§ (Tuneitonaheld-outset)§ Testitonnewemails
Dear Sir.
First, I must solicit your confidence in this transaction, this is by virture of its nature as being utterly confidencial and top secret. …
TO BE REMOVED FROM FUTURE MAILINGS, SIMPLY REPLY TO THIS MESSAGE AND PUT "REMOVE" IN THE SUBJECT.
99 MILLION EMAIL ADDRESSESFOR ONLY $99
Ok, Iknow this is blatantly OT but I'm beginning to go insane. Had an old Dell Dimension XPS sitting in the corner and decided to put it to use, I know it was working pre being stuck in the corner, but when I plugged it in, hit the power nothing happened.
Page 15
NaïveBayesforText
§ Bag-of-wordsNaïveBayes:§ Features:Wi isthewordatpositon i§ Asbefore:predictlabelconditionedonfeaturevariables(spamvs.ham)§ Asbefore:assumefeaturesareconditionallyindependentgivenlabel§ New:eachWi isidenticallydistributed
§ Generativemodel:
§ “Tied”distributionsandbag-of-words§ Usually,eachvariablegetsitsownconditionalprobabilitydistributionP(F|Y)§ Inabag-of-wordsmodel
§ Eachposition isidenticallydistributed§ Allpositionssharethesameconditional probs P(W|Y)§ Whymakethisassumption?
§ Called“bag-of-words”becausemodelisinsensitivetowordorderorreordering
Wordatpositioni,not ith wordinthedictionary!
When the lecture is over, remember to wake up the person sitting next to you in the lecture room.
in is lecture lecture next over person remember room sitting the the the to to up wake when you
how many variables are there?how many values?
Page 16
Example:SpamFiltering
§ Model:
§ Whataretheparameters?
§ Wheredothesetablescomefrom?
the : 0.0156to : 0.0153and : 0.0115of : 0.0095you : 0.0093a : 0.0086with: 0.0080from: 0.0075...
the : 0.0210to : 0.0133of : 0.01192002: 0.0110with: 0.0108from: 0.0107and : 0.0105a : 0.0100...
ham : 0.66spam: 0.33
Page 17
SpamExample
Word P(w|spam) P(w|ham) Tot Spam Tot Ham(prior) 0.33333 0.66666 -1.1 -0.4Gary 0.00002 0.00021 -11.8 -8.9would 0.00069 0.00084 -19.1 -16.0you 0.00881 0.00304 -23.8 -21.8like 0.00086 0.00083 -30.9 -28.9to 0.01517 0.01339 -35.1 -33.2lose 0.00008 0.00002 -44.5 -44.0weight 0.00016 0.00002 -53.3 -55.0while 0.00027 0.00027 -61.5 -63.2you 0.00881 0.00304 -66.2 -69.0sleep 0.00006 0.00001 -76.0 -80.5
P(spam | w) = 98.9
Page 18
TrainingandTesting
Page 19
ImportantConcepts
§ Data:labeledinstances,e.g.emailsmarkedspam/ham§ Trainingset§ Heldoutset§ Testset
§ Features:attribute-valuepairswhichcharacterizeeachx
§ Experimentationcycle§ Learnparameters(e.g.modelprobabilities)ontrainingset§ (Tunehyperparameters onheld-outset)§ Computeaccuracyoftestset§ Veryimportant:never“peek”atthetestset!
§ Evaluation§ Accuracy:fractionofinstancespredictedcorrectly
§ Overfitting andgeneralization§ Wantaclassifierwhichdoeswellontest data§ Overfitting:fittingthetrainingdataveryclosely,butnot
generalizingwell§ Underfitting:fitsthetrainingsetpoorly
TrainingData
Held-OutData
TestData
Page 20
UnderfittingandOverfitting
Page 21
0 2 4 6 8 10 12 14 16 18 20-15
-10
-5
0
5
10
15
20
25
30
Degree15polynomial
Overfitting
Page 22
Example:Overfitting
2wins!!
Page 23
Example:Overfitting
§ Posteriorsdeterminedbyrelativeprobabilities(oddsratios):
south-west : infnation : infmorally : infnicely : infextent : infseriously : inf...
Whatwentwronghere?
screens : infminute : infguaranteed : inf$205.00 : infdelivery : infsignature : inf...
Page 24
GeneralizationandOverfitting
§ Relativefrequencyparameterswilloverfit thetrainingdata!§ Justbecauseweneversawa3withpixel(15,15)onduring trainingdoesn’tmeanwewon’tseeitattesttime§ Unlikelythateveryoccurrenceof“minute”is100%spam§ Unlikelythateveryoccurrenceof“seriously”is100%ham§ Whataboutallthewordsthatdon’toccurinthetrainingsetatall?§ Ingeneral,wecan’tgoaroundgivingunseeneventszeroprobability
§ Asanextremecase,imagineusingtheentireemailastheonlyfeature§ Wouldgetthetrainingdataperfect(ifdeterministic labeling)§ Wouldn’t generalize atall§ Justmaking thebag-of-wordsassumption givesussomegeneralization, butisn’tenough
§ Togeneralizebetter:weneedtosmoothorregularizetheestimates
Page 25
ParameterEstimation
Page 26
ParameterEstimation
§ Estimatingthedistributionofarandomvariable
§ Elicitation: askahuman(whyisthishard?)
§ Empirically:usetrainingdata(learning!)§ E.g.:foreachoutcomex,lookattheempiricalrate ofthatvalue:
§ Thisistheestimatethatmaximizesthelikelihoodofthedata
r r b
r b b
r bbrb b
r bb
r
b
b
Page 27
YourFirstConsultingJob
§ Abillionairetechentrepreneurasksyouaquestion:§ Hesays:Ihavethumbtack,ifIflipit,what’stheprobabilityitwillfallwiththenailup?
§ Yousay:Pleaseflipitafewtimes:
§ Yousay:Theprobabilityis:§ P(H)=3/5
§ Hesays:Why???§ Yousay:Because…
Page 28
YourFirstConsultingJob
§ P(Heads)=q,P(Tails)=1-q
§ Flipsarei.i.d.:§ Independentevents§ Identicallydistributedaccordingtounknowndistribution
§ SequenceD ofaH HeadsandaT Tails
…
D={xi | i=1…n}, P(D | θ) = ΠiP(xi | θ)
Page 29
MaximumLikelihoodEstimation
§ Data:ObservedsetD ofaH HeadsandaT Tails§ Hypothesisspace: Bernoullidistributions§ Learning:findingqisanoptimizationproblem
§ What’stheobjectivefunction?
§ MLE:ChooseqtomaximizeprobabilityofD
Page 30
MaximumLikelihoodEstimation
§ Setderivativetozero,andsolve!
Brief Article
The Author
January 11, 2012
⇥̂ = argmax⇥
lnP (D | ⇥)
ln ⇥�H
d
d⇥lnP (D | ⇥) =
d
d⇥[ln ⇥�H (1� ⇥)�T ] =
d
d⇥[�H ln ⇥ + �T ln(1� ⇥)]
1
Brief Article
The Author
January 11, 2012
⇥̂ = argmax⇥
lnP (D | ⇥)
ln ⇥�H
d
d⇥lnP (D | ⇥) =
d
d⇥[ln ⇥�H (1� ⇥)�T ] =
d
d⇥[�H ln ⇥ + �T ln(1� ⇥)]
1
Brief Article
The Author
January 11, 2012
⇥̂ = argmax⇥
lnP (D | ⇥)
ln ⇥�H
d
d⇥lnP (D | ⇥) =
d
d⇥[ln ⇥�H (1� ⇥)�T ] =
d
d⇥[�H ln ⇥ + �T ln(1� ⇥)]
= �Hd
d⇥ln ⇥ + �T
d
d⇥ln(1� ⇥) =
�H
⇥� �T
1� ⇥= 0
1
Brief Article
The Author
January 11, 2012
⇥̂ = argmax⇥
lnP (D | ⇥)
ln ⇥�H
d
d⇥lnP (D | ⇥) =
d
d⇥[ln ⇥�H (1� ⇥)�T ] =
d
d⇥[�H ln ⇥ + �T ln(1� ⇥)]
= �Hd
d⇥ln ⇥ + �T
d
d⇥ln(1� ⇥) =
�H
⇥� �T
1� ⇥= 0
1
Brief Article
The Author
January 11, 2012
�̂ = argmax✓
lnP (D | �)
ln �↵H
d
d�lnP (D | �) =
d
d�ln �↵H (1� �)↵T
1
Brief Article
The Author
January 11, 2012
⇥̂ = argmax⇥
lnP (D | ⇥)
ln ⇥�H
d
d⇥lnP (D | ⇥) =
d
d⇥[ln ⇥�H (1� ⇥)�T ] =
d
d⇥[�H ln ⇥ + �T ln(1� ⇥)]
= �Hd
d⇥ln ⇥ + �T
d
d⇥ln(1� ⇥) =
�H
⇥� �T
1� ⇥= 0
1
Page 32
MaximumLikelihood?
§ Relativefrequenciesarethemaximumlikelihoodestimates
§ Anotheroptionistoconsiderthemostlikelyparametervaluegiventhedata
????
Page 34
LaplaceSmoothing
§ Laplace’sestimate:§ Pretendyousaweveryoutcome
oncemorethanyouactuallydid
§ CanderivethisestimatewithDirichlet priors (seecs281a)
r r b
Page 35
LaplaceSmoothing
§ Laplace’sestimate(extended):§ Pretendyousaweveryoutcomekextratimes
§ What’sLaplacewithk=0?§ kisthestrength oftheprior
§ Laplaceforconditionals:§ Smootheachconditionindependently:
r r b
Page 36
RealNB:Smoothing
§ Forrealclassificationproblems,smoothingiscritical§ Newoddsratios:
helvetica : 11.4seems : 10.8group : 10.2ago : 8.4areas : 8.3...
verdana : 28.8Credit : 28.4ORDER : 27.2<FONT> : 26.9money : 26.5...
Dothesemakemoresense?
Page 38
TuningonHeld-OutData
§ Nowwe’vegottwokindsofunknowns§ Parameters: theprobabilitiesP(X|Y),P(Y)§ Hyperparameters:e.g.theamount/typeofsmoothingtodo,k,a
§ Whatshouldwelearnwhere?§ Learnparametersfromtrainingdata§ Tunehyperparameters ondifferentdata
§ Why?§ Foreachvalueofthehyperparameters,trainandtestontheheld-outdata
§ Choosethebestvalueanddoafinaltestonthetestdata
Page 39
Summary
§ Bayesruleletsusdodiagnosticquerieswithcausalprobabilities
§ ThenaïveBayesassumptiontakesallfeaturestobeindependentgiventheclasslabel
§ WecanbuildclassifiersoutofanaïveBayesmodelusingtrainingdata
§ Smoothingestimatesisimportantinrealsystems