The Naïve Bayes Classifier - svivek · •The naïve Bayes Classifier •Learning the naïve Bayes Classifier •Practical concerns 2. Today’s lecture •The naïve Bayes Classifier

MachineLearning

TheNaïveBayesClassifier

1

Today’slecture

• ThenaïveBayesClassifier

• LearningthenaïveBayesClassifier

• Practicalconcerns

2

Today’slecture



• Practicalconcerns

3

Wherearewe?

WehaveseenBayesianlearning– Usingaprobabilisticcriteriontoselectahypothesis– Maximumaposterioriandmaximumlikelihoodlearning

Youshouldknowwhatisthedifferencebetweenthem

Wecouldalsolearnfunctionsthatpredict probabilitiesofoutcomes

– Differentfromusingaprobabilisticcriteriontolearn

Maximumaposteriori(MAP)prediction asopposedtoMAPlearning

4

Wherearewe?

WehaveseenBayesianlearning– Usingaprobabilisticcriteriontoselectahypothesis– Maximumaposterioriandmaximumlikelihoodlearning

Youshouldknowwhatisthedifferencebetweenthem

Wecouldalsolearnfunctionsthatpredict probabilitiesofoutcomes

– Differentfromusingaprobabilisticcriteriontolearn

Maximumaposteriori(MAP)prediction asopposedtoMAPlearning

5

MAPprediction

UsingtheBayesruleforpredicting𝑦 givenaninput𝐱

𝑃 𝑌 = 𝑦 𝑋 = 𝐱 =𝑃 𝑋 = 𝐱 𝑌 = 𝑦 𝑃 𝑌 = 𝑦

𝑃(𝑋 = 𝐱)

6

Posteriorprobabilityoflabelbeingy forthisinputx

MAPprediction


𝑃 𝑌 = 𝑦 𝑋 = 𝐱 =𝑃 𝑋 = 𝐱 𝑌 = 𝑦 𝑃 𝑌 = 𝑦

𝑃(𝑋 = 𝐱)

Predictthelabel𝑦 fortheinput𝐱 using

argmax.

𝑃 𝑋 = 𝐱 𝑌 = 𝑦 𝑃 𝑌 = 𝑦𝑃(𝑋 = 𝐱)

7

MAPprediction


𝑃 𝑌 = 𝑦 𝑋 = 𝐱 =𝑃 𝑋 = 𝐱 𝑌 = 𝑦 𝑃 𝑌 = 𝑦

𝑃(𝑋 = 𝐱)


argmax.𝑃 𝑋 = 𝐱 𝑌 = 𝑦 𝑃 𝑌 = 𝑦

8

MAPprediction


𝑃 𝑌 = 𝑦 𝑋 = 𝐱 =𝑃 𝑋 = 𝐱 𝑌 = 𝑦 𝑃 𝑌 = 𝑦

𝑃(𝑋 = 𝐱)



9

Don’tconfusewithMAPlearning:findshypothesisby

MAPprediction



10

Likelihood ofobservingthisinputx whenthelabelisy

Priorprobabilityofthelabelbeingy

Allweneedarethesetwosetsofprobabilities

Example:Tennisagain

11

Temperature Wind P(T, W|Tennis=Yes)

Hot Strong 0.15

Hot Weak 0.4

Cold Strong 0.1

Cold Weak 0.35

Temperature Wind P(T, W|Tennis=No)

Hot Strong 0.4

Hot Weak 0.1

Cold Strong 0.3

Cold Weak 0.2

Playtennis P(Playtennis)

Yes 0.3

No 0.7Prior

Likelihood

Withoutanyotherinformation,whatisthepriorprobabilitythatIshouldplaytennis?

OndaysthatIdo playtennis,whatistheprobabilitythatthetemperatureisTandthewindisW?

OndaysthatIdon’t playtennis,whatistheprobabilitythatthetemperatureisTandthewindisW?

Example:Tennisagain

12


Hot Strong 0.15

Hot Weak 0.4

Cold Strong 0.1

Cold Weak 0.35


Hot Strong 0.4

Hot Weak 0.1

Cold Strong 0.3

Cold Weak 0.2


Yes 0.3

No 0.7Prior

Likelihood




Example:Tennisagain

13


Hot Strong 0.15

Hot Weak 0.4

Cold Strong 0.1

Cold Weak 0.35


Hot Strong 0.4

Hot Weak 0.1

Cold Strong 0.3

Cold Weak 0.2


Yes 0.3

No 0.7Prior

Likelihood




Example:Tennisagain

14


Hot Strong 0.15

Hot Weak 0.4

Cold Strong 0.1

Cold Weak 0.35


Hot Strong 0.4

Hot Weak 0.1

Cold Strong 0.3

Cold Weak 0.2


Yes 0.3

No 0.7Prior

Likelihood

Input:Temperature=Hot(H)Wind=Weak(W)

ShouldIplaytennis?

Example:Tennisagain

15


Hot Strong 0.15

Hot Weak 0.4

Cold Strong 0.1

Cold Weak 0.35


Hot Strong 0.4

Hot Weak 0.1

Cold Strong 0.3

Cold Weak 0.2


Yes 0.3

No 0.7Prior

Likelihood


ShouldIplaytennis?

argmaxy P(H,W|play?)P(play?)

Example:Tennisagain

16


Hot Strong 0.15

Hot Weak 0.4

Cold Strong 0.1

Cold Weak 0.35


Hot Strong 0.4

Hot Weak 0.1

Cold Strong 0.3

Cold Weak 0.2


Yes 0.3

No 0.7Prior

Likelihood


ShouldIplaytennis?


P(H,W|Yes)P(Yes)=0.4£ 0.3=0.12

P(H,W|No)P(No)=0.1£ 0.7=0.07

Example:Tennisagain

17


Hot Strong 0.15

Hot Weak 0.4

Cold Strong 0.1

Cold Weak 0.35


Hot Strong 0.4

Hot Weak 0.1

Cold Strong 0.3

Cold Weak 0.2


Yes 0.3

No 0.7Prior

Likelihood


ShouldIplaytennis?


P(H,W|Yes)P(Yes)=0.4£ 0.3=0.12

P(H,W|No)P(No)=0.1£ 0.7=0.07

MAPprediction=Yes

Howhardisittolearnprobabilisticmodels?

O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -

Outlook: S(unny),O(vercast),R(ainy)

Temperature: H(ot),M(edium),C(ool)

Humidity: H(igh),N(ormal),L(ow)

Wind: S(trong),W(eak)

18



Outlook: S(unny),O(vercast),R(ainy)

Temperature: H(ot),M(edium),C(ool)

Humidity: H(igh),N(ormal),L(ow)

Wind: S(trong),W(eak)

19

Weneedtolearn

1.Theprior𝑃(Play? )2.Thelikelihoods𝑃 x Play? )



PriorP(play?)

• Asinglenumber(Whyonlyone?)

LikelihoodP(X|Play?)

• Thereare4features

• ForeachvalueofPlay? (+/-),weneedavalueforeachpossibleassignment:P(x1,x2, x3,x4 |Play?)

• (24 – 1)parametersineachcase

Oneforeachassignment

20



PriorP(play?)




• ForeachvalueofPlay? (+/-),weneedavalueforeachpossibleassignment:P(O,T,H,W|Play?)

21



3 3 3 2

PriorP(play?)





22Valuesforthisfeature



3 3 3 2

PriorP(play?)





• (3 ⋅ 3 ⋅ 3 ⋅ 2 − 1)parametersineachcase

Oneforeachassignment

23Valuesforthisfeature



PriorP(Y)

• Ifthereareklabels,thenk– 1parameters(whynotk?)

LikelihoodP(X|Y)

• Iftherearedfeatures,then:

• WeneedavalueforeachpossibleP(x1,x2,!,xd |y)foreachy

• k(2d – 1)parameters

Needalotofdatatoestimatethesemanynumbers!

24

Ingeneral



PriorP(Y)


LikelihoodP(X|Y)

• IftherearedBooleanfeatures:




25

Ingeneral



PriorP(Y)


LikelihoodP(X|Y)





26

Ingeneral


PriorP(Y)


LikelihoodP(X|Y)





27

Highmodelcomplexity

Ifthereisverylimiteddata,highvarianceintheparameters


PriorP(Y)


LikelihoodP(X|Y)





28

Highmodelcomplexity


Howcanwedealwiththis?


PriorP(Y)


LikelihoodP(X|Y)





29

Highmodelcomplexity


Howcanwedealwiththis?

Answer:Makeindependenceassumptions

Recall:Conditionalindependence

SupposeX,YandZarerandomvariables

XisconditionallyindependentofYgivenZiftheprobabilitydistributionofXisindependentofthevalueofYwhenZisobserved

Orequivalently

30

Modelingthefeatures

𝑃(𝑥:, 𝑥<,⋯ , 𝑥>|𝑦) requiredk(2d – 1)parameters

Whatifallthefeatureswereconditionallyindependentgiventhelabel?

Thatis,𝑃 𝑥:, 𝑥<,⋯ , 𝑥> 𝑦 = 𝑃 𝑥: 𝑦 𝑃 𝑥< 𝑦 ⋯𝑃 𝑥> 𝑦

Requiresonlydnumbersforeachlabel.kd featuresoverall.Notbad!

31

TheNaïveBayesAssumption

Modelingthefeatures

𝑃(𝑥:, 𝑥<,⋯ , 𝑥>|𝑦) requiredk(2d – 1)parameters

Whatifallthefeatureswereconditionallyindependentgiventhelabel?

Thatis,𝑃 𝑥:, 𝑥<,⋯ , 𝑥> 𝑦 = 𝑃 𝑥: 𝑦 𝑃 𝑥< 𝑦 ⋯𝑃 𝑥> 𝑦

Requiresonlydnumbersforeachlabel.kd parametersoverall.Notbad!

32

TheNaïveBayesAssumption


Assumption:FeaturesareconditionallyindependentgiventhelabelY

Topredict,weneedtwosetsofprobabilities– PriorP(y)– Foreachxj,wehavethelikelihoodP(xj |y)

33




Decisionrule

34

ℎAB 𝒙 = argmax.

𝑃 𝑦 𝑃 𝑥:, 𝑥<,⋯ , 𝑥> 𝑦)




Decisionrule

35

ℎAB 𝒙 = argmax.

𝑃 𝑦 𝑃 𝑥:, 𝑥<,⋯ , 𝑥> 𝑦)

= argmax.

𝑃 𝑦 D𝑃(𝑥E|𝑦)�

E

DecisionboundariesofnaïveBayes

WhatisthedecisionboundaryofthenaïveBayesclassifier?

Considerthetwoclasscase.Wepredictthelabeltobe+if

36

𝑃 𝑦 = + D𝑃 𝑥E 𝑦 = + > 𝑃 𝑦 = − D𝑃 𝑥E 𝑦 = −)�

E

�

E



Considerthetwoclasscase.Wepredictthelabeltobe+if

37

𝑃 𝑦 = + D𝑃 𝑥E 𝑦 = + > 𝑃 𝑦 = − D𝑃 𝑥E 𝑦 = −)�

E

�

E

𝑃 𝑦 = + ∏ 𝑃 𝑥E 𝑦 = +)�E

𝑃 𝑦 = − ∏ 𝑃(𝑥E|𝑦 = −)�E

> 1



Takinglogandsimplifying,weget

38

Thisisalinearfunctionofthefeaturespace!

Easytoprove.Seenoteoncoursewebsite

log𝑃(𝑦 = −|𝒙)𝑃(𝑦 = +|𝒙) = 𝒘L𝒙 + 𝑏

Today’slecture



• PracticalConcerns

39

LearningthenaïveBayesClassifier

• Whatisthehypothesisfunctionh definedby?– Acollectionofprobabilities

• Priorforeachlabel:P(y)• Likelihoodsforfeaturexj givenalabel:P(xj|y)

IfwehaveadatasetD={(xi,yi)}withmexamplesAndwewanttolearntheclassifierinaprobabilisticway– Whatistheprobabilisticcriteriontoselectthehypothesis?

40



• Priorforeachlabel:𝑃(𝑦)• Likelihoodsforfeaturexj givenalabel:𝑃(𝑥𝑗|𝑦)

IfwehaveadatasetD={(xi,yi)}withmexamplesAndwewanttolearntheclassifierinaprobabilisticway– Whatistheprobabilisticcriteriontoselectthehypothesis?

41




Supposewehaveadataset𝐷 = {(𝒙𝑖, 𝑦𝑖)}withmexamples

42




Supposewehaveadataset𝐷 = {(𝒙𝑖, 𝑦𝑖)}withmexamples

43

Anoteonconventionforthissection:• Examplesinthedatasetareindexedbythesubscript𝑖 (e.g. 𝒙𝑖)• Featureswithinanexampleareindexedbythesubscript𝑗

• The𝑗ST featureofthe𝑖ST examplewillbe𝑥UE




Ifwehaveadataset𝐷 = {(𝒙𝑖, 𝑦𝑖)}withmexamplesAndwewanttolearntheclassifierinaprobabilisticway– Whatisaprobabilisticcriteriontoselectthehypothesis?

44


Maximumlikelihoodestimation

45

HerehisdefinedbyalltheprobabilitiesusedtoconstructthenaïveBayesdecision


Givenadataset𝐷 = {(𝒙𝑖, 𝑦𝑖)} withmexamples

46

Eachexampleinthedatasetisindependentandidenticallydistributed

SowecanrepresentP(D|h)asthisproduct


Givenadataset𝐷 = {(𝒙𝑖, 𝑦𝑖)}withmexamples

47

Asks“Whatprobabilitywouldthisparticularh assigntothepair(xi,yi)?”

Eachexampleinthedatasetisindependentandidenticallydistributed

SowecanrepresentP(D|h)asthisproduct


GivenadatasetD={(xi,yi)}withmexamples

48



49

TheNaïveBayesassumption

xij isthejthfeatureofxi



50

Howdoweproceed?



51



52

Whatnext?



53

Whatnext?

Weneedtomakeamodelingassumptionaboutthefunctionalformoftheseprobabilitydistributions



54

Forsimplicity,supposetherearetwolabels1 and0 andallfeaturesarebinary

• Prior:P(y=1)=p andP(y=0)=1– p

Thatis,thepriorprobabilityisfromtheBernoullidistribution.



55



• Likelihood foreachfeaturegivenalabel• P(xj =1|y=1)=aj andP(xj =0 |y=1)=1– aj• P(xj =1|y=0)=bj andP(xj =0 |y=0)=1- bj



56




Thatis,thelikelihoodofeachfeatureisalsoisfromtheBernoullidistribution.



57




hconsistsofp,allthea’sandb’s



58




59


[z]iscalledtheindicatorfunctionortheIversonbracket

Itsvalueis1iftheargumentzistrueandzerootherwise



60

Likelihoodforeachfeaturegivenalabel• P(xj =1|y=1)=aj andP(xj =0 |y=1)=1– aj• P(xj =1|y=0)=bj andP(xj =0 |y=0)=1- bj


Substitutingandderivingtheargmax,weget

61

P(y=1)=p



62

P(y=1)=p

P(xj =1|y=1)=aj



63

P(y=1)=p

P(xj =1|y=1)=aj

P(xj =1|y=0)=bj

Let’slearnanaïveBayesclassifier


64

WiththeassumptionthatallourprobabilitiesarefromtheBernoullidistribution



65

𝑃 𝑃𝑙𝑎𝑦 = + =914 𝑃 𝑃𝑙𝑎𝑦 = − =

514



66

𝑃(𝑶 = 𝑆|𝑃𝑙𝑎𝑦 = +) =29

𝑃 𝑃𝑙𝑎𝑦 = + =914 𝑃 𝑃𝑙𝑎𝑦 = − =

514


67


𝑃(𝑶 = 𝑅|𝑃𝑙𝑎𝑦 = +) = 39

𝑃(𝑶 = 𝑆|𝑃𝑙𝑎𝑦 = +) =29

𝑃 𝑃𝑙𝑎𝑦 = + =914 𝑃 𝑃𝑙𝑎𝑦 = − =

514


68


𝑃(𝑶 = 𝑂|𝑃𝑙𝑎𝑦 = +) = 49

Andsoon,forotherattributesandalsoforPlay=-

𝑃(𝑶 = 𝑅|𝑃𝑙𝑎𝑦 = +) = 39

𝑃(𝑶 = 𝑆|𝑃𝑙𝑎𝑦 = +) =29

𝑃 𝑃𝑙𝑎𝑦 = + =914 𝑃 𝑃𝑙𝑎𝑦 = − =

514

NaïveBayes:LearningandPrediction

• Learning– Counthowoftenfeaturesoccurwitheachlabel.Normalizetogetlikelihoods

– Priorsfromfractionofexampleswitheachlabel– Generalizestomulticlass

• Prediction– Uselearnedprobabilitiestofindhighestscoringlabel

69

Today’slecture



• Practicalconcerns+anexample

70

ImportantcaveatswithNaïveBayes

1. Featuresneednotbeconditionallyindependentgiventhelabel– Justbecauseweassumethattheyaredoesn’tmeanthat

that’showtheybehaveinnature– Wemadeamodelingassumptionbecauseitmakes

computation andlearningeasier

2. Notenoughtrainingdatatogetgoodestimatesoftheprobabilitiesfromcounts

71


1. Featuresarenotconditionallyindependentgiventhelabel

AllbetsareoffifthenaïveBayesassumptionisnotsatisfied

Andyet,veryoftenusedinpracticebecauseofsimplicityWorksreasonablywellevenwhentheassumptionisviolated

72



73

Thebasicoperationforlearninglikelihoodsiscountinghowoftenafeatureoccurswithalabel.

Whatifweneverseeaparticularfeaturewithaparticularlabel?Eg:SupposeweneverobserveTemperature=coldwithPlayTennis=Yes

Shouldwetreatthosecountsaszero?



74



Shouldwetreatthosecountsaszero? Butthatwillmaketheprobabilitieszero



75



Shouldwetreatthosecountsaszero?

Answer:Smoothing• Addfakecounts(verysmallnumberssothatthecountsarenotzero)• TheBayesianinterpretationofsmoothing:Priors onthehypothesis(MAPlearning)

Butthatwillmaketheprobabilitieszero

Example:Classifyingtext

• Instancespace:Textdocuments• Labels:Spam orNotSpam

• Goal:TolearnafunctionthatcanpredictwhetheranewdocumentisSpam orNotSpam

HowwouldyoubuildaNaïveBayesclassifier?

76

Letusbrainstorm

Howtorepresentdocuments?Howtoestimateprobabilities?Howtoclassify?


1. RepresentdocumentsbyavectorofwordsAsparsevectorconsistingofonefeatureperword

2. LearningfromNlabeleddocuments1. Priors

2. Foreachwordwinvocabulary:

77





78





79





80





81

Howoftendoesawordoccurwithalabel?





82

Smoothing

Continuousfeatures

• Sofar,wehavebeenlookingatdiscretefeatures– P(xj |y)isaBernoullitrial(i.e.acointoss)

• WecouldmodelP(xj |y)withotherdistributionstoo– Thisisaseparateassumptionfromtheindependence

assumptionthatnaiveBayesmakes– Eg:Forrealvaluedfeatures,(Xj |Y)couldbedrawnfroma

normaldistribution

• Exercise:Derivethemaximumlikelihoodestimatewhenthefeaturesareassumedtobedrawnfromthenormaldistribution

83

Summary:NaïveBayes

• Independenceassumption– Allfeaturesareindependentofeachothergiventhelabel

• Maximumlikelihoodlearning:Learningissimple– Generalizestorealvaluedfeatures

• PredictionviaMAPestimation– Generalizestobeyondbinaryclassification

• Importantcaveatstoremember– Smoothing– Independenceassumptionmaynotbevalid

• Decisionboundaryislinearforbinaryclassification

84

The Naïve Bayes Classifier - svivek · •The naïve Bayes Classifier •Learning the naïve Bayes Classifier •Practical concerns 2. Today’s lecture •The naïve Bayes Classifier

Documents