Machine Learning The Naïve Bayes Classifier 1
MachineLearning
TheNaïveBayesClassifier
1
Today’slecture
• ThenaïveBayesClassifier
• LearningthenaïveBayesClassifier
• Practicalconcerns
2
Today’slecture
• ThenaïveBayesClassifier
• LearningthenaïveBayesClassifier
• Practicalconcerns
3
Wherearewe?
WehaveseenBayesianlearning– Usingaprobabilisticcriteriontoselectahypothesis– Maximumaposterioriandmaximumlikelihoodlearning
Youshouldknowwhatisthedifferencebetweenthem
Wecouldalsolearnfunctionsthatpredict probabilitiesofoutcomes
– Differentfromusingaprobabilisticcriteriontolearn
Maximumaposteriori(MAP)prediction asopposedtoMAPlearning
4
Wherearewe?
WehaveseenBayesianlearning– Usingaprobabilisticcriteriontoselectahypothesis– Maximumaposterioriandmaximumlikelihoodlearning
Youshouldknowwhatisthedifferencebetweenthem
Wecouldalsolearnfunctionsthatpredict probabilitiesofoutcomes
– Differentfromusingaprobabilisticcriteriontolearn
Maximumaposteriori(MAP)prediction asopposedtoMAPlearning
5
MAPprediction
UsingtheBayesruleforpredicting𝑦 givenaninput𝐱
𝑃 𝑌 = 𝑦 𝑋 = 𝐱 =𝑃 𝑋 = 𝐱 𝑌 = 𝑦 𝑃 𝑌 = 𝑦
𝑃(𝑋 = 𝐱)
6
Posteriorprobabilityoflabelbeingy forthisinputx
MAPprediction
UsingtheBayesruleforpredicting𝑦 givenaninput𝐱
𝑃 𝑌 = 𝑦 𝑋 = 𝐱 =𝑃 𝑋 = 𝐱 𝑌 = 𝑦 𝑃 𝑌 = 𝑦
𝑃(𝑋 = 𝐱)
Predictthelabel𝑦 fortheinput𝐱 using
argmax.
𝑃 𝑋 = 𝐱 𝑌 = 𝑦 𝑃 𝑌 = 𝑦𝑃(𝑋 = 𝐱)
7
MAPprediction
UsingtheBayesruleforpredicting𝑦 givenaninput𝐱
𝑃 𝑌 = 𝑦 𝑋 = 𝐱 =𝑃 𝑋 = 𝐱 𝑌 = 𝑦 𝑃 𝑌 = 𝑦
𝑃(𝑋 = 𝐱)
Predictthelabel𝑦 fortheinput𝐱 using
argmax.𝑃 𝑋 = 𝐱 𝑌 = 𝑦 𝑃 𝑌 = 𝑦
8
MAPprediction
UsingtheBayesruleforpredicting𝑦 givenaninput𝐱
𝑃 𝑌 = 𝑦 𝑋 = 𝐱 =𝑃 𝑋 = 𝐱 𝑌 = 𝑦 𝑃 𝑌 = 𝑦
𝑃(𝑋 = 𝐱)
Predictthelabel𝑦 fortheinput𝐱 using
argmax.𝑃 𝑋 = 𝐱 𝑌 = 𝑦 𝑃 𝑌 = 𝑦
9
Don’tconfusewithMAPlearning:findshypothesisby
MAPprediction
Predictthelabel𝑦 fortheinput𝐱 using
argmax.𝑃 𝑋 = 𝐱 𝑌 = 𝑦 𝑃 𝑌 = 𝑦
10
Likelihood ofobservingthisinputx whenthelabelisy
Priorprobabilityofthelabelbeingy
Allweneedarethesetwosetsofprobabilities
Example:Tennisagain
11
Temperature Wind P(T, W|Tennis=Yes)
Hot Strong 0.15
Hot Weak 0.4
Cold Strong 0.1
Cold Weak 0.35
Temperature Wind P(T, W|Tennis=No)
Hot Strong 0.4
Hot Weak 0.1
Cold Strong 0.3
Cold Weak 0.2
Playtennis P(Playtennis)
Yes 0.3
No 0.7Prior
Likelihood
Withoutanyotherinformation,whatisthepriorprobabilitythatIshouldplaytennis?
OndaysthatIdo playtennis,whatistheprobabilitythatthetemperatureisTandthewindisW?
OndaysthatIdon’t playtennis,whatistheprobabilitythatthetemperatureisTandthewindisW?
Example:Tennisagain
12
Temperature Wind P(T, W|Tennis=Yes)
Hot Strong 0.15
Hot Weak 0.4
Cold Strong 0.1
Cold Weak 0.35
Temperature Wind P(T, W|Tennis=No)
Hot Strong 0.4
Hot Weak 0.1
Cold Strong 0.3
Cold Weak 0.2
Playtennis P(Playtennis)
Yes 0.3
No 0.7Prior
Likelihood
Withoutanyotherinformation,whatisthepriorprobabilitythatIshouldplaytennis?
OndaysthatIdo playtennis,whatistheprobabilitythatthetemperatureisTandthewindisW?
OndaysthatIdon’t playtennis,whatistheprobabilitythatthetemperatureisTandthewindisW?
Example:Tennisagain
13
Temperature Wind P(T, W|Tennis=Yes)
Hot Strong 0.15
Hot Weak 0.4
Cold Strong 0.1
Cold Weak 0.35
Temperature Wind P(T, W|Tennis=No)
Hot Strong 0.4
Hot Weak 0.1
Cold Strong 0.3
Cold Weak 0.2
Playtennis P(Playtennis)
Yes 0.3
No 0.7Prior
Likelihood
Withoutanyotherinformation,whatisthepriorprobabilitythatIshouldplaytennis?
OndaysthatIdo playtennis,whatistheprobabilitythatthetemperatureisTandthewindisW?
OndaysthatIdon’t playtennis,whatistheprobabilitythatthetemperatureisTandthewindisW?
Example:Tennisagain
14
Temperature Wind P(T, W|Tennis=Yes)
Hot Strong 0.15
Hot Weak 0.4
Cold Strong 0.1
Cold Weak 0.35
Temperature Wind P(T, W|Tennis=No)
Hot Strong 0.4
Hot Weak 0.1
Cold Strong 0.3
Cold Weak 0.2
Playtennis P(Playtennis)
Yes 0.3
No 0.7Prior
Likelihood
Input:Temperature=Hot(H)Wind=Weak(W)
ShouldIplaytennis?
Example:Tennisagain
15
Temperature Wind P(T, W|Tennis=Yes)
Hot Strong 0.15
Hot Weak 0.4
Cold Strong 0.1
Cold Weak 0.35
Temperature Wind P(T, W|Tennis=No)
Hot Strong 0.4
Hot Weak 0.1
Cold Strong 0.3
Cold Weak 0.2
Playtennis P(Playtennis)
Yes 0.3
No 0.7Prior
Likelihood
Input:Temperature=Hot(H)Wind=Weak(W)
ShouldIplaytennis?
argmaxy P(H,W|play?)P(play?)
Example:Tennisagain
16
Temperature Wind P(T, W|Tennis=Yes)
Hot Strong 0.15
Hot Weak 0.4
Cold Strong 0.1
Cold Weak 0.35
Temperature Wind P(T, W|Tennis=No)
Hot Strong 0.4
Hot Weak 0.1
Cold Strong 0.3
Cold Weak 0.2
Playtennis P(Playtennis)
Yes 0.3
No 0.7Prior
Likelihood
Input:Temperature=Hot(H)Wind=Weak(W)
ShouldIplaytennis?
argmaxy P(H,W|play?)P(play?)
P(H,W|Yes)P(Yes)=0.4£ 0.3=0.12
P(H,W|No)P(No)=0.1£ 0.7=0.07
Example:Tennisagain
17
Temperature Wind P(T, W|Tennis=Yes)
Hot Strong 0.15
Hot Weak 0.4
Cold Strong 0.1
Cold Weak 0.35
Temperature Wind P(T, W|Tennis=No)
Hot Strong 0.4
Hot Weak 0.1
Cold Strong 0.3
Cold Weak 0.2
Playtennis P(Playtennis)
Yes 0.3
No 0.7Prior
Likelihood
Input:Temperature=Hot(H)Wind=Weak(W)
ShouldIplaytennis?
argmaxy P(H,W|play?)P(play?)
P(H,W|Yes)P(Yes)=0.4£ 0.3=0.12
P(H,W|No)P(No)=0.1£ 0.7=0.07
MAPprediction=Yes
Howhardisittolearnprobabilisticmodels?
O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -
Outlook: S(unny),O(vercast),R(ainy)
Temperature: H(ot),M(edium),C(ool)
Humidity: H(igh),N(ormal),L(ow)
Wind: S(trong),W(eak)
18
Howhardisittolearnprobabilisticmodels?
O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -
Outlook: S(unny),O(vercast),R(ainy)
Temperature: H(ot),M(edium),C(ool)
Humidity: H(igh),N(ormal),L(ow)
Wind: S(trong),W(eak)
19
Weneedtolearn
1.Theprior𝑃(Play? )2.Thelikelihoods𝑃 x Play? )
Howhardisittolearnprobabilisticmodels?
O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -
PriorP(play?)
• Asinglenumber(Whyonlyone?)
LikelihoodP(X|Play?)
• Thereare4features
• ForeachvalueofPlay? (+/-),weneedavalueforeachpossibleassignment:P(x1,x2, x3,x4 |Play?)
• (24 – 1)parametersineachcase
Oneforeachassignment
20
Howhardisittolearnprobabilisticmodels?
O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -
PriorP(play?)
• Asinglenumber(Whyonlyone?)
LikelihoodP(X|Play?)
• Thereare4features
• ForeachvalueofPlay? (+/-),weneedavalueforeachpossibleassignment:P(O,T,H,W|Play?)
21
Howhardisittolearnprobabilisticmodels?
O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -
3 3 3 2
PriorP(play?)
• Asinglenumber(Whyonlyone?)
LikelihoodP(X|Play?)
• Thereare4features
• ForeachvalueofPlay? (+/-),weneedavalueforeachpossibleassignment:P(O,T,H,W|Play?)
22Valuesforthisfeature
Howhardisittolearnprobabilisticmodels?
O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -
3 3 3 2
PriorP(play?)
• Asinglenumber(Whyonlyone?)
LikelihoodP(X|Play?)
• Thereare4features
• ForeachvalueofPlay? (+/-),weneedavalueforeachpossibleassignment:P(O,T,H,W|Play?)
• (3 ⋅ 3 ⋅ 3 ⋅ 2 − 1)parametersineachcase
Oneforeachassignment
23Valuesforthisfeature
Howhardisittolearnprobabilisticmodels?
O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -
PriorP(Y)
• Ifthereareklabels,thenk– 1parameters(whynotk?)
LikelihoodP(X|Y)
• Iftherearedfeatures,then:
• WeneedavalueforeachpossibleP(x1,x2,!,xd |y)foreachy
• k(2d – 1)parameters
Needalotofdatatoestimatethesemanynumbers!
24
Ingeneral
Howhardisittolearnprobabilisticmodels?
O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -
PriorP(Y)
• Ifthereareklabels,thenk– 1parameters(whynotk?)
LikelihoodP(X|Y)
• IftherearedBooleanfeatures:
• WeneedavalueforeachpossibleP(x1,x2,!,xd |y)foreachy
• k(2d – 1)parameters
Needalotofdatatoestimatethesemanynumbers!
25
Ingeneral
Howhardisittolearnprobabilisticmodels?
O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -
PriorP(Y)
• Ifthereareklabels,thenk– 1parameters(whynotk?)
LikelihoodP(X|Y)
• IftherearedBooleanfeatures:
• WeneedavalueforeachpossibleP(x1,x2,!,xd |y)foreachy
• k(2d – 1)parameters
Needalotofdatatoestimatethesemanynumbers!
26
Ingeneral
Howhardisittolearnprobabilisticmodels?
PriorP(Y)
• Ifthereareklabels,thenk– 1parameters(whynotk?)
LikelihoodP(X|Y)
• IftherearedBooleanfeatures:
• WeneedavalueforeachpossibleP(x1,x2,!,xd |y)foreachy
• k(2d – 1)parameters
Needalotofdatatoestimatethesemanynumbers!
27
Highmodelcomplexity
Ifthereisverylimiteddata,highvarianceintheparameters
Howhardisittolearnprobabilisticmodels?
PriorP(Y)
• Ifthereareklabels,thenk– 1parameters(whynotk?)
LikelihoodP(X|Y)
• IftherearedBooleanfeatures:
• WeneedavalueforeachpossibleP(x1,x2,!,xd |y)foreachy
• k(2d – 1)parameters
Needalotofdatatoestimatethesemanynumbers!
28
Highmodelcomplexity
Ifthereisverylimiteddata,highvarianceintheparameters
Howcanwedealwiththis?
Howhardisittolearnprobabilisticmodels?
PriorP(Y)
• Ifthereareklabels,thenk– 1parameters(whynotk?)
LikelihoodP(X|Y)
• IftherearedBooleanfeatures:
• WeneedavalueforeachpossibleP(x1,x2,!,xd |y)foreachy
• k(2d – 1)parameters
Needalotofdatatoestimatethesemanynumbers!
29
Highmodelcomplexity
Ifthereisverylimiteddata,highvarianceintheparameters
Howcanwedealwiththis?
Answer:Makeindependenceassumptions
Recall:Conditionalindependence
SupposeX,YandZarerandomvariables
XisconditionallyindependentofYgivenZiftheprobabilitydistributionofXisindependentofthevalueofYwhenZisobserved
Orequivalently
30
Modelingthefeatures
𝑃(𝑥:, 𝑥<,⋯ , 𝑥>|𝑦) requiredk(2d – 1)parameters
Whatifallthefeatureswereconditionallyindependentgiventhelabel?
Thatis,𝑃 𝑥:, 𝑥<,⋯ , 𝑥> 𝑦 = 𝑃 𝑥: 𝑦 𝑃 𝑥< 𝑦 ⋯𝑃 𝑥> 𝑦
Requiresonlydnumbersforeachlabel.kd featuresoverall.Notbad!
31
TheNaïveBayesAssumption
Modelingthefeatures
𝑃(𝑥:, 𝑥<,⋯ , 𝑥>|𝑦) requiredk(2d – 1)parameters
Whatifallthefeatureswereconditionallyindependentgiventhelabel?
Thatis,𝑃 𝑥:, 𝑥<,⋯ , 𝑥> 𝑦 = 𝑃 𝑥: 𝑦 𝑃 𝑥< 𝑦 ⋯𝑃 𝑥> 𝑦
Requiresonlydnumbersforeachlabel.kd parametersoverall.Notbad!
32
TheNaïveBayesAssumption
TheNaïveBayesClassifier
Assumption:FeaturesareconditionallyindependentgiventhelabelY
Topredict,weneedtwosetsofprobabilities– PriorP(y)– Foreachxj,wehavethelikelihoodP(xj |y)
33
TheNaïveBayesClassifier
Assumption:FeaturesareconditionallyindependentgiventhelabelY
Topredict,weneedtwosetsofprobabilities– PriorP(y)– Foreachxj,wehavethelikelihoodP(xj |y)
Decisionrule
34
ℎAB 𝒙 = argmax.
𝑃 𝑦 𝑃 𝑥:, 𝑥<,⋯ , 𝑥> 𝑦)
TheNaïveBayesClassifier
Assumption:FeaturesareconditionallyindependentgiventhelabelY
Topredict,weneedtwosetsofprobabilities– PriorP(y)– Foreachxj,wehavethelikelihoodP(xj |y)
Decisionrule
35
ℎAB 𝒙 = argmax.
𝑃 𝑦 𝑃 𝑥:, 𝑥<,⋯ , 𝑥> 𝑦)
= argmax.
𝑃 𝑦 D𝑃(𝑥E|𝑦)�
E
DecisionboundariesofnaïveBayes
WhatisthedecisionboundaryofthenaïveBayesclassifier?
Considerthetwoclasscase.Wepredictthelabeltobe+if
36
𝑃 𝑦 = + D𝑃 𝑥E 𝑦 = + > 𝑃 𝑦 = − D𝑃 𝑥E 𝑦 = −)�
E
�
E
DecisionboundariesofnaïveBayes
WhatisthedecisionboundaryofthenaïveBayesclassifier?
Considerthetwoclasscase.Wepredictthelabeltobe+if
37
𝑃 𝑦 = + D𝑃 𝑥E 𝑦 = + > 𝑃 𝑦 = − D𝑃 𝑥E 𝑦 = −)�
E
�
E
𝑃 𝑦 = + ∏ 𝑃 𝑥E 𝑦 = +)�E
𝑃 𝑦 = − ∏ 𝑃(𝑥E|𝑦 = −)�E
> 1
DecisionboundariesofnaïveBayes
WhatisthedecisionboundaryofthenaïveBayesclassifier?
Takinglogandsimplifying,weget
38
Thisisalinearfunctionofthefeaturespace!
Easytoprove.Seenoteoncoursewebsite
log𝑃(𝑦 = −|𝒙)𝑃(𝑦 = +|𝒙) = 𝒘L𝒙 + 𝑏
Today’slecture
• ThenaïveBayesClassifier
• LearningthenaïveBayesClassifier
• PracticalConcerns
39
LearningthenaïveBayesClassifier
• Whatisthehypothesisfunctionh definedby?– Acollectionofprobabilities
• Priorforeachlabel:P(y)• Likelihoodsforfeaturexj givenalabel:P(xj|y)
IfwehaveadatasetD={(xi,yi)}withmexamplesAndwewanttolearntheclassifierinaprobabilisticway– Whatistheprobabilisticcriteriontoselectthehypothesis?
40
LearningthenaïveBayesClassifier
• Whatisthehypothesisfunctionh definedby?– Acollectionofprobabilities
• Priorforeachlabel:𝑃(𝑦)• Likelihoodsforfeaturexj givenalabel:𝑃(𝑥𝑗|𝑦)
IfwehaveadatasetD={(xi,yi)}withmexamplesAndwewanttolearntheclassifierinaprobabilisticway– Whatistheprobabilisticcriteriontoselectthehypothesis?
41
LearningthenaïveBayesClassifier
• Whatisthehypothesisfunctionh definedby?– Acollectionofprobabilities
• Priorforeachlabel:𝑃(𝑦)• Likelihoodsforfeaturexj givenalabel:𝑃(𝑥𝑗|𝑦)
Supposewehaveadataset𝐷 = {(𝒙𝑖, 𝑦𝑖)}withmexamples
42
LearningthenaïveBayesClassifier
• Whatisthehypothesisfunctionh definedby?– Acollectionofprobabilities
• Priorforeachlabel:𝑃(𝑦)• Likelihoodsforfeaturexj givenalabel:𝑃(𝑥𝑗|𝑦)
Supposewehaveadataset𝐷 = {(𝒙𝑖, 𝑦𝑖)}withmexamples
43
Anoteonconventionforthissection:• Examplesinthedatasetareindexedbythesubscript𝑖 (e.g. 𝒙𝑖)• Featureswithinanexampleareindexedbythesubscript𝑗
• The𝑗ST featureofthe𝑖ST examplewillbe𝑥UE
LearningthenaïveBayesClassifier
• Whatisthehypothesisfunctionh definedby?– Acollectionofprobabilities
• Priorforeachlabel:𝑃(𝑦)• Likelihoodsforfeaturexj givenalabel:𝑃(𝑥𝑗|𝑦)
Ifwehaveadataset𝐷 = {(𝒙𝑖, 𝑦𝑖)}withmexamplesAndwewanttolearntheclassifierinaprobabilisticway– Whatisaprobabilisticcriteriontoselectthehypothesis?
44
LearningthenaïveBayesClassifier
Maximumlikelihoodestimation
45
HerehisdefinedbyalltheprobabilitiesusedtoconstructthenaïveBayesdecision
Maximumlikelihoodestimation
Givenadataset𝐷 = {(𝒙𝑖, 𝑦𝑖)} withmexamples
46
Eachexampleinthedatasetisindependentandidenticallydistributed
SowecanrepresentP(D|h)asthisproduct
Maximumlikelihoodestimation
Givenadataset𝐷 = {(𝒙𝑖, 𝑦𝑖)}withmexamples
47
Asks“Whatprobabilitywouldthisparticularh assigntothepair(xi,yi)?”
Eachexampleinthedatasetisindependentandidenticallydistributed
SowecanrepresentP(D|h)asthisproduct
Maximumlikelihoodestimation
GivenadatasetD={(xi,yi)}withmexamples
48
Maximumlikelihoodestimation
GivenadatasetD={(xi,yi)}withmexamples
49
TheNaïveBayesassumption
xij isthejthfeatureofxi
Maximumlikelihoodestimation
GivenadatasetD={(xi,yi)}withmexamples
50
Howdoweproceed?
Maximumlikelihoodestimation
GivenadatasetD={(xi,yi)}withmexamples
51
LearningthenaïveBayesClassifier
Maximumlikelihoodestimation
52
Whatnext?
LearningthenaïveBayesClassifier
Maximumlikelihoodestimation
53
Whatnext?
Weneedtomakeamodelingassumptionaboutthefunctionalformoftheseprobabilitydistributions
LearningthenaïveBayesClassifier
Maximumlikelihoodestimation
54
Forsimplicity,supposetherearetwolabels1 and0 andallfeaturesarebinary
• Prior:P(y=1)=p andP(y=0)=1– p
Thatis,thepriorprobabilityisfromtheBernoullidistribution.
LearningthenaïveBayesClassifier
Maximumlikelihoodestimation
55
Forsimplicity,supposetherearetwolabels1 and0 andallfeaturesarebinary
• Prior:P(y=1)=p andP(y=0)=1– p
• Likelihood foreachfeaturegivenalabel• P(xj =1|y=1)=aj andP(xj =0 |y=1)=1– aj• P(xj =1|y=0)=bj andP(xj =0 |y=0)=1- bj
LearningthenaïveBayesClassifier
Maximumlikelihoodestimation
56
Forsimplicity,supposetherearetwolabels1 and0 andallfeaturesarebinary
• Prior:P(y=1)=p andP(y=0)=1– p
• Likelihood foreachfeaturegivenalabel• P(xj =1|y=1)=aj andP(xj =0 |y=1)=1– aj• P(xj =1|y=0)=bj andP(xj =0 |y=0)=1- bj
Thatis,thelikelihoodofeachfeatureisalsoisfromtheBernoullidistribution.
LearningthenaïveBayesClassifier
Maximumlikelihoodestimation
57
Forsimplicity,supposetherearetwolabels1 and0 andallfeaturesarebinary
• Prior:P(y=1)=p andP(y=0)=1– p
• Likelihood foreachfeaturegivenalabel• P(xj =1|y=1)=aj andP(xj =0 |y=1)=1– aj• P(xj =1|y=0)=bj andP(xj =0 |y=0)=1- bj
hconsistsofp,allthea’sandb’s
LearningthenaïveBayesClassifier
Maximumlikelihoodestimation
58
• Prior:P(y=1)=p andP(y=0)=1– p
LearningthenaïveBayesClassifier
Maximumlikelihoodestimation
59
• Prior:P(y=1)=p andP(y=0)=1– p
[z]iscalledtheindicatorfunctionortheIversonbracket
Itsvalueis1iftheargumentzistrueandzerootherwise
LearningthenaïveBayesClassifier
Maximumlikelihoodestimation
60
Likelihoodforeachfeaturegivenalabel• P(xj =1|y=1)=aj andP(xj =0 |y=1)=1– aj• P(xj =1|y=0)=bj andP(xj =0 |y=0)=1- bj
LearningthenaïveBayesClassifier
Substitutingandderivingtheargmax,weget
61
P(y=1)=p
LearningthenaïveBayesClassifier
Substitutingandderivingtheargmax,weget
62
P(y=1)=p
P(xj =1|y=1)=aj
LearningthenaïveBayesClassifier
Substitutingandderivingtheargmax,weget
63
P(y=1)=p
P(xj =1|y=1)=aj
P(xj =1|y=0)=bj
Let’slearnanaïveBayesclassifier
O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -
64
WiththeassumptionthatallourprobabilitiesarefromtheBernoullidistribution
Let’slearnanaïveBayesclassifier
O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -
65
𝑃 𝑃𝑙𝑎𝑦 = + =914 𝑃 𝑃𝑙𝑎𝑦 = − =
514
Let’slearnanaïveBayesclassifier
O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -
66
𝑃(𝑶 = 𝑆|𝑃𝑙𝑎𝑦 = +) =29
𝑃 𝑃𝑙𝑎𝑦 = + =914 𝑃 𝑃𝑙𝑎𝑦 = − =
514
Let’slearnanaïveBayesclassifier
67
O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -
𝑃(𝑶 = 𝑅|𝑃𝑙𝑎𝑦 = +) = 39
𝑃(𝑶 = 𝑆|𝑃𝑙𝑎𝑦 = +) =29
𝑃 𝑃𝑙𝑎𝑦 = + =914 𝑃 𝑃𝑙𝑎𝑦 = − =
514
Let’slearnanaïveBayesclassifier
68
O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -
𝑃(𝑶 = 𝑂|𝑃𝑙𝑎𝑦 = +) = 49
Andsoon,forotherattributesandalsoforPlay=-
𝑃(𝑶 = 𝑅|𝑃𝑙𝑎𝑦 = +) = 39
𝑃(𝑶 = 𝑆|𝑃𝑙𝑎𝑦 = +) =29
𝑃 𝑃𝑙𝑎𝑦 = + =914 𝑃 𝑃𝑙𝑎𝑦 = − =
514
NaïveBayes:LearningandPrediction
• Learning– Counthowoftenfeaturesoccurwitheachlabel.Normalizetogetlikelihoods
– Priorsfromfractionofexampleswitheachlabel– Generalizestomulticlass
• Prediction– Uselearnedprobabilitiestofindhighestscoringlabel
69
Today’slecture
• ThenaïveBayesClassifier
• LearningthenaïveBayesClassifier
• Practicalconcerns+anexample
70
ImportantcaveatswithNaïveBayes
1. Featuresneednotbeconditionallyindependentgiventhelabel– Justbecauseweassumethattheyaredoesn’tmeanthat
that’showtheybehaveinnature– Wemadeamodelingassumptionbecauseitmakes
computation andlearningeasier
2. Notenoughtrainingdatatogetgoodestimatesoftheprobabilitiesfromcounts
71
ImportantcaveatswithNaïveBayes
1. Featuresarenotconditionallyindependentgiventhelabel
AllbetsareoffifthenaïveBayesassumptionisnotsatisfied
Andyet,veryoftenusedinpracticebecauseofsimplicityWorksreasonablywellevenwhentheassumptionisviolated
72
ImportantcaveatswithNaïveBayes
2. Notenoughtrainingdatatogetgoodestimatesoftheprobabilitiesfromcounts
73
Thebasicoperationforlearninglikelihoodsiscountinghowoftenafeatureoccurswithalabel.
Whatifweneverseeaparticularfeaturewithaparticularlabel?Eg:SupposeweneverobserveTemperature=coldwithPlayTennis=Yes
Shouldwetreatthosecountsaszero?
ImportantcaveatswithNaïveBayes
2. Notenoughtrainingdatatogetgoodestimatesoftheprobabilitiesfromcounts
74
Thebasicoperationforlearninglikelihoodsiscountinghowoftenafeatureoccurswithalabel.
Whatifweneverseeaparticularfeaturewithaparticularlabel?Eg:SupposeweneverobserveTemperature=coldwithPlayTennis=Yes
Shouldwetreatthosecountsaszero? Butthatwillmaketheprobabilitieszero
ImportantcaveatswithNaïveBayes
2. Notenoughtrainingdatatogetgoodestimatesoftheprobabilitiesfromcounts
75
Thebasicoperationforlearninglikelihoodsiscountinghowoftenafeatureoccurswithalabel.
Whatifweneverseeaparticularfeaturewithaparticularlabel?Eg:SupposeweneverobserveTemperature=coldwithPlayTennis=Yes
Shouldwetreatthosecountsaszero?
Answer:Smoothing• Addfakecounts(verysmallnumberssothatthecountsarenotzero)• TheBayesianinterpretationofsmoothing:Priors onthehypothesis(MAPlearning)
Butthatwillmaketheprobabilitieszero
Example:Classifyingtext
• Instancespace:Textdocuments• Labels:Spam orNotSpam
• Goal:TolearnafunctionthatcanpredictwhetheranewdocumentisSpam orNotSpam
HowwouldyoubuildaNaïveBayesclassifier?
76
Letusbrainstorm
Howtorepresentdocuments?Howtoestimateprobabilities?Howtoclassify?
Example:Classifyingtext
1. RepresentdocumentsbyavectorofwordsAsparsevectorconsistingofonefeatureperword
2. LearningfromNlabeleddocuments1. Priors
2. Foreachwordwinvocabulary:
77
Example:Classifyingtext
1. RepresentdocumentsbyavectorofwordsAsparsevectorconsistingofonefeatureperword
2. LearningfromNlabeleddocuments1. Priors
2. Foreachwordwinvocabulary:
78
Example:Classifyingtext
1. RepresentdocumentsbyavectorofwordsAsparsevectorconsistingofonefeatureperword
2. LearningfromNlabeleddocuments1. Priors
2. Foreachwordwinvocabulary:
79
Example:Classifyingtext
1. RepresentdocumentsbyavectorofwordsAsparsevectorconsistingofonefeatureperword
2. LearningfromNlabeleddocuments1. Priors
2. Foreachwordwinvocabulary:
80
Example:Classifyingtext
1. RepresentdocumentsbyavectorofwordsAsparsevectorconsistingofonefeatureperword
2. LearningfromNlabeleddocuments1. Priors
2. Foreachwordwinvocabulary:
81
Howoftendoesawordoccurwithalabel?
Example:Classifyingtext
1. RepresentdocumentsbyavectorofwordsAsparsevectorconsistingofonefeatureperword
2. LearningfromNlabeleddocuments1. Priors
2. Foreachwordwinvocabulary:
82
Smoothing
Continuousfeatures
• Sofar,wehavebeenlookingatdiscretefeatures– P(xj |y)isaBernoullitrial(i.e.acointoss)
• WecouldmodelP(xj |y)withotherdistributionstoo– Thisisaseparateassumptionfromtheindependence
assumptionthatnaiveBayesmakes– Eg:Forrealvaluedfeatures,(Xj |Y)couldbedrawnfroma
normaldistribution
• Exercise:Derivethemaximumlikelihoodestimatewhenthefeaturesareassumedtobedrawnfromthenormaldistribution
83
Summary:NaïveBayes
• Independenceassumption– Allfeaturesareindependentofeachothergiventhelabel
• Maximumlikelihoodlearning:Learningissimple– Generalizestorealvaluedfeatures
• PredictionviaMAPestimation– Generalizestobeyondbinaryclassification
• Importantcaveatstoremember– Smoothing– Independenceassumptionmaynotbevalid
• Decisionboundaryislinearforbinaryclassification
84