Bayesian Learning - cse.iitrpr.ac.incse.iitrpr.ac.in/ckn/courses/f2018/csl603/w10.pdf · Features of Bayesian Learning •Each training example can incrementally increase or decrease

Bayesian LearningCS503/CSL603 - Fall 2018

Narayanan C [email protected]

Outline

• Bayes’ Theorem• MAP Learners• Bayes optimal classifier• Naïve Bayes classifier• Example – text classification• Bayesian networks

Bayesian Learning CSL603 - Machine Learning 2

Features of Bayesian Learning

• Each training example can incrementally increase or decrease the estimated probability that a hypothesis is correct.• Allows for probabilistic predictions• Practical learning algorithms• Naïve Bayes learning• Bayesian network learning• Combine prior knowledge with observations

• Require prior probabilities

• Useful conceptual framework• “gold standard” for evaluating other classifiers• Tools for analysis


Bayes Theorem

• If ! and " are two random variables

• In the context of classifier hypothesis h and training data $

• %(h) – prior probability of hypothesis h• %($) – prior probability of training data $• % h|$ – probability of h given $• % $|h – probability of $ given h


% ! " = % " ! %(!)%(")

% h $ = % $ h %(h)%($)

Choosing the Hypotheses

• Given the training data, we are interested in the most probable hypothesis• Maximum a posteriori hypothesis - h"#$

h"#$ ≡ argmax+∈-. h /≡ argmax+∈-

. / h .(h).(/)

≡ argmax+∈-. / h . h• If every hypothesis is equally probable, . h2 = . h4 , ∀h2, h4 ∈ 7,

then we can simplify it to Maximum likelihood (ML) hypothesis - h"8h"8 = argmax+9∈-. /|h2


Example

• Does the patient have cancer or not?• A patient takes a lab test and the result comes back positive. The test returns

a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the disease is no present. Furthermore, 0.008 of the entire population have this cancer.



! "#$"%& = ! ¬"#$"%& =

! +|"#$"%& = ! −|"#$"%& =

! +|¬"#$"%& = ! −|¬"#$"%& =

! "#$"%&| + =

Brute-Force MAP Hypothesis Learner (1)

• If we are given ! = < $%, '% , … , $), ') >, examples and the class labels, • For each hypothesis h ∈ - , calculate the posterior probability

• Output the hypothesis ℎ/01 that has the highest posterior probability


2 h ! = 2 ! h 2(h)2(!)

h/01 = argmax:∈;2 h !


• If we are given ! = < $%, '% , … , $), ') >, examples and the class labels, choose +(!|h)• +(!|h) = 1 if h is consistent with !• +(!|h) = 0 otherwise

• Choose +(h) to be uniform distribution• + h = %

2 ∀h ∈ 5

• Then + h ! = 6 ! h 6(7)6(8)



• If we are given ! = < $%, '% , … , $), ') >, examples and the class labels, choose +(!|h)• +(!|h) = 1 if h is consistent with !• +(!|h) = 0 otherwise

• Choose +(h) to be uniform distribution• + h = %

2• Then

+ h ! = 3%

45,6, if h is consistent with !

0 , otherwise


Evolution of Posterior Probabilities


!h

hPh|% &

h

Ph|% &,% (

h

Classifying new instances

• Given a new instance x, what is the most probable classification?• One solution – h#$%(x)

• But can we do better?• Consider the following example containing three hypotheses:

( h) * = 0.4, ( h0 * = 0.3, ( h2 * = 0.3• Given a new instance x,

h) x = +, h0 x = −, h2 x = −• What is the most probable classification for x


Bayes Optimal Classifier (1)• Combine the prediction of all hypotheses weighted by their posterior

probabilities• Bayes optimal classification

argmax&∈( )*+∈,

- y h0 -(h0|3)

• Example- h5 3 = .4, - − h5 = 0, - + h5 = 1- h> 3 = .3, - − h> = 1, - + h> = 0- h@ 3 = .3, - − h@ = 1, - + h@ = 0


)*+∈,

- + h0 -(h0|3) = )*+∈,

- − h0 -(h0|3) =

Bayes Optimal Classifier (2)

• Optimal in the sense• No other classification method using the same hypothesis space and same

prior knowledge can outperform this method on average.

• Method maximizes the probability that the new instance is classified correctly, given the available data, hypothesis space and prior probabilities over the hypothesis.• But it is inefficient• Compute posterior probability for every hypothesis and combine the

predictions of each hypothesis.


Naïve Bayes Classifier (1)

• Bayes rule, slightly different application• Let ! = {$%, $', … $)} be the different class labels. • The label for +,- instance ./ ∈ !

• 1 $2 x4 - posterior probability that instance x4 belongs to class $2• 1 x4 $2 - probability that an instance drawn from class $2 would be x4

(likelihood)• 1($2) – probability of class $2 (prior)• 1(x/) – probability of instance x4 (evidence)


1 $2 x4 = 1 x4 $2 1($2)1(x4)


• Classify instance x as class " with maximum posterior probability

• Ignore the denominator (since we are only interested in the maximum)

• If the prior is uniform


" = argmax()(+(|x)

" = argmax() x +( )(+()

" = argmax() x +(


• Look at the classifier

• What is each instance x?• A " dimensional tuple – ($%, … , $()

• Estimate the joint probability distribution* $%, … $( +,• Practical issue- need to know the probability of every possible instance given

every possible class.• With " Boolean features and - classes – K2( probability values!!!


0 = argmax,* x +,


• Make the naïve Bayes assumption – features/attributes are conditionally independent given the target attribute (class label)

! "#, … "& '( = *+,#

&! "+ '(

• This results in the naïve Bayes classifier (NBC)!


- = argmax(*+,#

&! "+ '( !('()

NBC – Practical Issues (1)

• Estimating probabilities from !• Prior probabilities

" #$ = | x(, *+ : *+ = #$||!|

• If the features are discrete

" -. = / #$ = | x(, *+ : -+. = / ∧ *+= #$|| x(, *+ : *+ = #$|



• If the features are continuous?

• Assume some parameterized distribution for !", e.g., Normal

• Learn parameters of distribution from data, e.g., mean and variance of !" values

• Determine the parameters that maximize the likelihood.

• # !" $% ~ '(), +,), ) and +, are unknown



• If the features are continuous?

• Assume some parameterized distribution for !", e.g., Normal

• Learn parameters of distribution from data, e.g., mean and variance of !" values

• Determine the parameters that maximize the likelihood.

• Discretize the feature• E.g., price ∈ ℛ to price ∈ *+,,./012., ℎ14ℎ



• If there are no examples in class !" for which #$ = &' #$ = & !" = 0

)$*+

,' #$ !" ' !" = 0

• Use m-estimate defined as follows

' #$ = & !" = x., 01 : #1$ = & ∧ 01= !" + 56x., 01 : 01 = !" + 5

• Prior estimate of the probability – 6• Equivalent sample size – 5 (how heavily to weight 6 relative to the observed data)


Example – Learn to Classify Text

• Problem Definition• Given a set of news articles that are of interest, we would to like to learn to

classify the articles by topic.

• Naïve Bayes is among the most effective algorithms to perform this task.• What will be attributes to represent the documents?• Vector of words – one attribute per word position in the document

• What is the Target concept• Is the document interesting?• Topic of the document


Algorithm – Learn Naïve Bayes

• Collect all words and tokens that occur in the Examples (!)• Vocabulary – all distinct words and tokens in !

• Compute probabilities " #$ and " %& #$• !$ – Examples for which the target label is #$• " #$ = |)*|

|)|• + – total number of words in !$ (counting duplicates multiple times)• For each work %& in Vocabulary

• +& = number of times word %& occurs in !$• " %&|#$ = ,-./

,.|0123456378|


Algorithm – Classify Naïve Bayes

• Given a test instance• Compute the frequency of occurrence in the test instance of each term in the

vocabulary• Apply naïve Bayes classification rule!


Example: 20 Newsgroup

• Given 1000 training documents from each group• Learn to classify new documents according to the newsgroup it came

from• NBC – 89% accuracy


Bayesian Network (1)

• Naïve Bayes assumption of conditional independence is too restrictive.• The problem is intractable without some conditional independence

assumption• Bayesian networks describe conditional independence among subsets

of variables.• Allows for combining prior knowledge about (in) dependencies

among variables with training data• Recollect – Conditional Independence


Bayesian Network - Example

Storm

CampfireLightning

Thunder ForestFire

Campfire

C

¬C

¬S,B ¬S,¬B

0.4

0.6

0.1

0.9

0.8

0.2

0.2

0.8

S,¬B

BusTourGroup

S,B


Bayes Network (2)

• Network represents the joint probability distribution over all variables• !(#$%&', )*+,%*&-&%*.,… 0%&1+$02&1)• In general,

! 45, 46, … , 47 =9:;5

7! 4:|!=&1>$+ 4:

• Where !=&1>$+ 4: denotes immediate predecessors of 4: in the graph.

• What is the Bayes Network corresponding to the Naive Bayes Classifier?


Bayes Network (3)

• Inference• Bayes network encodes all the information required for inference.• Exact inference methods

• Work well for some structures• Monte Carlo methods

• Simulate the network randomly to calculate approximate solutions.

• Learning• If the structure is known and there are no missing values, it is easy to learn a

Bayes network• If the network structure is known and there are some missing values,

expectation – maximization algorithm• If the structure is unknown, the problem is very difficult.


Summary

• Bayes rule• Bayes Optimal Classifier• Practical Naïve Bayes Classifier• Example – text classification task

• Maximum-likelihood estimates• Bayesian networks


Bayesian Learning - cse.iitrpr.ac.incse.iitrpr.ac.in/ckn/courses/f2018/csl603/w10.pdf · Features of Bayesian Learning •Each training example can incrementally increase or decrease

Documents