Top Banner
Bayesian Learning CS503/CSL603 - Fall 2018 Narayanan C Krishnan [email protected]
31

Bayesian Learning - cse.iitrpr.ac.incse.iitrpr.ac.in/ckn/courses/f2018/csl603/w10.pdf · Features of Bayesian Learning •Each training example can incrementally increase or decrease

Mar 29, 2019

Download

Documents

vuongnguyet
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Bayesian Learning - cse.iitrpr.ac.incse.iitrpr.ac.in/ckn/courses/f2018/csl603/w10.pdf · Features of Bayesian Learning •Each training example can incrementally increase or decrease

Bayesian LearningCS503/CSL603 - Fall 2018

Narayanan C [email protected]

Page 2: Bayesian Learning - cse.iitrpr.ac.incse.iitrpr.ac.in/ckn/courses/f2018/csl603/w10.pdf · Features of Bayesian Learning •Each training example can incrementally increase or decrease

Outline

• Bayes’ Theorem• MAP Learners• Bayes optimal classifier• Naïve Bayes classifier• Example – text classification• Bayesian networks

Bayesian Learning CSL603 - Machine Learning 2

Page 3: Bayesian Learning - cse.iitrpr.ac.incse.iitrpr.ac.in/ckn/courses/f2018/csl603/w10.pdf · Features of Bayesian Learning •Each training example can incrementally increase or decrease

Features of Bayesian Learning

• Each training example can incrementally increase or decrease the estimated probability that a hypothesis is correct.• Allows for probabilistic predictions• Practical learning algorithms• Naïve Bayes learning• Bayesian network learning• Combine prior knowledge with observations

• Require prior probabilities

• Useful conceptual framework• “gold standard” for evaluating other classifiers• Tools for analysis

Bayesian Learning CSL603 - Machine Learning 3

Page 4: Bayesian Learning - cse.iitrpr.ac.incse.iitrpr.ac.in/ckn/courses/f2018/csl603/w10.pdf · Features of Bayesian Learning •Each training example can incrementally increase or decrease

Bayes Theorem

• If ! and " are two random variables

• In the context of classifier hypothesis h and training data $

• %(h) – prior probability of hypothesis h• %($) – prior probability of training data $• % h|$ – probability of h given $• % $|h – probability of $ given h

Bayesian Learning CSL603 - Machine Learning 4

% ! " = % " ! %(!)%(")

% h $ = % $ h %(h)%($)

Page 5: Bayesian Learning - cse.iitrpr.ac.incse.iitrpr.ac.in/ckn/courses/f2018/csl603/w10.pdf · Features of Bayesian Learning •Each training example can incrementally increase or decrease

Choosing the Hypotheses

• Given the training data, we are interested in the most probable hypothesis• Maximum a posteriori hypothesis - h"#$

h"#$ ≡ argmax+∈-. h /≡ argmax+∈-

. / h .(h).(/)

≡ argmax+∈-. / h . h• If every hypothesis is equally probable, . h2 = . h4 , ∀h2, h4 ∈ 7,

then we can simplify it to Maximum likelihood (ML) hypothesis - h"8h"8 = argmax+9∈-. /|h2

Bayesian Learning CSL603 - Machine Learning 5

Page 6: Bayesian Learning - cse.iitrpr.ac.incse.iitrpr.ac.in/ckn/courses/f2018/csl603/w10.pdf · Features of Bayesian Learning •Each training example can incrementally increase or decrease

Example

• Does the patient have cancer or not?• A patient takes a lab test and the result comes back positive. The test returns

a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the disease is no present. Furthermore, 0.008 of the entire population have this cancer.

Bayesian Learning CSL603 - Machine Learning 6

Page 7: Bayesian Learning - cse.iitrpr.ac.incse.iitrpr.ac.in/ckn/courses/f2018/csl603/w10.pdf · Features of Bayesian Learning •Each training example can incrementally increase or decrease

Bayesian Learning CSL603 - Machine Learning 7

! "#$"%& = ! ¬"#$"%& =

! +|"#$"%& = ! −|"#$"%& =

! +|¬"#$"%& = ! −|¬"#$"%& =

! "#$"%&| + =

Page 8: Bayesian Learning - cse.iitrpr.ac.incse.iitrpr.ac.in/ckn/courses/f2018/csl603/w10.pdf · Features of Bayesian Learning •Each training example can incrementally increase or decrease

Brute-Force MAP Hypothesis Learner (1)

• If we are given ! = < $%, '% , … , $), ') >, examples and the class labels, • For each hypothesis h ∈ - , calculate the posterior probability

• Output the hypothesis ℎ/01 that has the highest posterior probability

Bayesian Learning CSL603 - Machine Learning 8

2 h ! = 2 ! h 2(h)2(!)

h/01 = argmax:∈;2 h !

Page 9: Bayesian Learning - cse.iitrpr.ac.incse.iitrpr.ac.in/ckn/courses/f2018/csl603/w10.pdf · Features of Bayesian Learning •Each training example can incrementally increase or decrease

Brute-Force MAP Hypothesis Learner (2)

• If we are given ! = < $%, '% , … , $), ') >, examples and the class labels, choose +(!|h)• +(!|h) = 1 if h is consistent with !• +(!|h) = 0 otherwise

• Choose +(h) to be uniform distribution• + h = %

2 ∀h ∈ 5

• Then + h ! = 6 ! h 6(7)6(8)

Bayesian Learning CSL603 - Machine Learning 9

Page 10: Bayesian Learning - cse.iitrpr.ac.incse.iitrpr.ac.in/ckn/courses/f2018/csl603/w10.pdf · Features of Bayesian Learning •Each training example can incrementally increase or decrease

Brute-Force MAP Hypothesis Learner (3)

• If we are given ! = < $%, '% , … , $), ') >, examples and the class labels, choose +(!|h)• +(!|h) = 1 if h is consistent with !• +(!|h) = 0 otherwise

• Choose +(h) to be uniform distribution• + h = %

2• Then

+ h ! = 3%

45,6, if h is consistent with !

0 , otherwise

Bayesian Learning CSL603 - Machine Learning 10

Page 11: Bayesian Learning - cse.iitrpr.ac.incse.iitrpr.ac.in/ckn/courses/f2018/csl603/w10.pdf · Features of Bayesian Learning •Each training example can incrementally increase or decrease

Evolution of Posterior Probabilities

Bayesian Learning CSL603 - Machine Learning 11

!h

hPh|% &

h

Ph|% &,% (

h

Page 12: Bayesian Learning - cse.iitrpr.ac.incse.iitrpr.ac.in/ckn/courses/f2018/csl603/w10.pdf · Features of Bayesian Learning •Each training example can incrementally increase or decrease

Classifying new instances

• Given a new instance x, what is the most probable classification?• One solution – h#$%(x)

• But can we do better?• Consider the following example containing three hypotheses:

( h) * = 0.4, ( h0 * = 0.3, ( h2 * = 0.3• Given a new instance x,

h) x = +, h0 x = −, h2 x = −• What is the most probable classification for x

Bayesian Learning CSL603 - Machine Learning 12

Page 13: Bayesian Learning - cse.iitrpr.ac.incse.iitrpr.ac.in/ckn/courses/f2018/csl603/w10.pdf · Features of Bayesian Learning •Each training example can incrementally increase or decrease

Bayes Optimal Classifier (1)• Combine the prediction of all hypotheses weighted by their posterior

probabilities• Bayes optimal classification

argmax&∈( )*+∈,

- y h0 -(h0|3)

• Example- h5 3 = .4, - − h5 = 0, - + h5 = 1- h> 3 = .3, - − h> = 1, - + h> = 0- h@ 3 = .3, - − h@ = 1, - + h@ = 0

Bayesian Learning CSL603 - Machine Learning 13

)*+∈,

- + h0 -(h0|3) = )*+∈,

- − h0 -(h0|3) =

Page 14: Bayesian Learning - cse.iitrpr.ac.incse.iitrpr.ac.in/ckn/courses/f2018/csl603/w10.pdf · Features of Bayesian Learning •Each training example can incrementally increase or decrease

Bayes Optimal Classifier (2)

• Optimal in the sense• No other classification method using the same hypothesis space and same

prior knowledge can outperform this method on average.

• Method maximizes the probability that the new instance is classified correctly, given the available data, hypothesis space and prior probabilities over the hypothesis.• But it is inefficient• Compute posterior probability for every hypothesis and combine the

predictions of each hypothesis.

Bayesian Learning CSL603 - Machine Learning 14

Page 15: Bayesian Learning - cse.iitrpr.ac.incse.iitrpr.ac.in/ckn/courses/f2018/csl603/w10.pdf · Features of Bayesian Learning •Each training example can incrementally increase or decrease

Naïve Bayes Classifier (1)

• Bayes rule, slightly different application• Let ! = {$%, $', … $)} be the different class labels. • The label for +,- instance ./ ∈ !

• 1 $2 x4 - posterior probability that instance x4 belongs to class $2• 1 x4 $2 - probability that an instance drawn from class $2 would be x4

(likelihood)• 1($2) – probability of class $2 (prior)• 1(x/) – probability of instance x4 (evidence)

Bayesian Learning CSL603 - Machine Learning 15

1 $2 x4 = 1 x4 $2 1($2)1(x4)

Page 16: Bayesian Learning - cse.iitrpr.ac.incse.iitrpr.ac.in/ckn/courses/f2018/csl603/w10.pdf · Features of Bayesian Learning •Each training example can incrementally increase or decrease

Naïve Bayes Classifier (2)

• Classify instance x as class " with maximum posterior probability

• Ignore the denominator (since we are only interested in the maximum)

• If the prior is uniform

Bayesian Learning CSL603 - Machine Learning 16

" = argmax()(+(|x)

" = argmax() x +( )(+()

" = argmax() x +(

Page 17: Bayesian Learning - cse.iitrpr.ac.incse.iitrpr.ac.in/ckn/courses/f2018/csl603/w10.pdf · Features of Bayesian Learning •Each training example can incrementally increase or decrease

Naïve Bayes Classifier (3)

• Look at the classifier

• What is each instance x?• A " dimensional tuple – ($%, … , $()

• Estimate the joint probability distribution* $%, … $( +,• Practical issue- need to know the probability of every possible instance given

every possible class.• With " Boolean features and - classes – K2( probability values!!!

Bayesian Learning CSL603 - Machine Learning 17

0 = argmax,* x +,

Page 18: Bayesian Learning - cse.iitrpr.ac.incse.iitrpr.ac.in/ckn/courses/f2018/csl603/w10.pdf · Features of Bayesian Learning •Each training example can incrementally increase or decrease

Naïve Bayes Classifier (4)

• Make the naïve Bayes assumption – features/attributes are conditionally independent given the target attribute (class label)

! "#, … "& '( = *+,#

&! "+ '(

• This results in the naïve Bayes classifier (NBC)!

Bayesian Learning CSL603 - Machine Learning 18

- = argmax(*+,#

&! "+ '( !('()

Page 19: Bayesian Learning - cse.iitrpr.ac.incse.iitrpr.ac.in/ckn/courses/f2018/csl603/w10.pdf · Features of Bayesian Learning •Each training example can incrementally increase or decrease

NBC – Practical Issues (1)

• Estimating probabilities from !• Prior probabilities

" #$ = | x(, *+ : *+ = #$||!|

• If the features are discrete

" -. = / #$ = | x(, *+ : -+. = / ∧ *+= #$|| x(, *+ : *+ = #$|

Bayesian Learning CSL603 - Machine Learning 19

Page 20: Bayesian Learning - cse.iitrpr.ac.incse.iitrpr.ac.in/ckn/courses/f2018/csl603/w10.pdf · Features of Bayesian Learning •Each training example can incrementally increase or decrease

NBC – Practical Issues (2)

• If the features are continuous?

• Assume some parameterized distribution for !", e.g., Normal

• Learn parameters of distribution from data, e.g., mean and variance of !" values

• Determine the parameters that maximize the likelihood.

• # !" $% ~ '(), +,), ) and +, are unknown

Bayesian Learning CSL603 - Machine Learning 20

Page 21: Bayesian Learning - cse.iitrpr.ac.incse.iitrpr.ac.in/ckn/courses/f2018/csl603/w10.pdf · Features of Bayesian Learning •Each training example can incrementally increase or decrease

NBC – Practical Issues (3)

• If the features are continuous?

• Assume some parameterized distribution for !", e.g., Normal

• Learn parameters of distribution from data, e.g., mean and variance of !" values

• Determine the parameters that maximize the likelihood.

• Discretize the feature• E.g., price ∈ ℛ to price ∈ *+,,./012., ℎ14ℎ

Bayesian Learning CSL603 - Machine Learning 21

Page 22: Bayesian Learning - cse.iitrpr.ac.incse.iitrpr.ac.in/ckn/courses/f2018/csl603/w10.pdf · Features of Bayesian Learning •Each training example can incrementally increase or decrease

NBC – Practical Issues (4)

• If there are no examples in class !" for which #$ = &' #$ = & !" = 0

)$*+

,' #$ !" ' !" = 0

• Use m-estimate defined as follows

' #$ = & !" = x., 01 : #1$ = & ∧ 01= !" + 56x., 01 : 01 = !" + 5

• Prior estimate of the probability – 6• Equivalent sample size – 5 (how heavily to weight 6 relative to the observed data)

Bayesian Learning CSL603 - Machine Learning 22

Page 23: Bayesian Learning - cse.iitrpr.ac.incse.iitrpr.ac.in/ckn/courses/f2018/csl603/w10.pdf · Features of Bayesian Learning •Each training example can incrementally increase or decrease

Example – Learn to Classify Text

• Problem Definition• Given a set of news articles that are of interest, we would to like to learn to

classify the articles by topic.

• Naïve Bayes is among the most effective algorithms to perform this task.• What will be attributes to represent the documents?• Vector of words – one attribute per word position in the document

• What is the Target concept• Is the document interesting?• Topic of the document

Bayesian Learning CSL603 - Machine Learning 23

Page 24: Bayesian Learning - cse.iitrpr.ac.incse.iitrpr.ac.in/ckn/courses/f2018/csl603/w10.pdf · Features of Bayesian Learning •Each training example can incrementally increase or decrease

Algorithm – Learn Naïve Bayes

• Collect all words and tokens that occur in the Examples (!)• Vocabulary – all distinct words and tokens in !

• Compute probabilities " #$ and " %& #$• !$ – Examples for which the target label is #$• " #$ = |)*|

|)|• + – total number of words in !$ (counting duplicates multiple times)• For each work %& in Vocabulary

• +& = number of times word %& occurs in !$• " %&|#$ = ,-./

,.|0123456378|

Bayesian Learning CSL603 - Machine Learning 24

Page 25: Bayesian Learning - cse.iitrpr.ac.incse.iitrpr.ac.in/ckn/courses/f2018/csl603/w10.pdf · Features of Bayesian Learning •Each training example can incrementally increase or decrease

Algorithm – Classify Naïve Bayes

• Given a test instance• Compute the frequency of occurrence in the test instance of each term in the

vocabulary• Apply naïve Bayes classification rule!

Bayesian Learning CSL603 - Machine Learning 25

Page 26: Bayesian Learning - cse.iitrpr.ac.incse.iitrpr.ac.in/ckn/courses/f2018/csl603/w10.pdf · Features of Bayesian Learning •Each training example can incrementally increase or decrease

Example: 20 Newsgroup

• Given 1000 training documents from each group• Learn to classify new documents according to the newsgroup it came

from• NBC – 89% accuracy

Bayesian Learning CSL603 - Machine Learning 26

Page 27: Bayesian Learning - cse.iitrpr.ac.incse.iitrpr.ac.in/ckn/courses/f2018/csl603/w10.pdf · Features of Bayesian Learning •Each training example can incrementally increase or decrease

Bayesian Network (1)

• Naïve Bayes assumption of conditional independence is too restrictive.• The problem is intractable without some conditional independence

assumption• Bayesian networks describe conditional independence among subsets

of variables.• Allows for combining prior knowledge about (in) dependencies

among variables with training data• Recollect – Conditional Independence

Bayesian Learning CSL603 - Machine Learning 27

Page 28: Bayesian Learning - cse.iitrpr.ac.incse.iitrpr.ac.in/ckn/courses/f2018/csl603/w10.pdf · Features of Bayesian Learning •Each training example can incrementally increase or decrease

Bayesian Network - Example

Storm

CampfireLightning

Thunder ForestFire

Campfire

C

¬C

¬S,B ¬S,¬B

0.4

0.6

0.1

0.9

0.8

0.2

0.2

0.8

S,¬B

BusTourGroup

S,B

Bayesian Learning CSL603 - Machine Learning 28

Page 29: Bayesian Learning - cse.iitrpr.ac.incse.iitrpr.ac.in/ckn/courses/f2018/csl603/w10.pdf · Features of Bayesian Learning •Each training example can incrementally increase or decrease

Bayes Network (2)

• Network represents the joint probability distribution over all variables• !(#$%&', )*+,%*&-&%*.,… 0%&1+$02&1)• In general,

! 45, 46, … , 47 =9:;5

7! 4:|!=&1>$+ 4:

• Where !=&1>$+ 4: denotes immediate predecessors of 4: in the graph.

• What is the Bayes Network corresponding to the Naive Bayes Classifier?

Bayesian Learning CSL603 - Machine Learning 29

Page 30: Bayesian Learning - cse.iitrpr.ac.incse.iitrpr.ac.in/ckn/courses/f2018/csl603/w10.pdf · Features of Bayesian Learning •Each training example can incrementally increase or decrease

Bayes Network (3)

• Inference• Bayes network encodes all the information required for inference.• Exact inference methods

• Work well for some structures• Monte Carlo methods

• Simulate the network randomly to calculate approximate solutions.

• Learning• If the structure is known and there are no missing values, it is easy to learn a

Bayes network• If the network structure is known and there are some missing values,

expectation – maximization algorithm• If the structure is unknown, the problem is very difficult.

Bayesian Learning CSL603 - Machine Learning 30

Page 31: Bayesian Learning - cse.iitrpr.ac.incse.iitrpr.ac.in/ckn/courses/f2018/csl603/w10.pdf · Features of Bayesian Learning •Each training example can incrementally increase or decrease

Summary

• Bayes rule• Bayes Optimal Classifier• Practical Naïve Bayes Classifier• Example – text classification task

• Maximum-likelihood estimates• Bayesian networks

Bayesian Learning CSL603 - Machine Learning 31