MACHINE LEARNINGcs573x/2006/Notes/cs... · Recall the Bayesian recipe for classification • The Bayesian recipe is simple, optimal, and in principle, straightforward to apply •

1

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

MACHINE LEARNING

Vasant HonavarArtificial Intelligence Research Laboratory

Department of Computer ScienceBioinformatics and Computational Biology Program

Center for Computational Intelligence, Learning, & DiscoveryIowa State University

[email protected]/~honavar/

www.cild.iastate.edu/

2



Recall the Bayesian recipe for classification

• The Bayesian recipe is simple, optimal, and in principle, straightforward to apply

• To use this recipe in practice, we need to know P(X|ωi) – the generative model for data for each class and P(ωi) – the prior probabilities of classes

• Because these probabilities are unknown, we need to estimate them from data – or learn them!

• X is typically high-dimensional • Need to estimate P(X|ωi) from limited data

3



Naïve Bayes Classifier• We can classify X if we know P(X|ωi)• How to learn P (X|ωi) ?One solution: Assume that the random variables in X are

conditionally independent given the class.• Result: Naïve Bayes classifier which performs optimally

under certain assumptions• A simple, practical learning algorithm grounded in

Probability TheoryWhen to use• Attributes that describe instances are likely to be

conditionally independent given classification• The data is insufficient to estimate all the probabilities

reliably if we do not assume independence

4



Naïve Bayes Classifier

Successful applications• Diagnosis• Document Classification• Protein Function Classification• Prediction of protein-protein interfacesand many others……..

5



Conditional Independence

( ) ( )

variablesrandom tosassignment valuepossible allfor equations, of setsrepresent e that thesNote

,...,

if given t independenmutually are ,..., space.event given aon

variablesrandom be and ,...Let

121

1

1

WZPWZZZP

WZZ

WZZ

i

n

in

n

n

∏=

=

6



Implications of Independence• Suppose we have 5 Binary attributes and a binary class

label• Without independence, in order to specify the joint

distribution, we need to specify a probability for each possible assignment of values to each variable resulting in a table of size 26=64

• Suppose the features are independent given the class label – we only need 5(2x2)=20 entries

• The reduction in the number of probabilities to be estimated is even more striking when N, the number of attributes is large – from O(2N) to O(N)

7



Naive Bayes Classifier

( )

)()|,...,,(maxarg ),...,,(

)()|,...,,(maxarg

) ...,|(maxarg where

... , , valuesattribute in termsof described is )...,( instancean where

:function target valueddiscrete aConsider

2211

2211

2211

2211

2211

21

jjnn

nn

jjnn

nnjMAP

ii

nn

n

PxXxXxXPxXxXxXP

PxXxXxXP

xXxXxXPXDomainx

xXxXxXXXXX

f

j

j

j

ωω

ωω

ωω

χχ

ω

ω

ω

=======

====

====∈

===∈=

Ω→

Ω∈

Ω∈

Ω∈

ωMAP is called the maximum a posteriori classification

8



Naive Bayes Classifier

( )

( )∏

∏

=Ω∈

=Ω∈

Ω∈

Ω∈

==

=

==

====

====

n

ijiij

NB

j

n

ijiiMAP

jjnn

nnjMAP

xXPP

PxXP

tindependen

PxXxXxXP

xXxXxXP

j

j

j

j

1

1

2211

2211

|)(maxarg

)( |maxarg

have weclass, given the are attributes theIf

)()|,...,,(maxarg

) ...,|(maxarg

ωω

ω

ωωω

ωω

ωω

ω

ω

ω

ω

9



Naive Bayes Learner

( ) ( )( )

( )( )

( ) ( )∏=

Ω∈ωω=ω=

=

ω=Ω=←ω=

ω=Ω←ω=Ω

Ωω

n

ijiij

N

jiijii

ii

jj

xXPPXc

xxxX

DaXPEstimateaXP

Xa

DPEstimate P

j

kk

k

1

21

|)(maxarg

),...,( instancenew a Classify

,| )|(ˆ

of value possible each For ,ˆ

, of value possible each For j

Estimate is a procedure for estimating the relevant probabilities from set of training examples

10



Estimation of Probabilities from Small Samples

j

jijii

jii

ii

jji

jj

j

jijii

nn

aXPn

m

aXPp

Xa

ω n

ωn

mnmpn

aXP

k

k

k

k

k

k

k

→=∞→

=

=

+

+←=

)|(ˆ, As

prior thegiven to weight theis

)|(ˆfor estimateprior theis

attributefor valueattribute haveh whic

class of examples trainingofnumber

class of examples trainingofnumber theis

)|(ˆ

ω

ω

ω

This is effectively the same as using Dirichlet priors as we shall see later

11



• Learning dating preferences • Learn which news articles are of interest.• Learn to classify web pages by topic.• Learn to classify SPAM• Learn to assign proteins to functional families

based on amino acid composition

Naive Bayes is among the most useful algorithms

What attributes shall we use to represent text?

Sample Applications of Naïve Bayes Classifier

12



Learning Dating Preferences

Training DataInstance Class labelI1 (t, d, l) +I2 (s, d, l) +I3 (t, b, l) −I4 (t, r, l) −I5 (s, b, l) −I6 (t, b, w) +I7 (t, d, w) +I8 (s, b, w) +

Instances –

ordered 3-tuples of attribute values corresponding to

Height (tall, short)Hair (dark, blonde, red) Eye (blue, brown)

Classes –

+, –

13



Probabilities to estimate

P(+) = 5/8P(–) = 3/8

1/32/3–

2/53/5+stP(Height | c)

01–3/52/5+wlP(Eye | c)

1/32/30–

02/53/5+rbdP(Hair | c)

Classify (Height=t, Hair=b, eye=l)P(X | +) = (3/5)(2/5)(2/5) = (12/125)P(X | –) = (2/3)(2/3)(1) = (4/9)Classification = ?

Classify (Height=t, Hair=r, eye=w)

Note the problem with zero probabilitiesSolution – Use Laplacian estimates

14



• Target concept Interesting? : Documents → +,-• Learning: Use training examples to estimate

P (+), P (- ), P (d |+), P (d |-)

Alternative generative models for documents: • Represent each document by sequence of words

– In the most general case, we need a probability for each word occurrence in each position in the document, for each possible document length

– Too many probabilities to estimate!

• Represent each document by tuples of word counts

Learning to Classify Text

15




words!of bag a asdocument each Treat lengthdocument of and

position, theoft independen isposition particular ain wordspecific a ngencounteri

ofy probabilit that assume matters,simplify Tolength!document possibleeach for iesprobabilit

document,each for estimating require wouldThis

))(,|())(()|()(

1

ΩVocabulary

dlengthXPdlengthPdP

length(d)

ii

dlength

ii

×

= ∏=

ωω

16



Bag of Words Representation

Ω×

==

Vocabulary todrops estimated be toiesprobabilit ofnumber The

)|(... )|( iesprobabilit

ofset theof instead each wordfor )|(y probabilitlconditiona-classt independen-position one estimate weSo

)(1 jkdlengthjk

jk

wXPwXP

wP

ωω

ω

The result is a generative model for documents that treats each document as an ordered tuple of word frequencies

More sophisticated models can consider dependencies between adjacent word positions (Markov models – we will come back to these later)

17




( ) ( )

have. we wordsof bags labeled thefrom )|( estimatecan Wedocument) theoflength on dependence ignoring(

document in of occurences ofnumber theis where

)|(!

! toalproportion is |

have wetion,representa wordsof bag With the

jk

kkd

k

njk

kkd

kkd

j

wP

dwn

wPn

n dP kd

ω

ωω ∏∏∑

⎪⎪⎭

⎪⎪⎬

⎫

⎪⎪⎩

⎪⎪⎨

⎧⎟⎟⎠

⎞⎜⎜⎝

⎛

18



• Given 1000 training documents from each group, learn to classify new documents according to the newsgroup where it belongs

• Naive Bayes achieves 89% classification accuracy

Naïve Bayes Text Classifier

comp.graphicscomp.os.ms-windows.misccomp.sys.ibm.pc.hardwarecomp.sys.mac.hardware

comp.windows.x

misc.forsalerec.autos

rec.motorcyclesrec.sport.baseballrec.sport.hockeyalt.atheism

soc.religion.christiantalk.religion.misc

talk.politics.mideasttalk.politics.misctalk.politics.guns

sci.spacesci.crypt

sci.electronicssci.med

19



Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!ogicse!uwm.eduFrom: [email protected] (John Doe) Subject: Re: This year's biggest and worst (opinion)... Date: 5 Apr 93 09:53:39 GMT

I can only comment on the Kings, but the most obvious candidate for pleasant surprise is Alex Zhitnik. He came highly touted as a defensive defenseman, but he's clearly much more than that. Great skater and hard shot (though wish he were more accurate). In fact, he pretty much allowed the Kings to trade away that huge defensive liability Paul Coffey. Kelly Hrudey is only the biggest disappointment if you thought he was any good to begin with. But, at best, he's only a mediocre goaltender. A better choice would be Tomas Sandstrom, though not through any fault of his own, but because some thugs in Toronto decided ….

Representative article from rec.sport.hockey

Naïve Bayes Text Classifier

20



Sequence Classification

Need a generative model for sequencesSimplest alternative – sequence-length independent

multinomial (bag of letters) model! More sophisticated alternatives possible – for

example, Markov models that capture dependencies among small windows of neighboring letters

21



Naïve Bayes Learner – Summary• Produces minimum error classifier if attributes are

conditionally independent given the classWhen to use• Attributes that describe instances are likely to be

conditionally independent given classification• There is not enough data to estimate all the

probabilities reliably if we do not assume independence

• Often works well even if when independence assumption is violated (Domigos and Pazzani, 1996)

• Can be used iteratively – Kang et al., 2006

22



Estimating probabilities from data (discrete case)

• Maximum likelihood estimation• Bayesian estimation• Maximum a posteriori estimation

23



Example: Binomial Experiment

• When tossed, the thumbtack can land in one of two positions: Head or Tail

• We denote by θ the (unknown) probability P(H).• Estimation task—• Given a sequence of toss samples x[1], x[2], …, x[M] we

want to estimate the probabilities P(H)= θ and P(T) = 1 - θ

Head Tail

24



Statistical parameter fitting

Consider samples x[1], x[2], …, x[M] such that• The set of values that X can take is known• Each is sampled from the same distribution• Each is sampled independently of the rest

The task is to find a parameter Θ so that the data can be summarized by a probability P(x[j]| Θ ).

• The parameters depend on the given family of probability distributions: multinomial, Gaussian, Poisson, etc.

• We will focus first on binomial and then on multinomial distributions

• The main ideas generalize to other distribution families

i.i.d.samples

25



The Likelihood FunctionHow good is a particular θ?

It depends on how likely it is to generate the observed data

The likelihood for the sequence H,T, T, H, H is

∏==m

mxPDPDL )|][()|():( θθθ

θθθθθθ ⋅⋅−⋅−⋅= )1()1():( DL

0 0.2 0.4 0.6 0.8 1

θ

L(θ

:D)

26



Likelihood function

• The likelihood function L(θ : D) provides a measure of relative preferences for various values of the parameter θgiven a collection of observations D drawn from a distribution that is parameterized by fixed but unknown θ.

• L(θ : D) is the probability of the observed data D consideredas a function of θ.

• Suppose data D is 5 heads out of 8 tosses. What is the likelihood function assuming that the observations were generated by a binomial distribution with an unknown but fixed parameterθ ?

( )35 158

θθ −⎟⎟⎠

⎞⎜⎜⎝

⎛

27



Sufficient Statistics

• To compute the likelihood in the thumbtack example we only require NH and NT (the number of heads and the number of tails)

• NH and NT are sufficient statistics for the parameter θ that specifies the binomial distribution

• A statistic is simply a function of the data• A sufficient statistic s for a parameter θ is a function that

summarizes from the data D, the relevant information s(D)needed to compute the likelihood L(θ :D).

• If s is a sufficient statistic for s(D) = s(D’ ),then L(θ :D) = L(θ :D’)

TH NNDL )():( θ−⋅θ=θ 1

28



Maximum Likelihood Estimation

• Main Idea: Learn parameters that maximize the likelihood function

• Maximum likelihood estimation is• Intuitively appealing• One of the most commonly used estimators

in statistics• Assumes that the parameter to be estimated

is fixed, but unknown

29



Example: MLE for Binomial Data

• Applying the MLE principle we get• (Why?)

0 0.2 0.4 0.6 0.8 1

L(θ

:D)

Example:(NH,NT ) = (3,2)

ML estimate is 3/5 = 0.6

TH

H

NNN+

=θ

30



MLE for Binomial data ( ) ( )( ) ( )θNθNθ:DL

θθθ:DL

TH

NN TH

−+=−⋅=

1logloglog1

The likelihood is positive for all legitimate values of θ

So maximizing the likelihood is equivalent to maximizing its logarithm i.e. log likelihood

( ) ( )

( ) ( )( )

( )

( )TH

HML

HTH

TH

NNNθ

NθNNθ

Nθ

Nθ:DLθ

θ:DLθ:DLθ

+=

=+

=−−

+=∂∂

=∂∂

01

1log

of extremaat 0logNote that the likelihood is indeed maximized at θ =θMLbecause in the neighborhood of θML, the value of the likelihood is smaller than it is at θ =θML

31



Maximum and curvature of likelihood around the maximum

• At the maximum, the derivative of the log likelihood is zero• At the maximum, the second derivative is negative. • The curvature of the log likelihood is defined as

• Large observed curvature I (θML) at θ = θML

• is associated with a sharp peak, intuitively indicating less uncertainty about the maximum likelihood estimate

• I (θML) is called the Fisher information

( ) ( )DθLθ

I :log2∂∂

−=θ

32



Maximum Likelihood Estimate

ML estimate can be shown to be • Asymptotically unbiased

• Asymptotically consistent - converges to the true value as the number of examples approaches infinity

• Asymptotically efficient – achieves the lowest variance that any estimate can achieve for a training set of a certain size (satisfies the Cramer-Rao bound)

( ) 0lim

1Prlim2 =−

=≤−

∞→

∞→

TrueMLN

TrueMLN

E θθ

εθθ

( ) TrueMLNE θ=θ

∞→lim

33



Maximum Likelihood Estimate• ML estimate can be shown to be representationally invariant –

If θMLis an ML estimate of θ, and g (θ ) is a function of θ, then g (θML ) is an ML estimate of g (θ )

• When the number of samples is large, the probability distribution of θML has Gaussian distribution with mean θTrue(the actual value of the parameter) – a consequence of the central limit theorem – a random variable which is a sum of a large number of random variables has a Gaussian distribution – ML estimate is related to the sum of random variables

• We can use the likelihood ratio to reject the null hypothesis corresponding to θ = θ0 as unsupported by data if the ratio of the likelihoods evaluated at θ0 and at θML is small. (The ratio can be calibrated when the likelihood function is approximately quadratic)

34



Naïve Bayes Classifier• We can define the likelihood for a Naïve Bayesian Classifier• Let Θj be the class conditional probabilities for class j• Let Lj be the corresponding likelihood • Lj factorizes

• Each Θij specifies a binomial distribution associated with class j for ith attribute

∏

∏∏

∏∏

∏

Θ=

Θ=

Θ=

Θ=Θ

iijij

i piji

p iiji

pjnj

DL

pxP

pxP

pxpxPDL

):(

):][(

):][(

):][,],[():( 1 Ki.i.d. samples

Independence factorization

35



Naïve Bayes Classifier

• Decomposition Independent Estimation Problems

• If the parameters for each family are decoupled via independence, then they can be estimated independently of each other

36



From Binomial to Multinomial

• Suppose a random variable X can take the values1,2,…,K

• We want to learn the parameters θ 1, θ 2. …, θ K• Sufficient statistics: N1, N2, …, NK - the number of times each outcome is observed

• Likelihood function

• ML estimate

∏=

=K

k

Nk

kDL1

):( θθ

∑=

llN

Nkkθ

37



MLE estimates for Naive Bayes Classifiers

• When we assume that P(Xi |C) is multinomial, we get the decomposition:

• For each class we get an independent multinomial estimation problem

• The MLE is

∏∏∏∏

∏θ=Θ=

Θ=Θ

j i

ji

ji

j i

ji

c x

cxNcx

c x

cxNiji

miiii

cxP

mcmxPDL

),(|

),():|(

):][|][():(

)(),(ˆ

|j

jicx cN

cxNji=θ

38



Summary of Maximum Likelihood estimation• Define a likelihood function which is a measure of how

likely it is that the observed data were generated from a probability distribution with a particular choice of parameters

• Select the parameters that maximize the likelihood• In simple cases, ML estimate has a closed form solution• In other cases, ML estimation may require numerical

optimization

• Problem with ML estimate – assigns zero probability to unobserved values – can lead to difficulties when estimating from small samples

• Question – How would Naïve Bayes classifier behave if some of the class conditional probability estimates are zero?

39



• MLE commits to a specific value of the unknown parameter (s)

• MLE is the same in both cases shown

Bayesian Estimation

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

vs.

Of course, in general, one cannot summarize a function by a single number!

Intuitively, the confidence in the estimates should be different

40



Bayesian Estimation

Maximum Likelihood approach is Frequentist at its core• Assumes there is an unknown but fixed parameter θ• Estimates θ with some confidence • Prediction of probabilities using the estimated

parameter value

Bayesian Approach• Represents uncertainty about the unknown parameter• Uses probability to quantify this uncertainty:

– Unknown parameters as random variables• Prediction follows from the rules of probability:

– Expectation over the unknown parameters

41



Example: Binomial Data Revisited0 0.2 0.4 0.6 0.8 1

( )

7142.075

7630)1(30)|()|]1[(

)1(30)|(

301

65)()|(

101

1)(],1,0[ and)1()|( case, In this

)()|(

)()|()|(

1

0

761

0

41

0

4

1

0

651

0

541

0

14

1

0

==⎥⎦

⎤⎢⎣

⎡−=−===+

−=

=⎥⎦

⎤⎢⎣

⎡−=−=

=−

=∈∀−=

=

∫∫

∫∫

∫

θθθθθθθθθ

θθθ

θθθθθθθ

θθθθθ

θθθ

θθθ

ddDpDHmXP

Dp

dpDp

pDp

dpDp

pDpDp

• Suppose that we choose a uniform prior p(θ ) = 1 for θ in [0,1]• P(θ |D) is proportional to the likelihood L(θ :D)

42



Example: Binomial Data Revisited

K71420751 .)|()|][( ==θθ⋅θ==+ ∫ dDPDHMxP

(NH,NT ) = (4,1)MLE for P(X = H ) is 4/5 = 0.8Bayesian estimate is

In this example, MLE and Bayesian prediction differIt can be proved that If the prior is well-behaved – i.e. does not assign 0 density to any feasible parameter valueThen both MLE and Bayesian estimate converge tothe same value in the limit

Both almost surely converge to the underlying distribution P(X)But the ML and Bayesian approaches behave differently when the number of samples is small

43



All relative frequencies are not equi-probable

• In practice we might want to express priors that allow us to express our beliefs regarding the parameter to be estimated

• For example, we might want a prior that assigns a higher probability to parameter values that describe a fair coin than it does to an unfair coin

• The beta distribution allows us to capture such prior beliefs

44



Beta distribution

Gamma Function:

The integral converges if and only if x > 0.If x is an integer that is greater than 0, it can be shown

that So

( ) dtetxΓ tx∫∞

−−=0

1

( ) ( ) !xxΓ 1−= ( )( ) xxΓ

xΓ=

+1

( )

( ) ( )( ) ( ) ( ) 10 where1

:is ,0 numbers real are where, ,, parametersith function wdensity beta The

11 ≤≤−=

>+=

−− θθθbΓaΓ

NΓθp

θ;a,bbeta a,bbaN ba

ba

45



Beta distribution

( ) ( ) . then,by given on distributi has IfNaθ Eθ;a,bbetaθ =

( ) ( ) ( )( )2

11 d1

then,0 numbers real are If1

0 ++++

=−

>

∫ baΓbΓaΓθθ

a,b

ba θ

[ ] [ ]

( ) ( )( ) ( )tbsabetaDθp

babetaptNsN

MXX

TH

++=

===

=

,; that showcan Then we,; and ; ;Let

on;distributi binomial a from samples iid of sequence abe ,...,1DLet

θθθ

Update of the parameter with a beta prior based on data yields a beta posterior

46



Conjugate Families

• The property that the posterior distribution follows the same parametric form as the prior distribution is called conjugacy

• Conjugate families are useful because:– For many distributions we can represent them with

hyper parameters– They allow for sequential update to obtain the posterior– In many cases we have closed-form solution for

prediction• Beta prior is a conjugate family for the binomial likelihood

47



Bayesian prediction

( )[ ] [ ]

( ) ( )

[ ]( ) ( )( ) ( )TH

HH

TH

NNbaNa

MNNa DH MX

NbNabetaDpM,.....X XD

babeta

++++

=++

==+

++=

=

1P : prediction

,; :posterior 1 :Data

,; :prior

θθ

θ

48



Dirichlet Priors• Recall that the likelihood function is• A Dirichlet prior with hyperparameters α1,…,αK is defined

as

• Then the posterior has the same form, with hyperparameters α1+N 1,…,αK +N K

∏=

=ΘK

k

Nk

kDL1

):( θ

( )( )

( )K

K

kkk

K

kkK

kk

...θθΘ

θθαΓ

NΓP k

1

11

1

1

where

1 ;10 ;)(

=

=≤≤=Θ ∑∏∏ ==

−

=

αθ

∏∏∏=

−+

==

− =∝

ΘΘ∝ΘK

k

Nk

K

k

Nk

K

kk

kkkk

DPPDP

1

1

11

1

)|()()|(

αα θθθ

49



Dirichlet Priors

• Dirichlet priors enable closed form prediction based on multinomial samples: – If P(Θ) is Dirichlet with hyperparameters α1,…,αK

then

• Since the posterior is also Dirichlet, we get

∑∫ αα

=ΘΘ⋅θ==

ll

kk dPkXP )()]1[(

∑∫ +α+α

=ΘΘ⋅θ==+

lll )(

)|()|]1[(N

NdDPDkMXP kkk

50



Intuition behind priors

• The hyperparameters α1,…,αK can be thought of as imaginary counts from our prior experience

• Equivalent sample size = α1+…+αK

• The larger the equivalent sample size the more confident we are in our prior

51



Effect of Priors

Prediction of P(X=H ) after seeing data with NH = 0.25•NTfor different sample sizes

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0 20 40 60 80 1000

0.1

0.2

0.3

0.4

0.5

0.6

0 20 40 60 80 100

Different strength αH + αTFixed ratio αH / αT

Fixed strength αH + αTDifferent ratio αH / αT

52



Effect of Priors• In real data, Bayesian estimates are less sensitive to noise

in the data

0.1

0.2

0.3

0.4

0.5

0.6

0.7

5 10 15 20 25 30 35 40 45 50

P(X

= 1|

D)

N

MLEDirichlet(.5,.5)Dirichlet(1,1)Dirichlet(5,5)

Dirichlet(10,10)

N

0

1Toss Result

53



Conjugate Families

• The property that the posterior distribution follows the same parametric form as the prior distribution is called conjugacy

– Dirichlet prior is a conjugate family for the multinomial likelihood

• Conjugate families are useful because:– For many distributions we can represent them with

hyperparameters– They allow for sequential update within the same

representation– In many cases we have closed-form solution for

prediction

54



Bayesian Estimation

where

θθθ

θθθ

dMxxPMxP

dMxxPMxxMxP

MxxMxP

])[,],1[|()|]1[(

])[,],1[|(])[,],1[,|]1[(

])[,],1[|]1[(

K

KK

K

∫∫

+=

+=

+

])[],1[()()|][],1[(])[],1[|(

MxxPPMxxPMxxP

K

KK

θθθ =

Posterior

Likelihood Prior

Probability of data

55



Summary of Bayesian estimation

• Treat the unknown parameters as random variables• Assume a prior distribution for the unknown parameters• Update the distribution of the parameters based on data• Use Bayes rule to make prediction

56



Maximum a posteriori (MAP) estimates –Reconciling ML and Bayesian approaches

( ) ( ) ( )( )

( )( ) ( )( ) ( )DLP

PDP

DPDPPDP

DP

MAP

:maxarg

maxarg

maxarg

ΘΘ=

ΘΘ=

Θ=Θ

ΘΘ=Θ

Θ

Θ

Θ

57



Maximum a posteriori (MAP) estimates –Reconciling ML and Bayesian approaches

( ) ( )DLPMAP :maxarg ΘΘ=ΘΘ

Like in Bayesian estimation, we treat the unknown parameters as random variables

But we estimate a single value for the parameter – the maximum a posteriori estimate that corresponds to the most probable value of the parameter given the data for a given choice of the prior

58



Back to Naïve Bayes Classifier

0)|(ˆ)(ˆ0)|(ˆ ==→== ∏l

jlljjii kkaXPPaXP ωωω

If one of the attribute values has estimated class conditional probability of 0, it dominates all other attribute values

When we have few examples, this is more likely

Solution – use priors e.g., assume each value to be equally likely unless data indicates otherwise

59



Decision Tree Classifiers

• Decision tree Representation for modeling dependencies among input variables using

• Elements of information theory• How to learn decision trees from data• Over-fitting and how to minimize it• How to deal with missing values in the data• Learning decision trees from distributed data• Learning decision trees at multiple levels of

abstraction

60



Decision tree representation

In the simplest case,• each internal node tests on an attribute• each branch corresponds to an attribute value• each leaf node corresponds to a class label

In general, • each internal node corresponds to a test (on input

instances) with mutually exclusive and exhaustive outcomes – tests may be univariate or multivariate

• each branch corresponds to an outcome of a test• each leaf node corresponds to a class label

61



Decision Tree Representation

B004

A013

B102

A111

cyx

11, 01,10, 00

1110

0100

1 0

c=A c=B

1 0

1101

11 01

1 0

c=A c=B

1000

10 00

c=A c=B

x 1 0x

Tree 1 Tree 2Data set

x

11, 01, 10, 00

y

Examples

Attributes Class

Should we choose Tree 1 or Tree 2? Why?

62



Decision tree representation• Any Boolean function can be represented by a

decision tree

• Any function

where each is the domain of the i th attribute and C is a discrete set of values (class labels) can be represented by a decision tree

• In general, the inputs need not be discrete valued

CAAAf n →⋅⋅⋅×× 21:

iA

63



Learning Decision Tree Classifiers• Decision trees are especially well suited for

representing simple rules for classifying instances that are described by discrete attribute values

• Decision tree learning algorithms• Implement Ockham’s razor as a preference bias

(simpler decision trees are preferred over more complex trees)

• Are relatively efficient – linear in the size of the decision tree and the size of the data set

• Produce comprehensible results• Are often among the first to be tried on a new

data set

64



Learning Decision Tree ClassifiersOckham’s razor recommends that we pick the simplest decision tree that is consistent with the training setSimplest tree is one that takes the fewest bits to encode (why? – information theory)There are far too many trees that are consistent with a training setSearching for the simplest tree that is consistent with the training set is not typically computationally feasible

SolutionUse a greedy algorithm – not guaranteed to find the simplest tree – but works well in practice Or restrict the space of hypothesis to a subset of simple trees

65



Information – Some intuitions

• Information reduces uncertainty• Information is relative – to what you already

know• Information content of a message is related to

how surprising the message is• Information is related Information depends on

context

66



Digression: Information and Uncertainty

Sender Receiver

Message

You are stuck inside. You send me out to report back to you on what the weather is like. I do not lie, so you trust me. You and I are both generally familiar with the weather in Iowa

On a July afternoon in Iowa, I walk into the room and tell you it is hot outside

On a January afternoon in Iowa, I walk into the room and tell you it is hot outside

67




How much information does a message contain?

If my message to you describes a scenario that you expect with certainty, the information content of the message for you is zero

The more surprising the message to the receiver, the greater the amount of information conveyed by the message

What does it mean for a message to be surprising?

Sender Receiver

Message

68




Suppose I have a coin with heads on both sides and you know that I have a coin with heads on both sides.

I toss the coin, and without showing you the outcome, tell you that it came up heads. How much information did I give you?

Suppose I have a fair coin and you know that I have a fair coin.

I toss the coin, and without showing you the outcome, tell you that it came up heads. How much information did I give you?

69



Information

• Without loss of generality, assume that messages are binary – made of 0s and 1s.

• Conveying the outcome of a fair coin toss requires 1 bit of information – need to identify one out of two equally likely outcomes

• Conveying the outcome one of an experiment with 8 equally likely outcomes requires 3 bits ..

• Conveying an outcome of that is certain takes 0 bits • In general, if an outcome has a probability p, the

information content of the corresponding message is

ppI 2log)( −= 0)0( =I

70



Information is Subjective

• Suppose there are 3 agents – Adrian, Oksana, Jun, in a world where a dice has been tossed. Adrian observes the outcome is a “6” and whispers to Oksana that the outcome is “even” but Jun knows nothing about the outcome.

• Probability assigned by Oksana to the event “6” is a subjective measure of Oksana’s belief about the state of the world.

• Information gained by Adrian by looking at the outcome of the dice =log26 bits.

• Information conveyed by Adrian to Oksana = log26 –log23 bits

• Information conveyed by Adrian to Jun = 0 bits

71



Information and Shannon Entropy

• Suppose we have a message that conveys the result of a random experiment with m possible discrete outcomes, with probabilities

mppp ,..., 21

The expected information content of such a message is called the entropy of the probability distribution

( )

( )( ) otherwise 0

0 provided log

),..,(

2

121

=≠−=

=∑=

i

iii

i

m

iim

pIpppI

pIppppH

72



Shannon’s entropy as a measure of information

( ) ( )

( )

( ) ( ) ( ) ( ) bitIIppH

bitppH

ppp

pPH

PppP

ii

i

ii

i

i

n

ii

i

n

ii

n

log,

logloglog,

loglog

by given is ondistributi the ofentropy Theondistributiy probabilit discrete a be ) ....( Let

0001110

121

21

21

21

21

21

1

2

2

1

222

2

1

21

21

1

=−−=−==

=⎟⎠⎞

⎜⎝⎛

⎟⎠⎞

⎜⎝⎛−⎟

⎠⎞

⎜⎝⎛

⎟⎠⎞

⎜⎝⎛−=−==⎟

⎠⎞

⎜⎝⎛

−=⎟⎟⎠

⎞⎜⎜⎝

⎛=

=

∑

∑

∑∑

=

=

==

r

r

73



Properties of Shannon’s entropy

( )( )

( )( )

PPH

PHpi

NPHN

pi

NPHN

PHP

i

i

rv

r

r

r

rr

offunction continuous a is )(

0 ,1such that If

log ,1 If

log outcomes, possible are thereIf

0

2

2

==∃

==∀

≤

≥∀

74



Shannon’s entropy as a measure of information• For any distribution is the optimal number of

binary questions required on average to determine an outcome drawn from P.

• We can extend these ideas to talk about how much information is conveyed by the observation of the outcome of one experiment about the possible outcomes of another (mutual information)

• We can also quantify the difference between two probability distributions (Kullback-Liebler divergence or relative entropy)

( )P HPrr

,

75



Coding Theory Perspective

• Suppose you and I both know the distribution• I choose an outcome according to• Suppose I want to send you a message about

the outcome• You and I could agree in advance on the

questions • I can simply send you the answers • Optimal message length on average is • This generalizes to noisy communication

Pr

Pr

( )PHr

76



Entropy of random variables and sets of random variables

( ) ( ) ( )

( ) ( )

( ) )(log)( , variablesrandom ofset a is If

log

log,... values taking variablerandom aFor

2

21

2

1

XXXX

x

PPH

aXPaXP

XPXPXHa.aX

i

n

ii

X

n

∑

∑

∑

−=

==−=

−=

=

77



Joint Entropy and Conditional Entropy

( )

( )

( ) )|(log),(|

)|()(

)|(log),(| given ofentropy lConditiona

),(log),(, entropyjoint the, and variablesrandomFor

2

2,

,2

aYXPaYXPaYXH

aYXHYP

YXPYXPYXHYX

YXPYXPYXHYX

X

Y

YX

YX

==−==

==

−=

−=

∑

∑

∑

∑

78



Joint Entropy and Conditional Entropy

( )

( ) ( )( ) ( )X|YHYH

Y|XHXHYXH

YHXYHYHXHYXH

+=+=

⎭⎬⎫

≤+≤

),(

Entropyfor ruleChain equality?) have wedoWhen (

)()|(

)(),(

:results UsefulSome

79



Example of entropy calculations

P(X = H; Y = H) = 0.2. P(X= H; Y = T) = 0.4P(X = T; Y = H) = 0.3. P(X = T; Y = T) = 0.1H(X,Y)=-0.2log20.2+…≈1.85P(X = H) = 0.6. H(X) 0.97P(Y = H) = 0.5. H(Y) = 1.0P(Y= H| X = H) = 0.2/0.6 = 0.333P(Y = T|X = H) = 1-0.333=0.667P(Y= H| X = T)= 0.3/0.4=0.75P(Y= T| X = T)=0.1/0.4=0.25 H(Y|X)≈0.88

80



Mutual Information

( )

( )( )

)()(),(log),(),(

ons,distributiy probabilit of In terms)|()(, )|()(,

)|()()|()(),( rulechain usingby Or

),()()(, and between n informatio mutual

average the, and variablerandom aFor

,2 bYPaXP

bYaXPbYaXPYXI

XYHYHYXIYXHXHYXI

YXHYHXYHXHYXH

YXHYHXHYXIYX

YX

YX ====

===

−=−=

+=+=

−+=

∑Question: When is I(X,Y)=0?

81



Relative Entropy

0)||(0|)||(

)||()||( Note)(

(log)()||(

. to from distance"" of measure a is distance)Liebler -(Kullbackentropy relative The

. variablerandomover onsdistributi twobe and Let

2

=≥

≠

⎟⎟⎠

⎞⎜⎜⎝

⎛=∑

PPDQPD

PQDQPDXQXPXPQPD

QP

XQP

X