Top Banner
PGM: Tirgul 11 Na?ve Bayesian Classifier + Tree Augmented Na?ve Bayes (adapted from tutorial by Nir Friedman and Moises Goldszmidt
24

The Classification Problem

Jan 01, 2016

Download

Documents

armand-bean

PGM: Tirgul 11 Na?ve Bayesian Classifier + Tree Augmented Na?ve Bayes (adapted from tutorial by Nir Friedman and Moises Goldszmidt. Age Sex ChestPain RestBP Cholesterol BloodSugar ECG MaxHeartRt Angina OldPeak Heart Disease. The Classification Problem. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The Classification Problem

PGM: Tirgul 11

Na?ve Bayesian Classifier +

Tree Augmented Na?ve Bayes(adapted from tutorial by Nir Friedman and Moises Goldszmidt

Page 2: The Classification Problem

The Classification Problem

From a data set describing objects by vectors of features and a class

Find a function FF: features class to classify a new object

Vector1= <49, 0, 2, 134, 271, 0, 0, 162, 0, 0, 2, 0, 3> PresenceVector2= <42, 1, 3, 130, 180, 0, 0, 150, 0, 0, 1, 0, 3> PresenceVector3= <39, 0, 3, 94, 199, 0, 0, 179, 0, 0, 1, 0, 3 > PresenceVector4= <41, 1, 2, 135, 203, 0, 0, 132, 0, 0, 2, 0, 6 > AbsenceVector5= <56, 1, 3, 130, 256, 1, 2, 142, 1, 0.6, 2, 1, 6 > AbsenceVector6= <70, 1, 2, 156, 245, 0, 2, 143, 0, 0, 1, 0, 3 > PresenceVector7= <56, 1, 4, 132, 184, 0, 2, 105, 1, 2.1, 2, 1, 6 > Absence

Page 3: The Classification Problem

Examples

Predicting heart disease Features: cholesterol, chest pain, angina, age, etc. Class: {present, absent}

Finding lemons in cars Features: make, brand, miles per gallon, acceleration,etc. Class: {normal, lemon}

Digit recognition Features: matrix of pixel descriptors Class: {1, 2, 3, 4, 5, 6, 7, 8, 9, 0}

Speech recognition Features: Signal characteristics, language model Class: {pause/hesitation, retraction}

Page 4: The Classification Problem

Approaches

Memory based Define a distance between samples Nearest neighbor, support vector machines

Decision surface Find best partition of the space CART, decision trees

Generative models Induce a model and impose a decision rule Bayesian networks

Page 5: The Classification Problem

Generative Models

Bayesian classifiers Induce a probability describing the data

P(A1,…,An,C)

Impose a decision rule. Given a new object < a1,…,an >

c = argmaxC P(C = c | a1,…,an)

We have shifted the problem to learning P(A1,…,An,C)

We are learning how to do this efficiently: learn a Bayesian network representation for P(A1,…,An,C)

Page 6: The Classification Problem

Optimality of the decision ruleMinimizing the error rate...

Let ci be the true class, and let lj be the class returned by the classifier.

A decision by the classifier is correctcorrect if ci=lj, and in errorerror if ci lj.

The error incurred by choose label lj is

Thus, had we had access to P, we minimize error rate by choosing li when

which is the decision rule for the Bayesian classifier

)|(1)|()|()|(1

alPalPlcLcE i

n

jjiii

ijalPalP ji )|()|(

Page 7: The Classification Problem

Advantages of the Generative Model Approach

OutputOutput: Rank over the outcomes---likelihood of present vs. absent

ExplanationExplanation: What is the profile of a “typical” person with a heart disease

Missing valuesMissing values: both in training and testing Value of informationValue of information: If the person has high cholesterol

and blood sugar, which other test should be conducted? ValidationValidation: confidence measures over the model and its

parameters Background knowledgeBackground knowledge: priors and structure

Page 8: The Classification Problem

Evaluating the performance of a classifier: n-fold cross validation

D1 D2 D3 Dn

Run 1

Run 2

Run 3

Run n

Partition the data set in n segments

Do n times Train the classifier with the

green segments Test accuracy on the red

segments Compute statistics on the n runs

• Variance• Mean accuracy

Accuracy: on test data of size m

Acc =

Original data set

m

lcm

kjik

1

)|(

Page 9: The Classification Problem

Age

Sex

ChestPain RestBP

Cholesterol

BloodSugar ECG

MaxHeartRate

Angina

OldPeak

STSlope

Vessels

Thal

OutcomeHeart diseaseAccuracy = 85%Data sourceUCI repository

Advantages of Using a Bayesian Network

Efficiency in learning and query answering Combine knowledge engineering and statistical

induction Algorithms for decision making, value of information,

diagnosis and repair

Page 10: The Classification Problem

Problems with BNs as classifiers

When evaluating a Bayesian network, we examine the likelyhood of the model B given the data D and try to maximize it:

When Learning structure we also add penalty for structure complexity and seek a balance between the two terms (MDL or variant). The following properties follow:

A Bayesian network minimized the error over all the variables in the domain and not necessarily the local error of the class given the attributes (OK with enough data).

Because of the penalty, a Bayesian network in effect looks at a small subset of the variables that effect a given node (it’s Markov blanket)

N

i

iin

iB caaPDBLL

11 ),,,(log)|(

Page 11: The Classification Problem

Problems with BNs as classifiers (cont.)

Let’s look closely at the likelyhood term:

The first term estimates just what we want: the probability of the class given the attributes. The second term estimates the joint probability of the attributes.

When there are many attributes, the second term starts to dominate (value of log is increased for small values).

Why not use the just the first term? We can no longer factorize and calculations become much harder.

N

i

in

iB

N

i

in

iiB aaPaacPDBLL

11

11 ),,(log),,|(log)|(

Page 12: The Classification Problem

C

F1 F2 F5F3 F4 F6

pregnant age insulindpf mass glucose

Diabetes inPima Indians

(from UCI repository)

The Naïve Bayesian Classifier

Fixed structure encoding the assumption that features are independent of each other given the class.

Learning amounts to estimating the parameters for each P(Fi|C) for each Fi.

)()|()|()|(),,|( 62161 CPCFPCFPCFPFFCP

Page 13: The Classification Problem

The Naïve Bayesian Classifier (cont.)

What do we gain?

We ensure that in the learned network, the probability P(C|A1…An) will take every attribute into account.

We will show polynomial time algorithm for learning the network. Estimates are robust consisting of low order statistics requiring few

instances Has proven to be a powerful classifier often exceeding unrestricted

Bayesian networks.

Page 14: The Classification Problem

The Naïve Bayesian Classifier (cont.)

Common practice is to estimate

These estimate are identical to MLE for multinomials

)(),(ˆ

| cNcaN i

cai

C

F1 F2 F5F3 F4 F6

Page 15: The Classification Problem

Improving Naïve Bayes Naïve Bayes encodes assumptions of independence that may

be unreasonable:

Are pregnancy and age independent given diabetes?

Problem: same evidence may be incorporated multiple times (a rare Glucose level and a rare Insulin level over penalize the class variable)

The success of naïve Bayes is attributed to Robust estimation Decision may be correct even if probabilities are inaccurate

Idea: improve on naïve Bayes by weakening the independence assumptions

Bayesian networks provide the appropriate mathematical language for this task

Page 16: The Classification Problem

Tree Augmented Naïve Bayes (TAN)

Approximate the dependence among features with a tree Bayes net Tree induction algorithm

Optimality: maximum likelihood tree Efficiency: polynomial algorithm

Robust parameter estimation

C

pregnant age

insulin

dpf mass

glucoseF1 F2 F5

F3

F4 F6

)(),|(),|()|(),,|( 3612161 CPCFFPCFFPCFPFFCP

Page 17: The Classification Problem

Optimal Tree construction algorithm

The procedure of Chow and Lui construct a tree structure BT that maximizes LL(BT |D)

Compute the mutual information between every pair of attributes:

Build a complete undirected graph in which the vertices are the attributes and each edge is annotated with the corresponding mutual information as weight.

Build a maximum weighted spanning tree of this graph.

Complexity: O(n2N) + O(n2) + O(n2logn) = O(n2N) where n is the number of attributes and N is the sample size

ji

jDiD

jiD

aaaPaP

aaP

jiDji aaPAAI,

)()(

),(log),();(

Page 18: The Classification Problem

It is easy to “plant” the optimal tree in the TAN by revising the algorithm to use a revised conditional measure which takes the conditional probability on the class into account:

This measures the gain in the log-likelyhood of adding Ai as a parent of Aj when C is already a parent.

Tree construction algorithm (cont.)

caa

caPcaP

caaP

jiDjiji

jDiD

jiDcaaPAAI,,

)|()|(

)|,(log),,();(

Page 19: The Classification Problem

When evaluating parameters we estimate the conditional probability P(Ai|Parents(Ai)). This is done by partitionaing the data according to possible values of Parents(Ai).

When a partition contains just a few instances we get an unreliable estimate

In Naive Bayes the partition was only on the values of the classifier (and we have to assume that is adequate)

In TAN we have twice the number of partitions and get unreliable estimates, especially for small data sets.

Solution:

where s is the smoothing bias and typically small.

Problem with TAN

sPaNP

PaNPwhere

xPPaxPPax

xD

xD

DxDx

)(

)(

)()1()|()|(*

Page 20: The Classification Problem

Performance: TAN vs. Naïve Bayes

65

70

75

80

85

90

95

100

65 70 75 80 85 90 95 100TAN

Naïv

e B

ayes

25 Data sets from UCI repository

MedicalSignal processingFinancialGames

Accuracy based on 5-fold cross-validationNo parameter tuning

Page 21: The Classification Problem

Performance: TAN vs C4.5

65

70

75

80

85

90

95

100

65 70 75 80 85 90 95 100

C4

.5

TAN

25 Data sets from UCI repository

MedicalSignal processingFinancialGames

Accuracy based on 5-fold cross-validationNo parameter tuning

Page 22: The Classification Problem

Beyond TAN

Can we do better by learning a more flexible structure?

Experiment: learn a Bayesian network without restrictions on the structure

Page 23: The Classification Problem

Performance: TAN vs. Bayesian Networks

65

70

75

80

85

90

95

100

65 70 75 80 85 90 95 100TAN

Bayesi

an

Netw

ork

s

25 Data sets from UCI repository

MedicalSignal processingFinancialGames

Accuracy based on 5-fold cross-validationNo parameter tuning

Page 24: The Classification Problem

Classification: Summary

Bayesian networks provide a useful language to improve Bayesian classifiers

Lesson: we need to be aware of the task at hand, the amount of training data vs dimensionality of the problem, etc

Additional benefits Missing values Compute the tradeoffs involved in finding out feature values Compute misclassification costs

Recent progress: Combine generative probabilistic models, such as Bayesian

networks, with decision surface approaches such as Support Vector Machines