PGM: Tirgul 11 Na?ve Bayesian Classifier + Tree Augmented Na?ve Bayes (adapted from tutorial by Nir Friedman and Moises Goldszmidt
Jan 01, 2016
PGM: Tirgul 11
Na?ve Bayesian Classifier +
Tree Augmented Na?ve Bayes(adapted from tutorial by Nir Friedman and Moises Goldszmidt
The Classification Problem
From a data set describing objects by vectors of features and a class
Find a function FF: features class to classify a new object
Vector1= <49, 0, 2, 134, 271, 0, 0, 162, 0, 0, 2, 0, 3> PresenceVector2= <42, 1, 3, 130, 180, 0, 0, 150, 0, 0, 1, 0, 3> PresenceVector3= <39, 0, 3, 94, 199, 0, 0, 179, 0, 0, 1, 0, 3 > PresenceVector4= <41, 1, 2, 135, 203, 0, 0, 132, 0, 0, 2, 0, 6 > AbsenceVector5= <56, 1, 3, 130, 256, 1, 2, 142, 1, 0.6, 2, 1, 6 > AbsenceVector6= <70, 1, 2, 156, 245, 0, 2, 143, 0, 0, 1, 0, 3 > PresenceVector7= <56, 1, 4, 132, 184, 0, 2, 105, 1, 2.1, 2, 1, 6 > Absence
Examples
Predicting heart disease Features: cholesterol, chest pain, angina, age, etc. Class: {present, absent}
Finding lemons in cars Features: make, brand, miles per gallon, acceleration,etc. Class: {normal, lemon}
Digit recognition Features: matrix of pixel descriptors Class: {1, 2, 3, 4, 5, 6, 7, 8, 9, 0}
Speech recognition Features: Signal characteristics, language model Class: {pause/hesitation, retraction}
Approaches
Memory based Define a distance between samples Nearest neighbor, support vector machines
Decision surface Find best partition of the space CART, decision trees
Generative models Induce a model and impose a decision rule Bayesian networks
Generative Models
Bayesian classifiers Induce a probability describing the data
P(A1,…,An,C)
Impose a decision rule. Given a new object < a1,…,an >
c = argmaxC P(C = c | a1,…,an)
We have shifted the problem to learning P(A1,…,An,C)
We are learning how to do this efficiently: learn a Bayesian network representation for P(A1,…,An,C)
Optimality of the decision ruleMinimizing the error rate...
Let ci be the true class, and let lj be the class returned by the classifier.
A decision by the classifier is correctcorrect if ci=lj, and in errorerror if ci lj.
The error incurred by choose label lj is
Thus, had we had access to P, we minimize error rate by choosing li when
which is the decision rule for the Bayesian classifier
)|(1)|()|()|(1
alPalPlcLcE i
n
jjiii
ijalPalP ji )|()|(
Advantages of the Generative Model Approach
OutputOutput: Rank over the outcomes---likelihood of present vs. absent
ExplanationExplanation: What is the profile of a “typical” person with a heart disease
Missing valuesMissing values: both in training and testing Value of informationValue of information: If the person has high cholesterol
and blood sugar, which other test should be conducted? ValidationValidation: confidence measures over the model and its
parameters Background knowledgeBackground knowledge: priors and structure
Evaluating the performance of a classifier: n-fold cross validation
D1 D2 D3 Dn
Run 1
Run 2
Run 3
Run n
Partition the data set in n segments
Do n times Train the classifier with the
green segments Test accuracy on the red
segments Compute statistics on the n runs
• Variance• Mean accuracy
Accuracy: on test data of size m
Acc =
Original data set
m
lcm
kjik
1
)|(
Age
Sex
ChestPain RestBP
Cholesterol
BloodSugar ECG
MaxHeartRate
Angina
OldPeak
STSlope
Vessels
Thal
OutcomeHeart diseaseAccuracy = 85%Data sourceUCI repository
Advantages of Using a Bayesian Network
Efficiency in learning and query answering Combine knowledge engineering and statistical
induction Algorithms for decision making, value of information,
diagnosis and repair
Problems with BNs as classifiers
When evaluating a Bayesian network, we examine the likelyhood of the model B given the data D and try to maximize it:
When Learning structure we also add penalty for structure complexity and seek a balance between the two terms (MDL or variant). The following properties follow:
A Bayesian network minimized the error over all the variables in the domain and not necessarily the local error of the class given the attributes (OK with enough data).
Because of the penalty, a Bayesian network in effect looks at a small subset of the variables that effect a given node (it’s Markov blanket)
N
i
iin
iB caaPDBLL
11 ),,,(log)|(
Problems with BNs as classifiers (cont.)
Let’s look closely at the likelyhood term:
The first term estimates just what we want: the probability of the class given the attributes. The second term estimates the joint probability of the attributes.
When there are many attributes, the second term starts to dominate (value of log is increased for small values).
Why not use the just the first term? We can no longer factorize and calculations become much harder.
N
i
in
iB
N
i
in
iiB aaPaacPDBLL
11
11 ),,(log),,|(log)|(
C
F1 F2 F5F3 F4 F6
pregnant age insulindpf mass glucose
Diabetes inPima Indians
(from UCI repository)
The Naïve Bayesian Classifier
Fixed structure encoding the assumption that features are independent of each other given the class.
Learning amounts to estimating the parameters for each P(Fi|C) for each Fi.
)()|()|()|(),,|( 62161 CPCFPCFPCFPFFCP
The Naïve Bayesian Classifier (cont.)
What do we gain?
We ensure that in the learned network, the probability P(C|A1…An) will take every attribute into account.
We will show polynomial time algorithm for learning the network. Estimates are robust consisting of low order statistics requiring few
instances Has proven to be a powerful classifier often exceeding unrestricted
Bayesian networks.
The Naïve Bayesian Classifier (cont.)
Common practice is to estimate
These estimate are identical to MLE for multinomials
)(),(ˆ
| cNcaN i
cai
C
F1 F2 F5F3 F4 F6
Improving Naïve Bayes Naïve Bayes encodes assumptions of independence that may
be unreasonable:
Are pregnancy and age independent given diabetes?
Problem: same evidence may be incorporated multiple times (a rare Glucose level and a rare Insulin level over penalize the class variable)
The success of naïve Bayes is attributed to Robust estimation Decision may be correct even if probabilities are inaccurate
Idea: improve on naïve Bayes by weakening the independence assumptions
Bayesian networks provide the appropriate mathematical language for this task
Tree Augmented Naïve Bayes (TAN)
Approximate the dependence among features with a tree Bayes net Tree induction algorithm
Optimality: maximum likelihood tree Efficiency: polynomial algorithm
Robust parameter estimation
C
pregnant age
insulin
dpf mass
glucoseF1 F2 F5
F3
F4 F6
)(),|(),|()|(),,|( 3612161 CPCFFPCFFPCFPFFCP
Optimal Tree construction algorithm
The procedure of Chow and Lui construct a tree structure BT that maximizes LL(BT |D)
Compute the mutual information between every pair of attributes:
Build a complete undirected graph in which the vertices are the attributes and each edge is annotated with the corresponding mutual information as weight.
Build a maximum weighted spanning tree of this graph.
Complexity: O(n2N) + O(n2) + O(n2logn) = O(n2N) where n is the number of attributes and N is the sample size
ji
jDiD
jiD
aaaPaP
aaP
jiDji aaPAAI,
)()(
),(log),();(
It is easy to “plant” the optimal tree in the TAN by revising the algorithm to use a revised conditional measure which takes the conditional probability on the class into account:
This measures the gain in the log-likelyhood of adding Ai as a parent of Aj when C is already a parent.
Tree construction algorithm (cont.)
caa
caPcaP
caaP
jiDjiji
jDiD
jiDcaaPAAI,,
)|()|(
)|,(log),,();(
When evaluating parameters we estimate the conditional probability P(Ai|Parents(Ai)). This is done by partitionaing the data according to possible values of Parents(Ai).
When a partition contains just a few instances we get an unreliable estimate
In Naive Bayes the partition was only on the values of the classifier (and we have to assume that is adequate)
In TAN we have twice the number of partitions and get unreliable estimates, especially for small data sets.
Solution:
where s is the smoothing bias and typically small.
Problem with TAN
sPaNP
PaNPwhere
xPPaxPPax
xD
xD
DxDx
)(
)(
)()1()|()|(*
Performance: TAN vs. Naïve Bayes
65
70
75
80
85
90
95
100
65 70 75 80 85 90 95 100TAN
Naïv
e B
ayes
25 Data sets from UCI repository
MedicalSignal processingFinancialGames
Accuracy based on 5-fold cross-validationNo parameter tuning
Performance: TAN vs C4.5
65
70
75
80
85
90
95
100
65 70 75 80 85 90 95 100
C4
.5
TAN
25 Data sets from UCI repository
MedicalSignal processingFinancialGames
Accuracy based on 5-fold cross-validationNo parameter tuning
Beyond TAN
Can we do better by learning a more flexible structure?
Experiment: learn a Bayesian network without restrictions on the structure
Performance: TAN vs. Bayesian Networks
65
70
75
80
85
90
95
100
65 70 75 80 85 90 95 100TAN
Bayesi
an
Netw
ork
s
25 Data sets from UCI repository
MedicalSignal processingFinancialGames
Accuracy based on 5-fold cross-validationNo parameter tuning
Classification: Summary
Bayesian networks provide a useful language to improve Bayesian classifiers
Lesson: we need to be aware of the task at hand, the amount of training data vs dimensionality of the problem, etc
Additional benefits Missing values Compute the tradeoffs involved in finding out feature values Compute misclassification costs
Recent progress: Combine generative probabilistic models, such as Bayesian
networks, with decision surface approaches such as Support Vector Machines