This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Robust Feature Selection Robust Feature Selection by Mutual Information Distributionsby Mutual Information Distributions
Marco ZaffalonMarco Zaffalon & Marcus Hutter & Marcus Hutter
Consider two discrete random variables (Consider two discrete random variables (,,))
(In)Dependence often measured by MI(In)Dependence often measured by MI
– Also known as Also known as cross-entropycross-entropy or or information gaininformation gain– ExamplesExamples
Inference of Bayesian nets, classification treesInference of Bayesian nets, classification trees Selection of relevant variables for the task at handSelection of relevant variables for the task at hand
ClassificationClassification– Predicting the Predicting the classclass value given values of value given values of featuresfeatures– Features (or attributes) and class = random variablesFeatures (or attributes) and class = random variables– Learning the rule ‘features Learning the rule ‘features class’ from data class’ from data
Filters goal: removing irrelevant featuresFilters goal: removing irrelevant features– More accurate predictions, easier modelsMore accurate predictions, easier models
MI-based approachMI-based approach– Remove feature Remove feature if class if class does not depend on it: does not depend on it:– Or: remove Or: remove if if
is an arbitrary threshold of relevanceis an arbitrary threshold of relevance
0πI
πI
Empirical Mutual InformationEmpirical Mutual Informationa common way to use MI in practicea common way to use MI in practice
Data ( ) Data ( ) contingency table contingency table
Problems of the empirical approachProblems of the empirical approach– due to random fluctuations? (finite sample)due to random fluctuations? (finite sample)– How to know if it is reliable, e.g. by How to know if it is reliable, e.g. by
jj\\ii 11 22 …… rr
11 nn1111 nn1212 …… nn1r1r
22 nn2121 nn2222 …… nn2r2r
ss nns1s1 nns2s2 …… nnsrsr
occurred times of# i,jnij
occurred times of# i nnj iji
occurred times of# j nni ijj
sizedataset ij ijnn
nnijij ̂ π̂I
0ˆ πI
?nIP
n
We Need the Distribution of MIWe Need the Distribution of MI
Bayesian approachBayesian approach– Prior distribution for the unknown chances Prior distribution for the unknown chances
Collected measures for each filterCollected measures for each filter– Average # of correct predictions (prediction accuracy)Average # of correct predictions (prediction accuracy)– Average # of features usedAverage # of features used
Naive Bayes
Classification
Test
in
stance
Filter
Inst
ance
k
Inst
ance
k+
1
Inst
ance
N
Learningdata
Store after
classi
ficatio
n
Results on 10 Complete DatasetsResults on 10 Complete Datasets
# of used features# of used features
Accuracies NOT significantly differentAccuracies NOT significantly different– Except Chess & Spam with FFExcept Chess & Spam with FF
# Instances # Features Dataset FF F BF690 36 Australian 32.6 34.3 35.9
Extension to Incomplete SamplesExtension to Incomplete Samples
MAR assumptionMAR assumption– General case: missing features and classGeneral case: missing features and class
EM + closed-form expressionsEM + closed-form expressions
– Missing features onlyMissing features only Closed-form approximate expressions for Mean and VarianceClosed-form approximate expressions for Mean and Variance Complexity still Complexity still OO((rsrs))
New experimentsNew experiments– 5 data sets5 data sets– Similar behaviorSimilar behavior
0.9
0.92
0.94
0.96
0.98
1
0
30
0
60
0
90
0
12
00
15
00
18
00
21
00
24
00
27
00
30
00
Instance number
Pre
dic
tio
n a
cc
ura
cy
(H
yp
oth
yro
idlo
ss
)
F
FF
ConclusionsConclusions
Expressions for several moments of MI distribution Expressions for several moments of MI distribution are availableare available– The distribution can be approximated wellThe distribution can be approximated well– Safer inferences, same computational complexity of Safer inferences, same computational complexity of
empirical MIempirical MI– Why not to use it?Why not to use it?
Robust feature selection shows power of MI Robust feature selection shows power of MI distributiondistribution– FF outperforms traditional filter FFF outperforms traditional filter F
Many useful applications possibleMany useful applications possible– Inference of Bayesian netsInference of Bayesian nets– Inference of classification treesInference of classification trees– ……