Discovering Unrevealed Discovering Unrevealed Properties of Probability Properties of Probability Estimation Trees: on Algorithm Estimation Trees: on Algorithm Selection and Performance Selection and Performance Explanation Explanation Kun Zhang, Wei Fan, Bill Buckles Xiaojing Yuan, and Zujia Xu Dec. 21, 2006
28
Embed
Discovering Unrevealed Properties of Probability Estimation Trees: on Algorithm Selection and Performance Explanation Discovering Unrevealed Properties.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Discovering Unrevealed Discovering Unrevealed Properties of Probability Properties of Probability
Estimation Trees: on Algorithm Estimation Trees: on Algorithm Selection and Performance Selection and Performance
ExplanationExplanation
Kun Zhang, Wei Fan, Bill Buckles Xiaojing Yuan, and Zujia Xu
Dec. 21, 2006
What this Paper Offers
Preference of a probability estimation tree (PET)
Many important and previously unrevealed properties of PETs
A practical guide for choosing the most appropriate PET algorithm
•For larger AUC, P(y|x,θ) should vary from one test point to another
•The number of unique probabilities is maximized as a result
RDT > BPET > CFT > C4.4 > C4.5
10
00i
nn
2/)1n(nrAUC
Behind the Scenes Behind the Scenes - - Why RDT (BPET) preferable on low (high) signal separability datasets?
1. RDT: discards any criterion for optimal feature selection
2. More like a structure for data summarization.
3. When the signal-separability is low, this property protects RDT from the danger of identifying noise as signal or overfitting on noise, which is very likely to be caused by massive searches or optimization adopted by BPET.
4. RDT provides an average of probability estimation which approaches the mean of true probabilistic values as more individual trees added.
• The reasons:
&
&
&&
& & & && &
20 40 60 80 100
0.1
20
.14
0.1
60
.18
0.2
00
.22
Percentage of 75% Data Examples
MS
E &
&
&&
& & & && &o o o
oo o o o o o
$ $
$ $ $ $$ $ $ $
x
x
xx
x x
xx x
x# #
# # # # # # # #
&o$x#
BagPETRDTC4.4C4.5CFT
&&
&&
&
&& &
& &
20 40 60 80 100
0.6
00
.65
0.7
00
.75
0.8
00
.85
0.9
0
Percentage of 75% Data Examples
AU
C&
&&
&
&
&& &
& &
o
oo
oo
o
oo
o o
$
$ $$ $
$
$$ $
$
x
xx
x x xx x x x
#
# #
##
#
##
##
&o$x#
BagPETRDTC4.4C4.5CFT
Behind the Scenes Behind the Scenes - - Why RDT (BPET) preferable on low (high) signal separability datasets?
• The evidence (I) – Spect and Sonar, low-signal separability domains
Behind the Scenes Behind the Scenes - - Why RDT (BPET) preferable on low (high) signal separability datasets?
• The evidence (II) – Pima, a low-signal separability domain
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.6
Score
Em
piric
al P
robabili
ty
Cal = 0.0036
1 6 11 12 22 13 2 0
Class 1 Examples in Each Bin
Fre
quency
020
40
Class 1Class 0
3239 42
2634
17
2 0
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.6
Score
Em
piric
al P
robabili
ty
Cal = 0.018
6 9 10 4 4 10 14 10
Class 1 Examples in Each Bin
Fre
quency
020
50 Class 1
Class 0
66
3019 14 13 17 20
13
RDT:
BPET:
Behind the Scenes Behind the Scenes - Why RDT (BPET) preferable on low (high) signal separability datasets?
• The evidence (III) - Spam, a high-signal separability domain
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
Score
Em
piric
al P
robability
Cal = 0.013
4 3 4 6 15 28 24 49 102 219
Class 1 Examples in Each Bin
Fre
quency
020
040
0
Class 1Class 0
404
156 86
33 35 36 26 52102
221
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
Score
Em
piric
al P
robability
Cal = 0.0038
6 5 4 11 4 8 17 24 38 337
Class 1 Examples in Each Bin
Fre
quency
020
050
0Class 1Class 0
545
78 36 28 20 17 19 29 39
340
RDT:
BPET:
&
&&
&&
& & && &
20 40 60 80 100
0.0
50
.15
0.2
5
Percentage of 75% Data Examples
MS
E
&
&&
&&
& & && &
oo
o o o o o o o o
$
$$
$$ $
$ $$ $
xx
xx x
x xx
x x
# # ## # # # # # #
&o$x#
BagPETRDTC4.4C4.5CFT &
&& & & & & & & &
20 40 60 80 100
0.9
50
.97
0.9
9
Percentage of 75% Data Examples
AU
C
&
&& & & & & & & &
o
o
oo
oo o o o o
$
$$ $ $ $ $ $ $ $
x
xx
xx x x x x x
#
## # #
# # # # #
&o$x#
BagPETRDTC4.4C4.5CFT
Behind the ScenesBehind the Scenes - Why high separability categorical datasets with limited feature values hurt RDT?
• The observations – Tic-tac-toe and Chess
Behind the ScenesBehind the Scenes - - Why high separability categorical datasets with limited feature values hurt RDT?
• The reason:• High separability categorical datasets
with limited values tend to restrict the degree of diversity that RDT’s random feature selection can explore
- Random feature selection mechanism of RDT
• Categorical features: once;
• Continuous features: multiple times, but different splitting value each time.
The reasons
1. Low-signal separability domains Good performance benefits from the
probability aggregation mechanism Rectify errors introduced to the
probability estimates due to the attribute noise
2. High-signal separability domains Aggregation of the estimated
probabilities from the other irrelevant leaves will adversely affect the final probability estimates.
Behind the ScenesBehind the Scenes - Why CFT preferable on low-signal separability datasets ?
&
&
&&
& & & && &
20 40 60 80 100
0.1
20
.14
0.1
60
.18
0.2
00
.22
Percentage of 75% Data Examples
MS
E &
&
&&
& & & && &o o o
oo o o o o o
$ $
$ $ $ $$ $ $ $
x
x
xx
x x
xx x
x# #
# # # # # # # #
&o$x#
BagPETRDTC4.4C4.5CFT &
&& & & &
& & & &
20 40 60 80 100
0.6
00
.65
0.7
00
.75
0.8
00
.85
Percentage of 75% Data Examples
AU
C
&&
& & & && & & &
oo o o o o
o o o o
$
$
$$
$ $ $$ $
$
xx
xx
xx
x
x
xx# #
# #
# ## #
##
&o$x#
BagPETRDTC4.4C4.5CFT
• The evidence (I) – Spect and Pima, low-signal separability domains
Behind the Scenes Behind the Scenes - Why CFT preferable on low-signal separability datasets ?
Behind the Scenes Behind the Scenes - Why CFT preferable on low-signal separability datasets ?
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.6
Score
Em
piric
al P
robability
Cal = 0.0044
0 1 11 12 13 0 0 0
Class 1 Examples in Each Bin
Fre
quency
015
30
Class 1Class 0
0
11
3125
20
0 0 0
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.6
Score
Em
piric
al P
robability
Cal = 0.081
9 5 4 0 2 0 7 10
Class 1 Examples in Each Bin
Fre
quency
010
Class 1Class 0
23
1016
1
10
0
9
18
CFT:
C4.4:
• The evidence (II) - Liver, a low-signal separability domain
20
AUC Score
Given dataset
Signal-noise separability estimation
through RDT or BPET
Ensemble or Single trees
Low signal-noise
separability
High signal-noise
separability
Ensemble or Single
trees
Ensemble
(AUC,MSE,ErrorRate)
RDT CFT
Single Trees
(AUC,MSE,ErrorRate)
>=0.9< 0.9
EnsembleSingle Tree
AUCMSEError Rate
CFT
AUC
MSE, ErrorRate
C4.5 or C4.4
Feature types and
value characteris
tics Categorical feature (with limited values)
BPETRDT ( BPET)
Continuous features (categorical feature with a large number of values)
AUC, MSE, ErrorRate
AUC, MSE, ErrorRate
Choosing the Appropriate PET Algorithm Given a New
Problem
Summary AUC: iAUC: index of signal noise separability Preference of a PET on multiple
evaluation metrics “signal-noise separability” of the dataset other observable statistics.
Many important and unrevealed properties of PETs are analyzed
A practical guide for choosing the most appropriate PET algorithm