Smooth Boosting By Using An Information- Based Criterion Kohei Hatano Kyushu University, JAPAN
Smooth Boosting By Using An Information-
Based Criterion
Kohei HatanoKyushu University, JAPAN
Organization of this talk
1. Introduction2. Preliminaries3. Our booster4. Experiments5. Summary
Boosting• Methodology to combine prediction rules into a
more accurate one .
E.g. learning rule to classify web pages on “Drew Barrymore”
accuracy 80%
Barrymore?y n
NoYes
Set of pred. rules = words
Labeled training data (web pages)
y nBarrymore?
NOYES
y n Drew?
NOYES
y nCharlie’s engels?
NOYES
+ +
combination of prediction rules (say, majority vote)
accuracy 51%!
John Drew Barrymore
(her father)
“The Barrymore family” of Hollywood
Jaid Barrymore(her mother)
John Barrymore(her grandpa)
Lionel Barrymore(her granduncle) Diana Barrymore
(her aunt)
Boosting by filtering [Schapire 90], [Freund 95],
Advantage 2: smaller space complexity (for sample)
accept
reject
Boosting scheme that uses random sampling from data
(Huge) data boosting algorithmsample randomly
Advantage 1: can determine sample size adaptively
batch learning: O(1/) boosting by filtering : polylog(1/) (: desired error)
Some known results Boosting algorithms by filtering
– Schapire’s first boosting alg. [Schapire 90] ,Boost-by-Majority [Freund
95], MadaBoost [Domingo&Watanabe 00], AdaFlat [Gavinsky 03].
– Criterion for choosing prediction rules : accuracy
Are there any better criteria?
A candidate: information-based criterion– Real AdaBoost [Schapire&Singer 99], InfoBoost [Aslam 00] (a simple
version of Real AdaBoost)
– Criterion for choosing prediction rules : mutual information– sometimes faster than those using accuracy-based
criterion Experimental: [Schapire&Singer 99], Theoretical: [Hatano&Warmuth 03],
[Hatano&Watanabe 04]
– However, no boosting algorithm by filtering known
Our work
Boosting by filtering Information-based criterion
efficient boosting by filteringusing an information-based criterion
lower space complexity faster convergence
our work
1. Introduction2. Preliminaries3. Our booster4. Experiments5. Summary
Illustration of general boostingTrain. data (x1,+1) (x2,+1) (x3,-1) (x4,-1) (x5,+1)
Distribution D1
0.2 0.2 0.2 0.2 0.2
1. Choose a pred. rule h1 maximizing some criterion w. r. t. D1.
2. Assign a coefficient to h1
based on its quality.h1
+1 -1
0.25
3. Update the distribution.
Pred. of h 1 +1 +1 -1 +1 -1
lower higher: correct: wrong
Pred. of h2 -1 -1 -1 -1 +1
Illustration of general boosting(2)Train. data (x1,+1) (x2,+1) (x3,-1) (x4,-1) (x5,+1)
Distribution D2
0.16 0.16 0.21 0.21 0.26
h2
+1 -1
0.28
lowerhigher: correct: wrong
1. Choose a pred. rule h2 maximizing some criterion w. r. t. D2.
2. Assign a coefficient to h1
based on its weighted error.
3. Update the distribution.
Repeat these procedure for T times
Illustration of general boosting(3)
h2
+1 -1
0.28
h3
+1 -1
0.05
h1
+1 -1
0.25
+ +
Final pred. rule = weighted majority vote of chosen pred. rules.
0.020.050.280.25)( xH
instance x
predict +1, if H(x) >0predict -1, otherwise
Example: AdaBoost [Freund&Schapire 97]
(h)edge
1
tD
)()(maxarg
m
iitii
Wht xDxhyh
;))(exp(
))(exp()(
;
11
11
1
m
iiti
itiit
tttt
xHy
xHyxD
hHH
;)()( where
,ln11
21
i ititit
t
xDxhyr
t
-yiHt(xi)
wrongcorrect
Difficult examples (possibly noisy) may have too much
weights
Criterion for choosing pred. rules
Coefficient
Update
(edge)
Smooth boosting• Keeping the distribution “smooth”
• makes boosting algorithms– noise-tolerant
• (statistical query model) MadaBoost [Domingo&Watanabe00]
• (malicious noise model ) SmoothBoost [Servedio01]
• (agnostic boosting model) AdaFlat [Gavinsky 03] ,
– sampling from Dt can be simulated efficiently
via sampling from D1 (e.g., by rejection sampling).
applicable in the boosting by filtering framework
D1 (original distribution, e.g. uniform)Dt (distribution costructed by the booster)
supxDt(x)/D1(x)is poly-bounded
poly D1
Example: MadaBoost [Domingo & Watanabe 00]
;))((
))(()(
;
11
11
1
m
iiti
itiit
tttt
xHy
xHyxD
hHH
;)()( where
,ln11
21
i ititit
t
xDxhyr
t
-yiHt(xi)
Dt is 1/bounded( : error of Ht)
Criterion for choosing pred. rules
Coefficient
Update
(edge)
(h)edge
1
tD
)()(maxarg
m
iitii
Wht xDxhyh
l(-yiHt(xi))
Examples of other smooth boosters
LogitBoost [Freidman, et al 00]AdaFlat [Gavinsky 03]
logistic function stepwise linear function
1. Introduction2. Preliminaries3. Our booster4. Experiments5. Summary
Our new booster
;)(maxarg gain pseudo
hh tWh
t
;))((
))(()(
);())(()()(
11
11
1
m
iiti
itiit
ttttt
xHy
xHyxD
xhxhxHxH
;)(
)()(]1[ where ;
0 z if ,2/]1[
0 z if ,2/]1[)(
1)(:
1)(:
xihtiit
xihtiititi
tt
tt xD
xDxhyz
-yiHt(xi)
Criterion for choosing pred. rules
Coefficient
Update
Still, Dt is 1/bounded( : error of Ht)
(pseudo gain)
l(-yiHt(xi))
Pseudo gain
];1[)1(]1[ tt pp
;)(
)()(
]1[ };1)({Pr)( where
1)(:
1)(:
1)(:
i
i
t
i
xhiit
itxhi
ii
ttDxhi
itt xD
xDxhy
xhxDp
;)()( edge1
m
iiii xDxhy
22 ]1[)1(]1[)( ttttt pph
Relation to edge
;]1[)1(]1[ 22 tt pp
Property: 2 (by convexity of of the square function)
Interpretation of pseudo gain)(-1 min)( max tt t
ht
hhh
tt
but, ・・・ the entropy function is NOT defined with Shannon’s entropy, but defined with Gini index
minh(conditional entropy of labels given ht)
maxh(mutual information between h and labels)
Information-based criteria
)1(4)(EGini ppp
;)1(2)(EKM ppp
)1log()1(log)(EShannon ppppp
Cf. Real AdaBoost and InfoBoost choose a pred. rule that maximizes the mutual information defined with KM entropy.
[Kearns & Mansour 98]
Good news: Gini index can be estimated via sampling efficiently!
Our booster chooses a pred. rule maximizing the mutual informationdefined by Gini Index (GiniBoost)
Convergence of train. error (GiniBoost)
Thm. Suppose that (train. error of Ht)> for t=1,…,T. Then
.)(4
1train.err1
T
tttT hH
Coro.Further, if t (ht)¸ , train.err(HT) · in T= O(1/) steps.
Comparison on convergence speed
booster #of iterations to get a final rule with error
comments
MadaBoost[Domingo&Watanabe 00] O(1/ 2)
○boost by filtering○adaptive (don’ need to know )×needs technical assumptions
SmoothBoost[Servedio 01]
O(1/ 2)○boost by filtering× not adaptive
AdaFlat[Gavinsky 03]
O(1/22 ) ○boost by filtering
○adaptive
GiniBoost(our result)
O(1/) 1/2)
○boost by filtering○adaptive
AdaBoost[Schapire& Freund 97]
O(log(1/) /2)○adaptive×boost by filtering
: minimum pseudo gain : minimum edge
Boosting- by- filtering version of
GiniBoost (outline)
• Multiplicative bounds for pseudo gain (and more practical bounds using the
central limit approximation).• Adaptive pred. rule selector.• Boosting alg. in the PAC learning
sense.
1. Introduction2. Preliminaries3. Our booster4. Experiments5. Summary
Experiments • Topic classification of Reuters news (Reuters-21578)• Binary classification for each 5 topics (Results are
averaged).• 10,000 examples.• 30,000 words used as base pred. rules.• Run algorithms until they sample 1,000,000 examples in
total.• 10-fold CV.
Test error over Reuters
Note : GiniBoost2 doubles coefficients t[+1], t[-1] used in GiniBoost
Execution time
test error(%)
time (sec.)
AdaBoost (w/o sampling , run in 100 step)
5.6 1349
MadaBoost 6.7 493
GiniBoost 5.8 408
GiniBoost2 5.5 359
faster by about 4 times!(Cf. similar result w/o sampling RealAdaBoost [Schapire & Singer 99] )
1. Introduction2. Preliminaries3. Our booster4. Experiments5. Summary
Summary/Open problemSummaryGiniBoost:• uses pseudo gain (Gini index) to choose base
prediction rules• shows faster convergence in the filtering scheme.
Open problem•Theoretical analysis on noise-tolerance
Comparison on sample size
# of sampling # of accepted examples
time ( sec. )
AdaBoost (w/o sampling, run in 100 steps)
N/A N/A 1349
MadaBoost 1,032,219 157,320 493
GiniBoost1 1,039,943 156,856 408
GiniBoost2 1,027,874 140,916 359
Observation : smaller accepted examples→ faster selection of pred. rules