Decision Trees: Some exercises
0.
Exemplifying
how to compute information gains
and how to work with decision stumps
CMU, 2013 fall, W. Cohen E. Xing, Sample questions, pr. 4
1.
Timmy wants to know how to do well for ML exam. He collects those oldstatistics and decides to use decision trees to get his model. He now gets 9data points, and two features: “whether stay up late before exam” (S) and“whether attending all the classes” (A). We already know the statistics is asbelow:
Set (all ) = [5+, 4−]
Set (S+) = [3+, 2−], Set (S−) = [2+, 2−]
Set (A+) = [5+, 1−], Set (A−) = [0+, 3−]
Suppose we are going to use the feature to gain most information at firstsplit, which feature should we choose? How much is the information gain?
You may use the following approximations:
N 3 5 7log2 N 1.58 2.32 2.81
2.
[5+,4−]
[3+,2−] [2+,2−]
+ −
S
[5+,4−]
[5+,1−] [0+,3−]
+ −
A
H=1 H=0
H(all)not.= H [5+, 4−]
not.= H
(5
9
)
sim.= H
(4
9
)def.=
5
9log2
9
5+
4
9log2
9
4= log2 9−
5
9log2 5−
4
9log2 4
= 2 log2 3−5
9log2 5−
8
9= −8
9+ 2 log2 3−
5
9log2 5 = 0.991076
H(all|S) def.=
5
9·H [3+, 2−] +
4
9·H [2+, 2−] = . . .
=5
9· 0.970951+ 4
9· 1 = 0.983861
H(all|A) def.=
6
9·H [5+, 1−] +
3
9·H [0+, 3−] = . . .
=6
9· 0.650022+ 4
9· 0 = 0.433348
IG(all, S)def.= H(all)−H(all|S) = 0.007215
IG(all, A)def.= H(all)−H(all|A) = 0.557728
IG(all, S) < IG(all, A) ⇔ H(all|S) > H(all|A)
3.
Decision stumps;
entropy, mean conditional entropy, and information gain:
some very convenient formulas to be usedwhen working with pocket calculators
Sebastian Ciobanu, Liviu Ciortuz, 2017
4.
Consider the decision stump given in the nearby image.The symbols a, b, c, d, e and f represent counts computedfrom a training dataset (not provided). As you see, thelabel (or, output variable), here denoted by Y , is binary,and so is the attribute (or, input variable) A. Obviously,a = c+ e and b = d+ f .
0 1
[a+,b−]
A
[c+,d−] [e+,f−]
a. Prove that the entropy of [the output variable] corresponding to the par-tition associated to the test node in this decision stump is
H [a+, b−] =1
a+ blog2
(a+ b)a+b
aabbif a 6= 0 and b 6= 0.
b. Derive a similar formula for the entropy, for the case when the outputvariable has three values, and the partition associated to the test node in thedecision stump would be [a+, b−, c∗].(Note that there is no link between the last c and the c count in the abovedecision stump.)
5.
c. Assume that for the above given decision stump we would have [all counts]c, d, e and f different from 0. Prove that the mean conditional entropy corre-sponding to this decision stump is
Hnode |attribute =1
a+ blog2
((c+ d)c+d
ccdd· (e + f)e+f
eeff
)
.
d. Now suppose that one of the counts c, d, e and f is 0; for example, let’sconsider c = 0. Infer the formula for the mean conditional entropy in thiscase.
e. Prove the following formula for the information gain corresponding tothe above given decision stump, assuming that all a, b, c, d, e and f are strictlypositive:
IGnode ;attribute =1
a+ blog2
((a+ b)a+b
aabb· ccdd
(c+ d)c+d· eeff
(e + f)e+f
)
.
6.
WARNING!
A serious problem when using the above formulas on a pocket calculator isthe fact that the internal capacity of representation for intermediate resultscan be overflown.
For example, a pocket calculator Sharp EL-531VH can represent the number5656 but not 5757. Similarly, the calculator made available by the Linux Mintoperating system [see the Accessories menu] can represent 179179 but not180180.
In such overflow cases, you should use the basic / general formulas for en-tropies and the information gain, because they make a better use of the logfunction.
7.
Answer
a.
H [a+, b−] = −( a
a+ b· log2
a
a+ b+
b
a+ b· log2
b
a+ b
)
= − 1
a+ b
(
a · log2a
a+ b+ b · log2
b
a+ b
)
= − 1
a+ b
(
log2
( a
a+ b
)a
+ log2
( b
a+ b)b)
= − 1
a+ b
(
log2aa
(a+ b)a+ log2
bb
(a+ b)b
)
= − 1
a+ b· log2
aa · bb(a+ b)a+b
=1
a+ b· log2
(a+ b)a+b
aa · bb .
b.
H [a+, b−, c∗] = −( a
a+ b+ c· log2
a
a+ b+ c+
b
a+ b + c· log2
b
a+ b+ c+
c
a+ b+ c· log2
c
a+ b+ c
)
= − 1
a+ b+ c
(
log2
( a
a+ b + c
)a
+ log2
( b
a+ b+ c
)b
+ log2
( c
a+ b+ c
)c)
= − 1
a+ b+ c
(
log2aa
(a+ b+ c)a+ log2
bb
(a+ b+ c)b+ log2
cc
(a+ b+ c)c
)
= − 1
a+ b+ c· log2
aa · bb · cc(a+ b+ c)a+b+c
=1
a+ b+ c· log2
(a+ b+ c)a+b+c
aa · bb · cc .
8.
c.
Hnod|atributdef.=
c+ d
a+ b·H [c+, d−] +
e+ f
a+ b·H [e+, f−]
=✘✘✘c+ d
a+ b· 1
✘✘✘c+ d· log2
(c+ d)c+d
cc · dd +✘✘✘e + f
a+ b· 1
✘✘✘e+ f· log2
(e+ f)e+f
ee · ff
=1
a+ b· log2
( (c+ d)c+d
cc · dd · (e+ f)e+f
ee · ff
)
.
d.
Hnod|atribut =e+ f
a+ b·H [e+, f−]
=e+ f
a+ b·( e
e+ f· log2
e+ f
e+
f
e+ f· log2
e+ f
f
)
=✘✘✘e+ f
a+ b· 1
✘✘✘e+ f
(
e · log2e+ f
e+ f · log2
e+ f
f
)
=1
a+ b· log2
(e + f)e+f
ee · ff.
e.
IGnod;atribut =1
a+ b· log2
(a+ b)a+b
aa · bb − 1
a+ b· log2
( (c+ d)c+d
cc · dd · (e + f)e+f
ee · ff
)
=1
a+ b· log2
((a+ b)a+b
aa · bb · cc · dd(c+ d)c+d
· ee · ff
(e + f)e+f
)
.
9.
Important REMARKS (in Romanian)
1. Intrucat majoritatea calculatoarelor de buzunar nu au functia log2 cifunctiile ln si lg, ın formulele prezentate sau deduse la punctele a − e ar fide dorit sa schimbam baza logaritmului. Aceasta revine – pe langa ınlocuirealui log2 cu ln sau lg – la ınmultirea membrului drept cu 1/ ln 2, respectiv 1/ lg 2.
2. Intrucat, la aplicarea algoritmului ID3, pentru alegerea celui maibun atribut de pus ın nodul curent este suficient sa calculam entropiileconditionale medii, va fi suficient sa comparam produsele de forma
(c+ d)c+d
ccdd· (e+ f)e+f
eeff(1)
pentru compasii de decizie considerati la nodul respectiv si sa alegem minimuldintre aceste produse.
10.
Exemplifying the application of the ID3 algorithm
on a toy mushrooms dataset
CMU, 2002(?) spring, Andrew Moore, midterm example questions, pr. 2
11.
You are stranded on a deserted island. Mushrooms of various types grow widely allover the island, but no other food is anywhere to be found. Some of the mushroomshave been determined as poisonous and others as not (determined by your formercompanions’ trial and error). You are the only one remaining on the island. You havethe following data to consider:
Example NotHeavy Smelly Spotted Smooth Edible
A 1 0 0 0 1
B 1 0 1 0 1
C 0 1 0 1 1
D 0 0 0 1 0
E 1 1 1 0 0
F 1 0 1 1 0
G 1 0 0 1 0
H 0 1 0 0 0
U 0 1 1 1 ?
V 1 1 0 1 ?
W 1 1 0 0 ?
You know whether or not mushrooms A through H are poisonous, but you do not knowabout U through W .
12.
For the a–d questions, consider only mushrooms A through H.
a. What is the entropy of Edible?
b. Which attribute should you choose as the root of a decision tree?Hint : You can figure this out by looking at the data without explicitlycomputing the information gain of all four attributes.
c. What is the information gain of the attribute you chose in the previousquestion?
d. Build a ID3 decision tree to classify mushrooms as poisonous or not.
e. Classify mushrooms U , V and W using the decision tree as poisonous ornot poisonous.
f. If the mushrooms A through H that you know are not poisonous suddenlybecame scarce, should you consider trying U , V and W? Which one(s) andwhy? Or if none of them, then why not?
13.
a.
HEdible = H [3+, 5−]def.= −3
8· log2
3
8− 5
8· log2
5
8=
3
8· log2
8
3+
5
8· log2
8
5
=3
8· 3− 3
8· log2 3 +
5
8· 3− 5
8· log2 5 = 3− 3
8· log2 3−
5
8· log2 5
≈ 0.9544
14.
b.
0 1
[3+,5−]
[1+,2−] [2+,3−]
NotHeavy0
0 1
[3+,5−]
[2+,3−] [1+,2−]
Smelly0
0 1
[2+,3−] [1+,2−]
Spotted0
[3+,5−]
0 1
[2+,2−] [1+,3−]
Smooth0
[3+,5−]
Node 1 Node 2
15.
c.
H0/Smooth
def.=
4
8H [2+, 2−] +
4
8H [1+, 3−] =
1
2· 1 + 1
2
(1
4log2
4
1+
3
4log2
4
3
)
=1
2+
1
2
(1
4· 2 + 3
4· 2− 3
4log2 3
)
=1
2+
1
2
(
2− 3
4log2 3
)
=1
2+ 1− 3
8log2 3 =
3
2− 3
8log2 3 ≈ 0.9056
IG0/Smooth
def.= HEdible −H0/Smooth
= 0.9544− 0.9056 = 0.0488
16.
d.
H0/NotHeavy
def.=
3
8H [1+, 2−] +
5
8H [2+, 3−]
=3
8
(1
3log2
3
1+
2
3log2
3
2
)
+5
8
(2
5log2
5
2+
3
5log2
5
3
)
=3
8
(1
3log2 3 +
2
3log2 3−
2
3· 1)
+5
8
(2
5log2 5−
2
5· 1 + 3
5log2 5−
3
5log2 3
)
=3
8
(
log2 3−2
3
)
+5
8
(
log2 5−3
5log2 3−
2
5
)
=3
8log2 3−
2
8+
5
8log2 5−
3
8log2 3−
2
8
=5
8log2 5−
4
8≈ 0.9512
⇒ IG0/NotHeavy
def.= HEdible −H0/NotHeavy = 0.9544− 0.9512 = 0.0032,
IG0/NotHeavy = IG0/Smelly = IG0/Spotted = 0.0032 < IG0/Smooth = 0.0488
17.
Important Remark (in Romanian)
In loc sa fi calculat efectiv aceste castiguri de informatie, pentru a determina atributulcel mai ,,bun“, ar fi fost suficient sa comparam valorile entropiilor conditionale mediiH0/Smooth si H0/NotHeavy:
IG0/Smooth > IG0/NotHeavy ⇔ H0/Smooth < H0/NotHeavy
⇔ 3
2− 3
8log2 3 <
5
8log2 5−
1
2⇔ 12− 3 log2 3 < 5 log2 5− 4
⇔ 16 < 5 log2 5 + 3 log2 3 ⇔ 16 < 11.6096 + 4.7548 (adev.)
In mod alternativ, tinand cont de formulele de la problema UAIC, 2017 fall, S. Ciobanu,L. Ciortuz, putem proceda chiar mai simplu relativ la calcule (nu doar aici, ori de cateori nu avem de-a face cu un numar mare de instante):
H0/Neteda < H0/Usoara ⇔ 44
��22 ·��22
· 44
33<
55
��22 ·��33
·��33
��22⇔ 48
33< 55 ⇔ 48 < 33 · 55 ⇔ 216 < 33 · 55
⇔ 64 · 210︸︷︷︸
1024
< 27 · 25 · 125︸ ︷︷ ︸
>3 · 8 · 125︸ ︷︷ ︸
1000
(adev.)
18.
Node 1: Smooth = 0
1
[2+,2−]
[2+,0−] [0+,2−]
01
0 1
Smelly
0 1
1
[2+,2−]
[0+,1−] [2+,1−]
0
NotHeavy1
[2+,2−]
[1+,1−] [1+,1−]
0 1
Spotted
19.
Node 2: Smooth = 1
0
0 1
2
[1+,3−]
[1+,1−] [0+,2−]
Node 3
NotHeavy
0
2
[1+,3−]
[1+,0−][0+,3−]
1
0 1
Smelly
0
2
[1+,3−]
[0+,1−][1+,2−]
0 1
Spotted
20.
The resulting ID3 Tree
1 0 10
10
[3+,5−]
[2+,2−] [1+,3−]
[2+,0−] [0+,2−] [0+,3−] [1+,0−]
0 1 0 1
Smelly Smelly
Smooth
IF (Smooth = 0 AND Smelly = 0) OR(Smooth = 1 AND Smelly = 1)
THEN Edibile;ELSE ¬Edible;
Classification of test instances:
U Smooth = 1, Smelly = 1 ⇒ Edible = 1V Smooth = 1, Smelly = 1 ⇒ Edible = 1W Smooth = 0, Smelly = 1 ⇒ Edible = 0
21.
Exemplifying the greedy character of
the ID3 algorithm
CMU, 2003 fall, T. Mitchell, A. Moore, midterm, pr. 9.a
22.
Fie atributele binare de intrare A,B,C, atributul de iesire Y si urmatoareleexemple de antrenament:
A B C Y
1 1 0 01 0 1 10 1 1 10 0 1 0
a. Determinati arborele de decizie calculat de algoritmul ID3. Este acestarbore de decizie consistent cu datele de antrenament?
23.
Raspuns
Nodul 0 : (radacina)
[2+,2−]
[1+,1−] [1+,1−]
0 1
A
[2+,2−]
[1+,1−] [1+,1−]
0 1
B
0Nod 1
[2+,2−]
[0+,1−] [2+,1−]
0 1
C
Se observa imediat ca primii doi “compasi de decizie” (engl. decisionstumps) au IG = 0, ın timp ce al treilea compas de decizie are IG > 0. Prinurmare, ın nodul 0 (radacina) vom pune atributul C.
24.
Nodul 1 : Avem de clasificat instantele cu C = 1, deci alegerea se face ıntreatributele A si B.
Nod 21
[1+,1−] [1+,0−]
0 1
A
[2+,1−]
1
[1+,1−] [1+,0−]
0 1
B
[2+,1−]
Cele doua entropii conditionale medii sunt egale:
H1/A = H1/B =2
3H [1+, 1−] +
1
3H [1+, 0−]
Asadar, putem alege oricare dintre cele doua atribute. Pentru fixare, ılalegem pe A.
25.
Nodul 2 : La acest nod nu mai avem disponibildecat atributul B, deci ıl vom pune pe acesta.
Arborele ID3 complet este reprezentat ın figuraalaturata.
Prin constructie, algoritmul ID3 este consistentcu datele de antrenament daca acestea sunt con-sistente (i.e., necontradictorii). In cazul nostru,se verifica imediat ca datele de antrenament suntconsistente.
1
10
[1+,0−][0+,1−]
0 1
B
0
[2+,1−]
C
0 1
[0+,1−]
[1+,1−] [1+,0−]
0 1
A
[2+,2−]
26.
b. Exista un arbore de decizie de adancime mai mica (decat cea a arboreluiID3) consistent cu datele de mai sus? Daca da, ce concept (logic) reprezintaacest arbore?
Raspuns:
Din date se observa ca atributul de iesireY reprezinta de fapt functia logica A xor
B.
Reprezentand aceasta functie ca arborede decizie, vom obtine arborele alaturat.
Acest arbore are cu un nivel mai putindecat arborele construit cu algoritmulID3.
Prin urmare, arborele ID3 nu este op-tim din punctul de vedere al numaruluide niveluri.
0 10 1
1 10 0
10
[2+,2−]
[1+,1−] [1+,1−]
A
B B
[0+,1−] [1+,0−] [1+,0−] [0+,1−]
27.
Aceasta este o consecinta a caracterului “greedy” al algoritmuluiID3, datorat faptului ca la fiecare iteratie alegem ,,cel mai bun“atribut ın raport cu criteriul castigului de informatie.
Se stie ca algoritmii de tip “greedy” nu granteaza obtinerea opti-mului global.
28.
Exemplifying the application of the ID3 algorithm
in the presence of both
categorical and continue attributes
CMU, 2012 fall, Eric Xing, Aarti Singh, HW1, pr. 1.1
29.
As of September 2012, 800 extrasolar planets have been identified in our galaxy. Super-secret surveying spaceships sent to all these planets have established whether they arehabitable for humans or not, but sending a spaceship to each planet is expensive. Inthis problem, you will come up with decision trees to predict if a planet is habitablebased only on features observable using telescopes.
a. In nearby table you are given the datafrom all 800 planets surveyed so far. The fea-tures observed by telescope are Size (“Big” or“Small”), and Orbit (“Near” or “Far”). Eachrow indicates the values of the features andhabitability, and how many times that set ofvalues was observed. So, for example, therewere 20 “Big” planets “Near” their star thatwere habitable.
Size Orbit Habitable Count
Big Near Yes 20Big Far Yes 170
Small Near Yes 139Small Far Yes 45Big Near No 130Big Far No 30
Small Near No 11Small Far No 255
Derive and draw the decision tree learned by ID3 on this data (use the maximuminformation gain criterion for splits, don’t do any pruning). Make sure to clearly markat each node what attribute you are splitting on, and which value corresponds to whichbranch. By each leaf node of the tree, write in the number of habitable and inhabitableplanets in the training data that belong to that node.
30.
Answer: Level 1
[374+,426−]
SB
Size
[184+,266−][190+,160−]
FN
Orbit
[215+,285−]
[374+,426−]
H(374/800) H(374/800)
H(92/225)[159+,141−]
H(19/35) H(47/100) H(43/100)
H(Habitable|Size) =35
80·H
(
19
35
)
+45
80·H
(
92
225
)
=35
80· 0.9946 +
45
80· 0.9759 = 0.9841
H(Habitable|Orbit) =3
8·H
(
47
100
)
+5
8·H
(
43
100
)
=3
8· 0.9974 +
5
8· 0.9858 = 0.9901
IG(Habitable;Size) = 0.0128
IG(Habitable;Orbit) = 0.0067
31.
The final decision tree
+− −+
FN
[170+,30−]
[374+,426−]
SB
FN
[139+,11−] [45+,255−]
[184+,266−][190+,160−]
[20+,130−]
Size
Orbit Orbit
32.
b. For just 9 of the planets, a thirdfeature, Temperature (in Kelvin degrees),has been measured, as shown in thenearby table.Redo all the steps from part a on this datausing all three features. For the Temper-ature feature, in each iteration you mustmaximize over all possible binary thresh-olding splits (such as T ≤ 250 vs. T > 250,for example).
Size Orbit Temperature Habitable
Big Far 205 NoBig Near 205 NoBig Near 260 YesBig Near 380 Yes
Small Far 205 NoSmall Far 260 YesSmall Near 260 YesSmall Near 380 NoSmall Near 380 No
According to your decision tree, would a planet with the features (Big, Near, 280) bepredicted to be habitable or not habitable?
Hint : You might need to use the following values of the entropy function for a Bernoullivariable of parameter p:
H(1/3) = 0.9182, H(2/5) = 0.9709, H(92/225) = 0.9759, H(43/100) = 0.9858, H(16/35) = 0.9946,H(47/100) = 0.9974.
33.
Answer
Binary threshold splits for the continuous attribute Temperature:
205 380260232.5 320 Temperature
34.
Answer: Level 1
−
H=0
H(1/3)
T<=320
[4+,5−]
H(4/9)
[1+,2−]
N
[4+,5−][4+,5−]
SB
[2+,2−]
H=1
H(4/9)
[1+,2−]
F N
H(1/3)
H(4/9)
OrbitSize
[2+,3−]
H(2/5)
[3+,3−]
H=1
[3+,3−]
H=1
[4+,5−]
H(4/9)
[0+,3−] [4+,2−]
H(1/3)
NY
T<=232.5
Y
==
>>
>>
H(Habitable|Size) =4
9+
5
9·H
(
2
5
)
=4
9+
5
9· 0.9709 = 0.9838
H(Habitable|Temp ≤ 232.5) =2
3· H
(
1
3
)
=2
3· 0.9182 = 0.6121
IG(Habitable; Size) = H
(
187
400
)
− 0.9838
= 0.9969 − 0.9838
= 0.0072
IG(Habitable;Temp ≤ 232.5) = 0.3788
35.
Answer: Level 2
+
H=0
+
H=0
+
H=0
[4+,2−]
SB
[2+,0−]
F
OrbitSize
[2+,2−]
H=1
[4+,2−]
H(1/3)
[1+,0−]
N
H(2/5)
[3+,2−]
H(1/3)
T<=320
[4+,2−]
H(1/3)
[3+,0−] [1+,2−]
NY
H(1/3)
>>
>
>=
Note:The plain lines indicate that both the specific conditional entropies and their coefficients
(weights) in the average conditional entropies satisfy the indicated relationship. (For ex-
ample, H(2/5) > H(1/3) and 5/6 > 3/6.)
The dotted lines indicate that only the specific conditional entropies satisfy the indicated rela-
tionship. (For example, H(2/2) = 1 > H(2/5) but 4/6 < 5/6.)
36.
The final decision tree:
−
+
+ −
[0+,3−]
NY
[4+,2−]
T<=232.5
[4+,5−]
[3+,0−] [1+,2−]
NY
T<=320
[1+,0−]
SB
[0+,2−]
Size
c. According to your decision tree,would a planet with the features(Big, Near, 280) be predicted to behabitable or not habitable?
Answer: habitable.
37.
Exemplifying
the application of the ID3 algorithm
on continuous attributes,
and in the presence of a noise.
Decision surfaces; decision boundaries.
The computation of the LOOCV error
CMU, 2002 fall, Andrew Moore, midterm, pr. 3
38.
Suppose we are learning a classifier with binary input values Y = 0 andY = 1. There is one real-valued input X. The training data is given in thenearby table.
Assume that we learn a decision tree on this data. Assume that whenthe decision tree splits on the real-valued attribute X, it puts the splitthreshold halfway between the attributes that surround the split. Forexample, using information gain as splitting criterion, the decision treewould initially choos to split at X = 5, which is halfway between X = 4 andX = 6 datapoints.
X Y1 02 03 04 06 17 18 18.5 09 110 1
Let Algorithm DT2 be the method of learning a decision tree with only two leaf nodes(i.e., only one split).
Let Algorithm DT⋆ be the method of learning a decision tree fully, with no prunning.
a. What will be the training set error for DT2 and respectively DT⋆ on our data?
b. What will be the leave-one-out cross-validation error (LOOCV) for DT2 and re-spectively DT⋆ on our data?
39.
• training data:
0 1 2 3 4 5 7 8 9 10 X6
• discretization / decision thresholds:
X1 2 3 4 6 7 8 9 105 8.25
8.75
• compact representation of the ID3 tree:
0 21
X1 2 3 4 5 6 7 8 9 10
• decision “surfaces”:
X8.25 8.75
− + − +
5
ID3 tree:
X<5
X<8.75
0
0
1
1
X<8.25
0
1
2
[4−,0+] [1−,5+]
[5−,5+]
[1−,0+] [0−,2+]
[1−,2+][0−,3+]
Da
Da Nu
Nu
NuDa
40.
ID3:IG computations
X<8.75X<8.25
[1−,2+] [1−,3+]
Nu
[0−,2+]
Da
[1−,5+]
[0−,3+]
NuDa
[1−,5+]
IG: 0.109IG: 0.191
Level 1:
5 8.758.25
++− −Decision "surfaces":
X<8.75
[5−,3+]
Nu
[0−,2+]
Da
[5−,5+]
X<8.25
[1−,2+][4−,3+]
NuDa
[5−,5+]
X<5
[4−,0+] [1−,5+]
Da Nu
[5−,5+]
<<
Level 0:
=<
41.
ID3, LOOCV:Decision surfaces
LOOCV error: 3/10
8.758.25
++ −−4.5
X=4:
5 8.75
++−7.75
−X=8:
5
++− +X=8.5:
5 8.25
++ −−9.25
X=9:
5 8.758.25
++ −−X=1,2,3:
5 8.758.25
++ −−X=7:
5 8.758.25
++ −−X=10:
8.758.25
++ −−5.5
X=6:
42.
DT2
+
5
−Decision "surfaces":
X<5
0 1
[1−,5+]
Da Nu
[5−,5+]
[4−,0+]
43.
DT2, LOOCV
IG computations
Case 1: X=1, 2, 3, 4
X<5 X<8.75X<8.25
[3−,0+] [1−,5+]
Da Nu
[4−,5+]
<<
[1−,2+] [4−,3+]
Nu
[0−,2+]
Da
[4−,5+]
[3−,3+]
NuDa
[4−,5+]
/4.5
=<
Case 2: X=6, 7, 8
X<5 X<8.75X<8.25
[4−,0+] [1−,4+]
Da Nu
<
<
[5−,2+]
Nu
[0−,2+]
Da
[5−,4+]
[4−,2+]
NuDa
[5−,4+]
/5.5 /7.75
[5−,4+]
[1−,2+]
=
<
44.
DT2, CVLOO
IG computations
(cont’d)
Case 3: X=8.5
X<5
[4−,0+] [0−,5+]
Da Nu
[4−,5+]
Case 2: X=9, 10
X<5 X<8.75X<8.25
<
[1−,4+]
Da Nu
<[5−,3+]
Nu
[0−,1+]
Da
[5−,4+]
[4−,3+]
NuDa
[5−,4+][5−,4+]
[1−,1+]
/9.25
[4−,0+]
=
<
CVLOO error: 1/10
45.
Applying ID3 on a dataset with two continuous attributes:
decision zones
Liviu Ciortuz, 2017
46.
Consider the training dataset in thenearby figure.X1 and X2 are considered countinous at-tributes.
Apply the ID3 algorithm on this dataset.Draw the resulting decision tree.
Make a graphical representation of thedecision areas and decision boundariesdetermined by ID3.
X1
X2
0
1
3
2
4
1 2 3 4 50
47.
Solution
Level 1:
−
H=0
−
H=0
[4+,5−][4+,5−]
NY
[2+,0−] [2+,4−]
Y N
H(1/3)
X1 < 9/2
[4+,5−]
NY
X2 < 3/2X1 < 5/2
H(1/4)
[4+,5−]
NY
[4+,5−]
NY
X2 < 5/2 X2 < 7/2
[2+,5−] [2+,1−] [1+,1−] [3+,4−] [3+,2−] [1+,3−] [4+,2−] [0+,3−]
H=1 H(2/5)H(3/7)H(2/7) H(1/3)
=
>
H(1/3)
<
>
IG=0.091
IG=0.378IG=0.319
H[Y| . ] = 2/3 H(1/3)H[Y| . ] = 7/9 H(2/7) H[Y| . ] = 5/9 H(2/5) + 4/9 H(1/4)
48.
Level 2:
−
H=0−
H=0
−
H=0
N
X1 < 5/2
[1+,1−]
NY Y
[4+,2−]
[2+,0−] [2+,2−]H=1
X1 < 4 X2 < 3/2 X2 < 5/2
[4+,2−] [4+,2−] [4+,2−]
Y N Y
[2+,2−]H=1
[2+,0−]
=
[3+,1−]H=1 H(1/4)
[3+,2−] [1+,0−]H(2/5)
IG=0.04
IG=0.109
>IG=0.251
>=
N
IG=0.251
H[Y| . ] = 1/3 + 2/3 H(1/4) H[Y| . ] = 5/6 H(2/5)H[Y| . ] = 2/3Notes:
1. Split thresholds for continuous attributes must be recomputed at each new iteration,
because they may change. (For instance, here above, 4 replaces 4.5 as a threshold for X1.)
2. In the current stage, i.e., for the current node in the ID3 tree you may choose (as test)
either X1 < 5/2 or X1 < 4.
3. Here above we have an example of reverse relationships between weighted and respectively
un-weighted specific entropies: H [2+, 2−] > H [3+, 2−] but4
6·H [2+, 2−] <
5
6·H [3+, 2−].
49.
The final decision tree:
−
[0+,3−]
+
− +
NY
X2 < 7/2
[4+,5−]
[4+,2−]
[2+,0−] [2+,2−]
NY
X1 < 5/2
NY
X1 < 4
[0+,2−] [2+,0−]
Decision areas:
+
2X
X1
0
1
1 2 3
3
2
4
4 50
+
50.
Other criteria than IG for
the best attribute selection in ID3:
Gini impurity / indexand Misclassification impurity
CMU, 2003 fall, T. Mitchell, A. Moore, HW1, pr. 4
51.
Entropy is a natural measure to quantify the impurity of a data set. TheDecision Tree learning algorithm uses entropy as a splitting criterion by cal-culating the information gain to decide the next attribute to partition thecurrent node.
However, there are other impurity measures that could be used as the split-ting criteria too. Let’s investigate two of them.
Assume the current node n has k classes c1, c2, . . . , ck.
• Gini Impurity: i(n) = 1−∑k
i=1 P2(ci).
• Misclassification Impurity: i(n) = 1−maxki=1 P (ci).
a. Assume node n has two classes, c1 and c2. Please draw a figure in whichthe three impurity measures (Entropy, Gini and Misclassification) are repre-sented as the function of P (c1).
52.
Answer
Entropy (p) = −p log2 p− (1 − p) log2(1− p)
Gini (p) = 1− p2 − (1− p)2 = 2p(1− p)
MisClassif (p) =
=
{1− (1 − p), for p ∈ [0; 1/2)1− p, for p ∈ [1/2; 1]
=
{p, for p ∈ [0; 1/2)1− p, for p ∈ [1/2; 1]
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
p
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
EntropyGiniMisClassif
53.
b. Now we can define new splitting criteria based on the Gini and Misclassifi-cation impurities, which is called Drop-of-Impurity in some literatures. Thatis the difference between the impurity of the current node and the weightedsum of the impurities of children.
For the binary category splits, Drop-of-Impurity is defined as
∆i(n) = i(n)− P (nl) i(nl)− P (nr) i(nr),
where nl and nr are the left and respectively the right child of node n aftersplitting.
Please calculate the Drop-of-Impurity (using both Gini and MisclassificationImpurity) for the following example data set in which C is the class variableto be predicted.
A a1 a1 a1 a2 a2 a2C c1 c1 c2 c2 c2 c2
54.
Answer
21
A0
1a a2
[2+,4−]
[2+,1−] [0+,3−]
Gini: p = 2/6 = 1/3 ⇒
i(0) = 2 · 13(1− 1
3) =
2
3· 23=
4
9
i(1) = 2 · 23(1− 2
3) =
4
3· 13=
4
9i(2) = 0
⇒ ∆ i(0) =4
9− 3
6· 49=
4
9− 2
9=
2
9.
Misclassification: p = 1/3 < 1/2 ⇒
i(0) = p =1
3
i(1) = 1− 2
3=
1
3i(2) = 0
⇒ ∆ i(0) =1
3− 1
2· 13=
1
6.
55.
c. We choose the attribute that can maximize the Drop-of-Impurity to split anode. Please create a data set and show that on this data set, MisclassificationImpurity based ∆i(n) couldn’t determine which attribute should be used forsplitting (e.g., ∆i(n) = 0 for all the attributes), but Information Gain andGini Impurity based ∆i(n) can.
Answer
A a1 a1 a1 a2 a2 a2 a2C c1 c2 c2 c2 c2 c2 c1
Entropy: ∆ i(0) = H [5+, 2−]−(3
7H [2+, 1−] +
4
7H [3+, 1−]
)
= 0.006 6= 0;
Gini: 2
{2
7
(
1− 2
7
)
−[3
7· 13
(
1− 1
3
)
+4
7· 14
(
1− 1
4
)]}
= 2
{10
49−[2
21+
3
28
]}
= 2
(10
49− 17
84
)
6= 0;
Misclassification: ∆ i(0) =2
7−(3
7· 13+
4
7· 14
)
= 0.
56.
Note: A [quite bad] property
If C1 < C2, Cl1 < Cl
2 and Cr1 < Cr
2
(with C1 = Cl1 + Cr
1 and C2 = Cl2 + Cr
2),
then the Drop-of-Impurity based on Misclassification is 0.
A
a1 a2
[C +,C −]l 2
[C +,C −]l 2r r
l 2l l[C +,C −]
Proof
∆i(n) =C1
C1 + C2−(Cl
1 + Cl2
C1 + C2· Cl
1
Cl1 + Cl
2
+Cr
1 + Cr2
C1 + C2· Cr
1
Cr1 + Cr
2
)
=C1
C1 + C2− Cl
1 + Cr1
C1 + C2=
C1
C1 + C2− C1
C1 + C2= 0.
57.
Exemplifying
pre- and post-pruning of decision trees
using a threshold for the Information Gain
CMU, 2006 spring, Carlos Guestrin, midterm, pr. 4
[adapted by Liviu Ciortuz]
58.
Starting from the data in the following table, the ID3 algorithmbuilds the decision tree shown nearby.
V W X Y
0 0 0 00 1 0 11 0 0 11 1 0 01 1 1 1
V
X
1
W
0 1
10W
1 0
10
0 1
0 1
a. One idea for pruning such a decision tree would be to startat the root, and prune splits for which the information gain (orsome other criterion) is less than some small ε. This is called top-down pruning. What is the decision tree returned for ε = 0.0001?What is the training set error for this tree?
59.
AnswerWe will first augment the given decision tree withinformations regarding the data partitions (i.e., thenumber of positive and, respectively, negative in-stances) which were assigned to each test node duringthe application of ID3 algorithm.
The information gain yielded by the attribute X in theroot node is:
H [3+; 2−]− 1/5 · 0− 4/5 · 1 = 0.971− 0.8 = 0.171 > ε.
Therefore, this node will not be eliminated from thetree.
The information gain for the attribute V (in the left-hand side child of the root node) is:
H [2+; 2−]− 1/2 · 1− 1/2 · 1 = 1− 1 = 0 < ε.
X
V 1
W W
0 1 1 0
[3+;2−]
[2+;2−] [1+;0−]
0 1
0 1
[1+;1−] [1+;1−]
10 10
[0+;1−] [1+;0−] [1+;0−] [0+;1−]
So, the whole left subtree will be cut off and replaced by a decisionnode, as shown nearby. The training error produced by this tree is2/5.
X
0 1
0 1
60.
b. Another option would be to start at the leaves, and prunesubtrees for which the information gain (or some other criterion)of a split is less than some small ε. In this method, no ancestors ofchildren with high information gain will get pruned. This is calledbottom-up pruning. What is the tree returned for ε = 0.0001?What is the training set error for this tree?
Answer:
The information gain of V is IG(Y ;V ) = 0. A step later, the infor-mation gain of W (for either one of the descendent nodes of V ) isIG(Y ;W ) = 1. So bottom-up pruning won’t delete any nodes andthe tree [given in the problem statement] remains unchanged.
The training error is 0.
61.
c. Discuss when you would want to choose bottom-up pruningover top-down pruning and vice versa.
Answer:
Top-down pruning is computationally cheaper. When buildingthe tree we can determine when to stop (no need for real pruning).But as we saw top-down pruning prunes too much.
On the other hand, bottom-up pruning is more expensive since wehave to first build a full tree — which can be exponentially large— and then apply pruning. The second problem with bottom-uppruning is that supperfluous attributes may fullish it (see CMU,CMU, 2009 fall, Carlos Guestrin, HW1, pr. 2.4). The third prob-lem with it is that in the lower levels of the tree the number ofexamples in the subtree gets smaller so information gain mightbe an inappropriate criterion for pruning, so one would usuallyuse a statistical test instead.
62.
Exemplifying
χ2-Based Pruning of Decision Trees
CMU, 2010 fall, Ziv Bar-Joseph, HW2, pr. 2.1
63.
In class, we learned a decision tree pruning algorithm that iter-atively visited subtrees and used a validation dataset to decidewhether to remove the subtree. However, sometimes it is desir-able to prune the tree after training on all of the available data.
One such approach is based on statistical hypothesis testing.
After learning the tree, we visit each internal node and testwhether the attribute split at that node is actually uncorrelatedwith the class labels.
We hypothesize that the attribute is independent and then usePearson’s chi-square test to generate a test statistic that mayprovide evidence that we should reject this “null” hypothesis. Ifwe fail to reject the hypothesis, we prune the subtree at that node.
64.
a. At each internal node we can create a contingency table for the trainingexamples that pass through that node on their paths to the leaves. The tablewill have the c class labels associated with the columns and the r values thesplit attribute associated with the rows.
Each entry Oi,j in the table is the number of times we observe a trainingsample with that attribute value and label, where i is the row index thatcorresponds to an attribute value and j is the column index that correspondsto a class label.
In order to calculate the chi-square test statistic, we need a similar table ofexpected counts. The expected count is the number of observations we wouldexpect if the class and attribute are independent.
Derive a formula for each expected count Ei,j in the table.
Hint : What is the probability that a training example that passes throughthe node has a particular label? Using this probability and the independenceassumption, what can you say about how many examples with a specificattribute value are expected to also have the class label?
65.
b. Given these two tables for the split, you can now calculate the chi-squaretest statistic
χ2 =
r∑
i=1
c∑
j=1
(Oi,j − Ei,j)2
Ei,j
with degrees of freedom (r − 1)(c− 1).
You can plug the test statistic and degrees of freedom into a software packagea
or an online calculatorb to calculate a p-value. Typically, if p < 0.05 we rejectthe null hypothesis that the attribute and class are independent and say thesplit is statistically significant.
The decision tree given on the next slide was built from the data in the tablenearby.For each of the 3 internal nodes in the decision tree, show the p-value for thesplit and state whether it is statistically significant.How many internal nodes will the tree have if we prune splits with p ≥ 0.05?
aUse 1-chi2cdf(x,df) in MATLAB or CHIDIST(x,df) in Excel.b(https://en.m.wikipedia.org/wiki/Chi-square distribution#Table of .CF.872 value vs p-value.
66.
Input:
X1 X2 X3 X4 Class1 1 0 0 01 0 1 0 10 1 0 0 01 0 1 1 10 1 1 1 10 0 1 0 01 0 0 0 10 1 0 1 11 0 0 1 11 1 0 1 11 1 1 1 10 0 0 0 0
4X
1X
2X
01
0
1
0 1
[4−,8+]
10
[1−,2+]
0 1
[0−,2+] [1−,0+]
[3−,0+]
[4−,2+] [0−,6+]
67.
Idea
While traversing the ID3 tree [usually in bottom-up manner],
remove the nodes for which
there is not enough (“significant”) statistical evidence that
there is a dependence between
the values of the input attribute tested in that node and the valuesof the output attribute (the labels),
supported by the set of instances assigned to that node.
68.
Contingency tables
OX4Class = 0 Class = 1
X4 = 0 4 2X4 = 1 0 6
N=12⇒
P (X4 = 0) =6
12=
1
2, P (X4 = 1) =
1
2
P (Class = 0) =4
12=
1
3, P (Class = 1) =
2
3
OX1|X4=0 Class = 0 Class = 1
X1 = 0 3 0X1 = 1 1 2
N=6⇒
P (X1 = 0 | X4 = 0) =3
6=
1
2
P (X1 = 1 | X4 = 0) =1
2
P (Class = 0 | X4 = 0) =4
6=
2
3
P (Class = 1 | X4 = 0) =1
3
OX2|X4=0,X1=1 Class = 0 Class = 1
X2 = 0 0 2
X2 = 1 1 0
N=3⇒
P (X2 = 0 | X4 = 0, X1 = 1) =2
3
P (X2 = 1 | X4 = 0, X1 = 1) =1
3
P (Class = 0 | X4 = 0, X1 = 1) =1
3
P (Class = 1 | X4 = 0, X1 = 1) =2
3
69.
The reasoning that leads to the computation of
the expected number of observations
P (A = i, C = j) = P (A = i) · P (C = j)
P (A = i) =
∑ck=1Oi,k
Nand P (C = j) =
∑rk=1Ok,j
N
P (A = i, C = j)indep.=
(∑c
k=1Oi,k) (∑r
k=1Ok,j)
N2
E[A = i, C = j] = N · P (A = i, C = j)
70.
Expected number of observations
EX4Class = 0 Class = 1
X4 = 0 2 4X4 = 1 2 4
EX1|X4Class = 0 Class = 1
X1 = 0 2 1X1 = 1 2 1
EX2|X4,X1=1 Class = 0 Class = 1
X2 = 02
3
4
3
X2 = 11
3
2
3
EX4(X4 = 0,Class = 0) :
N = 12, P (X4 = 0) =1
2si P (Class = 0) =
1
3⇒
N · P (X4 = 0,Class = 0) = N · P (X4 = 0) · P (Class = 0) = 12 · 12· 13= 2
71.
χ2 Statistics
χ2 =r∑
i=1
c∑
j=1
(Oi,j −Ei,j)2
Ei,j
χ2X4
=(4− 2)2
2+
(0− 2)2
2+
(2− 4)2
4+
(6− 4)2
4= 2 + 2 + 1 + 1 = 6
χ2X1|X4=0
=(3− 2)2
2+
(1− 2)2
2+
(0− 1)2
1+
(2− 1)2
1= 3
χ2X2|X4=0,X1=1
=
(
0− 2
3
)2
2
3
+
(
1− 1
3
)2
1
3
+
(
2− 4
3
)2
4
3
+
(
0− 2
3
)2
2
3
=4
9· 274
= 3
p-values: 0.0143, 0.0833, and respectively 0.0833.
The first one of these p-values is smaller than ε, therefore the root node(X4) cannot be prunned.
72.
0 2 4 6 8
0.0
0.2
0.4
0.6
0.8
1.0
χ2 − Pearson’s cumulative test statistic
p−va
lue
0 2 4 6 8
0.0
0.2
0.4
0.6
0.8
1.0
Chi Squared Pearson test statistics
k = 1k = 2k = 3k = 4k = 6k = 9
73.
Output (pruned tree) for 95% confidence level
4X
10
10
74.
The AdaBoost algorithm:
why was it designed the way it was designed, and
the convergence of the training error , in certain conditions
CMU, 2015 fall, Ziv Bar-Joseph, Eric Xing, HW4, pr. 2.1-5
CMU, 2009 fall, Carlos Guestrin, HW2, pr. 3.1
CMU, 2009 fall, Eric Xing, HW3, pr. 4.2.2
75.
Consider m training examples S = {(x1, y1), . . . , (xm, ym)}, where x ∈ X and y ∈ {−1, 1}.Suppose we have a weak learning algorithm A which produces a hypothesis h : X → {−1, 1} givenany distribution D of examples.
AdaBoost is an iterative algorithm which works as follows:
• Begin with a uniform distribution D1(i) =1
m, i = 1, . . . ,m.
• At each iteration t = 1, . . . , T ,
• run the weak learning algo. A on the distribution Dt and produce the hypothsis ht;
Note (1): Since A is a weak learning algorithm, the produced hypothesis ht at round t isonly slightly better than random guessing, say, by a margin γt:
εt = errDt(ht) = Prx∼Dt [y 6= ht(x)] =1
2− γt.
Note (2): If at a certain iteration t < T the weak classifier A cannot produce a hypothesisbetter than random guessing (i.e., γt = 0) or it produces a hypothesis for which εt = 0,then the AdaBoost algorithm should be stopped.
• update the distribution
Dt+1(i) =1
Zt
·Dt(i) · e−αtyiht(xi) for i = 1, . . . ,m, (2)
where αtnot.=
1
2ln
1− εtεt
, and Zt is the normalizer.
• In the end, deliver HT = sign(
∑T
t=1 αtht
)
as the learned hypothesis, which will act as a
weighted majority vote.
76.
We will prove that the training error errS(HT ) of AdaBoost decreases at avery fast rate, and in certain cases it converges to 0.
Important Remark
The above formulation of the AdaBoost algorithm states no restriction onthe ht hypothesis delivered by the weak classifier A at iteration t, except thatεt < 1/2.
However, in another formulation of the AdaBoost algorithm (in a more gen-eral setup; see for instance MIT, 2006 fall, Tommi Jaakkola, HW4, problem3), it is requested / reccommended that hypothesis ht be chosen by (approxi-mately) maximizing the [criterion of] weighted training error on a whole classof hypotheses like, for instance, decision trees of depth 1 (decision stumps).
In this problem we will not be concerned with such a request, but we willcomply with it for instance in problem CMU, 2015 fall, Ziv Bar-Joseph, EricXing, HW4, pr2.6, when showing how AdaBoost works in practice.
77.
a. Prove the following relationships:
i. Zt = e−αt · (1 − εt) + eαt · εtii. Zt = 2
√
εt(1− εt) (a consequence derivable from i.)
iii. 0 < Zt < 1 (a consequence derivable from ii.)
iv. Dt+1(i) =
Dt(i)
2εt, i ∈ M
not.= {i|yi 6= ht(xi)}, i.e., the mistake set
Dt(i)
2(1− εt), i ∈ C
not.= {i|yi = ht(xi)}, i.e., the correct set
(a consequence derivable from (2) and ii.)
v. εi > εj ⇒ αi < αj
vi. errDt+1(ht) =
1
Zt· eαt · εt, where errDt+1
(ht)not.= PrDt+1
({xi|ht(xi) 6= yi})
vii. errDt+1(ht) = 1/2 (a consequence derivable from ii. and v.)
78.
Solution
a/i. Since Zt is the normalization factor for the distribution Dt+1, we can write:
Zt =
m∑
i=1
Dt(i)e−αtyiht(xi) =
∑
i∈C
Dt(i)e−αt +
∑
i∈M
Dt(i)eαt= (1− εt) · e−αt + εt · eαt . (3)
a/ii. Since αtnot.=
1
2ln
1− εtεt
, it follows that
eαt = e
1
2ln1− εtεt = e
ln
√
√
√
√
1− εtεt =
√1− εtεt
(4)
and
e−αt =1
eαt=
√εt
1− εt. (5)
So,
Zt = (1− εt) ·√
εt1− εt
+ εt ·√
1− εtεt
= 2√
εt(1− εt). (6)
Note that1− εtεt
> 1 because εt ∈ (0, 1/2); therefore αt > 0.
79.
a/iii.
The second order function εt(1−εt) reaches its maximum value forr εt = 1/2, and the
maximum is 1/4. Since εt ∈ (0, 1/2), it follows from (6) that Zt > 0 and Zt < 2
√
1
4= 1.
a/iv. Based on (2), we can write:
Dt+1(i) =1
Zt·Dt(i) ·
{eαt , for i ∈ Me−αt , for i ∈ C.
Therefore,
i ∈ M ⇒ Dt+1(i) =1
Zt·Dt(i) · eαt
(4)=
1
2√
εt(1− εt)·Dt(i) ·
√1− εt√εt
=Dt(i)
2εt
i ∈ C ⇒ Dt+1(i) =1
Zt·Dt(i) · e−αt
(5)=
1
2√
εt(1− εt)·Dt(i) ·
√εt√
1− εt=
Dt(i)
2(1− εt).
80.
a/v. Starting from the definition αt = ln
√1− εtεt
, we can write:
αi < αj ⇔ ln
√1− εiεi
< ln
√
1− εjεj
Further on, since both ln and √ functions are strictly increasing, it follows that
αi < αj ⇔1− εiεi
<1− εjεj
εi,εj>0⇔ εj(1− εi) < εi(1− εj) ⇔ εj −✟✟εiεj < εi −✟✟εiεj ⇔ εi > εj
a/vi. It is easy to see that
errDt+1(ht) =
m∑
i=1
Dt+1(i) · 1{yi 6=ht(xi)} =∑
i∈M
1
ZtDt(i)e
αt =1
Zt
∑
i∈M
Dt(i)
︸ ︷︷ ︸εt
eαt
=1
Zt· εt · eαt (7)
a/vii. By substituting (6) and (4) into (7), we will get:
⇒ errDt+1(ht) =
1
Zt· εt · eαt =
1
2√
εt(1 − εt)· εt ·
√1− εtεt
=1
2
81.
b. Show that DT+1(i) =(
m ·∏Tt=1 Zt
)−1
e−yif(xi), where f(x) =∑T
t=1 αtht(x).
c. Show that errS(HT ) ≤∏T
t=1 Zt, where errS(HT )not.=
1
m
∑mi=1 1{HT (xi) 6=yi} is the
traing error produced by AdaBoost.
d. Obviously, we would like to minimize test set error produced by AdaBoost,but it is hard to do so directly. We thus settle for greedily optimizing theupper bound on the training error found at part c. Observe that Z1, . . . , Zt−1
are determined by the first t − 1 iterations, and we cannot change them at iteration t.A greedy step we can take to minimize the training set error bound on round t is tominimize Zt. Prove that the value of αt that minimizes Zt (among all possible
values for αt) is indeed αt =1
2ln
1− εtεt
(see the previous slide).
e. Show that∏T
t=1 Zt ≤ e−2∑
Tt=1
γ2t .
f. From part c and d, we know the training error decreases at exponentialrate with respect to T . Assume that there is a number γ > 0 such that γ ≤ γtfor t = 1, . . . , T . (This γ is called a guarantee of empirical γ-weak learnability.)
How many rounds are needed to achieve a training error ε > 0? Please expressin big-O notation, T = O(·).
82.
Solution
b. We will expand Dt(i) recursively:
DT+1(i) =1
ZTDT (i) e
−αT yi hT (xi)
= DT (i)1
ZTe−αT yi hT (xi)
= DT−1(i)1
ZT−1e−αT−1 yi hT−1(xi)
1
ZTe−αT yi hT (xi)
...
= D1(i)1
∏Tt=1 Zt
e−∑T
t=1αt yi ht(xi)
=1
m ·∏Tt=1 Zt
e−yi f(xi).
83.
c. We will make use of the fact that the exponentialloss function upper bounds the 0-1 loss function, i.e.1{x<0} ≤ e−x:
)
[ x
y
0
1e−x1{x<0}
errS(HT ) =1
m
m∑
i=1
1{yif(xi)<0}
≤ 1
m
m∑
i=1
e−yi f(xi)
b.=
1
m
m∑
i=1
DT+1(i) ·m ·T∏
t=1
Zt =
m∑
i=1
DT+1(i)
T∏
t=1
Zt
=
(m∑
i=1
DT+1(i)
)
︸ ︷︷ ︸
1
·(
T∏
t=1
Zt
)
=
T∏
t=1
Zt.
84.
d. We will start from the equation
Zt = εt · eαt + (1− εt) · e−αt ,
which has been proven at part a. Note that the right-hand side is constantwith respect to εt (the error produced by ht, the hypothesis produced by theweak classifier A at the current step).Then we will proceed as usually, computing the partial derivative w.r.t. εt:
∂
∂αt
(εt · eαt + (1− εt) · e−αt
)= 0 ⇔ εt · eαt − (1 − εt) · e−αt = 0
⇔ εt · (eαt)2 = 1− εt ⇔ e2αt =1− εtεt
⇔ 2αt = ln1− εtεt
⇔ αt =1
2ln
1− εtεt
.
Note that1− εtεt
> 1 (and therefore αt > 0) because εt ∈ (0, 1/2).
It can also be immediately shown that αt =1
2ln
1− εtεt
is indeed the value for
which we reach the minim of the expression εt ·eαt +(1−εt) ·e−αt, and thereforeof Zt too:
εt · eαt − (1− εt) · e−αt > 0 ⇔ e2αt − 1− εtεt
> 0 ⇔ αt >1
2ln
1− εtεt
> 0.
85.
Plots of three Z(β) functions,
Z(β) = εt · β + (1− εt) ·1
β
where
βnot.= eα, (α being free(!) here)
and εt is fix.
It implies that
βmin =
√1− εtεt
Z(βmin) = . . . = 2√
εt(1− εt)
αmin = lnβmin = ln
√1− εtεt
1 2 3 4 5
0.6
0.8
1.0
1.2
1.4
beta
Z
eps = 1/4eps = 1/10eps = 2/5
86.
e. Making use of the relationship (6) proven at part a,and using the fact that 1 − x ≤ e−x for all x ∈ R, we canwrite:
x
y
0
1e−x
1 − xT∏
t=1
Zt =
T∏
t=1
2 ·√
εt(1− εt)
=
T∏
t=1
2 ·√(1
2− γt
)(
1−(1
2− γt
))
=
T∏
t=1
√
1− 4γ2t
≤T∏
t=1
√
e−4γ2t
=T∏
t=1
√
(e−2γ2t )2 =
T∏
t=1
e−2γ2t
= e−2∑T
t=1γ2t .
87.
f.
εt γt0 1/2
γ
From the result obtained at parts c and d, we get:
errS(HT ) ≤ e−2∑T
t=1γ2t ≤ e−2Tγ2
=(e−2γ2)T
=1
(e2γ2
)T
Therefore,
errS(HT ) < ε if − 2Tγ2 < ln ε ⇔ 2Tγ2 > − ln ε ⇔ 2Tγ2 > ln1
ε⇔ T >
1
2γ2ln
1
ε
Hence we need T = O(
1
γ2ln
1
ε
)
.
Note: It follows that errS(HT ) → 0 as T → ∞.
88.
Exemplifying the application of AdaBoost algorithm
CMU, 2015 fall, Ziv Bar-Joseph, Eric Xing, HW4, pr. 2.6
89.
Consider the training dataset in the nearby fig-ure. Run T = 3 iterations of AdaBoost with deci-sion stumps (axis-aligned separators) as the baselearners. Illustrate the learned weak hypothesesht in this figure and fill in the table given below.
(For the pseudo-code of the AdaBoost algorithm, see CMU,
2015 fall, Ziv Bar-Joseph, Eric Xing, HW4, pr. 2.1-5. Please
read the Important Remark that follows that pseudo-code!)
x1
x2
x6
x3
x7
x4
5x
8x
9x
X1
X2
0
1
3
2
4
1 2 3 4 50
t εt αt Dt(1) Dt(2) Dt(3) Dt(4) Dt(5) Dt(6) Dt(7) Dt(8) Dt(9) errS(H)
1
2
3
Note: The goal of this exercise is to help you understand how AdaBoost works in practice.
It is advisable that — after understanding this exercise — you would implement a program /
function that calculates the weighted training error produced by a given decision stump, w.r.t. a
certain probabilistic distribution (D) defined on the training dataset. Later on you will extend
this program to a full-fledged implementation of AdaBoost.
90.
Solution
Unlike the graphical reprezentation that we used until now for decisionstumps (as trees of depth 1), here we will work with the following analit-ical representation : for a continuous attribute X taking values x ∈ R and forany threshold s ∈ R, we can define two decision stumps:
sign(x − s) =
{1 if x ≥ s−1 if x < s
and sign(s− x) =
{−1 if x ≥ s1 if x < s.
For convenience, in the sequel we will denote the first decision stump withX ≥ s and the second with X < s.
According to the Important Remark that follows the AdaBoost pseudo-code[see CMU, 2015 fall, Ziv Bar-Joseph, Eric Xing, HW4, pr. 2.1-5], at each iter-ation (t) the weak algorithm A selects the/a decision stump which, among alldecision stumps, has the minimum weighted training error w.r.t. the currentdistribution (Dt) on the training data.
91.
Notes
When applying the ID3 algorithm, for each continous attribute X, we useda threshold for each pair of examples (xi, yi), (xi+1, yi+1), with yiyi+1 < 0 suchthat xi < xi+1, but no xj ∈ Val (X) for which xi < xj < xi+1.
We will proceed similarly when applying AdaBoost with decision stumps andcontinous attributes.
[In the case of ID3 algorithm, there is a theoretical result stating that there isno need to consider other thresholds for a continuous attribute X apart fromthose situated beteen pairs of successive values (xi < xi+1) having oppositelabels (yi 6= yi+1), because the Information Gain (IG) for the other thresholds(xi < xi+1, with yi = yi+1) is provably less than the maximal IG for X.
LC: A similar result can be proven, which allows us to simplify the applicationof the weak classifier (A) in the framework of the AdaBoost algorithm.]
Moreover, we will consider also a threshold from the outside of the interval ofvalues taken by the attribute X in the training dataset. [The decision stumpscorresponding to this “outside” threshold can be associated with the decisiontrees of depth 0 that we met in other problems.]
92.
Iteration t = 1:
Therefore, at this stage (i.e, the first iteration of AdaBoost) the thresholds forthe two continuous variables (X1 and X2) corresponding to the two coordinatesof the training instances (x1, . . . , x9) are
•
1
2,5
2, and
9
2for X1, and
•
1
2,3
2,5
2and
7
2for X2.
One can easily see that we can get rid of the “outside” threshold1
2for X2,
because the decision stumps corresponding to this threshold act in the same
as the decision stumps associated to the “outside” threshold1
2for X1.
The decision stumps corresponding to this iteration together with their as-sociated weighted training errors are shown on the next slide. When fillingthose tabels, we have used the equalities errDt
(X1 ≥ s) = 1 − errDt(X1 < s)
and, similarly, errDt(X2 ≥ s) = 1− errDt
(X2 < s), for any threshold s and everyiteration t = 1, 2, . . .. These equalities are easy to prove.
93.
s1
2
5
2
9
2
errD1(X1 < s)
4
9
2
9
4
9+
2
9=
2
3
errD1(X1 ≥ s)
5
9
7
9
1
3
s1
2
3
2
5
2
7
2
errD1(X2 < s)
4
9
1
9+
3
9=
4
9
2
9+
1
9=
1
3
2
9
errD1(X2 ≥ s)
5
9
5
9
2
3
7
9
It can be seen that the minimal weighted training error (ε1 = 2/9) is obtained for the
decision stumps X1 < 5/2 and X2 < 7/2. Therefore we can choose h1 = sign
(7
2−X2
)
as
best hypothesis at iteration t = 1; the corresponding separator is the line X2 =7
2. The
h1 hypothesis wrongly classifies the instances x4 and x5. Then
γ1 =1
2− 2
9=
5
18and α1 =
1
2ln
1− ε1ε1
= ln
√(
1− 2
9
)
:2
9= ln
√
7
2≈ 0.626
94.
Now the algorithm must get a new distribution (D2) by altering the old one (D1) sothat the next iteration concentrates more on the misclassified instances.
D2(i) =1
Z1D1(i)( e
−α1
︸ ︷︷ ︸√2/7
)yi h1(xi) =
1
Z1· 19·√
2
7for i ∈ {1, 2, 3, 6, 7, 8, 9};
1
Z1· 19·√
7
2for i ∈ {4, 5}.
Remember that Z1 is a normalization factor for D2.So,
Z1 =1
9
(
7 ·√
2
7+ 2 ·
√
7
2
)
=2√14
9= 0.8315
Therefore,
D2(i) =
9
2√14
· 19·√
2
7=
1
14for i 6∈ {4, 5};
9
2√14
· 19·√
7
2=
1
4for i ∈ {4, 5}.
X2
x3
x1
2x
4x
9x
8x
h1
X1
0
1
1 2 3
3
2
4
4 50
1/4
1/4 1/14
1/141/14
1/14
1/14 1/14 1/14
+−6 7
5
x x
x
95.
Note
If, instead of sign
(7
2−X2
)
we would have taken, as hypothesis h1, the deci-
sion stump sign
(5
2−X1
)
, the subsequent calculus would have been slightly
different (although both decision stumps have the same – minimal – weighted
training error,2
9): x8 and x9 would have been allocated the weights
1
4, while
x4 si x5 would have been allocated the weights1
14.
(Therefore, the output of AdaBoost may not be uniquely determined!)
96.
Iteration t = 2:
s1
2
5
2
9
2
errD2(X1 < s)
4
14
2
14
2
14+
2
4+
2
14=
11
14
errD2(X1 ≥ s)
10
14
12
14
3
14
s1
2
3
2
5
2
7
2
errD2(X2 < s)
4
14
1
4+
3
14=
13
28
2
4+
1
14=
8
14
2
4=
1
2
errD2(X2 ≥ s)
10
14
15
28
6
14
1
2
Note: According to the theoretical result presented at part a of CMU, 2015fall, Ziv Bar-Joseph, Eric Xing, HW4, pr. 2.1-5, computing the weighted errorrate of the decision stump [corresponding to the test] X2 < 7/2 is now super-fluous, because this decision stump has been chosen as optimal hypothesis atthe previous iteration. (Nevertheless, we had placed it into the tabel, for thesake of a thorough presentation.)
97.
Now the best hypothesis is h2 = sign
(5
2−X1
)
; the corresponding separator
is the line X1 =5
2.
ε2 = PD2({x8, x9}) =
2
14=
1
7= 0.143 ⇒ γ2 =
1
2− 1
7=
5
14
α2 = ln
√1− ε2ε2
= ln
√(
1− 1
7
)
:1
7= ln
√6 = 0.896
D3(i) =1
Z2·D2(i) · ( e−α2
︸ ︷︷ ︸
1/√6
)yi h2(xi) =
1
Z2·D2(i) ·
1√6
if h2(xi) = yi;
1
Z2·D2(i) ·
√6 otherwise
=
1
Z2· 1
14· 1√
6for i ∈ {1, 2, 3, 6, 7};
1
Z2· 14· 1√
6for i ∈ {4, 5};
1
Z2· 1
14·√6 for i ∈ {8, 9}.
98.
Z2 = 5 · 1
14· 1√
6+ 2 · 1
4· 1√
6+ 2 · 1
14·√6 =
5
14√6+
1
2√6+
√6
7=
12 + 12
14√6
=24
14√6=
2√6
7≈ 0.7
D3(i) =
7
2√6· 1
14· 1√
6=
1
24for i ∈ {1, 2, 3, 6, 7};
7
2√6· 14· 1√
6=
7
48for i ∈ {4, 5};
7
2√6· 1
14·√6 =
1
4for i ∈ {8, 9}.
X2
h2
h1
X1
x1
5x x
9
x4
x3
x2
x6
x7
0
1
1 2 3
3
2
4
4 50
7/48 1/4
1/4
1/24
1/24 1/24 1/24
7/481/24
+−
−+
8x
99.
Iteration t = 3:
s1
2
5
2
9
2
errD3(X1 < s)
2
24+
2
4=
7
12
2
4
2
24+ 2 · 7
48+ 2 · 1
4=
21
24
errD3(X1 ≥ s)
5
12
2
4
3
24=
1
8
s1
2
3
2
5
2
7
2
errD3(X1 < s)
7
12
7
48+
2
24+
1
4=
23
482 · 7
48+
1
24=
1
32 · 7
48=
7
24
errD3(X1 ≥ s)
5
12
25
48
2
3
17
24
100.
The new best hypothesis is h3 = sign
(
X1 −9
2
)
; the corresponding separator is the line
X1 =9
2.
ε3 = PD3({x1, x2, x7}) = 2 · 1
24+
1
24=
3
24=
1
8
γ3 =1
2− 1
8=
3
8
α3 = ln
√1− ε3ε3
= ln
√(
1− 1
8
)
:1
8= ln
√7 = 0.973
2X h
2h
3
h1
X1
x1
x2
x9
x8
x7
x4
x5
x63
x
0
1
1 2 3
3
2
4
4 50
+ +
−
− −
+
101.
Finally, after filling our results in the given table, we get:
t εt αt Dt(1) Dt(2) Dt(3) Dt(4) Dt(5) Dt(6) Dt(7) Dt(8) Dt(9) errS(H)
1 2/9 ln√
7/2 1/9 1/9 1/9 1/9 1/9 1/9 1/9 1/9 1/9 2/9
2 2/14 ln√6 1/14 1/14 1/14 1/4 1/4 1/14 1/14 1/14 1/14 2/9
3 1/8 ln√7 1/24 1/24 1/24 7/48 7/48 1/24 1/24 1/4 1/4 0
Note: The following table helps you understand how errS(H) was computed;
remember that H(xi)def.= sign
(∑T
t=1 αt ht(xi))
.
t αt x1 x2 x3 x4 x5 x6 x7 x8 x9
1 0.626 +1 +1 −1 +1 +1 −1 −1 +1 +12 0.896 +1 +1 −1 −1 −1 −1 −1 −1 −13 0.973 −1 −1 −1 −1 −1 −1 +1 +1 +1
H(xi) +1 +1 −1 −1 −1 −1 −1 +1 +1
102.
Remark : One can immediately see thatthe [test] instance (1, 4) will be classifiedby the hypothesis H learned by AdaBoostas negative (since −α1 + α2 − α3 = −0.626 +0.896− 0.973 < 0). After making other sim-ilar calculi, we can conclude that the de-cision zones and the decision boundariesproduced by AdaBoost for the given train-ing data will be as indicated in the nearbyfigure.
2X h
2h
3
h1
X1
x7
x8
x9
x1
x2
3x x
6
x4
x5
0
1
1 2 3
3
2
4
4 50
Remark : The execution of AdaBoost could continue (if we would have taken initiallyT > 3...), although we have obtained errS(H) = 0 at iteration t = 3. By elaborating thedetails, we would see that for t = 4 we would obtain as optimal hypothesis X2 < 7/2(which has already been selected at iteration t = 1). This hypothesis produces nowthe weighted training error ε4 = 1/6. Therefore, α4 = ln
√5, and this will be added to
α1 = ln√
7/2 in the new output H. In this way, the confidence in the hypothesis X2 < 7/2would be strenghened.So, we should keep in mind that AdaBoost can select several times a certain weakhypothesis (but never at consecutive iterations, cf. CMU, 2015 fall, E. Xing, Z. Bar-Joseph, HW4, pr. 2.1).
103.
Graphs made by MSc student Sebastian Ciobanu (2018 fall)
The variation of εt w.r.t. t:
0 10 20 30 40
0.14
0.16
0.18
0.20
0.22
iteration
epsi
lon
The two upper bounds of the empiricalerror of HT :
0 10 20 30 400.
00.
20.
40.
60.
8
T
exp(− 2∑t=1
Tγt
2)∏t=1
TZt
errS(HT)
104.
AdaBoost and [non ] empirical γ-weak learnability:
Exemplification on a dataset from R
CMU, 2012 fall, Tom Mitchell, Ziv Bar-Joseph, final, pr. 8.a-e
105.
In this problem, we study how Ad-aBoost performs on a very sim-ple classification problem shown inthe nearby figure.
43210x
5 6
We use decision stump for each weak hypothesis hi. Decision stump classifierchooses a constant value s and classifies all points where x > s as one classand other points where x ≤ s as the other class.
a. What is the initial weight that is assigned to each data point?
b. Show the decision boundary for the first decision stump (indicate thepositive and negative side of the decision boundary).
c. Circle the point whose weight increases in the boosting process.
d. Write down the weight that is assigned to each data point after the firstiteration of boosting algorithm.
e. Can boosting algorithm perfectly classify all the training examples? If no,briefly explain why. If yes, what is the minimum number of iterations?
106.
AnswerWith outside threshold:
• t = 1 :
h1
31 5
err: 1/3
1/3 1/3 1/3
• t = 2 :
h2
31 5
1/21/4 1/4
err: 1/2 err: 1/4 err: 1/4
• t = 3 :
h3
31 5
1/31/6 1/2
err: 1/3 err: 1/2 err: 1/6
Without outside threshold:
• t = 1 :h1
31 5
1/3 1/3 1/3
err: 1/3
• t = 2 :h2
31 5
1/4 1/4 1/2
err: 1/2 err: 1/4
• t = 3 :h3
31 5
1/2 1/6 1/3
err: 1/3 err: 1/2
• t = 4 :h4
31 5
3/8 1/8 1/2
err: 1/2 err: 3/8
107.
With outside threshold:
• t = 1 :ε1 = 1/3 ⇒ α1 =
√2 = 0.3465, errS(H1) = 1/3
• t = 2 :ε2 = 1/4 ⇒ α2 =
√3 = 0.5493
x1 x2 x3
α1 − − −α2 − + +
H2(xi) − + +
⇒ errS(H2) = 1/3
• t = 3 :ε3 = 1/6 ⇒ α3 =
√5 = 0.8047
x1 x2 x3
α1 − − −α2 − + +α3 + + −
H3(xi) − + −
⇒ errS(H3) = 0
Without outside threshold:
• t = 1 :ε1 = 1/3 ⇒ α1 =
√2 = 0.3465, errS(H1) = 1/3
• t = 2 : ε2 = 1/4 ⇒ α2 =√3 = 0.5493
x1 x2 x3
α1 − + +α2 + + −
H2(xi) + + −
⇒ errS(H2) = 1/3
• t = 3 : ε3 = 1/3 = ε1 ⇒ α3 =√2 = 0.3465 = α1
x1 x2 x3
α1 − + +α2 + + −α1 − + +
H3(xi) − + +
⇒ errS(H3) = 1/3
• t = 4 : ε4 = 3/8 ⇒ α4 =√
5/3 = 0.2554
x1 x2 x3
α1 − + +α2 + + −α1 − + +α4 + + −
H4(xi) + + −
⇒ errS(H4) = 1/3
It can be easily proven that the signs of x1 andx3 will always be opposite to each other, while thesign of x2 will always be +.
Therefore errS(HT ) = 1/3 for any T ∈ N∗.
108.
Graphs made bySebastian Ciobanu
with outside threshold
0 10 20 30 40
0.20
0.25
0.30
iteration
epsi
lon
without outside threshold
0 10 20 30 40
0.25
0.30
0.35
0.40
0.45
iteration
epsi
lon
0 10 20 30 40
0.0
0.2
0.4
0.6
0.8
T
exp(− 2∑t=1
Tγt
2)∏t=1
TZt
errS(HT)
0 10 20 30 40
0.3
0.4
0.5
0.6
0.7
0.8
0.9
T
exp(− 2∑t=1
Tγt
2)
∏t=1
TZt
errS(HT)
109.
Seeing AdaBoost as an optimization algorithm,
w.r.t. the [inverse] exponential loss function
CMU, 2008 fall, Eric Xing, HW3, pr. 4.1.1
CMU, 2008 fall, Eric Xing, midterm, pr. 5.1
110.
At CMU, 2015 fall, Ziv Bar-Joseph, Eric Xing, HW4, pr. 2.1-5, part d, we have shownthat in AdaBoost, we try to [indirectly] minimize the training error errS(H) by sequen-
tially minimizing its upper bound∏T
t=1 Zt, i.e. at each iteration t (1 ≤ t ≤ T ) we chooseαt so as to minimize Zt (viwed as a function of αt).
Here you will see that another way to explain AdaBoost is by sequentially minimizingthe negative exponential loss:
Edef.=
m∑
i=1
exp(−yifT (xi))not.=
m∑
i=1
exp(−yi
T∑
t=1
αtht(xi)). (8)
That is to say, at the t-th iteration (1 ≤ t ≤ T ) we want to choose besides the appropriateclassifier ht the corresponding weight αt so that the overall loss E (accumulated up tothe t-th iteration) is minimized.
Prove that this [new] strategy will lead to the same update rule for αt used in AdaBoost,
i.e., αt =1
2ln
1− εtεt
.
Hint : You can use the fact that Dt(i) ∝ exp(−yift−1(xi)), and it [LC: the proportionalityfactor] can be viewed as constant when we try to optimize E with respect to αt in thet-th iteration.
111.
Solution
At the t-th iteration, we have
E =m∑
i=1
exp(−yift(xi)) =m∑
i=1
exp
(
−yi
(t−1∑
t′=1
αt′ht′(xi)
)
− yiαtht(xi)
)
=
m∑
i=1
exp(−yift−1(xi)) · exp(−yiαtht(xi)) =
m∑
i=1
(
m
t−1∏
i=1
Zt
)
·Dt(i) · exp(−yiαtht(xi))
∝m∑
i=1
Dt(i) · exp(−yiαtht(xi))not.= E′
(see CMU, 2015 fall, Z. Bar-Joseph, E. Xing, HW4, pr. 2.1-5, part b)
Further on, we can rewrite E′ as
E′ =
m∑
i=1
Dt(i) · exp(−yiαtht(xi)) =∑
i∈C
Dt(i) exp(−αt) +∑
i∈M
Dt(i) exp(αt)
= (1 − εt) · e−αt + εt · eαt , (9)
where C is the set of examples which are correctly classified by ht, and M is the set ofexamples which are mis-classified by ht.
112.
The relation (9) is identical with the expression (3) from part a of CMU, 2015fall, Ziv Bar-Joseph, Eric Xing, HW4, pr. 2.1-5 (see the solution). Therefore,
E will reach its minim for αt =1
2ln
1− εtεt
.
113.
AdaBoost algorithm: the notion of [voting ] margin ;
some properties
CMU, 2016 spring, W. Cohen, N. Balcan, HW4, pr. 3.3
114.
Despite that model complexity increases with each iteration, AdaBoost does not usuallyoverfit. The reason behind this is that the model becomes more “confident” as weincrease the number of iterations. The “confidence” can be expressed mathematicallyas the [voting] margin. Recall that after the algorithm AdaBoost terminates with Titerations the [output] classifier is
HT (x) = sign
(T∑
t=1
αt ht(x)
)
.
Similarly, we can define the intermediate weighted classifier after k iterations as:
Hk(x) = sign
(k∑
t=1
αt ht(x)
)
.
As its output is either −1 or 1, it does not tell the confidence of its judgement. Here,without changing the decision rule, let
Hk(x) = sign
(k∑
t=1
αt ht(x)
)
,
where αt =αt
∑kt′=1 αt′
so that the weights on each weak classifier are normalized.
115.
Define the margin after the k-th iteration as [the sum of] the [normalized]weights of ht voting correctly minus [the sum of] the [normalized] weights ofht voting incorrectly.
Margink(x) =∑
t:ht(x)=y
αt −∑
t:ht(x) 6=y
αt.
a. Let fk(x)not.=∑k
t=1 αt ht(x). Show that Margink(xi) = yi fk(xi) for all traininginstances xi, with i = 1, . . . ,m.
b. If Margink(xi) > Margink(xj), which of the samples xi and xj will receive ahigher weight in iteration k + 1?
Hint : Use the relation Dk+1(i) =1
m ·∏kt=1 Zt
· exp(−yifk(xi)) which was proven
at CMU, 2015 fall, Z. Bar-Joseph, E. Xing, HW4, pr. 2.2.
116.
Solution
a. We will prove the equality starting from its right hand side:
yi fk(xi) = yi
k∑
t=1
αt ht(xi) =
k∑
t=1
αt yi ht(xi) =∑
t:ht(xi)=yi
αt −∑
t:ht(xi) 6=yi
αt
= Margink(xi).
b. According to the relationship already proven at part a,
Margink(xi) > Margink(xj) ⇔ yi fk(xi) > yj fk(xj) ⇔−yi fk(xi) < −yj fk(xj) ⇔ exp(−yi fk(xi)) < exp(−yj fk(xj)).
Based on the given Hint, it follows that Dk+1(i) < Dk+1(j).
117.
Important Remark
It can be shown that boosting tends to increase the margins oftraining examples — see the relation (8) at CMU, 2008 fall, EricXing, HW3, pr. 4.1.1 —, and that a large margin on trainingexamples reduces the generalization error.
Thus we can explain why, although the number of “parameters” ofthe model created by AdaBoost increases with 2 at every iteration— therefore complexity rises —, it usually doesn’t overfit.
118.
AdaBoost: a sufficient condition for γ-weak learnability,
based on the voting margins
CMU, 2016 spring, W. Cohen, N. Balcan, HW4, pr. 3.1.4
119.
At CMU, 2015 fall, Z. Bar-Joseph, E. Xing, HW4, pr. 2.5 we encounteredthe notion of empirical γ-weak learnability. When this condition — γ ≤ γt for
all t, where γtdef.=
1
2−εt, with εt being the weighted training error produced by
the weak hypothesis ht — is met, it ensures that AdaBoost will drive downthe training error quickly. However, this condition does not hold all the time.
In this problem we will prove a sufficient condition for empirical weak learn-ability [to hold]. This condition refers to the notion of voting margin whichwas presented in CMU, 2016 spring, W. Cohen, N. Balcan, HW4, pr. 3.3.
Namely, we will prove thatif there is a constant θ > 0 such that the [voting] margins of all traininginstances are lower-bounded by θ at each iteration of the AdaBoost algorithm,then the property of empirical γ-weak learnability is “guaranteed”, with γ =θ/2.
120.
[Formalisation]
Suppose we are given a training set S = {(x1, y1), . . . , (xm, ym)}, such that forsome weak hypotheses h1, . . . , hk from the hypothesis space H, and some non-negative coefficients α1, . . . , αk with
∑kj=1 αj = 1, there exists θ > 0 such that
yi(
k∑
j=1
αjhj(xi)) ≥ θ, ∀(xi, yi) ∈ S.
Note: according to CMU, 2016 spring, W. Cohen, N. Balcan, HW4, pr. 3.3,
yi(k∑
j=1
αjhj(xi)) = Margink(xi) = fk(xi), where fk(xi)not.=
k∑
j=1
αjhj(xi).
Key idea: We will show that if the condition above is satisfied (for a given k),then for any distribution D over S, there exists a hypothesis hl ∈ {h1, . . . , hk}with weighted training error at most
1
2− θ
2over the distribution D.
It will follow that when the condition above is satisfied for any k, the training
set S is empirically γ-weak learnable, with γ =θ
2.
121.
a. Show that — if the condition stated above is met — there exists a weakhypothesis hl from {h1, . . . , hk} such that Ei∼D[yihl(xi)] ≥ θ.
Hint : Taking expectation under the same distribution does not change theinequality conditions.
b. Show that the inequality Ei∼D[yihl(xi)] ≥ θ is equivalent to
Pri∼D[yi 6= hl(xi)]︸ ︷︷ ︸
errD(hl)
≤ 1
2− θ
2,
meaning that the weighted training error of hl is at most1
2− θ
2, and therefore
γt ≥θ
2.
122.
Solution
a. Since yi(∑k
j=1 αjhj(xi)) ≥ θ ⇔ yifk(xi) ≥ θ for i = 1, . . . ,m, it follows (according to theHint) that
Ei∼D[yifk(xi)] ≥ θ where fk(xi)not.=
k∑
j=1
αjhj(xi). (10)
On the other side, Ei∼D[yihl(xi)] ≥ θdef.⇔ ∑m
i=1 yihl(xi) ·D(i) ≥ θ.
Suppose, on contrary, that Ei∼D[yihl(xi)] < θ, that is∑m
i=1 yihl(xi) ·D(i) < θ for l = 1, . . . , k.Then
∑mi=1 yihl(xi) ·D(i) · αl < θ · αl for l = 1, . . . , k. By summing up these inequations for
l = 1, . . . , k we get
k∑
l=1
m∑
i=1
yihl(xi) ·D(i) · αl <k∑
l=1
θ · αl ⇔m∑
i=1
yiD(i)
(k∑
l=1
hl(xi)αl
)
< θk∑
l=1
αl ⇔
m∑
i=1
yifk(xi) ·D(i) < θ, (11)
because∑k
j=1 αj = 1 si fk(xi)not.=∑k
l=1 αlhl(xi).
The inequation (11) can be written as Ei∼D[yifk(xi)] < θ. Obviously, it contradicts therelationship (10). Therefore, the previous supposition is false. In conclusion, there existl ∈ {1, . . . , k} such that Ei∼D[yihl(xi)] ≥ θ.
123.
Solution (cont’d)
b. We already said that Ei∼D[yihl(xi)] ≥ θ ⇔∑mi=1 yi hl(xi) ·D(i) ≥ θ.
Since yi ∈ {−1,+1} and hl(xi) ∈ {−1,+1} for i = 1, . . . ,m and l = 1, . . . , k, we have
m∑
i=1
yi hl(xi) ·D(i) ≥ θ ⇔∑
i:yi=hl(xi)
D(xi)−∑
i:yi 6=hl(xi)
D(xi) ≥ θ ⇔ (1− εl)− εl ≥ θ ⇔
1− 2εl ≥ θ ⇔ 2εl ≤ 1− θ ⇔ εl ≤1
2− θ
2
def.⇔ errD(hl) ≤1
2− θ
2.
124.
AdaBoost:
Any set of consistently labelled instances from R
is empirically γ-weak learnable
by using decision stumps
Stanford, 2016 fall, Andrew Ng, John Duchi, HW2, pr. 6.abc
125.
At CMU, 2015, Z. Bar-Joseph, E. Xing, HW4, pr. 2.5 we encountered thenotion of empirical γ-weak learnability. When this condition — γ ≤ γt for all
t, where γtdef.=
1
2− εt, with εt being the weighted training error produced by
the weak hypothesis ht — is met, it ensures that AdaBoost will drive downthe training error quickly.
In this problem we will assume that our input attribute vectors x ∈ R, that is,they are one-dimensional, and we will show that [LC] when these vectors areconsitently labelled, decision stumps based on thresholding provide a weak-learning guarantee (γ).
126.
Decision stumps: analytical definitions / formalization
Thresholding-based decision stumps can be seen as functions indexed by athreshold s and sign +/−, such that
φs,+(x) =
{1 if x ≥ s−1 if x < s
and φs,−(x) =
{−1 if x ≥ s1 if x < s.
Therefore, φs,+(x) = −φs,−(x).
Key idea for the proof
We will show that given a consistently labelled training set S ={(x1, y1), . . . , (xm, ym)}, with xi ∈ R and yi ∈ {−1,+1} for i = 1, . . . ,m, thereis some γ > 0 such that for any distribution p defined on this training setthere is a threshold s ∈ R for which
errorp(φs,+) ≤1
2− γ or errorp(φs,−) ≤
1
2− γ,
where errorp(φs,+) and errorp(φs,−) denote the weighted training error of φs,+
and respectively φs,−, computed according to the distribution p.
127.
Convention: In our problem we will assume that our training instancesx1, . . . xm ∈ R are distinct. Moreover, we will assume (without loss of gen-erality, but this makes the proof notationally simpler) that
x1 > x2 > . . . > xm.
a. Show that, given S, for each threshold s ∈ R there is some m0(s) ∈{0, 1, . . . ,m} such that
errorp(φs,+)def.=
m∑
i=1
pi · 1{yi 6=φs,+(xi)} =1
2− 1
2
m0(s)∑
i=1
yipi −m∑
i=m0(s)+1
yipi
︸ ︷︷ ︸
not.: f(m0(s))and
errorp(φs,−)def.=
m∑
i=1
pi · 1{yi 6=φs,−(xi)} =1
2− 1
2
m∑
i=m0(s)+1
yipi −m0(s)∑
i=1
yipi
︸ ︷︷ ︸
not.: −f(m0(s))
Note: Treat sums over empty sets of indices as zero. Therefore,∑0
i=1 ai = 0for any ai, and similarly
∑mi=m+1 ai = 0.
128.
b. Prove that, given S, there is some γ > 0 (which may depend on the trainingset size m) such that for any set of probabilities p on the training set (thereforepi ≥ 0 and
∑mi=1 pi = 1) we can find m0 ∈ {0, . . . ,m} so as
|f(m0)| ≥ 2γ, where f(m0)not.=
m0∑
i=1
yipi −m∑
i=m0(s)+1
yipi,
Note: γ should not depend on p.
Hint : Consider the difference f(m0)− f(m0 − 1).
What is your γ?
129.
c. Based on your answers to parts a and b, what edge can thresholded decisionstumps guarantee on any training set {xi, yi}mi=1, where the raw attributesxi ∈ R are all distinct? Recall that the edge of a weak classifier φ : R → {−1, 1}is the constant γ ∈ (0, 1/2) such that
errorp(φ)def.=
m∑
i=1
pi · 1{φ(xi) 6=yi} ≤ 1
2− γ.
d. Can you give an upper bound on the number of thresholded decisionstumps required to achieve zero error on a given training set?
130.
Solution
a. We perform several algebraic steps.Let sign(t) = 1 if t ≥ 0, and sign(t) = −1 otherwise.Then
1{φs,+(x) 6=y} = 1{sign(x−s) 6=y} = 1{y·sign(x−s)≤0},
where the symbol 1{ } denotes the well known indicator function.Thus we have
errorp(φs,+)def.=
m∑
i=1
pi · 1{yi 6=φs,+(xi)} =
m∑
i=1
pi · 1{yi·sign(xi−s)≤0}
=∑
i:xi≥s
pi · 1{yi=−1} +∑
i:xi<s
pi · 1{yi=1}
Thus, if we let m0(s) be the index in {0, . . . ,m} such that xi ≥ s for i ≤ m0(s)and xi < s for i > m0(s), which we know must exist because x1 > x2 > . . . , wehave
errorp(φs,+)def.=
m∑
i=1
pi · 1{yi 6=φs,+(xi)} =
m0(s)∑
i=1
pi · 1{yi=−1} +
m∑
i=m0(s)+1
pi · 1{yi=1}.
131.
Now we make a key observation :we have
1{y=−1} =1− y
2and 1{y=1} =
1 + y
2,
because y ∈ {−1, 1}.
Consequently,
errorp(φs,+)def.=
m∑
i=1
pi · 1{yi 6=φs,+(xi)} =
m0(s)∑
i=1
pi · 1{yi=−1} +m∑
i=m0(s)+1
pi · 1{yi=1}
=
m0(s)∑
i=1
pi ·1− yi
2+
m∑
i=m0(s)+1
pi ·1 + yi
2=
1
2
m∑
i=1
pi −1
2
m0(s)∑
i=1
piyi +1
2
m∑
i=m0(s)+1
piyi
=1
2− 1
2
m0(s)∑
i=1
piyi −m∑
i=m0(s)+1
piyi
.
The last equality follows because∑m
i=1 pi = 1.
The case for φs,− is symmetric to this one, so we omit the argument.
132.
Solution (cont’d)
b. For any m0 ∈ {1, . . . ,m} we have
f(m0)− f(m0 − 1) =
m0∑
i=1
yipi −m∑
i=m0+1
yipi −m0−1∑
i=1
yipi +
m∑
i=m0
yipi = 2ym0pm0
.
Therefore, |f(m0)− f(m0 − 1)| = 2|ym0| pm0
= 2pm0for all m0 ∈ {1, . . . ,m}.
Because∑m
i=1 pi = 1, there must be at least one index m′0 with pm′
0≥ 1
m.
Thus we have |f(m′0)− f(m′
0 − 1)| ≥ 2
m, and so it must be the case that at least
one of
|f(m′0)| ≥
1
mor |f(m′
0 − 1)| ≥ 1
m
holds.Depending on which one of those two inequations is true, we would then“return” m′
0 or m′0 − 1.
(Note: If |f(m′0 − 1)| ≥ 1
mand m′
0 = 1, then we have to consider an “outside”
threshold, s > x1.)
Finally, we have γ =1
2m.
133.
Solution (cont’d)
c. The inequation proven at part b, |f(m0)| ≥ 2γ implies that either f(m0) ≥ 2γor f(m0) ≤ −2γ. So,
either f(m0) ≥ 2γ ⇔ −f(m0) ≤ −2γ ⇔ 1
2− 1
2f(m0)
︸ ︷︷ ︸
errorp(φs,+)
≤ 1
2− 1
2· 2γ =
1
2− γ
or f(m0) ≤ −2γ ⇔ 1
2+
1
2f(m0)
︸ ︷︷ ︸
errorp(φs,−)
≤ 1
2− 1
2· 2γ =
1
2− γ
for any s ∈ (xm0+1, xm0],a
Therefore thresholded decision stumps are guaranteed to have an edge of at
least γ =1
2mover random guessing.
aIn the case described by the Note at part b, we must consider s > x1.
134.
Summing up
At each iteration t executed by AdaBoost,
• a probabilistic distribution p (denoted as Dt in CMU, 2015, Z. Bar-Joseph, E. Xing, HW4, pr. 2.1-5) is in use;
• [at part b of the present exercise we proved that]there is at least one m0 (better denoted m0(p)) in {0, . . . ,m} such that
|f(m0)| ≥1
m
not.= 2γ, where f(m0)
def.=
m0∑
i=1
yipi −m∑
i=m0+1
yipi
• [the proof made at parts a and c of the present exercise implies that]for any s ∈ (xm0+1, xm0
], a
errorp(φs,+) ≤1
2− γ or errorp(φs,−) ≤
1
2− γ, where γ
not.=
1
2m,
As a consequence, AdaBoost can choose at each iteration a weak hypothesis
(ht) for which γt ≥ γ =1
2m.
aSee the previous footnote.
135.
Solution (cont’d)
d. Boosting takeslnm
2γ2iterations to achieve zero [training] error, as shown at
CMU, 2015, Z. Bar-Joseph, E. Xing, HW4, pr. 2.5, so with decision stumpswe will achieve zero [training] error in at most 2m2 lnm iterations of boosting.
Each iteration of boosting introduces a single new weak hypothesis, so atmost 2m2 lnm thresholded decision stumps are necessary.
136.
A generalized version of the AdaBoost algorithm
MIT, 2003 fall, Tommy Jaakkola, HW4, pr. 2.1-3
137.
Here we derive a boosting algorithm from a slightly more general perspectivethan the AdaBoost algorithm in CMU, 2015 fall, Z. Bar-Joseph, E. Xing,HW4, pr. 2.1-5, that will be applicable for a class of loss functions includingthe exponential one.
The goal is to generate discriminant functions of the form
fK(x) = α1h(x; θ1) + . . .+ αKh(x; θK),
where both x belong to Rd, θ are parameters, and you can assume that the
weak classifiers h(x; θ) are decision stumps whose predictions are ±1; any otherset of weak learners would be fine without modification.
We successively add components to the overall discriminant function in amanner that will separate the estimation of [the parameters of] the weakclassifiers from the setting of the votes α to the extent possible.
138.
A useful definitionLet’s start by defining a set of useful loss functions. The only restriction we place onthe loss is that it should be a monotonically decreasing and differentiable function of itsargument. The argument in our context is yi fK(xi) so that the more the discriminantfunction agrees with the ±1 label yi, the smaller the loss.The simple exponential loss we have already considered [at CMU, 2015 fall, Z. Bar-Joseph, E. Xing, HW4, pr. 2.1-5], i.e.,
Loss(yifK(xi)) = exp(−yifK(xi))
certainly conforms to this notion.
And so does the logistic loss
Loss(yifK(xi))
= ln(1 + exp(−yifK(xi))).
0
0.5
1
1.5
2
2.5
3
3.5
4
-4 -3 -2 -1 0 1 2 3 4
z
-log(sigma(z))sigma(z)
139.
Remark
Note that the logistic loss has a nice interpretation as a negative log-probability. Indeed, [recall that] for an additive logistic regression model
− lnP (y = 1|x,w) = − ln1
1 + exp(−z)= ln(1 + exp(−z)),
where z = w1φ1(x)+. . .+wKφK(x) and we omit the bias term (w0) for simplicity.
By replacing the additive combination of basis functions (φi(x)) with thecombination of weak classifiers (h(x; θi)), we have an additive logistic regressionmodel where the weak classifiers serve as the basis functions. The difference isthat both the basis functions (weak classifiers) and the coefficients multiplyingthem will be estimated. In the logistic regression model we typically envisiona fixed set of basis functions.
140.
Let us now try to derive the boosting algorithm in a manner that can acco-modate any loss function of the type discussed above. To this end, supposewe have already included k − 1 component classifiers
fk−1(x) = α1h(x; θ1) + . . .+ αk−1h(x; θk−1), (12)
and we wish to add another h(x; θ). The estimation criterion for the overalldiscriminant function, including the new component with votes α, is given by
J(α, θ) =1
m
m∑
i=1
Loss(yifk−1(xi) + yi αh(xi; θ)).
Note that we explicate only how the objective depends on the choice of thelast component and the corresponding votes since the parameters of the k− 1previous components along with their votes have already been set and won’tbe modified further.
141.
We will first try to find the new component or parameters θ so as to maximizeits potential in reducing the empirical loss, potential in the sense that we cansubsequently adjust the votes to actually reduce the empirical loss. Moreprecisely, we set θ so as to minimize the derivative
∂
∂αJ(α, θ)|α=0 =
1
m
m∑
i=1
∂
∂αLoss(yifk−1(xi) + yi αh(xi; θ))|α=0
=1
m
m∑
i=1
dL(yifk−1(xi)) yi h(xi; θ), (13)
where dL(z)not.=
∂Loss(z)
∂z.
Note that this derivative∂
∂αJ(α, θ)|α=0 precisely captures the amount by which
we would start to reduce the empirical loss if we gradually increased the vote(α) for the new component with parameters θ. Minimizing this reductionseems like a sensible estimation criterion for the new component or θ. Thisplan permits us to first set θ and then subsequently optimize α to actuallyminimize the empirical loss.
142.
Let’s rewrite the algorithm slightly to make it look more like a boosting al-gorithm. First, let’s define the following weights and normalized weights onthe training examples:
W(k−1)i = −dL(yifk−1(xi)) and
W(k−1)i =
W(k−1)i
∑mj=1 W
(k−1)j
, for i = 1, . . . ,m.
These weights are guaranteed to be non-negative since the loss function is adecreasing function of its argument (its derivative has to be negative or zero).
143.
Now we can rewrite the expression (13) as
∂
∂αJ(α, θ)|α=0 = − 1
m
m∑
i=1
W(k−1)i yi h(xi; θ)
= − 1
m
(∑
j
W(k−1)j
)
·m∑
i=1
W(k−1)i
∑
j W(k−1)j
yi h(xi; θ)
= − 1
m
(∑
j
W(k−1)j
)
·m∑
i=1
W(k−1)i yi h(xi; θ).
By ignoring the multiplicative constant (i.e.,1
m
∑
j W(k−1)j , which is constant
at iteration k), we will estimate θ by minimizing
−m∑
i=1
W(k−1)i yih(xi; θ), (14)
where the normalized weights W(k−1)i sum to 1. (This is the same as max-
imizing the weighted agreement with the labels, i.e.,∑m
i=1 W(k−1)i yih(xi; θ).)
144.
Some remarks (by Liviu Ciortuz)
1. Using some familiar notations, we can write
m∑
i=1
W(k−1)i yih(xi; θ) =
∑
i∈C
W(k−1)i yih(xi; θ) +
∑
i∈M
W(k−1)i yih(xi; θ)
=∑
i∈C
W(k−1)i
︸ ︷︷ ︸
1−εk
−∑
i∈M
W(k−1)i
︸ ︷︷ ︸εk
= 1− 2εk
⇒ εk =1
2
(
1−m∑
i=1
W(k−1)i yih(xi; θk)
)
2. Because W(k−1)i ≥ 0 and
∑mi=1 W
(k−1)i = 1, it follows that
∑mi=1 W
(k−1)i yih(xi; θ) ∈ [−1,+1],
therefore
1− (∑
i∈C W(k−1)i −∑i∈M W
(k−1)i ) = 1−
m∑
i=1
W(k−1)i yih(xi; θk)
︸ ︷︷ ︸
∈[−1,+1]
∈ [0,+2], and so
1
2
(
1−∑mi=1 W
(k−1)i yih(xi; θk)
)
∈ [0,+1].
145.
We are now ready to cast the steps of the boosting algorithm in a form similar to theAdaBoost algorithm given at CMU, 2015 fall, Z. Bar-Joseph, E. Xing, HW4, pr. 2.1-5.
[Assume Wi(0) =1
mand f0(xi) = 0 for i = 1, . . . .m.]
Step 1 : Find any classifier h(x; θk) that performs better than chance with respect to theweighted training error:
εk =1
2
(
1−m∑
i=1
W(k−1)i yih(xi; θk)
)
. (15)
Step 2 : Set the votes αk for the new component by minimizing the overall empiricalloss:
J(α, θk) =1
m
m∑
i=1
Loss(yi fk−1(xi) + yi αh(xi; θk)), and so αk = argminα≥0
J(α, θk).
Step 3 : Recompute the normalized weights for the next iteration according to
W(k)i = −ck · dL(yi fk−1(xi) + yi αk h(xi; θk)
︸ ︷︷ ︸
yi fk(xi)
) for i = 1, . . . ,m, (16)
where ck is chosen so that∑m
i=1 W(k)i = 1.
146.
One more remark (by Liviu Ciortuz),
now concerning Step 1:
Normally there should be such εk ∈ (0, 1/2) (in fact, some corresponding θk),because if for some h we would have εk ∈ (1/2, 1), then we can take h′ = −h,and the resulting ε′k would belong to (0, 1/2).
The are only two exceptions, which correspond to the case when for anyhypothesis h we would have
− either εk = 1/2, in which case∑
i∈C W(k−1)i =
∑
i∈M W(k−1)i
− or εk ∈ {0, 1}, in which case either h or h′ = −h is a perfect (therefore notweak) classifier for the given training data.
147.
Exemplifying Step 1 on data from
CMU, 2012 fall, T. Mitchell, Z. Bar-Joseph, final, pr. 8.a-e
[graphs made by MSc student Sebastian Ciobanu, FII, 2018 fall]
Iteration 1
−10 −5 0 5 10
−0.
3−
0.2
−0.
10.
00.
10.
20.
3
theta
dJ(+
|−)_
with
_alp
ha_z
ero
−10 −5 0 5 10
−0.
3−
0.2
−0.
10.
00.
10.
20.
3
theta
dJ(−
|+)_
with
_alp
ha_z
ero
148.
[graphs made by MSc student Sebastian Ciobanu, FII, 2018 fall]
Iteration 2
−10 −5 0 5 10
−0.
4−
0.2
0.0
0.2
0.4
theta
dJ(+
|−)_
with
_alp
ha_z
ero
−10 −5 0 5 10−
0.4
−0.
20.
00.
20.
4
theta
dJ(−
|+)_
with
_alp
ha_z
ero
149.
[graphs made by MSc student Sebastian Ciobanu, FII, 2018 fall]
Iteration 3
−10 −5 0 5 10
−0.
4−
0.2
0.0
0.2
theta
dJ(+
|−)_
with
_alp
ha_z
ero
−10 −5 0 5 10−
0.2
0.0
0.2
0.4
theta
dJ(−
|+)_
with
_alp
ha_z
ero
150.
a. Show that the three steps in the algorithm correspond exactly to AdaBoostwhen the loss function is the exponential loss Loss(z) = exp(−z).
More precisely, show that in this case the setting of αk based on the new weak
classifier and the weight update to get W(k)i would be identical to AdaBoost.
(In CMU, 2015 fall, Z. Bar-Joseph, E. Xing, HW4, pr. 2.1-5, W(k)i corresponds
to Dk(i).)
Solution
For the first part, we will show that the minimization in Step 2 of the generalalgorithm (LHS below), with Loss(z) = e−z, is the same as the minimizationperformed by AdaBoost (RHS below), i.e. that
argminα>0
m∑
i=1
Loss(yifk−1(xi) + αyih(xi; θk)) = argminα>0
m∑
i=1
W(k−1)i exp(−αyih(xi; θk)),
with (from AdaBoost)
W(k−1)i = ck−1 · exp(−yifk−1(xi)),
where ck−1 is a normalization constant (weights sum to 1).
151.
Solution (cont’d)
Evaluating the objective in LHS gives:
m∑
i=1
Loss(yifk−1(xi) + αyih(xi; θk))
=
m∑
i=1
exp(−yifk−1(xi)) exp(−αyih(xi; θk))(16)=
1
ck−1
m∑
i=1
W(k−1)i exp(−αyih(xi; θk))
=1
ck−1
[ ∑
i∈{i|yih(xi;θk)=1}
(
W(k−1)i
)
e−α +∑
i∈{i|yih(xi;θk)=−1}
(
W(k−1)i
)
eα]
,
which is proportional to the objective minimized by AdaBoost so that mini-mizing value of α is the same for both algorithms (see CMU, 2008 fall, EricXing, HW3, pr. 4.1.1).
For the second part, note that the weight assignment in Step 3 of the generalalgorithm (for stage k) is
W(k)i = −ck · dL(yifk(xi)) = ck · exp(−yifk(xi)),
which is the same as in AdaBoost (see CMU, 2015 fall, Z. Bar-Joseph, E.Xing, HW4, pr. 2.2).
152.
b. Show that for any valid loss function of the type discussed above, the new componenth(x; θk) just added at the k-th iteration would have weighted training error exactly 1/2
relative to the updated weights W(k)i . If you prefer, you can show this only in the case
of the logistic loss.
Solution
At stage k, αk is chosen to minimize J(α, θk), i.e. to solve∂J(α, θk)
∂α= 0. In general,
∂
∂αJ(α, θk) =
1
m
m∑
i=1
∂
∂αLoss(yifk−1(xi) + yiαh(xi; θk))
=1
m
m∑
i=1
dL(yifk−1(xi) + yiαh(xi; θk))︸ ︷︷ ︸
−W
(k)i
ck
yih(xi; θk) ∝m∑
i=1
W(k)i yih(xi; θk),
so that we must have W(k)i yih(xi; θk) = 0.Then, the weighted training error for h(x; θk)
(relative to the updated weights W(k)i determined by αk) can be computed in a similarly
way to (15):
1
2
(
1−m∑
i=1
W(k)i yih(xi; θk)
︸ ︷︷ ︸
0
)
=1
2(1− 0) =
1
2.
153.
CMU, 2008 fall, Eric Xing, HW3, pr. 4.1.1xxx CMU, 2008 fall, Eric Xing, midterm, pr. 5.1
c. Now, suppose that we change the objective function to J(fk) =∑m
i=1(yi − fk(xi))2 and
we still want to optimize it sequentially.a What is the new update rule for αk?
Solution
We will compute the derivative of J(fk) =∑m
i=1(yi− fk(xi))2 and set it to zero to find the
value of αk.
∂J(fk)
∂αk=
∂∑m
i=1(yi − fk(xi))2
∂αk=
m∑
i=1
2(yi − fk(xi))∂(yi − fk(xi))
∂αk.
We also know thatfk(xi) = fk−1(xi) + αkhk(xi).
In this equation, fk−1 is independent of αk. Substituting this in the derivative equation,we get
∂J(fk)
∂αk= 2
m∑
i=1
(yi − fk(xi))∂(yi − fk−1(xi)− αkhk(xi))
∂αk= 2
m∑
i=1
(yi − fk(xi))(−hk(xi))
aLC: Note that (yi −f(xi))2 = [yi(1−yifk(xi))]2 = (1−yifk(xi))2 = (1− zi)2, where zi = yifk(xi). The function(1 − z)2 is derivable and convex; it is decreasing on (−∞, 1] and increasing on [1,+∞).
154.
Solution (cont’d)
Setting the derivative to zero, we get
∂J(fk)
∂αk= 0 ⇔
m∑
i=1
(yi − fk(xi))hk(xi) = 0 ⇔m∑
i=1
(yi − αkhk(xi)− fk−1(xi))hk(xi) = 0
m∑
i=1
(yi − fk−1(xi))hk(xi) = αk
m∑
i=1
h2k(xi) ⇔ αk =
∑mi=1(yi − fk−1(xi))hk(xi)
∑mi=1 h
2k(xi)︸ ︷︷ ︸
1
⇔ αk =1
m
m∑
i=1
(yi − fk−1(xi))hk(xi).
155.
MIT, 2006 fall, Tommi Jaakkola, HW4, pr. 3.axxx MIT, 2009 fall, Tommi Jaakkola, HW3, pr. 2.1
d. Show that if we use the logistic loss instead [of the exponential loss] the
unnormalized weights W(k)i are bounded by 1.
Solution
W(k)i was defined as −dL(yifk(xi)), with dL(z)
not.= − ∂
∂z(ln(1 + e−z)). Therefore,
W(k)i =
e−zi
1 + e−zi=
1
1 + ezi< 1, where zi = yifk(xi).
156.
e. When using the logistic loss, what are the normalized weights, W(k)i ? Express the
weights as a function of the agreements yifk(xi), where we have already included thek-th weak learner.
What can you say about the resulting normalized weights for examples that are clearlymisclassified in comparison to those that are just slightly misclassified by the currentensemble?
If the training data contains mislabeled examples, why do we prefer the logistic lossover the exponential loss, Loss(z) = exp(−z)?
Solution
The normalized weights are given by W(k)i = ck ·
exp(−yifk(xi))
1 + exp(−yifk(xi)),with the normalization
constant ck =
(∑m
i=1
exp(−yifk(xi))
1 + exp(−yifk(xi))
)−1
.
[Answer from MIT, 2011 fall, Leslie P. Kaelbling, HW5, pr. 1.1]
For clearly misclassified examples, yi hk(xi) is a large negative, so W(k)i is close to [and
less than] 1, while for slightly misclassifed examples, W(k)i is close to [and greater than]
1/2. Thus, the normalized weights for the two respective cases will be in a ratio of atmost 2 : 1, i.e. a single clearly misclassifed outlier will never be worth more than twocompletely uncertain points. This is why boosting [with logistic loss function] is robustto outliers.
157.
Solution (cont’d) in Romanian
LC: Pentru ultima parte de la punctul e nu am gasit deloc raspuns la MIT,ınsa pot gandi astfel:
In cazul functiei de pierdere logistice, daca avem un xi care ar avea de drepteticheta yi = +1, dar se considera (ın mod eronat) yi = −1, pierderea este deaproximativ fk(xi) daca fk(xi) > 0,a ın vreme ce ın cazul functiei de pierdere[negativ] exponentiale pierderea este exp(fk(xi)) care este ın general mult maimare decat fk(xi). Cazurile simetrice (fk(xi) ≤ 0 si apoi yi = −1 → +1) setrateaza ın mod similar.
aVedeti graficul functiei de pierdere logistica din enuntul problemei de fata.
158.
MIT, 2006 fall, Tommi Jaakkola, HW4, pr. 3.b
f. Suppose we use logistic loss and the training set is linearly separable. Wewould like to use a linear support vector machine (no slack penalties) as abase classifier [LC: or any linear separator consitent with the training data].Assume that the generalized AdaBoost algorithm minimizes the εk weightederror at Step 1. In the first boosting iteration, what would the resulting α1
be?
159.
Solution
In Step 1, we pick θ1. We wish to find θ1 to minimize∂J(α, θ)
∂α|α=0.
Equivalently, this θ1 is chosen to minimize the weighted sum: 2ε1 − 1 =
−∑mi=1 W0(i)yih(xi; θ), where W0(i) =
1
mfor all i = 1, 2, . . . ,m. If the train-
ing set is linearly separable with offset, then the no-slack SVM problem isfeasible. Hence, the base classifier in this case will thus be an affine (linearwith offset) separator h(·; θ1), which satisfies the inequality yih(xi; θ1) ≥ 1 forall i = 1, 2, . . . ,m.
In Step 2, we pick α1 to minimize J(α1, θ1) =∑m
i=1 L(yih0(xi) + α1yih(xi; θ1)) =∑mi=1 L(α1yih(xi; θ1)). Note that J(α1, θ1) is a sum of terms that are strictly
decreasing in α1 (as yih(xi; θ1) ≥ 1); therefore, it itself is also strictly decreasingin α1. It follows that the boosting algorithm will take α1 = ∞ in order tominimize J(α1, θ1).
This makes sense, because if we can find a base classifier that perfectly sepa-rates the data, we will weight it as much as we can to minimize the boostingloss. The lesson here is simple: when doing boosting, we need to use baseclassifiers that are not powerful enough to perfectly separate the data.
160.