Top Banner
Deep Boosting Corinna Cortes CORINNA@GOOGLE. COM Google Research, 111 8th Avenue, New York, NY 10011 Mehryar Mohri MOHRI @CIMS. NYU. EDU Courant Institute and Google Research, 251 Mercer Street, New York, NY 10012 Umar Syed USYED@GOOGLE. COM Google Research, 111 8th Avenue, New York, NY 10011 Abstract We present a new ensemble learning algorithm, DeepBoost, which can use as base classifiers a hypothesis set containing deep decision trees, or members of other rich or complex families, and succeed in achieving high accuracy without over- fitting the data. The key to the success of the al- gorithm is a capacity-conscious criterion for the selection of the hypotheses. We give new data- dependent learning bounds for convex ensembles expressed in terms of the Rademacher complexi- ties of the sub-families composing the base clas- sifier set, and the mixture weight assigned to each sub-family. Our algorithm directly benefits from these guarantees since it seeks to minimize the corresponding learning bound. We give a full de- scription of our algorithm, including the details of its derivation, and report the results of several experiments showing that its performance com- pares favorably to that of AdaBoost and Logistic Regression and their L 1 -regularized variants. 1. Introduction Ensemble methods are general techniques in machine learning for combining several predictors or experts to create a more accurate one. In the batch learning set- ting, techniques such as bagging, boosting, stacking, error- correction techniques, Bayesian averaging, or other av- eraging schemes are prominent instances of these meth- ods (Breiman, 1996; Freund & Schapire, 1997; Smyth & Wolpert, 1999; MacKay, 1991; Freund et al., 2004). En- semble methods often significantly improve performance in practice (Quinlan, 1996; Bauer & Kohavi, 1999; Caru- ana et al., 2004; Dietterich, 2000; Schapire, 2003) and ben- Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy- right 2014 by the author(s). efit from favorable learning guarantees. In particular, Ad- aBoost and its variants are based on a rich theoretical anal- ysis, with performance guarantees in terms of the margins of the training samples (Schapire et al., 1997; Koltchinskii & Panchenko, 2002). Standard ensemble algorithms such as AdaBoost combine functions selected from a base classifier hypothesis set H. In many successful applications of AdaBoost, H is reduced to the so-called boosting stumps, that is decision trees of depth one. For some difficult tasks in speech or image processing, simple boosting stumps are not sufficient to achieve a high level of accuracy. It is tempting then to use a more complex hypothesis set, for example the set of all decision trees with depth bounded by some relatively large number. But, existing learning guarantees for AdaBoost depend not only on the margin and the number of the training examples, but also on the complexity of H mea- sured in terms of its VC-dimension or its Rademacher com- plexity (Schapire et al., 1997; Koltchinskii & Panchenko, 2002). These learning bounds become looser when us- ing too complex base classifier sets H. They suggest a risk of overfitting which indeed can be observed in some experiments with AdaBoost (Grove & Schuurmans, 1998; Schapire, 1999; Dietterich, 2000; atsch et al., 2001b). This paper explores the design of alternative ensemble al- gorithms using as base classifiers a hypothesis set H that may contain very deep decision trees, or members of some other very rich or complex families, and that can yet suc- ceed in achieving a higher performance level. Assume that the set of base classifiers H can be decomposed as the union of p disjoint families H 1 ,...,H p ordered by increas- ing complexity, where H k , k [1,p], could be for example the set of decision trees of depth k, or a set of functions based on monomials of degree k. Figure 1 shows a pictorial illustration. Of course, if we strictly confine ourselves to using hypotheses belonging only to families H k with small k, then we are effectively using a smaller base classifier set H with favorable guarantees. But, to succeed in some chal-
14

Deep Boosting - NYU Computer Sciencemohri/pub/srmboost.pdf · Deep Boosting H 1 H 2 H 3 H 4 H 5 H 1!H 2 ... ous boosting-style algorithms. ... hypothesis set, the analysis coincides

Jun 17, 2019

Download

Documents

lamkien
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Deep Boosting - NYU Computer Sciencemohri/pub/srmboost.pdf · Deep Boosting H 1 H 2 H 3 H 4 H 5 H 1!H 2 ... ous boosting-style algorithms. ... hypothesis set, the analysis coincides

Deep Boosting

Corinna Cortes [email protected]

Google Research, 111 8th Avenue, New York, NY 10011

Mehryar Mohri [email protected]

Courant Institute and Google Research, 251 Mercer Street, New York, NY 10012

Umar Syed [email protected]

Google Research, 111 8th Avenue, New York, NY 10011

AbstractWe present a new ensemble learning algorithm,DeepBoost, which can use as base classifiers ahypothesis set containing deep decision trees, ormembers of other rich or complex families, andsucceed in achieving high accuracy without over-fitting the data. The key to the success of the al-gorithm is a capacity-conscious criterion for theselection of the hypotheses. We give new data-dependent learning bounds for convex ensemblesexpressed in terms of the Rademacher complexi-ties of the sub-families composing the base clas-sifier set, and the mixture weight assigned to eachsub-family. Our algorithm directly benefits fromthese guarantees since it seeks to minimize thecorresponding learning bound. We give a full de-scription of our algorithm, including the detailsof its derivation, and report the results of severalexperiments showing that its performance com-pares favorably to that of AdaBoost and LogisticRegression and their L1-regularized variants.

1. IntroductionEnsemble methods are general techniques in machinelearning for combining several predictors or experts tocreate a more accurate one. In the batch learning set-ting, techniques such as bagging, boosting, stacking, error-correction techniques, Bayesian averaging, or other av-eraging schemes are prominent instances of these meth-ods (Breiman, 1996; Freund & Schapire, 1997; Smyth &Wolpert, 1999; MacKay, 1991; Freund et al., 2004). En-semble methods often significantly improve performancein practice (Quinlan, 1996; Bauer & Kohavi, 1999; Caru-ana et al., 2004; Dietterich, 2000; Schapire, 2003) and ben-

Proceedings of the 31 st International Conference on MachineLearning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy-right 2014 by the author(s).

efit from favorable learning guarantees. In particular, Ad-aBoost and its variants are based on a rich theoretical anal-ysis, with performance guarantees in terms of the marginsof the training samples (Schapire et al., 1997; Koltchinskii& Panchenko, 2002).

Standard ensemble algorithms such as AdaBoost combinefunctions selected from a base classifier hypothesis set H .In many successful applications of AdaBoost, H is reducedto the so-called boosting stumps, that is decision trees ofdepth one. For some difficult tasks in speech or imageprocessing, simple boosting stumps are not sufficient toachieve a high level of accuracy. It is tempting then to usea more complex hypothesis set, for example the set of alldecision trees with depth bounded by some relatively largenumber. But, existing learning guarantees for AdaBoostdepend not only on the margin and the number of thetraining examples, but also on the complexity of H mea-sured in terms of its VC-dimension or its Rademacher com-plexity (Schapire et al., 1997; Koltchinskii & Panchenko,2002). These learning bounds become looser when us-ing too complex base classifier sets H . They suggest arisk of overfitting which indeed can be observed in someexperiments with AdaBoost (Grove & Schuurmans, 1998;Schapire, 1999; Dietterich, 2000; Ratsch et al., 2001b).

This paper explores the design of alternative ensemble al-gorithms using as base classifiers a hypothesis set H thatmay contain very deep decision trees, or members of someother very rich or complex families, and that can yet suc-ceed in achieving a higher performance level. Assume thatthe set of base classifiers H can be decomposed as theunion of p disjoint families H1, . . . ,Hp ordered by increas-ing complexity, where Hk, k ∈ [1, p], could be for examplethe set of decision trees of depth k, or a set of functionsbased on monomials of degree k. Figure 1 shows a pictorialillustration. Of course, if we strictly confine ourselves tousing hypotheses belonging only to families Hk with smallk, then we are effectively using a smaller base classifier setH with favorable guarantees. But, to succeed in some chal-

Page 2: Deep Boosting - NYU Computer Sciencemohri/pub/srmboost.pdf · Deep Boosting H 1 H 2 H 3 H 4 H 5 H 1!H 2 ... ous boosting-style algorithms. ... hypothesis set, the analysis coincides

Deep Boosting

H1

H2

H3

H4

H5H1

H1�H2· · ·

H1�· · · �Hp

Figure 1. Base classifier set H decomposed in terms of sub-families H1, . . . , Hp or their unions.

lenging tasks, the use of a few more complex hypothesescould be needed. The main idea behind the design of ouralgorithms is that an ensemble based on hypotheses drawnfrom H1, . . . ,Hp can achieve a higher accuracy by makinguse of hypotheses drawn from Hks with large k if it allo-cates more weights to hypotheses drawn from Hks with asmall k. But, can we determine quantitatively the amountsof mixture weights apportioned to different families? Canwe provide learning guarantees for such algorithms?

Note that our objective is somewhat reminiscent of that ofmodel selection, in particular Structural Risk Minimization(SRM) (Vapnik, 1998), but it differs from that in that wedo not wish to limit our base classifier set to some optimalHq =

⋃qk=1 Hk. Rather, we seek the freedom of using as

base hypotheses even relatively deep trees from rich Hks,with the promise of doing so infrequently, or that of re-serving them a somewhat small weight contribution. Thisprovides the flexibility of learning with deep hypotheses.

We present a new algorithm, DeepBoost, whose design isprecisely guided by the ideas just discussed. Our algorithmis grounded in a solid theoretical analysis that we presentin Section 2. We give new data-dependent learning boundsfor convex ensembles. These guarantees are expressed interms of the Rademacher complexities of the sub-familiesHk and the mixture weight assigned to each Hk, in ad-dition to the familiar margin terms and sample size. Ourcapacity-conscious algorithm is derived via the applicationof a coordinate descent technique seeking to minimize suchlearning bounds. We give a full description of our algo-rithm, including the details of its derivation and its pseu-docode (Section 3) and discuss its connection with previ-ous boosting-style algorithms. We also report the results ofseveral experiments (Section 4) demonstrating that its per-formance compares favorably to that of AdaBoost, whichis known to be one of the most competitive binary classifi-cation algorithms.

2. Data-dependent learning guarantees forconvex ensembles with multiple hypothesissets

Non-negative linear combination ensembles such as boost-ing or bagging typically assume that base functions are se-lected from the same hypothesis set H . Margin-based gen-eralization bounds were given for ensembles of base func-tions taking values in {−1,+1} by Schapire et al. (1997) in

terms of the VC-dimension of H . Tighter margin boundswith simpler proofs were later given by Koltchinskii &Panchenko (2002), see also (Bartlett & Mendelson, 2002),for the more general case of a family H taking arbitraryreal values, in terms of the Rademacher complexity of H .

Here, we also consider base hypotheses taking arbitraryreal values but assume that they can be selected from sev-eral distinct hypothesis sets H1, . . . ,Hp with p ≥ 1 andpresent margin-based learning in terms of the Rademachercomplexity of these sets. Remarkably, the complexity termof these bounds admits an explicit dependency in terms ofthe mixture coefficients defining the ensembles. Thus, theensemble family we consider is F = conv(

⋃pk=1 Hk), that

is the family of functions f of the form f =∑T

t=1 αtht,where α = (α1, . . . , αT ) is in the simplex ∆ and where,for each t ∈ [1, T ], ht is in Hkt for some kt ∈ [1, p].

Let X denote the input space. H1, . . . ,Hp are thus fam-ilies of functions mapping from X to R. We considerthe familiar supervised learning scenario and assume thattraining and test points are drawn i.i.d. according to somedistribution D over X × {−1,+1} and denote by S =((x1, y1), . . . , (xm, ym)) a training sample of size m drawnaccording to Dm.

Let ρ > 0. For a function f taking values in R, we de-note by R(f) its binary classification error, by Rρ(f) itsρ-margin error, and by RS,ρ(f) its empirical margin error:

R(f) = E(x,y)∼D

[1yf(x)≤0], Rρ(f) = E(x,y)∼D

[1yf(x)≤ρ],

Rρ(f) = E(x,y)∼S

[1yf(x)≤ρ],

where the notation (x, y) ∼ S indicates that (x, y) is drawnaccording to the empirical distribution defined by S.

The following theorem gives a margin-based Rademachercomplexity bound for learning with such functions in thebinary classification case. As with other Rademacher com-plexity learning guarantees, our bound is data-dependent,which is an important and favorable characteristic of ourresults. For p = 1, that is for the special case of a singlehypothesis set, the analysis coincides with that of the stan-dard ensemble margin bounds (Koltchinskii & Panchenko,2002).Theorem 1. Assume p > 1. Fix ρ > 0. Then, for anyδ > 0, with probability at least 1 − δ over the choice ofa sample S of size m drawn i.i.d. according to Dm, thefollowing inequality holds for all f =

∑Tt=1 αtht ∈ F:

R(f) ≤ RS,ρ(f) +4ρ

T∑t=1

αtRm(Hkt)

+2ρ

√log p

m+

√⌈4ρ2

log[ ρ2m

log p

]⌉ log p

m+

log 2δ

2m.

Page 3: Deep Boosting - NYU Computer Sciencemohri/pub/srmboost.pdf · Deep Boosting H 1 H 2 H 3 H 4 H 5 H 1!H 2 ... ous boosting-style algorithms. ... hypothesis set, the analysis coincides

Deep Boosting

Thus, R(f) ≤ RS,ρ(f) + 4ρ

∑Tt=1 αtRm(Hkt) + C(m, p)

with C(m, p) = O(√

log pρ2m log

[ρ2mlog p

]).

This result is remarkable since the complexity term in theright-hand side of the bound admits an explicit depen-dency on the mixture coefficients αt. It is a weighted aver-age of Rademacher complexities with mixture weights αt,t ∈ [1, T ]. Thus, the second term of the bound suggeststhat, while some hypothesis sets Hk used for learning couldhave a large Rademacher complexity, this may not be detri-mental to generalization if the corresponding total mixtureweight (sum of αts corresponding to that hypothesis set) isrelatively small. Such complex families offer the potentialof achieving a better margin on the training sample.

The theorem cannot be proven via a standard Rademachercomplexity analysis such as that of Koltchinskii &Panchenko (2002) since the complexity term of the boundwould then be the Rademacher complexity of the familyof hypotheses F = conv(

⋃pk=1 Hk) and would not de-

pend on the specific weights αt defining a given func-tion f . Furthermore, the complexity term of a standardRademacher complexity analysis is always lower boundedby the complexity term appearing in our bound. Indeed,since Rm(conv(∪p

k=1Hk)) = Rm(∪pk=1Hk), the follow-

ing lower bound holds for any choice of the non-negativemixtures weights αt summing to one:

Rm(F) ≥ mmaxk=1

Rm(Hk) ≥T∑

t=1

αtRm(Hkt). (1)

Thus, Theorem 1 provides a finer learning bound than theone obtained via a standard Rademacher complexity anal-ysis. The full proof of the theorem is given in Appendix A.Our proof technique exploits standard tools used to de-rive Rademacher complexity learning bounds (Koltchin-skii & Panchenko, 2002) as well as a technique used bySchapire, Freund, Bartlett, and Lee (1997) to derive earlyVC-dimension margin bounds. Using other standard tech-niques as in (Koltchinskii & Panchenko, 2002; Mohri et al.,2012), Theorem 1 can be straightforwardly generalized tohold uniformly for all ρ > 0 at the price of an additional

term that is in O(√

log log 2ρ

m

).

3. AlgorithmIn this section, we will use the learning guarantees of Sec-tion 2 to derive a capacity-conscious ensemble algorithmfor binary classification.

3.1. Optimization problem

Let H1, . . . ,Hp be p disjoint families of functions takingvalues in [−1,+1] with increasing Rademacher complex-

ities Rm(Hk), k ∈ [1, p]. We will assume that the hy-pothesis sets Hk are symmetric, that is, for any h ∈ Hk,we also have (−h) ∈ Hk, which holds for most hypothe-sis sets typically considered in practice. This assumptionis not necessary but it helps simplifying the presentation ofour algorithm. For any hypothesis h ∈ ∪p

k=1Hk, we denoteby d(h) the index of the hypothesis set it belongs to, that ish ∈ Hd(h). The bound of Theorem 1 holds uniformly forall ρ > 0 and functions f ∈ conv(

⋃pk=1 Hk).1 Since the

last term of the bound does not depend on α, it suggestsselecting α to minimize

G(α) =1m

m∑i=1

1yiPT

t=1 αtht(xi)≤ρ +4ρ

T∑t=1

αtrt,

where rt = Rm(Hd(ht)). Since for any ρ > 0, f and f/ρadmit the same generalization error, we can instead searchfor α ≥ 0 with

∑Tt=1 αt ≤ 1/ρ which leads to

minα≥0

1m

m∑i=1

1yiPT

t=1αtht(xi)≤1+4T∑

t=1

αtrt s.t.T∑

t=1

αt ≤1ρ.

The first term of the objective is not a convex functionof α and its minimization is known to be computation-ally hard. Thus, we will consider instead a convex upperbound. Let u 7→ Φ(−u) be a non-increasing convex func-tion upper bounding u 7→ 1u≤0 with Φ differentiable overR and Φ′(u) 6= 0 for all u. Φ may be selected to be forexample the exponential function as in AdaBoost (Freund& Schapire, 1997) or the logistic function. Using such anupper bound, we obtain the following convex optimizationproblem:

minα≥0

1m

m∑i=1

Φ(1− yi

T∑t=1

αtht(xi))

+ λ

T∑t=1

αtrt (2)

s.t.T∑

t=1

αt ≤1ρ,

where we introduced a parameter λ ≥ 0 controlling the bal-ance between the magnitude of the values taken by functionΦ and the second term. Introducing a Lagrange variableβ ≥ 0 associated to the constraint in (2), the problem canbe equivalently written as

minα≥0

1m

m∑i=1

Φ(1− yi

T∑t=1

αtht(xi))

+T∑

t=1

(λrt + β)αt.

Here, β is a parameter that can be freely selected by thealgorithm since any choice of its value is equivalent to a

1The conditionPT

t=1 αt = 1 of Theorem 1 can be relaxedto

PTt=1 αt ≤ 1. To see this, use for example a null hypothesis

(ht = 0 for some t).

Page 4: Deep Boosting - NYU Computer Sciencemohri/pub/srmboost.pdf · Deep Boosting H 1 H 2 H 3 H 4 H 5 H 1!H 2 ... ous boosting-style algorithms. ... hypothesis set, the analysis coincides

Deep Boosting

choice of ρ in (2). Let {h1, . . . , hN} be the set of distinctbase functions, and let G be the objective function basedon that collection:

G(α)=1m

m∑i=1

Φ(1−yi

N∑j=1

αjhj(xi))+

N∑t=1

(λrj +β)αj ,

with α = (α1, . . . , αN ) ∈ RN . Note that we can drop therequirement α ≥ 0 since the hypothesis sets are symmetricand αtht = (−αt)(−ht). For each hypothesis h, we keepeither h or −h in {h1, . . . , hN}. Using the notation

Λj = λrj + β, (3)

for all j ∈ [1, N ], our optimization problem can then berewritten as minα F (α) with

F (α)=1m

m∑i=1

Φ(1−yi

N∑j=1

αjhj(xi))+

N∑t=1

Λj |αj |, (4)

with no non-negativity constraint on α. The function Fis convex as a sum of convex functions and admits a sub-differential at all α ∈ R. We can design a boosting-stylealgorithm by applying coordinate descent to F (α).

Let αt = (αt,1, . . . , αt,N )> denote the vector obtained af-ter t ≥ 1 iterations and let α0 = 0. Let ek denote thekth unit vector in RN , k ∈ [1, N ]. The direction ek andthe step η selected at the tth round are those minimizingF (αt−1 + ηek), that is

F (αt−1 + ηek)=1m

m∑i=1

Φ(1− yift−1(xi)−ηyihk(xi)

)+

∑j 6=k

Λj |αt−1,j |+ Λk|αt−1,k + η|,

where ft−1 =∑N

j=1 αt−1,jhj . For any t ∈ [1, T ], wedenote by Dt the distribution defined by

Dt(i) =Φ′

(1− yift−1(xi)

)St

, (5)

where St is a normalization factor, St =∑m

i=1 Φ′(1 −yift−1(xi)). For any s ∈ [1, T ] and j ∈ [1, N ], we denoteby εs,j the weighted error of hypothesis hj for the distribu-tion Ds, for s ∈ [1, T ]:

εs,j =12

[1− E

i∼Ds

[yihj(xi)]]. (6)

3.2. DeepBoost

Figure 2 shows the pseudocode of the algorithm DeepBoostderived by applying coordinate descent to the objectivefunction (4). The details of the derivation of the expres-sion are given in Appendix B. In the special cases of the

DEEPBOOST(S = ((x1, y1), . . . , (xm, ym)))1 for i← 1 to m do2 D1(i)← 1

m3 for t← 1 to T do4 for j ← 1 to N do5 if (αt−1,j 6= 0) then6 dj ←

(εt,j − 1

2

)+ sgn(αt−1,j)

Λjm2St

7 elseif(∣∣εt,j − 1

2

∣∣ ≤ Λjm2St

)then

8 dj ← 09 else dj ←

(εt,j − 1

2

)− sgn(εt,j − 1

2 )Λjm2St

10 k ← argmaxj∈[1,N ]

|dj |

11 εt ← εt,k

12 if(|(1− εt)eαt−1,k−εte

−αt−1,k |≤ ΛkmSt

)then

13 ηt ← −αt−1,k

14 elseif((1− εt)eαt−1,k−εte

−αt−1,k > ΛkmSt

)then

15 ηt ← log[− Λkm

2εtSt+

√[Λkm2εtSt

]2+ 1−εt

εt

]16 else ηt ← log

[+ Λkm

2εtSt+

√[Λkm2εtSt

]2+ 1−εt

εt

]17 αt ← αt−1 + ηtek

18 St+1 ←∑m

i=1 Φ′(1− yi

∑Nj=1 αt,jhj(xi)

)19 for i← 1 to m do

20 Dt+1(i)←Φ′

(1−yi

PNj=1 αt,jhj(xi)

)St+1

21 f ←∑N

j=1 αT,jhj

22 return f

Figure 2. Pseudocode of the DeepBoost algorithm for both theexponential loss and the logistic loss. The expression of theweighted error εt,j is given in (6). In the generic case of a sur-rogate loss Φ different from the exponential or logistic losses, ηt

is found instead via a line search or other numerical methods fromηt = argmaxη F (αt−1 + ηek).

exponential loss (Φ(−u) = exp(−u)) or the logistic loss(Φ(−u) = log2(1 + exp(−u))), a closed-form expressionis given for the step size (lines 12-16), which is the same inboth cases (see Sections B.4 and B.5). In the generic case,the step size ηt can be found using a line search or othernumerical methods. Note that when the condition of line12 is satisfied, the step taken by the algorithm cancels outthe coordinate along the direction k, thereby leading to asparser result. This is consistent with the fact that the ob-jective function contains a second term based on (weighted)L1-norm, which is favoring sparsity.

Our algorithm is related to several other boosting-type al-gorithms devised in the past. For λ = 0 and β = 0 andusing the exponential surrogate loss, it coincides with Ada-Boost (Freund & Schapire, 1997) with precisely the same

direction and same step log[√

1−εt

εt

]using H =

⋃pk=1 Hk

as the hypothesis set for base learners. This corresponds to

Page 5: Deep Boosting - NYU Computer Sciencemohri/pub/srmboost.pdf · Deep Boosting H 1 H 2 H 3 H 4 H 5 H 1!H 2 ... ous boosting-style algorithms. ... hypothesis set, the analysis coincides

Deep Boosting

ignoring the complexity term of our bound as well as thecontrol of the sum of the mixture weights via β. For λ = 0and β = 0 and using the logistic surrogate loss, our algo-rithm also coincides with additive logistic loss (Friedmanet al., 1998).

In the special case where λ = 0 and β 6= 0 and forthe exponential surrogate loss, our algorithm matches theL1-norm regularized AdaBoost (e.g., see (Ratsch et al.,2001a)). For the same choice of the parameters and forthe logistic surrogate loss, our algorithm matches the L1-norm regularized additive Logistic Regression studied byDuchi & Singer (2009) using the base learner hypothesisset H =

⋃pk=1 Hk. H may in general be very rich. The key

foundation of our algorithm and analysis is instead to takeinto account the relative complexity of the sub-families Hk.Also, note that L1-norm regularized AdaBoost and Logis-tic Regression can be viewed as algorithms minimizing thelearning bound obtained via the standard Rademacher com-plexity analysis (Koltchinskii & Panchenko, 2002), usingthe exponential or logistic surrogate losses. Instead, theobjective function minimized by our algorithm is based onthe generalization bound of Theorem 1, which as discussedearlier is a finer bound (see (1)). For λ = 0 but β 6= 0, ouralgorithm is also close to the so-called unnormalized Arc-ing (Breiman, 1999) or AdaBoostρ (Ratsch & Warmuth,2002) using H as a hypothesis set. AdaBoostρ coincideswith AdaBoost modulo the step size, which is more con-servative than that of AdaBoost and depends on ρ. Ratsch& Warmuth (2005) give another variant of the algorithmthat does not require knowing the best ρ, see also the re-lated work of Kivinen & Warmuth (1999); Warmuth et al.(2006).

Our algorithm directly benefits from the learning guaran-tees given in Section 2 since it seeks to minimize the boundof Theorem 1. In the next section, we report the results ofour experiments with DeepBoost. Let us mention that wehave also designed an alternative deep boosting algorithmthat we briefly describe and discuss in Appendix C.

4. ExperimentsAn additional benefit of the learning bounds presented inSection 2 is that they are data-dependent. They are basedon the Rademacher complexity of the base hypothesis setsHk, which in some cases can be well estimated from thetraining sample. The algorithm DeepBoost directly inher-its this advantage. For example, if the hypothesis set Hk

is based on a positive definite kernel with sample matrixKk, it is known that its empirical Rademacher complexity

can be upper bounded by√

Tr[Kk]

m and lower bounded by1√2

√Tr[Kk]

m . In other cases, when Hk is a family of func-tions taking binary values, we can use an upper bound on

the Rademacher complexity in terms of the growth func-

tion of Hk, ΠHk(m): Rm(Hk) ≤

√2 log ΠHk

(m)

m . Thus,for the family Hstumps

1 of boosting stumps in dimension d,ΠHstumps

1(m) ≤ 2md, since there are 2m distinct threshold

functions for each dimension with m points. Thus, the fol-lowing inequality holds:

Rm(Hstumps1 ) ≤

√2 log(2md)

m. (7)

Similarly, we consider the family of decision trees Hstumps2

of depth 2 with the same question at the internal nodes ofdepth 1. We have ΠHstumps

2(m) ≤ (2m)2 d(d−1)

2 since thereare d(d − 1)/2 distinct trees of this type and since eachinduces at most (2m)2 labelings. Thus, we can write

Rm(Hstumps2 ) ≤

√2 log(2m2d(d− 1))

m. (8)

More generally, we also consider the family of all binarydecision trees H trees

k of depth k. For this family it is knownthat VC-dim(H trees

k ) ≤ (2k + 1) log2(d + 1) (Mansour,1997). More generally, the VC-dimension of Tn, the fam-ily of decision trees with n nodes in dimension d can bebounded by (2n + 1) log2(d + 2) (see for example (Mohri

et al., 2012)). Since Rm(H) ≤√

2 VC-dim(H) log(m+1)m ,

for any hypothesis class H we have

Rm(Tn) ≤√

(4n + 2) log2(d + 2) log(m + 1)m

. (9)

The experiments with DeepBoost described below use ei-ther Hstumps = Hstumps

1 ∪Hstumps2 or Htrees

K =⋃K

k=1 H treesk ,

for some K > 0, as the base hypothesis sets. For any hy-pothesis in these sets, DeepBoost will use the upper boundsgiven above as a proxy for the Rademacher complexityof the set to which it belongs. We leave it to the futureto experiment with finer data-dependent estimates or up-per bounds on the Rademacher complexity, which couldfurther improve the performance of our algorithm. Re-call that each iteration of DeepBoost searches for the basehypothesis that is optimal with respect to a certain crite-rion (see lines 5-10 of Figure 2). While an exhaustivesearch is feasible for Hstumps

1 , it would be far too expen-sive to visit all trees in Htrees

K when K is large. There-fore, when using Htrees

K and also Hstumps2 as the base hy-

potheses we use the following heuristic search procedurein each iteration t: First, the optimal tree h∗1 ∈ H trees

1 isfound via exhaustive search. Next, for all 1 < k ≤ K,a locally optimal tree h∗k ∈ H trees

k is found by consider-ing only trees that can be obtained by adding a single layerof leaves to h∗k−1. Finally, we select the best hypothesesin the set {h∗1, . . . , h∗K , h1, . . . , ht−1}, where h1, . . . , ht−1

are the hypotheses selected in previous iterations.

Page 6: Deep Boosting - NYU Computer Sciencemohri/pub/srmboost.pdf · Deep Boosting H 1 H 2 H 3 H 4 H 5 H 1!H 2 ... ous boosting-style algorithms. ... hypothesis set, the analysis coincides

Deep Boosting

Table 1. Results for boosted decision stumps and the exponential loss function.

AdaBoost AdaBoost AdaBoost AdaBoostbreastcancer Hstumps

1 Hstumps2 AdaBoost-L1 DeepBoost ocr17 Hstumps

1 Hstumps2 AdaBoost-L1 DeepBoost

Error 0.0429 0.0437 0.0408 0.0373 Error 0.0085 0.008 0.0075 0.0070(std dev) (0.0248) (0.0214) (0.0223) (0.0225) (std dev) 0.0072 0.0054 0.0068 (0.0048)

Avg tree size 1 2 1.436 1.215 Avg tree size 1 2 1.086 1.369Avg no. of trees 100 100 43.6 21.6 Avg no. of trees 100 100 37.8 36.9

AdaBoost AdaBoost AdaBoost AdaBoostionosphere Hstumps

1 Hstumps2 AdaBoost-L1 DeepBoost ocr49 Hstumps

1 Hstumps2 AdaBoost-L1 DeepBoost

Error 0.1014 0.075 0.0708 0.0638 Error 0.0555 0.032 0.03 0.0275(std dev) (0.0414) (0.0413) (0.0331) (0.0394) (std dev) 0.0167 0.0114 0.0122 (0.0095)

Avg tree size 1 2 1.392 1.168 Avg tree size 1 2 1.99 1.96Avg no. of trees 100 100 39.35 17.45 Avg no. of trees 100 100 99.3 96

AdaBoost AdaBoost AdaBoost AdaBoostgerman Hstumps

1 Hstumps2 AdaBoost-L1 DeepBoost ocr17-mnist Hstumps

1 Hstumps2 AdaBoost-L1 DeepBoost

Error 0.243 0.2505 0.2455 0.2395 Error 0.0056 0.0048 0.0046 0.0040(std dev) (0.0445) (0.0487) (0.0438) (0.0462) (std dev) 0.0017 0.0014 0.0013 (0.0014)

Avg tree size 1 2 1.54 1.76 Avg tree size 1 2 2 1.99Avg no. of trees 100 100 54.1 76.5 Avg no. of trees 100 100 100 100

AdaBoost AdaBoost AdaBoost AdaBoostdiabetes Hstumps

1 Hstumps2 AdaBoost-L1 DeepBoost ocr49-mnist Hstumps

1 Hstumps2 AdaBoost-L1 DeepBoost

Error 0.253 0.260 0.254 0.253 Error 0.0414 0.0209 0.0200 0.0177(std dev) (0.0330) (0.0518) (0.04868) (0.0510) (std dev) 0.00539 0.00521 0.00408 (0.00438)

Avg tree size 1 2 1.9975 1.9975 Avg tree size 1 2 1.9975 1.9975Avg no. of trees 100 100 100 100 Avg no. of trees 100 100 100 100

Breiman (1999) and Reyzin & Schapire (2006) extensivelyinvestigated the relationship between the complexity of de-cision trees in an ensemble learned by AdaBoost and thegeneralization error of the ensemble. We tested DeepBooston the same UCI datasets used by these authors, http://archive.ics.uci.edu/ml/datasets.html, specifi-cally breastcancer, ionosphere, german(numeric)and diabetes. We also experimented with two opticalcharacter recognition datasets used by Reyzin & Schapire(2006), ocr17 and ocr49, which contain the handwrittendigits 1 and 7, and 4 and 9 (respectively). Finally, becausethese OCR datasets are fairly small, we also constructedthe analogous datasets from all of MNIST, http://yann.lecun.com/exdb/mnist/, which we call ocr17-mnistand ocr49-mnist. More details on all the datasets aregiven in Table 4, Appendix D.1.

As we discussed in Section 3.2, by fixing the parameters βand λ to certain values, we recover some known algorithmsas special cases of DeepBoost. Our experiments comparedDeepBoost to AdaBoost (β = λ = 0 with exponentialloss), to Logistic Regression (β = λ = 0 with logisticloss), which we abbreviate as LogReg, to L1-norm regular-ized AdaBoost (e.g., see (Ratsch et al., 2001a)) abbreviatedas AdaBoost-L1, and also to the L1-norm regularized ad-ditive Logistic Regression algorithm studied by (Duchi &Singer, 2009) (β > 0, λ = 0) abbreviated as LogReg-L1.

In the first set of experiments reported in Table 1, we com-pared AdaBoost, AdaBoost-L1, and DeepBoost with theexponential loss (Φ(−u) = exp(−u)) and base hypothe-ses Hstumps. We tested standard AdaBoost with base hy-potheses Hstumps

1 and Hstumps2 . For AdaBoost-L1, we op-

timized over β ∈ {2−i : i = 6, . . . , 0} and for Deep-Boost, we optimized over β in the same range and λ ∈{0.0001, 0.005, 0.01, 0.05, 0.1, 0.5}. The exact parameteroptimization procedure is described below.

In the second set of experiments reported in Table 2, weused base hypotheses Htrees

K instead of Hstumps, where themaximum tree depth K was an additional parameter to beoptimized. Specifically, for AdaBoost we optimized overK ∈ {1, . . . , 6}, for AdaBoost-L1 we optimized over thosesame values for K and β ∈ {10−i : i = 3, . . . , 7}, and forDeepBoost we optimized over those same values for K, βand λ ∈ {10−i : i = 3, . . . , 7}.

The last set of experiments, reported in Table 3, are identi-cal to the experiments reported in Table 2, except we usedthe logistic loss Φ(−u) = log2(1 + exp(−u)).

We used the following parameter optimization procedurein all experiments: Each dataset was randomly partitionedinto 10 folds, and each algorithm was run 10 times, with adifferent assignment of folds to the training set, validationset and test set for each run. Specifically, for each run i ∈{0, . . . , 9}, fold i was used for testing, fold i + 1 (mod 10)was used for validation, and the remaining folds were usedfor training. For each run, we selected the parameters thathad the lowest error on the validation set and then measuredthe error of those parameters on the test set. The averageerror and the standard deviation of the error over all 10 runsis reported in Tables 1, 2 and 3, as is the average number oftrees and the average size of the trees in the ensembles.

In all of our experiments, the number of iterations was setto 100. We also experimented with running each algorithm

Page 7: Deep Boosting - NYU Computer Sciencemohri/pub/srmboost.pdf · Deep Boosting H 1 H 2 H 3 H 4 H 5 H 1!H 2 ... ous boosting-style algorithms. ... hypothesis set, the analysis coincides

Deep Boosting

Table 2. Results for boosted decision trees and the exponential loss function.

breastcancer AdaBoost AdaBoost-L1 DeepBoost ocr17 AdaBoost AdaBoost-L1 DeepBoostError 0.0267 0.0264 0.0243 Error 0.004 0.003 0.002

(std dev) (0.00841) (0.0098) (0.00797) (std dev) (0.00316) (0.00100) (0.00100)Avg tree size 29.1 28.9 20.9 Avg tree size 15.0 30.4 26.0

Avg no. of trees 67.1 51.7 55.9 Avg no. of trees 88.3 65.3 61.8

ionosphere AdaBoost AdaBoost-L1 DeepBoost ocr49 AdaBoost AdaBoost-L1 DeepBoostError 0.0661 0.0657 0.0501 Error 0.0180 0.0175 0.0175

(std dev) (0.0315) (0.0257) (0.0316) (std dev) (0.00555) (0.00357) (0.00510)Avg tree size 29.8 31.4 26.1 Avg tree size 30.9 62.1 30.2

Avg no. of trees 75.0 69.4 50.0 Avg no. of trees 92.4 89.0 83.0

german AdaBoost AdaBoost-L1 DeepBoost ocr17-mnist AdaBoost AdaBoost-L1 DeepBoostError 0.239 0.239 0.234 Error 0.00471 0.00471 0.00409

(std dev) (0.0165) (0.0201) (0.0148) (std dev) (0.0022) (0.0021) (0.0021)Avg tree size 3 7 16.0 Avg tree size 15 33.4 22.1

Avg no. of trees 91.3 87.5 14.1 Avg no. of trees 88.7 66.8 59.2

diabetes AdaBoost AdaBoost-L1 DeepBoost ocr49-mnist AdaBoost AdaBoost-L1 DeepBoostError 0.249 0.240 0.230 Error 0.0198 0.0197 0.0182

(std dev) (0.0272) (0.0313) (0.0399) (std dev) (0.00500) (0.00512) (0.00551)Avg tree size 3 3 5.37 Avg tree size 29.9 66.3 30.1

Avg no. of trees 45.2 28.0 19.0 Avg no. of trees 82.4 81.1 80.9

Table 3. Results for boosted decision trees and the logistic loss function.

breastcancer LogReg LogReg-L1 DeepBoost ocr17 LogReg LogReg-L1 DeepBoostError 0.0351 0.0264 0.0264 Error 0.00300 0.00400 0.00250

(std dev) (0.0101) (0.0120) (0.00876) (std dev) (0.00100) (0.00141) (0.000866)Avg tree size 15 59.9 14.0 Avg tree size 15.0 7 22.1

Avg no. of trees 65.3 16.0 23.8 Avg no. of trees 75.3 53.8 25.8

ionosphere LogReg LogReg-L1 DeepBoost ocr49 LogReg LogReg-L1 DeepBoostError 0.074 0.060 0.043 Error 0.0205 0.0200 0.0170

(std dev) (0.0236) (0.0219) (0.0188) (std dev) (0.00654) (0.00245) (0.00361)Avg tree size 7 30.0 18.4 Avg tree size 31.0 31.0 63.2

Avg no. of trees 44.7 25.3 29.5 Avg no. of trees 63.5 54.0 37.0

german LogReg LogReg-L1 DeepBoost ocr17-mnist LogReg LogReg-L1 DeepBoostError 0.233 0.232 0.225 Error 0.00422 0.00417 0.00399

(std dev) (0.0114) (0.0123) (0.0103) (std dev) (0.00191) (0.00188) (0.00211)Avg tree size 7 7 14.4 Avg tree size 15 15 25.9

Avg no. of trees 72.8 66.8 67.8 Avg no. of trees 71.4 55.6 27.6

diabetes LogReg LogReg-L1 DeepBoost ocr49-mnist LogReg LogReg-L1 DeepBoostError 0.250 0.246 0.246 Error 0.0211 0.0201 0.0201

(std dev) (0.0374) (0.0356) (0.0356) (std dev) (0.00412) (0.00433) (0.00411)Avg tree size 3 3 3 Avg tree size 28.7 33.5 72.8

Avg no. of trees 46.0 45.5 45.5 Avg no. of trees 79.3 61.7 41.9

for up to 1,000 iterations, but observed that the test errorsdid not change significantly, and more importantly the or-dering of the algorithms by their test errors was unchangedfrom 100 iterations to 1,000 iterations.

Observe that with the exponential loss, DeepBoost has asmaller test error than AdaBoost and AdaBoost-L1 on ev-ery dataset and for every set of base hypotheses, except forthe ocr49-mnist dataset with decision trees where its per-formance matches that of AdaBoost-L1. Similarly, with thelogistic loss, DeepBoost performs always at least as well asLogReg or LogReg-L1. For the small-sized UCI datasets itis difficult to obtain statistically significant results, but, forthe larger ocrXX-mnist datasets, our results with Deep-Boost are statistically significantly better at the 2% levelusing one-sided paired t-tests in all three sets of experi-ments (three tables), except for ocr49-mnist in Table 3,

where this holds only for the comparison with LogReg.

This across-the-board improvement is the result of Deep-Boost’s complexity-conscious ability to dynamically tunethe sizes of the decision trees selected in each boostinground, trading off between training error and hypothesisclass complexity. The selected tree sizes should depend onproperties of the training set, and this is borne out by ourexperiments: For some datasets, such as breastcancer,DeepBoost selects trees that are smaller on average thanthe trees selected by AdaBoost-L1 or LogReg-L1, while,for other datasets, such as german, the average tree sizeis larger. Note that AdaBoost and AdaBoost-L1 produceensembles of trees that have a constant depth since neitheralgorithm penalizes tree size except for imposing a maxi-mum tree depth K, while for DeepBoost the trees in oneensemble typically vary in size. Figure 3 plots the distri-

Page 8: Deep Boosting - NYU Computer Sciencemohri/pub/srmboost.pdf · Deep Boosting H 1 H 2 H 3 H 4 H 5 H 1!H 2 ... ous boosting-style algorithms. ... hypothesis set, the analysis coincides

Deep Boosting

Ion: Histogram of tree sizes

Tree sizes

Frequency

10 20 30 40

024681012

Figure 3. Distribution of tree sizes when DeepBoost is run on theionosphere dataset.

bution of tree sizes for one run of DeepBoost. It shouldbe noted that the columns for AdaBoost in Table 1 simplylist the number of stumps to be the same as the number ofboosting rounds; a careful examination of the ensemblesfor 100 rounds of boosting typically reveals a 5% duplica-tion of stumps in the ensembles.

Theorem 1 is a margin-based generalization guarantee, andis also the basis for the derivation of DeepBoost, so weshould expect DeepBoost to induce large margins on thetraining set. Figure 4 shows the margin distributions forAdaBoost, AdaBoost-L1 and DeepBoost on the same sub-set of the ionosphere dataset.

5. ConclusionWe presented a theoretical analysis of learning with abase hypothesis set composed of increasingly complex sub-families, including very deep or complex ones, and de-rived an algorithm, DeepBoost, which is precisely basedon those guarantees. We also reported the results of exper-iments with this algorithm and compared its performancewith that of AdaBoost and additive Logistic Regression,and their L1-norm regularized counterparts in several tasks.We have derived similar theoretical guarantees in the multi-class setting and used them to derive a family of new multi-class deep boosting algorithms that we will present and dis-cuss elsewhere. Our theoretical analysis and algorithmicdesign could also be extended to ranking and to a broadclass of loss functions. This should also lead to the gener-alization of several existing algorithms and their use with aricher hypothesis set structured as a union of families withdifferent Rademacher complexity. In particular, the broadfamily of maximum entropy models and conditional max-imum entropy models and their many variants, which in-cludes the already discussed logistic regression, could allbe extended in a similar way. The resulting DeepMaxentmodels (or their conditional versions) may admit an al-ternative theoretical justification that we will discuss else-where. Our algorithm can also be extended by consider-ing non-differentiable convex surrogate losses such as thehinge loss. When used with kernel base classifiers, thisleads to an algorithm we have named DeepSVM. The the-ory we developed could perhaps be further generalized to

Ion: AdaBoost−L1, fold = 6

Normalized Margin

Freq

uenc

y

0.1 0.3 0.5 0.7

01020304050

Ion: AdaBoost, fold = 6

Normalized Margin

Freq

uenc

y

0.1 0.3 0.5 0.7

01020304050

Ion: DeepBoost, fold = 6

Normalized Margin

Freq

uenc

y

0.1 0.3 0.5 0.7

01020304050

0.1 0.3 0.5 0.7

0.0

0.2

0.4

0.6

0.8

1.0

Normalized Margin

Cum

ulat

ive D

ist.

Cumulative Distribution of Margins

Figure 4. Distribution of normalized margins for AdaBoost (up-per right), AdaBoost-L1 (upper left) and DeepBoost (lower left)on the same subset of ionosphere. The cumulative margindistributions (lower right) illustrate that DeepBoost (red) induceslarger margins on the training set than either AdaBoost (black) orAdaBoost-L1 (blue).

encompass the analysis of other learning techniques suchas multi-layer neural networks.

Our analysis and algorithm also shed some new light onsome remaining questions left about the theory underly-ing AdaBoost. The primary theoretical justification forAdaBoost is a margin guarantee (Schapire et al., 1997;Koltchinskii & Panchenko, 2002). However, AdaBoostdoes not precisely maximize the minimum margin, whileother algorithms such as arc-gv (Breiman, 1996) that aredesigned to do so tend not to outperform AdaBoost (Reyzin& Schapire, 2006). Two main reasons are suspected for thisobservation: (1) in order to achieve a better margin, algo-rithms such as arc-gv may tend to select deeper decisiontrees or in general more complex hypotheses, which maythen affect their generalization; (2) while those algorithmsachieve a better margin, they do not achieve a better mar-gin distribution. Our theory may help better understand andevaluate the effect of factor (1) since our learning boundsexplicitly depend on the mixture weights and the contri-bution of each hypothesis set Hk to the definition of theensemble function. However, our guarantees also suggest abetter algorithm, DeepBoost.

AcknowledgmentsWe thank Vitaly Kuznetsov for his comments on an ear-lier draft of this paper. The work of M. Mohri was partlyfunded by the NSF award IIS-1117591.

Page 9: Deep Boosting - NYU Computer Sciencemohri/pub/srmboost.pdf · Deep Boosting H 1 H 2 H 3 H 4 H 5 H 1!H 2 ... ous boosting-style algorithms. ... hypothesis set, the analysis coincides

Deep Boosting

ReferencesBartlett, Peter L. and Mendelson, Shahar. Rademacher and

Gaussian complexities: Risk bounds and structural re-sults. JMLR, 3, 2002.

Bauer, Eric and Kohavi, Ron. An empirical comparison ofvoting classification algorithms: Bagging, boosting, andvariants. Machine Learning, 36(1-2):105–139, 1999.

Breiman, Leo. Bagging predictors. Machine Learning, 24(2):123–140, 1996.

Breiman, Leo. Prediction games and arcing algorithms.Neural Computation, 11(7):1493–1517, 1999.

Caruana, Rich, Niculescu-Mizil, Alexandru, Crew, Geoff,and Ksikes, Alex. Ensemble selection from libraries ofmodels. In ICML, 2004.

Dietterich, Thomas G. An experimental comparison ofthree methods for constructing ensembles of decisiontrees: Bagging, boosting, and randomization. MachineLearning, 40(2):139–157, 2000.

Duchi, John C. and Singer, Yoram. Boosting with structuralsparsity. In ICML, pp. 38, 2009.

Freund, Yoav and Schapire, Robert E. A decision-theoreticgeneralization of on-line learning and an application toboosting. Journal of Computer System Sciences, 55(1):119–139, 1997.

Freund, Yoav, Mansour, Yishay, and Schapire, Robert E.Generalization bounds for averaged classifiers. The An-nals of Statistics, 32:1698–1722, 2004.

Friedman, Jerome, Hastie, Trevor, and Tibshirani, Robert.Additive logistic regression: a statistical view of boost-ing. Annals of Statistics, 28:2000, 1998.

Grove, Adam J and Schuurmans, Dale. Boosting in thelimit: Maximizing the margin of learned ensembles. InAAAI/IAAI, pp. 692–699, 1998.

Kivinen, Jyrki and Warmuth, Manfred K. Boosting as en-tropy projection. In COLT, pp. 134–144, 1999.

Koltchinskii, Vladmir and Panchenko, Dmitry. Empiri-cal margin distributions and bounding the generalizationerror of combined classifiers. Annals of Statistics, 30,2002.

MacKay, David J. C. Bayesian methods for adaptive mod-els. PhD thesis, California Institute of Technology, 1991.

Mansour, Yishay. Pessimistic decision tree pruning basedon tree size. In Proceedings of ICML, pp. 195–201,1997.

Mohri, Mehryar, Rostamizadeh, Afshin, and Talwalkar,Ameet. Foundations of Machine Learning. The MITPress, 2012.

Quinlan, J. Ross. Bagging, boosting, and C4.5. InAAAI/IAAI, Vol. 1, pp. 725–730, 1996.

Ratsch, Gunnar and Warmuth, Manfred K. Maximizing themargin with boosting. In COLT, pp. 334–350, 2002.

Ratsch, Gunnar and Warmuth, Manfred K. Efficient marginmaximizing with boosting. Journal of Machine LearningResearch, 6:2131–2152, 2005.

Ratsch, Gunnar, Mika, Sebastian, and Warmuth, Man-fred K. On the convergence of leveraging. In NIPS, pp.487–494, 2001a.

Ratsch, Gunnar, Onoda, Takashi, and Muller, Klaus-Robert. Soft margins for AdaBoost. Machine Learning,42(3):287–320, 2001b.

Reyzin, Lev and Schapire, Robert E. How boosting themargin can also boost classifier complexity. In ICML,pp. 753–760, 2006.

Schapire, Robert E. Theoretical views of boosting and ap-plications. In Proceedings of ALT 1999, volume 1720 ofLecture Notes in Computer Science, pp. 13–25. Springer,1999.

Schapire, Robert E. The boosting approach to machinelearning: An overview. In Nonlinear Estimation andClassification, pp. 149–172. Springer, 2003.

Schapire, Robert E., Freund, Yoav, Bartlett, Peter, and Lee,Wee Sun. Boosting the margin: A new explanation forthe effectiveness of voting methods. In ICML, pp. 322–330, 1997.

Smyth, Padhraic and Wolpert, David. Linearly combiningdensity estimators via stacking. Machine Learning, 36:59–83, July 1999.

Vapnik, Vladimir N. Statistical Learning Theory. Wiley-Interscience, 1998.

Warmuth, Manfred K., Liao, Jun, and Ratsch, Gunnar. To-tally corrective boosting algorithms that maximize themargin. In ICML, pp. 1001–1008, 2006.

Page 10: Deep Boosting - NYU Computer Sciencemohri/pub/srmboost.pdf · Deep Boosting H 1 H 2 H 3 H 4 H 5 H 1!H 2 ... ous boosting-style algorithms. ... hypothesis set, the analysis coincides

Deep Boosting

A. Proof of Theorem 1Theorem 1. Assume p > 1. Fix ρ > 0. Then, for anyδ > 0, with probability at least 1 − δ over the choice of asample S of size m drawn i.i.d. according to D, the follow-ing inequality holds for all f =

∑Tt=1 αtht:

R(f) ≤ RS,ρ(f) +4ρ

T∑t=1

αtRm(Hkt)

+2ρ

√log p

m+

√⌈4ρ2

log[ ρ2m

log p

]⌉ log p

m+

log 2δ

2m.

Thus, R(f) ≤ RS,ρ(f) + 4ρ

∑Tt=1 αtRm(Hkt) + C(m, p)

with C(m, p) = O(√

log pρ2m log

[ρ2mlog p

]).

Proof. For a fixed h = (h1, . . . , hT ), any α ∈ ∆ de-fines a distribution over {h1, . . . , hT }. Sampling from{h1, . . . , hT } according to α and averaging leads to func-tions g of the form g = 1

n

∑Ti=1 ntht for some n =

(n1, . . . , nT ), with∑T

t=1 nt = n, and ht ∈ Hkt .

For any N = (N1, . . . , Np) with |N| = n, we consider thefamily of functions

GF,N ={

1n

p∑k=1

Nk∑j=1

hk,j |∀(k, j) ∈ [p]×[Nk], hk,j ∈Hk

},

and the union of all such families GF,n =⋃|N|=n GF,N.

Fix ρ > 0. For a fixed N, the Rademacher complex-ity of GF,N can be bounded as follows for any m ≥ 1:Rm(GF,N) ≤ 1

n

∑pk=1 Nk Rm(Hk). Thus, the follow-

ing standard margin-based Rademacher complexity boundholds (Koltchinskii & Panchenko, 2002). For any δ > 0,with probability at least 1− δ, for all g ∈ GF,N,

Rρ(g)− RS,ρ(g) ≤ 2ρ

1n

p∑k=1

Nk Rm(Hk) +

√log 1

δ

2m.

Since there are at most pn possible p-tuples N with |N| =n, by the union bound, for any δ > 0, with probability atleast 1− δ, for all g ∈ GF,n, we can write

Rρ(g)− RS,ρ(g) ≤ 2ρ

1n

p∑k=1

Nk Rm(Hk) +

√log pn

δ

2m.

Thus, with probability at least 1 − δ, for all functionsg = 1

n

∑Ti=1 ntht with ht ∈ Hkt , the following inequality

holds

Rρ(g)−RS,ρ(g) ≤ 2ρ

1n

p∑k=1

∑t:kt=k

nt Rm(Hkt)+

√log pn

δ

2m.

Taking the expectation with respect to α and usingEα[nt/n] = αt, we obtain that for any δ > 0, with proba-bility at least 1− δ, for all h, we can write

Eα[Rρ(g)− RS,ρ(g)] ≤ 2

ρ

T∑t=1

αtRm(Hkt) +

√log pn

δ

2m.

Fix n ≥ 1. Then, for any δn > 0, with probability at least1− δn,

Eα[Rρ/2(g)−RS,ρ/2(g)] ≤ 4

ρ

T∑t=1

αtRm(Hkt)+

√log pn

δn

2m.

Choose δn = δ2pn−1 for some δ > 0, then for p ≥ 2,∑

n≥1 δn = δ2(1−1/p) ≤ δ. Thus, for any δ > 0 and any

n ≥ 1, with probability at least 1 − δ, the following holdsfor all h:

Eα[Rρ/2(g)− RS,ρ/2(g)] ≤

T∑t=1

αtRm(Hkt) +

√log 2p2n−1

δ

2m. (10)

Now, for any f =∑T

t=1 αtht ∈ F and anyg = 1

n

∑Ti=1 ntht, we can upper bound R(f) =

Pr(x,y)∼D[yf(x) ≤ 0], the generalization error of f , asfollows:

R(f) = Pr(x,y)∼D

[yf(x)− yg(x) + yg(x) ≤ 0]

≤ Pr[yf(x)− yg(x) < −ρ/2] + Pr[yg(x) ≤ ρ/2]= Pr[yf(x)− yg(x) < −ρ/2] + Rρ/2(g).

We can also write

Rρ/2(g) = RS,ρ/2(g − f + f)

≤ Pr[yg(x)− yf(x) < −ρ/2] + RS,ρ(f).

Combining these inequalities yields

Pr(x,y)∼D

[yf(x) ≤ 0]− RS,ρ(f)

≤ Pr[yf(x)− yg(x) < −ρ/2]

+ Pr[yg(x)− yf(x) < −ρ/2] + Rρ/2(g)− RS,ρ/2(g).

Taking the expectation with respect to α yields

R(f)− RS,ρ(f) ≤ Ex∼D,α

[1yf(x)−yg(x)<−ρ/2]+

Ex∼D,α

[1yg(x)−yf(x)<−ρ/2] + Eα[Rρ/2(g)− RS,ρ/2(g)].

Since f = Eα[g], by Hoeffding’s inequality, for any x,

Eα[1yf(x)−yg(x)<−ρ/2]=Pr

α[yf(x)−yg(x)<−ρ/2] ≤ e−

nρ2

8

Eα[1yg(x)−yf(x)<−ρ/2]=Pr

α[yg(x)−yf(x)<−ρ/2] ≤ e−

nρ2

8 .

Page 11: Deep Boosting - NYU Computer Sciencemohri/pub/srmboost.pdf · Deep Boosting H 1 H 2 H 3 H 4 H 5 H 1!H 2 ... ous boosting-style algorithms. ... hypothesis set, the analysis coincides

Deep Boosting

Thus, for any fixed f ∈ F , we can write

R(f)− RS,ρ(f) ≤ 2e−nρ2/8 + Eα[Rρ/2(g)− RS,ρ/2(g)].

Thus, the following inequality holds:

supf∈F

R(f)− RS,ρ(f)

≤ 2e−nρ2/8 + suph

Eα[Rρ/2(g)− RS,ρ/2(g)].

Therefore, in view of (10), for any δ > 0 and any n ≥ 1,with probability at least 1 − δ, the following holds for allf ∈ F :

R(f)− RS,ρ(f)

≤ 4ρ

T∑t=1

αtRm(Hkt) + 2e−nρ2/8 +

√log 2p2n−1

δ

2m

=4ρ

T∑t=1

αtRm(Hkt)+2e−nρ2/8+

√(2n−1) log p+log 2

δ

2m.

To select n, we seek to minimize

f : n 7→ 2e−nρ2/8 +

√n log p

m= 2e−nu +

√nv,

with u = ρ2/8 and v = (log p)/m. f is differentiable andfor all n, f ′(n) = −2ue−nu +

√v

2√

n. The minimum of f is

thus for n such that

f ′(n) = 0⇔ 2ue−nu =√

v

2√

n⇔ −2une−2un = − v

8u

⇔ n =−12u

W−1

(−v

8u

),

where W−1 is the second branch of the Lambert function(inverse of x 7→ xex. It is not hard to verify that the fol-lowing inequalities hold for all x ∈ (0, 1/e]:

− log(x) ≤ −W−1(−x) ≤ 2 log(x).

Bounding −W−1 using the lower bound leads to the fol-lowing choice for n:

n =⌈−1

2ulog

( v

8u

)⌉=

⌈ 4ρ2

log( ρ2m

log p

)⌉.

Plugging in this value of n yields the following bound:

R(f)− RS,ρ(f) ≤ 4ρ

T∑t=1

αtRm(Hkt) +

√log p

m

+

√⌈4ρ2

log[ ρ2m

log p

]⌉ log p

m+

log 2δ

2m,

which concludes the proof.

(a) (b) (c)

Figure 5. Illustration of the directional derivatives in the threecases of definition (11).

B. Coordinate descentB.1. Maximum descent coordinate

For a differentiable convex function, the definition of co-ordinate descent along the direction with maximal descentis standard: the direction selected is the one maximizingthe absolute value of the directional derivative. Here, weclarify the definition of the maximal descent strategy for anon-differentiable convex function.

For any function Q : RN → R, we denote by Q′+(α, e)

the right directional derivative of Q at α ∈ RN and byQ′−(α, e) its left directional derivative at α ∈ RN along

the direction e ∈ RN , ‖e‖ = 1, when they exist:

Q′+(α, e) = lim

η→0+

Q(α + ηe)−Q(α)η

Q′−(α, e) = lim

η→0−

Q(α + ηe)−Q(α)η

.

For the remaining of this section, we will assume that Q isa convex function. It is known that in that case these quan-tities always exist and that Q′

−(α, e) ≤ Q′+(α, e) for all

α and e. The left and right directional derivatives coincidewith the directional derivative Q′(α, e) of Q along the di-rection e when Q is differentiable at α along the directione: Q′(α, e) = Q′

+(α, e) = Q′−(α, e).

For any j ∈ [1, N ], let ej denote the jth unit vector in RN .For any α ∈ RN and j ∈ [1, N ], we define the descentgradient δQ(α, ej) of Q along the direction ej as follows:

δQ(α, ej) = (11)0 if Q′

−(α, ej) ≤ 0 ≤ Q′+(α, ej)

Q′+(α, ej) if Q′

−(α, ej) ≤ Q′+(α, ej) ≤ 0

Q′−(α, ej) if 0 ≤ Q′

−(α, ej) ≤ Q′+(α, ej).

δQ(α, ej) is the element of the subgradient along ej thatis the closest to 0. Figure 5 illustrates the three cases in thatdefinition. Note that when Q is differentiable along ej , thenQ′

+(α, ej) = Q′−(α, ej) and δQ(α, ej) = Q′(α, ej).

The maximum descent coordinate can then be defined by

k = argmaxj∈[1,N ]

|δQ(α, ej)| (12)

This coincides with the standard definition when Q is con-vex and differentiable.

Page 12: Deep Boosting - NYU Computer Sciencemohri/pub/srmboost.pdf · Deep Boosting H 1 H 2 H 3 H 4 H 5 H 1!H 2 ... ous boosting-style algorithms. ... hypothesis set, the analysis coincides

Deep Boosting

B.2. Direction

In view of (12), at each iteration t ≥ 1, the direction ek

selected by coordinate descent with maximum descent isk = argmaxj∈[1,N ] |δQ(αt−1, ej)|. To determine k, wecompute δQ(αt−1, ej) for all j ∈ [1, N ] by distinguishingtwo cases: αt−1,j 6= 0 and αt−1,j = 0.

Assume first that αt−1,j 6= 0 and let s denote the sign ofαt−1,j . For η sufficiently small, αt−1,j + η has the sign ofαt−1,j , that is s and

F (αt−1 + ηej) =1m

m∑i=1

Φ(1− yift−1(xi)− ηyihj(xi)

)+

∑p6=j

Λj |αt−1,p|+ sΛj(αt−1,j + η).

Thus, when αt−1,j 6= 0, F admits a directional derivativealong ej given by

F ′(αt−1, ej)=− 1m

m∑i=1

yihj(xi)Φ′(1−yift−1(xi)

)+ sΛj

=− 1m

m∑i=1

yihj(xi)Dt(i)St + sΛj

=(2εt,j − 1)St

m+ sΛj ,

and δF (αt−1, ej) = (2εt,j−1)St

m +sgn(αt−1,j)Λj . Whenαt−1,j = 0, we find similarly that

F ′+(αt−1, ej) = (2εt,j − 1)

St

m+ Λj

F ′−(αt−1, ej) = (2εt,j − 1)

St

m− Λj .

The condition (F ′−(α, ej) ≤ 0 ≤ F ′

+(α, ej)) is equivalentto(− Λj ≤ (2εt,j − 1)

St

m≤ Λj

)⇔

∣∣∣∣εt,j −12

∣∣∣∣ ≤ Λjm

2St.

Thus, in summary, we can write, for all j ∈ [1, N ],

δF (αt−1, ej) =(2εt,j−1)St

m +sgn(αt−1,j)Λj if (αt−1,j 6= 0)0 else if

∣∣εt,j− 12

∣∣ ≤ Λjm2St

(2εt,j−1)St

m +Λj else if εt,j− 12 ≤ −

Λjm2St

(2εt,j−1)St

m−Λj otherwise.

This can be simplified by unifying the last two cases andobserving that the sign of (εt,j − 1

2 ) suffices to distinguishbetween the last two cases:

δF (αt−1, ej) =(2εt,j − 1)St

m +sgn(αt−1,j)Λj if (αt−1,j 6= 0)0 else if

∣∣εt,j− 12

∣∣ ≤ Λjm2St

(2εt,j − 1)St

m−sgn(εt,j− 12 )Λj otherwise.

B.3. Step

Given the direction ek, the optimal step value η is givenby argminη F (αt−1 + η ek). In the most general case, ηcan be found via a line search or other numerical meth-ods. In some special cases, we can derive a closed-formsolution for the step by minimizing an upper bound onF (αt−1 + η ek). For convenience, in what follows, weuse the shorthand εt for εt,k.

Since yihk(xi) = 1+yihk(xi)2 · (1) + 1−yihk(xi)

2 · (−1), bythe convexity of u 7→ Φ(1 − ηu), the following holds forall η ∈ R:

Φ(1− yift−1(xi)− ηyihk(xi)

)(13)

≤ 1 + yihk(xi)2

Φ(1− yift−1(xi))− η

)+

1− yihk(xi)2

Φ(1− yift−1(xi)) + η

).

Thus, we can write

F (αt−1 + ηek)−∑j 6=k

Λj |αt−1,j |

≤ 1m

m∑i=1

1 + yihk(xi)2

Φ(1− yift−1(xi))− η

)+

1m

m∑i=1

1− yihk(xi)2

Φ(1− yift−1(xi)) + η

)+ Λk|αt−1,k + η|.

Let J(η) denote that upper bound. We can select η to min-imize J(η). J is convex and admits a subdifferential at allpoints. Thus, η∗ is a minimizer of J(η) iff 0 ∈ ∂J(η∗),where ∂J(η∗) denotes the subdifferential of J at η∗.

B.4. Exponential loss

In the case Φ = exp, J(η) can be expressed as follows

J(η) =1m

m∑i=1

1 + yihk(xi)2

e1−yift−1(xi)e−η

+1m

m∑i=1

1− yihk(xi)2

e1−yift−1(xi)eη

+ Λk|αt−1,k + η|,

and e1−yift−1(xi) = Φ′(1− yift−1(xi)) = StDt(i). Thus,J can be rewritten as follows:2

J(η) = (1− εt)St

me−η + εt

St

meη + Λk|αt−1,k + η|,

2Note that when the functions in H take values in {−1, +1},(13) is in fact an equality and J(η) coincides with F (αt−1 +ηet)−

Pj 6=k Λj |αt−1,j |.

Page 13: Deep Boosting - NYU Computer Sciencemohri/pub/srmboost.pdf · Deep Boosting H 1 H 2 H 3 H 4 H 5 H 1!H 2 ... ous boosting-style algorithms. ... hypothesis set, the analysis coincides

Deep Boosting

e��

P

X

e��t�1,k

P (e��t�1,k)

Figure 6. Plot of the polynomial function P .

where we used the shorthand εt = εt,k where k is the in-dex of the direction ek selected. If αt−1,k + η∗ = 0,then the subdifferential of |αt−1,k + η| at η∗ is the set{ν : ν ∈ [−1,+1]}. Thus, ∂J(η∗) contains 0 iff there ex-ists ν ∈ [−1,+1] such that

− (1− εt)St

me−η∗ + εt

St

meη∗ + Λkν = 0

⇔− (1− εt)eαt−1,k + εte−αt−1,k +

Λkm

Stν = 0.

This is equivalent to the condition∣∣(1− εt)eαt−1,k − εte−αt−1,k

∣∣ ≤ Λkm

St. (14)

If αt−1,k + η∗ > 0, then the subdifferential of |αt−1,k + η|at η∗ is reduced to {1} and ∂J(η∗) contains 0 iff

− (1− εt)e−η∗ + εteη∗ +

Λkm

St= 0

⇔ εte2η∗ +

Λkm

Steη∗ − (1− εt) = 0. (15)

Solving the resulting second-degree equation in eη∗ gives

eη∗ = −Λkm

2εtSt+

√(Λkm

2εtSt

)2

+1− εt

εt,

that is

η∗ = log

−Λkm

2εtSt+

√(Λkm

2εtSt

)2

+1− εt

εt

.

Let P be the second-degree polynomial of (15) whose so-lution is eη∗ . P is convex, has one negative solution, onepositive solution, and the positive solution is eη∗ . Sincee−αt−1,k is positive, the condition αt−1,k + η∗ > 0 or−αt−1,k < η∗ is then equivalent to P (e−αt−1,k) < 0 (seeFigure 6), that is

εte−2αt−1,k +

Λkm

Ste−αt−1,k − (1− εt) < 0

⇔(1− εt)eαt−1,k − εte−αt−1,k >

Λkm

St. (16)

Note that η∗ ≤ η0, where η0 = log[√

1−εt

εt

]is the step

size used is AdaBoost.

The case αt−1,k + η∗ < 0 can be treated similarly. It isequivalent to the condition

(1− εt)eαt−1,k − εte−αt−1,k < −Λkm

St, (17)

and leads to the step size

η∗ = log

Λkm

2εtSt+

√(Λkm

2εtSt

)2

+1− εt

εt

.

B.5. Logistic loss

In the case of logistic loss, for any u ∈ R, Φ(−u) =log2(1 + e−u) and Φ′(−u) = 1

log 21

(1+eu) . To determinethe step size, we use the following general upper bound:

Φ(−u− v)− Φ(−u) = log2

[1 + e−u−v

1 + e−u

]= log2

[1 + e−u + e−u−v − e−u

1 + e−u

]= log2

[1 +

e−v − 11 + eu

]≤ e−v − 1

(log 2)(1 + eu)= Φ′(−u)(e−v − 1).

Thus, we can write

F (αt−1 + ηet)− F (αt−1)

≤ 1m

m∑i=1

Φ′(1− yift−1(xi))(e−ηyihk(xi) − 1)

+ Λk(|αt−1,k + η| − |αt−1,k|)

=1m

m∑i=1

Dt(i)St(e−ηyihk(xi) − 1)

+ Λk(|αt−1,k + η| − |αt−1,k|).

To determine η, we can minimize this upper bound, orequivalently the following

1m

m∑i=1

Dt(i)St e−ηyihk(xi) + Λk|αt−1,k + η|.

This expression is syntactically the same as the one con-sidered in the case of the exponential loss with only thedistribution weights Dt(i) and St being different. Indeed,

Page 14: Deep Boosting - NYU Computer Sciencemohri/pub/srmboost.pdf · Deep Boosting H 1 H 2 H 3 H 4 H 5 H 1!H 2 ... ous boosting-style algorithms. ... hypothesis set, the analysis coincides

Deep Boosting

in the case of the exponential loss (Φ = exp), we can write

F (αt−1 + ηek)−∑j 6=k

Λj |αt−1,j |

=1m

m∑i=1

Φ(1− yift−1(xi)−ηyihk(xi))+Λk|αt−1,k+η|,

=1m

m∑i=1

Φ(1− yift−1(xi)) e−ηyihk(xi)+Λk|αt−1,k+η|,

=1m

m∑i=1

Φ′(1− yift−1(xi)) e−ηyihk(xi)+Λk|αt−1,k+η|,

=1m

m∑i=1

Dt(i)St e−ηyihk(xi)+Λk|αt−1,k+η|.

Thus, we obtain immediately the same expressions for thestep size in the case of the logistic loss with the same threecases, but with St =

∑mi=1

1

1+e−1+yift−1(xi)and Dt(i) =

1St

1

1+e−1+yift−1(xi).

C. Alternative DeepBoostγ algorithmWe also devised and implemented an alternative algorithm,DeepBoostγ , which is inspired by the learning bound ofTheorem 1 but does not seek to minimize it. The algo-rithm admits a parameter γ > 0 representing the edge valuedemanded at each boosting round. This is the amount bywhich we require the error εt of the base hypothesis ht se-lected at round t to be better than 1

2 : εt − 12 > γ. We

assume given p distinct hypothesis sets with increasing de-grees of complexity H1, . . . ,Hp. DeepBoostγ proceeds asif we were running AdaBoost using only as base hypothe-sis set H1. But, at each round, if the edge achieved by thebest hypothesis found in H1 is not sufficient, that is if it isnot larger than the demanded edge γ, then it selects insteadthe hypothesis in H2 with the smallest error on the sampleweighted by Dt. If the edge of that hypothesis is also notsufficient, it proceeds with the next hypothesis set and soforth. If the edge is insufficient even with the best hypoth-esis in Hp, then it just uses the best hypothesis found inH =

⋃pk=1 Hk. The edge parameter γ is determined via

cross-validation.

DeepBoostγ is inspired by the bound of Theorem 1 sinceit seeks to use as much as possible hypotheses from H1 orlower complexity families and only when necessary func-tions from more complex families. Since it tends to chooserarely hypotheses from more complex Hks, the complexityterm of the bound of Theorem 1 remains close to the oneusing only H1. On the other hand, DeepBoostγ can achievea smaller empirical margin loss (first term of the bound)by selecting, when needed, more powerful hypotheses thanthose accessible using H1 alone.

We carried out some early experiments on several datasets

Table 4. Dataset statistics. german refers more specifically to thegerman (numeric) dataset.

breastcancer ionosphere germanExamples 699 351 1000Attributes 9 34 24

diabetes ocr17 ocr49Examples 768 2000 2000Attributes 8 196 196

ocr17-mnist ocr49-mnistExamples 15170 13782Attributes 400 400

with DeepBoostγ using boosting stumps, in which the per-formance of the algorithm was found to be superior to thatof AdaBoost. A more extensive study of the theoretical andempirical properties of this algorithm are left to the future.

D. Additional empirical informationD.1. Dataset sizes and attributes

The size and the number of attributes for the datasets usedin our experiments are indicated in Table 4.