Learning Resource-Aware Classifiers for Mobile Devices: from Regularization to Energy Efficiency

Local Rademacher Complexity: Sharper Risk Bounds

With and Without Unlabeled Samples

Luca Oneto, Alessandro Ghio, Sandro Ridella

DITEN - University of GenovaVia Opera Pia 11A, I-16145 Genova, Italy

{Luca.Oneto, Alessandro.Ghio, Sandro.Ridella}@unige.it

Davide Anguita

DIBRIS - University of GenovaVia Opera Pia 13, I-16145 Genova, Italy

[email protected]

Abstract

We derive in this paper a new Local Rademacher Complexity risk bound

on the generalization ability of a model, which is able to take advantage of

the availability of unlabeled samples. Moreover, this new bound improves

state–of–the–art results even when no unlabeled samples are available.

Keywords: Statistical Learning Theory, Performance Estimation, Local

Rademacher Complexity, Unlabeled Samples

1. Introduction

A learning process can be described as the selection of an hypothesis

in a fixed set, based on empirical observations (Vapnik, 1998). Its asymp-

totic analysis, through a bound on the generalization error, has been thor-

oughly investigated in the past (Vapnik, 1998; Talagrand, 1987). However,

as the number of samples is limited in practice, finite sample analysis with

Preprint submitted to Neural Networks January 11, 2015

global measures of the complexity of the hypothesis set was proposed, and

represented a fundamental advance in the field (Vapnik, 1998; Bartlett &

Mendelson, 2003; Koltchinskii, 2006; Bousquet & Elisseeff, 2002; Valiant,

2013; McAllester & Akinbiyi, 2013). A further refinement has consisted in

exploiting local measures of complexity, which take in account only those

models that well approximate the available data (Bartlett et al., 2002b, 2005;

Koltchinskii, 2006; Blanchard & Massart, 2006; Lever et al., 2013). Recently,

some attempts to further improve these results have been made (Audibert

& Tsybakov, 2007; Srebro et al., 2010; Steinwart & Scovel, 2007): unfor-

tunately, these approaches require additional assumptions that, in general,

are not satisfied or cannot be justified by inferring them from the data. Al-

ternative paths have been explored like, for example, exploiting additional

a–priori information (Parrado-Hernandez et al., 2012). Recently, the use of

unlabeled samples has been proposed for improving the tightness of Global

Rademacher Complexity based bounds (Anguita et al., 2011). Such results

are appealing since unlabeled samples are commonly available in many real

world applications, as also confirmed by the success of learning procedures

able to exploit them (Chapelle et al., 2006).

In this paper, we extend the recent results on the use of unlabeled samples

in global complexity measures to the case of local ones and derive sharper

Local Rademacher Complexity risk bounds on the generalization ability of

a model. For this purpose, two steps are completed. First, we propose a

proof for the Local Rademacher Complexity bound, simplified with respect

to the milestone result of (Bartlett et al., 2005) through the exploitation

of the well–known bounded difference inequality (McDiarmid, 1989). Such

2

simplification enables us to apply results on concentration inequalities of self–

bounding functions (Boucheron et al., 2013), and to obtain a sharper Local

Rademacher Complexity risk bound. The latter improves the state-of-the-

art results both when unlabeled samples are used and the dataset is entirely

composed of labeled samples.

2. The learning framework

We consider the conventional learning problem (Vapnik, 1998): based on

a random observation of X ∈ X , one has to estimate Y ∈ Y by choosing a

suitable hypothesis h : X → Y , where h ∈ H. A learning algorithm selects h

by exploiting a set of labeled samples Dnl:{(X l

1, Yl1

), · · · ,

(X lnl, Y l

nl

)}and,

eventually, a set of unlabeled ones Dnu :{

(Xu1 ) , · · · ,

(Xunu

)}. Dnl

and Dnu

consist of a sequence of independent samples distributed according to µ over

X × Y . The generalization error

L(h) = Eµ`(h(X), Y ), (1)

associated to a hypothesis H, is defined through a loss function `(h(X), Y ) :

Y × Y → [0, 1]. As µ is unknown, L(h) cannot be explicitly computed, thus

we have to resort to its empirical estimator, namely the empirical error

Lnl(h) =

1

nl

nl∑i=1

`(h(X li

), Y l

i

). (2)

Note that Lnl(h) is a biased estimator, since the data used for selecting the

model and for computing the empirical error coincide. We estimate this

bias by studying the discrepancy between the generalization error and the

empirical error. For this purpose we exploit powerful statistical tools like

concentration inequalities and the Local Rademacher Complexity.

3

2.1. Definitions

In the seminal work of Bartlett et al. (2005), a bound, defined over the

space of functions, is provided. In this work, we generalize this result to a

more general supervised learning framework. For this purpose, we switch

from the space of functions H to the space of loss functions.

Definition 2.1. Given a space of functions H with its associated loss func-

tion `(h(X), Y ), the space of loss functions L is defined as:

L ={`(h(X), Y )

h ∈ H} . (3)

Let us also consider the corresponding star–shaped space of function.

Definition 2.2. Given the space of loss functions L, its star–shaped version

is:

Ls ={α` α ∈ [0, 1], ` ∈ L

}. (4)

Then, the generalization error and the empirical error can be rewritten

in terms of the space of loss functions:

L(h) ≡ L(`) = Eµ`(h(X), Y ), (5)

Lnl(h) ≡ Lnl

(`) =1

nl

nl∑i=1

`(h(X li

), Y l

i

). (6)

Moreover we can define, respectively, the expected square error and the em-

pirical square error:

L(`2) = Eµ [`(h(X), Y )]2 , (7)

Lnl(`2) =

1

nl

nl∑i=1

[`(h(X li

), Y l

i

)]2. (8)

4

Consequently, the variance of ` ∈ L can be defined as:

V 2(`) = Eµ [`(h(X), Y )− L(`)]2 = L(`2)− [L(`)]2. (9)

Note that the following relations hold:

V 2(`) ≤ L(`2) ≤ L(`), L[(α`)2] = α2L(`2). (10)

Since we do not know in advance which h ∈ H will be chosen during the

learning phase, in order to estimate L(`) we have to study the behavior of

the difference between the generalization error and the empirical error.

Definition 2.3. Given L, the Uniform Deviation of the loss Unl(L) and

square loss U2nl

(L) are:

Unl(L) = sup

`∈L

[L(`)− Lnl

(`)], U2

nl(L) = sup

`∈L

[Lnl

(`2)− L(`2)], (11)

while their deterministic counterparts are:

Unl(L) = EµUnl

(L), U2nl

(L) = EµU2nl

(L). (12)

The Uniform Deviation is not computable, but we can upper bound its

value through some computable quantity. One possibility is to use the Ra-

demacher Complexity.

Definition 2.4. The Rademacher Complexity of the loss and of the square

loss are:

Rnl(L) = Eσ sup

`∈L

2

nl

nl∑i=1

σi`(h(X li

), Y l

i

), (13)

R2nl

(L) = Eσ sup`∈L

2

nl

nl∑i=1

σi[`(h(X li

), Y l

i

)]2, (14)

5

where σ1, . . . , σnlare nl {±1}–valued independent Rademacher random vari-

ables for which P(σi = +1) = P(σi = −1) = 1/2. Their deterministic coun-

terparts are:

Rnl(L) = EµRnl

(L), R2nl

(L) = EµR2nl

(L). (15)

In Appendix A, some propaedeutic properties of the Uniform Deviation

and Rademacher Complexity are recalled, which will be useful for deriving

the main results of this work.

Finally, we will also make use of the the notion of sub–root function

(Bartlett et al., 2005).

Definition 2.5. A function is a sub–root function if and only if:

(I) ψ(r) is positive,

(II) ψ(r) is non–decreasing,

(III) ψ(r)/√r is non–increasing,

with r > 0.

Its properties are reported in Appendix B.

3. Local Rademacher Complexity error bound

In this section, we propose a proof of the Local Rademacher Complexity

bound on the generalization error of a model (Bartlett et al., 2005; Koltchin-

skii, 2006), which is simplified with respect to the original proof in literature

and allows us also to obtain optimal constants.

In order to improve the readability of the paper, an outline of the main

steps of the proof is presented. As a first step, Theorem 3.1 shows that it is

6

possible to bound the generalization error of a function chosen in H, through

an assumption over the Expected Uniform Deviation of a normalized and

slightly enlarged version (see Lemma 3.2) of H. As a second step, Theorem

3.3 shows how to relate the Expected Uniform Deviation and the Expected

Rademacher Complexity through the use of a sub–root function. The fixed

point of this sub–root function is used to bound the generalization error of

a function chosen in H. As a third step, Lemma 3.4 shows that, instead of

using any sub–root function, we can directly use the Expected Rademacher

Complexity of a local space of functions, where functions therein are the ones

with low expected square error. As a fourth step, Lemma 3.5 and Lemma

3.6 allow to substitute the non–computable expected quantities mentioned

above with their empirical counterpart, which can be computed from the

data. Then, we finally derive the main result of this section in Theorem

3.7, which is a fully empirical Local Rademacher Complexity bound on the

generalization error of a function chosen in the original hypothesis space H.

The following theorem is needed for normalizing the original hypothesis

space: this allows to bound the generalization error of a function chosen in

H.

Theorem 3.1. Let us consider the normalized loss space Lr:

Lr =

{r

L(`2) ∨ r` ` ∈ L

}, (16)

and let us suppose that, ∀K > 1:

Unl(Lr) ≤

r

K. (17)

Then, ∀h ∈ H, the following inequality holds:

L(h)≤max

{(K

K − 1Lnl

(h)

),(Lnl

(h) +r

K

)}≤ K

K − 1Lnl

(h) +r

K. (18)

7

Proof. We note that, ∀`r ∈ Lr:

L(`r) ≤ Lnl(`r) + Un (Lr) ≤ Lnl

(`r) +r

K. (19)

Let us consider the two cases:

1. L(`2) ≤ r

2. L(`2) > r

In the first case `r = `, so the following inequality holds:

L(`) = L(`r) ≤ Lnl(`r) +

r

K= Lnl

(`) +r

K. (20)

In the second case, `r = rL(`2)

` and rL(`2)

∈ [0, 1]. Then:

L(`)− Lnl(`) ≤ Un (L) ≤ L(`2)

rUn (Lr) ≤

L(`)

r

r

K=L(`)

K. (21)

By solving the last inequality with respect to L(`), we obtain the bound over

L(`):

L(`) ≤ K

K − 1Lnl

(`), L(`2) > r, ` ∈ L. (22)

Note that L(`) ≡ L(h) and Lnl(`) ≡ Lnl

(h), with h ∈ H, if ` ∈ L. By

combining the results of Eqns. (20) and (22), the theorem is proved.

The next step shows that the normalized hypothesis space defined in

Theorem 3.1 is a subset of a new star–shaped space.

Lemma 3.2. Let

Lsr ={α` α ∈ [0, 1], ` ∈ L, L[(α`)2] ≤ r

}, (23)

then

Lr ⊆ Lsr. (24)

8

Proof. Let us consider Lr in the two cases introduced above:

1. L(`2) ≤ r

2. L(`2) > r

In the first case, `r = ` and then:

L(`2) = L(`2r) ≤ r. (25)

In the second case, L(`2) > r, so we have that

`r =

[r

L(`2)

]`,

r

L(`2)≤ 1, (26)

and the following bound holds:

L(`2r) ≤[

r

L(`2)

]L(`2) ≤ r. (27)

Thus, the property of Eq. (24) is proved.

If we consider a sub–root function that upper-bounds the Expected Ra-

demacher Complexity of the hypothesis space defined in Lemma 3.2, we can

exploit Theorem 3.1 for bounding the generalization error of a function cho-

sen in the original hypothesis space H.

Theorem 3.3. Let us consider a sub–root function ψnl(r), with fixed point

r∗nl, and suppose that, ∀r > r∗nl

Rnl(Lsr) ≤ ψnl

(r). (28)

Then, ∀h ∈ H and ∀K > 1 we have that, with probability (1− e−x):

L(h) ≤max

{(K

K − 1Lnl

(h)

),

(Lnl

(h) +Kr∗nl+ 2

√x

2n

)}(29)

9

Proof. By exploiting the properties of the Uniform Deviation and the Rade-

macher Complexity (see Lemma Appendix A.1 and Lemma Appendix A.2)

and the properties of the sub–root functions, listed in Appendix B, we have

that, with probability (1− e−x):

Unl(Lr) ≤ Unl

(Lr) +

√x

2n≤ Rnl

(Lr) +

√x

2n≤ Rnl

(Lsr) +

√x

2n

≤ ψnl(r) +

√x

2n≤√rr∗nl

+

√x

2n. (30)

The last step of the proof consists in showing that r can be chosen, such that

Unl(Lr) ≤ r

Kwith K > 1 and r ≥ r∗nl

, so that we can exploit Theorem 3.1

and conclude the proof. For this purpose, we set A =√r∗nl

and C =√

x2n

.

Thus, we have to find the solution of:

A√r + C =

r

K, (31)

which is

r =

[(2CK

+ A2)

+√(

2CK

+ A2)2 − 4C2

K2

]2K2

. (32)

It is straightforward proving that:

r ≥ A2K2 ≥ A2 = r∗nl, (33)

r ≤ A2K2 + 2CK. (34)

By substituting Eq. (34) into Theorem 3.1, the proof is over.

The previous theorem holds for any sub–root function which satisfies Eq.

(28). The next lemma shows that the Rademacher Complexity, defined in the

last theorem, is indeed a sub–root function, which means that the inequality

of Eq. (28) is indeed an equality.

10

Lemma 3.4. Let us consider Rnl(Lsr), namely the Expected Rademacher

Complexity computed on Lsr. Then:

ψnl(r) = Rnl

(Lsr) (35)

is a sub–root function.

Proof. In order to prove the lemma, the following properties must apply (see

Definition 2.5):

1. ψnl(r) is positive

2. ψnl(r) is non–decreasing

3. ψnl (r)/√r is non–increasing

Concerning the first property, we can note that:

ψnl(r) = Eσ sup

`∈Lsr

2

nl

nl∑i=1

σi`(h(X li

), Y l

i

)≥ sup

`∈LsrEσ

2

nl

nl∑i=1

σi`(h(X li

), Y l

i

)≥ 0. (36)

Concerning the second property, we have that, for 0 ≤ r1 ≤ r2:

Lsr1 ⊆ Lsr2, (37)

therefore

ψnl(r1) = Eσ sup

`∈Lsr1

2

nl

nl∑i=1

σi`(h(X li

), Y l

i

)≤ Eσ sup

`∈Lsr2

2

nl

nl∑i=1

σi`(h(X li

), Y l

i

)= ψnl

(r2). (38)

Finally, concerning the third property, for 0 ≤ r1 ≤ r2, we define the following

quantity:

`σr2 = arg sup`∈Lsr2

2

nl

nl∑i=1

σi`(h(X li

), Y l

i

), L

[(`σr2)2] ≤ r2. (39)

11

Note that, since r1r2≤ 1, we have that

√r1r2`σr2 ∈ L

sr2

. Consequently:

L

[(√r1r2`σr2

)2]

=r1r2L[(`σr2)

2] ≤ r1. (40)

Thus, we have that:

ψnl(r1) = Eσ sup

`∈Lsr1

2

nl

nl∑i=1

σi`(h(X li

), Y l

i

)≥ Eσ

2

nl

nl∑i=1

σi

√r1r2`σr2(h(X li

), Y l

i

)=

√r1r2Eσ sup

`∈Lsr2

2

nl

nl∑i=1

σi`(h(X li

), Y l

i

)=

√r1r2ψnl

(r2), (41)

which allows proving the claim since

ψnl(r2)√r2≤ ψnl

(r1)√r1

. (42)

The next two lemmas allow to substitute the non–computable expected

quantities, Lsr and Rnl, with their empirical counterparts, which can be com-

puted from the data.

Lemma 3.5. Let us suppose that

r ≥ Rnl(Lsr) , (43)

and let us define

Lsr =

{α` α ∈ [0, 1], ` ∈ L, Lnl

[(α`)2] ≤(

3r +

√x

2nl

)}. (44)

12

Then, ∀`rs ∈ Lsr, the following inequality holds with probability (1− e−x):

Lsr = Lsr. (45)

Proof. By exploiting Lemma Appendix A.1 and Lemma Appendix A.2 we

have that, with probability (1− e−x) and ∀`rs ∈ Lsr:

Lnl

[(`rs)

2] ≤ L[(`rs)

2]+ U2nl

(Lrs) +

√x

2nl

≤ r + 2Rnl+

√x

2nl

≤ 3r +

√x

2nl(46)

which concludes the proof.

Lemma 3.6. Let us consider two sub–root functions and their fixed points:

ψnl(r) = Rnl

(Lsr) , ψnl(r∗nl

) = r∗nl(47)

ψnl(r) = Rnl

(Lsr)

+

√2x

nl, ψnl

(r∗nl) = r∗nl

. (48)

The following inequalities hold with probability (1− 2e−x):

ψnl(r) ≤ ψnl

(r), r∗nl≤ r∗nl

. (49)

Proof. By exploiting Lemma Appendix A.2 and Lemma 3.5 we can obtain

the following inequality, which holds with probability (1− 2e−x):

ψnl(r) = Rnl

(Lsr) (50)

≤ Rnl(Lsr) +

√2x

nl

≤ Rnl

(Lsr)

+

√2x

nl= ψnl

(r).

13

Both ψnl(r) and ψnl

(r) are sub–root functions (as proved in Lemma 3.4 and

in accordance with the properties of Appendix B), with fixed points r∗nland

r∗nl, respectively. Then, since ψnl

(r) ≤ ψnl(r), r∗nl

≤ r∗nland r∗nl

≥ Rnl(Lsr),

thanks to the properties of the sub–root functions (see Appendix B), as

required by Lemma 3.5.

Finally, we derive the main result of this section, namely a fully empir-

ical Local Rademacher Complexity bound on the generalization error of a

function, chosen in the original hypothesis space H.

Theorem 3.7. Let us consider a space of function H and the fixed point r∗nl

of the following sub–root function:

ψnl(r) = Rnl

(Lsr)

+

√2x

nl, (51)

where

Lsr =

{α`α ∈ [0, 1], ` ∈ L, Lnl

(`2) ≤ 1

α2

(3r +

√x

2nl

)}. (52)

Then, ∀h ∈ H and ∀K > 1 the following inequality holds with probability

(1− 3e−x):

L(h) ≤ max

{(K

K − 1Lnl

(h)

),

(Lnl

(h) +Kr∗nl+ 2

√x

2n

)}. (53)

Proof. The theorem can be straightforwardly proved by combining the results

of Theorem 3.3 and Lemma 3.6.

4. Exploiting unlabeled samples to tighten the bound

When unlabeled samples are available, a training set consisting of both

labeled and unlabeled sample can be defined:

Dnl+nu = Dnl∪ Dnu = {(X l+u

1 , Y l+u1 ), · · · , (X l+u

nl+nu, Y l+u

nl+nu)} (54)

14

and m, such that (m − 1)nl < nu + nl ≤ mnl (supposing m > 1). In order

to prove a new fully empirical Local Rademacher Complexity bound in this

new framework, we need to recall some additional properties and definitions.

Definition 4.1. The Rademacher Complexity of H is:

Rnl(H) = Eσ sup

h∈H

1

nl

nl∑i=1

σih(X li

). (55)

Only input patterns Xi are needed to compute the Rademacher Comple-

xity of H, according to Eq. (55).

Remark 4.2. In order to compute Rnl(H), labels Y l

i , i ∈ {1, . . . , nl} are not

required.

Moreover, we can show that the Rademacher Complexity computed on

the set of functions is strictly related to the Rademacher Complexity com-

puted on the set of loss functions.

Remark 4.3. Under some mild conditions:

Rnl(L) ≤ C`Rnl

(H), (56)

where C` ∈ (0,∞) is a constant, which depends only on the loss `.

For example, in binary classification, where Y ∈ {±1}, Y ∈ {±1}, we

have that:

`(h(X), Y ) =(1− Y h(X))

2, (57)

then C`H = 1. If, instead, we have an L-Lipschitz loss function

|`L(h1(X), Y )− `L(h2(X), Y )| ≤ L |h1(X)− h2(X)| , (58)

15

by exploiting the contraction inequality (Ledoux & Talagrand, 1991; Koltchin-

skii, 2011):

Rnl(L) ≤ 2LRnl

(H), (59)

then C`L = 2L. Many well-known and commonly exploited loss functions

are L–Lipschitz. In the classification framework Y ∈ {±1}, Y ∈ [−1, 1],

examples are the ρ–margin loss (Bartlett & Mendelson, 2003)

`ρ(h(X), Y ) = min

[1,max

[0, 1− Y h(X)

ρ

]], (60)

and the soft loss (Anguita et al., 2011)

`H(h(X), Y ) ={1−min [1,max [0, Y h(X)]]}

2. (61)

In the bounded regression framework, examples are the square loss (Rosasco

et al., 2004)

`|·|2(h(X), Y ) = min[1,max

[0, (Y − h(X))2

]], (62)

and the L1 loss (Rosasco et al., 2004)

`|·|1(h(X), Y ) = min [1,max [0, |Y − h(X)|]] . (63)

The notion of Rademacher Complexity can be extended in order to con-

template unlabeled samples (Anguita et al., 2011).

Definition 4.4. The Extended Rademacher Complexity (Anguita et al., 2011)

and its expected version (the Expected Extended Rademacher Complexity) are,

respectively:

Rnu(L) =1

m

m∑j=1

Eσ sup`∈L

2

nl

nl∑i=1

σi`(h(X l+u

(j−1)∗nl+i

), Y l+u

(j−1)∗nl+i

)16

=1

m

m∑j=1

Rjnl

(L) (64)

Rnu(L) = EµRnu(L) (65)

Remark 4.5. Note that Y l+ui , with i ∈ {1, . . . , nu + nl}, are the labels of

X l+ui , with i ∈ {1, . . . , nu}, and are unknown for all the samples in Dnu. As

a consequence, Rnu(L) is not computable.

For the sake of readability, we refer the reader to Appendix A, where

some additional properties of the Extended Rademacher Complexity are sum-

marized.

The notion of Extended Rademacher Complexity can be exploited to

derive a Local Extended Rademacher Complexity bound, which takes into

account also the unlabeled samples (if available), and to obtain a bound,

analogous to the one of Theorem 3.7. The steps to obtain this result are

similar to the ones of Section 3, but we have to properly modify Lemmas 3.4

and 3.6 so to deal with the extended definition of the notion of complexity.

The results of Lemma 3.4 can be adapted to the new context by noting

that the Extended Rademacher Complexity is a sub–root function because

it is a sum of sub–root functions (see Appendix B). The modification of

Lemma 3.6, instead, requires that two results are proved. The first one

(Lemma 4.6) shows that we can relate the Expected Rademacher Complexity

to the Extended Rademacher Complexity, both computed on the set of loss

functions. Unfortunately, measures computed on the space of loss functions

still cannot be computed from data, since they require labels to be known.

Therefore, we can replace the Extended Rademacher Complexity computed

on the set of loss functions with its counterpart computed on the set of

17

functions (Lemma 4.7). The latter is a fully empirical quantity, since it does

not require the knowledge of the labels.


ψnl(r) = Rnl

(Lsr) , ψnl(r∗nl

) = r∗nl(66)

ψnu(r) = Rnu

(Lsr)

+

√2x

mnl, ψnu(r∗nu

) = r∗nu(67)

Then the following inequalities hold with probability (1− 2e−x):

ψnl(r) ≤ ψnu(r), r∗nl

≤ r∗nu. (68)

Proof. We exploit Theorem 3.5 and Lemma Appendix A.6 to obtain the

following inequality, which holds with probability (1− 2e−x):

ψnl(r) = Rnl

(Lsr) (69)

= Rnu (Lsr)

≤ Rnu (Lsr) +

√2x

mnl

≤ Rnu

(Lsr)

+

√2x

mnl= ψnu(r). (70)

Note that ψnl(r) is a sub–root function, and so is also ψnu(r), since it is sum

of sub–root functions. Their fixed points are, respectively, r∗nland r∗nu

. Thus,

since ψnl(r) ≤ ψnu(r), for the properties of the sub–root functions, we have

r∗nl≤ r∗nu

and r∗nu≥ Rnl

(Lsr), as required by Theorem 3.5.

Lemma 4.7. Let us define Hsr as

Hsr =

{αhα ∈ [0, 1], ` ∈ L, Lnl

[(α`)2] ≤ 3r +

√x

2nl

}. (71)

18

Let us consider two sub–root functions and their fixed points:

ψnu(r) = Rnu

(Lsr)

+

√2x

mnl, ψnu(r∗nu

) = r∗nu(72)

ˆψnu(r) = C`Rnu

(Hsr

)+

√2x

mnl,

ˆψnu(ˆr∗nu

) = ˆr∗nu(73)

Then, the following inequalities hold:

ψnu(r) ≤ ˆψnu(r), r∗nu

≤ ˆr∗nu. (74)

Proof. By exploiting the property of Lemma 4.3 we have that:

ψnu(r) = Rnu

(Lsr)

+

√2x

mnl

≤ C`Rnu

(Hsr

)+

√2x

mnl=

ˆψnu(r) (75)

Thanks to the properties of sub–root functions, if r∗nuand ˆr∗nu

are the fixed

points of ψnu(r) andˆψnu respectively, therefore r∗nu

≤ ˆr∗nu.

Remark 4.8. Note that Rnu(Hsr) can be computed from available data, as it

does not require the labels Y ui , i ∈ {1, . . . , nu} to be known.

We finally derive the fully empirical Local Extended Rademacher Com-

plexity bound on the generalization error of a function, chosen in the original

hypothesis spaceH. This is the counterpart of Theorem 3.7, but exploits also

unlabeled samples, when available.

Theorem 4.9. Let us consider a space of function H and the fixed point ˆr∗nu


ˆψnu(r) = C`Rnu

(Hsr

)+

√2x

mnl, (76)

19

where

Hsr =

{αhα∈[0, 1], ` ∈ L, Lnl

[(α`)2] ≤ 3r +

√x

2nl

}(77)

Then ∀h ∈ H and ∀K > 1 the following inequality holds with probability

(1− 3e−x):

L(h) ≤ max

{(K

K − 1Lnl

(h)

),

(Lnl

(h) +K ˆr∗nu+ 2

√x

2n

)}, (78)


of Theorem 3.3, Lemma 4.6 and Lemma 4.7.

The bound of Theorem 3.7, which takes into account unlabeled samples,

is always sharper than the bound of Theorem 4.9, and the two coincides when

no unlabeled samples are available (m = 1). In fact, assuming that the value

of the Rademacher Complexity does not remarkably change when we average

m different realizations (see Definition 4.4), we have thatˆψnu(r) ≤ ψnl

(r).

Therefore, for the properties of the sub–root function, ˆr∗nu≤ r∗nl

.

5. Pushing forward the state–of–the–art: further sharpening the

bound

The bounds of Theorems 3.7 and 4.9 are based on the exploitation of Mc-

Diarmid’s inequalities (McDiarmid, 1989). We exploit in this section more

refined concentration inequalities (Boucheron et al., 2000; Bousquet, 2002;

Boucheron et al., 2013), based on the milestone results of Talagrand Ledoux

& Talagrand (1991). This approach improves the technique proposed by

Bartlett et al. (2005) and obtains optimal constants for the bounds. To the

20

best knowledge of the authors, these results represent the state–of–art Lo-

cal Rademacher Complexity bounds, which are achieved by avoiding some

unnecessary upper bounds at the expenses of a slightly more complex formu-

lation with respect to the ones of Theorems 3.7 and 4.9.

As a first issues, we start by considering the case when no unlabeled sam-

ples are available (Theorem 3.7). For this purpose, we modify Theorem 3.3,

by giving up the closed form solution. Then, we exploit the more refined

concentration inequalities in Lemmas 3.5 and 3.6. By combining these dif-

ferent pieces, the desired bound, taking into account only labeled samples,

can be derived.

The first step is to obtain the counterpart of Theorem 3.3.

Theorem 5.1. Let us consider a sub–root function ψnl(r) and its fixed point

r∗nl, and suppose that ∀r > r∗nl

:

Rnl(Lsr) ≤ ψnl

(r). (79)

Let us define rU as the largest solution, with respect to r, of the following

equation:

√rr∗nl

+[2√rr∗nl

+ r]φ

x[nl

(2√rr∗nl

+ r)] =

r

K. (80)

Then ∀h ∈ H and ∀K > 1 we have that, with probability (1− e−x):

L(h) ≤ max

{(K

K − 1Lnl

(h)

),

(Lnl

(h) +rU

K

)}(81)

Proof. Analogously to what done in Theorem 3.3, we upper bound the uni-

form deviation Unl(Lr). In this case, we exploit Lemma Appendix A.1,

21

Lemma Appendix A.4 and Lemma 3.2:

Unl(Lr) ≤

√rr∗nl

+[2√rr∗nl

+ r]φ

x[nl

(2√rr∗nl

+ r)]

= A√r + C(r) ≤ r

K(82)

Let us define rU as the largest possible solution of the following equation

A√r + C(r) ≤ r

K. (83)

Since C(r) ≥ 0 is a function of r:

A√r ≤ A

√r + C(r) ≤ r

K. (84)

Then

rU ≥ A2K2 ≥ r∗nl. (85)

Combining this result with the one of Theorem 3.1 allows proving the theo-

rem.

The next two Lemmas are the counterparts of Lemmas 3.5 and 3.6.

Theorem 5.2. Let us suppose that:

r ≥ Rnl(Lsr) . (86)

Let us define Lsr as:

Lsr =

{α` α ∈ [0, 1], ` ∈ L, Lnl

[(α`)2] ≤ 3r + 5rφ

(x

nl5r

)}(87)

Then ∀`rs ∈ Lsr and with probability (1− e−x):

Lsr ⊆ Lsr. (88)

22

Proof. Let us consider Lemma Appendix A.1 and Lemma Appendix A.4.

Then, we have that, ∀`rs ∈ Lsr and with probability (1− e−x):

Lnl

[(`rs)

2] ≤ r + 2r + (4r + r) φ

[x

nl (4r + r)

]≤ 3r + 5rφ

(x

nl5r

)(89)

which allows proving the theorem.


ψnl(r) = Rnl

(Lsr) , ψnl(r∗nl

) = r∗nl(90)

ψnl(r) = Rnl

(Lsr) + rφ

(2x

nlr

), ψnl

(r∗nl) = r∗nl

. (91)

Then, the following inequalities hold with probability (1− 2e−x):

ψnl(r) ≤ ψnl

(r), r∗nl≤ r∗nl

. (92)

Proof. By exploiting Lemma Appendix A.4 and the sub–root properties (see

Appendix B), we have that:

ψnl(r) = Rnl

(Lsr) ≤ Rnl(Lsr) +Rnl

(Lsr)φ(

2x

nlRnl(Lsr)

)≤ Rnl

(Lsr) + rφ

(2x

nlr

)= ψnl

(r). (93)

Since ψnl(r) ≤ ψnl

(r), we have that r∗nl≤ r∗nl

.

Finally, we derive the tighter version of the bound, which contemplates

only labeled samples.

Theorem 5.4. Let us consider a space of function H and the fixed point r∗nl


ψnl(r) = Rnl

(Lsr)

+ rφ

(2x

nlr

)(94)

23

where

Lsr =

{α`α∈[0, 1], ` ∈ L, Lnl

[(α`)2] ≤ 3r + 5rφ

(x

nl5r

)}(95)

Let us define rU as the largest solution of the following identity:√rr∗nl

+[2√rr∗nl

+ r]φ

{x[

nl(2√rr∗nl

+ r)]} =

r

K(96)

Then, ∀h ∈ H and ∀K > 1, the following inequality holds with probability

(1− 3e−x):

L(h) ≤ max

{(K

K − 1Lnl

(h)

),

(Lnl

(h) +rU

K

)}, (97)


of Theorem 5.1 and Lemma 5.3.

Theorem 5.4 is a sharper version of the bound of Theorem 3.7. Note that

both Theorem 5.4 and Theorem 3.7 are in implicit form. However, Theorem

5.4 requires to look for a fixed point and to find the largest solution of an

equation; Theorem 3.7, instead, only requires to find a fixed point.

Analogously, it is possible to derive a tighter bound when unlabeled sam-

ples are available.

Theorem 5.5. Let us consider a space of functions H and the fixed point

ˆr∗nuof the following sub–root function:

ˆψnu(r) = C`Rnu

(Lsr)

+ rφ

(2x

mnlr

)(98)

where

Lsr =

{α`α∈[0, 1], ` ∈ L, Lnl

[(α`)2] ≤ 3r + 5rφ

(x

nl5r

)}(99)

24

Let us define rU as the largest solution of the following equation:

√rˆr∗nu

+

[2

√rˆr∗nu

+ r

]φ

x[

nl

(2√rˆr∗nu

+ r

)] =

r

K(100)

Then, ∀h ∈ H and ∀K > 1 the following inequality holds with probability

(1− 3e−x):

L(h) ≤ max

{(K

K − 1Lnl

(h)

),

(Lnl

(h) +rU

K

)}(101)

Note that the bound of Theorem 5.5 coincides with the one of Theorem

5.4 when m = 1, namely when no unlabeled samples are available. Analogous

considerations can be done for Theorem 4.9 and Theorem 3.7.

To the best of our knowledge, Theorem 5.4 is the sharpest Local Ra-

demacher Complexity bound for the usual case when only labeled sample

are available. Theorem 5.5 further improves it since it also contemplates

unlabeled samples.

6. Discussion: How tight are the new bounds?

In this section, we benchmark the bounds proposed in this paper with

respect to state-of-the-art propositions in literature, aiming to show the pos-

itive effect of exploiting eventually available unlabeled samples. In particular,

we compare the following bounds:

(I) Theorem 5.5, which coincides with the one of Theorem 5.4 when m = 1

(II) The sharpest bound based on Global Rademacher Complexity with un-

labeled data (Anguita et al., 2011)

(III) The state-of-the-art Local Rademacher Complexity bound of (Corollary

5.1 in (Bartlett et al., 2005)).

25

Figure 1: Comparison of the bound computed with no unlabeled samples proposed in this

paper against state–of–the–art Local Rademacher Complexity bound.

Figure 2: Comparison of the bounds proposed in this paper against the state–of–the–art

bound with unlabeled samples m = 1.

26





27



Figure 6: Effect of the unlabeled samples on the bound proposed in this paper m ∈

{1, 10, 100, 1000}.

28

Results are obtained with respect to the scenario where H = h∗ and

Lnl(h∗) = 0. In this case, the hypothesis space is composed by only one

model that performs no error on labeled training data: it represents the best

case scenarios for every configuration (global and local settings, both when

unlabeled samples are considered and unavailable), thus allowing to fairly

compare the different bounds.

Figure 1 proposes the comparison in the scenario when no unlabeled sam-

ples are available. In particular, we benchmark the bound proved in Theorem

5.4 (or, equivalently, Theorem 5.5 with m = 1) with the Local Rademacher

Complexity bound of (Corollary 5.1 in (Bartlett et al., 2005)), as the num-

ber of labeled samples is varied (nl): the plot shows that, independently of

the number of training samples, the proposed bound is tighter than the one

previously proposed in literature.

Figures 2, 3, 4, and 5, instead, deal with comparing the bound proposed

in Theorem 5.5 and the sharpest bound based on Global Rademacher Com-

plexity with unlabeled data (Anguita et al., 2011) as both the number of

labeled (nl) and unlabeled (m) samples are varied. Generally, the proposed

bound is tighter than the previous proposition in literature, especially as the

number of unlabeled samples increases: this effect is particularly evident for

m > 100. In fact, while the state-of-the-art Global Rademacher Complexity

bound by Anguita et al. (2011) does not remarkably improve as m increases,

Figure 6 shows that the bound of Theorem 5.5 tends to become much tighter

as the number of available unlabeled samples is higher, especially when few

labeled samples are available (nl < 500). As a matter of fact, the additional

information related to unlabeled samples helps defining more raffinate local

29

spaces, thus leading to benefits in terms of tightness.

As a final remark, it is worth noting that this work also represents a

starting point for future research. In fact, the effectiveness of the proposed

bound will have to be tested in real-world applications, e.g. as a tool for

model selection and error estimation (Bartlett et al., 2002a; Anguita et al.,

2011). As a second issue, since Local Rademacher Complexity has been

exploited for different purposes (e.g. Multiple Kernel Learning (Kloft &

Blanchard, 2011; Cortes et al., 2013)), an interesting research perspective

consists in verifying whether the Extended Local Rademacher Complexity

allows obtaining improved results in these frameworks. Finally, it is also

worth exploring, from a theoretical point of view, whether making additional

hypothesis (e.g. low noise (Srebro et al., 2010; Audibert & Tsybakov, 2007;

Steinwart & Scovel, 2007)) can even further improve the tightness of the

proposed bound.

Appendix A. Properties of the Uniform Deviation, the Radema-

cher Complexity and the Extended Rademacher Com-

plexity

The following lemma shows how to upper bound the Uniform Deviation

in terms of the Rademacher Complexity. Moreover, it allows to upper bound

the Rademacher Complexity of the square loss in terms of the Rademacher

Complexity of the loss.

Lemma Appendix A.1. For the Rademacher Complexity and the Uni-

form Deviation, the following properties hold(Bartlett & Mendelson, 2003;

30

Koltchinskii, 2011):

Unl(L) ≤ Rnl

(L), U2nl

(L) ≤ R2nl

(L) ≤ 2Rnl(L). (A.1)

The proof is mainly based on the symmetrization lemma (Bartlett &

Mendelson, 2003) and the contraction inequality (Ledoux & Talagrand, 1991;

Bartlett et al., 2005; Koltchinskii, 2011).

The next two lemmas, instead, allow to show that the Expected Uniform

Deviation and the Expected Rademacher Complexity are tight concentrated

around their mean.

Lemma Appendix A.2. The functions Rnl(L), Unl

(L) and U2nl

(L) are

bounded difference functions (Bartlett & Mendelson, 2003). Thus, according

to (McDiarmid, 1989), we can state that, with probability (1− e−x):

Rnl(L) ≤ Rnl

(L) +

√2x

nl, (A.2)

Unl(L) ≤ Unl

(L) +

√x

2nl, (A.3)

U2nl

(L) ≤ U2nl

(L) +

√x

2nl. (A.4)

Definition Appendix A.3. According with (Boucheron et al., 2000; Bous-

quet, 2002; Oneto et al., 2013) we can define the following functions:

φ(a) = (1 + a) log(1 + a)− a, a > −1, (A.5)

φ(a) = 1− exp

[1 +W−1

(a− 1

exp(1)

)], φ

[−φ(a)

]= a, a ∈ [0, 1], (A.6)

φ(a) = exp

[1 +W0

(a− 1

exp(1)

)]− 1, φ

[φ(a)

]= a, a ∈ [0,+∞), (A.7)

where W−1 and W0 are, respectively, two solutions of the Lambert W function

(Weisstein, 2013).

31

Lemma Appendix A.4. The function Rnl(L) is a self bounding func-

tion (Bartlett & Mendelson, 2003; Boucheron et al., 2000), while Unl(L) and

U2nl

(L) satisfy the hypothesis of (Bousquet, 2002; Bartlett et al., 2005). Con-

sequently we can also state that, with probability (1− e−x):

Rnl(L) ≤ Rnl

(L) +Rnl(L)φ

(2x

nlRnl(L)

), (A.8)

Unl(L) ≤ Unl

(L) +[2Unl

(L) + L(`2)]φ

{x

nl [2Unl(L) + L(`2)]

}, (A.9)

U2nl

(L) ≤ U2nl

(L) +[2U2

nl(L) + L(`2)

]φ

{x

nl[2U2

nl(L) + L(`2)

]} . (A.10)

The following lemma links the Expected Extended Rademacher Comple-

xity and the Expected Rademacher Complexity.

Lemma Appendix A.5. The Expected Extended Rademacher Complexity

and the Expected Rademacher Complexity coincide.

Proof. The proof is trivial by noting that:

Rnu(L) = Eµ1

m

m∑j=1

Rjnl

(L) =1

m

m∑j=1

EµRjnl

(L)

=1

m

m∑j=1

Rnl(L) = Rnl

(L). (A.11)

The next two lemmas, instead, allow to show that the Expected Extended

Rademacher Complexity is tight concentrated around its mean.

Lemma Appendix A.6. The Extended Rademacher Complexity Rnu(L) is

a bounded difference function (since it is a sum of bounded difference func-

tions (Anguita et al., 2011; McDiarmid, 1989)). Thus, we have that, with

32

probability (1− e−x):

Rnl(L) = Rnu(L) ≤ Rnu(L) +

√2x

mnl. (A.12)

Lemma Appendix A.7. Since Rnu(L) is a self bounding function (as it

is a sum of self bounding functions (Anguita et al., 2011; Boucheron et al.,

2000)), we have that, with probability (1− e−x):

Rnl(L) = Rnu(L) ≤ Rnu(L) +Rnu(L)φ

[2x

mnlRnu (L)

]= Rnu(L) +Rnl

(L)φ[

2xmnlRnl

(L)

]. (A.13)

Appendix B. Sub–Root Functions Properties

The notion of sub–root function have emerged as a key concept in Learn-

ing Theory in the last years (Bartlett et al., 2005; Koltchinskii, 2006; Bartlett

et al., 2002b). Its properties are useful to prove the main results of this paper,

and, for this reason, we survey them in the following.

Lemma Appendix B.1. A non–trivial sub–root function satisfies the fol-

lowing properties (Bousquet et al., 2004; Bartlett et al., 2005):

1. ψ(r) : [0,+∞)→ [0,+∞)

2. 0 < β ≤ 1→ ψ(βr) ≥√βψ(r)

3. β ≥ 1→ ψ(βr) ≤√βψ(r)

4. ψ(r): is continuous in [0,+∞)

5. ψ(r) = r: has a unique positive solution, r∗, that is defined as fixed–

point

6. ∀r > 0, r ≥ ψ(r) if and only if r ≥ r∗

7. let ψ1, ψ2, with fixed points r∗1, r∗2, and α ∈ [0, 1], with αψ2(r∗1) ≤

ψ1(r∗1) ≤ ψ2(r

∗1), then: α2r∗2 ≤ r∗1 ≤ r∗2

33

8. let ψ have fixed point r∗ and g : R+ → R+ where r∗g is the smallest fixed

point (if existent) and ψ1(r) ≤ g(r), ∀r ∈ [0,∞) then: r∗ ≤ r∗g

9. ∀r ≥ r∗ → ψnl(r) ≤

√r/r∗ψn(r∗) =

√rr∗

10. let ψ1 and ψ2 be two sub–root functions. Then ψ(r) = ψ1(r) + ψ2(r) is

also a sub–root function

11. let ψ be a sub–root function and c ∈ [0,+∞] a constant. Then ψ′(r) =

ψ(r) + c is also a sub–root function.

References

Anguita, D., Ghio, A., Oneto, L., & Ridella, S. (2011). The impact of

unlabeled patterns in rademacher complexity theory for kernel classifiers.

Neural Information Processing Systems , (pp. 585–593).

Audibert, J. Y., & Tsybakov, A. B. (2007). Fast learning rates for plug-in

classifiers. The Annals of Statistics , 35 , 608–633.

Bartlett, P. L., Boucheron, S., & Lugosi, G. (2002a). Model selection and

error estimation. Machine Learning , 48 , 85–113.

Bartlett, P. L., Bousquet, O., & Mendelson, S. (2002b). Localized rademacher

complexities. In Computational Learning Theory .

Bartlett, P. L., Bousquet, O., & Mendelson, S. (2005). Local rademacher

complexities. The Annals of Statistics , 33 , 1497–1537.

Bartlett, P. L., & Mendelson, S. (2003). Rademacher and gaussian complexi-

ties: Risk bounds and structural results. The Journal of Machine Learning

Research, 3 , 463–482.

34

Blanchard, G., & Massart, P. (2006). Discussion: Local rademacher complex-

ities and oracle inequalities in risk minimization. The Annals of Statistics ,

34 , 2664–2671.

Boucheron, S., Lugosi, G., & Massart, P. (2000). A sharp concentration

inequality with applications. Random Structures & Algorithms , 16 , 277–

292.

Boucheron, S., Lugosi, G., & Massart, P. (2013). Concentration inequalities:

A nonasymptotic theory of independence. Oxford University Press.

Bousquet, O. (2002). A bennett concentration inequality and its application

to suprema of empirical processes. Comptes Rendus Mathematique, 334 ,

495–500.

Bousquet, O., Boucheron, S., & Lugosi, G. (2004). Introduction to statistical

learning theory. In Advanced Lectures on Machine Learning .

Bousquet, O., & Elisseeff, A. (2002). Stability and generalization. The Jour-

nal of Machine Learning Research, 2 , 499–526.

Chapelle, O., Scholkopf, B., & Zien, A. (2006). Semi-supervised learning .

MIT press Cambridge.

Cortes, C., Kloft, M., & Mohri, M. (2013). Learning kernels using local

rademacher complexity. In Neural Information Processing Systems .

Kloft, M., & Blanchard, G. (2011). The local rademacher complexity of lp-

norm multiple kernel learning. In Neural Information Processing Systems .

35

Koltchinskii, V. (2006). Local rademacher complexities and oracle inequali-

ties in risk minimization. The Annals of Statistics , 34 , 2593–2656.

Koltchinskii, V. (2011). Oracle Inequalities in Empirical Risk Minimization

and Sparse Recovery Problems . Springer.

Ledoux, M., & Talagrand, M. (1991). Probability in Banach Spaces:

isoperimetry and processes volume 23. Springer.

Lever, G., Laviolette, F., & Shawe-Taylor, J. (2013). Tighter pac-bayes

bounds through distribution-dependent priors. Theoretical Computer Sci-

ence, 473 , 4–28.

McAllester, D., & Akinbiyi, T. (2013). Pac-bayesian theory. In Empirical

Inference (pp. 95–103).

McDiarmid, C. (1989). On the method of bounded differences. Surveys in

combinatorics , 141 , 148–188.

Oneto, L., Ghio, A., Anguita, S., & Ridella, S. (2013). An improved analysis

of the rademacher data-dependent bound using its self bounding property.

Neural Networks , 44 , 107–111.

Parrado-Hernandez, E., Ambroladze, A., Shawe-Taylor, J., & Sun, S. (2012).

Pac-bayes bounds with data dependent priors. The Journal of Machine

Learning Research, 13 , 3507–3531.

Rosasco, L., De Vito, E., Caponnetto, A., Piana, M., & Verri, A. (2004). Are

loss functions all the same? Neural Computation, 16 , 1063–1076.

36

Srebro, N., Sridharan, K., & Tewari, A. (2010). Smoothness, low noise and

fast rates. In Neural Information Processing Systems .

Steinwart, I., & Scovel, C. (2007). Fast rates for support vector machines

using gaussian kernels. The Annals of Statistics , 35 , 575–607.

Talagrand, M. (1987). The glivenko-cantelli problem. The Annals of Proba-

bility , 15 , 837–870.

Valiant, L. (2013). Probably Approximately Correct: Nature’s Algorithms for

Learning and Prospering in a Complex World . Basic Books.

Vapnik, V. N. (1998). Statistical learning theory . Wiley–Interscience.

Weisstein, E. W. (2013). Lambert W-Function. From MathWorld–

A Wolfram Web Resource, http://mathworld.wolfram.com/LambertW-

Function.html.

37

Learning Resource-Aware Classifiers for Mobile Devices: from Regularization to Energy Efficiency

Documents

Learning Resource-Aware Classifiers for Mobile Devices: from Regularization to Energy Efficiency