Local Rademacher Complexity: Sharper Risk Bounds
With and Without Unlabeled Samples
Luca Oneto, Alessandro Ghio, Sandro Ridella
DITEN - University of GenovaVia Opera Pia 11A, I-16145 Genova, Italy
{Luca.Oneto, Alessandro.Ghio, Sandro.Ridella}@unige.it
Davide Anguita
DIBRIS - University of GenovaVia Opera Pia 13, I-16145 Genova, Italy
Abstract
We derive in this paper a new Local Rademacher Complexity risk bound
on the generalization ability of a model, which is able to take advantage of
the availability of unlabeled samples. Moreover, this new bound improves
state–of–the–art results even when no unlabeled samples are available.
Keywords: Statistical Learning Theory, Performance Estimation, Local
Rademacher Complexity, Unlabeled Samples
1. Introduction
A learning process can be described as the selection of an hypothesis
in a fixed set, based on empirical observations (Vapnik, 1998). Its asymp-
totic analysis, through a bound on the generalization error, has been thor-
oughly investigated in the past (Vapnik, 1998; Talagrand, 1987). However,
as the number of samples is limited in practice, finite sample analysis with
Preprint submitted to Neural Networks January 11, 2015
global measures of the complexity of the hypothesis set was proposed, and
represented a fundamental advance in the field (Vapnik, 1998; Bartlett &
Mendelson, 2003; Koltchinskii, 2006; Bousquet & Elisseeff, 2002; Valiant,
2013; McAllester & Akinbiyi, 2013). A further refinement has consisted in
exploiting local measures of complexity, which take in account only those
models that well approximate the available data (Bartlett et al., 2002b, 2005;
Koltchinskii, 2006; Blanchard & Massart, 2006; Lever et al., 2013). Recently,
some attempts to further improve these results have been made (Audibert
& Tsybakov, 2007; Srebro et al., 2010; Steinwart & Scovel, 2007): unfor-
tunately, these approaches require additional assumptions that, in general,
are not satisfied or cannot be justified by inferring them from the data. Al-
ternative paths have been explored like, for example, exploiting additional
a–priori information (Parrado-Hernandez et al., 2012). Recently, the use of
unlabeled samples has been proposed for improving the tightness of Global
Rademacher Complexity based bounds (Anguita et al., 2011). Such results
are appealing since unlabeled samples are commonly available in many real
world applications, as also confirmed by the success of learning procedures
able to exploit them (Chapelle et al., 2006).
In this paper, we extend the recent results on the use of unlabeled samples
in global complexity measures to the case of local ones and derive sharper
Local Rademacher Complexity risk bounds on the generalization ability of
a model. For this purpose, two steps are completed. First, we propose a
proof for the Local Rademacher Complexity bound, simplified with respect
to the milestone result of (Bartlett et al., 2005) through the exploitation
of the well–known bounded difference inequality (McDiarmid, 1989). Such
2
simplification enables us to apply results on concentration inequalities of self–
bounding functions (Boucheron et al., 2013), and to obtain a sharper Local
Rademacher Complexity risk bound. The latter improves the state-of-the-
art results both when unlabeled samples are used and the dataset is entirely
composed of labeled samples.
2. The learning framework
We consider the conventional learning problem (Vapnik, 1998): based on
a random observation of X ∈ X , one has to estimate Y ∈ Y by choosing a
suitable hypothesis h : X → Y , where h ∈ H. A learning algorithm selects h
by exploiting a set of labeled samples Dnl:{(X l
1, Yl1
), · · · ,
(X lnl, Y l
nl
)}and,
eventually, a set of unlabeled ones Dnu :{
(Xu1 ) , · · · ,
(Xunu
)}. Dnl
and Dnu
consist of a sequence of independent samples distributed according to µ over
X × Y . The generalization error
L(h) = Eµ`(h(X), Y ), (1)
associated to a hypothesis H, is defined through a loss function `(h(X), Y ) :
Y × Y → [0, 1]. As µ is unknown, L(h) cannot be explicitly computed, thus
we have to resort to its empirical estimator, namely the empirical error
Lnl(h) =
1
nl
nl∑i=1
`(h(X li
), Y l
i
). (2)
Note that Lnl(h) is a biased estimator, since the data used for selecting the
model and for computing the empirical error coincide. We estimate this
bias by studying the discrepancy between the generalization error and the
empirical error. For this purpose we exploit powerful statistical tools like
concentration inequalities and the Local Rademacher Complexity.
3
2.1. Definitions
In the seminal work of Bartlett et al. (2005), a bound, defined over the
space of functions, is provided. In this work, we generalize this result to a
more general supervised learning framework. For this purpose, we switch
from the space of functions H to the space of loss functions.
Definition 2.1. Given a space of functions H with its associated loss func-
tion `(h(X), Y ), the space of loss functions L is defined as:
L ={`(h(X), Y )
h ∈ H} . (3)
Let us also consider the corresponding star–shaped space of function.
Definition 2.2. Given the space of loss functions L, its star–shaped version
is:
Ls ={α` α ∈ [0, 1], ` ∈ L
}. (4)
Then, the generalization error and the empirical error can be rewritten
in terms of the space of loss functions:
L(h) ≡ L(`) = Eµ`(h(X), Y ), (5)
Lnl(h) ≡ Lnl
(`) =1
nl
nl∑i=1
`(h(X li
), Y l
i
). (6)
Moreover we can define, respectively, the expected square error and the em-
pirical square error:
L(`2) = Eµ [`(h(X), Y )]2 , (7)
Lnl(`2) =
1
nl
nl∑i=1
[`(h(X li
), Y l
i
)]2. (8)
4
Consequently, the variance of ` ∈ L can be defined as:
V 2(`) = Eµ [`(h(X), Y )− L(`)]2 = L(`2)− [L(`)]2. (9)
Note that the following relations hold:
V 2(`) ≤ L(`2) ≤ L(`), L[(α`)2] = α2L(`2). (10)
Since we do not know in advance which h ∈ H will be chosen during the
learning phase, in order to estimate L(`) we have to study the behavior of
the difference between the generalization error and the empirical error.
Definition 2.3. Given L, the Uniform Deviation of the loss Unl(L) and
square loss U2nl
(L) are:
Unl(L) = sup
`∈L
[L(`)− Lnl
(`)], U2
nl(L) = sup
`∈L
[Lnl
(`2)− L(`2)], (11)
while their deterministic counterparts are:
Unl(L) = EµUnl
(L), U2nl
(L) = EµU2nl
(L). (12)
The Uniform Deviation is not computable, but we can upper bound its
value through some computable quantity. One possibility is to use the Ra-
demacher Complexity.
Definition 2.4. The Rademacher Complexity of the loss and of the square
loss are:
Rnl(L) = Eσ sup
`∈L
2
nl
nl∑i=1
σi`(h(X li
), Y l
i
), (13)
R2nl
(L) = Eσ sup`∈L
2
nl
nl∑i=1
σi[`(h(X li
), Y l
i
)]2, (14)
5
where σ1, . . . , σnlare nl {±1}–valued independent Rademacher random vari-
ables for which P(σi = +1) = P(σi = −1) = 1/2. Their deterministic coun-
terparts are:
Rnl(L) = EµRnl
(L), R2nl
(L) = EµR2nl
(L). (15)
In Appendix A, some propaedeutic properties of the Uniform Deviation
and Rademacher Complexity are recalled, which will be useful for deriving
the main results of this work.
Finally, we will also make use of the the notion of sub–root function
(Bartlett et al., 2005).
Definition 2.5. A function is a sub–root function if and only if:
(I) ψ(r) is positive,
(II) ψ(r) is non–decreasing,
(III) ψ(r)/√r is non–increasing,
with r > 0.
Its properties are reported in Appendix B.
3. Local Rademacher Complexity error bound
In this section, we propose a proof of the Local Rademacher Complexity
bound on the generalization error of a model (Bartlett et al., 2005; Koltchin-
skii, 2006), which is simplified with respect to the original proof in literature
and allows us also to obtain optimal constants.
In order to improve the readability of the paper, an outline of the main
steps of the proof is presented. As a first step, Theorem 3.1 shows that it is
6
possible to bound the generalization error of a function chosen in H, through
an assumption over the Expected Uniform Deviation of a normalized and
slightly enlarged version (see Lemma 3.2) of H. As a second step, Theorem
3.3 shows how to relate the Expected Uniform Deviation and the Expected
Rademacher Complexity through the use of a sub–root function. The fixed
point of this sub–root function is used to bound the generalization error of
a function chosen in H. As a third step, Lemma 3.4 shows that, instead of
using any sub–root function, we can directly use the Expected Rademacher
Complexity of a local space of functions, where functions therein are the ones
with low expected square error. As a fourth step, Lemma 3.5 and Lemma
3.6 allow to substitute the non–computable expected quantities mentioned
above with their empirical counterpart, which can be computed from the
data. Then, we finally derive the main result of this section in Theorem
3.7, which is a fully empirical Local Rademacher Complexity bound on the
generalization error of a function chosen in the original hypothesis space H.
The following theorem is needed for normalizing the original hypothesis
space: this allows to bound the generalization error of a function chosen in
H.
Theorem 3.1. Let us consider the normalized loss space Lr:
Lr =
{r
L(`2) ∨ r` ` ∈ L
}, (16)
and let us suppose that, ∀K > 1:
Unl(Lr) ≤
r
K. (17)
Then, ∀h ∈ H, the following inequality holds:
L(h)≤max
{(K
K − 1Lnl
(h)
),(Lnl
(h) +r
K
)}≤ K
K − 1Lnl
(h) +r
K. (18)
7
Proof. We note that, ∀`r ∈ Lr:
L(`r) ≤ Lnl(`r) + Un (Lr) ≤ Lnl
(`r) +r
K. (19)
Let us consider the two cases:
1. L(`2) ≤ r
2. L(`2) > r
In the first case `r = `, so the following inequality holds:
L(`) = L(`r) ≤ Lnl(`r) +
r
K= Lnl
(`) +r
K. (20)
In the second case, `r = rL(`2)
` and rL(`2)
∈ [0, 1]. Then:
L(`)− Lnl(`) ≤ Un (L) ≤ L(`2)
rUn (Lr) ≤
L(`)
r
r
K=L(`)
K. (21)
By solving the last inequality with respect to L(`), we obtain the bound over
L(`):
L(`) ≤ K
K − 1Lnl
(`), L(`2) > r, ` ∈ L. (22)
Note that L(`) ≡ L(h) and Lnl(`) ≡ Lnl
(h), with h ∈ H, if ` ∈ L. By
combining the results of Eqns. (20) and (22), the theorem is proved.
The next step shows that the normalized hypothesis space defined in
Theorem 3.1 is a subset of a new star–shaped space.
Lemma 3.2. Let
Lsr ={α` α ∈ [0, 1], ` ∈ L, L[(α`)2] ≤ r
}, (23)
then
Lr ⊆ Lsr. (24)
8
Proof. Let us consider Lr in the two cases introduced above:
1. L(`2) ≤ r
2. L(`2) > r
In the first case, `r = ` and then:
L(`2) = L(`2r) ≤ r. (25)
In the second case, L(`2) > r, so we have that
`r =
[r
L(`2)
]`,
r
L(`2)≤ 1, (26)
and the following bound holds:
L(`2r) ≤[
r
L(`2)
]L(`2) ≤ r. (27)
Thus, the property of Eq. (24) is proved.
If we consider a sub–root function that upper-bounds the Expected Ra-
demacher Complexity of the hypothesis space defined in Lemma 3.2, we can
exploit Theorem 3.1 for bounding the generalization error of a function cho-
sen in the original hypothesis space H.
Theorem 3.3. Let us consider a sub–root function ψnl(r), with fixed point
r∗nl, and suppose that, ∀r > r∗nl
Rnl(Lsr) ≤ ψnl
(r). (28)
Then, ∀h ∈ H and ∀K > 1 we have that, with probability (1− e−x):
L(h) ≤max
{(K
K − 1Lnl
(h)
),
(Lnl
(h) +Kr∗nl+ 2
√x
2n
)}(29)
9
Proof. By exploiting the properties of the Uniform Deviation and the Rade-
macher Complexity (see Lemma Appendix A.1 and Lemma Appendix A.2)
and the properties of the sub–root functions, listed in Appendix B, we have
that, with probability (1− e−x):
Unl(Lr) ≤ Unl
(Lr) +
√x
2n≤ Rnl
(Lr) +
√x
2n≤ Rnl
(Lsr) +
√x
2n
≤ ψnl(r) +
√x
2n≤√rr∗nl
+
√x
2n. (30)
The last step of the proof consists in showing that r can be chosen, such that
Unl(Lr) ≤ r
Kwith K > 1 and r ≥ r∗nl
, so that we can exploit Theorem 3.1
and conclude the proof. For this purpose, we set A =√r∗nl
and C =√
x2n
.
Thus, we have to find the solution of:
A√r + C =
r
K, (31)
which is
r =
[(2CK
+ A2)
+√(
2CK
+ A2)2 − 4C2
K2
]2K2
. (32)
It is straightforward proving that:
r ≥ A2K2 ≥ A2 = r∗nl, (33)
r ≤ A2K2 + 2CK. (34)
By substituting Eq. (34) into Theorem 3.1, the proof is over.
The previous theorem holds for any sub–root function which satisfies Eq.
(28). The next lemma shows that the Rademacher Complexity, defined in the
last theorem, is indeed a sub–root function, which means that the inequality
of Eq. (28) is indeed an equality.
10
Lemma 3.4. Let us consider Rnl(Lsr), namely the Expected Rademacher
Complexity computed on Lsr. Then:
ψnl(r) = Rnl
(Lsr) (35)
is a sub–root function.
Proof. In order to prove the lemma, the following properties must apply (see
Definition 2.5):
1. ψnl(r) is positive
2. ψnl(r) is non–decreasing
3. ψnl (r)/√r is non–increasing
Concerning the first property, we can note that:
ψnl(r) = Eσ sup
`∈Lsr
2
nl
nl∑i=1
σi`(h(X li
), Y l
i
)≥ sup
`∈LsrEσ
2
nl
nl∑i=1
σi`(h(X li
), Y l
i
)≥ 0. (36)
Concerning the second property, we have that, for 0 ≤ r1 ≤ r2:
Lsr1 ⊆ Lsr2, (37)
therefore
ψnl(r1) = Eσ sup
`∈Lsr1
2
nl
nl∑i=1
σi`(h(X li
), Y l
i
)≤ Eσ sup
`∈Lsr2
2
nl
nl∑i=1
σi`(h(X li
), Y l
i
)= ψnl
(r2). (38)
Finally, concerning the third property, for 0 ≤ r1 ≤ r2, we define the following
quantity:
`σr2 = arg sup`∈Lsr2
2
nl
nl∑i=1
σi`(h(X li
), Y l
i
), L
[(`σr2)2] ≤ r2. (39)
11
Note that, since r1r2≤ 1, we have that
√r1r2`σr2 ∈ L
sr2
. Consequently:
L
[(√r1r2`σr2
)2]
=r1r2L[(`σr2)
2] ≤ r1. (40)
Thus, we have that:
ψnl(r1) = Eσ sup
`∈Lsr1
2
nl
nl∑i=1
σi`(h(X li
), Y l
i
)≥ Eσ
2
nl
nl∑i=1
σi
√r1r2`σr2(h(X li
), Y l
i
)=
√r1r2Eσ sup
`∈Lsr2
2
nl
nl∑i=1
σi`(h(X li
), Y l
i
)=
√r1r2ψnl
(r2), (41)
which allows proving the claim since
ψnl(r2)√r2≤ ψnl
(r1)√r1
. (42)
The next two lemmas allow to substitute the non–computable expected
quantities, Lsr and Rnl, with their empirical counterparts, which can be com-
puted from the data.
Lemma 3.5. Let us suppose that
r ≥ Rnl(Lsr) , (43)
and let us define
Lsr =
{α` α ∈ [0, 1], ` ∈ L, Lnl
[(α`)2] ≤(
3r +
√x
2nl
)}. (44)
12
Then, ∀`rs ∈ Lsr, the following inequality holds with probability (1− e−x):
Lsr = Lsr. (45)
Proof. By exploiting Lemma Appendix A.1 and Lemma Appendix A.2 we
have that, with probability (1− e−x) and ∀`rs ∈ Lsr:
Lnl
[(`rs)
2] ≤ L[(`rs)
2]+ U2nl
(Lrs) +
√x
2nl
≤ r + 2Rnl+
√x
2nl
≤ 3r +
√x
2nl(46)
which concludes the proof.
Lemma 3.6. Let us consider two sub–root functions and their fixed points:
ψnl(r) = Rnl
(Lsr) , ψnl(r∗nl
) = r∗nl(47)
ψnl(r) = Rnl
(Lsr)
+
√2x
nl, ψnl
(r∗nl) = r∗nl
. (48)
The following inequalities hold with probability (1− 2e−x):
ψnl(r) ≤ ψnl
(r), r∗nl≤ r∗nl
. (49)
Proof. By exploiting Lemma Appendix A.2 and Lemma 3.5 we can obtain
the following inequality, which holds with probability (1− 2e−x):
ψnl(r) = Rnl
(Lsr) (50)
≤ Rnl(Lsr) +
√2x
nl
≤ Rnl
(Lsr)
+
√2x
nl= ψnl
(r).
13
Both ψnl(r) and ψnl
(r) are sub–root functions (as proved in Lemma 3.4 and
in accordance with the properties of Appendix B), with fixed points r∗nland
r∗nl, respectively. Then, since ψnl
(r) ≤ ψnl(r), r∗nl
≤ r∗nland r∗nl
≥ Rnl(Lsr),
thanks to the properties of the sub–root functions (see Appendix B), as
required by Lemma 3.5.
Finally, we derive the main result of this section, namely a fully empir-
ical Local Rademacher Complexity bound on the generalization error of a
function, chosen in the original hypothesis space H.
Theorem 3.7. Let us consider a space of function H and the fixed point r∗nl
of the following sub–root function:
ψnl(r) = Rnl
(Lsr)
+
√2x
nl, (51)
where
Lsr =
{α`α ∈ [0, 1], ` ∈ L, Lnl
(`2) ≤ 1
α2
(3r +
√x
2nl
)}. (52)
Then, ∀h ∈ H and ∀K > 1 the following inequality holds with probability
(1− 3e−x):
L(h) ≤ max
{(K
K − 1Lnl
(h)
),
(Lnl
(h) +Kr∗nl+ 2
√x
2n
)}. (53)
Proof. The theorem can be straightforwardly proved by combining the results
of Theorem 3.3 and Lemma 3.6.
4. Exploiting unlabeled samples to tighten the bound
When unlabeled samples are available, a training set consisting of both
labeled and unlabeled sample can be defined:
Dnl+nu = Dnl∪ Dnu = {(X l+u
1 , Y l+u1 ), · · · , (X l+u
nl+nu, Y l+u
nl+nu)} (54)
14
and m, such that (m − 1)nl < nu + nl ≤ mnl (supposing m > 1). In order
to prove a new fully empirical Local Rademacher Complexity bound in this
new framework, we need to recall some additional properties and definitions.
Definition 4.1. The Rademacher Complexity of H is:
Rnl(H) = Eσ sup
h∈H
1
nl
nl∑i=1
σih(X li
). (55)
Only input patterns Xi are needed to compute the Rademacher Comple-
xity of H, according to Eq. (55).
Remark 4.2. In order to compute Rnl(H), labels Y l
i , i ∈ {1, . . . , nl} are not
required.
Moreover, we can show that the Rademacher Complexity computed on
the set of functions is strictly related to the Rademacher Complexity com-
puted on the set of loss functions.
Remark 4.3. Under some mild conditions:
Rnl(L) ≤ C`Rnl
(H), (56)
where C` ∈ (0,∞) is a constant, which depends only on the loss `.
For example, in binary classification, where Y ∈ {±1}, Y ∈ {±1}, we
have that:
`(h(X), Y ) =(1− Y h(X))
2, (57)
then C`H = 1. If, instead, we have an L-Lipschitz loss function
|`L(h1(X), Y )− `L(h2(X), Y )| ≤ L |h1(X)− h2(X)| , (58)
15
by exploiting the contraction inequality (Ledoux & Talagrand, 1991; Koltchin-
skii, 2011):
Rnl(L) ≤ 2LRnl
(H), (59)
then C`L = 2L. Many well-known and commonly exploited loss functions
are L–Lipschitz. In the classification framework Y ∈ {±1}, Y ∈ [−1, 1],
examples are the ρ–margin loss (Bartlett & Mendelson, 2003)
`ρ(h(X), Y ) = min
[1,max
[0, 1− Y h(X)
ρ
]], (60)
and the soft loss (Anguita et al., 2011)
`H(h(X), Y ) ={1−min [1,max [0, Y h(X)]]}
2. (61)
In the bounded regression framework, examples are the square loss (Rosasco
et al., 2004)
`|·|2(h(X), Y ) = min[1,max
[0, (Y − h(X))2
]], (62)
and the L1 loss (Rosasco et al., 2004)
`|·|1(h(X), Y ) = min [1,max [0, |Y − h(X)|]] . (63)
The notion of Rademacher Complexity can be extended in order to con-
template unlabeled samples (Anguita et al., 2011).
Definition 4.4. The Extended Rademacher Complexity (Anguita et al., 2011)
and its expected version (the Expected Extended Rademacher Complexity) are,
respectively:
Rnu(L) =1
m
m∑j=1
Eσ sup`∈L
2
nl
nl∑i=1
σi`(h(X l+u
(j−1)∗nl+i
), Y l+u
(j−1)∗nl+i
)16
=1
m
m∑j=1
Rjnl
(L) (64)
Rnu(L) = EµRnu(L) (65)
Remark 4.5. Note that Y l+ui , with i ∈ {1, . . . , nu + nl}, are the labels of
X l+ui , with i ∈ {1, . . . , nu}, and are unknown for all the samples in Dnu. As
a consequence, Rnu(L) is not computable.
For the sake of readability, we refer the reader to Appendix A, where
some additional properties of the Extended Rademacher Complexity are sum-
marized.
The notion of Extended Rademacher Complexity can be exploited to
derive a Local Extended Rademacher Complexity bound, which takes into
account also the unlabeled samples (if available), and to obtain a bound,
analogous to the one of Theorem 3.7. The steps to obtain this result are
similar to the ones of Section 3, but we have to properly modify Lemmas 3.4
and 3.6 so to deal with the extended definition of the notion of complexity.
The results of Lemma 3.4 can be adapted to the new context by noting
that the Extended Rademacher Complexity is a sub–root function because
it is a sum of sub–root functions (see Appendix B). The modification of
Lemma 3.6, instead, requires that two results are proved. The first one
(Lemma 4.6) shows that we can relate the Expected Rademacher Complexity
to the Extended Rademacher Complexity, both computed on the set of loss
functions. Unfortunately, measures computed on the space of loss functions
still cannot be computed from data, since they require labels to be known.
Therefore, we can replace the Extended Rademacher Complexity computed
on the set of loss functions with its counterpart computed on the set of
17
functions (Lemma 4.7). The latter is a fully empirical quantity, since it does
not require the knowledge of the labels.
Lemma 4.6. Let us consider two sub–root functions and their fixed points:
ψnl(r) = Rnl
(Lsr) , ψnl(r∗nl
) = r∗nl(66)
ψnu(r) = Rnu
(Lsr)
+
√2x
mnl, ψnu(r∗nu
) = r∗nu(67)
Then the following inequalities hold with probability (1− 2e−x):
ψnl(r) ≤ ψnu(r), r∗nl
≤ r∗nu. (68)
Proof. We exploit Theorem 3.5 and Lemma Appendix A.6 to obtain the
following inequality, which holds with probability (1− 2e−x):
ψnl(r) = Rnl
(Lsr) (69)
= Rnu (Lsr)
≤ Rnu (Lsr) +
√2x
mnl
≤ Rnu
(Lsr)
+
√2x
mnl= ψnu(r). (70)
Note that ψnl(r) is a sub–root function, and so is also ψnu(r), since it is sum
of sub–root functions. Their fixed points are, respectively, r∗nland r∗nu
. Thus,
since ψnl(r) ≤ ψnu(r), for the properties of the sub–root functions, we have
r∗nl≤ r∗nu
and r∗nu≥ Rnl
(Lsr), as required by Theorem 3.5.
Lemma 4.7. Let us define Hsr as
Hsr =
{αhα ∈ [0, 1], ` ∈ L, Lnl
[(α`)2] ≤ 3r +
√x
2nl
}. (71)
18
Let us consider two sub–root functions and their fixed points:
ψnu(r) = Rnu
(Lsr)
+
√2x
mnl, ψnu(r∗nu
) = r∗nu(72)
ˆψnu(r) = C`Rnu
(Hsr
)+
√2x
mnl,
ˆψnu(ˆr∗nu
) = ˆr∗nu(73)
Then, the following inequalities hold:
ψnu(r) ≤ ˆψnu(r), r∗nu
≤ ˆr∗nu. (74)
Proof. By exploiting the property of Lemma 4.3 we have that:
ψnu(r) = Rnu
(Lsr)
+
√2x
mnl
≤ C`Rnu
(Hsr
)+
√2x
mnl=
ˆψnu(r) (75)
Thanks to the properties of sub–root functions, if r∗nuand ˆr∗nu
are the fixed
points of ψnu(r) andˆψnu respectively, therefore r∗nu
≤ ˆr∗nu.
Remark 4.8. Note that Rnu(Hsr) can be computed from available data, as it
does not require the labels Y ui , i ∈ {1, . . . , nu} to be known.
We finally derive the fully empirical Local Extended Rademacher Com-
plexity bound on the generalization error of a function, chosen in the original
hypothesis spaceH. This is the counterpart of Theorem 3.7, but exploits also
unlabeled samples, when available.
Theorem 4.9. Let us consider a space of function H and the fixed point ˆr∗nu
of the following sub–root function:
ˆψnu(r) = C`Rnu
(Hsr
)+
√2x
mnl, (76)
19
where
Hsr =
{αhα∈[0, 1], ` ∈ L, Lnl
[(α`)2] ≤ 3r +
√x
2nl
}(77)
Then ∀h ∈ H and ∀K > 1 the following inequality holds with probability
(1− 3e−x):
L(h) ≤ max
{(K
K − 1Lnl
(h)
),
(Lnl
(h) +K ˆr∗nu+ 2
√x
2n
)}, (78)
Proof. The theorem can be straightforwardly proved by combining the results
of Theorem 3.3, Lemma 4.6 and Lemma 4.7.
The bound of Theorem 3.7, which takes into account unlabeled samples,
is always sharper than the bound of Theorem 4.9, and the two coincides when
no unlabeled samples are available (m = 1). In fact, assuming that the value
of the Rademacher Complexity does not remarkably change when we average
m different realizations (see Definition 4.4), we have thatˆψnu(r) ≤ ψnl
(r).
Therefore, for the properties of the sub–root function, ˆr∗nu≤ r∗nl
.
5. Pushing forward the state–of–the–art: further sharpening the
bound
The bounds of Theorems 3.7 and 4.9 are based on the exploitation of Mc-
Diarmid’s inequalities (McDiarmid, 1989). We exploit in this section more
refined concentration inequalities (Boucheron et al., 2000; Bousquet, 2002;
Boucheron et al., 2013), based on the milestone results of Talagrand Ledoux
& Talagrand (1991). This approach improves the technique proposed by
Bartlett et al. (2005) and obtains optimal constants for the bounds. To the
20
best knowledge of the authors, these results represent the state–of–art Lo-
cal Rademacher Complexity bounds, which are achieved by avoiding some
unnecessary upper bounds at the expenses of a slightly more complex formu-
lation with respect to the ones of Theorems 3.7 and 4.9.
As a first issues, we start by considering the case when no unlabeled sam-
ples are available (Theorem 3.7). For this purpose, we modify Theorem 3.3,
by giving up the closed form solution. Then, we exploit the more refined
concentration inequalities in Lemmas 3.5 and 3.6. By combining these dif-
ferent pieces, the desired bound, taking into account only labeled samples,
can be derived.
The first step is to obtain the counterpart of Theorem 3.3.
Theorem 5.1. Let us consider a sub–root function ψnl(r) and its fixed point
r∗nl, and suppose that ∀r > r∗nl
:
Rnl(Lsr) ≤ ψnl
(r). (79)
Let us define rU as the largest solution, with respect to r, of the following
equation:
√rr∗nl
+[2√rr∗nl
+ r]φ
x[nl
(2√rr∗nl
+ r)] =
r
K. (80)
Then ∀h ∈ H and ∀K > 1 we have that, with probability (1− e−x):
L(h) ≤ max
{(K
K − 1Lnl
(h)
),
(Lnl
(h) +rU
K
)}(81)
Proof. Analogously to what done in Theorem 3.3, we upper bound the uni-
form deviation Unl(Lr). In this case, we exploit Lemma Appendix A.1,
21
Lemma Appendix A.4 and Lemma 3.2:
Unl(Lr) ≤
√rr∗nl
+[2√rr∗nl
+ r]φ
x[nl
(2√rr∗nl
+ r)]
= A√r + C(r) ≤ r
K(82)
Let us define rU as the largest possible solution of the following equation
A√r + C(r) ≤ r
K. (83)
Since C(r) ≥ 0 is a function of r:
A√r ≤ A
√r + C(r) ≤ r
K. (84)
Then
rU ≥ A2K2 ≥ r∗nl. (85)
Combining this result with the one of Theorem 3.1 allows proving the theo-
rem.
The next two Lemmas are the counterparts of Lemmas 3.5 and 3.6.
Theorem 5.2. Let us suppose that:
r ≥ Rnl(Lsr) . (86)
Let us define Lsr as:
Lsr =
{α` α ∈ [0, 1], ` ∈ L, Lnl
[(α`)2] ≤ 3r + 5rφ
(x
nl5r
)}(87)
Then ∀`rs ∈ Lsr and with probability (1− e−x):
Lsr ⊆ Lsr. (88)
22
Proof. Let us consider Lemma Appendix A.1 and Lemma Appendix A.4.
Then, we have that, ∀`rs ∈ Lsr and with probability (1− e−x):
Lnl
[(`rs)
2] ≤ r + 2r + (4r + r) φ
[x
nl (4r + r)
]≤ 3r + 5rφ
(x
nl5r
)(89)
which allows proving the theorem.
Lemma 5.3. Let us consider two sub–root functions and their fixed points:
ψnl(r) = Rnl
(Lsr) , ψnl(r∗nl
) = r∗nl(90)
ψnl(r) = Rnl
(Lsr) + rφ
(2x
nlr
), ψnl
(r∗nl) = r∗nl
. (91)
Then, the following inequalities hold with probability (1− 2e−x):
ψnl(r) ≤ ψnl
(r), r∗nl≤ r∗nl
. (92)
Proof. By exploiting Lemma Appendix A.4 and the sub–root properties (see
Appendix B), we have that:
ψnl(r) = Rnl
(Lsr) ≤ Rnl(Lsr) +Rnl
(Lsr)φ(
2x
nlRnl(Lsr)
)≤ Rnl
(Lsr) + rφ
(2x
nlr
)= ψnl
(r). (93)
Since ψnl(r) ≤ ψnl
(r), we have that r∗nl≤ r∗nl
.
Finally, we derive the tighter version of the bound, which contemplates
only labeled samples.
Theorem 5.4. Let us consider a space of function H and the fixed point r∗nl
of the following sub–root function:
ψnl(r) = Rnl
(Lsr)
+ rφ
(2x
nlr
)(94)
23
where
Lsr =
{α`α∈[0, 1], ` ∈ L, Lnl
[(α`)2] ≤ 3r + 5rφ
(x
nl5r
)}(95)
Let us define rU as the largest solution of the following identity:√rr∗nl
+[2√rr∗nl
+ r]φ
{x[
nl(2√rr∗nl
+ r)]} =
r
K(96)
Then, ∀h ∈ H and ∀K > 1, the following inequality holds with probability
(1− 3e−x):
L(h) ≤ max
{(K
K − 1Lnl
(h)
),
(Lnl
(h) +rU
K
)}, (97)
Proof. The theorem can be straightforwardly proved by combining the results
of Theorem 5.1 and Lemma 5.3.
Theorem 5.4 is a sharper version of the bound of Theorem 3.7. Note that
both Theorem 5.4 and Theorem 3.7 are in implicit form. However, Theorem
5.4 requires to look for a fixed point and to find the largest solution of an
equation; Theorem 3.7, instead, only requires to find a fixed point.
Analogously, it is possible to derive a tighter bound when unlabeled sam-
ples are available.
Theorem 5.5. Let us consider a space of functions H and the fixed point
ˆr∗nuof the following sub–root function:
ˆψnu(r) = C`Rnu
(Lsr)
+ rφ
(2x
mnlr
)(98)
where
Lsr =
{α`α∈[0, 1], ` ∈ L, Lnl
[(α`)2] ≤ 3r + 5rφ
(x
nl5r
)}(99)
24
Let us define rU as the largest solution of the following equation:
√rˆr∗nu
+
[2
√rˆr∗nu
+ r
]φ
x[
nl
(2√rˆr∗nu
+ r
)] =
r
K(100)
Then, ∀h ∈ H and ∀K > 1 the following inequality holds with probability
(1− 3e−x):
L(h) ≤ max
{(K
K − 1Lnl
(h)
),
(Lnl
(h) +rU
K
)}(101)
Note that the bound of Theorem 5.5 coincides with the one of Theorem
5.4 when m = 1, namely when no unlabeled samples are available. Analogous
considerations can be done for Theorem 4.9 and Theorem 3.7.
To the best of our knowledge, Theorem 5.4 is the sharpest Local Ra-
demacher Complexity bound for the usual case when only labeled sample
are available. Theorem 5.5 further improves it since it also contemplates
unlabeled samples.
6. Discussion: How tight are the new bounds?
In this section, we benchmark the bounds proposed in this paper with
respect to state-of-the-art propositions in literature, aiming to show the pos-
itive effect of exploiting eventually available unlabeled samples. In particular,
we compare the following bounds:
(I) Theorem 5.5, which coincides with the one of Theorem 5.4 when m = 1
(II) The sharpest bound based on Global Rademacher Complexity with un-
labeled data (Anguita et al., 2011)
(III) The state-of-the-art Local Rademacher Complexity bound of (Corollary
5.1 in (Bartlett et al., 2005)).
25
Figure 1: Comparison of the bound computed with no unlabeled samples proposed in this
paper against state–of–the–art Local Rademacher Complexity bound.
Figure 2: Comparison of the bounds proposed in this paper against the state–of–the–art
bound with unlabeled samples m = 1.
26
Figure 3: Comparison of the bounds proposed in this paper against the state–of–the–art
bound with unlabeled samples m = 10.
Figure 4: Comparison of the bounds proposed in this paper against the state–of–the–art
bound with unlabeled samples m = 100.
27
Figure 5: Comparison of the bounds proposed in this paper against the state–of–the–art
bound with unlabeled samples m = 1000.
Figure 6: Effect of the unlabeled samples on the bound proposed in this paper m ∈
{1, 10, 100, 1000}.
28
Results are obtained with respect to the scenario where H = h∗ and
Lnl(h∗) = 0. In this case, the hypothesis space is composed by only one
model that performs no error on labeled training data: it represents the best
case scenarios for every configuration (global and local settings, both when
unlabeled samples are considered and unavailable), thus allowing to fairly
compare the different bounds.
Figure 1 proposes the comparison in the scenario when no unlabeled sam-
ples are available. In particular, we benchmark the bound proved in Theorem
5.4 (or, equivalently, Theorem 5.5 with m = 1) with the Local Rademacher
Complexity bound of (Corollary 5.1 in (Bartlett et al., 2005)), as the num-
ber of labeled samples is varied (nl): the plot shows that, independently of
the number of training samples, the proposed bound is tighter than the one
previously proposed in literature.
Figures 2, 3, 4, and 5, instead, deal with comparing the bound proposed
in Theorem 5.5 and the sharpest bound based on Global Rademacher Com-
plexity with unlabeled data (Anguita et al., 2011) as both the number of
labeled (nl) and unlabeled (m) samples are varied. Generally, the proposed
bound is tighter than the previous proposition in literature, especially as the
number of unlabeled samples increases: this effect is particularly evident for
m > 100. In fact, while the state-of-the-art Global Rademacher Complexity
bound by Anguita et al. (2011) does not remarkably improve as m increases,
Figure 6 shows that the bound of Theorem 5.5 tends to become much tighter
as the number of available unlabeled samples is higher, especially when few
labeled samples are available (nl < 500). As a matter of fact, the additional
information related to unlabeled samples helps defining more raffinate local
29
spaces, thus leading to benefits in terms of tightness.
As a final remark, it is worth noting that this work also represents a
starting point for future research. In fact, the effectiveness of the proposed
bound will have to be tested in real-world applications, e.g. as a tool for
model selection and error estimation (Bartlett et al., 2002a; Anguita et al.,
2011). As a second issue, since Local Rademacher Complexity has been
exploited for different purposes (e.g. Multiple Kernel Learning (Kloft &
Blanchard, 2011; Cortes et al., 2013)), an interesting research perspective
consists in verifying whether the Extended Local Rademacher Complexity
allows obtaining improved results in these frameworks. Finally, it is also
worth exploring, from a theoretical point of view, whether making additional
hypothesis (e.g. low noise (Srebro et al., 2010; Audibert & Tsybakov, 2007;
Steinwart & Scovel, 2007)) can even further improve the tightness of the
proposed bound.
Appendix A. Properties of the Uniform Deviation, the Radema-
cher Complexity and the Extended Rademacher Com-
plexity
The following lemma shows how to upper bound the Uniform Deviation
in terms of the Rademacher Complexity. Moreover, it allows to upper bound
the Rademacher Complexity of the square loss in terms of the Rademacher
Complexity of the loss.
Lemma Appendix A.1. For the Rademacher Complexity and the Uni-
form Deviation, the following properties hold(Bartlett & Mendelson, 2003;
30
Koltchinskii, 2011):
Unl(L) ≤ Rnl
(L), U2nl
(L) ≤ R2nl
(L) ≤ 2Rnl(L). (A.1)
The proof is mainly based on the symmetrization lemma (Bartlett &
Mendelson, 2003) and the contraction inequality (Ledoux & Talagrand, 1991;
Bartlett et al., 2005; Koltchinskii, 2011).
The next two lemmas, instead, allow to show that the Expected Uniform
Deviation and the Expected Rademacher Complexity are tight concentrated
around their mean.
Lemma Appendix A.2. The functions Rnl(L), Unl
(L) and U2nl
(L) are
bounded difference functions (Bartlett & Mendelson, 2003). Thus, according
to (McDiarmid, 1989), we can state that, with probability (1− e−x):
Rnl(L) ≤ Rnl
(L) +
√2x
nl, (A.2)
Unl(L) ≤ Unl
(L) +
√x
2nl, (A.3)
U2nl
(L) ≤ U2nl
(L) +
√x
2nl. (A.4)
Definition Appendix A.3. According with (Boucheron et al., 2000; Bous-
quet, 2002; Oneto et al., 2013) we can define the following functions:
φ(a) = (1 + a) log(1 + a)− a, a > −1, (A.5)
φ(a) = 1− exp
[1 +W−1
(a− 1
exp(1)
)], φ
[−φ(a)
]= a, a ∈ [0, 1], (A.6)
φ(a) = exp
[1 +W0
(a− 1
exp(1)
)]− 1, φ
[φ(a)
]= a, a ∈ [0,+∞), (A.7)
where W−1 and W0 are, respectively, two solutions of the Lambert W function
(Weisstein, 2013).
31
Lemma Appendix A.4. The function Rnl(L) is a self bounding func-
tion (Bartlett & Mendelson, 2003; Boucheron et al., 2000), while Unl(L) and
U2nl
(L) satisfy the hypothesis of (Bousquet, 2002; Bartlett et al., 2005). Con-
sequently we can also state that, with probability (1− e−x):
Rnl(L) ≤ Rnl
(L) +Rnl(L)φ
(2x
nlRnl(L)
), (A.8)
Unl(L) ≤ Unl
(L) +[2Unl
(L) + L(`2)]φ
{x
nl [2Unl(L) + L(`2)]
}, (A.9)
U2nl
(L) ≤ U2nl
(L) +[2U2
nl(L) + L(`2)
]φ
{x
nl[2U2
nl(L) + L(`2)
]} . (A.10)
The following lemma links the Expected Extended Rademacher Comple-
xity and the Expected Rademacher Complexity.
Lemma Appendix A.5. The Expected Extended Rademacher Complexity
and the Expected Rademacher Complexity coincide.
Proof. The proof is trivial by noting that:
Rnu(L) = Eµ1
m
m∑j=1
Rjnl
(L) =1
m
m∑j=1
EµRjnl
(L)
=1
m
m∑j=1
Rnl(L) = Rnl
(L). (A.11)
The next two lemmas, instead, allow to show that the Expected Extended
Rademacher Complexity is tight concentrated around its mean.
Lemma Appendix A.6. The Extended Rademacher Complexity Rnu(L) is
a bounded difference function (since it is a sum of bounded difference func-
tions (Anguita et al., 2011; McDiarmid, 1989)). Thus, we have that, with
32
probability (1− e−x):
Rnl(L) = Rnu(L) ≤ Rnu(L) +
√2x
mnl. (A.12)
Lemma Appendix A.7. Since Rnu(L) is a self bounding function (as it
is a sum of self bounding functions (Anguita et al., 2011; Boucheron et al.,
2000)), we have that, with probability (1− e−x):
Rnl(L) = Rnu(L) ≤ Rnu(L) +Rnu(L)φ
[2x
mnlRnu (L)
]= Rnu(L) +Rnl
(L)φ[
2xmnlRnl
(L)
]. (A.13)
Appendix B. Sub–Root Functions Properties
The notion of sub–root function have emerged as a key concept in Learn-
ing Theory in the last years (Bartlett et al., 2005; Koltchinskii, 2006; Bartlett
et al., 2002b). Its properties are useful to prove the main results of this paper,
and, for this reason, we survey them in the following.
Lemma Appendix B.1. A non–trivial sub–root function satisfies the fol-
lowing properties (Bousquet et al., 2004; Bartlett et al., 2005):
1. ψ(r) : [0,+∞)→ [0,+∞)
2. 0 < β ≤ 1→ ψ(βr) ≥√βψ(r)
3. β ≥ 1→ ψ(βr) ≤√βψ(r)
4. ψ(r): is continuous in [0,+∞)
5. ψ(r) = r: has a unique positive solution, r∗, that is defined as fixed–
point
6. ∀r > 0, r ≥ ψ(r) if and only if r ≥ r∗
7. let ψ1, ψ2, with fixed points r∗1, r∗2, and α ∈ [0, 1], with αψ2(r∗1) ≤
ψ1(r∗1) ≤ ψ2(r
∗1), then: α2r∗2 ≤ r∗1 ≤ r∗2
33
8. let ψ have fixed point r∗ and g : R+ → R+ where r∗g is the smallest fixed
point (if existent) and ψ1(r) ≤ g(r), ∀r ∈ [0,∞) then: r∗ ≤ r∗g
9. ∀r ≥ r∗ → ψnl(r) ≤
√r/r∗ψn(r∗) =
√rr∗
10. let ψ1 and ψ2 be two sub–root functions. Then ψ(r) = ψ1(r) + ψ2(r) is
also a sub–root function
11. let ψ be a sub–root function and c ∈ [0,+∞] a constant. Then ψ′(r) =
ψ(r) + c is also a sub–root function.
References
Anguita, D., Ghio, A., Oneto, L., & Ridella, S. (2011). The impact of
unlabeled patterns in rademacher complexity theory for kernel classifiers.
Neural Information Processing Systems , (pp. 585–593).
Audibert, J. Y., & Tsybakov, A. B. (2007). Fast learning rates for plug-in
classifiers. The Annals of Statistics , 35 , 608–633.
Bartlett, P. L., Boucheron, S., & Lugosi, G. (2002a). Model selection and
error estimation. Machine Learning , 48 , 85–113.
Bartlett, P. L., Bousquet, O., & Mendelson, S. (2002b). Localized rademacher
complexities. In Computational Learning Theory .
Bartlett, P. L., Bousquet, O., & Mendelson, S. (2005). Local rademacher
complexities. The Annals of Statistics , 33 , 1497–1537.
Bartlett, P. L., & Mendelson, S. (2003). Rademacher and gaussian complexi-
ties: Risk bounds and structural results. The Journal of Machine Learning
Research, 3 , 463–482.
34
Blanchard, G., & Massart, P. (2006). Discussion: Local rademacher complex-
ities and oracle inequalities in risk minimization. The Annals of Statistics ,
34 , 2664–2671.
Boucheron, S., Lugosi, G., & Massart, P. (2000). A sharp concentration
inequality with applications. Random Structures & Algorithms , 16 , 277–
292.
Boucheron, S., Lugosi, G., & Massart, P. (2013). Concentration inequalities:
A nonasymptotic theory of independence. Oxford University Press.
Bousquet, O. (2002). A bennett concentration inequality and its application
to suprema of empirical processes. Comptes Rendus Mathematique, 334 ,
495–500.
Bousquet, O., Boucheron, S., & Lugosi, G. (2004). Introduction to statistical
learning theory. In Advanced Lectures on Machine Learning .
Bousquet, O., & Elisseeff, A. (2002). Stability and generalization. The Jour-
nal of Machine Learning Research, 2 , 499–526.
Chapelle, O., Scholkopf, B., & Zien, A. (2006). Semi-supervised learning .
MIT press Cambridge.
Cortes, C., Kloft, M., & Mohri, M. (2013). Learning kernels using local
rademacher complexity. In Neural Information Processing Systems .
Kloft, M., & Blanchard, G. (2011). The local rademacher complexity of lp-
norm multiple kernel learning. In Neural Information Processing Systems .
35
Koltchinskii, V. (2006). Local rademacher complexities and oracle inequali-
ties in risk minimization. The Annals of Statistics , 34 , 2593–2656.
Koltchinskii, V. (2011). Oracle Inequalities in Empirical Risk Minimization
and Sparse Recovery Problems . Springer.
Ledoux, M., & Talagrand, M. (1991). Probability in Banach Spaces:
isoperimetry and processes volume 23. Springer.
Lever, G., Laviolette, F., & Shawe-Taylor, J. (2013). Tighter pac-bayes
bounds through distribution-dependent priors. Theoretical Computer Sci-
ence, 473 , 4–28.
McAllester, D., & Akinbiyi, T. (2013). Pac-bayesian theory. In Empirical
Inference (pp. 95–103).
McDiarmid, C. (1989). On the method of bounded differences. Surveys in
combinatorics , 141 , 148–188.
Oneto, L., Ghio, A., Anguita, S., & Ridella, S. (2013). An improved analysis
of the rademacher data-dependent bound using its self bounding property.
Neural Networks , 44 , 107–111.
Parrado-Hernandez, E., Ambroladze, A., Shawe-Taylor, J., & Sun, S. (2012).
Pac-bayes bounds with data dependent priors. The Journal of Machine
Learning Research, 13 , 3507–3531.
Rosasco, L., De Vito, E., Caponnetto, A., Piana, M., & Verri, A. (2004). Are
loss functions all the same? Neural Computation, 16 , 1063–1076.
36
Srebro, N., Sridharan, K., & Tewari, A. (2010). Smoothness, low noise and
fast rates. In Neural Information Processing Systems .
Steinwart, I., & Scovel, C. (2007). Fast rates for support vector machines
using gaussian kernels. The Annals of Statistics , 35 , 575–607.
Talagrand, M. (1987). The glivenko-cantelli problem. The Annals of Proba-
bility , 15 , 837–870.
Valiant, L. (2013). Probably Approximately Correct: Nature’s Algorithms for
Learning and Prospering in a Complex World . Basic Books.
Vapnik, V. N. (1998). Statistical learning theory . Wiley–Interscience.
Weisstein, E. W. (2013). Lambert W-Function. From MathWorld–
A Wolfram Web Resource, http://mathworld.wolfram.com/LambertW-
Function.html.
37