arXiv:2107.14478v2 [math.NA] 5 Sep 2021 ERROR ANALYSIS OF DEEP RITZ METHODS FOR ELLIPTIC EQUATIONS YULING JIAO * , YANMING LAI † , YISU LO ‡ , YANG WANG § , AND YUNFEI YANG ¶ Abstract. Using deep neural networks to solve PDEs has attracted a lot of attentions recently. However, why the deep learning method works is falling far behind its empirical success. In this paper, we provide a rigorous numerical analysis on deep Ritz method (DRM) [43] for second order elliptic equations with Drichilet, Neumann and Robin boundary condition, respectively. We establish the first nonasymptotic convergence rate in H 1 norm for DRM using deep networks with smooth activation functions including logistic and hyperbolic tangent functions. Our results show how to set the hyper-parameter of depth and width to achieve the desired convergence rate in terms of number of training samples. Key words. DRM, Neural Networks, Approximation error, Rademacher complexity, Chaining method, Pseudo dimension, Covering number. AMS subject classifications. 65N99 1. Introduction. Partial differential equations (PDEs) are one of the fundamental math- ematical models in studying a variety of phenomenons arising in science and engineering. There have been established many conventional numerical methods successfully for solving PDEs in the case of low dimension (d ≤ 3), particularly the finite element method [6, 7, 33, 39, 22]. However, one will encounter some difficulties in both of theoretical analysis and numerical im- plementation when extending conventional numerical schemes to high-dimensional PDEs. The classic analysis of convergence, stability and any other properties will be trapped into trouble- some situation due to the complex construction of finite element space [7, 6]. Moreover, in the term of practical computation, the scale of the discrete problem will increase exponentially with respect to the dimension. Motivated by the well-known fact that deep learning method for high-dimensional data anal- ysis has been achieved great successful applications in discriminative, generative and reinforce- ment learning [18, 14, 37], solving high dimensional PDEs with deep neural networks becomes an extremely potential approach and has attracted much attentions [3, 38, 27, 34, 43, 45, 5, 17]. Roughly speaking, these works can be divided into three categories. The first category is using deep neural network to improve classical numerical methods, see for example [40, 42, 21, 15]. In the second category, the neural operator is introduced to learn mappings between infinite- dimensional spaces with neural networks [24, 2, 25]. For the last category, one utilizes deep neural networks to approximate the solutions of PDEs directly including physics-informed neu- ral networks (PINNs) [34], deep Ritz method (DRM) [43] and deep Galerkin method (DGM) [45]. PINNs is based on residual minimization for solving PDEs [3, 38, 27, 34]. Proceed from * School of Mathematics and Statistics, and Hubei Key Laboratory of Computational Science, Wuhan Uni- versity, Wuhan 430072, P.R. China. ([email protected]) † School of Mathematics and Statistics, Wuhan University, Wuhan 430072, P.R. China. (laiyan- [email protected]) ‡ Department of Mathematics, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong ([email protected]) § Department of Mathematics, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong ([email protected]) ¶ Department of Mathematics, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong ([email protected]) 1
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
arX
iv:2
107.
1447
8v2
[m
ath.
NA
] 5
Sep
202
1
ERROR ANALYSIS OF DEEP RITZ METHODS FOR ELLIPTIC EQUATIONS
YULING JIAO∗, YANMING LAI † , YISU LO‡ , YANG WANG § , AND YUNFEI YANG ¶
Abstract. Using deep neural networks to solve PDEs has attracted a lot of attentions recently. However,
why the deep learning method works is falling far behind its empirical success. In this paper, we provide a
rigorous numerical analysis on deep Ritz method (DRM) [43] for second order elliptic equations with Drichilet,
Neumann and Robin boundary condition, respectively. We establish the first nonasymptotic convergence rate
in H1 norm for DRM using deep networks with smooth activation functions including logistic and hyperbolic
tangent functions. Our results show how to set the hyper-parameter of depth and width to achieve the desired
convergence rate in terms of number of training samples.
1. Introduction. Partial differential equations (PDEs) are one of the fundamental math-
ematical models in studying a variety of phenomenons arising in science and engineering. There
have been established many conventional numerical methods successfully for solving PDEs in
the case of low dimension (d ≤ 3), particularly the finite element method [6, 7, 33, 39, 22].
However, one will encounter some difficulties in both of theoretical analysis and numerical im-
plementation when extending conventional numerical schemes to high-dimensional PDEs. The
classic analysis of convergence, stability and any other properties will be trapped into trouble-
some situation due to the complex construction of finite element space [7, 6]. Moreover, in the
term of practical computation, the scale of the discrete problem will increase exponentially with
respect to the dimension.
Motivated by the well-known fact that deep learning method for high-dimensional data anal-
ysis has been achieved great successful applications in discriminative, generative and reinforce-
ment learning [18, 14, 37], solving high dimensional PDEs with deep neural networks becomes
an extremely potential approach and has attracted much attentions [3, 38, 27, 34, 43, 45, 5, 17].
Roughly speaking, these works can be divided into three categories. The first category is using
deep neural network to improve classical numerical methods, see for example [40, 42, 21, 15].
In the second category, the neural operator is introduced to learn mappings between infinite-
dimensional spaces with neural networks [24, 2, 25]. For the last category, one utilizes deep
neural networks to approximate the solutions of PDEs directly including physics-informed neu-
ral networks (PINNs) [34], deep Ritz method (DRM) [43] and deep Galerkin method (DGM)
[45]. PINNs is based on residual minimization for solving PDEs [3, 38, 27, 34]. Proceed from
∗School of Mathematics and Statistics, and Hubei Key Laboratory of Computational Science, Wuhan Uni-versity, Wuhan 430072, P.R. China. ([email protected])
†School of Mathematics and Statistics, Wuhan University, Wuhan 430072, P.R. China. ([email protected])
‡Department of Mathematics, The Hong Kong University of Science and Technology, Clear Water Bay,Kowloon, Hong Kong ([email protected])
§Department of Mathematics, The Hong Kong University of Science and Technology, Clear Water Bay,Kowloon, Hong Kong ([email protected])
¶Department of Mathematics, The Hong Kong University of Science and Technology, Clear Water Bay,Kowloon, Hong Kong ([email protected])
We summarize the related works and our results in the following table 1.1.
Table 1.1: Previous works and our result
PaperDepth and
activation functionsEquation(s)
RegularityCondition
[28]D = 2
ReLU3Second order
differential equationu∗ ∈ Barron class
[26]D = 2Softplus
Poisson equationand Schrodinger equation
u∗ ∈ Barron class
[19]D = 2
ReLUk2m-th order
differential equationu∗ ∈ Barron class
[9]D = O(log d)
ReLU2Second order
elliptic equationu∗ ∈ C2
This paperD = O(log d)Logistic and
Hyperbolic tangent
Second orderelliptic equation
u∗ ∈ H2
The rest of the paper are organized as follows. In Section 2, we give some preliminaries.
In Section 3, we present the DRM method and the error decomposition results for analysis of
DRM. In Section 4 and 5, we give detail analysis on the approximation error and statistical
error. In Section 6, we present our main results. We give conclusion and short discussion in
Section 7.
2. Neural Network. Due to its strong expressivity, neural network function class plays
an important role in machine learning. A variety of neural networks are choosen as parameter
function classes in the training process. We now introduce some notation related to neural
network which will simplify our later discussion. Let D ∈ N+. A function f : Rd → R
nD
implemented by a neural network is defined by
f0(x) = x,
fℓ(x) = ρ (Aℓfℓ−1 + bℓ) for ℓ = 1, . . . ,D − 1,
f := fD(x) = ADfD−1 + bD,
(2.1)
3
where Aℓ =(a(ℓ)ij
)∈ R
nℓ×nℓ−1 and bℓ =(b(ℓ)i
)∈ R
nℓ . ρ is called the activation function and
acts componentwise. D is called the depth of the network and W := maxnℓ : ℓ = 1, · · · ,Dis called the width of the network. φ = Aℓ,bℓℓ are called the weight parameters. For
convenience, we denote ni, i = 1, · · · ,D, as the number of nonzero weights on the first i layers
in the representation (2.1). Clearly nD is the total number of nonzero weights. Sometimes
we denote a function implemented by a neural network as fρ for short. We use the notation
Nρ (D, nD, Bθ) to refer to the collection of functions implemented by a ρ−neural network with
depth D, total number of nonzero weights nD and each weight being bounded by Bθ.
3. Deep Ritz Method and Error Decomposition. Let Ω be a convex bounded open
set in Rd and assume that ∂Ω ∈ C∞. Without loss of generality we assume that Ω ⊂ [0, 1]d.
We consider the following second order elliptic equation:
−u+ wu = f in Ω (3.1)
with three kinds of boundary condition:
u = 0 on ∂Ω (3.2a)
∂u
∂n= g on ∂Ω (3.2b)
αu+ β∂u
∂n= g on ∂Ω, α, β ∈ R, β 6= 0 (3.2c)
which are called Drichilet, Neumann and Robin boundary condition, respectively. Note that
for Drichilet problem, we only consider the homogeneous boundary condition here since the
inhomogeneous case can be turned into homogeneous case by translation. We also remark that
Neumann condition (3.2b) is covered by Robin condition (3.2c). Hence in the following we only
consider Dirichlet problem and Robin problem.
We make the following assumption on the known terms in equation:
(A) f ∈ L∞(Ω), g ∈ H1/2(Ω), w ∈ L∞(Ω), w ≥ cw
where cw is some positive constant. In the following we abbreviate
C(‖f‖L∞(Ω), ‖g‖H1/2(Ω), ‖w‖L∞(Ω), cw
),
constants depending on the known terms in equation, as C(coe) for simplicity.
For problem (3.1)(3.2a), the varitional problems is to find u ∈ H10 (Ω) such that
(∇u,∇v) + (wu, v) = (f, v), ∀v ∈ H10 (Ω). (3.3a)
The corresponding minimization problem is
minu∈H1
0 (Ω)
1
2
∫
Ω
(|∇u|2 + wu2 − 2fu
)dx. (3.3b)
Lemma 3.1. Let (A) holds. Let uD be the solution of problem (3.3a)(also (3.3b)). Then
uD ∈ H2(Ω).
4
Proof. See [12].
For problem (3.1)(3.2c), the variational problem is to find u ∈ H1(Ω) such that
(∇u,∇v) + (wu, v) +α
β(T0u, T0v)|∂Ω = (f, v) +
1
β(g, T0v)|∂Ω, ∀v ∈ H1(Ω) (3.4a)
where T0 is zero order trace operator. The corresponding minimization problem is
minu∈H1(Ω)
∫
Ω
(1
2|∇u|2 + 1
2wu2 − fu
)dx+
1
β
∫
∂Ω
(α2(T0u)
2 − gT0u)ds. (3.4b)
Lemma 3.2. Let (A) holds. Let uR be the solution of problem (3.4a)(also (3.4b)). Then
uR ∈ H2(Ω) and ‖uR‖H2(Ω) ≤ C(coe)β for any β > 0.
Proof. See [12].
Intuitively, when α = 1, g = 0 and β → 0, we expect that the solution of Robin problem
converges to the solution of Dirichlet problem. Hence we only need to consider the Robin
problem since the Dirichlet problem can be handled through a limit process. Define L as a
functional on H1(Ω):
L(u) :=∫
Ω
(1
2|∇u|2 + 1
2wu2 − fu
)dx+
1
β
∫
∂Ω
(α2(T0u)
2 − gT0u)ds.
The next lemma verify the assertion.
Lemma 3.3. Let (A) holds. Let α = 1, g = 0. Let uD be the solution of problem (3.3a)(also
(3.3b)) and uR the solution of problem (3.4a)(also (3.4b)). There holds
‖uR − uD‖H1(Ω) ≤ C(coe)β.
Proof. We first have
∫
Ω
∇uD · ∇vdx−∫
∂Ω
T1uDvds+
∫
Ω
wuDvdx =
∫
Ω
fvdx, ∀v ∈ H1(Ω).
with T1 being first order trace operator. Hence for any u ∈ H1(Ω),
L(u) =∫
Ω
(1
2|∇u|2 + 1
2wu2 − fu
)dx+
1
2β
∫
∂Ω
(T0u)2ds
=
∫
Ω
(1
2|∇u|2 + 1
2wu2
)dx +
1
2β
∫
∂Ω
(T0u)2ds−
∫
Ω
∇uD · ∇udx+
∫
∂Ω
T1uDuds−∫
Ω
wuDudx,
=
∫
Ω
(1
2|∇u−∇uD|2 + 1
2w(u− uD)2
)dx+
1
2β
∫
∂Ω
(T0u+ βT1uD)2ds
−∫
Ω
(1
2|∇uD|2 + 1
2wu2
D
)dx − β
2
∫
∂Ω
(T1uD)2ds. (3.5)
Define
Rβ(u) =
∫
Ω
(1
2|∇u−∇uD|2 + 1
2w(u− uD)
2
)dx+
1
2β
∫
∂Ω
(T0u+ βT1uD)2 ds.
5
Since uR is the minimizer of L, from (3.5) we conclude that it is also the minimizer of R.
Note uD ∈ H2(Ω)(Lemma 3.1), by trace theorem we know T1uD ∈ H1/2(∂Ω) and hence
there exists φ ∈ H1(Ω) such that T0φ = −T1uD. Set u = βφ+ uD, then
C(coe)‖uR − uD‖2H1(Ω) ≤ R(uR) ≤ R(u) = β2
∫
Ω
(1
2|∇φ|2 + 1
2wφ2
)dx = C(uD, coe)β2.
Note that L can be equivalently written as
L(u) =|Ω|EX∼U(Ω)
(1
2|∇u(X)|2 + 1
2w(X)u2(X)− f(X)u(X)
)
+|∂Ω|β
EY ∼U(∂Ω)
(α2(T0u)
2(Y )− g(Y )T0u(Y ))
where U(Ω) and U(∂Ω) are uniform distribution on Ω and ∂Ω, respectively. We then introduce
a discrete version of L:
L(u) := |Ω|N
N∑
i=1
(1
2|∇u(Xi)|2 +
1
2w(Xi)u
2(Xi)− f(Xi)u(Xi)
)
+|∂Ω|βM
M∑
j=1
(α2(T0u)
2(Yj)− g(Yj)T0u(Yj))
where XiNi=1 and YjMj=1 are i.i.d. random variables according to U(Ω) and U(∂Ω) respec-
tively. We now consider a minimization problem with respect to L:
minuφ∈P
L(uφ) (3.6)
where P refers to the parameterized function class. We denote by uφ the solution of problem
(3.6). Finally, we call a (random) solver A, say SGD, to minimize L and denote the output of
A, say uφA, as the final solution.
The following error decomposition enables us to apply different methods to deal with dif-
ferent kinds of error.
Proposition 3.1. Let (A) holds. Assume that P ⊂ H1(Ω). Let uR and uD be the solution
of problem (3.4a)(also (3.4b)) and (3.3a)(also (3.3b)), respectively. Let uφAbe the solution of
problem (3.6) generated by a random solver.
(1)
‖uφA− uR‖H1(Ω) ≤ C(coe) [Eapp + Esta + Eopt]1/2 .
where
Eapp =1
βC(Ω, coe, α) inf
u∈P‖u− uR‖2H1(Ω),
Esta = supu∈P
[L(u)− L(u)
]+ sup
u∈P
[L(u)− L(u)
],
6
Eopt = L (uφA)− L (uφ) .
(2) Set α = 1, g = 0.
‖uφA− uD‖H1(Ω) ≤ C(coe)
[Eapp + Esta + Eopt + ‖uR − uD‖2H1(Ω)
]1/2.
Proof. We only prove (1) since (2) is a direct result from (1) and triangle inequality. For
any u ∈ P , set v = u− uR, then
L (u) = L (uR + v)
=1
2(∇(uR + v),∇(uR + v))L2(Ω) +
1
2(uR + v, uR + v)L2(Ω;w) − 〈uR + v, f〉L2(Ω)
+α
2β(T0uR + T0v, T0uR + T0v)L2(∂Ω) −
1
β〈T0uR + T0v, g〉L2(∂Ω)
=1
2(∇uR,∇uR)L2(Ω) +
1
2(uR, uR)L2(Ω;w) − 〈uR, f〉L2(Ω) +
α
2β(T0uR, T0uR)L2(∂Ω)
− 1
β〈T0uR, g〉L2(∂Ω) +
1
2(∇v,∇v)L2(Ω) +
1
2(v, v)L2(Ω;w) +
α
2β(T0v, T0v)L2(∂Ω)
+
[(∇uR,∇v)L2(Ω) + (u∗, v)L2(Ω;w) − 〈v, f〉L2(Ω) +
α
β(T0uR, T0v)L2(∂Ω) −
1
β〈T0v, g〉L2(∂Ω)
]
= L (uR) +1
2(∇v,∇v)L2(Ω) +
1
2(v, v)L2(Ω;w) +
α
2β(T0v, T0v)L2(∂Ω),
where the last equality is due to the fact that uR is the solution of equation (3.4a). Hence
C(coe)‖v‖2H1(Ω) ≤ L (u)− L (uR) =1
2(∇v,∇v)L2(Ω) +
1
2(v, v)L2(Ω;w) +
α
2β(T0v, T0v)L2(∂Ω)
≤ 1
βC(Ω, coe, α)‖v‖2H1(Ω),
here we apply trace inequality
‖T0v‖L2(∂Ω) ≤ C(Ω)‖v‖H1(Ω).
See more details in [1]. In other words, we obtain
C(coe)‖u − uR‖2H1(Ω) ≤ L (u)− L (uR) ≤1
βC(Ω, coe, α)‖u− uR‖2H1(Ω). (3.7)
Now, letting u be any element in P , we have
L (uφA)− L (uR)
=L (uφA)− L (uφA
) + L (uφA)− L (uφ) + L (uφ)− L (u) + L (u)− L (u) + L (u)− L (uR)
≤ supu∈P
[L(u)− L(u)
]+[L (uφA
)− L (uφ)]+ sup
u∈P
[L(u)− L(u)
]+
1
βC(Ω, coe, α)‖u− uR‖2H1(Ω),
where the last step is due to inequality (3.7) and the fact that L (uφ)− L (u) ≤ 0. Since u can
7
be any element in P , we take the infimum of u on both side of the above display,
L (uφA)− L (uR) ≤ inf
u∈P
1
βC(Ω, coe, α)‖u − uR‖2H1(Ω) + sup
u∈P[L(u)− L(u)]
+ supu∈P
[L(u)− L(u)] +[L (uφA
)− L (uφ)]. (3.8)
Combining (3.7) and (3.8) yields the result.
4. Approximation Error. For nerual network approximation in Sobolev spaces, [16] is
a comprehensive study concerning a variety of activation functions, including ReLU, sigmoidal
type functions, etc. The key idea in [16] to build the upper bound in Sobolev spaces is to
construct an approximate partition of unity.
Denote Fs,p,d :=f ∈ W s,p
([0, 1]d
): ‖f‖W s,p([0,1]d) ≤ 1
.
Theorem 4.1 (Proposition 4.8, [16]). Let p ≥ 1, s, k, d ∈ N+, s ≥ k + 1. Let ρ be logistic
function 11+e−x or tanh function ex−e−x
ex+e−x . For any ǫ > 0 and f ∈ Fs,p,d, there exists a neural
network fρ with depth C log(d+ s) and C(d, s, p, k)ǫ−d/(s−k−µk) non-zero weights such that
‖f − fρ‖Wk,p([0,1]d) ≤ ǫ.
Moreover, the weights in the neural network are bounded in absolute value by
C(d, s, p, k)ǫ−2− 2(d/p+d+k+µk)+d/p+ds−k−µk
where µ is an arbitrarily small positive number.
Remark 4.1. The bounds in the theorem can be found in the proof of [16, Proposition
4.8], except that they did not explicitly give the bound on the depth. In their proof, they
partition [0, 1]d into small patches, approximate f by a sum of localized polynomial∑
m φmpm,
and approximately implement∑
m φmpm by a neural network, where the bump functions φmform an approximately partition of unity and pm =
∑|α|<s cf,m,αx
α are the averaged Taylor
polynomials. As shown in [16], φm can be approximated by the products of the d-dimensional
output of a neural network with constant layers. And the identity map I(x) = x and the product
function ×(a, b) = ab can also be approximated by neural networks with constant layers. In
order to approximate φmxα, we need to implement d + s − 1 products. Hence, the required
depth can be bounded by C log(d+ s).
Since the region [0, 1]d is larger than the region Ω we consider(recall we assume without
loss of generality that Ω ⊂ [0, 1]d at the beginning), we need the following extension result.
Lemma 4.2. Let k ∈ N+, 1 ≤ p < ∞. There exists a linear operator E from W k,p(Ω) to
W k,p0
([0, 1]d
)and Eu = u in Ω.
Proof. See Theorem 7.25 in [13].
From Lemma 3.2 we know that our target function uR ∈ H2(Ω). Hence we are able to
obtain an approximation result in H1-norm.
Corollary 4.3. Let ρ be logistic function 11+e−x or tanh function ex−e−x
ex+e−x . For any ǫ > 0
and f ∈ H2(Ω) with ‖f‖H2(Ω) ≤ 1, there exists a neural network fρ with depth C log(d+1) and
8
C(d)ǫ−d/(1−µ) non-zero weights such that
‖f − fρ‖H1(Ω) ≤ ǫ.
Moreover, the weights in the neural network are bounded by C(d)ǫ−(9d+8)/(2−2µ), where µ is an
arbitrarily small positive number.
Proof. Set k = 1, s = 2, p = 2 in Theorem 4.1 and use the fact ‖f − fρ‖H2(Ω) ≤ ‖Ef −fρ‖H2([0,1]d), where E is the extension operator in Lemma 4.2.
5. Statistical Error. In this section we investigate statistical error supu∈P ±[L(u)− L(u)
].
Lemma 5.1.
EXiNi=1,YjM
j=1supu∈P
±[L(u)− L(u)
]≤
5∑
k=1
EXiNi=1,YjM
j=1supu∈P
±[Lk(u)− Lk(u)
].
where
L1(u) =|Ω|2
EX∼U(Ω)|∇u(X)|2, L2(u) =|Ω|2
EX∼U(Ω)w(X)u2(X),
L3(u) = −|Ω|EX∼U(Ω)f(X)u(X), L4(u) =α|∂Ω|2β
EY ∼U(∂Ω)(Tu)2(Y ),
L5(u) = −|∂Ω|β
EY ∼U(∂Ω)g(Y )Tu(Y ).
and Lk(u) is the discrete version of Lk(u), for example,
L1(u) =|Ω|2N
N∑
i=1
|∇u(Xi)|2.
Proof. Direct result from triangle inequality.
By the technique of symmetrization, we can bound the difference between continuous loss
Li and empirical loss Li by Rademacher complexity.
Definition 5.1. The Rademacher complexity of a set A ⊆ RN is defined as
RN (A) = EσiNk=1
[supa∈A
1
N
N∑
k=1
σkak
],
where, σkNk=1 are N i.i.d Rademacher variables with P(σk = 1) = P(σk = −1) = 12 . The
Rademacher complexity of function class F associate with random sample XkNk=1 is defined
as
RN (F) = EXk,σkNk=1
[supu∈F
1
N
N∑
k=1
σku(Xk)
].
For Rademacher complexity, we have following structural result.
Lemma 5.2. Assume that w : Ω → R and |w(x)| ≤ B for all x ∈ Ω, then for any function
9
class F , there holds
RN (w · F) ≤ BRN(F),
where w · F := u : u(x) = w(x)u(x), u ∈ F.Proof.
RN (w · F) =1
NEXk,σkN
k=1supu∈F
N∑
k=1
σkw(Xk)u(Xk)
=1
2NEXkN
k=1EσkN
k=2supu∈F
[w(X1)u(X1) +
N∑
k=2
σkw(Xk)u(Xk)
]
+1
2NEXkN
k=1EσkN
k=2supu∈F
[−w(X1)u(X1) +
N∑
k=2
σkw(Xk)u(Xk)
]
=1
2NEXkN
k=1EσkN
k=2
supu,u′∈F
[w(X1)[u(X1)− u′(X1)] +
N∑
k=2
σkw(Xk)u(Xk) +
N∑
k=2
σkw(Xk)u′(Xk)
]
≤ 1
2NEXkN
k=1EσkN
k=2
supu,u′∈F
[B|u(X1)− u′(X1)|+
N∑
k=2
σkw(Xk)u(Xk) +
N∑
k=2
σkw(Xk)u′(Xk)
]
=1
2NEXkN
k=1EσkN
k=2
supu,u′∈F
[B[u(X1)− u′(X1)] +
N∑
k=2
σkw(Xk)u(Xk) +
N∑
k=2
σkw(Xk)u′(Xk)
]
=1
NEXk,σkN
k=1supu∈F
[σ1Bu(X1) +
N∑
k=2
σkw(Xk)u(Xk)
]
≤ · · · ≤ BN
EXk,σkNk=1
supu∈F
N∑
k=1
σku(Xk) = BR(F).
Now we bound the difference between continuous loss and empirical loss in terms of Rademacher
complexity.
Lemma 5.3.
EXiNi=1
supu∈P
±[L1(u)− L1(u)
]≤ C(Ω, coe)RN (F1),
EXiNi=1
supu∈P
±[L2(u)− L2(u)
]≤ C(Ω, coe)RN (F2),
EXiNi=1
supu∈P
±[L3(u)− L3(u)
]≤ C(Ω, coe)RN (F3),
EYjMj=1
supu∈P
±[L4(u)− L4(u)
]≤ α
βC(Ω, coe)RN (F4),
EYjMj=1
supu∈P
±[L5(u)− L5(u)
]≤ 1
βC(Ω, coe)RN (F5),
10
where
F1 = |∇u|2 : u ∈ P, F2 = u2 : u ∈ P,F3 = u : u ∈ P, F4 = u2|∂Ω : u ∈ P,F5 = u|∂Ω : u ∈ P.
Proof. We only present the proof with respect to L2 since other inequalities can be shown
similarly. We take XkNk=1 as an independent copy of XkNk=1, then
L2(u)− L2(u) =|Ω|2
[EX∼U(Ω)w(X)u2(X)− 1
N
N∑
k=1
w(Xk)u2(Xk)
]
=|Ω|2N
EXkNk=1
N∑
k=1
[w(Xk)u
2(Xk)− w(Xk)u2(Xk)
].
Hence
EXkNk=1
supu∈P
∣∣∣L2(u)− L2(u)∣∣∣
≤ |Ω|2N
EXk,XkNk=1
supu∈P
N∑
k=1
[w(Xk)u
2(Xk)− w(Xk)u2(Xk)
]
=|Ω|2N
EXk,Xk,σkNk=1
supu∈P
N∑
k=1
σk
[w(Xk)u
2(Xk)− w(Xk)u2(Xk)
]
≤ |Ω|2N
EXk,σkNk=1
supu∈P
N∑
k=1
σkw(Xk)u2(Xk) +
|Ω|2N
EXk,σkNk=1
supu∈P
N∑
k=1
−σkw(Xk)u2(Xk)
=|Ω|N
EXk,σkNk=1
supu∈P
N∑
k=1
σkw(Xk)u2(Xk)
= |Ω|RN (w · F2) ≤ C(Ω, coe)RN (F2),
where the second step is due to the fact that the insertion of Rademacher variables doesn’t
change the distribution, the fourth step is because σkw(Xk)u2(Xk) and −σkw(Xk)u
2(Xk) have
the same distribution, and we use Lemma 5.2 in the last step.
In order to bound Rademecher complexities, we need the concept of covering number.
Definition 5.4. An ǫ-cover of a set T in a metric space (S, τ) is a subset Tc ⊂ S such
that for each t ∈ T , there exists a tc ∈ Tc such that τ(t, tc) ≤ ǫ. The ǫ-covering number of
T , denoted as C(ǫ, T, τ) is defined to be the minimum cardinality among all ǫ-cover of T with
respect to the metric τ .
In Euclidean space, we can establish an upper bound of covering number for a bounded set
easily.
11
Lemma 5.5. Suppose that T ⊂ Rd and ‖t‖2 ≤ B for t ∈ T , then
C(ǫ, T, ‖ · ‖2) ≤(2B
√d
ǫ
)d
.
Proof. Let m =⌊2B
√d
ǫ
⌋and define
Tc =
−B +
ǫ√d,−B +
2ǫ√d, · · · ,−B +
mǫ√d
d
,
then for t ∈ T , there exists tc ∈ Tc such that
‖t− tc‖2 ≤
√√√√d∑
i=1
(ǫ√d
)2
= ǫ.
Hence
C(ǫ, T, ‖ · ‖2) ≤ |Tc| = md ≤(2B
√d
ǫ
)d
.
A Lipschitz parameterization allows us to translates a cover of the function space into a
cover of the parameter space. Such a property plays an essential role in our analysis of statistical
error.
Lemma 5.6. Let F be a parameterized class of functions: F = f(x; θ) : θ ∈ Θ. Let
‖ · ‖Θ be a norm on Θ and let ‖ · ‖F be a norm on F . Suppose that the mapping θ 7→ f(x; θ) is
L-Lipschitz, that is,
∥∥∥f(x; θ)− f(x; θ)∥∥∥
F≤ L
∥∥∥θ − θ∥∥∥Θ,
then for any ǫ > 0, C (ǫ,F , ‖ · ‖F) ≤ C (ǫ/L,Θ, ‖ · ‖Θ).Proof. Suppose that C (ǫ/L,Θ, ‖ · ‖Θ) = n and θini=1 is an ǫ/L-cover of Θ. Then for any
θ ∈ Θ, there exists 1 ≤ i ≤ n such that
‖f(x; θ) − f (x; θi)‖F ≤ L ‖θ − θi‖Θ ≤ ǫ.
Hence f(x; θi)ni=1 is an ǫ-cover of F , implying that C (ǫ,F , ‖ · ‖F) ≤ n.
To find the relation between Rademacher complexity and covering number, we first need
the Massart’s finite class lemma stated below.
Lemma 5.7. For any finite set A ⊂ RN with diameter D = supa∈A ‖a‖2,
RN (A) ≤ D
N
√2 log |A|.
Proof. See, for example, [35, Lemma 26.8].
12
Lemma 5.8. Let F be a function class and ‖f‖∞ ≤ B for any f ∈ F , we have
RN (F) ≤ inf0<δ<B/2
(4δ +
12√N
∫ B/2
δ
√log C(ǫ,F , ‖ · ‖∞)dǫ
).
Proof. We apply chaning method. Set ǫk = 2−k+1B. We denote by Fk such that Fk is an
ǫk-cover of F and |Fk| = C(ǫk,F , ‖ · ‖∞). Hence for any u ∈ F , there exists uk ∈ Fk such that
‖u− uk‖∞ ≤ ǫk. Let K be a positive integer determined later. We have
RN(F) = Eσi,XiNi=1
[supu∈F
1
N
N∑
i=1
σiu (Xi)
]
= Eσi,XiNi=1
1
Nsupu∈F
N∑
i=1
σi (u (Xi)− uK (Xi)) +
K−1∑
j=1
N∑
i=1
σi (uj+1 (Xi)− uj (Xi)) +
N∑
i=1
σiu1 (Xi)
≤ Eσi,XiNi=1
[supu∈F
1
N
N∑
i=1
σi (u (Xi)− uK (Xi))
]+
K−1∑
j=1
Eσi,XiNi=1
[supu∈F
1
N
N∑
i=1
σi (uj+1 (Xi)− uj (Xi))
]
+ Eσi,XiNi=1
[1
Nsupu∈F1
1
N
N∑
i=1
σiu(Xi)
].
We can choose F1 = 0 to eliminate the third term. For the first term,
Eσi,XiNi=1
supu∈F
1
N
[N∑
i=1
σi (u (Xi)− uK (Xi))
]≤ Eσi,XiN
i=1supu∈F
1
N
N∑
i=1
|σi| ‖u− uK‖∞ ≤ ǫK .
For the second term, for any fixed samples XiNi=1, we define