arXiv:2107.14478v2 [math.NA] 5 Sep 2021

arX

iv:2

107.

1447

8v2

[m

ath.

NA

] 5

Sep

202

1

ERROR ANALYSIS OF DEEP RITZ METHODS FOR ELLIPTIC EQUATIONS

YULING JIAO∗, YANMING LAI † , YISU LO‡ , YANG WANG § , AND YUNFEI YANG ¶

Abstract. Using deep neural networks to solve PDEs has attracted a lot of attentions recently. However,

why the deep learning method works is falling far behind its empirical success. In this paper, we provide a

rigorous numerical analysis on deep Ritz method (DRM) [43] for second order elliptic equations with Drichilet,

Neumann and Robin boundary condition, respectively. We establish the first nonasymptotic convergence rate

in H1 norm for DRM using deep networks with smooth activation functions including logistic and hyperbolic

tangent functions. Our results show how to set the hyper-parameter of depth and width to achieve the desired

convergence rate in terms of number of training samples.

Key words. DRM, Neural Networks, Approximation error, Rademacher complexity, Chaining method,

Pseudo dimension, Covering number.

AMS subject classifications. 65N99

1. Introduction. Partial differential equations (PDEs) are one of the fundamental math-

ematical models in studying a variety of phenomenons arising in science and engineering. There

have been established many conventional numerical methods successfully for solving PDEs in

the case of low dimension (d ≤ 3), particularly the finite element method [6, 7, 33, 39, 22].

However, one will encounter some difficulties in both of theoretical analysis and numerical im-

plementation when extending conventional numerical schemes to high-dimensional PDEs. The

classic analysis of convergence, stability and any other properties will be trapped into trouble-

some situation due to the complex construction of finite element space [7, 6]. Moreover, in the

term of practical computation, the scale of the discrete problem will increase exponentially with

respect to the dimension.

Motivated by the well-known fact that deep learning method for high-dimensional data anal-

ysis has been achieved great successful applications in discriminative, generative and reinforce-

ment learning [18, 14, 37], solving high dimensional PDEs with deep neural networks becomes

an extremely potential approach and has attracted much attentions [3, 38, 27, 34, 43, 45, 5, 17].

Roughly speaking, these works can be divided into three categories. The first category is using

deep neural network to improve classical numerical methods, see for example [40, 42, 21, 15].

In the second category, the neural operator is introduced to learn mappings between infinite-

dimensional spaces with neural networks [24, 2, 25]. For the last category, one utilizes deep

neural networks to approximate the solutions of PDEs directly including physics-informed neu-

ral networks (PINNs) [34], deep Ritz method (DRM) [43] and deep Galerkin method (DGM)

[45]. PINNs is based on residual minimization for solving PDEs [3, 38, 27, 34]. Proceed from

∗School of Mathematics and Statistics, and Hubei Key Laboratory of Computational Science, Wuhan Uni-versity, Wuhan 430072, P.R. China. ([email protected])

†School of Mathematics and Statistics, Wuhan University, Wuhan 430072, P.R. China. ([email protected])

‡Department of Mathematics, The Hong Kong University of Science and Technology, Clear Water Bay,Kowloon, Hong Kong ([email protected])

§Department of Mathematics, The Hong Kong University of Science and Technology, Clear Water Bay,Kowloon, Hong Kong ([email protected])

¶Department of Mathematics, The Hong Kong University of Science and Technology, Clear Water Bay,Kowloon, Hong Kong ([email protected])

1

http://arxiv.org/abs/2107.14478v2

the variational form, [43, 45, 44] propose neural-network based methods related to classical

Ritz and Galerkin method. In [45], weak adversarial networks (WAN) are proposed inspired by

Galerkin method. Based on Ritz method, [43] proposes the deep Ritz method (DRM) to solve

variational problems corresponding to a class of PDEs.

1.1. Related works and contributions. The idea of using neural networks to solve

PDEs goes back to 1990’s [23, 8]. Although there are great empirical achievements in recent

several years, a challenging and interesting question is to provide a rigorous error analysis such

as finite element method. Several recent efforts have been devoted to making processes along

this line, see for example [11, 28, 30, 32, 26, 20, 36, 41, 10]. In [28], least squares minimization

method with two-layer neural networks is studied, the optimization error under the assumption

of over-parametrization and generalization error without the over-parametrization assumption

are analyzed. In [26, 44, 19], the generalization error bounds of two-layer neural networks are

derived via assuming that the exact solutions lie in spectral Barron space. Although the studies

in [26, 44] can overcome the curse of dimensionality, it should be pointed out that it is difficult

to generalize these results to deep neural networks or to the situation where the underlying

solutions are living in general Sobolev spaces.

Two important questions have not been addressed in the above mentioned re-

lated study are those: Can we provide a convergence result of DRM only requiring

the target solution living in H2 ? How to determine the depth and width to achieve

the desired convergence rate ? In this paper, we give a firm answers on these questions by

providing an error analysis of using DRM with sigmoidal deep neural networks to solve second

order elliptic equations with Drichilet, Neumann and Robin boundary condition, respectively.

Let uφAbe the solution of a random solver for DRM (i.e., uφA

is the solution of (3.6))

and use the notation Nρ (D, nD, Bθ) to refer to the collection of functions implemented by a

ρ−neural network with depth D, total number of nonzero weights nD and each weight being

bounded by Bθ. Set ρ = 11+e−x or ex−e−x

ex+e−x . Our main contributions are as follows:

• Let uR be the weak solution of Robin problem (3.1)(3.2c). For any ǫ > 0 and µ ∈ (0, 1),

set the parameterized function class

P = Nρ

(C log(d+ 1), C(d, β)ǫ−d/(1−µ), C(d, β)ǫ−(9d+8)/(2−2µ)

)

and number of samples

N = M = C(d,Ω, coe, α, β)ǫ−Cd log(d+1)/(1−µ),

if the optimization error of uφAis Eopt ≤ ǫ, then

EXiNi=1,YjM

j=1‖uφA

− uR‖H1(Ω) ≤ C(Ω, coe, α)ǫ.

• Let uD be the weak solution of Dirichlet problem (3.1)(3.2a). Set α = 1, g = 0. For

any ǫ > 0, let β = C(coe)ǫ as the penalty parameter, set the parameterized function

2

class

P = Nρ

(C log(d+ 1), C(d)ǫ−5d/2(1−µ), C(d)ǫ−(45d+40)/(4−4µ)

)


N = M = C(d,Ω, coe)ǫ−Cd log(d+1)/(1−µ),

if the optimization error Eopt ≤ ǫ, then

EXiNi=1,YjM

j=1‖uφA

− uD‖H1(Ω) ≤ C(Ω, coe)ǫ.

We summarize the related works and our results in the following table 1.1.

Table 1.1: Previous works and our result

PaperDepth and

activation functionsEquation(s)

RegularityCondition

[28]D = 2

ReLU3Second order

differential equationu∗ ∈ Barron class

[26]D = 2Softplus

Poisson equationand Schrodinger equation

u∗ ∈ Barron class

[19]D = 2

ReLUk2m-th order

differential equationu∗ ∈ Barron class

[9]D = O(log d)

ReLU2Second order

elliptic equationu∗ ∈ C2

This paperD = O(log d)Logistic and

Hyperbolic tangent

Second orderelliptic equation

u∗ ∈ H2

The rest of the paper are organized as follows. In Section 2, we give some preliminaries.

In Section 3, we present the DRM method and the error decomposition results for analysis of

DRM. In Section 4 and 5, we give detail analysis on the approximation error and statistical

error. In Section 6, we present our main results. We give conclusion and short discussion in

Section 7.

2. Neural Network. Due to its strong expressivity, neural network function class plays

an important role in machine learning. A variety of neural networks are choosen as parameter

function classes in the training process. We now introduce some notation related to neural

network which will simplify our later discussion. Let D ∈ N+. A function f : Rd → R

nD

implemented by a neural network is defined by

f0(x) = x,

fℓ(x) = ρ (Aℓfℓ−1 + bℓ) for ℓ = 1, . . . ,D − 1,

f := fD(x) = ADfD−1 + bD,

(2.1)

3

where Aℓ =(a(ℓ)ij

)∈ R

nℓ×nℓ−1 and bℓ =(b(ℓ)i

)∈ R

nℓ . ρ is called the activation function and

acts componentwise. D is called the depth of the network and W := maxnℓ : ℓ = 1, · · · ,Dis called the width of the network. φ = Aℓ,bℓℓ are called the weight parameters. For

convenience, we denote ni, i = 1, · · · ,D, as the number of nonzero weights on the first i layers

in the representation (2.1). Clearly nD is the total number of nonzero weights. Sometimes

we denote a function implemented by a neural network as fρ for short. We use the notation

Nρ (D, nD, Bθ) to refer to the collection of functions implemented by a ρ−neural network with

depth D, total number of nonzero weights nD and each weight being bounded by Bθ.

3. Deep Ritz Method and Error Decomposition. Let Ω be a convex bounded open

set in Rd and assume that ∂Ω ∈ C∞. Without loss of generality we assume that Ω ⊂ [0, 1]d.

We consider the following second order elliptic equation:

−u+ wu = f in Ω (3.1)

with three kinds of boundary condition:

u = 0 on ∂Ω (3.2a)

∂u

∂n= g on ∂Ω (3.2b)

αu+ β∂u

∂n= g on ∂Ω, α, β ∈ R, β 6= 0 (3.2c)

which are called Drichilet, Neumann and Robin boundary condition, respectively. Note that

for Drichilet problem, we only consider the homogeneous boundary condition here since the

inhomogeneous case can be turned into homogeneous case by translation. We also remark that

Neumann condition (3.2b) is covered by Robin condition (3.2c). Hence in the following we only

consider Dirichlet problem and Robin problem.

We make the following assumption on the known terms in equation:

(A) f ∈ L∞(Ω), g ∈ H1/2(Ω), w ∈ L∞(Ω), w ≥ cw

where cw is some positive constant. In the following we abbreviate

C(‖f‖L∞(Ω), ‖g‖H1/2(Ω), ‖w‖L∞(Ω), cw

),

constants depending on the known terms in equation, as C(coe) for simplicity.

For problem (3.1)(3.2a), the varitional problems is to find u ∈ H10 (Ω) such that

(∇u,∇v) + (wu, v) = (f, v), ∀v ∈ H10 (Ω). (3.3a)

The corresponding minimization problem is

minu∈H1

0 (Ω)

1

2

∫

Ω

(|∇u|2 + wu2 − 2fu

)dx. (3.3b)

Lemma 3.1. Let (A) holds. Let uD be the solution of problem (3.3a)(also (3.3b)). Then

uD ∈ H2(Ω).

4

Proof. See [12].

For problem (3.1)(3.2c), the variational problem is to find u ∈ H1(Ω) such that

(∇u,∇v) + (wu, v) +α

β(T0u, T0v)|∂Ω = (f, v) +

1

β(g, T0v)|∂Ω, ∀v ∈ H1(Ω) (3.4a)

where T0 is zero order trace operator. The corresponding minimization problem is

minu∈H1(Ω)

∫

Ω

(1

2|∇u|2 + 1

2wu2 − fu

)dx+

1

β

∫

∂Ω

(α2(T0u)

2 − gT0u)ds. (3.4b)

Lemma 3.2. Let (A) holds. Let uR be the solution of problem (3.4a)(also (3.4b)). Then

uR ∈ H2(Ω) and ‖uR‖H2(Ω) ≤ C(coe)β for any β > 0.

Proof. See [12].

Intuitively, when α = 1, g = 0 and β → 0, we expect that the solution of Robin problem

converges to the solution of Dirichlet problem. Hence we only need to consider the Robin

problem since the Dirichlet problem can be handled through a limit process. Define L as a

functional on H1(Ω):

L(u) :=∫

Ω

(1

2|∇u|2 + 1

2wu2 − fu

)dx+

1

β

∫

∂Ω

(α2(T0u)

2 − gT0u)ds.

The next lemma verify the assertion.

Lemma 3.3. Let (A) holds. Let α = 1, g = 0. Let uD be the solution of problem (3.3a)(also

(3.3b)) and uR the solution of problem (3.4a)(also (3.4b)). There holds

‖uR − uD‖H1(Ω) ≤ C(coe)β.

Proof. We first have

∫

Ω

∇uD · ∇vdx−∫

∂Ω

T1uDvds+

∫

Ω

wuDvdx =

∫

Ω

fvdx, ∀v ∈ H1(Ω).

with T1 being first order trace operator. Hence for any u ∈ H1(Ω),

L(u) =∫

Ω

(1

2|∇u|2 + 1

2wu2 − fu

)dx+

1

2β

∫

∂Ω

(T0u)2ds

=

∫

Ω

(1

2|∇u|2 + 1

2wu2

)dx +

1

2β

∫

∂Ω

(T0u)2ds−

∫

Ω

∇uD · ∇udx+

∫

∂Ω

T1uDuds−∫

Ω

wuDudx,

=

∫

Ω

(1

2|∇u−∇uD|2 + 1

2w(u− uD)2

)dx+

1

2β

∫

∂Ω

(T0u+ βT1uD)2ds

−∫

Ω

(1

2|∇uD|2 + 1

2wu2

D

)dx − β

2

∫

∂Ω

(T1uD)2ds. (3.5)

Define

Rβ(u) =

∫

Ω

(1

2|∇u−∇uD|2 + 1

2w(u− uD)

2

)dx+

1

2β

∫

∂Ω

(T0u+ βT1uD)2 ds.

5

Since uR is the minimizer of L, from (3.5) we conclude that it is also the minimizer of R.

Note uD ∈ H2(Ω)(Lemma 3.1), by trace theorem we know T1uD ∈ H1/2(∂Ω) and hence

there exists φ ∈ H1(Ω) such that T0φ = −T1uD. Set u = βφ+ uD, then

C(coe)‖uR − uD‖2H1(Ω) ≤ R(uR) ≤ R(u) = β2

∫

Ω

(1

2|∇φ|2 + 1

2wφ2

)dx = C(uD, coe)β2.

Note that L can be equivalently written as

L(u) =|Ω|EX∼U(Ω)

(1

2|∇u(X)|2 + 1

2w(X)u2(X)− f(X)u(X)

)

+|∂Ω|β

EY ∼U(∂Ω)

(α2(T0u)

2(Y )− g(Y )T0u(Y ))

where U(Ω) and U(∂Ω) are uniform distribution on Ω and ∂Ω, respectively. We then introduce

a discrete version of L:

L(u) := |Ω|N

N∑

i=1

(1

2|∇u(Xi)|2 +

1

2w(Xi)u

2(Xi)− f(Xi)u(Xi)

)

+|∂Ω|βM

M∑

j=1

(α2(T0u)

2(Yj)− g(Yj)T0u(Yj))

where XiNi=1 and YjMj=1 are i.i.d. random variables according to U(Ω) and U(∂Ω) respec-

tively. We now consider a minimization problem with respect to L:

minuφ∈P

L(uφ) (3.6)

where P refers to the parameterized function class. We denote by uφ the solution of problem

(3.6). Finally, we call a (random) solver A, say SGD, to minimize L and denote the output of

A, say uφA, as the final solution.

The following error decomposition enables us to apply different methods to deal with dif-

ferent kinds of error.

Proposition 3.1. Let (A) holds. Assume that P ⊂ H1(Ω). Let uR and uD be the solution

of problem (3.4a)(also (3.4b)) and (3.3a)(also (3.3b)), respectively. Let uφAbe the solution of

problem (3.6) generated by a random solver.

(1)

‖uφA− uR‖H1(Ω) ≤ C(coe) [Eapp + Esta + Eopt]1/2 .

where

Eapp =1

βC(Ω, coe, α) inf

u∈P‖u− uR‖2H1(Ω),

Esta = supu∈P

[L(u)− L(u)

]+ sup

u∈P

[L(u)− L(u)

],

6

Eopt = L (uφA)− L (uφ) .

(2) Set α = 1, g = 0.

‖uφA− uD‖H1(Ω) ≤ C(coe)

[Eapp + Esta + Eopt + ‖uR − uD‖2H1(Ω)

]1/2.

Proof. We only prove (1) since (2) is a direct result from (1) and triangle inequality. For

any u ∈ P , set v = u− uR, then

L (u) = L (uR + v)

=1

2(∇(uR + v),∇(uR + v))L2(Ω) +

1

2(uR + v, uR + v)L2(Ω;w) − 〈uR + v, f〉L2(Ω)

+α

2β(T0uR + T0v, T0uR + T0v)L2(∂Ω) −

1

β〈T0uR + T0v, g〉L2(∂Ω)

=1

2(∇uR,∇uR)L2(Ω) +

1

2(uR, uR)L2(Ω;w) − 〈uR, f〉L2(Ω) +

α

2β(T0uR, T0uR)L2(∂Ω)

− 1

β〈T0uR, g〉L2(∂Ω) +

1

2(∇v,∇v)L2(Ω) +

1

2(v, v)L2(Ω;w) +

α

2β(T0v, T0v)L2(∂Ω)

+

[(∇uR,∇v)L2(Ω) + (u∗, v)L2(Ω;w) − 〈v, f〉L2(Ω) +

α

β(T0uR, T0v)L2(∂Ω) −

1

β〈T0v, g〉L2(∂Ω)

]

= L (uR) +1

2(∇v,∇v)L2(Ω) +

1

2(v, v)L2(Ω;w) +

α

2β(T0v, T0v)L2(∂Ω),

where the last equality is due to the fact that uR is the solution of equation (3.4a). Hence

C(coe)‖v‖2H1(Ω) ≤ L (u)− L (uR) =1

2(∇v,∇v)L2(Ω) +

1

2(v, v)L2(Ω;w) +

α

2β(T0v, T0v)L2(∂Ω)

≤ 1

βC(Ω, coe, α)‖v‖2H1(Ω),

here we apply trace inequality

‖T0v‖L2(∂Ω) ≤ C(Ω)‖v‖H1(Ω).

See more details in [1]. In other words, we obtain

C(coe)‖u − uR‖2H1(Ω) ≤ L (u)− L (uR) ≤1

βC(Ω, coe, α)‖u− uR‖2H1(Ω). (3.7)

Now, letting u be any element in P , we have

L (uφA)− L (uR)

=L (uφA)− L (uφA

) + L (uφA)− L (uφ) + L (uφ)− L (u) + L (u)− L (u) + L (u)− L (uR)

≤ supu∈P

[L(u)− L(u)

]+[L (uφA

)− L (uφ)]+ sup

u∈P

[L(u)− L(u)

]+

1

βC(Ω, coe, α)‖u− uR‖2H1(Ω),

where the last step is due to inequality (3.7) and the fact that L (uφ)− L (u) ≤ 0. Since u can

7

be any element in P , we take the infimum of u on both side of the above display,

L (uφA)− L (uR) ≤ inf

u∈P

1

βC(Ω, coe, α)‖u − uR‖2H1(Ω) + sup

u∈P[L(u)− L(u)]

+ supu∈P

[L(u)− L(u)] +[L (uφA

)− L (uφ)]. (3.8)

Combining (3.7) and (3.8) yields the result.

4. Approximation Error. For nerual network approximation in Sobolev spaces, [16] is

a comprehensive study concerning a variety of activation functions, including ReLU, sigmoidal

type functions, etc. The key idea in [16] to build the upper bound in Sobolev spaces is to

construct an approximate partition of unity.

Denote Fs,p,d :=f ∈ W s,p

([0, 1]d

): ‖f‖W s,p([0,1]d) ≤ 1

.

Theorem 4.1 (Proposition 4.8, [16]). Let p ≥ 1, s, k, d ∈ N+, s ≥ k + 1. Let ρ be logistic

function 11+e−x or tanh function ex−e−x

ex+e−x . For any ǫ > 0 and f ∈ Fs,p,d, there exists a neural

network fρ with depth C log(d+ s) and C(d, s, p, k)ǫ−d/(s−k−µk) non-zero weights such that

‖f − fρ‖Wk,p([0,1]d) ≤ ǫ.

Moreover, the weights in the neural network are bounded in absolute value by

C(d, s, p, k)ǫ−2− 2(d/p+d+k+µk)+d/p+ds−k−µk

where µ is an arbitrarily small positive number.

Remark 4.1. The bounds in the theorem can be found in the proof of [16, Proposition

4.8], except that they did not explicitly give the bound on the depth. In their proof, they

partition [0, 1]d into small patches, approximate f by a sum of localized polynomial∑

m φmpm,

and approximately implement∑

m φmpm by a neural network, where the bump functions φmform an approximately partition of unity and pm =

∑|α|<s cf,m,αx

α are the averaged Taylor

polynomials. As shown in [16], φm can be approximated by the products of the d-dimensional

output of a neural network with constant layers. And the identity map I(x) = x and the product

function ×(a, b) = ab can also be approximated by neural networks with constant layers. In

order to approximate φmxα, we need to implement d + s − 1 products. Hence, the required

depth can be bounded by C log(d+ s).

Since the region [0, 1]d is larger than the region Ω we consider(recall we assume without

loss of generality that Ω ⊂ [0, 1]d at the beginning), we need the following extension result.

Lemma 4.2. Let k ∈ N+, 1 ≤ p < ∞. There exists a linear operator E from W k,p(Ω) to

W k,p0

([0, 1]d

)and Eu = u in Ω.

Proof. See Theorem 7.25 in [13].

From Lemma 3.2 we know that our target function uR ∈ H2(Ω). Hence we are able to

obtain an approximation result in H1-norm.

Corollary 4.3. Let ρ be logistic function 11+e−x or tanh function ex−e−x

ex+e−x . For any ǫ > 0

and f ∈ H2(Ω) with ‖f‖H2(Ω) ≤ 1, there exists a neural network fρ with depth C log(d+1) and

8

C(d)ǫ−d/(1−µ) non-zero weights such that

‖f − fρ‖H1(Ω) ≤ ǫ.

Moreover, the weights in the neural network are bounded by C(d)ǫ−(9d+8)/(2−2µ), where µ is an

arbitrarily small positive number.

Proof. Set k = 1, s = 2, p = 2 in Theorem 4.1 and use the fact ‖f − fρ‖H2(Ω) ≤ ‖Ef −fρ‖H2([0,1]d), where E is the extension operator in Lemma 4.2.

5. Statistical Error. In this section we investigate statistical error supu∈P ±[L(u)− L(u)

].

Lemma 5.1.

EXiNi=1,YjM

j=1supu∈P

±[L(u)− L(u)

]≤

5∑

k=1

EXiNi=1,YjM

j=1supu∈P

±[Lk(u)− Lk(u)

].

where

L1(u) =|Ω|2

EX∼U(Ω)|∇u(X)|2, L2(u) =|Ω|2

EX∼U(Ω)w(X)u2(X),

L3(u) = −|Ω|EX∼U(Ω)f(X)u(X), L4(u) =α|∂Ω|2β

EY ∼U(∂Ω)(Tu)2(Y ),

L5(u) = −|∂Ω|β

EY ∼U(∂Ω)g(Y )Tu(Y ).

and Lk(u) is the discrete version of Lk(u), for example,

L1(u) =|Ω|2N

N∑

i=1

|∇u(Xi)|2.

Proof. Direct result from triangle inequality.

By the technique of symmetrization, we can bound the difference between continuous loss

Li and empirical loss Li by Rademacher complexity.

Definition 5.1. The Rademacher complexity of a set A ⊆ RN is defined as

RN (A) = EσiNk=1

[supa∈A

1

N

N∑

k=1

σkak

],

where, σkNk=1 are N i.i.d Rademacher variables with P(σk = 1) = P(σk = −1) = 12 . The

Rademacher complexity of function class F associate with random sample XkNk=1 is defined

as

RN (F) = EXk,σkNk=1

[supu∈F

1

N

N∑

k=1

σku(Xk)

].

For Rademacher complexity, we have following structural result.

Lemma 5.2. Assume that w : Ω → R and |w(x)| ≤ B for all x ∈ Ω, then for any function

9

class F , there holds

RN (w · F) ≤ BRN(F),

where w · F := u : u(x) = w(x)u(x), u ∈ F.Proof.

RN (w · F) =1

NEXk,σkN

k=1supu∈F

N∑

k=1

σkw(Xk)u(Xk)

=1

2NEXkN

k=1EσkN

k=2supu∈F

[w(X1)u(X1) +

N∑

k=2

σkw(Xk)u(Xk)

]

+1

2NEXkN

k=1EσkN

k=2supu∈F

[−w(X1)u(X1) +

N∑

k=2

σkw(Xk)u(Xk)

]

=1

2NEXkN

k=1EσkN

k=2

supu,u′∈F

[w(X1)[u(X1)− u′(X1)] +

N∑

k=2

σkw(Xk)u(Xk) +

N∑

k=2

σkw(Xk)u′(Xk)

]

≤ 1

2NEXkN

k=1EσkN

k=2

supu,u′∈F

[B|u(X1)− u′(X1)|+

N∑

k=2

σkw(Xk)u(Xk) +

N∑

k=2

σkw(Xk)u′(Xk)

]

=1

2NEXkN

k=1EσkN

k=2

supu,u′∈F

[B[u(X1)− u′(X1)] +

N∑

k=2

σkw(Xk)u(Xk) +

N∑

k=2

σkw(Xk)u′(Xk)

]

=1

NEXk,σkN

k=1supu∈F

[σ1Bu(X1) +

N∑

k=2

σkw(Xk)u(Xk)

]

≤ · · · ≤ BN

EXk,σkNk=1

supu∈F

N∑

k=1

σku(Xk) = BR(F).

Now we bound the difference between continuous loss and empirical loss in terms of Rademacher

complexity.

Lemma 5.3.

EXiNi=1

supu∈P

±[L1(u)− L1(u)

]≤ C(Ω, coe)RN (F1),

EXiNi=1

supu∈P

±[L2(u)− L2(u)


EXiNi=1

supu∈P

±[L3(u)− L3(u)


EYjMj=1

supu∈P

±[L4(u)− L4(u)

]≤ α

βC(Ω, coe)RN (F4),

EYjMj=1

supu∈P

±[L5(u)− L5(u)

]≤ 1

βC(Ω, coe)RN (F5),

10

where

F1 = |∇u|2 : u ∈ P, F2 = u2 : u ∈ P,F3 = u : u ∈ P, F4 = u2|∂Ω : u ∈ P,F5 = u|∂Ω : u ∈ P.

Proof. We only present the proof with respect to L2 since other inequalities can be shown

similarly. We take XkNk=1 as an independent copy of XkNk=1, then

L2(u)− L2(u) =|Ω|2

[EX∼U(Ω)w(X)u2(X)− 1

N

N∑

k=1

w(Xk)u2(Xk)

]

=|Ω|2N

EXkNk=1

N∑

k=1

[w(Xk)u

2(Xk)− w(Xk)u2(Xk)

].

Hence

EXkNk=1

supu∈P

∣∣∣L2(u)− L2(u)∣∣∣

≤ |Ω|2N

EXk,XkNk=1

supu∈P

N∑

k=1

[w(Xk)u


]

=|Ω|2N

EXk,Xk,σkNk=1

supu∈P

N∑

k=1

σk

[w(Xk)u


]

≤ |Ω|2N

EXk,σkNk=1

supu∈P

N∑

k=1

σkw(Xk)u2(Xk) +

|Ω|2N

EXk,σkNk=1

supu∈P

N∑

k=1

−σkw(Xk)u2(Xk)

=|Ω|N

EXk,σkNk=1

supu∈P

N∑

k=1

σkw(Xk)u2(Xk)

= |Ω|RN (w · F2) ≤ C(Ω, coe)RN (F2),

where the second step is due to the fact that the insertion of Rademacher variables doesn’t

change the distribution, the fourth step is because σkw(Xk)u2(Xk) and −σkw(Xk)u

2(Xk) have

the same distribution, and we use Lemma 5.2 in the last step.

In order to bound Rademecher complexities, we need the concept of covering number.

Definition 5.4. An ǫ-cover of a set T in a metric space (S, τ) is a subset Tc ⊂ S such

that for each t ∈ T , there exists a tc ∈ Tc such that τ(t, tc) ≤ ǫ. The ǫ-covering number of

T , denoted as C(ǫ, T, τ) is defined to be the minimum cardinality among all ǫ-cover of T with

respect to the metric τ .

In Euclidean space, we can establish an upper bound of covering number for a bounded set

easily.

11

Lemma 5.5. Suppose that T ⊂ Rd and ‖t‖2 ≤ B for t ∈ T , then

C(ǫ, T, ‖ · ‖2) ≤(2B

√d

ǫ

)d

.

Proof. Let m =⌊2B

√d

ǫ

⌋and define

Tc =

−B +

ǫ√d,−B +

2ǫ√d, · · · ,−B +

mǫ√d

d

,

then for t ∈ T , there exists tc ∈ Tc such that

‖t− tc‖2 ≤

√√√√d∑

i=1

(ǫ√d

)2

= ǫ.

Hence

C(ǫ, T, ‖ · ‖2) ≤ |Tc| = md ≤(2B

√d

ǫ

)d

.

A Lipschitz parameterization allows us to translates a cover of the function space into a

cover of the parameter space. Such a property plays an essential role in our analysis of statistical

error.

Lemma 5.6. Let F be a parameterized class of functions: F = f(x; θ) : θ ∈ Θ. Let

‖ · ‖Θ be a norm on Θ and let ‖ · ‖F be a norm on F . Suppose that the mapping θ 7→ f(x; θ) is

L-Lipschitz, that is,

∥∥∥f(x; θ)− f(x; θ)∥∥∥

F≤ L

∥∥∥θ − θ∥∥∥Θ,

then for any ǫ > 0, C (ǫ,F , ‖ · ‖F) ≤ C (ǫ/L,Θ, ‖ · ‖Θ).Proof. Suppose that C (ǫ/L,Θ, ‖ · ‖Θ) = n and θini=1 is an ǫ/L-cover of Θ. Then for any

θ ∈ Θ, there exists 1 ≤ i ≤ n such that

‖f(x; θ) − f (x; θi)‖F ≤ L ‖θ − θi‖Θ ≤ ǫ.

Hence f(x; θi)ni=1 is an ǫ-cover of F , implying that C (ǫ,F , ‖ · ‖F) ≤ n.

To find the relation between Rademacher complexity and covering number, we first need

the Massart’s finite class lemma stated below.

Lemma 5.7. For any finite set A ⊂ RN with diameter D = supa∈A ‖a‖2,

RN (A) ≤ D

N

√2 log |A|.

Proof. See, for example, [35, Lemma 26.8].

12

Lemma 5.8. Let F be a function class and ‖f‖∞ ≤ B for any f ∈ F , we have

RN (F) ≤ inf0<δ<B/2

(4δ +

12√N

∫ B/2

δ

√log C(ǫ,F , ‖ · ‖∞)dǫ

).

Proof. We apply chaning method. Set ǫk = 2−k+1B. We denote by Fk such that Fk is an

ǫk-cover of F and |Fk| = C(ǫk,F , ‖ · ‖∞). Hence for any u ∈ F , there exists uk ∈ Fk such that

‖u− uk‖∞ ≤ ǫk. Let K be a positive integer determined later. We have

RN(F) = Eσi,XiNi=1

[supu∈F

1

N

N∑

i=1

σiu (Xi)

]

= Eσi,XiNi=1

1

Nsupu∈F

N∑

i=1

σi (u (Xi)− uK (Xi)) +

K−1∑

j=1

N∑

i=1

σi (uj+1 (Xi)− uj (Xi)) +

N∑

i=1

σiu1 (Xi)

≤ Eσi,XiNi=1

[supu∈F

1

N

N∑

i=1

σi (u (Xi)− uK (Xi))

]+

K−1∑

j=1

Eσi,XiNi=1

[supu∈F

1

N

N∑

i=1

σi (uj+1 (Xi)− uj (Xi))

]

+ Eσi,XiNi=1

[1

Nsupu∈F1

1

N

N∑

i=1

σiu(Xi)

].

We can choose F1 = 0 to eliminate the third term. For the first term,

Eσi,XiNi=1

supu∈F

1

N

[N∑

i=1

σi (u (Xi)− uK (Xi))

]≤ Eσi,XiN

i=1supu∈F

1

N

N∑

i=1

|σi| ‖u− uK‖∞ ≤ ǫK .

For the second term, for any fixed samples XiNi=1, we define

Vj := (uj+1 (X1)− uj (X1) , . . . , uj+1 (XN )− uj (XN )) ∈ RN : u ∈ F.

Then, for any vj ∈ Vj ,

‖vj‖2 =(

n∑

i=1

|uj+1(Xi)− uj(Xi)|2)1/2

≤ √n ‖uj+1 − uj‖∞

≤ √n ‖uj+1 − u‖∞ +

√n ‖uj − u‖∞ =

√nǫj+1 +

√nǫj = 3

√nǫj+1.

Applying Lemma 5.7, we have

K−1∑

j=1

EσNi=1

[supu∈F

1

N

N∑

i=1


]

=

K−1∑

j=1

EσiNi=1

[supvj∈Vj

1

N

N∑

i=1

σivji

]≤

K−1∑

j=1

3ǫj+1√N

√2 log |Vj |.

13

By the denition of Vj , we know that |Vj | ≤ |Fj | |Fj+1| ≤ |Fj+1|2. Hence

K−1∑

j=1

Eσi,XiNi=1

[supu∈F

1

N

N∑

i=1


]≤

K−1∑

j=1

6ǫj+1√N

√log |Fj+1|.

Now we obtain

RN (F) ≤ ǫK +

K−1∑

j=1

6ǫj+1√N

√log |Fj+1|

= ǫK +12√N

K−1∑

j=1

(ǫj+1 − ǫj+2)√log C(ǫj+1,F , ‖ · ‖∞)

≤ ǫK +12√N

∫ B/2

ǫK+1

√log C(ǫ,F , ‖ · ‖∞)dǫ.

We conclude the lemma by choosing K such that ǫK+2 < δ ≤ ǫK+1 for any 0 < δ < B/2.

From Lemma 5.6 we know that the kep step to bound C(ǫ,Fi, ‖ · ‖∞) with Fi defined in

Lemma 5.3 is to compute the upper bound of Lipschitz constant of class Fi, which is done in

Lemma 5.9-5.12.

Lemma 5.9. Let D, nD, ni ∈ N+, nD = 1, Bθ ≥ 1 and ρ be a bounded Lipschitz continuous

function with Bρ, Lρ ≤ 1. Set the parameterized function class P = Nρ (D, nD, Bθ). For any

f(x; θ) ∈ P, f(x; θ) is√nDB

D−1θ

(∏D−1i=1 ni

)-Lipschitz continuous with respect to variable θ,

i.e.,

∣∣∣f(x; θ)− f(x; θ)∣∣∣ ≤ √

nDBD−1θ

(D−1∏

i=1

ni

)∥∥∥θ − θ∥∥∥2, ∀x ∈ Ω.

Proof. For ℓ = 2, · · · ,D(the argument for the case of ℓ = D is slightly different),

∣∣∣f (ℓ)q − f (ℓ)

q

∣∣∣ =

∣∣∣∣∣∣ρ

nℓ−1∑

j=1

a(ℓ)qj f

(ℓ−1)j + b(ℓ)q

− ρ

nℓ−1∑

j=1

a(ℓ)qj f

(ℓ−1)j + b(ℓ)q

∣∣∣∣∣∣

≤ Lρ

∣∣∣∣∣∣

nℓ−1∑

j=1

a(ℓ)qj f

(ℓ−1)j −

nℓ−1∑

j=1

a(ℓ)qj f

(ℓ−1)j + b(ℓ)q − b(ℓ)q

∣∣∣∣∣∣

≤ Lρ

nℓ−1∑

j=1

∣∣∣a(ℓ)qj

∣∣∣∣∣∣f (ℓ−1)

j − f(ℓ−1)j

∣∣∣+ Lρ

nℓ−1∑

j=1

∣∣∣a(ℓ)qj − a(ℓ)qj

∣∣∣∣∣∣f (ℓ−1)

j

∣∣∣+ Lρ

∣∣∣b(ℓ)q − b(ℓ)q

∣∣∣

≤ BθLρ

nℓ−1∑

j=1

∣∣∣f (ℓ−1)j − f

(ℓ−1)j

∣∣∣+BρLρ

nℓ−1∑

j=1


∣∣∣+ Lρ

∣∣∣b(ℓ)q − b(ℓ)q

∣∣∣

≤ Bθ

nℓ−1∑

j=1

∣∣∣f (ℓ−1)j − f

(ℓ−1)j

∣∣∣+nℓ−1∑

j=1


∣∣∣+∣∣∣b(ℓ)q − b(ℓ)q

∣∣∣ .

14

For ℓ = 1,

∣∣∣f (1)q − f (1)

q

∣∣∣ =

∣∣∣∣∣∣ρ

n0∑

j=1

a(1)qj x

+j b

(1)q

− ρ

n0∑

j=1

a(1)qj xj + b(1)q

∣∣∣∣∣∣

≤n0∑

j=1

∣∣∣a(1)qj − a(1)qj

∣∣∣ +∣∣∣b(1)q − b(1)q

∣∣∣ =n1∑

j=1

∣∣∣θj − θj

∣∣∣ .

For ℓ = 2,

∣∣∣f (2)q − f (2)

q

∣∣∣ ≤ Bθ

n1∑

j=1

∣∣∣f (1)j − f

(1)j

∣∣∣+n1∑

j=1

∣∣∣a(2)qj − a(2)qj

∣∣∣+∣∣∣b(2)q − b(2)q

∣∣∣

≤ Bθ

n1∑

j=1

n1∑

k=1

∣∣∣θk − θk

∣∣∣+n1∑

j=1

∣∣∣a(2)qj − a(2)qj

∣∣∣+∣∣∣b(2)q − b(2)q

∣∣∣

≤ n1Bθ

n2∑

j=1


∣∣∣ .

Assuming that for ℓ ≥ 2,

∣∣∣f (ℓ)q − f (ℓ)

q

∣∣∣ ≤(

ℓ−1∏

i=1

ni

)Bℓ−1

θ

nℓ∑

j=1


∣∣∣ ,

we have

∣∣∣f (ℓ+1)q − f (ℓ+1)

q

∣∣∣ ≤ Bθ

nℓ∑

j=1

∣∣∣f (ℓ)j − f

(ℓ)j

∣∣∣+nℓ∑

j=1

∣∣∣a(ℓ+1)qj − a

(ℓ+1)qj

∣∣∣+∣∣∣b(ℓ+1)

q − b(ℓ+1)q

∣∣∣

≤ Bθ

nℓ∑

j=1

(ℓ−1∏

i=1

ni

)Bℓ−1

θ

n1∑

k=1


∣∣∣+nℓ∑

j=1

∣∣∣a(ℓ+1)qj − a

(ℓ+1)qj

∣∣∣+∣∣∣b(ℓ+1)

q − b(ℓ+1)q

∣∣∣

≤(

ℓ∏

i=1

ni

)Bℓ

θ

nℓ+1∑

j=1


∣∣∣ .

Hence by induction and Holder inequality we conclude that

∣∣∣f − f∣∣∣ ≤

(D−1∏

i=1

ni

)BD−1

θ

nD∑

j=1


∣∣∣ ≤ √nDB

D−1θ

(D−1∏

i=1

ni

)∥∥∥θ − θ∥∥∥2.

Lemma 5.10. Let D, nD, ni ∈ N+, nD = 1, Bθ ≥ 1 and ρ be a function such that ρ′ is

bounded by Bρ′ . Set the parameterized function class P = Nρ (D, nD, Bθ). Let p = 1, · · · , d.We have

∣∣∣∂xpf(ℓ)q

∣∣∣ ≤(

ℓ−1∏

i=1

ni

)(BθBρ′)

ℓ, ℓ = 1, 2, · · · ,D − 1,

∣∣∂xpf∣∣ ≤

(D−1∏

i=1

ni

)BD

θ BD−1ρ′ .

15

Proof. For ℓ = 1, 2, · · · ,D − 1,

∣∣∣∂xpf(ℓ)q

∣∣∣ =

∣∣∣∣∣∣

nℓ−1∑

j=1

a(ℓ)qj ∂xpf

(ℓ−1)j ρ′

nℓ−1∑

j=1

a(ℓ)qj f

(ℓ−1)j + b(ℓ)q

∣∣∣∣∣∣≤ BθBρ′

nℓ−1∑

j=1

∣∣∣∂xpf(ℓ−1)j

∣∣∣

≤ (BθBρ′)2nℓ−1∑

k=1

nℓ−2∑

j=1

∣∣∣∂xpf(ℓ−2)j

∣∣∣ = nℓ−1 (BθBρ′)2nℓ−2∑

j=1

∣∣∣∂xpf(ℓ−2)j

∣∣∣

≤ · · · ≤(

ℓ−1∏

i=2

ni

)(BθBρ′)

ℓ−1n1∑

j=1

∣∣∣∂xpf(1)j

∣∣∣

≤(

ℓ−1∏

i=2

ni

)(BθBρ′)ℓ−1

n1∑

j=1

BθBρ′ =

(ℓ−1∏

i=1

ni

)(BθBρ′)ℓ .

The bound for∣∣∂xpf

∣∣ can be derived similarly.

Lemma 5.11. Let D, nD, ni ∈ N+, nD = 1, Bθ ≥ 1 and ρ be a function such that ρ, ρ′

are bounded by Bρ, Bρ′ ≤ 1 and have Lipschitz constants Lρ, Lρ′ ≤ 1, respectively. Set the

parameterized function class P = Nρ (D, nD, Bθ). Then, for any f(x; θ) ∈ P, p = 1, · · · , d,∂xpf(x; θ) is

√nD(D + 1)B2D

θ

(∏D−1i=1 ni

)2-Lipschitz continuous with respect to variable θ, i.e.,

∣∣∣∂xpf(x; θ)− ∂xpf(x; θ)∣∣∣ ≤ √

nD(D + 1)B2Dθ

(D−1∏

i=1

ni

)2 ∥∥∥θ − θ∥∥∥2, ∀x ∈ Ω.

Proof. For ℓ = 1,

∣∣∣∂xpf(1)q − ∂xp f

(1)q

∣∣∣

=

∣∣∣∣∣∣a(1)qp ρ

′

n0∑

j=1

a(1)qj xj + b(1)q

− a(1)qp ρ

′

n0∑

j=1

a(1)qj xj + b(1)q

∣∣∣∣∣∣

≤∣∣∣a(1)qp − a(1)qp

∣∣∣

∣∣∣∣∣∣ρ′

n0∑

j=1

a(1)qj xj + b(1)q

∣∣∣∣∣∣+∣∣∣a(1)qp

∣∣∣

∣∣∣∣∣∣ρ′

n0∑

j=1

a(1)qj xj + b(1)q

− ρ′

n0∑

j=1

a(1)qj xj + b(1)q

∣∣∣∣∣∣

≤Bρ′

∣∣∣a(1)qp − a(1)qp

∣∣∣+BθLρ′

n0∑

j=1

∣∣∣a(1)qj − a(1)qj

∣∣∣+BθLρ′

∣∣∣b(1)q − b(1)q

∣∣∣ ≤ 2Bθ

n1∑

k=1


∣∣∣

For ℓ ≥ 2, we establish the Recurrence relation:

∣∣∣∂xpf(ℓ)q − ∂xp f

(ℓ)q

∣∣∣

≤nℓ−1∑

j=1

∣∣∣a(ℓ)qj

∣∣∣∣∣∣∂xpf

(ℓ−1)j

∣∣∣

∣∣∣∣∣∣ρ′

nℓ−1∑

j=1

a(ℓ)qj f

(ℓ−1)j + b(ℓ)q

− ρ′

nℓ−1∑

j=1

a(ℓ)qj f

(ℓ−1)j + b(ℓ)q

∣∣∣∣∣∣

+

nℓ−1∑

j=1

∣∣∣a(ℓ)qj ∂xpf(ℓ−1)j − a

(ℓ)qj ∂xp f

(ℓ−1)j

∣∣∣

∣∣∣∣∣∣ρ′

nℓ−1∑

j=1

a(ℓ)qj f

(ℓ−1)j + b(ℓ)q

∣∣∣∣∣∣

16

≤ BθLρ′

nℓ−1∑

j=1

∣∣∣∂xpf(ℓ−1)j

∣∣∣

nℓ−1∑

j=1

∣∣∣a(ℓ)qj f(ℓ−1)j − a

(ℓ)qj f

(ℓ−1)j

∣∣∣+∣∣∣b(ℓ)q − b(ℓ)q

∣∣∣

+Bρ′

nℓ−1∑

j=1

∣∣∣a(ℓ)qj ∂xpf(ℓ−1)j − a

(ℓ)qj ∂xp f

(ℓ−1)j

∣∣∣

≤ BθLρ′

nℓ−1∑

j=1

∣∣∣∂xpf(ℓ−1)j

∣∣∣

Bρ

nℓ−1∑

j=1


∣∣∣+Bθ

nℓ−1∑

j=1

∣∣∣f (ℓ−1)j − f

(ℓ−1)j

∣∣∣+∣∣∣b(ℓ)q − b(ℓ)q

∣∣∣

+Bρ′Bθ

nℓ−1∑

j=1

∣∣∣∂xpf(ℓ−1)j − ∂xp f

(ℓ−1)j

∣∣∣+Bρ′

nℓ−1∑

j=1


∣∣∣∣∣∣∂xp f

(ℓ−1)j

∣∣∣

≤ Bθ

nℓ−1∑

j=1

∣∣∣∂xpf(ℓ−1)j

∣∣∣

nℓ−1∑

j=1


∣∣∣+Bθ

nℓ−1∑

j=1

∣∣∣f (ℓ−1)j − f

(ℓ−1)j

∣∣∣+∣∣∣b(ℓ)q − b(ℓ)q

∣∣∣

+Bθ

nℓ−1∑

j=1

∣∣∣∂xpf(ℓ−1)j − ∂xp f

(ℓ−1)j

∣∣∣+nℓ−1∑

j=1


∣∣∣∣∣∣∂xp f

(ℓ−1)j

∣∣∣

≤ Bθ

(ℓ−1∏

i=1

ni

)Bℓ

θ

nℓ−1∑

j=1


∣∣∣+Bθ

nℓ−1∑

j=1

(ℓ−2∏

i=1

ni

)Bℓ−2

θ

nℓ−1∑

k=1


∣∣∣+∣∣∣b(ℓ)q − b(ℓ)q

∣∣∣

+Bθ

nℓ−1∑

j=1

∣∣∣∂xpf(ℓ−1)j − ∂xp f

(ℓ−1)j

∣∣∣+nℓ−1∑

j=1


∣∣∣(

ℓ−2∏

i=1

ni

)Bℓ−1

θ

≤ Bθ

nℓ−1∑

j=1

∣∣∣∂xpf(ℓ−1)j − ∂xp f

(ℓ−1)j

∣∣∣+B2ℓθ

(ℓ−1∏

i=1

ni

)2nℓ∑

k=1


∣∣∣

For ℓ = 2,

∣∣∣∂xpf(2)q − ∂xp f

(2)q

∣∣∣ ≤ Bθ

n1∑

j=1

∣∣∣∂xpf(1)j − ∂xp f

(1)j

∣∣∣+B4θn

21

n2∑

k=1


∣∣∣

≤ 2B2θn1

n1∑

k=1


∣∣∣+B4θn

21

n2∑

k=1


∣∣∣ ≤ 3B4θn

21

n2∑

k=1


∣∣∣

Assuming that for ℓ ≥ 2,

∣∣∣∂xpf(ℓ)q − ∂xp f

(ℓ)q

∣∣∣ ≤ (ℓ + 1)B2ℓθ

(ℓ−1∏

i=1

ni

)2nℓ∑

k=1


∣∣∣

we have

∣∣∣∂xpf(ℓ+1)q − ∂xp f

(ℓ+1)q

∣∣∣

≤Bθ

nℓ∑

j=1

∣∣∣∂xpf(ℓ)j − ∂xp f

(ℓ)j

∣∣∣+B2ℓ+2θ

(ℓ∏

i=1

ni

)2nℓ+1∑

k=1


∣∣∣

≤Bθ

nℓ∑

j=1

(ℓ+ 1)B2ℓθ

(ℓ−1∏

i=1

ni

)2nℓ∑

k=1


∣∣∣+B2ℓ+2θ

(ℓ∏

i=1

ni

)2nℓ+1∑

k=1


∣∣∣

17

≤(ℓ+ 2)B2ℓ+2θ

(ℓ∏

i=1

ni

)2nℓ+1∑

k=1


∣∣∣

Hence by by induction and Holder inequality we conclude that

∣∣∣∂xpf − ∂xp f∣∣∣ ≤ (D + 1)B2D

θ

(D−1∏

i=1

ni

)2nD∑

k=1


∣∣∣ ≤ √nD(D + 1)B2D

θ

(D−1∏

i=1

ni

)2 ∥∥∥θ − θ∥∥∥2

Lemma 5.12. Let D, nD, ni ∈ N+, nD = 1, Bθ ≥ 1 and ρ be a function such that ρ, ρ′


parameterized function class P = Nρ (D, nD, Bθ) Then, for any fi(x; θ), fi(x; θ) ∈ Fi, i =

1, · · · , 5, we have

|fi(x; θ)| ≤ Bi, ∀x ∈ Ω,

|fi(x; θ) − fi(x; θ)| ≤ Li‖θ − θ‖2, ∀x ∈ Ω,

with B1 = d(∏D−1

i=1 ni

)2B2D

θ , B2 = B4 = B23 , B3 = B5 = (nD−1 + 1)Bθ, and

L1 = 2d√nD(D + 1)B3D

θ

(D−1∏

i=1

ni

)3

, L2 = 2√nDB

Dθ (nD−1 + 1)

(D−1∏

i=1

ni

),

L3 =√nDB

D−1θ

(D−1∏

i=1

ni

), L4 = L2, L5 = L3.

Proof. Direct result from Lemma 5.9, 5.11 and some calculation.

Now we state our main result with respect to statistical error.

Theorem 5.13. Let D, nD, ni ∈ N+, nD = 1, Bθ ≥ 1 and ρ be a function such that ρ, ρ′


parameterized function class P = Nρ (D, nD, Bθ). Then, if N = M , we have

EXiNi=1,YjM

j=1supu∈P

±[L(u)− L(u)

]≤ C(Ω, coe, α)

β

d√Dn

2DD B2D

θ√N

√log (dDnDBθN).

Proof. From Lemma 5.5, 5.6 and 5.8, we have

RN (Fi) ≤ inf0<δ<Bi/2

(4δ +

12√N

∫ Bi/2

δ

√log C(ǫ,Fi, ‖ · ‖∞)dǫ

)

≤ inf0<δ<Bi/2

(4δ +

12√N

∫ Bi/2

δ

√nD log

(2LiBθ

√nD

ǫ

)dǫ

)

≤ inf0<δ<Bi/2

(4δ +

6√nDBi√N

√log

(2LiBθ

√nD

δ

)).

18

Choosing δ = 1/√N < Bi/2 and applying Lemma 5.12, we have

RN (Fi) ≤4√N

+6√nDBi√N

√log(2LiBθ

√nD

√N)

≤ Cd√nD(∏D−1

i=1 ni

)2B2D

θ√N

√√√√√log

4dnD(D + 1)B3D+1

θ

(D−1∏

i=1

ni

)3 √N

≤ Cd√Dn

2DD B2D

θ√N

√log (dDnDBθN) (5.1)

Combining Lemma 5.1, 5.3 and (5.1), we obtain, if N = M ,

EXiNi=1,YjM

j=1supu∈P

±[L(u)− L(u)

]≤ C(Ω, coe, α)

β

d√Dn

2DD B2D

θ√N

√log (dDnDBθN).

6. Covergence Rate for the Ritz Method. Theorem 6.1. Let (A) holds. Assume

that Eopt = 0. Let ρ be logistic function 11+e−x or tanh function ex−e−x

ex+e−x . Let uφAbe the solution

of problem (3.6) generated by a random solver and uφ be an optimal solution of problem (3.6).

(1)Let uR be the weak solution of Robin problem (3.1)(3.2c). For any ǫ > 0 and µ ∈ (0, 1),

set the parameterized function class

P = Nρ

(C log(d+ 1), C(d, β)ǫ−d/(1−µ), C(d, β)ǫ−(9d+8)/(2−2µ)

)


N = M = C(d,Ω, coe, α, β)ǫ−Cd log(d+1)/(1−µ),

if the optimization error Eopt = L (uφA)− L (uφ) ≤ ǫ, then

EXiNi=1,YjM

j=1‖uφA

− uR‖H1(Ω) ≤ C(Ω, coe, α)ǫ.

(2)Let uD be the weak solution of Dirichlet problem (3.1)(3.2a). Set α = 1, g = 0. For any

ǫ > 0, let β = C(coe)ǫ as the penalty parameter, set the parameterized function class

P = Nρ

(C log(d+ 1), C(d)ǫ−5d/2(1−µ), C(d)ǫ−(45d+40)/(4−4µ)

)


N = M = C(d,Ω, coe)ǫ−Cd log(d+1)/(1−µ),

if the optimization error Eopt ≤ ǫ, then

EXiNi=1,YjM

j=1‖uφA

− uD‖H1(Ω) ≤ C(Ω, coe)ǫ.

19

Remark 6.1. Dirichlet boundary condition corresponding to a constrained minimization

problem, which may cause some difficulties in computation. The penalty method has been applied

in finite element methods and finite volume method [4, 29]. It is also been used in deep PDEs

solvers [43, 34, 44] since it is not easy to construct a network with given values on the boundary.

Recently, [32, 31] also study the convergence of DRM with Dirichlet boundary condition via

penalty method. However, the analysis in [32, 31] is based on some additional conditions, and

we do not need these conditions to obtain the error inducing by the penalty. More importantly,

we provide the convergence rate analysis involving the statistical error caused by finite samples

used in the SGD training, while in [32, 31] they do not consider the statistical error at all.

Proof. We first normalize the solution.

infu∈P

‖u− uR‖H1(Ω) = ‖uR‖H1(Ω) infu∈P

∥∥∥∥u

‖uR‖H1(Ω)− uR

‖uR‖H1(Ω)

∥∥∥∥H1(Ω)

= ‖uR‖H1(Ω) infu∈P

∥∥∥∥u− uR

‖uR‖H1(Ω)

∥∥∥∥H1(Ω)

≤ C(coe)

βinfu∈P

∥∥∥∥u− uR

‖uR‖H1(Ω)

∥∥∥∥H1(Ω)

where in the third step we apply Lemma 3.2. By Lemma 3.2 and Corollary 4.3, there exists a

neural network function

uρ ∈ P = Nρ

(C log(d+ 1), C(d)

(1

β3/2ǫ

)d/(1−µ)

, C(d)

(1

β3/2ǫ

)(9d+8)/(2−2µ))

such that

∥∥∥∥uρ −uR

‖uR‖H1(Ω)

∥∥∥∥H1(Ω)

≤ β3/2ǫ.

Hence,

Eapp =1

βC(Ω, coe, α) inf

u∈P‖u− uR‖2H1(Ω)

≤ 1

β3C(Ω, coe, α) inf

u∈P

∥∥∥∥u− uR

‖uR‖H1(Ω)

∥∥∥∥2

H1(Ω)

≤ C(Ω, coe, α)ǫ2. (6.1)

Since ρ, ρ′ are bounded and Lipschitz continuous with Bρ, Bρ′ , Lρ, Lρ′ ≤ 1 for ρ = 11+e−x and

ρ = ex−e−x

ex+e−x , we can apply Theorem 5.13 with D = C log(d + 1), nD = C(d)(

1β3/2ǫ

)d/(1−µ)

,

Bθ = C(d)(

1β3/2ǫ

)(9d+8)/(2−2µ)

. Now we conclude that by setting

N,M = C(d,Ω, coe, α)

(1

β3/2ǫ

)Cd log(d+1)/(1−µ)

(6.2)

we have

Esta = EXiNi=1,YjM

j=1

[supu∈P

[L(u)− L(u)

]+ sup

u∈P

[L(u)− L(u)

]]≤ ǫ2. (6.3)

Combining Proposition 3.1(1), (6.1) and (6.3) yields (1).

20

Setting the penalty parameter

β = C(coe)ǫ (6.4)

and combining Lemma 3.3, Proposition 3.1(2) and (6.1)− (6.4) yields (2).

7. Conclusions and Extensions. This paper provided an analysis of convergence rate

for deep Ritz methods for elliptic equations with Drichilet, Neumann and Robin boundary

condition, respectively. Specifically, our study shed light on how to set depth and width of

networks and how to set the penalty parameter to achieve the desired convergence rate in terms

of number of training samples.

There are several interesting further research directions. First, the approximation and

statistical error bounds deriving here can be used for studying the nonasymptotic convergence

rate for residual based method, such as PINNs. Second, the similar result may be applicable to

deep Ritz methods for optimal control problems and inverse problems.

8. Acknowledgements. We would like to thank Ingo Guhring and Mones Raslan for

helpful discussions on approximation error.

The work of Y. Jiao is supported in part by the National Science Foundation of China

under Grant 11871474 and by the research fund of KLATASDSMOE. The work of Y. Wang is

supported in part by the Hong Kong Research Grant Council grants 16308518 and 16317416

and HK Innovation Technology Fund ITS/044/18FX, as well as Guangdong-Hong Kong-Macao

Joint Laboratory for Data-Driven Fluid Mechanics and Engineering Applications.

REFERENCES

[1] R. A. Adams and J. J. Fournier, Sobolev spaces, Elsevier, 2003.

[2] A. Anandkumar, K. Azizzadenesheli, K. Bhattacharya, N. Kovachki, Z. Li, B. Liu, and A. Stuart,

Neural operator: Graph kernel network for partial differential equations, in ICLR 2020 Workshop on

Integration of Deep Neural Models and Differential Equations, 2020.

[3] C. Anitescu, E. Atroshchenko, N. Alajlan, and T. Rabczuk, Artificial neural network methods for the

solution of second order boundary value problems, Cmc-computers Materials & Continua, 59 (2019),

pp. 345–359.

[4] I. Babuska, The finite element method with penalty, Mathematics of Computation, 27 (1973), pp. 221–228.

[5] J. Berner, M. Dablander, and P. Grohs, Numerically solving parametric families of high-dimensional

kolmogorov partial differential equations via deep learning, in Advances in Neural Information Pro-

cessing Systems, vol. 33, Curran Associates, Inc., 2020, pp. 16615–16627.

[6] S. Brenner and R. Scott, The mathematical theory of finite element methods, vol. 15, Springer Science

& Business Media, 2007.

[7] P. G. Ciarlet, The finite element method for elliptic problems, SIAM, 2002.

[8] M. Dissanayake and N. Phan-Thien, Neural-network-based approximations for solving partial differential

equations, Communications in Numerical Methods in Engineering, 10 (1994), pp. 195–201.

[9] C. Duan, Y. Jiao, Y. Lai, X. Lu, and Z. Yang, Convergence rate analysis for deep ritz method, arXiv

preprint arXiv:2103.13330, (2021).

[10] W. E, C. Ma, and L. Wu, The Barron space and the flow-induced function spaces for neural network

models, arXiv preprint arXiv:1906.08039, (2021).

[11] W. E and S. Wojtowytsch, Some observations on partial differential equations in Barron and multi-layer

spaces, arXiv preprint arXiv:2012.01484, (2020).

[12] L. C. Evans, Partial differential equations, Graduate studies in mathematics, 19 (1998).

21

[13] D. Gilbarg and N. S. Trudinger, Elliptic partial differential equations of second order, vol. 224, springer,

2015.

[14] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and

Y. Bengio, Generative adversarial networks, Advances in Neural Information Processing Systems, 3

(2014).

[15] D. Greenfeld, M. Galun, R. Basri, I. Yavneh, and R. Kimmel, Learning to optimize multigrid PDE

solvers, in Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and

R. Salakhutdinov, eds., vol. 97 of Proceedings of Machine Learning Research, PMLR, 09–15 Jun 2019,

pp. 2415–2423.

[16] I. Guhring and M. Raslan, Approximation rates for neural networks with encodable weights in smooth-

ness spaces, Neural Networks, 134 (2021), pp. 107–130.

[17] J. Han, A. Jentzen, and W. E, Solving high-dimensional partial differential equations using deep learning,

Proceedings of the National Academy of Sciences, 115 (2018), pp. 8505–8510.

[18] K. He, X. Zhang, S. Ren, and J. Sun, Delving deep into rectifiers: Surpassing human-level performance

on imagenet classification, in Proceedings of the IEEE international conference on computer vision,

2015, pp. 1026–1034.

[19] Q. Hong, J. W. Siegel, and J. Xu, A priori analysis of stable neural network solutions to numerical

pdes, arXiv preprint arXiv:2104.02903, (2021).

[20] Q. Hong, J. W. Siegel, and J. Xu, Rademacher complexity and numerical quadrature analysis of stable

neural networks with applications to numerical pdes, arXiv preprint arXiv:2104.02903, (2021).

[21] J.-T. Hsieh, S. Zhao, S. Eismann, L. Mirabella, and S. Ermon, Learning neural pde solvers with

convergence guarantees, in International Conference on Learning Representations, 2018.

[22] T. J. Hughes, The Finite Element Method: Linear Static and Dynamic Finite Element Analysis, Courier

Corporation, 2012.

[23] I. E. Lagaris, A. Likas, and D. I. Fotiadis, Artificial neural networks for solving ordinary and partial

differential equations., IEEE Trans. Neural Networks, 9 (1998), pp. 987–1000.

[24] Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, A. Stuart, K. Bhattacharya, and A. Anandkumar,

Multipole graph neural operator for parametric partial differential equations, in Advances in Neural

Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin,

eds., vol. 33, Curran Associates, Inc., 2020, pp. 6755–6766.

[25] Z. Li, N. B. Kovachki, K. Azizzadenesheli, B. liu, K. Bhattacharya, A. Stuart, and A. Anandku-

mar, Fourier neural operator for parametric partial differential equations, in International Conference

on Learning Representations, 2021.

[26] J. Lu, Y. Lu, and M. Wang, A priori generalization analysis of the deep ritz method for solving high

dimensional elliptic equations, arXiv preprint arXiv:2101.01708, (2021).

[27] L. Lu, X. Meng, Z. Mao, and G. E. Karniadakis, Deepxde: A deep learning library for solving differential

equations, SIAM Review, 63 (2021), pp. 208–228.

[28] T. Luo and H. Yang, Two-layer neural networks for partial differential equations: Optimization and

generalization theory, arXiv preprint arXiv:2006.15733, (2020).

[29] B. Maury, Numerical analysis of a finite element/volume penalty method, Siam Journal on Numerical

Analysis, 47 (2009), pp. 1126–1148.

[30] S. Mishra and R. Molinaro, Estimates on the generalization error of physics informed neural networks

(pinns) for approximating pdes, arXiv preprint arXiv:2007.01138, (2020).

[31] J. Muller and M. Zeinhofer, Deep ritz revisited, in ICLR 2020 Workshop on Integration of Deep Neural

Models and Differential Equations, 2020.

[32] J. Muller and M. Zeinhofer, Error estimates for the variational training of neural networks with

boundary penalty, arXiv preprint arXiv:2103.01007, (2021).

[33] A. Quarteroni and A. Valli, Numerical Approximation of Partial Differential Equations, vol. 23,

Springer Science & Business Media, 2008.

[34] M. Raissi, P. Perdikaris, and G. E. Karniadakis, Physics-informed neural networks: A deep learning

framework for solving forward and inverse problems involving nonlinear partial differential equations,

Journal of Computational Physics, 378 (2019), pp. 686–707.

[35] S. Shalev-Shwartz and S. Ben-David, Understanding machine learning: From theory to algorithms,

22

Cambridge university press, 2014.

[36] Y. Shin, Z. Zhang, and G. Karniadakis, Error estimates of residual minimization using neural networks

for linear pdes, arXiv preprint arXiv:2010.08019, (2020).

[37] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser,

I. Antonoglou, V. Panneershelvam, M. Lanctot, et al., Mastering the game of go with deep

neural networks and tree search, nature, 529 (2016), pp. 484–489.

[38] J. A. Sirignano and K. Spiliopoulos, Dgm: A deep learning algorithm for solving partial differential

equations, Journal of Computational Physics, 375 (2018), pp. 1339–1364.

[39] J. Thomas, Numerical Partial Differential Equations: Finite Difference Methods, vol. 22, Springer Science

& Business Media, 2013.

[40] K. Um, R. Brand, Y. R. Fei, P. Holl, and N. Thuerey, Solver-in-the-loop: Learning from differentiable

physics to interact with iterative pde-solvers, in Advances in Neural Information Processing Systems,

vol. 33, Curran Associates, Inc., 2020, pp. 6111–6122.

[41] S. Wang, X. Yu, and P. Perdikaris, When and why pinns fail to train: A neural tangent kernel per-

spective, arXiv preprint arXiv:2007.14527, (2020).

[42] Y. Wang, Z. Shen, Z. Long, and B. Dong, Learning to discretize: Solving 1d scalar conservation laws via

deep reinforcement learning, Communications in Computational Physics, 28 (2020), pp. 2158–2179.

[43] E. Weinan and T. Yu, The deep ritz method: A deep learning-based numerical algorithm for solving

variational problems, Communications in Mathematics and Statistics, 6 (2017), pp. 1–12.

[44] J. Xu, Finite neuron method and convergence analysis, Communications in Computational Physics, 28

(2020), pp. 1707–1745.

[45] Y. Zang, G. Bao, X. Ye, and H. Zhou, Weak adversarial networks for high-dimensional partial differ-

ential equations, Journal of Computational Physics, 411 (2020), p. 109409.

23

arXiv:2107.14478v2 [math.NA] 5 Sep 2021

Documents