Top Banner
A PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD FOR SOLVING HIGH DIMENSIONAL ELLIPTIC EQUATIONS JIANFENG LU, YULONG LU, AND MIN WANG Abstract. This paper concerns the a priori generalization analysis of the Deep Ritz Method (DRM) [W. E and B. Yu, 2017], a popular neural-network-based method for solving high dimensional partial differential equations. We derive the generalization error bounds of two-layer neural networks in the framework of the DRM for solving two prototype elliptic PDEs: Poisson equation and static Schrödinger equation on the d-dimensional unit hyper- cube. Specifically, we prove that the convergence rates of generalization errors are inde- pendent of the dimension d, under the a priori assumption that the exact solutions of the PDEs lie in a suitable low-complexity space called spectral Barron space. Moreover, we give sufficient conditions on the forcing term and the potential function which guarantee that the solutions are spectral Barron functions. We achieve this by developing a new solution theory for the PDEs on the spectral Barron space, which can be viewed as an analog of the classical Sobolev regularity theory for PDEs. 1. Introduction Numerical solutions to high dimensional partial differential equations (PDEs) have been a long-standing challenge in scientific computing. The impressive advance of deep learning has offered exciting possibilities for algorithmic innovations. In particular, it is a natural idea to represent solutions of PDEs by (deep) neural networks to exploit the rich expressiveness of neural networks representation. The parameters of neural networks are then trained by opti- mizing some loss functions associated with the PDE. Natural loss functions can be designed using the variational structure, similar to the Ritz-Galerkin method in classical numerical analysis of PDEs. Such method is known as the Deep Ritz Method (DRM) in [13, 22]. Methods in a similar spirit has been also developed in the computational physics literature [4] for solving eigenvalue problems arising from many-body quantum mechanics, under the framework of variational Monte Carlo method [28]. Despite wide popularity and many successful applications of the DRM and other ap- proaches of using neural networks to solve high-dimensional PDEs, the analysis of such methods is scarce and still not well understood. This paper aims to provide an a priori generalization error analysis of the DRM with dimension-explicit estimates. Generally speaking, the error of using neural networks to solve high dimensional PDEs can be decomposed into the following parts: Approximation error: this is the error of approximating the solution of a PDE using neural networks; Generalization error: this refers to the error of the neural network-based approximate solution on predicting unseen data. The variational problem involves integrals in high di- mension, which can be expensive to compute. In practice Monte Carlo methods are usually used to approximate those high dimensional integrals and thus the miminizer of the surrogate model (known as empirical risk minimization) would be different from the minimizer of the original variational problem; Date : March 23, 2021. J.L. and M.W. are supported in part by National Science Foundation via grants DMS-2012286 and CCF- 1934964. Y.L. is supported by the start-up fund of the Department of Mathematics and Statistics at UMass Amherst. 1 arXiv:2101.01708v2 [math.NA] 22 Mar 2021
36

arXiv:2101.01708v2 [math.NA] 22 Mar 2021

Jan 12, 2023

Download

Documents

Khang Minh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: arXiv:2101.01708v2 [math.NA] 22 Mar 2021

A PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZMETHOD FOR SOLVING HIGH DIMENSIONAL ELLIPTIC EQUATIONS

JIANFENG LU, YULONG LU, AND MIN WANG

Abstract. This paper concerns the a priori generalization analysis of the Deep Ritz Method(DRM) [W. E and B. Yu, 2017], a popular neural-network-based method for solving highdimensional partial differential equations. We derive the generalization error bounds oftwo-layer neural networks in the framework of the DRM for solving two prototype ellipticPDEs: Poisson equation and static Schrödinger equation on the d-dimensional unit hyper-cube. Specifically, we prove that the convergence rates of generalization errors are inde-pendent of the dimension d, under the a priori assumption that the exact solutions of thePDEs lie in a suitable low-complexity space called spectral Barron space. Moreover, we givesufficient conditions on the forcing term and the potential function which guarantee that thesolutions are spectral Barron functions. We achieve this by developing a new solution theoryfor the PDEs on the spectral Barron space, which can be viewed as an analog of the classicalSobolev regularity theory for PDEs.

1. Introduction

Numerical solutions to high dimensional partial differential equations (PDEs) have been along-standing challenge in scientific computing. The impressive advance of deep learning hasoffered exciting possibilities for algorithmic innovations. In particular, it is a natural idea torepresent solutions of PDEs by (deep) neural networks to exploit the rich expressiveness ofneural networks representation. The parameters of neural networks are then trained by opti-mizing some loss functions associated with the PDE. Natural loss functions can be designedusing the variational structure, similar to the Ritz-Galerkin method in classical numericalanalysis of PDEs. Such method is known as the Deep Ritz Method (DRM) in [13, 22].Methods in a similar spirit has been also developed in the computational physics literature[4] for solving eigenvalue problems arising from many-body quantum mechanics, under theframework of variational Monte Carlo method [28].

Despite wide popularity and many successful applications of the DRM and other ap-proaches of using neural networks to solve high-dimensional PDEs, the analysis of suchmethods is scarce and still not well understood. This paper aims to provide an a priorigeneralization error analysis of the DRM with dimension-explicit estimates.

Generally speaking, the error of using neural networks to solve high dimensional PDEs canbe decomposed into the following parts:• Approximation error: this is the error of approximating the solution of a PDE using

neural networks;• Generalization error: this refers to the error of the neural network-based approximate

solution on predicting unseen data. The variational problem involves integrals in high di-mension, which can be expensive to compute. In practice Monte Carlo methods are usuallyused to approximate those high dimensional integrals and thus the miminizer of the surrogatemodel (known as empirical risk minimization) would be different from the minimizer of theoriginal variational problem;

Date: March 23, 2021.J.L. and M.W. are supported in part by National Science Foundation via grants DMS-2012286 and CCF-

1934964. Y.L. is supported by the start-up fund of the Department of Mathematics and Statistics at UMassAmherst.

1

arX

iv:2

101.

0170

8v2

[m

ath.

NA

] 2

2 M

ar 2

021

Page 2: arXiv:2101.01708v2 [math.NA] 22 Mar 2021

2 JIANFENG LU, YULONG LU, AND MIN WANG

• Training (or optimization) error: this is the error incurred by the optimization algorithmused in the training of neural networks for PDEs. Since the parameters of the neural net-works are obtained through an optimization process, it might not be able to find the bestapproximation to the unknown solution within the function class.

Note that from a numerical analysis point of view, these errors already appear for conven-tional Galerkin methods. Indeed, taking finite element methods for example, the approxi-mation error is the error of approximating the true solution in the finite element space; thegeneralization error can be seen as the discretization error caused by numerical quadratureof the variational formulation; the optimization error corresponds to the computational errorin the conventional numerical PDEs due to the inaccurate resolution of linear or nonlinearfinite dimensional discrete system.

Although classical numerical analysis for PDEs in low dimensions has formed a relativelycomplete theory in the last several decades, the error analysis of neural network methods ismuch more challenging for high dimensional PDEs and requires new ideas and tools. In fact,the three components of error analysis highlighted above all face new difficulties.

For approximation, as is well known, high dimensional problems suffer from the curse ofdimensionality, if we proceed with standard regularity-based function spaces such as Sobolevspaces or Hölder spaces as in conventional numerical analysis. In fact, even using deep neuralnetworks, the approximation rate for functions in such spaces deteriorate as the dimensionbecomes higher; see [43, 44]. Therefore, to obtain better approximation rates that scale mildlyin the large dimensionality, it is natural to assume that the function of interest lies in a suitablesmaller function space which has low complexity compared to Sobolev or Hölder spaces sothat the function can be efficiently approximated by neural networks in high dimensions.The first function class of this kind is the so-called Barron space defined in the seminal workof Barron [2]; see also [11, 23, 37, 38] for more variants of Barron spaces and their neural-network approximation properties. In the present paper we will introduce a discrete versionof Barron’s definition of such space using the idea of spectral decomposition and becauseof this we adopt the terminology of spectral Barron space following [10, 38] to distinguishit from the other versions. As the Barron spaces are very different from the usual Sobolevspaces, for PDE problems, one has to develop novel a priori estimates and correspondinglyapproximation error analysis. In particular, a new solution theory for high dimensional PDEsin those low-complexity function spaces needs to be developed. This paper makes an initialattempt in establishing a solution theory in the spectral Barron space for a class of ellipticPDEs.

The analysis of the generalization error is also intimately related to the function class (e.g.neural networks) we use, in particular its complexity. This makes the generalization analysisquite different from the analysis of numerical quadrature error in an usual finite elementmethod. We face a trade-off between the approximation and generalization: To reduce theapproximation error, one would like to use an approximation ansatz which involves largenumber of degrees of freedom, however, such choice will incur large generalization error.

The training of the neural networks also remains to be a very challenging problem sincethe associated optimization problem is highly non-convex. In fact, even under a standardsupervised learning setting, we still largely lack understanding of the optimization error,except in simplified setting where the optimization dynamics is essentially linear (see e.g.,[6, 15, 21]). The analysis for PDE problems would face similar, if not severer, difficulties,and it is beyond the scope of our current work.

In this work, we provide a rigorous analysis to the approximation and generalization errorsof the DRM for high dimensional elliptic PDEs. We will focus on relative simple PDEs(Poisson equation and static Schrödinger equation) to better convey the idea and illustratethe framework, without bogging the readers down with technical details. Our analysis, asalready suggested by the discussions above, which is based on identifying a correct functional

Page 3: arXiv:2101.01708v2 [math.NA] 22 Mar 2021

A PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 3

analysis setup and developing the corresponding a priori analysis and complexity estimates,will provides dimension-independent generalization error estimates.

1.1. Related Works. Several previous works on analysis of neural-network based methodsfor high-dimensional PDEs focus on the aspect of representation, i.e., whether a solutionto the PDE can be approximated by a neural network with quantitative error control; seee.g., [17, 20]. Fixing an approximation space, the generalization error can be controlled byanalyzing complexity such as covering numbers, see e.g., [3] for a specific PDE problem.

More recently, several papers [26, 30, 35, 36] considered the generalization error analysisof the physics informed neural network (PINNs) approach based on residual minimizationfor solving PDEs [24, 33]. In particular, the work [35] established the consistency of theloss function such that the approximation converges to the true solution as the trainingsample increases under the assumption of vanishing training error. For the generalizationerror, Mishra and Molinaro [30] carried out an a-posteriori-type generalization error analysisfor PINNs, and proved that the generalization error is bounded by the training error andquadrature error under some stability assumptions of the PDEs. To avoid the issue of curse ofdimensionality in quadrature error, the authors also considered the cumulative generalizationerror which involves a validation set. The paper [36] proved both a priori and a posteriorestimates for residual minimization methods in Sobolev spaces. The paper [26] obtaineda priori generalization estimates for a class of second order linear PDEs by assuming (butwithout verifying) that the exact solutions of PDEs belong to a Barron-type space introducedin [11].

Different from the previous generalization error analysis, we derive a priori and dimension-explicit generalization error estimates under the assumption that the solutions of the PDEslie in the spectral Barron space that is more aligned with [2]. Moreover, we justify suchassumption by developing a novel solution theory in the spectral Barron space for the PDEsof consideration. This regularity theory is the main difference between our work comparedwith the above mentioned ones.

It is worth mentioning that in a very recent preprint [12], E and Wojtowytsch consideredthe regularity theory of high dimensional PDEs on the whole space (including screened Pois-son equation, heat equation, and a viscous Hamilton-Jacobi equation) defined in the Barronspace introduced by [11]. Their result shared a similar spirit as our analysis of PDE regular-ity theory in the spectral Barron space (Theorem 2.5 for Poisson equation and Theorem 2.6for static Schrödinger equation), while we focus on PDEs on finite domain, and as a result,we have to develop different Barron function spaces from those used for the whole space.The authors of [12] also provided some counterexamples to regularity theory for PDE prob-lems defined on non-convex domains, while we would only focus on simple domain (in facthypercubes) in this work.

While we focus on the variational principle based approach for solving high dimensionalPDEs using neural networks, we note that many other approaches have been developed,such as the deep BSDE method based on the control formulation of parabolic PDEs [18],the deep Galerkin method based on the weak formulation [39], methods based on the strongformulation (residual minimization) such as the PINNs [24, 33], the diffusion Monte Carlotype approach for high-dimensional eigenvalue problems [19], just to name a few. It wouldbe interesting future directions to extend our analysis to these methods.

1.2. Our Contributions. We analyze the generalization error of two-layer neural networksfor solving two simple elliptic PDEs in the framework of DRM. Specifically we make thefollowing contributions:

• We define a spectral Barron space Bs(Ω) on a d-dimensional unit hypercube Ω = [0, 1]d

that extend the Barron’s original function space [2] from the whole space to bounded

Page 4: arXiv:2101.01708v2 [math.NA] 22 Mar 2021

4 JIANFENG LU, YULONG LU, AND MIN WANG

domain; see the definition (2.10). In the generalization theory we develop, we assumethat the solutions lie in the spectral Barron space.• We show that the spectral Barron functions B2(Ω) can be well approximated in theH1-norm by two-layer neural networks with either ReLU or Softplus activation func-tions without curse of dimensionality. Moreover, the parameters (weights and biases)of the two-layer neural networks are controlled explicitly in terms of the spectral Bar-ron norm. The bounds on the neural-network parameters play an essential role incontrolling the generalization error of the neural nets. See Theorem 2.1 and Theo-rem 2.2 for the approximation results.• We derive generalization error bounds of the neural-network solutions for solvingPoisson equation and the static Schrödinger equation under the assumption that thesolutions belong to the Barron space B2(Ω). We emphasize that the convergence ratesin our generalization error bounds are dimension-independent and that the prefactorsin the error estimates depend at most polynomially on the dimension and the Barronnorms of the solutions, indicating that the DRM overcomes the curse of dimensionalitywhen the solutions of the PDEs are spectral Barron functions. See Theorem 2.3 andTheorem 2.4 for the generalization results.• Last but not the least, we develop new well-posedness theory for the solutions ofPoisson and static Schrödinger equations in the spectral Barron space, providingsufficient conditions to verify the earlier assumption on the solutions made in thegeneralization analysis. The new solution theory can be viewed as an analog of theclassical PDE theory in Sobolev or Hölder spaces. See Theorem 2.5 and Theorem 2.6for the new solution theory in spectral Barron space.

1.3. Notation. We use |x|p to denote the p-norm of a vector x ∈ Rd. When p = 2 we write|x| = |x|2.

2. Set-Up and Main Results

2.1. Set-Up of PDEs. Let Ω = [0, 1]d be the unit hypercube on Rd. Let ∂Ω be the boundaryof Ω. We consider the following two prototype elliptic PDEs on Ω equipped with the Neumannboundary condition: Poisson equation

(2.1)−∆u = f on Ω,

∂u

∂ν= 0 on ∂Ω

and the static Schrödinger equation

(2.2)−∆u+ V u = f on Ω,

∂u

∂ν= 0 on ∂Ω.

Throughout the paper, we make the minimal assumption that f ∈ L2(Ω) and V ∈ L∞(Ω)with V (x) ≥ Vmin > 0, although later we will impose stronger regularity assumptions on f andV . In particular, in our high dimensional setting, we would certainly need to restrict the classof f and V , otherwise just prescribing such general functions numerically would already incurcurse of dimensionality. The well-posedness of the solutions to the Poisson equation and staticSchrödinger equation in the Sobolev space H1(Ω) as well as the variational characterizationsof the solutions are well-known and are summarized in the proposition below, whose proofcan be found in Appendix A.

Proposition 2.1. (i) Assume that f ∈ L2(Ω) with∫

Ωfdx = 0. Then there exists a unique

weak solution u∗P ∈ H1 (Ω) := u ∈ H1(Ω)|

∫Ωudx = 0 to the Poisson equation (2.1).

Page 5: arXiv:2101.01708v2 [math.NA] 22 Mar 2021

A PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 5

Moreover, we have that

(2.3) u∗P = arg minu∈H1(Ω)

EP (u) := arg minu∈H1(Ω)

1

2

∫Ω

|∇u|2dx+1

2

(∫Ω

udx)2

−∫

Ω

fudx,

and that for any u ∈ H1(Ω),

(2.4) 2(E(u)− E(u∗P )) ≤ ‖u− u∗P‖2H1(Ω) ≤ 2 max2CP + 1, 2(E(u)− E(u∗P )),

where CP is the Poincaré constant on the domain Ω, i.e., for any v ∈ H1(Ω),∥∥∥v − ∫Ω

vdx∥∥∥2

L2(Ω)≤ CP‖∇v‖2

L2(Ω).

(ii) Assume that f, V ∈ L∞(Ω) and that 0 < Vmin ≤ V (x) ≤ Vmax <∞ for some constantsVmin and Vmax. Then there exists a unique weak solution u∗S ∈ H1(Ω) to the static Schrödingerequation (2.2). Moreover, we have that

(2.5) u∗S = arg minu∈H1(Ω)

ES(u) := arg minu∈H1(Ω)

1

2

∫Ω

|∇u|2 + V |u|2 dx−∫

Ω

fudx,

and that for any u ∈ H1(Ω)

(2.6)2

max(1, Vmax)(E(u)− E(u∗S)) ≤ ‖u− u∗‖2

H1(Ω) ≤2

min(1, Vmin)(E(u)− E(u∗S)).

The variational formulations (2.3) and (2.5) are the basis of the DRM [13] for solvingthose PDEs. The main idea is to train neural networks to minimize the (population) lossdefined by the Ritz energy functional E . More specifically, let F ⊂ H1(Ω) be a hypothesisfunction class parameterized by neural networks. The DRM seeks the optimal solution tothe population loss E within the hypothesis space F . However, the population loss requiresevaluations of d-dimensional integrals, which can be prohibitively expensive when d 1 iftraditional quadrature methods were used. To circumvent the curse of dimensionality, it isnatural to employ the Monte-Carlo method for computing the high dimensional integrals,which leads to the so-called empirical loss (or risk) minimization.

2.2. Empirical Loss Minimization. Let us denote by PΩ the uniform probability distri-butions on the domain Ω. Then the loss functional EP and ES can be rewritten in terms ofexpectations under PΩ as

EP (u) = |Ω| · EX∼PΩ

[1

2|∇u(X)|2 − f(X)u(X)

]+

1

2

(|Ω| · EX∼PΩ

u(X))2

,

ES(u) = |Ω| · EX∼PΩ

[1

2|∇u(X)|2 +

1

2V (X)|u(X)|2 − f(X)u(X)

].

To define the empirical loss, let Xjnj=1 be an i.i.d. sequence of random variables distributedaccording to PΩ. Define the empirical losses En,P and En,S by setting

(2.7)

En,P (u) =1

n

n∑j=1

[|Ω| ·

(1

2|∇u(Xj)|2 − f(Xj)u(Xj)

)]+

1

2

( |Ω|n

n∑j=1

u(Xj))2

,

En,S(u) =1

n

n∑j=1

[|Ω| ·

(1

2|∇u(Xj)|2 +

1

2V (Xj)|u(Xj)|2 − f(Xj)u(Xj)

)].

Given an empirical loss En, the empirical loss minimization algorithm seeks un which mini-mizes En, i.e.(2.8) un = arg min

u∈FEn(u).

Here we have suppressed the dependence of un on F . We denote by un,P and un,S the minimalsolutions to the empirical loss En,P and En,S, respectively.

Page 6: arXiv:2101.01708v2 [math.NA] 22 Mar 2021

6 JIANFENG LU, YULONG LU, AND MIN WANG

2.3. Main Results. The goal of the present paper is to obtain quantitative estimates forthe generalization error between the minimal solution un,S and un,P computed from the finitedata points Xjnj=1 and the exact solutions when the spacial dimension d is large. Ourprimary interest is to derive such estimates which scales mildly with respect to the increasingdimension d. To this end, it is necessary to assume that the true solutions lie in a smallerspace which has a lower complexity than Sobolev spaces. Specifically we will consider thespectral Barron space defined below via the cosine transformation.

Let C be a set of cosine functions defined by

(2.9) C :=

Φk

k∈Nd0

:= d∏i=1

cos(πkixi) | ki ∈ N0

.

Given u ∈ L1(Ω), let u(k)k∈Nd0 be the expansion coefficients of u under the basis Φkk∈Nd0 .Let us define for s ≥ 0 the spectral Barron space Bs(Ω) on Ω by

(2.10) Bs(Ω) :=u ∈ L1(Ω) :

∑k∈Nd0

(1 + πs|k|s1)|u(k)| <∞.

The spectral Barron norm of a function u on Bs(Ω) is given by

‖u‖Bs(Ω) =∑k∈Nd0

(1 + πs|k|s1)|u(k)|.

Observe that a function f ∈ Bs(Ω) if and only if u(k)k∈Nd0 belongs to the weighted `1-space`1Ws

(Nd0) on the lattice Nd

0 with the weights Ws(k) = (1+π2s|k|2s1 ). When s = 2, we adopt theshort notation B(Ω) for B2(Ω). Our definition of spectral Barron space is strongly motivatedby the seminar work by Barron [2] and other recent works [11, 23, 37]. The initial Barronfunction f in [2] is defined on the whole space Rd whose Fourier transform f(w) satisfies that∫|f(ω)||ω|dω <∞. Our spectral Barron space Bs(Ω) with s = 1 can be viewed as a discrete

analog of the initial Barron space from [2].The most important property of the Barron functions is that those functions can be well

approximated by two-layer neural networks without the curse of dimensionality. To make thismore precise, let us define the class of two-layer neural networks to be used as our hypothesisspace for solving PDEs. Given an activation function φ, a constant B > 0 and the numberof hidden neurons m, we define

(2.11) Fφ,m(B) :=c+

m∑i=1

γiφ(ωi · x− ti), |c| ≤ 2B, |wi|1 = 1, |ti| ≤ 1,m∑i=1

|γi| ≤ 4B.

Our first result concerns the approximation of spectral Barron functions in B(Ω) by two-layerneural networks with ReLU activation functions.

Theorem 2.1. Consider the class of two-layer ReLU neural networks

(2.12) FReLU,m(B) :=c+

m∑i=1

γiReLU(ωi·x−ti), |c| ≤ 2B, |wi|1 = 1, |ti| ≤ 1,m∑i=1

|γi| ≤ 4B.

Then for any u ∈ B(Ω), there exists um ∈ FReLU,m(‖u‖B(Ω)), such that

‖u− um‖H1(Ω) ≤√

116‖u‖B(Ω)√m

.

A similar approximation result was firstly proved in the seminar paper of Barron [2] wherethe same approximation rate O(m−

12 ) was also obtained when approximating the Barron

function defined on the whole space with two-layer neural nets with the sigmoid activationfunction in the L∞-norm. Results of this kind were also obtained in the recent works [11,

Page 7: arXiv:2101.01708v2 [math.NA] 22 Mar 2021

A PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 7

23, 37]. In particular, the same convergence rate was proved for approximating functions fwith ‖f‖Bs =

∫Rd |f(ω)|(1 + |ω|)sdω < ∞ in Sobolev norms by two-layer networks with a

general class of activation functions satisfying polynomial decay condition. The convergencerate O(m−

12 ) was recently improved to O(m−( 1

2+δ(d))) with δ(d) > 0 depending on d in [38]

when ReLUk or cosine is used as the activation function. Moreover, the rate has been provedto be sharp in the Sobolev norms when the index s of Barron space and that of the Sobolevnorm belong to certain appropriate regime.

Although the function class FReLU,m(B) can be used to approximate functions in B(Ω)without curse of dimensionality, it brings several issues to both theory and computation ifused as the hypothesis class for solving PDEs. On the one hand, the set FReLU,m ⊂ H1(Ω)consists of only piecewise affine functions, which may be undesirable in some PDE problemsif the function of interest is expected to be more regular or smooth. On the other hand,the fact that FReLU,m only admits first order weak derivatives makes it extremely difficult tobound the complexities of function classes involving derivatives of functions from FReLU,m,whereas the latter is a crucial ingredient for getting a generalization bound for the DRM.

To resolve those issues, in what follows we will consider instead a class of two-layer neuralnetworks with the Softplus [9, 16] activation function. Recall the Softplus function SP(z) =ln(1 + ez) and its rescaled version SPτ (z) defined also for τ > 0,

SPτ (z) =1

τSP(τz) =

1

τln(1 + eτz).

Observe that the rescaled Softplus SPτ (z) can be viewed as a smooth approximation ofthe ReLU function since SPτ (z)→ReLU(z) as τ→0 for any z ∈ R (see Lemma 4.6 for aquantitative statement). Moreover, the two-layer neural networks with the activation functionSPτ satisfy a similar approximation result as Theorem 2.1 when approximating spectralBarron functions in B(Ω), as shown in the next theorem.

Theorem 2.2. Consider the class of two-layer Softplus neural networks functions

(2.13) FSPτ ,m(B) :=c+

m∑i=1

γiSPτ (ωi · x− ti), |c| ≤ 2B, |wi|1 = 1, |ti| ≤ 1,m∑i=1

|γi| ≤ 4B.

Then for any u ∈ B(Ω), there exists a two-layer neural network um ∈ FSPτ ,m(‖u‖B(Ω)) withactivation function SPτ with τ =

√m, such that

‖u− um‖H1(Ω) ≤‖u‖B(Ω)(6 logm+ 30)√

m.

The proofs of Theorem 2.1 and Theorem 2.2 can be found in Section 4.Now we are ready to state the main generalization results of two-layer neural networks for

solving Poisson and the static Schrödinger equations. We start with the generalization errorbound for the neural-network solution in the Poisson case.

Theorem 2.3. Assume that the solution u∗P of the Neumann problem for the Poisson equation(2.1) satisfies that ‖u∗P‖B(Ω) <∞. Let umn,S be the minimizer of the empirical loss En,P in theset F = FSPτ ,m(‖u∗P‖B(Ω)) with τ =

√m. Then it holds that

(2.14) E[EP (umn,P )− EP (u∗P )

]≤ C1

√m(√

logm+ 1)√n

+C2(logm+ 1)2

m.

Here C1 > 0 depends polynomially on ‖u∗P‖B(Ω), d, ‖f‖L∞(Ω), and C2 > 0 depends quadraticallyon ‖u∗P‖B(Ω). In particular, setting m = n

13 in (2.14) leads to

E[EP (umn,P )− EP (u∗)

]≤ C3(log n)2

n13

for some C3 > 0 depending only polynomially on ‖u∗P‖B(Ω), d, ‖f‖L∞(Ω).

Page 8: arXiv:2101.01708v2 [math.NA] 22 Mar 2021

8 JIANFENG LU, YULONG LU, AND MIN WANG

Next we state the generalization error for the neural-network solution in the case of thestatic Schrödinger equation.

Theorem 2.4. Assume that the solution u∗S of the Neumann problem for the static Schrödingerequation (2.2) satisfies that ‖u∗S‖B(Ω) < ∞. Let umn,S be the minimizer of the empirical lossEn,S in the set F = FSPτ ,m(‖u∗S‖B(Ω)) with τ =

√m. Then it holds that

(2.15) E[ES(umn,S)− ES(u∗S)

]≤ C4

√m(√

logm+ 1)√n

+C5(logm+ 1)2

m.

Here C4 > 0 depends polynomially on ‖u∗S‖B(Ω), d, ‖f‖L∞(Ω), ‖V ‖L∞(Ω) and C5 > dependsquadratically on ‖u∗S‖B(Ω). In particular, setting m = n

13 in (2.15) leads to

E[ES(umn,S)− ES(u∗S)

]≤ C6(log n)2

n13

for some C6 > 0 depending only polynomially on ‖u∗S‖B(Ω), d, ‖f‖L∞(Ω), ‖V ‖L∞(Ω).

Remark 2.1. Thanks to the estimates (2.4) and (2.6), the generalization bound above on theenergy excess translate directly to the generalization bound on square of the H1-error betweenthe neural-network solution and the exact solution of the PDE. Specifically, when m = n

13 , it

holds that for some constant C7 > 0,

E‖umn − u∗‖2H1(Ω) ≤ C7

(log n)2

n13

.

Theorem 2.3 and Theorem 2.4 show that the generalization error of the neural-networksolution for Poisson and the static Schrödinger equations do not suffer from the curse ofdimensionality under the key assumption that their exact solutions belong to the spectralBarron space B(Ω). The proofs of Theorem 2.3 and Theorem 2.4 can be found in Section 6.

Finally we verify the key low-complexity assumption by proving new well-posedness theoryof Poisson and the static Schrödinger equations in spectral Barron spaces. We start with thenew solution theory for the Poisson equation, whose proof can be found in Section 7.1.

Theorem 2.5. Assume that f ∈ Bs(Ω) with s ≥ 0 and that f0 =∫

Ωf(x)dx = 0. Then

the unique solution u∗ to the Neumann problem for the Poisson equation satisfies that u∗ ∈Bs+2(Ω) and that

‖u∗‖Bs+2(Ω) ≤ d‖f‖Bs(Ω).

In particular, when s = 0 we have ‖u∗‖B(Ω) ≤ d‖f‖B0(Ω).

The next theorem establishes the solution theory for the static Schrödinger equation inspectral Barron spaces.

Theorem 2.6. Assume that f ∈ Bs(Ω) with s ≥ 0 and that V ∈ Bs(Ω) with V (x) ≥Vmin > 0 for every x ∈ Rd. Then the static Schrödinger problem (2.2) has a unique solutionu ∈ Bs+2(Ω). Moreover, there exists a constant C8 > 0 depending on V and d such that

(2.16) ‖u‖Bs+2(Ω) ≤ C8‖f‖Bs(Ω).

In particular, when s = 0 we have ‖u∗‖B(Ω) ≤ C8‖f‖B0(Ω).

The stability estimates above can be viewed as an analog of the standard Sobolev regularityestimate ‖u‖Hs+2(Ω) ≤ C‖f‖Hs(Ω). However, the proof of the estimate (2.16) is quite differentfrom that of the Sobolev estimate. In particular, due to the lack of Hilbert structure inthe Barron space Bs(Ω), the standard Lax-Milgram theorem and the bootstrap argumentsfor proving the Sobolev regularity estimates can not be applied here. Instead, we turn tostudying the equivalent operator equation satisfied by the cosine coefficients of the solutionof the static Schrödinger equation. By exploiting the fact that the Barron space is a weighted`1-space on the cosine coefficients, we manage to prove the well-posedness of the operator

Page 9: arXiv:2101.01708v2 [math.NA] 22 Mar 2021

A PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 9

equation and the stability estimate (2.16) with an application of the Fredholm theory to theoperator equation. The complete proof of Theorem 2.6 can be found in Section 7.2.

2.4. Discussions and Future Directions. We established dimension-independent rates ofconvergence for the generalization error of the DRM for solving two simple linear ellipticPDEs. We would like to discuss some restrictions of the main results and point out someinteresting future directions.

First, some numerical results show that the convergence rates in our generalization errorestimates may not be sharp. In fact, Siegel and Xu [38] obtained sharp convergence ratesof O(m−( 1

2+δ(d))) with some δ(d) > 0 for approximating a similar class of spectral Barron

functions using two-layer neural nets with cosine and ReLUk activation functions. However,the parameters (weights and biases) of the neural networks constructed in their approximationresults were not well controlled (and maybe unbounded) and potentially could lead to largegeneralization errors. One interesting open question is to sharpen the approximation ratefor our spectral Barron functions using controllable two-layer neural networks with possiblydifferent activation functions. On the other hand, the statistical error bound O(

√m(√

logm+1)√n

)

may also be improved with sharper and more delicate Rademacher complexity estimates ofthe neural networks.

We restricted our attention on two simple elliptic problems defined on a hypercube withthe Neumann boundary condition to better convey the main ideas. It is natural to considercarrying out similar programs of solving more general PDE problems defined on generalbounded or unbounded domains with other boundary conditions. One major difficulty ariseswhen one comes to the definition of Barron functions on a general bounded domain and ourspectral Barron functions built on cosine expansions can not be adapted to general domains.Other Barron functions such as the one defined in [11] via integral representation are onbounded domains and may be considered as alternatives, but building a solution theoryfor PDEs in those spaces seems highly nontrivial; see [12] for some results and discussionsalong this direction. Another major issue comes from solving PDEs with essential boundaryconditions such as Dirichlet or periodic boundary conditions, where one needs to constructneural networks that satisfy those boundary conditions; we refer to [8, 31] for some initialattempts in this direction.

Finally, the analysis of training error of neural network methods for solving PDEs is ahighly important and challenging question. The difficulty is largely due to the non-convexityof the loss function in the parameters. Nevertheless, recent breakthroughs in the theoreticalanalysis of two-layer neural networks training show that the training dynamics can be largelysimplified in infinite-width limit, such as in the the mean field regime [5, 29, 34, 40] or neuraltangent kernel (NTK) regime [6, 15, 21], where global convergence of limiting dynamicscan be proved under suitable assumptions. It is an exciting direction to establish similarconvergence results for overparameterized two-layer networks in the context of solving PDEs.

3. Abstract generalization error bounds

In this section, we derive some abstract generalization bounds for the empirical loss mini-mization discussed in the previous section. To simply the notation, we suppress the problem-dependent subscript P or S and denote by un the minimizer of the empirical loss En over thehypothesis space F . Recall that u∗ is the exact solution of the PDE. We aim to bound theenergy excess

∆En := E(un)− E(u∗).

By definition we have that ∆En ≥ 0. To bound ∆En from above, we first decompose ∆En as

(3.1) ∆En = E(un)− En(un) + En(un)− En(uF) + En(uF)− E(uF) + E(uF)− E(u∗).

Page 10: arXiv:2101.01708v2 [math.NA] 22 Mar 2021

10 JIANFENG LU, YULONG LU, AND MIN WANG

Here uF = arg minu∈F E(u). Since un is the minimizer of En, En(un)−En(uF) ≤ 0. Thereforetaking expectation on both sides of (3.1) gives

(3.2) E∆En ≤ E[E(un)− En(un)]︸ ︷︷ ︸∆Egen

+E[En(uF)]− E(uF)︸ ︷︷ ︸∆Ebias

+ E(uF)− E(u∗)︸ ︷︷ ︸∆Eapprox

.

Observe that ∆Egen and ∆Ebias are the statistical errors: the first term ∆Egen describingthe generalization error of the empirical loss minimization over the hypothesis space F andthe second term ∆Ebias being the bias coming from the Monte Carlo approximation of theintegrals. Whereas the third term ∆Eapprox is the approximation error incurred by restrictingminimizing E from over the set H1(Ω) to F . Moreover, thanks to Proposition 2.1, the thirdterm ∆Eapprox is equivalent (up to a constant) to infu∈F ‖u− u∗‖2

H1(Ω).To control the statistical errors, it is essential to prove the so-called uniform law of large

numbers for certain function classes, where the notion of Rademacher complexity plays animportant role, which we now recall below.

Definition 3.1. We define for a set of random variables Zjnj=1 independently distributedaccording to PΩ and a function class S the random variable

Rn(S) := Eσ

[supg∈S

∣∣∣ 1n

n∑j=1

σjg(Zj)∣∣∣ ∣∣∣ Z1, · · · , Zn

],

where the expectation Eσ is taken with respect to the independent uniform Bernoulli se-quence σjnj=1 with σj ∈ ±1. Then the Rademacher complexity of S defined by Rn(S) =

EPΩ[Rn(S)].

The following important symmetrization lemma makes the connection between the uniformlaw of large numbers and the Rademacher complexity.

Lemma 3.1. [41, Proposition 4.11] Let F be a set of functions. Then

E supu∈F

∣∣∣ 1n

n∑j=1

u(Xj)− EX∼PΩu(X)

∣∣∣ ≤ 2Rn(F).

3.1. Poisson Equation. In this subsection we derive the abstract generalization bound inthe setting of Poisson equation. Recall the Ritz loss and the empirical loss associated to thePoisson equation

E(u) = |Ω| · EX∼PΩ

[1

2|∇u(X)|2 − f(X)u(X)

]+

1

2

(|Ω| · EX∼PΩ

u(X))2

=: E1(u) + E2(u),

En(u) =1

n

n∑j=1

[|Ω| ·

(1

2|∇u(Xj)|2 − f(Xj)u(Xj)

)]+

1

2

( |Ω|n

n∑j=1

u(Xj))2

=: E1n(u) + E2

n(u).

Page 11: arXiv:2101.01708v2 [math.NA] 22 Mar 2021

A PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 11

By definition, the bias term ∆Ebias satisfies that

∆Ebias = E[E1n(uF)]− E1(uF) + E[E2

n(uF)]− E2(uF)

=1

2E( |Ω|n

n∑j=1

u(Xj))2

− 1

2

(|Ω| · EX∼PΩ

u(X))2

=1

2E[( 1

n

n∑j=1

uF(Xj)− EX∼PΩuF(X)

)·( 1

n

n∑j=1

uF(Xj) + EX∼PΩuF(X)

)]≤ ‖uF‖L∞(Ω) · E sup

u∈F

∣∣∣ 1n

n∑j=1

u(Xj)− EX∼PΩu(X)

∣∣∣≤ 2 sup

u∈F‖u‖L∞(Ω) ·Rn(F),

where we have used |Ω| = 1 the last inequality follows from Lemma 3.1.Next we bound the first term ∆Egen. Let us first define the set of functions GP for the term

appeared in E1 by

GP :=g : Ω→R

∣∣ g =1

2|∇u|2 − fu where u ∈ F

.

Then it follows by Lemma 3.1 that

∆Egen ≤ E supv∈F

∣∣∣E(v)− En(v)∣∣∣

≤ E supv∈F

∣∣∣E1(v)− E1n(v)

∣∣∣+ E supv∈F

∣∣∣E2(v)− E2n(v)

∣∣∣≤ E sup

g∈G

∣∣∣ 1n

n∑j=1

g(Xj)− EPΩ[g]∣∣∣+ E sup

u∈F

1

2

∣∣∣(EX∼PΩu(X)

)2

−( 1

n

n∑j=1

u(Xj))2∣∣∣

≤ 2Rn(GP ) + supu∈F‖u‖L∞(Ω) · E sup

u∈F

∣∣∣ 1n

n∑j=1

u(Xj)− EX∼PΩu(X)

∣∣∣≤ 2Rn(GP ) + 2 sup

u∈F‖u‖L∞(Ω)Rn(F).

Finally owing to the estimate (2.4) in Proposition 2.1, the approximation error ∆Eapproxsatisfies that

∆Eapprox ≤1

2infu∈F‖u− u∗‖2

H1(Ω).

To summarize, we have established the following abstract generalization error bound for theenergy excess ∆En in the case of Poisson equation.

Theorem 3.1. Let un,P be the minimizer of the empirical risk En,P within the hypothesisclass F satisfying that supu∈F ‖u‖L∞(Ω) <∞. Let ∆En,P = EP (un,P )− EP (u∗P ). Then

(3.3) E∆En,P ≤ 2Rn(GP ) + 4 supu∈F‖u‖L∞(Ω) ·Rn(F) +

1

2infu∈F‖u− u∗‖2

H1(Ω).

3.2. Static Schrödinger Equation. In this subsection we proceed to prove an abstractgeneralization bound for the static Schrödinger equation. First recall the corresponding Ritzloss and the empirical loss as follows

ES(u) = |Ω| · EX∼PΩ

[1

2|∇u(X)|2 +

1

2V (X)|u(X)|2 − f(X)u(X)

],

En,S(u) =1

n

n∑j=1

[|Ω| ·

(1

2|∇u(Xj)|2 +

1

2V (Xj)|u(Xj)|2 − f(Xj)u(Xj)

)].

Page 12: arXiv:2101.01708v2 [math.NA] 22 Mar 2021

12 JIANFENG LU, YULONG LU, AND MIN WANG

Similar to the previous subsection, we introduce the function class GS by setting

GS :=g : Ω→R

∣∣ g =1

2|∇u|2 +

1

2V |u|2 − fu where u ∈ F

.

In the Schrödinger case, since the Ritz energy ES is linear with respect to the probabilitymeasure PΩ, the statistical errors ∆Egen and ∆Ebias are simpler than those in the Poisson case.In particular, a similar calculation shows that ∆Egen = 0 and ∆E2 ≤ 2Rn(GS). Therefore asa result of (3.2) we obtained the following theorem.

Theorem 3.2. Let un,S be the minimizer of the empirical risk En,S within the hypothesisclass F satisfying that supu∈F ‖u‖L∞(Ω) <∞. Let ∆En,S = EP (un,S)− EP (u∗S). Then

(3.4) E∆En,S ≤ 2Rn(GS) +1

2infu∈F‖u− u∗‖2

H1(Ω).

4. Spectral Barron functions on the hypercube and theirH1-approximation.

In this section, we discuss the properties of spectral Barron functions on the d-dimensionalhypercube defined by (2.10) as well as their neural network approximations. Since our spectralBarron functions are defined via the expansion under the following set of cosine functions:

C =

Φk

k∈Nd0

:= d∏i=1

cos(πkixi) | ki ∈ N0

,

we start by stating some preliminaries on C and the product of cosines to be used in thesubsequent proofs.

4.1. Preliminary Lemmas.

Lemma 4.1. The set C forms an orthogonal basis of L2(Ω) and H1(Ω).

Proof. First that C forms an orthogonal basis of L2(Ω) follows directly from the Parseval’stheorem applied to the Fourier expansion of the even extension of a function u from L2(Ω).To see C is an orthogonal basis of H1(Ω), since C is an orthogonal set of H1(Ω), it suffices toshow that if u ∈ H1(Ω) satisfying (

u,Φk

)H1(Ω)

= 0

for all k ∈ Nd0, then u = 0. In fact, the last display above yields that

0 =

∫Ω

u · Φkdx+

∫Ω

∇u · ∇Φkdx

=

∫Ω

u · (Φk −∆Φk)dx

= (1 + π2|k|2)

∫Ω

u · Φkdx,

where for the second identity we have used the Green’s formula and the fact that the normalderivative of Φk vanishes on the boundary of Ω. Therefore we have obtained that (u,Φk)L2 = 0for any k ∈ Nd

0, which implies that u = 0 since C is an orthogonal basis of L2(Ω).

Given u ∈ L2(Ω), let u(k)k∈Nd0 be the expansion coefficients of u under the basis Φkk∈Nd0 .Then for any u ∈ L2(Ω),

u(x) =∑k∈Nd0

u(k)Φk(x).

Page 13: arXiv:2101.01708v2 [math.NA] 22 Mar 2021

A PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 13

Moreover, it follows from a straightforward calculation that for u ∈ H1(Ω),

‖u‖2H1(Ω) =

∑k∈Nd0

αk(1 + π2|k|2)|u(k)|2,

where αk = 〈Φk,Φk〉L2(Ω) = 2−∑di=1 1ki 6=0 ≤ 1. This implies the following characterization of a

function from H1(Ω) function in terms of its expansion coefficients under C.

Corollary 4.1. The space H1(Ω) can be characterized as

H1(Ω) =u ∈ L2(Ω)

∣∣∣ ∑k∈Nd0

|u(k)|2(1 + π2|k|2) <∞.

The following elementary product formula of cosine functions will also be useful.

Lemma 4.2. For any θidi=1 ⊂ R,d∏i=1

cos(θi) =1

2d

∑ξ∈Ξ

cos(ξ · θ),

where θ = (θ1, · · · , θd)T and Ξ = 1,−1d.

Proof. The lemma follows directly by iterating the following simple identity

cos(θ1) cos(θ2) =1

2

(cos(θ1 + θ2) + cos(θ1 − θ2)

)=

1

4

(cos(θ1 + θ2) + cos(θ1 − θ2) + cos(−θ1 − θ2) + cos(−θ1 + θ2)

).

4.2. Spectral Barron Space and Neural-Network Approximation. Recall for any s ∈N the spectral Barron space Bs(Ω) given by

Bs(Ω) :=u ∈ L1(Ω) :

∑k∈Nd0

(1 + πs|k|s1)|u(k)| <∞

with associated norm ‖u‖Bs(Ω) :=∑

k∈Nd0(1 + πs|k|s1)|u(k)|. Recall also the short notation

B(Ω) for B2(Ω).

Lemma 4.3. The following embedding results hold:(i) B(Ω) → H1(Ω);(ii) B0(Ω) → L∞(Ω).

Proof. (i). If u ∈ B(Ω), then ‖u‖B(Ω) =∑

k∈Nd0(1 + π2|k|21)|u(k)| < ∞. This particularly

implies |u(k)| ≤ ‖u‖B(Ω) for each k ∈ Nd. Since αk ≤ 1, we have from the Cauchy-Schwarzinequality that

‖u‖2H1(Ω) =

∑k∈Nd0

αk(1 + π2|k|2)|u(k)|2

≤ ‖u‖B(Ω)

∑k∈Nd0

(1 + π2d|k|21)|u(k)|

≤ d‖u‖2B(Ω).

(ii). For u ∈ B0(Ω), using the fact that ‖Φk‖L∞(Ω) ≤ 1 we have that

‖u‖L∞(Ω) =∥∥∥∑k∈Nd0

u(k)Φk

∥∥∥L∞(Ω)

≤∑k∈Nd0

|u(k)| = ‖u‖B(Ω).

Page 14: arXiv:2101.01708v2 [math.NA] 22 Mar 2021

14 JIANFENG LU, YULONG LU, AND MIN WANG

Thanks to Lemma 4.1 and Lemma 4.2, any function u ∈ H1(Ω) admits the expansion

(4.1) u(x) =∑k∈Nd0

u(k) · 1

2d

∑ξ∈Ξ

cos(πkξ · x),

where u(k) is the expansion coefficient of u under the basis C and kξ = (k1ξ1, · · · , kdξd) ∈ Zd.Given u ∈ B(Ω) ⊂ H1(Ω), letting (−1)θ(k) = sign(u(k)) with θ(k) ∈ 0, 1, we have from

(4.1) that

u(x) = u(0) +∑

k∈Nd0\0

u(k) · 1

2d

∑ξ∈Ξ

cos(πkξ · x)

= u(0) +∑

k∈Nd0\0

|u(k)| sign(u(k)) · 1

2d

∑ξ∈Ξ

cos(πkξ · x)

= u(0) +∑

k∈Nd0\0

|u(k)| · 1

2d

∑ξ∈Ξ

cos(π(kξ · x+ θk))

= u(0) +∑

k∈Nd0\0

1

Zu|u(k)| (1 + π2|k|21) · Zu

1 + π2|k|21· 1

2d

∑ξ∈Ξ

cos(π(kξ · x+ θk))

=: u(0) +

∫g(x, k)µ(dk),

where µ(dk) is the probability measure on Nd0 \ 0 defined by

µ(dk) =∑

k∈Nd0\0

1

Zu

∣∣u(k)∣∣(1 + π2|k|21)δ(dk)

with normalizing constant Zu =∑

k∈Nd0\0|u(k)|(1 + π2|k|21) ≤ ‖u‖B(Ω) and

g(x, k) =Zu

1 + π2|k|21· 1

2d

∑ξ∈Ξ

cos(π(kξ · x+ θk)).

Observe that the function g(x, k) ∈ C2(Ω) for every k ∈ Nd0 \ 0. Moreover, it is straight-

forward to show that the following bounds hold:

‖g(·, k)‖H1(Ω) = Zu

√αk

1 + π2|k|21≤ ‖u‖B(Ω),

‖Dsg(·, k)‖L∞(Ω) ≤ Zu ≤ ‖u‖B(Ω) for s = 0, 1, 2.

Let us define for a constant B > 0 the function class

Fcos(B) := γ

1 + π2|k|21cos(π(k · x+ b)), k ∈ Zd \ 0, |γ| ≤ B, b ∈ 0, 1

.

It follows from the calculations above that if u ∈ B(Ω), then u := u − u(0) lies in the H1-closure of the convex hull of Fcos(B) with B = ‖u‖B(Ω). Indeed, if kimi=1 is an i.i.d. sequenceof random samples from the probability measure µ, then it follows from Fubini’s theorem

Page 15: arXiv:2101.01708v2 [math.NA] 22 Mar 2021

A PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 15

that

E

∥∥∥∥∥u(x)− 1

m

m∑i=1

g(x, ki)

∥∥∥∥∥2

H1(Ω)

= E

∫Ω

∣∣∣∣∣u(x)− 1

m

m∑i=1

g(x, ki)

∣∣∣∣∣2

dx+ E

∫Ω

∣∣∣∣∣∇u(x)− 1

m

m∑i=1

∇g(x, ki)

∣∣∣∣∣2

dx

=1

m

∫Ω

Var[g(x, k)]dx+1

m

∫Ω

Tr(Cov[∇g(x, k)])dx

≤E‖g(·, k)‖2

H1(Ω)

m

≤‖u‖2

B(Ω)

m.

Therefore the expected H1-norm of an average of m elements in Fcos(B) converges to zeroas m→∞. This in particular implies that there exists a sequence of convex combinations ofpoints in Fcos(B) converging to u in H1-norm. Since the H1-norm of any function in Fcos(B)is bounded by B, an application of Maurey’s empirical method (see Lemma 4.4) yields thefollowing theorem.

Theorem 4.1. Let u ∈ B(Ω). Then there exists um which is a convex combination of mfunctions in Fcos(B) with B = ‖u‖B(Ω) such that

‖u− u(0)− um‖2H1(Ω) ≤

‖u‖2B(Ω)

m.

Lemma 4.4. [2, 32] Let u belongs to the closure of the convex hull of a set G in a Hilbertspace. Let the Hilbert norm of of each element of G be bounded B > 0. Then for everym ∈ N, there exists gimi=1 ⊂ G and cimi=1 ⊂ [0, 1] with

∑mi=1 ci = 1 such that∥∥∥u− m∑

i=1

cigi

∥∥∥2

≤ B2

m.

4.3. Reduction to ReLU and Softplus Activation Functions. Notice that every func-tion in Fcos(B) is the composition of the one dimensional function g defined on [−1, 1] by

(4.2) g(z) =γ

1 + π2|k|21cos(π(|k|1z + b))

with k ∈ Zd \ 0, |γ| ≤ B and b ∈ 0, 1, and a linear function z = w · x with w = k/|k|1.It is clear that g ∈ C2([−1, 1]) and g satisfies that

(4.3) ‖g(s)‖L∞([−1,1]) ≤ |γ| ≤ B for s = 0, 1, 2.

Since b ∈ 0, 1, it also holds that g′(0) = 0.

Lemma 4.5. Let g ∈ C2([−1, 1]) with ‖g(s)‖L∞([−1,1]) ≤ B for s = 0, 1, 2. Assume thatg′(0) = 0. Let zj2m

j=0 be a partition of [−1, 1] with z0 = −1, zm = 0, z2m = 1 and zj+1− zj =h = 1/m for each j = 0, · · · , 2m− 1. Then there exists a two-layer ReLU network gm of theform

(4.4) gm(z) = c+2m∑i=1

aiReLU(εiz − bi), z ∈ [−1, 1]

with c = g(0), bi ∈ [−1, 1] and εi ∈ ±1, i = 1, · · · , 2m such that

(4.5) ‖g − gm‖W 1,∞([−1,1]) ≤2B

m.

Moreover, we have that |ai| ≤ 2Bm

and that |c| ≤ B.

Page 16: arXiv:2101.01708v2 [math.NA] 22 Mar 2021

16 JIANFENG LU, YULONG LU, AND MIN WANG

Proof. Let gm be the piecewise linear interpolation of g with respect to the grid zj2mj=0, i.e.

gm(z) = g(zj+1)z − zjh

+ g(zj)zj+1 − z

hif z ∈ [zj, zj+1].

According to [1, Chapter 11],

‖g − gm‖L∞([−1,1]) ≤h2

8‖g′′‖L∞([−1,1]).

Moreover, ‖g′ − g′m‖L∞([−1,1]) ≤ h‖g′′‖L∞([−1,1]). In fact, consider z ∈ [zj, zj+1] for some j ∈0, · · · , 2m− 1. By the mean value theorem, there exist ξ, η ∈ (zj, zj+1) such that (g(zj+1−g(zj)))/h = g′(ξ) and hence∣∣∣g′(z)− g(zj+1)− g(zi)

h

∣∣∣ =∣∣∣g′(z)− g′(ξ)

∣∣∣= |g′′(η)||z − ξ|≤ h‖g′′‖L∞([−1,1]).

This proves the error bound (4.5).Next, we show that gm can be represented by a two-layer ReLU neural network. Indeed,

it is easy to verify that gm can be rewritten as

(4.6) gm(z) = c+m∑i=1

aiReLU(zi − z) +2m∑

i=m+1

aiReLU(z − zi−1), z ∈ [−1, 1],

where c = g(zm) = g(0) and the parameters ai defined by

ai =

g(zm+1)−g(zm)

h, if i = m+ 1,

g(zm−1)−g(zm)h

, if i = m,g(zi)−2g(zi−1)+g(zi−2)

h, if i > m+ 1,

g(zi−1)−2g(zi)+g(zi+1)h

, if i < m.

Furthermore, by again the mean value theorem, there exists ξ1, ξ2 ∈ (zm, zm+1) such that|am+1| = |g′(ξ1)| = |g′(ξ1) − g′(0)| = |g′′(ξ2)ξ1| ≤ Bh. In a similar manner one can obtainthat |am| ≤ Bh and |ai| ≤ 2Bh if i /∈ m,m+ 1.

Finally, by setting εi = −1, bi = −zi for i = 1, · · · ,m and εi = 1, bi = zi−1 for i =m + 1, · · · , 2m, one obtains the desired form (4.4) of gm. This completes the proof of thelemma.

The following proposition is a direct consequence of Lemma 4.5.

Proposition 4.1. Define the function class

FReLU(B) :=c+ γReLU(w · x− t), |c| ≤ 2B, |w|1 = 1, |t| ≤ 1, |γ| ≤ 4B.

Then for any constant c such that |c| ≤ B, the set c + Fcos(B) is in the H1-closure of theconvex hull of FReLU(B).

Proof. First Lemma 4.5 states that each C2-function g with g′(0) = 0 and with up to secondorder derivatives bounded by B can be well approximated inH1-norm by a linear combinationof a constant function and the ReLU functions ReLU(εz − t) with the sum of the absolutevalues of the combination coefficients bounded by 4B. As a result, the function g defined in(4.2) lies in the closure of the convex hull of functions c+ γReLU(εz − t) with |c| ≤ B, |γ| ≤4B, |t| ≤ 1. Then the proposition follows from absorbing the additive constant c into theconstant c in the definition of FReLU(B).

With Proposition 4.1, we are ready to give the proof of Theorem 2.1.

Page 17: arXiv:2101.01708v2 [math.NA] 22 Mar 2021

A PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 17

Proof of Theorem 2.1. Observe that if u ∈ FReLU(B), then

‖u‖2H1(Ω) ≤ (c+ 2γ)2 + γ2 ≤ (102 + 42)B2 = 116B2.

Therefore Theorem 2.1 follows directly from Lemma 4.4, Proposition 4.1 with c = u(0) andthe fact that |u(0)| ≤ ‖u‖B(Ω).

Next we proceed to prove Theorem 2.2 which concerns approximating spectral Barronfunctions using two-layer networks with the Softplus activation. To this end, let us first statea lemma which shows that ReLU can be well approximated by SPτ for τ 1.

Lemma 4.6. The following inequalities hold:

(i) |ReLU(z)− SPτ (z)| ≤ 1

τe−τ |z|, ∀z ∈ [−2, 2];

(ii) |ReLU′(z)− SP′τ (z)| ≤ e−τ |z|, ∀z ∈ [−2, 0) ∪ (0, 2];

(iii) ‖SPτ‖W 1,∞([−2,2]) ≤ 3 +1

τ.

Proof. Notice that ReLU(z) − SPτ (z) = − 1τ

ln(1 + e−τ |z|). Hence inequality (i) follows fromthat

|ReLU(z)− SPτ (z)| ≤ 1

τln(1 + e−τ |z|) ≤ e−τ |z|

τ,

where the second inequality follows from the simple inequality ln(1 + x) ≤ x for x > −1. Inaddition, inequality (ii) holds since

|ReLU′(z)− SP′τ (z)| =∣∣∣ 1

1 + eτ |z|

∣∣∣ ≤ e−τ |z|, if z 6= 0.

Finally, inequality (iii) follows from that

‖SPτ (z)‖L∞([−2,2]) = SPτ (2) ≤ 2 +1

τ

and that

|SP′τ (z)| =∣∣∣ 1

1 + eτz

∣∣∣ ≤ 1.

Lemma 4.7. Let g ∈ C2([−1, 1]) with ‖g(s)‖L∞([−1,1]) ≤ B for s = 0, 1, 2. Assume thatg′(0) = 0. Let zjmj=−m be a partition of [−1, 1] with m ≥ 2 and z−m = −1, z0 = 0, zm = 1and zj+1 − zj = h = 1/m for each j = −m, · · · ,m− 1. Then there exists a two-layer neuralnetwork gτ,m of the form

(4.7) gτ,m(z) = c+2m∑i=1

aiSPτ (εiz − bi), z ∈ [−1, 1]

with c = g(0) ≤ B, bi ∈ [−1, 1], |ai| ≤ 2B/m and εi ∈ ±1, i = 1, · · · , 2m such that

(4.8) ‖g − gτ,m‖W 1,∞([−1,1]) ≤ 6Bδτ ,

where

(4.9) δτ :=1

τ

(1 +

1

τ

)(log(τ

3

)+ 1).

Proof. Thanks to Lemma 4.5, there exists gm of the form

(4.10) gm(z) = c+m∑i=1

aiReLU(zi − z) +2m∑

i=m+1

aiReLU(z − zi−1), z ∈ [−1, 1]

Page 18: arXiv:2101.01708v2 [math.NA] 22 Mar 2021

18 JIANFENG LU, YULONG LU, AND MIN WANG

such that ‖g − gm‖W 1,∞([−1,1]) ≤ 2B/m. More importantly, the coefficients ai satisfies that|ai| ≤ 2B/m so that

∑2mi=1 ai ≤ 4B. Now let gτ,m be the function obtained by replacing the

activation ReLU in gm by SPτ , i.e.

(4.11) gτ,m(z) = c+m∑i=1

aiSPτ (zi − z) +2m∑

i=m+1

aiSPτ (z − zi−1), z ∈ [−1, 1].

Suppose that z ∈ (zj, zj+1) for some fixed j < m − 1. Then thanks to Lemma 4.6 - (i), thebound |ai| ≤ 2B/m and the fact that |zi − z| ≥ 1/m if i 6= j while z ∈ (zj, zj+1), we have

|gm(z)− gτ,m(z)| ≤ |aj||ReLU(zj − z)− SPτ (zj − z)|

+m∑

i=1,i 6=j

|ai||ReLU(zi − z)− SPτ (zi − z)|

+2m∑

i=m+1

|ai||ReLU(z − zi−1)− SPτ (z − zi−1)|

≤ 2B

mτ+

2B

τe−τ |x|1|x|≥1/m.

Similar bounds hold for the case where z ∈ (zj, zj+1) for j > m. Lastly, if z ∈ (zm, zm+1),then both the m-th and m+ 1-th term in (4.10) and (4.11) depend on zm, from which we get

|gm(z)− gτ,m(z)| ≤ 4B

mτ+

2B

τe−τ |x|1|x|≥1/m.

Therefore we have obtained that

‖gm − gτ,m‖L∞([−1,1]) ≤4B

mτ+

2B

τe−τ |x|1|x|≥1/m.

Thanks to Lemma 4.6 - (ii), the same argument carries over to the estimate for the differenceof the derivatives and leads to

‖g′m − g′τ,m‖L∞([−1,1]) ≤4B

m+ 2Be−τ |x|1|x|≥1/m.

Combining the estimates above with that ‖g − gm‖W 1,∞([−1,1]) ≤ 2B/m yields that

‖g − gτ,m‖W 1,∞([−1,1]) ≤ ‖g − gm‖W 1,∞([−1,1]) + ‖gm − gτ,m‖W 1,∞([−1,1])

≤ 2B

m+

4B

mτ+

2B

τe−τ |x|1|x|≥1/m

≤ 2B(

1 +1

τ

)( 3

m+ e−

τm

)= 6Bδτ .

We have used the fact that max0<x≤1/2 3x+ e−τx =(log(τ3

)+ 1)

3τin the last inequality. The

proof of the lemma is finished by combining the estimates above and by rewriting (4.11) inthe form of (4.7).

Now we are ready to present the proof of Theorem 2.2. To do this, let us define the functionclass

FSPτ (B) :=c+ γSPτ (w · x− t), |c| ≤ 2B, |w|1 = 1, |t| ≤ 1, |γ| ≤ 4B

.

Note by (iii) of Lemma 4.6 that

(4.12) supu∈FSPτ (B)

‖f‖H1(Ω) ≤ 2B + 4B‖SPτ‖W 1,∞([−2,2]) ≤ 14B +4B

τ.

Page 19: arXiv:2101.01708v2 [math.NA] 22 Mar 2021

A PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 19

Proof of Theorem 2.2. First according to Theorem 4.1, u− u(0) lies in the H1-closure of theconvex hull of Fcos(B) with B = ‖u‖B(Ω). Note that each function in Fcos(B) is a compositionof the multivariate linear function z = w · x with |w| = 1 and the univariate function g(z)defined in (4.2) such that g′(0) = 0 and ‖g(s)‖L∞([−1,1]) ≤ B for s = 0, 1, 2. By Lemma 4.7,such g can be approximated by gτ,m which lies in the convex hull of the set of functions

c+ γSPτ (εz − b), |c| ≤ B, ε ∈ ±1, |b| ≤ 1, γ ≤ 4B.

Moreover, ‖g − gτ,m‖W 1,∞([−1,1]) ≤ 6Bδτ . As a result, we have that

‖g(w · x)− gτ,m(w · x)‖H1(Ω) ≤ ‖g − gτ,m‖W 1,∞([−1,1]) ≤ 6Bδτ .

This combining with the fact that |u(0)| ≤ B yields that there exists a function uτ in theclosure of the convex hull of FSPτ (B) such that

‖u− uτ‖H1(Ω) ≤ 6Bδτ .

Thanks to Lemma 4.4 and the bound (4.12), there exists um ∈ FSPτ ,m(B), which is a convexcombination of m functions in FSPτ (B) such that

‖uτ − um‖H1(Ω) ≤B(

+ 14)

√m

.

Combining the last two inequalities leads to

‖u− um‖H1(Ω) ≤ 6Bδτ +B(

+ 14)

√m

.

Setting τ =√m ≥ 1 and using (4.9), we obtain that

‖u− um‖H1(Ω) ≤6B

τ

(1 +

1

τ

)(log(τ

3

)+ 1)

+B√m

(4

τ+ 14

)≤ 6B√

m2(1

2log(m) + 1

)+

18B√m

=B(6 log(m) + 30)√

m.

This proves the desired estimate.

5. Rademacher complexities of two-layer neural networks

The goal of this section is to derive the Rademacher complexity bounds for some two-layerneural-network function classes that are relevant to the Ritz losses of the Poisson and thestatic Schrödinger equations. These bounds will be essential for obtaining the generalizationbounds in Theorem 2.3 and Theorem 2.4.

First let us consider for fixed positive constants C,Γ,W and T the set of two-layer neuralnetworks

(5.1)Fm =

uθ(x) = c+

m∑i=1

γiφ(wi · x+ ti), x ∈ Ω, θ ∈ Θ∣∣ |c| ≤ C,

m∑i=1

|γi| ≤ Γ,

|wi|1 ≤ W, |ti| ≤ T.

Here φ is the activation function, θ = (c, γimi=1, wimi=1, timi=1) denotes collectively theparameters of the two-layer neural network, Θ = Θc × Θγ × Θw × Θt = [−C,C]× Bm

1 (Γ)×(Bd

1(W ))m× [−T, T ]m represents the parameter space. We shall consider the set Θ endowed

with the metric ρ defined for θ = (c, γ, w, t), θ′ = (c′, γ′, w′, t′) ∈ Θ by

(5.2) ρΘ(θ, θ′) = max|c− c′|, |γ − γ′|1,maxi|wi − w′i|1, ‖t− t′‖∞.

Page 20: arXiv:2101.01708v2 [math.NA] 22 Mar 2021

20 JIANFENG LU, YULONG LU, AND MIN WANG

Throughout the section we assume that φ satisfies the following assumption, which particu-larly holds for the Softplus activation function.

Assumption 5.1. φ ∈ C2(R) and that φ (resp. φ′, the derivative of φ) is L-Lipschitz (resp.is L′-Lipschitz) for some L,L′ > 0. Moreover, there exist positive constants φmax and φ′max

such that

supw∈Θw,t∈Θt,x∈Ω

|φ(w · x+ t)| ≤ φmax and supw∈Θw,t∈Θt,x∈Ω

|φ′(w · x+ t)| ≤ φ′max.

Recall that the Rademacher complexity of a function class G is defined by

Rn(G) = EZEσ

[supg∈G

∣∣∣ 1n

n∑j=1

σjg(Zj)∣∣∣ ∣∣∣ Z1, · · · , Zn

].

In the subsequent proof, it will be useful to use the following modified Rademacher complexityRn(G) without the absolute value sign:

Rn(G) = EZEσ

[supg∈G

1

n

n∑j=1

σjg(Zj)∣∣∣ Z1, · · · , Zn

].

The lemma below bounds the Rademacher complexity of Fm.

Lemma 5.1. Assume that the activation function φ is L-Lipschitz. Then

Rn(Fm) ≤ 4ΓL(W√d+ T ) + 2Γ2|φ(0)|√

n.

Proof. Let φ(x) = φ(x)− φ(0). First observe that

[supf∈Fm

1

n

n∑j=1

σjf(Zj)∣∣∣Z1, · · · , Zn

]= Eσ

[sup

Θ

1

n

n∑j=1

σj(c+

m∑i=1

γiφ(wi · Zj + ti))∣∣∣Z1, · · · , Zn

]= Eσ

[sup

Θ

1

n

n∑j=1

σj

m∑i=1

γiφ(wi · Zj + ti)∣∣∣Z1, · · · , Zn

]≤ 1

nEσ

[sup

Θ

m∑i=1

γi

n∑j=1

σjφ(wi · Zj + ti)∣∣∣Z1, · · · , Zn

]+

1

nEσ

[sup

Θ

m∑i=1

γi

n∑j=1

σjφ(0)]

=: J1 + J2.

Using the fact that φ(·) = φ(·)− φ(0) is L-Lipschitz, one has that

|J1 ≤1

n

m∑i=1

|γi| · Eσ

[sup

|w|1≤W,|t|≤T

∣∣∣ n∑j=1

σjφ(w · Zj + t)∣∣∣ ∣∣∣ Z1, · · · , Zn

]≤ 2ΓL

n

(Eσ

[sup|w|1≤W

∣∣∣ n∑j=1

σjw · Zj∣∣∣ ∣∣∣ Z1, · · · , Zn

]+ Eσ

[sup|t|≤T

∣∣∣ n∑j=1

σjt∣∣∣])

≤ 2ΓL

n

(W · Eσ

∣∣∣ n∑j=1

σjZj

∣∣∣+ TEσ

[∣∣∣ n∑j=1

σj

∣∣∣])

≤ 2ΓL

n

(W ·

√√√√ n∑j=1

|Zj|2 + T ·

√√√√Eσ

[ n∑j=1

σ2j

])

Page 21: arXiv:2101.01708v2 [math.NA] 22 Mar 2021

A PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 21

≤ 2ΓL(W√d+ T )√n

.

Note that in the second inequality we have used the Talagrand’s contraction principle (Lemma5.2 below). Moreover, since

∑mi=1 |γi| ≤ Γ, it is easy to see that

J2 ≤Γ|φ(0)|n

[∣∣∣ n∑j=1

σj

∣∣∣]

≤ Γ|φ(0)|n

√√√√Eσ

[ n∑j=1

σ2j

]=

Γ|φ(0)|√n

.

Combining the estimates above and then taking the expectation w.r.t. Zj yields thatRn(Fm) ≤ 2ΓL(W

√d+T )+Γ|φ(0)|√n

. This combined with Lemma 5.3 below leads to the desiredestimate.

Lemma 5.2 (Ledoux-Talagrand contraction [25, Theorem 4.12]). Assume that φ : R→R isL-Lipschitz with φ(0) = 0. Let σini=1 be independent Rademacher random variables. Thenfor any T ⊂ Rn

Eσ sup(t1,··· ,tn)∈T

∣∣∣ n∑i=1

σiφ(ti)∣∣∣ ≤ 2L · Eσ sup

(t1,··· ,tn)∈T

∣∣∣ n∑i=1

σiti

∣∣∣.Lemma 5.3. [27, Lemma 1] Assume that the set of functions G contains the zero function.Then

Rn(G) ≤ 2Rn(G).

Recall the sets of two-layer neural networks FReLU,m(B) and FSPτ ,m(B) defined by (2.12)and (2.13) respectively. Since both ReLU and SPτ are 1-Lipschitz and ReLU(0) = 0, SPτ (0) =ln 2τ, the following corollary is a direct consequence of Lemma 5.1.

Corollary 5.1.

Rn(FReLU,m(B)) ≤ 16(√d+ 1)B√n

and Rn(FSPτ ,m(B)) ≤16(√d+ 1 + 2 ln 2

τ)B

√n

.

Given the source function f ∈ L∞(Ω) and the potential V ∈ L∞(Ω), we recall the functionclasses associated to the Ritz losses of Poisson equation and the static Schrödinger equation

(5.3)Gm,P :=

g : Ω→R

∣∣ g =1

2|∇u|2 − fu where u ∈ Fm

,

Gm,S :=g : Ω→R

∣∣ g =1

2|∇u|2 +

1

2V |u|2 − fu where u ∈ Fm

.

In the sequel we aim to bound the Rademacher complexities of Gm,P and Gm,S defined above.This will be achieved by bounding the Rademacher complexities of the following functionclasses

G1m :=

g : Ω→R

∣∣ g =1

2|∇u|2 where u ∈ Fm

,

G2m :=

g : Ω→R

∣∣ g = fu where u ∈ Fm,

G3m :=

g : Ω→R

∣∣ g =1

2V |u|2 where u ∈ Fm

.

Page 22: arXiv:2101.01708v2 [math.NA] 22 Mar 2021

22 JIANFENG LU, YULONG LU, AND MIN WANG

The celebrated Dudley’s theorem will be used to bound the Rademacher complexity in termsof the metric entropy. To this end, let us first recall the metric entropy and the Dudley’stheorem below.

Let (E, ρ) be a metric space with metric ρ. A δ-cover of a set A ⊂ E with respect to ρ isa collection of points x1, · · · , xn ⊂ A such that for every x ∈ A, there exists i ∈ 1, · · · , nsuch that ρ(x, xi) ≤ δ. The δ-covering number N (δ, A, ρ) is the cardinality of the smallest δ-cover of the set A with respect to the metric ρ. Equivalently, the δ-covering numberN (δ, A, ρ)is the minimal number of balls Bρ(x, δ) of radius δ needed to cover the set A.

Theorem 5.1 (Dudley’s theorem). Let F be a function class such that supf∈F ‖f‖∞ ≤ M .Then the Rademacher complexity Rn(F) satisfies that

Rn(F) ≤ inf0≤δ≤M

4δ +

12√n

∫ M

δ

√logN (ε,F , ‖ · ‖∞) dε

.

Note that our statement of Dudley’s theorem is slightly different from the standard Dud-ley’s theorem where the covering number is based on the empirical `2-metric instead of theL∞-metric above. However, since L∞-metric is stronger than the empirical `2-metric andsince the covering number is monotonically increasing with respect to the metric, Theo-rem 5.1 follows directly from the classical Dudley’s theorem (see e.g. [42, Theorem 1.19]).

Let us now state an elementary lemma on the covering number of product spaces.

Lemma 5.4. Let (Ei, ρi) be metric spaces with metrics ρi and let Ai ⊂ Ei, i = 1, · · · , n.Consider the product space E = ×ni=1Ei equipped with the metric ρ = maxi ρi and the setA = ×ni=1Ai. Then for any δ > 0,

(5.4) N (δ, A, ρ) ≤n∏i=1

N (δ, Ai, ρi).

Proof. It suffices to prove the lemma in the case that n = 2, i.e.,

(5.5) N (δ, A1 × A2, ρ) ≤ N (δ, A1, ρ1) · N (δ, A2, ρ2).

Indeed, suppose that C1 and C2 are δ-covers of A1 and A2 respectively. Then it is straightfor-ward that the product set C1×C2 is also a δ-cover of A1×A2 in the space (E1×E2, ρ) withρ = max(ρ1, ρ2). Hence N (δ, A1 × A2, ρ) ≤ card(C1) · card(C2). Applying this inequalityfor Ci with card(Ci) = N (δ, Ai, ρi), i = 1, 2, we obtain (5.5). The general inequality (5.4)follows by iterating (5.5).

As a consequence of Lemma 5.4, the following proposition gives an upper bound for thecovering number N (δ,Θ, ρΘ).

Proposition 5.1. Consider the metric space (Θ, ρΘ) with ρΘ defined in (5.2). Then for anyδ > 0, the covering number N (δ,Θ, ρΘ) satisfies that

N (δ,Θ, ρΘ) ≤ 2C

δ·(3Γ

δ

)m·(3W

δ

)dm·(3T

δ

)m.

Proof. Thanks to Lemma 5.4,

N (δ,Θ, ρ) ≤ N (δ,Θc, | · |) · N (δ,Θγ, | · |1) ·(N (δ, Bd

1(W ), | · |1))m· N (δ,Θt, | · |∞)

≤ 2C

δ·(3Γ

δ

)m·(3W

δ

)dm·(3T

δ

)m,

where in the last inequality we have used the fact that the covering number of a d-dimensional`p-ball of radius r satisfies that

N (δ, Bdp(r), | · |p) ≤

(3r

δ

)d.

Page 23: arXiv:2101.01708v2 [math.NA] 22 Mar 2021

A PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 23

Bounding Rn(G1m). We would like to bound Rn(G1

m) from above using metric entropy. Tothis end, let us first bound the covering number N (δ,G1

m, ‖ · ‖∞). Recall the parametersC,Γ,W and T in (5.1). With those parameters fixed, to simplify expressions, we introducethe following functions to be used in the sequel

M(δ,Λ,m, d) :=2CΛ

δ·(3ΓΛ

δ

)m·(3WΛ

δ

)dm·(3TΛ

δ

)m,(5.6)

Z(M,Λ, d) := M(√

(log(2CΛ))+ +√

(log(3ΓΛ) + d log(3WΛ) + log(3TΛ))+

)(5.7)

+√d+ 3

∫ M

0

√(log(1/ε))+dε.

Lemma 5.5. Let the activation function φ satisfy Assumption 5.1. Then we have

(5.8) N (δ,G1m, ‖ · ‖∞) ≤M(δ,Λ1,m, d),

where the constant Λ1 is defined by

(5.9) Λ1 =(

(W + Γ)φ′max + 2ΓWL′)

ΓWφ′max.

Proof. Thanks to Assumption 5.1, supθ∈Θ |φ′(w · x+ t)| ≤ φ′max. This implies that

maxθ∈Θ|∇uθ(x)| ≤

m∑i=1

|γi||wi||φ′(wi · x+ ti)|

≤ ΓWφ′max.

Furthermore, for θ, θ′ ∈ Θ, by adding and subtracting terms, we have that

|∇uθ(x)−∇uθ′(x)| ≤m∑i=1

|γi − γ′i||wi||φ′(wi · x+ ti)|

+m∑i=1

|γ′i||wi − w′i||φ′(wi · x+ ti)|+m∑i=1

|γ′i||w′i||φ′(wi · x+ ti)− φ′(w′i · x+ t′i)|

≤ Wφ′max|γ − γ′|1 + Γφ′max maxi|wi − w′i|+ ΓWL′(max

i|wi − w′i|1 + |t− t′|∞)

≤(

(W + Γ)φ′max + 2ΓWL′)ρΘ(θ, θ′).

Combining the last two estimates yields that1

2

∣∣|∇uθ(x)|2 − |∇uθ′(x)|2∣∣ ≤ 1

2

∣∣∇uθ(x) +∇uθ′(x)∣∣∣∣∇uθ(x)−∇uθ′(x)

∣∣≤ Λ1ρΘ(θ, θ′).

This particularly implies that N (δ,G1m, ‖ · ‖∞) ≤ N ( δ

Λ1,Θ, ρΘ). Then the estimate (5.8)

follows from Proposition 5.1 with δ replaced by δΛ1.

Proposition 5.2. Assume that the activation function φ satisfies Assumption 5.1. Then

Rn(G1m) ≤ Z(M1,Λ1, d) ·

√m

n.

where M1 = 12Γ2W 2(φ′max)2 and Λ1 is defined in (5.9).

Proof. Thanks to Assumption 5.1,

supg∈G1

m

‖g‖L∞(Ω) ≤ supu∈Fm

1

2‖∇u‖2

L∞(Ω)

≤ Γ2W 2(φ′max)2

2.

Page 24: arXiv:2101.01708v2 [math.NA] 22 Mar 2021

24 JIANFENG LU, YULONG LU, AND MIN WANG

Then the proposition follows from Lemma 5.5, Theorem 5.1 with δ = 0 and M = M1 =Γ2W 2(φ′max)2

2, and the simple fact that

√a+ b ≤

√a+√b for a, b ≥ 0 .

Bounding Rn(G2m). The next lemma provides an upper bound for N (δ,G2

m, ‖ · ‖∞).

Lemma 5.6. Assume that ‖f‖L∞(Ω) ≤ F for some F > 0. Assume that the activationfunction φ satisfies Assumption 5.1. Then the covering number N (δ,G2

m, ‖ · ‖∞) satisfies that

N (δ,G2m, ‖ · ‖∞) ≤M(δ,Λ2,m, d).

Here the constant Λ2 is defined by

(5.10) Λ2 = F(1 + φmax + 2LΓ

).

Proof. Note that a function gθ ∈ G2m has the form gθ = fuθ. Given θ = (c, γ, w, t), θ′ =

(c′, γ′, w′, t′) ∈ Θ, we have

(5.11)

|uθ(x)− uθ′(x)| ≤ |c− c′|+m∑i=1

|γiφ(wi · x− ti)−m∑i=1

γ′iφ(w′i · x− t′i)|

≤ |c− c′|+m∑i=1

|γi − γ′i|φ(wi · x− ti) +m∑i=1

|γ′i||φ(wi · x− ti)− φ(w′i · x− t′i)|.

Since φ satisfies Assumption 5.1, we have that |φ(wi · x− ti)| ≤ φmax and that

|φ(wi · x− ti)− φ(w′i · x− t′i)| ≤ L(|wi − w′i|1 + |ti − t′i|).Therefore, it follows from (5.11) that

(5.12)

|uθ(x)− uθ′(x)| ≤ |c− c′|+ φmax|γ − γ′|1+ LΓ(max

i|wi − w′i|1 + |t− t′|∞)

≤(

1 + φmax + 2LΓ)ρΘ(θ, θ′).

This implies that

‖gθ − gθ′‖∞ ≤ F(

1 + φmax + 2LΓ)ρ = Λ2ρΘ(θ, θ′).

As a consequence, N (δ,G2m, ‖·‖∞) ≤ N ( δ

Λ2,Θ, ρΘ). Then the lemma follows from Proposition

5.1 with δ replaced by δΛ2.

Proposition 5.3. Assume that ‖f‖L∞(Ω) ≤ F for some F > 0. Assume that the activationfunction φ is L-Lipschitz. Then

Rn(G2m) ≤ Z(M2,Λ2, d) ·

√m

n,

where M2 = F (C + Γφmax) and Λ2 is defined in (5.10).

Proof. It follows from the definition of G2m and the assumption that ‖f‖L∞(Ω) ≤ F , one

has that supg∈G2m‖g‖L∞(Ω) ≤ M2 = F (C + Γφmax). Then the proposition is proved by an

application of Theorem 5.1 with δ = 0,M = M2 and Lemma 5.6.

Bounding Rn(G3m). The lemma below gives an upper bound for N (δ,G3

m, ‖ · ‖∞).

Lemma 5.7. Assume that ‖V ‖L∞(Ω) ≤ Vmax for some Vmax <∞. Assume that the activationfunction φ satisfies Assumption 5.1. Then the covering number N (δ,G3

m, ‖ · ‖∞) satisfies that

(5.13) N (δ,G3m, ‖ · ‖∞) ≤M(δ,Λ3,m, d),

where the constant Λ3 is defined by

(5.14) Λ3 = Vmax(C + Γφmax)(

1 + φmax + 2LΓ).

Page 25: arXiv:2101.01708v2 [math.NA] 22 Mar 2021

A PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 25

Proof. By the definition of Fm and Assumption 5.1 on φ,supu∈Fm

‖u‖L∞(Ω) ≤ C + Γφmax.

Moreover, recall from (5.12) that for θ, θ′ ∈ Θ,

|uθ(x)− uθ′(x)| ≤(

1 + φmax + 2LΓ)ρΘ(θ, θ′).

Consequently,∣∣∣12V (x)u2

θ(x)− 1

2V (x)u2

θ′(x)∣∣∣ ≤ 1

2|V (x)||uθ(x) + uθ′(x)||uθ(x)− uθ′(x)|

≤ Λ3ρΘ(θ, θ′).

The estimate (5.13) follows from the same line of arguments used in the proof of Lemma 5.6.

Proposition 5.4. Under the same assumption of Lemma 5.7, G3m satisfies that

Rn(G3m) ≤ Z(M3,Λ3, d) ·

√m

n,

where M3 = Vmax

2(C + Γφmax)2 and Λ3 is defined in (5.14).

Proof. Note that supu∈G3m‖u‖L∞(Ω) ≤ M3 = Vmax

2(C + Γφmax)2. Then the proposition follows

from Theorem 5.1 with δ = 0,M = M3 and Lemma 5.7.

The following corollary is a direct consequence of the Propositions 5.2-5.4.

Corollary 5.2. The two sets of functions Gm,P and Gm,S defined in (5.3) satisfy that

Rn(Gm,P ) ≤ (Z(M1,Λ1, d) + Z(M2,Λ2, d)) ·√m

n

and that

Rn(Gm,S) ≤3∑i=1

Z(Mi,Λi, d) ·√m

n

Considering the set of two-layer neural networks FSPτ ,m(B) defined in (2.13) with τ =√m,

we define the following associated sets of functions

GSPτ ,m,P (B) := g : Ω→R | g =1

2|∇u|2 − fu where u ∈ FSPτ ,m,P (B),

GSPτ ,m,S(B) := g : Ω→R | g =1

2|∇u|2 +

1

2V |u|2 − fu where u ∈ FSPτ ,m,S(B),

G1SPτ ,m(B) := g : Ω→R | g =

1

2|∇u|2 where u ∈ FSPτ ,m(B),

G2SPτ ,m(B) := g : Ω→R | g = fu where u ∈ FSPτ ,m(B)

G3SPτ ,m(B) :=

g : Ω→R

∣∣ g =1

2V |u|2 where u ∈ FSPτ ,m(B)

.

Corollary 5.2 allows us to bound the Rademacher complexities of GSPτ ,m,P (B) and GSPτ ,m,S(B).Indeed, from the definition of the activation function SPτ , we know that ‖SP′τ‖L∞(R) ≤ 1 and‖SP′′τ‖L∞(R) ≤ τ =

√m, so SPτ satisfies Assumption (5.1) with

L = φ′max = 1, L′ = τ =√m,φmax ≤ 3 +

1√m≤ 4.

Note also that FSPτ ,m,P (B) coincides with the set Fm defined in (5.1) with the followingparameters(5.15) C = 2B,Γ = 4B,W = 1, T = 1.

Page 26: arXiv:2101.01708v2 [math.NA] 22 Mar 2021

26 JIANFENG LU, YULONG LU, AND MIN WANG

With the parameters above, one has thatM1 = 8B, Λ1 ≤ 32B2

√m+ 4B,

M2 ≤ 18FB, Λ2 ≤ F (5 + 8B),

M3 ≤Vmax

2(18B)2, Λ3 ≤ 18VmaxB(5 + 8B).

Inserting Mi and Λi, i = 1, 2, 3 into (5.7), one can obtain by a straightforward calculationthat there exist positive constants C1(B, d), C2(B, d, F ) and C3(B, d, Vmax), depending on theparameters B, d, F, Vmax polynomially, such that

Z(M1,Λ1, d) ≤ C1(B, d)√

logm,

Z(M2,Λ2, d) ≤ C2(B, d, F ),

Z(M3,Λ3, d) ≤ C3(B, d, Vmax),

Combining the estimates above with Corollary 5.2 gives directly the Rademacher complexitybounds for GSPτ ,m,P (B) and GSPτ ,m,S(B) as summarized in the following theorem.

Theorem 5.2. Assume that ‖f‖L∞(Ω) ≤ F and ‖V ‖L∞(Ω) ≤ Vmax. Consider the setsGSPτ ,m,P (B) and GSPτ ,m,S(B) with τ =

√m. Then there exist positive constants CP (B, d, F )

and CS(B, d, F, Vmax) depending polynomially on B, d, F, Vmax such that

Rn(GSPτ ,m,P (B)) ≤ CP (B, d, F )√m(√

logm+ 1)√n

,

Rn(GSPτ ,m,S(B)) ≤ CS(B, d, F, Vmax)√m(√

logm+ 1)√n

.

6. Proofs of Theorem 2.3 and Theorem 2.4

With the approximation estimates for spectral Barron functions and the complexity esti-mates of the two-layer neural networks proved in previous sections, we are ready to proveTheorem 2.3 and Theorem 2.4 which establish the a priori generalization error bounds of theDRM.

Proof of Theorem 2.3. Recall that umn,P is the minimizer of the empirical loss En,P in the setF = FSPτ ,m(B) with τ =

√m, where B = ‖u∗P‖B(Ω). From the definition of FSPτ ,m(B), one

can obtain thatsup

u∈FSPτ ,m(B)

‖u‖L∞(Ω) ≤ 14B.

Then it follows from Theorem 3.1, Theorem 5.2, Theorem 2.2 and Corollary 5.1 thatE[EP (umn,P )− EP (u∗P )

]≤ 2Rn(GSPτ ,m,P ) + 4 sup

u∈FSPτ ,m(B)

‖u‖L∞(Ω) ·Rn(FSPτ ,m)

+1

2inf

u∈FSPτ,m(B)

‖u− u∗‖2H1(Ω)

≤ 2CP (B, d, F )√m(√

logm+ 1)√n

+4 · 14 · 16 ·B2(

√d+ 1 + ln 2√

m)

√n

+B2(6 logm+ 30)2

2m

≤ C1

√m(√

logm+ 1)√n

+C2(logm+ 1)2

m,

where the constant C1 depends polynomially onB, d and F and C2 depends only quadraticallyon B.

Proof of Theorem 2.4. The proof is almost identical to the proof of Theorem 2.3 and followsdirectly from Theorem 3.2, Theorem 5.2, Theorem 2.2 and Corollary 5.1. Hence we omit thedetails.

Page 27: arXiv:2101.01708v2 [math.NA] 22 Mar 2021

A PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 27

7. Solution theory of Poisson and static Schrödinger Equations inspectral Barron Spaces

In Theorems 2.3 and 2.4, we have established the generalization error bounds of the DRMfor the Poisson equation and static Schrödinger equation under the assumption that theexact solutions lie in the spectral Barron space B(Ω). This section aims to justify suchassumption by proving complexity estimates of solutions in the spectral Barron space asshown in Theorem 2.5 and Theorem 2.6. This can be viewed as regularity analysis of highdimensional PDEs in the spectral Barron space.

7.1. Proof of Theorem 2.5. Suppose that f =∑

k∈Nd0fkΦk and that f has vanishing mean

value on Ω so that f0 = 0. Let uk be the cosine coefficients of the solution u∗P of the Neumannproblem for Poisson equation. By testing Φk on both sides of the Poisson equation and bytaking account of the Neumann boundary condition, one obtains that

u0 = 0,

uk = − 1

π2|k|2fk.

As a result,

‖u∗P‖Bs+2(Ω) =∑

k∈Nd0\0

(1 + πs+2|k|s+21 )|uk| =

∑k∈Nd0\0

(1 + πs+2|k|s+21 )

π2|k|2|fk|

≤ d∑

k∈Nd0\0

(1 + πs|k|s)|fk| = d‖f‖Bs(Ω),

where we have used |k|21 ≤ d|k|2 in the inequality above. This finishes the proof.

7.2. Proof of Theorem 2.6. First under the assumption of Theorem 2.6, there exists aunique solution u ∈ H1(Ω) to (2.2). Moreover,

(7.1) ‖∇u‖2L2(Ω) + Vmin‖u‖2

L2(Ω) ≤ ‖f‖L2(Ω)‖u‖L2(Ω).

Our goal is to show that u ∈ Bs+2(Ω). To this end, let us first derive an operator equationthat is equivalent to the original Schrödinger problem (2.2). To do this, multiplying Φk

on both sides of the static Schrödinger equation and then integrating yields the followingequivalent linear system on u:

(7.2) − |π|2|k|2uk + (V u)k = fk, k ∈ Nd0.

Let us first consider (7.2) with k = 0. Thanks to Corollary B.1,

(V u)0 =1

β0

( ∑m∈Zd

β2mu|m|V|m|

)= u0V0 +

( ∑m∈Zd\0

β2mu|m|V|m|

),

where we have also used the fact that β0 = 1. Consequently, equation (7.2) with k = 0becomes

u0V0 +∑

m∈Zd\0

β2mu|m|V|m| = f0.

For k 6= 0, using again Corollary B.1, equation (7.2) can be written as

−|π|2|k|2uk +1

βk

( ∑m∈Zd

βmu|m|βm−kV|m−k|

)= fk, k ∈ Nd \ 0.

Recall that a function u ∈ Bs(Ω) is equivalent to that uk belongs to the weighted `1 space`1Ws

(Nd0) with the weight Ws(k) = 1 + πs|k|s1. We would like to rewrite the above equations

Page 28: arXiv:2101.01708v2 [math.NA] 22 Mar 2021

28 JIANFENG LU, YULONG LU, AND MIN WANG

as an operator equation on the space `1Ws

(Nd0). For doing this, let us define some useful

operators. Define the operator M : u 7→Mu by

(Mu)k =

V0u0 if k = 0,

−|π|2|k|2uk otherwise.

Define the operator V : u 7→ Vu by

(Vu)k =

∑m∈Zd\0 β

2mu|m|V|m| if k = 0,

1βk

(∑m∈Zd βmu|m|βm−kV|m−k|

)otherwise.

With those operators, the system (7.2) can be reformulated as the operator equation

(7.3) (M + V)u = f .

Since V (x) ≥ Vmin > 0 for every x, we have V0 > 0. As a direct consequence, the diagonaloperator M is invertible. Therefore the operator equation (7.3) is equivalent to

(7.4) (I + M−1V)u = M−1f .

In order to show that u ∈ Bs+2(Ω), it suffices to show that the equation (7.3) or (7.4) hasa unique solution u ∈ `1

Ws(Nd

0). Indeed, if u ∈ `1Ws

(Nd0), then it follows from (7.3) and the

boundedness of V on `1Ws

(Nd0) (see (7.8) in the proof of Lemma 7.1 below) that

(7.5)‖Mu‖`1Ws (Nd0) ≤ ‖Vu‖`1Ws (Nd0) + ‖f‖`1Ws (Nd0)

≤ C(d, V )‖u‖`1Ws (Nd0) + ‖f‖`1Ws (Nd0).

Moreover, this combined with the positivity of V0 implies that

(7.6)

‖u‖Bs+2(Ω) =∑k∈Nd0

(1 + πs+2|k|s+21 )|uk|

=1

V0

· V0|u0|+∑

k∈Nd0\0

1 + πs+2|k|s+21

π2|k|2· π2|k|2|uk|

≤ max 1

V0

,( 1

π2+ d)‖Mu‖`1Ws (Nd0)

≤ C1(d, V )(‖u‖`1Ws (Nd0) + ‖f‖`1Ws (Nd0))

for some C1(d, V ) > 0.Next, we claim that equation (7.4) has a unique solution u ∈ `1

Ws(Nd

0) and that there existsa constant C2 > 0 such that

(7.7) ‖u‖`1Ws (Nd0) ≤ C2‖f‖`1Ws (Nd0).

To see this, observe that owing to the compactness of M−1V as shown in Lemma 7.1, theoperator equation I+M−1V is a Fredholm operator on `1

Ws(Nd

0). By the celebrated Fredholmalternative theorem (see e.g., [14] and [7, VII 10.7]), the operator I + M−1V has a boundedinverse (I + M−1V)−1 if and only if (I + M−1V)u = 0 has a trivial solution. Therefore toobtain the bound (7.7), it suffices to show that (I + M−1V)u = 0 implies u = 0. By theequivalence between the Schrödinger problem (2.2) and (7.4), we only need to show that theonly solution of (2.2) is zero. Notice that the latter is a direct consequence of (7.1) and thusthis finishes the proof of that the Schrödinger problem (2.2) has a unique solution in B(Ω).Finally, the stability estimate (2.16) follows by combining (7.6) and (7.7).

Lemma 7.1. Assume that V ∈ Bs(Ω) with V (x) ≥ Vmin > 0 for every x ∈ Ω. Then theoperator M−1V is compact on `1

Ws(Nd

0).

Page 29: arXiv:2101.01708v2 [math.NA] 22 Mar 2021

A PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 29

Proof. SinceM−1 is a multiplication operator on `1Ws

(Nd0) with the diagonal entries converging

to zero, it follows from Lemma 7.2 that M−1 is compact on `1Ws

(Nd0). Therefore to show the

compactness of M−1V, it is sufficient to show that the operator V is bounded on `1Ws

(Nd0). To

see this, note that by definition βk = 21k−∑di=1 1ki 6=0 ∈ [21−d, 2]. In addition, since V ∈ B0(Ω),

using Corollary B.1, one has that(7.8)

‖Vu‖`1Ws (Nd0) =∣∣∣ ∑m∈Zd\0

β2mu|m|V|m|

∣∣∣+∑k∈Nd

1

βk

∣∣∣ ∑m∈Zd

βmu|m|βm−kV|m−k|

∣∣∣(1 + πs|k|s1)

≤ 4∑

m∈Zd\0

|u|m||∑

m∈Zd\0

|V|m||+ 2d+1∑m∈Zd

∑k∈Nd|u|m|||V|m−k||

(1 + |π|sCs(|m− k|s1 + |m|s1)

)≤ 22d+2‖u‖`1(Nd0)‖V ‖`1(Nd0) + 22d+1 max(1, Cs) ·

(‖u‖`1(Nd0)‖V ‖`1Ws (Nd0) + ‖u‖`1Ws (Nd0)‖V ‖`1(Nd0)

)≤ 22d+3 max(1, Cs) · ‖V ‖`1Ws (Nd0)‖u‖`1Ws (Nd0)

= 22d+3 max(1, Cs) · ‖V ‖Bs(Ω)‖u‖`1Ws (Nd0),

where in the first inequality above we used the elementary inequality |a+ b|s ≤ Cs(|a|s + |b|s)for some constant Cs > 0 and in the second inequality we used the fact that

∑m∈Zd |u|m|| ≤

2d‖u‖`1(Nd0) ≤ 2d‖u‖`1Ws (Nd0).

Lemma 7.2. Suppose that T is a multiplication operator on `1Ws

(Nd0) defined by for u =

(uk)k∈Nd0 that (Tu)k = λkuk with λk→0 as ‖k‖2→∞. Then T : `1Ws

(Nd0)→`1

Ws(Nd

0) is compact.

Proof. It suffices to show that the image of the unit ball in `1Ws

(Nd0) under the map T is totally

bounded. To this end, given any fixed ε > 0, let K0 ∈ N be such that |λk| ≤ ε if ‖k‖2 > K0.Denote by I0 : k ∈ Nd

0 : ‖k‖2 ≤ K0 and let d0 be the cardinality of the index set I0.Note that the ball in Rd0 of radius maxk|λk| : k ∈ I0 with respect to the weighted 1-norm‖v‖`1Ws =

∑k∈I0 |vk|Ws(k) is precompact, so it can be covered by the union of n ε-balls with

centers v1, · · · , vn where vi ∈ Rd0 . We now claim that the image of the unit ball in `1Ws

(Nd0)

under T is covered by n 2ε-balls with centers (v1,0), · · · , (vn,0). In fact, for u ∈ `1Ws

(Nd0)

with∑

k∈Nd0|uk|Ws(k) ≤ 1, one has

Tu =(

(λkuk)k∈I0 ,0)

+(0, (λkuk)k/∈I0

).

Suppose that vi∗ is the closest center of v1, · · · , vn to the vector((λkuk)k∈I0

). Then

‖Tu− (vi∗ ,0)‖`1Ws (Nd0) =∑k∈I0

|(vi∗)k − (λkuk)|Ws(k) +∥∥∥(0, (λkuk)k/∈I0)∥∥∥

`1Ws (Nd0)

≤ ε+ ε∥∥∥(0, (uk)k/∈I0)∥∥∥

`1Ws (Nd0)≤ 2ε.

This finishes the proof.

Appendix A. Proof of Proposition 2.1

A.1. Proof of Proposition 2.1-(i). First, it is well known that the problem (2.1) has aunique weak solution u∗P ∈ H1

(Ω) = u ∈ H1(Ω) :∫

Ωudx = 0, i.e.

(A.1) a(u, v) =:

∫Ω

∇u · ∇v = F (v) :=

∫Ω

fvdx for every v ∈ H1 (Ω).

Moreover, the solution u∗P satisfies that

u∗P = arg minu∈H1

(Ω)

1

2

∫Ω

|∇u|2dx−∫

Ω

fudx.

Page 30: arXiv:2101.01708v2 [math.NA] 22 Mar 2021

30 JIANFENG LU, YULONG LU, AND MIN WANG

Due to the mean-zero constraint of the space H1 (Ω), the variational formulation above is

inconvenient to be adopted as the loss function for training a neural network solution. Totackle this issue, we consider instead the following modified Poisson problem:

(A.2)−∆u+ λ

∫Ω

udx = f on Ω,

∂νu = 0 on ∂Ω.

Here λ > 0 is a fixed constant. By the Lax-Milgram theorem the problem (A.2) has a uniqueweak solution u∗λ,P , which solves

(A.3) aλ(u∗λ,P , v) :=

∫Ω

∇u · ∇vdx+ λ

∫Ω

udx

∫Ω

vdx = F (v) for every v ∈ H1(Ω).

It is clear that u∗λ,P is the solution of the variational problem

(A.4) arg minu∈H1(Ω)

1

2

∫Ω

|∇u|2dx+λ

2

(∫Ω

udx)2

−∫

Ω

fudx.

Furthermore, the lemma below shows that the weak solutions of (A.2) are independent of λand they all coincides with u∗P .

Lemma A.1. Assume that λ > 0. Let u∗P and u∗λ,P be the weak solution of (2.1) and (A.2)respectively with f ∈ L2(Ω) satisfying

∫Ωfdx = 0. Then we have that u∗λ,P = u∗P .

Proof. We only need to show that u∗λ,P satisfies the weak formulation (A.1). In fact, sinceu∗λ,P satisfies (A.3), by setting v = 1 we obtain that

λ

∫Ω

udx =

∫Ω

fdx = 0.

This immediately implies that aλ(u∗λ,P , v) = a(u∗λ,P , v) and hence u∗λ,P satisfies (A.1).

Since the solution to (A.2) is invariant for all λ > 0, for simplicity we set λ = 1 in (A.4)and this proves (2.3), i.e.

(A.5) u∗P = arg minu∈H1(Ω)

EP (u) = arg minu∈H1(Ω)

1

2

∫Ω

|∇u|2dx−∫

Ω

fudx+1

2

(∫Ω

udx)2

.

Finally we prove that u∗P satisfies the estimate (2.4). To see this, we first state a useful lemmawhich computes the energy excess E(u)− E(u∗P ) with any u ∈ H1(Ω).

Lemma A.2. Let u∗P be the minimizer of EP or equivalently the weak solution of the Poissonproblem (A.2). Then for any u ∈ H1(Ω), it holds that

EP (u)− EP (u∗P ) =1

2

∫Ω

|∇u−∇u∗P |2dx+1

2

(∫Ω

u∗P − u dx)2

.

Proof. It follows from Green’s formula and the fact that u∗P ∈ H1 (Ω) that

E(u∗P ) =

∫Ω

1

2|∇u∗P |2 − fu∗Pdx+

1

2

(∫Ω

u∗Pdx)2

︸ ︷︷ ︸=0

=

∫Ω

1

2|∇u∗P |2 + ∆u∗Pu

∗Pdx

= −1

2

∫Ω

|∇u∗P |2dx.

Page 31: arXiv:2101.01708v2 [math.NA] 22 Mar 2021

A PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 31

Then for any u ∈ H1(Ω), applying Green’s formula again yields

E(u)− E(u∗P ) =1

2

∫Ω

|∇u|2dx−∫

Ω

fudx+1

2

(∫Ω

udx)2

+1

2

∫Ω

|∇u∗P |2dx

=1

2

∫Ω

|∇u|2dx+

∫Ω

∆u∗Pudx+1

2

(∫Ω

udx)2

+1

2

∫Ω

|∇u∗P |2dx

=1

2

∫Ω

|∇u−∇u∗P |2dx+1

2

(∫Ω

(u∗P − u) dx)2

.

Now recall that CP > 0 is the Poincaré constant such that for any v ∈ H1(Ω),∥∥∥v − ∫Ω

vdx∥∥∥2

L2(Ω)≤ CP‖∇v‖2

L2(Ω).

As a result,‖v‖2

H1(Ω) = ‖∇v‖2L2(Ω) + ‖v‖2

L2(Ω)

≤ ‖∇v‖2L2(Ω) + 2

∥∥∥v − ∫Ω

v∥∥∥2

L2(Ω)+ 2∣∣∣ ∫

Ω

vdx∣∣∣2

≤ (2CP + 1)‖∇v‖2L2(Ω) + 2

∣∣∣ ∫Ω

vdx∣∣∣2.

Therefore, an application of the last inequality with v = u− u∗P and Lemma A.2 yields that

‖u− u∗P‖2H1(Ω) ≤ 2 max2CP + 1, 2(E(u)− E(u∗P )).

On the other hand, it follows from Lemma A.2 that

E(u)− E(u∗P ) ≤ 1

2‖u− u∗P‖2

H1(Ω).

Combining the last two estimates leads to (2.4) and hence finishes the proof of Proposition2.1-(i).

A.2. Proof of Proposition 2.1-(ii). First the standard Lax-Milgram theorem implies thatthe static Schrödinger equation has a unique weak solution u∗S. Moreover, it is not hard toverify that u∗S solves the equivalent variational problem (2.5), i.e.

u∗S = arg minu∈H1(Ω)

ES(u) = arg minu∈H1(Ω)

1

2

∫Ω

|∇u|2 + V |u|2 dx−∫

Ω

fudx,

Finally we prove that u∗S satisfies the estimate (2.6). For this, we first claim that for anyu ∈ H1(Ω),

(A.6) ES(u)− ES(u∗S) =1

2

∫Ω

|∇u−∇u∗S|2dx+1

2

∫Ω

V (u∗S − u)2 dx.

In fact, using Green’s formula, one has that

ES(u∗S) =

∫Ω

1

2|∇u∗S|2 +

1

2V |u∗S|2 − fu∗dx

=

∫Ω

1

2|∇u∗S|2 +

1

2V |u∗S|2 + (∆u∗S − V u∗S)u∗dx

= −1

2

∫Ω

|∇u∗S|2 + V |u∗|2dx.

Then for any u ∈ H1(Ω), applying Green’s formula again yields

ES(u)− ES(u∗S) =1

2

∫Ω

|∇u|2 + V |u|2dx−∫

Ω

fudx+1

2

∫Ω

|∇u∗S|2 + V |u∗S|2dx

=1

2

∫Ω

|∇u|2 + V |u|2dx+

∫Ω

(∆u∗S − V u∗S)udx+1

2

∫Ω

|∇u∗S|2 + V |u∗S|2dx

Page 32: arXiv:2101.01708v2 [math.NA] 22 Mar 2021

32 JIANFENG LU, YULONG LU, AND MIN WANG

=1

2

∫Ω

|∇u−∇u∗S|2dx+1

2

∫Ω

V(u∗S − u

)2dx.

The estimate (2.6) follows directly from the identity (A.6) and the assumption that 0 <Vmin ≤ V (x) ≤ Vmax. This completes the proof.

Appendix B. Some useful facts on cosine series and convolution

Assume that u ∈ L1(Ω) admits the cosine series expansion

u(x) =∑k∈Nd0

ukΦk(x),

where ukk∈Nd0 are the cosine expansion coefficients, i.e.

(B.1) uk =

∫Ωu(x)Φk(x)dx∫Ω

Φ2k(x)dx

=

∫Ωu(x)Φk(x)dx

2−∑di=1 1ki 6=0

.

Let Ωe := [−1, 1]d and define the even extension of ue of a function u byue(x) = ue(x1, · · · , ud) = u(|x1|, · · · , |xd|), x ∈ Ωe.

Let uk be the Fourier coefficients of ue. Since ue is real and even, one has that

ue =∑k∈Zd

uk cos(πk · x),

where

(B.2) uk =

∫Ωeue(x) cos(πk · x)dx∫Ωe

cos2(πk · x)dx=

1

2d−1k 6=0

∫Ωe

ue(x) cos(πk · x)dx.

By abuse of notation, we use |k| to stand for the vector (|k1|, |k2, |, · · · , |kd|).

Lemma B.1. For every k ∈ Zd, it holds that uk = βku|k| where βk = 21k 6=0−∑di=1 1ki 6=0.

Proof. First thanks to Lemma 4.2 and the evenness of cosine,∫Ωe

ue(x) cos(πk · x)dx =

∫Ωe

ue(x) cos(π( d−1∑i=1

kixi

))cos(πkdxd)dx

−∫

Ωe

ue(x) sin(π( d−1∑i=1

kixi

))sin(πkdxd)dx︸ ︷︷ ︸

=0

=

∫Ωe

ue(x) cos(π( d−2∑i=1

kixi

))cos(πkd−1xd−1) cos(πkdxd)dx

−∫

Ωe

ue(x) sin(π( d−2∑i=1

kixi

))sin(πkd−1xd−1) cos(πkdxd)dx︸ ︷︷ ︸

=0

= · · ·

=

∫Ωe

ue(x)d∏i=1

cos(πkixi)dx

= 2d∫

Ω

u(x)Φk(x)dx.

In addition, since Φk = Φ|k| for any k ∈ Zd, the lemma follows from the equation above,(B.1) and (B.2).

Page 33: arXiv:2101.01708v2 [math.NA] 22 Mar 2021

A PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 33

The next lemma shows that the Fourier coefficients of the product of two functions u andv are the discrete convolution of their Fourier coefficients. Recall that ukk∈Zd denote theFourier coefficients of the even functions ue.

Lemma B.2. Let we = ueve. Then wk =∑

m∈Zd umvk−m.

Proof. By definition, ue(x) =∑

m∈Zd um cos(πm ·x) and ve(x) =∑

n∈Zd vn cos(πn ·x) Thanksto the fact that ∫

Ωe

cos(π` · x) cos(πk · x) = 2d−1k 6=0δ`(k),

one obtains that

wk =1

2d−1k 6=0

∫Ωe

ue(x)ve(x) cos(πk · x)dx

=1

2d−1k 6=0

∑m∈Zd

∑n∈Zd

umvn

∫Ωe

cos(πm · x) cos(πn · x) cos(πk · x)dx

=1

2d−1k 6=0

∑m∈Zd

∑n∈Zd

umvn

∫Ωe

1

2

[cos(π(m+ n) · x) + cos(π(m− n) · x)

]cos(πk · x)dx

=1

2

∑m∈Zd

um(vk−m + vm−k)

=∑m∈Zd

umvk−m,

where we have also used that vk = v−k for any k.

Corollary B.1. For any k ∈ Nd,

(uv)k =1

βk

∑m∈Zd

βmu|m|βm−kv|m−k|.

Proof. Thanks to Lemma B.1 and Lemma B.2,

(uv)k =1

βk(uv)k =

1

βk(u ∗ v)k =

1

βk

∑m∈Zd

βmu|m|βm−kv|m−k|.

References

[1] Uri M Ascher and Chen Greif. A first course on numerical methods. SIAM, 2011.[2] Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal

function. IEEE Transactions on Information theory, 39(3):930–945, 1993.[3] Julius Berner, Philipp Grohs, and Arnulf Jentzen. Analysis of the generalization error:

Empirical risk minimization over deep artificial neural networks overcomes the curseof dimensionality in the numerical approximation of black–scholes partial differentialequations. SIAM Journal on Mathematics of Data Science, 2(3):631–657, 2020.

[4] Giuseppe Carleo and Matthias Troyer. Solving the quantum many-body problem withartificial neural networks. Science, 355(6325):602–606, 2017.

[5] Lenaic Chizat and Francis Bach. On the global convergence of gradient descent forover-parameterized models using optimal transport. Advances in neural informationprocessing systems, 31:3036–3046, 2018.

[6] Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiableprogramming. In Advances in Neural Information Processing Systems, pages 2933–2943,2019.

[7] John B. Conway. A course in functional analysis, volume 96 of Graduate Texts inMathematics. Springer-Verlag, New York, second edition, 1990.

Page 34: arXiv:2101.01708v2 [math.NA] 22 Mar 2021

34 JIANFENG LU, YULONG LU, AND MIN WANG

[8] Suchuan Dong and Naxian Ni. A method for representing periodic functions and en-forcing exactly periodic boundary conditions with deep neural networks. arXiv preprintarXiv:2007.07442, 2020.

[9] Charles Dugas, Yoshua Bengio, François Bélisle, Claude Nadeau, and René Garcia. In-corporating second-order functional knowledge for better option pricing. In Advances inneural information processing systems, pages 472–478, 2001.

[10] Weinan E, Chao Ma, Stephan Wojtowytsch, and Lei Wu. Towards a mathematicalunderstanding of neural network-based machine learning: what we know and what wedon’t. arXiv preprint arXiv:2009.10713, 2020.

[11] Weinan E, Chao Ma, and Lei Wu. Barron spaces and the compositional function spacesfor neural network models. arXiv preprint arXiv:1906.08039, 2019.

[12] Weinan E and Stephan Wojtowytsch. Some observations on partial differential equationsin barron and multi-layer spaces, 2020. arXiv preprint arXiv:2012.01484.

[13] Weinan E and Bing Yu. The deep ritz method: a deep learning-based numerical algo-rithm for solving variational problems. Communications in Mathematics and Statistics,6(1):1–12, 2018.

[14] Ivar Fredholm. On a class of functional equations. Acta mathematica, 27(1):365–390,1903.

[15] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Limita-tions of lazy training of two-layers neural network. In Advances in Neural InformationProcessing Systems, pages 9108–9118, 2019.

[16] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural net-works. In Proceedings of the fourteenth international conference on artificial intelligenceand statistics, pages 315–323, 2011.

[17] Philipp Grohs, Fabian Hornung, Arnulf Jentzen, and Philippe Von Wurstemberger. Aproof that artificial neural networks overcome the curse of dimensionality in the numer-ical approximation of Black-Scholes partial differential equations, 2018. arXiv preprintarXiv:1809.02362.

[18] Jiequn Han, Arnulf Jentzen, and E Weinan. Solving high-dimensional partial differen-tial equations using deep learning. Proceedings of the National Academy of Sciences,115(34):8505–8510, 2018.

[19] Jiequn Han, Jianfeng Lu, and Mo Zhou. Solving high-dimensional eigenvalue problemsusing deep neural networks: A diffusion Monte Carlo like approach. Journal of Compu-tational Physics, 423:109792, 2020.

[20] Martin Hutzenthaler, Arnulf Jentzen, Thomas Kruse, and Tuan Anh Nguyen. A proofthat rectified deep neural networks overcome the curse of dimensionality in the numer-ical approximation of semilinear heat equations. SN Partial Differential Equations andApplications, 1:1–34, 2020.

[21] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergenceand generalization in neural networks. In Advances in neural information processingsystems, pages 8571–8580, 2018.

[22] Yuehaw Khoo, Jianfeng Lu, and Lexing Ying. Solving for high-dimensional committorfunctions using artificial neural networks. Research in the Mathematical Sciences, 6(1):1,2019.

[23] Jason M Klusowski and Andrew R Barron. Approximation by combinations of relu andsquared relu ridge functions with `1 and `0 controls. IEEE Transactions on InformationTheory, 64(12):7649–7656, 2018.

[24] Isaac E Lagaris, Aristidis Likas, and Dimitrios I Fotiadis. Artificial neural networksfor solving ordinary and partial differential equations. IEEE transactions on neuralnetworks, 9(5):987–1000, 1998.

Page 35: arXiv:2101.01708v2 [math.NA] 22 Mar 2021

A PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 35

[25] Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: Isoperimetry andProcesses, volume 23. Springer Science & Business Media, 1991.

[26] Tao Luo and Haizhao Yang. Two-layer neural networks for partial differential equations:Optimization and generalization theory. arXiv preprint arXiv:2006.15733, 2020.

[27] Tengyu Ma. CS229T/STATS231: Statistical Learning Theory, 2018. URL: https://web.stanford.edu/class/cs229t/scribe_notes/10_08_final.pdf. Last visited on2020/09/16.

[28] William Lauchlin McMillan. Ground state of liquid he 4. Physical Review, 138(2A):A442,1965.

[29] Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the land-scape of two-layer neural networks. Proceedings of the National Academy of Sciences,115(33):E7665–E7671, 2018.

[30] Siddhartha Mishra and Roberto Molinaro. Estimates on the generalization error ofphysics informed neural networks (PINNs) for approximating PDEs, 2020. arXiv preprintarXiv:2006.16144.

[31] Ali Girayhan Özbay, Sylvain Laizet, Panagiotis Tzirakis, Georgios Rizos, and BjörnSchuller. Poisson cnn: Convolutional neural networks for the solution of the pois-son equation with varying meshes and dirichlet boundary conditions. arXiv preprintarXiv:1910.08613, 2019.

[32] Gilles Pisier. Remarques sur un résultat non publié de B. Maurey. Séminaire Analysefonctionnelle (dit" Maurey-Schwartz"), pages 1–12, 1981.

[33] Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neuralnetworks: A deep learning framework for solving forward and inverse problems involvingnonlinear partial differential equations. Journal of Computational Physics, 378:686–707,2019.

[34] Grant Rotskoff and Eric Vanden-Eijnden. Parameters as interacting particles: long timeconvergence and asymptotic error scaling of neural networks. In Advances in neuralinformation processing systems, pages 7146–7155, 2018.

[35] Yeonjong Shin, Jerome Darbon, and George Em Karniadakis. On the convergence ofphysics informed neural networks for linear second-order elliptic and parabolic typePDEs, 2020. arXiv preprint arXiv:2004.01806.

[36] Yeonjong Shin, Zhongqiang Zhang, and George Em Karniadakis. Error estimates ofresidual minimization using neural networks for linear PDEs, 2020. arXiv preprintarXiv:2010.08019.

[37] Jonathan W Siegel and Jinchao Xu. Approximation rates for neural networks withgeneral activation functions. Neural Networks, 2020.

[38] Jonathan W Siegel and Jinchao Xu. High-order approximation rates for neural networkswith ReLUk activation functions. arXiv preprint arXiv:2012.07205, 2020.

[39] Justin Sirignano and Konstantinos Spiliopoulos. DGM: A deep learning algorithm forsolving partial differential equations. Journal of computational physics, 375:1339–1364,2018.

[40] Justin Sirignano and Konstantinos Spiliopoulos. Mean field analysis of neural networks:A law of large numbers. SIAM Journal on Applied Mathematics, 80(2):725–752, 2020.

[41] Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint. Cam-bridge University Press, 2019.

[42] Michael M. Wolf. Mathematical Foundations of Supervised Learning, 2020. URL:https://www-m5.ma.tum.de/foswiki/pub/M5/Allgemeines/MA4801_2020S/ML_notes_main.pdf. Last visited on 2020/12/5.

[43] Dmitry Yarotsky. Error bounds for approximations with deep relu networks. NeuralNetworks, 94:103–114, 2017.

Page 36: arXiv:2101.01708v2 [math.NA] 22 Mar 2021

36 JIANFENG LU, YULONG LU, AND MIN WANG

[44] Dmitry Yarotsky. Optimal approximation of continuous functions by very deep ReLUnetworks. arXiv preprint arXiv:1802.03620, 2018.

(JL) Departments of Mathematics, Physics, and Chemistry, Duke University, Box 90320,Durham, NC 27708.

Email address: [email protected]

(YL) Department of Mathematics and Statistics, Lederle Graduate Research Tower, Uni-versity of Massachusetts, 710 N. Pleasant Street, Amherst, MA 01003.

Email address: [email protected]

(MW) Mathematics Department, Duke University, Box 90320, Durham, NC 27708.Email address: [email protected]