Top Banner
https://www.monash.edu/science/quantitative-finance/publication Towards Explaining Deep Learning: Asymptotic Properties of ReLU FFN Sieve Estimators Hasan Fallahgoul, Vincentius Franstianto, Grégoire Loeper Monash CQFIS Working Paper 2020 Abstract A multi-layer, multi-node ReLU network is a powerful, efficient, and popular tool in statistical prediction tasks. However, in contrast to the great emphasis on its empirical applications, its statistical properties are rarely investigated which is mainly due to its severe nonlinearity and heavy parametrization. To help to close this gap via a sieve estimator, we first show that there exists such a sieve estimator for a ReLU feed- forward network. Next, we establish three asymptotic properties of the ReLU network: consistency, sieve- based convergence rate, and asymptotic normality. Finally, to validate the theoretical results, a Monte Carlo analysis is provided.
43

Towards Explaining Deep Learning: Asymptotic Properties of ...

Dec 02, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Towards Explaining Deep Learning: Asymptotic Properties of ...

https://www.monash.edu/science/quantitative-finance/publication

Towards Explaining Deep Learning: Asymptotic Properties of ReLU FFN Sieve Estimators

Hasan Fallahgoul, Vincentius Franstianto, Grégoire Loeper

Monash CQFIS Working Paper 2020

Abstract A multi-layer, multi-node ReLU network is a powerful,

efficient, and popular tool in statistical prediction

tasks. However, in contrast to the great emphasis on

its empirical applications, its statistical properties are

rarely investigated which is mainly due to its severe

nonlinearity and heavy parametrization. To help to

close this gap via a sieve estimator, we first show that

there exists such a sieve estimator for a ReLU feed-

forward network. Next, we establish three asymptotic

properties of the ReLU network: consistency, sieve-

based convergence rate, and asymptotic normality.

Finally, to validate the theoretical results, a Monte

Carlo analysis is provided.

Page 2: Towards Explaining Deep Learning: Asymptotic Properties of ...

Towards Explaining Deep Learning:

Asymptotic Properties of ReLU FFN Sieve Estimators∗

Hasan Fallahgoul†

Monash University

Vincentius Franstianto‡

Monash University

Grégoire Loeper§

Monash University

First draft: December 6, 2019

This draft: March 6, 2020

Abstract

A multi-layer, multi-node ReLU network is a powerful, efficient, and popular tool in statis-

tical prediction tasks. However, in contrast to the great emphasis on its empirical applications,

its statistical properties are rarely investigated which is mainly due to its severe nonlinear-

ity and heavy parametrization. To help to close this gap via a sieve estimator, we first show

that there exists such a sieve estimator for a ReLU feed-forward network. Next, we establish

three asymptotic properties of the ReLU network: consistency, sieve-based convergence rate,

and asymptotic normality. Finally, to validate the theoretical results, a Monte Carlo analysis is

provided.

Keywords: Deep Learning, Neural Networks, Rectified Linear Unit, Sieve Estimators, Consis-

tency, Rate of Convergence

JEL classification: C1, C5.

∗Monash Centre for Quantitative Finance and Investment Strategies has been supported by BNP Paribas. We thankYan Dolinsky, Loriano Mancini, and Juan-Pablo Ortega for comments on the earlier draft.†Hasan Fallahgoul, Monash University, School of Mathematics and Centre for Quantitative Finance and Investment

Strategies, 9 Rainforest Walk, 3800 Victoria, Australia. E-mail: [email protected].‡Vincentius Franstianto, Monash University, School of Mathematics and Centre for Quantitative Finance and Invest-

ment Strategies, 9 Rainforest Walk, 3800 Victoria, Australia. E-mail: [email protected].§Grégoire Loeper, Monash University, School of Mathematics and Centre for Quantitative Finance and Investment

Strategies, 9 Rainforest Walk, 3800 Victoria, Australia. E-mail: [email protected].

Page 3: Towards Explaining Deep Learning: Asymptotic Properties of ...

1 Introduction

The asymptotic properties of feed-forward networks (FFN) with rectified linear unit (ReLU) acti-

vation functions are rarely explored.1 Although it has great accuracy in statistical regression and

classification tasks, the lack of these properties has left ReLU networks, like many other neural

networks, as "black boxes". This lack of knowledge also barricades the development of statistical

inferences of the ReLU network regressions. Furthermore, it has been shown that the regulariza-

tion of neural networks requires more care than its machine learning counterparts, due to severe

nonlinearity and heavy parametrization.2 These challenges have limited the applications of neu-

ral networks in the area of economic and finance. To help to tackle these difficulties, in this paper

we ask this question: what are the asymptotic properties of the ReLU FFN?

The objective of a neural network is to transform an input data to some output through approx-

imating an unknown function, target function, possibly non-linear and time-varying. Each layer of

a neural network can be seen as an approximation of the unknown target function. For a compli-

cated target function, it is likely that the output of the first-layer neural network does not match the

target function. Generally, there are two ways to overcome this problem: increasing the number

of nodes (neurons) or layers.3 Adding another layer to the neural network is equivalent to endow

the model with another chance for a better approximation of the objective function. Theoretically,

increasing the layers leads to a better approximation.4 By adding more layers to the network,

the model gets more powerful. Practically, however, it leads to several problems: (i) training the

model is difficult and takes longer time; (ii) it is likely that the trained model becomes "overfitted”;

(iii) to use the trained model for transforming unseen data in future (i.e., out-of-sample data), one

will need a greater level of regularization to obtain a reasonable validation metric.5

Farrell, Liang, and Misra (2019) have obtained a groundbreaking result on the asymptotic

properties of deep neural networks. They prove a probabilistic convergence rate for multi-layer

ReLU networks, under the assumption that the number of hidden layers, i.e., depth of the net-

work, grows with the sample size. This paper complements their findings in two important ways.

Firstly, practitioners usually work with a fixed number of layers, as growing the depth makes the

1With the notable exception of Farrell, Liang, and Misra (2019).2See Gu, Kelly, and Xiu (2020) and therein references.3Adding more nodes and layers mean increasing network width and depth, respectively. There are other approachessuch as using different activation functions, optimization techniques, etc. Detailed information about an FFN can befound in Anthony and Bartlett (2009).4See Eldan and Shamir (2016), among others.5See Liu, Shi, Li, Li, Zhu, and Liu (2016), Sun, Chen, Wang, Liu, and Liu (2016), and therein references.

1

Page 4: Towards Explaining Deep Learning: Asymptotic Properties of ...

training harder and increases the tendency to eventual overfit.6 Unlike the Farrell, Liang, and

Misra (2019), we derive the asymptotic properties for a neural network with a fixed number of

layers where depth of the network does not grow with the sample size. Secondly, the paper does

not prove the existence of the exact solution of least squares or logistic losses. It assumes the solu-

tion does exist. Since the neural network function can be considered in a nonparametric regression

framework, it can be seen as a form of sieve estimator.7 Therefore, it would be natural to explore

existence of a solution for such a network in the strong formulation.

Neural networks can be seen as an approximation procedure where the basis functions are

themselves learned from the data. The learning is done via an optimization problem over many

flexible combinations of simple functions. Indeed, considering the neural network in this way

makes them a form of sieve estimator. Earlier studies on this topic explored the theoretical prop-

erties of neural networks with small number of hidden layers (shallow network) and smooth

activation functions, see Hornik, Stinchcombe, White, et al. (1989), White (1992), Chen and White

(1999), and Anthony and Bartlett (2009), among others. Many of the existing works such as An-

thony and Bartlett (2009), Akpinar, Kratzwald, and Feuerriegel (2019), and Bartlett, Harvey, Liaw,

and Mehrabian (2019) focus more on sample complexities, Vapnik-Chervonenkis and Pollard ex-

plore dimension upper bounds, which do not give much insight about statistical explainability

in the way that asymptotic properties do. The other theoretical-oriented works such as Shen,

Jiang, Sakhanenko, and Lu (2019) and Horel and Giesecke (2020) focus on asymptotic properties

and statistical tests, respectively, of one-layer sigmoid networks. However, sigmoid networks are

rarely used in practice compared to ReLU networks, as the latter have sparse representations and

non-vanishing gradients that help speeding up training computations.

It is thus natural to explore the ReLU asymptotic properties using conventional approaches

such as consistency, probabilistic rate of convergence, and asymptotic normality, since neural

network regression can be seen as a specific class of sieve extremum estimation, see Grenander

(1981).8 The main contribution of this paper is threefold.

6It has been formally shown that depth – even if increased by one layer – can be exponentially more valuable thanincreasing width in standard feed-forward neural networks, see Eldan and Shamir (2016). After the seminal work ofHinton, Osindero, and Teh (2006), the machine learning community has experimented and adopted deeper (and wider)networks. However, even for a complicated problem such as image recognition, a neural network with 152 layers hasbeen used, see He, Zhang, Ren, and Sun (2016), among others. Gu, Kelly, and Xiu (2020) have applied machine learningapproaches to predict risk premia. They have found strong shreds of evidence in favor of these approaches. Theirdeepest network has five hidden layers.7Chen (2007) provides an excellent review of the sieve estimator, its properties as well as its applications.8In this paper, we consider a fully connected FFN with ReLU as the activation function. Farrell, Liang, and Misra (2019)refer to such a network by Multi-Layer Perceptron.

2

Page 5: Towards Explaining Deep Learning: Asymptotic Properties of ...

• Motivated by Shen, Jiang, Sakhanenko, and Lu (2019), we first show that there exists a sieve

estimator based on a ReLU feed-forward network. We also prove that the ReLU network

with fixed depth is consistent. Unlike Farrell, Liang, and Misra (2019) in our setting depth

of network is fixed and does not diverge with the sample size. The asymptotic analysis by

Farrell, Liang, and Misra (2019) covers three neural network architectures, however, a neural

network with the fixed depth does not include in any of their architectures.

• We also establish nonasymptotic bounds for a nonparametric estimation which we refer to

as the rate of convergence of the ReLU neural network estimator. These bounds are new to

the literature and can be used for developing a significant test for the neural network, see

Horel and Giesecke (2020).

• The asymptotic normality of a sieve estimator for the ReLU FFN is provided. The asymptotic

normality appears to be new to the literature and is one of the main theoretical contributions

of this paper.

Our results are among the first inference of multi-layer ReLU networks built on the sieve space

of continuous functions. The sieve estimation underlies many parametric and nonparametric es-

timating methods, such as time-series and quantile regressions. As these classical methods have

been known before machine learning methods, the sieve-based inferences are more commonly

known than the new machine-learning-specific inferences. The asymptotic properties of ReLU

networks from the sieve framework will open new possibilities of the adaptation of existing sieve

inferences to ReLU neural networks.

Furthermore, it has been known that fixed-depth ReLU networks are easier to train than their

growing-depth counterpart.9 With the same number of sample data and iterations in the stochas-

tic gradient descent, the fixed-depth networks give better convergence results. Considering the

availability of easy-to-use packages for training neural networks such as Keras in Python, having

a sieve-based convergence result will give an intuitive understanding of the accuracy of ReLU

networks among communities that are unfamiliar with machine learning.

The paper is organized as follows. Section 2 provides an overview of the ReLU FFN. Section 3

states the theoretical setting needed to prove asymptotic properties of ReLU neural networks. Sec-

tion 4 presents the main theoretical results of the paper: the consistency, sieve-based convergence

9See Sun, Chen, Wang, Liu, and Liu (2016) and Liu, Shi, Li, Li, Zhu, and Liu (2016), among others.

3

Page 6: Towards Explaining Deep Learning: Asymptotic Properties of ...

rate, and asymptotic normality. Section 5 explores the validity of theoretical results in simulations.

Section 6 concludes.

2 The ReLU feed-forward network

In this section, we discuss the architecture of a ReLU FFN. We refer interested readers to Anthony

and Bartlett (2009) and Goodfellow, Bengio, and Courville (2016) for detailed exposition.

In regression, we observe the yis whose values are driven by the underlying target function

f0, which is a function of a d-dimensional vector xixixi. Each element of xixixi is an observed predic-

tor/independent variable. As the exact functional form of f0 is rarely known, the target function

f0 is estimated using a specific function of xixixi. In our context, this estimating function is a multi-

layer neural network with ReLU activation function.

Figure 1 is an example of a ReLU network. This is an example of an FFN where information

propagates only forward.10 This network begins by taking input from d initial nodes, i.e., xixixi. The

number of initial nodes in Figure 1 is two, i.e., xixixi = (x1,i, x2,i). Each initial node corresponds to

each element of the d−dimensional predictor vector xi, and the layer consisting of these nodes is

called the input layer. These initial nodes can be seen as impulse receptors in biological neural

networks.

The inputs are then transformed into signals that are propagated forward to the next layer

called the first hidden layer, equivalent to neurons in the biological counterpart. This layer con-

tains Hn hidden nodes, and each of them is connected to all nodes in the input layer. The value of

each node, Y, is specified in the following way. First, one needs to calculate

Y = ∑i

wixi + b (2.1)

where wi, xi, and b are the weight, the value of each node, and bias, respectively. Since the value

of Y can belong to (−∞, ∞), one needs to use a function to decide whether the neuron associated

with Y to be "fired" or not. An activation function is used for this purpose. There are various types

of the activation function. In this paper, we use the most popular one, rectified linear unit, which

is given by ReLU(x) = max(x, 0).

In plain words, the input signals are taken in the form of linear combinations of all input nodes,

10There are other classes of deep neural networks which the information does not only propagate forward, see Anthonyand Bartlett (2009) and Goodfellow, Bengio, and Courville (2016) for detailed exposition. The results of this paper arefor the feed-forward network, however, the validity of the results of this paper for other neural networks is an openquestion.

4

Page 7: Towards Explaining Deep Learning: Asymptotic Properties of ...

whose weights depend on each hidden node. Then, these hidden nodes may be activated based

on the used activation function. The activation is the same for all nodes in the hidden layer. We

refer to this network, the ReLU network. If the input is positive, then the hidden node is activated

and produces a positive output. If it is not, then the node is not activated and produces a zero

output. In Figure 1, the first and second hidden layers have three nodes.

The outputs of this hidden layer will then be used as new inputs to the next layer with nodes,

and again each of them is connected to the inputting nodes from the first hidden layer. If the

next layer is also a hidden layer, then the input signals from the previously hidden layer will be

processed by each node of this next hidden layer and turned into output in the same manner as the

previously hidden layer nodes process input.11 If the next layer is the output layer, equivalent to

response to the impulse in its biological counterpart, the input signals are still linear aggregations,

but no activation function is applied to them. They are taken plainly as linear combinations. We

denote the number of hidden layer in a ReLU network by Ld which in Figure 1, Ld = 2.

A crucial step in approximating target function f0 via neural networks is training the model –

finding unknown parameters, i.e., weights wi and biases. The unknown parameters are estimated

by minimizing a loss function.12 One advantage of neural networks overs its machine learning

counterparts is that training a neural network allows for joint updates of all model parameters at

each step of optimization. However, the optimization procedure can be highly computationally

intensive due to their high degree of nonlinearity and heavy parameterization. To overcome this

problem, stochastic gradient descent have been used to train a neural network.13

Notation: For the rest of the paper, vectors are denoted by either bold or capital fonts. N(ε,D, ρ)

is a covering number for a pseudo-metric space (D, ρ).14 Asymptotic inequality an<∼ bn means

∃R > 0, N1 ∈ N such that ∀n ≥ N1, an ≤ Rbn, an ∼ bn implies limn→∞

an

bn= 1. The covering number

N(ε,D, ρ) for a pseudo-metric space (D, ρ) is the minimum number of ε-balls of the pseudo-metric

ρ needed to cover D, with possible overlapping. C0([0, 1]d) is the set of all continuous function on

[0, 1]d, and W k,∞([0, 1]d) denotes the Sobolev space of order k on [0, 1]d.15 We denote big-O and

small-O as O and o, respectively. Also, both OP and oP denote big-O and small-O in probability.

EC and CC are abbreviations for existence and consistency condition, respectively.

11The input signals have a linear form and are subject to the ReLU activation function.12In this paper we use the mean square error loss function.13See Wilson and Martinez (2003), among others, for a detailed discussion on stochastic gradient descent and its imple-mentation.14The covering number is the minimal number of balls of radius ε which cover a pseudo-metric space (D, ρ).15See Leoni (2017).

5

Page 8: Towards Explaining Deep Learning: Asymptotic Properties of ...

3 The setting

In this section, we discuss the setting that is need to establish the main results in Section 4. Suppose

that the true nonparametric regression model is

yi = f0(xixixi) + εi

where ε1, ε2,...., εn are independent identically distributed defined on a complete probability

space, (Ω,A, P), E [εi] = 0, Var (εi) = σ2 < ∞, x1x1x1, x2x2x2,...., xnxnxn ∈ X = [0, 1]d are vectors of predic-

tors, and f0 ∈ F :=

f ∈ C0 | f : [0, 1]d → R

. Define the sample squared loss on f ∈ F and the

population criterion function, respectively, as

Qn( f ) :=1n

n

∑i=1

(yi − f (xixixi))2 =

1n

n

∑i=1

( f0(xixixi)− f (xixixi))2 − 2

n

n

∑i=1

εi( f0(xixixi)− f (xixixi)) +1n

n

∑i=1

ε2i

Qn( f ) := E [Qn( f )] = E

[1n

n

∑i=1

(yi − f (xixixi))2

]=

1n

n

∑i=1

( f (xixixi)− f0(xixixi))2 + σ2.

In regression, we are interested to find f such that

f := arg minf∈F

Qn( f ).

However, if F is too rich, f may be inconsistent.16 Hence, we are interested in finding a sequence

of nested function spaces Fn, which satisfies

F1 ⊂ F2 ⊂ · · · ⊂ Fn ⊂ Fn+1 ⊂ · · · ⊂ F

where ∀ f ∈ F , ∃ fn ∈ Fn s.t. limn→∞

ρ ( f , fn) = 0. More precisely, Fn is dense in F . Fn itself is called

a sieve space of F with respect to the pseudo-metric ρ, and the sequence fn is called a sieve. Take ρ ≡

ρn, where ρn: F ×F → [0, ∞)

ρn( f ) =

√1n

n

∑i=1

f (xixixi)2

where ρn is a pseudo-norm.17 Hence, instead of f , we are looking for an approximate sieve estimator

fn that satisfies

Qn( fn) ≤ inff∈Fn

Qn( f ) +OP (ηn)

where limn→∞

ηn = 0.

The next question is how to construct Fn that makes it dense in F . First of all, consider the

16Inconsistency here means ( f P9 f0)17See Shen, Jiang, Sakhanenko, and Lu (2019).

6

Page 9: Towards Explaining Deep Learning: Asymptotic Properties of ...

following fixed-width ReLU feed-forward network function space indexed by Wn

FWn :=

hLd+1,1 (xxx) : xxx ∈ [0, 1]d

where hu,j (xxx) is the output of the jth node of the layer u in the ReLU network with input x, u = 0

or u = Ld + 1 correspond to the input and output layers, respectively, and 1 ≤ u ≤ Ld correspond

to the uth hidden layer. We also have j ∈ 1, 2, ....., Hn,u, where Hn,u is the number of nodes in the

uth layer, Hn,0 = d, and Hn,Ld+1 = 1. For 1 ≤ u ≤ Ld, the formula for hu,j (xxx) is

hu,j (xxx) = ReLU

(Hn,u−1

∑k=1

γu,j,k · hu−1,k (xxx) + γu,j,0

)where h0,k (xxx) = xk, the kth element of xxx. It should be noted that γu,j,k and γu,j,0 are equivalent to wi

and b in (2.1).

We use the upper bound max1≤j≤Hn,u

Hn,u−1

∑k=0

∣∣γu,j,k∣∣ ≤ Mn,u, ∀ 1 ≤ u ≤ Ld + 1, where Mn,u > 1, Mn,u

can depend on n, and Mn,0 = 1 as X = [0, 1]d. This upper bound is used in the entropy number

upper bound. Wn itself is the number of parameters γu,j,k in a single ReLU network, with Wn =Ld

∑u=0

(Hn,u + 1) Hn,u+1.

According to Proposition 1 in Yarotsky (2018), ∀ f0 ∈ F , ∃πWn f0 ∈ Fn s.t.

‖πWn f0 − f0‖∞ := supxxx∈[0,1]d

|πWn f0(xxx)− f0(xxx)| ≤ O(

ω f0

(O(

W−1/dn

)))with ω f0 : [0, ∞]→ [0, ∞), ω f0(r) = max

| f0(xxx)− f0(yyy)| : xxx, yyy ∈ [0, 1]d, |xxx− yyy| < r

. It is clear that

‖ · ‖∞ is a pseudo-metric, (FWn , ‖ · ‖∞) is a sieve space of f0, and πWn f0 is the related sieve. It

is imperative that W−1/dn → 0 as n ↑ ∞ to ensure dense FWn . Also, for Γn :=

γu,j,k : ∀u, j, k

,18

we need to ensure that the range of the parameters inside ∪nΓn can span through RWn . These

requirements for having a dense FWn can be summarized as

Wn ↑ ∞ and Γn ↑ RWn , as n ↑ ∞

and the chosen set Γn is a compact set

Γn =[−M(γ)

n,u,j,k, M(γ)n,u,j,k

]Wn, ∀1 ≤ u ≤ Ld + 1, ∀1 ≤ j ≤ Hn,u, ∀0 ≤ k ≤ Hn,u−1

such that ∀u, j, k,∣∣γu,j,k

∣∣ ≤ M(γ)n,u,j,k, and also

Hn,u

∑j=1

Hn,u−1

∑k=0

M(γ)n,u,j,k = Mn,u. Hence, the two requirements

of the denseness of FWn are given under the following assumption.

Assumption 3.1 (Assumption for Dense FWnFWnFWn ). Hn, M(γ)n,u,j,k ↑ ∞ as n ↑ ∞, ∀1 ≤ u ≤ Ld + 1,

∀1 ≤ j ≤ Hn,u, ∀0 ≤ k ≤ Hn,u−1.

18This is the set of all parameters in the ReLU network

7

Page 10: Towards Explaining Deep Learning: Asymptotic Properties of ...

For the following next sections, the results and proofs are obtained by the same proofing strate-

gies as Shen, Jiang, Sakhanenko, and Lu (2019). The main things that are changed are the upper

bound of the supremum of ‖ · ‖∞ among FWn elements and the upper bound of the entropy num-

ber, and some other things that are related to them.

4 Main results

We discuss our three main theoretical results in this section; sieve-estimator consistency, conver-

gence rate, and asymptotic normality. All proofs are provided in Appendix A.

4.1 Existence

Theorem 4.1.1 (Existence). There exists an approximate sieve estimator fn in FWn .

The following remark is the tool for proving the Existence Theorem.

Remark 4.1.1 (Existence Conditions). (Remark 2.1. in Chen (2007)). There exists an approximate

sieve estimator fn inside FWn if the following statements hold

EC1. Qn( f ) is measurable function of the data (xixixi, yi) , i ∈ 1, 2, ..., n.

EC2. Qn( f ) is lower semicontinuous on FWn under the pseudo-metric ρn, for each ω ∈ Ω fixing the

sequence xixixi, yi(ω))ni=1.

EC3. FWn is a sieve of F and compact under ρn.

Note that in Chen (2007), the notation Qn( f ) = −Qn( f ) is used. This explains why EC2 of the

Remark 4.1.1 is lower semicontinuous instead of upper semicontinous. EC1 is obviously satisfied

by Qn( f ), as yi = f0(xixixi) + εi. Note also that fixing xixixi, yiyiyi(ω) is equivalent to fixing εi(ω), ∀ ω

∈ Ω. To prove EC2 and EC3, we make use of the following lemma.

Lemma 4.1.1. For each n, 1 ≤ u ≤ Ld + 1, and 1 ≤ j ≤ Hn,u,

sup1≤j≤Hn,u

‖hn,j‖∞ ≤ M∗n,i :=u

∏i=0

Mn,i ≥ 1

and this implies

supf∈FWn

‖ f ‖∞ ≤ M∗n,Ld+1 =Ld+1

∏u=0

Mn,u ≥ 1.

8

Page 11: Towards Explaining Deep Learning: Asymptotic Properties of ...

Please refer to the Appendix A for the proof of this lemma, EC2, and EC3. Note also that Qn( f )

can be proven to be continuous on (FWn , ρn), which is stronger than EC2. Thus the existence of fn

is justified.

We can consider the bounded ReLU function

h∗Ld+1,1(xxx) = min(UB f0 , max

(LB f0 , hLd+1,1(xxx)

))where LB f0 and UB f0 are lower and upper bounds of f0 such that LB f0 < min

xxx∈[0,1]df0 and UB f0

> maxxxx∈[0,1]d

f0, respectively. The existence of both bounds are guaranteed by the Extreme Value

Theorem.

If π∗Wnf0 is the sequence of functions h∗Ld+1,1(xxx) sharing the same parameters with πWn f0, then

the fact that ‖πWn f0 − f0‖∞→ 0 as n→ ∞ gives πWn f0 = π∗Wnf0, where n > M, ∃M>0. Thus π∗Wn

f0

is also a sieve sequence, and the pseudo-metric space(F ∗Wn

, ρn

)being composed of functions

h∗Ld+1,1(xxx) is also a sieve space of F .

As min(UB f0 , x

)and max

(LB f0 , x

)are Lipschitz continuous functions with Lipschitz constant

1, all Existence Conditions are satisfied and Qn( f ) are still continuous on(F ∗Wn

, ρn

). The key

difference is now supf∈F ∗Wn

‖ f ‖∞ = M∗∗n,L∗d+1 := min(

M∗n,Ld+1, max(∣∣LB f0

∣∣ ,∣∣UB f0

∣∣)), with L∗d = Ld + 2

is the depth of the new bounded ReLU network.

The advantage of considering such bounded ReLU network space will be clear in the next

subsection, where we show that fn is also consistent under a product probability space.

4.2 Consistency

Define the product space (Ω∗,A∗, P∗) =n

∏i=1

(Ω,A, P)× (Z , C, PZ ), with the last probability mea-

sure containing additional random variables independent ofn

∏i=1

(Ω,A, P). The consistency of fn

is satisfied under this probability measure, with a condition on the parameter number growth.

Theorem 4.2.1 (Consistency). Define

M(all)n,Ld+1 := max

1≤i≤Ld+1Mn,i > 1

C∗n,d,Ld+1,Wn:= Wn ln

(d M∗n,Ld+1 Wn (M(all)

n,Ld+1)Ld)

If (M∗n,Ld+1)2C∗n,d,Ld+1,Wn

= o(n), then

plimn→∞

ρn( fn − f0) = 0, under (Ω∗,A∗, P∗) .

The conditions for consistency are given in the following remark.

9

Page 12: Towards Explaining Deep Learning: Asymptotic Properties of ...

Remark 4.2.1 (Consistency Conditions). (Remark 3.1.(3) in Chen (2007)). The approximate sieve

estimator fn in the sieve space FWn of F satisfies

plimn→∞

ρn( fn − f0) = 0

if the following conditions are satisfied

CC1. Qn( f ) is continuous at f0 in F , Qn( f0) < ∞.

CC2. For all ζ > 0, Qn( f0) < inf f∈F :ρn( f− f0)≥ζ

Qn( f ).

CC3. Qn( f ) is a measurable function of the data (xixixi, yi) , i ∈ 1, 2, ..., n.

CC4. Qn( f ) is lower semicontinuous on FWn under ρn, for each ω ∈ Ω fixing the sequence

xixixi, yi(ω))ni=1.

CC5. (FWn , ρn) is compact sieve space.

CC6. (Uniform convergence) plimn→∞

supf∈FWn

|Qn( f )−Qn( f )| = 0, for each Wn.

Note that Qn( f ) =1n

n

∑i=1

( f (xixixi)− f0(xixixi))2 + σ2 is continuous on F . The proof of its continuity

is very similar to the proof of the lower semicontinuity of Qn( f ), with the related constant is

now Dn,Ld+1 = n−1(

2M∗n,Ld+1 + 2 maxxxx∈[0,1]d

| f0(xxx)|)

, and hence CC1 is satisfied. It is clear that CC2 is

satisfied as f0 minimizes Qn inF . As CC3, CC4, and CC5 are the Existence Conditions, we already

have them. The last thing that needs to be dealt with is CC6.

Lemma 4.2.1 (CC6 Satisfaction). If (M∗n,Ld+1)2C∗n,d,Ld+1,Wn

= o(n), then CC6 is satisfied under

(Ω∗,A∗, P∗).

We can also consider F ∗Wnfor consistency. CC1-CC5 are clearly satisfied. For CC6, our new

conditions are

(M∗∗n,L∗d+1)2C∗∗n,d,L∗d+1,W∗n

= o(n)

where

C∗∗n,d,L∗d+1,W∗n:= W∗n ln

(d M∗∗n,L∗d+1 W∗n (M(all)∗

n,L∗d+1)L∗d)

where M(all)∗n,L∗d+1 := max

(M(all)

n,Ld+1,∣∣LB f0

∣∣ ,∣∣UB f0

∣∣) and W∗n = Wn + 4. The reason W∗n is taken such

this is both functions min(UB f0 , v

)and max

(LB f0 , v

)are activation functions that can be seen to

take linear aggregation of v(1 · v + 0) as an input, and thus requires two additional weights.

10

Page 13: Towards Explaining Deep Learning: Asymptotic Properties of ...

Remark 4.2.2. As M∗n,Ld+1 is bounded above by max(∣∣LB f0

∣∣ ,∣∣UB f0

∣∣), the consistency condition satis-

faction is only dependent on the growth rate of W∗n = O (Wn). This implies we have more flexibility in

adjusting the growth rate of Wn. Note that UB f0 and LB f0 can be taken to be very large positive or negative

real numbers, respectively, such as ±10100,000. As most f0 encountered in practice are rarely that large, we

can treat bounded ReLU networks as if they are unbounded ReLU networks in most applications.

In the next subsection, we show that the convergence rate of fn can be bounded by ηn conver-

gence rate.

4.3 Rate of convergence

Theorem 4.3.1. (Rate of Convergence) Suppose that

ηn = O

max

ρn (πWn f0 − f0)2 ,

(C∗n,d,Ld+1,Wn

n

)2/3

where C∗n,d,Ld+1,Wndefined in the Consistency Theorem and

(M∗n,Ld+1

)2C∗n,d,Ld+1,Wn

= o(n). Then

ρn

(fn − f0

)= OP∗

max

ρn (πWn f0 − f0) ,

(C∗n,d,Ld+1,Wn

n

)1/3 .

The following remark underlies the Rate of Convergence Theorem proof.

Remark 4.3.1 (Convergence Rate of ρn( fn − πWn f0)ρn( fn − πWn f0)ρn( fn − πWn f0)). (Theorem 3.4.1 in Vaart and Wellner (1996))

For each n, let δn satisfying 0 ≤ δn ≤ α be arbitary (δn is typically a multiple of ρn(πWn f0− f0))). Suppose

that, for every n and δn < δ ≤ α,

supδ/2≤ρn( f−πWn f0)≤δ

f∈FWn

Qn(πWn f0)−Qn( f ) ≤ −δ2

EP∗

supδ/2≤ρn( f−πWn fn)≤δ

f∈FWn

√n [(Qn −Qn) (πWn f0)− (Qn −Qn) ( f )]

<∼ φn(δ)

for functions φn such that δ 7→ φn(δ)/δβ is decreasing on (δn, α) for some β < 2. Let rn<∼ δ−1

n satisfy

r2nφn

(1rn

)≤√

n, for every n.

If the approximate sieve estimator fn satisfies Qn( fn) ≤ Qn(πWn f0) + OP

(r−2

n)

and ρn( fn − πWn f0)

converges to zero in outer probability defined in (Ω∗,A∗, P∗), then

ρn( fn − πWn f0) = OP∗

(r−1

n

).

If the displayed conditions are valid for α = ∞, then the condition that fn is consistent is unnecessary.

11

Page 14: Towards Explaining Deep Learning: Asymptotic Properties of ...

The two supremum-upper-bound conditions in Remark 4.3.1 have been proven in Shen, Jiang,

Sakhanenko, and Lu (2019). We will state them in the remark below

Remark 4.3.2. (Lemma 4.1 and Lemma 4.2 in Shen, Jiang, Sakhanenko, and Lu (2019))

• For every n and δ > 8ρn (πWn f0 − f0), we have

supδ/2≤ρn( f−πWn fn)≤δ

f∈FWn

Qn(πWn f0)−Qn( f ) <∼ −δ2

• For every sufficiently large n and δ > 8ρn (πWn f0 − f0), we have

EP∗

supδ/2≤ρn( f−πWn f0)≤δ

f∈FWn

√n [(Qn −Qn) (πWn f0)− (Qn −Qn) ( f )]

<∼

∫δ

0

√ln (N (η,FWn , ρn))dη.

We can then finish the proof of the major theorem regarding the rate of convergence of fn.

4.4 Asymptotic Normality

We show that fn− f0 is indeed asymptotically Gaussian under certain assumptions. We follow the

same procedure as the proof of one-layer sigmoid network normality in Shen, Jiang, Sakhanenko,

and Lu (2019), which is inspired by the General Theory on Asymptotic Normality in Shen (1997).

This general asymptotic normality requires stronger growth regulations than consistency. To

achieve this, we depart from our usual assumption that each ReLU network requires fixed depth.

The networks now are allowed to have growing depths. We note that this do not change any

proofs or circumstances above to achieve consistency and sieve-based rate of convergence. This

is done to ensure that we can use Theorem 1 from Yarotsky (2017), which requires depth growing

ReLU network classes. Then, we have a flexible sieve sequence rate ρn (πWn f0 − f0) that can be

adjusted for asymptotic Gaussianity requirement. We replace Ld with Ln for clarity.

We also require that f0 ∈

f ∈ C0([0, 1]d) ∩W k,∞([0, 1]d) | ‖ f ‖W∞ ≤ MW

, ∃MW>0, and k∈N.

W k,∞([0, 1]d) is the Sobolev space defined in [0, 1]d, composed of functions whose derivatives up

to order k are defined in weak sense, in terms of; partial integration for d=1, or, distribution for d>1.

This space is a Banach space with respect to the norm ‖ f ‖W∞ := maxkkk:0≤|kkk|≤k

‖Dkkk f (x)‖L∞([0,1]d), where

kkk∈(N∪ 0)d, Dkkk f (xxx) :=∂|kkk| f

∂xk11 ∂xk2

2 . . . ∂xkdd

is the related weak derivative, x1,. . .,xd and k1,. . .,kd are

12

Page 15: Towards Explaining Deep Learning: Asymptotic Properties of ...

the elements of vectors xxx and kkk, respectively.

We use the Gateaux derivative of Qn( f ) at f0 in the direction of f − f0.19 It is obvious that

dQn( f0; f − f0) = limτ→0

Qn ( f0 + τ( f − f0))−Qn ( f0)

τ

= limτ→0

n

∑i=1

[yi − f0(xixixi)− τ ( f (xixixi)− f0(xixixi))]2 −

n

∑i=1

[yi − f0(xixixi)]2

= − 2n

n

∑i=1

[εi ( f (xixixi)− f0(xixixi))]

with the related first-order Taylor remainder term

R1 ( f0; f − f0) = Qn( f )−Qn( f0)− dQn( f0; f − f0)

=1n

n

∑i=1

[yi − f (xixixi)]2 −

n

∑i=1

[yi − f0(xixixi)]2 +

2n[εi ( f (xixixi)− f0(xixixi))]

=1n

n

∑i=1

[εi + f0(xixixi)− f (xixixi)]2 −

n

∑i=1

ε2i +

2n[εi ( f (xixixi)− f0(xixixi))]

=1n

n

∑i=1

[ f (xixixi)− f0(xixixi)]2 = ρn ( f − f0)

2 .

Note that Gateaux derivatives of Qn can be defined as F is a convex vector space. We define a

pseudo-scalar product 〈·, ·〉ρn : F ×F → R, with the mapping rule

〈 f , g〉ρn =1n

n

∑i=1

f (xixixi)g(xixixi)

where the subscript ρn indicates 〈 f − g, f − g〉ρn = ρn( f − g)2. The proof that 〈·, ·〉ρn is indeed a

pseudo-scalar product is the proof of Proposition 6.2 in Shen, Jiang, Sakhanenko, and Lu (2019).

We also use of the following remark, which is useful to bound the empirical process√

n dQn( f0; f − f0) in the proof.

Remark 4.4.1. (Lemma 5.1. in Shen, Jiang, Sakhanenko, and Lu (2019)). Let X1, ...., Xn be indepen-

dent random variables, Xi is under probability measure Pi. Define the empirical process νn(g) as

νn(g) :=1√n

n

∑i=1

[g (Xi)−EPi [g (Xi)]] .

Let Gn = g : ‖g‖∞ ≤ Mn, ε > 0 and V ≥ supg∈Gn

1n

n

∑i=1

Var [g (Xi)] be arbitrary. Define ψ (B, n, V) :=

B2/[

2V(

1 +BMn

2√

nV

)]. If ln (N (u,Gn, ‖ · ‖∞)) ≤ Anu−r for some 0 < r < 2 and u ∈ (0, a], where a

is a small positive number, and, there exists a positive constant Ki = Ki(r, ε) 1 = 1,2 such that

B ≥ K1A2

r+2n M

2−rr+2n n

r−22(r+2) ∨ K2A1/2

2 V2−r

4 .

19For further algebraic details see Section 5 of Shen, Jiang, Sakhanenko, and Lu (2019).

13

Page 16: Towards Explaining Deep Learning: Asymptotic Properties of ...

Then

P∗(

supg∈Gn

|νn(g)| > B

)≤ 5 exp (−(1− ε)ψ(B, n, V)) .

Now, we are ready to state the asymptotic Gaussianity exhibited by fn.

Theorem 4.4.1. (Asymptotic Normality) Suppose that ηn = o(r−2

n), and also

r−1n = o(n−1/2)√

M∗n,Ln+1C∗n,d,Ln+1,Wn= o

(n1/4

)ρn (πWn f0 − f0) = o

(min

n−1/4, n−1/6 (C∗n,d,Ln+1,Wn

)−1/3)

then the distribution of the statistics

1√n

n

∑i=1

(fn(xixixi)− f0(xixixi)

)approaches normal distribution when n→ ∞.

4.4.1 Asymptotic Normality Conditions Satisfaction for Sufficiently Smooth f0

In this section, we discuss the growth rate requirements that satisfy the Asymptotic Normality

Theorem conditions, which are very challenging. As stated before, we use the following remark,

which is just a reciting of Theorem 1 from Yarotsky (2017).

Remark 4.4.2. (Theorem 1 from Yarotsky (2017)). For any function

f ∈ G∗ :=

f ∈ W k,∞([0, 1]d) | ‖ f ‖W∞ ≤ 1

and any k,d∈N, ε∈(0, 1), there is a feed-forward ReLU network, whose layers may be connected with layers

after their adjacent layers, with a weight assignment that

• is capable of expressing f with error ε.

• has the depth at most c(ln(1/ε)+ 1) and at most cεd/k(ln(1/ε)+ 1) weights and hidden layer nodes,

with some constant c = c(d,k).

This remark makes it possible to construct a sieve sequence πWn f0 that satisfies ‖πWn f0 −

f0‖∞ = εn. Note that we can have any function f satisfying ‖ f ‖W∞ ≤ K, ∀K > 0 to be related to

f ∗ ∈ G∗ by f ∗ =1K

f . We can then take εn to be a sequence in (0, 1) such that εn ↓ 0. One might

question the possibilities of getting such weight assignments from a compact Γn that can adjust εn.

We emphasize that Γn can be made as large as possible and be replaced by other compact sets. For

14

Page 17: Towards Explaining Deep Learning: Asymptotic Properties of ...

example, one can just take Mγn,u,j,k = M′, where M′ can be taken arbitarily large, and replace Γn con-

structed from element-wise bounds on γu,j,k by the set made from `1-norm bounds forHn,u−1

∑k=0|γu,j,k|,

where the bounds are Mn,u =Hn,u

∑j=1

Hn,u−1

∑k=0

M′. The resulting Γn is still compact. However, as M′ can

be made as large as possible, the sieve sequence πWn f0 has its tail in Γn for sufficiently large n,

as it converges to f0.

We show that the ReLU network required by the remark above can be contained in a ReLU net-

work with layers connected only to their adjacent layers, which is called a multi-layer perceptron.

Our multi-layer perceptron also assume a hidden layer node is connected to all previous nodes.

Our idea is similar to Lemma 1 in Farrell, Liang, and Misra (2019), although we use hidden-layer-

nodes upper bound instead of weight bound. Note that all ReLU networks described in previous

sections are multi-layer perceptrons. We give them explicit name in this section for the sake of

clarity. In other cases, what we mean by ReLU networks are multi-layer perceptrons, as they are

the most commonly used ReLU networks in practice.

Lemma 4.4.1. If θ is a ReLU feed-forward network with non adjacent layer connections with Nn hidden

layer nodes and Ln hidden layers, then there is a ReLU multi-layer perceptron θ′ with full previous-layer

connections, Hn nodes per hidden layer, and Ln hidden layers such that θ(x) = θ′(x), where x is the input

vector, and Hn ≤ NnLn + d.

From this point, we refer multi-layer perceptrons as ReLU networks again. First, for bounded

ReLU networks, M∗∗n,Ld+1 ≤ max(|LB f0 |, |UB f0 |), and both |LB f0 | and |UB f0 | can be taken arbitarily

large. We emphasize again that arbitary, large magnitude bounds for the bounded ReLU net-

works enables them to be regarded as standard ReLU networks in practice both during training

and predicting. However, as the bounds no longer dependent on n, it enables the flexibility of

convergence rate settings for satisfying Asymptotic Normality conditions. We instead work on

the bounded ReLU networks under the assumption of very large |LB f0 | and |UB f0 |.

Next, we are going to derive the conditions for satisfaction of all conditions of the Asymptotic

Normality Theorem. Suppose f0 satisfies k = ud, u ∈ N, and a sieve-sequence error that εn =

n−a, ∃a > 0. If we assume n-polynomial growth condition on εn, then both Hn and Ln have n-

polynomial growth rate. Therefore, Wn = O(H2nLn). As ρn(πWn f0 − f0) ≤ εn and C∗n,d,Ln+1,Wn

=

O(WnLn), the two Asymptotic Gaussianity rate conditions can thus be written as

H2nL2

n = o(n1/4)

15

Page 18: Towards Explaining Deep Learning: Asymptotic Properties of ...

εn = o(minn−1/4, n−1/6(C∗n,d,Ln+1,Wn)−1/3)

where Hn = O(NnLn) by Lemma 4.4.1, and also Nn is the number of hidden unit nodes in the orig-

inal, possibly non-multi-layer-perceptron ReLU networks from which the ReLU sieve sequence

πWn f0 is constructed. Remark 4.4.2 tells us that

Nn = cna/u( a

uln(n) + 1

)Ln = c

( au

ln(n) + 1)

and these together with the rewritten rate conditions yield

N2n L3

n = O(

c5n2a/u( a

uln(n) + 1

)5)

n−a = o(n−1/4)

n−a = o(

n−1/6c−5/3n−2a/3u( a

uln(n) + 1

)−5/3)

and these conditions lead to

n2a/u < n1/4, na > n1/4 and n−a < n−1/6n−2a/3u

which then simplify to

14< a <

u8

and a >u

6u− 4. (4.1)

The last two conditions are satisfied for every u ≥ 3, as the function b : (0, ∞)→ R, b(x) =x

6x− 4is decreasing on [1, ∞).

5 Monte Carlo analysis

This simulations are meant to confirm that multi-layer ReLU network sieve estimator fn does

indeed converge to the true regression function f0, and also their difference is asymptotically nor-

mal. As f0 rarely has the same form as the estimating neural network fn, parameter comparisons

such as those in Section 4.1 of Shen, Jiang, Sakhanenko, and Lu (2019) is not practically important,

because it cannot be done for fn and f0 with different functional forms. Alternatively, instead of

this impracticality of studying parameter consistency, one can study the asymptotic properties of

the estimating function without bothering the parameter consistency.20

20It should be noted that we have done an additional Monte Carlo analysis related to measuring asset risk premia inthe area of finance. To save space, we have not reported the results, however, they are available upon request. Theresults are in line with Gu, Kelly, and Xiu (2020).

16

Page 19: Towards Explaining Deep Learning: Asymptotic Properties of ...

5.1 Consistency of ReLU feed-forward network

We conduct a simulation of yi = f0(xi) + εi for showing the probabilistic convergence of fn. We

simulate xi from the uniform distribution in [0, 1], i.e., xi ∼ U [0, 1], and residuals are independent

and identically distributed as normal distribution with mean zero and standard deviation 0.7, i.e.,

εi ∼ i.i.d N (0, 0.72). The functions that serve as the true mean function f0(x) are

• A sigmoid function

f0(x) = 5 + 18σ (9x− 2)− 12σ (2x− 9)

• A periodic function

f0(x) = sin (2πx) +13

cos (3πx + 3)

• A non-differentiable function

f0(x) =

8( 1

2 − x)

, if x ∈[0, 1

2

]10√

x− 12 (2− x) , if x ∈

( 12 , 1]

• A superposition between sigmoid and periodic function

f0(x) = 5 sin (8πx) + 18σ (9x− 2)− 12σ (2x− 9) .

Note that we have chosen functions that have similar functional form with simulation functions

in Shen, Jiang, Sakhanenko, and Lu (2019), but with larger parameter values. Although being

defined in a very short range [0, 1], they have significant value variations. They are harder to fit

than those functions used in Shen, Jiang, Sakhanenko, and Lu (2019), which are much gentler.

This difficulty in fitting makes the comparison between the f0 and fn more interesting, as these

two functions are much more likely to have different plots for smaller values of n. Also, as we

compare the performance between multi-layer ReLU and one-layer sigmoid networks, the neural

network with better performance is also more likely to show significantly better numerical and

visual convergence if we use f0 that are challenging to fit.

To conduct the simulation, we take M(γ)n,u,j,k for ReLU networks and Vn for sigmoid networks to

be M′ and M′rn, respectively,21 where M′ is a very large number, and one possible example of its

values is M′ = 10100,000. We can replace the original Γn with the new compact set made by using

`1-norm bounds forHn,u−1

∑k=0|γu,j,k| , and the bounds are Mn,u =

Hn,u

∑j=1

Hn,u−1

∑k=0

M′. Remark 4.2.2 guarantees

21See Shen, Jiang, Sakhanenko, and Lu (2019).

17

Page 20: Towards Explaining Deep Learning: Asymptotic Properties of ...

that our output-unbounded ReLU networks can be seen as output-bounded with the upper and

lower bounds that are very large and small, respectively, and these bounds are independent of

sample size n.

By bounding the parameter sets and the output with a large real number, we can conduct the

training minimization as an unbounded optimization, as the values of parameters are always be

contained in the set in the coding implementations. This is meant to simplify the implementations,

as one can use common gradient descent algorithms instead of the subgradient projection algo-

rithm when doing unbounded minimization. The subgradient projection algorithm projects each

point in each gradient descent iteration to the convex set, e.g., such as Γn, where the parameters

are assumed to belong, and thus it simplifies to the standard gradient/subgradient descent if in

each iteration the parameter stays inside the convex set.22

The training is done by using Keras 2.2.4 for Python 3.7 in Spyder 3.3.4. The gradient algorithm

being used is Nadam with learning rate 0.001. The simulation is conducted by setting the growth

rate of Hn and rn (for one-layer sigmoid networks) to be n0.4. For multi-layer ReLU networks, Ld

= 2. The number of samples are n ∈ 2× 103, 5× 103, 2× 104, 5× 104. For the superposition f0,

as the convergence is slower, we also consider n ∈ 2× 105, 5× 105. The results can be seen in

Tables 1. For Qn( fn), the values are considered good if it is close to Qn( f0) = 0.49.

An inspection on the errors, ρn( fn − f0)2, and least square errors, Qn( fn), reveals two major

points. Firstly, by increasing the sample size, both ρn( fn− f0)2 and Qn( fn) converge to zero where

the activation function is either ReLU or sigmoid across all simulated functions, i.e., f0. In fact,

the errors ρn( fn − f0)2 has a decreasing pattern as sample size increases for both ReLU and sig-

moid. Secondly, when the simulated function f0 has more complicated structure, the two-layer

ReLU neural network outperform the one-layer sigmoid in terms of convergent rate. Overall, the

consistency of estimated function fn is confirmed by the results that are provided in Table 1.

We close this section with a detailed comparison of two-layer ReLU networks to one-layer

sigmoid networks as they were used in Shen, Jiang, Sakhanenko, and Lu (2019). An inspection on

Table 1 reveals that the one-layer sigmoid networks convergence speed matches the multi-layer

ReLU network when f0 is the sigmoid. This result does hold both numerically, the left half of

Table 1, and visually, Figure 2. This is not surprising as the sigmoid neural networks themselves

are linear combinations of sigmoid functions.

As evidenced by Figures 3-5, the two-layer ReLU can detect the fluctuating patterns and the

22See Zhou and Feng (2018) and references therein.

18

Page 21: Towards Explaining Deep Learning: Asymptotic Properties of ...

non-differentiable point better and quicker than the one-layer sigmoid. The sigmoid networks

somehow become wavy and less accurate when approaching the point of non-differentiability at

larger n, Figure 4. As expected, the ReLU networks also have faster numerical convergence speed

for fluctuating and non-differentiable f0, indicated by Tables 1.

5.2 Asymptotic Normality

This part focuses on the simulation of the Asymptotic Normality Theorem. For conducting the

simulation, the number of nodes per hidden layer is chosen to be Hn = 9n0.1(0.1 ln(n)+ 1)2, and the

hidden layer depth is Ln = 3(0.1 ln(n) + 1). This growth rate follows the bounding argument from

Remark 4.4.2. As deep neural networks are notorious for their training difficulties, we conduct the

training with batch size = 4 and epoch size = 40. Thus the training iterations are more than 8 times

of those of the fixed-depth ReLU networks. The training is still conducted with the same device,

operating system, Python library, method and learning rate as the consistency simulations. The

training is done with the data sample size n ∈ 2× 103, 5× 103, 2× 104, 5× 104, 2× 105.

For this simulation, the true regression functions f0 are the first two functions in consistency

simulation, and the sigmoid and periodic superposition

f0(x) = 10 sin (16πx) + 12σ (2x− 9)− 18σ (9x− 2) , x ∈ [0, 1].

We choose them as all of these functions are infinitely differentiable, to satisfy the smoothness

requirement of asymptotic normality in (4.1). As before, the true target functions f0 used in this

asymptotic normality are significantly steeper than those used in the normality simulation of Shen,

Jiang, Sakhanenko, and Lu (2019). This makes getting the right estimation accuracy and stability

more challenging. After the training is done, we repeat the data simulation 200 times (similar to

Shen, Jiang, Sakhanenko, and Lu (2019)) to get samples of the statistics1√n

n

∑i=1

(fn(xixixi)− f0(xixixi)

).

Note that the ρn( fn − f0) and Qn( fn) values from Tables 2 verify the consistency of the increasing-

depth ReLU networks, as shown in the previous section.

Next, after standardizing the samples, we construct the Q-Q plots by comparing them against

N (0, 1) and conduct normality tests on them. The statistical tests used are Kolmogorov-Smirnov,

Shapiro-Wilk, and d’Agostino-Pearson. We do normalize the data even for Kolmogorov-Smirnov.

Our interest here is to see the form of asymptotic distribution themselves, not its parametric mean

and variance, which explains why the standardization is done. We use the Kolmogorov-Smirnov

only to check the normality of the distribution.

19

Page 22: Towards Explaining Deep Learning: Asymptotic Properties of ...

The Q-Q plots from Figure 6 definitely indicate the normality of the 200 statistics’ samples.23

Almost all statistical tests results in Tables 2 do not reject the normality of these samples at 5%

significance level. The notable exception is for the sigmoid f0 when n = 5× 104 (Table 2), where

Shapiro-Wilk and d’Agostino-Pearson reject the normality of the statistics. Our explanation is the

existence of two extreme outliers that are separated from other samples in the case of sigmoid

f0 when n = 5× 104 (Figure 6). This makes the tail a little bit heavier. This little tail heaviness

creates a problem for Shapiro-Willk test that considers the variance of the samples in testing, and

also d’Agostino-Pearson, which considers samples’ skewness and kurtosis. However, the general

population remains on the line and thus exhibits normality. Note also that for other values of n,

both of these tests do not reject the normality.

6 Summary and future research

It has been shown that the regularization of neural networks requires more care than its machine

learning counterparts, due to severe nonlinearity and heavy parameterization. This is also true

for its expandability and interpretation (it not impossible). To help to overcome these obstacles

we provide some asymptotic properties for the ReLU feed-forward networks. More specifically,

we derived three unexplored asymptotic properties of a parallel ReLU network sieve space in

C1([0, 1]d); consistency, sieve-based convergence rate, and asymptotic normality in the product

probability space (Ω∗,A∗, P∗) of independent identically distributed errors.

To validate the theoretical findings – consistency and to compare the convergence of the multi-

layer neural networks with ReLU and Sigmoid as an activation function – we conducted a Monte

Carlo analysis that confirms our theoretical findings. Furthermore, although both of them con-

verge, the multi-layer ReLU networks have better convergence and pattern recognition in all cases

except the sigmoid f0, in which both of them are equally good. It is thus worthy to have the ex-

ploration of the statistical properties of ReLU neural networks, which is still uncommon.

There are several directions for future investigations. Although consistency can be derived,

it requires the expansion of the errors’ probability space. One can explore the consistency under

the single probability space (Ω,A, P). Furthermore, exploring the asymptotic distribution and

statistical test of multi-layer ReLU networks is also an interesting and exciting direction for future

research.

23We got the same results for other test functions. Results are available upon request.

20

Page 23: Towards Explaining Deep Learning: Asymptotic Properties of ...

x1

x2

Inputlayer

h1,1

h1,2

h1,3

Hiddenlayer 1

h2,1

h2,2

h2,3

Hiddenlayer 2

y1

Outputlayer

Figure 1: Architecture of a fully connected feed-forward network with two hidden layers.

An example of the multi-layer a feed-forward network being described. The green, blue, andorange layers indicate input, hidden, and output layers, respectively. The number of hidden layersis Ld = 2, and it has Hn = 3 nodes per hidden layer. For the input layer, the node xi indicate thei− th predictor(node x1 means the first predictor, x2 means the second). For the hidden and outputlayers, the indices indicate the ReLU function hu,j associated with the related nodes, with u andj is the layer and node indices, respectively, where u = 0 or u = 3 implies the input and outputlayers, respectively. For example, node 3 in the second hidden layer is the function h2,3 (x) =

ReLU

(Hn=3

∑k=1

γ2,3,k · h1,k (x) + γ2,3,0

)where ReLU(x) = max(x, 0). A directed arrow going from

node k in layer u− 1 to node j in layer u is the parameter γu,j,k. As an example, the arrow fromnode 2 in the first hidden layer to node 3 in the second means parameter γ2,3,2.

21

Page 24: Towards Explaining Deep Learning: Asymptotic Properties of ...

Figure 2: The pictures of multi-layer ReLU and one-layer sigmoid neural networks approximatingf0(x) = 18σ(9x − 2)− 12σ(2− 9x) + 5 for different sample size. Both batch and epoch numbersbeing used during the training are 32. The numbers of nodes per layer after rounding the nearestinteger of n0.4 for each choice of n are 21, 30, 53, and 76, respectively, where n is the sample size.For ReLU, the number of hidden layer Ld is 2.

Figure 3: The pictures of multi-layer ReLU and one-layer sigmoid neural networks approximatingf0(x) = sin (2πx) + 1

3 cos (3πx + 3) for different sample size. Both batch and epoch numbers beingused during the training are 32. The numbers of nodes per layer after rounding the nearest integerof n0.4 for each choice of n are 21,30,53, and 76, respectively, where n is the sample size. For ReLU,the number of hidden layer Ld is 2.

22

Page 25: Towards Explaining Deep Learning: Asymptotic Properties of ...

Figure 4: The pictures of multi-layer ReLU and one-layer sigmoid neural networks approximating

f0(x) = −8(x− 1

2

)10≤x≤0.5 + 10

√x− 1

2 (2− x)10.5<x≤1 for different sample size. Both batchand epoch numbers being used during the training are 32. The numbers of nodes per layer afterrounding the nearest integer of n0.4 for each choice of n are 21, 30, 53, and 76, respectively, wheren is the sample size. For ReLU, the number of hidden layer Ld is 2.

Figure 5: The pictures of multi-layer ReLU and one-layer sigmoid neural networks approximatingf0(x) = 18σ(9x − 2)− 12σ(2− 9x) + 5 sin(8πx) for different sample size. Both batch and epochnumbers being used during the training are 32. The numbers of nodes per layer after roundingthe nearest integer of n0.4 for each choice of n are 21, 30, 53, 76, 132, and 190, respectively, where nis the sample size. For ReLU, the number of hidden layer Ld is 2.

23

Page 26: Towards Explaining Deep Learning: Asymptotic Properties of ...

Figure 6: The Q-Q plots for multi ReLU network estimation of f0(x) = 5 + 18σ (9x− 2) −12σ (2x− 9), x ∈ [0, 1], for different sample sizes. The theoretical quantile is N (0, 1). Thebatch and epoch numbers used in the training are 4 and 40, respectively. The number of nodesper hidden layer and the depth of the ReLU networks are Hn = 9n0.1(0.1 ln(n) + 1)2 and Ln =3(0.1 ln(n) + 1), respectively, where n is the sample size.

24

Page 27: Towards Explaining Deep Learning: Asymptotic Properties of ...

Table 1: Error and least square errors for different functions of f0.

ρn( fn − f0)2 and Qn( fn) are the error and least square errors, respectively. n is the sample size oftraining data. fn is the approximated sieve estimator. σ(.) is a Sigmoid function. ReLU: rectifiedlinear unit. The visualizations of the convergence for the sigmoid, periodic, non-differentiable, andsigmoid and periodic f0 are in Figures 2, 3, 4, and 5, respectively.

f0(x) = 18σ(9x− 2)− 12σ(2− 9x) + 5 f0(x) = sin (2πx) + 13 cos (3πx + 3)

ρn( fn − f0)2 Qn( fn) ρn( fn − f0)2 Qn( fn)

n ReLU Sigmoid ReLU Sigmoid ReLU Sigmoid ReLU Sigmoid

2× 103 13.9306 51.6624 14.6018 52.8889 0.1469 0.4475 0.6433 0.9428

5× 103 13.6968 13.4850 14.1075 13.9171 0.0378 0.4472 0.5305 0.9281

2× 104 0.0340 0.0439 0.5223 0.5330 0.0018 0.4413 0.4907 0.9299

5× 104 0.0070 0.0140 0.4950 0.5025 0.0079 0.4134 0.4958 0.9077

f0(x) = −8(

x− 12

)10≤x≤0.5 + 10

√x− 1

2 (2− x)10.5<x≤1 f0(x) = 18σ(9x− 2)− 12σ(2− 9x) + 5 sin(8πx)

ρn( fn − f0)2 Qn( fn) ρn( fn − f0)2 Qn( fn)

n ReLU Sigmoid ReLU Sigmoid ReLU Sigmoid ReLU Sigmoid

2× 103 0.8408 3.6109 1.3705 4.1554 26.7791 56.6896 27.5223 57.9058

5× 103 0.5048 2.3753 1.0156 2.9082 14.7668 25.8588 15.3108 26.3610

2× 104 0.0187 0.8677 0.5058 1.3548 8.4030 12.3960 8.8938 12.8759

5× 104 0.0194 0.1739 0.5076 0.6694 8.2855 11.4466 8.7745 11.9541

2× 105 0.9574 7.8232 1.4476 8.3020

5× 105 0.1372 6.4662 0.6274 6.9519

25

Page 28: Towards Explaining Deep Learning: Asymptotic Properties of ...

Table 2: Goodness-of-fit tests results for different functions of f0.

ρn( fn − f0)2 and Qn( fn) are the error and least square errors, respectively. n is the sample size oftraining data. fn is the approximated sieve estimator. σ(.) is a sigmoid function. KS: Kolmogorov-Smirnov test. SW: Shapiro-Wilk test. AP: d’Agostino-Pearson test. The Q-Q plots of the standard-ized data is provided in Figure 6

f0(x) = 18σ (9x− 2)− 12σ (2− 9x) + 5

n ρn( fn − f0)2 Qn( fn) KS(p-value) SW(p-value) AP(p-value)

2× 103 0.0722 0.5801 0.0505 (0.7300) 0.9967 (0.9485) 0.0103 (0.9948)

5× 103 0.1054 0.5870 0.0427 (0.8668) 0.9955 (0.8232) 1.3818 (0.5011)

2× 104 0.0897 0.5775 0.0495 (0.7050) 0.9954 (0.8200) 0.2040 (0.9029)

5× 104 0.0722 0.5549 0.0332 (0.9783) 0.9849 (0.0314) 7.8790 (0.0194)

2× 105 0.0490 0.5416 0.0491 (0.7105) 0.9936 (0.5432) 0.4241 (0.8089)

f0(x) = sin (2πx) + 13 cos (3πx + 3)

n ρn( fn − f0)2 Qn( fn) KS(p-value) SW(p-value) AP(p-value)

2× 103 0.0103 0.5032 0.0403 (0.9230) 0.9939 (0.5901) 1.5961 (0.4501)

5× 103 0.0470 0.5323 0.0435 (0.8507) 0.9948 (0.7213) 2.2750 (0.3206)

2× 104 0.0075 0.4953 0.0376 (0.9376) 0.9928 (0.4426) 1.0288 (0.5978)

5× 104 0.0148 0.4992 0.0449 (0.8060) 0.9912 (0.2695) 5.0601 (0.0796)

2× 105 0.0048 0.4967 0.0442 (0.8209) 0.9957 (0.8561) 0.5117 (0.7742)

f0(x) = 12σ (2− 9x)− 18σ (9x− 2) + 10 sin (16πx)

n ρn( fn − f0)2 Qn( fn) KS(p-value) SW(p-value) AP(p-value)

2× 103 41.3207 41.6786 0.0485 (0.7752) 0.9949 (0.7458) 0.0539 (0.9733)

5× 103 21.0600 21.6952 0.0582 (0.5192) 0.9929 (0.4553) 2.3510 (0.3086)

2× 104 0.5636 1.0433 0.0465 (0.7750) 0.9939 (0.5947) 1.2024 (0.5481)

5× 104 0.9429 1.4291 0.0348 (0.9662) 0.9942 (0.6418) 1.3650 (0.5053)

2× 105 0.2481 0.7400 0.0317 (0.9866) 0.9959 (0.8807) 1.5539 (0.4597)

26

Page 29: Towards Explaining Deep Learning: Asymptotic Properties of ...

A Proofs

Proof of Lemma 4.1.1. Suppose now u = 1. As a ReLU function is Lipschitz with constant 1, we have

∀1 ≤ j ≤ Hn,1,∥∥h1,j

∥∥∞ = ReLU

(d

∑k=1

γ1,j,kxk + γ1,j,0

)

≤∣∣∣∣∣ d

∑k=1

γ1,j,kxk + γ1,j,0

∣∣∣∣∣ ≤ Mn,1

as we have X = [0, 1]d. Thus we have sup1≤j≤Hn,1

∥∥h1,j∥∥

∞ ≤ Mn,1 =2

∏i=0

Mn,i as Mn,0 = 1

Suppose now 2 ≤ u ≤ Ld + 1. Then ∀1 ≤ j ≤ Hn,u,∥∥hu,j∥∥

∞ = ReLU

(Hn,u−1

∑k=1

γu,j,khu−1,k(xxx) + γu,j,0

)

≤∣∣∣∣∣Hn,u−1

∑k=1

γu,j,khu−1,k(xxx) + γu,j,0

∣∣∣∣∣≤(

sup1≤k≤Hn,u−1

‖hu−1,k‖∞ ∨ 1

)Hn

∑k=0

γu,j,k

≤(

sup1≤k≤Hn,u−1

‖hu−1,k‖∞ ∨ 1

)Mn,u

We will use induction. If u = 2, then we have∥∥h2,j

∥∥∞ ≤

2

∏i=1

Mn,i as Mn,2 > 1. Next, if we have

2 ≤ u ≤ Ld + 1, then we have ∀j,∥∥hu+1,j

∥∥∞ ≤

(sup

1≤k≤Hn,u

‖hu,k‖∞ ∨ 1

)Mn,u+1, and thus

∥∥hu+1,j∥∥

≤u+1

∏i=0

Mn,i as ∀u, Mn,u ≥ 1.

Hence, ‖hLd+1,1‖∞ ≤Ld+1

∏i=1

Mn,i, and the conclusion follows.

Proof of EC2 and EC3 Satisfaction. For each fixed ω ∈ Ω, we have Q(ω)n :(FWn , ρn) → ([0, ∞), | · |) a

mapping from pseudo-metric space to metric space. By Triangle Inequality, we have ∀ f , g ∈ FWn ,∣∣∣Q(ω)n ( f )−Q

(ω)n (g)

∣∣∣ = ∣∣∣∣∣ 1n n

∑i=1

[( f (xixixi)− f0(xixixi))

2 − (g(xixixi)− f0(xixixi))2]

− 2n

n

∑i=1

[εi(ω)( f (xixixi)− g(xixixi))]

∣∣∣∣∣≤∣∣∣∣∣ 1n n

∑i=1

[( f (xixixi)− g(xixixi))( f (xixixi) + g(xixixi)− 2 f0(xixixi))]

∣∣∣∣∣+

∣∣∣∣∣ 2n n

∑i=1

[εi(ω)( f (xixixi)− g(xixixi))]

∣∣∣∣∣

27

Page 30: Towards Explaining Deep Learning: Asymptotic Properties of ...

≤ 1n

n

∑i=1

[| f (xixixi)− g(xixixi)| (| f (xixixi) + g(xixixi) + 2 f0(xixixi)|)]

+2n

n

∑i=1

[|εi(ω)|| f (xixixi)− g(xixixi)|]

As f0 is continuous function on a compact domain [0, 1]d, we can use Extreme Value Theoremto get ∣∣∣Q(ω)

n ( f )−Q(ω)n (g)

∣∣∣ ≤ 2M∗n,Ld+1 + 2 maxxxx∈[0,1]d

| f0(xxx)|+ 2 sup1≤i≤n

εi(ω)

n

n

∑i=1

[| f (xixixi)− g(xixixi)|]

≤ Dn,Ld+1,ω‖ f − g‖1

with Dn,Ld+1,ω = n−1

(2M∗n,Ld+1 + 2 max

xxx∈[0,1]d| f0(xxx)|+ 2 sup

1≤i≤nεi(ω)

), and ‖ · ‖k indicates the kth Eu-

clidean norm.From the equivalence of Euclidean norms, ∃V∗ > 0 such that ‖ f − g‖1 ≤ V∗‖ f − g‖2, which

leads to ∣∣∣Q(ω)n ( f )−Q

(ω)n (g)

∣∣∣ ≤ √nDn,Ld+1,ωV∗ρn( f − g)2

Hence, ∀ ζ > 0, we can take δ = ζ1/2 (√nDn,Ld+1,ωV∗)−1/2

> 0 such that ∀ f , g ∈ FWn ,∣∣∣Q(ω)n ( f )−Q

(ω)n (g)

∣∣∣ < ζ whenever ρn( f − g) < δ, implying continuity and hence EC2.Next, we will prove EC3. Note that

ρn(πWn f0 − f0) ≤ ‖πWn f0 − f0‖∞ ≤ O(

ω f0

(O(

W−1/dn

)))and thus it is clear that FWn is still a sieve space of F under ρn, as long as we have the DensenessAssumption.

To prove the compactness of the pseudo-metric space (FWn , ρn), use the following mappingF : (Γn, ‖ · ‖2)→ (FWn , ρn)

[γu,j,k] 7→ F([γu,j,k]) = hLd+1,1(xxx | [γu,j,k])

with hLd+1,1(xxx | [γu,j,k]) uses [γu,j,k] as its parameters in ReLU linear combinations. Definitely FWn

= F(Γn).Now, we prove the continuity of F. For every [γ

(1)u,j,k], [γ

(2)u,j,k] ∈ Γn, we have

ρn

(F([γ(1)

u,j,k])− F([γ(2)u,j,k])

)2=

1n

n

∑i=1

Hn,Ld

∑k=1

γ(1)Ld+1,1,k · hLd,k(xixixi | [γ(1)

u,j,k]) + γ(1)Ld+1,1,0

−Hn,Ld

∑k=1

γ(2)Ld+1,1,k · hLd,k(xixixi | [γ(2)

u,j,k])− γ(2)Ld+1,1,0

2

Take γu,j and hu as column matrices with element orders corresponding to k = 1, 2, ..., Hn,u.This notations together with the Triangle Inequality and the fact that ReLU is Lipschitz with con-stant 1 lead to

ρn

(F([γ(1)

u,j,k])− F([γ(2)u,j,k])

)2

=1n

n

∑i=1

Hn,Ld

∑k=1

γ(1)Ld+1,1,k · ReLU

((γ(1)Ld,k

)>· hLd−1(xixixi | [γ(1)

u,k,[1:Hn,Ld−1]]) + γ

(1)Ld,k,0

)+ γ

(1)Ld+1,1,0

28

Page 31: Towards Explaining Deep Learning: Asymptotic Properties of ...

−Hn,Ld

∑k=1

γ(2)Ld+1,1,k · ReLU

((γ(2)Ld,k

)>· hLd−1(xixixi | [γ(2)

u,k,[1:Hn,Ld−1]]) + γ

(2)Ld,k,0

)− γ

(2)Ld+1,1,0

2

≤ 1n

n

∑i=1

(∣∣∣γ(1)Ld+1,1,0 − γ

(2)Ld+1,1,0

∣∣∣+ Hn,Ld

∑k=1

∣∣∣∣∣γ(1)Ld+1,1,k · ReLU

((γ(1)Ld,k

)>· hLd−1(xixixi | [γ(1)

u,k,[1:Hn,Ld−1]]) + γ

(1)Ld,k,0

)

−γ(2)Ld+1,1,k · ReLU

((γ(2)Ld,k

)>· hLd−1(xixixi | [γ(2)

u,k,[1:Hn,Ld−1]]) + γ

(2)Ld,k,0

)∣∣∣∣∣)2

≤ 1n

n

∑i=1

(∣∣∣γ(1)Ld+1,1,0 − γ

(2)Ld+1,1,0

∣∣∣+

Hn,Ld

∑k=1

∣∣∣∣γ(1)Ld+1,1,k

(ReLU

((γ(1)Ld,k

)>· hLd−1(xixixi | [γ(1)

u,k,[1:Hn,Ld−1]]) + γ

(1)Ld,k,0

)

−ReLU((

γ(2)Ld,k

)>· hLd−1(xixixi | [γ(2)

u,k,[1:Hn,Ld−1]]) + γ

(2)Ld,k,0

))∣∣∣∣+

Hn,Ld

∑k=1

[∣∣∣γ(1)Ld+1,1,k − γ

(2)Ld+1,1,k

∣∣∣×∣∣∣∣ReLU

((γ(2)Ld,k

)>· hLd−1(xixixi | [γ(2)

u,k,[1:Hn,Ld−1]]) + γ

(2)Ld,k,0

)∣∣∣∣])2

≤ 1n

n

∑i=1

(∣∣∣γ(1)Ld+1,1,0 − γ

(2)Ld+1,1,0

∣∣∣+

Hn,Ld

∑k=1

∣∣∣∣γ(1)Ld+1,1,k

((γ(1)Ld,k

)>· hLd−1(xixixi | [γ(1)

u,k,[1:Hn,Ld−1]]) + γ

(1)Ld,k,0

−(

γ(2)Ld,k

)>· hLd−1(xixixi | [γ(2)

u,k,[1:Hn,Ld−1]])− γ

(2)Ld,k,0

)∣∣∣∣+

Hn,Ld

∑k=1

[∣∣∣γ(1)Ld+1,1,k − γ

(2)Ld+1,1,k

∣∣∣ ∣∣∣∣(γ(2)Ld,k

)>· hLd−1(xixixi | [γ(2)

u,k,[1:Hn,Ld−1]]) + γ

(2)Ld,k,0

∣∣∣∣])2

By Lemma 4.1.1 and the Triangle Inequality, we have

ρn

(F([γ(1)

u,j,k])− F([γ(2)u,j,k])

)2

≤ 1n

n

∑i=1

(∣∣∣γ(1)Ld+1,1,0 − γ

(2)Ld+1,1,0

∣∣∣+M∗n,Ld−1

Hn,Ld

∑k=1

[∣∣∣γ(1)Ld+1,1,k

∣∣∣ ∥∥∥γ(1)Ld,k − γ

(2)Ld,k

∥∥∥1+∣∣∣γ(1)

Ld+1,1,k

∣∣∣ ∣∣∣γ(1)Ld,k,0 − γ

(2)Ld,k,0

∣∣∣]

+M∗n,Ld−1

Hn,Ld

∑k=1

[∣∣∣γ(1)Ld+1,1,k − γ

(2)Ld+1,1,k

∣∣∣ (∥∥∥γ(2)Ld,k

∥∥∥1+∣∣∣γ(2)

Ld,k,0

∣∣∣)])2

29

Page 32: Towards Explaining Deep Learning: Asymptotic Properties of ...

≤ 1n

n

∑i=1

(∣∣∣γ(1)Ld+1,1,0 − γ

(2)Ld+1,1,0

∣∣∣+M∗n,Ld−1

Hn,Ld

∑k=1

[∣∣∣γ(1)Ld+1,1,k

∣∣∣ ∥∥∥γ(1)Ld,k − γ

(2)Ld,k

∥∥∥1+∣∣∣γ(1)

Ld+1,1,k

∣∣∣ ∣∣∣γ(1)Ld,k,0 − γ

(2)Ld,k,0

∣∣∣]+M∗n,Ld−1

Hn,Ld

∑k=1

[∣∣∣γ(1)Ld+1,1,k − γ

(2)Ld+1,1,k

∣∣∣ (∥∥∥γ(2)Ld,k

∥∥∥1+∣∣∣γ(2)

Ld,k,0

∣∣∣)])2

≤ 1n

n

∑i=1

(∣∣∣γ(1)Ld+1,1,0 − γ

(2)Ld+1,1,0

∣∣∣+M∗n,Ld−1Mn,Ld+1

Hn,Ld

∑k=1

[∥∥∥γ(1)Ld,k − γ

(2)Ld,k

∥∥∥1+∣∣∣γ(1)

Ld,k,0 − γ(2)Ld,k,0

∣∣∣]

+M∗n,Ld−1Mn,Ld

Hn,Ld

∑k=1

[∣∣∣γ(1)Ld+1,1,k − γ

(2)Ld+1,1,k

∣∣∣])2

≤(

M∗n,Ld+1)2

(∣∣∣γ(1)Ld+1,1,0 − γ

(2)Ld+1,1,0

∣∣∣+ Hn,Ld

∑k=1

[∥∥∥γ(1)Ld,k − γ

(2)Ld,k

∥∥∥1+∣∣∣γ(1)

Ld,k,0 − γ(2)Ld,k,0

∣∣∣]

+

Hn,Ld

∑k=1

[∣∣∣γ(1)Ld+1,1,k − γ

(2)Ld+1,1,k

∣∣∣])2

≤(

M∗n,Ld+1)2∥∥∥[γ(1)

u,j,k]− [γ(2)u,j,k]

∥∥∥2

1

Using the equivalence of Euclidean norms, ∃V∗>0 such that∥∥∥[γ(1)

u,j,k]− [γ(2)u,j,k]

∥∥∥1

≤ V∗∥∥∥[γ(1)

u,j,k]− [γ(2)u,j,k]

∥∥∥2, and hence ρn

(F([γ(1)

u,j,k])− F([γ(2)u,j,k])

)2≤(

M∗n,Ld+1V∗∥∥∥[γ(1)

u,j,k]− [γ(2)u,j,k]

∥∥∥2

)2.

Hence, for every ζ > 0, we can take δ =(

M∗n,Ld+1V∗)−1

ζ > 0 such that∥∥∥[γ(1)

u,j,k]− [γ(2)u,j,k]

∥∥∥2< δ

leads to ρn

(F([γ(1)

u,j,k])− F([γ(2)u,j,k])

)2< ζ, implying the continuity of F.

It is obvious that (Γn, ‖ · ‖2) is compact. Hence, as every continuous image of a compact set iscompact, we prove the third existence condition.

Proof of the CC6 Satisfaction Lemma. The proof is almost the same with the proof of Lemma 3.2. inShen, Jiang, Sakhanenko, and Lu (2019). For any δ > 0 and Wn, we have

P∗(

supf∈FWn

|Qn( f )−Qn( f )| > δ

)

= P∗(

supf∈FWn

∣∣∣∣∣ 1n n

∑i=1

[ε2

i − σ2]− 2n

n

∑i=1

[εi ( f (xixixi)− f0(xixixi))]

∣∣∣∣∣ > δ

)

≤ P

(∣∣∣∣∣ 1n n

∑i=1

[ε2

i − σ2]∣∣∣∣∣ > δ

2

)+ P∗

(sup

f∈FWn

∣∣∣∣∣ 1n n

∑i=1

[εi ( f (xixixi)− f0(xixixi))]

∣∣∣∣∣ > δ

4

)

Because E[ε2

i]

= σ2, the Weak Law of Large Numbers gives P

(∣∣∣∣∣ 1n n

∑i=1

[ε2

i − σ2]∣∣∣∣∣ > δ

2

)→ 0 as n

→ ∞.

30

Page 33: Towards Explaining Deep Learning: Asymptotic Properties of ...

Now we need to show that

limn→∞

P∗(

supf∈FWn

∣∣∣∣∣ 1n n

∑i=1

[εi ( f (xixixi)− f0(xixixi))]

∣∣∣∣∣ > δ

4

)= 0

By the Markov’s Inequality, this is satisfied if

limn→∞

EP∗

[sup

f∈FWn

∣∣∣∣∣ 1n n

∑i=1

[εi ( f (xixixi)− f0(xixixi))]

∣∣∣∣∣]= 0

Our intention is to use the Symmetrization Inequality (Lemma 2.3.1. in Vaart and Wellner(1996)) to get a Rademacher sequence upper bound. Define Y1 (ε, f (xxx)) := εi ( f (xxx)− f0(xxx)), forevery f ∈ FWn . Clearly, E [Y1 (ε, f (xxx))] = 0. We need to show that Y1 (ε, f (xxx)) is measurable w.r.t.(Ω,A, P).

Suppose[γu,j,k

]n ∈ Γn from (1). As Γn is compact, we have a sequence

[γu,j,k

]n,m ∈ QWn ∩

Γn converging to[γu,j,k

]n when m→ ∞. The function F in EC2 and EC3 Satisfaction proof maps[

γu,j,k]

n,m to fm,n ∈ FWn .Since F is continuous, fm,n → f pointwise, the Example 2.3.4 in Vaart and Wellner (1996) im-

plies FWn members are measurable w.r.t. (Ω,A, P), and thus so is Y1 (ε, f (xxx)). Hence, by theSymmetrization Inequality

EP∗

[sup

f∈FWn

∣∣∣∣∣ 1n n

∑i=1

[εi ( f (xixixi)− f0(xixixi))]

∣∣∣∣∣]= EP∗

[sup

f∈FWn

∣∣∣∣∣ 1n n

∑i=1

[Y1 (εi, f (xixixi))]

∣∣∣∣∣]

≤ 2EP∗,ε

[ER

[sup

f∈FWn

∣∣∣∣∣ 1n n

∑i=1

[RiY1 (εi, f (xixixi))]

∣∣∣∣∣]]

≤ 2EP∗,ε

[ER

[sup

f∈FWn

∣∣∣∣∣ 1n n

∑i=1

[Riεi ( f (xixixi)− f0(xixixi))]

∣∣∣∣∣]]

with Ri are i.i.d Rademacher random variables living in (Z , C, PZ ), independent ofn

∏i=1

(Ω,A, P),

and also, EP∗,ε and ER mean the expectation is taken w.r.t. ε and the Rademacher variables,respectively.

Define Y2

(ω′, f , n

):=

1√n

n

∑i=1

(Ri(ω

′)εi ( f (xixixi)− f0(xixixi))

), for each f ∈ FWn , ω

′ ∈ Z . Fix ε1,

ε2,...., εn. Hence, Y2

(ω′, f , n

)is a sub-Gaussian process indexed by f . It is easy to see that ∀ ω

′ ∈

Z , Y2

(ω′, f , n

)is continuous w.r.t. (FWn , ρn). Hence, with the sequence fn,m converging pointwise

to fn above, we have Y2

(ω′, fm,n, n

)→ Y2

(ω′, fn, n

), for every ω

′ ∈ Z .

By Section 2.3.3* and Corollary 2.2.8. in Vaart and Wellner (1996), respectively, Y2

(ω′, f , n

)is a separable sub-Gaussian process defined on (FWn , ρn), and ∃ C > 0 such that ∀ f ∗n ∈ FWn , wehave

ER

[sup

f∈FWn

∣∣∣∣∣ 1n n

∑i=1

[Riεi ( f (xixixi)− f0(xixixi))]

∣∣∣∣∣]

=1√n

ER

[sup

f∈FWn

|Y2 (·, f , n)|]

31

Page 34: Towards Explaining Deep Learning: Asymptotic Properties of ...

≤ ER

[∣∣∣∣∣ 1n n

∑i=1

[Riεi ( f ∗n (xixixi)− f0(xixixi))]

∣∣∣∣∣]+ C

∫∞

0

√√√√√ ln(

N(

12

η,FWn , dn

))n

with dn( f , g) :=

√1n

n

∑i=1

(ε2

i). It is clear that dn is also a pseudo-distance. The term

N(

12

η,FWn , dn

)here denotes the minimum number of balls with radius

12

η and distance

concept dn being required to cover FWn . Its natural logarithm is a metric entropy of FWn .

Obviously, dn( f , g) ≤ ‖ f − g‖∞

(√1n

n

∑i=1

ε2i

). The Strong Law of Large Numbers tells us that

for any n ≥ N1,1n

n

∑i=1

ε2i < σ2 + 1 almost everywhere except when εi take values in a null set E of

n

∏i=1

(Ω,A, P).

By the Cauchy-Schwartz inequality, for any n ≥ N1, we have

ER

[∣∣∣∣∣ 1n n

∑i=1

Riεi ( f ∗n (xixixi)− f0(xixixi))

∣∣∣∣∣]≤ 1

n

n

∑i=1|εi| | f ∗n (xixixi)− f0(xixixi)|

≤√

1n

n

∑i=1

εi

√1n

n

∑i=1

( f ∗n (xixixi)− f0(xixixi))2

≤√

σ2 + 1 ‖ f ∗n − f0‖∞ a.e.Take f ∗n = πWn f0. Since ‖πWn f0 − f0‖∞ → 0 as n→ ∞, for every ζ > 0, there exists N2 > 0 such

that for any n ≥ N2

‖πWn f0 − f0‖∞ <ζ√

σ2 + 1Thus every n > N1 ∨ N2 satisfies

ER

[∣∣∣∣∣ 1n n

∑i=1

Riεi (πWn f0(xixixi)− f0(xixixi))

∣∣∣∣∣]< ζ a.e.

Next, we are going to bound the second integral term with a term that converges to 0 when n→ ∞ almost everywhere. We know that dn( f , g) ≤ ‖ f − g‖∞

√σ2 + 1 almost everywhere. There-

fore, a ball Bdn

(f ;

12

η

):=

g ∈ FWn : dn( f , g) <12

η

g ∈ FWn : ‖ f − g‖∞ <η

2√

σ2 + 1

=:

B‖·‖∞

(f ;

η

2√

σ2 + 1

)almost everywhere, for every f, g ∈ FWn .

This implies∫∞

0

√√√√√ ln(

N(

12

η,FWn , dn

))n

dη ≤

∫∞

0

√√√√√ ln(

N(

12√

σ2 + 1η,FWn , ‖ · ‖∞

))n

∫2M∗n,Ld+1

0

√√√√√ ln(

N(

12√

σ2 + 1η,FWn , ‖ · ‖∞

))n

32

Page 35: Towards Explaining Deep Learning: Asymptotic Properties of ...

as we have ‖ f − g‖∞ ≤ 2M∗n,Ld+1, for every f, g ∈ FWn .From Theorem 14.5 in Anthony and Bartlett (2009), ∀ η ≤ 2M∗n,Ld+1,

ln(

N(

η

2√

σ2 + 1,FWn , ‖ · ‖∞

))≤Wn · ln

8√

σ2 + 1 · e · d ·M∗n,Ld+1 ·Wn ·(

M(all)n,Ld+1

)Ld+1

η ·(

M(all)n,Ld+1 − 1

)

Define

Un,d,Ld+1,Wn :=

8√

σ2 + 1 · e · d ·M∗n,Ld+1 ·Wn ·(

M(all)n,Ld+1

)Ld+1

M(all)n,Ld+1 − 1

Wn

Un,d,Ld+1,Wn := ln(

Un,d,Ld+1,Wn

)−Wn

= Wn ·

ln

8√

σ2 + 1 · e · d ·M∗n,Ld+1 ·Wn ·(

M(all)n,Ld+1

)Ld+1

M(all)n,Ld+1 − 1

− 1

= Wn ·

ln

d ·M∗n,Ld+1 ·Wn ·(

M(all)n,Ld+1

)Ld+1

M(all)n,Ld+1 − 1

+ ln(

8√

σ2 + 1)

≤ 2Wn · ln

d ·M∗n,Ld+1 ·Wn ·(

M(all)n,Ld+1

)Ld+1

M(all)n,Ld+1 − 1

, for each n ≥ N1 ∨ N3

by choosing N3 > 0 that allows d · M∗n,Ld+1 · Wn ≥ 8√

σ2 + 1. As we have M(all)n,Ld+1 > 1,(

M(all)n,Ld+1

)Ld+1

M(all)n,Ld+1 − 1

> 1, and thus ln

d ·M∗n,Ld+1 ·Wn ·(

M(all)n,Ld+1

)Ld+1

M(all)n,Ld+1 − 1

> ln(

8√

σ2 + 1)

.

Since for every η ≤ 2M∗n,Ld+1,

ln(

N(

12√

σ2 + 1η,FWn , ‖ · ‖∞

))≤ Un,d,Ld+1,Wn + Wn · ln

(1η

)≤ Un,d,Ld+1,Wn +

Wn

η

≤ Un,d,Ld+1,Wn

(1 +

)we have ∀ n ≥ N1 ∨ N3,∫

2M∗n,Ld+1

0

√√√√√ ln(

N(

12√

σ2 + 1η,FWn , ‖ · ‖∞

))n

dη ≤√

Un,d,Ld+1,Wn

n

∫2M∗n,Ld+1

0

√1 +

≤ 5√

2

√Un,d,Ld+1,Wn

n·M∗n,Ld+1

Now, for large M(all)n,Ld+1, we have

(M(all)

n,Ld+1

)Ld+1

M(all)n,Ld+1 − 1

∼(

M(all)n,Ld+1

)Ld. Hence, ∃ N4 > 0 such that for

33

Page 36: Towards Explaining Deep Learning: Asymptotic Properties of ...

any n ≥ N4, we have∫2M∗n,Ld+1

0

√√√√√ ln(

N(

12√

σ2 + 1η,FWn , ‖ · ‖∞

))n

<∼ 10

√√√√√(M∗n,Ld+1

)2·Wn · ln

(d ·M∗n,Ld+1 ·Wn ·

(M(all)

n,Ld+1

)Ld)

n

<∼ 10

√√√√(M∗n,Ld+1

)2C∗n,d,Ld+1,Wn

n

Therefore, if(

M∗n,Ld+1

)2· C∗n,d,Ld+1,Wn

= o(n), every n ≥ N1 ∨ N2 ∨ N3 ∨ N4 satisfies∫2M∗n,Ld+1

0

√√√√√ ln(

N(

12√

σ2 + 1η,FWn , ‖ · ‖∞

))n

dη → 0 as n→ ∞ a.e.

which implies

ER

[sup

f∈FWn

∣∣∣∣∣ 1n n

∑i=1

[Riεi ( f (xixixi)− f0(xixixi))]

∣∣∣∣∣]→ 0 as n→ ∞ a.e.

Thus for any n ≥ N1 ∨ N2 ∨ N3 ∨ N4, we have

ER

[sup

f∈FWn

∣∣∣∣∣ 1n n

∑i=1

[Riεi ( f (xixixi)− f0(xixixi))]

∣∣∣∣∣]

≤√

σ2 + 1 ‖πWn f0 − f0‖∞ + 10C

√√√√(M∗n,Ld+1

)2C∗n,d,Ld+1,Wn

n→ 0 a.e.

It is clear that

EP∗,ε

√σ2 + 1 ‖πWn f0 − f0‖∞ + 10C

√√√√(M∗n,Ld+1

)2C∗n,d,Ld+1,Wn

n

=≤√

σ2 + 1 ‖πWn f0 − f0‖∞ + 10C

√√√√(M∗n,Ld+1

)2C∗n,d,Ld+1,Wn

n→ 0 < ∞.

Therefore, we can use Generalized Dominated Convergence Theorem (from the completenessof (Ω,A, P)) to conclude

EP∗

[sup

f∈FWn

∣∣∣∣∣ 1n n

∑i=1

[εi ( f (xixixi)− f0(xixixi))]

∣∣∣∣∣]≤ 2EP∗,ε

[ER

[sup

f∈FWn

∣∣∣∣∣ 1n n

∑i=1

[Riεi ( f (xixixi)− f0(xixixi))]

∣∣∣∣∣]]→ 0 a.e.

and this certainly implies that under (Ω∗,A∗, P∗)),

plimn→∞

supf∈FWn

∣∣∣∣∣ 1n n

∑i=1

[εi ( f (xixixi)− f0(xixixi))]

∣∣∣∣∣ = 0

34

Page 37: Towards Explaining Deep Learning: Asymptotic Properties of ...

and as we have ln(

N(

η

2√

σ2 + 1η,FWn , ‖ · ‖∞

))= o(n), the equality above holds for each Wn,

which finishes the proof.

Proof of the Rate of Convergence Theorem. Take δn = ρn(πWn f0 − f0), which definitely decreases to 0as n→ ∞. Suppose that all hypotheses of the Rate of Convergence Theorem hold.

Now, note that for any η ≤ 2M∗n,Ld+1

ln (N (η,FWn , ρn)) ≤ ln (N (η,FWn , ‖ · ‖∞))

≤Wn · ln

4 · e · d ·M∗n,Ld+1 ·Wn ·(

M(all)n,Ld+1

)Ld+1

η ·(

M(all)n,Ld+1 − 1

)

<∼ C∗n,d,Ld+1,Wn

+ Wn

(ln(

)+ 1)

<∼ C∗n,d,Ld+1,Wn

·(

1 + η

16η

)as for n sufficiently large, C∗n,d,Ld+1,Wn

> Wn, and the difference keeps increasing as n getting larger.Thus, for any δ ≤ 1, we have∫

δ

0

√ln (N (η,FWn , ρn))dη <

∼√

C∗n,d,Ld+1,Wn

∫δ

0

√1 + η

16ηdη

<∼

√C∗n,d,Ld+1,Wn

4

∫δ

0

√1 +

<∼

√2 · C∗n,d,Ld+1,Wn

4

∫δ

0

√1η

<∼

√C∗n,d,Ld+1,Wn

· δ =: φn(δ)

Clearly, φn(δ) defined above makes δ 7→ φn(δ)/δβ decreasing in (0, 1), for any12< β < 2. Let

rn<∼ ρn (πWn f0 − f0)

−1. For each n under consideration, we would like to have

r2nφn

(1rn

)=√

C∗n,d,Ld+1,Wnr3/2

n ≤√

n

which will be satisfied if we have rn ≤ 3√

n/C∗n,d,Ld+1,Wn.

FromQn( fn) ≤ inf

f∈FWn

Qn( f ) +OP (ηn) ≤ Qn (πWn f0) +OP (ηn) ≤ Qn (πWn f0) +OP

(r−2

n)

we require ηn ≤ r−2n .

Because no other requirements of the lower bounds of r−1n beside the three inequalities above,

we can take

r−2n = max

ρn (πWn f0 − f0)2 ,

(C∗n,d,Ld+1,Wn

n

)2/3

which justifies the upper bounding of ηn in the Rate of Convergence Theorem.

35

Page 38: Towards Explaining Deep Learning: Asymptotic Properties of ...

Hence, by the Convergence Rate of ρn( fn − πWn f0) Remark, we have ρn

(fn − πWn f0

)=

OP∗(r−1

n)

the Triangle inequality, we have

ρn

(fn − f0

)≤ ρn

(fn − πWn f0

)+ ρn (πWn f0 − f0)

≤ OP∗

(r−1

n

)+ ρn (πWn f0 − f0)

≤ OP∗

max

ρn (πWn f0 − f0) ,

(C∗n,d,Ld+1,Wn

n

)1/3 .

Proof of the Theorem ??. For any f ∈ F , the first-order functional Taylor expansion of Qn in f isQn( f ) = Qn( f0) + dQn( f0; f − f0) +R1( f0; f − f0)

=1n

n

∑i=1

ε2i −

2n

n

∑i=1

εi ( f (xixixi)− f0(xixixi)) +1n

n

∑i=1

( f (xixixi)− f0(xixixi))2

=1n

n

∑i=1

ε2i −

2n

n

∑i=1

εi ( f (xixixi)− f0(xixixi)) + ρn( f − f0)2 (A.1)

For δn =√

ηn = o(r−1

n)

and ι(xxx) ≡ 1, define the local alternatives

fn := (1− δn) fn + δn ( f0 + ι)

πWn fn := (1− δn) fn + δn (πWn f0 + ι)

Then, by using (A.1), we have

Qn( fn) =1n

n

∑i=1

ε2i −

2n

n

∑i=1

εi

(( fn(xixixi)− f0(xixixi)

)+ ρn( fn − f0)

2

Qn(πWn fn) =1n

n

∑i=1

ε2i −

2n

n

∑i=1

εi((πWn fn(xixixi)− f0(xixixi)

)+ ρn(πWn fn − f0)

2

and subtracting these two equations gives

Qn( fn) = Qn(πWn fn) +2n

n

∑i=1

εi

((πWn fn(xixixi)− fn(xixixi)

)+ ρn( fn − f0)

2 − ρn(πWn fn − f0)2

Next, by using the local alternative definition of πWn fn, the pseudo-scalar product distributive-w.r.t-addition property, and the Cauchy-Schwartz inequality we have

ρn(πWn fn − f0

)2=⟨πWn fn − f0, πWn fn − f0

⟩ρn

≤ (1− δn)2 ρn

(fn − f0

)2+ 2 (1− δn) δnρn

(fn − f0

)ρn (πWn f0 − f0)

+ 2 (1− δn) δn

⟨fn − f0, ι

⟩ρn+ 2δ2

nρn (πWn f0 − f0)2

+ 2δ2nρn (πWn f0 − f0) + δ2

n

Using this expression and the fact2n

n

∑i=1

εi

(πWn fn(xixixi)− fn(xixixi)

)= − 2

n

n

∑i=1

εi

(δn

(fn(xixixi)− πWn f0(xixixi)

)− δnι(xixixi)

)we have−OP

(δ2

n)≤ inf

f∈FWn

Qn( f )−Qn( fn)

36

Page 39: Towards Explaining Deep Learning: Asymptotic Properties of ...

≤ Qn(πWn fn

)−Qn( fn)

≤ ρn(πWn fn − f0

)2 − ρn

(fn − f0

)2− 2

n

n

∑i=1

εi

(πWn fn(xixixi)− fn(xixixi)

)≤(−2δn + δ2

n)

ρn

(fn − f0

)2+ 2 (1− δn) δnρn

(fn − f0

)ρn (πWn f0 − f0)

+ 2 (1− δn) δn

⟨fn − f0, ι

⟩ρn+

2n

δn

n

∑i=1

εi

(fn(xixixi)− πWn f0(xixixi)

)− 2

nδn

n

∑i=1

εiι(xixixi)

+ 2δ2nρn (πWn f0 − f0)

2 + 2δ2nρn (πWn f0 − f0) + δ2

n

≤ δ2nρn

(fn − f0

)2+ 2 (1− δn) δnρn

(fn − f0

)ρn (πWn f0 − f0)

+ 2 (1− δn) δn

⟨fn − f0, ι

⟩ρn

+2n

δn

n

∑i=1

εi

(fn(xixixi)− πWn f0(xixixi)

)− 2

nδn

n

∑i=1

εiι(xixixi) +OP(δ2n)

as we know that 2δ2nρn (πWn f0 − f0)

2 + 2δ2nρn (πWn f0 − f0) + δ2

n = OP(δ2n) and also −2δn + δ2

n ≤ δ2n.

We can rewrite the last inequality as

−OP (δn) ≤ δnρn

(fn − f0

)2+ 2 (1− δn) ρn

(fn − f0

)ρn (πWn f0 − f0) (A.2)

+ 2 (1− δn)⟨

fn − f0, ι⟩

ρn+

2n

n

∑i=1

εi

(fn(xixixi)− πWn f0(xixixi)

)− 2

n

n

∑i=1

εiι(xixixi) +OP(δn)

Our goal now is to show n→∞ are⟨

fn − f0, ι⟩

ρncan be expressed as

1n

n

∑i=1

εiι(xixixi) + oP∗(n−1/2),

and the other terms are oP∗(n−1/2).Note that the second condition of this theorem makes the consistency assumption in the Con-

sistency and Rate of Convergence theorems are satisfied. Also, big-O and small-O in P definitelyimply big-O and small-O in P∗, respectively. Hence, by using the Rate of Convergence theorem,

ρn

(fn − f0

)ρn (πWn f0 − f0) = OP∗

(ρn (πWn f0 − f0)

2 , ρn (πWn f0 − f0)

(C∗n,d,Ln+1,Wn

n

)1/3)The third condition of this theorem implies

ρn (πWn f0 − f0)2 = o

(o(

n−1/4)2)= o

(n−1/2

)ρn (πWn f0 − f0)

(C∗n,d,Ln+1,Wn

n

)1/3

= o(

n−1/6 (C∗n,d,Ln+1,Wn

)−1/3 (C∗n,d,Ln+1,Wn

)1/3 n−1/3)= o

(n−1/2

)which give

2 (1− δn) ρn

(fn − f0

)ρn (πWn f0 − f0) = oP∗

(n−1/2

)Now, we are going to use Remark 4.4.1 and the definition of big-O in probability. By noting

that An = C∗n,d,Ln+1,Wn, Mn = M∗n,Ln+1, we have

2n

n

∑i=1

εi

(fn(xixixi)− πWn f0(xixixi)

)≤ 1√

nsup

f∈FWn

4√n

∣∣∣∣∣ n

∑i=1

εi f (xixixi)

∣∣∣∣∣= OP

(n−1/2 (C

∗n,d,Ln+1,Wn

)2/3(M∗n,Ln+1)1/3

n1/6

)

37

Page 40: Towards Explaining Deep Learning: Asymptotic Properties of ...

and to have the term in RHS as o(n−1/2), we require

(C∗n,d,Ln+1,Wn)2/3(M∗n,Ln+1)

1/3

n1/6 =

C∗n,d,Ln+1,Wn

√M∗n,Ln+1

n1/4

2/3

= o(1)

which is satisfied by the second condition.Hence, (A.2) can be rewritten as

−OP(r−1n ) ≤ o

(r−1

n n−1/2)+ oP∗

(n−1/2

)+ 2 (1− δn)

⟨fn − f0, ι

⟩ρn+ oP(n−1/2)

− 2n

n

∑i=1

εi +OP(r−1n )

with both (3.2) and the fact that big-O and small-O imply big-O and small-O in probability, re-spectively, imply

− (1− δn)⟨

fn − f0, ι⟩

ρn+

1n

n

∑i=1

εiι(xixixi) ≤ o(

r−1n n−1/2

)+ oP∗

(n−1/2

)+ oP(n−1/2) +OP(r−1

n )

≤ oP∗

(n−1/2

)If both sides in the last inequality of (A.2) are multiplied by -1 and move

⟨fn − f0, ι

⟩ρn

andn

∑i=1

εiι(xixixi) terms to the LHS and the other terms to the RHS, then we have

(1− δn)⟨

fn − f0, ι⟩

ρn− 1

n

n

∑i=1

εiι(xixixi) ≤ oP∗

(n−1/2

)Therefore, by the Triangle Inequality,∣∣∣∣∣⟨ fn − f0, ι

⟩ρn− 1

n

n

∑i=1

εiι(xixixi)

∣∣∣∣∣ ≤∣∣∣∣∣(1− δn)

⟨fn − f0, ι

⟩ρn− 1

n

n

∑i=1

εiι(xixixi)

∣∣∣∣∣+ δn

∣∣∣∣⟨ fn − f0, ι⟩

ρn

∣∣∣∣≤ oP∗

(n−1/2

)+ δnρn

(fn − f0

)= oP∗

(n−1/2

)which gives

√n⟨

fn − f0, ι⟩=

1√n

n

∑i=1

(fn(xixixi)− f0(xixixi)

)=

1√n

n

∑i=1

εi + oP∗ (1)

and thus the conclusion follows from the Central Limit Theorem.

Proof of Lemma 4.4.1. Suppose that a node of θ, call it β1, in either the input layer or a hidden layer,is connected to a node β2 contained in a layer that is not adjacent to the β1 layer. Then, we add newnodes β′′1 , β′′2 ,....., β′′v located in each of the v adjacent layers between the β1 layer and the β2 layerthat connect β1 and β2 adjacently. If initially β2 has linear term β1,outputwβ1,β2 , where β1,output is theoutput of β1, as a part of its linear aggregation input, then β1,outputwβ1,β2 + B is both inputs andoutputs of node β′′1 , β′′2 ,....., β′′v with B is taken sufficiently large to ensure ReLU(β1,outputwβ1,β2 + B)= β1,outputwβ1,β2 + B. The node β2 now has the new −B term adding its old constant input term

38

Page 41: Towards Explaining Deep Learning: Asymptotic Properties of ...

to ensure that the input from β′′v remains β1,outputwβ1,β2 . This shows that any non-adjacent nodeconnections in θ can be transformed into the equivalent adjacent connections.

After transforming that all node connections from the input layer to the last hidden layer intoadjacent connecctions, we can add missing previous layer connections by taking zero weights forthe input from the related nodes. Then, we can look for the hidden layer with most nodes, takeHn as the number of the nodes in that layer, and add the nodes to other hidden layers to ensurethat every hidden layer has Hn nodes. These new nodes have zero inputs and outputs, as allincoming connections to each of these nodes have zero weights, to make sure that the output ofthe other nodes stay the same. We are then guaranteed to have a multi-layer perceptron with fullprevious-layer connections and same number of nodes per hidden layer Hn called θ′.

Now, we are going to derive the upper bound condition for Hn in θ′. Suppose that H′1, H′2,· · · ,H′Ln

are number of nodes in 1st, 2nd,· · · , and Lthn hidden layers of θ, respectively. Suppose also

that θ has been transformed into θ′ by constructing adjacent connections subsequently from theinput layer to the Lth

n hidden layer. As the new nodes in the uth hidden layer are constructedto bridge the non-adjacent connections from all layers before that layer, the 1st, 2nd,· · · , and Lth

nhidden layers in θ′ have at most d + H′1, d + H′1 + H′2,· · · , d + ∑Ln

u=1 H′u nodes, respectively. We

know that H′u ≤ Nn, ∀1 ≤ u ≤ Ln. Thus we can infer Hn ≤ d +Ln

∑u=1

H′u ≤ d +

(max

1≤u≤LnH′u

)Ln ≤

d + NnLn.

39

Page 42: Towards Explaining Deep Learning: Asymptotic Properties of ...

References

AKPINAR, N.-J., B. KRATZWALD, AND S. FEUERRIEGEL (2019): “Sample Complexity Bounds forRecurrent Neural Networks with Application to Combinatorial Graph Problems,” arXiv preprintarXiv:1901.10289.

ANTHONY, M., AND P. L. BARTLETT (2009): Neural network learning: Theoretical foundations. cam-bridge university press.

BARTLETT, P. L., N. HARVEY, C. LIAW, AND A. MEHRABIAN (2019): “Nearly-tight VC-dimensionand Pseudodimension Bounds for Piecewise Linear Neural Networks.,” Journal of MachineLearning Research, 20(63), 1–17.

CHEN, X. (2007): “Large sample sieve estimation of semi-nonparametric models,” Handbook ofeconometrics, 6, 5549–5632.

CHEN, X., AND H. WHITE (1999): “Improved rates and asymptotic normality for nonparametricneural network estimators,” IEEE Transactions on Information Theory, 45(2), 682–691.

ELDAN, R., AND O. SHAMIR (2016): “The power of depth for feedforward neural networks,” inin Vitaly Feldman, Alexander Rakhlin, and Ohad Shamir, eds., 29th Annual Conference on LearningTheory, volume 49 of Proceedings of Machine Learning Research, pp. 907–940.

FARRELL, M. H., T. LIANG, AND S. MISRA (2019): “Deep neural networks for estimation andinference: Application to causal effects and other semiparametric estimands,” arXiv preprintarXiv:1809.09953.

GOODFELLOW, I., Y. BENGIO, AND A. COURVILLE (2016): Deep learning. MIT press.

GRENANDER, U. (1981): “Abstract inference,” Discussion paper.

GU, S., B. T. KELLY, AND D. XIU (2020): “Empirical Asset Pricing Via Machine Learning,” Reviewof Financial Studies, forthcoming.

HE, K., X. ZHANG, S. REN, AND J. SUN (2016): “Deep residual learning for image recognition,”in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778.

HINTON, G. E., S. OSINDERO, AND Y.-W. TEH (2006): “A fast learning algorithm for deep beliefnets,” Neural computation, 18(7), 1527–1554.

HOREL, E., AND K. GIESECKE (2020): “Towards explainable AI: Significance tests for neural net-works,” Journal of Machine Learning Research, forthcoming.

HORNIK, K., M. STINCHCOMBE, H. WHITE, ET AL. (1989): “Multilayer feedforward networks areuniversal approximators.,” Neural Networks, 2(5), 359–366.

LEONI, G. (2017): A first course in Sobolev spaces. American Mathematical Soc.

LIU, M., J. SHI, Z. LI, C. LI, J. ZHU, AND S. LIU (2016): “Towards better analysis of deep convo-lutional neural networks,” IEEE transactions on visualization and computer graphics, 23(1), 91–100.

SHEN, X. (1997): “On methods of sieves and penalization,” Annals of Statistics, 25(6), 2555–2591.

40

Page 43: Towards Explaining Deep Learning: Asymptotic Properties of ...

SHEN, X., C. JIANG, L. SAKHANENKO, AND Q. LU (2019): “Asymptotic Properties of NeuralNetwork Sieve Estimators,” arXiv preprint arXiv:1906.00875.

SUN, S., W. CHEN, L. WANG, X. LIU, AND T.-Y. LIU (2016): “On the depth of deep neural net-works: A theoretical view,” in Thirtieth AAAI Conference on Artificial Intelligence.

VAART, A. W., AND J. A. WELLNER (1996): Weak convergence and empirical processes: with applica-tions to statistics. Springer.

WHITE, H. (1992): Artificial neural networks: approximation and learning theory. Blackwell Publishers,Inc.

WILSON, D. R., AND T. R. MARTINEZ (2003): “The general inefficiency of batch training for gra-dient descent learning,” Neural Networks, 16(10), 1429–1451.

YAROTSKY, D. (2017): “Error bounds for approximations with deep ReLU networks,” Neural Net-works, 94, 103–114.

(2018): “Optimal approximation of continuous functions by very deep relu networks,”arXiv preprint arXiv:1802.03620.

ZHOU, P., AND J. FENG (2018): “Understanding generalization and optimization performance ofdeep CNNs,” arXiv preprint arXiv:1805.10767.

41