CBMM Memo No. 100 August 17, 2019
Theoretical Issues in Deep Networks:Approximation, Optimization and Generalization
Tomaso Poggio1, Andrzej Banburski 1, Qianli Liao1
1Center for Brains, Minds, and Machines, MIT
Abstract
While deep learning is successful in a number of applications, it is not yet well understood theoretically. Asatisfactory theoretical characterization of deep learning however, is beginning to emerge. It covers the followingquestions: 1) representation power of deep networks 2) optimization of the empirical risk 3) generalization propertiesof gradient descent techniques — why the expected error does not suffer, despite the absence of explicit regular-ization, when the networks are overparametrized? In this review we discuss recent advances in the three areas. Inapproximation theory both shallow and deep networks have been shown to approximate any continuous functionson a bounded domain at the expense of an exponential number of parameters (exponential in the dimensionalityof the function). However, for a subset of compositional functions, deep networks of the convolutional type (evenwithout weight sharing) can have a linear dependence on dimensionality, unlike shallow networks. In optimizationwe discuss the loss landscape for the exponential loss function. It turns out that global minima at infinity are com-pletely degenerate. The other critical points of the gradient are less degenerate, with at least one – and typicallymore – nonzero eigenvalues. This suggests that stochastic gradient descent will find with high probability theglobal minima. To address the question of generalization for classification tasks, we use classical uniform conver-gence results to justify minimizing a surrogate exponential-type loss function under a unit norm constraint on theweight matrix at each layer. It is an interesting side remark, that such minimization for (homogeneous) ReLU deepnetworks implies maximization of the margin. The resulting constrained gradient system turns out to be identicalto the well-known weight normalization technique, originally motivated from a rather different way. We also showthat standard gradient descent contains an implicit L2 unit norm constraint in the sense that it solves the sameconstrained minimization problem with the same critical points (but a different dynamics). Our approach, which issupported by several independent new results, offers a solution to the puzzle about generalization performance ofdeep overparametrized ReLU networks, uncovering the origin of the underlying hidden complexity control in thecase of deep networks.
This material is based upon work supported by the Center for Brains,Minds and Machines (CBMM), funded by NSF STC award CCF-1231216.
1
Theoretical Issues in Deep Networks:Approximation, Optimization and GeneralizationTomaso Poggioa,1, Andrzej Banburskia, and Qianli Liaoa
aCenter for Brains, Minds and Machines, MIT
This manuscript was compiled on August 17, 2019
While deep learning is successful in a number of applications, it isnot yet well understood theoretically. A satisfactory theoretical char-acterization of deep learning however, is beginning to emerge. Itcovers the following questions: 1) representation power of deep net-works 2) optimization of the empirical risk 3) generalization proper-ties of gradient descent techniques — why the expected error doesnot suffer, despite the absence of explicit regularization, when thenetworks are overparametrized? In this review we discuss recentadvances in the three areas. In approximation theory both shal-low and deep networks have been shown to approximate any con-tinuous functions on a bounded domain at the expense of an ex-ponential number of parameters (exponential in the dimensionalityof the function). However, for a subset of compositional functions,deep networks of the convolutional type (even without weight shar-ing) can have a linear dependence on dimensionality, unlike shallownetworks. In optimization we discuss the loss landscape for the ex-ponential loss function. It turns out that global minima at infinityare completely degenerate. The other critical points of the gradientare less degenerate, with at least one – and typically more – nonzeroeigenvalues. This suggests that stochastic gradient descent will findwith high probability the global minima. To address the question ofgeneralization for classification tasks, we use classical uniform con-vergence results to justify minimizing a surrogate exponential-typeloss function under a unit norm constraint on the weight matrix ateach layer. It is an interesting side remark, that such minimizationfor (homogeneous) ReLU deep networks implies maximization of themargin. The resulting constrained gradient system turns out to beidentical to the well-known weight normalization technique, origi-nally motivated from a rather different way. We also show that stan-dard gradient descent contains an implicit L2 unit norm constraintin the sense that it solves the same constrained minimization prob-lem with the same critical points (but a different dynamics). Our ap-proach, which is supported by several independent new results (1–4),offers a solution to the puzzle about generalization performance ofdeep overparametrized ReLU networks, uncovering the origin of theunderlying hidden complexity control in the case of deep networks.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
Machine Learning | Deep learning | Approximation | Optimization |Generalization
1. Introduction2
In the last few years, deep learning has been tremendously3
successful in many important applications of machine learn-4
ing. However, our theoretical understanding of deep learning,5
and thus the ability of developing principled improvements,6
has lagged behind. A satisfactory theoretical characterization7
of deep learning is emerging. It covers the following areas:8
1) approximation properties of deep networks 2) optimization9
of the empirical risk 3) generalization properties of gradient10
descent techniques – why the expected error does not suf-11
fer, despite the absence of explicit regularization, when the12
networks are overparametrized? 13
A. When Can Deep Networks Avoid the Curse of Dimension- 14
ality?. We start with the first set of questions, summarizing 15
results in (5–7), and (8, 9). The main result is that deep net- 16
works have the theoretical guarantee, which shallow networks 17
do not have, that they can avoid the curse of dimensionality 18
for an important class of problems, corresponding to composi- 19
tional functions, that is functions of functions. An especially 20
interesting subset of such compositional functions are hierar- 21
chically local compositional functions where all the constituent 22
functions are local in the sense of bounded small dimensional- 23
ity. The deep networks that can approximate them without 24
the curse of dimensionality are of the deep convolutional type 25
– though, importantly, weight sharing is not necessary. 26
Implications of the theorems likely to be relevant in practice 27
are: 28
a) Deep convolutional architectures have the theoretical 29
guarantee that they can be much better than one layer archi- 30
tectures such as kernel machines for certain classes of problems; 31
b) the problems for which certain deep networks are guaran- 32
teed to avoid the curse of dimensionality (see for a nice review 33
(10)) correspond to input-output mappings that are compo- 34
sitional with local constituent functions; c) the key aspect of 35
convolutional networks that can give them an exponential 36
Significance Statement
In the last few years, deep learning has been tremendouslysuccessful in many important applications of machine learn-ing. However, our theoretical understanding of deep learning,and thus the ability of developing principled improvements, haslagged behind. A theoretical characterization of deep learningis now beginning to emerge. It covers the following questions:1) representation power of deep networks 2) optimization ofthe empirical risk 3) generalization properties of gradient de-scent techniques – how can deep networks generalize despitebeing overparametrized – more weights than training data – inthe absence of any explicit regularization? We review progresson all three areas showing that 1) for a the class of composi-tional functions deep networks of the convolutional type areexponentially better approximators than shallow networks; 2)only global minima are effectively found by stochastic gradientdescent for over-parametrized networks; 3) there is a hiddennorm control in the minimization of cross-entropy by gradientdescent that allows generalization despite overparametrization.
T.P. designed research; T.P., A.B., and Q.L. performed research; and T.P. and A.B. wrote the paper.
The authors declare no conflict of interest.
1To whom correspondence should be addressed. E-mail: [email protected]
www.pnas.org/cgi/doi/10.1073/pnas.XXXXXXXXXX PNAS | August 17, 2019 | vol. XXX | no. XX | 1–9
x1 x2 x3 x4 x5 x6 x7 x8
a b
….∑
x1 x2 x3 x4 x5 x6 x7 x8
….….
…....... ... ...
∑
x1 x2 x3 x4 x5 x6 x7 x8
x1 x2 x3 x4 x5 x6 x7 x8
Fig. 1. The top graphs are associated to functions; each of the bottom diagramsdepicts the ideal network approximating the function above. In a) a shallow uni-versal network in 8 variables and N units approximates a generic function of 8variables f(x1, · · · , x8). Inset b) shows a hierarchical network at the bottom inn = 8 variables, which approximates well functions of the form f(x1, · · · , x8) =h3(h21(h11(x1, x2), h12(x3, x4)), h22(h13(x5, x6), h14(x7, x8))) as repre-sented by the binary graph above. In the approximating network each of the n− 1nodes in the graph of the function corresponds to a set of Q = N
n−1 ReLU units com-
puting the ridge function∑Q
i=1ai(〈vi, x〉+ ti)+, with vi, x ∈ R2, ai, ti ∈ R.
Each term in the ridge function corresponds to a unit in the node (this is somewhatdifferent from todays deep networks, but equivalent to them (25)). Similar to theshallow network, a hierarchical network is universal, that is, it can approximate anycontinuous function; the text proves that it can approximate a compositional functionsexponentially better than a shallow network. Redrawn from (9).
advantage is not weight sharing but locality at each level of37
the hierarchy.38
B. Related Work. Several papers in the ’80s focused on the39
approximation power and learning properties of one-hidden40
layer networks (called shallow networks here). Very little41
appeared on multilayer networks, (but see (11–15)). By now,42
several papers (16–18) have appeared. (8, 19–22) derive new43
upper bounds for the approximation by deep networks of44
certain important classes of functions which avoid the curse45
of dimensionality. The upper bound for the approximation by46
shallow networks of general functions was well known to be47
exponential. It seems natural to assume that, since there is no48
general way for shallow networks to exploit a compositional49
prior, lower bounds for the approximation by shallow networks50
of compositional functions should also be exponential. In51
fact, examples of specific functions that cannot be represented52
efficiently by shallow networks have been given, for instance in53
(23–25). An interesting review of approximation of univariate54
functions by deep networks has recently appeared (26).55
C. Degree of approximation. The general paradigm is as fol-56
lows. We are interested in determining how complex a network57
ought to be to theoretically guarantee approximation of an58
unknown target function f up to a given accuracy ε > 0. To59
measure the accuracy, we need a norm ‖ · ‖ on some normed60
linear space X. As we will see the norm used in the results61
of this paper is the sup norm in keeping with the standard62
choice in approximation theory. As it turns out, the results of63
this section require the sup norm in order to be independent 64
from the unknown distribution of the input data. 65
Let VN be the be set of all networks of a given kind with 66
N units (which we take to be or measure of the complexity 67
of the approximant network). The degree of approximation 68
is defined by dist(f, VN ) = infP∈VN ‖f − P‖. For example, if 69
dist(f, VN ) = O(N−γ) for some γ > 0, then a network with 70
complexity N = O(ε−1γ ) will be sufficient to guarantee an 71
approximation with accuracy at least ε. The only a priori in- 72
formation on the class of target functions f , is codified by the 73
statement that f ∈ W for some subspace W ⊆ X. This sub- 74
space is a smoothness and compositional class, characterized 75
by the parameters m and d (d = 2 in the example of Figure 1 76
; it is the size of the kernel in a convolutional network). 77
D. Shallow and deep networks. This section characterizes con- 78
ditions under which deep networks are “better” than shallow 79
network in approximating functions. Thus we compare shallow 80
(one-hidden layer) networks with deep networks as shown in 81
Figure 1. Both types of networks use the same small set of 82
operations – dot products, linear combinations, a fixed nonlin- 83
ear function of one variable, possibly convolution and pooling. 84
Each node in the networks corresponds to a node in the graph 85
of the function to be approximated, as shown in the Figure. A 86
unit is a neuron which computes (〈x,w〉+ b)+, where w is the 87
vector of weights on the vector input x. Both w and the real 88
number b are parameters tuned by learning. We assume here 89
that each node in the networks computes the linear combina- 90
tion of r such units∑r
i=1 ci(〈x,wi〉+ bi)+. Notice that in our 91
main example of a network corresponding to a function with 92
a binary tree graph, the resulting architecture is an idealized 93
version of deep convolutional neural networks described in the 94
literature. In particular, it has only one output at the top 95
unlike most of the deep architectures with many channels and 96
many top-level outputs. Correspondingly, each node computes 97
a single value instead of multiple channels, using the combina- 98
tion of several units. However our results hold also for these 99
more complex networks (see (25)). 100
The sequence of results is as follows. 101
• Both shallow (a) and deep (b) networks are universal, that 102
is they can approximate arbitrarily well any continuous 103
function of n variables on a compact domain. The result 104
for shallow networks is classical. 105
• We consider a special class of functions of n 106
variables on a compact domain that are hier- 107
archical compositions of local functions, such as 108
f(x1, · · · , x8) = h3(h21(h11(x1, x2), h12(x3, x4)), 109
h22(h13(x5, x6), h14(x7, x8))) 110
The structure of the function in Figure 1 b) is represented 111
by a graph of the binary tree type, reflecting dimensional- 112
ity d = 2 for the constituent functions h. In general, d is 113
arbitrary but fixed and independent of the dimensionality 114
n of the compositional function f . (25) formalizes the 115
more general compositional case using directed acyclic 116
graphs. 117
• The approximation of functions with a compositional 118
structure – can be achieved with the same degree of ac- 119
curacy by deep and shallow networks but the number of 120
parameters are much smaller for the deep networks than 121
2 | www.pnas.org/cgi/doi/10.1073/pnas.XXXXXXXXXX Poggio et al.
for the shallow network with equivalent approximation122
accuracy.123
We approximate functions with networks in which the124
activation nonlinearity is a smoothed version of the so called125
ReLU, originally called ramp by Breiman and given by σ(x) =126
x+ = max(0, x) . The architecture of the deep networks127
reflects the function graph with each node hi being a ridge128
function, comprising one or more neurons.129
Let In = [−1, 1]n, X = C(In) be the space of all continuousfunctions on In, with ‖f‖ = maxx∈In |f(x)|. Let SN,n denotethe class of all shallow networks with N units of the form
x 7→N∑k=1
akσ(〈wk, x〉+ bk),
where wk ∈ Rn, bk, ak ∈ R. The number of trainable pa-130
rameters here is (n + 2)N ∼ n. Let m ≥ 1 be an integer,131
and Wnm be the set of all functions of n variables with con-132
tinuous partial derivatives of orders up to m <∞ such that133
‖f‖+∑
1≤|k|1≤m‖Dkf‖ ≤ 1, where Dk denotes the partial134
derivative indicated by the multi-integer k ≥ 1, and |k|1 is the135
sum of the components of k.136
For the hierarchical binary tree network, the analogous137
spaces are defined by considering the compact set Wn,2m to138
be the class of all compositional functions f of n variables139
with a binary tree architecture and constituent functions h140
in W 2m. We define the corresponding class of deep networks141
DN,2 to be the set of all deep networks with a binary tree142
architecture, where each of the constituent nodes is in SM,2,143
where N = |V |M , V being the set of non–leaf vertices of the144
tree. We note that in the case when n is an integer power of145
2, the total number of parameters involved in a deep network146
in DN,2 is 4N .147
The first theorem is about shallow networks.148
Theorem 1 Let σ : R → R be infinitely differentiable, and149
not a polynomial. For f ∈ Wnm the complexity of shallow150
networks that provide accuracy at least ε is151
N = O(ε−n/m) and is the best possible. [1]152
The estimate of Theorem 1 is the best possible if the only a153
priori information we are allowed to assume is that the target154
function belongs to f ∈Wnm. The exponential dependence on155
the dimension n of the number e−n/m of parameters needed to156
obtain an accuracy O(ε) is known as the curse of dimension-157
ality. Note that the constants involved in O in the theorems158
will depend upon the norms of the derivatives of f as well as159
σ.160
Our second and main theorem is about deep networks with161
smooth activations (preliminary versions appeared in (6–8)).162
We formulate it in the binary tree case for simplicity but163
it extends immediately to functions that are compositions164
of constituent functions of a fixed number of variables d (in165
convolutional networks d corresponds to the size of the kernel).166
Theorem 2 For f ∈ Wn,2m consider a deep network with167
the same compositional architecture and with an activation168
function σ : R → R which is infinitely differentiable, and169
not a polynomial. The complexity of the network to provide170
approximation with accuracy at least ε is171
N = O((n− 1)ε−2/m). [2]172
The proof is in (25). The assumptions on σ in the theorems 173
are not satisfied by the ReLU function x 7→ x+, but they are 174
satisfied by smoothing the function in an arbitrarily small 175
interval around the origin. The result of the theorem can be 176
extended to non-smooth ReLU(25). 177
In summary, when the only a priori assumption on the 178
target function is about the number of derivatives, then to 179
guarantee an accuracy of ε, we need a shallow network with 180
O(ε−n/m) trainable parameters. If we assume a hierarchical 181
structure on the target function as in Theorem 2, then the 182
corresponding deep network yields a guaranteed accuracy of 183
ε with O(ε−2/m) trainable parameters. Note that Theorem 2 184
applies to all f with a compositional architecture given by 185
a graph which correspond to, or is a subgraph of, the graph 186
associated with the deep network – in this case the graph 187
corresponding to Wn,dm . 188
2. The Optimization Landscape of Deep Nets with 189
Smooth Activation Function 190
The main question in optimization of deep networks is to the 191
landscape of the empirical loss in terms of its global minima 192
and local critical points of the gradient. 193
A. Related work. There are many recent papers studying opti- 194
mization in deep learning. For optimization we mention work 195
based on the idea that noisy gradient descent (27–30) can find 196
a global minimum. More recently, several authors studied the 197
dynamics of gradient descent for deep networks with assump- 198
tions about the input distribution or on how the labels are 199
generated. They obtain global convergence for some shallow 200
neural networks (31–36). Some local convergence results have 201
also been proved (37–39). The most interesting such approach 202
is (36), which focuses on minimizing the training loss and 203
proving that randomly initialized gradient descent can achieve 204
zero training loss (see also (40–42)). In summary, there is by 205
now an extensive literature on optimization that formalizes 206
and refines to different special cases and to the discrete domain 207
our results of (43, 44). 208
B. Degeneracy of global and local minima under the expo- 209
nential loss. The first part of the argument of this section 210
relies on the obvious fact (see (1)), that for RELU networks 211
under the hypothesis of an exponential-type loss function, 212
there are no local minima that separate the data – the only 213
critical points of the gradient that separate the data are the 214
global minima. 215
Notice that the global minima are at ρ = ∞, when the 216
exponential is zero. As a consequence, the Hessian is identically 217
zero with all eigenvalues being zero. On the other hand any 218
point of the loss at a finite ρ has nonzero Hessian: for instance 219
in the linear case the Hessian is proportional to∑N
nxnx
Tn . The 220
local minima which are not global minima must missclassify. 221
How degenerate are they? 222
Simple arguments (1) suggest that the critical points which 223
are not global minima cannot be completely degenerate. We 224
thus have the following 225
Property 1 Under the exponential loss, global minima are 226
completely degenerate with all eigenvalues of the Hessian (W 227
of them withW being the number of parameters in the network) 228
being zero. The other critical points of the gradient are less 229
Poggio et al. PNAS | August 17, 2019 | vol. XXX | no. XX | 3
Fig. 2. Stochastic Gradient Descent and Langevin Stochastic Gradient Descent(SGDL) on the 2D potential function shown above leads to an asymptotic distributionwith the histograms shown on the left. As expected from the form of the Boltzmanndistribution, both dynamics prefer degenerate minima to non-degenerate minima ofthe same depth. From (1).
degenerate, with at least one – and typically N – nonzero230
eigenvalues.231
For the general case of non-exponential loss and smooth232
nonlinearities instead of the RELU the following conjecture233
has been proposed (1):234
Conjecture 1 : For appropriate overparametrization, there235
are a large number of global zero-error minimizers which are236
degenerate; the other critical points – saddles and local minima237
– are generically (that is with probability one) degenerate on a238
set of much lower dimensionality.239
C. SGD and Boltzmann Equation. The second part of our ar-240
gument (in (44)) is that SGD concentrates in probability on241
the most degenerate minima. The argument is based on the242
similarity between a Langevin equation and SGD and on the243
fact that the Boltzmann distribution is exactly the asymptotic244
“solution” of the stochastic differential Langevin equation and245
also of SGDL, defined as SGD with added white noise (see for246
instance (45)). The Boltzmann distribution is247
p(f) = 1Ze−
LT , [3]248
where Z is a normalization constant, L(f) is the loss and T249
reflects the noise power. The equation implies that SGDL250
prefers degenerate minima relative to non-degenerate ones of251
the same depth. In addition, among two minimum basins of252
equal depth, the one with a larger volume is much more likely253
in high dimensions as shown by the simulations in (44). Taken254
together, these two facts suggest that SGD selects degenerate255
minimizers corresponding to larger isotropic flat regions of the256
loss. Then SDGL shows concentration – because of the high257
dimensionality – of its asymptotic distribution Equation 3.258
Together (43) and (1) suggest the following259
Conjecture 2 : For appropriate overparametrization of the260
deep network, SGD selects with high probability the global261
minimizers of the empirical loss, which are highly degenerate.262
3. Generalization263
Recent results by (2) illuminate the apparent absence of ”over-264
fitting” (see Figure 4) in the special case of linear networks265
for binary classification. They prove that minimization of loss266
functions such as the logistic, the cross-entropy and the expo-267
nential loss yields asymptotic convergence to the maximum268
margin solution for linearly separable datasets, independently269
of the initial conditions and without explicit regularization.270
Here we discuss the case of nonlinear multilayer DNNs un-271
der exponential-type losses, for several variations of the basic272
gradient descent algorithm. The main results are:273
• classical uniform convergence bounds for generalization 274
suggest a form of complexity control on the dynamics 275
of the weight directions Vk: minimize a surrogate loss 276
subject to a unit Lp norm constraint; 277
• gradient descent on the exponential loss with an explicit 278
L2 unit norm constraint is equivalent to a well-known 279
gradient descent algorithms, weight normalization which 280
is closely related to batch normalization; 281
• unconstrained gradient descent on the exponential loss 282
yields a dynamics with the same critical points as weight 283
normalization: the dynamics seems to implicitely enforce 284
a L2 unit constraint on the directions of the weights Vk. 285
We observe that several of these results directly apply to 286
kernel machines for the exponential loss under the separa- 287
bility/interpolation assumption, because kernel machines are 288
one-homogeneous. 289
A. Related work. A number of papers have studied gradient 290
descent for deep networks (46–48). Close to the approach 291
summarized here (details are in (1)) is the paper (49). Its 292
authors study generalization assuming a regularizer because 293
they are – like us – interested in normalized margin. Unlike 294
their assumption of an explicit regularization, we show here 295
that commonly used techniques, such as weight and batch nor- 296
malization, in fact minimize the surrogate loss margin while 297
controlling the complexity of the classifier without the need 298
to add a regularizer or to use weight decay. Surprisingly, we 299
will show that even standard gradient descent on the weights 300
implicitly controls the complexity through an “implicit” unit 301
L2 norm constraint. Two very recent papers ((4) and (3)) de- 302
velop an elegant but complicated margin maximization based 303
approach which lead to some of the same results of this section 304
(and many more). The important question of which condi- 305
tions are necessary for gradient descent to converge to the 306
maximum of the margin of f are studied by (4) and (3). Our 307
approach does not need the notion of maximum margin but 308
our theorem 3 establishes a connection with it and thus with 309
the results of (4) and (3). Our main goal here (and in (1)) 310
is to achieve a simple understanding of where the complexity 311
control underlying generalization is hiding in the training of 312
deep networks. 313
B. Deep networks: definitions and properties. We define a 314
deep network with K layers with the usual coordinate-wise 315
scalar activation functions σ(z) : R → R as the set of 316
functions f(W ;x) = σ(WKσ(WK−1 · · ·σ(W 1x))), where the 317
input is x ∈ Rd, the weights are given by the matrices W k, 318
one per layer, with matching dimensions. We sometime use 319
the symbol W as a shorthand for the set of W k matrices 320
k = 1, · · · ,K. For simplicity we consider here the case of 321
binary classification in which f takes scalar values, implying 322
that the last layer matrix WK is WK ∈ R1,Kl . The labels are 323
yn ∈ {−1, 1}. The weights of hidden layer l are collected in a 324
matrix of size hl × hl−1. There are no biases apart form the 325
input layer where the bias is instantiated by one of the input 326
dimensions being a constant. The activation function in this 327
section is the ReLU activation. 328
For ReLU activations the following important positive one- 329
homogeneity property holds σ(z) = ∂σ(z)∂z
z. A consequence of 330
one-homogeneity is a structural lemma (Lemma 2.1 of (50)) 331
4 | www.pnas.org/cgi/doi/10.1073/pnas.XXXXXXXXXX Poggio et al.
∑i,jW i,jk
(∂f(x)∂W
i,jk
)= f(x) where Wk is here the vectorized332
representation of the weight matrices Wk for each of the dif-333
ferent layers (each matrix is a vector).334
For the network, homogeneity implies f(W ;x) =335 ∏K
k=1 ρkf(V1, · · · , VK ;xn), where Wk = ρkVk with the ma-336
trix norm ||Vk||p = 1. Another property of the Rademacher337
complexity of ReLU networks that follows from homogeneity338
is RN (F) = ρRN (F) where ρ = ρ1∏K
k=1 ρk, F is the class of339
neural networks described above.340
We define f = ρf ; F is the associated class of normalized341
neural networks (we call f(V ;x) = f(x) with the understand-342
ing that f(x) = f(W ;x)). Note that ∂f∂ρk
= ρρkf and that the343
definitions of ρk, Vk and f all depend on the choice of the344
norm used in normalization.345
In the case of training data that can be separated by the346
networks f(xn)yn > 0 ∀n = 1, · · · , N . We will sometime347
write f(xn) as a shorthand for ynf(xn).348
C. Uniform convergence bounds: minimizing a surrogate349
loss under norm constraint. Classical generalization bounds350
for regression (51) suggest that minimizing the empirical loss351
of a loss function such as the cross-entropy subject to con-352
strained complexity of the minimizer is a way to to attain353
generalization, that is an expected loss which is close to the354
empirical loss:355
Proposition 1 The following generalization bounds apply to356
∀f ∈ F with probability at least (1− δ):357
L(f) ≤ L(f) + c1RN (F) + c2
√ln( 1
δ)
2N [4]358
where L(f) = E[`(f(x), y)] is the expected loss, L(f) is the359
empirical loss, RN (F) is the empirical Rademacher average of360
the class of functions F, measuring its complexity; c1, c2 are361
constants that depend on properties of the Lipschitz constant362
of the loss function, and on the architecture of the network.363
Thus minimizing under a constraint on the Rademacher364
complexity a surrogate function such as the cross-entropy365
(which becomes the logistic loss in the binary classification366
case) will minimize an upper bound on the expected clas-367
sification error because such surrogate functions are upper368
bounds on the 0− 1 function. We can choose a class of func-369
tions F with normalized weights and write f(x) = ρf(x) and370
RN (F) = ρRN (F). One can choose any fixed ρ as a (Ivanov)371
regularization-type tradeoff.372
In summary, the problem of generalization may approached373
by minimizing the exponential loss – more in general an374
exponential-type loss, such the logistic and the cross-entropy –375
under a unit norm constraint on the weight matrices:376
limρ→∞
arg min||Vk||=1, ∀k
L(ρf) [5]377
where we write f(W ) = ρf(V ) using the homogeneity of the378
network. As it will become clear later, gradient descent tech-379
niques on the exponential loss automatically increase ρ to380
infinity. We will typically consider the sequence of minimiza-381
tions over Vk for a sequence of increasing ρ. The key quantity382
for us is f and the associated weights Vk; ρ is in a certain383
sense an auxiliary variable, a constraint that is progressively384
relaxed.385
In the following we explore the implications for deep net- 386
works of this classical approach to generalization. 387
C.1. Remark: minimization of an exponential-type loss implies mar- 388
gin maximization . Though not critical for our approach to the 389
question of generalization in deep networks it is interesting 390
that constrained minimization of the exponential loss implies 391
margin maximization. This property relates our approach 392
to the results of several recent papers (2–4). Notice that 393
our theorem 3 as in (52) is a sufficient condition for margin 394
maximization. Necessity is not true for general loss functions. 395
To state the margin property more formally, we adapt to 396
our setting a different result due to (52) (they consider for a 397
linear network a vanishing λ regularization term whereas we 398
have for nonlinear networks a set of unit norm constraints). 399
First we recall the definition of the empirical loss L(f) = 400∑N
n=1 `(ynf(xn)) with an exponential loss function `(yf) = 401
e−yf . We define η(f) a the margin of f , that is η(f) = 402
minn f(xn). 403
Then our margin maximization theorem (proved in (1)) 404
takes the form 405
Theorem 3 Consider the set of Vk, k = 1, · · · ,K correspond- 406
ing to 407
min||Vk||=1
L(f(ρk, Vk)) [6] 408
where the norm ||Vk|| is a chosen Lp norm and L(f)(ρk, VK) = 409
L(f(ρ)) =∑
n`(ynρf(V ;xn)) is the empirical exponential loss. 410
For each layer consider a sequence of increasing ρk. Then the 411
associated sequence of Vk defined by Equation 6, converges for 412
ρ→∞ to the maximum margin of f , that is to max||Vk||≤1 η(f) 413
. 414
D. Minimization under unit norm constraint: weight normal- 415
ization. The approach is then to minimize the loss function 416
L(f(w)) =∑N
n=1 e−f(W ;xn)yn =
∑N
n=1 e−ρf(Vk;xn)yn , with 417
ρ =∏ρk, subject to ||Vk||pp = 1 ∀k, that is under a unit norm 418
constraint for the weight matrix at each layer (if p = 2 then 419∑i,j
(Vk)2i,j = 1 is the Frobenius norm). The minimization is 420
understood as a sequence of minimizations for a sequence of 421
increasing ρk. Clearly these constraints imply the constraint 422
on the norm of the product of weight matrices for any p norm 423
(because any induced operator norm is a sub-multiplicative 424
matrix norm). The standard choice for a loss function is an 425
exponential-type loss such the cross-entropy, which for binary 426
classification becomes the logistic function. We study here 427
the exponential because it is simpler and retains all the basic 428
properties. 429
There are several gradient descent techniques that given the 430
unconstrained optimization problem transform it into a con- 431
strained gradient descent problem. To provide the background 432
let us formulate the standard unconstrained gradient descent 433
problem for the exponential loss as it is used in practical 434
training of deep networks: 435
W i,jk = − ∂L
∂W i,jk
=N∑n=1
yn∂f(xn;w)∂W i,j
k
e−ynf(xn;W ) [7] 436
where Wk is the weight matrix of layer k. Notice that, since 437
the structural property implies that at a critical point we have 438
Poggio et al. PNAS | August 17, 2019 | vol. XXX | no. XX | 5
∑N
n=1 ynf(xn;w)e−ynf(xn;W ) = 0, the only critical points of439
this dynamics that separate the data (i.e. ynf(xn;w) > 0 ∀n)440
are global minima at infinity. Of course for separable data,441
while the loss decreases asymptotically to zero, the norm of the442
weights ρk increases to infinity, as we will see later. Equations443
7 define a dynamical system in terms of the gradient of the444
exponential loss L.445
The set of gradient-based algorithms enforcing a unit-norm446
constraints (53) comprises several techniques that are equiv-447
alent for small values of the step size. They are all good448
approximations of the true gradient method. One of them is449
the Lagrange multiplier method; another is the tangent gradient450
method based on the following theorem:451
Theorem 4 (53) Let ||u||p denote a vector norm that is452
differentiable with respect to the elements of u and let g(t)453
be any vector function with finite L2 norm. Then, calling454
ν(t) = ∂||u||p∂u u=u(t), the equation455
u = hg(t) = Sg(t) = (I − ννT
||ν||22)g(t) [8]456
with ||u(0)|| = 1, describes the flow of a vector u that satisfies457
||u(t)||p = 1 for all t ≥ 0.458
In particular, a form for g is g(t) = µ(t)∇uL, the gradient459
update in a gradient descent algorithm. We call Sg(t) the460
tangent gradient transformation of g. In the case of p = 2461
we replace ν in Equation 8 with u because ν(t) = ∂||u||2∂u
= u.462
This gives S = I − uuT
||u||22and u = Sg(t).463
Consider now the empirical loss L written in terms of Vk464
and ρk instead of Wk, using the change of variables defined by465
Wk = ρkVk but without imposing a unit norm constraint on466
Vk. The flows in ρk, Vk can be computed as ρk = ∂Wk∂ρk
∂L∂Wk
=467
V Tk∂L∂Wk
and Vk = ∂Wk∂Vk
∂L∂Wk
= ρk∂L∂Wk
, with ∂L∂Wk
given by468
Equations 7.469
We now enforce the unit norm constraint on Vk by using470
the tangent gradient transform on the Vk flow. This yields471
ρk = V Tk∂L
∂WkVk = Skρk
∂L
∂Wk. [9]472
Notice that the dynamics above follows from the classical473
approach of controlling the Rademacher complexity of f during474
optimization (suggested by bounds such as Equation 4. The475
approach and the resulting dynamics for the directions of the476
weights would seem different from the standard unconstrained477
approach in training deep networks. It turns out, however, that478
the dynamics described by Equations 9 is the same dynamics479
of Weight Normalization.480
The technique of Weight normalization (54) was originally481
proposed as a small improvement on standard gradient descent482
“to reduce covariate shifts”. It was defined for each layer in483
terms of w = g v||v|| , as484
g = v
||v||∂L
∂wv = g
||v||S∂L
∂w[10]485
with S = I − vvT
||v||2 .486
It is easy to see that Equations 9 are the same as the weight487
normalization Equations 10, if ||v||2 = 1. We now observe,488
multiplying Equation 9 by vT , that vT v = 0 because vTS = 0,489
implying that ||v||2 is constant in time with a constant that490
can be taken to be 1. Thus the two dynamics are the same.491
E. Generalization with hidden complexity control. Empiri- 492
cally it appears that GD and SGD converge to solutions that 493
can generalize even without batch or weight normalization. 494
Convergence may be difficult for quite deep networks and gen- 495
eralization may not be as good as with batch normalization 496
but it still occurs. How is this possible? 497
We study the dynamical system Wki,j under the 498
reparametrization W i,jk = ρkV
i,jk with ||Vk||2 = 1. We con- 499
sider for each weight matrixWk the corresponding “vectorized” 500
representation in terms of vectors W i,jk = Wk. We use the 501
following definitions and properties (for a vector w): 502
• Define w||w||2
= w; thus w = ||w||2w with ||w||2 = 1. Also 503
define S = I − wwT = I − wwT
||w||22. 504
• The following relations are easy to check: 505
1. ∂||w||2∂w
= w 506
2. ∂w∂w
= S||w||2
. 507
3. Sw = Sw = 0 508
4. S2 = S 509
The gradient descent dynamic system used in training 510
deep networks for the exponential loss is given by Equation 7. 511
Following the chain rule for the time derivatives, the dynamics 512
forWk is exactly (see (1)) equivalent to the following dynamics 513
for ||Wk|| = ρk and Vk: 514
ρk = ∂||Wk||∂Wk
∂Wk
∂t= V Tk Wk [11] 515
and 516
Vk = ∂Vk∂Wk
∂Wk
∂t= SkρkWk [12] 517
where Sk = I − VkV Tk . We used property 1 in 4 for Equation 518
11 and property 2 for Equation 12. 519
The key point here is that the dynamics of Vk includes a 520
unit L2 norm constraint: using the tangent gradient transform 521
will not change the equation because S2 = S. 522
As separate remarks , notice that if for t > t0, f separates all 523
the data, ddtρk > 0, that is ρ diverges to∞ with limt→∞ ρ = 0. 524
In the 1-layer network case the dynamics yields ρ ≈ log t 525
asymptotically. For deeper networks, this is different. (1) 526
shows (for one support vector) that the product of weights 527
at each layer diverges faster than logarithmically, but each 528
individual layer diverges slower than in the 1-layer case. The 529
norm of the each layer grows at the same rate ρ2k, independent 530
of k. The Vk dynamics has stationary or critical points given 531
by 532
W∑
αn(ρ(t)(∂f(xn)∂V i,jk
− V i,jk f(xn)), [13] 533
where αn = e−ynρ(t)f(xn). We examine later the linear one- 534
layer case f(x) = vTx in which case the stationary points 535
of the gradient are given by∑
αn(ρ(t)(xn − vvTxn). In the 536
linear case the critical point is unique and corresponds to a 537
hyperbolic minimum. In the general case the critical points 538
correspond for ρ→∞ to degenerate zero “asymptotic minima” 539
of the loss. 540
To understand whether there exists a hidden complexity 541
control in standard gradient descent, we check whether there 542
6 | www.pnas.org/cgi/doi/10.1073/pnas.XXXXXXXXXX Poggio et al.
2
1.95 -
-0 (lJ N
1.9 E'-
0 C
C :::,
1.85 ....... Vl Vl
.2 1.8 co
C
;p Vl
(lJ I- 1.75
1.7
0.44
-
-0 (lJ N 0,42 m E'-
0
2 0.:,.......1/'1 Vl
_g 0.38 co C
;p Vl
Q.) t- 0.36
0.3
0.005
' • •
•
0.0005
• •
•
•
• • •
• ••
0.01 0.0,5 0.02 0.025 0.03
Training loss (unnormalized)
•
•
•
0.001 0.0015 0.002
Training Loss (unnormalized)
std. dev. 2.303 of initial weights 2.302
0.14 -
� 2.301 0.12 N
ro 0.1 E 2.3 '-
0 C -
0.08 � 2.299 0
...J
0.06 co C 2.298 :p Vl
(lJ
0.04 I- 2.297
0.02 2.296
2.296
scaling of initial weights
4
-
-0
3.5 Q) N
ro 3 E
'-0 C -
2.5 VI 1/'1
0
co 2 C
......
1/'1
Q)
1.5 I-
1
2.297 2.298 2.299 2.3 2.301
Training Loss (normalized)
1.5 1.55 1.6
Training loss (normalized)
2.302
1.65
std. dev. of initial weights
0.14
0.12
0.1
0.08
0.06
0.04
0.02
scaling of initial weights
4
3.5
3
2.5
2
1.5
Fig. 3. The top left graph shows testing vs training cross-entropy loss for networks each trained on the same data sets (CIFAR10) but with adifferent initializations, yielding zero classification error on training set but different testing errors. The top right graph shows the same data,that is testing vs training loss for the same networks, now normalized by dividing each weight by the Frobenius norm of its layer. Noticethat all points have zero classification error at training. The red point on the top right refers to a network trained on the same CIFAR-10data set but with randomized labels. It shows zero classification error at training and test error at chance level. The top line is a square-lossregression of slope 1 with positive intercept. The bottom line is the diagonal at which training and test loss are equal. The networks are3-layer convolutional networks. The left can be considered as a visualization of Equation 4 when the Rademacher complexity is not controlled.
The right hand side is a visualization of the same relation for normalized networks that is L(f) ≤ L(f) + c1RN (F) + c2
√ln( 1
δ)
2N . Under our
conditions for N and for the architecture of the network the terms c1RN (F) + c2
√ln( 1
δ)
2N represent a small offset. From (55).
exists an Lp norm for which unconstrained normalization is543
equivalent to constrained normalization.544
From Theorem 4 we expect the constrained case to be given545
by the action of the following projector onto the tangent space:546
Sp = I− ννT
||ν||22with νi = ∂||w||p
∂wi= sign(wi)◦
(|wi|||w||p
)p−1
.
[14]547
The constrained Gradient Descent is then548
ρk = V Tk Wk Vk = ρkSpWk. [15]549
On the other hand, reparametrization of the unconstrained550
dynamics in the p-norm gives (following Equations 11 and 12)551
ρk = ∂||Wk||p∂Wk
∂Wk
∂t= sign(Wk) ◦
(|Wk|||Wk||p
)p−1
· Wk
Vk = ∂Vk∂Wk
∂Wk
∂t=I − sign(Wk) ◦
(|Wk|||Wk||p
)p−1WTk
||Wk||p−1p
Wk.
[16]
552
These two dynamical systems are clearly different for generic553
p reflecting the presence or absence of a regularization-like554
constraint on the dynamics of Vk.555
As we have seen however, for p = 2 the 1-layer dynamical556
system obtained by minimizing L in ρk and Vk withWk = ρkVk557
under the constraint ||Vk||2 = 1, is the weight normalization558
dyanmics559
ρk = V Tk Wk Vk = SρkWk, [17]560
which is quite similar to the standard gradient equations561
ρk = V Tk Wk v = S
ρkWk. [18]562
The two dynamical systems differ only by a ρ2k factor in 563
the Vk equations. However, the critical points of the gradient 564
for the Vk flow, that is the point for which Vk = 0, are the 565
same in both cases since for any t > 0 ρk(t) > 0 and thus 566
Vk = 0 is equivalent to SWk = 0. Hence, gradient descent 567
with unit Lp-norm constraint is equivalent to the standard, 568
unconstrained gradient descent but only when p = 2. Thus 569
Fact 1 The standard dynamical system used in deep learning, 570
defined by Wk = − ∂L∂Wk
, implicitly enforces a unit L2 norm 571
constraint on Vk with ρkVk = Wk. Thus, under an exponential 572
loss, if the dynamics converges, the Vk represent the minimizer 573
under the L2 unit norm constraint. 574
Thus standard GD implicitly enforces the L2 norm con- 575
straint on Vk = Wk||Wk||2
, consistently with Srebro’s results 576
on implicit bias of GD. Other minimization techniques such 577
as coordinate descent may be biased towards different norm 578
constraints. 579
F. Linear networks and rates of convergence. The linear 580
(f(x) = ρvTx) networks case (2) is an interesting example 581
of our analysis in terms of ρ and v dynamics. We start with 582
unconstrained gradient descent, that is with the dynamical 583
system 584
ρ = 1ρ
N∑n=1
e−ρvT xnvTxn v = 1
ρ
N∑n=1
e−ρvT xn(xn − vvTxn).
[19] 585
If gradient descent in v converges to v = 0 at finite time, 586
v satisfies vvTx = x, where x =∑C
j=1 αjxj with positive 587
coefficients αj and xj are the C support vectors (see (1)). A 588
solution vT = ||x||x† then exists (x†, the pseudoinverse of x, 589
Poggio et al. PNAS | August 17, 2019 | vol. XXX | no. XX | 7
since x is a vector, is given by x† = xT
||x||2 ). On the other hand,590
the operator T in v(t+ 1) = Tv(t) associated with equation 19591
is non-expanding, because ||v|| = 1, ∀t. Thus there is a fixed592
point v ∝ x which is independent of initial conditions (56).593
The rates of convergence of the solutions ρ(t) and v(t),594
derived in different way in (2), may be read out from the595
equations for ρ and v. It is easy to check that a general596
solution for ρ is of the form ρ ∝ C log t. A similar estimate597
for the exponential term gives e−ρvT xn ∝ 1t. Assume for598
simplicity a single support vector x. We claim that a solution599
for the error ε = v− x, since v converges to x, behaves as 1log t .600
In fact we write v = x+ ε and plug it in the equation for v in601
20. We obtain (assuming normalized input ||x|| = 1)602
ε = 1ρe−ρv
T x(x−(x+ε)(x+ε)Tx) ≈ 1ρe−ρv
T x(x−x−xεT−εxT ),[20]603
which has the form ε = − 1t log t (2xε
T ). Assuming ε of the604
form ε ∝ 1log t we obtain − 1
t log2 t = −B 1t log2 t . Thus the error605
indeed converges as ε ∝ 1log t .606
A similar analysis for the weight normalization equations607
17 considers the same dynamical system with a change in the608
equation for v, which becomes609
v ∝ e−ρρ(I − vvT )x. [21]610
This equation differs by a factor ρ2 from equation 20. As a611
consequence equation 21 is of the form ε = − log ttε, with a612
general solution of the form ε ∝ t−12 log t. In summary, GD with613
weight normalization converges faster to the same equilibrium614
than standard gradient descent: the rate for ε = v − x is615
t−12 log(t) vs 1
log t .616
Our goal was to find limρ→∞ arg min||Vk||=1, ∀k L(ρf). We617
have seen that various forms of gradient descent enforce dif-618
ferent paths in increasing ρ that empirically have different619
effects on convergence rate. It is an interesting theoretical620
and practical challenge to find the optimal way, in terms of621
generalization and convergence rate, to grow ρ→∞.622
Our analysis of simplified batch normalization (1) suggests623
that several of the same considerations that we used for weight624
normalization should apply (in the linear one layer case BN is625
identical to WN). However, BN differs from WN in the multi-626
layer case in several ways, in addtion to weight normalization:627
it has for instance separate normalization for each unit, that628
is for each row of the weight matrix at each layer.629
4. Discussion630
A main difference between shallow and deep networks is in631
terms of approximation power or, in equivalent words, of632
the ability to learn good representations from data based on633
the compositional structure of certain tasks. Unlike shallow634
networks, deep local networks – in particular convolutional635
networks – can avoid the curse of dimensionality in approxi-636
mating the class of hierarchically local compositional functions.637
This means that for such class of functions deep local networks638
represent an appropriate hypothesis class that allows good639
approximation with a minimum number of parameters. It640
is not clear, of course, why many problems encountered in641
practice should match the class of compositional functions.642
Though we and others have argued that the explanation may643
be in either the physics or the neuroscience of the brain, these644
Training data size: 50000
10 2 10 3 10 4 10 5 10 6 10 7
Number of Model Params
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Err
or
on
CIF
AR
-10
#Training DataTraining
Test
Fig. 4. Empirical and expected error in CIFAR 10 as a function of number of neuronsin a 5-layer convolutional network. The expected classification error does not increasewhen increasing the number of parameters beyond the size of the training set in therange we tested.
arguments are not rigorous. Our conjecture at present is that 645
compositionality is imposed by the wiring of our cortex and, 646
critically, is reflected in language. Thus compositionality of 647
some of the most common visual tasks may simply reflect the 648
way our brain works. 649
Optimization turns out to be surprisingly easy to perform 650
for overparametrized deep networks because SGD will converge 651
with high probability to global minima that are typically ch 652
mumore degenerate for the exponential loss than other local 653
critical points. 654
More surprisingly, gradient descent yields generalization in 655
classification performance, despite overparametrization and 656
even in the absence of explicit norm control or regularization, 657
because standard gradient descent in the weights is subject to 658
an implicit unit (L2) norm constraint on the directions of the 659
weights in the case of exponential-type losses for classification 660
tasks. 661
In summary, it is tempting to conclude that the practical 662
success of deep learning has its roots in the almost magic syn- 663
ergy of unexpected and elegant theoretical properties of several 664
aspects of the technique: the deep convolutional network ar- 665
chitecture itself, its overparametrization, the use of stochastic 666
gradient descent, the exponential loss, the homogeneity of the 667
RELU units and of the resulting networks. 668
Of course many problems remain open on the way to develop 669
a full theory and, especially, in translating it to new archi- 670
tectures. More detailed results are needed in approximation 671
theory, especially for densely connected networks. Our frame- 672
work for optimization is missing at present a full classification 673
of local minima and their dependence on overparametrization. 674
The analysis of generalization should include an analysis of 675
convergence of the weights for multilayer networks (see (4) and 676
(3)). A full theory would also require an analysis of the trade- 677
off for deep networks between approximation and estimation 678
error, relaxing the separability assumption. 679
ACKNOWLEDGMENTS. We are grateful to Sasha Rakhlin and 680
Nate Srebro for useful suggestions about the structural lemma and 681
about separating critical points. Part of the funding is from the 682
Center for Brains, Minds and Machines (CBMM), funded by NSF 683
8 | www.pnas.org/cgi/doi/10.1073/pnas.XXXXXXXXXX Poggio et al.
STC award CCF-1231216, and part by C-BRIC, one of six centers684
in JUMP, a Semiconductor Research Corporation (SRC) program685
sponsored by DARPA.686
1. Banburski A, et al. (2019) Theory of deep learning III: Dynamics and generalization in deep687
networks. CBMM Memo No. 090.688
2. Soudry D, Hoffer E, Srebro N (2017) The Implicit Bias of Gradient Descent on Separable689
Data. ArXiv e-prints.690
3. Lyu K, Li J (2019) Gradient descent maximizes the margin of homogeneous neural networks.691
CoRR abs/1906.05890.692
4. Shpigel Nacson M, Gunasekar S, Lee JD, Srebro N, Soudry D (2019) Lexicographic and693
Depth-Sensitive Margins in Homogeneous and Non-Homogeneous Deep Models. arXiv e-694
prints p. arXiv:1905.07325.695
5. Anselmi F, Rosasco L, Tan C, Poggio T (2015) Deep convolutional network are hierarchical696
kernel machines. Center for Brains, Minds and Machines (CBMM) Memo No. 35, also in697
arXiv.698
6. Poggio T, Rosasco L, Shashua A, Cohen N, Anselmi F (2015) Notes on hierarchical splines,699
dclns and i-theory, (MIT Computer Science and Artificial Intelligence Laboratory), Technical700
report.701
7. Poggio T, Anselmi F, Rosasco L (2015) I-theory on depth vs width: hierarchical function702
composition. CBMM memo 041.703
8. Mhaskar H, Liao Q, Poggio T (2016) Learning real and boolean functions: When is deep704
better than shallow? Center for Brains, Minds and Machines (CBMM) Memo No. 45, also in705
arXiv.706
9. Mhaskar H, Poggio T (2016) Deep versus shallow networks: an approximation theory per-707
spective. Center for Brains, Minds and Machines (CBMM) Memo No. 54, also in arXiv.708
10. Donoho DL (2000) High-dimensional data analysis: The curses and blessings of dimension-709
ality in AMS CONFERENCE ON MATH CHALLENGES OF THE 21ST CENTURY.710
11. Mhaskar H (1993) Approximation properties of a multilayered feedforward artificial neural711
network. Advances in Computational Mathematics pp. 61–80.712
12. Mhaskar HN (1993) Neural networks for localized approximation of real functions in Neural713
Networks for Processing [1993] III. Proceedings of the 1993 IEEE-SP Workshop. (IEEE), pp.714
190–196.715
13. Chui C, Li X, Mhaskar H (1994) Neural networks for localized approximation. Mathematics of716
Computation 63(208):607–623.717
14. Chui CK, Li X, Mhaskar HN (1996) Limitations of the approximation capabilities of neural718
networks with one hidden layer. Advances in Computational Mathematics 5(1):233–243.719
15. Pinkus A (1999) Approximation theory of the mlp model in neural networks. Acta Numerica720
8:143–195.721
16. Poggio T, Smale S (2003) The mathematics of learning: Dealing with data. Notices of the722
American Mathematical Society (AMS) 50(5):537–544.723
17. Montufar, G. F.and Pascanu R, Cho K, Bengio Y (2014) On the number of linear regions of724
deep neural networks. Advances in Neural Information Processing Systems 27:2924–2932.725
18. Livni R, Shalev-Shwartz S, Shamir O (2013) A provably efficient algorithm for training deep726
networks. CoRR abs/1304.7045.727
19. Anselmi F, et al. (2014) Unsupervised learning of invariant representations with low sample728
complexity: the magic of sensory cortex or a new framework for machine learning?. Center729
for Brains, Minds and Machines (CBMM) Memo No. 1. arXiv:1311.4158v5.730
20. Anselmi F, et al. (2015) Unsupervised learning of invariant representations. Theoretical Com-731
puter Science.732
21. Poggio T, Rosaco L, Shashua A, Cohen N, Anselmi F (2015) Notes on hierarchical splines,733
dclns and i-theory. CBMM memo 037.734
22. Liao Q, Poggio T (2016) Bridging the gap between residual learning, recurrent neural net-735
works and visual cortex. Center for Brains, Minds and Machines (CBMM) Memo No. 47, also736
in arXiv.737
23. Telgarsky M (2015) Representation benefits of deep feedforward networks. arXiv preprint738
arXiv:1509.08101v2 [cs.LG] 29 Sep 2015.739
24. Safran I, Shamir O (2016) Depth separation in relu networks for approximating smooth non-740
linear functions. arXiv:1610.09887v1.741
25. Poggio T, Mhaskar H, Rosasco L, Miranda B, Liao Q (2016) Theory I: Why and when can742
deep - but not shallow - networks avoid the curse of dimensionality, (CBMM Memo No. 058,743
MIT Center for Brains, Minds and Machines), Technical report.744
26. Daubechies I, DeVore R, Foucart S, Hanin B, Petrova G (2019) Nonlinear approximation and745
(deep) relu networks. arXiv e-prints p. arXiv:1905.02199.746
27. Jin C, Ge R, Netrapalli P, Kakade SM, Jordan MI (2017) How to escape saddle points effi-747
ciently. CoRR abs/1703.00887.748
28. Ge R, Huang F, Jin C, Yuan Y (2015) Escaping from saddle points - online stochastic gradient749
for tensor decomposition. CoRR abs/1503.02101.750
29. Lee JD, Simchowitz M, Jordan MI, Recht B (2016) Gradient descent only converges to min-751
imizers in 29th Annual Conference on Learning Theory, Proceedings of Machine Learning752
Research, eds. Feldman V, Rakhlin A, Shamir O. (PMLR, Columbia University, New York,753
New York, USA), Vol. 49, pp. 1246–1257.754
30. Du SS, Lee JD, Tian Y (2018) When is a convolutional filter easy to learn? in International755
Conference on Learning Representations.756
31. Tian Y (2017) An analytical formula of population gradient for two-layered relu network and its757
applications in convergence and critical point analysis in Proceedings of the 34th International758
Conference on Machine Learning - Volume 70, ICML’17. (JMLR.org), pp. 3404–3413.759
32. Soltanolkotabi M, Javanmard A, Lee JD (2019) Theoretical insights into the optimization land-760
scape of over-parameterized shallow neural networks. IEEE Transactions on Information761
Theory 65(2):742–769.762
33. Li Y, Yuan Y (2017) Convergence analysis of two-layer neural networks with relu activation in763
Proceedings of the 31st International Conference on Neural Information Processing Systems,764
NIPS’17. (Curran Associates Inc., USA), pp. 597–607.765
34. Brutzkus A, Globerson A (2017) Globally optimal gradient descent for a convnet with gaussian766
inputs in Proceedings of the 34th International Conference on Machine Learning, ICML 2017, 767
Sydney, NSW, Australia, 6-11 August 2017. pp. 605–614. 768
35. Du S, Lee J, Tian Y, Singh A, Poczos B (2018) Gradient descent learns one-hidden-layer CNN: 769
Don’t be afraid of spurious local minima in Proceedings of the 35th International Conference 770
on Machine Learning, Proceedings of Machine Learning Research, eds. Dy J, Krause A. 771
(PMLR, Stockholmsmässan, Stockholm Sweden), Vol. 80, pp. 1339–1348. 772
36. Du SS, Lee JD, Li H, Wang L, Zhai X (2018) Gradient descent finds global minima of deep 773
neural networks. CoRR abs/1811.03804. 774
37. Zhong K, Song Z, Jain P, Bartlett PL, Dhillon IS (2017) Recovery guarantees for one-hidden- 775
layer neural networks in Proceedings of the 34th International Conference on Machine Learn- 776
ing - Volume 70, ICML’17. (JMLR.org), pp. 4140–4149. 777
38. Zhong K, Song Z, Dhillon IS (2017) Learning non-overlapping convolutional neural networks 778
with multiple kernels. CoRR abs/1711.03440. 779
39. Zhang X, Yu Y, Wang L, Gu Q (2018) Learning One-hidden-layer ReLU Networks via Gradient 780
Descent. arXiv e-prints. 781
40. Li Y, Liang Y (2018) Learning overparameterized neural networks via stochastic gradient 782
descent on structured data in Advances in Neural Information Processing Systems 31, eds. 783
Bengio S, et al. (Curran Associates, Inc.), pp. 8157–8166. 784
41. Du SS, Zhai X, Poczos B, Singh A (2019) Gradient descent provably optimizes over- 785
parameterized neural networks in International Conference on Learning Representations. 786
42. Zou D, Cao Y, Zhou D, Gu Q (2018) Stochastic gradient descent optimizes over- 787
parameterized deep relu networks. CoRR abs/1811.08888. 788
43. Poggio T, Liao Q (2017) Theory II: Landscape of the empirical risk in deep learning. 789
arXiv:1703.09833, CBMM Memo No. 066. 790
44. Zhang C, et al. (2017) Theory of deep learning IIb: Optimization properties of SGD. CBMM 791
Memo 072. 792
45. Raginsky M, Rakhlin A, Telgarsky M (2017) Non-convex learning via stochastic gradient 793
langevin dynamics: A nonasymptotic analysis. arXiv:180.3251 [cs, math]. 794
46. Daniely A (2017) Sgd learns the conjugate kernel class of the network in Advances in Neural 795
Information Processing Systems 30, eds. Guyon I, et al. (Curran Associates, Inc.), pp. 2422– 796
2430. 797
47. Allen-Zhu Z, Li Y, Liang Y (2018) Learning and generalization in overparameterized neural 798
networks, going beyond two layers. CoRR abs/1811.04918. 799
48. Arora S, Du SS, Hu W, yuan Li Z, Wang R (2019) Fine-grained analysis of optimization and 800
generalization for overparameterized two-layer neural networks. CoRR abs/1901.08584. 801
49. Wei C, Lee JD, Liu Q, Ma T (2018) On the margin theory of feedforward neural networks. 802
CoRR abs/1810.05369. 803
50. Liang T, Poggio T, Rakhlin A, Stokes J (2017) Fisher-rao metric, geometry, and complexity of 804
neural networks. CoRR abs/1711.01530. 805
51. Bousquet O, Boucheron S, Lugosi G (2003) Introduction to statistical learning theory. pp. 806
169–207. 807
52. Rosset S, Zhu J, Hastie T (2003) Margin maximizing loss functions in Advances in Neural 808
Information Processing Systems 16 [Neural Information Processing Systems, NIPS 2003, 809
December 8-13, 2003, Vancouver and Whistler, British Columbia, Canada]. pp. 1237–1244. 810
53. Douglas SC, Amari S, Kung SY (2000) On gradient adaptation with unit-norm constraints. 811
IEEE Transactions on Signal Processing 48(6):1843–1847. 812
54. Salimans T, Kingm DP (2016) Weight normalization: A simple reparameterization to acceler- 813
ate training of deep neural networks. Advances in Neural Information Processing Systems. 814
55. Liao Q, Miranda B, Banburski A, Hidary J, Poggio TA (2018) A surprising linear relationship 815
predicts test performance in deep networks. CoRR abs/1807.09659. 816
56. Ferreira PJSG (1996) The existence and uniqueness of the minimum norm solution to certain 817
linear and nonlinear problems. Signal Processing 55:137–139. 818
Poggio et al. PNAS | August 17, 2019 | vol. XXX | no. XX | 9