Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Post on 27-Sep-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

Is Depth Needed for Deep Learning?Circuit Complexity in Neural Networks

Ohad Shamir

and Microsoft Research

STOC Deep Learning WorkshopJune 2017

Ohad Shamir Is Depth Needed for Deep Learning? 1/34

Neural Networks (a.k.a. Deep Learning)

Ohad Shamir Is Depth Needed for Deep Learning? 2/34

Neural Networks (a.k.a. Deep Learning)

Ohad Shamir Is Depth Needed for Deep Learning? 2/34

Neural Networks (a.k.a. Deep Learning)

Ohad Shamir Is Depth Needed for Deep Learning? 2/34

Neural Networks (a.k.a. Deep Learning)

Ohad Shamir Is Depth Needed for Deep Learning? 2/34

Neural Networks (a.k.a. Deep Learning)

Ohad Shamir Is Depth Needed for Deep Learning? 2/34

Neural Networks (a.k.a. Deep Learning)

Ohad Shamir Is Depth Needed for Deep Learning? 2/34

Neural Networks (a.k.a. Deep Learning)

Ohad Shamir Is Depth Needed for Deep Learning? 2/34

Neural Networks (a.k.a. Deep Learning)

A single neuron

x 7→ σ(w>x + b)

Activation σ examples

ReLU: [z]+ := max0, z

Feedforward neural network

Deep Networks

x (∈ Rd) 7→Wk σk−1 (· · ·σ2 (W2 σ1 (W1x + b1) + b2) · · · ) + bk

Depth: kWidth: Maximal dimension of W1, . . . ,Wk

Ohad Shamir Is Depth Needed for Deep Learning? 3/34

Deep Learning

Winner of imagenet challenge 2012:Alexnet, 8 layers

Ohad Shamir Is Depth Needed for Deep Learning? 4/34

Deep Learning

Winner of imagenet challenge 2014:VGG, 19 layers

Ohad Shamir Is Depth Needed for Deep Learning? 4/34

Deep Learning

Winner of imagenet challenge 2015:Resnet, 152 layers

Ohad Shamir Is Depth Needed for Deep Learning? 4/34

Is Depth Needed for Deep Learning?

Overlwhelming empirical evidence

Intuitive

Many tasks are naturally modelled as a pipelineDeep networks allow end-to-end learning

Ohad Shamir Is Depth Needed for Deep Learning? 5/34

Is Depth Needed for Deep Learning?

Overlwhelming empirical evidenceIntuitive

Many tasks are naturally modelled as a pipelineDeep networks allow end-to-end learning

hand-crafted features predictor “dog”

明天打电话给我。 “call me tomorrow”

Ohad Shamir Is Depth Needed for Deep Learning? 5/34

Is Depth Needed for Deep Learning?

Overlwhelming empirical evidenceIntuitive

Many tasks are naturally modelled as a pipelineDeep networks allow end-to-end learning

hand-crafted features predictor “dog”

明天打电话给我。 “call me tomorrow”

Ohad Shamir Is Depth Needed for Deep Learning? 5/34

Is Depth Needed for Deep Learning?

No (in some sense):

Universal Approximation Theorems [Cybenko, 1989, Hornik 1991,Leshno et al. 1993...]

2-layer networks, with any non-polynomial activation σ, canapproximate any continuous f : [0, 1]d → R to arbitrary accuracy

Catch: Construction uses exp(d)-wide networksWhat about poly-sized networks??

Ohad Shamir Is Depth Needed for Deep Learning? 6/34

Is Depth Needed for Deep Learning?

No (in some sense):

Universal Approximation Theorems [Cybenko, 1989, Hornik 1991,Leshno et al. 1993...]

2-layer networks, with any non-polynomial activation σ, canapproximate any continuous f : [0, 1]d → R to arbitrary accuracy

Catch: Construction uses exp(d)-wide networksWhat about poly-sized networks??

Ohad Shamir Is Depth Needed for Deep Learning? 6/34

Is Depth Needed for Deep Learning?

No (in some sense):

Universal Approximation Theorems [Cybenko, 1989, Hornik 1991,Leshno et al. 1993...]

2-layer networks, with any non-polynomial activation σ, canapproximate any continuous f : [0, 1]d → R to arbitrary accuracy

Catch: Construction uses exp(d)-wide networksWhat about poly-sized networks??

Ohad Shamir Is Depth Needed for Deep Learning? 6/34

Is Depth Needed for Deep Learning?

Main Question

Are there real-valued functions which are

Expressible by a depth-h, width-w neural network

Can’t be even approximated by any depth < h network, unlesswidth is w

Approximation metric: Expected loss w.r.t. some data distribution:

d(n, f ) = Ex∼D `(n(x), f (x))

In this talk: `(y , y ′) = (y − y ′)2

Ohad Shamir Is Depth Needed for Deep Learning? 7/34

Should Sound Familiar...

Same question asked in circuit complexity! (just differentmotivation)

Boolean circuits (e.g. AC 0)

Separation between any two constant depths (Hastad 1986,Rossman et al. 2015)

Threshold circuits (e.g. TC 0)

Neural networks with σ(z) = 1z ≥ 0 activations; Booleaninput/outputSeparation between depth 2 and 3, if weights are bounded[Hajnal et al. 1987]Sufficiently larger depths known to hit natural proofs barrier

Arithmetic circuits

Neural networks computing polynomialsEach neuron computes sum or product

Ohad Shamir Is Depth Needed for Deep Learning? 8/34

Should Sound Familiar...

Same question asked in circuit complexity! (just differentmotivation)

Boolean circuits (e.g. AC 0)

Separation between any two constant depths (Hastad 1986,Rossman et al. 2015)

Threshold circuits (e.g. TC 0)

Neural networks with σ(z) = 1z ≥ 0 activations; Booleaninput/outputSeparation between depth 2 and 3, if weights are bounded[Hajnal et al. 1987]Sufficiently larger depths known to hit natural proofs barrier

Arithmetic circuits

Neural networks computing polynomialsEach neuron computes sum or product

Ohad Shamir Is Depth Needed for Deep Learning? 8/34

Should Sound Familiar...

Same question asked in circuit complexity! (just differentmotivation)

Boolean circuits (e.g. AC 0)

Separation between any two constant depths (Hastad 1986,Rossman et al. 2015)

Threshold circuits (e.g. TC 0)

Neural networks with σ(z) = 1z ≥ 0 activations; Booleaninput/outputSeparation between depth 2 and 3, if weights are bounded[Hajnal et al. 1987]Sufficiently larger depths known to hit natural proofs barrier

Arithmetic circuits

Neural networks computing polynomialsEach neuron computes sum or product

Ohad Shamir Is Depth Needed for Deep Learning? 8/34

Should Sound Familiar...

Same question asked in circuit complexity! (just differentmotivation)

Boolean circuits (e.g. AC 0)

Separation between any two constant depths (Hastad 1986,Rossman et al. 2015)

Threshold circuits (e.g. TC 0)

Neural networks with σ(z) = 1z ≥ 0 activations; Booleaninput/outputSeparation between depth 2 and 3, if weights are bounded[Hajnal et al. 1987]Sufficiently larger depths known to hit natural proofs barrier

Arithmetic circuits

Neural networks computing polynomialsEach neuron computes sum or product

Ohad Shamir Is Depth Needed for Deep Learning? 8/34

Should Sound Familiar...

But: Modern neural networks have non-Boolean inputs/outputs,and non-polynomial activations

Not Boolean circuits

Not threshold circuits

Not arithmetic circuits

Unlike work from the 80’s/90’s (e.g. Parberry [1994]), interestedin real-valued inputs/outputs, not just Boolean functions

Ohad Shamir Is Depth Needed for Deep Learning? 9/34

This Talk

Depth separations for modern neural networks

Nascent field in machine learning community; Some examplesof results and techniques

Focus on clean lower bounds and standard activations

Many open questions..

Comments/feedback welcome!

Ohad Shamir Is Depth Needed for Deep Learning? 10/34

This Talk

Depth separations for modern neural networks

Nascent field in machine learning community; Some examplesof results and techniques

Focus on clean lower bounds and standard activations

Many open questions..

Comments/feedback welcome!

Ohad Shamir Is Depth Needed for Deep Learning? 10/34

This Talk

Depth separations for modern neural networks

Nascent field in machine learning community; Some examplesof results and techniques

Focus on clean lower bounds and standard activations

Many open questions..

Comments/feedback welcome!

Ohad Shamir Is Depth Needed for Deep Learning? 10/34

Separating Depth 2 and 3 via Correlations

Depth-2 networks: x 7→ w>2 σ(W1x + b1) + b2

Linear combination of neurons σ(w>x + b)

Depth-3 networks: x 7→ w>3 σ(W2σ(W1x + b1) + b2) + b3

Theorem (Daniely 2017)

Let (x, y) 7→ f (x>y), where x, y uniform on Sd−1,f (z) = sin(πd3z):

ε-approximable by depth-3 ReLU network of poly(d , 1/ε)width and weight sizes

Not Ω(1)-approximable by any depth-2 ReLU network ofexp(o(d log d)) width and O(exp(d))-sized weights.

More generally: Other activations; Any f which is inapproximablewith O(d1+ε)-degree polynomial

Ohad Shamir Is Depth Needed for Deep Learning? 11/34

Separating Depth 2 and 3 via Correlations

Depth-2 networks: x 7→ w>2 σ(W1x + b1) + b2

Linear combination of neurons σ(w>x + b)

Depth-3 networks: x 7→ w>3 σ(W2σ(W1x + b1) + b2) + b3

Theorem (Daniely 2017)

Let (x, y) 7→ f (x>y), where x, y uniform on Sd−1,f (z) = sin(πd3z):

ε-approximable by depth-3 ReLU network of poly(d , 1/ε)width and weight sizes

Not Ω(1)-approximable by any depth-2 ReLU network ofexp(o(d log d)) width and O(exp(d))-sized weights.

More generally: Other activations; Any f which is inapproximablewith O(d1+ε)-degree polynomial

Ohad Shamir Is Depth Needed for Deep Learning? 11/34

Separating Depth 2 and 3 via Correlations

Depth-2 networks: x 7→ w>2 σ(W1x + b1) + b2

Linear combination of neurons σ(w>x + b)

Depth-3 networks: x 7→ w>3 σ(W2σ(W1x + b1) + b2) + b3

Theorem (Daniely 2017)

Let (x, y) 7→ f (x>y), where x, y uniform on Sd−1,f (z) = sin(πd3z):

ε-approximable by depth-3 ReLU network of poly(d , 1/ε)width and weight sizes

Not Ω(1)-approximable by any depth-2 ReLU network ofexp(o(d log d)) width and O(exp(d))-sized weights.

More generally: Other activations; Any f which is inapproximablewith O(d1+ε)-degree polynomial

Ohad Shamir Is Depth Needed for Deep Learning? 11/34

Lower Bound Proof Idea

Based on Harmonic analysis over Sd−1

(x, y) 7→ f (x>y) almost orthogonal to any(x, y) 7→ ψ(x>w, v>y) (e.g. one neuron)

Need many neurons (or huge weight) to correlate with f (x>y)

Comparison to Threshold Circuit Results

Correlation bounds also used for depth-2/3 separation ofthreshold circuits [Hajnal et al. 1987]

But, stronger separation: exp(Ω(d log d)) vs. exp(Ω(d))width

Boolean functions never require more than O(2d) width...

Ohad Shamir Is Depth Needed for Deep Learning? 12/34

Lower Bound Proof Idea

Based on Harmonic analysis over Sd−1

(x, y) 7→ f (x>y) almost orthogonal to any(x, y) 7→ ψ(x>w, v>y) (e.g. one neuron)

Need many neurons (or huge weight) to correlate with f (x>y)

Comparison to Threshold Circuit Results

Correlation bounds also used for depth-2/3 separation ofthreshold circuits [Hajnal et al. 1987]

But, stronger separation: exp(Ω(d log d)) vs. exp(Ω(d))width

Boolean functions never require more than O(2d) width...

Ohad Shamir Is Depth Needed for Deep Learning? 12/34

Weight Restrictions

Result assume that weights are not too large. Reallynecessary?

In threshold circuits: 30-year-old open question

Next: Separating depth-2/3 neural networks, without anyweight restrictions, using a different technique

Theorem (Eldan and S., 2016)

∃ function f & distribution on Rd which is

ε-approximable by 3-layer, poly(d , 1/ε)-wide network

Not Ω(1)-approximable by any 2-layer, exp(o(d))-widenetwork

Applies to virtually any measurable σ(·) s.t. |σ(x)| ≤ poly(x)

Ohad Shamir Is Depth Needed for Deep Learning? 13/34

Weight Restrictions

Result assume that weights are not too large. Reallynecessary?

In threshold circuits: 30-year-old open question

Next: Separating depth-2/3 neural networks, without anyweight restrictions, using a different technique

Theorem (Eldan and S., 2016)

∃ function f & distribution on Rd which is

ε-approximable by 3-layer, poly(d , 1/ε)-wide network

Not Ω(1)-approximable by any 2-layer, exp(o(d))-widenetwork

Applies to virtually any measurable σ(·) s.t. |σ(x)| ≤ poly(x)

Ohad Shamir Is Depth Needed for Deep Learning? 13/34

Proof Idea

Use radial functions:

f (x) = g(‖x‖2) where x ∈ Rd , f : R→ R

If f is Lipschitz, easy to approximate with depth 3

One layer: Approximate x 7→ x2;hence also x 7→ ‖x‖2 =

∑i x

2i

Second layer+output neuron: Approximate univariate functionof ‖x‖2

With two layers, difficult to do

Ohad Shamir Is Depth Needed for Deep Learning? 14/34

Proof Idea

Use radial functions:

f (x) = g(‖x‖2) where x ∈ Rd , f : R→ R

If f is Lipschitz, easy to approximate with depth 3

One layer: Approximate x 7→ x2;hence also x 7→ ‖x‖2 =

∑i x

2i

Second layer+output neuron: Approximate univariate functionof ‖x‖2

With two layers, difficult to do

Ohad Shamir Is Depth Needed for Deep Learning? 14/34

Proof Idea

Use radial functions:

f (x) = g(‖x‖2) where x ∈ Rd , f : R→ R

If f is Lipschitz, easy to approximate with depth 3

One layer: Approximate x 7→ x2;hence also x 7→ ‖x‖2 =

∑i x

2i

Second layer+output neuron: Approximate univariate functionof ‖x‖2

With two layers, difficult to do

Ohad Shamir Is Depth Needed for Deep Learning? 14/34

Proof Idea

Fourier Transform on Rd

Given function f , f (ξ) =∫f (x) exp(−2πiξ>x)dx

If x sampled from a distribution with density ϕ2,

Ex∼ϕ2

[(n(x)− f (x))2

]=

∫(n(x)− f (x))2ϕ2(x)dx

=

∫(n(x)ϕ(x)− f (x)ϕ(x))2dx

=

∫(n(ξ) ∗ ϕ(ξ)− f (ξ) ∗ ϕ(ξ))2dξ

For a two-layer network, n(x) =∑

i ni ,wi(x) :=

∑i ni (w

>i x), so

equals ∫ (∑i

ni ,wi(ξ) ∗ ϕ(ξ)− f (ξ) ∗ ϕ(ξ)

)2

Ohad Shamir Is Depth Needed for Deep Learning? 15/34

Proof Idea

Fourier Transform on Rd

Given function f , f (ξ) =∫f (x) exp(−2πiξ>x)dx

If x sampled from a distribution with density ϕ2,

Ex∼ϕ2

[(n(x)− f (x))2

]=

∫(n(x)− f (x))2ϕ2(x)dx

=

∫(n(x)ϕ(x)− f (x)ϕ(x))2dx

=

∫(n(ξ) ∗ ϕ(ξ)− f (ξ) ∗ ϕ(ξ))2dξ

For a two-layer network, n(x) =∑

i ni ,wi(x) :=

∑i ni (w

>i x), so

equals ∫ (∑i

ni ,wi(ξ) ∗ ϕ(ξ)− f (ξ) ∗ ϕ(ξ)

)2

Ohad Shamir Is Depth Needed for Deep Learning? 15/34

Proof Idea

Fourier Transform on Rd

Given function f , f (ξ) =∫f (x) exp(−2πiξ>x)dx

If x sampled from a distribution with density ϕ2,

Ex∼ϕ2

[(n(x)− f (x))2

]=

∫(n(x)− f (x))2ϕ2(x)dx

=

∫(n(x)ϕ(x)− f (x)ϕ(x))2dx

=

∫(n(ξ) ∗ ϕ(ξ)− f (ξ) ∗ ϕ(ξ))2dξ

For a two-layer network, n(x) =∑

i ni ,wi(x) :=

∑i ni (w

>i x), so

equals ∫ (∑i

ni ,wi(ξ) ∗ ϕ(ξ)− f (ξ) ∗ ϕ(ξ)

)2

Ohad Shamir Is Depth Needed for Deep Learning? 15/34

Proof Idea

∫ (∑i

ni ,wi(ξ) ∗ ϕ(ξ)− f (ξ) ∗ ϕ(ξ)

)2

For (say) Gaussian ϕ2, ϕ is Gaussian ⇒

ni ,wi(ξ) ∗ ϕ(ξ) f (ξ) ∗ ϕ(ξ)

Intuition: Can’t approximate “fat” function with few “thin”functions in high dimension

Ohad Shamir Is Depth Needed for Deep Learning? 16/34

Proof Idea

∫ (∑i

ni ,wi(ξ) ∗ ϕ(ξ)− f (ξ) ∗ ϕ(ξ)

)2

For (say) Gaussian ϕ2, ϕ is Gaussian ⇒

ni ,wi(ξ) ∗ ϕ(ξ) f (ξ) ∗ ϕ(ξ)

Intuition: Can’t approximate “fat” function with few “thin”functions in high dimension

Ohad Shamir Is Depth Needed for Deep Learning? 16/34

Proof Idea

But: Hard to handle Gaussian tailIdea: Use density ϕ2 s.t.

f ∗ ϕ is “sufficiently fat”ϕ has bounded support∑

i ni ,wi(ξ) ∗ ϕ(ξ)

Ohad Shamir Is Depth Needed for Deep Learning? 17/34

Proof Idea

Explicit construction in Rd :

Density ϕ2(x) =

(Rd

‖x‖

)d

· J2d/2(2πRd ‖x‖)︸ ︷︷ ︸

Bessel function of first kind

Function f (x) =

poly(d)∑i=1

εi1 ‖x‖2 ∈ ∆i

where εi ∈ −1,+1, and ∆i are disjoint intervals

Ohad Shamir Is Depth Needed for Deep Learning? 18/34

Higher Depth

So far: Separations between depths 2 and 3, in terms of dimension

Open Question: Can we show separations for higher depths?

In threshold circuits: Longstanding open problem. Probablyvery difficult from some constant depth (natural proofsbarrier)

But: Not threshold circuits... Perhaps “hard” functions inEuclidean space?

Next: Higher-depth separations, in terms of quantities other thandimension

Ohad Shamir Is Depth Needed for Deep Learning? 19/34

Higher Depth

So far: Separations between depths 2 and 3, in terms of dimension

Open Question: Can we show separations for higher depths?

In threshold circuits: Longstanding open problem. Probablyvery difficult from some constant depth (natural proofsbarrier)

But: Not threshold circuits... Perhaps “hard” functions inEuclidean space?

Next: Higher-depth separations, in terms of quantities other thandimension

Ohad Shamir Is Depth Needed for Deep Learning? 19/34

Higher Depth

So far: Separations between depths 2 and 3, in terms of dimension

Open Question: Can we show separations for higher depths?

In threshold circuits: Longstanding open problem. Probablyvery difficult from some constant depth (natural proofsbarrier)

But: Not threshold circuits... Perhaps “hard” functions inEuclidean space?

Next: Higher-depth separations, in terms of quantities other thandimension

Ohad Shamir Is Depth Needed for Deep Learning? 19/34

Highly Oscillatory Functions

Theorem (Telgarsky, 2016)

There exists a family of functions ϕk∞k=1 on [0, 1],s.t. for any k,

ϕk expressible by k-depth, O(1)-width ReLU network

ϕk not approximable by any o(k/ log(k))-depth, poly(k)width ReLU network

* Approximation w.r.t. uniform distribution on [0, 1]

Again, can be generalized to other activations

Ohad Shamir Is Depth Needed for Deep Learning? 20/34

Highly Oscillatory Functions

Theorem (Telgarsky, 2016)

There exists a family of functions ϕk∞k=1 on [0, 1],s.t. for any k,

ϕk expressible by k-depth, O(1)-width ReLU network

ϕk not approximable by any o(k/ log(k))-depth, poly(k)width ReLU network

* Approximation w.r.t. uniform distribution on [0, 1]

Again, can be generalized to other activations

Ohad Shamir Is Depth Needed for Deep Learning? 20/34

Construction

ϕ1(x) = [2x ]+ − [4x − 2]+

Ohad Shamir Is Depth Needed for Deep Learning? 21/34

Construction

ϕ2 = ϕ1(ϕ1(x))

Ohad Shamir Is Depth Needed for Deep Learning? 21/34

Construction

ϕk(x) = ϕk1(x)

Ohad Shamir Is Depth Needed for Deep Learning? 21/34

Construction

ϕk expressible by O(k)-depth,O(1)-width ReLU network

ϕk composed of 2k+1 linear segments;can’t be approximated bypiecewise-linear function with o(2k)segments

A depth h, width w network expressesat most (2w)h linear segments

⇒ If h = o(k/ log(k)), can’tapproximate with w = poly(k) width

Ohad Shamir Is Depth Needed for Deep Learning? 22/34

Separations in Accuracy

Theorem (Safran and S., 2016)

There exists a large family F of C 2 functions on [0, 1]d (includingx 7→ x2), s.t. for any f ∈ F :

Can be ε-approximated with polylog(1/ε) depth and widthReLU network

Cannot be ε-approximated with O(1) depth ReLU network,unless width is poly(1/ε)

* Approximation w.r.t. uniform distribution

F ≈ non-linear functions expressible by a fixed number ofadditions and multiplications.

Note: Broadly similar and independent results in [Yarotski 2016],[Liang and Srikant 2016]

Ohad Shamir Is Depth Needed for Deep Learning? 23/34

Separations in Accuracy

Theorem (Safran and S., 2016)

There exists a large family F of C 2 functions on [0, 1]d (includingx 7→ x2), s.t. for any f ∈ F :

Can be ε-approximated with polylog(1/ε) depth and widthReLU network

Cannot be ε-approximated with O(1) depth ReLU network,unless width is poly(1/ε)

* Approximation w.r.t. uniform distribution

F ≈ non-linear functions expressible by a fixed number ofadditions and multiplications.

Note: Broadly similar and independent results in [Yarotski 2016],[Liang and Srikant 2016]

Ohad Shamir Is Depth Needed for Deep Learning? 23/34

Proof Idea for x 7→ x2

Upper bound:

Use ϕ1, ϕ2, . . . , ϕO(log(1/ε)) variants toextract first O(log(1/ε)) bits of x

Given bit vector, do long multiplicationto get first O(log(1/ε)) bits of x2

Convert back to R

Representable via O(log(1/ε)) depth/widthnetwork

Ohad Shamir Is Depth Needed for Deep Learning? 24/34

Proof Idea for x 7→ x2

Upper bound:

Use ϕ1, ϕ2, . . . , ϕO(log(1/ε)) variants toextract first O(log(1/ε)) bits of x

Given bit vector, do long multiplicationto get first O(log(1/ε)) bits of x2

Convert back to R

Representable via O(log(1/ε)) depth/widthnetwork

Ohad Shamir Is Depth Needed for Deep Learning? 24/34

Proof Idea for x 7→ x2

Lower Bound:

If h linear,∫ a+∆a

(x2 − h(x)

)2= Ω

(∆5)

⇒ If h piecewise-linear with O(n)

segments,∫ 1

0

(x2 − h(x)

)2= Ω

(n−4)

But: Any O(1)-depth, w -width networkcan express only poly(w) segments

⇒ For ε approximation, need poly(1/ε)width

Similar ideas also in higher dimensions

Ohad Shamir Is Depth Needed for Deep Learning? 25/34

Natural Depth Separations

So far: Depth separations for some functions

But for machine learning, this is only 1/3 of the picture!

Expressiveness Statistical Error Optimization Error

Depth separations for functions that we can hope to learn withstandard optimization methods?

(x, y) 7→ f (x>y), f highly-oscillating

x 7→ f (‖x‖), f highly-oscillating

x 7→ f (x), f highly-oscillating

x 7→ x2, bit-extraction (highly-oscillating) + longmultiplication networks

Ohad Shamir Is Depth Needed for Deep Learning? 26/34

Natural Depth Separations

So far: Depth separations for some functions

But for machine learning, this is only 1/3 of the picture!

Expressiveness Statistical Error Optimization Error

Depth separations for functions that we can hope to learn withstandard optimization methods?

(x, y) 7→ f (x>y), f highly-oscillating

x 7→ f (‖x‖), f highly-oscillating

x 7→ f (x), f highly-oscillating

x 7→ x2, bit-extraction (highly-oscillating) + longmultiplication networks

Ohad Shamir Is Depth Needed for Deep Learning? 26/34

Natural Depth Separations

So far: Depth separations for some functions

But for machine learning, this is only 1/3 of the picture!

Expressiveness Statistical Error Optimization Error

Depth separations for functions that we can hope to learn withstandard optimization methods?

(x, y) 7→ f (x>y), f highly-oscillating

x 7→ f (‖x‖), f highly-oscillating

x 7→ f (x), f highly-oscillating

x 7→ x2, bit-extraction (highly-oscillating) + longmultiplication networks

Ohad Shamir Is Depth Needed for Deep Learning? 26/34

Natural Depth Separations

So far: Depth separations for some functions

But for machine learning, this is only 1/3 of the picture!

Expressiveness Statistical Error Optimization Error

Depth separations for functions that we can hope to learn withstandard optimization methods?

(x, y) 7→ f (x>y), f highly-oscillating

x 7→ f (‖x‖), f highly-oscillating

x 7→ f (x), f highly-oscillating

x 7→ x2, bit-extraction (highly-oscillating) + longmultiplication networks

Ohad Shamir Is Depth Needed for Deep Learning? 26/34

First Example: Indicator of L2 Ball

Theorem (Safran & S., 2016)

Let f (x) = 1 ‖x‖ ≤ 1 on Rd , exists distribution s.t.

ε-approximable with depth-3, poly(d , 1/ε)-wide ReLU network

Not Ω(d−4)-approximable by any depth-2, exp(o(d))-wideReLU network

Can be generalized to indicators of any ellipsoids

Proof idea: Reduction from construction of Eldan and S.

Ohad Shamir Is Depth Needed for Deep Learning? 27/34

Experiment: Unit L2 ball

d = 100

Batch number (x1000)0 20 40 60 80 100 120 140 160 180 200

RM

SE

(va

lidat

ion

set)

0.15

0.2

0.25

0.3

3-layer, width 1002-layer, width 1002-layer, width 2002-layer, width 4002-layer, width 800

Ohad Shamir Is Depth Needed for Deep Learning? 28/34

Second Example: L1 Ball

Theorem (Safran & S., 2016)

Let f (x) = [‖x‖1 − 1]+ on Rd , exists distribution s.t.

Expressible with with depth-3, width 2d ReLU network

Not ε-approximable by any depth-2, width min1/ε, exp(d)ReLU network

Ohad Shamir Is Depth Needed for Deep Learning? 29/34

Proof Idea

Upper Bound

[‖x‖1 − 1]+ =

[d∑

i=1

([xi ]+ + [−xi ]+

)− 1

]+

Lower Bound

Function “breaks” along 2d facets ofthe L1 ball

For a good approximation, most facetsmust have a ReLU neuron breakingclose to it

Bound can probably be improved...

Ohad Shamir Is Depth Needed for Deep Learning? 30/34

Experiment: L1 Ball

Ohad Shamir Is Depth Needed for Deep Learning? 31/34

Other Directions

Many other works and directions!

Study architectural properties of neural networks (depth andbeyond) using arithmetic-style circuits

Depth separations using metrics other than approximationerror

Study realistic architectures via upper bounds

...[Delalleau and Bengio 2011],[Pascanu et al. 2013],[Martens et al. 2013],[Montufar etal. 2014],[Cohen et al. 2015],[Cohen et al. 2016],[Raghu et al. 2016],[Poole et al.2016],[Arora et al. 2016],[Mhaskar and Poggio 2016],[Shaham et al. 2016],[Mossel2016],[McCane and Szymanskic 2016],[Poggio et al. 2017],[Sharir and Shashua2017],[Rolnick and Tegmark 2017],[Nguyen and Hein 2017],[Petersen and Voigtlaender2017], [Lu et al. 2017],[Montanelli and Du 2017],[Telgarsky 2017],[Lee et al.2017],[Khrulkov et al. 2017],[Serra et al. 2017],[Guss and Salakhutdinov2017],[Mukherjee and Basu 2017]...

Ohad Shamir Is Depth Needed for Deep Learning? 32/34

Summary and Discussion

Depth separations for modern neural networks

Take-Home Message

Similar questions as in circuit complexity, but not standard circuitsand a different playing field

Euclidean (not Boolean) input/output

Continuity; large Lipschitz constants; Fourier in Rd ...

No clear algebraic structure (as in arithmetic circuits). Usegeometric properties instead

Curvature; piecewise linearity; sparsity in Fourier domain...

AFAIK, little study of connections between fields

Ohad Shamir Is Depth Needed for Deep Learning? 33/34

Open Questions

Separations w.r.t. dimension for depths > 3?

Alternatively, a “natural proof” barrier? How to even define?

Strong separations w.r.t. dimension for O(1)-Lipschitzfunctions?

Circuit complexity techniques to analyze neural networks?And vice versa?

Is there any function which is both (1) provably deep and (2)easily learned with neural networks?

Architecture and expressiveness of modern neural networksbeyond depth: Convolutions, pooling, recurrences, skipconnections...

Ohad Shamir Is Depth Needed for Deep Learning? 34/34

Open Questions

Separations w.r.t. dimension for depths > 3?

Alternatively, a “natural proof” barrier? How to even define?

Strong separations w.r.t. dimension for O(1)-Lipschitzfunctions?

Circuit complexity techniques to analyze neural networks?And vice versa?

Is there any function which is both (1) provably deep and (2)easily learned with neural networks?

Architecture and expressiveness of modern neural networksbeyond depth: Convolutions, pooling, recurrences, skipconnections...

Ohad Shamir Is Depth Needed for Deep Learning? 34/34

Open Questions

Separations w.r.t. dimension for depths > 3?

Alternatively, a “natural proof” barrier? How to even define?

Strong separations w.r.t. dimension for O(1)-Lipschitzfunctions?

Circuit complexity techniques to analyze neural networks?And vice versa?

Is there any function which is both (1) provably deep and (2)easily learned with neural networks?

Architecture and expressiveness of modern neural networksbeyond depth: Convolutions, pooling, recurrences, skipconnections...

Ohad Shamir Is Depth Needed for Deep Learning? 34/34

Open Questions

Separations w.r.t. dimension for depths > 3?

Alternatively, a “natural proof” barrier? How to even define?

Strong separations w.r.t. dimension for O(1)-Lipschitzfunctions?

Circuit complexity techniques to analyze neural networks?And vice versa?

Is there any function which is both (1) provably deep and (2)easily learned with neural networks?

Architecture and expressiveness of modern neural networksbeyond depth: Convolutions, pooling, recurrences, skipconnections...

Ohad Shamir Is Depth Needed for Deep Learning? 34/34

top related