Top Banner
Fast Learning Requires Good Memory: A Time-Space Lower Bound for Parity Learning Ran Raz * Abstract We prove that any algorithm for learning parities requires either a memory of quadratic size or an exponential number of samples. This proves a recent conjecture of Steinhardt, Valiant and Wager [SVW15] and shows that for some learning problems a large storage space is crucial. More formally, in the problem of parity learning, an unknown string x ∈{0, 1} n was chosen uniformly at random. A learner tries to learn x from a stream of samples (a 1 ,b 1 ), (a 2 ,b 2 ) ..., where each a t is uniformly distributed over {0, 1} n and b t is the inner product of a t and x, modulo 2. We show that any algorithm for parity learning, that uses less than n 2 25 bits of memory, requires an exponential number of samples. Previously, there was no non-trivial lower bound on the number of samples needed, for any learning problem, even if the allowed memory size is O(n) (where n is the space needed to store one sample). We also give an application of our result in the field of bounded-storage cryptography. We show an encryption scheme that requires a private key of length n, as well as time complexity of n per encryption/decryption of each bit, and is provenly and unconditionally secure as long as the attacker uses less than n 2 25 memory bits and the scheme is used at most an exponential number of times. Previous works on bounded- storage cryptography assumed that the memory size used by the attacker is at most linear in the time needed for encryption/decryption. * Weizmann Institute of Science, Israel, and the Institute for Advanced Study, Princeton, NJ. Research supported by the Israel Science Foundation grant No. 1402/14, by the I-CORE Program of the Planning and Budgeting Committee and the Israel Science Foundation, by the Simons Collaboration on Algorithms and Geometry, by the Fund for Math at IAS, and by the National Science Foundation grant No. CCF- 1412958. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation. Email: [email protected] 1
21

Fast Learning Requires Good Memory: A Time-Space Lower Bound … ·  · 2016-02-19Fast Learning Requires Good Memory: A Time-Space Lower Bound for Parity Learning ... there was no

May 09, 2018

Download

Documents

buianh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Fast Learning Requires Good Memory: A Time-Space Lower Bound … ·  · 2016-02-19Fast Learning Requires Good Memory: A Time-Space Lower Bound for Parity Learning ... there was no

Fast Learning Requires Good Memory:A Time-Space Lower Bound for Parity Learning

Ran Raz∗

Abstract

We prove that any algorithm for learning parities requires either a memory ofquadratic size or an exponential number of samples. This proves a recent conjectureof Steinhardt, Valiant and Wager [SVW15] and shows that for some learning problemsa large storage space is crucial.

More formally, in the problem of parity learning, an unknown string x ∈ 0, 1nwas chosen uniformly at random. A learner tries to learn x from a stream of samples(a1, b1), (a2, b2) . . ., where each at is uniformly distributed over 0, 1n and bt is theinner product of at and x, modulo 2. We show that any algorithm for parity learning,that uses less than n2

25 bits of memory, requires an exponential number of samples.Previously, there was no non-trivial lower bound on the number of samples needed,

for any learning problem, even if the allowed memory size is O(n) (where n is the spaceneeded to store one sample).

We also give an application of our result in the field of bounded-storagecryptography. We show an encryption scheme that requires a private key of length n, aswell as time complexity of n per encryption/decryption of each bit, and is provenly and

unconditionally secure as long as the attacker uses less than n2

25 memory bits and thescheme is used at most an exponential number of times. Previous works on bounded-storage cryptography assumed that the memory size used by the attacker is at mostlinear in the time needed for encryption/decryption.

∗Weizmann Institute of Science, Israel, and the Institute for Advanced Study, Princeton, NJ. Researchsupported by the Israel Science Foundation grant No. 1402/14, by the I-CORE Program of the Planningand Budgeting Committee and the Israel Science Foundation, by the Simons Collaboration on Algorithmsand Geometry, by the Fund for Math at IAS, and by the National Science Foundation grant No. CCF-1412958. Any opinions, findings and conclusions or recommendations expressed in this material arethose of the author and do not necessarily reflect the views of the National Science Foundation. Email:[email protected]

1

Page 2: Fast Learning Requires Good Memory: A Time-Space Lower Bound … ·  · 2016-02-19Fast Learning Requires Good Memory: A Time-Space Lower Bound for Parity Learning ... there was no

1 Introduction

Parity learning can be solved in polynomial time, by Gaussian elimination, using O(n)samples and O(n2) memory bits. On the other hand, parity learning can be solved by tryingall the possibilities, using n+ o(n) memory bits and an exponential number of samples.

We prove that any algorithm for parity learning requires either n2

25memory bits, or an

exponential number of samples. Our result may be of interest from the points of view oflearning theory, computational complexity and cryptography.

1.1 Learning Theory

The main message of this paper from the point of view of learning theory is that for somelearning problems, access to a relatively large memory is crucial. In other words, in somecases, learning is infeasible, due to memory constraints. We show that there exist conceptclasses that can be efficiently learnt from a polynomial number of samples, if the learnerhas access to a quadratic-size memory, but require an exponential number of samples if thememory used by the learner is of less than quadratic size. This gives a formally stated andmathematically proved example for the intuitive feeling that a ”good” memory may be veryhelpful in learning processes.

Many works studied the resources needed for learning, under certain information,communication or memory constraints (see in particular [S14, SVW15] and the manyreferences given there). However, there was no previous non-trivial lower bound on thenumber of samples needed, for any learning problem, even when the allowed memory size isbounded by the length of one sample (where we don’t count the space taken by the currentsample that is being read).

The starting point of our work is the intriguing recent work of Steinhardt, Valiant andWager [SVW15]. Steinhardt, Valiant and Wager asked whether there exist concept classesthat can be efficiently learnt from a polynomial number of samples, but cannot be learntfrom a polynomial number of samples if the allowed memory size is linear in the length of onesample. They conjectured that the problem of parity learning provides such a separation.Our main result proves that conjecture.

Remark 1.1. Conjecture 1.1 of [SVW15] conjectures that any algorithm for parity learningrequires either at least n2

4bits of memory, or at least 2n/4 samples. Our main result

qualitatively proves this conjecture, but with different constants. The conjecture, as stated,(that is, with the ambitious constants 1

4, 1

4) is too strong.1

1Roughly speaking, this is the case since an algorithm similar to Gaussian elimination can solve parity

learning, using n2

4 +O(n) memory bits and a polynomial number of samples (by keeping in step k, a matrix

with k rows and n columns, where the first k columns form the identity matrix). If 2n/4 samples areavailable, one can essentially solve a parity learning problem of size 3

4n + o(n), by considering only samples

with coefficients 0 on the last 14n−o(n) variables. Hence, if 2n/4 samples are available, 9

64n2 +o(n2) memory

bits are sufficient.

2

Page 3: Fast Learning Requires Good Memory: A Time-Space Lower Bound … ·  · 2016-02-19Fast Learning Requires Good Memory: A Time-Space Lower Bound for Parity Learning ... there was no

1.2 Computational Complexity

Time-space tradeoffs have been extensively studied in the field of computational complexity,in many works and various settings. Two brilliant lines of research were particularlysuccessful in establishing time-space lower bounds for computation.

The first line of works [BJS98, A99a, A99b, BSSV00] gives explicit examples forpolynomial-time computable Boolean functions f : 0, 1n → 0, 1, such that, anyalgorithm for computing f requires either at least n1−ε memory bits, where ε > 0 is an

arbitrarily small constant, or time complexity of at least Ω(n√

log n/ log log n)

. These

bounds are proved for any branching program that computes f . Branching programs arethe standard and most general computational model for studying time-space tradeoffs inthe non-uniform setting (which is the more general setting), and is also the computationalmodel that we use in the current work.

The second line of works [F97, FLvMV05, W06, W07] (and other works) studies time-space tradeoffs for SAT (and other NP problems), in the uniform setting, and proves thatany algorithm for SAT requires either at least n1−ε memory bits, or time complexity of atleast n1+δ (where 0 < ε, δ < 1 are constants). For an excellent survey, see [vM07].

Both lines of works obtain less than quadratic lower bounds on the time needed forcomputation, under memory constraints. Quadratic lower bounds on the time needed forcomputation are not known, even if the allowed memory-size is logarithmic. Comparingthese results to our work, one may ask what makes it possible to prove exponential lowerbounds on the time needed for parity learning, under memory constraints, while the knowntime-space lower bounds for computations are significantly weaker? The main point to keepin mind is that when studying time-space tradeoffs for computing a function, one assumesthat the input for the function can always be accessed, and the space needed to store theinput doesn’t count as memory that is used by the algorithm. Thus, the input is stored forfree. In our learning problem, it is assumed that after the learner saw a sample, the learnercannot access that sample again, unless the sample was stored in the learner’s memory. Thelearner can always get a new sample that is ”as good as the old one”, but she cannot accessthe same sample that she saw before (without storing it in the memory).

Finally, let us note that by Barrington’s celebrated result, any function in NC can becomputed by a polynomial-length branching program of width 5 [B86]. Hence, provingsuper-polynomial lower bounds on the time needed for computing a function, by a branchingprogram of width 5, would imply super-polynomial lower bounds for formula size.

1.3 Cryptography

Assume that a group of (two or more) users share a (random) secret key x ∈ 0, 1n. Assumethat user Alice wants to send an encrypted bit M ∈ 0, 1 to user Bob. Let a be a string ofn bits, uniformly distributed over 0, 1n, and assume that both Alice and Bob know a (wecan think of a as taken from a shared random string and if a shared random string is notavailable Alice can just choose a randomly and send it to Bob). Let b be the inner productof a and x, modulo 2. Thus, b is known to both Alice and Bob and can be used as a one timepad to encrypt/decrypt M , that is, Alice encrypts by computing M ⊕ b and Bob decryptsby computing M = (M ⊕ b)⊕ b.

3

Page 4: Fast Learning Requires Good Memory: A Time-Space Lower Bound … ·  · 2016-02-19Fast Learning Requires Good Memory: A Time-Space Lower Bound for Parity Learning ... there was no

Assume that this protocol is used m+1 times, with the same secret key x, where m is lessthan exponential. Denote by at, bt the string a and bit b used at time t. Suppose that duringall that time, an attacker could see (a1, b1), . . . , (am, bm), but the attacker has less than n2

25

bits of memory. Our main result shows that the attacker cannot guess the secret key x, withbetter than exponentially small probability. Therefore, using the fact that inner product is astrong extractor (with exponentially small error), even if the attacker sees am+1, the attackercannot predict bm+1, with better than exponentially small advantage over a random guess.

Thus, if the attacker has less than n2

25bits of memory, the encryption remains secure as

long as it is used less than an exponential number of times.Bounded-storage cryptography, first introduced by Maurer [M92] and extensively studied

in many works, studies cryptographical protocols that are secure under the assumption thatthe memory used by the attacker is limited (see for example [CM97, AR99, ADR02, V03,DM04], and many other works). Previous works on bounded-storage cryptography assumedthe existence of a high-rate source of randomness that streams random bits to all parties.The main idea is that the attacker doesn’t have sufficiently large memory to store all randombits, and hence a shared secret key can be used to randomly select (or extract) bits from therandom source that the attacker has very little information about.

In previous works, the number of random bits transmitted during the encryption wasassumed to be larger than the memory-size of the attacker. Thus, the time needed forencryption/decryption was at least linear in the memory-size of the attacker. In contrast,the time needed for encryption/decryption in our protocol is n, while the encryption is secureagainst attackers with memory of size n2

25.

Remark 1.2. If Alice and Bob want to transmit encrypted messages of length m, wherem ≥ n (and the attacker has O(n2) bits of memory), our protocol has no advantage overprevious ones, as the time needed for encryption/decription in our protocol is mn. Theadvantage of our protocol is in situations where the users want to securely transmit manyshorter messages.

1.4 Our Result

Parity Learning

In the problem of parity learning, there is an unknown string x ∈ 0, 1n that was chosenuniformly at random. A learner tries to learn x from samples (a, b), where a ∈R 0, 1n andb = a · x (where a · x denotes inner product modulo 2). That is, the learning algorithm isgiven a stream of samples, (a1, b1), (a2, b2) . . ., where each at is uniformly distributed over0, 1n and for every t, bt = at · x.

Main Result

Theorem 1. For any c < 120

, there exists α > 0, such that the following holds: Let x beuniformly distributed over 0, 1n. Let m ≤ 2αn. Let A be an algorithm that is given asinput a stream of samples, (a1, b1), . . . , (am, bm), where each at is uniformly distributed over0, 1n and for every t, bt = at ·x. Assume that A uses at most cn2 memory bits and outputsa string x ∈ 0, 1n. Then, Pr[x = x] ≤ O(2−αn).

4

Page 5: Fast Learning Requires Good Memory: A Time-Space Lower Bound … ·  · 2016-02-19Fast Learning Requires Good Memory: A Time-Space Lower Bound for Parity Learning ... there was no

Theorem 1 is restated, in a stronger2 and more formal 3 form, as Theorem 2 in Section 7,and the proof of Theorem 2 is given there.

2 Preliminaries

For an integer n, denote [n] = 1, . . . , n. For a, x ∈ 0, 1n, denote by a · x their innerproduct modulo 2.

For a function P : Ω → R, we denote by |P |1 its `1 norm. In particular, for twodistributions, P,Q : Ω→ [0, 1], we denote by |P −Q|1 their `1 distance.

For a random variable X and an event E, we denote by PX the distribution of the randomvariables X, and we denote by PX|E the distribution of the random variable X conditionedon the event E.

Denote by Un the uniform distribution over 0, 1n. For an affine subspace w ⊆ 0, 1n,denote by Uw the uniform distribution over w.

For n ∈ N, denote by A(n) the set of all affine subspaces of 0, 1n.

3 Proof Outline

Computational Model

We model the learning algorithm by a branching program. A branching program of length mand width d, for parity learning, is a directed (multi) graph with vertices arranged in m+ 1layers containing at most d vertices each. Intuitively, each layer represents a time step andeach vertex represents a memory state of the learner. In the first layer, that we think of aslayer 0, there is only one vertex, called the start vertex. A vertex of outdegree 0 is calleda leaf. Every non-leaf vertex in the program has 2n+1 outgoing edges, labeled by elements(a, b) ∈ 0, 1n × 0, 1, with exactly one edge labeled by each such (a, b), and all theseedges going into vertices in the next layer. Intuitively, these edges represent the action whenreading (at, bt). The samples (a1, b1), . . . , (am, bm) ∈ 0, 1n × 0, 1 that are given as input,define a computation-path in the branching program, by starting from the start vertex andfollowing at Step t the edge labeled by (at, bt), until reaching a leaf.

Each leaf v in the program is labeled by an affine subspace w(v) ∈ A(n), that we thinkof as the output of the program on that leaf. The program outputs the label w(v) of theleaf v reached by the computation-path. We interpret the output of the program as a guessthat x ∈ w(v).

We also consider affine branching programs, where every vertex v (not necessarilya leaf) is labeled by an affine subspace w(v) ∈ A(n), such that, the start vertex islabeled by the space 0, 1n ∈ A(n), and for any edge (u, v), labeled by (a, b), we havew(u)∩x′ ∈ 0, 1n : a ·x′ = b ⊆ w(v). These properties guarantee that if the computation-

2Theorem 2 allows the algorithm to output an affine subspace of dimension ≤ 35n, and bounds by 2−αn

the probability that x belongs to that affine subspace.3Theorem 2 models the algorithm by a branching program, which is more formal and clarifies that the

theorem holds also in the (more general) non-uniform setting.

5

Page 6: Fast Learning Requires Good Memory: A Time-Space Lower Bound … ·  · 2016-02-19Fast Learning Requires Good Memory: A Time-Space Lower Bound for Parity Learning ... there was no

path reaches a vertex v then x ∈ w(v). Thus, we can interpret w(v) as an affine subspacethat is known to contain x.

An affine branching program is called accurate if for (almost) all vertices v, thedistribution of x, conditioned on the event that the computation-path reached v, is closeto the uniform distribution over w(v).

For exact definitions, see Section 5.

The High-Level Approach

The proof has two parts. We prove lower bounds for affine branching programs, and wereduce general branching programs to affine branching programs. The hard part is thereduction from general branching programs to affine branching programs. We note that thisreduction is very wasteful and expands the width of the branching program by a factor of2Θ(n2). Nevertheless, since we allow our branching program to be of width up to 2O(n2), thisis still affordable (as long as the exact constant in the exponent is relatively small). Wehave to make sure though that, when proving time-space lower bounds for affine branchingprograms, the upper bounds that we assume on the width of the affine branching programsare larger than the expansion of the width caused by the reduction.

We note that in the introduction to Conjecture 1.1 of [SVW15], Steinhardt, Valiant andWager mention that they were able to prove the conjecture “for any algorithm whose memorystates correspond to subspaces”. However, a formal statement (or proof) is not given, so wedo not know how similar their result is to our lower bound for affine branching programs.We note that affine branching programs, as we define here, do not satisfy Conjecture 1.1of [SVW15] (see Remark 1.1 and Footnote 1).

Lower Bounds for Affine Branching Programs

Assume that we have an affine branching program of length at most 2cn and width at most2cn

2, for a small enough constant c. Fix k = 4

5n. We prove that the probability that the

computation-path reaches some vertex that is labeled with an affine subspace of dimension≤ k is at most 2−Ω(n2).

Without loss of generality, we can assume that all vertices in the program are labeled withaffine subspaces of dimension ≥ k. Other vertices can just be removed as the computation-path must reach a vertex labeled with a subspace of dimension k, before it reaches a vertexlabeled with a subspace of dimension < k (because the dimension can decrease by at most1 along an edge).

We define the “orthogonal” to an affine subspace as the vector space orthogonal to thevector space that defines that affine subspace (that is, the vector space that the affinesubspace is given as it’s translation).

Let v be a vertex in the program, such that, w(v) is of dimension k. It’s enough to provethat the probability that the computation-path reaches v is at most 2−Ω(n2).

To prove this, we consider the vector spaces “orthogonal” to the affine subspaces that labelthe vertices along the computation-path, and for each of them we consider its intersectionwith the vector space “orthogonal” to w(v). We note that, in each step, the probability thatthe dimension of the intersection increases is exponentially small (as it requires that the at

6

Page 7: Fast Learning Requires Good Memory: A Time-Space Lower Bound … ·  · 2016-02-19Fast Learning Requires Good Memory: A Time-Space Lower Bound for Parity Learning ... there was no

currently being read is contained in some small vector space). Since the dimension of theintersection must increase a linear number of times, in order for the computation-path toreach v, a simple union bound shows that the probability to reach v is at most 2−Ω(n2).

The full details are given in Lemma 7.1.

From Branching Programs to Affine Branching Programs

In Section 6, we show how to simulate a branching program by an accurate affine branchingprogram. We do that layer after layer. Assume that we are already done with layer j − 1,so every vertex in layer j − 1 is already labeled by an affine subspace, and the distributionof x, conditioned on the event that the computation-path reached a vertex, is close to theuniform distribution over the affine subspace that labels that vertex.

Now, take a vertex v in layer j, and consider the distribution of x, conditioned on theevent that the computation-path reached the vertex v. By the property that we already knowon layer j − 1, this distribution is close to a convex combination of uniform distributionsover affine subspaces of 0, 1n.

One could split v into a large number of vertices, one vertex for each affine subspacein the combination. However, this practically means that we would have a vertex for anyaffine subspace. We would like to keep the number of vertices somewhat smaller. This isdone by grouping many affine subspaces into one group. The group will be labeled by anaffine subspace that contains all the affine subspaces in the group. Moreover, we will havethe property that for each such group, the uniform distribution over the affine subspace thatlabels the group is close to the relevant weighted average of the uniform distributions overthe affine subspaces in the group. Thus, practically, we can replace all the affine subspacesin the group by one affine subspace that represents all of them.

Lemma 4.3 shows that it is possible to group all the affine subspaces into a relativelysmall number of groups.

We note that the entire inductive argument is delicate, as we cannot afford deterioratingthe error multiplicatively in each step and need to make sure that all errors are additive.

4 Distributions over Affine Subspaces

In this section, we study convex combinations of uniform distributions over affine subspacesof 0, 1n. Lemma 4.3 is the only result, proved in this section, that is used outside thesection.

In the following lemmas, we have a random variable W ∈ A(n) and we consider thedistribution EW [UW ]. This distribution is a convex combination of uniform distributionsover affine subspaces of 0, 1n.

The first lemma identifies a condition that implies that the distribution EW [UW ] is closeto the uniform distribution over 0, 1n.

Lemma 4.1. Let W ∈ A(n) be a random variable. Let r ≥ n2. Assume that for every

a ∈ 0, 1n, such that a 6= ~0, and every b ∈ 0, 1,

PrW

[∀x ∈ W : a · x = b] ≤ 2−r.

7

Page 8: Fast Learning Requires Good Memory: A Time-Space Lower Bound … ·  · 2016-02-19Fast Learning Requires Good Memory: A Time-Space Lower Bound for Parity Learning ... there was no

Then ∣∣∣EW

[UW ]− Un∣∣∣1< 2−(r−n

2 ).

Proof. The proof uses Fourier analysis. For any affine subspace w ⊆ 0, 1n, the Fouriercoefficients of Uw are:

Uw(a) =

2−n if ∀x ∈ w : a · x = 0−2−n if ∀x ∈ w : a · x = 1

0 otherwise

Hence, the Fourier coefficients of EW [UW ] are:

EW

[UW ](a) = 2−n ·(

PrW

[∀x ∈ W : a · x = 0]− PrW

[∀x ∈ W : a · x = 1]),

and note that this also implies

EW

[UW ](~0) = 2−n.

The Fourier coefficients of Un are:

Un(a) =

2−n if a = ~0

0 if a 6= ~0

Thus, ∑a∈0,1n

(EW

[UW ](a)− Un(a))2

< 2n ·(2−n · 2−r

)2= 2−n−2r.

By Cauchy-Schwarz and Parseval,(E

x∈R0,1n

∣∣∣EW

[UW ](x)− Un(x)∣∣∣)2

≤ Ex∈R0,1n

(EW

[UW ](x)− Un(x))2

=

∑a∈0,1n

(EW

[UW ](a)− Un(a))2

< 2−n−2r.

Therefore,∣∣∣EW

[UW ]− Un∣∣∣1

= 2n Ex∈R0,1n

∣∣∣EW

[UW ](x)− Un(x)∣∣∣ < 2n ·

√2−n−2r = 2−(r−n/2).

The next lemma shows that always there exists an affine subspace s ⊆ 0, 1n, suchthat the distribution EW |(W⊆s)[UW ] is close to the uniform distribution over s, and the eventW ⊆ s occurs with non-negligible probability.

Lemma 4.2. Let W ∈ A(n) be a random variable. Let r ≥ n2. There exists an affine

subspace s ⊆ 0, 1n, such that:

1.PrW

[W ⊆ s] ≥ 2−∑n−dim(s)−1

i=0 (r− i2).

8

Page 9: Fast Learning Requires Good Memory: A Time-Space Lower Bound … ·  · 2016-02-19Fast Learning Requires Good Memory: A Time-Space Lower Bound for Parity Learning ... there was no

2. ∣∣∣∣ EW |(W⊆s)

[UW ]− Us∣∣∣∣1

< 2−(r−n2 ).

Proof. The proof is by induction on n. The base case, n = 0, is trivial, because in this casethe only element of A(n) is ~0, so the lemma follows with s = ~0.

Let n ≥ 1. If for every a ∈ 0, 1n, such that a 6= ~0, and every b ∈ 0, 1, we havePrW [∀x ∈ W : a ·x = b] ≤ 2−r, the proof follows by Lemma 4.1, with s = 0, 1n. Otherwise,there exists a 6= ~0, and b ∈ 0, 1, such that, PrW [∀x ∈ W : a · x = b] > 2−r. Denote by uthe (n− 1)-dimensional affine subspace

u = x ∈ 0, 1n : a · x = b.

Thus,PrW

[W ⊆ u] > 2−r.

Consider the random variable W ′ = W | (W ⊆ u). Since u is an (n − 1)-dimensionalaffine subspace, we can identify u with 0, 1n−1 and think of W ′ as a random variable overA(n − 1). Hence, by the inductive hypothesis (applied with n − 1 and r − 1

2), there exists

an affine subspace s ⊆ u, such that,

1.PrW ′

[W ′ ⊆ s] ≥ 2−∑n−dim(s)−1

i=1 (r− i2).

2. ∣∣∣∣ EW ′|(W ′⊆s)

[UW ′ ]− Us∣∣∣∣1

< 2−(r−n2 ).

We will show that s satisfies the two properties claimed in the statement of the lemma.For the first property, note that since s ⊆ u,

Pr[W ⊆ s] = Pr[W ⊆ u] · Pr[W ⊆ s | W ⊆ u] = Pr[W ⊆ u] · Pr[W ′ ⊆ s]

> 2−r · 2−∑n−dim(s)−1

i=1 (r− i2) = 2−

∑n−dim(s)−1i=0 (r− i

2).

For the second property, note that since s ⊆ u,

EW |(W⊆s)

[UW ] = EW ′|(W ′⊆s)

[UW ′ ].

The next lemma is the main result of this section.

Lemma 4.3. Let W ∈ A(n) be a random variable. Let r ≥ n2. There exists a partial function

σ : A(n)→ A(n), such that:

1. PrW [W 6∈ domain(σ)] ≤ 2−2n.

2. For every w ∈ domain(σ), w ⊆ σ(w).

9

Page 10: Fast Learning Requires Good Memory: A Time-Space Lower Bound … ·  · 2016-02-19Fast Learning Requires Good Memory: A Time-Space Lower Bound for Parity Learning ... there was no

3. For every s ∈ image(σ), ∣∣∣∣ EW |(σ(W )=s)

[UW ]− Us∣∣∣∣1

< 2−(r−n2 ).

4. For every k ∈ N, there are at most

4n · 2∑n−k−1

i=0 (r− i2)

elements s ∈ image(σ), with dim(s) ≥ k.

Proof. The proof is by repeatedly applying Lemma 4.2. We start with the random variableW0 = W , and apply Lemma 4.2 on W0. We obtain a subspace s0 (the subspace s whoseexistence is guaranteed by Lemma 4.2). For every w ⊆ s0, we define σ(w) = s0.

We then define the random variable W1 = W0 | (W0 6⊆ s0), and apply Lemma 4.2 on W1.We obtain a subspace s1 (the subspace s whose existence is guaranteed by Lemma 4.2). Forevery w ⊆ s1 on which σ was still not defined, we define σ(w) = s1.

In the same way, in Step i, we define the random variable Wi = Wi−1 | (Wi−1 6⊆ si−1).Note that Wi = W | (W 6⊆ s0) ∧ . . . ∧ (W 6⊆ si−1), that is, Wi is the restriction of W tothe part of A(n) where σ was still not defined. We apply Lemma 4.2 on Wi and obtain asubspace si (the subspace s whose existence is guaranteed by Lemma 4.2). For every w ⊆ sion which σ was still not defined, we define σ(w) = si.

We repeat this until PrW [W 6∈ domain(σ)] ≤ 2−2n.Note that for i′ < i, si′ 6= si, because the support of Wi doesn’t contain any element

w ⊆ si′ . Hence, the subspaces s0, s1, . . . are all different.It remains to show that the four properties in the statement of the lemma hold.The first property is obvious because we continue to define σ on more and more elements

repeatedly, until the first property holds.The second property is obvious because we mapped w to si only if w ⊆ si.The third property holds by the second property guaranteed by Lemma 4.2.The forth property holds because by the first property guaranteed by Lemma 4.2, in each

step where we obtain a subspace si of dimension at least k, we define σ on a fraction of at

least 2−∑n−k−1

i=0 (r− i2) of the space that still remains. Thus, after at most 4n · 2

∑n−k−1i=0 (r− i

2)

such steps we have Pr[W 6∈ domain(σ)] ≤ 2−2n, and we stop. Thus, the number of elements

si, of dimension at least k, that we obtain in the process, is at most 4n · 2∑n−k−1

i=0 (r− i2).

5 Branching Programs for Parity Learning

Recall that in the problem of parity learning, there is a string x ∈ 0, 1n that was chosenuniformly at random. A learner tries to learn x from a stream of samples, (a1, b1), (a2, b2) . . .,where each at is uniformly distributed over 0, 1n and for every t, bt = at · x.

5.1 General Branching Programs for Parity Learning

In the following definition, we model the learner by a branching program. We allow thebranching program to output an affine subspace w ∈ A(n). We interpret the output of the

10

Page 11: Fast Learning Requires Good Memory: A Time-Space Lower Bound … ·  · 2016-02-19Fast Learning Requires Good Memory: A Time-Space Lower Bound for Parity Learning ... there was no

program as a guess that x ∈ w. Obviously, the output w is more meaningful when dim(w)is relatively small.

Definition 5.1. Branching Program for Parity Learning: A branching program oflength m and width d, for parity learning, is a directed (multi) graph with vertices arrangedin m + 1 layers containing at most d vertices each. In the first layer, that we think of aslayer 0, there is only one vertex, called the start vertex. A vertex of outdegree 0 is called a leaf.All vertices in the last layer are leaves (but there may be additional leaves). Every non-leafvertex in the program has 2n+1 outgoing edges, labeled by elements (a, b) ∈ 0, 1n × 0, 1,with exactly one edge labeled by each such (a, b), and all these edges going into vertices inthe next layer. Each leaf v in the program is labeled by an affine subspace w(v) ∈ A(n), thatwe think of as the output of the program on that leaf.

Computation-Path: The samples (a1, b1), . . . , (am, bm) ∈ 0, 1n×0, 1 that are givenas input, define a computation-path in the branching program, by starting from the startvertex and following at Step t the edge labeled by (at, bt), until reaching a leaf. The programoutputs the label w(v) of the leaf v reached by the computation-path.

Success Probability: The success probability of the program is the probability thatx ∈ w, where w is the affine subspace that the program outputs, and the probability is overx, a1, . . . , am (where x, a1, . . . , am are uniformly distributed over 0, 1n, and for every t,bt = at · x).

5.2 Affine Branching Programs for Parity Learning

Next, we define a special type of a branching program for parity learning, that we callan affine branching program for parity learning. In an affine branching program for paritylearning, every vertex v (not necessarily a leaf) is labeled by an affine subspace w(v) ∈ A(n).We will have the property that if the computation-path reaches v then x ∈ w(v). Thus, wecan interpret w(v) as an affine subspace that is known to contain x.

Definition 5.2. Affine Branching Program for Parity Learning: A branching programfor parity learning is affine if each vertex v in the program is labeled by an affine subspacew(v) ∈ A(n), and the following properties hold:

1. Start vertex: The start vertex is labeled by the space 0, 1n ∈ A(n).

2. Soundness: For an edge e = (u, v), labeled by (a, b), denote

w(e) = w(u) ∩ x′ ∈ 0, 1n : a · x′ = b.

Then,w(e) ⊆ w(v).

Given an affine branching program for parity learning, and samples (a1, b1), . . . , (am, bm),such that, for every t, bt = at · x, it follows by induction that for every vertex v in theprogram, if the computation-path reaches v then x ∈ w(v). In particular, the output w ofthe program always satisfies x ∈ w, and thus the success probability of an affine program isalways 1.

11

Page 12: Fast Learning Requires Good Memory: A Time-Space Lower Bound … ·  · 2016-02-19Fast Learning Requires Good Memory: A Time-Space Lower Bound for Parity Learning ... there was no

5.3 Accurate Affine Branching Programs for Parity Learning

For a vertex v in a branching program for parity learning, we denote by Px|v the distributionof the random variable x, conditioned on the event that the vertex v was reached by thecomputation-path.

Definition 5.3. ε-Accurate Affine Branching Program for Parity Learning: Anaffine branching program of length m for parity learning is ε-accurate if all the leaves are inthe last layer, and the following additional property holds (where x, a1, . . . , am are uniformlydistributed over 0, 1n, and for every t, bt = at · x):

3. Accuracy: Let 0 ≤ t ≤ m. Let Vt be the vertex in layer t, reached by the computation-path. Let yt be a random variable uniformly distributed over the subspace w(Vt), Then,

|PVt,x − PVt,yt |1 ≤ ε,

or, equivalently,EVt

∣∣Px|Vt − Uw(Vt)

∣∣1≤ ε.

6 From Branching Programs to Affine Branching

Programs

In this section, we show that any branching program B for parity learning can be simulatedby an affine branching program P for parity learning. Roughly speaking, each vertex of thesimulated program B will be represented by a set of vertices of the simulating program P .Note that the width of P will typically be significantly larger than the width of B.

More precisely, a branching program B for parity learning is simulated by a branchingprogram P for parity learning if there exists a mapping Γ from the vertices of P to thevertices of B, and the following properties hold:

1. Preservation of structure: For every i, Γ maps layer i of P to layer i of B. Moreover,Γ maps leaves to leaves and non-leaf vertices to non-leaf vertices. Note that Γ is notnecessarily one-to-one.

2. Preservation of functionality: For every edge (u, v), labeled by (a, b), in P , thereis an edge (Γ(u),Γ(v)), labeled by (a, b), in B.

Lemma 6.1. Let k′ < n. Assume that there exists a length m and width d branching programB for parity learning (of size n), such that: all leaves of B are in the last layer; the outputof B is always an affine subspace of dimension ≤ k′; and the success probability of B is β.

Let n2≤ r ≤ n. Let ε = 4m · 2−(r−n

2 ). Then, there exists an ε-accurate length m affinebranching program P for parity learning (of size n), such that:

1. For every k < n, the number of vertices in P , that are labeled with an affine subspaceof dimension k, is at most

4n · 2∑n−k−1

i=0 (r− i2) · dm.

12

Page 13: Fast Learning Requires Good Memory: A Time-Space Lower Bound … ·  · 2016-02-19Fast Learning Requires Good Memory: A Time-Space Lower Bound for Parity Learning ... there was no

2. For every k, such that, k′ < k < n, the output of P is an affine subspace of dimension< k, with probability of at least

β − ε− 2−(k−k′).

Proof. For every 0 ≤ j ≤ m, let εj = 4j ·2−(r−n2 ). We will use Lemma 4.3 to turn, inductively,

the layers of B, one by one, into layers of an ε-accurate affine branching program, P . InStep j of the induction, we will turn layer j of B into layer j of P , and define the labelw(v) ∈ A(n) for every vertex v in that layer of P . Formally, we will construct, inductively, asequence of programs B,P0, . . . , Pm = P , where each program is of length m, and for every j,the program Pj differs from the previous program only in layer j (and in the edges going intolayer j and out of layer j). After Step j of the induction, we will have a branching programPj, such that, layers 0 to j of Pj form an affine branching program for parity learning. Inaddition, the following inductive hypothesis will hold:

Inductive Hypothesis:

Let Lj be the set of vertices in layer j of Pj. Let Vj be the vertex in Lj, reached by thecomputation-path of Pj. Note that Vj is a random variable that depends on x, a1, . . . , aj (andrecall that x, a1, . . . , am are uniformly distributed over 0, 1n, and for every t, bt = at · x).The inductive hypothesis is that there exists a random variable Uj over Lj, such that, if yjis a random variable uniformly distributed over the subspace w(Uj), then∣∣PVj ,x − PUj ,yj

∣∣1≤ εj

2. (1)

The inductive hypothesis is equivalent to the accuracy requirement (see Definition 5.3)for layer j of Pj, up to a small multiplicative constant in the accuracy, but we need toassume it in this slightly different form, in order to avoid deteriorating the accuracy by amultiplicative factor in each step of the induction.

Base Case:

In the base case of the induction, j = 0, we define P0 by just labeling the start vertex of Bby 0, 1n ∈ A(n). Thus, the start vertex property in the definition of an affine branchingprogram is satisfied. The soundness property is trivially satisfied because the restriction ofP0 to layer 0 contains no edges. Since we always start from the start vertex, the distributionof the random variable x, conditioned on the event that we reached the start vertex, isjust Un, and hence the inductive hypothesis (Equation (1)) holds with U0 = V0.

Inductive Step:

Assume that we already turned layers 0 to j − 1 of B into layers 0 to j − 1 of P . That is,we already defined the program Pj−1, and layers 0 to j − 1 of Pj−1 satisfy the start vertexproperty, the soundness property, and the inductive hypothesis (Equation (1)). We will nowshow how to define Pj from Pj−1, that is, how to turn layer j of B into layer j of P .

Let Uj−1 ∈ Lj−1 be the random variable that satisfies the inductive hypothesis(Equation (1)) for layer j − 1 of Pj−1. Let yj−1 be a random variable uniformly distributed

13

Page 14: Fast Learning Requires Good Memory: A Time-Space Lower Bound … ·  · 2016-02-19Fast Learning Requires Good Memory: A Time-Space Lower Bound for Parity Learning ... there was no

over the subspace w(Uj−1). Let a ∈R 0, 1n. Let b = a · yj−1. Let E = (Uj−1, V ) be theedge labeled by (a, b) outgoing Uj−1 in Pj−1. Thus, V is a vertex in layer j of Pj−1. LetW = w(E), where w(E) is defined as in the soundness property in Definition 5.2. That is,

w(E) = w(Uj−1) ∩ x′ ∈ 0, 1n : a · x′ = b,

where (a, b) is the label of E, and w(Uj−1) is the label of Uj−1 in Pj−1.Let v be a vertex in layer j of Pj−1 (and note that v is also a vertex in layer j of B). Let

Wv = W |(V = v).

Let σv : A(n)→ A(n) be the partial function whose existence is guaranteed by Lemma 4.3,when applied on the random variable Wv. Extend σv : A(n)→ A(n) so that it outputs thespecial value ∗ on every element where it was previously undefined.

In the program Pj, we will split the vertex v into |image(σv)| vertices (where image(σv)already contains the additional special value ∗). For every s ∈ image(σv), we will have avertex (v, s). If s 6= ∗, we label the vertex (v, s) by the affine subspace s, and we label theadditional vertex (v, ∗) by 0, 1n. For every s ∈ image(σv), the edges going out of (v, s)(in Pj) will be the same as the edges going out of v in Pj−1. That is, for every edge (v, v′)(from layer j to layer j + 1) in the program Pj−1, and every s ∈ image(σv), we will have anedge ((v, s), v′) with the same label, (from layer j to layer j + 1) in the program Pj.

We will now define the edges going into the vertices (v, s) in the program Pj. For everyedge e = (u, v), labeled by (a, b), (from layer j− 1 to layer j), in the program Pj−1, considerthe affine subspace w = w(e) = w(u) ∩ x′ ∈ 0, 1n : a · x′ = b (as in the soundnessproperty in Definition 5.2), where w(u) is the label of u in Pj−1. Let s = σv(w).

In Pj , we will have the edge (u, (v, s)) (labeled by (a, b)), from layer j − 1 to layer j,that is, we connect u to (v, s). Note that the edge (u, (v, s)) satisfies the soundness propertyin the definition of an affine branching program: If s 6= ∗, the vertex (v, s) is labeled bys = σv(w) and by Poperty 2 of Lemma 4.3, w ⊆ σv(w). If s = ∗, the vertex (v, s) is labeledby 0, 1n and hence the soundness property is trivially satisfied.

Proof of the Inductive Hypothesis:

Next, we will prove the inductive hypothesis (Equation (1)), for Pj. We will define therandom variable Uj ∈ Lj as follows:

As before, let Uj−1 ∈ Lj−1 be the random variable that satisfies the inductive hypothesis(Equation (1)) for layer j − 1 of Pj−1. Let yj−1 be a random variable uniformly distributedover the subspace w(Uj−1). Let a ∈R 0, 1n. Let b = a ·yj−1. Let E = (Uj−1, V ) be the edgelabeled by (a, b) outgoing Uj−1 in Pj−1. Thus, V is a vertex in layer j of Pj−1. As before, letW = w(E) = w(Uj−1)∩x′ ∈ 0, 1n : a ·x′ = b. As before, for a vertex v in layer j of Pj−1,let σv : A(n) → A(n) be the partial function whose existence is guaranteed by Lemma 4.3,when applied on the random variable Wv = W |(V = v), and extend σv : A(n) → A(n) sothat it outputs the special value ∗ on every element where it was previously undefined.

We define Uj = (V, σV (W )) ∈ Lj. Let yj be a random variable uniformly distributed overthe subspace w(Uj), and let Vj be the vertex in Lj, reached by the computation-path of Pj.We need to prove that ∣∣PVj ,x − PUj ,yj

∣∣1≤ 2j · 2−(r−n

2 ). (2)

14

Page 15: Fast Learning Requires Good Memory: A Time-Space Lower Bound … ·  · 2016-02-19Fast Learning Requires Good Memory: A Time-Space Lower Bound for Parity Learning ... there was no

Let y′j be a random variable uniformly distributed over the subspace W . Equation (2)follows by the following two equations and by the triangle inequality:∣∣∣PUj ,y′j

− PUj ,yj

∣∣∣1≤ 2 · 2−(r−n

2 ). (3)∣∣∣PVj ,x − PUj ,y′j

∣∣∣1≤ 2(j − 1) · 2−(r−n

2 ). (4)

Thus, it is sufficient to prove Equation (3) and Equation (4). We will start with Equation (3).By Property 3 of Lemma 4.3, for every v in layer j of Pj−1, and every s ∈ image(σv)\∗,∣∣∣∣ E

W |(V=v),(σv(W )=s)[UW ]− Us

∣∣∣∣1

< 2−(r−n2 ).

By the definitions of y′j and Uj,

EW |(V=v),(σv(W )=s)

[UW ] = EW |(Uj=(v,s))

[UW ] = Py′j |(Uj=(v,s)).

By the definition of yj,Us = Pyj |(Uj=(v,s))

Hence ∣∣∣Py′j |(Uj=(v,s)) − Pyj |(Uj=(v,s))

∣∣∣1< 2−(r−n

2 ).

Taking expectation over Uj, and taking into account that, by Property 1 of Lemma 4.3,for every v, Pr(σv(W ) = ∗) ≤ 2−2n, we obtain∣∣∣PUj ,y′j

− PUj ,yj

∣∣∣1

= EUj

∣∣∣Py′j |Uj− Pyj |Uj

∣∣∣1< 2−(r−n

2 ) + 2−2n,

which proves Equation (3).We will now prove Equation (4). Let T be the following probabilistic transformation

from Lj−1 × 0, 1n to Lj × 0, 1n. Given (u, z) ∈ Lj−1 × 0, 1n, the transformation Tchooses a ∈R 0, 1n and b = a · z, and outputs (V, z), where V ∈ Lj is the vertex obtainedby following the edge labeled by (a, b) outgoing u in Pj.

By the definition of the computation-path, T (Vj−1, x) has the same distribution as (Vj, x).By the definition of Uj, yj, y

′j, we have that T (Uj−1, yj−1) has the same distribution as (Uj, y

′j).

Hence, by the triangle inequality and the inductive hypothesis,∣∣∣PVj ,x − PUj ,y′j

∣∣∣1

=∣∣PT (Vj−1,x) − PT (Uj−1,yj−1)

∣∣1≤∣∣PVj−1,x − PUj−1,yj−1

∣∣1≤ 2(j − 1) · 2−(r−n

2 ),

which gives Equation (4).Since, by induction, layers 0 to j− 1 of Pj−1 form an affine branching program for parity

learning, and since we already saw that all the edges between layer j − 1 and layer j of Pjsatisfy the soundness property in the definition of an affine branching program, we have thatlayers 0 to j of Pj form an affine branching program for parity learning.

15

Page 16: Fast Learning Requires Good Memory: A Time-Space Lower Bound … ·  · 2016-02-19Fast Learning Requires Good Memory: A Time-Space Lower Bound for Parity Learning ... there was no

P is ε-Accurate:

We will now prove that the final branching program P = Pm, that we obtained, satisfiesthe requirements of the lemma. We already know that P is an affine branching program forparity learning.

We will start by proving that P is ε-accurate. Let 0 ≤ t ≤ m. Let Vt be the vertex inlayer t of P , reached by the computation-path of P . Let zt be a random variable uniformlydistributed over the subspace w(Vt), We need to prove that,

|PVt,x − PVt,zt|1 ≤ ε. (5)

Recall that by the inductive hypothesis (Equation (1)), there exists a random variable Utover layer t of P , such that, if yt is a random variable uniformly distributed over the subspacew(Ut), then

|PVt,x − PUt,yt|1 ≤ε2, (6)

and this also implies|PVt − PUt|1 ≤

ε2.

By the last inequality and since for every v in layer t of P , it holds that Pzt|(Vt=v) = Pyt|(Ut=v)

(since they are both uniformly distributed over w(v)), we have

|PVt,zt − PUt,yt |1 = |PVt − PUt |1 ≤ε2. (7)

Equation (5) follows by Equation (6), Equation (7) and the triangle inequality.

P Satisfies the Additional Properties:

We will now prove that P satisfies the two additional properties claimed in the statement ofthe lemma. The first property holds since Property 4 of Lemma 4.3 ensures that for every

vertex in layers 1 to m of the branching program B, we obtain at most 4n · 2∑n−k−1

i=0 (r− i2)

vertices in the branching program P that are labeled with affine subspaces of dimension k.It remains to prove the second property. Let Vm = (V, S) be the vertex in layer m of

P , reached by the computation-path of P . Note that Vm is a random variable that dependson x, a1, . . . , am (and recall that x, a1, . . . , am are uniformly distributed over 0, 1n, and forevery t, bt = at · x).

Note that V is the vertex in layer m of B, reached by the computation-path of B (on thesame x, a1, . . . , am). This is true since P simulates B. More precisely, by the construction, ifon x, a1, . . . , am, the program P reaches (V, S), then, on the same x, a1, . . . , am, the programB reaches V .

Since the success probability of B is β,

Pr[x ∈ w(V )] = β,

where w(V ) is the label of V in B. Let ym be a random variable uniformly distributed overthe subspace w(Vm), where w(Vm) is the label of Vm in P . Since P is ε-accurate,

|PV,x − PV,ym|1 ≤ |PV,S,x − PV,S,ym|1 = |PVm,x − PVm,ym|1 ≤ ε.

Thus,Pr[ym ∈ w(V )] ≥ Pr[x ∈ w(V )]− ε = β − ε.

16

Page 17: Fast Learning Requires Good Memory: A Time-Space Lower Bound … ·  · 2016-02-19Fast Learning Requires Good Memory: A Time-Space Lower Bound for Parity Learning ... there was no

Let k > k′. Recall that w(V ) is of dimension ≤ k′. Thus, if w(Vm) is of dimension ≥ k,the (conditional) probability that ym ∈ w(V ) is at most 2k

′−k. Thus,

β − ε ≤ Pr[ym ∈ w(V )] ≤ Pr[dim(w(Vm)) < k] + 2k′−k.

That is,Pr[dim(w(Vm)) < k] ≥ β − ε− 2−(k−k′).

7 Time-Space Lower Bounds for Parity Learning

In this section, we will use Lemma 6.1 to prove Theorem 2, our main result. Recall thatTheorem 2 is stronger than Theorem 1, and hence Theorem 1 follows as well. We start by alemma that will be used, in the proof of Theorem 2, to obtain time-space lower bounds foraffine branching programs.

Lemma 7.1. Let k < n. Let P be a length m affine branching program for parity learning(of size n), such that, for every vertex u of P , dim(w(u)) ≥ k. Let v be a vertex of P , suchthat, dim(w(v)) = k. Then, the probability that the computation-path of P reaches v is atmost

mn−k · 2∑n−k−1

j=0 (n−2k−j).

Proof. Let s be the vector space “orthogonal” to w(v) in 0, 1n. That is,

s = a ∈ 0, 1n : ∃b ∈ 0, 1 ∀x′ ∈ w(v) : a · x′ = b .

Let V0, . . . , Vm be the vertices on the computation-path of P . Note that V0, . . . , Vm arerandom variables that depend on x, a1, . . . , am. For every 0 ≤ i ≤ m, let Si be the vectorspace “orthogonal” to w(Vi) in 0, 1n. That is,

Si = a ∈ 0, 1n : ∃b ∈ 0, 1 ∀x′ ∈ w(Vi) : a · x′ = b .

By the soundness property in Definition 5.2, for every 1 ≤ i ≤ m,

Si ⊆ span(Si−1 ∪ ai). (8)

For every 0 ≤ i ≤ m, let Zi = dim(Si ∩ s). Note that Z0 = 0, and by Equation (8),for every 1 ≤ i ≤ m, Zi ≤ Zi−1 + 1. If the computation-path of P reaches v then for some1 ≤ i ≤ m, Zi = n − k. Thus, if the computation-path of P reaches v, there exist n − kindices i1 < . . . < in−k ∈ [m], such that, the following event, denoted by Ei1,...,in−k

, occurs:

Ei1,...,in−k=

∧j∈[n−k]

(Zij−1 = j − 1) ∧ (Zij = j).

(In particular, Ei1,...,in−koccurs if for every j, we take ij to be the first i such that

Zi = j). We will bound the probability that the computation-path of P reaches v, bybounding Pr[Ei1,...,in−k

], and taking the union bound over (less than) mn−k possibilities fori1, . . . , in−k ∈ [m].

17

Page 18: Fast Learning Requires Good Memory: A Time-Space Lower Bound … ·  · 2016-02-19Fast Learning Requires Good Memory: A Time-Space Lower Bound for Parity Learning ... there was no

Fix i1 < . . . < in−k ∈ [m]. For r ∈ 0, . . . , n− k, let

Ei1,...,ir =∧j∈[r]

(Zij−1 = j − 1) ∧ (Zij = j).

Thus,

Pr[Ei1,...,in−k] =

∏j∈[n−k]

Pr[Ei1,...,ij | Ei1,...,ij−1].

We will show how to bound Pr[Ei1,...,ij | Ei1,...,ij−1].

Pr[Ei1,...,ij | Ei1,...,ij−1] = Pr[(Zij−1 = j − 1) ∧ (Zij = j) | Ei1,...,ij−1

]

= Pr[(Zij−1 = j − 1) ∧ (Zij−1 < Zij) | Ei1,...,ij−1]

≤ Pr[(Zij−1 < Zij) | Ei1,...,ij−1∧ (Zij−1 = j − 1)]. (9)

Note that the event Ei1,...,ij−1∧ (Zij−1 = j − 1) that we condition on, on the right hand side,

depends only on x, a1, . . . , aij−1. We will bound the probability for the event (Zij−1 < Zij),conditioned on any event that fixes Zij−1 and depends only on x, a1, . . . , aij−1.

More generally, fix 1 ≤ i ≤ m, and let E ′i be the event (Zi−1 < Zi). Let E ′ be anyevent that fixes Zi−1 and depends only on x, a1, . . . , ai−1. Without loss of generality, we canassume that the event E ′ just fixes the values of x, a1, . . . , ai−1. We will show how to boundPr[E ′i | E ′].

Thus, we fix x, a1, . . . , ai−1 and we will bound Pr[E ′i] (conditioned on x, a1, . . . , ai−1). ByEquation (8), if E ′i occurs then dim(Si−1 ∩ s) < dim(Si ∩ s) ≤ dim(span(Si−1 ∪ ai) ∩ s),and hence Si−1 ∩ s ( span(Si−1 ∪ ai) ∩ s, which implies that there exists a ∈ Si−1,such that, a ⊕ ai ∈ s. For every fixed a ∈ Si−1, the event a ⊕ ai ∈ s occurs withprobability 2dim(s)−n = 2(n−k)−n = 2−k (since ai is uniformly distributed and independentof x, a1, . . . , ai−1). We will bound the probability for E ′i by taking a union bound over allpossibilities for a, but doing so we take into account that a ∈ Si−1 satisfies a⊕ ai ∈ s if andonly if every a′ ∈ a⊕ (Si−1 ∩ s) satisfies a′ ⊕ ai ∈ s. Thus, we can take a union bound over2dim(Si−1)−Zi−1 ≤ 2n−k−Zi−1 possibilities (where we assume that Zi−1 is fixed). Hence, by theunion bound

Pr[E ′i | E ′] ≤ 2n−k−Zi−1 · 2−k = 2n−2k−Zi−1 .

Thus, in particular, by Equation (9),

Pr[Ei1,...,ij | Ei1,...,ij−1] ≤ 2n−2k−(j−1).

Hence,

Pr[Ei1,...,in−k] ≤

∏j∈[n−k]

2n−2k−(j−1) = 2∑n−k−1

j=0 (n−2k−j).

By the union bound, the probability that the computation-path of P reaches v is at most

mn−k · 2∑n−k−1

j=0 (n−2k−j).

18

Page 19: Fast Learning Requires Good Memory: A Time-Space Lower Bound … ·  · 2016-02-19Fast Learning Requires Good Memory: A Time-Space Lower Bound for Parity Learning ... there was no

Theorem 2. For any c < 120

, there exists α > 0, such that the following holds: Let B be

a branching program of length at most 2αn and width at most 2cn2

for parity learning (ofsize n), such that, the output of B is always an affine subspace of dimension ≤ 3

5n. Assume

for simplicity and without loss of generality that all leaves of B are in the last layer. Then,the success probability of B (that is, the probability that x is contained in the subspace thatB outputs) is at most O(2−αn).

Proof. Let 0 < α < 15

be a sufficiently small constant (to be determined later on). Let B be

a branching program of length m = 2αn and width d = 2cn2

for parity learning (of size n),such that, the output of B is always an affine subspace of dimension ≤ 3

5n. Assume for

simplicity and without loss of generality that all leaves of B are in the last layer. Denote byβ the success probability of B.

Let r =(

12

+ 2α)·n. Let k = 4

5n. By Lemma 6.1, there exists a length m affine branching

program P for parity learning (of size n), such that:

1. The number of vertices in P , that are labeled with an affine subspace of dimension k,is at most

4n · 2∑n−k−1

i=0 (r− i2) · dm.

2. The output of P is an affine subspace of dimension ≤ k, with probability of at least

β − 4 · 2−αn − 2−15n ≥ β − 5 · 2−αn.

Assume without loss of generality that every vertex u of P , such that dim(w(u)) = k,is a leaf. (Otherwise, we can just redefine u to be a leaf by removing all the edges goingout of it). Assume without loss of generality that for every vertex u of P , dim(w(u)) ≥ k.(Otherwise, we can just remove u as it is unreachable from the start vertex, since we definedall vertices labeled by subspaces of dimension k to be leaves and since by the soundnessproperty in Definition 5.2, the dimensions along the computation-path can only decrease by1 in each step).

By Lemma 7.1, and by substituting the values of m, d, k, r, the probability that thecomputation-path of P reaches some vertex that is labeled with an affine subspace ofdimension k is at most(

4n · 2∑n−k−1

i=0 (r− i2) · dm

)·(mn−k · 2

∑n−k−1i=0 (n−2k−i)

)= 4nm · 2cn2 ·

(2∑n−k−1

i=0 ( 12n+2αn− i

2))·(

2αn(n−k) · 2∑n−k−1

i=0 (− 35n−i)

)= 4nm · 2cn2 · 2(n−k)(3αn− 1

10n) ·(

2∑n−k−1

i=0 (− 32i))

= 4nm · 2cn2 · 2(n−k)(3αn− 110n) · 2−

34

(n−k)·(n−k−1)

= 4nm · 2cn2 · 215n(3αn− 1

10n− 3

20n+ 3

4)

= 4nm · 2n2(c+ 35α− 1

20+ 3

20n).

Thus, if α < 53

(120− c), this probability is at most 2−Ω(n2), and hence,

β − 5 · 2−αn ≤ 2−Ω(n2).

That is,β ≤ O(2−αn).

19

Page 20: Fast Learning Requires Good Memory: A Time-Space Lower Bound … ·  · 2016-02-19Fast Learning Requires Good Memory: A Time-Space Lower Bound for Parity Learning ... there was no

References

[A99a] Miklos Ajtai: Determinism versus Non-Determinism for Linear Time RAMs. STOC1999: 632-641 3

[A99b] Miklos Ajtai: A Non-linear Time Lower Bound for Boolean Branching Programs.FOCS 1999: 60-70 3

[ADR02] Yonatan Aumann, Yan Zong Ding, Michael O. Rabin: Everlasting security in thebounded storage model. IEEE Transactions on Information Theory 48(6): 1668-1680(2002) 4

[AR99] Yonatan Aumann, Michael O. Rabin: Information Theoretically SecureCommunication in the Limited Storage Space Model. CRYPTO 1999: 65-79 4

[B86] David A. Mix Barrington: Bounded-Width Polynomial-Size Branching ProgramsRecognize Exactly Those Languages in NC1. J. Comput. Syst. Sci. 38(1): 150-164 (1989)(also in STOC 1986) 3

[BJS98] Paul Beame, T. S. Jayram, Michael E. Saks: Time-Space Tradeoffs for BranchingPrograms. J. Comput. Syst. Sci. (JCSS) 63(4):542-572 (2001) (also in FOCS 1998) 3

[BSSV00] Paul Beame, Michael E. Saks, Xiaodong Sun, Erik Vee: Time-space trade-off lowerbounds for randomized computation of decision problems. J. ACM (JACM) 50(2):154-195(2003) (also in FOCS 2000) 3

[CM97] Christian Cachin, Ueli M. Maurer: Unconditional Security Against Memory-Bounded Adversaries. CRYPTO 1997: 292-306 4

[DM04] Stefan Dziembowski, Ueli M. Maurer: On Generating the Initial Key in theBounded-Storage Model. EUROCRYPT 2004: 126-137 4

[F97] Lance Fortnow: Time-Space Tradeoffs for Satisfiability. J. Comput. Syst. Sci. 60(2):337-353 (2000) (also in CCC 1997) 3

[FLvMV05] Lance Fortnow, Richard J. Lipton, Dieter van Melkebeek, Anastasios Viglas:Time-space lower bounds for satisfiability. J. ACM 52(6): 835-865 (2005) 3

[M92] Ueli M. Maurer: Conditionally-Perfect Secrecy and a Provably-Secure RandomizedCipher. J. Cryptology 5(1): 53-66 (1992) 4

[vM07] Dieter van Melkebeek: A Survey of Lower Bounds for Satisfiability and RelatedProblems. Foundations and Trends in Theoretical Computer Science, 2: 197-303, 2007.3

[S14] Ohad Shamir: Fundamental Limits of Online and Distributed Algorithms forStatistical Learning and Estimation. NIPS 2014: 163-171 2

20

Page 21: Fast Learning Requires Good Memory: A Time-Space Lower Bound … ·  · 2016-02-19Fast Learning Requires Good Memory: A Time-Space Lower Bound for Parity Learning ... there was no

[SVW15] Jacob Steinhardt, Gregory Valiant, Stefan Wager: Memory, Communication, andStatistical Queries. Electronic Colloquium on Computational Complexity (ECCC) 22:126 (2015) 1, 2, 6

[V03] Salil P. Vadhan: Constructing Locally Computable Extractors and Cryptosystems inthe Bounded-Storage Model. J. Cryptology 17(1): 43-77 (2004) (also in Crypto 2003) 4

[W06] Ryan Williams: Inductive Time-Space Lower Bounds for Sat and Related Problems.Computational Complexity 15(4): 433-470 (2006) 3

[W07] Ryan Williams: Time-Space Tradeoffs for Counting NP Solutions Modulo Integers.IEEE Conference on Computational Complexity 2007: 70-82 3

21