-
Rounding Sum-of-Squares Relaxations
Boaz Barak∗ Jonathan Kelner† David Steurer‡
June 10, 2014
Abstract
We present a general approach to rounding semidefinite
programming relaxations obtained by theSum-of-Squares method
(Lasserre hierarchy). Our approach is based on using the connection
betweenthese relaxations and the Sum-of-Squares proof system to
transform a combining algorithm—an algorithmthat maps a
distribution over solutions into a (possibly weaker) solution—into
a rounding algorithm thatmaps a solution of the relaxation to a
solution of the original problem.
Using this approach, we obtain algorithms that yield improved
results for natural variants of threewell-known problems:
1. We give a quasipolynomial-time algorithm that approximates
max‖x‖2=1 P(x) within an additivefactor of ε‖P‖spectral, where ε
> 0 is a constant, P is a degree d = O(1), n-variate
polynomialwith nonnegative coefficients, and ‖P‖spectral is the
spectral norm of a matrix corresponding to P’scoefficients. Beyond
being of interest in its own right, obtaining such an approximation
for generalpolynomials (with possibly negative coefficients) is a
long-standing open question in quantuminformation theory, and our
techniques have already led to improved results in this area
(Brandãoand Harrow, STOC ’13).
2. We give a polynomial-time algorithm that, given a subspace V
⊆ n of dimension d that (almost)contains the characteristic
function of a set of size n/k, finds a vector v ∈ V that satisfies
i v4i >Ω(d−1/3k(i v2i )
2). This is a natural analytical relaxation of the problem of
finding the sparsestelement in a subspace, and is also motivated by
a connection to the Small Set Expansion problemshown by Barak et
al. (STOC 2012). In particular our results yield an improvement of
the previousbest known algorithms for small set expansion in a
certain range of parameters.
3. We use this notion of L4 vs. L2 sparsity to obtain a
polynomial-time algorithm with substantiallyimproved guarantees for
recovering a planted sparse vector v in a random d-dimensional
subspaceof n. If v has µn nonzero coordinates, we can recover it
with high probability whenever µ 6O(min(1, n/d2)). In particular,
when d 6
√n, this recovers a planted vector with up to Ω(n) nonzero
coordinates. When d 6 n2/3, our algorithm improves upon existing
methods based on comparingthe L1 and L∞ norms, which intrinsically
require µ 6 O
(1/√
d).
∗Microsoft Research.†Department of Mathematics, Massachusetts
Institute of Technology.‡Department of Computer Science, Cornell
University.
-
Contents
1 Introduction 11.1 The Sum of Squares hierarchy . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Optimizing
polynomials with nonnegative coefficients . . . . . . . . . . . . .
. . . . . . . . 41.3 Optimizing hypercontractive norms and finding
analytically sparse vectors . . . . . . . . . . 51.4 Related work .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 71.5 Organization of this paper . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . 81.6 Notation . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 8
2 Overview of our techniques 82.1 Finding a planted sparse
vector in a random low-dimensional subspace . . . . . . . . . . . .
102.2 Finding “analytically sparse” vectors in general subspaces .
. . . . . . . . . . . . . . . . . 112.3 Optimizing polynomials with
nonnegative coefficients . . . . . . . . . . . . . . . . . . . . .
13
3 Approximation for nonnegative tensor maximization 143.1 Direct
Rounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . 153.2 Making Progress . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4 Finding an “analytically sparse” vector in a subspace 174.1
Random function rounding . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 184.2 Coordinate projection rounding .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
194.3 Gaussian Rounding . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . 204.4 Conditioning . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 204.5 Truncating functions . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . 214.6 Putting things together
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 22
5 Finding planted sparse vectors 235.1 Recovering f0
approximately (Proof of Theorem 5.2) . . . . . . . . . . . . . . .
. . . . . . 255.2 Recovering f0 exactly (Proof of Theorem 5.3) . .
. . . . . . . . . . . . . . . . . . . . . . . 26
6 Results for Small Set Expansion 286.1 Small-set expansion of
Cayley graphs . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . 286.2 Approximating small-set expansion using ASVP . . . . . .
. . . . . . . . . . . . . . . . . 29
7 Discussion and open questions 30
References 31
A Pseudoexpectation toolkit 35A.1 Spectral norm and SOS proofs .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
B Low-Rank Tensor Optimization 38
C LOCC Polynomial Optimization 39
D The 2-to-q norm and small-set expansion 42D.1 Norm bound
implies small-set expansion . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 45
-
1 Introduction
Convex programming is the algorithmic workhorse behind many
applications in computer science and otherfields. But its power is
far from understood, especially in the case of hierarchies of
linear programming(LP) and semidefinite programming (SDP)
relaxations. These are systematic approaches to make a
convexrelaxation tighter by adding to it more constraints. Various
such hierarchies have been proposed independentlyby researchers
from several communities [Sho87, SA90, LS91, Nes00, Par00, Las01].
In general, thesehierarchies are parameterized by a number ` called
their level. For problems on n variables, the hierarchyof the `th
level can be optimized in nO(`) time, where for the typical domains
used in CS (such as {0, 1}nor the n-dimensional unit sphere), n
rounds correspond to the exact (or near exact) solution by brute
forceexponential-time enumeration.
There are several strong lower bounds (also known as integrality
gaps) for these hierarchies, in particularshowing that ω(1) levels
(and often even nΩ(1) or Ω(n) levels) of many such hierarchies
can’t improve bymuch on the known polynomial-time approximation
guarantees for many NP hard problems, includingSAT,
Independent-Set, Max-Cut and more [Gri01b, Gri01a, ABLT06, dlVKM07,
Sch08, Tul09, CMM09,BGMT12, BCV+12]. Unfortunately, there are many
fewer positive results, and several of them only showthat these
hierarchies can match the performance of previously known (and
often more efficient) algorithms,rather than using hierarchies to
get genuinely new algorithmic results.1 For example, Karlin,
Mathieu andNguyen [KMN11] showed that ` levels of the Sum of
Squares hierarchy can approximate the Knapsackproblem up to a
factor of 1 + 1/`, thus approaching the performance of the standard
dynamic program.Guruswami and Sinop [GS11] and (independently)
Barak, Raghavendra, and Steurer [BRS11] showed thatsome SDP
hierarchies can match the performance of the [ABS10] algorithm for
Small Set Expansionand Unique Games, and their techniques also gave
improved results for some other problems (see also[RT12, AG11,
AGS13]). Chlamtac and Singh [CS08] (building on [Chl07]) used
hierarchies to obtain somenew approximation guarantees for the
independent set problem in 3-uniform hypergraphs. Bhaskara,
Charikar,Chlamtac, Feige, and Vijayaraghavan [BCC+10] gave an
LP-hierarchy based approximation algorithm forthe k-densest
subgraph problem, although they also showed a purely combinatorial
algorithm with the sameperformance. The famous algorithm of Arora,
Rao and Vazirani [ARV04] for Sparsest Cut can be viewed(in
retrospect) as using a constant number of rounds of an SDP
hierarchy to improve upon the performance ofthe basic LP for this
problem. Perhaps the most impressive use of super-constant levels
of a hierarchy to solvea new problem was the work of Brandão,
Christandl and Yard [BCY11] who used an SDP hierarchy
(firstproposed by [DPS04]) to give a quasipolynomial time algorithm
for a variant of the quantum separabilityproblem of testing whether
a given density matrix corresponds to a separable (i.e.,
non-entangled) quantumstate or is ε-far from all such states (see
Section 1.2).
One of the reasons for this paucity of positive results is that
we have relatively few tools to round suchconvex hierarchies. A
rounding algorithm maps a solution to the relaxation to a solution
to the originalprogram.2 In the case of a hierarchy, the relaxation
solution satisfies more constraints, but we do not alwaysknow how
to take advantage of this when rounding. For example, [ARV04] used
a very sophisticated analysisto get better rounding when the
solution to a Sparsest Cut relaxation satisfies a constraint known
as triangleinequalities, but we have no general tools to use the
additional constraints that come from higher levels ofthe
hierarchies, nor do we know if these can help in rounding or not.
This lack of rounding techniques isparticularly true for the Sum of
Squares (SOS, also known as Lasserre) hierarchy [Par00, Las01].3
This is the
1 The book chapter [CT10] is a good source for several of the
known upper and lower bounds, though it does not contain some ofthe
more recent ones.
2 While the name derives from the prototypical case of relaxing
an integer program to a linear program by allowing the variablesto
take non-integer values, we use “rounding algorithm” for any
mapping from relaxation solutions to actual solutions, even in
caseswhere the actual solutions are themselves non-integer.
3 While it is common in the TCS community to use Lasserre to
describe the primal version of this SDP, and Sum of Squares
(SOS)
1
-
strongest variant of the canonical semidefinite programming
hierarchies, and has recently shown promiseto achieve tasks beyond
the reach of weaker hierarchies [BBH+12]. But there are essentially
no generalrounding tools that take full advantage of its
power.4
In this work we propose a general approach to rounding SOS
hierarchies, and instantiate this approach intwo cases, giving new
algorithms making progress on natural variants of two longstanding
problems. Ourapproach is based on the intimate connection between
the SOS hierarchy and the “Positivstellensatz”/“Sumof Squares”
proof system. This connection was used in previous work for either
negative results [Gri01b,Gri01a, Sch08], or positive results for
specific instances [BBH+12, OZ13, KOTZ14], translating proofs of
abound on the actual value of these instances into proofs of bounds
on the relaxation value. In contrast, we usethis connection to give
explicit rounding algorithms for general instances of certain
computational problems.
1.1 The Sum of Squares hierarchy
Our work uses the Sum of Squares (SOS) semidefinite programming
hierarchy and in particular its relationshipwith the Sum of Squares
(or Positivstellensatz) proof system. We now briefly review both
the hierarchy andproof system. See the introduction of [OZ13] and
the monograph [Lau09] for a more in depth discussionof these
concepts and their history. Underlying both the SDP and proof
system is the natural approach toprove that a real polynomial P is
nonnegative via showing that it equals a sum of squares: P =
∑ki=1 Q
2i
for some polynomials Q1, . . . ,Qk. The question of when a
nonnegative polynomial has such a “certificateof non-negativity”
was studied by Hilbert who realized this doesn’t always hold and
asked (as his 17thproblem) whether a nonnegative polynomial is
always a sum of squares of rational functions. This wasproven to be
the case by Artin, and also follows from the more general
Positivstellensatz (or “Positive LocusTheorem”) [Kri64, Ste74]. The
Positivstellensatz/SOS proof system of Grigoriev and Vorobjov
[GV01] isbased on the Positivstellensatz as a way to refute the
assertion that a certain set of polynomial equations
P1(x1, . . . , xn) = . . . = Pk(x1, . . . , xn) = 0 (1.1)
can be satisfied by showing that there exists some polynomials
Q1, . . . ,Qk and a sum of squares polynomialS such that ∑
PiQi = 1 + S . (1.2)
([GV01] considered inequalities as well, although in our context
one can always restrict to equalities withoutloss of generality.)
One natural measure for the complexity of such proof is the degree
of the polynomialsP1Q1, . . . , PkQk and S .
The sum of squares semidefinite program was proposed
independently by several authors [Sho87, Par00,Nes00, Las01] One
way to describe it is as follows. If the set of equalities (1.1) is
satisfiable then in particularthere exists some random variable X
over n such that
P1(X1, . . . , Xn)2 = . . . = Pk(X1, . . . , Xn)2 = 0 .
(1.3)
That is, X is some distribution over the non-empty set of
solutions to (1.1).For every degree `, we can consider the linear
operator L = L` that maps a polynomial P of degree
at most ` into the number P(X1, . . . , Xn). Note that by
choosing the monomial basis, this operator canbe described by a
vector of length n`, or equivalently, by an n`/2 × n`/2 matrix.
This operator satisfies thefollowing conditions:
to describe the dual, in this paper we use the more descriptive
SOS name for both programs. We note that in all the applications
weconsider, strong duality holds, and so these programs are
equivalent.
4 The closest general tool we are aware of is the repeated
conditioning methods of [BRS11, GS11], though these can
beimplemented in weaker hierarchies too and so do not seem to use
the full power of the SOS hierarchy. However, this technique
doesplay a role in this work as well.
2
-
Normalization If P is the constant polynomial 1 then LP = 1
Linearity L(P + Q) = LP +LQ for every P,Q of degree 6 `.
Positivity LP2 > 0 for every P of degree 6 `/2.
Following [BBH+12], we call a linear operator satisfying the
above conditions a level ` pseudoexpectationfunction, or `-p.e.f.,
and use the suggestive notation ̃ P(X) to denoteLP. Correspondingly
we will sometimestalk about a level ` pseudodistribution (or
`-p.d.) X, by which we mean that there is an associated level
`pseudoexpectation operator. (Note that if X is an actual random
variable then it is in particular a level `pseudodistribution for
every `.) Given the representation of L as an n`-dimensional vector
it is possible toefficiently check that it satisfies the above
conditions efficiently, and in particular the positivity
conditioncorresponds to the fact that, when viewed as a matrix, L
is positive semidefinite. Thus it is also possible tooptimize over
the set of operators satisfying these conditions in time nO(`), and
this optimization procedure isknown as the SOS SDP hierarchy.
Clearly, as ` grows, the conditions become stricter. In Appendix A
wecollect some useful properties of these pseudoexpectations. In
particular one can show (see Corollary A.3)that if ̃ P2(X) = 0 then
̃ P(X)Q(X) = 0 for every polynomial Q (as long as Q, P have degrees
at most `/2).Thus, if there is a refutation to (1.1) of the form
(1.2) where all polynomials involved have degree at most` then
there would not exist a level 2` pseudoexpectation operator
satisfying (1.3). This connection goesboth ways, establishing an
equivalence between the degree of Positivstellensatz proofs and the
level of thecorresponding SOS relaxation.
Until recently, this relation was mostly used for negative
results, translating proof complexity lowerbounds into integrality
gap results for the SOS hierarchy [BBH+12, OZ13, KOTZ14]. However,
in 2012Barak, Brandão, Harrow, Kelner, Steurer and Zhou [BBH+12]
used this relation for positive results, showingthat the SOS
hierarchy can in fact solve some interesting instances of the
Unique Games maximizationproblem that fool weaker hierarchies.
Their idea was to use the analysis of the previous works that
provedthese integrality gaps for weaker hierarchies. Such proofs
work by showing that (a) the weaker hierarchyoutputs a large value
on this particular instance but (b) the true value is actually
small. [BBH+12]’s insightwas that oftentimes the proof of (b) only
uses arguments that can be captured by the
SOS/Positivstellensatzproof system, and hence inadvertently shows
that the SOS SDP value is actually small as well. Some followup
works [OZ13, KOTZ14] extended this to other instances, but all
these results held for very specificinstances which have been
proven before to have small objective value.
In this work we use this relation to get some guarantees on the
performance of the SOS SDP on generalinstances. We give a more
detailed overview of our approach in Section 2, but the high level
idea is asfollows. For particular optimization problems, we design
a “rounding algorithm” that on input the momentmatrix of a
distribution on actual solutions achieving a certain value ν,
outputs a solution with some value ν̃which is a function of ν. We
call such an algorithm a combining algorithm, since it “combines” a
distributionover solutions into a single one. (Note that the
solution output by the combining algorithm need not bein the
support of the distribution, and generally, when ν̃ , ν, it won’t
be.) We then “lift” the analysis ofthis combining algorithm into
the SOS framework, by showing that all the arguments can be
captured inthis proof system. This in turns implies that the
algorithm would still achieve the value ν̃ even if it is onlygiven
a pseudoexpectation of the distribution of sufficiently high level
`, and hence in fact this combiningalgorithm is a rounding
algorithm for the level ` SOS hierarchy. We apply this idea to
obtain new results fortwo applications— optimizing polynomials with
nonnegative coefficients over the unit sphere, and
finding“analytically sparse” vectors inside a subspace.
Remark 1.1 (Relation to the Unique Games Conjecture.). While the
SOS hierarchy is relevant to manyalgorithmic applications, some
recent work focused on its relation to Khot’s Unique Games
Conjecture(UGC) [Kho02]. On a high level, the UGC implies that the
basic semidefinite program is an optimal efficient
3
-
algorithm for many problems, and hence in particular using
additional constant or polylogarithmic levelsof the SOS hierarchy
will not help. More concretely, as discussed in Section 1.3 below,
the UGC is closelyrelated to the question of how hard it is to find
sparse (or “analytically sparse”) vectors in a given subspace.Our
work shows how the SOS hierarchy can be useful in general, and in
particular gives strong average-case results and nontrivial
worst-case results for finding sparse vectors in subspaces.
Therefore, it can beconsidered as giving some (far from conclusive)
evidence that the UGC might be false.
1.2 Optimizing polynomials with nonnegative coefficients
Our first result yields an additive approximation to this
optimization problem for polynomials with nonnegativecoefficients,
when the value is scaled by the spectral norm of an associated
matrix. If P is an n-variate degree-thomogeneous polynomial with
nonnegative coefficients, then it can be represented by a tensor M
∈ nt suchthat P(x) = M · x⊗t for every x ∈ n. It is convenient to
state our result in terms of this tensor representation:
Theorem 1.2. There is an algorithm A, based on O(t3 log n/ε2)
levels of the SOS hierarchy, such that forevery even5 t and
nonnegative M ∈ nt ,
max‖x‖=1
M · x⊗t 6 A(M) 6 max‖x‖=1
M · x⊗t + ε‖M‖spectral ,
where · denotes the standard dot product, and ‖M‖spectral
denotes the spectral norm of M, when consideredas an nt/2 × nt/2
matrix.
Note that the algorithm of Theorem 1.2 only uses a logarithmic
number of levels, and thus it showsthat this fairly natural
polynomial optimization problem can be solved in quasipolynomial
time, as opposedto the exponential time needed for optimizing over
general polynomials of degree > 2. Indeed, previouswork on the
convergence of the Lasserre hierarchy for general polynomials
[DW12] can be described in ourlanguage here as trying to isolate a
solution in the support of the distribution, and this generally
requires alinear number of levels. Obtaining the logarithmic bound
here relies crucially on constructing a “combined”solution that is
not necessarily in the support. The algorithm is also relatively
simple, and so serves as a gooddemonstration of our general
approach.
Relation to quantum information theory. An equivalent way to
state this result is that we get an ε additiveapproximation in the
case that ‖M‖spectral 6 1, in which case the value max‖x‖=1 M · x⊗t
is in the interval [0, 1].This phrasing is particularly natural in
the context of quantum information theory. A general
(potentiallymixed) quantum state on 2`-qubits is represented by an
n2 × n2 density matrix ρ for n = 2`; ρ is a positivesemidefinite
matrix and has trace 1. If ρ is separable, which means that there
is no entanglement betweenthe first ` qubits and the second `
qubits, then ρ = xx∗ ⊗ yy∗ for some distribution over x, y ∈ n,
wherev∗ denotes the complex adjoint operation. If we further
restrict the amplitudes of the states to be real, andenforce
symmetry on the two halves, then this would mean that ρ = x⊗4. (All
our results should generalizeto states without those restrictions
to symmetry and real amplitudes, which we make just to simplify
thestatement of the problem and the algorithm.) A quantum
measurement operator over this space is an n2 × n2matrix M of
spectral norm 6 1. The probability that the measurement accepts a
state ρ is Tr(Mρ). Findingan algorithm that, given a measurement M,
finds the separable state ρ that maximizes this probability is
animportant question in quantum information theory which amounts to
finding a classical upper bound for thecomplexity class QMA(2) of
Quantum Merlin Arthur proofs with two independent provers [HM13].
Notethat if we consider symmetric real states then this is the same
as finding argmax‖x‖=1 M · x⊗4, and hencedropping the
non-negativity constraint in our result would resolve this
longstanding open problem. There is a
5 The algorithm easily generalizes to polynomials of odd degree
t and to non-homogenous polynomials, see Remark 3.5.
4
-
closely related dual form of this question, known as the quantum
separability problem, where one is given aquantum state ρ and wants
to find the test M that maximizes
Tr(Mρ) − maxρ′ separable
Tr(Mρ′) (1.4)
or to simply distinguish between the case that this quantity is
at least ε and the case that ρ is separable. Thebest result known
in this area is the paper [BCY11] mentioned above, which solved the
distinguishing variantof quantum separability problem in the case
that measurements are restricted to so-called Local Operationsand
one-way classical communication (one-way LOCC) operators. However,
they did not have a roundingalgorithm, and in particular did not
solve the problem of actually finding a separable state that
maximizesthe probability of acceptance of a given one-way LOCC
measurement. The techniques of this work wereused by Brandão and
Harrow [BH13] to solve the latter problem, and also greatly
simplify the proof of[BCY11]’s result, which originally involved
relations between several measures of entanglement proved inseveral
papers.6 For completeness, in Appendix C we give a short proof of
this result, specialized to thecase of real vectors and polynomials
of degree four (corresponding to quantum states of two systems, or
twoprover QMA proofs). We also show in Appendix B that in the case
the measurement satisfies the strongercondition of having its `2
(i.e., Frobenius) norm be at most 1, there is a simpler and more
efficient algorithmfor estimating the maximum probability the
measurement accepts a separable state, giving an ε
additiveapproximation in poly(n) exp(poly(1/ε)) time. In contrast,
[BCY11]’s algorithm took quasipolynomial timeeven in this case.
Relation to small set expansion. Nonnegative tensors also arise
naturally in some applications, and inparticular in the setting of
small set expansion for Cayley graphs over the cube, which was our
originalmotivation to study them. In particular, one corollary of
our result is:
Corollary 1.3 (Informally stated). There is an algorithm A,
based on poly(K(G)) log n levels of the SOShierarchy, that solves
the Small Set Expansion problem on Cayley graphs G over `2 (where `
= log n) whereK(G) is a parameter bounding the spectral norm of an
operator related to G’s top eigenspace.
We discuss the derivation and the meaning of this corollary in
Section 6 but note that the conditionof having small value K(G)
seems reasonable. Having K(G) = O(1) implies that the graph is a
small setexpander, and in particular the known natural examples of
Cayley graphs that are small set expanders, suchas the noisy
Boolean hypercube and the “short code” graph of [BGH+12] have K(G)
= O(1). Thus a priorione might have thought that a graph that is
hard to distinguish from small set expanders would have a
smallvalue of K(G).
1.3 Optimizing hypercontractive norms and finding analytically
sparse vectors
Finding a sparse nonzero vector inside a d dimensional linear
subspace V ⊆ n is a natural task arisingin many applications in
machine learning and optimization (e.g., see [DH13] and the
references therein).Related problems are known under many names
including the “sparse null space”, “dictionary learning”,“blind
source separation”, “min unsatisfy”, and “certifying restricted
isometry property” problems. (Theseproblems all have the same
general flavor but differ on various details such as worst-case vs.
average case,affine vs. linear subspaces, finding a single vector
vs. a basis, and more.) Problems of this type are oftenNP-hard,
with some hardness of approximation results known, and conjectured
average-case hardness (e.g.,see [ABSS97, KZ12, GN10] and the
references therein).
We consider a natural relaxation of this problem, which we call
the analytically sparse vector problem(ASVP), which assumes the
input subspace (almost) contains an actually sparse 0/1 vector, but
allows the
6The paper [BH13] was based on a previous version of this work
[BKS12] that contained only the results for nonnegative
tensors.
5
-
algorithm to find a vector v ∈ V that is only “analytically
sparse” in the sense that ‖v‖4/‖v‖2 is large. Moreformally, for q
> p and µ > 0, we say that a vector v is µ Lq/Lp-sparse if (i
v
qi )
1/q/(Eivpi )
1/p > µ1/q−1/p.That is, a vector is µ Lq/Lp-sparse if it has
the same q-norm vs p-norm ratio as a 0/1 vector of measure atmost
µ.
This is a natural relaxation, and similar conditions have been
considered in the past. For example,Spielman, Wang, and Wright
[SWW12] used in their work on dictionary learning a subroutine that
finds avector v in a subspace that maximizes the ratio ‖v‖∞/‖v‖1
(which can be done efficiently via n linear programs).However,
because any subspace of dimension d contains an O(1/
√d) L∞/L1-sparse vector, this relaxation
can only detect the existence of vectors that are supported on
less than O(n/√
d) coordinates. Some workshave observed that the L2/L1 ratio is
a much better proxy for sparsity [ZP01, DH13], but computing it is
anon-convex optimization problem for which no efficient algorithm
is known. Similarly, the L4/L2 ratio is agood proxy for sparsity
for subspaces of small dimension (say d = O(
√n)) but it is non-convex, and it is not
known how to efficiently optimize it.7
Nevertheless, because ‖v‖44 is a degree 4 polynomial, the
problem of maximizing it for v ∈ V of unitnorm amounts to a
polynomial maximization problem over the sphere, that has a natural
SOS program.Indeed, [BBH+12] showed that this program does in fact
yield a good approximation of this ratio for randomsubspaces. As we
show in Section 5, we can use this to improve upon the results of
[DH13] and find plantedsparse vectors in random subspaces that are
of not too large a dimension:
Theorem 1.4. There is a constant c > 0 and an algorithm A,
based on O(1)-rounds of the SOS program,such that for every vector
v0 ∈ n supported on at most cn min(1, n/d2) coordinates, if v1, . .
. , vd are chosenindependently at random from the Gaussian
distribution on n, then given any basis for V = span{v0, . . . ,
vd}as input, A outputs an ε-approximation of v0 in poly(n,
log(1/ε)) time.
In particular, we note that this recovers a planted vector with
up to Ω(n) nonzero coordinates whend 6√
n, and it can recover vectors with more than the O(n/√
d) nonzero coordinates that are necessary forexisting techniques
whenever d � n2/3.Perhaps more significantly, we prove the
following nontrivial worst-case bound for this problem:
Theorem 1.5. There is a polynomial-time algorithm A, based on
O(1) levels of the SOS hierarchy, thaton input a d-dimensional
subspace V ⊆ n such that there is a 0/1-vector v ∈ V with at most
µn nonzerocoordinates, A(V) outputs an O(µd1/3) L4/L2-sparse vector
in V.
Moreover, this holds even if v is not completely inside V but
only satisfies ‖ΠVv‖22 > (1 − ε)‖v‖22, for someabsolute constant
ε > 0, where ΠV is the projector to V.
The condition that the vector is 0/1 can be significantly
relaxed, see Remark 4.12. Theorem 4.1 isalso motivated by the Small
Set Expansion problem. The current best known algorithms for Small
SetExpansion and Unique Games [ABS10] reduce these problems into
the task of finding a sparse vector in asubspace, and then find
this vector using brute force enumeration. This enumeration is the
main bottleneckin improving the algorithms’ performance.8 [BBH+12]
showed that, at least for the Small Set Expansion
7 It seems that what makes our relaxation different from the
original problem is not so much the qualitative issue of
consideringanalytically sparse vectors as opposed to actually
sparse vectors, but the particular choice of the L4/L2 ratio, which
on one handseems easier (even if not truly easy) to optimize over
than the L2/L1 ratio, but provides better guarantees than the L∞/L1
ratio.However, this choice does force us to restrict our attention
to subspaces of low dimension, while in some applications such
ascertifying the restricted isometry property, the subspace in
question is often the kernel of a “short and fat” matrix, and hence
isalmost full dimensional. Nonetheless, we believe it should be
possible to extend our results to handle subspaces of higher
dimension,perhaps at the some mild cost in the number of
rounds.
8 This is the only step that takes super-polynomial time in
[ABS10]’s algorithm for Small Set Expansion. Their algorithm
forUnique Games has an additional divide and conquer step that
takes subexponential time, but, in our opinion, seems less
inherentlynecessary. Thus we conjecture that if the sparse-vector
finding step could be sped up then it would be possible to speed up
thealgorithm for both problems.
6
-
question, finding an L4/L2 analytically sparse vector would be
good enough. Using their work we obtain thefollowing corollary of
Theorem 1.5:
Corollary 1.6 (Informally stated). There is an algorithm that
given an n-vertex graph G that contains a setS of size o(n/d1/3)
with expansion at most ε, outputs a set S ′ of measure δ = o(1)
with expansion boundedaway from 1, i.e., Φ(S ) 6 1 −Ω(1), where d
is the dimension of the eigenspace of G’s random walk
matrixcorresponding to eigenvalues larger than 1 − O(ε).
The derivation and meaning of this result is discussed in
Section 6. We note that this is the first result thatgives an
approximation of this type to the small set expansion in terms of
the dimension of the top eigenspace,as opposed to an approximation
that is polynomial in the number of vertices.
1.4 Related work
Our paper follows the work of [BBH+12], that used the language
of pseudoexpectation to argue that the SOShierarchy can solve
specific interesting instances of Unique Games, and perhaps more
importantly, how it isoften possible to almost mechanically “lift”
arguments about actual distributions to the more general settingof
pseudodistribution. In this work we show how the same general
approach be used to obtain positive resultsfor general
instances.
The fact that LP/SDP solutions can be viewed as expectations of
distributions is well known, and severalrounding algorithms can be
considered as trying to “reverse engineer” a relaxation solution to
get a gooddistribution over actual solutions.
Techniques such as randomized rounding, the hyperplane rounding
of [GW95], and the rounding forTSP [GSS11, AKS12] can all be viewed
in this way. One way to summarize the conceptual differencebetween
our techniques and those approaches is that these previous
algorithms often considered the relaxationsolution as giving
moments of an actual distribution on “fake” solutions. For example,
in [GW95]’s MaxCut algorithm, where actual solutions are modeled as
vectors in {±1}n, the SDP solution is treated as themoment matrix
of a Gaussian distribution over real vectors that are not
necessarily ±1-valued. Similarly inthe TSP setting one often
considers the LP solution to yield moments of a distribution over
spanning treesthat are not necessarily TSP tours. In contrast, in
our setting we view the solution as providing moments of a
“fake” distribution on actual solutions.Treating solutions
explicitly as “fake distributions” is prevalent in the literature
on negative results (i.e.,
integrality gaps) for LP/SDP hierarchies. For hierarchies weaker
than SOS, the notion of “fake” is different,and means that there is
a collection of local distributions, one for every small subset of
the variables, that areconsistent with one another but do not
necessarily correspond to any global distribution. Fake
distributionsare also used in some positive results for
hierarchies, such as [BRS11, GS11], but we make this more
explicit,and, crucially, make much heavier use of the tools
afforded by the Sum of Squares relaxation.
The notion of a “combining algorithm” is related to the notion
of polymorphisms [BJK05] in the study ofconstraint satisfaction
problems. A polymorphism is a way to combine a number of satisfying
assignments ofa CSP into a different satisfying assignments, and
some relations between polymorphism, their generalizationto
approximation problems, rounding SDP’s are known (e.g., see the
talk [Rag10]). The main differenceis polymorphisms operate on each
bit of the assignment independently, while we consider here
combiningalgorithms that can be very global.
In a follow up (yet unpublished) work, we used the techniques of
this paper to obtain improved results forthe sparse dictionary
learning problem, recovering a set of vectors x1, . . . , xm ∈ n
from random samplesof µ-sparse linear combinations of them for any
µ = o(1), improving upon previous results that requiredµ � 1/
√n [SWW12, AGM13, AAJ+13].
7
-
1.5 Organization of this paper
In Section 2 we give a high level overview of our general
approach, as well as proof sketches for (specialcases of) our main
results. Section 3 contains the proof of Theorem 1.2— a
quasipolynomial time algorithmto optimize polynomials with
nonnegative coefficients over the sphere. Section 4 contains the
proof ofTheorem 1.5— a polynomial time algorithm for an
O(d1/3)-approximation of the “analytical sparsest vectorin a
subspace” problem. In Section 5 we show how to use the notion of
analytical sparsity to solve the questionof finding a “planted”
sparse vector in a random subspace. Section 6 contains the proofs
of Corollaries 1.3and 1.6 of our results to the small set expansion
problem. Appendix A contains certain technical lemmasshowing that
pseudoexpectation operators obey certain inequalities that are true
for actual expectations.Appendix C contains a short proof (written
in classical notation, and specialized to the real symmetric
setting)of [BCY11, BH13]’s result that the SOS hierarchy yields a
good approximation to the acceptance probabilityof QMA(2) verifiers
/ measurement operators that have bounded one-way LOCC norm.
Appendix B shows asimpler algorithm for the case that the verifier
satisfies the stronger condition of a bounded L2 (Frobenius)norm.
For the sake of completeness, Appendix D reproduces the proof from
[BBH+12] of the relationbetween hypercontractive norms and small
set expansion. Our papers raises many more questions than
itanswers, and some discussion of those appears in Section 7.
1.6 NotationNorms and inner products. We will use linear
subspaces of the form V = RU where U is a finite set with
anassociated measure µ : U → [0,∞]. The p-norm of a vector v ∈ V is
defined as ‖v‖p =
(∑ω∈U µ(ω)|vω|p
)1/p. Similarly,the inner product of v, w ∈ V is defined as 〈u,
v〉 = ∑ω∈U µ(ω)uωvω. We will only use two measures in this work:
thecounting measure, where µ(ω) = 1 for every ω ∈ U, and the
uniform measure, where µ(ω) = 1/|U| for all ω ∈ U. (Thenorms
corresponding to this measure are often known as the expectation
norms.) We will use vector notation (i.e.,letters such as u, v, and
indexing of the form ui) for elements of subspaces with the
counting measure, and functionnotation (i.e., letters such as f , g
and indexing of the form f (x)) for elements of subspaces with the
uniform measure.The dot product notation u · v will be used
exclusively for the inner product with the counting measure.
Pseudoexpectations. We use the notion of pseudoexpectations from
[BBH+12]. A level ` pseudoexpectationfunction (`-p.e.f.) ̃X is an
operator mapping a polynomial P of degree at most ` into a number
denoted by ̃x∼X P(x)and satisfying the linearity, normalization,
and positivity conditions as stated in Section 1.1. We sometimes
referto X as a level ` pseudodistribution (`-p.d.) by which we mean
that there exists an associated pseudoexpectationoperator.9 We will
sometimes use the notation ̃ P(X) when X is an actual random
variable, in which case ̃ P(X)simply equals P(X). (We do so when we
present arguments for actual distributions that we will later want
to generalizeto pseudodistributions.)
If P,Q are polynomials of degree at most `/2, and ̃X is an
`-p.e.f., we say that ̃X is consistent with the constraintP(x) ≡ 0
if it satisfies ̃x∼X P(x)2 = 0. We say that it is consistent with
the constraint Q(x) > 0, if it consistent withthe constraint
Q(x) − S (x) ≡ 0 for some polynomial S of degree 6 `/2 which is a
sum of squares. (In the context ofoptimization, to enforce the
inequality constraint Q(x) > 0, it is always possible to add an
auxiliary variable y and thenenforce the equality constraint Q(x) −
y2 ≡ 0.) Appendix A contains several useful facts about
pseudoexpectations.
2 Overview of our techniques
Traditionally to design a mathematical-programming based
approximation algorithm for some optimizationproblem O, one first
decides what the relaxation is— i.e., whether it is a linear
program, semidefinite program,or some other convex program, and
what constraints to put in. Then, to demonstrate that the value of
the
9 In the paper [BBH+12] we used the name level ` fictitious
random variable for X, but we think the name pseudodistribution
isbetter as it is more analogous to the name pseudoexpectation. The
name “pseudo random variable” would of course be much
tooconfusing.
8
-
program is not too far from the actual value, one designs a
rounding algorithm that maps a solution of theconvex program into a
solution of the original problem of approximately the same value.
Our approach isconceptually different— we design the rounding
algorithm first, analyze it, and only then come up with
therelaxation.
Initially, this does not seem to make much sense— how can you
design an algorithm to round solutions ofa relaxation when you
don’t know what the relaxation is? We do this by considering an
idealized version of arounding algorithm which we call a combining
algorithm. Below we discuss this in more detail but
roughlyspeaking, a combining algorithm maps a distribution over
actual solutions of O into a single solution (thatmay or may not be
part of the support of this distribution). This is a potentially
much easier task than roundingrelaxation solutions, and every
rounding algorithm yields a combining algorithm. In the other
direction, everycombining algorithm yields a rounding algorithm for
some convex programming relaxation, but in generalthat relaxation
could be of exponential size. Nevertheless, we show that in several
interesting cases, it ispossible to transform a combining algorithm
into a rounding algorithm for a not too large relaxation that wecan
efficiently optimize over, thus obtaining a feasible approximation
algorithm. The main tool we use forthat is the Sum of Squares proof
system, which allows to lift certain arguments from the realm of
combiningalgorithms to the realm of rounding algorithms.
We now explain more precisely the general approach, and then
give an overview of how we use thisapproach for our two
applications— finding “analytically sparse” vectors in subspaces,
and optimizingpolynomials with nonnegative coefficients over the
sphere.
Consider a general optimization problem of minimizing some
objective function in some set S , such asthe n dimensional Boolean
hypercube or the unit sphere. A convex relaxation for this problem
consists of anembedding that maps elements in S into elements in
some convex domain, and a suitable way to generalizethe objective
function to a convex function on this domain. For example, in
linear programming relaxationswe typically embed {0, 1}n into the
set [0, 1]n, while in semidefinite programming relaxations we might
embed{0, 1}n into the set of n × n positive semidefinite matrices
using the map x 7→ X where Xi, j = xix j. Giventhis embedding, we
can use convex programming to find the element in the convex domain
that maximizesthe objective, and then use a rounding algorithm to
map this element back into the domain S in a way thatapproximately
preserves the objective value.
A combining algorithm C takes as input a distribution X over
solutions in S and maps it into a singleelement C(X) of S , such
that the objective value of C(X) is approximately close to the
expected objectivevalue of a random element in X. Every rounding
algorithm R yields a combining algorithm C. The reasonis that if
there is some embedding f mapping elements in S into some convex
domain T , then for everydistribution X over S , we can define yX
to be x∈X f (x). By convexity, yX will be in T and its objective
valuewill be at most the average objective value of an element in
X. Thus if we define C(X) to output R(yX) thenC will be a combining
algorithm with approximation guarantees at least as good as
R’s.
In the other direction, because the set of distributions over S
is convex and can be optimized over byan O(|S |)-sized linear
program, every combining algorithm can be viewed as a rounding
algorithm for thisprogram. However, |S | is typically exponential
in the bit description of the input, and hence this is not avery
useful program. In general, we cannot improve upon this, because
there is always a trivially losslesscombining algorithm that
“combines” a distribution X into a single solution x of the same
expected valueby simply sampling x from X at random. Thus for
problems where getting an exact value is exponentiallyhard, this
combining algorithm cannot be turned into a rounding algorithm for
a subexponential-sizedefficiently-optimizable convex program.
However it turns out that at least in some cases, nontrivial
combiningalgorithms can be turned into a rounding algorithm for an
efficient convex program. A nontrivial combiningalgorithm C has the
form C(X) = C′(M(X)) where C′ is an efficient (say polynomial or
quasipolynomialtime) algorithm and M(X) is a short (say polynomial
or quasipolynomial size) digest of the distribution X. Inall the
cases we consider, M(X) will consist of all the moments up to some
level ` of the random variable X,
9
-
or some simple functions of it. That is, typically M(X) is a
vector in m` such that for every i1, . . . , i` ∈ [m],Mi1,...,i` =
x∼X xi1 · · · xi` . We do not have a general theorem showing that
any nontrivial combining algorithmcan be transformed into a
rounding algorithm for an efficient relaxation. However, we do have
a fairly general“recipe” to use the analysis of nontrivial
combining algorithms to transform them into rounding algorithms.The
key insight is that many of the tools used in such analyses, such
as the Cauchy–Schwarz and Hölderinequalities, and other properties
of distributions, fall under the “Sum of Squares” proof framework,
andhence can be shown to hold even when the algorithm is applied
not to actual moments but to so-called“pseudoexpectations” that
arise from the SOS semidefinite programming hierarchy.
We now turn to giving a high level overview of our results. For
the sake of presentations, we focus oncertain special cases of
these two applications, and even for these cases omit many of the
proof details andonly provide rough sketches of the proofs. The
full details can be found in Sections 5, 4 and 3.
2.1 Finding a planted sparse vector in a random low-dimensional
subspace
We consider the following natural problem, which was also
studied by Demanet and Hand [DH13]. Letf0 ∈ U be a sparse function
over some universe U of size n. That is, f0 is supported on at most
µncoordinates for some µ = o(1). Let V be the subspace spanned by
f0 and d random (say Gaussian) functionsf1, . . . , fd ∈ U . Can we
recover f0 from any basis for V?
Demanet and Hand showed that if µ is very small, specifically µ
� 1/√
d, then f0 would be the mostL∞/L1-sparse function in V , and
hence (as mentioned above) can be recovered efficiently by running
nlinear programs. The SOS framework yields a natural and easy to
describe algorithm for recovering f0 aslong as µ is a sufficiently
small constant and the dimension d is at most O(
√n). The algorithm uses the
SOS program for finding the most L4/L2-sparse function in V ,
which, as mentioned above, is simply thepolynomial optimization
problem of maximizing ‖ f ‖44 over f in the intersection of V and
the unit Euclideansphere.
Since f0 itself is in particular µ L4/L2-sparse , the optimum
for the program is at least 1/µ. Thusa combining algorithm would
get as input a distribution D over functions f ∈ V satisfying ‖ f
‖2 = 1and ‖ f ‖44 > 1/µ, and need to output a vector closely
correlated with f0.10 (We use here the expectationnorms, namely ‖ f
‖pp = ω | f (ω)|p.) For simplicity, assume that the fi’s are
orthogonal to f0 (they arenearly orthogonal, and so everything we
say below will still hold up to a sufficiently good
approximation,see Section 5). In this case, we can write every f in
the support of D as f = 〈 f0, f 〉 f0 + f ′ wheref ′ ∈ V ′ = span{
f1, . . . , fd}. It is not hard to show using standard
concentration of measure results (see e.g.,[BBH+12, Theorem 7.1])
that if d = O(
√n) then every f ′ ∈ V ′ satisfies
‖ f ′‖4 6 C‖ f ′‖2 , (2.1)
for some constant C. Therefore using triangle inequality, and
using the fact that ‖ f ′‖2 6 ‖ f ‖2 = 1, it musthold that
µ−1/4 6 ‖ f ‖4 6 〈 f , f0〉µ−1/4 + C (2.2)
or〈 f , f0〉 > 1 −Cµ1/4 = 1 − o(1) (2.3)
for µ = o(1).In particular this implies that if we apply a
singular value decomposition (SVD) to the second moment
matrix D ofD (i.e., D = f∈D f ⊗2) then the top eigenvector will
have 1 − o(1) correlation with f0, and hencewe can simply output it
as our solution.
10 Such a closely correlated vector can be corrected to output
f0 exactly, see Section 5.
10
-
To make this combining algorithm into a rounding algorithm we
use the result of [BBH+12] that showedthat (2.1) can actually be
proven via a sum of squares argument. Namely they showed that there
is a degree 4sum of squares polynomial S such that
‖Π′ f ‖44 + S ( f ) = C4‖ f ‖42 . (2.4)
(2.4) implies that even ifD is merely a pseudodistribution then
it must satisfy (2.1). (When the latter israised to the fourth
power to make it a polynomial inequality.) We can then essentially
follow the argument,proving a version of (2.2) raised to the 4th
power by appealing to the fact that pseudodistributions
satisfyHölder’s inequality, (Corollary A.11) and hence deriving
that D will satisfy (2.3), with possibly slightlyworse constants,
even when it is only a pseudodistribution.
In Section 5, we make this precise and extend the argument to
obtain nontrivial (but weaker) guaranteeswhen d >
√n. We then show how to use an additional correction step to
recover the original function f0 up
to arbitrary accuracy, thus boosting our approximation of f0
into an essentially exact one.
2.2 Finding “analytically sparse” vectors in general
subspaces
We now outline the ideas behind the proof of Theorem 4.1—
finding analytically sparse vectors in general(as opposed to
random) subspaces. This is a much more challenging setting than
random subspaces, andindeed our algorithm and its analysis is more
complicated (though still only uses a constant number of
SOSlevels), and at the moment, the approximation guarantee we can
prove is quantitatively weaker. This is themost technically
involved result in this paper, and so the reader may want to skip
ahead to Section 2.3 wherewe give an overview of the simpler result
of optimizing over polynomials with nonnegative coefficients.
We consider the special case of Theorem 4.1 where we try to
distinguish between a YES case wherethere is a 0/1 valued
o(d−1/3)-sparse function that is completely contained in the input
subspace, and aNO case where every function in the subspace has its
four norm bounded by a constant times its twonorm. That is, we
suppose that we are given some subspace V ⊆ U of dimension d and a
distributionD over functions f : U → {0, 1} in V such that ω∈U[ f
(ω) = 1] = µ for every f in the support ofD, and µ = o(d−1/3). The
goal of our combining algorithm is to output some function g ∈ V
such that‖g‖44 = ω g(ω)4 � (ω g(ω)2)2 = ‖g‖42. (Once again, we use
the expectation inner product and norms, withuniform measure
overU.)
Since the f ’s correspond to sets of measure µ, we would expect
the inner product 〈 f , f ′〉 of a typical pairf , f ′ (which equals
the measure of the intersection of the corresponding sets) to be
roughly µ2. Indeed, onecan show that if the average inner product 〈
f , f ′〉 is ω(µ2) then it’s easy to find such a desired function
g.Intuitively, this is because in this case the distribution D of
sets does not have an equal chance to containall the elements inU,
but rather there is some set I of o(|U|) coordinates which is
favored byD. Roughlyspeaking, that would mean that a random linear
combination g of these functions would have most of itsmass
concentrated inside this small set I, and hence satisfy ‖g‖4 �
‖g‖2. But it turns out that letting g be arandom gaussian function
matching the first two moments ofD is equivalent to taking such a
random linearcombination, and so our combining algorithm can obtain
this g using moment information alone.
Our combining algorithm will also try all n coordinate
projection functions. That is, let δω be the functionsuch that
δω(ω′) equals n = |U| if ω = ω′ and equals 0 otherwise, (and hence
under our expectation innerproduct f (ω) = 〈 f , δω〉). The
algorithm will try all functions of the form Πδu, where Π is the
projector to thesubspace V . Fairly straightforward calculations
show that the 2-norm squared of such a function is expectedto be
(d/n)‖δω‖22 = d, and it turns out in our setting we can assume that
the norm is well concentratedaround this expectation (or else we’d
be able to find a good solution in some other way). Thus, if
coordinateprojection fails then it must hold that
O(d2) = O(ω‖Πδω‖42) > ω‖Πδω‖
44 = ω,ω′
〈Πδω, δω′〉4 . (2.5)
11
-
It turns out that (2.5) implies some nontrivial constraints on
the distributionD. Specifically we know that
µ = f∼D‖ f ‖44 = f∼D,ω∈U〈 f , δω〉
4 .
But since f = Π f and Π is symmetric, the RHS is equal to
f∼D,ω∈U
〈 f ,Πδω〉4 = 〈 f∼D
f ⊗4, ω∈U
(Πδω)⊗4〉 6 ‖ f∼D
f ⊗4‖2‖ ω∈U
(Πδω)⊗4‖2 ,
where the last inequality uses Cauchy–Schwarz. If we square this
inequality we get that
µ2 6 〈 f∼D
f ⊗4, f∼D
f ⊗4〉〈 ω∈U]
(Πδω)⊗4, ω∈U
(Πδω)⊗4〉 =(
f , f ′∼D〈 f , f ′〉4
) (ω,ω′〈Πδω,Πδω′〉4
).
But since Π is a projector satisfying Π = Π2, we can use (2.5)
and obtain that
Ω(µ2/d2) 6 f , f ′∼D
〈 f , f ′〉4 .
Since d = o(µ−3) this means that
f , f ′∼D〈 f , f ′〉4 � µ8 . (2.6)
Equation (2.6), contrasted with the fact that f , f ′∼D〈 f , f
′〉 = O(µ2), means that the inner product of tworandom functions in
D is somewhat “surprisingly unconcentrated”, which seems to be a
nontrivial pieceof information about D.11 Indeed, because the f ’s
are nonnegative functions, if we pick a random u andconsider the
distribution Du where the probability of every function is
reweighed proportionally to f (u),then intuitively that should
increase the probability of pairs with large inner products.
Indeed, as we showin Lemma A.4, one can use Hölder’s inequality to
prove that there exist ω1, . . . , ω4 such that under
thedistributionD′ where every element f is reweighed proportionally
to f (ω1) · · · f (ω4), it holds that
f , f ′∼D′
〈 f , f ′〉 >(
f , f ′∼D〈 f , f ′〉4
)1/4. (2.7)
(2.7) and (2.6) together imply that E f , f ′∼D′〈 f , f ′〉 � µ2,
which, as mentioned above, means that we can finda function g
satisfying ‖g‖4 � ‖g‖2 by taking a gaussian function matching the
first two moments ofD′.
Once again, this combining algorithm can be turned into an
algorithm that uses O(1) levels of theSOS hierarchy. The main
technical obstacle (which is still not very hard) is to prove
another appropriategeneralization of Hölder’s inequality for
pseudoexpectations (see Lemma A.4). Generalizing to the settingthat
in the YES case the function is only approximately in the vector
space is a bit more cumbersome. Weneed to consider apart from f the
function f that is obtained by first projecting f to the subspace
and then“truncating” it by rounding each coordinate where f is too
small to zero. Because this truncation operation isnot a low degree
polynomial, we include the variables corresponding to f as part of
the relaxation, and so ourpseudoexpectation operator also contains
the moments of these functions as well.
11 Interestingly, this part of the argument does not require µ
to be o(d−1/3), and some analogous “non-concentration” property ofD
can be shown to hold for a hard to roundD for any µ = o(1).
However, we currently know how to take advantage of this propertyto
obtain a combining algorithm only in the case that µ � d−1/3.
12
-
2.3 Optimizing polynomials with nonnegative coefficients
We now consider the task of maximizing a polynomial with
nonnegative coefficients over the sphere, namelyproving Theorem
3.1. We consider the special case of Theorem 3.1 where the
polynomial is of degree 4. Thatis, we are given a parameter ε >
0 and an n2 × n2 nonnegative matrix M with spectral norm at most 1
andwant to find an ε additive approximation to the maximum of∑
i, j,k,l
Mi, j,k,lxix jxkxl , (2.8)
over all x ∈ Rn with ‖x‖ = 1, where in this section we let ‖x‖
be the standard (counting) Euclidean norm‖x‖ =
√∑i x2i .
One can get some intuition for this problem by considering the
case where M is 0/1 valued and x is0/k−1/2 valued for some k. In
this case one can think of M as a 4-uniform hypergraph on n
vertices and x as asubset S ⊆ [n] that maximizes the number of
edges inside S divided by |S |2, and so this problem is related
tosome type of a densest subgraph problem on a hypergraph.12
Let’s assume that we are given a distribution X over unit
vectors that achieve some value ν in (2.8).This is a non convex
problem, and so generally the average of these vectors would not be
a good solution.
However, it turns out that the vector x∗ defined such that x∗i
=√x∼X x2i can sometimes be a good solution
for this problem. Specifically, we will show that if it fails to
give a solution of value at least ν − ε, then wecan find a new
distribution X′ obtained by reweighing elements X that is in some
sense “simpler” than X.More precisely, we will define some
nonnegative potential function Ψ such that Ψ(X) 6 log n for all X
andΨ(X′) 6 Ψ(X) −Ω(ε2) under the above conditions. This will show
that we will need to use this reweighingstep at most
logarithmically many times.
Indeed, suppose that ∑i, j,k,l
Mi, j,k,lx∗i x∗j x∗k x∗l = (x
∗⊗2)T Mx∗⊗2 6 ν − ε . (2.9)
We claim that in contrastyT My > ν , (2.10)
where y is the n2-dimensional vector defined by yi, j =√x∼X x2i
x
2j . Indeed, (2.10) follows from the
non-negativity of M and the Cauchy–Schwarz inequality since
ν =∑i, j,k,l
Mi, j,k,l x∈X
xix jxkxl 6∑i, j,k,l
Mi, j,k,l√
x∼Xx2i x
2j
√
x∼Xx2k x
2l = y
T My
Note that since X is a distribution over unit vectors, both x∗
and y are unit vectors, and hence (2.9) and(2.10) together with the
fact that M has bounded spectral norm imply that
ε 6 yT My − (x∗⊗2)T Mx∗⊗2 = (y − x∗⊗2)T M(y + x∗⊗2) 6 ‖y − x∗⊗2‖
· ‖y + x∗⊗2‖ 6 2‖y − x∗⊗2‖ . (2.11)
However, it turns out that ‖y − x∗⊗2‖ equals√
2 times the Hellinger distance of the two distributionsD,D∗ over
[n] × [n] defined as follows: [D = (i, j)] = x2i x2j while [D∗ =
(i, j)] = ( x2i )( x2j) (seeSection 3). At this point we can use
standard information theoretic inequalities to derive from (2.11)
that
12 The condition of maximizing |E(S )|/|S |2 is related to the
log density condition used by [BCC+10] in their work on the
densestsubgraph problem, since, assuming that the set [n] of all
vertices is not the best solution, the set S satisfies that log|S |
|E(S )| > logn |E|.However, we do not know how to use their
algorithm to solve this problem. Beyond the fact that we consider
the hypergraph setting,their algorithm manages to find a set of
nontrivial density under the assumption that there is a “log dense”
subset, but it is notguaranteed to find the “log dense” subset
itself.
13
-
there is Ω(ε2) mutual information between the two parts of D.
Another way to say this is that the entropy ofthe second part of D
drops on average by Ω(ε2) if we condition on the value of the first
part. To say the samething mathematically, if we define D(X) to be
the distribution (x∼X x21, . . . ,x∼X x2n) over [n] and D(X|i) tobe
the distribution 1
x∼X x2i(x∼X x2i x
21, . . . ,x∼X x
2i x
2n) then
i∼D(x)
H(X|i) 6 H(X) −Ω(ε2) .
But one can verify that D(X|i) = D(Xi) where Xi is the
distribution over x’s such that [Xi = x] = x2i [X =x]/X x2i , which
means that if we define Ψ(X) = H(D(X)) then we get that
i∼D(x)
Ψ(Xi) 6 Ψ(X) −Ω(ε2)
and hence Ψ is exactly the potential function we were looking
for.To summarize our combining algorithm will do the following for
t = O(log n/ε2) steps: given the first
moments of the distribution X, define the vector x∗ as above and
test if it yields an objective value of atleast ν − ε. Otherwise,
pick i with probability x∼X x2i and move to the distribution Xi.
Note that given dlevel moments for X, we can compute the d − 1
level moments of Xi, and hence the whole algorithm can becarried
out with only access to level O(log n/ε2) moments of X. We then see
that the only properties of themoments used in this proof are
linearity, the fact that
∑x2i can always be replaced with 1 in any expression,
and the Cauchy–Schwarz inequality used for obtaining (2.10). It
turns out that all these properties holdeven if we are not given
access to the moments of a true distribution X but are only given
access to a leveld pseudoexpectation operator ̃ for d equalling
some constant times log n/ε2. Such pseudoexpectationsoperators can
be optimized over in d levels of the SOS hierarchy, and hence this
combining algorithm is infact a rounding algorithm.
3 Approximation for nonnegative tensor maximization
In this section we prove Theorem 1.2, giving an approximation
algorithm for the maximum over the sphere ofa polynomial with
nonnegative coefficients. We will work in the space Rn endowed with
the counting measurefor norms and inner products. We will define
the spectral norm of a degree-2t homogeneous polynomialM in x =
x(x1, . . . , xn), denoted by ‖M‖spectral, to be the minimum of the
spectral norm of Q taken over allquadratic forms Q over (n)⊗t such
that Q(x⊗t) = M(x) for every x. Note that we can compute the
spectralnorm of a homogeneous polynomial in polynomial time using
semidefinite programming. Thus we canrestate our main theorem of
this section as:
Theorem 3.1 (Theorem 1.2, restated). Let M be a degree-2t
homogeneous polynomial in x = (x1, . . . , xn)with nonnegative
coefficients. Then, there is an algorithm, based on O(t3 log n/ε2)
levels of the SOS hierarchy,that finds a unit vector x∗ ∈ n such
that
M(x∗) > maxx∈n,‖x‖=1
M(x) − ε‖M‖spectral .
To prove Theorem 3.1 we first come up with a combining
algorithm, namely an algorithm that takes (themoment matrix of) a
distribution X over unit vectors x ∈ Rn such that M(x) > ν and
find a unit vector x∗ suchthat M(x∗) > ν − ε. We then show that
the algorithm will succeed even if X is merely a level O(t log
n/ε2)pseudo distribution; that is, the moment matrix is a
pseudoexpectation operator. The combining algorithm isvery
simple:
Combining algorithm for polynomials with nonnegative
coefficients:
14
-
Input: distribution X over unit x ∈ n such that M(x) =
ν.Operation: Do the following for t2 log n/ε2 steps:
Direct rounding: For i ∈ [n], let x∗i =√x∼X x2i . If M(x
∗) > ν − 4ε then output x∗ and quit.
Conditioning: Try to find i1, . . . , it−1 ∈ [n] such that the
distribution Xi1,...,it−1 satisfies Ψ(Xi1,...,it−1) 6 Ψ(X) −ε2/t2,
and set X = Xi1,...,it−1 , where:
– Xi1,...,it−1 is defined by letting [Xi1,...,it−1 = x] be
proportional to [X = x] ·∏t−1
j=1 x2i j
for everyx ∈ n.
– Ψ(X) is defined to be H(A(X)) where H(·) is the Shannon
entropy function and A(X) is thedistribution over [n] obtained by
letting [A(X) = i] = x∼X x2i for every i ∈ [n].
Clearly Ψ(X) is always in [0, log n], and hence if we can show
that we always succeed in at least oneof the steps, then eventually
the algorithm will output a good x∗. We now show that if the direct
roundingstep fails, then the conditioning step must succeed. We do
the proof under the assumption that X is an actualdistribution.
Almost of all of this analysis holds verbatim when X is a
pseudodistribution of level at least2t2 log n/ε2, and we note the
one step where the extension requires using a nontrivial (though
easy to prove)property of pseudoexpectations, namely that they
satisfy the Cauchy–Schwarz inequality.
Some information theory facts. We recall some standard relations
between various entropy and distancemeasures. Let X and Y be two
jointly distributed random variables. We denote the joint
distribution of X and Yby {XY}, and their marginal distributions by
{X} and {Y}. We let {X}{Y} denote the product of the
distributions{X} and {Y} (corresponding to sampling X and Y
independently from their marginal distribution). Recall thatthe
Shannon entropy of X, denoted by H(X), is defined to be
∑x∈Support(X) [X = x] log(−[X = x]). The
mutual information of X and Y is defined as I(X,Y) def= H(X) −
H(X | Y), where H(X | Y) is conditionalentropy of X with respect to
Y , defined as y∼{Y} H(X | Y = y). The Hellinger distance between
twodistributions p and q is defined by dH(p, q)
def=
(1 −∑i √piqi)1/2. (In particular, dH(p, q) equals 1/√2 times
the Euclidean distance of the unit vectors√
p and√
q.) The following inequality (whose proof followsby combining
standard relations between the Hellinger distance, Kullback–Leibler
divergence, and mutualinformation) would be useful for us
Lemma 3.2. For any two jointly-distributed random variables X
and Y,
2dH({XY}, {X}{Y}
)26 I(X,Y)
3.1 Direct Rounding
Given X, we define the following correlated random variables A1,
. . . , At over [n]: the probability that(A1, . . . , At) = (i1, .
. . , it) is equal to x∼X x2i1 · · · x
2it. Note that for every i, the random variable Ai is
distributed
according to A(X). (Note that even if X is only a
pseudodistribution, A1, . . . , At are actual random variables.)The
following lemma gives a sufficient condition for our direct
rounding step to succeed:
Lemma 3.3. Let M,X be as above. If dH({A1 · · · At}, {A1} · · ·
{At}) 6 ε, then the unit vector x∗ withx∗i = (x∼X x
2i )
1/2 satisfies M(x∗) > ν − 4ε‖M‖spectral. Moreover, this holds
even if X is a level ` > 2tpseudodistribution.
15
-
Proof. Let Q be a quadratic form with Q(x⊗t) = M(x). Let y ∈
(n)⊗t be the vector yi1···it =(̃x∼cX x2i1 · · · x
2it)1/2. Then,
̃M(x∗) = 〈M̂, ̃ x∗⊗2t〉 6 〈M̂, y ⊗ y〉 = Q(y) (3.1)
Here, the vector M̂ ∈ (n)⊗2t contains the coefficients of M. In
particular, M̂ > 0 entry-wise. The inequalityin (3.1) uses
Cauchy–Schwarz; namely that ̃ xαxβ 6 (̃(xα)2 · ̃(xβ)2)1/2 = yαyβ.
The final equality in (3.1)uses that y is symmetric.
Next, we bound the difference between Q(y) and M(x∗)
Q(y) − M(x∗) = Q(y) − Q(x∗⊗t) = 〈y + x∗⊗t,Q(y − x∗⊗t)〉 6 ‖Q‖ ·
‖y + x∗⊗t‖ · ‖y − x∗⊗t‖ . (3.2)
(Here, 〈·,Q ·〉 denotes the symmetric bilinear form corresponding
to Q.)Since both x∗⊗t and y are unit vectors, ‖y + x∗⊗t‖ 6 2. By
construction, the vector y corre-
sponds to the distribution {A1 · · · At} and x∗⊗t corresponds to
the distribution {A1} · · · {At}. In particular,dH({A1 · · · At},
{A1} · · · {At}) = 1√2‖y − x
∗⊗t‖. Together with the bounds (3.1) and (3.2),
M(x∗) > ̃M(x) − 4‖Q‖ · dH({A1 · · · At}, {A1} · · · {At}) .
�
To verify this carries over when X is a pseudodistribution, we
just need to use the fact that Cauchy–Schwarz holds for
pseudoexpectations (Lemma A.2).
3.2 Making Progress
The following lemma shows that if the sufficient condition above
is violated, then on expectation we canalways make progress.
(Because A1, . . . , At are actual random variables, it
automatically holds regardless ofwhether X is an actual
distribution or a pseudodistribution.)
Lemma 3.4. If dH({A1 · · · At}, {A1} · · · {At}) > ε, then
H(At | A1 · · · At−1) 6 H(A) − 2ε2/t2
Proof. The bound follows by combining a hybrid argument with
Lemma 3.2.Let A′1, . . . , A
′t be independent copies of A1, . . . , At so that
{A1 · · · At · · · A′1 · · · A′t} = {A1 · · · At}{A1} · · · {At}
.
We consider the sequence of distributions D0, . . . ,Dt with
Di = {A1 · Ai · · · A′i+1 · · · A′t} .
By assumption, dH(D0,Dt) > ε. Therefore, there exists an
index i such that dH(Di−1,Di) > ε/t. LetX = A1 · · · Ai−1 and Y
= AiA′i+1 · · · A′t . Then, Di = {XY} and Di−1 = {X}{Y}. By Lemma
3.2,
H(Y) − H(Y | X) = I(X,Y) > 2dH({XY}, {X}{Y})2 > 2ε2/t2
.
Since A′i+1, . . . , A′t are independent of A1, . . . , Ai,
H(Y) − H(Y | X) = H(Ai) − H(Ai | A1 · · · Ai−1) .
By symmetry and the monotonicity of entropy under conditioning,
we conclude
H(At | A1 · · · At−1) 6 H(A) − 2ε2/t2 . �
16
-
Lemma 3.4 implies that if our direct rounding fails then the
expectation of H(A1) conditioned onA2, . . . , At is at most H(A) −
2ε2/t2, but in particular this means there exist i1, . . . , it−1
so that H(At|A1 =i1, . . . , At−1 = it−1) 6 H(A) − 2ε2/t2. The
probability of i under this distribution At|A1 = i1, . . . , At−1 =
it−1 isproportional to x∼X x2i ·
∏t−1j=1 x
2i j
, which means that it exactly equals the distribution
A(Xi1,...,it−1). Thus wesee that Ψ(Xi1,...,it−1) 6 Ψ(X) − 2ε2/t2.
This concludes the proof of Theorem 3.1. �
Remark 3.5 (Handling odd degrees and non homogenous
polynomials). If the polynomial P is not homoge-nous but only has
monomials of even degree, we can homogenize it by multiplying every
monomial with anappropriate power of (
∑x2i ) which is identically equal to 1 on the sphere. To handle
odd degree monomials
we can introduce a new variable x0 and set a constraint that it
must be identically equal to 1/√
t. (Note thatif the pseudoexpectation operator is consistent
with this constraint then our rounding algorithm will in factoutput
a vector that satisfies it.) This way we can represent every odd
degree monomial αS
∏i∈S xi with the
even degree monomial tαS x0∏
i∈S xi. The maximum of P on the unit sphere is equal to the
maximum of thenew polynomial P′ on the sphere of radius
√1 + 1/t, which, because P′ is homogenous, equals (1 +
1/t)t/2
times the maximum of P′ of the sphere. We simple define the
spectral norm of P as the spectral norm of P′.
4 Finding an “analytically sparse” vector in a subspace
In this section we prove Theorem 1.5. We letU be a universe of
size n and L2(U) be the vector space ofreal-valued functions f : U
→ . The measure on the setU is the uniform probability distribution
and hencewe will use the inner product 〈 f , g〉 = ω f (ω)g(ω) and
norm ‖ f ‖p = (ω f (ω)p)1/p for f , g : U → andp > 1.
Theorem 4.1 (Theorem 1.5, restated). There is a constant ε >
0 and a polynomial-time algorithm A, basedon O(1) levels of the SOS
hierarchy, that on input a projector operator Π such that there
exists a µ-sparseBoolean function f satisfying ‖Π f ‖22 > (1 −
ε)‖ f ‖22, outputs a function g ∈ Image(Π) such that
‖g‖44 > Ω(
‖g‖42µ(rank Π)1/3
).
We will prove Theorem 4.1 by first showing a combining algorithm
and then transforming it into arounding algorithm. Note that the
description of the combining algorithm is independent of the
actualrelaxation used, since it assumes a true distribution on the
solutions, and so we first describe the algorithmbefore specifying
the relaxation. In our actual relaxation we will use some auxiliary
variables that will makethe analysis of the algorithm simpler.
Combining algorithm for finding an analytically sparse
vector:
Input: DistributionD over Boolean (i.e., 0/1 valued) functions f
∈ L2(U) that satisfy:
– µ( f ) = [ f (ω) = 1] = 1/λ.
– ‖Π f ‖22 > (1 − ε)‖ f ‖22.
Goal: Output g such that‖g‖44 > γ‖g‖42 where γ = Ω(1/µ(rank
Π)1/3) (4.1)
Operation: Do the following:
Coordinate projection rounding: For ω ∈ U, let δω : U → be the
function that satisfies 〈 f , δω〉 = f (ω)for all f ∈ L2(U). Go over
all vectors of the form gω = Πδω for ω ∈ U and if there is one that
satisfies(4.1) then output it. Note that the output of this
procedure is independent of the distributionD.
17
-
Random function rounding: Choose a random gaussian vector t ∈
L2(U) and output g = Πt if it satisfies(4.1). (Note that this is
also independent of the distributionD.)
Conditioning: Go over all choices for ω1, . . . , ω4 ∈ U and
modify the distribution D to the distributionDω1,...,ω4 defined
such that Dω1 ,...,ω4 [ f ] is proportional to D[ f ]
∏4j=1 f (ω j)
2 for every f .
Gaussian rounding: For every one of these choices, let t be a
random Gaussian that matches the first twomoments of the
distributionD, and output g = Πt if it satisfies (4.1).
Because we will make use of this fact later, we will note when
certain properties hold not just forexpectations of actual
probability distributions but for pseudoexpectations as well. The
extension to pseudo-expectations is typically not deep, but can be
cumbersome, and so the reader might want to initially
restrictattention to the case of a combining algorithm, where we
only deal with actual expectations. We show theconsequences for
each of the steps failing, and then combine them together to get a
contradiction.
4.1 Random function rounding
We start by analyzing the random function rounding step. Let e1,
. . . , en be an orthonormal basis for the spaceof functions L2(U).
Let t be a standard Gaussian function in L2(U), i.e., t = ξ1e1 + .
. .+ ξnen for independentstandard normal variable ξ1, . . . , ξn
(each with mean 0 and variance 1). The following lemmas
combinedshow what are the consequences if ‖Πt‖4 is not much bigger
than ‖Πt‖2.
Lemma 4.2. For any f , g : U → ,t〈 f , t〉〈g, t〉 = 〈 f , g〉 .
Proof. Write f and g in the basis {e1, . . . , en}: i.e., f
=∑
i aiei and g =∑
j b je j. Then, because this is anorthonormal basis, 〈 f , g〉 is
equal to ∑i aibi and 〈 f , t〉〈g, t〉 = ∑i, j aib jξiξ j, which has
expectation ∑i aibi.Hence, the left-hand side is the same as the
right-hand side. �
Lemma 4.3. The 4th moment of ‖Πt‖4 satisfies
t‖Πt‖44 > ω‖Πδω‖
42 .
Proof. By the previous lemma, the Gaussian variable Πt(ω) =
〈Πδω, t〉 has variance ‖Πδω‖22. Therefore,
t‖Πt‖44 = t ω Πt(ω)
44 = ω
t〈δω,Πt〉4
= 3ω
(t〈Πδω, t〉2
)4= 3
ω‖Πδω‖42 ,
since 3 = X∼N(0,1) X4. �
Lemma 4.4. The 4th moment of ‖Πt‖2 satisfies
t‖Πt‖42 6 10 · (rank Π)2 .
Proof. The random variable ‖Πt‖22 has a χ2-distribution with k =
rank Π degrees of freedom. The mean ofthis distribution is k and
the variance is 2k. It follows that t‖Πt‖42 6 10(rank Π)2. �
18
-
4.2 Coordinate projection rounding
We now turn to showing the implications of the failure of
projection rounding. We start by noting thefollowing technical
lemma, that holds for both the expectation and counting inner
products:
Lemma 4.5. Let x and y be two independent, vector-valued random
variables. Then,
〈x, y〉4 6(〈x, x′〉4
)1/2 · (〈y, y′〉4)1/2 .Moreover, this holds even if x, y come
from a level ` > 8 pseudodistribution.
Proof. By Cauchy–Schwarz,
̃x,y〈x, y〉4 = 〈̃x x⊗4, ̃y y⊗4〉 6 ‖̃x x⊗4‖2 · ‖̃y y⊗4‖2 =
(̃x,x′〈x, x′〉4
)1/2 · (̃y,y′〈y, y′〉4)1/2 .We now consider the case of
pseudodistributions. In this case the pseudoexpectation over two
independent
x and x′ is obtained using Lemma A.5. Let X and Y be the
n4-dimensional vectors ̃ x⊗4 and ̃ y⊗4
respectively.We can use the standard Cauchy–Schwarz to argue
that X · Y 6 ‖X‖2 · ‖Y‖2, and so what is left is to argue
that ‖X‖22 = ̃x,x′〈x, x′〉4, and similarly for Y . This holds by
linearity for the same reason this is true for actualexpectations,
but for the sake of completeness, we do this calculation. We use
the counting inner product forconvenience. Because the lemma’s
statement is scale free, this will imply it also for the
expectation norm.
̃x,x′〈x, x′〉4 = ̃
x,x′
∑i, j,k,l
xix jxkxlx′i x′jx′kx′l =
∑i, j,k,l
(̃x
xix jxkxl)(̃x′
x′i x′jx′kx′l) ,
where the last equality holds by independence. But this is
simply equal to∑i, j,k,l
(̃x
xix jxkxl)2 = ‖X‖22
�
The following lemma shows a nontrivial consequence for ‖Πδω‖44
being small:
Lemma 4.6 (Coordinate projection rounding). For any
distributionD over L2(U),
f∼D‖Π f ‖44 6
( f , f ′∼D〈 f ,Π f ′〉4
)1/2 · (ω‖Πδω‖44)1/2 .Moreover, this holds even if D is a level
` > 8 pseudodistribution. (Note that ω is simply the
uniformdistribution overU, and hence the last term of the right
hand side always denotes an actual expectation.)
Proof. By the previous lemma,
̃f∼D‖Π f ‖44 = ̃f∼Dω〈δω,Π f 〉
4 6(̃ f , f ′∼D〈 f ,Π f ′〉4
)1/2 · (ω,ω′〈δω,Πδω′〉4)1/2=
(̃ f , f ′∼D〈 f ,Π f ′〉4
)1/2 · (ω‖Πδω‖44)1/2 . �
19
-
4.3 Gaussian Rounding
In this subsection we analyze the gaussian rounding step. Let t
be a random function with the Gaussiandistribution that matches the
first two moments of a distributionD over L2(U).
Lemma 4.7. The 4th moment of ‖Πt‖4 satisfies
t‖Πt‖44 = 3 f , f ′∼D
〈(Π f )2, (Π f ′)2
〉.
Moreover, this holds even ifD is a level ` > 100
pseudodistribution. (Note that even in this case t is still
anactual distribution.)
Proof.
t‖Πt‖44 = t ω Πt(ω)
4 = 3ω
(t Πt(ω)2
)2= 3 ̃
ω
( f∼DΠ f (ω)2
)2= 3 ̃
f , f ′∼D
〈(Π f )2, (Π f ′)2
〉. �
Fact 4.8. If {A, B,C,D} have Gaussian distribution, then
ABCD = AB · CD + AC · BD + BC · AD .
Lemma 4.9. The fourth moment of ‖Πt‖2 satisfies
t‖Πt‖42 6 3
(
f∼D‖Π f ‖22
)2.
Moreover, this holds even ifD is a level ` > 4
pseudodistribution.
Proof. By the previous fact,
t‖Πt‖42 = ω,ω′t Πt(ω)
2 · Πt(ω′)2
= ω,ω′̃f
Π f (ω)2 · ̃f
Π f (ω′)2 + 2(̃f
Π f (ω)Π f (ω′))26 3
(̃f‖Π f ‖22
)2. �
4.4 Conditioning
We now show the sense in which conditioning can make progress.
LetD be a distribution over L2(U). Forω ∈ U, letDω be the
distributionD reweighed by f (ω)2 for f ∼ D. That is, Dω{ f } ∝ f
(ω)2 · D{ f }, or inother words, for every function P(·), f∼Dω P( f
) = ( f∼D f (ω)2P( f ))/( f∼D f (ω)2). Similarly, we
writeDω1,...,ωr for the distributionD reweighed by f (ω1)2 · · · f
(ωr)2.
Lemma 4.10 (Conditioning). For every even r ∈ , there are points
ω1, . . . , ωr ∈ U such that the reweigheddistributionD′ =
Dω1,...,ωr satisfies
f ,g∼D′
〈f 2, g2
〉>
( f ,g∼D
〈f 2, g2
〉r)1/rMoreover, this holds even ifD is a level ` > 10r
pseudodistribution.
20
-
Proof. We have that
maxω1,...,ωr
̃f ,g∼Dω1 ,...,ωr
〈f 2, g2
〉= max
ω1,...,ωr
̃ f (ω1)2 · · · f (ωr)2 · g(ω1)2 · · · g(ωr)2〈
f 2, g2〉(
̃ f f (ω1)2 · · · f (ωr)2) (̃g g(ω1)2 · · · g(ωr)2
) but using (X)/(Y) 6 max(X/Y) and ω1,...,ωr f (ω1)
2 · · · f (ωr)2g(ω1)2 · · · g(ωr)2 =〈g2, f 2
〉r, the RHS is
lower bounded by
ω1,...,ωr ̃ f (ω1)2 · · · f (ωr)2 · g(ω1)2 · · · g(ωr)2
〈f 2, g2
〉ω1,...,ωr
(̃ f f (ω1)2 · · · f (ωr)2
) (̃g g(ω1)2 · · · g(ωr)2
) = f ,g∼D 〈 f 2, g2〉r+1̃ f ,g∼D
〈f 2, g2
〉rNow, ifD was an actual expectation, then we could use
Hölder’s inequality to lower bound the numerator
of the RHS by( f ,g∼D
〈f 2, g2
〉r)(r+1)/rwhich would lower bound the RHS by
( f ,g∼D
〈f 2, g2
〉r)1/r. For
pseudoexpectations this follows by appealing to Lemma A.4. �
4.5 Truncating functions
The following observation would be useful for us for analyzing
the case that the distribution is over functionsthat are not
completely inside the subspace. Note that if the function f is
inside the subspace, we can justtake f = f in Lemma 4.11, and so
the reader may want to skip this section in a first reading and
just pretendthat f = f below.
Lemma 4.11. Let ε < 1/400, Π be a projector on U and suppose
that f : U → {0, 1} satisfies that[ f (ω) = 1] = µ and ‖Π f ‖22
> (1 − ε)µ. Then there exists a function f : U → such that:
1. ‖Π f ‖44 > Ω(µ).
2. For every ω ∈ U, Π f (ω)2 > Ω(| f (ω)|).
Proof. Fix τ > 0 to be some sufficiently small constant
(e.g., τ = 1/2 will do). Let f ′ = Π f . We definef = f ′ · 1| f ′
|>τ (i.e., f (ω) = f ′(ω) if | f ′(ω)| > τ and f (ω) = 0
otherwise) and define f = f ′ · 1| f ′ | τ| f (ω)| for every ω ∈
U.
Since f (x) , 0 if and only if f ′(x) ∈ (0, τ), clearly | f (x)|
6 | f (x) − f ′(x)| and hence ‖ f ‖22 6 εµ.Using f ′ = f + f , we
see that Π f = f + ( f ′ − f ) − f + (Π f − f ). Now since f ′ is
in the subspace,‖Π f − f ‖2 6 ‖ f ′ − f ‖2 = ‖ f ‖2 and hence for g
= ( f ′ − f ) − f + (Π f − f ), ‖g‖2 6 3
√εµ. Therefore
the probability that g(ω) > 10√ε is at most µ/2. This means
that with probability at least µ/2 it holds
that f (ω) = 1 and g(ω) 6 10√ε, in which case f (ω) > 1 −
10
√ε > 1/2. In particular, we get that
f (ω)4 > Ω(µ). �
Remark 4.12 (Non-Boolean functions). The proof of Lemma 4.11
establishes much more than its statement.In particular note that we
did not make use of the fact that f is nonnegative, and a function
f into {0,±1}with [ f (ω) , 0] = µ would work just the same. We
also did not need the nonzero values to have magnitudeexactly one,
since the proof would easily extend to the case where they are in
[1/c, c] for some constantc. One can also allow some nonzero values
of the function to be outside that range, as long as their
totalcontribution to the 2-norm squared is much smaller than µ.
21
-
4.6 Putting things together
We now show how the above analysis yields a combining algorithm,
and we then discuss the changes neededto extend this argument to
pseudodistributions, and hence obtain a rounding algorithm.
LetD be a distribution over Boolean functions f : U → {0, 1}
with ‖ f ‖22 = µ and ‖Π f ‖22 > 0.99‖ f ‖22. Thegoal is to
compute a function t : U → with ‖Πt‖4 � ‖t‖22, given the low-degree
moments ofD.
Suppose that random-function rounding and coordinate-projection
rounding fail to produce a functiont with ‖Πt‖44 > γ‖t‖42. Then,
ω‖Πδω‖42 6 O(γ) · (rank Π)2 (from failure of random-function
rounding andLemmas 4.3 and 4.4). By the failure of
coordinate-projection rounding (and using Lemma 4.6 applied to
thedistribution over f ) we get that(
f∼D‖Π f ‖44
)26 O(γ) ·
f , f ′∼D
〈f , f
′〉4 · ω‖Πδω‖42.
Combining the two bounds, we get
f , f ′∼D
〈f , f
′〉4> Ω(1/(γ rank Π)2)
(
f∼D‖Π f ‖44
)2Since (by Lemma 4.11), (Π f )(ω)2 > Ω(| f (ω)|) for every ω
∈ U and f in the support of D, we have〈(Π f )2, (Π f ′)2〉 > Ω(〈
f , f ′〉) for all f , f ′ in the support. Thus,
f , f ′∼D
〈(Π f )2, (Π f ′)2
〉4> Ω(1/(γ rank Π)2)
(
f∼D‖Π f ‖44
)2By the reweighing lemma, there exists ω1, . . . , ω4 ∈ U such
that the reweighted distributionD′ = Dω1,...,ω4satisfies
f , f ′∼D′
〈(Π f )2, (Π f ′)2
〉>
(
f , f ′∼D
〈(Π f )2, (Π f ′)2
〉4)1/4> Ω(1/(γ rank Π))1/2
(
f∼D‖Π f ]‖44
)1/2The failure of Gaussian rounding (applied toD′) implies
f , f ′∼D′
〈(Π f )2, (Π f ′)2
〉6 O(γ)
(
f∼D′‖Π f ‖22
)2.
Combining these two bounds, we get
f∼D‖Π f ‖44 6 O(γ3 rank Π) ·
(
f∼D′‖Π f ‖22
)4By the properties ofD and Lemma 4.11, the left-hand side is
Ω(µ) and the right-hand side is O(γ3 rank Πµ4).Therefore, we
get
γ > Ω(
1(rank Π)1/3µ
)Extending to pseudodistributions. We now consider the case
thatD is a pseudodistribution of some largeconstant level `. (We
have not tried to optimize it at all, though ` = 100 should follow
easily from the proofsabove.) Most of the statements above just go
through as is, given that the analysis of all individual steps
doesextend (as noted) for pseudoexpectations. One issue is that the
truncation operation used to obtain f is nota low degree
polynomial. While it may be possible to approximate it with such a
polynomial, we sidestep
22
-
the issue by simply adding f as additional auxiliary variables
to our program, and enforcing the conclusionsof Lemma 4.11 as
constraints that the pseudoexpectation operator must be consistent
with. This is anotherexample of how we design our relaxation to fit
the rounding/combining algorithm, rather than the other wayaround.
With this step, we can replace statements such as “(*) holds for
all functions in the support ofD”(where (*) is some equality or
inequality constraint in the variables f , f ) with the statement
“D is consistentwith (*)” and thus complete the proof. �
5 Finding planted sparse vectors
As an application of our work, we show how we can find sparse
(or analytically sparse) vectors in-side a sufficiently generic
subspace. In particular, this improves upon a recent result of
Demanet andHand [DH13] who used the L∞/L1 optimization procedure of
Spielman et al. [SWW12] to show one canrecover a µ-sparse vector
planted in a random d-dimension subspace V ′ ⊆ n when µ � 1/
√d. Our
result, combined with the bound on the SDP value of the 2 → 4
norm of a random subspace from[BBH+12], implies that if d = O(
√n) then we can in fact recover such a vector as long as µ �
1.
Problem: PlantedRecovery(µ, d, |U|, ε)
Input: An arbitrary basis for a linear subspace V = span (V ′ ∪
{ f0}), where:
– V ′ ⊆ U is a random d-dimensional subspace, chosen as the span
of d vectors drawn indepen-dently from the standard Gaussian
distribution on U , and
– f0 is an arbitrary µ-sparse vector, i.e., S = supp( f0) has |S
| 6 µ|U|.
Goal: Find a vector f ∈ V with 〈 f , f0〉2 > (1 − ε) ‖ f ‖2 ‖
f0‖2.
The goal here should be thought of as recovering f0 to
arbitrarily high precision (“exactly”), and thus therunning time of
an algorithm should be logarithmic in 1/ε. We note that f0 is not
required to be random, andit may be chosen adversarially based on
the choice of V ′. We will prove the following theorem, which is
thissection’s main result:
Theorem 5.1. (Theorem 1.4, restated) For some absolute constant
K > 0, there is an algorithm that solvesPlantedRecovery(µ, d,
|U|, ε) with high probability in time poly(|U|, log(1/ε)) for any µ
< Kµ0(d), where
µ0(