-
Set Cover in Sub-linear Time∗
Piotr Indyk † Sepideh Mahabadi † Ronitt Rubinfeld ‡ Ali Vakilian
†
Anak Yodpinyanee †
Abstract
We study the classic set cover problem from the perspec-tive of
sub-linear algorithms. Given access to a collec-tion of m sets over
n elements in the query model, weshow that sub-linear algorithms
derived from existingtechniques have almost tight query
complexities.
On one hand, first we show an adaptation of thestreaming
algorithm presented in [17] to the sub-linearquery model, that
returns an α-approximate cover usingÕ(m(n/k)1/(α−1) + nk) queries
to the input, wherek denotes the value of a minimum set cover.
Wethen complement this upper bound by proving that forlower values
of k, the required number of queries isΩ̃(m(n/k)1/(2α)), even for
estimating the optimal coversize. Moreover, we prove that even
checking whethera given collection of sets covers all the elements
wouldrequire Ω(nk) queries. These two lower bounds providestrong
evidence that the upper bound is almost tightfor certain values of
the parameter k.
On the other hand, we show that this bound is notoptimal for
larger values of the parameter k, as there ex-ists a (1 +
ε)-approximation algorithm with Õ(mn/kε2)queries. We show that
this bound is essentially tightfor sufficiently small constant ε,
by establishing a lowerbound of Ω̃(mn/k) query complexity.
Our lower-bound results follow by carefully design-ing two
distributions of instances that are hard to dis-tinguish. In
particular, our first lower bound involves aprobabilistic
construction of a certain set system witha minimum set cover of
size αk, with the key propertythat a small number of “almost
uniformly distributed”modifications can reduce the minimum set
cover sizedown to k. Thus, these modifications are not
detectableunless a large number of queries are asked. We
believethat our probabilistic construction technique might
findapplications to lower bounds for other combinatorial
op-timization problems.
∗This work was supported by the NSF grants, including
No.CCF-1650733, CCF-1733808, CCF-1420692, IIS-1741137, and the
Simons Investigator award.†CSAIL, MIT, {indyk, mahabadi,
vakilian, anak}@mit.edu‡CSAIL, MIT and TAU,
[email protected]
1 Introduction
Set Cover is a classic combinatorial optimization prob-lem, in
which we are given a set (universe) of n el-ements U = {e1, · · · ,
en} and a collection of m setsF = {S1, · · · , Sm}. The goal is to
find a set cover of U ,i.e., a collection of sets in F whose union
is U , of min-imum size. Set Cover is a well-studied problem
withapplications in operations research [16], information
re-trieval and data mining [32], learning theory [19], webhost
analysis [9], and many others. Recently, this prob-lem and other
related coverage problems have gained alot of attention in the
context of massive data sets, e.g.,streaming model [32, 12, 10, 17,
7, 3, 24, 2, 5, 18] ormap reduce model [22, 25, 4].
Although the problem of finding an optimal solutionis
NP-complete, a natural greedy algorithm whichiteratively picks the
“best” remaining set (the set thatcovers the most number of
uncovered elements) is widelyused. The algorithm finds a solution
of size at mostk lnn where k is the optimum cover size, and can
beimplemented to run in time linear in the input size.However, the
input size itself could be as large asΘ(mn), so for large data sets
even reading the inputmight be infeasible.
This raises a natural question: is it possible to solveminimum
set cover in sub-linear time? This questionwas previously addressed
in [28, 33], who showed thatone can design constant running-time
algorithms bysimulating the greedy algorithm, under the
assumptionthat the sets are of constant size and each elementoccurs
in a constant number of sets. However, thoseconstant-time
algorithms have a few drawbacks: theyonly provide a mixed
multiplicative/additive guarantee(the output cover size is
guaranteed to be at most k ·lnn+ �n), the dependence of their
running times on themaximum set size is exponential, and they only
outputthe (approximate) minimum set cover size, not thecover
itself. From a different perspective, [20] (buildingon [15]) showed
that an O(1)-approximate solution tothe fractional version of the
problem can be found inÕ(mk2+nk2) time1. Combining this algorithm
with the
1The method can be further improved to Õ(m+nk) (N. Young,
Copyright c© 2018 by SIAMUnauthorized reproduction of this
article is prohibited
-
randomized rounding yields an O(log n)-approximatesolution to
Set Cover with the same complexity.
In this paper we initiate a systematic study of thecomplexity of
sub-linear time algorithms for set coverwith multiplicative
approximation guarantees. Ourupper bounds complement the
aforementioned resultof [20] by presenting algorithms which are
fast when k islarge, as well as algorithms that provide more
accuratesolutions (even with a constant-factor
approximationguarantee) that use a sub-linear number of
queries2.Equally importantly, we establish nearly matching
lowerbounds, some of which even hold for estimating theoptimal
cover size. Our algorithmic results and lowerbounds are presented
in Table 1.
Data access model. As in the prior work [28, 33]on Set Cover,
our algorithms and lower bounds assumethat the input can be
accessed via the adjacency-listoracle.3 More precisely, the
algorithm has access to thefollowing two oracles:
1. EltOf: Given a set Si and an index j, the oraclereturns the
jth element of Si. If j > |Si|, ⊥ isreturned.
2. SetOf: Given an element ei and an index j, theoracle returns
the jth set containing ei. If eiappears in less than j sets, ⊥ is
returned.This is a natural model, providing a “two-way” con-
nection between the sets and the elements. Further-more, for
some graph problems modeled by Set Cover(such as Dominating Set or
Vertex Cover), such or-acles are essentially equivalent to the
aforementionedincident-list model studied in sub-linear graph
algo-rithms. We also note that the other popular accessmodel
employing the membership oracle, where we canquery whether an
element e is contained in a set S, isnot suitable for Set Cover, as
it can be easily seen thateven checking whether a feasible cover
exists requiresΩ(mn) time.
1.1 Overview of our results. In this paper wepresent algorithms
and lower bounds for the Set Coverproblem. The results are
summarized in Table 1. TheNP-hardness of this problem (or even its
o(log n)-approximate version [13, 31, 1, 26, 11]) precludes
theexistence of highly accurate algorithms with fast run-ning
times, while (as we show) it is still possible to de-sign
algorithms with sub-linear query complexities andlow approximation
factors. The lower bound proofs hold
personal communication).2Note that polynomial time algorithm
with sub-logarithmic
approximation algorithms are unlikely to exist.3In the context
of graph problems, this model is also known
as the incidence-list model, and has been studied extensively,
seee.g., [8, 14, 6].
for the running time of any algorithm approximation setcover
assuming the defined data access model.
We present two algorithms with sub-linear numberof queries.
First, we show that the streaming algorithmpresented in [17] can be
adapted so that it returns
an O(α)-approximate cover using Õ(m(n/k)1/(α−1) +nk) queries,
which could be quadratically smaller thanmn. Second, we present a
simple algorithm which istailored to the case when the value of k
is large. Thisalgorithm computes an O(log n)-approximate cover
in
Õ(mn/k) time (not just query complexity). Hence,by combining it
with the algorithm of [20], we getan O(log n)-approximation
algorithm that runs in time
Õ(m+ n√m).
We complement the first result by proving thatfor low values of
k, the required number of queriesis Ω̃(m(n/k)1/(2α)) even for
estimating the size of theoptimal cover. This shows that the first
algorithm isessentially optimal for the values of k where the
firstterm in the runtime bound dominates. Moreover, weprove that
even the Cover Verification problem, whichis checking whether a
given collection of k sets coversall the elements, would require
Ω(nk) queries. Thisprovides strong evidence that the term nk in the
firstalgorithm is unavoidable. Lastly, we complement thesecond
algorithm, by showing a lower bound of Ω̃(mn/k)if the approximation
ratio is a small constant.
1.2 Related work. Sublinear algorithms for SetCover under the
oracle model have been previouslystudied as an estimation problem;
the goal is only toapproximate the size of the minimum set cover
ratherthan constructing one. Nguyen and Onak [28] considerSet Cover
under the oracle model we employ in this pa-per, in a specific
setting where both the maximum cardi-nality of sets in F , and the
maximum number of occur-rences of an element over all sets, are
bounded by someconstants s and t; this allows algorithms whose time
andquery complexities are constant, (2(st)
4
/ε)O(2s), con-
taining no dependency on n or m. They provide an al-gorithm for
estimating the size of the minimum set coverwhen, unlike our work,
allowing both ln s multiplicativeand εn additive errors. Their
result has been subse-quently improved to (st)O(s)/ε2 by Yoshida et
al. [33].Additionally, the results of Kuhn et al. [21] on gen-eral
packing/covering LPs in the distributed LOCALmodel, together with
the reduction method of Parnasand Ron [30], implies estimating set
cover size to withina O(ln s)-multiplicative factor (with εn
additive error),can be performed in (st)O(log s log t)/ε4
time/query com-plexities.
Set Cover can also be considered as a generalizationof the
Vertex Cover problem. The estimation variant
Copyright c© 2018 by SIAMUnauthorized reproduction of this
article is prohibited
-
Problem Approximation Constraints Query Complexity Section
Set Cover
αρ+ ε α ≥ 2 Õ( 1ε (m(nk )
1α−1 + nk)) 4.2
ρ+ ε - Õ(mnkε2 ) 4.3
α k < ( nlogm )1
4α+1 Ω̃(m(nk )1/(2α)) A
αα ≤ 1.01
k = O( nlogm )Ω̃(mnk ) 3.2
CoverVerification
- k ≤ n/2 Ω(nk) 5
Table 1: A summary of our algorithms and lower bounds. We use
the following notation: k ≥ 1 denotes thesize of the optimum cover;
α ≥ 1 denotes a parameter that determines the trade-off between the
approximationquality and query/time complexities; ρ ≥ 1 denotes the
approximation factor of a “black box” algorithm for setcover used
as a subroutine; We assume that α ≤ log n and m ≥ n.
of Vertex Cover under the adjacency-list oracle modelhas been
studied in [30, 23, 29, 33]. Set Cover hasbeen also studied in the
sublinear space context, mostnotably for the streaming model of
computation [32,12, 7, 3, 2, 5, 18, 10, 17]. In this model, there
arealgorithms that compute approximate set covers withonly
multiplicative errors. Our algorithms use some ofthe ideas
introduced in the last two papers [10, 17].
1.3 Overview of the Algorithms. The algorithmicresults presented
in Section 4, use the techniques intro-duced for the streaming Set
Cover problem by [10, 17]to get new results in the context of
sub-linear time al-gorithms for this problem. Two components
previouslyused for the set cover problem in the context of
stream-ing are Set Sampling and Element Sampling. Assum-ing the
size of the minimum set cover is k, Set Samplingrandomly samples
Õ(k) sets and adds them to the main-tained solution. This ensures
that all the elements thatare well represented in the input (i.e.,
appearing in atleast m/k sets) are covered by the sampled sets. On
theother hand, the Element Sampling technique samplesroughly
Õ(k/δ) elements, and finds a set cover for thesampled elements. It
can be shown that the cover forthe sampled elements covers a (1 −
δ) fraction of theoriginal elements.
Specifically, the first algorithm performs a constantnumber of
iterations. Each iteration uses elementsampling to compute a
“partial” cover, removes theelements covered by the sets selected
so far and recurseson the remaining elements. However, making
thisprocess work in sub-linear time (as opposed to sub-linear
space) requires new technical development. Forexample, the
algorithm of [17] relies on the ability totest membership for a
set-element pair, which generallycannot be efficiently performed in
our model.
The second algorithm performs only one round of
set sampling, and then identifies the elements that arenot
covered by the sampled sets, without performing afull scan of those
sets. This is possible because with highprobability only those
elements that belong to few inputsets are not covered by the sample
sets. Therefore, wecan efficiently enumerate all pairs (ei, Sj), ei
∈ Sj , forthose elements ei that were not covered by the
sampledsets. We then run a black box algorithm only on theset
system induced by those pairs. This approach letsus avoid the nk
term present in the query and runtimebounds for the first
algorithm, which makes the secondalgorithm highly efficient for
large values of k.
1.4 Overview of the Lower Bounds. TheSet Cover lower bound for
smaller optimal valuek. We establish our lower bound for the
problem of esti-mating the size of the minimum set cover, by
construct-ing two distributions of set systems. All systems in
thesame distribution share the same optimal set cover size,but
these sizes differ by a factor α between the two dis-tributions;
thus, the algorithm is required to determinefrom which distribution
its input set system is drawn,in order to correctly estimate the
optimal cover size.Our distributions are constructed by a novel use
of theprobabilistic method. Specifically, we first
probabilis-tically construct a set system called median
instance(see Lemma 3.1): this set system has the property that(a)
its minimum set cover size is αk and (b) a small num-ber of changes
to the instance reduces the minimum setcover size to k. We set the
first distribution to be al-ways this median instance. Then, we
construct the sec-ond distribution by a random process that
performs thechanges (depicted in Figure 1) resulting in a
modifiedinstance. This process distributes the changes
almostuniformly throughout the instance, which implies thatthe
changes are unlikely to be detected unless the algo-rithm performs
a large number of queries. We believethat this construction might
find applications to lower
Copyright c© 2018 by SIAMUnauthorized reproduction of this
article is prohibited
-
bounds for other combinatorial optimization problems.
The Set Cover lower bound for larger optimalvalue k. Our lower
bound for the problem of comput-ing an approximate set cover
leverages the constructionabove. We create a combined set system
consisting ofmultiple modified instances all chosen independently
atrandom, allowing instances with much larger k. By theproperties
of the random process generating modifiedinstances, we observe that
most of these modified in-stances have different optimal set cover
solution, andthat distinguishing these instances from one another
re-quires many queries. Thus, it is unlikely for the algo-rithm to
be able to compute an optimal solution to alarge fraction of these
modified instances, and thereforeit fails to achieve the desired
approximation factor forthe overall combined instance.
The Cover Verification lower bound for a cover ofsize k. For
Cover Verification, however, we instead givean explicit
construction of the distributions. We firstcreate an underlying set
structure such that initially, thecandidate sets contain all but k
elements. Then we mayswap in each uncovered element from a
non-candidateset. Our set structure is systematically designed
sothat each swap only modifies a small fraction of theanswers from
all possible queries; hence, each swap ishard to detect without
Ω(n) queries. The distributionof valid set covers is composed of
instances obtainedby swapping in every uncovered element, and that
ofnon-covers is similarly obtained but leaving one
elementuncovered.
2 Preliminaries for the Lower Bounds
First, we formally specify the representation of the
setstructures of input instances, which applies to bothSet Cover
and Cover Verification.
Our lower bound proofs rely mainly on the construc-tion of
instances that are hard to distinguish by the al-gorithm. To this
end, we define the swap operation thatexchanges a pair of elements
between two sets, and howthis is implemented in the actual
representation.
Definition 2.1. (swap operation) Consider twosets S and S′. A
swap on S and S′ is defined overtwo elements e, e′ such that e ∈ S
\ S′ and e′ ∈ S′ \ S,where S and S′ exchange e and e′. Formally,
afterperforming swap(e, e′), S = (S ∪ {e′}) \ {e} andS′ = (S′ ∪
{e}) \ {e′}. As for the representation viaEltOf and SetOf, each
application of swap onlymodifies 2 entries for each oracle. That
is, if previouslye = EltOf(S, i), S = SetOf(e, j), e′ = EltOf(S′,
i′),and S′ = SetOf(e′, j′), then their new values changeas follows:
e′ = EltOf(S, i), S′ = SetOf(e, j),e = EltOf(S′, i′), and S =
SetOf(e′, j′).
In particular, we extensively use the property that theamount of
changes to the oracle’s answers incurred byeach swap is minimal. We
remark that when we performmultiple swaps on multiple disjoint
set-element pairs,every swap modifies distinct entries and do not
interferewith one another.
Lastly, we define the notion of query-answer history,which is a
common tool for establishing lower boundsfor sub-linear algorithms
under query models.
Definition 2.2. By query-answer history, we denotethe sequence
of query-answer pairs 〈(q1, a1), (q2, a2),. . . , (qr, ar)〉
recording the communication between thealgorithm and the oracles,
where each new queryqi+1 may only depend on the query-answer
pairs(q1, a1), . . . , (qi, ai). In our case, each qi represents
ei-ther a SetOf query or an EltOf query made by thealgorithm, and
each ai is the oracle’s answer to thatrespective query according to
the set structure instance.
3 Lower Bounds for the Set Cover Problem
In this section, we present lower bounds for Set Coverboth for
small values of the optimal cover size k (inSection 3.1), and for
large values of k (in Section 3.2).For low values of k, we prove
the following theoremwhose proof is postponed to Appendix A.
Theorem 3.1. For 2 ≤ k ≤ ( n16α logm )1
4α+1 and 1 <α ≤ log n, any randomized algorithm that solves
theSet Cover problem with approximation factor α andsuccess
probability at least 2/3 requires Ω̃(m(n/k)
12α )
queries.
Instead, in Section 3.1 we focus on the simple set-ting of this
theorem which applies to approximation pro-tocols for
distinguishing between instances with mini-mum set cover sizes 2
and 3, and show a lower bound ofΩ̃(mn) (which is tight up to a
polylogarithmic factor)for approximation factor 3/2. This
simplification is forthe purpose of both clarity and also for the
fact thatthe result for this case is used in Section 3.2 to
establishour lower bound for large values of k.
High level idea. Our approach for establishing thelower bound is
as follows. First, we construct a medianinstance I∗ for Set Cover,
whose minimum set coversize is 3. We then apply a randomized
proceduregenModifiedInst, which slightly modifies the
medianinstance into a new instance containing a set cover ofsize 2.
Applying Yao’s principle, the distribution ofthe input to the
deterministic algorithm is either I∗
with probability 1/2, or a modified instance generatedthru
genModifiedInst(I∗), which is denoted by D(I∗),again with
probability 1/2. Next, we consider theexecution of the
deterministic algorithm. We show
Copyright c© 2018 by SIAMUnauthorized reproduction of this
article is prohibited
-
that unless the algorithm asks at least Ω̃(mn) queries,the
resulting query-answer history generated over I∗
would be the same as those generated over instancesconstituting
a constant fraction of D(I∗), reducing thealgorithm’s success
probability to below 2/3. Morespecifically, we will establish the
following theorem.
Theorem 3.2. Any algorithm that can distinguishwhether the input
instance is I∗ or belongs to D(I∗)with probability of success
greater than 2/3, requiresΩ(mn/ logm) queries.
Corollary 3.1. For 1 < α < 3/2, and k ≤ 3, anyrandomized
algorithm that approximates by a factorof α, the size of the
optimal cover for the Set Coverproblem with success probability at
least 2/3 requires
Ω̃(mn) queries.
For simplicity, we assume that the algorithm hasthe knowledge of
our construction (which may onlystrengthens our lower bounds); this
includes I∗ andD(I∗), along with their representation via EltOf
andSetOf. The objective of the algorithm is simplyto distinguish
them. Since we are distinguishing adistribution of instances D(I∗)
against a single instanceI∗, we may individually upper bound the
probabilitythat each query-answer pair reveals the modified partof
the instance, then apply the union bound directly.However,
establishing such a bound requires a certainset of properties that
we obtain through a careful designof I∗ and genModifiedInst. We
remark that ourapproach shows the hardness of distinguishing
instanceswith with different cover sizes. That is, our lower
boundon the query complexity also holds for the problemof
approximating the size of the minimum set cover(without explicitly
finding one).
Lastly, in Section 3.2 we provide a construction uti-lizing
Theorem 3.2 to extend Corollary 3.1, establish thefollowing theorem
on lower bounds for larger minimumset cover sizes.
Theorem 3.3. For any sufficiently small approxima-tion factor α
≤ 1.01 and k = O(m/ log n), any random-ized algorithm that computes
an α-approximation to theSet Cover problem with success probability
at least 0.99requires Ω̃(mn/k) queries.
3.1 The Set Cover Lower Bound for Small Opti-mal Value k
3.1.1 Construction of the Median Instance I∗.Let F be a
collection of m sets such that (independentlyfor each set-element
pair (S, e)) S contains e with
probability 1 − p0, where p0 =√
9 logmn (note that
since we assume logm ≤ n/c for large enough c, wecan assume that
p0 ≤ 1/2). Equivalently, we mayconsider the incidence matrix of
this instance: eachentry is either 0 (indicating e /∈ S) with
probabilityp0, or 1 (indicating e ∈ S) otherwise. We writeF ∼ I(U ,
p0) denoting the collection of sets obtainedfrom this
construction.
Definition 3.1. (Median instance) An instance ofSet Cover, I, is
a median instance if it satisfies all thefollowing properties.(a)
No two sets cover all the elements. (The size of its
minimum set cover is at least 3.)(b) For any two sets the number
of elements not covered
by the union of these sets is at most 18 logm.(c) The
intersection of any two sets has size at least
n/8.(d) For any pair of elements e, e′, the number of sets S
s.t. e ∈ S but e′ /∈ S is at least m√9 logm4√n
.
(e) For any triple of sets S, S1 and S2, |(S1∩S2)\S| ≤6√n
logm.
(f) For each element, the number of sets that do not
contain that element is at most 6m√
logmn .
Lemma 3.1. There exists a median instance I∗ satisfy-ing all
properties from Definition 3.1. In fact, with highprobability, an
instance drawn from the distribution inwhich Pr[e ∈ S] = 1− p0
independently at random, sat-isfies the median properties.
The proof of the lemma follows from standard applica-tions of
concentration bounds. See the full version ofthis paper for
detailed proofs.
3.1.2 Distribution D(I∗) of Modified InstancesI ′ Derived from
I∗. Fix a median instance I∗. Wenow show that we may perform
O(logm) swap opera-tions on I∗ so that the size of the minimum set
coverin the modified instance becomes 2. Moreover, its in-cidence
matrix differs from that of I∗ in O(logm) en-tries. Consequently,
the number of queries to EltOfand SetOf that induce different
answers from those ofI∗ is also at most O(logm).
We define D(I∗) as the distribution of in-stances I ′ generated
from a median instance I∗ bygenModifiedInst(I∗) given below in
Figure 1 as fol-lows. Assume that I∗ = (U ,F). We select two
differentsets S1, S2 from F uniformly at random; we aim to
turnthese two sets into a set cover. To do so, we swap outsome of
the elements in S2 and bring in the uncoveredelements. For each
uncovered element e, we pick anelement e′ ∈ S2 that is also covered
by S1. Next, con-sider the candidate set that we may exchange its e
withe′ ∈ S2:
Copyright c© 2018 by SIAMUnauthorized reproduction of this
article is prohibited
-
Definition 3.2. (Candidate set) For any pair of el-ements e, e′,
the candidate set of (e, e′) are all setsthat contain e but not e′.
The collection of candidatesets of (e, e′) is denoted by
Candidate(e, e′). Note thatCandidate(e, e′) 6= Candidate(e′, e) (in
fact, these twocollections are disjoint).
genModifiedInst(I∗ = (U ,F)):M← ∅pick two different sets S1, S2
from F
uniformly at randomfor each e ∈ U \ (S1 ∪ S2) do
pick e′ ∈ (S1 ∩ S2) \M uniformly at randomM←M∪ {e′}Pick a random
set S in Candidate(e, e′)swap(e, e′) between S, S2
Figure 1: The procedure of constructing a modifiedinstance of
I∗.
We choose a random set S from Candidate(e, e′),and swap e ∈ S
with e′ ∈ S2 so that S2 nowcontains e. We repeatedly apply this
process for allinitially uncovered e so that eventually S1 and S2
forma set cover. We show that the proposed
algorithm,genModifiedInst, can indeed be executed withoutgetting
stuck.
Lemma 3.2. The procedure genModifiedInst is well-defined under
the precondition that the input instanceI∗ is a median
instance.
Proof. To carry out the algorithm, we must ensure thatthe number
of the initially uncovered elements is atmost that of the elements
covered by both S1 and S2.This follows from the properties of
median instances(Definition 3.1): |U \ (S1 ∪ S2)| ≤ 18 logm by
property(b), and that the size of the intersection of S1 and S2is
greater than n/8 by property (c). That is, in ourconstruction there
are sufficiently many possible choicesfor e′ to be matched and
swapped with each uncoveredelement e. Moreover, by property (d)
there are plentyof candidate sets S for performing swap(e, e′) with
S2.
3.1.3 Bounding the Probability of Modifica-tion. Let D(I∗)
denote the distribution of instancesgenerated by
genModifiedInst(I∗). If an algorithmwere to distinguish between I∗
or I ′ ∼ D(I∗), it mustfind some cell in the EltOf or SetOf tables
that wouldhave been modified by genModifiedInst, to confirmthat
genModifiedInst is indeed executed; otherwiseit would make wrong
decisions half of the time. Wewill show an additional property of
this distribution:
none of the entries of EltOf and SetOf are signifi-cantly more
likely to be modified during the executionof genModifiedInst.
Consequently, no algorithm maystrategically detect the difference
between I∗ or I ′ withthe desired probability, unless the number of
queries isasymptotically the reciprocal of the maximum probabil-ity
of modification among any cells.
Define PElt−Set : U × F → [0, 1] as the probabilitythat an
element is swapped by a set. More precisely,for an element e ∈ U
and a set S ∈ F , if e /∈ S in themedian instance I∗, then
PElt−Set(e, S) = 0; otherwise,it is equal to the probability that S
swaps e. We notethat these probabilities are taken over I ′ ∼ D(I∗)
whereI∗ is a fixed median instance. That is, as per Figure 1,they
correspond to the random choices of S1, S2, therandom matchingM
between U \ (S1∪S2) and S1∩S2,and their random choices of choosing
each candidateset S. We bound the values of PElt−Set via the
followinglemma.
Lemma 3.3. For any e ∈ U and S ∈ F , PElt−Set(e, S) ≤4800
logm
mn where the probability is taken over I′ ∼ D(I∗).
Proof. Let S1, S2 denote the first two sets picked (uni-formly
at random) from F to construct a modified in-stance of I∗. For each
element e and a set S such thate ∈ S in the basic instance I∗,
PElt−Set(e, S) = Pr[S = S2] ·Pr[e ∈ S1 ∩ S2]·Pr[e matches to U \
(S1 ∪ S2) | e ∈ S1 ∩ S2]+ Pr[S /∈ {S1, S2}]·Pr[e ∈ S \ (S1 ∪ S2) |
e ∈ S]·Pr[S swaps e with S2 | e ∈ S \ (S1 ∪ S2)] .
where all probabilities are taken over I ′ ∼ D(I∗). Nextwe bound
each of the above six terms. Since we choosethe sets S1, S2
randomly, Pr[S = S2] = 1/m. We boundthe second term by 1. For the
third term, since wepick a matching uniformly at random among all
possible(maximum) matchings between U\(S1∪S2) and S1∩S2,by
symmetry, the probability that a certain elemente ∈ S1 ∩ S2 is in
the matching is (by properties (b)and (c) of median instances),
|U \ (S1 ∪ S2)||S1 ∩ S2|
≤ 18 logmn/8
=144 logm
n.
We bound the fourth term by 1. To compute the fifthterm, let de
denote the number of sets in F that donot contain e. By property
(f) of median instances, theprobability that e ∈ S is in S \ (S1 ∪
S2) given thatS /∈ {S1, S2} is at most,
de(de − 1)(m− 1)(m− 2)
≤36m2 · logmn
m2/2=
72 logm
n.
Copyright c© 2018 by SIAMUnauthorized reproduction of this
article is prohibited
-
Finally for the last term, note that by symme-try, each pair of
matched elements ee′ is picked bygenModifiedInst equiprobably.
Thus, for any e ∈ S \(S1∪S2), the probability that each element e′
∈ S1∩S2is matched to e is 1|S1∩S2| . By properties (c)–(e) of
me-
dian instances, the last term is at most
∑e′∈(S1∩S2)\S
Pr[ee′ ∈M] · 1|Candidate(e, e′)|
= |(S1 ∩ S2) \ S| ·1
|S1 ∩ S2|· 1Candidate(e, e′)
≤ 6√n logm · 1
n/8· 1m√9 logm4√n
=64
m.
Therefore,
PElt−Set(e, S) ≤1
m· 1 · 144 logm
n+ 1 · 72 logm
n· 64m
≤ 4800 logmmn
.
3.1.4 Proof of Theorem 3.2. Now we consider amedian instance I∗,
and its corresponding family ofmodified setsD(I∗). To prove the
promised lower boundfor randomized protocols distinguishing I∗ and
I ′ ∼D(I∗), we apply Yao’s principle and instead show thatno
deterministic algorithm A may determine whetherthe input is I∗ or I
′ ∼ D(I∗) with success probabilityat least 2/3 using r = o( mnlogm
) queries. Recall that
if A’s query-answer history 〈(q1, a1), . . . , (qr, ar)〉
whenexecuted on I ′ is the same as that of I∗, then A
mustunavoidably return a wrong decision for the probabilitymass
corresponding to I ′. We bound the probability ofthis event as
follows.
Lemma 3.4. Let Q be the set of queries made by A onI∗. Let I ′ ∼
D(I∗) where I∗ is a given median instance.Then the probability that
A returns different outputs onI∗ and I ′ is at most 4800 logmmn
|Q|.
Proof of Theorem 3.2. If A does not output correctlyon I∗, the
probability of success of A is less than 1/2;thus, we can assume
that A returns the correct answeron I∗. This implies that A returns
an incorrect solutionon the fraction of I ′ ∼ I ′(I∗) for which
A(I∗) = A(I ′).Now recall that the distribution in which we apply
Yao’sprinciple consists of I∗ with probability 1/2, and
drawnuniformly at random from D(I∗) also with probability
1/2. Then over this distribution, by Lemma 3.4,
Pr[A suceeds] ≤ 1− 12PrI′∼D(I∗)[A(I∗) = A(I ′)]
≤ 1− 12
(1− 4800 logm
mn|Q|)
=1
2+
2400 logm
mn|Q|.
Thus, if the number of queries made by A is lessthan mn14400
logm , then the probability that A returns thecorrect answer over
the input distribution is less than2/3 and the proof is
complete.
3.2 The Set Cover Lower Bound for Large Op-timal Value k. Our
construction of the median in-stance I∗ and its associated
distribution D(I∗) of mod-ified instances also leads to the lower
bound of Ω̃(mnk )for the problem of computing an approximate
solutionto Set Cover. This lower bound matches the perfor-mance of
our algorithm for large optimal value k andshows that it is tight
for some range of value k, albeit itonly applies to sufficiently
small approximation factorα ≤ 1.01.
Proof overview. We construct a distribution overcompounds: a
compound is a Set Cover instance thatconsists of t = Θ(k) smaller
instances I1, . . . , It, whereeach of these t instances is either
the median instance I∗
or a random modified instance drawn from D(I∗). Byour
construction, a large majority of our distributionis composed of
compounds that contains at least 0.2tmodified instances Ii such
that, any deterministic algo-rithm A must fail to distinguish Ii
from I∗ when it isonly allowed to make a small number of queries. A
de-terministic A can safely cover these modified instanceswith
three sets, incurring a cost (sub-optimality) of 0.2t.Still, A may
choose to cover such an Ii with two sets toreduce its cost, but it
then must err on a different com-pound where Ii is replaced with
I
∗. We track down thetrade-off between the amount of cost that A
saves onthese compounds by covering these Ii’s with two sets,and
the amount of error on other compounds its schemeincurs. A is
allowed a small probability δ to make er-rors, which we then use to
upper-bound the expectedcost that A may save, and conclude that A
still incursan expected cost of 0.1t overall. We apply Yao’s
princi-ple (for algorithms with errors) to obtain that random-ized
algorithms also incur an expected cost of 0.05t, oncompounds with
optimal solution size k ∈ [2t, 3t], yield-ing the impossibility
result for computing solutions withapproximation factor α = k+0.1tk
> 1.01 when given in-sufficient queries.
Copyright c© 2018 by SIAMUnauthorized reproduction of this
article is prohibited
-
3.2.1 Overall Lower Bound Argument Com-pounds. Consider the
median instance I∗ and its as-sociated distribution D(I∗) of
modified instances forSet Cover with n elements and m sets, and let
t = Θ(k)be a positive integer parameter. We define a compoundI =
I(I1, I2, . . . , It) as a set structure instance consist-ing of t
median or modified instances I1, I2, . . . , It, form-ing a set
structure (U t,F t) of n′ , nt elements andm′ , mt sets, in such a
way that each instance Ii oc-cupies separate elements and sets.
Since the optimalsolution to each instance Ii is 3 if Ii = I
∗, and 2 if Iiis any modified instance, the optimal solution for
thecompound is 2t plus the number of occurrences of themedian
instance; this optimal objective value is alwaysΘ(k).
Random distribution over compounds. EmployingYao’s principle, we
construct a distribution D of com-pounds I(I1, I2, . . . , It): it
will be applied against anydeterministic algorithm A for computing
an approxi-mate minimum set cover, which is allowed to err on
atmost a δ-fraction of the compounds from the distribu-tion (for
some small constant δ > 0). For each i ∈ [t],we pick Ii = I
∗ with probability c/(m2
)where c > 2 is
a sufficiently large constant. Otherwise, simply draw arandom
modified instance Ii ∼ D(I∗). We aim to showthat, in expectation
over D, A must output a solutionthat of size Θ(t) more than the
optimal set cover size ofthe given instance I ∼ D.A frequently
leaves many modified instancesundetected. Consider an instance I
containing at least0.95t modified instances. These instances
constituteat least a 0.99-fraction of D: the expected number
ofoccurrences of the median instance in each compoundis only c/
(m2
)· t = O(t/m2), so by Markov’s inequality,
the probablity that there are more than 0.05t medianinstances is
at most O(1/m2) < 0.01 for large m. Wemake use of the following
useful lemma, whose proof isdeferred to Section 3.2.2. In what
follow, we say thatthe algorithm “distinguishes” or “detects the
difference”between Ii and I
∗ if it makes a query that inducesdifferent answers, and thus
may deduce that one of Ii orI∗ cannot be the input instance. In
particular, if Ii = I
∗
then detecting the difference between them would
beimpossible.
Lemma 3.5. Fix M ⊆ [t] and consider the distributionover
compounds I(I1, . . . , It) with Ii ∼ D(I∗) for i ∈Mand Ii = I
∗ for i /∈ M . If A makes at most o( mntlogm )queries to I, then
it may detect the differences betweenI∗ and at least 0.75t of the
modified instances {Ii}i∈M ,with probability at most 0.01.
We apply this lemma for any |M | ≥ 0.95t (although thestatement
holds for any M , even vacuously for |M | <
0.75t). Thus, for 0.99 ·0.99 > 0.98-fraction of D, A failsto
identify, for at least 0.95t − 0.75t = 0.2t modifiedinstances Ii in
I, whether it is a median instance ora modified instance. Observe
that the query-answerhistory of A on such I would not change if we
were toreplace any combination of these 0.2t modified instancesby
copies of I∗. Consequently, if the algorithm were tocorrectly cover
I by using two sets for some of theseIi, it must unavoidably err
(return a non-cover) on thecompound where these Ii’s are replaced
by copies of themedian instance.
Charging argument. We call a compound I toughif A does not err
on I, and A fails to detect at least0.2tmodified instances; denote
by Dtough the conditionaldistribution of D restricted to tough
instances. Fortough I, let cost(I) denote the number of
modifiedinstances Ii that the algorithm decides to cover withthree
sets. That is, for each tough compound I, cost(I)measures how far
the solution returned by A is, fromthe optimal set cover size.
Then, there are at least0.2t − cost(I) modified instances Ii that A
choosesto cover with only two sets despite not being ableto verify
whether Ii = I
∗ or not. Let RI denotethe set of the indices of these modified
instances, so|RI| = 0.2t − cost(I). By doing so, A then errs on
thereplaced compound r(I, RI), denoting the compoundsimilar to I,
except that each modified instance Ii fori ∈ RI is replaced by I∗.
In this event, we say thatthe tough compound I charges the replaced
compoundr(I, RI) via RI. Recall that the total error of A is δ:this
quantity upper-bounds the total probability massesof charged
instances, which we will then manipulate toobtain a lower bound on
EI∼D[cost(I)].
Instances must share optimal solutions for Rto charge the same
replaced instance. Observethat many tough instances may charge to
the samereplaced instance: we must handle these duplicities.First,
consider two tough instances I1 6= I2 charingthe same Ir = r(I
1, R) = r(I2, R) via the same R =RI1 = RI2 . As I
1 6= I2 but r(I1, R) = r(I2, R), thesetough instances differ on
some modified instances withindices in R. Nonetheless, the
query-answer historiesof A operating on I1 and I2 must be the same
astheir instances in R are both indistinguishable fromI∗ by the
deterministic A. Since A does not err ontough instances (by
definition), both tough I1 andI2 must share the same optimal set
cover on everyinstance in R. Consequently, for each fixed R,
onlytough instances that have the same optimal solution formodified
instances in R may charge the same replacedinstance via R.
Charged instance is much heavier than charg-ing instances
combined. By our construction of
Copyright c© 2018 by SIAMUnauthorized reproduction of this
article is prohibited
-
I(I1, . . . , It) drawn from D, Pr[Ii = I∗] = c/
(m2
)for
the median instance. On the other hand,∑`j=1 Pr[Ii =
Ij ] ≤ (1 − c/(m2
)) · (1/
(m2
)) < 1/
(m2
)for modified in-
stances I1, . . . , I` sharing the same optimal set
cover,because they are all modified instances constructed tohave
the two sets chosen by genModifiedInst as theiroptimal set cover:
each pair of sets is chosen uniformlywith probability 1/
(m2
). Thus, the probability that I∗
is chosen is more than c times the total probability thatany Ij
is chosen. Generalizing this observation, we con-sider tough
instances I1, I2, . . . ,I` charging the same Irvia R, and bound
the difference in probabilities that Irand any Ij are drawn. For
each index in R, it is morethan c times more likely for D to draw
the median in-stance, rather than any modified instances of a fixed
op-timal solution. Then, for the replaced compound Ir thatA errs,
p(Ir) ≥ c|R| ·
∑`j=1 p(I
j) (where p denotes the
probability mass in D, not in Dtough). In other words,the
probability mass of the replaced instance chargedvia R is always at
least c|R| times the total probabilitymass of the charging tough
instances.
Bounding the expected cost using δ. In ourcharging argument by
tough instances above, we onlybound the amount of charges on the
replaced instancesvia a fixed R. As there are up to 2t choices for
R,we scale down the total amount charged to a replacedinstance by a
factor of 2t, so that
∑tough I c
|RI|p(I)/2t
lower bounds the total probability mass of the replacedinstances
that A errs.
Let us first focus on the conditional distributionDtough
restricted to tough instances. Recall that at leasta (0.98−
δ)-fraction of the compounds in D are tough:A fails to detect
differences between 0.2t modifiedinstances from the median instance
with probability0.98, and among these compounds, A may err on
atmost a δ-fraction. So in the conditional distributionDtough over
tough instances, the individual probability
mass is scaled-up to ptough(I) ≤ p(I)0.98−δ . Thus,∑tough I
c
|RI|p(I)
2t≥∑
tough I c|RI|(0.98− δ)ptough(I)
2t
=(0.98− δ)EI∼Dtough
[c|RI|
]2t
.
As the probability mass above cannot exceed thetotal allowed
error δ, we have
δ
0.98− δ· 2t ≥ EI∼Dtough
[c|RI|
]≥ EI∼Dtough
[c0.2t−cost(I)
]≥ c0.2t−EI∼Dtough [cost(I)],
where Jensen’s inequality is applied in the last step
above. So,
EI∼Dtough [cost(I)] ≥ 0.2t−t+ log δ0.98−δ
log c
=
(0.2− 1
log c
)t−
log δ0.98−δlog c
≥ 0.11t,
for sufficiently large c (and m) when choosing δ = 0.02.We now
return to the expected cost over the entire
distribution I. For simplicity, define cost(I) = 0for any
non-tough I. This yields EI∼D[cost(I)] ≥(0.98 − δ)EI∼Dtough
[cost(I)] ≥ (0.98 − δ) · 0.11t ≥ 0.1t,establishing the expected
cost of any deterministic Awith probability of error at most 0.02
over D.
Establishing the lower bound for randomizedalgorithms. Lastly,
we apply Yao’s principle4 toobtain that, for any randomized
algorithm with errorprobability δ/2 = 0.01, its expected cost under
theworst input is at least 12 · 0.1t = 0.05t. Recall now thatour
cost here lower-bounds the sub-optimality of thecomputed set cover
(that is, the algorithm uses at leastcost more sets to cover the
elements than the optimalsolution does). Since our input instances
have optimalsolution k ∈ [2t, 3t] and the randomized
algorithmreturns a solution with cost at least 0.05t in
expectation,it achieves an approximation factor of no better thanα
= k+0.05tk > 1.01 with o(
mntlogm ) queries. Theorem 3.3
then follows, noting the substitution of our problem
size:mntlogm =
(m′/t)(n′/t)tlog(m′/t) = Θ(
m′n′
k′ logm′ ).
3.2.2 Proof of Lemma 3.5 First, we recall thefollowing result
from Lemma 3.4 for distinguishingbetween I∗ and a random I ′ ∼
D(I∗).
Corollary 3.2. Let q be the number of queries madeby A on Ii ∼
D(I∗) over n elements and m sets, whereI∗ is a median instance.
Then the probability that Adetects a difference between Ii and
I
∗ in one of itsqueries is at most 4800q logmmn .
Marbles and urns. Fix a compound I(I1, . . . , It).Let s ,
mn4800 logm , and then consider the following,entirely different,
scenario. Suppose that we have turns, where each urn contains s
marbles. In the ith
urn, in case Ii is a modified instance, we put in thisurn one
red marble and s − 1 white marbles; otherwiseif Ii = I
∗, we put in s white marbles. Observe thatthe probability of
obtaining a red marble by drawing q
4Here we use the Monte Carlo version where the algorithm mayerr,
and use cost instead of the time complexity as our measure
ofperformance. See, e.g., Proposition 2.6 in [27] and the
descriptiontherein.
Copyright c© 2018 by SIAMUnauthorized reproduction of this
article is prohibited
-
marbles from a single urn without replacement is exactlyq/s (for
q ≤ s). Now, we will relate the probability ofdrawing red marbles
to the probability of successfullydistinguishing instances. We
emphasize that we areonly comparing the probabilities of events for
the sakeof analysis, and we do not imply or suggest any
directanalogy between the events themselves.
Corollary 3.2 above bounds the probability that thealgorithm
successfully distinguishes a modified instanceIi from I
∗ with 4800q logmmn = q/s. Then, the probabilityof
distinguishing between Ii and I
∗ using q queries, isbounded from above by the probability of
obtaining ared marble after drawing q marbles from an urn.
Conse-quently, the probability that the algorithm distinguishes3t/4
instances is bounded from above by the probabil-ity of drawing the
red marbles from at least 3t/4 urns.Hence, to prove that the event
of Lemma 3.5 occurswith probability at most 0.01, it is sufficient
to upper-bound the probability that an algorithm obtains 3t/4red
marbles by 0.01.
Consider an instance of t urns; for each urn i ∈
[t]corresponding to a modified instance Ii, exactly one ofits s
marbles is red. An algorithm may draw marblesfrom each urn, one by
one without replacement, forpotentially up to s times. By the
principle of deferreddecisions, the red marble is equally likely to
appear inany of these s draws, independent of the events for
otherurns. Thus, we can create a tuple of t random variablesT =
(T1, . . . , Tt) such that for each i ∈ [t], Ti is chosenuniformly
at random from {1, . . . , s}. The variable Tirepresents the number
of draws required to obtain thered marble in the ith urn; that is,
only the T thi draw fromthe ith urn finds the red marble from that
urn. In case Iiis a median instance, we simply set Ti = s+1
indicatingthat the algorithm never detects any difference as Ii
andI∗ are the same instance.
We now show the following two lemmas in order tobound the number
of red marbles the algorithm mayencounter throughout its
execution.
Lemma 3.6. Let b > 3 be a fixed constant and defineThigh = {i
| Ti ≥ sb}. If t ≥ 14b, then |Thigh| ≥ (1−
2b )t
with probability at least 0.99.
Proof. Let Tlow = {1, . . . , t} \ Thigh. Notice that forthe ith
urn, Pr[i ∈ Tlow] < 1b independently of otherurns, and thus
|Tlow| is stochastically dominated byB(t, 1b ), the binomial
distribution with t trials andsuccess probability 1b . Applying
Chernoff bound, weobtain
Pr
[|Tlow| ≥
2t
b
]≤ e− t3b < 0.01.
Hence, |Thigh| ≥ t − 2tb = (1 −2b )t with probability at
least 0.99, as desired.
Lemma 3.7. If the total number of draws made by thealgorithm is
less than (1 − 3b )
stb , then with probability
at least 0.99, the algorithm will not obtain red marblesfrom at
least tb urns.
Proof. If the total number of such draws is less than(1− 3b
)
stb , then the number of draws from at least
3tb urns
is less than sb each. Assume the condition of Lemma 3.6:for at
least (1− 2b )t urns, Ti ≥
sb . That is, the algorithm
will not encounter a red marble if it makes less thansb draws
from such an urn. Then, there are at least
tb
urns with Ti ≥ sb from which the algorithm makes lessthan sb
draws, and thus does not obtain a red marble.Overall this event
holds with probability at least 0.99due to Lemma 3.6.
We substitute b = 4 and assume sufficiently larget. Suppose that
the deterministic algorithm makes lessthan (1− 34 )
st4 =
st16 queries, then for a fraction of 0.99 of
all possible tuples T , there are t/4 instances Ii that
thealgorithm fails to detect their differences from I∗:
theprobability of this event is lower-bounded by that of theevent
where the red marbles from those correspondingurns i are not drawn.
Therefore, the probability that thealgorithm makes queries that
detect differences betweenI∗ and more than 3t/4 instances Ii’s is
bounded by 0.01,concluding our proof of Lemma 3.5.
4 Sub-Linear Algorithms for the Set CoverProblem
In this paper, we present two different approximation
al-gorithms for Set Cover with sub-linear query in the ora-cle
model: smallSetCover and largeSetCover. Bothof our algorithms rely
on the techniques from the re-cent developments on Set Cover in the
streaming model.However, adopting those techniques in the oracle
modelrequires novel insights and technical development.
Throughout the description of our algorithms, weassume that we
have access to a black box subroutinethat given the full Set Cover
instance (where all mem-bers of all sets are revealed), returns a
ρ-approximatesolution5.
The first algorithm (smallSetCover) returns a(αρ+ε) approximate
solution of the Set Cover instanceusing Õ( 1ε (m(
nk )
1α−1 + nk)) queries, while the second
algorithm (largeSetCover) achieves an approximation
factor of (ρ + ε) using Õ(mnkε2 ) queries, where k is thesize
of the minimum set cover. These algorithms can becombined so that
the number of queries of the algorithmbecomes asymptotically the
minimum of the two:
5The approximation factor ρ may take on any value between 1and
Θ(logn) depending on the computational model one assumes.
Copyright c© 2018 by SIAMUnauthorized reproduction of this
article is prohibited
-
Theorem 4.1. There exists a randomized algorithmfor Set Cover in
the oracle model that w.h.p.6 com-putes an O(ρ log n)-approximate
solution and uses
Õ(min{m(nk
)1/ logn+nk , mnk }) = Õ(m+n
√m) num-
ber of queries.
4.1 Preliminaries. Our algorithms use the followingtwo sampling
techniques developed for Set Cover inthe streaming model [10]:
Element Sampling and SetSampling. The first technique, Element
Sampling,states that in order to find a (1 − δ)-cover of U
w.h.p.,it suffices to solve Set Cover on a subset of elements
ofsize Õ(ρk logmδ ) picked uniformly at random. It showsthat we
may restrict our attention to a subproblem witha much smaller
number of elements, and our solutionto the reduced instance will
still cover a good fractionof the elements in the original
instance. The nexttechnique, Set Sampling, shows that if we pick `
setsuniformly at random from F in the solution, then eachelement
that is not covered by any of picked sets w.h.p.only occurs in
Õ(m` ) sets in F ; that is, we are leftwith a much sparser
subproblem to solve. The formalstatements of these sampling
techniques are as follows.See [10] for the proofs.
Lemma 4.1. (Element Sampling) Consider an in-stance of Set Cover
on (U , F) whose optimal coverhas size at most k. Let Usmp be a
subset of U ofsize Θ
(ρk logm
δ
)chosen uniformly at random, and let
Csmp ⊆ F be a ρ-approximate cover for Usmp. Then,w.h.p. Csmp
covers at least (1− δ)|U| elements.
Lemma 4.2. (Set Sampling) Consider an instance(U ,F) of Set
Cover. Let Frnd be a collection of ` setspicked uniformly at
random. Then, w.h.p. Frnd coversall elements that appear in Ω(m
logn` ) sets of F .
4.2 The smallSetCover Algorithm. The algo-rithm of this section
is a modified variant of the stream-ing algorithm of Set Cover in
[17] that works in thesublinear query model. Similarly to the
algorithmof [17], our algorithm smallSetCover considers differ-ent
guesses of the value of an optimal solution (ε−1 log nguesses) and
performs the core iterative algorithm iter-SetCover for all of them
in parallel. For each guess` of the size of an optimal solution,
the iterSetCovergoes through 1/α iterations and by applying
ElementSampling, guarantees that w.h.p. at the end of each
it-eration, the number of uncovered elements reduces by
6An algorithm succeeds with high probability (w.h.p.) if
itsfailure probability can be decreased to n−c for any constant c
> 0without affecting its asymptotic performance, where n
denotesthe input size.
a factor of n−1/α. Hence, after 1/α iterations all ele-ments
will be covered. Furthermore, since the numberof sets picked in
each iteration is at most `, the final so-lution has at most ρ`
sets where ρ is the performance ofthe offline block algOfflineSC
that iterSetCover usesto solve the reduced instances constructed by
ElementSampling.
Although our general approach in iterSetCover issimilar to the
iterative core of the streaming algorithmof Set Cover, there are
challenges that we need toovercome so that it works efficiently in
the query model.Firstly, the approach of [17] relies on the ability
to testmembership for a set-element pair when executing itsset
filtering subroutine: given a subset S, the algorithmof [17]
requires to compute |S ∩ S| which cannot beimplemented efficiently
in the query model (in theworst case, requires m|S| queries).
Instead, here weemploy the set sampling which w.h.p. guarantees
thatthe number of sets that contain an (yet uncovered)element is
small.
Next challenge is achieving m(n/k)1/(α−1) + nkquery bound for
computing an α-approximate solu-tion. As mentioned earlier, both
our approach andthe algorithm of [17] need to run the algorithm
inparallel for different guesses ` of the size of an opti-mal
solution. However, since iterSetCover performsm(n/`)1/(α−1) + n`
queries, if smallSetCover invokesiterSetCover with guesses in an
increasing order thenthe query complexity becomes mn1/(α−1) + nk;
on theother hand, if it invokes iterSetCover with guesses ina
decreasing order then the query complexity becomesm(n/k)1/(α−1) +
mn! To solve this issue, smallSet-Cover performs in two stages: in
the first stage, it findsa (log n)-estimate of k by invoking
iterSetCover us-ing m + nk queries (assuming guesses are evaluated
inan increasing order) and then in the second rounds itonly invokes
iterSetCover with approximation factorα in the smallerO(log
n)-approximate region around the(log n)-estimate of k computed in
the first stage. Thus,in our implementation, besides the desired
approxima-tion factor, iterSetCover receives an upper bound anda
lower bound on the size of an optimal solution.
Now, we provide a detailed description ofiterSetCover. It
receives α, �, l and u as its arguments,and it is guaranteed that
the size of an optimal coverof the input instance, k, is in [l, u].
Note that the algo-rithm does not know the value of k and the
samplingtechniques described in Section 4.1 rely on k. There-fore,
the algorithm needs to find a (1+ε) estimate7 of kdenoted as `.
This can be done by trying all powers of
7The exact estimate that the algorithm works with is a(1 + ε
2ρα) estimate.
Copyright c© 2018 by SIAMUnauthorized reproduction of this
article is prohibited
-
(1 + ε) in [l, u]. The parameter α denotes the trade-offbetween
the query complexity and the approximationguarantee that the
algorithm achieves. Moreover, we as-sume that the algorithm has
access to a ρ-approximateblack box solver of Set Cover.
iterSetCover first performs Set Sampling to coverall elements
that occur in Ω̃(m/`) sets. Then itgoes through α − 2 iterations
and in each iteration,it performs Element Sampling with parameter δ
=Õ((`/n)1/(α−1)). By Lemma 4.1, after (α−2) iterations,w.h.p. only
`
(n`
)1/(α−1)elements remain uncovered, for
which the algorithm finds a cover by invoking the of-fline set
cover solver. The parameters are set so that all(α − 1) instances
that are required to be solved by theoffline set cover solver (the
(α− 2) instances costructedby Element Sampling and the final
instance) are of sizeÕ(m
(n`
)1/(α−1)).
In the rest of this section, we show that small-SetCover w.h.p.
returns an almost (ρα)-approximatesolution of Set Cover(U ,F) with
query complexityÕ(m
(nk
) 1α−1 + nk) where k is the size of a minimum
set cover.
Theorem 4.2. The smallSetCover algorithm outputsa (αρ +
ε)-approximate solution of Set Cover(U ,F)using Õ( 1ε (m(n/k)
1α−1 +nk)) number of queries w.h.p.,
where k is the size of an optimal solution of (U ,F).
To analyze the performance of smallSetCover,first we need to
analyze the procedures invoked bysmallSetCover: iterSetCover and
algOfflineSC.The procedure algOfflineSC(S, `) receives as an inputa
subset of elements S and an estimate on the size of anoptimal cover
of S using sets in F . The algOfflineSCalgorithm first determines
all occurrences of S in F .Then it invokes a black box subroutine
that returns acover of size at most ρ` (if there exists a cover of
size `for S) for the reduced Set Cover instance over S.
Moreover, we assume that all subroutines haveaccess to the EltOf
and SetOf oracles, |U| and |F|.
Lemma 4.3. Suppose that each e ∈ S appears in Õ(m` )sets of F
and lets assume that there exists a set of ` setsin F that covers
S. Then algOfflineSC(S, `) returns acover of size at most ρ` of S
using Õ(m|S|` ) queries.
Proof. Since each element of S is contained by Õ(m` )sets in F
, the information required to solve the reducedinstance on S can be
obtained by Õ(m|S|` ) queries (i.e.
Õ(m` ) SetOf query per element in S).
Lemma 4.4. The cover constructed by the outer loop
ofiterSetCover(α, ε, l, u) with the parameter ` > k, sol`,w.h.p.
covers U .
iterSetCover(α, ε, l, u):
B Try all (1 + ε2αρ )-approximate guesses of kfor ` ∈ {(1 + ε2αρ
)
i | log1+ ε2αρ l ≤ i ≤ log1+ ε2αρ u}do in order:sol` ←
collection of ` sets picked
uniformly at random B Set SamplingUrem ← U \
⋃r∈sol` r B n` EltOf
repeat (α− 2) timesS← sample of Urem of size Õ(ρ`
(n`
) 1α−1 )
D ← algOfflineSC(S, `)if D = null then
break B Try the next value of `sol` ← sol`
⋃D
Urem ← Urem \⋃
r∈D r B ρn` EltOf
if |Urem| ≤ `(n`
)1/(α−1)B Feasibility Test
D ← algOfflineSC(Urem, `)if D 6= null then
sol` ← sol`⋃D
return sol`
Figure 2: iterSetCover is the main procedure of thesmallSetCover
algorithm for the Set Cover problem.
Proof. After picking ` sets uniformly at random, bySet Sampling
(Lemma 4.2), w.h.p. each element that isnot covered by the sampled
sets appears in Õ(m` ) setsof F . Next, by Element Sampling (Lemma
4.1 withδ =
(`n
)1/(α−1)), at the end of each inner iteration,
w.h.p. the number of uncovered elements decreases by
a factor of(`n
)1/(α−1). Thus after at most (α − 2)
iterations, w.h.p. less than `(n`
)1/(α−1)elements remain
uncovered. Finally, algOfflineSC is invoked on theremaining
elements; hence, sol` w.h.p. covers U .
Next we analyze the query complexity and the ap-proximation
guarantee of iterSetCover. As we onlyapply Element Sampling and Set
Sampling polynomi-ally many times, all invocations of the
correspondinglemmas during an execution of the algorithm must
suc-ceed w.h.p., so we assume their high probability guaran-tees
for the proofs in rest of this section.
Lemma 4.5. Given that l ≤ k ≤ u1+ε/(2αρ) ,w.h.p. iterSetCover(α,
ε, l, u) finds a (ρα + ε)-approximate solution of the input
instance usingÕ(1ε (m(
nl )
1/(α−1) + nk))
queries.
Proof. Let `k = (1 +ε
2αρ )dlog1+ ε
2αρke
be the smallestpower of 1 + ε2αρ greater than or equal to k.
Note
that it is guaranteed that `k ∈ [l, u]. By Lemma
4.4,iterSetCover terminates with a guess value ` ≤ `k. In
Copyright c© 2018 by SIAMUnauthorized reproduction of this
article is prohibited
-
algOfflineSC(S, `):
FS ← ∅for each element e ∈ S doFe ← the collection of sets
containing eFS ← FS ∪ Fe
D ← solution of size at most ρ` for Set Coveron (S,FS)
constructed by the black box solver
B If there exists no such cover, then D = nullreturn D
Figure 3: algOfflineSC(S, `) invokes a black box thatreturns a
cover of size at most ρ` (if there exists a coverof size ` for S)
for the Set Cover instance that is theprojection of F over S.
the following we compute the query complexity of therun of
iterSetCover with a parameter ` ≤ `k.
Set Sampling component picks ` sets and thenupdate the set of
elements that are not covered by thosesets, Urem, using O(n`) EltOf
queries. Next, in eachiteration of the inner loop, the algorithm
samples asubset S of size Õ
(`(n/`)1/(α−1)
)from Urem. Recall
that, by Set Sampling (Lemma 4.2), each e ∈ S ⊂ Uremappears in
at most Õ(m/`) sets. Since each element in
Urem appears in Õ(m/`), algOfflineSC returns a coverD of size
at most ρ` using Õ
(m (n/`)
1/(α−1))SetOf
queries (Lemma 4.3). By the guarantee of ElementSampling (Lemma
4.1), the number of elements in Uremthat are not covered by D is at
most (`/n)1/(α−1)|Urem|.Finally, at the end of each inner loop, the
algorithmupdates the set of uncovered elements Urem by usingÕ(n`)
EltOf queries. The Feasibility Test which ispassed w.h.p. for ` ≤
`k ensures that the final runof algOfflineSC performs
Õ(m(n/`)1/(α−1)) SetOfqueries. Hence, the total number of queries
performedin each iteration of the outer loop of iterSetCover
with
parameter ` ≤ `k is Õ(m (n/`)
1/(α−1)+ n`
).
By Lemma 4.4, if `k ≤ u, then the outer loop ofiterSetCover is
executed for l ≤ ` ≤ `k before itterminates. Thus, the total number
of queries madeby iterSetCover is:
log1+ ε2αρ
`k∑i=dlog1+ ε
2αρle
Õ(m
(n
(1 + ε2αρ )i
) 1α−1
+ n(1 +ε
2αρ)i)
= Õ
(m(nl
) 1α−1
(log1+ ε2αρ
`kl
)+
n`kε/(ρα)
)= Õ
(1
ε
(m(nl
)1/(α−1)+ nk
)).
Now, we show that the number of sets returned
by iterSetCover is not more than (αρ + ε)`k. SetSampling picks `
sets and each run of algOfflineSCreturns at most ρ` sets. Thus the
size of the solutionreturned by iterSetCover is at most
(1+(α−1)ρ)`k <(αρ+ ε)k.
Next, we prove the main theorem of the section.
smallSetCover(α, ε):
sol← iterSetCover(log n, 1, 1, n)k′ ← |sol| B Find a ρ log n
estimate of k.return iterSetCover(α, �, b k
′
ρ lognc, dk′(1 + ε2αρ )e)
Figure 4: The description of the smallSetCover algo-rithm.
Proof of Theorem 4.2. The algorithm smallSetCoverfirst finds a
(ρ log n)-approximate solution of
Set Cover(U ,F), sol, with Õ(m + nk) queries bycalling
iterSetCover(log n, 1, 1, n). Having thatk ≤ k′ = |sol| ≤ (ρ log
n)k, the algorithm callsiterSetCover with α as the approximation
factor and[bk′/(ρ log n)c, dk′(1 + ε2αρ )e] as the range
containingk. By Lemma 4.5, the second call to iterSetCoverin
smallSetCover returns a (αρ + ε)-approximatesolution of Set Cover(U
,F) using the following numberof queries:
Õ(1
ε(m(
nk
ρ logn
)1
α−1 + nk)) = Õ(1
ε(m(
n
k)
1α−1 + nk)).
4.3 The largeSetCover Algorithm. The secondalgorithm,
largeSetCover, works strictly better thansmallSetCover for large
values of k (k ≥
√m). The
advantage of largeSetCover is that it does not needto update the
set of uncovered elements at any pointand simply avoids the
additive nk term in the querycomplexity bound; the result of
Section 5 suggeststhat the nk term may be unavoidable if one
wishesto maintain the uncovered elements. Note that theguarantees
of largeSetCover is that at the end of thealgorithm, w.h.p. the
ground set U is covered.
The algorithm largeSetCover, given in Figure 5,first randomly
picks ε`/3 sets. By Set Sampling(Lemma 4.2), w.h.p. every element
that occurs in
Ω̃(m/(ε`)) sets of F will be covered by the picked sets.It then
solves the Set Cover instance over the elementsthat occur in
Õ(m/(ε`)) sets of F by an offline solver ofSet Cover using
Õ(m/(ε`)) queries; note that this set ofelements may include some
already covered elements. Inorder to get the promised query
complexity, largeSet-Cover enumerates the guesses ` of the size of
an optimalset cover in the decreasing order. The algorithm
returns
Copyright c© 2018 by SIAMUnauthorized reproduction of this
article is prohibited
-
feasible solutions for ` ≥ k and once it cannot find a fea-sible
solution for `, it returns the solution constructedfor the previous
guess of k, i.e., `(1 + ε/(3ρ)). Since
largeSetCover performs Set Sampling for Õ(ε−1) it-erations,
w.h.p. the total query complexity of largeSet-Cover is
Õ(mn/(kε2)). Note that testing whether the
number of occurrences of an element is Õ(m/(ε`)) onlyrequires a
single query, namely SetOf(e, cm lognε` ).
largeSetCover(ε):
B Try all (1 + ε3ρ )-approximate gueses of kfor ` ∈ {(1 + ε3ρ
)
i | 0 ≤ i ≤ log1+ ε3ρ n}do in the decreasing order:rnd` ←
collection of ε`3 sets picked uniformly
at random B Set SamplingFrare ← ∅ B intersection with rare
elementsfor e ∈ U do
if e appears in
-
fractions, and fail to reach the desired probability
ofsuccess.
5.1 Underlying Set Structure. Our instance con-tains n sets and
n elements (so m = n), where the firstk sets forms Fk, the
candidate for the set cover we wishto verify. We first consider the
incidence matrix rep-resentation, such that the rows represent the
sets andthe columns represent the elements. We focus on thefirst
n/k elements, and consider a slab, composing ofn/k columns of the
incidence matrix. We define a basicslab as the structure
illustrated in Figure 6 (for n = 12and k = 3), where the cell (i,
j) is white if ej ∈ Si,and is gray otherwise. The rows are divided
into blocksof size k, where first block, the query block,
containsthe rows whose sets we wish to check for coverage; no-tice
that only the last element is not covered. Morespecifically, in a
basic slab, the query block contains setsS1, . . . , Sn/k, each of
which is equal to {e1, . . . , en/k−1}.The subsequent rows form the
swapper blocks each con-sisting of n/k sets. The rth swapper block
consists ofsets S(r+1)n/k+1, . . . , S(r+2)n/k, each of which is
equalto {e1, . . . , en/k} \ {er}.
quer
y
blo
ck
swap
per
blo
ck1
swap
per
blo
ck2
swap
per
blo
ck3
e1 e2 e3 e4
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
e1 e2 e3 e4
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
Figure 6: A basic slab and an example of a
swappingoperation.
We perform one swap in this slab. Consider aparameter (x, y)
representing the index of a white cellwithin the query block. We
exchange the color of thiswhite cell with the gray cell on the same
row, andsimilarly exchange the same pair of cells on swapperblock
y. An example is given in Figure 6; the dashedblue rectangle
corresponds to the indices parameterizingpossible swaps, and the
red squares mark the modifiedcells. This modification corresponds
to a single swapoperation; in this example, choosing the index (3,
2)
slab 1 slab 2 slab 3
e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
Figure 7: A example structure of a Yes instance; allelements are
covered by the first 3 sets.
swaps (e2, e4) between S3 and S9. Observe that thereare k × (n/k
− 1) = n − k possible swaps on a singleslab, and any single swap
allows the query sets to coverall n/k elements.
Lastly, we may create the full instance by placingall k slabs
together, as shown in Figure 7, shifting theelements’ indices as
necessary. The structure of oursets may be specified solely by the
swaps made onthese slabs. We define the structure of our
instancesas follows.• For a Yes instance, we make one random swap
on
each slab. This allows the first k sets to cover
allelements.
• For a No instance, we make one random swap oneach slab except
for exactly one of them. In thatslab, the last element is not
covered by any of thefirst k sets.Now, to properly define an
instance, we must
describe our structure via EltOf and SetOf. We firstcreate a
temporary instance consisting of k basic slabs,where none of the
cells are swapped. Create EltOf andSetOf lists by sorting each list
in an increasing orderof indices. Each instance from the above
constructioncan then be obtained by applying up to k swaps on
thistemporary instance.
5.2 Proof of Theorem 5.1. Observe that accordingto our instance
construction, the algorithm may verify,with a single query, whether
a certain swap occurs in acertain slab. Namely, it is sufficient to
query an entryof EltOf or SetOf that would have been modified
by
Copyright c© 2018 by SIAMUnauthorized reproduction of this
article is prohibited
-
that swap, and check whether it is actually modified ornot. For
simplicity, we assume that the algorithm hasthe knowledge of our
construction. Further, withoutloss of generality, the algorithm
does not make multiplequeries about the same swap, or make a query
that isnot corresponding to any swap.
We employ Yao’s principle as follows: to prove alower bound for
randomized algorithms, we show a lowerbound for any deterministic
algorithm on a fixed distri-bution of input instances. Let s = n −
k be the num-ber of possible swaps in each slab; assume s = Θ(n).We
define our distribution of instances as follows: eachof the sk
possible Yes instances occurs with probability1/(2sk), and each of
the ksk−1 possible No instances oc-curs with probability
1/(2ksk−1). Equivalently speak-ing, we create a random Yes instance
by making oneswap on each basic slab. Then we make a coin flip:with
probability 1/2 we pick a random slab and undothe swap on that slab
to obtain a No instance; otherwisewe leave it as a Yes instance. To
prove by contradiction,assume there exists a deterministic
algorithm that solvesthe Cover Verification problem over this
distribution ofinstances with r = o(sk) queries.
Consider the Yes instances portion of the distribu-tion, and
observe that we may alternatively interpretthe random process
generating them as as follows. Foreach slab, one of its s possible
swaps is chosen uniformlyat random. This condition again follows
the scenarioconsidered in Section 3.2: we are given k urns (slabs)
ofeach consisting of s marbles (possible swap locations),and aim to
draw the red marble (swapped entry) froma large fraction of these
urns. Following the proof ofLemmas 3.6-3.7, we obtain that if the
total number ofqueries made by the algorithm is less than (1 − 3b
)
skb ,
then with probability at least 0.99, the algorithm willnot see
any swaps from at least kb slabs.
Then, consider the corresponding No instancesobtained by undoing
the swap in one of the slabs of theYes instance. Suppose that the
deterministic algorithmmakes less than (1 − 3b )
skb queries, then for a fraction
of 0.99 of all possible tuples T , the output of the Yesinstance
is the same as the output of 1b fraction of Noinstances, namely
when the slab containing no swap isone of the kb slabs that the
algorithm has not detected aswap in the corresponding Yes instance;
the algorithmmust answer incorrectly on half of the
correspondingweight in our distribution of input instances. Thus
theprobability of success for any algorithm with less than(1− 3b
)
skb queries is at most
1−Pr[|Thigh| ≥ (1−
2
b)k
](1
b)(
1
2) ≤ 1− 0.495
b< 0.9,
for a sufficiently small constant b > 3 (e.g. b = 4). As
s = Θ(n) and by Yao’s principle, this implies the lowerbound of
Ω(nk) for the Cover Verification problem.
Acknowledgement
We would like to thank Jonathan Ullman for
helpfuldiscussions.
References
[1] N. Alon, D. Moshkovitz, and S. Safra.
Algorithmicconstruction of sets for k-restrictions. ACM
Trans.Algo., 2(2):153–177, 2006.
[2] S. Assadi. Tight space-approximation tradeoff forthe
multi-pass streaming set cover problem. In Proc.36th ACM Sympos. on
Principles of Database Systems(PODS), pages 321–335, 2017.
[3] S. Assadi, S. Khanna, and Y. Li. Tight boundsfor single-pass
streaming complexity of the set coverproblem. In Proc. 48th Annu.
ACM Sympos. TheoryComput. (STOC), pages 698–711, 2016.
[4] M. Bateni, H. Esfandiari, and V. S. Mirrokni. Dis-tributed
coverage maximization via sketching. CoRR,abs/1612.02327, 2016.
[5] M. Bateni, H. Esfandiari, and V. S. Mirrokni. Almostoptimal
streaming algorithms for coverage problems.Proc. 29th ACM Sympos.
Parallel Alg. Arch. (SPAA),2017.
[6] S. Bhattacharya, M. Henzinger, D. Nanongkai, andC.
Tsourakakis. Space-and time-efficient algorithmfor maintaining
dense subgraphs on one-pass dynamicstreams. In Proc. 47th Annu. ACM
Sympos. TheoryComput. (STOC), pages 173–182, 2015.
[7] A. Chakrabarti and A. Wirth. Incidence geome-tries and the
pass complexity of semi-streaming setcover. In Proc. 27th ACM-SIAM
Sympos. DiscreteAlgs. (SODA), pages 1365–1373, 2016.
[8] B. Chazelle, R. Rubinfeld, and L. Trevisan. Approxi-mating
the minimum spanning tree weight in sublineartime. SIAM Journal on
computing, 34(6):1370–1379,2005.
[9] F. Chierichetti, R. Kumar, and A. Tomkins. Max-coverin
map-reduce. In Proc. 19th Int. Conf. World WideWeb (WWW), pages
231–240, 2010.
[10] E. D. Demaine, P. Indyk, S. Mahabadi, and A. Vakil-ian. On
streaming and communication complexity ofthe set cover problem. In
Proc. 28th Int. Symp. Dist.Comp. (DISC), volume 8784 of Lect. Notes
in Comp.Sci., pages 484–498, 2014.
[11] I. Dinur and D. Steurer. Analytical approach toparallel
repetition. In Proc. 46th Annu. ACM Sympos.Theory Comput. (STOC),
pages 624–633, 2014.
[12] Y. Emek and A. Rosén. Semi-streaming set cover. InProc.
41st Int. Colloq. Automata Lang. Prog. (ICALP),volume 8572 of Lect.
Notes in Comp. Sci., pages 453–464, 2014.
[13] U. Feige. A threshold of ln n for approximating set
Copyright c© 2018 by SIAMUnauthorized reproduction of this
article is prohibited
-
cover. Journal of the ACM (JACM), 45(4):634–652,1998.
[14] A. Goel, M. Kapralov, and S. Khanna. Perfect match-ings in
O(n logn) time in regular bipartite graphs.SIAM Journal on
Computing, 42(3):1392–1404, 2013.
[15] M. D. Grigoriadis and L. G. Khachiyan. A sublinear-time
randomized approximation algorithm for ma-trix games. Operations
Research Letters, 18(2):53–58,1995.
[16] T. Grossman and A. Wool. Computational experiencewith
approximation algorithms for the set coveringproblem. Euro. J.
Oper. Res., 101(1):81–92, 1997.
[17] S. Har-Peled, P. Indyk, S. Mahabadi, and A.
Vakilian.Towards tight bounds for the streaming set coverproblem.
In Proc. 35th ACM Sympos. on Principlesof Database Systems (PODS),
2016.
[18] P. Indyk, S. Mahabadi, R. Rubinfeld, J. Ullman,A. Vakilian,
and A. Yodpinyanee. Fractional setcover in the streaming model.
Approximation, Ran-domization, and Combinatorial Optimization
(AP-PROX/RANDOM), pages 198–217, 2017.
[19] M. J. Kearns and U. V. Vazirani. An introduction
tocomputational learning theory. MIT press, 1994.
[20] C. Koufogiannakis and N. E. Young. A nearly linear-time
PTAS for explicit fractional packing and coveringlinear programs.
Algorithmica, 70(4):648–674, 2014.
[21] F. Kuhn, T. Moscibroda, and R. Wattenhofer. Theprice of
being near-sighted. In Proc. 17th ACM-SIAMSympos. Discrete Algs.
(SODA), 2006.
[22] R. Kumar, B. Moseley, S. Vassilvitskii, and A. Vattani.Fast
greedy algorithms in MapReduce and stream-ing. In Proc. 25th ACM
Sympos. Parallel Alg. Arch.(SPAA), pages 1–10, 2013.
[23] S. Marko and D. Ron. Distance approximation
inbounded-degree and general sparse graphs. In Ap-proximation,
Randomization, and Combinatorial Op-timization. Algorithms and
Techniques, pages 475–486.Springer, 2006.
[24] A. McGregor and H. T. Vu. Better streaming algo-rithms for
the maximum coverage problem. In 20thInternational Conference on
Database Theory, ICDT2017, March 21-24, 2017, Venice, Italy, pages
22:1–22:18, 2017.
[25] V. S. Mirrokni and M. Zadimoghaddam. Randomizedcomposable
core-sets for distributed submodular max-imization. In Proc. 47th
Annu. ACM Sympos. TheoryComput. (STOC), pages 153–162, 2015.
[26] D. Moshkovitz. The projection games conjectureand the
NP-hardness of lnn-approximating set-cover.In Approximation,
Randomization, and CombinatorialOptimization. Algorithms and
Techniques, pages 276–287. Springer, 2012.
[27] R. Motwani and P. Raghavan. Randomized algorithms.Chapman
& Hall/CRC, 2010.
[28] H. N. Nguyen and K. Onak. Constant-time approxima-tion
algorithms via local improvements. In Proc. 49thAnnu. IEEE Sympos.
Found. Comput. Sci. (FOCS),pages 327–336. IEEE, 2008.
[29] K. Onak, D. Ron, M. Rosen, and R. Rubinfeld. A near-optimal
sublinear-time algorithm for approximatingthe minimum vertex cover
size. In Proc. 23rd ACM-SIAM Sympos. Discrete Algs. (SODA), pages
1123–1131, 2012.
[30] M. Parnas and D. Ron. Approximating the minimumvertex cover
in sublinear time and a connection todistributed algorithms.
Theoretical Computer Science,381(1):183–196, 2007.
[31] R. Raz and S. Safra. A sub-constant
error-probabilitylow-degree test, and a sub-constant
error-probabilityPCP characterization of NP. In Proc. 29th Annu.
ACMSympos. Theory Comput. (STOC), 1997.
[32] B. Saha and L. Getoor. On maximum coverage inthe streaming
model & application to multi-topic blog-watch. In Proc. SIAM
Int. Conf. Data Mining (SDM),pages 697–708, 2009.
[33] Y. Yoshida, M. Yamamoto, and H. Ito. Improvedconstant-time
approximation algorithms for maximummatchings and other
optimization problems. SIAMJournal on Computing, 41(4):1074–1093,
2012.
A Generalized Lower Bounds for the Set CoverProblem
In this section we generalize the approach of Section 3and prove
our main lower bound result (Theorem 3.1)for the number of queries
required for approximatingwith factor α the size of an optimal
solution to theSet Cover problem, where the input instance
containsm sets, n elements, and a minimum set cover of size k.The
structure of our proof is largely the same as thesimplified case,
but the definitions and the details ofour analysis will be more
complicated. The size of theminimum set cover of the median
instance will insteadbe at least αk + 1, and genModifiedInst
reduces thisdown to k. We now aim to prove the following
statementwhich implies the lower bound in Theorem 3.1.
Theorem A.1. Let k be the size of an optimal so-lution of I∗
such that 1 < α ≤ log n and 2 ≤
k ≤(
n16α logm
) 14α+1
. Any algorithm that distinguishes
whether the input instance is I∗ or belongs to D(I∗)
withprobability of success at least 2/3 requires Ω̃(m(nk )
1/(2α))queries.
A.1 Construction of the Median Instance I∗.Let F be a collection
of m sets such that independentlyfor each set-element pair (S, e),
S contains e withprobability 1 − p0, where we modify the
probability to
p0 =(
8(αk+2) logmn
)1/(αk). We start by proving some
inequalities involving p0 that will be useful later on,which
hold for any k in the assumed range.
Lemma A.1. For 2 ≤ k ≤(
n16α logm
) 14α+1
, we have
Copyright c© 2018 by SIAMUnauthorized reproduction of this
article is prohibited
-
that(a) 1− p0 ≥ pk/40 ,(b) p
k/40 ≤ 1/2,
(c)pk0
(1−p0)2 ≤(
8(αk+2) logmn
) 12α
.
Proof. Recall as well that α > 1. In the given rangeof k, we
have k4α ≤ n16αk logm ≤
n8(αk+2) logm because
kα ≥ 2. Thus
p0 =
(8(αk + 2) logm
n
) 1αk
≤(
1
k4α
) 1αk
= k−4/k.
Next, rewrite k−4/k = e−4 ln kk and observe that 4 ln kk ≤
4e < 1.5. Since e
−x ≤ 1 − x2 for any x < 1.5, we havep0 ≤ e−
4 ln kk < 1 − 2 ln kk . Further, p
k/40 ≤ e− ln k = 1/k.
Hence p0 + pk/40 ≤ 1− 2 ln kk +
1k ≤ 1, implying the first
statement.The second statement easily follows as p
k/40 ≤ 1/k ≤
1/2 since k ≥ 2. For the last statement, we make use ofthe first
statement:
pk0(1− p0)2
≤ pk0
(pk/40 )
2= p
k/20 =
(8(αk + 2) logm
n
) 12α
which completes the proof of the lemma.
Next, we give the new, generalized definition ofmedian
instances.
Definition A.1. (Median instance) An instance ofSet Cover, I =
(U ,F), is a median instance if itsatisfies all the following
properties.(a) No αk sets cover all the elements. (The size of
its
minimum set cover is greater than αk.)(b) The number of
uncovered elements of the union of
any k sets is at most 2npk0 .(c) For any pair of elements e, e′,
the number of sets
S ∈ F s.t. e ∈ S but e′ /∈ S is at least(1− p0)p0m/2.
(d) For any collection of k sets S1, · · · , Sk, |Sk ∩ (S1 ∪· ·
· ∪ Sk−1)| ≥ (1− p0)(1− pk−10 )n/2.
(e) For any collection of k+1 sets S, S1, · · · , Sk, |(Sk∩(S1 ∪
· · · ∪ Sk−1)) \ S| ≤ 2p0(1− p0)(1− pk−10 )n.
(f) For each element, the number of sets that do notcontain the
element is at most (1 + 1k )p0m.
Lemma A.2. For k ≤ min{√
m27 lnm , (
n16α logm )
14α+1 },
there exists a median instance I∗ satisfying all themedian
properties from Definition A.1. In fact, mostof the instances
constructed by the described randomizedprocedure satisfy the median
properties.
Proof. The lemma follows from applying the unionbound on the
results of Lemmas A.3–A.8.
The proofs of the Lemmas A.3–A.8 follow fromstandard
applications of concentration bounds. See thefull version of this
paper for detailed proofs.
Lemma A.3. With probability at least 1 − m−2 overF ∼ I(U , p0),
the size of the minimum set cover of theinstance (F ,U) is at least
αk + 1.
Lemma A.4. With probability at least 1 − m−2 overF ∼ I(U , p0),
any collection of k sets has at most 2npk0uncovered elements.
Lemma A.5. Suppose that F ∼ I(U , p0) and let e, e′
be two elements in U . Given k ≤(
n16α logm
) 14α+1
, with
probability at least 1 −m−2, the number of sets S ∈ Fsuch that e
∈ S but e′ /∈ S is at least mp0(1− p0)/2.
Lemma A.6. Suppose that F ∼ I(U , p0) and letS1, · · · , Sk be k
different sets in F . Given k ≤(
n16α logm
) 14α+1
, with probability at least 1−m−2, |Sk ∩(S1 ∪ · · · ∪ Sk−1)| ≥
(1− p0)(1− pk−10 )n/2.
Lemma A.7. Suppose that F ∼ I(U , p0) and letS1, · · · , Sk and
S be k + 1 different sets in F . Given
k ≤(
n16α logm
) 14α+1
, with probability at least 1 −m−2,|(Sk ∩ (S1 ∪ · · · ∪ Sk−1)) \
S| ≤ 2p0(1− p0)(1− pk−10 )n.
Lemma A.8. Given that k ≤(
n16α logm
) 14α+1
, for each
element, the number of sets that do not contain theelement is at
most (1 + 1k )p0m.
A.2 Distribution D(I∗) of the Modified In-stances Derived from
I∗. Fix a median instanceI∗. We now show that we may perform
Õ(n1−1/αk1/α)swap operations on I∗ so that the size of the
mini-mum set cover in the modified instance becomes k.So, the
number of queries to EltOf and SetOf thatinduce different answers
from those of I∗ is at mostÕ(n1−1/αk1/α). We define D(I∗) as the
distribution ofinstances I ′ that is generated from a median
instanceI∗ by genModifiedInst(I∗) given below in Figure 8.The main
difference from the simplified version are thatwe now select k
different sets to turn them into a setcover, and the swaps may only
occur between Sk andthe candidates.
Lemma A.9. The procedure genModifiedInst is well-defined under
the precondition that the input instance I∗
is a median instance.
Proof. To carry out the algorithm, we must ensure thatthe number
of the initially uncovered elements is at
Copyright c© 2018 by SIAMUnauthorized reproduction of this
article is prohibited
-
genModifiedInst(I∗ = (U ,F)):M← ∅pick k different sets S1, · ·
·Sk from F
uniformly at randomfor each e ∈ U \ (S1 ∪ · · · ∪ Sk) do
pick e′ ∈ (Sk ∩ (S1 ∪ · · · ∪ Sk−1)) \Muniformly at random
M←M∪ {ee′}pick a random set S in Candidate(e, e′)swap(e, e′)
between S, Sk
Figure 8: The procedure of constructing a modifiedinstance of
I∗.
most that of the elements covered by both Sk andsome other set
from S1, . . . , Sk−1. Since I
∗ is a medianinstance, by properties (b) and (d) from Definition
A.1,these values satisfy |U \ (S1 ∪ · · · ∪ Sk)| ≤ 2pk0n and|Sk ∩
(S1 ∪ · · · ∪ Sk−1)| ≥ (1 − p0)(1 − pk−10 )n/2,respectively. By
Lemma A.1, p
k/40 ≤ 1/2. Using this
and Lemma A.1 again,
(1− p0)(1− pk−10 )n/2 ≥ pk/40 · p
k/40 · n/2
≥ pk/20 n/2 ≥ 2pk0n.
That is, in our construction there are sufficiently manypossible
choices for e′ to be matched and swapped witheach uncovered element
e. Moreover, since I∗ is amedian instance, |Candidate(e, e′)| ≥ (1−
p0)p0m/2 (byproperty (c)), and there are plenty of candidates
foreach swap.
A.2.1 Bounding the Probability of Modifica-tion. Similarly to
the simplified case, define PElt−Set :U × F → [0, 1] as the
probability that an element isswapped by a set, and upper bound it
via the followinglemma.
Lemma A.10. For any e ∈ U and S ∈ F ,PElt−Set(e, S) ≤ 64p
k0
(1−p0)2m where the probability is taken
over the random choices of I ′ ∼ D(I∗).
Proof. Let S1, . . . , Sk denote the first k sets
picked(uniformly at random) from F to construct a modifiedinstance
of I∗. For each element e and a set S such thate ∈ S in the basic
instance I∗,
PElt−Set(e, S) = Pr[S = Sk] ·Pr[e ∈ ∪i∈[k−1]Si | e ∈ Sk
]·Pr
[e matches to U \ (∪i∈[k]Si) | e ∈ Sk ∩ (∪i∈[k−1]Si)
]+ Pr[S /∈ {S1, . . . , Sk}] ·Pr
[e ∈ S \ (∪i∈[k]Si) | e ∈ S
]·Pr[S swaps e with Sk | e ∈ S \ (S1 ∪ · · · ∪ Sk)] ,
where all probabilities are taken over I ′ ∼ D(I∗). Nextwe bound
each of the above six terms. Clearly, sincewe choose the sets S1, ·
· · , Sk randomly, Pr[S = Sk] =1/m. We bound the second term by 1.
Next, byproperties (b) and (d) of median instances, the thirdterm
is at most
|U \ (∪i∈[k]Si)||Sk ∩ (∪i∈[k−1]Si)|
≤ 2pk0n
(1− p0)(1− pk−10 )n2≤ 4p
k0
(1− p0)2.
We bound the fourth term by 1. Let de denote thenumber of sets
in F that do not contain e. Usingproperty (f) of median instances,
the fifth term is atmost
de(de − 1) · · · (de − k + 1)(m− 1)(m− 2) · · · (m− k)
≤(
dem− 1
)k≤ ( (1 + 1/k)p0m
m(1− 1k+1 ))k ≤ e2pk0 ,
Finally for the last term, note that by symme-try, each pair of
matched elements ee′ is picked bygenModifiedInst equiprobably.
Thus, for any e ∈S \(S1∪· · ·∪Sk), the probability that each
element e′ ∈Sk∩(S1∪· · ·∪Sk−1) is matched to e is
1|Sk∩(S1∪···∪Sk−1)| .By properties (c)-(e) of median instances, the
last termis at most∑e′∈(Sk∩(∪i∈[k−1]Si))\S
Pr[ee′ ∈M] ·Pr[(S, Sk) swap (e, e′)]
≤ |(Sk ∩ (∪i∈[k−1]Si)) \ S| ·1