Sketching and Streaming Entropy via Approximation Theory Nick Harvey (MSR/Waterloo) Jelani Nelson (MIT) Krzysztof Onak (MIT)
Jan 14, 2016
Sketching and Streaming Entropy via Approximation Theory
Nick Harvey (MSR/Waterloo)
Jelani Nelson (MIT)
Krzysztof Onak (MIT)
Streaming Model
Increment x 1
x = (0, 0, 0, 0, …, 0)
Algorithmx = (1, 0, 0, 0, …, 0)
Algorithmx ∈ ℤn
m updates
x = (1, 0, 0, 1, …, 0)
Algorithm
Increment x 4
Goal: Compute statistics, e.g. ||x||1, ||x||2 …Trivial solution: Store x (or store all updates)
O(n·log(m)) space
Goal: Compute using O(polylog(nm)) space
x = (9, 2, 0, 5, …,12)
Algorithm
Streaming Algorithms(a very brief introduction)
• Fact: [Alon-Matias-Szegedy ’99], [Bar-Yossef et al. ’02], [Indyk-Woodruff ’05], [Bhuvanagiri et al. ‘06], [Indyk ’06], [Li ’08], [Li ’09]
Can compute (1±) = (1±)Fp usingO(-2 logc n) bits of space(if 0 p2)O(-O(1) n1-2/p ∙ logO(1)(n)) bits (if
2<p)
• Another Fact: Mostly optimal: [Alon-Matias-Szegedy ‘99], [Bar-Yossef et al. ’02], [Saks-Sun ’02], [Chakrabarti-Khot-Sun ‘03], [Indyk-Woodruff ’03], [Woodruff ’04]
– Proofs using communication complexity and information theory
p
px
Practical Motivation• General goal: Dealing with massive data sets
– Internet traffic, large databases, …
• Network monitoring & anomaly detection– Stream consists of internet packets
– xi = # packets sent to port i
– Under typical conditions, x is very concentrated– Under “port scan attack”, x less concentrated– Can detect by estimating empirical entropy
[Lakhina et al. ’05], [Xu et al. ‘05], [Zhao et al. ‘07]
Entropy
• Probability distribution a = (a1, a2, …, an)
• Entropy H(a) = -Σ ailg(ai)
• Examples:– a = (1/n, 1/n, …, 1/n) : H(a) = lg(n)– a = (0, …, 0, 1, 0, …, 0) : H(a) = 0
• small when concentrated, LARGE when not
Streaming Algorithms for Entropy• How much space to estimate H(x)?
– [Guha-McGregor-Venkatasubramanian ‘06], [Chakrabarti-Do Ba-Muthu ‘06], [Bhuvanagiri-Ganguly ‘06]– [Chakrabarti-Cormode-McGregor ‘07]:
multiplicative (1±) approx: O(-2 log2 m) bits additive approx: O(-2 log4 m) bits
Ω(-2) lower bound for both
• Our contributions:– Additive or multiplicative (1±) approximation– Õ(-2 log3 m) bits, and can handle deletions– Can sketch entropy in the same space
~
First IdeaIf you can estimate Fp for p≈1,
then you can estimate H(x)
Why?Rényi entropy
Review of Rényi
• Definition:
• Convergence to Shannon:
p
xxxH
pp
pp
1
/log)( 1
)()(lim 1 xHxH pp
Hp(x)
p10 2 … Alfred RényiClaude Shannon
Overview of Algorithm• Set p=1.01 and let x =
• Compute
• Set
• So
p
pxy )1(
)log(1
1y
pH
pp
xp
p
1
)1log(
1
)log(
~1
/ xx
~
~
100)( xHH
100)(01.1 xH
~
~
(using Li’s “compressed counting”)
As p1this gets betterthis gets worse!
Analysis
)(xH
Making the tradeoff• How quickly does Hp(x) converge to H(x)?
• Theorem: Let x be distr., with mini xi ≥ 1/m.
Let . Then
Let . Then
• Plugging in: O(-3 log4 m) bits of space suffice for additive approximation
mOplog
11 1
)(
)(1
xH
xH
p
mOp
2log11
)()(0 xHxH p
Multiplicative Approximation
Additive Approximation
~
~~
~ ~
~
Proof: A trick worth remembering• Let f : ℝ and gℝ : ℝ be such thatℝ
Lpg
pfp
)(
)(lim 10)(lim 1 pfp 0)(lim 1 pgp
• l’Hopital’s rule says that
Lpg
pfp )(
)(lim 1
• It actually says more! It says converges to
at least as fast as does.
)(
)(
pg
pfL
)(
)(
pg
pf
Improvements• Status: additive approx using O(-3 log4 m) bits • How to reduce space further?
– Interpolate with multiple points: Hp1(x), Hp2
(x), ...
Hp(x)
p10 2 …
Shannon
Multiple Rényis
Single Rényi
LEGEND
Analyzing Interpolation
• Let f(z) be a Ck+1 function• Interpolate f with polynomial q with q(zi)=f(zi), 0≤i≤k• Fact:
where y, zi [a,b]• Our case: Set f(z) = H1+z(x)• Goal: Analyze f(k+1)(z)
)(sup)()()( )1(
],[
1 zfabyqyf k
baz
k
Hp(x)
p10 2 …
Bounding Derivatives
• Rényi derivatives are messy to analyze• Switch to Tsallis entropy f(z) = S1+z(x),
• Can prove Tsallis also converges to Shannon1
1)(
p
xxS
p
pp
~
n
ii
kzik xxzG
1
1 )(log)( ~ ~
k
jjk
jjk
k
kk
jz
zGk
z
zGkzf
111
0)(
!
)(!)1())(1(!)1()(
Define:
(when a=-O(1/(k·log m)), b=0)
can set k = log(1/ε)+loglog m
mxHOzf kk
baz
1)1(
],[log)(sup
Fact:
Key Ingredient:Noisy Interpolation
• We don’t have f(zi), we have f(zi)±ε
• How to interpolate in presence of noise?
• Idea: we pick our zi very carefully
Chebyshev Polynomials
• Rogosinski’s Theorem: q(x) of degree k and |q(βj)|≤ 1 (0≤j≤k) |q(x)| ≤ |Tk(x)| for |x| > 1
• Map [-1,1] onto interpolation interval [z0,zk]
• Choose zj to be image of βj, j=0,…,k
• Let q(z) interpolate f(zj)±ε and q(z) interpolate f(zj)
• r(z) = (q(z)-q(z))/ ε satisfies Rogosinski’s conditions!
))arccos(cos()( xkxTk
~
~
Tradeoff in Choosing zk
• zk close to 0 |Tk(preimage(0))|still small
• …but zk close to 0 high space complexity
• Just how close do we need 0 and zk to be?
Tk grows quickly once leaving [z0, zk]
z0 zk
0
The Magic of Chebyshev
• [Paturi ’92]:Tk(1 + 1/kc) ≤ e4k1-(c/2). Set c = 2.
• Suffices to set zk=-O(1/(k3log m))
• Translates to Õ(-2 log3 m) space
The Final Algorithm(additive approximation)
• Set k = lg(1/) + lglg(m),
zj = (k2cos(jπ/k)-(k2+1))/(9k3lg(m)) (0 ≤ j ≤ k)
• Estimate S1+zj = (1-(F1+zj
/(F1)1+zj))/zj for 0 ≤ j ≤ k
• Interpolate degree-k polynomial q(zj) = S1+zj
• Output q(0)
~
~~
~
~
Multiplicative Approximation• How to get multiplicative approximation?
– Additive approximation is multiplicative, unless H(x) is small– H(x) small large [CCM ’07]
• Suppose and define
• We combine (1±ε)RF1 and (1±ε)RF1+zj to get (1±ε)f(zj)
• Question: How do we get (1±ε)RFp?
• Two different approaches:– A general approach (for any p, and negative frequencies)– An approach exploiting p ≈ 1, only for nonnegative freqs
(better by log(m))
x
xx
i*
*ii
pip xRF
Questions / Thoughts• For what other problems can we use this
“generalize-then-interpolate” strategy?– Some non-streaming problems too?
• The power of moments?
• The power of residual moments?CountMin (CM ’05) + CountSketch (CCF ’02) HSS (Ganguly et al.)
• WANTED: Faster moment estimation (some progress in [Cormode-Ganguly ’07])