MBG 1 May 2004 AHBHA: Managing Congestion through Adaptive Hop-By-Hop Aggregation Michael Greenwald, University of Pennsylvania
MBG 1May 2004
AHBHA: Managing Congestion
through Adaptive Hop-By-Hop Aggregation
Michael Greenwald,University of Pennsylvania
MBG 2May 2004
What is congestion?
• Congestion: Applications/clients present a larger aggregate load than intermediate nodes in the network can handle.
• Congestion Control: mechanism that ensures network remains manageable under overload.
Much more difficult than the related problem of Flow Control: participants are unaware that they
share a resource
MBG 3May 2004
What causes congestion? (Isn’t bandwidth cheap?)
• Persistent congestion: solved by adequate provisioning (True, bandwidth is cheap).
• Cause: Intermittent high load– Intermittent emergency (earthquake, 9/11)– Extreme loads (expected: Mother’s day.
Unexpected: Pathfinder pictures)
– Periods of growth– DOS or DDOS attack
• Effect: Congestion– Bursty traffic (statistical multiplexing)– Many sources converge on a single link– Low capacity link becomes bottleneck– subset of multicast destinations
Congestion (will) still occur(s), even though bandwidth is getting cheaper
MBG 4May 2004
Why must congestion be controlled?
• Congestion collapse– Links clogged with useless packets: that will be dropped
anyway, or are retransmissions, or are out-of-date
• Long Delay – Relevant mostly for short transactions over long distances.
• Variability in delay (jitter)
• Drop rate – Not a problem in itself, since packets only dropped if can’t
make it through bottleneck anyway, but …
– Use up bandwidth on other links before being dropped.
– Control over which packets get dropped?
• Low Utilization (inefficiency)
• Fairness
MBG 5May 2004
How is congestion controlled?
• Slow-start/congestion avoidance• Losses-per-epoch/Fast Retransmit
• Full Buffers, tail-drop => RED• Non-compliant flows => FRED, Penalty box etc.
• Pkt drop/buffer size is noisy signal => Vegas,
• Adjust parameters => BLUE, ARED
• Avoid packet loss => ECN
• Explicit feedback => XCP
• Fat pipes => TCP FAST
• Fairness, lossy wireless => TCP Westwood
• Mice, Fairness, QOS, bad RTE
MBG 6May 2004
Response
• Concern with robustness, efficiency, and fairness
• Control theoretic approach:– Stability, convergence
• Make world safe for control theory:– Controller reacts as quickly as signal changes
» Know RTT, react quickly, change slowly
– Response to feedback must be predictable
» Behavior of aggregate independent of # of flows.
– Behavior of client/application/transport predictable.
Significant contributions, but ….
MBG 7May 2004
QuickTimeª and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTimeª and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTimeª and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTimeª and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTimeª and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTimeª and aTIFF (Uncompressed) decompressor
are needed to see this picture.
• Complexity: Epicycles on epicyclesComplexity: Epicycles on epicycles
• FragilityFragility
• The end-to-end argument; misinterpretedThe end-to-end argument; misinterpreted– Trapped by success, religious dogma, & need for Trapped by success, religious dogma, & need for
field testingfield testing
• Congestion control common to Congestion control common to allall clients clients
• Don’t optimize for a particular application, Don’t optimize for a particular application, even TCPeven TCP
A Different ViewA Different View
MBG 8May 2004
QuickTimeª and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTimeª and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTimeª and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTimeª and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTimeª and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTimeª and aTIFF (Uncompressed) decompressor
are needed to see this picture.
• View from routers: predictable response to View from routers: predictable response to feedbackfeedback
• View from hosts: delivery fabric with View from hosts: delivery fabric with predictable congestion feedbackpredictable congestion feedback
• Stark contrast with current system:Stark contrast with current system:– Extreme example: Extreme example: Aggregated small TCP flows do not Aggregated small TCP flows do not
exponentially decrease or linearly increase.exponentially decrease or linearly increase.– 1,000,000 flows, so window for each flow is small 1,000,000 flows, so window for each flow is small
(approx 1)(approx 1)» Congestion notifies 10% of the flows, decrease by Congestion notifies 10% of the flows, decrease by
at most 10% of packets.at most 10% of packets.» Regardless, each of 1,000,000 flows increases Regardless, each of 1,000,000 flows increases
cwnd by 1 each RTT, effectively doubling rate. cwnd by 1 each RTT, effectively doubling rate. (alternatively, larger fraction in SlowStart)(alternatively, larger fraction in SlowStart)
A Different ViewA Different View
MBG 9May 2004
AHBHAAdaptive Hop-by-Hop Aggregation
A simple idea
• Hop-by-hop: feedback and controller at each node.– Why any different than CreditNet (Kung) or HBH (Kanakia)?
• Aggregate flows based on purely local characteristics (next-hop X TOS X QOS) X input
– What about head-of-line blocking? Local vs. global behavior? Isolation of congestion?
• Transitive renaming of congested links– Based on observation that most of the net is adequately
provisioned.
input Interconnect
Router architectureinpu
t
input
output
output
input Interconnect
input
input
output
output
MBG 10May 2004
Controlling Utilization of a Resource
• Consider a Flow Queue with a current queue length, a known rate capacity, and a set of input flows
– Capacity may be a physical limit for a physical link, or a rate limit imposed by a neighbor for finer grained flow.
– Input flows may be local flows sharing a single physical output link, or may be flows coming in from neighbors.
• Queuelength > threshold triggers congestion control (at most once per RTT)
– Must determine whether queue growth due to burstiness or due to input rate exceeding output rate
– If former, smooth inputs; latter, throttle neighbors.
input
input
input
output
MBG 11May 2004
Controlling Utilization of a Resource
• MAIR (Mean Aggregate Input Rate) = Sum (over inputs) [ (smoothed) number of packets /
interval]
• If MAIR < output capacity, then input flows need only be smoothed.– Pacing: 1/pkts-per-RTT
• If MAIR > output capacity, then input rates must be reduced– Acceptable rates computed in 2 passes.
– BaseAllocation = Capacity/Nflows. // Can use weights per flow instead.
– ExcessAllocation = 0; UncontrolledFlows = 0;for flow in inputs, if 2*flowRate < BaseAllocation, then
ExcessAllocation += BaseAllocation - 2*flowRate;UncontrolledFlows++;
end
– FairAllocation = BaseAllocation + ExcessAllocation/(Nflows-UncontrolledFlows);
• Send FairAllocation if input rate >= FairAllocation
input
input
input
output
Input from fairness controller
Drain queue in 2 RTT
MBG 12May 2004
Renaming to Isolate Congestion
• R1->R2 becomes congested.
• QuarantineSet = {N | NextHop(N)@R1 = R2}
R1
R2
MBG 13May 2004
Renaming to Isolate Congestion
• Create artificial node R1’
• QuarantineSet = {N | NextHop(N)@R1 = R2}
• Routing update to all neighbors of R1 advertising R1’ as best path to {N}.
R1
R2R1’
MBG 14May 2004
Renaming to Isolate Congestion
• If queue to R1’ is congested, recurse and split artificial node and advertise to input queues.
R1
R2R1’
MBG 15May 2004
Releasing control
• Record time of transition from uncontrolled to controlled.• Record time of most recent (“last”) congestion event.NewCongestionInterval =
LastCongestionEvent - FirstCongestionEvent;CongestionInterval =
max(NewCongestionInterval,OldCongestionInterval);If ((now-LastCongestionEvent) > CongestionInterval) {
Release control;OldCongestionInterval =
max(OldCongestionInterval/2,NewCongestionInterval);}// Release state after 10* CongestionInterval w/ no congestion;// OldCongestionInterval > 4*max(IRTT,ORTT);
MBG 16May 2004
Miscellaneous Details• Uncontrolled flows:
– XMIT: If X packets sent in RTT interval t0, then at most 2X packets in interval t1
• Periodic CC packets between immediate neighbors– List controlled flows and rate
– Once per Max(RTT/2, 20 PacketTimes)
– If no CC pkt from N in RTT interval, then all flows are controlled at X/2.
• CC packets high priority
• Compute RTT
• Assume max rate known neighbors. – Good assumption for dedicated lines
– May need to be estimated for Ether/shared channel or multi-hop neighbors
MBG 17May 2004
Advantages• Works for: Mice, Elephants, Non-TCP flows• Long-delay flows: ramp-up in log(n) round-trips; high
utilization• Doesn’t treat loss as congestion signal • Not sensitive to parameters • Fairness decoupled from CC mechanism: agnostic on
policy, or policy delivery mechanism. Can work with either packet marking (diffsrv) or flow-weighting (periodic packets from src to dst, providing per-flow weights).
• Aggregation + per-hop control makes flows smoother and less self-similar
• Response time to source comparable to e2e packet loss.
But you can say almost the same things about XCP, or FAST, or …
MBG 18May 2004
Serendipity:
• Buffer sizes:– Per-link rather than cross-network
– 1 buffer per neighbor, rather than 1 per flow
• Broken routers, misbehaving hosts, DOS attacks
• Multicast
• Simplifies TCP
MBG 19May 2004
Preliminary Observations
• Significantly simpler than current world (let alone more complex world)
• AHBHA comparable in all cases. Never significantly better. Simulations using ns2
– RED (varying capacities & loads), Floyd “TCP Friendly”, FAST (*), TCP WESTWOOD(*), XCP (**), AHBHA Regions + Legacy, Defective routers, DOS, Short flows (1 pkt), Mbottle, “SimpleTCP”, w/ECN to source
– (*) Compared to results in paper, (**) needed to compile separate versions of ns2
• Works with non-cooperating clients and routers.
MBG 20May 2004
Preliminary Non-Results
• Stability not proven
• Convergence not proven
• Convergence-time not established• (On the other hand, intuitive reasons to believe stable: e.g.
bounded input increase, superposition of stable systems, RTTs are equal (not just by assumption))
• If many congestion points, then lose congestion isolation
MBG 21May 2004
Unresolved Issues with Naming Controlled Aggregates
• 1 bit for renaming next hop? B bits? Exhaustive list?
• Aggregate by next hop? Or 2nd hop (horizon effect)?
• More hysteresis in determining CongestionInterval, rather than Releasing control after quiet period.
• Right choices depend on patterns of congestion in real network.
• Measurement required.
MBG 22May 2004
cing+: Measuring Network-Internal Delays using only Existing
Infrastructure
joint work with Kostas Anagnostakis
University of Pennsylvania Raphael Ryger
Yale University
MBG 23May 2004
USER
RESEARCH
MANAGEMENT
Remote measurement of per-link delays
• Network measurement techniques:– Understanding of control mechanisms (such as TCP
congestion control) --- both results and workload
– Gain insights into network performance
– Fault Isolation, Error reporting
– Curiosity; switch providers?
• Network parameters such as delay, loss, and throughput are easy to measure end-to-end
• Network parameters such as delay, loss, and throughput are difficult to measure on individual links inside the network.
MBG 24May 2004
Understand your tools:Know yourself
• How accurate are the results?
• Why do we believe it is accurate?
• What are its limitations?
• Answering these questions is difficult, sometime surprising, and results in a much better tool.
MBG 25May 2004
Network Delay Tomography: A Brief History
SB
C
From a remote source, S, estimate the distribution of link delays
X2
X3
AX1
• Direct measurement, using existing tools (e.g. pathchar)– <RTT to tail> - <RTT to head> yields RTT on link (TTL-expired responses)
– Only existing infrastructure: measure anywhere w/o cooperation.
– But…
» ICMP responses representative?
» Asymmetric paths: return paths vary so (tail - head) may not be meaningful.
» Round trip vs. one-way delay?
a1
a3
a2
a4
a5
1
2
MBG 26May 2004
Network Delay Tomography: A Brief History
SB
C
From a remote source, S, estimate the distribution of link delays
X2
X3
AX1
• Indirect inference methods (e.g. minc project)– One packet to multiple sources, and correlate behavior on links in the
resulting tree
– But:
» Deployability (works best with multicast, need cooperating rcvrs)
» Accuracy (assumes independence of delay, quality of estimates degrades over longer paths)
» Robustness (high variance in error)
» Computational complexity
» Need for many samples, therefore much time
a1
a3
a2
a4
a5
1
2
MBG 27May 2004
Network Delay Tomography: A Brief History
SB
C
From a remote source, S, estimate the distribution of link delays
X2
X3
AX1
• Direct methods (e.g. cing project)– f(<Timestamp to tail>,<Timestamp to head>) yields delay on link
– No infrastructure required, highly accurate, strong experimental validation
– But:» Packet pair may not encounter equal queues» ICMP processing may not be representative» Clocks are unsynchronized » Routing irregularity, so not always applicable
a1
a3
a2
a4
a5
1
2
MBG 28May 2004
Network tomography: a direct method
• Use router ICMP Timestamp messages and packet-pair probes to directly estimate queuing delay
A B21
t2 t1
t0 d 1prop d 1
q d 2prop d 2
q t0 d 1prop d 1
q
d 2prop d 2
q
fixed variable: propagation delay
: queueing delay
31 2
Account for fixed, by subtracting min time over set of observations
MBG 29May 2004
Question your assumptions
• Feasibility: basic mechanism supported? Accuracy? Stability of routing? etc. etc.
• Do back-to-back packets really experience the same delay on their shared path?
• Are ICMP processing times indicative of processing time for “normal” packets?
• How to account for differing offset and skew on clocks?
• Are the paths to adjacent nodes coincident?
A B21 31 2
MBG 30May 2004
Back-to-back packets• Do packets arrive back-to-
back?
• Do back-to-back packets experience identical queuing delays and process time? (and stay back-to-back?)
• Distinctions are irrelevant to algorithm; the issue is simply difference in timestamped value.
• Experiment: Probe routers with varied load and varied path length from source
7300 routers
Caveats: our sources had good connectivity to the Internet, so does not provide insight into slow links near source, or high
congestion near source.
This issue common to all algorithms
80% of routers worst-caseInter-pkt delay <= 7 ms.
95+% of routers have 98% of Inter-pkt delays <= 6 ms.
MBG 31May 2004
ICMP Processing Time
• Send direct first, so queuing delays err conservatively and overestimate ICMP processing time.
• Median processing time always negligible
• 95% usually negligible
•This issue common to all direct measurements that use ICMP
•Variation in processing time between head & tail
•Comparison w/non-ICMP traffic
SB
C
X2
X3
AX1a1
a3
a2
a4
a5
“A2”, req
spoof
Dot = median, box=interquartile range, bars 5-95%, dots are outliers.
Allows spoofed src
response
Cooperating rcvr
1
2
MBG 32May 2004
Dot = median, box=interquartile range, bars 5-95%, dots are outliers.
ICMP Processing Time
• Spoofing and cooperation limits scope of experiment (7 targets, 20 routers).
• Broader study? If processing delays significant on head of link, then estimated queuing delay for link should sometimes be negative.
• Occasionally present in 9.9% of sample (1,368)
SB
C
X2
X3
AX1a1
a3
a2
a4
a5
“A2”, req
spoof
response
Cooperating rcvrAllows spoofed src
1
2
MBG 33May 2004
Unsynchronized Clocks
• Clock offsets may vary because of clock skew or “jumps” to adjust for skew.
• Both src & dst may jump, and may be skewed in opposite directions
• May distort individual observations, as well as provide an erroneous minimum for d2
prop
• Impossible to tell for individual observation whether t due to queuing or clock artifact
A B21 31 2
t2 t1
t0 d 1prop d 1
q d 2prop d 2
q t0 d 1prop d 1
q
d 2prop d 2
q
+ OA,1+ OA,2
+ OA,2 - OA,1
I =
MBG 34May 2004
Unsynchronized Clocks
• Post processing looks at multiple observations
• RTT provides valuable clues: queuing and max
• Can recover skew; only care if jump occurred between request & response
• Look for colinear regions; label others “can’t tell”
Local clock
MBG 35May 2004
Routing Issues
• Reachability: It is easy to see that most links are not measurable by the direct method from a single source: many links connect to a node, but only one is on the path from S.
• (In some sense, S is mainly interested in links reachable from S.)
• Regularity: If the path to the head of a link is not a prefix of the path to the tail of the link, then we cannot meaningfully subtract the timestamp responses.
SB
C
X2
X3
AX1a1
a3
a2
a4
a5
A routing map, R, is regular over a graph G if Rs(m) = Rs(d) for all m in sd2
3
1
MBG 36May 2004
Irregular routing
• Nevertheless:– Coverage for single links ranges from 20% (SRI) to 53% (LIACS)– Multiple sources increase the likelihood measurably– Multi-hop segments increase the likelihood of coverage
Internet routing is irregular.
MBG 37May 2004
Simpler approach?:TTL vs. Timestamp
• Why aren’t TTL-limited RTT measurements (a’la pathchar) sufficient?
– TTL-limiting removes the routing problem
– RTT measurements removes the clock problem.
• Accuracy:– Asymmetry– One-way vs. round-
trip– Not back-to-back on
return path
10,931 links
MBG 38May 2004
A hybrid solution
• Indirect inference is accurate for small trees: look only at small, isolated trees.
• Timestamps and TTL-limited probes make every router a “cooperating receiver”.
• Indirect inference can isolate return-path delays from forward path delays.
• Indirect inference can determine delay distribution in shared portion of overlapping segments
• Deconvolution
A hybrid technique can be almost as accurate as the direct approach, and almost universally applicable
MBG 39May 2004
Putting it all together
• By combination/choice of Timestamps, RTT, TTL-expiration and using either Indirect methods of MINC or deconvolution we can cover just about every link in the Internet, often by many methods
• But which methods to use?
MBG 40May 2004
Putting it all together
• Shared link is most accurate for MINC
• Deconvolution is only as accurate as the least accurate segment.
• But which methods to use?
MBG 41May 2004
Relative Accuracy
• Estimated vs. Actual Mean delay
• 2nd row shows effect of divergent paths; 200ms extra delay on path of 2nd pair
MBG 42May 2004
Increased Coverage
Multiple sources Granularity vs. accuracy
MBG 43May 2004
Collecting vast quantities of data
• Individual delay measurements over thousands of paths, tens of thousands of nodes, millions of samples
• Long running simulation of AHBHA can generate petabytes of data for moderate size networks.
• How can we accurately collect these measurements without sinking under their weight?
MBG 44May 2004
Space-Efficient Online Computation of Quantile
Summaries
joint work with
Sanjeev KhannaUniversity of Pennsylvania
MBG 45May 2004
Summarizing extremely large data sets
• The problem:– Vast quantities of data, perhaps ephemeral
– Memory is limited and observations are lost once observed
– Therefore construct a proxy data structure of manageable size, able to return needed information
• What kind of information do we need? Distribution of values
– Quantile queries: Given a quantile, , return the value whose rank is N
– e.g. min, max, median, 90th percentile, 99th percentile…
• Munro & Paterson [1980] (Pohl[1969]): p-pass algorithm to compute exact quantile requires (N1/p) space.
approximation is required to reduce space.
MBG 46May 2004
Trading off accuracy for space
• Explicit a priori guarantee on precision of the approximation, but try to use the smallest memory footprint possible.
• Explicit and tunable a priori guarantee on maximum memory footprint, and make the approximation as accurate as possible.
MBG 47May 2004
Histograms
Trading off accuracy for space
• Explicit a priori guarantee on precision of the approximation, but try to use the smallest memory footprint possible.
• Explicit and tunable a priori guarantee on maximum memory footprint, and make the approximation as accurate as possible.
MBG 48May 2004
-approximate quantile summary
Trading off accuracy for space
• Explicit a priori guarantee on precision of the approximation, but try to use the smallest memory footprint possible.
• An -approximate quantile summary can answer any quantile query to within a precision of
– Given a quantile, , return a value whose rank is guaranteed to be within the interval [( - )N, ( + )N]
Goal: Construct an -approximate quantile summary using minimal space
MBG 49May 2004
Requirements• Explicit & tunable a priori guarantees on the
precision of the approximation
• As small a memory footprint as possible
• Online: Single pass over the data
• Data Independent Performance: guarantees should be unaffected by arrival order, distribution of values, or cardinality of observations.
• Data Independent Setup: no a priori knowledge required about data set (size, range, distribution, order).
MBG 50May 2004
Related Work• Manku, Rajagopalan, and Lindsay generalize a
class of 1-pass algorithms (e.g. Agrawal & Swami [COMAD95], Alsabti, Ranka & Singh [VLDB97]),
– [SIGMOD98]» a priori knowledge of size of data set» O((1/) log2 ( N)) worst case space» does not exploit any structure in observations
– [SIGMOD99]
» Give up deterministic guarantee in exchange for dropping the requirement of a priori knowledge of size of data set
• Gibbons, Matias, & Poosala [VLDB97] & Chaudhuri, Motwani, & Narsayya [SIGMOD98]
– Multiple passes (& CMN only probabilistic guarantee)
MBG 51May 2004
Our epsilon-approximate quantile summary
MBG 52May 2004
Overview of Summary Data Structure
• Keep a data structure that stores vi, rmin(vi), and rmax(vi) for each observation.
– vi = value of ith observation stored in the summary
<v0, v1, …. vi, … vS-1>|S| can be << N
– rmin(vi) = minimum possible rank of vi
– rmax(vi) = maximum possible rank of vi
192[501,503]
201[529,536]
.01, N=1750
204[539,540]
MBG 53May 2004
Overview of Summary Data Structure
• Keep a data structure that stores vi, rmin(vi), and rmax(vi) for each observation. Tuple = {vi, gi, i}; gi = rmin(vi) - rmin(vi-1) , i = rmax(vi) - rmin(vi)
• Quantile = .3? Compute r and choose best vi
192[501,503]
{15,2}
201[529,536]
{28,7}.01, N=1750
204[539,540]
{10,1}
= .3
r = N = 525
MBG 54May 2004
Overview of Summary Data Structure
• Keep a data structure that stores vi, rmin(vi), and rmax(vi) for each observation. Tuple = {vi, gi, i}; gi = rmin(vi) - rmin(vi-1) , i = rmax(vi) - rmin(vi)
• If (rmax(vi+1) - rmin(vi) - 1) < 2N, then -approximate summary.
• Our goal: always maintain this property.
192[501,503]
{15,2}
201[529,536]
{28,7}.01, N=1750
204[539,540]
{10,1}
= .3
r = N = 525
2N=35
MBG 55May 2004
Overview of Summary Data Structure
• Keep a data structure that stores vi, rmin(vi), and rmax(vi) for each observation. Tuple = {vi, gi, i}; gi = rmin(vi) - rmin(vi-1) , i = rmax(vi) - rmin(vi)
• Goal: always maintain -approximate summary (rmax(vi+1) - rmin(vi) - 1) = (gi + I - 1) < 2N
• Insert new observations into summary
192[501,503]
{15,2}
201[529,536]
{28,7}.01, N=1750
204[539,540]
{10,1}
= .3
r = N = 525
1972N=35
MBG 56May 2004
Overview of Summary Data Structure
• Keep a data structure that stores vi, rmin(vi), and rmax(vi) for each observation. Tuple = {vi, gi, i}; gi = rmin(vi) - rmin(vi-1) , i = rmax(vi) - rmin(vi)
• Goal: always maintain -approximate summary(rmax(vi+1) - rmin(vi) - 1) = (gi + I - 1) < 2N
• Insert new observations into summary
192[501,503]
{15,2}
201[529,536]
{28,7}.01, N=1750
204[539,540]
{10,1}
= .3
r = N = 525
197[502,536]
2N=35
MBG 57May 2004
Overview of Summary Data Structure
• Keep a data structure that stores vi, rmin(vi), and rmax(vi) for each observation. Tuple = {vi, gi, i}; gi = rmin(vi) - rmin(vi-1) , i = rmax(vi) - rmin(vi)
• Goal: always maintain -approximate summary(rmax(vi+1) - rmin(vi) - 1) = (gi + I - 1) < 2N
• Insert new observations into summary
192[501,503]
{15,2}
201[530,537]
{28,7}.01, N=1751
204[540,541]
{10,1}
= .3
r = N = 525
197[502,536]
2N=35.02
{1,34}
MBG 58May 2004
Overview of Summary Data Structure
• Keep a data structure that stores vi, rmin(vi), and rmax(vi) for each observation. Tuple = {vi, gi, i}; gi = rmin(vi) - rmin(vi-1) , i = rmax(vi) - rmin(vi)
• Goal: always maintain -approximate summary(rmax(vi+1) - rmin(vi) - 1) = (gi + I - 1) < 2N
• Insert new observations into summary
• Delete all “superfluous” entries.
192[501,503]
{15,2}
201[530,537]
{28,7}.01, N=1751
204[540,541]
{10,1}
= .3
r = N = 525
197[502,536]
2N=35.02
{1,34}
MBG 59May 2004
Overview of Summary Data Structure
• Keep a data structure that stores vi, rmin(vi), and rmax(vi) for each observation. Tuple = {vi, gi, i}; gi = rmin(vi) - rmin(vi-1) , i = rmax(vi) - rmin(vi)
• Goal: always maintain -approximate summary(rmax(vi+1) - rmin(vi) - 1) = (gi + I - 1) < 2N
• Insert new observations into summary
• Delete all “superfluous” entries.
192[501,503]
{15,2}
201[530,537]
{28,7}.01, N=1751
204[540,541]
{10,1}
= .3
r = N = 525
2N=35.02
{1,34}
MBG 60May 2004
Reducing space requirement of summary
• “Delete all superfluous entries”: What do we mean by “superfluous” entries?
• Goal: minimizing workspace --- not size of final summary
– Can always reduce the final summary to size O(1/).
• Deletion rule (“compress”) will reduce summary size, but will take care to keep workspace small regardless of incoming observations.
To explain COMPRESS operation, we need to develop some more terminology
MBG 61May 2004
Terminology• Full tuple: A tuple is full if gi + I = 2N• Full tuple pair: A pair of tuples is full if
deleting the left-hand tuple would overfill the right one
• Capacity: number of observations that can be counted by gi before the tuple becomes full. (= 2N - I)
• We say that ti and tj have similar capacities if log capacity(ti) log capacity(tj) (intuition, not def’n)
• Similarity partitions the possible values of into bands.Our general strategy will be to delete tuples with small
capacity and preserve tuples with large capacity.
MBG 62May 2004
More Terminology:Tree Representation
• The bands can be used to impose a tree structure over the tuples.
– Group tuples with similar capacities into bands
.001, N=7,000
2N=14
-range Capacity Band0-7 8-15 38-11 4-7 212-13 2-3 114 1 0
83,1,14 84,1,14 85,1,14 89,10,0 90,2,12 93,6,5 94,1,13
vi,gi,i
S
MBG 63May 2004
More Terminology:Tree Representation
• The bands can be used to impose a tree structure over the tuples.
– Group tuples with similar capacities into bands
.001, N=7,000
2N=14
-range Capacity Band0-7 8-15 38-11 4-7 212-13 2-3 114 1 0
83,1,14 84,1,14 85,1,14 89,10,0 90,2,12 93,6,5 94,1,130 0 0 3 1 2 1S
MBG 64May 2004
More Terminology:Tree Representation
• The bands can be used to impose a tree structure over the tuples.
– Group tuples with similar capacities into bands
.001, N=7,000
2N=14
-range Capacity Band0-7 8-15 38-11 4-7 212-13 2-3 114 1 0
0 0 0 0 0 01 1 1 1 1 1 1 122 23 3 3 3
MBG 65May 2004
More Terminology:Tree Representation
• The bands can be used to impose a tree structure over the tuples.
– Group tuples with similar capacities into bands
– First (least index) node to the right with higher capacity band becomes parent.
.001, N=7,000
2N=14
-range Capacity Band0-7 8-15 38-11 4-7 212-13 2-3 114 1 0
0 0 0 0 0 01 1 1 1 1 1 1 122 23 3 3 3
MBG 66May 2004
More Terminology:Tree Representation
• The bands can be used to impose a tree structure over the tuples.
– Group tuples with similar capacities into bands
– First (least index) node to the right with higher capacity band becomes parent.
.001, N=7,000
2N=14
-range Capacity Band0-7 8-15 38-11 4-7 212-13 2-3 114 1 0
0 0 0 0 0 0
1 1 1 1 1 1 1 122 2
3 3 3 3
R
MBG 67May 2004
COMPRESS operation
General strategy: delete tuples with small capacity and preserve tuples with large
capacity.
1) Deletion cannot leave descendants unmerged --- it must delete entire subtrees
2) Deletion can only merge a tuple with small capacity into a tuple with similar or larger capacity.
3) Deletion cannot create an over-full tuple (i.e with g+ > floor(2N))
MBG 68May 2004
Analysis
TheoremAt any time n, the total number of tuples stored in
S(n) is at most (11/2)log(2n)
Sketch of proof• Each tuple requires the support of many
observations in order to survive a COMPRESS
• Only n observations
• Therefore only a relatively small number of tuples can survive
MBG 69May 2004
Useful Lemmas
• A tuple that survives insertion at time m must have = floor(2m) (else would be immediately deleted (has no descendants, and if smaller then parent has capacity to absorb it)).
• If i and j are ever in the same band, they will always be in the same band. (Technical details on def’n of band; band boundaries are only deleted, never created).
• The number of observations covered cumulatively by tuples in bands [0.. ] is bounded by 2/
MBG 70May 2004
Limited number of full tuple pairs in each band
For any given , at most 4/ nodes from band are right partners in a full tuple pair.
Defn: If neighbors are a full tuple pair, then g*
j-1 + gj + j > 2n• Assume p pairs exist. Sum over all such pairs:
g*j-1 + gj + j > 2pn
2g*j + j > 2pn
g*j is bounded by # of observations in bands [0..] = (2)/
j is bounded by max j in , = 2n - 2-1
(2+1)/ + p(2n - 2-1) > 2pn4/ > p
What about non full tuple pairs? At most 1 per parent.
MBG 71May 2004
Each parent requires many descendants to survive COMPRESS
At time n, for any , at most 3/2 nodes have a child in band
• Choose a parent Vi with a child in band . Choose the rightmost child, Vj. Let mj (< n - 2-1/(2)) be the time Vj was inserted.
• Red nodes, descendants of Vj, and anything merged into Vj, must have arrived after n - 2+1/(2)
g in picture + gi + i > 2n & gi(mj) + i < 2mj
g in picture + gi (since n - 2+1/(2)) > 2(n - (n - 2-1/(2)))• At most 2+1/(2) observations avail, each Vi needs > 2 (2-1/(2))
Therefore at most 2/ parents of nodes in band (more complexity needed to get to 3/2)
0
1 1 1
> ViVj
MBG 72May 2004
Analysis
TheoremAt any time n, the total number of tuples stored in
S(n) is at most (11/2)log(2n)
Combining Lemmas• 4/ pairs per band
• At most 3/(2) parents of children in band
• At most 1 singleton per parent
• 11/(2) tuples per band
• At most log(2n) bands
MBG 73May 2004
Experimental Results• Measurement:
– |S|– Observed (vs. desired ) : max, avg, and for 16 representative quantiles
– Optimal max observed
• Compared 3 algorithms– MRL
– Preallocated (1/3 number of stored observations as MRL)
– Adaptive: allocate a new quantile only when observed error is about to exceed desired
• Optimization in algorithm:– Keep entries up to high water mark (can only help)
MBG 74May 2004
“Random” Input5.00E+00 5.00E+01 5.00E+02 5.00E+03 5.00E+04 5.00E+05 5.00E+06
0
2500
5000
7500
10000
12500
15000
17500
20000
22500
25000
27500
MRL Tuples Space
5.00E+06
Space
Error
MBG 75May 2004
Handling Deletions
• Artificial data set
• AT&T CDR: median length of active phone call?
MBG 76May 2004
Summarizing Quantile Summaries
• Empirically, behaves very well indeed:– On average, for “random” input, seems to use constant space
• Best-known worst-case guarantees• GK used as a black box to improve other
algorithms:– Munro & Patersons classic p-pass algorithm for computing
median exactly. GK reduces space/number of passes by a factor of Omega(log n)
– Probabilistic quantile summaries
• The basic data structure has applications to other problems:
– Order statistics in sensor networks
MBG 77May 2004
Concluding remarks• AHBHA:
– seems very promising; but a lot of work is needed to evaluate it properly.
• Cing+:– Identified problems with existing techniques– The hybrid approach was an obvious idea, but required a lot of work and
care to succeed– As accurate as cing, almost universally applicable.
• Quantile Summaries:– Exploit as much information as possible.– Proof is unsatisfying & inelegant because of complexity; notion of
bands; and COMPRESS non-intuitive– Result is significant improvement with several unexpected applications.
Simple ideas, complex implementation and analysis,
large payoff
MBG 78May 2004
General remarks
• A small shift in view can sometimes yield large reductions in complexity
• Even simple solutions to large scale problems are extremely difficult to evaluate --- many details, many cases, unexpected interactions, many metrics. As a discipline we do not have a good methodology for evaluation.
• Experimental results are surprisingly difficult to obtain, confirm, and evaluate. It is worth persevering.
• Formal analysis of sub-problems can give us solid ground to stand on even when large problem is analytically intractable. It can also yield significant practical improvements.
Successful systems research needsvision, experimental technique, and formal analytic skills
MBG 79May 2004
Ongoing projects
• AHBHA: congestion control, network architecture
• Cing: network delay tomography, large scale measurement studies
• Streaming data: summaries
• Sensor networks: balanced power, order statistics, communication optimizations
• Coverage: cooperative virus defense w/ untrusted peers
• EXCHANGE: peer2peer incentives
• Harmony: generic, safe, reconciliation of OTS apps
• Canon: consistent security for heterogeneous systems
• NBS: practical non-blocking algs, contention in distributed algorithms,
MBG 80May 2004
The End
MBG 81May 2004
Simpler approach?:TTL vs. Timestamp
• Why aren’t TTL-limited RTT measurements (a’la pathchar) sufficient?
– TTL-limiting removes the routing problem
– RTT measurements removes the clock problem.
• Accuracy:– Asymmetry– One-way vs. round-
trip– Not back-to-back on
return path
10,931 links
MBG 82May 2004
Network Tomography: feasibility• TIMESTAMP support?: 96% response to TIMESTAMPs• TIMESTAMP indicative of normal packets?: within ms resolution• Clock synchronization?: robust post-facto algorithm • Irregular routing limits choice of nodes
Example: Path structure, Penn to Sprintlabs
Corresponding feasible measurement partitions, Penn to Sprintlabs
MBG 83May 2004
Network tomography: feasibility (2)
•Data: ~10k paths from 5 different sources•Metric: fraction of nodes usable for tomography•Results: ~50% nodes are usable, more difficult as distance from source increases, better when probing from multiple sources
MBG 84May 2004
Why is this not ideal?Accepted quibbles
• Non-TCP: Assumes everything TCP-Friendly• Packet loss due to errors (e.g. wireless)
considered congestion signal• Bad RTE can also cause false signals• “Mice” (congestion control only kicks in after
6 packets or so)• Large bandwidth-delay pipes• Self-similarity of traffic (bursty)• Buffer occupancy (high)• RED hard to configure to perform well
(different parameters for different scenarios)• Fairness• QOS
MBG 85May 2004
QuickTimeª and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Conjecture: Internet is at local maximum with very steep slopes
• Hop by hop feedback– Head of line blocking– Local, so can’t achieve global fairness
• Aggregation– Fractal nature of traffic
• Rate-based congestion control– unbounded input, oscillatory
• Explicit out-of-band congestion notification packets– adds to load under congestion, wastes bandwidth, and unstable
Most new ideas, taken by themselves, make matters worse than Standard TCP.
But put them all together…
Some locally bad ideas: