MBG 1 May 2004 AHBHA: Managing Congestion through Adaptive Hop-By-Hop Aggregation Michael Greenwald, University of Pennsylvania.

MBG 1May 2004

AHBHA: Managing Congestion

through Adaptive Hop-By-Hop Aggregation

Michael Greenwald,University of Pennsylvania

MBG 2May 2004

What is congestion?

• Congestion: Applications/clients present a larger aggregate load than intermediate nodes in the network can handle.

• Congestion Control: mechanism that ensures network remains manageable under overload.

Much more difficult than the related problem of Flow Control: participants are unaware that they

share a resource

MBG 3May 2004

What causes congestion? (Isn’t bandwidth cheap?)

• Persistent congestion: solved by adequate provisioning (True, bandwidth is cheap).

• Cause: Intermittent high load– Intermittent emergency (earthquake, 9/11)– Extreme loads (expected: Mother’s day.

Unexpected: Pathfinder pictures)

– Periods of growth– DOS or DDOS attack

• Effect: Congestion– Bursty traffic (statistical multiplexing)– Many sources converge on a single link– Low capacity link becomes bottleneck– subset of multicast destinations

Congestion (will) still occur(s), even though bandwidth is getting cheaper

MBG 4May 2004

Why must congestion be controlled?

• Congestion collapse– Links clogged with useless packets: that will be dropped

anyway, or are retransmissions, or are out-of-date

• Long Delay – Relevant mostly for short transactions over long distances.

• Variability in delay (jitter)

• Drop rate – Not a problem in itself, since packets only dropped if can’t

make it through bottleneck anyway, but …

– Use up bandwidth on other links before being dropped.

– Control over which packets get dropped?

• Low Utilization (inefficiency)

• Fairness

MBG 5May 2004

How is congestion controlled?

• Slow-start/congestion avoidance• Losses-per-epoch/Fast Retransmit

• Full Buffers, tail-drop => RED• Non-compliant flows => FRED, Penalty box etc.

• Pkt drop/buffer size is noisy signal => Vegas,

• Adjust parameters => BLUE, ARED

• Avoid packet loss => ECN

• Explicit feedback => XCP

• Fat pipes => TCP FAST

• Fairness, lossy wireless => TCP Westwood

• Mice, Fairness, QOS, bad RTE

MBG 6May 2004

Response

• Concern with robustness, efficiency, and fairness

• Control theoretic approach:– Stability, convergence

• Make world safe for control theory:– Controller reacts as quickly as signal changes

» Know RTT, react quickly, change slowly

– Response to feedback must be predictable

» Behavior of aggregate independent of # of flows.

– Behavior of client/application/transport predictable.

Significant contributions, but ….

MBG 7May 2004

QuickTimeª and aTIFF (Uncompressed) decompressor

are needed to see this picture.











• Complexity: Epicycles on epicyclesComplexity: Epicycles on epicycles

• FragilityFragility

• The end-to-end argument; misinterpretedThe end-to-end argument; misinterpreted– Trapped by success, religious dogma, & need for Trapped by success, religious dogma, & need for

field testingfield testing

• Congestion control common to Congestion control common to allall clients clients

• Don’t optimize for a particular application, Don’t optimize for a particular application, even TCPeven TCP

A Different ViewA Different View

MBG 8May 2004













• View from routers: predictable response to View from routers: predictable response to feedbackfeedback

• View from hosts: delivery fabric with View from hosts: delivery fabric with predictable congestion feedbackpredictable congestion feedback

• Stark contrast with current system:Stark contrast with current system:– Extreme example: Extreme example: Aggregated small TCP flows do not Aggregated small TCP flows do not

exponentially decrease or linearly increase.exponentially decrease or linearly increase.– 1,000,000 flows, so window for each flow is small 1,000,000 flows, so window for each flow is small

(approx 1)(approx 1)» Congestion notifies 10% of the flows, decrease by Congestion notifies 10% of the flows, decrease by

at most 10% of packets.at most 10% of packets.» Regardless, each of 1,000,000 flows increases Regardless, each of 1,000,000 flows increases

cwnd by 1 each RTT, effectively doubling rate. cwnd by 1 each RTT, effectively doubling rate. (alternatively, larger fraction in SlowStart)(alternatively, larger fraction in SlowStart)

A Different ViewA Different View

MBG 9May 2004

AHBHAAdaptive Hop-by-Hop Aggregation

A simple idea

• Hop-by-hop: feedback and controller at each node.– Why any different than CreditNet (Kung) or HBH (Kanakia)?

• Aggregate flows based on purely local characteristics (next-hop X TOS X QOS) X input

– What about head-of-line blocking? Local vs. global behavior? Isolation of congestion?

• Transitive renaming of congested links– Based on observation that most of the net is adequately

provisioned.

input Interconnect

Router architectureinpu

t

input

output

output

input Interconnect

input

input

output

output

MBG 10May 2004

Controlling Utilization of a Resource

• Consider a Flow Queue with a current queue length, a known rate capacity, and a set of input flows

– Capacity may be a physical limit for a physical link, or a rate limit imposed by a neighbor for finer grained flow.

– Input flows may be local flows sharing a single physical output link, or may be flows coming in from neighbors.

• Queuelength > threshold triggers congestion control (at most once per RTT)

– Must determine whether queue growth due to burstiness or due to input rate exceeding output rate

– If former, smooth inputs; latter, throttle neighbors.

input

input

input

output

MBG 11May 2004

Controlling Utilization of a Resource

• MAIR (Mean Aggregate Input Rate) = Sum (over inputs) [ (smoothed) number of packets /

interval]

• If MAIR < output capacity, then input flows need only be smoothed.– Pacing: 1/pkts-per-RTT

• If MAIR > output capacity, then input rates must be reduced– Acceptable rates computed in 2 passes.

– BaseAllocation = Capacity/Nflows. // Can use weights per flow instead.

– ExcessAllocation = 0; UncontrolledFlows = 0;for flow in inputs, if 2*flowRate < BaseAllocation, then

ExcessAllocation += BaseAllocation - 2*flowRate;UncontrolledFlows++;

end

– FairAllocation = BaseAllocation + ExcessAllocation/(Nflows-UncontrolledFlows);

• Send FairAllocation if input rate >= FairAllocation

input

input

input

output

Input from fairness controller

Drain queue in 2 RTT

MBG 12May 2004

Renaming to Isolate Congestion

• R1->R2 becomes congested.

• QuarantineSet = {N | NextHop(N)@R1 = R2}

R1

R2

MBG 13May 2004


• Create artificial node R1’

• QuarantineSet = {N | NextHop(N)@R1 = R2}

• Routing update to all neighbors of R1 advertising R1’ as best path to {N}.

R1

R2R1’

MBG 14May 2004


• If queue to R1’ is congested, recurse and split artificial node and advertise to input queues.

R1

R2R1’

MBG 15May 2004

Releasing control

• Record time of transition from uncontrolled to controlled.• Record time of most recent (“last”) congestion event.NewCongestionInterval =

LastCongestionEvent - FirstCongestionEvent;CongestionInterval =

max(NewCongestionInterval,OldCongestionInterval);If ((now-LastCongestionEvent) > CongestionInterval) {

Release control;OldCongestionInterval =

max(OldCongestionInterval/2,NewCongestionInterval);}// Release state after 10* CongestionInterval w/ no congestion;// OldCongestionInterval > 4*max(IRTT,ORTT);

MBG 16May 2004

Miscellaneous Details• Uncontrolled flows:

– XMIT: If X packets sent in RTT interval t0, then at most 2X packets in interval t1

• Periodic CC packets between immediate neighbors– List controlled flows and rate

– Once per Max(RTT/2, 20 PacketTimes)

– If no CC pkt from N in RTT interval, then all flows are controlled at X/2.

• CC packets high priority

• Compute RTT

• Assume max rate known neighbors. – Good assumption for dedicated lines

– May need to be estimated for Ether/shared channel or multi-hop neighbors

MBG 17May 2004

Advantages• Works for: Mice, Elephants, Non-TCP flows• Long-delay flows: ramp-up in log(n) round-trips; high

utilization• Doesn’t treat loss as congestion signal • Not sensitive to parameters • Fairness decoupled from CC mechanism: agnostic on

policy, or policy delivery mechanism. Can work with either packet marking (diffsrv) or flow-weighting (periodic packets from src to dst, providing per-flow weights).

• Aggregation + per-hop control makes flows smoother and less self-similar

• Response time to source comparable to e2e packet loss.

But you can say almost the same things about XCP, or FAST, or …

MBG 18May 2004

Serendipity:

• Buffer sizes:– Per-link rather than cross-network

– 1 buffer per neighbor, rather than 1 per flow

• Broken routers, misbehaving hosts, DOS attacks

• Multicast

• Simplifies TCP

MBG 19May 2004

Preliminary Observations

• Significantly simpler than current world (let alone more complex world)

• AHBHA comparable in all cases. Never significantly better. Simulations using ns2

– RED (varying capacities & loads), Floyd “TCP Friendly”, FAST (*), TCP WESTWOOD(*), XCP (**), AHBHA Regions + Legacy, Defective routers, DOS, Short flows (1 pkt), Mbottle, “SimpleTCP”, w/ECN to source

– (*) Compared to results in paper, (**) needed to compile separate versions of ns2

• Works with non-cooperating clients and routers.

MBG 20May 2004

Preliminary Non-Results

• Stability not proven

• Convergence not proven

• Convergence-time not established• (On the other hand, intuitive reasons to believe stable: e.g.

bounded input increase, superposition of stable systems, RTTs are equal (not just by assumption))

• If many congestion points, then lose congestion isolation

MBG 21May 2004

Unresolved Issues with Naming Controlled Aggregates

• 1 bit for renaming next hop? B bits? Exhaustive list?

• Aggregate by next hop? Or 2nd hop (horizon effect)?

• More hysteresis in determining CongestionInterval, rather than Releasing control after quiet period.

• Right choices depend on patterns of congestion in real network.

• Measurement required.

MBG 22May 2004

cing+: Measuring Network-Internal Delays using only Existing

Infrastructure

joint work with Kostas Anagnostakis

University of Pennsylvania Raphael Ryger

Yale University

MBG 23May 2004

USER

RESEARCH

MANAGEMENT

Remote measurement of per-link delays

• Network measurement techniques:– Understanding of control mechanisms (such as TCP

congestion control) --- both results and workload

– Gain insights into network performance

– Fault Isolation, Error reporting

– Curiosity; switch providers?

• Network parameters such as delay, loss, and throughput are easy to measure end-to-end

• Network parameters such as delay, loss, and throughput are difficult to measure on individual links inside the network.

MBG 24May 2004

Understand your tools:Know yourself

• How accurate are the results?

• Why do we believe it is accurate?

• What are its limitations?

• Answering these questions is difficult, sometime surprising, and results in a much better tool.

MBG 25May 2004

Network Delay Tomography: A Brief History

SB

C

From a remote source, S, estimate the distribution of link delays

X2

X3

AX1

• Direct measurement, using existing tools (e.g. pathchar)– <RTT to tail> - <RTT to head> yields RTT on link (TTL-expired responses)

– Only existing infrastructure: measure anywhere w/o cooperation.

– But…

» ICMP responses representative?

» Asymmetric paths: return paths vary so (tail - head) may not be meaningful.

» Round trip vs. one-way delay?

a1

a3

a2

a4

a5

1

2

MBG 26May 2004


SB

C


X2

X3

AX1

• Indirect inference methods (e.g. minc project)– One packet to multiple sources, and correlate behavior on links in the

resulting tree

– But:

» Deployability (works best with multicast, need cooperating rcvrs)

» Accuracy (assumes independence of delay, quality of estimates degrades over longer paths)

» Robustness (high variance in error)

» Computational complexity

» Need for many samples, therefore much time

a1

a3

a2

a4

a5

1

2

MBG 27May 2004


SB

C


X2

X3

AX1

• Direct methods (e.g. cing project)– f(<Timestamp to tail>,<Timestamp to head>) yields delay on link

– No infrastructure required, highly accurate, strong experimental validation

– But:» Packet pair may not encounter equal queues» ICMP processing may not be representative» Clocks are unsynchronized » Routing irregularity, so not always applicable

a1

a3

a2

a4

a5

1

2

MBG 28May 2004

Network tomography: a direct method

• Use router ICMP Timestamp messages and packet-pair probes to directly estimate queuing delay

A B21

t2 t1

t0 d 1prop d 1

q d 2prop d 2

q t0 d 1prop d 1

q

d 2prop d 2

q

fixed variable: propagation delay

: queueing delay

31 2

Account for fixed, by subtracting min time over set of observations

MBG 29May 2004

Question your assumptions

• Feasibility: basic mechanism supported? Accuracy? Stability of routing? etc. etc.

• Do back-to-back packets really experience the same delay on their shared path?

• Are ICMP processing times indicative of processing time for “normal” packets?

• How to account for differing offset and skew on clocks?

• Are the paths to adjacent nodes coincident?

A B21 31 2

MBG 30May 2004

Back-to-back packets• Do packets arrive back-to-

back?

• Do back-to-back packets experience identical queuing delays and process time? (and stay back-to-back?)

• Distinctions are irrelevant to algorithm; the issue is simply difference in timestamped value.

• Experiment: Probe routers with varied load and varied path length from source

7300 routers

Caveats: our sources had good connectivity to the Internet, so does not provide insight into slow links near source, or high

congestion near source.

This issue common to all algorithms

80% of routers worst-caseInter-pkt delay <= 7 ms.

95+% of routers have 98% of Inter-pkt delays <= 6 ms.

MBG 31May 2004

ICMP Processing Time

• Send direct first, so queuing delays err conservatively and overestimate ICMP processing time.

• Median processing time always negligible

• 95% usually negligible

•This issue common to all direct measurements that use ICMP

•Variation in processing time between head & tail

•Comparison w/non-ICMP traffic

SB

C

X2

X3

AX1a1

a3

a2

a4

a5

“A2”, req

spoof

Dot = median, box=interquartile range, bars 5-95%, dots are outliers.

Allows spoofed src

response

Cooperating rcvr

1

2

MBG 32May 2004

Dot = median, box=interquartile range, bars 5-95%, dots are outliers.

ICMP Processing Time

• Spoofing and cooperation limits scope of experiment (7 targets, 20 routers).

• Broader study? If processing delays significant on head of link, then estimated queuing delay for link should sometimes be negative.

• Occasionally present in 9.9% of sample (1,368)

SB

C

X2

X3

AX1a1

a3

a2

a4

a5

“A2”, req

spoof

response

Cooperating rcvrAllows spoofed src

1

2

MBG 33May 2004

Unsynchronized Clocks

• Clock offsets may vary because of clock skew or “jumps” to adjust for skew.

• Both src & dst may jump, and may be skewed in opposite directions

• May distort individual observations, as well as provide an erroneous minimum for d2

prop

• Impossible to tell for individual observation whether t due to queuing or clock artifact

A B21 31 2

t2 t1

t0 d 1prop d 1

q d 2prop d 2

q t0 d 1prop d 1

q

d 2prop d 2

q

+ OA,1+ OA,2

+ OA,2 - OA,1

I =

MBG 34May 2004

Unsynchronized Clocks

• Post processing looks at multiple observations

• RTT provides valuable clues: queuing and max

• Can recover skew; only care if jump occurred between request & response

• Look for colinear regions; label others “can’t tell”

Local clock

MBG 35May 2004

Routing Issues

• Reachability: It is easy to see that most links are not measurable by the direct method from a single source: many links connect to a node, but only one is on the path from S.

• (In some sense, S is mainly interested in links reachable from S.)

• Regularity: If the path to the head of a link is not a prefix of the path to the tail of the link, then we cannot meaningfully subtract the timestamp responses.

SB

C

X2

X3

AX1a1

a3

a2

a4

a5

A routing map, R, is regular over a graph G if Rs(m) = Rs(d) for all m in sd2

3

1

MBG 36May 2004

Irregular routing

• Nevertheless:– Coverage for single links ranges from 20% (SRI) to 53% (LIACS)– Multiple sources increase the likelihood measurably– Multi-hop segments increase the likelihood of coverage

Internet routing is irregular.

MBG 37May 2004

Simpler approach?:TTL vs. Timestamp

• Why aren’t TTL-limited RTT measurements (a’la pathchar) sufficient?

– TTL-limiting removes the routing problem

– RTT measurements removes the clock problem.

• Accuracy:– Asymmetry– One-way vs. round-

trip– Not back-to-back on

return path

10,931 links

MBG 38May 2004

A hybrid solution

• Indirect inference is accurate for small trees: look only at small, isolated trees.

• Timestamps and TTL-limited probes make every router a “cooperating receiver”.

• Indirect inference can isolate return-path delays from forward path delays.

• Indirect inference can determine delay distribution in shared portion of overlapping segments

• Deconvolution

A hybrid technique can be almost as accurate as the direct approach, and almost universally applicable

MBG 39May 2004

Putting it all together

• By combination/choice of Timestamps, RTT, TTL-expiration and using either Indirect methods of MINC or deconvolution we can cover just about every link in the Internet, often by many methods

• But which methods to use?

MBG 40May 2004

Putting it all together

• Shared link is most accurate for MINC

• Deconvolution is only as accurate as the least accurate segment.

• But which methods to use?

MBG 41May 2004

Relative Accuracy

• Estimated vs. Actual Mean delay

• 2nd row shows effect of divergent paths; 200ms extra delay on path of 2nd pair

MBG 42May 2004

Increased Coverage

Multiple sources Granularity vs. accuracy

MBG 43May 2004

Collecting vast quantities of data

• Individual delay measurements over thousands of paths, tens of thousands of nodes, millions of samples

• Long running simulation of AHBHA can generate petabytes of data for moderate size networks.

• How can we accurately collect these measurements without sinking under their weight?

MBG 44May 2004

Space-Efficient Online Computation of Quantile

Summaries

joint work with

Sanjeev KhannaUniversity of Pennsylvania

MBG 45May 2004

Summarizing extremely large data sets

• The problem:– Vast quantities of data, perhaps ephemeral

– Memory is limited and observations are lost once observed

– Therefore construct a proxy data structure of manageable size, able to return needed information

• What kind of information do we need? Distribution of values

– Quantile queries: Given a quantile, , return the value whose rank is N

– e.g. min, max, median, 90th percentile, 99th percentile…

• Munro & Paterson [1980] (Pohl[1969]): p-pass algorithm to compute exact quantile requires (N1/p) space.

approximation is required to reduce space.

MBG 46May 2004

Trading off accuracy for space

• Explicit a priori guarantee on precision of the approximation, but try to use the smallest memory footprint possible.

• Explicit and tunable a priori guarantee on maximum memory footprint, and make the approximation as accurate as possible.

MBG 47May 2004

Histograms



• Explicit and tunable a priori guarantee on maximum memory footprint, and make the approximation as accurate as possible.

MBG 48May 2004

-approximate quantile summary



• An -approximate quantile summary can answer any quantile query to within a precision of

– Given a quantile, , return a value whose rank is guaranteed to be within the interval [( - )N, ( + )N]

Goal: Construct an -approximate quantile summary using minimal space

MBG 49May 2004

Requirements• Explicit & tunable a priori guarantees on the

precision of the approximation

• As small a memory footprint as possible

• Online: Single pass over the data

• Data Independent Performance: guarantees should be unaffected by arrival order, distribution of values, or cardinality of observations.

• Data Independent Setup: no a priori knowledge required about data set (size, range, distribution, order).

MBG 50May 2004

Related Work• Manku, Rajagopalan, and Lindsay generalize a

class of 1-pass algorithms (e.g. Agrawal & Swami [COMAD95], Alsabti, Ranka & Singh [VLDB97]),

– [SIGMOD98]» a priori knowledge of size of data set» O((1/) log2 ( N)) worst case space» does not exploit any structure in observations

– [SIGMOD99]

» Give up deterministic guarantee in exchange for dropping the requirement of a priori knowledge of size of data set

• Gibbons, Matias, & Poosala [VLDB97] & Chaudhuri, Motwani, & Narsayya [SIGMOD98]

– Multiple passes (& CMN only probabilistic guarantee)

MBG 51May 2004

Our epsilon-approximate quantile summary

MBG 52May 2004

Overview of Summary Data Structure

• Keep a data structure that stores vi, rmin(vi), and rmax(vi) for each observation.

– vi = value of ith observation stored in the summary

<v0, v1, …. vi, … vS-1>|S| can be << N

– rmin(vi) = minimum possible rank of vi

– rmax(vi) = maximum possible rank of vi

192[501,503]

201[529,536]

.01, N=1750

204[539,540]

MBG 53May 2004


• Keep a data structure that stores vi, rmin(vi), and rmax(vi) for each observation. Tuple = {vi, gi, i}; gi = rmin(vi) - rmin(vi-1) , i = rmax(vi) - rmin(vi)

• Quantile = .3? Compute r and choose best vi

192[501,503]

{15,2}

201[529,536]

{28,7}.01, N=1750

204[539,540]

{10,1}

= .3

r = N = 525

MBG 54May 2004



• If (rmax(vi+1) - rmin(vi) - 1) < 2N, then -approximate summary.

• Our goal: always maintain this property.

192[501,503]

{15,2}

201[529,536]

{28,7}.01, N=1750

204[539,540]

{10,1}

= .3

r = N = 525

2N=35

MBG 55May 2004



• Goal: always maintain -approximate summary (rmax(vi+1) - rmin(vi) - 1) = (gi + I - 1) < 2N

• Insert new observations into summary

192[501,503]

{15,2}

201[529,536]

{28,7}.01, N=1750

204[539,540]

{10,1}

= .3

r = N = 525

1972N=35

MBG 56May 2004



• Goal: always maintain -approximate summary(rmax(vi+1) - rmin(vi) - 1) = (gi + I - 1) < 2N


192[501,503]

{15,2}

201[529,536]

{28,7}.01, N=1750

204[539,540]

{10,1}

= .3

r = N = 525

197[502,536]

2N=35

MBG 57May 2004





192[501,503]

{15,2}

201[530,537]

{28,7}.01, N=1751

204[540,541]

{10,1}

= .3

r = N = 525

197[502,536]

2N=35.02

{1,34}

MBG 58May 2004





• Delete all “superfluous” entries.

192[501,503]

{15,2}

201[530,537]

{28,7}.01, N=1751

204[540,541]

{10,1}

= .3

r = N = 525

197[502,536]

2N=35.02

{1,34}

MBG 59May 2004





• Delete all “superfluous” entries.

192[501,503]

{15,2}

201[530,537]

{28,7}.01, N=1751

204[540,541]

{10,1}

= .3

r = N = 525

2N=35.02

{1,34}

MBG 60May 2004

Reducing space requirement of summary

• “Delete all superfluous entries”: What do we mean by “superfluous” entries?

• Goal: minimizing workspace --- not size of final summary

– Can always reduce the final summary to size O(1/).

• Deletion rule (“compress”) will reduce summary size, but will take care to keep workspace small regardless of incoming observations.

To explain COMPRESS operation, we need to develop some more terminology

MBG 61May 2004

Terminology• Full tuple: A tuple is full if gi + I = 2N• Full tuple pair: A pair of tuples is full if

deleting the left-hand tuple would overfill the right one

• Capacity: number of observations that can be counted by gi before the tuple becomes full. (= 2N - I)

• We say that ti and tj have similar capacities if log capacity(ti) log capacity(tj) (intuition, not def’n)

• Similarity partitions the possible values of into bands.Our general strategy will be to delete tuples with small

capacity and preserve tuples with large capacity.

MBG 62May 2004

More Terminology:Tree Representation

• The bands can be used to impose a tree structure over the tuples.

– Group tuples with similar capacities into bands

.001, N=7,000

2N=14

-range Capacity Band0-7 8-15 38-11 4-7 212-13 2-3 114 1 0

83,1,14 84,1,14 85,1,14 89,10,0 90,2,12 93,6,5 94,1,13

vi,gi,i

S

MBG 63May 2004




.001, N=7,000

2N=14


83,1,14 84,1,14 85,1,14 89,10,0 90,2,12 93,6,5 94,1,130 0 0 3 1 2 1S

MBG 64May 2004




.001, N=7,000

2N=14


0 0 0 0 0 01 1 1 1 1 1 1 122 23 3 3 3

MBG 65May 2004




– First (least index) node to the right with higher capacity band becomes parent.

.001, N=7,000

2N=14


0 0 0 0 0 01 1 1 1 1 1 1 122 23 3 3 3

MBG 66May 2004




– First (least index) node to the right with higher capacity band becomes parent.

.001, N=7,000

2N=14


0 0 0 0 0 0

1 1 1 1 1 1 1 122 2

3 3 3 3

R

MBG 67May 2004

COMPRESS operation

General strategy: delete tuples with small capacity and preserve tuples with large

capacity.

1) Deletion cannot leave descendants unmerged --- it must delete entire subtrees

2) Deletion can only merge a tuple with small capacity into a tuple with similar or larger capacity.

3) Deletion cannot create an over-full tuple (i.e with g+ > floor(2N))

MBG 68May 2004

Analysis

TheoremAt any time n, the total number of tuples stored in

S(n) is at most (11/2)log(2n)

Sketch of proof• Each tuple requires the support of many

observations in order to survive a COMPRESS

• Only n observations

• Therefore only a relatively small number of tuples can survive

MBG 69May 2004

Useful Lemmas

• A tuple that survives insertion at time m must have = floor(2m) (else would be immediately deleted (has no descendants, and if smaller then parent has capacity to absorb it)).

• If i and j are ever in the same band, they will always be in the same band. (Technical details on def’n of band; band boundaries are only deleted, never created).

• The number of observations covered cumulatively by tuples in bands [0.. ] is bounded by 2/

MBG 70May 2004

Limited number of full tuple pairs in each band

For any given , at most 4/ nodes from band are right partners in a full tuple pair.

Defn: If neighbors are a full tuple pair, then g*

j-1 + gj + j > 2n• Assume p pairs exist. Sum over all such pairs:

g*j-1 + gj + j > 2pn

2g*j + j > 2pn

g*j is bounded by # of observations in bands [0..] = (2)/

j is bounded by max j in , = 2n - 2-1

(2+1)/ + p(2n - 2-1) > 2pn4/ > p

What about non full tuple pairs? At most 1 per parent.

MBG 71May 2004

Each parent requires many descendants to survive COMPRESS

At time n, for any , at most 3/2 nodes have a child in band

• Choose a parent Vi with a child in band . Choose the rightmost child, Vj. Let mj (< n - 2-1/(2)) be the time Vj was inserted.

• Red nodes, descendants of Vj, and anything merged into Vj, must have arrived after n - 2+1/(2)

g in picture + gi + i > 2n & gi(mj) + i < 2mj

g in picture + gi (since n - 2+1/(2)) > 2(n - (n - 2-1/(2)))• At most 2+1/(2) observations avail, each Vi needs > 2 (2-1/(2))

Therefore at most 2/ parents of nodes in band (more complexity needed to get to 3/2)

0

1 1 1

> ViVj

MBG 72May 2004

Analysis

TheoremAt any time n, the total number of tuples stored in

S(n) is at most (11/2)log(2n)

Combining Lemmas• 4/ pairs per band

• At most 3/(2) parents of children in band

• At most 1 singleton per parent

• 11/(2) tuples per band

• At most log(2n) bands

MBG 73May 2004

Experimental Results• Measurement:

– |S|– Observed (vs. desired ) : max, avg, and for 16 representative quantiles

– Optimal max observed

• Compared 3 algorithms– MRL

– Preallocated (1/3 number of stored observations as MRL)

– Adaptive: allocate a new quantile only when observed error is about to exceed desired

• Optimization in algorithm:– Keep entries up to high water mark (can only help)

MBG 74May 2004

“Random” Input5.00E+00 5.00E+01 5.00E+02 5.00E+03 5.00E+04 5.00E+05 5.00E+06

0

2500

5000

7500

10000

12500

15000

17500

20000

22500

25000

27500

MRL Tuples Space

5.00E+06

Space

Error

MBG 75May 2004

Handling Deletions

• Artificial data set

• AT&T CDR: median length of active phone call?

MBG 76May 2004

Summarizing Quantile Summaries

• Empirically, behaves very well indeed:– On average, for “random” input, seems to use constant space

• Best-known worst-case guarantees• GK used as a black box to improve other

algorithms:– Munro & Patersons classic p-pass algorithm for computing

median exactly. GK reduces space/number of passes by a factor of Omega(log n)

– Probabilistic quantile summaries

• The basic data structure has applications to other problems:

– Order statistics in sensor networks

MBG 77May 2004

Concluding remarks• AHBHA:

– seems very promising; but a lot of work is needed to evaluate it properly.

• Cing+:– Identified problems with existing techniques– The hybrid approach was an obvious idea, but required a lot of work and

care to succeed– As accurate as cing, almost universally applicable.

• Quantile Summaries:– Exploit as much information as possible.– Proof is unsatisfying & inelegant because of complexity; notion of

bands; and COMPRESS non-intuitive– Result is significant improvement with several unexpected applications.

Simple ideas, complex implementation and analysis,

large payoff

MBG 78May 2004

General remarks

• A small shift in view can sometimes yield large reductions in complexity

• Even simple solutions to large scale problems are extremely difficult to evaluate --- many details, many cases, unexpected interactions, many metrics. As a discipline we do not have a good methodology for evaluation.

• Experimental results are surprisingly difficult to obtain, confirm, and evaluate. It is worth persevering.

• Formal analysis of sub-problems can give us solid ground to stand on even when large problem is analytically intractable. It can also yield significant practical improvements.

Successful systems research needsvision, experimental technique, and formal analytic skills

MBG 79May 2004

Ongoing projects

• AHBHA: congestion control, network architecture

• Cing: network delay tomography, large scale measurement studies

• Streaming data: summaries

• Sensor networks: balanced power, order statistics, communication optimizations

• Coverage: cooperative virus defense w/ untrusted peers

• EXCHANGE: peer2peer incentives

• Harmony: generic, safe, reconciliation of OTS apps

• Canon: consistent security for heterogeneous systems

• NBS: practical non-blocking algs, contention in distributed algorithms,

MBG 80May 2004

The End

MBG 81May 2004

Simpler approach?:TTL vs. Timestamp

• Why aren’t TTL-limited RTT measurements (a’la pathchar) sufficient?

– TTL-limiting removes the routing problem

– RTT measurements removes the clock problem.

• Accuracy:– Asymmetry– One-way vs. round-

trip– Not back-to-back on

return path

10,931 links

MBG 82May 2004

Network Tomography: feasibility• TIMESTAMP support?: 96% response to TIMESTAMPs• TIMESTAMP indicative of normal packets?: within ms resolution• Clock synchronization?: robust post-facto algorithm • Irregular routing limits choice of nodes

Example: Path structure, Penn to Sprintlabs

Corresponding feasible measurement partitions, Penn to Sprintlabs

MBG 83May 2004

Network tomography: feasibility (2)

•Data: ~10k paths from 5 different sources•Metric: fraction of nodes usable for tomography•Results: ~50% nodes are usable, more difficult as distance from source increases, better when probing from multiple sources

MBG 84May 2004

Why is this not ideal?Accepted quibbles

• Non-TCP: Assumes everything TCP-Friendly• Packet loss due to errors (e.g. wireless)

considered congestion signal• Bad RTE can also cause false signals• “Mice” (congestion control only kicks in after

6 packets or so)• Large bandwidth-delay pipes• Self-similarity of traffic (bursty)• Buffer occupancy (high)• RED hard to configure to perform well

(different parameters for different scenarios)• Fairness• QOS

MBG 85May 2004



Conjecture: Internet is at local maximum with very steep slopes

• Hop by hop feedback– Head of line blocking– Local, so can’t achieve global fairness

• Aggregation– Fractal nature of traffic

• Rate-based congestion control– unbounded input, oscillatory

• Explicit out-of-band congestion notification packets– adds to load under congestion, wastes bandwidth, and unstable

Most new ideas, taken by themselves, make matters worse than Standard TCP.

But put them all together…

Some locally bad ideas:

MBG 1 May 2004 AHBHA: Managing Congestion through Adaptive Hop-By-Hop Aggregation Michael Greenwald, University of Pennsylvania.

Documents