TECHNICAL REPORT TR10-02, COMNET, TECHNION, ISRAEL 1 ...webee.technion.ac.il/~isaac/p/tr10-02_multipass.pdf · TECHNICAL REPORT TR10-02, COMNET, TECHNION, ISRAEL 1 Providing Performance

TECHNICAL REPORT TR10-02, COMNET, TECHNION, ISRAEL 1

Providing Performance Guarantees inMultipass Network Processors

Isaac Keslassy, Kirill Kogan, Gabriel Scalosub, Michael Segal

Abstract—Current network processors (NPs) increasingly dealwith packets with heterogeneous processing times. In such anenvironment packets that require many processing cycles delaylow latency traffic, because the common approach in today’s NPsis to employ run-to-completion processing. These difficulties haveled to the emergence of the Multipass NP architecture, whereafter a processing cycle ends, all processed packets are recycledinto the buffer and re-compete for processing resources.

In this work we provide a model that captures many ofthe characteristics of this architecture, and consider severalscheduling and buffer management algorithms that are speciallydesigned to optimize the performance of multipass networkprocessors. In particular, we provide analytical guarantees for thethroughput performance of our algorithms. We further conducta comprehensive simulation study which validates our results.

I. INTRODUCTION

A. BackgroundMulti-core Network Processors (NPs) are widely used to

perform complex packet processing tasks in modern high-speed routers. NPs are able to address such diverse functions asforwarding, classification, protocol conversion, DPI, intrusiondetection, SSL, NAT, firewalling, and traffic engineering. Theyare often implemented using many processing cores. Thesecores are either arranged as a pool of identical cores (e.g.,the Cavium CN68XX [28] or the AMCC nP7310 [29]), as along pipeline of cores (e.g., the Xelerated X11 [30]), or asa combination of both (e.g., the EZChip NP-4 [31] or theNetronome NFP-32xx [32]).

These architectures are very efficient for simple trafficmixes. However, following operator demands, packet pro-cessing needs are becoming more heterogeneous and relyon a growing number of more complex features, such asadvanced VPN encryption (like IPsec-VPN and SSL-VPN),LZS decompression, VoIP SBC, video CAC, per-subscriber

The authors would like to thank Isask’har (Zigi) Walter for his comments.This work was partly supported by US Air Force European Office ofAerospace Research and Development, grant FA8655 − 09 − 1 − 3016,Deutsche Telecom, European project FLAVIA, Israeli Ministry of Industry,Trade and Labor (consortium CORNET), as well as by the European ResearchCouncil Starting Grant no. 210389.

Please note that this technical report updates several results from an earliermanuscript, especially regarding Theorems 1, 2 and 7. In particular, Theorem 7now relies on a more intuitive algorithm based on the current number ofresidual passes.

I. Keslassy is with the Dept. of Electrical Engineering, Technion, Haifa,Israel. Email: [email protected].

K. Kogan is with Cisco Systems and the Dept. of Communication SystemsEngineering, Beer-Sheva, Israel. Email: [email protected].

G. Scalosub is with the Dept. of Communication Systems Engineering,Beer-Sheva, Israel. Email: [email protected].

M. Segal is with the Dept. of Communication Systems Engineering, Beer-Sheva, Israel. Email: [email protected].

queueing, and hierarchical classification for QoS [20], [28],[33], [34].

These features are increasingly challenging for traditionalarchitectures, posing implementation, fairness, and bench-marking issues. First, longer and more complex featuresrequire either deeper pipeline lengths (e.g., 512 PISC pro-cessor cores in the Xelerated HX3XX series [30]) or longerprocessing times in run-for-completion cores. Second, a fewpackets with many features can delay, and even temporarilystarve, the later packets. In fact, given limited high-speedbuffering, this might lead to large drop rates upon congestion.This was illustrated in the Christmas tree packet DoS (Denial-of-Service) attack, in which each packet “lights up” severalIP options processing bits [20]. Finally, and maybe moresignificantly, typical benchmarking tests used to rely on asimple stream of minimum-sized packets with only a basicIP forwarding service to measure the “worst-case throughput”of an NP [20], [23]. As benchmarking tests start to measurethroughput given more advanced processing features, the im-pact of these features will be even more highlighted.

In view of the increasing impact of the packets with heavyfeatures, another NP architecture has emerged as a leadingalternative in the industry: the Multipass NP architecture. Inthis architecture, the processing time of a packet is dividedinto several time intervals, called passes or cycles. Intuitively,when a packet arrives to the NP, it is sent to a processing core.Then, after the core completes its processing pass, the packetis recycled into the set of packets awaiting processing. Andso on, until all the packet passes are completed.

In practice, another appeal of the multipass architecture isthat it does not require the NP designer to define a largepipeline length in advance. This is especially useful for NPswith different possible markets. In addition, note that in mul-tipass NPs, actually recycling packets would involve complexinterconnections and large buffers. Therefore, to decrease thecost of recycling, packets practically stay buffered and smallcontrol messages go through recycling instead.

This NP architecture with recycling has for instance beenimplemented in the recent Cisco QuantumFlow NP [33].Forming the heart of Cisco’s most recent ASR 1000 edgerouters, this 40-core NP might become the most widespreadamong high-speed routers. Also, although not strictly multi-pass NP architectures, several NP architectures in the literaturealready allow for recycling of complex packets, such as [24]for IP control packets.

Given a heterogeneous set of packet processing times,the scheduler plays a significant role in the multipass NParchitecture. This is because it should make sure that heavy

2

packets with many passes do not monopolize the cores andstarve packets with fewer passes.

To the best of our knowledge, despite the emergence ofthe multipass NP architecture, there has not yet been anyanalysis of its scheduler performance in the literature. Inparticular, NP schedulers are typically designed for the worst-case throughput to support a guaranteed wire rate (see Section2.2 in [23]). But little is known regarding the worst-casethroughput of the various possible multipass NP schedulers.

The goal of this paper is to offer designs with provenperformance guarantees for the multipass-NP scheduler. Oursolutions enable dealing with the various requirements posedto the scheduler (such as delay, throughput, and implementa-tion complexity), and illustrate tradeoffs as to the scheduler’sability to fulfill these requirements. Our analysis also makesit possible for the designer of future multipass NPs to haveanalytical worst-case guarantees on the NP performance, in-cluding for traffic with complex processing needs.

B. Our Contributions

In this paper, we analyze the performance of schedulingand buffer management policies in multipass NPs, and provideguarantees as to their worst-case throughput.

We consider settings where each arriving packet requiressome number of processing passes, and study the interplay ofthree factors:• The scheduling policy: we study both FIFO buffers, and

Priority Queues (where priority is determined by thenumber of remaining passes required).

• The buffer management policy: we design and evaluateboth preemptive policies (where packets residing in thebuffer can be discarded), and non-preemptive policies.

• The implementation cost: Our model allows for a copyingcost of packets into the the buffer which reflects theimpact multiple accesses to the buffer have system’sthroughput.

We design and analyze algorithms which aim at maximizingthe overall value obtained by the system, which is affected byboth the packet-level throughput (considered as benefit) an thecopying cost (considered as penalty). We note that our modelcan also be used to model cold-cache penalties. A detaileddescription of our model is given in Section II.

For our analytical results, we use competitive analysis toevaluate the performance of our proposed policies. For thecase where no copying cost is incurred, we design and analyzebuffer management algorithms for both FIFO- and PQ-basedenvironments. We show that non-preemptive architectures maysuffer form extremely large performance degradation com-pared to the optimal performance possible. On the other hand,we prove that natural buffer management policies for PQ-based environments are optimal when preemption is allowed,and further show that FIFO-based environments endowedwith preemption, although they are not optimal, can obtaina reasonable guaranteed throughput compared to the optimalperformance possible, which depends only on the maximumnumber of passes a packet requires. These results are presentedin Section III. For the case where the system incurs a strictly

positive copying cost, we devise competitive buffer man-agement algorithms for PQ-based environments, and providean elaborate analysis of their performance guarantees. Theseresults are presented in Section IV.

To complete our study, we present a simulation study thatfurther validates our results and provides additional insights asto the performance of multicore NPs. Specifically, our resultsshow that the design criteria governing our algorithms, whichare intended to optimize towards the worst-case scenario,exhibit very good performance also for simulated average-case traffic. In addition, our simulation study shows that thenumber of available cores has a striking non-trivial effect onthe performance of the various policies we propose. Theseresults are presented in Section V.

Our work gives rise to a multitude of questions and possibleextensions. We discuss these further in Section VI.

C. Related WorkAs mentioned above, recycling is not new in NPs and has

previously appeared in the literature, especially for particularlycomplex packets that cannot be processed using a typicalpipelining scheme For instance, in the Open Network Lab’sNP [24], the XScale recycles IP control packets and otherexceptional packets, and also enables the use of plugins forspecial processing. Multipass NP power consumption has alsobeen a previous topic of study. However, to our knowledge,there is no previous work in the literature that discusses thescheduling and buffer management policies in multipass NPs.Several other related topics have been studied in the contextof NPs, namely, task mapping and load-balancing [13], [25].In addition, [18] introduces a queueing network to modelNP resources and application work flows. Finally, [8] furtherstudies the impact of caches on performance. However, noneof these papers considers the impact of recycling in theirmodels. Moreover, no paper analyzes the impact of the packetadmission control policy on the worst-case NP performance.

There is also a long history of OS scheduling for mul-tithreaded processors. A comprehensive overview of com-petitive online scheduling for server systems is provided in[22]. For instance, the SRPT (Shortest Remaining ProcessingTime) algorithm always runs the job with the least amount ofremaining processing time, and it is well known that SRPTis optimal for average response [21]. Additional objectives,models, and algorithms have been studied extensively in thiscontext (e.g., [10], [17], [19], [21], to name but a few). Anotherrelated line of research which takes into account some notionof “throughput” in the context of OS scheduling assigns jobsto processors with fluctuating speeds, which is not knownprecisely to the scheduler [11]. When comparing this bodyof research with the framework of NPs one should note thatOS scheduling is mostly concerned with average responsetime, average slowdown, etc., while NP scheduling is targetedat providing (worst-case) guarantees on the throughput. Inaddition, NP scheduling is unique in that it inherently hasa limited-size buffer.

Another large body of research related to our work fo-cuses on competitive packet scheduling and buffer manage-ment, mostly for various switching architectures, such as

3

SM IB(memory)

PPE1 PPE2 · · · PPEC

manage buffer/queuesand assign

packets to PPEs

memoryaccess

arrivals departures

recycling control messages

Fig. 1. An outline of the architecture model, as an abstraction of a standardMultipass NP Architecture (see, e.g. [33]).

Output-Queued (OQ) switches (e.g., [1], [16]), shared memoryswitches with OQs (e.g., [12], [15]), and merging buffers(e.g., [14]). This body of work has also studied more generalmulti-queue switches (e.g., [2], [4]–[6]). Some works alsoprovide experimental studies of these algorithms and furthervalidate their performance [3].

II. MODEL DESCRIPTION

A. Multipass NP Architecture

Figure 1 illustrates the multipass NP architectural modelused in this paper. It is a simplified model of the CiscoQuantumFlow NP architecture [33]. The three major modulesin our model are: (a) the Input Buffer (IB), (b) the SchedulerModule (SM), and (c) a set of C cores or Packet ProcessingElements (PPEs).

First, the IB module is used to buffer incoming packets. TheIB holds a buffer that can contain at most B packets. It obeysa given Buffering Model (BM), as defined later. Second, theSM module has two main functionalities in our model: thebuffer management, as later described, and the assignment ofpackets to PPEs, by binding each PPE with its correspondingIB packet. Each PPE element is a processing core that workson a specific packet stored in the IB for one cycle (predefinedperiod of time), also referred to as a time slot. For simplicitywe assume that each PPE is single threaded.

We divide time into discrete time slots, where each stepconsists of four phases: (i) transmission, in which completedpackets leave the NP and incomplete control packets for thosewith remaining passes are recycled, (ii) arrival, in whichthe SM performs its buffer management task consideringnewly arrived packets and recycled control packets (observethat recycled control packets are admitted to IB before newarrivals), (iii) scheduling, in which C head-of-queue packetsare designated for processing, and (iv) processing, in whichthe SM assigns a designated packet to each PPE, and packetprocessing takes place.

We assume arbitrary packet arrival (i.e., it is not governed byany specific stochastic process, and may even be adversarial).

We also assume that all packets have unit size. Each arrivingpacket p is further stamped with the number of passes itrequires from the NP, denoted r(p). This number is essentiallythe number of times the packet should be assigned to a PPEif it is to be successfully delivered. The availability of thisinformation relies on [26], which shows that “processing on anNP is highly regular and predictable. Therefore it is possibleto use processing time predictions in admission control andscheduling decisions.”

B. Problem Statement and Objectives

In the NP multipass architecture, new packets incur highercosts than recycled packets. New packets admitted to the buffermonopolize part of the memory link capacity to enter thememory, and therefore require more capacity in the memoryaccess implementation of an NP. Each new packet also needsto update many pointers and associated structures at linkspeeds. These costs are substantially higher than the costsassociated with recycled control packets corresponding topackets already stored in the buffer.

To reflect the value of throughput, we assume that eachdeparted packet has unit value. However, to reflect the costof admitting new packets, each newly admitted packet is alsoassumed to incur a fixed copying cost cost of α ∈ [0, 1) forcopying it to IB. Finally, we measure the final value as thetotal throughput value minus the total copying cost.

Any specific architecture corresponding to our model canbe summarized by a 4-tuple (B,BM,α,C), where B denotesthe buffer size available for IB, BM is the buffering model(in this paper it will usually be PQ or FIFO), α is the copyingcost, and C is the number of available PPEs.

Our objective is the following: given a (B,BM,α,C)-architecture, and given some finite arrival sequence, maximizethe value of successfully delivered packets.

For the case where α = 0, the overall value of success-fully delivered packets is equal to the system’s packet-levelthroughput. For the case where α > 0 the overall value ofsuccessfully delivered packets equals the throughput minus theoverall copying cost incurred by admitting packets to IB.

Our goal is to provide performance guarantees for vari-ous scheduling and buffer management algorithms. We usecompetitive analysis [9], [27] to evaluate the performanceguarantees provided by online algorithms. An algorithm ALGis said to be c-competitive (for some c ≥ 1) if for any arrivalsequence σ, the overall value of packets successfully deliveredby ALG is at least 1/c times the overall value of packetssuccessfully delivered by an optimal solution (denoted OPT ),obtained by a possibly offline clairvoyant algorithm.

C. Further Notation and Algorithmic Framework

We will define a greedy buffer management policy as apolicy that accepts all arrivals whenever there is availablebuffer space (in the IB). Throughout this paper we only lookat work-conserving schedulers, i.e. schedulers that never leavea processor idle unnecessarily.

We will say that an arriving packet p preempts a packetq that has already been accepted into the IB module iff q

4

is dropped and p is admitted to the buffer instead. A buffermanagement policy is called preemptive whenever it allowsfor preemptions.

For any algorithm ALG and any time-slot t, we defineIBALGt as the set of packets stored in IB of algorithm ALGat time t.

We assume that the original number of passes required byany packet is in a finite range 1, . . . , k. The value of k willplay a fundamental role in our analysis. We note, however,that none of our algorithms need know k in advance.

The number of residual passes of a packet is key to severalof our algorithms. Formally, for every time t, and every packetp currently stored in IB, its number of residual passes, denotedrt(p), is defined to be the number of processing passes itrequires before it can be successfully delivered.

Most of our algorithms will take the general form depictedin Algorithm 1, where the specific subroutine determiningwhether or not preemption takes place will depend on thealgorithm. We note that we will distinguish between thevarious embodiments of our algorithms also depending on theBM they employ in the IB. More specifically, we will focusour attention on two natural BMs:

1) FIFO: In this policy packets are serviced in FIFO order,i.e. the C head-of-line packets are chosen for assignmentto the PPEs. Upon completion of a processing round bythe PPEs, all the packets that have been processed inthis round and still require further processing passes arequeued at the tail of the IB queue.

2) Priority Queueing (PQ): In this policy packets are ser-viced in non-increasing order of residual passes, i.e., Cpackets with the minimum number of residual passes arechosen for assignment to the PPEs in every time slot.

We assume that the queue order is also maintained accordingto the BM preference order.

The generic algorithmic setting for the buffer managementpolicy of the SM is defined in Algorithm 1. The specificalgorithms discussed in the following sections will differaccording to the decision made by the DECIDEIFPREEMPTprocedure, which decides which packet to discard in case ofoverflow.

Algorithm 1 ALG: Buffer Management Policy1: upon the arrival of packet p:2: if the buffer is not full then3: accept packet4: else5: DECIDEIFPREEMPT(ALG,p)6: end if

III. BUFFER MANAGEMENT WITH NO COPYING COST(α = 0)

A. Non-preemptive Policies

In this section we consider non-preemptive greedy buffermanagement policies. Essentially, the subroutine DECIDEIF-PREEMPT for such policies simply rejects the pending packet.

The following theorem provides a lower bound on the perfor-mance of such non-preemptive policies for FIFO schedulers.

Theorem 1. The competitive ratio of any non-preemptivegreedy buffer management policy (B,FIFO,C, 0)-system isat least k

C , where k is a maximal number of passes requiredby any packet.

Proof: Assume for simplicity that B/C is an integer.Consider the following set of arrivals. During the first timeslot arrive B packets with maximal number of passes k. Sinceonline algorithm A is greedy it accepts all of them. OPT doesnot accept these packets. During the next kB/C time slots thebuffer of A is full since it is FIFO and non preemptive. Duringthis time interval arrive kB/C packets with a single requiredpass and all of them are transmitted by OPT .

The following theorem provides a similar lower bound forPQ schedulers.

Theorem 2. The competitive ratio of any non-preemptivegreedy buffer management policy (B,PQ,C, 0)-system is atleast k − 1, where k is a maximal number of passes requiredby any packet.

Proof: We will show more specifically that the compet-itive ratio is at least (k − 1)(1 − ε), for any ε > 0. Assumefor simplicity that k/C is an integer. At time 0 we have thearrival of B packets, each requiring k passes, and at time 1 wehave the arrival of (k − 1)C packets, each requiring a singlepass. Consider the sequence of time slots ai = ik − 1, fori = 1, . . . , `. At any time ai we have the arrival of C packets,each requiring k passes. At any time ai+1 we have the arrivalof (k − 1)C packets, each requiring a single pass.

We now turn to analyze the performance of any greedyPQ policy given the above arrival sequence. At time 0 allB packets are accepted (due to greediness), and none of thearrivals at time 1 can be accommodated into the buffer. It iseasy to show by induction by the arrival of packets at time a1,C of the packets accepted by at time 0 are delivered, and thebuffer has room to accommodate the newly arriving C packetsat time a1, and none of the packets arriving at time a1 + 1can be accommodated into the buffer. It is therefore easy toshow by induction that for any i = 1, . . . , ` − 1, the numberof packets delivered by the algorithm between ai and ai+1 isexactly C, the algorithm accepts all packets arriving at ai+1,and cannot accept any of the packets arriving at time ai+1 +1.This gives an overall throughput of at most B + `C packets.

Let us now turn to describe a feasible solution whosethroughput serves as a lower bound on OPT. This solutionwould accept B − (k − 1)C packets that arrive at time0, (k − 1)C packets that arrive at time 1, and for everyi = 1, . . . , `, accepts the packets that arrive at time ai+1. Wefirst show that the number of packets residing in the bufferof this solution never exceeds the buffer capacity of B. Bydefinition, the overall number of packets accepted by time 1 isB. Furthermore, the algorithm delivers all packets that arrivedat time 1 by time 1 + (k − 1) = k = a1 + 1 (since each ofthem requires a single pass), and can therefore accommodatethe newly arriving packets at time a1 + 1. It is easy to show

5

by induction that for any i = 1, . . . , ` − 1, the number ofpackets delivered by this solution between ai+1 and ai+1 +1is exactly (k − 1)C, and the solution can accept all packetsarriving at ai+1 +1. It follows that this solution has an overallthroughput of B − (k − 1)C + `(k − 1)C.

Combining the above we obtain that the competitive ratioof any greedy buffer management policy in a (B,PQ,C, 0)-system is at least

B − (k − 1)C + `(k − 1)CB + `C

=B + (`− 1)(k − 1)C

B + `C

which tends to k − 1 as ` grows to infinity, thus completingthe proof.

As demonstrated by the above results, the simplicity of non-preemptive greedy policies does have its price. In the followingsections we explore the benefits of introducing preemptivepolicies, and provide an analysis of their guaranteed perfor-mance.

B. Preemptive PoliciesFor the case where α = 0, we focus our attention on the

intuitive rule for preemption which states that a newly arrivedpacket p should preempt a buffered packet q at time t ifrt(p) < rt(q). This rule is formalized in Algorithm 2, whichgives a formal definition of the DECIDEIFPREEMPT procedureof Algorithm 1.

Algorithm 2 DECIDEIFPREEMPT(ALG,p)1: i← first packet in IBALGt s.t. rt(pi) = maxi′ rt(pi′)2: . first in the order implied by the BM3: if r(p) < rt(pi) then4: drop pi and accept p5: else6: reject p7: end if

In what follows we consider the performance of the abovepreemption rule for two specific BMs, namely: ALG ∈PQ,FIFO.

1) Preemptive Priority Queueing: In this section we studythe performance of a BM implementing PQ, where prioritiesare set in accordance with the non increasing order of residualpasses.1 We refer to this algorithm as PQ1.2 The followingtheorem provides some guarantee as to its performance.

Theorem 3. PQ1 is optimal.

Proof: Let O be the set of packets successfully deliveredby some optimal solution O. We will sometime abuse nota-tion and use O (the set of packets) to refer to OPT (thesolution). For every time t, every algorithm A, and everyinteger ` ∈

1, . . . ,

∣∣IBOt ∣∣, let PAt (`) denote the set of `head-of-queue packets in IBAt (recall we can assume withoutloss of generality that packets in IB are ordered accordingto A’s buffering model). Consider algorithm PQ1. Recall that

1Packet p has a higher priority than packet q at time t if rt(p) < rt(q).2The reason for choosing the subscript 1 would become clear in section IV.

for any such `, the head-of-line packets in IBPQ1t have the

minimal number of residual passes. Consider the followingvolume function

ΦAt (`) =∑

p∈PAt (`)

rAt (p)

where rAt (p) denotes the residual number of passes of packetp in IBAt . ΦAt (`) measures the amount of remaining workrequired for processing the ` head-of-queue packets in thebuffer of A at time t.

We will prove that for any time t and any integer ` ∈1, . . . ,

∣∣IBOt ∣∣,

ΦPQ1t (`) ≤ ΦOt (`), (1)

i.e., the amount of residual work necessary by PQ1 forprocessing the ` head-of-queue packets is at most that requiredby OPT. Note that it is sufficient to consider φ as defined atthe end of each time step, although the inequality holds alsoafter each of the phases within a time step. By proving that theinequality in Equation (1) holds for any `, we would obtainin that

φPQ1t (1) ≤ φOt (1),

which implies that whenever OPT processes a 0-residualpasses packet, so does PQ1. Therefore, if Equation (1) holds,the throughput of PQ1 is at least that of OPT, completing theproof.

We now turn to prove Equation (1). First note that withoutloss of generality we can assume that OPT is both work-conserving, i.e., never idles when its buffer is non-empty,and also non-preemptive. Since PQ1 does not perform anyadmission control, and always accepts packets when it hasroom, then at any time t, IBPQ1

t ≥ IBOt .The proof follows by induction on t. For t = 0, the claim

clearly holds since by definition PQ1 accepts the maximalsize set of packets of minimal passes (up to the buffer capacitylimit B) among all arrivals at t = 0, and stores them in non-decreasing order of residual passes. Assume the claim holdsfor t′ < t, and consider time t. We will show the inequalityholds after each of the phases within time step t. For the trans-mission phase, note that by the induction hypothesis, for every` ΦPQ1

t−1 (`) ≤ ΦOt−1(`), and since both PQ1 and OPT are work-conserving, both sides of the inequality reduce by the amountof processing done in the transmission phase (the same forboth PQ1 and OPT, unless the buffer of OPT becomes empty,in which case the inequality is trivially true since we are onlyconcerned with ` ≤

∣∣IBOt ∣∣, and∣∣IBOt ∣∣ = 0 in this case). This

implies that the inequality holds after the transmission phase.Consider now the arrival phase within time step t. We willconsider the buffer management decisions made by PQ1 as ifthey were done in a series of phases. In the first phase, assumePQ1 retains in its buffer PPQ1

t (∣∣IBOt ∣∣), and let RPQ1

t denotethe remaining packets in its buffer, i.e., packets in positionsgreater than

∣∣IBOt ∣∣ that are currently in IBPQ1t (which are set

aside at this point). In the second phase, assume PQ1 acceptsthe set of packets AOt , the set of packets accepted by OPT attime t (recall that by our assumption OPT is non-preemptive,hence by the fact that OPT does not overflow, PQ1 also

6

has room to accept these packets in this phase). Denote byIB

PQ1

t = PPQ1t (

∣∣IBOt ∣∣) ∪ AOt . For this buffer occupancy(when sorted by non-decreasing residual passes order), by theinduction hypothesis on PPQ1

t (∣∣IBOt ∣∣), the inequality holds

since the added packets are exactly those accepted by OPT attime t. Note that

∣∣∣IBPQ1

t

∣∣∣ =∣∣IBOt ∣∣, where IBOt is considered

at the end of time step t. Now let PQ1 consider the remainingpackets pending, by first considering packets in RPQ1

t , andafterwards considering any additional packets which arrivedat time t (and were not accepted by OPT). Let IBPQ1

t denotethe buffer of PQ1 after this sequence of events. First notethat this sequence mimics exactly the behavior of PQ1 (up toaccepting packets which are equivalent to those accepted byOPT). Let p` (p`) denote the packet in position ` in IBPQ1

t

(∣∣∣IBPQ1

t

∣∣∣). By the priority-based preemption rule of PQ1,for every position ` ∈

1, . . . ,

∣∣IBOt ∣∣, rt(p`) ≤ rt(p`). Thisfollows from the fact that the queues are ordered in non-decreasing order of residual passes, and the candidate packetsconsidered by PQ1 which resulted in the buffer configurationIBPQ1

t is a superset of∣∣∣IBPQ1

t

∣∣∣. It therefore follows that for

any ` ∈

1, . . . ,∣∣IBOt ∣∣, ΦPQ1

t (`) ≤ ΦOt (`), thus completingthe proof.

The above theorem provide concrete motivation for using apriority queuing buffering model. It also enables using PQ1

as a benchmark for optimality.On the other hand, priority queueing has many drawbacks

in terms of the difficulty in providing delay guarantees and interms of implementation. For instance, low-priority packetsmay be delayed arbitrarily for an arbitrarily long amountof time due to the steady arrival of low-priority packets.Therefore it is of interest to study BMs that ensure suchscenarios do not occur. One such predominant BM is usingFIFO queueing, which is discussed in the following section.

2) Preemptive FIFO: In this section we analyze the pre-emptive policy depicted in Algorithm 2, where the BM imple-ments FIFO queueing. We refer to this algorithm as FIFO1.FIFO has many attractive features, including bounded delay,and it is easy to implement. We first begin with providingthe counterpart to Theorem 3 which shows that as opposed topriority queueing, the performance of FIFO1 can be ratherfar from optimal.

Theorem 4. FIFO1 has competitive ratio Ω( log kC ) in a

(B,FIFO,C, 0)-system.

Proof: Assume for simplicity that B/C is an integer, andfurther assume that k+ 1 = B

C . Consider the following arrivalsequence: for i = 0, . . . , k we have B packets with k − irequired passes arriving at time ti = iB/C.

Let us first consider the performance of FIFO1 given theabove arrival sequence. At time t0 FIFO1 accepts B packets,each with k required passes. Call this set A. It is easy to seethat for every i = 1, . . . , k at time ti all the packets in A arestill in FIFO1’s buffer, and each has k − i residual passes.Hence, FIFO1 never has a reason to preempt any of thepackets in A. It follows that at time tk the buffer holds Bpackets with 1 residual passes and can eventually only deliver

B packets.We now turn to consider the performance of an optimal

policy for the above arrival sequence. We first show a policythat delivers (1 + 1

C )B − 1 packets out of the above arrivalsequence (implying a lower bound of 1+ 1

C− 1B on the compet-

itive ratio). We then refine our analysis to prove the requiredresult. We henceforth start by considering the conservativepolicy which for any i = 0, . . . , k− 1 accepts a single packetat time ti, and further accepts all B packets arriving at timetk. First note that the above policy is feasible: since for any i,ti+1 − ti = B

C ≥ k + 1, if a policy accepts a single packet attime ti, then we are guaranteed to have this packet deliveredby time ti+1. This implies that the buffer is empty at time ti+1,implying in turn the feasibility of the above policy. Since thepolicy accepts k = B

C − 1 packets by time tk, and B packetsat time tk, we have a total throughput of (1 + 1

C )B − 1 asrequired.

Let us now turn to refine our analysis, and present a betterpolicy which implies the required result, and is based uponthe simple policy just described. Recall that at time ti, thebuffer is empty, and we have B packets with k − i requiredpasses arriving. Our new policy accepts b B

C(k−i+1)c of thesepackets. We will show that this new policy ensures that thebuffer is empty just before the arrival phase at any timeti+1. The overall number of time steps required to delivera set of b B

C(k−i+1)c packets, each requiring k − i passes, isb BC(k−i+1)c · (k − i + 1) ≤ B

C = ti+1 − ti, which impliesthat by time ti+1 the buffer is indeed empty. Note that sincek = B

C − 1, b BC(k−i+1)c ≥ 1 for every i = 0, . . . , k. We can

now evaluate the performance of this new policy. The overallnumber of packets accepted (and delivered) by the policy is∑k

i=0b BC(k−i+1)c ≥

∑ki=0( B

C(k−i+1) − 1)= B

C

∑k+1j=1

1j − (k + 1)

= BC ·Hk+1 − (k + 1)

where Hn is the n-th harmonic number which satisfies Hn =Θ(log n). Since B = Θ(k), the result follows.

We now turn to provide an upper bound on the performanceof FIFO1, as given by the following theorem.

Theorem 5. FIFO1 is 2k-competitive in a (B,FIFO,C, 0)-system.

Proof: Consider intervals of time when OPT transmitsout kB packets. In this case FIFO1 transmits at least Bpackets and in the worst case at the end of interval its bufferis full (because of greediness) and each packet has k residualpasses. Since FIFO1 needs to pay for at most B packetsfrom the previous iteration in the worst case FIFO1 is 2kcompetitive.

IV. BUFFER MANAGEMENT WITH COPYING COST (α > 0)

In this section we consider the more involved case whereeach packet admitted to the buffer incurs copying cost α. Forthis model, it is preferable to perform as few preemptionsas possible, since preemptions increase the costs, but do notcontribute to the overall throughput. We recall that the overallperformance of an algorithm in this model is defined as the

7

algorithm’s throughput, from which we subtract the overallcopying cost incurred due to admitting distinct packets to thebuffer.

A. Characterization of the Optimal AlgorithmWe first note that if we consider algorithm PQ1 described

in the previous section, which is optimal for the case whereα = 0, we are guaranteed to have it produce the maximumthroughput possible given the arrival sequence. If we furtherconsider a slightly distorted model where PQ1 is allowed to“pay” its copying cost only upon the successful delivery ofa packet, we essentially obtain an optimal solution also forcases where α > 0, because in that case PQ1 never pays auseless cost of α for a packet that it ends up dropping . Thisis formalized in the following theorem:

Theorem 6. PQ1 that pays the copying cost only for trans-mitted packets is optimal for the (B,PQ,α,C)-architecture,for any α ∈ [0, 1).

The theorem can also be seen with a different perspective.Intuitively, a PQ1 scheduler that would know in advancewhat packets are winners and would only accept those wouldbe optimal. More formally, combining PQ1 with a bufferadmission control policy that would only accept the packetsthat ultimately depart using a given optimum scheduling policycan reach optimality.

B. Optimizing Priority QueuingGiven a copying cost α < 1, we will define a value

β = β(α) ≥ 1 (the precise value of β will be derivedfrom our analysis below), which will be used in defining thepreemption-rule DECIDEIFPREEMPT(PQβ ,p), as specified inAlgorithm 3. The algorithm essentially preempts a packet q infavor of a newly arrived packet p only if p has significantlyfewer residual passes than q’s residual passes. Note that thespecial case where β = 1 coincided with algorithm PQ1

described in section III-B (hence the subscript 1).

Algorithm 3 DECIDEIFPREEMPT(PQβ ,p)

1: pB ← last packet in IBPQβt

2: . note that rt(pB) = maxp′∈IBPQβt

rt(p′)

3: if rt(p) < rt(pB)β then

4: drop pB and accept p5: else6: reject p7: end if

We now turn to analyze the performance of the algorithmwhich uses PQβ-preemption. We first prove an upper boundon the performance of the algorithm, for any value of β. Wecan then optimize the value of β = β(α) so as to yield the bestpossible upper bound. We focus our attention in this analyticalsection on the case where C = 1. We note that for any twovalues of β, β′ ≥ k, the algorithm would behave the same,i.e., be non-preemptive. It therefore follows that although ouralgorithm need not know the value of k in advance, and may

p1 p2 · · · pm−1 pm

in G

q(1) q(2) q(m−1) q(m)1 q

(m)2

· · · q(m)`

at most L packets

Fig. 2. Outline of mapping χ. Packet p1 is admitted to the buffer upon arrivalwithout preempting any packet, and henceforth packet pi+1 preempts packetpi. The mapping ψ along the preemption sequence is depicted by dashedarrows. Such a sequence ends at a packet pm which is successfully deliveredby PQβ . Mapping φ, depicted by solid arrows, maps at most 1 packet to anypacket that is preempted in the sequence, and at most L packets to the lastpacket of the sequence which is successfully delivered by PQβ . This givesan overall of 2(m−1)+L packets mapped to any single packet successfullydelivered by PQβ .

very well design an algorithm which uses a value of β > k(being unaware of the real value of k), in our analysis we mayassume that β ≤ k, since any algorithm which uses a value ofβ > k is equivalent to an algorithm that uses a value of k forβ. This having been said, the optimal value of β = β(α, k),that minimizes the competitive ratio, does depend on the valueof k, and in order to find this optimal value, k has to be givenin advance.

Theorem 7. For C = 1, algorithm PQβ has a competitiveratio of

(2 + log ββ−1

(k/2)− 1 + 2 logβ k)(1− α)

1− α logβ k,

for B ≥ 2, β > 1, α < min(1, logβ k).

C. Proof Intuition for Theorem 7In the remainder of this section, we will focus on the proof

of Theorem 7. We will denote by G the set of packetssuccessfully delivered by PQβ , and by O be the set ofpackets successfully delivered by some optimal solution OPT.Consider a partition of the set of packets O\G = A1∪A2, suchthat A1 is the set of packets dropped by PQβ upon arrival, andA2 is the remaining set of packets, consisting of packets thatwere originally accepted, but at some point were preempted bymore favorable packets. It follows that O = A1∪A2∪(G∩O).

Our analysis will be based on describing a mapping ofpackets in O to packets in G, such that every packet in Gpiggybacks a bounded number of packets of O. Our mappingwill be devised in several steps.

First, we define a mapping φ : A1 7→ A2 ∪G such that forevery p ∈ A2,

∣∣φ−1(p)∣∣ ≤ 1, and for every p ∈ G,

∣∣φ−1(p)∣∣ ≤

L, for some value of L to be determined later (see Lemma 12).We then define a mapping ψ : A2∪G 7→ G such that for everyp ∈ G,

∣∣ψ−1(p)∣∣ ≤M , for some value of M to be determined

later (see Lemma 14). By composing these two mappings weobtain a mapping χ : O \G 7→ G such that for every p ∈ G,∣∣χ−1(p)

∣∣ ≤ 2(M−1)+L, i.e., there are at most 2(M−1)+Lpackets from O \G mapped to any single packet in p ∈ G byχ. Figure 2 gives an outline of the resulting mapping χ.

It is important to note that this mapping is done in hindsight,as part of the analysis, and is not part of the algorithm’s

8

φ before collapse

O G

p1

p2

p3

p4

p5

p6

φ after collapse (p1 delivered)

O G

p2

p3

p4

p5

p6

Fig. 4. Outline of a mapping-collapse. Upon the delivery of the HOL packetin G, p1, the largest set of live A1 packets closest to the head of queuein G (but no more than L) are mapped to the new HOL packet, p2, andremaining packets of A1 in O’s buffer are shifted downwards appropriately.In this example we take L = 3.

definition. We can therefore assume that for our analysis,we know for every packet arrival which algorithm(s) wouldeventually successfully deliver this packet.

D. The Basic Mapping φOur goal in this section is to define a mapping φ : A1 7→

A2∪G such that for every p ∈ A2,∣∣φ−1(p)

∣∣ ≤ 1, and for everyp ∈ G,

∣∣φ−1(p)∣∣ ≤ L, for some value of L to be determined

later (see Lemma 12). For every time t, we will denote theordered set of packets residing in the buffer of PQβ at t bypt1, p

t2, and so on. Recall that since the buffer size is at most

B, such a sequence is of length at most B. For clarity, we willsometimes abuse notation and omit the superscript t, when itis clear from the context. We will further define the load ofpi at t by nt(pi) =

∣∣φ−1(pi)∣∣, i.e. the number of packets

currently mapped to packet pi. In order to avoid ambiguity asfor the reference time, t should be interpreted as the arrivaltime of a single packet. If more than one packet arrive in atime slot, these notations should be considered for every packetindependently, in the sequence in which they arrive (althoughthey might share the same actual time slot).

The mapping will be dynamically updated at each event ofpacket arrival, or packet transmission from the buffer of G,as follows: Assume packet p arrives at time t. We distinguishbetween 3 cases:

1) If p is not in both O and G (i.e., neither PQβ norOPT deliver it successfully), then the mapping remainsunchanged.

2) If p ∈ A1, and it is assigned to buffer slot j in the bufferof O upon arrival, perform an (O, j)-mapping-shift (seedetailed description below).

3) If p ∈ A2 ∪ G, and it is assigned to buffer slot i in thebuffer of G upon arrival (i.e., after its acceptance to thebuffer we have pi = p), perform a (G, i)-mapping-shift(see detailed description below).

The last case to consider is the case of a packet being suc-cessfully delivered by G. In this case we perform a mapping-collapse onto the head-of-line (HOL) packet in G (see detaileddescription below).

At any given time, we will consider the set of live packetsin A1 that are currently in the buffer of O, where this set is

updated dynamically as follows: Every packet q ∈ A1 is aliveupon arrival. A live packet q ceases to be alive the momentφ(q) is completed (either by being preempted, or by beingdelivered). All remappings described henceforth only apply tolive packets. Specifically, for every event causing a change orupdate in the mapping occurring in any time t, packets in A1

which are no longer alive at t are essentially considered bythe following procedures as packets which are in A2∪(G∩O)(i.e., their mappings do not change, and they are sidesteppedwhen shifting mappings).

We first give some definitions. We say a mapping is se-quential if for every i < i′, the set of packets mapped tothe the packet in slot i would leave OPT before the set ofpackets mapped to the packet in slot i′ (assuming both ofthese slots are non-empty). We further say a mapping is i-prefix-full if every packet in slot i′ ≤ i has packets mappedto it and every packet in slot i′ > i has no packets mappedto it, and furthermore if i > 1 then the HOL packet in G hasL packets mapped to it. See Figure 5 for an example of amapping satisfying these two properties.

In order to finalize the description of φ, it remains toexplain the notion of mapping-shifts, and mapping-collapse.An (O, j)-mapping-shift works as follows: If the HOL packetin G has less than L packets currently mapped to it, we mapthe arriving packet p ∈ A1 to the HOL packet in G. Otherwise,we find the minimal index i of a packet in the buffer of Gto which no packet is mapped to, and map packet p to thispacket. If there is no such packet in the buffer of G (i.e., theHOL packet has load L, and every other packet in the bufferof G has load exactly 1), then we map p to the last packetin G. Clearly this mapping is feasible, i.e., whenever a packetp ∈ A1 arrives, there is a packet in G to which we can mapp. In order to complete this mapping-shift, we swap mappings(without changing the number of packets mapped to to anypacket in G) such that the resulting mapping is sequential.See Figure 3(a) for an example of an (O, j)-mapping-shift.

A (G, i)-mapping-shift is simpler and works as follows: forany non-empty buffer-slot j > i, remap any packets mappedto pj , to pj−1, in sequence, starting from j = i+1. Figure 3(b)gives an example of a (G, i)-mapping-shift.

We now turn to describe the effect of a mapping-collapse.Upon the successful delivery of the HOL packet in G, thepacket which was just in the second position in G’s buffer,becomes the HOL. This packet may have at most 1 packetmapped to it (according to the definition of the mapping).Upon its becoming the HOL packet, we remap the largest setof live packets in A1 currently in the buffer of O, to the newHOL packet, such that there are at most L packets mappedto it. If we have remapped r such packets, and there remainadditional packets in A1 currently in the buffer of O, then weremap each of these packets r positions downward, such thatthe resulting mapping is i-prefix-full for some buffer positioni ∈ 1, . . . , B. Figure 4 gives an example of a mapping-collapse.

We say a mapping satisfies the head-of-line-before-OPT(HOBO) property w.r.t. L, if at any time t, if the HOL packetin G has L packets mapped to it, then the last of these packetswould leave O no earlier than this HOL packet would leave

9

φ before O accepts q

O

q1

q2

q3

q4

q5

q6

q7

q8

q9

q

G

p1

p2

p3

p4

p5

p6

p7

p8

p9

p10

q inserted to buffer slot 4and mapped to p5

O

q1

q2

q3

q

q4

q5

q6

q7

q8

q9

G

p1

p2

p3

p4

p5

p6

p7

p8

p9

p10

complete the (O, j)-mapping-shift

O

q1

q2

q3

q

q4

q5

q6

q7

q8

q9

G

p1

p2

p3

p4

p5

p6

p7

p8

p9

p10

(a) (O, j)-mapping-shift (q ∈ A1 is admitted by O to slot j = 5)

φ before G accepts p

O G

p1

p2

p3

p4

p5

p6

p

complete the (G, i)-mapping-shift

O G

p1

p2

p

p3

p4

p5

p6

p inserted to buffer slot 3

O G

p1

p2

p

p3

p4

p5

p6

(b) (G, i)-mapping-shift (p is admitted by G to slot i = 3)

Fig. 3. Outline of mapping-shifts. The new packet is inserted into the corresponding buffer slot, and the mapping is shifted accordingly. Cyan packets arepackets that are either in A2 ∪ (O ∩G), or packets in A1 that are no longer alive. White packets are live A1 packets, which might be affected by changesin the mappings. In both examples L = 3.

G

O

L

p1p2p3p4p5p6p7p8· · ·p`

L packets mapped to p10 packets mapped

to pis after p71 packet mapped toeach of p2, . . . , p6

mapped packets form sequential blocks

Fig. 5. An example of a mapping which is 6-prefix-full and sequential. Cyanpackets are packets that are either in A2∪ (O∩G) or A1 packets that are nolonger alive. Mapped white packets are in A1. In the example above, L = 5.

G.The following lemma shows that if L satisifes the HOBO

property, then except for maybe the HOL packet in the bufferof G, any other packet in the buffer has load at most 1.This follows by definition for all such non-HOL packets, savepossibly for the last packet in the buffer, which is the focusof the lemma.

Lemma 8. If L satisfies the HOBO property, then at most oneO packet is mapped to the last packet in G.

Proof: Assume q ∈ A1 arrives at time t. It follows thatupon its arrival, the buffer of G is full (since otherwise itwould have accepted q). Assume by contradiction that theHOL packet in G’s buffer has load L, and that every otherpacket in G’s buffer has load exactly 1. Since L satisfied theHOBO property, we are guaranteed to have that the last packetmapped to the HOL packet in G is delivered by O no earlierthan the HOL packet in G is delivered by G. In particular, attime t the last packet mapped to the HOL packet in G hasnot yet been delivered by O, and it resides in the buffer of O.Since the mapping is maintained sequential, we are guaranteedto have that all packets in the buffer of O mapped to packetsother than the HOL packet in G are also residing in the bufferof O. It follows that the buffer occupancy of O is at least that

of G, which by the fact that the buffer of G is full, implies thatthe buffer of O is full upon the arrival of q, contradicting thefact that O accepts q upon its arrival (since recall we assumedwithout loss of generality that O never preempts an acceptedpacket).

The above lemma essentially guarantees that if we choose Lsuch that the HOBO property is maintained, then each packetin G buffer except the first has at most one mapping.

The following lemma ensures that upon every event whichaffects the mapping, every live packet has sufficiently manyresidual passes, compared to the packet to which it is mappedto.

Lemma 9. For every i ∈ 1, . . . , B, let pi denote the packetresiding in slot i in the buffer of G. For every such i, if q is(re)mapped to pi at time t, then rOt (q) ≥ 1

β rGt (pi).

Proof: We prove by induction on the sequence of events(essentially, induction on t) that in every event which causesa (re)mapping of q to some packet pi at time t, the propertyrOt (q) ≥ 1

β rGt (pi) holds. First note that every packet q ∈ A1

arriving at time t′, causes an O-shift, which implies that attime t′, rOt (q) ≥ 1

β rGt (pj) for all j ∈ 1, . . . , B, and in

particular this is true for the packet pi to which q is mappedto upon its arrival at time t′. We now turn to the inductionstep, and prove the property holds for every event causing aremapping of q. For an O-shift affecting the mapping of q,this happens upon the arrival of a packet q′ at some time t,such that rOt (q′) ≤ rOt (q) (since by definition, an (O, j)-shiftmight cause remapping of packets only in positions j′ ≥ j.By combining this observation with the fact that an O-shiftoccurred upon the arrival of q′ implies rOt (q′) ≥ 1

β rGt (pj) for

all j ∈ 1, . . . , B, we are guaranteed to the have the propertyhold iin case of an O-shift. For a G-shift affecting the mappingof q, note that this can only occur by changing the targetof the mapping from being packet pi+1, to being packet pi(by the definition of G-shifts). In this case we have rOt (q) ≥1β r

Gt (pi+1) ≥ 1

β rGt (pi), since G maintains its buffer in non

decreasing order of residual passes. The last case to consideris the that where we have a mapping-collapse affecting themapping of q. In this case, q is originally mapped to somepacket pi+m (for some m), and after the collapse is mapped

10

timesp

p becomes HOL

for the last time

s∗p

p delivered

p

q1 q2 q3 · · · q −1 q`

t1 t2 t3 t −1 t`

Fig. 6. An outline of the mappings to the HOL packet p in G during theinterval Hp = [sp, s∗p). For each i, packet qi is mapped to p at time ti.

to packet pi. In this case as well, similarly to the case of theG shift, since we remap to packets which are closer to G’shead-of-queue, the induction hypothesis implies the propertyis maintained.

In what follows, we assume that B ≥ 2.For every packet p ∈ G ∪ A2, consider the set of packets

mapped to p when p is completed. If p ∈ A2, i.e., it ispreempted by G at some time t, then since preemption alwaystakes place from the last slot in the buffer of G, by thedefinition of the mapping there could be at most one packetmapped to p when it is completed. We thus have the followinglemma.

Lemma 10. For every p ∈ A2, φ−1(p) ≤ 1 when p ispreempted.

If p ∈ G, let sp be the latest time in which p becomes theHOL packet in G, and let s∗p denote the time in which it isdelivered by G. We further consider the set of packets mappedto p upon its delivery, and denote this set by q1, . . . , q`, wherethe order is implied by the order in which these packetsare mapped to p. This set of packets may be split into twosets: q1, . . . , qi, which are the packets mapped to p due tothe mapping-collapse at time sp, and the set qi+1, . . . , q`which are additional packets mapped to p (which can onlybe due to O-shifts occurring at times t ∈ Hp = [sp, s∗p)).For every such packet qi, let ti denote the time in which itis mapped to p during the interval Hp. We hereby introducethe following notation: let xit = rOt (qi), and let yt = rGt (p).Using this notation, Lemma 9 implies that at any time ti,xit ≥ 1

β yt. Figure 6 provides a graphical description of thepackets mapped to p along time.

In what follows we provide an analysis which will eventu-ally enable us to determine the value of L used in the definitionof the mapping. Consider any value L which satisfies thefollowing property (using the notation introduced above):

For every p ∈ G, if it has load L upon delivery, there existssome t ∈ Hp such that

yt ≤∑

1≤j≤L|tj≥txjtj . (2)

Any value L satisfying this property is said to be HOBO-compliant.

We first note that there exists an HOBO-compliant valueof L. Assume we take in the our mapping L = k. Thisimplies that during Hp, if p has load L, then there areL = k distinct packets mapped to p during Hp, where each of

these packets has some strictly positive number of residualpasses. This implies that for t = sp, Equation (2) holds,since yt ≤ k ≤ ∑k

j=1 xjtj . It follows that L = k is HOBO-

compliant. It is worthwhile to note that by definition, if L isHOBO-compliant, then every L ≥ L is also HOBO-compliant(since the right hand side of Equation (2) can only increase,where the left hand side remains unchanged).

Next, we prove that for any value L which is HOBO-compliant, satisfies HOBO.

Lemma 11. If L is HOBO-compliant, then L satisfies HOBO.

Proof: Consider the time t ∈ Hp for which Equation (2)holds. For such a t, the sum of residual passes of packetsmapped to p as of time t (non of which is already deliveredby O) is at least the number of residual passes remaining for pat time t. Since by definition p is the HOL packet throughoutHp, and is delivered by the end of this interval, this impliesthat the last of the packets mapped to p cannot be deliveredby O before p is delivered by G.

We henceforth abuse notation, and let L denote the minimalinteger which is HOBO-compliant. The following lemmaprovides an upper bound on L.

Lemma 12. L satisfies

L ≤ 2 + log ββ−1

(k/2) . (3)

Proof: By definition, L is the minimal value for which forany packet p ∈ G such that has load L upon its delivery, thereexists some time t ∈ [sp, s∗p] for which yt ≤

∑j≤L|tj≥t x

jtj .

Therefore, if we consider L − 1, then there exists somepacket p ∈ G such that that has load L − 1, yet for everytime t ∈ [sp, s∗p], yt >

∑j≤L|tj≥t x

jtj . In particular this holds

for every time t = ti, for i = 1, . . . , L − 1. We now abusenotation and let yi = yti , and also let xi = xiti . We further letzi =

∑L−1j=i xj .

Using the above notation, we have for every i ∈1, . . . , L− 1,

yi > zi,

i.e.yi ≥ zi + 1,

In addition, by Lemma 9 we have for every i = 1, . . . , L−2,

xi = zi − zi+1 ≥1β· yi.

Therefore, combining both equations, we obtain:

zi+1 ≤ zi −1β· yi

≤ zi −1β· (zi + 1)

i.e.zi+1 + 1 ≤

(1− 1

β

)· (zi + 1) .

By iteration, we get:

zL−1 + 1 ≤(

1− 1β

)L−2

· (z1 + 1) .

11

Finally, we use zL−1+1 = xL−1+1 ≥ 2 and z1+1 ≤ y1 ≤ k.Therefore we obtain:

2 ≤(

1− 1β

)L−2

· k.

Rearranging the terms we obtain the required result.We thus have the following corollary:

Corollary 13. For L = 2 + log ββ−1

(k/2), for every p ∈ G,φ−1(p) ≤ L when p is delivered.

Proof: By our choice of L, Lemma 12 implies that L isHOBO-compliant. For such an L, Lemmas 11 and 8 ensurethat the mapping φ is feasible. Since every p ∈ G is the HOLpacket of G upon delivery, it follows by the definition of themapping that φ−1(p) ≤ L, as required.

E. The Mapping ψIn this section we define a mapping ψ : A2 ∩G 7→ G such

that for every p ∈ G,∣∣ψ−1(p)

∣∣ ≤ logβ k, i.e., there are at mostlogβ k packets from A2 ∩ G mapped to any single packet inp ∈ G by ψ.

The mapping essentially follows a preemption sequence ofpackets, up to a packet that is successfully delivered by G.Formally, it is defined by backward recursion as follows: ifp ∈ G, then ψ(p) = p. Otherwise p ∈ A2 is preempted infavor of some packet q ∈ A2 ∪G, such that r(p) > βr(q), inwhich case we define ψ(p) = ψ(q).

Lemma 14. For every p ∈ G,∣∣ψ−1(p)

∣∣ ≤ logβ k.

Proof: The proof follows immediately from the fact thatfor every p preempted by q we have r(p) > βr(q), i.e., areduction of the number of required passes by a factor of β.Since r(p) ≥ 0 is integral, it follows that any such sequence ofpreemptions can be of length at most logβ k. Since preemptionis one-to-one it follows that the maximum number of packetsmapped to any single p ∈ G is bounded by the length of anysuch preemption sequence, which completes the proof.

F. Putting it All TogetherWe are now in a position to prove our main theorem.

Proof of Theorem 7: Our proof essentially relies ondetermining the value of L in the description of mapping φ.We set L = 2+log β

β−1(k/2), as suggested by Lemma 12 and

Corollary 13.By composing the mappings φ and ψ we obtain a mapping

χ : A1 ∩ A2 7→ G such that for every p ∈ G,∣∣χ−1(p)

∣∣ ≤2(logβ k − 1) + L = L− 2 + 2 logβ k. This follows from thefact that every packet along the preemption sequence save thelast piggybacks at most 1 packets by φ (Lemma 10), and thelast packet in the preemption sequence piggybacks at most Lpackets by φ (Corollary 13). One should also take into accountall the packets in the preemption sequence itself which areaccounted for by ψ (save the last one, which is successfullydelivered by G). Again, see Figure 2 for an illustration of themapping χ.

All that remains is to bound the value obtained by theoptimal solution, compared to the value obtained by by PQβ .

Assuming α < 1logβ k

, one can see that the overall paymentsmade by the algorithm in any preemption sequence sum to atmost α logβ k < 1 (since payment is made only for packets inA2∪G), and hence they do not exceed the unit profit obtainedby delivering the last packet in the sequence. It follows that anypacket delivered by our algorithm contributes at least a valueof 1 − α logβ k. For every such packet, the optimal solutionmay obtain a value of at most (L−2+2 logβ k+1)(1−α) =(2 + log β

β−1(k/2)− 1 + 2 logβ k)(1−α). Note the additional

value of 1 in the denominator, which accounts for packets inO ∩G. The result follows.

Before we turn to describe our simulation setting andresults, it would be instructive to discuss some of the conse-quences of Theorem 7. By optimizing the value of β one canobtain the minimum value for the competitive ratio (dependingon the value of α). Table I gives an illustration of the optimalvalues of β and the competitive ratio they imply for k = 10.

α 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4β 3.03 3.24 3.49 3.75 4.05 4.38 4.75 5.16

CR 8.37 9.23 10.21 11.32 12.57 14.00 15.63 17.48

TABLE IOPTIMAL VALUES OF β , AND THE IMPLIED COMPETITIVE RATIO (CR), AS

GIVEN BY THEOREM 7, FOR k = 7.

V. SIMULATION STUDY

In this section we compare the performance of the familyof algorithms PQβ for various values of β (defined in Sec-tion IV), as well as algorithms PQ1 and FIFO1 (defined inSection III-B), and the non-preemptive algorithm that uses PQ(defined in Section III-A), which we dub PQ∞ (this notationis used to maintain consistency with our notation of PQβ).

When considering the family of algorithms PQβ , we con-sider several values for β, and do not restrict ourselves tothe optimal values implied by our analysis. The reason forthis is that our analysis is targeted at bounding the worst caseperformance, and it is instructive to evaluate the performanceof the algorithms using different values of β for simulatedtraffic that is not necessarily worst-case.

Our traffic is generated using an ON-OFF Markov modu-lated Poisson process (MMPP), which is targeted at producingbursty traffic. The choice of parameters is governed by theaverage arrival load, which is determined by the product of theaverage packet arrival rate and the average number of passesrequired by packets. For a choice of parameters yielding anaverage packet arrival rate of λ, where every packet has itsrequired number of passes chosen uniformly at random withinthe range [1, k], we obtain an average arrival load (in terms ofrequired passes) of λ · k+1

2 .Figures 7 and 8 provide the results of our simulations.

The Y -axis in all figures represents the ratio between thealgorithms’ performance and the optimal performance possiblegiven the arrival sequence. For the case where α = 0the optimal performance is obtained by PQ1 (as proved inTheorem 3), whereas for α > 0 the optimal performance isobtained by the algorithm that incurs the copying cost onlyupon transmission (as proved in Theorem 6).

12

We conduct two sets of simulations; one targeted at a betterunderstanding of the dependence on the number of recycles,and the other targeted at evaluating the power of havingmultiple cores. We note that the standard deviation throughoutour simulation study never exceeds 0.05 (deviation bars areomitted from the figures for readability).

A. Variable Maximum Number of Required PassesIn the first set of simulations we set the average arrival

rate to be λ = 0.3. By performing simulations for variablevalues of the maximum number of required passes k inthe range [4, 24], we essentially evaluate the performance ofour algorithms in settings ranging from underload (averagearrival load of 0.75) to extreme overload (average arrival loadof 3.75), which enables validating the performance of ouralgorithms in various traffic scenarios. For every choice ofparameters, we conducted 20 rounds of simulation, whereeach round consisted of simulating the arrival of 1000 packets.Throughout our simulations we used a buffer of size B = 20,and restricted our attention to the single-core case, i.e., C = 1.

For α = 0, Figure 7(a) shows that the performance ofPQβ degrades as β increases. This behavior is of courseexpected, since the optimal performance is known to beobtained by algorithm PQ1 which preempts whenever somegain can be obtained. The non-preemptive algorithm (PQ∞)has poor performance, and the performance of FIFO1 laysin between the performance of the algorithms PQβ andthe non-preemptive algorithm. When further considering theperformance of the algorithms for increasing values of α,in Figures 7(a)-7(c), and most notably in Figure 7(c), aninteresting phenomenon is exhibited: the performance of allalgorithms (especially FIFO1) degrades substantially, savethe performance of the non-preemptive algorithm which ismaintained essentially unaltered.

One of the most interesting aspects arising from our simula-tion results is the fact that they seem to imply that our worst-case analysis has been beneficial in designing algorithms thatwork well also on average. This can be seen especially bycomparing Figures 7(b) and 7(c): the results show that whenα changes, the value of β for which PQβ performs bestalso changes (specifically, compare PQ1.5 and PQ2). Thischange is in accordance with the value of β that optimizes thecompetitive ratio, which is a worst-case bound derived fromour analysis (see, e.g., the optimal values of β appearing inTable I for k = 10).

B. Variable Number of CoresIn this set of simulations we evaluated the performance of

our algorithms for variable values of C in the range [1, 25].For each choice of parameters, we conducted 20 rounds ofsimulation, where each round consisted of simulating thearrival of 1000 packets. Throughout our simulations we useda buffer of size B = 20, and used k = 16 as the maximumnumber of passes required by any packet.

Figure 8(a) presents the results for a constant traffic arrivalrate of λ = 3. Not surprisingly, the performance of all algo-rithms improves drastically as the number of cores increases.

The increase in the number of cores essentially provides thenetwork processor with a speedup proportional to the numberof cores (assuming the average arrival rate remains constant).

We further evaluate the performance of our algorithms forincreasing number of cores, while simultaneously increasingthe average arrival rate (set to λ = 0.3 · C, for each value ofC), such that the ratio between the speedup and the arrivalrate remains constant. The results of this set of simulations ispresented in Figures 8(b) and and 8(c), for α = 0 and α = 0.4,respectively. Contrarily to what may have been expected, theperformance of some of the algorithms is not monotonicallynon-decreasing as the number of cores increases. Furthermore,the performance of some of the algorithms, and especiallythe non-preemptive algorithm PQ∞, decreases drastically asthe number of cores increases (up to a certain point), whencompared to the optimal performance possible. Only oncethe number of cores is sufficiently large (which occurs whenC ≥ 14), do all algorithms exhibit a steady improvement inperformance as the number of cores further increases. Thisis due to the fact that for such a large number of cores,almost all packets in the buffer are scheduled in every timeslot (recall that the buffer used in our simulations has asize of B = 20). It is interesting to note that this behaviortrend is independent of the value of α for both FIFO1

and PQ∞. These results provide further motivation, beyondthe worst-case lower bounds presented in Section III-A, foradopting preemptive buffer management policies in multi-core,multipass NPs, and shows the vulnerability of architecturesbased on FIFO buffers.

VI. DISCUSSION

The increasingly-heterogeneous packet-processing needs ofNP traffic are posing design challenges to NP architects. Inthis paper we provide performance guarantees for variousalgorithms within the multipass NP architecture, and furthervalidate these results by simulations.

Our results can be extended in several directions to reflectcurrent NP constraints. Our work which focuses on unit-sizedpackets and homogeneous PPEs can be considered as a firststep towards solutions which more generally deal with variablepacket sizes and heterogeneous PPEs. In addition, it would beinteresting to study non-greedy algorithms which are equippedwith an admission control mechanism that aim at maximizingthe guaranteed NP throughput. Last, it would be interestingto see the impact of moving the computation of the numberof passes needed for each packet from the entrance of the NPto PPEs during the first pass. This is especially interestingbecause the first pass often corresponds to processing featuresthat lead to the early dropping of packets, such as ACL.

REFERENCES

[1] W. Aiello, R. Ostrovsky, E. Kushilevitz and A. Rosen. Dynamic routingon networks with fixed-size buffers. SODA, pp. 771–780, 2003.

[2] S. Albers and M. Schmidt. On the performance of greedy algorithmsin packet buffering. SIAM Journal on Computing, 35(2), pp. 278–304,2005.

[3] S. Albers and T. Jacobs. An experimental study of new and knownonline packet buffering algorithms. ESA, pp. 754–765, 2007.

[4] Y. Azar and M. Litichevskey. Maximizing throughput in multi-queueswitches. Algorithmica, 45(1), pp. 69–90, 2006.

13

0.4

0.5

0.6

0.7

0.8

0.9

1

4 8 12 16 20 24

Tran

smit

ted

val

ue

ve

rsu

s o

pti

mal

Maximal number of recycles

(a) α = 0

0.4

0.5

0.6

0.7

0.8

0.9

1

4 8 12 16 20 24

Tran

smit

ted

val

ue

ve

rsu

s o

pti

mal


(b) α = 0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

4 8 12 16 20 24

Tran

smit

ted

val

ue

ve

rsu

s o

pti

mal


(c) α = 0.4

Fig. 7. Performance ratio of online algorithms versus optimal for different values of α, as a function of the maximum number of passes k required by apacket k. The results presented are for a single core (i.e., C = 1). The average arrival rate of the simulated traffic for each value of k is fixed to 0.3 (packetsper time slot).

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 3 5 7 9

Tran

smit

ted

val

ue

ve

rsu

s o

pti

mal

Number of PPEs

(a) constant rate λ = 0.3 (α = 0.4, k = 16)

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 6 11 16 21

Tran

smit

ted

val

ue

ve

rsu

s o

pti

mal

Number of PPEs

(b) α = 0 (λ = 0.3 · C, k = 16)

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 6 11 16 21

Tran

smit

ted

val

ue

ve

rsu

s o

pti

mal

Number of PPEs

(c) α = 0.4 (λ = 0.3 · C, k = 16)

Fig. 8. Performance ratio of online algorithms versus optimal for different values of α, as a function of the number of cores C. In Figure 8(a) the arrivalrate is kept constant at λ = 0.3, regardless of the number of cores C. In all other figures the average arrival rate of the simulated traffic for each value of Cis proportional to the number of cores (set to λ = 0.3 · C).

[5] Y. Azar and Y. Richter. Management of multi-queue switches in QoSnetworks. Algorithmica, 43(1-2), pp. 81–96, 2005.

[6] Y. Azar and Y. Richter. An improved algorithm for CIOQ switches.ACM Transactions on Algorithms, 2(2), pp. 282–295, 2006.

[7] N. Bansal, H. Chan, and K. Pruhs. Speed scaling with an arbitrarypower function. SODA, pp. 693-701, 2009.

[8] M. Becchi, M. Franklin, and P. Crowley, Performance/area efficiencyin chip multiprocessors with micro-caches. 4th ACM InternationalConference on Computing Frontiers (CF), Ischia, Italy, May 2007.

[9] A. Borodin and R. El-Yaniv, Online Computation and CompetitiveAnalysis. Cambridge University Press, 1998.

[10] L. Becchetti, S. Leonardi, A. Marchetti-Spaccamela, and K. Pruhs.On-line weighted flow time and deadline scheduling. J. DiscreteAlgorithms, 4(3):339–352, 2006.

[11] C. Chan and N. Bambos. Throughput loss in task scheduling dueto server state uncertainty. International Conference On PerformanceEvaluation Methodologies And Tools, 2009.

[12] E. L. Hahne, A. Kesselman, and Y. Mansour. Competitive buffermanagement for shared-memory switches. SPAA, pp. 53–58, 2001.

[13] X. Huang and T. Wolf, Evaluating dynamic task mapping in networkprocessor runtime systems. TPDS, 19(8):1086–1098, 2008.

[14] A. Kesselman, Z. Lotker, Y. Mansour, and B. Patt-Shamir. Bufferoverflows of merging streams. ESA, pp. 349–360, 2003.

[15] A. Kesselman and Y. Mansour. Harmonic buffer management policyfor shared memory switches. TCS, 324(2-3):161–182, 2004.

[16] A. Kesselman, Z. Lotker, B. Patt-Shamir. Y. Mansour, B. Schieber, andM. Sviridenko. Buffer overflow management in QoS switches. SIAMJournal on Computing, 33(3):563–583, 2004.

[17] S. Leonardi , D. Raz. Approximating total flow time on parallelmachines. STOC, pp.110-119, 1997.

[18] J. Lu and J.J. Wang, Analytical performance analysis of network-processor-based application designs. Proc. 15th Int’l Conf. ComputerComm. and Networks (ICCCN ’06), pp. 33-39, Oct. 2006.

[19] R. Motwani , S. Phillips , E. Torng. Nonclairvoyant scheduling. TCS,130(1):17-47, 1994.

[20] J. Mudigonda, H.M. Vin, R. Yavatkar. A case for data caching innetwork processors. Unpublished manuscript.

[21] S. Muthukrishnan , R. Rajaraman , A. Shaheen , J. Gehrke. OnlineScheduling to Minimize Average Stretch. FOCS, pp. 433–442, 1999.

[22] K. Pruhs. Competitive online scheduling for server systems. SIGMET-RICS, 34(4):52–58, 2007.

[23] T. Sherwood, G. Varghese, and B. Calder, A pipelined memoryarchitecture for high throughput network processors. ISCA, pp. 288–299, 2003.

[24] C. Wiseman, et al. Remotely Accessible Network Processor-BasedRouter for Network Experimentation. ANCS, pp. 20–29, 2008.

[25] N. Weng and T. Wolf, Analytic modeling of network processors forparallel workload mapping. TECS, 8(3):1–29, 2009.

[26] T. Wolf, P. Pappu, and M. A. Franklin, Predictive scheduling of networkprocessors. Computer Networks, 41(5):601–621, 2003.

[27] D. Sleator and R. Tarjan, “Amortized efficiency of list update andpaging rules,” Commun. ACM, 28(2):202–208, 1985.

[28] Cavium, OCTEON II CN68XX Multi-Core MIPS64 Proces-sors, Product Brief, 2010. [Online] http://www.caviumnetworks.com/OCTEON-II CN68XX.html

[29] AMCC, nP7310 10 Gbps Network Processor, Product Brief,2010. [Online] http://www.appliedmicro.com/MyAMCC/jsp/public/productDetail/product detail.jsp?productID=nP7310

[30] Xelerated, X11 Family of Network Processors, Product Brief, 2010.[Online] http://www.xelerated.com/Uploads/Files/67.pdf

[31] EZChip, NP-4 Network Processor, Product Brief, 2010. [Online] http://www.ezchip.com/p np4.htm

[32] Netronome, NFP-32xx Network Flow Processor, Product Brief, 2010.[Online] http://www.netronome.com/pages/network-flow-processors

[33] Cisco, The Cisco QuantumFlow Processor, Product Brief, 2010.[Online] http://www.cisco.com/en/US/prod/collateral/routers/ps9343/solution overview c22-448936.html

[34] Juniper, Junos Trio, White Paper, 2009. [Online] http://www.juniper.net/us/en/local/pdf/whitepapers/2000331-en.pdf

TECHNICAL REPORT TR10-02, COMNET, TECHNION, ISRAEL 1 ...webee.technion.ac.il/~isaac/p/tr10-02_multipass.pdf · TECHNICAL REPORT TR10-02, COMNET, TECHNION, ISRAEL 1 Providing Performance

Documents