7/31/2019 Packet Forwarding Capabilities
1/14
Understanding the Packet Forwarding Capability
of General-Purpose Processors
Katerina Argyraki, Kevin Fall, Gianluca Iannaccone,
Allan Knies
, Maziar Manesh
, Sylvia Ratnasamy
EPFL, Intel Research
AbstractCompared to traditional high-end network equipment
built on specialized hardware, software routers run-
ning on commodity servers offer significant advan-
tages: lower costs due to large-volume manufacturing,
a widespread supply/support chain, and, most impor-
tantly, programmability and extensibility. The challenge
is scaling software-router performance to carrier-level
speeds. As a first step, in this paper, we study the packet-
processing capability of modern commodity servers; we
identify the packet-processing bottlenecks, examine towhat extent these can be alleviated through upcoming
technology advances, and discuss what further changes
are needed to take software routers beyond the small en-
terprise.
1 Introduction
To what extent are general-purpose processors capable of
high-speed packet processing? The answer to this ques-
tion could have significant implications for how future
network infrastructure is built. To date, the development
of network equipment (switches, routers, various middle-boxes) has focussed primarily on achieving high perfor-
mance for relatively limited forms of packet processing.
However, as networks take on increasingly sophisticated
functionality (e.g., data loss protection, application ac-
celeration, intrusion detection), and as major ISPs com-
pete in offering new services (e.g., video, mobility sup-
port services), there is an increasing need for network
equipment that is programmable and extensible. And in-
deed, both industry and research have already taken ini-
tial steps to tackle the issue [4, 6, 7, 9, 21].
In current networking equipment, high performance
and programmability are competing goalsif not mu-
tually exclusive. On the one hand, we have high-endswitches and routers that rely on specialized hardware
and software and offer high performance, but are no-
toriously difficult to extend, program, or otherwise ex-
periment with. On the other hand, we have software
routers, where all significant packet-processing steps are
performed in software running on commodity PC/server
platforms; these are, of course, easily programmable, but
only suitable for low-packet-rate environments such as
small enterprises [6].
The challenge of building network infrastructure that
is both programmable and capable of high performance
can be approached from one of two extreme starting
points. One approach would be to start with existing
high-end, specialized devices and retro-fit programma-
bility into them. For example, some router vendors have
announced plans to support limited APIs that will allow
third-party developers to change/extend the software part
of their products (which does not typically involve core
packet processing) [7,9]. A larger degree of programma-
bility is possible with network-processor chips, whichoffer a semi-specialized option, i.e., implement only
the most expensive packet-processing operations in spe-
cialized hardware and run the rest on programmable pro-
cessors. While certainly an improvement we note that,
in practice, network processors have proven hard to pro-
gram: in the best case, the programmer needs to learn a
new language; in the worst, she must be aware of (and
program to avoid) low-level issues like resource con-
tention during parallel execution or expensive memory
accesses [14, 16].
From the opposite end of the spectrum, a different ap-
proach would be to start with software routers and op-
timize their packet-processing performance. The allure
of this approach is that it would allow network infras-
tructure to tap into the many desirable properties of the
PC-based ecosystem, including lower costs due to large-
volume manufacturing, rapid advances in power manage-
ment, familiar programming environment and operating
systems, and a widespread supply/support chain. In other
words, if feasible, this approach could enable a network
infrastructure that is programmable in much the same
way as end-systems are today. The challenge is taking
this approach beyond the small enterprise, i.e., scaling
PC/server packet-processing performance to carrier-level
speeds.It is perhaps too early to tell which approach dom-
inates; in fact, its more likely that each approach re-
sults in different tradeoffs between programmability and
performance, and these tradeoffs will cause each to be
adopted where appropriate. As yet, however, there has
been little research exposing what tradeoffs are achiev-
able. As a first step in this direction, in this paper, we
explore the performance limitations for packet process-
ing on commodity servers.
7/31/2019 Packet Forwarding Capabilities
2/14
A legitimate question at this point is whether the per-
formance requirements for network equipment are just
too high and our exploration is a fools errand. The bar
is indeed high: in terms of individual link/port speeds,
10Gbps is already widespread and 40Gbps is being de-
ployed at major ISPs; in terms of aggregate switching
speeds, carrier-grade routers range from 40Gbps to ahigh of 92Tbps! Two developments, however, lend us
hope. The first is a recent research proposal [11] that
presents a solution whereby a cluster of N servers can
be interconnected to achieve aggregate switching speeds
of NR bps, provided each server can process packets at
a rate on the order of R bps. This result implies that, in
order to scale software routers, it is sufficient to scale a
single server to individual line speeds (10-40Gbps) rather
than aggregate speeds (40Gbps-92Tbps). This reduction
makes for a much more plausible target.
Secondly, we expect that the current trajectory in
server technology trends will work in favor of packet-
processing workloads. For example, packet processingappears naturally suited to exploiting the tremendous
computational power that multicore processors offer par-
allel applications. Similarly, I/O bandwidth has gained
tremendously by the transition from PCI-X to PCIe al-
lowing 10Gbps Ethernet NICs to enter the PC market [1].
And finally, as we discuss in Section 4, the impending ar-
rival of multiprocessor architectures with multiple inde-
pendent memory controllers should offer a similar boost
in available memory bandwidth.
While there is widespread awareness of these ad-
vances in server technology, we find little comprehensive
evaluation of how these advances can/do translate into
performance improvements for packet-processing work-
loads. Hence, in this paper, we undertake a measurement
study aimed at exploring these issues. Specifically, we
focus on the following questions:
what are the packet-processing bottlenecks in mod-
ern general-purpose platforms;
what (hardware or software) architectural changes
can help remove these bottlenecks;
do the current technology trends for general-
purpose platforms favor packet processing?
As we shall see, answering these seemingly straight-
forward questions, requires a surprising amount of
sleuthing. Modern processors and operating systems are
both beasts of great complexity. And while current hard-
ware/software offer extensive hooks for measurement
and system profiling these can be equally overwhelm-
ing. For example, current x86 processors have over 400
performance counters that can be programmed for de-
tailed tracing of everything from branch mispredictions
to I/O data transactions. Its thus easy (as we discovered)
to sink in a morass of performance monitoring data. Part
of our contribution is thus a methodology by which to go
about such an evaluation. Our study adopts a top-down
approach in which we start with black-box testing and
then recursively identify and drill down into only those
aspects of the overall system that merit further scrutiny.Finally, it is important to note that even though our
study stemmed from an interest in programmable net-
work infrastructure, our findings are relevant to more
than just the network context. Packet processing is just
one instance of a more general class of stream based ap-
plications (such as real time video delivery, stock trading,
etc.). Our findings apply equally to these too.
The remainder of this paper is organized as follows.
We start the paper in Section 2 with some high-level anal-
ysis estimating upper bounds on the packet processing
performance for different server architectures. Section 3
follows this with a measurement study aimed at iden-
tifying the bottlenecks and overheads on these servers.We present the inferences from our measurement study
in Section 4 and discuss potential improvements in Sec-
tion 5. We discuss related work in Section 6 and finally
conclude.
2 Optimistic Back-of-the-Envelope Analysis
Before delving into experimentation, we would like to
calibrate our expectations. We thus start with a simple
thought experiment aimed at estimating absolute upper
bounds on the packet forwarding performance of bothexisting and next-generation servers. Since our goal is
quick calibration, our reasoning here is deliberately both
coarse-grained and optimistic; the experimental results
that follow will show where reality lies.
Figures 1 and 2 present a high-level view of two server
architectures: Fig.1 depicts a traditional shared-bus ar-
chitecture used in current x86 servers [3], while Fig.2
represents a point-to-point architecture as will be sup-
ported by the next-generation of x86 servers [8].
In the shared-bus architecture, communication be-
tween the CPUs, memory, and I/O is routed through the
chipset that includes the memory and I/O bus con-
trollers. There are three main system buses in this archi-tecture. The front side bus (FSB) is used for communi-
cation both between different CPUs and between a CPU1
and the chipset. The PCIe bus connects I/O devices, in-
cluding network interfaces, to the chipset via one or more
high-speed serial channels known as lanes and, finally,
the memory bus connects the memory and chipset.
1In this paper we will use the terms CPU, socket and processor in-
terchangeably to refer to a multi-core processor.
2
7/31/2019 Packet Forwarding Capabilities
3/14
Figure 1: Traditional shared bus architecture Figure 2: Point-to-point architecture
The point-to-point server (Fig.2) represents two sig-
nificant architectural changes relative to the above: first,
the FSB is replaced by a mesh of dedicated point-to-point
links thus removing a potential bottleneck for inter-CPU
communication. Second, the point-to-point architecture
replaces the single external memory controller shared
across CPUs with a memory controller integrated within
each CPU; this leads to a dramatic increase in aggregatememory bandwidth, since each CPU now has a dedicated
link to a portion of the overall memory space. Servers
based on such point-to-point architectures and with upto
32 cores (4 sockets and 8 cores/socket) are due to emerge
in the near future [10].
To estimate a servers packet-forwarding capability,
we consider the following minimal set of operations typ-
ically required to forward an incoming packet and the
corresponding load they impose on each of the primary-
system components:
1. The incoming packet is DMA-ed from the network
card (NIC) to main memory (incurring one transac-tion on the PCIe and memory bus).
2. The CPU reads the packet header (one transaction
on the FSB and memory bus).
3. The CPU performs any necessary packet processing
(CPU-only, assuming no bus transactions).
4. The CPU writes the modified packet header to
memory (one transaction on the memory bus and
FSB).
5. The packet is DMA-ed from memory to NIC (onetransaction on the memory and PCIe bus).
Figures 1 and 2 also show the manner in which each of
these operations maps onto the various system buses for
the architecture in question. As we see, for the shared-
bus architecture, a single packet results in 4 transactions
on the memory bus and 2 on each of the FSB and PCIe
buses; thus, a line rate of R bps leads to (roughly) a load
of 4R, 2R, and 2R on each of the memory, FSB, and
PCIe buses.2 Currently available technology advertises
memory, FSB, and PCIe bandwidths of approximately
100Gbps, 85Gbps, and 64Gbps respectively (assuming
DDR2 SDRAM at 800MHz, a 64-bit wide 1.33GHz
FSB, and 32-lane PCIe1.1); these numbers suggest that
a current shared-bus architecture could sustain line rates
up to R = 25 Gb/s.
For the point-to-point architecture, each packet con-
tributes 4 memory-bus transactions, 4 transactions on the
inter-socket point-to-point links, and 2 PCIe transactions;
since we have 4 memory buses, 6 inter-socket links and
4 PCIe links, assuming uniform load distribution across
the system, a line rate of R bps yields loads of R, 2R/3,and R/2 on each of the memory, inter-socket, and PCIebuses respectively. If we (conservatively) assume simi-
lar technology constants as before (memory, inter-socket,
and PCIe bandwidths at 100Gbps, 85Gbps, and 64Gbps
respectively) this suggests a point-to-point architecture
could scale to line rates of 40Gb/s and even higher.
In terms of CPU resources: If we assume min-sizedpackets of 40 bytes, then the packet interarrival time is
32ns for speeds of R =10Gb/s and 8ns for R =40Gb/s.For the shared-bus server with 8 CPU, each with a speed
of 3 GHz (available today), this implies a budget of
3072 and 768 cycles/pkt for line rates of 10Gbps and
40Gbps respectively. Assuming a cycles-per-instruction
(CPI) ratio of 1, this suggests a budget of 3072 (768) in-
structions per packet for line rates of 10Gb/s (40Gb/s).
With 32 cores at similar speeds, the point-to-point server
would see a budget of 12288 and 3072 instructions/pkt
for 10Gb/s and 40Gb/s respectively.
In summary, based on the above, one might conclude
that current shared-bus architectures may scale to 25Gb/sbut not 40Gb/s, while emerging servers may scale even
to 40Gb/s.
2This estimate assumes that the entire packet (rather than the
header) is read to/from the memory and CPU for packet processing.
A more accurate estimate would account for packet header sizes (or
cache line sizes if smaller than header lengths). We ignore this here
since our tests in the following section consider only min-sized packets
of 64 bytes, equal to a cache line length because of which the inaccu-
racy is of little relevance.
3
7/31/2019 Packet Forwarding Capabilities
4/14
3 Measurement-based Analysis
We now turn to experimentation. We first de-
scribe our experimental setup and then present the
packet-forwarding rates achieved by unmodified soft-
ware/hardware.
Experimental Setup For our experiments, we use a
mid-level server machine running SMP Click [18]. Our
server is a dual-socket 1.6GHz quad-core CPU with an
L2 cache of 4MB, two 1.066GHz FSBs (one to each
socket) and 8 GBytes of DDR2-667 SDRAM. With the
exception of the CPU speeds, these ratings are similar to
the shared-bus architecture from Figure 1 and, hence, our
results should be comparable. The machine has a total of
16 1GigE NICs. To source/sink traffic, we use two addi-
tional servers each of which is connected to 8 of the 16
NICs on our test machine. We generate (and terminate)
traffic using similar servers with 8 GigE NICs. We instru-
ment our servers with Intel EMON, a performance mon-
itoring tool similar to Intel VTune, as well as a chipset-
specific tool that allows us to monitor memory-bus us-
age.3
The forwarding rates achieved will depend on the na-
ture of the traffic workload. To a first approximation,
this workload can be characterized by: (1) the incom-
ing packet arrival rate r, measured in packets/sec, (2) the
size of packets, P, measured in bytes (hence the incom-
ing rate R = rP) and (3) the processing per packet.We focus on evaluating the fundamental capability of the
system to move packets through and, hence, start by con-
sidering only the first two factors (packet rate and size)
without considering any sophisticated packet processing.
Hence, we remove the IP routing components from our
Click configuration and only implement simple forward-
ing that enforces a route between source and destination
NICs; i.e., packets arriving on NIC #0 are sent to NIC
#1, NIC #2 to NIC #3 and so on. We have 16 NICs and,
hence, use 8 kernel threads, each pinned to one core and
each in charge of one input/output NIC pair. In the re-
sults that follow, where the input rate to the system is un-
der 8Gbps, we use one of our traffic generation servers
as the source and the other as sink; for tests that require
higher traffic rates each server acts as both source and
sink allowing us to generate input traffic up to 16Gbps.
Measured performance We start by looking at the
loss-free forwarding rate the server can sustain (i.e.,
without dropping packets) under increasing input packet
3Although our tools are proprietary, many of the measures they re-
port are derived from public performance counters and, in these cases,
our tests are reproducible. In an extended technical report, we will
present in detail how our measures, when possible, can be derived from
the public performance counters available on x86 processors.
0 2 4 6 8 10 12 14 160
5
10
15
offered load (Gbps)
sustainedload(Gbps)
64 bytes
128 bytes
256 bytes
512 bytes
1024 bytes
Figure 3: Forwarding rate under increasing load for dif-
ferent packet sizes.
rates and for various packet sizes. We plot this sustained
rate in terms of both bits-per-second (bps) and packets-
per-second (pps) in Figures 3 and 4 respectively. We see
that, in the case of larger packet sizes (1024 bytes and
higher), the server scales to 14.9 Gbps and can keep up
with the offered load up to the maximum traffic we can
generate given the number of slots on the server; i.e.,
packet forwarding isnt limited by any bottleneck inside
the server. However, in the case of 64 byte packets, we
see that performance saturates at around 3.4 Gbps, or 6.4
million pps. As Figure 4 suggests, the server is troubled
by the high input packet rate (pps) rather than bit rate(bps). Note that the case of 64 byte packets is the worst-
case traffic scenario. Though unlikely in reality, it covers
an important role as it is considered the reference bench-
mark by network equipment manufactures.
Relative to the back-of-the-envelope estimates we ar-
rived at in the previous section, we can conclude that,
while our server approaches the estimated rates for larger
packet sizes, for small packets, the achievable rates are
well below our estimates. At a high-level, our reasoning
could have been wildly off-target for two reasons: (1)
in assuming that the nominal/advertised rates for each
system component (PCIe, memory, FSB) are attainablein practice and/or (2) in our estimates of the overhead
per packet (4x, 2x, etc.). In what follows, we look into
each of these possibilities. In Section 3.1 we attempt to
track down the bottleneck(s) that limit(s) the forwarding
rate for small packets and, in so doing, estimate attain-
able performance limits for the different system compo-
nents. In Section 3.2 we take a closer look at the per-
packet overheads and attempt to deconstruct these into
their component causes.
4
7/31/2019 Packet Forwarding Capabilities
5/14
0 1 2 3 4 5 6 7 80
1
2
3
4
5
6
7
8
offered load (Mpps)
sustainedload(Mpps)
64 bytes
128 bytes
256 bytes
512 bytes
1024 bytes
Figure 4: Forwarding rate under increasing load for dif-
ferent packet sizes in pps.
3.1 Bottleneck Analysis
We look for the bottleneck through a process of elimi-
nation, starting with the four major system components
discussed earlier the CPUs and the three system buses
and drilling deeper as and when it appears warranted.
CPU The CPUs are plausible candidates, since CPU
processing depends on the incoming packet rate, and
performance saturates as soon as we reach a specific
packet rate (the same for 64-byte and 128-byte packets,
as shown in Figure 4). Note that the traditional metric
of CPU utilization reveals little here, because Click op-
erates in a pure polling mode, where the CPUs are al-
ways 100% utilized. Instead, we look at the number of
empty polls i.e., the number of times the CPU polls
for packets to process but none are available in memory.
Our measurements reveal that, even at the saturation rate
(3.4Gbps for 64-byte packets), we still see a non-trivial
number of empty polls approximately 62,000 per sec-
ond for each core. Hence, we eliminate CPU processing
as a candidate bottleneck.
System buses Our tools allow us to directly measure
the load in bits/sec on the FSB and memory bus; the
load difference between these two buses gives us an es-
timate for the PCIe load. Note that this is not always agood estimate, as FSB bandwidth can be consumed by
inter-socket communication which does not figure on the
memory bus; however, it does make sense in our par-
ticular setup (with each input/output port pair consis-
tently served by the same socket) which yields little inter-
socket communication. Figures 5 and 6 plot the load on
each of the FSB, memory, and PCIe buses for 64-byte
and 1024-byte packets under increasing input rates. We
see that, for any particular line rate, the load on all three
0 1 2 3 4 50
5
10
15
20
25
30
35
offered load (Gbps)
observerdload(Gbps)
Memory
I/O
FSB
Figure 5: Bus bandwidths for 64 byte packets.
0 5 10 150
10
20
30
40
50
offered load (Gbps)
observerdload(Gbps)
Memory
I/O
FSB
Figure 6: Bus bandwidths for 1024 byte packets.
buses is always higher with 64-byte packets than with
1024-byte ones. Hence, any of the buses could be the
bottleneck, and we proceed to examine each one more
closely.
FSB Under the covers, the FSB consists of separate
data and address buses, and our tools allow us to sep-
arately measure the utilization of each. The results are
shown in Figures 7 and 8: while it is clear that the
data bus is under-utilized, it is not immediately obvious
whether this is the case for the address bus as well. Togage the maximum attainable utilization on each bus, we
wrote a simple benchmark program (we will refer to it as
the stream benchmark from now on) that creates and
writes to a very large array. This benchmark consumes
50 Gbps of FSB bandwidth that translate into 37% data-
bus utilization and 74% address-bus utilization. These
numbers are well above the utilization levels from our
packet-forwarding workload, which means that the latter
does not saturate the FSB. Hence, we conclude that the
5
7/31/2019 Packet Forwarding Capabilities
6/14
0 1 2 3 4 50
20
40
60
80
100
offered load (Gbps)
busutiliz
ation(%)
FSB Data
FSB Address
Figure 7: FSB data and address bus utilization for 64
byte packets.
0 5 10 150
20
40
60
80
100
offered load (Gbps)
busutilization(%)
FSB DataFSB Address
Figure 8: FSB data and address bus utilization for 1024
byte packets.
FSB is not the bottleneck.
PCIe Unlike the FSB where all operations are in fixed-
size units (64 bytes a cache line), the PCIe bus sup-
ports variable-length transfers; hence, if the PCIe bus is
the bottleneck, this could be due either to the incoming
bit rate or to the requested operation rate (which depends
on the incoming packet rate). To test the former, we sim-
ply look at the maximum bit rate that the PCIe bus hassustained; from Figures 5 and 6, we see that the maxi-
mum PCIe load for 1024-byte packets exceeds the PCIe
load recorded at saturation for 64-byte packets. Hence,
the PCIe bit rate is not the problem.
To test whether the PCIe operation rate is the bottle-
neck, we measure the maximum packet rate that can be
sustained by each individual PCIe lane. Our rationale
is the following: given that PCIe lanes are independent
from each other, if we can successfully drive the packet
1 2 3 4 5 6 7 80
1
2
3
4
Cores
Gbps
Forwarding rate
1 2 3 4 5 6 7 80
5
10
15
20
25
Cores
Gbps
I/O load
Figure 9: Forwarding rate (top) and PCIe load (bottom)
for 64 byte packets as a function of the number of cores
rate on a single lane beyond the per-lane rate recorded
at saturation, then we will have showed that our packet-forwarding workload does not saturate the PCIe lanes
and, hence, packet rate is not the problem. To this end, we
start with a single pair of input/output ports pinned to a
single core and gradually add ports and cores. The results
are shown in Figure 9, where we plot both the sustained
forwarding rate and the PCIe load: we already know that
for 64-byte packets, at saturation, each input/output port
pair sustains approximately 0.4Gbps (Figure 3); from
Figure 9, we see that each individual port pair (and,
hence, the corresponding PCIe lanes) can go well beyond
that rate (approx. 0.75Gbps). Hence, we conclude that
the PCIe bus is not the bottleneck either.
Memory This leaves us with the memory bus as the
only potential culprit. To estimate the maximum attain-
able memory bandwidth, we use the stream benchmark
described above, which consumes 51Gbps of memory-
bus bandwidth. This is about 35% higher than the
33Gbps maximum consumed by our 64-byte packet-
forwarding workload, surprisingly suggesting that aggre-
gate memory bandwidth is not the bottleneck either.4
This would seem to return us to square one. However,
memory-system performance is notoriously sensitive to
details like access patterns and load balancing; hence, we
look further into these details.We consider two potential reasons why our packet-
forwarding workload might reduce memory-system ef-
4Note that even 51Gbps is fairly low relative to the nominal rating
of 100Gbps we used in estimating upper bounds. It turns out this limit
is due to saturation of the address bus; recall that the address bus uti-
lization is 74% for the stream test; prior work [24] and discussions with
architects reveal that an address bus is regarded as saturated at approx-
imately 75% utilization. This is in keeping with the general perception
that, in a shared-bus architecture, the vast majority of applications are
bottlenecked on the FSB.
6
7/31/2019 Packet Forwarding Capabilities
7/14
1
2
3
4
1
2
3
4
0
2
4
6
8
10
rankbank
Gbps
1
2
3
4
1
2
3
4
0
2
4
6
8
10
rankbank
Gbps
1
2
3
4
1
2
3
4
0
2
4
6
8
10
rankbank
Gbps
Figure 10: Memory load distribution across banks and ranks. Left: 64 byte packets. Middle: 1024 byte packets. Right:
the stream benchmark.
ficiency relative to the stream benchmark. The first is the
fact that the sequence of memory locations accessed due
to our workload is highly irregularas opposed to the
nicely in-sequence access pattern of the stream bench-
mark. To assess the impact of irregular accesses, we re-
run the stream benchmark but, instead of writing to each
array entry in sequence, we write to random locations.
This modification does cause a drop in memory band-
width, but the drop is modest (from 51Gbps to about
46Gbps), indicating that irregular accesses are not the
problem.
The second reason is sub-optimal use of the physical
memory space: The memory system is internally orga-
nized as multiple memory channels (or branches) each
of which is organized as a grid of ranks and banks. In
particular, the 8GB memory on our machine consists of
two memory channels each of which comprises a grid of4 banks 4 ranks; our tools reports the memory traf-
fic to different rank/bank pairs aggregated across mem-
ory channels; i.e., the memory traffic we report for (say)
a pair (bank 1, rank3) is the sum of the traffic seen on
(bank 1, rank 3) for each of the two memory channels.
Figure 10 shows the distribution of memory traffic over
the various ranks and banks for three workloads: (1) 64-
byte packets at the saturation rate of 3.4Gbps, (2) 1024-
byte packets at 15.2Gbps, and (3) the stream benchmark.
Notice that, while memory traffic is perfectly balanced
for the stream benchmark (and reasonably balanced for
the 1024-byte packet workload), for the 64-byte packet
workload, it is all concentrated on two rank/bank ele-
ments (in reality, we see one overloaded rank-bank pair
on each channel; since the figure shows the aggregate
load over the two channels).
This result suggests that the bottleneck is not the ag-
gregate memory bandwidth, but the bandwidth to the in-
dividual rank/bank elements that, for some reason, end
up carrying most of the 64-byte packet workload. To ver-
ify this, we measure the maximum attainable bandwidth
to a single rank/bank pair; we do this through a sim-
ple test that creates multiple threads, all of which con-
tinuously read and write a single locations in memory.
The result is 7.2Gbps of memory traffic (all on a sin-
gle rank/bank pair), which is almost equal to the max-
imum per-rank/bank load recorded at saturation for the
64-byte packet workload. We should note that both the
CPUs and the FSB are under-utilized during this mem-
ory test. Hence, we conclude that the bottleneck is the
memory system, not because it lacks the necessary ca-
pacity, but because of the imbalance in accessed memory
locations.
We now look into why this imbalance takes place.
We see from Figure 10 that, for 1024-byte packets, the
memory load is much better distributed than for 64-
byte packets. This leads us to suspect that the imbal-
ance is related to the manner in which packets are laidout onto the rank/bank grid. We test this with an exper-
iment where we maintain a fixed packet rate (400,000
packets/sec) and measure the resulting memory-load dis-
tribution for different packet sizes (64 to 1500 bytes).
Figure 11 shows the outcome: Ignoring for the moment
the load on {bank 2, rank 3}, we observe that, as the
packet size increases, the additional memory load is dis-
tributed over increasing numbers of rank/bank pairs and,
within a single memory channel, this spilling over to
additional ranks and banks happens at the granularity of
64 bytes; for example, for 256-byte packets, we see in-
creased load on 3 rank/bank pairs, for 512-byte packets
on 4 rank/bank pairs, and so forth. Moreover, we observethat this growth starts from the low-ordered banks, i.e.,
bank 1 is loaded first, then bank 2 and so on.5
These observations lead us to the following theory: the
default packet-buffer size in Linux is 2KB; each such
5Regarding the high load on {bank 2, rank 3}: we suspect this is
caused by the large number of empty polls that we see at the low packet
rate for this test, and that the location corresponds to the memory-
mapped I/O register being polled. We find the load on this rank-bank
drops with increasing packet rates further supporting this conjecture.
7
7/31/2019 Packet Forwarding Capabilities
8/14
1 23 4
12
34
0
2
4
rank
64 bytes
bank
Gbps
1 23 4
12
34
0
2
4
128 bytes
1 23 4
12
34
0
2
4
256 bytes
1 23 4
12
34
0
2
4
512 bytes
1 23 4
12
34
0
2
4
1024 bytes
1 23 4
12
34
0
2
4
1500 bytes
Figure 11: Memory load distribution across banks and
ranks for different packets sizes and a fixed packet rate.
buffer spans the entire rank/bank grid, which would al-
low high memory throughput if we were using the entire
2KB allocation. However, our 64-byte packet workload
ends up using only one of the rank/bank pairs on each
of the memory channels, leading to the two spikes we
see in Figure 10. To test this theory, we repeat our ear-
lier experiment with 64-byte packets from Figure 10, but
now change the default buffer allocation size to 1KB.
If our theory is right and a 2KB address space spans
the entire grid, then 1KB should span half the grid and,
hence, the two spikes in Figure 10 should split into 4
spikes. Figure 12 shows that this is indeed the case. Un-
fortunately (for some reason we do not fully understand
as yet), we have not been able to allocate yet smallerbuffer sizes (e.g., 128B), due to the need for the de-
vice driver to accomodate for additional data structures,
and hence we do not experiment with even smaller al-
locations. Nonetheless, our experiment with 1024-byte
buffers clearly shows the cause (and potential to remedy)
the problem of skewed memory load. As we discuss in
Section 5, we believe this issue could be fixed in a gen-
eral manner through the use of a modified memory allo-
cator that allows for variable-size buffer allocations.
Finally, if our conjecture that this imbalance was
the performance bottleneck was right then reducing the
imbalance should translate to higher packet-forwarding
rates. Happily, using 1024B buffers we do see a 29.5%increase in forwarding rate from 3.4Gbps to 4.4Gbps;
Figure 13 shows this improvement in terms of the packet
rate (from 6.4 to 8.2 Mpps).
Summary of bottleneck analysis The presented ex-
periments showed what rates are achievable on each
system component for hand-crafted workloads like our
stream benchmark. We use these rates as re-calibrated
12
34
12
34
0
5
10
rank
Original Clickw/2048 byte buffers
bank
Gbps
12
34
12
34
0
5
10
rank
Modified Clickw/1024 byte buffers
bank
Gbps
Figure 12: Memory load distribution across banks and
ranks for 64B packets and two different sizes of packet
buffers.
0 2 4 6 8 100
1
2
3
4
5
6
7
8
9
10
11
offered load (Mpps)
sustainedload(Mpps)
original click w/2048 byte buffers
modified click w/1024 byte buffers
Figure 13: Before-and-after forwarding rates for 64B
packets and two different sizes of packet buffers.
8
7/31/2019 Packet Forwarding Capabilities
9/14
system attainable load w/ 64B percentage
component limit (Gbps) router (Gbps) room-to-grow
1 rank-bank 7.2 (from volatile-int) 7.168 0
FSB address bus 74 (from stream) 50 48
aggregate memory 51 (from stream) 33 54
PCIe 36 (from 1KB pkt tests) 20 80
FSB data bus 37 (from stream) 9 311
Table 1: Room for growth on each of the system components computed as the percentage increase in measured usage
for the 64B packet forwarding workload that can be accommodated before we hit achievable performance limits as
obtained from specially crafted benchmark tests.
upper bounds on the performance of each component
and compare them to the corresponding rates measured
for the 64-byte packet workload at saturation. To quan-
tify our observations, we define, for each component,
the room for growth as the percentage increase in
usage that could be accommodated on the componentbefore we hit the upper bound. For example, for the
stream benchmark, we measured 51Gbps of maximum
aggregate memory bandwidth; for our 64-byte packet
workload, at saturation, we measured 33Gbps of aggre-
gate memory bandwidth; thus, ignoring other bottlenecks
(such as the per rank/bank load), there is room to increase
memory-bus usage by about 54% ((5133)/33) beforehitting the 51Gbps upper bound. Table 1 summarizes our
results. We see that, if we can eliminate the problem of
poor memory allocation (we discuss potential solutions
in Section 5), then there is room for a fairly substantial
improvement in the minimum-sized packet forwarding
rateapproximately 50%. The next section looks foradditional sources of inefficiencythis time due to soft-
ware overheads.
3.2 Overhead Analysis
The previous section treated the system as a black box,
measuring the load on each bus but making no attempt
to justify it. We now try to deconstruct the measured
load into its components, as a way to assess the packet-
forwarding efficiency of our system.
First, we adjust the back-of-the-envelope analysis of
Section 2 to our particular experimental setup and use itto estimate the expected load on each bus: In Section 2,
we argued that an incoming traffic rate of R bps should
roughly lead to loads of 2R, 2R,a n d 4R on the FSB, PCIe,
and memory bus respectively. These numbers were based
on two assumptions: first, that bus loads are only due to
moving packets around; second, that the CPU reads and
updates each incoming packet, thus contributing to FSB
and memory-bus load. The second assumption does not
hold in our experiments, because we use static routing,
1 2 3 4 5 6 70
2
4
6
8
10
12
Packets/sec (Mpps)
verhea
d
atio
FSB
Memory
Figure 14: Memory and FSB per packet overhead.
where the CPU does not even need to read packet headers
to determine the output port. Hence, with the optimistic
reasoning of Section 2, in our experiments, an incoming
traffic rate of R bps should roughly result in loads of 0,
2R, and 2R on the FSB, PCIe, and memory bus, all of
them due to moving packets from NIC to memory and
back.
Not surprisingly, these estimates are below the loads
that we actually measure, indicating that, beyond moving
packets around, all three buses incur an extra per-packet
overhead. We quantify this overhead as the number of
extra per-packet transactions (i.e., transactions that are
not due to moving packets between NIC and memory)
performed on each bus. We compute it as follows:
measured loadestimated load
packet rat e transaction size
Figure 14 plots this number for the FSB and memory
bus as a function of the packet rate and size; the PCIe
overhead is simply the difference between the other two.
So, the FSB and PCIe overheads start around 6, while
the memory-bus overhead starts around 12; all overheads
slightly drop as the packet rate increases.
It turns out that these overheads make sense once we
consider the transactions for book-keeping socket-buffer
9
7/31/2019 Packet Forwarding Capabilities
10/14
descriptors: For each packet transfer from NIC to mem-
ory, there are three such transactions on each of the
FSB and PCIe bus: the NIC updates the corresponding
socket-buffer descriptor, as well as the descriptor ring
(two PCIe and memory-bus transactions); the CPU reads
the updated descriptor, writes a new (empty) descriptor
to memory, and updates the descriptor ring accordingly(three FSB and memory-bus transactions); finally, the
NIC reads the new descriptor (one PCIe and memory-
bus transaction). Each packet transfer from memory to
NIC involves similar transactions and, hence, descriptor
book-keeping accounts for the 6 extra per-packet transac-
tions we measure on the FSB and PCIe busand, hence,
the 12 extra transactions measured on the memory bus.
The slight overhead drop as the packet rate increases is
due to the cache that optimizes the transfer of multi-
ple (up to four) descriptors with each 64-byte transac-
tion (each descriptor is 16-bytes long); this optimization
kicks in more often at higher packet rates.
We should note that these extra per-packet transac-tions translate into surprisingly high traffic overheads,
especially for small packets: for 1024-byte packets, 12
per-packet transactions on the memory bus translate into
37.5% traffic overhead; for 64-byte packets, this num-
ber becomes 600%. As we discuss in Section 5, these
overheads can be reduced by amortizing descriptors over
multiple packets whenever possible (similar techniques
are already common in high-speed capture cards).
4 Inferring Server Potential
We now apply our findings from the last two sections toanswer the following questions:
Given our analysis of current-server packet-
forwarding performance, what can we improve and
what levels of performance are attainable?
What packet-forwarding performance should we
expect from next-generation servers?
We answer these through extrapolative analysis and leave
validation to future work.
4.1 Shared-bus Architectures
In Section 3.1, we saw that the first packet-forwarding
bottleneck arises from inefficient use of the memory
system, in particular, the imbalanced layout of packets
across memory ranks and banks. The question is, how
much could we improve performance by fixing this im-
balance?
According to our overhead analysis (Section 3.2), per-
packet overhead on the memory bus does not increase
with packet rate; hence, if we eliminated the problem-
atic packet layout, we could increase our forwarding rate
until we hit the next bottleneck. According to our bottle-
neck analysis (Section 3.1), that is the FSB address bus,
and we could increase our forwarding rate by 50% be-
fore hitting it. Hence, we argue that eliminating the prob-
lematic layout could increase our minimum-size-packetforwarding rate by 50%, i.e., from 3.4Gbps to approxi-
mately 5.1Gbps.
A second area for improvement, identified in Sec-
tion 3.2, is the use of socket-buffer descriptors and the
chatty manner in which these are maintained. We now
estimate how much we could improve performance by
simply amortizing descriptor transfer across multiple
packet transfers.
We start by considering the memory bus. From Sec-
tion 3.2, we can approximate the load on the memory bus
as 2 bit rate+ 10 packet rat e transaction size. Werewe to transfer, say, 10 descriptors with a single trans-
action, that would immediately reduce memory-bus loadto 2 bit rate+ packet rate transaction size; for 64-bytepackets and 64-byte transactions, this corresponds to a
factor-of-4 reduction. Applying a similar line of reason-
ing to the FSB and PCIe bus, we can show that, for 64-
byte packets, descriptor amortization stands to reduce the
load on each bus by factors of 10 and 2.5. Recall from Ta-
ble 1 that we had 0%, 50% and 80% room for growth on
each of the memory, FSB, and PCIe buses, and, hence,
the load on each of these buses could grow by a factor
of 4 (41.0), 15 (10150%), and 4.5 (2.5180%) re-
spectively. Since the maximum improvement that can be
accommodated on all buses is by a factor of 4, we argue
that reducing descriptor-related overheads could improve
our minimum-size-packet forwarding rate from 3.4Gbps
to 13.6Gbps.
Finally, combining both optimizations should allow us
to climb still higher to approximately 20Gbps, though,
of course, the limited number of network slots on our
machine would limit us to 16Gbps.
While the above is an extrapolation (albeit one de-
rived from empirical observations), it nonetheless points
to the tremendous untapped potential of current servers.
Even if our estimates are off by a factor two, it still
seems possible that current servers can achieve forward-
ing rates of 10Gbpsa number currently consideredthe realm of specialized (and expensive) equipment. We
close by noting that the suggested fixes involve only
modest operating-system rearchitecting.
4.2 Point-to-point Architectures
We now apply the results of our measurement study to es-
timate the packet forwarding rates that might be achiev-
able with point-to-point (p2p) server architectures as in-
10
7/31/2019 Packet Forwarding Capabilities
11/14
troduced in Section 2 (Figure 2). At times, we make as-
sumptions where the necessary details of p2p architec-
tures arent yet known and, in such cases, we explicitly
note our assumptions as such.
In a p2p architecture the role of the FSB that of
carrying traffic between sockets and between CPUs and
their non-local memory is played by point-to-pointlinks such as Intels QuickPath [5]. In our analysis, we
assume our findings from the FSB apply to these inter-
socket buses. Specifically, we assume that the 50% room-
to-grow that we measured on the FSB applies to these
inter-socket buses. If anything, this seems like a wildly
conservative assumption for two reasons: (1) the nomi-
nal speed of these inter-socket links is 200Gbps [2], com-
pared to 85Gbps for current FSBs and (2) the operations
seen on the single FSB are now spread across six inter-
socket links.
We compute the expected performance for the p2p
architecture by considering the different factors that
will offer a performance improvement relative to theshared-bus server weve studied so far. These factors are:
(1) reduced per-bus overheads: This improvement
results simply due to the transition from a shared-bus
to a peer-to-peer architecture as discussed in Section 2.
These overheads and the corresponding reduction are
summarized in the first three columns of Table 2.6
(2) room-to-grow: As before this records the capacity
for growth on each bus. For this we use our findings
from Section 3.1.
(3) technology improvements: This accounts for the
standard technology improvements expected in this next-
generation of servers. We assume a 2x improvement on
the FSB and PCIe buses by observing that (for exam-
ple) the Intel QuickPath inter-socket links for use in the
Nehalem server line supports speeds that are over 2x
faster than current FSBs. Likewise, the PCIe-2.0 runs 2x
faster than current PCIe-1.1 and the recently announced
PCIe-3.0 is to run at 2x the speed of PCIe-2.0 [20] (our
test server uses PCIe-1.1). We conservatively assume that
memory technology will not improve.
Table 2 summarizes these performance factors and
computes the combined performance improvement thatwe can expect on each system component. As we see,
the overall performance improvement is still limited by
memory (both because were assuming the rank-bank
6Note that, while Section 3.2 revealed that the overheads we see in
practice arefar higherthan those from ouranalysis, were assuming that
the relative reduction across architectures will still hold. This appears
reasonable since this reduction is entirely due to the offered load being
split across more system components 6 vs. 1 inter-socket bus, 4 vs. 1
memory bus and 4 vs. 1 PCIe bus.
Figure 15: Forwarding rates for shared-bus and p2p
server architectures with and without different optimiza-
tions.
imbalance problem remains and that memory technol-
ogy improves more slowly). Despite this, were left with
a 4x improvement suggesting that a next-generation p2p
server running unmodified Linux+click will scale to ap-proximately 13.6Gbps. The additional use of the opti-
mizations described above could further improve perfor-
mance to potentially exceed 40Gbps.
Figure 15 summarizes the various forwarding rates for
the different architectures and optimizations considered.
In summary, current shared-bus servers scale to (min-
sized) packet forwarding rates of 3.4Gbps and we esti-
mate future p2p servers will scale to 10Gbps. Moreover
our analysis suggests that modifications to eliminate key
bottlenecks and overheads stand to improve these rates
to over 10Gbps and 40Gbps respectively.
5 Recommendations and Discussion
5.1 Eliminating the Bottlenecks
We believe the bottlenecks and overheads identified in
the previous sections can be addressed through relatively
modest changes to operating systems and NIC firmware.Unfortunately, the need to modify NIC firmware makes it
difficult to experiment with these changes. We describe
these modifications at a high level and note that these
modifications do not impact the programmability of the
system.
Improved memory allocators. Recall that our results
in Section 3.1 suggest that the imbalance in memory ac-
cesses with regard to (skb) packet buffers in the kernel
11
7/31/2019 Packet Forwarding Capabilities
12/14
system shared-bus p2p gain from room-to gain w/ overall
bus overheads overheads reduced -grow tech gain
(section 2) (section 2) overheads (Table 1) trends
memory 4R R 4x 1.0x 1.0x 4x
FSB/CSI 2R 2R/3 3x 1.5x 2x 9x
memory 2R R/2 4x 1.8x 2x 14.4x
Table 2: Computing the performance improvement with a p2p server architecture. R denotes the line rate in bits/second.
occurs because the kernel allocates all packet buffers to
be a single size with a default of 2KB. This problem can
be addressed by simply creating packet buffers of various
sizes (e.g., 64B, 256B, 1024B and 2048B) and allocating
a packet to the buffer appropriate for its size. This can
be implemented by simply creating multiple descriptor
rings, one for each buffer size; on receiving an incom-
ing packet, the NIC simply uses the descriptor ring ap-
propriate to the size of the received packet. While more
wasteful of system memory, this isnt an issue since the
memory requirements for a router workload are a small
fraction of the available server memory. This approach is
in fact inspired by similar approaches in hardware routers
that pre-divide memory space into separate regions for
use by packets of different sizes [13].
The imbalance due to packet descriptors can be like-
wise tackled by arranging for packet descriptors to con-
sume a greater portion of the memory space by, for ex-
ample, using larger descriptor rings and/or multiple de-
scriptor rings. Conveniently however, the use of amor-
tized packet descriptors as described below would also
have the effect of greatly reducing the descriptor-relatedtraffic to memory and hence implementing amortized de-
scriptors might suffice to reduce this problem.
Amortizing packet descriptors Section 3.2 reveals
that handling packet descriptors imposes an inordinate
per-packet overhead particularly for small packet sizes.
As alluded to earlier, a simple strategy is to have a single
descriptor summarize multiple up to a parameter k
packets. This amortization is similar to what is already
implemented on capture cards designed for special-
ized monitoring equipment. Such amortization is easily
accommodated for k smaller than the amount of packetbuffer memory already on the NIC. Since we imagine
that kcan be a fairly small number ( 10) and since cur-
rent NICs already have buffer capacity for a fair num-
ber of packets (e.g., our cards have room for 64 full-
sized packets), such amortization should not increase the
storage requirements on NICs. Amortization can how-
ever impose increased delay. This can be controlled by
having a timeout that regulates the maximum time pe-
riod the NIC can wait to transfer packets. Setting to
a small multiple (e.g., 2k times) the reception time for
small packets should be an acceptable delay penalty.
5.2 Discussion
When we set out to study the forwarding performance
of commodity servers, we already expected the mem-ory system to be the bottleneck; the fact, however, that
the bottleneck was due to an unfortunate combination of
packet layout and memory-chip organization came as a
surprise. While trying to figure this out, we looked at how
the kernel allocates memory for its structures; not sur-
prisingly, it favors adjacent memory addresses to lever-
age caching. However, given that the kernel uses physical
addresses, nearby addresses often correspond to physi-
cally nearby locations that fall on the same memory rank
and bank. As a result, workloads that cannot benefit from
caching may end up hitting the same memory rank/bank
pairs and, hence, be unable to benefit from aggregate
memory bandwidth either. In short, when combined withan unfortunate data layout, locality can hurt rather than
help.
Another surprise was the lack of literature on the be-
havior and performance of system components outside
the CPUs. The increasing processor speeds and the rise
of multi-processor systems mean that, from now on,
processing data is less likely to be the bottleneck than
moving it around between CPUs and other I/O devices.
Hence, it is important to be able to measure and under-
stand system performance beyond the CPUs.
Finally, we were surprised by the lack of efficiency
in moving data between system components. In manycases, data is unnecessarily transferred to memory (con-
tributing to memory-bus load) when it could be directly
transferred from the NIC to the appropriate CPU cache.
Packet forwarding and processing workloads would ben-
efit significantly from techniques along the lines of Di-
rect Cache Access (DCA), where the memory controller
directly places incoming packets into the right CPU
cache by snooping the DMA transfer from NIC to mem-
ory [17].
12
7/31/2019 Packet Forwarding Capabilities
13/14
6 Related and Future Work
The use of a software router based on general-purpose
hardware and operating systems is not a new one. In fact,
the 13 NSFNET NSS (nodal switching subsystems) in-
cluded 9 systems running Berkeley UNIX interconnected
with a 4Mb/s IBM token ring. Click [18] and Scout [22]explored the question of how to architect router software
for improved programmability and extensibility; SMP
Click [12] extends the early Click architecture to better
exploit the performance potential of multiprocessor PCs.
These efforts focused primarily on designing the soft-
ware architecture for packet processing and, while they
do report on the performance of their systems, this is at
a fairly high-level using purely black-box evaluation. By
contrast, our work assumes Clicks software architecture
but delves under the covers (of both hardware and soft-
ware) to understand why performance is limited and how
these limitations carry over to future server architectures.
As a slight digression: it is somewhat interesting tonote the role of time on the performance of these (fairly
similar) software routers. The early NSF nodes achieved
forwarding rates of 1K packets/sec (circa 1986), Click
(at SOSP99) reported a maximum forwarding rate of
330Kpps which SMP-click improves to 494Kpps (2001);
we find unmodified Click achieves about 6.5Mpps. This
is of course somewhat anecdotal since were not nec-
essarily comparing the same click configurations but
nonetheless suggests the general trajectory.
There is an extensive body of work on benchmarking
various application workloads on general-purpose pro-
cessors. The vast majority of this work is in the context
of computation-centric workloads and benchmarks such
as TPC-C. Closer to our interest in packet processing,
are efforts similar to those of Veal et al. [24] that look
for the bottlenecks in server-like workloads that involve
a fair load of TCP termination. Their analysis reveals
that such workloads are bottlenecked on the FSB address
bus. A similar conclusion has been arrived at for several,
more traditional, workloads. (We refer the reader to [24]
for additional references to the literature on such evalu-
ations.) As our results indicate, the bottleneck to packet
processing lies elsewhere.
There is similarly a large body of work on packet
processing using specialized hardware (e.g., see [19]and the references therein). Most recently, Turner et
al. describe a Supercharged Planetlab Platform [23]
for high-performance overlays that combines general-
purpose servers with network processors (for slow and
fast path processing respectively); they achieve forward-
ing rates of up to 5Gbps for 130B packets. We focus in-
stead on general-purpose processors and our results sug-
gest that these offer performance that is competitive.
Closest to our work is a recent independent effort by
Egi et al. [15]. Motivated by the goal of building high-
performance virtualized routers on commodity hardware,
the authors undertake a measurement study to under-
stand the performance limitations of modern PCs. They
observe similar performance and, like us, arrive at the
conclusion that something is amiss at the memory sys-
tem. Through inference based on black-box testing theauthors suggest that non-contiguous memory writes ini-
tiated by the PCIe controller is the likely culprit. Our ac-
cess to chipset tools allows us to probe the internals of
the memory system and our findings there lead us to a
somewhat different conclusion.
Finally, our work also builds on a recent position paper
making the case for cluster-based software routers [11];
the paper identifies the need to scale servers to line rate
but doesnt explore the issue of bottlenecks and perfor-
mance in any detail.
In terms of future work, we plan to extend our work
along three main directions. First, were exploring the
possibility of implementing the modified descriptor andbuffer allocator schemes described above. Second, we
hope to repeat our analysis on the Nehalem server plat-
forms once available [8]. Finally, were currently work-
ing to build a cluster-based router prototype as described
in earlier work [11] and hope to leverage our findings
here to both evaluate and improve our prototype.
7 Conclusion
A long-held and widespread perception has been that
general-purpose processors are incapable of high-speed
packet forwarding motivating an entire industry aroundthe development of specialized (and often expensive)
network equipment. Likewise, the barrier to scalability
has been variously attributed to limitations in I/O, mem-
ory throughput and various other factors. While these no-
tions might each have been true at various points in time,
modern PC technology evolves rapidly and hence it is
important that we calibrate our perceptions by the current
state of technology. In this paper, we revisit old questions
about the scalability of in-software packet processing in
the context of current and emerging off-the-shelf server
technology. Another, perhaps more important, contribu-
tion of our work is to offer concrete data on questions that
have often been answered through anecdotal or indirectexperience.
Our results suggest that particularly with a little care
modern server platforms do in fact hold the poten-
tial to scale to the high rates typically associated with
specialized network equipment and that emerging tech-
nology trends (multicore, NUMA-like memory architec-
tures, etc.) should only further improve this scalability.
We hope that our results, taken together with the grow-
ing need for more flexible network infrastructure, will
13
7/31/2019 Packet Forwarding Capabilities
14/14
spur further exploration into the role of commodity PC
hardware/software in building future networks.
References
[1] Intel 10 Gigabit XF SR Server Adapters. http:
//www.intel.com/network/connectivity/products/10gbexfsrserveradapter.htm .
[2] Intel QuickPath Interconnect. http://en.wikipedia.
org/wiki/Intel_QuickPath_Interconnect .
[3] Intel Xeon Processor 5000 Sequence. http://www.intel.
com/products/processor/xeon5000 .
[4] NetFPGA. http://yuba.stanford.edu/NetFPGA/ .
[5] Next-Generation Intel Microarchitecture. http://www.
intel.com/technology/architecture-silicon/
next-gen.
[6] Vyatta: Open Source Networking. http://www.vyatta.
com/products/.
[7] Cisco Opening Up IOS. http://www.networkworld.
com/news/2007/121207-cisco-ios.html , Dec. 2007.
[8] Intel Demonstrates Industrys First 32nm Chip and Next-
Generation Nehalem Microprocessor Architecture. Intel
News Release., Sept. 2007. http://www.intel.com/
pressroom/archive/releases/20070918corp_a.
htm.
[9] Juniper Open IP Solution Development Program.
http://http://www.juniper.net/company/
presscenter/pr/2007/pr-071210.html , 2007.
[10] Intel Corporations Multicore Architecture Briefing, Mar.
2008. http://www.intel.com/pressroom/archive/
releases/20080317fact.htm .
[11] K. Argyraki et al. Can software routers scale? In ACM Sig-
comm Workshop on Programmable Routers for Extensible Ser-
vices, Aug. 2008.
[12] B. Chen and R. Morris. Flexible control of parallelism in a multi-
procesor pc router. In Proc. of the USENIX Technical Conference,
June 2001.
[13] I. Cisco Systems. Introduction to Cisco IOS Software. http:
//www.ciscopress.com/articles/ .
[14] D. Comer. Network Processors. http://www.cisco.com/
web/about/ac123/ac147/archived_issues/ipj_
7-4/network_processors.html .
[15] N. Egi, A. Greenhalgh, M. Handley, M. Hoerdt, and L. Mathy.
Towards performant virtual routers on commodity hardware.
Technical Report Research Note RN/08/XX, University College
London, Lancaster University, May 2008.
[16] R. Ennals, R. Sharp, and A. Mycroft. Task partitioning for multi-
core network processors. In Proc. of International Conference on
Compiler Construction, 2005.
[17] R. Huggahalli, R. Iyer, and S. Tetrick. Direct Cache Access for
High Bandwidth Network I/O. In Proc. of ISCA, 2005.
[18] E. Kohler, R. Morris, B. Chen, J. Jannotti, and F. Kaashoek. The
click modular router. ACM Transactions on Computer Systems,
18(3):263297, Aug. 2000.
[19] J. Mudigonda, H. Vin, and S. W. Keckler. Reconciling perfor-
mance and programmability in networking systems. In Proc. ofSIGCOMM, 2007.
[20] PIC-SIG. PCI Express Base 2.0 Specification, 2007.
http://www.pcisig.com/specifications/pciexpress/base2.
[21] A. Sigcomm. Workshop on Programmable Routers for Extensible
Services. http://www.sigcomm.org/sigcomm2008/
workshops/presto/, 2008.
[22] T. Spalink, S. Karlin, L. Peterson, and Y. Gottlieb. Building a
Robust Software-Based Router Using Network Processors. In
Proc. of the 18th ACM SOSP, 2001.
[23] J. Turner et al. Supercharging planetlab a high performance,
multi-application, overlay network platform. In Proc. of SIG-
COMM, 2007.
[24] B. Veal and A. Foong. Performance scalability of a multi-core
web server. In Proc. of ACM ANCS, Dec. 2007.
14