-
Open Packet Processor: a programmable architecture forwire speed
platform-independent stateful in-network
processing
Giuseppe Bianchi, Marco Bonola, Salvatore Pontarelli,Davide
Sanvito, Antonio Capone, Carmelo CasconeCNIT / University of Rome
Tor Vergata / Politecnico di Milano
ABSTRACTThis paper aims at contributing to the ongoing debate
onhow to bring programmability of stateful packet process-ing tasks
inside the network switches, while retaining plat-form
independency. Our proposed approach, named “OpenPacket Processor”
(OPP), shows the viability (via an hard-ware prototype relying on
commodity HW technologies andoperating in a strictly bounded number
of clock cycles) ofeXtended Finite State Machines (XFSM) as
low-level dataplane programming abstraction. With the help of
examples,including a token bucket and a C4.5 traffic classifier
basedon a binary tree, we show the ability of OPP to
supportstateful operation and flow-level feature tracking.
Platformindependence is accomplished by decoupling the
implemen-tation of hardware primitives (registries, conditions,
updateinstructions, forwarding actions, matching facilities)
fromtheir usage by an application formally described via an
ab-stract XFSM. We finally discuss limitations and extensions.
1. INTRODUCTIONTo face the emerging needs for service
flexibility, net-
work efficiency, traffic diversification, and network secu-rity
and reliability, today’s network nodes are called toa more flexible
and richer packet processing. The origi-nal Internet nodes,
historically limited to switches androuters providing “just” the
plain forwarding services,have been massively complemented with a
variety ofheterogeneous middlebox-type functions [1, 2, 3, 4]
suchas network address translation, tunneling, load balanc-ing,
monitoring, intrusion detection, and so on. Thediversification of
network equipment and technologieshas definitely provided an
increased availability of net-work functionalities, but at the cost
of a significant ex-tra complexity in the control and management of
largescale multi-vendor networks.
Software-defined Networking (SDN) emerged as anattempt to
address such problem. Coined in 2009 [5] asa direct follow-up of
the OpenFlow proposal [6], SDNhas broadly evolved since then [7]
and does not in prin-ciple restricts to OpenFlow (a “minor piece in
the SDNarchitecture”, according to the OpenFlow inventors them-
selves [8]) as device-level abstraction. Nevertheless, mostof
the high level network programming abstractions pro-posed in the
last half a dozen years [9, 10, 11, 12, 13,14, 15] still rely on
OpenFlow as southbound (usingRFC 7426’s terminology) programming
interface. In-deed, OpenFlow was designed with the desire for
rapidadoption, opposed to first principles [7]; i.e., as a
prag-matic attempt to address the dichotomy between i) flex-ibility
and ability to support a broad range of innova-tion, and ii)
compatibility with commodity hardwareand vendors’ need for closed
platforms [6].
The aftermath is that most of the above mentionednetwork
programming frameworks circumvent OpenFlow’slimitations by
promoting a “two-tiered” [16] program-ming model: any stateful
processing intelligence of thenetwork applications is delegated to
the network con-troller, whereas OpenFlow switches limit to install
andenforce stateless packet forwarding rules delivered bythe remote
controller. Centralization of the networkapplications’ intelligence
may not be a problem (and ac-tually turns out to be an advantage)
whenever changesin the forwarding states do not have strict real
timerequirements, and depend upon global network states.But for
applications which rely only on local flow/portstates, the latency
toll imposed by the reliance on anexternal controller rules away
the possibility to enforcesoftware-implemented control plane tasks
at wire speed,i.e. while remaining on the fast path1.
One might argue that we do not even need such ultra-fast
processing and packet-by-packet manipulation andcontrol
capabilities. However, not only the large real-world deployment of
proprietary hardware network ap-pliances (e.g. for traffic
classification, control/balancing,monitoring, etc), but also the
evolution of the Open-Flow specification itself shows that this may
not be the
1 A 64 bytes packet takes about 5 ns on a 100 gbps speed,roughly
the time needed for a signal to reach a control entityplaced one
meter away. And the execution of an albeit sim-ple
software-implemented control task may take way moretime than this.
Thus, even the physical, capillary, distribu-tion of control agents
(as proxies of the remote SDN con-troller for low latency tasks) on
each network device wouldhardly meet fast path requirements.
1
arX
iv:1
605.
0197
7v1
[cs
.NI]
6 M
ay 2
016
-
case. As a matter of fact, since the creation of the
OpenNetworking Foundation (ONF) in 2011, and up to thelatest
(version 1.5) specification, we have witnessed anhectic evolution
of the OpenFlow standard, with severalOpenFlow extensions devised
to fix punctual shortcom-ings and accommodate specific needs, by
incorporatingextremely specific stateful primitives (such as meters
forrate control, group ports for fast failover support or dy-namic
selection of one among many action buckets ateach time - e.g. for
load balancing -, synchronized tablesfor supporting learning-type
functionalities, etc).
Indeed, in the last couple of years, a new researchtrend has
started to challenge improved programmabil-ity of the data plane,
beyond the elementary“match/action”abstraction provided by
OpenFlow, and (even more re-cently) initial work on higher level
network program-ming frameworks devised to exploit such newer
andmore capable lower-level primitives are starting to emerge[17,
16]. Proposals such as POF [18, 19], although notyet targeting
stateful flow processing, do significantlyimprove header matching
flexibility and programmabil-ity, freeing it from any specific
structure of the packetheader. Programmability of the packet
scheduler insidethe switch has been recently addressed in [20].
Workssuch as OpenState [21, 22] and FAST [23] explicitlyadd support
for per-flow state handling inside Open-Flow switches, although the
abstractions therein de-fined are still simplistic and severely
limit the type ofapplications that can be deployed (for instance,
Open-State supports only a special type of Finite State Ma-chines,
namely Mealy Machines, which do not providethe programmer with the
possibility to declare and useown memory or registries). The P4
programming lan-guage [24, 25] leverages more advanced hardware
tech-nology, namely dedicated processing architectures [26]or
Reconfigurable Match Tables [27] as an extension ofTCAMs (Ternary
Content Addressable Memories) topermit a significantly improved
programmability in thepacket processing pipeline. In its latest
1.0.2 languagespecification [28], P4 has made a further crucial
step inimproving stateful processing, by introducing
registersdefined as “stateful memories [which] can be used in amore
general way to keep state”. However, the P4 lan-guage does not
specify how registries should be scalablysupported and managed by
the underlying HW.
ContributionThis work is an attempt to revisit fast-path
programma-bility, by (i) proposing a programming abstraction
whichretains the platform independent features of the origi-nal
“match/action” OpenFlow abstraction, and by (ii)showing how our
abstraction can be directly “executed”over an HW architecture
(whose feasibility is concretelyshown via an HW FPGA prototype). In
analogy withthe OpenFlow’s “match/action” abstraction, which
ex-poses a network node’s TCAM to third party programma-
Figure 1: a) A typical OpenFlow pipeline ar-chitecture. b) the
OPP enabled pipeline. OPP“stages” can be pipelined with other OPP
stagesor ordinary OpenFlow Match/Action stages.
bility, also our abstraction directly refers to the HWinterface,
and as such it can be directly exposed tothe programmer as a
machine-level “configuration” in-terface, hence without any
intermediary compilation oradaptation to the target (i.e., unlike
the case of P4).
In conceiving our abstraction, we have been largelyinspired by
[21], where eXtended Finite State Machines[29] (therein referred to
as “full” XFSM) were conjec-tured as a possible forward-looking
abstraction. Ourkey difference with respect to [21] is that we not
limitto postulate that such “full” XFSMs may ultimately bea
suitable abstraction, but we concretely show their via-bility and
their “executability” over an HW architectureleveraging commodity
HW (standard TCAMs, hash ta-bles, ALUs, and somewhat trivial
additional circuitry),and with a strictly bounded number of clock
ticks.
A limitation in this paper is our focus on a “sin-gle” packet
processing stage, opposed to a more generalpacket processing
pipeline comprising multiple match/actiontables. In essence, our
work shows the viability of anXFSM-based abstraction as a
(significant) generaliza-tion of the original single-table
OpenFlow’s match/action.While multiple pipelined instances of our
atomic OpenPacket Processor stages are clearly possible, exactly
asmultiple match/action tables can be pipelined since Open-Flow
version 1.1 (see figure 1), our present work doesnot yet take
advantage of HW pipeline optimizationssuch as Reconfigurable Match
Tables [27].
Finally, and similarly to the OpenFlow’s original de-sign
philosophy, even if our proposed architecture ispragmatically
limited by the specific set of primitivesimplemented by the HW
(supported packet process-ing and forwarding actions, matching
facilities, arith-metic and logic operations on registry values,
etc), itnevertheless remains extensible (by adding new actionsor
instructions) and largely expressive in terms of howthe programmer
shall use and combine such primitiveswithin a desired stateful
operation. As it will be hope-fully apparent later on, a “full”
XFSM permits to for-mally describe a wide variety of programmable
packetprocessing and control tasks, which our architecturepermits
to directly convey and deploy inside the switch.
2
-
And even if probably not of any practical interest, thefact that
not only the OpenFlow legacy statistics, butalso further tailored
stateful extensions today integratedin the OpenFlow standard (hence
hardcoded in the switch)might be externally programmed using an
apparentlyviable platform agnostic abstraction merits further
con-siderations (see discussion in section 5).
2. CONCEPTAs anticipated in the previous section, our work
fo-
cuses on the design of a single Open Packet Processor(OPP)
stage, as a significant generalization of the tra-ditional
OpenFlow’s Match/Action abstraction. Morespecifically, our goal is
to provide a packet processingstage which holds the following
properties.
Ability to process packets directly on the fastpath, i.e., while
the packet is traveling in the pipeline(nanoseconds time scale).
The requirement of perform-ing packet processing tasks in a
deterministic and small(bounded) number of HW clock cycles hardly
copeswith the possibility to employ a standard CPU (andthe relevant
programming language), and requires us toimplement a
domain-specific (traffic/network) comput-ing architecture from
scratch.
Efficient storage and management of per-flowstateful
information. Other than parsing packet headerinformation and
exposing such fields to a match/actionstage (or a pipeline of match
action stages[27, 26]), wealso aim at permitting the programmer to
further usethe past flow history for defining a desired
per-packetprocessing behaviour. As shown in section 2.2, this
caneasily accomplished by pre-pending a dedicated storagetable
(concretely, an hash table) that permits to re-trieve, in O(1)
time, stateful flow information. We namethis structure as Flow
Context Table, as, in somewhatanalogy with context switching in
ordinary operatingsystems, it permits to retrieve stateful
information as-sociated to the flow to which an arriving packet
belongs,and store back an updated context the end of the
packetprocessing pipeline. Such (flow) context switching
willoperate at wire speed, on a packet-by-packet basis.
Ability to specify and compute a wide (and pro-grammable) class
of stateful information, thus in-cluding counters, running
averages, and in most general-ity stateful features useful in
traffic control applications.It readily follows that the packet
processing pipeline,which in standard OpenFlow is limited to
match/actionprimitives, must be enriched with means to describeand
(on the fly) enforce conditions on stateful quan-tities (e.g. the
flow rate is above a threshold, or thetime elapsed since the last
seen packet is greater thanthe average inter-arrival time), as well
as provide arith-metic/logic operations so as to update such
stateful fea-tures in a bounded number of clock cycles (ideally
one).
Platform independence. A key pragmatic insightin the original
OpenFlow abstraction was the decisionof restricting the OpenFlow
switch programmer’s abil-ity to just select actions among a finite
set of supportedones (opposed to permitting the programmer to
developown custom actions), and associate a desired action
set(bundle) to a specific packet header match. We concep-tually
follow a similar approach, but we cast it into amore elaborate
eXtended Finite State Machine (XFSM)model. As described in section
2.1, an XFSM abstrac-tion permits us to formalize complex
behavioral mod-els, involving custom per-flow states, custom
per-flowregisters, conditions, state transitions, and arithmeticand
logic operations. Still, an XFSM model does notrequire us to know
how such primitives are concretelyimplemented in the hardware
platform, but “just” per-mits us to combine them together so as to
formalize adesired behaviour. Hence, it can be ported across
plat-forms which support a same set of primitives.
2.1 XFSM abstractionThe OpenFlow’s“Match-action”abstraction has
been
widely extended throughout the various standardizationsteps,
with the extension of the match fields (includingthe possibility to
perform matches on meta-data), withnew actions (and instructions),
and with the ability toassociate a set of actions to a given match.
Nevertheless,the basic abstraction conceptually remains the same.
Itis instructive to formally re-interpret the (basic) Open-Flow
match/action abstraction as a “map” T : I → O,where I = {i1, . . .
, iM} is a finite set of Input Symbols,namely all the possible
matches which are technicallysupported by an OpenFlow specification
(being irrele-vant, at least for this discussion, to know how such
In-put Symbols’ set I is established, and that each inputsymbol is
a Cartesian combination of all possible headerfield matches), and O
= {o1, . . . , oK} is a finite set ofOutput Symbols, i.e. all the
possible actions supportedby an OpenFlow switch. The obvious limit
of this ab-straction is that the match/action mapping is
staticallyconfigured, and can change only upon controller’s
in-tervention (e.g. via flow-mod OpenFlow commands).Finally, note
that the “engine” which performs the ac-tual mapping T : I → O is a
standard TCAM.
As observed in [21], an OpenFlow switch can be triv-ially
extended to support a more general abstractionwhich takes the form
of a Mealy Machine, i.e. a Fi-nite State Machine with output, and
which permits toformally model dynamic forwarding behaviors, i.e.
per-mit to change in time the specific action(s) associatedto a
same match. It suffices to add a further finite setS = {s1, s2, ,
sN} of programmer-specific states, and usethe TCAM to perform the
mapping T : S× I → S×O.While remaining feasible on ordinary
OpenFlow hard-ware, such Mealy Machine abstraction brings about
two
3
-
XFSM formal notation MeaningI input symbols all possible matches
on packet
header fieldsO output symbols OpenFlow-type actionsS custom
states application specific states, de-
fined by programmerD n-dimensional linear
space D1 × · · · ×Dnall possible settings of n mem-ory
registers; include bothcustom per-flow and globalswitch registers
(see text)
F set of enabling func-tions fi : D → {0, 1}
Conditions (boolean predi-cates) on registers
U set of update func-tions ui : D → D
Applicable operations for up-dating registers’ content
T transition relationT : S × F × I →S × U ×O
Target state, actions and reg-ister update commands asso-ciated
to each transition
Table 1: eXtended Finite State Machine model
key differences with respect to the original
OpenFlowabstraction. First, the (output) action associated to avery
same (input) match may now differ depending onan (input) state si ∈
S, i.e., the state in which the flowis found when a packet is being
processed. Second, theMealy Machine permits to specify in which,
possibly dif-ferent, (output) state so ∈ S the flow shall enter
oncethe packet will be processed. While quite interesting,this
generalization appears still insufficient to permitthe programmer
to implement meaningful applications,as it lacks the ability to
run-time compute and exploitin the forwarding decisions per-flow
features commonlyused in traffic control algorithms.
The goal of this paper is to show that a switch ar-chitecture
can be further easily extended (section 2.2)to support an even more
general Finite State Machinemodel, namely the eXtended Finite State
Machine (XFSM)model introduced in [29]. As summarized in table
1,this model is formally specified by means of a 7-tupleM = (I,O,
S,D, F, U, T ). Input symbols I (OpenFlow-type matches) and Output
Symbols O (actions) are thesame as in OpenFlow. Per-application
states S are in-herited from the Mealy Machine abstraction [21],
andpermit the programmer to freely specify the possiblestates in
which a flow can be, in relation to her desiredcustom application
(technically, a state label is handledas a bit string). For
instance, in an heavy hitter detec-tion application, a programmer
can specify states suchas NORMAL, MILD, or HEAVY, whereas in a load
balancingapplication, the state can be the actual switch outputport
number (or the destination IP address) an alreadyseen flow has been
pinned to, or DEFAULT for newly ar-riving flows or flows that can
be rerouted. With respectto a Mealy Machine, the key advantage of
the XFSMmodel resides in the additional programming flexibilityin
three fundamental aspects.
(1) Custom (per-flow) registers and global (switch-level)
parameters. The XFSM model permits the
programmer to explicitly define her own registers, byproviding
an array of per-flow variables whose content(time stamps, counters,
average values, last TCP/ACKsequence number seen, etc) shall be
decided by the pro-grammer herself. Additionally, it is useful to
expose tothe programmer (as further registers) also
switch-levelstates (such as the switch queues’ status) or
“global”shared variables which all flows can access. Albeit
prac-tically very important, a detailed distinction into differ-ent
register types is not foundational in terms of ab-straction, and
therefore all registers that the program-mer can access (and
eventually update) are summarizedin the XFSM model presented in
Table 1 via the arrayD of memory registers.
(2) Custom conditions on registers and switchparameters. The
sheer majority of traffic control ap-plications rely on
comparisons, which permit to deter-mine whether a counter exceeded
some threshold, orwhether some amount of time has elapsed since
thelast seen packet of a flow (or the first packet of theflow,
i.e., the flow duration). The enabling functionsfi : D → {0, 1}
serve exactly for this purpose, by imple-menting a set of
(programmable) boolean comparators,namely conditions whose input
can be decided by theprogrammer, and whose output is 1 or 0,
depending onwhether the condition is true or false. In turns, the
out-come of such comparisons can be exploited in the tran-sition
relation, i.e. a state transition can be triggeredonly if a
programmer-specific condition is satisfied.
(3) Register’s updates. Along with the state transi-tion, the
XFSM models also permits the programmer toupdate the content of the
deployed registers. As we willshow later on, registers’ updates
require the HW to im-plement a set of update functions ui : D → D,
namelyarithmetic and logic primitives which must be providedin the
HW pipeline, and whose input and output datashall be configured by
the programmer.
Finally, we stress that the actual computational stepin an XFSM
is the transition relation T : S × F × I →S × U × O, which is
nothing else than a “map” (albeitwith more complex inputs and
outputs than the basicOpenFlow map), and hence is naturally
implementedby the switch TCAM, as shown in the next section
2.2.
2.2 OPP architectureTo our view, what makes the previously
described
XFSM abstraction compelling is the fact that it can bedirectly
executed on the switch’s fast path using off theshelf HW, as we
will prove in section 3 with a concreteHW prototype. As discussed
in the next section, prac-tical restrictions of course emerge in
terms of memorydeployed for the registers, as well as capability of
theALUs used for register updates, but such restrictions aremostly
related to an actual implementation, rather thanto the design which
remains at least in principle very
4
-
Figure 2: OPP architecture
general and flexible. A sketch of the proposed OpenPacket
Processor architecture is illustrated in Figure 2.The packet
processing workflow is best explained bymeans of the following
stages.
Stage 1: flow context lookup. Once a packet entersan OPP
processing block, the first task is to extract,from the packet, a
Flow Identification Key (FK), whichidentifies the entity to which a
state may be assigned.The flow is identified by an unique key
composed of asubset of the information stored in the packet
header.The desired FK is configured by the programmer (anIP
address, a source/destination MAC pair, a 5-tupleflow identifier,
etc) and depends on the specific applica-tion. The FK is used as
index to lookup a Flow ContextTable, which stores the flow context,
expressed in termsof (i) the state label si currently associated to
the flow,
and (ii) an array ~R = {R0, R1, ..., Rk} of (up to) k +
1registers defined by the programmer. The retrieved flowcontext is
then appended as metadata and the packetis forwarded to the next
stage.
Stage 2: conditions’ evaluation. Goal of the Condi-tion Block
illustrated in Figure 2 (and implemented us-ing ordinary boolean
circuitry, see section 3) is to com-pute programmer-specific
conditions, which can take asinput either the per flow register
values (the array ~R),as well as global registers delivered to this
block as anarray ~G = {G0, G1, ..., Gh} of (up to) h + 1 global
vari-ables and/or global switch states. Formally, this blockis
therefore in charge to implement the enabling func-tions specified
by the XFSM abstraction. In practice, itis trivial to extend the
assessment of conditions also topacket header fields (for instance,
port number greaterthan a given global variable or custom per-flow
reg-ister). The output of this block is a boolean vector~C = {c0,
c1, ..., cm} which summarizes whether the i-thcondition is true (ci
= 1) or false (ci = 0).
Stage 3: XFSM execution step. Since boolean con-ditions have
been transformed into 0/1 bits, they canbe provided as input to the
TCAM, along with the statelabel and the necessary packet header
fields, to perform
a wildcard matching (different conditions may apply indifferent
states, i.e. a bit representing a condition canbe set to “don’t
care” for some specific states). EachTCAM row models one transition
in the XFSM, andreturns a 3-tuple: (i) the next state in which the
flowshall be set (which could coincide with the input statein the
case of no state transition, i.e., a self-transitionin the XFSM),
(ii) the actions associated the transition(usual OpenFlow-type
forwarding actions, such as drop,push_label, set_tos etc...), and
(iii) the informationneeded to update the registers as described
below.
Stage 4: register updates. Most applications re-quire arithmetic
processing when updating a statefulvariable. Operations can be as
simple as integer sums(to update counters or byte statistics) or
can requiretailored floating point processing (averages,
exponentialdecays, etc). The role of the Update logic block
compo-nent highlighted in Figure 2 is to implement an arrayof
Arithmetic and Logic Units (ALUs) which supporta selected set of
computation primitives which permitthe programmer to update
(re-compute) the value of theregisters, using as input the
information available at thisstage (previous values of the
registers, information ex-tracted from the packet, etc). Section
3.1 will describethe specific instruction set implemented in our HW
pro-totype, where (with no pretence of completeness, norwillingness
to impose our own set of operations) we im-plement a set of
operations which appear to be eitheruseful to the specific network
programmer’s needs, aswell as computationally effective in terms of
implemen-tation (ideally, executable in a single clock tick). It
isworth to mention that the problem of extending theset of
supported ALU instructions is merely a technicalone, and does not
affect the OPP architecture.
Extension: Cross-flow context handling. As notedin [21], there
are many useful stateful control tasks, inwhich states for a given
flow are updated by events oc-curring on different flows. A simple
but prominent ex-ample is MAC learning: packets are forwarded using
thedestination MAC address, but the forwarding databaseis updated
using the source MAC address. Thus, it may
5
-
be useful to further generalize the XFSM abstraction assuggested
in [21], i.e. by permitting the programmer touse a Flow Key during
lookup (e.g. read informationassociated to a MAC destination
address) and employa possibly different Flow Key (e.g. associated
to theMAC source) for updating a state or a register.
2.3 Programming the OPPIs is useful to conclude this section
with at least a
sketch of which types of applications (and programs)may be
deployed.
A first trivial example of dynamic forwarding actionsis that of
a simple mechanism which distinguishes short-lived flows from
long-lived flows by considering“long” any flow that has transmitted
at least N packets,and applies different DSCP tags. The OPP
program-mer would simply need to define two states (DEFAULTalso
associated to every new flow, and LONG), one per-flow register R0
(a packet counter), one global regis-ter G0 (storing the constant
threshold N), a conditionR0 > G0 applicable when in state
DEFAULT, and an up-date function ADD(R0, 1)→ R0. Note that we did
notassume any pre-implemented counters or meters in theswitch, but
the counter and the relevant threshold checkhas been programmed
using the OPP abstraction.
The usage of packet inter-arrival times and timersis exemplified
by a dynamic intra-flow load bal-ancing application, which can
reroute a flow while itis in progress. As suggested in [30],
rerouting shouldnot occur during packet bursts, to avoid out of
order-ing and relevant performance impairments. Support inOPP just
requires, for each packet being transmitted,to update a per-flow
register R with the quantity t+∆,being t the actual packet
timestamp and ∆ a suitablethreshold. When the next packet arrives
(time t1), wecheck the condition t1 > R. If this is false, we
route thepacket to the assigned path (indicated by the state
la-bel); conversely we route it to an alternative path, andwe
change the state accordingly. Again, note that wehave not assumed
any pre-implemented support fromthe switch (e.g. timeouts or soft
states), besides theability to provide time information (e.g., via
a globalregister, or timestamps as packet metadata).
Finally, the integration in the ALU design of
monitoring-specific update instructions (averages, variances,
smooth-ing filters, see section 3.1), several features
frequentlyused in traffic control and classification applications
canbe computed on the fly during the pipeline. We deferrelevant
examples to section 4.
3. HARDWARE FEASIBILITYDespite the current trend in
softwarization of net-
work functions and the widespread deployment of soft-ware
switches, we believe that the viability of switch-level programming
abstractions which challenge Open-
Mixer
Delay queue
action
PKT
fields
extractor
Action
Block
Egress queues
Ingress queues
microcontroller
Configuration
commands
OPP
Status
UART
communication
Metadata
Global
registers
Update logic block
Condition
logic
block
Flow
context
memory
condition
vector
state
TCAM
(XFSM table)
Flow
registers
PKT
Update
information
OPP stage
Figure 3: Scheme of an OPP stage
Flow limitations, hence including this work, must stillbe proven
in terms of hardware feasibility and ability torun in a strictly
bounded number of clock cycles.
Figure 3 provides a block-level overview of a candi-date
hardware implementation of a single OPP stage,which we prove
feasible with an FPGA prototype. Pipelin-ing of an OPP stage with
other OPP stages or ordinarymatch/action tables does not affect the
single stage de-sign (although it does not permit us to benefit
fromhardware extensions and TCAM optimizations such asthose
introduced in [27], which we leave to future work).Figure 3 also
illustrates the necessary auxiliary blocksdevised to handle packet
input capture and output de-livery, in the assumption of a 4× 4
port switch.Packet reception and header field extraction. Pack-ets
received on the input queues are collected and se-rialized by a
mixer block, so that the OPP block re-ceives one packet per clock
cycle. Such packet is thenprocessed by a Packet Fields Extractor,
configured toprovide, together with the header fields (8 in our
proto-type), the blocks required in the next processing stages-
specifically: i) the Flow Key used to query the FlowContext Table,
ii) the header fields used by the Con-dition block, iii) the header
fields used by the UpdateLogic Block, and iv) the (eventually
different) Flow Keyused for updating the Flow Context. The Packet
FieldsExtractor is easily implemented in HW as a parallel ar-ray of
elementary Shift and Mask (SaM) blocks whereeach SaM block selects
the beginning of the targetedheader field (the shift function), and
performs a bit-wisemask operation. This operation closely resembles
thatproposed in POF [18]: we also use offsets, but insteadof
lengths we use bit masks.
Flow Context Table. This data structure is in chargeto store
both state as well as registries associated toFlow Keys. It
consists of an hash table (we imple-mented a d-left hash table with
d = 4) to handle exactmatches, plus a TCAM to handle wildcard
matches.Unlike the hash table, which must be arguably largeto store
per-flow states, a very small TCAM can bedeployed, as it is
required to handle the very few spe-cial cases where wildcard
matches are needed (mainly
6
-
Figure 4: Condition logic block array element
default states, where the TCAM priority permits to
dif-ferentiate default states for different protocols or
packetformats). Our implementation uses 128 bit Flow Keys,and
returns a 146 bit value which is sufficient to supporta 16 bit
state label, four 32 bit per-flow registries, andtwo auxiliary bits
per entry used by the microcontrollerfor housekeeping (see
below).
Condition Logic Block. This block permits to config-ure
conditions on input pairs (per-flow registries, globalregistries,
header fields), and evaluate them so as to re-turn as output a
boolean 0/1 vector. This block, shownin figure 4, comprises
multiple (8 in our implementa-tion) parallel configurable
comparators, each of whichtakes as input two operands selected
among all the flowregistries Ri, all the global registries Gi and
the headerfields Hi coming from the packet field extractor.
Theselection operation is provided by two multiplexers (onefor each
operand). Each comparator supports five arith-metic comparison
functions: >, ≥, =, ≤,
-
Type Instructions definitionLogic NOP do nothing
NOT OUT1← NOT (IN1)XOR, AND, OR OUT1← IN1 op IN2
Arit. ADD,SUB,MUL,DIV OUT1← IN1 op IN2ADDI,SUBI,MULI,DIVI OUT1←
IN1 op IMM
Shift/ LSL (Logical Shift Left) OUT1← IN1 > IMM
ROR (Rotate Right) OUT1← IN1 ror IMM
Table 2: ALU basic instruction set
(INi) can be any among the available per flow registriesRi, the
global variables Gi, or the header fields Hi pro-vided by the
Extractor. Output operands (OUTi) indi-cate where the result of the
instruction must be written(e.g. in a given per-flow register, or
in a global variable).In some instructions, one or more of the
operands (IOi)are both used as input and output. Our
implementationsupports 4 per-flow registries, 4 global registries
and 8header fields. Therefore, it may in principle support upto 24/
log2(16) = 6 operands. In practice, we envisionat most 4 operands
(e.g., for the variance or for theewma smoothing instructions) and
thus our implemen-tation may readily support up to 64 among
registriesand header fields. In the case of
logic/arithmetic/shiftoperations, which only require at most two
operandsplus a third output, we have also considered the case
inwhich one of the operands is an actual value (immediatevalue)
which can hence use 16 bits.
The packet/flow specific instructions supported in ourprototype
do implement, as a dedicated HW primi-tives running at the system
clock frequency and with amaximum latency of two clock cycles2,
domain-specificoperations which we deem useful in traffic control
ap-plications, and which would normally require multipleclock
cycles if implemented using more elementary op-erations. Such
domain specific operations include theonline computation of running
averages (avg) and vari-ances (var), and the computation of
exponentially de-caying moving averages (ewma) which can serve
thepurpose of a moving average, but which can be incre-mentally
computed and do not require to maintain awindow of samples.
Usage and implementation details about packet/flowspecific
instructions are provided in Table 3. The avgoperation stores the
number of samples in IO1, and in-cludes a new sample IN1 in the
running average IO2.Similarly, the var operation stores the number
of sam-ples in IO1, the average of the value IN1 in IO2 andthe
variance in IO3.
The ewma operation3 was included to permit smooth-
2As they involve a division, which we had to limit to 16 bitsfor
dividend and divisor to target a 2 clock cycles latency.3Being tk
the last sample time, and xk′ a new sample oc-curring at time tk′ ,
for simplicity of HW implementation weapproximate the exponentially
weighted moving average asm(tk′) = m(tk)α
tk′−tk +xk′ , and we use α = 1/2 to computepowers as shift
operations. The intermediate decay quantity
Instruction definitionavg() IO1← IO1 + 1
IO2← IO2 + (IN1− IO2)/(IO1 + 1)IO1← IO1 + 1
var() IO2← IO2 + (IN1− IO2)/(IO1 + 1)IO3← IO3 + ((IN1− IO2)2 −
IO3)/(IO1 + 1)
IO1← IN1ewma() decay = 1
-
resource type Reference switch OPP switch# Slice LUTs 49436
(11%) 71712 (16%)
# Block RAMs 194 (13%) 393 (26%)
Table 4: Hardware cost of OPP compared withthe reference NetFPGA
SUME switch.
The size of the SRAM that can be instantiated on alast
generation chip is up to 32 MB, corresponding toaround 1 millions
of entries in the d-left hash for thecontext flow table. The size
of a TCAM can be up to40 Mb, corresponding to 256K XFSM table
entries.
The system latency, i.e. the time interval from thefirst table
lookup to the last context update is 6 clockcycles. The FPGA
prototype is able to sustain the fullthroughput of 40 Gbits/sec
provided by the 4 switchports. If we suppose a minimum packet size
of 40 bytes(320 bits), the system is able to process 1 packet
foreach clock cycle, and thus up to 6 packets could bepipelined.
However, the feedback loop (not present inthe forward-only OpenFlow
pipelines [36]) raises a con-cern: the state update performed for a
packet at thesixth clock cycle would be missed by pipelined
packets.This could be an issue for packets belonging to a sameflow
arriving back-to-back (consecutive clock cycles); inpractice, as
long as the system is configured to workby aggregating N ≥ 6
different links, the mixer’s roundrobin policy will separate two
packets coming from thesame link of N clock cycles, thus solving
the problem.Note that the 6 clock cycles latency is fixed by the
hard-ware blocks used in the FPGA (the TCAM and theBlock RAMs) and
basically does not change scaling upthe number of ingress ports or
moving to an ASIC.
The whole system has been synthesized using thestandard Xilinx
design flow. Table 4 reports the logicand memory resources (in
terms of absolute numbersand fraction of available FPGA resources)
used by theOPP FPGA implementation, and compare these resultswith
those required for the NetFPGA SUME single-stage reference switch.
As expected, the logic uses asmall fraction of the total area (the
increase with re-spect the reference switch is 5% of the available
FPGAlogic resources), that is dominated by memory (thatdoubles with
respect the reference switch). The syn-thesis results hence confirm
the trend already shownby [27]: the HW area is dominated by memory,
whileadding intelligence/features in the logic require a
smallsilicon overhead. The performance in terms of latencyof an OPP
stage and throughput of deployed FPGAprototype has been measured
sending several synthetictraces of packets of different size. The
results are pre-sented in Fig. 5. As expected, the FPGA is able
tosustain the expected throughput4
4Due to the limitation of our hardware measurement set-up,we
were unable to actually send to the FPGA more than 24Gbits/sec, so
the data referred to 64B packet size could notbe measured. The
expected theoretical value is reported.
0
10
20
30
40
50
60
0
10
20
30
40
50
60
70
80
64 128 256 512
late
ncy
(ns)
Thro
ughp
ut (M
pps)
packet size (bytes)
TheoreticalMeasuredLatency
Figure 5: Performance of the FPGA prototype
4. PROGRAMMING EXAMPLESTo functionally test the ability of OPP
to support
stateful applications, we have developed a complete OPPvirtual
software environment. For both the switch andcontroller
implementation we have extended the CPqDOpenFlow 1.3 software
virtual switch [37] and the widelyadopted OpenFlow controller Ryu
[38]. The sotwareimplementation serves just for testing purpose,
henceit closely mimics the described OPP hardware opera-tion,
including the relevant limitations. To configure anOPP switch via
the controller, we developed an OPP-specific extension of the
OpenFlow protocol. Due tospace limitations (the interested reader
can find con-figuration files in the repository) we just mention
thatthe configuration of the XFSM in the OPP architectureis a
straightforward extension of an OpenFlow config-uration: it just
requires to populate the XFSM tableentries and to configure of
conditions, functions, key ex-tractors and initial global register
values. All softwarecomponents required to test the proposed
applicationsare bundled in a mininet [39] based virtual machine
avi-lable at a dedicated OPP repository [40], along with
ourprototype’s VHDL HW code.
To understand how an application can be programmedusing OPP,
let’s walk through a simplistic example ofa (quite inefficient, but
at least trivial to follow) TCPport scan detection and mitigation
application. Sincethe target is to detect IP address which behave
as scan-ners, we use as Flow Key the IP address. Figure 6represents
our desired application’s behavior, expressedin the form of an XFSM
“program”, whereas figure pro-vides a corresponding tabular
configuration deliveredto the switch. For every IP packet, we check
in theFlow Context Table whether the IP source has an asso-ciated
context; if this is not the case, a DEFAULT stateis conventionally
returned. the XFSM table now checkswhether the packet is a TCP SYN,
and only in this casewe will allocate a Flow Context Table entry
for the con-sidered IP source, and we will set it in MONITOR
state.In this state, we measure the rate of new TCP SYN ar-rivals
toward hosts behind the switch port 1. Such rate
9
-
NEW_TCP_FLOWif R0 >= G0
[DROP]
ANY_PACKETif R1 > pkt.ts[DROP]
NEW_TCP_FLOWif R0 < G0
[OUT]
ANY_PACKETIf R1 < pkt.ts
[OUT]
NEW_TCP_FLOW
[OUT]
IDLE_TIMEOUT_EXPIRED
R0: TCP SYN rate (EWMA)R1: DROP
state expiration timestampR2: last
packet timestamp
G0: rate threshold (global)G1: DROP
duration (global)
DEFAULT
DROP
MONITOR
Figure 6: Port scan detection XFSM
C0 C1 state packet fields next state packet
actions update functions
* * 0 syn=1 1 OUT R0=0R2=pkt.ts
0 * 1 syn=1 1 OUT R0=EWMA(R0, R2,
pkt.ts);;R2=pkt.ts1 * 1 syn=1 2 DROP R1=pkt.ts + G1* 1
2 * 2 DROP
* 0 2 * 1 OUT R0=0R2=pkt.ts
Figure 7: Port scan detection XFSM table
(computed with the EWMA update function) is storedand updated in
the flow registers R0.
While in MONITOR state, the value of R0 is verified foreach new
TCP flow. If a given threshold (say 20 SYN/s,a value stored in the
global register G0) is exceeded, thestate associated to this flow
is set to DROP and all pack-ets from this IP addresses are
discarded. Suppose nowthat the programmer wants to block the
scanner for 5seconds. Lacking explicit timers (a non trivial HW
ex-tension), such mechanism is realised by the followingprocedure:
(i) when the flow state transits from MON-ITOR to DROP the register
R1 is set to the packet timestamp value plus 5 sec. (a value stored
in the global reg-ister G1); (ii) in MONITOR state the R1 value is
checkedfor every received packet; (iii) If R1 pkt.ts (packet
arrival time); v) theXFSM table is configured as in fig. 7.
4.1 Decision tree based traffic classificationMachine Learning
(ML) tools are widely adopted by
the networking community for detecting anomalies andclassifying
traffic patterns [41]. We have tested the fea-
C0 C1 C2 C3 state next state packet actions update
functions
* * * * 0 1 Fwd()R4=now() +
G0var(R0,R1,R2,pkt.len)R3 = R3 +
pkt.len
0 * * * 1 Fwd() var(R0,R1,R2,pkt.len)R3 = R3 +
pkt.len1 1 * 1 1 3 Fwd(),SetField(dscp=0)1 1 * 0 1 2
Fwd(),SetField(dscp=10)1 0 0 * 1 2 Fwd(),SetField(dscp=10)1 0 1 * 1
3 Fwd(),SetField(dscp=0)* * * * 2 Fwd(),SetField(dscp=10)* * * * 3
Fwd(),SetField(dscp=0)
Figure 8: XFSM table for the traffic classifier
sibility of using OPP to support this kind of traffic
mon-itoring schemes by implementing a decision tree super-vised
classifier based on the C4.5 algorithm [42] that hasbeen exploited
by different work on ML based networktraffic classification [43,
44, 45, 46, 47, 48].
Any ML based classification mechanisms has two phases:a training
phase and a test phase. The training phaseis off line and used to
create the classification modelby feeding the algorithm with
labeled data that asso-ciate a measured traffic feature vector to
one of n de-cision classes. In the case of decision tree based
MLalgorithms, the output of such phase is the binary
clas-sification tree. The training phase must obviously beperformed
outside the switch. For our use case imple-mentation the decision
rules have been created usingthe Orange data mining framework [49],
and a featureset proposed in [50]. We considered a simple
scenariowhere it is necessary to discriminate between WEB andP2P
(control) traffic. The selected features for each floware: packet
size average/variance and total number ofreceived bytes. These
features are mapped directly tothe per flow memory registers R1,
R2, R3. Moreover theapplication XFSM requires two additional
registers: R0(packet counter) and R4 (measurement window
expira-tion time). The input feature vectors are evaluated overa
time window of 10 seconds.
The test phase, which is performed online, consistsof two
operations: (1) for each flow the feature set de-scribed above is
computed; (2) after 10 seconds a deci-sion is made according to the
decision tree. This test-ing mechanism is implemented in OPP
according to theXFSM shown in Figure 8. The XFSM flows states
areencoded as follows: State 0 → default; State 1 → mea-surement
and decision; State 2 → WEB traffic (DSCPclass AF11); State 3 → P2P
traffic (DSCP class besteffort). The condition set is: C0 : (now
> R4); C1 :(R2 > G2), C2 : R3 > G3, C3 : R1
-
C0 C1 state packet fields next state packet
actions update functions
* * 0 eth.type=IP 1 OUT R0=now - G0R1=now + G1
1 1 1 eth.type=IP 1 OUT R0= R0+G1R1= R0+G1
1 0 1 eth.type=IP 1 OUT R0= now - G0R1= now +
G10 1 1 eth.type=IP 1 DROP
Figure 9: Token bucket XFSM. The flow regis-ters R0, R1 are used
to store respectively Tmin,Tmax. The global registers G0, G1 are
used tostore B ∗Q and Q. The extractors are ip.srcperformed after
the condition verification, we cannotupdate the number of tokens in
the bucket based onpacket arrival time before evaluating the
condition (to-ken availability) for packet forwarding. For this
reasonwe have implemented an alternative and equivalent al-gorithm
based on a time window. For each flow a timewindow W (Tmin − Tmax)
of length BQ is maintainedto represent the availability times of
the tokens in thebucket. At each packet arrival, if arrival time T0
iswithin W (Case 1), at least one token is available andthe bucket
is not full, so we shift W by Q to the right andforward packet. If
the arrival time is after Tmax (Case2), the bucket is full, so
packet is forwarded and W ismoved to the right to reflect that B− 1
tokens are nowavailable (Tmin = T0− (B − 1)Q and Tmax = T0
+Q).Finally, if the packet is received before Tmin (Case 3),no
token is available, therefore W is left unchanged andthe packet is
dropped.
In the OPP implementation, upon receipt of the firstflow packet,
we make a state transition in which weinitialize the two registers:
Tmin = T0 − (B − 1) ∗ Qand Tmax = T0 + Q (initialization with full
bucket).
At each subsequent packet arrival we verify two con-ditions: C0
: Tnow >= Tmin; C1 : Tnow
-
functionalities. Moreover, the ability to store data
inpersistent registries permits to properly describe, usingP4, the
“computational loop” characterizing OPP. Onthe other side, we
suffered from the lack, among the P4constructs, of an explicit
state/context table and a rele-vant clean way to store and access
per-flow data. In ourOPP.p4 library, a table functionally
equivalent to ourContext table was actually constructed by
combiningarrays of registers with hash keys generators which
areprovided as P4 language primitives. However, besidesthe obvious
stretch (P4 registers are generic, and notspecifically meant to be
deployed on a per-flow basis),this construction also suffers from
hash collisions, a nontrivial problem if constrained to be
addressed while thepacket is flying through the pipeline. The
availabilityof a tailored context/state table structure in P4
wouldgreatly simplify the support of an OPP target platform.
Structural limitations and possible extensionsWhile (we believe)
very promising, our proposed ap-proach is not free of structural
concerns. If, on one side,limitations in the set of supported
enabling functionsand ALU functions for registry updates may be
easilyaddressed with suitable extensions, and integration ofmore
flexible packet header parsing (following [24]) isnot expected to
bring significant changes in the archi-tecture, there are at least
three pragmatic compromiseswhich we took in the design, and which
suggest futureresearch directions. The first, and major, one
resides inthe fact that state transitions are“clocked”by packet
ar-rivals: one and only one state transition associated to aflow
can be triggered only if a packet of that flow arrives;asynchronous
events, such as timers’ expiration, are notsupported. So far we
have partially addressed this limi-tation with, on one side, the
decoupling between lookupand update functions (the cross-flow state
handling fea-ture), and on the other side with programming
trickssuch as the handling of time performed while imple-menting
the token bucket example. But further flexi-bility in this
direction is a priority in our future researchwork. A second
shortcoming is the deployment of ALUprocessing only in the Update
Logic Block. This deci-sion was done in favour of a cleaner
abstraction anda simpler implementation. However
(programmable)arithmetic and logic operations would be beneficial
alsowhile evaluate conditions (e.g., A − B > C) which, inthe
most general case, may require to be postponed tothe next packet
(the update function can store A − Bin a registry, and the next
condition can use such reg-istry). A third, minor, shortcoming
relates to the factthat all updates occur in parallel. This
prevents the pro-grammer to pipeline operations, i.e. use (in the
sametransition step) the output of an instruction as inputto a next
one. While this issue is easily addressed bydeploying multiple
Update Logic Blocks in series, thiswould increase the latency of
the OPP loop.
6. RELATED WORKThis work focuses on data plane programming
archi-
tectures and abstractions, a relatively recent trend. Insuch
field, so far the mostly influential work is arguablyP4 [24] a
programming language specifically focusing ondata path packet
processing. In turns, such initial workhas stimulated the creation
of a consortium (p4.org)which has so far produced a release 1.0.2
of the lan-guage specification [28]. Our OPP work is at a
different(lower) level than P4: it describes an hardware
pro-gramming interface and a relevant architecture whichcould be in
future adapted to be used as compilationtarget [25] for P4.
Furthermore, our work deals withstateful processing across
different packets of a flow,and as such it appears perfectly
complementary to theoriginal P4 proposal [24] which initially
focused mainlyon programming flexibility of the packet pipeline
(P4registers have been introduced in [28]).
Concerning stateful processing, the work closer toours is
OpenState [21], and in part FAST [23]. Withrespect to OpenState,
OPP makes a very significantstep forward, as we support full
eXtended Finite StateMachines (XFSM, as defined in [29]) opposed to
themuch simpler OpenState’s Mealy Machines, and hencewe
significantly broaden the variety of applications thatcan be
programmed on the switch. Such step requiresadditional new
specialized hardware blocks with respectto OpenState which instead
requires only marginal ex-tensions to an OpenFlow hardware design
[22].
Finally, OPP shares some technical similarities with[27] and
with the Intel Flexpipe architecture [26], espe-cially for what
concerns the handling of ALUs in thepacket processing pipeline.
However, both OPP focus(on stateful processing) and architecture
design remainextremely different from both [27] and [26]. Indeed,
anadvised extension consists in extending OPP to handlemultiple
pipelined stages and hence exploit the TCAMreconfigurability
concepts introduced in [27].
7. CONCLUSIONSOPP is an attempt to find a pragmatic and
viable
balance between platform-independent HW configura-bility and
data plane (packet-level) programming flex-ibility. While
permitting programmers to deploy moresophisticated stateful
forwarding tasks with respect tothe basic OpenFlow’s static
match/action abstraction,we believe that an asset of our
configuration interfaceresides in the fact that it does not
significantly departfrom OpenFlow-type configurations - our
extended fi-nite state machine model is indeed conveyed to
theswitch in the usual form of a TCAM’s Flow Table. Wethus hope
that our work might stimulate further debatein the research
community on how to incrementally de-ploy programmable traffic
processing inside the networknodes, e.g. via gradual OpenFlow
extensions.
12
-
8. REFERENCES
[1] Z. Wang, Z. Qian, Q. Xu, Z. Mao, and M. Zhang,“An untold
story of middleboxes in cellularnetworks,” vol. 41, no. 4, pp.
374–385, 2011.
[2] J. Sherry, S. Hasan, C. Scott, A. Krishnamurthy,S.
Ratnasamy, and V. Sekar, “Makingmiddleboxes someone else’s problem:
networkprocessing as a cloud service,” ACM SIGCOMMComputer
Communication Review, vol. 42, no. 4,pp. 13–24, 2012.
[3] Z. A. Qazi, C.-C. Tu, L. Chiang, R. Miao,V. Sekar, and M.
Yu, “SIMPLE-fying middleboxpolicy enforcement using SDN,” vol. 43,
no. 4, pp.27–38, 2013.
[4] A. Gember-Jacobson, R. Viswanathan,C. Prakash, R. Grandl, J.
Khalid, S. Das, andA. Akella, “OpenNF: Enabling innovation
innetwork function control,” in ACM SIGCOMMConference, 2014, pp.
163–174.
[5] K. Greene, “TR10: Software-defined networking,2009,” MIT
Technology Review.
[6] N. McKeown, T. Anderson, H. Balakrishnan,G. Parulkar, L.
Peterson, J. Rexford, S. Shenker,and J. Turner, “OpenFlow: enabling
innovation incampus networks,” ACM SIGCOMM ComputerCommunication
Review, vol. 38, no. 2, pp. 69–74,2008.
[7] N. Feamster, J. Rexford, and E. Zegura, “Theroad to SDN: an
intellectual history ofprogrammable networks,” ACM SIGCOMMComputer
Communication Review, vol. 44, no. 2,pp. 87–98, 2014.
[8] S. Shenker, M. Casado, T. Koponen, N. McKeownet al., “The
future of networking, and the past ofprotocols,” Keynote slides,
Open NetworkingSummit, vol. 20, 2011.
[9] N. Gude, T. Koponen, J. Pettit, B. Pfaff,M. Casado, N.
McKeown, and S. Shenker, “Nox:towards an operating system for
networks,” ACMSIGCOMM Computer Communication Review,vol. 38, no. 3,
pp. 105–110, 2008.
[10] A. K. Nayak, A. Reimers, N. Feamster, andR. Clark,
“Resonance: Dynamic access control forenterprise networks,” in 1st
ACM Workshop onResearch on Enterprise Networking (WREN09),2009.
[11] N. Foster, R. Harrison, M. J. Freedman,C. Monsanto, J.
Rexford, A. Story, andD. Walker, “Frenetic: A network
programminglanguage,” in ACM SIGPLAN Notices, vol. 46,no. 9, 2011,
pp. 279–291.
[12] A. Voellmy, H. Kim, and N. Feamster, “Procera: alanguage
for high-level reactive network control,”in Proc. 1st workshop on
Hot topics in softwaredefined networks, 2012, pp. 43–48.
[13] C. Monsanto, J. Reich, N. Foster, J. Rexford, andD. Walker,
“Composing Software-DefinedNetworks,” in USENIX NSDI, 2013, pp.
1–13.
[14] T. Nelson, A. D. Ferguson, M. J. Scheer, andS.
Krishnamurthi, “Tierless programming andreasoning for
software-defined networks,” USENIXNSDI, 2014.
[15] H. Kim, J. Reich, A. Gupta, M. Shahbaz,N. Feamster, and R.
Clark, “Kinetic: Verifiabledynamic network control,” in USENIX
NSDI,May 2015.
[16] M. T. Arashloo, Y. Koral, M. Greenberg,J. Rexford, and D.
Walker, “SNAP: Statefulnetwork-wide abstractions for packet
processing,”arXiv preprint arXiv:1512.00822, 2015.
[17] M. Shahbaz and N. Feamster, “The case for anintermediate
representation for programmabledata planes,” in 1st ACM SIGCOMM
Symposiumon Software Defined Networking Research, 2015.
[18] H. Song, “Protocol-oblivious forwarding: Unleashthe power
of sdn through a future-proofforwarding plane,” in Proceedings of
the SecondACM SIGCOMM Workshop on Hot Topics inSoftware Defined
Networking, HotSDN ’13, 2013,pp. 127–132.
[19] H. Song, J. Gong, H. Chen, and J. Dustzadeh,“Unified POF
Programming for Diversified SDNData Plane Devices,” in ICNS 2015,
2015.
[20] A. Sivaraman, S. Subramanian, A. Agrawal,S. Chole, S.-T.
Chuang, T. Edsall, M. Alizadeh,S. Katti, N. McKeown, and H.
Balakrishnan,“Towards programmable packet scheduling,” in14th ACM
Workshop on Hot Topics in Networks,2015, p. 23.
[21] G. Bianchi, M. Bonola, A. Capone, andC. Cascone,
“Openstate: programmingplatform-independent stateful
openflowapplications inside the switch,” ACM SIGCOMMComputer
Communication Review, vol. 44, no. 2,pp. 44–51, 2014.
[22] S. Pontarelli, M. Bonola, G. Bianchi, A. Capone,and C.
Cascone, “Stateful openflow: Hardwareproof of concept,” in IEEE
High PerformanceSwitching and Routing (HPSR), 2015.
[23] M. Moshref, A. Bhargava, A. Gupta, M. Yu, andR. Govindan,
“Flow-level state transition as a newswitch primitive for SDN,” in
3rd workshop onHot topics in software defined networking, 2014,pp.
61–66.
[24] P. Bosshart, D. Daly, G. Gibb, M. Izzard,N. McKeown, J.
Rexford, C. Schlesinger,D. Talayco, A. Vahdat, G. Varghese et al.,
“P4:Programming protocol-independent packetprocessors,” ACM SIGCOMM
ComputerCommunication Review, vol. 44, no. 3, pp. 87–95,
13
-
2014.[25] L. Jose, L. Yan, G. Varghese, and N. McKeown,
“Compiling packet programs to reconfigurableswitches,” in USENIX
NSDI, 2015.
[26] “Intel Ethernet Switch FM6000 Series - SoftwareDefined
Networking.” [Online].
Available:http://www.intel.com/content/dam/www/public/\us/en/documents/white-papers/ethernet-switch-fm6000-sdn-paper.pdf
[27] P. Bosshart, G. Gibb, H.-S. Kim, G. Varghese,N. McKeown, M.
Izzard, F. Mujica, andM. Horowitz, “Forwarding metamorphosis:
Fastprogrammable match-action processing inhardware for sdn,” in
ACM SIGCOMMConference, 2013, pp. 99–110.
[28] The P4 language Consortium, “The P4 LanguageSpecification,
version 1.0.2,” March 2015.
[29] K. T. Cheng and A. S. Krishnakumar, “AutomaticFunctional
Test Generation Using The ExtendedFinite State Machine Model,” in
ACM Int. DesignAutomation Conference (DAC), 1993, pp. 86–91.
[30] S. Kandula, D. Katabi, S. Sinha, and A. Berger,“Dynamic
load balancing without packetreordering,” ACM SIGCOMM
ComputerCommunication Review, vol. 37, no. 2, pp. 51–62,2007.
[31] N. Zilberman, Y. Audzevich, G. Covington, andA. Moore,
“NetFPGA SUME: Toward 100 Gbpsas Research Commodity,” Micro, IEEE,
vol. 34,no. 5, pp. 32–41, Sept 2014.
[32] “Virtex-7 Family Overview,”http://www.xilinx.com.
[33] B. Jean-Louis, “Using block RAM for highperformance
read/write TCAMs,” 2012.
[34] Z. Ullah, M. Jaiswal, Y. Chan, and R. Cheung,“FPGA
Implementation of SRAM-based TernaryContent Addressable Memory,” in
IEEE 26thInternational Parallel and Distributed ProcessingSymposium
Workshops & PhD Forum(IPDPSW), 2012.
[35] W. Jiang, “Scalable ternary content addressablememory
implementation using FPGAs,” inArchitectures for Networking
andCommunications Systems (ANCS), 2013ACM/IEEE Symposium on, 2013,
pp. 71–82.
[36] Open Networking Foundation, “OpenFlow SwitchSpecification
ver 1.4,” Oct. 2013.
[37] “OpenFlow 1.3 Software
Switch,”http://cpqd.github.io/ofsoftswitch13/.
[38] “RYU software framework,”http://osrg.github.io/ryu/.
[39] “MiniNet,” http://www.mininet.org.[40] “OPP Source
Repository,” for blind review
purposes currently available at anonymized
linkhttps://drive.google.com/folderview?id=0BzXk0zMykkNoR1RTeXMzNmZJdjQ
- to be replaced with public one after the reviewprocess.
[41] T. T. Nguyen and G. Armitage, “A survey oftechniques for
internet traffic classification usingmachine learning,”
Communications Surveys &Tutorials, IEEE, vol. 10, no. 4, pp.
56–76, 2008.
[42] J. R. Quinlan, C4. 5: programs for machinelearning.
Elsevier, 2014.
[43] N. Williams, S. Zander, and G. Armitage, “Apreliminary
performance comparison of fivemachine learning algorithms for
practical ip trafficflow classification,” ACM SIGCOMM
ComputerCommunication Review, vol. 36, no. 5, pp. 5–16,2006.
[44] Z.-S. Pan, S.-C. Chen, G.-B. Hu, and D.-Q.Zhang, “Hybrid
neural network and c4. 5 formisuse detection,” in Machine Learning
andCybernetics, 2003 IEEE International Conferenceon, vol. 4, 2003,
pp. 2463–2467.
[45] Y. Ma, Z. Qian, G. Shou, and Y. Hu, “Study ofinformation
network traffic identification based onc4. 5 algorithm,” in
Wireless Communications,Networking and Mobile Computing,
2008.WiCOM’08. 4th IEEE International Conferenceon, 2008, pp.
1–5.
[46] Y. Zhang, H. Wang, and S. Cheng, “A method forreal-time
peer-to-peer traffic classification basedon c4. 5,” in
Communication Technology (ICCT),12th IEEE International Conference
on, 2010, pp.1192–1195.
[47] W. Li and A. W. Moore, “A machine learningapproach for
efficient traffic classification,” inModeling, Analysis, and
Simulation of Computerand Telecommunication Systems,
MASCOTS’07.15th International Symposium on, 2007, pp.310–317.
[48] R. Alshammari et al., “Machine learning basedencrypted
traffic classification: identifying ssh andskype,” in Computational
Intelligence for Securityand Defense Applications, 2009. CISDA
2009.IEEE Symposium on, 2009, pp. 1–8.
[49] J. Demšar, T. Curk, A. Erjavec, Črt Gorup,T. Hočevar, M.
Milutinovič, M. Možina,M. Polajnar, M. Toplak, A. Starič, M.
Štajdohar,L. Umek, L. Žagar, J. Žbontar, M. Žitnik, andB.
Zupan, “Orange: Data mining toolbox inpython,” Journal of Machine
Learning Research,vol. 14, pp. 2349–2353, 2013. [Online].
Available:http://jmlr.org/papers/v14/demsar13a.html
[50] Z. Li, R. Yuan, and X. Guan, “Accurateclassification of the
internet traffic based on thesvm method,” in Communications, 2007.
ICC’07.IEEE International Conference on, 2007, pp.1373–1378.
14
http://www.intel.com/content/dam/www/public/ \
us/en/documents/white-papers/ethernet-switch-fm6000-sdn-paper.pdfhttp://www.intel.com/content/dam/www/public/
\
us/en/documents/white-papers/ethernet-switch-fm6000-sdn-paper.pdfhttp://www.intel.com/content/dam/www/public/
\
us/en/documents/white-papers/ethernet-switch-fm6000-sdn-paper.pdfhttp://jmlr.org/papers/v14/demsar13a.html
1 Introduction2 Concept2.1 XFSM abstraction2.2 OPP
architecture2.3 Programming the OPP
3 Hardware feasibility3.1 Update logic block3.2 FPGA
prototype
4 Programming examples4.1 Decision tree based traffic
classification4.2 Traffic policing with token buckets
5 Discussion and extensions6 Related work7 Conclusions8
References