Scalability and Resilience of Software-Deﬁned Networking ... · Abstract—Software-Deﬁned Networking (SDN) allows to con- ... central controller, and (2) resiliency in terms

1

Scalability and Resilience of Software-DefinedNetworking: An Overview

Benjamin J. van Asten, Niels L. M. van Adrichem and Fernando A. KuipersNetwork Architectures and Services, Delft University of Technology

Mekelweg 4, 2628 CD Delft, The Netherlands{B.J.vanAsten@student., N.L.M.vanAdrichem@, F.A.Kuipers@}tudelft.nl

Abstract—Software-Defined Networking (SDN) allows to con-trol the available network resources by an intelligent and cen-tralized authority in order to optimize traffic flows in a flexiblemanner. However, centralized control may face scalability issueswhen the network size or the number of traffic flows increases.Also, a centralized controller may form a single point of failure,thereby affecting the network resilience.

This article provides an overview of SDN that focuses on (1)scalability concerning the increased control overhead faced by acentral controller, and (2) resiliency in terms of protection againstcontroller failure, network topology failure and security in termsof malicious attacks.

I. INTRODUCTION

Currently, most switching and routing solutions integrateboth data and control plane functionality. The data plane per-forms per-packet forwarding based on look-up tables locatedin the memory or buffer of the switch or router, whereasthe control plane is used to define rules based on network-ing policies to create the look-up tables. Due to the highdemands on network performance and growing configurationcomplexity, the control plane has become overly complicated,inflexible and difficult to manage. To solve this problem a newnetworking paradigm was needed, which was compatible withthe widely used Ethernet switching and IP routing techniques.The solution was found in virtualization techniques used inserver applications, where an abstraction layer is positionedabove the server hardware to allow multiple virtual machinesto share the available resources of the server. Software DefinedNetworking (SDN) adopted this paradigm and introduced anabstraction layer in networking.

By abstracting the network resources, the data and con-trol planes are separated. The data plane is located at theswitch hardware, where the optimized forwarding hardwareis preserved and the control of the network is centralizedinto an intelligent authority with the aim to improve flex-ibility and manageability. A centralized authority providesthe intelligence to network switches to route and controlthe traffic through the network infrastructure. Optimal pathsthrough the network can be provided by the central authorityin advance or on demand. The current implementation of theSDN networking paradigm is found in the OpenFlow protocoldeveloped by Stanford University in 2008 and is currentlyunder development within the Open Networking Foundation.OpenFlow has attracted some big vendors in the networking

community and became the most popular realization of theSDN networking paradigm.

Since the introduction of OpenFlow, much research hasbeen performed on two different fields, being i) scalabilityand performance of the central authority in relation to thegrowth of network traffic and requests, and ii) the robustnessand resiliency of the network against link and switch failures inthe network, but also failures of the central authority. Clearly,the two are related, since scalability issues may cause failures.

In this overview, we specifically focus on scalability andresilience of SDN. In section II, we our work from existingsurveys on SDN and OpenFlow networking. We describe thebasics behind the SDN paradigm, the OpenFlow protocol,network controllers and compliant switches in section III. Thegeneral framework and standard notation is given in sectionIV, while sections V to VI discuss related work in relation toour framework. Section VII concludes this overview.

II. RELATED SURVEYS

Nunes et al. [1] presented a standard survey with emphasison past, present and future implementations of SDN. It givesa proper overview of possible applications that could benefitfrom SDN. Feamster et al. [2] give a historical insight inthe development of SDN networks, with the emphasis onvirtualizing the network and separating the data and controlplanes. A survey on security in SDN is given by Scot-Hayward et al. [3]. The survey provides a nice categorizationon security-related research and addresses security analysis,enhancements and solutions, as well as the data, control andapplication layer of SDN. Yeganeh et al. [4] focus on thescalability concerns in relation to current-state networking,controllers and switching hardware. Sezer et al. [5] discusthe implementation challenges for SDN in relation to carrier-grade networks. Suzuki et al. [6] take a similar approachconcerning OpenFlow technologies in carrier-grade and datacenter networks. In [7], Lara et al. provide an extensive surveyon network innovations using OpenFlow, where the OpenFlowspecification is discussed in detail and recent experiences withOpenFlow deployments on campus networks and testbeds areshared.

In contrast to the above mentioned surveys, we propose agraphical reference framework, with which SDN strengths andfrailties are identified more easily. Furthermore, we specifi-cally focus on scalability and resilience aspects.

arX

iv:1

408.

6760

v1 [

cs.N

I] 2

8 A

ug 2

014

2

Network Controller

Video WWWVOIP

Network Controller

Abstraction InterfaceRequest

ConnectivityAbstract

view

Control+

Optimizetraffic

SupplyConnectivity

+Status

Co

ntro

l Inte

rface

AP

I

Data Plane

Control Plane

Application Plane

Figure 1. SDN concept of abstracting the network view - On the Data Plane,network elements (switches) provide network connectivity and status to theControl Plane. The network elements are configured by Network Controllersvia a Control Interface for global optimized configuration. An abstract viewof the network is given to the Application Plane via a standardized interface.Network services request connectivity from the Network Controllers, afterwhich the Network Elements are configured.

III. INTRODUCTION TOSOFTWARE-DEFINED NETWORKING

In this section we introduce Software-Defined Networking(section III-A), the OpenFlow protocol (section III-B), Open-Flow controllers (section III-C) and Open vSwitch (sectionIII-D).

A. Abstracting the network

In the SDN philosophy, the network topology is configuredbased on requests from network services and applications.Services request connectivity to a network and if the requestcan be fulfilled, paths through the topology are provided tothe service for the requested amount of time. In figure 1 theSDN concept is presented.

The SDN concept speaks of three planes, which do notcorrespond directly with the OSI reference model. A shortdescription of the planes is given below:

• Data Plane - The Data Plane is built up from NetworkElements and provides connectivity. Network Elementsconsist of Ethernet switches, routers and firewalls, withthe difference that the control logic does not makeforwarding decisions autonomously on a local level.Configuration of the Network Elements is provided viathe control interface with the Control Plane. To optimizenetwork configuration, status updates from the elementsare sent to a Network Controller;

• Control Plane - Network Controllers configure the Net-work Elements with forwarding rules based on the re-quested performance from the applications and the net-work security policy. The controllers contain the forward-ing logic, but can be enhanced with additional routinglogic. Combined with actual status information from theData Plane, the Control Plane can compute optimizedforwarding configurations. To the application layer, anabstract view from the network is shared via a generalApplication Programming Interface (API). This abstract

Incoming Port

Outgoing Port B

Outgoing Port C

Outgoing Port D

SDN Control Logic

Ethernet Packets

STP

STP

STP

VLA

N

VLA

N

VLA

N

SDN Interface

SDN Forwarding Rules SDN Protocol

To destination Co

nfi

g

Unm

atc

hed

Match

To central control logic

Figure 2. SDN-enabled Ethernet switch - Incoming traffic is matched to SDNForwarding Rules and on positive matching traffic is forwarded as normal.Information of unmatched packets is sent to the central control logic (NetworkControllers), where new SDN forwarding rules are computed and configuredat the SDN switches involved for transporting the data packets. The localcontrol logic is enhanced with a configuration interface (SDN protocol) tocommunicate with Network Controllers.

view does not contain details on individual links betweenelements, but enough information for the applications torequest and maintain connectivity;

• Application Plane - Applications request connectivitybetween two end-nodes, based on delay, throughput andavailability descriptors received in the abstract view fromthe Control Plane. The advantage is the dynamic alloca-tion of requests, as non-existing connectivity does notneed processing at local switch level. Also applicationscan adapt service quality based on received statistics. Forexample to throttle the bandwidth for video streamingapplications on high network utilization.

By decoupling the control logic the management of switchessimplifies, as decisions to flood or forward data packets are notmade locally anymore. Header information from data packetsat the switches must be transmitted to the central controllogic for processing and configuration computations, whichintroduces an additional delay in packet forwarding. As seenin figure 2, the basic functionality of the SDN switch is similarto that of an Ethernet switch. Header information is matchedto the configured SDN Forwarding Rules and packets aresubsequently forwarded to the configured outgoing port(s).Unmatched header information is sent to the central controllogic via the SDN control interface. Thus, for communicationbetween the SDN switch and the centralized controllers anadditional protocol is needed. This protocol must contain thefunctionality to configure forwarding rules and ports, as wellbe able to collect and transmit switch status and statics to thecentral control logic. OpenFlow is such a protocol.

B. OpenFlow protocol

Two examples of SDN protocols are OpenFlow [8] andForCES [9]. More protocols exist, however OpenFlow is mostpopular. OpenFlow has attracted many researchers, organiza-tions and foundations, so a wide collection of open-sourcesoftware is available in the form of OpenFlow controllers (sec-tion III-C), as well as physical and virtual switch implemen-tations (section III-D). The OpenFlow protocol describes andcouples switching hardware to software configurations, suchas incoming and outgoing ports, as well as an implementation

3

FlowTable 1

Match

Prio

Instructions

Counter

Hard Timeout

Idle Timeout

FlowTable N

Match

Prio

Instructions

Counter

Hard Timeout

Idle Timeout

PacketMetadata Group

Table AType

Action Buckets

Counter

ActionSet / List

Packet In

Ou

tpu

t

Dro

p

Gro

up

Packet Out

Yes

No, execute

Actions

Table Miss

Drop

Controller

Execute

Goto Group Table A

OpenFlow ControllerFlow Request

PacketMetadata

Empty

Assigned

Figure 3. Flow diagram of the OpenFlow protocol - An incoming packet isassigned with Packet Metadata and is matched to Flow Rules in Flow Tables.Each Flow Rule contains instructions, which are added as Actions to thePacket Metadata. Instructions can include forwarding to other Flow Tables.If all Flow Tables are passed, the Actions in the Metadata are executed.Actions define the outcome for the packets, such as Output or Drop traffic.If a group of Flow Rules requires the same processing, Group Tables areapplied. Group tables also contain Actions. When a packet does not match,a Table Miss is initiated and the packet can be forwarded to the OpenFlowcontroller for further processing or the packet is dropped.

of the SDN Forwarding Rules. In figure 3 a flow diagramis given of the matching process of an incoming packet inan OpenFlow-compliant switch enabled with protocol version1.3. A detailed survey on the OpenFlow specification is givenin [7].

The SDN Forwarding Rules, called Flows in OpenFlow, arestored in one or more Flow Tables. For each incoming packet,a metadata set is created, containing an Action List, ActionSet or both. In the Action List and Set actions are added foreach Flow Table the packet transverses, whereas the Actionsdefine the appropriate operations for the packet. Examples ofActions are forward the packet to port X, drop the packet,go to Group Table A or modify the packet header. The maindifference between a List and Set is the time of execution.Actions added to a List are executed directly after leavingthe current Flow Table, whereas the Actions defined in theSet are accumulated and executed when all Flow Tables areprocessed. Each Flow Table contains Flow Entries with sixparameters [8]:

• Match - The criteria to which the packets are matched.Criteria include parameters of the datalink, network andtransport layers contained in data packet headers andoptionally metadata from previous tables. A selection ofcriteria is given in table I;

• Instructions - When a packet matches, instructions areadded to the metadata set to direct the packet to anotherFlow Table or add Actions to the Action List or Set;

• Priority - The packet header can match to multiple FlowEntries, but the entry with highest priority determines theoperations;

• Counter - Every time a packet has matched and isprocessed by a Flow Entry, a counter is updated. Counterstatistics can be used by the OpenFlow controller andApplication Plane to determine network policies or fornetwork monitoring [10];

• Hard Timeout - A Flow Entry is added by an OpenFlowcontroller, where the maximum amount of time this entry

Table ISELECTION OF FIELDS FOR FLOW RULES TO MATCH INCOMING PACKETS

[8].

Match Field Layer Description

Ingress Port Physical Incoming ports and interfaces

Ethernet Address Datalink Source and destination MAC-address

VLAN Datalink VLAN identity and priority

MPLS Network MPLS label and traffic class

IP Network IPv4 / IPv6 addresses

Transport Transport TCP/UPD, source and destination port

may exist in the Flow Table before expiring is defined bythe Hard Timeout. The Hard Timeout can be used to limitnetwork access for a certain node in the network and forautomatic refreshing of the Flow Table to prevent largetables;

• Idle Timeout - The amount of time a Flow Entry is notmatched is defined as the idle time. Idle Timeout definesthe maximum idle time and is mainly used for refreshingFlow Tables.

From OpenFlow protocol version 1.1 and onwards, GroupTables have been defined. Group Tables allow more advancedconfigurations and consist of three parameters:

• Action Buckets - Each bucket is coupled to a switchport and contains a set of Actions to execute. The maindifference with Instructions from the Flow Table is thatAction Buckets can be coupled to counters and interfacestatus flags. Based on values of these parameters a bucketis valid or not;

• Type - Defines the behavior and the number of ActionBuckets in the Group Table. Multiple Action Bucketscan be used for i) multicast and broadcast applications,where the incoming packet is copied over multiple ActionBuckets (multiple ports), ii) load sharing applications,where a selection mechanism selects the Action Bucketto execute and iii) failover applications, where from theavailable Action Buckets the first live one is selected toexecute. Assigning a single Action Bucket to a GroupTable is useful for defining Actions for a large numberof Flow Entries with the same required forwarding policy;

• Counter - The number of times the Group Table has beenaddressed.

With the descriptions of the Flow and Group Table wecan follow the incoming packet from figure 3. At entry, ametadata set is assigned to the data packet and the packetheader is matched to the Flow Entries in the first Flow Table.On match, instructions are added to the Action Set and thepacket can be processed further. When the instructions includeforwarding to other Flow Tables, the packet with metadata isprocessed in a similar way and instructions are added to theSet. When no forwarding to other Flow Tables is instructed,the Action Set from the metadata is executed. Actions from theSet and / or Group Table determine the process of the packet.In switching operation, the MAC address is matched in thefirst Flow Table and the Action Set defines how to forward

4

CoreTopology Database

TranslationRouting

Algorithms

L2 Switch

Security

Web Interface

Redundancy

OF 1.x

To OpenFlow switch

Sync

To OpenFlowController(s)

To Application Plane

Generic OpenFlow Controller

Figure 4. Generic description of an OpenFlow Controller - The controllersmentioned are built around a Core application, which acts as a backbonein the controller. To communicate with OpenFlow switches, a translationmodule translates the OpenFlow protocol parameters to the “language” usedinside the controller. Additional modules can advertise themselves to the Coreapplication to receive switch events. Based on the application, flow rules canbe calculated or notification are sent to the application plane.

the packet on a specified outgoing port. When none of theFlow Entries match, a Table Miss is initiated. Depending onthe configuration by the OpenFlow controller, the packet isdropped or transmitted to the controller for a Flow Request.At the controller, new Flow Entries are computed and addedto Flow Tables of involved switches.

C. OpenFlow Controller

OpenFlow controllers are developed in many variationsand all share the same goal of controlling and configuringcompliant switches. In [1] and [7] a list of hardware switchesis given, all OpenFlow compliant. Main differences are foundin programming languages and support for OpenFlow spec-ifications. Popular implementations, like NOX [11] and theOpen vSwitch (OVS)-controller from Open vSwitch [12] usethe C/C++ language, while POX [13] and Ryu [14] arePython-based controllers. Java based controllers are found inFloodLight [15] and OpenDayLight [16]. Only Ryu, Open-DayLight and unofficial ported versions from NOX supportOpenFlow protocol version 1.3 so far. For more advancedconfiguration purposes and the use of Group Tables, we advisethe Ryu and OpenDayLight controller. Both FloodLight andOpenDayLight offer web browser based configuration toolsinstead of command line interfaces and are therefore moreuser friendly. NOX, POX and Ryu share a similar structureand this shared structure is used to give an example of anOpenFlow controller in figure 4.

The example controller is built around a Core application,which acts as an backbone in the controller. To commu-nicate with the OpenFlow switch, a translation module isadded to translate OpenFlow protocol messages to controllerparameters. Other modules, like a layer-2 switch module(L2-switch) in Ryu, can advertise themselves to the Coreapplication and register on specific switch parameters andevents. The mentioned controllers supply the Core applicationwith translation modules for OpenFlow protocol version 1.x.Depending on the requirements on the controllers, one canconstruct and program modules and advertise these to thecontroller. In figure 4 examples are given, such as a topologymodule, for maintaining an up-to-date status of the networkinfrastructure. Also modules for redundancy purposes can be

Open vSwitch

Iface 1

OFMan

IP

OFPort A

OFPort C

Iface 2 Iface 3 Iface 4

OFPort B

Iface 2

OpenFlow Module

Terminal

Network Policy

OpenFlow Controller

Network traffic

Flows calculatedby hand

Flows dynamically calculated

Plugin X Plugin Y Plugin Z

Figure 5. Description of Open vSwitch in relation to OpenFlow - WithOpen vSwitch physical network interfaces from servers and computer can beassigned to an OpenFlow switch. To connect the virtual OpenFlow switch toa remote controller via an IP connection, a management interface (OF Man)with IP address is assigned. The other option to configure the OpenFlowmodule is via a command line interface. Other physical interfaces can beassigned directly to an OpenFlow port, where it is also possible to bindmultiple physical interfaces to a single OpenFlow port. With other OpenvSwitch plug-ins it is possible to control the OpenFlow module.

added, to synchronize information with other controllers at thecontrol plane.

D. Open vSwitch

Although Open vSwitch (OVS) is not specifically designedto enable the SDN philosophy, it is widely used by researchersand organizations to test OpenFlow implementations and ben-efit from flexible SDN configurations. OVS can be configuredto turn regular servers and computers with multiple physicalnetwork interfaces into a virtual OpenFlow switch, as shownin figure 5. Many Linux distributions, such as the Ubuntu OS,support OVS installation from their repositories.

Depending on the required configuration, OVS can beconfigured as a layer-2 switch (controlled by a local OVS-controller) or as a generic OpenFlow switch. Configurationof the OpenFlow module can be supplied by an externalOpenFlow controller or Flow Rules are supplied manuallyvia the command line interface. External plug-ins can alsoconfigure the OpenFlow module of OVS. An example of sucha plug-in is the Quantum plug-in from OpenStack [17].

IV. GRAPHICAL SDN FRAMEWORK

To differentiate and compare the existing SDN solutions,we have developed a graphical SDN framework. Within thegraphical framework multiple layers are defined, which indi-cate the hierarchy level of components within SDN networks(see figure 6).

For our graphical framework we define that controllersperform the computations and tasks to control traffic and addi-tional layers can be added for administrative and synchronizingpurposes. In the following we explain the layers of figure 6.

• Level-0 - Switch Layer - The lowest layer identified inthe OpenFlow structure is the switch layer, with the mainpurpose to deliver data plane functionality. Data planefunctions are performed at the Switch / Open vSwitchsublayer, where the two additional sub-layers, being the

5

Legend

OpenFlow Switch / Open vSwitch

SWITC

HLA

YERV

IRT

UA

LIZATIO

NLA

YER

LEVEL - 0

LEV

EL – 0.5

Additional Switch Layer

Local OpenFlow Controller

1 2 N

Virtualization OpenFlow Switch

1

1

N

1

1

P

Additional Area Control Layer

Area OpenFlow Controller

C

1

1

Virtualization OpenFlow Controller

V

V

P

Additional Global Control Layer

Global OpenFlow Controller

1

1

1

C

...

...

...

...

...

...

...

1

1

2 C

1 2 V

CO

NTR

OL

LAYER

LEVEL - 1

LEVEL - 2

GLO

BA

LLA

YER

1 2 P

Switch

Additional Layer

Virtual Switch

OpenFlow Controller

T (N / V – P / C / G)

Full LayersName

Notation

Figure 6. Graphical Framework for OpenFlow differentiation and compar-ison - On the left the numerical UML relationship between the componentsand layers of an OpenFlow network topology is visible. On the right aphysical decomposition of the same configuration is given to clarify the UMLrelationship.

Additional Switch Layer and Local OpenFlow Controller,add additional functionality to perform minor controlplane tasks;

• Level-0.5 - Virtualization Layer - On top of the switchlayer, the Virtualization Layer can be placed with themain function to divide and share the switch resourcesover multiple OpenFlow controllers. It enables multiplevirtual network topologies on top of a single physicalinfrastructure. Resources of physical switches are virtu-alized by this layer and presented to the Control Layeras multiple virtual switches;

• Level-1 - Control Layer - The functionality of the controllayer is to perform the tasks of the SDN control planein a defined area of the network topology for a numberof switches. Decisions made at this layer influence onlya part of the network and are locally optimal. In regularOpenFlow configurations, only a single Area OpenFlowController is present. Solutions have been proposed toenhance the control layer with additional area OpenFlowlayers to extend functionality, such as synchronization ofFlow Rules with other controllers;

• Level-2 - Global Layer - The top layer has the func-tionality to control network topology at a global level,where forwarding and routing decisions influence thewhole topology. A Global OpenFlow Controller can thuscompute globally optimal routes through the network, asit controls all switches. The structure of the global layeris similar to the control layer.

To indicate the numerical relationship between the layers, theUML-standard, as used in object-oriented programming, isused as guidance in the framework in figure 6. The relationshipstates that the components (sub-layers) at each level share aone-to-one relationship (1..1) with each other. From the switch

Table IIEXPLANATION FOR THE DEVELOPED OPENFLOW NOTATION STANDARD

Symbol Description Relation

N No. Switches ∈ (1, 2, .., N)V No. Virtual controllers ∈ (1, 2, .., V ),V ≤ NP No. Virtual switches ∈ (1, 2, ..P )C No. Area OpenFlow Controllers ∈ (1, 2, .., C),C ≤ N ,

C ≤ V or C ≤ PG Global controller enabled ∈ (0, 1)

X+s Layer enhanced for securityX+p Layer enhanced for scalabilityX+r Layer enhanced for resiliencyX+b Backup component available

level a many-to-many or many-to-few relationship exists withthe virtualization layer (N..V ) or control layer (N..C), whenno virtualization is applied. In the case of virtualization, amany-to-few or many-to-many relation (P..C) indicates Pvirtual switches are controlled by C OpenFlow controllers.Within a domain, multiple area controllers can be controlledby a single centralized controller with the global view of thenetwork. In case of an inter-domain network infrastructure,global layers can be interconnected.

In order to differentiate multiple network topologies usingthe UML relationships, we use the following notation. For net-work topology T the notation is given as T (N/V −P/C/G),where the description of the used symbols is given in tableII. The notation with the defined symbols of table II wouldnot cover the entire framework, as additional sub-layers orapplications are not indicated. Therefore an extra indicator isadded to the OpenFlow notation, to indicate if an enhancementis added and for which enhancement area (security, scalabilityand/or resiliency) the enhancement is added. Beside additionalcomponents to the layers, it is possible from OpenFlowprotocol version 1.2 to add redundant controllers to the controlplane. Therefore the backup indicator is defined. When anenhancement overlaps multiple areas or when a component isapplied redundantly, it is possible to combine indicators. Thecontroller C+sb indicates a security enhanced controller thatis redundantly applied.

Before multiple OpenFlow enhancement proposals will bediscussed, a reference OpenFlow network configuration isgiven in the figure 7. The reference configuration is indicatedby T (N/ − /C/0), where the infrastructure is built up fromN switches controlled by C general OpenFlow controllers.

The flow of actions in figure 7 is as follows. Networktraffic, in the form of data packets, arrives at the switch dataplane, where the packet headers are matched to a set of flowrules stored in the flow tables. When no match is found, theOpenFlow module of the switch will initiate a flow request tothe assigned controller. The controller will process the eventand install the needed flow rules into the switches at thedesignated route through the network, based on the policy andrules defined in the forwarding and routing applications of thecontroller.

6

Policy / Rules

Inst

all F

low

s

Flo

w R

eque

st

N1

...

OpenFlow Module Flow Table(s)

Area Flow ActionsStandard OpenFlow Controller Applications

SWITC

HLA

YE

R

LEVEL - 0

CO

NT

RO

LLA

YER

LEVEL - 1

1

N

Match

Legend

T (N / – / 1 / 0)

Standard OpenFlowName

Notation

Flow Table

OpenFlow Application

Flow Processing

Switch

OpenFlow Controller

Switch Parameters

Packet In Packet Out

Figure 7. Reference OpenFlow configuration - Arriving data packets areprocessed and on a Table Miss, a Flow Request is sent to a “standard”OpenFlow controller, where network routing applications determine a newFlow Rule based on set Policy and Rules. The new Flow Rule is installed bythe controller in the assigned switches and network traffic can transverse thenetwork.

V. SCALABILITY IN SDN

By introducing a centralized authority to control the networktraffic over a large network, the growth of network traffic maynot scale with the performance of the controller [4]. Multipleproposals have been introduced [18], [19], [20], [21], [22] tocreate a distributed centralized authority, to solve the concernson scalability and performance. In this section, these proposalswill be discussed and projected onto our developed graphicalframework.

A. HyperFlow

The proposed solution by Tootoonchian et al. [19] to solvethe scalability problem in SDN and OpenFlow networks is todeploy multiple OpenFlow controllers in the network infras-tructure and implement a distributed event-based control plane,resulting in a T (N/ − /C+p/1) structure. The improvementin scalability and performance is found in the placement ofmultiple controllers close to the switches, reducing flow setuptimes, while each controller is provided with the global viewof the network. Another advantage of this approach is theresiliency against network partitioning (disconnected network)in case of network failures, as the views of the connectedswitches and controllers are synchronized.

HyperFlow, see figure 8, is built up from three layers, whereno virtualization is applied. The architecture at the switch level(data plane) and the OpenFlow specification are unmodified,which makes the HyperFlow concept applicable to currentSDN-capable networks. At the control layer the HyperFlowapplication is connected to a standard NOX controller. To en-able communication with the global layer, a publish-subscribesystem is placed at the additional control layer to locally storethe network status and view. The HyperFlow distributed globallayer consists of three parts: the data channel, control channeland the distribution of functionality among the controllers.On the data channel, network events are distributed by theHyperFlow application. These events are stored in the publish-subscribe system, which synchronizes the global view of thecontrollers, but can operate independently in case of partition-ing. The control channel is used for monitoring the status ofthe controllers. The distribution of functionality among the

Inst

all F

low

s

Flo

w R

eque

st

1

CO

NTR

OL

LAY

ER

LEVEL - 1

C

N

LEVEL - 0

SWITC

HLA

YER

NOX OpenFlow Controller

Flow Table(s)OpenFlow Module

N1

...

HyperFlow Module

...

C

Publish-Subscribe System

1

...

C

1

C

Distributed Controllers

Global Flow Actions

ControlChannel ...

DataChannel

Con

tro

ller

Stat

us

1 C

GLO

BA

LLA

YER

LEVEL - 2

Net

wo

rk E

ven

ts

Match

Inst

ruct

ions

Legend

T (N / – / C+p / 1)

HyperFlowName

Notation

Flow Table(s)

Switch Parameters

Switch

Additional Layer

OpenFlow Controller

Flow Processing


Channel


Figure 8. HyperFlow implementation - An additional global layer andextra OpenFlow applications at the control layer provide area controllerswith additional information, to increase decision making capabilities at theglobal level for optimal data packet forwarding.

controllers is realized as follows. Each controller has a globalview of the network and its status, it has capabilities to installflows into all switches. All required flows for a particulardata stream are published to all distributed controllers tosynchronize network state. The HyperFlow application on thecontrollers filter the requested flows for the switches assignedto it and will install the flows accordingly.

In [19] no extensive measurements are performed on differ-ent network topologies and traffic loads, thus no conclusioncan be drawn from the HyperFlow approach to improvescalability in real-life SDN networks. Besides that, there aresome limitations in the HyperFlow design. The first limitationwas found by the authors themselves in the performance ofthe publish-subscribe system (WheelFS). All network events,flow installs and status information need to be synchronizedbetween the multiple controllers, which requires a fast dis-tributed storage system. In HyperFlow the performance of thepublish-subscribe system was limited to approximately 1000events per second. This does not indicate that the controllersare limited in processing, but the global view of the networkcontrollers may not quickly converge. A second limitation isthe lack of a management application in the global layer. In[19] no distinction is made between the functionality of theswitches and controllers. This assumes that a global policyfor Flow Rule installations must be configured in all assignedcontrollers. The last limitation is found in the performanceof HyperFlow, where network traffic may be dropped whenthe offered load exceeds a controller’s capacity. Althoughthe load on controllers can be reduced by assigning lessswitches to a controller, a single switch processing many flowsmay still overload its controller. We think a solution whereFlow Requests can be forwarded to neighbor controllers forprocessing or smart applications at the switch layer to off-loadcontrollers could be another solution.

7

Legend

LEVEL - 0

SWITC

HLA

YE

R

Flow / Forwarding Table

Forwarding Table(s)

Switch Modules

Switch

Application

Flow Processing

Additional Layer

General SDN Controller

T (N / – / C+p / 1+p)

ONIX AreaName

Notation

1

...

C

C

N1

...N

1

...

C

Network Information Base

Policy / Rules Area Flow ActionsGeneralNetworkControl

Install Flow

Flo

w R

eque

st

Topology Sync

CO

NTR

OL

LAY

ER

LEV

EL - 1

OpenFlowModule

Import-

Export

Op

enFl

ow

Info

Sta

tus

Configuration Database

SQL Database

DHT Table

Status Sync

Global Flow Request

Aggregate

Database / Table

1

1 C

C

...


Figure 9. ONIX distributed control platform at area level - ONIX adds adistributed control platform between the data and control plane to reduce theworkload on flow controllers. Switch management is performed by multipleONIX instances which are synchronized via two databases. Forwardingdecisions are made at area level by a general SDN flow controller.

B. ONIX

ONIX [21] is not built on-top of an OpenFlow controller, butcan be classified as a “General SDN Controller.” The approachthe developers took to abstract the network view and overcomescalability problems, was to add an additional distributedcontrol platform. By adding an additional platform, the man-agement and network control functions from the control planeare separated from each other. The management functions,used for link status and topology monitoring, are performedby the ONIX control platform and reduce the workload fora general flow controller. To partition the workload at thecontrol platform, multiple ONIX instances can be installedto manage a subset of switches in the network infrastructure.The available information is synchronized between the ONIXinstances, forming a distributed layer as seen with HyperFlowin section V-A, but now at the controller layer. A networkcontroller assigned to a part of the network, can subtractinformation from the distributed layer to calculate area flows.In order to calculate global flows, ONIX has the capability toaggregate information from multiple area platforms to a globaldistributed platform. The aggregation of information and thecalculation of global optimal forwarding rules is similar to therouting of internet traffic over multiple autonomous systems(ASes) [23], where the route between ASes is globally deter-mined, but the optimal route inside the autonomous systemis calculated locally. With this general introduction, we canconclude that ONIX can be classified as a T (N/−/C+p/1+p)SDN topology1. With the use of figure 9 more insight is givenin the design and functioning of ONIX at area level, while

1In this overview the network scope is limited to a single domain, butthe capabilities of ONIX can reach beyond that scope as it is designed forlarge-scale production networks.

Legend

Application

Flow Processing

Additional Layer

General SDN Controller

T (N / – / C+p / 1)

ONIX GlobalName

Notation

Policy / Rules Global Flow ActionsGeneralNetworkControl

GLO

BA

LLA

YE

R

LEVEL - 2 1

1

ONIXAera Controller NIB

Aera Network Control

Network Information Base

Import-

Export

Aggregate

Global Flow Request

LEV

EL - 1

CO

NTR

OL

LAY

ER

Figure 10. ONIX distributed control platform at global level - Aggregateinformation from multiple area controllers are combined at the ONIX globallayer, from where the global controller can determine optimal routing andrequest the level-1 controllers for paths within the assigned area.

figure 10 shows how global flows are calculated using thedistributed control platform.

At switch layer no modifications are required for ONIXand two channels connect to the switch with the Import-Export module at the distributed platform. The ONIX plat-form is an application that runs on dedicated servers andrequires no specific hardware to function. According to theload and traffic intensity, multiple switches are assigned toan ONIX instance. Multiple ONIX instances combined formthe distributed control platform. The first channel connectsthe configuration database and is used for managing andaccessing general switch configuration and status information.To manage the forwarding, flow tables and switch port status,the second channel is connected to the OpenFlow module ofthe switch. The information and configuration status collectedby the import-export module represent the network state of theconnected switches and is stored in the Network InformationBase (NIB) as a network graph. The NIB uses two datastores, being a SQL-database for slow changing topologyinformation and a Dynamic Hash Table (DHT-table) for rapidand frequent status changes (such as link utilization and roundtrip times). The application of two data stores overcomesthe performance issues faced by HyperFlow. From the NIB,a general flow controller or control logic can receive flowrequests, switch status and network topology and computepaths. In comparison to the reference OpenFlow controller(figure 6), the ONIX platform has the availability of switchand link status information. Computation of paths is thusnot limited by the information provided by the OpenFlowprotocol. The last step in the flow setup process is to install thecomputed flows in the switch forwarding table via the ONIXdistributed platform.

In [21] evaluation results are shown of measurements onthe ONIX distributed layer to determine the performance ofa single ONIX instance, multiple instances, replication timesbetween the data stores and recovery times on switch failures.Unfortunately no information is present of the used controllogic and the performance gain in comparison with regularOpenFlow controllers. The advantage of the ONIX distributed

8

Legend

Flo

w R

eque

sts

1

LEV

EL - 0

SWITC

HLA

YER

Flow Table

Switch CPU + ASIC Traffic Statistics

N1

...

1 N

N

Area Flow Actions

LEV

EL - 1

1

N

Area OpenFlow Scheduler

Trig

ger

(pus

h)

Mic

ro F

low

s

CO

NTR

OL

LAYER

Match Flow Table(s)

Switch Parameters

Switch

Application / Function

Flow Processing

Additional Layer

OpenFlow Controller

T (N+p / – / 1 / 0)

DevoFlowName

Notation

1

...

N

Traffic Measurement

1

...

N

Sta

tist

ics

(psu

h /

pul

l)

Local Flow Actions

Elep

han

t Fl

ow

s

Flo

w R

ule

Local OpenFlow Scheduler

Micro Flow

Re-Routing

Multipath

Wildcard Routing

Load Balancing


Figure 11. DevoFlow implementation - Scalability enhancements made atthe switch layer are shown as an additional layer, but are implemented asmodifications to the soft- and firmware in the switch ASIC / CPU. Routingdecisions can be made at the switch layer using traffic parameters as input toreduce workload at the OpenFlow controller. Elephant flows invoke the useof the area controller for optimal routing, while micro flows are routed usingcloned flow rules.

control platform is the partitioning of the workload overmultiple instances. So, if an ONIX instance is limiting trafficthroughput (dropping flow requests) due to high workload,assigned switches can be reassigned to other ONIX instances.

C. DevoFlow

The approach of DevoFlow [18] is guided by two observa-tions. First, the amount of traffic between the data and controlplanes needs to be reduced, because the current hardwareOpenFlow switches are not optimized for inter-plane commu-nication. Second, the high number of flow requests must belimited, because the processing power of a single OpenFlowcontroller may not scale with network traffic. To limit thetraffic and flow requests to the control plane, traffic may becategorized into micro and elephant flows. In DevoFlow microflows will be processed at the switch layer, without the needof the control layer. The DevoFlow philosophy is that onlyheavy traffic users, elephant flows, need flow management.By limiting the management to elephant flows, only onecontroller is needed in the network topology, shifting therouting complexity to the switch layer. With this informationwe classified the DevoFlow solution as a T (N+p/ − /1/0)SDN topology.

In figure 11, the DevoFlow solution is drawn as an ad-ditional layer on top of the physical switch to simplify therepresentation, but the actual implementation is performed atthe soft- and firmware of the switch. This indicates that modifi-cations to the standard switch are required. To ease integrationof DevoFlow with other SDN concepts, no modifications aremade to the OpenFlow protocol and controller. The destinationof arriving packets is compared to flow rules installed in theswitch table. When no match is found, a flow request must be

initiated by the “Traffic Measurement” module in the switch.The traffic measurement module monitors the packets in thedata flows and computes statistics. At the start of a floweach flow is marked as a micro flow and an existing flowrule from the forwarding table is cloned and modified by the“Local Flow Scheduler.” The modification to the flow rulesallows multipath routing and re-routing. In case that multiplepaths between the switch and destination exist, the local flowscheduler can select one of the possible ports and the micro-flow rule for that port is cloned. Re-routing is applied whenone of the available switch ports is down and traffic needsalternative paths through the network.

To detect and route elephant flows through the network, thearea scheduler can use four different schemes:

• Wild-card routing - The switch pushes the available trafficstatistics to the controller with a specified time interval.The scheduler pro-actively calculates unique spanningtrees for all destinations in the network topology using theleast-congested route and install the trees as flows in theswitch flow tables. So for each destination in the networka flow is present in the switches and no flow requests areneeded from the switch to the OpenFlow controller;

• Pull-based statistics - The scheduler regularly pulls thetraffic statistics from the switch and determines if ele-phant flows are present in the current data flows. Oncean elephant flow is detected, the scheduler determines theleast-congested path for this flow and installs the requiredflow rules at the switches;

• Sampled statistics - This method is very similar to thepull-based scheme, but instead of pulling traffic statisticsevery time period, the switch samples traffic statistics intoa bundle and pushes the bundle to the scheduler. At thescheduler, it is again determined if any elephant flows arepresent and on positive identification flows are installed,as described in the pull-based scheme;

• Threshold - For each flow at the switch, the amount oftransferred data is monitored. Once a flow exceeds aspecified threshold, a trigger is sent to the scheduler andthe least-congested path is installed into the switches.

All schemes are based on traffic statistics, where flows are onlyinstalled if identified as elephant flows in a reactive manner. In[18] multiple simulations have been performed on a large datanetwork simulator to capture the behavior of flows through thenetwork and measure data throughput, control traffic and thesize of flow tables in the switches. The results show that thepull-based scheme with a short update interval maximizes thedata throughput in the network. This performance comes ata price, as much traffic is initialized between the switch andcontroller and the size of the flow table is significantly largein comparison with the other schemes. The threshold schemeis identified as most optimal, as the data throughput is high,less traffic is required between the switch and the controllerand the size of the flow table is minimal. Another advantageis the required workload on the scheduler in the controller, asno traffic statistics have to be monitored and processed.

9

N


N1

...

Match

App Detect

N C

N C

1

...N or C*

1

...N or C*

Elephant FlowsKandoo Global Flow Actions

AppReroute

1

Non-Global Apps

Even

t Tr

igge

r

Micro Flows

Global Apps

Flo

w R

eque

sts

Sta

tist

ics

Local Flow Actions

Area* Flow Actions

Install Flows

LEVEL - 0

SWITC

HLA

YERC

ON

TR

OL*

LAYER

GLO

BA

LLA

YER

LEVEL - 1

LEVEL - 2

Legend

Flow Table(s)

Switch Parameters

Switch

Application / Function

Flow Processing

OpenFlow Controller

T (N / – / C+p / 1+p )*

KandooName

Notation

T (N+p / – / 1+p / 0)

Packet OutPacket In

Figure 12. Kandoo implementation - A standard switch sends flow requestsand statistics to the local controllers. The Non-global applications processlocal OpenFlow events, micro flows and the “App Detect” application detectselephants flows using a threshold. Elephant flows are processed by the root/ global controller, where after the flow rules are sent to the local controllerto install at the switches.

D. Kandoo

In [20] Kandoo is presented with similar goals and philos-ophy as DevoFlow, namely to limit the overhead of eventsbetween the data and control planes, but solves the problemby applying more controllers in a network topology. Kandoodifferentiates two layers of controllers, namely local con-trollers, which are located close to the switches for local eventprocessing, and a root controller for network-wide routingsolutions. In practice this means that local controllers willprocess the micro flows and a trigger from the local controllermust inform the root controller about the presence of anelephant flow. The number of local controllers depends onthe amount of traffic (local events) and the workload on thecontroller, which is somewhat similar to the approach of ONIXand its distributed control platform. In an extreme case, everyswitch is assigned to one local controller. Translating Kandooto the graphical framework, the root controller is located atthe global layer and the local controllers can be found on thearea or switch layer. If the extreme case is valid, the localcontroller can be seen as an extension of the switch layer andthe topology can be classified as T (N+p/−/1/0). In a regulararchitecture, where more than one switch is assigned to a localcontroller, a T (N/−/C+p/1) SDN topology is found. Figure12 projects the regular case of Kandoo.

Kandoo leaves the software in the switch unmodified andshifts the processing of events to local controllers. The localcontrollers are standard OpenFlow controllers extended witha Kandoo module. This approach keeps the communicationbetween the switch and controller standardized and givesthe possibility to utilize standard OpenFlow applications. TheKandoo module intercepts the OpenFlow traffic and monitorsit for elephant flows using the “App Detect” application. Aslong as no elephant flow is detected, the local controllerprocesses the flow requests as micro flows. Elephant flows are

detected using the threshold scheme from DevoFlow, whichrelays an event to the Kandoo module of the root controllerin case of positive detection.

To propagate events from the local controllers to the rootcontroller, a messaging channel is used. The root controllermust subscribe to this channel in order to receive events.After receiving an elephant flow detection trigger, the “AppRe-Route” determines an optimal route and requests the localcontrollers to install this route. The developers of [20] havenot given any information on how the re-routing process isexecuted and which routing schemes are used.

Some measurements have been performed on a (small) treetopology with the Kandoo framework installed. Results showcomparisons between the topology in a standard OpenFlowand Kandoo configuration. As expected, less events are pro-cessed by the root controller, but no information is givenabout the workload and performance of the local controllers.Overall we can state that the simulations and measurementsare too limited to give a good indication of the performanceenhancement provided by Kandoo. The limiting factor in theKandoo configuration is the interface between the switch andthe local controller, as in DevoFlow is shown that this interfaceis the limiting factor in current SDN network implementations.If the limit on this interface is not reached, the layeredcontroller solution is an interesting concept which can alsobe useful to tackle security and resiliency problems.

E. FlowVisorVirtualization is a widely applied technique allowing mul-

tiple instances on the same hardware resources. Hardwareresources are abstracted by an abstraction layer and presentedto a virtualization layer. On top of the virtualization layer,instances are presented with virtualized hardware resourcesthat one can control as if without virtualization. This approachis roughly similar to the SDN philosophy, but in FlowVi-sor by Sherwood et al. [22] the virtualization approach isreapplied, where OpenFlow-compliant switches are offered tothe FlowVisor abstraction layer. FlowVisor offers slices ofthe network topology to multiple OpenFlow guest controllers,where the slices are presented as virtual OpenFlow switches.The guest controllers control the slices, where FlowVisortranslates the configurations and network policies from eachslice to Flow Rules on the physical OpenFlow switches. If thisapproach is applied to large networks, scalability problems canbe resolved as control of the OpenFlow switches is dividedover multiple controllers. To distinguish network slices, fourdimensions are defined in FlowVisor:

• Slice - Set of Flow Rules on a selection of switches ofthe network topology to route traffic;

• Separation - FlowVisor must ensure that guest controllersonly can control and observe the assigned part of thetopology;

• Sharing - Bandwidth available on the topology must beshared over the slices, where minimum data rates can beassigned for each slice;

• Partitioning - FlowVisor must partition the Flow Tablesfrom the hardware OpenFlow switches and keep track ofthe flows of each guest controller.

10

VIR

TUA

LIZATIO

NLA

YE

R

LEVEL – 0

.5

1

N

LEVEL - 0

SWITC

HLA

YE

RFlow Table(s)OpenFlow Module

N1

...

Check

Legend

T (N / 1 - P / C / 0)

VirtualizationName

Notation

Flow Table

Switch

OpenFlow Controller

Switch Parameters

Update

Forwarding Module

Flo

w R

eque

sts

Packet OutPacket In

Translation Module

...

P

1

1 P

Sta

tus

+

Sta

tist

ics

Resource Allocation Policy

Inst

all F

low

sSlice

1Slice

C...

Sta

tus

+

Sta

tist

ics

Flo

w R

eque

st

Slice 1 Slice C...

Inst

all F

low

s

...1 P

P

To C OpenFlow controllers

Abstraction Layer

Network Policies

Virtual Switch

Abstraction Modules

Figure 13. Network virtualization on OpenFlow network topology - RegularOpenFlow traffic (Flow Requests, status and traffic statistics) from theOpenFlow module is sent to the FlowVisor Forwarding Module. Depending onthe slice configuration and slice policy, OpenFlow traffic is forwarded to theTranslation Module of the “virtual” switches. Guest controllers (which can beany OpenFlow controller) communicate with the translation modules, whereFlow Rules are translated, forwarded and installed if they do not interferewith the network policy for that slice.

With these four dimensions and definitions, OpenFlow re-sources can be virtualized and shared over multiple instances.This means that a single hardware OpenFlow switch can becontrolled by multiple guest controllers. To provide more de-tails on the functioning of network virtualization in OpenFlow,FlowVisor is represented in figure 13.

As illustrated in figure 13, FlowVisor acts like a transparentlayer between hardware switches and the controllers. Hard-ware OpenFlow switches are assigned to the virtualizationlayer, where FlowVisor advertises itself as a controller. Open-Flow traffic is transmitted to FlowVisor at the virtualizationlayer, where the network capacity is sliced and divided overmultiple users. The FlowVisor Forwarding Module checkson policy violations, before traffic is sent to the Translationmodule at the virtual OpenFlow switch. At the translationmodule, the traffic is translated and sent to the guest con-trollers, assigned to the corresponding slices. Flows comingfrom the guest controllers are checked on interference with theslice policy and translated to flows for the hardware switches.Via the forwarding module the guest flows are installed at theswitches. The projection shows a straight forward exampleof a T (N/1 − P/C/0) configuration, where one FlowVisorinstance performs all virtualization tasks, but more complexconfigurations are possible.

Adding an additional layer between the switch and controllayer creates unwanted overhead. Experiments show that re-sponse times for the processing of Flow Requests increasedfrom 12 ms to 16 ms. This means that FlowVisor accounts foran additional delay of 4 ms, in comparison to non-virtualizedOpenFlow topologies [22]. Besides the delay measurements,experiments were performed to test bandwidth sharing be-

Table IIICOMPARISON OF SCALABILITY SOLUTIONS AND COMPONENTS.

Solution Standarized Complexity Decision Classification

HyperFlow +/− + Global XONIX − + Global XDevoFlow +/− +/− Semi-Global VKandoo + − Semi-Global VFlowVisor + +/− Semi-Global X

Solution Availability Performance Reliability

WheelFS +/− − +DHT + + −SQL + − +

tween slices and CPU utilization on the hardware OpenFlowswitches. Network policies must include minimal bandwidthguarantees (QoS parameters) for each user, to prevent unfairuse of the network.

An aspect not covered in FlowVisor is security. Additionalmechanisms for classification of traffic and checking of FlowRules may be required to ensure full separation and isolationbetween network traffic on the slices.

F. Conclusion on scalability

Five different concepts on scalability have been reviewed.All solutions propose to divide workload over multiple in-stances. However, it remains difficult to come with an optimalscalability solution for SDN networks. As can be seen fromtable III, trade-offs have to be made by network designersand managers. Table III gives an overview of the proposedframeworks and their components. Besides observations fromthe review, also a column is defined for standardization. Thisindicates availability of used components in the framework.Unfortunately, only FlowVisor is available as open-sourcesoftware, but on conceptual level standardized componentscan be used to reproduce the other proposed frameworks. Ahigh standardization in the table indicates that the solutionis built up from standard available OpenFlow components. Apart of the table is dedicated to data storage solutions usedin ONIX and HyperFlow and is useful for future distributedlayer developments and comparisons.

VI. RESILIENCY IN SDNIn regular networks, when the control logic of a switch fails,

only network traffic over that particular switch is affected.When failover paths are preprogrammed into neighboringswitches, backup paths are available and on failure detection,backup paths can be activated. If the control logic in an SDNenabled network fails, the forwarding and routing capabilitiesof the network are down, resulting in drop of Flow Requests,undelivered data packets and an unreliable network. In anearly stage of the development of the OpenFlow protocol,this problem was identified and from protocol version 1.2,a master-slave configuration at the control layer can be ap-plied to increase the network resiliency to failing OpenFlowcontrollers.

We define robustness of a topology as the measure ofconnectivity in a network after removing links and switches.

11

Flo

w R

eque

st 1

CO

NTR

OL

LAYER

LEV

EL - 1

C

N

LEV

EL - 0

SWITC

HLA

YER

Flow Table(s)Switch and Traffic Parameters

N1

...

...

C

Network Traffic

Check

Messenger

Rul

es

Flows

Inst

all F

low

s

Backup Flows

NOX OpenFlow Controller

Local Flow Actions

CPRecoverySource Table

To messenger module on secondary controllers

Legend

T (N / – / C+R / 0)

CPRecoveryName

Notation

Flow Table


Flow Processing

Switch

OpenFlow Controller

Switch Parameters

Figure 14. Replication component in a standard NOX controller - A standardNOX-controller is enhanced to replicate flow installs over multiple slavecontrollers using OpenFlow protocol 1.2 (and higher) and the CPRecoverymodule.

Its resiliency indicates its ability to re-allocate redundant pathswithin a specified time window. On the controller side, therobustness of a single or group of OpenFlow controller(s)is defined as the resistance of controller(s) before enteringfailure states. The resilience is defined as the ability to recovercontrol logic after a failure. As described, the definitionsfor robustness and resiliency can have different meanings,depending on the viewpoint of the designer. In this overview,examples and proposals of both viewpoints are discussed.

Before proposed solutions on resiliency and robustness willbe reviewed, a small retrospect to section V is made, as allthose solutions house the ability to increase the robustness ofan SDN network. By partitioning the workload of a singlecontroller over multiple instances, the robustness of the net-work is increased. Failure of a single controller will only resultin an uncontrollable part of the network. To recover from afailure and increase the resiliency, additional logic is required.For both viewpoints, timely detection of a failure and a fastfailover process are basic requirements. To create a robustand resilient network, the network topology must includeredundant paths [24]. For a resilient control layer, the networkstate must be synchronized and identical between masterand slave controllers. Additional modules and synchronizationschemes must meet these requirements without compromisingthe performance and adding unwanted latency.

This section will review resiliency of 3 different aspectsimportant to the network: (1) Section VI-A focuses on theresiliency at the control layer; (2) Sections VI-B to VI-Egive more insight in topology failure recovery and protectionschemes, while section VI-F discusses the more special caseof in-band networks where the control and forwarding planeshare the same transport layer; (3) Finally, section VI-G SDNnetwork security.

A. Replication component for controller resiliency

In [25] the master-slave capabilities of the OpenFlow proto-col are utilized to deliver controller robustness. This indicatesthat a primary controller (master) has control over all switchesand on the master failure, a backup controller (slave) can

take over control of assigned switches. Fonseca et al. [25]introduce a solution, indicated in this review as CPR, whichintegrates a replication component into a standard OpenFlowcontroller. As replication component the “Primary-Backup”protocol is applied, to offer resilience against failures and aconsistent view of the latest failure-free state of the network.The primary-backup protocol synchronizes the state of theprimary controller with the backup controllers. In CPR twophases are distinguished, namely the replication and recoveryphases. During the replication phase, calculated flows by theprimary controller are synchronized over the backup con-trollers. After failure detection of the primary controller, therecovery process is initiated to reallocate a primary controllerand restore flow calculations. With the replication componentintegrated in an OpenFlow controller, the solution can beclassified as a T (N/− /C+R/0) topology. Hereby we denotethat always one primary controller is present, with C − 1remaining backup controllers. The current implementation ofthe OpenFlow protocol allows a total of C = 4 controllers. Infigure 14 the synchronization process of CPR is shown.

The CPR solution connects to standard OpenFlow-compliant switches and is built upon the NOX OpenFlowcontroller. Additional components, to enable replication, areintegrated into the NOX controller as modules. The switchesare configured in the master-slave setting, allowing multiplecontrollers in a predefined listing. During the replication phase,flow requests are sent from the switch to the primary controller.At the controller, the ordinary processes are executed forrouting and forwarding. After the flow is calculated in the areaflow scheduler, it is intercepted by the “CPRecovery” module.This module determines whether the controller is assigned asprimary and on positive identification the flow is added to thesource table of the controller. Via the “Messenger” module, thesource table of the backup controllers are synchronized usingthe primary-backup protocol. After all controllers are updatedand synchronized, the flow is installed into the switches. Thisreplication procedure enables a fully synchronized backup,before the flows are installed to the switches. So when theprimary fails, the second assigned controller can seamlesslytake over network control. A drawback of this replicationscheme is the additional latency introduced to the completeflow install process.

All network switches can be configured to perform activityprobing on the primary controller. If the primary controllerfails to reply within a configurable time window (τ ), thenetwork switch starts the recovery phase and assigns the firstbackup controller from the list as primary controller. On thecontroller side, when a join request from the switches isreceived by a backup controller, this controller will set itself asprimary controller and the replication phase is started. Updatemessages from the primary controller are also sent to theoriginal primary controller and on its recovery it is assignedas one of the secondary controllers.

The replication and recovery processes seem to solve the re-siliency problem with OpenFlow controllers, but the primary-backup protocol and the recovery phase may fail in case of atemporary network partitioning and geographically separatedcontrollers. To explain the potential flaw, the example topology

12

S1S3

S5

S6S2

S4

C1,P C2,S

Figure 15. Example topology for network partitioning - On a link failurebetween S3 and S4 the topology is partitioned and the backup controllerC2,S will be assigned as primary by switch S4 to S6 using the primarybackup protocol.

T (6/− /2+R/0) of figure 15 is used.On normal operation, controller C1,P is assigned as primary

and controller C2,S as secondary (backup) controller. At timet the link S3 − S4 becomes unavailable and the network ispartitioned into two components by the following reasoning.Switches S1 to S3 are under control of the original primarycontroller, where the remaining switches (S4 to S6) willselect the secondary controller as new controller, as the timewindow on activity probing expires on t + τ . We questionthe behavior of the replication and recovery phase of thereplication component (and the primary-backup protocol) incase link S3−S4 becomes operational. Switches S4 to S6 willnot re-assign to the original primary controller, so the networktopology remains partitioned until failure of controller C2. In[25] and other performed research no specific measurementsare performed on the influence of geographical positioning ofOpenFlow controllers and their secondary problems, like flowsynchronizing and primary controller selection. To solve thisproblem, a more advanced synchronization scheme is required,with primary controller propagation and election schemes.

To test the functionality and the performance of the repli-cation component, Fonseca et al. [25] performed two sim-ulations. In the first simulation the packet delay betweentwo hosts in a tree topology with the primary and backupcontrollers connected to the top switch are measured. Atspecific times the primary controller is forced into a failurestate. Where the average packet delay is 20 ms, the delayrises to approximately 900 ms during the recovery phase.After the rise, the packet delay normalizes to average and thenetwork functions normally. Although the replication phase issuccessful, the delay during the recovery phase is unacceptablefor carrier-grade data networks providing voice services, whichrequire end-to-end delays not to exceed 50 ms.

The second simulation measured the response time of a flowinstall. Therefore multiple measurements have been performedusing the number of secondary controllers as variable. Asdescribed earlier, the CPRecovery module first synchronizesthe secondary controllers, before installing a computed FlowRule into the switch Flow Table. The measurements showthat the response time increases linearly with the number ofsecondary controllers, with a minimum response time of 8ms when no backup controller is configured and a maximumof 62 ms with 3 secondary controllers to synchronize. Thelinear expansion of the response times is unacceptable fordata networks. We propose to increase the performance of

the CPRecovery module and lower the response times by per-forming the install of the Flow Rule and the synchronizationto the secondary controllers in parallel or first installing theFlow Rule and perform synchronization afterwards.

B. Reactive link failure recovery

In [26] three existing switching and routing modules (L2-Learning, PySwitch, Routing) from the NOX-controller arecompared as recovery mechanisms. Additionally, a predeter-mined recovery mechanism is added. In the following themodules are discussed shortly on their ability to recover links.

• L2-learning - The standard switching module of the NOXcontroller functions similarly to a common layer-2 switch.However, the applied NOX-controller lacks the SpanningTree Protocol (STP);

• L2-learning PySwitch - The functioning of this module isvery similar to the standard L2-learning module. It is ex-tended with two mechanisms to improve its performance.The first implemented extension adds aging timers to theinstalled Flow Rules, so that the switch can remove andupdate its Flow Table. Every time a switch processesa packet, the time-stamp of the flow rule is updated.To protect the Flow Table, the hard-time must be largerthan the idle-time. The second mechanism applied is theapplication of STP [27], to remove possible networkingloops in the topology;

• Routing - The routing module uses three mechanisms tocompute Flow Rules. To enable routing, the control mustmaintain the network topology for path computations. Todetect connected switches and link failures, on a regularbasis the switch sends Link Layer Discovery Protocol(LLDP) packets, containing information about the switchMAC-address, port number and VLAN indicator. A re-ceiving OpenFlow switch replies with an LLDP-packetcontaining its own parameters. When the reply packetis received by the corresponding switch, the assignedcontroller is informed about the detected link and thenetwork topology is updated. The recovery capabilities ofthe routing module depend on the discovery mechanismand the configured timeout interval. If an LLDP-packetis not received within the configured interval, the switchdeclares the link lost and informs the controller of thestatus change;

• Pre-determined - The pre-determined module does notrely on learning and discovery mechanisms, but it imple-ments path protection at the control layer. In the controllermultiple static paths are provided by the network man-ager. Based on priority and available paths, the controllerchooses a path and installs it on the switches accordingly.On a network failure the controller can choose a redun-dant path from the provided paths, reducing the need forlink discovery mechanisms and path calculations. As thenetwork manager provides the paths, no spanning treeprotocol is needed (assuming that the paths are loop-free).

The first three mechanisms work dynamically and providepath restoration, whereas the fourth mechanism is especiallydesigned for path protection. This module applied to the

13

Table IVCOMPARISON OF PROPERTIES AND RESULTS ON LINK RECOVERY MECHANISMS.

Name Update forwarding table Recovery scheme Recovery Time

L2-Learning Traffic / ARP - / ARP Seconds - MinuteL2-Learning PySwitch Traffic / ARP Aging timers / ARP Seconds - MinuteRouting LLDP Aging timers / LLDP Seconds - MinutePre-determined Manually Configured MilliSeconds

CO

NTR

OL

LAYE

R

LEV

EL - 1

1

N

LEV

EL - 0

SWITC

HLA

YER


N1

...

Network Traffic

Check

Legend

T (N / – / C+R / 0)

CPRecoveryName

Notation

Flow Table

NOX Application

Flow Processing

Switch

OpenFlow Controller

Switch Parameters

Flo

w R

eque

st

NOX Recovery Modules

Pyswitch*Routing*Pre-Recovery

L2-Learning

Area Flow Actions

1. 2. 3. 4.

Lin

k St

atu

s

Update

1. Flow Table2. Shortest Path

3. Learning + SPT4. Learning

Flo

w w

o. E

xpir

e T

ime

r (1

& 4

)

Flo

w w

. Exp

ire

Tim

er

(2 &

3)

Figure 16. Recovery mechanisms - In total four mechanisms are availableto recover from a link failure. All mechanisms have their own link failuredetection methods. On link failure, the enabled mechanism will construct anew path and install these into the switch flow table.

topology leads to the classification T (N/−/C+R/0), becauseadditional logic is added to the controller to improve itsperformance on link resiliency. Figure 16 projects the fourrecovery mechanisms onto the graphical framework.

From the OpenFlow module at the switch, Flow Requestsand link status information are exchanged with the controller.In figure 16 the four recovery modules are drawn, but onlyone module at a time is active. Furthermore, the routing andPySwitch modules are marked, to indicate the availability ofthe spanning tree protocol. Each of the modules can determineFlow Rules, based on the available information. The L2-learning and pre-determined modules install Flows withoutaging timers, while the routing and PySwitch modules settimes to protect flow tables at the switch. In [26] simulationshave been performed on a T (6/ − /1+R/0) topology toshow the behavior of the different modules and measurementshave been taken to see if the link-recovery requirements areachievable. The topology contained multiple cycle and Sharmaet al. [26] showed that much traffic is traversing between theOpenFlow switches and the controller, to maintain link-statusinformation. Only the pre-determined module consumed lesstraffic, which is expected from its static and fixed design.To simulate network traffic, ping packets were sent with aninterval of 10 ms between two end-hosts. On a specifiedtime, a link failure was initiated and the recovery time andnumber of dropped packets measured. Measurements showthat it takes 108 ms for a link failure detection to be indicatedon the controller. This value is already above the required50 ms, so any recovery mechanism discussed in this researchwill fail. The pre-determined module acts immediately on a

link failure, which results in recovery times of approximately12 ms, resulting in a total delay of 120 ms. Results forthe routing and PySwitch modules show that the recoverydepends on the idle and hard times set in the aging timer.The L2-Learning mechanism fails recovery of a path withoutthe application of the Address Recovery Protocol (ARP) [28].Table IV summarizes the properties and results of the fourrecovery mechanisms, where a distinction is made on howtopology (link) information is maintained, how the mechanismrecovers from link failures and on what time scale paths arerecovered.

Unfortunately, no experiments have been performed onvarying the idle and hard times of the routing and PySwitchmodules. Reducing these times, we believe, can have muchinfluence on the recovery process of links. Also changingthe default timeout timers of ARP and LLDP can improvethe performance. The current implementation of the pre-determined module can act fast on a link failure, but lacks theability of constructing paths on demand by using dynamicallyreceived topology information.

C. Path-based protection

As shown in the previous section, the reactive and pre-determined recovery schemes implemented at the control layerdo not meet the 50 ms recovery time requirement for carrier-grade networks. Sharma et al. [29] came with a similarproposal, but now applied on the switch layer. This reduces therecovery time, as no communication with the control layer isrequired. Multiple schemes can be applied to recover pathsin case of link failures, where a distinction can be madebetween protection and restoration schemes [24]. Protectionschemes do not need communication with the controller torestore paths, as actions are pre-configured at the switchlayer. Restoration schemes require communication betweenthe switch and controller and recovery paths are dynamicallyallocated.

In [29] the 1 : 1 protection scheme is implemented asa protection mechanism at the switch layer, where 1 : 1refers to activating a backup path after failure of the primarypath. To enable this mechanism, the Group Table concept ofthe OpenFlow protocol is utilized. In normal operation, thedestination address from a packet is matched in the Flow Tableand the packet will be forwarded to the correct port or dropped.By applying Group Tables, a flow rule can also contain alink to a unique group. In the Group Table, one or moreAction Buckets define actions based on status parameters.On change of these parameters, the action bucket executesa predefined action. In case of the protection scheme, when

14

CO

NT

RO

LLA

YER

LEVEL - 1

1

N

LEVEL - 0

SWITC

HLA

YER

N1

...

Legend

T (N+R / – / 1+R / 0)

Link Recovery 2Name

Notation

OpenFlow Table

NOX Application

Flow Processing

Switch

OpenFlow Controller

Switch Software

RestorationFlow Actions

Group Table

ActionBuckets

Packet Out

Flow Table(s)Protection

OF Module

LoS BFD

Path Status

Match

Flo

w R

ules

Shortest PathNOX Recovery Modules

Routing

Lin

k St

atu

s

Flo

w R

eque

st

Packet In

Figure 17. 1 : 1 Recovery scheme - Two schemes are visible to recoverfrom a failure. The protection scheme utilizes the BFD path failure detectionprotocol in cooperation with Group Tables and Action Buckets to enable 1 :1 path protection. Restoration of failed links is executed by the controllerand a modified routing module which uses LoS for link failure detection.New constructed paths are installed to the switches, without incorporatingthe failed link.

a failure is detected in the path, the backup path is enabledfor the flow. For path failure detection the BidirectionalForwarding Detection (BFD) [30] protocol is implemented.To monitor the complete path between multiple OpenFlowswitches, a BFD session is configured between the entry andexit switches. If the periodical messaging over the session fails,BFD assumes the path lost, updates the action bucket in theOpenFlow switches and the protected path is installed. The1 : 1 protection scheme implemented on a topology leads toa T (N+R/− /1/0) system.

The second implementation in [29] is the 1 : 1 restorationscheme at the switch layer. An extension to the standardrouting module of the NOX controller is made to increasethe resiliency. The failure detection capabilities of the rout-ing module depend on the OpenFlow aging timers and theimplementation of a topology module incorporating LLDPpackets. The extended routing module uses the “Loss ofSignal” (LoS) failure detection mechanism available in theOpenFlow protocol. LoS detects port changes in the switchfrom “Up” to “Down” and reports these to the controller. Otherthan BFD, LoS does not monitor complete paths, but only locallinks at the switch. On link failure detection, a notification issent to the routing module and a new path is constructed,without incorporating the failed link. The new path with itscorresponding flow rules are installed in the switches, afterwhich the path is recovered. The proposed solution for pathrestoration is classified as T (N/− /1+R/0).

The protection scheme most likely restores paths faster,as no communication is required with the controller andbackup paths are preconfigured. The restoration scheme ismore adaptive and is more flexible, as paths are calculated withstatus parameters of the current topology. In a large network,both schemes can be applied, depending on the networkservices provided. A combined scheme can be classified asa T (N+R/− /1+R/0) topology, as shown in figure 17.

In normal operation, the process of packet forwarding issimilar to the standard OpenFlow operation. On protected

paths, the BFD protocol monitors the status and on failure theaction buckets in the Group Table are updated. Actions definedin the Action Buckets, enable the protected path. In case ofrestoration, the OpenFlow module monitors a link failure,after which the routing module in the controller constructsand installs a new path. An important aspect of the recoveryprocess is the latency between time of link failure and therecovery of all affected flows. In [29] an analytical model isgiven for the restoration process. It gives a good indicationwhere the latency is introduced in the recovery process. Wehave extended the model with the protection scheme, toindicate the differences between both recovery schemes.

TR = TLoS +

F∑i=1

(TLU,i + TC,i + TI,i) (1)

TP = max(TBFD,1, ..., TBFD,N ) + (2)P∑i=1

max(TAB,1,i, ..., TAB,N,i)

The total restoration time (TR) is determined by the loss-of-signal failure detection time (TLoS), the total time spent at thecontroller to look up the failed link (TLU ), the path calculationtime (TCALC), the flow install / modification time (TI ) andthe number of flows (F ) to restore. In here the propagationdelay, which is assumed to be small (∼ 1ms), is integrated withthe failure detection and flow installation time. The protectionmodel depends on the BFD failure detection time (TBFD), thetime to process the action bucket (TAB) and the number offlows affected by the link failure (P ). Because a broken flowis only restored after the processing of the “slowest” of Nswitches in the path, the max operator is applied.

To give an indication of the latency differences, multiplesimulations and measurements have been performed on dif-ferent topologies in [29]. Results are show in table V. Delaytimes to process the action buckets are unknown and likelynot more than several milliseconds.

Both recovery schemes were able to recover paths on linkfailure. The main difference in performance is found in thefailure detection mechanism. Where BFD only needs 40 msto detect a path failure, the LoS mechanism takes more than100 ms to report a broken link. A main disadvantage of theapplication of BFD in the protection scheme is the introducedoverhead for monitoring all paths. Furthermore, the fixed pre-planned configuration is inflexible and the experiments wereperformed in such a way that link failures did not influencethe protected paths. Restoration is more flexible by allocatingrestoration paths dynamically with up-to-date network topol-ogy information. Recovery times for both schemes mainlydepend on the number of flows to recover in the network.Restoration times can exceed 1000 ms, if a large number offlows need recovering.

D. Link-based protection

In addition to the path-based discovery discussed in theprevious section, Van Adrichem et al. [31] propose to deploy

15

Table VCOMPARISON OF TIME DELAYS IN LINK RESTORATION AND PROTECTION.

Time Symbol Delay (ms) Relation Comment

Failure detection time (P ) TBFD 40 - 44 Fixed

Failure detection time (R) TLoS 100 - 200 FixedController look-up time (R) TLU 1 - 10 Linear Delay with 250 - 3000 flowsPath calculation time (R) TCALC 10 - 200 Linear Delay with 25 - 300 pathsFlow installation time (R) TI 1 - 5 Linear Delay with 1000 - 10000 flows

link-based monitoring and protection to overcome topologyfailure. Their contribution in minimizing recovery time istwofold:

1) They minimize failure detection time. By using link-based, instead of path-based, BFD monitoring sessions,the per-session RTT and thus BFD interval window isminimized compared to per-path sessions. Experimentsshow that configurations with a BFD interval window of1 ms are feasible.

2) They adapt the Group Table implementation of theOpenFlow capable software switch Open vSwitch toconsider BFD status real-time, hence eliminating theadministrative processes of bringing an interface’ statusdown.

Herewith, they enable a controller to employ protection byconfiguring per-switch backup paths using BFD aware GroupTable rules. Where path-based failure monitoring has a com-plexity of O(N ×N) sessions, link-based failure monitoringdecreases to a complexity of O(L), where N and L respec-tively represent the number of nodes and links in a network.Hence, the number of BFD sessions traversing each link islimited to exactly 1.

The experiments of [31] show a recovery time as lowas 3.3 ms independent of network size. Instead, due to thesoftware nature of Open vSwitch - the solution scales tothe degree of each node, emphasizing the need for hardwareimplementations of BFD and packet forwarding.

Since the proposed solution deploys per-switch backuppaths, in exceptional cases crankback routing may need to beapplied. Where a fast failure recovery is still guaranteed, thesolution is suboptimal. However, in time the network controllerwill be notified of the change in topology and can reconfigurethe network to an optimal state without service interruption.

E. Segment-based protection

Where the previous two sections discuss path- and link-based protection against network failure, the research in [32]proposes a hybrid approach, where individual segments of apath are protected. The main idea is to provide a workingpath, as well as a backup path for each switch invokedin the working path. Both paths are installed in the FlowTables with different priorities and after failure detection, theflows for the working path are removed from the table byadditional mechanisms in OpenFlow, after which the backupflow becomes the working path. In figure 18 the projection tothe graphical framework is given (note the overlap with figure17).

CO

NTR

OL

LAYER

LEVEL - 1

1

N

LEVEL - 0

SWITC

HLA

YER

N1

...

Legend

T (N+R / – / 1+R / 0)

Link Recovery 3Name

Notation

OpenFlow Table

NOX Application

Flow Processing

Switch

OpenFlow Controller

Switch Software

Segment ProtectionFlow Actions

Packet Out

Flow Table(s)

OFModule

RecoveryModule

Flo

w R

ules

Working Path+

Backup PathsNOX Modules

Routing

Lin

k St

atu

s

Flo

w R

eque

st

Packet In

Auto-reject Flow

Backup Flow renewal

Figure 18. OpenFlow segment protection scheme - Along with the workingpath, backup paths are provided to the OpenFlow switch to protect segments.An extended Recovery Module in the OpenFlow module rejects flows fromthe Flow Table after a failure detection and enables backup paths for thesegments. To prevent backup flows from removal by the idle timers, therecovery module transmits renewal messages over the backup paths.

The failure detection mechanism in [32] is unknown and totrigger the backup paths, two additional modules are added tothe OpenFlow protocol version 1.0. For removing the flowsfor the working path from the Flow Table, an “auto-reject”mechanism is developed. This mechanism deletes entries whenthe port status from the OpenFlow switches changes. Thesecond developed mechanism “flow renewal” is used to updateflow entries for the backup paths. Backup paths are installedusing idle timers and while the working path is active, updatemessages are transmitted over the backup paths to update theidle timers, preventing automatic flow removal by OpenFlow.

Multiple experiments have been performed with the adaptedversion of OpenFlow, utilizing segment protection with the“auto-reject” and “flow renewal” mechanisms. Results showaverage recovery times with a variable number of flow entriesper switch around 30 ms, with a maximum of 65 ms. Thefact that modifications and extensions must be made to theOpenFlow protocol, leading to a non-standard implementation,makes that we do not recommend the solution in [32] for largeSDN implementations.

F. In-band OpenFlow networks

In sections (VI-B) to (VI-E), the controller was connectedin an “out-of-band” configuration, which indicates that sepa-rate connections from the switch to the controller are avail-able. Only control traffic traverses over these connections,ensuring no delays or traffic congestion between controllerand switches. In an “in-band” configuration, control traffictraverses the same connections as data traffic. No additionalnetwork interfaces are needed. With the application of an in-

16

A

S1

S3

S4

C

S2B

C

D

Figure 19. Example configuration for control traffic restoration - A. Normalsituation. B. Link failure and controller notification. C. Update intermediateswitches to restore control path. D. Communication between S3 and controlleris restored.

band configuration to a network, Sharma et al. [33] discovereda problem. When a link failure occurs and the communicationbetween a switch and controller is lost, the basic operationsfor a switch are to restore the connection by requesting anew connection after waiting for an echo request timeout. Theminimum value for the timeout is limited to 1 second, whichis a magnitude of 20 too long for proper path recovery incarrier-grade networks. Therefore in [33] the restoration andprotection schemes from [29] are reused to solve the problem.

As seen before, data traffic paths can be restored or pro-tected from a link failure. In case of restoration, the faileddata traffic paths cannot be restored without communicationchannels between the switches and the controller. Therefore,first the control path must be restored after which the datapaths can be reinstalled on the switches by the controller.In order to implement this priority of processing the linkfailure, the “Barrier Request and Reply Messages” conceptfrom the OpenFlow protocol is utilized. In “normal” operation,OpenFlow messages can be reordered by the switches forperformance gains. To stop reordering, Barrier Messages aresent by the controller and the switches must, upon receivinga Barrier Request, process all preceding instructions beforeprocessing the following request. After processing of theBarrier Request, a reply is sent to the controller. To clarifythe restoration process in an in-band configuration, an exampletopology with process description is given in figure 19.

In figure 19, in total four phases are distinguished fromnormal operation to link failure and restoration of the controlchannel for switch S3:

• Phase A - Initial phase where the control traffic for switchS3 is routed over switch S1 and S2 to the controller. Ina normal “out-of-band” configuration, S3 would have aseparate network connection with the controller;

• Phase B - The link between switches S2 and S3 failsand the communication between S3 and the controllerstops. Switch S2 monitors the link failure with the LoSmechanism and sends an error report to the controller viaS1;

• Phase C - The controller calculates the new control pathover S1 and S4 to S3 with highest priority, whereafterthe data traffic paths are recalculated. These paths cannotyet be installed into switch S3, as the broken path is stillpresent in the flow table. Therefore the controller first

Table VICOMPARISON OF RECOVERY SCHEMES IN IN-BAND CONFIGURATIONS.

Recovery Scheme (Control - Data) Symbol Analytical Relationship

Restoration - Restoration TR,R TRC + TRD

Restoration - Protection TR,R max(TRC , TP )

Protection - Protection TP,P TP

Protection - Restoration TP,R max(TP , TR)

updates S1 and S4;• Phase D - The flow modification messages are processed

and the reply message is sent to the controller by S1

and S4. After both barrier reply messages are receivedby the controller, the new control path to switch S3 isconfigured at the intermediate switches. The connectionto the controller is restored and the recalculated data pathscan be installed in all switches.

As seen in the description of the phases, the use of barrierrequests synchronizes the intermediate steps of restoration. Be-sides restoration, the control traffic path can also be recoveredby a 1 : 1 protection scheme. Protection of the control anddata traffic paths is provided by BFD link failure detectionand the Group Tables with Action Buckets. Main advantagesof protection is that no communication is required and theswitches can autonomously update their flow tables.

Using the restoration and protection schemes, a total offour recovery schemes are possible. As with the out-of-bandconfiguration, analytical models can be used to predict thebehavior during the recovery process. The restoration modelfor the control traffic path is a modification of the earlierrestoration model. Equations (3) and (4) show the restorationtimes for an OpenFlow in-band configuration, where TRC isthe control traffic restoration time, TB is the additional timedelay introduced by the Barrier Message reply mechanismand TIS,i is the time to install and modify the flow tablesof intermediate switches. The restoration time for data trafficpaths (TRD) is a simplified form of equation (1), withoutthe LoS failure detection delay as the controller already isinformed about the network failure.

TRC = TLoS + TB +

S∑i=1

(TLU,i + TC,i + TIS,i + TI,i) (3)

TRD =

F∑i=1

(TLU,i + TC,i + TI,i) (4)

In table VI the four recovery schemes are given, together withthe analytical delay models.

The analytical relationships in table VI assume that therecovery and protection processes do not influence each otherat the switch. In [33] multiple measurements have beenpreformed on all four in-band recovery schemes. Resultsshow that when restoration is applied to recover data paths,delays exceed the 50 ms requirement. Only the full protectionrecovery scheme meets the requirements, but in practice thisscheme will not be applied due to large flow tables and the

17

CO

NTR

OL

LAYER

LEVEL - 1

1

N

LEVEL - 0

SWITC

HLA

YER


N1

...

Check

Legend

T (N / – / 1+S / 0)

General SecurityName

Notation

Flow Table


Flow Processing

Switch

OpenFlow Controller

Switch Parameters

Flow Checker(Control)

Update

Traffic Measurement(Classification)

Flow Scheduler(Action)

RegularThreat BThreat A

Traf

fic

Stat

isti

cs

Packet OutPacket In

Flo

w In

stal

l

Mar

ked

FR

Policy Violation

TCP Dumps

Classification from external applicaton

Unm

arke

dFl

ow

Req

uest

s

Figure 20. General security configuration - The first step is to classifyand mark incoming traffic with statistics from the OpenFlow switch orexternal measurements (Classification). Marked Flow Requests go to theFlow Scheduler where the requests are processed and according actions areassigned to flows (Action). The last step is to check the flows to the networksecurity policy (Control) and install the flows at the switches. On policyviolation the Flow Scheduler must be informed.

large number of configurations which have to be made by thenetwork manager at the switches.

Looking to the differences in the results between TR,R

and TP,R, we can conclude that the performance differenceis small2. This is expected, as only a few control paths tothe switches have to be recovered. With this conclusion, wecan state that for recovery requirements there is no noticeableperformance difference between in- and out-of-band config-urations, when a full protection scheme is implemented. Acomparison between restoration in both configurations is notpossible due to the differentiation in delay in the measurementsof [29] and [33]. We think it is possible to predict the behaviorfor both configurations with the derived analytical models,when reliable measurements for the defined parameters arepresent.

G. Security in SDN

Network security is applied to control networks access,provide separation between users and protect the networkagainst malicious and unwanted intruders. It remains a hottopic under SDN researchers, because a basic security level isexpected from a new network technology, as well as the factthat network security applications can easily be applied to thenetwork control logic. We define two levels of security. Thefirst level invokes logical connections between end hosts insidethe network. Protocols like Secure Socket Layer (SSL) orpacket encrypting techniques must ensure connection security.Within SDN, this level of security plays an important role, asthe control link between switches and the centralized controllermust be ensured. In the OpenFlow protocol a mechanism tosecure the connection is available, but not required. It is up tothe controller to secure the connection with the switches and anumber of controller implementations have not implementedlink security mechanisms. When no link security is applied, a

2Results show a TLoS of approximately 50 ms in comparison with 100ms measured in [29].

malicious node can impersonate the controller and take overcontrol of the switches.

The second level of security is to protect switches, serversand end hosts in the network. Numerous examples are presentto indicate the threats to the network as a whole. Malicioussoftware can intrude the network, infect hosts and gatherinformation, but also flooding attacks can disable networkservers or overload OpenFlow switches and controllers. Se-curity mechanisms must be implemented on the network todetect malicious traffic and take necessary actions to blockand reroute this traffic. In current networking, network securityis applied at higher networking layers. Routers and firewallsperform security tasks at layer 3, whereas end hosts and servershost security applications at layer 7. With SDN, there is acentral authority that routes traffic through the network andenables the possibility to apply security policies to all layersin networking. Much research has been performed and theresults of [34], [35], [36], [37] are used to determine securityproperties within SDN. Most researchers follow roughly thesame procedure to apply security to the network. This pro-cedure consists on three steps where a short description ofthe process is given, as well as a reference to the performedresearch.

• Classification - Data flows through the network must beclassified in order to determine malicious behavior andnetwork attacks. Without classification it is impossible toprotect the network and perform countermeasures. Themain source for traffic classification is found in trafficstatistics [35];

• Action - Once a traffic flow is marked as malicious,the control layer must modify flow tables to protect thenetwork and prevent propagation of the malicious trafficthrough the network. For each threat, different actions areneeded, so the control layer must be flexible for quickadoption of new protection schemes [34];

• Check - The last security process is the checking ofcalculated flow rules with the applied security policy fromthe network manager. Flow rules may (unintentionally)disrupt the security policy and therefore an extra controlprocess is needed. Preventing network security violationsby checking flow rules before install on the switches,completes the overall security process [36], [37].

The three processes combined form the protection layer forthe network and all can be implemented at the control layer,which results in a T (N/ − /1+S/0) configuration. To givethe most general view of this configuration, we assume nomodifications to the switch layer. Figure 20 gives the generalOpenFlow security configuration.

As seen in figure 20, normal Flow Requests and trafficstatistics enter the classification module in the OpenFlowcontroller. Traffic statistics can originate from the OpenFlowmodule at the switch and result from processed TCP and UDPtraffic dumps. The classification module identifies malicioustraffic flows and has two actions to perform. First, it mustinform the Flow Scheduler with the presence of maliciousflows in the network. Existing flows in the switch tables mustbe modified by the Flow Scheduler. Second, incoming Flow

18

Requests must be marked, so that the flow scheduler canprocess the Flow Requests according to the security policy. Atthe scheduler, multiple security modules are present, to installand modify Flow Tables with rules based on the security policyfor regular traffic and counter measures for the different threatsto the network. So for each traffic flow there exists a unique setof rules, in order to protect the individual traffic flows withinthe network, as well as the network itself. The last step beforeFlow Rules can be installed or modified is confirming validityto the overall security policy of the network. A computed FlowRule by a module in the Flow Scheduler can confirm the ruleof that module, but may violate the security policies of othermodules and the overall network security policy. After a FlowRule is approved by such a flow policy checker, it can beinstalled into the switch Flow Tables.

In theory, the classification process looks easy to execute,but [35] and [34] have proven otherwise. In [35] an effectivesolution is found to identify abnormal traffic and floodingattacks. The most obvious mechanism to classify traffic flowsare continuous TCP and UDP traffic dumps. With these dumps,all information is present to identify malicious traffic, but ittakes much computing resources to process all the dumpscontinuously. Therefore an intelligent mechanism is employedto map traffic flows based on traffic statistics. Using the maps,all flows are characterized and abnormal traffic flows can beidentified and removed from the network. This method is onlyto detect flooding attacks, so to detect other threats, moreclassification mechanisms are needed. In [34] an applicationlayer is presented to apply classification modules, as well assecurity modules, at the Flow Scheduler. A general applicationlayer eases implementation of modules for newly identifiedthreats. Multiple examples in [34] show that with the applica-tion of classification and security modules, counter measuresfor network threats can be implemented into an OpenFlowenvironment.

H. Conclusion on resiliencyWhere [25] focuses on the resiliency of the control plane

and developed a mechanism to utilize the master-slave conceptfor OpenFlow controllers, [26], [29], [31], [32], [33] researchthe ability to recover failed links. On the controller side wecan state that a single controller is insufficient in terms ofrobustness by a lack of redundancy. With the failure of acontroller the OpenFlow switches will lose the control layerfunctionality, resulting in an uncontrolled network topology. Ifthe 50 ms recovery requirement from carrier-grade networksis applied, the replication scheme used to improve robustnesswill not suffice. Hence, the replication component needsmodifications to lower the repair latency.

On link recovery, five papers are discussed. Although mostrecovery concepts start with “traditional” failure recoverybased on Loss-of-Signal to detect failure, these methods haveproven to have a large latency. Instead, actively probing failuredetection mechanisms can detect failures more quickly. GroupTables with pre-programmed failover logic and fast restorationof flows by the controller are two proven techniques whichhave shown to recover paths within sub 50 ms time. Hence,recovery from topology failure appears sufficiently covered.

However, an integral solution, where also OpenFlow con-trollers are redundantly applied, with the ability to recovernetwork control in millisecond order, needs further research.Techniques used in the scalability research field can be joinedwith the OpenFlow protocol to apply master-slave configu-rations. Combined with proposed path recovery schemes ahigher level of resilience can be reached in both the controland forwarding plane.

VII. CONCLUSION

In this overview, we have discussed the basic principles ofSDN, where the control layer is decoupled from the data planeand merged into a centralized control logic. The centralizedlogic, itself controlled by software, has a global view ofthe network and has the capabilities to dynamically controlhardware devices for optimal traffic flows through the network.For communication between the data plane and the controllogic, the OpenFlow protocol is commonly utilized. Twomain problem areas are identified from the reviewed research,being limited scalability and decreased resiliency due to thecentralized nature of SDN.

To gain detailed insight on performed research in SDN andOpenFlow networks, we have developed a general frameworkand notation in which we classify scientific work related toscalability and resiliency in SDN. We have made a separationin proposed solutions based on scalability and resiliencyissues. Due to the centralization of the control logic, scalabilityissues exist on the number of hardware devices to control by asingle control logic or the number of Flow Requests processedby the logic. Solutions can be found in increasing performanceof the central logic, reducing the number of tasks to performby the central control logic or dividing hardware resourceswith either virtualization or introducing multiple coexistingcontrollers.

On resiliency three problem areas have been distinguished,being the resistance of both the controller and the networktopology against failures, as well as the network resilienceagainst malicious attacks (security). Resiliency of SDN net-works can be increased with the application of redundantnetwork controllers and replication of internal network state,while recovery schemes can protect against network topologyfailures.

However, as each implementation seems to make trade-offs,possible solutions are still suboptimal by nature. The topicsof scalability and topology failure may be sufficiently solvedby combining the decrease of control overhead, distributingmultiple controllers among a network, and deploying discussedfailure protection mechanisms. The topic of recovery fromcontroller failure, however, seems underrepresented and needsto be researched more thoroughly.

REFERENCES

[1] B. A. A. Nunes, M. Mendonca, X.-N. Nguyen, K. Obraczka, andT. Turletti, “A survey of software-defined networking: Past, present, andfuture of programmable networks,” Communications Surveys Tutorials,IEEE, vol. 16, no. 3, pp. 1617–1634, Third 2014.

[2] N. Feamster, J. Rexford, and E. Zegura, “The road to sdn,” Queue,vol. 11, no. 12, p. 20, 2013.

19

[3] S. Scott-Hayward, G. O’Callaghan, and S. Sezer, “Sdn security: Asurvey,” in Future Networks and Services (SDN4FNS), 2013 IEEE SDNfor. IEEE, 2013, pp. 1–7.

[4] S. H. Yeganeh, A. Tootoonchian, and Y. Ganjali, “On scalabilityof software-defined networking,” Communications Magazine, IEEE,vol. 51, no. 2, pp. 136–141, 2013.

[5] S. Sezer, S. Scott-Hayward, P. K. Chouhan, B. Fraser, D. Lake,J. Finnegan, N. Viljoen, M. Miller, and N. Rao, “Are we ready forsdn? implementation challenges for software-defined networks,” Com-munications Magazine, IEEE, vol. 51, no. 7, 2013.

[6] K. Suzuki, K. Sonoda, N. Tomizawa, Y. Yakuwa, T. Uchida, Y. Higuchi,T. Tonouchi, and H. Shimonishi, “A survey on openflow technologies,”IEICE Transactions on Communications, vol. 97, no. 2, pp. 375–386,2014.

[7] A. Lara, A. Kolasani, and B. Ramamurthy, “Network innovation usingopenflow: A survey,” Communications Surveys Tutorials, IEEE, vol. 16,no. 1, pp. 493–512, First 2014.

[8] Open-Network-Foundation. (2013, apr) Openflow switch specificationopenflow version 1.3.2 (wire protocol 0x04).

[9] A. Doria, J. H. Salim, R. Haas, H. Khosravi, W. Wang, L. Dong,R. Gopal, and J. Halpern, “Forwarding and Control Element Separation(ForCES) Protocol Specification,” RFC 5810 (Proposed Standard),Internet Engineering Task Force, Mar. 2010. [Online]. Available:http://www.ietf.org/rfc/rfc5810.txt

[10] N. L. M. van Adrichem, C. Doerr, and F. A. Kuipers, “Opennetmon:Network monitoring in openflow software-defined networks,” in NetworkOperations and Management Symposium (NOMS), 2014 IEEE, May2014.

[11] N. Gude, T. Koponen, J. Pettit, B. Pfaff, M. Casado, N. McKeown, andS. Shenker, “Nox: towards an operating system for networks,” ACMSIGCOMM Computer Communication Review, vol. 38, no. 3, pp. 105–110, 2008.

[12] Open-Network-Foundation. (2013) Open vswitch manual - ovs-vswitchdb.conf.db(5) 2.0.90. [Online]. Available: http://openvswitch.org/ovs-vswitchd.conf.db.5.pdf

[13] Stanford University. (2014) Pox openflow controller wiki. [Online].Available: https://openflow.stanford.edu/display/ONL/POX+Wiki

[14] Nippon Telegraph and Telephone Corporation. (2014, jan) Ryu sdncontroller. [Online]. Available: http://osrg.github.io/ryu/

[15] Project-FloodLight. (2014, jan) Floodlight open sdn controller. [Online].Available: http://www.projectfloodlight.org/floodlight/

[16] Linux-Foundation. (2014, jan) Opendaylight project. [Online]. Available:http://www.opendaylight.org/

[17] NEC-Corporation. (2014) Quantum plug-in for openstack cloudcomputing software. [Online]. Available: https://wiki.openstack.org/wiki/Neutron/NEC_OpenFlow_Plugin

[18] A. R. Curtis, J. C. Mogul, J. Tourrilhes, P. Yalagandula, P. Sharma, andS. Banerjee, “Devoflow: Scaling flow management for high-performancenetworks,” in ACM SIGCOMM Computer Communication Review,vol. 41, no. 4. ACM, 2011, pp. 254–265.

[19] A. Tootoonchian and Y. Ganjali, “Hyperflow: A distributed control planefor openflow,” in Proceedings of the 2010 internet network managementconference on Research on enterprise networking. USENIX Associa-tion, 2010, pp. 3–3.

[20] S. Hassas Yeganeh and Y. Ganjali, “Kandoo: a framework for efficientand scalable offloading of control applications,” in Proceedings of thefirst workshop on Hot topics in software defined networks. ACM, 2012,pp. 19–24.

[21] T. Koponen, M. Casado, N. Gude, J. Stribling, L. Poutievski, M. Zhu,R. Ramanathan, Y. Iwata, H. Inoue, T. Hama et al., “Onix: A distributedcontrol platform for large-scale production networks.” in OSDI, vol. 10,2010, pp. 1–6.

[22] R. Sherwood, G. Gibb, K.-K. Yap, G. Appenzeller, M. Casado, N. McK-eown, and G. Parulkar, “Flowvisor: A network virtualization layer,”OpenFlow Switch Consortium, Tech. Rep, 2009.

[23] J. Hawkinson and T. Bates, “Guidelines for creation, selection, andregistration of an Autonomous System (AS),” RFC 1930 (Best CurrentPractice), Internet Engineering Task Force, Mar. 1996. [Online].Available: http://www.ietf.org/rfc/rfc1930.txt

[24] F. A. Kuipers, “An overview of algorithms for network survivability,”ISRN Communications and Networking, vol. 2012, p. 24, 2012.

[25] P. Fonseca, R. Bennesby, E. Mota, and A. Passito, “A replication com-ponent for resilient openflow-based networking,” in Network Operationsand Management Symposium (NOMS), 2012 IEEE. IEEE, 2012, pp.933–939.

[26] S. Sharma, D. Staessens, D. Colle, M. Pickavet, and P. Demeester,“Enabling fast failure recovery in openflow networks,” in Design ofReliable Communication Networks (DRCN), 2011 8th InternationalWorkshop on the. IEEE, 2011, pp. 164–171.

[27] IEEE, “Ieee standard for local and metropolitan area networks: Mediaaccess control (mac) bridges,” Institute of Electrical and ElectronicsEngineers, WG802.1 Bridging and Management Working Group, Tech.Rep. 802.1D-2004, 2011.

[28] D. Plummer, “Ethernet Address Resolution Protocol: Or ConvertingNetwork Protocol Addresses to 48.bit Ethernet Address for Transmissionon Ethernet Hardware,” RFC 826 (INTERNET STANDARD), InternetEngineering Task Force, Nov. 1982, updated by RFCs 5227, 5494.[Online]. Available: http://www.ietf.org/rfc/rfc826.txt

[29] S. Sharma, D. Staessens, D. Colle, M. Pickavet, and P. Demeester,“Openflow: meeting carrier-grade recovery requirements,” ComputerCommunications, 2012.

[30] D. Katz and D. Ward, “Bidirectional Forwarding Detection (BFD),” RFC5880 (Proposed Standard), Internet Engineering Task Force, Jun. 2010.

[31] N. L. M. van Adrichem, B. J. van Asten, and F. A. Kuipers, “Fastrecovery in software-defined networks,” in Software Defined Networks(EWSDN), 2014 Third European Workshop on. IEEE, 2014.

[32] A. Sgambelluri, A. Giorgetti, F. Cugini, F. Paolucci, and P. Castoldi,“Openflow-based segment protection in ethernet networks,” OpticalCommunications and Networking, IEEE/OSA Journal of, vol. 5, no. 9,pp. 1066–1075, 2013.

[33] S. Sharma, D. Staessens, D. Colle, M. Pickavet, and P. Demeester, “Fastfailure recovery for in-band openflow networks,” in Design of ReliableCommunication Networks (DRCN), 2013 9th International Conferenceon the. IEEE, 2013, pp. 52–59.

[34] S. Shin, P. Porras, V. Yegneswaran, M. Fong, G. Gu, and M. Tyson,“Fresco: Modular composable security services for software-defined net-works,” in Proceedings of Network and Distributed Security Symposium,2013.

[35] R. Braga, E. Mota, and A. Passito, “Lightweight ddos flooding attackdetection using nox/openflow,” in Local Computer Networks (LCN),2010 IEEE 35th Conference on. IEEE, 2010, pp. 408–415.

[36] S. Son, S. Shin, V. Yegneswaran, P. Porras, and G. Gu, “Model checkinginvariant security properties in openflow,” in Communications (ICC),2013 IEEE International Conference on, June 2013, pp. 1974–1979.

[37] P. Porras, S. Shin, V. Yegneswaran, M. Fong, M. Tyson, and G. Gu, “Asecurity enforcement kernel for openflow networks,” in Proceedings ofthe first workshop on Hot topics in software defined networks. ACM,2012, pp. 121–126.

http://www.ietf.org/rfc/rfc5810.txt

http://openvswitch.org/ovs-vswitchd.conf.db.5.pdf

http://openvswitch.org/ovs-vswitchd.conf.db.5.pdf

https://openflow.stanford.edu/display/ONL/POX+Wiki

http://osrg.github.io/ryu/

http://www.projectfloodlight.org/floodlight/

http://www.opendaylight.org/

https://wiki.openstack.org/wiki/Neutron/NEC_OpenFlow_Plugin

https://wiki.openstack.org/wiki/Neutron/NEC_OpenFlow_Plugin



Scalability and Resilience of Software-Deﬁned Networking ... · Abstract—Software-Deﬁned Networking (SDN) allows to con- ... central controller, and (2) resiliency in terms

Documents