Study on SelfHealing and Self Optimization in Software Defined Networking for Heterogeneous Networks Kevin Andrea Image: Birmingham Rail & Locomotive http://www.bhamrail.com/frogswitch/turnouts.asp
Study on Self-‐Healing and Self-‐Optimization in Software Defined Networking for Heterogeneous
Networks
Kevin Andrea
Image: Birmingham Rail & Locomotive http://www.bhamrail.com/frogswitch/turnouts.asp
Overview• Introduction• Background• Software Defined Networking
• Overview• Current Applications• Issues
• New Networking Landscape• IoT, VANET, Smart Grid• Issues
• Solutions• Self-‐Healing• Self-‐Optimization
• Conclusion• References Image: Institute for Communication Systems, University of Surrey, UK.
http://www.surrey.ac.uk/ics/activity/facilities/futureinternet/
Introduction
• Problem Statement• Current networking infrastructure is complex• Managing requires Autonomic Systems• IBM
• Self-‐Configuration• Self-‐Protection• Self-‐Healing• Self-‐Optimization
• Exploration of new networking technologies.• Software Defined Networks
Image: Ivan Pepelnjak, NIL Data Communications, TechTargethttp://searchtelecom.techtarget.com/feature/BGP-‐essentials-‐The-‐protocol-‐that-‐makes-‐the-‐
Internet-‐work
bility feature of SDN can be used to achieve self-* attributesof the autonomic systems. The combination of autonomicsystem and SDN can be used to control and manage the net-work infrastructure. The purpose of both the technologiesis to overcome the growing complexity of the network. Theapplication of autonomic properties on SDN can unleash thetrue potential of future networks. The reliability of SDN canbe improved with the application of self-healing principle.SDN is described by ONF [1] as an architecture in which
the control and data planes are decoupled, network intelli-gence and state are logically centralized, and the underlyingnetwork infrastructure is abstracted from the applications.Figure 1 illustrates the layered architecture of SDN.
Data Plane Layer
Control Plane Layer
Application Layer
Network Switches
Network Applications
SDN controller/ Network Operating System
OpenFlow
Figure 1: Architecture of SDN
The SDN is a layered architecture which separates thedata plane from control plane. The separation of data planefrom control plane and centralizing the intelligence simplifiesthe management of the network. But it poses reliabilityissues in the network. A network failure can hamper thetraffic forwarding which results in packet loss and serviceunavailability. In an SDN based network the faults can becategorized into three areas (1) Data plane, where a switchor link between switches fails, (2) control plane, where linkconnecting controller and switch fails, and (3) controller,where the controller machine itself fails. A lot of researchis going on for exploring the services and functionality thatSDN can provide to leverage the network. But the area offault management in SDN is still not much explored. Thereare some solutions proposed for handling the failure in SDNbut they are not practical in the actual network consideringthe enormous traffic.In this paper, we present the existing work in the field
of fault management domain for SDN and present its lim-itations in terms of applicability for the present networks.We also propose an optimized self-healing (OSH) frameworkfor SDN which ensures optimal state and continuous avail-ability of the network after recovering from a failure. Afterthat we present the functionality of the rapid failure recov-ery scheme. Performance analysis based on an analyticalmodel of the rapid failure recovery is given by consideringthe factors like failure recovery time and the memory sizerequired for the backup flow rules.
2. PREVIOUS WORKThere has been some research done in the area of fault
management for SDN. Most of them rely on traditional wellknown approach of failure recovery i.e. restoration and pro-tection. In the case of restoration, alternative paths are
established after a failure occurs. In the case of protection,the alternative paths are established even before the fail-ure occurs in the network. Most of the existing scheme [7][8] [9] [11] [12], relies on adding flow entries for installingthe backup path for each of the disrupted flow on the failedlink. In Andrea S. et al [13], for each new flow, controllerinstalls backup path for each of the link which is a part ofthe primary path. This solution is not practical for the net-work having thousands of flows. According to [18] [19], ina modern data center with 100,000+ compute nodes, thenumber of flows in the network will be in the millions. Insuch a case, installing backup flow rule for each of the flowsmay overload the centralized controller and create process-ing bottlenecks. Considering the present switch hardwares,storing the millions of flow rules is impractical. An addi-tional TCAMmemory can used to store the OpenFlow rules,but because of its cost, commercial switches do not supportmore than 5000 to 6000 TCAM based flow rules [14].In path protection scheme, the immediate local recovery
is not possible. Because after the link failure, the switchwhich reroutes the traffic from a primary path to a backuppath should receive the failure notification. The link failurenotification time adds up to the total recovery time, whichresults in delayed recovery and ultimately higher packet loss.After switching to the backup path due to a failure, the flowentries for the disrupted flows become obsolete and needsextra controller involvement to explicitly remove them fromthe flow table. The authors of [10] [13] relies on restorationmechanism for failure recovery. Restoration mechanism re-quires more time to recover than the protection mechanismbecause the controller has to install the flow rules in all theswitches which are part of the alternate path. It also putssignificant load on the controller which ultimately delays therecovery process.CORONET [11] relies on LLDP messages for detecting
the changes in topology caused by the failure. But theLLDPs monitoring message processing overloads the con-troller. Yang Y. et al [12] calculates an optimal backuppath after a link failure is detected and then uses restora-tion mechanism for the link recovery. The path calculationand installation process delays the recovery process. Byconsidering all the issues in existing research, we propose anOSH approach for failure recovery in SDN. It addresses allthe issues presented in the existing schemes and provide anapproach to optimally recover a network from failures. Ourproposed RR scheme does not need a full-state controllerintervention upon the failure occurrence which reduces theload on the controller. Thus, it makes the failure recoveryprocess faster by eliminating the overhead of communicationbetween switches and controller. Because of the less com-munication, the overall congestion in the network reducessignificantly.
3. PROPOSED DESIGN ARCHITECTUREAND MECHANISM
3.1 Autonomic Self-healing Architecture forSDN
We propose a self-healing system for SDN which is capableof optimally handling the failures in SDN. Our proposed ar-chitecture of OSH mechanism is shown in the figure 2. Thearchitecture is divided into data plane and control plane.
Image: [2]
Introduction
• Problem Statement• Current networking infrastructure is complex• Managing requires Autonomic Systems• IBM
• Self-‐Configuration• Self-‐Protection• Self-‐Healing• Self-‐Optimization
• Exploration of new networking technologies.• Software Defined Networks
Image: Ivan Pepelnjak, NIL Data Communications, TechTargethttp://searchtelecom.techtarget.com/feature/BGP-‐essentials-‐The-‐protocol-‐that-‐makes-‐the-‐
Internet-‐work
bility feature of SDN can be used to achieve self-* attributesof the autonomic systems. The combination of autonomicsystem and SDN can be used to control and manage the net-work infrastructure. The purpose of both the technologiesis to overcome the growing complexity of the network. Theapplication of autonomic properties on SDN can unleash thetrue potential of future networks. The reliability of SDN canbe improved with the application of self-healing principle.SDN is described by ONF [1] as an architecture in which
the control and data planes are decoupled, network intelli-gence and state are logically centralized, and the underlyingnetwork infrastructure is abstracted from the applications.Figure 1 illustrates the layered architecture of SDN.
Data Plane Layer
Control Plane Layer
Application Layer
Network Switches
Network Applications
SDN controller/ Network Operating System
OpenFlow
Figure 1: Architecture of SDN
The SDN is a layered architecture which separates thedata plane from control plane. The separation of data planefrom control plane and centralizing the intelligence simplifiesthe management of the network. But it poses reliabilityissues in the network. A network failure can hamper thetraffic forwarding which results in packet loss and serviceunavailability. In an SDN based network the faults can becategorized into three areas (1) Data plane, where a switchor link between switches fails, (2) control plane, where linkconnecting controller and switch fails, and (3) controller,where the controller machine itself fails. A lot of researchis going on for exploring the services and functionality thatSDN can provide to leverage the network. But the area offault management in SDN is still not much explored. Thereare some solutions proposed for handling the failure in SDNbut they are not practical in the actual network consideringthe enormous traffic.In this paper, we present the existing work in the field
of fault management domain for SDN and present its lim-itations in terms of applicability for the present networks.We also propose an optimized self-healing (OSH) frameworkfor SDN which ensures optimal state and continuous avail-ability of the network after recovering from a failure. Afterthat we present the functionality of the rapid failure recov-ery scheme. Performance analysis based on an analyticalmodel of the rapid failure recovery is given by consideringthe factors like failure recovery time and the memory sizerequired for the backup flow rules.
2. PREVIOUS WORKThere has been some research done in the area of fault
management for SDN. Most of them rely on traditional wellknown approach of failure recovery i.e. restoration and pro-tection. In the case of restoration, alternative paths are
established after a failure occurs. In the case of protection,the alternative paths are established even before the fail-ure occurs in the network. Most of the existing scheme [7][8] [9] [11] [12], relies on adding flow entries for installingthe backup path for each of the disrupted flow on the failedlink. In Andrea S. et al [13], for each new flow, controllerinstalls backup path for each of the link which is a part ofthe primary path. This solution is not practical for the net-work having thousands of flows. According to [18] [19], ina modern data center with 100,000+ compute nodes, thenumber of flows in the network will be in the millions. Insuch a case, installing backup flow rule for each of the flowsmay overload the centralized controller and create process-ing bottlenecks. Considering the present switch hardwares,storing the millions of flow rules is impractical. An addi-tional TCAMmemory can used to store the OpenFlow rules,but because of its cost, commercial switches do not supportmore than 5000 to 6000 TCAM based flow rules [14].In path protection scheme, the immediate local recovery
is not possible. Because after the link failure, the switchwhich reroutes the traffic from a primary path to a backuppath should receive the failure notification. The link failurenotification time adds up to the total recovery time, whichresults in delayed recovery and ultimately higher packet loss.After switching to the backup path due to a failure, the flowentries for the disrupted flows become obsolete and needsextra controller involvement to explicitly remove them fromthe flow table. The authors of [10] [13] relies on restorationmechanism for failure recovery. Restoration mechanism re-quires more time to recover than the protection mechanismbecause the controller has to install the flow rules in all theswitches which are part of the alternate path. It also putssignificant load on the controller which ultimately delays therecovery process.CORONET [11] relies on LLDP messages for detecting
the changes in topology caused by the failure. But theLLDPs monitoring message processing overloads the con-troller. Yang Y. et al [12] calculates an optimal backuppath after a link failure is detected and then uses restora-tion mechanism for the link recovery. The path calculationand installation process delays the recovery process. Byconsidering all the issues in existing research, we propose anOSH approach for failure recovery in SDN. It addresses allthe issues presented in the existing schemes and provide anapproach to optimally recover a network from failures. Ourproposed RR scheme does not need a full-state controllerintervention upon the failure occurrence which reduces theload on the controller. Thus, it makes the failure recoveryprocess faster by eliminating the overhead of communicationbetween switches and controller. Because of the less com-munication, the overall congestion in the network reducessignificantly.
3. PROPOSED DESIGN ARCHITECTUREAND MECHANISM
3.1 Autonomic Self-healing Architecture forSDN
We propose a self-healing system for SDN which is capableof optimally handling the failures in SDN. Our proposed ar-chitecture of OSH mechanism is shown in the figure 2. Thearchitecture is divided into data plane and control plane.
Image: [2]
Background – Traditional Networks
• Network Flow• Sequence of Packets
• Source to Destination
• Routing Protocols• Link State Routing Protocols
• OSPF (Dijkstra Algorithm)• Connectivity Graph of Routers
• Distance Vector Routing Protocol• RIP (Bellman-‐Ford Algorithm)
• Neighbor Graph of Routers• Exterior Gateway Protocol
• BGP• Routing Between Autonomous Systems
Switch A Switch B Switch C
A -‐> C
C –> A
C –> A
A -‐> C
Background – Traditional Networks
• Traditional Network Router• Maintains a mapping between Destination Address and Port
• Encapsulates Two Functions• Control
• Experts configure router• Protocols build the Routing Table
• Data Transfer• Uses the Routing Table• Forwards Flows (Packets)
Image: Computer Desktop Encyclopedia, The Computer Language Company, Inc.http://homepages.uel.ac.uk/u0116401/RouterDefinition.htm
• Traditional Network Router• Maintains a mapping between Destination Address and Port
• Encapsulates Two Functions• Control
• Experts configure router• Protocols build the Routing Table
• Data Transfer• Uses the Routing Table• Forwards Flows (Packets)
Background – Traditional Networks
Image: Birmingham Rail & Locomotive http://www.bhamrail.com/frogswitch/turnouts.asp
Control
Data LineData Line
Data Transfer
Software Defined Network (SDN)
• Spearheaded by Sun in 1995• Designed to allow software switching for networks.• Provided as a means for experimenting with new algorithms and protocols.
• Decouples Control and Data• All control logic is moved into a centralized controller.• Hardware is replaced with Software
• Operators have a global view of the network state.
Image: [2]
bility feature of SDN can be used to achieve self-* attributesof the autonomic systems. The combination of autonomicsystem and SDN can be used to control and manage the net-work infrastructure. The purpose of both the technologiesis to overcome the growing complexity of the network. Theapplication of autonomic properties on SDN can unleash thetrue potential of future networks. The reliability of SDN canbe improved with the application of self-healing principle.SDN is described by ONF [1] as an architecture in which
the control and data planes are decoupled, network intelli-gence and state are logically centralized, and the underlyingnetwork infrastructure is abstracted from the applications.Figure 1 illustrates the layered architecture of SDN.
Data Plane Layer
Control Plane Layer
Application Layer
Network Switches
Network Applications
SDN controller/ Network Operating System
OpenFlow
Figure 1: Architecture of SDN
The SDN is a layered architecture which separates thedata plane from control plane. The separation of data planefrom control plane and centralizing the intelligence simplifiesthe management of the network. But it poses reliabilityissues in the network. A network failure can hamper thetraffic forwarding which results in packet loss and serviceunavailability. In an SDN based network the faults can becategorized into three areas (1) Data plane, where a switchor link between switches fails, (2) control plane, where linkconnecting controller and switch fails, and (3) controller,where the controller machine itself fails. A lot of researchis going on for exploring the services and functionality thatSDN can provide to leverage the network. But the area offault management in SDN is still not much explored. Thereare some solutions proposed for handling the failure in SDNbut they are not practical in the actual network consideringthe enormous traffic.In this paper, we present the existing work in the field
of fault management domain for SDN and present its lim-itations in terms of applicability for the present networks.We also propose an optimized self-healing (OSH) frameworkfor SDN which ensures optimal state and continuous avail-ability of the network after recovering from a failure. Afterthat we present the functionality of the rapid failure recov-ery scheme. Performance analysis based on an analyticalmodel of the rapid failure recovery is given by consideringthe factors like failure recovery time and the memory sizerequired for the backup flow rules.
2. PREVIOUS WORKThere has been some research done in the area of fault
management for SDN. Most of them rely on traditional wellknown approach of failure recovery i.e. restoration and pro-tection. In the case of restoration, alternative paths are
established after a failure occurs. In the case of protection,the alternative paths are established even before the fail-ure occurs in the network. Most of the existing scheme [7][8] [9] [11] [12], relies on adding flow entries for installingthe backup path for each of the disrupted flow on the failedlink. In Andrea S. et al [13], for each new flow, controllerinstalls backup path for each of the link which is a part ofthe primary path. This solution is not practical for the net-work having thousands of flows. According to [18] [19], ina modern data center with 100,000+ compute nodes, thenumber of flows in the network will be in the millions. Insuch a case, installing backup flow rule for each of the flowsmay overload the centralized controller and create process-ing bottlenecks. Considering the present switch hardwares,storing the millions of flow rules is impractical. An addi-tional TCAMmemory can used to store the OpenFlow rules,but because of its cost, commercial switches do not supportmore than 5000 to 6000 TCAM based flow rules [14].
In path protection scheme, the immediate local recoveryis not possible. Because after the link failure, the switchwhich reroutes the traffic from a primary path to a backuppath should receive the failure notification. The link failurenotification time adds up to the total recovery time, whichresults in delayed recovery and ultimately higher packet loss.After switching to the backup path due to a failure, the flowentries for the disrupted flows become obsolete and needsextra controller involvement to explicitly remove them fromthe flow table. The authors of [10] [13] relies on restorationmechanism for failure recovery. Restoration mechanism re-quires more time to recover than the protection mechanismbecause the controller has to install the flow rules in all theswitches which are part of the alternate path. It also putssignificant load on the controller which ultimately delays therecovery process.
CORONET [11] relies on LLDP messages for detectingthe changes in topology caused by the failure. But theLLDPs monitoring message processing overloads the con-troller. Yang Y. et al [12] calculates an optimal backuppath after a link failure is detected and then uses restora-tion mechanism for the link recovery. The path calculationand installation process delays the recovery process. Byconsidering all the issues in existing research, we propose anOSH approach for failure recovery in SDN. It addresses allthe issues presented in the existing schemes and provide anapproach to optimally recover a network from failures. Ourproposed RR scheme does not need a full-state controllerintervention upon the failure occurrence which reduces theload on the controller. Thus, it makes the failure recoveryprocess faster by eliminating the overhead of communicationbetween switches and controller. Because of the less com-munication, the overall congestion in the network reducessignificantly.
3. PROPOSED DESIGN ARCHITECTUREAND MECHANISM
3.1 Autonomic Self-healing Architecture forSDN
We propose a self-healing system for SDN which is capableof optimally handling the failures in SDN. Our proposed ar-chitecture of OSH mechanism is shown in the figure 2. Thearchitecture is divided into data plane and control plane.
Software Defined Network (SDN)
• Ease of Configuration• Autonomic Systems are able to change any network node’s forwarding rules.• Without having to configure individual switches.
• Fundamentally new Architecture• If an entry exists in the Data Plane Flow Table, the packet is forwarded.• If no entry, the packet is sent to the Control Plane, which generates a new rule.
Image: [2]
bility feature of SDN can be used to achieve self-* attributesof the autonomic systems. The combination of autonomicsystem and SDN can be used to control and manage the net-work infrastructure. The purpose of both the technologiesis to overcome the growing complexity of the network. Theapplication of autonomic properties on SDN can unleash thetrue potential of future networks. The reliability of SDN canbe improved with the application of self-healing principle.SDN is described by ONF [1] as an architecture in which
the control and data planes are decoupled, network intelli-gence and state are logically centralized, and the underlyingnetwork infrastructure is abstracted from the applications.Figure 1 illustrates the layered architecture of SDN.
Data Plane Layer
Control Plane Layer
Application Layer
Network Switches
Network Applications
SDN controller/ Network Operating System
OpenFlow
Figure 1: Architecture of SDN
The SDN is a layered architecture which separates thedata plane from control plane. The separation of data planefrom control plane and centralizing the intelligence simplifiesthe management of the network. But it poses reliabilityissues in the network. A network failure can hamper thetraffic forwarding which results in packet loss and serviceunavailability. In an SDN based network the faults can becategorized into three areas (1) Data plane, where a switchor link between switches fails, (2) control plane, where linkconnecting controller and switch fails, and (3) controller,where the controller machine itself fails. A lot of researchis going on for exploring the services and functionality thatSDN can provide to leverage the network. But the area offault management in SDN is still not much explored. Thereare some solutions proposed for handling the failure in SDNbut they are not practical in the actual network consideringthe enormous traffic.In this paper, we present the existing work in the field
of fault management domain for SDN and present its lim-itations in terms of applicability for the present networks.We also propose an optimized self-healing (OSH) frameworkfor SDN which ensures optimal state and continuous avail-ability of the network after recovering from a failure. Afterthat we present the functionality of the rapid failure recov-ery scheme. Performance analysis based on an analyticalmodel of the rapid failure recovery is given by consideringthe factors like failure recovery time and the memory sizerequired for the backup flow rules.
2. PREVIOUS WORKThere has been some research done in the area of fault
management for SDN. Most of them rely on traditional wellknown approach of failure recovery i.e. restoration and pro-tection. In the case of restoration, alternative paths are
established after a failure occurs. In the case of protection,the alternative paths are established even before the fail-ure occurs in the network. Most of the existing scheme [7][8] [9] [11] [12], relies on adding flow entries for installingthe backup path for each of the disrupted flow on the failedlink. In Andrea S. et al [13], for each new flow, controllerinstalls backup path for each of the link which is a part ofthe primary path. This solution is not practical for the net-work having thousands of flows. According to [18] [19], ina modern data center with 100,000+ compute nodes, thenumber of flows in the network will be in the millions. Insuch a case, installing backup flow rule for each of the flowsmay overload the centralized controller and create process-ing bottlenecks. Considering the present switch hardwares,storing the millions of flow rules is impractical. An addi-tional TCAMmemory can used to store the OpenFlow rules,but because of its cost, commercial switches do not supportmore than 5000 to 6000 TCAM based flow rules [14].
In path protection scheme, the immediate local recoveryis not possible. Because after the link failure, the switchwhich reroutes the traffic from a primary path to a backuppath should receive the failure notification. The link failurenotification time adds up to the total recovery time, whichresults in delayed recovery and ultimately higher packet loss.After switching to the backup path due to a failure, the flowentries for the disrupted flows become obsolete and needsextra controller involvement to explicitly remove them fromthe flow table. The authors of [10] [13] relies on restorationmechanism for failure recovery. Restoration mechanism re-quires more time to recover than the protection mechanismbecause the controller has to install the flow rules in all theswitches which are part of the alternate path. It also putssignificant load on the controller which ultimately delays therecovery process.
CORONET [11] relies on LLDP messages for detectingthe changes in topology caused by the failure. But theLLDPs monitoring message processing overloads the con-troller. Yang Y. et al [12] calculates an optimal backuppath after a link failure is detected and then uses restora-tion mechanism for the link recovery. The path calculationand installation process delays the recovery process. Byconsidering all the issues in existing research, we propose anOSH approach for failure recovery in SDN. It addresses allthe issues presented in the existing schemes and provide anapproach to optimally recover a network from failures. Ourproposed RR scheme does not need a full-state controllerintervention upon the failure occurrence which reduces theload on the controller. Thus, it makes the failure recoveryprocess faster by eliminating the overhead of communicationbetween switches and controller. Because of the less com-munication, the overall congestion in the network reducessignificantly.
3. PROPOSED DESIGN ARCHITECTUREAND MECHANISM
3.1 Autonomic Self-healing Architecture forSDN
We propose a self-healing system for SDN which is capableof optimally handling the failures in SDN. Our proposed ar-chitecture of OSH mechanism is shown in the figure 2. Thearchitecture is divided into data plane and control plane.
Software Defined Network (SDN)
• Features• Fast convergence times when powered on.• Centralized controller provides fine-‐grained control for managing complex infrastructures.• Simplifies network devices.
• Any device can now be a network device.
• Simple packet forwarders.
Image: [2]
bility feature of SDN can be used to achieve self-* attributesof the autonomic systems. The combination of autonomicsystem and SDN can be used to control and manage the net-work infrastructure. The purpose of both the technologiesis to overcome the growing complexity of the network. Theapplication of autonomic properties on SDN can unleash thetrue potential of future networks. The reliability of SDN canbe improved with the application of self-healing principle.SDN is described by ONF [1] as an architecture in which
the control and data planes are decoupled, network intelli-gence and state are logically centralized, and the underlyingnetwork infrastructure is abstracted from the applications.Figure 1 illustrates the layered architecture of SDN.
Data Plane Layer
Control Plane Layer
Application Layer
Network Switches
Network Applications
SDN controller/ Network Operating System
OpenFlow
Figure 1: Architecture of SDN
The SDN is a layered architecture which separates thedata plane from control plane. The separation of data planefrom control plane and centralizing the intelligence simplifiesthe management of the network. But it poses reliabilityissues in the network. A network failure can hamper thetraffic forwarding which results in packet loss and serviceunavailability. In an SDN based network the faults can becategorized into three areas (1) Data plane, where a switchor link between switches fails, (2) control plane, where linkconnecting controller and switch fails, and (3) controller,where the controller machine itself fails. A lot of researchis going on for exploring the services and functionality thatSDN can provide to leverage the network. But the area offault management in SDN is still not much explored. Thereare some solutions proposed for handling the failure in SDNbut they are not practical in the actual network consideringthe enormous traffic.In this paper, we present the existing work in the field
of fault management domain for SDN and present its lim-itations in terms of applicability for the present networks.We also propose an optimized self-healing (OSH) frameworkfor SDN which ensures optimal state and continuous avail-ability of the network after recovering from a failure. Afterthat we present the functionality of the rapid failure recov-ery scheme. Performance analysis based on an analyticalmodel of the rapid failure recovery is given by consideringthe factors like failure recovery time and the memory sizerequired for the backup flow rules.
2. PREVIOUS WORKThere has been some research done in the area of fault
management for SDN. Most of them rely on traditional wellknown approach of failure recovery i.e. restoration and pro-tection. In the case of restoration, alternative paths are
established after a failure occurs. In the case of protection,the alternative paths are established even before the fail-ure occurs in the network. Most of the existing scheme [7][8] [9] [11] [12], relies on adding flow entries for installingthe backup path for each of the disrupted flow on the failedlink. In Andrea S. et al [13], for each new flow, controllerinstalls backup path for each of the link which is a part ofthe primary path. This solution is not practical for the net-work having thousands of flows. According to [18] [19], ina modern data center with 100,000+ compute nodes, thenumber of flows in the network will be in the millions. Insuch a case, installing backup flow rule for each of the flowsmay overload the centralized controller and create process-ing bottlenecks. Considering the present switch hardwares,storing the millions of flow rules is impractical. An addi-tional TCAMmemory can used to store the OpenFlow rules,but because of its cost, commercial switches do not supportmore than 5000 to 6000 TCAM based flow rules [14].
In path protection scheme, the immediate local recoveryis not possible. Because after the link failure, the switchwhich reroutes the traffic from a primary path to a backuppath should receive the failure notification. The link failurenotification time adds up to the total recovery time, whichresults in delayed recovery and ultimately higher packet loss.After switching to the backup path due to a failure, the flowentries for the disrupted flows become obsolete and needsextra controller involvement to explicitly remove them fromthe flow table. The authors of [10] [13] relies on restorationmechanism for failure recovery. Restoration mechanism re-quires more time to recover than the protection mechanismbecause the controller has to install the flow rules in all theswitches which are part of the alternate path. It also putssignificant load on the controller which ultimately delays therecovery process.
CORONET [11] relies on LLDP messages for detectingthe changes in topology caused by the failure. But theLLDPs monitoring message processing overloads the con-troller. Yang Y. et al [12] calculates an optimal backuppath after a link failure is detected and then uses restora-tion mechanism for the link recovery. The path calculationand installation process delays the recovery process. Byconsidering all the issues in existing research, we propose anOSH approach for failure recovery in SDN. It addresses allthe issues presented in the existing schemes and provide anapproach to optimally recover a network from failures. Ourproposed RR scheme does not need a full-state controllerintervention upon the failure occurrence which reduces theload on the controller. Thus, it makes the failure recoveryprocess faster by eliminating the overhead of communicationbetween switches and controller. Because of the less com-munication, the overall congestion in the network reducessignificantly.
3. PROPOSED DESIGN ARCHITECTUREAND MECHANISM
3.1 Autonomic Self-healing Architecture forSDN
We propose a self-healing system for SDN which is capableof optimally handling the failures in SDN. Our proposed ar-chitecture of OSH mechanism is shown in the figure 2. Thearchitecture is divided into data plane and control plane.
Software Defined Network (SDN)
• Current Applications• Google uses SDN as its internal backbone in interconnecting their Data Center Networks (DCN)s• 93% of mobile providers expect Mobile SDN globally implemented within 5 years (2019)• Current papers have focused on DCN applications and optimizations.
Image: [2]
bility feature of SDN can be used to achieve self-* attributesof the autonomic systems. The combination of autonomicsystem and SDN can be used to control and manage the net-work infrastructure. The purpose of both the technologiesis to overcome the growing complexity of the network. Theapplication of autonomic properties on SDN can unleash thetrue potential of future networks. The reliability of SDN canbe improved with the application of self-healing principle.SDN is described by ONF [1] as an architecture in which
the control and data planes are decoupled, network intelli-gence and state are logically centralized, and the underlyingnetwork infrastructure is abstracted from the applications.Figure 1 illustrates the layered architecture of SDN.
Data Plane Layer
Control Plane Layer
Application Layer
Network Switches
Network Applications
SDN controller/ Network Operating System
OpenFlow
Figure 1: Architecture of SDN
The SDN is a layered architecture which separates thedata plane from control plane. The separation of data planefrom control plane and centralizing the intelligence simplifiesthe management of the network. But it poses reliabilityissues in the network. A network failure can hamper thetraffic forwarding which results in packet loss and serviceunavailability. In an SDN based network the faults can becategorized into three areas (1) Data plane, where a switchor link between switches fails, (2) control plane, where linkconnecting controller and switch fails, and (3) controller,where the controller machine itself fails. A lot of researchis going on for exploring the services and functionality thatSDN can provide to leverage the network. But the area offault management in SDN is still not much explored. Thereare some solutions proposed for handling the failure in SDNbut they are not practical in the actual network consideringthe enormous traffic.In this paper, we present the existing work in the field
of fault management domain for SDN and present its lim-itations in terms of applicability for the present networks.We also propose an optimized self-healing (OSH) frameworkfor SDN which ensures optimal state and continuous avail-ability of the network after recovering from a failure. Afterthat we present the functionality of the rapid failure recov-ery scheme. Performance analysis based on an analyticalmodel of the rapid failure recovery is given by consideringthe factors like failure recovery time and the memory sizerequired for the backup flow rules.
2. PREVIOUS WORKThere has been some research done in the area of fault
management for SDN. Most of them rely on traditional wellknown approach of failure recovery i.e. restoration and pro-tection. In the case of restoration, alternative paths are
established after a failure occurs. In the case of protection,the alternative paths are established even before the fail-ure occurs in the network. Most of the existing scheme [7][8] [9] [11] [12], relies on adding flow entries for installingthe backup path for each of the disrupted flow on the failedlink. In Andrea S. et al [13], for each new flow, controllerinstalls backup path for each of the link which is a part ofthe primary path. This solution is not practical for the net-work having thousands of flows. According to [18] [19], ina modern data center with 100,000+ compute nodes, thenumber of flows in the network will be in the millions. Insuch a case, installing backup flow rule for each of the flowsmay overload the centralized controller and create process-ing bottlenecks. Considering the present switch hardwares,storing the millions of flow rules is impractical. An addi-tional TCAMmemory can used to store the OpenFlow rules,but because of its cost, commercial switches do not supportmore than 5000 to 6000 TCAM based flow rules [14].
In path protection scheme, the immediate local recoveryis not possible. Because after the link failure, the switchwhich reroutes the traffic from a primary path to a backuppath should receive the failure notification. The link failurenotification time adds up to the total recovery time, whichresults in delayed recovery and ultimately higher packet loss.After switching to the backup path due to a failure, the flowentries for the disrupted flows become obsolete and needsextra controller involvement to explicitly remove them fromthe flow table. The authors of [10] [13] relies on restorationmechanism for failure recovery. Restoration mechanism re-quires more time to recover than the protection mechanismbecause the controller has to install the flow rules in all theswitches which are part of the alternate path. It also putssignificant load on the controller which ultimately delays therecovery process.
CORONET [11] relies on LLDP messages for detectingthe changes in topology caused by the failure. But theLLDPs monitoring message processing overloads the con-troller. Yang Y. et al [12] calculates an optimal backuppath after a link failure is detected and then uses restora-tion mechanism for the link recovery. The path calculationand installation process delays the recovery process. Byconsidering all the issues in existing research, we propose anOSH approach for failure recovery in SDN. It addresses allthe issues presented in the existing schemes and provide anapproach to optimally recover a network from failures. Ourproposed RR scheme does not need a full-state controllerintervention upon the failure occurrence which reduces theload on the controller. Thus, it makes the failure recoveryprocess faster by eliminating the overhead of communicationbetween switches and controller. Because of the less com-munication, the overall congestion in the network reducessignificantly.
3. PROPOSED DESIGN ARCHITECTUREAND MECHANISM
3.1 Autonomic Self-healing Architecture forSDN
We propose a self-healing system for SDN which is capableof optimally handling the failures in SDN. Our proposed ar-chitecture of OSH mechanism is shown in the figure 2. Thearchitecture is divided into data plane and control plane.
Software Defined Network (SDN)
• Motivating Concerns• Existing SDN solutions for DCNs assume TCP, Anycast traffic that is loosely correlated.• Existing SDN work assumes static, homogeneous devices.• SDN also poses reliability concerns.
• Faults may occur at the Controller Machine, the Control Plane, or the Data Plane
• Fault Management research has not been well explored. Image: [2]
bility feature of SDN can be used to achieve self-* attributesof the autonomic systems. The combination of autonomicsystem and SDN can be used to control and manage the net-work infrastructure. The purpose of both the technologiesis to overcome the growing complexity of the network. Theapplication of autonomic properties on SDN can unleash thetrue potential of future networks. The reliability of SDN canbe improved with the application of self-healing principle.SDN is described by ONF [1] as an architecture in which
the control and data planes are decoupled, network intelli-gence and state are logically centralized, and the underlyingnetwork infrastructure is abstracted from the applications.Figure 1 illustrates the layered architecture of SDN.
Data Plane Layer
Control Plane Layer
Application Layer
Network Switches
Network Applications
SDN controller/ Network Operating System
OpenFlow
Figure 1: Architecture of SDN
The SDN is a layered architecture which separates thedata plane from control plane. The separation of data planefrom control plane and centralizing the intelligence simplifiesthe management of the network. But it poses reliabilityissues in the network. A network failure can hamper thetraffic forwarding which results in packet loss and serviceunavailability. In an SDN based network the faults can becategorized into three areas (1) Data plane, where a switchor link between switches fails, (2) control plane, where linkconnecting controller and switch fails, and (3) controller,where the controller machine itself fails. A lot of researchis going on for exploring the services and functionality thatSDN can provide to leverage the network. But the area offault management in SDN is still not much explored. Thereare some solutions proposed for handling the failure in SDNbut they are not practical in the actual network consideringthe enormous traffic.In this paper, we present the existing work in the field
of fault management domain for SDN and present its lim-itations in terms of applicability for the present networks.We also propose an optimized self-healing (OSH) frameworkfor SDN which ensures optimal state and continuous avail-ability of the network after recovering from a failure. Afterthat we present the functionality of the rapid failure recov-ery scheme. Performance analysis based on an analyticalmodel of the rapid failure recovery is given by consideringthe factors like failure recovery time and the memory sizerequired for the backup flow rules.
2. PREVIOUS WORKThere has been some research done in the area of fault
management for SDN. Most of them rely on traditional wellknown approach of failure recovery i.e. restoration and pro-tection. In the case of restoration, alternative paths are
established after a failure occurs. In the case of protection,the alternative paths are established even before the fail-ure occurs in the network. Most of the existing scheme [7][8] [9] [11] [12], relies on adding flow entries for installingthe backup path for each of the disrupted flow on the failedlink. In Andrea S. et al [13], for each new flow, controllerinstalls backup path for each of the link which is a part ofthe primary path. This solution is not practical for the net-work having thousands of flows. According to [18] [19], ina modern data center with 100,000+ compute nodes, thenumber of flows in the network will be in the millions. Insuch a case, installing backup flow rule for each of the flowsmay overload the centralized controller and create process-ing bottlenecks. Considering the present switch hardwares,storing the millions of flow rules is impractical. An addi-tional TCAMmemory can used to store the OpenFlow rules,but because of its cost, commercial switches do not supportmore than 5000 to 6000 TCAM based flow rules [14].
In path protection scheme, the immediate local recoveryis not possible. Because after the link failure, the switchwhich reroutes the traffic from a primary path to a backuppath should receive the failure notification. The link failurenotification time adds up to the total recovery time, whichresults in delayed recovery and ultimately higher packet loss.After switching to the backup path due to a failure, the flowentries for the disrupted flows become obsolete and needsextra controller involvement to explicitly remove them fromthe flow table. The authors of [10] [13] relies on restorationmechanism for failure recovery. Restoration mechanism re-quires more time to recover than the protection mechanismbecause the controller has to install the flow rules in all theswitches which are part of the alternate path. It also putssignificant load on the controller which ultimately delays therecovery process.
CORONET [11] relies on LLDP messages for detectingthe changes in topology caused by the failure. But theLLDPs monitoring message processing overloads the con-troller. Yang Y. et al [12] calculates an optimal backuppath after a link failure is detected and then uses restora-tion mechanism for the link recovery. The path calculationand installation process delays the recovery process. Byconsidering all the issues in existing research, we propose anOSH approach for failure recovery in SDN. It addresses allthe issues presented in the existing schemes and provide anapproach to optimally recover a network from failures. Ourproposed RR scheme does not need a full-state controllerintervention upon the failure occurrence which reduces theload on the controller. Thus, it makes the failure recoveryprocess faster by eliminating the overhead of communicationbetween switches and controller. Because of the less com-munication, the overall congestion in the network reducessignificantly.
3. PROPOSED DESIGN ARCHITECTUREAND MECHANISM
3.1 Autonomic Self-healing Architecture forSDN
We propose a self-healing system for SDN which is capableof optimally handling the failures in SDN. Our proposed ar-chitecture of OSH mechanism is shown in the figure 2. Thearchitecture is divided into data plane and control plane.
Software Defined Network (SDN)
• Resilience• SDNs currently use existing solutions: Protection and Restoration• Existing schemes add backup paths for each flow entry.
• For non-‐DCN operations, this is simply not practical and would overload the controller.
• After a failure, restoration techniques assess the routers and the controller must install new rules on each switch.• Overloads the controller and delays recovery.
Image: [2]
bility feature of SDN can be used to achieve self-* attributesof the autonomic systems. The combination of autonomicsystem and SDN can be used to control and manage the net-work infrastructure. The purpose of both the technologiesis to overcome the growing complexity of the network. Theapplication of autonomic properties on SDN can unleash thetrue potential of future networks. The reliability of SDN canbe improved with the application of self-healing principle.SDN is described by ONF [1] as an architecture in which
the control and data planes are decoupled, network intelli-gence and state are logically centralized, and the underlyingnetwork infrastructure is abstracted from the applications.Figure 1 illustrates the layered architecture of SDN.
Data Plane Layer
Control Plane Layer
Application Layer
Network Switches
Network Applications
SDN controller/ Network Operating System
OpenFlow
Figure 1: Architecture of SDN
The SDN is a layered architecture which separates thedata plane from control plane. The separation of data planefrom control plane and centralizing the intelligence simplifiesthe management of the network. But it poses reliabilityissues in the network. A network failure can hamper thetraffic forwarding which results in packet loss and serviceunavailability. In an SDN based network the faults can becategorized into three areas (1) Data plane, where a switchor link between switches fails, (2) control plane, where linkconnecting controller and switch fails, and (3) controller,where the controller machine itself fails. A lot of researchis going on for exploring the services and functionality thatSDN can provide to leverage the network. But the area offault management in SDN is still not much explored. Thereare some solutions proposed for handling the failure in SDNbut they are not practical in the actual network consideringthe enormous traffic.In this paper, we present the existing work in the field
of fault management domain for SDN and present its lim-itations in terms of applicability for the present networks.We also propose an optimized self-healing (OSH) frameworkfor SDN which ensures optimal state and continuous avail-ability of the network after recovering from a failure. Afterthat we present the functionality of the rapid failure recov-ery scheme. Performance analysis based on an analyticalmodel of the rapid failure recovery is given by consideringthe factors like failure recovery time and the memory sizerequired for the backup flow rules.
2. PREVIOUS WORKThere has been some research done in the area of fault
management for SDN. Most of them rely on traditional wellknown approach of failure recovery i.e. restoration and pro-tection. In the case of restoration, alternative paths are
established after a failure occurs. In the case of protection,the alternative paths are established even before the fail-ure occurs in the network. Most of the existing scheme [7][8] [9] [11] [12], relies on adding flow entries for installingthe backup path for each of the disrupted flow on the failedlink. In Andrea S. et al [13], for each new flow, controllerinstalls backup path for each of the link which is a part ofthe primary path. This solution is not practical for the net-work having thousands of flows. According to [18] [19], ina modern data center with 100,000+ compute nodes, thenumber of flows in the network will be in the millions. Insuch a case, installing backup flow rule for each of the flowsmay overload the centralized controller and create process-ing bottlenecks. Considering the present switch hardwares,storing the millions of flow rules is impractical. An addi-tional TCAMmemory can used to store the OpenFlow rules,but because of its cost, commercial switches do not supportmore than 5000 to 6000 TCAM based flow rules [14].
In path protection scheme, the immediate local recoveryis not possible. Because after the link failure, the switchwhich reroutes the traffic from a primary path to a backuppath should receive the failure notification. The link failurenotification time adds up to the total recovery time, whichresults in delayed recovery and ultimately higher packet loss.After switching to the backup path due to a failure, the flowentries for the disrupted flows become obsolete and needsextra controller involvement to explicitly remove them fromthe flow table. The authors of [10] [13] relies on restorationmechanism for failure recovery. Restoration mechanism re-quires more time to recover than the protection mechanismbecause the controller has to install the flow rules in all theswitches which are part of the alternate path. It also putssignificant load on the controller which ultimately delays therecovery process.
CORONET [11] relies on LLDP messages for detectingthe changes in topology caused by the failure. But theLLDPs monitoring message processing overloads the con-troller. Yang Y. et al [12] calculates an optimal backuppath after a link failure is detected and then uses restora-tion mechanism for the link recovery. The path calculationand installation process delays the recovery process. Byconsidering all the issues in existing research, we propose anOSH approach for failure recovery in SDN. It addresses allthe issues presented in the existing schemes and provide anapproach to optimally recover a network from failures. Ourproposed RR scheme does not need a full-state controllerintervention upon the failure occurrence which reduces theload on the controller. Thus, it makes the failure recoveryprocess faster by eliminating the overhead of communicationbetween switches and controller. Because of the less com-munication, the overall congestion in the network reducessignificantly.
3. PROPOSED DESIGN ARCHITECTUREAND MECHANISM
3.1 Autonomic Self-healing Architecture forSDN
We propose a self-healing system for SDN which is capableof optimally handling the failures in SDN. Our proposed ar-chitecture of OSH mechanism is shown in the figure 2. Thearchitecture is divided into data plane and control plane.
Modern Networking Landscape
• Modern Networking• Wireless• Mobile• Heterogeneous
• WiFi, Bluetooth, NFC, Ethernet, USB, 802.15.4, ZigBee, 5G, mmWave, 802.11p VANET
• Applications• Vehicle Communications
• Vehicle to Vehicle Updates• DoT to Vehicle Updates• Media Streaming to Vehicles• Vehicle to DoT Sensor Data• Cell to Vehicle Communications Image: Institute for Communication Systems, University of Surrey, UK.
http://www.surrey.ac.uk/ics/activity/facilities/futureinternet/
Modern Networking Landscape
• Applications• Internet of Things (IoT)
• Field built on Heterogeneous Wireless Devices
• Very difficult to provision resources in this environment.
• Devices deployed in an uncoordinated manner.
• Multi-‐Objective Optimization• QoS in a DCN focuses on single optimizations.
• QoS for IoT adds in delay, jitter, packet loss, throughput• User Perceivable
Image: Smart Home Energy http://smarthomeenergy.co.uk/what-‐smart-‐home
SDN for Modern Networking
• Benefits• Can differentiate flow scheduling over ad-‐hoc, heterogeneous paths.• Allows for opportunistic exchanges over the best networking interface.• Vehicle-‐Vehicle over 802.11p• Vehicle-‐DoT over Cell LTE
• Can route flows based on priority or other categorizations.
Image: Smart Home Energy http://smarthomeenergy.co.uk/what-‐smart-‐home
Problems with SDN for IoT: Self-‐Healing
• Motivating Application• Smart Grid
• Packet flows control relays at power stations.
• Fast recovery is critical after a communication disturbance.• Short downtimes could lead to overloading nearby power stations, causing cascade failures.
• 2003 Blackout• Alarm did not sound, operators failed to redistribute power, 10 million people affected.
Image: "Map of North America, blackout 2003" by Lokal_Profil. Licensed under CC BY-SA 2.5 via Commons -
https://commons.wikimedia.org/wiki/File:Map_of_North_America,_blackout_2003.svg#/media/File:Map_of_North_America,_blackout_2003.svg
Problems with SDN for IoT: Self-‐Healing
• Rapid Recovery [2]• Problem
• Recovery takes too long using restoration technique.
• Not feasible to store backup rules for each flow with protection technique.
• Solution• Treat all flows from to the same link with a single rule.
• Pre-‐allocate a single backup rule.• On failure notification, immediately switch all flows over that link to the backup.
The control interfaces communicate with data plane basedon the demanded service. Network statistics module pro-vides the interface to monitor the network flow informationand statistics. Topology discovery and management modulemanages the information related to the network topology.The policy module describes how a network element shouldact based on the defined policies. The aim of the load bal-ancing module is to estimate the load on the network andprovide input to the OSH module. The routing module cal-culates the shortest path for each of the backup link. OSHmodule queries the network management modules to col-lect the network information. Based on the informationcollected, it calculates the optimal path for achieving theimproved recovery.
SDN Controller (NOX)
Rapid Recovery Module
Flow Management
Action Management
Resource Management
Control Plane
Data
PlaneOptimized
Self-healing Management
Topology Discovery & Management
Policy Management
Load Balancing
Routing Management
Netw
ork M
anagement
Application
Switch
Level
Mgm
t.
Network Statistics
Management
Forwarding Information
Base
OpenFlow
Notification Module
Figure 2: Proposed architecture of optimized self-healing mechanism
When a new component is added into the network, themodules at the switch level management handles the job ofinstalling flow entries and configuring it. The OSH modulegets invoked by a failure notification from the notificationmodule of data plane. By coordinating with other networkapplications, it finds the optimal path for all the flows whichare affected by the link failure. The optimal path is con-structed to provide a prescribed level of QoS for all the ex-isting services after a failure occur within the network. But,the OSH module requires some time to compute a new op-timal path. Therefore, to achieve a fast failure recovery, RRmodule is utilized. RR module is capable of autonomouslyhandling the link failure without much intervention of thecontroller by using link protection scheme. Once the net-work quickly recovers from the failure using RR mechanismof the data plane, the OSH module tries to optimize therecovery.
3.2 Rapid Failure Recovery in OpenFlow Net-work
For RR, we used link protection scheme which overcomesthe challenges of path protection schemes like deferred faultrecovery, packet loss and increased controller involvement inhandling the failure. In the link protection, when a failureoccurs, the switch connected to the failed link routes the con-nection around the failed link to the neighboring node whichis part of a pre-computed shortest backup path. Therefore,the switches which are directly connected to the failed linkperforms the immediate local recovery. This results into
minimum recovery time and lesser packet loss. To imple-ment the link protection scheme in a centralized OpenFlow[6] enabled network, we applied the group table concept [17].The ability a the flow entry to point to a group table providesadditional ways of forwarding. A group entry in a group ta-ble is associated with multiple action buckets, where eachaction bucket contains a set of actions to execute and itsassociated parameters.
The flow entries in a flow table points to the group with aunique group identifier. A group identifier uniquely identi-fies the group. Each Group entry consists of a group identi-fier, a group type, counter and the number of action buckets.The counter field counts the packets processed by the group.An action bucket contains a set of actions to execute and as-sociated parameters. A group type determines the way inwhich the action buckets are executed. For the implementa-tion of link protection, we are using the fast failover grouptype. A group entry of the fast failover group executes aset of actions based on the alive status value of the port[16]. Fast failover group eliminates the need for controllerinvolvement for performing RR. In a fast failover group, ifthe action bucket alive status value is 0xffffffffL then it is de-clared as unavailable. In this case, the group table executesthe next available action bucket. The status of the actionbucket depends on the port status.
Ether type Instructions
0x0800 Forward Packets to Group #1
0x0800 Forward Packets to Group #1
Group ID Group Type Action Buckets
1 Fast Failover
B1: Output to port 1 (Primary Link)B2: Push VLAN tag and output to port 2 (Backup Path)
2 ... ...
Flow Table
IP Destination
192.168.1.1
192.168.1.2
Rule
1
2--- ---------
Ingress port
1
2---
Figure 3: Group table concept
When a link failure occurs, the fast failover group executesthe next available action bucket which outputs the packetto an intermediate switch of the backup path. Therefore theswitch autonomously performs the immediate link recoverywithout any intervention of the controller. The functional-ity of the fast failover group is illustrated using the Figure3. On receipt of a packet, an OpenFlow switch extracts itsmatch fields and starts flow table lookup. If its IP desti-nation address, ingress port and EtherType field matcheswith the flow rule 1, the packet is forwarded to the group1. Similarly, for a flow of packets having a destination IPaddress as 192.168.1.2, Ingress port as 2 and EtherType as0x0800 matches with the flow rule 2 and forwarded to group1. Group table executes the action bucket B1 and outputsthe incoming packets forwarded by flow rule 1 and 2 to out-put port 1. In case of link failure, the status of bucket B1becomes unavailable. The group table detects the changedstatus of the action bucket B1 and executes the next avail-able action bucket B2. Action bucket B2 executes its asso-ciated action which pushes VLAN tag into the packet andforward it to output port 2.
Image: [2]
Problems with SDN for IoT: Self-‐Healing
• Optimized Self-‐Healing (OSH)[2]• Topology Discovery manages routing topology in place.• Load Balancing module estimates the network load.• Routing Module calculates shortest paths per flow.• On Failure
• Receive Notification from Data Plane• Calculate a new Optimal Path• Validate Path WRT QoS• Send new Flow Routing information.
The control interfaces communicate with data plane basedon the demanded service. Network statistics module pro-vides the interface to monitor the network flow informationand statistics. Topology discovery and management modulemanages the information related to the network topology.The policy module describes how a network element shouldact based on the defined policies. The aim of the load bal-ancing module is to estimate the load on the network andprovide input to the OSH module. The routing module cal-culates the shortest path for each of the backup link. OSHmodule queries the network management modules to col-lect the network information. Based on the informationcollected, it calculates the optimal path for achieving theimproved recovery.
SDN Controller (NOX)
Rapid Recovery Module
Flow Management
Action Management
Resource Management
Control Plane
Data
PlaneOptimized
Self-healing Management
Topology Discovery & Management
Policy Management
Load Balancing
Routing Management
Netw
ork M
anagement
Application
Switch
Level
Mgm
t.
Network Statistics
Management
Forwarding Information
Base
OpenFlow
Notification Module
Figure 2: Proposed architecture of optimized self-healing mechanism
When a new component is added into the network, themodules at the switch level management handles the job ofinstalling flow entries and configuring it. The OSH modulegets invoked by a failure notification from the notificationmodule of data plane. By coordinating with other networkapplications, it finds the optimal path for all the flows whichare affected by the link failure. The optimal path is con-structed to provide a prescribed level of QoS for all the ex-isting services after a failure occur within the network. But,the OSH module requires some time to compute a new op-timal path. Therefore, to achieve a fast failure recovery, RRmodule is utilized. RR module is capable of autonomouslyhandling the link failure without much intervention of thecontroller by using link protection scheme. Once the net-work quickly recovers from the failure using RR mechanismof the data plane, the OSH module tries to optimize therecovery.
3.2 Rapid Failure Recovery in OpenFlow Net-work
For RR, we used link protection scheme which overcomesthe challenges of path protection schemes like deferred faultrecovery, packet loss and increased controller involvement inhandling the failure. In the link protection, when a failureoccurs, the switch connected to the failed link routes the con-nection around the failed link to the neighboring node whichis part of a pre-computed shortest backup path. Therefore,the switches which are directly connected to the failed linkperforms the immediate local recovery. This results into
minimum recovery time and lesser packet loss. To imple-ment the link protection scheme in a centralized OpenFlow[6] enabled network, we applied the group table concept [17].The ability a the flow entry to point to a group table providesadditional ways of forwarding. A group entry in a group ta-ble is associated with multiple action buckets, where eachaction bucket contains a set of actions to execute and itsassociated parameters.
The flow entries in a flow table points to the group with aunique group identifier. A group identifier uniquely identi-fies the group. Each Group entry consists of a group identi-fier, a group type, counter and the number of action buckets.The counter field counts the packets processed by the group.An action bucket contains a set of actions to execute and as-sociated parameters. A group type determines the way inwhich the action buckets are executed. For the implementa-tion of link protection, we are using the fast failover grouptype. A group entry of the fast failover group executes aset of actions based on the alive status value of the port[16]. Fast failover group eliminates the need for controllerinvolvement for performing RR. In a fast failover group, ifthe action bucket alive status value is 0xffffffffL then it is de-clared as unavailable. In this case, the group table executesthe next available action bucket. The status of the actionbucket depends on the port status.
Ether type Instructions
0x0800 Forward Packets to Group #1
0x0800 Forward Packets to Group #1
Group ID Group Type Action Buckets
1 Fast Failover
B1: Output to port 1 (Primary Link)B2: Push VLAN tag and output to port 2 (Backup Path)
2 ... ...
Flow Table
IP Destination
192.168.1.1
192.168.1.2
Rule
1
2--- ---------
Ingress port
1
2---
Figure 3: Group table concept
When a link failure occurs, the fast failover group executesthe next available action bucket which outputs the packetto an intermediate switch of the backup path. Therefore theswitch autonomously performs the immediate link recoverywithout any intervention of the controller. The functional-ity of the fast failover group is illustrated using the Figure3. On receipt of a packet, an OpenFlow switch extracts itsmatch fields and starts flow table lookup. If its IP desti-nation address, ingress port and EtherType field matcheswith the flow rule 1, the packet is forwarded to the group1. Similarly, for a flow of packets having a destination IPaddress as 192.168.1.2, Ingress port as 2 and EtherType as0x0800 matches with the flow rule 2 and forwarded to group1. Group table executes the action bucket B1 and outputsthe incoming packets forwarded by flow rule 1 and 2 to out-put port 1. In case of link failure, the status of bucket B1becomes unavailable. The group table detects the changedstatus of the action bucket B1 and executes the next avail-able action bucket B2. Action bucket B2 executes its asso-ciated action which pushes VLAN tag into the packet andforward it to output port 2.
Image: [2]
Problems with SDN for IoT: Self-‐Healing
• Evaluation of Self-‐Healing• Model Evaluation
• 99% reduction of backup flows needed to be stored.
• Immediate restoration of service with a possibly sub-‐optimal backup path.
• Optimal backup path is pushed after controller calculates it.
Link ID # 2 (BC)
Link ID # 1 (CB)
A
E F
B C
D
Flow 2
Flow 1
Flow 3
Flow 1,2,3
Ether type Instructions0x0800 Forward Packets to Group #10x0800 Forward Packets to Group #1
Flow Table 1 for BIP src
192.168.1.1192.168.1.2
Rule12
IP Dst192.168.1.8192.168.1.9
21
43
14
23
1
24
VLAN ID Instructions2 Forward Packets to Port 21 Forward Packets to Port 1
Flow Table # FIP src
**
Rule12
--- ---------
IP Dst**---
InstructionsRemove VLAN TAG 1 & forward packets to flow table 1
VLN ID11
RuleFlow Table 0 for B
….......Forward packets to flow table 1*N
... ......
Ether type Instructions0x0800 Forward Packets to Group #10x0800 Forward Packets to Group #1
Flow Table 1 for CIP src
192.168.1.8192.168.1.9
Rule12
IP Dst192.168.1.1192.168.1.2
... ......... .........
InstructionsRemove VLAN TAG 2 & forward packets to flow table 1
VLN ID21
RuleFlow Table 0 for C
….......Forward packets to flow table 1*N
Group ID Group Type Action BucketsOutput to port 3Push VLAN tag 2 and output to port 4
B2: Backup path
DescriptionB1:Primary link
Group Table for B
1 Fast Failover
... ...2 ...
Group ID Group Type Action BucketsOutput to port 1 Push VLAN tag 1 and output to port 2
B2: Backup path
DescriptionB1:Primary link
Group Table for C
1 Fast Failover
... ...2 ...
IP src192.168.1.2
IP Dst192.168.1.9
Flow 2's Packet Header
IP src192.168.1.1
IP Dst192.168.1.8
Flow 1's Packet Header
Figure 4: Link protection mechanism
In reactive link protection, the controller can intervene toremove the outdated flow rules to speed up the recovery. Inthis case, when a failure is detected (TFD), the failure no-tification is sent to the controller. The controller searchesthe disrupted flow (TDFS) and sends flow modification mes-sage (TFM) to the switch for modifying the outdated flowrules. This process is repeated for each of the disruptedflow. When the switch receives the flow modification mes-sage, all the matching rule from the flow table are modi-fied in time (TUPDATE). The propagation time (TPROP) ofthe failure notification message from switch to the controllercontributes to the recovery process. The recovery time (TR)taken by this scheme is expressed by equation 2.
TR = TFD +N!
f=0
(TDFS,f +TFM,f +TUPDATE,f )+TPROP (2)
Our proposed scheme for RR is reactive in nature. Afterfailure detection, the affected switch handles the flow rerout-ing without any controller intervention. Therefore, the timecomplexity of our proposed RR protection scheme dependson the time a switch takes to detect a failure (TFD) and thetime to change the alive status (TAS) of the group entrieswhich corresponds to the failed link. According to Sharmaet al [8], a switch takes an approximately 5.8 microsecondsto modify the alive-status of one Group Entry. The recoverytime (TR) taken by our RR scheme is calculated by equation3.
TR = TFD + TAS (3)
4.2 Calculating Memory Size RequirementIn the traditional protection schemes, controller pre-installs
the flow rules for the backup path. For each of the disruptedflow of a primary link, backup flow rule should be present inthe switches for the backup path. Therefore, the number offlow rules in a switch of the backup path (NBF) is equal tothe number of disrupted flows (NDF) i.e. NBF= NDF. Butthis approach is not suitable for the network with thousandsof flow because of memory constraint of the switch hardware.For a 100+ disrupted flows, our RR module reduces thebackup path’s flow entries in a switch by more than 99 per-cent and saves switch memory. This contributes to smallerflow table and faster table lookup. RR module compress allthe flows having same output port which corresponds to thefailed link into one wildcard flow rule. Therefore, the num-ber of flow rules in an intermediate switch of the backup path(NBF) is equal to 1 (NBF = 1 ) Figure 5 graphically showsthe compression achieved by the RR module (Nbf-RR) overthe traditional scheme (Nbf ). The following graph is plottedusing the above observation.
5. CONCLUSION AND FUTURE WORKAn OSH model for SDN is proposed in this paper. The
proposed model is based on the autonomic principle of theautonomous system. The aim of our model is to achieveoptimal failure recovery. We presented the analytical modelof our RR scheme and proved how it can achieve a quickrecovery with low disruption time and reduces the backupflow entries per switch. The backup path flow aggregationenabled the 99 percent of reduction in the flow entries per
Recovery Time for standard Reactive Link = Fail Detect Time + Propagation time +
Sum over all disrupted flows of the combination of (time to detect the disruption, thetime to send the modification message,and the time to update the flow table).
Link ID # 2 (BC)
Link ID # 1 (CB)
A
E F
B C
D
Flow 2
Flow 1
Flow 3
Flow 1,2,3
Ether type Instructions0x0800 Forward Packets to Group #10x0800 Forward Packets to Group #1
Flow Table 1 for BIP src
192.168.1.1192.168.1.2
Rule12
IP Dst192.168.1.8192.168.1.9
21
43
14
23
1
24
VLAN ID Instructions2 Forward Packets to Port 21 Forward Packets to Port 1
Flow Table # FIP src
**
Rule12
--- ---------
IP Dst**---
InstructionsRemove VLAN TAG 1 & forward packets to flow table 1
VLN ID11
RuleFlow Table 0 for B
….......Forward packets to flow table 1*N
... ......
Ether type Instructions0x0800 Forward Packets to Group #10x0800 Forward Packets to Group #1
Flow Table 1 for CIP src
192.168.1.8192.168.1.9
Rule12
IP Dst192.168.1.1192.168.1.2
... ......... .........
InstructionsRemove VLAN TAG 2 & forward packets to flow table 1
VLN ID21
RuleFlow Table 0 for C
….......Forward packets to flow table 1*N
Group ID Group Type Action BucketsOutput to port 3Push VLAN tag 2 and output to port 4
B2: Backup path
DescriptionB1:Primary link
Group Table for B
1 Fast Failover
... ...2 ...
Group ID Group Type Action BucketsOutput to port 1 Push VLAN tag 1 and output to port 2
B2: Backup path
DescriptionB1:Primary link
Group Table for C
1 Fast Failover
... ...2 ...
IP src192.168.1.2
IP Dst192.168.1.9
Flow 2's Packet Header
IP src192.168.1.1
IP Dst192.168.1.8
Flow 1's Packet Header
Figure 4: Link protection mechanism
In reactive link protection, the controller can intervene toremove the outdated flow rules to speed up the recovery. Inthis case, when a failure is detected (TFD), the failure no-tification is sent to the controller. The controller searchesthe disrupted flow (TDFS) and sends flow modification mes-sage (TFM) to the switch for modifying the outdated flowrules. This process is repeated for each of the disruptedflow. When the switch receives the flow modification mes-sage, all the matching rule from the flow table are modi-fied in time (TUPDATE). The propagation time (TPROP) ofthe failure notification message from switch to the controllercontributes to the recovery process. The recovery time (TR)taken by this scheme is expressed by equation 2.
TR = TFD +N!
f=0
(TDFS,f +TFM,f +TUPDATE,f )+TPROP (2)
Our proposed scheme for RR is reactive in nature. Afterfailure detection, the affected switch handles the flow rerout-ing without any controller intervention. Therefore, the timecomplexity of our proposed RR protection scheme dependson the time a switch takes to detect a failure (TFD) and thetime to change the alive status (TAS) of the group entrieswhich corresponds to the failed link. According to Sharmaet al [8], a switch takes an approximately 5.8 microsecondsto modify the alive-status of one Group Entry. The recoverytime (TR) taken by our RR scheme is calculated by equation3.
TR = TFD + TAS (3)
4.2 Calculating Memory Size RequirementIn the traditional protection schemes, controller pre-installs
the flow rules for the backup path. For each of the disruptedflow of a primary link, backup flow rule should be present inthe switches for the backup path. Therefore, the number offlow rules in a switch of the backup path (NBF) is equal tothe number of disrupted flows (NDF) i.e. NBF= NDF. Butthis approach is not suitable for the network with thousandsof flow because of memory constraint of the switch hardware.For a 100+ disrupted flows, our RR module reduces thebackup path’s flow entries in a switch by more than 99 per-cent and saves switch memory. This contributes to smallerflow table and faster table lookup. RR module compress allthe flows having same output port which corresponds to thefailed link into one wildcard flow rule. Therefore, the num-ber of flow rules in an intermediate switch of the backup path(NBF) is equal to 1 (NBF = 1 ) Figure 5 graphically showsthe compression achieved by the RR module (Nbf-RR) overthe traditional scheme (Nbf ). The following graph is plottedusing the above observation.
5. CONCLUSION AND FUTURE WORKAn OSH model for SDN is proposed in this paper. The
proposed model is based on the autonomic principle of theautonomous system. The aim of our model is to achieveoptimal failure recovery. We presented the analytical modelof our RR scheme and proved how it can achieve a quickrecovery with low disruption time and reduces the backupflow entries per switch. The backup path flow aggregationenabled the 99 percent of reduction in the flow entries per
Recovery Time for Rapid Recovery = Fail Detect Time + Time to change Alive Status
Change of status takes approximately 5.8 microseconds.
Link ID # 2 (BC)
Link ID # 1 (CB)
A
E F
B C
D
Flow 2
Flow 1
Flow 3
Flow 1,2,3
Ether type Instructions0x0800 Forward Packets to Group #10x0800 Forward Packets to Group #1
Flow Table 1 for BIP src
192.168.1.1192.168.1.2
Rule12
IP Dst192.168.1.8192.168.1.9
21
43
14
23
1
24
VLAN ID Instructions2 Forward Packets to Port 21 Forward Packets to Port 1
Flow Table # FIP src
**
Rule12
--- ---------
IP Dst**---
InstructionsRemove VLAN TAG 1 & forward packets to flow table 1
VLN ID11
RuleFlow Table 0 for B
….......Forward packets to flow table 1*N
... ......
Ether type Instructions0x0800 Forward Packets to Group #10x0800 Forward Packets to Group #1
Flow Table 1 for CIP src
192.168.1.8192.168.1.9
Rule12
IP Dst192.168.1.1192.168.1.2
... ......... .........
InstructionsRemove VLAN TAG 2 & forward packets to flow table 1
VLN ID21
RuleFlow Table 0 for C
….......Forward packets to flow table 1*N
Group ID Group Type Action BucketsOutput to port 3Push VLAN tag 2 and output to port 4
B2: Backup path
DescriptionB1:Primary link
Group Table for B
1 Fast Failover
... ...2 ...
Group ID Group Type Action BucketsOutput to port 1 Push VLAN tag 1 and output to port 2
B2: Backup path
DescriptionB1:Primary link
Group Table for C
1 Fast Failover
... ...2 ...
IP src192.168.1.2
IP Dst192.168.1.9
Flow 2's Packet Header
IP src192.168.1.1
IP Dst192.168.1.8
Flow 1's Packet Header
Figure 4: Link protection mechanism
In reactive link protection, the controller can intervene toremove the outdated flow rules to speed up the recovery. Inthis case, when a failure is detected (TFD), the failure no-tification is sent to the controller. The controller searchesthe disrupted flow (TDFS) and sends flow modification mes-sage (TFM) to the switch for modifying the outdated flowrules. This process is repeated for each of the disruptedflow. When the switch receives the flow modification mes-sage, all the matching rule from the flow table are modi-fied in time (TUPDATE). The propagation time (TPROP) ofthe failure notification message from switch to the controllercontributes to the recovery process. The recovery time (TR)taken by this scheme is expressed by equation 2.
TR = TFD +N!
f=0
(TDFS,f +TFM,f +TUPDATE,f )+TPROP (2)
Our proposed scheme for RR is reactive in nature. Afterfailure detection, the affected switch handles the flow rerout-ing without any controller intervention. Therefore, the timecomplexity of our proposed RR protection scheme dependson the time a switch takes to detect a failure (TFD) and thetime to change the alive status (TAS) of the group entrieswhich corresponds to the failed link. According to Sharmaet al [8], a switch takes an approximately 5.8 microsecondsto modify the alive-status of one Group Entry. The recoverytime (TR) taken by our RR scheme is calculated by equation3.
TR = TFD + TAS (3)
4.2 Calculating Memory Size RequirementIn the traditional protection schemes, controller pre-installs
the flow rules for the backup path. For each of the disruptedflow of a primary link, backup flow rule should be present inthe switches for the backup path. Therefore, the number offlow rules in a switch of the backup path (NBF) is equal tothe number of disrupted flows (NDF) i.e. NBF= NDF. Butthis approach is not suitable for the network with thousandsof flow because of memory constraint of the switch hardware.For a 100+ disrupted flows, our RR module reduces thebackup path’s flow entries in a switch by more than 99 per-cent and saves switch memory. This contributes to smallerflow table and faster table lookup. RR module compress allthe flows having same output port which corresponds to thefailed link into one wildcard flow rule. Therefore, the num-ber of flow rules in an intermediate switch of the backup path(NBF) is equal to 1 (NBF = 1 ) Figure 5 graphically showsthe compression achieved by the RR module (Nbf-RR) overthe traditional scheme (Nbf ). The following graph is plottedusing the above observation.
5. CONCLUSION AND FUTURE WORKAn OSH model for SDN is proposed in this paper. The
proposed model is based on the autonomic principle of theautonomous system. The aim of our model is to achieveoptimal failure recovery. We presented the analytical modelof our RR scheme and proved how it can achieve a quickrecovery with low disruption time and reduces the backupflow entries per switch. The backup path flow aggregationenabled the 99 percent of reduction in the flow entries per
Problems with SDN for IoT: Self-‐Healing
Image: [2]
Problems with SDN for IoT: Self-‐Healing
• Evaluation of Self-‐Healing• Smart Grid Approach [5]
• Requires coordination between Data Plane and Control Plane post-‐failure for backup route installation.• Would benefit from [2]
• Novel Additions for Self-‐Healing• Multiple QoS Levels
• New flows are added to links based on ability to ensure QoSfor all current flows.• Lower priority flows may be moved to different links.
• Dedicated flows get their own links.
020
4060
8010
0D
atar
ate
[Mbp
s]
Re-routing ofBackgroundData Traffic
Switch 2
MMS TrafficBackground Data Transfer
Background Real-Time TrafficTotal Traffic
0 50 100 150 200
020
4060
8010
0
Time [s]
Dat
arat
e[M
bps]
Minimum Data RateGuarantee for
Smart Grid Traffic
Switch 3
Fig. 5: Scenario 2a: Load Management at Switches 2 and 3
to enqueue traffic flows using these queues, mapping the QoSrequirements of different traffic classes being equivalent topriorities. The traffic class of an arriving packet is identified bymatching against specified rules such as the packet’s applicationlayer protocol. Table I contains the traffic class along withassumptions on minimum data rate and latency requirements,considered for this experiment.Both following QoS approaches have been evaluated by insertingsuccessively the hereafter detailed traffic flows into the network(cf. Figure 2): background data transfer from Server 2 to Client 1(TCP/FTP-based), background real-time traffic from Server 1to Client 1 (UDP-based) and MMS reports from Server 2 toClient 1. In addition, the network is set to 100Mbps maximumdata rate, whereat 5Mbps are reserved for network control at alltimes, leaving effective data rates of 95Mbps.
Scenario 2a: Reserved Data Rate for Time-Critical Services
This QoS approach aims at reserving data rate for time-critical services by guaranteeing minimum data rates for eachtraffic flow and re-routing traffic flows with lower prioritywhenever the available data rate of a link might be exceeded.Therefore the SDN-Controller keeps track of all active trafficflows within the network along with their routes and priorities.For this approach, the following algorithm has been developed:
1) Shortest Path Calculation: First, the shortest route iscalculated for arriving packets.
2) Data Rate Demand: For each link, the SDN-Controllerdetermines the QoS demand of the new flow and comparesits associated minimum data rate with the available datarate of the link.
3) Priority Grouping: If the data rate is sufficient on all linksthe flow is added, otherwise the priority of the new flow andof those already active on the respective link is compared.Subsequently, flows are grouped into flows with higher orsame priority and those with lower priority.
4) Re-routing of New Flow: If the link capacity is insufficientfor the new and all higher/same priority flows, the secondbest route is calculated and tested, restarting at step (2).
020
4060
8010
0D
atar
ate
[Mbp
s]
Re-routing ofBackground
Data and Real-Time Traffic
Switch 2
MMS TrafficBackground Data Transfer
Background Real-Time TrafficTotal Traffic
0 50 100 150 200
020
4060
8010
0
Time [s]
Dat
arat
e[M
bps]
Link Reservation forSmart Grid Traffic
Switch 3
Fig. 6: Scenario 2b: Link Reservation at Switch 3
5) Sorting of Lower Priority Flows: Else, lower priorityflows are sorted into a list depending on their overlap withthe route of the new flow and their priority, defining anorder for re-routing.
6) Determination of Lower Priority Flows for Re-Routing:While link capacity is insufficient for higher, same and theremaining lower priority flows, the next lower priority flowis marked for re-routing and removed from the list.
7) Unmodified Lower Priority Flows: All flows remainingin the list maintain their current route.
8) Calculation of Alternative Routes: New routes are beingcalculated on a virtual topology, excluding the links ofthe new flow’s route, for flows that have to be shifted.Afterwards, for each pair of flow and alternative route allsteps starting from (2) have to be repeated.
9) Drop of Lower Priority Flows: If no alternate routes areavailable flows are dropped.
10) Re-routing of Lower Priority Flows: Else, OFFlowModmessages are prepared for establishing new flow entries atthe switches for the lower priority traffic flows affected.
Figure 5 shows active transmissions and its correspondingdata rates at Switch 2 (top) and 3 (bottom), when applyingthis approach. At the beginning, background data traffic istransmitted via Switch 3, exploiting all available data rate. After30 s background real-time traffic is added to the network. Sincethe link between Switch 3 and 4 is used by this transmissionas well, the data rate of the background data traffic is reducedaccordingly. Next, background data traffic needs to be re-routedvia Switch 1 and 2 in order to enable high priority MMS reportson the same link after 105 s. Otherwise, adding MMS reportswould cause the sum of minimum data rate requirements toexceed the available capacity of the link between Switch 3 and4, which is prevented by shifting the lowest priority traffic flow.
Scenario 2b: Dedicated Links for Time-Critical Services
For the second approach critical services have been specified,which are granted dedicated links for data exchange. Accord-ingly, the previous algorithm has been extended as follows:
2014 IEEE International Conference on Smart Grid Communications
426
Images: [5]
IV. TEST SETUP AND RESULTS OF SDN-ENABLED SMARTGRID COMMUNICATION
Starting with the testbed setup this section reveals extendedpossibilities of SDN-based networks by means of two differentscenarios. Following the scenario descriptions, algorithms forestablishing the required capabilities at the SDN-Controller andanalysis results are presented.
A. Testbed Setup
The SDN4SmartGrids testbed comprises four switches, oneSDN-controller and two servers for generating and one clientfor receiving traffic as shown in Figure 2. As for the switches,four identical workstations have been used, running OpenvSwitch 2.1.0 [4] as kernel module on basis of 64Bit Ubuntu13.04 Server (3.8.0-30-generic Kernel). Each switch is equippedwith one integrated Intel I217-LM 1000Base−T (IEEE 802.3ab)Ethernet Controller for connecting to the control network andone 4 Port Intel I350 1000Base−T Ethernet Controller foroperating within the data network. The SDN-controller hasbeen developed by enhancing the OpenFlow Controller Beacon1.0.4 [12] based on Java JDK 1.7 running on Windows 7 x64.Client and server workstations connect to the data networkwith onboard 1000Base−T RTL8167 Realtek adapters, usingWindows 7 x64 as OS, as well.IEC 61850 compliant traffic has been generated applying theopen-source library libIEC61850 [13], written in standard C, forMMS reports as well as a C implementation of the SV servicefor SV messages, developed at the Communication NetworksInstitute. For testing, both SV and MMS messages are sent inintervals of 1ms, using packet size of 122Byte and 684Byte.
B. Scenario 1: Fast Recovery for Smart Grid Communications
This scenario deals with enabling fast recovery after distur-bance of a communication link. Providing such functionality isof great importance for ensuring reliable operation of commu-nication networks in critical environments such as substationsof power systems. In particular, alternative routes through thenetwork need to be established immediately, guaranteeing thetransmission of monitoring and control traffic. Therefore aproactive algorithm for calculating alternative paths throughthe network has been developed and integrated into the SDN-Controller. The algorithm’s performance has been assessedby measuring the duration of traffic interruption as well asprocessing times at the controller and switches.First, a brief description of the algorithm, which is applied fordealing with communication link disconnection, is given:
SDN Controller
Switch 1 Switch 4
Switch 2
Switch 3
Server 1
Server 2
Client 1
Control NetworkData Network
Fig. 2: Setup of the SDN4SmartGrids Testbed with Data andControl Network
Server Client Server Client Server ClientDataData
ACK
DataData
ACK
DataData
ACKData Data Data
Data
ACK
DupACK
RTO
RTO
RTO
RTO
RTO
Data
Data
Disturbance Disturbance
Disturbance
MMS Case 1 MMS Case 2 MMS Case 3
Fig. 3: MMS-TCP Flowchart Showing Different Recovery Cases
1) Alternative shortest paths: Alternatives are calculated foreach pair of route and possible link failure, using alternativetopologies which exclude the respective link.
2) Mapping to switch configurations: The alternative pathsare converted into switch configurations, preparing corre-sponding OFFlowMod messages for adding, reconfiguringor deleting traffic flow entries at the switches.
3) Monitoring of active flows: The SDN-Controller keepstrack of active traffic flows and their routes, consideringOFFlowRemoved messages received from switches.
4) Port status notification: In case of a link failure, anOFPortStatus message is issued by the switches connectedto the respective link and sent to the SDN-Controller.
5) Re-routing: Pre-calculated alternative routes are looked upfor all affected traffic flows and corresponding OFFlow-Mods are sent out to the switches for re-routing.
Applying this algorithm, multiple measurements have beenexecuted to verify and analyse its effect. The experimental designof this scenario can be conceived by means of Figure 2: MMSreports respectively SV messages, both containing measurementvalues, are transmitted from Server 1 to Client 1 using eitherthe upper path (via Switch 2) or the lower path (via Switch 3).During transmission, one of the active communication linksis disconnected by a) physical disconnection of an interfaceor b) by software command. Figure 4 (top) shows the overallrecovery times of SV and MMS traffic, in terms of a cumulativedistribution function, considering both cases. In case of SVmessages and link disconnection by command, the mean downtime of transmission amounts to 87.16ms (median: 85.17ms),whereas for physical link disconnection the mean delay increasesto 360.64ms (median: 358.80ms). Hence, port status detectionby the OS induces additional delay in the range of 210 to 305ms.A more complex behaviour can be observed for TCP-basedMMS reports due to reliability mechanisms, which apply Ac-knowledgements (ACKs) and retransmissions. Accordingly, re-covery time depends on the following TCP-specific parameters,which are set to Windows 7 default configuration: Retransmis-sion Time-Out (RTO) (300ms), acknowledgement frequency(2 packets) and delayed acknowledgement timer (50ms). There-fore, Figure 3 distinguishes three different cases, which mightoccur when a link is disconnected during TCP based communi-cation, explaining the effects encountered in Figure 4.MMS Case 1: In Case 1, the link is disconnected before orduring the transmission window’s first packet transfer. Thepacket will not be received by the client and no ACK is issued,wherefore retransmission begins after the RTO elapses. Thiscase applies to MMS reports with recovery times in the range of300ms after the link is disconnected by command.
2014 IEEE International Conference on Smart Grid Communications
424
Problems with SDN for IoT: Self-‐Optimization
• Objective• Reduce energy usage for network infrastructure. [1]• Accounted for 8% of total energy consumption in 2008, projected 14% by 2020.
• Motivating Application• Campus Network
• Existing infrastructure requires 24/7 uptime on all networking hardware, configured to handle peak traffic.
• Real traffic occurs in patterns.• Campus traffic is significantly higher during the day. Image: Institute for Communication Systems, University of Surrey, UK.
http://www.surrey.ac.uk/ics/activity/facilities/futureinternet/
Problems with SDN for IoT: Self-‐Optimization
• Determine Minimum Switches to Power• NP-‐Hard, but can be formulated with a MILP• Multiple Constraints
• Strategic Greedy Heuristic• Looks for best paths on active nodes, will activate other nodes if needed.
• Shortest Shortest Path First (SPF)• Longest Shortest Path First (LPF)• Smallest Demand First (SDF)• Highest Demand First (HDF)
Fig. 2. Campus network to be investigated
The campus network is build of four different switch types:Core, distribution, WAN and access switches, denoted bydecreasingly sized red dots in this order. Table I displays theenergy consumption of the four different types of switches usedin the campus network evaluation. The last column Bandwidth(Gbit/s) corresponds to the bandwidth of the four link typesused within the campus network. We bundle multiple real linksinto one logical link, so that shutting down one of these logicallinks is equivalent of saving the power usage of one respectiveline card inserted in the chassis of the switch. These four typescan be distinguished by their thickness of the line representingthis logical link in figure 2. Core links interconnect coreswitches, whereas distribution links connect the core switchesto the distribution nodes. WAN links can be found betweenthe distribution layer an the WAN switches in the bottom leftand right hand corner.
TABLE I. HARDWARE INFORMATION OF THE CAMPUS NETWORK
Type Switch e.c. (W) Line Card e.c. (W) Bandwidth (Gbit/s)Core 2000 150 400Distribution 1000 80 150WAN 700 50 100Access 200 -/- (5 W Port) 50
As we can see from the table I, core link bandwidth is notproportional to the line card energy usage of the distributionand WAN switches. The reason for this is that core line cardsare usually incorporating optical links, which consume lessenergy than copper wired port links.
In contrast to the campus network, our mesh networkis build in a homogeneous manner. More specifically, eachnode consumes the same amount of energy and each linkinterconnecting two nodes provides 150 Gbit/s bandwidth.
A. Proof of concept
We randomly generated traffic demands at three levels: lowtraffic load (nighttime traffic), medium traffic load (averagedaytime traffic) and high traffic load (yearly peak traffic).Traffic utilization during the day roughly triples compared tothe demands at nighttime [16]. Accordingly, we set the mediumtraffic load to be three times higher than low traffic load andpeak traffic is five times higher.
Fig. 3. Optimal network configuration for a low traffic utilization at night
Figure 3 displays an example of the traffic flow andnetwork state in a low traffic utilization scenario. This trafficdemand distribution corresponds to low demand as it occursduring nighttime. Yellow nodes display nodes with demands,whereas blue nodes denote switches turned on to forwardflows. Blue links denote turned on line cards and bundlesof links of the respective switches. Everything colored greyis currently shut down. An example for a campus network
Fig. 4. Optimal network configuration for a mid traffic utilization duringdaytime
configuration at daytime traffic load is displayed in figure 4.In Figure 5 and 6, we can see the potential energy savingsin a campus network and a mesh network, respectively. Theseare the results from Cplex solver and the heuristic LPF. Atthe night time, we are able to save up to ca 45% of energycompared to an ”always online” network. An important aspectto highlight, is that in realistic mesh networks (e.g. WAN’s)the average network utilization is just around 30% [17] andtherefore even higher energy savings are possible.
The Strategic Greedy LPF method results in an optimalsolution for low traffic load. In case of medium and high trafficload, it is ca. 7% worse than MILP in mesh network and ca.4% in the campus network. Furthermore, we observe that the
2014 IEEE 3rd International Conference on Cloud Networking (CloudNet)
158
Image: [1]
Problems with SDN for IoT: Self-‐Optimization
• Determine Minimum Switches to Power• NP-‐Hard, but can be formulated with a MILP• Multiple Constraints
• Strategic Greedy Heuristic• Looks for best paths on active nodes, will activate other nodes if needed.
• Shortest Shortest Path First (SPF)• Longest Shortest Path First (LPF)• Smallest Demand First (SDF)• Highest Demand First (HDF)
Fig. 2. Campus network to be investigated
The campus network is build of four different switch types:Core, distribution, WAN and access switches, denoted bydecreasingly sized red dots in this order. Table I displays theenergy consumption of the four different types of switches usedin the campus network evaluation. The last column Bandwidth(Gbit/s) corresponds to the bandwidth of the four link typesused within the campus network. We bundle multiple real linksinto one logical link, so that shutting down one of these logicallinks is equivalent of saving the power usage of one respectiveline card inserted in the chassis of the switch. These four typescan be distinguished by their thickness of the line representingthis logical link in figure 2. Core links interconnect coreswitches, whereas distribution links connect the core switchesto the distribution nodes. WAN links can be found betweenthe distribution layer an the WAN switches in the bottom leftand right hand corner.
TABLE I. HARDWARE INFORMATION OF THE CAMPUS NETWORK
Type Switch e.c. (W) Line Card e.c. (W) Bandwidth (Gbit/s)Core 2000 150 400Distribution 1000 80 150WAN 700 50 100Access 200 -/- (5 W Port) 50
As we can see from the table I, core link bandwidth is notproportional to the line card energy usage of the distributionand WAN switches. The reason for this is that core line cardsare usually incorporating optical links, which consume lessenergy than copper wired port links.
In contrast to the campus network, our mesh networkis build in a homogeneous manner. More specifically, eachnode consumes the same amount of energy and each linkinterconnecting two nodes provides 150 Gbit/s bandwidth.
A. Proof of concept
We randomly generated traffic demands at three levels: lowtraffic load (nighttime traffic), medium traffic load (averagedaytime traffic) and high traffic load (yearly peak traffic).Traffic utilization during the day roughly triples compared tothe demands at nighttime [16]. Accordingly, we set the mediumtraffic load to be three times higher than low traffic load andpeak traffic is five times higher.
Fig. 3. Optimal network configuration for a low traffic utilization at night
Figure 3 displays an example of the traffic flow andnetwork state in a low traffic utilization scenario. This trafficdemand distribution corresponds to low demand as it occursduring nighttime. Yellow nodes display nodes with demands,whereas blue nodes denote switches turned on to forwardflows. Blue links denote turned on line cards and bundlesof links of the respective switches. Everything colored greyis currently shut down. An example for a campus network
Fig. 4. Optimal network configuration for a mid traffic utilization duringdaytime
configuration at daytime traffic load is displayed in figure 4.In Figure 5 and 6, we can see the potential energy savingsin a campus network and a mesh network, respectively. Theseare the results from Cplex solver and the heuristic LPF. Atthe night time, we are able to save up to ca 45% of energycompared to an ”always online” network. An important aspectto highlight, is that in realistic mesh networks (e.g. WAN’s)the average network utilization is just around 30% [17] andtherefore even higher energy savings are possible.
The Strategic Greedy LPF method results in an optimalsolution for low traffic load. In case of medium and high trafficload, it is ca. 7% worse than MILP in mesh network and ca.4% in the campus network. Furthermore, we observe that the
2014 IEEE 3rd International Conference on Cloud Networking (CloudNet)
158
Image: [1]
Problems with SDN for IoT: Self-‐Optimization
• Evaluation• LPF provides the greatest energy savings.• When the longest paths are addressed first, a subset of the spanning tree is produced.
• In SPF, this creates many disconnected fragments, which need to be later connected, resulting in sub-‐optimal graph.
Image: [1]
Fig. 5. Performance of optimal solution and heuristic in a campus network
Fig. 6. Performance of optimal solution and heuristic in a mesh network
energy saving in mesh network is linearly decreasing whilethis is not the case for the campus network. This is because ofthe inhomogeneous energy consumption of switches in campusnetwork. When the traffic reaches a certain boundary, new coreor distribution switches have to be powered on, resulting in asharp increase in energy consumption.
B. Strategic Greedy Heuristic performance
The performance of four heuristic algorithms is shown inFigure 7 and for 8 mesh and campus network, respectively.
Fig. 7. Average energy savings for different strategies in a mesh network
For the mesh network, LPF outperforms the other three byup to 5% more total energy saving. This clearly makes sense ifwe think about the geometry. When the demands with longestpaths are accommodated first, a subset of a spanning tree isproduced. Afterwards, LPF will be able to efficiently allocate
smaller paths, which makes use of the already turned on linksand nodes. However, if we allocate demands with shortestpaths first, small distributed path fragments are activatedthroughout the network. Node pairs with longer paths will notbe able to efficiently allocate their path along these randomlypositioned short routes and therefore overall performance ofSPF suffers. When increasing the network size even further,this effect is expected to reveal even more, as smaller pathswill be more distributed and it will be even harder to makeuse of already turned on network gear.
Fig. 8. Average energy savings for different strategies in a campus network
For the campus network, all algorithms except SPF performnearly the same at low traffic load. When traffic increases, LPFtends to outperform the others by up to 7% total energy savingfor the same reason mentioned above.
C. Robustness of heuristic algorithms
This section investigates the robustness of the four heuristicalgorithms. Figure 9 shows, that all four strategies were alwaysable to find a valid network configuration, while routing up to600 Gbit/s overall in the campus network (mid traffic load).In terms of robustness in high peak load scenarios, it makessense to allocate resources for node pairs with short paths firstas figure 9 informs.
Fig. 9. Robustness for the Strategic Greedy for different strategies in acampus network
The situation, however, changes in mesh networks as thistopology is not designed in a hierarchical way. Figure 10displays, that SPF performed by far the worst and only founda valid configuration in 40 out of 100 different traffic scenarios
2014 IEEE 3rd International Conference on Cloud Networking (CloudNet)
159
Problems with SDN for IoT: Self-‐Optimization
• Objective• Enhance User Experience. [3]
• QoS is critical, however, user perceivable experience is now important for user-‐centric, mobile networking.
• Quality of Experience (QoE)
• Motivating Application• Smart Home
• Multiple QoE levels.• Distance Learning• Health Provider Link• Online Gaming
entries in the flow table, bandwidth) and accepts or rejects the requisitions according to the availability of resources.
4.2.2 Overview Network Application Follows the up/down state of the network switches and their ports by listening to the asynchronous messages exchanged between the OpenFlow controller and the switches. Data collected by this application helps the management layer to keep a global view of the network.
4.2.3 Route Calculation Application Determinates the available path(s), calculating the route(s) based in the “rules control” of Mapping Rules. The adequacy of a route for fulfilling the solicitation is determined by the network topology and by a set of performance metrics, such as latency.
4.2.4 Statistics Counter Uses a mix of passive and active monitoring for measuring different network metrics in different aggregation levels. It also measures the flow latency, error rate and jitter by inserting package probes in the network.
4.2.5 Mapping Rules Are used by the management layer for translating the high level network policies into control rules. The controller and the control applications (for instance, routing) use those rules for calculating the entries in the flow table of each switch.
4.3 Data Layer Composed by network devices with some protocol enabled. OpenFlow [23] was the first open protocol to provide standard APIs for managing the data layer in the commuting structure by the controller. The controller uses the OpenFlow API to install forwarding rules in the flow tables of switches, discovering network topology, monitoring statistics flow and visualizing the up/down status of active devices and flows.
OpenFlow provides support for monitoring and controlling QoS. The packages pertaining to a flow may be queued in a particular queue of an output interface. Controllers may consult configuration parameters and the statistic of queues. Switches may rewrite the IP ToS field in the IP header. Support for notification of explicit congestion is also inserted in OpenFlow.
ONF [24] created the OF-Config for supporting the configuration of several characteristics in an OpenFlow switch. OF-Config [25] may be used for configuring the minimum and maximum transmission rates of a queue in an OpenFlow switch. An OpenFlow controller may also read those rates from a switch. The most recent additions regarding QoS in OpenFlow are the meter tables and meter bands, which may be used for limiting the transmission rate of an output interface.
Regarding monitoring, an OpenFlow controller may query a switch in order to get aggregation statistics in different levels of aggregation, by table, by flow, by interface and by queue. In summary, OpenFlow provides an extensive support for configuring and monitoring QoS, facilitating the control of complex situations introduced by distributed routing and traffic engineering mechanisms.
5. TESTBED AND EXPERIMENTAL RESULTS This section presents the scenario, the main elements of the proposed architecture (Figure 4) used in the testbed, and experiments carried out in order to evidence how the QoE/QoS management mechanism reacts after detecting nonconformities in the service provided. An overview of the network topology used in the experiments can be seen in Figure 5
Internet
Laptop
Desktop
Game Machine
Wireless orWired Clients
Home Gateway
Ethernet
Home Area Network Access Network Metropolitan Area Network Service Provider
NetworkWide AreaNetwork
OpenFlow MessagePhisycal ConextionData
Legend Remote server for games
Gaming Services
Teleconsultation
formal andinformal carers
Tele-education
Remote server for Tele-education
Figure 5. Topology of the Network used in the Experiments
5.1 Description of the Scenario The scenario consists of three sub-categories of services, provided via an IPTV provider in a HAN (Home Area Network), as follows: Teleconsultation, tele-education and GoD (Game on Demand).
The sub-service Teleconsultation offered by a third party is used by Alice, a remotely assisted patient who often needs to consult with health caregivers and send data, vital signs, and monitoring images to a remote service unit. The appointment is held via videoconference between the patient and the caregiver. Vital signs are collected using the eHealth platform [6]. Images of the patient are sent via cameras. The configuration of the sensors and evaluation of the context parameters in a Ubiquitous Assisted Environment were discussed in our previous research work [22]. The sub-service Tele-education VoD (Video on Demand) is used by Bob, a distance education undergraduate student. The sub-service GoD is used by another user who is keen to network games.
5.2 Home Area Network Four machines were used in the HAN. eHealth flows from Host 4 (application server) to Host 1 (Alice’s machine). Teleconsultation transits between Host 1 (Alice’s machine) and Host 5 (application server). Biomedical signals are sent from Host 1 to Host 5. Tele-education flows from Host 5 to Host 2 (Bob’s machine). GoD flows from Host 5 to Host 3 (another user’s machine). Host 4 was used to generate background traffic from Host 4 to Host 5.
The machines of the users were installed IPTV client application with an interface with access to services and a module to capture the QoE impact dimensions. It was prototyped in Java, using the SAWSL framework for performing semantic annotations. The information data in the terminal reports, including the values of the QoE/QoS performance parameters, are transported in XML using a variation of the GSQR protocol [9]. The module was configured so that from time to time the terminal sends information about QoE and QoS parameters from the reports to the application managing the QoE/QoS of the service provider.
55
Image: [3]
Problems with SDN for IoT: Self-‐Optimization
• QoE• Subjective evaluation of the user’s perceptions and expectations. • Need to meet hedonic and pragmatic needs.
• Autonomically adapt the resources on the network based on the user experience.• Measured from Client Machine• Reported to the Controller
• User Experience Information stored in a Knowledge Base
• Compares expected quality to reported quality for QoEfeedback. Image: [3]
L A Y E R D A T A
L A Y E R M A N A G E M E N T
L A Y E R C O N T R O LSouthbound API (OpenFlow Protocol API)
Northbound API (REST API)
Standard SDN Controller
Switchs OpenFlow
Applications Modules(Admission Control, Overview Network,
Route Calculation, Statistical Counter)
Rule BaseMapping
RulesJava
API
Application Servers
(eHealth, VoIP, VoD, Games, News...)KB
KnowledgeBase
DBPolicyBase
DBSubscriberDatabase
QoS / QoEManagement Application
(QoE Aware Engine, KB, DB)
Use
rIn
terf
ace
QoE Aware Semantic Engine
Capture Dimensions QoE
QoE Evaluation
QoS Measurements
QoE Ontology
Figure 4. Functional Architecture for QoE Management
4.1 Management Layer This layer contains the business applications of the organization, which offer network services such as virtualization, firewalls, load balancers, QoS/QoE managers, and others. Any application of this level communicates with the standard SDN controller or with some application module of the control layer, using the Northbound API. The main contribution, in this layer, consists in the semantic engine, which captures information about QoE dimensions and learns with the user’s experience. The engine is constituted by the elements described below.
4.1.1 QoE Ontology Described in section 3, it is used to unify concepts, easing the knowledge about the user's experience when using the service and allowing inferences. The ontology was built in Protégé 4.0.
4.1.2 Module for Capturing QoE Dimmensions This module was implemented using the SAWSDL Framework, in order to Extract, Transform and Load (ETL) information in the reports from the user's terminals. Information in the reports is structured in XML format, referencing the concepts of QoE ontology. Transport of data from terminal was made using the GSQR protocol [9]. When the reports arrive in the QoE management system, they pass through the ETL processing and information is persisted in the Knowledge Base (KB) of the server hosting the corresponding service, as instances in Web Ontology Language (OWL).
4.1.3 Module for Monitoring and Measuring QoE/QoS Parameters Responsible for monitoring and measuring the values of qualitative (user) and quantitative (network performance parameters) metrics, at the client side. Having the knowledge about each user persisted in the KB, this module applies a mapping function over the QoE/QoS parameters for each service (i.e. VoIP, IPTV). For the VoIP service the mapping service of QoE/QoS described in our previous work [4] may be used. For the IPTV service a mapping function of QoE/QoS using linear regression, described in [12] may be used, Equation (1).
MOS = α Thr + β Jt + γ Plr + ε (1)
Thr is the throughput; Jt is the jitter and Plr is the packet loss rate. Coefficients α, β, γ, and ε are calculated particularly for each case.
With all information captured about the QoE dimensions, plus user and network status information, the semantic engine learns about the user’s experience using a service and is able to provide information of QoE degradation to the network controller. Table 2 illustrates examples of the KB records for video streaming.
Table 2. KB instances used for QoE Learning
USER USER CONTEXT CONTENT QoS APPLICATION
QoS NETWORK MOS
Name Equipament Activity Physical Type Codec Resolution Delay Jiiter PLR Bitrate Value
Alice notebook work-shy
Room Action MPEG4 1366X768 0 0 0 673202 5,0
Alice Mobile Strolling Office Talkshow MPEG4 1280X720 0 0 0 423606 4,0
Bob notebook work-shy
Room Action MPEG4 1366X768 0 0 0 423601 4,0
Bob Mobile Strolling Office Talkshow MPEG4 1280X720 20ms 15ms 0 123200 3,0
4.1.4 Module for Analyzing and Verifying QoE This module is composed by semantic rules, used to analyze and verify QoE degradations. The semantic engine compares the expected value of the metric with values (minimum and maximum limits) of policies defined by the administrator (policies ontology). When degradations are detected, an event from the semantic engine is triggered for a QoE/QoS management application that queries its policies database about actions to be performed according to the metric. Table 3, adapted from [1] illustrates some examples of policy adaptations that may be applied for optimizing QoE.
Table 3. Examples of policies adaptations for optimizing QoE
QoS Metrics Policies Adaptation Actions
Dropping packets
Change queue configuration Forward flow through alternative route
Throughput Change limiting rates of flows saturating bandwith Forward flow through alternative route
Latency Change queue configuration of the switch Plan the transmission of flows through a less congested route with proper delay
Jitter Forward flow through a less congested route
To apply the actions, the QoE/QoS management application communicates with the control layer using a Northbound API. In this moment the high level rules are converted and persisted into control rules, on the Mapping Rules module, of the control layer. After the control layer verifies the available resources, the actions are performed in the commuters interfaces (queues, routes, flow limiters), using a Southbound API.
4.2 Control Layer Composed by the SDN controller, which is able to communicate with commuters from the data layer, via Southbound APIs, using, for instance, the OpenFlow protocol. In order to allow automatic configurations for optimizing QoE, this layer needs some control applications in addition to a standard SDN controller, such as those proposed in [1], described hereinafter. Note that the contribution of this paper consists not in implementing those modules, but in the mechanisms described in the management layer, which in turn provide information for the control layer modules to perform the adaptation of network policies.
4.2.1 Admission Control Application Receives solicitations for provisioning resources from the management layer, analyzes the network resources (i.e. queues,
54
Problems with SDN for IoT: Self-‐Optimization
• QoE• Mean Opinion Score (MOS)
• Calculated using factors of Throughput, Jitter, and Packet Loss Rates.• KB stores information about each user to build context.
Image: [3]
L A Y E R D A T A
L A Y E R M A N A G E M E N T
L A Y E R C O N T R O LSouthbound API (OpenFlow Protocol API)
Northbound API (REST API)
Standard SDN Controller
Switchs OpenFlow
Applications Modules(Admission Control, Overview Network,
Route Calculation, Statistical Counter)
Rule BaseMapping
RulesJava
API
Application Servers
(eHealth, VoIP, VoD, Games, News...)KB
KnowledgeBase
DBPolicyBase
DBSubscriberDatabase
QoS / QoEManagement Application
(QoE Aware Engine, KB, DB)
Use
rIn
terf
ace
QoE Aware Semantic Engine
Capture Dimensions QoE
QoE Evaluation
QoS Measurements
QoE Ontology
Figure 4. Functional Architecture for QoE Management
4.1 Management Layer This layer contains the business applications of the organization, which offer network services such as virtualization, firewalls, load balancers, QoS/QoE managers, and others. Any application of this level communicates with the standard SDN controller or with some application module of the control layer, using the Northbound API. The main contribution, in this layer, consists in the semantic engine, which captures information about QoE dimensions and learns with the user’s experience. The engine is constituted by the elements described below.
4.1.1 QoE Ontology Described in section 3, it is used to unify concepts, easing the knowledge about the user's experience when using the service and allowing inferences. The ontology was built in Protégé 4.0.
4.1.2 Module for Capturing QoE Dimmensions This module was implemented using the SAWSDL Framework, in order to Extract, Transform and Load (ETL) information in the reports from the user's terminals. Information in the reports is structured in XML format, referencing the concepts of QoE ontology. Transport of data from terminal was made using the GSQR protocol [9]. When the reports arrive in the QoE management system, they pass through the ETL processing and information is persisted in the Knowledge Base (KB) of the server hosting the corresponding service, as instances in Web Ontology Language (OWL).
4.1.3 Module for Monitoring and Measuring QoE/QoS Parameters Responsible for monitoring and measuring the values of qualitative (user) and quantitative (network performance parameters) metrics, at the client side. Having the knowledge about each user persisted in the KB, this module applies a mapping function over the QoE/QoS parameters for each service (i.e. VoIP, IPTV). For the VoIP service the mapping service of QoE/QoS described in our previous work [4] may be used. For the IPTV service a mapping function of QoE/QoS using linear regression, described in [12] may be used, Equation (1).
MOS = α Thr + β Jt + γ Plr + ε (1)
Thr is the throughput; Jt is the jitter and Plr is the packet loss rate. Coefficients α, β, γ, and ε are calculated particularly for each case.
With all information captured about the QoE dimensions, plus user and network status information, the semantic engine learns about the user’s experience using a service and is able to provide information of QoE degradation to the network controller. Table 2 illustrates examples of the KB records for video streaming.
Table 2. KB instances used for QoE Learning
USER USER CONTEXT CONTENT QoS APPLICATION
QoS NETWORK MOS
Name Equipament Activity Physical Type Codec Resolution Delay Jiiter PLR Bitrate Value
Alice notebook work-shy
Room Action MPEG4 1366X768 0 0 0 673202 5,0
Alice Mobile Strolling Office Talkshow MPEG4 1280X720 0 0 0 423606 4,0
Bob notebook work-shy
Room Action MPEG4 1366X768 0 0 0 423601 4,0
Bob Mobile Strolling Office Talkshow MPEG4 1280X720 20ms 15ms 0 123200 3,0
4.1.4 Module for Analyzing and Verifying QoE This module is composed by semantic rules, used to analyze and verify QoE degradations. The semantic engine compares the expected value of the metric with values (minimum and maximum limits) of policies defined by the administrator (policies ontology). When degradations are detected, an event from the semantic engine is triggered for a QoE/QoS management application that queries its policies database about actions to be performed according to the metric. Table 3, adapted from [1] illustrates some examples of policy adaptations that may be applied for optimizing QoE.
Table 3. Examples of policies adaptations for optimizing QoE
QoS Metrics Policies Adaptation Actions
Dropping packets
Change queue configuration Forward flow through alternative route
Throughput Change limiting rates of flows saturating bandwith Forward flow through alternative route
Latency Change queue configuration of the switch Plan the transmission of flows through a less congested route with proper delay
Jitter Forward flow through a less congested route
To apply the actions, the QoE/QoS management application communicates with the control layer using a Northbound API. In this moment the high level rules are converted and persisted into control rules, on the Mapping Rules module, of the control layer. After the control layer verifies the available resources, the actions are performed in the commuters interfaces (queues, routes, flow limiters), using a Southbound API.
4.2 Control Layer Composed by the SDN controller, which is able to communicate with commuters from the data layer, via Southbound APIs, using, for instance, the OpenFlow protocol. In order to allow automatic configurations for optimizing QoE, this layer needs some control applications in addition to a standard SDN controller, such as those proposed in [1], described hereinafter. Note that the contribution of this paper consists not in implementing those modules, but in the mechanisms described in the management layer, which in turn provide information for the control layer modules to perform the adaptation of network policies.
4.2.1 Admission Control Application Receives solicitations for provisioning resources from the management layer, analyzes the network resources (i.e. queues,
54
Problems with SDN for IoT: Self-‐Optimization
• QoE Evaluation• Background traffic was only induced on the final test.• Flow rates are limited based on MOS to maintain acceptable levels.
Image: [3]
For the Home Gateway, TP-Link with 10Mbps capacity was used, emulating the ADSL / Cable / PON service.
5.3 Network Provider and IPTV service In the experiments, the network and service providers are represented by the same entity. The provider provides the service and the infrastructure service access network.
To simulate the services provided, an application server was implemented. The application server was configured in Linux Ubuntu 04.13, 64bits. In the server, videos and games were made available so that users could watch and play over the network. An eHealth application was used to collect data about patients’ vital signs. Alice’s biomedical data are requested at any time by the health caregiver before, during or after teleconsultation.
On the same physical machine, the QoE/QoS managing application was configured with a semantic engine, a KB Knowledge Base) and a network adaptation policies base. The managing application uses REST API to communicate with the controller.
The Floodlight controller was installed and used as an OpenFlow controller. It was adopted because it has a group of modules and applications that, together with the OpenFlow API and the OF-Config protocol, allow the visualization of the network topology, status of devices, changing the forwarding tables and verifying active flows, among other functionalities. For this scenario, the modules Topology Management, Static Flow Entry Pusher and Counter Store of Floodlight were used for, respectively, verifying the network topology, forwarding the flows, installing and removing flows from a given switch and generating statistics.
For monitoring performance parameters of the transport network, Linux tools and the Counter Store module of the Floodlight controller were used. The output of the monitors’ data was used as input for the controller.
In order to emulate the transmission network, a mesh was used with four OVS installed on Ubuntu 13.04-64bits operating system. Each OVS runs in a separate physical machine, interconnected, using GRE tunnels (Generic Routing Encapsulation). The Linux tc command was used to set the maximum capacity of each link and the delay. All links were configured with a maximum capacity of 10 Mbps. Two routes were created (Route 1: S1-S2-S4 and route 2: S1-S3-S4) to send flows through alternative paths. By default, all flows follow Route 1.
5.4 Experimental Results In the experiment, the QoE of flows in active sessions of three users was analyzed, using different applications, in a HAN consuming three sub-services from the IPTV provider. In opportune moments, background traffic is generated in order to ascertain the behavior of the proposed mechanism. Four experiments were performed in the laboratory.
In the first experiment, the mechanism of QoE management control was not activated. And applications competed by the total bandwidth, indiscriminately available, according to results measured and plotted in Figure 6.
1
2
3
4
5
6
7
8
9
10
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300
Th
rou
gh
pu
t (
Mb
ps
)
Time (second)
Experiment Results without QoE/QoS Management Control
Tele-education Teleconsultation Game on Demand
Figure 6. Throughput results of measurements without QoE/QoS mechanism
The second experiment was to evaluate the same flow, without and after enabling the QoS management mechanism. The flows with QoS guarantee correspond to 5Mbps for Teleconsultation, 3 Mbps for Tele-education and 2Mbps for GoD. Up to 90 seconds all flows compete for the total bandwidth available, and at 90 seconds, after activation of the QoS control mechanism, each flow receives its web portion, as can be seen in Figure 7.
0
1
2
3
4
5
6
7
8
9
10
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300
Thro
ughp
ut (M
bps)
Time (second)
Experiment Result with QoE/QoS Management Control
Tele-education Teleconsultation Game on Demand
Figure 7. Throughput results of measurements with the QoS control mechanism
The third experiment was to test the throughput adaptation policy with an alternate route. QoS guarantees of 4,5Mbps, 2,5Mbps, 2Mbps and 1,0Mbps were established respectively for Teleconsultation, Tele-education, GoD and Biomedical Signals. All flows were transported by route 1. In order to induce QoE degradation, backgroud flows were programmed to transmit data with a transmission rate of 3Mbps at two different time intervals, either 12 to 30 seconds or 40 to 60 seconds. In both instances, with such a transmission rate, the bandwidth was saturated and the services started to suffer degradation from 12 to 20 seconds and from 40 to 45 seconds. As the semantic engine was programmed to analyze the level of the service provided at intervals of 5 seconds in the first case, detection of the QoE degradation occurred at 15 seconds and in the second case at 40 seconds. In both cases, the time between the detection of QoE and restoration of the service level was 5 seconds. The measurements and occurrences of those events are plotted in Figure 8. In the two transmissions background flows, when the semantic engine identified the violation of the metric, an event was triggered for the QoE/QoS application. This one, queried its policies database, verified the kind of action to be performed and, using the Northbound API, communicated with the control layer. The high level rule was mapped and persisted in the Mapping Rules database. And the “Admission control” application was invoked. As this application has an overall view of the network, it verified the availability of alternate routes, and the module “Static Flow Entry Pusher” of the controller was triggered to change the forwarding table on the switches and to forward the ill-behaved flows through route 2 (S1-S3-S4), as shown in Figure 8.
56
0
1
2
3
4
5
6
7
8
9
10
0 5 10 15 20 25 30 35 40 45 50 55 60
Th
rou
gh
pu
t (M
bp
s)
Time (second)
Throughput Adaptation Policy - Flow Bypass to Route Alternative
Tele-education Teleconsultation Game on Demand Biomedical Signals Background Traffic
Figure 8. Throughput adaptation policy, after change
alternative route
The fourth experiment was to test the throughput adaptation policy, with flow rate limiters and without alternate route. In this experiment, only route 1 was kept to test the system behavior to accommodate best effort flows and QoS guarantee flows. QoS guarantees of 4,5Mbps, 2,5Mbps and 2Mbps were established, respectively for Teleconsultation, Tele-education, and GoD and with bandwidth saturation. Up to 10 seconds, the throughput was kept in all sessions. But those new sessions compromised the bandwidth of active flows, until the problem was detected and fixed, as illustrated in Figure 9.
0
1
2
3
4
5
6
7
8
9
10
0 5 10 15 20 25 30 35 40 45 50 55 60
Trh
ou
gh
pu
t (M
bp
s)
Time (second)
Throughput Adaptation Policy - Limiters Flow Rates
Tele-education Teleconsultation Game on Demand heavy games downloads Update eHealth software
Figure 9. Throughput adaptation policy, after change limiting
rates of flows
From 11 seconds up to 50 seconds, Bob’s terminal downloaded heavy games from the server. Programmed to verify service level at every 5 seconds, at 15 seconds the semantic engine showed throughput degradation, and at 20 seconds the transmission rate limit for downloading games was changed, and the residual bandwidth of 1 Mbps was used for this session until 30 seconds. At this time, Alice’s terminal started to update the eHealth software. From 30 to 60 seconds, Alice’s terminal updated the eHealth system from the server. In a short time (5 seconds), the services were provided with degradations and at 35 seconds there was normalization and flows with QoS guarantees were maintained until the end of the experiment. From 35 to 60 seconds, the residual bandwidth of 1Mbps was distributed among the new asset flows. At 50 seconds, during which game download was completed, the residual bandwidth of 1Mbps was allocated for uploading the eHealth software. The semantic engine detected violation in the values of the throughput metric and generated events for the QoE/QoS managing application. As the flows are “best effort” and in this experiment there is no alternate route for deviating them, the application searched its policies database and found that the action to be performed consisted in changing the rate of limiting flows, because they were saturating the bandwidth. The control layer was triggered by the Northbound API; the high level rule was converted and persisted in the Mapping Rules Base; the admission control application verified the available resources and, using the OpenFlow protocol, through the Southbound API, the limiting rate for flows (game downloads and updated e-Health software) was changed on the switches interfaces.
6. CONSIDERATIONS AND FUTURE WORKS Many proposals for provision and delivery of services also address QoE as an extension of QoS, where only the network parameters are mapped to predict the level of the user’s satisfaction. This QoE management mode contradicts the opinion poll presented in section II and the main concepts of QoE in the literature. Concerns about the quality of service under the user's perspective demands that the QoE be conceptualized as a multidimensional construct, which encompasses the human dimensions, content, context and technology, therefore, based on an interdisciplinary approach. Under this perspective, this proposal is located, and different from others found in the literature, by being founded in several areas of knowledge and projected for the Internet of the Future. To provide an effective and comprehensive management of QoE, we propose a taxonomy of impact dimensions in the quality perceived by the user, ways to model and represent knowledge using ontologies. Information data on the QoE impact dimensions are mapped and persisted in a knowledge base. The semantic engine proposed uses this information to learn the user’s experience in the use of a service and based on policies and can detect QoE degradations. The semantic elements, together with network control applications, were incorporated, prototyped and tested in an SDN architecture. A usage scenario, with IPTV sub-services consumption was presented and experienced. Two of the experiments tested the throughput metric adaptation policies with and without the use of an alternate route, with and without saturation of the total bandwidth available. In the experiments, it was observed that the components of the SDN network, along with the semantic mechanism proposed, provide support to offer services aware of the user experience, and are able to detect, report and correct degradations of QoE without impacting the user-perceived quality. In the third and fourth experiments, it was observed that the time to restore the QoE after the degradation detection is 5 seconds. However, the time of restoration of the QoE can range between 9 seconds, once the semantic engine is programmed to perform the inference process every 5 seconds. Our research continues with new experiments to evaluate the performance of the proposed mechanism with multiple flows and multiple users and, considering that the testbad was performed with defined number of flows and users, and in a realistic setting, these elements are scalar. In addition, we intend to evaluate other QoS metric adaptation policies and combine new dimensions of the proposed taxonomy
7. REFERENCES [1] Bari, M.F., Chowdhury, S.R., Ahmed, R. and Boutaba, R.
2013. PolicyCop: An Autonomic QoS Policy Enforcement Framework for Software Defined Networks. 2013 IEEE SDN for Future Networks and Services (SDN4FNS) (Nov. 2013), 1–7.
[2] Bellavista, P., Corradi, A., Fanelli, M. and Foschini, L. 2012. A survey of context data distribution for mobile ubiquitous systems. ACM Computing Surveys. 44, 4 (2012), 1–45.
[3] Cardone, G., Corradi, A., Foschini, L. and Montanari, R. 2012. Socio-technical awareness to support recommendation and efficient delivery of IMS-enabled mobile services. IEEE Communications Magazine. 50, 6 (Jun. 2012), 82–90.
57
Proposed IoT Architecture
• Proposed architecture to address basic IoT concerns [4]• Task-‐Resource Matching Module
• Maps heterogeneous device resources through semantic modeling.• Service Solution Specification Module
• Maps the characteristics of the devices involved in the proposed solutions to specific requirements for devices.
• Flow Scheduling Module• Accesses the network state information and uses the Genetic Algorithm to schedule flows.
• GA is natively compatible with networking.• Nodes are genes, mutation and crossover are performed by replacing sub-‐paths. • Fitness value is the QoS of the flow.
Proposed IoT Architecture
• Proposed IoT Architecture Evaluation• On a Campus-‐Wide Network
• Data Transfers – 8% throughput increase• TeleAudio Flows – 51-‐71% reduction of end-‐to-‐end delay.• Video Streaming – 32-‐67% less Jitter
0
50
10 0
15 0
20 0
25 0
30 0
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43
Kbps
B in P a ck in g
Lo a d B a la n ce
P rop o s ed A l g or ith m
F low Id
F i l e s h a r i n g
0
1
2
3
4
5
6
7
8
9
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43
Bin PackingLoad BalanceProposed Algorithm
FlowId
Seco
nd Tele Audio
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43
Bin PackingLoad BalanceProposed Algorithm
Video
FlowId
Second
(a) End-to-End Throughput (b) End-to-End Delay (c) End-to-End Jitter
Fig. 8. Performance Comparisons Results
[0.01, 0.1] seconds. Tele audio and video streaming flows are
from real traffic traces [26], [27].
In our GA-based flow scheduling algorithm, we initially
choose two paths for each flow as parents. Under this specific
network topology, we choose the path generated by load
balance algorithm as one of the parents; then, we determine
the other parent by exchanging the current core route with
the alternative one (we have two core routers). We argue
that the file sharing service requires large throughput, the tele
audio service requires low delay, while the video streaming
service requires low jitter. Since QoS requirements wd, wj , wt
mentioned in Section V-B2 highly depend on the user
experience, audio/video codec and buffer size in the end
devices, etc, we do not set any particular QoS requirement in
this simulation based experiment. Instead, we try to optimize
the QoS performance (maximize throughput, minimize delay
and jitter) in a predefined amount of generations (we set 10
generations here). Hence we slightly change the fitness value
in equation (7) with �xd+ xj+!(10000/xt): for file sharing
flows (�, , !) = (0, 0, 1), for tele audio flows (�, , !) =(1, 0, 0), and for video streaming flows (�, , !) = (0, 1, 0).
We have totally 45 flows (each of 45 end devices has
one flow): flows 1-21 are file sharing, flows 22-36 are tele
audio, and flows 37-45 are video streaming. Fig. 8(a) shows
the Flow Throughput comparison. For file sharing flows, the
load balance algorithm outperforms the bin packing algorithm,
while our proposed algorithm has an average 8% throughput
increase if compared with the load balance algorithm. The
reason is in wireless links when link utilization exceeds
a threshold, the packet drop rate increases dramatically, as
indicated in [33]. Fig. 8(b) shows that for tele audio flows,
our proposed algorithm can improve the end-to-end delay
performance by 51% and 71%, compared to load balance and
bin packing algorithm respectively. However, the other two
types of flows suffer approximately the same delay experience
under these three algorithms. We argue the reason is tele
audio flows have bursty traffic patterns; it might not have
big data volume, but if two flows are scheduled with similar
busty pattern in the same link, a large delay occurs. That
is why tele audio flows have poor delay performance under
bin packing and load balance algorithms. Fig. 8(c) shows
that video streaming flows have an average 32% and 67%
less jitter with our proposed algorithm than the other two
algorithms. Two observations can be obtained here: a) video
streaming flows have a better overall jitter performance than
tele audio ones; b) our proposed algorithm has almost the
same throughput and delay performance on video streaming
flows, compared with the other two algorithms. The reason is
video streaming flows have variable packet length, but almost
constant inter packet interval. Hence if the interfered flows
also have a stable inter packet interval, the jitter should be
low. In fact, our proposed algorithm schedules more video
streaming flows with flow sharing flows (more stable inter
packet interval) than tele audio flows (variable inter packet
interval).
Extra flow entry messages overhead exists in the beginning
of the experiments. Since we assume that we perform a
one time flow scheduling and flows are stable once they are
initialized, we do not examine how the extra message overhead
affects the network performance. However, enabling online
scheduling with dynamic flow admission is also one of our
future work directions.
VII. CONCLUSIONS
In this paper, we have presented an original SDN controller
design in IoT Multinetworks whose central, novel feature is the
layered architecture that enable flexible, effective, and efficient
management on task, flow, network, and resources. We gave a
novel vision on tasks and resources in IoT environments, and
illustrated how we bridge the gap between abstract high level
tasks and specific low level network/device resources. A vari-
ant of Network Calculus model is developed to accurately esti-
mate the end-to-end flow performance in IoT Multinetworks,
which is further serving as fundamentals of a novel multi-
constraints flow scheduling algorithm under heterogeneous
traffic pattern and network links. Simulation based validations
have shown that our proposed flow scheduling algorithm has
better performance when compared with existing ones. We are
currently in the process of integrating this layered controller
design with our MINA software stack, in a large IoT electrical
vehicular network testbed [2] and developing more secure,
sophisticated tools to assist on-the-fly resource provisioning
and network control.
What we have realized is that the layered controller design
is critical to the management of heterogeneous IoT Multinet-
works. Techniques applied at each layer could be different - in
our design, the semantic modeling approach performs resource
matching and the GA-based algorithm schedules flows. Those
techniques can be viewed as plug-ins and can be adjusted
or replaced in different IoT scenarios. We strongly believe
that our novel layered controller architecture that inherently
supports heterogeneity and flexibility is of primary importance
to efficiently manage IoT Multinetworks.
Image: [4]
Conclusion
• SDN for Heterogeneous, Wireless, Mobile networks is still very new.• The papers cited are from 2014/2015.
• I just got back from the conference one was presented at.
• They all reference a lack of proper work in autonomic resiliency and address aspects of the same issue.• Combining their work would result in a much stronger case towards earlier adoption of SDN by IoT.
• The papers all lack proper evaluations, with several lacking even rudimentary evaluations. • The next step is a common testbed for evaluations and rigorous analysis.
"At the South Pole, December 1911" by Olav Bjaaland (1863-‐1961)[1] -‐ Cropped photograph from Amundsen, Roald: The South Pole, Vol. II, first published by John Murray, London 1913. Photo facing page 134. Licensed under PD-‐US via Wikipedia -‐https://en.wikipedia.org/wiki/File:At_the_South_Pole,_December_1911.jpg#/media/File:At_the_South_Pole,_December_1911.jpg
Conclusion
• SDN for Heterogeneous, Wireless, Mobile networks is still very new.• The papers cited are from 2014/2015.
• I just got back from the conference one was presented at.
• They all reference a lack of proper work in autonomic resiliency and address aspects of the same issue.• Combining their work would result in a much stronger case towards earlier adoption of SDN by IoT.
• The papers all lack proper evaluations, with several lacking even rudimentary evaluations. • The next step is a common testbed for evaluations and rigorous analysis.
Image: John Walker (Founder of AutoDesk, Co-‐Author of AutoCAD), http://www.fourmilab.ch/images/antarctica_2013/S015.html
Primary References[1] A. Markiewicz, P. N. Tran, and A. Timm-Giel, “Energy consumption optimization for software defined networks considering dynamic traffic,” in Cloud Networking (CloudNet), 2014 IEEE 3rd International Conference on, pp. 155–160, Oct 2014.
[2] P. Thorat, S. M. Raza, D. T. Nguyen, G. Im, H. Choo, and D. S. Kim, “Optimized self-healing framework for software defined networks,” in Proceedings of the 9th International Conference on Ubiquitous Information Management and Communication, IMCOM ’15, (New York, NY, USA), pp. 7:1–7:6, ACM, 2015.
[3] M. P. da Silva, M. A. Dantas, A. L. Gon ̧calves, and A. R. Pinto, “A managing qoe approach for provisioning user experience aware services using sdn,” in Proceed- ings of the 11th ACM Symposium on QoS and Security for Wireless and Mobile Networks, Q2SWinet ’15, (New York, NY, USA), pp. 51–58, ACM, 2015.
[4] Z. Qin, G. Denker, C. Giannelli, P. Bellavista, and N. Venkatasubramanian, “A software defined networking architecture for the internet-of-things,” in Network Operations and Management Symposium (NOMS), 2014 IEEE, pp. 1–9, May 2014.
[5] N. Dorsch, F. Kurtz, H. Georg, C. Hagerling, and C. Wietfeld, “Software-defined networking for smart grid communications: Applications, challenges and advantages,” in Smart Grid Communications (SmartGridComm), 2014 IEEE International Conference on, pp. 422–427, Nov 2014.