Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 1 An introduction to NETWORK RESILIENCY Giorgio Ventre & Stefano Avallone COMICS Group Dipartimento di Informatica e Sistemistica Università di Napoli Federico II
64
Embed
Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 1 An introduction to NETWORK RESILIENCY Giorgio Ventre & Stefano.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 1
An introduction to
NETWORK RESILIENCY
Giorgio Ventre & Stefano AvalloneCOMICS Group
Dipartimento di Informatica e SistemisticaUniversità di Napoli Federico II
Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 2
ReferencesReferences
Jean-Philippe Vasseur, Mario Pickavet, Piet
Demeester. “Network Recovery, protection and
restoration of optical, SONET-SDH, IP and MPLS”.
Morgan Kaufmann
AA. VV. Building Survivable Networks, Feature Issue
of IEEE Network Magazine, March/April 2004
Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 3
Communication Networks RelevanceCommunication Networks Relevance
Communication Networks are becoming fundamental infrastructures:the amount of data carried out by Communication
Networks is considerably grows in the last years;
many social and economic activities depend on Communication Networks;
many safe critical activities depend on Communication Networks.
Reliability is an essential feature of today Communication Networks !
Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 4
The (a) ability of a network to maintain or restore an acceptable level of performance during network failures by applying various restoration techniques, and (b) mitigation or prevention of service outages from network failures by applying preventive techniques.
Acronym: Network Survivability.
[1] Alliance for Telecommunications Industry Solutions (ATIS) http://www.atis.org/tg2k/_network_reliability.html
Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 5
Network Reliability: related conceptsNetwork Reliability: related concepts
There are many concepts that are related to Network Reliability, for example:network element reliability: the probability of a
network element to be fully operational during a certain period of time;
network element availability: the probability of a network element to be in an up-state at a given instant of time t;
network element fault: the inability of a network element to perform a required action
....
Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 6
Which failures may occur Which failures may occur ??
The ability of a network to provide required services may be compromised by different failures:planed or unplanned failures;
internal or external failures;
software or hardware failures;
malicious or casual failures
....
Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 7
Accounted FailuresAccounted Failures
Provide actions to address all the failures that may occur on a Communication Network is unfeasible.
Network provider and ISP normally provides actions plain to address the most frequent failures.
These failure are called Accounted Failure
The most common type of Accounted Failure are:single link failure; single node failure.
Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 8
Failures' ImpactFailures' Impact
In today Communication Networks a single failure may produces a major disruption in network availability.
A single cut in an optical cable may drop thousands of logical network connections.On July 5, 2002 a submarine cable break affected
the Asia Pacific Cable Network (ACPN 2), causing a considerable slowdown in all the network connections among Japan, China, South Korea, etc.
Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 9
Failures' Impact:Failures' Impact: ATC systems ATC systems Press Releases (
http://www.natca.org/mediacenter/press-release-detail.aspx?id=394) MASSIVE POWER, COMMUNICATIONS FAILURE AT MAJOR AIR
TRAFFIC CONTROL CENTER PUTS CONTROLLERS IN DARK, FLIGHTS IN JEOPARDY
07/19/2006 Bob Marks
PALMDALE, Calif. – A massive power and communications failure late Tuesday at the Los Angeles Air Route Traffic Control Center left scrambling air traffic controllers to deal with a nightmare scenario – how to keep dozens of flights away from each other above a large swath of the Southwestern United States despite the inability to see them, talk to them or relay crucial instructions for 15 excruciatingly long minutes.
Every ounce of skill, heart and determination that controllers bring into the control room every day was put to the test during one of the worst outages to ever hit the facility. It was so bad, controllers say, that the only thing they had of use to aid the situation that actually worked was their cell phones – devices which the Federal Aviation Administration, inexplicably, has barred from control rooms, further impeding the safety of the system.
More details in http://themainbang.typepad.com/blog/2006/07/complete_failur.html
Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 10
Some parameters that may be used to characterize the reliability of a network may be found in ITU G.911 Recommendation:
“Parameters and Calculation Methodologies for Reliability and Availability of Fibre Optic
Systems”
In the following slides some of the parameters defined in ITU G.911 are introduced
Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 11
Failure in Time (FITs) and Maintenance TimeFailure in Time (FITs) and Maintenance Time
Failure in Time:is the number of device's failure occurred in a
specific time interval;
normally is expressed as failures per bilion of device hours.
Maintenance Time:the time interval during which a maintenance
action is performed on an item either manually or automatically, ...
Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 12
Mean Time Between Failure (MTBF)Mean Time Between Failure (MTBF)
The Mean Time Between Failures (MTBF) is the steady-state expectation of time between failures
Mathematically the MTBF (in years per failure) is releated to the failure rate F (in FITs per 109 hours) as follows:
MTBF1.14 105
F
Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 13
Mean Time To Repair (MTTR)Mean Time To Repair (MTTR)
The Mean Time To Repair (MTTR) is defined as total corrective maintenance time divided by the total number of corrective maintenance actions during a period of time.
Given the definitions of MTBF and MTTR the availability A of an item may be derived as:
A 1MTTRMTBF
Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 14
Users, services and reliability requirements Users, services and reliability requirements
Network reliability is a “relative concept”.
The reliability requirements of a communication network depend on:the user type;
the service type.
Different users-services combinations led to divers requirements in terms of MTBF and MTTR.
Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 15
User classificationUser classification According to their reliability requirements, network
users may be classified in the following categories:
Safety critical users. Users for which service interruption are unacceptable.
Business critical users. Users for which any service interruption bring to a high financial loss.
Low cost users. Users for which service interruption cause only discomfort.
Basic lever users. Users for which service reliability is only a side effect.
Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 16
Availability: Impact of OutagesAvailability: Impact of Outages
Ref: “Service Applications for SONET DCS Distribution Restoration”, IEEE J. Special Areas in Comm, Jan 94
50 m
s
200 m
s 2 Sec
10 Sec
5 Min
30 M
in
15 M
in
Protection Switching
Range
1st Restoration
Target Range
2nd Restoration
Target Range
3rd Restoration
Target Range
4th Restoration
Target Range
Restoration time after failure detection
Serv
ice
Out
age
Impa
ct
0
Service“Hit””
(Reframes)
Undesirabl
e
Social / B
usiness Im
pact
Unacceptabl
e•Potential voiceband discinnects (<5%)
•Trigger changeover of CSS7 STP signaling links
•Effect cell rerouting process
•May drop voice band calls depending on channel bank vintage
•Drop all circuit switched connections
•PL disconnects
•Potential packet (X.25) disconnects
•Potential data session time-outs
•Packet (X.25) disconnects
•Data session time-outs
•Network congestion
•Minor social/ Business impacts
•Potentially FCC reportable
•Major social/ business impacts
Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 17
Market Drivers for SurvivabilityMarket Drivers for Survivability
Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 24
IP Network ExpectationsIP Network Expectations
Service Delay Jitter Loss Availability
Real Time Interactive
(VOIP, Cell Relay ..)L L L H
Layer 2 & Layer 3 VPN’s (FR/Ethernet/AAL5)
M
Internet Service H H M L
Video Services L M M H
HHLL LL
L : Low M : Medium H : HighL : Low M : Medium H : High
Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 25
Measuring Availability: The Port MethodMeasuring Availability: The Port Method
• Based on Port count in Network
• Does not take into account the Bandwidth of ports
e.g. OC-192 and 64k are both ports• Good for dedicated Access service because ports are tied to
customers.
(Total # of Ports X Sample Period) - (number of impacted port x outage duration)
(Total number of Ports x sample period) x 100
Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 26
The Port Method ExampleThe Port Method Example
• 10,000 active access ports Network
• An Access Router with 100 access ports fails for 30 minutes.– Total Available Port-Hours = 10,000*24 = 240,000– Total Down Port-Hours = 100*.5 = 50– Availability for a Single Day =
(240000-50)/240,000*100 = 99.979166 %
Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 27
The Bandwidth MethodThe Bandwidth Method• Based on Amount of Bandwidth available in Network
• Takes into account the Bandwidth of ports
• Good for Core Routers
(Total amount of BW X Sample Period) - (Amount of BE impacted x outage duration)
(Total amount of BW in network x sample period) x 100
Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 28
The Bandwidth Method ExampleThe Bandwidth Method Example
• Total capacity of network 100 Gigabits/sec• An Access Router with 1 Gigabits/sec BW fails for 30 minutes.
– Total BW available in network for a day = 100*24 = 2400 Total BW lost in outage = 1*.5 = 0.5
– Availability for a Single Day = ((2400-0.5)/2,400)*100 = 99.979166
%
Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 29
Basic Ideas: Working and Protect FibersBasic Ideas: Working and Protect Fibers
Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 30
Service classification (1/2)Service classification (1/2)
Communication networks are used to carry many different services.
Different services may have divers reliability requirements.
Reliability requirements of such services are related to QoS parameters:Bit Rate;
Delay;
Jitter;
...
Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 31
Service classification (2/2)Service classification (2/2)
Application Bit Rate Bit Rate Variation Delay Sensitivity Need for Recovery
Plain Old Telephone Service31-32 Kbps Constant 5 5Voice Over IP 8-32 Kbps Constant 5 5Video-telephony 256-1920 KbpsHigh High 5 5Videoconferencing at least 256 Kbps High 5 5Teleworking 64 Kbps – 2 Mbps Very High 5 4TV broadcast 2-8 Mbps High 4 3Distance Learning 64 Kbps – 2 Mbps Very High 5 5Movies on Demand 750 Kbps – 4 MbpsHigh 4 5News on Demand 64 Kbps Very High 2 2Internet Access 64 Kbps – 2 Mbps Very High 1 2Teleshopping 64 Kbps – 2 Mbps Very High 2 2
[2] A.Lason, et al., “Network Scenarios and Requirements”, European IST project Layers Internetworking in Optical Network (LION), deliverable D6, Septemper 1999.
Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 32
How to increase network reliability ?How to increase network reliability ?Prevent network failure:
Network recovery imposes several requirements. For example:there should be backup capacity to create a
recovery path;
the backup capacity must be enough to ensure QoS constraints;
single point of failure must be avoided;
.....
Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 35
Recovery and reversion cyclesRecovery and reversion cycles
Recovery Cycle
Reversion Cycle
Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 36
Recovery mechanismsRecovery mechanisms
A high variety of recovery mechanisms exist.
Every mechanisms has advantages and drawbacks
In the following slides some criteria that may be used to evaluate and classify recovery mechanisms are reported [3, 4].
[3] V. Sharma et al., “Framework for MPLS-based recovery”, RFC 3469, IETF web site, Feb 2003
[4] K. Owens, V. Sharma, M. Oommen, and F. Hellstrand, “Network Survivability Considerations for Traffic Engineered IP Networks”, Internet draft: draft-owens-te-network-survivability-03, May 2002. Available at: www.ietf.org. Accessed July 2005
Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 37
Backup CapacityBackup Capacity
Dedicatedone to one relationship between the backup resources
and the working path;
the simplest solution;
an inefficient solution.
Sharedthe backup resources are shared among different
working path;
a more simple solution;
a more efficient solution.
Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 38
Recovery PathRecovery Path
Preplannedrecovery paths for all accounted failure scenario is
calculated in advance;
allows fast recovery of failure;
lacks flexibility for unaccounted failure scenarios.
Dynamicthe recover path is calculate “on the fly” when the
failure is detected;
may be used to search recovery paths also for unaccounted failure scenarios.
Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 39
Recovery ApproachesRecovery ApproachesProtection
the recovery paths are preplanned and fully signaled before a failure occurs;
when a failure occurs no additional signaling is needed to establish the recovery path;
is the faster solution.
Restorationthe recovery pat may be preplanned or dynamically
allocated but are not signaled in advance;
when a failure occurs aditional signaling is needed to establish the recovery path;
is a more flexible solution.
Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 40
Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 53
Mesh RestorationMesh Restoration
DCS
DCS DCS
DCS DCS
DCS
Line or Link Restoration
Working Path
Path Restoration
• Control: Centralized or Distributed• Route Calculation: Preplanned or Dynamic• Type of Alternate Routing: Line or Path
Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 54
Link vs. Path restoration Link vs. Path restoration Link restorationLink restoration
• Requires the ability to identify the failed link at both ends.
• Can not protect node failureCan not protect node failure.
• Link based
Mesh (generalized loop back) – insensitive to additions to network – scalable;
backup path can be pre-computed – fast recovery; dynamic rerouting
Path restorationPath restoration
More resilient than link restorationMore resilient than link restoration.
Reroutes the traffic from the primary path to a Shared Risk Group (SRG) -disjoint
backup path.
Protect both end-to-end paths and single linksProtect both end-to-end paths and single links. • Preferred: Path BasedPreferred: Path Based
Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 55
Link vs. Path restoration Link vs. Path restoration
A
B
C
D
E
F
Flow 1: A-C-D
Flow 2: E-C-D-F
A
B
C
D
E
F
A
B
C
D
E
F
Link (Generalized Loopback) Restoration
Path Restoration
Fault: Link Cut
Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 56
Pre-compute vs. Real-timePre-compute vs. Real-time
Pre-computedPre-computed calculates restoration paths before a failure happens. Allows prior availability of reroute information to the nodes where
actions need to be taken after failure is detected. Enables fast restorationEnables fast restoration.
Real-timeReal-time calculates restoration paths after a failure happens. Restoration is slower. Restoration is slower. Enables more efficient capacity utilizationEnables more efficient capacity utilization.
• Preferred: Pre-computedPreferred: Pre-computed
Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 57
Centralized vs. DistributedCentralized vs. Distributed Centralized restoration:Centralized restoration:
Computes restoration and primary paths for all demands with up-to-date information
Routes may then be downloaded into nodal databases. Effectiveness?
• More capacity efficiency More capacity efficiency • Possibly slow (but may be executed in the background)Possibly slow (but may be executed in the background)• Scalability in questionScalability in question.
Distributed restorationDistributed restoration Source and destination nodes dynamically search for the protection
wavelengths required to reestablish the disrupted lightpath Since lack of knowledge of sharing database of other OXCs, it may not be able
to determine backup sharability for any given primary path
Independent of underlying physical networkIndependent of underlying physical network
Physical
Data Link
Network (IP)
Transport
Session
Presentation
Application
Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 61
MPLS layer restorationMPLS layer restoration
MPLS Layer ProtectionMPLS Layer ProtectionReal-time or pre-computedReal-time or pre-computedLine or path level protectionLine or path level protectionProtection path is node and link disjointnode and link disjoint from the primary
path. Protection path may be allocated to low-priority trafficallocated to low-priority traffic in
the absence of network failure. Faster than dynamic IP reroutingFaster than dynamic IP rerouting Working LSPs have pre-established node/link disjoint protection Working LSPs have pre-established node/link disjoint protection
pathspaths
Physical
Data Link
Network
Transport
Session
Presentation
Application
MPLS
Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 62
using an hold-off time a chronological order among the recovery mechanisms adopted in different layer is imposed;
alternatively a “token” may used to impose a sequential order among the different layers.
Integrated Approach[1]
there is a recovery scheme that has a full overview of all the layers;
the recovery scheme may decide when and in which layer (layers) the recovery actions must be taken.
[1] D. Colle, et all., “Data-centric optical networks and their survivability”, Selected Areas in Communications, IEEE Journal on Volume 20, Issue 1, Jan. 2002 Page(s):6 - 20