Page 1 Hans Peter Schwefel PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04 PhD Course: Performance & Reliability Analysis of IP-based Communication Networks Henrik Schiøler, Hans-Peter Schwefel, Mark Crovella, Søren Asmussen • Day 1 Basics & Simple Queueing Models(HPS) • Day 2 Traffic Measurements and Traffic Models (MC) • Day 3 Advanced Queueing Models & Stochastic Control (HPS) • Day 4 Network Models and Application (HS) • Day 5 Simulation Techniques (HPS, SA) • Day 6 Reliability Aspects (HPS) Organized by HP Schwefel & H Schiøler
32
Embed
Page 1 Hans Peter Schwefel PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04 PhD Course: Performance & Reliability Analysis of IP-based.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
PhD Course: Performance & Reliability Analysis of IP-based Communication NetworksHenrik Schiøler, Hans-Peter Schwefel, Mark Crovella, Søren Asmussen
• Day 1 Basics & Simple Queueing Models(HPS)
• Day 2 Traffic Measurements and Traffic Models (MC)
• Day 3 Advanced Queueing Models & Stochastic Control (HPS)
• Day 4 Network Models and Application (HS)
• Day 5 Simulation Techniques (HPS, SA)
• Day 6 Reliability Aspects (HPS)
Organized by HP Schwefel & H Schiøler
Page 2Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Supervision & AuditingSupervision & AuditingContinuous ExecutionContinuous Execution redundancy and diversity state replication message distribution
Error HandlingError Handling detection and recovery tracing and logging escalation and alarming operational modes
NE-InterfaceNE-Interface multi-homing stacks and protocols IP fail-over
network design network redundancy
/fail-over
Network HA backup and restore regional redundancy rolling data upgrade
Data HA
repair and replacement Fault Management HA control & verification system diagnosis
OAM
Rolling UpgradeRolling Upgrade rolling upgrade patch procedure migration
Startup & ShutdownStartup & Shutdown system startup graceful shutdown
Adapted from S. Uhle, Siemens ICM N
Page 5Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
OAM Concepts I: Startup/Upgrade/Shutdown• Startup
– Put components into stable operational state– Caution: potential for high-load (synchronisation etc.) in start-up phase– Concept for start-up of larger sub-systems needed (e.g. Mutual dependence)
• Upgrades– Rolling Upgrade
• Allow upgrades of single components while its replicates are in operation• Consistency/data compatibility problems
– Patch concept (SW)• Instead of full SW re-installation, incremental changes• Goal: zero or minimal outage of components
• Graceful shutdown– Take components safely out of operation– Possible steps:
• Stop accepting/redirect new tasks• Finish existing tasks• Synchronize data and isolate component
• HW: Hot plug/swap ability– E.g. Interface cards
• Life Testing
Page 6Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
OAM Concepts II:Monitoring/Supervision
• Resource Supervision and Overload Protection– E.g. CPU load, queue-lengths, traffic volumes– Alarming operator intervention, e.g. upgrades– Signalling to reduce overload at source– Graceful degradation
• Logging & Tracing– For off-line analysis of incidents– Correlation of traces for system analysis– Problem: adequate granularity of logging data
• Service concepts/contracts– Reaction to alarms– System recovery modes– Spare-parts handling– Qualified technicians, availability (24/7 ?)
• High-availability of OAM system– Redundancy (frequently separate
OAM network(s))– Prioritisation of OAM traffic– Handling/Storing of OAM/logging data
Page 7Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
3. Availability and Network Management• Rolling upgrade, planned downtime, etc.
4. Methods & Protocols for fault-tolerant systems
a) SW Fault-Tolerance
b) Network Availability• IP Routing, HSRP/VRRP, SCTP
c) Server/Service Availability• Server Farms, Cluster solution, distributed reliability
Demonstration: Fault-tolerant call control systems
Page 8Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
General Approaches: Fault-tolerance• Basic requirements for fault-tolerance
– Number of fault-types and number of faults is bounded– Existence of redundancy (structural: equal/diversity; functional; information; time-redundancy/retries)
• Functional Parts of fault-tolerant systems– Fault detection (& diagnosis)
• Replication and Comparison (if identical realisations not suitable for design errors!)• Inversion (e.g. Mathematical functions)• Acceptance (necessary conditions on result, e.g. ranges)• Timing behavior (time-outs)
– Fault isolation: prevent spreading• Isolation of functional components, e.g. Atomic actions, layering model
cooperating systems!)• Forward: move to consistent, acceptable, safe new state; but loss of result• Compensation, e.g. TMR, FEC
Page 9Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Software fault-tolerance
• Mainly: Design errors & user interaction (as opposed to production errors, wear-out, etc.)
• Observations/Estimates (experience in computing centers, numbers a bit old however)
– 0.25...10 errors per 1000 lines of code– Only about 30% of error reports by users accepted as errors by vendor– Reaction times (updates/patches): weeks to months– Reliability not nearly as improved as hardware errors, various reasons:
• IP Layer Network Resilience: Dynamic Routing, e.g. OSPF– ’Hello’ packets used to determine adjacencies and link-states– Missing hello packets (typically 3) indicate outages of links or routers– Link-states propagated through Link-state advertisements (LSA)– Updated link-state information (adjacencies) lead to modified path selection
Application
TCP/UDP
IP
Link-Layer
L5-7
L4
L3
L2
Page 12Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Dynamic Routing: Improvements• Drawbacks of dynamic routing
– Long Duration until newly converged routing tables (30sec up to several minutes)– Rerouting not possible if first router (gateway) fails
• Improvements– Speed-up: Pre-determined secondary paths (e.g. Via MPLS)
• Multiple routers on same LAN• Master performs packet routing• Fail-over by migration of ’virtual’ MAC address
HUB 1
HUB 2
Router 1
Router 2
NE 1
NE 2
HSRPVRRP
Virtual Router
single IP- address of the virtual router in the network for client-Transparency
Page 13Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Streaming Control Transmission Protocol (SCTP)
• Defined in RFC2960 (see also RFC 3257, 3286)
• Purpose initially: Signalling Transport
• Features– Reliable, full-duplex unicast transport (performs retransmissions)– TCP-friendly flow control (+ many other features of TCP)– Multi-streaming, in sequence delivery within streams
Avoid head of line blocking (performance issue)– Multi-homing: hosts with multiple IP addresses, path monitoring (heart-beat mechanism),
transparent failover to secondary paths• Useful for provisioning of network reliability
Host A Host B
IPa1
IPa2 IPb2
IPb1
Separate Networks
SCTP Association
Page 14Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
• Distributed architecture
– Servers need only IP address
– Entities offering same service grouped into pools accessible by pool handle
• Pool User (PU) contact servers (PE) after receiving the response to a name resolution request sent to a name server (NS)
– Name Server monitors PEs
– Messages for dynamic registration and De-registration
– Flat Name Space
• Architecture and pool access protocols (ASAP, ENRP) defined in IETF RSerPool WG
• Failure detection and fail-over performed in ASAP layer of client
Distributed Redundancy: Reliable Server Pooling
NameServer(s)
Server Pool
PE (A) PE (B)
NameServer(s)
(de-) registrationMonitoring
[State-sharing]
Pool User
Name
Fail-over
ASAP=Aggregate Server Access Protocol
ENRP=Endpoint Name Resolution Protocol
Page 20Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
Rserpool: more details• RSerPool Scenario
– Each PE contains full implementation of functionality, no distributed sub-processes (different to RTP)
reduced granularity for possible load balancing
More Details:• RSerPool Name Space
– Flat Name Space easier management, performance (no recursive requests)– All name servers in operational scope hold same info (about all pools in operational scope)
• Load Balancing– Load factors sent by PE to Name Server (initiated by PE)– In resolution requests, Name Server communicates load factors to PU– PU can use load factors in selection policies
Page 21Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
RSerPool Protocols: Functionality (Current status in IETF)
• ENRP+ State-Sharing between Name Servers
• ASAP+ (De -)Registration of Pool-elements (PE, Name Server)+ Supervision of Pool-elements by a Name Server + Name Resolution (PU, Name Server)+ PE selection according to policy (& load factors) (PU)+ Failure detection based on transport layer (e.g. SCTP timeout)+ Support of Fail-over to other Pool-element (PU-PE)
+ Business cards (pool-to-pool communication)+ Last will
+ Simple Cookie Mechanism (usable for Pool Element state information) (PU-PE)
+ Under discussion: Application Layer Acknowledgements (PU-PE)
Page 22Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
RserPool and ServerBlades
• Server Blades– Many processor cards in one 19‘‘racks space efficiency
– Integrated Switch (possibly duplicated) for external communication
– Backplane for internal communication (duplicated)
• Combination RSerPool on Server Blades– No additional load balancing SW necessary (but less granularity)
– Works on any heterogeneous system without additional implementation effort (e.g. Server Blades + Sun Workstations)
– Standardized protocol stacks, no implementation effort in clients (except for use of RSerPool API)
Page 23Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04
References/Acknowledgments• M. Bozinovski, T. Renier, K. Larsen: ‘Reliable Call Control’, Project reports and presentations,
CPK/CTIF--Siemens ICM N, 2001-2004.• Siemens ICM N, S. Uhle and colleagues• Fujitsu Siemens Computers (FSC), ‘Reliable Telco Platform’, Slides.• E. Jessen, ‘Dependability and Fault-tolerance of computing systems’, lecture notes (German),
TU Munich, 1996.
Page 24Hans Peter Schwefel
PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04