Connect. Communicate. Collaborate Hades – Going Operational Roland Karch, RRZE FAU Erlangen-Nürnberg JRA1 Montpellier Meeting, October 2006.
Post on 04-Jan-2016
221 Views
Preview:
Transcript
Connect. Communicate. Collaborate
Hades – Going Operational
Roland Karch, RRZE FAU Erlangen-Nürnberg
JRA1 Montpellier Meeting, October 2006
Connect. Communicate. CollaborateHades Implementation Status List
• IPv6 Measurements (Up and running in more than half of the JRA1 locations)
• Multicast Measurements (Implementation)• Alerts
– Packet Loss Maps (Implemented, Deployed for X-WiN)– SNMP Traps (Server needs to be set up)– Generic Web Interface (Evaluation)
• Maintenance– To be integrated into one interface with Alerts
Connect. Communicate. CollaborateIPv6 Measurements
• Running in:– Amsterdam (SURFnet)– Athens (GRNET)– Ljubljana (ARNES)– Paris (RENATER) (currently offline)– Prague (CESNET)– Sofia (ISTF)– Zagreb (CARNET)
• Owning a JRA1 Hades measurement box as well as an IPv6 capable network but aren‘t on the list? Contact us!
Connect. Communicate. Collaborate
Hades weather map (GEANT/NRENs, Geographically)
Connect. Communicate. Collaborate
Hades weather maps (Abstract, domain specific)
Connect. Communicate. CollaborateAlerts – Packet Loss Maps
• One map to show observed packet loss on all Hades monitored links
• Colour coding on links to show short and long outages• Currently still in development, not yet in the european
context available• Maps for other metrics under consideration, but details
about those metrics yet to be determined (see statistical analysis)
Connect. Communicate. CollaborateAlerts – SNMP traps
• Problem with data on measurement archive: age between 0 and 90 minutes
• To ensure up to date information for alerts, solutions are either:– Increase frequency of data polling (causing
management network overhead and load on the measurement point and archive)
– Do analysis on the measurement point in real time (CPU load on the measurement point only, but problem of how to deliver decentralized alerts
• Solution: Decentralized analysis, and SNMP traps for alerting
Connect. Communicate. CollaborateAlerts – SNMP traps
• Multiple potential use cases for traps– Central visualization to subscribe to all alerts in order to
create a powerful map and/or alert list with history– NOCs might subscribe for their uplinks/sensitive paths
to important locations (typically already running SNMP capable monitoring facilities)
Connect. Communicate. CollaborateAlerts – SNMP traps
• Benefits– Only causes network traffic when necessary– Real time data for analysis available on the
measurement point– SNMP MP usable?
• Drawbacks– SNMP very often filtered into user networks (web
visualisation as intermediate server might solve that)– Won’t alert when the reporting path is affected by the
network problem itself
Connect. Communicate. CollaborateAlerts – Statistics
• Higher level of statistical analysis for measurement data might help to determine a „connection footprint“ and show changes in it due to routing changes.
• Possible numbers to play with:– Line inherent delay (minimal delay that catches all, or a
high percentile of all measurement packets)– Regular IPDV (blurry zone in a plot, delta between line
inherent delay and maximum of 90 percent of the measurements)
Connect. Communicate. CollaborateAlerts – Statistics – Key values
• 11.4 ms minimal delay subtracted: „Network intrinsic delay“
• 1 µs gap: timestamp precision• Lower boundary: timer precision
00:00 01:00 02:00 03:00 04:000
5
10
15
20
25
30
35
40
One-way-delay
Delta Delay / µs
Tim e / hD
elt
a D
ela
y /
µs
Connect. Communicate. CollaborateAlerts – Statistics – Pathfinders
• First packet in every group of 5: ~7 µs longer delay
• Most probable reason: Receiver process has to be loaded into the CPU cache before processing the first packet
0 4 8 12 16 20 240
200
400
600
800
1000
1200
1400
1600
1800
Pathfinder packets
Pathfinder
No pathfinder
Delta OWD / µsN
Connect. Communicate. CollaborateAlerts – Statistics – Path fingerprint
• Comparison of paths on different networks (hardware, lines, configuration differs)
• Both: small OWD, narrow distribution of delay
• Path 2: longer distribution tail• Path 1: reordering!
0 20 40 60 800
100
200
300
400
500
600
700
800
Delay on two network paths
Path 1 Path 2
Delta Delay / µsN
Connect. Communicate. CollaborateMaintenance
• Most important part of „going operational“• Current status:
– Daily checking of which measurement lines are down (up to 24 hours delay) over the web visualization
– Scripts run to catch most anomalies (clock status, old data
– perfSONAR MAs are monitored externally (ISTF)
Connect. Communicate. CollaborateMaintenance
• Evaluation of Nagios [1]• Could serve as a common platform for alert and
maintenance visualization• Provides a front end for both SNMP and scripted
surveillance
[1] http://nagios.org/
Connect. Communicate. CollaborateMaintenance
• Goals– Highest possible level of automation– Fixing of simple problems either fully automated (i.e.
restarting measurements) or via scripts that can be triggered on the web server
– Transparency for users
Connect. Communicate. Collaborate
Questions / Discussion / Want to contact us?
• Website: http://www.win-labor.dfn.de/• Email: win-labor@dfn.de
top related