Connect. Communicate. Collaborate Hades – Going Operational Roland Karch, RRZE FAU Erlangen-Nürnberg JRA1 Montpellier Meeting, October 2006.

Connect. Communicate. Collaborate

Hades – Going Operational

Roland Karch, RRZE FAU Erlangen-Nürnberg

JRA1 Montpellier Meeting, October 2006

Connect. Communicate. CollaborateHades Implementation Status List

• IPv6 Measurements (Up and running in more than half of the JRA1 locations)

• Multicast Measurements (Implementation)• Alerts

– Packet Loss Maps (Implemented, Deployed for X-WiN)– SNMP Traps (Server needs to be set up)– Generic Web Interface (Evaluation)

• Maintenance– To be integrated into one interface with Alerts

Connect. Communicate. CollaborateIPv6 Measurements

• Running in:– Amsterdam (SURFnet)– Athens (GRNET)– Ljubljana (ARNES)– Paris (RENATER) (currently offline)– Prague (CESNET)– Sofia (ISTF)– Zagreb (CARNET)

• Owning a JRA1 Hades measurement box as well as an IPv6 capable network but aren‘t on the list? Contact us!

Hades weather map (GEANT/NRENs, Geographically)

Hades weather maps (Abstract, domain specific)

Connect. Communicate. CollaborateAlerts – Packet Loss Maps

• One map to show observed packet loss on all Hades monitored links

• Colour coding on links to show short and long outages• Currently still in development, not yet in the european

context available• Maps for other metrics under consideration, but details

about those metrics yet to be determined (see statistical analysis)

Connect. Communicate. CollaborateAlerts – SNMP traps

• Problem with data on measurement archive: age between 0 and 90 minutes

• To ensure up to date information for alerts, solutions are either:– Increase frequency of data polling (causing

management network overhead and load on the measurement point and archive)

– Do analysis on the measurement point in real time (CPU load on the measurement point only, but problem of how to deliver decentralized alerts

• Solution: Decentralized analysis, and SNMP traps for alerting

• Multiple potential use cases for traps– Central visualization to subscribe to all alerts in order to

create a powerful map and/or alert list with history– NOCs might subscribe for their uplinks/sensitive paths

to important locations (typically already running SNMP capable monitoring facilities)

• Benefits– Only causes network traffic when necessary– Real time data for analysis available on the

measurement point– SNMP MP usable?

• Drawbacks– SNMP very often filtered into user networks (web

visualisation as intermediate server might solve that)– Won’t alert when the reporting path is affected by the

network problem itself

Connect. Communicate. CollaborateAlerts – Statistics

• Higher level of statistical analysis for measurement data might help to determine a „connection footprint“ and show changes in it due to routing changes.

• Possible numbers to play with:– Line inherent delay (minimal delay that catches all, or a

high percentile of all measurement packets)– Regular IPDV (blurry zone in a plot, delta between line

inherent delay and maximum of 90 percent of the measurements)

Connect. Communicate. CollaborateAlerts – Statistics – Key values

• 11.4 ms minimal delay subtracted: „Network intrinsic delay“

• 1 µs gap: timestamp precision• Lower boundary: timer precision

00:00 01:00 02:00 03:00 04:000

One-way-delay

Delta Delay / µs

Tim e / hD

Connect. Communicate. CollaborateAlerts – Statistics – Pathfinders

• First packet in every group of 5: ~7 µs longer delay

• Most probable reason: Receiver process has to be loaded into the CPU cache before processing the first packet

0 4 8 12 16 20 240

Pathfinder packets

Pathfinder

No pathfinder

Delta OWD / µsN

Connect. Communicate. CollaborateAlerts – Statistics – Path fingerprint

• Comparison of paths on different networks (hardware, lines, configuration differs)

• Both: small OWD, narrow distribution of delay

• Path 2: longer distribution tail• Path 1: reordering!

0 20 40 60 800

Delay on two network paths

Path 1 Path 2

Delta Delay / µsN

Connect. Communicate. CollaborateMaintenance

• Most important part of „going operational“• Current status:

– Daily checking of which measurement lines are down (up to 24 hours delay) over the web visualization

– Scripts run to catch most anomalies (clock status, old data

– perfSONAR MAs are monitored externally (ISTF)

• Evaluation of Nagios [1]• Could serve as a common platform for alert and

maintenance visualization• Provides a front end for both SNMP and scripted

surveillance

[1] http://nagios.org/

• Goals– Highest possible level of automation– Fixing of simple problems either fully automated (i.e.

restarting measurements) or via scripts that can be triggered on the web server

– Transparency for users

Questions / Discussion / Want to contact us?

• Website: http://www.win-labor.dfn.de/• Email: win-labor@dfn.de

Connect. Communicate. Collaborate Hades – Going Operational Roland Karch, RRZE FAU Erlangen-Nürnberg JRA1 Montpellier Meeting, October 2006.

Documents

Loren Heilig, Steffen Karch...Loren Heilig, Steffen Karch...

Sambuichi Workshop Karch 2008

Einsatz der Plugins Workflow, Kontakte und UnivIS€¦ ·.....

DEPARTEMENT DE PHYSIQUE NUCLEAIRE ET CORPUSCULAIRE Data...

MPI communication schemes - RRZE Moodle

Karch 16 Antiinflam LA(1)

07.05.2003 Dominik Lieb 1 Die Subnetz- Datenbank am RRZE.

Colmenero Mpls Tp 201112 JRA1

SEE-GRID-SCI Branko Marovic JRA1 Leader University of...

Complicaciones Neurologicas de La Cocaina Karch

3rd ILIAS Annual Meeting JRA1....

REGIONALES RECHENZENTRUM ERLANGEN [RRZE]...

REGIONALES RECHENZENTRUM ERLANGEN [RRZE] · PDF...

Das RRZE 2007 Bilanz und Ausblick30.01.07 RRZE 2006/2007 2.....

JRA1 5th General Meeting. Paris, February 2007 JRA1 web page...

TCP/IP – Troubleshooting - rrze.fau.de · REGIONALES...