This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Conclusions2 Ramiro Voicu LHCOPN London March 2010
The MonALISA ArchitectureThe MonALISA Architecture
3
Regional or Global High Level Regional or Global High Level Services, Services, Repositories & ClientsRepositories & Clients
Secure and reliable communicationSecure and reliable communicationDynamic load balancing Dynamic load balancing Scalability & ReplicationScalability & ReplicationAAA for ClientsAAA for Clients
Distributed Dynamic Distributed Dynamic Registration and Discovery-Registration and Discovery-based on a lease based on a lease mechanism and remote eventsmechanism and remote events
JINI-Lookup Services Secure & Public
MonALISA services
Proxies
HL services
Agents
Network of
Distributed System for gathering and Distributed System for gathering and analyzing information based on analyzing information based on mobile agents: mobile agents: Customized aggregation, Triggers,Customized aggregation, Triggers,ActionsActions
Fully Distributed System with no Single Point of Failure3 Ramiro Voicu LHCOPN London March 2010
MonALISA Service & Data HandlingMonALISA Service & Data Handling
4
Data Store
Data CacheService & DB
Configuration Control (SSL)
Predicates & Agents
Data (via ML Proxy)
Applications Clients or Higher Level
Services
WS Clients andservice
WebService
WSDLSOAP
LookupService
LookupService
Registration
Discovery
Postgres
AGENTSAGENTS
FILTERS / TRIGGERSFILTERS / TRIGGERS
Monitoring ModulesMonitoring ModulesCollects any type of information
Dynamic (Re)Loading
Push and Pull
4 Ramiro Voicu LHCOPN London March 2010
Two levels of decisions:
local (autonomous),
global (correlations).
Actions triggered by:
values above/below given thresholds,
absence/presence of values,
correlations between any values.
Action types:
alerts (emails/instant msg/atom feeds),
running an external command,
automatic charts annotations in the repository,
running custom code, like securely ordering a ML service to (re)start a site service.
ML ServiceML Service
ML ServiceML Service
Actions based onActions based onglobal informationglobal information
Actions based onActions based onlocal informationlocal information
• Traffic• Jobs• Hosts• Apps
• Temperature• Humidity• A/C Power• …
SensorsSensors Local Local decisionsdecisions
Global Global decisionsdecisions
Local and Global Decision FrameworkLocal and Global Decision Framework
Global ML
Services
5 Ramiro Voicu LHCOPN London March 2010
USLHCNetUSLHCNet
USLHCNet provides transatlantic connections of the Tier1 computing facilities at Fermilab and Brookhaven with the Tier0 and Tier1 facilities at CERN as well as Tier1s elsewhere in Europe and Asia.
Together with ESnet, Internet2 and the GEANT, USLHCNet supports connections between the Tier2 centers.
The USLHCNet core infrastructure is using the Ciena Core Director devices that provide time-division multiplexing and packet-forwarding protocols that support virtual circuits with bandwidth guarantees. The virtual circuits offer the functionality to develop efficient data transfer services with support for QoS and priorities.
Hybrid network: uses both Ciena CD and Force10 routers
6 transatlantic 10G links at the moment
6 Ramiro Voicu LHCOPN London March 2010
USLHCnet ML weather mapUSLHCnet ML weather map
7 Ramiro Voicu LHCOPN London March 2010
Monitoring modulesMonitoring modules
We developed a set of monitoring modules for USLHCNet network devices:
Force10 (SNMP & sFlow)
Traffic per interface
sFlow traffic
Link status monitoring
Ciena Core Director (TL1 – Transaction Language1)
ETTP (Ethernet Termination Point) traffic
EFLOW (Ethernet Flow) traffic
OSRP (routing protocol) topology
VCG Provisioned / Available Bandwidth
Dynamic circuits inside the optical core of the network
Ping module/MLPing trigger which sends alarms in case of packet loss8 Ramiro Voicu LHCOPN London March 2010
Each CircuitEach Circuitis monitored at bothis monitored at bothends by at least twoends by at least twoMonALISA services;MonALISA services;the monitored datathe monitored datais aggregated by is aggregated by global filters in global filters in the repositorythe repository
10 Ramiro Voicu LHCOPN London March 2010
Local and global filtersLocal and global filters
Based on the MonALISA actions framework a set of triggers have been deployed inside the service to notify by email, SMS and IM the USLHCNet network engineers in case of problems
The filters developed for USLHCNet repository aggregate the redundant monitoring data (traffic and link status) collected from all the MonALISA services
The link status is computed as a logical “AND” between both end points of a link. This also cross checks the status reported by the hardware equipment.
We collect data in two repository instances, each with replicated database back-ends. These instances are dynamically balanced in DNS.
11 Ramiro Voicu LHCOPN London March 2010
USLHCnet: USLHCnet: Precise measurements Precise measurements for the Operational Status on the WAN Linkfor the Operational Status on the WAN Link
Operations & management assisted by agent-based softwareOperations & management assisted by agent-based software Used on the new CIENA equipment used for network managmentUsed on the new CIENA equipment used for network managment
12 Ramiro Voicu LHCOPN London March 2010
USLHCnet: ALL EFLOW traffic - last 2 months USLHCnet: ALL EFLOW traffic - last 2 months
13 Ramiro Voicu LHCOPN London March 2010
USLHCnet: Accounting for Integrated TrafficUSLHCnet: Accounting for Integrated Traffic
Topology monitoring and discoveryTopology monitoring and discovery
NETWORKS
AS
ROUTERS
Real Time Topology Discovery & DisplayReal Time Topology Discovery & Display
Storage discovery in AliceStorage discovery in Alice
17 Ramiro Voicu LHCOPN London March 2010
France
Italy
USA
Russia
Nordic Countries
distance(IP, IP)distance(IP, IP) Same IP-class networkSame IP-class network Common domain nameCommon domain name Same ASSame AS Same country (+ function of RTT between Same country (+ function of RTT between
the respective AS-es if known)the respective AS-es if known) If distance between the AS-es is known, use itIf distance between the AS-es is known, use it Same continentSame continent Far awayFar away
distance(IP, Set<IP>): Client's public IP to all distance(IP, Set<IP>): Client's public IP to all known IPs for the storageknown IPs for the storage
“Fiber cut” simulationsThe traffic moves from one transatlantic line to the other oneFDT transfer (CERN – CALTECH) continues uninterruptedTCP fully recovers in ~ 20s