Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT
Jan 19, 2016
Fabric Management with ELFms
BARC-CERN collaboration meeting
B.A.R.C. Mumbai
28/10/05
Presented by G. Cancio – CERN/IT
German Cancio – CERN/IT - n° 2
Outline
The ELFms framework Quattor
Lemon
LEAF
Deployment status
German Cancio – CERN/IT - n° 3
Fabric Management with ELFms (I)
ELFms stands for ‘Extremely Large Fabric management system’
Subsystems:
: configuration, installation and management of nodes
: system / service monitoring
: hardware / state management
ELFms manages and controls most of the nodes in the CERN CC ~2600 nodes out of ~ 3500 Multiple functionality and cluster size (batch nodes, disk servers, tape servers, DB, web,
…) Heterogeneous hardware (CPU, memory, HD size,..) Supported OS: Linux (RH7, RHES2/3/4, Scientific Linux 3/4 – 32/64bit) and Solaris
Node ConfigurationManagement
NodeManagement
German Cancio – CERN/IT - n° 4
• Development is now coordinated by CERN/IT in collaboration with other HEP institutes
Fabric Management with ELFms (II)
• Quattor/Lemon are used in production in/outside CERN• LCG T1/T2 sites, ranging from 50-800 nodes/site
• Complete configuration of system and LCG Grid middlewarevia Quattor
• Integration with Grid services e.g. monitoring (GridICE, MonALISA)
• ELFms (Quattor/Lemon) were started in the scope of EU DataGrid.
German Cancio – CERN/IT - n° 5
http://quattor.org
German Cancio – CERN/IT - n° 6
Quattor
Quattor takes care of the configuration, installation and management of fabric nodes
A Configuration Database holds the ‘desired state’ of all fabric elements
• Node setup (CPU, HD, memory, software RPMs/PKGs, network, system services, location, audit info…)
• Cluster (name and type, batch system, load balancing info…)
Autonomous management agents running on the node for• Base installation
• Service (re-)configuration
• Software installation and management
German Cancio – CERN/IT - n° 7
Node Configuration Manager NCM
CompA CompB CompC
ServiceA ServiceB ServiceC
RPMs / PKGs
SW Package ManagerSPMA
Managed Nodes
SW server(s)
HTTP
SWRepository RPMs
Architecture
Install server
HTTP / PXE System installer
Install Manager
base OS
XML configuration profiles
Configuration server
HTTP
CDB
SQL backend
SQL
CLI
GUI
scriptsXML backend
SOAP
German Cancio – CERN/IT - n° 8
Configuration Information Configuration is expressed using a language called Pan
Information is arranged into templates Common properties set only once
Using templates it is possible to create hierarchies to match service structures
CERNCC
name_srv1: 137.138.16.5time_srv1: ip-time-1
lxbatchcluster_name: lxbatchmaster: lxmaster01pkg_add (lsf5.1)
lxpluscluster_name: lxpluspkg_add (lsf5.1) disk_srv
lxplus001 eth0/ip: 137.138.4.246 pkg_add (lsf6_beta)
lxplus020 eth0/ip: 137.138.4.225
lxplus029
German Cancio – CERN/IT - n° 9
Quattor Deployment Quattor in complete control of Linux boxes (~ 2600 nodes, to grow to ~
6-8000 in 2008)
CDB holding information of all systems in CERN-CC
Over 90 NCM configuration components developed From basic system configuration to Grid services setup… (including desktops)
SPMA used for managing all software ~ 2 weekly security and functional updates (including kernel upgrades) Eg. KDE security upgrade (~300MB per node) and LSF client upgrade (v4 to
v5) in 15 mins, without service interruption Handles (occasional) downgrades as well
Developments ongoing: CDB: Fine-grained ACL protection to templates, namespaces, stronger typing,
improved SQL/XMLDB backend … Security: Deployment of HTTPS instead of HTTP (usage of host certificates) Re-engineering of Software Repository (BARC)
Proxy architecture for enhanced scalability …
German Cancio – CERN/IT - n° 10
Proxy server setup
DNS-load balanced HTTP
M M’Backend(“Master”)
FrontendL1 proxies
L2 proxies(“Head” nodes)
Server cluster
HH H…
Rack 1 Rack 2… … Rack N
Installation images,RPMs,configuration profiles
German Cancio – CERN/IT - n° 11
Quattor @ LCG/EGEE
EGEE and LCG have chosen quattor for managing their integration testbeds
Components available for a fully automated LCG-2 configuration
Many sites (a dozen, including LAL/IN2P3, NIKHEF, DESY,..) adopt quattor as fabric management framework…
In India: BARC, VECCAL (ALICE experiment)
… leading to improved core software robustness and completeness
Identified and removed site dependencies and assumptions
Documentation, installation guides, bug tracking, release cycles
German Cancio – CERN/IT - n° 12
http://cern.ch/lemon
German Cancio – CERN/IT - n° 13
Lemon – LHC Era Monitoring
CorrelationEngines
User Workstations
Web browser
Lemon CLI
User
MonitoringRepository
TCP/UDP
SOAP
SOAP
Repositorybackend
SQL
Nodes
Monitoring Agent
Sensor SensorSensor
RRDTool / PHP
apache
HTTP
German Cancio – CERN/IT - n° 14
Deployment and Enhancements Smooth production running of Monitoring Agent and Oracle-based repository at
CERN-CC ~ 200 metrics sampled every 30s -> 1d; ~ 1 GB of data / day on ~ 1800 nodes No aging-out of data but archiving on MSS (CASTOR)
Usage outside CERN-CC, collaborations GridICE (>100 LCG sites) CMS-Online IN2P3 Others…
Hardened and enhanced EDG software Rich sensor set (from general to service specific eg. IPMI/SMART for disk/tape..) Generic multi-purpose sensor by BARC
Correlation and Fault Recovery Light-weight local self-healing module (eg. /tmp cleanup, restart daemons) Being re-engineered by BARC
Security for sample transport (TCP and UDP) (BARC)
Status and performance visualization pages …
German Cancio – CERN/IT - n° 15
Monitoring the Fabric
Using a web-based status display:
CC Overview
German Cancio – CERN/IT - n° 16
Monitoring the Fabric
Using a web-based status display:
CC Overview
Clusters and nodes
German Cancio – CERN/IT - n° 17
Monitoring the Fabric
Using a web-based status display:
CC Overview
Clusters and nodes
VO’s
German Cancio – CERN/IT - n° 18
Monitoring the Fabric
Using a web-based status display:
CC Overview
Clusters and nodes
VO’s
Power
German Cancio – CERN/IT - n° 19
Monitoring the Fabric
Using a web-based status display:
CC Overview
Clusters and nodes
VO’s
Power
Error trending
German Cancio – CERN/IT - n° 20
Monitoring the Fabric
Using a web-based status display:
CC Overview
Clusters and nodes
VO’s
Power
Error trending
Batch system
German Cancio – CERN/IT - n° 21
Next Steps…
Service based views (user/mgmt perspective) Synoptical view of what services are running how – appropriate for
end users and managers
Needs to be built on top of Quattor and Lemon
Would require a separate service definition DB
Alarm system for operators Allow 24/24h 7/7d operators to receive, acknowledge, ignore,
hide, process alarms received via Lemon
Integrated into the Lemon Status pages
German Cancio – CERN/IT - n° 24
http://cern.ch/leaf
German Cancio – CERN/IT - n° 25
LEAF - LHC Era Automated Fabric
LEAF is a collection of workflows for high level node hardware and state management, on top of Quattor and LEMON:
HMS (Hardware Management System): Track systems through all physical steps in lifecycle eg. installation,
moves, vendor calls, retirement Automatically requests installs, retires etc. to technicians GUI to locate equipment physically HMS implementation is CERN specific (based on Remedy workflows), but
concepts and design should be generic
SMS (State Management System): Automated handling (and tracking of) high-level configuration steps
Reconfigure and reboot all cluster nodes for new kernel and/or physical move Drain and reconfig nodes for diagnosis / repair operations
Issues all necessary (re)configuration commands via Quattor extensible framework – plug-ins for site-specific operations possible
German Cancio – CERN/IT - n° 26
Use Case: Move rack of machines
Node
HMS
NW DB
SMS
QuattorCDB
ServiceMgrTechnicians
1. new location
2. Set to standby
3. Update
4. Refresh
5. Take out of production•Close queues and drain jobs
•Disable alarms
6. Request move
9. Install work order
7a. Update
7b. Update
10. Set to production
11. Update
12. Refresh 13. Put into production
German Cancio – CERN/IT - n° 27
LEAF Deployment
HMS in full production for all nodes in CC HMS heavily used during CC node migration (~ 1500 nodes)
SMS in production for all quattor managed nodes
Current work: More automation, and handling of other HW types for HMS
More service specific SMS clients (eg. tape & disk servers)
Developing ‘asset management’ GUI (CCTracker) -> BARC Multiple select, drag&drop nodes to automatically initiate HMS moves and
SMS operations
Interface to LEMON GUI
German Cancio – CERN/IT - n° 28
Managing the Fabric
Visualize, locate and manage CC objects using high-level workflows
Visualize physical location of equipment
German Cancio – CERN/IT - n° 29
Managing the Fabric
Visualize, locate and manage CC objects using high-level workflows
Visualize physical location of equipment
properties
German Cancio – CERN/IT - n° 30
Managing the Fabric
Visualize, locate and manage CC objects using high-level workflows
Visualize physical location of equipment
properties
Initiate and track workflows on hardware and services e.g. add/remove/retire operations, update properties, kernel and
OS upgrades, etc
German Cancio – CERN/IT - n° 31
ELFms is deployed in production at CERN Stabilized results from 3-year developments within EDG and LCG Established technology - from Prototype to Production Consistent full-lifecycle management and high automation level Providing real added-on value for day-to-day operations
Quattor and LEMON are generic software Other projects and sites getting involved
Site-specific workflows and “glue scripts” can be put on top for smooth integration with existing fabric environments
LEAF HMS and SMS
Summary
= + +More information:http://cern.ch/elfms