Top Banner
M.Biasotto, CERN, 5 november 2001 M.Biasotto, CERN, 5 november 2001 1 1 Fabric Management Fabric Management Massimo Biasotto, Enrico Ferro – INFN Massimo Biasotto, Enrico Ferro – INFN LNL LNL
28

M.Biasotto, CERN, 5 november 2001 1 Fabric Management Massimo Biasotto, Enrico Ferro – INFN LNL.

Jan 17, 2016

Download

Documents

Loreen McKinney
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: M.Biasotto, CERN, 5 november 2001 1 Fabric Management Massimo Biasotto, Enrico Ferro – INFN LNL.

M.Biasotto, CERN, 5 november 2001M.Biasotto, CERN, 5 november 2001 11

Fabric ManagementFabric Management

Massimo Biasotto, Enrico Ferro – INFN LNLMassimo Biasotto, Enrico Ferro – INFN LNL

Page 2: M.Biasotto, CERN, 5 november 2001 1 Fabric Management Massimo Biasotto, Enrico Ferro – INFN LNL.

M.Biasotto, CERN, 5 november 2001M.Biasotto, CERN, 5 november 2001 22

Legnaro CMS Farm LayoutLegnaro CMS Farm Layout

FFastastEEthth

32 – GigaEth 1000 BT32 – GigaEth 1000 BT

SWITCHSWITCH

N1N1FFastastEEthth

SWITCHSWITCH

11 88

S1S1 S16S16

NN2424 N1N1 NN2424

Nx – Computational NodeNx – Computational NodeDual PIII – 1 GHzDual PIII – 1 GHz512 MB512 MB3x75 GB Eide disk + 1x20 GB for O.S.3x75 GB Eide disk + 1x20 GB for O.S.

Sx – Disk Server NodeSx – Disk Server NodeDual PIII – 1 GHzDual PIII – 1 GHzDual PCI (33/32 – 66/64 512 MBDual PCI (33/32 – 66/64 512 MB4x75 GB Eide Raid disks (exp up to 10) 4x75 GB Eide Raid disks (exp up to 10) 1x20 GB disk O.S.1x20 GB disk O.S.

FFastastEEthth

SWITCHSWITCH

N1N1 22 NN24242001200140 Nodes40 Nodes4000 SI954000 SI959 TB9 TB

2001-2-32001-2-3up to 190up to 190NodesNodes

S11S112001200111 Servers11 Servers3.3 TB3.3 TB

To WANTo WAN34 Mbps 200134 Mbps 2001155 Mbps 2002155 Mbps 2002

Page 3: M.Biasotto, CERN, 5 november 2001 1 Fabric Management Massimo Biasotto, Enrico Ferro – INFN LNL.

M.Biasotto, CERN, 5 november 2001M.Biasotto, CERN, 5 november 2001 33

DatagridDatagrid

Project structured in many “Work Packages”:Project structured in many “Work Packages”:– WP1: Workload ManagementWP1: Workload Management– WP2: Data ManagementWP2: Data Management– WP3: Monitoring ServicesWP3: Monitoring Services– WP4: Fabric ManagementWP4: Fabric Management– WP5: Mass Storage ManagementWP5: Mass Storage Management– WP6: TestbedWP6: Testbed– WP7: NetworkWP7: Network– WP8-10: ApplicationsWP8-10: Applications

3 year project (2001-2003).3 year project (2001-2003). Milestones: month 9 (Sept 2001), month 21 (Sept 2002), Milestones: month 9 (Sept 2001), month 21 (Sept 2002),

month 33 (Sept 2003)month 33 (Sept 2003)

Page 4: M.Biasotto, CERN, 5 november 2001 1 Fabric Management Massimo Biasotto, Enrico Ferro – INFN LNL.

M.Biasotto, CERN, 5 november 2001M.Biasotto, CERN, 5 november 2001 44

OverviewOverview

Datagrid WP4 (Fabric Management) overviewDatagrid WP4 (Fabric Management) overview WP4 software architectureWP4 software architecture WP4 subsystems and componentsWP4 subsystems and components Installation and software managementInstallation and software management Current prototype: LCFGCurrent prototype: LCFG LCFG architectureLCFG architecture LCFG configuration and examplesLCFG configuration and examples

Page 5: M.Biasotto, CERN, 5 november 2001 1 Fabric Management Massimo Biasotto, Enrico Ferro – INFN LNL.

M.Biasotto, CERN, 5 november 2001M.Biasotto, CERN, 5 november 2001 55

WP4 overviewWP4 overview

Partners: CERN, INFN (Italy), KIP (Germany), Partners: CERN, INFN (Italy), KIP (Germany), NIKHEF (Holland), PPARC (UK), ZIB (Germany)NIKHEF (Holland), PPARC (UK), ZIB (Germany)

WP4 website:WP4 website:http://hep-proj-grid-fabric.web.cern.ch/hep-proj-http://hep-proj-grid-fabric.web.cern.ch/hep-proj-grid-fabric/grid-fabric/

Aims to deliver a computing fabric comprised of Aims to deliver a computing fabric comprised of all the necessary tools to manage a centre all the necessary tools to manage a centre providing Grid services on clusters of thousands providing Grid services on clusters of thousands of nodesof nodes

Page 6: M.Biasotto, CERN, 5 november 2001 1 Fabric Management Massimo Biasotto, Enrico Ferro – INFN LNL.

M.Biasotto, CERN, 5 november 2001M.Biasotto, CERN, 5 november 2001 66

WP4 structureWP4 structure

WP activity divided in 6 main ‘tasks’WP activity divided in 6 main ‘tasks’– Configuration management (CERN + PPARC)Configuration management (CERN + PPARC)– Resource management (ZIB)Resource management (ZIB)– Installation & node management (CERN + INFN + Installation & node management (CERN + INFN +

PPARC)PPARC)– Monitoring (CERN + INFN)Monitoring (CERN + INFN)– Fault tolerance (KIP)Fault tolerance (KIP)– Gridification (NIKHEF)Gridification (NIKHEF)

Overall WP4 functionality structured into units called Overall WP4 functionality structured into units called ‘subsystems’, corresponding to the above tasks‘subsystems’, corresponding to the above tasks

Page 7: M.Biasotto, CERN, 5 november 2001 1 Fabric Management Massimo Biasotto, Enrico Ferro – INFN LNL.

M.Biasotto, CERN, 5 november 2001M.Biasotto, CERN, 5 november 2001 77

Architecture overviewArchitecture overview

WP4 architectural design document (draft):WP4 architectural design document (draft):– http://hep-proj-grid-fabric.web.cern.ch/hep-proj-grid-fabric/http://hep-proj-grid-fabric.web.cern.ch/hep-proj-grid-fabric/

architecture/eu/default.htmarchitecture/eu/default.htm

Still work in progress: open issues that need further Still work in progress: open issues that need further investigationinvestigation

Functionalities classified into two main categories:Functionalities classified into two main categories:– User job control and managementUser job control and management

handled by Gridification and Resource Management handled by Gridification and Resource Management subsystemssubsystems

– Automated system administrationAutomated system administration handled by Configuration Mgmt, Installation Mgmt, handled by Configuration Mgmt, Installation Mgmt,

Fabric Monitoring and Fault Tolerance subsystemsFabric Monitoring and Fault Tolerance subsystems

Page 8: M.Biasotto, CERN, 5 november 2001 1 Fabric Management Massimo Biasotto, Enrico Ferro – INFN LNL.

M.Biasotto, CERN, 5 november 2001M.Biasotto, CERN, 5 november 2001 88

Farm A (LSF) Farm B (PBS)

Grid User

(Mass storage,Disk pools)

Local User

Installation &Node Mgmt

ConfigurationManagement

Monitoring &Fault Tolerance

FabricGridification

ResourceManagement

Grid InfoServices(WP3)

WP4 subsystems

Other Wps

ResourceBroker(WP1)

Data Mgmt(WP2)

Grid DataStorage(WP5)

Architecture overviewArchitecture overview

- Interface between Grid-wide services and local fabric;

- Provides local authentication, authorization and mapping of grid credentials.

- Interface between Grid-wide services and local fabric;

- Provides local authentication, authorization and mapping of grid credentials.

- provides transparent access to different cluster batch systems;

- enhanced capabilities (extended scheduling policies, advanced reservation, local accounting).

- provides transparent access to different cluster batch systems;

- enhanced capabilities (extended scheduling policies, advanced reservation, local accounting).

- provides a central storage and management of all fabric configuration information;

- central DB and set of protocols and APIs to store and retrieve information.

- provides a central storage and management of all fabric configuration information;

- central DB and set of protocols and APIs to store and retrieve information.

- provides the tools to install and manage all software running on the fabric nodes;

- bootstrap services; software repositories; Node Management to install, upgrade, remove and configure software packages on the nodes.

- provides the tools to install and manage all software running on the fabric nodes;

- bootstrap services; software repositories; Node Management to install, upgrade, remove and configure software packages on the nodes.

- provides the tools for gathering and storing performance, functional and environmental changes for all fabric elements;

- central measurement repository provides health and status view of services and resources;

- fault tolerance correlation engines detect failures and trigger recovery actions.

- provides the tools for gathering and storing performance, functional and environmental changes for all fabric elements;

- central measurement repository provides health and status view of services and resources;

- fault tolerance correlation engines detect failures and trigger recovery actions.

Page 9: M.Biasotto, CERN, 5 november 2001 1 Fabric Management Massimo Biasotto, Enrico Ferro – INFN LNL.

M.Biasotto, CERN, 5 november 2001M.Biasotto, CERN, 5 november 2001 99

Resource Management diagramResource Management diagram

Stores static and dynamic information describing the states of the RMS and its managed resources

Stores static and dynamic information describing the states of the RMS and its managed resources

Accepts job requests, verifies credentials and schedules the jobs

Accepts job requests, verifies credentials and schedules the jobs

Assigns resources to incoming job requests, enhancing fabric batch systems capabilities (better load balancing, adapts to resource failures, considers maintenance tasks)

Assigns resources to incoming job requests, enhancing fabric batch systems capabilities (better load balancing, adapts to resource failures, considers maintenance tasks)

proxies provide uniform interface to underlying batch systems (LSF, Condor, PBS)

proxies provide uniform interface to underlying batch systems (LSF, Condor, PBS)

Page 10: M.Biasotto, CERN, 5 november 2001 1 Fabric Management Massimo Biasotto, Enrico Ferro – INFN LNL.

M.Biasotto, CERN, 5 november 2001M.Biasotto, CERN, 5 november 2001 1010

Monitoring & Fault Tolerance diagramMonitoring & Fault Tolerance diagram

Local node

MSA

MRcache

  

 FTA

FTCE AD

MSMS

MS

Central repository

Data Base

MR server

Service master node

FTCE

Control flowData flow

Human operator host

MUI

 

   

 

                 

    

MS

Monitoring Sensor Agent - collects data from Monitoring Sensors and forwards them to the Measurement Repository

Monitoring Sensor Agent - collects data from Monitoring Sensors and forwards them to the Measurement Repository

Measurement Repository - stores timestamped information; it consists of local caches on the nodes and a central repository server

Measurement Repository - stores timestamped information; it consists of local caches on the nodes and a central repository server

Monitoring User Interface - graphical interface to the Measurement Repository

Monitoring User Interface - graphical interface to the Measurement Repository

Monitoring Sensor - performs measurement of one or several metrics;

Monitoring Sensor - performs measurement of one or several metrics;

Fault Tolerance Correlation Engine - processes measurements of metrics stored in MR to detect failures and possibly decide recovery actions

Fault Tolerance Correlation Engine - processes measurements of metrics stored in MR to detect failures and possibly decide recovery actions

Actuator Dispatcher - used by FTCE to dispatch Fault Tolerance Actuators; it consists of an agent controlling all actuators on a local node

Actuator Dispatcher - used by FTCE to dispatch Fault Tolerance Actuators; it consists of an agent controlling all actuators on a local node

Fault Tolerance Actuator - executes automatic recovery actions

Fault Tolerance Actuator - executes automatic recovery actions

Page 11: M.Biasotto, CERN, 5 november 2001 1 Fabric Management Massimo Biasotto, Enrico Ferro – INFN LNL.

M.Biasotto, CERN, 5 november 2001M.Biasotto, CERN, 5 november 2001 1111

Configuration Management diagramConfiguration Management diagram

High Level

Description

High Level

DescriptionLow Level

Description

Low Level

Description

Cache

Configuration

Manager

Cache

Configuration

Manager

Local

Process

Local

Process

Configuration

Database

A

P

I

Client Node

Configuration Database: stores configuration information and manages modification and retrieval access

Cache Configuration Manager: downloads node profiles from CDBand stores them locally

Page 12: M.Biasotto, CERN, 5 november 2001 1 Fabric Management Massimo Biasotto, Enrico Ferro – INFN LNL.

M.Biasotto, CERN, 5 november 2001M.Biasotto, CERN, 5 november 2001 1212

Configuration DataBaseConfiguration DataBase

All computing nodes of CMS Farm #3 use

cmsserver1 as Application Server

All computing nodes of CMS Farm #3 use

cmsserver1 as Application Server

cmsserver1 /etc/exports

/app cmsnode1, cmsnode2, ..

cmsserver1 /etc/exports

/app cmsnode1, cmsnode2, ..

cmsnode3 /etc/fstab

cmsserver1:/app /app nfs..

cmsnode3 /etc/fstab

cmsserver1:/app /app nfs..cmsnode2 /etc/fstab

cmsserver1:/app /app nfs..

cmsnode2 /etc/fstab

cmsserver1:/app /app nfs..cmsnode1 /etc/fstab

cmsserver1:/app /app nfs..

cmsnode1 /etc/fstab

cmsserver1:/app /app nfs..

High LevelDescription

Low LevelDescription

Page 13: M.Biasotto, CERN, 5 november 2001 1 Fabric Management Massimo Biasotto, Enrico Ferro – INFN LNL.

M.Biasotto, CERN, 5 november 2001M.Biasotto, CERN, 5 november 2001 1313

Installation Management diagramInstallation Management diagram

ActuatorDispatcher

Monitoring& Fault Tol.

NMASP’s(local)

SP’s

Configuration Management Subsystem

Administrative Scripting Layer Applications

Systemimages

SR BS

Fab

ric No

de

Data Flow: Configuration, SP’s, system images. monitoring

Control Flow: function calls

Node Management Agent - manages installation, upgrade, removal and configuration of software packages

Node Management Agent - manages installation, upgrade, removal and configuration of software packages

Software Repository - central fabric store for Software Packages

Software Repository - central fabric store for Software Packages

Bootstrap Service - service for initial installation of nodes

Bootstrap Service - service for initial installation of nodes

Page 14: M.Biasotto, CERN, 5 november 2001 1 Fabric Management Massimo Biasotto, Enrico Ferro – INFN LNL.

M.Biasotto, CERN, 5 november 2001M.Biasotto, CERN, 5 november 2001 1414

Distributed designDistributed design

Distributed design in the architecture, in order to ensure Distributed design in the architecture, in order to ensure scalability:scalability:– individual nodes as much autonomous as possibleindividual nodes as much autonomous as possible– local instances of almost every subsystem: operations local instances of almost every subsystem: operations

performed locally where possibleperformed locally where possible– central steering for control and collective operationscentral steering for control and collective operations

Config DB

Local Cache

Config DB

Local Cache

Monitoring

Local Reposito

ry

Monitoring

Local Reposito

ry

Config DB

Local Cache

Config DB

Local Cache

Monitoring

Local Reposito

ry

Monitoring

Local Reposito

ry

Monitoring

Central

Repository

Monitoring

Central

Repository

Central

Config DB

Central

Config DB

Config DB

Local Cache

Config DB

Local Cache

Monitoring

Local Reposito

ry

Monitoring

Local Reposito

ry

Page 15: M.Biasotto, CERN, 5 november 2001 1 Fabric Management Massimo Biasotto, Enrico Ferro – INFN LNL.

M.Biasotto, CERN, 5 november 2001M.Biasotto, CERN, 5 november 2001 1515

Scripting layerScripting layer

All subsystems are tied together using a high level All subsystems are tied together using a high level ‘scripting layer’:‘scripting layer’:– allows administrators to code and automate complex allows administrators to code and automate complex

fabric-wide management operationsfabric-wide management operations– coordination in execution of user jobs and coordination in execution of user jobs and

administrative task on the nodesadministrative task on the nodes– scripts can be executed by Fault Tolerance subsystem scripts can be executed by Fault Tolerance subsystem

to automate corrective actionsto automate corrective actions All subsystems provide APIs to control their componentsAll subsystems provide APIs to control their components Subsystems keep their independence and internal Subsystems keep their independence and internal

coherence: the scripting layer only aims at connecting coherence: the scripting layer only aims at connecting them for building high-level operationsthem for building high-level operations

Page 16: M.Biasotto, CERN, 5 november 2001 1 Fabric Management Massimo Biasotto, Enrico Ferro – INFN LNL.

M.Biasotto, CERN, 5 november 2001M.Biasotto, CERN, 5 november 2001 1616

Maintenance tasksMaintenance tasks

Control function calls to NMA are known as ‘maintenance Control function calls to NMA are known as ‘maintenance tasks’tasks’– non intrusive: can be executed without interfering with non intrusive: can be executed without interfering with

user jobs (e.g. cleanup of log files)user jobs (e.g. cleanup of log files)– intrusive: for example kernel upgrades or node rebootsintrusive: for example kernel upgrades or node reboots

Two basic node states from the administration point of Two basic node states from the administration point of viewview– production: node is running user jobs or user services production: node is running user jobs or user services

(e.g. NFS server). Only non intrusive tasks can be (e.g. NFS server). Only non intrusive tasks can be executedexecuted

– maintenance: no user jobs or services. Both intrusive maintenance: no user jobs or services. Both intrusive and non intrusive tasks can be executedand non intrusive tasks can be executed

Usually a node is put into maintenance status only when it Usually a node is put into maintenance status only when it is idle, after draining the job queues or switching the is idle, after draining the job queues or switching the services to another node. But there can be exceptions to services to another node. But there can be exceptions to immediately force the status change.immediately force the status change.

Page 17: M.Biasotto, CERN, 5 november 2001 1 Fabric Management Massimo Biasotto, Enrico Ferro – INFN LNL.

M.Biasotto, CERN, 5 november 2001M.Biasotto, CERN, 5 november 2001 2020

Installation & Software Mgmt PrototypeInstallation & Software Mgmt Prototype

The current prototype is based on a software tool The current prototype is based on a software tool originally developed by the Computer Science Department originally developed by the Computer Science Department of Edinburgh University: LCFG (Large Scale Linux of Edinburgh University: LCFG (Large Scale Linux Configuration)Configuration)http://www.dcs.ed.ac.uk/home/paul/publications/ALS2000/http://www.dcs.ed.ac.uk/home/paul/publications/ALS2000/

Handles automated installation, configuration and Handles automated installation, configuration and management of machinesmanagement of machines

Basic features:Basic features:– automatic installation of O.S.automatic installation of O.S.– installation/upgrade/removal of all (rpm-based) software installation/upgrade/removal of all (rpm-based) software

packagespackages– centralized configuration and management of machinescentralized configuration and management of machines– extendible to configure and manage custom application extendible to configure and manage custom application

softwaresoftware

Page 18: M.Biasotto, CERN, 5 november 2001 1 Fabric Management Massimo Biasotto, Enrico Ferro – INFN LNL.

M.Biasotto, CERN, 5 november 2001M.Biasotto, CERN, 5 november 2001 2121

A collection of agents read configuration parameters and either generate traditional config files or directly manipulate various services

Abstract configuration parameters for all nodes stored in a central repository

ldxprof

LoadProfile

Generic

Component

ProfileObject

rdxprof

ReadProfile

LCFG Objects

Local cache

Client nodes

Web Server

HTTP

XML Profile

LCFG Config Files

Make XMLProfile

Server

LCFG diagramLCFG diagram

+inet.services telnet login ftp

+inet.allow telnet login ftp sshd

+inet.allow_telnet ALLOWED_NETWORKS

+inet.allow_login ALLOWED_NETWORKS

+inet.allow_ftp ALLOWED_NETWORKS

+inet.allow_sshd ALL

+inet.daemon_sshd yes

.....

+auth.users myckey

+auth.userhome_mickey /home/mickey

+auth.usershell_mickey /bin/tcsh

+inet.services telnet login ftp

+inet.allow telnet login ftp sshd

+inet.allow_telnet ALLOWED_NETWORKS

+inet.allow_login ALLOWED_NETWORKS

+inet.allow_ftp ALLOWED_NETWORKS

+inet.allow_sshd ALL

+inet.daemon_sshd yes

.....

+auth.users myckey

+auth.userhome_mickey /home/mickey

+auth.usershell_mickey /bin/tcsh

Config files

<inet>

<allow cfg:template="allow_$ tag_$ daemon_$">

<allow_RECORD cfg:name="telnet">

<allow>192.168., 192.135.30.</allow>

</allow_RECORD>

.....

</auth>

<user_RECORD cfg:name="mickey">

<userhome>/home/MickeyMouseHome</userhome>

<usershell>/bin/tcsh</usershell>

</user_RECORD>

<inet>

<allow cfg:template="allow_$ tag_$ daemon_$">

<allow_RECORD cfg:name="telnet">

<allow>192.168., 192.135.30.</allow>

</allow_RECORD>

.....

</auth>

<user_RECORD cfg:name="mickey">

<userhome>/home/MickeyMouseHome</userhome>

<usershell>/bin/tcsh</usershell>

</user_RECORD>

XML profiles

ProfileObject

inet auth

/etc/services/etc/services

/etc/inetd.conf/etc/inetd.conf

/etc/hosts.allow

in.telnetd : 192.168., 192.135.30.

in.rlogind : 192.168., 192.135.30.

in.ftpd : 192.168., 192.135.30.

sshd : ALL

/etc/hosts.allow

in.telnetd : 192.168., 192.135.30.

in.rlogind : 192.168., 192.135.30.

in.ftpd : 192.168., 192.135.30.

sshd : ALL

/etc/shadow/etc/shadow

/etc/group/etc/group

/etc/passwd

....

mickey:x:999:20::/home/Mickey:/bin/tcsh

....

/etc/passwd

....

mickey:x:999:20::/home/Mickey:/bin/tcsh

....

Page 19: M.Biasotto, CERN, 5 november 2001 1 Fabric Management Massimo Biasotto, Enrico Ferro – INFN LNL.

M.Biasotto, CERN, 5 november 2001M.Biasotto, CERN, 5 november 2001 2222

LCFG: future developmentLCFG: future development

APIAPI

Generic

Component

ProfileObject

Cache

Manager

CacheManager

LCFG Objects

Web ServerXML Profile

HTTPConfiguration

Database

Cache

CurrentPrototype

FutureEvolution

ldxprof

LoadProfile

Generic

Component

ProfileObject

rdxprof

ReadProfile

LCFG ObjectsLocal cache

Client nodes

Web Server

HTTP

XML Profile

LCFG Config Files

Make XMLProfile

Server

Page 20: M.Biasotto, CERN, 5 november 2001 1 Fabric Management Massimo Biasotto, Enrico Ferro – INFN LNL.

M.Biasotto, CERN, 5 november 2001M.Biasotto, CERN, 5 november 2001 2323

LCFG configuration (I)LCFG configuration (I)

Most of the configuration data are common for a category Most of the configuration data are common for a category of nodes (e.g. diskservers, computing nodes) and only a of nodes (e.g. diskservers, computing nodes) and only a few are node-specific (e.g. hostname, IP-address)few are node-specific (e.g. hostname, IP-address)

Using the cpp preprocessor it is possible to build a Using the cpp preprocessor it is possible to build a hierarchical structure of config files containing directives hierarchical structure of config files containing directives like like #define#define, , #include#include, , #ifdef#ifdef, comments with /* */, etc..., comments with /* */, etc...

The configuration of a typical LCFG node looks like this:The configuration of a typical LCFG node looks like this:

#define HOSTNAME pc239 /* Host specific definitions */#define HOSTNAME pc239 /* Host specific definitions */

#include "site.h" /* Site specific definitions */#include "site.h" /* Site specific definitions */

#include "linuxdef.h" /* Common linux resources */#include "linuxdef.h" /* Common linux resources */

#include "client.h" /* LCFG client specific resources #include "client.h" /* LCFG client specific resources */*/

Page 21: M.Biasotto, CERN, 5 november 2001 1 Fabric Management Massimo Biasotto, Enrico Ferro – INFN LNL.

M.Biasotto, CERN, 5 november 2001M.Biasotto, CERN, 5 november 2001 2424

LCFG configuration (II)LCFG configuration (II)

From "site.h"From "site.h"#define LCFGSRV grid01#define LCFGSRV grid01#define URL_SERVER_CONFIG http://grid01/lcfg#define URL_SERVER_CONFIG http://grid01/lcfg#define LOCALDOMAIN .lnl.infn.it#define LOCALDOMAIN .lnl.infn.it#define DEFAULT_NAMESERVERS 192.135.30.245#define DEFAULT_NAMESERVERS 192.135.30.245 [...][...]

From "linuxdef.h"From "linuxdef.h"update.interfaces eth0update.interfaces eth0update.hostname_eth0 HOSTNAMEupdate.hostname_eth0 HOSTNAMEupdate.netmask_eth0 NETMASKupdate.netmask_eth0 NETMASK

[...][...]From "client.h"From "client.h"

update.disks hdaupdate.disks hdaupdate.partitions_hda hda1 hda2update.partitions_hda hda1 hda2

update.pdetails_hda1 free /update.pdetails_hda1 free /update.pdetails_hda2 128 swapupdate.pdetails_hda2 128 swapauth.users mickeyauth.users mickeyauth.usercomment_mickey Mickey Mouseauth.usercomment_mickey Mickey Mouseauth.userhome_mickey /home/Mickeyauth.userhome_mickey /home/Mickey[...][...]

Page 22: M.Biasotto, CERN, 5 november 2001 1 Fabric Management Massimo Biasotto, Enrico Ferro – INFN LNL.

M.Biasotto, CERN, 5 november 2001M.Biasotto, CERN, 5 november 2001 2525

LCFG: configuration changesLCFG: configuration changes

Server-side: when the config files are modified, a tool Server-side: when the config files are modified, a tool (mkxprof) recreates the new xml profile for all the nodes (mkxprof) recreates the new xml profile for all the nodes affected by the changesaffected by the changes– this can be done manually or with a daemon periodically this can be done manually or with a daemon periodically

checking for config changes and calling mkxprofchecking for config changes and calling mkxprof– mkxprof can notify via UDP the nodes affected by the mkxprof can notify via UDP the nodes affected by the

changeschanges Client-side: another tool (rdxprof) downloads the new Client-side: another tool (rdxprof) downloads the new

profile from the serverprofile from the server– usually activated by an LCFG object at bootusually activated by an LCFG object at boot– can be configured to work ascan be configured to work as

daemon periodically polling the serverdaemon periodically polling the server daemon waiting for notificationsdaemon waiting for notifications started by cron at predefined timesstarted by cron at predefined times

Page 23: M.Biasotto, CERN, 5 november 2001 1 Fabric Management Massimo Biasotto, Enrico Ferro – INFN LNL.

M.Biasotto, CERN, 5 november 2001M.Biasotto, CERN, 5 november 2001 2626

LCFG: what’s an object?LCFG: what’s an object?

It's a simple shell script (but in future it will probably be a It's a simple shell script (but in future it will probably be a perl script)perl script)

Each object provides a number of “methods” (start, stop, Each object provides a number of “methods” (start, stop, reconfig, query, ...) which are invoked at appropriate timesreconfig, query, ...) which are invoked at appropriate times

A simple and typical object behaviour:A simple and typical object behaviour:– Started by profile object when notified of a configuration Started by profile object when notified of a configuration

changechange– Loads its configuration from the cacheLoads its configuration from the cache– Configures the appropriate services, either translating Configures the appropriate services, either translating

config parameters into a traditional config file or directly config parameters into a traditional config file or directly controlling the service (e.g. starting a daemon with controlling the service (e.g. starting a daemon with command-line parameters derived from configuration).command-line parameters derived from configuration).

Page 24: M.Biasotto, CERN, 5 november 2001 1 Fabric Management Massimo Biasotto, Enrico Ferro – INFN LNL.

M.Biasotto, CERN, 5 november 2001M.Biasotto, CERN, 5 november 2001 2727

LCFG: custom objectsLCFG: custom objects

LCFG provides the objects to manage all the standard LCFG provides the objects to manage all the standard services of a machine: inet, syslog, auth, nfs, cron, ...services of a machine: inet, syslog, auth, nfs, cron, ...

Admins can build new custom objects to configure and Admins can build new custom objects to configure and manage their own applications:manage their own applications:– define your custom “resources” (configuration define your custom “resources” (configuration

parameters) to be added to the node profileparameters) to be added to the node profile– include in your script the object “generic”, which include in your script the object “generic”, which

contains the definition of common function used by all contains the definition of common function used by all objects (config loading, log, output, ...)objects (config loading, log, output, ...)

– overwrite the standard methods (start, stop, reconfig, ...) overwrite the standard methods (start, stop, reconfig, ...) with your custom codewith your custom code

– for simple objects usually just a few lines of codefor simple objects usually just a few lines of code

Page 25: M.Biasotto, CERN, 5 november 2001 1 Fabric Management Massimo Biasotto, Enrico Ferro – INFN LNL.

M.Biasotto, CERN, 5 november 2001M.Biasotto, CERN, 5 november 2001 2828

LCFG: Software Packages ManagementLCFG: Software Packages Management

Currently it is RedHat-specific: heavily dependent on the Currently it is RedHat-specific: heavily dependent on the RPM toolRPM tool

The software to install is defined in a file on the server The software to install is defined in a file on the server containing a list of RPM packages (currently not yet containing a list of RPM packages (currently not yet merged in the XML profile)merged in the XML profile)

Whenever the list is modified, the required RPM packages Whenever the list is modified, the required RPM packages are automatically installed/upgraded/removed by a specific are automatically installed/upgraded/removed by a specific LCFG object (updaterpms), which is started at boot or when LCFG object (updaterpms), which is started at boot or when the node is notified of the changethe node is notified of the change

Page 26: M.Biasotto, CERN, 5 november 2001 1 Fabric Management Massimo Biasotto, Enrico Ferro – INFN LNL.

M.Biasotto, CERN, 5 november 2001M.Biasotto, CERN, 5 november 2001 2929

First boot via floppy or via network

Initialization script starts

First boot via floppy or via network

Initialization script starts

LCFG: node installation procedureLCFG: node installation procedure

DHCP Server

Software

Packages

Software

Packages

IP address

Config URL

IP address

Config URL

Root Image

with LCFGenvironme

nt

Root Image

with LCFGenvironme

nt

NFS Server

LCFG Config

Files

LCFG Config

FilesXML

Profiles

XML

Profiles

LCFG Server WEB Server

Software Repository

Client Node

After reboot LCFG objects complete the node configuration

After reboot LCFG objects complete the node configuration

Root Image complete with LCFG

environment mounted via NFS

Root Image complete with LCFG

environment mounted via NFS

Load minimal config data via DHCP:

IP Address, Gateway, LCFG Config URL

Load minimal config data via DHCP:

IP Address, Gateway, LCFG Config URL

Load complete configuration via

HTTP

Load complete configuration via

HTTP

Start object “install”:

disk partitioning, network,...

installation of required packages

copy of LCFG configuration

reboot

Start object “install”:

disk partitioning, network,...

installation of required packages

copy of LCFG configuration

reboot

Page 27: M.Biasotto, CERN, 5 november 2001 1 Fabric Management Massimo Biasotto, Enrico Ferro – INFN LNL.

M.Biasotto, CERN, 5 november 2001M.Biasotto, CERN, 5 november 2001 3030

LCFG: summaryLCFG: summary

Pros:Pros:– In Edinburgh it has been used for years in a complex In Edinburgh it has been used for years in a complex

environment, managing hundreds of nodesenvironment, managing hundreds of nodes– Supports the complete installation and management of Supports the complete installation and management of

all the software (both O.S. and applications)all the software (both O.S. and applications)– Extremely flexible and easy to customizeExtremely flexible and easy to customize

Cons:Cons:– Complex: steep learning curveComplex: steep learning curve– Prototype: the evolution of this tool is not clear yetPrototype: the evolution of this tool is not clear yet– Lack of user-friendly tools for the creation and Lack of user-friendly tools for the creation and

management of configuration files: errors can be very management of configuration files: errors can be very dangerous!dangerous!

Page 28: M.Biasotto, CERN, 5 november 2001 1 Fabric Management Massimo Biasotto, Enrico Ferro – INFN LNL.

M.Biasotto, CERN, 5 november 2001M.Biasotto, CERN, 5 november 2001 3131

Future plansFuture plans

Future evolution not clearly defined: it will depend also on Future evolution not clearly defined: it will depend also on results of forthcoming tests (1results of forthcoming tests (1stst Datagrid milestone) Datagrid milestone)

Integration of current prototype with Configuration Integration of current prototype with Configuration Management componentsManagement components– Config Cache Manager and API released ad prototypes Config Cache Manager and API released ad prototypes

but not yet integrated with LCFGbut not yet integrated with LCFG Configuration DataBaseConfiguration DataBase

– complete definition of node profilescomplete definition of node profiles– user-friendly tools to access and modify config user-friendly tools to access and modify config

informationinformation Development of still missing objectsDevelopment of still missing objects

– system services (AFS, PAM, ...)system services (AFS, PAM, ...)– fabric software (grid sw, globus, batch systems, ...)fabric software (grid sw, globus, batch systems, ...)– application software (CMS, Atlas, ...) in collaboration application software (CMS, Atlas, ...) in collaboration

with people from experimentswith people from experiments