Top Banner
Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004
42

Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

Mar 31, 2015

Download

Documents

Sally Elders
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

Grid Operations – Keeping the Grid Running

EB-TB Joint Meeting

John Gordon

13th May 2004

Page 2: Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

Production Service Grids

• CCLRC is involved in Grid Operations for– LCG– GridPP– NGS– CCLRC– EGEE

• This means different things for different grids

Page 3: Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

UK GOC

• Core of GOC built around experience in deploying and running National Grid Service (NGS) – Support service– Help Desk/call centre ?

• Important to coordinate and integrate this with deployment and operations work in EGEE, LCG and similar projects. – EGEE – low level services, CA, GOC, CERT...

• Dedicated deployment and operations management will be a key component

• Develop relationship to ETF(o), ETFp/NGS, HPC, and large campus and project focused grids, which are not under the direct control of the GOC

Page 4: Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

The LCG GOC Vision

GOC Processes and Activities

– Coordinating Grid Operations

– Defining Service Level Parameters

– Monitoring Service Performance Levels

– First-Level Fault Analysis

– Interacting with Local Support Groups

– Coordinating Security Activities

– Operations Development

– Grid Accounting

Page 5: Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

LCG Wider Picture

• In LCG, GOC sits alongside – Deployment Team – who roll out the middleware

– Certification & Testing team

– User Support Centre

– Experiment Support – for the applications

Page 6: Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

• Within the scope of LCG we are responsible for monitoring how the grid is running – who is up, who is down, and why

• Identifying Problems, Contact the Right People, Suggest Actions• Providing scalable solutions to allow other people to monitor resources • Manage site Information – definitive source of information• Accounting – Aggregate Job Throughput (per Site, per VO)• Established at CCLRC (RAL)• Status of LCG2 Grid here:

• http://goc.grid-support.ac.uk/

LCG GOC Monitoring

Page 7: Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

Overview

GOC Proposal envisaged three Phases– Phase 1 Jul 03 – Oct 03

– Phase 2 Nov 03 – May 04

– Phase 3 Jun 04 – Jun 05

• GOC Vision• What was planned in Phase 1 and its current status• What is planned for Phase 2

Page 8: Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

The Vision

• GOC Processes and Activities

– Coordinating Grid Operations

– Defining Service Level Parameters

– Monitoring Service Performance Levels

– First-Level Fault Analysis

– Interacting with Local Support Groups

– Coordinating Security Activities

– Operations Development

Page 9: Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

Phase 1(Jun 03 – Oct 03)

Taken from Proposal Jun 2003• a) Set up an initial monitoring centre - Done

– Steering Group established

– LCG-Rollout list installed

– GOC website set up

– Variety of Monitoring

– SLA tests developed and running for CE and RB

Page 10: Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

Phase 1

• b) Draft Security Policy and Procedures - Done

– Drafted with the LCG Security Group

• Approved by GDB in October

• Will be submitted to SC2 for Adoption– Three GOC-related supporting Annexes in preparation

• Service Level Agreement Guide - drafted

• Procedures for Resource Admins - partly drafted

• Procedure for site self-audit - in outline

Page 11: Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

Phase 1

c) Define Service Level Parameters – Partly Done

– Schedule, Availability, Reliability all clear and defined• Schedule

– The published periods of downtime for upgrading etc• Availability

– The proportion of actual up-time to scheduled up-time• Reliability

– The mean time to failure– Performance is service-specific; ideas under discussion

• needs experience with real users before deciding what is important– Service Level Agreement

• The publication by the site of the targeted (designed) service level parameters for an LCG service in a prescribed format will comprise the SLA for that service

• The GOC will monitor and publish alongside the actual achieved values of the same parameters

Page 12: Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

Phase 1

• d) Establish a Monitoring Regime – Done • (but further development is ongoing)

– SLA Monitoring

• CE and RB availability and reliability are being crudely monitored now

• Reports of significant failures sent to Rollout List

– Use and Development of MapCenter

– Use and Development of GppMon

– GridICE

Page 13: Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

Phase 1

• e) Select tools for use and evaluation in Phase 2 - Done

– As Phase 1• GppMon (extended to add history)• MapCenter (extended to accommodate SLA tests)• GridICE (run server for LCG2)

– plus MonALISA• needs local sensing agents

– plus network monitoring tools from EDG WP7• needs local agents• needs R-GMA

Page 14: Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

Phase 1

• In addition to the work envisaged in the Proposal for Phase 1 RAL is acting as an operational GOC by monitoring LCG sites from the moment they install the LCG software.– All CE s are tested every 10 mins with an authentication test– All RB s are tested every 10 mins with a job-list-match test– Network connectivity is tested every 10 mins from RAL to every host– Port accessibility is tested to every externally accessible service every 10 mins– A trivial job is submitted to every CE every hour via Globus and via the CERN

RB– Logs are examined and analysed several times a week – Significant failings or problems are reported to the LCG-Rollout list– Several problems have been uncovered in both the monitors and in various sites

Page 15: Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

Plans for Phase 2(Nov 03 – May 04)

• a) Set up a second monitoring centre

– Eventually there should be 2 more, one in the East and one in the West to provide 24 hour cover, and to provide regional coordination of operational issues like alerts and SLAs

– Taipei have taken packaged monitoring and installed

– Now sharing monitoring duties

– Discussions with TRIUMF as third

Page 16: Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

Plans for Phase 2

• b) Establish Grid operations and security coordination regime in consultation with

– LCG Security Group– Local Security Officers– Local Support Groups– LCG User Support Centre (GGUS)

• to– promote the Security Policy and associated documents– agree and establish common operational practices, principally the way in which SLAs

and monitoring will work– agree a fault analysis and alerting mechanism– agree an incident response mechanism

Page 17: Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

Plans for Phase 2

• c) Establish a simple change control regime

– question whether or to what degree 'control' is appropriate

– as a minimum ensure information about recent and prospective changes is published to the community

– establish whatever mechanism is agreed in coordination with local support groups

– the minimum in outline would include:

• the schedule of service down time (part of SLA)

• the schedule and nature of proposed changes

• site would publish information via GOC web site

Page 18: Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

Plans for Phase 2

• d) Monitoring service levels

– Investigate using EDG WP7 network monitoring tools

• uses R-GMA

– Install tools to monitor and detect deviations from SLA

– Deploy remote agents - include in software distributions?

– Automatic alert mechanisms for operations staff

– Set up mechanisms to notify local support of problems

Page 19: Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

Why We Monitor• Keep systems up and running• Notice failures; grid-wide services MDS, RBs• Knowing what services a site should be running

no point raising an alert if the site isn’t meant to run it!definition of services and which sites run them (SLA)

What Tools Do We Use• Job Submission; GridIce; Nagios• How – Database• Developments Planned nagios

3 Stage Plan over next 12 months

Monitoring Overview

Page 20: Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

• There are many frameworks which can be used to monitor distributed environments

• MAPCENTRE http://mapcenter.in2p3.fr/• GPPMON http://goc.grid-support.ac.uk/• GRIDICE http://edt002.cnaf.infn.it:50080/gridice/• NAGIOS http://www.nagios.org/• MONALISA http://monalisa.cacr.caltech.edu/

• Example: Mapcentre 30 sites ~ 500 lines in config file (static version)

• Example: Nagios 30 sites, 12 individual config files with dependencies

• Developed Tools to Configure these services to make the job easier

NAGIOS, MAPCENTER and GPPMON

Monitoring Services

Page 21: Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

LCG2 CORE SITES Status: 12th May 2004 10.20

~30 SITES

http://goc.grid-support.ac.uk/

Page 22: Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

GOC Job Submission Flow Diagram

JOB Script

RB.CEcreate

RB

sent acknowledge

edg-job-submit

GOC (UI)

Build List of CE, RB

Resources

SITE DB

SQL QUERY

CE

Other.GlueCEUniqueID

wget http://goc_ui/ack.cgi?RB.CE

received acknowledgement WN

CE

Dave Kant
Data Flow DiagramA graphical means of presenting, describing or analyzing a process.
Page 23: Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

GOC Job Submission Flow Diagram

GOC (UI)

Build List of CE, RB

Resources

JOB Script

GLOBUS.CEcreate CE

sent acknowledge

globus-job-run CE

SITE DB

SQL QUERY

wget http://goc_ui/ack.cgi?GLOBUS.CE

received acknowledgement

1

2

3

4

5

Dave Kant
Data Flow DiagramA graphical means of presenting, describing or analyzing a process.
Page 24: Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

Nagios is a powerfull monitoring service that supports notifications, and the execution of remote agents to correct problems when faults are discovered.

• Advantages => proactively monitor grid (NRPE daemon)

• Automatic Configuration of Nagios based on Database

• Developed a set of plugins which focus on service behaviour and data consistency

Do RBs find resources? Does Site GIIS’s publish correct hostname? Is the site running the latest stable software release? Does the Gatekeeper authentication service work? Are the host certificates valid e.g Issued by Trusted CA Are essential services running e.g GridFTP

• Further plugins are being developed (e.g certification)

GOC Features – Nagios Monitoring

Page 25: Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

Nagios Screen Shot

Service Summary for Nodes:

Certificate Lifetime Check , GridFTP , GRAM Authentication

Site Attributes via GIIS (siteName, Tag, …)

HOST PLUGIN STATUS STATUS INFORMATION

Page 26: Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

http://grid-ice.esc.rl.ac.uk/gridice

Page 27: Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.
Page 28: Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

Nagios Screen Shots LCG-1

Host and Service Summary tables for BDII nodes

Page 29: Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

GOC Site Database• Develop and maintain a database to hold Site Information

• Contact Lists, Nodes, IP, URLs, Scheduled Maintenance

• Each Site has its own Administration Page where Access is Controlled through the use of X509 certificates. (GridSite)

• Monitoring Scripts read information in database and run a set of customised tools to monitor the infrastructure

• To be included in the monitoring a site must register its resources (CE,SE,RB,RC,RLS,MDS,RGMA,BDII,..)

Page 30: Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

GOC

GOC GridSite MySQL

Resource CentreResources & Site Information

EDG, LCG-1, LCG-2, …

ce

se

bdii

rb

Monitoring

Secure Database Management via HTTPS / X.509

People, Contact Information, Resources

Scheduled Maintenance

RC

SQLhttps

SERVER

Page 31: Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

People: Who do we notify when there are problems

EXAMPLE: RAL Site

Page 32: Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

Node Information (Type, Hostname, IP Address, Group)

EXAMPLE: RAL Site

Page 33: Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

Fault Diagnosis

• Monitoring is currently checked every day – And a report sent to LCG-ROLLOUT mail-list

• Further diagnosis done by GOC on problem sites by additional tools – and possible causes suggested

• Additional monitoring developed in response to new problems– Eg certificate lifetimnes

Page 34: Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

LCG1 CERT Status: 27 Feb 2004

Page 35: Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

Distributing GOC Software

GOC GridSite MySQL

Packaging Monitoring Tools

•Provide ROCs with a standard set of tools to proactively monitor resources

•2nd Prototype GOC established in Taipei (GMT+8hours)

GOC Centre CLRC, TW

Remote Query to collect a list of resourcesLocal query if service not available

Monitor Resources via Job Submission

TOOLS

SITE CONFIG

Page 36: Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

Provide ROCs with a package to monitor the resources in the region• Tailored Monitoring• ROCs can upload their own maps• GUI to automate site locations on the map

Hierarchical view of Resources• Example GridPP federated into 4 virtual T2 centres

Monitoring Developments

EGEE

France UK/I S.E.E

GridPP

LondonT2

ScotGrid

IMPERIAL

QMUL

Edinburgh

Page 37: Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

LCG Accounting Overview

CE

PBS/LSF Jobmanager Log

GateKeeper

Listens on port 2119

GRAM Authentication

GIIS

LDAP Information Server

MON

RGMA Database

We have an accounting solution.

The Accounting is provided by RGMA

At each site, log-file data is processed from different sources and published into a local database.

Page 38: Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

LCG Accounting – How it Works

GOC provides an interface to produce accounting plots “on-demand”

Total Number of Jobs per VO per Site (ok)

Total Number of Jobs per VO aggregated over all sites (to be done)

Tailor plots according to the requirements of the user community

~ 1000 Alice JobsTaipei Statistics Feb/Mar

Page 39: Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

LCG Accounting

CNAF Statistics March

~ 10,000 Alice Jobs

RAL Statistics March

~ 6,300 Alice Jobs

Page 40: Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

EGEE - Consortia

10 Consortia (incl. GEANT/TERENA/DANTE) 70 Partners

UK e-Science:PPARC + Core Programme

USA

Enabling Grids for E-science for Europe Everyone

Page 41: Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

EGEE – SA1

9-10 ROCs, 4 CICs

cf 3 worldwide in LCG

RAL proposes to extend LCG GOC monitoring to ROCs

Page 42: Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.

Summary

• A Grid Operations Centre involves many roles– Security, agreements, monitoring, accounting, support

• RAL has tackled all of these to different degrees– Still developing

• Share work with other grids– NGS, EGEE

• Biggest problem is problem and issue tracking