Operating the LCG and EGEE Production Grid for HEP

[email protected] CHEP’04 – 28 September 2004 - 1

Operating the LCG and EGEE Operating the LCG and EGEE Production Grid for HEPProduction Grid for HEP

Ian BirdIT Department, CERN

LCG Deployment Area Manager & EGEE Operations Manager

CHEP‘0428th September 2004

EGEE is a project funded by the European Union under contract IST-2003-508833


LCG Operations in 2004LCG Operations in 2004

Goal: - deploy & operate a prototype LHC computing environment Scope:

Integrate a set of middleware and coordinate and support its deployment to the regional centres

Provide operational services to enable running as a production-quality service

Provide assistance to the experiments in integrating their software and deploying in LCG; Provide direct user support

Deployment Goals for LCG-2 Production service for Data Challenges in 2004 Experience in close collaboration between the Regional Centres Learn how to maintain and operate a global grid Focus on building a production-quality service Understand how LCG can be integrated into the sites’ physics computing

services Set up the EGEE project and migrate the existing structure towards

EGEE structure By design LCG and EGEE services and operations teams are the same


LCG – from certification to productionLCG – from certification to production

Some history:

March 2003 LCG-0 existing middleware, waiting for EDG-2 release

September 2003 LCG-1 3 month late -> reduced functionality extensive certification process -> improved stability

(RB, Information system) integrated 32 sites ~300 CPUs first use for

production December 2003 LCG-2

Full set of functionality for DCs, first MSS integration Deployed in January to 8 core sites DCs started in February -> testing in production Large sites integrate resources into LCG (MSS and

farms) Introduced a pre-production service for the

experiments Alternative packaging (tool based and generic

installation guides) Mai 2004 -> now monthly incremental releases

Not all releases are distributed to external sites Improved services, functionality, stability and

packing step by step Timely response to experiences from the data

challenges

The formal certification process has been invaluable

The process to stabilise existing middleware and put in production is expensive

Testbeds, people, time Now have monthly incremental

middleware releases Not all are deployed

Expand now with a pre-production service


LCG-2 SoftwareLCG-2 Software

LCG-2 core packages: VDT (Globus2, condor) EDG WP1 (Resource Broker, job submission tools) EDG WP2 (Replica Management tools) + lcg tools

• One central RMC and LRC for each VO, located at CERN, ORACLE backend Several bits from other WPs (Config objects, InfoProviders, Packaging…) GLUE 1.1 (Information schema) + few essential LCG extensions MDS based Information System with significant LCG enhancements

(replacements, simplified (see poster)) Mechanism for application (experiment) software distribution

Almost all components have gone through some reengineering robustness scalability efficiency adaptation to local fabrics

The services are now quite stable and the performance and scalability has been significantly improved (within the limits of the current architecture) Still far from perfect data management


LCG-2/EGEE-0 Status LCG-2/EGEE-0 Status 24-09-2004 24-09-2004

Total:78 Sites~9000 CPUs6.5 PByte

Total:78 Sites~9000 CPUs6.5 PByte

Cyprus


Experiences in deploymentExperiences in deployment

LCG covers many sites (>70) now – both large and small Large sites – existing infrastructures – need to add-on grid interfaces etc. Small sites want a completely packaged, push-button, out-of-the-box

installation (including batch system, etc) Satisfying both simultaneously is hard – requires very flexible packaging,

installation, and configuration tools and procedures• A lot of effort had to be invested in this area

There are many problems – but in the end we are quite successful System is reasonably stable System is used in production System is reasonably easy to install now ~80 sites Now have a basis on which to incrementally build essential functionality,

and from which to measure improvements

This infrastructure now also forms the EGEE production service


Operations services for LCG – 2004 Operations services for LCG – 2004

Deployment and Operational support Hierarchical model

• CERN acts as 1st level support for the Tier 1 centres• Tier 1 centres provide 1st level support for associated Tier 2s• “Tier 1 sites” “Primary sites”

Grid Operations Centres (GOC)• Provide operational monitoring, troubleshooting, coordination of incident

response, etc.• RAL (UK) led sub-project to prototype a GOC• Operations support from CERN team, GOC, and Taipei, with many individual

contributions on the mailing list

User support Central model

• FZK provides user support portal– Problem tracking system web-based and available to all LCG participants

• Experiments provide triage of problems CERN team provide in-depth support and support for integration of

experiment sw with grid middleware


Experiences during the data challengesExperiences during the data challenges


Data ChallengesData Challenges

Large scale production effort of the LHC experiments test and validate the computing models produce needed simulated data test experiments production frame works and software test the provided grid middleware test the services provided by LCG-2

All experiments used LCG-2 for all or part of their productions


Data Challenges – ALICE Data Challenges – ALICE

• Phase I120k Pb+Pb events produced in 56k jobs1.3 million files (26TByte) in Castor@CERNTotal CPU: 285 MSI-2k hours (2.8 GHz PC working 35 years)~25% produced on LCG-2

Phase II (underway)1 million jobs, 10 TB produced, 200TB transferred ,500 MSI2k hours CPU~15% on LCG-2


Data Challenges – ATLAS Data Challenges – ATLAS

ATLAS DC2 - CPU usage

41%

30%

29%

LCG

NorduGrid

Grid3

• Phase I7.7 Million events fully simulated (Geant 4) in 95.000 jobs22 TByteTotal CPU: 972 MSI-2k hours >40% produced on LCG-2 (used LCG-2, GRID3, NorduGrid)

ATLAS DC2 - LCG - September 71%

2%

0%

1%

2%

14%

3%

1%

3%

9%

8%

3%2%5%1%4%

1%

1%

3%

0%

1%

1%

4%1%

0%

12%

0%

1%

1%

2%

10%

1% 4%

at.uibk

ca.triumf

ca.ualberta

ca.umontreal

ca.utoronto

ch.cern

cz.golias

cz.skurut

de.fzk

es.ifae

es.ific

es.uam

fr.in2p3

it.infn.cnaf

it.infn.lnl

it.infn.mi

it.infn.na

it.infn.na

it.infn.roma

it.infn.to

it.infn.lnf

jp.icepp

nl.nikhef

pl.zeus

ru.msu

tw.sinica

uk.bham

uk.ic

uk.lancs

uk.man

uk.rl


Data Challenges – CMS Data Challenges – CMS

• ~30 M events produced• 25Hz reached

•(only once for a full day)• RLS, Castor, control systems, T1 storage, …

•Not a CPU challenge, but a full chain demonstration•Pre-challenge production in 2003/04

•70 M Monte Carlo events (30M with Geant-4) produced•Classic and grid (CMS/LCG-0, LCG-1, Grid3) productions


DIRAC alone

LCG inaction

1.8 106/day

LCG paused

3-5 106/day

LCG restarted

Data Challenges – LHCb Data Challenges – LHCb

• Phase I186 M events 61 TByteTotal CPU: 424 CPU years (43 LCG-2 and 20 DIRAC sites)Up to 5600 concurrent running jobs in LCG-2

This is 5-6 times what was possible at CERN alone


Data challenges – summary Data challenges – summary

Probably the first time such a set of large scale grid productions has been done

Significant efforts invested on all sides – very fruitful collaborations Unfortunately, DCs were first time the LCG-2 system had been used Adaptations were essential – adapting experiment software to middleware

and vice-versa – as limitations/capabilities were exposed Many problems were recognised and addressed during the challenges

Systematic confrontation of the functional problems with experiment requirements has recently been made (GAG)

Middleware is actually quite stable now But – job efficiency is not high – for many reasons (see below) Started to see some basic underlying issues:

Of implementation (lack of error handling, scalability, etc) Of underlying models (workload management) Perhaps also of fabric services – batch systems ?

But – single largest issue is lack of stable operations


Problems during the data challenges Problems during the data challenges

Common functional issues seen by all experiments:

Sites suffering from configuration and operational problems inadequate resources on some sites (hardware, human..) this is now the main source of failures

Load balancing between different sites is problematic jobs can be “attracted” to sites that do not have adequate resources modern batch systems are too complex and dynamic to summarize their

behaviour in a few values in the IS Identification of problems in LCG-2 is difficult

distributed environment, access to many log files needed….. status of monitoring tools

Handling thousands of jobs is time consuming and tedious Support for bulk operation is not adequate

Performance and scalability of services storage (access and number of files) job submission information system file catalogues

Services suffered from hardware problems

DC summary


Configuration and stability problemsConfiguration and stability problems This is the largest source of problems Many are “well-known” fabric problems

Batch systems that cause “black holes” NFS problems Clock skew at a site Software not installed or configured correctly Lack of configuration management – fixed problems reappear

Firewall issues – often less than optimal coordination between grid admins and firewall

maintainers Others are due to lack of experience

Many grid sites have not run services before, do not have procedures, tools, diagnostics

Not limited to small sites Lack of support

Maintaining stable operation is labour intensive still – requires adequate operations staff trained in grid management

Slow response – problems reported daily – but may last for weeks No vacations … Experiments expect 24x365 stable operation

Grid successfully integrates these problems from 80 sites

Building a stable operation is the highest priority

This is what EGEE is funded to do


EGEE and Evolving the Operations modelEGEE and Evolving the Operations model


Applications

Geant network

Grid infrastructure

EGEEEGEE

Goal Create a Europe-wide production quality grid infrastructure

on top of present regional grid programs• despite it’s name the project has a worldwide scope • multi science project

Scale 70 leading institutes in 27 countries ~300 FTEs Aim: 20’000 CPUs Initially: 2 year project

Activities 48% service activities (operation, support) 24% middleware re-engineering 28% management, training, dissemination, international

cooperation

Builds on: LCG to establish a grid operations service

• single team for deployment and operations Experience gained from running services for the LHC experiments

• HEP experiments are the pilot application for EGEE, together with bio-medical


LCG and EGEE OperationsLCG and EGEE Operations

EGEE is funded to operate and support a research grid infrastructure in Europe

The core infrastructure of the LCG and EGEE grids is now operated as a single service, growing out of LCG service LCG includes US and Asia-Pacific, EGEE includes other sciences Substantial part of infrastructure common to both

LCG Deployment Manager is the EGEE Operations Manager CERN team (Operations Management Centre) provides coordination,

management, and 2nd level support Support activities are expanded with the provision of

Core Infrastructure Centres (CIC) (4) Regional Operations Centres (ROC) (9) ROCs are coordinated by Italy, outside of CERN (which has no ROC)


User support: Becomes hierarchical Through the Regional Operations

Centres (ROC)• Act as front-line support for user and

operations issues

• Provide local knowledge and adaptations

Coordination: At CERN (Operations Management

Centre) and CIC for HEP-LHC

Operational support: The LCG GOC is the model for the

EGEE CICs• CIC’s replace the European GOC at

RAL

• Also run essential infrastructure services

• Provide support for other (non-LHC) applications

• Provide 2nd level support to ROCs

Operations: LCG Operations: LCG EGEE in Europe EGEE in Europe


The Regional Operations CentresThe Regional Operations Centres

The ROC organisation is the focus of EGEE operations activities: Coordinate and support deployment Coordinate and support operations Coordinate Resource Centre management

• Negotiate application access to resources within region

• Coordinate planning and reporting within region

• Negotiate and monitor SLA’s within the region Teams:

• Deployment team

• 24 hour support team (answers user and rc problems)

• Operations training at RC’s

• Organise tutorials for users

The ROC is the first point of contact for all: New sites joining the grid and support for them New users and user support


Core Infrastructure CentresCore Infrastructure Centres

“Grid Operations Centres” – behaving as a single organisation Operate infrastructure services, e.g.:

VO services:• VO servers, VO registration service

RBs, UIs, Information services RLS and other database services Ensure recovery procedures and fail-over (between CICs)

Act as Grid Operations Centres Monitoring, proactive troubleshooting Performance monitoring Control sites’ participation in production service Use work done at RAL for LCG GOC as starting point

Support to ROCs for operational problems Operational configuration management and change control Accounting and resource usage/availability monitoring

Take responsibility for operational “control” (tbd) – rotates through 4 CICs


Future activitiesFuture activities


Future activitiesFuture activities

All experiments expect to have significant ongoing productions for the foreseeable future Some will also have next data challenges 1 year from now

LCG will run a series of “service challenges” Complementary to data challenges/ongoing productions Demonstrate essential service-level issues (e.g. Tier0-1 reliable data

transfer) Essential that we are able to build a manageable production service

Based on existing infrastructure Reasonable improvements

In parallel build a “pre-production” service where: New middleware (gLite, …) can be demonstrated and validated before

being deployed in production Understand the migration strategy to 2nd generation middleware Use the existing production service as the baseline comparison

It takes a long time to make new software production quality Must be careful not to step backwards – even though what we have is far

from perfect


What next? What next? Service challenges Service challenges

Proposed to be used in addition to ongoing data challenges and production use: Goal is to ensure baseline services can be demonstrated Demonstrate the resolution of problems mentioned earlier Demonstrate that operational and emergency procedures are in place

4 areas proposed: Reliable data transfer

• Demonstrate fundamental service for Tier 0 Tier 1 by end 2004 Job flooding/exerciser

• Understand the limitations and baseline performances of the system– May be achieved by the ongoing real productions

Incident response• Ensure the procedures are in place and work – before real life tests them

Interoperability• How can we bring together the different grid infrastructures?


IssuesIssues

Operational management How much control can/should be assumed by an operations centre? Small sites with little support – can GOCs restart services?

• More intelligence in the services to recognise problems

Strong organisation to take operational responsibility Ensure that problems are addressed, traced, reported

Need site management to take responsibility Ensure that Operational security group is in place with good

communications Simplify service configurations – to avoid mistakes Weight of VOs

EGEE has many VOs (most still national in scope) Deploying a VO is very heavyweight – must become much simpler


SummarySummary

Data challenges have been running for 8 months Major effort and collaboration between all involved

Distributed operations in such a large system are hard Requires significant effort – EGEE will help here

Many lessons have been learned Essential that 2nd generation middleware takes account of all these

issues Not just functionality, but manageability, scalability, accountability,

robustness operational requirements are important requirements for users too …

Now moving to a phase of sustained, continuous operation While building a parallel service to validate next generation

middleware We have come a long way in the last few months There is still much to be done


Related papersRelated papers

Distributed Computing Services track Evolution of Data Management in LCG-2 (278); Jean-Philippe Baud

Distributed Computing Systems and Experiences track Deploying and Operating LCG-2 (389); Markus Schulz Many other papers on experience in using LCG-2

Poster Session 2 Several papers on LCG-2 – all aspects: certification, usage,

information systems, integration/certification

Operating the LCG and EGEE Production Grid for HEP

Documents

lcg operations

lcg mss

design lcg

egee services

essential lcg extensionsmds

egee production grid

perfect data management

egee structure