[email protected]CHEP’04 – 28 September 2004 - 1 Operating the LCG and EGEE Operating the LCG and EGEE Production Grid for HEP Production Grid for HEP Ian Bird IT Department, CERN LCG Deployment Area Manager & EGEE Operations Manager CHEP‘04 28 th September 2004 EGEE is a project funded by the European Union under contract IST- 2003-508833
28
Embed
Operating the LCG and EGEE Production Grid for HEP
Operating the LCG and EGEE Production Grid for HEP. Ian Bird IT Department, CERN LCG Deployment Area Manager & EGEE Operations Manager CHEP‘ 04 28 th September 2004. EGEE is a project funded by the European Union under contract IST-2003-508833. LCG Operations in 2004. - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Integrate a set of middleware and coordinate and support its deployment to the regional centres
Provide operational services to enable running as a production-quality service
Provide assistance to the experiments in integrating their software and deploying in LCG; Provide direct user support
Deployment Goals for LCG-2 Production service for Data Challenges in 2004 Experience in close collaboration between the Regional Centres Learn how to maintain and operate a global grid Focus on building a production-quality service Understand how LCG can be integrated into the sites’ physics computing
services Set up the EGEE project and migrate the existing structure towards
EGEE structure By design LCG and EGEE services and operations teams are the same
LCG – from certification to productionLCG – from certification to production
Some history:
March 2003 LCG-0 existing middleware, waiting for EDG-2 release
September 2003 LCG-1 3 month late -> reduced functionality extensive certification process -> improved stability
(RB, Information system) integrated 32 sites ~300 CPUs first use for
production December 2003 LCG-2
Full set of functionality for DCs, first MSS integration Deployed in January to 8 core sites DCs started in February -> testing in production Large sites integrate resources into LCG (MSS and
farms) Introduced a pre-production service for the
experiments Alternative packaging (tool based and generic
installation guides) Mai 2004 -> now monthly incremental releases
Not all releases are distributed to external sites Improved services, functionality, stability and
packing step by step Timely response to experiences from the data
challenges
The formal certification process has been invaluable
The process to stabilise existing middleware and put in production is expensive
Testbeds, people, time Now have monthly incremental
• One central RMC and LRC for each VO, located at CERN, ORACLE backend Several bits from other WPs (Config objects, InfoProviders, Packaging…) GLUE 1.1 (Information schema) + few essential LCG extensions MDS based Information System with significant LCG enhancements
(replacements, simplified (see poster)) Mechanism for application (experiment) software distribution
Almost all components have gone through some reengineering robustness scalability efficiency adaptation to local fabrics
The services are now quite stable and the performance and scalability has been significantly improved (within the limits of the current architecture) Still far from perfect data management
Experiences in deploymentExperiences in deployment
LCG covers many sites (>70) now – both large and small Large sites – existing infrastructures – need to add-on grid interfaces etc. Small sites want a completely packaged, push-button, out-of-the-box
installation (including batch system, etc) Satisfying both simultaneously is hard – requires very flexible packaging,
installation, and configuration tools and procedures• A lot of effort had to be invested in this area
There are many problems – but in the end we are quite successful System is reasonably stable System is used in production System is reasonably easy to install now ~80 sites Now have a basis on which to incrementally build essential functionality,
and from which to measure improvements
This infrastructure now also forms the EGEE production service
Operations services for LCG – 2004 Operations services for LCG – 2004
Deployment and Operational support Hierarchical model
• CERN acts as 1st level support for the Tier 1 centres• Tier 1 centres provide 1st level support for associated Tier 2s• “Tier 1 sites” “Primary sites”
Grid Operations Centres (GOC)• Provide operational monitoring, troubleshooting, coordination of incident
response, etc.• RAL (UK) led sub-project to prototype a GOC• Operations support from CERN team, GOC, and Taipei, with many individual
contributions on the mailing list
User support Central model
• FZK provides user support portal– Problem tracking system web-based and available to all LCG participants
• Experiments provide triage of problems CERN team provide in-depth support and support for integration of
Large scale production effort of the LHC experiments test and validate the computing models produce needed simulated data test experiments production frame works and software test the provided grid middleware test the services provided by LCG-2
All experiments used LCG-2 for all or part of their productions
• Phase I120k Pb+Pb events produced in 56k jobs1.3 million files (26TByte) in Castor@CERNTotal CPU: 285 MSI-2k hours (2.8 GHz PC working 35 years)~25% produced on LCG-2
Phase II (underway)1 million jobs, 10 TB produced, 200TB transferred ,500 MSI2k hours CPU~15% on LCG-2
Data challenges – summary Data challenges – summary
Probably the first time such a set of large scale grid productions has been done
Significant efforts invested on all sides – very fruitful collaborations Unfortunately, DCs were first time the LCG-2 system had been used Adaptations were essential – adapting experiment software to middleware
and vice-versa – as limitations/capabilities were exposed Many problems were recognised and addressed during the challenges
Systematic confrontation of the functional problems with experiment requirements has recently been made (GAG)
Middleware is actually quite stable now But – job efficiency is not high – for many reasons (see below) Started to see some basic underlying issues:
Of implementation (lack of error handling, scalability, etc) Of underlying models (workload management) Perhaps also of fabric services – batch systems ?
But – single largest issue is lack of stable operations
Problems during the data challenges Problems during the data challenges
Common functional issues seen by all experiments:
Sites suffering from configuration and operational problems inadequate resources on some sites (hardware, human..) this is now the main source of failures
Load balancing between different sites is problematic jobs can be “attracted” to sites that do not have adequate resources modern batch systems are too complex and dynamic to summarize their
behaviour in a few values in the IS Identification of problems in LCG-2 is difficult
distributed environment, access to many log files needed….. status of monitoring tools
Handling thousands of jobs is time consuming and tedious Support for bulk operation is not adequate
Performance and scalability of services storage (access and number of files) job submission information system file catalogues
Configuration and stability problemsConfiguration and stability problems This is the largest source of problems Many are “well-known” fabric problems
Batch systems that cause “black holes” NFS problems Clock skew at a site Software not installed or configured correctly Lack of configuration management – fixed problems reappear
Firewall issues – often less than optimal coordination between grid admins and firewall
maintainers Others are due to lack of experience
Many grid sites have not run services before, do not have procedures, tools, diagnostics
Not limited to small sites Lack of support
Maintaining stable operation is labour intensive still – requires adequate operations staff trained in grid management
Slow response – problems reported daily – but may last for weeks No vacations … Experiments expect 24x365 stable operation
Grid successfully integrates these problems from 80 sites
Building a stable operation is the highest priority
EGEE is funded to operate and support a research grid infrastructure in Europe
The core infrastructure of the LCG and EGEE grids is now operated as a single service, growing out of LCG service LCG includes US and Asia-Pacific, EGEE includes other sciences Substantial part of infrastructure common to both
LCG Deployment Manager is the EGEE Operations Manager CERN team (Operations Management Centre) provides coordination,
management, and 2nd level support Support activities are expanded with the provision of
Core Infrastructure Centres (CIC) (4) Regional Operations Centres (ROC) (9) ROCs are coordinated by Italy, outside of CERN (which has no ROC)
The Regional Operations CentresThe Regional Operations Centres
The ROC organisation is the focus of EGEE operations activities: Coordinate and support deployment Coordinate and support operations Coordinate Resource Centre management
• Negotiate application access to resources within region
• Coordinate planning and reporting within region
• Negotiate and monitor SLA’s within the region Teams:
• Deployment team
• 24 hour support team (answers user and rc problems)
• Operations training at RC’s
• Organise tutorials for users
The ROC is the first point of contact for all: New sites joining the grid and support for them New users and user support
“Grid Operations Centres” – behaving as a single organisation Operate infrastructure services, e.g.:
VO services:• VO servers, VO registration service
RBs, UIs, Information services RLS and other database services Ensure recovery procedures and fail-over (between CICs)
Act as Grid Operations Centres Monitoring, proactive troubleshooting Performance monitoring Control sites’ participation in production service Use work done at RAL for LCG GOC as starting point
Support to ROCs for operational problems Operational configuration management and change control Accounting and resource usage/availability monitoring
Take responsibility for operational “control” (tbd) – rotates through 4 CICs
All experiments expect to have significant ongoing productions for the foreseeable future Some will also have next data challenges 1 year from now
LCG will run a series of “service challenges” Complementary to data challenges/ongoing productions Demonstrate essential service-level issues (e.g. Tier0-1 reliable data
transfer) Essential that we are able to build a manageable production service
Based on existing infrastructure Reasonable improvements
In parallel build a “pre-production” service where: New middleware (gLite, …) can be demonstrated and validated before
being deployed in production Understand the migration strategy to 2nd generation middleware Use the existing production service as the baseline comparison
It takes a long time to make new software production quality Must be careful not to step backwards – even though what we have is far
What next? What next? Service challenges Service challenges
Proposed to be used in addition to ongoing data challenges and production use: Goal is to ensure baseline services can be demonstrated Demonstrate the resolution of problems mentioned earlier Demonstrate that operational and emergency procedures are in place
4 areas proposed: Reliable data transfer
• Demonstrate fundamental service for Tier 0 Tier 1 by end 2004 Job flooding/exerciser
• Understand the limitations and baseline performances of the system– May be achieved by the ongoing real productions
Incident response• Ensure the procedures are in place and work – before real life tests them
Interoperability• How can we bring together the different grid infrastructures?