Enabling Grids for E-sciencE www.eu-egee.org Analysis of the ATLAS Rome Analysis of the ATLAS Rome Production Experience on the Production Experience on the LCG Computing Grid LCG Computing Grid Simone Campana, CERN/INFN EGEE User Forum, CERN (Switzerland) March 1 st – 3 rd 2006
28
Embed
Enabling Grids for E-sciencE Analysis of the ATLAS Rome Production Experience on the LCG Computing Grid Simone Campana, CERN/INFN EGEE.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Enabling Grids for E-sciencE
www.eu-egee.org
Analysis of the ATLAS Rome Analysis of the ATLAS Rome Production Experience on the Production Experience on the LCG Computing GridLCG Computing Grid
Simone Campana, CERN/INFN
EGEE User Forum, CERN (Switzerland) March 1st – 3rd 2006
EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006 2
Data ChallengeData Challenge: validation of the Computing and Data Model and test the complete software suite Full simulation and reprocessing of data as if coming from the detector Same software and computing infrastructure to be employed for data taking
ATLAS ran two major Data Challenges DC1 in 2002-2003 (with direct access to local resources + NorduGrid, see later) DC2 in July - December 2004 (completely in GRID environment)
Large scale productionLarge scale production in January – June 2005 Friendly called “Rome Production” provide data for physics studies for the ATLAS Rome Workshop in June 2005. Can be considered totally equivalent to a real Data ChallengeCan be considered totally equivalent to a real Data Challenge
Same methodology Large number of events produced
Offered a unique opportunity to test improvements in the production framework, the Grid middleware and the reconstruction software.
ATLAS resources span three different grids: the LCGLCG, NorduGridNorduGrid and OSGOSG
In this talk I will present the “Rome production” experience on the LHC Computing Grid infrastructure
EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006 6
The Logging and BookkeepingLogging and Bookkeeping service keeps the state information of a job allows the user to query its status
Each Computing ElementComputing Element is the front-end to a local batch system manages a pool of Worker Nodes
where the job is eventually executed.
Limited User CredentialsLimited User Credentials (Proxies) can be automatically renewed through a Proxy Service.
A set of services running on the Resource BrokerResource Broker machine match job requirements to the available resources schedule the job for execution to an appropriate Computing Element track the job status allow to retrieve their job output.
The Workload Management SystemThe Workload Management Systemresponsible for the management and monitoring of jobsresponsible for the management and monitoring of jobs
EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006 8
The LCG Workload Management System is highly automatedhighly automated designed to reduce the human intervention at the minimum consists in a complex set of services interacting with external components. This complexity caused a certain unreliabilityunreliability of the WMS during DC2during DC2.
The system became more robustsystem became more robust before the Rome production several bug fixes and optimizations in the WMS workflow.
The heterogeneous and dynamic natureheterogeneous and dynamic nature of a Grid environment implies a certain level of unreliability. ATLAS application improved to cope with such unreliability.
The production team and the LCG operation and support teams gathered a lot of experiencelot of experience during DC2 and benefited from this experience at the time of Rome Production.
EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006 21
NO Reliable File TransferReliable File Transfer service in place during Rome production.
Data movement performed through LCG DM client tools.
LCG DM tools did not provide timeout and retrytimeout and retry capabilities WORKAROUND: a timeout and possible retry was implemented in Lexor at some point
LCG DM tools not always ensure consistencyconsistency between files in the SE and entries in the catalog
If the catalog is down or unreachable or the operation is killed prematurely WORKAROUND: manual cleanup was needed
Data access on mass storage systemsmass storage systems was very problematic
Data need to be moved (staged) from tape to disk before being accessed. The middleware could not ensure the existence/persistency of data on disk WORKAROUND: manual pre-staging of files was carried out by the production team
The ATLAS strategy for file distributionstrategy for file distribution must be (re)thought.
Output files chaotically spread around 143 different Storage Elements. A replication schema for frequently accessed file was not in place Complicates analysis of the reconstructed samples and the production itself.
EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006 23
Timeout and RetryTimeout and Retry capabilities introduced nativelynatively in the LCG DM tools. Also improved to guarantee atomic operations.
A new catalog, the LCG File CatalogLCG File Catalog, has been developed More stable Easier problem tracking Better performance and reliability
The Storage Resource ManagerStorage Resource Manager interface introduced as a front-end to every SE. agreed on between experiments and middleware developers standardize storage access and management Offers more functionality for MSS access
A reliable File Transfer Servicereliable File Transfer Service developed within the EGEE project. It is a SERVICESERVICE Allows to replicate files between SEs in a reliable way. Built on top of gridFTP and SRM Capable to deal with data transfer from/to MSS
EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006 24
The performanceperformance of the WMS for job submission and handling generally acceptable in normal conditions
… but degrade under stress. WORKAROUND: several RBs dedicated to ATLAS and with different
hardware solutions have been deployed
The EGEE project will provide an enhanced WMSenhanced WMS Possibility of bulk submission, bulk matching and bulk queries Improved communication with Computing Elements at sites
Possible improvement of job submission speedsubmission speed and job dispatching.
Some preliminary tests show promising resultspromising results Still several issue must be clarified
EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006 27
Lack of VO specific informationVO specific information about jobs at the sites GridICE sensors deployed in every site, but not correctly configured
everywhere. Partial information, difficult to interpret.
QueriesQueries to the ATLAS Production Database could cause an excessive load.
The error diagnosticserror diagnostics should be improved performed parsing executor log files and querying the DB
should be formalized in proper tools
Real-timeReal-time job output inspectioninspection would have been helpful especially to investigate causes of hanging jobs.
An ATLAS team is building a global job monitoring systemglobal job monitoring system Based on the current tools Possibly integrating new components (R-GMA etc …)
EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006 28
The Rome Production on the LCG infrastructure has been an overall successful exerciseoverall successful exercise Exercised the ATLAS production system Contributed to the testing of the ATLAS Computing and Data model Stress-tested the LCG infrastructure … and produced a lot of simulated data for the physicists!!!
Must be seen as the consequence of several improvements in the Grid middleware In the ATLAS components In the LCG operations
Still, several components need improvementsseveral components need improvements both in terms of reliability and performance Production still requires a lot of human attention
Issues have been addressed to the relevant partiesaddressed to the relevant parties and a lot of work has been done since Rome Production Preliminary tests show promising improvements Will be evaluated fully in Service Challenge 4 (April 2006)